docs/design/2025-04-15-global-memory-arbitrator.md
This document proposes a practical TiDB global memory control mechanism.
According to the introduction in #58194, there are 3 problems with the original memory control mechanism:
Severity: OOM ≈ Heavy Golang GC > Kill SQL/session
Because TiDB cannot customize physical memory allocation, the core solution to these problems is to prevent heap usage (either legally occupied or not GC-able) from exceeding golang runtime soft memory limit.
Memory is a typical mutually exclusive resource and follows the finite Pool model. Therefore, preemptive scheduling can be performed only if its holder, Memory Resource Pool(aka Memory Pool / mem-pool), is able to release memory.
A centralized global Memory Resource Arbitrator (aka mem-arbitrator or arbitrator) is proposed, which replaces the original optimistic posteriori method with a pessimistic scheduling model (subscription-first-then-allocation). It quantifies the global memory resources of the TiDB instance into dynamic virtual quotas and identifies unsafe / prohibited quota to ensure global memory safety during the arbitration process.
The original config tidb_server_memory_limit (aka server-limit) is still the most important hard constraint. arbitrator introduces config soft-limit, which indicates the upper limit of global memory quota.
When facing an OOM Risk, the arbitrator will use the KILL method to protect the memory safety. For other abnormal situations, it only chooses the CANCEL method. A new executor error will be defined to represent such behaviors from mem- arbitrator.
OOM Risk: actual memory usage reaches the security warning line & global stuck risk occurs.arbitrator contains self-adaptive tuning mechanism to quantify & store the runtime memory state and dynamically adjust the soft-limit to resolve Loop OOM problem. arbitrator provides manual tuning interfaces to handle extreme memory leaks as well: manually set soft-limit; pre-subscribe the mem quota of a specified size;
Loop OOM: TiDB restarts in a loop due to OOM caused by continuous extreme memory leak loads.There are 2 work modes for arbitrator when memory is insufficient:
STANDARD mode: CANCEL pending subscription mem pools
PRIORITY mode: CANCEL mem pools with lower priority
mem-priority: reuse the definition PRIORITY of Resource Group
LOWMEDIUMHIGHNew system variables:
tidb_mem_arbitrator_mode: global level
DISABLE (default): disable mem-arbitratorSTANDARDPRIORITYtidb_mem_arbitrator_soft_limit: global level
0 (default):95% server-limit(1, server-limit]ratio * server-limit): (0, 1]AUTO
tidb_mem_arbitrator_query_reserved: session level
0 (default)(1, server-limit]
tidb_mem_arbitrator_wait_averse: session level
0 (default)1: bind SQL to HIGH mem-priority; CANCEL SQL when memory is insufficient;nolimit: make SQL execution out of the control of the mem-arbitratorResource Group, otherwise MEDIUMtidb_mem_arbitrator_mode to a value other than DISABLE, which takes effect immediately for all TiDB nodes.memoryLimitTuner: the expected maximum memory usage of a TiDB instance is similar to the golang runtime soft memory limit and server-limit.STANDARD mode:
PRIORITY mode:
HIGH > MEDIUM > LOW) then FIFO (execution time constraint max_execution_time)wait_averse propertyLOW to HIGH, quota usage from large to small)tidb_mem_quota_query, the behavior is the same as the original TiDB, (controlled by sys var tidb_mem_oom_action)KILL mechanism for OOM Risk:
mem risk: when the memory usage of TiDB instance reaches the safety warning line 95% * server-limitOOM Risk) if at least one condition is met:
100MB/ssafety threshold (90% * server-limit) within 5sOOM Risk: KILL pools in order (mem-priority from LOW to HIGH, quota usage from large to small) until the memory usage falls below safety thresholdsoft-limit can be set through sys var tidb_mem_arbitrator_soft_limit
AUTO: model the runtime state and dynamically adjust the upper limit of global memory resources:
JSON format):
magnifi: current memory stress (workload characteristic), which is the ratio of mem quota to actual heap usagepool-medium-cap: medium mem quota of root pools as pool init cap suggestionlast-risk: heap & quota state of last mem riskmem risk occurs, calculate and locally persist the current state so that it can be restored after each restart.30s check the runtime state, if the memory stress is reduced, gradually reduce the magnifi-ratio to increase the mem quota upper limittidb_mem_arbitrator_query_reserved
DISABLE mode:
KILL mechanism of mem-arbitrator won't take effectGeneral scenarios: users need TiDB system stability to avoid memory security issues
OLAP scenarios that need to ensure SQL execution (multi-concurrency and large OLAP)
Low latency / non-blocking services (OLTP)
Memory leak or Loop OOM scenarios
Deploy a single TiDB node
PRIORITY mode of mem-arbitrator. Bind important SQL (such as OLTP related) to the resource group with HIGH priority.Deploy multiple TiDB nodes
STANDARD mode of mem-arbitrator, relying on upper-layer to perform SQL retries among multiple nodes. The overall resource utilization of the cluster is high, and the risk of single-point OOM is low.PRIORITY mode of mem-arbitrator.
HIGH priority, bind OLAP-related SQL with MEDIUM / LOW priority.
wait_averse property and set other settings as needed.
OOM appears after enabling mem-arbitrator
AUTO and let mem-arbitrator adaptively adjust the upper limit of global memory resources and gradually converge the OOM problem (multiple OOMs may occur in extreme cases).Ensure the successful execution of important SQL
HIGH and ensure that the SQL execution won't be Kill/CancelPRIORITY mode of mem-arbitrator, SQL can be bound with wait_averse property first. If the upper-layer retries multiple times and fails (the overall cluster resources are insufficient), disable wait_averse and retry by mem-prioritymem-pool is a thread-safe structure which manages a tree-like set of quota, which consists of several cores:
budget:
reserved-budgetactions interface: notification, out-of-cap, out-of-limitThe allocated quota of mem-pool should not exceed the used quota of its budget + reserved-budget. When the used quota is greater than the capacity, the pool could increase the budget in two ways:
out-of-cap action
mem-arbitratorEach TiDB instance will have a single, unique global mem-arbitrator. The arbitrator uses the common TiDB property server-limit as the hard limit for physical / logical memory, and sets it as the soft limit of the golang runtime through debug.SetMemoryLimit. The arbitrator also uses 95% * server-limit as the default mode of soft-limit.
The mem quota is separated into 4 parts: allocated, available, buffer, out-of-control.
allocated: the quota occupied by mem pools (root mem-pool or await-free-pool)available: the quota available for allocPRIORITY modesoft-limit takes effect by modifying out-of-controlThe arbitrator is a logical abstraction, implemented as an asynchronous worker running in a separate goroutine.
The root pool is required to implement 3 interfaces before attaching to the arbitrator:
Stop(reason)
reason: CANCEL, KILL, etc.Finish
HeapInuse
The arbitrator contains a unique mem pool called await-free-pool, which is able to allocate quota from available mem quota directly rather than through the Arbitration process.
sql-compile/ optimize / recursion, which cannot easily track memory usagesmall memory usage scenarios: small DML, information/schema query, temporary objects, etc.Each session-level memory tracker will try to consume from the await-free-pool when its mem usage is small (less than 1/1000 * server-limit). If the tracker fails to consume or its mem usage has exceeded the threshold, it will be bound to a root mem-pool and managed by the global-arbitrator.
The root pool implements the Cancel / Kill by sending relevant signal to the tracker's sql-killer channel. It also implements HeapInuse through the heap usage info reported from the sub-trackers.
After the tracker is detached, the root pool will be reset and its memory usage profile will be reported to the arbitrator for Auto Tune. At the same time, the digests / stats info can be used to predict the quota usage of SQL, thereby reducing the times of allocations and the risk of insufficient resources for the next execution.
The golang runtime GC function is a heavy operation which will stop the whole world. The arbitrator has the ability to sense the whole memory usage profile and can invoke such function only when necessary. These are the designed opportunities to execute golang GC:
Runtime Memory Risk during arbitration processMemory resource contention is a common issue when facing dynamic memory quota allocation. The deadlock problem occurs when all root pools are synchronously requesting memory quota but there is not enough memory resource. To resolve the deadlock problem, there are 2 simple ways:
bufferprivileged budget which is a logical flag that can be bound to only one root pool until that pool releases it. Any root pool which has occupied this budget can allocate memory quota without limit, which means that in extreme resource contention scenarios, all running pools will be degraded to the single concurrency mode.There are 3 key points to ensure the safety of global memory resources:
out-of-control)digests)out-of-control mainly presents these memory contents:
limit - soft-limitDynamic calculation:
out-of-control = max(avoid, untracked ... )buffer only works under PRIORITY mode. For blocking scheduling, a larger buffer can bring positive effects to memory safety but also may cause negative effects on memory resource utilization. The default way is to calculate the size of buffer space dynamically by timed interval.
buffer = max(pool-mem-usage) / DurationSufficient and accurate context can greatly help to enhance memory resource isolation. It requires historical memory profile data for the mem-pool before running.
digests = max(mem usage) / Pool / DurationFor the SQL, an appropriate mem quota value for pre-subscription can be found by the following steps:
To predict the possible maximum quota usage for unknown pools whose mem consumption is NOT small, the arbitrator will estimate an appropriate value suggest-pool-init-cap. This value will be persisted and can be determined by:
suggest-pool-init-cap = (median of pool mem usage) / DurationIf the actual memory usage pattern is really hard to predict, a reasonable model is required to quantify the relationship between logical memory and actual memory. From the global perspective, this issue can be reduced to computing the scaling ratio of the actual memory to the logical memory over the timeline intervals.
magnifi-ratio is a posteriori self-adaptive value which will be persisted by the arbitrator.
Runtime Memory Risk, the magnifi-ratio can be calculated as (heap / allocated-quota + delta* )(max-heap / blocked-at-quota) / Duration.Planner / Compiler / Optimizer / Common
4MB) from await-free-pool in advance for temporary useSession
Modify the original session dispatch mechanism
allocated mem quota + pending alloc mem quotaMetrics
Log
SQL execution statistics
Perf mem profile
Module mem-arbitrator / mem-pool: Unit tests
Other module adaptation:
Scenario Tests
Benchmark
Compared with the original mechanism, the current mechanism may not use up all the memory resources of the instance
Under PRIORITY mode of the mem-arbitrator, if all the pools are degraded into the single concurrency mode, there will be 2 risks:
Memory Control Mode Summary
The mainstream implementation is to use memory pool and subscription first and then allocation. The difference between the optimistic and pessimistic ways is mainly reflected in the processing after the subscription fails. Small SQL is suitable for optimistic control mode because the cost of retry is low, while large SQL is suitable for pessimistic mode.