docs/design/2020-06-01-global-kill.md
This document introduces the design of global connection id, and the global KILL <connID> based on it.
Currently connection ids are local to TiDB instances, which means that a KILL x must be directed to the correct instance, and can not safely be load balanced across the cluster, as discussed here.
To support "Global Kill", we need:
KILL x to target TiDB instance, on which the connection x is running.connID. 32 bits connID is used on small clusters (number of TiDB instances less than 2048), to be fully compatible with all clients including legacy 32 bits ones, while 64 bits connID is used for big clusters. Bit 0 in connID is a markup to distinguish between these two kinds.connID 31 21 20 1 0
+--------+------------------+------+
|serverID| local connID |markup|
| (11b) | (20b) | =0 |
+--------+------------------+------+
63 62 41 40 1 0
+--+---------------------+--------------------------------------+------+
| | serverID | local connID |markup|
|=0| (22b) | (40b) | =1 |
+--+---------------------+--------------------------------------+------+
The key factor is serverID (see serverID section for detail), which depends on number of TiDB instances in cluster.
serverID (because of occupied) continuously more than 3 times (which will happen when size of cluster is increasing rapidly); 2) All local connID in 32 bits connID are in used (see local connID section for detail).serverID until next restart or lost connection to PD.Bit 63 is always ZERO, making connID in range of non-negative int64, to be more friendly to exists codes, and some languages don't have primitive type uint64.
markup == 0 indicates that the connID is just 32 bits long effectively, and high 32 bits should be all zeros. Compatible with legacy 32 bits clients.markup == 1 indicates that the connID is 64 bits long. Incompatible with legacy 32 bits clients.markup == 1 while high 32 bits are zeros, indicates that 32 bits truncation happens. See Compatibility section.serverID is selected RANDOMLY from serverIDs pool(see next) by each TiDB instance on startup, and the uniqueness is guaranteed by PD (etcd). serverID should be larger or equal to 1, to ensure that high 32 bits of connID is always non-zero, and make it possible to detect truncation.
serverIDs pool is:
serverIDs within [1, 2047] acquired from CLUSTER_INFO when 32 bits connID chosen.serverIDs within [2048, 2^22-1] when 64 bits connID chosen.On failure (e.g. fail connecting to PD, or all serverID are unavailable when 64 bits connID chosen), we block any new connection.
serverID is kept by PD with a lease (defaults to 12 hours, long enough to avoid brutally killing any long-run SQL). If TiDB is disconnected to PD longer than half of the lease (defaults to 6 hours), all connections are killed, and no new connection is accepted, to avoid running with stale/incorrect serverID. On connection to PD restored, a new serverID is acquired before accepting new connection.
On single TiDB instance without PD, a serverID of 1 is assigned.
local connID is allocated by each TiDB instance on establishing connections:
For 32 bits connID, local connID is possible to be integer-overflow and/or used up, especially on system being busy and/or with long running SQL. So we use a lock-free queue to maintain available local connID, dequeue on client connecting, and enqueue on disconnecting. When local connID exhausted, upgrade to 64 bits.
For 64 bits connID, allocate local connID by auto-increment. Besides, flip to zero if integer-overflow, and check local connID existed or not by Server.clients for correctness with trivial cost, as the conflict is very unlikely to happen (It needs more than 3 years to use up 2^40 local connID in a 1w TPS instance). At last, return "Too many connections" error if exhausted.
On processing KILL x command, first extract serverID from x. Then if serverID aims to a remote TiDB instance, get the address from CLUSTER_INFO, and redirect the command to it by "Coprocessor API" provided by the remote TiDB, along with the original user authentication.
| 32 bits | 64 bits | |
|---|---|---|
| ServerID pool size | 2^11 | 2^22 - 2^11 |
| ServerID allocation | Random of Unused serverIDs acquired from PD within pool. Retry if unavailable. Upgrade to 64 bits while failed more than 3 times | Random of All serverIDs within pool. Retry if unavailable |
| Local connID pool size | 2^20 | 2^40 |
| Local connID allocation | Using a queue to maintain and allocate available local connID. Upgrade to 64 bits while exhausted | Auto-increment within pool. Flip to zero when overflow. Return "Too many connections" if exhausted |
32 bits connID is compatible to well-known clients.
64 bits connID is incompatible with legacy 32 bits clients. (According to some quick tests by now, MySQL client v8.0.19 supports KILL a connection with 64 bits connID, while CTRL-C does not, as it truncates the connID to 32 bits). A warning is set prompting that truncation happened, but user cannot see it, because CTRL-C is sent by a new connection in an instant.
KILL TIDB syntax and compatible-kill-query configuration item are deprecated.
Set small_cluster_size_threshold and local_connid_pool_size to small numbers (e.g. 4) by variable hacking, for easily switch between 32 and 64 bits connID.
connID with small clusterconnIDlocal connID used up.connID with big clusterconnID