docs/RFCS/20220104_cluster_locks.md
cluster_locksNOTE: The design described ahead in this RFC constitutes the initial proposal, and some details (such as the request API) may have changed slightly from the initial design in the implementation process.
This design doc RFC proposes the implementation of a virtual cluster_locks table
to enable observability of a point-in-time view of lock holders within a given
cluster. Such a view would be able to show which transactions are holding
locks on which spans across the ranges of a cluster, as well as which
transactions may be waiting on a given lock. This information, in conjunction
with other virtual tables, would allow a user to identify the individual
transactions and queries causing other transactions to wait on a lock at a
given point in time. The cluster_locks virtual table would provide a
client-level view into the Lock Table of each KV range’s Concurrency Manager. This
also means that the view will only incorporate locks managed by the Lock
Table, and not request-level latches or replicated locks that are not
represented in the in-memory Lock Table.
| Term | Explanation |
|---|---|
| Sequencing | Processing a request through Concurrency Control. |
| Transaction | A potentially long-lived operation containing multiple requests that commits atomically. Transactions can hold locks, and release them on commit or abort. |
| Request | In Concurrency Control, a set of KV operations (i.e. kv.Batch) potentially belonging to a transaction that, upon sequencing, can take latches, locks, or wait in lock wait queues. |
| Latch | A request-level, in-memory mechanism for mutual exclusion over key spans between requests. They are only held over the span of a single request, and are dropped once the request completes (or enters a lock wait queue). |
| Lock | A transaction-level mechanism with varying durability for mutual exclusion on a single key between transactions. Obtained via sequencing a request, and held for the lifetime of a transaction. They may or may not be tracked in the in-memory lock table. |
| Lock Table (“LT”) | An in-memory, btree-based structure tracking a subset of locks, and the associated wait queues for each lock, across a given range. Locks in the LT may or may not be held, but “empty” locks (without holders or wait queues) are GC’d periodically. Replicated locks that do not have wait queues are typically not tracked in the LT. |
| Replicated Lock | A durable lock represented in persistent storage by what is commonly known as an “intent” (TxnMeta). This type of lock may or may not be tracked in the in-memory lock table - as mentioned above, a replicated lock is tracked in the lock table if there are wait queues made up of other transactions waiting to acquire this lock. |
| Unreplicated Lock | A lock that is only represented in the lock table, and will not persist in case of node failure or lease transfer. Used in [Get/Scan]ForUpdate. |
| Replicated Lock Keyspace | The separate keyspace in which replicated locks, commonly known as “separated intents”, are stored; in the Local keyspace of a range. This is entirely separate from the Lock Table. Example intent key in the replicated lock keyspace: /Local/Lock/Intent/Table/56/1/1169/5/3054/0. |
| Lock Strength | CRDB only supports Exclusive Locks for read & read/write transactions (despite enums including Shared and Upgrade types). |
| Lock Wait Queue | A queue of requests (which may or may not be transactional) waiting on a single key lock. This queue forms when a request is sequenced that attempts to obtain a lock that is already held. |
As a developer, database administrator, or other CockroachDB user, it is extremely useful to be able to obtain a point-in-time view of locks currently held, as well as identify which connections and/or transactions are holding these locks. Ideally, a user would be able to visualize which statements or transactions are blocking others, and for how long the waiting transactions have been waiting.
While we provide some mechanisms to visualize what operations are in progress
in a cluster, particularly the crdb_internal.cluster_[queries, transactions, sessions]
virtual tables, these do not allow for a user to investigate and pinpoint what
is currently causing contention. The same goes for the Contention Events
features, which are focused on showing historical contention (for requests that
have already completed) over time, aggregated across queries and sessions,
rather than on current lock contention. A feature to view point-in-time lock
state and contention is therefore currently missing, and has been
requested in the past.
There are a number of use cases for visualizing contention in a running
database or cluster, as elaborated below in Use Cases. These
use cases exist for both engineers and TSEs as well as users and administrators
of CockroachDB, and while cluster_locks is intended for use by any type of
user looking to investigate contention among lockholders in a cluster, it is
one of several tools that can be used to investigate contention, and how it
fits in with other tools is described further below in Alternatives. The
particular use case that this feature targets especially is viewing
point-in-time, live contention at the SQL user level, and as such can provide a
“first line” tool to investigate issues on a running cluster.
A comparable feature that exists in PostgreSQL is known as pg_locks,
and one that exists in Sybase is known as sp_locks.
postgres=# select locktype, relation::regclass, transactionid, mode, pid, granted from pg_locks order by pid;
locktype | relation | transactionid | mode | pid | granted
---------------+-------------------+---------------+------------------+-------+---------
relation | student | | AccessShareLock | 30622 | t
relation | student | | RowExclusiveLock | 30622 | t
virtualxid | | | ExclusiveLock | 30622 | t
relation | student | | SIReadLock | 30622 | t
transactionid | | 502 | ExclusiveLock | 30622 | t
transactionid | | 502 | ShareLock | 30626 | f
relation | student | | SIReadLock | 30626 | t
relation | student | | AccessShareLock | 30626 | t
relation | student | | RowExclusiveLock | 30626 | t
virtualxid | | | ExclusiveLock | 30626 | t
transactionid | | 503 | ExclusiveLock | 30626 | t
...
Currently, CockroachDB’s concurrency control mechanisms, particularly the Lock Table, only track locks which are currently held by transactional requests that also have other operations (transactional and non-transactional) waiting to acquire those locks. This means that if we are relying on the Lock Table as the source of data, as the Technical Design below specifies, we will have a few limitations that should be noted, despite the fact that these limitations may be acceptable given the use case of this feature.
GetForUpdate/ScanForUpdate) can obtain them. This means that transactional
read-only requests and non-transactional requests will only show up as waiters.crdb_internal.cluster_queries, since we
do not keep track of the statement history in an active transaction, we will not
know which statement in the transaction was the one that originally obtained
the lock. For example, imagine an open transaction with 1k+ statements thus
far, with UPDATE item SET i_name = 'Blue Suede Shoes' WHERE i_id = 124; as
the first statement. While other transactions that attempt to obtain the lock
on this key will have waiting queries visible in cluster_queries, we will not
be able to show that this UPDATE statement was the one that caused its
transaction to obtain the lock.The crdb_internal.cluster_locks table will be implemented as a virtual schema table
at the SQL level, and will be populated by making KV requests across the ranges
in the cluster. Each KV request will be evaluated using the corresponding
range’s Concurrency Manager, which will populate the response. These combined
responses will be used as necessary to populate the cluster_locks table.
Schema for crdb_internal.cluster_locks:
CREATE TABLE crdb_internal.cluster_locks (
range_id INT, -- the ID of the range that contains the lock
table_id INT, -- the id of the table to which the range with this lock belongs
database_name STRING, -- the name of the individual database
schema_name STRING, -- the name of the schema
table_name STRING, -- the name of the table
index_name STRING, -- the name of the index
lock_key BYTES, -- the key this lock protects access to
lock_key_pretty STRING, -- the pretty-printed key this lock protects access to
txn_id UUID, -- the unique ID of the transaction holding the lock (NULL if not held)
ts TIMESTAMP, -- the timestamp at which the lock is to be held at
lock_strength STRING, -- the type of lock [SHARED, UPGRADE, EXCLUSIVE] (note that only EXCLUSIVE locks currently supported, NULL if not held)
durability STRING, -- the durability of the lock [REPLICATED, UNREPLICATED] (NULL if not held)
granted BOOL, -- represents if this transaction is holding the lock or waiting on the lock
contended BOOL, -- represents if this lock has active waiters
duration INTERVAL, -- represents how long the lock has been held (or waiter has been waiting)
isolation_level STRING, -- the isolation level [SERIALIZABLE, SNAPSHOT, READ COMMITTED]
);
This table will be populated by issuing a KV API request for all the contended
locks across the cluster. If possible, it would be advantageous for
performance to incorporate filters on range_ids or smaller key spans, in
order to limit the RPCs necessary.
Note that this table will not show latches or latch contention, nor will it show replicated locks that are uncontended, as they are not tracked by the lock table. We will also not display empty locks from the lock table (i.e. those that are not held and do not have any other requests waiting). The virtual table should support the ability to display all locks or to only display contended locks, that is, locks with readers/writers in the wait queue, so that we can avoid displaying every locked key held by a transaction. See the Note on Keys in Concurrency Control for more information.
It is also important to note that we may be unable to show the time at which a transaction waiting on a lock started waiting. See Note on Start Wait Time in Concurrency Control below for more information.
The KV API will be implemented with a new KV request known as QueryLocksRequest.
proto
message QueryLocksRequest {
RequestHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true];
SpanScope span_scope = 2; // [GLOBAL, LOCAL]
}
message LockWaiter {
TxnMeta waiter_txn_meta = 1;
bool active_waiter = 2;
SpanAccess access = 3; // [ReadOnly, ReadWrite]
// Potentially added fields: start_wait
// Potentially unnecessary fields: wait_kind, spans, seq_num
}
message LockStateInfo {
int64 range_id = 2 [(gogoproto.customname) = "RangeID",
(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.RangeID"];
Key locked_key = 1 [(gogoproto.nullable) = false];
SpanScope span_scope = 2; // [GLOBAL, LOCAL]
Strength lock_strength = 3;
Durability durability = 4;
bool lock_held = 5;
TxnMeta lockholder_txn_meta = 6;
repeated LockWaiter lock_waiters = 7;
}
message QueryLocksResponse {
ResponseHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true];
repeated LockStateInfo locks = 2;
}
The QueryLocksRequest will be issued using the KV client for a key span
representing a single tenant (or perhaps database) in the cluster, such that
the DistSender will issue a KV RPC to the leaseholder of each range, and
return the response values to be used to populate the rows of the virtual table.
The DistSender will handle any instances of range unavailability or lease
transfer between replicas entirely within the KV client layer, as with other KV requests.
The design described above means that we will have an all-or-nothing approach
to responding to this query: either we can get information from all ranges, or
if there are any unavailable ranges, the request will fail. These semantics
are on par with those of a large scan, or of querying crdb_internal.ranges,
and are the default fault tolerance of DistSender and the KV client layer as
a whole. If necessary, it could be possible to iterate through [/Meta2, /Meta2Max)
to look up the range descriptors for all ranges, issue discrete KV
QueryLocksRequests for the key span of each range, and then handle each
response (or error) as necessary. This would have two downsides however:
primarily, the distinct semantics of this means that we would need to implement
such error handling (and potentially a custom mechanism to display which ranges
had errors) at the SQL level, which doesn’t exist today. Secondarily, and
perhaps less critically, this would also require an additional scan over all of
Meta2. Thus, the default fault tolerance provided by the KV client layer is
deemed acceptable for an initial implementation.
While we can restrict the requests for an individual query to a single (active) database in the cluster, it will be necessary for the KV API request to be made on every range in the keyspan. While this should not cause contention with other operations (as we will not hold any locks over the entirety of evaluation), a request over the entire keyspan of a single database or cluster may nevertheless be a fairly large operation, requiring pagination/byte limits as necessary.
At a functional level within the KV leaseholder replica of the range, the
QueryLocksRequest will be sequenced as a normal request by the Concurrency
Manager (without requesting any latches or locks), and proceed to command
evaluation. This is preferred over the Concurrency Manager intercepting the
request upon sequencing as we can thus ensure that we have checked that this
replica is the leaseholder for the range. Once at the evaluation stage, we can
access the Concurrency Manager to grab the mutex for the Lock Table btree for
each key scope (Global keys, Local keys), clone a snapshot of the
(copy-on-write) btree, and then release the mutex
(similarly to how this is done for collection of Lock Table metrics). We
can then iterate through all of the locks in the tree, populating the fields
necessary from the lockState object as well as each lock’s lockWaitQueue.
Note on Keys: Unfortunately, since each lock state (i.e. for a single key)
in the Lock Table only maintains minimal information about the transaction the
lock is held by, we are unable to track the span requested by the transaction
and will need to use the individual key. This is because the lockState
object does not track the lockTableGuard, which maintains the full
SpanSet the request is asking for. To
avoid returning all of the keys that are contended, we may consider only
including keys that have readers/writers in the queues. For instance, consider
if txn1 has Put operations on keys [a, d] and txn2 has Put operations
on keys [b, e]; while keys b, c, d will appear in the lock table held by
txn1, the requests in txn2 will only be waiting in at most 1 queue (i.e.
b), and thus we can represent the two transactions as contending (for the
moment) on b. See example here.
Note on Start Wait Time: As the time that a transaction begins waiting on a lock is only tracked within the waiting Goroutine for the purposes of Contention Events, and not maintained within the lock wait queue itself, we will need to modify this if we want to be able to display the time spent waiting in our virtual table view. As this would be a highly useful feature, it is likely worth the time needed to make this change, but it does not exist at the moment. (Follow-up note: This has been implemented as part of #76395).
Note on Lock Aquisition Time: Similar to the above, we do not currently track the time a lockholder acquires a lock. This could be resolved by incorporating the planned work to track this into the scope of this project. (Follow-up note: This has been implemented as part of #76395).
One last point worth noting is that while non-transactional lock holders will
not show up in the lock table, they can show up in lock wait queues (as
“blocked” by other transactions). This also applies to transactional read-only
requests, with the exception of GetForUpdate/ScanForUpdate requests (which
acquire unreplicated locks).
While we can implement a PostgreSQL compatibility layer and populate the virtual pg_locks table in response to queries, this may not be of much use for two reasons:
relation) locks, user locks and several others
do not apply to CockroachDB, while CockroachDB-specific concepts such as
ranges, lock spans, and more do not apply to the paradigm layed out in
pg_locks.pg_stat_activity (used in conjunction with pg_locks),
which CockroachDB does not currently support, the use case for pg_locks
alone, given some of the above-mentioned limitations, may not be strong.For these reasons, a PostgreSQL compatibility layer is deemed out of scope for
an initial implementation, though may be revisited at a later date. If we were
to implement a PostgreSQL-compatible mechanism to populate the (currently
unimplemented) pg_locks table, following their specification,
our implementation could map as follows:
pg_locks Column | Values | CRDB Mapping |
|---|---|---|
locktype text | relation, tuple, transactionid, more | transactionid |
database oid | dbOid(db.GetID()) | |
relation oid | tableOid(descriptor.GetID()) | |
page int4 | null (does not apply) | |
tuple int2 | null (does not apply) | |
virtualxid text | null (does not apply) | |
transactionid xid | txn.UUID() converted to xid (int32) | |
classid oid | null (does not apply) | |
objid oid | null (does not apply) | |
objsubid int2 | null (does not apply) | |
virtualtransaction text | null (does not apply) | |
pid int4 | null (does not apply) | |
mode text | AccessShareLock, ExclusiveLock, RowExclusiveLock, SIReadLock, more | ExclusiveLock, ShareLock (if waiter) |
granted bool | lockState.holder.locked (or false if a member of lockWaitQueue) | |
fastpath bool | false (does not apply) | |
waitstart timestamptz | start time in lock table waiters |
The biggest alternative solutions to observing contention is to utilize the
Active Tracing Spans Registry and the Contention Events framework. The
Active Tracing Spans Registry, for which there is
currently ongoing work to visualize with a UI,
would be useful to show traces for currently active operations, including those
contending on locks, but does not specifically map to the use case a virtual
table like cluster_locks would provide. That said, it will likely be worth it
to coordinate these efforts, as they can work together to better enable
CockroachDB users and developers. The Contention Events framework
(i.e.crdb_internal.cluster_contended_* tables), for which there is
also planned/ongoing work,
is extremely useful
for diagnosing contention in a cluster over time, but as Contention Events are
only emitted in tracing spans
after a transaction has finished waiting on a lock, it does not provide
insight into what is currently blocking a particular transaction.
Given that the Active Tracing Spans Registry is also intended to visualize what
transactions are actively running (and potentially holding locks), albeit in a
much more engineer-focused, in-depth manner, it could be theoretically possible
to implement something like cluster_locks using it as infrastructure. At this
point in time, however, this may not be the best approach, especially as it
would likely require more complexity to narrow down the data in the Active
Tracing Spans into a view like cluster_locks, it would be additional indirection
rather than interfacing with the Lock Table directly, and additionally there
are currently limitations that restrict viewing the Active Tracing Spans to a
single node rather than cluster-wide.
It may be also worth noting that while serializability failures can occur and
are a common class of failures, in CRDB, they are distinct from (albeit
potentially caused by) lock contention. These failures, which are
TransactionRetryErrors with reason RETRY_SERIALIZABLE,
can include some information elaborating on what caused the serializability
failure, but do not include any information on the contention history of the
transaction. Nonetheless, improving these error messages is a goal. Lock
contention, on the other hand, should not result in a user-visible error unless
the client sets a timeout on a given query. If the user does not specify a
timeout, a query waiting on a lock could effectively be stalled indefinitely. When
timeouts are set, however, in the case of timeout errors we also do not surface
information about the contention to the user, and as such this error messaging
cannot be used to investigate lock contention.
The primary use case targeted is for SQL users - particularly CockroachDB users and admins - to be able to visualize live, point-in-time contention across a given cluster or a single database in particular, though there are of course other users that can take take advantage of this feature. CockroachDB engineers, users or developers attempting to understand the concurrency model, TSEs, SREs and others all may find the virtual table to be a useful view into what is happening on a given cluster, or what transactions may be blocking other operations at any given time. This can even be especially useful for someone with knowledge of CRDB’s internals as a first-line tool to obtain a quick glance at what locks are held currently before moving onto some of the other tools mentioned above for a deeper investigation.
| Historical View | Live View | |
|---|---|---|
| For Engineers/TSEs | Jaeger/etc, Splunk (potentially) | Active Tracing Spans |
| For Users/DB Admins | Contention Events (via SQL, Dashboards) | cluster_locks (via SQL) |
To visualize the transactions holding locks with basic information about the client session:
SELECT
l.database_name,
l.table_name,
l.range_id,
l.lock_key_pretty,
l.txn_id,
l.granted,
s.node_id,
s.user_name,
s.client_address
FROM crdb_internal.cluster_locks l
JOIN crdb_internal.cluster_transactions t ON l.txn_id = t.id
JOIN crdb_internal.cluster_sessions s ON t.session_id = s.session_id
WHERE l.granted = true;
database_name | table_name | range_id | lock_key_pretty | txn_id | granted | node_id | user_name | client_address
----------------+------------+----------+-------------------+--------------------------------------+---------+---------+-----------+------------------
tpcc | item | 72 | /Table/62/1/135/0 | ba7c4940-96a5-4064-b650-ed2b7191ab5a | true | 2 | root | 127.0.0.1:59033
To visualize transactions which are blocking other transactions:
SELECT
lh.database_name,
lh.table_name,
lh.range_id,
lh.lock_key_pretty,
lh.txn_id AS lock_holder,
lw.txn_id AS lock_waiter
FROM crdb_internal.cluster_locks lh
JOIN crdb_internal.cluster_locks lw ON lh.lock_key = lw.lock_key
WHERE lh.granted = true AND lh.txn_id IS DISTINCT FROM lw.txn_id;
database_name | table_name | range_id | lock_key_pretty | lock_holder | lock_waiter
----------------+------------+----------+-------------------+--------------------------------------+---------------------------------------
tpcc | item | 72 | /Table/62/1/325/0 | ba7c4940-96a5-4064-b650-ed2b7191ab5a | 55e58b37-e5aa-4e56-a4a6-13ca72dca30b
To include a display of the queries that are waiting on a lock:
SELECT
lh.database_name,
lh.table_name,
lh.range_id,
lh.lock_key_pretty,
q.query as waiting_query
FROM crdb_internal.cluster_locks lh
JOIN crdb_internal.cluster_locks lw ON lh.lock_key = lw.lock_key
JOIN crdb_internal.cluster_queries q ON lw.txn_id = q.txn_id
WHERE lh.granted = true AND lh.txn_id IS DISTINCT FROM lw.txn_id;
database_name | table_name | range_id | lock_key_pretty | waiting_query
----------------+------------+----------+-------------------+--------------------------------------
tpcc | item | 72 | /Table/62/1/325/0 | SELECT * FROM item WHERE i_id = 325
To display the number of waiting transactions on a single lock:
SELECT
l.database_name,
l.table_name,
l.range_id,
l.lock_key_pretty,
COUNT(*) AS waiter_count
FROM crdb_internal.cluster_locks l
WHERE l.granted=false
GROUP BY l.database_name, l.table_name, l.range_id, l.lock_key_pretty;
database_name | table_name | range_id | lock_key_pretty | waiter_count
----------------+------------+----------+-------------------+---------------
tpcc | item | 72 | /Table/62/1/325/0 | 1
SHOW LOCKS