docs/tech-notes/cluster-observability.md
Last update: May 2023
This document provides an architectural overview of the observability layer in CockroachDB.
Original author: j82w
Table of contents:
99044 SQL Activity page now display only persisted stats when selecting to view fingerprints. This means data just recently executed might take up to 10min to show on the Console.
master 98815 & 22.2 backport 99403 SQL Activity page adds new search criteria which requires a limit and sort to be specified.
98885 & 100807 Add new system activity tables and update job
The results below were calculated using a test cluster with 9 nodes and 100k-115k rows on our statistics tables, making a request to return the results for the past 1h.
| Version information | Latency to load SQL Activity page |
|---|---|
| Before any change | 1.7 minutes |
| Changes on 22.1 & 22.2 | 9.9 seconds |
| Changes on 23.1 | ~500ms |
The performance gains are from:
sequenceDiagram
box less than max top * num columns cached
participant sqlActivityUpdater.transferAllStats
end
box greater than max top * num columns cached
participant sqlActivityUpdater.transferTopStats
end
PersistedSQLStats->>sqlActivityUpdater: Statistics flush is done via channel
sqlActivityUpdater->>sqlActivityUpdater.compactActivityTables: Removes rows if limit is hit
sqlActivityUpdater->>sqlActivityUpdater.transferAllStats: if less than 3000 rows
sqlActivityUpdater->>sqlActivityUpdater.transferTopStats: if greater than 300 rows
sequenceDiagram
title KV layer creates trace
SQL layer->>KV layer: Initial call to KV layer
KV layer->>lock_table_waiter: Transaction hit contention
lock_table_waiter->>contentionEventTracer.notify: verify it's new lock
contentionEventTracer.notify->>contentionEventTracer.emit: Adds ContentionEvent to trace span
KV layer->>SQL layer: Return the results of the query
SQL layer->>KV layer: if tracing is enabled then network call to get trace
KV layer->>SQL layer: returns traces
sequenceDiagram
title Transaction id cache
connExecutor.recordTransactionStart->>txnidcache.Record: Add to cache with default fingerprint id
connExecutor.recordTransactionFinish->>txnidcache.Record: Replace the cache with actual fingerprint id
sequenceDiagram
title Contention events insert process
executor_statement_metrics->>contention.registry: AddContentionEvent(ExtendedContentionEvent)
contention.registry->>event_store: addEvent(ExtendedContentionEvent)
event_store->>ConcurrentBufferGuard: buffer with eventBatch with 64 events
ConcurrentBufferGuard->>eventBatchChan: Flush to batch to channel when full
sequenceDiagram
title Background task 'contention-event-intake'
event_store.eventBatchChan->>resolver.enqueue: Append to unresolvedEvents
event_store.eventBatchChan->>event_store.upsertBatch: Add to unordered cache (blockingTxnId, waitingTxnId, WaitingStmtId)
sequenceDiagram
title Background task 'contention-event-resolver' 30s with jitter
event_store.flushAndResolve->>resolver.dequeue: Append to unresolvedEvents
resolver.dequeue->>resolver.resolveLocked: Batch by CoordinatorNodeID
resolver.resolveLocked->>RemoteNode(RPCRequest): Batch blocking txn ids
RemoteNode(RPCRequest)->>txnidcache.Lookup: Lookup the id to get fingerprint
RemoteNode(RPCRequest)->>resolver.resolveLocked: Return blocking txn id & fingerprint results
resolver.resolveLocked->>LocalNode(RPCRequest): Batch waiting txn ids
LocalNode(RPCRequest)->>txnidcache.Lookup: Lookup the id to get fingerprint
LocalNode(RPCRequest)->>resolver.resolveLocked: Return waiting txn id & fingerprint results
resolver.resolveLocked->>resolver.resolveLocked: Move resolved events to resolved queue
resolver.dequeue->>event_store.flushAndResolve: Return all resolved txn fingerprints
event_store.flushAndResolve->>event_store.upsertBatch: Replace existing unresolved with resolved events