docs/tech-notes/timeseries.md
Alongside the SQL layer exposed to client apps, CockroachDB also implements another database engine on top of the KV layer: a native, embedded time series database.
This timeseries database exists solely for the purpose of storing metrics efficiently inside CockroachDB and serving graphs on the DB Console, without introducing a dependency on 3rd party technology. It features:
The tsdb is layered "on top" of CockroachDB KV, that is, it stores its data using the same transaction and replication layer as SQL. Only the data storage layout is different (see section below).
<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->Table of Contents
<!-- markdown-toc end -->From the tsdb perspective, a data timeseries is a set of measurements of a single named variable at multiple points in time.
The variable under each timeseries is identified by a name (the "metric" in the remainder of the CockroachDB source).
Data points are identified by a unix timestamp (expressed in nanoseconds), a 64-bit floating-point value, and a source (a string) where the value was measured.
(For example, in versions of CockroachDB up to and including v22.2, the source is the node ID where the data is coming from. Starting in v23.1 we see more diverse uses of the source field.)
A single timeseries is stored as multiple groups of KV pairs, one per storage resolution.
Storage resolutions represent the granularity at which values are kept. To simplify, "finer resolution" means more data points are kept by unit of time. For example, the "10s" resolution keeps 1 data item per 10 seconds; the "30mn" resolution keeps 1 data item per 30 minutes.
Under each resolution, data points are stored in a columnar fashion, with multiple values per keys. A group of data points under the same KV key is called a slab. Each source has its different set of slabs.
Each resolution uses a different slab size:
(There are 4 resolutions currently implemented in the code. 10s and 30mn are used for real-world deployments; 1ns and 50ns resolutions are also implemented for testing.)
For each resolution, there is also a configurable maximum age of slabs: when that maximum is reached, every time a new slab is added, the oldest slab is "rolled up" into the newest slab of a coarser resolution. Rolling up means multiple consecutive data points are averaged/interpolated together and only the result is stored. This is the algorithm that compresses the data over time.
These configuration knobs are cluster settings:
timeseries.storage.resolution_10s.ttl is the maximum age of values
stored at the 10s resolution. Data older than this is subject to
rollup to the 30mn resolution.
The default is 10 days.
timeseries.storage.resolution_30m.ttl is the maximum age of values
stored at the 30mn resolution. Data older than this is subject
to deletion (because we don't have a coarser resolution implemented).
The default is 90 days.
Each timeseries is stored as a sequence of KV pairs, as defined by the algorithm above:
/System/tsd/<name>/<resolution>/<start_timestamp>/<source>
(Note: this is the low-level encoding. When using cockroach debug keys, the pretty-printed output
inverts the order of the key, placing the <source> component prior to the resolution
in the output. However, this is just a display artifact. In storage, the source is at the end.)
For example, the metric cr.node.sql.new_conns is stored under /System/tsd/cr.node.sql.new_conns/....
If we look at the keyspace with all configurations set to defaults, we will see:
the 10s resolution data, split into 1-hour slabs, up to 240 pairs (1 per hour over 10 days); for example:
//System/tsd/cr.node.sql.new_conns/10s/2022-08-11T10:00:00Z/1
//System/tsd/cr.node.sql.new_conns/10s/2022-08-11T10:00:00Z/2
//System/tsd/cr.node.sql.new_conns/10s/2022-08-11T10:00:00Z/3
//System/tsd/cr.node.sql.new_conns/10s/2022-08-11T11:00:00Z/1
//System/tsd/cr.node.sql.new_conns/10s/2022-08-11T11:00:00Z/2
//System/tsd/cr.node.sql.new_conns/10s/2022-08-11T11:00:00Z/3
//System/tsd/cr.node.sql.new_conns/10s/2022-08-11T12:00:00Z/1
//System/tsd/cr.node.sql.new_conns/10s/2022-08-11T12:00:00Z/2
//System/tsd/cr.node.sql.new_conns/10s/2022-08-11T12:00:00Z/3
...
//System/tsd/cr.node.sql.new_conns/10s/2022-08-21T08:00:00Z/1
//System/tsd/cr.node.sql.new_conns/10s/2022-08-21T08:00:00Z/2
//System/tsd/cr.node.sql.new_conns/10s/2022-08-21T08:00:00Z/3
//System/tsd/cr.node.sql.new_conns/10s/2022-08-21T09:00:00Z/1
//System/tsd/cr.node.sql.new_conns/10s/2022-08-21T09:00:00Z/2
//System/tsd/cr.node.sql.new_conns/10s/2022-08-21T09:00:00Z/3
Notice: for each timestamp, there's one key per source (node).
Under each of these keys, there are up to 360 data points
(1 per 10 second interval, up to 1 hour). The actual data storage is
a bit more complex than a simple array of floats, see the doc on
roachpb.InternalTimeSeriesData for details.
the 30m resolution data, split into 24-hour slabs, up to 90 pairs (1 per day over 90 days); for example:
//System/tsd/cr.node.sql.new_conns/30m/2022-06-11T00:00:00Z/1
//System/tsd/cr.node.sql.new_conns/30m/2022-06-11T00:00:00Z/2
//System/tsd/cr.node.sql.new_conns/30m/2022-06-11T00:00:00Z/3
...
//System/tsd/cr.node.sql.new_conns/30m/2022-08-10T00:00:00Z/1
//System/tsd/cr.node.sql.new_conns/30m/2022-08-10T00:00:00Z/2
//System/tsd/cr.node.sql.new_conns/30m/2022-08-10T00:00:00Z/3
Under each of these keys there are up to 48 data points (1 per 30m interval, up to 1 day).
The ts writes are not subject to MVCC logic otherwise afforded to SQL writes.
Instead, the KV layer offers a specialized operation, Merge, which performs
high-performance append-only behavior to stored data points.
These writes are replicated (for fault tolerance), and offer protection against replay (so they are idempotent and can be re-applied during transient faults), but they are not transactional; in particular they cannot be rolled back. This affords the tsdb extremely efficient reads, unhindered by MVCC intents and deletion tombstones.
Data deletion (for rollups) does use the standard KV DeleteRange
operation. At a low level, DeleteRange has a special path
for timeseries KV pairs, which makes it more efficient.
The tsdb is implemented in a way mostly independent from the rest of CockroachDB.
It operates over an abstract DataSource, defining a method
GetTimeSeriesData() returning tsdb.TimeSeriesData (a struct
containing series name, source and data points).
The client of the tsdb library is then intended to call
PollSource(). This method starts an asynchronous task that,
periodically (at at 10s interval) calls StoreData(). This method, in
turn, scrapes the data points from the DataSource and stores them in
KV according to the rules spelled out above.
How this is used in CockroachDB is explained in a separate section below.
The tsdb maintains the following metrics about its own functioning:
timeseries.write.samples: Total number of metric samples written to disk.timeseries.write.bytes: Total size in bytes of metric samples written to disk.timeseries.write.errors: Total errors encountered while attempting to write metrics to KV.The tsdb query engine (Query() method in code) is the component which,
given a timeseries query, computes a result array of tsdb.TimeSeriesDatapoint
suitable for e.g. plotting on a graph using the currently stored KV data.
The parameters of a ts query include:
cr.node.sql.new_conns)The query engine uses state of the art formulas internally to avoid data rounding artifacts, discontinuities and anomalies at the rollup boundaries. All in all, it may not be as advanced as Prometheus or Grafana, but it is not simplistic either.
To read data from the KV layer, it uses regular Scan operations.
The tsdb Go package also offers a Server component, which
offers two gRPC methods: (*Server).Query and (*Server).Dump.
This ts.Server component translates calls from client to
calls to the low-level (*ts.DB).Query method:
(*ts.Server).Query takes requests as tspb.TimeSeriesQueryRequest,
and each request can contain zero or more low-level tspb.Query object.
(*ts.Server).Query is responsible for parallelizing calls to the
low-level (*ts.DB).Query, one call per tspb.Query in
tspb.TimeSeriesQueryRequest.
This endpoint is meant for use for data display as graphs.
(*ts.Server).Dump takes request as tspb.DumpRequest, and
collects all the timeseries data within the specified interval,
for the given names and resolutions.
This endpoint is meant for use to dump the low-level tsdb data.
The tsdb main components (ts.DB and ts.Server) are mostly
stateless: they contains merely a few static configuration parameters,
an interface to the underlying KV database (kv.DB, like the one used
by the SQL layer), and an interface to the cluster settings for
dynamic tuning knobs.
As of this writing, one ts.DB and one ts.Server are instantiated
per KV node, in the (*server.Server).PreStart() method.
Separately from the tsdb engine, CockroachDB implements a component
called "metrics recorder" (server/status/recorder.go). This component
is mainly responsible for periodically scanning all the
in-memory metric objects (via metric.Registry) and keeping
a copy of their values for clients of the Prometheus
status endpoint (/_status/vars).
In addition to this responsibility, the recorder also implements
the ts.DataSource interface: it can serve scraping requests
by a ts.DB's PollSource.
Upon startup, a KV node calls PollSource on its ts.DB and
points it to the metric recorder.
This ensures that all the metrics in the metric.Registry get
periodically saved into the tsdb.
The high-level query engine in ts.Server is exposed
to CockroachDB's external gRPC and HTTP endpoints.
The only HTTP endpoint is /ts/db, pointing to (*ts.Server).Query.
This is used by DB Console to display metric graphs, together with the
(independently defined) /_admin/v1/metricmetadata and
/_admin/v1/chartcatalog.
The gRPC endpoints are Dump, DumpRaw and Query, pointing
to the respective methods of (*ts.Server):
cockroach debug tsdump.None of these APIs are currently documented as available to end-users.