docs/RFCS/20191113_vectorized_external_storage.md
Add the ability for vectorized operators to store intermediate results to disk when these are too large to fit in main memory by adding an on-disk queue abstraction. Without this ability, queries that use these vectorized operators will encounter an out of memory error and will not be able to complete. Examples of these operators are sorts and joins (among others). Adding support for vectorized external storage will allow the full vectorized execution engine to be turned on by default, unlocking performance gains for analytical queries that use these buffering operators.
Note that CockroachDB already supports external storage in its row-oriented execution engine (covered here), this RFC covers external storage for the vectorized execution engine.
The vectorized execution engine offers some significant performance gains for analytical queries (https://www.cockroachlabs.com/blog/how-we-built-a-vectorized-execution-engine/). However, at the time of writing (19.2.0), only queries that use a constant amount of memory are run through the vectorized execution engine by default. This excludes a large family of queries that use operators such as sorts or joins because their use could lead to an out of memory error. Adding external storage would allow these vectorized operators to be turned on by default, which will result in performance improvements for analytical workloads.
The full list of components that need external storage are:
| Name | Reason | Solution |
|---|---|---|
| Hash router | Buffering data for blocked output | On-disk queue |
| Merge joiner | Buffering equality groups | On-disk queue |
| Hash joiner | Buffering the build table | GRACE hash join |
| Hash aggregator | Too many groups to fit in memory | GRACE hash join |
| Sorter | Buffering data to sort | External merge sort |
| Unordered distinct | Seen table too big to fit in memory | GRACE hash join |
| Window functions | Buffering window function results | Mix of external merge sort and GRACE hash join |
All of these components will use one or more of the following:
Similarly to the row execution engine RFC on external storage, the algorithms to implement are limited to an external sort and external hash join. However, this proposal differs from the original external storage RFC in that instead of using a sorted KV storage engine as a map and delegating on-disk sorting to it, the vectorized engine will use disk only to store data and use external-storage aware operators to perform external merge sorts and GRACE hash joins. This ends up being a lot more performant (refer to the Alternatives section).
The proposal in this RFC is to introduce an on-disk queue data structure backed by flat files where batches of columnar data are serialized using the Arrow IPC format. This queue can be used directly by most operators that need to buffer unlimited data in FIFO order. There will also be an additional abstraction that will give the caller the option to use separate queues as distinct partitions. This last abstraction will be used to flush and re-read partitions in the case of a GRACE hash join or external merge sort.
The impact on the end user will be improved performance for large analytical queries, as queries that would previously not run through the vectorized execution engine will do so after the implementation of external storage. There will be no other user-facing changes, the change proposed in this RFC is purely focused on existing feature improvement.
Given that external merge sorts and GRACE hash joins are well established algorithms (external sorts more so, refer to this issue for an explanation of a GRACE hash join), this RFC will focus on the backend infrastructure that will allow these algorithms to spill to disk by focusing on describing the design of a single queue.
https://github.com/asubiotto/cockroach/tree/prototype is a prototype of the proposed on-disk queue backed by flat files. A Pebble-backed alternative was also considered in that prototype. The choice of storage method is examined in more detail in the Rationale and Alternatives section.
The requirements for the storage method are that:
Using the same filesystem vfs.FS abstraction as Pebble does will
allow us to write an on-disk queue backed by flat files with the
first two of these requirements. The vfs.FS will be the same as the
temporary storage engine, so we will piggy-back off of the existing
temporary file management code. The interface is as simple as:
type Queue interface {
Enqueue(coldata.Batch) error
Dequeue() (coldata.Batch, error)
Close() error
}
The flat file Queue implementation will have knobs to configure
the in-memory buffer size to amortize write overhead, and the maximum
size of a file, after which it will roll over to a new file. Different
configurations will be benchmarked to find a good default value.
Note that the maximum in-use disk space will be limited by a disk monitor
similarly to what is currently done with external storage in the row execution
engine. As bytes are written to disk, these are accounted for by a
mon.BytesMonitor, which additionally provides observability in to how much
space is being used at a given time as well as how much disk space was used
in a given query through EXPLAIN ANALYZE.
The implementation will create unique files using the following naming scheme:
<FlowID>_<QueueID>_<FileID>
A Queue will be created with a unique integer ID and keep a counter
for the number of files created, which it will use as a suffix for
the file names to maintain uniqueness within the Queue. Prefixing
the filenames with the FlowID UUID (which uniquely describes a
sub-plan of a distributed query running on a given node), will allow
the Flow to perform any file cleanup once the query completes (in
case of orphaned files when recovering from possible panics).
Thankfully, serialization of coldata.Batches is already
implemented in the colserde package using the Arrow IPC
format. These serialized bytes will
be buffered until a flush is required, at which point they are written to a
file with an accompanying file footer. The start offset of these written bytes
as well as the number of bytes written will be stored for when the batches
should be dequeued. A current limitation is that bytes cannot be deserialized
at a finer granularity than they are written at, although we are planning
on fixing this (https://github.com/cockroachdb/cockroach/issues/42045).
Snappy compression is used to minimize file size.
The aforementioned abstraction of partitions as distinct queues can be trivially implemented by instantiating as many queues as partitions necessary. This could mean many small files, as a queue must have at least one file. The problem with this design is that there might be many open file descriptors at once if the user of the queue needs to read a little data from each partition, in the case of an external merge sort, for example. The number of open file descriptors is limited on some file systems, so this could lead to crashes. However, this problem can be mitigated by having a maximum number of open file descriptors scoped to a temporary storage engine as well as per queue. The implementation of partitions would keep as many file descriptors open as allowed, closing the least recently used file when a new partition must be read from.
A queue backed by Pebble was also considered as an alternative. The configuration options for this queue are the write buffer size and the maximum value size. Batches are enqueued up to the maximum value size and written to a buffered writer with an ordinal as the key.
The benchmark used to test the performance of both queue implementations is
a simple write-everything then read-everything benchmark run on a linux gceworker.
This simulates the general case of having a single goroutine flush data
which it then rereads and operates on. The flat file Queue implementation
syncs files every 512KiB, similarly to Pebble. Note that both implementations
use snappy compression.
Before benchmarking each implementation with a varying buffer size,
the data size to store was fixed at 512MiB and each implementation was
independently benchmarked to find the optimal maximum file size in the
case of flat files, and value size in the case of pebble, while keeping
the write buffer size constant at 0. The options used for Pebble are
the same ones the Pebble temporary storage instance uses (see
https://github.com/cockroachdb/cockroach/blob/09b391536f72349c3951ce6b75c5231b58933b07/pkg/storage/engine/temp_engine.go#L117).
Given these graphs, a fixed file size of 32MiB was chosen for the flat queue
implementation and a 128KiB maximum value size for the Pebble implementation.
Overall, we see a ~50% improvement by using flat files instead of Pebble as an on-disk queue implementation in the best case.
Initially, the downside of choosing flat files over Pebble was that
there wasn’t a clear understanding of how the requirements outlined
in Detailed Design would be satisfied when using
flat files. However, since the vfs.FS implementation that the
Pebble temporary engine uses can be reused, there is no upside to
using pebble barring perhaps implementation simplicity since details
like compression and file descriptor management are taken care of.