docs/proposals/parquet-storage.md
Since the introduction of Block Storage in Cortex, TSDB format and Store Gateway is the de-facto way to query long term data on object storage. However, it presents several significant challenges:
TSDB format, while efficient for write-heavy workloads on local SSDs, is not designed for object storage:
Store Gateway is originally introduced in Thanos. Both Cortex and Thanos community have been collaborating to add a lot of optimizations to Store Gateway. However, it has its own problems related to the design.
Resource Intensive
State Management and Scaling Difficulties
Query Inefficiencies
Apache Parquet is a columnar storage format designed specifically for efficient data storage and retrieval from object storage systems. It offers several key advantages that directly address the problems we face with TSDB and Store Gateway:
There are other benefits of Parquet formats, but they are not directly related to the proposal:
There are 2 new Cortex components/modules introduced in this design.
Parquet Converter is a new component that converts TSDB blocks on object store to Parquet file format.
It is similar to compactor, however, it only converts single block. The converted Parquet files will be stored in the same TSDB block folder so that the lifecycle of Parquet file will be managed together with the block.
Only certain blocks can be configured to convert to Parquet file and it can be block duration based, for example we only convert if block duration is >= 12h.
Similar to the existing distributorQueryable and blockStorageQueryable, Parquet queryable is a queryable implementation which allows Cortex to query parquet files and can be used in both Cortex Querier and Ruler.
If Parquet queryable is enabled, block storage queryable will be disabled and Cortex querier will not query Store Gateway anymore. distributorQueryable remains unchanged so it still queries Ingesters.
Parquet queryable uses bucket index to discovers parquet files in object storage. The bucket index is the same as the existing TSDB bucket index file, but using a different name bucket-index-parquet.json.gz. It is updated periodically by Cortex Compactor/Parquet Converter if parquet storage is enabled.
Cortex querier remains a stateless component when Parquet queryable is enabled.
┌──────────┐ ┌─────────────┐ ┌──────────────┐
│ Ingester │───>│ TSDB │───>│ Parquet │
└──────────┘ │ Blocks │ │ Converter │
└─────────────┘ └──────────────┘
│
v
┌──────────┐ ┌─────────────┐ ┌──────────────┐
│ Query │───>│ Parquet │───>│ Parquet │
│ Frontend │ │ Querier │ │ Files │
└──────────┘ └─────────────┘ └──────────────┘
Parquet file is converted from TSDB block so it follows the same time range constraint.
If the largest block is 1 day then parquet file can go up to 1 day. Max block range is configurable in Cortex but default value is 24h. So following schema will use 24h as example.
The Parquet format consists of two types of files:
Labels Parquet File
__name__, label1, ..., labelN)__name__ alphabetically in ascending orderChunks Parquet File
| Column Name | Description | Type | Encoding/Compression/skipPageBounds | Required |
|---|---|---|---|---|
s_hash | Hash of all labels | INT64 | None/Zstd/Yes | No |
s_col_indexes | Bitmap indicating which columns store the label set for this row (series) | ByteArray (bitmap) | DeltaByteArray/Zstd/Yes | Yes |
s_lbl_{labelName} | Values for a given label name. Rows are sorted by metric name | ByteArray (string) | RLE_DICTIONARY/Zstd/No | Yes |
s_data_{n} | Chunks columns (0 to data_cols_count). Each column contains data from [n*duration, (n+1)*duration] where duration is 24h/data_cols_count | ByteArray (encoded chunks) | DeltaByteArray/Zstd/Yes | Yes |
data_cols_count will be a parquet file metadata and its value is default to 3 but it can be configurable to adjust for different usecases.
We'd like to give huge credits for people from the Thanos community who started this initiative.