Back to Questdb

Parquet partition metadata file

docs/parquet-metadata.md

9.4.236.7 KB
Original Source

Parquet partition metadata file

Goals

  • Per-partition metadata file stored alongside data.parquet in the partition directory.
  • Give optional access to statistics about each row group so that they can be pruned by the query planner.
  • Give access (mandatory) to the parquet partition file's byte ranges, encodings, and compression for each column chunk of each row group to be able to decode data pulled from cold storage without the parquet footer.
  • Store the QuestDB-specific metadata currently written as JSON in the parquet footer, avoiding the need to parse the parquet footer for this information.
  • Usable with external parquet files.

Background: current QuestDB parquet metadata

Currently stored under the "questdb" key in the parquet file's key-value metadata as JSON. This will be replaced by the binary format below.

json
{
  "version": 1,
  "schema": [
    {
      "column_type": 12,
      "column_top": 0,
      "format": 1
    },
    {
      "column_type": 5,
      "column_top": 256
    },
    {
      "column_type": 26,
      "column_top": 0,
      "ascii": true
    }
  ],
  "unused_bytes": 4096
}

File format

Binary file encoded in little-endian. One file per partition, stored in the partition directory alongsidedata.parquet.

The file has a header with column descriptors, row group blocks in the middle, and a footer at the end. The committed _pm file size is stored in the header's first 8 bytes (PARQUET_META_FILE_SIZE) and is patched last by the writer — readers treat it as the MVCC commit signal. The footer ends with a 4-byte trailer that stores the footer length, so the latest footer is at PARQUET_META_FILE_SIZE - 4 - footer_length. The parquet file size (separate concept) is stored in _txn field 3 for each partition. Row group blocks are referenced by offset from the footer. On update, only new/changed row group blocks are appended; the footer reuses offsets to unchanged blocks. The file is small (typically tens of KB), memory-mapped and cached in TableReader. Bloom filter bitsets are stored in the out-of-line region of each row group block (inlined mode) or referenced from the parquet file (external mode), with offsets in the footer feature section.

Callers never use the filesystem's reported file size to bound an _pm read or mapping. The on-disk length may include bytes from an in-progress, unpublished append and is not a valid commit boundary; only PARQUET_META_FILE_SIZE in the header is. A reader mmaps the 32-byte header prefix, reads PARQUET_META_FILE_SIZE, then remaps to that size.

Overview

                  _pm metadata file                                 data.parquet
                 +================================+                  +==========================+
                 | HEADER                         |                  |                          |
                 |  parquet_meta_file_size -------+--> end of file   |  ...column chunks...     |
                 |  feature_flags                 |                  |                          |
                 |  designated_timestamp          |         +------->|  dict page  | data pages |
                 |  sorting_column_count          |         |        |                          |
                 |  column_count                  |         |        +==========================+
                 |                                |         |
                 | COLUMN DESCRIPTORS             |         |
                 |  col 0: name, type, ..         |         |
                 |  col 1: name, type, ..         |         |
                 |  ...                           |         |
                 |                                |         |
                 | SORTING COLUMNS                |         |
                 |  col indices                   |         |
                 |                                |         |
                 | NAME STRINGS                   |         |
                 |                                |         |
                 | HEADER FEATURE SECTIONS        |         |
                 |  (if any flags set)            |         |
                 +--------------------------------+         |
                 | ROW GROUP BLOCK 0              |         |
                 |  num_rows                      |         |
                 |  chunk col 0:                  |         |
                 |    codec, encodings            |         |
                 |    byte_range_start  ----------+---------+
                 |    total_compressed            |
                 |    null_count                  |
                 |    _reserved (0)               |
                 |    min_stat, max_stat          |
                 |  chunk col 1: ...              |
                 |  ...                           |
                 | (out-of-line stats)            |
                 | (bloom filter bitsets)         |
                 +--------------------------------+
                 | ROW GROUP BLOCK 1              |
                 |  ...                           |
                 +--------------------------------+
                 |  ...                           |
                 +--------------------------------+
                 | FOOTER                         |
                 |  parquet_footer_offset         |
                 |  parquet_footer_length         |
                 |  row_group_count               |
                 |  unused_bytes                  |
                 |  prev_parquet_meta_file_size   |  (0 if first version; trailer at prev - 4)
                 |  footer_feature_flags          |  (per-footer flags)
                 |  entry 0: offset --------------+--> ROW GROUP BLOCK 0
                 |  entry 1: offset --------------+--> ROW GROUP BLOCK 1
                 |  ...                           |
                 | FOOTER FEATURE SECTIONS        |
                 |  bloom filter offsets          |
                 |  (if BLOOM_FILTERS set)        |
                 |  CRC32                         |
                 |  FOOTER_LENGTH (4B)            |  <-- trailer at parquet_meta_file_size - 4
  _txn field 3:  +================================+
  parquet file size

The latest footer is located by footer_offset = parquet_meta_file_size - 4 - FOOTER_LENGTH, where FOOTER_LENGTH is read from the 4-byte trailer at parquet_meta_file_size - 4.

Update mode - only changed blocks are appended; unchanged blocks are reused:

                 +================================+
                 | HEADER                         |
                 |  parquet_meta_file_size -------+--> end of new file (patched last)
                 +--------------------------------+
                 | ROW GROUP BLOCK 0              |  <-- unchanged, kept in place
                 +--------------------------------+
                 | ROW GROUP BLOCK 1              |  <-- was merged, old data now dead
                 +--------------------------------+
                 | (old footer)                   |  <-- prev version, still readable via chain
                 | (old trailer)                  |  <-- still valid for older parquet_meta_file_size
                 +--------------------------------+
                 | ROW GROUP BLOCK 1'             |  <-- new version of block 1
                 +--------------------------------+
                 | ROW GROUP BLOCK 2              |  <-- newly appended
                 +--------------------------------+
                 | FOOTER (new)                   |
                 |  prev_parquet_meta_file_size --+-->|old trailer| (trailer at prev - 4 gives old footer)
                 |  entry 0: offset --------------+--> BLOCK 0  (old, reused)
                 |  entry 1: offset --------------+--> BLOCK 1' (new)
                 |  entry 2: offset --------------+--> BLOCK 2  (new)
                 |  CRC32                         |
                 |  FOOTER_LENGTH (4B)            |  <-- new trailer at end
  _txn field 3:  +================================+
  parquet file size (new)

Readers pinned to the previous snapshot's parquet_meta_file_size still see the old trailer at their snapshot's end-of-file and walk the prev_parquet_meta_file_size chain to the footer matching their _txn parquet-size token.

Feature flags

_pm has two independent u64 feature-flag fields:

  • FEATURE_FLAGS in the header applies file-wide. It covers every footer in the MVCC chain and is the right place for capabilities that are the same for every snapshot of the file.
  • FOOTER_FEATURE_FLAGS in each footer applies only to that footer. Two footers reachable via PREV_PARQUET_META_FILE_SIZE can carry different sets of footer flags, enabling per-snapshot feature sections.

Both fields share the same bit policy:

  • Bits 0-31: optional - unknown bits may be ignored.
  • Bits 32-63: required - unknown bits must cause the reader to reject the file (or, for footer flags, the specific footer the reader is about to use).

Feature sections appear in bit order. Header-gated sections live at the end of the header (after name strings); footer-gated sections live at the end of the footer (after row group entries, before the CRC). The footer-trailer's FOOTER_LENGTH bounds all footer sections so readers can locate the CRC without recognizing every bit.

Defined feature flags

No footer flag bits are defined yet. Header flag bits:

bitnamedependencyheader sectionfooter section
0BLOOM_FILTERSnone4 + bloom_col_count * 4 bytes: [u32 bloom_col_count][u32; bloom_col_count] column indices (sorted ascending, unique)row_group_count * bloom_col_count * 4 bytes: inlined offsets (>>3) into _pm; 0 = absent
1BLOOM_FILTERS_EXTERNALbit 0none (shares bit 0 header section)entry width grows from 4 to 16 bytes: [(u64 offset, u64 length); row_group_count * bloom_col_count] into the parquet file; (0, 0) = absent
2SORTING_IS_DTS_ASCnonenonenone

Bit 0 is only set when at least one column has a bloom filter. Bit 1 cannot be set without bit 0; the reader rejects the file otherwise. Feature sections are ordered by bit position.

Bit 2 indicates that the partition's sorting order is implicitly the designated timestamp column in ascending order. The on-disk SORTING_COLUMN_COUNT is 0 and the SORTING_COLUMNS section is absent, but readers treat the partition as sorted by [DESIGNATED_TIMESTAMP] ascending. The designated timestamp column's DESCENDING flag must not be set. This flag is only valid when DESIGNATED_TIMESTAMP >= 0; writers must not set it when DESIGNATED_TIMESTAMP is -1.

QuestDB-managed parquet snapshots represented by _pm always normalize column_top to 0. Null prefixes are materialized directly into parquet chunks/pages, and _pm pruning relies on per-chunk NULL_COUNT rather than file-level column_top metadata.

File header

offsetsizefieldtypedescription
08PARQUET_META_FILE_SIZEu64total committed _pm file size; patched last by the writer and acts as the MVCC commit signal (not covered by CRC)
88FEATURE_FLAGSu64reserved for future format extensions; currently always 0
164DESIGNATED_TIMESTAMPi32index of the designated timestamp in descriptors (or -1)
204SORTING_COLUMN_COUNTu32
244COLUMN_COUNTu32
284RESERVEDu32must be 0 (alignment padding)
32..COLUMN_DESCRIPTORSCOLUMN_COUNT * Column descriptor (32B each)
....SORTING_COLUMNSSORTING_COLUMN_COUNT * Sorting column (4B each)
....NAME_STRINGSColumn names, each [utf8 bytes]; length from descriptor's NAME_LENGTH
....HEADER FEATURE SECTIONSFeature-flag-gated sections, in bit order. See "Bloom filters" below.

The latest footer lives at PARQUET_META_FILE_SIZE - 4 - FOOTER_LENGTH, where FOOTER_LENGTH is read from the 4-byte trailer at PARQUET_META_FILE_SIZE - 4. Readers do not consult ff.length() / stat() — the filesystem size is not a commit boundary.

For a column to be the designated timestamp it must comply to these rules:

  • It must be the first column in sorting columns, sorted in ascending order
  • The column type must be timestamp
  • The column repetition must be required (no nulls allowed)

Column descriptor (32 bytes)

Per-column metadata. Written once in the header, applies across all row groups.

offsetsizefieldtypedescription
08NAME_OFFSETu64offset from the file start to column name (utf-8 encoded, not null-terminated)
84IDi32index of the column related to QuestDB schema (or -1)
124TYPEi32QuestDB column type code
164FLAGSi32Column flags
204FIXED_BYTE_LENi32For FIXED_LEN_BYTE_ARRAY physical type: the fixed length in bytes (matches parquet type_length). 0 otherwise.
244NAME_LENGTHu32length of the column name in bytes
281PHYSICAL_TYPEu8Parquet physical type: 0=BOOLEAN, 1=INT32, 2=INT64, 3=INT96, 4=FLOAT, 5=DOUBLE, 6=BYTE_ARRAY, 7=FIXED_LEN_BA
291MAX_REP_LEVELu8Maximum repetition level (0 for non-nested columns)
301MAX_DEF_LEVELu8Maximum definition level (0 for required, 1 for optional)
311RESERVEDu8Reserved, must be 0

Column flags

bit offsetbit sizefieldtypedescription
01LOCAL_KEY_IS_GLOBALi1Symbol
11IS_ASCIIi1Varchar
22FIELD_REPETITIONu20 = Required, 1 = Optional, 2 = Repeated
41DESCENDINGi1For sorted column, 1 = Descending
527RESERVEDReserved, must be 0

Sorting column (4 bytes)

Alignment: 4 bytes.

offsetsizefieldtypedescription
04INDEXu32Ordinal position of the column in column descriptors

Row group blocks

Written sequentially after the header. Each block holds the column chunk metadata for one row group. Column chunks are fixed-size (64 bytes), so block size = 8 + COLUMN_COUNT * 64. On update, new/changed blocks are appended after the old footer; unchanged blocks are left in place and referenced by the new footer.

Blocks must be aligned to 8 bytes so that the offset in the footer can be stored as a u32 (actual offset = value << 3).

For types > 8 bytes (LONG128, UUID, LONG256), min/max stat values are stored out-of-line immediately after the column chunks. When inlined bloom filters are present (feature flag bit 0 set, bit 1 clear), bloom filter bitsets follow the out-of-line stats, each padded to 8-byte alignment. All out-of-line data is part of the row group block and written together with it. References to the bitsets are in the footer feature section, not in the column chunk struct.

Row group block

offsetsizefieldtypedescription
08NUM_ROWSu64
8..COLUMN_CHUNKSCOLUMN_COUNT * Column chunk (64B each)

Column chunk (64 bytes)

Per-column-chunk metadata needed to locate and decode data from the parquet file.

offsetsizefieldtypedescription
01CODECu8parquet CompressionCodec enum: 0=UNCOMPRESSED, 1=SNAPPY, 2=GZIP, 3=LZO, 4=BROTLI, 5=LZ4, 6=ZSTD, 7=LZ4_RAW
11ENCODINGSu8bitmask: bit 0=PLAIN, 1=RLE_DICTIONARY, 2=DELTA_BINARY_PACKED, 3=DELTA_LENGTH_BYTE_ARRAY, 4=DELTA_BYTE_ARRAY, 5=BYTE_STREAM_SPLIT
21STAT_FLAGSu8
31STAT_SIZESu8low nibble = MIN_STAT byte size (inline only), high nibble = MAX_STAT byte size (inline only)
44RESERVEDu32must be 0. Previously held bloom filter offsets, now moved to footer feature sections. Exists for layout preservation.
88NUM_VALUESu64total values (may differ from row count for arrays)
168BYTE_RANGE_STARTu64byte offset in parquet file to chunk start (dictionary page offset if present, else data page offset)
248TOTAL_COMPRESSEDu64total compressed bytes of all pages
328NULL_COUNTu64number of nulls; mandatory for QuestDB-managed _pm files
408DISTINCT_COUNTu64number of distinct values
488MIN_STATu64inline min value (parquet physical-width LE bytes, low bytes of the u64) or OOL reference (see Stat encoding)
568MAX_STATu64inline max value (parquet physical-width LE bytes, low bytes of the u64) or OOL reference (see Stat encoding)

STAT_FLAGS interpretation

bit offsetbit sizefieldtypedescription
01MIN_STAT_PRESENTi1Indicates if MIN_STAT is present
11MIN_STAT_INLINEDi1Indicates if MIN_STAT is inlined
21MIN_STAT_VALUE_EXACTi1Indicates if MIN_STAT value is exact
31MAX_STAT_PRESENTi1Indicates if MAX_STAT is present
41MAX_STAT_INLINEDi1Indicates if MAX_STAT is inlined
51MAX_STAT_VALUE_EXACTi1Indicates if MAX_STAT value is exact
61DISTINCT_COUNT_PRESENTi1Indicates if DISTINCT_COUNT is present
71NULL_COUNT_PRESENTi1Indicates if NULL_COUNT is present

Stat encoding

MIN_STAT and MAX_STAT hold parquet stat bytes verbatim — at parquet physical type width and in parquet-native units. No QuestDB-side conversion (no narrowing, no Date days-to-millis, no Timestamp millis-to-micros, no INT96-to-nanos) happens between the parquet file and the _pm slot. Readers that need a QuestDB-native interpretation must apply it themselves at read time, using the column's physical type and logical type from the parquet schema.

Readers consume inline stats at parquet physical type width:

Inline placement is gated purely by stat byte width: stats whose min_value/max_value payloads are 1..=8 bytes inline into the u64 slot, longer payloads spill out-of-line. The QuestDB column type does not constrain placement, so a short variable-length stat (e.g., a 4-byte VARCHAR min) can occupy the slot the same way a primitive stat does.

parquet physical typeinline widthplacement in u64 slot
BOOLEAN1 bytelow byte holds 0 or 1; remaining 7 bytes are zero
INT32 / FLOAT4 byteslow 4 bytes hold the LE value; high 4 bytes are zero
INT64 / DOUBLE8 bytesthe full u64 holds the LE value
FIXED_LEN_BYTE_ARRAYfixed_byte_len bytes when <= 8low fixed_byte_len bytes; rest zero
BYTE_ARRAYup to 8 bytes when the stat payload fitslow STAT_SIZES.min/max_size bytes; rest zero
INT96always out-of-line (12 bytes)u64 slot holds the OOL reference

When stats are stored out-of-line (MIN_STAT_INLINED / MAX_STAT_INLINED clear), the u64 slot encodes a reference into the row group block's out-of-line region as (offset << 16) | length. The OOL bytes are the same parquet stat bytes, verbatim, with no width or unit conversion.

Stats with payload width 1..=8 bytes inline into the slot; stats with payload width > 8 bytes (e.g., LONG128, UUID, LONG256, INT96, or long VARCHAR/STRING values) MUST be stored out-of-line immediately after the row group blocks that reference them. In practice this means primitive stats for fixed-size types of <= 8 bytes (BOOLEAN, BYTE, SHORT, CHAR, INT/FLOAT/IPv4, LONG/DOUBLE/DATE/TIMESTAMP) always inline, and short SYMBOL/VARCHAR/STRING stats can also inline when the payload fits.

Inline stats for narrow signed types (BYTE, SHORT, GeoByte, GeoShort) backed by parquet INT32 occupy 4 bytes (the parquet physical width), not the QuestDB-native 1- or 2-byte width: skip-pruning code reads the slot at parquet physical width, so a sub-physical encoding would round-trip negatives incorrectly.

QuestDB-managed _pm files always set NULL_COUNT_PRESENT. Readers use NULL_COUNT == NUM_VALUES as the all-null fast path when deciding whether a parquet column chunk needs to be fetched and decoded.

Bloom filters

Bloom filter metadata is gated by feature flag bits 0 and 1 in the header.

Header section (bit 0, after name strings): declares which columns have bloom filters.

offsetsizefieldtypedescription
04BLOOM_COL_COUNTu32number of columns with bloom filters (> 0)
4BLOOM_COL_COUNT * 4BLOOM_FILTER_COLUMNSu32[]column indices, sorted ascending, unique

All indices must satisfy index < COLUMN_COUNT. The reader rejects the file otherwise.

Footer section (bit 0, after row group entries, before CRC): dense ROW_GROUP_COUNT * BLOOM_COL_COUNT matrix, row-major. Entry [rg_idx * BLOOM_COL_COUNT + pos] where pos is the column's position in BLOOM_FILTER_COLUMNS.

  • Inlined (bit 1 = 0): each entry is a u32 — the absolute _pm offset right-shifted by 3. 0 = absent for that (row_group, bloom_col) pair.
  • External (bit 1 = 1): each entry is (u64 offset, u64 length) into the parquet file. (0, 0) = absent.

Bitset storage:

  • Inlined: bitsets live in the row group block's out-of-line region, padded to 8-byte alignment. At the offset: [i32 LENGTH][bitset bytes].
  • External: bitsets live in the parquet file. The footer section entry gives offset and length.

Invariants:

  • Bit 0 is only set when at least one column has a bloom filter. If no columns have bloom filters, neither bit is set and no sections are written.
  • Bit 1 requires bit 0; setting bit 1 alone is invalid.
  • BLOOM_FILTER_COLUMNS is fixed at file creation time. Update mode does not rewrite the header; compaction (full rewrite) changes which columns have bloom filters.
  • A column can be in BLOOM_FILTER_COLUMNS but absent in some row groups (sentinel value 0 or (0, 0)).

Bloom filter bitset (inlined)

offsetsizefieldtypedescription
04LENGTHi32length of the bloom filter bitset in bytes (not bits)
4..BITSETbloom filter bitset

The parquet file size is stored in _txn field 3. The reader locates the latest footer via the trailer at PARQUET_META_FILE_SIZE - 4: footer_offset = PARQUET_META_FILE_SIZE - 4 - FOOTER_LENGTH. For MVCC, the reader walks the chain via PREV_PARQUET_META_FILE_SIZE on each footer — the same size-then-trailer indirection that the header uses for the latest footer, so each walk-back step re-validates the previous footer's location through its own trailer.

The CRC covers all bytes after PARQUET_META_FILE_SIZE: [8, CRC_field). This protects feature flags, column descriptors, row group blocks, and footer content, while excluding the mutable PARQUET_META_FILE_SIZE field at offset 0. It is located via FOOTER_LENGTH: CRC offset = footer_start + FOOTER_LENGTH - 4.

offsetsizefieldtypedescription
08PARQUET_FOOTER_OFFSETu64byte offset in the parquet file where the parquet footer starts
84PARQUET_FOOTER_LENGTHu32length of the parquet footer in bytes
124ROW_GROUP_COUNTu32
168UNUSED_BYTESu64accumulated dead bytes in the parquet file (old footers + replaced row group data)
248PREV_PARQUET_META_FILE_SIZEu64committed _pm file size at the previous snapshot (0 if first); walk back via trailer at prev - 4
328FOOTER_FEATURE_FLAGSu64per-footer feature flags; independent of the header's FEATURE_FLAGS
40..ROW_GROUP_ENTRIESROW_GROUP_COUNT * Row group entry (4B each)
....FOOTER_FEATURE_SECTIONSFeature-flag-gated sections, in bit order (may be empty)
..4CHECKSUMu32CRC32 over bytes [8, this field) — all content after PARQUET_META_FILE_SIZE
..4FOOTER_LENGTHu32total bytes from footer start through CHECKSUM (inclusive); NOT covered by CHECKSUM

The parquet file size is derived from the footer metadata: parquet_file_size = PARQUET_FOOTER_OFFSET + PARQUET_FOOTER_LENGTH + 8 (4B parquet footer length field + 4B PAR1 magic). This eliminates the need to store the parquet file size separately.

Row group entry (4 bytes)

offsetsizefieldtypedescription
04BLOCK_OFFSETu32byte offset from file start >> 3 (actual = value << 3)

Concurrent Writing/Reading

Atomicity is provided by the _txn file. The parquet file size is stored in _txn field 3 for each partition, serving as the MVCC version token.

Writer flow:

  1. Write/update data.parquet.
  2. Write the _pm metadata file. On update, append new/changed row group blocks after the old trailer, then write a new footer (with CRC) and a new 4-byte FOOTER_LENGTH trailer at the end. Patch PARQUET_META_FILE_SIZE in the header as the last write — this is the MVCC commit signal. Readers see either the old committed size (with the old trailer at old_parquet_meta_file_size - 4) or the new one.
  3. Commit _txn (A/B buffered), updating the partition name txn and parquet file size.

Reader flow:

  1. Read _txn via safeReadTxn() (spin-lock with version check). Obtain parquet file size from field 3.
  2. Memory-map the first 32 bytes of the _pm file. Read PARQUET_META_FILE_SIZE from offset 0.
  3. Remap (or open-and-map) parquet_meta_file_size bytes of the _pm file. Do not consult stat() / ff.length() — the filesystem size may include unpublished bytes.
  4. Read the 4-byte trailer at PARQUET_META_FILE_SIZE - 4 to get FOOTER_LENGTH; derive footer_offset = PARQUET_META_FILE_SIZE - 4 - FOOTER_LENGTH.
  5. Walk the MVCC chain: derive parquet file size from the current footer (PARQUET_FOOTER_OFFSET + PARQUET_FOOTER_LENGTH + 8). If it matches the _txn snapshot, stop. Otherwise, read PREV_PARQUET_META_FILE_SIZE from the current footer and repeat from step 4 with the new size. Each step re-validates the previous footer location via its own trailer.
  6. Read the footer: PARQUET_FOOTER_OFFSET, PARQUET_FOOTER_LENGTH, ROW_GROUP_COUNT, UNUSED_BYTES, and row group entries.
  7. For metadata-only operations (timestamp stats), read directly from the _pm file.
  8. For data operations, memory-map data.parquet using the parquet file size from _txn.

Rewrite mode (new partition directory with new name txn): new metadata file created. No concurrent access until _txn flips.

Update mode (same partition directory): new row group blocks appended after old trailer, new footer and trailer written at the end. Unchanged row groups keep their old offsets. The header's PARQUET_META_FILE_SIZE is patched last for atomicity. Readers pinned to an older _txn snapshot read the committed-at-that-time PARQUET_META_FILE_SIZE (from their own earlier mapping) and walk the PREV_PARQUET_META_FILE_SIZE chain from the matching trailer to find their footer.

Access patterns

Decoding a specific column chunk from cold storage

  • Look up the column chunk from cached metadata.
  • If NULL_COUNT == NUM_VALUES, treat the chunk as all-null and skip the parquet fetch.
  • Fetch [BYTE_RANGE_START, BYTE_RANGE_START + TOTAL_COMPRESSED) from cold storage.
  • Use CODEC to decompress; page headers carry per-page encoding.

Finding a specific row group by timestamp

  • Read each cached row group's timestamp column chunk MIN/MAX STAT.
  • Binary search by timestamp range.

Pruning row groups with bloom filter

  • Check if the BLOOM_FILTERS feature flag (bit 0) is set in the header.
  • Look up the column's position in BLOOM_FILTER_COLUMNS via binary search.
  • Read the bloom filter offset from the footer feature section at [rg_idx * bloom_col_count + pos].
  • For inlined mode: read LENGTH (i32) at the offset in the _pm file, then read LENGTH bytes of bitset.
  • For external mode: read the bitset from the parquet file at (offset, length).
  • Check if the value is in the bitset; skip row group if not present.

Retrieving byte ranges for specific columns

  • For each row group and column, look up the cached column chunk.
  • Byte range: [BYTE_RANGE_START, BYTE_RANGE_START + TOTAL_COMPRESSED).

Migration strategy

We rely on QuestDB's existing migration system to run the migration passes.

Initial setup

We pass over every partitions for each table and rely on the existing parquet files to generate the new metadata files at their last version. As cold-storage depends on this feature, no object-storage access is required for this migration.

Evolution

Feature flags are reserved for future extensions. Required features (bits 32-63) allow the reader to reject files that it cannot correctly interpret.

Migration from an older to a newer version mustn't require having access to the parquet file (in order to avoid cold-storage access). This is easily feasible as we control the partitions files, thus we can safely fill in the new metadata fields with default values that indicate the absence of the new feature (e.g. no encryption, already used bitmap filter algorithm).

External parquet files needs to see their metadata file invalidated.

Integration in a Multi-version Concurrency Controlled database

In QuestDB, the _txn file is responsible to tell the reader which partitions exists and where they are stored. Concurrent access between the TxWriter and TxReader relies on A/B double-buffering to remove needs for locks.

When adding a new row-group to a parquet partition, instead of rewriting the whole file, the row-group is appended to the file (after the existing footer) and a new footer is written afterwards. The header's FOOTER_OFFSET is patched last to point to the new footer. The new parquet file size is written to _txn field 3 so that readers can identify which footer matches their snapshot.

Existing readers continue to see their committed data by walking the PREV_PARQUET_META_FILE_SIZE chain: each footer links to the previous one, and the reader selects the footer whose derived parquet file size matches the parquet file size from its _txn snapshot.

As this file's purpose is to reflect the underlying parquet file, the same behavior is used to update the file. Whenever a new row-group is added to the original parquet file, it's also added to this file. A new footer (with CRC + trailer)is appended, the header's FOOTER_OFFSET is patched, and the parquet file size is updated in _txn.

Compaction

As new row-groups and footers are written into parquet files, more and more space is wasted. When a certain threshold is exceeded, the file is compacted in a new partition directory and the _txn file is updated to point to this new file. Following this strategy, the metadata file is also written again from scratch when this compaction occurs, keeping it as small as possible.