docs/parquet-metadata.md
data.parquet in the partition directory.Currently stored under the "questdb" key in the parquet file's key-value metadata as JSON. This will be replaced by
the binary format below.
{
"version": 1,
"schema": [
{
"column_type": 12,
"column_top": 0,
"format": 1
},
{
"column_type": 5,
"column_top": 256
},
{
"column_type": 26,
"column_top": 0,
"ascii": true
}
],
"unused_bytes": 4096
}
Binary file encoded in little-endian. One file per partition, stored in the partition directory alongsidedata.parquet.
The file has a header with column descriptors, row group blocks in the middle, and a footer at the end. The committed
_pm file size is stored in the header's first 8 bytes (PARQUET_META_FILE_SIZE) and is patched last by the writer — readers
treat it as the MVCC commit signal. The footer ends with a 4-byte trailer that stores the footer length, so the latest
footer is at PARQUET_META_FILE_SIZE - 4 - footer_length. The parquet file size (separate concept) is stored in _txn field 3
for each partition. Row group blocks are referenced by offset from the footer. On update, only new/changed row group
blocks are appended; the footer reuses offsets to unchanged blocks. The file is small (typically tens of KB),
memory-mapped and cached in TableReader. Bloom filter bitsets are stored in the out-of-line region of each row group
block (inlined mode) or referenced from the parquet file (external mode), with offsets in the footer feature section.
Callers never use the filesystem's reported file size to bound an _pm read or mapping. The on-disk length may
include bytes from an in-progress, unpublished append and is not a valid commit boundary; only PARQUET_META_FILE_SIZE in the
header is. A reader mmaps the 32-byte header prefix, reads PARQUET_META_FILE_SIZE, then remaps to that size.
_pm metadata file data.parquet
+================================+ +==========================+
| HEADER | | |
| parquet_meta_file_size -------+--> end of file | ...column chunks... |
| feature_flags | | |
| designated_timestamp | +------->| dict page | data pages |
| sorting_column_count | | | |
| column_count | | +==========================+
| | |
| COLUMN DESCRIPTORS | |
| col 0: name, type, .. | |
| col 1: name, type, .. | |
| ... | |
| | |
| SORTING COLUMNS | |
| col indices | |
| | |
| NAME STRINGS | |
| | |
| HEADER FEATURE SECTIONS | |
| (if any flags set) | |
+--------------------------------+ |
| ROW GROUP BLOCK 0 | |
| num_rows | |
| chunk col 0: | |
| codec, encodings | |
| byte_range_start ----------+---------+
| total_compressed |
| null_count |
| _reserved (0) |
| min_stat, max_stat |
| chunk col 1: ... |
| ... |
| (out-of-line stats) |
| (bloom filter bitsets) |
+--------------------------------+
| ROW GROUP BLOCK 1 |
| ... |
+--------------------------------+
| ... |
+--------------------------------+
| FOOTER |
| parquet_footer_offset |
| parquet_footer_length |
| row_group_count |
| unused_bytes |
| prev_parquet_meta_file_size | (0 if first version; trailer at prev - 4)
| footer_feature_flags | (per-footer flags)
| entry 0: offset --------------+--> ROW GROUP BLOCK 0
| entry 1: offset --------------+--> ROW GROUP BLOCK 1
| ... |
| FOOTER FEATURE SECTIONS |
| bloom filter offsets |
| (if BLOOM_FILTERS set) |
| CRC32 |
| FOOTER_LENGTH (4B) | <-- trailer at parquet_meta_file_size - 4
_txn field 3: +================================+
parquet file size
The latest footer is located by footer_offset = parquet_meta_file_size - 4 - FOOTER_LENGTH, where
FOOTER_LENGTH is read from the 4-byte trailer at parquet_meta_file_size - 4.
Update mode - only changed blocks are appended; unchanged blocks are reused:
+================================+
| HEADER |
| parquet_meta_file_size -------+--> end of new file (patched last)
+--------------------------------+
| ROW GROUP BLOCK 0 | <-- unchanged, kept in place
+--------------------------------+
| ROW GROUP BLOCK 1 | <-- was merged, old data now dead
+--------------------------------+
| (old footer) | <-- prev version, still readable via chain
| (old trailer) | <-- still valid for older parquet_meta_file_size
+--------------------------------+
| ROW GROUP BLOCK 1' | <-- new version of block 1
+--------------------------------+
| ROW GROUP BLOCK 2 | <-- newly appended
+--------------------------------+
| FOOTER (new) |
| prev_parquet_meta_file_size --+-->|old trailer| (trailer at prev - 4 gives old footer)
| entry 0: offset --------------+--> BLOCK 0 (old, reused)
| entry 1: offset --------------+--> BLOCK 1' (new)
| entry 2: offset --------------+--> BLOCK 2 (new)
| CRC32 |
| FOOTER_LENGTH (4B) | <-- new trailer at end
_txn field 3: +================================+
parquet file size (new)
Readers pinned to the previous snapshot's parquet_meta_file_size still see the old trailer at their
snapshot's end-of-file and walk the prev_parquet_meta_file_size chain to the footer matching their
_txn parquet-size token.
_pm has two independent u64 feature-flag fields:
FEATURE_FLAGS in the header applies file-wide. It covers every footer in
the MVCC chain and is the right place for capabilities that are the same for
every snapshot of the file.FOOTER_FEATURE_FLAGS in each footer applies only to that footer. Two
footers reachable via PREV_PARQUET_META_FILE_SIZE can carry different sets of
footer flags, enabling per-snapshot feature sections.Both fields share the same bit policy:
Feature sections appear in bit order. Header-gated sections live at the end of
the header (after name strings); footer-gated sections live at the end of the
footer (after row group entries, before the CRC). The footer-trailer's
FOOTER_LENGTH bounds all footer sections so readers can locate the CRC
without recognizing every bit.
No footer flag bits are defined yet. Header flag bits:
| bit | name | dependency | header section | footer section |
|---|---|---|---|---|
| 0 | BLOOM_FILTERS | none | 4 + bloom_col_count * 4 bytes: [u32 bloom_col_count][u32; bloom_col_count] column indices (sorted ascending, unique) | row_group_count * bloom_col_count * 4 bytes: inlined offsets (>>3) into _pm; 0 = absent |
| 1 | BLOOM_FILTERS_EXTERNAL | bit 0 | none (shares bit 0 header section) | entry width grows from 4 to 16 bytes: [(u64 offset, u64 length); row_group_count * bloom_col_count] into the parquet file; (0, 0) = absent |
| 2 | SORTING_IS_DTS_ASC | none | none | none |
Bit 0 is only set when at least one column has a bloom filter. Bit 1 cannot be set without bit 0; the reader rejects the file otherwise. Feature sections are ordered by bit position.
Bit 2 indicates that the partition's sorting order is implicitly the designated timestamp column in ascending order. The
on-disk SORTING_COLUMN_COUNT is 0 and the SORTING_COLUMNS section is absent, but readers treat the partition as sorted
by [DESIGNATED_TIMESTAMP] ascending. The designated timestamp column's DESCENDING flag must not be set. This flag is
only valid when DESIGNATED_TIMESTAMP >= 0; writers must not set it when DESIGNATED_TIMESTAMP is -1.
QuestDB-managed parquet snapshots represented by _pm always normalize column_top to 0. Null prefixes are
materialized directly into parquet chunks/pages, and _pm pruning relies on per-chunk NULL_COUNT rather than
file-level column_top metadata.
| offset | size | field | type | description |
|---|---|---|---|---|
| 0 | 8 | PARQUET_META_FILE_SIZE | u64 | total committed _pm file size; patched last by the writer and acts as the MVCC commit signal (not covered by CRC) |
| 8 | 8 | FEATURE_FLAGS | u64 | reserved for future format extensions; currently always 0 |
| 16 | 4 | DESIGNATED_TIMESTAMP | i32 | index of the designated timestamp in descriptors (or -1) |
| 20 | 4 | SORTING_COLUMN_COUNT | u32 | |
| 24 | 4 | COLUMN_COUNT | u32 | |
| 28 | 4 | RESERVED | u32 | must be 0 (alignment padding) |
| 32 | .. | COLUMN_DESCRIPTORS | COLUMN_COUNT * Column descriptor (32B each) | |
| .. | .. | SORTING_COLUMNS | SORTING_COLUMN_COUNT * Sorting column (4B each) | |
| .. | .. | NAME_STRINGS | Column names, each [utf8 bytes]; length from descriptor's NAME_LENGTH | |
| .. | .. | HEADER FEATURE SECTIONS | Feature-flag-gated sections, in bit order. See "Bloom filters" below. |
The latest footer lives at PARQUET_META_FILE_SIZE - 4 - FOOTER_LENGTH, where FOOTER_LENGTH is read from the 4-byte trailer at
PARQUET_META_FILE_SIZE - 4. Readers do not consult ff.length() / stat() — the filesystem size is not a commit boundary.
For a column to be the designated timestamp it must comply to these rules:
ascending ordertimestamprequired (no nulls allowed)Per-column metadata. Written once in the header, applies across all row groups.
| offset | size | field | type | description |
|---|---|---|---|---|
| 0 | 8 | NAME_OFFSET | u64 | offset from the file start to column name (utf-8 encoded, not null-terminated) |
| 8 | 4 | ID | i32 | index of the column related to QuestDB schema (or -1) |
| 12 | 4 | TYPE | i32 | QuestDB column type code |
| 16 | 4 | FLAGS | i32 | Column flags |
| 20 | 4 | FIXED_BYTE_LEN | i32 | For FIXED_LEN_BYTE_ARRAY physical type: the fixed length in bytes (matches parquet type_length). 0 otherwise. |
| 24 | 4 | NAME_LENGTH | u32 | length of the column name in bytes |
| 28 | 1 | PHYSICAL_TYPE | u8 | Parquet physical type: 0=BOOLEAN, 1=INT32, 2=INT64, 3=INT96, 4=FLOAT, 5=DOUBLE, 6=BYTE_ARRAY, 7=FIXED_LEN_BA |
| 29 | 1 | MAX_REP_LEVEL | u8 | Maximum repetition level (0 for non-nested columns) |
| 30 | 1 | MAX_DEF_LEVEL | u8 | Maximum definition level (0 for required, 1 for optional) |
| 31 | 1 | RESERVED | u8 | Reserved, must be 0 |
| bit offset | bit size | field | type | description |
|---|---|---|---|---|
| 0 | 1 | LOCAL_KEY_IS_GLOBAL | i1 | Symbol |
| 1 | 1 | IS_ASCII | i1 | Varchar |
| 2 | 2 | FIELD_REPETITION | u2 | 0 = Required, 1 = Optional, 2 = Repeated |
| 4 | 1 | DESCENDING | i1 | For sorted column, 1 = Descending |
| 5 | 27 | RESERVED | Reserved, must be 0 |
Alignment: 4 bytes.
| offset | size | field | type | description |
|---|---|---|---|---|
| 0 | 4 | INDEX | u32 | Ordinal position of the column in column descriptors |
Written sequentially after the header. Each block holds the column chunk metadata for one row group. Column chunks are fixed-size (64 bytes), so block size = 8 + COLUMN_COUNT * 64. On update, new/changed blocks are appended after the old footer; unchanged blocks are left in place and referenced by the new footer.
Blocks must be aligned to 8 bytes so that the offset in the footer can be stored as a u32 (actual offset = value << 3).
For types > 8 bytes (LONG128, UUID, LONG256), min/max stat values are stored out-of-line immediately after the column chunks. When inlined bloom filters are present (feature flag bit 0 set, bit 1 clear), bloom filter bitsets follow the out-of-line stats, each padded to 8-byte alignment. All out-of-line data is part of the row group block and written together with it. References to the bitsets are in the footer feature section, not in the column chunk struct.
| offset | size | field | type | description |
|---|---|---|---|---|
| 0 | 8 | NUM_ROWS | u64 | |
| 8 | .. | COLUMN_CHUNKS | COLUMN_COUNT * Column chunk (64B each) |
Per-column-chunk metadata needed to locate and decode data from the parquet file.
| offset | size | field | type | description |
|---|---|---|---|---|
| 0 | 1 | CODEC | u8 | parquet CompressionCodec enum: 0=UNCOMPRESSED, 1=SNAPPY, 2=GZIP, 3=LZO, 4=BROTLI, 5=LZ4, 6=ZSTD, 7=LZ4_RAW |
| 1 | 1 | ENCODINGS | u8 | bitmask: bit 0=PLAIN, 1=RLE_DICTIONARY, 2=DELTA_BINARY_PACKED, 3=DELTA_LENGTH_BYTE_ARRAY, 4=DELTA_BYTE_ARRAY, 5=BYTE_STREAM_SPLIT |
| 2 | 1 | STAT_FLAGS | u8 | |
| 3 | 1 | STAT_SIZES | u8 | low nibble = MIN_STAT byte size (inline only), high nibble = MAX_STAT byte size (inline only) |
| 4 | 4 | RESERVED | u32 | must be 0. Previously held bloom filter offsets, now moved to footer feature sections. Exists for layout preservation. |
| 8 | 8 | NUM_VALUES | u64 | total values (may differ from row count for arrays) |
| 16 | 8 | BYTE_RANGE_START | u64 | byte offset in parquet file to chunk start (dictionary page offset if present, else data page offset) |
| 24 | 8 | TOTAL_COMPRESSED | u64 | total compressed bytes of all pages |
| 32 | 8 | NULL_COUNT | u64 | number of nulls; mandatory for QuestDB-managed _pm files |
| 40 | 8 | DISTINCT_COUNT | u64 | number of distinct values |
| 48 | 8 | MIN_STAT | u64 | inline min value (parquet physical-width LE bytes, low bytes of the u64) or OOL reference (see Stat encoding) |
| 56 | 8 | MAX_STAT | u64 | inline max value (parquet physical-width LE bytes, low bytes of the u64) or OOL reference (see Stat encoding) |
| bit offset | bit size | field | type | description |
|---|---|---|---|---|
| 0 | 1 | MIN_STAT_PRESENT | i1 | Indicates if MIN_STAT is present |
| 1 | 1 | MIN_STAT_INLINED | i1 | Indicates if MIN_STAT is inlined |
| 2 | 1 | MIN_STAT_VALUE_EXACT | i1 | Indicates if MIN_STAT value is exact |
| 3 | 1 | MAX_STAT_PRESENT | i1 | Indicates if MAX_STAT is present |
| 4 | 1 | MAX_STAT_INLINED | i1 | Indicates if MAX_STAT is inlined |
| 5 | 1 | MAX_STAT_VALUE_EXACT | i1 | Indicates if MAX_STAT value is exact |
| 6 | 1 | DISTINCT_COUNT_PRESENT | i1 | Indicates if DISTINCT_COUNT is present |
| 7 | 1 | NULL_COUNT_PRESENT | i1 | Indicates if NULL_COUNT is present |
MIN_STAT and MAX_STAT hold parquet stat bytes verbatim — at parquet physical type width and in parquet-native units.
No QuestDB-side conversion (no narrowing, no Date days-to-millis, no Timestamp millis-to-micros, no INT96-to-nanos)
happens between the parquet file and the _pm slot. Readers that need a QuestDB-native interpretation must apply it
themselves at read time, using the column's physical type and logical type from the parquet schema.
Readers consume inline stats at parquet physical type width:
Inline placement is gated purely by stat byte width: stats whose min_value/max_value payloads are 1..=8 bytes
inline into the u64 slot, longer payloads spill out-of-line. The QuestDB column type does not constrain placement, so a
short variable-length stat (e.g., a 4-byte VARCHAR min) can occupy the slot the same way a primitive stat does.
| parquet physical type | inline width | placement in u64 slot |
|---|---|---|
| BOOLEAN | 1 byte | low byte holds 0 or 1; remaining 7 bytes are zero |
| INT32 / FLOAT | 4 bytes | low 4 bytes hold the LE value; high 4 bytes are zero |
| INT64 / DOUBLE | 8 bytes | the full u64 holds the LE value |
| FIXED_LEN_BYTE_ARRAY | fixed_byte_len bytes when <= 8 | low fixed_byte_len bytes; rest zero |
| BYTE_ARRAY | up to 8 bytes when the stat payload fits | low STAT_SIZES.min/max_size bytes; rest zero |
| INT96 | always out-of-line (12 bytes) | u64 slot holds the OOL reference |
When stats are stored out-of-line (MIN_STAT_INLINED / MAX_STAT_INLINED clear), the u64 slot encodes a reference into
the row group block's out-of-line region as (offset << 16) | length. The OOL bytes are the same parquet stat bytes,
verbatim, with no width or unit conversion.
Stats with payload width 1..=8 bytes inline into the slot; stats with payload width > 8 bytes (e.g., LONG128, UUID, LONG256, INT96, or long VARCHAR/STRING values) MUST be stored out-of-line immediately after the row group blocks that reference them. In practice this means primitive stats for fixed-size types of <= 8 bytes (BOOLEAN, BYTE, SHORT, CHAR, INT/FLOAT/IPv4, LONG/DOUBLE/DATE/TIMESTAMP) always inline, and short SYMBOL/VARCHAR/STRING stats can also inline when the payload fits.
Inline stats for narrow signed types (BYTE, SHORT, GeoByte, GeoShort) backed by parquet INT32 occupy 4 bytes (the parquet physical width), not the QuestDB-native 1- or 2-byte width: skip-pruning code reads the slot at parquet physical width, so a sub-physical encoding would round-trip negatives incorrectly.
QuestDB-managed _pm files always set NULL_COUNT_PRESENT. Readers use NULL_COUNT == NUM_VALUES as the all-null fast
path when deciding whether a parquet column chunk needs to be fetched and decoded.
Bloom filter metadata is gated by feature flag bits 0 and 1 in the header.
Header section (bit 0, after name strings): declares which columns have bloom filters.
| offset | size | field | type | description |
|---|---|---|---|---|
| 0 | 4 | BLOOM_COL_COUNT | u32 | number of columns with bloom filters (> 0) |
| 4 | BLOOM_COL_COUNT * 4 | BLOOM_FILTER_COLUMNS | u32[] | column indices, sorted ascending, unique |
All indices must satisfy index < COLUMN_COUNT. The reader rejects the file otherwise.
Footer section (bit 0, after row group entries, before CRC): dense ROW_GROUP_COUNT * BLOOM_COL_COUNT matrix,
row-major. Entry [rg_idx * BLOOM_COL_COUNT + pos] where pos is the column's position in BLOOM_FILTER_COLUMNS.
u32 — the absolute _pm offset right-shifted by 3. 0 = absent for that
(row_group, bloom_col) pair.(u64 offset, u64 length) into the parquet file. (0, 0) = absent.Bitset storage:
[i32 LENGTH][bitset bytes].Invariants:
BLOOM_FILTER_COLUMNS is fixed at file creation time. Update mode does not rewrite the header; compaction (full
rewrite) changes which columns have bloom filters.BLOOM_FILTER_COLUMNS but absent in some row groups (sentinel value 0 or (0, 0)).| offset | size | field | type | description |
|---|---|---|---|---|
| 0 | 4 | LENGTH | i32 | length of the bloom filter bitset in bytes (not bits) |
| 4 | .. | BITSET | bloom filter bitset |
The parquet file size is stored in _txn field 3. The reader locates the latest footer via the trailer at
PARQUET_META_FILE_SIZE - 4: footer_offset = PARQUET_META_FILE_SIZE - 4 - FOOTER_LENGTH. For MVCC, the reader walks
the chain via PREV_PARQUET_META_FILE_SIZE on each footer — the same size-then-trailer indirection that the header
uses for the latest footer, so each walk-back step re-validates the previous footer's location through its own
trailer.
The CRC covers all bytes after PARQUET_META_FILE_SIZE: [8, CRC_field). This protects feature flags, column
descriptors, row group blocks, and footer content, while excluding the mutable PARQUET_META_FILE_SIZE field at
offset 0. It is located via FOOTER_LENGTH: CRC offset = footer_start + FOOTER_LENGTH - 4.
| offset | size | field | type | description |
|---|---|---|---|---|
| 0 | 8 | PARQUET_FOOTER_OFFSET | u64 | byte offset in the parquet file where the parquet footer starts |
| 8 | 4 | PARQUET_FOOTER_LENGTH | u32 | length of the parquet footer in bytes |
| 12 | 4 | ROW_GROUP_COUNT | u32 | |
| 16 | 8 | UNUSED_BYTES | u64 | accumulated dead bytes in the parquet file (old footers + replaced row group data) |
| 24 | 8 | PREV_PARQUET_META_FILE_SIZE | u64 | committed _pm file size at the previous snapshot (0 if first); walk back via trailer at prev - 4 |
| 32 | 8 | FOOTER_FEATURE_FLAGS | u64 | per-footer feature flags; independent of the header's FEATURE_FLAGS |
| 40 | .. | ROW_GROUP_ENTRIES | ROW_GROUP_COUNT * Row group entry (4B each) | |
| .. | .. | FOOTER_FEATURE_SECTIONS | Feature-flag-gated sections, in bit order (may be empty) | |
| .. | 4 | CHECKSUM | u32 | CRC32 over bytes [8, this field) — all content after PARQUET_META_FILE_SIZE |
| .. | 4 | FOOTER_LENGTH | u32 | total bytes from footer start through CHECKSUM (inclusive); NOT covered by CHECKSUM |
The parquet file size is derived from the footer metadata:
parquet_file_size = PARQUET_FOOTER_OFFSET + PARQUET_FOOTER_LENGTH + 8 (4B parquet footer length field + 4B PAR1
magic). This eliminates the need to store the parquet file size separately.
| offset | size | field | type | description |
|---|---|---|---|---|
| 0 | 4 | BLOCK_OFFSET | u32 | byte offset from file start >> 3 (actual = value << 3) |
Atomicity is provided by the _txn file. The parquet file size is stored in _txn field 3 for each partition, serving
as the MVCC version token.
Writer flow:
data.parquet._pm metadata file. On update, append new/changed row group blocks after the old trailer, then write a
new footer (with CRC) and a new 4-byte FOOTER_LENGTH trailer at the end. Patch PARQUET_META_FILE_SIZE in the header as the
last write — this is the MVCC commit signal. Readers see either the old committed size (with the old trailer at
old_parquet_meta_file_size - 4) or the new one._txn (A/B buffered), updating the partition name txn and parquet file size.Reader flow:
_txn via safeReadTxn() (spin-lock with version check). Obtain parquet file size from field 3._pm file. Read PARQUET_META_FILE_SIZE from offset 0.parquet_meta_file_size bytes of the _pm file. Do not consult stat() / ff.length() — the
filesystem size may include unpublished bytes.PARQUET_META_FILE_SIZE - 4 to get FOOTER_LENGTH; derive footer_offset = PARQUET_META_FILE_SIZE - 4 - FOOTER_LENGTH.PARQUET_FOOTER_OFFSET + PARQUET_FOOTER_LENGTH + 8). If it matches the _txn snapshot, stop. Otherwise, read
PREV_PARQUET_META_FILE_SIZE from the current footer and repeat from step 4 with the new size. Each step
re-validates the previous footer location via its own trailer._pm file.data.parquet using the parquet file size from _txn.Rewrite mode (new partition directory with new name txn): new metadata file created. No concurrent access until
_txn flips.
Update mode (same partition directory): new row group blocks appended after old trailer, new footer and trailer
written at the end. Unchanged row groups keep their old offsets. The header's PARQUET_META_FILE_SIZE is patched last
for atomicity. Readers pinned to an older _txn snapshot read the committed-at-that-time PARQUET_META_FILE_SIZE
(from their own earlier mapping) and walk the PREV_PARQUET_META_FILE_SIZE chain from the matching trailer to find
their footer.
NULL_COUNT == NUM_VALUES, treat the chunk as all-null and skip the parquet fetch.[BYTE_RANGE_START, BYTE_RANGE_START + TOTAL_COMPRESSED) from cold storage.BLOOM_FILTERS feature flag (bit 0) is set in the header.BLOOM_FILTER_COLUMNS via binary search.[rg_idx * bloom_col_count + pos].LENGTH (i32) at the offset in the _pm file, then read LENGTH bytes of bitset.(offset, length).[BYTE_RANGE_START, BYTE_RANGE_START + TOTAL_COMPRESSED).We rely on QuestDB's existing migration system to run the migration passes.
We pass over every partitions for each table and rely on the existing parquet files to generate the new metadata files at their last version. As cold-storage depends on this feature, no object-storage access is required for this migration.
Feature flags are reserved for future extensions. Required features (bits 32-63) allow the reader to reject files that it cannot correctly interpret.
Migration from an older to a newer version mustn't require having access to the parquet file (in order to avoid cold-storage access). This is easily feasible as we control the partitions files, thus we can safely fill in the new metadata fields with default values that indicate the absence of the new feature (e.g. no encryption, already used bitmap filter algorithm).
External parquet files needs to see their metadata file invalidated.
In QuestDB, the _txn file is responsible to tell the reader which partitions exists and where they are stored.
Concurrent access between the TxWriter and TxReader relies on A/B double-buffering to remove needs for locks.
When adding a new row-group to a parquet partition, instead of rewriting the whole file, the row-group is appended to
the file (after the existing footer) and a new footer is written afterwards. The header's FOOTER_OFFSET is patched
last to point to the new footer. The new parquet file size is written to _txn field 3 so that readers can identify
which footer matches their snapshot.
Existing readers continue to see their committed data by walking the PREV_PARQUET_META_FILE_SIZE chain: each footer links to
the previous one, and the reader selects the footer whose derived parquet file size matches the parquet file size from
its _txn snapshot.
As this file's purpose is to reflect the underlying parquet file, the same behavior is used to update the file. Whenever
a new row-group is added to the original parquet file, it's also added to this file. A new footer (with CRC + trailer)is
appended, the header's FOOTER_OFFSET is patched, and the parquet file size is updated in _txn.
As new row-groups and footers are written into parquet files, more and more space is wasted. When a certain threshold is
exceeded, the file is compacted in a new partition directory and the _txn file is updated to point to this new file.
Following this strategy, the metadata file is also written again from scratch when this compaction occurs, keeping it as
small as possible.