docs/guides/parquet-mode.md
Parquet mode in Cortex provides an experimental feature that converts TSDB blocks to Parquet format for improved query performance and storage efficiency on older data. This feature is particularly beneficial for long-term storage scenarios where data is accessed less frequently but needs to be queried efficiently.
The parquet mode consists of two main components:
Traditional TSDB format and Store Gateway architecture face significant challenges when dealing with long-term data storage on object storage:
Apache Parquet addresses these challenges through:
For more details on the design rationale, see the Parquet Storage Proposal.
The parquet system works by:
To enable the parquet converter service, add it to your target list:
target: parquet-converter
Or include it in a multi-target deployment:
target: all,parquet-converter
Configure the parquet converter in your Cortex configuration:
parquet_converter:
# Data directory for caching blocks during conversion
data_dir: "./data"
# Frequency of conversion job execution
conversion_interval: 1m
# Maximum rows per parquet row group
max_rows_per_row_group: 1000000
# Number of concurrent meta file sync operations
meta_sync_concurrency: 20
# Enable file buffering to reduce memory usage
file_buffer_enabled: true
# Ring configuration for distributed conversion
ring:
kvstore:
store: consul
consul:
host: localhost:8500
heartbeat_period: 5s
heartbeat_timeout: 1m
instance_addr: 127.0.0.1
instance_port: 9095
Enable parquet conversion per tenant using limits:
limits:
# Enable parquet converter for all tenants
parquet_converter_enabled: true
# Shard size for shuffle sharding (0 = disabled)
parquet_converter_tenant_shard_size: 0.8
# Defines sort columns applied during Parquet file generation for specific tenants
parquet_converter_sort_columns: ["label1", "label2"]
You can also configure per-tenant settings using runtime configuration:
overrides:
tenant-1:
parquet_converter_enabled: true
parquet_converter_tenant_shard_size: 2
parquet_converter_sort_columns: ["cluster", "namespace"]
tenant-2:
parquet_converter_enabled: false
To enable querying of Parquet files, configure the querier:
querier:
# Enable parquet queryable with fallback (experimental)
enable_parquet_queryable: true
# Cache size for parquet shards
parquet_queryable_shard_cache_size: 512
# Default block store: "tsdb" or "parquet"
parquet_queryable_default_block_store: "parquet"
# Disable fallback to TSDB blocks when parquet files are not available
parquet_queryable_fallback_disabled: false
Configure query limits specific to parquet operations:
limits:
# Maximum number of rows that can be scanned per query
parquet_max_fetched_row_count: 1000000
# Maximum chunk bytes per query
parquet_max_fetched_chunk_bytes: 100_000_000 # 100MB
# Maximum data bytes per query
parquet_max_fetched_data_bytes: 1_000_000_000 # 1GB
Parquet mode supports dedicated caching for both chunks and labels to improve query performance. Configure caching in the blocks storage section:
blocks_storage:
bucket_store:
# Chunks cache configuration for parquet data
chunks_cache:
backend: "memcached" # Options: "", "inmemory", "memcached", "redis"
subrange_size: 16000 # Size of each subrange for better caching
max_get_range_requests: 3 # Max sub-GetRange requests per GetRange call
attributes_ttl: 168h # TTL for caching object attributes
subrange_ttl: 24h # TTL for caching individual chunk subranges
# Memcached configuration (if using memcached backend)
memcached:
addresses: "memcached:11211"
timeout: 500ms
max_idle_connections: 16
max_async_concurrency: 10
max_async_buffer_size: 10000
max_get_multi_concurrency: 100
max_get_multi_batch_size: 0
# Parquet labels cache configuration (experimental)
parquet_labels_cache:
backend: "memcached" # Options: "", "inmemory", "memcached", "redis"
subrange_size: 16000 # Size of each subrange for better caching
max_get_range_requests: 3 # Max sub-GetRange requests per GetRange call
attributes_ttl: 168h # TTL for caching object attributes
subrange_ttl: 24h # TTL for caching individual label subranges
# Memcached configuration (if using memcached backend)
memcached:
addresses: "memcached:11211"
timeout: 500ms
max_idle_connections: 16
The parquet converter determines which blocks to convert based on:
The conversion process:
When parquet queryable is enabled:
parquet_queryable_fallback_disabled is set to true, queries will fail with a consistency check error if any required blocks are not available as parquet files, ensuring strict parquet-only queryingMonitor parquet converter operations:
# Blocks converted
cortex_parquet_converter_blocks_converted_total
# Conversion failures
cortex_parquet_converter_block_convert_failures_total
# Delay in minutes of Parquet block to be converted from the TSDB block being uploaded to object store
cortex_parquet_converter_convert_block_delay_minutes
Monitor parquet query performance:
# Blocks queried by type
cortex_parquet_queryable_blocks_queried_total
# Query operations
cortex_parquet_queryable_operations_total
# Cache metrics
cortex_parquet_queryable_cache_hits_total
cortex_parquet_queryable_cache_misses_total
data_dir for block processingmax_rows_per_row_group based on your query patternsparquet_queryable_shard_cache_size based on available memorymeta_sync_concurrency based on object storage performanceparquet_converter_sort_columns based on your most common query filters to improve query performanceparquet_queryable_fallback_disabled: false (default) during initial deployment to allow queries to succeed even when parquet conversion is incompleteparquet_queryable_fallback_disabled: true only after ensuring all required blocks have been converted to parquet formatWhen enabling parquet mode: