wiki/fifo_compaction.md
This document describes the FIFO compaction style in RocksDB, covering the file dropping strategies and both the old and new intra-L0 compaction picking strategies.
FIFO compaction is designed for time-series and log-like workloads where data has a natural expiration. All data lives at L0. When total data exceeds a configured size limit, the oldest SST files are dropped — no merge, no rewrite, just deletion. This gives near-zero write amplification for the compaction layer.
L0 (all data lives here):
Newest Oldest
| |
v v
[SST_N] [SST_N-1] ... [SST_3] [SST_2] [SST_1]
^ ^
| |
new flushes added here oldest files dropped here
(when over size limit)
Without intra-L0 compaction, every memtable flush creates a new small SST file. Over time, the number of L0 files grows, increasing read amplification (each point lookup must check every L0 file). Intra-L0 compaction addresses this by merging small files into fewer larger files.
When compaction is triggered (score >= 1.0), the picker tries these strategies in order, returning the first non-null result:
PickCompaction():
|
|-- 1. PickTTLCompaction() [File Dropping]
| Drop files older than TTL.
|
|-- 2. PickSizeCompaction() [File Dropping]
| Drop oldest files when over size limit.
|
|-- 3. PickIntraL0Compaction() [Intra-L0]
| Dispatcher: merges small L0 files to reduce file count.
| Requires allow_compaction=true. Dispatches to:
| - PickRatioBasedIntraL0Compaction (use_kv_ratio_compaction=true)
| - PickCostBasedIntraL0Compaction (use_kv_ratio_compaction=false)
|
|-- 4. PickTemperatureChangeCompaction() [Temperature Migration]
Rewrite one file to change its temperature tier.
Lowest priority — runs only if nothing else needs to be done.
Steps 1 and 2 are file dropping — they delete old files to enforce size
or TTL limits. Step 3 is intra-L0 compaction — it merges small files into
fewer larger ones. PickIntraL0Compaction is the dispatcher that selects
between the two strategies based on use_kv_ratio_compaction.
Step 4 is temperature migration — it rewrites a single file to change its storage temperature (e.g., moving cold data to cheaper storage). It picks one file at a time, checking if the file's age exceeds a configured threshold but its current temperature doesn't match the target. It runs last because it's the lowest priority: disk space management (dropping) and read amplification (intra-L0) are more important than storage tiering. Since FIFO only allows one compaction at a time, running temperature change last ensures it never blocks more critical operations.
Note: Intra-L0 compaction runs after size-based dropping. If PickSizeCompaction
dropped files (returned non-null), PickIntraL0Compaction is skipped. This
means intra-L0 only runs when the DB is under the size limit or when
size-based compaction is already in progress.
The compaction score determines when compaction should be triggered. For FIFO:
score = effective_total_size / effective_max_size
Where:
effective_total_size = total SST size (or SST + blob when
max_data_files_size > 0)effective_max_size = max_table_files_size (or max_data_files_size
when set)Additional score contributions:
allow_compaction is true (enables intra-L0 compaction):
score = max(score, num_sorted_runs / level0_file_num_compaction_trigger)ttl > 0: score is boosted by expired file countThese strategies delete old files to enforce data size or TTL limits. No data is rewritten — files are simply removed.
PickTTLCompaction)When: ttl > 0
Drops L0 files whose data is older than the TTL threshold. Iterates from
oldest to newest, checking newest_key_time or creation_time from table
properties against current_time - ttl.
Before TTL compaction (ttl = 3600s, files older than 1 hour):
L0: [F6:10m] [F5:20m] [F4:40m] [F3:50m] [F2:70m] [F1:80m]
^^^^ ^^^^
older than TTL --> DROP
After:
L0: [F6:10m] [F5:20m] [F4:40m] [F3:50m]
Returns nullptr if deleting expired files would still leave the total size
above the size limit — in that case, size-based dropping handles it instead.
Config: MutableCFOptions::ttl (in seconds)
PickSizeCompaction)When: Total size exceeds the configured limit.
Compares sum of SST file sizes against max_table_files_size:
Before (total 1.2GB > max_table_files_size 1GB):
L0: [F8:200MB] [F7:200MB] [F6:200MB] [F5:200MB] [F4:200MB] [F3:200MB]
total = 1.2GB
Drop oldest files until under limit:
Drop F3 (200MB) --> remaining = 1.0GB <= 1GB limit --> STOP
After:
L0: [F8:200MB] [F7:200MB] [F6:200MB] [F5:200MB] [F4:200MB]
total = 1.0GB
max_data_files_size > 0)When BlobDB is enabled, SST files are small (keys + blob references) and blob
files hold the actual values. The total disk usage is dominated by blob files,
so max_table_files_size (SST-only) cannot control total disk usage.
max_data_files_size accounts for both SST and blob files:
effective_size = total_sst + total_blob
Example: total_sst = 10MB, total_blob = 9.99GB
max_table_files_size = 1GB --> sees 10MB, no dropping (WRONG!)
max_data_files_size = 10GB --> sees 10GB, drops when exceeded (CORRECT)
When dropping files, proportional estimation is used to account for blob data freed per SST file:
data_per_file = effective_size / num_files
Blob files are automatically cleaned up when their linked SSTs are deleted
(via BlobFileMetaData::GetLinkedSsts() reference counting).
Config:
CompactionOptionsFIFO::max_table_files_size (default: 1GB)CompactionOptionsFIFO::max_data_files_size (default: 0, disabled)PickTemperatureChangeCompaction)When: file_temperature_age_thresholds is non-empty
This is NOT file dropping — it rewrites a single SST file to assign it a new storage temperature (e.g., kWarm, kCold). This allows tiered storage systems to move aging data to cheaper/slower media. The file content is unchanged; only the temperature metadata is updated.
Picks one file at a time, scanning from oldest to newest. For each file,
checks if its age exceeds a configured threshold AND its current temperature
doesn't match the target. Only one file is migrated per compaction to minimize
impact on other operations. Only works with single-level FIFO
(num_levels == 1).
This runs as the lowest priority in the picking chain (step 4) because storage tiering is less urgent than disk space management (dropping) or read amplification (intra-L0 compaction). Since FIFO allows only one compaction at a time, this ensures temperature migration never blocks critical operations.
Config: file_temperature_age_thresholds = [{kWarm, 3600}, {kCold, 86400}]
[F6:5m,kUnk] [F5:30m,kUnk] [F4:2h,kUnk] [F3:5h,kUnk] [F2:2d,kUnk]
^^^^^^^^
age > 86400s
--> compact to kCold
After:
[F6:5m,kUnk] [F5:30m,kUnk] [F4:2h,kUnk] [F3:5h,kUnk] [F2:2d,kCold]
Config: CompactionOptionsFIFO::file_temperature_age_thresholds
Intra-L0 compaction merges multiple small L0 files into fewer larger files to reduce file count and read amplification. Unlike file dropping, this rewrites data — but only SST data (blob files are never rewritten).
allow_compaction = true is the master switch for intra-L0 compaction.
When enabled, use_kv_ratio_compaction selects which picking strategy to use:
allow_compaction = true (master switch for intra-L0)
|
+-- use_kv_ratio_compaction = false (default)
| Old Strategy: PickCostBasedIntraL0Compaction
| Guard: 1.1 * write_buffer_size
| Works when SST ~= write_buffer_size (non-BlobDB)
|
+-- use_kv_ratio_compaction = true
New Strategy: PickRatioBasedIntraL0Compaction
Guard: capacity-derived target from SST/blob ratio
Works when SST << write_buffer_size (BlobDB)
Requires: max_data_files_size > 0
PickCostBasedIntraL0CompactionWhen: allow_compaction = true AND use_kv_ratio_compaction = false.
Called from PickIntraL0Compaction (which only runs when PickSizeCompaction
returned nullptr, meaning the DB is under the size limit).
This is the original intra-L0 compaction, implemented in
PickCostBasedIntraL0Compaction(). It uses a greedy algorithm to pick files,
with a write_buffer_size-based guard to prevent re-compacting large files.
1. Start from the newest L0 file (index 0)
2. Greedily add older files while compact_bytes_per_del_file decreases
3. Stop when:
- A file is being_compacted
- compact_bytes_per_del_file starts increasing (diminishing returns)
- Total exceeds max_compaction_bytes
4. Check: enough files (>= trigger) AND per_del < 1.1 * write_buffer_size
5. Output: always a single file
compact_bytes_per_del_filecompact_bytes_per_del_file measures the cost per file eliminated. When
we compact N files into 1 output, we eliminate (N-1) files but must read and
rewrite all N files' data. The metric is:
compact_bytes_per_del_file = total_input_bytes / (num_files - 1)
The algorithm greedily adds files as long as this ratio keeps decreasing (meaning each additional file is "cheap" to include). When adding a file causes the ratio to increase, we stop — it signals diminishing returns.
Example: scanning files from newest (left) to oldest (right)
Files: [F5:32KB] [F4:64KB] [F3:48KB] [F2:96KB] [F1:128KB]
Step 1: Start with F5 (32KB). compact_bytes = 32KB.
Step 2: Add F4. compact_bytes = 96KB. per_del = 96/1 = 96KB.
Step 3: Add F3. compact_bytes = 144KB. per_del = 144/2 = 72KB. (72 < 96, improving)
Step 4: Add F2. compact_bytes = 240KB. per_del = 240/3 = 80KB. (80 > 72, WORSE!)
--> STOP. Adding F2 makes the ratio increase.
Result: pick [F5, F4, F3] (3 files), per_del = 72KB.
The ratio increases when a file is significantly larger than the average of files already selected. This naturally prevents including already-compacted files (which are larger than flush files) — IF the size gap is significant.
Before (4 flush files of 64KB each, trigger=4):
L0: [F4:64KB] [F3:64KB] [F2:64KB] [F1:64KB]
newest oldest
PickCostBasedIntraL0Compaction:
Add F4: compact_bytes = 64KB
Add F3: compact_bytes = 128KB, per_del = 128/1 = 128KB
Add F2: compact_bytes = 192KB, per_del = 192/2 = 96KB (96 < 128, better)
Add F1: compact_bytes = 256KB, per_del = 256/3 = 85KB (85 < 96, better)
No more files. Check: 4 >= trigger(4) and 85KB < 70MB. OK.
After:
L0: [C1:256KB] (single compacted output)
L0: [F8:64KB] [F7:64KB] [F6:64KB] [F5:64KB] [C1:256KB]
newest oldest (compacted)
PickCostBasedIntraL0Compaction:
Add F8: compact_bytes = 64KB
Add F7: compact_bytes = 128KB, per_del = 128/1 = 128KB
Add F6: compact_bytes = 192KB, per_del = 192/2 = 96KB (improving)
Add F5: compact_bytes = 256KB, per_del = 256/3 = 85KB (improving)
Add C1: compact_bytes = 512KB, per_del = 512/4 = 128KB (128 > 85, WORSE!)
--> STOP before C1.
Result: pick [F8, F7, F6, F5] — compacted file C1 is excluded.
This works because C1 (256KB) is 4x larger than flush files (64KB).
The guard compact_bytes_per_del_file < 1.1 * write_buffer_size prevents
picking files that are already near memtable size. The idea: compacted files
should be ~write_buffer_size, so they'd push per_del above the guard.
Guard works when SST ~= write_buffer_size:
Files: [64MB, 64MB, 64MB, 64MB] (SST ~= WBS = 64MB)
per_del = 256MB/3 = 85MB > 70MB --> guard rejects --> no re-compaction
With BlobDB, SST files are ~1000x smaller than write_buffer_size. The guard
threshold (e.g., 70MB) is never reached by any L0 file. ALL files pass the
guard, including previously compacted files:
Guard FAILS when SST << write_buffer_size (BlobDB):
write_buffer_size = 64MB, SST files ~64KB (1000x smaller)
Guard threshold: 1.1 * 64MB = 70.4MB
10 compacted files of 256KB each:
per_del = 2560KB/9 = 284KB << 70.4MB --> guard passes!
ALL 10 files re-compacted into 1 file of 2.56MB
Result: cascading re-compaction creates "monster files"
Round 1: [64KB, 64KB, 64KB, 64KB] --> compact --> [256KB]
Round 2: [64KB, 64KB, 64KB, 256KB] --> compact ALL --> [448KB]
Round 3: [64KB, 64KB, 64KB, 448KB] --> compact ALL --> [640KB]
... files grow unboundedly
Use the KV-ratio strategy instead for BlobDB workloads.
CompactionOptionsFIFO::allow_compaction (default: false)1.1 * write_buffer_sizelevel0_file_num_compaction_triggeruse_kv_ratio_compaction (PickRatioBasedIntraL0Compaction)When: allow_compaction = true AND use_kv_ratio_compaction = true
AND max_data_files_size > 0
This strategy replaces the write_buffer_size-based guard with a
capacity-derived target and uses tiered size-based merging to achieve
logarithmic write amplification. It observes the actual SST/blob size ratio,
computes a target graduated file size, and merges files incrementally through
size tiers rather than directly to target.
Without BlobDB: SST ~= write_buffer_size --> old guard works
With BlobDB: SST ~= write_buffer_size/1000 --> old guard is useless
The new strategy derives the target from the data capacity and
observed key/value ratio, not from write_buffer_size.
Step 1: Target Computation
The target graduated file size can be determined in two ways:
If max_compaction_bytes > 0 (explicitly set by user):
target = max_compaction_bytes // user override
If max_compaction_bytes == 0 (default, auto-calculate):
sst_ratio = total_l0_sst / (total_l0_sst + total_blob)
total_sst_at_cap = max_data_files_size * sst_ratio
target = total_sst_at_cap / level0_file_num_compaction_trigger
Example (auto-calculated):
max_data_files_size = 10GB, sst_ratio = 0.001 (64KB SST / 64MB total)
total_sst_at_cap = 10GB * 0.001 = 10MB
trigger = 10
target = 10MB / 10 = 1MB
The sst_ratio is recomputed on every PickCompaction call. The
computation is trivial (sum file sizes + arithmetic) and PickCompaction
is only called once per flush or compaction completion, so no caching is
needed. This also means the ratio naturally adapts when SetOptions()
changes configuration.
Step 2: Tier Boundaries
Tier boundaries form a geometric sequence descending from the target,
using trigger as the growth factor:
..., target/trigger^2, target/trigger, target
Example with target=1MB, trigger=10: boundaries = [10KB, 100KB, 1MB]
Boundaries below 10KB are not generated (SST files of most workloads are larger than this). If target itself is below 10KB, it is used as the sole boundary.
Files >= target are "graduated" and never compacted again. They sit in L0 until FIFO drops them.
Step 3: Tiered File Selection
For each tier boundary (smallest first), scan L0 from oldest to newest:
For each boundary B (from smallest to largest):
1. Skip files >= B (they belong to higher tiers) and being_compacted files
2. Collect contiguous files < B
3. Stop when accumulated >= B (cap at 2*B to prevent tier-skipping)
4. If >= 2 files and accumulated >= B: merge them
5. Output (~B bytes) lands at the next tier
Processing boundaries smallest-first ensures bottom-up build: flush outputs are merged first, and higher-tier merges happen naturally as lower-tier outputs accumulate.
Example (target=1MB, trigger=10, flush~10KB):
Tier boundaries: [10KB, 100KB, 1MB]
L0: [1MB_grad] [1MB_grad] [100KB] [100KB] [10KB] [10KB] [F] [F] [F] [F]
Scan at boundary=10KB:
F,F,F,F (all < 10KB) --> accumulated >= 10KB? If yes, merge → ~10KB output
Scan at boundary=100KB:
10KB,10KB,... (all < 100KB) --> accumulated >= 100KB? merge → ~100KB
Scan at boundary=1MB:
100KB,100KB,... (all < 1MB) --> accumulated >= 1MB? merge → graduated!
The tiered approach trades higher L0 file count for logarithmic write amp:
Write amp per byte:
k + 1 = ceil(log(target/flush) / log(trigger)) + 1
Each byte is rewritten once per tier crossing.
L0 file count at steady state:
trigger + k * (trigger - 1)
More than the original trigger target, but bounded logarithmically.
Example (target=1MB, flush=1KB, trigger=10):
k = 3 tiers, write amp = 4, file count ≈ 37
vs flat merging: write amp ≈ 57
The guard is implicit in the tier boundaries:
Graduated files (>= target) are skipped at EVERY tier boundary.
1MB >= 1MB --> skipped at 1MB boundary
1MB >= 100KB --> skipped at 100KB boundary
1MB >= 10KB --> skipped at 10KB boundary
Intermediate tier files are only merged at HIGHER tier boundaries.
A 100KB file (output of tier-0 merge) is:
>= 100KB --> skipped at 100KB boundary (won't be re-merged at same tier)
< 1MB --> eligible at 1MB boundary (merges into graduated file)
Compare with the old strategy's guard:
Old: guard = 1.1 * write_buffer_size (breaks when SST << WBS)
New: graduated files >= target always excluded; intermediate files
progress through tiers without cascading re-compaction
Steady state L0 (target=64MB, trigger=4, flush~1MB):
[64MB_grad, 64MB_grad, 64MB_grad, 64MB_grad,
16MB, 16MB, 16MB,
4MB, 4MB,
1MB, 1MB, 1MB]
- 4 graduated files at target size (frozen until FIFO drops them)
- Intermediate files at tier sizes (accumulating for next merge)
- Flush outputs (accumulating for first tier merge)
When FIFO drops the oldest graduated file, it removes exactly
1/trigger of the total SST data (predictable).
With BlobDB (SST ~1KB, blob ~1MB per flush, target=1MB, trigger=10):
- k = 3 tiers (1KB → 10KB → 100KB → 1MB)
- SST write amp: k+1 = 4x (flush + 3 tier crossings)
- Blob write amp: ~1x (never rewritten)
- Total write amp: ~1 + 1KB*4/(1KB+1MB) ≈ 1.004x
Without BlobDB (SST ~64MB per flush):
- target = large, ratio = 1, k = 1 typically
- SST write amp: ~2x
At steady state, all graduated files are close to the target size. Output is in [boundary, 2*boundary) at each tier. Variable flush sizes are handled naturally — the size-based merge rule produces consistent output regardless of individual file sizes.
CompactionOptionsFIFO::allow_compaction (required: true)CompactionOptionsFIFO::use_kv_ratio_compaction (default: false)CompactionOptionsFIFO::max_data_files_size (required, > 0)level0_file_num_compaction_trigger (target max L0 file count)max_compaction_bytes (default: 0 = auto-calculate target from capacity;
when > 0, overrides auto-calculated target with this value)Both strategies require allow_compaction = true. The choice of strategy
depends on whether BlobDB is used:
Decision tree:
Want intra-L0 compaction?
|
+-- NO: allow_compaction = false (default)
| No file merging, only dropping.
|
+-- YES: allow_compaction = true
|
+-- Using BlobDB (SST << write_buffer_size)?
| |
| +-- YES: use_kv_ratio_compaction = true
| | (also requires max_data_files_size > 0)
| |
| +-- NO: use_kv_ratio_compaction = false (default)
| Old strategy works fine.
| Criteria | Old (default) | New (use_kv_ratio_compaction) |
|---|---|---|
| Guard mechanism | 1.1 * write_buffer_size | capacity-derived target |
| Works with BlobDB? | No (guard broken) | Yes (designed for it) |
| File uniformity | Poor with BlobDB | Good (+/-25%) |
| Re-compaction risk | High with BlobDB | None (tiered boundaries prevent it) |
| Write amp (BlobDB) | Unpredictable | Logarithmic: (k+1)x SST, ~1x total |
| Requires | allow_compaction=true | allow_compaction=true + use_kv_ratio_compaction=true + max_data_files_size>0 |
options.compaction_style = kCompactionStyleFIFO;
options.compaction_options_fifo.max_table_files_size = 1ULL * 1024 * 1024 * 1024; // 1GB
options.compaction_style = kCompactionStyleFIFO;
options.compaction_options_fifo.max_table_files_size = 1ULL * 1024 * 1024 * 1024;
options.compaction_options_fifo.allow_compaction = true;
options.level0_file_num_compaction_trigger = 4;
options.compaction_style = kCompactionStyleFIFO;
options.compaction_options_fifo.max_data_files_size = 10ULL * 1024 * 1024 * 1024; // 10GB
options.compaction_options_fifo.allow_compaction = true; // master switch
options.compaction_options_fifo.use_kv_ratio_compaction = true; // select new strategy
options.level0_file_num_compaction_trigger = 10;
options.enable_blob_files = true;
options.min_blob_size = 1024;
options.compaction_style = kCompactionStyleFIFO;
options.compaction_options_fifo.max_data_files_size = 10ULL * 1024 * 1024 * 1024;
options.compaction_options_fifo.allow_compaction = true;
options.compaction_options_fifo.use_kv_ratio_compaction = true;
options.level0_file_num_compaction_trigger = 10;
options.ttl = 86400; // 24 hours
options.enable_blob_files = true;
options.min_blob_size = 1024;
| Option | Type | Default | Description |
|---|---|---|---|
max_table_files_size | uint64_t | 1GB | SST-only size limit for FIFO dropping |
max_data_files_size | uint64_t | 0 | Combined SST+blob size limit (0=disabled) |
allow_compaction | bool | false | Master switch for intra-L0 compaction (required for both old and new strategies) |
use_kv_ratio_compaction | bool | false | Select capacity-derived intra-L0 strategy (requires allow_compaction=true AND max_data_files_size>0) |
age_for_warm | uint64_t | 0 | DEPRECATED |
file_temperature_age_thresholds | vector | empty | Age-based temperature migration |
allow_trivial_copy_when_change_temperature | bool | false | Allow trivial copy for temp change |
trivial_copy_buffer_size | uint64_t | 4096 | Buffer size for trivial copy |
| Option | Relevance to FIFO |
|---|---|
level0_file_num_compaction_trigger | Target max L0 file count for KV-ratio; min files for old intra-L0 |
ttl | TTL-based file expiration (seconds) |
write_buffer_size | Guard threshold for old-style intra-L0 (1.1x) |
max_compaction_bytes | For KV-ratio: 0 = auto-calculate target from capacity; > 0 = use as target directly. For old intra-L0: cap on total input size. Default sanitized to target_file_size_base * 25 (except when use_kv_ratio_compaction=true, where 0 is preserved) |