turbopack/crates/turbo-persistence/README.md
This crate provides a way to persist key value pairs into a folder and restore them later.
The API only allows a single write transaction at a time, but multiple threads can fill the transaction with (non-conflicting) data concurrently.
When pushing data into the WriteBatch it is already persisted to disk, but only becomes visible after the transaction is committed. On startup left-over uncommitted files on disk are automatically cleaned up.
The architecture is optimized for pushing a lot data to disk in a single transaction, while still allowing for fast random reads.
It supports having multiple key families, which are stored in separate files, but a write batch can contain keys from multiple families. Each key family defines a separate key space. Entries in different key families doesn't influence each other (also not performance-wise).
There is a single CURRENT file which stores the latest committed sequence number.
All other files have a sequence number as file name, e. g. 0000123.sst. All files are immutable once there sequence number is <= the committed sequence number. But they might be deleted when they are superseeded by other committed files.
There are four different file types:
*.sst): These files contain key value pairs.*.blob): These files contain large values.*.del): These files contain a list of sequence numbers of files that should be considered as deleted.*.meta): These files contain metadata about the SST files. They contains the hash range and a AMQF for quick filtering.Therefore there are these value types:
*.sst files.*.sst files.*.sst files.*.blob files.| Inline | Small | Medium | Blob | |
|---|---|---|---|---|
| Size | ≤ 8 B | 9 B .. 4 kB | 4 kB .. 64 MB | > 64 MB |
| Compression unit | key block | shared value block (≥ 8 kB) | dedicated value block | separate file |
| Compression unit size | ≤ 16 kB | 8 kB .. 12 kB | 4 kB .. 64 MB | > 64 MB |
| Access cost | no extra overhead | decompress shared block (~8 kB) | decompress value size | open separate file, decompress value size |
| Storage overhead | 0 | 8 B in key block + 8 B per ~8 kB in block table | 2 B in key block + 8 B in block table | 4 B in key block + 8 B in blob header |
| Compaction | re-compressed | re-compressed | copied compressed | pointer copied |
Small value blocks are emitted once they accumulate at least MIN_SMALL_VALUE_BLOCK_SIZE (8 kB) of data. This means actual block sizes range from 8 kB up to 8 kB + MAX_SMALL_VALUE_SIZE (4 kB) = 12 kB. This provides a good balance between compression efficiency (blocks ≥ 4 kB compress well with LZ4) and access cost (only ~8–12 kB needs to be decompressed per lookup).
A meta file can contain metadata about multiple SST files. The metadata is stored in a single file to avoid having too many small files.
The SST file contains only data without any header.
Blocks can be stored compressed (LZ4) or uncompressed. The 4-byte header distinguishes them:
Each block stores a 4-byte CRC32 checksum (big-endian) computed on the on-disk block data (i.e. after compression). On read, the checksum is verified before decompression so that on-disk damage is caught before passing data to LZ4. A checksum mismatch returns an error indicating that the cached data is damaged.
n times
An Index block contains n 8 bytes hashes, which specify n - 1 hash ranges (eq hash goes into the prev range, except for the first key). Between these n hashes there are n - 1 2 byte block indicies that point to the block that contains the hash range.
The hashes are sorted.
n is (block size + 1) / 10
A Key block contains n keys, which specify n key value pairs.
The block type determines whether the key hash is stored per entry:
During lookup, if block type is 2, the full hash is recomputed from the key data.
Depending on the type field entry has a different format:
The entries are sorted by key hash and key.
Used when all entries in a block have the same key size and value type. Eliminates the per-entry offset table, enabling direct arithmetic indexing during binary search.
Entry position for index i is computed as header_size + i * stride with no indirection. The writer automatically selects fixed-size format when all entries in a block qualify; otherwise falls back to the variable-size format above.
The plain value compressed with dynamic compression. Each blob file has an 8-byte header:
The checksum is verified on the compressed data before decompression when the blob is read.
Reading start from the current sequence number and goes downwards.
Writing starts by creating a new WriteBatch. It maintains an atomic counter of the next free sequence number.
The WriteBatch has a thread local buffer that accumulates operations until a certain threshold is reached. Then the buffer is sorted and written to a new SST file (and maybe some blob files).
When the WriteBatch is committed all thread local buffers are merged into a single global buffer and written into new SST files (potentially multiple when threshold is reached).
fsync! The new sequence number is written to the CURRENT file.
After that optimization might take place.
For compaction we compute the "coverage" of the SST files. The coverage is the average number of SST files that need to be touched to figure out that a key is missing. The coverage can be computed by looking at the min_hash and max_hash of the SST files only.
For a single SST file we can compute (max_hash - min_hash) / u64::MAX as the coverage of the SST file. We sum up all these coverages to get the total coverage.
Compaction chooses a few SST files and runs the merge step of merge sort on tham to create a few new SST files with sorted ranges.
Example:
key hash range: | 0 ... u64::MAX |
SST 1: |----------------|
SST 2: |----------------|
SST 3: |-----|
can be compacted into:
key hash range: | 0 ... u64::MAX |
SST 1': |-------|
SST 2': |------|
SST 3': |-----|
The merge operation decreases the total coverage since the new SST files will have a coverage of < 1.
But we need to be careful to insert the SST files in the correct location again, since items in these SST files might be overridden in later SST file and we don't want to change that.
Since SST files that are smaller than the current sequence number are immutable we can't change the files and we can't insert new files at this sequence numbers. Instead we need to insert the new SST after the current sequence number and copy all SST files after the original SST files after them. (Actually we only need to copy SST files with overlapping key hash ranges. And we can hardlink them instead). Later we will write the current sequence number and delete them original and all copied SST files.
We can run multiple merge operations concurrently when the key hash ranges are not overlapping or they are from different key families. The copy operation need to be strictly after all merge operations.
There must not be another SST file with overlapping key hash range between files of a merge operation.
During the merge operation we eliminate duplicate keys. When blob references are eliminated we delete the blob file after the current sequence number was updated.
Since the process might exit unexpectedly, to avoid "forgetting" to delete the SST files we keep track of that in a *.del file. This file contains the sequence number of SST and blob files that should be deleted. We write that file before the current sequence number is updated. On restart we execute the deletes again.
We limit the number of SST files that are merged at once to avoid long compactions.
Full example:
Example:
key hash range: | 0 ... u64::MAX | Family
SST 1: |-| 1
SST 2: |----------------| 1
SST 3: |----------------| 1
SST 4: |-----| 2
SST 5: |-----| 2
SST 6: |-------| 1
SST 7: |-------| 1
SST 8: |--------| 2
SST 9: |--------| 2
CURRENT: 9
Compactions could selects SST 2, 3, 6 and SST 4, 5, 8 for merging (we limited to 3 SST files per merge operation). This also selects SST 7, 9 for copying. The current sequence number is 9.
We merge SST 2, 3, 6 into new SST files 10, 12, 14 and SST 4, 5, 8 into new SST files 11, 13. Both operations are done concurrently so they might choose free sequence numbers in random order. The operation might result in less SST files due to duplicate keys.
After that we copy SST files 7, 9 to new SST files 15, 16.
We write a "del" file at sequence number 17.
After that we write the new current sequence number 17.
Then we delete SST files 2, 3, 6 and 4, 5, 8 and 7, 9. The
SST files 1 stays unchanged.
key hash range: | 0 ... u64::MAX | Family
SST 1: |-| 1
SST 10: |-----| 1
SST 12: |-----| 1
SST 11: |------| 2
SST 14: |-------| 1
SST 13: |-----| 2
SST 15: |-------| 1
SST 16: |--------| 2
DEL 17: (2, 3, 4, 5, 6, 7, 8, 9)
CURRENT: 17
Configuration options for compations are:
CURRENT fileCURRENT file.*.del files and delete the files that are listed in there.*.sst files and memory map them.