docs/dev/sstable-compression-dicts.md
Scylla now supports dictionary-based compression for SSTables, which improves compression ratios by sharing compression dictionaries across compression chunks.
Traditional SSTable compression in Scylla works on a chunk-by-chunk basis. Each chunk is compressed independently, which means patterns that occur across chunks cannot be effectively leveraged for better compression.
Dictionary-based compression addresses this limitation by training a dictionary on representative data samples and using it across all compression chunks, providing the compression algorithm with additional context for referencing.
Dictionary training: Scylla samples data chunks from across the cluster to build an optimized compression dictionary for a specific table.
Dictionary distribution: Dictionaries are stored in the system.dicts
table (managed by group0). Each table has its own (possibly absent) row there.
Shared Compression: When opening an SSTable for writing, if the table
has compression dictionaries enabled, the current
recommended dictionary for a table (i.e. the one in system.dicts)
is used to compress the data, and is written into the header of
CompressionInfo.db.
Decompression: When opening an SSTable for reading, the dictionary blob
is loaded from CompressionInfo.db and used to decompress the data.
There are two new persistent data structures involved:
CompressionInfo.db gains two new
compressor IDs (lz4 with dicts, zstd with dicts) and new "compressor options"
which store the dictionary blob used by this SSTable.system.dicts, which (in addition to the RPC compression
dict) now also stores the current recommended SSTable compression dict
for each table.The structure of the format isn't affected. Instead, we add two new compressor
identifiers (LZ4WithDictsCompressor and ZstdWithDictsCompressor), which
use the "compressor options" map in CompressionInfo.db to store the dict.
Since the structure isn't affected, we don't increment the SSTable version for this. Naturally, the dict-compressed SSTables won't be readable by older versions of Scylla (or by Cassandra), but they should complain about an unknown compressor rather than consider the SSTable malformed.
If a downgrade is necessary, it can be done by disabling dictionaries
(through schema, or by setting sstable_compression_dictionaries_enable_writing
to false on all nodes) and rewriting the SSTables
(with nodetool upgradesstables -a or similar).
The extension is hidden behind the SSTABLE_COMPRESSION_DICTS cluster feature.
We store the dictionary blob in the "options" map in the header of
CompressionInfo.db, under the keys .dictionary.00000000,
.dictionary.00000001, ...
(It's split into several parts, because the "options" have 16-bit lengths, and dictionaries are usually bigger than that).
system.dicts extensionIf a system.dicts partition with key sstables/{table_uuid} exists,
it provides the current recommended dict for this table, which is used
to compress new SSTables.
If a table doesn't have a matching row in system.dicts, then there's no
current dictionary for this table, and new SSTables should fall back to
dictionaryless compression.
With "traditional" compression, a compressor was just a function in the code, not involving any data. This meant that the creation of compressors was cheap and easy.
But with dictionaries involved, each unique compressor has its own RAM and cache footprint. Therefore we want to deduplicate compressors as much as possible.
For this, we create new compressors through a central "compressor factory" which contacts other shards and ensures that there are no redundant copies of dictionaries in memory.
To create a dictionary, some training data is needed. This means that the dictionary can't be created immediately for a new table, some data must accumulate in it first.
Also, the dataset can change over time, and a dictionary might become outdated. In this case, it could be good to retrain it.
But it would be impractical to manually pick the right moments to train new
dicts. So there's sstable_dict_autotrainer, which periodically trains
new dicts, if it seems that the given dict-aware table deserves one.
Refer to the implementation for up-to-date details.
sstable_compression entry in the schema to one of the new compressor IDs.
(The autotrainer will eventually train a dict for it.)storage_service/retrain_dict can be used to trigger a dictionary
training for a table manually, without waiting for the automatic training.storage_service/estimate_compression_ratios can be used to generate
a report with estimations of compression ratios (on the given table) for
various compression configs (algorithm, level, chunk size), to guide the
choice of configuration.SAMPLE_SSTABLES is used by a dictionary-training node to gather SSTable
samples from other nodes.ESTIMATE_SSTABLE_VOLUME is a helper RPC used by a dictionary-training node
to find out how much data other nodes have, so that it can later request
the right (i.e. proportional) amount of samples from each node.
It's also used by the autotrainer to find out if the table is big enough for
dictionary training.There are several new config knobs related to this feature, all named like
sstable_compression_dictionaries_*.
Refer to config.hh for up-to-date details.