architecture/design/advanced-delta-encoding.md
Assume we have a table created with CREATE TABLE t (h int, r int, c1 int, c2 int, PRIMARY KEY (h, r));
And we've inserted a single row using INSERT INTO t(h, r, c1, c2) VALUES (1, 2, 12, 24);
According to https://docs.yugabyte.com/latest/architecture/docdb/persistence this is represented in docdb as the following 3 separate records:
SubDocKey(DocKey(0x1210, [1], [2]), [SystemColumnId(0); HT{ physical: 1634209096289349 }]) -> null
SubDocKey(DocKey(0x1210, [1], [2]), [ColumnId(12); HT{ physical: 1634209096289349 w: 1 }]) -> 12
SubDocKey(DocKey(0x1210, [1], [2]), [ColumnId(13); HT{ physical: 1634209096289349 w: 2 }]) -> 24
Which then encoded in binary format and stored in underlying rocksdb as:
471210488000000121488000000121 4A80 23800185F0027D73BA804A 0114000000000004 -> null
471210488000000121488000000121 4B8C 23800185F0027D73BA803FAB 0115000000000004 -> 12
471210488000000121488000000121 4B8D 23800185F0027D73BA803F8B 0116000000000004 -> 24
The last 8 bytes of each binary record representation above is rocksdb internal suffix which contains value type (kTypeValue = 0x01) and sequence number (increments for each new record).
Standard rocksdb key delta encoding saves some space by extracting common shared prefix and encoding those records using the following scheme: <shared_prefix_size><non_shared_size><value_size><non_shared><value>. For our example this will be encoded as:
471210488000000121488000000121 4A80 23800185F0027D73BA804A 0114000000000004 -> null => <0><24><0><0x4712104880000001214880000001214A8023800185F0027D73BA804A0114000000000004><>
471210488000000121488000000121 4B8C 23800185F0027D73BA803FAB 0115000000000004 -> 12 => <15><32><4><0x4B8C23800185F0027D73BA803FAB0115000000000004><12>
471210488000000121488000000121 4B8D 23800185F0027D73BA803F8B 0116000000000004 -> 24 => <16><31><4><0x8D23800185F0027D73BA803FAB0115000000000004><24>
But the only difference between keys of each of these docdb records and the next one is only ColumnId and w (write_id) component of hybrid time. In the binary form, we also have the last internal rocksdb component which is usually just gets incremented for the next record. So if we can only store this difference, but not the whole key part after shared_prefix - we can save more space.
Difference between 1st and 2nd encoded docdb keys:
471210488000000121488000000121 4A80 23800185F0027D73BA80 4A 0114000000000004
471210488000000121488000000121 4B8C 23800185F0027D73BA80 3FAB 0115000000000004
^^^^ ^^^^ ^^
So, we can just store 2+2=4 bytes that are not reused from the previous key (plus additional info about key component sizes and flag that the last internal component is reused and incremented) instead of 32 bytes for the whole key part after the prefix.
Difference between 2nd and 3rd encoded docdb keys:
4712104880000001214880000001214B 8C 23800185F0027D73BA803F AB 0115000000000004
4712104880000001214880000001214B 8D 23800185F0027D73BA803F 8B 0116000000000004
^^ ^^ ^^
Here we can store 1+1=2 bytes that are not reused from the previous key (plus additional info as above) instead of 31 bytes for the whole key part after the prefix.
More columns per row we have - more space we can save using this approach.
In order to implement this at rocksdb level we try to match previous and current encoded keys to the following keys pattern and maximize <shared_prefix> and <shared_middle> size:
Previous key: <shared_prefix>[<prev_key_non_shared_1>[<shared_middle>[<prev_key_non_shared_2>]]][<last_internal_component_to_reuse>]
Current key: <shared_prefix>[<non_shared_1>[<shared_middle>[<non_shared_2>]]][<last_internal_component_to_reuse> (optionally incremented)]
<last_internal_component_to_reuse> always has size of 8 bytes (if internal rocksdb component is the same in previous and current keys or just its embedded sequence number is incremented) or 0 bytes (in other cases).
Then we encode information about these component sizes and whether last_internal_component is reused and if it is incremented or reused as is (this can happen in snapshot SST files where the sequence number is reset to zero).
After this information we just need to store <non_shared_1><non_shared_2><value> instead of <non_shared><value> (where <non_shared> is everything after <shared_prefix>).
As a results of the approach described above we have following components sizes/flag that fully determine difference between previous key and current key:
shared_prefix_sizeprev_key_non_shared_1_sizenon_shared_1_sizeshared_middle_sizeprev_key_non_shared_2_sizenon_shared_2_sizelast_internal_component_reuse_size (0 or 8)is_last_internal_component_inc (whether last internal component reused from previous key is incremented)Note that previous key size is always shared_prefix_size + prev_key_non_shared_1_size + shared_middle_size + prev_key_non_shared_2_size + last_internal_component_reuse_size, so we can compute shared_middle_size based on shared_prefix_size, prev_key_non_shared_1_size, prev_key_non_shared_2_size, last_internal_component_reuse_size and previous key size.
And current key size is always shared_prefix_size + non_shared_1_size + shared_middle_size + non_shared_2_size + last_internal_component_reuse_size.
We will store non_shared_i_size_delta = non_shared_i_size - prev_key_non_shared_i_size (i = 1, 2) for efficiency, because for DocDB these deltas are 0 in most cases and we can encode this more efficiently.
So, to be able to decode key-value pair it is enough to know whole previous key and store the following:
shared_prefix_sizenon_shared_1_sizenon_shared_1_size_deltanon_shared_2_sizenon_shared_2_size_deltabool is_last_internal_component_reusedbool is_last_internal_component_incvalue_sizenon_shared_1 bytesnon_shared_2 bytesvalue bytesThe general encoding format is therefore: <sizes & flags encoded><non_shared_1 bytes>[<non_shared_2 bytes>]<value bytes>.
In general, sizes & flags encoding format is:
<encoded_1><...>, where encoded_1 is an encoded varint64 containing the following information:
is_last_internal_component_incvalue_sizeWe encode these two flags together with the value_size, because if value_size is less than 32 bytes, then encoded_1 will still be encoded as 1 byte.
Otherwise one more byte is negligible comparing to size of the value that is not delta compressed.
We have several cases that occur most frequently and we encode them in a special way for space-efficiency, depending on the case <...> is encoded differently:
is_last_internal_component_reused == truenon_shared_1_size == 1non_shared_2_size == 1non_shared_1_size_delta == 0non_shared_2_size_delta == 0In this case we know all sizes and flags except shared_prefix_size, so we just need to store: <encoded_1><shared_prefix_size>
<encoded_1><encoded_2><...>, where encoded_2 is one byte that determines which subcase we are dealing with. In some cases encoded_2 also contains some more useful information.2.1. Something is reused from the previous key (meaning shared_prefix_size + shared_middle_size + last_internal_component_reuse_size > 0):
2.1.1. Optimized for the following case:
is_last_internal_component_reused == truenon_shared_1_size_delta == 0non_shared_2_size_delta == 0non_shared_1_size < 8non_shared_2_size < 4In this case encoded_2 is:
10non_shared_2_size_delta == 1non_shared_1_sizenon_shared_2_sizeWe store: <encoded_1><encoded_2><shared_prefix_size> and the rest of sizes and flags is computed based on this.
2.1.2: Rest of the cases when we reuse bytes from the previous key.
In this case encoded_2 is:
11is_last_internal_component_reusedWe store: <encoded_1><encoded_2><non_shared_1_size>[<non_shared_1_size_delta>][<non_shared_2_size>][<non_shared_2_size_delta>]<shared_prefix_size>, sizes that we know based on encoded_2 bits are zeros are not stored to save space.
2.2. Nothing is reused from the previous key (mostly restart keys). In this case we just need to store key size and value size.
2.2.1. 0 < key_size < 128.
In this case encoded_2 is:
0key_size (>0)And we just need to store: <encoded_1><encoded_2> to be able to decode key size and value size.
2.2.2. key_size == 0 || key_size >= 128
In this case encoded_2 is just 0:
00We store: <encoded_1><encoded_2=0><key_size> that is enough to decode key size and value size.