docs/architecture/storage.md
toyDB uses an embedded key/value store for data
storage, located in the storage
module. This stores arbitrary keys and values as binary byte strings. The storage engine doesn't
know or care what the keys and values contain -- we'll see later how the SQL data model, with tables
and rows, is mapped onto this key/value structure.
The storage engine supports simple set/get/delete operations on individual keys. It does not itself support transactions -- this is built on top, and we'll get back to it shortly.
Keys are stored in sorted order. This allows range scans, where we can iterate over all key/value pairs between two specific keys, or with a specific key prefix. This will be needed by other components in the system, e.g. to scan all rows in a specific SQL table, to scan all versions of an MVCC key, to scan the tail of the Raft log, etc.
The storage engine is pluggable: there are multiple implementations, and the user can choose which
one to use in the config file. These implement the storage::Engine trait:
Let's look at the existing storage engine implementations.
Memory Storage EngineThe simplest storage engine is the storage::Memory engine. This is a trivial implementation which
stores data in memory using the Rust standard library's
BTreeMap, without persisting
it to disk. It is primarily used for testing.
Since this is just a wrapper around the BTreeMap we can include it in its entirety here:
BitCask Storage EngineThe main storage engine is storage::BitCask. This is a very simple variant of
BitCask, used in the Riak
database. It is kind of like the LSM-tree's
baby cousin.
toyDB's BitCask implementation uses a single append-only log file for storage. To write a key/value pair, we simply append it to the file. To delete a key, we append a special tombstone value. When reading a key, the last entry for that key in the file is used.
The file format for a key/value pair is simply:
u32 (4 bytes).i32 (4 bytes). -1 if tombstone.For example, the key/value pair foo=bar would be written as follows (in hexadecimal):
keylen valuelen key value
00000003 00000003 666f6f 626172
Because the data file is a simple log, we don't need a separate write-ahead log for crash recovery -- the data file is the write-ahead log.
To quickly look up key/value pairs when reading, we maintain an in-memory KeyDir index which maps
a key to the latest value's position in the file. All keys must therefore fit in memory.
We initially generate this index by scanning through the entire file when it is opened:
To write a key, we append it to the file and update the KeyDir:
To delete a key, we append a tombstone value instead:
To read a value for a key, we look up the key's file location in the KeyDir index (if the key
exists), and then read it from the file:
The KeyDir uses an inner stdlib BTreeMap to keep track of keys. This allows range scans, where
we iterate over a sorted set of keys between the range bounds, loading each key from the file:
As keys are updated and deleted, we'll keep accumulating old versions in the log file. To remove these, the log file is compacted on startup. This writes out the latest value of every live key/value pair to a new file, and replaces the old file. The keys are written in sorted order, to make later scans faster.