docs/architecture/mvcc.md
Transactions are groups of reads and writes (e.g. to different keys) that are submitted together as a single unit. For example, a bank transaction that transfers $100 from account A to account B might consist of this group of reads and writes:
a = get(A)
b = get(B)
if a < 100:
error("insufficient balance")
set(A, a - 100)
set(B, b + 100)
toyDB provides ACID transactions, a set of very strong guarantees:
Atomicity: all of the writes take effect as an single, atomic unit, at the same instant, when they are committed. Other users will never see some of the writes without the others.
Consistency: database constraints are never violated (e.g. referential integrity or uniqueness contraints). We'll see how this is implemented later in the SQL execution layer.
Isolation: users should appear to have the entire database to themselves, unaffected by other simultaneous users. Two transactions may conflict, in which case one has to retry, but if a transaction succeeds then the user knows with certainty that the operations were executed without interference by anyone else. This eliminates the risk of race conditions.
Durability: committed writes are never lost (even if the system crashes).
To illustrate how transactions work, here's an example MVCC test script where two concurrent users modify a set of bank accounts (there's many other test scripts there too):
To provide these guarantees, toyDB uses a common technique called
Multi-Version Concurrency Control
(MVCC). It is implemented at the key/value storage level, in the storage::mvcc
module. It uses a storage::Engine for actual data storage.
MVCC provides an isolation level called snapshot isolation: a transaction sees a snapshot of the database as it was when the transaction began. Any later changes are invisible to it.
It does this by storing historical versions of key/value pairs. The version number is simply a number that's incremented for every new transaction:
Each transaction has its own unique version number. When it writes a key/value pair it appends its
version number to the key as Key::Version(&[u8], Version) (using the Keycode encoding we've seen
previously). If an old version of the key already exists, it will have a different version number
suffix and therefore be stored as a separate key in the storage engine. Deleted keys are versions
with a special tombstone value.
Here's a simple diagram of what a history of versions 1 to 5 of keys a to d might look like:
Additionally, we need to keep track of the currently ongoing (uncommitted) transaction versions, known as the "active set".
With versioning and the active set, we can summarize the MVCC protocol with a few simple rules:
When a new transaction begins, it:
When the transaction reads a key, it:
When the transaction writes a key, it:
When the transaction commits, it:
The magic happens when the transaction removes itself from the active set. This is a single, atomic operation, and when it completes all of its writes immediately become visible to new transactions. However, ongoing transactions still won't see these writes, because the version is still in their active set snapshot or at a later version (hence they are isolated from this transaction).
Furthermore, the transaction could see its own uncommitted writes even though noone else could, and if any writes conflicted with another transaction it would error out and have to retry.
Not only that, this also allows us to do time-travel queries, where we can query the database as it was at any time in the past: we simply pick a version number to read at.
There are a few more details that we've left out here: transaction rollbacks need to keep track of the writes and undo them, and read-only queries can avoid allocating new version numbers. We also don't garbage collect old version, for simplicity. See the module documentation for more details:
Let's walk through a simple example with code pointers to get a feel for how this is implemented. Notice how we don't have to deal with any version numbers when we're using the MVCC API -- this is an internal MVCC implementation detail.
// Open a BitCask database in the file "toy.db" with MVCC support.
let path = PathBuf::from("toy.db");
let db = MVCC::new(BitCask::new(path)?);
// Begin a new transaction.
let txn = db.begin()?;
// Read the key "foo", and decode the binary value as a u64 with bincode.
let bytes = txn.get(b"foo")?.expect("foo not found");
let mut value: u64 = bincode::deserialize(&bytes)?;
// Delete "foo".
txn.delete(b"foo")?;
// Add 1 to the value, and write it back to the key "bar".
value += 1;
let bytes = bincode::serialize(&value);
txn.set(b"bar", bytes)?;
// Commit the transaction.
txn.commit()?;
First, we begin a new transaction with MVCC::begin(), which calls through to
Transaction::begin(). This obtains a version number stored in Key::NextVersion and increments
it, then takes a snapshot of the active set in Key::ActiveSet and adds itself to it:
This returns a Transaction object which provides the main key/value API, with get/set/delete
methods. It keeps track of the main state of the transaction: it's version number and active set.
Next, we call Transaction::get(b"foo") to read the value of the key foo. This finds the latest
version that's visible to us (ignoring future versions and the active set). Recall that we store
multiple version of each key as Key::Version(key, version). The Keycode encoding ensures that all
versions are stored in sorted order, so we can do a reverse range scan from Key::Version(b"foo", self.version) to Key::Version(b"foo", 0) and return the latest version that's visible to us:
We then call Transaction::delete(b"foo") and Transaction::set(b"bar", value). Both of these just
call through to the same Transaction::write_version() method, but use Some(value) for a regular
key/value pair and None as a deletion tombstone:
To write a new version of a key, we first have to check for conflicts by seeing if there's a
version of the key that's invisible to us -- if it is, we conflicted with a concurrent transaction.
We use a range scan for this, like we did in Transaction::get().
If there are no conflicts, we go on to write Key::Version(b"foo", self.version) and encode the
value as an Option<value> to accomodate the None tombstone marker. We also write a
Key::TxnWrite(version, key) to keep track of the keys we've written in case we have to roll back.
Finally, Transaction::commit() will make our transaction take effect and become visible. It does
this simply by removing itself from the active set in Key::ActiveSet, and also cleaning up its
Key::TxnWrite write tracking. As the comment says, we don't actually have to flush to durable
storage here, because the Raft log will provide durability for us -- we'll get back to this later.