docs/architecture/sql-raft.md
toyDB uses Raft to replicate SQL storage across a cluster of nodes (see the Raft section for details). All nodes will store a full copy of the SQL database, and the Raft leader will replicate writes across nodes and execute reads.
Recall the Raft state machine interface raft::State:
In toyDB, the state machine is just a sql::engine::Local storage engine with a thin wrapper:
Raft will submit read and write commands to this state machine as binary Vec<u8> data, so we have
to represent the methods of sql::engine::Engine as binary Raft commands. We do this as two
enums, sql::engine::raft::Read and sql::engine::raft::Write, which we'll Bincode-encode:
Notice that almost all requests include a mvcc::TransactionState. Most of the useful methods of
sql::engine::Engine are on the sql::engine::Transaction, but unlike the Local engine, below
Raft we can't hold on to a Transaction object in memory between each command -- nodes may restart
and leadership may move, and we want client transactions to keep working despite this. Instead, we
will use the client-supplied mvcc::TransactionState to reconstruct a Transaction for every
command via mvcc::Transaction::resume() and call methods on it.
When the state machine receives a write command, it decodes it as a Write and calls the
appropriate Local method. The result is Bincode-encoded and returned to the caller, who knows what
return type to expect for a given command. The state machine also keeps track of the Raft applied
index of each command as a separate key in the key/value store.
Similarly, read commands are decoded as a Read and the appropriate Local method is called:
That's the state machine running below Raft. But how do we actually send these commands to Raft and
receive results? That's handled by the sql::engine::Raft implementation, which uses a channel to
send requests to the local Raft node (we'll see how this plumbing works in the server section):
The channel takes a raft::Request containing binary Raft client requests and a return channel
where the Raft node can send back a raft::Response. The Raft engine has a few convenience methods
to send requests and receive responses, for both read and write requests:
And the implementation of the sql::engine::Engine and sql::engine::Transaction traits simply
send these requests via Raft:
One thing to note here is that we don't support streaming data via Raft, so e.g. the
Transaction::scan method will buffer the entire result in a Vec. With a full table scan, this
will load the entire table into memory -- that's unfortunate, but we keep it simple.
To summarize, this is what happens when Transaction::insert() is called to insert a row via Raft:
sql::engine::raft::Transaction::insert(): called to insert a row.sql::engine::raft::Write::Insert: enum representation of the insert command.raft::Request::Write: raft request containing the Bincode-encoded Write::Insert command.sql::engine::raft::Engine::tx: sends the Request::Write and response channel to Raft.raft::Node::step(): the Request::Write is given to Raft in a Message::ClientRequest.raft::State::apply(): the Bincode-encoded Write::Insert is passed to the state machine.sql::engine::raft::State::apply(): decodes the command to a Write::Insert.sql::engine::raft::State::local: contains the Local engine on each node.sql::engine::local::Engine::resume(): called to obtain the SQL/MVCC transaction.sql::engine::local::Transaction::insert(): the row is inserted to the local engine.raft::RawNode::tx: the Ok(()) result is sent as a Bincode-encoded Message::ClientResponse.sql::engine::raft::Transaction::insert(): receives the result and returns it to the caller.The plumbing here will be covered in more details in the server section.