docs/tech-notes/change-replicas.md
This tech note briefly explains the end-to-end process of up-replicating a range, starting from the replicateQueue to
sending Learner snapshots. This is meant to serve as a high level overview of the code paths taken, and more detailed
descriptions can be found in tech notes specific to each area.
Raft snapshots are necessary when a follower replica is unable to catch up using the existing Raft logs of the leader.
Such a scenario occurs when an existing replica falls too far behind the rest of the Raft group such that we've
truncated the Raft log above it. On the other hand, Learner snapshots are necessary when we add a new replica (due to a
rebalancing operation) and the new replica needs a snapshot to up-replicate. We should make a small distinction
that Learner snapshots are sent during rebalancing operations by the replicateQueue and Raft snapshots
are sent by the raftSnapshotQueue when a replica falls behind. However, there is no difference in the way the
snapshots are sent, generated or applied. For example, in the case where the replication factor is increased, the Raft
group would need to create a new replica from a clean slate. To up-replicate this replica, we would send it a Learner
snapshot of the full range data.
In this note, we will focus on the scenario where a Learner snapshot is needed for the up-replication and rebalancing of replicas.
As a brief overview, the replicateQueue manages a queue of replicas and is responsible for processing replicas for
rebalancing. For each replica that holds the lease, it invokes the allocator to compute decisions on whether any placement changes are needed
and the replicateQueue executes these changes. Then the replicateQueue calls changeReplicas to act on these change
requests and repeats until all changes are complete.
The AdminChangeReplicas API exposed by KV is mainly responsible for atomically changing replicas of a range. These
changes include non-voter promotions, voter/non-voter swaps, additions and removals. The change is performed in a
distributed transaction and takes effect when the transaction is committed. There are many details involved in these
changes, but for the purposes of this note, we will focus on up-replication and the process of sending Learner
snapshots.
During up-replication, ChangeReplicas runs a transaction to add a new learner replica to the range. Learner
replicas are new replicas that receive Raft traffic but do not participate in quorum; allowing the range to remain
highly available during the replication process.
Once the learners have been added to the range, they are synchronously sent a Learner snapshot from the leaseholder (which is typically, but not necessarily, the Raft leader) to up-replicate. There is some nuance here as Raft snapshots are typically automatically sent by the existing Raft snapshot queue. However, we are synchronously sending a snapshot here in the replicateQueue to quickly catch up learners. To prevent a race with the raftSnapshotQueue, we lock snapshots to learners and non-voters on the current leaseholder store while processing the change. In addition, we also place a lock on log truncation to ensure we don't truncate the Raft log while a snapshot is inflight, preventing wasted snapshots. However, both of these locks are the best effort as opposed to guaranteed as the lease can be transferred and the new leaseholder could still truncate the log.
The snapshot process itself is broken into three parts: generating, transmitting and applying the snapshot. This process
is the same whether it is invoked by the replicateQueue or raftSnapshotQueue.
A snapshot is a bulk transfer of all replicated data in a range and everything the replica needs to be a member of a
Raft group. It consists of a consistent view of the state of some replica of a range as of an applied index. The
GetSnapshot method gets a storage engine snapshot which reflects the replicated state as of the applied index the
snapshot is generated at, and creates an iterator from the storage snapshot. A storage engine snapshot is important to
ensure that multiple iterators created at different points in time see a consistent replicated state that does not
change while the snapshot is streamed. In our current code, the engine snapshot is not necessary for correctness as only
one iterator is constructed.
The snapshot transfer is sent through a bi-directional stream of snapshot requests and snapshot responses. However, before the actual range data is sent, the streaming rpc first sends a header message to the recipient store and blocks until the store responds to accept or reject the snapshot data.
The recipient checks the following conditions before accepting the snapshot. We currently allow one concurrent snapshot application on a receiver store at a time. Consequently, if there are snapshot applications already in-flight on the receiver, incoming snapshots are throttled or rejected if they wait too long. The receiver then checks whether its store has a compatible replica present to have the snapshot applied. In the case of up-replication, we expect the learner replica to be present on the receiver store. However, if the snapshot overlaps an existing replica or replica placeholder on the receiver, the snapshot will be rejected as well. Once these conditions have been verified, a response message will be sent back to the sender.
Once the recipient has accepted, the sender proceeds to use the iterator to read chunks of
size kv.snapshot_sender.batch_size from storage into memory. The batch size is enforced as to not hold on to a
significant chunk of data in memory. Then, in-memory batches are created to send the snapshot data in streaming grpc
message chunks to the recipient. Note that snapshots are rate limited depending on cluster settings as to not overload
the network. Finally, once all the data is sent, the sender sends a final message to the receiver and waits for a
response from the recipient indicating whether the snapshot was a success. On the receiver side, the key-value pairs are
used to construct multiple SSTs for direct ingestion, which prevents the receiver from holding the entire snapshot in
memory.
Once the snapshot is received on the receiver, defensive checks are applied to ensure the correct snapshot is received. It ensures there is an initialized learner replica or creates a placeholder to accept the snapshot. Finally, the Raft snapshot message is handed to the replica’s Raft node, which will apply the snapshot. The placeholder or learner is then converted to a fully initialized replica.