docs/RFCS/20150729_replica_tombstone.md
Add a Replica ID to each replica in a range descriptor, and pass this replica ID in all raft messages. When a replica is removed from a node, record a tombstone with the replica ID, and reject any later messages referring to that replica.
Replica removal is necessarily an asynchronous process -- the node holding the removed replica may be down at the time of its removal, and so any recovering node may have some replicas that should have been removed but have not yet been garbage-collected. These nodes may try to send raft messages that could disrupt the legitimate members of the group. Even worse, if there has been enough turnover in the membership of a group, a quorum of removed replicas may manage to elect a lease holder among themselves.
We have an additional complication compared to vanilla raft because we allow node IDs to be reused (this is necessary for coalesced heartbeats. We may remove a replica from a node and then re-add it later with the same node ID, so we must be able to distinguish messages from an earlier epoch of the range.
Here is a scenario that can lead to split-brain in the current system:
ChangeReplicas transactions to remove
node C. Several more ChangeReplicas transactions follow, adding
nodes D, E, and F and removing A and B.DistSenders) whose range descriptor cache is out of date may
talk to the ABC group instead of the correct DEF group.The problems caused by removing replicas are also discussed in #768.
A Replica ID is generated with every ChangeReplicas call and
stored in the Replica message of the range descriptor. Replica IDs
are monotonically increasing within a range and never reused. They are
generated using a next_replica_id field which will be added to the
RangeDescriptor (alternative generation strategies include using the
raft log position or database timestamp of the EndTransaction call
that commits the membership change, but this information is less
accessible at the point where it would be needed).
The ReplicaID replaces the current RaftNodeID (which is
constructed from the NodeID and StoreID). Raft only knows about
replica IDs, which are never reused, so we don't have to worry about
certain problems that can aries when node IDs are reused (such as vote
changes or log regression. MultiRaft is responsible for mapping
ReplicaIDs to node and store IDs (which it must do to coalesce
heartbeats). This is done with a new method on the
WriteableGroupStorage interface, along with an in-memory cache. Note
the node and store IDs for a given replica never change once that
replica is created, so we don't need to worry about synchronizing or
invalidating entries in this cache.
The raft transport will send the node, store, and replica ID of both
the sender and receiver with every RaftMessageRequest. This is how
the Store will learn of the replica ID for new ranges (the sender's
IDs must be inserted in the replica ID cache, so we can route
responses that do not yet appear in any range descriptor we have
stored). The Store will drop incoming messages with a replica ID
that is less than the last known one for the range (in order to
minimize disruption from out-of-date servers).
The DistSender will also send the replica ID in the header of every
request, and the receiver will reject requests with incorrect replica
IDs (This will be rare since a range will normally be absent from a
node for a time before being re-replicated).
When a replica is garbage-collected, we write a tombstone containing the replica ID, so that the store can continue to drop out-of-date raft messages even after the GC.
In the scenario above, with this change node C would send node A and B's replica IDs in its vote requests. They would see that they have a tombstone for that replica and not recreate it.
Tombstones must be long-lived, since a node may come back online after a lengthy delay. We cannot garbage-collect tombstones unless we also guarantee that no node will come back after a period of time longer than the tombstone GC time.
If we had an explicit replica-creation RPC instead of creating them automatically on the first sighting of a raft message, this problem may go away. However, doing so would be tricky: this RPC would need to be retried in certain circumstances, and it is difficult to distinguish cases where a retry is necessary from cases that lead to the problem discussed here.