docs/divergence.md
In 2013 HashiCorp created its own Raft implementation based on the just released Raft paper by Diego Ongaro and John Ousterhout. This was before Diego's subsequent Raft dissertation in 2014, and long before third party analyses such as Heidi Howard and Ittai Abraham's Raft does not Guarantee Liveness in the face of Network Faults in 2020.1
HashiCorp's Raft library usage grew rapidly through its use in Consul and Nomad, and later Vault, in parallel with rapidly expanding use in etcd and other implementations.
The explosion in activity between live systems and research led to a wide divergence between not only implementations, but implementations and the original paper and dissertation.
This document attempts to explain where HashiCorp Raft either meaningfully diverges from the original Raft paper, or makes an implementation choice not explicitly outlined in the paper.
This is not expected to be a comprehensive list. Additions and edits are welcome!
The Raft paper defines heartbeats as empty AppendEntries RPCs which are sent by the leader to each server after elections and during idle periods to prevent election timeouts.
HashiCorp Raft performs heartbeating concurrently with other AppendEntries RPCs to avoid having to set the election timeout high enough to account for the max acceptable disk operation. This allows the heartbeat timeout to detect network partitions much more quickly without risking causing an election during periodic but ephemeral spikes in disk io latency.
The Raft does not Guarantee liveness paper describes how certain partitions can prevent Raft clusters from making progress by causing continual elections.
HashiCorp Raft implements the second of the suggested fixes from Howard's paper: rejecting vote request RPCs when there is already an established leader. The paper defines this more precisely as:
...ignore RequestVote RPCs if they have received an AppendEntries RPC from the leader within the election timeout.
This approach is actually mentioned in the Cluster membership changes section of the original Raft paper, but explicitly excludes its use during "normal" elections:
To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specifically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote. This does not affect normal elections...
So HashiCorp Raft follows the later paper's suggestion and ignores the original paper's exclusion of this logic during normal operation.
HashiCorp Raft implements the Pre-Vote extension defined in the Raft dissertation (§9.6). Pre-Vote is an optimization where a candidate discovers whether its index is up to date and therefore able to win an election before incrementing its term and causing an election.
The Pre-Vote extension is enabled by default but may be disabled in using the Config.PreVoteDisabled flag.
HashiCorp Raft implements the Leadership Transfer extension as defined in the Raft dissertation (§3.10). Leadership transfer is an optimization that allows the current leader to hand off leadership to a follower to avoid waiting for the election timeout during regular operations such as restarts and upgrades.
While leadership transfer in defined in the Raft dissertation, HashiCorp Raft
extends the specification slightly because of another divergence in HashiCorp
Raft: rejecting votes when there's already a
leader. Since other followers
would reject the intended new-leader's request for a vote, HashiCorp Raft adds
an extra LeadershipTransfer flag to override that
behavior in the case of leadership transfers.
All Raft members should support leadership transfers before a transfer is attempted. The feature is not enabled by default and requires explicitly triggering at the application level. Consul was the first to implement this via mechanisms in their API/CLI and graceful agent shutdown.
See https://raft.github.io/ for a comprehensive list of papers and resources. ↩