engine/packages/epoxy/README.md
Epoxy is a geo-distributed, strongly-consistent KV store based on the EPaxos protocol.
Paper:
https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf
Reference implementations:
TODO: Correctness issue: https://github.com/otrack/on-epaxos-correctness
Talks:
Other resources:
Interference is at the core of EPaxos. Interference is used to determine if a command can be committed on the fast path without executing the full slow-path Paxos protocol.
Some notes related to Rivet's workload requirements relating to interference:
pegboard::workflows::actor::actor_keys::Propose: Uses check or set operation for the actor key's reservation IDpegboard::ops::get_reservation_for_key: Uses an optimistic read to find the reservation ID to resolve the actual actor IDThe leader datacenter holds a coordinator workflow. This workflow detect config changes and handle:
See spec/RECONFIGURE.md for more information.
Each datacenter has its own replica workflow that is responsible for:
All peer-to-peer communication is done via the POST /v{version}/epoxy/message using the versioned BARE epoxy protocol.
Proposals are the mechanism for establishing consensus when making a chance to the KV store.
Proposals are not implemented as a workflow since they're:
See spec/PROPOSAL.md for more information.
Do not use rivet_config::config::Topology (i.e. ctx.config().topology()) from this service. Instead, read the cluster config propagated from the coordinator with epoxy::ops::read_cluster_config.
The leader datacenter runs a coordinator workflow in charge of coordinating config changes across all peers.
This ensures that peers receive config changes at the exact same time.
The coordinator will attempt to propagate the config change to all new replicas when a config change is detected.
If the config is invalid (i.e. the replica cannot be reached), the workflow will keep retrying indefinitely.
Reconfigure will be aborted if there is a config change detected. In this case, all changes will be abandoned and the coordinator will attempt to reconfigure with the new config.
The paper has an incorrect explanation of adding new replicas. It describes the join process as:
The issue with this is that the quorum count will have changed, but the replica cannot vote yet. This means that if you add too many replicas at the same time you will cause a complete outage until the new replicas have downloaded all instances and began voting.
Instead, we opt to:
This enables clusters to add many new replicas at once without causing an outage.
Explicit Prepare is implemented but not integrated anywhere.
The paper recommends prioritizing earlier commands. We don't have functionality to do this right now, but the current design has almost no contention so this is not a concern.
The paper recommends batching requests for a "9x" improvement in throughput. We don't need to do this since we (a) execute requests in parallel, (b) assume low contention, and (c) built the system to be stateless + scale horizontally with UDB.
The paper recommends an optimize recovery mechanism that requires less replicas for consensus.