docs/dev/strong_consistency.md
This document describes the implementation details and design choices for the strongly-consistent tables feature
The feature is heavily based on the existing implementation of Raft in scylla, which is described in docs/dev/raft-in-scylla.md.
The Raft groups for strongly consistent tables differ from Raft group0 particularly in the extend of where their Raft group members can be located. For group0, all group members (Raft servers) are on shard 0. For groups for strongly consistent tablets, the group members may be located on any shard. In the future, they will even be able to move alongside their corresponding tablets.
That's why, when adding the Raft metadata persistence layer for strongly consistent tables, we can't reuse the existing approach for group 0. Group0's persistence stores all Raft state on shard 0. This approach can't be used for strongly consistent tables, because raft groups for strongly consistent tables can occupy many different shards and their metadata may be updated often. Storing all data on a single shard would at the same time make this shard a bottleneck and it would require performing cross-shard operations for most strongly consistent writes, which would also diminish their performance on its own.
Instead, we want to store the metadata for a Raft group on the same shard where this group's server is located, avoiding any cross-shard operations and evenly distributing the work related to writing metadata to all shards.
We introduce a separate set of Raft system tables for strongly consistent tablets:
system.raft_groupssystem.raft_groups_snapshotssystem.raft_groups_snapshot_configThese tables mirror the logical contents of the existing system.raft, system.raft_snapshots,
system.raft_snapshot_config tables, but their partition key is a composite (shard, group_id)
rather than just group_id.
To make “(shard, group_id) belongs to shard X” true at the storage layer, we use:
service::strong_consistency::raft_groups_partitioner)
which encodes the shard into the token, andservice::strong_consistency::raft_groups_sharder) which extracts
that shard from the token.As a result, reads and writes for a given group’s persistence are routed to the same shard where the Raft server instance runs.
The partitioner encodes the destination shard in the token’s high bits:
[shard: 16 bits][group_id_hash: 48 bits]smallint column used in the schema.
it also needs to be non-negative, so it's effectively limited to range [0, 32767]group_id (timeuuid)The key property is that shard extraction is a pure bit operation and does not depend on the cluster’s shard count.
raft_groups_sharder::shard_for_writes() returns up to one shard - it does not support
migrations using double writes. Instead, for a given Raft group, when a tablet is migrated,
the Raft metadata needs to be erased from the former location and added in the new location.