docs/RFCS/20181204_copysets.md
Copysets reduce the probability of data loss in the presence of multi node failures in large clusters.
This RFC will present a design for integrating copysets in cockroach and discuss its tradeoffs. Copysets have earlier been discussed in RFC #6484.
More details on copysets can be seen in the academic literature.
In large clusters simultaneous loss of multiple nodes have a very high probability of data loss. For example, consider a cluster of 100 nodes using a replication factor of 3 having ~10k ranges. The simultaneous loss of 2 or more nodes has a very high probability of data loss since there could be a range out of the 10k ranges which has 2 out of its 3 replicas on the 2 lost nodes. This probability can be reduced by adding locality to nodes since cockroach supports failures of all nodes in a locality, but the loss of two nodes in different localities again has a high probability of data loss.
Copysets significantly reduces the probability of data loss in the presence of multi node failures.
Copysets divides the cluster into disjoint sets of nodes. The size of each set will be based on the used replication factors. Separate copysets are created for each replication factor. A range should prefer to allocate its replicas within a copyset rather than spread its replicas across copysets.
So there are two major components
Copysets will initially be an opt-in feature (based on a cluster setting) and
implemented for a scatter width of replication_factor - 1 (eventually it will
be extended to support higher scatter width).
For simplicity, we will explain the design without considering scatter width
in copysets.
The cluster will be divided into copysets. For each replication factor in the cluster, separate copysets will be generated.
The requirements for copysets of a replication factor are
Copysets are generated for each replication factor used in the system. Better failure tolerance can be provided if copysets for different replication factors are aligned, but this is not the case in the presented strategies.
Two possible strategies for copyset allocation is presented below.
Optimal allocation (for locality diversity) of copysets for a particular replication factor can be done as follows:
1. Compute num_copysets = floor(num_stores/replication_factor)
2. Sort stores based on increasing order of locality.
3. Assign copysets to stores in a round robin fashion.
For example, consider the case where we have stores as follows:
Locality1: S1 S2 S3
Locality2: S4 S5 S6
Locality3: S7 S8 S9 S10
Copysets for RF 3 would be created as
num_copysets = 10/3 = 3
CS1: S1 S4 S7 S10
CS2: S2 S5 S8
CS3: S3 S6 S9
In this strategy the goal is to minimize data movement when copysets are regenerated with a different store list (some stores added, some stores removed).
This allocation tries to create a copyset-store mapping (with incremental changes over previously used copysets) which is diverse in locality. It tries to minimize the number of changes to previously used copysets and ensure that each store in a copyset belongs to a different locality when possible. The allocation
Swaps are made between a source copyset and a target copyset which guarantee that the diversity of the source copyset increases while the diversity of the target copyset does decrease (or if it decreases it still is > replication factor).
Store swaps are made between a source copyset and a target copyset based on the localities present in the source and target copyset. The conditions required for a swap are:
By diversity above we mean the number of localities in a copyset.
Point (3) above ensures that diversity of the target copyset does not decrease (or if it decreases it does not fall below replication factor).
A single iteration doing swaps considers all (n choose 2) copyset combinations
where n is the number of copysets. These iterations continue till sum of
diversity of all copysets cannot be improved further (no swap are candidates
found for a whole iteration).
For example, consider the case where we have stores as follows:
Locality1: S1 S2 S3
Locality2: S4 S5 S6
Locality3: S7 S8 S9
Locality4: S10 S11 S12 S13
And initial copyset allocation as
CS1: S1 S5 S9
CS2: S2 S6 S10
CS3: S3 S7 S11
CS4: S4 S8 S12 S13
Say store S6 is removed.
After step 2 (assign stores to same copyset ID till size reaches rf), we have
CS1: S1 S5 S9
CS2: S2 S10
CS3: S3 S7 S11
CS4: S4 S8 S12
After filling empty spots by adding remaining stores (S13 in this case)
CS1: S1 S5 S9
CS2: S2 S10 S13
CS3: S3 S7 S11
CS4: S4 S8 S12
After swaps (between CS1 and CS2 since CS2 has 2 stores from Locality4)
CS1: S1 S5 S13
CS2: S2 S10 S9
CS3: S3 S7 S11
CS4: S4 S8 S12
This strategy may not achieve optimal possible diversity but tries to ensure that each locality within a copyset is different.
The store list considered for copyset allocation would be the current live stores. The way live stores are computed will be the same as the way allocator detects live stores (but throttled stores will not be excluded.) Copysets will be re-generated if the store list has been stable and not changed for 3 ticks (each tick has a 10s interval).
Copyset allocation can be persisted as a proto in the distributed KV layer. The copysets strategy which minimizes data movement requires copysets to be persisted (it requires the previous state to be global and survive restarts). The lowest live node ID in the cluster would be managing (persisting) copysets. Other nodes will be periodically (every 10s) cache the persisted copysets and using it for re-balancing.
Copysets will only be re-generated (and persisted) if the store list changes. In steady state all nodes will be periodically reading the persisted copysets and there will be no need to re-generate and persist new copysets.
The cluster can tolerate failure of one node within each copyset for RF=3. For example a 100 node cluster can tolerate the simultaneous failure of 33 nodes in the best case (for RF=3) without suffering any data loss.
Ranges need to be rebalanced to be contained within a copyset.
There are two range re-balancers currently being used in cockroach:
This RFC will explain the implementation for copyset rebalancing for the replicate queue which processes one replica at a time. Replica rebalancing by the store rebalancer will be disabled if copysets is enabled (at least for the initial version). The store rebalancer can still perform lease holder rebalancing.
The allocator uses a scoring function to
The scoring function considers the following (given in order of priority)
For rebalancing ranges into copysets, a new "copyset score" will be added to the allocator. Priority wise it will be between (2) and (3) above. Zone constraints and disk fullness take a higher priority over copyset score.
Since copyset allocation considers diversity, it's priority can be placed above diversity score. If copysets are disabled in the cluster, this score will have no impact in rebalancing.
Copyset score (higher score is better) of a range is high if:
x we should move the range
completely to a copyset y if the nodes in copyset y have a
significantly lower load (for example nodes in y have a lot more free
disk space).So the following replica transition for a range of RF 3 should be allowed in
case (2):
x x x -> x x y -> x y y -> y y y
where x x x means that the 3 replicas of the range are in copyset x.
Let's say r is the replication factor of a range. Each of its replicas belongs
to a node with a particular copyset id. We can formally define the scores as:
Number of pairwise same copyset id / (r choose 2)Copyset score can be defined as (k * homogeneity_score + idle_score) / (k + 1).
It is normalized and lies between 0 and 1.
Let's say we want to migrate a range from a copyset x to a copyset y if the
idle score of y differs by more than d (configurable).
If d is too small, it could lead to thrashing of replicas, so we can use a
value like 15%.
Though the below calculations may seem a bit complex, to the end user we can
just expose d, which is easy to understand - the max difference between idle
scores of two copysets in the cluster.
For example, if idle score of x is a and y is a + d, we require:
copyset_score(x x x) < copyset_score(x x y)
k * homogeneityScore(x x x) + idleScore(x x x) < k * homogeneityScore(x x y) + idleScore(x x y)
# Generalizing for replication factor r where r = 3 below
homogeneityScore(x x x) = 1
idleScore(x x x) = ra/r = a
homogeneityScore(x x y) = (r-1 choose 2) / (r choose 2) # since 1 copyset is different.
idleScore(x x y) = ((r-1) * a + a + d)/r = (ra + d) / r
# So we get
k * 1 + a <= k * (r-1 choose 2) / (r choose 2) + (ra + d) / r
=> k <= d / 2
For example, for r = 3, d = 0.15, and idle score of x being 0.2 and idle
score of y being 0.36
totalScore(x x x) = 0.075 * 1 + 0.2 = 0.275
totalScore(x x y) = 0.075 * 0.33 + (0.2 + 0.2 + 0.36)/3 = 0.278
So a range will migrate from
(x x x) -> (x x y) -> (x y y) -> (y y y)
The above migration will not happen if y has an idle score of 0.34 (since
d = 0.15).
The first step (x x x) -> (x x y) is the hardest as homogeneity
is broken. The proof for this is given above.
For (x x y) -> (x y y) step, the homogeneity score remains the
same, and idle score improves (since y has a better idle score).
For (x y y) -> (y y y) step, both the homogeneity score and
idle score improve.
When a range actually migrates from (x x x) to (x x y), it goes
through an intermediate step (x x x y) after which one x is
removed, but similar math applies.
This scoring function will allow ranges to organically move into copysets
and try to maintain approximately equal load among copysets. Thrashing will
be avoided by choosing an appropriate value of d.
Due to the above drawbacks, copysets will be disabled by default and there will be a cluster setting where users can enable copysets if they are ok with the above drawbacks.
There can be multiple approaches for both copyset allocation and the scoring function. This design in this RFC is something simple and the respective algorithms can be tweaked independently later.
Chainsets is one way to make incremental changes to copysets, but again potentially at the cost of reduced locality diversity. The length of the chain used in chainsets could be considered equivalent to replication factor in cockroach.
Apart from unit tests, roachtests can be added which verify copyset based rebalancing in the presence of