Feature Name: Copysets
Status: draft
Start Date: 2018-12-04
Authors: Vijay Karthik, Mohammed Hassan
RFC PR: (PR # after acceptance of initial draft)
Cockroach Issue: #25194

Summary

Copysets reduce the probability of data loss in the presence of multi node failures in large clusters.

This RFC will present a design for integrating copysets in cockroach and discuss its tradeoffs. Copysets have earlier been discussed in RFC #6484.

More details on copysets can be seen in the academic literature.

Motivation

In large clusters simultaneous loss of multiple nodes have a very high probability of data loss. For example, consider a cluster of 100 nodes using a replication factor of 3 having ~10k ranges. The simultaneous loss of 2 or more nodes has a very high probability of data loss since there could be a range out of the 10k ranges which has 2 out of its 3 replicas on the 2 lost nodes. This probability can be reduced by adding locality to nodes since cockroach supports failures of all nodes in a locality, but the loss of two nodes in different localities again has a high probability of data loss.

Copysets significantly reduces the probability of data loss in the presence of multi node failures.

Guide-level explanation

Copysets divides the cluster into disjoint sets of nodes. The size of each set will be based on the used replication factors. Separate copysets are created for each replication factor. A range should prefer to allocate its replicas within a copyset rather than spread its replicas across copysets.

So there are two major components

Managing copysets (which node belongs to which copyset)
Copyset assignments should take into account locality of nodes so that locality fault tolerance is not lost. Addition / Removal / Crashed nodes should be taken into account when assigning nodes to copysets.
Rebalancing all replicas of a range to reside within a single copyset on a best effort basis.
Rebalancing replicas into copysets is important, but some properties like constraints set by a user should take priority over copysets.

Copysets will initially be an opt-in feature (based on a cluster setting) and implemented for a scatter width of replication_factor - 1 (eventually it will be extended to support higher scatter width).
For simplicity, we will explain the design without considering scatter width in copysets.

Design

Managing copysets

The cluster will be divided into copysets. For each replication factor in the cluster, separate copysets will be generated.

The requirements for copysets of a replication factor are

There should be no overlap of nodes between copysets for scatter width of rf -1 and minimize overlapping nodes for scatter width >= rf (where rf is the replication factor).
Copysets should be locality fault tolerant (each node in a copyset should preferably be from a different locality)
Copysets should rebalance on node additions / removal / failures.

Copysets are generated for each replication factor used in the system. Better failure tolerance can be provided if copysets for different replication factors are aligned, but this is not the case in the presented strategies.

Two possible strategies for copyset allocation is presented below.

Optimal diversity copyset allocation

Optimal allocation (for locality diversity) of copysets for a particular replication factor can be done as follows:

1. Compute num_copysets = floor(num_stores/replication_factor)
2. Sort stores based on increasing order of locality.
3. Assign copysets to stores in a round robin fashion.

For example, consider the case where we have stores as follows:

Locality1: S1  S2  S3
Locality2: S4  S5  S6 
Locality3: S7  S8  S9 S10

Copysets for RF 3 would be created as

num_copysets = 10/3 = 3
CS1: S1 S4 S7 S10
CS2: S2 S5 S8
CS3: S3 S6 S9

Minimize data movement copyset allocation

In this strategy the goal is to minimize data movement when copysets are regenerated with a different store list (some stores added, some stores removed).

This allocation tries to create a copyset-store mapping (with incremental changes over previously used copysets) which is diverse in locality. It tries to minimize the number of changes to previously used copysets and ensure that each store in a copyset belongs to a different locality when possible. The allocation

Computes the required number of copysets for the new store list.
Assign previously existing stores to the same copyset id they belonged to (if copyset id exists based on 1) if copyset size < replication factor
Adds the newly added stores (not present in previous copyset allocation) and remaining stores from (2) to empty spots in each copyset (if the copyset has < replication factor stores or if it is the last copyset). after assigning previously existing stores which have carried over).
Swaps stores between copysets to avoid duplicate localities in a single copyset till it converges (diversity cannot be improved further).

Swaps

Swaps are made between a source copyset and a target copyset which guarantee that the diversity of the source copyset increases while the diversity of the target copyset does decrease (or if it decreases it still is > replication factor).

Store swaps are made between a source copyset and a target copyset based on the localities present in the source and target copyset. The conditions required for a swap are:

The source copyset has diversity < replication factor. This means that the source copyset has two stores in a particular locality. One of these stores will be a source swap candidate.
The target copyset has a locality not present in the source copyset (let's call this target locality). A store from this locality will be a target swap candidate.
One of the following is true
1. Locality of the source swap candidate is not present in the target copyset.
2. Target copyset either
  1. Has two stores in the target locality.
  2. Has diversity > replication factor.

By diversity above we mean the number of localities in a copyset.

Point (3) above ensures that diversity of the target copyset does not decrease (or if it decreases it does not fall below replication factor).

A single iteration doing swaps considers all (n choose 2) copyset combinations where n is the number of copysets. These iterations continue till sum of diversity of all copysets cannot be improved further (no swap are candidates found for a whole iteration).

For example, consider the case where we have stores as follows:

Locality1: S1  S2  S3
Locality2: S4  S5  S6
Locality3: S7  S8  S9
Locality4: S10 S11 S12 S13

And initial copyset allocation as

CS1: S1 S5 S9
CS2: S2 S6 S10
CS3: S3 S7 S11
CS4: S4 S8 S12 S13

Say store S6 is removed.

After step 2 (assign stores to same copyset ID till size reaches rf), we have

CS1: S1 S5 S9
CS2: S2 S10
CS3: S3 S7 S11
CS4: S4 S8 S12

After filling empty spots by adding remaining stores (S13 in this case)

CS1: S1 S5 S9
CS2: S2 S10 S13
CS3: S3 S7 S11
CS4: S4 S8 S12

After swaps (between CS1 and CS2 since CS2 has 2 stores from Locality4)

CS1: S1 S5 S13
CS2: S2 S10 S9
CS3: S3 S7 S11
CS4: S4 S8 S12

This strategy may not achieve optimal possible diversity but tries to ensure that each locality within a copyset is different.

Copyset re-generation

The store list considered for copyset allocation would be the current live stores. The way live stores are computed will be the same as the way allocator detects live stores (but throttled stores will not be excluded.) Copysets will be re-generated if the store list has been stable and not changed for 3 ticks (each tick has a 10s interval).

Copyset allocation can be persisted as a proto in the distributed KV layer. The copysets strategy which minimizes data movement requires copysets to be persisted (it requires the previous state to be global and survive restarts). The lowest live node ID in the cluster would be managing (persisting) copysets. Other nodes will be periodically (every 10s) cache the persisted copysets and using it for re-balancing.

Copysets will only be re-generated (and persisted) if the store list changes. In steady state all nodes will be periodically reading the persisted copysets and there will be no need to re-generate and persist new copysets.

The cluster can tolerate failure of one node within each copyset for RF=3. For example a 100 node cluster can tolerate the simultaneous failure of 33 nodes in the best case (for RF=3) without suffering any data loss.

Rebalancing ranges

Ranges need to be rebalanced to be contained within a copyset.
There are two range re-balancers currently being used in cockroach:

Replicate queue
Store rebalancer

This RFC will explain the implementation for copyset rebalancing for the replicate queue which processes one replica at a time. Replica rebalancing by the store rebalancer will be disabled if copysets is enabled (at least for the initial version). The store rebalancer can still perform lease holder rebalancing.

The allocator uses a scoring function to

Decide which store to use for a new replica for a range
Which replica to remove when a range has more than required replicas
Whether a replica has to move from one store to another where the resultant score for the range will he higher.

The scoring function considers the following (given in order of priority)

Zone constraints (which are constraints on having certain tables in certain zones)
Disk fullness: checks whether the source or target is too full.
Diversity score difference: Diversity score is proportional to the number of different localities the range has a replica in. It looks at nC2 diversity score based on their localities where n is the number of replicas.
Convergence score difference: Convergence score is used to avoid moving ranges whose movement will cause the stats (range count) of a range to move away from the global mean.
Balance score difference: Balance score is the normalized utilization of a node. It currently considers number of ranges. Nodes with a low balance score are preferred.
Range count difference: Stores with a low range count are preferred.

Copyset score

For rebalancing ranges into copysets, a new "copyset score" will be added to the allocator. Priority wise it will be between (2) and (3) above. Zone constraints and disk fullness take a higher priority over copyset score.

Since copyset allocation considers diversity, it's priority can be placed above diversity score. If copysets are disabled in the cluster, this score will have no impact in rebalancing.

Copyset score (higher score is better) of a range is high if:

A range is completely contained within a copyset.
The copysets the range is in are under-utilized. We want each copyset to be equally loaded. If a range is completely contained in a copyset x we should move the range completely to a copyset y if the nodes in copyset y have a significantly lower load (for example nodes in y have a lot more free disk space).

So the following replica transition for a range of RF 3 should be allowed in case (2): x x x -> x x y -> x y y -> y y y
where x x x means that the 3 replicas of the range are in copyset x.

Let's say r is the replication factor of a range. Each of its replicas belongs to a node with a particular copyset id. We can formally define the scores as:

Homogeneity score: Number of pairwise same copyset id / (r choose 2)
Idle score: This score is proportional to how "idle" a store is. For starters we can consider this to be % disk free (Available Capacity / Total Capacity of the store). We want ranges to migrate to copysets with significantly lower load.
1. The idle score of a store is proportional to the idleness of a store, like % disk free on the store.
2. The idle score of a copyset is the lowest idle score of the stores in the copyset.
3. The idle score of a range is the weighted average idle score of the copysets of the stores a range is present in. A range can be a part of multiple copysets when it is in flux (examples given below).

Copyset score can be defined as (k * homogeneity_score + idle_score) / (k + 1). It is normalized and lies between 0 and 1.

Computation of k

Let's say we want to migrate a range from a copyset x to a copyset y if the idle score of y differs by more than d (configurable). If d is too small, it could lead to thrashing of replicas, so we can use a value like 15%.
Though the below calculations may seem a bit complex, to the end user we can just expose d, which is easy to understand - the max difference between idle scores of two copysets in the cluster.

For example, if idle score of x is a and y is a + d, we require:

copyset_score(x x x) < copyset_score(x x y)
k * homogeneityScore(x x x) + idleScore(x x x) < k * homogeneityScore(x x y) + idleScore(x x y)
# Generalizing for replication factor r where r = 3 below
    homogeneityScore(x x x) = 1
    idleScore(x x x) = ra/r = a
    homogeneityScore(x x y) = (r-1 choose 2) / (r choose 2) # since 1 copyset is different.
    idleScore(x x y) = ((r-1) * a + a + d)/r = (ra + d) / r
# So we get
k * 1 + a <= k * (r-1 choose 2) / (r choose 2) + (ra + d) / r
=> k <= d / 2

For example, for r = 3, d = 0.15, and idle score of x being 0.2 and idle score of y being 0.36

totalScore(x x x) = 0.075 * 1 + 0.2 = 0.275
totalScore(x x y) = 0.075 * 0.33 + (0.2 + 0.2 + 0.36)/3 = 0.278

So a range will migrate from

(x x x) -> (x x y) -> (x y y) -> (y y y)

The above migration will not happen if y has an idle score of 0.34 (since d = 0.15).
The first step (x x x) -> (x x y) is the hardest as homogeneity is broken. The proof for this is given above. For (x x y) -> (x y y) step, the homogeneity score remains the same, and idle score improves (since y has a better idle score). For (x y y) -> (y y y) step, both the homogeneity score and idle score improve. When a range actually migrates from (x x x) to (x x y), it goes through an intermediate step (x x x y) after which one x is removed, but similar math applies.

This scoring function will allow ranges to organically move into copysets and try to maintain approximately equal load among copysets. Thrashing will be avoided by choosing an appropriate value of d.

Drawbacks

Copysets increase recovery time since only nodes within the copyset of a crashed node can up-replicate data. This can be mitigated by choosing a higher scatter width (description of scatter width is given in the academic literature).
Zone constraints will not be supported in the initial version of copysets. Copyset allocation can later be tweaked to respect zone constraints.
Heterogeneous clusters. Copysets will work in heterogeneous clusters but each copyset will be limited by the weakest node in the copyset (since idle score of a copyset is the lowest node idle score). This may be something we can live with.
Doesn't play well with the store rebalancer. For the first cut store based replica rebalancing will be disabled with copysets enabled. A similar logic can be incorporated into the store rebalancer at a later point.

Due to the above drawbacks, copysets will be disabled by default and there will be a cluster setting where users can enable copysets if they are ok with the above drawbacks.

Rationale and Alternatives

There can be multiple approaches for both copyset allocation and the scoring function. This design in this RFC is something simple and the respective algorithms can be tweaked independently later.

Chainsets

Chainsets is one way to make incremental changes to copysets, but again potentially at the cost of reduced locality diversity. The length of the chain used in chainsets could be considered equivalent to replication factor in cockroach.

Testing scenarios

Apart from unit tests, roachtests can be added which verify copyset based rebalancing in the presence of

Node addition / removal
Node crashes (up to 1/3rd of the cluster)
Change of replication factors
Locality fault tolerance
Changes of constraints

Table of Contents