docs/tech-notes/rebalancing.md
Last update: January 2019
Original author: Alex Robinson
This document serves as an end-to-end description of the current state of range and lease rebalancing in Cockroach as of v2.1. The target reader is anyone who is interested in better understanding how and why replica and lease placement decisions get made. Little detailed knowledge of core should be needed.
The most complete documentation, of course, is in the code, tests, and the surrounding comments, but those are necessarily split across several files and are much tougher to piece together into a logical understanding. That scattered knowledge is centralized here, without excessive detail that is likely to become stale.
Cockroach maintains multiple replicas of each range of data for fault tolerance. It also maintains a single leaseholder for each range to optimize the performance of reads and help maintain correctness invariants. The locations of these replicas and leaseholders are hugely important to the fault tolerance and performance of the system as a whole, and so Cockroach contains a bunch of logic that proactively tries to ensure a reasonably optimal distribution of replicas and leases throughout the cluster.
There are a number of factors that need to be considered when making placement decisions. For replicas, this includes:
We currently don't directly use the last two factors, instead hoping that
balancing the overall load and number of ranges are good enough proxies for the
number of writes and the amount of data, respectively. We previously tried to
integrate these factors into decisions, but allocation decisions became quite
complex and the approach ran into a number of issues (#17979), so it has since
been removed in favor of the StoreRebalancer's pure QPS-based rebalancing.
For lease rebalancing, the considerations include:
Note that there is a built-in conflict here -- moving leases closer to where requests are coming from may require imbalancing the lease count, or causing the nodes in those localities to have more load than other nodes. Getting this right can be a tough balancing act, and it's hard to ever be fully confident that you've gotten it right because there are almost certainly workloads out there that won't be handled optimally by whatever decision-making logic you implement.
Historically, all rebalancing has been handled by the ReplicateQueue. As of
v2.1, there's also a separate component called the StoreRebalancer which
focuses specifically on the problem of balancing the QPS on each store. QPS is
used here essentially as a proxy for the (CPU/network/disk) load on each node.
It's not a perfect proxy in general, but it seems to work well in benchmarks and
tests.
The ReplicateQueue is one of our handful of replica queues which periodically
iterate over all the replicas in each store. Replicas are queued by the
replicaScanner on each store, which simply scans over all replicas at a
configurable pace and runs them through each of the replica queues. Replicas
are also sometimes manually queued in the ReplicateQueue upon certain
triggers, as will be explained later.
Upon being asked to operate on a replica, the ReplicateQueue must:
The main interesting bit here, of course, is how the decisions are made, which is described in detail in the subsections below. The only other points of note are:
ReplicateQueue only acts on ranges for which the local store is the
current leaseholder.kv.snapshot_rebalance.max_rate,
and ZoneConfig.RangeMaxBytes, but
modifications to one or more of them can put a cluster in danger.First, we must decide what action to do - up-replicate, down-replicate, or
consider a rebalance. This decision is quite simple and can be easily
understood from the code. Essentially we just have to compare the number of
desired replicas from the applicable ZoneConfig to the number of non-dead,
non-decommissioning replicas from the range. There's a bit of extra logic needed
to dynamically adjust the desired number of replicas when it's greater than the
number of nodes in the cluster (#27349, #30441, #32949, #34126), but
that's about it.
Picking an up-replication target is relatively straightforward. We can just
iterate over all live stores in the cluster, evaluating them on each of the
considerations in order, choosing one of the best results. We
will never, ever choose a store that doesn't meet the ZoneConfig constraints,
has an overfull disk, or is on the same node as another store that already
contains a replica for the range. After that, we will first prefer maximizing
diversity before considering factors such as the range count on each store. We
notably do not consider the QPS on each store here -- it's only taken into
account by the StoreRebalancer, never by the ReplicateQueue.
Rather than always choosing the best result, if there are two similarly good options we will choose randomly between them. See https://brooker.co.za/blog/2012/01/17/two-random.html or https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf for details on why this behavior is preferable.
Picking a replica to remove from an over-replicated range is also quite straightforward. We just iterate over each replica's store, grading it on the same considerations as always, choosing one of the two worst-scoring stores. The only real exception is if one of the replicas is dead; in such cases, we'll always remove the dead store(s) first. Note that as of #28875 we don't remove replicas from dead stores until we have allocated a replacement replica on a different store. This makes certain data loss scenarios less likely (see #25392 for details).
If the algorithm chooses to remove the local replica, the replica must first transfer the lease away before it can be removed. Note that while the new leaseholder's replicate queue will examine the range shortly after acquiring the lease, it's possible for the new leaseholder to make a different decision. This isn't a real problem, but it does mean that removing oneself involves more work and less certainty than removing any of the other replicas.
Deciding when to rebalance is when things start to get legitimately tricky, and is what much of the allocator code is devoted to. This makes intuitive sense if you consider that when adding or removing replica you both
However, when rebalancing, you have to decide whether taking action is actually desirable. And in practice, you want a bias against action, since there's a real cost to moving data around, and we don't want to do so unless there's a correspondingly real benefit. Also, the problem isn't linear any more - it's roughly O(m*n) when there are m replicas in the range and n live stores in the cluster, because we have to choose both the replica to be removed and the replica to add. This is particularly an issue for diversity score calculations and per-replica / expressive zone config constraints. For example, if you have the following stores:
| StoreID | Locality | Range Count |
|---|---|---|
| s1 | region=west,zone=a | 10 |
| s2 | region=west,zone=b | 10 |
| s3 | region=central,zone=a | 100 |
| s4 | region=east,zone=a | 100 |
And a range has replicas on s1, s3, and s4, then going purely by range count it would be great to take a replica off of range s3 or s4, which are both relatively overfull. It would also be great to add a replica to s2, which is relatively underfull. However, replacing s3 or s4 with s2 would hurt the range's diversity, which we never choose to do without the user telling us to.
You can probably imagine that as the number of stores grows, doing all the
pairwise comparisons could become quite a bit of work. To optimize these
calculations, we group stores that share the same locality and the same
node/store attributes (a mostly-forgotten feature, but one that still needs to
be accounted for when considering ZoneConfig constraints). We can do all
constraint and diversity-scoring calculations just once for each group, and also
pair each group up against the only existing replicas that it could legally
replace without hurting diversity or violating constraints. We then only have to
do range count comparisons within these "comparable" classes of stores.
At the end, we can determine which (added replica, removed replica) pairs are the largest improvement and choose from amongst them. As one last precautionary step, we then simulate the down-replication logic on the set of replicas that will result from adding the new replica. If the simulation finds that we would remove the replica that was just added, we choose not to make that change. This avoids thrashing, and is needed because we can't atomically add a member to the raft group at the same time that we remove one. It's possible that this isn't necessary right now, since the rebalancing code has been significantly improved since it was added, but at the very least it's a nice fail-safe against future mistakes.
We support two high-level types of constraints -- those which apply to all replicas in a range, and those which are scoped to only apply to a particular number of the replicas in a range (publicly referred to as per-replica constraints). The latter option adds a good deal of subtlety to all allocator decisions -- up-replication, down-replication, and especially rebalancing.
In order to satisfy the requirements, we had to split up constraint checking into separate functions that work differently for adding, removing, and rebalancing. We also had to add an internal concept of whether a replica is "necessary" for meeting the required constraints, in addition to the existing concept of whether or not the replica is valid. A replica is "necessary" if the per-replica constraints wouldn't be satisfied if the replica weren't part of a range.
For more details on the design of the feature, see the discussion on #19985. For the implementation, see #22819.
For the most part, deciding whether to transfer a lease is a fairly straightforward decision based on whether the current leaseholder node is in a draining state and on the lease counts on all the stores holding replicas for a range. The more complex logic is related to the follow-the-workload functionality that kicks in if-and-only-if the various nodes holding replicas are in different localities. The logic involved here is better explained in the original RFC than I could do in less space here. The logic has not meaningfully changed since the original design/implementation.
As of v2.1, Cockroach also includes a separate control loop on each store called
the StoreRebalancer. The StoreRebalancer exists because we found in #26059
that an uneven balance of load on each node was causing serious performance
problems when attempting to run TPC-C at large scale without using partitioning.
Ensuring that each load had a more even balance of work to do was experimentally
found to allow significantly higher and smoother performance.
The StoreRebalancer takes a somewhat different approach to rebalancing,
though. While the ReplicateQueue iterates over each replica one at a time,
deciding whether the replica would be better off somewhere else, the
StoreRebalancer looks at the overall amount of load (BatchRequest QPS
specifically, although it could in theory consider other factors) on each store
and attempts to take action if the local store is overloaded relative to the
other stores in the cluster. This difference is important -- our previous
attempt to rebalance based on load was integrated into the replicate queue, and
it didn't work very well for at least three different reasons:
BatchRequests being handled by a store per second
were much more strongly correlated with a load imbalance than keys written
per second.The StoreRebalancer solves all these problems. It only focuses on QPS, and by
focusing on the store-level imbalance first and picking ranges to rebalance
later, it can choose ranges that are specifically high in QPS in order to have
the biggest influence on store-level balance with the smallest disruption on
range count (which the ReplicateQueue is still responsible for attempting to
even out). Ranges to rebalance are efficiently chosen because we have started
tracking a priority queue of the hottest ranges by QPS on each store. This queue
gets repopulated once a minute, when the existing loop that iterates over all
replicas to compute store-level stats does its thing. This list of hot ranges
can have other uses as well, such as powering debug endpoints for the admin UI
(#33336).
Interpreting the exact details of how things work from the code should be pretty straightforward; we attempt to move leases to resolve imbalances first, and only resort to moving replicas around if moving leases was insufficient to resolve the imbalance. There are some controls in place to avoid rebalancing when QPS is too low to matter, or to avoid messing with a range that's so hot that it constitutes the majority of a node's qps, or to not bother moving ranges with too few qps to really matter, or a few other such things.
The StoreRebalancer can be controlled by a cluster setting that either fully
turns it off, enables just lease rebalancing, or enables both lease and replica
rebalancing, which is the default.
For more details, see the original prototype (#26608) or the final implementation (#28340, #28852).
Before removing a replica or transferring a lease, we need to take the raft status of the various existing replicas into account. This is important to avoid temporary unavailability.
For example, if you transfer the lease for a range to a replica that is way behind in processing its raft log, it will take some time before that replica gets around to processing the command which transferred the lease to it, and it won't be able to serve any requests until it does so.
Or when considering which replica to remove from a range, we must take care not to remove a replica that is critical for the range's quorum. If only 3 replicas out of 5 are caught up with the raft leader's state, we can't remove any of those 3, but can safely remove either of the other 2.
Note that it's possible that the raft state of the underlying replicas changes between when we do this check and when the actual transfer/removal takes place, so it isn't a foolproof protection, but the window of risk is very small and we haven't noticed it being a problem in practice.
a, b, and c, and wants to rebalance to a different store in a, there
will be a short period of time in which 2 of the range's 4 replicas are in
a. If a goes down before one of them is removed, the range will be
without a quorum until a comes back up.RelocateRange is flaky in v2.2-alpha versions because we now immediately put a
range through the replicate queue when a new lease is acquired on it (#31287).
It may fail to complete its desired changes successfully due to racing with
changes proposed by the new leaseholder.RelocateRange (and consequently the StoreRebalancer as a whole) doesn't
populate any useful information into the system.rangelog table, which has
traditionally been the best way to debug rebalancing decisions after the
fact (#34130).