Back to Cockroach

Allocator

pkg/kv/kvserver/allocator/doc.md

26.1.310.1 KB
Original Source

Allocator

Replica Scanner And Base Queue

  • Every store has a replica scanner that periodically scans and iterates over every replica of stores. It calls every base queues’s MaybeAddAsync with every replica. Every base queue’s MaybeAdd then calls into base queue’s replicaCanBeProcessed, shouldQueue. If shouldQueue returns true, it adds the replica to the queue with the priority given by shouldQueue. Skips if shouldQueue returns false.
  • Every base queue implements bq.processLoop which pops items added by the replica scanner and calls into bq.processOneAsyncAndReleaseSem. bq.processOneAsyncAndReleaseSem then calls bq.processReplica async with a time out. bq.processReplica then calls bq.process. If replicas fail to be processed, they will be sent to the purgatory queue.
  • Base queue includes: consistency queue, lease queue, merge queue, mvcc gc queue, raft log queue, raft snapshot queue, replica gc queue, replicate queue, split queue, timeseries maintenance queue.

Replicate Queue

High Level

  1. ReplicateQueue.shouldQueue calls into ReplicaPlanner.ShouldPlanChange which decides whether the replica scanner adds the replica to the queue.
  2. Replica scanner adds replicas that need to be queued to the base queue.
  3. Inside replicateQueue.process with one replica, a) grab the range token b) calls into rq.processOneChange.
  4. processOneChange calls into ReplicaPlanner.PlanOneChange which plans a replicate change or an error. PlanOneChange calls into allocator.ComputeAction which returns an enum action. This enum action tells replicate queue to consider add / remove / replace replicas. ComputeAction gives an action that should be done. The separation is here because planner.ShouldPlanChange calls into ComputeAction as well, and we want a cheap check for the queue. Note that actions returned from ComputeAction in ShouldPlanChange v.s. PlanOneChange may be different.
  5. Replica planner then switches and computes what replicate changes need to be done based on the action.

shouldQueue: whether queue replicas to replicate queue

  1. shouldQueue takes a replica and calls into ReplicaPlanner.ShouldPlanChange which then calls into the allocator via rp.allocator.ComputeAction.
  2. ComputeAction computes an action that needs to be done on this replica.
  • It first checks if a repair action is needed: under-replicated, quorum, dead/decommissioning replicas, over-replicated. Note that this does not include constraint repair.
  • If no repair needs to be done on this range, it would fall back and return AllocatorConsiderRebalance by default.
  1. After ComputeAction gives ReplicaPlanner.ShouldPlanChange the action, it would return true if a repair action is needed. Otherwise, it checks if a rebalance target can be found via Allocator.RebalanceTarget. We will go into the details of RebalanceTarget below.

processOneChange: plan and process a change on the replica

  1. replicateQueue.processOneChange happens when the base queue pops the replica off the queue and actually start processing the replica.
  2. replicateQueue.processOneChange calls into ReplicaPlanner.PlanOneChange which then calls into ComputeAction as well. Based on the action, it would call into ReplicaPlanner's helper functions (such as addOrReplaceVoters) to compute the exact allocation operation out (such as a lease transfer, change of replicas).
  3. For now, lets focus on how considerRebalance. As a reminder, this action is returned by default for every replica when no other repair action applies. Note that this action does help repair zone constraints.

considerRebalance: find one rebalancing target for any of the existing replica

  1. ReplicaPlanner first picks which scorer options to be used for the allocator decision. It uses the RangeCountScorerOptions by default and ScatterScorerOptions if this action originates from a scatter operation. ScorerOptions define the heuristics used to evaluate candidate replicas, helping the allocator guide decisions toward specific goals such as range count convergence.
  2. ReplicaPlanner then calls into the allocator to find the best rebalancing target via Allocator.RebalanceTarget.
  3. Allocator.RebalanceTarget first builds constraint checkers, which are passed to rankedCandidateListForRebalancing to determine which existing replicas / stores are valid or required based on the range’s constraints.
  4. Allocator.RebalanceTarget then calls into rankedCandidateListForRebalancing to find a list of rebalance options. Each rebalance option consists of an existing replica and a set of candidate stores that can replace it.
  • rankedCandidateListForRebalancing first builds a map of existing store ids to their corresponding candidate entries - annotated with attributes such as valid, necessary, disk fullness, I/O overload, and diversity score.

  • For each existing replica store, it then constructs an equivalence class, containing the store itself along with a list of comparable candidates.

    • This is how each equivalence class is constraucted: for every existing replica, all other stores are considered as potential candidates. During this iteration, some filtering is applied based on the state of other existing replicas. If a candidate store is not worse than the existing store, it is added to the list of comparable candidates. From all considered candidates, the best set is selected, and an equivalance class is constructed using existing replica and these best candidates. Note that at this stage, only attributes such as valid, necessary (constraint conformance), diversity score, and disk fullness are considered. At this point, range count convergence has not been considered yet.

    • (TODO(wenyihu6 during review): why ioOverloaded is always false here there is a comment https://github.com/cockroachdb/cockroach/blob/bc9fdcd029eae1f30a5ea82e44b60a629d452f6c/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1550-L1553 but idu).

    • (TODO(wenyihu6 during review) looks like at this stage we do not exclude other existing replicas stores - we filter it out below when iterating over comparable candidates but why)

  • At this point, we have an equivalence class for every existing replica. We then examine candidates’ range count statistics within each class to help break ties.

    • For each equivalence class, ScorerOptions populate the convergence score, balance score, and range count for existing store and potential candidates. The candidate set is then further filtered to include only those better than the existing store.
  1. Using the results from rankedCandidateListForRebalancing, Allocator.RebalanceTarget iterates over all rebalanceOptions to identify the option that offers the largest improvement. As a reminder, each rebalanceOption consists of an existing replica and a list of candidate stores. For each option, it selects the best candidate set, randomly chooses one candidate, and then determines the overall best rebalance option by comparing all rebalanceOptions.