cluster-autoscaler/proposals/parallel_drain.md
Author: x13n
Scale down of non-empty nodes is slow. We are only draining one node at a time. This is particularly problematic in large clusters. While empty nodes can be removed at the same time, non-empty nodes being removed sequentially, which can take hours in thousand node clusters. We want to speed it up by allowing parallel scale down of multiple non-empty nodes. Ideally, we'd like to drain as many nodes as possible simultaneously, without leaving workloads hanging.
With the old algorithm scale down was disabled if deleting of non-empty node from previous iteration did not yet complete. For the new algorithm we plan to relax this.
The algorithm will internally keep track of the state of nodes in the cluster, which will be updated every loop iteration. The state will allow answering following questions:
The candidate_set will be built in a greedy manner by iterating over all nodes. To verify if a given node can be put in the candidate_set we will use scheduler simulation to binpack the pods on the nodes starting from the nodes on the other end of the list (details below). Since the simulation can be time-consuming, it will be time bound. This may lead to a slower scale down, but will prevent scale up from starving.
In each iteration the contents of candidate_set will be updated. Nodes can be added, removed or stay in candidate_set. For each node in the candidate_set we will keep track of how long it is in there. If node is removed and then re-added to candidate set the timer is reset.
To trigger node deletion will use the already present ScaleDownUnneededTime and ScaleDownUnreadyTime parameters. If in given CA loop iteration there are nodes which have been in candidate_set for more than ScaleDownUnneededTime (or is not ready and is in candidate_set for more than scaleDownUnreadyTime), the actuation of scaledown is triggered. The number of nodes which are actively scaled down will be limited by MaxDrainParallelism and MaxScaleDownParallelism configuration parameters. We will configure separate limits for scaling down empty and non-empty nodes to allow quick scale-down of empty nodes even if lengthy drains of non-empty nodes are in progress.
The existing --max-empty-bulk-delete flag will be deprecated and eventually
removed in favor of --max-scale-down-parallelism and --max-drain-parallelism.
The actuation will be done in a similar fashion as in the old algorithm. For a given set of nodes to be scaled down we will synchronously taint the nodes. Then separate goroutines to perform draining and node deletions will be run.
Current scale-down algorithm uses SoftTainting to limit the chance that new pods will be moved toward scaled down candidates. New algorithm will use the same mechanism for nodes in the candidate_set.
The state in the scale-down algorithm will be updated incrementally based on the changes to the set of nodes and pods in the cluster snapshot.
The existing scale down code lives mostly in the ScaleDown object, spanning scale_down.go file with 1.5k lines of code. As a part of this effort, ScaleDown object will undergo refactoring, extracting utils to a separate file. ScaleDown itself will become an interface with two implementations (both relying on common utils): existing version and the new algorithm described below. This will allow easy switching between algorithms with a flag: different flag values will pick different ScaleDown interface implementations.
As a part of the refactoring, we will combine FastGetPodsToMove and
DetailedGetPodsForMove into a single PodsToMove function. The
checkReferences flag will be dropped and we will always do a detailed check.
We are now using listers so doing a detailed check does not add extra API calls
and should not add too much to execution time.
Due to this change, doing the simulation pass twice will no longer make sense. New implementation of the algorithm will perform it only once, as described in the section below.
Actuation logic will be separated from ScaleDown decisions, since only the decision-making process is going to change. In particular, the SoftTaining logic will be extracted from ScaleDown. Instead, NeededNodes()/UnneededNodes() methods will be used to fetch nodes for (un)tainting.
The algorithm is stateful. On each call to UpdateUnneededNodes (happening every CA loop iteration) the state will be updated to match the most recent cluster state snapshot. The most important fields to be held in algorithm state are:
UpdateUnneededNodes will be called from RunOnce every CA loop as it is now. On every call the internal state is updated to match the recent snapshot of the cluster state.
Single loop iteration of algorithm:
target\_replicas - current\_replicas times.Return node lists built based on candidate_set/non_candidate_set, respectively.
The responsibility of StartDeletion is to trigger actual scaledown of nodes in candidate_set (or a subset of those). The method will keep track of empty and non-empty nodes being scaled down (separately). The number of nodes being scaled down will be bounded by MaxDrainParallelism and MaxScaleDownParallelism options passed in as CA Flags.
Method steps in pseudocode:
The algorithm described above may be suboptimal in its approach to scheduling pods. In particular, it can lead pods to jump between nodes as they are added to the candidate_set, which increases the usage of PreFilters/Filters in the scheduler framework, which can be costly.
To optimize that, we may use a cyclic pointer that would be used as a starting point in scheduling simulations. Such approach may limit the number of calls to the scheduler framework.
Implementation-wise, we could reuse existing SchedulerBasedPredicateChecker. In order to do this, the algorithm for picking nodes for scheduling would be passed to SchedulerBasedPredicateChecker as a strategy. By default, it would be the existing round robin across all nodes, but ScaleDown simulation would inject a more sophisticated one, relying a pointer managed by the scale down logic.
Throughout the loop we will keep the quotas remaining for PDBs in pdbs_remaining_disruptions structure. The structure will be computed at the beginning of UpdateUnneededNodes and for each PDB it will hold how many Pods matching this PDB can still be disrupted. Then we decrease the counters as we go over the nodes and simulate pods rescheduling. The initial computation and drain simulation takes into account the state of the pod. Specifically if Pod is not Healthy it will be subtracted from the remaining quota on the initial computation and then not subtracted again on drain simulation.
We need to repeat checks from PreFilteringScaleDownNodeProcessor in UpdateUnneededNodes anyway as the latter simulate multi node deletion so it seems we can drop PreFilteringScaleDownNodeProcessor if new scale down algorithm is enabled. Not crucial as checks are not costly.
ScaleDownStatus does not play very well with the concept of multiple nodes being scheduled down. The existing ScaleDownInProgress value in ScaleDownResult enum represents a state in which CA decides to skip scale down logic due to ongoing deletion of a single non-empty node. In the new algorithm, this status will no longer be emitted. Instead, a new ScaleDownThrottled status will be emitted when max parallelism for both empty and non-empty nodes was reached and hence no new deletion can happen.
Cluster state holds a map listing unneeded candidates for each node group. We will keep the map. The semantic of unneeded nodes will change though. With the old algorithm there was no guarantee that all the "unneeded" nodes can be removed altogether - the drain simulation is done independently for each candidate node. The parallel scaledown algorithm validates that all unneeded nodes can be dropped together (modulo difference in behavior of scheduler simulation and actual scheduler).
Whenever Cluster Autoscaler evicts a pod, it is expected that some external controller will create a similar pod elsewhere. However, it takes a non-zero time for controllers to react and hence we will keep track of all evicted pods on a dedicated list (recent_evictions). After each eviction, the pod object along with the eviction timestamp will be added to the list and kept for a preconfigured amount of time. The pods from that list will be injected back into the cluster before scale down simulation as a safety buffer, except when CA is certain the replacement pods were either already scheduled or don't need replacing. This can be verified by examining the parent object for the pods (e.g. ReplicaSet).
In the initial implementation, for the sake of simplicity, parallel drain will not be triggered as long as there are already any nodes in the deleted_set.
The existing set of metrics will suffice for the sake of monitoring performance of parallel scale down (i.e. scaled_down_nodes_total, scaled_down_gpu_nodes_total, function_duration_seconds). We may extend the set of function metric label values for function_duration_seconds to get a better visibility into empty vs. non-empty scale down duration.
Flag controlled: when --max-drain-parallelism != 1, the new logic is enabled.
Old logic will stay for a while to make rollback possible. The flag will be
introduced in version 1.25, default it to >1 in 1.26 and eventually drop the old
logic in 1.27 or later.
The existing --max-empty-bulk-delete flag will be deprecated and eventually
removed: new flags will no longer refer to empty nodes. During the deprecation
period, --max-scale-down-parallelism will default to the value of
--max-empty-bulk-delete.
The reasons for node to not be eligible for scale down include: