cluster-autoscaler/proposals/balance_similar.md
We have multiple requests from people who want to use node groups with the same instance type, but located in multiple zones for redundancy / HA. Currently Cluster Autoscaler is randomly adding and deleting nodes in those node groups, which results in uneven node distribution across different zones. The goal of this proposal is to introduce mechanism to balance the number of nodes in similar node groups.
We want this feature to work reasonably well with any currently supported CA configuration. In particular we want to support both homogenous and heterogenous clusters, allowing the user to easily choose or implement strategy for defining what kind of node should be added (i.e. large instance vs several small instances, etc).
Those goals imply a few more specific constraints that we want to keep:
The general idea behind this proposal is to introduce a concept of "Node Group Set", consisting of one or more of "similar" node groups (the definition of "similar" is provided in separate section). When scaling up we would split the nodes between node groups in the same set to make their size as similar as possible. For example assume node group set made of node groups A (currently 1 node), B (3 nodes), and C (6 nodes). If we needed to add a new node to the cluster it would go to group A. If we needed to add 4 nodes, 3 of them would go to group A and 1 to group B.
Note that this does not guarantee that node groups will always have the same size. Cluster Autoscaler will add exactly as many nodes as are required for pending pods, which may not be divisible by number of node groups in node group set. Additionally we scale down underutilized nodes, which may happen to be in the same node group. Including relative sizes of similar node groups in scale down logic will be covered by a different proposal later on.
There will be no change to how expansion options are generated in ScaleUp function. Instead the balancing will be executed after expansion option is chosen by expansion.Strategy and before node group is resized. The high-level algorithm will be as follows:
If the user sets the corresponding flag to 'false' we skip step 3, resulting in a single element in NGS (this makes step 4 no-op and step 6 trivial).
We will balance size of similar node groups. We want similar groups to consist of machine with the same instance type and with the same set of custom labels. In particular we define "similar" node groups as having:
There are other ways to implement the general idea than the proposed solution. This section lists other options that were considered and discusses pros and cons of each one. Feel free to skip it.
This is the solution described in "Implementation proposal" section.
Pros:
Cons:
This idea is somewhat similar to [S1], but the new method would be called on a set of expansion options before expansion.Strategy chooses one. The new method could either modify each option to contain a set of scale-ups on similar node groups.
Pros:
Cons:
This solution would work by implementing a NodeGroupSet wrapper implementing cloudprovider.NodeGroup interface. It would consist of one or more NodeGroups and internally load balance their sizes.
Pros:
Cons:
This solution would change how expansion options are generated in core/scale_up.go. The main ScaleUp function could be largely rewritten to take balancing node groups into account.
Pros:
Cons:
A lot of difficulty of the problem comes from the fact that we can have pods who can only schedule on some of the node groups in a given node group set. Such pods require specific config by user (zone-based labelSelector or antiaffinity) and are likely not very common in most clusters. Additionally one can argue that having a majority of pods explicitly specify the zone they want to run in defies the purpose of automatically balancing the size of node groups between zones in the first place.
If we treat those pods as edge case options [S3] and [S4] don't seem very attractive. Their main benefit of options [S3] and [S4] is allowing to deal with such edge cases at the cost of significantly increased complexity.
That leaves options [S1] and [S2]. Once again this is a decision between better handling of difficult cases versus complexity. This time this tradeoff applies mostly to expansion.Strategy interface. So far there are no implementations of this interface that make zone-based decisions and making expansion options more complex (by consisting of a set of NodeGroups) will make all existing strategies more complex as well, for no benefit. So it seems that [S1] is the best available option by virtue of its relative simplicity.