cluster-autoscaler/proposals/buffers.md
If the user uses autoscaling the cluster size will be adjusted to the number of currently running pods. This means that often when a new pod is created its scheduling time will need to include the time of creating a new node and attaching it to the cluster. Many users care deeply about pod scheduling time and therefore would prefer to keep spare capacity in the cluster in order to speed up start time of the newly created pods.
In order to allow users to express the need for spare capacity in the cluster a
new kubernetes object will be introduced called a CapacityBuffer, which will
define spare capacity per workload or set of workloads. Configuration would
be translated to pod specs that could be injected in memory by autoscaler to
drive scaling decisions for the cluster.
While some use cases of buffers can be already accomplished using balloon pods/deployments (overprovisioning node capacity documentation) there are reasons to introduce Buffers as a separate API concept:
There was already a similar proposal that was not implemented in the end: Pod headroom
The feature has been requested multiple times by the community (#749, #987, #3240, #3384, #4409)
A CapacityBuffer CRD (autoscaling.x-k8s.io) is added to represent spare capacity
requested by the user along with a set of libraries that make it easy to
integrate any autoscaler along with reference implementation in the
Cluster Autoscaler repository.
In order to support buffers the cluster will need to run:
Note that providing spare capacity depends heavily on the autoscaling capabilities of the cluster - the support for buffers will depend on the compatibility of the autoscaler used within the cluster.
The scale subresource offers a selector to find pods created by a given resource. In the initial implementation we will assume that all pods are homogeneous with an exception to the admission hook changes, and therefore we will take the most recent pod matching this selector. This means that the buffer will not initialize until it exists with at least one pod of a given target at the same time.
In the future this can be improved:
In many feature requests the users wanted to “just specify an additional number of nodes” to provision as a spare capacity. This feature is not included in this design because of the complexity of how this configuration may end up working.
Example 1: user specified a buffer of 3 nodes. 3 nodes are created and on one of them the scheduler puts a pod. Should we create a new node now? In case of balanced scheduling it would be happening a lot making this buffer not work as expected growing the cluster with poorly utilized nodes.
Example 2: user specified a buffer of 3 nodes, but the node class (like CCC or Karpenter NodePool) defines different possible shapes of nodes as fallbacks. What should the size of the buffer be? Is it 3 nodes of the highest priority that are currently available? Or should we somehow calculate 3x the size of the top priority on the lower priority nodes?
Instead of that there is an option to specify full buffered capacity as a limit that will be provisioned in chunks as specified by the PodTemplate.
That being said, if there is a use case for such configuration that would have a clear meaning to the users and limited number of options needed to specify it, it should be relatively easy to add a new capacity definition.
Buffers are not a required concept in each of the clusters and what they offer will depend on the autoscaling capabilities of the cluster. Adding them as a CRD in autoscaling space will make it clear and will allow for releasing it independently of the core k8s.
The initial iteration will cover basic use cases and will allow to simplify balloon pod/deployment management and allow for reduction of scheduler related overhead of preemption. But once we launch the initial integration there are potential next steps:
The user has a deployment of pods that are serving user traffic. In order to better contain the traffic spikes I want to keep a buffer of a size of 10% of the deployment size so that when the HPA scales my deployment the new pods are started faster. Additionally, this speeds up the feedback loop of the metrics allowing the HPA also to faster provide next scaling decisions.
Note: the size of the buffer will be calculated based on existing replicas to avoid situation when the buffer grows before all pods are there to be considered for a scale up.
Example buffer:
apiversion: autoscaling.x-k8s.io/v1alpha1
kind: CapacityBuffer
metadata:
name: my-deployment-buffer
namespace: my-namespace
spec:
scalableRef
apiGroup: apps
kind: Deployment
name: my-deployment
percentage: 10
replicas: 1
The user is a part of an admin team owning the CI/CD pipeline. The cluster is used for running one off test tasks that should be completed promptly. However, the cluster is scaling up and down making the tasks wait for adding a new node. The admin team wants to keep spare space in the cluster so that the tasks schedule quickly.
Example buffer:
apiVersion: v1
kind: PodTemplate
metadata:
name: my-custom-template
namespace: my-namespace
template:
spec:
containers:
- name: cpu-buffer-container
resources:
requests:
cpu: "8"
limits:
cpu: "8"
nodeSelector:
cloud.google.com/compute-class: my-custom-class
apiversion: autoscaling.x-k8s.io/v1alpha1
kind: CapacityBuffer
metadata:
name: testing-capacity
namespace: my-namespace
spec:
limits:
mem: 5120Mi
cpi: 40
podTemplateRef:
name: my-custom-template
type CapacityBuffer struct {
metav1.TypeMeta
metav1.ObjectMeta
Spec CapacityBufferSpec
Status CapacityBufferStatus
}
type LocalObjectRef struct {
Name string
}
type ScalableRef struct {
// Can be empty string for the core API group
ApiGroup string
// The first version advertised in API discovery for the specified apiGroup
// serving the specified kind with a scale subresource is used.
Kind string
Name string
}
type CapacityBufferSpec struct {
// Reference to a PodTemplate resource in the same namespace that declares
// a shape of a single chung of the buffer
// +optional
// oneof=PodShapeSource
PodTemplateRef *LocalObjectRef
// Reference to an object of a kind that has scale subresource and sets its label selector field.
// +optional
// oneof=PodShapeSource
ScalableRef *ScalableRef
// If neither replicas nor percentage is set as many chunks as fit in limits will be created,
// if both are set we will take it will mean max from these.
// +optional
Replicas *int
// Applicable only if ScalableRef is set.
// Absolute number is calculated from percentage by rounding up to a minimum of 1.
// +optional
Percentage *int
// If empty it will create additional nodes to provide capacity in the cluster.
// Cloud providers can offer their own buffering strategies.
// +optional
ProvisioningStrategy *string
// If specified it will limit the number of chunks created for this buffer.
// If there are no other limitations for the number of chunks it will be used to
// create as many chunks as fit into these limits.
// +optional
Limits *ResourceList
}
type CapacityBufferStatus struct {
// If pod template, replicas and generation id are not set conditions will
//provide details about the error state.
// +optional
PodTemplateRef *PodTemplateRef
// Number of replicas calculated by the buffer controller that autoscaler
// should act on.
// +optional
Replicas *int
// Number of replicas from this buffer that have provisioned capacity in the
// cluster that is ready to be used.
// +optional
ReadyReplicas *int
// +optional
PodTemplateGeneration *int
Conditions []metav1.Condition
}
The controller would own the logic of translating the buffer configuration into a pod spec that would define what spare space needs to be provisioned in the cluster to fulfill the requirements of the buffer. The autoscaler component would use the pod spec to provision the capacity.
In case of the cluster autoscaler we would deploy the controller as an optional subprocess:
This design has two main advantages compared to embedding all the logic in the library that would be used in autoscaler code:
Base on the full Buffer spec writes the status including conditions (to mark it as ready for provisioning) and PodBufferCapacity. In the future the number of chunks will take into account the k8s Resource Quotas.
Autoscaler will use the buffer status field and buffer type to determine how many and what spare capacity to deploy. Autoscaler can additionally set conditions on the buffer to communicate error states and unforeseen circumstances.
Currently, the only option to define the cluster capacity is to define a PodSpec and instantly consume it via scheduling of the pod. However, there is ongoing work within the sig-scheduling group to define scheduling of workloads and reservations (which also define cluster capacity). This design work is not mature enough yet (no KEP proposed) so we will stick with pod template definition for now and possibly expand to new objects definition when they are available.
Today there is no easy way of just making the deployment 10% bigger, but the users can already set any scaling targets for HPA. However:
Today it is possible to set a resource quota per namespace that will prevent the user who has access to the namespace from creating too many objects of a given type or pods that use too many resources.
Since buffers are not pods they will not be accounted by the quota system and so any user who is able to create buffers will be able to create a buffer of any size.
For the initial implementation on buffers the users should use other tools to limit the size of the cluster (max size on the GKE node pool/CCC or karpenter pool, max total size). Once we launch we will reassess the need for other mechanisms for buffers.
Two options that should be considered are:
BufferPolicy) to manage what buffers can
be created. This may be needed if we would like to offer different quotas
depending on the Buffer type.Other considered alternatives: Buffers and k8s quotas
Yes, compatible node autoscaler and buffer controller (the latter run as part of cluster autoscaler in the reference implementation)
Yes, buffer controller will list cluster objects. No for external calls (outside the cluster).
Yes (CapacityBuffer)
No
No
No
No
The same way as scheduling a higher number of pods.
Since there is already well know path of creating balloon pods/deployments we could decide to do nothing and point users to these “mechanisms”
Managing balloon pods proves to be problematic - you need to keep them in sync with the workload they need to scale for or in sync with the VM shape (to avoid creating too many of these).
We gathered feedback from the customers that maintenance of balloon pods and sizing them is something that should be simplified.
Balloon pods/deployments allow only for active capacity. Some cloud providers (Aws and azure) offer already warm/stand by pools consisting of stopped VMs. Having a single API surface can standardize configuration of these and simplify testing different buffering variants by the users (user wants to optimize their cost and startup latency - they want to test different buffer options in order to decide which one offers them correct cost/efficiency tradeoff - they want a simple way of switching between similar configurations to test what works best for them).
Alternatively we could avoid adding a new object and introduce relevant fields on HPA, Job, CCC etc.
This solution would end up with options added in multiple places.
From the user point of view they would likely better target each use case, but they would end up with multiple inconsistent configuration options (as each place would have some options that apply or not apply there, or some things would be already implemented and some features would be implemented only for some of the places).
From the implementation point of view this would require modifying multiple object definitions including core kubernetes objects and their behavior. In this case also extensibility would be much harder:
Additionally adding a feature to buffers (like scheduled buffers or a new buffer type) would require again touching every single object.
We could decide to offer only an open source library that translates the buffers to pods that can be used by any autoscaler and get rid of the controller and the translation layer.
Note that while some of these arguments are not very strong the cost of having a separate controller running as a subprocess in the autoscaler seems low and therefore my proposal includes the controller.