cluster-autoscaler/proposals/provisioning-request.md
author: kisieland
Currently CA does not provide any way to express that a group of pods would like to have a capacity available. This is caused by the fact that each CA loop picks a group of unschedulable pods and works on provisioning capacity for them, meaning that the grouping is random (as it depends on the kube-scheduler and CA loop interactions). This is especially problematic in couple of cases:
0->200->400->600 rather than one 0->600. This
significantly increases the e2e latency as there is non-negligible time tax
on each scale-up operation.Provisioning Request (abbr. ProvReq) is a new namespaced Custom Resource that aims to allow users to ask CA for capacity for groups of pods. It allows users to express the fact that group of pods is connected and should be threated as one entity. This AEP proposes an API that can have multiple provisioning classes and can be extended by cloud provider specific ones. This object is meant as one-shot request to CA, so that if CA fails to provision the capacity it is up to users to retry (such retry functionality can be added later on).
The following code snippets assume kubebuilder is used to generate the CRD:
// ProvisioningRequest is a way to express additional capacity
// that we would like to provision in the cluster. Cluster Autoscaler
// can use this information in its calculations and signal if the capacity
// is available in the cluster or actively add capacity if needed.
type ProvisioningRequest struct {
metav1.TypeMeta `json:",inline"`
// Standard object metadata. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
//
// +optional
metav1.ObjectMeta `json:"metadata,omitempty"`
// Spec contains specification of the ProvisioningRequest object.
// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status.
//
// +kubebuilder:validation:Required
Spec ProvisioningRequestSpec `json:"spec"`
// Status of the ProvisioningRequest. CA constantly reconciles this field.
//
// +optional
Status ProvisioningRequestStatus `json:"status,omitempty"`
}
// ProvisioningRequestList is a object for list of ProvisioningRequest.
type ProvisioningRequestList struct {
metav1.TypeMeta `json:",inline"`
// Standard list metadata.
//
// +optional
metav1.ListMeta `json:"metadata"`
// Items, list of ProvisioningRequest returned from API.
//
// +optional
Items []ProvisioningRequest `json:"items"`
}
// ProvisioningRequestSpec is a specification of additional pods for which we
// would like to provision additional resources in the cluster.
type ProvisioningRequestSpec struct {
// PodSets lists groups of pods for which we would like to provision
// resources.
//
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=32
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
PodSets []PodSet `json:"podSets"`
// ProvisioningClass describes the different modes of provisioning the resources.
// Supported values:
// * check-capacity.kubernetes.io - check if current cluster state can fullfil this request,
// do not reserve the capacity.
// * atomic-scale-up.kubernetes.io - provision the resources in an atomic manner
// * ... - potential other classes that are specific to the cloud providers
//
// +kubebuilder:validation:Required
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
ProvisioningClass string `json:"provisioningClass"`
// Parameters contains all other parameters custom classes may require.
//
// +optional
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
Parameters map[string]string `json:"Parameters"`
}
type PodSet struct {
// PodTemplateRef is a reference to a PodTemplate object that is representing pods
// that will consume this reservation (must be within the same namespace).
// Users need to make sure that the fields relevant to scheduler (e.g. node selector tolerations)
// are consistent between this template and actual pods consuming the Provisioning Request.
//
// +kubebuilder:validation:Required
PodTemplateRef Reference `json:"podTemplateRef"`
// Count contains the number of pods that will be created with a given
// template.
//
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=16384
Count int32 `json:"count"`
}
type Reference struct {
// Name of the referenced object.
// More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names#names
//
// +kubebuilder:validation:Required
Name string `json:"name,omitempty"`
}
// ProvisioningRequestStatus represents the status of the resource reservation.
type ProvisioningRequestStatus struct {
// Conditions represent the observations of a Provisioning Request's
// current state. Those will contain information whether the capacity
// was found/created or if there were any issues. The condition types
// may differ between different provisioning classes.
//
// +listType=map
// +listMapKey=type
// +patchStrategy=merge
// +patchMergeKey=type
// +optional
Conditions []metav1.Condition `json:"conditions"`
// Statuses contains all other status values custom provisioning classes may require.
//
// +optional
// +kubebuilder:validation:MaxItems=64
Statuses map[string]string `json:"statuses"`
}
The check-capacity.kubernetes.io is one-off check to verify that the in the cluster
there is enough capacity to provision given set of pods.
Note: If two of such objects are created around the same time, CA will consider them independently and place no guards for the capacity. Also the capacity is not reserved in any manner so it may be scaled-down.
The atomic-scale-up.kubernetes.io aims to provision the resources required for the
specified pods in an atomic way. The proposed logic is to:
Parameters field, using ValidUntilSeconds key and would contain string
denoting duration for which we should retry (measured since creation fo the CR).Note: that the VMs created in this mode are subject to the scale-down logic.
So the duration during which users need to create the Pods is equal to the
value of --scale-down-unneeded-time flag.
To avoid generating double scale-ups and exclude pods that are meant to consume given capacity CA should be able to differentiate those from all other pods. To achieve this users need to specify the following pod annotations (those are not required in ProvReq’s template, though can be specified):
annotations:
"autoscaling.x-k8s.io/provisioning-class-name": "provreq-class-name"
"autoscaling.x-k8s.io/consume-provisioning-request": "provreq-name"
Previous prosoal included annotations with prefix cluster-autoscaler.kubernetes.io
but were deprecated as part of API reivew.
If those are provided for the pods that consume the ProvReq with check-capacity.kubernetes.io class,
the CA will not provision the capacity, even if it was needed (as some other pods might have been
scheduled on it) and will result in visibility events passed to the ProvReq and pods.
If those are not passed the CA will behave normally and just provision the capacity if it needed.
Both annotation are required and CA will not work correctly if only one of them is passed.
Note: CA will match all pods with this annotation to a corresponding ProvReq and ignore them when executing a scale-up loop (so that is up to users to make sure that the ProvReq count is matching the number of created pods). If the ProvReq is missing, all of the pods that consume it will be unschedulable indefinitely.
Conditions field.Note: Users can create a ProvReq and pods consuming them at the same time (in a "fire and forget" manner), but this may result in the pods being unschedulable and triggering user configured alerts.
To cancel a pending Provisioning Request with atomic class, all that the users need to do is to delete the Provisioning Request object. After that the CA will no longer guard the nodes from deletion and proceed with standard scale-down logic.
The following Condition states should encode the states of the ProvReq:
CapacityAvailable=true will denote that cluster contains enough capacity to schedule podsCapacityAvailable=false will denote that cluster does not contain enough capacity to schedule podsThe Reasons and Messages will contain more details about why the specific condition was triggered.
Providers of the custom classes should reuse the conditions where available or create their own ones if items from the above list cannot be used to denote a specific situation.
The proposed implementation is to handle each ProvReq in a separate scale-up loop. This will require changes in multiple parts of CA:
The following e2e test scenarios will be created to check whether ProvReq handling works as expected:
check-capacity.kubernetes.io provisioning class is created, CA
checks if there is enough capacity in cluster to provision specified pods.atomic-scale-up.kubernetes.io provisioning class is created, CA
picks an appropriate node group scales it up atomically.The current Cluster Autoscaler implementation is not taking into account Resource Quotas.
The current proposal is to not include handling of the Resource Quotas, but it could be added later on.
One of the expansion of this approach is to introduce the ProvisioningClass CRD,
which follows the same approach as
StorageClass object.
Such approach would allow administrators of the cluster to introduce a list of allowed
ProvisioningClasses. Such CRD can also contain a pre set configuration, i.e.
administrators may set that atomic-scale-up.kubernetes.io would retry up to 2h.
Possible CRD definition:
// ProvisioningClass is a way to express provisioning classes available in the cluster.
type ProvisioningClass struct {
// Name denotes the name of the object, which is to be used in the ProvisioningClass
// field in Provisioning Request CRD.
//
// +kubebuilder:validation:Required
Name string `json:"name"`
// Parameters contains all other parameters custom classes may require.
//
// +optional
Parameters map[string]string `json:"Parameters"`
}