multidimensional-pod-autoscaler/AEP.md
AEP - Autoscaler Enhancement Proposal
<!-- toc -->Items marked with (R) are required prior to targeting to a milestone / release.
implementableCurrently, Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) control the scaling actions separately as independent controllers to determine the resource allocation for a containerized application. Due to the independence of these two controllers, when they are configured to optimize the same target, e.g., CPU usage, they can lead to an awkward situation where HPA tries to spin more pods based on the higher-than-threshold CPU usage while VPA tries to squeeze the size of each pod based on the lower CPU usage (after scaling out by HPA). The final outcome would be a large number of small pods created for the workloads. Manual fine-tuning the timing to do vertical/horizontal scaling and prioritization are usually needed for synchronization of the HPA and VPA.
We propose a Multi-dimensional Pod Autoscaling (MPA) framework that combines the actions of vertical and horizontal autoscaling in a single action but separates the actuation completely from the controlling algorithms. It consists of three controllers (i.e., a recommender, an updater, and an admission controller) and an MPA API (i.e., a CRD object or CR) that connects the autoscaling recommendations to actuation. The multidimensional scaling algorithm is implemented in the recommender. The scaling decisions derived from the recommender are stored in the MPA object. The updater and the admission controller retrieve those decisions from the MPA object and actuate those vertical and horizontal actions. Our proposed MPA (with the separation of recommendations from actuation) allows developers to replace the default recommender with their alternative customized recommender, so developers can provide their own recommender implementing advanced algorithms that control both scaling actions across different resource dimensions.
To scale application Deployments, Kubernetes supports both horizontal and vertical scaling with a Horizontal Pod Autoscaler (HPA) and a Vertical Pod Autoscaler (VPA), respectively. Currently, HPA and VPA work separately as independent controllers to determine the resource allocation of a containerized application.
desired_replicas = current_replicas * (current_metric_value / desired_metric_value).When using HPA and VPA together to both reduce resource usage and guarantee application performance, VPA resizes pods based on their measured resource usage, and HPA scales in/out based on the customer application performance metric, and their logic is entirely ignorant of each other. Due to the independence of these two controllers, they can lead to an awkward situation where VPA tries to squeeze the pods into smaller sizes based on their measured utilization. Still, HPA tries to scale out the applications to improve the customized performance metrics. It is also not recommended to use HPA together with VPA for CPU or memory metrics. Therefore, there is a need to combine the two controllers so that horizontal and vertical scaling decisions are made in combination for an application to achieve both objectives, including resource efficiency and the application service-level objectives (SLOs)/performance goals. However, existing VPA/HPA designs cannot accommodate such requirements. Manual fine-tuning the timing or frequency to do vertical/horizontal scaling and prioritization are usually needed for synchronization of the HPA and VPA.
Many studies in research show that combined horizontal and vertical scaling can guarantee application performance with better resource efficiency using advanced algorithms such as reinforcement learning [1, 2]. These algorithms cannot be used with existing HPA and VPA frameworks. A new framework (MPA) is needed to combine horizontal and vertical scaling actions and separate the actuation of scaling actions from the autoscaling algorithms. The new MPA framework will work for all workloads on Kubernetes.
[1] Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer (2020). FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020).
[2] Haoran Qiu, Weichao Mao, Archit Patke, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, Ravishankar K. Iyer (2022). SIMPPO: A Scalable and Incremental Online Learning Framework for Serverless Resource Management. In Proceedings of the 13th ACM Symposium on Cloud Computing (SoCC 2022).
For certain workloads, to ensure a custom metric (e.g., throughput or request-serving latency), horizontal scaling typically controls the CPU resources effectively, and vertical scaling is typically effective in increasing or decreasing the allocated memory capacity per pod. Thus, there is a need to control different types of resources at the same time using different scaling actions. Existing VPA and HPA can control these separately. However, they cannot achieve the same objective, e.g., guarantee a custom metric within an SLO target, by controlling both dimensions with different resource types independently. For example, they can lead to an awkward situation where HPA tries to spin more pods based on the higher-than-threshold CPU usage while VPA tries to squeeze the size of each pod based on the lower memory usage (after scaling out by HPA). In the end, there will be a large number of small pods created for the workloads.
Our proposed MPA framework consists of three controllers (i.e., a recommender, an updater, and an admission controller) and an MPA API (i.e., a CRD object or CR) that connects the autoscaling recommendations to actuation. The figure below describes the architectural overview of the proposed MPA framework.
MPA API. Application owners specify the autoscaling configurations which include:
MPA API is also responsible for connecting the autoscaling actions generated from the MPA Recommender to MPA Admission Controller and Updater which actually execute the scaling actions. MPA API is created based on the multidimensional Pod scaling service (not open-sourced) provided by Google. MPA API is a Custom Resource Definition (CRD) in Kubernetes and each MPA instance is a CR. MPA CR keeps track of recommendations on target requests and target replica numbers.
Metrics APIs. The Metrics APIs serve both default metrics or custom metrics associated with any Kubernetes objects. Custom metrics could be the application latency, throughput, or any other application-specific metrics. HPA already consumes metrics from such a variety of metric APIs (e.g., metrics.k8s.io API for resource metrics provided by metrics-server, custom.metrics.k8s.io API for custom metrics provided by "adapter" API servers provided by metrics solution vendors, and the external.metrics.k8s.io API for external metrics provided by the custom metrics adapters as well. A popular choice for the metrics collector is Prometheus. The metrics are then used by the MPA Recommender for making autoscaling decisions.
MPA Recommender. MPA Recommender retrieves the time-indexed measurement data from the Metrics APIs and generates the vertical and horizontal scaling actions. The actions from the MPA Recommender are then updated in the MPA API object. The autoscaling behavior is based on user-defined configurations. Users can implement their own recommenders as well.
MPA Updater. MPA Updater will update the number of replicas in the deployment and evict the eligible pods for vertical scaling.
MPA Admission-Controller. If users intend to directly execute the autoscaling recommendations generated from the MPA Recommender, the MPA Admission-Controller will update the deployment configuration (i.e., the size of each replica) and configure the rolling update to the Application Deployment.
To actuate the decisions without losing availability, we plan to:
We use a web-hooked admission controller to manage vertical scaling because if the actuator directly updates the vertical scaling configurations through deployment, it will potentially overload etcd (as vertical scaling might be quite frequent). MPA Admission Controller intercepts Pod creation requests and rewrites the request by applying recommended resources to the Pod spec. We do not use the web-hooked admission controller to manage the horizontal scaling as it could slow down the pod creation process. In the future when the in-place vertical resizing is enabled, we can enable the option of in-place vertical resizing while keeping the web-hooked admission controller for eviction-based vertical resizing as an option as well.
Pros:
Cons:
To generate the vertical scaling action recommendation, we reuse VPA libraries as much as possible to implement scaling algorithm integrated with the newly generated MPA API code. To do that, we need to update accordingly the code which read and update the VPA objects to be interacting with the MPA objects. To generate the horizontal scaling action recommendation, we reuse HPA libraries, integrating with the MPA API code, to reads and updates the MPA objects. We integrate vertical and horizontal scaling in a single feedback cycle. As an intitial solution, vertical scaling and horizontal scaling is performed alternatively (vertical scaling first). Vertical scaling will scale the CPU and memory allocations based on the historical usage; and horizontal scaling will scale the number of replicas based on either CPU utilization or a custom metric. In the future, we can consider more complex way of prioritization and conflict resolution. The separation of recommendation and actuation allows customized recommender to be used to replace the default recommender. For example, users can plug-in their RL-based controller to replace the MPA recommender, receiving measurements from the Metrics Server and modifying the MPA objects directly to give recommendations.
The implementation of the MPA framework (the backend) is based on the existing HPA and VPA codebase so that it only requires minimum code maintenance. Reused Codebase References:
We reuse the CR definitions from the MultidimPodAutoscaler object developed by Google.
MultidimPodAutoscaler is the configuration for multi-dimensional Pod autoscaling, which automatically manages Pod resources and their count based on historical and real-time resource utilization.
MultidimPodAutoscaler has two main fields: spec and status.
apiVersion: autoscaling.gke.io/v1beta1
kind: MultidimPodAutoscaler
metadata:
name: my-autoscaler
# MultidimPodAutoscalerSpec
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-target
policy:
updateMode: Auto
goals:
metrics:
- type: Resource
resource:
# Define the target CPU utilization request here
name: cpu
target:
type: Utilization
averageUtilization: target-cpu-util
constraints:
global:
minReplicas: min-num-replicas
maxReplicas: max-num-replicas
containerControlledResources: [ memory, cpu ] # Added cpu here as well
container:
- name: '*' # either a literal name, or "*" to match all containers
# this is not a general wildcard match
# Define boundaries for the memory request here
requests:
minAllowed:
memory: min-allowed-memory
maxAllowed:
memory: max-allowed-memory
# Define the recommender to use here
recommenders:
- name: my-recommender
# MultidimPodAutoscalerStatus
status:
lastScaleTime: timestamp
currentReplicas: number-of-replicas
desiredReplicas: number-of-recommended-replicas
recommendation:
containerRecommendations:
- containerName: name
lowerBound: lower-bound
target: target-value
upperBound: upper-bound
conditions:
- lastTransitionTime: timestamp
message: message
reason: reason
status: status
type: condition-type
currentMetrics:
- type: metric-type
value: metric-value
[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Unit tests are located at each controller package.
Integration tests are to be added in the beta version.
End-to-end tests are to be added in the beta version.
MPA can be enabled by checking the prerequisite and executing ./deploy/mpa-up.sh.
No.
MPA can be disabled by executing ./deploy/mpa-down.sh.
No impact will happen because everytime MPA is enabled it is a full new reset and restart of MPA.
End-to-end test of MPA will be included in the beta version.
MPA relies on cluster-level metrics.k8s.io API (for example, from metrics-server)
For the evict-and-replace mechanism, the API server needs to support the MutatingAdmissionWebhook API.
No, replacing HPA/VPA with MPA only translates the way how recommendations are generated (separation of recommendation from actuation). The original API calls used by HPA/VPA are reused by MPA and no new API calls are used by MPA.
Yes, MPA introduces a new Custom Resource MultidimPodAutoscaler, similar to VerticalPodAutoscaler.
No.
No. It will not affect any existing API objects.
No. To the best of our knowledge, it will not cause any increasing time of existing SLIs/SLOs.
No.
No.
<!-- Describe them, providing: - API call type (e.g. PATCH pods) - estimated throughput - originating component(s) (e.g. Kubelet, Feature-X-controller) Focusing mostly on: - components listing and/or watching resources they didn't before - API calls that may be triggered by changes of some Kubernetes resources (e.g. update of object X triggers new updates of object Y) - periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.) -->An alternative option is to have MPA just as a recommender. For VPA, based on the support of the customized recommender, MPA can be implemented as a recommender to write to a VPA object. Then VPA updater and admission controller will actuate the recommendation. For HPA, additional support for alternative recommenders is needed so MPA can write scaling recommendations to the HPA object as well.
In this alternative approach (non-open-sourced), a MultidimPodAutoscaler object modifies memory or/and CPU requests and adds replicas so that the average utilization of each replica matches your target utilization.
The MPA object will be translated to VPA and HPA objects so at the end there are two independent controllers managing the vertical and horizontal scaling application deployment.