cluster-autoscaler/proposals/circumvent-tag-limit-aws.md
Currently an EC2 Autoscaling group can only have 50 tags. Many tags are already added to the ASG by standard components like the AWS cloudprovider for Kubernetes and by customers for billing and cost association purposes. Adding labels and taints to the ASG as tags will run into this 50 tag limit. The primary focus of this proposal is to get around the 50 tag limit for customers scaling to/from 0 nodes using Cluster Autoscaler on AWS EKS ManagedNodegroups in a way that will not limit the ManagedNodegroups service.
AWS provides the EKS ManagedNodegroups service which manages the lifecycle of EC2 worker nodes that can join an EKS Kubernetes cluster. Each EKS ManagedNodegroup has an underlying ASG. ASGs and Cluster Autoscaler support scaling to/from 0.
In the AWS cloud provider, if there are no nodes in the nodegroup, Cluster Autoscaler creates a nodeTemplate from ASG tags and some default allocatable resources. (See the code) The tags for Cluster Autoscaler include resources, labels, and taints. The nodeTemplate is used to identify available node resources in the absence of an actual EC2 instance existing. We propose that, when a scaled-to-0 EKS ManagedNodegroup is being used, the Cluster Autoscaler cloud provider for AWS takes advantage of the EKS DescribeNodegroup API, which will return the latest lists of labels and taints. Unlike the ASG tags, the API will not limit the number of labels and taints that Cluster Autoscaler can discover to 50.
The ServiceAccount running the Cluster Autoscaler Pod (if using IAM Roles for Service Accounts (IRSA)) or associated with the instance profile (if not using IRSA) will need to have one additional policy applied to it: eks:DescribeNodegroup. If Cluster Autoscaler doesn’t have the correct permissions to call the API, it will fall back to using just the ASG tags.
We propose that the AWS specific driver of Cluster Autoscaler checks for AWS EKS ManagedNodegroups tags on the ASG that marks the ASG as a ManagedNodegroup. If the tags are present, Cluster Autoscaler pulls information from a cache struct containing DescribeNodegroup API response data. The AWS EKS ManagedNodegroups tags look like eks:cluster-name : <CLUSTER_NAME> and eks:nodegroup-name : <NODEGROUP_NAME>. We'll use AWS EKS ManagedNodegroups tags because we automatically add them to every managed nodegroup. The DescribeNodegroup API is already in the AWS SDK that Cluster Autoscaler uses. (Current EKS interface)
When Cluster Autoscaler finds the ManagedNodegroups tags, it will call functions on a ManagedNodegroupCache struct to get the labels, taints, and some other values. The cache struct will hold the response data from the EKS DescribeNodegroup API. When the cache is accessed (the cache struct methods are called), it will check if there's data cached. If data exists, it will then check if the TTL has expired (1 minute). If there's no cached data or the TTL has expired, the DescribeNodegroup API will be called. Cluster Autoscaler will include both the ASG tag values and the values from the EKS API in its decisions. If an ASG tag contains the same resource as the EKS API (ex: a label key is used in both), Cluster Autoscaler will choose the value from the ASG tag so customers can override any value they want.
This is what a DescribeNodegroup API response looks like (also see here):
HTTP/1.1 200
Content-type: application/json
{
"nodegroup": {
"amiType": "*string*",
"capacityType": "*string*",
"clusterName": "*string*",
"createdAt": *number*,
"diskSize": *number*,
"health": {
"issues": [
{
"code": "*string*",
"message": "*string*",
"resourceIds": [ "*string*" ]
}
]
},
"instanceTypes": [ "*string*" ],
"labels": {
"*string*" : "*string*"
},
"launchTemplate": {
"id": "*string*",
"name": "*string*",
"version": "*string*"
},
"modifiedAt": *number*,
"nodegroupArn": "*string*",
"nodegroupName": "*string*",
"nodeRole": "*string*",
"releaseVersion": "*string*",
"remoteAccess": {
"ec2SshKey": "*string*",
"sourceSecurityGroups": [ "*string*" ]
},
"resources": {
"autoScalingGroups": [
{
"name": "*string*"
}
],
"remoteAccessSecurityGroup": "*string*"
},
"scalingConfig": {
"desiredSize": *number*,
"maxSize": *number*,
"minSize": *number*
},
"status": "*string*",
"subnets": [ "*string*" ],
"tags": {
"*string*" : "*string*"
},
"version": "*string*"
}
}
Latency and throttling:
By default, Cluster Autoscaler runs every 10 seconds. Our best practices documentation notes that this short interval can cause throttling because Cluster Autoscaler already makes AWS API calls during each loop. Our documentation recommends that customers increase the interval, so adding this API shouldn’t cause latency problems for customers. (EKS Best Practices) Also, the API call will only happen for EKS ManagedNodegroups. If this increase in latency is too much for even one run of the loop, we will look into moving the API calls into parallel goroutines.
EKS also throttles describe API calls by default. To mitigate this issue, Cluster Autoscaler will keep a cache of DescribeNodegroup responses: ManagedNodegroupCache. Each cache bucket will have a TTL and, when the TTL expires, DescribeNodegroup will be called again. If there are errors during the call to DescribeNodegroup, Cluster Autoscaler will move on and just look at the ASG tags and existing default allocatable resource values.
The AWS EKS ManagedNodegroups service creating a configMap in the cluster that saves all of the node information for ManagedNodegroups was considered.