cluster-autoscaler/cloudprovider/aws/README.md
On AWS, Cluster Autoscaler utilizes Amazon EC2 Auto Scaling Groups to manage node
groups. Cluster Autoscaler typically runs as a Deployment in your cluster.
Cluster Autoscaler requires Kubernetes v1.3.0 or greater.
Cluster Autoscaler requires the ability to examine and modify EC2 Auto Scaling Groups. We recommend using IAM roles for Service Accounts to associate the Service Account that the Cluster Autoscaler Deployment runs as with an IAM role that is able to perform these functions. If you are unable to use IAM Roles for Service Accounts, you may associate an IAM service role with the EC2 instance on which the Cluster Autoscaler pod runs.
There are a number of ways to run the autoscaler in AWS, which can significantly impact the range of IAM permissions required for the Cluster Autoscaler to function properly. Two options are provided below, one which will allow use of all of the features of the Cluster Autoscaler, the second with a more limited range of IAM actions enabled, which enforces using certain configuration options in the Cluster Autoscaler binary.
It is strongly recommended to restrict the target resources for the autoscaling actions
by either specifying Auto Scaling Group ARNs in the Resource list of the policy or
using tag based conditionals. The minimal policy
includes an example of restricting by ASG ARN.
Permissions required when using ASG Autodiscovery and Dynamic EC2 List Generation (the default behaviour). In this example, only the second block of actions should be updated to restrict the resources/add conditionals:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"ec2:DescribeImages",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplateVersions",
"ec2:GetInstanceTypesFromInstanceRequirements",
"eks:DescribeNodegroup"
],
"Resource": ["*"]
},
{
"Effect": "Allow",
"Action": [
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup"
],
"Resource": ["*"]
}
]
}
NOTE: The below policies/arguments to the Cluster Autoscaler need to be modified as appropriate for the names of your ASGs, as well as account ID and AWS region before being used.
The following policy provides the minimum privileges necessary for Cluster Autoscaler to run. When using this policy, you cannot use autodiscovery of ASGs. In addition, it restricts the IAM permissions to the node groups the Cluster Autoscaler is configured to scale.
This in turn means that you must pass the following arguments to the Cluster Autoscaler binary, replacing min and max node counts and the ASG:
--aws-use-static-instance-list=false
--nodes=1:100:exampleASG1
--nodes=1:100:exampleASG2
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"eks:DescribeNodegroup"
],
"Resource": ["arn:aws:autoscaling:${YOUR_CLUSTER_AWS_REGION}:${YOUR_AWS_ACCOUNT_ID}:autoScalingGroup:*:autoScalingGroupName/${YOUR_ASG_NAME}"]
}
]
}
The "eks:DescribeNodegroup" permission allows Cluster Autoscaler to pull labels and taints from the EKS DescribeNodegroup API for EKS managed nodegroups. (Note: When an EKS DescribeNodegroup API label and a tag on the underlying autoscaling group have the same key, the EKS DescribeNodegroup API label value will be saved by the Cluster Autoscaler over the autoscaling group tag value.) Currently the Cluster Autoscaler will only call the EKS DescribeNodegroup API when a managed nodegroup is created with 0 nodes and has never had any nodes added to it. Once nodes are added, even if the managed nodegroup is scaled back to 0 nodes, this functionality will not be called anymore. In the case of a Cluster Autoscaler restart, the Cluster Autoscaler will need to repopulate caches so it will call this functionality again if the managed nodegroup is at 0 nodes. Enabling this functionality any time there are 0 nodes in a managed nodegroup (even after a scale-up then scale-down) would require changes to the general shared Cluster Autoscaler code which could happen in the future.
NOTE: For private clusters, in order for the EKS DescribeNodegroup API to work, you need to create an interface endpoint for Amazon EKS (AWS PrivateLink), as described at the AWS Documentation.
OIDC federated authentication allows your service to assume an IAM role and interact with AWS services without having to store credentials as environment variables. For an example of how to use AWS IAM OIDC with the Cluster Autoscaler please see here.
NOTE The following is not recommended for Kubernetes clusters running on AWS. If you are using Amazon EKS, consider using IAM roles for Service Accounts instead.
For on-premise clusters, you may create an IAM user subject to the above policy and provide the IAM credentials as environment variables in the Cluster Autoscaler deployment manifest. Cluster Autoscaler will use these credentials to authenticate and authorize itself.
apiVersion: v1
kind: Secret
metadata:
name: aws-secret
type: Opaque
data:
aws_access_key_id: BASE64_OF_YOUR_AWS_ACCESS_KEY_ID
aws_secret_access_key: BASE64_OF_YOUR_AWS_SECRET_ACCESS_KEY
Please refer to the relevant Kubernetes documentation for creating a secret manually.
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-secret
key: aws_access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-secret
key: aws_secret_access_key
- name: AWS_REGION
value: YOUR_AWS_REGION
Auto-Discovery Setup is the preferred method to configure Cluster Autoscaler.
To enable this, provide the --node-group-auto-discovery flag as an argument
whose value is a list of tag keys that should be looked for. For example,
--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<cluster-name>
will find the ASGs that have at least all the given tags. Without the tags, the Cluster Autoscaler will be unable to add new instances
to the ASG as it has not been discovered. In the example, a value is not given for the tags and in this case any value will be ignored and
will be arbitrary - only the tag name matters. Optionally, the tag value can be set to be usable and custom tags can also be added. For example,
--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled=foo,k8s.io/cluster-autoscaler/<cluster-name>=bar,my-custom-tag=custom-value.
Now the ASG tags must have the correct values as well as the custom tag to be successfully discovered by the Cluster Autoscaler.
Example deployment:
kubectl apply -f examples/cluster-autoscaler-autodiscover.yaml
Cluster Autoscaler will respect the minimum and maximum values of each Auto Scaling Group. It will only adjust the desired value.
Each Auto Scaling Group should be composed of instance types that provide approximately equal capacity. For example, ASG "xlarge" could be composed of m5a.xlarge, m4.xlarge, m5.xlarge, and m5d.xlarge instance types, because each of those provide 4 vCPUs and 16GiB RAM. Separately, ASG "2xlarge" could be composed of m5a.2xlarge, m4.2xlarge, m5.2xlarge, and m5d.2xlarge instance types, because each of those provide 8 vCPUs and 32GiB RAM.
Cluster Autoscaler will attempt to determine the CPU, memory, and GPU resources provided by an Auto Scaling Group based on the instance type specified in its Launch Configuration or Launch Template. It will also examine any overrides provided in an ASG's Mixed Instances Policy. If any such overrides are found, only the first instance type found will be used. See Using Mixed Instances Policies and Spot Instances for details.
When scaling up from 0 nodes, the Cluster Autoscaler reads ASG tags to derive information about the specifications of the nodes i.e labels and taints in that ASG. Note that it does not actually apply these labels or taints - this is done by an AWS generated user data script. It gives the Cluster Autoscaler information about whether pending pods will be able to be scheduled should a new node be spun up for a particular ASG with the asumption the ASG tags accurately reflect the labels/taint actually applied.
The following is only required if scaling up from 0 nodes. The Cluster Autoscaler will require the label tag
on the ASG should a deployment have a NodeSelector, else no scaling will occur as the Cluster Autoscaler does not realise
the ASG has that particular label. The tag is of the format
k8s.io/cluster-autoscaler/node-template/label/<label-name>: <label-value> is
the name of the label and the value of each tag specifies the label value.
Example tags:
k8s.io/cluster-autoscaler/node-template/label/foo: barThe following is only required if scaling up from 0 nodes. The Cluster Autoscaler will require the taint tag
on the ASG, else tainted nodes may get spun up that cannot actually have the pending pods run on it. The tag is of the format
k8s.io/cluster-autoscaler/node-template/taint/<taint-name>:<taint-value:taint-effect> is
the name of the taint and the value of each tag specifies the taint value and effect with the format <taint-value>:<taint-effect>.
Example tags:
k8s.io/cluster-autoscaler/node-template/taint/dedicated: true:NoScheduleFrom version 1.14, Cluster Autoscaler can also determine the resources provided
by each Auto Scaling Group via tags. The tag is of the format
k8s.io/cluster-autoscaler/node-template/resources/<resource-name>.
<resource-name> is the name of the resource, such as ephemeral-storage. The
value of each tag specifies the amount of resource provided. The units are
identical to the units used in the resources field of a Pod specification.
Example tags:
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: 100GASG labels can specify autoscaling options, overriding the global cluster-autoscaler settings for the labeled ASGs. Those labels takes the same values format as the cluster-autoscaler command line flags they override (a float or a duration, encoded as string). Currently supported autoscaling options (and example values) are:
k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledownutilizationthreshold: 0.5
(overrides --scale-down-utilization-threshold value for that specific ASG)k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledowngpuutilizationthreshold: 0.5
(overrides --scale-down-gpu-utilization-threshold value for that specific ASG)k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledownunneededtime: 10m0s
(overrides --scale-down-unneeded-time value for that specific ASG)k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledownunreadytime: 20m0s
(overrides --scale-down-unready-time value for that specific ASG)k8s.io/cluster-autoscaler/node-template/autoscaling-options/ignoredaemonsetsutilization: true
(overrides --ignore-daemonsets-utilization value for that specific ASG)NOTE: It is your responsibility to ensure such labels and/or taints are applied via the node's kubelet configuration at startup. Cluster Autoscaler will not set the node taints for you.
Recommendations:
k8s.io/cluster-autoscaler/<cluster-name> when
k8s.io/cluster-autoscaler/enabled is used across many clusters to prevent
ASGs from different clusters having conflicts.
An ASG must contain at least all the tags specified and as such secondary tags can differentiate between different
clusters ASGs.--nodes argument if
--node-group-auto-discovery is specified.autoscaling:DescribeLaunchConfigurations or
ec2:DescribeLaunchTemplateVersions to the Action list of the IAM Policy
used by Cluster Autoscaler, depending on whether your ASG utilizes Launch
Configurations or Launch Templates.The device plugin on nodes that provides GPU resources can take some time to advertise the GPU resource to the cluster. This may cause Cluster Autoscaler to unnecessarily scale out multiple times.
To avoid this, you can configure kubelet on your GPU nodes to label the node
before it joins the cluster by passing it the --node-labels flag. The label
format is as follows:
cloud.google.com/gke-accelerator=<gpu-type>k8s.amazonaws.com/accelerator=<gpu-type><gpu-type> varies by instance type. On P2 instances, for example, the
value is nvidia-tesla-k80.
Cluster Autoscaler can also be configured manually if you wish by passing the
--nodes argument at startup. The format of the argument is
--nodes=<min>:<max>:<asg-name>, where <min> is the minimum number of nodes,
<max> is the maximum number of nodes, and <asg-name> is the Auto Scaling
Group name.
You can pass multiple --nodes arguments if you have multiple Auto Scaling Groups
you want Cluster Autoscaler to use.
NOTES:
<min> and <max> must be within the range of the minimum and maximum
instance counts specified by the Auto Scaling group.Examples:
kubectl apply -f examples/cluster-autoscaler-one-asg.yaml
kubectl apply -f examples/cluster-autoscaler-multi-asg.yaml
NOTE: This setup is not compatible with Amazon EKS.
To run a CA pod on a control plane node the CA deployment should tolerate the master
taint and nodeSelector should be used to schedule the pods on a control plane node.
Please replace {{ node_asg_min }}, {{ node_asg_max }} and {{ name }} with
your ASG setting in the yaml file.
kubectl apply -f examples/cluster-autoscaler-run-on-control-plane.yaml
NOTE: The minimum version of cluster autoscaler to support MixedInstancePolicy is v1.14.x.
If your workloads can tolerate interruption, consider taking advantage of Spot Instances for a lower price point. To enable diversity among On Demand and Spot Instances, as well as specify multiple EC2 instance types in order to tap into multiple Spot capacity pools, use a mixed instances policy on your ASG. Note that the instance types should have the same amount of RAM and number of CPU cores, since this is fundamental to CA's scaling calculations. Using mismatched instances types can produce unintended results. See an example below.
Additionally, there are other factors which affect scaling, such as node labels.
If you are currently using nodeSelector with the
beta.kubernetes.io/instance-type
label, you will need to apply a common propagating label to the ASG and use that
instead, since the instance-type label can no longer be relied upon. One may
also use auto-generated tags such as aws:cloudformation:stack-name for this
purpose. Node affinity and
anti-affinity
are not affected in the same way, since these selectors natively accept multiple
values; one must add all the configured instances types to the list of values,
for example:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/instance-type
operator: In
values:
- r5.2xlarge
- r5d.2xlarge
- r5a.2xlarge
- r5ad.2xlarge
- r5n.2xlarge
- r5dn.2xlarge
- r4.2xlarge
- i3.2xlarge
Similarly, if using the balancing-label flag, you should only choose labels which have the same value for all nodes in
the node group. Otherwise you may get unexpected results, as the flag values will vary based on the nodes created by
the ASG.
See CloudFormation example here.
The set of the latest supported EC2 instance types will be fetched by the CA at
run time. You can find all the available instance types in the CA logs. If your
network access is restricted such that fetching this set is infeasible, you can
specify the command-line flag --aws-use-static-instance-list=true to switch
the CA back to its original use of a statically defined set.
To refresh static list, please run go run ec2_instance_types/gen.go under
cluster-autoscaler/cloudprovider/aws/.
If you want to use custom AWS cloud config e.g. endpoint urls
kubectl apply -f examples/configmap-cloudconfig-example.yaml
values.yaml:
cloudConfigPath: config/cloud.conf
extraVolumes:
- name: cloud-config
configMap:
name: cloud-config
extraVolumeMounts:
- name: cloud-config
mountPath: config
Please note: it is also possible to mount the cloud config file from host:
extraVolumes:
- name: cloud-config
hostPath:
path: /path/to/file/on/host
extraVolumeMounts:
- name: cloud-config
mountPath: config/cloud.conf
readOnly: true
/etc/ssl/certs/ca-bundle.crt should exist by default on ec2 instance in
your EKS cluster. If you use other cluster provision tools like
kops with different operating systems
other than Amazon Linux 2, please use /etc/ssl/certs/ca-certificates.crt or
correct path on your host instead for the volume hostPath in your cluster
autoscaler manifest.--skip-nodes-with-system-pods=false flag.--scale-down-delay-after-add,
--scale-down-delay-after-delete, and --scale-down-delay-after-failure
flag. E.g. --scale-down-delay-after-add=5m to decrease the scale down delay
to 5 minutes after a node has been added.--expander flag supports five options:
random, most-pods, least-waste, priority, and grpc. random will
expand a random ASG on scale up. most-pods will scale up the ASG that will
schedule the most amount of pods. least-waste will expand the ASG that will
waste the least amount of CPU/MEM resources. In the event of a tie, cluster
autoscaler will fall back torandom. The priority expander lets you define
a custom priority ranking in a ConfigMap for selecting ASGs, and the grpc
expander allows you to write your own expansion logic.--provider-id flag. The provider id has the format
aws:///<availability-zone>/<instance-id>, e.g.
aws:///us-east-1a/i-01234abcdef.AWS_STS_REGIONAL_ENDPOINTS=regional should be set.Metadata response hop limit set to 2.
Otherwise, the /latest/api/token call will timeout and result in an error. See AWS docs here for further information.eks:nodegroup-name tag to the ASG as this will lead to extra EKS API calls that could slow down scaling when there are 0 nodes in the nodegroup.AWS_MAX_ATTEMPTS to configure max retries--aws-use-static-instance-list=true to the CA startup command. For more information on private cluster requirements, see AWS docs here.