doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md
(kuberay-eks-gpu-cluster-setup)=
This guide walks you through the steps to create an Amazon EKS cluster with GPU nodes specifically for KubeRay. The configuration outlined here can be applied to most KubeRay examples found in the documentation.
Follow the first two steps in this AWS documentation to: (1) create your Amazon EKS cluster and (2) configure your computer to communicate with your cluster.
Follow "Step 3: Create nodes" in this AWS documentation to create node groups. The following section provides more detailed information.
Typically, avoid running GPU workloads on the Ray head. Create a CPU node group for all Pods except Ray GPU workers, such as the KubeRay operator, Ray head, and CoreDNS Pods.
Here's a common configuration that works for most KubeRay examples in the docs:
Create a GPU node group for Ray GPU workers.
Here's a common configuration that works for most KubeRay examples in the docs:
Please install the NVIDIA device plugin. (Note: You can skip this step if you used the BOTTLEROCKET_x86_64_NVIDIA AMI in the step above.)
tolerations to nvidia-device-plugin.yml to enable the DaemonSet to schedule Pods on the GPU nodes.Note: If you encounter permission issues with
kubectl, follow "Step 2: Configure your computer to communicate with your cluster" in the AWS documentation.
# Install the DaemonSet
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml
# Verify that your nodes have allocatable GPUs. If the GPU node fails to detect GPUs,
# please verify whether the DaemonSet schedules the Pod on the GPU node.
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
# Example output:
# NAME GPU
# ip-....us-west-2.compute.internal 4
# ip-....us-west-2.compute.internal <none>
Add a Kubernetes taint to prevent scheduling CPU Pods on this GPU node group. For KubeRay examples, add the following taint to the GPU nodes: Key: ray.io/node-type, Value: worker, Effect: NoSchedule, and include the corresponding tolerations for GPU Ray worker Pods.
Warning: GPU nodes are extremely expensive. Please remember to delete the cluster if you no longer need it.
Note: If you encounter permission issues with
eksctl, navigate to your AWS account's webpage and copy the credential environment variables, includingAWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, andAWS_SESSION_TOKEN, from the "Command line or programmatic access" page.
eksctl get nodegroup --cluster ${YOUR_EKS_NAME}
# CLUSTER NODEGROUP STATUS CREATED MIN SIZE MAX SIZE DESIRED CAPACITY INSTANCE TYPE IMAGE ID ASG NAME TYPE
# ${YOUR_EKS_NAME} cpu-node-group ACTIVE 2023-06-05T21:31:49Z 0 1 1 m5.xlarge AL2_x86_64 eks-cpu-node-group-... managed
# ${YOUR_EKS_NAME} gpu-node-group ACTIVE 2023-06-05T22:01:44Z 0 1 1 g5.12xlarge BOTTLEROCKET_x86_64_NVIDIA eks-gpu-node-group-... managed