doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md
(kuberay-kai-scheduler)=
This guide demonstrates how to use KAI Scheduler for setting up hierarchical queues with quotas, gang scheduling, and GPU sharing using RayClusters.
KAI Scheduler is a high-performance, scalable Kubernetes scheduler built for AI/ML workloads. Designed to orchestrate GPU clusters at massive scale, KAI optimizes GPU allocation and supports the full AI lifecycle - from interactive development to large distributed training and inference. Some of the key features are:
How PodGrouper works:
RayCluster "distributed-training":
├── Head Pod: 1 GPU
└── Worker Group: 4 × 0.5 GPU = 2 GPUs
Total Group Requirement: 3 GPUs
PodGrouper schedules all 5 pods (1 head + 4 workers) together or none at all.
You can arrange queues hierarchically for organizations with multiple teams, for example, departments with multiple teams.
<KAI_SCHEDULER_VERSION> in the following command. It's recommended to choose v0.10.0 or higher version.# Install KAI Scheduler
helm upgrade -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler -n kai-scheduler --create-namespace --version <KAI_SCHEDULER_VERSION> --set "global.gpuSharing=true"
Follow the official KubeRay operator installation documentation and add the following configuration to enable KAI Scheduler integration:
--set batchScheduler.name=kai-scheduler
Create a basic queue structure for department-1 and its child team-a. For demo reasons, this example doesn't enforce any quota, overQuotaWeight, or limit. You can configure these parameters depending on your needs:
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: department-1
spec:
#priority: 100 (optional)
resources:
cpu:
quota: -1
limit: -1
overQuotaWeight: 1
gpu:
quota: -1
limit: -1
overQuotaWeight: 1
memory:
quota: -1
limit: -1
overQuotaWeight: 1
---
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: team-a
spec:
#priority: 200 (optional)
parentQueue: department-1
resources:
cpu:
quota: -1
limit: -1
overQuotaWeight: 1
gpu:
quota: -1
limit: -1
overQuotaWeight: 1
memory:
quota: -1
limit: -1
overQuotaWeight: 1
Note: To make this demo easier to follow, it combined these queue definitions with the RayCluster example in the next step. You can use the single combined YAML file and apply both queues and workloads at once.
The key pattern is to add the queue label to your RayCluster. Here's a basic example from the KubeRay repository:
metadata:
name: raycluster-sample
labels:
kai.scheduler/queue: team-a # This is the essential configuration.
Apply this RayCluster with queues:
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.kai-scheduler.yaml
kubectl apply -f ray-cluster.kai-scheduler.yaml
#Verify queues are created
kubectl get queues
# NAME PRIORITY PARENT CHILDREN DISPLAYNAME
# department-1 ["team-a"]
# team-a department-1
# Watch the pods get scheduled
kubectl get pods -w
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-7d86f4f46b-dq22x 1/1 Running 0 50s
# raycluster-sample-head-rvrkz 0/1 ContainerCreating 0 13s
# raycluster-sample-worker-worker-mlvtz 0/1 Init:0/1 0 13s
# raycluster-sample-worker-worker-rcb54 0/1 Init:0/1 0 13s
# raycluster-sample-worker-worker-mlvtz 0/1 Init:0/1 0 40s
# raycluster-sample-worker-worker-rcb54 0/1 Init:0/1 0 41s
# raycluster-sample-head-rvrkz 0/1 Running 0 42s
# raycluster-sample-head-rvrkz 1/1 Running 0 54s
# raycluster-sample-worker-worker-rcb54 0/1 PodInitializing 0 59s
# raycluster-sample-worker-worker-mlvtz 0/1 PodInitializing 0 59s
# raycluster-sample-worker-worker-rcb54 0/1 Running 0 60s
# raycluster-sample-worker-worker-mlvtz 0/1 Running 0 60s
# raycluster-sample-worker-worker-rcb54 1/1 Running 0 71s
# raycluster-sample-worker-worker-mlvtz 1/1 Running 0 71s
In Kubernetes, assigning different priorities to workloads ensures efficient resource management, minimizes service disruption, and supports better scaling. By prioritizing workloads, KAI Scheduler schedules jobs according to their assigned priority. When sufficient resources aren't available for a workload, the scheduler can preempt lower-priority workloads to free up resources for higher-priority ones. This approach ensures the scheduler always prioritizes that mission-critical services in resource allocation.
KAI scheduler deployment comes with several predefined priority classes:
You can submit the same workload preceding with a specific priority. Modify the preceding example into a build class workload:
labels:
kai.scheduler/queue: team-a # This is the essential configuration.
priorityClassName: build # Here you can specify the priority class in metadata.labels (optional)
See the documentation for more information.
This example creates two workers that share a single GPU, 0.5 each with time-slicing, within a RayCluster. See the YAML file):
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml
kubectl apply -f ray-cluster.kai-gpu-sharing.yaml
# Watch the pods get scheduled
kubectl get pods -w
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-7d86f4f46b-dq22x 1/1 Running 0 4m9s
# raycluster-half-gpu-head-9rtxf 0/1 Running 0 4s
# raycluster-half-gpu-shared-gpu-worker-5l7cn 0/1 Pending 0 4s
# raycluster-half-gpu-shared-gpu-worker-98tzh 0/1 Pending 0 4s
# ... (skip for brevity)
# raycluster-half-gpu-shared-gpu-worker-5l7cn 0/1 Init:0/1 0 6s
# raycluster-half-gpu-shared-gpu-worker-5l7cn 0/1 Init:0/1 0 7s
# raycluster-half-gpu-shared-gpu-worker-98tzh 0/1 Init:0/1 0 8s
# raycluster-half-gpu-head-9rtxf 1/1 Running 0 19s
# raycluster-half-gpu-shared-gpu-worker-5l7cn 0/1 PodInitializing 0 19s
# raycluster-half-gpu-shared-gpu-worker-98tzh 0/1 PodInitializing 0 19s
# raycluster-half-gpu-shared-gpu-worker-5l7cn 0/1 Running 0 20s
# raycluster-half-gpu-shared-gpu-worker-98tzh 0/1 Running 0 20s
# raycluster-half-gpu-shared-gpu-worker-5l7cn 1/1 Running 0 31s
# raycluster-half-gpu-shared-gpu-worker-98tzh 1/1 Running 0 31s
Note: GPU sharing with time slicing in this example occurs only at the Kubernetes layer, allowing multiple pods to share a single GPU device. The scheduler doesn't enforce memory isolation, so applications must manage their own usage to prevent interference. For other GPU sharing approaches, for example, MPS, see the KAI documentation.
To confirm that GPU sharing is working correctly, use these commands:
# 1. Check GPU fraction annotations and shared GPU groups
kubectl get pods -l ray.io/cluster=raycluster-half-gpu -o custom-columns="NAME:.metadata.name,NODE:.spec.nodeName,GPU-FRACTION:.metadata.annotations.gpu-fraction,GPU-GROUP:.metadata.labels.runai-gpu-group"
You should see both worker pods on the same node with GPU-FRACTION: 0.5 and the same GPU-GROUP ID:
NAME NODE GPU-FRACTION GPU-GROUP
raycluster-half-gpu-head ip-xxx-xx-xx-xxx <none> <none>
raycluster-half-gpu-shared-gpu-worker-67tvw ip-xxx-xx-xx-xxx 0.5 3e456911-a6ea-4b1a-8f55-e90fba89ad76
raycluster-half-gpu-shared-gpu-worker-v5tpp ip-xxx-xx-xx-xxx 0.5 3e456911-a6ea-4b1a-8f55-e90fba89ad76
This shows that both workers have the same NVIDIA_VISIBLE_DEVICES (same physical GPU) and GPU-FRACTION: 0.50.
If pods remain in Pending state, the most common issue is missing queue labels.
Check operator logs for KAI Scheduler errors and look for error messages like:
"Queue label missing from RayCluster; pods will remain pending"
Solution: Ensure your RayCluster has the queue label that exists in the cluster:
metadata:
labels:
kai.scheduler/queue: default # Add this label