doc/source/cluster/kubernetes/examples/argocd.md
(deploying-on-argocd-example)=
This guide provides a step-by-step approach for deploying Ray clusters on Kubernetes using ArgoCD. ArgoCD is a declarative GitOps tool that enables you to manage Ray cluster configurations in Git repositories with automated synchronization, version control, and rollback capabilities. This approach is particularly valuable when managing multiple Ray clusters across different environments, implementing audit trails and approval workflows, or maintaining infrastructure-as-code practices. For simpler use cases like single-cluster development or quick experimentation, direct kubectl or Helm deployments may be sufficient. You can read more about the benefits of ArgoCD here.
This example demonstrates how to deploy the KubeRay operator and a RayCluster with three different worker groups, leveraging ArgoCD's GitOps capabilities for automated cluster management.
Before proceeding with this guide, ensure you have the following:
kubectl configured to access your Kubernetes cluster.First, deploy the Custom Resource Definitions (CRDs) required by the KubeRay operator. Create a file named ray-operator-crds.yaml with the following content:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ray-operator-crds
namespace: argocd
spec:
project: default
destination:
server: https://kubernetes.default.svc
namespace: ray-cluster
source:
repoURL: https://github.com/ray-project/kuberay
targetRevision: v1.5.1 # update this as necessary
path: helm-chart/kuberay-operator/crds
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- Replace=true
Apply the ArgoCD Application:
kubectl apply -f ray-operator-crds.yaml
Wait for the CRDs Application to sync and become healthy. You can check the status using:
kubectl get application ray-operator-crds -n argocd
Which should eventually give something like:
NAME SYNC STATUS HEALTH STATUS
ray-operator-crds Synced Healthy
Alternatively, if you have the ArgoCD CLI installed, you can wait for the application:
argocd app wait ray-operator-crds
After the CRDs are installed, deploy the KubeRay operator itself. Create a file named ray-operator.yaml with the following content:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ray-operator
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/ray-project/kuberay
targetRevision: v1.5.1 # update this as necessary
path: helm-chart/kuberay-operator
helm:
skipCrds: true # CRDs are already installed in Step 1
destination:
server: https://kubernetes.default.svc
namespace: ray-cluster
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Note the skipCrds: true setting in the Helm configuration. This is required because the CRDs were installed separately in Step 1.
Apply the ArgoCD Application:
kubectl apply -f ray-operator.yaml
Wait for the operator Application to sync and become healthy. You can check the status using:
kubectl get application ray-operator -n argocd
Which should give the following output eventually:
NAME SYNC STATUS HEALTH STATUS
ray-operator Synced Healthy
Alternatively, if you have the ArgoCD CLI installed:
argocd app wait ray-operator
Verify that the KubeRay operator pod is running:
kubectl get pods -n ray-cluster -l app.kubernetes.io/name=kuberay-operator
Now deploy a RayCluster with autoscaling enabled and three different worker groups. Create a file named raycluster.yaml with the following content:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: raycluster
namespace: argocd
spec:
project: default
destination:
server: https://kubernetes.default.svc
namespace: ray-cluster
ignoreDifferences:
- group: ray.io
kind: RayCluster
name: raycluster-kuberay
namespace: ray-cluster
jqPathExpressions:
- .spec.workerGroupSpecs[].replicas
source:
repoURL: https://ray-project.github.io/kuberay-helm/
chart: ray-cluster
targetRevision: "1.5.1"
helm:
releaseName: raycluster
valuesObject:
image:
repository: docker.io/rayproject/ray
tag: latest
pullPolicy: IfNotPresent
head:
rayStartParams:
num-cpus: "0"
enableInTreeAutoscaling: true
autoscalerOptions:
version: v2
upscalingMode: Default
idleTimeoutSeconds: 600 # 10 minutes
env:
- name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES
value: "100"
worker:
groupName: standard-worker
replicas: 1
minReplicas: 1
maxReplicas: 200
rayStartParams:
resources: '"{\"standard-worker\": 1}"'
resources:
requests:
cpu: "1"
memory: "1G"
additionalWorkerGroups:
additional-worker-group1:
image:
repository: docker.io/rayproject/ray
tag: latest
pullPolicy: IfNotPresent
disabled: false
replicas: 1
minReplicas: 1
maxReplicas: 30
rayStartParams:
resources: '"{\"additional-worker-group1\": 1}"'
resources:
requests:
cpu: "1"
memory: "1G"
additional-worker-group2:
image:
repository: docker.io/rayproject/ray
tag: latest
pullPolicy: IfNotPresent
disabled: false
replicas: 1
minReplicas: 1
maxReplicas: 200
rayStartParams:
resources: '"{\"additional-worker-group2\": 1}"'
resources:
requests:
cpu: "1"
memory: "1G"
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Apply the ArgoCD Application:
kubectl apply -f raycluster.yaml
Wait for the RayCluster Application to sync and become healthy. You can check the status using:
kubectl get application raycluster -n argocd
Alternatively, if you have the ArgoCD CLI installed:
argocd app wait raycluster
Verify that the RayCluster is running:
kubectl get raycluster -n ray-cluster
Which will give something like:
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
raycluster-kuberay 3 3 ... ... ... ready ...
You should see the head pod and worker pods:
kubectl get pods -n ray-cluster
Gives something like:
NAME READY STATUS RESTARTS AGE
kuberay-operator-6c485bc876-28dnl 1/1 Running 0 11d
raycluster-kuberay-additional-worker-group1-n45rc 1/1 Running 0 5d21h
raycluster-kuberay-additional-worker-group2-b2455 1/1 Running 0 2d18h
raycluster-kuberay-head 2/2 Running 0 5d21h
raycluster-kuberay-standard-worker-worker-bs8t8 1/1 Running 0 5d21h
The ignoreDifferences section in the RayCluster Application configuration is critical for proper autoscaling. To determine which fields need to be ignored, you can inspect the RayCluster resource to identify fields that change dynamically during runtime.
First, describe the RayCluster resource to see its full specification:
kubectl describe raycluster raycluster-kuberay -n ray-cluster
Or, get the resource in YAML format to see the exact field paths:
kubectl get raycluster raycluster-kuberay -n ray-cluster -o yaml
Look for fields that are modified by controllers or autoscalers. In the case of Ray, the autoscaler modifies the replicas field under each worker group spec. You'll see output similar to:
spec:
workerGroupSpecs:
- replicas: 5 # This value changes dynamically
minReplicas: 1
maxReplicas: 200
groupName: standard-worker
# ...
When ArgoCD detects differences between the desired state (in Git) and the actual state (in the cluster), it will show these in the UI or via CLI:
argocd app diff raycluster
If you see repeated differences in fields that should be managed by controllers (like autoscalers), those are candidates for ignoreDifferences.
The ignoreDifferences section in the RayCluster Application configuration tells ArgoCD which fields to ignore. Without this setting, ArgoCD and the Ray Autoscaler may conflict, resulting in unexpected behavior when requesting workers dynamically (for example, using ray.autoscaler.sdk.request_resources). Specifically, when requesting N workers, the Autoscaler might not spin up the expected number of workers because ArgoCD could revert the replica count back to the original value defined in the Application manifest.
The recommended approach is to use jqPathExpressions, which automatically handles any number of worker groups:
ignoreDifferences:
- group: ray.io
kind: RayCluster
name: raycluster-kuberay
namespace: ray-cluster
jqPathExpressions:
- .spec.workerGroupSpecs[].replicas
This configuration tells ArgoCD to ignore differences in the replicas field for all worker groups. The jqPathExpressions field uses JQ syntax with array wildcards ([]), which means you don't need to update the configuration when adding or removing worker groups.
Note: The name and namespace must match your RayCluster resource name and namespace. Verify these values separately if you've customized them.
Alternative: Using jsonPointers
If you prefer explicit configuration, you can use jsonPointers instead:
ignoreDifferences:
- group: ray.io
kind: RayCluster
name: raycluster-kuberay
namespace: ray-cluster
jsonPointers:
- /spec/workerGroupSpecs/0/replicas
- /spec/workerGroupSpecs/1/replicas
- /spec/workerGroupSpecs/2/replicas
With jsonPointers, you must explicitly list each worker group by index:
/spec/workerGroupSpecs/0/replicas - First worker group (the default worker group)/spec/workerGroupSpecs/1/replicas - Second worker group (additional-worker-group1)/spec/workerGroupSpecs/2/replicas - Third worker group (additional-worker-group2)If you add or remove worker groups, you must update this list accordingly. The index corresponds to the order of worker groups as they appear in the RayCluster spec, with the default worker group at index 0 and additionalWorkerGroups following in the order they are defined.
See the ArgoCD diff customization documentation for more details on both approaches.
By ignoring these differences, ArgoCD allows the Ray Autoscaler to dynamically manage worker replicas without interference.
To access the Ray Dashboard, port-forward the head service:
kubectl port-forward -n ray-cluster svc/raycluster-kuberay-head-svc 8265:8265
Navigate to http://localhost:8265 in your browser to view the Ray Dashboard.
You can customize the RayCluster configuration by modifying the valuesObject section in the raycluster.yaml file:
repository and tag to use different Ray versions.additionalWorkerGroups section.minReplicas, maxReplicas, and idleTimeoutSeconds to control autoscaling behavior.rayStartParams to allocate custom resources to worker groups.After making changes, commit them to your Git repository. ArgoCD will automatically sync the changes to your cluster if automated sync is enabled.
If you prefer to deploy all components at once, you can combine all three ArgoCD Applications into a single file. Create a file named ray-argocd-all.yaml with the following content:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ray-operator-crds
namespace: argocd
spec:
project: default
destination:
server: https://kubernetes.default.svc
namespace: ray-cluster
source:
repoURL: https://github.com/ray-project/kuberay
targetRevision: v1.5.1 # update this as necessary
path: helm-chart/kuberay-operator/crds
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- Replace=true
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ray-operator
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/ray-project/kuberay
targetRevision: v1.5.1 # update this as necessary
path: helm-chart/kuberay-operator
helm:
skipCrds: true # CRDs are installed in the first Application
destination:
server: https://kubernetes.default.svc
namespace: ray-cluster
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: raycluster
namespace: argocd
spec:
project: default
destination:
server: https://kubernetes.default.svc
namespace: ray-cluster
ignoreDifferences:
- group: ray.io
kind: RayCluster
name: raycluster-kuberay # ensure this is aligned with the release name
namespace: ray-cluster # ensure this is aligned with the namespace
jqPathExpressions:
- .spec.workerGroupSpecs[].replicas
source:
repoURL: https://ray-project.github.io/kuberay-helm/
chart: ray-cluster
targetRevision: "1.4.1"
helm:
releaseName: raycluster # this affects the ignoreDifferences field
valuesObject:
image:
repository: docker.io/rayproject/ray
tag: latest
pullPolicy: IfNotPresent
head:
rayStartParams:
num-cpus: "0"
enableInTreeAutoscaling: true
autoscalerOptions:
version: v2
upscalingMode: Default
idleTimeoutSeconds: 600 # 10 minutes
env:
- name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES
value: "100"
worker:
groupName: standard-worker
replicas: 1
minReplicas: 1
maxReplicas: 200
rayStartParams:
resources: '"{\"standard-worker\": 1}"'
resources:
requests:
cpu: "1"
memory: "1G"
additionalWorkerGroups:
additional-worker-group1:
image:
repository: docker.io/rayproject/ray
tag: latest
pullPolicy: IfNotPresent
disabled: false
replicas: 1
minReplicas: 1
maxReplicas: 30
rayStartParams:
resources: '"{\"additional-worker-group1\": 1}"'
resources:
requests:
cpu: "1"
memory: "1G"
additional-worker-group2:
image:
repository: docker.io/rayproject/ray
tag: latest
pullPolicy: IfNotPresent
disabled: false
replicas: 1
minReplicas: 1
maxReplicas: 200
rayStartParams:
resources: '"{\"additional-worker-group2\": 1}"'
resources:
requests:
cpu: "1"
memory: "1G"
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Note in this example, the jqPathExpressions approach is used.
Apply all three Applications at once:
kubectl apply -f ray-argocd-all.yaml
This single-file approach is convenient for quick deployments, but the step-by-step approach in the earlier sections provides better visibility into the deployment process and makes it easier to troubleshoot issues.