doc/source/cluster/kubernetes/k8s-ecosystem/metrics-references.md
(kuberay-metrics-references)=
controller-runtime metricsKubeRay exposes metrics provided by kubernetes-sigs/controller-runtime, including information about reconciliation, work queues, and more, to help users operate the KubeRay operator in production environments.
For more details about the default metrics provided by kubernetes-sigs/controller-runtime, see Default Exported Metrics References.
Starting with KubeRay 1.4.0, KubeRay provides metrics for its custom resources to help users better understand Ray clusters and Ray applications.
You can view these metrics by following the instructions below:
# Forward a local port to the KubeRay operator service.
kubectl port-forward service/kuberay-operator 8080
# View the metrics.
curl localhost:8080/metrics
# You should see metrics like the following if a RayCluster already exists:
# kuberay_cluster_info{name="raycluster-kuberay",namespace="default",owner_kind="None"} 1
| Metric name | Type | Description | Labels |
|---|---|---|---|
kuberay_cluster_info | Gauge | Metadata information about RayCluster custom resources. | namespace: <RayCluster-namespace> |
name: <RayCluster-name> | |||
owner_kind: <RayJob|RayService|None> | |||
uid: <RayCluster-uid> | |||
kuberay_cluster_condition_provisioned | Gauge | Indicates whether the RayCluster is provisioned. See RayClusterProvisioned for more information. | namespace: <RayCluster-namespace> |
name: <RayCluster-name> | |||
condition: <true|false> | |||
uid: <RayCluster-uid> | |||
kuberay_cluster_provisioned_duration_seconds | Gauge | The time, in seconds, when a RayCluster's RayClusterProvisioned status transitions from false (or unset) to true. | namespace: <RayCluster-namespace> |
name: <RayCluster-name> | |||
uid: <RayCluster-uid> |
| Metric name | Type | Description | Labels |
|---|---|---|---|
kuberay_service_info | Gauge | Metadata information about RayService custom resources. | namespace: <RayService-namespace> |
name: <RayService-name> | |||
uid: <RayService-uid> | |||
kuberay_service_condition_ready | Gauge | Describes whether the RayService is ready. Ready means users can send requests to the underlying cluster and the number of serve endpoints is greater than 0. See RayServiceReady for more information. | namespace: <RayService-namespace> |
name: <RayService-name> | |||
uid: <RayService-uid> | |||
kuberay_service_condition_upgrade_in_progress | Gauge | Describes whether the RayService is performing a zero-downtime upgrade. See UpgradeInProgress for more information. | namespace: <RayService-namespace> |
name: <RayService-name> | |||
uid: <RayService-uid> |
| Metric name | Type | Description | Labels |
|---|---|---|---|
kuberay_job_info | Gauge | Metadata information about RayJob custom resources. | namespace: <RayJob-namespace> |
name: <RayJob-name> | |||
uid: <RayJob-uid> | |||
kuberay_job_deployment_status | Gauge | The RayJob's current deployment status. | namespace: <RayJob-namespace> |
name: <RayJob-name> | |||
deployment_status: <New|Initializing|Running|Complete|Failed|Suspending|Suspended|Retrying|Waiting> | |||
uid: <RayJob-uid> | |||
kuberay_job_execution_duration_seconds | Gauge | Duration of the RayJob CR’s JobDeploymentStatus transition from Initializing to either the Retrying state or a terminal state, such as Complete or Failed. The Retrying state indicates that the CR previously failed and that spec.backoffLimit is enabled. | namespace: <RayJob-namespace> |
name: <RayJob-name> | |||
job_deployment_status: <Complete|Failed> | |||
retry_count: <count> | |||
uid: <RayJob-uid> |