Back to Ray

KubeRay metrics references

doc/source/cluster/kubernetes/k8s-ecosystem/metrics-references.md

1.13.16.2 KB
Original Source

(kuberay-metrics-references)=

KubeRay metrics references

controller-runtime metrics

KubeRay exposes metrics provided by kubernetes-sigs/controller-runtime, including information about reconciliation, work queues, and more, to help users operate the KubeRay operator in production environments.

For more details about the default metrics provided by kubernetes-sigs/controller-runtime, see Default Exported Metrics References.

KubeRay custom metrics

Starting with KubeRay 1.4.0, KubeRay provides metrics for its custom resources to help users better understand Ray clusters and Ray applications.

You can view these metrics by following the instructions below:

sh
# Forward a local port to the KubeRay operator service.
kubectl port-forward service/kuberay-operator 8080

# View the metrics.
curl localhost:8080/metrics

# You should see metrics like the following if a RayCluster already exists:  
# kuberay_cluster_info{name="raycluster-kuberay",namespace="default",owner_kind="None"} 1

RayCluster metrics

Metric nameTypeDescriptionLabels
kuberay_cluster_infoGaugeMetadata information about RayCluster custom resources.namespace: <RayCluster-namespace>
name: <RayCluster-name>
owner_kind: <RayJob|RayService|None>
uid: <RayCluster-uid>
kuberay_cluster_condition_provisionedGaugeIndicates whether the RayCluster is provisioned. See RayClusterProvisioned for more information.namespace: <RayCluster-namespace>
name: <RayCluster-name>
condition: <true|false>
uid: <RayCluster-uid>
kuberay_cluster_provisioned_duration_secondsGaugeThe time, in seconds, when a RayCluster's RayClusterProvisioned status transitions from false (or unset) to true.namespace: <RayCluster-namespace>
name: <RayCluster-name>
uid: <RayCluster-uid>

RayService metrics

Metric nameTypeDescriptionLabels
kuberay_service_infoGaugeMetadata information about RayService custom resources.namespace: <RayService-namespace>
name: <RayService-name>
uid: <RayService-uid>
kuberay_service_condition_readyGaugeDescribes whether the RayService is ready. Ready means users can send requests to the underlying cluster and the number of serve endpoints is greater than 0. See RayServiceReady for more information.namespace: <RayService-namespace>
name: <RayService-name>
uid: <RayService-uid>
kuberay_service_condition_upgrade_in_progressGaugeDescribes whether the RayService is performing a zero-downtime upgrade. See UpgradeInProgress for more information.namespace: <RayService-namespace>
name: <RayService-name>
uid: <RayService-uid>

RayJob metrics

Metric nameTypeDescriptionLabels
kuberay_job_infoGaugeMetadata information about RayJob custom resources.namespace: <RayJob-namespace>
name: <RayJob-name>
uid: <RayJob-uid>
kuberay_job_deployment_statusGaugeThe RayJob's current deployment status.namespace: <RayJob-namespace>
name: <RayJob-name>
deployment_status: <New|Initializing|Running|Complete|Failed|Suspending|Suspended|Retrying|Waiting>
uid: <RayJob-uid>
kuberay_job_execution_duration_secondsGaugeDuration of the RayJob CR’s JobDeploymentStatus transition from Initializing to either the Retrying state or a terminal state, such as Complete or Failed. The Retrying state indicates that the CR previously failed and that spec.backoffLimit is enabled.namespace: <RayJob-namespace>
name: <RayJob-name>
job_deployment_status: <Complete|Failed>
retry_count: <count>
uid: <RayJob-uid>