doc/source/ray-observability/reference/system-metrics.rst
.. _system-metrics:
Ray exports a number of system metrics, which provide introspection into the state of Ray workloads, as well as hardware utilization statistics. The following table describes the officially supported metrics:
.. note::
Certain labels are common across all metrics, such as SessionName (uniquely identifies a Ray cluster instance), instance (per-node label applied by Prometheus), and JobId (Ray job ID, as applicable).
Starting with Ray 2.53+, the WorkerId label is no longer exported by default due to its high cardinality.
The Ray team doesn't expect this to be a breaking change, as none of Ray’s built-in components rely on this label.
However, if you have custom tooling that depends on WorkerId label, take note of this change.
You can restore or adjust label behavior using the environment variable RAY_metric_cardinality_level:
legacy: Preserve all labels. (This was the default behavior before Ray 2.53.)recommended: Drop high-cardinality labels. Ray internally determines specific labels; currently this includes only WorkerId. (This is the default behavior since Ray 2.53.)low: Same as recommended, but also drops the Name label for tasks and actors... list-table:: Ray System Metrics :header-rows: 1
ray_tasksName, State, IsRetryrpc::TaskState <https://github.com/ray-project/ray/blob/e85355b9b593742b4f5cb72cab92051980fa73d3/src/ray/protobuf/common.proto#L583>_ for more information. The function/method name is available as the Name label. If the task was retried due to failure or reconstruction, the IsRetry label will be set to "1", otherwise "0".ray_actorsName, Staterpc::ActorTableData::ActorState <https://github.com/ray-project/ray/blob/b3799a53dcabd8d1a4d20f22faa98e781b0059c7/src/ray/protobuf/gcs.proto#L79>. ALIVE has two sub-states: ALIVE_IDLE, and ALIVE_RUNNING_TASKS. An actor is considered ALIVE_IDLE if it is not running any tasks.ray_resourcesName, State, instanceray_object_store_memoryLocation, ObjectState, instanceray_placement_groupsStaterpc::PlacementGroupTable <https://github.com/ray-project/ray/blob/e85355b9b593742b4f5cb72cab92051980fa73d3/src/ray/protobuf/gcs.proto#L517>_ for more information.ray_memory_manager_worker_eviction_totalType, Nameray_node_cpu_utilizationinstanceray_node_cpu_countinstanceray_node_gpus_utilizationinstance, GpuDeviceName, GpuIndexGpuDeviceName is a name of a GPU device (e.g., NVIDIA A10G) and GpuIndex is the index of the GPU.ray_node_disk_usageinstanceray_node_disk_freeinstanceray_node_disk_write_iopsinstance, node_typeray_node_disk_io_write_speedinstanceray_node_disk_read_iopsinstance, node_typeray_node_disk_io_read_speedinstanceray_node_mem_availableinstance, node_typeray_node_mem_shared_bytesinstance, node_typeray_node_mem_usedinstanceray_node_mem_totalinstanceray_component_uss_mbComponent, instanceray_component_cpu_percentageComponent, instanceray_node_gram_availableinstance, node_type, GpuIndex, GpuDeviceNameray_node_gram_usedinstance, GpuDeviceName, GpuIndexray_node_network_receivedinstance, node_typeray_node_network_sentinstance, node_typeray_node_network_receive_speedinstanceray_node_network_send_speedinstanceray_cluster_active_nodesnode_typeray_cluster_failed_nodesnode_typeray_cluster_pending_nodesnode_typeMetrics Semantics and Consistency
Ray guarantees all its internal state metrics are *eventually* consistent even in the presence of failures--- should any worker fail, eventually the right state will be reflected in the Prometheus time-series output. However, any particular metrics query is not guaranteed to reflect an exact snapshot of the cluster state.
For the `ray_tasks` and `ray_actors` metrics, you should use sum queries to plot their outputs (e.g., ``sum(ray_tasks) by (Name, State)``). The reason for this is that Ray's task metrics are emitted from multiple distributed components. Hence, there are multiple metric points, including negative metric points, emitted from different processes that must be summed to produce the correct logical view of the distributed system. For example, for a single task submitted and executed, Ray may emit ``(submitter) SUBMITTED_TO_WORKER: 1, (executor) SUBMITTED_TO_WORKER: -1, (executor) RUNNING: 1``, which reduces to ``SUBMITTED_TO_WORKER: 0, RUNNING: 1`` after summation.