doc/source/ray-core/internals/metric-exporter.rst
.. _metric-exporter:
This document is based on upstream/master at commit 05e7efd5 (2025-12-17).
Ray's metric exporting infrastructure collects metrics from C++ components (raylet, GCS, workers) and Python components, aggregates them, and exports them to Prometheus. This document explains how metrics flow through the system from registration to final export.
Ray's metric system uses a multi-stage pipeline:
The following diagram shows the high-level flow:
.. code-block:: text
C++ Components (raylet, GCS, workers) ↓ (Record metrics via Metric::Record) OpenTelemetryMetricRecorder (C++) ↓ (OTLP gRPC export) Metrics Agent (Python - ReporterAgent) ↓ (Aggregate & process) OpenTelemetryMetricRecorder (Python) ↓ (Prometheus format) Prometheus Server
Ray's C++ components register and record metrics through the OpenTelemetryMetricRecorder <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.h>__ singleton. The recorder supports four metric types: Gauge, Counter, Sum, and Histogram.
Metric Types
- **Gauge**: Represents a current value that can go up or down (e.g., number of running tasks)
- **Counter**: A cumulative metric that only increases (e.g., total tasks submitted)
- **Sum (UpDownCounter)**: A cumulative metric that can increase or decrease (e.g., number of objects in object store)
- **Histogram**: Tracks the distribution of values over time (e.g., task execution time)
Registration Process
Metrics are registered lazily on first use. The OpenTelemetryMetricRecorder uses a singleton pattern accessible via GetInstance() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L78>__. When a metric is first recorded, it's automatically registered if it hasn't been registered already.
Registration methods (defined in open_telemetry_metric_recorder.cc <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc>__):
RegisterGaugeMetric() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L164>__: Registers an observable gauge with a callbackRegisterCounterMetric() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L203>__: Registers a synchronous counterRegisterSumMetric() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L216>__: Registers a synchronous up-down counterRegisterHistogramMetric() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L229>__: Registers a histogram with explicit bucket boundariesRecording Mechanisms
Ray uses two different recording mechanisms depending on the metric type:
**Observable Metrics (Gauges)**
Observable gauges store values in an intermediate map (`observations_by_name_`) until collection time. When you call `SetMetricValue() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L269>`__ for a gauge, the value is stored with its tags. During export, a callback function (`DoubleGaugeCallback <https://github.com/ray-project/ray/blob/52ed7e3/src/ray/observability/open_telemetry_metric_recorder.cc#L42>`__) is invoked by the OpenTelemetry SDK, which collects all stored values and clears the map to prevent stale data. The callback implementation is in `CollectGaugeMetricValues() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L150>`__.
**Synchronous Metrics (Counters, Sums, Histograms)**
Synchronous metrics record values directly to their instruments without intermediate storage. When you call `SetMetricValue() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L269>`__ for these types, the value is immediately added to the counter or recorded in the histogram via `SetSynchronousMetricValue() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L292>`__.
Key Implementation Details
mutex_) to protect the observations map and registered instrumentsRegisterGaugeMetric() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L183-L195>__ for details)C++ components record metrics through the Metric::Record() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/stats/metric.cc#L111>__ method, which forwards to OpenTelemetryMetricRecorder::SetMetricValue() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/stats/metric.cc#L135>__.
C++ components export metrics to the metrics agent using the OpenTelemetry Protocol (OTLP) over gRPC. The export process is configured when the recorder is started.
OpenTelemetry SDK Integration
The `OpenTelemetryMetricRecorder` initializes the OpenTelemetry SDK in its `constructor <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L129>`__ and `Start() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L87>`__ method with:
- **MeterProvider**: Manages meter instances and metric readers
- **PeriodicExportingMetricReader**: Collects metrics at regular intervals and exports them
- **OTLP gRPC Exporter**: Sends metrics to the metrics agent endpoint
Export Configuration
~~~~~~~~~~~~~~~~~~~~~
When `Start() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L87>`__ is called, the recorder configures:
- **Endpoint**: The metrics agent's gRPC address (typically `127.0.0.1:port`)
- **Export Interval**: How often metrics are collected and exported (configurable)
- **Export Timeout**: Maximum time to wait for export completion
- **Aggregation Temporality**: Set to delta mode to prevent double-counting (see `exporter_options.aggregation_temporality <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/src/ray/observability/open_telemetry_metric_recorder.cc#L97>`__)
Delta Aggregation Temporality
Ray uses delta aggregation temporality, which means only the changes since the last export are sent. This is important because the metrics agent accumulates metrics, and re-accumulating them during export would lead to double-counting.
Export Process
During each export interval:
1. **Observable Gauges**: The OpenTelemetry SDK invokes registered callbacks, which collect values from `observations_by_name_` and clear the map
2. **Synchronous Metrics**: Values are read directly from the instruments
3. **OTLP Format**: Metrics are converted to OTLP format
4. **gRPC Export**: Metrics are sent to the metrics agent via gRPC
Metric Reception and Processing (Python Side)
----------------------------------------------
The metrics agent (ReporterAgent) receives metrics from C++ components via a gRPC service that implements the OpenTelemetry Metrics Service interface.
gRPC Service Implementation
The ReporterAgent <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/python/ray/dashboard/modules/reporter/reporter_agent.py>__ class implements MetricsServiceServicer, which provides the Export() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/python/ray/dashboard/modules/reporter/reporter_agent.py#L662>__ method. This method receives ExportMetricsServiceRequest messages containing OTLP-formatted metrics from C++ components.
Metric Processing
When metrics are received, the `Export()` method processes them in the following structure:
- **Resource Metrics**: Top-level container for metrics from a specific resource (e.g., a raylet process)
- **Scope Metrics**: Groups metrics by instrumentation scope
- **Metrics**: Individual metric data points
The method routes metrics to appropriate handlers based on their type:
- **Histogram Metrics**: Processed by `_export_histogram_data() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/python/ray/dashboard/modules/reporter/reporter_agent.py#L577>`__
- **Number Metrics** (Gauge, Counter, Sum): Processed by `_export_number_data() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/python/ray/dashboard/modules/reporter/reporter_agent.py#L628>`__
For histogram metrics, the metrics agent receives pre-aggregated OTLP bucket counts. The system reconstructs observations from bucket midpoints and records them with a single batch call to reduce lock contention.
Conversion to internal format
The metrics agent converts OTLP format to Ray's internal metric representation and forwards them to the Python OpenTelemetryMetricRecorder <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/python/ray/_private/telemetry/open_telemetry_metric_recorder.py>__ for further processing and aggregation.
The Python OpenTelemetryMetricRecorder <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/python/ray/_private/telemetry/open_telemetry_metric_recorder.py>__ handles final aggregation and cardinality reduction before exporting to Prometheus. This step is crucial for managing metric cardinality and preventing metric explosion.
OpenTelemetryMetricRecorder (Python)
The Python recorder (defined in `open_telemetry_metric_recorder.py <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/python/ray/_private/telemetry/open_telemetry_metric_recorder.py>`__) has a similar structure to the C++ version but uses the Prometheus exporter instead of OTLP. It maintains:
- **Registered Instruments**: Maps metric names to OpenTelemetry instruments
- **Observations Maps**: Stores gauge, counter, and sum observations (with their tag sets) until collection.
- **Histogram Bucket Midpoints**: Pre-calculated midpoints for histogram bucket conversion when reconstructing observations from OTLP bucket counts.
For gauges, counters, and sums, the recorder uses observable (asynchronous) instruments. Calls to `set_metric_value()` store values internally, and OpenTelemetry invokes callbacks at collection time to export aggregated observations. For histograms, OpenTelemetry doesn't support an observable histogram, so the recorder calls `record()` synchronously.
High-cardinality labels can cause metric explosion, making metrics systems unusable. Ray implements cardinality reduction through label filtering and value aggregation.
**Label Filtering**
The system identifies high-cardinality labels based on the `RAY_metric_cardinality_level` environment variable. The logic is implemented in `MetricCardinality.get_high_cardinality_labels_to_drop() <https://github.com/ray-project/ray/blob/05e7efd5ef71dca7a396e6b5f15c8ff16960c5db/python/ray/_private/telemetry/metric_cardinality.py#L80>`__:
- **`legacy`**: All labels are preserved (default behavior before Ray 2.53)
- **`recommended`**: The `WorkerId` label is dropped (default since Ray 2.53)
- **`low`**: Both `WorkerId` and `Name` labels are dropped for tasks and actors
**Aggregation process**
For observable gauges, counters, and sums, aggregation happens in the callback registered with the OpenTelemetry SDK in the Python recorder:
- **Collection**: The callback collects all observations for a metric from an internal observations map.
- **Label Filtering**: The callback drops high-cardinality labels from tag sets based on `MetricCardinality.get_high_cardinality_labels_to_drop()`.
- **Grouping**: The callback groups observations that share the same filtered tag set.
- **Aggregation**: The callback aggregates each group with `MetricCardinality.get_aggregation_function()`:
- For counters and sums, the system always aggregates by summing values.
- For gauges, the system aggregates task and actor metrics by summing values, and uses the first value for other metrics.
- **Export**: The callback returns aggregated observations to OpenTelemetry for Prometheus export.
This process ensures that metrics remain manageable even when there are thousands of workers or unique task names.