content/integrate/redis-data-integration/observability.md
RDI reports metrics about its operation using Prometheus exporter endpoints. You can connect to the endpoints with Prometheus to query the metrics and plot simple graphs or with Grafana to produce more complex visualizations and dashboards.
RDI exposes three endpoints:
The sections below explain these sets of metrics in more detail. See the [architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}}) for an introduction to these concepts.
{{< note >}}If you don't use Prometheus or Grafana, you can still see
RDI metrics with the RDI monitoring screen in Redis Insight or with the
[redis-di status]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-status" >}})
command from the CLI.{{< /note >}}
The way you access the metrics endpoints depends on whether you are using a VM installation or a Helm installation for RDI. The sections below describe the correct approach for each installation type.
For VM installations, the metrics are available by default on the following endpoints:
https://<RDI_HOST>/collector-source/metricshttps://<RDI_HOST>/metricshttps://<RDI_HOST>/operator/metricsPlease note that for RDI versions prior to 1.16.0 the collector metrics are not accessible.
For Helm installations, the metrics are available via autodiscovery in the K8s cluster. Follow the steps below to use them:
Make sure you have the Prometheus Operator installed in your K8s cluster (see the Prometheus Operator installation guide for more information about this).
Update your values.yaml file to enable metrics for the operator, collector and stream processor components.
For the collector, update the collector section, under the dataPlane section:
dataPlane:
collector:
# Enable service monitor
serviceMonitor:
enabled: true
# Make sure to label the ServiceMonitor so that Prometheus can discover it
labels:
release: prometheus
For the stream processor, update the rdiMetricsExporter section:
rdiMetricsExporter:
# Enable service monitor
serviceMonitor:
enabled: true
# Make sure to label the ServiceMonitor so that Prometheus can discover it
labels:
release: prometheus
For the operator, update the operator section:
operator:
prometheus:
enabled: true
labels:
release: prometheus
metrics:
enabled: true
{{< note >}}The Prometheus service discovery loop runs at regular intervals. This means that after deploying or updating RDI with the above configuration, it may take a few minutes for Prometheus to discover the new ServiceMonitors and start scraping metrics from the RDI components. {{< /note >}}
These metrics are divided into three groups:
The following table lists all collector metrics and their descriptions:
| Metric | Type | Description | Alerting Recommendations |
|---|---|---|---|
| Schema History Metrics | |||
ChangesApplied | Counter | Total number of schema changes applied during recovery and runtime | Informational - monitor for trends |
ChangesRecovered | Counter | Number of changes that were read during the recovery phase | Informational - monitor for trends |
MilliSecondsSinceLastAppliedChange | Gauge | Number of milliseconds since the last change was applied | Informational - monitor for trends |
MilliSecondsSinceLastRecoveredChange | Gauge | Number of milliseconds since the last change was recovered from the history store | Informational - monitor for trends |
RecoveryStartTime | Gauge | Time in epoch milliseconds when recovery started (-1 if not applicable) | Informational - monitor for trends |
| Connection and State Metrics | |||
Connected | Gauge | Whether the collector is currently connected to the database (1=connected, 0=disconnected) | Critical Alert: Alert if value = 0 (disconnected) |
| Queue Metrics | |||
CurrentQueueSizeInBytes | Gauge | Current size of the collector's internal queue in bytes | Informational - monitor for trends |
MaxQueueSizeInBytes | Gauge | Maximum configured size of the collector's internal queue in bytes | Informational - use for capacity planning |
QueueRemainingCapacity | Gauge | Remaining capacity of the collector's internal queue | Informational - monitor for trends |
QueueTotalCapacity | Gauge | Total capacity of the collector's internal queue | Informational - use for capacity planning |
| Streaming Performance Metrics | |||
MilliSecondsBehindSource | Gauge | Number of milliseconds the collector is behind the source database (-1 if not applicable) | Informational - monitor for trends and business SLA requirements |
MilliSecondsSinceLastEvent | Gauge | Number of milliseconds since the collector processed the most recent event (-1 if not applicable) | Informational - monitor for trends in active systems |
NumberOfCommittedTransactions | Counter | Number of committed transactions processed by the collector | Informational - monitor for trends |
NumberOfEventsFiltered | Counter | Number of events filtered by include/exclude list rules | Informational - monitor for trends |
| Event Counters | |||
TotalNumberOfCreateEventsSeen | Counter | Total number of CREATE (INSERT) events seen by the collector | Informational - monitor for trends |
TotalNumberOfDeleteEventsSeen | Counter | Total number of DELETE events seen by the collector | Informational - monitor for trends |
TotalNumberOfEventsSeen | Counter | Total number of events seen by the collector | Informational - monitor for trends |
TotalNumberOfUpdateEventsSeen | Counter | Total number of UPDATE events seen by the collector | Informational - monitor for trends |
NumberOfErroneousEvents | Counter | Number of events that caused errors during processing | Critical Alert: Alert if > 0 (indicates processing failures) |
| Snapshot Metrics | |||
RemainingTableCount | Gauge | Number of tables remaining to be processed during snapshot | Informational - monitor snapshot progress |
RowsScanned | Counter | Number of rows scanned per table during snapshot (reported per table) | Informational - monitor snapshot progress |
SnapshotAborted | Gauge | Whether the snapshot was aborted (1=aborted, 0=not aborted) | Critical Alert: Alert if value = 1 (snapshot failed) |
SnapshotCompleted | Gauge | Whether the snapshot completed successfully (1=completed, 0=not completed) | Informational - monitor snapshot completion |
SnapshotDurationInSeconds | Gauge | Total duration of the snapshot process in seconds | Informational - monitor for performance trends |
SnapshotPaused | Gauge | Whether the snapshot is currently paused (1=paused, 0=not paused) | Informational - monitor snapshot state |
SnapshotPausedDurationInSeconds | Gauge | Total time the snapshot was paused in seconds | Informational - monitor snapshot state |
SnapshotRunning | Gauge | Whether a snapshot is currently running (1=running, 0=not running) | Informational - monitor snapshot state |
TotalTableCount | Gauge | Total number of tables included in the snapshot | Informational - use for progress calculation |
{{< note >}}
Many metrics include context labels that specify the phase (snapshot or streaming), database name, and other contextual information. Metrics with a value of -1 typically indicate that the measurement is not applicable in the current state.
{{< /note >}}
RDI reports metrics during the two main phases of the ingest pipeline, the snapshot phase and the change data capture (CDC) phase. (See the [pipeline lifecycle]({{< relref "/integrate/redis-data-integration/data-pipelines" >}}) docs for more information). The table below shows the full set of metrics that RDI reports with their descriptions.
| Metric Name | Metric Type | Metric Description | Alerting Recommendations |
|---|---|---|---|
incoming_records_total | Counter | Total number of incoming records processed by the system | Informational - monitor for trends |
incoming_records_created | Gauge | Timestamp when the incoming records counter was created | Informational - no alerting needed |
processed_records_total | Counter | Total number of records that have been successfully processed | Informational - monitor for trends |
rejected_records_total | Counter | Total number of records that were rejected during processing | Critical Alert: Alert if > 0 (indicates processing failures) |
filtered_records_total | Counter | Total number of records that were filtered out during processing | Informational - monitor for trends |
rdi_engine_state | Gauge | Current state of the RDI engine with labels for state (e.g., STARTED, RUNNING) and sync_mode (e.g., SNAPSHOT, STREAMING) | Critical Alert: Alert if state indicates failure or error condition |
rdi_version_info | Gauge | Version information for RDI components with labels for cli and engine versions | Informational - use for version tracking |
monitor_time_elapsed_total | Counter | Total time elapsed (in seconds) since monitoring started | Informational - use for uptime tracking |
monitor_time_elapsed_created | Gauge | Timestamp when the monitor time elapsed counter was created | Informational - no alerting needed |
rdi_incoming_entries | Gauge | Count of incoming events by data_source and operation type (pending, inserted, updated, deleted, filtered, rejected) | Informational - monitor for trends, alert only on "rejected" > 0 |
rdi_stream_event_latency_ms | Gauge | Latency in milliseconds of the oldest event in each data stream, labeled by data_source | Informational - monitor based on business SLA requirements |
| Processor Performance Total Metrics | |||
rdi_processed_batches_total | Counter | Total number of processed batches | Informational - use for data ingestion and load tracking |
rdi_processor_batch_size_total | Counter | Total batch size across all processed batches | Informational - use for throughput analysis |
rdi_processor_read_time_ms_total | Counter | Total read time in milliseconds across all batches | Informational - use for performance analysis |
rdi_processor_transform_time_ms_total | Counter | Total transform time in milliseconds across all batches | Informational - use for performance analysis |
rdi_processor_write_time_ms_total | Counter | Total write time in milliseconds across all batches | Informational - use for performance analysis |
rdi_processor_process_time_ms_total | Counter | Total process time in milliseconds across all batches | Informational - use for performance analysis |
rdi_processor_ack_time_ms_total | Counter | Total acknowledgment time in milliseconds across all batches | Informational - use for performance analysis |
rdi_processor_total_time_ms_total | Counter | Sum of the total read_time, process_time and ack_time values in milliseconds across all batches | Informational - use for performance analysis |
rdi_processor_rec_per_sec_total | Gauge | Total records per second across all batches | Informational - use for throughput analysis |
| Processor Performance Last Batch Metrics | |||
rdi_processor_batch_size_last | Gauge | Last batch size processed | Informational - use for real-time monitoring |
rdi_processor_read_time_ms_last | Gauge | Last batch read time in milliseconds | Informational - use for real-time performance monitoring |
rdi_processor_transform_time_ms_last | Gauge | Last batch transform time in milliseconds | Informational - use for real-time performance monitoring |
rdi_processor_write_time_ms_last | Gauge | Last batch write time in milliseconds | Informational - use for real-time performance monitoring |
rdi_processor_process_time_ms_last | Gauge | Last batch process time in milliseconds | Informational - use for real-time performance monitoring |
rdi_processor_ack_time_ms_last | Gauge | Last batch acknowledgment time in milliseconds | Informational - use for real-time performance monitoring |
rdi_processor_total_time_ms_last | Gauge | Last batch total time in milliseconds | Informational - use for real-time performance monitoring |
rdi_processor_rec_per_sec_last | Gauge | Last batch records per second | Informational - use for real-time throughput monitoring |
{{< note >}} Additional information about stream processor metrics:
rdi_ prefix, this will be replaced by the Kubernetes namespace name if you supplied a custom name during installation. The prefix is always rdi_ for VM installations._created suffix are automatically generated by Prometheus for counters and gauges to track when they were first created.rdi_incoming_entries metric provides a detailed breakdown for each data source by operation type.rdi_stream_event_latency_ms metric helps monitor data freshness and processing delays.The RDI operator exposes Prometheus metrics at the /metrics endpoint to monitor the health and state of the operator itself and the Pipeline resources it manages.
The endpoint for operator metrics is https://<RDI_HOST>/operator/metrics (or the operator service endpoint in Kubernetes environments).
Most of the metrics exposed by the RDI operator are standard controller-runtime metrics. The metrics that are relevant for RDI operations are listed in the table below:
| Metric Name | Metric Type | Metric Description | Alerting Recommendations |
|---|---|---|---|
rdi_operator_pipeline_phase | Gauge | Current phase of each Pipeline resource with labels for namespace, name, and phase (Active, Inactive, Pending, Resetting, Error) | Critical Alert: Alert if the phase is "Error" for periods longer than 2 minutes |
rdi_operator_is_leader | Gauge | Leadership status of the operator instance (1 = leader, 0 = not leader) with label for instance_id | Informational - monitor to ensure that the correct RDI instance is the leader in HA or DR deployments |
Pipeline phase tracking: The rdi_operator_pipeline_phase metric helps you monitor the lifecycle state of each RDI Pipeline resource. Each pipeline reports its current phase (Active, Inactive, Pending, Resetting, or Error) as a gauge value of 1, while all other phases for that pipeline are set to 0. This allows you to track phase transitions and identify pipelines that are stuck in error states.
Leader election: In high availability (HA) or disaster recovery (DR) deployments with multiple RDI instances, the rdi_operator_is_leader metric indicates which RDI instance is actively managing Pipeline resources. Only one RDI instance should have a value of 1 at any time, while all other instances should report 0. This metric is useful for troubleshooting leader election issues in HA or DR deployments.
In Kubernetes deployments, you can configure Prometheus to scrape operator metrics by enabling the Prometheus ServiceMonitor in your Helm values:
operator:
prometheus:
enabled: true
labels:
release: prometheus
Note: The ServiceMonitor resources must be labelled correctly for metrics to be auto-scraped by Prometheus. The correct label is configured in Prometheus, by default it is release: prometheus.
You can also expose the metrics endpoint externally using an Ingress:
operator:
ingress:
enabled: true
hosts:
- operator.example.com
pathPrefix: ""
Then access metrics at https://operator.example.com/operator/metrics.
The alerting strategy described in the sections below focuses on system failures and data integrity issues that require immediate attention. Most other metrics are informational, so you should monitor them for trends rather than trigger alerts.
These are the only alerts that require immediate action:
Collector alerts:
Connected = 0: Database connectivity has been lost. RDI cannot function without a database connection.NumberOfErroneousEvents > 0: Errors are occurring during data processing. This indicates data corruption or processing failures.SnapshotAborted = 1: The snapshot process has failed, so the initial sync is incomplete.Processor alerts:
rejected_records_total > 0: Records are being rejected. This indicates data quality issues or processing failures.rdi_engine_state: Alert only if the state indicates a clear failure condition (not just "not running").Operator alerts:
rdi_operator_pipeline_phase with phase="Error" for more than 2 minutes: A Pipeline resource has entered an error state and requires investigation.rdi_operator_is_leader = 0 for more than 2 minutes, the RDI pipeline is not active.rdi_operator_is_leader = 1, RDI is in a "split brain" state.You should monitor these metrics on dashboards and review them regularly, but they don't require automated alerts:
RDI uses fluentd and
logrotate to ship and rotate logs
for its Kubernetes (K8s) components.
So whenever a containerized component is removed by the RDI operator process or by K8s,
the logs are available for you to inspect.
By default, RDI stores logs in the host VM file system at /opt/rdi/logs.
The logs are recorded at the minimum INFO level and get rotated when they reach a size of 100MB.
RDI retains the last five log rotated files by default.
Logs are in a straightforward text format, which lets you analyze them with several different observability tools.
You can change the default log settings using the
[redis-di configure-rdi]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-configure-rdi" >}})
command.
If you ever need to send a comprehensive set of forensics data to Redis support then you should
run the
[redis-di dump-support-package]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-dump-support-package" >}})
command from the CLI. See
[Troubleshooting]({{< relref "/integrate/redis-data-integration/troubleshooting#dump-support-package" >}})
for more information.