Back to Redis

Observability

content/integrate/redis-data-integration/observability.md

latest21.2 KB
Original Source

RDI reports metrics about its operation using Prometheus exporter endpoints. You can connect to the endpoints with Prometheus to query the metrics and plot simple graphs or with Grafana to produce more complex visualizations and dashboards.

RDI exposes three endpoints:

  • Collector metrics: CDC collector performance and connectivity
  • Stream processor metrics: Data processing performance and throughput
  • Operator metrics: Kubernetes operator health and Pipeline resource states

The sections below explain these sets of metrics in more detail. See the [architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}}) for an introduction to these concepts.

{{< note >}}If you don't use Prometheus or Grafana, you can still see RDI metrics with the RDI monitoring screen in Redis Insight or with the [redis-di status]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-status" >}}) command from the CLI.{{< /note >}}

Accessing the metrics

The way you access the metrics endpoints depends on whether you are using a VM installation or a Helm installation for RDI. The sections below describe the correct approach for each installation type.

VM Installation

For VM installations, the metrics are available by default on the following endpoints:

  • Collector metrics: https://<RDI_HOST>/collector-source/metrics
  • Stream processor metrics: https://<RDI_HOST>/metrics
  • Operator metrics: https://<RDI_HOST>/operator/metrics

Please note that for RDI versions prior to 1.16.0 the collector metrics are not accessible.

Helm installation

For Helm installations, the metrics are available via autodiscovery in the K8s cluster. Follow the steps below to use them:

  1. Make sure you have the Prometheus Operator installed in your K8s cluster (see the Prometheus Operator installation guide for more information about this).

  2. Update your values.yaml file to enable metrics for the operator, collector and stream processor components.

    • For the collector, update the collector section, under the dataPlane section:

      yaml
      dataPlane:
        collector:
          # Enable service monitor
          serviceMonitor:
            enabled: true
      
            # Make sure to label the ServiceMonitor so that Prometheus can discover it
            labels:
              release: prometheus
      
    • For the stream processor, update the rdiMetricsExporter section:

      yaml
      rdiMetricsExporter:
        # Enable service monitor
        serviceMonitor:
          enabled: true
      
          # Make sure to label the ServiceMonitor so that Prometheus can discover it
          labels:
            release: prometheus
      
    • For the operator, update the operator section:

      yaml
      operator:
        prometheus:
          enabled: true
          labels:
            release: prometheus
        metrics:
          enabled: true
      

{{< note >}}The Prometheus service discovery loop runs at regular intervals. This means that after deploying or updating RDI with the above configuration, it may take a few minutes for Prometheus to discover the new ServiceMonitors and start scraping metrics from the RDI components. {{< /note >}}

Collector metrics

These metrics are divided into three groups:

  • Pipeline state: metrics about the pipeline mode and connectivity
  • Data flow counters: counters for data breakdown per source table
  • Processing performance: processing speed of RDI micro batches

The following table lists all collector metrics and their descriptions:

MetricTypeDescriptionAlerting Recommendations
Schema History Metrics
ChangesAppliedCounterTotal number of schema changes applied during recovery and runtimeInformational - monitor for trends
ChangesRecoveredCounterNumber of changes that were read during the recovery phaseInformational - monitor for trends
MilliSecondsSinceLastAppliedChangeGaugeNumber of milliseconds since the last change was appliedInformational - monitor for trends
MilliSecondsSinceLastRecoveredChangeGaugeNumber of milliseconds since the last change was recovered from the history storeInformational - monitor for trends
RecoveryStartTimeGaugeTime in epoch milliseconds when recovery started (-1 if not applicable)Informational - monitor for trends
Connection and State Metrics
ConnectedGaugeWhether the collector is currently connected to the database (1=connected, 0=disconnected)Critical Alert: Alert if value = 0 (disconnected)
Queue Metrics
CurrentQueueSizeInBytesGaugeCurrent size of the collector's internal queue in bytesInformational - monitor for trends
MaxQueueSizeInBytesGaugeMaximum configured size of the collector's internal queue in bytesInformational - use for capacity planning
QueueRemainingCapacityGaugeRemaining capacity of the collector's internal queueInformational - monitor for trends
QueueTotalCapacityGaugeTotal capacity of the collector's internal queueInformational - use for capacity planning
Streaming Performance Metrics
MilliSecondsBehindSourceGaugeNumber of milliseconds the collector is behind the source database (-1 if not applicable)Informational - monitor for trends and business SLA requirements
MilliSecondsSinceLastEventGaugeNumber of milliseconds since the collector processed the most recent event (-1 if not applicable)Informational - monitor for trends in active systems
NumberOfCommittedTransactionsCounterNumber of committed transactions processed by the collectorInformational - monitor for trends
NumberOfEventsFilteredCounterNumber of events filtered by include/exclude list rulesInformational - monitor for trends
Event Counters
TotalNumberOfCreateEventsSeenCounterTotal number of CREATE (INSERT) events seen by the collectorInformational - monitor for trends
TotalNumberOfDeleteEventsSeenCounterTotal number of DELETE events seen by the collectorInformational - monitor for trends
TotalNumberOfEventsSeenCounterTotal number of events seen by the collectorInformational - monitor for trends
TotalNumberOfUpdateEventsSeenCounterTotal number of UPDATE events seen by the collectorInformational - monitor for trends
NumberOfErroneousEventsCounterNumber of events that caused errors during processingCritical Alert: Alert if > 0 (indicates processing failures)
Snapshot Metrics
RemainingTableCountGaugeNumber of tables remaining to be processed during snapshotInformational - monitor snapshot progress
RowsScannedCounterNumber of rows scanned per table during snapshot (reported per table)Informational - monitor snapshot progress
SnapshotAbortedGaugeWhether the snapshot was aborted (1=aborted, 0=not aborted)Critical Alert: Alert if value = 1 (snapshot failed)
SnapshotCompletedGaugeWhether the snapshot completed successfully (1=completed, 0=not completed)Informational - monitor snapshot completion
SnapshotDurationInSecondsGaugeTotal duration of the snapshot process in secondsInformational - monitor for performance trends
SnapshotPausedGaugeWhether the snapshot is currently paused (1=paused, 0=not paused)Informational - monitor snapshot state
SnapshotPausedDurationInSecondsGaugeTotal time the snapshot was paused in secondsInformational - monitor snapshot state
SnapshotRunningGaugeWhether a snapshot is currently running (1=running, 0=not running)Informational - monitor snapshot state
TotalTableCountGaugeTotal number of tables included in the snapshotInformational - use for progress calculation

{{< note >}} Many metrics include context labels that specify the phase (snapshot or streaming), database name, and other contextual information. Metrics with a value of -1 typically indicate that the measurement is not applicable in the current state. {{< /note >}}

Stream processor metrics

RDI reports metrics during the two main phases of the ingest pipeline, the snapshot phase and the change data capture (CDC) phase. (See the [pipeline lifecycle]({{< relref "/integrate/redis-data-integration/data-pipelines" >}}) docs for more information). The table below shows the full set of metrics that RDI reports with their descriptions.

Metric NameMetric TypeMetric DescriptionAlerting Recommendations
incoming_records_totalCounterTotal number of incoming records processed by the systemInformational - monitor for trends
incoming_records_createdGaugeTimestamp when the incoming records counter was createdInformational - no alerting needed
processed_records_totalCounterTotal number of records that have been successfully processedInformational - monitor for trends
rejected_records_totalCounterTotal number of records that were rejected during processingCritical Alert: Alert if > 0 (indicates processing failures)
filtered_records_totalCounterTotal number of records that were filtered out during processingInformational - monitor for trends
rdi_engine_stateGaugeCurrent state of the RDI engine with labels for state (e.g., STARTED, RUNNING) and sync_mode (e.g., SNAPSHOT, STREAMING)Critical Alert: Alert if state indicates failure or error condition
rdi_version_infoGaugeVersion information for RDI components with labels for cli and engine versionsInformational - use for version tracking
monitor_time_elapsed_totalCounterTotal time elapsed (in seconds) since monitoring startedInformational - use for uptime tracking
monitor_time_elapsed_createdGaugeTimestamp when the monitor time elapsed counter was createdInformational - no alerting needed
rdi_incoming_entriesGaugeCount of incoming events by data_source and operation type (pending, inserted, updated, deleted, filtered, rejected)Informational - monitor for trends, alert only on "rejected" > 0
rdi_stream_event_latency_msGaugeLatency in milliseconds of the oldest event in each data stream, labeled by data_sourceInformational - monitor based on business SLA requirements
Processor Performance Total Metrics
rdi_processed_batches_totalCounterTotal number of processed batchesInformational - use for data ingestion and load tracking
rdi_processor_batch_size_totalCounterTotal batch size across all processed batchesInformational - use for throughput analysis
rdi_processor_read_time_ms_totalCounterTotal read time in milliseconds across all batchesInformational - use for performance analysis
rdi_processor_transform_time_ms_totalCounterTotal transform time in milliseconds across all batchesInformational - use for performance analysis
rdi_processor_write_time_ms_totalCounterTotal write time in milliseconds across all batchesInformational - use for performance analysis
rdi_processor_process_time_ms_totalCounterTotal process time in milliseconds across all batchesInformational - use for performance analysis
rdi_processor_ack_time_ms_totalCounterTotal acknowledgment time in milliseconds across all batchesInformational - use for performance analysis
rdi_processor_total_time_ms_totalCounterSum of the total read_time, process_time and ack_time values in milliseconds across all batchesInformational - use for performance analysis
rdi_processor_rec_per_sec_totalGaugeTotal records per second across all batchesInformational - use for throughput analysis
Processor Performance Last Batch Metrics
rdi_processor_batch_size_lastGaugeLast batch size processedInformational - use for real-time monitoring
rdi_processor_read_time_ms_lastGaugeLast batch read time in millisecondsInformational - use for real-time performance monitoring
rdi_processor_transform_time_ms_lastGaugeLast batch transform time in millisecondsInformational - use for real-time performance monitoring
rdi_processor_write_time_ms_lastGaugeLast batch write time in millisecondsInformational - use for real-time performance monitoring
rdi_processor_process_time_ms_lastGaugeLast batch process time in millisecondsInformational - use for real-time performance monitoring
rdi_processor_ack_time_ms_lastGaugeLast batch acknowledgment time in millisecondsInformational - use for real-time performance monitoring
rdi_processor_total_time_ms_lastGaugeLast batch total time in millisecondsInformational - use for real-time performance monitoring
rdi_processor_rec_per_sec_lastGaugeLast batch records per secondInformational - use for real-time throughput monitoring

{{< note >}} Additional information about stream processor metrics:

  • Where the metric name has the rdi_ prefix, this will be replaced by the Kubernetes namespace name if you supplied a custom name during installation. The prefix is always rdi_ for VM installations.
  • Metrics with the _created suffix are automatically generated by Prometheus for counters and gauges to track when they were first created.
  • The rdi_incoming_entries metric provides a detailed breakdown for each data source by operation type.
  • The rdi_stream_event_latency_ms metric helps monitor data freshness and processing delays.
  • The processor performance metrics are divided into two categories:
    • Total metrics: Accumulate values across all processed batches for historical analysis
    • Last batch metrics: Show real-time performance data for the most recently processed batch {{< /note >}}

Operator metrics

The RDI operator exposes Prometheus metrics at the /metrics endpoint to monitor the health and state of the operator itself and the Pipeline resources it manages.

The endpoint for operator metrics is https://<RDI_HOST>/operator/metrics (or the operator service endpoint in Kubernetes environments).

Operator metric types

Most of the metrics exposed by the RDI operator are standard controller-runtime metrics. The metrics that are relevant for RDI operations are listed in the table below:

Metric NameMetric TypeMetric DescriptionAlerting Recommendations
rdi_operator_pipeline_phaseGaugeCurrent phase of each Pipeline resource with labels for namespace, name, and phase (Active, Inactive, Pending, Resetting, Error)Critical Alert: Alert if the phase is "Error" for periods longer than 2 minutes
rdi_operator_is_leaderGaugeLeadership status of the operator instance (1 = leader, 0 = not leader) with label for instance_idInformational - monitor to ensure that the correct RDI instance is the leader in HA or DR deployments

Understanding operator metrics

Pipeline phase tracking: The rdi_operator_pipeline_phase metric helps you monitor the lifecycle state of each RDI Pipeline resource. Each pipeline reports its current phase (Active, Inactive, Pending, Resetting, or Error) as a gauge value of 1, while all other phases for that pipeline are set to 0. This allows you to track phase transitions and identify pipelines that are stuck in error states.

Leader election: In high availability (HA) or disaster recovery (DR) deployments with multiple RDI instances, the rdi_operator_is_leader metric indicates which RDI instance is actively managing Pipeline resources. Only one RDI instance should have a value of 1 at any time, while all other instances should report 0. This metric is useful for troubleshooting leader election issues in HA or DR deployments.

Accessing operator metrics

In Kubernetes deployments, you can configure Prometheus to scrape operator metrics by enabling the Prometheus ServiceMonitor in your Helm values:

yaml
operator:
  prometheus:
    enabled: true
    labels:
      release: prometheus

Note: The ServiceMonitor resources must be labelled correctly for metrics to be auto-scraped by Prometheus. The correct label is configured in Prometheus, by default it is release: prometheus. You can also expose the metrics endpoint externally using an Ingress:

yaml
operator:
  ingress:
    enabled: true
    hosts:
      - operator.example.com
    pathPrefix: ""

Then access metrics at https://operator.example.com/operator/metrics.

The alerting strategy described in the sections below focuses on system failures and data integrity issues that require immediate attention. Most other metrics are informational, so you should monitor them for trends rather than trigger alerts.

Critical alerts (immediate response required)

These are the only alerts that require immediate action:

Collector alerts:

  • Connected = 0: Database connectivity has been lost. RDI cannot function without a database connection.
  • NumberOfErroneousEvents > 0: Errors are occurring during data processing. This indicates data corruption or processing failures.
  • SnapshotAborted = 1: The snapshot process has failed, so the initial sync is incomplete.

Processor alerts:

  • rejected_records_total > 0: Records are being rejected. This indicates data quality issues or processing failures.
  • rdi_engine_state: Alert only if the state indicates a clear failure condition (not just "not running").

Operator alerts:

  • rdi_operator_pipeline_phase with phase="Error" for more than 2 minutes: A Pipeline resource has entered an error state and requires investigation.
  • No leader in HA or DR setups: If both RDI instances report rdi_operator_is_leader = 0 for more than 2 minutes, the RDI pipeline is not active.
  • Multiple leaders in HA or DR setups: If both RDI instances report rdi_operator_is_leader = 1, RDI is in a "split brain" state.

Important monitoring (but not alerts)

You should monitor these metrics on dashboards and review them regularly, but they don't require automated alerts:

  • Queue metrics: Queue utilization can vary widely and hitting 0% or 100% capacity may be normal during certain operations.
  • Latency metrics: Lag and processing times depend heavily on business requirements and normal operational patterns.
  • Event counters: Event rates naturally vary based on application usage patterns.
  • Snapshot progress: Snapshot duration and progress depend on data size, so you should typically monitor them manually.
  • Schema changes: Schema change frequency is highly application-dependent.

Key principles for RDI alerting

  • Alert on failures, not performance: Focus alerts on system failures rather than performance degradation.
  • Business context matters: Latency and throughput requirements vary significantly between organizations.
  • Establish baselines first: Monitor metrics for weeks before you set any threshold-based alerts.
  • Avoid alert fatigue: If you see too many non-critical alerts, you are less likely to take truly critical issues seriously.
  • Use dashboards for trends: Most metrics are better suited for dashboard monitoring than alerting

Monitoring best practices

  • Dashboard-first approach: Use Grafana dashboards to visualize trends and patterns.
  • Baseline establishment: Monitor your specific workload for 2-4 weeks before you consider adding more alerts.
  • Business SLA alignment: Only create alerts for metrics that directly impact your business SLA requirements.
  • Manual review: Don't use automated alerts to review metric trends. Instead, schedule regular business reviews to check them manually.

RDI logs

RDI uses fluentd and logrotate to ship and rotate logs for its Kubernetes (K8s) components. So whenever a containerized component is removed by the RDI operator process or by K8s, the logs are available for you to inspect. By default, RDI stores logs in the host VM file system at /opt/rdi/logs. The logs are recorded at the minimum INFO level and get rotated when they reach a size of 100MB. RDI retains the last five log rotated files by default. Logs are in a straightforward text format, which lets you analyze them with several different observability tools. You can change the default log settings using the [redis-di configure-rdi]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-configure-rdi" >}}) command.

Dump support package

If you ever need to send a comprehensive set of forensics data to Redis support then you should run the [redis-di dump-support-package]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-dump-support-package" >}}) command from the CLI. See [Troubleshooting]({{< relref "/integrate/redis-data-integration/troubleshooting#dump-support-package" >}}) for more information.