relayer/src/metrics/docs_and_dashboards/status_metrics.md
Note on pg_cron: If pg_cron performs the update in the background, Rust will not see it, and the metrics will not update. To make the Timeout alerts work, you must either move the timeout logic to Rust or accept that Timeouts might be under-reported in these specific metrics.
So for now, timed_out alerts must not be implemented until this specific features is moved to the application logic and not in cron.
Detailed monitoring of the internal state machine for Input Proofs, User Decryption, and Public Decryption requests.
This document outlines the metrics used to track the lifecycle of requests within the Relayer. These metrics are generated in real-time as the application hooks into database updates.
Understanding the flow is essential for interpreting the metrics.
| Status | Business Context | Expected Duration |
|---|---|---|
queued | Ingestion: Payload received. Waiting for ACL/Readiness checks. | |
| Critical: If high, the Relayer is overwhelmed or Readiness checks are hanging. | < 10s max (Internal) | |
| ~Ms (Readiness) | ||
processing | Internally Queued before Transaction Broadcasting: ACL passed. Request crafted and internally queued, waiting to be picked up by transaction engine. | |
| Warning: If high, transaction engine is backlogged or internal queue is congested. | Queue-dependent (typically < 1s, but can be longer under load) | |
tx_in_flight | Transaction Broadcasting: Request picked up for sending to blockchain. Transaction is being broadcast. | |
| Critical: If high, transaction engine is slow or RPC is congested. | Block Time (e.g., 50ms - 1s) | |
receipt_received | Tx confirmed Waiting for Gateway Event (Consensus or Decryption). | |
| Critical: If long, KMS is stalling or Relayer Listener is broken. | Network dependent (~5s - 30s) | |
completed | Success: Final happy path. Request served to user or proof accepted. | N/A (End State) |
timed_out | Error: |
failure | Error: Transaction Engine failure (Gas, RPC error, invalid payload). | N/A (End State) |relayer_request_countreq_type: user_decrypt, public_decrypt, input_proofstatus: queued, processing, receipt_received, completed, timed_out, failurerelayer_request_status_duration_seconds0.1s to 3600s (1h).req_type: user_decrypt, ...previous_status: The status the request just left (e.g., if transitioning queued -> processing, label is queued). completed, failed and timed_out not taken in account since they are final states.TODO: refine
sum by (req_type) (relayer_request_count{status="queued"})
sum by (req_type) (relayer_request_count{status="processing"})
sum by (req_type) (relayer_request_count{status="receipt_received"})
queued status. Measures how long ACL checks take.sum by (le) (rate(relayer_request_status_duration_seconds_bucket{previous_status="queued"}[5m]))
processing status. Measures blockchain block times.sum by (le) (rate(relayer_request_status_duration_seconds_bucket{previous_status="processing"}[5m]))
receipt_received status. Measures KMS network speed and listener.sum by (le) (rate(relayer_request_status_duration_seconds_bucket{previous_status="receipt_received"}[5m]))
sum by (req_type) (rate(relayer_request_count{status="completed"}[5m]))
increase(relayer_request_count{status="failure"}[1h])
increase(relayer_request_count{status="timed_out"}[1h])