docs/monitoring.md
This document covers monitoring and observability for Lago's infrastructure components.
Lago exposes Sidekiq metrics through Prometheus, enabling comprehensive monitoring of background job processing. There are two layers of metrics available depending on your Sidekiq license.
┌─────────────────────────────────────────────────────────────────────────┐
│ Metrics Collection │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌─────────────────────┐ ┌────────────────┐ │
│ │ Sidekiq │ │ Sidekiq Web UI │ │ Prometheus │ │
│ │ Workers │─────▶│ + Prometheus │─────▶│ │ │
│ │ │ │ Exporter │ │ │ │
│ └──────────────┘ └─────────────────────┘ └────────────────┘ │
│ :3000/prometheus/metrics │
│ │
│ ┌──────────────┐ ┌─────────────────────┐ ┌────────────────┐ │
│ │ Sidekiq │ │ StatsD Exporter │ │ Prometheus │ │
│ │ Pro │─────▶│ (DogStatsD) │─────▶│ │ │
│ │ Middleware │ │ │ │ │ │
│ └──────────────┘ └─────────────────────┘ └────────────────┘ │
│ (optional) :12345/metrics │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Components:
lago-sidekiqs service)
sidekiq/prometheus/exporter gem/prometheus/metricslago-sidekiqs/config.ru# lago-sidekiqs/config.ru
require './app'
require 'sidekiq'
require 'sidekiq/web'
require 'sidekiq/prometheus/exporter'
require 'sidekiq/throttled'
require 'sidekiq/throttled/web'
Sidekiq.configure_client do |config|
config.redis = { url: ENV['REDIS_URL'] }
end
Sidekiq::Web.use(Rack::Session::Cookie, secret: ENV['SESSION_SECRET'])
run Rack::URLMap.new('/' => Sidekiq::Web, '/prometheus/metrics' => Sidekiq::Prometheus::Exporter)
LAGO_SIDEKIQ_STATSD_ENDPOINT environment variablelago_api namespacelago-api/config/initializers/sidekiq.rb# lago-api/config/initializers/sidekiq.rb
def configure_sidekiq_pro_metrics(config)
statsd_endpoint = ENV.fetch("LAGO_SIDEKIQ_STATSD_ENDPOINT", nil)
if statsd_endpoint.nil?
Rails.logger.warn "LAGO_SIDEKIQ_STATSD_ENDPOINT not set, Sidekiq Pro metrics will not be reported"
return
end
statsd_host, statsd_port = statsd_endpoint.split(":")
if statsd_host.empty? || statsd_port.nil? || statsd_port.empty?
Rails.logger.error "LAGO_SIDEKIQ_STATSD_ENDPOINT invalid format, expected host:port"
return
end
require "datadog/statsd"
config.dogstatsd = -> {
Datadog::Statsd.new(statsd_host, statsd_port.to_i,
tags: ["env:#{config[:environment]}", "service:sidekiq"],
namespace: Rails.application.name)
}
config.server_middleware do |chain|
require "sidekiq/middleware/server/statsd"
chain.add Sidekiq::Middleware::Server::Statsd
end
end
These metrics are available from the Sidekiq web service at /prometheus/metrics and work with both Sidekiq OSS and Pro.
| Metric | Type | Description |
|---|---|---|
sidekiq_processed_jobs_total | Counter | Total number of processed jobs (all-time) |
sidekiq_failed_jobs_total | Counter | Total number of failed jobs (all-time) |
sidekiq_workers | Gauge | Total number of worker threads across all processes |
sidekiq_processes | Gauge | Number of Sidekiq processes running |
sidekiq_busy_workers | Gauge | Number of workers currently executing jobs |
sidekiq_enqueued_jobs | Gauge | Total number of jobs waiting in all queues |
sidekiq_scheduled_jobs | Gauge | Number of jobs scheduled for future execution |
sidekiq_retry_jobs | Gauge | Number of jobs waiting to be retried |
sidekiq_dead_jobs | Gauge | Number of jobs in the dead queue |
| Metric | Type | Labels | Description |
|---|---|---|---|
sidekiq_host_processes | Gauge | host, quiet | Number of processes per host. quiet=true indicates graceful shutdown |
| Metric | Type | Labels | Description |
|---|---|---|---|
sidekiq_queue_latency_seconds | Gauge | name | Time since oldest job was enqueued (queue delay) |
sidekiq_queue_enqueued_jobs | Gauge | name | Number of jobs waiting in the queue |
sidekiq_queue_max_processing_time_seconds | Gauge | name | Longest running job execution time |
sidekiq_queue_workers | Gauge | name | Number of worker threads serving this queue |
sidekiq_queue_processes | Gauge | name | Number of processes serving this queue |
sidekiq_queue_busy_workers | Gauge | name | Number of workers currently processing jobs from this queue |
Available Queues:
| Queue | Worker Type |
|---|---|
ai_agent | AI Agent Worker |
analytics | Analytics Worker |
billing | Billing Worker |
clock | Default Worker (clock jobs) |
clock_worker | Dedicated Clock Worker |
default | Default Worker |
events | Events Worker |
high_priority | Default Worker |
integrations | Default Worker |
invoices | Default Worker |
long_running | Default Worker |
low_priority | Default Worker |
mailers | Default Worker |
pdfs | PDF Worker |
providers | Default Worker |
wallets | Default Worker (deprecated) |
webhook | Default Worker (webhook jobs) |
webhook_worker | Dedicated Webhook Worker |
When using Sidekiq Pro with LAGO_SIDEKIQ_STATSD_ENDPOINT configured, additional per-job metrics are available. These metrics are sent as DogStatsD and can be converted to Prometheus format using a StatsD exporter.
Set the following environment variable to enable Sidekiq Pro metrics:
LAGO_SIDEKIQ_STATSD_ENDPOINT=statsd-exporter:9125
The metrics are tagged with:
env: Environment name (e.g., production)service: Always sidekiqqueue: Queue nameworker: Job class nameAll metrics use the lago_api_ prefix (application namespace).
| Metric | Type | Labels | Description |
|---|---|---|---|
lago_api_jobs_count | Counter | queue, worker | Total number of jobs executed |
lago_api_jobs_success | Counter | queue, worker | Number of successfully completed jobs |
lago_api_jobs_failure | Counter | queue, worker, error_type | Number of failed jobs by error type |
lago_api_jobs_perform | Summary | queue, worker | Job execution duration (seconds) with p50, p90, p99 quantiles |
lago_api_jobs_recovered_fetch | Counter | queue | Number of jobs recovered from interrupted fetches |
Job failure rate by worker:
rate(lago_api_jobs_failure[5m]) / rate(lago_api_jobs_count[5m])
P99 execution time for billing jobs:
lago_api_jobs_perform{queue="billing", quantile="0.99"}
Top 10 slowest jobs (by median execution time):
topk(10, lago_api_jobs_perform{quantile="0.5"})
Jobs per second by queue:
sum by (queue) (rate(lago_api_jobs_count[5m]))
# Queue latency too high (jobs waiting too long)
- alert: SidekiqQueueLatencyHigh
expr: sidekiq_queue_latency_seconds > 300
for: 5m
labels:
severity: critical
annotations:
summary: "Sidekiq queue {{ $labels.name }} has high latency"
description: "Queue {{ $labels.name }} has jobs waiting for {{ $value | humanizeDuration }}"
# Dead jobs accumulating
- alert: SidekiqDeadJobsIncreasing
expr: increase(sidekiq_dead_jobs[1h]) > 100
labels:
severity: critical
annotations:
summary: "Sidekiq dead jobs increasing rapidly"
description: "{{ $value }} jobs moved to dead queue in the last hour"
# No workers available
- alert: SidekiqNoWorkers
expr: sidekiq_workers == 0
for: 2m
labels:
severity: critical
annotations:
summary: "No Sidekiq workers available"
description: "All Sidekiq workers are down"
# Queue backlog building up
- alert: SidekiqQueueBacklog
expr: sidekiq_queue_enqueued_jobs > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "Sidekiq queue {{ $labels.name }} has backlog"
description: "Queue {{ $labels.name }} has {{ $value }} jobs waiting"
# High failure rate
- alert: SidekiqHighFailureRate
expr: |
rate(lago_api_jobs_failure[5m])
/ rate(lago_api_jobs_count[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High job failure rate for {{ $labels.worker }}"
description: "{{ $labels.worker }} has {{ $value | humanizePercentage }} failure rate"
# Worker process in quiet mode (shutting down)
- alert: SidekiqWorkerQuiet
expr: sidekiq_host_processes{quiet="true"} > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Sidekiq worker {{ $labels.host }} in quiet mode"
description: "Worker has been shutting down for over 10 minutes"
# Slow job execution
- alert: SidekiqSlowJobs
expr: lago_api_jobs_perform{quantile="0.99"} > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Slow job execution for {{ $labels.worker }}"
description: "P99 execution time is {{ $value }}s"
# Retry queue has jobs
- alert: SidekiqRetryQueueNotEmpty
expr: sidekiq_retry_jobs > 50
for: 15m
labels:
severity: info
annotations:
summary: "Sidekiq retry queue has pending jobs"
description: "{{ $value }} jobs waiting to be retried"
Key panels to include in your Sidekiq monitoring dashboard:
Overview Row
Queue Health Row
Worker Health Row
Job Performance Row (requires Sidekiq Pro)
Capacity Planning Row