docs/batch-queue-metrics.md
This document provides a comprehensive breakdown of all metrics emitted by the Batch Queue and Fair Queue systems, including what they mean and how to identify degraded system states.
The batch queue system consists of two layers:
batch_queue.*) - High-level batch processing metricsbatch-queue.*) - Low-level message queue metrics (with name: "batch-queue")Both layers emit metrics that together provide full observability into batch processing.
These metrics track batch-level operations.
| Metric | Description | Labels |
|---|---|---|
batch_queue.batches_enqueued | Number of batches initialized for processing | envId, itemCount, streaming |
batch_queue.items_enqueued | Number of individual batch items enqueued | envId |
batch_queue.items_processed | Number of batch items successfully processed (turned into runs) | envId |
batch_queue.items_failed | Number of batch items that failed processing | envId, errorCode |
batch_queue.batches_completed | Number of batches that completed (all items processed) | envId, hasFailures |
| Metric | Description | Unit | Labels |
|---|---|---|---|
batch_queue.batch_processing_duration | Time from batch creation to completion | ms | envId, itemCount |
batch_queue.item_queue_time | Time from item enqueue to processing start | ms | envId |
These metrics track the underlying message queue operations. With the batch queue configuration, they are prefixed with batch-queue..
| Metric | Description |
|---|---|
batch-queue.messages.enqueued | Number of messages (batch items) added to the queue |
batch-queue.messages.completed | Number of messages successfully processed |
batch-queue.messages.failed | Number of messages that failed processing |
batch-queue.messages.retried | Number of message retry attempts |
batch-queue.messages.dlq | Number of messages sent to dead letter queue |
| Metric | Description | Unit |
|---|---|---|
batch-queue.message.processing_time | Time to process a single message | ms |
batch-queue.message.queue_time | Time a message spent waiting in queue | ms |
| Metric | Description | Labels |
|---|---|---|
batch-queue.queue.length | Current number of messages in a queue | fairqueue.queue_id |
batch-queue.master_queue.length | Number of active queues in the master queue shard | fairqueue.shard_id |
batch-queue.inflight.count | Number of messages currently being processed | fairqueue.shard_id |
batch-queue.dlq.length | Number of messages in the dead letter queue | fairqueue.tenant_id |
Understanding how metrics relate helps diagnose issues:
batches_enqueued × avg_items_per_batch ≈ items_enqueued
items_enqueued = items_processed + items_failed + items_pending
batches_completed ≤ batches_enqueued (lag indicates processing backlog)
Symptoms:
batch_queue.items_processed rate drops to 0batch-queue.inflight.count is 0batch-queue.master_queue.length is growingLikely Causes:
Actions:
Symptoms:
batch_queue.item_queue_time p99 > 60 secondsbatch-queue.queue.length growing continuouslybatch-queue.inflight.count at max capacityLikely Causes:
Actions:
BATCH_QUEUE_CONSUMER_COUNTBATCH_QUEUE_GLOBAL_RATE_LIMIT settingSymptoms:
batch_queue.items_failed rate > 5% of items_processedbatch-queue.messages.dlq increasingLikely Causes:
Actions:
errorCode label distribution on items_failedSymptoms:
batch_queue.batches_enqueued - batch_queue.batches_completed is increasing over timebatch-queue.master_queue.length trending upwardLikely Causes:
Actions:
Symptoms:
envId labels show much higher item_queue_time than othersLikely Causes:
Actions:
BATCH_CONCURRENCY_* environment settingsSymptoms:
batch_queue.item_queue_time has periodic spikesLikely Causes:
BATCH_QUEUE_GLOBAL_RATE_LIMIT is set too lowActions:
# Throughput
rate(batch_queue_items_processed_total[5m])
rate(batch_queue_items_failed_total[5m])
# Success Rate
rate(batch_queue_items_processed_total[5m]) /
(rate(batch_queue_items_processed_total[5m]) + rate(batch_queue_items_failed_total[5m]))
# Batch Completion Rate
rate(batch_queue_batches_completed_total[5m]) / rate(batch_queue_batches_enqueued_total[5m])
# Item Queue Time (p50, p95, p99)
histogram_quantile(0.50, rate(batch_queue_item_queue_time_bucket[5m]))
histogram_quantile(0.95, rate(batch_queue_item_queue_time_bucket[5m]))
histogram_quantile(0.99, rate(batch_queue_item_queue_time_bucket[5m]))
# Batch Processing Duration
histogram_quantile(0.95, rate(batch_queue_batch_processing_duration_bucket[5m]))
# Current backlog
batch_queue_master_queue_length
batch_queue_inflight_count
# DLQ (should be 0)
batch_queue_dlq_length
| Condition | Severity | Threshold |
|---|---|---|
| Processing stopped | Critical | items_processed rate = 0 for 5min |
| High failure rate | Warning | items_failed / items_processed > 0.05 |
| Queue time p99 | Warning | > 30 seconds |
| Queue time p99 | Critical | > 120 seconds |
| DLQ length | Warning | > 0 |
| Batch completion lag | Warning | batches_enqueued - batches_completed > 100 |
| Variable | Impact |
|---|---|
BATCH_QUEUE_CONSUMER_COUNT | More consumers = higher throughput, lower queue time |
BATCH_QUEUE_CONSUMER_INTERVAL_MS | Lower = more frequent polling, higher throughput |
BATCH_QUEUE_GLOBAL_RATE_LIMIT | Caps max items/sec, increases queue time if too low |
BATCH_CONCURRENCY_FREE/PAID/ENTERPRISE | Per-tenant concurrency limits |
BATCH_QUEUE_DRR_QUANTUM | Credits per tenant per round (fairness tuning) |
BATCH_QUEUE_MAX_DEFICIT | Max accumulated credits (prevents starvation) |
When investigating batch queue issues:
engine:batch-queue:batch:drr:deficit hash in RedisBatchTaskRun table for stuck PROCESSING batches