Back to Trigger

Batch Queue & Fair Queue Metrics Guide

docs/batch-queue-metrics.md

4.4.58.1 KB
Original Source

Batch Queue & Fair Queue Metrics Guide

This document provides a comprehensive breakdown of all metrics emitted by the Batch Queue and Fair Queue systems, including what they mean and how to identify degraded system states.

Overview

The batch queue system consists of two layers:

  1. BatchQueue (batch_queue.*) - High-level batch processing metrics
  2. FairQueue (batch-queue.*) - Low-level message queue metrics (with name: "batch-queue")

Both layers emit metrics that together provide full observability into batch processing.


BatchQueue Metrics

These metrics track batch-level operations.

Counters

MetricDescriptionLabels
batch_queue.batches_enqueuedNumber of batches initialized for processingenvId, itemCount, streaming
batch_queue.items_enqueuedNumber of individual batch items enqueuedenvId
batch_queue.items_processedNumber of batch items successfully processed (turned into runs)envId
batch_queue.items_failedNumber of batch items that failed processingenvId, errorCode
batch_queue.batches_completedNumber of batches that completed (all items processed)envId, hasFailures

Histograms

MetricDescriptionUnitLabels
batch_queue.batch_processing_durationTime from batch creation to completionmsenvId, itemCount
batch_queue.item_queue_timeTime from item enqueue to processing startmsenvId

FairQueue Metrics (batch-queue namespace)

These metrics track the underlying message queue operations. With the batch queue configuration, they are prefixed with batch-queue..

Counters

MetricDescription
batch-queue.messages.enqueuedNumber of messages (batch items) added to the queue
batch-queue.messages.completedNumber of messages successfully processed
batch-queue.messages.failedNumber of messages that failed processing
batch-queue.messages.retriedNumber of message retry attempts
batch-queue.messages.dlqNumber of messages sent to dead letter queue

Histograms

MetricDescriptionUnit
batch-queue.message.processing_timeTime to process a single messagems
batch-queue.message.queue_timeTime a message spent waiting in queuems

Observable Gauges

MetricDescriptionLabels
batch-queue.queue.lengthCurrent number of messages in a queuefairqueue.queue_id
batch-queue.master_queue.lengthNumber of active queues in the master queue shardfairqueue.shard_id
batch-queue.inflight.countNumber of messages currently being processedfairqueue.shard_id
batch-queue.dlq.lengthNumber of messages in the dead letter queuefairqueue.tenant_id

Key Relationships

Understanding how metrics relate helps diagnose issues:

batches_enqueued × avg_items_per_batch ≈ items_enqueued
items_enqueued = items_processed + items_failed + items_pending
batches_completed ≤ batches_enqueued (lag indicates processing backlog)

Degraded State Indicators

🔴 Critical Issues

1. Processing Stopped

Symptoms:

  • batch_queue.items_processed rate drops to 0
  • batch-queue.inflight.count is 0
  • batch-queue.master_queue.length is growing

Likely Causes:

  • Consumer loops crashed
  • Redis connection issues
  • All consumers blocked by concurrency limits

Actions:

  • Check webapp logs for "BatchQueue consumers started" message
  • Verify Redis connectivity
  • Check for "Unknown concurrency group" errors

2. Items Stuck in Queue

Symptoms:

  • batch_queue.item_queue_time p99 > 60 seconds
  • batch-queue.queue.length growing continuously
  • batch-queue.inflight.count at max capacity

Likely Causes:

  • Processing is slower than ingestion
  • Concurrency limits too restrictive
  • Global rate limiter bottleneck

Actions:

  • Increase BATCH_QUEUE_CONSUMER_COUNT
  • Review concurrency limits per environment
  • Check BATCH_QUEUE_GLOBAL_RATE_LIMIT setting

3. High Failure Rate

Symptoms:

  • batch_queue.items_failed rate > 5% of items_processed
  • batch-queue.messages.dlq increasing

Likely Causes:

  • TriggerTaskService errors
  • Invalid task identifiers
  • Downstream service issues

Actions:

  • Check errorCode label distribution on items_failed
  • Review batch error records in database
  • Check TriggerTaskService logs

🟡 Warning Signs

4. Growing Backlog

Symptoms:

  • batch_queue.batches_enqueued - batch_queue.batches_completed is increasing over time
  • batch-queue.master_queue.length trending upward

Likely Causes:

  • Sustained high load
  • Processing capacity insufficient
  • Specific tenants monopolizing resources

Actions:

  • Monitor DRR deficit distribution across tenants
  • Consider scaling consumers
  • Review per-tenant concurrency settings

5. Uneven Tenant Processing

Symptoms:

  • Some envId labels show much higher item_queue_time than others
  • DRR logs show "tenants blocked by concurrency" frequently

Likely Causes:

  • Concurrency limits too low for high-volume tenants
  • DRR quantum/maxDeficit misconfigured

Actions:

  • Review BATCH_CONCURRENCY_* environment settings
  • Adjust DRR parameters if needed

6. Rate Limit Impact

Symptoms:

  • batch_queue.item_queue_time has periodic spikes
  • Logs show "Global rate limit reached, waiting"

Likely Causes:

  • BATCH_QUEUE_GLOBAL_RATE_LIMIT is set too low

Actions:

  • Increase global rate limit if system can handle more throughput
  • Or accept as intentional throttling

Processing Health

# Throughput
rate(batch_queue_items_processed_total[5m])
rate(batch_queue_items_failed_total[5m])

# Success Rate
rate(batch_queue_items_processed_total[5m]) / 
  (rate(batch_queue_items_processed_total[5m]) + rate(batch_queue_items_failed_total[5m]))

# Batch Completion Rate
rate(batch_queue_batches_completed_total[5m]) / rate(batch_queue_batches_enqueued_total[5m])

Latency

# Item Queue Time (p50, p95, p99)
histogram_quantile(0.50, rate(batch_queue_item_queue_time_bucket[5m]))
histogram_quantile(0.95, rate(batch_queue_item_queue_time_bucket[5m]))
histogram_quantile(0.99, rate(batch_queue_item_queue_time_bucket[5m]))

# Batch Processing Duration
histogram_quantile(0.95, rate(batch_queue_batch_processing_duration_bucket[5m]))

Queue Depth

# Current backlog
batch_queue_master_queue_length
batch_queue_inflight_count

# DLQ (should be 0)
batch_queue_dlq_length

Alert Thresholds (Suggested)

ConditionSeverityThreshold
Processing stoppedCriticalitems_processed rate = 0 for 5min
High failure rateWarningitems_failed / items_processed > 0.05
Queue time p99Warning> 30 seconds
Queue time p99Critical> 120 seconds
DLQ lengthWarning> 0
Batch completion lagWarningbatches_enqueued - batches_completed > 100

Environment Variables Affecting Metrics

VariableImpact
BATCH_QUEUE_CONSUMER_COUNTMore consumers = higher throughput, lower queue time
BATCH_QUEUE_CONSUMER_INTERVAL_MSLower = more frequent polling, higher throughput
BATCH_QUEUE_GLOBAL_RATE_LIMITCaps max items/sec, increases queue time if too low
BATCH_CONCURRENCY_FREE/PAID/ENTERPRISEPer-tenant concurrency limits
BATCH_QUEUE_DRR_QUANTUMCredits per tenant per round (fairness tuning)
BATCH_QUEUE_MAX_DEFICITMax accumulated credits (prevents starvation)

Debugging Checklist

When investigating batch queue issues:

  1. Check consumer status: Look for "BatchQueue consumers started" in logs
  2. Check Redis: Verify connection and inspect keys with prefix engine:batch-queue:
  3. Check concurrency: Look for "tenants blocked by concurrency" debug logs
  4. Check rate limits: Look for "Global rate limit reached" debug logs
  5. Check DRR state: Query batch:drr:deficit hash in Redis
  6. Check batch status: Query BatchTaskRun table for stuck PROCESSING batches