docs/handbook/engineering/postmortems/2026-03-redis-queue-events-overload.mdx
On Friday, March 20, 2026, BullMQ's QueueEvents caused every worker to broadcast job lifecycle events to all connected app instances. As traffic grew, Redis output buffers grew faster than clients could consume them, eventually filling Redis memory. Once memory was exhausted, the runsMetadata queue stopped consuming, workers crashed, and flow execution logs were delayed up to 8 hours before appearing in the UI.
The incident lasted the entire day. Mitigation involved repeated server restarts and manual cleanup to minimize customer impact while the root cause was identified. The fix was to revert the QueueEvents change.
runsMetadata jobs were eventually resumed and indexed.All times are in UTC.
QueueEvents broadcast volume overwhelms output buffers. The runsMetadata queue stops consuming and workers start crashing.QueueEvents broadcasting. Change reverted. Redis memory recovers, workers resume, and all backed-up runsMetadata jobs are processed and indexed.BullMQ's QueueEvents feature subscribes each worker instance to a Redis pub/sub stream of all job lifecycle events (started, completed, failed, etc.) for the queues it listens to. In a multi-instance deployment, this means every app server receives every event from every other server.
As traffic grew, the volume of events exceeded the rate at which clients could read them. Redis buffers these unread events in per-client output buffers. When the cumulative buffer size exceeded available Redis memory, Redis could no longer accept writes. The runsMetadata queue — which records execution logs for the UI — was the first visible casualty, but all queue operations were degraded.
runsMetadata queue stalling or worker crash loops.| Action Item | Status |
|---|---|
Revert QueueEvents adoption to stop the event broadcast storm | Done |
| Add alerting on Redis memory usage and output buffer growth | To do |
Add monitoring for runsMetadata queue consumption lag | To do |
QueueEvents listener so workers no longer broadcast lifecycle events to all instances, eliminating the Redis output buffer growth.