docs/handbook/engineering/postmortems/2026-03-19-redis-and-delay-overload.mdx
On March 19, 2026, a bug in the Delay step caused flows to restart from the beginning instead of resuming after the delay, creating an infinite loop that flooded Redis with jobs. Affected flows never completed, and the growing job backlog degraded queue processing for all users.
All times are in UTC.
When a flow hits a Delay step, the system puts the job on hold via BullMQ's moveToDelayed(). The bug was that the job still carried executionType: BEGIN instead of RESUME. When the delay expired, the worker re-ran the entire flow from the first step, hit the Delay again, paused again, and looped forever — flooding Redis with new jobs on every iteration.
Trigger -> Step 1 -> Delay(20s) -> PAUSE
| (20s later, job still says "BEGIN")
Trigger -> Step 1 -> Delay(20s) -> PAUSE
| (20s later, job still says "BEGIN")
... forever
The platform does enforce per-execution time limits, but because the job was marked as BEGIN instead of RESUME, each loop iteration was treated as a brand-new execution rather than a continuation. Each fresh execution only ran from the trigger to the Delay step — well within the time limit — before spawning another delayed job and repeating.
| Action Item | Status |
|---|---|
Update job data to executionType: RESUME before calling moveToDelayed() so the worker continues from the correct step | Done |
| Add test coverage for Delay step resume behavior to catch regressions where a delayed job restarts instead of resuming | Done |
| Prevent a flow from entering an infinite state by detecting and halting repeated re-executions of the same run (ENG-320) | Done |
| Add alerting on abnormal queue depth growth to detect runaway job creation before customers are impacted | To do |
| Add monitoring for repeated execution patterns on a single flow (e.g., same flow re-triggered N times within a short window) | To do |
executionType: RESUME before calling moveToDelayed(), so the worker continues from where the flow left off instead of restarting.VALIDATION error.