docs/architecture/durable-execution.md
Conductor is a durable execution engine for distributed workflows and durable agents. Every workflow execution is persisted at every step, survives infrastructure failures, and guarantees at-least-once task delivery. This durable execution model means your workflows and agents never lose progress. This page defines exactly what that means.
When a workflow executes, Conductor persists:
All state is written to the configured persistence store (Redis, PostgreSQL, MySQL, or Cassandra) before the next step proceeds. If the server restarts, execution resumes from the last persisted state.
Conductor provides at-least-once delivery for all tasks:
IN_PROGRESS.COMPLETED and Conductor advances the workflow.A task is never silently lost. If a worker polls a task but never responds, the response timeout triggers redelivery.
Here is exactly what happens in each failure scenario:
| Scenario | What Conductor does | Outcome |
|---|---|---|
| Worker crashes after poll, before any work | Response timeout fires. Task returns to SCHEDULED. New worker picks it up. | Task is retried automatically. No data loss. |
| Worker crashes after side effect, before completion update | Response timeout fires. Task is redelivered to another worker. | Task executes again. Workers must be idempotent for side effects, or use the task's updateTime to detect redelivery. |
| Worker reports FAILED | Conductor creates a new task execution based on retry configuration (retryCount, retryDelaySeconds, retryLogic). | Retried up to the configured limit. After exhaustion, task moves to FAILED and the workflow's failure handling kicks in. |
| Worker reports FAILED_WITH_TERMINAL_ERROR | No retry. Task is terminal. | Workflow fails or executes the configured failureWorkflow. |
| Server restarts during workflow execution | On restart, the sweeper service picks up in-progress workflows from persistent storage and re-evaluates them. | Execution resumes from the last persisted state. No manual intervention needed. |
| Long wait across deploys | WAIT and HUMAN tasks remain IN_PROGRESS in persistent storage. The timer or signal resolution is durable. | When the duration elapses or signal arrives (even days later, after multiple deploys), the task completes and the workflow advances. |
| Signal/webhook arrives for a paused workflow | The Task Update API or event handler sets the WAIT/HUMAN task to COMPLETED with the provided output. | Workflow resumes immediately with the signal payload available as task output. |
| Workflow definition updated while executions are running | Running executions continue using the snapshot of the definition taken at start time. New executions use the updated definition. | No running execution is affected by definition changes. Zero-downtime upgrades. |
| Workflow version deleted while executions are running | Running executions are decoupled from the metadata store. They continue using their embedded definition snapshot. | Existing executions complete normally. Only new starts are affected. |
| Network partition between worker and server | Worker's updates don't reach the server. Response timeout fires, task is requeued. | After partition heals, a new worker (or the same one) picks up the task. |
Every task follows this state machine:
SCHEDULED ──→ IN_PROGRESS ──→ COMPLETED
│ │
│ ├──→ FAILED ──→ SCHEDULED (retry)
│ │
│ ├──→ FAILED_WITH_TERMINAL_ERROR
│ │
│ └──→ TIMED_OUT ──→ SCHEDULED (retry)
│
└──→ CANCELED (workflow terminated)
Terminal states: COMPLETED, FAILED (after retries exhausted), FAILED_WITH_TERMINAL_ERROR, CANCELED, COMPLETED_WITH_ERRORS (optional tasks).
Each transition is persisted before any subsequent action is taken.
Durability is configurable per task via the task definition:
| Parameter | What it controls |
|---|---|
timeoutSeconds | Maximum wall-clock time for the task to reach a terminal state. |
responseTimeoutSeconds | Maximum time to wait for a worker status update before requeuing. |
pollTimeoutSeconds | Maximum time a scheduled task waits to be polled before timeout. |
retryCount | Number of retry attempts on failure or timeout. |
retryLogic | FIXED, EXPONENTIAL_BACKOFF, or LINEAR_BACKOFF. |
retryDelaySeconds | Base delay between retries. |
timeoutPolicy | RETRY, TIME_OUT_WF, or ALERT_ONLY. |
Beyond individual tasks, Conductor provides workflow-level durability:
failureWorkflow that runs automatically when the main workflow fails, with full context (reason, failed task ID, workflow execution data).Every workflow execution is fully replayable. Conductor preserves the complete execution graph — inputs, outputs, and state for every task — so you can re-execute workflows at any time.
| Operation | What it does | When to use |
|---|---|---|
| Restart | Re-executes the entire workflow from the beginning | Definition changed, need a clean run |
| Rerun | Re-executes from a specific task, reusing outputs of prior tasks | Fix a task in the middle without re-running everything |
| Retry | Retries the last failed task and continues from that point | Transient failure, external dependency was down |
All three operations work on workflows in any terminal state (COMPLETED, FAILED, TIMED_OUT, TERMINATED) and are available indefinitely — Conductor preserves the full execution graph. Restart can optionally use the latest workflow definition, so you can fix a bug in the definition and replay immediately.
In multi-node deployments, Conductor ensures consistency through:
decide evaluation runs per workflow at a time across the cluster (pluggable: Zookeeper, Redis).See the deployment guide for distributed lock configuration.