docs/devguide/ai/failure-semantics.md
This page defines exactly what happens when things go wrong in an agent workflow. Not "Conductor is durable" — but the precise behavior under every failure scenario an agent can encounter.
Scenario: The LLM_CHAT_COMPLETE task calls an LLM provider and the call fails (rate limit, timeout, provider outage, malformed response).
What happens:
FAILED.retryCount, retryLogic, retryDelaySeconds).FAILED terminal state.failureWorkflow runs if configured, or the workflow moves to FAILED.What is preserved: The prompt, the error response, the retry count, and the timing of each attempt. You can inspect every failed attempt in the UI.
What is NOT re-executed: Nothing upstream. Only the failed LLM call retries. All previously completed tasks retain their outputs.
Configuration:
{
"name": "plan_action",
"retryCount": 3,
"retryLogic": "EXPONENTIAL_BACKOFF",
"retryDelaySeconds": 5,
"responseTimeoutSeconds": 60
}
This retries the LLM call up to 3 times with exponential backoff (5s, 10s, 20s). If the LLM doesn't respond within 60 seconds, the task times out and retries.
Scenario: The LLM responds, but the output is not valid JSON or doesn't match the expected schema (e.g., missing action field).
What happens:
The LLM_CHAT_COMPLETE task completes successfully — the LLM did respond. The malformed output propagates to the next task. What happens next depends on the downstream task:
${plan.output.result.action} and action doesn't exist, the task fails with an input resolution error.How to handle it: Add a SWITCH or INLINE task after the LLM call to validate the output before acting on it:
{
"name": "validate_plan",
"taskReferenceName": "validate",
"type": "INLINE",
"inputParameters": {
"plan": "${plan.output.result}",
"evaluatorType": "graaljs",
"expression": "(function() { var p = $.plan; if (!p || !p.action) { return {valid: false, error: 'Missing action field'}; } return {valid: true, plan: p}; })()"
}
}
If validation fails, use a SWITCH to re-run the LLM with a corrective prompt, or fail the workflow.
Scenario: A CALL_MCP_TOOL or HTTP task calls an external tool and the tool doesn't respond within the configured timeout.
What happens:
responseTimeoutSeconds fires. The task moves to TIMED_OUT.Critical implication: The tool call may execute more than once. Tool workers and MCP tools should be idempotent. Use the task's taskId or a correlation ID as an idempotency key.
What is preserved: The timed-out attempt is recorded with its input, the timeout event, and the timing. Every retry attempt is separately recorded.
Scenario: A tool call sends an email, then the worker crashes before reporting completion. The task is retried, and the email is sent again.
What happens:
POST /api/tasks to report completion.responseTimeoutSeconds fires. The task moves to TIMED_OUT, then SCHEDULED (retry).This is at-least-once delivery. Conductor guarantees the task will execute at least once, but it may execute more than once if the worker fails after performing side effects.
How to handle it:
taskId is unique per attempt).updateTime to detect redelivery — if the task was already processed, skip the side effect.failureWorkflow with compensation tasks.Scenario: A HUMAN task is waiting for approval, and nobody responds. Hours pass. Days pass.
What happens:
The HUMAN task remains IN_PROGRESS in durable storage indefinitely. It does not timeout unless you explicitly configure timeoutSeconds on the task definition.
If you want a timeout: Set timeoutSeconds and timeoutPolicy on the task definition:
{
"name": "human_approval",
"timeoutSeconds": 86400,
"timeoutPolicy": "TIME_OUT_WF"
}
This times out after 24 hours and fails the workflow. Alternatively, use timeoutPolicy: "ALERT_ONLY" to log a timeout without failing.
If you want escalation: Use a parallel WAIT + HUMAN pattern:
{
"type": "FORK",
"forkTasks": [
[{"type": "HUMAN", "taskReferenceName": "approval"}],
[{"type": "WAIT", "inputParameters": {"duration": "4 hours"}},
{"type": "LLM_CHAT_COMPLETE", "taskReferenceName": "escalation_notify"}]
]
}
Scenario: An external system calls the Task Update API to complete a HUMAN task, but the network is flaky and the call is retried. Conductor receives the completion signal twice.
What happens:
The first call moves the task from IN_PROGRESS to COMPLETED and advances the workflow. The second call arrives for a task that is already in a terminal state.
COMPLETED.This is safe by default. Conductor's task state machine enforces that a task can only transition to a terminal state once. Duplicate callbacks are harmless.
Scenario: A FORK/JOIN runs three parallel branches. Branch 1 completes. Branch 2 fails. Branch 3 is still running.
What happens:
FAILED and retries according to its retry policy.JOIN task waits for all branches to reach a terminal state.FAILED, the JOIN task fails.What is preserved: Each branch's completed tasks retain their outputs. If you retry the workflow from the failed task, only the failed branch re-executes. Successful branches are not re-run.
Scenario: You update the workflow definition (add a task, change a parameter) while executions are running.
What happens:
Running executions are not affected. Each execution uses an immutable snapshot of the definition taken at start time. The snapshot is embedded in the execution record.
If you want to apply the new definition: Use restart with latest definitions. This re-executes the workflow from the beginning using the updated definition.
Scenario: You deploy a new version of your worker code. Old worker instances are shut down, new instances start up. Tasks are in-flight.
What happens:
responseTimeoutSeconds fires for abandoned tasks. Tasks move to TIMED_OUT, then SCHEDULED (retry).Window of vulnerability: The time between old worker shutdown and responseTimeoutSeconds firing. During this window, the task appears IN_PROGRESS but no worker is processing it.
How to minimize impact:
responseTimeoutSeconds short (10-60 seconds for most tasks).What is never lost: Completed task outputs. The workflow state. The execution history. Only the in-progress task is affected, and it is automatically retried.
Scenario: A DYNAMIC task resolves to a task type based on LLM output. The LLM returns a task name that doesn't exist (not registered, was deleted, or is misspelled).
What happens:
The DYNAMIC task fails with a resolution error — the specified task type cannot be found. The task moves to FAILED and retries according to its retry policy.
How to handle it: Validate the LLM output before the DYNAMIC task. Use an INLINE or SWITCH task to check that the resolved task name is in a known allowlist.
Scenario: A worker is executing a task (e.g., an LLM call). A network partition occurs. The worker completes the task but cannot report the result to the Conductor server.
What happens:
COMPLETED to the server. The request fails due to the network partition.responseTimeoutSeconds, the server marks the task as TIMED_OUT and requeues it.Tokens are consumed twice in this scenario. The original LLM call succeeded but the result was lost. This is the cost of at-least-once delivery. For long-running or expensive LLM calls, consider implementing client-side caching in your worker to avoid re-execution.
Scenario: An autonomous agent loop runs for hours or days, with WAIT pauses, HUMAN approvals, and periodic LLM calls.
What happens:
This is a normal operating mode for Conductor. The workflow stays RUNNING with individual tasks in IN_PROGRESS (for active work) or COMPLETED (for finished steps).
WAIT tasks consume no resources. The durable timer fires when the duration elapses, even across deploys.HUMAN tasks consume no resources. They persist until the signal arrives.DO_WHILE loop counter and all intermediate state survive indefinitely.Practical limits:
timeoutSeconds applies to the total execution. Set it high enough for your expected duration, or omit it for unlimited execution time.| Failure | What Conductor does | What you should do |
|---|---|---|
| LLM call fails | Retries with configured backoff | Set retry policy on task definition |
| LLM returns bad output | Downstream task fails on input resolution | Add a validation step after LLM calls |
| Tool call times out | Retries after responseTimeoutSeconds | Make tools idempotent |
| Tool call has side effects, then crashes | Retries — side effect may execute twice | Use idempotency keys |
| Human never responds | Task stays IN_PROGRESS forever | Set timeoutSeconds or build escalation |
| Duplicate callback | Second call rejected, no duplicate execution | Safe by default |
| FORK branch fails | JOIN waits for all branches; workflow fails if branch exhausts retries | Configure retry policies per branch |
| Definition changes while running | Running executions unaffected (snapshot) | Use restart to apply new definitions |
| Worker deploy | In-flight tasks requeued after response timeout | Keep response timeouts short; use graceful shutdown |
| Dynamic task doesn't exist | Task fails, retries | Validate LLM output before DYNAMIC resolution |
| Network partition | Task requeued after timeout, may re-execute | Make workers idempotent; consider client-side caching |
| Multi-day execution | Normal operation, fully durable | Offload large payloads; set appropriate timeouts |