docs/devguide/bestpractices.md
This guide covers production best practices for running Conductor as a durable execution engine at scale. Every recommendation here comes from real-world operational experience.
Conductor guarantees at-least-once task delivery. Network partitions, worker restarts, and response timeouts can all cause a task to be delivered more than once. Your workers must be idempotent — executing the same task twice should produce the same result without side effects.
Patterns for idempotency:
| Pattern | When to use |
|---|---|
| Idempotency key | Pass a unique key (e.g., workflowId + taskId) to downstream services. The service deduplicates on this key. |
| Upsert instead of insert | Use INSERT ... ON CONFLICT UPDATE or equivalent so repeated writes converge to the same state. |
| Check-then-act | Query current state before performing the action. Skip if already completed. |
| Idempotent HTTP methods | Prefer PUT over POST when the downstream API supports it. |
from conductor.client.worker.worker_task import worker_task
@worker_task(task_definition_name="charge_payment")
def charge_payment(workflow_id: str, task_id: str, amount: float, currency: str) -> dict:
idempotency_key = f"{workflow_id}-{task_id}"
# Check if this charge was already processed
existing = payment_gateway.get_charge(idempotency_key)
if existing:
return {"chargeId": existing.id, "status": "already_processed"}
charge = payment_gateway.create_charge(
amount=amount, currency=currency, idempotency_key=idempotency_key
)
return {"chargeId": charge.id, "status": "charged"}
The workflowId and taskId combination is unique per task execution attempt, making it an ideal idempotency key.
Every task definition should have explicit timeouts. A task without timeouts can block a workflow indefinitely.
The rule: responseTimeoutSeconds < timeoutSeconds. The response timeout detects unresponsive workers; the overall timeout enforces the SLA.
| Task pattern | responseTimeoutSeconds | timeoutSeconds | timeoutPolicy | retryCount |
|---|---|---|---|---|
| API call (< 5s expected) | 10 | 30 | RETRY | 3 |
| ML inference | 120 | 300 | RETRY | 1 |
| Human approval | 0 (disabled) | 86400 | ALERT_ONLY | 0 |
| Batch processing | 600 | 3600 | TIME_OUT_WF | 0 |
| Quick data transform | 5 | 15 | RETRY | 3 |
| Policy | Behavior | Use when |
|---|---|---|
RETRY | Retries the task up to retryCount times. | Transient failures are expected (network calls, external APIs). |
TIME_OUT_WF | Fails the entire workflow immediately. | The task is critical and retrying won't help (e.g., expired batch window). |
ALERT_ONLY | Marks the task as timed out but keeps the workflow running. | Human-in-the-loop tasks or tasks with external completion signals. |
!!! warning
Setting responseTimeoutSeconds to 0 disables the response timeout. Only do this for tasks that are completed externally (e.g., WAIT or Human tasks).
See Task Definitions for the full parameter reference.
Conductor stores task inputs and outputs in its database. Large payloads degrade performance and increase storage costs.
| Payload | Recommended limit | Hard limit (configurable) |
|---|---|---|
| Task input | < 64 KB | 1 MB |
| Task output | < 64 KB | 1 MB |
| Workflow input | < 64 KB | 1 MB |
For payloads exceeding 64 KB, use external payload storage. Conductor supports S3 out of the box:
{
"conductor.external-payload-storage.type": "s3",
"conductor.external-payload-storage.s3.bucket-name": "my-conductor-payloads",
"conductor.external-payload-storage.s3.region": "us-east-1",
"conductor.external-payload-storage.s3.signed-url-expiration-seconds": 300
}
| Do | Don't |
|---|---|
| Return only data that downstream tasks need. | Dump entire API responses into task output. |
| Store large files in S3/GCS and pass the URI. | Pass file contents as base64 in payloads. |
Use inputTemplate to set default values on the task definition. | Duplicate static config in every workflow definition. |
| Keep payload keys flat and descriptive. | Nest payloads 5 levels deep with ambiguous keys. |
Break work into small tasks that each do one thing. This gives you:
| Approach | When to use |
|---|---|
| Sub-workflow | Reusable logic shared across multiple parent workflows. Independently versioned and testable. |
| Inline tasks in a single workflow | Logic specific to one workflow. Fewer indirections to debug. |
Use sub-workflows when a group of tasks represents a bounded business capability (e.g., "process payment", "send notification bundle"). Don't create sub-workflows for a single task — the overhead isn't worth it.
| Pattern | When to use |
|---|---|
| DYNAMIC_FORK | Process N items in parallel. Use when items are independent and parallelism improves throughput. |
| DO_WHILE | Process items sequentially when ordering matters or a shared resource requires serialization. |
!!! tip Keep DYNAMIC_FORK fan-out under 500 concurrent tasks per workflow. Beyond that, consider batching items into chunks and forking over the chunks.
Workers are stateless and scale horizontally. Tune these parameters to match your workload.
The polling interval controls how frequently workers check for new tasks. Shorter intervals reduce latency; longer intervals reduce server load.
| Workload | Recommended polling interval |
|---|---|
| Low-latency (< 1s SLA) | 100-250 ms |
| Standard processing | 500 ms - 1s |
| Background / batch | 5-10s |
Each worker instance runs a configurable number of polling threads. Start with:
threads = (target_throughput * avg_task_duration_seconds) / num_worker_instances
For example, 100 tasks/sec with 2s average execution across 5 instances: (100 * 2) / 5 = 40 threads per instance.
Use task definition settings to protect downstream services:
{
"name": "call_external_api",
"rateLimitPerFrequency": 50,
"rateLimitFrequencyInSeconds": 1,
"concurrentExecLimit": 20
}
This limits the task to 50 executions per second globally, with at most 20 running concurrently.
Use task domains to route tasks to specific worker pools. Common use cases:
See Scaling Workers for more detail.
By default, a failed task is retried according to retryCount and retryLogic (FIXED, EXPONENTIAL_BACKOFF, or LINEAR_BACKOFF). For errors that should not be retried, set the task status to FAILED_WITH_TERMINAL_ERROR:
from conductor.client.http.models import TaskResult, TaskResultStatus
@worker_task(task_definition_name="validate_order")
def validate_order(order_id: str, items: list) -> TaskResult:
if not items:
result = TaskResult()
result.status = TaskResultStatus.FAILED_WITH_TERMINAL_ERROR
result.reason_for_incompletion = "Order has no items — not retryable"
return result
# ... validation logic
return {"valid": True}
| Error type | Strategy |
|---|---|
| Transient (network timeout, 503) | Let Conductor retry with backoff. |
| Client error (400, validation failure) | Return FAILED_WITH_TERMINAL_ERROR. |
| Partial failure in batch | Return partial results as output; use workflow logic to handle remainder. |
For workflows that span multiple services, design compensation tasks to undo completed steps when a later step fails.
Forward compensation — Fix the problem and continue. Use a SWITCH after the failed task to route to a recovery path.
Backward compensation — Undo completed work in reverse order. Model this as a separate workflow triggered by the failure workflow mechanism:
failureWorkflow.!!! tip Store compensation metadata (transaction IDs, resource handles) in each task's output so the failure workflow has everything it needs to roll back.
Conductor supports workflow versioning natively. Use this for safe deployments.
Running workflows continue on the version they were started with. You cannot migrate a running execution to a new version. Plan for this:
If version N+1 has issues:
Because workers are decoupled from workflow definitions, you can roll back the workflow version independently of worker deployments.
Track these metrics to maintain healthy Conductor operations:
| Metric | What it tells you | Alert threshold |
|---|---|---|
| Task queue depth | Backlog of unprocessed tasks. | Growing consistently over 5 minutes. |
| Task poll count (per task type) | Whether workers are actively polling. | Drops to zero. |
| Workflow failure rate | Percentage of workflows ending in FAILED state. | > 5% over a 15-minute window. |
| Task response time (p99) | How close workers are to the response timeout. | > 80% of responseTimeoutSeconds. |
| Worker thread utilization | Whether workers are saturated. | > 90% sustained for 10 minutes. |
| External payload storage errors | S3/GCS write failures blocking tasks. | Any non-zero count. |
See Monitoring and Scaling Workers for built-in monitoring tools.