Back to Go Micro

Durable Workflows

internal/website/blog/24.md

6.0.05.4 KB
Original Source

Durable Workflows

June 17, 2026 • Asim Aslam

A workflow that calls real services is rarely instant and rarely side-effect-free. It reserves inventory at step one, charges a card at step two, sends a confirmation at step three. Each of those changes the world. So when the process dies between step two and step three — a deploy, an OOM, a node going away — you can't just run it again from the top: that reserves twice and charges twice. And if the workflow was triggered by an event with no human watching, nobody noticed it died at all.

This is the oldest problem in distributed systems, and it has an established answer: durable execution — checkpoint progress as you go, and on restart resume from where you stopped instead of from the beginning. Go Micro flows now do this.

What a flow was, and what it is now

A flow used to run one augmented-LLM turn per event. Useful, but a single step — there was no notion of a task with stages, and nothing survived a crash.

A flow can now be an ordered list of steps — a task made of stages — and each step is checkpointed before and after. If the process dies mid-run, the run resumes at the step it stopped on, and the steps that already completed do not run again.

go
f := micro.NewFlow("checkout",
    micro.FlowTrigger("events.order.placed"),
    micro.FlowRetry(2),
    micro.FlowSteps(
        micro.FlowStep{Name: "reserve", Run: micro.FlowCall("inventory", "Inventory.Reserve")},
        micro.FlowStep{Name: "charge",  Run: micro.FlowCall("payment", "Payment.Charge")},
        micro.FlowStep{Name: "confirm", Run: micro.FlowCall("orders", "Orders.Confirm")},
    ),
)

A single-step flow keeps working exactly as before; steps are additive.

How it resumes

State carries a typed payload plus a Stage marker — the name of the step the run is at. That marker is the single source of truth for "where it is," and it's the resume point. Before each step, the run is saved; after each step completes, the stage advances and the run is saved again. On restart, the engine loads the run and starts at Stage, so completed steps — and their side effects — are skipped.

Here is a run whose payment dependency is down on the first attempt:

first run:
  reserve  → inventory reserved
  charge   → payment dependency unavailable (crash)
  run failed: payment gateway timeout

checkpoint: run 70643f61 is at step "charge" (status failed)

resume:
  charge   → payment captured
  confirm  → order confirmed

reserve ran 1 time(s) total — completed steps are not repeated on resume

f.Pending(ctx) lists incomplete runs after a restart; f.Resume(ctx, runID) continues one. The full example is examples/flow-durable — it needs no API key, because durability is the only thing on display.

The honest part

Exactly-once is impossible if a crash lands inside a step — you can't know whether the charge went through. What durable execution actually gives you is at-least-once delivery plus a stable idempotency key per step (runID + step name), so a replayed step is recognized and de-duplicated by the service receiving it. Side-effecting steps have to honor that key. A framework can make this consistent; it can't repeal the underlying reality, and claiming otherwise would be dishonest.

Where agents come in

Go Micro draws the line from Anthropic's taxonomy: workflows follow a predefined path; agents direct themselves. A flow is the workflow — you author the steps. An agent is the self-directed one — the model authors the steps at runtime. They are two kinds of control flow, and durability is orthogonal to both.

So a workflow step can hand off to an agent:

go
micro.FlowStep{Name: "resolve", Run: micro.FlowDispatch("support-agent")}

The deterministic part stays a durable flow; the open-ended part is an agent. The same Checkpoint that persists a flow run is the mechanism the agent's own loop will use to become durable too — that's the next step, and it's a bigger one, because it means the agent owning its loop rather than the provider driving it. What ships today is durable workflows that can call services and dispatch to agents.

No separate engine

The pluggability is the usual Go Micro shape. The built-in Checkpoint is store-backed — point the default store at Postgres or NATS KV and a run survives a real restart, no extra moving parts. Need more, or already run Temporal or Restate? Implement the Checkpoint interface and delegate to it; the explicit step model is what makes a flow mappable onto an external engine. Most teams need neither — the default is durable.

go
type Checkpoint interface {
    Save(ctx context.Context, run Run) error
    Load(ctx context.Context, runID string) (Run, bool, error)
    Delete(ctx context.Context, runID string) error
    List(ctx context.Context) ([]Run, error)
}

That's the through-line. Durable execution isn't a workflow engine you adopt alongside your services; it's a store and an interface, and the workflow is still just an ordered list of steps you can read. Same as everything else in Go Micro — the abstraction is the service, and this is one more thing the substrate underneath it now handles.

See the Agents and Workflows guide for the full reference.