docs/install/architecture/benchmark.mdx
This benchmark answers one question for the recommended production shape: at the 1:10 app-to-worker ratio, how does throughput scale as you grow the fleet from 40 to 160 workers? It runs the worker-is-the-sandbox model on a real GKE cluster, against a same-region object store with signed URLs and official piece tarballs served from the CDN.
The shape under test: app tier, Redis job queue, Postgres, S3, and a one-flow-per-worker execution tier.
A 4-node synchronous webhook flow that holds the HTTP connection open until the flow returns:
<Steps> <Step title="Webhook trigger"> Catches the request on a `/sync` URL and holds the connection until the flow finishes. </Step> <Step title="Math Helper"> Adds `2 + 3`. </Step> <Step title="Code step"> Runs `return inputs.sum + 1` inside an `isolated-vm` context. </Step> <Step title="Webhook response"> Returns the result, closing the held connection. </Step> </Steps>The compute is sub-millisecond by design — everything measured below is orchestration (queueing, callbacks, sandbox boot), which is what actually shapes production latency.
Each fleet size is held at the recommended 1:10 ratio (1 app per 10 workers) and run warm (AP_REUSE_SANDBOX=true) — the engine process is reused between jobs.
Each worker is one sandbox at concurrency 1, hard-capped at 0.5 vCPU / 1 GB. Apps are 1 vCPU / 1 GB. Load concurrency is matched to the worker count so requests don't queue behind the concurrency-1 workers.
| Apps · Workers | Ratio | Warm req/s | Warm req/s per worker |
|---|---|---|---|
| 4 app · 40 workers | 1:10 | 185.3 | 4.6 |
| 8 app · 80 workers | 1:10 | 409.5 | 5.1 |
| 12 app · 120 workers | 1:10 | 553.0 | 4.6 |
| 16 app · 160 workers | 1:10 | 686.3 | 4.3 |
Only the app and worker counts scale (1:10). Postgres and Redis are a single fixed-size pod each — the same for every row below. CPU is the average across three warm load tests; the singletons' figures are the whole pod, app/worker are per pod.
| Apps · Workers | Warm req/s | Postgres used / cap | Redis used / cap | App used / cap (per pod) | Worker used / cap (per pod) |
|---|---|---|---|---|---|
| 4 · 40 | 185 | 522m / 3000m | 134m / 2000m | 782m / 1000m | 102m / 500m |
| 8 · 80 | 410 | 640m / 3000m | 150m / 2000m | 518m / 1000m | 72m / 500m |
| 12 · 120 | 553 | 546m / 3000m | 132m / 2000m | 311m / 1000m | 50m / 500m |
| 16 · 160 | 686 | 396m / 3000m | 169m / 2000m | 205m / 1000m | 37m / 500m |
Postgres never crosses ~0.65 of a core and Redis never crosses ~0.17, both far below their caps and flat as the fleet quadruples — they are not absorbing a growing share of anything. Workers sit at ≤0.1 of their 0.5-core cap. No tier approaches saturation, which is exactly why each added worker keeps adding throughput. (The singletons are sized this large on purpose — see Test environment — so they provably stay off the critical path; the default Postgres max_connections=100 would cap the fleet at ~10 apps, which is the artifact behind the earlier "120 cliff".)
workers ÷ per-flow-time, which is linear in the fleet. (The synchronous response reaches the client sooner than that — it is sent at the response step, before the worker wraps up the log write — so client-perceived latency is lower than the worker-busy time that sets throughput.) Per-flow time carries run-to-run variance (the object-store log-write tail), which is why a single run's curve looks bumpy; the invariant that the per-worker rate holds constant is what shows the scaling is linear.
<Note>
Why Production Setup recommends 1:10. Apps at 1 vCPU are cheap relative to the worker fleet, and 1:10 is the warm-headroom margin that keeps the app tier from becoming the wall during bursts. See Production Setup. </Note>
Where the worker's milliseconds go — warm at peak (16 app · 160 w):
| Layer | Warm |
|---|---|
| Provision (flow bundle + piece + engine, mostly disk-cache hits) | ~10 ms |
| Sandbox boot (engine process reused) | ~5 ms |
| Flow run (4 steps: engine→app callbacks + end-of-run log persist) | ~203 ms |
| Worker-busy avg per job | ~218 ms |
This is the time the worker is occupied per job — and at concurrency 1 it is what sets throughput (workers ÷ worker-busy-time). The synchronous client sees less: the response is published at the flow's response step, before the worker finishes persisting the run log, so client-perceived latency runs below the worker-busy figure.
n2-standard-16 × 10 nodes, europe-west1-bSANDBOX_CODE_ONLY (Node fork + isolated-vm)europe-west1) over the S3-interop endpoint, path-style SigV4 presigned URLs (AP_S3_USE_SIGNED_URLS=true)AP_USE_CDN_FOR_BUNDLES=true)max_connections=2000 (the default 100 would starve the app pools past ~10 apps), durability off, and its data dir on tmpfs; Redis at 2 vCPU / 2 GB with io-threads. Under load both stay near-idle (Postgres <0.6 core, Redis <0.2), confirming the worker tier, not the singletons, is the ceiling.hey, concurrency matched to worker count (40/80/120/160) so requests don't queue behind the concurrency-1 workers — latency reflects real service time, not backlogbenchmark/run-gke.sh [total_requests] [concurrency]
The script mints a worker token, deploys benchmark/k8s-sandbox.yaml to the cluster, runs the load test against the app LoadBalancer, and reports warm throughput and the per-run breakdown from worker-pod logs. Set APP_REPLICAS and WORKER_REPLICAS (keeping the 1:10 ratio) to reproduce any row in the results table.