Back to Activepieces

Benchmark Experiments

benchmark/EXPERIMENTS.md

0.86.06.8 KB
Original Source

Benchmark Experiments

A log of the load-test experiments run against the worker-is-the-sandbox execution model (ADR 0003 / ADR 0004), so the methodology and findings can be reused without re-deriving them.

Each experiment records: the question, the rig, how to reproduce it, and the measured output.


Experiment 1 — GKE app : worker ratio (1:10 vs 1:20)

Question. At a fixed app size (1 vCPU / 1 GB), does halving the app:worker ratio from 1:20 to 1:10 (twice as many app pods per worker) buy proportionally more throughput — for warm and for cold traffic?

Rig

ComponentConfiguration
ClusterGKE, e2-standard-4 × 14 nodes, europe-west1-b
WorkerOne sandbox per worker, concurrency 1, in-process engine fork (SANDBOX_CODE_ONLY: Node child + isolated-vm). Hard cap 0.5 vCPU / 1 GB
App1 vCPU / 1 GB per pod
Object storeReal same-region GCS bucket (europe-west1) over the S3-interop endpoint (storage.googleapis.com, path-style + SigV4 presigned URLs). Engine pulls flow bundle + piece archives via signed links
Postgres / RedisIn-cluster
Load toolhey, -c = worker count (40 or 80) so requests don't queue behind the concurrency-1 workers — latency reflects real service time, not backlog

Pairs tested (ratio is the only variable):

  • 1:202 app / 40 worker and 4 app / 80 worker
  • 1:104 app / 40 worker and 8 app / 80 worker

Warm vs cold:

  • warm = AP_REUSE_SANDBOX=true — engine process reused between jobs.
  • cold = AP_REUSE_SANDBOX=false — fresh engine fork + boot every job (the realistic isolation guarantee).

The flow under test

A 4-node synchronous webhook flow:

  1. Webhook trigger (catch_webhook, /sync — holds the HTTP connection until the flow returns)
  2. Math Helper (addition_math, 2 + 3)
  3. Code step in isolated-vm (return { result: Number(inputs.sum) + 1 })
  4. Webhook response (sendFlowResponse)

The actual compute is sub-millisecond; everything measured below is orchestration overhead.

How to reproduce

bash
# Deploys benchmark/k8s-sandbox.yaml to the cluster, runs the load test against the app
# LoadBalancer, and reports cold-boot latency, warm throughput, and the per-run breakdown.
WORKER_REPLICAS=80 APP_REPLICAS=8 REUSE_SANDBOX=true benchmark/run-gke.sh 1000 80

Vary WORKER_REPLICAS / APP_REPLICAS for each pair and REUSE_SANDBOX for warm vs cold. The cluster + GCS bucket were torn down after the run (teardown commands are printed at the end).

Results

Headline — throughput by config (app limit 1000m, worker limit 500m; the bottleneck is whoever saturates its cap first):

ConfigWarm req/sCold req/sApp CPU/pod (cold)Worker CPU/pod (cold)
2 app · 40 w59.119.8415m (42%)346m (69%)
4 app · 40 w93.519.6257m (26%)339m (68%)
4 app · 80 w110.533.30m (0%)¹0m (0%)¹
8 app · 80 w148.933.5235m (24%)336m (67%)

¹ kubectl top sampling missed this cold run; CPU not captured.

Latency anatomy — where the milliseconds go (warm 8a/80w vs cold 2a/40w):

LayerWarmColdWhat it is
app ingress + Redis + worker poll~91 ms~39 mswebhook→app→Redis enqueue→worker dequeue, + response delivery back
provision24 ms16 msflow-bundle + piece + engine install — all disk-cache hits
sandbox boot18 ms1167 mswarm = process reused; cold = fresh fork + Node start + bundle parse + isolated-vm init + socket connect
flow run (4 steps)372 ms762 msper-step engine→app callbacks + isolated-vm code + response handshake
end-to-end avg505 ms1984 msp50 446/1957 · p95 648/2183 · p99 3817/2986 ms
  • The cold "sandbox boot" tax (1167 ms). A fresh engine fork pays Node startup (incl. the ~80 ms --no-node-snapshot penalty forced by isolated-vm), 694 KB engine-bundle parse/compile (the bulk), and socket.io connect (~90 ms). In isolated profiling this is ~570 ms; under sustained cold load it inflates to ~1167 ms because ~40 workers fork at once, each capped at 0.5 CPU, and contend — boot is CPU-bound. Warm reuses the process and pays just 18 ms.
  • "flow run" (372 ms warm / 762 ms cold) is orchestration, not compute: after each step the engine reports progress / persists output via an HTTP callback to the app (3 runnable steps ≈ 3 round-trips + flow load + final sendFlowResponse), plus the isolated-vm code call. Direct evidence it's app-callback-bound: adding apps cut warm flow-run from 477 ms (1:20) → 372 ms (1:10) with identical steps — pure compute wouldn't move. Cold flow-run is ~2× warm because the just-forked engine runs on a cold V8 (no JIT warmup) while contending for CPU.

The ratio finding

WorkersMode1:201:10Δ
40 wwarm59.1 (2a)93.5 (4a)+58%
40 wcold19.819.6−1%
80 wwarm110.5 (4a)148.9 (8a)+35%
80 wcold33.333.5+1%
  • Warm: 1:10 does add throughput — at 80 workers 111 → 149 req/s (+35%). The app is the warm bottleneck (workers idle ~11–14%), so more apps = more callback capacity = lower flow-run latency = higher throughput. The 40-worker pair shows +58%, smaller because 40 workers can't push enough warm load to fully use even the 1:20 apps.
  • Cold: 1:10 makes essentially no difference (−1% / +1%) — cold is worker-bound (each job pays the ~1.1–1.3 s fork+boot; workers at 64–68%), so extra apps sit idle.

Verdict. 1:10 helps only warm/burst traffic and only where workers can saturate the apps; for cold (the realistic isolation path) it's wasted apps. Since apps at 1 vCPU are cheap relative to the worker fleet, 1:10 is a reasonable safety margin for warm-heavy workloads, but 1:20 is the efficient default — the extra apps in 1:10 buy headroom, not a proportional throughput gain.

Notes on caching

Provisioning is cheap because pieces are cached. A worker is its own sandbox and fills its piece cache lazily on first use (the old AP_PRE_WARM_CACHE up-front install step no longer exists). After first use the piece + flow bundle live on the worker's local disk, so warm runs do zero install work — here flow-bundle download ≈ 2 ms and piece install ≈ 3–13 ms. On a cold/first install the archive is pulled from the same-region S3 bucket via a signed link (fast in-region fetch, not a slow npm round-trip). Cache warmth comes from running long-lived worker replicas, not a warm-up flag.

Measurement caveat: layer numbers are from hey + engine timing logs. Per-step splits weren't captured (the engine logged flow-run as one aggregate), so the within-step attribution is structural, not timed.