@elizaos/voice-bench

Voice-loop benchmark harness for the Eliza-1 voice pipeline.

A deterministic, replayable harness that drives the real voice pipeline with synthetic audio inputs and measures latency, barge-in behavior, and rollback waste. Per AGENTS.md "evidence-or-it-didn't-happen" rule, every optimization PR that touches the voice loop ships this harness's JSON output as proof.

What it measures

The harness records timestamps for every observable transition in the mic → ASR → drafter ∥ verifier → chunker → TTS pipeline (see BenchEventName in src/types.ts) and derives:

Metric	Definition
TTFA (★ primary)	`t_tts_first_audio − t_speech_start`
Perceived response latency	`t_tts_first_audio − t_speech_end`
Barge-in response	`t_barge_in_hard_stop − t_barge_in_trigger`
Rollback waste	drafter tokens rejected / drafter tokens proposed
DFlash acceptance	when DFlash is wired
Peak RSS / CPU / GPU	best-effort process sampling at 100 ms

Running

bash

# The mock-only CLI path is disabled. Use the real VoiceBench runner:
packages/benchmarks/voicebench/run.sh --profile=groq \
  --dataset=packages/benchmarks/voicebench/fixtures/manifest-groq.json

# Compare to a recorded baseline; exit 1 on regression
packages/benchmarks/voicebench/run.sh --profile=elevenlabs \
  --dataset=packages/benchmarks/voicebench/fixtures/manifest-elevenlabs.json

Running on GPU (single-GPU tier)

For Linux + NVIDIA hosts, the harness ships per-GPU autotune profiles under packages/inference/configs/gpu/ (3090, 4090, 5090, H200). The inference engine for this tier is llama.cpp / llama-server — not vLLM or SGLang.

Detect the host card and print the resolved autotune plan:

bash

bun run --cwd packages/inference/voice-bench bench gpu
# Or narrowed to a specific bundle:
bun run --cwd packages/inference/voice-bench bench gpu --bundle eliza-1-9b

The subcommand calls nvidia-smi --query-gpu=name,memory.total and loads the matching JSON config file. On a CPU-only host (e.g. CI without a GPU runner) it prints { "nvidiaPresent": false } and exits 0.

Once a real PipelineDriver is wired for --backend cuda, the GPU matrix in configs/gpu/matrix.json enumerates the (GPU, bundle, ctx_size) tuples we benchmark. Each row maps to one autotune config.

Per-GPU expected metrics live in the config JSON files and are flagged "_provenance": "extrapolated" until a real run replaces them. The override mechanism + per-GPU known limits are documented in packages/inference/configs/gpu/SPECS.md and docs/inference/gpu-tier.md.

Unit tests:

bash

bun run --cwd packages/inference/voice-bench test
bun run --cwd packages/inference/voice-bench typecheck

Regenerate fixture WAVs into fixtures/:

bash

bun run --cwd packages/inference/voice-bench generate-fixtures

The fixtures/ directory is gitignored — the harness uses in-memory fixtures by default and only writes WAVs when you ask it to.

Scenario catalog

ID	Shape	What it exercises
`short-turn`	1.5 s utterance	Baseline TTFA on a healthy pipeline
`long-turn`	8 s utterance	Verifier coverage; no token drop
`false-end-of-speech`	utterance with 400 ms mid-clause pause	Voice state machine `PAUSE_TENTATIVE → LISTENING` rollback (C1 discard)
`barge-in`	utterance + overlay at t=3 s	Hard-stop within 200 ms
`barge-in-mid-response`	utterance + overlay at t=5 s	Voice state machine `SPEAKING → LISTENING` rollback (C1 restore)
`cold-start`	first turn on a fresh process	Load-side latency
`warm-start`	second turn after prewarm	Steady-state TTFA

Rollback scenarios report two extra fields on top of the per-fixture BenchMetrics:

rollbackCount — number of rollback-drop events the pipeline emitted (one per C1 discard or C1 restore).
rollbackWasteTokens — drafter tokens thrown away because the state machine rolled back. The driver may supply this directly; otherwise the harness sums data.tokens from each rollback-drop event.

Eval gates

Defined in src/gates.ts. Defaults:

Metric	Warn	Fail
TTFA p50 regression vs baseline	+20 %	+50 %
TTFA p95 regression vs baseline	+30 %	+50 %
Barge-in p95	—	250 ms absolute ceiling
False-barge-in rate	—	0.05 / turn ceiling
Rollback waste	—	0.30 ceiling

evaluateGates() returns a GateReport with a markdown table. The CLI emits this to stdout and exits 1 on a fail row.

Updating baselines

When a real optimization legitimately improves a metric, record a new baseline:

bash

bun run --cwd packages/inference/voice-bench bench \
  --bundle eliza-1-2b --backend metal --runs 5 \
  --output packages/inference/voice-bench/baselines/M4Max-metal.json

Commit the JSON. Future PRs compare against it.

Wiring the real pipeline (follow-up)

The runnable mock-only CLI is disabled and runBench() rejects mock/fake/stub drivers. The MockPipelineDriver remains test-only scaffolding; the real pipeline driver is a follow-up — the contract is the PipelineDriver interface in src/types.ts. To wire it:

Construct a VoicePipeline (packages/app-core/.../voice/pipeline.ts) with real StreamingTranscriber, DraftProposer, and TargetVerifier implementations. The bench package intentionally does not depend on @elizaos/app-core — wire from a thin host package that owns both.
Inside the driver's run(args), feed args.audio.pcm to the VoiceScheduler via its MicSource adapter while replaying frames through SyntheticAudioSource at wall-clock rate.
Attach a VoiceBenchProbe to each pipeline event. The events you need to fire (see BenchEventName):
- speech-start / speech-pause / speech-end — from the VAD
- asr-partial / asr-final — from StreamingTranscriber
- draft-start / draft-first-token / draft-complete — from DraftProposer
- verifier-start / verifier-first-token / verifier-complete — from TargetVerifier
- phrase-emit — from the phrase chunker
- tts-first-pcm — from the streaming TTS backend
- audio-out-first-frame — from the ring buffer's first dequeue
- barge-in-trigger / barge-in-hard-stop — from BargeInController
Optionally implement dispose() to tear down GPU resources.
Register the driver under a backend name (metal, cuda, vulkan, cpu) and add a case in bin/voice-bench.

The real driver should emit the same event sequence as the unit-test driver, but benchmark artifacts produced by test drivers are not release evidence.

Known limitations

Synthetic audio is not real speech. Per docs/eliza-1-pipeline/06-test-matrix.md, release-blocking latency gates still require a real-recorded WAV corpus.
GPU utilization is not yet sampled. The Metal/Vulkan counter hooks are TBD; the field is optional in BenchMetrics.
DFlash stats are driver-supplied. The real driver must hook into dflash-server; mock values are not accepted for release evidence.
Single-process only. The harness runs the driver in-process. For cold-start measurement that includes shell startup, the runner needs a subprocess wrapper — a follow-up.

Architecture

SyntheticAudioSource ─┐
                      │
                      ▼
                 PipelineDriver.run({ audio, injection, probe })
                      │
                      ▼ (BenchEventName timestamps)
                 MetricsCollector ──► BenchMetrics
                      │
                      ▼
                 aggregate() ──► BenchAggregates
                      │
                      ▼
                 evaluateGates(current, baseline) ──► GateReport (md)

Everything in src/ is pure TypeScript with strict + no any. No runtime dependency on @elizaos/* packages — the harness is intentionally isolated so a bun test in CI doesn't drag the inference stack along.