plugins/plugin-local-inference/native/voice-bench/README.md
Voice-loop benchmark harness for the Eliza-1 voice pipeline.
A deterministic, replayable harness that drives the real voice pipeline with synthetic audio inputs and measures latency, barge-in behavior, and rollback waste. Per AGENTS.md "evidence-or-it-didn't-happen" rule, every optimization PR that touches the voice loop ships this harness's JSON output as proof.
The harness records timestamps for every observable transition in the
mic → ASR → drafter ∥ verifier → chunker → TTS pipeline (see
BenchEventName in src/types.ts) and derives:
| Metric | Definition |
|---|---|
| TTFA (★ primary) | t_tts_first_audio − t_speech_start |
| Perceived response latency | t_tts_first_audio − t_speech_end |
| Barge-in response | t_barge_in_hard_stop − t_barge_in_trigger |
| Rollback waste | drafter tokens rejected / drafter tokens proposed |
| DFlash acceptance | when DFlash is wired |
| Peak RSS / CPU / GPU | best-effort process sampling at 100 ms |
# The mock-only CLI path is disabled. Use the real VoiceBench runner:
packages/benchmarks/voicebench/run.sh --profile=groq \
--dataset=packages/benchmarks/voicebench/fixtures/manifest-groq.json
# Compare to a recorded baseline; exit 1 on regression
packages/benchmarks/voicebench/run.sh --profile=elevenlabs \
--dataset=packages/benchmarks/voicebench/fixtures/manifest-elevenlabs.json
For Linux + NVIDIA hosts, the harness ships per-GPU autotune profiles
under packages/inference/configs/gpu/ (3090, 4090, 5090, H200). The
inference engine for this tier is llama.cpp / llama-server — not
vLLM or SGLang.
Detect the host card and print the resolved autotune plan:
bun run --cwd packages/inference/voice-bench bench gpu
# Or narrowed to a specific bundle:
bun run --cwd packages/inference/voice-bench bench gpu --bundle eliza-1-9b
The subcommand calls nvidia-smi --query-gpu=name,memory.total and
loads the matching JSON config file. On a CPU-only host (e.g. CI without
a GPU runner) it prints { "nvidiaPresent": false } and exits 0.
Once a real PipelineDriver is wired for --backend cuda, the GPU
matrix in configs/gpu/matrix.json enumerates the (GPU, bundle,
ctx_size) tuples we benchmark. Each row maps to one autotune config.
Per-GPU expected metrics live in the config JSON files and are flagged
"_provenance": "extrapolated" until a real run replaces them. The
override mechanism + per-GPU known limits are documented in
packages/inference/configs/gpu/SPECS.md and
docs/inference/gpu-tier.md.
Unit tests:
bun run --cwd packages/inference/voice-bench test
bun run --cwd packages/inference/voice-bench typecheck
Regenerate fixture WAVs into fixtures/:
bun run --cwd packages/inference/voice-bench generate-fixtures
The fixtures/ directory is gitignored — the harness uses in-memory
fixtures by default and only writes WAVs when you ask it to.
| ID | Shape | What it exercises |
|---|---|---|
short-turn | 1.5 s utterance | Baseline TTFA on a healthy pipeline |
long-turn | 8 s utterance | Verifier coverage; no token drop |
false-end-of-speech | utterance with 400 ms mid-clause pause | Voice state machine PAUSE_TENTATIVE → LISTENING rollback (C1 discard) |
barge-in | utterance + overlay at t=3 s | Hard-stop within 200 ms |
barge-in-mid-response | utterance + overlay at t=5 s | Voice state machine SPEAKING → LISTENING rollback (C1 restore) |
cold-start | first turn on a fresh process | Load-side latency |
warm-start | second turn after prewarm | Steady-state TTFA |
Rollback scenarios report two extra fields on top of the per-fixture
BenchMetrics:
rollbackCount — number of rollback-drop events the pipeline emitted
(one per C1 discard or C1 restore).rollbackWasteTokens — drafter tokens thrown away because the state
machine rolled back. The driver may supply this directly; otherwise the
harness sums data.tokens from each rollback-drop event.Defined in src/gates.ts. Defaults:
| Metric | Warn | Fail |
|---|---|---|
| TTFA p50 regression vs baseline | +20 % | +50 % |
| TTFA p95 regression vs baseline | +30 % | +50 % |
| Barge-in p95 | — | 250 ms absolute ceiling |
| False-barge-in rate | — | 0.05 / turn ceiling |
| Rollback waste | — | 0.30 ceiling |
evaluateGates() returns a GateReport with a markdown table. The CLI
emits this to stdout and exits 1 on a fail row.
When a real optimization legitimately improves a metric, record a new baseline:
bun run --cwd packages/inference/voice-bench bench \
--bundle eliza-1-2b --backend metal --runs 5 \
--output packages/inference/voice-bench/baselines/M4Max-metal.json
Commit the JSON. Future PRs compare against it.
The runnable mock-only CLI is disabled and runBench() rejects mock/fake/stub
drivers. The MockPipelineDriver remains test-only scaffolding; the real
pipeline driver is a follow-up — the contract is the
PipelineDriver interface in src/types.ts. To wire it:
VoicePipeline (packages/app-core/.../voice/pipeline.ts)
with real StreamingTranscriber, DraftProposer, and TargetVerifier
implementations. The bench package intentionally does not depend on
@elizaos/app-core — wire from a thin host package that owns both.run(args), feed args.audio.pcm to the
VoiceScheduler via its MicSource adapter while replaying frames
through SyntheticAudioSource at wall-clock rate.VoiceBenchProbe to each pipeline event. The events you need
to fire (see BenchEventName):
speech-start / speech-pause / speech-end — from the VADasr-partial / asr-final — from StreamingTranscriberdraft-start / draft-first-token / draft-complete — from
DraftProposerverifier-start / verifier-first-token / verifier-complete —
from TargetVerifierphrase-emit — from the phrase chunkertts-first-pcm — from the streaming TTS backendaudio-out-first-frame — from the ring buffer's first dequeuebarge-in-trigger / barge-in-hard-stop — from BargeInControllerdispose() to tear down GPU resources.metal, cuda, vulkan,
cpu) and add a case in bin/voice-bench.The real driver should emit the same event sequence as the unit-test driver, but benchmark artifacts produced by test drivers are not release evidence.
docs/eliza-1-pipeline/06-test-matrix.md,
release-blocking latency gates still require a real-recorded WAV
corpus.BenchMetrics.dflash-server; mock values are not accepted for release evidence.SyntheticAudioSource ─┐
│
▼
PipelineDriver.run({ audio, injection, probe })
│
▼ (BenchEventName timestamps)
MetricsCollector ──► BenchMetrics
│
▼
aggregate() ──► BenchAggregates
│
▼
evaluateGates(current, baseline) ──► GateReport (md)
Everything in src/ is pure TypeScript with strict + no any. No
runtime dependency on @elizaos/* packages — the harness is intentionally
isolated so a bun test in CI doesn't drag the inference stack along.