Back to Eliza

@elizaos/voice-bench

plugins/plugin-local-inference/native/voice-bench/README.md

2.0.18.1 KB
Original Source

@elizaos/voice-bench

Voice-loop benchmark harness for the Eliza-1 voice pipeline.

A deterministic, replayable harness that drives the real voice pipeline with synthetic audio inputs and measures latency, barge-in behavior, and rollback waste. Per AGENTS.md "evidence-or-it-didn't-happen" rule, every optimization PR that touches the voice loop ships this harness's JSON output as proof.

What it measures

The harness records timestamps for every observable transition in the mic → ASR → drafter ∥ verifier → chunker → TTS pipeline (see BenchEventName in src/types.ts) and derives:

MetricDefinition
TTFA (★ primary)t_tts_first_audio − t_speech_start
Perceived response latencyt_tts_first_audio − t_speech_end
Barge-in responset_barge_in_hard_stop − t_barge_in_trigger
Rollback wastedrafter tokens rejected / drafter tokens proposed
DFlash acceptancewhen DFlash is wired
Peak RSS / CPU / GPUbest-effort process sampling at 100 ms

Running

bash
# The mock-only CLI path is disabled. Use the real VoiceBench runner:
packages/benchmarks/voicebench/run.sh --profile=groq \
  --dataset=packages/benchmarks/voicebench/fixtures/manifest-groq.json

# Compare to a recorded baseline; exit 1 on regression
packages/benchmarks/voicebench/run.sh --profile=elevenlabs \
  --dataset=packages/benchmarks/voicebench/fixtures/manifest-elevenlabs.json

Running on GPU (single-GPU tier)

For Linux + NVIDIA hosts, the harness ships per-GPU autotune profiles under packages/inference/configs/gpu/ (3090, 4090, 5090, H200). The inference engine for this tier is llama.cpp / llama-server — not vLLM or SGLang.

Detect the host card and print the resolved autotune plan:

bash
bun run --cwd packages/inference/voice-bench bench gpu
# Or narrowed to a specific bundle:
bun run --cwd packages/inference/voice-bench bench gpu --bundle eliza-1-9b

The subcommand calls nvidia-smi --query-gpu=name,memory.total and loads the matching JSON config file. On a CPU-only host (e.g. CI without a GPU runner) it prints { "nvidiaPresent": false } and exits 0.

Once a real PipelineDriver is wired for --backend cuda, the GPU matrix in configs/gpu/matrix.json enumerates the (GPU, bundle, ctx_size) tuples we benchmark. Each row maps to one autotune config.

Per-GPU expected metrics live in the config JSON files and are flagged "_provenance": "extrapolated" until a real run replaces them. The override mechanism + per-GPU known limits are documented in packages/inference/configs/gpu/SPECS.md and docs/inference/gpu-tier.md.

Unit tests:

bash
bun run --cwd packages/inference/voice-bench test
bun run --cwd packages/inference/voice-bench typecheck

Regenerate fixture WAVs into fixtures/:

bash
bun run --cwd packages/inference/voice-bench generate-fixtures

The fixtures/ directory is gitignored — the harness uses in-memory fixtures by default and only writes WAVs when you ask it to.

Scenario catalog

IDShapeWhat it exercises
short-turn1.5 s utteranceBaseline TTFA on a healthy pipeline
long-turn8 s utteranceVerifier coverage; no token drop
false-end-of-speechutterance with 400 ms mid-clause pauseVoice state machine PAUSE_TENTATIVE → LISTENING rollback (C1 discard)
barge-inutterance + overlay at t=3 sHard-stop within 200 ms
barge-in-mid-responseutterance + overlay at t=5 sVoice state machine SPEAKING → LISTENING rollback (C1 restore)
cold-startfirst turn on a fresh processLoad-side latency
warm-startsecond turn after prewarmSteady-state TTFA

Rollback scenarios report two extra fields on top of the per-fixture BenchMetrics:

  • rollbackCount — number of rollback-drop events the pipeline emitted (one per C1 discard or C1 restore).
  • rollbackWasteTokens — drafter tokens thrown away because the state machine rolled back. The driver may supply this directly; otherwise the harness sums data.tokens from each rollback-drop event.

Eval gates

Defined in src/gates.ts. Defaults:

MetricWarnFail
TTFA p50 regression vs baseline+20 %+50 %
TTFA p95 regression vs baseline+30 %+50 %
Barge-in p95250 ms absolute ceiling
False-barge-in rate0.05 / turn ceiling
Rollback waste0.30 ceiling

evaluateGates() returns a GateReport with a markdown table. The CLI emits this to stdout and exits 1 on a fail row.

Updating baselines

When a real optimization legitimately improves a metric, record a new baseline:

bash
bun run --cwd packages/inference/voice-bench bench \
  --bundle eliza-1-2b --backend metal --runs 5 \
  --output packages/inference/voice-bench/baselines/M4Max-metal.json

Commit the JSON. Future PRs compare against it.

Wiring the real pipeline (follow-up)

The runnable mock-only CLI is disabled and runBench() rejects mock/fake/stub drivers. The MockPipelineDriver remains test-only scaffolding; the real pipeline driver is a follow-up — the contract is the PipelineDriver interface in src/types.ts. To wire it:

  1. Construct a VoicePipeline (packages/app-core/.../voice/pipeline.ts) with real StreamingTranscriber, DraftProposer, and TargetVerifier implementations. The bench package intentionally does not depend on @elizaos/app-core — wire from a thin host package that owns both.
  2. Inside the driver's run(args), feed args.audio.pcm to the VoiceScheduler via its MicSource adapter while replaying frames through SyntheticAudioSource at wall-clock rate.
  3. Attach a VoiceBenchProbe to each pipeline event. The events you need to fire (see BenchEventName):
    • speech-start / speech-pause / speech-end — from the VAD
    • asr-partial / asr-final — from StreamingTranscriber
    • draft-start / draft-first-token / draft-complete — from DraftProposer
    • verifier-start / verifier-first-token / verifier-complete — from TargetVerifier
    • phrase-emit — from the phrase chunker
    • tts-first-pcm — from the streaming TTS backend
    • audio-out-first-frame — from the ring buffer's first dequeue
    • barge-in-trigger / barge-in-hard-stop — from BargeInController
  4. Optionally implement dispose() to tear down GPU resources.
  5. Register the driver under a backend name (metal, cuda, vulkan, cpu) and add a case in bin/voice-bench.

The real driver should emit the same event sequence as the unit-test driver, but benchmark artifacts produced by test drivers are not release evidence.

Known limitations

  • Synthetic audio is not real speech. Per docs/eliza-1-pipeline/06-test-matrix.md, release-blocking latency gates still require a real-recorded WAV corpus.
  • GPU utilization is not yet sampled. The Metal/Vulkan counter hooks are TBD; the field is optional in BenchMetrics.
  • DFlash stats are driver-supplied. The real driver must hook into dflash-server; mock values are not accepted for release evidence.
  • Single-process only. The harness runs the driver in-process. For cold-start measurement that includes shell startup, the runner needs a subprocess wrapper — a follow-up.

Architecture

SyntheticAudioSource ─┐
                      │
                      ▼
                 PipelineDriver.run({ audio, injection, probe })
                      │
                      ▼ (BenchEventName timestamps)
                 MetricsCollector ──► BenchMetrics
                      │
                      ▼
                 aggregate() ──► BenchAggregates
                      │
                      ▼
                 evaluateGates(current, baseline) ──► GateReport (md)

Everything in src/ is pure TypeScript with strict + no any. No runtime dependency on @elizaos/* packages — the harness is intentionally isolated so a bun test in CI doesn't drag the inference stack along.