Back to Eliza

Turn-detector fine-tune pipeline

packages/training/scripts/turn_detector/README.md

2.0.17.0 KB
Original Source

Turn-detector fine-tune pipeline

Eliza-1 ships a bundled semantic end-of-turn (EOT) detector — one of three Tier-3 EOU classifiers the runtime resolves at voice-session start (plugins/plugin-local-inference/src/services/voice/eot-classifier.ts). Per device tier:

TierBundle revisionBackboneOn-disk (Q8 ONNX)Languages
0_8b, 2bv1.2.2-enSmolLM2-135M distilled~66 MBEN only
4b+v0.4.1-intlPruned Qwen2.5-0.5B~396 MB14 langs
--turn-license=apache (override)n/aSmolLM2-135M binary head (latishab/turnsense)~176 MBEN only

This directory hosts the fine-tune + eval pipeline for those models:

  • finetune_turn_detector.py — LoRA / APOLLO finetune entrypoint. Reads a YAML config (configs/turn_detector_<tier>.yaml), supports either the text-only path (Option A in R1) or the future joint-with-text-LM path (Option B).
  • eval_turn_detector.py — Computes F1 + mean detection latency on the configured held-out set. Gates a publish at:
    • F1 ≥ 0.85 (TURN_DETECTOR_F1_THRESHOLD in the manifest schema)
    • meanLatencyMs ≤ 30 (TURN_DETECTOR_MEAN_LATENCY_MS_LIMIT)
  • test_turn_detector_pipeline.py — Smoke test for the eval threshold logic and the resolver/config IO so the scaffold stays runnable as the real finetune code lands.

Data sources

  • TURNS-2K (Apache-2.0, EN). Bundled with latishab/turnsense. 2 000 samples covering backchannels, self-corrections, code-switching, STT formatting variants. Primary EN intrinsic eval.
  • Easy Turn (arXiv 2509.23938). Four-state labels (complete / incomplete / backchannel / wait); paired audio + transcript. Useful for the future joint-with-text-LM path.
  • Our own trajectories (privacy-filtered). Every voice turn the runtime records (when ELIZA_DISABLE_TRAJECTORY_LOGGING != 1) yields a (transcript-so-far, did-the-user-continue-within-1s) pair from VAD events. Becomes the dominant signal once we have several hundred hours.
  • For multilingual coverage: NoXi (subject to per-corpus licensing) and the YODAS Thai EOU subset for Thai support (v2 only).

All data goes through the workspace privacy filter (eliza/plugins/plugin-training/src/core/privacy-filter.ts) before it lands on disk. No raw user transcript or audio escapes that boundary.

Quick start (smoke)

bash
uv run --extra train pytest \
  packages/training/scripts/turn_detector/test_turn_detector_pipeline.py

Real run (LoRA + eval, English tier)

bash
# 1) Stage TURNS-2K (Apache-2.0 mirror) + the optional Easy Turn split.
uv run --extra train python -m scripts.turn_detector.finetune_turn_detector \
    --config packages/training/scripts/turn_detector/configs/turn_detector_en.yaml \
    --out artifacts/turn-detector-en/ \
    --epochs 3

# 2) Eval the resulting ONNX export.
uv run --extra train python -m scripts.turn_detector.eval_turn_detector \
    --model artifacts/turn-detector-en/onnx/model_q8.onnx \
    --tokenizer artifacts/turn-detector-en/tokenizer.json \
    --testset packages/training/data/turn/TURNS-2K/test.jsonl \
    --report artifacts/turn-detector-en/eval.json

eval.json carries { "f1": <0..1>, "meanLatencyMs": <ms>, "passed": <bool> }; the publish orchestrator copies it into the manifest evals.turnDetector slot the runtime validator enforces.

Eliza-1 drafter target (alternative — runs on the live drafter)

configs/turn_detector_eliza1_drafter.yaml trains a LoRA adapter on top of the eliza-1 drafter (the small model DFlash already keeps warm for speculative decoding) instead of a standalone ONNX. The runtime layers the adapter onto a dedicated EOT context at voice-session start and reads P(<|im_end|>) directly off the live model — see plugins/plugin-local-inference/src/services/voice/eliza1-eot-scorer.ts.

bash
uv run --extra train python -m scripts.turn_detector.finetune_turn_detector \
    --config packages/training/scripts/turn_detector/configs/turn_detector_eliza1_drafter.yaml \
    --out artifacts/turn-detector-eliza1-drafter/ \
    --epochs 3

# Convert the saved torch LoRA to GGUF for the runtime to consume.
# `convert_lora_to_gguf.py` ships with the llama.cpp checkout — see
# `EXPORT-NEXT-STEP.txt` written under the run dir.
python llama.cpp/convert_lora_to_gguf.py \
    --base elizaos/eliza-1 \
    --revision bundles/2b/drafter \
    artifacts/turn-detector-eliza1-drafter/checkpoints/best.pt \
    --outfile artifacts/turn-detector-eliza1-drafter/eot-lora.gguf

The resulting .gguf adapter ships under the manifest slot files.eotLoraAdapter (see schema.ts) and the runtime loads it via startVoiceSession({ useEliza1Eot: true, eliza1EotLoraPath }) — operators can also force this path by setting ELIZA_VOICE_EOT_BACKEND=eliza-1 in the env.

Trade-offs vs the LiveKit/Turnsense ONNX path:

AspectLiveKit ONNXEliza-1 drafter LoRA
On-disk weight cost66–396 MB (separate ONNX)~few MB adapter (no base model)
Cold startLoads ONNX runtime + ONNXReuses the drafter context
Calibration baselineDistilled SmolLM2 / Qwen2.5Vanilla drafter + LoRA
Multilingual coverage14 langs (intl revision)Inherits eliza-1 vocab coverage
Backend requirementWorks with any text backendnode-llama-cpp in-process

The eliza-1-drafter path is preferred when the in-process backend is active. The runtime falls back to LiveKit transparently when the drafter is not loaded (e.g. llama-server subprocess builds).

Cancellation contract (handshake with R11)

Turn detection emits a VoiceTurnSignal (data); it never aborts a turn directly. The controller above it (VoiceTurnController) consumes the signal and decides whether to suppress speculative generation via BargeInCancelToken.signal with reason "turn-suppressed". See .swarm/research/R11-cancellation.md.

See also