packages/training/scripts/voice/samantha_lora/RUNBOOK.md
End-to-end recipe for producing a publishable Kokoro voice adapter trained on Samantha audio. Time- and budget-stamped so you know what you're signing up for before you start.
| Step | Wall time (24 GB GPU) | Notes |
|---|---|---|
| collect_audio.md | Hours / days | Operator-driven; not a script. |
| validate_voice_corpus | <1 min | Pure Python; no GPU. |
| prep_corpus | 2–10 min | CPU-bound (resample + phonemize); no GPU. |
| train_lora | 30–90 min | RTX 4090: 30 min @ 2000 steps. |
| export_adapter | 5–10 min | GPU; runs the merge via mel-fit. |
| eval_voice | 5–15 min | GPU; loads Whisper + ECAPA + UTMOS. |
| publish_samantha (dry) | <1 min | Network only. |
Disk: ~2 GB per training run (checkpoints + processed audio + TensorBoard logs). The merged voice.bin itself is 510 KB.
VRAM: 24 GB at default settings. Drop --batch-size to 2 for 16 GB
cards (RTX 4080 / 5080 / 4060 Ti 16 GB); training will be slower but
still complete.
# Python deps (one-time):
pip install librosa soundfile 'misaki[en]' apollo-torch peft transformers torch
# Plus kokoro for inference:
pip install 'kokoro>=0.9.4'
# Optional but recommended for eval:
pip install utmos # UTMOS package; falls back to SQUIM_SUBJECTIVE if absent
# 0. Pick a workspace.
RUN=~/eliza-training/samantha-lora-$(date +%Y%m%d-%H%M%S)
CORPUS=~/samantha-corpus # populated per collect_audio.md
# 1. Validate the corpus.
python3 packages/training/scripts/voice/samantha_lora/validate_voice_corpus.py \
--corpus "$CORPUS"
# 2. Prep (resample + phonemize + privacy filter + split).
python3 packages/training/scripts/voice/samantha_lora/prep_corpus.py \
--corpus "$CORPUS" \
--run-dir "$RUN"
# 3. LoRA training (24 GB default).
python3 packages/training/scripts/voice/samantha_lora/train_lora.py \
--run-dir "$RUN" \
--max-steps 2000 \
--rank 16 --alpha 32
# 4. Export the merged voice.bin.
python3 packages/training/scripts/voice/samantha_lora/export_adapter.py \
--run-dir "$RUN" \
--out "$RUN/out" \
--mode merged \
--voice-name af_same
# 5. Evaluate vs publish gates.
python3 packages/training/scripts/voice/samantha_lora/eval_voice.py \
--out "$RUN/out" \
--val-clips-dir "$RUN/processed/wavs_norm"
# 6. Package the publishable HF voice release.
python3 packages/training/scripts/kokoro/package_voice_for_release.py \
--run-dir "$RUN/out" \
--release-dir "$RUN/release" \
--voice-name af_same \
--voice-display-name Samantha
# 7. Dry-run publish (sanity-checks the bundle without uploading).
HF_TOKEN=hf_xxx packages/training/scripts/voice/samantha_lora/publish_samantha.sh \
--release-dir "$RUN/release/af_same" \
--hf-repo elizaos/eliza-1 \
--dry-run
# 8. Real push (only when step 5 was green).
HF_TOKEN=hf_xxx packages/training/scripts/voice/samantha_lora/publish_samantha.sh \
--release-dir "$RUN/release/af_same" \
--hf-repo elizaos/eliza-1 \
--push --private --update-catalog
Each script has a hard gate before the next one runs — if a step fails, the next will refuse to start (and tell you why). The chain:
| Producer | Consumer | Gate |
|---|---|---|
| validate_voice_corpus | prep_corpus | exit 0; corpus is mono / 24 kHz / ≥10 min |
| prep_corpus | train_lora | prep_manifest.privacy_filter=applied, |
| phonemizer=misaki | ||
| train_lora | export_adapter | checkpoints/best (or step_*) present |
| export_adapter | eval_voice | manifest.json + voice.bin |
| eval_voice | publish_samantha | gate_report.passed=true |
| package_voice_for_release | publish_samantha | voice.bin + voice-preset.json + manifest-fragment.json |
validate_voice_corpus.py says "below floor"Add audio. The 10-min minimum is not negotiable in the script.
prep_corpus.py errors [prep_corpus] missing audio dependencyRun pip install librosa soundfile. The prep step has no usable
fallback — Kokoro requires 24 kHz and we need a real resampler.
prep_corpus.py errors privacy filter exited NThe privacy filter found unredactable content. Inspect
<run-dir>/_privacy_tmp/filtered.jsonl (kept on filter failure). Edit
the offending transcripts in transcripts.csv and re-run prep.
train_lora.py errors APOLLO not installedRun pip install apollo-torch. Per AGENTS.md APOLLO is required —
there is no AdamW fallback path.
train_lora.py OOMs--batch-size first (8 → 4 → 2).--rank (16 → 8) and --alpha (32 → 16) — same ratio.--max-steps to 1000; the loss curve flattens by
then for small corpora.eval_voice.py says wer > 0.10The synth output is not transcribable. Causes (in order of likelihood):
--rank and re-train.eval_voice.py says speaker_similarity < 0.55Need more audio. ECAPA cosine on tiny corpora is noisy; the gate is already relaxed. If you've already added audio and it still fails, check the source audio quality (the voice you trained may simply not sound like Samantha).
publish_samantha.sh says gate_report.passed=FalseDon't override. Fix the failing metric per the table above.
publish_samantha.sh says HF_TOKEN is not setExport it: export HF_TOKEN=hf_xxx. The script intentionally refuses
to push without a real token — no half-pushed states.
huggingface-cli download elizaos/eliza-1 voice/kokoro/voices/af_same.binpackages/shared/src/local-inference/voice-models.ts)
if --update-catalog was not passed: refresh sha256 + sizeBytes
from the new release. (The runtime auto-update checker reads this.)models/voice/CHANGELOG.md per the
convention there.cache/voice-preset-default.bin if you bundled one), runs
ensureSamanthaPresetReady, and from then on af_same is the
default voice with real bytes behind it.