packages/training/scripts/voice/eot/DATASETS.md
End-of-turn classifiers learn from (transcript_so_far, label_eot ∈ {0,1})
pairs. Positive examples are complete user turns; negative examples are
partial / mid-turn fragments. Quality of the eval set matters more than
volume — production accuracy is bounded by how well your eval corpus
matches your deployment distribution.
This audit covers (1) datasets already in this repo, (2) recommended public corpora to download, and (3) HuggingFace dataset candidates enumerated live via the HF API on 2026-05-15.
Search paths run by this audit:
packages/training/scripts/ — corpus tooling + already-prepared corporapackages/benchmarks/ — eval scenarios with turn-aligned structureeliza/packages/skills/skills/*/scenarios/ — scenario YAML/JSON dialogs*dataset*, *corpus*, *conversations*, *scenarios* dirsRun this to refresh the local inventory on your workstation:
find packages eliza/packages -type d \
\( -name "*dataset*" -o -name "*corpus*" -o -name "*scenarios*" \
-o -name "*conversations*" -o -name "*dialog*" \) \
-not -path "*/node_modules/*" -not -path "*/.git/*" \
| head -40
Notable local sources:
packages/training/scripts/ — pack_dataset.py,
synthesize_targets.py, extract_eliza_prompts.py. Most prepared
corpora target chat/instruction fine-tuning rather than turn-aligned
EOT. Use as supplementary signal: the natural end-of-message
boundaries in the SFT corpus are positive EOT examples, and any
truncation point is a negative.packages/benchmarks/voice-speaker-validation/ —
multi-speaker scenario fixtures with explicit turn boundaries (used
for diarization eval). Small (5 fixtures) but on-distribution.eliza/packages/skills/skills/*/scenarios/ — scenario files
with user/assistant turn structure. Volume varies per skill.~/.milady/trajectories/ — real Eliza voice
sessions. Privacy filter mandatory before any write path touches
these (per packages/training/AGENTS.md).Estimated local volume: low five-figures of turn pairs after dedup. Sufficient for a quick eval slice; insufficient as the sole training corpus for a high-accuracy LoRA. Combine with public + HF data below.
prep_eot_corpus.py.(sentence_id, lang, text) plus optional
user-translation links.Top candidates by download count + topical relevance to EOT, surfaced via the HF Search API on 2026-05-15. Re-run the enumeration before each training cycle — the catalog moves.
| Dataset | Downloads | Notes |
|---|---|---|
Krisp-AI/turn-taking-test-v1 | 38 | Krisp's purpose-built turn-taking eval. Small but on-distribution. Use as held-out eval, not training. |
fixie-ai/turntaking-contextual-tts | 29 | TTS-side annotations of turn-taking decisions. Useful for cross-validation. |
anyreach-ai/semantic-turn-taking-benchmark | 152 | Benchmark-style; use as eval split. |
anyreach-ai/dualturn-otospeech-turn-taking | 693 | Speech-side turn-taking, mostly useful for ASR-aligned timing. |
hiraki/candor-turntaking-annotations | 39 | CANDOR conversational corpus annotations — gold-standard academic turn-taking. |
acengnew/turn-taking-cues-json | 3 | Small JSON; quick start. |
| Dataset | Downloads | Notes |
|---|---|---|
li2017dailydialog/daily_dialog | 9,687 | The standard daily-dialog corpus. Multi-turn, topic-labelled, turn-aligned. Excellent training source. |
apptek-com/apptek_callcenter_dialogues | 4,713 | Call-center transcripts — closest to voice-assistant deployment distribution. |
cornell-movie-dialog/cornell_movie_dialog | 422 | Movie dialog. Use as supplementary; less on-distribution than call-center. |
google/air_dialogue | 334 | Task-oriented dialog (flight booking). Domain-specific. |
pixelsandpointers/daily_dialog_w_turn_templates | 346 | Daily-dialog with explicit turn templates pre-applied. Saves prep work. |
roskoN/dailydialog | 1,841 | Alternative daily-dialog mirror. |
HuggingFaceTB/everyday-conversations-llama3.1-2k | 1,435 | Llama-style chat format — distribution-matched to chat models. |
| Dataset | Downloads | Notes |
|---|---|---|
gpt-omni/VoiceAssistant-400K | 2,018 | 400K voice-assistant interactions. Large; sample for training. |
worstchan/VoiceAssistant-400K-SLAM-Omni | 616 | SLAM/Omni variant of the same set. |
VocalNet/VoiceAssistant-430K-vocalnet | 888 | 430K with vocalnet annotations. |
For training a robust EOT LoRA on eliza-1-{0_8b, 2b, 4b}:
| Source | Mix | Why |
|---|---|---|
| daily_dialog | 35% | High-quality turn-aligned dialog at scale |
| OpenSubtitles | 25% | Scripted dialog volume, broad domain coverage |
| apptek_callcenter_dialogues | 15% | On-distribution (voice-assistant style) |
| VoiceAssistant-400K (sample) | 15% | Direct match to deployment use case |
| Local trajectory logs | 5% | On-distribution real data (privacy-filtered) |
| Tatoeba | 5% | Clean single-sentence positives |
Held-out eval (10-15% of total, drawn primarily from the turn-taking-benchmark + CANDOR + local trajectories):
Krisp-AI/turn-taking-test-v1anyreach-ai/semantic-turn-taking-benchmarkhiraki/candor-turntaking-annotationsNegatives are synthesized by prep_eot_corpus.py via mid-turn token
chops; do NOT rely on natural negatives in the corpus.
Every dataset write path in prep_eot_corpus.py MUST run the privacy
filter from packages/training/scripts/validate_corpus.py (per
packages/training/AGENTS.md §3). Public corpora are filtered too —
they contain real-world PII drift that should not enter a model
artifact you intend to publish.