EOT Training Dataset Audit

End-of-turn classifiers learn from (transcript_so_far, label_eot ∈ {0,1}) pairs. Positive examples are complete user turns; negative examples are partial / mid-turn fragments. Quality of the eval set matters more than volume — production accuracy is bounded by how well your eval corpus matches your deployment distribution.

This audit covers (1) datasets already in this repo, (2) recommended public corpora to download, and (3) HuggingFace dataset candidates enumerated live via the HF API on 2026-05-15.

1. Local repo data sources

Search paths run by this audit:

packages/training/scripts/ — corpus tooling + already-prepared corpora
packages/benchmarks/ — eval scenarios with turn-aligned structure
eliza/packages/skills/skills/*/scenarios/ — scenario YAML/JSON dialogs
Any *dataset*, *corpus*, *conversations*, *scenarios* dirs

Run this to refresh the local inventory on your workstation:

bash

find packages eliza/packages -type d \
    \( -name "*dataset*" -o -name "*corpus*" -o -name "*scenarios*" \
       -o -name "*conversations*" -o -name "*dialog*" \) \
    -not -path "*/node_modules/*" -not -path "*/.git/*" \
    | head -40

Notable local sources:

packages/training/scripts/ — pack_dataset.py, synthesize_targets.py, extract_eliza_prompts.py. Most prepared corpora target chat/instruction fine-tuning rather than turn-aligned EOT. Use as supplementary signal: the natural end-of-message boundaries in the SFT corpus are positive EOT examples, and any truncation point is a negative.
packages/benchmarks/voice-speaker-validation/ — multi-speaker scenario fixtures with explicit turn boundaries (used for diarization eval). Small (5 fixtures) but on-distribution.
eliza/packages/skills/skills/*/scenarios/ — scenario files with user/assistant turn structure. Volume varies per skill.
Trajectory logs at ~/.milady/trajectories/ — real Eliza voice sessions. Privacy filter mandatory before any write path touches these (per packages/training/AGENTS.md).

Estimated local volume: low five-figures of turn pairs after dedup. Sufficient for a quick eval slice; insufficient as the sole training corpus for a high-accuracy LoRA. Combine with public + HF data below.

2. Recommended public corpora

OpenSubtitles 2018 (LREC 2016)

URL: http://opus.nlpl.eu/OpenSubtitles2018.php
License: the corpus is built from movie subtitles; redistribution follows the OpenSubtitles ToS — verify licensing for your jurisdiction before publication. Training-only use is broadly accepted in academic literature.
Format: XML or sentence-aligned plain text per language pair.
Turn signal: each subtitle line is a natural turn boundary in scripted dialog. Imperfect (subtitles compress dialog, omit back-channels) but volume is enormous (~3.4 GB monolingual English).
EOT relevance: good positive-example source. Negatives are generated by mid-sentence chops in prep_eot_corpus.py.

Tatoeba (CC-BY 2.0)

URL: https://tatoeba.org/en/downloads
License: CC-BY 2.0 (sentences) + CC0 (links table).
Format: TSV of (sentence_id, lang, text) plus optional user-translation links.
Turn signal: every Tatoeba sentence is a complete utterance → positive EOT label. Cleaner than OpenSubtitles (curated, no subtitle abbreviations) but ALL positives — you must synthesize negatives.
EOT relevance: high-quality positive examples; pair with OpenSubtitles for negative coverage.

3. HuggingFace dataset candidates (enumerated live)

Top candidates by download count + topical relevance to EOT, surfaced via the HF Search API on 2026-05-15. Re-run the enumeration before each training cycle — the catalog moves.

Direct turn-taking corpora (use first)

Dataset	Downloads	Notes
`Krisp-AI/turn-taking-test-v1`	38	Krisp's purpose-built turn-taking eval. Small but on-distribution. Use as held-out eval, not training.
`fixie-ai/turntaking-contextual-tts`	29	TTS-side annotations of turn-taking decisions. Useful for cross-validation.
`anyreach-ai/semantic-turn-taking-benchmark`	152	Benchmark-style; use as eval split.
`anyreach-ai/dualturn-otospeech-turn-taking`	693	Speech-side turn-taking, mostly useful for ASR-aligned timing.
`hiraki/candor-turntaking-annotations`	39	CANDOR conversational corpus annotations — gold-standard academic turn-taking.
`acengnew/turn-taking-cues-json`	3	Small JSON; quick start.

Dialog corpora (training + eval mix)

Dataset	Downloads	Notes
`li2017dailydialog/daily_dialog`	9,687	The standard daily-dialog corpus. Multi-turn, topic-labelled, turn-aligned. Excellent training source.
`apptek-com/apptek_callcenter_dialogues`	4,713	Call-center transcripts — closest to voice-assistant deployment distribution.
`cornell-movie-dialog/cornell_movie_dialog`	422	Movie dialog. Use as supplementary; less on-distribution than call-center.
`google/air_dialogue`	334	Task-oriented dialog (flight booking). Domain-specific.
`pixelsandpointers/daily_dialog_w_turn_templates`	346	Daily-dialog with explicit turn templates pre-applied. Saves prep work.
`roskoN/dailydialog`	1,841	Alternative daily-dialog mirror.
`HuggingFaceTB/everyday-conversations-llama3.1-2k`	1,435	Llama-style chat format — distribution-matched to chat models.

Voice-assistant / instruction corpora

Dataset	Downloads	Notes
`gpt-omni/VoiceAssistant-400K`	2,018	400K voice-assistant interactions. Large; sample for training.
`worstchan/VoiceAssistant-400K-SLAM-Omni`	616	SLAM/Omni variant of the same set.
`VocalNet/VoiceAssistant-430K-vocalnet`	888	430K with vocalnet annotations.

Recommended composition

For training a robust EOT LoRA on eliza-1-{0_8b, 2b, 4b}:

Source	Mix	Why
daily_dialog	35%	High-quality turn-aligned dialog at scale
OpenSubtitles	25%	Scripted dialog volume, broad domain coverage
apptek_callcenter_dialogues	15%	On-distribution (voice-assistant style)
VoiceAssistant-400K (sample)	15%	Direct match to deployment use case
Local trajectory logs	5%	On-distribution real data (privacy-filtered)
Tatoeba	5%	Clean single-sentence positives

Held-out eval (10-15% of total, drawn primarily from the turn-taking-benchmark + CANDOR + local trajectories):

Krisp-AI/turn-taking-test-v1
anyreach-ai/semantic-turn-taking-benchmark
hiraki/candor-turntaking-annotations
Local trajectory holdout (last 7 days, privacy-filtered)

Negatives are synthesized by prep_eot_corpus.py via mid-turn token chops; do NOT rely on natural negatives in the corpus.

Privacy filter contract

Every dataset write path in prep_eot_corpus.py MUST run the privacy filter from packages/training/scripts/validate_corpus.py (per packages/training/AGENTS.md §3). Public corpora are filtered too — they contain real-world PII drift that should not enter a model artifact you intend to publish.