Back to Eliza

EOT Training Dataset Audit

packages/training/scripts/voice/eot/DATASETS.md

2.0.17.0 KB
Original Source

EOT Training Dataset Audit

End-of-turn classifiers learn from (transcript_so_far, label_eot ∈ {0,1}) pairs. Positive examples are complete user turns; negative examples are partial / mid-turn fragments. Quality of the eval set matters more than volume — production accuracy is bounded by how well your eval corpus matches your deployment distribution.

This audit covers (1) datasets already in this repo, (2) recommended public corpora to download, and (3) HuggingFace dataset candidates enumerated live via the HF API on 2026-05-15.

1. Local repo data sources

Search paths run by this audit:

  • packages/training/scripts/ — corpus tooling + already-prepared corpora
  • packages/benchmarks/ — eval scenarios with turn-aligned structure
  • eliza/packages/skills/skills/*/scenarios/ — scenario YAML/JSON dialogs
  • Any *dataset*, *corpus*, *conversations*, *scenarios* dirs

Run this to refresh the local inventory on your workstation:

bash
find packages eliza/packages -type d \
    \( -name "*dataset*" -o -name "*corpus*" -o -name "*scenarios*" \
       -o -name "*conversations*" -o -name "*dialog*" \) \
    -not -path "*/node_modules/*" -not -path "*/.git/*" \
    | head -40

Notable local sources:

  • packages/training/scripts/pack_dataset.py, synthesize_targets.py, extract_eliza_prompts.py. Most prepared corpora target chat/instruction fine-tuning rather than turn-aligned EOT. Use as supplementary signal: the natural end-of-message boundaries in the SFT corpus are positive EOT examples, and any truncation point is a negative.
  • packages/benchmarks/voice-speaker-validation/ — multi-speaker scenario fixtures with explicit turn boundaries (used for diarization eval). Small (5 fixtures) but on-distribution.
  • eliza/packages/skills/skills/*/scenarios/ — scenario files with user/assistant turn structure. Volume varies per skill.
  • Trajectory logs at ~/.milady/trajectories/ — real Eliza voice sessions. Privacy filter mandatory before any write path touches these (per packages/training/AGENTS.md).

Estimated local volume: low five-figures of turn pairs after dedup. Sufficient for a quick eval slice; insufficient as the sole training corpus for a high-accuracy LoRA. Combine with public + HF data below.

OpenSubtitles 2018 (LREC 2016)

  • URL: http://opus.nlpl.eu/OpenSubtitles2018.php
  • License: the corpus is built from movie subtitles; redistribution follows the OpenSubtitles ToS — verify licensing for your jurisdiction before publication. Training-only use is broadly accepted in academic literature.
  • Format: XML or sentence-aligned plain text per language pair.
  • Turn signal: each subtitle line is a natural turn boundary in scripted dialog. Imperfect (subtitles compress dialog, omit back-channels) but volume is enormous (~3.4 GB monolingual English).
  • EOT relevance: good positive-example source. Negatives are generated by mid-sentence chops in prep_eot_corpus.py.

Tatoeba (CC-BY 2.0)

  • URL: https://tatoeba.org/en/downloads
  • License: CC-BY 2.0 (sentences) + CC0 (links table).
  • Format: TSV of (sentence_id, lang, text) plus optional user-translation links.
  • Turn signal: every Tatoeba sentence is a complete utterance → positive EOT label. Cleaner than OpenSubtitles (curated, no subtitle abbreviations) but ALL positives — you must synthesize negatives.
  • EOT relevance: high-quality positive examples; pair with OpenSubtitles for negative coverage.

3. HuggingFace dataset candidates (enumerated live)

Top candidates by download count + topical relevance to EOT, surfaced via the HF Search API on 2026-05-15. Re-run the enumeration before each training cycle — the catalog moves.

Direct turn-taking corpora (use first)

DatasetDownloadsNotes
Krisp-AI/turn-taking-test-v138Krisp's purpose-built turn-taking eval. Small but on-distribution. Use as held-out eval, not training.
fixie-ai/turntaking-contextual-tts29TTS-side annotations of turn-taking decisions. Useful for cross-validation.
anyreach-ai/semantic-turn-taking-benchmark152Benchmark-style; use as eval split.
anyreach-ai/dualturn-otospeech-turn-taking693Speech-side turn-taking, mostly useful for ASR-aligned timing.
hiraki/candor-turntaking-annotations39CANDOR conversational corpus annotations — gold-standard academic turn-taking.
acengnew/turn-taking-cues-json3Small JSON; quick start.

Dialog corpora (training + eval mix)

DatasetDownloadsNotes
li2017dailydialog/daily_dialog9,687The standard daily-dialog corpus. Multi-turn, topic-labelled, turn-aligned. Excellent training source.
apptek-com/apptek_callcenter_dialogues4,713Call-center transcripts — closest to voice-assistant deployment distribution.
cornell-movie-dialog/cornell_movie_dialog422Movie dialog. Use as supplementary; less on-distribution than call-center.
google/air_dialogue334Task-oriented dialog (flight booking). Domain-specific.
pixelsandpointers/daily_dialog_w_turn_templates346Daily-dialog with explicit turn templates pre-applied. Saves prep work.
roskoN/dailydialog1,841Alternative daily-dialog mirror.
HuggingFaceTB/everyday-conversations-llama3.1-2k1,435Llama-style chat format — distribution-matched to chat models.

Voice-assistant / instruction corpora

DatasetDownloadsNotes
gpt-omni/VoiceAssistant-400K2,018400K voice-assistant interactions. Large; sample for training.
worstchan/VoiceAssistant-400K-SLAM-Omni616SLAM/Omni variant of the same set.
VocalNet/VoiceAssistant-430K-vocalnet888430K with vocalnet annotations.

For training a robust EOT LoRA on eliza-1-{0_8b, 2b, 4b}:

SourceMixWhy
daily_dialog35%High-quality turn-aligned dialog at scale
OpenSubtitles25%Scripted dialog volume, broad domain coverage
apptek_callcenter_dialogues15%On-distribution (voice-assistant style)
VoiceAssistant-400K (sample)15%Direct match to deployment use case
Local trajectory logs5%On-distribution real data (privacy-filtered)
Tatoeba5%Clean single-sentence positives

Held-out eval (10-15% of total, drawn primarily from the turn-taking-benchmark + CANDOR + local trajectories):

  • Krisp-AI/turn-taking-test-v1
  • anyreach-ai/semantic-turn-taking-benchmark
  • hiraki/candor-turntaking-annotations
  • Local trajectory holdout (last 7 days, privacy-filtered)

Negatives are synthesized by prep_eot_corpus.py via mid-turn token chops; do NOT rely on natural negatives in the corpus.

Privacy filter contract

Every dataset write path in prep_eot_corpus.py MUST run the privacy filter from packages/training/scripts/validate_corpus.py (per packages/training/AGENTS.md §3). Public corpora are filtered too — they contain real-world PII drift that should not enter a model artifact you intend to publish.