Back to Eliza

Same voice corpus

packages/training/data/voice/sam/README.md

2.0.16.5 KB
Original Source

Same voice corpus

A 58-clip, ~3.5 min voice corpus used to train / clone the same voice for Kokoro (and as the freeze target for OmniVoice). Lands here from lalalune/ai_voices upstream.

License: research / personal use only. No upstream LICENSE file exists in lalalune/ai_voices and the same voice is a derivative of Her (2013, Warner Bros). Do NOT redistribute the raw audio. Publish only fine-tune deltas / voice embeddings as derivative works with explicit attribution. See source.json and §License below.

Provenance

  • Upstream: lalalune/ai_voicessam/ subset (renamed locally to same).
  • Format on disk (upstream): flat directory of samantha_NNN.wav (44.1 kHz mono 16-bit PCM) + samantha_NNN.txt (Whisper-base transcripts). The build script re-keys these to same_NNN.{wav,txt} when landing them locally.
  • Commit pinned in source.json — written by build_same_manifest.py at fetch time.
  • R12 inventory: .swarm/research/R12-ai_voices.md.

Layout

packages/training/data/voice/same/
  README.md          # this file (tracked)
  source.json        # upstream URL + commit sha + counts + license (tracked)
  manifest.jsonl     # one JSON record per clip (tracked)
  .gitignore         # local ignore: audio + raw stay out of git (tracked)
  audio/             # 24 kHz mono PCM16, LUFS-normalized (gitignored)
    same_001.wav .. same_058.wav
  raw/               # untouched upstream 44.1 kHz mono PCM16 (gitignored)
    same_001.wav .. same_058.wav
    same_001.txt .. same_058.txt
  ljspeech/
    metadata.csv     # LJSpeech format `id|raw|normalized` (tracked)
    wavs/            # symlinks into ../../audio/ (gitignored)

Tracked in git: README.md, source.json, manifest.jsonl, ljspeech/metadata.csv, .gitignore. Everything under audio/, raw/, and ljspeech/wavs/ is gitignored and regenerated by the manifest builder.

Counts (R12-measured)

PropertyValue
Clip count58
Total duration~210.3 s (3.51 min)
Source sample rate44.1 kHz
Channelsmono
Bit depth16-bit PCM
Min / median / max clip duration0.67 / 1.89 / 13.13 s

The duration distribution is skewed short — 35 of 58 clips ≤ 3 s — so the LoRA fine-tune path is impractical at this size (community minimum is 1–3 h). Use extract_voice_embedding.py (voice-clone) as the primary path; LoRA is an experimental comparison only.

How to fetch

The corpus is regenerable from upstream. The build script sparse-clones only the sam slice (not the full 258 MB repo).

bash
python3 packages/training/scripts/voice/build_same_manifest.py \
    --sparse-clone /tmp/ai_voices

This will:

  1. git clone --filter=blob:none --sparse lalalune/ai_voices into /tmp/ai_voices, with sparse-checkout set sam utils README.md (so only ~19 MB of audio is fetched, not 258 MB).
  2. Validate 58 wav/txt pairs, uniform 44.1 kHz mono 16-bit PCM, total duration in [180, 240] s.
  3. Copy raw audio into raw/ (renamed to same_NNN).
  4. Normalize to 24 kHz mono PCM16 at -23 LUFS via ffmpeg (skip with --no-normalize) into audio/.
  5. Re-transcribe every clip with whisper-large-v3 if the openai-whisper package is installed — replaces the upstream samantha_002.txt='641.' hallucination (and any other Whisper-base errors). Skip with --no-retranscribe.
  6. Write manifest.jsonl, source.json, and ljspeech/metadata.csv.

Two-step (when a clone already exists)

bash
python3 packages/training/scripts/voice/build_same_manifest.py \
    --src /tmp/ai_voices/sam \
    --dst packages/training/data/voice/same

Pre-flight audit

bash
bash packages/training/scripts/voice/audit_same.sh /tmp/ai_voices/sam

I7 and I11 must run this before invoking the kokoro pipeline.

Manifest schema

manifest.jsonl — one JSON record per line:

json
{
  "id": "same_001",
  "audio_path": "audio/same_001.wav",
  "raw_audio_path": "raw/same_001.wav",
  "transcript": "Yeah, I've been trying to figure out how to talk to you about this.",
  "transcript_source": "whisper-large-v3",
  "duration_s": 2.123,
  "sample_rate": 24000,
  "source_sample_rate": 44100,
  "channels": 1,
  "bit_depth": 16,
  "excluded": false,
  "source": "github.com/lalalune/ai_voices@<sha>",
  "subset": "same"
}
  • transcript_source ∈ {whisper-large-v3, upstream-whisper-base}.
  • excluded=true only for clips that still hold an upstream Whisper-base hallucination after the build. Today that is at most same_002 when the script ran with --no-retranscribe.
  • source pins the upstream git commit SHA at fetch time.

Downstream consumers

  • Kokoro voice-clone (primary)packages/training/scripts/kokoro/extract_voice_embedding.py --clips-dir packages/training/data/voice/same/audio --base-model hexgrad/Kokoro-82M --out voice.bin.
  • Kokoro LoRA fine-tune (experimental)packages/training/scripts/kokoro/prep_ljspeech.py --data-dir packages/training/data/voice/same/ljspeech --sample-rate 24000 …, then finetune_kokoro.py.
  • OmniVoice freeze (R6 / I6) — preset-based, consumes audio/ directly.

License

The upstream repo (lalalune/ai_voices) ships no LICENSE file. Its README.md only says "For fun and research only, obviously." The same voice is a derivative of Her (2013, Warner Bros).

Treat this corpus as a non-commercial research dataset:

  • Do not redistribute raw audio. Audio stays gitignored in this repo and is re-fetched from upstream by the build script.
  • Published derivatives (Kokoro voice-clone voice.bin, LoRA adapter, OmniVoice preset, fine-tuned ONNX/GGUF) must:
    • credit lalalune/ai_voices upstream,
    • declare the upstream commit SHA pinned in source.json,
    • be marked research-only (private HF dataset / private=true on first push; public release requires explicit owner sign-off).

See .swarm/collab.md for the C0 decision log on license handling (2026-05-13).

Known issues

  • same_002.txt = "641." — Whisper-base hallucination on a 1.37 s clip. build_same_manifest.py fixes this when run with the default --retranscribe (loads whisper-large-v3 and rewrites every transcript). When invoked with --no-retranscribe (CI / smoke) the clip is marked excluded=true in manifest.jsonl and skipped in ljspeech/metadata.csv.
  • Corpus skews short. 35 of 58 clips ≤ 3 s. LoRA prosody re-target benefits from longer clips; voice-clone is fine.