Same voice corpus

A 58-clip, ~3.5 min voice corpus used to train / clone the same voice for Kokoro (and as the freeze target for OmniVoice). Lands here from lalalune/ai_voices upstream.

License: research / personal use only. No upstream LICENSE file exists in lalalune/ai_voices and the same voice is a derivative of Her (2013, Warner Bros). Do NOT redistribute the raw audio. Publish only fine-tune deltas / voice embeddings as derivative works with explicit attribution. See source.json and §License below.

Provenance

Upstream: lalalune/ai_voices — sam/ subset (renamed locally to same).
Format on disk (upstream): flat directory of samantha_NNN.wav (44.1 kHz mono 16-bit PCM) + samantha_NNN.txt (Whisper-base transcripts). The build script re-keys these to same_NNN.{wav,txt} when landing them locally.
Commit pinned in source.json — written by build_same_manifest.py at fetch time.
R12 inventory: .swarm/research/R12-ai_voices.md.

Layout

packages/training/data/voice/same/
  README.md          # this file (tracked)
  source.json        # upstream URL + commit sha + counts + license (tracked)
  manifest.jsonl     # one JSON record per clip (tracked)
  .gitignore         # local ignore: audio + raw stay out of git (tracked)
  audio/             # 24 kHz mono PCM16, LUFS-normalized (gitignored)
    same_001.wav .. same_058.wav
  raw/               # untouched upstream 44.1 kHz mono PCM16 (gitignored)
    same_001.wav .. same_058.wav
    same_001.txt .. same_058.txt
  ljspeech/
    metadata.csv     # LJSpeech format `id|raw|normalized` (tracked)
    wavs/            # symlinks into ../../audio/ (gitignored)

Tracked in git: README.md, source.json, manifest.jsonl, ljspeech/metadata.csv, .gitignore. Everything under audio/, raw/, and ljspeech/wavs/ is gitignored and regenerated by the manifest builder.

Counts (R12-measured)

Property	Value
Clip count	58
Total duration	~210.3 s (3.51 min)
Source sample rate	44.1 kHz
Channels	mono
Bit depth	16-bit PCM
Min / median / max clip duration	0.67 / 1.89 / 13.13 s

The duration distribution is skewed short — 35 of 58 clips ≤ 3 s — so the LoRA fine-tune path is impractical at this size (community minimum is 1–3 h). Use extract_voice_embedding.py (voice-clone) as the primary path; LoRA is an experimental comparison only.

How to fetch

The corpus is regenerable from upstream. The build script sparse-clones only the sam slice (not the full 258 MB repo).

End-to-end (recommended)

bash

python3 packages/training/scripts/voice/build_same_manifest.py \
    --sparse-clone /tmp/ai_voices

This will:

git clone --filter=blob:none --sparse lalalune/ai_voices into /tmp/ai_voices, with sparse-checkout set sam utils README.md (so only ~19 MB of audio is fetched, not 258 MB).
Validate 58 wav/txt pairs, uniform 44.1 kHz mono 16-bit PCM, total duration in [180, 240] s.
Copy raw audio into raw/ (renamed to same_NNN).
Normalize to 24 kHz mono PCM16 at -23 LUFS via ffmpeg (skip with --no-normalize) into audio/.
Re-transcribe every clip with whisper-large-v3 if the openai-whisper package is installed — replaces the upstream samantha_002.txt='641.' hallucination (and any other Whisper-base errors). Skip with --no-retranscribe.
Write manifest.jsonl, source.json, and ljspeech/metadata.csv.

Two-step (when a clone already exists)

bash

python3 packages/training/scripts/voice/build_same_manifest.py \
    --src /tmp/ai_voices/sam \
    --dst packages/training/data/voice/same

Pre-flight audit

bash

bash packages/training/scripts/voice/audit_same.sh /tmp/ai_voices/sam

I7 and I11 must run this before invoking the kokoro pipeline.

Manifest schema

manifest.jsonl — one JSON record per line:

json

{
  "id": "same_001",
  "audio_path": "audio/same_001.wav",
  "raw_audio_path": "raw/same_001.wav",
  "transcript": "Yeah, I've been trying to figure out how to talk to you about this.",
  "transcript_source": "whisper-large-v3",
  "duration_s": 2.123,
  "sample_rate": 24000,
  "source_sample_rate": 44100,
  "channels": 1,
  "bit_depth": 16,
  "excluded": false,
  "source": "github.com/lalalune/ai_voices@<sha>",
  "subset": "same"
}

transcript_source ∈ {whisper-large-v3, upstream-whisper-base}.
excluded=true only for clips that still hold an upstream Whisper-base hallucination after the build. Today that is at most same_002 when the script ran with --no-retranscribe.
source pins the upstream git commit SHA at fetch time.

Downstream consumers

Kokoro voice-clone (primary) — packages/training/scripts/kokoro/extract_voice_embedding.py --clips-dir packages/training/data/voice/same/audio --base-model hexgrad/Kokoro-82M --out voice.bin.
Kokoro LoRA fine-tune (experimental) — packages/training/scripts/kokoro/prep_ljspeech.py --data-dir packages/training/data/voice/same/ljspeech --sample-rate 24000 …, then finetune_kokoro.py.
OmniVoice freeze (R6 / I6) — preset-based, consumes audio/ directly.

License

The upstream repo (lalalune/ai_voices) ships no LICENSE file. Its README.md only says "For fun and research only, obviously." The same voice is a derivative of Her (2013, Warner Bros).

Treat this corpus as a non-commercial research dataset:

Do not redistribute raw audio. Audio stays gitignored in this repo and is re-fetched from upstream by the build script.
Published derivatives (Kokoro voice-clone voice.bin, LoRA adapter, OmniVoice preset, fine-tuned ONNX/GGUF) must:
- credit lalalune/ai_voices upstream,
- declare the upstream commit SHA pinned in source.json,
- be marked research-only (private HF dataset / private=true on first push; public release requires explicit owner sign-off).

See .swarm/collab.md for the C0 decision log on license handling (2026-05-13).

Known issues

same_002.txt = "641." — Whisper-base hallucination on a 1.37 s clip. build_same_manifest.py fixes this when run with the default --retranscribe (loads whisper-large-v3 and rewrites every transcript). When invoked with --no-retranscribe (CI / smoke) the clip is marked excluded=true in manifest.jsonl and skipped in ljspeech/metadata.csv.
Corpus skews short. 35 of 58 clips ≤ 3 s. LoRA prosody re-target benefits from longer clips; voice-clone is fine.