packages/training/data/voice/sam/README.md
A 58-clip, ~3.5 min voice corpus used to train / clone the same voice
for Kokoro (and as the freeze target for OmniVoice). Lands here from
lalalune/ai_voices upstream.
License: research / personal use only. No upstream
LICENSEfile exists inlalalune/ai_voicesand the same voice is a derivative of Her (2013, Warner Bros). Do NOT redistribute the raw audio. Publish only fine-tune deltas / voice embeddings as derivative works with explicit attribution. Seesource.jsonand §License below.
lalalune/ai_voices — sam/ subset (renamed locally to same).samantha_NNN.wav (44.1 kHz mono 16-bit PCM) + samantha_NNN.txt (Whisper-base transcripts). The build script re-keys these to same_NNN.{wav,txt} when landing them locally.source.json — written by build_same_manifest.py at fetch time..swarm/research/R12-ai_voices.md.packages/training/data/voice/same/
README.md # this file (tracked)
source.json # upstream URL + commit sha + counts + license (tracked)
manifest.jsonl # one JSON record per clip (tracked)
.gitignore # local ignore: audio + raw stay out of git (tracked)
audio/ # 24 kHz mono PCM16, LUFS-normalized (gitignored)
same_001.wav .. same_058.wav
raw/ # untouched upstream 44.1 kHz mono PCM16 (gitignored)
same_001.wav .. same_058.wav
same_001.txt .. same_058.txt
ljspeech/
metadata.csv # LJSpeech format `id|raw|normalized` (tracked)
wavs/ # symlinks into ../../audio/ (gitignored)
Tracked in git: README.md, source.json, manifest.jsonl,
ljspeech/metadata.csv, .gitignore. Everything under audio/, raw/,
and ljspeech/wavs/ is gitignored and regenerated by the manifest builder.
| Property | Value |
|---|---|
| Clip count | 58 |
| Total duration | ~210.3 s (3.51 min) |
| Source sample rate | 44.1 kHz |
| Channels | mono |
| Bit depth | 16-bit PCM |
| Min / median / max clip duration | 0.67 / 1.89 / 13.13 s |
The duration distribution is skewed short — 35 of 58 clips ≤ 3 s — so the
LoRA fine-tune path is impractical at this size (community minimum is
1–3 h). Use extract_voice_embedding.py (voice-clone) as the primary path;
LoRA is an experimental comparison only.
The corpus is regenerable from upstream. The build script sparse-clones only the sam slice (not the full 258 MB repo).
python3 packages/training/scripts/voice/build_same_manifest.py \
--sparse-clone /tmp/ai_voices
This will:
git clone --filter=blob:none --sparse lalalune/ai_voices into
/tmp/ai_voices, with sparse-checkout set sam utils README.md
(so only ~19 MB of audio is fetched, not 258 MB).raw/ (renamed to same_NNN).--no-normalize) into audio/.whisper-large-v3 if the
openai-whisper package is installed — replaces the upstream
samantha_002.txt='641.' hallucination (and any other
Whisper-base errors). Skip with --no-retranscribe.manifest.jsonl, source.json, and ljspeech/metadata.csv.python3 packages/training/scripts/voice/build_same_manifest.py \
--src /tmp/ai_voices/sam \
--dst packages/training/data/voice/same
bash packages/training/scripts/voice/audit_same.sh /tmp/ai_voices/sam
I7 and I11 must run this before invoking the kokoro pipeline.
manifest.jsonl — one JSON record per line:
{
"id": "same_001",
"audio_path": "audio/same_001.wav",
"raw_audio_path": "raw/same_001.wav",
"transcript": "Yeah, I've been trying to figure out how to talk to you about this.",
"transcript_source": "whisper-large-v3",
"duration_s": 2.123,
"sample_rate": 24000,
"source_sample_rate": 44100,
"channels": 1,
"bit_depth": 16,
"excluded": false,
"source": "github.com/lalalune/ai_voices@<sha>",
"subset": "same"
}
transcript_source ∈ {whisper-large-v3, upstream-whisper-base}.excluded=true only for clips that still hold an upstream Whisper-base
hallucination after the build. Today that is at most same_002
when the script ran with --no-retranscribe.source pins the upstream git commit SHA at fetch time.packages/training/scripts/kokoro/extract_voice_embedding.py --clips-dir packages/training/data/voice/same/audio --base-model hexgrad/Kokoro-82M --out voice.bin.packages/training/scripts/kokoro/prep_ljspeech.py --data-dir packages/training/data/voice/same/ljspeech --sample-rate 24000 …, then finetune_kokoro.py.audio/ directly.The upstream repo (lalalune/ai_voices) ships no LICENSE file. Its
README.md only says "For fun and research only, obviously." The same
voice is a derivative of Her (2013, Warner Bros).
Treat this corpus as a non-commercial research dataset:
voice.bin, LoRA adapter,
OmniVoice preset, fine-tuned ONNX/GGUF) must:
lalalune/ai_voices upstream,source.json,private=true on
first push; public release requires explicit owner sign-off).See .swarm/collab.md for the C0 decision log on license handling
(2026-05-13).
same_002.txt = "641." — Whisper-base hallucination on a 1.37 s
clip. build_same_manifest.py fixes this when run with the
default --retranscribe (loads whisper-large-v3 and rewrites every
transcript). When invoked with --no-retranscribe (CI / smoke) the
clip is marked excluded=true in manifest.jsonl and skipped in
ljspeech/metadata.csv.