Back to Eliza

Same corpus — F2 augmented + distilled

packages/training/data/voice/sam/CORPUS.md

2.0.14.2 KB
Original Source

Same corpus — F2 augmented + distilled

Updated by F2 (Kokoro same fine-tune retry agent) on 2026-05-15.

Original corpus

PropertyValue
Sourcelalalune/ai_voices (upstream sam subset, landed locally as same)
Clips57 (58 raw, 1 excluded: same_002 hallucination)
Duration~3.5 min (210 s)
Format44.1 kHz mono PCM16 (normalized to 24 kHz, -23 LUFS)
LicenseResearch-only — derivative of Her (2013, Warner Bros)
Commitc6db5b5dc703e212664a17cf58114f5ecfddc853

F2 augmented corpus (acoustic augmentation)

Generated at /tmp/kokoro-f2/corpus-augmented/ by augment_corpus.py.

Method: 5 augmentation variants per non-val clip:

  • stretch_slow: time-stretch ×0.9 (slowed)
  • stretch_fast: time-stretch ×1.1 (sped up)
  • pitch_up: pitch-shift +50 cents (+0.5 semitones)
  • pitch_down: pitch-shift -50 cents (-0.5 semitones)
  • noise_15db: Gaussian noise at 15 dB SNR
PropertyValue
Original clips57
Augmented clips260 (52 non-val × 5 variants)
Total clips317
Total duration~18.2 min
Train lines312
Val lines5 (original only, no augmented val clips)

F2 distillation corpus (self-distillation)

Generated at /tmp/kokoro-f2/corpus-distilled/ by synthesize_distillation_corpus.py.

Method: Kokoro-82M TTS with af_bella voice (same's closest available stock voice) synthesizing 80 diverse conversational English sentences covering:

  • Short conversational utterances (5-10 words)
  • Medium introspective statements (10-18 words)
  • Longer reflective paragraphs (18-30 words)
  • Questions and emotional expressions
PropertyValue
Clips406
Duration~30 min
Train lines366
Val lines40
Voice usedaf_bella (Kokoro stock)
PurposeTeacher-student distillation: expand training signal

F2 merged corpus

Merged at /tmp/kokoro-f2/corpus-merged/ by merge_corpus.py.

PropertyValue
Train lines678
Val lines45
Estimated total duration~48 min
Sourcesaugmented (real same) + distilled (af_bella synthesis)

Training experiments

ExperimentConfigStatusUTMOSWERSpkSimbeatsBaseline
mel-fit 0anchor=0.0 lr=0.005 steps=1200 init=bellaDone2.0060.9920.145false
mel-fit 1anchor=0.05 lr=0.005 steps=1200 init=bellaDone2.0041.0000.147false
mel-fit 2anchor=0.1 lr=0.005 steps=1600 init=bellaDone2.0031.0480.119false
mel-fit 3anchor=0.0 lr=0.01 steps=800 init=nicoleDone2.0041.0000.122false
mel-fit 5anchor=0.0 lr=0.002 steps=2000 init=bellaDone2.0040.6780.159false
full-FTlr=3e-5 5k steps augmented corpusRunning

Baseline (af_bella on same val prompts)

MetricValue
UTMOS4.371
WER0.000
SpkSim0.034
RTF91.4×

Note: af_bella SpkSim of 0.034 reflects that af_bella and same are different speakers. Any fine-tune that moves SpkSim > 0.034 + 0.05 = 0.084 AND improves UTMOS + WER beats baseline.

Key finding (F2)

The mel-fit objective consistently achieves SpkSim 0.11-0.16 (vs baseline 0.034) — the voice IS moving toward same. However UTMOS collapses to 2.0 and WER to ~1.0. This confirms the Q1 re-eval diagnosis: mel-fit moves the speaker centroid but destroys audio quality because ref_s timbre and prosody halves were jointly learned by the StyleTTS-2 trainer; gradient descent on ref_s alone in an inference-only package cannot maintain their joint coherence.

The full-FT on the augmented+distilled corpus is the structural fix — it trains all model weights jointly against the mel-reconstruction objective on a much larger corpus.

Scripts

ScriptPurpose
augment_corpus.pyAcoustic augmentation (F2)
synthesize_distillation_corpus.pySelf-distillation synthesis (F2)
merge_corpus.pyMerge multiple corpus dirs
prep_merged_corpus.pyPrep processed/ dir for finetune_kokoro_full.py
run_f2_pipeline.pyFull F2 orchestrator