Universal_TTS_Finetune

Universal Coqui & Rhasspy Piper TTS fine-tuning workflow with:

a Gradio web GUI
a headless CLI
LJSpeech-style dataset generation from your own audio
optional automatic transcription with Whisper when transcripts are not provided
quick post-training inference for the model you just trained

Supported models

The current workflow targets the bundled recipes/ljspeech training recipes for these Coqui models:

Align TTS
DelightfulTTS
FastPitch
FastSpeech
FastSpeech 2
Glow-TTS
NeuralHMM-TTS
Overflow
SpeedySpeech
Tacotron2 Capacitron
Tacotron2 DCA
Tacotron2 DDC
VITS
XTTS v1
XTTS v2
Piper TTS (Rhasspy)

When Coqui publishes a matching pretrained checkpoint, the trainer can auto-download it and continue from it. Otherwise the workflow still prepares the recipe workspace and can train from a user-supplied checkpoint or recipe defaults.

What it does

1. Prepare a dataset

Point the app at audio files or a folder of audio.

If you provide a transcript map (csv, tsv, txt, or json), it uses that text.
VTT Audio Slicing: If you provide a .vtt file (along with the matching audiobook file, e.g. generated by ebook2audiobook), the tool parses the timestamps and slices the audio directly into perfectly formatted training data with 0% transcription errors and zero GPU/CPU transcription overhead.
If you do not provide text or a VTT file, it transcribes with Whisper and chunks longer recordings into sentence-sized clips.
Speaker Diarization: Optionally enable speaker diarization to separate multiple speakers into distinct datasets. This uses a high-performance PyAnnote ResNet-34 VoxCeleb speaker model (pyannote/wespeaker-voxceleb-resnet34-LM) to extract embeddings and group clips by voice. You can configure:
- Expected Speakers: Force the clustering into exactly N speaker folders.
- Distance Threshold: Fine-tune the sensitivity of auto-detecting speakers when expected speakers is set to 0.
Re-diarization: Once a dataset has been prepared, the original mixed audio clips are preserved. You can re-diarize the dataset with new speaker counts or thresholds via the web GUI without re-running the slow Whisper transcription step.
It writes an LJSpeech-style dataset under:

text

<output_root>/dataset/LJSpeech-1.1/

including:

wavs/
metadata.csv
metadata_shuf.csv
metadata_train.csv
metadata_val.csv
dataset_info.json

2. Train or fine-tune a model

Pick one of the supported Coqui recipes, then train from the GUI or CLI.

Training artifacts are written under:

text

<output_root>/training_runs/<model>/<timestamp>/ready/

with an artifacts.json file that the GUI and CLI can load later.

3. Test the trained model

After training, load the generated artifacts.json (or the training folder) and synthesize test audio.

XTTS models use a speaker reference WAV.
Single-speaker recipe models synthesize directly.

Install

Install the required dependencies using pip:

bash

git clone https://github.com/DrewThomasson/ebook2audiobook.git
cd ebook2audiobook
./ebook2audiobook.command #Mac/Linux or ebook2audiobook.cmd #Window | locally install ebook2audiobook first
conda activate ./python_env  # Activate the created python env for E2A
cd tools/Universal_TTS_Finetune # Go into Universal_TTS_Finetune dir
pip install -r requirements.txt # Install additional requirments for E2A SML

Run the web GUI

Run the application directly with Python:

bash

python web_gui.py --port 7862 --out_path /absolute/path/to/output

Run with Docker

To run the application using Docker, simply use docker-compose. This handles installing all system dependencies and setting up GPU support automatically:

bash

docker-compose up --build

The application will be available at http://localhost:7862.

Headless CLI

Note: By default, the training commands (train and workflow) will stream live training logs to your console so you can see progress in real time. If you prefer to suppress this output (e.g., when running in a background job), you can pass the --no-stream-logs flag.

List models:

bash

python headless_cli.py list-models

Prepare a dataset from a folder of audio and auto-transcribe with Whisper:

bash

python headless_cli.py prepare-dataset \
  --output-root /absolute/path/to/output \
  --audio-dir /absolute/path/to/audio \
  --language en \
  --whisper-model small \
  --diarize-speakers

Note: The --diarize-speakers flag is optional. If provided, the pipeline will extract speaker embeddings using a pre-trained PyAnnote ResNet-34 speaker model and cluster them by distinct speakers. You can optionally specify --expected-speakers <count> to cluster into exactly that many speakers, or adjust --diarize-threshold <float> to control auto-detection sensitivity. It will output separate datasets (e.g., dataset/LJSpeech-1.1_Speaker_1/) and default to returning the speaker with the most training data.

Prepare a dataset using an existing transcript file:

bash

python headless_cli.py prepare-dataset \
  --output-root /absolute/path/to/output \
  --audio-dir /absolute/path/to/audio \
  --transcript-file /absolute/path/to/metadata.csv

Prepare a dataset from an audiobook and .vtt alignment file (e.g. from ebook2audiobook outputs):

bash

python headless_cli.py prepare-dataset \
  --output-root /absolute/path/to/output \
  --audio-file /absolute/path/to/audiobook.mp3 \
  --transcript-file /absolute/path/to/alignment.vtt

Prepare a dataset from an audiobook and ePUB/text using Forced Alignment (bypasses Whisper transcription entirely):

Convert your ePUB chapter/book to a plain text .txt file (e.g. using Calibre).
Run the aligner by passing exactly 1 audio file (e.g. a chapter) and the .txt file:

bash

python headless_cli.py prepare-dataset \
  --output-root /absolute/path/to/output \
  --audio-file /absolute/path/to/chapter1.mp3 \
  --transcript-file /absolute/path/to/chapter1_transcript.txt \
  --language en

Automatic Sentence Splitting: By default, the pipeline automatically splits paragraphs/blocks of text into individual sentences (using multilingual quote-aware regular expressions) to ensure optimal audio slice lengths (1-12s) for TTS training.
Custom Sentence Line Formatting: If you have manually formatted your text file to have one sentence per line and want to bypass the automatic splitting, pass the --no-auto-split-sentences CLI flag (or uncheck Auto-split sentences for forced alignment in the Web GUI):

bash

python headless_cli.py prepare-dataset \
  --output-root /absolute/path/to/output \
  --audio-file /absolute/path/to/chapter1.mp3 \
  --transcript-file /absolute/path/to/chapter1_transcript.txt \
  --language en \
  --no-auto-split-sentences

Prepare a dataset by automatically matching local transcript files in a folder:

If you point the tool to a folder containing multiple audio files (using --audio-dir) and do not supply a global --transcript-file, the system will automatically scan the folder. For each audio file:

If a matching .vtt file exists (e.g., chapter1.mp3 and chapter1.vtt), it slices using those timestamps.
If a matching .txt file exists (e.g., chapter2.mp3 and chapter2.txt), it automatically runs Forced Alignment on it first and then slices it.
If no matching transcript is found (e.g., chapter3.mp3), it alerts the user and automatically falls back to transcribing using Whisper.

All generated slices from all chapters are merged into the final dataset automatically.

bash

python headless_cli.py prepare-dataset \
  --output-root /absolute/path/to/output \
  --audio-dir /absolute/path/to/mixed_audio_folder \
  --language de

Dry-run a training workspace:

bash

python headless_cli.py train \
  --model xtts_v2 \
  --output-root /absolute/path/to/output \
  --dry-run

Train a model:

bash

python headless_cli.py train \
  --model glow_tts \
  --output-root /absolute/path/to/output \
  --epochs 50 \
  --batch-size 16

Run the whole workflow in one command:

bash

python headless_cli.py workflow \
  --model xtts_v2 \
  --output-root /absolute/path/to/output \
  --audio-dir /absolute/path/to/audio \
  --language en \
  --test-text "This is a quick validation sample."

Test all supported models sequentially on a dataset, saving sample audio and discarding the checkpoints to save space:

bash

python headless_cli.py batch-test \
  --output-root /absolute/path/to/output \
  --audio-dir /absolute/path/to/audio \
  --language en \
  --discard-models \
  --auto-calculate-epochs \
  --diarize-speakers

Note: The --auto-calculate-epochs flag ignores the --epochs argument and dynamically computes the optimal number of epochs for each model family (e.g., targeting 1,500 steps for XTTS and 15,000 steps for Tacotron2) based on the exact size of your provided dataset.

Generate speech from the newest trained model:

bash

python headless_cli.py synthesize \
  --artifacts /absolute/path/to/output \
  --model xtts_v2 \
  --text "Testing the fine-tuned voice." \
  --language en

Early Stopping and Exporting Checkpoints (Ctrl+C Support)

During long training runs (especially on CPU-only machines), you can stop the training early once you are satisfied with the generated voice quality.

Listen to samples: By running training with --sample-epoch-interval <N> (e.g. 1 or 5), the trainer will automatically write periodic audio samples inside <run_dir>/epoch_samples/.
Stop early: Press Ctrl+C to terminate the training process. PyTorch Lightning automatically saves intermediate model checkpoints (.ckpt files) at the end of every epoch.
Export and package: Since the training run was interrupted, it will not have automatically packaged the final model. You can run the included helper script export_checkpoint.py to manually package your latest checkpoint (converts to ONNX for Piper, or optimizes and copies .pth files for Coqui/XTTS):

bash

python export_checkpoint.py /path/to/your/training_run/<timestamp>

This will create a ready/ directory inside your training run with model.onnx and artifacts.json, making it immediately loadable inside the web GUI or CLI.

Transcript file formats

Accepted transcript formats:

vtt WebVTT alignment file (e.g. from ebook2audiobook)
json dictionary or list of objects
csv
tsv
pipe-delimited text

The audio key can be an absolute path, file name, or stem. The text field can be named text, transcript, sentence, or utterance.

Notes

The workflow automatically uses CUDA when available and falls back to CPU otherwise.
XTTS models are the best option when you need multilingual fine-tuning or speaker-conditioned inference.
Some upstream Coqui recipes still depend on recipe-specific assumptions. If you need deeper tuning, use the extra_overrides_json field/flag to override recipe values before launch.