tools/Universal_TTS_Finetune/README.md
Universal Coqui & Rhasspy Piper TTS fine-tuning workflow with:
The current workflow targets the bundled recipes/ljspeech training recipes for these Coqui models:
When Coqui publishes a matching pretrained checkpoint, the trainer can auto-download it and continue from it. Otherwise the workflow still prepares the recipe workspace and can train from a user-supplied checkpoint or recipe defaults.
Point the app at audio files or a folder of audio.
csv, tsv, txt, or json), it uses that text..vtt file (along with the matching audiobook file, e.g. generated by ebook2audiobook), the tool parses the timestamps and slices the audio directly into perfectly formatted training data with 0% transcription errors and zero GPU/CPU transcription overhead.pyannote/wespeaker-voxceleb-resnet34-LM) to extract embeddings and group clips by voice. You can configure:
<output_root>/dataset/LJSpeech-1.1/
including:
wavs/metadata.csvmetadata_shuf.csvmetadata_train.csvmetadata_val.csvdataset_info.jsonPick one of the supported Coqui recipes, then train from the GUI or CLI.
Training artifacts are written under:
<output_root>/training_runs/<model>/<timestamp>/ready/
with an artifacts.json file that the GUI and CLI can load later.
After training, load the generated artifacts.json (or the training folder) and synthesize test audio.
Install the required dependencies using pip:
git clone https://github.com/DrewThomasson/ebook2audiobook.git
cd ebook2audiobook
./ebook2audiobook.command #Mac/Linux or ebook2audiobook.cmd #Window | locally install ebook2audiobook first
conda activate ./python_env # Activate the created python env for E2A
cd tools/Universal_TTS_Finetune # Go into Universal_TTS_Finetune dir
pip install -r requirements.txt # Install additional requirments for E2A SML
Run the application directly with Python:
python web_gui.py --port 7862 --out_path /absolute/path/to/output
To run the application using Docker, simply use docker-compose. This handles installing all system dependencies and setting up GPU support automatically:
docker-compose up --build
The application will be available at http://localhost:7862.
Note: By default, the training commands (train and workflow) will stream live training logs to your console so you can see progress in real time. If you prefer to suppress this output (e.g., when running in a background job), you can pass the --no-stream-logs flag.
List models:
python headless_cli.py list-models
Prepare a dataset from a folder of audio and auto-transcribe with Whisper:
python headless_cli.py prepare-dataset \
--output-root /absolute/path/to/output \
--audio-dir /absolute/path/to/audio \
--language en \
--whisper-model small \
--diarize-speakers
Note: The --diarize-speakers flag is optional. If provided, the pipeline will extract speaker embeddings using a pre-trained PyAnnote ResNet-34 speaker model and cluster them by distinct speakers. You can optionally specify --expected-speakers <count> to cluster into exactly that many speakers, or adjust --diarize-threshold <float> to control auto-detection sensitivity. It will output separate datasets (e.g., dataset/LJSpeech-1.1_Speaker_1/) and default to returning the speaker with the most training data.
Prepare a dataset using an existing transcript file:
python headless_cli.py prepare-dataset \
--output-root /absolute/path/to/output \
--audio-dir /absolute/path/to/audio \
--transcript-file /absolute/path/to/metadata.csv
Prepare a dataset from an audiobook and .vtt alignment file (e.g. from ebook2audiobook outputs):
python headless_cli.py prepare-dataset \
--output-root /absolute/path/to/output \
--audio-file /absolute/path/to/audiobook.mp3 \
--transcript-file /absolute/path/to/alignment.vtt
Prepare a dataset from an audiobook and ePUB/text using Forced Alignment (bypasses Whisper transcription entirely):
.txt file (e.g. using Calibre)..txt file:python headless_cli.py prepare-dataset \
--output-root /absolute/path/to/output \
--audio-file /absolute/path/to/chapter1.mp3 \
--transcript-file /absolute/path/to/chapter1_transcript.txt \
--language en
--no-auto-split-sentences CLI flag (or uncheck Auto-split sentences for forced alignment in the Web GUI):python headless_cli.py prepare-dataset \
--output-root /absolute/path/to/output \
--audio-file /absolute/path/to/chapter1.mp3 \
--transcript-file /absolute/path/to/chapter1_transcript.txt \
--language en \
--no-auto-split-sentences
Prepare a dataset by automatically matching local transcript files in a folder:
If you point the tool to a folder containing multiple audio files (using --audio-dir) and do not supply a global --transcript-file, the system will automatically scan the folder. For each audio file:
.vtt file exists (e.g., chapter1.mp3 and chapter1.vtt), it slices using those timestamps..txt file exists (e.g., chapter2.mp3 and chapter2.txt), it automatically runs Forced Alignment on it first and then slices it.chapter3.mp3), it alerts the user and automatically falls back to transcribing using Whisper.All generated slices from all chapters are merged into the final dataset automatically.
python headless_cli.py prepare-dataset \
--output-root /absolute/path/to/output \
--audio-dir /absolute/path/to/mixed_audio_folder \
--language de
Dry-run a training workspace:
python headless_cli.py train \
--model xtts_v2 \
--output-root /absolute/path/to/output \
--dry-run
Train a model:
python headless_cli.py train \
--model glow_tts \
--output-root /absolute/path/to/output \
--epochs 50 \
--batch-size 16
Run the whole workflow in one command:
python headless_cli.py workflow \
--model xtts_v2 \
--output-root /absolute/path/to/output \
--audio-dir /absolute/path/to/audio \
--language en \
--test-text "This is a quick validation sample."
Test all supported models sequentially on a dataset, saving sample audio and discarding the checkpoints to save space:
python headless_cli.py batch-test \
--output-root /absolute/path/to/output \
--audio-dir /absolute/path/to/audio \
--language en \
--discard-models \
--auto-calculate-epochs \
--diarize-speakers
Note: The --auto-calculate-epochs flag ignores the --epochs argument and dynamically computes the optimal number of epochs for each model family (e.g., targeting 1,500 steps for XTTS and 15,000 steps for Tacotron2) based on the exact size of your provided dataset.
Generate speech from the newest trained model:
python headless_cli.py synthesize \
--artifacts /absolute/path/to/output \
--model xtts_v2 \
--text "Testing the fine-tuned voice." \
--language en
During long training runs (especially on CPU-only machines), you can stop the training early once you are satisfied with the generated voice quality.
--sample-epoch-interval <N> (e.g. 1 or 5), the trainer will automatically write periodic audio samples inside <run_dir>/epoch_samples/.Ctrl+C to terminate the training process. PyTorch Lightning automatically saves intermediate model checkpoints (.ckpt files) at the end of every epoch.export_checkpoint.py to manually package your latest checkpoint (converts to ONNX for Piper, or optimizes and copies .pth files for Coqui/XTTS):python export_checkpoint.py /path/to/your/training_run/<timestamp>
This will create a ready/ directory inside your training run with model.onnx and artifacts.json, making it immediately loadable inside the web GUI or CLI.
Accepted transcript formats:
vtt WebVTT alignment file (e.g. from ebook2audiobook)json dictionary or list of objectscsvtsvThe audio key can be an absolute path, file name, or stem. The text field can be named text, transcript, sentence, or utterance.
extra_overrides_json field/flag to override recipe values before launch.