docs/source/faq.md
We tried to collect common issues and questions we receive about 🐸TTS. It is worth checking before going deeper.
See this page <what_makes_a_good_dataset>tts model?Check your dataset with notebooks in dataset_analysis folder. Use this notebook to find the right audio processing parameters. A better set of parameters results in a better audio synthesis.
Write your own dataset formatter in datasets/formatters.py or format your dataset as one of the supported datasets, like LJSpeech.
A formatter parses the metadata file and converts a list of training samples.
If you have a dataset with a different alphabet than English, you need to set your own character list in the config.json.
TTS/bin/find_unique_chars.py to get characters used in your dataset.Write your own text cleaner in utils.text.cleaners. It is not always necessary, except when you have a different alphabet or language-specific requirements.
cleaner performs number and abbreviation expansion and text normalization. Basically, it converts the written text to its spoken format.basic_cleaners.Fill in a config.json. Go over each parameter one by one and consider it regarding the appended explanation.
Coqpit class created for your target model. Coqpit classes for tts models are under TTS/tts/configs/.config.json. For the rest, their default values are used.config.json for training a GlowTTS network.{
"model": "glow_tts",
"batch_size": 32,
"eval_batch_size": 16,
"num_loader_workers": 4,
"num_eval_loader_workers": 4,
"run_eval": true,
"test_delay_epochs": -1,
"epochs": 1000,
"text_cleaner": "english_cleaners",
"use_phonemes": false,
"phoneme_language": "en-us",
"phoneme_cache_path": "phoneme_cache",
"print_step": 25,
"print_eval": true,
"mixed_precision": false,
"output_path": "recipes/ljspeech/glow_tts/",
"test_sentences": ["Test this sentence.", "This test sentence.", "Sentence this test."],
"datasets":[{"formatter": "ljspeech", "meta_file_train":"metadata.csv", "path": "recipes/ljspeech/LJSpeech-1.1/"}]
}
Train your model.
CUDA_VISIBLE_DEVICES="0" python train_tts.py --config_path config.jsonpython3 -m trainer.distribute --gpus "0,1" --script TTS/bin/train_tts.py --config_path config.jsonNote: You can also train your model using pure 🐍 python. Check {eval-rst} :ref: 'tutorial_for_nervous_beginners'.
tensorboard. It will show you loss, attention alignment, model output. Go with the order below to measure the model performance.config.json.There is no single objective metric to decide the end of a training since the voice quality is a subjective matter.
In our model trainings, we follow these steps;
TestAttention notebook.Keep in mind that the approach above only validates the model robustness. It is hard to estimate the voice quality without asking the actual people. The best approach is to pick a set of promising models and run a Mean-Opinion-Score study asking actual people to score the models.
tts or tts-server commands. For details check {ref}here <synthesizing_speech>.TTS.utils.synthesizer.Synthesizer class.stopnet. It is the part of the model telling the decoder when to stop.stopnet relates to something else that is broken in your model or dataset. Especially the attention module.trim_db value in the config. You can find a better value for your dataset by using CheckSpectrogram notebook. If this value is too small, too much of the audio will be trimmed. If too big, then too much silence will remain. Both will curtail the stopnet performance.