Back to Unilm

Common Voice

edgelm/examples/speech_synthesis/docs/common_voice_example.md

latest2.0 KB
Original Source

[Back]

Common Voice

Common Voice is a public domain speech corpus with 11.2K hours of read speech in 76 languages (the latest version 7.0). We provide examples for building Transformer models on this dataset.

Data preparation

Download and unpack Common Voice v4 to a path ${DATA_ROOT}/${LANG_ID}. Create splits and generate audio manifests with

bash
python -m examples.speech_synthesis.preprocessing.get_common_voice_audio_manifest \
  --data-root ${DATA_ROOT} \
  --lang ${LANG_ID} \
  --output-manifest-root ${AUDIO_MANIFEST_ROOT} --convert-to-wav

Then, extract log-Mel spectrograms, generate feature manifest and create data configuration YAML with

bash
python -m examples.speech_synthesis.preprocessing.get_feature_manifest \
  --audio-manifest-root ${AUDIO_MANIFEST_ROOT} \
  --output-root ${FEATURE_MANIFEST_ROOT} \
  --ipa-vocab --lang ${LANG_ID}

where we use phoneme inputs (--ipa-vocab) as example.

To denoise audio and trim leading/trailing silence using signal processing based VAD, run

bash
for SPLIT in dev test train; do
    python -m examples.speech_synthesis.preprocessing.denoise_and_vad_audio \
      --audio-manifest ${AUDIO_MANIFEST_ROOT}/${SPLIT}.audio.tsv \
      --output-dir ${PROCESSED_DATA_ROOT} \
      --denoise --vad --vad-agg-level 2
done

Training

(Please refer to the LJSpeech example.)

Inference

(Please refer to the LJSpeech example.)

Automatic Evaluation

(Please refer to the LJSpeech example.)

Results

LanguageSpeakers--archParamsTest MCDModel
English200tts_transformer54M3.8Download

[Back]