Back to Wav2letter

Multilingual LibriSpeech (MLS)

recipes/mls/README.md

0.27.7 KB
Original Source

Multilingual LibriSpeech (MLS)

Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It is available at OpenSLR.

This directory contains pretrained monolingual models and also steps to reproduce the results. All the models are trained using 32GB Nvidia V100 GPUs. We have used a total of 64 GPUs for training English, German, Dutch, Spanish, French models and 16 GPUs for training models on Italian, Portuguese and Polish.

Dependencies

Tokens and Lexicons

LanguageToken SetTrain LexiconJoint Lexicon (Train + GB)
Englishtokens.txttrain_lexicon.txtjoint_lexicon.txt
Germantokens.txttrain_lexicon.txtjoint_lexicon.txt
Dutchtokens.txttrain_lexicon.txtjoint_lexicon.txt
Frenchtokens.txttrain_lexicon.txtjoint_lexicon.txt
Spanishtokens.txttrain_lexicon.txtjoint_lexicon.txt
Italiantokens.txttrain_lexicon.txtjoint_lexicon.txt
Portuguesetokens.txttrain_lexicon.txtjoint_lexicon.txt
Polishtokens.txttrain_lexicon.txtjoint_lexicon.txt

Pre-trained acoustic models

LanguageArchitectureAcoustic Model
Englisharch.txtam.bin
Germanarch.txtam.bin
Dutcharch.txtam.bin
Frencharch.txtam.bin
Spanisharch.txtam.bin
Italianarch.txtam.bin
Portuguesearch.txtam.bin
Polisharch.txtam.bin

Pre-trained language models

The 5-gram_lm.arpa from the tar ball should be used to decode each acoustic model. For faster loading, people may convert those arpa files into binary format following the steps here.

LanguageLanguage Model
Englishmls_lm_english.tar.gz
Germanmls_lm_german.tar.gz
Dutchmls_lm_dutch.tar.gz
Frenchmls_lm_french.tar.gz
Spanishmls_lm_spanish.tar.gz
Italianmls_lm_italian.tar.gz
Portuguesemls_lm_portuguese.tar.gz
Polishmls_lm_polish.tar.gz

Usage

Preparing the dataset

Follow the steps here to download and prepare the datset for a given language.

Training

[...]/flashlight/build/bin/asr/fl_asr_train train --flagsfile=train/[lang].cfg --minloglevel=0 --logtostderr=1

Decoding

Viterbi

[...]/flashlight/build/bin/asr/fl_asr_test --am=[...]/am.bin --lexicon=[...]/train_lexicon.txt --datadir=[...] --test=test.lst --tokens=[...]/tokens.txt --emission_dir='' --nouselexicon --show

Beam search with language model

[...]/flashlight/build/bin/asr/fl_asr_decode --flagsfile=decode/[lang].cfg

Citation

@article{Pratap2020MLSAL,
  title={MLS: A Large-Scale Multilingual Dataset for Speech Research},
  author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert},
  journal={ArXiv},
  year={2020},
  volume={abs/2012.03411}
}

NOTE: We have made few updates to the MLS dataset after our INTERSPEECH paper was submitted to include more number of hours and also to improve the quality of transcripts. To avoid confusion (by having multiple versions), we are making ONLY one release with all the improvements included. For accurate dataset statistics and baselines, please refer to the arXiv paper above.