Back to Wav2letter

RASR release

recipes/rasr/README.md

0.26.6 KB
Original Source

RASR release

This is a repository sharing pre-trained acoustic models and language models for our new paper Rethinking Evaluation in ASR: Are Our Models Robust Enough?.

Dependencies

Models

Acoustic Model

All the acoustic models are retrained using Flashlight with wav2letter++ consolidated. Tedlium is not used as training data here due to license issue. All the training data has more standardized sample rate 16kHz rather than 8kHz used in the paper.

Here, we are releasing models with different architecture and different sizes. Note that the models may not fully reproduce results in the paper because of both data and toolkit implementation discrepancies.

Achitecture# ParamArch FilePath
Transformer300 milam_transformer_ctc_stride3_letters_300Mparams.archam_transformer_ctc_stride3_letters_300Mparams.bin
Transformer70 milam_transformer_ctc_stride3_letters_70Mparams.archam_transformer_ctc_stride3_letters_70Mparams.bin
Conformer300 milam_conformer_ctc_stride3_letters_300Mparams.archam_conformer_ctc_stride3_letters_300Mparams.bin
Conformer87 milam_conformer_ctc_stride3_letters_87Mparams.archam_conformer_ctc_stride3_letters_87Mparams.bin
Conformer28 milam_conformer_ctc_stride3_letters_25Mparams.archam_conformer_ctc_stride3_letters_25Mparams.bin
Conformer (distillation)28 milam_conformer_ctc_stride3_letters_25Mparams_distill.archam_conformer_ctc_stride3_letters_25Mparams_distill.bin

Language Model

Language models are trained on Common Crawl corpus as mentioned in paper. We are providing 4-gram LMs with different pruning here with 200k-top words. All the LMs are trained with KenLM toolkit.

Pruning ParamSize (GB)PathArpa Path
0 0 5 58.4large-
0 6 15 152.5smallsmall

The perplexities of the LMs on different development sets are listed below.

LMnov93devTL-devCV-devLS-dev-cleanLS-dev-otherRT03
Large313158243303304227
Small331178262330325226

WER

Here we summarize the decoding WER for all releasing models. All the numbers in the table are in format viterbi WER -> beam search WER (small beam/large beam).

Achitecture# Paramnov92TL-testCV-testLS-test-cleanLS-test-otherHub05-SWBHub05-CH
Transformer300 mil3.4 → 2.9/2.97.6 → 5.5/5.415.5 → 11.6/11.23.0 → 3.2/3.27.2 → 6.4/6.46.8 → 6.2/6.211.6 → 10.8/10.7
Transformer70 mil4.5 → 3.7/3.59.4 → 6.2/6.119.8 →13.8/13.04 → 3.6 /3.69.7 → 7.7/7.57.5 → 6.6/6.513 → 11.8/11.7
Conformer300 mil3.5 → 3.3/3.38.4 → 6.2/6.017 → 12.7/12.03.2 → 3.4/3.48 → 7/6.87 → 6.4/6.511.9 → 10.7/10.5
Conformer87 mil4.3 → 3.3/3.38.7 → 6.1/5.918.2 →13.1/12.43.7 → 3.5/3.58.6 → 7.4/7.27.3 → 6.7/6.712.2 → 11.7/11.5
Conformer28 mil5 → 3.9/3.810.5 → 6.9/6.622.2 → 15.4/14.44.7 → 4/3.911.1 → 8.9/8.68.8 → 7.8/7.713.7 → 12.4/12.2
Conformer (distillation)28 mil4.7 → 3.9/3.89.4 → 6.5/6.419.6 → 14.6/13.84.1 → 3.8/3.89.9 → 8.4/8.27.6 → 6.9/6.813.0 → 12.2/12.0

Decoding is done with lexicon-based beam-search decoder using 200k common crawl lexicon and small common crawl lm.

Achitecture# ParamLM WeightWord ScoreBeam Size
Transformer300 mil1.5050/500
Transformer70 mil1.7050/500
Conformer300 mil1.8250/500
Conformer87 mil2050/500
Conformer28 mil2050/500
Conformer (distilllation)28 mil1.40.450/500

Tutorial

To simply serialize all the models and interact with them, please refer to the Flashlight ASR app tutorials.

Citation

@article{likhomanenko2020rethinking,
  title={Rethinking Evaluation in ASR: Are Our Models Robust Enough?},
  author={Likhomanenko, Tatiana and Xu, Qiantong and Pratap, Vineel and Tomasello, Paden and Kahn, Jacob and Avidov, Gilad and Collobert, Ronan and Synnaeve, Gabriel},
  journal={arXiv preprint arXiv:2010.11745},
  year={2020}
}