recipes/sota/2019/README.md
In the paper we are considering:
Run data and auxiliary files (like lexicon, tokens set, etc.) preparation (set necessary paths instead of [...]: data_dst path to data to store, model_dst path to auxiliary path to store).
pip install sentencepiece==0.1.82
python3 ../../utilities/prepare_librispeech_wp_and_official_lexicon.py --data_dst [...] --model_dst [...] --nbest 10 --wp 10000
Besides data the auxiliary files for acoustic and language models training/evaluation will be generated:
cd $MODEL_DST
tree -L 2
.
├── am
│ ├── librispeech-train-all-unigram-10000.model
│ ├── librispeech-train-all-unigram-10000.tokens
│ ├── librispeech-train-all-unigram-10000.vocab
│ ├── librispeech-train+dev-unigram-10000-nbest10.lexicon
│ ├── librispeech-train-unigram-10000-nbest10.lexicon
│ └── train.txt
└── decoder
├── 4-gram.arpa
├── 4-gram.arpa.lower
└── decoder-unigram-10000-nbest10.lexicon
librispeech directory.lm_corpus_and_PL_generation (raw LM corpus which has no intersection with Librovox data is prepared in the raw_lm_corpus)librivox directory.lm directory.rescoring directory.lm_analysis.| Lexicon | Tokens | Beam-search lexicon | WP tokenizer model |
|---|---|---|---|
| Lexicon | Tokens | Beam-search lexicon | WP tokenizer model |
Tokens and lexicon files generated in the $MODEL_DST/am/ and $MODEL_DST/decoder/ are the same as in the table.
Below there is info about pre-trained acoustic models, which one can use, for example, to reproduce a decoding step.
| Dataset | Acoustic model dev-clean | Acoustic model dev-other |
|---|---|---|
| LibriSpeech | Resnet CTC clean | Resnet CTC other |
| LibriSpeech + LibriVox | Resnet CTC clean | Resnet CTC other |
| LibriSpeech | TDS CTC clean | TDS CTC other |
| LibriSpeech + LibriVox | TDS CTC clean | TDS CTC other |
| LibriSpeech | Transformer CTC clean | Transformer CTC other |
| LibriSpeech + LibriVox | Transformer CTC clean | Transformer CTC other |
| LibriSpeech | Resnet S2S clean | Resnet S2S other |
| LibriSpeech + LibriVox | Resnet S2S clean | Resnet S2S other |
| LibriSpeech | TDS Seq2Seq clean | TDS Seq2Seq other |
| LibriSpeech + LibriVox | TDS Seq2Seq clean | TDS Seq2Seq other |
| LibriSpeech | Transformer Seq2Seq clean | Transformer Seq2Seq other |
| LibriSpeech + LibriVox | Transformer Seq2Seq clean | Transformer Seq2Seq other |
| LM type | Language model | Vocabulary | Architecture | LM Fairseq | Dict fairseq |
|---|---|---|---|---|---|
| ngram | word 4-gram | - | - | - | - |
| ngram | wp 6-gram | - | - | - | - |
| GCNN | word GCNN | vocabulary | Archfile | fairseq | fairseq dict |
| GCNN | wp GCNN | vocabulary | Archfile | fairseq | fairseq dict |
| Transformer | - | - | - | fairseq | fairseq dict |
To reproduce decoding step from the paper download these models into $MODEL_DST/am/ and $MODEL_DST/decoder/ appropriately.
One can use prepared corpus to train LM to generate PL on LibriVox data: raw corpus and normalized corpus and 4gram LM with 200k vocab.
We open-sourced also the generated pseudo-labels on which we trained our model: pl and pl with overlap. **Make sure to fix the prefixes to the files names in the lists, right now it is set to be /root/librivox)
@article{synnaeve2019end,
title={End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures},
author={Synnaeve, Gabriel and Xu, Qiantong and Kahn, Jacob and Grave, Edouard and Likhomanenko, Tatiana and Pratap, Vineel and Sriram, Anuroop and Liptchinsky, Vitaliy and Collobert, Ronan},
journal={arXiv preprint arXiv:1911.08460},
year={2019}
}