End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

In the paper we are considering:

different architectures for acoustic modeling:
- ResNet
- TDS
- Transformer
different criterions:
- Seq2Seq
- CTC
different settings:
- supervised LibriSpeech 1k hours
- supervised LibriSpeech 1k hours + unsupervised LibriVox 57k hours (for LibriVox we generate pseudo-labels to use them as a target),
and different language models:
- word-piece (ngram, ConvLM)
- word-based (ngram, ConvLM, transformer)

Dependencies

flashlight branch v0.2
[wav2letter] (https://github.com/flashlight/wav2letter/tree/v0.2) branch v0.2

Data preparation

Run data and auxiliary files (like lexicon, tokens set, etc.) preparation (set necessary paths instead of [...]: data_dst path to data to store, model_dst path to auxiliary path to store).

pip install sentencepiece==0.1.82
python3 ../../utilities/prepare_librispeech_wp_and_official_lexicon.py --data_dst [...] --model_dst [...] --nbest 10 --wp 10000

Besides data the auxiliary files for acoustic and language models training/evaluation will be generated:

cd $MODEL_DST
tree -L 2
.
├── am
│   ├── librispeech-train-all-unigram-10000.model
│   ├── librispeech-train-all-unigram-10000.tokens
│   ├── librispeech-train-all-unigram-10000.vocab
│   ├── librispeech-train+dev-unigram-10000-nbest10.lexicon
│   ├── librispeech-train-unigram-10000-nbest10.lexicon
│   └── train.txt
└── decoder
    ├── 4-gram.arpa
    ├── 4-gram.arpa.lower
    └── decoder-unigram-10000-nbest10.lexicon

Instructions to reproduce training and decoding

To reproduce acoustic models training on Librispeech (1k hours) and beam-search decoding of these models check the librispeech directory.
Details on pseudolabels preparation is in the directory lm_corpus_and_PL_generation (raw LM corpus which has no intersection with Librovox data is prepared in the raw_lm_corpus)
To reproduce acoustic models training on Librispeech 1k hours + unsupervised LibriVox data (with generated pseudo-labels) and beam-search decoding of these models, check librivox directory.
Details on language models training one can find in the lm directory.
Beam dump for the best models and beam rescoring can be found in the rescoring directory.
Disentangling of acoustic and linguistic representations analyis (TTS and Segmentation experiments) are in lm_analysis.

Tokens and Lexicon sets

Lexicon	Tokens	Beam-search lexicon	WP tokenizer model
Lexicon	Tokens	Beam-search lexicon	WP tokenizer model

Tokens and lexicon files generated in the $MODEL_DST/am/ and $MODEL_DST/decoder/ are the same as in the table.

Pre-trained acoustic models

Below there is info about pre-trained acoustic models, which one can use, for example, to reproduce a decoding step.

Dataset	Acoustic model dev-clean	Acoustic model dev-other
LibriSpeech	Resnet CTC clean	Resnet CTC other
LibriSpeech + LibriVox	Resnet CTC clean	Resnet CTC other
LibriSpeech	TDS CTC clean	TDS CTC other
LibriSpeech + LibriVox	TDS CTC clean	TDS CTC other
LibriSpeech	Transformer CTC clean	Transformer CTC other
LibriSpeech + LibriVox	Transformer CTC clean	Transformer CTC other
LibriSpeech	Resnet S2S clean	Resnet S2S other
LibriSpeech + LibriVox	Resnet S2S clean	Resnet S2S other
LibriSpeech	TDS Seq2Seq clean	TDS Seq2Seq other
LibriSpeech + LibriVox	TDS Seq2Seq clean	TDS Seq2Seq other
LibriSpeech	Transformer Seq2Seq clean	Transformer Seq2Seq other
LibriSpeech + LibriVox	Transformer Seq2Seq clean	Transformer Seq2Seq other

Pre-trained language models

LM type	Language model	Vocabulary	Architecture	LM Fairseq	Dict fairseq
ngram	word 4-gram	-	-	-	-
ngram	wp 6-gram	-	-	-	-
GCNN	word GCNN	vocabulary	Archfile	fairseq	fairseq dict
GCNN	wp GCNN	vocabulary	Archfile	fairseq	fairseq dict
Transformer	-	-	-	fairseq	fairseq dict

To reproduce decoding step from the paper download these models into $MODEL_DST/am/ and $MODEL_DST/decoder/ appropriately.

Non-overlap LM corpus (Librispeech official LM corpus excluded the data from Librivox)

One can use prepared corpus to train LM to generate PL on LibriVox data: raw corpus and normalized corpus and 4gram LM with 200k vocab.

Generated pseudo-labels used in the paper

We open-sourced also the generated pseudo-labels on which we trained our model: pl and pl with overlap. **Make sure to fix the prefixes to the files names in the lists, right now it is set to be /root/librivox)

Citation

@article{synnaeve2019end,
  title={End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures},
  author={Synnaeve, Gabriel and Xu, Qiantong and Kahn, Jacob and Grave, Edouard and Likhomanenko, Tatiana and Pratap, Vineel and Sriram, Anuroop and Liptchinsky, Vitaliy and Collobert, Ronan},
  journal={arXiv preprint arXiv:1911.08460},
  year={2019}
}