recipes/ipl/README.md
Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data. We study the main components of IPL: decoding with a language model and data augmentation. We then demonstrate the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets in both standard and low-resource setting. We also study the effect of language models trained on different corpora to show IPL can effectively utilize additional text. Finally, we release a new large in-domain text corpus which does not overlap with the Librispeech training transcriptions to foster research in low-resource, semi-supervised ASR.
We release a new LM training corpus including abandunt books from Gutenberg Project. The corpus is designed for low-resource ASR study with LibriSpeech (LS) and LibriLight (LV) datasets by carefully filtering out the potential transcriptions belonging to the training/dev/test data of LibriSpeech and LibriLight.
| LM | Description | Corpus | Vocabulary | Model |
|---|---|---|---|---|
| LS \ LV | Librispeech LM corpus without LV transcriptions | corpus | 200K vocab | lm |
| GB \ LS \ LV | Gutenberg books without LS transcriptions, LV transcriptions | raw, normalized | 200K vocab | lm |
| GB \ LV | Gutenberg books without LV transcriptions | raw, normalized | 200K vocab | lm |
We release our pretrained models from the paper. The results in the paper can be reproduced from the models with the following project commits:
e62eb7ea4c9381411508c08226598ba11cbf9511d02f08749ce3cf0eeefa4406f61ad9dddb4a19b2The architecture of the models can be found in here, which is the best transformer CTC architecture we developed in End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures.
| Labeled Set | Lexicon | Tokens |
|---|---|---|
| LibriLight-train-10h | lexicon | tokens |
| LibriSpeech-train-clean-100 | lexicon | tokens |
| LibriSpeech-train-960h | lexicon | tokens |
| Labeled Data | Unlabeled Data | AM: dev-clean | AM: dev-other | LM |
|---|---|---|---|---|
| LL-10 | LS-960 | dev-clean | dev-other | LS \ LV |
| LL-10 | LS-960 | dev-clean | dev-other | GB \ LS \ LV |
| LL-10 | LS-960 + LV | dev-clean | dev-other | LS \ LV |
| LL-10 | LS-960 + LV | dev-clean | dev-other | GB \ LS \ LV |
| Ls-100 | LS-860 | dev-clean | dev-other | LS \ LV |
| Ls-100 | LS-860 | dev-clean | dev-other | GB \ LS \ LV |
| Ls-100 | LS-860 + LV | dev-clean | dev-other | LS \ LV |
| Ls-100 | LS-860 + LV | dev-clean | dev-other | GB \ LV \ LS |
| Ls-960 | LV | dev-clean | dev-other | LS \ LV |
| Ls-960 | LV | dev-clean | dev-other | GB \ LV |
The LM mentioned in the above table is the one used in IPL training.
@article{xu2020iterative,
title={Iterative Pseudo-Labeling for Speech Recognition},
author={Xu, Qiantong and Likhomanenko, Tatiana and Kahn, Jacob and Hannun, Awni and Synnaeve, Gabriel and Collobert, Ronan},
journal={arXiv preprint arXiv:2005.09267},
year={2020}
}