Back to Wav2letter

Iterative Pseudo-Labeling for Speech Recognition

recipes/ipl/README.md

0.28.1 KB
Original Source

Iterative Pseudo-Labeling for Speech Recognition

Abstract

Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data. We study the main components of IPL: decoding with a language model and data augmentation. We then demonstrate the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets in both standard and low-resource setting. We also study the effect of language models trained on different corpora to show IPL can effectively utilize additional text. Finally, we release a new large in-domain text corpus which does not overlap with the Librispeech training transcriptions to foster research in low-resource, semi-supervised ASR.

Gutenberg Language Model

We release a new LM training corpus including abandunt books from Gutenberg Project. The corpus is designed for low-resource ASR study with LibriSpeech (LS) and LibriLight (LV) datasets by carefully filtering out the potential transcriptions belonging to the training/dev/test data of LibriSpeech and LibriLight.

LMDescriptionCorpusVocabularyModel
LS \ LVLibrispeech LM corpus without LV transcriptionscorpus200K vocablm
GB \ LS \ LVGutenberg books without LS transcriptions, LV transcriptionsraw, normalized200K vocablm
GB \ LVGutenberg books without LV transcriptionsraw, normalized200K vocablm

Acoustic Models

We release our pretrained models from the paper. The results in the paper can be reproduced from the models with the following project commits:

The architecture of the models can be found in here, which is the best transformer CTC architecture we developed in End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures.

Tokens and Lexicons

Labeled SetLexiconTokens
LibriLight-train-10hlexicontokens
LibriSpeech-train-clean-100lexicontokens
LibriSpeech-train-960hlexicontokens

Pre-trained Models

Labeled DataUnlabeled DataAM: dev-cleanAM: dev-otherLM
LL-10LS-960dev-cleandev-otherLS \ LV
LL-10LS-960dev-cleandev-otherGB \ LS \ LV
LL-10LS-960 + LVdev-cleandev-otherLS \ LV
LL-10LS-960 + LVdev-cleandev-otherGB \ LS \ LV
Ls-100LS-860dev-cleandev-otherLS \ LV
Ls-100LS-860dev-cleandev-otherGB \ LS \ LV
Ls-100LS-860 + LVdev-cleandev-otherLS \ LV
Ls-100LS-860 + LVdev-cleandev-otherGB \ LV \ LS
Ls-960LVdev-cleandev-otherLS \ LV
Ls-960LVdev-cleandev-otherGB \ LV

The LM mentioned in the above table is the one used in IPL training.

Citation

@article{xu2020iterative,
  title={Iterative Pseudo-Labeling for Speech Recognition},
  author={Xu, Qiantong and Likhomanenko, Tatiana and Kahn, Jacob and Hannun, Awni and Synnaeve, Gabriel and Collobert, Ronan},
  journal={arXiv preprint arXiv:2005.09267},
  year={2020}
}