Iterative Pseudo-Labeling for Speech Recognition

Abstract

Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data. We study the main components of IPL: decoding with a language model and data augmentation. We then demonstrate the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets in both standard and low-resource setting. We also study the effect of language models trained on different corpora to show IPL can effectively utilize additional text. Finally, we release a new large in-domain text corpus which does not overlap with the Librispeech training transcriptions to foster research in low-resource, semi-supervised ASR.

Gutenberg Language Model

We release a new LM training corpus including abandunt books from Gutenberg Project. The corpus is designed for low-resource ASR study with LibriSpeech (LS) and LibriLight (LV) datasets by carefully filtering out the potential transcriptions belonging to the training/dev/test data of LibriSpeech and LibriLight.

LM	Description	Corpus	Vocabulary	Model
LS \ LV	Librispeech LM corpus without LV transcriptions	corpus	200K vocab	lm
GB \ LS \ LV	Gutenberg books without LS transcriptions, LV transcriptions	raw, normalized	200K vocab	lm
GB \ LV	Gutenberg books without LV transcriptions	raw, normalized	200K vocab	lm

Acoustic Models

We release our pretrained models from the paper. The results in the paper can be reproduced from the models with the following project commits:

flashlight - commit e62eb7ea4c9381411508c08226598ba11cbf9511
wav2letter - commit d02f08749ce3cf0eeefa4406f61ad9dddb4a19b2

The architecture of the models can be found in here, which is the best transformer CTC architecture we developed in End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures.

Tokens and Lexicons

Labeled Set	Lexicon	Tokens
LibriLight-train-10h	lexicon	tokens
LibriSpeech-train-clean-100	lexicon	tokens
LibriSpeech-train-960h	lexicon	tokens

Pre-trained Models

Labeled Data	Unlabeled Data	AM: dev-clean	AM: dev-other	LM
LL-10	LS-960	dev-clean	dev-other	LS \ LV
LL-10	LS-960	dev-clean	dev-other	GB \ LS \ LV
LL-10	LS-960 + LV	dev-clean	dev-other	LS \ LV
LL-10	LS-960 + LV	dev-clean	dev-other	GB \ LS \ LV
Ls-100	LS-860	dev-clean	dev-other	LS \ LV
Ls-100	LS-860	dev-clean	dev-other	GB \ LS \ LV
Ls-100	LS-860 + LV	dev-clean	dev-other	LS \ LV
Ls-100	LS-860 + LV	dev-clean	dev-other	GB \ LV \ LS
Ls-960	LV	dev-clean	dev-other	LS \ LV
Ls-960	LV	dev-clean	dev-other	GB \ LV

The LM mentioned in the above table is the one used in IPL training.

Citation

@article{xu2020iterative,
  title={Iterative Pseudo-Labeling for Speech Recognition},
  author={Xu, Qiantong and Likhomanenko, Tatiana and Kahn, Jacob and Hannun, Awni and Synnaeve, Gabriel and Collobert, Ronan},
  journal={arXiv preprint arXiv:2005.09267},
  year={2020}
}