Self-Training for End-to-End Speech Recognition

Abstract

We revisit self-training in the context of end-to-end speech recognition. We demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model. Key to our approach are a strong baseline acoustic and language model used to generate the pseudo-labels, filtering mechanisms tailored to common errors from sequence-to-sequence models, and a novel ensemble approach to increase pseudo-label diversity. Experiments on the LibriSpeech corpus show that with an ensemble of four models and label filtering, self-training yields a 33.9% relative improvement in WER compared with a baseline trained on 100 hours of labelled data in the noisy speech setting. In the clean speech setting, self-training recovers 59.3% of the gap between the baseline and an oracle model, which is at least 93.8% relatively higher than what previous approaches can achieve.

Reproduction

Acoustic model configuration files are provided for each dataset to reproduce results from the paper (training and decoding steps).

Pretrained convolutional language models used in the paper are also included, as well as steps to generate the language model corpus used to train the language models and steps to reproduce acoustic model training.

Training and decoding broadly follow the existing TDS seq2seq recipes.

Dependencies

All results from the paper can be reproduced exactly with the following project commits:

flashlight - commit 77ad2f79249c6833875f57865712de4666617d00
wav2letter - commit 57b4904c8c4a808d393f047a9352c2d5be57ae8f

Each commit contains versioned documentation for building and installing requisite dependencies.

Tokens and Lexicon Sets

Dataset	Unlabeled Set	Lexicon	Tokens
LibriSpeech	train-clean-100 Baseline	Lexicon	Tokens
LibriSpeech	train-clean-100 + train-clean-360 Oracle	Lexicon	Tokens
LibriSpeech	train-clean-100 + train-other-500 Oracle	Lexicon	Tokens

Tokens and lexicon files generated in the $MODEL_DST/am/ and $MODEL_DST/decoder/ directories following the LibriSpeech recipe are the same as in the table.

Pre-Trained Models

Acoustic Models

Components of the baseline model trained only on LibriSpeech training sets are below.

Dataset	Unlabeled Set	Acoustic Model: dev-clean	Acoustic Model: dev-other
LibriSpeech	train-clean-100 Baseline	dev-clean	dev-other
LibriSpeech	train-clean-100 + train-clean-360 Oracle	dev-clean	dev-other
LibriSpeech	train-clean-100 + train-other-500 Oracle	dev-clean	dev-other

Below are models trained on pseudo-labels. All train sets include the base train-clean-100 set in addition to generated pseudo-labels. Steps for generating pseudo-labels can be found here:

Dataset	Pseudo-Labeled Set	AM: dev-clean	AM: dev-other	Synthetic Lexicon
LibriSpeech	train-clean-100 + train-clean-360 PLs (single)	dev-clean	dev-other	Synthetic Lexicon
LibriSpeech	train-clean-100 + train-other-500 PLs (single)	dev-clean	dev-other	Synthetic Lexicon
LibriSpeech	train-clean-100 + train-clean-360 (ensemble: 2+3+5+7+8)	dev-clean	dev-other

Language Models

The instructions in LibriSpeech contain steps to reproduce the language model training corpus. Below are components of the GCNN language model used for decoding:

LM type	Language model	Vocabulary	Architecture	LM fairseq	Dict fairseq
GCNN	word-piece GCNN	4k WP	Archfile	fairseq LM	fairseq Dict

Citation

@article{kahn2019selftraining,
    title={Self-Training for End-to-End Speech Recognition},
    author={Jacob Kahn and Ann Lee and Awni Hannun},
    year={2019},
    eprint={1909.09116},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}