Back to Wav2letter

Self-Training for End-to-End Speech Recognition

recipes/self_training/README.md

0.27.9 KB
Original Source

Self-Training for End-to-End Speech Recognition

Abstract

We revisit self-training in the context of end-to-end speech recognition. We demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model. Key to our approach are a strong baseline acoustic and language model used to generate the pseudo-labels, filtering mechanisms tailored to common errors from sequence-to-sequence models, and a novel ensemble approach to increase pseudo-label diversity. Experiments on the LibriSpeech corpus show that with an ensemble of four models and label filtering, self-training yields a 33.9% relative improvement in WER compared with a baseline trained on 100 hours of labelled data in the noisy speech setting. In the clean speech setting, self-training recovers 59.3% of the gap between the baseline and an oracle model, which is at least 93.8% relatively higher than what previous approaches can achieve.

Reproduction

Acoustic model configuration files are provided for each dataset to reproduce results from the paper (training and decoding steps).

Pretrained convolutional language models used in the paper are also included, as well as steps to generate the language model corpus used to train the language models and steps to reproduce acoustic model training.

Training and decoding broadly follow the existing TDS seq2seq recipes.

Dependencies

All results from the paper can be reproduced exactly with the following project commits:

Each commit contains versioned documentation for building and installing requisite dependencies.

Tokens and Lexicon Sets

DatasetUnlabeled SetLexiconTokens
LibriSpeechtrain-clean-100 BaselineLexiconTokens
LibriSpeechtrain-clean-100 + train-clean-360 OracleLexiconTokens
LibriSpeechtrain-clean-100 + train-other-500 OracleLexiconTokens

Tokens and lexicon files generated in the $MODEL_DST/am/ and $MODEL_DST/decoder/ directories following the LibriSpeech recipe are the same as in the table.

Pre-Trained Models

Acoustic Models

Components of the baseline model trained only on LibriSpeech training sets are below.

DatasetUnlabeled SetAcoustic Model: dev-cleanAcoustic Model: dev-other
LibriSpeechtrain-clean-100 Baselinedev-cleandev-other
LibriSpeechtrain-clean-100 + train-clean-360 Oracledev-cleandev-other
LibriSpeechtrain-clean-100 + train-other-500 Oracledev-cleandev-other

Below are models trained on pseudo-labels. All train sets include the base train-clean-100 set in addition to generated pseudo-labels. Steps for generating pseudo-labels can be found here:

DatasetPseudo-Labeled SetAM: dev-cleanAM: dev-otherSynthetic Lexicon
LibriSpeechtrain-clean-100 + train-clean-360 PLs (single)dev-cleandev-otherSynthetic Lexicon
LibriSpeechtrain-clean-100 + train-other-500 PLs (single)dev-cleandev-otherSynthetic Lexicon
LibriSpeechtrain-clean-100 + train-clean-360 (ensemble: 2+3+5+7+8)dev-cleandev-other
<!-- | LibriSpeech | train-clean-100 + train-other-500 (ensemble) | [dev-clean]() | [dev-other]() | Synthetic Lexicon | -->

Language Models

The instructions in LibriSpeech contain steps to reproduce the language model training corpus. Below are components of the GCNN language model used for decoding:

LM typeLanguage modelVocabularyArchitectureLM fairseqDict fairseq
GCNNword-piece GCNN4k WPArchfilefairseq LMfairseq Dict

Citation

@article{kahn2019selftraining,
    title={Self-Training for End-to-End Speech Recognition},
    author={Jacob Kahn and Ann Lee and Awni Hannun},
    year={2019},
    eprint={1909.09116},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}