kosmos-2/fairseq/examples/speech_to_text/README.md
https://www.aclweb.org/anthology/2020.aacl-demo.6
Speech recognition (ASR) and speech-to-text translation (ST) with fairseq.
S2T modeling data consists of source speech features, target text and other optional information (source text, speaker id, etc.). Fairseq S2T uses per-dataset-split TSV manifest files to store these information. Each data field is represented by a column in the TSV file.
Unlike text token embeddings, speech features (e.g. log mel-scale filter banks) are usually fixed during model training and can be pre-computed. The manifest file contains the path to either the feature file in NumPy format or the WAV/FLAC audio file. For the latter, features will be extracted on-the-fly by fairseq S2T. Optionally, feature/audio files can be packed into uncompressed ZIP files (then accessed via byte offset and length) to improve I/O performance.
Fairseq S2T also employs a YAML file for data related configurations: tokenizer type and dictionary path for the target text, feature transforms such as CMVN (cepstral mean and variance normalization) and SpecAugment, temperature-based resampling, etc.
Fairseq S2T uses the unified fairseq-train interface for model training. It requires arguments --task speech_to_text,
--arch <model architecture in fairseq.models.speech_to_text.*> and --config-yaml <config YAML filename>.
Fairseq S2T uses the unified fairseq-generate/fairseq-interactive interface for inference and evaluation. It
requires arguments --task speech_to_text and --config-yaml <config YAML filename>. The interactive console takes
audio paths (one per line) as inputs.
fairseq-interactive) support. Examples:
ASR (LibriSpeech)
and ST (CoVoST 2).Please cite as:
@inproceedings{wang2020fairseqs2t,
title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
year = {2020},
}
@inproceedings{ott2019fairseq,
title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
year = {2019},
}