examples/mms/lid_rerank/README.md
This project provides N-best re-ranking, a simple inference procedure, for improving multilingual speech recognition (ASR) "in the wild" where models are expected to first predict language identity (LID) before transcribing. Our method considers N-best LID predictions for each utterance, runs the corresponding ASR in N different languages, and then uses external features over the candidate transcriptions to determine re-rank.
The workflow is as follows: 1) run LID+ASR inference (MMS and Whisper are supported), 2) compute external re-ranking features, 3) tune feature coefficients on dev set, and 4) apply on test set.
For more information about our method, please refer to the paper: "Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking".
Prepare a text file with one path to a wav file in each line:
#/path/to/wav/list
/path/to/audio1.wav
/path/to/audio2.wav
/path/to/audio3.wav
The following workflow also assumes that LID and ASR references are available (at least for the dev set). We use 3-letter iso codes for both Whisper and MMS.
Next run either Whisper or MMS based LID+ASR.
Refer to the Whisper documentation for installation instructions.
First run LID:
python whisper/infer_lid.py --wavs "path/to/wav/list" --dst "path/to/lid/results" --model large-v2 --n 10
Note that the size of the N-best list is set as 10 here.
Then run ASR, using the top-N LID predictions:
python whisper/infer_asr.py --wavs "path/to/wav/list" --lids "path/to/lid/results"/nbest_lid --dst "path/to/asr/results" --model large-v2
Refer to the Fairseq documentation for installation instructions.
Prepare data and models following the instructions from the MMS repository. Note that the MMS backend expects a slightly different wav list format, which can be obtained via:
python mms/format_wav_list.py --src "/path/to/wav/list" --dst "/path/to/wav/manifest.tsv"
Note that MMS also expects LID references in a file named "/path/to/wav/manifest.lang".
Then run LID:
cd "path/to/fairseq/dir"
PYTHONPATH='.' python3 examples/mms/lid/infer.py "path/to/dict/dir" --path "path/to/model" --task audio_classification --infer-manifest "path/to/wav/manifest.tsv" --output-path "path/to/lid/results" --top-k 10
Note that the size of the N-best list is set as 10 here.
Then run ASR, using the top-N LID predictions. Since MMS uses language-specific parameters, we've parallelized inference across languages:
#Split data by language
python mms/split_by_lang.py --wavs_tsv "/path/to/wav/manifest.tsv" --lid_preds "path/to/lid/results"predictions.txt --dst "path/to/data/split"
#Write language-specific ASR python commands to an executable file
mms/make_parallel_single_runs.py --dump "path/to/data/split" --model "path/to/model" --dst "path/to/asr/results" --fairseq_dir "path/to/fairseq/dir" > run.sh
#Running each language sequentially (you can also parallelize this)
. ./run.sh
#Merge language-specific results back to original order
python mms/merge_by_run.py --dump "path/to/data/split" --exp "path/to/asr/results"
python mala/infer.py --txt "path/to/asr/results"/nbest_asr_hyp --dst "path/to/lm/results"
Download the model from the official source.
python nllb/infer.py --txt "path/to/asr/results"/nbest_asr_hyp --dst "path/to/wlid/results" --model "path/to/nllb/model"
Download the model from the official source.
First run u-romanization on the N-best ASR hypotheses:
python mms-zs/uromanize.py --txt "path/to/asr/results"/nbest_asr_hyp --lid "path/to/lid/results"/nbest_lid --dst "path/to/uasr/results" --model "path/to/mms-zeroshot"
Then compute the forced alignment score using the MMS-Zeroshot model:
python mms-zs/falign.py --uroman_txt "path/to/uasr/results"/nbest_asr_hyp_uroman --wav "path/to/wav/list" --dst "path/to/uasr/results" --model "path/to/mms-zeroshot"
python rerank/tune_coefficients.py --slid "path/to/lid/results"/slid_score --asr "path/to/asr/results"/asr_score --wlid "path/to/wlid/results"/wlid_score --lm "path/to/lm/results"/lm_score --uasr "path/to/uasr/results"/uasr_score --dst "path/to/rerank/results" --ref_lid "ground-truth/lid" --nbest_lid "path/to/lid/results"/nbest_lid --ref_asr "ground-truth/asr" --nbest_asr "path/to/asr/results"/nbest_asr_hyp
python rerank/rerank.py --slid "path/to/lid/results"/slid_score --asr "path/to/asr/results"/asr_score --wlid "path/to/wlid/results"/wlid_score --lm "path/to/lm/results"/lm_score --uasr "path/to/uasr/results"/uasr_score --dst "path/to/rerank/results" --ref_lid "ground-truth/lid" --nbest_lid "path/to/lid/results"/nbest_lid --ref_asr "ground-truth/asr" --nbest_asr "path/to/asr/results"/nbest_asr_hyp --w "path/to/rerank/results"/best_coefficients
The re-ranked LID and ASR will be in "path/to/rerank/results"/reranked_1best_lid and "path/to/rerank/results"/reranked_1best_asr_hyp respectively.
@article{yan2024wild,
title={Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking},
author={Brian Yan, Vineel Pratap, Shinji Watanabe, Michael Auli},
journal={arXiv},
year={2024}
}