examples/mms/asr/tutorial/MMS_ASR_Inference_Colab.ipynb
In this notebook, we will give an example on how to run simple ASR inference using MMS ASR model.
Credit to epk2112 (github)
!mkdir "temp_dir"
!git clone https://github.com/pytorch/fairseq
# Change current working directory
!pwd
%cd "/content/fairseq"
!pip install --editable ./
!pip install tensorboardX
Un-comment to download your preferred model. In this example, we use MMS-FL102 for demo purposes. For better model quality and language coverage, user can use MMS-1B-ALL model instead (but it would require more RAM, so please use Colab-Pro instead of Colab-Free).
# MMS-1B:FL102 model - 102 Languages - FLEURS Dataset
!wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt'
# # MMS-1B:L1107 - 1107 Languages - MMS-lab Dataset
# !wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt'
# # MMS-1B-all - 1162 Languages - MMS-lab + FLEURS + CV + VP + MLS
# !wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt'
Create a folder on path '/content/audio_samples/' and upload your .wav audio files that you need to transcribe e.g. '/content/audio_samples/audio.wav'
Note: You need to make sure that the audio data you are using has a sample rate of 16kHz You can easily do this with FFMPEG like the example below that converts .mp3 file to .wav and fixing the audio sample rate
Here, we use a FLEURS english MP3 audio for the example.
!wget -P ./audio_samples/ 'https://datasets-server.huggingface.co/assets/google/fleurs/--/en_us/train/0/audio/audio.mp3'
!ffmpeg -y -i ./audio_samples/audio.mp3 -ar 16000 ./audio_samples/audio.wav
In the below example, we will transcribe a sentence in English.
To transcribe other languages:
--lang "eng" with new Iso CodeTo improve the transcription quality, user can use language-model (LM) decoding by following this instruction ASR LM decoding
import os
os.environ["TMPDIR"] = '/content/temp_dir'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"
!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav"
Since MMS is a CTC model, we can further improve the accuracy by running beam search decoding using a language model.
While we have not open sourced the language models used in MMS (yet!), we have provided the details of the data and commands to used to train the LMs in the Appendix section of our paper.
For this tutorial, we will use a alternate English language model based on Common Crawl data which has been made publicly available through the efforts of Likhomanenko, Tatiana, et al. "Rethinking evaluation in asr: Are our models robust enough?.". The language model can be accessed from the GitHub repository here.
! mkdir -p /content/lmdecode
!wget -P /content/lmdecode https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin # smaller LM
!wget -P /content/lmdecode https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt
Install decoder bindings from flashlight
# Taken from https://github.com/flashlight/flashlight/blob/main/scripts/colab/colab_install_deps.sh
# Install dependencies from apt
! sudo apt-get install -y libfftw3-dev libsndfile1-dev libgoogle-glog-dev libopenmpi-dev libboost-all-dev
# Install Kenlm
! cd /tmp && git clone https://github.com/kpu/kenlm && cd kenlm && mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install -j$(nproc)
# Install Intel MKL 2020
! cd /tmp && wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB && \
apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
! sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' && \
apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends intel-mkl-64bit-2020.0-088
# Remove existing MKL libs to avoid double linkeage
! rm -rf /usr/local/lib/libmkl*
! rm -rf flashlight
! git clone --recursive https://github.com/flashlight/flashlight.git
%cd flashlight
! git checkout 035ead6efefb82b47c8c2e643603e87d38850076
%cd bindings/python
! python3 setup.py install
%cd /content/fairseq
Next, we download an audio file from People's speech data. We will the audio sample from their 'dirty' subset which will be more challenging for the ASR model.
!wget -O ./audio_samples/tmp.wav 'https://datasets-server.huggingface.co/assets/MLCommons/peoples_speech/--/dirty/train/0/audio/audio.wav'
!ffmpeg -y -i ./audio_samples/tmp.wav -ar 16000 ./audio_samples/audio_noisy.wav
Let's listen to the audio file
import IPython
IPython.display.display(IPython.display.Audio("./audio_samples/audio_noisy.wav"))
print("Trancript: limiting emotions that we experience mainly in our childhood which stop us from living our life just open freedom i mean trust and")
Run inference with both greedy decoding and LM decoding
import os
os.environ["TMPDIR"] = '/content/temp_dir'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"
print("======= WITHOUT LM DECODING=======")
!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav" "/content/fairseq/audio_samples/audio_noisy.wav"
print("\n\n\n======= WITH LM DECODING=======")
# Note that the lmweight, wordscore needs to tuned for each LM
# Using the same values may not be optimal
decoding_cmds = """
decoding.type=kenlm
decoding.beam=500
decoding.beamsizetoken=50
decoding.lmweight=2.69
decoding.wordscore=2.8
decoding.lmpath=/content/lmdecode/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin
decoding.lexicon=/content/lmdecode/lexicon.txt
""".replace("\n", " ")
!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav" "/content/fairseq/audio_samples/audio_noisy.wav" \
--extra-infer-args '{decoding_cmds}'