Running MMS-ASR inference in Colab

In this notebook, we will give an example on how to run simple ASR inference using MMS ASR model.

Credit to epk2112 (github)

Step 1: Clone fairseq-py and install latest version

python

!mkdir "temp_dir"
!git clone https://github.com/pytorch/fairseq

# Change current working directory
!pwd
%cd "/content/fairseq"
!pip install --editable ./ 
!pip install tensorboardX

2. Download MMS model

Un-comment to download your preferred model. In this example, we use MMS-FL102 for demo purposes. For better model quality and language coverage, user can use MMS-1B-ALL model instead (but it would require more RAM, so please use Colab-Pro instead of Colab-Free).

python

# MMS-1B:FL102 model - 102 Languages - FLEURS Dataset
!wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt'

# # MMS-1B:L1107 - 1107 Languages - MMS-lab Dataset
# !wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt'

# # MMS-1B-all - 1162 Languages - MMS-lab + FLEURS + CV + VP + MLS
# !wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt'

3. Prepare audio file

Create a folder on path '/content/audio_samples/' and upload your .wav audio files that you need to transcribe e.g. '/content/audio_samples/audio.wav'

Note: You need to make sure that the audio data you are using has a sample rate of 16kHz You can easily do this with FFMPEG like the example below that converts .mp3 file to .wav and fixing the audio sample rate

Here, we use a FLEURS english MP3 audio for the example.

python

!wget -P ./audio_samples/ 'https://datasets-server.huggingface.co/assets/google/fleurs/--/en_us/train/0/audio/audio.mp3'
!ffmpeg -y -i ./audio_samples/audio.mp3 -ar 16000 ./audio_samples/audio.wav

4: Run Inference and transcribe your audio(s)

In the below example, we will transcribe a sentence in English.

To transcribe other languages:

Go to MMS README ASR section
Open Supported languages link
Find your target languages based on Language Name column
Copy the corresponding Iso Code
Replace --lang "eng" with new Iso Code

To improve the transcription quality, user can use language-model (LM) decoding by following this instruction ASR LM decoding

python

import os

os.environ["TMPDIR"] = '/content/temp_dir'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"

!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav"

5: Beam search decoding using a Language Model and transcribe audio file(s)

Since MMS is a CTC model, we can further improve the accuracy by running beam search decoding using a language model.

While we have not open sourced the language models used in MMS (yet!), we have provided the details of the data and commands to used to train the LMs in the Appendix section of our paper.

For this tutorial, we will use a alternate English language model based on Common Crawl data which has been made publicly available through the efforts of Likhomanenko, Tatiana, et al. "Rethinking evaluation in asr: Are our models robust enough?.". The language model can be accessed from the GitHub repository here.

python

! mkdir -p /content/lmdecode 

!wget -P /content/lmdecode  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin # smaller LM 
!wget -P /content/lmdecode  https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/lexicon.txt

Install decoder bindings from flashlight

python

# Taken from https://github.com/flashlight/flashlight/blob/main/scripts/colab/colab_install_deps.sh 
# Install dependencies from apt
! sudo apt-get install -y libfftw3-dev libsndfile1-dev libgoogle-glog-dev libopenmpi-dev libboost-all-dev
# Install Kenlm
! cd /tmp && git clone https://github.com/kpu/kenlm && cd kenlm && mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install -j$(nproc)

# Install Intel MKL 2020
! cd /tmp && wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB && \
    apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
! sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' && \
    apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends intel-mkl-64bit-2020.0-088
# Remove existing MKL libs to avoid double linkeage
! rm -rf /usr/local/lib/libmkl*

python

! rm -rf flashlight
! git clone --recursive https://github.com/flashlight/flashlight.git
%cd flashlight
! git checkout 035ead6efefb82b47c8c2e643603e87d38850076 
%cd bindings/python 
! python3 setup.py install

%cd /content/fairseq

Next, we download an audio file from People's speech data. We will the audio sample from their 'dirty' subset which will be more challenging for the ASR model.

python

!wget -O ./audio_samples/tmp.wav 'https://datasets-server.huggingface.co/assets/MLCommons/peoples_speech/--/dirty/train/0/audio/audio.wav'
!ffmpeg -y -i ./audio_samples/tmp.wav -ar 16000 ./audio_samples/audio_noisy.wav

Let's listen to the audio file

python

import IPython
IPython.display.display(IPython.display.Audio("./audio_samples/audio_noisy.wav"))
print("Trancript: limiting emotions that we experience mainly in our childhood which stop us from living our life just open freedom i mean trust and")

Run inference with both greedy decoding and LM decoding

python

import os

os.environ["TMPDIR"] = '/content/temp_dir'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"

print("======= WITHOUT LM DECODING=======")

!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav" "/content/fairseq/audio_samples/audio_noisy.wav"

print("\n\n\n======= WITH LM DECODING=======")

# Note that the lmweight, wordscore needs to tuned for each LM 
# Using the same values may not be optimal
decoding_cmds = """
decoding.type=kenlm 
decoding.beam=500 
decoding.beamsizetoken=50 
decoding.lmweight=2.69
decoding.wordscore=2.8
decoding.lmpath=/content/lmdecode/lm_common_crawl_small_4gram_prun0-6-15_200kvocab.bin
decoding.lexicon=/content/lmdecode/lexicon.txt
""".replace("\n", " ")
!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio.wav" "/content/fairseq/audio_samples/audio_noisy.wav" \
    --extra-infer-args '{decoding_cmds}'