Speech Recognition Pre-Training

Wav2Vec2 Speech Pre-Training

The script run_speech_wav2vec2_pretraining_no_trainer.py can be used to pre-train a Wav2Vec2 model from scratch.

In the script run_speech_wav2vec2_pretraining_no_trainer, a Wav2Vec2 model is pre-trained on audio data alone using Wav2Vec2's contrastive loss objective.

The following examples show how to fine-tune a "base"-sized Wav2Vec2 model as well as a "large"-sized Wav2Vec2 model using accelerate.

NOTE 1

Wav2Vec2's pre-training is known to be quite unstable. It is advised to do a couple of test runs with a smaller dataset, i.e. --dataset_config_names clean clean, --dataset_split_names validation test to find good hyper-parameters for learning_rate, batch_size, num_warmup_steps, and the optimizer. A good metric to observe during training is the gradient norm which should ideally be between 0.5 and 2.

NOTE 2

When training a model on large datasets it is recommended to run the data preprocessing in a first run in a non-distributed mode via --preprocessing_only so that when running the model in distributed mode in a second step the preprocessed data can easily be loaded on each distributed device.

Demo

In this demo run we pre-train a "base-sized" Wav2Vec2 model simply only on the validation and test data of librispeech_asr.

The demo is run on two Titan RTX (24 GB RAM each). In case you have less RAM available per device, consider reducing --batch_size and/or the --max_duration_in_seconds.

bash

accelerate launch run_wav2vec2_pretraining_no_trainer.py \
	--dataset_name="librispeech_asr" \
	--dataset_config_names clean clean \
	--dataset_split_names validation test \
	--model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \
	--output_dir="./wav2vec2-pretrained-demo" \
	--max_train_steps="20000" \
	--num_warmup_steps="32000" \
	--gradient_accumulation_steps="8" \
	--learning_rate="0.005" \
	--weight_decay="0.01" \
	--max_duration_in_seconds="20.0" \
	--min_duration_in_seconds="2.0" \
	--logging_steps="1" \
	--saving_steps="10000" \
	--per_device_train_batch_size="8" \
	--per_device_eval_batch_size="8" \
	--adam_beta1="0.9" \
	--adam_beta2="0.98" \
	--adam_epsilon="1e-06" \
	--gradient_checkpointing \
	--mask_time_prob="0.65" \
	--mask_time_length="10"

The results of this run can be seen here.

Base

To pre-train "base-sized" Wav2Vec2 model, e.g. facebook/wav2vec2-base on librispeech_asr, the following command can be run:

bash

accelerate launch run_wav2vec2_pretraining_no_trainer.py \
	--dataset_name=librispeech_asr \
	--dataset_config_names clean clean other \
	--dataset_split_names train.100 train.360 train.500 \
	--model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \
	--output_dir="./wav2vec2-pretrained-demo" \
	--max_train_steps="200000" \
	--num_warmup_steps="32000" \
	--gradient_accumulation_steps="4" \
	--learning_rate="0.001" \
	--weight_decay="0.01" \
	--max_duration_in_seconds="20.0" \
	--min_duration_in_seconds="2.0" \
	--logging_steps="1" \
	--saving_steps="10000" \
	--per_device_train_batch_size="8" \
	--per_device_eval_batch_size="8" \
	--adam_beta1="0.9" \
	--adam_beta2="0.98" \
	--adam_epsilon="1e-06" \
	--gradient_checkpointing \
	--mask_time_prob="0.65" \
	--mask_time_length="10"

The experiment was run on 8 GPU V100 (16 GB RAM each) for 4 days. In case you have more than 8 GPUs available for a higher effective batch_size, it is recommended to increase the learning_rate to 0.005 for faster convergence.

The results of this run can be seen here and the checkpoint pretrained for 85,000 steps can be accessed here

Large

To pre-train "large-sized" Wav2Vec2 model, e.g. facebook/wav2vec2-large-lv60, on librispeech_asr, the following command can be run:

bash

accelerate launch run_wav2vec2_pretraining_no_trainer.py \
	--dataset_name=librispeech_asr \
	--dataset_config_names clean clean other \
	--dataset_split_names train.100 train.360 train.500 \
	--output_dir=./test \
	--max_train_steps=200000 \
	--num_warmup_steps=32000 \
	--gradient_accumulation_steps=8 \
	--learning_rate=0.001 \
	--weight_decay=0.01 \
	--max_duration_in_seconds=20.0 \
	--min_duration_in_seconds=2.0 \
	--model_name_or_path=./ \
	--logging_steps=1 \
	--saving_steps=10000 \
	--per_device_train_batch_size=2 \
	--per_device_eval_batch_size=4 \
	--adam_beta1=0.9 \
	--adam_beta2=0.98 \
	--adam_epsilon=1e-06 \
	--gradient_checkpointing \
	--mask_time_prob=0.65 \
	--mask_time_length=10

The experiment was run on 8 GPU V100 (16 GB RAM each) for 7 days. In case you have more than 8 GPUs available for a higher effective batch_size, it is recommended to increase the learning_rate to 0.005 for faster convergence.

The results of this run can be seen here and the checkpoint pretrained for 120,000 steps can be accessed here