examples/speech_to_speech/docs/direct_s2st_discrete_units.md
We provide the implementation for speech-to-unit translation (S2UT) proposed in "Direct speech-to-speech translation with discrete units (Lee et al. 2021)" and also the transformer-based implementation of the speech-to-spectrogram translation (S2SPECT, or transformer-based Translatotron) baseline in the paper.
| Unit config | Unit size | Vocoder dataset | Model |
|---|---|---|---|
| HuBERT Base, Librispeech, layer 6 | 100 | LJSpeech | ckpt, config |
$SRC_AUDIO and $TGT_AUDIO, with ${SPLIT}/${SAMPLE_ID}.wav for source and target speech under each folder, separately. Note that for S2UT experiments, target audio sampling rate should be in 16,000 Hz, and for S2SPECT experiments, target audio sampling rate is recommended to be in 22,050 Hz.--out_quantized_file_path) as ${TGT_AUDIO}/${SPLIT}.txt. In Lee et al. 2021, we use 100 units from the sixth layer (--layer 6) of the HuBERT Base model.Speech-to-speech data
S2UT
--reduce-unit for training S2UT reduced model$VOCODER_CKPT, $VOCODER_CFG) can be downloaded from the Pretrained Models section. They are not required if --eval-inference is not going to be set during model training.# $SPLIT1, $SPLIT2, etc. are split names such as train, dev, test, etc.
python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \
--source-dir $SRC_AUDIO --target-dir $TGT_AUDIO --data-split $SPLIT1 $SPLIT2 \
--output-root $DATA_ROOT --reduce-unit \
--vocoder-checkpoint $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG
S2SPECT
# $SPLIT1, $SPLIT2, etc. are split names such as train, dev, test, etc.
python examples/speech_to_speech/preprocessing/prep_s2spect_data.py \
--source-dir $SRC_AUDIO --target-dir $TGT_AUDIO --data-split $SPLIT1 $SPLIT2 \
--output-root $DATA_ROOT
Multitask data
$TASK_NAME, prepare ${DATA_ROOT}/${TASK_NAME}/${SPLIT}.tsv files for each split following the format below: (Two tab separated columns. The sample_ids should match with the sample_ids for the speech-to-speech data in ${DATA_ROOT}/${SPLIT}.tsv.)id tgt_text
sample_id_0 token1 token2 token3 ...
sample_id_1 token1 token2 token3 ...
...
$TASK_NAME, prepare ${DATA_ROOT}/${TASK_NAME}/dict.txt, a dictionary in fairseq format with all tokens for the targets for $TASK_NAME.config_multitask.yaml. Below is an example of the config used for S2UT reduced with Fisher experiments including two encoder multitasks (source_letter, target_letter) and one decoder CTC task (decoder_target_ctc).source_letter: # $TASK_NAME
decoder_type: transformer
dict: ${DATA_ROOT}/source_letter/dict.txt
data: ${DATA_ROOT}/source_letter
encoder_layer: 6
loss_weight: 8.0
target_letter:
decoder_type: transformer
dict: ${DATA_ROOT}/target_letter/dict.txt
data: ${DATA_ROOT}/target_letter
encoder_layer: 8
loss_weight: 8.0
decoder_target_ctc:
decoder_type: ctc
dict: ${DATA_ROOT}/decoder_target_ctc/dict.txt
data: ${DATA_ROOT}/decoder_target_ctc
decoder_layer: 3
loss_weight: 1.6
Speech-to-unit translation (S2UT)
Here's an example for training Fisher S2UT models with 100 discrete units as target:
fairseq-train $DATA_ROOT \
--config-yaml config.yaml --multitask-config-yaml config_multitask.yaml \
--task speech_to_speech --target-is-code --target-code-size 100 --vocoder code_hifigan \
--criterion speech_to_unit --label-smoothing 0.2 \
--arch s2ut_transformer_fisher --share-decoder-input-output-embed \
--dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
--train-subset train --valid-subset dev \
--save-dir ${MODEL_DIR} \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
--optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \
--max-update 400000 --max-tokens 20000 --max-target-positions 3000 --update-freq 4 \
--seed 1 --fp16 --num-workers 8
--update-freq accordingly for different #GPUs. In the above we set --update-freq 4 to simulate training with 4 GPUs.--n-frames-per-step 5 to train an S2UT stacked system with reduction ratio r=5. (Use $DATA_ROOT prepared without --reduce-unit.)--eval-inference --eval-args '{"beam": 1, "max_len_a": 1}' --best-checkpoint-metric mcd_loss. It is recommended to sample a smaller subset as the validation set as MCD loss computation is time-consuming.Speech-to-spectrogram translation (S2SPECT)
Here's an example for training Fisher S2SPECT models with reduction ratio r=5:
fairseq-train $DATA_ROOT \
--config-yaml config.yaml --multitask-config-yaml config_multitask.yaml \
--task speech_to_speech --n-frames-per-step 5 \
--criterion speech_to_spectrogram \
--arch s2spect_transformer_fisher --decoder-normalize-before \
--dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
--train-subset train --valid-subset dev \
--save-dir ${MODEL_DIR} \
--eval-inference --best-checkpoint-metric mcd_loss \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
--optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 --weight-decay 1e-6 \
--max-update 400000 --max-tokens 80000 --max-tokens-valid 30000 --required-batch-size-multiple 1 \
--max-target-positions 3000 --update-freq 16 \
--seed 1 --fp16 --num-workers 8
--update-freq accordingly for different #GPUs. In the above we set --update-freq 16 to simulate training with 16 GPUs.Unit-based HiFi-GAN vocoder
The vocoder is trained with the speech-resynthesis repo. See here for instructions on how to train the unit-based HiFi-GAN vocoder with duration prediction. The same vocoder can support waveform generation for both reduced unit sequences (with --dur-prediction set during inference) and original unit sequences.
Speech-to-unit translation (S2UT)
${RESULTS_PATH}/generate-${GEN_SUBSET}.txt).fairseq-generate $DATA_ROOT \
--config-yaml config.yaml --multitask-config-yaml config_multitask.yaml \
--task speech_to_speech --target-is-code --target-code-size 100 --vocoder code_hifigan \
--path $MODEL_DIR/checkpoint_best.pt --gen-subset $GEN_SUBSET \
--max-tokens 50000 \
--beam 10 --max-len-a 1 \
--results-path ${RESULTS_PATH}
--beam 1 --n-frames-per-step $r for decoding with S2UT stacked models.grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
sed 's/^D-//ig' | sort -nk1 | cut -f3 \
> ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit
python examples/speech_to_speech/generate_waveform_from_code.py \
--in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
--vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
--results-path ${RESULTS_PATH} --dur-prediction
--dur-prediction for generating audio for S2UT reduced models.Speech-to-spectrogram translation (S2SPECT)
Follow the same inference process as in fairseq-S^2 to generate waveform.
# assume using a default Griffin-Lim vocoder
python examples/speech_synthesis/generate_waveform.py $DATA_ROOT \
--config-yaml config.yaml --multitask-config-yaml config_multitask.yaml \
--task speech_to_speech --n-frames-per-step 5 \
--path $MODEL_DIR/checkpoint_best.pt --gen-subset $GEN_SUBSET \
--max-tokens 50000 \
--results-path ${RESULTS_PATH} --dump-waveforms --output-sample-rate 16000
In addition to using the default Griffin-Lim vocoder, one can also finetune a HiFi-GAN vocoder for the S2SPECT model by following the instructions in the HiFi-GAN repo.
Multitask decoding
Coming soon.
To evaluate speech translation output, we first apply ASR on the speech output and then compute BLEU score betweent the ASR decoded text and the references using sacreBLEU.
En