Back to Fairseq

Textless Speech-to-Speech Translation (S2ST) on Real Data

examples/speech_to_speech/docs/textless_s2st_real_data.md

0.12.35.2 KB
Original Source

Textless Speech-to-Speech Translation (S2ST) on Real Data

We provide instructions and pre-trained models for the work "Textless Speech-to-Speech Translation on Real Data (Lee et al. 2021)".

Pre-trained Models

HuBERT

ModelPretraining DataModelQuantizer
mHuBERT BaseVoxPopuli En, Es, Fr speech from the 100k subsetdownloadL11 km1000

Unit-based HiFi-GAN vocoder

Unit configUnit sizeVocoder languageDatasetModel
mHuBERT, layer 111000EnLJSpeechckpt, config
mHuBERT, layer 111000EsCSS10ckpt, config
mHuBERT, layer 111000FrCSS10ckpt, config

Speech normalizer

LanguageTraining dataTarget unit configModel
En10 minsmHuBERT, layer 11, km1000download
En1 hrmHuBERT, layer 11, km1000download
En10 hrsmHuBERT, layer 11, km1000download
Es10 minsmHuBERT, layer 11, km1000download
Es1 hrmHuBERT, layer 11, km1000download
Es10 hrsmHuBERT, layer 11, km1000download
Fr10 minsmHuBERT, layer 11, km1000download
Fr1 hrmHuBERT, layer 11, km1000download
Fr10 hrsmHuBERT, layer 11, km1000download
  • Refer to the paper for the details of the training data.

Inference with Pre-trained Models

Speech normalizer

  1. Download the pre-trained models, including the dictionary, to DATA_DIR.
  2. Format the audio data.
bash
# AUDIO_EXT: audio extension, e.g. wav, flac, etc.
# Assume all audio files are at ${AUDIO_DIR}/*.${AUDIO_EXT}

python examples/speech_to_speech/preprocessing/prep_sn_data.py \
  --audio-dir ${AUDIO_DIR} --ext ${AUIDO_EXT} \
  --data-name ${GEN_SUBSET} --output-dir ${DATA_DIR} \
  --for-inference
  1. Run the speech normalizer and post-process the output.
bash
mkdir -p ${RESULTS_PATH}

python examples/speech_recognition/new/infer.py \
    --config-dir examples/hubert/config/decode/ \
    --config-name infer_viterbi \
    task.data=${DATA_DIR} \
    task.normalize=false \
    common_eval.results_path=${RESULTS_PATH}/log \
    common_eval.path=${DATA_DIR}/checkpoint_best.pt \
    dataset.gen_subset=${GEN_SUBSET} \
    '+task.labels=["unit"]' \
    +decoding.results_path=${RESULTS_PATH} \
    common_eval.post_process=none \
    +dataset.batch_size=1 \
    common_eval.quiet=True

# Post-process and generate output at ${RESULTS_PATH}/${GEN_SUBSET}.txt
python examples/speech_to_speech/preprocessing/prep_sn_output_data.py \
  --in-unit ${RESULTS_PATH}/hypo.units \
  --in-audio ${DATA_DIR}/${GEN_SUBSET}.tsv \
  --output-root ${RESULTS_PATH}

Unit-to-waveform conversion with unit vocoder

The pre-trained vocoders can support generating audio for both full unit sequences and reduced unit sequences (i.e. duplicating consecutive units removed). Set --dur-prediction for generating audio with reduced unit sequences.

bash
# IN_CODE_FILE contains one unit sequence per line. Units are separated by space.

python examples/speech_to_speech/generate_waveform_from_code.py \
  --in-code-file ${IN_CODE_FILE} \
  --vocoder ${VOCODER_CKPT} --vocoder-cfg ${VOCODER_CFG} \
  --results-path ${RESULTS_PATH} --dur-prediction

Training new models

To be updated.