Textless Speech-to-Speech Translation (S2ST) on Real Data

We provide instructions and pre-trained models for the work "Textless Speech-to-Speech Translation on Real Data (Lee et al. 2021)".

Pre-trained Models

HuBERT

Model	Pretraining Data	Model	Quantizer
mHuBERT Base	VoxPopuli En, Es, Fr speech from the 100k subset	download	L11 km1000

Unit-based HiFi-GAN vocoder

Unit config	Unit size	Vocoder language	Dataset	Model
mHuBERT, layer 11	1000	En	LJSpeech	ckpt, config
mHuBERT, layer 11	1000	Es	CSS10	ckpt, config
mHuBERT, layer 11	1000	Fr	CSS10	ckpt, config

Speech normalizer

Language	Training data	Target unit config	Model
En	10 mins	mHuBERT, layer 11, km1000	download
En	1 hr	mHuBERT, layer 11, km1000	download
En	10 hrs	mHuBERT, layer 11, km1000	download
Es	10 mins	mHuBERT, layer 11, km1000	download
Es	1 hr	mHuBERT, layer 11, km1000	download
Es	10 hrs	mHuBERT, layer 11, km1000	download
Fr	10 mins	mHuBERT, layer 11, km1000	download
Fr	1 hr	mHuBERT, layer 11, km1000	download
Fr	10 hrs	mHuBERT, layer 11, km1000	download

Refer to the paper for the details of the training data.

Inference with Pre-trained Models

Speech normalizer

Download the pre-trained models, including the dictionary, to DATA_DIR.
Format the audio data.

bash

# AUDIO_EXT: audio extension, e.g. wav, flac, etc.
# Assume all audio files are at ${AUDIO_DIR}/*.${AUDIO_EXT}

python examples/speech_to_speech/preprocessing/prep_sn_data.py \
  --audio-dir ${AUDIO_DIR} --ext ${AUIDO_EXT} \
  --data-name ${GEN_SUBSET} --output-dir ${DATA_DIR} \
  --for-inference

Run the speech normalizer and post-process the output.

bash

mkdir -p ${RESULTS_PATH}

python examples/speech_recognition/new/infer.py \
    --config-dir examples/hubert/config/decode/ \
    --config-name infer_viterbi \
    task.data=${DATA_DIR} \
    task.normalize=false \
    common_eval.results_path=${RESULTS_PATH}/log \
    common_eval.path=${DATA_DIR}/checkpoint_best.pt \
    dataset.gen_subset=${GEN_SUBSET} \
    '+task.labels=["unit"]' \
    +decoding.results_path=${RESULTS_PATH} \
    common_eval.post_process=none \
    +dataset.batch_size=1 \
    common_eval.quiet=True

# Post-process and generate output at ${RESULTS_PATH}/${GEN_SUBSET}.txt
python examples/speech_to_speech/preprocessing/prep_sn_output_data.py \
  --in-unit ${RESULTS_PATH}/hypo.units \
  --in-audio ${DATA_DIR}/${GEN_SUBSET}.tsv \
  --output-root ${RESULTS_PATH}

Unit-to-waveform conversion with unit vocoder

The pre-trained vocoders can support generating audio for both full unit sequences and reduced unit sequences (i.e. duplicating consecutive units removed). Set --dur-prediction for generating audio with reduced unit sequences.

bash

# IN_CODE_FILE contains one unit sequence per line. Units are separated by space.

python examples/speech_to_speech/generate_waveform_from_code.py \
  --in-code-file ${IN_CODE_FILE} \
  --vocoder ${VOCODER_CKPT} --vocoder-cfg ${VOCODER_CFG} \
  --results-path ${RESULTS_PATH} --dur-prediction

Training new models

To be updated.