examples/textless_nlp/dgslm/README.md
This repo contains the code and pre-trained models for the paper Generative Spoken Dialogue Language Modeling.
<details> <summary>Paper abstract </summary></details>We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.
The hubert_fisher repository contains the pre-trained models and recipies to produce discrete units for the dGSLM model.
The vocoder_hifigan repo contains the vocoder and recipies to synthesize the waveform from the discrete units.
We share the pre-trained model checkpoint for the best configuration in the paper (DLM-5 model, with Edge Unit Prediction & Delayed Duration Prediction objectives), dubbed as SpeechDLM, trained on the 2000 hours of Fisher dataset :
| Pre-trained SpeechDLM model trained on Fisher dataset |
|---|
| model checkpoint - dictionary 1 - dictionary 2 |
| the two dictionary files correspond to the two channels, and actually have the same content. |
You can sample from a trained SpeechDLM model interactively :
from fairseq.models.speech_dlm import SpeechDLM
# Load SpeechDLM model
speech_dlm = SpeechDLM.from_pretrained(
model_name_or_path='/path/to/model/dir',
checkpoint_file='speech_dlm_base.pt',
data_name_or_path='/path/to/data/dir'
)
# Disable dropout
speech_dlm.eval()
# Move model to GPU
speech_dlm.cuda()
# Define the input sequences
input_sequences = [{
'unitA': '7 376 376 133 178 486 486 486 486 486 486 486 486 2 486',
'unitB': '7 499 415 177 7 7 7 7 7 7 136 136 289 289 408'
}]
# Sample from the SpeechDLM model
generated_units = speech_dlm.sample(
input_sequences,
max_len_a = 0,
max_len_b = 500,
sampling=True,
beam=5,
)
# >> {'unitA': '7 376 376 133 178 486 486 486 486 486 486 486 486 2 486 486 178 486 486 2 2 376 376 486 486 486 376 376 387 387 ...',
# >> 'unitB': '7 499 415 177 7 7 7 7 7 7 136 136 289 289 408 32 428 95 356 141 331 439 350 350 192 331 445 202 104 104 ...'}
Or using the sample_speech_dlm.py script :
python sample_speech_dlm.py \
--in-file $INPUT_CODE_FILE --out-file $OUTPUT_FILE \
--ckpt $CHECKPOINT_PATH --data $DATA_DIR
where each line of INPUT_CODE_FILE is a dictionary with keys 'audio', 'unitA', 'unitB' as follows :
{'audio': 'file_1', 'unitA': '8 8 ... 352 352', 'unitB': '217 8 ... 8 8'}
{'audio': 'file_2', 'unitA': '5 5 ... 65 65', 'unitB': '6 35 ... 8 9'}
...
This code file can be created with the script create_input_code.py (using the outputs of quantize_with_kmeans.py here) :
python examples/textless_nlp/dgslm/vocoder_hifigan/create_input_code.py \
$CHANNEL1_UNITS $CHANNEL2_UNITS $OUTPUT_CODE_FILE
First, you need to prepare the raw dataset. For each split (train, valid), you need two files corresponding to two channels (namely unitA and unitB for example) containing the units from each channel separately. Make sure that 2 files have the same number of lines and each corresponding line has the same number of units.
Here is an example of .unitA file :
7 376 376 133 178
486 486 486
486 376
and the corresponding .unitB file :
7 499 415 177 7
7 7 136
331 445
These two files can be obtained using the example command of hubert fisher, with the --hide-fname option added.
The raw dataset directory should contain the following files :
train.unitA valid.unitA
train.unitB valid.unitB
Next preprocess/binarize the data with fairseq-preprocess, but make sure to preprocess each channel separately, and rename the preprocessed files under the following format ${split}.${channel}.{bin, idx}. Each channel also needs a separate dictionary file under the name dict.${channel}.txt .
Here is an example pre-processing code :
# Preprocess the first channel (unitA)
fairseq-preprocess --source-lang unitA \
--only-source \
--trainpref $RAW_DATA_DIR/train \
--validpref $RAW_DATA_DIR/valid \
--destdir $BIN_DATA_DIR \
--workers 20
# Preprocess the second channel (unitB) and reuse the dictionary from the first channel
fairseq-preprocess --source-lang unitB \
--srcdict $BIN_DATA_DIR/dict.unitA.txt \
--only-source \
--trainpref $RAW_DATA_DIR/train \
--validpref $RAW_DATA_DIR/valid \
--destdir $BIN_DATA_DIR \
--workers 20
# Rename the bin & index files
for channel in unitA unitB; do
for split in train valid; do
mv $BIN_DATA_DIR/${split}.${channel}-None.${channel}.bin $BIN_DATA_DIR/${split}.${channel}.bin
mv $BIN_DATA_DIR/${split}.${channel}-None.${channel}.idx $BIN_DATA_DIR/${split}.${channel}.idx
done
done
Finally, the preprocessed (bin) dataset directory should contain the following files :
dict.unitA.txt train.unitA.idx train.unitA.bin valid.unitA.idx valid.unitA.bin
dict.unitB.txt train.unitB.idx train.unitB.bin valid.unitB.idx valid.unitB.bin
To train the SpeechDLM (with the configuration as the pre-trained model) on 2 GPUs :
fairseq-train $BIN_DATA_DIR \
--save-dir $CHECKPOINT_DIR \
--tensorboard-logdir $CHECKPOINT_DIR \
--task speech_dlm_task --channels unitA,unitB \
--next-unit-prediction "False" --edge-unit-prediction "True" \
--duration-prediction "True" --delayed-duration-target "True" \
--criterion speech_dlm_criterion \
--arch speech_dlm --decoder-cross-layers 4 \
--share-decoder-input-output-embed \
--dropout 0.1 --attention-dropout 0.1 \
--optimizer adam --adam-betas "(0.9, 0.98)" --clip-norm 1.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 \
--max-tokens 18432 --tokens-per-sample 6144 --sample-break-mode none \
--update-freq 16 --num-workers 4 --skip-invalid-size-inputs-valid-test \
--max-update 250000 --warmup-updates 20000 \
--save-interval-updates 10000 --keep-last-epochs 1 --no-epoch-checkpoints \
--log-interval 50 --seed 100501 \
--fp16 --checkpoint-activations
The model can be validated via the fairseq-validate command :
fairseq-validate $BIN_DATA_DIR \
--task speech_dlm_task \
--path $CHECKPOINT_PATH \
--max-tokens 6144
If you find our work useful in your research, please consider citing our paper:
@article{nguyen2022dgslm,
title = {Generative Spoken Dialogue Language Modeling},
author = {Nguyen, Tu Anh and Kharitonov, Eugene and Copet, Jade and Adi, Yossi and Hsu, Wei-Ning and Elkahky, Ali and Tomasello, Paden and Algayres, Robin and Sagot, Benoit and Mohamed, Abdelrahman and Dupoux, Emmanuel},
eprint={2203.16502},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2022}
}