deltalm/README.md
Encoder-Decoder Pre-training for Language Generation and Translation
DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders. Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Alexandre Muzio, Saksham Singhal, Hany Hassan Awadalla, Xia Song, Furu Wei. CoRR abs/2106.13736.
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs. Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang, Xian-Ling Mao, Heyan Huang, and Furu Wei. In EMNLP 2021.
We evaluate DeltaLM on cross-lingual abstractive summarization benchmark. We report the results by averaging the numbers in different languages.
| Model | #Params | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|
| mBART | 610M | 34.5 | 12.9 | 28.7 |
| mT5 | 300M | 27.5 | 8.8 | 22.8 |
| mT5 | 580M | 31.8 | 11.5 | 26.0 |
| DeltaLM | 360M | 35.3 | 13.4 | 28.7 |
git submodule update --init deltalm/fairseq
cd deltalm/
pip install --editable fairseq/
.
+-- /path/to/data/
| +-- train.src
| +-- train.tgt
| +-- valid.src
| +-- valid.tgt
Examples (IWSLT14 German to English):
bash examples/prepare_iwslt14.sh /tmp/iwslt14
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < train.src > train.spm.src
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < train.tgt > train.spm.tgt
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < valid.src > valid.spm.src
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < valid.tgt > valid.spm.tgt
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < test.src > test.spm.src
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < test.tgt > test.spm.tgt
Examples (IWSLT14 German to English):
bash examples/binary_iwslt14.sh \
/tmp/iwslt14/iwslt14.tokenized.de-en \
/tmp/iwslt14/iwslt14.spm \
/path/to/checkpoint/spm.model
data_bin=/path/to/data-bin/
python preprocess.py \
--trainpref train.spm \
--validpref valid.spm \
--testpref test.spm \
--source-lang src --target-lang tgt \
--destdir $data_bin \
--srcdict /path/to/checkpoint/dict.txt \
--tgtdict /path/to/checkpoint/dict.txt \
--workers 40
Examples (IWSLT14 German to English):
bash examples/binary_iwslt14.sh \
/tmp/iwslt14/iwslt14.spm \
/tmp/iwslt14/iwslt14.bin \
/path/to/checkpoint/dict.txt
PRETRAINED_MODEL=/path/to/checkpoint/model.pt
python train.py $data_bin \
--save-dir $save_dir \
--arch deltalm_base \
--pretrained-deltalm-checkpoint $PRETRAINED_MODEL \
--share-all-embeddings \
--max-source-positions 512 --max-target-positions 512 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt \
--lr $lr \
--warmup-init-lr 1e-07 \
--stop-min-lr 1e-09 \
--warmup-updates 4000 \
--max-update 400000 \
--max-epoch 100 \
--max-tokens $batch_size \
--update-freq 1 \
--seed 1 \
--log-format simple \
--skip-invalid-size-inputs-valid-test
**Note:
--arch deltalm_large.max-tokens and update-freq to suit in different experimental environments. Recommendation of the total batch size is 4096 * 128 tokens per step.--fp16 for more efficient training on the devices that have Tensor Cores.Examples (IWSLT14 German to English):
bash examples/train_iwslt14.sh \
/tmp/iwslt14/iwslt14.bin \
/tmp/iwslt14/checkpoints \
/path/to/checkpoint/model.pt
python generate.py $data_bin \
--path $save_dir/checkpoint_best.pt \
--batch-size 128 --beam 5 --remove-bpe=sentencepiece
Examples (IWSLT14 German to English):
bash examples/evaluate_iwslt14.sh \
/tmp/iwslt14/iwslt14.bin \
/tmp/iwslt14/checkpoints
If you find this repository useful, please consider citing our work:
@article{deltalm,
title={{DeltaLM}: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders},
author={Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Alexandre Muzio and Saksham Singhal and Hany Hassan Awadalla and Xia Song and Furu Wei},
year={2021},
eprint={2106.13736},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This repository is built using the Fairseq repository.
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Microsoft Open Source Code of Conduct
For help or issues using DeltaLM models, please submit a GitHub issue.
For other communications related to DeltaLM, please contact Shuming Ma ([email protected]), Furu Wei ([email protected]).