Back to Unilm

EdgeFormer

edgelm/README.md

latest5.1 KB
Original Source

EdgeFormer

EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation

EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation. Tao Ge and Furu Wei

  • March 2022: release code and pretrained checkpoints.

Pretrained Models

Downstream seq2seq tasks

We evaluate EdgeFormer on the benchmarks of three popular seq2seq tasks: CoNLL-14 for GEC, XSUM for Abstractive Summarization, and SQuAD-NQG for Question Generation.

CoNLL-14

Model#Params#FLOPSF0.5
Transformer-base44M1.8G50.1
Pretrained 12+2 Universal Transformer7.4M1.4G51.3
Pretrained 12+2 Universal Transformer (wide)9.4M1.9G51.7
Pretrained EdgeFormer9.4M1.3G52.7

XSUM

Model#Params#FLOPSROUGE-1ROUGE-2ROUGE-L
Transformer-base44M1.8G31.210.724.9
Pretrained 12+2 Universal Transformer7.4M1.4G34.413.427.9
Pretrained 12+2 Universal Transformer (wide)9.4M1.9G35.114.028.6
Pretrained EdgeFormer9.4M1.3G36.314.829.5

SQuAD-NQG

Model#Params#FLOPSB4MTRROUGE-L
Transformer-base44M1.8G2.69.026.0
Pretrained 12+2 Universal Transformer7.4M1.4G18.321.045.9
Pretrained 12+2 Universal Transformer (wide)9.4M1.9G18.721.346.1
Pretrained EdgeFormer9.4M1.3G19.021.746.3

Setup

pip install --editable ./


## Fine-tuning
```bash
PRETRAINED_MODEL=/path/to/checkpoint/model.pt
fairseq-train /path/to/binarized/data \
        --restore-file $PRETRAINED_MODEL  --reset-lr-scheduler --reset-optimizer --reset-dataloader \
        --task translation \
        --criterion label_smoothed_cross_entropy \
        --arch transformer_edge --encoder-layers 12 --decoder-ffn-embed-dim 128 --lora-r 32 --lora-r-shape 0 \
        --share-all-embeddings \
        --required-batch-size-multiple 8 \
        --optimizer adam \
        --adam-betas '(0.9,0.98)' \
        --adam-eps 1e-6 \
        --clip-norm 1.0 \
        --lr-scheduler polynomial_decay \
        --lr 0.00015 \
        --warmup-updates 8000 \
        --total-num-update 100000 \
        --max-update 100000 --max-epoch 1000 \
        --max-tokens 20000 \
        --update-freq 1 \
        --log-format simple \
        --log-interval 1000 \
        --save-interval-updates 5000 \
        --fp16 \
        --fp16-init-scale 4 \
        --fp16-scale-window 256 \
        --min-loss-scale 0.0001 \
        --seed 1
        --save-dir /path/to/save/checkpoints
        --ddp-backend legacy_ddp

**Note:

  • Please adjust the hyperparameters like lr and warmup-updates based on the datasets and tasks.
  • Please adjust the max-tokens and update-freq to suit in different experimental environments.
  • Use --fp16 for more efficient training on the devices that have Tensor Cores.
  1. Evaluation:
bash
fairseq-generate $data_bin \
    --path $save_dir/checkpoint_best.pt \
    --batch-size 64 --beam 5 --remove-bpe=sentencepiece

Citation

If you find this repository useful, please consider citing our work:

@article{ge2022edgeformer,
  title={EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation},
  author={Ge, Tao and Wei, Furu},
  journal={arXiv preprint arXiv:2202.07959},
  year={2022}
}

Acknowledgement

This repository is built using the Fairseq repository.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using EdgeFormer models, please submit a GitHub issue.

For other communications related to EdgeFormer, please contact Tao Ge ([email protected]), Furu Wei ([email protected]).