kosmos-2/fairseq/examples/normformer/README.md
This is the code for the "NormFormer: Improved Transformer Pretraining with Extra Normalization"
If you have any issues or questions please post a github issue and tag @sshleifer.
$DATA to be the path to the binarized data directory.FSDP, which requires pip install fairscale>=0.4.0.To modify an existing fairseq-train command to use NormFormer, simply add the following flags:
fairseq-train ... \
--scale-attn --scale-fc --scale-heads
--scale-residssource examples/normformer/train_lm.sh.--distributed-world-size 8. You should adjust --update-freq and --batch-size and such that the effective batch size is (1024x1024x0.5) tokens for 125M and 355M,
and (1024x1024) for 1.3B parameter and above. For small models, --update-freq=256/global_bs. For large models, --update-freq=512/global_bs, where global_bs = --batch-size * --distributed-world-sizetrain_125M --lr 6e-4 # GPT-3 Replicated
train_125M --lr 1e-3 # stronger high-lr baseline
train_125M --lr 3e-3 --scale-attn --scale-fc --scale-heads # No scale-resids
train_125M --lr 3e-3 --scale-attn --scale-fc --scale-heads --scale-resids # Best command
train_355M --lr 6e-4 # GPT-3 Replicated
train_355M --lr 1e-3 # stronger high-lr baseline
train_355M --lr 1e-3 --scale-attn --scale-fc --scale-heads # No scale-resids
train_355M --lr 1e-3 --scale-attn --scale-fc --scale-heads --scale-resids # Slightly better
train_1.3B --lr 2e-4 # GPT-3 Replicated
train_1.3B --lr 6e-4 # stronger high-lr baseline
train_1.3B --lr 6e-4 --scale-attn --scale-fc --scale-heads # NormFormer
train_2.7B --lr 1.6e-4 # GPT-3 Replicated
train_2.7B --lr 1.6e-4 --activation-fn relu_squared # stronger Relu^2 baseline
train_2.7B --lr 6e-4 --activation-fn relu_squared --scale-attn --scale-fc --scale-heads # NormFormer 2.7B
@misc{shleifer2021normformer,
title={NormFormer: Improved Transformer Pretraining with Extra Normalization},
author={Sam Shleifer and Jason Weston and Myle Ott},
year={2021},
eprint={2110.09456},
archivePrefix={arXiv},
primaryClass={cs.CL}
}