beit2/PRETRAINING.md
We follow the settings proposed in BEiT v1.
The BEiT v2 base model can be pretrained on ImageNet-1k using a DGX-2 box (16 V100-32GB):
python -m torch.distributed.launch --nproc_per_node=16 run_beitv2_pretraining.py \
--data_set image_folder \
--data_path /path/to/imagenet-1k/train \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model \
--model beit_base_patch16_224_8k_vocab_cls_pt \
--shared_lm_head True \
--early_layers 9 \
--head_layers 2 \
--num_mask_patches 75 \
--second_input_size 224 \
--second_interpolation bicubic \
--min_crop_scale 0.2 \
--tokenizer_model vqkd_encoder_base_decoder_3x768x12_clip \
--tokenizer_weight https://github.com/addf400/files/releases/download/BEiT-v2/vqkd_encoder_base_decoder_3x768x12_clip-d5036aa7.pth \
--batch_size 128 \
--lr 1.5e-3 \
--warmup_epochs 10 \
--clip_grad 3.0 \
--drop_path 0.1 \
--layer_scale_init_value 0.1 \
--imagenet_default_mean_and_std \
--opt_betas 0.9 0.999 \
--opt_eps 1e-8 \
--epochs 1600 \
--save_ckpt_freq 20
--model: beit v2 model. beit_base_patch16_224_8k_vocab_cls_pt means base model with cls-token pretraining. beit_base_patch16_224_8k_vocab means base model without cls-token pretraining, i.e., beit v1 base model.--batch_size: batch size per GPU. Effective batch size = number of GPUs * --batch_size. So in the above example, the effective batch size is 128*16 = 2048.--tokenizer_model: we recommand vqkd_encoder_base_decoder_3x768x12_clip.--tokenizer_weight: weight path of tokenizer model.--epochs: we use 300 for short schedules and 1600 for long schedules.--opt_betas: 0.98 for 300 epochs and 0.999 for 1600 epochs.--drop_path: 0. for 300 epochs and 0.1 for 1600 epochs.The BEiT v2 large model can be pretrained on ImageNet-1k using 4xDGX-2 box (4x16 V100-32GB):
python -m torch.distributed.launch --nnodes 4 --node_rank {0, 1, 2, 3} --nproc_per_node=16 run_beitv2_pretraining.py \
--data_set image_folder \
--data_path /path/to/imagenet-1k/train \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model \
--model beit_large_patch16_224_8k_vocab_cls_pt \
--shared_lm_head True \
--early_layers 21 \
--head_layers 2 \
--num_mask_patches 75 \
--second_input_size 224 \
--second_interpolation bicubic \
--min_crop_scale 0.2 \
--tokenizer_model vqkd_encoder_base_decoder_3x768x12_clip \
--tokenizer_weight https://github.com/addf400/files/releases/download/BEiT-v2/vqkd_encoder_base_decoder_3x768x12_clip-d5036aa7.pth \
--batch_size 32 \
--lr 1.5e-3 \
--warmup_epochs 10 \
--clip_grad 3.0 \
--drop_path 0.1 \
--layer_scale_init_value 1e-5 \
--imagenet_default_mean_and_std \
--opt_betas 0.9 0.999 \
--opt_eps 1e-8 \
--epochs 1600 \
--save_ckpt_freq 20
--model: beit v2 model. beit_large_patch16_224_8k_vocab_cls_pt means large model with cls-token pretraining. beit_large_patch16_224_8k_vocab means large model without cls-token pretraining, i.e., beit v1 large model.--batch_size: batch size per GPU. Effective batch size = number of GPUs * --batch_size. So in the above example, the effective batch size is 32*4x16 = 2048.--tokenizer_model: we recommand vqkd_encoder_base_decoder_3x768x12_clip.--tokenizer_weight: weight path of tokenizer model.--epochs: we use 300 for short schedules and 1600 for long schedules.--opt_betas: 0.98 for 300 epochs and 0.999 for 1600 epochs.--drop_path: 0. for 300 epochs and 0.1 for 1600 epochs.We provide some pretrained models here.
| model name | pretraining epochs | weight |
|---|---|---|
| beit_base_patch16_224_8k_vocab_cls_pt | 300 | link |
| beit_base_patch16_224_8k_vocab_cls_pt | 1600 | link |
| beit_large_patch16_224_8k_vocab_cls_pt | 300 | link |
| beit_large_patch16_224_8k_vocab_cls_pt | 1600 | link |