Back to Unilm

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

beit2/README.md

latest10.0 KB
Original Source

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Official PyTorch implementation and pretrained models of BEiT v2.

The code and pretrained models of BEiT can be found at here.

The code and pretrained models of BEiT-3 can be found at here.

Pretrained Models

We provide four BEiT weights pretrained on ImageNet-1k. The models were pretrained with 224x224 resolution.

  • BEiT-base: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16 (#parameters: 86M)
  • BEiT-large: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16 (#parameters: 304M)

Download checkpoints that are self-supervised pretrained on ImageNet-1k and then intermediate finetuned on ImageNet-21k (recommended):

Download checkpoints that are self-supervised pretrained on ImageNet-1k:

Setup

alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel bash

First, clone the repo and install required packages:

git clone https://github.com/microsoft/unilm.git
cd unilm/beit2
pip install -r requirements.txt

The required packages including: Pytorch version 1.7.1, torchvision version 0.8.2 and Timm version 0.4.12, etc.

For mixed-precision training, please install apex

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Fine-tuning on ImageNet-1k (Image Classification)

We summarize the validation results as follows. We also provide the fine-tuned weights. The detailed instructions to reproduce the results can be found at get_started_for_image_classification.md.

nameinitialized checkpointresolutionacc@1acc@5#paramsweight
BEiTv2-basebeitv2_base_patch16_224_pt1k224x22485.597.586.5Mlink
BEiTv2-basebeitv2_base_patch16_224_pt1k_ft21k224x22486.598.086.5Mlink
BEiTv2-largebeitv2_base_patch16_224_pt1k224x22487.398.2304Mlink
BEiTv2-largebeitv2_base_patch16_224_pt1k_ft21k224x22488.498.6304Mlink

Fine-tuning on ADE20K (Semantic Segmentation)

We summarize the validation results as follows. We also provide the fine-tuned weights. The detailed instructions to reproduce the results can be found at semantic_segmentation/README.md.

nameinitialized checkpointmethodcrop sizeiterationsmIoU#paramsweight
BEiTv2-basebeitv2_base_patch16_224_pt1kUPerNet512x512160k53.1163Mlink
BEiTv2-basebeitv2_base_patch16_224_pt1k_ft21kUPerNet512x512160k53.5163Mlink
BEiTv2-largebeitv2_large_patch16_224_pt1kUPerNet512x512160k56.7441Mlink
BEiTv2-largebeitv2_large_patch16_224_pt1k_ft21kUPerNet512x512160k57.5441Mlink

Fine-tuning on MSCOCO2017 (Object Detection)

Under preparation.

Pre-training on ImageNet-1k

See PRETRAINING.md for detailed instructions.

Visual Tokenizer (VQ-KD) Trained on ImageNet-1k

We provide the VQ-KD tokenizer trained on ImageNet-1k.

See TOKENIZER.md for more details.

Code for Analysis of Self-Attention Map

Pre-trained BEiT_base_patch16_224 on ImageNet-1k with 800 epochs, config: --disable_rel_pos_bias --abs_pos_emb --layer_scale_init_value 0

bash
python visualize_attention.py \
  --model beit_base_patch16_224_8k_vocab \
  --disable_rel_pos_bias \
  --abs_pos_emb \
  --layer_scale_init_value 0 \
  --input_size 480 \
  --pretrained_weights /folder/to/download/beit_base_patch16_224_pt1k_800ep.pth \
  --image_path ../visualization/input2.png \
  --selected_row 11 \
  --selected_col 13

--selected_row 11 and --selected_col for choosing the image patch as query

Citation

If you find this repository useful, please consider citing our work:

@inproceedings{beit,
title={{BEiT}: {BERT} Pre-Training of Image Transformers},
author={Hangbo Bao and Li Dong and Songhao Piao and Furu Wei},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=p-BhZSz59o4}
}

@article{beitv2,
title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers},
author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei},
year={2022},
eprint={2208.06366},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

Acknowledgement

This repository is built using the BEiT, the CLIP, the DeiT, the Dino repository and the timm library.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using BEiT v2 models, please submit a GitHub issue.

For other communications, please contact Li Dong ([email protected]), Furu Wei ([email protected]).