beit3/README.md
Official PyTorch implementation and pretrained models of BEiT-3.
The code and pretrained models of BEiT can be found at here.
The code and pretrained models of BEiT v2 can be found at here.
We provide BEiT-3 weights pretrained on monomodal and multimodal data. Our large-size model outperforms previous large-size models across various vision-language and vision downstream tasks. The models were pretrained with 224x224 resolution.
BEiT3-base and BEiT3-large.BEiT3-base-itc and BEiT3-large-itc usually achieve better performance.Models pretrained on ImageNet-21k images, 160 GB text documents, and web-scale image-text pairs (collected from LAION-400M, English LAION-2B, COYO-700M, and CC15M).
BEiT3-base: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #parameters: 276MBEiT3-large: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16; #parameters: 746MPerform image-text contrastive intermediate tuning on BEiT3-base and BEiT3-large.
BEiT3-base-itc: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #parameters: 222MBEiT3-large-itc: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16; #parameters: 674MAdd indomain image-text pairs (COCO and VG) to continue training BEiT3-base and BEiT3-large using masked data modeling. The indomain models achieve better performance on VQAv2 and NLVR2 tasks.
BEiT3-base-indomain: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #parameters: 276MBEiT3-large-indomain: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16; #parameters: 746Mbeit3.spm is the sentencepiece model used for tokenizing texts.
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
We use Magneto with decoupled Multiway Transformer as the backbone architecture. Magneto can have better training stability and obtain better performance across modalities (such as vision, and language). The implementation is based on the torchscale package.
alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel bash
Clone the repo and install required packages:
git clone https://github.com/microsoft/unilm.git
cd unilm/beit3
pip install -r requirements.txt
The detailed instructions can be found at get_started_for_image_classification.md. We only use vision-related parameters for image classification fine-tuning.
| initialized checkpoint | resolution | acc@1 | acc@5 | #params | weight |
|---|---|---|---|---|---|
| beit3_base_patch16_224 | 224x224 | 85.4 | 97.6 | 87M | link |
| beit3_base_indomain_patch16_224 | 224x224 | 85.4 | 97.6 | 87M | link |
| beit3_large_patch16_224 | 224x224 | 87.6 | 98.3 | 305M | link |
| beit3_large_indomain_patch16_224 | 224x224 | 87.5 | 98.3 | 305M | link |
The detailed instructions can be found at get_started_for_vqav2.md.
| initialized checkpoint | resolution | augmented data | test-dev | test-std | #params | weight |
|---|---|---|---|---|---|---|
| beit3_base_patch16_224 | 480x480 | - | 77.65 | - | 228M | link |
| beit3_base_indomain_patch16_224 | 480x480 | - | 78.46 | - | 228M | link |
| beit3_large_patch16_224 | 480x480 | - | 81.85 | - | 683M | link |
| beit3_large_indomain_patch16_224 | 480x480 | - | 82.53 | - | 683M | link |
| beit3_large_indomain_patch16_224 | 768x768 | VGQA | 82.97 | 83.03 | 684M | link |
The detailed instructions can be found at get_started_for_nlvr2.md.
| initialized checkpoint | resolution | dev | test-P | #params | weight |
|---|---|---|---|---|---|
| beit3_base_patch16_224 | 224x224 | 83.6 | 84.4 | 226M | link |
| beit3_base_indomain_patch16_224 | 224x224 | 84.6 | 85.3 | 226M | link |
| beit3_large_patch16_224 | 224x224 | 88.5 | 89.4 | 681M | link |
| beit3_large_indomain_patch16_224 | 224x224 | 89.2 | 90.0 | 681M | link |
The detailed instructions can be found at get_started_for_image_captioning.md.
| initialized checkpoint | resolution | test CIDEr | #params | weight |
|---|---|---|---|---|
| beit3_base_patch16_224 | 480x480 | 133.6 | 271M | link |
| beit3_base_indomain_patch16_224 | 480x480 | 135.0 | 271M | link |
| beit3_large_patch16_224 | 480x480 | 143.2 | 739M | link |
| initialized checkpoint | resolution | val CIDEr | #params | weight |
|---|---|---|---|---|
| beit3_base_patch16_224 | 480x480 | 104.4 | 271M | link |
| beit3_base_indomain_patch16_224 | 480x480 | 105.6 | 271M | link |
| beit3_large_patch16_224 | 480x480 | 120.2 | 739M | link |
The detailed instructions can be found at get_started_for_retrieval.md.
| initialized checkpoint | resolution | IR@1 | TR@1 | #params | weight |
|---|---|---|---|---|---|
| beit3_base_itc_patch16_224 | 384x384 | 61.4 | 79.1 | 222M | link |
| beit3_large_itc_patch16_224 | 384x384 | 63.4 | 82.1 | 675M | link |
| initialized checkpoint | resolution | IR@1 | TR@1 | #params | weight |
|---|---|---|---|---|---|
| beit3_base_itc_patch16_224 | 384x384 | 86.2 | 96.3 | 222M | link |
| beit3_large_itc_patch16_224 | 384x384 | 88.1 | 97.2 | 675M | link |
If you find this repository useful, please consider citing our work:
@inproceedings{beit3,
title={Image as a foreign language: {BEiT} pretraining for vision and vision-language tasks},
author={Wenhui Wang and Hangbo Bao and Li Dong and Johan Bjorck and Zhiliang Peng and Qiang Liu and Kriti Aggarwal and Owais Khan Mohammed and Saksham Singhal and Subhojit Som and Furu Wei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}
@article{beitv2,
title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers},
author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei},
year={2022},
eprint={2208.06366},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{beit,
title={{BEiT}: {BERT} Pre-Training of Image Transformers},
author={Hangbo Bao and Li Dong and Songhao Piao and Furu Wei},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=p-BhZSz59o4}
}
This repository is built using the BEiT, the BEiTv2, the CLIP, the open_clip, the Oscar, the DeiT, the Dino repository and the timm library.
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Microsoft Open Source Code of Conduct
For help or issues using BEiT-3 models, please submit a GitHub issue.