VLMo - General-purpose Multimodal Pre-training

Paper: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts.

Official PyTorch implementation and pre-trained models of VLMo.

Dec, 2022: Code & model release.
Sep, 2022: VLMo was accepted by NeurIPS 2022.
May 30th, 2022: new version of VLMo paper on arXiv.
November 24th, 2021: VLMo Large (single model) as the new SOTA on the VQA Challenge
Nov 2021: release preprint in arXiv

Pre-trained Models

We provide three VLMo weights pre-trained on COCO, VG, SBU and GCC. The models were pre-trained with 224x224 resolution.

VLMo-base: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #VL_FFN=2 (#parameters: 175M)
VLMo-base_plus: #layer=24; hidden=544; FFN factor=4x; #head=16; patch=16x16; #VL_FFN=3 (#parameters: 167M)
VLMo-large: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16; #VL_FFN=3 (#parameters: 562M)

Setup

alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} pytorch/pytorch:1.8.0-cuda11.1-cudnn8-devel bash

First, clone the repo and install required packages:

git clone https://github.com/microsoft/unilm.git
cd unilm/vlmo

pip install -r requirements.txt

Dataset Preparation

We process the pre-training and fine-tuning data to the same format as in ViLT.

Pre-training

Replace <ARROW_ROOT> as your data dir in following commands.

Step 1: Vision Pre-Training

Download the pre-trained model weight from BEiT repo.

Step 2: Language Pre-Training (VLMo-Base)

bash

# download from https://github.com/addf400/files/releases/download/v1.0/beit_base_patch16_224_pt22k_ft22kto1k.pth
export INIT_CKPT=/path/to/save/beit_base_checkpoint

python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_textmlm_base whole_word_masking=True step200k per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=$INIT_CKPT log_dir=<YOUR_OUTPUT_PATH>

Or you can download our pre-trained ckpts for this stage:

Step 3: Vision-Language Pre-Training (VLMo-Base)

bash


export INIT_CKPT=/path/to/save/last_stage_ckpt

python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_mlm_itm_itc_base whole_word_masking=True step200k per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=$INIT_CKPT log_dir=<YOUR_OUTPUT_PATH>

Fine-Tuning on Downstream Tasks

Commands

bash


python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> "<CONFIG_NAME>" per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path="<VLMo_WEIGHT>" log_dir=<YOUR_OUTPUT_PATH>

To reduce GPU memory cost, use Deepspeed and Activation Checkpoint.

Configs

You can found "<CONFIG_NAME>" for each task as follows:

VQAv2

<CONFIG_NAME>	initialized checkpoint	finetuned weight	test-dev
task_finetune_vqa_base_image480	VLMo-base	weight	76.6
task_finetune_vqa_base_plus_image480	VLMo-base_plus	weight	78.5
task_finetune_vqa_large_image480	VLMo-large	weight	79.9

NLVR2

<CONFIG_NAME>	initialized checkpoint	finetuned weight	test-P
task_finetune_nlvr2_base_image384	VLMo-base	weight	83.3
task_finetune_nlvr2_base_plus_image384	VLMo-base_plus	weight	85.1
task_finetune_nlvr2_large_image384	VLMo-large	weight	86.9

COCO

<CONFIG_NAME>	initialized checkpoint	finetuned weight	TR@1	IR@1
task_finetune_irtr_coco_base_image384	VLMo-base	weight	74.8	57.2
task_finetune_irtr_coco_base_plus_image384	VLMo-base_plus	weight	76.3	58.6
task_finetune_irtr_coco_large_image384	VLMo-large	weight	78.2	60.6

F30K

<CONFIG_NAME>	initialized checkpoint	finetuned weight	TR@1	IR@1
task_finetune_irtr_f30k_base_image384	VLMo-base_coco_finetuned	weight	92.3	79.3
task_finetune_irtr_f30k_base_plus_image384	VLMo-base_plus	weight	93.2	81.8
task_finetune_irtr_f30k_large_image384	VLMo-large_coco_finetuned	weight	95.3	84.5

Evaluation

To eval a finetuned model by appending test_only=True and set load_path= to the finetuned VLMo weight as follow:

bash

python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=1 "<CONFIG_NAME>" per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path="<Finetuned_VLMo_WEIGHT>" test_only=True

For retrieval tasks, also set get_recall_metric=True in the command.

Acknowledgement

This repository is built using the ViLT repository, BEiT repository, ALBEF and the timm library.

Citation

If you find this repository useful, please consider citing our work:

@inproceedings{vlmo,
      title={{VLMo}: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts},
      author={Hangbo Bao and Wenhui Wang and Li Dong and Qiang Liu and Owais Khan Mohammed and Kriti Aggarwal and Subhojit Som and Songhao Piao and Furu Wei},
      booktitle={Advances in Neural Information Processing Systems},
      year={2022},
      url={https://openreview.net/forum?id=bydKs84JEyw}
}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using VLMo models, please submit a GitHub issue.