Back to Unilm

VLMo - General-purpose Multimodal Pre-training

vlmo/README.md

latest9.1 KB
Original Source

VLMo - General-purpose Multimodal Pre-training

Paper: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts.

Official PyTorch implementation and pre-trained models of VLMo.

  • Dec, 2022: Code & model release.
  • Sep, 2022: VLMo was accepted by NeurIPS 2022.
  • May 30th, 2022: new version of VLMo paper on arXiv.
  • November 24th, 2021: VLMo Large (single model) as the new SOTA on the VQA Challenge
  • Nov 2021: release preprint in arXiv

Pre-trained Models

We provide three VLMo weights pre-trained on COCO, VG, SBU and GCC. The models were pre-trained with 224x224 resolution.

  • VLMo-base: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #VL_FFN=2 (#parameters: 175M)
  • VLMo-base_plus: #layer=24; hidden=544; FFN factor=4x; #head=16; patch=16x16; #VL_FFN=3 (#parameters: 167M)
  • VLMo-large: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16; #VL_FFN=3 (#parameters: 562M)

Setup

alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} pytorch/pytorch:1.8.0-cuda11.1-cudnn8-devel bash

First, clone the repo and install required packages:

git clone https://github.com/microsoft/unilm.git
cd unilm/vlmo

pip install -r requirements.txt

Dataset Preparation

We process the pre-training and fine-tuning data to the same format as in ViLT.

Pre-training

Replace <ARROW_ROOT> as your data dir in following commands.

Step 1: Vision Pre-Training

Download the pre-trained model weight from BEiT repo.

Step 2: Language Pre-Training (VLMo-Base)

bash
# download from https://github.com/addf400/files/releases/download/v1.0/beit_base_patch16_224_pt22k_ft22kto1k.pth
export INIT_CKPT=/path/to/save/beit_base_checkpoint

python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_textmlm_base whole_word_masking=True step200k per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=$INIT_CKPT log_dir=<YOUR_OUTPUT_PATH>

Or you can download our pre-trained ckpts for this stage:

Step 3: Vision-Language Pre-Training (VLMo-Base)

bash

export INIT_CKPT=/path/to/save/last_stage_ckpt

python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_mlm_itm_itc_base whole_word_masking=True step200k per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=$INIT_CKPT log_dir=<YOUR_OUTPUT_PATH>

Fine-Tuning on Downstream Tasks

Commands

bash

python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> "<CONFIG_NAME>" per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path="<VLMo_WEIGHT>" log_dir=<YOUR_OUTPUT_PATH>

To reduce GPU memory cost, use Deepspeed and Activation Checkpoint.

Configs

You can found "<CONFIG_NAME>" for each task as follows:

VQAv2

<CONFIG_NAME>initialized checkpointfinetuned weighttest-dev
task_finetune_vqa_base_image480VLMo-baseweight76.6
task_finetune_vqa_base_plus_image480VLMo-base_plusweight78.5
task_finetune_vqa_large_image480VLMo-largeweight79.9

NLVR2

<CONFIG_NAME>initialized checkpointfinetuned weighttest-P
task_finetune_nlvr2_base_image384VLMo-baseweight83.3
task_finetune_nlvr2_base_plus_image384VLMo-base_plusweight85.1
task_finetune_nlvr2_large_image384VLMo-largeweight86.9

COCO

<CONFIG_NAME>initialized checkpointfinetuned weightTR@1IR@1
task_finetune_irtr_coco_base_image384VLMo-baseweight74.857.2
task_finetune_irtr_coco_base_plus_image384VLMo-base_plusweight76.358.6
task_finetune_irtr_coco_large_image384VLMo-largeweight78.260.6

F30K

<CONFIG_NAME>initialized checkpointfinetuned weightTR@1IR@1
task_finetune_irtr_f30k_base_image384VLMo-base_coco_finetunedweight92.379.3
task_finetune_irtr_f30k_base_plus_image384VLMo-base_plusweight93.281.8
task_finetune_irtr_f30k_large_image384VLMo-large_coco_finetunedweight95.384.5

Evaluation

To eval a finetuned model by appending test_only=True and set load_path= to the finetuned VLMo weight as follow:

bash
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=1 "<CONFIG_NAME>" per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path="<Finetuned_VLMo_WEIGHT>" test_only=True
  • For retrieval tasks, also set get_recall_metric=True in the command.

Acknowledgement

This repository is built using the ViLT repository, BEiT repository, ALBEF and the timm library.

Citation

If you find this repository useful, please consider citing our work:

@inproceedings{vlmo,
      title={{VLMo}: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts},
      author={Hangbo Bao and Wenhui Wang and Li Dong and Qiang Liu and Owais Khan Mohammed and Kriti Aggarwal and Subhojit Som and Songhao Piao and Furu Wei},
      booktitle={Advances in Neural Information Processing Systems},
      year={2022},
      url={https://openreview.net/forum?id=bydKs84JEyw}
}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using VLMo models, please submit a GitHub issue.