ViTDet

Description

This is an implementation of ViTDet based on MMDetection, MMCV, and MMEngine.

Usage

Training commands

Follow original setting, this project is trained with total batch size of 64 (16 GPU with 4 images per GPU).

In MMDetection's root directory, run the following command to train the model:

bash

GPUS=${GPUS} ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}

Below is an example of using 16 GPUs to train VitDet on a Slurm partition named dev, and set the work-dir to some shared file systems.

shell

GPUS=16 ./tools/slurm_train.sh dev vitdet_mask_b projects/ViTDet/configs/vitdet_mask-rcnn_vit-b-mae_lsj-100e.py /nfs/xxxx/vitdet_mask-rcnn_vit-b-mae_lsj-100e

Testing commands

In MMDetection's root directory, run the following command to test the model:

bash

python tools/test.py projects/ViTDet/configs/vitdet_mask-rcnn_vit-b-mae_lsj-100e.py ${CHECKPOINT_PATH}

Results

Based on mmdetection, this project almost aligns the test and train accuracy of the ViTDet.

Method	Backbone	Pretrained Model	Training set	Test set	Epoch	Val Box AP	Val Mask AP	Download
ViTDet	ViT-B	MAE	COCO2017 Train	COCO2017 Val	100	51.6	45.7	model / log

Note:

The mask AP is lower than official repo slightly
other model vision will release code and weights in the future

Citation

latex

@article{li2022exploring,
  title={Exploring plain vision transformer backbones for object detection},
  author={Li, Yanghao and Mao, Hanzi and Girshick, Ross and He, Kaiming},
  journal={arXiv preprint arXiv:2203.16527},
  year={2022}
}

Checklist