MaxViT: Multi-Axis Vision Transformer (ECCV 2022)

⚠️ DISCLAIMER: This implementation is still under development.

[TOC]

MaxViT is a family of hybrid (CNN + ViT) vision backbone models, that achieves better performances across the board for both parameter and FLOPs efficiency than both state-of-the-art ConvNets and Transformers (Blog). They can also scale well on large dataset sizes like ImageNet-21K. Notably, due to the linear-complexity of the grid attention used, MaxViT scales well on tasks requiring large image sizes, such as object detection and segmentation.

MaxViT meta-architecture: a homogeneously stacked backbone, wherein each MaxViT block contains MBConv, block attention (window-based local attention), and grid attention (dilated global attention).

Results on ImageNet-1k standard train and test:

Results on ImageNet-21k and JFT pre-trained models:

Model Performance

Note: Deit ImageNet pretrain experimental settings are different from the paper. These experiments follows the pre-training hyperparameters in paper and only run pre-training for similar number of steps. The paper suggested a short fine-tuning with different hyper-parameters and EMA.

Deit ImageNet pretrain {.new-tab}

Model	Eval Size	Top-1 Acc	Acc on Paper	#Param	#FLOPs	Config
MaxViT-Tiny	224x224	83.1 (-0.5)	83.6	31M	5.6G	config
MaxViT-Small	224x224	84.1 (-0.3)	84.4	69M	11.7G	config
MaxViT-Base	224x224	84.2 (-0.7)	84.9	120M	23.4G	config
MaxViT-Large	224x224	84.6 (-0.6)	85.2	212M	43.9G	config
MaxViT-XLarge	224x224	84.8	-	475M	97.9G	config

Cascade RCNN models {.new-tab}

Model	Image Size	Window Size	Epochs	box AP	box AP on paper	mask AP	Config
MaxViT-Tiny	640x640	20x20	200	49.97	-	42.69	config
MaxViT-Tiny	896x896	28x28	200	52.35 (+0.25)	52.1	44.69	-
MaxViT-Small	640x640	20x20	200	50.79	-	43.36	-
MaxViT-Small	896x896	28x28	200	53.54 (+0.44)	53.1	45.79	config
MaxViT-Base	640x640	20x20	200	51.59	-	44.07	config
MaxViT-Base	896x896	28x28	200	53.47 (+0.07)	53.4	45.96	config

</section> <section class="tabs">

JFT-300M supervised pretrain {.new-tab}

Model	Pretrain Size	#Param	#FLOPs	globalPR-AUC
MaxViT-Base	224x224	120M	23.4G	52.75%
MaxViT-Large	224x224	212M	43.9G	53.77%
MaxViT-XLarge	224x224	475M	-	54.71%

ImageNet Finetuning {.new-tab}

Model	Image Size	Top-1 Acc	Acc on Paper	#Param	#FLOPs	Config
MaxViT-Base	384x384	88.37% (-0.32%)	88.69%	120M	74.2G	config
MaxViT-Base	512x512	88.63% (-0.19%)	88.82%	120M	138.3G	config
MaxViT-Large	384x384	88.86% (-0.26%)	89.12%	212M	128.7G	config
MaxViT-Large	512x512	89.02% (-0.39%)	89.41%	212M	245.2G	config
MaxViT-XLarge	384x384	89.21% (-0.15%)	89.36%	475M	293.7G	config
MaxViT-XLarge	512x512	89.31% (-0.22%)	89.53%	475M	535.2G	config

Cascade RCNN models {.new-tab}

Model	Image Size	Window Size	Epochs	box AP	box AP on paper	mask AP	Config
MaxViT-Base	896x896	28x28	200	54.31 (+0.91)	53.4	46.31	config
MaxViT-Large	896x896	28x28	200	54.69	-	46.59	config

</section>

Citation

Should you find this repository useful, please consider citing:

@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}