Back to Models

MaxViT: Multi-Axis Vision Transformer (ECCV 2022)

official/projects/maxvit/README.md

2.20.05.9 KB
Original Source

MaxViT: Multi-Axis Vision Transformer (ECCV 2022)

⚠️ DISCLAIMER: This implementation is still under development.

[TOC]

MaxViT is a family of hybrid (CNN + ViT) vision backbone models, that achieves better performances across the board for both parameter and FLOPs efficiency than both state-of-the-art ConvNets and Transformers (Blog). They can also scale well on large dataset sizes like ImageNet-21K. Notably, due to the linear-complexity of the grid attention used, MaxViT scales well on tasks requiring large image sizes, such as object detection and segmentation.

MaxViT meta-architecture: a homogeneously stacked backbone, wherein each MaxViT block contains MBConv, block attention (window-based local attention), and grid attention (dilated global attention).

<p align="center"> </p>

Results on ImageNet-1k standard train and test:

<p align="center"> </p>

Results on ImageNet-21k and JFT pre-trained models:

<p align="center"> </p>

Model Performance

Note: Deit ImageNet pretrain experimental settings are different from the paper. These experiments follows the pre-training hyperparameters in paper and only run pre-training for similar number of steps. The paper suggested a short fine-tuning with different hyper-parameters and EMA.

<section class="tabs">

Deit ImageNet pretrain {.new-tab}

ModelEval SizeTop-1 AccAcc on Paper#Param#FLOPsConfig
MaxViT-Tiny224x22483.1 (-0.5)83.631M5.6Gconfig
MaxViT-Small224x22484.1 (-0.3)84.469M11.7Gconfig
MaxViT-Base224x22484.2 (-0.7)84.9120M23.4Gconfig
MaxViT-Large224x22484.6 (-0.6)85.2212M43.9Gconfig
MaxViT-XLarge224x22484.8-475M97.9Gconfig

Cascade RCNN models {.new-tab}

ModelImage SizeWindow SizeEpochsbox APbox AP on papermask APConfig
MaxViT-Tiny640x64020x2020049.97-42.69config
MaxViT-Tiny896x89628x2820052.35 (+0.25)52.144.69-
MaxViT-Small640x64020x2020050.79-43.36-
MaxViT-Small896x89628x2820053.54 (+0.44)53.145.79config
MaxViT-Base640x64020x2020051.59-44.07config
MaxViT-Base896x89628x2820053.47 (+0.07)53.445.96config
</section> <section class="tabs">

JFT-300M supervised pretrain {.new-tab}

ModelPretrain Size#Param#FLOPsglobalPR-AUC
MaxViT-Base224x224120M23.4G52.75%
MaxViT-Large224x224212M43.9G53.77%
MaxViT-XLarge224x224475M-54.71%

ImageNet Finetuning {.new-tab}

ModelImage SizeTop-1 AccAcc on Paper#Param#FLOPsConfig
MaxViT-Base384x38488.37% (-0.32%)88.69%120M74.2Gconfig
MaxViT-Base512x51288.63% (-0.19%)88.82%120M138.3Gconfig
MaxViT-Large384x38488.86% (-0.26%)89.12%212M128.7Gconfig
MaxViT-Large512x51289.02% (-0.39%)89.41%212M245.2Gconfig
MaxViT-XLarge384x38489.21% (-0.15%)89.36%475M293.7Gconfig
MaxViT-XLarge512x51289.31% (-0.22%)89.53%475M535.2Gconfig

Cascade RCNN models {.new-tab}

ModelImage SizeWindow SizeEpochsbox APbox AP on papermask APConfig
MaxViT-Base896x89628x2820054.31 (+0.91)53.446.31config
MaxViT-Large896x89628x2820054.69-46.59config
</section>

Citation

Should you find this repository useful, please consider citing:

@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}