Back to Mmdetection

MaskFormer

configs/maskformer/README.md

3.3.04.9 KB
Original Source

MaskFormer

Per-Pixel Classification is Not All You Need for Semantic Segmentation

<!-- [ALGORITHM] -->

Abstract

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

<div align=center> </div>

Introduction

MaskFormer requires COCO and COCO-panoptic dataset for training and evaluation. You need to download and extract it in the COCO dataset path. The directory should be like this.

none
mmdetection
├── mmdet
├── tools
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── panoptic_train2017.json
│   │   │   ├── panoptic_train2017
│   │   │   ├── panoptic_val2017.json
│   │   │   ├── panoptic_val2017
│   │   ├── train2017
│   │   ├── val2017
│   │   ├── test2017

Results and Models

BackbonestyleLr schdMem (GB)Inf time (fps)PQSQRQPQ_thSQ_thRQ_thPQ_stSQ_stRQ_stConfigDownload
R-50pytorch75e16.2-46.75780.29757.17650.82981.12561.79840.61079.04850.199configmodel | log
Swin-Lpytorch300e27.2-53.24981.70464.23158.79882.92370.28244.87479.86355.097configmodel | log

Note

  1. The R-50 version was mentioned in Table XI, in paper Masked-attention Mask Transformer for Universal Image Segmentation.
  2. The models were trained with mmdet 2.x and have been converted for mmdet 3.x.

Citation

latex
@inproceedings{cheng2021maskformer,
  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
  journal={NeurIPS},
  year={2021}
}