MaskFormer

Per-Pixel Classification is Not All You Need for Semantic Segmentation

Abstract

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Introduction

MaskFormer requires COCO and COCO-panoptic dataset for training and evaluation. You need to download and extract it in the COCO dataset path. The directory should be like this.

none

mmdetection
├── mmdet
├── tools
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── panoptic_train2017.json
│   │   │   ├── panoptic_train2017
│   │   │   ├── panoptic_val2017.json
│   │   │   ├── panoptic_val2017
│   │   ├── train2017
│   │   ├── val2017
│   │   ├── test2017

Results and Models

Backbone	style	Lr schd	Mem (GB)	Inf time (fps)	PQ	SQ	RQ	PQ_th	SQ_th	RQ_th	PQ_st	SQ_st	RQ_st	Config	Download
R-50	pytorch	75e	16.2	-	46.757	80.297	57.176	50.829	81.125	61.798	40.610	79.048	50.199	config	model \| log
Swin-L	pytorch	300e	27.2	-	53.249	81.704	64.231	58.798	82.923	70.282	44.874	79.863	55.097	config	model \| log

Note

The R-50 version was mentioned in Table XI, in paper Masked-attention Mask Transformer for Universal Image Segmentation.
The models were trained with mmdet 2.x and have been converted for mmdet 3.x.

Citation

latex

@inproceedings{cheng2021maskformer,
  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
  journal={NeurIPS},
  year={2021}
}