docs/source/en/model_doc/maskformer.md
This model was released on 2021-07-13 and added to Hugging Face Transformers on 2022-03-02.
This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight breaking changes to fix it in the future. If you see something strange, file a Github Issue.
</Tip>The MaskFormer model was proposed in Per-Pixel Classification is Not All You Need for Semantic Segmentation by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. MaskFormer addresses semantic segmentation with a mask classification paradigm instead of performing classic pixel-level classification.
The abstract from the paper is the following:
Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
The figure below illustrates the architecture of MaskFormer. Taken from the original paper.
This model was contributed by francesco. The original code can be found here.
use_auxiliary_loss of [MaskFormerConfig] to True, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).get_num_masks function inside in the MaskFormerLoss class of modeling_maskformer.py. When training on multiple nodes, this should be
set to the average number of target masks across all nodes, as can be seen in the original implementation here.MaskFormerImageProcessor] to prepare images for the model and optional targets for the model.~MaskFormerImageProcessor.post_process_semantic_segmentation] or [~MaskFormerImageProcessor.post_process_panoptic_segmentation]. Both tasks can be solved using [MaskFormerForInstanceSegmentation] output, panoptic segmentation accepts an optional label_ids_to_fuse argument to fuse instances of the target object/s (e.g. sky) together.MaskFormer] with [Trainer] or Accelerate can be found here.[[autodoc]] models.maskformer.modeling_maskformer.MaskFormerModelOutput
[[autodoc]] models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput
[[autodoc]] MaskFormerDetrConfig
[[autodoc]] MaskFormerConfig
[[autodoc]] MaskFormerImageProcessor - preprocess - post_process_semantic_segmentation - post_process_instance_segmentation - post_process_panoptic_segmentation
[[autodoc]] MaskFormerImageProcessorPil - preprocess - post_process_semantic_segmentation - post_process_instance_segmentation - post_process_panoptic_segmentation
[[autodoc]] MaskFormerModel - forward
[[autodoc]] MaskFormerForInstanceSegmentation - forward