Back to Mmdetection

MM Grounding DINO

configs/mm_grounding_dino/README.md

3.3.043.1 KB
Original Source

MM Grounding DINO

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

<!-- [ALGORITHM] -->

Abstract

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community.

<div align=center> </div> <div align=center> </div>

Dataset Preparation

Please refer to dataset_prepare.md or 中文版数据准备

✨ What's New

💎 We have released the pre-trained weights for Swin-B and Swin-L, welcome to try and give feedback.

Usage

Please refer to usage.md or 中文版用法说明

Zero-Shot COCO Results and Models

ModelBackboneStyleCOCO mAPPre-Train DataConfigDownload
GDINO-TSwin-TZero-shot46.7O365
GDINO-TSwin-TZero-shot48.1O365,GoldG
GDINO-TSwin-TZero-shot48.4O365,GoldG,Cap4Mconfigmodel
MM-GDINO-TSwin-TZero-shot48.5(+1.8)O365config
MM-GDINO-TSwin-TZero-shot50.4(+2.3)O365,GoldGconfigmodel | log
MM-GDINO-TSwin-TZero-shot50.5(+2.1)O365,GoldG,GRITconfigmodel | log
MM-GDINO-TSwin-TZero-shot50.6(+2.2)O365,GoldG,V3Detconfigmodel | log
MM-GDINO-TSwin-TZero-shot50.4(+2.0)O365,GoldG,GRIT,V3Detconfigmodel | log
MM-GDINO-BSwin-BZero-shot52.5O365,GoldG,V3Detconfigmodel | log
MM-GDINO-B*Swin-B-59.5O365,ALLconfigmodel | log
MM-GDINO-LSwin-LZero-shot53.0O365V2,OpenImageV6,GoldGconfigmodel | log
MM-GDINO-L*Swin-L-60.3O365V2,OpenImageV6,ALLconfigmodel | log
  • This * indicates that the model has not been fully trained yet. We will release the final weights in the future.
  • ALL: GoldG,V3det,COCO2017,LVISV1,COCO2014,GRIT,RefCOCO,RefCOCO+,RefCOCOg,gRefCOCO.

Zero-Shot LVIS Results

ModelMiniVal APrMiniVal APcMiniVal APfMiniVal APVal1.0 APrVal1.0 APcVal1.0 APfVal1.0 APPre-Train Data
GDINO-T18.824.234.728.810.115.329.920.1O365,GoldG,Cap4M
MM-GDINO-T28.130.242.035.7(+6.9)17.122.436.527.0(+6.9)O365,GoldG
MM-GDINO-T26.632.441.836.5(+7.7)17.322.636.427.1(+7.0)O365,GoldG,GRIT
MM-GDINO-T33.036.045.940.5(+11.7)21.525.540.230.6(+10.5)O365,GoldG,V3Det
MM-GDINO-T34.237.446.241.4(+12.6)23.627.640.531.9(+11.8)O365,GoldG,GRIT,V3Det

Zero-Shot ODinW (Object Detection in the Wild) Results

Results and models of ODinW13

| Method | GDINO-T (O365,GoldG,Cap4M) | MM-GDINO-T (O365,GoldG) | MM-GDINO-T (O365,GoldG,GRIT) | MM-GDINO-T (O365,GoldG,V3Det) | MM-GDINO-T (O365,GoldG,GRIT,V3Det) | | --------------------- | -------------------------------- | ----------------------------- | ---------------------------------- | ----------------------------------- | ---------------------------------------- | | AerialMaritimeDrone | 0.173 | 0.133 | 0.155 | 0.177 | 0.151 | | Aquarium | 0.195 | 0.252 | 0.261 | 0.266 | 0.283 | | CottontailRabbits | 0.799 | 0.771 | 0.810 | 0.778 | 0.786 | | EgoHands | 0.608 | 0.499 | 0.537 | 0.506 | 0.519 | | NorthAmericaMushrooms | 0.507 | 0.331 | 0.462 | 0.669 | 0.767 | | Packages | 0.687 | 0.707 | 0.687 | 0.710 | 0.706 | | PascalVOC | 0.563 | 0.565 | 0.580 | 0.556 | 0.566 | | pistols | 0.726 | 0.585 | 0.709 | 0.671 | 0.729 | | pothole | 0.215 | 0.136 | 0.285 | 0.199 | 0.243 | | Raccoon | 0.549 | 0.469 | 0.511 | 0.553 | 0.535 | | ShellfishOpenImages | 0.393 | 0.321 | 0.437 | 0.519 | 0.488 | | thermalDogsAndPeople | 0.657 | 0.556 | 0.603 | 0.493 | 0.542 | | VehiclesOpenImages | 0.613 | 0.566 | 0.603 | 0.614 | 0.615 | | Average | 0.514 | 0.453 | 0.511 | 0.516 | 0.533 |

  • The MM-GDINO-T config file is odinw13

Results and models of ODinW35

| Method | GDINO-T (O365,GoldG,Cap4M) | MM-GDINO-T (O365,GoldG) | MM-GDINO-T (O365,GoldG,GRIT) | MM-GDINO-T (O365,GoldG,V3Det) | MM-GDINO-T (O365,GoldG,GRIT,V3Det) | | --------------------------- | -------------------------------- | ----------------------------- | ---------------------------------- | ----------------------------------- | ---------------------------------------- | | AerialMaritimeDrone_large | 0.173 | 0.133 | 0.155 | 0.177 | 0.151 | | AerialMaritimeDrone_tiled | 0.206 | 0.170 | 0.225 | 0.184 | 0.206 | | AmericanSignLanguageLetters | 0.002 | 0.016 | 0.020 | 0.011 | 0.007 | | Aquarium | 0.195 | 0.252 | 0.261 | 0.266 | 0.283 | | BCCD | 0.161 | 0.069 | 0.118 | 0.083 | 0.077 | | boggleBoards | 0.000 | 0.002 | 0.001 | 0.001 | 0.002 | | brackishUnderwater | 0.021 | 0.033 | 0.021 | 0.025 | 0.025 | | ChessPieces | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | | CottontailRabbits | 0.806 | 0.771 | 0.810 | 0.778 | 0.786 | | dice | 0.004 | 0.002 | 0.005 | 0.001 | 0.001 | | DroneControl | 0.042 | 0.047 | 0.097 | 0.088 | 0.074 | | EgoHands_generic | 0.608 | 0.527 | 0.537 | 0.506 | 0.519 | | EgoHands_specific | 0.002 | 0.001 | 0.005 | 0.007 | 0.003 | | HardHatWorkers | 0.046 | 0.048 | 0.070 | 0.070 | 0.108 | | MaskWearing | 0.004 | 0.009 | 0.004 | 0.011 | 0.009 | | MountainDewCommercial | 0.430 | 0.453 | 0.465 | 0.194 | 0.430 | | NorthAmericaMushrooms | 0.471 | 0.331 | 0.462 | 0.669 | 0.767 | | openPoetryVision | 0.000 | 0.001 | 0.000 | 0.000 | 0.000 | | OxfordPets_by_breed | 0.003 | 0.002 | 0.004 | 0.006 | 0.004 | | OxfordPets_by_species | 0.011 | 0.019 | 0.016 | 0.020 | 0.015 | | PKLot | 0.001 | 0.004 | 0.002 | 0.008 | 0.007 | | Packages | 0.695 | 0.707 | 0.687 | 0.710 | 0.706 | | PascalVOC | 0.563 | 0.565 | 0.580 | 0.566 | 0.566 | | pistols | 0.726 | 0.585 | 0.709 | 0.671 | 0.729 | | plantdoc | 0.005 | 0.005 | 0.007 | 0.008 | 0.011 | | pothole | 0.215 | 0.136 | 0.219 | 0.077 | 0.168 | | Raccoons | 0.549 | 0.469 | 0.511 | 0.553 | 0.535 | | selfdrivingCar | 0.089 | 0.091 | 0.076 | 0.094 | 0.083 | | ShellfishOpenImages | 0.393 | 0.321 | 0.437 | 0.519 | 0.488 | | ThermalCheetah | 0.087 | 0.063 | 0.081 | 0.030 | 0.045 | | thermalDogsAndPeople | 0.657 | 0.556 | 0.603 | 0.493 | 0.543 | | UnoCards | 0.006 | 0.012 | 0.010 | 0.009 | 0.005 | | VehiclesOpenImages | 0.613 | 0.566 | 0.603 | 0.614 | 0.615 | | WildfireSmoke | 0.134 | 0.106 | 0.154 | 0.042 | 0.127 | | websiteScreenshots | 0.012 | 0.02 | 0.016 | 0.016 | 0.016 | | Average | 0.227 | 0.202 | 0.228 | 0.214 | 0.284 |

  • The MM-GDINO-T config file is odinw35

Zero-Shot Referring Expression Comprehension Results

| Method | GDINO-T (O365,GoldG,Cap4M) | MM-GDINO-T (O365,GoldG) | MM-GDINO-T (O365,GoldG,GRIT) | MM-GDINO-T (O365,GoldG,V3Det) | MM-GDINO-T (O365,GoldG,GRIT,V3Det) | | ---------------------- | -------------------------------- | ----------------------------- | ---------------------------------- | ----------------------------------- | ---------------------------------------- | | RefCOCO val @1,5,10 | 50.8/89.5/94.9 | 53.1/89.9/94.7 | 53.4/90.3/95.5 | 52.1/89.8/95.0 | 53.1/89.7/95.1 | | RefCOCO testA @1,5,10 | 57.4/91.3/95.6 | 59.7/91.5/95.9 | 58.8/91.70/96.2 | 58.4/86.8/95.6 | 59.1/91.0/95.5 | | RefCOCO testB @1,5,10 | 45.0/86.5/92.9 | 46.4/86.9/92.2 | 46.8/87.7/93.3 | 45.4/86.2/92.6 | 46.8/87.8/93.6 | | RefCOCO+ val @1,5,10 | 51.6/86.4/92.6 | 53.1/87.0/92.8 | 53.5/88.0/93.7 | 52.5/86.8/93.2 | 52.7/87.7/93.5 | | RefCOCO+ testA @1,5,10 | 57.3/86.7/92.7 | 58.9/87.3/92.9 | 59.0/88.1/93.7 | 58.1/86.7/93.5 | 58.7/87.2/93.1 | | RefCOCO+ testB @1,5,10 | 46.4/84.1/90.7 | 47.9/84.3/91.0 | 47.9/85.5/92.7 | 46.9/83.7/91.5 | 48.4/85.8/92.1 | | RefCOCOg val @1,5,10 | 60.4/92.1/96.2 | 61.2/92.6/96.1 | 62.7/93.3/97.0 | 61.7/92.9/96.6 | 62.9/93.3/97.2 | | RefCOCOg test @1,5,10 | 59.7/92.1/96.3 | 61.1/93.3/96.7 | 62.6/94.9/97.1 | 61.0/93.1/96.8 | 62.9/93.9/97.4 |

| Method | thresh_score | GDINO-T (O365,GoldG,Cap4M) | MM-GDINO-T (O365,GoldG) | MM-GDINO-T (O365,GoldG,GRIT) | MM-GDINO-T (O365,GoldG,V3Det) | MM-GDINO-T (O365,GoldG,GRIT,V3Det) | | --------------------------------------- | ------------ | -------------------------------- | ----------------------------- | ---------------------------------- | ----------------------------------- | ---------------------------------------- | | gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc | 0.5 | 39.3/70.4 | | | | 39.4/67.5 | | gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc | 0.6 | 40.5/83.8 | | | | 40.6/83.1 | | gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc | 0.7 | 41.3/91.8 | 39.8/84.7 | 40.7/89.7 | 40.3/88.8 | 41.0/91.3 | | gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc | 0.8 | 41.5/96.8 | | | | 41.1/96.4 | | gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc | 0.5 | 31.9/70.4 | | | | 33.1/69.5 | | gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc | 0.6 | 29.3/82.9 | | | | 29.2/84.3 | | gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc | 0.7 | 27.2/90.2 | 26.3/89.0 | 26.0/91.9 | 25.4/91.8 | 26.1/93.0 | | gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc | 0.8 | 25.1/96.3 | | | | 23.8/97.2 | | gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc | 0.5 | 30.9/72.5 | | | | 33.0/69.6 | | gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc | 0.6 | 30.0/86.1 | | | | 31.6/96.7 | | gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc | 0.7 | 29.7/93.5 | 31.3/84.8 | 30.6/90.2 | 30.7/89.9 | 30.4/92.3 | | gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc | 0.8 | 29.1/97.4 | | | | 29.5/84.2 |

  • The MM-GDINO-T config file is here

Zero-Shot Description Detection Dataset(DOD)

shell
pip install ddd-dataset

| Method | mode | GDINO-T (O365,GoldG,Cap4M) | MM-GDINO-T (O365,GoldG) | MM-GDINO-T (O365,GoldG,GRIT) | MM-GDINO-T (O365,GoldG,V3Det) | MM-GDINO-T (O365,GoldG,GRIT,V3Det) | | -------------------------------- | -------- | -------------------------------- | ----------------------------- | ---------------------------------- | ----------------------------------- | ---------------------------------------- | | FULL/short/middle/long/very long | concat | 17.2/18.0/18.7/14.8/16.3 | 15.6/17.3/16.7/14.3/13.1 | 17.0/17.7/18.0/15.7/15.7 | 16.2/17.4/16.8/14.9/15.4 | 17.5/23.4/18.3/14.7/13.8 | | FULL/short/middle/long/very long | parallel | 22.3/28.2/24.8/19.1/13.9 | 21.7/24.7/24.0/20.2/13.7 | 22.5/25.6/25.1/20.5/14.9 | 22.3/25.6/24.5/20.6/14.7 | 22.9/28.1/25.4/20.4/14.4 | | PRES/short/middle/long/very long | concat | 17.8/18.3/19.2/15.2/17.3 | 16.4/18.4/17.3/14.5/14.2 | 17.9/19.0/18.3/16.5/17.5 | 16.6/18.8/17.1/15.1/15.0 | 18.0/23.7/18.6/15.4/13.3 | | PRES/short/middle/long/very long | parallel | 21.0/27.0/22.8/17.5/12.5 | 21.3/25.5/22.8/19.2/12.9 | 21.5/25.2/23.0/19.0/15.0 | 21.6/25.7/23.0/19.5/14.8 | 21.9/27.4/23.2/19.1/14.2 | | ABS/short/middle/long/very long | concat | 15.4/17.1/16.4/13.6/14.9 | 13.4/13.4/14.5/13.5/11.9 | 14.5/13.1/16.7/13.6/13.3 | 14.8/12.5/15.6/14.3/15.8 | 15.9/22.2/17.1/12.5/14.4 | | ABS/short/middle/long/very long | parallel | 26.0/32.0/33.0/23.6/15.5 | 22.8/22.2/28.7/22.9/14.7 | 25.6/26.8/33.9/24.5/14.7 | 24.1/24.9/30.7/23.8/14.7 | 26.0/30.3/34.1/23.9/14.6 |

Note:

  1. Considering that the evaluation time for Inter-scenario is very long and the performance is low, it is temporarily not supported. The mentioned metrics are for Intra-scenario.
  2. concat is the default inference mode for Grounding DINO, where it concatenates multiple sub-sentences with "." to form a single sentence for inference. On the other hand, "parallel" performs inference on each sub-sentence in a for-loop.
  3. The MM-GDINO-T config file is concat_dod and parallel_dod

Pretrain Flickr30k Results

ModelPre-Train DataVal R@1Val R@5Val R@10Test R@1Test R@5Test R@10
GLIP-TO365,GoldG84.994.996.385.695.496.7
GLIP-TO365,GoldG,CC3M,SBU85.395.596.986.095.997.2
GDINO-TO365,GoldG,Cap4M87.896.698.088.196.998.2
MM-GDINO-TO365,GoldG85.595.697.286.295.797.4
MM-GDINO-TO365,GoldG,GRIT86.795.897.687.096.297.7
MM-GDINO-TO365,GoldG,V3Det85.995.797.486.395.797.4
MM-GDINO-TO365,GoldG,GRIT,V3Det86.796.097.687.296.297.7

Note:

  1. @1,5,10 refers to precision at the top 1, 5, and 10 positions in a predicted ranked list.
  2. The MM-GDINO-T config file is here

Validating the generalization of a pre-trained model through fine-tuning

RTTS

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-501x48.1
Cascade R-CNNR-501x50.8
ATSSR-501x48.2
TOODR-501X50.8
MM-GDINO(zero-shot)Swin-T49.8
MM-GDINOSwin-T1x69.1

RUOD

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-501x52.4
Cascade R-CNNR-501x55.3
ATSSR-501x55.7
TOODR-501X57.4
MM-GDINO(zero-shot)Swin-T29.8
MM-GDINOSwin-T1x65.5

Brain Tumor

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-5050e43.5
Cascade R-CNNR-5050e46.2
DINOR-5050e46.4
Cascade-DINOR-5050e48.6
MM-GDINOSwin-T50e47.5

Cityscapes

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-5050e30.1
Cascade R-CNNR-5050e31.8
DINOR-5050e34.5
Cascade-DINOR-5050e34.8
MM-GDINO(zero-shot)Swin-T34.2
MM-GDINOSwin-T50e51.5

People in Painting

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-5050e17.0
Cascade R-CNNR-5050e18.0
DINOR-5050e12.0
Cascade-DINOR-5050e13.4
MM-GDINO(zero-shot)Swin-T23.1
MM-GDINOSwin-T50e38.9

COCO

(1) Closed-set performance

ArchitectureBackboneLr schdbox AP
Faster R-CNNR-501x37.4
Cascade R-CNNR-501x40.3
ATSSR-501x39.4
TOODR-501X42.4
DINOR-501X50.1
GLIP(zero-shot)Swin-T46.6
GDINO(zero-shot)Swin-T48.5
MM-GDINO(zero-shot)Swin-T50.4
GLIPSwin-T1x55.4
GDINOSwin-T1x58.1
MM-GDINOSwin-T1x58.2
  • The MM-GDINO-T config file is here

(2) Open-set continuing pretraining performance

ArchitectureBackboneLr schdbox AP
GLIP(zero-shot)Swin-T46.7
GDINO(zero-shot)Swin-T48.5
MM-GDINO(zero-shot)Swin-T50.4
MM-GDINOSwin-T1x54.7
  • The MM-GDINO-T config file is here
  • Due to the small size of the COCO dataset, continuing pretraining solely on COCO can easily lead to overfitting. The results shown above are from the third epoch. I do not recommend you train using this approach.

(3) Open vocabulary performance

ArchitectureBackboneLr schdbox APBase box APNovel box APbox AP@50Base box AP@50Novel box AP@50
MM-GDINO(zero-shot)Swin-T51.148.458.966.764.074.2
MM-GDINOSwin-T1x57.256.160.473.673.075.3
  • The MM-GDINO-T config file is here

LVIS 1.0

(1) Open-set continuing pretraining performance

ArchitectureBackboneLr schdMiniVal APrMiniVal APcMiniVal APfMiniVal APVal1.0 APrVal1.0 APcVal1.0 APfVal1.0 AP
GLIP(zero-shot)Swin-T18.121.233.126.710.814.729.019.6
GDINO(zero-shot)Swin-T18.824.234.728.810.115.329.920.1
MM-GDINO(zero-shot)Swin-T34.237.446.241.423.627.640.531.9
MM-GDINOSwin-T1x50.758.860.158.745.250.256.151.7
  • The MM-GDINO-T config file is here

(2) Open vocabulary performance

ArchitectureBackboneLr schdMiniVal APrMiniVal APcMiniVal APfMiniVal AP
MM-GDINO(zero-shot)Swin-T34.237.446.241.4
MM-GDINOSwin-T1x43.257.459.357.1
  • The MM-GDINO-T config file is here

RefEXP

RefCOCO

ArchitectureBackboneLr schdval @1val @5val @10testA @1testA @5testA @10testB @1testB @5testB @10
GDINO(zero-shot)Swin-T50.889.594.957.591.395.645.086.592.9
MM-GDINO(zero-shot)Swin-T53.189.795.159.191.095.546.887.893.6
GDINOSwin-TUNK89.291.986.0
MM-GDINOSwin-T5e89.598.699.491.499.299.886.697.999.1
  • The MM-GDINO-T config file is here

RefCOCO+

ArchitectureBackboneLr schdval @1val @5val @10testA @1testA @5testA @10testB @1testB @5testB @10
GDINO(zero-shot)Swin-T51.686.492.657.386.792.746.484.190.7
MM-GDINO(zero-shot)Swin-T52.787.793.558.787.293.148.485.892.1
GDINOSwin-TUNK81.187.474.7
MM-GDINOSwin-T5e82.197.899.287.599.299.774.096.396.4
  • The MM-GDINO-T config file is here

RefCOCOg

ArchitectureBackboneLr schdval @1val @5val @10test @1test @5test @10
GDINO(zero-shot)Swin-T60.492.196.259.792.196.3
MM-GDINO(zero-shot)Swin-T62.993.397.262.993.997.4
GDINOSwin-TUNK84.284.9
MM-GDINOSwin-T5e85.598.499.485.898.699.4
  • The MM-GDINO-T config file is here

gRefCOCO

ArchitectureBackboneLr schdval Pr@(F1=1, IoU≥0.5)val N-acctestA Pr@(F1=1, IoU≥0.5)testA N-acctestB Pr@(F1=1, IoU≥0.5)testB N-acc
GDINO(zero-shot)Swin-T41.391.827.290.229.793.5
MM-GDINO(zero-shot)Swin-T41.091.326.193.030.492.3
MM-GDINOSwin-T5e45.164.742.565.540.363.2
  • The MM-GDINO-T config file is here

Citation

If you find this project useful in your research, please consider citing:

latex
@article{zhao2024open,
  title={An Open and Comprehensive Pipeline for Unified Object Grounding and Detection},
  author={Zhao, Xiangyu and Chen, Yicheng and Xu, Shilin and Li, Xiangtai and Wang, Xinjiang and Li, Yining and Huang, Haian},
  journal={arXiv preprint arXiv:2401.02361},
  year={2024}
}