Back to Mmdetection

GLIP: Grounded Language-Image Pre-training

configs/glip/README.md

3.3.024.4 KB
Original Source

GLIP: Grounded Language-Image Pre-training

GLIP: Grounded Language-Image Pre-training

<!-- [ALGORITHM] -->

Abstract

This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.

<div align=center> </div>

Installation

shell
cd $MMDETROOT

# source installation
pip install -r requirements/multimodal.txt

# or mim installation
mim install mmdet[multimodal]
shell
cd $MMDETROOT

wget https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth

python demo/image_demo.py demo/demo.jpg \
configs/glip/glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py \
--weights glip_tiny_a_mmdet-b3654169.pth \
--texts 'bench. car'
<div align=center> </div>

NOTE

GLIP utilizes BERT as the language model, which requires access to https://huggingface.co/. If you encounter connection errors due to network access, you can download the required files on a computer with internet access and save them locally. Finally, modify the lang_model_name field in the config to the local path. Please refer to the following code:

python
from transformers import BertConfig, BertModel
from transformers import AutoTokenizer

config = BertConfig.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased", add_pooling_layer=False, config=config)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

config.save_pretrained("your path/bert-base-uncased")
model.save_pretrained("your path/bert-base-uncased")
tokenizer.save_pretrained("your path/bert-base-uncased")

COCO Results and Models

ModelZero-shot or FinetuneCOCO mAPOfficial COCO mAPPre-Train DataConfigDownload
GLIP-T (A)Zero-shot43.042.9O365configmodel
GLIP-T (A)Finetune53.352.9O365configmodel| log
GLIP-T (B)Zero-shot44.944.9O365configmodel
GLIP-T (B)Finetune54.153.8O365configmodel| log
GLIP-T (C)Zero-shot46.746.7O365,GoldGconfigmodel
GLIP-T (C)Finetune55.255.1O365,GoldGconfigmodel| log
GLIP-TZero-shot46.646.6O365,GoldG,CC3M,SBUconfigmodel
GLIP-TFinetune55.455.2O365,GoldG,CC3M,SBUconfigmodel| log
GLIP-LZero-shot51.351.4FourODs,GoldG,CC3M+12M,SBUconfigmodel
GLIP-LFinetune59.4FourODs,GoldG,CC3M+12M,SBUconfigmodel| log

Note:

  1. The weights corresponding to the zero-shot model are adopted from the official weights and converted using the script. We have not retrained the model for the time being.
  2. Finetune refers to fine-tuning on the COCO 2017 dataset. The L model is trained using 16 A100 GPUs, while the remaining models are trained using 16 NVIDIA GeForce 3090 GPUs.
  3. Taking the GLIP-T(A) model as an example, I trained it twice using the official code, and the fine-tuning mAP were 52.5 and 52.6. Therefore, the mAP we achieved in our reproduction is higher than the official results. The main reason is that we modified the weight_decay parameter.
  4. Our experiments revealed that training for 24 epochs leads to overfitting. Therefore, we chose the best-performing model. If users want to train on a custom dataset, it is advisable to shorten the number of epochs and save the best-performing model.
  5. Due to the official absence of fine-tuning hyperparameters for the GLIP-L model, we have not yet reproduced the official accuracy. I have found that overfitting can also occur, so it may be necessary to consider custom modifications to data augmentation and model enhancement. Given the high cost of training, we have not conducted any research on this matter at the moment.

LVIS Results

ModelOfficialMiniVal APrMiniVal APcMiniVal APfMiniVal APVal1.0 APrVal1.0 APcVal1.0 APfVal1.0 APPre-Train DataConfigDownload
GLIP-T (A)O365configmodel
GLIP-T (A)12.115.525.820.26.210.922.814.7O365configmodel
GLIP-T (B)O365configmodel
GLIP-T (B)8.613.926.019.34.69.822.613.9O365configmodel
GLIP-T (C)14.319.431.124.6O365,GoldGconfigmodel
GLIP-T (C)14.419.831.925.28.313.228.118.2O365,GoldGconfigmodel
GLIP-TO365,GoldG,CC3M,SBUconfigmodel
GLIP-T18.121.233.126.710.814.729.019.6O365,GoldG,CC3M,SBUconfigmodel
GLIP-L29.234.942.137.9FourODs,GoldG,CC3M+12M,SBUconfigmodel
GLIP-L27.933.739.736.120.225.835.328.5FourODs,GoldG,CC3M+12M,SBUconfigmodel

Note:

  1. The above are zero-shot evaluation results.
  2. The evaluation metric we used is LVIS FixAP. For specific details, please refer to Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details.
  3. We found that the performance on small models is better than the official results, but it is lower on large models. This is mainly due to the incomplete alignment of the GLIP post-processing.

ODinW (Object Detection in the Wild) Results

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks. However, it remains challenging to evaluate the transferablity of these models due to the lack of easy-to-use evaluation toolkits and public benchmarks. To tackle this, we build ELEVATER 1 , the first benchmark and toolkit for evaluating (pre-trained) language-augmented visual models. ELEVATER is composed of three components. (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to facilitate model evaluation on downstream tasks. (iii) Metrics. A variety of evaluation metrics are used to measure sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). ELEVATER is platform for Computer Vision in the Wild (CVinW), and is publicly released at https://computer-vision-in-the-wild.github.io/ELEVATER/

Results and models of ODinW13

MethodGLIP-T(A)OfficialGLIP-T(B)OfficialGLIP-T(C)OfficialGroundingDINO-TGroundingDINO-B
AerialMaritimeDrone0.1230.1220.1100.1100.1300.1300.1730.281
Aquarium0.1750.1740.1730.1690.1910.1900.1950.445
CottontailRabbits0.6860.6860.6880.6880.7440.7440.7990.808
EgoHands0.0130.0130.0030.0040.3140.3150.6080.764
NorthAmericaMushrooms0.5020.5020.3670.3670.2970.2960.5070.675
Packages0.5890.5890.0830.0830.6990.6990.6870.670
PascalVOC0.5120.5120.5410.5400.5650.5650.5630.711
pistols0.3390.3390.5020.5010.5030.5040.7260.771
pothole0.0070.0070.0300.0300.0580.0580.2150.478
Raccoon0.0750.0740.2850.2880.2410.2440.5490.541
ShellfishOpenImages0.2530.2530.3370.3380.3000.3020.3930.650
thermalDogsAndPeople0.3720.3720.4750.4750.5100.5100.6570.633
VehiclesOpenImages0.5740.5660.5620.5470.5490.5340.6130.647
Average0.3250.3240.3200.3180.3920.3920.5140.621

Results and models of ODinW35

MethodGLIP-T(A)OfficialGLIP-T(B)OfficialGLIP-T(C)OfficialGroundingDINO-TGroundingDINO-B
AerialMaritimeDrone_large0.1230.1220.1100.1100.1300.1300.1730.281
AerialMaritimeDrone_tiled0.1740.1740.1720.1720.1720.1720.2060.364
AmericanSignLanguageLetters0.0010.0010.0030.0030.0090.0090.0020.096
Aquarium0.1750.1750.1730.1710.1920.1820.1950.445
BCCD0.0160.0160.0010.0010.0000.0000.1610.584
boggleBoards0.0000.0000.0000.0000.0000.0000.0000.134
brackishUnderwater0.0160..0130.0210.0270.0200.0220.0210.454
ChessPieces0.0010.0010.0000.0000.0010.0010.0000.000
CottontailRabbits0.7100.7090.6830.6830.7520.7520.8060.797
dice0.0050.0050.0040.0040.0040.0040.0040.082
DroneControl0.0160.0170.0060.0080.0050.0070.0420.638
EgoHands_generic0.0090.0100.0050.0060.5100.5080.6080.764
EgoHands_specific0.0010.0010.0040.0060.0030.0040.0020.687
HardHatWorkers0.0290.0290.0230.0230.0330.0330.0460.439
MaskWearing0.0070.0070.0030.0020.0050.0050.0040.406
MountainDewCommercial0.2180.2270.1990.1970.4780.4630.4300.580
NorthAmericaMushrooms0.5020.5020.4500.4500.4970.4970.4710.501
openPoetryVision0.0000.0000.0000.0000.0000.0000.0000.051
OxfordPets_by_breed0.0010.0020.0020.0040.0010.0020.0030.799
OxfordPets_by_species0.0160.0110.0120.0090.0130.0090.0110.872
PKLot0.0020.0020.0000.0000.0000.0000.0010.774
Packages0.5690.5690.2790.2790.7120.7120.6950.728
PascalVOC0.5120.5120.5410.5400.5650.5650.5630.711
pistols0.3390.3390.5020.5010.5030.5040.7260.771
plantdoc0.0020.0020.0070.0070.0090.0090.0050.376
pothole0.0070.0100.0240.0250.0850.1010.2150.478
Raccoons0.0750.0740.2850.2880.2410.2440.5490.541
selfdrivingCar0.0710.0720.0740.0740.0810.0800.0890.318
ShellfishOpenImages0.2530.2530.3370.3380.3000.3020.3930.650
ThermalCheetah0.0280.0280.0000.0000.0280.0280.0870.290
thermalDogsAndPeople0.3720.3720.4750.4750.5100.5100.6570.633
UnoCards0.0000.0000.0000.0010.0020.0030.0060.754
VehiclesOpenImages0.5740.5660.5620.5470.5490.5340.6130.647
WildfireSmoke0.0000.0000.0000.0000.0170.0170.1340.410
websiteScreenshots0.0030.0040.0030.0050.0050.0060.0120.175
Average0.1340.1340.1380.1380.1790.1780.2270.492

Results on Flickr30k

ModelOfficialPre-Train DataVal R@1Val R@5Val R@10Test R@1Test R@5Test R@10
GLIP-T(C)O365, GoldG84.894.996.385.595.496.6
GLIP-T(C)O365, GoldG84.994.996.385.695.496.7
GLIP-TO365,GoldG,CC3M,SBU85.395.596.986.095.997.2