Back to Mmdetection

Detecting Twenty-thousand Classes using Image-level Supervision

projects/Detic_new/README.md

3.3.017.5 KB
Original Source

Detecting Twenty-thousand Classes using Image-level Supervision

Description

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

<p align="center"> </p>

Detecting Twenty-thousand Classes using Image-level Supervision, Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra, ECCV 2022 (arXiv 2201.02605)

Usage

<!-- For a typical model, this section should contain the commands for training and testing. You are also suggested to dump your environment specification to env.yml by `conda env export > env.yml`. -->

Installation

Detic requires to install CLIP.

shell
pip install git+https://github.com/openai/CLIP.git

Prepare Datasets

It is recommended to download and extract the dataset somewhere outside the project directory and symlink the dataset root to $MMDETECTION/data as below. If your folder structure is different, you may need to change the corresponding paths in config files.

LVIS

LVIS dataset is adopted as box-labeled data, LVIS is available from official website or mirror. You need to generate lvis_v1_train_norare.json according to the official prepare datasets for open-vocabulary LVIS, which removes the labels of 337 rare-class from training. You can also download lvis_v1_train_norare.json from our backup. The directory should be like this.

shell
mmdetection
├── data
│   ├── lvis
│   │   ├── annotations
│   │   |	├── lvis_v1_train.json
│   │   |	├── lvis_v1_val.json
│   │   |	├── lvis_v1_train_norare.json
│   │   ├── train2017
│   │   ├── val2017

ImageNet-LVIS

ImageNet-LVIS is adopted as image-labeled data. You can download ImageNet-21K dataset from the official website. Then you need to unzip the overlapping classes of LVIS and convert them into LVIS annotation format according to the official prepare datasets. The directory should be like this.

shell
mmdetection
├── data
│   ├── imagenet
│   │   ├── annotations
│   │   |	├── imagenet_lvis_image_info.json
│   │   ├── ImageNet-21K
│   │   |	├── n00007846
│   │   |	├── n01318894
│   │   |	├── ...

Metadata

data/metadata/ is the preprocessed meta-data (included in the repo). Please follow the official instruction to pre-process the LVIS dataset. You will generate lvis_v1_train_cat_info.json for Federated loss, which contains the frequency of each category of training set of LVIS. In addition, lvis_v1_clip_a+cname.npy is the pre-computed CLIP embeddings for each category of LVIS. You can also choose to directly download lvis_v1_train_cat_info and lvis_v1_clip_a+cname.npy form our backup. The directory should be like this.

shell
mmdetection
├── data
│   ├── metadata
│   │   ├── lvis_v1_train_cat_info.json
│   │   ├── lvis_v1_clip_a+cname.npy

Demo

Here we provide the Detic model for the open vocabulary demo. This model is trained on combined LVIS-COCO and ImageNet-21K for better demo purposes. LVIS models do not detect persons well due to its federated annotation protocol. LVIS+COCO models give better visual results.

BackboneTraining dataConfigDownload
Swin-BLVIS & COCO & ImageNet-21Kconfigmodel

You can also download other models from official model zoo, and convert the format by run

shell
python tools/model_converters/detic_to_mmdet.py --src /path/to/detic_weight.pth --dst /path/to/mmdet_weight.pth

Inference with existing dataset vocabulary

You can detect classes of existing dataset with --texts command:

shell
python demo/image_demo.py \
  ${IMAGE_PATH} \
  ${CONFIG_PATH} \
  ${MODEL_PATH} \
  --texts lvis \
  --pred-score-thr 0.5 \
  --palette 'random'

Inference with custom vocabularies

Detic can detects any class given class names by using CLIP. You can detect customized classes with --texts command:

shell
python demo/image_demo.py \
  ${IMAGE_PATH} \
  ${CONFIG_PATH} \
  ${MODEL_PATH} \
  --texts 'headphone . webcam . paper . coffe.' \
  --pred-score-thr 0.3 \
  --palette 'random'

Note that headphone, paper and coffe (typo intended) are not LVIS classes. Despite the misspelled class name, Detic can produce a reasonable detection for coffe.

Models and Results

Training

There are two stages in the whole training process. The first stage is to train a model using images with box labels as the baseline. The second stage is to finetune from the baseline model and leverage image-labeled data.

First stage

To train the baseline with box-supervised, run

shell
bash ./tools/dist_train.sh projects/Detic_new/detic_centernet2_r50_fpn_4x_lvis_boxsup.py 8
Model (Config)mask mAPmask mAP(official)mask mAP_raremask mAP_rare(officical)
detic_centernet2_r50_fpn_4x_lvis_boxsup31.631.526.625.6

Second stage

The second stage uses both object detection and image classification datasets.

Multi-Datasets Config

We provide improved dataset_wrapper ConcatDataset to concatenate multiple datasets, all datasets could have different annotation types and different pipelines (e.g., image_size). You can also obtain the index of dataset_source for each sample through get_dataset_source . We provide sampler MultiDataSampler to custom the ratios of different datasets. Beside, we provide batch_sampler MultiDataAspectRatioBatchSampler to enable different datasets to have different batchsizes. The config of multiple datasets is as follows:

python
dataset_det = dict(
    type='ClassBalancedDataset',
    oversample_thr=1e-3,
    dataset=dict(
        type='LVISV1Dataset',
        data_root='data/lvis/',
        ann_file='annotations/lvis_v1_train.json',
        data_prefix=dict(img=''),
        filter_cfg=dict(filter_empty_gt=True, min_size=32),
        pipeline=train_pipeline_det,
        backend_args=backend_args))

dataset_cls = dict(
    type='ImageNetLVISV1Dataset',
    data_root='data/imagenet',
    ann_file='annotations/imagenet_lvis_image_info.json',
    data_prefix=dict(img='ImageNet-LVIS/'),
    pipeline=train_pipeline_cls,
    backend_args=backend_args)

train_dataloader = dict(
    batch_size=[8, 32],
    num_workers=2,
    persistent_workers=True,
    sampler=dict(
        type='MultiDataSampler',
        dataset_ratio=[1, 4]),
    batch_sampler=dict(
        type='MultiDataAspectRatioBatchSampler',
        num_datasets=2),
    dataset=dict(
        type='ConcatDataset',
        datasets=[dataset_det, dataset_cls]))
Note:
  • If the one of the multiple datasets is ConcatDataset , it is still considered as a dataset for num_datasets in MultiDataAspectRatioBatchSampler.

To finetune the baseline model with image-labeled data, run:

shell
bash ./tools/dist_train.sh projects/Detic_new/detic_centernet2_r50_fpn_4x_lvis_in21k-lvis.py 8
Model (Config)mask mAPmask mAP(official)mask mAP_raremask mAP_rare(officical)
detic_centernet2_r50_fpn_4x_lvis_in21k-lvis32.933.230.929.7

Standard LVIS Results

Model (Config)mask mAPmask mAP(official)mask mAP_raremask mAP_rare(officical)Download
detic_centernet2_r50_fpn_4x_lvis_boxsup31.631.526.625.6model | log
detic_centernet2_r50_fpn_4x_lvis_in21k-lvis32.933.230.929.7model | log
detic_centernet2_swin-b_fpn_4x_lvis_boxsup40.740.738.035.9model | log
detic_centernet2_swin-b_fpn_4x_lvis_in21k-lvis41.741.741.741.7model | log

Open-vocabulary LVIS Results

Model (Config)mask mAPmask mAP(official)mask mAP_raremask mAP_rare(officical)Download
detic_centernet2_r50_fpn_4x_lvis-base_boxsup30.430.216.216.4model | log
detic_centernet2_r50_fpn_4x_lvis-base_in21k-lvis32.632.427.424.9model | log

Testing

Test Command

To evaluate a model with a trained model, run

shell
python ./tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE}

Open-vocabulary LVIS Results

The models are converted from the official model zoo.

Model (Config)mask mAPmask mAP_novelDownload
detic_centernet2_swin-b_fpn_4x_lvis-base_boxsup38.421.9model
detic_centernet2_swin-b_fpn_4x_lvis-base_in21k-lvis40.734.0model
Note:
  • The open-vocabulary LVIS setup is LVIS without rare class annotations in training, termed lvisbase. We evaluate rare classes as novel classes in testing.
  • in21k-lvis denotes that the model use the overlap classes between ImageNet-21K and LVIS as image-labeled data.

Citation

If you find Detic is useful in your research or applications, please consider giving a star 🌟 to the official repository and citing Detic by the following BibTeX entry.

BibTeX
@inproceedings{zhou2022detecting,
  title={Detecting Twenty-thousand Classes using Image-level Supervision},
  author={Zhou, Xingyi and Girdhar, Rohit and Joulin, Armand and Kr{\"a}henb{\"u}hl, Philipp and Misra, Ishan},
  booktitle={ECCV},
  year={2022}
}