Back to Models

TF-Vision Model Garden

official/vision/README.md

2.20.022.7 KB
Original Source

TF-Vision Model Garden

⚠️ Disclaimer: Checkpoints are based on training with publicly available datasets. Some datasets contain limitations, including non-commercial use limitations. Please review the terms and conditions made available by third parties before using the datasets provided. Checkpoints are licensed under Apache 2.0.

⚠️ Disclaimer: Datasets hyperlinked from this page are not owned or distributed by Google. Such datasets are made available by third parties. Please review the terms and conditions made available by the third parties before using the data.

Table of Contents

Introduction

TF-Vision modeling library for computer vision provides a collection of baselines and checkpoints for image classification, object detection, and segmentation.

Backbones

Backbones
DilatedResNet
EfficientNet
MobileDet
MobileNet
ResNet
ResNet3D
RevNet
SpineNet
SpineNetMobile
VisionTransformer

Decoders

Decoders
ASPP
FPN
NASFPN

Heads

Heads
DetectionHead
MaskHead
MaskScoring
RPNHead
RetinaNetHead
SegmentationHead

Image Classification

ResNet models trained with vanilla settings

<details>
  • Models are trained from scratch with batch size 4096 and 1.6 initial learning rate.
  • Linear warmup is applied for the first 5 epochs.
  • Models trained with l2 weight regularization and ReLU activation.
ModelResolutionEpochsTop-1Top-5Download
ResNet-50224x2249076.192.9config
ResNet-50224x22420077.193.5config | ckpt
ResNet-101224x22420078.394.2config | ckpt
ResNet-152224x22420078.794.3config | ckpt
</details>

ResNet-RS models trained with various settings

<details>

We support state-of-the-art ResNet-RS image classification models with features:

  • ResNet-RS architectural changes and Swish activation. (Note that ResNet-RS adopts ReLU activation in the paper.)
  • Regularization methods including Random Augment, 4e-5 weight decay, stochastic depth, label smoothing and dropout.
  • New training methods including a 350-epoch schedule, cosine learning rate and EMA.
  • Configs are in this directory.
ModelResolutionParams (M)Top-1Top-5Download
ResNet-RS-50160x16035.779.194.5config | ckpt
ResNet-RS-101160x16063.780.294.9config | ckpt
ResNet-RS-101192x19263.781.395.6config | ckpt
ResNet-RS-152192x19286.881.995.8config | ckpt
ResNet-RS-152224x22486.882.596.1config | ckpt
ResNet-RS-152256x25686.883.196.3config | ckpt
ResNet-RS-200256x25693.483.596.6config | ckpt
ResNet-RS-270256x256130.183.696.6config | ckpt
ResNet-RS-350256x256164.383.796.7config | ckpt
ResNet-RS-350320x320164.384.296.9config | ckpt
</details>

Vision Transformer (ViT)

<details>

We support ViT and DEIT implementations. ViT models trained under the DEIT settings:

modelresolutionTop-1Top-5Download
ViT-ti16224x22473.491.9ckpt
ViT-s16224x22479.494.7ckpt
ViT-b16224x22481.895.8ckpt
ViT-l16224x22482.295.8ckpt
</details>

Object Detection and Instance Segmentation

Common Settings and Notes

<details>
  • We provide models adopting ResNet-FPN and SpineNet backbones based on detection frameworks:
  • Models are all trained on COCO train2017 and evaluated on COCO val2017.
  • Training details:
    • Models finetuned from ImageNet pretrained checkpoints adopt the 12 or 36 epochs schedule. Models trained from scratch adopt the 350 epochs schedule.
    • The default training data augmentation implements horizontal flipping and scale jittering with a random scale between [0.5, 2.0].
    • Unless noted, all models are trained with l2 weight regularization and ReLU activation.
    • We use batch size 256 and stepwise learning rate that decays at the last 30 and 10 epoch.
    • We use square image as input by resizing the long side of an image to the target size then padding the short side with zeros.
</details>

COCO Object Detection Baselines

RetinaNet (ImageNet pretrained)

<details>
BackboneResolutionEpochsFLOPs (B)Params (M)Box APDownload
R50-FPN640x6401297.034.034.3config
R50-FPN640x6407297.034.036.8config | ckpt
</details>

RetinaNet (Trained from scratch)

<details>

training features including:

  • Stochastic depth with drop rate 0.2.
  • Swish activation.
BackboneResolutionEpochsFLOPs (B)Params (M)Box APDownload
SpineNet-49640x64050085.428.544.2config | ckpt
SpineNet-961024x1024500265.443.048.5config | ckpt
SpineNet-1431280x1280500524.067.050.0config | ckpt
</details>

Mobile-size RetinaNet (Trained from scratch):

<details>
BackboneResolutionEpochsFLOPs (B)Params (M)Box APDownload
MobileNetv2256x256600-2.2723.5config
Mobile SpineNet-49384x3846001.02.3228.1config | ckpt
</details>

YOLOv7 (Trained from scratch)

<details>
VariantResolutionEpochsFLOPs (B)Params (M)Box APDownload
YOLOv7640x64030053.1644.5750.5config | ckpt
</details>

Instance Segmentation Baselines

Mask R-CNN (Trained from scratch)

<details>
BackboneResolutionEpochsFLOPs (B)Params (M)Box APMask APDownload
ResNet50-FPN640x640350227.746.342.337.6config
SpineNet-49640x640350215.740.842.637.9config
SpineNet-961024x1024500315.055.248.142.4config
SpineNet-1431280x1280500498.879.249.343.4config
</details>

Cascade RCNN-RS (Trained from scratch)

<details>
BackboneResolutionEpochsParams (M)Box APMask APDownload
SpineNet-49640x64050056.446.440.0config
SpineNet-961024x102450070.850.943.8config
SpineNet-1431280x128050094.951.945.0config
</details>

Semantic Segmentation

  • We support DeepLabV3 and DeepLabV3+ architectures, with Dilated ResNet backbones.
  • Backbones are pre-trained on ImageNet.

PASCAL-VOC

<details>
ModelBackboneResolutionStepsmIoUDownload
DeepLabV3Dilated Resnet-101512x51230k78.7
DeepLabV3+Dilated Resnet-101512x51230k79.2ckpt
</details>

CITYSCAPES

<details>
ModelBackboneResolutionStepsmIoUDownload
DeepLabV3+Dilated Resnet-1011024x204890k78.79
</details>

Video Classification

Common Settings and Notes

<details> </details>

Kinetics-400 Action Recognition Baselines

<details>
ModelInput (frame x stride)Top-1Top-5Download
SlowOnly8 x 874.191.4config
SlowOnly16 x 475.692.1config
R3D-5032 x 277.093.0config
R3D-RS-5032 x 278.293.7config
R3D-RS-10132 x 279.594.2-
R3D-RS-15232 x 279.994.3-
R3D-RS-20032 x 280.494.4-
R3D-RS-20048 x 281.0--
MoViNet-A0-Base50 x 569.4089.18-
MoViNet-A1-Base50 x 574.5792.03-
MoViNet-A2-Base50 x 575.9192.63-
MoViNet-A3-Base120 x 279.3494.52-
MoViNet-A4-Base80 x 380.6494.93-
MoViNet-A5-Base120 x 281.3995.06-
</details>

Kinetics-600 Action Recognition Baselines

<details>
ModelInput (frame x stride)Top-1Top-5Download
SlowOnly8 x 877.393.6config
R3D-5032 x 279.594.8config
R3D-RS-20032 x 283.1--
R3D-RS-20048 x 283.8--
MoViNet-A0-Base50 x 572.0590.92config
MoViNet-A1-Base50 x 576.6993.40config
MoViNet-A2-Base50 x 578.6294.17config
MoViNet-A3-Base120 x 281.7995.67config
MoViNet-A4-Base80 x 383.4896.16config
MoViNet-A5-Base120 x 284.2796.39config
</details>

More Documentations

Please read through the references in the examples/starter.