docs/source/en/model_doc/imagegpt.md
This model was released on 2020-06-17 and added to Hugging Face Transformers on 2021-11-18.
The ImageGPT model was proposed in Generative Pretraining from Pixels by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. ImageGPT (iGPT) is a GPT-2-like model trained to predict the next pixel value, allowing for both unconditional and conditional image generation.
The abstract from the paper is the following:
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features.
<small> Summary of the approach. Taken from the original paper. </small>
This model was contributed by nielsr, based on this issue. The original code can be found here.
ImageGPTImageProcessor] to prepare
images for the model.output_hidden_states=True, and
then average-pool the hidden states at whatever layer you like.ImageGPTForImageClassification].| Model variant | Depths | Hidden sizes | Decoder hidden size | Params (M) | ImageNet-1k Top 1 |
|---|---|---|---|---|---|
| MiT-b0 | [2, 2, 2, 2] | [32, 64, 160, 256] | 256 | 3.7 | 70.5 |
| MiT-b1 | [2, 2, 2, 2] | [64, 128, 320, 512] | 256 | 14.0 | 78.7 |
| MiT-b2 | [3, 4, 6, 3] | [64, 128, 320, 512] | 768 | 25.4 | 81.6 |
| MiT-b3 | [3, 4, 18, 3] | [64, 128, 320, 512] | 768 | 45.2 | 83.1 |
| MiT-b4 | [3, 8, 27, 3] | [64, 128, 320, 512] | 768 | 62.6 | 83.6 |
| MiT-b5 | [3, 6, 40, 3] | [64, 128, 320, 512] | 768 | 82.0 | 83.8 |
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ImageGPT.
<PipelineTag pipeline="image-classification"/>ImageGPTForImageClassification] is supported by this example script and notebook.If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
[[autodoc]] ImageGPTConfig
[[autodoc]] ImageGPTImageProcessor - preprocess
[[autodoc]] ImageGPTImageProcessorPil - preprocess
[[autodoc]] ImageGPTModel - forward
[[autodoc]] ImageGPTForCausalImageModeling - forward
[[autodoc]] ImageGPTForImageClassification - forward