Back to Transformers

SigLIP

docs/source/en/model_doc/siglip.md

5.8.06.7 KB
Original Source
<!--Copyright 2023 The HuggingFace Team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. -->

This model was released on 2023-03-27 and added to Hugging Face Transformers on 2024-01-08.

<div style="float: right;"> <div class="flex flex-wrap space-x-1">
</div>
</div>

SigLIP

SigLIP is a multimodal image-text model similar to CLIP. It uses separate image and text encoders to generate representations for both modalities.

Unlike CLIP, SigLIP employs a pairwise sigmoid loss on image-text pairs during training. This training loss eliminates the need for a global view of all pairwise similarities between images and texts within a batch. Consequently, it enables more efficient scaling to larger batch sizes while also delivering superior performance with smaller batch sizes.

You can find all the original SigLIP checkpoints under the SigLIP collection.

[!TIP] Click on the SigLIP models in the right sidebar for more examples of how to apply SigLIP to different image and text tasks.

The example below demonstrates how to generate similarity scores between texts and image(s) with [Pipeline] or the [AutoModel] class.

<hfoptions id="usage"> <hfoption id="Pipeline">
python
from transformers import pipeline


image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]

pipeline = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-224", device=0)
pipeline(image, candidate_labels=candidate_labels)
</hfoption> <hfoption id="AutoModel">
python
import requests
import torch
from PIL import Image

from transformers import AutoModel, AutoProcessor


model = AutoModel.from_pretrained("google/siglip-base-patch16-224", device_map="auto", attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
texts = [f'This is a photo of {label}.' for label in candidate_labels]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)
print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
</hfoption> </hfoptions>

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.

The example below uses bitsandbytes to only quantize the weights to int4.

python
import requests
import torch
from PIL import Image

from transformers import AutoModel, AutoProcessor, BitsAndBytesConfig


bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModel.from_pretrained("google/siglip-base-patch16-224", quantization_config=bnb_config, device_map="auto", attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
texts = [f'This is a photo of {label}.' for label in candidate_labels]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)
print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")

Notes

  • Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use torch.distributed utilities which may limit the scalability of batch size.

  • When using the standalone [SiglipTokenizer] or [SiglipProcessor], make sure to pass padding="max_length" because that is how the model was trained.

  • To get the same results as the [Pipeline], a prompt template of "This is a photo of {label}." should be passed to the processor.

  • Toggle the attn_implementation parameter to either "sdpa" or "flash_attention_2" to use a more memory-efficient attention.

    py
    # pip install -U flash-attn --no-build-isolation
    
    from transformers import SiglipModel
    
    model = SiglipModel.from_pretrained(
        "google/siglip-so400m-patch14-384",
        attn_implementation="flash_attention_2",
        device_map="auto",
    )
    

SiglipConfig

[[autodoc]] SiglipConfig

SiglipTextConfig

[[autodoc]] SiglipTextConfig

SiglipVisionConfig

[[autodoc]] SiglipVisionConfig

SiglipTokenizer

[[autodoc]] SiglipTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

SiglipImageProcessor

[[autodoc]] SiglipImageProcessor - preprocess

SiglipImageProcessorPil

[[autodoc]] SiglipImageProcessorPil - preprocess

SiglipProcessor

[[autodoc]] SiglipProcessor - call

SiglipModel

[[autodoc]] SiglipModel - forward - get_text_features - get_image_features

SiglipTextModel

[[autodoc]] SiglipTextModel - forward

SiglipVisionModel

[[autodoc]] SiglipVisionModel - forward

SiglipForImageClassification

[[autodoc]] SiglipForImageClassification - forward