Zero-Shot Image Classification

This example shows how SentenceTransformers can be used to map images and texts to the same vector space.

We can use this to perform zero-shot image classification by providing the names for the labels.

As model, we use the OpenAI CLIP Model, which was trained on a large set of images and image alt texts.

The images in this example are from Unsplash.

python

import torch
from IPython.display import Image as IPImage
from IPython.display import display

from sentence_transformers import SentenceTransformer

# We use the original CLIP model for computing image embeddings and English text embeddings
en_model = SentenceTransformer("clip-ViT-B-32")

python

# We download some images from our repository which we want to classify
img_names = ["eiffel-tower-day.jpg", "eiffel-tower-night.jpg", "two_dogs_in_snow.jpg", "cat.jpg"]
url = "https://github.com/huggingface/sentence-transformers/raw/main/examples/sentence_transformer/applications/image-search/"
images = [url + img_name for img_name in img_names]

# And compute the embeddings for these images
img_emb = en_model.encode(images, convert_to_tensor=True)

python

# Then, we define our labels as text. Here, we use 4 labels
labels = ["dog", "cat", "Paris at night", "Paris"]

# And compute the text embeddings for these labels
en_emb = en_model.encode(labels, convert_to_tensor=True)

# Now, we compute the cosine similarity between the images and the labels
cos_scores = en_model.similarity(img_emb, en_emb)

# Then we look which label has the highest cosine similarity with the given images
pred_labels = torch.argmax(cos_scores, dim=1)

# Finally we output the images + labels
for img_name, pred_label in zip(img_names, pred_labels):
    display(IPImage(img_name, width=200))
    print("Predicted label:", labels[pred_label])
    print("\n\n")

Zero-Shot Image Classification

The original CLIP Model only works for English, hence, we used Multilingual Knowledge Distillation to make this model work with 50+ languages.

For this, we must load the clip-ViT-B-32-multilingual-v1 model to encode our labels. We can define our labels in 50+ languages and can also mix the languages we have

python

multi_model = SentenceTransformer("clip-ViT-B-32-multilingual-v1")

# Then, we define our labels as text. Here, we use 4 labels
labels = [
    "Hund",  # German: dog
    "gato",  # Spanish: cat
    "巴黎晚上",  # Chinese: Paris at night
    "Париж",  # Russian: Paris
]

# And compute the text embeddings for these labels
txt_emb = multi_model.encode(labels, convert_to_tensor=True)

# Now, we compute the cosine similarity between the images and the labels
cos_scores = multi_model.similarity(img_emb, txt_emb)

# Then we look which label has the highest cosine similarity with the given images
pred_labels = torch.argmax(cos_scores, dim=1)

# Finally we output the images + labels
for img_name, pred_label in zip(img_names, pred_labels):
    display(IPImage(img_name, width=200))
    print("Predicted label:", labels[pred_label])
    print("\n\n")