examples/sentence_transformer/applications/image-search/Image_Classification.ipynb
This example shows how SentenceTransformers can be used to map images and texts to the same vector space.
We can use this to perform zero-shot image classification by providing the names for the labels.
As model, we use the OpenAI CLIP Model, which was trained on a large set of images and image alt texts.
The images in this example are from Unsplash.
import torch
from IPython.display import Image as IPImage
from IPython.display import display
from sentence_transformers import SentenceTransformer
# We use the original CLIP model for computing image embeddings and English text embeddings
en_model = SentenceTransformer("clip-ViT-B-32")
# We download some images from our repository which we want to classify
img_names = ["eiffel-tower-day.jpg", "eiffel-tower-night.jpg", "two_dogs_in_snow.jpg", "cat.jpg"]
url = "https://github.com/huggingface/sentence-transformers/raw/main/examples/sentence_transformer/applications/image-search/"
images = [url + img_name for img_name in img_names]
# And compute the embeddings for these images
img_emb = en_model.encode(images, convert_to_tensor=True)
# Then, we define our labels as text. Here, we use 4 labels
labels = ["dog", "cat", "Paris at night", "Paris"]
# And compute the text embeddings for these labels
en_emb = en_model.encode(labels, convert_to_tensor=True)
# Now, we compute the cosine similarity between the images and the labels
cos_scores = en_model.similarity(img_emb, en_emb)
# Then we look which label has the highest cosine similarity with the given images
pred_labels = torch.argmax(cos_scores, dim=1)
# Finally we output the images + labels
for img_name, pred_label in zip(img_names, pred_labels):
display(IPImage(img_name, width=200))
print("Predicted label:", labels[pred_label])
print("\n\n")
The original CLIP Model only works for English, hence, we used Multilingual Knowledge Distillation to make this model work with 50+ languages.
For this, we must load the clip-ViT-B-32-multilingual-v1 model to encode our labels. We can define our labels in 50+ languages and can also mix the languages we have
multi_model = SentenceTransformer("clip-ViT-B-32-multilingual-v1")
# Then, we define our labels as text. Here, we use 4 labels
labels = [
"Hund", # German: dog
"gato", # Spanish: cat
"巴黎晚上", # Chinese: Paris at night
"Париж", # Russian: Paris
]
# And compute the text embeddings for these labels
txt_emb = multi_model.encode(labels, convert_to_tensor=True)
# Now, we compute the cosine similarity between the images and the labels
cos_scores = multi_model.similarity(img_emb, txt_emb)
# Then we look which label has the highest cosine similarity with the given images
pred_labels = torch.argmax(cos_scores, dim=1)
# Finally we output the images + labels
for img_name, pred_label in zip(img_names, pred_labels):
display(IPImage(img_name, width=200))
print("Predicted label:", labels[pred_label])
print("\n\n")