Image Duplicates & Near Duplicates

This example shows how SentenceTransformers can be used to find image duplicates and near duplicates.

As model, we use the OpenAI CLIP Model, which was trained on a large set of images and image alt texts.
Note that the CLIP is now from 2021 and more recent models exist. We use it for this illustration because it is small and therefore run on a Google Colab GPU.
As an alternative, you can check the models with the zero-shot-image-classification pipeline tag on the 🤗 Hub.
You can also test trimmed models that have the advantage of being smaller than the models from which they are derived while keeping the same performance. To find these models, we invite you to look at the visual embedding models listed in this Space.

As a source for photos, we use the Unsplash Dataset Lite, which contains about 25k images. See the License about the Unsplash images.

We encode all images into vector space and then look for pairs of images with a very high cosine similarity, i.e. (near-)identical photos.

python

from datasets import load_dataset
from IPython.display import display

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import paraphrase_mining_embeddings

python

# First, we load the respective CLIP model
model = SentenceTransformer("sentence-transformers/clip-ViT-B-32")

python

# Next, we load the Unsplash Dataset Lite
unsplash = load_dataset("sentence-transformers/unsplash-lite")

# We can see that the dataset contains a column containing the images
# but also a column containing keywords that we will not use here.
unsplash

python

# Now, we need to compute the embeddings
# The attached function will be required for the following cell
# in the case where precomputed embeddings are not used .
embeddings_name = "embeddings_clip-ViT-B-32"


def embed(batch):
    """
    adds to the dataset a column of embeddings of images calculated with the model
    """
    image = batch["image"]
    return {embeddings_name: model.encode(image, convert_to_tensor=True)}

python

# Here for the calculation of the embeddings, you have 2 choices:
# 1) `use_precomputed_embeddings = True` and in that case you can
# use embeddings that we have already pre-calculated in order to speed up
# the execution of the notebook.
# 2) `use_precomputed_embeddings = False` and compute the embeddings on the fly.
# Takes about 9 minutes on a Google Colab T4

use_precomputed_embeddings = True

if use_precomputed_embeddings:
    embeddings_ds = load_dataset("sentence-transformers/unsplash-lite", name=embeddings_name, split="train")
    unsplash["train"] = unsplash["train"].add_column(embeddings_name, embeddings_ds[embeddings_name])

else:
    unsplash = unsplash.map(embed, batched=True, batch_size=16)

    # Uncomment the rest of the else condition if you want to save the embeddings
    # on the Hub to use `use_precomputed_embeddings = True` in the future

    # # We delete 'image' and 'keywords' so as not to save them as duplicates unnecessarily
    # embeddings_ds = unsplash['train'].remove_columns(['image', 'keywords'])
    # embeddings_ds.push_to_hub(
    #     "your_username/unsplash-lite", # your username
    #     config_name=embeddings_name,
    #     split="train",
    #     token="hf_xx" # your HF token
    # )

# We now have a new column containing our embeddings
unsplash

python

# Now we mine for the most similar image pairs across the whole collection.
# paraphrase_mining_embeddings compares the images and returns the pairs with the
# highest cosine similarity, sorted from most to least similar.

img_emb = unsplash["train"][embeddings_name]
duplicates = paraphrase_mining_embeddings(img_emb)

# duplicates contains a list with triplets (score, image_id1, image_id2) and is sorted in decreasing order

Duplicates

In the next cell, we output the top 10 most similar images. These are identical images, i.e. the same photo was uploaded twice to Unsplash

python

for score, idx1, idx2 in duplicates[0:10]:
    print(f"\nScore: {score:.3f}")
    print("Image ", idx1)
    # # original size
    # display(unsplash["train"][idx1]["image"])
    # # width=200px
    display(
        unsplash["train"][idx1]["image"].resize(
            (200, int(200 * unsplash["train"][idx1]["image"].height / unsplash["train"][idx1]["image"].width))
        )
    )

    print("Image ", idx2)
    # # original size
    # display(unsplash["train"][idx2]["image"])
    # # width=200px
    display(
        unsplash["train"][idx2]["image"].resize(
            (200, int(200 * unsplash["train"][idx2]["image"].height / unsplash["train"][idx2]["image"].width))
        )
    )

Near Duplicates

We can also skip the duplicate images and find near duplicates. To achieve this, we only look at images pairs that have a cosine similarity below a certain threshold. In our example, we look at images with a cosine similarity lower than 0.99

python

threshold = 0.99
near_duplicates = [entry for entry in duplicates if entry[0] < threshold]

for score, idx1, idx2 in near_duplicates[0:10]:
    print(f"\nScore: {score:.3f}")
    print("Image ", idx1)
    # # original size
    # display(unsplash["train"][idx1]["image"])
    # # width=200px
    display(
        unsplash["train"][idx1]["image"].resize(
            (200, int(200 * unsplash["train"][idx1]["image"].height / unsplash["train"][idx1]["image"].width))
        )
    )

    print("Image ", idx2)
    # # original size
    # display(unsplash["train"][idx2]["image"])
    # # width=200px
    display(
        unsplash["train"][idx2]["image"].resize(
            (200, int(200 * unsplash["train"][idx2]["image"].height / unsplash["train"][idx2]["image"].width))
        )
    )