Semantic Text Deduplication

In this example we'll use PostgreSQL + pgvectors similarity search using the vecs library to identify near duplicate snippets of text.

Our task is to improve IMDB movie reviews by making sure each review on the site is substantive and original. To achieve that, we'll identify and remove any reviews that are near duplicates of others.

Install Dependencies

python

!pip install -qU vecs datasets sentence_transformers flupy tqdm

Load the Dataset

First we load the IMBD dataset using the datasets library. It contains the text of 25000 movie reviews.

python

from datasets import load_dataset

data = load_dataset("imdb", split="train")
data

python

# Look at an example review
data["text"][5]

Embedding Model

Next, we can use the sentence-transformers/all-MiniLM-L6-v2 model to create a 384 dimensional text embedding that represents the semantic meaning of each review. These embeddings are what we'll use for near-duplicate detection.

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Initialize the Vecs Collection

The vecs library wraps a pythonic interface around PostgreSQL and pgvector. A collection in vecs maps 1:1 with a PostgreSQL table.

First you will need to establish a connection to your database. You can find the Postgres connection string in the Database Settings of your Supabase project.

Note: SQLAlchemy requires the connection string to start with postgresql:// (instead of postgres://). Don't forget to rename this after copying the string from the dashboard.

Note: You must use the "connection pooling" string (domain ending in *.pooler.supabase.com) with Google Colab since Colab does not support IPv6.

This will also work with any other Postgres provider that supports pgvector.

python

import vecs

# Substitute your connection string here
DB_CONNECTION = "postgresql://postgres:password@localhost:5431/db"

# create vector store client
vx = vecs.create_client(DB_CONNECTION)

# create a PostgreSQL/pgvector table named "reviews" to contain the review embeddings
reviews = vx.get_or_create_collection(name="reviews", dimension=384)

Create Embeddings for Each Review

Now we can iterate over the dataset, producing embeddings for the reviews

python

from typing import List, Dict, Tuple
from flupy import flu
import numpy as np
from tqdm import tqdm


batch_size = 50

records: List[Tuple[str, np.ndarray, Dict]] = []

# Iterate over the dataset in chunks
for chunk_ix, chunk in tqdm(flu(data['text']).chunk(batch_size).enumerate()):

    # Create embeddings for current chunk
    embedding_chunk = model.encode(chunk)

    # Enumerate the embeddings and create a record to insert into the database
    for row_ix, (text, embedding) in enumerate(zip(chunk, embedding_chunk)):
        record_id = chunk_ix * batch_size + row_ix
        records.append((f"{record_id}", embedding, {"text": text}))

Insert the Embeddings into Postgres

python

reviews.upsert(records)

Index the Collection

Indexing the collection creates an index on the vector column in Postgres that significantly improves performance of similarity queries.

python

reviews.create_index()

Search for Near Duplicates

Finally we can enumerate each review, searching for the most similar reviews and displaying them if the results are near duplicates. We could then prune out the near-duplicate reviews to make sure our viewers see a new and interesting opinion with each review they choose to read.

python

for ix, text in tqdm(enumerate(data['text'])):

    # Load the next row from the dataset
    query_results = reviews.fetch(ids=[f'{ix}'])
    
    (query_id, query_embedding, query_meta) = query_results[0]

    # Retrieve the original text from the row's metadata
    query_text = query_meta["text"]

    # To keep the output easy to read quickly, we'll restrict reviews to < 500 characters
    # In the real-world you would not include this restriction
    if len(query_text) < 500:

        # Query the review embeddings for the most similar 5 reviews
        top_5 = reviews.query(
            query_vector=query_embedding,
            limit = 5,
            include_metadata= True,
            include_value=True
        )

        # For each result
        for result_id, result_distance, result_meta in top_5[1:]:
            
            result_text = result_meta["text"]

            if (
                # Since our query embedding is in the collection, the nearest result
                # is always itself with a distance of 0. We exclude that record and 
                # review any others with a distance < 0.17
                0.01 < abs(result_distance) < 0.17
                and len(result_text) < 500
                and query_id < result_id
            ):
                print(
                    "query_id:", query_id,
                    "\t", "result_id:", result_id,
                    "\t", "distance", round(result_distance, 4),
                    "\n\n", "Query Text",
                    "\n\n", query_meta["text"],
                    "\n\n", "Result Text",
                    "\n\n", result_meta["text"],
                    "\n", "-" * 80
                )