examples/ai/semantic_text_deduplication.ipynb
In this example we'll use PostgreSQL + pgvectors similarity search using the vecs library to identify near duplicate snippets of text.
Our task is to improve IMDB movie reviews by making sure each review on the site is substantive and original. To achieve that, we'll identify and remove any reviews that are near duplicates of others.
!pip install -qU vecs datasets sentence_transformers flupy tqdm
First we load the IMBD dataset using the datasets library. It contains the text of 25000 movie reviews.
from datasets import load_dataset
data = load_dataset("imdb", split="train")
data
# Look at an example review
data["text"][5]
Next, we can use the sentence-transformers/all-MiniLM-L6-v2 model to create a 384 dimensional text embedding that represents the
semantic meaning of each review. These embeddings are what we'll use for near-duplicate detection.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
The vecs library wraps a pythonic interface around PostgreSQL and pgvector.
A collection in vecs maps 1:1 with a PostgreSQL table.
First you will need to establish a connection to your database. You can find the Postgres connection string in the Database Settings of your Supabase project.
Note: SQLAlchemy requires the connection string to start with
postgresql://(instead ofpostgres://). Don't forget to rename this after copying the string from the dashboard.
Note: You must use the "connection pooling" string (domain ending in
*.pooler.supabase.com) with Google Colab since Colab does not support IPv6.
This will also work with any other Postgres provider that supports pgvector.
import vecs
# Substitute your connection string here
DB_CONNECTION = "postgresql://postgres:password@localhost:5431/db"
# create vector store client
vx = vecs.create_client(DB_CONNECTION)
# create a PostgreSQL/pgvector table named "reviews" to contain the review embeddings
reviews = vx.get_or_create_collection(name="reviews", dimension=384)
Now we can iterate over the dataset, producing embeddings for the reviews
from typing import List, Dict, Tuple
from flupy import flu
import numpy as np
from tqdm import tqdm
batch_size = 50
records: List[Tuple[str, np.ndarray, Dict]] = []
# Iterate over the dataset in chunks
for chunk_ix, chunk in tqdm(flu(data['text']).chunk(batch_size).enumerate()):
# Create embeddings for current chunk
embedding_chunk = model.encode(chunk)
# Enumerate the embeddings and create a record to insert into the database
for row_ix, (text, embedding) in enumerate(zip(chunk, embedding_chunk)):
record_id = chunk_ix * batch_size + row_ix
records.append((f"{record_id}", embedding, {"text": text}))
reviews.upsert(records)
Indexing the collection creates an index on the vector column in Postgres that significantly improves performance of similarity queries.
reviews.create_index()
Finally we can enumerate each review, searching for the most similar reviews and displaying them if the results are near duplicates. We could then prune out the near-duplicate reviews to make sure our viewers see a new and interesting opinion with each review they choose to read.
for ix, text in tqdm(enumerate(data['text'])):
# Load the next row from the dataset
query_results = reviews.fetch(ids=[f'{ix}'])
(query_id, query_embedding, query_meta) = query_results[0]
# Retrieve the original text from the row's metadata
query_text = query_meta["text"]
# To keep the output easy to read quickly, we'll restrict reviews to < 500 characters
# In the real-world you would not include this restriction
if len(query_text) < 500:
# Query the review embeddings for the most similar 5 reviews
top_5 = reviews.query(
query_vector=query_embedding,
limit = 5,
include_metadata= True,
include_value=True
)
# For each result
for result_id, result_distance, result_meta in top_5[1:]:
result_text = result_meta["text"]
if (
# Since our query embedding is in the collection, the nearest result
# is always itself with a distance of 0. We exclude that record and
# review any others with a distance < 0.17
0.01 < abs(result_distance) < 0.17
and len(result_text) < 500
and query_id < result_id
):
print(
"query_id:", query_id,
"\t", "result_id:", result_id,
"\t", "distance", round(result_distance, 4),
"\n\n", "Query Text",
"\n\n", query_meta["text"],
"\n\n", "Result Text",
"\n\n", result_meta["text"],
"\n", "-" * 80
)