Back to Supabase

Semantic Text Deduplication

apps/docs/content/guides/ai/quickstarts/text-deduplication.mdx

1.26.042.7 KB
Original Source

This guide will walk you through a "Semantic Text Deduplication" example using Colab and Supabase Vecs. You'll learn how to find similar movie reviews using embeddings, and remove any that seem like duplicates. You will:

  1. Launch a Postgres database that uses pgvector to store embeddings
  2. Launch a notebook that connects to your database
  3. Load the IMDB dataset
  4. Use the sentence-transformers/all-MiniLM-L6-v2 model to create an embedding representing the semantic meaning of each review.
  5. Search for all duplicates.

<$Partial path="database_setup.mdx" />

Launching a notebook

Launch our semantic_text_deduplication notebook in Colab:

<a className="w-64" href="https://colab.research.google.com/github/supabase/supabase/blob/master/examples/ai/semantic_text_deduplication.ipynb"

</a>

At the top of the notebook, you'll see a button Copy to Drive. Click this button to copy the notebook to your Google Drive.

Connecting to your database

Inside the Notebook, find the cell which specifies the DB_CONNECTION. It will contain some code like this:

python
import vecs

DB_CONNECTION = "postgresql://<user>:<password>@<host>:<port>/<db_name>"

# create vector store client
vx = vecs.create_client(DB_CONNECTION)

Replace the DB_CONNECTION with your own connection string. You can find the connection string on your project dashboard by clicking Connect.

<Admonition type='note'>

SQLAlchemy requires the connection string to start with postgresql:// (instead of postgres://). Don't forget to rename this after copying the string from the dashboard.

</Admonition> <Admonition type='note'>

You must use the "connection pooling" string (domain ending in *.pooler.supabase.com) with Google Colab since Colab does not support IPv6.

</Admonition>

Stepping through the notebook

Now all that's left is to step through the notebook. You can do this by clicking the "execute" button (ctrl+enter) at the top left of each code cell. The notebook guides you through the process of creating a collection, adding data to it, and querying it.

You can view the inserted items in the Table Editor, by selecting the vecs schema from the schema dropdown.

<$Partial path="ai/quickstart_hf_deployment.mdx" />

Next steps

You can now start building your own applications with Vecs. Check our examples for ideas.