apps/docs/content/guides/ai/quickstarts/text-deduplication.mdx
This guide will walk you through a "Semantic Text Deduplication" example using Colab and Supabase Vecs. You'll learn how to find similar movie reviews using embeddings, and remove any that seem like duplicates. You will:
sentence-transformers/all-MiniLM-L6-v2 model to create an embedding representing the semantic meaning of each review.<$Partial path="database_setup.mdx" />
Launch our semantic_text_deduplication notebook in Colab:
<a className="w-64" href="https://colab.research.google.com/github/supabase/supabase/blob/master/examples/ai/semantic_text_deduplication.ipynb"
</a>
At the top of the notebook, you'll see a button Copy to Drive. Click this button to copy the notebook to your Google Drive.
Inside the Notebook, find the cell which specifies the DB_CONNECTION. It will contain some code like this:
import vecs
DB_CONNECTION = "postgresql://<user>:<password>@<host>:<port>/<db_name>"
# create vector store client
vx = vecs.create_client(DB_CONNECTION)
Replace the DB_CONNECTION with your own connection string. You can find the connection string on your project dashboard by clicking Connect.
SQLAlchemy requires the connection string to start with postgresql:// (instead of postgres://). Don't forget to rename this after copying the string from the dashboard.
You must use the "connection pooling" string (domain ending in *.pooler.supabase.com) with Google Colab since Colab does not support IPv6.
Now all that's left is to step through the notebook. You can do this by clicking the "execute" button (ctrl+enter) at the top left of each code cell. The notebook guides you through the process of creating a collection, adding data to it, and querying it.
You can view the inserted items in the Table Editor, by selecting the vecs schema from the schema dropdown.
<$Partial path="ai/quickstart_hf_deployment.mdx" />
You can now start building your own applications with Vecs. Check our examples for ideas.