examples/providers/azurecosmosdb/semantic-search.ipynb
import os
import requests
In this walkthrough we will go over the Retrieval API with a Azure CosmosDB Mongo vCore datastore for semantic search.
Before running the notebook please initialize the retrieval API and have it running locally somewhere. Please follow the instructions to start the Retreival API provided here.
Azure Cosmos DB Azure Cosmos DB is a fully managed NoSQL and relational database for modern app development. Using Azure Cosmos DB for MongoDB vCore, you can store vector embeddings in your documents and perform vector similarity search on a fully managed MongoDB-compatible database service.
Learn more about Azure Cosmos DB for MongoDB vCore here. If you don't have an Azure account, you can start setting one up here.
First we will create a list of documents. From the perspective of the retrieval plugin, a document consists of an "id", "text", "embedding"(optional) and a collection of "metadata". The "metadata" has "source", "source_id", "created_at", "url" and "author" fields. Query metadata does not expose the "url" field.
For this example we have taken some data about a few dog breeds.
document_1 = {
"id": "Siberian Husky",
"text": "Siberian Huskies are strikingly beautiful and energetic Arctic breed dogs known for their captivating blue eyes and remarkable endurance in cold climates."
}
document_2 = {
"id": "Alaskan Malamute",
"text": "The Alaskan Malamute is a powerful and friendly Arctic sled dog breed known for its strength, endurance, and affectionate nature."
}
document_3 = {
"id": "Samoyed",
"text": "The Samoyed is a cheerful and fluffy Arctic breed, renowned for its smile and gentle disposition, originally used for herding reindeer and pulling sleds in Siberia."
}
On the first insert, the datastore will create the collection and index if necessary on the field embedding. Currently hybrid search is not yet supported.
To make these requests to the retrieval app API, we will need to provide authorization in the form of the BEARER_TOKEN we set earlier. We do this below:
BEARER_TOKEN_HERE = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkFheXVzaCBLYXRhcmlhIiwiaWF0IjoxNTE2MjM5MDIyfQ.VHEVK_IdThXZJr8aQsfjVQ-_n4raepdpqsC5gYDsubE"
endpoint_url = 'http://0.0.0.0:8000'
headers = {
"Authorization": f"Bearer {BEARER_TOKEN_HERE}"
}
response = requests.post(
f"{endpoint_url}/upsert",
headers=headers,
json={"documents": [document_1, document_2, document_3]
}
)
response.json()
Let's query the data store for dogs based on the place of their origin.
queries = [
{
"query":"I want dog breeds from Siberia.",
"top_k":2
},
{
"query":"I want dog breed from Alaska.",
"top_k":1
}
]
response = requests.post(
f"{endpoint_url}/query",
headers=headers,
json={"queries":queries}
)
response.json()
You can either delete all the data, or provide a list of docIds to delete
response = requests.delete(
f"{endpoint_url}/delete",
headers=headers,
json={"ids":["doc:SiberianHusky:chunk:SiberianHusky_0"]}
)
response.json()
response = requests.delete(
f"{endpoint_url}/delete",
headers=headers,
json={"delete_all":True}
)
response.json()