python

import os
import requests

Document retrieval: upsert and basic query usage

In this walkthrough we will go over the Retrieval API with a Azure CosmosDB Mongo vCore datastore for semantic search.

Before running the notebook please initialize the retrieval API and have it running locally somewhere. Please follow the instructions to start the Retreival API provided here.

Azure Cosmos DB Azure Cosmos DB is a fully managed NoSQL and relational database for modern app development. Using Azure Cosmos DB for MongoDB vCore, you can store vector embeddings in your documents and perform vector similarity search on a fully managed MongoDB-compatible database service.

Learn more about Azure Cosmos DB for MongoDB vCore here. If you don't have an Azure account, you can start setting one up here.

Document

First we will create a list of documents. From the perspective of the retrieval plugin, a document consists of an "id", "text", "embedding"(optional) and a collection of "metadata". The "metadata" has "source", "source_id", "created_at", "url" and "author" fields. Query metadata does not expose the "url" field.

For this example we have taken some data about a few dog breeds.

python

document_1 = {
    "id": "Siberian Husky",
    "text": "Siberian Huskies are strikingly beautiful and energetic Arctic breed dogs known for their captivating blue eyes and remarkable endurance in cold climates."
}

document_2 = {
    "id": "Alaskan Malamute",
    "text": "The Alaskan Malamute is a powerful and friendly Arctic sled dog breed known for its strength, endurance, and affectionate nature."
}

document_3 = {
    "id": "Samoyed",
    "text": "The Samoyed is a cheerful and fluffy Arctic breed, renowned for its smile and gentle disposition, originally used for herding reindeer and pulling sleds in Siberia."
}

Indexing the Docs

On the first insert, the datastore will create the collection and index if necessary on the field embedding. Currently hybrid search is not yet supported.

To make these requests to the retrieval app API, we will need to provide authorization in the form of the BEARER_TOKEN we set earlier. We do this below:

python

BEARER_TOKEN_HERE = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkFheXVzaCBLYXRhcmlhIiwiaWF0IjoxNTE2MjM5MDIyfQ.VHEVK_IdThXZJr8aQsfjVQ-_n4raepdpqsC5gYDsubE"
endpoint_url = 'http://0.0.0.0:8000'
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN_HERE}"
}

python

response = requests.post(
    f"{endpoint_url}/upsert",
    headers=headers,
    json={"documents": [document_1, document_2, document_3]
    }
)

response.json()

Querying the datastore

Let's query the data store for dogs based on the place of their origin.

python

queries = [
    {
        "query":"I want dog breeds from Siberia.",
        "top_k":2
    },
    {
        "query":"I want dog breed from Alaska.",
        "top_k":1
    }
]

response = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={"queries":queries}
)

response.json()

Deleting the data from the datastore

You can either delete all the data, or provide a list of docIds to delete

python

response = requests.delete(
    f"{endpoint_url}/delete",
    headers=headers,
    json={"ids":["doc:SiberianHusky:chunk:SiberianHusky_0"]}
)

response.json()

python

response = requests.delete(
    f"{endpoint_url}/delete",
    headers=headers,
    json={"delete_all":True}
)

response.json()