Semantic Search of one's own data with OpenAI Embedding Model and MongoDB Atlas Vector Search

It is often a valuable exercise, when developing and documenting, to consider User Stories. We have a number of different personas interested in the ChatGPT Retrieval Plugin.

The End User, who wishes to extract information from her organization's or personal data.
The Data Scientist, who curates the data.
The Application Engineer, who sets up and maintains the application.

Application Setup

The Application Engineer has a number of tasks to complete in order to provide service to her two users.

Set up the DataStore.
- Create a MongoDB Atlas cluster.
- Add a Vector Index Search to it.
Begin by following the detailed steps in setup.md. Once completed, you will have a running Cluster, with a Database, a Collection, and a Vector Search Index attached to it.

You will also have a number of required environment variables. These need to be available to run this example. We will check for them below, and suggest how to set them up with an .env file if that is your preference.
Create and Serve the ChatGPT Retrival Plugin.
- Provide an API for the Data Scientist to insert, update, and delete data.
- Provide an API for the End User to query the data using natural language.
Start the service in another terminal as described in the repo's [QuickStart]( here.

IMPORTANT Make sure the environment variables are set in the terminal before poetry run start.

Application Usage

This notebook tells a story of a Data Scientist and an End User as they interact with the service.

We begin by collecting and fiiltering an example dataset, the Stanford Question Answering Dataset (SQuAD)[https://huggingface.co/datasets/squad]. We upsert the data into a MongoDB Collection via the query endpoint of the Plugin API. Upon doing this, Atlas begins to automatically index the data in preparation for Semantic Search.

We close by asking a question of the data, searching not for a particular text string, but using common language.

1) Application Engineering

Of course, we cannot begin until we test that our environment is set up.

Check environment variables

python

!pwd

python

!which python

python

import os
required_vars = {'BEARER_TOKEN', 'OPENAI_API_KEY', 'DATASTORE', 'EMBEDDING_DIMENSION', 'EMBEDDING_MODEL',
                 'MONGODB_COLLECTION', 'MONGODB_DATABASE', 'MONGODB_INDEX', 'MONGODB_URI'}
assert os.environ["DATASTORE"] == 'mongodb'
missing = required_vars - set(os.environ)
if missing:
    print(f"It is strongly recommended to set these additional environment variables. {missing}=")

python

# If you keep the environment variables in a .env file, like that .env.example, do this:
if missing:
    from dotenv import dotenv_values
    from pathlib import Path
    import os
    config = dotenv_values(Path('../.env'))
    os.environ.update(config)

Check MongoDB Atlas Datastore connection

python

from pymongo import MongoClient
client = MongoClient(os.environ["MONGODB_URI"])
# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

python

db = client[os.environ["MONGODB_DATABASE"]]
clxn = db[os.environ["MONGODB_COLLECTION"]]
clxn.name

Check OpenAI Connection

These tests require the environment variables: OPENAI_API_KEY, EMBEDDING_MODEL

We set the api_key, then query the API for its available models. We then loop over this list to find which can provide text embeddings, and their natural, full, default dimensions.

python

import openai
openai.api_key = os.environ["OPENAI_API_KEY"]
models = openai.Model.list()
model_names = [model["id"] for model in models['data']]
model_dimensions = {}
for model_name in model_names:
    try:
        response = openai.Embedding.create(input=["Some input text"], model=model_name)
        model_dimensions[model_name] = len(response['data'][0]['embedding'])
    except:
        pass
f"{model_dimensions=}"

2) Data Engineering

Prepare personal or organizational dataset

The ChatGPT Retrieval Plug provides semantic search of your own data using OpenAI's Embedding Models and MongoDB's Vector Datastore and Semantic Search.

In this example, we will use the Stanford Question Answering Dataset (SQuAD), which we download from Hugging Face Datasets.

python

import pandas as pd
from datasets import load_dataset
data = load_dataset("squad_v2", split="train")
data = data.to_pandas().drop_duplicates(subset=["context"])
print(f'{len(data)=}')
data.head()

To speed up our example, let's focus specifically on questions about Beyoncé

python

data = data.loc[data['title']=='Beyoncé']

python

documents = [
    {
        'id': r['id'],
        'text': r['context'],
        'metadata': {
            'title': r['title']
        }
    } for r in data.to_dict(orient='records')
]
documents[0]

Upsert and Index data via the Plugin API

Posting an upsert request to the ChatGPT Retrieval Plugin API performs two tasks on the backend. First, it inserts into (or updates) your data in the MONGODB_COLLECTION in the MongoDB Cluster that you setup. Second, Atlas asynchronously begins populating a Vector Search Index on the embedding key.

If you have already created the Collection and a Vector Search Index through the Atlas UI while Setting up MongoDB Atlas Cluster in setup.md, then indexing will begin immediately.

If you haven't set up the Atlas Vector Search yet, no problem. upsert will insert the data. To start indexing, simply go back to the Atlas UI and add a Search Index. This will trigger indexing. Once complete, we can begin semantic queries!

The front end of the Plugin is a FastAPI web server. It's API provides simple http requests.'We will need to provide authorization in the form of the BEARER_TOKEN we set earlier. We do this below:

python

endpoint_url = 'http://0.0.0.0:8000'
headers = {"Authorization": f"Bearer {os.environ['BEARER_TOKEN']}"}

Although our sample data is not large, and the service and datastore are reponsive, we follow best-practice and execute bulk upserts in batches with retries.

python

from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

# Setup request parameters to batch requests and retry on 5xx errors
batch_size = 100
retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[500, 502, 503, 504])
session = requests.Session()
session.mount('http://', HTTPAdapter(max_retries=retries))
n_docs = len(documents)
for i in tqdm(range(0, n_docs, batch_size)):
    i_end = min(n_docs, i+batch_size)
    print(f'{(i,i_end) =}') 
    # make post request that allows up to 5 retries
    res = session.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={"documents": documents[i:i_end]}
    )

python

if res.status_code != 200:
    res.text, res.reason

3) Answering Questions

Now would be a good time to go back to the Atlas UI, navigate to your collection's search index. Once all our SQuAD records have been successfully indexed, we can proceed with the querying phase. By passing one or more queries to the /query endpoint, we can easily conduct a query on the datastore. For this task, we can utilize a few questions from SQuAD2.

python

def format_results(results):
    for query_result in results.json()['results']:
        query = query_result['query']
        answers = []
        scores = []
        for result in query_result['results']:
            answers.append(result['text'])
            scores.append(round(result['score'], 2))
        print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")    

def ask(question: str):
    res = requests.post(
        f"{endpoint_url}/query",
        headers=headers,
        json={'queries': [{"query": question}]}
    )
    format_results(res)

python

ask("Who is Beyonce?")

python

ask("Who is Beyonce married to?")

4) Clean up

python

response = requests.delete(
    f"{endpoint_url}/delete",
    headers=headers,
    json={"delete_all":True}
)

response.json()