examples/providers/mongodb/semantic-search.ipynb
It is often a valuable exercise, when developing and documenting, to consider User Stories. We have a number of different personas interested in the ChatGPT Retrieval Plugin.
The Application Engineer has a number of tasks to complete in order to provide service to her two users.
Set up the DataStore.
Begin by following the detailed steps in setup.md. Once completed, you will have a running Cluster, with a Database, a Collection, and a Vector Search Index attached to it.
You will also have a number of required environment variables. These need to be available to run this example.
We will check for them below, and suggest how to set them up with an .env file if that is your preference.
Create and Serve the ChatGPT Retrival Plugin.
Start the service in another terminal as described in the repo's [QuickStart]( here.
IMPORTANT Make sure the environment variables are set in the terminal before poetry run start.
This notebook tells a story of a Data Scientist and an End User as they interact with the service.
We begin by collecting and fiiltering an example dataset, the Stanford Question Answering Dataset (SQuAD)[https://huggingface.co/datasets/squad].
We upsert the data into a MongoDB Collection via the query endpoint of the Plugin API.
Upon doing this, Atlas begins to automatically index the data in preparation for Semantic Search.
We close by asking a question of the data, searching not for a particular text string, but using common language.
Of course, we cannot begin until we test that our environment is set up.
!pwd
!which python
import os
required_vars = {'BEARER_TOKEN', 'OPENAI_API_KEY', 'DATASTORE', 'EMBEDDING_DIMENSION', 'EMBEDDING_MODEL',
'MONGODB_COLLECTION', 'MONGODB_DATABASE', 'MONGODB_INDEX', 'MONGODB_URI'}
assert os.environ["DATASTORE"] == 'mongodb'
missing = required_vars - set(os.environ)
if missing:
print(f"It is strongly recommended to set these additional environment variables. {missing}=")
# If you keep the environment variables in a .env file, like that .env.example, do this:
if missing:
from dotenv import dotenv_values
from pathlib import Path
import os
config = dotenv_values(Path('../.env'))
os.environ.update(config)
from pymongo import MongoClient
client = MongoClient(os.environ["MONGODB_URI"])
# Send a ping to confirm a successful connection
try:
client.admin.command('ping')
print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
print(e)
db = client[os.environ["MONGODB_DATABASE"]]
clxn = db[os.environ["MONGODB_COLLECTION"]]
clxn.name
These tests require the environment variables: OPENAI_API_KEY, EMBEDDING_MODEL
We set the api_key, then query the API for its available models. We then loop over this list to find which can provide text embeddings, and their natural, full, default dimensions.
import openai
openai.api_key = os.environ["OPENAI_API_KEY"]
models = openai.Model.list()
model_names = [model["id"] for model in models['data']]
model_dimensions = {}
for model_name in model_names:
try:
response = openai.Embedding.create(input=["Some input text"], model=model_name)
model_dimensions[model_name] = len(response['data'][0]['embedding'])
except:
pass
f"{model_dimensions=}"
The ChatGPT Retrieval Plug provides semantic search of your own data using OpenAI's Embedding Models and MongoDB's Vector Datastore and Semantic Search.
In this example, we will use the Stanford Question Answering Dataset (SQuAD), which we download from Hugging Face Datasets.
import pandas as pd
from datasets import load_dataset
data = load_dataset("squad_v2", split="train")
data = data.to_pandas().drop_duplicates(subset=["context"])
print(f'{len(data)=}')
data.head()
To speed up our example, let's focus specifically on questions about Beyoncé
data = data.loc[data['title']=='Beyoncé']
documents = [
{
'id': r['id'],
'text': r['context'],
'metadata': {
'title': r['title']
}
} for r in data.to_dict(orient='records')
]
documents[0]
Posting an upsert request to the ChatGPT Retrieval Plugin API performs two tasks on the backend. First, it inserts into (or updates) your data in the MONGODB_COLLECTION in the MongoDB Cluster that you setup. Second, Atlas asynchronously begins populating a Vector Search Index on the embedding key.
If you have already created the Collection and a Vector Search Index through the Atlas UI while Setting up MongoDB Atlas Cluster in setup.md, then indexing will begin immediately.
If you haven't set up the Atlas Vector Search yet, no problem. upsert will insert the data. To start indexing, simply go back to the Atlas UI and add a Search Index. This will trigger indexing. Once complete, we can begin semantic queries!
The front end of the Plugin is a FastAPI web server. It's API provides simple http requests.'We will need to provide authorization in the form of the BEARER_TOKEN we set earlier. We do this below:
endpoint_url = 'http://0.0.0.0:8000'
headers = {"Authorization": f"Bearer {os.environ['BEARER_TOKEN']}"}
Although our sample data is not large, and the service and datastore are reponsive, we follow best-practice and execute bulk upserts in batches with retries.
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry
# Setup request parameters to batch requests and retry on 5xx errors
batch_size = 100
retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[500, 502, 503, 504])
session = requests.Session()
session.mount('http://', HTTPAdapter(max_retries=retries))
n_docs = len(documents)
for i in tqdm(range(0, n_docs, batch_size)):
i_end = min(n_docs, i+batch_size)
print(f'{(i,i_end) =}')
# make post request that allows up to 5 retries
res = session.post(
f"{endpoint_url}/upsert",
headers=headers,
json={"documents": documents[i:i_end]}
)
if res.status_code != 200:
res.text, res.reason
Now would be a good time to go back to the Atlas UI, navigate to your collection's search index. Once all our SQuAD records have been successfully indexed, we can proceed with the querying phase. By passing one or more queries to the /query endpoint, we can easily conduct a query on the datastore. For this task, we can utilize a few questions from SQuAD2.
def format_results(results):
for query_result in results.json()['results']:
query = query_result['query']
answers = []
scores = []
for result in query_result['results']:
answers.append(result['text'])
scores.append(round(result['score'], 2))
print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")
def ask(question: str):
res = requests.post(
f"{endpoint_url}/query",
headers=headers,
json={'queries': [{"query": question}]}
)
format_results(res)
ask("Who is Beyonce?")
ask("Who is Beyonce married to?")
response = requests.delete(
f"{endpoint_url}/delete",
headers=headers,
json={"delete_all":True}
)
response.json()