examples/providers/pinecone/semantic-search.ipynb
In this walkthrough we will see how to use the retrieval API with a Pinecone datastore for semantic search / question-answering.
Before running this notebook you should have already initialized the retrieval API and have it running locally or elsewhere. The full instructions for doing this are found in the project README.
We will summarize the instructions (specific to the Pinecone datastore) before moving on to the walkthrough.
Install Python 3.10 if not already installed.
Clone the retrieval-app repository:
git clone [email protected]:openai/retrieval-app.git
cd /path/to/retrieval-app
poetry:pip install poetry
poetry env use python3.10
retrieval-app dependencies:poetry install
BEARER_TOKEN: Secret token used by the app to authorize incoming requests. We will later include this in the request headers. The token can be generated however you prefer, such as using jwt.io.
OPENAI_API_KEY: The OpenAI API key used for generating embeddings with the OpenAI embeddings model. Get an API key here!
DATASTORE: set to pinecone.
PINECONE_API_KEY: Set to your Pinecone API key. This requires a free Pinecone account and can be found in the Pinecone console.
PINECONE_ENVIRONMENT: Set to your Pinecone environment, looks like us-east1-gcp, us-west1-aws, and can be found next to your API key in the Pinecone console.
PINECONE_INDEX: Set this to your chosen index name. The name you choose is your choice, we just recommend setting it to something descriptive like "openai-retrieval-app". Note that index names are restricted to alphanumeric characters, "-", and can contain a maximum of 45 characters.
poetry run start
If running the app locally you should see something like:
INFO: Uvicorn running on http://0.0.0.0:8000
INFO: Application startup complete.
In that case, the app is automatically connected to our index (specified by PINECONE_INDEX), if no index with that name existed beforehand, the app creates one for us.
Now we're ready to move on to populating our index with some data.
There are a few Python libraries we must pip install for this notebook to run, those are:
!pip install -qU datasets pandas tqdm
In this example, we will use the Stanford Question Answering Dataset (SQuAD), which we download from Hugging Face Datasets.
from datasets import load_dataset
data = load_dataset("squad", split="train")
data
Convert to Pandas dataframe for easier preprocessing steps.
data = data.to_pandas()
data.head()
The dataset contains a lot of duplicate context paragraphs, this is because each context can have many relevant questions. We don't want these duplicates so we remove like so:
data = data.drop_duplicates(subset=["context"])
print(len(data))
data.head()
The format required by the apps upsert function is a list of documents like:
[
{
"id": "abc",
"text": "some important document text",
"metadata": {
"field1": "optional metadata goes here",
"field2": 54
}
},
{
"id": "123",
"text": "some other important text",
"metadata": {
"field1": "another metadata",
"field2": 71,
"field3": "not all metadatas need the same structure"
}
}
...
]
Every document must have a "text" field. The "id" and "metadata" fields are optional.
To create this format for our SQuAD data we do:
documents = [
{
'id': r['id'],
'text': r['context'],
'metadata': {
'title': r['title']
}
} for r in data.to_dict(orient='records')
]
documents[:3]
We're now ready to begin indexing (or upserting) our documents. To make these requests to the retrieval app API, we will need to provide authorization in the form of the BEARER_TOKEN we set earlier. We do this below:
import os
BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "BEARER_TOKEN_HERE"
Use the BEARER_TOKEN to create our authorization headers:
headers = {
"Authorization": f"Bearer {BEARER_TOKEN}"
}
We'll perform the upsert in batches of batch_size. Make sure that the endpoint_url variable is set to the correct location for your running retrieval-app API.
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry
batch_size = 100
endpoint_url = "http://localhost:8000"
s = requests.Session()
# we setup a retry strategy to retry on 5xx errors
retries = Retry(
total=5, # number of retries before raising error
backoff_factor=0.1,
status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))
for i in tqdm(range(0, len(documents), batch_size)):
i_end = min(len(documents), i+batch_size)
# make post request that allows up to 5 retries
res = s.post(
f"{endpoint_url}/upsert",
headers=headers,
json={
"documents": documents[i:i_end]
}
)
With that our SQuAD records have all been indexed and we can move on to querying.
To query the datastore all we need to do is pass one or more queries to the /query endpoint. We can take a few questions from SQuAD:
queries = data['question'].tolist()
# format into the structure needed by the /query endpoint
queries = [{'query': queries[i]} for i in range(len(queries))]
len(queries)
We will use just the first three questions:
queries[:3]
res = requests.post(
"http://0.0.0.0:8000/query",
headers=headers,
json={
'queries': queries[:3]
}
)
res
Now we can loop through the responses and see the results returned for each query:
for query_result in res.json()['results']:
query = query_result['query']
answers = []
scores = []
for result in query_result['results']:
answers.append(result['text'])
scores.append(round(result['score'], 2))
print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")
The top results are all relevant as we would have hoped. With that we've finished. The retrieval app API can be shut down, and to save resources the Pinecone index can be deleted within the Pinecone console.