Using the Pinecone Retrieval App

In this walkthrough we will see how to use the retrieval API with a Pinecone datastore for semantic search / question-answering.

Before running this notebook you should have already initialized the retrieval API and have it running locally or elsewhere. The full instructions for doing this are found in the project README.

We will summarize the instructions (specific to the Pinecone datastore) before moving on to the walkthrough.

App Quickstart

Install Python 3.10 if not already installed.
Clone the retrieval-app repository:

git clone [email protected]:openai/retrieval-app.git

Navigate to the app directory:

cd /path/to/retrieval-app

Install poetry:

pip install poetry

Create a new virtual environment:

poetry env use python3.10

Install the retrieval-app dependencies:

poetry install

Set app environment variables:

BEARER_TOKEN: Secret token used by the app to authorize incoming requests. We will later include this in the request headers. The token can be generated however you prefer, such as using jwt.io.
OPENAI_API_KEY: The OpenAI API key used for generating embeddings with the OpenAI embeddings model. Get an API key here!

Set Pinecone-specific environment variables:

DATASTORE: set to pinecone.
PINECONE_API_KEY: Set to your Pinecone API key. This requires a free Pinecone account and can be found in the Pinecone console.
PINECONE_ENVIRONMENT: Set to your Pinecone environment, looks like us-east1-gcp, us-west1-aws, and can be found next to your API key in the Pinecone console.
PINECONE_INDEX: Set this to your chosen index name. The name you choose is your choice, we just recommend setting it to something descriptive like "openai-retrieval-app". Note that index names are restricted to alphanumeric characters, "-", and can contain a maximum of 45 characters.

Run the app with:

poetry run start

If running the app locally you should see something like:

INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete.

In that case, the app is automatically connected to our index (specified by PINECONE_INDEX), if no index with that name existed beforehand, the app creates one for us.

Now we're ready to move on to populating our index with some data.

Required Libraries

There are a few Python libraries we must pip install for this notebook to run, those are:

python

!pip install -qU datasets pandas tqdm

Preparing Data

In this example, we will use the Stanford Question Answering Dataset (SQuAD), which we download from Hugging Face Datasets.

python

from datasets import load_dataset

data = load_dataset("squad", split="train")
data

Convert to Pandas dataframe for easier preprocessing steps.

python

data = data.to_pandas()
data.head()

The dataset contains a lot of duplicate context paragraphs, this is because each context can have many relevant questions. We don't want these duplicates so we remove like so:

python

data = data.drop_duplicates(subset=["context"])
print(len(data))
data.head()

The format required by the apps upsert function is a list of documents like:

json

[
    {
        "id": "abc",
        "text": "some important document text",
        "metadata": {
            "field1": "optional metadata goes here",
            "field2": 54
        }
    },
    {
        "id": "123",
        "text": "some other important text",
        "metadata": {
            "field1": "another metadata",
            "field2": 71,
            "field3": "not all metadatas need the same structure"
        }
    }
    ...
]

Every document must have a "text" field. The "id" and "metadata" fields are optional.

To create this format for our SQuAD data we do:

python

documents = [
    {
        'id': r['id'],
        'text': r['context'],
        'metadata': {
            'title': r['title']
        }
    } for r in data.to_dict(orient='records')
]
documents[:3]

Indexing the Docs

We're now ready to begin indexing (or upserting) our documents. To make these requests to the retrieval app API, we will need to provide authorization in the form of the BEARER_TOKEN we set earlier. We do this below:

python

import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "BEARER_TOKEN_HERE"

Use the BEARER_TOKEN to create our authorization headers:

python

headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

We'll perform the upsert in batches of batch_size. Make sure that the endpoint_url variable is set to the correct location for your running retrieval-app API.

python

from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

batch_size = 100
endpoint_url = "http://localhost:8000"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, len(documents), batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

With that our SQuAD records have all been indexed and we can move on to querying.

Making Queries

To query the datastore all we need to do is pass one or more queries to the /query endpoint. We can take a few questions from SQuAD:

python

queries = data['question'].tolist()
# format into the structure needed by the /query endpoint
queries = [{'query': queries[i]} for i in range(len(queries))]
len(queries)

We will use just the first three questions:

python

queries[:3]

python

res = requests.post(
    "http://0.0.0.0:8000/query",
    headers=headers,
    json={
        'queries': queries[:3]
    }
)
res

Now we can loop through the responses and see the results returned for each query:

python

for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

The top results are all relevant as we would have hoped. With that we've finished. The retrieval app API can be shut down, and to save resources the Pinecone index can be deleted within the Pinecone console.