Using Elasticsearch as a datastore

In this walkthrough we will see how to use the retrieval API with a Elasticsearch datastore for search / question-answering.

Before running this notebook you should have already initialized the retrieval API and have it running locally or elsewhere. See readme for instructions on how to do this.

App Quickstart

Install Python 3.10 if not already installed.
Clone the retrieval-app repository:

git clone [email protected]:openai/retrieval-app.git

Navigate to the app directory:

cd /path/to/retrieval-app

Install poetry:

pip install poetry

Create a new virtual environment:

poetry env use python3.10

Install the retrieval-app dependencies:

poetry install

Set app environment variables:

BEARER_TOKEN: Secret token used by the app to authorize incoming requests. We will later include this in the request headers. The token can be generated however you prefer, such as using jwt.io.
OPENAI_API_KEY: The OpenAI API key used for generating embeddings with the OpenAI embeddings model. Get an API key here!

Set Elasticsearch-specific environment variables:

DATASTORE: set to elasticsearch.

Set the Elasticsearch connection specific environment variables. Either set ELASTICSEARCH_CLOUD_ID or ELASTICSEARCH_URL.

ELASTICSEARCH_CLOUD_ID: Set to your deployment cloud id. You can find this in the Elasticsearch console.
ELASTICSEARCH_URL: Set to your Elasticsearch URL, looks like https://<username>:<password>@<host>:<port>. You can find this in the Elasticsearch console.

Set the Elasticsearch authentication specific environment variables. Either set ELASTICSEARCH_USERNAME and ELASTICSEARCH_PASSWORD or ELASTICSEARCH_API_KEY.

ELASTICSEARCH_USERNAME: Set to your Elasticsearch username. You can find this in the Elasticsearch console. Typically this is set to elastic.
ELASTICSEARCH_PASSWORD: Set to your Elasticsearch password. You can find this in the Elasticsearch console in security.
ELASTICSEARCH_API_KEY: Set to your Elasticsearch API key. You can set one up in Kibana Stack management page.

Set the Elasticsearch index specific environment variables.

ELASTICSEARCH_INDEX: Set to the name of the Elasticsearch index you want to use.

Run the app with:

poetry run start

If running the app locally you should see something like:

INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete.

In that case, the app is automatically connected to our index (specified by ELASTICSEARCH_INDEX), if no index with that name existed beforehand, the app creates one for us.

Now we're ready to move on to populating our index with some data.

Required Libraries

There are a few Python libraries we must pip install for this notebook to run, those are:

python

!pip install -qU datasets pandas tqdm

Preparing Data

In this example, we will use the Stanford Question Answering Dataset (SQuAD2), which we download from Hugging Face Datasets.

python

from datasets import load_dataset

data = load_dataset("squad_v2", split="train")
data

Transform the data into a Pandas dataframe for simpler preprocessing.

python

data = data.to_pandas()
data.head()

The dataset contains a lot of duplicate context paragraphs, this is because each context can have many relevant questions. We don't want these duplicates so we remove like so:

python

data = data.drop_duplicates(subset=["context"])
print(len(data))
data.head()

The format required by the apps upsert function is a list of documents like:

json

[
    {
        "id": "abc",
        "text": "some important document text",
        "metadata": {
            "field1": "optional metadata goes here",
            "field2": 54
        }
    },
    {
        "id": "123",
        "text": "some other important text",
        "metadata": {
            "field1": "another metadata",
            "field2": 71,
            "field3": "not all metadatas need the same structure"
        }
    }
    ...
]

Every document must have a "text" field. The "id" and "metadata" fields are optional.

To create this format for our SQuAD data we do:

python

documents = [
    {
        'id': r['id'],
        'text': r['context'],
        'metadata': {
            'title': r['title']
        }
    } for r in data.to_dict(orient='records')
]
documents[:3]

Indexing the Docs

Now, it's time to initiate the indexing process, also known as upserting, for our documents. To perform these requests to the retrieval app API, we must provide authorization using the BEARER_TOKEN we defined earlier. Below is how we accomplish this:

python

import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "BEARER_TOKEN_HERE"

headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

Now we will execute bulk inserts in batches set by the batch_size.

Now that all our SQuAD2 records have been successfully indexed, we can proceed with the querying phase.

python

from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

batch_size = 100
endpoint_url = "http://localhost:8000"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, 10, batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

Making Queries

By passing one or more queries to the /query endpoint, we can easily conduct a query on the datastore. For this task, we can utilize a few questions from SQuAD2.

python

queries = data['question'].tolist()
# format into the structure needed by the /query endpoint
queries = [{'query': queries[i]} for i in range(len(queries))]
len(queries)

python

res = requests.post(
    "http://0.0.0.0:8000/query",
    headers=headers,
    json={
        'queries': queries[:3]
    }
)
res

At this point, we have the ability to iterate through the responses and observe the outcomes obtained for each query:

python

for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

The top results are all relevant as we would have hoped. We can see that the score is a measure of how relevant the document is to the query. The higher the score the more relevant the document is to the query.