examples/providers/elasticsearch/search.ipynb
In this walkthrough we will see how to use the retrieval API with a Elasticsearch datastore for search / question-answering.
Before running this notebook you should have already initialized the retrieval API and have it running locally or elsewhere. See readme for instructions on how to do this.
Install Python 3.10 if not already installed.
Clone the retrieval-app repository:
git clone [email protected]:openai/retrieval-app.git
cd /path/to/retrieval-app
poetry:pip install poetry
poetry env use python3.10
retrieval-app dependencies:poetry install
BEARER_TOKEN: Secret token used by the app to authorize incoming requests. We will later include this in the request headers. The token can be generated however you prefer, such as using jwt.io.
OPENAI_API_KEY: The OpenAI API key used for generating embeddings with the OpenAI embeddings model. Get an API key here!
DATASTORE: set to elasticsearch.ELASTICSEARCH_CLOUD_ID or ELASTICSEARCH_URL.ELASTICSEARCH_CLOUD_ID: Set to your deployment cloud id. You can find this in the Elasticsearch console.
ELASTICSEARCH_URL: Set to your Elasticsearch URL, looks like https://<username>:<password>@<host>:<port>. You can find this in the Elasticsearch console.
ELASTICSEARCH_USERNAME and ELASTICSEARCH_PASSWORD or ELASTICSEARCH_API_KEY.ELASTICSEARCH_USERNAME: Set to your Elasticsearch username. You can find this in the Elasticsearch console. Typically this is set to elastic.
ELASTICSEARCH_PASSWORD: Set to your Elasticsearch password. You can find this in the Elasticsearch console in security.
ELASTICSEARCH_API_KEY: Set to your Elasticsearch API key. You can set one up in Kibana Stack management page.
ELASTICSEARCH_INDEX: Set to the name of the Elasticsearch index you want to use.poetry run start
If running the app locally you should see something like:
INFO: Uvicorn running on http://0.0.0.0:8000
INFO: Application startup complete.
In that case, the app is automatically connected to our index (specified by ELASTICSEARCH_INDEX), if no index with that name existed beforehand, the app creates one for us.
Now we're ready to move on to populating our index with some data.
There are a few Python libraries we must pip install for this notebook to run, those are:
!pip install -qU datasets pandas tqdm
In this example, we will use the Stanford Question Answering Dataset (SQuAD2), which we download from Hugging Face Datasets.
from datasets import load_dataset
data = load_dataset("squad_v2", split="train")
data
Transform the data into a Pandas dataframe for simpler preprocessing.
data = data.to_pandas()
data.head()
The dataset contains a lot of duplicate context paragraphs, this is because each context can have many relevant questions. We don't want these duplicates so we remove like so:
data = data.drop_duplicates(subset=["context"])
print(len(data))
data.head()
The format required by the apps upsert function is a list of documents like:
[
{
"id": "abc",
"text": "some important document text",
"metadata": {
"field1": "optional metadata goes here",
"field2": 54
}
},
{
"id": "123",
"text": "some other important text",
"metadata": {
"field1": "another metadata",
"field2": 71,
"field3": "not all metadatas need the same structure"
}
}
...
]
Every document must have a "text" field. The "id" and "metadata" fields are optional.
To create this format for our SQuAD data we do:
documents = [
{
'id': r['id'],
'text': r['context'],
'metadata': {
'title': r['title']
}
} for r in data.to_dict(orient='records')
]
documents[:3]
Now, it's time to initiate the indexing process, also known as upserting, for our documents. To perform these requests to the retrieval app API, we must provide authorization using the BEARER_TOKEN we defined earlier. Below is how we accomplish this:
import os
BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "BEARER_TOKEN_HERE"
headers = {
"Authorization": f"Bearer {BEARER_TOKEN}"
}
Now we will execute bulk inserts in batches set by the batch_size.
Now that all our SQuAD2 records have been successfully indexed, we can proceed with the querying phase.
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry
batch_size = 100
endpoint_url = "http://localhost:8000"
s = requests.Session()
# we setup a retry strategy to retry on 5xx errors
retries = Retry(
total=5, # number of retries before raising error
backoff_factor=0.1,
status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))
for i in tqdm(range(0, 10, batch_size)):
i_end = min(len(documents), i+batch_size)
# make post request that allows up to 5 retries
res = s.post(
f"{endpoint_url}/upsert",
headers=headers,
json={
"documents": documents[i:i_end]
}
)
By passing one or more queries to the /query endpoint, we can easily conduct a query on the datastore. For this task, we can utilize a few questions from SQuAD2.
queries = data['question'].tolist()
# format into the structure needed by the /query endpoint
queries = [{'query': queries[i]} for i in range(len(queries))]
len(queries)
res = requests.post(
"http://0.0.0.0:8000/query",
headers=headers,
json={
'queries': queries[:3]
}
)
res
At this point, we have the ability to iterate through the responses and observe the outcomes obtained for each query:
for query_result in res.json()['results']:
query = query_result['query']
answers = []
scores = []
for result in query_result['results']:
answers.append(result['text'])
scores.append(round(result['score'], 2))
print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")
The top results are all relevant as we would have hoped. We can see that the score is a measure of how relevant the document is to the query. The higher the score the more relevant the document is to the query.