docs/examples/evaluation/answer_and_context_relevancy.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/evaluation/answer_and_context_relevancy.ipynb" target="_parent"></a>
In this notebook, we demonstrate how to utilize the AnswerRelevancyEvaluator and ContextRelevancyEvaluator classes to get a measure on the relevancy of a generated answer and retrieved contexts, respectively, to a given user query. Both of these evaluators return a score that is between 0 and 1 as well as a generated feedback explaining the score. Note that, higher score means higher relevancy. In particular, we prompt the judge LLM to take a step-by-step approach in providing a relevancy score, asking it to answer the following two questions of a generated answer to a query for answer relevancy (for context relevancy these are slightly adjusted):
Each question is worth 1 point and so a perfect evaluation would yield a score of 2/2.
%pip install llama-index-llms-openai
import nest_asyncio
from tqdm.asyncio import tqdm_asyncio
nest_asyncio.apply()
def displayify_df(df):
"""For pretty displaying DataFrame in a notebook."""
display_df = df.style.set_properties(
**{
"inline-size": "300px",
"overflow-wrap": "break-word",
}
)
display(display_df)
LabelledRagDataset)For this demonstration, we will use a llama-dataset provided through our llama-hub.
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core import VectorStoreIndex
# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
"EvaluatingLlmSurveyPaperDataset", "./data"
)
rag_dataset.to_pandas()[:5]
Next, we build a RAG over the same source documents used to created the rag_dataset.
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
With our RAG (i.e query_engine) defined, we can make predictions (i.e., generate responses to the query) with it over the rag_dataset.
prediction_dataset = await rag_dataset.amake_predictions_with(
predictor=query_engine, batch_size=100, show_progress=True
)
We first need to define our evaluators (i.e. AnswerRelevancyEvaluator & ContextRelevancyEvaluator):
# instantiate the gpt-4 judges
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
AnswerRelevancyEvaluator,
ContextRelevancyEvaluator,
)
judges = {}
judges["answer_relevancy"] = AnswerRelevancyEvaluator(
llm=OpenAI(temperature=0, model="gpt-3.5-turbo"),
)
judges["context_relevancy"] = ContextRelevancyEvaluator(
llm=OpenAI(temperature=0, model="gpt-4"),
)
Now, we can use our evaluator to make evaluations by looping through all of the <example, prediction> pairs.
eval_tasks = []
for example, prediction in zip(
rag_dataset.examples, prediction_dataset.predictions
):
eval_tasks.append(
judges["answer_relevancy"].aevaluate(
query=example.query,
response=prediction.response,
sleep_time_in_seconds=1.0,
)
)
eval_tasks.append(
judges["context_relevancy"].aevaluate(
query=example.query,
contexts=prediction.contexts,
sleep_time_in_seconds=1.0,
)
)
eval_results1 = await tqdm_asyncio.gather(*eval_tasks[:250])
eval_results2 = await tqdm_asyncio.gather(*eval_tasks[250:])
eval_results = eval_results1 + eval_results2
evals = {
"answer_relevancy": eval_results[::2],
"context_relevancy": eval_results[1::2],
}
Here we use a utility function to convert the list of EvaluationResult objects into something more notebook friendly. This utility will provide two DataFrames, one deep one containing all of the evaluation results, and another one which aggregates via taking the mean of all the scores, per evaluation method.
from llama_index.core.evaluation.notebook_utils import get_eval_results_df
import pandas as pd
deep_dfs = {}
mean_dfs = {}
for metric in evals.keys():
deep_df, mean_df = get_eval_results_df(
names=["baseline"] * len(evals[metric]),
results_arr=evals[metric],
metric=metric,
)
deep_dfs[metric] = deep_df
mean_dfs[metric] = mean_df
mean_scores_df = pd.concat(
[mdf.reset_index() for _, mdf in mean_dfs.items()],
axis=0,
ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
mean_scores_df
The above utility also provides the mean score across all of the evaluations in mean_df.
We can get a look at the raw distribution of the scores by invoking value_counts() on the deep_df.
deep_dfs["answer_relevancy"]["scores"].value_counts()
deep_dfs["context_relevancy"]["scores"].value_counts()
It looks like for the most part, the default RAG does fairly well in terms of generating answers that are relevant to the query. Getting a closer look is made possible by viewing the records of any of the deep_df's.
displayify_df(deep_dfs["context_relevancy"].head(2))
And, of course you can apply any filters as you like. For example, if you want to look at the examples that yielded less than perfect results.
cond = deep_dfs["context_relevancy"]["scores"] < 1
displayify_df(deep_dfs["context_relevancy"][cond].head(5))