Answer Relevancy and Context Relevancy Evaluations

In this notebook, we demonstrate how to utilize the AnswerRelevancyEvaluator and ContextRelevancyEvaluator classes to get a measure on the relevancy of a generated answer and retrieved contexts, respectively, to a given user query. Both of these evaluators return a score that is between 0 and 1 as well as a generated feedback explaining the score. Note that, higher score means higher relevancy. In particular, we prompt the judge LLM to take a step-by-step approach in providing a relevancy score, asking it to answer the following two questions of a generated answer to a query for answer relevancy (for context relevancy these are slightly adjusted):

Does the provided response match the subject matter of the user's query?
Does the provided response attempt to address the focus or perspective on the subject matter taken on by the user's query?

Each question is worth 1 point and so a perfect evaluation would yield a score of 2/2.

python

%pip install llama-index-llms-openai

python

import nest_asyncio
from tqdm.asyncio import tqdm_asyncio

nest_asyncio.apply()

python

def displayify_df(df):
    """For pretty displaying DataFrame in a notebook."""
    display_df = df.style.set_properties(
        **{
            "inline-size": "300px",
            "overflow-wrap": "break-word",
        }
    )
    display(display_df)

Download the dataset (`LabelledRagDataset`)

For this demonstration, we will use a llama-dataset provided through our llama-hub.

python

from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core import VectorStoreIndex

# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
    "EvaluatingLlmSurveyPaperDataset", "./data"
)

python

rag_dataset.to_pandas()[:5]

Next, we build a RAG over the same source documents used to created the rag_dataset.

python

index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

With our RAG (i.e query_engine) defined, we can make predictions (i.e., generate responses to the query) with it over the rag_dataset.

python

prediction_dataset = await rag_dataset.amake_predictions_with(
    predictor=query_engine, batch_size=100, show_progress=True
)

Evaluating Answer and Context Relevancy Separately

We first need to define our evaluators (i.e. AnswerRelevancyEvaluator & ContextRelevancyEvaluator):

python

# instantiate the gpt-4 judges
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
    AnswerRelevancyEvaluator,
    ContextRelevancyEvaluator,
)

judges = {}

judges["answer_relevancy"] = AnswerRelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-3.5-turbo"),
)

judges["context_relevancy"] = ContextRelevancyEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

Now, we can use our evaluator to make evaluations by looping through all of the <example, prediction> pairs.

python

eval_tasks = []
for example, prediction in zip(
    rag_dataset.examples, prediction_dataset.predictions
):
    eval_tasks.append(
        judges["answer_relevancy"].aevaluate(
            query=example.query,
            response=prediction.response,
            sleep_time_in_seconds=1.0,
        )
    )
    eval_tasks.append(
        judges["context_relevancy"].aevaluate(
            query=example.query,
            contexts=prediction.contexts,
            sleep_time_in_seconds=1.0,
        )
    )

python

eval_results1 = await tqdm_asyncio.gather(*eval_tasks[:250])

python

eval_results2 = await tqdm_asyncio.gather(*eval_tasks[250:])

python

eval_results = eval_results1 + eval_results2

python

evals = {
    "answer_relevancy": eval_results[::2],
    "context_relevancy": eval_results[1::2],
}

Taking a look at the evaluation results

Here we use a utility function to convert the list of EvaluationResult objects into something more notebook friendly. This utility will provide two DataFrames, one deep one containing all of the evaluation results, and another one which aggregates via taking the mean of all the scores, per evaluation method.

python

from llama_index.core.evaluation.notebook_utils import get_eval_results_df
import pandas as pd

deep_dfs = {}
mean_dfs = {}
for metric in evals.keys():
    deep_df, mean_df = get_eval_results_df(
        names=["baseline"] * len(evals[metric]),
        results_arr=evals[metric],
        metric=metric,
    )
    deep_dfs[metric] = deep_df
    mean_dfs[metric] = mean_df

python

mean_scores_df = pd.concat(
    [mdf.reset_index() for _, mdf in mean_dfs.items()],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
mean_scores_df

The above utility also provides the mean score across all of the evaluations in mean_df.

We can get a look at the raw distribution of the scores by invoking value_counts() on the deep_df.

python

deep_dfs["answer_relevancy"]["scores"].value_counts()

python

deep_dfs["context_relevancy"]["scores"].value_counts()

It looks like for the most part, the default RAG does fairly well in terms of generating answers that are relevant to the query. Getting a closer look is made possible by viewing the records of any of the deep_df's.

python

displayify_df(deep_dfs["context_relevancy"].head(2))

And, of course you can apply any filters as you like. For example, if you want to look at the examples that yielded less than perfect results.

python

cond = deep_dfs["context_relevancy"]["scores"] < 1
displayify_df(deep_dfs["context_relevancy"][cond].head(5))

Answer Relevancy and Context Relevancy Evaluations

Answer Relevancy and Context Relevancy Evaluations

Download the dataset (LabelledRagDataset)

Evaluating Answer and Context Relevancy Separately

Taking a look at the evaluation results

Download the dataset (`LabelledRagDataset`)