Back to Llama Index

Pairwise Evaluator

docs/examples/evaluation/pairwise_eval.ipynb

0.14.214.1 KB
Original Source

Pairwise Evaluator

This notebook uses the PairwiseEvaluator module to see if an evaluation LLM would prefer one query engine over another.

python
%pip install llama-index-llms-openai
python
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()
python
# configuring logger to INFO level
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Response
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core.node_parser import SentenceSplitter
import pandas as pd

pd.set_option("display.max_colwidth", 0)

Using GPT-4 here for evaluation

python
# gpt-4
gpt4 = OpenAI(temperature=0, model="gpt-4")

evaluator_gpt4 = PairwiseComparisonEvaluator(llm=gpt4)
python
documents = SimpleDirectoryReader("./test_wiki_data/").load_data()
python
# create vector index
splitter_512 = SentenceSplitter(chunk_size=512)
vector_index1 = VectorStoreIndex.from_documents(
    documents, transformations=[splitter_512]
)

splitter_128 = SentenceSplitter(chunk_size=128)
vector_index2 = VectorStoreIndex.from_documents(
    documents, transformations=[splitter_128]
)
python
query_engine1 = vector_index1.as_query_engine(similarity_top_k=2)
query_engine2 = vector_index2.as_query_engine(similarity_top_k=8)
python
# define jupyter display function
def display_eval_df(query, response1, response2, eval_result) -> None:
    eval_df = pd.DataFrame(
        {
            "Query": query,
            "Reference Response (Answer 1)": response2,
            "Current Response (Answer 2)": response1,
            "Score": eval_result.score,
            "Reason": eval_result.feedback,
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "300px",
            "overflow-wrap": "break-word",
        },
        subset=[
            "Current Response (Answer 2)",
            "Reference Response (Answer 1)",
        ],
    )
    display(eval_df)

To run evaluations you can call the .evaluate_response() function on the Response object return from the query to run the evaluations. Lets evaluate the outputs of the vector_index.

python
# query_str = "How did New York City get its name?"
query_str = "What was the role of NYC during the American Revolution?"
# query_str = "Tell me about the arts and culture of NYC"
response1 = str(query_engine1.query(query_str))
response2 = str(query_engine2.query(query_str))

By default, we enforce "consistency" in the pairwise comparison.

We try feeding in the candidate, reference pair, and then swap the order of the two, and make sure that the results are still consistent (or return a TIE if not).

python
eval_result = await evaluator_gpt4.aevaluate(
    query_str, response=response1, reference=response2
)
python
display_eval_df(query_str, response1, response2, eval_result)

NOTE: By default, we enforce consensus by flipping the order of response/reference and making sure that the answers are opposites.

We can disable this - which can lead to more inconsistencies!

python
evaluator_gpt4_nc = PairwiseComparisonEvaluator(
    llm=gpt4, enforce_consensus=False
)
python
eval_result = await evaluator_gpt4_nc.aevaluate(
    query_str, response=response1, reference=response2
)
python
display_eval_df(query_str, response1, response2, eval_result)
python
eval_result = await evaluator_gpt4_nc.aevaluate(
    query_str, response=response2, reference=response1
)
python
display_eval_df(query_str, response2, response1, eval_result)

Running on some more Queries

python
query_str = "Tell me about the arts and culture of NYC"
response1 = str(query_engine1.query(query_str))
response2 = str(query_engine2.query(query_str))
python
eval_result = await evaluator_gpt4.aevaluate(
    query_str, response=response1, reference=response2
)
python
display_eval_df(query_str, response1, response2, eval_result)