docs/examples/evaluation/mt_bench_human_judgement.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/evaluation/mt_bench_human_judgement.ipynb" target="_parent"></a>
LabelledPairwiseEvaluatorDatasetIn this notebook guide, we benchmark Gemini and GPT models as LLM evaluators using a slightly adapted version of the MT-Bench Human Judgements dataset. For this dataset, human evaluators compare two llm model responses to a given query and rank them according to their own preference. In the original version, there can be more than one human evaluator for a given example (query, two model responses). In the adapted version that we consider however, we aggregate these 'repeated' entries and convert the 'winner' column of the original schema to instead represent the proportion of times 'model_a' wins across all of the human evaluators. To adapt this to a llama-dataset, and to better consider ties (albeit with small samples) we set an uncertainty threshold for this proportion in that if it is between [0.4, 0.6] then we consider there to be no winner between the two models. We download this dataset from llama-hub. Finally, the LLMs that we benchmark are listed below:
%pip install llama-index-llms-openai
%pip install llama-index-llms-cohere
%pip install llama-index-llms-gemini
!pip install "google-generativeai" -q
import nest_asyncio
nest_asyncio.apply()
Let's load in the llama-dataset from llama-hub.
from llama_index.core.llama_dataset import download_llama_dataset
# download dataset
pairwise_evaluator_dataset, _ = download_llama_dataset(
"MtBenchHumanJudgementDataset", "./mt_bench_data"
)
pairwise_evaluator_dataset.to_pandas()[:5]
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.llms.gemini import Gemini
from llama_index.llms.cohere import Cohere
llm_gpt4 = OpenAI(temperature=0, model="gpt-4")
llm_gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
llm_gemini = Gemini(model="models/gemini-pro", temperature=0)
evaluators = {
"gpt-4": PairwiseComparisonEvaluator(llm=llm_gpt4),
"gpt-3.5": PairwiseComparisonEvaluator(llm=llm_gpt35),
"gemini-pro": PairwiseComparisonEvaluator(llm=llm_gemini),
}
EvaluatorBenchmarkerPack (llama-pack)To compare our four evaluators we will benchmark them against MTBenchHumanJudgementDataset, wherein references are provided by human evaluators. The benchmarks will return the following quantites:
number_examples: The number of examples the dataset consists of.invalid_predictions: The number of evaluations that could not yield a final evaluation (e.g., due to inability to parse the evaluation output, or an exception thrown by the LLM evaluator)inconclusives: Since this is a pairwise comparison, to mitigate the risk for "position bias" we conduct two evaluations — one with original order of presenting the two model answers, and another with the order in which these answers are presented to the evaluator LLM is flipped. A result is inconclusive if the LLM evaluator in the second ordering flips its vote in relation to the first vote.ties: A PairwiseComparisonEvaluator can also return a "tie" result. This is the number of examples for which it gave a tie result.agreement_rate_with_ties: The rate at which the LLM evaluator agrees with the reference (in this case human) evaluator, when also including ties. The denominator used to compute this metric is given by: number_examples - invalid_predictions - inconclusives.agreement_rate_without_ties: The rate at which the LLM evaluator agress with the reference (in this case human) evaluator, when excluding and ties. The denominator used to compute this metric is given by: number_examples - invalid_predictions - inconclusives - ties.To compute these metrics, we'll make use of the EvaluatorBenchmarkerPack.
from llama_index.core.llama_pack import download_llama_pack
EvaluatorBenchmarkerPack = download_llama_pack(
"EvaluatorBenchmarkerPack", "./pack"
)
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gpt-3.5"],
eval_dataset=pairwise_evaluator_dataset,
show_progress=True,
)
gpt_3p5_benchmark_df = await evaluator_benchmarker.arun(
batch_size=100, sleep_time_in_seconds=0
)
gpt_3p5_benchmark_df.index = ["gpt-3.5"]
gpt_3p5_benchmark_df
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gpt-4"],
eval_dataset=pairwise_evaluator_dataset,
show_progress=True,
)
gpt_4_benchmark_df = await evaluator_benchmarker.arun(
batch_size=100, sleep_time_in_seconds=0
)
gpt_4_benchmark_df.index = ["gpt-4"]
gpt_4_benchmark_df
NOTE: The rate limit for Gemini models is still very constraining, which is understandable given that they've just been released at the time of writing this notebook. So, we use a very small batch_size and moderately high sleep_time_in_seconds to reduce risk of getting rate-limited.
evaluator_benchmarker = EvaluatorBenchmarkerPack(
evaluator=evaluators["gemini-pro"],
eval_dataset=pairwise_evaluator_dataset,
show_progress=True,
)
gemini_pro_benchmark_df = await evaluator_benchmarker.arun(
batch_size=5, sleep_time_in_seconds=0.5
)
gemini_pro_benchmark_df.index = ["gemini-pro"]
gemini_pro_benchmark_df
evaluator_benchmarker.prediction_dataset.save_json("gemini_predictions.json")
For convenience, let's put all the results in a single DataFrame.
import pandas as pd
final_benchmark = pd.concat(
[
gpt_3p5_benchmark_df,
gpt_4_benchmark_df,
gemini_pro_benchmark_df,
],
axis=0,
)
final_benchmark
From the results above, we make the following observations: