Back to Llama Index

Tonic Validate Evaluators

docs/examples/evaluation/TonicValidateEvaluators.ipynb

0.14.214.4 KB
Original Source
<a target="_blank" href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/evaluation/TonicValidateEvaluators.ipynb"> </a>

Tonic Validate Evaluators

This notebook has some basic usage examples of how to use Tonic Validate's RAGs metrics using LlamaIndex. To use these evaluators, you need to have tonic_validate installed, which you can install via pip install tonic-validate.

python
%pip install llama-index-evaluation-tonic-validate
python
import json

import pandas as pd
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.evaluation.tonic_validate import (
    AnswerConsistencyEvaluator,
    AnswerSimilarityEvaluator,
    AugmentationAccuracyEvaluator,
    AugmentationPrecisionEvaluator,
    RetrievalPrecisionEvaluator,
    TonicValidateEvaluator,
)

One Question Usage Example

For this example, we have an example of a question with a reference correct answer that does not match the LLM response answer. There are two retrieved context chunks, of which one of them has the correct answer.

python
question = "What makes Sam Altman a good founder?"
reference_answer = "He is smart and has a great force of will."
llm_answer = "He is a good founder because he is smart."
retrieved_context_list = [
    "Sam Altman is a good founder. He is very smart.",
    "What makes Sam Altman such a good founder is his great force of will.",
]

The answer similarity score is a score between 0 and 5 that scores how well the LLM answer matches the reference answer. In this case, they do not match perfectly, so the answer similarity score is not a perfect 5.

python
answer_similarity_evaluator = AnswerSimilarityEvaluator()
score = await answer_similarity_evaluator.aevaluate(
    question,
    llm_answer,
    retrieved_context_list,
    reference_response=reference_answer,
)
score

The answer consistency score is between 0.0 and 1.0, and measure whether the answer has information that does not appear in the retrieved context. In this case, the answer does appear in the retrieved context, so the score is 1.

python
answer_consistency_evaluator = AnswerConsistencyEvaluator()

score = await answer_consistency_evaluator.aevaluate(
    question, llm_answer, retrieved_context_list
)
score

Augmentation accuracy measeures the percentage of the retrieved context that is in the answer. In this case, one of the retrieved contexts is in the answer, so this score is 0.5.

python
augmentation_accuracy_evaluator = AugmentationAccuracyEvaluator()

score = await augmentation_accuracy_evaluator.aevaluate(
    question, llm_answer, retrieved_context_list
)
score

Augmentation precision measures whether the relevant retrieved context makes it into the answer. Both of the retrieved contexts are relevant, but only one makes it into the answer. For that reason, this score is 0.5.

python
augmentation_precision_evaluator = AugmentationPrecisionEvaluator()

score = await augmentation_precision_evaluator.aevaluate(
    question, llm_answer, retrieved_context_list
)
score

Retrieval precision measures the percentage of retrieved context is relevant to answer the question. In this case, both of the retrieved contexts are relevant to answer the question, so the score is 1.0.

python
retrieval_precision_evaluator = RetrievalPrecisionEvaluator()

score = await retrieval_precision_evaluator.aevaluate(
    question, llm_answer, retrieved_context_list
)
score

The TonicValidateEvaluator can calculate all of Tonic Validate's metrics at once.

python
tonic_validate_evaluator = TonicValidateEvaluator()

scores = await tonic_validate_evaluator.aevaluate(
    question,
    llm_answer,
    retrieved_context_list,
    reference_response=reference_answer,
)
python
scores.score_dict

You can also evaluate more than one query and response at once using TonicValidateEvaluator, and return a tonic_validate Run object that can be logged to the Tonic Validate UI (validate.tonic.ai).

To do this, you put the questions, LLM answers, retrieved context lists, and reference answers into lists and cal evaluate_run.

python
tonic_validate_evaluator = TonicValidateEvaluator()

scores = await tonic_validate_evaluator.aevaluate_run(
    [question], [llm_answer], [retrieved_context_list], [reference_answer]
)
scores.run_data[0].scores