docs/examples/evaluation/RAGChecker.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/evaluation/RAGChecker.ipynb" target="_parent"></a>
RAGChecker is a comprehensive evaluation framework designed for Retrieval-Augmented Generation (RAG) systems. It provides a suite of metrics to assess both the retrieval and generation components of RAG systems, offering detailed insights into their performance.
Key features of RAGChecker include:
For more information, visit the RAGChecker GitHub repository.
RAGChecker provides a comprehensive set of metrics to evaluate different aspects of RAG systems:
Overall Metrics:
Retriever Metrics:
Generator Metrics:
These metrics provide a nuanced evaluation of both the retrieval and generation components, allowing for targeted improvements in RAG systems.
%pip install -qU ragchecker llama-index
First, let's import the necessary libraries:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from ragchecker.integrations.llama_index import response_to_rag_results
from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics
Now, let's create a simple LlamaIndex query engine using a sample dataset:
# Load documents
documents = SimpleDirectoryReader("path/to/your/documents").load_data()
# Create index
index = VectorStoreIndex.from_documents(documents)
# Create query engine
rag_application = index.as_query_engine()
Now we'll demonstrate how to use the response_to_rag_results function to convert LlamaIndex output to the RAGChecker format:
# User query and groud truth answer
user_query = "What is RAGChecker?"
gt_answer = "RAGChecker is an advanced automatic evaluation framework designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance."
# Get response from LlamaIndex
response_object = rag_application.query(user_query)
# Convert to RAGChecker format
rag_result = response_to_rag_results(
query=user_query,
gt_answer=gt_answer,
response_object=response_object,
)
# Create RAGResults object
rag_results = RAGResults.from_dict({"results": [rag_result]})
print(rag_results)
Now that we have our results in the correct format, let's evaluate them using RAGChecker:
# Initialize RAGChecker
evaluator = RAGChecker(
extractor_name="bedrock/meta.llama3-70b-instruct-v1:0",
checker_name="bedrock/meta.llama3-70b-instruct-v1:0",
batch_size_extractor=32,
batch_size_checker=32,
)
# Evaluate using RAGChecker
evaluator.evaluate(rag_results, all_metrics)
# Print detailed results
print(rag_results)
The output will look something like this:
RAGResults(
1 RAG results,
Metrics:
{
"overall_metrics": {
"precision": 66.7,
"recall": 27.3,
"f1": 38.7
},
"retriever_metrics": {
"claim_recall": 54.5,
"context_precision": 100.0
},
"generator_metrics": {
"context_utilization": 16.7,
"noise_sensitivity_in_relevant": 0.0,
"noise_sensitivity_in_irrelevant": 0.0,
"hallucination": 33.3,
"self_knowledge": 0.0,
"faithfulness": 66.7
}
}
)
This output provides a comprehensive view of the RAG system's performance, including overall metrics, retriever metrics, and generator metrics as described in the earlier section.
Instead of evaluating all the metrics with all_metrics, you can choose specific metric groups as follows:
from ragchecker.metrics import (
overall_metrics,
retriever_metrics,
generator_metrics,
)
For even more granular control, you can choose specific individual metrics for your needs:
from ragchecker.metrics import (
precision,
recall,
f1,
claim_recall,
context_precision,
context_utilization,
noise_sensitivity_in_relevant,
noise_sensitivity_in_irrelevant,
hallucination,
self_knowledge,
faithfulness,
)
This notebook has demonstrated how to integrate RAGChecker with LlamaIndex to evaluate the performance of RAG systems. We've covered:
By leveraging RAGChecker's comprehensive metrics, you can gain valuable insights into your RAG system's performance, identify areas for improvement, and optimize both retrieval and generation components. This integration provides a powerful tool for developing and refining more effective RAG applications.