embedchain/docs/components/evaluation.mdx
We provide out-of-the-box evaluation metrics for your RAG application. You can use them to evaluate your RAG applications and compare against different settings of your production RAG application.
Currently, we provide support for following evaluation metrics:
<CardGroup cols={3}> <Card title="Context Relevancy" href="#context_relevancy"></Card> <Card title="Answer Relevancy" href="#answer_relevancy"></Card> <Card title="Groundedness" href="#groundedness"></Card> <Card title="Custom Metric" href="#custom_metric"></Card> </CardGroup>Here is a basic example of running evaluation:
from embedchain import App
app = App()
# Add data sources
app.add("https://www.forbes.com/profile/elon-musk")
# Run evaluation
app.evaluate(["What is the net worth of Elon Musk?", "How many companies Elon Musk owns?"])
# {'answer_relevancy': 0.9987286412340826, 'groundedness': 1.0, 'context_relevancy': 0.3571428571428571}
Under the hood, Embedchain does the following:
context relevancy, groundedness, and answer relevancy and return resultWe use OpenAI's gpt-4 model as default LLM model for automatic evaluation. Hence, we require you to set OPENAI_API_KEY as an environment variable.
In order to evaluate your RAG application, you have to setup a dataset. A data point in the dataset consists of questions, contexts, answer. Here is an example of how to create a dataset for evaluation:
from embedchain.utils.eval import EvalData
data = [
{
"question": "What is the net worth of Elon Musk?",
"contexts": [
"Elon Musk PROFILEElon MuskCEO, ...",
"a Twitter poll on whether the journalists' ...",
"2016 and run by Jared Birchall.[335]...",
],
"answer": "As of the information provided, Elon Musk's net worth is $241.6 billion.",
},
{
"question": "which companies does Elon Musk own?",
"contexts": [
"of December 2023[update], ...",
"ThielCofounderView ProfileTeslaHolds ...",
"Elon Musk PROFILEElon MuskCEO, ...",
],
"answer": "Elon Musk owns several companies, including Tesla, SpaceX, Neuralink, and The Boring Company.",
},
]
dataset = []
for d in data:
eval_data = EvalData(question=d["question"], contexts=d["contexts"], answer=d["answer"])
dataset.append(eval_data)
Once you have created your dataset, you can run evaluation on the dataset by picking the metric you want to run evaluation on.
For example, you can run evaluation on context relevancy metric using the following code:
from embedchain.evaluation.metrics import ContextRelevance
metric = ContextRelevance()
score = metric.evaluate(dataset)
print(score)
You can choose a different metric or write your own to run evaluation on. You can check the following links:
Context relevancy is a metric to determine "how relevant the context is to the question". We use OpenAI's gpt-4 model to determine the relevancy of the context. We achieve this by prompting the model with the question and the context and asking it to return relevant sentences from the context. We then use the following formula to determine the score:
context_relevance_score = num_relevant_sentences_in_context / num_of_sentences_in_context
You can run the context relevancy evaluation with the following simple code:
from embedchain.evaluation.metrics import ContextRelevance
metric = ContextRelevance()
score = metric.evaluate(dataset) # 'dataset' is definted in the create dataset section
print(score)
# 0.27975528364849833
In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the ContextRelevanceConfig class.
Here is a more advanced example of how to pass a custom evaluation config for evaluating on context relevance metric:
from embedchain.config.evaluation.base import ContextRelevanceConfig
from embedchain.evaluation.metrics import ContextRelevance
eval_config = ContextRelevanceConfig(model="gpt-4", api_key="sk-xxx", language="en")
metric = ContextRelevance(config=eval_config)
metric.evaluate(dataset)
ContextRelevanceConfigAnswer relevancy is a metric to determine how relevant the answer is to the question. We prompt the model with the answer and asking it to generate questions from the answer. We then use the cosine similarity between the generated questions and the original question to determine the score.
answer_relevancy_score = mean(cosine_similarity(generated_questions, original_question))
You can run the answer relevancy evaluation with the following simple code:
from embedchain.evaluation.metrics import AnswerRelevance
metric = AnswerRelevance()
score = metric.evaluate(dataset)
print(score)
# 0.9505334177461916
In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the AnswerRelevanceConfig class. Here is a more advanced example where you can provide your own evaluation config:
from embedchain.config.evaluation.base import AnswerRelevanceConfig
from embedchain.evaluation.metrics import AnswerRelevance
eval_config = AnswerRelevanceConfig(
model='gpt-4',
embedder="text-embedding-ada-002",
api_key="sk-xxx",
num_gen_questions=2
)
metric = AnswerRelevance(config=eval_config)
score = metric.evaluate(dataset)
AnswerRelevanceConfigGroundedness is a metric to determine how grounded the answer is to the context. We use OpenAI's gpt-4 model to determine the groundedness of the answer. We achieve this by prompting the model with the answer and asking it to generate claims from the answer. We then again prompt the model with the context and the generated claims to determine the verdict on the claims. We then use the following formula to determine the score:
groundedness_score = (sum of all verdicts) / (total # of claims)
You can run the groundedness evaluation with the following simple code:
from embedchain.evaluation.metrics import Groundedness
metric = Groundedness()
score = metric.evaluate(dataset) # dataset from above
print(score)
# 1.0
In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the GroundednessConfig class. Here is a more advanced example where you can configure the evaluation config:
from embedchain.config.evaluation.base import GroundednessConfig
from embedchain.evaluation.metrics import Groundedness
eval_config = GroundednessConfig(model='gpt-4', api_key="sk-xxx")
metric = Groundedness(config=eval_config)
score = metric.evaluate(dataset)
GroundednessConfigYou can also create your own evaluation metric by extending the BaseMetric class. You can find the source code for the existing metrics at embedchain.evaluation.metrics path.
from typing import Optional
from embedchain.config.base_config import BaseConfig
from embedchain.evaluation.metrics import BaseMetric
from embedchain.utils.eval import EvalData
class MyCustomMetric(BaseMetric):
def __init__(self, config: Optional[BaseConfig] = None):
super().__init__(name="my_custom_metric")
def evaluate(self, dataset: list[EvalData]):
score = 0.0
# write your evaluation logic here
return score