Back to Llama Index

AIMon

docs/examples/evaluation/AIMon.ipynb

0.14.217.3 KB
Original Source

AIMon's LlamaIndex Extension for LLM Response Evaluation

This notebook introduces AIMon's evaluators for the LlamaIndex framework, which are designed to assess the quality and accuracy of responses generated by language models (LLMs) integrated into LlamaIndex. Below is an overview of all available evaluators:

  • Hallucination Evaluator: Detects when a model generates information not supported by the provided context (hallucinations).
  • Guideline Evaluator: Ensures model responses follow predefined instructions and guidelines.
  • Completeness Evaluator: Checks whether the response fully addresses all aspects of the query or task.
  • Conciseness Evaluator: Evaluates if the response is brief yet complete, avoiding unnecessary verbosity.
  • Toxicity Evaluator: Flags harmful, offensive, or inappropriate language in the response.
  • Context Relevance Evaluator: Assesses the relevance and accuracy of the provided context in supporting the model's response.

In this notebook, we will focus on utilizing the Hallucination Evaluator, Guideline Evaluator, and Context Relevance Evaluator to evaluate your RAG (Retrieval-Augmented Generation) applications.

To learn more about AIMon, check out these resources:: Website and Documentation

Prerequisites

Let's get started by installing the dependencies and setting up the API keys.

python
%%capture
!pip install requests datasets aimon-llamaindex llama-index-embeddings-openai llama-index-llms-openai

Configure your OPENAI_API_KEY and AIMON_API_KEY in Google Collab secrets and provide them notebook access. We will use OpenAI for the LLM and embedding generation models. We will use AIMon for continuous monitoring of quality issues.

AIMon API key can be obtained here.

python
import os
import json

# Import Colab Secrets userdata module.
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

Dataset for evaluation

In this example, we will be using the transcripts from MeetingBank dataset [1] as our contextual information.

python
%%capture
from datasets import load_dataset

meetingbank = load_dataset("huuuyeah/meetingbank")

This function helps extract transcripts and converts them into a list of objects of type llama_index.core.Document.

python
from llama_index.core import Document


def extract_and_create_documents(transcripts):
    documents = []

    for transcript in transcripts:
        try:
            doc = Document(text=transcript)
            documents.append(doc)

        except Exception as e:
            print(f"Failed to create document")

    return documents


transcripts = [meeting["transcript"] for meeting in meetingbank["train"]]
documents = extract_and_create_documents(
    transcripts[:5]
)  ## Using only 5 transcripts to keep this example fast and concise.

Set up an embedding model. We will be using the text-embedding-3-small model here.

python
from llama_index.embeddings.openai import OpenAIEmbedding

embedding_model = OpenAIEmbedding(
    model="text-embedding-3-small", embed_batch_size=100, max_retries=3
)

Split documents into nodes and generate their embeddings

python
from aimon_llamaindex import generate_embeddings_for_docs

nodes = generate_embeddings_for_docs(documents, embedding_model)

Insert the nodes with embeddings into in-memory Vector Store Index.

python
from aimon_llamaindex import build_index

index = build_index(nodes)

Instantiate a Vector Index Retrieiver

python
from aimon_llamaindex import build_retriever

retriever = build_retriever(index, similarity_top_k=5)

Building the LLM Application

Configure the Large Language Model. Here we choose OpenAI's gpt-4o-mini model with temperature setting of 0.1.

python
## OpenAI's LLM
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o-mini",
    temperature=0.4,
    system_prompt="""
                    Please be professional and polite.
                    Answer the user's question in a single line.
                    Even if the context lacks information to answer the question, make
                    sure that you answer the user's question based on your own knowledge.
                    """,
)

Define your query and instructions

python
user_query = "Which council bills were amended for zoning regulations?"
user_instructions = [
    "Keep the response concise, preferably under the 100 word limit."
]

Update the LLM's system prompt with the user's instructions defined dynamically

python
llm.system_prompt += (
    f"Please comply to the following instructions {user_instructions}."
)

Retrieve a response for the query.

python
from aimon_llamaindex import get_response

llm_response = get_response(user_query, retriever, llm)

Running Evaluations using AIMon

Configure AIMon Client

python
from aimon import Client

aimon_client = Client(
    auth_header="Bearer {}".format(userdata.get("AIMON_API_KEY"))
)

Using AIMon’s Instruction Adherence Model (a.k.a. Guideline Evaluator)

This model evaluates if generated text adheres to given instructions, ensuring that LLMs follow the user’s guidelines and intent across various tasks for more accurate and relevant outputs.

python
from aimon_llamaindex.evaluators import GuidelineEvaluator

guideline_evaluator = GuidelineEvaluator(aimon_client)
evaluation_result = guideline_evaluator.evaluate(
    user_query, llm_response, user_instructions
)
python
print(json.dumps(evaluation_result, indent=4))

Using AIMon’s Hallucination Detection Evaluator Model (HDM-2)

AIMon’s HDM-2 detects hallucinated content in LLM outputs. It provides a “hallucination score” (0.0–1.0) quantifying the likelihood of factual inaccuracies or fabricated information, ensuring more reliable and accurate responses.

python
from aimon_llamaindex.evaluators import HallucinationEvaluator

hallucination_evaluator = HallucinationEvaluator(aimon_client)
evalution_result = hallucination_evaluator.evaluate(user_query, llm_response)
python
## Printing the initial evaluation result for Hallucination
print(json.dumps(evalution_result, indent=4))

Using AIMon's Context Relevance Evaluator to evaluate the relevance of context data used by the LLM to generate the response.

python
from aimon_llamaindex.evaluators import ContextRelevanceEvaluator

evaluator = ContextRelevanceEvaluator(aimon_client)
task_definition = (
    "Find the relevance of the context data used to generate this response."
)
evaluation_result = evaluator.evaluate(
    user_query, llm_response, task_definition
)
python
print(json.dumps(evaluation_result, indent=4))

Conclusion

In this notebook, we built a simple RAG application using the LlamaIndex framework. After retrieving a response to a query, we assessed it with AIMon’s evaluators.

References

[1]. Y. Hu, T. Ganter, H. Deilamsalehy, F. Dernoncourt, H. Foroosh, and F. Liu, "MeetingBank: A Benchmark Dataset for Meeting Summarization," arXiv, May 2023. [Online]. Available: https://arxiv.org/abs/2305.17529. Accessed: Jan. 16, 2025.