Benchmarking RAG Pipelines With A `LabelledRagDatatset`

The LabelledRagDataset is meant to be used for evaluating any given RAG pipeline, for which there could be several configurations (i.e. choosing the LLM, values for the similarity_top_k, chunk_size, and others). We've likened this abstract to traditional machine learning datastets, where X features are meant to predict a ground-truth label y. In this case, we use the query as well as the retrieved contexts as the "features" and the answer to the query, called reference_answer as the ground-truth label.

And of course, such datasets are comprised of observations or examples. In the case of LabelledRagDataset, these are made up with a set of LabelledRagDataExample's.

In this notebook, we will show how one can construct a LabelledRagDataset from scratch. Please note that the alternative to this would be to simply download a community supplied LabelledRagDataset from llama-hub in order to evaluate/benchmark your own RAG pipeline on it.

The `LabelledRagDataExample` Class

python

%pip install llama-index-llms-openai
%pip install llama-index-readers-wikipedia

python

from llama_index.core.llama_dataset import (
    LabelledRagDataExample,
    CreatedByType,
    CreatedBy,
)

# constructing a LabelledRagDataExample
query = "This is a test query, is it not?"
query_by = CreatedBy(type=CreatedByType.AI, model_name="gpt-4")
reference_answer = "Yes it is."
reference_answer_by = CreatedBy(type=CreatedByType.HUMAN)
reference_contexts = ["This is a sample context"]

rag_example = LabelledRagDataExample(
    query=query,
    query_by=query_by,
    reference_contexts=reference_contexts,
    reference_answer=reference_answer,
    reference_answer_by=reference_answer_by,
)

The LabelledRagDataExample is a Pydantic Model and so, going from json or dict (and vice-versa) is possible.

python

print(rag_example.json())

python

LabelledRagDataExample.parse_raw(rag_example.json())

python

rag_example.dict()

python

LabelledRagDataExample.parse_obj(rag_example.dict())

Let's create a second example, so we can have a (slightly) more interesting LabelledRagDataset.

python

query = "This is a test query, is it so?"
reference_answer = "I think yes, it is."
reference_contexts = ["This is a second sample context"]

rag_example_2 = LabelledRagDataExample(
    query=query,
    query_by=query_by,
    reference_contexts=reference_contexts,
    reference_answer=reference_answer,
    reference_answer_by=reference_answer_by,
)

The `LabelledRagDataset` Class

python

from llama_index.core.llama_dataset import LabelledRagDataset

rag_dataset = LabelledRagDataset(examples=[rag_example, rag_example_2])

There exists a convienience method to view the dataset as a pandas.DataFrame.

python

rag_dataset.to_pandas()

Serialization

To persist and load the dataset to and from disk, there are the save_json and from_json methods.

python

rag_dataset.save_json("rag_dataset.json")

python

reload_rag_dataset = LabelledRagDataset.from_json("rag_dataset.json")

python

reload_rag_dataset.to_pandas()

Building a synthetic `LabelledRagDataset` over Wikipedia

For this section, we'll first create a LabelledRagDataset using a synthetic generator. Ultimately, we will use GPT-4 to produce both the query and reference_answer for the synthetic LabelledRagDataExample's.

NOTE: if one has queries, reference answers, and contexts over a text corpus, then it is not necessary to use data synthesis to be able to predict and subsequently evaluate said predictions.

python

import nest_asyncio

nest_asyncio.apply()

python

!pip install wikipedia -q

python

# wikipedia pages
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core import VectorStoreIndex

cities = [
    "San Francisco",
]

documents = WikipediaReader().load_data(
    pages=[f"History of {x}" for x in cities]
)
index = VectorStoreIndex.from_documents(documents)

The RagDatasetGenerator can be built over a set of documents to generate LabelledRagDataExample's.

python

# generate questions against chunks
from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms.openai import OpenAI

# set context for llm provider
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)

# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    llm=llm,
    num_questions_per_chunk=2,  # set the number of questions per nodes
    show_progress=True,
)

python

len(dataset_generator.nodes)

python

# since there are 13 nodes, there should be a total of 26 questions
rag_dataset = dataset_generator.generate_dataset_from_nodes()

python

rag_dataset.to_pandas()

python

rag_dataset.save_json("rag_dataset.json")

Benchmarking RAG Pipelines With A `LabelledRagDatatset`

Benchmarking RAG Pipelines With A LabelledRagDatatset

The LabelledRagDataExample Class

The LabelledRagDataset Class

Serialization

Building a synthetic LabelledRagDataset over Wikipedia

Benchmarking RAG Pipelines With A `LabelledRagDatatset`

The `LabelledRagDataExample` Class

The `LabelledRagDataset` Class

Building a synthetic `LabelledRagDataset` over Wikipedia