Retriever Evaluation Tutorial

In MLflow 2.8.0, we introduced a new model type "retriever" to the mlflow.evaluate() API. It helps you to evaluate the retriever in a RAG application. It contains two built-in metrics precision_at_k and recall_at_k. In MLflow 2.9.0, ndcg_at_k is available.

This notebook illustrates how to use mlflow.evaluate() to evaluate the retriever in a RAG application. It has the following steps:

Step 1: Install and Load Packages
Step 2: Evaluation Dataset Preparation
Step 3: Calling mlflow.evaluate()
Step 4: Result Analysis and Visualization

Step 1: Install and Load Packages

python

%pip install mlflow==2.9.0 langchain==0.0.339 openai faiss-cpu gensim nltk pyLDAvis tiktoken

python

import ast
import os
import pprint

import pandas as pd
from langchain.docstore.document import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

import mlflow

os.environ["OPENAI_API_KEY"] = "<redacted>"

CHUNK_SIZE = 1000

# Assume running from https://github.com/mlflow/mlflow/blob/master/examples/llms/rag
OUTPUT_DF_PATH = "question_answer_source.csv"
SCRAPPED_DOCS_PATH = "mlflow_docs_scraped.csv"
EVALUATION_DATASET_PATH = "static_evaluation_dataset.csv"
DB_PERSIST_DIR = "faiss_index"

Step 2: Evaluation Dataset Preparation

The evaluation dataset should contain three columns: questions, ground truth doc IDs, retrieved relevant doc IDs. A "doc ID" is a unique string identifier of the documents in you RAG application. For example, it could be the URL of a documentation web page, or the file path of a PDF document.

If you have a list of questions that you would like to evaluate, please see 1.1 Manual Preparation. If you do not have a question list yet, please see 1.2 Generate the Evaluation Dataset.

Manual Preparation

When evaluating a retriever, it's recommended to save the retrieved document IDs into a static dataset represented by a Pandas Dataframe or an MLflow Pandas Dataset containing the input queries, retrieved relevant document IDs, and the ground-truth document IDs for the evaluation.

Concepts

A "document ID" is a string that identifies a document.

A list of "retrieved relevant document IDs" are the output of the retriever for a specific input query and a k value.

A list of "ground-truth document IDs" are the labeled relevant documents for a specific input query.

Expected Data Format

For each row, the retrieved relevant document IDs and the ground-truth relevant document IDs should be provided as a tuple of document ID strings.

The column name of the retrieved relevant document IDs should be specified by the predictions parameter, and the column name of the ground-truth relevant document IDs should be specified by the targets parameter.

Here is a simple example dataset that illustrates the expected data format. The doc IDs are the paths of the documentation pages.

python

data = pd.DataFrame({
    "questions": [
        "What is MLflow?",
        "What is Databricks?",
        "How to serve a model on Databricks?",
        "How to enable MLflow Autologging for my workspace by default?",
    ],
    "retrieved_context": [
        [
            "mlflow/index.html",
            "mlflow/quick-start.html",
        ],
        [
            "introduction/index.html",
            "getting-started/overview.html",
        ],
        [
            "machine-learning/model-serving/index.html",
            "machine-learning/model-serving/model-serving-intro.html",
        ],
        [],
    ],
    "ground_truth_context": [
        ["mlflow/index.html"],
        ["introduction/index.html"],
        [
            "machine-learning/model-serving/index.html",
            "machine-learning/model-serving/llm-optimized-model-serving.html",
        ],
        ["mlflow/databricks-autologging.html"],
    ],
})

Generate the Evaluation Dataset

There are two steps to generate the evaluation dataset: generate questions with ground truth doc IDs and retrieve relevant doc IDs.

Generate Questions with Ground Truth Doc IDs

If you don't have a list of questions to evaluate, you can generate them using LLMs. The Question Generation Notebook provides an example way to do it. Here is the result of running that notebook.

python

generated_df = pd.read_csv(OUTPUT_DF_PATH)

python

generated_df.head(3)

python

# Prepare dataframe `data` with the required format
data = pd.DataFrame({})
data["question"] = generated_df["question"].copy(deep=True)
data["source"] = generated_df["source"].apply(lambda x: [x])
data.head(3)

Retrieve Relevant Doc IDs

Once we have a list of questions with ground truth doc IDs from 1.1, we can collect the retrieved relevant doc IDs. In this tutorial, we use a LangChain retriever. You can plug in your own retriever as needed.

First, we build a FAISS retriever from the docs saved at https://github.com/mlflow/mlflow/blob/master/examples/llms/question_generation/mlflow_docs_scraped.csv. See the Question Generation Notebook for how to create this csv file.

python

embeddings = OpenAIEmbeddings()

python

scrapped_df = pd.read_csv(SCRAPPED_DOCS_PATH)
list_of_documents = [
    Document(page_content=row["text"], metadata={"source": row["source"]})
    for i, row in scrapped_df.iterrows()
]
text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=0)
docs = text_splitter.split_documents(list_of_documents)
db = FAISS.from_documents(docs, embeddings)

# Save the db to local disk
db.save_local(DB_PERSIST_DIR)

python

# Load the db from local disk
db = FAISS.load_local(DB_PERSIST_DIR, embeddings)
retriever = db.as_retriever()

python

# Test the retriever with a query
retrieved_docs = retriever.get_relevant_documents(
    "What is the purpose of the MLflow Model Registry?"
)
len(retrieved_docs)

After building a retriever, we define a function that takes a question string as input and returns a list of relevant doc ID strings.

python

# Define a function to return a list of retrieved doc ids
def retrieve_doc_ids(question: str) -> list[str]:
    docs = retriever.get_relevant_documents(question)
    return [doc.metadata["source"] for doc in docs]

We can store the retrieved doc IDs in the dataframe as a column "retrieved_doc_ids".

python

data["retrieved_doc_ids"] = data["question"].apply(retrieve_doc_ids)
data.head(3)

python

# Persist the static evaluation dataset to disk
data.to_csv(EVALUATION_DATASET_PATH, index=False)

python

# Load the static evaluation dataset from disk and deserialize the source and retrieved doc ids
data = pd.read_csv(EVALUATION_DATASET_PATH)
data["source"] = data["source"].apply(ast.literal_eval)
data["retrieved_doc_ids"] = data["retrieved_doc_ids"].apply(ast.literal_eval)
data.head(3)

Step 3: Calling `mlflow.evaluate()`

Metrics Definition

There are three built-in metrics provided for the retriever model type. Click the metric name below to see the metrics definitions.

All metrics compute a score between 0 and 1 for each row representing the corresponding metric of the retriever model at the given k value.

The k parameter should be a positive integer representing the number of retrieved documents to evaluate for each row. k defaults to 3.

When the model type is "retriever", these metrics will be calculated automatically with the default k value of 3.

Basic usage

There are two supported ways to specify the retriever's output:

Case 1: Save the retriever's output to a static evaluation dataset
Case 2: Wrap the retriever in a function

python

# Case 1: Evaluating a static evaluation dataset
with mlflow.start_run() as run:
    evaluate_results = mlflow.evaluate(
        data=data,
        model_type="retriever",
        targets="source",
        predictions="retrieved_doc_ids",
        evaluators="default",
    )

python

question_source_df = data[["question", "source"]]
question_source_df.head(3)

python

# Case 2: Evaluating a function
def retriever_model_function(question_df: pd.DataFrame) -> pd.Series:
    return question_df["question"].apply(retrieve_doc_ids)


with mlflow.start_run() as run:
    evaluate_results = mlflow.evaluate(
        model=retriever_model_function,
        data=question_source_df,
        model_type="retriever",
        targets="source",
        evaluators="default",
    )

python

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(evaluate_results.metrics)

Try different k values

To use another k value, use the evaluator_config parameter in the mlflow.evaluate() API as follows: evaluator_config={"retriever_k": <k_value>}.

python

# Case 1: Specifying the model type
evaluate_results = mlflow.evaluate(
    data=data,
    model_type="retriever",
    targets="ground_truth_context",
    predictions="retrieved_context",
    evaluators="default",
    evaluator_config={"retriever_k": 5}
  )

Alternatively, you can directly specify the desired metrics in the extra_metrics parameter of the mlflow.evaluate() API without specifying a model type. In this case, the k value specified in the evaluator_config parameter will be ignored.

python

# Case 2: Specifying the extra_metrics
evaluate_results = mlflow.evaluate(
    data=data,
    targets="ground_truth_context",
    predictions="retrieved_context",
    extra_matrics=[
      mlflow.metrics.precision_at_k(4),
      mlflow.metrics.precision_at_k(5)
    ],
  )

python

with mlflow.start_run() as run:
    evaluate_results = mlflow.evaluate(
        data=data,
        targets="source",
        predictions="retrieved_doc_ids",
        evaluators="default",
        extra_metrics=[
            mlflow.metrics.precision_at_k(1),
            mlflow.metrics.precision_at_k(2),
            mlflow.metrics.precision_at_k(3),
            mlflow.metrics.recall_at_k(1),
            mlflow.metrics.recall_at_k(2),
            mlflow.metrics.recall_at_k(3),
            mlflow.metrics.ndcg_at_k(1),
            mlflow.metrics.ndcg_at_k(2),
            mlflow.metrics.ndcg_at_k(3),
        ],
    )

python

import matplotlib.pyplot as plt

# Plotting each metric
for metric_name in ["precision", "recall", "ndcg"]:
    y = [evaluate_results.metrics[f"{metric_name}_at_{k}/mean"] for k in range(1, 4)]
    plt.plot([1, 2, 3], y, label=f"{metric_name}@k")

# Adding labels and title
plt.xlabel("k")
plt.ylabel("Metric Value")
plt.title("Metrics Comparison at Different Ks")
# Setting x-axis ticks
plt.xticks([1, 2, 3])
plt.legend()

# Display the plot
plt.show()

Corner case handling

There are a few corner cases handle specially for each built-in metric.

Empty retrieved document IDs

When no relevant docs are retrieved:

mlflow.metrics.precision_at_k(k) is defined as:
- 0 if the ground-truth doc IDs is non-empty
- 1 if the ground-truth doc IDs is also empty
mlflow.metrics.ndcg_at_k(k) is defined as:
- 0 if the ground-truth doc IDs is non-empty
- 1 if the ground-truth doc IDs is also empty

Empty ground-truth document IDs

When no ground-truth document IDs are provided:

mlflow.metrics.recall_at_k(k) is defined as:
- 0 if the retrieved doc IDs is non-empty
- 1 if the retrieved doc IDs is also empty
mlflow.metrics.ndcg_at_k(k) is defined as:
- 0 if the retrieved doc IDs is non-empty
- 1 if the retrieved doc IDs is also empty

Duplicate retreived document IDs

It is a common case for the retriever in a RAG system to retrieve multiple chunks in the same document for a given query. In this case, mlflow.metrics.ndcg_at_k(k) is calculated as follows:

If the duplicate doc IDs are in the ground truth, they will be treated as different docs. For example, if the ground truth doc IDs are [1, 2] and the retrieved doc IDs are [1, 1, 1, 3], the score will be equavalent to ground truth doc IDs [10, 11, 12, 2] and retrieved doc IDs [10, 11, 12, 3].

If the duplicate doc IDs are not in the ground truth, the ndcg score is calculated as normal.

Step 4: Result Analysis and Visualization

You can view the per-row scores in the logged "eval_results_table.json" in artifacts by either loading it to a pandas dataframe (shown below) or visiting the MLflow run comparison UI.

python

eval_results_table = evaluate_results.tables["eval_results_table"]
eval_results_table.head(5)

With the evaluate results table, you can further visualize the well-answered questions and poorly-answered questions using topical analysis techniques.

python

import nltk
import pyLDAvis.gensim_models as gensimvis
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Initialize NLTK resources
nltk.download("punkt")
nltk.download("stopwords")


def topical_analysis(questions: list[str]):
    stop_words = set(stopwords.words("english"))

    # Tokenize and remove stop words
    tokenized_data = []
    for question in questions:
        tokens = word_tokenize(question.lower())
        filtered_tokens = [word for word in tokens if word not in stop_words and word.isalpha()]
        tokenized_data.append(filtered_tokens)

    # Create a dictionary and corpus
    dictionary = corpora.Dictionary(tokenized_data)
    corpus = [dictionary.doc2bow(text) for text in tokenized_data]

    # Apply LDA model
    lda_model = models.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)

    # Get topic distribution for each question
    topic_distribution = []
    for i, ques in enumerate(questions):
        bow = dictionary.doc2bow(tokenized_data[i])
        topics = lda_model.get_document_topics(bow)
        topic_distribution.append(topics)
        print(f"Question: {ques}\nTopic: {topics}")

    # Print all topics
    print("\nTopics found are:")
    for idx, topic in lda_model.print_topics(-1):
        print(f"Topic: {idx} \nWords: {topic}\n")
    return lda_model, corpus, dictionary

python

filtered_df = eval_results_table[eval_results_table["precision_at_1/score"] == 1]
hit_questions = filtered_df["question"].tolist()
filtered_df = eval_results_table[eval_results_table["precision_at_1/score"] == 0]
miss_questions = filtered_df["question"].tolist()

python

lda_model, corpus, dictionary = topical_analysis(hit_questions)
vis_data = gensimvis.prepare(lda_model, corpus, dictionary)

python

# Uncomment the following line to render the interactive widget
# pyLDAvis.display(vis_data)

python

lda_model, corpus, dictionary = topical_analysis(miss_questions)
vis_data = gensimvis.prepare(lda_model, corpus, dictionary)

python

# Uncomment the following line to render the interactive widget
# pyLDAvis.display(vis_data)

Retriever Evaluation Tutorial

Retriever Evaluation Tutorial

Step 1: Install and Load Packages

Step 2: Evaluation Dataset Preparation

Manual Preparation

Concepts

Expected Data Format

Generate the Evaluation Dataset

Generate Questions with Ground Truth Doc IDs

Retrieve Relevant Doc IDs

Step 3: Calling mlflow.evaluate()

Metrics Definition

Basic usage

Try different k values

Corner case handling

Empty retrieved document IDs

Empty ground-truth document IDs

Duplicate retreived document IDs

Step 4: Result Analysis and Visualization

Step 3: Calling `mlflow.evaluate()`