LLM RAG Evaluation with MLflow Example Notebook

In this notebook, we will demonstrate how to evaluate various a RAG system with MLflow.

We need to set our OpenAI API key, since we will be using GPT-4 for our LLM-judged metrics.

In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure your favored environment management configuration file (i.e., .bashrc, .zshrc) to have the following entry:

OPENAI_API_KEY=<your openai API key>

python

import pandas as pd

import mlflow

Create a RAG system

Use Langchain and Chroma to create a RAG system that answers questions based on the MLflow documentation.

python

from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

python

loader = WebBaseLoader("https://mlflow.org/docs/latest/index.html")

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True,
)

Evaluate the RAG system using `mlflow.evaluate()`

Create a simple function that runs each input through the RAG chain

python

def model(input_df):
    answer = []
    for index, row in input_df.iterrows():
        answer.append(qa(row["questions"]))

    return answer

Create an eval dataset

python

eval_df = pd.DataFrame({
    "questions": [
        "What is MLflow?",
        "How to run mlflow.evaluate()?",
        "How to log_table()?",
        "How to load_table()?",
    ],
})

Create a faithfulness metric

python

from mlflow.metrics.genai.metric_definitions import faithfulness

faithfulness_metric = faithfulness(model="openai:/gpt-4")

python

results = mlflow.evaluate(
    model,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    predictions="result",
    extra_metrics=[faithfulness_metric, mlflow.metrics.latency()],
    evaluator_config={
        "col_mapping": {
            "inputs": "questions",
            "context": "source_documents",
        }
    },
)
print(results.metrics)

python

results.tables["eval_results_table"]

LLM RAG Evaluation with MLflow Example Notebook

LLM RAG Evaluation with MLflow Example Notebook

Create a RAG system

Evaluate the RAG system using mlflow.evaluate()

Evaluate the RAG system using `mlflow.evaluate()`