examples/evaluation/rag-evaluation.ipynb
In this notebook, we will demonstrate how to evaluate various a RAG system with MLflow.
We need to set our OpenAI API key, since we will be using GPT-4 for our LLM-judged metrics.
In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure your favored environment management configuration file (i.e., .bashrc, .zshrc) to have the following entry:
OPENAI_API_KEY=<your openai API key>
import pandas as pd
import mlflow
Use Langchain and Chroma to create a RAG system that answers questions based on the MLflow documentation.
from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
loader = WebBaseLoader("https://mlflow.org/docs/latest/index.html")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)
qa = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0),
chain_type="stuff",
retriever=docsearch.as_retriever(),
return_source_documents=True,
)
mlflow.evaluate()Create a simple function that runs each input through the RAG chain
def model(input_df):
answer = []
for index, row in input_df.iterrows():
answer.append(qa(row["questions"]))
return answer
Create an eval dataset
eval_df = pd.DataFrame({
"questions": [
"What is MLflow?",
"How to run mlflow.evaluate()?",
"How to log_table()?",
"How to load_table()?",
],
})
Create a faithfulness metric
from mlflow.metrics.genai.metric_definitions import faithfulness
faithfulness_metric = faithfulness(model="openai:/gpt-4")
results = mlflow.evaluate(
model,
eval_df,
model_type="question-answering",
evaluators="default",
predictions="result",
extra_metrics=[faithfulness_metric, mlflow.metrics.latency()],
evaluator_config={
"col_mapping": {
"inputs": "questions",
"context": "source_documents",
}
},
)
print(results.metrics)
results.tables["eval_results_table"]