examples/evaluation/huggingface-evaluation.ipynb
This guide will show how to load a pre-trained Hugging Face pipeline, log it to MLflow, and use mlflow.evaluate() to evaluate builtin metrics as well as custom LLM-judged metrics for the model.
For detailed information, please read the documentation on using MLflow evaluate.
You can either:
mlflow server within the same directory that your notebook is in.%pip install -q mlflow transformers torch torchvision evaluate datasets openai tiktoken fastapi rouge_score textstat
# Necessary imports
import warnings
import pandas as pd
from datasets import load_dataset
from transformers import pipeline
import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_correctness, make_genai_metric
# Disable FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)
Here we are loading a text generation pipeline, but you can also use either a text summarization or question answering pipeline.
mpt_pipeline = pipeline("text-generation", model="mosaicml/mpt-7b-chat")
We log our pipeline as an MLflow Model, which follows a standard format that lets you save a model in different "flavors" that can be understood by different downstream tools. In this case, the model is of the transformers "flavor".
mlflow.set_experiment("Evaluate Hugging Face Text Pipeline")
# Define the signature
signature = mlflow.models.infer_signature(
model_input="What are the three primary colors?",
model_output="The three primary colors are red, yellow, and blue.",
)
# Log the model using mlflow
with mlflow.start_run():
model_info = mlflow.transformers.log_model(
transformers_model=mpt_pipeline,
name="mpt-7b",
signature=signature,
registered_model_name="mpt-7b-chat",
)
Load in a dataset from Hugging Face Hub to use for evaluation.
dataset = load_dataset("tatsu-lab/alpaca")
eval_df = pd.DataFrame(dataset["train"])
eval_df.head(10)
Since we are evaluating how well our model can provide an answer to a given instruction, we may want to choose some metrics to help measure this on top of any builtin metrics that mlflow.evaluate() gives us.
Let's measure how well out model is doing on following two metrics:
answer_correctness here.answer_quality to measure this.We will need to pass these into extra_metrics argument for mlflow.evaluate().
Let's load our predefined metrics - in this case we are using answer_correctness with GPT-4.
answer_correctness_metric = answer_correctness(model="openai:/gpt-4")
Now we want to create a custom LLM-judged metric named answer_quality using make_genai_metric(). We need to define a metric definition and grading rubric, as well as some examples for the LLM judge to use.
# The definition explains what "answer quality" entails
answer_quality_definition = """Please evaluate answer quality for the provided output on the following criteria:
fluency, clarity, and conciseness. Each of the criteria is defined as follows:
- Fluency measures how naturally and smooth the output reads.
- Clarity measures how understandable the output is.
- Conciseness measures the brevity and efficiency of the output without compromising meaning.
The more fluent, clear, and concise a text, the higher the score it deserves.
"""
# The grading prompt explains what each possible score means
answer_quality_grading_prompt = """Answer quality: Below are the details for different scores:
- Score 1: The output is entirely incomprehensible and cannot be read.
- Score 2: The output conveys some meaning, but needs lots of improvement in to improve fluency, clarity, and conciseness.
- Score 3: The output is understandable but still needs improvement.
- Score 4: The output performs well on two of fluency, clarity, and conciseness, but could be improved on one of these criteria.
- Score 5: The output reads smoothly, is easy to understand, and clear. There is no clear way to improve the output on these criteria.
"""
# We provide an example of a "bad" output
example1 = EvaluationExample(
input="What is MLflow?",
output="MLflow is an open-source platform. For managing machine learning workflows, it "
"including experiment tracking model packaging versioning and deployment as well as a platform "
"simplifying for on the ML lifecycle.",
score=2,
justification="The output is difficult to understand and demonstrates extremely low clarity. "
"However, it still conveys some meaning so this output deserves a score of 2.",
)
# We also provide an example of a "good" output
example2 = EvaluationExample(
input="What is MLflow?",
output="MLflow is an open-source platform for managing machine learning workflows, including "
"experiment tracking, model packaging, versioning, and deployment.",
score=5,
justification="The output is easily understandable, clear, and concise. It deserves a score of 5.",
)
answer_quality_metric = make_genai_metric(
name="answer_quality",
definition=answer_quality_definition,
grading_prompt=answer_quality_grading_prompt,
version="v1",
examples=[example1, example2],
model="openai:/gpt-4",
greater_is_better=True,
)
We need to set our OpenAI API key, since we are using GPT-4 for our LLM-judged metrics.
In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure your favored environment management configuration file (i.e., .bashrc, .zshrc) to have the following entry:
OPENAI_API_KEY=<your openai API key>
Now, we can call mlflow.evaluate(). Just to test it out, let's use the first 10 rows of the data. Using the "text" model type, toxicity and readability metrics are calculated as builtin metrics. We also pass in the two metrics we defined above into the extra_metrics parameter to be evaluated.
with mlflow.start_run():
results = mlflow.evaluate(
model_info.model_uri,
eval_df.head(10),
evaluators="default",
model_type="text",
targets="output",
extra_metrics=[answer_correctness_metric, answer_quality_metric],
evaluator_config={"col_mapping": {"inputs": "instruction"}},
)
results.metrics is a dictionary with the aggregate values for all the metrics calculated. Refer here for details on the builtin metrics for each model type.
results.metrics
We can also view the eval_results_table, which shows us the metrics for each row of data.
results.tables["eval_results_table"]
Finally, we can view our evaluation results in the MLflow UI. We can select our experiment on the left sidebar, which will bring us to the following page. We can see that one run logged our model "mpt-7b-chat", and the other run has the dataset we evaluated.
We click on the Evaluation tab and hide any irrelevant runs.
We can now choose what columns we want to group by, as well as which column we want to compare. In the following example, we are looking at the score for answer correctness for each input-output pair, but we could choose any other metric to compare.
Finally, we get to the following view, where we can see the justification and score for answer correctness for each row.