Evaluate a 🤗 Hugging Face LLM with mlflow.evaluate()

This guide will show how to load a pre-trained Hugging Face pipeline, log it to MLflow, and use mlflow.evaluate() to evaluate builtin metrics as well as custom LLM-judged metrics for the model.

For detailed information, please read the documentation on using MLflow evaluate.

Start MLflow Server

You can either:

Start a local tracking server by running mlflow server within the same directory that your notebook is in.
Use a tracking server, as described in this overview.

Install necessary dependencies

python

%pip install -q mlflow transformers torch torchvision evaluate datasets openai tiktoken fastapi rouge_score textstat

python

# Necessary imports
import warnings

import pandas as pd
from datasets import load_dataset
from transformers import pipeline

import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_correctness, make_genai_metric

python

# Disable FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

Load a pretrained Hugging Face pipeline

Here we are loading a text generation pipeline, but you can also use either a text summarization or question answering pipeline.

python

mpt_pipeline = pipeline("text-generation", model="mosaicml/mpt-7b-chat")

Log the model using MLflow

We log our pipeline as an MLflow Model, which follows a standard format that lets you save a model in different "flavors" that can be understood by different downstream tools. In this case, the model is of the transformers "flavor".

python

mlflow.set_experiment("Evaluate Hugging Face Text Pipeline")

# Define the signature
signature = mlflow.models.infer_signature(
    model_input="What are the three primary colors?",
    model_output="The three primary colors are red, yellow, and blue.",
)

# Log the model using mlflow
with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=mpt_pipeline,
        name="mpt-7b",
        signature=signature,
        registered_model_name="mpt-7b-chat",
    )

Load Evaluation Data

Load in a dataset from Hugging Face Hub to use for evaluation.

python

dataset = load_dataset("tatsu-lab/alpaca")
eval_df = pd.DataFrame(dataset["train"])
eval_df.head(10)

Define Metrics

Since we are evaluating how well our model can provide an answer to a given instruction, we may want to choose some metrics to help measure this on top of any builtin metrics that mlflow.evaluate() gives us.

Let's measure how well out model is doing on following two metrics:

Is the answer correct? Let's use the predefined metric answer_correctness here.
Is the answer fluent, clear, and concise? We will define a custom metric answer_quality to measure this.

We will need to pass these into extra_metrics argument for mlflow.evaluate().

<details> <div> <h4>What is an evaluation metric?</h4> <p>An evaluation metric encapsulates any quantitative or qualitative measure you want to calculate for your model. For each model type, <code>mlflow.evaluate()</code> will automatically calculate some set of builtin metrics. Refer <a href="https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate">here</a> for which builtin metrics will be calculated for each model type. You can also pass in any other metrics you want to calculate as extra metrics. MLflow provides a set of predefined metrics that you can find <a href="https://mlflow.org/docs/latest/python_api/mlflow.metrics.html">here</a>, or you can define your own custom metrics. In the example here, we will use the predefined metric <code>mlflow.metrics.genai.answer_correctness</code>. </div> <div> <h4>Custom Metrics</h4> <p>To create a custom metric using <code>make_metric</code>, you will need to define an evaluation function that:</p> <ul> <li>Takes in parameters <i>predictions</i>, <i>targets</i>, and <i>metrics</i> (which has builtin metric values).</li> <li>Returns a <i>MetricValue</i>, which has three attributes: <ul> <li>scores: a list that contains per-row metrics.</li> <li>justifications: a list that contains per-row justifications of the values in scores. This is optional, and is usually used with genai metrics.</li> <li>aggregate_results: a dictionary that maps the aggregation method names to the corresponding aggregated values. This is intended to be used to aggregate scores.</li> </ul> </li> </ul> <p>Given such an evaluation function, for example named <code>my_metric_eval_fn</code>, all we need to do to create our custom metric is call <code>make_metric</code> with <i>greater_is_better</i> set appropriately, as shown below: <p> <code>custom_metric = make_metric(eval_fn=my_metric_eval_fn, greater_is_better=False)</code> </p> Refer to the <a href="https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.make_metric">documentation</a> for more optional parameters for <code>make_metric</code>. </p> <p>If we want to create a custom <strong>GenAI metric</strong>, we can use <code>make_genai_metric</code> instead.</p> <p>Let's explore how we would create a GenAI metric for our use case, where we want to measure the quality of the answer based on its fluency, clarity, and conciseness. We have four important things to define for our metric: the name, definition, grading prompt, and examples. There are other optional parameters for <code>make_genai_metric</code> which you can explore <a href="https://mlflow.org/docs/latest/llms/llm-evaluate/index.html#creating-custom-llm-evaluation-metrics">here</a>, but we will start with the essential parameters (examples are optional but highly recommended).</p> <ul> <li><strong>Name</strong>: The name of the metric.</li> <li><strong>Definition</strong>: Here we want to define what the metric fundamentally means. In our case, since we want the measurement answer quality to be a combination of fluency, clarity, and conciseness, we want to list out these aspects of answer quality. We also want to define what fluency, clarity, and conciseness mean within our specific context.</li> <li><strong>Grading Prompt</strong>: The grading prompt should explain the scoring criteria. What is a score? Does it range from 1-5 or do we want our judge LLM to give us a true/false response? What does a score of 1 mean? A score of 4?</li> <li><strong>Examples</strong>: Examples are optional but highly recommended. An example includes the input, output, and any other necessary columns and shows the judge LLM what kind of score and justification we would expect.</li> </ul> <p>All we have to do is call <code>make_genai_metric</code> with the above parameters, and we'll have a custom GenAI metric that we can use! Note that our predefined GenAI metrics (such as <code>answer_correctness</code>, which we use below) use the same prompt as custom GenAI metrics created with <code>make_genai_metric</code>, the only difference being that we have already defined the name, definition, and grading prompt. When using a predefined GenAI metric, any other parameters (such as examples) can be customized.</p> </div> </details>

Let's load our predefined metrics - in this case we are using answer_correctness with GPT-4.

python

answer_correctness_metric = answer_correctness(model="openai:/gpt-4")

Now we want to create a custom LLM-judged metric named answer_quality using make_genai_metric(). We need to define a metric definition and grading rubric, as well as some examples for the LLM judge to use.

python

# The definition explains what "answer quality" entails
answer_quality_definition = """Please evaluate answer quality for the provided output on the following criteria:
fluency, clarity, and conciseness. Each of the criteria is defined as follows:
  - Fluency measures how naturally and smooth the output reads.
  - Clarity measures how understandable the output is.
  - Conciseness measures the brevity and efficiency of the output without compromising meaning.
The more fluent, clear, and concise a text, the higher the score it deserves.
"""

# The grading prompt explains what each possible score means
answer_quality_grading_prompt = """Answer quality: Below are the details for different scores:
  - Score 1: The output is entirely incomprehensible and cannot be read.
  - Score 2: The output conveys some meaning, but needs lots of improvement in to improve fluency, clarity, and conciseness.
  - Score 3: The output is understandable but still needs improvement.
  - Score 4: The output performs well on two of fluency, clarity, and conciseness, but could be improved on one of these criteria.
  - Score 5: The output reads smoothly, is easy to understand, and clear. There is no clear way to improve the output on these criteria.
"""

# We provide an example of a "bad" output
example1 = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform. For managing machine learning workflows, it "
    "including experiment tracking model packaging versioning and deployment as well as a platform "
    "simplifying for on the ML lifecycle.",
    score=2,
    justification="The output is difficult to understand and demonstrates extremely low clarity. "
    "However, it still conveys some meaning so this output deserves a score of 2.",
)

# We also provide an example of a "good" output
example2 = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine learning workflows, including "
    "experiment tracking, model packaging, versioning, and deployment.",
    score=5,
    justification="The output is easily understandable, clear, and concise. It deserves a score of 5.",
)

answer_quality_metric = make_genai_metric(
    name="answer_quality",
    definition=answer_quality_definition,
    grading_prompt=answer_quality_grading_prompt,
    version="v1",
    examples=[example1, example2],
    model="openai:/gpt-4",
    greater_is_better=True,
)

Evaluate

We need to set our OpenAI API key, since we are using GPT-4 for our LLM-judged metrics.

In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure your favored environment management configuration file (i.e., .bashrc, .zshrc) to have the following entry:

OPENAI_API_KEY=<your openai API key>

Now, we can call mlflow.evaluate(). Just to test it out, let's use the first 10 rows of the data. Using the "text" model type, toxicity and readability metrics are calculated as builtin metrics. We also pass in the two metrics we defined above into the extra_metrics parameter to be evaluated.

python

with mlflow.start_run():
    results = mlflow.evaluate(
        model_info.model_uri,
        eval_df.head(10),
        evaluators="default",
        model_type="text",
        targets="output",
        extra_metrics=[answer_correctness_metric, answer_quality_metric],
        evaluator_config={"col_mapping": {"inputs": "instruction"}},
    )

View results

results.metrics is a dictionary with the aggregate values for all the metrics calculated. Refer here for details on the builtin metrics for each model type.

python

results.metrics

We can also view the eval_results_table, which shows us the metrics for each row of data.

python

results.tables["eval_results_table"]

View results in UI

Finally, we can view our evaluation results in the MLflow UI. We can select our experiment on the left sidebar, which will bring us to the following page. We can see that one run logged our model "mpt-7b-chat", and the other run has the dataset we evaluated.

We click on the Evaluation tab and hide any irrelevant runs.

We can now choose what columns we want to group by, as well as which column we want to compare. In the following example, we are looking at the score for answer correctness for each input-output pair, but we could choose any other metric to compare.

Finally, we get to the following view, where we can see the justification and score for answer correctness for each row.