docs/api_reference/source/python_api/mlflow.metrics.rst
The mlflow.metrics module helps you quantitatively and qualitatively measure your models.
.. autoclass:: mlflow.metrics.EvaluationMetric
These :py:class:EvaluationMetric <mlflow.metrics.EvaluationMetric> are used by the :py:func:mlflow.evaluate() API, either computed automatically depending on the model_type or specified via the extra_metrics parameter.
The following code demonstrates how to use :py:func:mlflow.evaluate() with an :py:class:EvaluationMetric <mlflow.metrics.EvaluationMetric>.
.. code-block:: python
import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_similarity
eval_df = pd.DataFrame(
{
"inputs": [
"What is MLflow?",
],
"ground_truth": [
"MLflow is the largest open source AI engineering platform for agents, LLM applications, and ML models. It was developed by Databricks, a company that specializes in data and AI solutions. MLflow is designed to address the challenges that data scientists and AI engineers face when developing, evaluating, and deploying AI applications.",
],
}
)
example = EvaluationExample(
input="What is MLflow?",
output="MLflow is the largest open source AI engineering platform "
"for agents, LLM applications, and ML models, including tracing, "
"evaluation, prompt management, experiment tracking, and deployment.",
score=4,
justification="The definition effectively explains what MLflow is "
"its purpose, and its developer. It could be more concise for a 5-score.",
grading_context={
"ground_truth": "MLflow is the largest open source AI engineering "
"platform for agents, LLM applications, and ML models. It was "
"developed by Databricks, a company that specializes in data and "
"AI solutions. MLflow is designed to address the challenges that "
"data scientists and AI engineers face when developing, evaluating, "
"and deploying AI applications."
},
)
answer_similarity_metric = answer_similarity(examples=[example])
results = mlflow.evaluate(
logged_model.model_uri,
eval_df,
targets="ground_truth",
model_type="question-answering",
extra_metrics=[answer_similarity_metric],
)
Information about how an :py:class:EvaluationMetric <mlflow.metrics.EvaluationMetric> is calculated, such as the grading prompt used is available via the metric_details property.
.. code-block:: python
import mlflow
from mlflow.metrics.genai import relevance
my_relevance_metric = relevance()
print(my_relevance_metric.metric_details)
Evaluation results are stored as :py:class:MetricValue <mlflow.metrics.MetricValue>. Aggregate results are logged to the MLflow run as metrics, while per-example results are logged to the MLflow run as artifacts in the form of an evaluation table.
.. autoclass:: mlflow.metrics.MetricValue
We provide the following builtin factory functions to create :py:class:EvaluationMetric <mlflow.metrics.EvaluationMetric> for evaluating models. These metrics are computed automatically depending on the model_type. For more information on the model_type parameter, see :py:func:mlflow.evaluate() API.
.. autofunction:: mlflow.metrics.mae
.. autofunction:: mlflow.metrics.mape
.. autofunction:: mlflow.metrics.max_error
.. autofunction:: mlflow.metrics.mse
.. autofunction:: mlflow.metrics.rmse
.. autofunction:: mlflow.metrics.r2_score
.. autofunction:: mlflow.metrics.precision_score
.. autofunction:: mlflow.metrics.recall_score
.. autofunction:: mlflow.metrics.f1_score
.. autofunction:: mlflow.metrics.ari_grade_level
.. autofunction:: mlflow.metrics.flesch_kincaid_grade_level
Includes all of the above Text Metrics as well as the following:
.. autofunction:: mlflow.metrics.exact_match
.. autofunction:: mlflow.metrics.rouge1
.. autofunction:: mlflow.metrics.rouge2
.. autofunction:: mlflow.metrics.rougeL
.. autofunction:: mlflow.metrics.rougeLsum
.. autofunction:: mlflow.metrics.toxicity
.. autofunction:: mlflow.metrics.token_count
.. autofunction:: mlflow.metrics.latency
.. autofunction:: mlflow.metrics.bleu
The following metrics are built-in metrics for the 'retriever' model type, meaning they will be
automatically calculated with a default retriever_k value of 3.
To evaluate document retrieval models, it is recommended to use a dataset with the following columns:
Alternatively, you can also provide a function through the model parameter to represent
your retrieval model. The function should take a Pandas DataFrame containing input queries and
ground-truth relevant doc IDs, and return a DataFrame with a column of retrieved relevant doc IDs.
A "doc ID" is a string or integer that uniquely identifies a document. Each row of the retrieved and ground-truth doc ID columns should consist of a list or numpy array of doc IDs.
Parameters:
targets: A string specifying the column name of the ground-truth relevant doc IDs
predictions: A string specifying the column name of the retrieved relevant doc IDs in either
the static dataset or the Dataframe returned by the model function
retriever_k: A positive integer specifying the number of retrieved docs IDs to consider for
each input query. retriever_k defaults to 3. You can change retriever_k by using the
:py:func:mlflow.evaluate API:
.. code-block:: python
evaluator_configmlflow.evaluate( model=retriever_function, data=data, targets="ground_truth", model_type="retriever", evaluators="default", evaluator_config={"retriever_k": 5} )
.. code-block:: python
extra_metricsmlflow.evaluate(
data=data,
predictions="predictions_param",
targets="targets_param",
model_type="retriever",
extra_metrics = [
mlflow.metrics.precision_at_k(5),
mlflow.metrics.precision_at_k(6),
mlflow.metrics.recall_at_k(5),
mlflow.metrics.ndcg_at_k(5)
]
)
NOTE: In the 2nd method, it is recommended to omit the model_type as well, or else
precision@3 and recall@3 will be calculated in addition to precision@5,
precision@6, recall@5, and ndcg_at_k@5.
.. autofunction:: mlflow.metrics.precision_at_k
.. autofunction:: mlflow.metrics.recall_at_k
.. autofunction:: mlflow.metrics.ndcg_at_k
Users create their own :py:class:EvaluationMetric <mlflow.metrics.EvaluationMetric> using the :py:func:make_metric <mlflow.metrics.make_metric> factory function
.. autofunction:: mlflow.metrics.make_metric
.. automodule:: mlflow.metrics :members: :undoc-members: :show-inheritance: :exclude-members: MetricValue, EvaluationMetric, make_metric, EvaluationExample, ari_grade_level, flesch_kincaid_grade_level, exact_match, rouge1, rouge2, rougeL, rougeLsum, toxicity, answer_similarity, answer_correctness, faithfulness, answer_relevance, mae, mape, max_error, mse, rmse, r2_score, precision_score, recall_score, f1_score, token_count, latency, precision_at_k, recall_at_k, ndcg_at_k, bleu
We also provide generative AI ("genai") :py:class:EvaluationMetric <mlflow.metrics.EvaluationMetric>\s for evaluating text models. These metrics use an LLM to evaluate the quality of a model's output text. Note that your use of a third party LLM service (e.g., OpenAI) for evaluation may be subject to and governed by the LLM service's terms of use. The following factory functions help you customize the intelligent metric to your use case.
.. automodule:: mlflow.metrics.genai :members: :undoc-members: :show-inheritance: :exclude-members: EvaluationExample, make_genai_metric
You can also create your own generative AI :py:class:EvaluationMetric <mlflow.metrics.EvaluationMetric>\s using the :py:func:make_genai_metric <mlflow.metrics.genai.make_genai_metric> factory function.
.. autofunction:: mlflow.metrics.genai.make_genai_metric
When using generative AI :py:class:EvaluationMetric <mlflow.metrics.EvaluationMetric>\s, it is important to pass in an :py:class:EvaluationExample <mlflow.metrics.genai.EvaluationExample>
.. autoclass:: mlflow.metrics.genai.EvaluationExample
Users must set the appropriate environment variables for the LLM service they are using for
evaluation. For example, if you are using OpenAI's API, you must set the OPENAI_API_KEY
environment variable. If using Azure OpenAI, you must also set the OPENAI_API_TYPE,
OPENAI_API_VERSION, OPENAI_API_BASE, and OPENAI_DEPLOYMENT_NAME environment variables.
See Azure OpenAI documentation <https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/switching-endpoints>_
Users do not need to set these environment variables if they are using a gateway route.