docs/docs/genai/eval-monitor/automatic-evaluations/index.mdx
import Tabs from "@theme/Tabs"; import TabItem from "@theme/TabItem"; import TabsWrapper from "@site/src/components/TabsWrapper"; import TilesGrid from "@site/src/components/TilesGrid"; import TileCard from "@site/src/components/TileCard"; import { Scale, MessageSquare } from "lucide-react"; import { APILink } from "@site/src/components/APILink"; import useBaseUrl from '@docusaurus/useBaseUrl';
Automatically evaluate traces and multi-turn conversations as they're logged - no code required
Automatic evaluation runs your LLM judges automatically on traces and multi-turn conversations as they're logged to MLflow, without requiring manual execution of code. This enables two key use cases:
<video src={useBaseUrl("/images/llms/tracing/automatic-evaluation-ui-setup.mp4")} controls loop autoPlay muted aria-label="Automatic Evaluation Setup" />
| Automatic Evaluation | Offline Evaluation | |
|---|---|---|
| When it runs | Automatically, as traces and conversations are logged | Manually, when you call mlflow.genai.evaluate() |
| Use case | Production quality tracking, continuous monitoring, internal QA, interactive testing | Regression testing, bug fix verification, pre-deployment testing, comparing agent versions |
| Data source | Live traces and conversations from your application | Curated datasets or historical traces |
Before setting up automatic evaluation, ensure that:
These examples show how to set up LLM judges that automatically evaluate traces and multi-turn conversations as they're logged to an MLflow Experiment, and how to update or disable existing judges. For more details on creating LLM judges, see LLM-as-a-Judge.
:::note
@scorer decorator or Scorer class) are not supported. Use built-in judges or create custom judges with make_judge().Navigate to your experiment and select the Judges tab
Click + New LLM judge
Select scope:
Configure the judge:
Evaluation settings:
Click Save
To edit or disable an existing judge, select it in the Judges tab.
For more details about the APIs used in this example, see <APILink fn="mlflow.genai.scorers.Scorer.start" />, <APILink fn="mlflow.genai.scorers.Scorer.update" />, and <APILink fn="mlflow.genai.scorers.Scorer.stop" />.
1. Specify the experiment for automatic evaluation
import mlflow
mlflow.set_experiment("my-experiment")
2. Start automatic evaluation for a trace-level judge
from mlflow.genai.scorers import ToolCallCorrectness, ScorerSamplingConfig
tool_judge = ToolCallCorrectness(model="gateway:/my-llm-endpoint")
registered_tool_judge = tool_judge.register(name="tool_call_correctness")
registered_tool_judge.start(
sampling_config=ScorerSamplingConfig(sample_rate=0.5), # Evaluate 50% of traces
)
3. Start automatic evaluation for a multi-turn (session-level) judge
from mlflow.genai.scorers import ConversationalGuidelines, ScorerSamplingConfig
frustration_judge = ConversationalGuidelines(
name="user_frustration",
guidelines="The user should not express frustration, confusion, or dissatisfaction during the conversation.",
model="gateway:/my-llm-endpoint",
)
registered_frustration_judge = frustration_judge.register(name="user_frustration")
registered_frustration_judge.start(
sampling_config=ScorerSamplingConfig(sample_rate=1.0), # Evaluate all conversations
)
4. Update or disable automatic evaluation for an existing judge
from mlflow.genai.scorers import get_scorer, ScorerSamplingConfig
judge = get_scorer(name="tool_call_correctness")
judge.update(sampling_config=ScorerSamplingConfig(sample_rate=0.3)) # Change sample rate
judge.stop() # Or, disable the judge
Assessments from automatic evaluation appear directly in the MLflow UI. For traces, assessments typically appear within a minute or two of logging. Multi-turn sessions are evaluated after 5 minutes of inactivity (no new traces have been added to the session) by default—this is <APILink fn="mlflow.environment_variables.MLFLOW_ONLINE_SCORING_DEFAULT_SESSION_COMPLETION_BUFFER_SECONDS">configurable</APILink>.
Navigate to your experiment in the MLflow UI to see results.
<div style={{ marginBottom: '16px' }}> <div style={{ width: '80%', textAlign: 'left', marginTop: '8px', color: '#666', fontSize: '0.9em' }}> Charts in the Overview tab display quality and performance trends over time </div> </div> <div style={{ display: 'flex', flexDirection: 'column', alignItems: 'flex-end' }}> <div style={{ width: '80%', textAlign: 'right', marginTop: '8px', color: '#666', fontSize: '0.9em' }}> Assessments from automatic evaluation appear as columns in the Traces tab </div> </div>Control what percentage of traces are evaluated (0-100%). Balance cost and coverage based on your needs:
Use trace search syntax to target specific traces. Examples:
# Only evaluate successful traces
filter_string = "trace.status = 'OK'"
# Only evaluate traces from production environment
filter_string = "metadata.environment = 'production'"
:::note For session-level evaluation, filters apply to the first trace in the session. :::
Automatic evaluation can assess entire multi-turn conversations (sessions), in addition to individual traces.
For more information about session evaluation, see Evaluate Conversations.
LLM judges are periodically executed securely within the MLflow server as new traces and multi-turn conversations are received. Evaluation happens asynchronously and does not block trace logging, so your application's performance is unaffected.
The MLflow Server uses AI Gateway endpoints to access LLMs for judge execution, ensuring secure and managed model access. Only the relevant trace or session data required by the judge (such as inputs, outputs, and context) is sent to the LLM.
| Issue | Solution |
|---|---|
| Missing assessments | Verify that the judge is active, the filter matches your traces, the sampling rate is greater than zero, and the traces are less than one hour old |
| Unexpected or unsatisfactory judge results | Edit the judge's instructions or use the align() method to optimize them automatically |
| Evaluation errors | Check trace/session assessments in the UI or SDK, or server logs, for details. Failed evaluations are not retried automatically |
For further debugging, enable debug logging on the MLflow server by setting the <APILink fn="mlflow.environment_variables.MLFLOW_LOGGING_LEVEL"><code>MLFLOW_LOGGING_LEVEL=DEBUG</code></APILink> environment variable and checking the MLflow server logs.