docs/docs/genai/concepts/evaluation-datasets.mdx
import { APILink } from "@site/src/components/APILink"; import ConceptOverview from "@site/src/components/ConceptOverview"; import TilesGrid from "@site/src/components/TilesGrid"; import TileCard from "@site/src/components/TileCard"; import { Database, ChartBar, FileText, Target, Activity, Code } from "lucide-react"; import useBaseUrl from '@docusaurus/useBaseUrl';
:::warning[SQL Backend Required] Evaluation Datasets require an MLflow Tracking Server with a SQL backend (PostgreSQL, MySQL, SQLite, or MSSQL). This feature is not available in FileStore (local mode) due to the relational data requirements for managing dataset records, associations, and schema evolution. :::
Evaluation Datasets in MLflow provide a structured way to organize and manage test data for LLM applications and AI agents. They serve as centralized repositories for test inputs, optional test outputs, expected outputs (expectations), and evaluation results, enabling systematic quality assessment across your AI development lifecycle.
Unlike static test files, evaluation datasets are living validation collections designed to grow and evolve with your application. Records can be continuously added from production traces, manual curation, or programmatic generation.
They can be viewed directly within the MLflow UI.
<video src={useBaseUrl("/images/eval-datasets.mp4")} controls loop autoPlay muted aria-label="Evaluation Datasets Video" />
Evaluation datasets are composed of several key elements that work together to provide comprehensive test management:
<ConceptOverview concepts={[ { icon: Database, title: "Dataset Records", description: "Individual test cases containing inputs (what goes into your model), expectations (what should come out), optional outputs (what your application returned), and metadata about the source and tags for organization." }, { icon: ChartBar, title: "Schema & Profile", description: "Automatically computed structure and statistics of your dataset. Schema tracks field names and types across records, while profile provides statistical summaries." }, { icon: Target, title: "Expectations", description: "Ground truth values and quality criteria that define correct behavior. These are the set of standards against which your model outputs are evaluated." }, { icon: Activity, title: "Experiment Association", description: "Links to MLflow experiments enable tracking which datasets were used for which model evaluations, providing full lineage and organizational control." } ]} />
The <APILink fn="mlflow.genai.datasets.EvaluationDataset" text="EvaluationDataset" /> object returned by <APILink fn="mlflow.genai.datasets.create_dataset" text="create_dataset()" /> and <APILink fn="mlflow.genai.datasets.get_dataset" text="get_dataset()" /> exposes the following fields:
| Field | Type | Description |
|---|---|---|
dataset_id | str | Unique identifier for the dataset (format: d-{32 hex chars}) |
name | str | Human-readable name for the dataset |
digest | str | Content hash for data integrity verification |
schema | Optional[str] | JSON string describing the structure of records (automatically computed) |
profile | Optional[str] | JSON string containing statistical information about the dataset |
tags | dict[str, str] | Key-value pairs for organizing and categorizing datasets |
experiment_ids | list[str] | List of MLflow experiment IDs this dataset is associated with |
created_time | int | Timestamp when the dataset was created (milliseconds) |
last_update_time | int | Timestamp of the last modification (milliseconds) |
created_by | Optional[str] | User who created the dataset (auto-detected from tags) |
last_updated_by | Optional[str] | User who last modified the dataset |
Records are fetched lazily — call <APILink fn="mlflow.genai.datasets.EvaluationDataset.to_df" text="to_df()" /> to load them into a pandas DataFrame.
Each record in an evaluation dataset represents a single test case with the following structure:
{
"inputs": {
"question": "What is the capital of France?",
"context": "France is a country in Western Europe",
"temperature": 0.7
},
"outputs": {
"answer": "The capital of France is Paris."
},
"expectations": {
"name": "expected_answer",
"value": "Paris",
},
"source": {
"source_type": "HUMAN",
"source_data": {
"annotator": "[email protected]",
"annotation_date": "2024-08-07"
}
},
"tags": {
"category": "geography",
"difficulty": "easy",
"validated": "true"
}
}
Records are uniquely identified by a hash of their inputs. When merging records with <APILink fn="mlflow.genai.datasets.EvaluationDataset.merge_records" text="merge_records()" />, if a record with identical inputs already exists, its expectations and tags are merged rather than creating a duplicate. This enables iterative refinement of test cases without data duplication.
Dataset schemas automatically evolve as you add records with new fields. The schema property tracks all field names and types encountered across records, while profile maintains statistical summaries. This automatic adaptation means you can start with simple test cases and progressively add complexity without manual schema migrations.
When new fields are introduced in subsequent records, they're automatically incorporated into the schema. Existing records without those fields are handled gracefully during evaluation and analysis.