import { APILink } from "@site/src/components/APILink"; import ConceptOverview from "@site/src/components/ConceptOverview"; import TilesGrid from "@site/src/components/TilesGrid"; import TileCard from "@site/src/components/TileCard"; import { Database, ChartBar, FileText, Target, Activity, Code } from "lucide-react"; import useBaseUrl from '@docusaurus/useBaseUrl';

Evaluation Dataset Concepts

:::warning[SQL Backend Required] Evaluation Datasets require an MLflow Tracking Server with a SQL backend (PostgreSQL, MySQL, SQLite, or MSSQL). This feature is not available in FileStore (local mode) due to the relational data requirements for managing dataset records, associations, and schema evolution. :::

What are Evaluation Datasets?

Evaluation Datasets in MLflow provide a structured way to organize and manage test data for LLM applications and AI agents. They serve as centralized repositories for test inputs, optional test outputs, expected outputs (expectations), and evaluation results, enabling systematic quality assessment across your AI development lifecycle.

Unlike static test files, evaluation datasets are living validation collections designed to grow and evolve with your application. Records can be continuously added from production traces, manual curation, or programmatic generation.

They can be viewed directly within the MLflow UI.

Core Components

Evaluation datasets are composed of several key elements that work together to provide comprehensive test management:

Dataset Object Schema

The <APILink fn="mlflow.genai.datasets.EvaluationDataset" text="EvaluationDataset" /> object returned by <APILink fn="mlflow.genai.datasets.create_dataset" text="create_dataset()" /> and <APILink fn="mlflow.genai.datasets.get_dataset" text="get_dataset()" /> exposes the following fields:

Field	Type	Description
`dataset_id`	`str`	Unique identifier for the dataset (format: `d-{32 hex chars}`)
`name`	`str`	Human-readable name for the dataset
`digest`	`str`	Content hash for data integrity verification
`schema`	`Optional[str]`	JSON string describing the structure of records (automatically computed)
`profile`	`Optional[str]`	JSON string containing statistical information about the dataset
`tags`	`dict[str, str]`	Key-value pairs for organizing and categorizing datasets
`experiment_ids`	`list[str]`	List of MLflow experiment IDs this dataset is associated with
`created_time`	`int`	Timestamp when the dataset was created (milliseconds)
`last_update_time`	`int`	Timestamp of the last modification (milliseconds)
`created_by`	`Optional[str]`	User who created the dataset (auto-detected from tags)
`last_updated_by`	`Optional[str]`	User who last modified the dataset

Records are fetched lazily — call <APILink fn="mlflow.genai.datasets.EvaluationDataset.to_df" text="to_df()" /> to load them into a pandas DataFrame.

Record Structure

Each record in an evaluation dataset represents a single test case with the following structure:

json

{
    "inputs": {
        "question": "What is the capital of France?",
        "context": "France is a country in Western Europe",
        "temperature": 0.7
    },
    "outputs": {
        "answer": "The capital of France is Paris."
    },
    "expectations": {
        "name": "expected_answer",
        "value": "Paris",
    },
    "source": {
        "source_type": "HUMAN",
        "source_data": {
            "annotator": "[email protected]",
            "annotation_date": "2024-08-07"
        }
    },
    "tags": {
        "category": "geography",
        "difficulty": "easy",
        "validated": "true"
    }
}

Record Fields

inputs (required): The test input data that will be passed to your model or application
outputs (optional): The actual outputs generated by your model (typically used for post-hoc evaluation)
expectations (optional): The expected outputs or quality criteria that define correct behavior
source (optional): Provenance information about how this record was created (automatically inferred if not provided)
tags (optional): Metadata specific to this individual record for organization and filtering

Record Identity and Deduplication

Records are uniquely identified by a hash of their inputs. When merging records with <APILink fn="mlflow.genai.datasets.EvaluationDataset.merge_records" text="merge_records()" />, if a record with identical inputs already exists, its expectations and tags are merged rather than creating a duplicate. This enables iterative refinement of test cases without data duplication.

Schema Evolution

Dataset schemas automatically evolve as you add records with new fields. The schema property tracks all field names and types encountered across records, while profile maintains statistical summaries. This automatic adaptation means you can start with simple test cases and progressively add complexity without manual schema migrations.

When new fields are introduced in subsequent records, they're automatically incorporated into the schema. Existing records without those fields are handled gracefully during evaluation and analysis.

Next Steps