apps/opik-documentation/documentation/fern/docs-v2/evaluation/advanced/manage_datasets.mdx
Datasets can be used to track test cases you would like to evaluate your LLM on. Each dataset is made up of a dictionary
with any key value pairs. When getting started, we recommend having an input and optional expected_output fields for
example. These datasets can be created from:
Once a dataset has been created, you can run Experiments on it. Each Experiment will evaluate an LLM application based on the test cases in the dataset using an evaluation metric and report the results back to the dataset.
The simplest and fastest way to create a dataset is directly in the Opik UI. This is ideal for quickly bootstrapping datasets from CSV files without needing to write any code.
Steps:
If you need to create a dataset with more than 1,000 rows, you can use the SDK.
<Tip> The UI dataset creation has some limitations: * File size is limited to 1,000 rows via the UI. * No support for nested JSON structures in the CSV itself.For datasets requiring rich metadata, complex schemas, or programmatic control, use the SDK instead (see the next section).
</Tip> <Note> When you create a dataset with a CSV file, this creates the first version (v1) of your dataset. All subsequent modifications will create new versions automatically. </Note>Dataset versioning in Opik creates immutable snapshots of your data. Every time you modify a dataset—whether adding, editing, or deleting items—a new version is automatically created. This ensures complete reproducibility, provides an audit trail of all changes, and allows easy rollback to any previous state.
Each dataset version contains:
production, baseline)Once a version is created, its data cannot be changed—any modification creates a new version instead. Restoring a previous version also creates a new version with the same data, preserving your complete version timeline.
<Note> The special `latest` tag always points to the most recent version. When running experiments without specifying a version, `latest` is used by default. </Note>When making changes to a dataset in the Opik UI, all modifications go into a draft state first. This gives you a staging area to review changes before committing them as a new version. The draft is visible only to you, and AI-generated samples from "Expand with AI" also go to draft for review.
When a dataset has unsaved draft changes, an orange "Draft" tag appears next to the dataset name, and Save changes / Discard changes buttons appear in the toolbar. Items show colored borders: green for newly added items, amber for modified items.
<Frame> </Frame>To commit your draft as a new version:
To abandon your draft, click Discard changes and confirm. If you try to navigate away with unsaved changes, Opik displays a warning to prevent accidental loss of work.
<Tip> Use draft mode to batch related changes into a single, well-documented version. </Tip>To view the complete timeline of dataset changes, navigate to your dataset and click the Version history tab. The table shows each version's name, change summary (items added/modified/deleted), version note, tags, item count, and creation timestamp.
<Frame> </Frame>From this view you can:
One of the most powerful ways to build evaluation datasets is by converting production traces into dataset items. This allows you to leverage real-world interactions from your LLM application to create test cases for evaluation.
To add traces to a dataset from the Opik UI:
When you add a trace to a dataset, the following structure is created:
expected_output for evaluation purposes)This rich structure allows you to:
You can create a dataset and log items to it using the get_or_create_dataset method:
// Create a dataset const client = new Opik(); const dataset = await client.getOrCreateDataset("My dataset", "Evaluation dataset", "my-project");
```python title="Python SDK" language="python"
from opik import Opik
# Create a dataset
client = Opik()
dataset = client.get_or_create_dataset(name="My dataset", project_name="my-project")
If a dataset with the given name already exists, the existing dataset will be returned.
You can insert items to a dataset using the insert method:
dataset.insert([ { user_question: "Hello, world!", expected_output: { assistant_answer: "Hello, world!" } }, { user_question: "What is the capital of France?", expected_output: { assistant_answer: "Paris" } }, ]);
```python title="Python" language="python"
import opik
# Get or create a dataset
client = opik.Opik()
dataset = client.get_or_create_dataset(name="My dataset", project_name="my-project")
# Add dataset items to it
dataset.insert([
{"user_question": "Hello, world!", "expected_output": {"assistant_answer": "Hello, world!"}},
{"user_question": "What is the capital of France?", "expected_output": {"assistant_answer": "Paris"}},
])
Once the items have been inserted, you can view them in the Opik UI:
<Frame> </Frame>You can also insert items from a JSONL file:
<CodeBlocks> ```python title="Python" language="python" import opikclient = opik.Opik() dataset = client.get_or_create_dataset(name="My dataset", project_name="my-project")
dataset.read_jsonl_from_file("path/to/file.jsonl")
</CodeBlocks>
#### Inserting items from a Pandas DataFrame
You can also insert items from a Pandas DataFrame:
<CodeBlocks>
```python title="Python" language="python"
import opik
client = opik.Opik()
dataset = client.get_or_create_dataset(name="My dataset", project_name="my-project")
dataset.insert_from_pandas(dataframe=df)
# You can also specify an optional keys_mapping parameter
dataset.insert_from_pandas(dataframe=df, keys_mapping={"Expected output": "expected_output"})
You can delete items in a dataset by using the delete method:
// Get or create a dataset client = new Opik(); dataset = await client.getDataset("My dataset")
await dataset.delete(["123", "456"])
// Or to delete all items await dataset.clear()
```python title="Python" language="python"
from opik import Opik
# Get or create a dataset
client = Opik()
dataset = client.get_dataset(name="My dataset")
dataset.delete(items_ids=["123", "456"])
# Or to delete all items
dataset.clear()
You can download a dataset from Opik using the get_dataset method:
const client = new Opik(); const dataset = await client.getDataset("My dataset");
const items = await dataset.getItems(); console.log(items);
```python title="Python" language="python"
from opik import Opik
client = Opik()
dataset = client.get_dataset(name="My dataset")
# Get items as list of DatasetItem objects
items = dataset.get_items()
# Convert to a Pandas DataFrame
dataset.to_pandas()
# Convert to a JSON array
dataset.to_json()
You can filter dataset items using the filter_string parameter on the get_items() method or when
running evaluations with evaluate_prompt(). This allows you to work with specific subsets of your data.
client = Opik() dataset = client.get_dataset(name="my_dataset")
failed_items = dataset.get_items(filter_string='tags contains "failed"')
</CodeBlocks>
### Filter syntax
The filter string uses Opik Query Language (OQL) syntax. Supported columns include:
| Column | Type | Description |
|--------|------|-------------|
| `id` | String | Unique identifier for the dataset item |
| `source` | String | Source of the dataset item |
| `trace_id` | String | Associated trace ID |
| `span_id` | String | Associated span ID |
| `data` | Dictionary | Use dot notation for nested fields (e.g., `data.category`) |
| `tags` | List | Use "contains" operator (e.g., `tags contains "test"`) |
| `created_at` | DateTime | ISO 8601 format (e.g., `created_at >= "2024-01-01T00:00:00Z"`) |
| `last_updated_at` | DateTime | ISO 8601 format |
| `created_by` | String | User who created the item |
| `last_updated_by` | String | User who last updated the item |
### Filter examples
<CodeBlocks>
```python title="Python" language="python"
from opik import Opik
client = Opik()
dataset = client.get_dataset(name="my_dataset")
# Filter by tag
failed_items = dataset.get_items(filter_string='tags contains "failed"')
# Filter by data field
finance_items = dataset.get_items(filter_string='data.category = "finance"')
# Filter by date
recent_items = dataset.get_items(
filter_string='created_at >= "2024-06-01T00:00:00Z"'
)
# Multiple conditions
filtered_items = dataset.get_items(
filter_string='tags contains "production" AND data.difficulty = "hard"'
)
When you run an experiment, Opik automatically links it to the specific dataset version that was used. This ensures complete reproducibility—you can always know exactly which data was used for any experiment.
Every experiment records which dataset version it used:
latest version is usedThis association is permanent. Even if you later modify the dataset, your experiment results remain linked to the original version used.
When running experiments from the Playground:
latest for the most recentWhen running experiments programmatically, you can specify which dataset version to use by passing a DatasetVersion object to evaluate():
client = Opik() dataset = client.get_dataset(name="My dataset")
result = evaluate( experiment_name="baseline-experiment", dataset=dataset, task=my_task_function, scoring_metrics=[my_metric], project_name="my-project", )
v1_view = dataset.get_version_view("v1") result = evaluate( experiment_name="v1-experiment", dataset=v1_view, # Pass the DatasetVersion object task=my_task_function, scoring_metrics=[my_metric], project_name="my-project", )
```typescript title="TypeScript" language="typescript"
import { Opik, evaluate } from "opik";
const client = new Opik();
const dataset = await client.getDataset("My dataset");
// Run experiment on the latest version (default)
const result = await evaluate({
experimentName: "baseline-experiment",
dataset: dataset,
task: myTaskFunction,
scoringMetrics: [myMetric],
projectName: "my-project",
});
// Run experiment on a specific version
const v2 = await dataset.getVersionView("v2");
const pinnedResult = await evaluate({
experimentName: "pinned-experiment",
dataset: v2,
task: myTaskFunction,
scoringMetrics: [myMetric],
projectName: "my-project",
});
The SDK provides methods for inspecting and working with dataset versions:
<CodeBlocks> ```python title="Python" language="python" from opik import Opikclient = Opik() dataset = client.get_dataset(name="My dataset")
current_version = dataset.get_current_version_name() print(f"Current version: {current_version}") # e.g., "v3"
version_info = dataset.get_version_info() print(f"Version ID: {version_info.id}") print(f"Version name: {version_info.version_name}") print(f"Items total: {version_info.items_total}") print(f"Created at: {version_info.created_at}")
v1_view = dataset.get_version_view("v1")
print(f"Version: {v1_view.version_name}") print(f"Items in v1: {v1_view.items_total}") print(f"Items added: {v1_view.items_added}") print(f"Items modified: {v1_view.items_modified}") print(f"Items deleted: {v1_view.items_deleted}")
v1_items = v1_view.get_items()
v1_df = v1_view.to_pandas() v1_json = v1_view.to_json()
```typescript title="TypeScript" language="typescript"
import { Opik } from "opik";
const client = new Opik();
const dataset = await client.getDataset("My dataset");
// Get the current (latest) version name
const currentVersion = await dataset.getCurrentVersionName();
console.log(`Current version: ${currentVersion}`); // e.g., "v3"
// Get detailed version info (returns DatasetVersionPublic)
const versionInfo = await dataset.getVersionInfo();
console.log(`Version ID: ${versionInfo?.id}`);
console.log(`Version name: ${versionInfo?.versionName}`);
console.log(`Items total: ${versionInfo?.itemsTotal}`);
console.log(`Created at: ${versionInfo?.createdAt}`);
// Get a read-only view of a specific version
const v1View = await dataset.getVersionView("v1");
// Access version metadata
console.log(`Version: ${v1View.versionName}`);
console.log(`Items in v1: ${v1View.itemsTotal}`);
console.log(`Items added: ${v1View.itemsAdded}`);
console.log(`Items modified: ${v1View.itemsModified}`);
console.log(`Items deleted: ${v1View.itemsDeleted}`);
// Get items from a specific version
const v1Items = await v1View.getItems();
// Export version data as JSON
const v1Json = await v1View.toJson();
Dataset expansion allows you to use AI to generate additional synthetic samples based on your existing dataset. This is particularly useful when you have a small dataset and want to create more diverse test cases to improve your evaluation coverage.
The AI analyzes the patterns in your existing data and generates new samples that follow similar structures while introducing variations. This helps you:
To expand a dataset with AI:
Sample Count: Start with a smaller number (10-20) to review the quality before generating larger batches.
Preserve Fields: Use this to maintain consistency in certain fields while allowing variation in others. For example, preserve the category field while varying the input and expected_output.
Variation Instructions: Provide specific guidance such as:
Tags are a powerful way to organize, categorize, and filter your dataset items. You can use tags to:
edge-case, production, multilingual)user-feedback, synthetic, real-world)needs-review, validated, archived)Each dataset item can have multiple tags.
To add tags to a single dataset item:
You can remove tags by clicking the "×" icon next to any tag in the details panel.
To add the same tag to multiple dataset items at once:
This is particularly useful when you want to categorize a group of related test cases or mark items from the same data source.
<Tip> Tags are case-sensitive and support alphanumeric characters, hyphens, and underscores. Choose consistent naming conventions for your tags to make filtering easier. </Tip>Once you've tagged your dataset items, you can filter them to work with specific subsets:
The dataset items table will update to show only items matching your filter criteria. You can:
The filter is saved in the URL, so you can bookmark or share specific filtered views of your dataset.
Opik supports bulk operations for efficiently managing large datasets. These operations help you work with many items at once without tedious individual selections.
When working with datasets that span multiple pages:
This works with filtered views too—if you have a filter applied, "Select all" only selects items matching that filter.
Once you have items selected, the toolbar shows available operations:
For large bulk operations: