apps/opik-documentation/documentation/fern/docs-v2/evaluation/concepts.mdx
Opik provides two complementary approaches to evaluating your LLM application. Understanding when to use each will help you build a robust evaluation strategy.
Test Suites let you define expected behaviors as natural-language assertions. An LLM judge checks each assertion against your agent's output and reports pass/fail results.
Best for:
A Test Suite has three main components:
Assertions can be defined at two levels:
pass_thresholdDataset-based evaluation scores your agent's outputs using quantitative metrics. You define a dataset of test cases, run your agent against them, and score the results using pre-built or custom metrics.
Best for:
A dataset-based evaluation has three main components:
Hallucination, AnswerRelevance, custom metrics)Each evaluation run creates an Experiment — a record of every dataset item, your agent's output, and the metric scores. Experiments are stored in Opik so you can compare them side-by-side.
| Test Suites | Datasets & Metrics | |
|---|---|---|
| Output | Pass/fail per assertion | Numeric scores per metric |
| Evaluation method | LLM judge checks natural-language assertions | Scoring functions (LLM-based or heuristic) |
| Best for | Behavioral testing, regression checks | Quality measurement, benchmarking |
| Iteration style | Update assertions, re-run suite | Update dataset or metrics, re-run experiment |
You can use both approaches together. For example, use Test Suites during development to validate specific behaviors, and Datasets & Metrics in CI to track quality scores over time.