apps/opik-documentation/documentation/fern/docs/evaluation/concepts.mdx
When working with LLM applications, the bottleneck to iterating faster is often the evaluation process. While it is possible to manually review your LLM application's output, this process is slow and not scalable. Instead of manually reviewing your LLM application's output, Opik allows you to automate the evaluation of your LLM application.
In order to understand how to run evaluations in Opik, it is important to first become familiar with the concepts of:
In this section, we will walk through all the concepts associated with Opik's evaluation platform.
The first step in automating the evaluation of your LLM application is to create a dataset which is a collection of samples that your LLM application will be evaluated on. Each dataset is made up of Dataset Items which store the input, expected output and other metadata for a single sample.
Given the importance of datasets in the evaluation process, teams often spend a significant amount of time curating and preparing their datasets. There are three main ways to create a dataset:
Manually curating examples: As a first step, you can manually curate a set of examples based on your knowledge of the application you are building. You can also leverage subject matter experts to help in the creation of the dataset.
Using synthetic data: If you don't have enough data to create a diverse set of examples, you can turn to synthetic data generation tools to help you create a dataset. The LangChain cookbook has a great example of how to use synthetic data generation tools to create a dataset.
Leveraging production data: If your application is in production, you can leverage the data that is being generated to augment your dataset. While this is often not the first step in creating a dataset, it can be a great way to enrich your dataset with real world data.
If you are using Opik for production monitoring, you can easily add traces to your dataset by selecting them in the UI and selecting Add to dataset in the Actions dropdown.
Experiments are the core building block of the Opik evaluation platform. Each time you run a new evaluation, a new experiment is created. Each experiment is made up of two main components:
In addition, for each experiment you will be able to see the average scores for each metric.
<Tip> You can update an experiment's name and configuration at any time from the Opik UI or through the SDKs. Learn more in the [Update Existing Experiment](/v1/evaluation/update_existing_experiment) guide. </Tip>One of the main advantages of having an automated evaluation platform is the ability to iterate quickly. The main drawback is that it can become difficult to track what has changed between two different iterations of an experiment.
The experiment configuration object allows you to store some metadata associated with a given experiment. This is useful for tracking things like the prompt template used for a given experiment, the model used, the temperature, etc.
You can then compare the configuration of two different experiments from the Opik UI to see what has changed.
<Frame> </Frame>Experiment items store the input, expected output, actual output and feedback scores for each dataset sample that was processed during an experiment. In addition, a trace is associated with each item to allow you to easily understand why a given item scored the way it did.
<Frame> </Frame>In addition to per-item metrics, you can compute experiment-level aggregate metrics that are calculated across all test results. These experiment scores are displayed in the Opik UI alongside feedback scores and can be used for sorting and filtering experiments.
Experiment scores are computed after all test results are collected using custom functions that take a list of test results and return aggregate metrics. Common use cases include computing maximum, minimum, or mean values across all test cases, or calculating custom statistics specific to your evaluation needs.
<Tip> Learn more about how to compute experiment-level metrics in the [Evaluate your LLM application](/v1/evaluation/evaluate_your_llm#computing-experiment-level-metrics) guide. </Tip>The experiment interface supports multiple evaluators scoring the same experiment items. Each item can receive independent scores from different team members, with all ratings preserved and aggregated in the interface.
This approach helps identify experiment items where evaluators disagree significantly, with access to individual reasoning to understand why disagreements occurred and improve evaluation consistency.
<Frame> </Frame>We have provided some guides to help you get started with Opik's evaluation platform: