apps/opik-documentation/documentation/fern/docs/production/rules.mdx
When working with LLMs in production, the sheer number of traces means that it isn't possible to manually review each trace. Opik allows you to define LLM as a Judge metrics that will automatically score the LLM calls logged to the platform.
By defining LLM as a Judge metrics that run on all your production traces, you will be able to automate the monitoring of your LLM calls for hallucinations, answer relevance or any other task specific metric.
Scoring rules can be defined through both the UI and the REST API.
To create a new scoring metric in the UI, first navigate to the project you would like to monitor. Once you have navigated to the rules tab, you will be able to create a new rule.
When creating a new rule, you will be presented with the following options:
100%, all traces will be scored.{{variable_name}} format.Opik comes pre-configured with 3 different LLM as a Judge metrics:
Opik's built-in LLM as a Judge metrics are very easy to use and are great for getting started. However, as you start working on more complex tasks, you may need to write your own LLM as a Judge metrics.
We typically recommend that you experiment with LLM as a Judge metrics during development using Opik's evaluation platform. Once you have a metric that works well for your use case, you can then use it in production.
<Frame> </Frame> When writing your own LLM as a Judge metric you will need to specify the prompt variables using the mustache syntax, ie. `{{ variable_name }}`. You can then map these variables to your trace data using the `variable_mapping` parameter. When the rule is executed, Opik will replace the variables with the values from the trace data.You can control the format of the output using the Scoring definition parameter. This is where you can define the scores you want the LLM as a Judge metric to return. Under the hood, we will use this definition in conjunction with the structured outputs functionality to ensure that the LLM as a Judge metric always returns trace scores.
LLM as a Judge metrics can evaluate traces that contain images when using vision-capable models. This is useful for:
To reference image data from traces in your evaluation prompts:
Example rule configuration:
Prompt:
Evaluate the quality of this generated image.
Rate the image on the following criteria:
1. Visual clarity and resolution
2. Relevance to the prompt
3. Technical quality
Provide a score between 0 and 1.
Variable Mapping:
output_image → output.image_data (path in trace structure)Model: Vision-capable model required
Supported image formats:
The scores returned by the online evaluation rules will be stored as feedback scores for each trace. This will allow you to review these scores in the traces sidebar and track their changes over time in the Opik dashboard.
You can also view the average feedback scores for all the traces in your project from the traces table.
It is also possible to define LLM as a Judge and Custome Python metrics that run on threads. This is useful to score the entire conversations and not just the individual traces.
<Frame> </Frame>We have built-in templates for the LLM as a Judge metrics that you can use to score the entire conversation:
For the LLM as a Judge metrics, keep in mind the only variable available is the {{context}} one, which is a dictionary containing the entire conversation:
[
{
"role": "user",
"content": "Hello, how are you?"
},
{
"role": "assistant",
"content": "I'm good, thank you!"
}
]
Similarly, for the Python metrics, you have the Conversation object available to you. This object is a List[Dict] where each dict represents a message in the conversation.
[
{
"role": "user",
"content": "Hello, how are you?"
},
{
"role": "assistant",
"content": "I'm good, thank you!"
}
]
For online scoring rules on threads, Opik waits for a "cooldown period" after the last activity in a thread before running the evaluation. This ensures the scoring is done on the full context of the conversation.
<Note> The default cooldown period is 15 minutes but can be configured at the workspace level under "Thread online scoring rule cooldown period". For self-hosted installations, set the `OPIK_TRACE_THREAD_TIMEOUT_TO_MARK_AS_INACTIVE` environment variable. </Note>By default, a newly created online evaluation rule will only run on traces or threads logged after the rule was defined. To run a rule against historical data, you can trigger it manually from the UI: