apps/opik-documentation/documentation/fern/docs-v2/evaluation/evaluate_threads.mdx
When you are running multi-turn conversations using frameworks that support LLM agents, the Opik integration will automatically group related traces into conversation threads using parameters suitable for each framework.
This guide will walk you through the process of evaluating and optimizing conversation threads in Opik using
the evaluate_threads function in the Python SDK.
The Python SDK provides a simple and efficient way to evaluate and optimize conversation threads using the
evaluate_threads function. This function allows you to specify a filter string to select specific threads for
evaluation, a list of metrics to apply to each thread, and it returns a ThreadsEvaluationResult object
containing the evaluation results and feedback scores.
Most importantly, this function automatically uploads the feedback scores to your traces in Opik! So, once evaluation is completed, you can also see the results in the UI.
To run the threads evaluation, you can use the following code:
from opik.evaluation import evaluate_threads
from opik.evaluation.metrics import ConversationalCoherenceMetric, UserFrustrationMetric
# Initialize the evaluation metrics
conversation_coherence_metric = ConversationalCoherenceMetric()
user_frustration_metric = UserFrustrationMetric()
# Run the threads evaluation
results = evaluate_threads(
project_name="ai_team",
filter_string='id = "0197ad2a"',
eval_project_name="ai_team_evaluation",
metrics=[
conversation_coherence_metric,
user_frustration_metric,
],
trace_input_transform=lambda x: x["input"],
trace_output_transform=lambda x: x["output"],
)
Threads consist of multiple traces, and each trace has an input and output. In practice, these typically contain user messages and agent responses. However, trace inputs and outputs are rarely just simple strings—they are usually complex data structures whose exact format depends on your agent framework.
To handle this complexity, you need to provide trace_input_transform and trace_output_transform functions. These are critical parameters that tell Opik how to extract the actual message content from your framework-specific trace structure.
Different agent frameworks structure their trace data differently:
{"messages": [{"content": "..."}]}{"task": {"description": "..."}}Without transform functions, Opik wouldn't know where to find the actual user questions and agent responses within your trace data.
Using these functions, the Opik evaluation engine will convert your threads chosen for evaluation into the standardized format expected by all Opik thread evaluation metrics:
[
{
"role": "user",
"content": "input string from trace 1"
},
{
"role": "assistant",
"content": "output string from trace 1"
},
{
"role": "user",
"content": "input string from trace 2"
},
{
"role": "assistant",
"content": "output string from trace 2"
}
]
Example:
If your trace input has the following structure:
{
"content": {
"user_question": "Tell me about your service?"
},
"metadata": {...}
}
Then your trace_input_transform should be:
lambda x: x["content"]["user_question"]
The evaluate_threads function takes a filter string as an argument. This string is used to select the threads that
should be evaluated. For example, if you want to evaluate only threads that have a specific ID, you can use the
following filter string:
filter_string='id = "0197ad2a"'
You can combine multiple filter strings using the AND operator. For example, if you want to evaluate only threads
that have a specific ID and were created after a certain date, you can use the following filter string:
filter_string='id = "0197ad2a" AND start_time > "2024-01-01T00:00:00Z"'
Supported filter fields and operators
The evaluate_threads function supports the following filter fields in the filter_string using Opik Query Language (OQL).
All fields and operators are the same as those supported by search_traces and search_spans:
| Field | Type | Operators |
|---|---|---|
id | String | =, !=, contains, not_contains, starts_with, ends_with, >, < |
name | String | =, !=, contains, not_contains, starts_with, ends_with, >, < |
created_by | String | =, !=, contains, not_contains, starts_with, ends_with, >, < |
thread_id | String | =, !=, contains, not_contains, starts_with, ends_with, >, < |
type | String | =, !=, contains, not_contains, starts_with, ends_with, >, < |
model | String | =, !=, contains, not_contains, starts_with, ends_with, >, < |
provider | String | =, !=, contains, not_contains, starts_with, ends_with, >, < |
status | String | =, contains, not_contains |
start_time | DateTime | =, >, <, >=, <= |
end_time | DateTime | =, >, <, >=, <= |
input | String | =, contains, not_contains |
output | String | =, contains, not_contains |
metadata | Dictionary | =, contains, >, < |
feedback_scores | Numeric | =, >, <, >=, <=, is_empty, is_not_empty |
tags | List | contains |
usage.total_tokens | Numeric | =, !=, >, <, >=, <= |
usage.prompt_tokens | Numeric | =, !=, >, <, >=, <= |
usage.completion_tokens | Numeric | =, !=, >, <, >=, <= |
duration | Numeric | =, !=, >, <, >=, <= |
number_of_messages | Numeric | =, !=, >, <, >=, <= |
total_estimated_cost | Numeric | =, !=, >, <, >=, <= |
Rules:
metadata.model, feedback_scores.accuracyAND (OR is not supported)The feedback_scores field is a dictionary where the keys are the metric names and the values are the metric values.
You can use it to filter threads based on their feedback scores. For example, if you want to evaluate only threads
that have a specific user frustration score, you can use the following filter string:
filter_string='feedback_scores.user_frustration_score >= 0.5'
Where user_frustration_score is the name of the user frustration metric and 0.5 is the threshold value to filter by.
Once the evaluation is complete, you can access the evaluation results in the Opik UI. Not only you will be able to see the score values, but the LLM-judge reasoning behind these values too!
<Frame> </Frame> <Note> **SDK Evaluation vs. Manual Feedback:** - When using the SDK's `evaluate_threads` function, only threads marked as "inactive" (after the cooldown period) are evaluated. This ensures you're scoring complete conversations. - You can manually add feedback scores to any thread at any time through the UI or API, regardless of its status. - For thread-level online evaluation rules (automatic scoring), Opik waits for a configurable "cooldown period" after the last activity before running the rules. </Note>Team-based thread evaluation enables multiple evaluators to score conversation threads independently, providing more reliable assessment of multi-turn dialogue quality.
Key benefits for thread evaluation:
This collaborative approach is especially valuable for conversational threads where dialogue quality, context maintenance, and user experience assessment often require multiple expert perspectives.
For more details on what metrics can be used to score conversational threads, refer to the conversational metrics page.
You can also define custom metrics to evaluate conversational threads, including LLM-as-a-Judge (LLM-J) reasoning metrics, as described in the following section: Custom Conversation Metrics guide.