Custom Metric

Opik allows you to define your own custom metrics, which is especially important when the metrics you need are not already available out of the box.

When to Create Custom Metrics

It is specially relevant to define your own metrics when:

You have domain-specific goals
Standard metrics don't capture the nuance you need
You want to align with business KPIs
You're experimenting with new evaluation approaches

If you want to write an LLM as a Judge metric, you can use either the G-Eval metric or create your own from scratch.

Writing your own custom metrics

To define a custom metric, you need to subclass the BaseMetric class and implement the score method and an optional ascore method:

python

from typing import Any

from opik.evaluation.metrics import base_metric, score_result
from opik.message_processing.emulation.models import SpanModel


class MyCustomMetric(base_metric.BaseMetric):
    def __init__(self, name: str):
        super().__init__(name)

    def score(self, input: str, output: str, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult:
        # Add your logic here
        
        return score_result.ScoreResult(
            value=0,
            name=self.name,
            reason="Optional reason for the score"
        )

The score method has access to the following parameters:

Flattened dataset item: If your dataset item is of the format {"input": "...", "expected_output": "..."}, the score method will receive input and expected_output parameters.
Task output: If your task output is of the format {"output": "..."}, the score method will receive an output parameter.
Task span: If you define a parameter named task_span, we will pass the full evaluation task trace to your score method. If you don't need access to the trajectory data, we recommend not defining the task_span parameter.

The score method should return a ScoreResult object. The ascore method is optional and can be used to compute asynchronously if needed.

<Tip> You can also return a list of `ScoreResult` objects as part of your custom metric. This is useful if you want to return multiple scores for a given input and output pair. </Tip>

Now you can use the custom metric to score LLM outputs:

python

metric = MyCustomMetric()

metric.score(input="What is the capital of France?", output="Paris")

Also, this metric can now be used in the evaluate function as explained here: Evaluating LLMs.

Example: Accessing trajectory data in a custom metric

You can access the trajectory data in a custom metric by using the task_span parameter.

python

from opik.evaluation.metrics import base_metric, score_result
from opik.message_processing.emulation.models import SpanModel

class MyCustomMetric(base_metric.BaseMetric):
    def __init__(self, name: str):
        super().__init__(name)

    def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult:
        # Add your logic here

        return score_result.ScoreResult(
            value=0,
            name=self.name,
            reason="Optional reason for the score"
        )

<Tip> In order to access the full trajectory data, make sure you have integrated your evaluation task with Opik's tracing features. You learn more about how to evaluate agent trajectories in the [Evaluate Agent Trajectory](/v1/evaluation/evaluate_agent_trajectory) guide. </Tip>

Example: Creating a metric with OpenAI model

You can implement your own custom metric by creating a class that subclasses the BaseMetric class and implements the score method.

python

import json
from typing import Any

from openai import OpenAI
from opik.evaluation.metrics import base_metric, score_result


class LLMJudgeMetric(base_metric.BaseMetric):
    def __init__(self, name: str = "Factuality check", model_name: str = "gpt-4o"):
        super().__init__(name)
        self.llm_client = OpenAI()
        self.model_name = model_name
        self.prompt_template = """
        You are an impartial judge evaluating the following claim for factual accuracy.
        Analyze it carefully and provide a binary score: true if the claim is accurate,
        false if it is inaccurate or contains errors. The format of your response
        should be a JSON object with no additional text or backticks.

        The format of your response should be a JSON object with no additional text or backticks that follows the format:
        {{
            "score": <true or false>
        }}

        Claim to evaluate: {output}

        Response:
        """

    def score(self, output: str, **ignored_kwargs: Any) -> score_result.ScoreResult:
        """
        Score the output of an LLM.

        Args:
            output: The output of an LLM to score.
            **ignored_kwargs: Any additional keyword arguments. This is important so that the metric can be used in the `evaluate` function.
        """
        # Construct the prompt based on the output of the LLM
        prompt = self.prompt_template.format(output=output)

        # Generate and parse the response from the LLM
        response = self.llm_client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": prompt}]
        )
        response_dict = json.loads(response.choices[0].message.content)

        # Parse the response and convert to integer for logging
        response_score = (
            response_dict["score"]
            if isinstance(response_dict["score"], bool)
            else str(response_dict["score"]).strip().lower() == "true"
        )

        return score_result.ScoreResult(
            name=self.name,
            value=response_score
        )

You can then use this metric to score your LLM outputs:

python

metric = LLMJudgeMetric()

metric.score(output="Paris is the capital of France")

In this example, we used the OpenAI Python client to call the LLM. You don't have to use the OpenAI Python client, you can update the code example above to use any LLM client you have access to.

Example: Adding support for many LLM providers

In order to support a wide range of LLM providers, we recommend using the litellm library to call your LLM. This allows you to support hundreds of models without having to maintain a custom LLM client.

Opik providers a LitellmChatModel class that wraps the litellm library and can be used in your custom metric:

python

import json
from typing import Any

from opik.evaluation.metrics import base_metric, score_result
from opik.evaluation import models


class LLMJudgeMetric(base_metric.BaseMetric):
    def __init__(self, name: str = "Factuality check", model_name: str = "gpt-4o"):
        super().__init__(name)

        self.llm_client = models.LiteLLMChatModel(model_name=model_name)

        self.prompt_template = """
        You are an impartial judge evaluating the following claim for factual accuracy. Analyze it carefully
        and respond with a number between 0 and 1: 1 if completely accurate, 0.5 if mixed accuracy, or 0 if inaccurate.
        Then provide one brief sentence explaining your ruling.

        The format of the your response should be a JSON object with no additional text or backticks that follows the format:
        {{
            "score": <score between 0 and 1>,
            "reason": "<reason for the score>"
        }}

        Claim to evaluate: {output}

        Response:
        """

    def score(self, output: str, **ignored_kwargs: Any) -> score_result.ScoreResult:
        """
        Score the output of an LLM.

        Args:
            output: The output of an LLM to score.
            **ignored_kwargs: Any additional keyword arguments. This is important so that the metric can be used in the `evaluate` function.
        """
        # Construct the prompt based on the output of the LLM
        prompt = self.prompt_template.format(output=output)

        # Generate and parse the response from the LLM
        response = self.llm_client.generate_string(input=prompt)

        response_dict = json.loads(response)

        return score_result.ScoreResult(
            name=self.name,
            value=response_dict["score"],
            reason=response_dict["reason"]
        )

You can then use this metric to score your LLM outputs:

python

metric = LLMJudgeMetric()

metric.score(output="Paris is the capital of France")

Example: Creating a metric with multiple scores

You can implement a metric that returns multiple scores, which will display as separate columns in the UI when using it in an evaluation.

To do so, setup your score method to return a list of ScoreResult objects.

python

from typing import Any, List

from opik.evaluation.metrics import base_metric, score_result

class MultiScoreCustomMetric(base_metric.BaseMetric):
    def __init__(self, name: str):
        super().__init__(name)

    def score(self, input: str, output: str, **ignored_kwargs: Any) -> List[score_result.ScoreResult]:
        # Add your logic here

        return [score_result.ScoreResult(
            value=0,
            name=self.name,
            reason="Optional reason for the score"
        ),
        score_result.ScoreResult(
            value=1,
            name=f"{self.name}-2",
            reason="Optional reason for the score"
        )]

Example: Enforcing structured outputs

In the examples above, we ask the LLM to respond with a JSON object. However as this is not enforced, it is possible that the LLM returns a non-structured response. In order to avoid this, you can use the litellm library to enforce a structured output. This will make our custom metric more robust and less prone to failure.

For this we define the format of the response we expect from the LLM in the LLMJudgeBinaryResult class and pass it to the LiteLLM client:

python

import json
from pydantic import BaseModel
from typing import Any

from opik.evaluation.metrics import base_metric, score_result
from opik.evaluation import models

class LLMJudgeBinaryResult(BaseModel):
    score: bool
    reason: str

class LLMJudgeMetric(base_metric.BaseMetric):
    def __init__(self, name: str = "Factuality check", model_name: str = "gpt-4o"):
        super().__init__(name)
        self.llm_client = models.LiteLLMChatModel(model_name=model_name)
        self.prompt_template = """
        You are an impartial judge evaluating the following claim for factual accuracy. Analyze it carefully and provide a binary score: true if the claim is accurate, false if it is inaccurate or contains errors. Then provide one brief sentence explaining your ruling.

         The format of the your response should be a json with no backticks that returns:

        {{
            "score": <true or false>,
            "reason": "<reason for the score>"
        }}

        Claim to evaluate: {output}

        Response:
        """

    def score(self, output: str, **ignored_kwargs: Any) -> score_result.ScoreResult:
        """
        Score the output of an LLM.

        Args:
            output: The output of an LLM to score.
            **ignored_kwargs: Any additional keyword arguments. This is important so that the metric can be used in the `evaluate` function.
        """
        # Construct the prompt based on the output of the LLM
        prompt = self.prompt_template.format(output=output)

        # Generate and parse the response from the LLM
        response = self.llm_client.generate_string(input=prompt, response_format=LLMJudgeBinaryResult)
        response_dict = json.loads(response)

        return score_result.ScoreResult(
            name=self.name,
            value=response_dict["score"],
            reason=response_dict["reason"]
        )

Similarly to the previous example, you can then use this metric to score your LLM outputs:

python

metric = LLMJudgeMetric()

metric.score(output="Paris is the capital of France")

Creating a custom metric using G-Eval

G-eval allows you to specify a set of criteria for your metric and it will use a Chain of Thought prompting technique to create some evaluation steps and return a score. You can read more about this advanced metric here.

To use G-Eval, you will need to specify a task introduction and evaluation criteria:

python

from opik.evaluation.metrics import GEval

metric = GEval(
    task_introduction="You are an expert judge tasked with evaluating the faithfulness of an AI-generated answer to the given context.",
    evaluation_criteria="""
        The OUTPUT must not introduce new information beyond what's provided in the CONTEXT.
        The OUTPUT must not contradict any information given in the CONTEXT.

        Return only a score between 0 and 1.
    """,
)

Custom Conversation Metrics

For evaluating multi-turn conversations and dialogue systems, you'll need specialized conversation metrics. These metrics evaluate entire conversation threads rather than single input-output pairs.

Learn how to create custom conversation metrics in the Custom Conversation Metrics guide.

What's next

Creating custom metrics is just the beginning of building a comprehensive evaluation system for your LLM applications. In this guide, you've learned how to create custom metrics using different approaches, from simple metrics to sophisticated LLM-as-a-judge implementations, including specialized conversation thread metrics for multi-turn dialogue evaluation.

From here, you might want to:

Evaluate your LLM application following the Evaluate your LLM application guide
Evaluate conversation threads using the Evaluate Threads guide
Explore built-in metrics in the Metrics overview

Custom metric

Custom Metric

When to Create Custom Metrics

Writing your own custom metrics

Example: Accessing trajectory data in a custom metric

Example: Creating a metric with OpenAI model

Example: Adding support for many LLM providers

Example: Creating a metric with multiple scores

Example: Enforcing structured outputs

Creating a custom metric using G-Eval

Custom Conversation Metrics

What's next