Evaluate multimodal traces - Opik

<Note> In Opik 2.0, datasets and experiments are project-scoped. Make sure to specify a `project_name` when creating datasets and running experiments so they are associated with the correct project. </Note>

Opik lets you evaluate multimodal prompts that combine text and images. You can run these experiments straight from the UI, or by using the SDKs. This page covers both flows, clarifies which models support image inputs, and explains how to customise model detection.

Online evaluation in the UI

LLM-as-a-Judge experiments in the Opik UI accept image attachments on both the dataset rows and the prompt messages. When you configure an evaluation:

Open Evaluations → LLM-as-a-Judge and click New evaluation.
Choose a vision-capable model (for example, gpt-4o or claude-3-5-sonnet).
Add image URLs or upload files in either the Dataset rows or the Prompt builder.
Launch the evaluation. Opik automatically keeps the images in the judge prompt when the selected model supports them. If the model does not support images, the UI surfaces a warning and flattens the image reference into a text placeholder so the run still completes.

All multimodal traces appear in the evaluation results, so you can inspect exactly what the judge model received.

Using the SDKs

Both the Python and TypeScript SDKs accept OpenAI-style message payloads. Each message can contain either a string or a list of content blocks. Image blocks use the image_url type and can point to an https:// URL or a data:image/...;base64, payload.

Python example

python

from opik import Opik
from opik.evaluation import evaluate_prompt, metrics

client = Opik()
dataset = client.get_or_create_dataset("vision_captions", project_name="my-project")
dataset.insert(
    [
        {
            "input": {
                "image_source": "https://example.com/cat.jpg",
            },
            "reference": "A grey cat sitting on a sofa",
        },
        {
            "input": {
                "image_source": "data:image/png;base64,iVBORw0KGgo...",  # base64 works too
            },
            "reference": "An orange cat playing with a toy",
        },
    ]
)

MESSAGES = [
    {
        "role": "system",
        "content": "You are an assistant that analyses the attached image.",
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the following picture."},
            {
                "type": "image_url",
                "image_url": {"url": "{{image_source}}", "detail": "high"},
            },
        ],
    },
]

evaluate_prompt(
    dataset=dataset,
    messages=MESSAGES,
    model="gpt-4o-mini",
    scoring_metrics=[metrics.Equals()],  # compares output against the dataset `reference`
    project_name="my-project",
)

The evaluator uses LiteLLM-style model identifiers. Opik recognises popular multimodal families (OpenAI GPT-4o, Anthropic Claude 3+, Google Gemini 1.5, Meta Llama 3.2 Vision, Mistral Pixtral, etc.) and treats any model whose name ends with -vision or -vl as vision-capable. Provider prefixes such as anthropic/ are stripped automatically. When a model is not recognised as vision-capable, Opik logs a warning and replaces image blocks with placeholders before making the API call.

TypeScript note

TypeScript support for multimodal evaluations is in progress. The TypeScript SDK will expose the same message structure and detection rules; we’ll update this section with a full example once the implementation lands.

Customising model support

If you are experimenting with a new provider, you can extend the registry at runtime:

python

from opik.evaluation.models import ModelCapabilities

ModelCapabilities.add_vision_model("my-provider/sparrow-vision-beta")

Any subsequent evaluations in that process will treat the custom model as vision-capable.

FAQ

How do I confirm whether a model supports images?

python

from opik.evaluation.models import ModelCapabilities

ModelCapabilities.supports_vision("anthropic/claude-3-opus")

If the call returns False, Opik will log a warning and flatten image blocks. The data is inserted as text and truncated to the first 500 characters to keep prompts manageable.

What kind of image sources can I use?

Direct https:// URLs (publicly accessible).
Base64 data URLs such as data:image/png;base64,iVBORw0....
Optional OpenAI detail fields ("low", "high") are preserved and forwarded when present.

Does this work with LangChain integrations?

Yes. Opik forwards the same OpenAI-style content blocks that LangChain expects, so structured messages with image_url dictionaries continue to work. A simple validation script is shown below:

python

from langchain_core.messages import messages_to_dict
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate
from opik.evaluation.models.langchain.message_converters import convert_to_langchain_messages

# Directly using LangChain message objects
plain_messages = [
    SystemMessage(content="You are an assistant."),
    HumanMessage(content="Describe the weather in Paris."),
]

convert_to_langchain_messages(messages_to_dict(plain_messages))

# Using a ChatPromptTemplate with multimodal content
chat_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    (
        "user",
        [
            {"type": "text", "text": "Describe the following image."},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://python.langchain.com/img/phone_handoff.jpeg",
                    "detail": "high",
                },
            },
        ],
    ),
])

rendered = chat_prompt.invoke({})
messages = messages_to_dict(rendered.messages)

convert_to_langchain_messages(messages)  # round-trip validation