apps/opik-documentation/documentation/fern/docs/evaluation/evaluate_multimodal.mdx
Opik lets you evaluate multimodal prompts that combine text and images. You can run these experiments straight from the UI, or by using the SDKs. This page covers both flows, clarifies which models support image inputs, and explains how to customise model detection.
LLM-as-a-Judge experiments in the Opik UI accept image attachments on both the dataset rows and the prompt messages. When you configure an evaluation:
gpt-4o or claude-3-5-sonnet).All multimodal traces appear in the evaluation results, so you can inspect exactly what the judge model received.
Both the Python and TypeScript SDKs accept OpenAI-style message payloads. Each message can contain either a string or a list of content blocks. Image blocks use the image_url type and can point to an https:// URL or a data:image/...;base64, payload.
from opik import Opik
from opik.evaluation import evaluate_prompt, metrics
client = Opik()
dataset = client.get_or_create_dataset("vision_captions", project_name="my-project")
dataset.insert(
[
{
"input": {
"image_source": "https://example.com/cat.jpg",
},
"reference": "A grey cat sitting on a sofa",
},
{
"input": {
"image_source": "data:image/png;base64,iVBORw0KGgo...", # base64 works too
},
"reference": "An orange cat playing with a toy",
},
]
)
MESSAGES = [
{
"role": "system",
"content": "You are an assistant that analyses the attached image.",
},
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the following picture."},
{
"type": "image_url",
"image_url": {"url": "{{image_source}}", "detail": "high"},
},
],
},
]
evaluate_prompt(
dataset=dataset,
messages=MESSAGES,
model="gpt-4o-mini",
scoring_metrics=[metrics.Equals()], # compares output against the dataset `reference`
project_name="my-project",
)
The evaluator uses LiteLLM-style model identifiers. Opik recognises popular multimodal families (OpenAI GPT-4o, Anthropic Claude 3+, Google Gemini 1.5, Meta Llama 3.2 Vision, Mistral Pixtral, etc.) and treats any model whose name ends with -vision or -vl as vision-capable. Provider prefixes such as anthropic/ are stripped automatically. When a model is not recognised as vision-capable, Opik logs a warning and replaces image blocks with placeholders before making the API call.
TypeScript support for multimodal evaluations is in progress. The TypeScript SDK will expose the same message structure and detection rules; we’ll update this section with a full example once the implementation lands.
If you are experimenting with a new provider, you can extend the registry at runtime:
from opik.evaluation.models import ModelCapabilities
ModelCapabilities.add_vision_model("my-provider/sparrow-vision-beta")
Any subsequent evaluations in that process will treat the custom model as vision-capable.
from opik.evaluation.models import ModelCapabilities
ModelCapabilities.supports_vision("anthropic/claude-3-opus")
If the call returns False, Opik will log a warning and flatten image blocks. The data is inserted as text and truncated to the first 500 characters to keep prompts manageable.
https:// URLs (publicly accessible).data:image/png;base64,iVBORw0....detail fields ("low", "high") are preserved and forwarded when present.Yes. Opik forwards the same OpenAI-style content blocks that LangChain expects, so structured messages with image_url dictionaries continue to work. A simple validation script is shown below:
from langchain_core.messages import messages_to_dict
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate
from opik.evaluation.models.langchain.message_converters import convert_to_langchain_messages
# Directly using LangChain message objects
plain_messages = [
SystemMessage(content="You are an assistant."),
HumanMessage(content="Describe the weather in Paris."),
]
convert_to_langchain_messages(messages_to_dict(plain_messages))
# Using a ChatPromptTemplate with multimodal content
chat_prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
(
"user",
[
{"type": "text", "text": "Describe the following image."},
{
"type": "image_url",
"image_url": {
"url": "https://python.langchain.com/img/phone_handoff.jpeg",
"detail": "high",
},
},
],
),
])
rendered = chat_prompt.invoke({})
messages = messages_to_dict(rendered.messages)
convert_to_langchain_messages(messages) # round-trip validation