docs/examples/mm_structured_outputs.md
An end-to-end example of Multimodal Structured Outputs with Daft and Qwen3-VL-8B
We'll evaluate Qwen3-VL's image understanding using a multiple choice subset of HuggingFace's The Cauldron dataset, a massive collection of 50 vision-language datasets.
Our pipeline will:
Check out the blog post where we evaluate Qwen3-VL-4B on 20k rows across 3 datasets.
This tutorial demonstrates the core evaluation pipeline on a small sample (50 rows) so you can inspect examples and understand the methodology. For an end-to-end implementation that scales to millions of rows, see eval_image_understanding.py in the daft-examples repo.
promptFirst, install the required dependencies:
pip install daft[openai] python-dotenv
Next, create a .env file in your project directory and add your HuggingFace token:
# .env
HF_TOKEN=your_huggingface_token_here
You can get a HuggingFace token from https://huggingface.co/settings/tokens.
Then, set up your environment variables and configuration:
import os
from dotenv import load_dotenv
load_dotenv()
# Configuration
MODEL_ID = "Qwen/Qwen3-VL-8B-Instruct"
LIMIT = 50 # Keep low for interactive demo
# HuggingFace Inference Provider (hosted Qwen3-VL endpoints)
OPENAI_API_KEY = os.getenv("HF_TOKEN")
OPENAI_BASE_URL = "https://router.huggingface.co/v1"
Configure Daft to use the OpenAI-compatible provider:
import daft
# Set the OpenAI-compatible provider
daft.set_provider("openai", api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)
The Cauldron is a massive collection of 50 vision-language datasets spanning:
We'll start with the AI2D subset—science diagrams with multiple-choice questions.
df_raw = daft.read_huggingface("HuggingFaceM4/the_cauldron/ai2d").limit(LIMIT).collect()
df_raw.show(3)
The dataset contains nested structures with columns:
images: List of image bytestexts: List of conversation turns with user (question) and assistant (answer) fieldsEach row represents a multiple-choice question with an accompanying science diagram.
We need to:
from daft import col
from daft.functions import unnest
df_img = df_raw.explode(col("images"))
df_img = df_img.with_column("image", col("images")["bytes"].decode_image())
df_text = df_img.explode(col("texts")).select(unnest(col("texts")), "image")
df_prep = df_text.with_column(
"answer",
col("assistant").regexp_replace("Answer: ", "").lstrip().rstrip()
).collect()
df_prep.show(3)
promptDaft's prompt function scales OpenAI-compatible calls across dataframes. We'll use a Pydantic model to enforce structured output.
For more info: API docs | User Guide
from daft.functions import prompt
from pydantic import BaseModel, Field
import time
PARAMS = {"temperature": 0.0, "max_tokens": 2}
class ChoiceResponse(BaseModel):
"""Structured output for multiple choice answers."""
choice: str = Field(..., description="The letter of the correct choice (e.g., A, B, C, D)")
start = time.time()
df_results = df_prep.with_column(
"result",
prompt(
messages=[col("image"), col("user")],
model=MODEL_ID,
use_chat_completions=True,
return_format=ChoiceResponse,
**PARAMS,
)
).limit(LIMIT).collect()
elapsed = time.time() - start
print(f"Processed {df_results.count_rows()} rows in {elapsed:.1f} seconds")
df_eval = df_results.with_column(
"is_correct",
col("result")["choice"].lstrip().rstrip() == col("answer").lstrip().rstrip()
)
accuracy = df_eval.where(col("is_correct")).count_rows() / df_eval.count_rows()
print(f"Accuracy (with image): {accuracy:.1%}")
df_eval.select("user", "image", "answer", col("result")["choice"].alias("predicted"), "is_correct").show(5)
A simple accuracy score tells us how often the model is correct, but not why. Our full evaluation found that ~70% of correct answers on image understanding benchmarks don't actually require the image. To understand the true contribution of image understanding, we conduct an ablation study—running the same prompts without images.
This lets us classify each example into four quadrants:
| Quadrant | With Image | Without Image | Interpretation |
|---|---|---|---|
| Both Correct | ✓ | ✓ | Question may be solvable from text alone |
| Image Helped | ✓ | ✗ | True image understanding |
| Image Hurt | ✗ | ✓ | Visual confusion |
| Both Incorrect | ✗ | ✗ | Hard question or model limitation |
SYSTEM_PROMPT_NO_IMAGE = "Respond to the multiple choice question with just the letter corresponding to the correct answer."
start = time.time()
df_ablation = df_eval.with_column(
"result_no_image",
prompt(
messages=col("user"),
system_message=SYSTEM_PROMPT_NO_IMAGE,
model=MODEL_ID,
use_chat_completions=True,
return_format=ChoiceResponse,
**PARAMS,
)
).with_column(
"is_correct_no_image",
col("result_no_image")["choice"].lstrip().rstrip() == col("answer").lstrip().rstrip()
).collect()
elapsed = time.time() - start
print(f"Processed {df_ablation.count_rows()} rows in {elapsed:.1f} seconds")
accuracy_no_image = df_ablation.where(col("is_correct_no_image")).count_rows() / df_ablation.count_rows()
print(f"Accuracy with image: {accuracy:.1%}")
print(f"Accuracy without image: {accuracy_no_image:.1%}")
print(f"Delta: {accuracy - accuracy_no_image:+.1%}")
from daft.functions import when, monotonically_increasing_id
df_classified = df_ablation.with_column(
"id", monotonically_increasing_id()
).with_column(
"quadrant",
when((col("is_correct") == True) & (col("is_correct_no_image") == True), "Both Correct")
.when((col("is_correct") == True) & (col("is_correct_no_image") == False), "Image Helped")
.when((col("is_correct") == False) & (col("is_correct_no_image") == True), "Image Hurt")
.otherwise("Both Incorrect")
)
df_classified.groupby("quadrant").count().select("quadrant", col("id").alias("count")).show()
Inspect cases where the image helped:
df_classified.where(col("quadrant") == "Image Helped").select(
"user", "image", "answer",
col("result")["choice"].alias("with_image"),
col("result_no_image")["choice"].alias("without_image")
).show(3)
df_classified.where(col("quadrant") == "Image Hurt").select(
"user", "image", "answer",
col("result")["choice"].alias("with_image"),
col("result_no_image")["choice"].alias("without_image")
).show(3)
total_count = df_classified.count_rows()
df_results = df_classified.groupby("quadrant").count().select(
"quadrant",
col("id").alias("count")
).with_column(
"percentage",
(col("count") / daft.lit(total_count) * 100)
).collect()
df_results.show()
We can go beyond pass/fail metrics by using VLM-as-a-Judge to explain why the model failed—especially on the most informative failure subsets:
We'll use a structured output schema so the judge reliably returns fields we can analyze.
from daft.functions import format
JUDGE_SYSTEM_PROMPT = """
You are an impartial judge reviewing the results of a textbook academic questions multiple choice benchmark.
Inspect the attached image and provide high-signal feedback on why the model chose its answer.
First, reason about the model's answer with the image and the model's answer without the image.
Second, develop a hypothesis for why the model made the choice it did.
Third, attribute the failure to a 'question' issue or an 'image' understanding issue.
Finally, assign whether the model's answer with the image is correct and whether the model's answer without the image is correct.
"""
class JudgeResponse(BaseModel):
"""Structured diagnostic feedback from the VLM judge."""
reasoning: str = Field(..., description="Why did the model choose the answer it did?")
hypothesis: str = Field(..., description="What caused the divergence from the correct answer?")
attribution: str = Field(
...,
description="Was this a 'question' issue or an 'image' understanding issue or 'other'?",
)
judge_template = format(
"""Given the image attached and the multiple choice question of <question>{}</question>,
The model chose the following prediction <model_answer>{}</model_answer> and without the image, the model chose the following prediction <no_image_model_answer>{}</no_image_model_answer>, but the correct answer is <correct_answer>{}</correct_answer>.
Provide diagnostic feedback.
""",
col("user"),
col("result")["choice"],
col("result_no_image")["choice"],
col("answer"),
)
df_failures = df_classified.where(
(col("quadrant") == "Image Hurt") | (col("quadrant") == "Both Incorrect")
)
JUDGE_PARAMS = {"temperature": 0.0, "max_tokens": 512}
df_judged = df_failures.with_column(
"judge_response",
prompt(
messages=[col("image"), judge_template],
system_message=JUDGE_SYSTEM_PROMPT,
model=MODEL_ID,
use_chat_completions=True,
return_format=JudgeResponse,
**JUDGE_PARAMS,
),
).collect()
print(f"Judged {df_judged.count_rows()} failure rows")
The judge's attribution field helps separate question issues (ambiguous prompts) from image understanding issues (missed labels, visual ambiguity).
df_judged.select(
"quadrant",
"user",
"image",
"answer",
col("result")["choice"].alias("with_image"),
col("result_no_image")["choice"].alias("without_image"),
unnest(col("judge_response")),
).show(3)
Verify the full pipeline ran:
print(f"Accuracy (with image): {accuracy:.1%}")
print(f"Accuracy (without image): {accuracy_no_image:.1%}")
print(f"Delta: {accuracy - accuracy_no_image:+.1%}")
df_classified.groupby("quadrant").count().show()
print(f"Judge rows: {df_judged.count_rows()}")
This tutorial runs locally on 50 rows. The Cauldron contains millions of rows across 50 subsets. To run this evaluation at scale, use Daft Cloud.
The production-ready script eval_image_understanding.py includes:
👉 Sign up for early access | Book a demo
In this tutorial, we built a small pipeline to evaluate Qwen3-VL's image understanding:
Multi-Dataset Evaluation: Try the full pipeline from the daft-examples repository that supports evaluating across all 50 Cauldron subsets here.
Experiment Tracking: Wire judge feedback into MLflow or W&B to track improvements over time.
RLVR Training: Use the is_correct signal and judge attributions for reinforcement learning with verifiable rewards.
Canonical References: