docs/usage/multimodal.md
SWE-agent supports multimodal AI models that can process both text and images. This enables the agent to work with visual context from GitHub issues, such as screenshots, diagrams, and UI mockups.
The multimodal implementation automatically:
Currently, SWE-agent processes images from the problem_statement category, which includes:
!!! note "Design Choice"
Only problem_statement images are processed to provide essential visual context for understanding the task, while preserving agent autonomy in determining solution approaches. Images from patch and test_patch categories may contain solution hints and are not processed.
Use the pre-configured multimodal setup:
sweagent run-batch \
--config config/default_mm_with_images.yaml \
--instances.type swe_bench \
--instances.subset multimodal \
--instances.split dev
You can disable image processing globally:
# config/your_config.yaml
agent:
templates:
disable_image_processing: true
Or for specific instances:
from sweagent.agent.problem_statement import SWEBenchMultimodalProblemStatement
problem_statement = SWEBenchMultimodalProblemStatement(
text="Fix the rendering issue",
issue_images=["https://example.com/screenshot.png"],
disable_image_processing=True # Skip image processing
)
Multimodal support works with any vision-capable models, including:
Example model configuration:
# model_configs/claude-sonnet-4-20250514_mm.yaml
model:
name: claude-sonnet-4-20250514
# Vision capabilities automatically detected
When loading SWE-bench instances, multimodal support is automatic:
{
"instance_id": "example__repo-123",
"problem_statement": "Fix the chart rendering bug...",
"image_assets": {
"problem_statement": ["http://example.com/chart.png"]
}
}
from sweagent.agent.problem_statement import SWEBenchMultimodalProblemStatement
problem_statement = SWEBenchMultimodalProblemStatement(
text="Fix the rendering issue shown in the screenshots",
issue_images=[
"https://example.com/before.png",
"https://example.com/after.png"
]
)
# This downloads images and converts them to base64 markdown
processed_text = problem_statement.get_problem_statement()
In order to enable multimodal processing, you need to update the following configuration options:
Enable image parsing in your configuration:
agent:
history_processors:
- type: image_parsing # Parse base64 encoded images in observations
Include image and browser tools for visual tasks:
agent:
tools:
bundles:
- path: tools/image_tools # includes open_image tool to let models open image files
- path: tools/web_browser # includes 17 browser automation tools (click_mouse, open_site, etc.)
The web_browser bundle provides tools for:
open_site)screenshot_site)click_mouse, type_text, scroll_on_page)execute_script_on_page)We've enabled multimodal processing when --instances.type=swe-bench --instances.subset=multimodal are set.
To disable this behavior, you must set --templates.disable_image_processing=true.