docs/en/tools/vision.mdx
Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more.
The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:
open_ai_api_key to call gpt-4.1-minilinkai_api_key to call LinkAI vision serviceWhen use_linkai=true, LinkAI is promoted to the highest priority.
If the current provider fails, the tool automatically tries the next one until it succeeds or all fail.
| Vendor | Vision Model | Notes |
|---|---|---|
| OpenAI / Compatible | Main model | All OpenAI-compatible multimodal models |
| Baidu Qianfan | Main model | Multimodal main models (e.g. ernie-5.0) handle images directly; falls back to ernie-4.5-turbo-vl for text-only main models |
| Qwen (DashScope) | Main model | Via MultiModalConversation API |
| Claude | Main model | Anthropic native image format |
| Gemini | Main model | inlineData format |
| Doubao | Main model | doubao-seed-2-0 series natively supported |
| Kimi (Moonshot) | Main model | kimi-k2.6, kimi-k2.5 natively supported |
| ZhipuAI | glm-5v-turbo | Always uses dedicated vision model |
| MiniMax | MiniMax-Text-01 | Always uses dedicated vision model |
| Parameter | Type | Required | Description |
|---|---|---|---|
image | string | Yes | Local file path or HTTP(S) image URL |
question | string | Yes | Question to ask about the image |
Supported image formats: jpg, jpeg, png, gif, webp
To specify a particular model for the vision tool, add to config.json:
{
"tool": {
"vision": {
"model": "ernie-4.5-turbo-vl"
}
}
}
In most cases no configuration is needed. The tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured.