Back to Chatgpt On Wechat

vision - Image Understanding

docs/en/tools/vision.mdx

2.0.92.7 KB
Original Source

Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more.

Model Selection

The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:

  1. Main model — uses the currently configured main model for image recognition (must be a multimodal model)
  2. Other configured models — auto-discovers other multimodal models with configured API keys as alternatives

If the current provider fails, the tool automatically tries the next one until it succeeds or all fail.

Supported Models

ProviderVision ModelNotes
OpenAI / CompatibleMain modelAll OpenAI-protocol-compatible multimodal models
Qwen (DashScope)Main modele.g. qwen3.6-plus, etc.
ClaudeMain modelAnthropic native image format
GeminiMain modelinlineData format
DoubaoMain modeldoubao-seed-2-0 series natively supported
Kimi (Moonshot)Main modelkimi-k2.6, kimi-k2.5 natively supported
Baidu QianfanMain modelDefaults to the multimodal main model (e.g. ernie-5.1); falls back to ernie-4.5-turbo-vl when the main model is not multimodal
ZhipuAIglm-5v-turboAlways uses the dedicated vision model
MiniMaxMiniMax-Text-01Always uses the dedicated vision model
<Note> ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically. </Note>

When use_linkai=true, LinkAI's multimodal model is used by default.

Custom Configuration

To specify the model used by Vision, configure it in config.json, for example:

json
{
    "tools": {
        "vision": {
            "model": "gpt-4.1"
        }
    }
}

The specified model is used first, and the tool automatically routes to the corresponding provider based on the model name; on failure, it falls back to other configured providers.

In most cases no configuration is needed — the tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured.

Parameters

ParameterTypeRequiredDescription
imagestringYesLocal file path or HTTP(S) image URL
questionstringYesQuestion to ask about the image

Supported image formats: jpg, jpeg, png, gif, webp

Use Cases

  • Describe image content
  • Extract text from images (OCR)
  • Identify objects, colors, scenes
  • Analyze screenshots and scanned documents
<Note> Images larger than 1MB are automatically compressed before upload. All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends. </Note>