vision - Image Analysis - Chatgpt On Wechat

Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more.

Model Selection

The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:

Main model — uses the currently configured main model for image recognition (zero extra cost)
Other configured models — auto-discovers other models with configured API keys as alternatives
OpenAI — uses open_ai_api_key to call gpt-4.1-mini
LinkAI — uses linkai_api_key to call LinkAI vision service

When use_linkai=true, LinkAI is promoted to the highest priority.

If the current provider fails, the tool automatically tries the next one until it succeeds or all fail.

Supported Models

Vendor	Vision Model	Notes
OpenAI / Compatible	Main model	All OpenAI-compatible multimodal models
Baidu Qianfan	Main model	Multimodal main models (e.g. `ernie-5.0`) handle images directly; falls back to `ernie-4.5-turbo-vl` for text-only main models
Qwen (DashScope)	Main model	Via MultiModalConversation API
Claude	Main model	Anthropic native image format
Gemini	Main model	inlineData format
Doubao	Main model	doubao-seed-2-0 series natively supported
Kimi (Moonshot)	Main model	kimi-k2.6, kimi-k2.5 natively supported
ZhipuAI	glm-5v-turbo	Always uses dedicated vision model
MiniMax	MiniMax-Text-01	Always uses dedicated vision model

<Note> ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically. </Note>

Parameters

Parameter	Type	Required	Description
`image`	string	Yes	Local file path or HTTP(S) image URL
`question`	string	Yes	Question to ask about the image

Supported image formats: jpg, jpeg, png, gif, webp

Custom Configuration

To specify a particular model for the vision tool, add to config.json:

json

{
    "tool": {
        "vision": {
            "model": "ernie-4.5-turbo-vl"
        }
    }
}

In most cases no configuration is needed. The tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured.

Use Cases

Describe image content
Extract text from images (OCR)
Identify objects, colors, scenes
Analyze screenshots and scanned documents

<Note> Images larger than 1MB are automatically compressed (max edge 1536px). All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends. </Note>