Back to Chatgpt On Wechat

vision - Image Analysis

docs/en/tools/vision.mdx

2.0.82.6 KB
Original Source

Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more.

Model Selection

The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:

  1. Main model — uses the currently configured main model for image recognition (zero extra cost)
  2. Other configured models — auto-discovers other models with configured API keys as alternatives
  3. OpenAI — uses open_ai_api_key to call gpt-4.1-mini
  4. LinkAI — uses linkai_api_key to call LinkAI vision service

When use_linkai=true, LinkAI is promoted to the highest priority.

If the current provider fails, the tool automatically tries the next one until it succeeds or all fail.

Supported Models

VendorVision ModelNotes
OpenAI / CompatibleMain modelAll OpenAI-compatible multimodal models
Baidu QianfanMain modelMultimodal main models (e.g. ernie-5.0) handle images directly; falls back to ernie-4.5-turbo-vl for text-only main models
Qwen (DashScope)Main modelVia MultiModalConversation API
ClaudeMain modelAnthropic native image format
GeminiMain modelinlineData format
DoubaoMain modeldoubao-seed-2-0 series natively supported
Kimi (Moonshot)Main modelkimi-k2.6, kimi-k2.5 natively supported
ZhipuAIglm-5v-turboAlways uses dedicated vision model
MiniMaxMiniMax-Text-01Always uses dedicated vision model
<Note> ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically. </Note>

Parameters

ParameterTypeRequiredDescription
imagestringYesLocal file path or HTTP(S) image URL
questionstringYesQuestion to ask about the image

Supported image formats: jpg, jpeg, png, gif, webp

Custom Configuration

To specify a particular model for the vision tool, add to config.json:

json
{
    "tool": {
        "vision": {
            "model": "ernie-4.5-turbo-vl"
        }
    }
}

In most cases no configuration is needed. The tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured.

Use Cases

  • Describe image content
  • Extract text from images (OCR)
  • Identify objects, colors, scenes
  • Analyze screenshots and scanned documents
<Note> Images larger than 1MB are automatically compressed (max edge 1536px). All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends. </Note>