docs/usage/getting-started/vision.mdx
LobeHub supports vision capabilities — Agents can see and understand images you share. This multimodal feature enables conversations that go beyond text to include rich visual context.
Vision-enabled models can:
Supported formats: JPEG/JPG, PNG, WebP, GIF (static frames only), BMP
<Callout type={'info'}> The image upload button only appears when you are using a vision-capable model. If you don't see it, switch to a model that supports vision (see supported models below). </Callout>
<Callout type={'warning'}> Vision features consume more tokens than text-only conversations, which may affect API costs for self-hosted or API-key deployments. </Callout>
Ask general questions about an image:
"What's in this image?"
"Describe what you see in detail"
"What are the main elements of this photo?"
Extract text from images, screenshots, and documents:
"What does the text say?"
"Transcribe all text from this image"
"Read the error message in this screenshot"
Works with screenshots, photos of signs, printed documents, and code in images. Handwriting recognition works with varying accuracy.
Upload several images at once and ask for comparison or combined analysis:
"Compare these three design variations and suggest which is most effective"
"What are the differences between these before/after photos?"
"Analyze the trends shown in these charts"
The more specific your question, the better the analysis:
<Tabs> <Tab title="Object Identification"> - "What type of plant is this?" - "What brand of laptop is shown?" - "Identify the components in this circuit board" </Tab> <Tab title="Scene Understanding"> - "Where was this photo likely taken?" - "What time of day does this appear to be?" - "Describe the setting and atmosphere" </Tab> <Tab title="Technical Analysis"> - "What colors are used in this design?" - "Evaluate the layout and spacing" - "What font family is being used?" </Tab> <Tab title="Content Analysis"> - "What's the main message of this infographic?" - "Summarize the data shown in this chart" - "What arguments does this slide present?" </Tab> </Tabs><Callout type={'warning'}> Vision models have limitations. Always verify critical information independently. </Callout>
Vision requires a vision-capable model. Look for models with a vision indicator in the model selector:
| Provider | Vision Models |
|---|---|
| OpenAI | GPT-4V, GPT-4o, GPT-4o mini |
| Anthropic | Claude 3 Haiku, Claude 3 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet+ |
| Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini Pro Vision |
Other providers may also offer vision models — check the model's capability tags in the selector.
<Cards> <Card href={'/docs/usage/getting-started/resource'} title={'Resource Library'} /><Card href={'/docs/usage/getting-started/image-generation'} title={'Image Generation'} />
<Card href={'/docs/usage/providers'} title={'AI Providers'} /> </Cards>