Supported Models for Visual Recognition

Conversations in LobeHub are no longer limited to text. We now support several large language models with visual recognition capabilities, including OpenAI's gpt-4-vision, Google Gemini Pro Vision, and Zhiyuan GLM-4 Vision.

Upload an image or drag it directly into the chat window, and your Agent can understand the visual content and continue the discussion in context. This works for screenshots, photos, diagrams, or any visual reference you need to share.

This brings a more natural multimodal experience to both everyday and professional scenarios:

Share photos from your day and discuss them
Upload UI screenshots for design feedback
Share diagrams and get explanations
Reference visual content without describing it in words

Context-aware visual understanding

The assistant doesn't just see the image—it understands it within the ongoing conversation. Ask follow-up questions about specific details, compare multiple images, or use visuals as reference material for complex discussions.

For specialized fields, this means clearer context and more practical responses. Medical imaging discussions, architectural reviews, or technical diagram analysis all become more natural when both parties can see the same visual reference.

Voice options for personalized interaction

To better serve users across regions and preferences, we've also added quality voice options from OpenAI Audio and Microsoft Edge Speech. Choose a voice that fits your style or scenario for more personalized interactions.

Visual Recognition: Chat With Images, Not Just Text

Supported Models for Visual Recognition

Share images naturally

Context-aware visual understanding

Voice options for personalized interaction