docs/changelog/2023-11-14-gpt4-vision.mdx
Conversations in LobeHub are no longer limited to text. We now support several large language models with visual recognition capabilities, including OpenAI's gpt-4-vision, Google Gemini Pro Vision, and Zhiyuan GLM-4 Vision.
Upload an image or drag it directly into the chat window, and your Agent can understand the visual content and continue the discussion in context. This works for screenshots, photos, diagrams, or any visual reference you need to share.
This brings a more natural multimodal experience to both everyday and professional scenarios:
The assistant doesn't just see the image—it understands it within the ongoing conversation. Ask follow-up questions about specific details, compare multiple images, or use visuals as reference material for complex discussions.
For specialized fields, this means clearer context and more practical responses. Medical imaging discussions, architectural reviews, or technical diagram analysis all become more natural when both parties can see the same visual reference.
To better serve users across regions and preferences, we've also added quality voice options from OpenAI Audio and Microsoft Edge Speech. Choose a voice that fits your style or scenario for more personalized interactions.