docs/Multimodal Tutorial.md
GGUF models with vision capabilities are uploaded along a mmproj file to Hugging Face.
For instance, unsloth/gemma-3-4b-it-GGUF has this:
user_data/modelsAs an example, download
to your textgen/user_data/models folder.
user_data/mmprojThen download
https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/mmproj-F16.gguf?download=true
to your textgen/user_data/mmproj folder. Name it mmproj-gemma-3-4b-it-F16.gguf to give it a recognizable name.
Launch the web UI
Navigate to the Model tab
Select the GGUF model in the Model dropdown:
Select the mmproj file in the Multimodal (vision) menu:
Click "Load"
Select your image by clicking on the 📎 icon and send your message:
The model will reply with great understanding of the image contents:
Multimodal also works with the ExLlamaV3 loader (the non-HF one).
No additional files are necessary, just load a multimodal EXL3 model and send an image.
Examples of models that you can use:
In the page below you can find some ready-to-use examples: