Multimodal Language Models - Sglang

These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with multimodal encoders.

Example launch Command

bash

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \  # example HF/local path
  --host 0.0.0.0 \
  --port 30000 \

See the OpenAI APIs section for how to send multimodal requests.

Supported models

Below the supported models are summarized in a table.

If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for Qwen2_5_VLForConditionalGeneration, use the expression:

text

repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen2_5_VLForConditionalGeneration

in the GitHub search bar.

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "22%"}} /> <col style={{width: "26%"}} /> <col style={{width: "40%"}} /> <col style={{width: "12%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Model Family (Variants)</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Example HuggingFace Identifier</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.02)"}}>Description</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Notes</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Qwen-VL</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>Qwen/Qwen3-VL-235B-A22B-Instruct</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Alibaba's vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>DeepSeek-VL2</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>deepseek-ai/deepseek-vl2</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>DeepSeek-OCR / OCR-2</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>deepseek-ai/DeepSeek-OCR-2</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>OCR-focused DeepSeek models for document understanding and text extraction.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Use <code>--trust-remote-code</code>.</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Janus-Pro</strong> (1B, 7B)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>deepseek-ai/Janus-Pro-7B</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>DeepSeek's open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>MiniCPM-V / MiniCPM-o</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>openbmb/MiniCPM-V-2_6</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Llama 3.2 Vision</strong> (11B)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>meta-llama/Llama-3.2-11B-Vision-Instruct</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>LLaVA</strong> (v1.5 & v1.6)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><em>e.g.</em> <code>liuhaotian/llava-v1.5-13b</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>LLaVA-NeXT</strong> (8B, 72B)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmms-lab/llava-next-72b</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>LLaVA-OneVision</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmms-lab/llava-onevision-qwen2-7b-ov</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Gemma 3 (Multimodal)</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>google/gemma-3-4b-it</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Gemma 3's larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Kimi-VL</strong> (A3B)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>moonshotai/Kimi-VL-A3B-Instruct</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Kimi-VL is a multimodal model that can understand and generate text from images.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Mistral-Small-3.1-24B</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>mistralai/Mistral-Small-3.1-24B-Instruct-2503</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Mistral 3.1 is a multimodal model that can generate text from text or images input. It also supports tool calling and structured output.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Phi-4-multimodal-instruct</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>microsoft/Phi-4-multimodal-instruct</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Phi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. It supports text, vision and audio modalities in SGLang.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>MiMo-VL</strong> (7B)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>XiaomiMiMo/MiMo-VL-7B-RL</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Xiaomi's compact yet powerful vision-language model featuring a native resolution ViT encoder for fine-grained visual details, an MLP projector for cross-modal alignment, and the MiMo-7B language model optimized for complex reasoning tasks.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>GLM-4.5V</strong> (106B) / <strong>GLM-4.1V</strong>(9B)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>zai-org/GLM-4.5V</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Use <code>--chat-template glm-4v</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>GLM-OCR</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>zai-org/GLM-OCR</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>GLM-OCR: A fast and accurate general OCR model</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>DotsVLM</strong> (General/OCR)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>rednote-hilab/dots.vlm1.inst</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>RedNote's vision-language model built on a 1.2B vision encoder and DeepSeek V3 LLM, featuring NaViT vision encoder trained from scratch with dynamic resolution support and enhanced OCR capabilities through structured image data training.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>DotsVLM-OCR</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>rednote-hilab/dots.ocr</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Specialized OCR variant of DotsVLM optimized for optical character recognition tasks with enhanced text extraction and document understanding capabilities.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Don't use <code>--trust-remote-code</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>NVILA</strong> (8B, 15B, Lite-2B, Lite-8B, Lite-15B)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>Efficient-Large-Model/NVILA-8B</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>chatml</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>NVILA explores the full stack efficiency of multi-modal design, achieving cheaper training, faster deployment and better performance.</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>NVIDIA Nemotron Nano 2.0 VL</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>NVIDIA Nemotron Nano v2 VL enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities. It builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, in order to achieve higher inference throughput in long document and video scenarios.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Use <code>--trust-remote-code</code>. You may need to adjust <code>--max-mamba-cache-size</code> [default is 512] to fit memory constraints.</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Ernie4.5-VL</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>baidu/ERNIE-4.5-VL-28B-A3B-PT</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Baidu's vision-language models(28B,424B). Support image and video comprehension, and also support thinking.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>JetVLM</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>JetVLM is an vision-language model designed for high-performance multimodal understanding and generation tasks built upon Jet-Nemotron.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Coming soon</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Step3-VL</strong> (10B)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>stepfun-ai/Step3-VL-10B</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>StepFun's lightweight open-source 10B parameter VLM for multimodal intelligence, excelling in visual perception, complex reasoning, and human alignment.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Qwen3-ASR</strong> (0.6B, 1.7B)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>Qwen/Qwen3-ASR-1.7B</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Alibaba's automatic speech recognition models supporting 52 languages. Served via the <code>/v1/audio/transcriptions</code> endpoint.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Qwen3-Omni</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>Qwen/Qwen3-Omni-30B-A3B-Instruct</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Alibaba's omni-modal MoE model. Currently supports the <strong>Thinker</strong> component (multimodal understanding for text, images, audio, and video), while the <strong>Talker</strong> component (audio generation) is not yet supported.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>LFM2-VL</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>LiquidAI/LFM2.5-VL-1.6B</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Liquid AI's vision-language model combining a SigLip2 vision encoder (NaFlex variable-resolution) with the LFM2 hybrid attention + short convolution language model. Supports multi-image inputs.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> </tr> </tbody> </table>

Audio Transcription

SGLang supports audio-only ASR models via the OpenAI-compatible /v1/audio/transcriptions endpoint. Upload an audio file and receive a transcription.

Launch Command

bash

sglang serve \
  --model-path Qwen/Qwen3-ASR-1.7B \
  --served-model-name qwen3-asr \
  --trust-remote-code \
  --host 0.0.0.0 --port 30000

Example Request

bash

curl http://localhost:30000/v1/audio/transcriptions \
  -F [email protected] \
  -F model=qwen3-asr \
  -F response_format=verbose_json

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "30%"}} /> <col style={{width: "28%"}} /> <col style={{width: "42%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.02)"}}>Model Family</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Example Identifier</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.02)"}}>Notes</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Whisper</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>openai/whisper-large-v3</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>OpenAI's speech recognition model.</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Qwen3-ASR</strong> (0.6B, 1.7B)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>Qwen/Qwen3-ASR-1.7B</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Use <code>--trust-remote-code</code>. Supports 52 languages.</td> </tr> </tbody> </table>

Video Input Support

SGLang supports video input for Vision-Language Models (VLMs), enabling temporal reasoning tasks such as video question answering, captioning, and holistic scene understanding. Video clips are decoded, key frames are sampled, and the resulting tensors are batched together with the text prompt, allowing multimodal inference to integrate visual and linguistic context.

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "24%"}} /> <col style={{width: "38%"}} /> <col style={{width: "38%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Model Family</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Example Identifier</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Video notes</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>Qwen-VL</strong> (Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-Omni)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>Qwen/Qwen3-VL-235B-A22B-Instruct</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>The processor gathers <code>video_data</code>, runs Qwen's frame sampler, and merges the resulting features with text tokens before inference.</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>GLM-4v</strong> (4.5V, 4.1V, MOE)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>zai-org/GLM-4.5V</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Video clips are read with Decord, converted to tensors, and passed to the model alongside metadata for rotary-position handling.</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>NVILA</strong> (Full & Lite)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>Efficient-Large-Model/NVILA-8B</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>The runtime samples eight frames per clip and attaches them to the multimodal request when <code>video_data</code> is present.</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>LLaVA video variants</strong> (LLaVA-NeXT-Video, LLaVA-OneVision)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>lmms-lab/LLaVA-NeXT-Video-7B</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>The processor routes video prompts to the LlavaVid video-enabled architecture, and the provided example shows how to query it with <code>sgl.video(...)</code> clips.</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>NVIDIA Nemotron Nano 2.0 VL</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>The processor samples at 2 FPS, at a max of 128 frames, as per model training. The model uses <a href="https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/multimodal/evs/README.md">EVS</a>, a pruning method that removes redundant tokens from video embeddings. By default <code>video_pruning_rate=0.7</code>. Change this by providing: <code>--json-model-override-args '{"video_pruning_rate": 0.0}'</code> to disable EVS, for example.</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>JetVLM</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>The runtime samples eight frames per clip and attaches them to the multimodal request when <code>video_data</code> is present.</td> </tr> </tbody> </table>

Use sgl.video(path, num_frames) when building prompts to attach clips from your SGLang programs.

Example OpenAI-compatible request that sends a video clip:

python

import requests

url = "http://localhost:30000/v1/chat/completions"

data = {
    "model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s happening in this video?"},
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4"
                    },
                },
            ],
        }
    ],
    "max_tokens": 300,
}

response = requests.post(url, json=data)
print(response.text)

Usage Notes

Performance Optimization

For multimodal models, you can use the --keep-mm-feature-on-device flag to optimize for latency at the cost of increased GPU memory usage:

Default behavior: Multimodal feature tensors are moved to CPU after processing to save GPU memory
With --keep-mm-feature-on-device: Feature tensors remain on GPU, reducing device-to-host copy overhead and improving latency, but consuming more GPU memory

Use this flag when you have sufficient GPU memory and want to minimize latency for multimodal inference.

Multimodal Inputs Limitation

Use --mm-process-config '{"image":{"max_pixels":1048576},"video":{"fps":3,"max_pixels":602112,"max_frames":60}}': To set image, video, and audio input limits.

This can reduce GPU memory usage, improve inference speed, and help to avoid OOM, but may impact model performance, thus set a proper value based on your specific use case. The config entries are passed as images_kwargs, videos_kwargs, and audio_kwargs to the HuggingFace processor, so each modality's settings are kept separate and do not collide. Refer to the HuggingFace documentation for your model's processor to understand the available parameters.

Bidirectional Attention in Multimodal Model Serving

Note for serving the Gemma-3 multimodal model:

As mentioned in Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM , Gemma-3 employs bidirectional attention between image tokens during the prefill phase. Currently, SGLang only supports bidirectional attention when using the Triton Attention Backend. Note, however, that SGLang's current bidirectional attention implementation is incompatible with both CUDA Graph and Chunked Prefill.

To enable bidirectional attention, you can use the TritonAttnBackend while disabling CUDA Graph and Chunked Prefill. Example launch command:

bash

python -m sglang.launch_server \
  --model-path google/gemma-3-4b-it \
  --host 0.0.0.0 --port 30000 \
  --enable-multimodal \
  --dtype bfloat16 --triton-attention-reduce-in-fp32 \
  --attention-backend triton \ # Use Triton attention backend
  --disable-cuda-graph \ # Disable Cuda Graph
  --chunked-prefill-size -1 # Disable Chunked Prefill

If higher serving performance is required and a certain degree of accuracy loss is acceptable, you may choose to use other attention backends, and you can also enable features like CUDA Graph and Chunked Prefill for better performance, but note that the model will fall back to using causal attention instead of bidirectional attention.