docs_new/docs/basic_usage/llama4.mdx
Llama 4 is Meta's latest generation of open-source LLM model with industry-leading performance.
SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since v0.4.5.
Ongoing optimizations are tracked in the Roadmap.
To serve Llama 4 models on 8xH100/H200 GPUs:
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tp 8 \
--context-length 1000000
OOM Mitigation: Adjust --context-length to avoid a GPU out-of-memory issue. For the Scout model, we recommend setting this value up to 1M on 8*H100 and up to 2.5M on 8*H200. For the Maverick model, we don't need to set context length on 8*H200. When hybrid kv cache is enabled, --context-length can be set up to 5M on 8*H100 and up to 10M on 8*H200 for the Scout model.
Attention Backend Auto-Selection: SGLang automatically selects the optimal attention backend for Llama 4 based on your hardware. You typically don't need to specify --attention-backend manually:
trtllm_mhafa3aiterintel_xputriton (fallback)To override the auto-selection, explicitly specify --attention-backend with one of the supported backends: fa3, aiter, triton, trtllm_mha, or intel_xpu.
Chat Template: Add --chat-template llama-4 for chat completion tasks.
Enable Multi-Modal: Add --enable-multimodal for multi-modal capabilities.
Enable Hybrid-KVCache: Set --swa-full-tokens-ratio to adjust the ratio of SWA layer (for Llama4, it's local attention layer) KV tokens / full layer KV tokens. (default: 0.8, range: 0-1)
Description: SGLang has supported Llama 4 Maverick (400B) with EAGLE speculative decoding.
Usage:
Add arguments --speculative-draft-model-path, --speculative-algorithm, --speculative-num-steps, --speculative-eagle-topk and --speculative-num-draft-tokens to enable this feature. For example:
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--trust-remote-code \
--tp 8 \
--context-length 1000000
lm_evalThe accuracy on SGLang for both Llama4 Scout and Llama4 Maverick can match the official benchmark numbers.
Benchmark results on MMLU Pro dataset with 8*H100:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "34%"}} /> <col style={{width: "33%"}} /> <col style={{width: "33%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}></th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Llama-4-Scout-17B-16E-Instruct</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Llama-4-Maverick-17B-128E-Instruct</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Official Benchmark</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>74.3</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>80.5</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>SGLang</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>75.2</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>80.7</td> </tr> </tbody> </table>Commands:
# Llama-4-Scout-17B-16E-Instruct model
python -m sglang.launch_server \
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
--port 30000 \
--tp 8 \
--mem-fraction-static 0.8 \
--context-length 65536
lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
# Llama-4-Maverick-17B-128E-Instruct
python -m sglang.launch_server \
--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--port 30000 \
--tp 8 \
--mem-fraction-static 0.8 \
--context-length 65536
lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
Details can be seen in this PR.