docs_new/docs/sglang-diffusion/quantization.mdx
SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep the base model and the quantized transformer override separate.
Use these paths:
--model-path: the base or original model--transformer-path: a quantized transformers-style transformer component directory that already contains its own config.json--transformer-weights-path: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID--quantization: apply online quantization to unquantized models at load time (activations are quantized dynamically)--quantization-ignored-layers layer name patterns to keep unquantized (e.g. attention.to_)Recommended example for pre-quantized checkpoints:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "a curious pikachu"
For quantized transformers-style transformer component folders:
sglang generate \
--model-path /path/to/base-model \
--transformer-path /path/to/quantized-transformer \
--prompt "A Logo With Bold Large Text: SGL Diffusion"
NOTE: Some model-specific integrations also accept a quantized repo or local
directory directly as --model-path, but that is a compatibility path. If a
repo contains multiple candidate checkpoints, pass
--transformer-weights-path explicitly.
Here, quant_family means a checkpoint and loading family with shared CLI
usage and loader behavior. It is not just the numeric precision or a kernel
backend.
Online quantization applies quantization to unquantized models at load time. This is useful for when pre-quantized checkpoints are not available.
Apply FP8 quantization to any unquantized model:
sglang generate \
--model-path Tongyi-MAI/Z-Image-Turbo \
--quantization fp8 \
--prompt "a beautiful sunset" \
--save-output
MXFP4 provides aggressive 4-bit compression with online quantization. Note: Requires ROCm and MI350+ (gfx95x) GPU.
sglang generate \
--model-path Tongyi-MAI/Z-Image-Turbo \
--quantization mxfp4 \
--prompt "a beautiful sunset" \
--save-output
Note: Requires aiter package with MXFP4 kernel support
By default, online quantization quantizes every linear layer in
the transformer. However, --quantization-ignored-layers can be used to keep specific layers in their original precision:
sglang generate \
--model-path Tongyi-MAI/Z-Image-Turbo \
--quantization fp8 \
--quantization-ignored-layers attention.to_ \
--prompt "a beautiful sunset" \
--save-output
sglang generate \
--model-path Tongyi-MAI/Z-Image-Turbo \
--quantization mxfp4 \
--quantization-ignored-layers attention.to_ \
--prompt "a beautiful sunset" \
--save-output
Each pattern is matched against the full layer prefix (e.g. layers.0.attention.to_q). A layer is skipped and left unquantizd if its prefix contains any of the given patterns.
This section is the canonical support matrix for the nine diffusion ModelOpt checkpoints currently wired up in SGLang docs and validation coverage.
Published checkpoints keep the serialized quantization config as
quant_method=modelopt; the FP8 vs NVFP4 split below is a documentation label
derived from quant_algo.
Six of the nine repos live under lmsys/*. The Wan2.2 entries use NVIDIA's
official full Diffusers repos, and the FLUX.2 NVFP4 entry keeps the official
black-forest-labs/FLUX.2-dev-NVFP4 repo.
The FP8 rows run in the regular H100 1-GPU diffusion CI shard; the NVFP4 rows
run in the B200 diffusion CI shard (multimodal-gen-test-1-b200).
Converted ModelOpt FP8 transformer repos should be loaded as transformer
component overrides. If the repo or local directory already contains
config.json, use --transformer-path. Full Diffusers repos such as the
NVIDIA Wan2.2 FP8 checkpoint can be passed directly with --model-path.
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-path lmsys/flux2-dev-modelopt-fp8-sglang-transformer \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
sglang generate \
--model-path nvidia/Wan2.2-T2V-A14B-Diffusers-FP8 \
--prompt "a fox walking through neon rain" \
--save-output
sglang generate \
--model-path hunyuanvideo-community/HunyuanVideo \
--transformer-path lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer \
--height 544 --width 960 --num-frames 17 \
--prompt "A cinematic shot of a red sports car driving through rain at night" \
--save-output
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-path lmsys/qwen-image-modelopt-fp8-sglang-transformer \
--prompt "A tiny astronaut reading a book under a glass greenhouse" \
--save-output
sglang generate \
--model-path Qwen/Qwen-Image-Edit-2511 \
--transformer-path lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer \
--image-path /path/to/input.png \
--prompt "Turn the scene into a warm watercolor illustration" \
--save-output
--transformer-path is the canonical flag for converted ModelOpt FP8
transformer component repos or directories that already carry config.json.config.json,
SGLang reads the quantization config from that override instead of relying on
the base model config.--transformer-weights-path still works when you intentionally point at raw
weight files or a directory that should be metadata-probed as weights first.dit_layerwise_offload is supported for ModelOpt FP8 checkpoints.dit_cpu_offload still stays disabled for ModelOpt FP8 checkpoints.quant_method=modelopt with
quant_algo=FP8; the modelopt-fp8 label in this document is a support
family name, not a serialized config key.python -m sglang.multimodal_gen.tools.build_modelopt_fp8_transformer.For mixed ModelOpt NVFP4 transformer overrides that already contain
config.json, keep the base model and quantized transformer separate and use
--transformer-path:
sglang generate \
--model-path black-forest-labs/FLUX.1-dev \
--transformer-path lmsys/flux1-dev-modelopt-nvfp4-sglang-transformer \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
For raw NVFP4 exports such as the official FLUX.2 release, use
--transformer-weights-path:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
SGLang also supports passing the NVFP4 repo or local directory directly as
--model-path:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
For Wan2.2 NVFP4:
sglang generate \
--model-path nvidia/Wan2.2-T2V-A14B-Diffusers-NVFP4 \
--prompt "a fox walking through neon rain" \
--save-output
--transformer-path for mixed ModelOpt NVFP4 transformer repos or local
directories that already include config.json.--transformer-weights-path for raw NVFP4 exports, individual
safetensors files, or repo layouts that should be treated as weights first.Wan2.2-T2V-A14B-Diffusers, the
primary --transformer-path override targets only transformer. Use a
per-component override such as --transformer-2-path only when you
intentionally want a non-default transformer_2.flashinfer_trtllm).--model-path loading is a compatibility path for FLUX.2 NVFP4-style
repos or local directories.--transformer-weights-path is provided explicitly, it takes precedence
over the compatibility --model-path flow.*-mixed.safetensors, then
falls back to loading from the directory.SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND. Supported values
include flashinfer_cudnn, flashinfer_cutlass, and flashinfer_trtllm.quant_method=modelopt with
quant_algo=NVFP4; the modelopt-nvfp4 label here is again a documentation
family name rather than a serialized config key.Install the runtime dependency first:
pip install nunchaku
For platform-specific installation methods and troubleshooting, see the Nunchaku installation guide.
For Nunchaku checkpoints, --model-path should still point to the original
base model, while --transformer-weights-path points to the quantized
transformer weights.
If the basename of --transformer-weights-path contains the pattern
svdq-(int4|fp4)_r{rank}, SGLang will automatically:
--quantization-precision--quantization-rankExamples:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> </colgroup> <thead> <tr> <th>checkpoint name fragment</th> <th>inferred precision</th> <th>inferred rank</th> <th>notes</th> </tr> </thead> <tbody> <tr> <td><code>svdq-int4_r32</code></td> <td><code>int4</code></td> <td><code>32</code></td> <td>Standard INT4 checkpoint</td> </tr> <tr> <td><code>svdq-int4_r128</code></td> <td><code>int4</code></td> <td><code>128</code></td> <td>Higher-quality INT4 checkpoint</td> </tr> <tr> <td><code>svdq-fp4_r32</code></td> <td><code>nvfp4</code></td> <td><code>32</code></td> <td><code>fp4</code> in the filename maps to CLI value <code>nvfp4</code></td> </tr> <tr> <td><code>svdq-fp4_r128</code></td> <td><code>nvfp4</code></td> <td><code>128</code></td> <td>Higher-quality NVFP4 checkpoint</td> </tr> </tbody> </table>Common filenames:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> </colgroup> <thead> <tr> <th>filename</th> <th>precision</th> <th>rank</th> <th>typical use</th> </tr> </thead> <tbody> <tr> <td><code>svdq-int4_r32-qwen-image.safetensors</code></td> <td><code>int4</code></td> <td><code>32</code></td> <td>Balanced default</td> </tr> <tr> <td><code>svdq-int4_r128-qwen-image.safetensors</code></td> <td><code>int4</code></td> <td><code>128</code></td> <td>Quality-focused</td> </tr> <tr> <td><code>svdq-fp4_r32-qwen-image.safetensors</code></td> <td><code>nvfp4</code></td> <td><code>32</code></td> <td>RTX 50-series / NVFP4 path</td> </tr> <tr> <td><code>svdq-fp4_r128-qwen-image.safetensors</code></td> <td><code>nvfp4</code></td> <td><code>128</code></td> <td>Quality-focused NVFP4</td> </tr> <tr> <td><code>svdq-int4_r32-qwen-image-lightningv1.0-4steps.safetensors</code></td> <td><code>int4</code></td> <td><code>32</code></td> <td>Lightning 4-step</td> </tr> <tr> <td><code>svdq-int4_r128-qwen-image-lightningv1.1-8steps.safetensors</code></td> <td><code>int4</code></td> <td><code>128</code></td> <td>Lightning 8-step</td> </tr> </tbody> </table>If your checkpoint name does not follow this convention, pass
--enable-svdquant, --quantization-precision, and --quantization-rank
explicitly.
Recommended auto-detected flow:
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \
--prompt "a beautiful sunset" \
--save-output
Manual override when the filename does not encode the quant settings:
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \
--enable-svdquant \
--quantization-precision int4 \
--quantization-rank 128 \
--prompt "a beautiful sunset" \
--save-output
--transformer-weights-path is the canonical flag for Nunchaku checkpoints.
Older config names such as quantized_model_path are treated as
compatibility aliases.svdq-(int4|fp4)_r{rank}.int4 and nvfp4. In filenames, the NVFP4 variant is
written as fp4.--num-inference-steps, such
as 4 or 8.MindStudio-ModelSlim (msModelSlim) is a model offline quantization compression tool launched by MindStudio and optimized for Ascend hardware.
Installation
# Clone repo and install msmodelslim:
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
Multimodal_sd quantization
Download the original floating-point weights of the large model. Taking Wan2.2-T2V-A14B as an example, you can go to Wan2.2-T2V-A14B to obtain the original model weights. Then install other dependencies (related to the model, refer to the modelscope model card).
Note: You can find pre-quantized validated models on modelscope/Eco-Tech.
Run quantization using one-click quantization (recommended):
msmodelslim quant \
--model_path /path/to/wan2_2_float_weights \
--save_path /path/to/wan2_2_quantized_weights \
--device npu \
--model_type Wan2_2 \
--quant_type w8a8 \
--trust_remote_code True
For more detailed examples of quantization of models, as well as information about their support, see the examples section in ModelSLim repo.
Note: SGLang does not support quantized embeddings, please disable this option when quantizing using msmodelslim.
Auto-Detection and different formats
For msmodelslim checkpoints, it's enough to specify only --model-path, the detection of quantization occurs automatically for each layer using parsing of quant_model_description.json config.
In the case of Wan2.2 only Diffusers weights storage format are supported, whereas modelslim saves the quantized model in the original Wan2.2 format.
For conversion, use the one-step wan_repack.py script:
python wan_repack.py \
--model-type Wan2.2-TI2V-5B \
--original-model-path {path_to_original_diffusers_model} \
--quant-path {path_to_quantized_model} \
--output-path {path_to_converted_model}
Supported --model-type values: Wan2.2-TI2V-5B (single-transformer), Wan2.2-T2V-A14B and Wan2.2-I2V-A14B (Cascade dual-transformer).
The script automatically handles: copying the base model, converting quantized weights to Diffusers format, and restoring config.json.
Usage Example
With auto-detected flow:
sglang generate \
--model-path Eco-Tech/Wan2.2-T2V-A14B-Diffusers-w8a8 \
--prompt "a beautiful sunset" \
--save-output
Available Quantization Methods:
W4A4_DYNAMIC linear with online quantization of activationsW8A8 linear with offline quantization of activationsW8A8_DYNAMIC linear with online quantization of activationsW8A8_MXFP8 linear with offline quantization (msmodelslim pre-quantized weights)mxfp8 linear with online quantization (--quantization mxfp8)W4A4_MXFP4 / W4A4_MXFP4_DUALSCALE linear with offline quantization (msmodelslim pre-quantized weights)mxfp4_npu linear with online quantization (--quantization mxfp4_npu)For online MXFP8 quantization, load the original FP16/BF16 model and add --quantization mxfp8.
Weights are quantized at load time via npu_dynamic_mx_quant, and activations are quantized per-token
during inference with npu_quant_matmul (block_size=32).
sglang generate \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--quantization mxfp8 \
--prompt "a fox walking through neon rain" \
--save-output
Hardware requirement: Ascend A5 series or newer.
npu_dynamic_mx_quantis not available on A2/A3.
Pre-quantized MXFP8 weights exported by msmodelslim are auto-detected via quant_model_description.json
(W8A8_MXFP8 scheme). Use wan_repack.py to convert the quantized weights to Diffusers format,
then load the converted model with --model-path:
sglang generate \
--model-path Eco-Tech/Wan2.2-T2V-A14B-Diffusers-mxfp8 \
--prompt "a beautiful sunset" \
--save-output
For online MXFP4 quantization on Ascend NPU, load the original FP16/BF16 model and add
--quantization mxfp4_npu. The mxfp4_npu key is used for Ascend because mxfp4
is reserved for the ROCm/aiter backend.
Weights are quantized at load time via npu_dynamic_dual_level_mx_quant, and activations
are quantized per-token during inference before npu_dual_level_quant_matmul. MXFP4 uses
dual-level block scales with an L1 block size of 32 and an L0 block size of 512.
sglang generate \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--quantization mxfp4_npu \
--prompt "a fox walking through neon rain" \
--save-output
Hardware requirement: Ascend A5 series or newer.
npu_dynamic_dual_level_mx_quantandnpu_dual_level_quant_matmulare not available on A2/A3.Note: Online MXFP4 weight quantization is experimental. The offline msmodelslim flow uses pre-quantized weights and may produce different numerical results.
Pre-quantized MXFP4 weights exported by msmodelslim are auto-detected via
quant_model_description.json (W4A4_MXFP4 / W4A4_MXFP4_DUALSCALE scheme).
Use wan_repack.py to convert the quantized weights to Diffusers format, then load
the converted model with --model-path:
sglang generate \
--model-path {path_to_converted_mxfp4_model} \
--prompt "a beautiful sunset" \
--save-output
The offline MXFP4 checkpoint stores weights in an FP8 container and includes dual-level
scales (weight_scale, weight_dual_scale). If exported with smooth quantization,
mul_scale is loaded and applied before activation quantization to keep activations
aligned with the calibrated weights.