docs_new/docs/hardware-platforms/ascend-npus/ascend_npu_quantization.mdx
To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.
SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.
GPTQ on Ascend support
<table> <thead> <tr> <th>Quantization scheme</th> <th>Layer type</th> <th>A2 Supported</th> <th>A3 Supported</th> <th>A5 Supported</th> </tr> </thead> <tbody> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/15203">W4A16</a></td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/15203">W8A16</a></td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/16364">W4A16 MOE</a></td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/16364">W8A16 MOE</a></td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> </tbody> </table> <table> <thead> <tr> <th>Quantization scheme</th> <th>Layer type</th> <th>A2 Supported</th> <th>A3 Supported</th> <th>A5 Supported</th> </tr> </thead> <tbody> <tr> <td>W4A16</td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td>W8A16</td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td>W4A16</td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td>W8A16</td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> </tbody> </table>Compressed-tensors (LLM Compressor) on Ascend support:
<table> <thead> <tr> <th>Quantization scheme</th> <th>Layer type</th> <th>A2 Supported</th> <th>A3 Supported</th> <th>A5 Supported</th> </tr> </thead> <tbody> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/14504">W8A8 dynamic</a></td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/14736">W4A8 dynamic with/without activation clip</a></td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/12759">W4A16 MOE</a></td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/14504">W8A8 dynamic</a></td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> </tbody> </table> <table> <thead> <tr> <th>Quantization type</th> <th>Layer type</th> <th>A2 Supported</th> <th>A3 Supported</th> <th>A5 Supported</th> </tr> </thead> <tbody> <tr> <td>All GGUF types (standard, K-quant)</td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td>All GGUF types (standard, K-quant)</td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> </tbody> </table>Usage Examples:
python3 -m sglang.launch_server \
--model-path Qwen3-14B-Q4_K_M.gguf \
--device npu --attention-backend ascend \
--host 0.0.0.0 --port 30000 \
--mem-fraction-static 0.7 --tp-size 2
python3 -m sglang.launch_server \
--model-path Qwen3-30B-A3B-Q4_K_M.gguf \
--device npu --attention-backend ascend \
--host 0.0.0.0 --port 30000 \
--mem-fraction-static 0.8 --tp-size 2
Implementation Notes:
- GGUF weights are pre-dequantized to FP16/BF16 during model loading on CPU, then transferred to NPU for inference. This trades higher memory usage for faster runtime performance (no per-forward-pass dequantization overhead).
- MoE layers use
npu_grouped_matmulandnpu_moe_init_routing/npu_moe_finalize_routingfor high-performance expert computation.- TP (tensor parallelism) sharding is supported for both dense and MoE GGUF models.
SGLang-Diffusion supports MXFP8 online and offline quantization for diffusion models (such as Wan2.2) on Ascend NPUs. MXFP8 requires A5; the ModelSlim W8A8/W4A4 schemes work on A2/A3.
Requirements for MXFP8: CANN ≥ 8.0.RC3, Ascend A5
<table> <thead> <tr> <th>Quantization method</th> <th><code>quant_type</code> in JSON</th> <th>Scheme class</th> <th>Mode</th> <th>A2/A3 Supported</th> <th>A5 Supported</th> <th>Trigger</th> </tr> </thead> <tbody> <tr> <td>MXFP8 (W8A8)</td> <td>—</td> <td><code>MXFP8Config</code></td> <td>Online</td> <td><strong style={{color: 'red'}}>x</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><code>--quantization mxfp8</code></td> </tr> <tr> <td>MXFP8 (W8A8)</td> <td><code>W8A8_MXFP8</code></td> <td><code>ModelSlimMXFP8Scheme</code></td> <td>Offline</td> <td><strong style={{color: 'red'}}>x</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td>auto-detected from <code>quant_model_description.json</code></td> </tr> <tr> <td>W8A8 static</td> <td><code>W8A8</code></td> <td><code>ModelSlimW8A8Int8</code></td> <td>Offline</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> <td>auto-detected from <code>quant_model_description.json</code></td> </tr> <tr> <td>W8A8 dynamic</td> <td><code>W8A8_DYNAMIC</code></td> <td><code>ModelSlimW8A8Int8</code></td> <td>Offline</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> <td>auto-detected from <code>quant_model_description.json</code></td> </tr> <tr> <td>W4A4 dynamic</td> <td><code>W4A4_DYNAMIC</code></td> <td><code>ModelSlimW4A4Int4</code></td> <td>Offline</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> <td>auto-detected from <code>quant_model_description.json</code></td> </tr> </tbody> </table>Online quantization dynamically quantizes FP16/BF16 weights to MXFP8 at load time using npu_dynamic_mx_quant + npu_quant_matmul CANN kernels. Pass --quantization mxfp8 to override auto-detection.
# Start the diffusion server with online MXFP8 quantization
sglang serve \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--quantization mxfp8 \
--num-gpus 4
# One-shot generation
sglang generate \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--quantization mxfp8 \
--prompt "a beautiful sunset over the mountains" \
--save-output
For offline quantization, pre-quantize the model with msModelSlim and load the resulting checkpoint. The quantization scheme is auto-detected from quant_model_description.json, so no extra --quantization flag is needed.
Step 1: Quantize with msModelSlim
msmodelslim quant \
--model_path /path/to/wan2_2_float_weights \
--save_path /path/to/wan2_2_mxfp8_weights \
--device npu \
--model_type Wan2_2 \
--quant_type mxfp8 \
--trust_remote_code True
Note: SGLang does not support quantized embeddings; disable embedding quantization when using msmodelslim.
Step 2: Convert to Diffusers format
msModelSlim saves quantized Wan2.2 weights in the original Wan format. Convert to Diffusers format using the provided repack script:
python python/sglang/multimodal_gen/tools/wan_repack.py \
--input-path /path/to/wan2_2_mxfp8_weights \
--output-path /path/to/wan2_2_mxfp8_diffusers
Then copy all files from the original Diffusers checkpoint (except the transformer/transformer_2 folders) into the output directory.
Step 3: Run inference
sglang generate \
--model-path /path/to/wan2_2_mxfp8_diffusers \
--prompt "a beautiful sunset over the mountains" \
--save-output
For pre-quantized checkpoints available on ModelScope, see modelscope/Eco-Tech.