Back to Sglang

Quantization on Ascend

docs/platforms/ascend/ascend_npu_quantization.md

0.5.1214.2 KB
Original Source

Quantization on Ascend

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.

SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.

ModelSlim on Ascend support

Quantization schemequant_type in JSONScheme classLayer typeA2 SupportedA3 SupportedA5 SupportedDiffusion models
W4A4 dynamicW4A4_DYNAMICModelSlimW4A4Int4Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: green;"></span>
W8A8 staticW8A8ModelSlimW8A8Int8Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: green;"></span>
W8A8 dynamicW8A8_DYNAMICModelSlimW8A8Int8Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: green;"></span>
MXFP8W8A8_MXFP8ModelSlimMXFP8SchemeLinear<span style="color: red;">x</span><span style="color: red;">x</span><span style="color: blue;">WIP</span><span style="color: green;"></span> (A5)
W4A4 dynamicW4A4_DYNAMICModelSlimW4A4Int4MoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: red;">x</span>
W4A8 dynamicW4A8_DYNAMICModelSlimW4A8Int8MoEMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: red;">x</span>
W8A8 dynamicW8A8_DYNAMICModelSlimW8A8Int8MoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: red;">x</span>
MXFP8W8A8_MXFP8ModelSlimMXFP8SchemeMoE<span style="color: red;">x</span><span style="color: red;">x</span><span style="color: blue;">WIP</span><span style="color: red;">x</span>

AWQ on Ascend support:

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W4A16MoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>

GPTQ on Ascend support

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W4A16 MOEMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A16 MOEMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>

Auto-round on Ascend support

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W4A16MoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A16MoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>

Compressed-tensors (LLM Compressor) on Ascend support:

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W8A8 dynamicLinear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W4A8 dynamic with/without activation clipMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W4A16 MOEMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A8 dynamicMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>

GGUF on Ascend support

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
GGUF (all types)Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
GGUF (all types)MoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>

Note: On Ascend, GGUF weights are pre-dequantized to FP16/BF16 during model loading to ensure optimal inference performance. This enables support for all GGUF quantization types (Q2_K, Q4_K_M, IQ4_XS, etc.) while maintaining high inference speed.

in progress

Diffusion Model Quantization on Ascend NPU

SGLang-Diffusion supports MXFP8 online and offline quantization for diffusion models (such as Wan2.2) on Ascend NPUs. MXFP8 requires A5; the ModelSlim W8A8/W4A4 schemes work on A2/A3.

Requirements for MXFP8: CANN ≥ 8.0.RC3, Ascend A5

Quantization methodquant_type in JSONScheme classModeA2/A3 SupportedA5 SupportedTrigger
MXFP8 (W8A8)MXFP8ConfigOnline<span style="color: red;">x</span><span style="color: green;"></span>--quantization mxfp8
MXFP8 (W8A8)W8A8_MXFP8ModelSlimMXFP8SchemeOffline<span style="color: red;">x</span><span style="color: green;"></span>auto-detected from quant_model_description.json
W8A8 staticW8A8ModelSlimW8A8Int8Offline<span style="color: green;"></span><span style="color: yellow;">TBD</span>auto-detected from quant_model_description.json
W8A8 dynamicW8A8_DYNAMICModelSlimW8A8Int8Offline<span style="color: green;"></span><span style="color: yellow;">TBD</span>auto-detected from quant_model_description.json
W4A4 dynamicW4A4_DYNAMICModelSlimW4A4Int4Offline<span style="color: green;"></span><span style="color: yellow;">TBD</span>auto-detected from quant_model_description.json

Online MXFP8 Quantization

Online quantization dynamically quantizes FP16/BF16 weights to MXFP8 at load time using npu_dynamic_mx_quant + npu_quant_matmul CANN kernels. Pass --quantization mxfp8 to override auto-detection.

bash
# Start the diffusion server with online MXFP8 quantization
sglang serve \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --num-gpus 4
bash
# One-shot generation
sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --quantization mxfp8 \
  --prompt "a beautiful sunset over the mountains" \
  --save-output

Offline MXFP8 Quantization (ModelSlim)

For offline quantization, pre-quantize the model with msModelSlim and load the resulting checkpoint. The quantization scheme is auto-detected from quant_model_description.json, so no extra --quantization flag is needed.

Step 1: Quantize with msModelSlim

bash
msmodelslim quant \
  --model_path /path/to/wan2_2_float_weights \
  --save_path /path/to/wan2_2_mxfp8_weights \
  --device npu \
  --model_type Wan2_2 \
  --quant_type mxfp8 \
  --trust_remote_code True

Note: SGLang does not support quantized embeddings; disable embedding quantization when using msmodelslim.

Step 2: Convert to Diffusers format

msModelSlim saves quantized Wan2.2 weights in the original Wan format. Convert to Diffusers format using the provided repack script:

bash
python python/sglang/multimodal_gen/tools/wan_repack.py \
  --input-path /path/to/wan2_2_mxfp8_weights \
  --output-path /path/to/wan2_2_mxfp8_diffusers

Then copy all files from the original Diffusers checkpoint (except the transformer/transformer_2 folders) into the output directory.

Step 3: Run inference

bash
sglang generate \
  --model-path /path/to/wan2_2_mxfp8_diffusers \
  --prompt "a beautiful sunset over the mountains" \
  --save-output

For pre-quantized checkpoints available on ModelScope, see modelscope/Eco-Tech.