Back to Sglang

Quantization on Ascend

docs/platforms/ascend/ascend_npu_quantization.md

0.5.119.9 KB
Original Source

Quantization on Ascend

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.

SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.

ModelSlim on Ascend support

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 SupportedDiffusion models
W4A4 dynamicLinear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: green;"></span>
W8A8 staticLinear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: green;"></span>
W8A8 dynamicLinear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: green;"></span>
MXFP8Linear<span style="color: red;">x</span><span style="color: red;">x</span><span style="color: blue;">WIP</span><span style="color: blue;">WIP</span>
W4A4 dynamicMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: red;">x</span>
W4A8 dynamicMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: red;">x</span>
W8A8 dynamicMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span><span style="color: red;">x</span>
MXFP8MoE<span style="color: red;">x</span><span style="color: red;">x</span><span style="color: blue;">WIP</span><span style="color: red;">x</span>

AWQ on Ascend support:

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W4A16MoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>

GPTQ on Ascend support

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W4A16 MOEMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A16 MOEMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>

Auto-round on Ascend support

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W4A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A16Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W4A16MoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A16MoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>

Compressed-tensors (LLM Compressor) on Ascend support:

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
W8A8 dynamicLinear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W4A8 dynamic with/without activation clipMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W4A16 MOEMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
W8A8 dynamicMoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>

GGUF on Ascend support

Quantization schemeLayer typeA2 SupportedA3 SupportedA5 Supported
GGUF (all types)Linear<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>
GGUF (all types)MoE<span style="color: green;"></span><span style="color: green;"></span><span style="color: yellow;">TBD</span>

Note: On Ascend, GGUF weights are pre-dequantized to FP16/BF16 during model loading to ensure optimal inference performance. This enables support for all GGUF quantization types (Q2_K, Q4_K_M, IQ4_XS, etc.) while maintaining high inference speed.