docs/platforms/ascend/ascend_npu_quantization.md
To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.
SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported | Diffusion models |
|---|---|---|---|---|---|
| W4A4 dynamic | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> | <span style="color: green;">√</span> |
| W8A8 static | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> | <span style="color: green;">√</span> |
| W8A8 dynamic | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> | <span style="color: green;">√</span> |
| MXFP8 | Linear | <span style="color: red;">x</span> | <span style="color: red;">x</span> | <span style="color: blue;">WIP</span> | <span style="color: blue;">WIP</span> |
| W4A4 dynamic | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> | <span style="color: red;">x</span> |
| W4A8 dynamic | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> | <span style="color: red;">x</span> |
| W8A8 dynamic | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> | <span style="color: red;">x</span> |
| MXFP8 | MoE | <span style="color: red;">x</span> | <span style="color: red;">x</span> | <span style="color: blue;">WIP</span> | <span style="color: red;">x</span> |
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|---|---|---|---|
| W4A16 | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W8A16 | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W4A16 | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
GPTQ on Ascend support
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|---|---|---|---|
| W4A16 | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W8A16 | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W4A16 MOE | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W8A16 MOE | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|---|---|---|---|
| W4A16 | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W8A16 | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W4A16 | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W8A16 | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
Compressed-tensors (LLM Compressor) on Ascend support:
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|---|---|---|---|
| W8A8 dynamic | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W4A8 dynamic with/without activation clip | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W4A16 MOE | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| W8A8 dynamic | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| Quantization scheme | Layer type | A2 Supported | A3 Supported | A5 Supported |
|---|---|---|---|---|
| GGUF (all types) | Linear | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
| GGUF (all types) | MoE | <span style="color: green;">√</span> | <span style="color: green;">√</span> | <span style="color: yellow;">TBD</span> |
Note: On Ascend, GGUF weights are pre-dequantized to FP16/BF16 during model loading to ensure optimal inference performance. This enables support for all GGUF quantization types (Q2_K, Q4_K_M, IQ4_XS, etc.) while maintaining high inference speed.