Quantization on Ascend

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.

SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.

ModelSlim on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported	Diffusion models
W4A4 dynamic	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>	<span style="color: green;">√</span>
W8A8 static	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>	<span style="color: green;">√</span>
W8A8 dynamic	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>	<span style="color: green;">√</span>
MXFP8	Linear	<span style="color: red;">x</span>	<span style="color: red;">x</span>	<span style="color: blue;">WIP</span>	<span style="color: blue;">WIP</span>
W4A4 dynamic	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>	<span style="color: red;">x</span>
W4A8 dynamic	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>	<span style="color: red;">x</span>
W8A8 dynamic	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>	<span style="color: red;">x</span>
MXFP8	MoE	<span style="color: red;">x</span>	<span style="color: red;">x</span>	<span style="color: blue;">WIP</span>	<span style="color: red;">x</span>

AWQ on Ascend support:

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W8A16	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W4A16	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>

GPTQ on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W8A16	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W4A16 MOE	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W8A16 MOE	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>

Auto-round on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W4A16	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W8A16	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W4A16	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W8A16	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>

Compressed-tensors (LLM Compressor) on Ascend support:

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
W8A8 dynamic	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W4A8 dynamic with/without activation clip	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W4A16 MOE	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
W8A8 dynamic	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>

GGUF on Ascend support

Quantization scheme	Layer type	A2 Supported	A3 Supported	A5 Supported
GGUF (all types)	Linear	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>
GGUF (all types)	MoE	<span style="color: green;">√</span>	<span style="color: green;">√</span>	<span style="color: yellow;">TBD</span>

Note: On Ascend, GGUF weights are pre-dequantized to FP16/BF16 during model loading to ensure optimal inference performance. This enables support for all GGUF quantization types (Q2_K, Q4_K_M, IQ4_XS, etc.) while maintaining high inference speed.