Quantization on Ascend - Sglang

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.

SGLang support mix-bits quantization (independently defines and loads each layer depending on the type of quantification specified in the quant_model_description'.json). Advanced mix-bits for MoE in progress, will add independent quantization determination for the w13 (up-gate) and w2 (down) layers.

ModelSlim on Ascend support

<table> <thead> <tr> <th>Quantization scheme</th> <th>Layer type</th> <th>A2 Supported</th> <th>A3 Supported</th> <th>A5 Supported</th> <th>Diffusion models</th> </tr> </thead> <tbody> <tr> <td>W4A4 dynamic</td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> </tr> <tr> <td>W8A8 static</td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> </tr> <tr> <td>W8A8 dynamic</td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/20922">MXFP8</a></td> <td>Linear</td> <td><strong style={{color: 'red'}}>x</strong></td> <td><strong style={{color: 'red'}}>x</strong></td> <td><strong style={{color: 'blue'}}>WIP</strong></td> <td><strong style={{color: 'blue'}}>WIP</strong></td> </tr> <tr> <td>W4A4 dynamic</td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> <td><strong style={{color: 'red'}}>x</strong></td> </tr> <tr> <td>W4A8 dynamic</td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> <td><strong style={{color: 'red'}}>x</strong></td> </tr> <tr> <td>W8A8 dynamic</td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> <td><strong style={{color: 'red'}}>x</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/20922">MXFP8</a></td> <td>MoE</td> <td><strong style={{color: 'red'}}>x</strong></td> <td><strong style={{color: 'red'}}>x</strong></td> <td><strong style={{color: 'blue'}}>WIP</strong></td> <td><strong style={{color: 'red'}}>x</strong></td> </tr> </tbody> </table>

AWQ on Ascend support:

<table> <thead> <tr> <th>Quantization scheme</th> <th>Layer type</th> <th>A2 Supported</th> <th>A3 Supported</th> <th>A5 Supported</th> </tr> </thead> <tbody> <tr> <td>W4A16</td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td>W8A16</td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td>W4A16</td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> </tbody> </table>

GPTQ on Ascend support

<table> <thead> <tr> <th>Quantization scheme</th> <th>Layer type</th> <th>A2 Supported</th> <th>A3 Supported</th> <th>A5 Supported</th> </tr> </thead> <tbody> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/15203">W4A16</a></td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/15203">W8A16</a></td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/16364">W4A16 MOE</a></td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/16364">W8A16 MOE</a></td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> </tbody> </table>

Auto-round on Ascend support

Compressed-tensors (LLM Compressor) on Ascend support:

<table> <thead> <tr> <th>Quantization scheme</th> <th>Layer type</th> <th>A2 Supported</th> <th>A3 Supported</th> <th>A5 Supported</th> </tr> </thead> <tbody> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/14504">W8A8 dynamic</a></td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/14736">W4A8 dynamic with/without activation clip</a></td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/12759">W4A16 MOE</a></td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td><a href="https://github.com/sgl-project/sglang/pull/14504">W8A8 dynamic</a></td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> </tbody> </table>

GGUF on Ascend support

<table> <thead> <tr> <th>Quantization type</th> <th>Layer type</th> <th>A2 Supported</th> <th>A3 Supported</th> <th>A5 Supported</th> </tr> </thead> <tbody> <tr> <td>All GGUF types (standard, K-quant)</td> <td>Linear</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> <tr> <td>All GGUF types (standard, K-quant)</td> <td>MoE</td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'green'}}>√</strong></td> <td><strong style={{color: 'orange'}}>TBD</strong></td> </tr> </tbody> </table>

Usage Examples:

Dense model (e.g. Qwen3-14B-Q4_K_M.gguf):

bash

python3 -m sglang.launch_server \
    --model-path Qwen3-14B-Q4_K_M.gguf \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.7 --tp-size 2

MoE model (e.g. Qwen3-30B-A3B-Q4_K_M.gguf):

bash

python3 -m sglang.launch_server \
    --model-path Qwen3-30B-A3B-Q4_K_M.gguf \
    --device npu --attention-backend ascend \
    --host 0.0.0.0 --port 30000 \
    --mem-fraction-static 0.8 --tp-size 2

Implementation Notes:

GGUF weights are pre-dequantized to FP16/BF16 during model loading on CPU, then transferred to NPU for inference. This trades higher memory usage for faster runtime performance (no per-forward-pass dequantization overhead).

MoE layers use npu_grouped_matmul and npu_moe_init_routing / npu_moe_finalize_routing for high-performance expert computation.

TP (tensor parallelism) sharding is supported for both dense and MoE GGUF models.