docs_new/docs/hardware-platforms/ascend-npus/model-tutorials/qwen3-8b.mdx
Qwen3-8B is a compact dense model in the Qwen3 series developed by Alibaba, featuring 8B parameters with Grouped-Query Attention (GQA) and up to 128K context length. It delivers significant improvements in instruction following, logical reasoning, text comprehension, mathematics, science, coding, and tool usage. The model supports EAGLE3 speculative decoding for accelerated inference and is available in both standard and thinking/reasoning-enhanced editions.
This document demonstrates the deployment of Qwen3-8B on Ascend NPUs using SGLang, including single-node PD mixed mode, feature configuration, and performance optimization.
This document is validated and written based on SGLang v0.5.13. The current model (Qwen3-8B) is fully supported in this version. To use the latest features (e.g., speculative decoding), it is recommended to use v0.5.13 or a later version.
| Feature | Example usage |
|---|---|
| Tensor Parallelism | --tp-size 2 |
| Quantization | --quantization modelslim |
| Chunked Prefill | auto based on device memory, or set explicit value; |
disable with --chunked-prefill-size -1; e.g. --chunked-prefill-size 8192 | |
| NPU Graph | enabled by default; disable with --disable-cuda-graph; |
control range via --cuda-graph-bs or --cuda-graph-max-bs; e.g. --cuda-graph-bs 1 2 4 6 9 10 15 16 | |
| Speculative Decoding | --speculative-algorithm EAGLE3 \ |
--speculative-draft-model-path /path/to/draft-model-weights \ | |
--speculative-num-steps 3 \ | |
--speculative-eagle-topk 1 \ | |
--speculative-num-draft-tokens 4 \ | |
--speculative-draft-model-quantization unquant | |
| Overlap Schedule | export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 |
For feature compatibility and conflict information between features, see Feature Compatibility.
Ensure the available device memory exceeds the model weight size before deployment. For optimal throughput and latency, refer to the best practice configurations which may require additional cards.
It is recommended to download the model weights to a shared directory across multiple nodes.
The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the online platform. You can directly pull it.
Both stable releases and daily builds are available. The following command is based on the stable release tag. For details, see Docker image versions.
<Tabs> <Tab title="Atlas 800I A3">docker pull quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3
docker run -itd --shm-size=16g --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0 \
--device=/dev/davinci1:/dev/davinci1 \
--device=/dev/davinci2:/dev/davinci2 \
--device=/dev/davinci3:/dev/davinci3 \
--device=/dev/davinci4:/dev/davinci4 \
--device=/dev/davinci5:/dev/davinci5 \
--device=/dev/davinci6:/dev/davinci6 \
--device=/dev/davinci7:/dev/davinci7 \
--device=/dev/davinci8:/dev/davinci8 \
--device=/dev/davinci9:/dev/davinci9 \
--device=/dev/davinci10:/dev/davinci10 \
--device=/dev/davinci11:/dev/davinci11 \
--device=/dev/davinci12:/dev/davinci12 \
--device=/dev/davinci13:/dev/davinci13 \
--device=/dev/davinci14:/dev/davinci14 \
--device=/dev/davinci15:/dev/davinci15 \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3
docker pull quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b
docker run -itd --shm-size=16g --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0 \
--device=/dev/davinci1:/dev/davinci1 \
--device=/dev/davinci2:/dev/davinci2 \
--device=/dev/davinci3:/dev/davinci3 \
--device=/dev/davinci4:/dev/davinci4 \
--device=/dev/davinci5:/dev/davinci5 \
--device=/dev/davinci6:/dev/davinci6 \
--device=/dev/davinci7:/dev/davinci7 \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b
Single-node deployment completes both prefill and decode within the same node (PD mixed mode), suitable for scenarios with limited hardware resources. This scenario is already covered in the best practice. For the complete, optimized deployment commands and benchmark data, see Qwen3-8B Best Practice — PD Mixed On A3.
After the service is started, you can invoke the model by sending a prompt:
# ============================================================
# Before running, update the following variables:
# HOST: the server host address (e.g., localhost)
# PORT: the server port number (e.g., 6688)
# ============================================================
curl http://${HOST}:${PORT}/generate \
-H "Content-Type: application/json" \
-d '{
"text": "What is the capital of France?",
"sampling_params": {
"max_new_tokens": 64,
"temperature": 0
}
}'
Expected result: an HTTP 200 response with the generated text containing "Paris".
Once the server prints The server is fired up and ready to roll! in the logs, it is ready to accept requests. For more
testing examples (Health Check, Generate, Chat Completions, and port usage guidance),
see Testing the Service.
For accuracy evaluation methods and datasets, see Accuracy Evaluation on Ascend NPU.
For performance data and benchmark commands, see Performance Testing on Ascend NPU.
For complete optimal configurations with deployment scripts and benchmark commands, see the Qwen3-8B Best Practice page.
For the full list of supported features, see Supported features. For detailed optimization guidance, see Optimization on Ascend NPU.
For common environment, installation, and general parameter issues, please refer to the Ascend NPU FAQ. This section only covers model-specific issues.