docs_new/docs/hardware-platforms/ascend-npus/best_practice/qwen3-8b.mdx
This guide describes the best practice data for Qwen3-8B on the Ascend NPU.
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 5ms | W8A8 INT8 | Optimal Configuration |
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 6K+1.5K | 11.79ms | W8A8 INT8 | Optimal Configuration |
| Model | Hardware | Cards | Deploy Mode | Dataset | TPOT | Quantization | Configuration |
|---|---|---|---|---|---|---|---|
| Qwen3-8B | Atlas 800I A3 | 1 | PD Mixed | 3.5K+1.5K | 37ms | W8A8 INT8 | Optimal Configuration |
<a id="single-node-pd-mixed" title="Referenced by external docs. Verify before removing."></a>
Model: Qwen3-8B
Hardware: Atlas 800I A3
Cards: 1
Deploy Mode: PD Mixed
Quantization: W8A8 INT8
Dataset: 3.5K+1.5K
TPOT: 37ms
# ============================================================
# Before running, update the following variables:
# MODEL_PATH: path to the model weights directory
# DRAFT_MODEL_PATH: path to the draft model weights directory
# HCCL_SOCKET_IFNAME: network interface name for HCCL
# GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================
MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=50
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--host 127.0.0.1 --port 6688 \
--trust-remote-code \
--nnodes 1 \
--node-rank 0 \
--attention-backend ascend \
--device npu \
--quantization modelslim \
--max-running-requests 70 \
--max-prefill-tokens 16384 \
--disable-radix-cache \
--chunked-prefill-size 16384 \
--tp-size 1 \
--mem-fraction-static 0.85 \
--cuda-graph-bs 8 12 24 36 48 51 55 60 63 64 66 68 70 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $DRAFT_MODEL_PATH \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
We tested it based on the RANDOM dataset.
python -m sglang.bench_serving \
--dataset-name random \
--backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 64 \
--random-input-len 3500 \
--random-output-len 1500 \
--num-prompts 256 \
--random-range-ratio 1
Model: Qwen3-8B
Hardware: Atlas 800I A3
Cards: 1
Deploy Mode: PD Mixed
Quantization: W8A8 INT8
Dataset: 3.5K+1.5K
TPOT: 5ms
# ============================================================
# Before running, update the following variables:
# MODEL_PATH: path to the model weights directory
# DRAFT_MODEL_PATH: path to the draft model weights directory
# HCCL_SOCKET_IFNAME: network interface name for HCCL
# GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================
MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--host 127.0.0.1 --port 6688 \
--trust-remote-code \
--nnodes 1 \
--node-rank 0 \
--attention-backend ascend \
--device npu \
--quantization modelslim \
--max-running-requests 1 \
--max-prefill-tokens 16384 \
--disable-radix-cache \
--chunked-prefill-size -1 \
--tp-size 2 \
--mem-fraction-static 0.894 \
--cuda-graph-bs 1 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $DRAFT_MODEL_PATH \
--speculative-num-steps 4 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 5
We tested it based on the RANDOM dataset.
python -m sglang.bench_serving \
--dataset-name random \
--backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 1 \
--random-input-len 3500 \
--random-output-len 1500 \
--num-prompts 4 \
--random-range-ratio 1
Model: Qwen3-8B
Hardware: Atlas 800I A3
Cards: 1
Deploy Mode: PD Mixed
Quantization: W8A8 INT8
Dataset: 6K+1.5K
TPOT: 11.79ms
# ============================================================
# Before running, update the following variables:
# MODEL_PATH: path to the model weights directory
# DRAFT_MODEL_PATH: path to the draft model weights directory
# HCCL_SOCKET_IFNAME: network interface name for HCCL
# GLOO_SOCKET_IFNAME: network interface name for Gloo
# ============================================================
MODEL_PATH=/path/to/model-weights
DRAFT_MODEL_PATH=/path/to/draft-model-weights
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export GLOO_SOCKET_IFNAME=<network-interface>
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=<network-interface>
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--host 127.0.0.1 --port 6688 \
--trust-remote-code \
--nnodes 1 \
--node-rank 0 \
--attention-backend ascend \
--device npu \
--quantization modelslim \
--max-running-requests 16 \
--max-prefill-tokens 16384 \
--disable-radix-cache \
--chunked-prefill-size -1 \
--tp-size 2 \
--mem-fraction-static 0.894 \
--cuda-graph-bs 1 5 15 16 \
--dtype bfloat16 \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path $DRAFT_MODEL_PATH \
--speculative-num-steps 4 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 5
We tested it based on the RANDOM dataset.
python -m sglang.bench_serving \
--dataset-name random \
--backend sglang \
--host 127.0.0.1 \
--port 6688 \
--max-concurrency 16 \
--random-input-len 6144 \
--random-output-len 1500 \
--num-prompts 16 \
--random-range-ratio 1