Back to Sglang

Support Features on Ascend NPU

docs/platforms/ascend/ascend_npu_support_features.md

0.5.1152.3 KB
Original Source

Support Features on Ascend NPU

This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue.

If you want to know the meaning and usage of each parameter, click Server Arguments.

Model and tokenizer

ArgumentDefaultsOptionsServer supported
--model-path
--modelNoneType: strA2, A3
--tokenizer-pathNoneType: strA2, A3
--tokenizer-modeautoauto, slowA2, A3
--tokenizer-worker-num1Type: intA2, A3
--skip-tokenizer-initFalsebool flag (set to enable)A2, A3
--load-formatautoauto, safetensorsA2, A3
--model-loader-
extra-config{}Type: strA2, A3
--trust-remote-codeFalsebool flag (set to enable)A2, A3
--context-lengthNoneType: intA2, A3
--is-embeddingFalsebool flag (set to enable)A2, A3
--enable-multimodalNonebool flag (set to enable)A2, A3
--revisionNoneType: strA2, A3
--model-implautoauto, sglang,
transformersA2, A3

HTTP server

ArgumentDefaultsOptionsServer supported
--host127.0.0.1Type: strA2, A3
--port30000Type: intA2, A3
--skip-server-warmupFalsebool flag (set to enable)A2, A3
--warmupsNoneType: strA2, A3
--nccl-portNoneType: intA2, A3
--fastapi-root-pathNoneType: strA2, A3
--grpc-modeFalseFalsePlanned

Quantization and data type

ArgumentDefaultsOptionsServer supported
--dtypeautoauto,
float16,
bfloat16A2, A3
--quantizationNonemodelslimA2, A3
--quantization-param-pathNoneType: strSpecial For GPU
--kv-cache-dtypeautoautoA2, A3
--enable-fp32-lm-headFalsebool flag
(set to enable)A2, A3
--modelopt-quantNoneType: strSpecial For GPU
--modelopt-checkpoint-
restore-pathNoneType: strSpecial For GPU
--modelopt-checkpoint-
save-pathNoneType: strSpecial For GPU
--modelopt-export-pathNoneType: strSpecial For GPU
--quantize-and-serveFalsebool flag
(set to enable)Special For GPU
--rl-quant-profileNoneType: strSpecial For GPU

Memory and scheduling

ArgumentDefaultsOptionsServer supported
--mem-fraction-staticNoneType: floatA2, A3
--max-running-requestsNoneType: intA2, A3
--prefill-max-requestsNoneType: intA2, A3
--max-queued-requestsNoneType: intA2, A3
--max-total-tokensNoneType: intA2, A3
--chunked-prefill-sizeNoneType: intA2, A3
--max-prefill-tokens16384Type: intA2, A3
--schedule-policyfcfslpm, fcfsA2, A3
--enable-priority-
schedulingFalsebool flag
(set to enable)A2, A3
--schedule-low-priority-
values-firstFalsebool flag
(set to enable)A2, A3
--priority-scheduling-
preemption-threshold10Type: intA2, A3
--schedule-conservativeness1.0Type: floatA2, A3
--page-size128Type: intA2, A3
--swa-full-tokens-ratio0.8Type: floatPlanned
--disable-hybrid-swa-memoryFalsebool flag
(set to enable)Planned
--radix-eviction-policylrulru,
lfuA2, A3
--enable-prefill-delayerFalsebool flag
(set to enable)A2, A3
--prefill-delayer-max-delay-passes30Type: intA2, A3
--prefill-delayer-token-usage-low-watermarkNoneType: floatA2, A3
--prefill-delayer-forward-passes-bucketsNoneList[float]A2, A3
--prefill-delayer-wait-seconds-bucketsNoneList[float]A2, A3
--abort-on-priority-
when-disabledFalsebool flag
(set to enable)A2, A3
--enable-dynamic-chunkingFalsebool flag
(set to enable)Experimental

Runtime options

ArgumentDefaultsOptionsServer supported
--deviceNoneType: strA2, A3
--tensor-parallel-size
--tp-size1Type: intA2, A3
--pipeline-parallel-size
--pp-size1Type: int; Currently 2 not supportedExperimental
--attention-context-parallel-size
--attn-cp-size1Type: int; must be equal to --tp-sizeA2, A3
--moe-data-parallel-size
--moe-dp-size1Type: intPlanned
--pp-max-micro-batch-sizeNoneType: intExperimental
--pp-async-batch-depthNoneType: intExperimental
--stream-interval1Type: intA2, A3
--incremental-streaming-outputFalsebool flag (set to enable)A2, A3
--random-seedNoneType: intA2, A3
--constrained-json-
whitespace-patternNoneType: strA2, A3
--constrained-json-
disable-any-whitespaceFalsebool flag (set to enable)A2, A3
--watchdog-timeout300Type: floatA2, A3
--soft-watchdog-timeout300Type: floatA2, A3
--dist-timeoutNoneType: intA2, A3
--download-dirNoneType: strA2, A3
--model-checksumNoneType: strPlanned
--base-gpu-id0Type: intA2, A3
--gpu-id-step1Type: intA2, A3
--sleep-on-idleFalsebool flag (set to enable)A2, A3

Logging

ArgumentDefaultsOptionsServer supported
--log-levelinfoType: strA2, A3
--log-level-httpNoneType: strA2, A3
--log-requestsFalsebool flag
(set to enable)A2, A3
--log-requests-level20, 1, 2, 3A2, A3
--log-requests-formattexttext, jsonA2, A3
--crash-dump-folderNoneType: strA2, A3
--enable-metricsFalsebool flag
(set to enable)A2, A3
--enable-metrics-for-
all-schedulersFalsebool flag
(set to enable)A2, A3
--tokenizer-metrics-
custom-labels-headerx-custom-labelsType: strA2, A3
--tokenizer-metrics-
allowed-custom-labelsNoneList[str]A2, A3
--bucket-time-to-
first-tokenNoneList[float]A2, A3
--bucket-inter-token-
latencyNoneList[float]A2, A3
--bucket-e2e-request-
latencyNoneList[float]A2, A3
--collect-tokens-
histogramFalsebool flag
(set to enable)A2, A3
--prompt-tokens-bucketsNoneList[str]A2, A3
--generation-tokens-bucketsNoneList[str]A2, A3
--gc-warning-threshold-secs0.0Type: floatA2, A3
--decode-log-interval40Type: intA2, A3
--enable-request-time-
stats-loggingFalsebool flag
(set to enable)A2, A3
--kv-events-configNoneType: strSpecial for GPU
--enable-traceFalsebool flag
(set to enable)A2, A3
--oltp-traces-endpointlocalhost:4317Type: strA2, A3
--log-requests-targetNoneType: strA2, A3
--uvicorn-access-log-exclude-prefixes[]List[str]A2, A3

RequestMetricsExporter configuration

ArgumentDefaultsOptionsServer supported
--export-metrics-to-
fileFalsebool flag
(set to enable)A2, A3
--export-metrics-to-
file-dirNoneType: strA2, A3
ArgumentDefaultsOptionsServer supported
--api-keyNoneType: strA2, A3
--admin-api-keyNoneType: strA2, A3
--served-model-nameNoneType: strA2, A3
--weight-versiondefaultType: strA2, A3
--chat-templateNoneType: strA2, A3
--hf-chat-template-nameNoneType: strA2, A3
--completion-templateNoneType: strA2, A3
--enable-cache-reportFalsebool flag
(set to enable)A2, A3
--reasoning-parserNonedeepseek-r1
deepseek-v3
glm45
gpt-oss
kimi
qwen3
qwen3-thinking
step3A2, A3
--tool-call-parserNonellama3
pythonic
qwen
qwen3_coderA2, A3
--sampling-defaultsmodelopenai, modelA2, A3

Data parallelism

ArgumentDefaultsOptionsServer supported
--data-parallel-size
--dp-size1Type: intA2, A3
--load-balance-methodautoauto,
round_robin,
follow_bootstrap_room,
total_requests,
total_tokensA2, A3

Multi-node distributed serving

ArgumentDefaultsOptionsServer supported
--dist-init-addr
--nccl-init-addrNoneType: strA2, A3
--nnodes1Type: intA2, A3
--node-rank0Type: intA2, A3

Model override args

ArgumentDefaultsOptionsServer supported
--json-model-override-
args{}Type: strA2, A3
--preferred-sampling-
paramsNoneType: strA2, A3

LoRA

ArgumentDefaultsOptionsServer supported
--enable-loraFalseBool flag
(set to enable)A2, A3
--enable-lora-overlap-loadingFalseBool flag
(set to enable)A2, A3
--max-lora-rankNoneType: intA2, A3
--lora-target-modulesNoneallA2, A3
--lora-pathsNoneType: List[str] /
JSON objectsA2, A3
--max-loras-per-batch8Type: intA2, A3
--max-loaded-lorasNoneType: intA2, A3
--lora-eviction-policylrulru,
fifoA2, A3
--lora-backendcsgmvtriton,
csgmv,
ascend,
torch_nativeA2, A3
--max-lora-chunk-size1616, 32,
64, 128Special for GPU

Kernel Backends (Attention, Sampling, Grammar, GEMM)

ArgumentDefaultsOptionsServer supported
--attention-backendNoneascendA2, A3
--prefill-attention-backendNoneascendA2, A3
--decode-attention-backendNoneascendA2, A3
--sampling-backendNonepytorch,
ascendA2, A3
--grammar-backendNonexgrammarA2, A3
--mm-attention-backendNoneascend_attnA2, A3
--nsa-prefill-backendflashmla_sparseflashmla_sparse,
flashmla_decode,
fa3,
tilelang,
aiterSpecial for GPU
--nsa-decode-backendfa3flashmla_prefill,
flashmla_kv,
fa3,
tilelang,
aiterSpecial for GPU
--fp8-gemm-backendautoauto,
deep_gemm,
flashinfer_trtllm,
flashinfer_cutlass,
flashinfer_deepgemm,
cutlass,
triton,
aiterSpecial for GPU
--disable-flashinfer-
autotuneFalsebool flag
(set to enable)Special for GPU

Speculative decoding

ArgumentDefaultsOptionsServer supported
--speculative-algorithmNoneEAGLE3,
NEXTNA2, A3
--speculative-draft-model-path
--speculative-draft-modelNoneType: strA2, A3
--speculative-draft-model-
revisionNoneType: str,
branch name,
tag name,
commit idA2, A3
--speculative-draft-load-formatautoauto,
dummyA2, A3
--speculative-num-stepsNoneType: intA2, A3
--speculative-eagle-topkNoneType: intA2, A3
--speculative-num-draft-tokensNoneType: intA2, A3
--speculative-accept-
threshold-single1.0Type: floatSpecial for GPU
--speculative-accept-
threshold-acc1.0Type: floatSpecial for GPU
--speculative-token-mapNoneType: strA2, A3
--speculative-attention-
modeprefillprefill,
decodeA2, A3
--speculative-moe-runner-
backendNoneautoA2, A3
--speculative-moe-a2a-
backendNoneascend_fuseepA2, A3
--speculative-draft-attention-backendNoneascendA2, A3
--speculative-draft-model-quantizationNoneunquantA2, A3

Ngram speculative decoding

ArgumentDefaultsOptionsServer supported
--speculative-ngram-
min-match-window-size1Type: intExperimental
--speculative-ngram-
max-match-window-size12Type: intExperimental
--speculative-ngram-
min-bfs-breadth1Type: intExperimental
--speculative-ngram-
max-bfs-breadth10Type: intExperimental
--speculative-ngram-
match-typeBFSBFS,
PROBExperimental. BFS uses recency-based expansion; PROB uses frequency-based expansion.
--speculative-ngram-
max-trie-depth18Type: intExperimental
--speculative-ngram-
capacity10000000Type: intExperimental

Expert parallelism

ArgumentDefaultsOptionsServer supported
--expert-parallel-size
--ep-size
--ep1Type: intA2, A3
--moe-a2a-backendnonenone,
deepep,
ascend_fuseep(It is incompatible with eplb)A2, A3
--moe-runner-backendautoauto, tritonA2, A3
--flashinfer-mxfp4-
moe-precisiondefaultdefault,
bf16Special for GPU
--enable-flashinfer-
allreduce-fusionFalsebool flag
(set to enable)Special for GPU
--deepep-modeautonormal,
low_latency,
autoA2, A3
--deepep-configNoneType: strSpecial for GPU
--ep-num-redundant-experts0Type: intA2, A3
--ep-dispatch-algorithmNonestatic,
dynamic,
fakeA2, A3
--init-expert-locationtrivialtrivial,
<path.pt>,
<path.json>,
<json_string>A2, A3
--enable-eplbFalsebool flag
(set to enable)A2, A3
--eplb-algorithmdeepseekauto,
deepseekA2, A3
--eplb-rebalance-num-iterations1000Type: intA2, A3
--eplb-rebalance-layers-
per-chunkNoneType: intA2, A3
--eplb-min-rebalancing-
utilization-threshold1.0Type: floatA2, A3
--expert-distribution-
recorder-modeNonestat,
stat_approx,
per_pass,
per_tokenA2, A3
--expert-distribution-
recorder-buffer-sizeNoneType: intA2, A3
--enable-expert-distribution-
metricsFalsebool flag (set to enable)A2, A3
--moe-dense-tp-sizeNone1A2, A3
--elastic-ep-backendNonenone, mooncakeSpecial for GPU
--mooncake-ib-deviceNoneType: strSpecial for GPU

Mamba Cache

ArgumentDefaultsOptionsServer supported
--max-mamba-cache-sizeNoneType: intA2, A3
--mamba-ssm-dtypefloat32float32,
bfloat16,
float16A2, A3
--mamba-full-memory-ratio0.9Type: floatA2, A3
--mamba-scheduler-strategyautoauto,
no_buffer,
extra_bufferA2, A3
--mamba-track-interval256Type: intA2, A3

Hierarchical cache

ArgumentDefaultsOptionsServer supported
--enable-hierarchical-
cacheFalsebool flag
(set to enable).
Currently, mamba cache is not supported.A2, A3
--hicache-ratio2.0Type: floatA2, A3
--hicache-size0Type: intA2, A3
--hicache-write-policywrite_throughCurrently only write_back supportedA2, A3
--hicache-io-backendkernelkernel_ascend,
                 `direct`                            |      A2, A3      |

| --hicache-mem-layout | layer_first | page_first_direct, page_first_kv_split | A2, A3 | | --hicache-storage- backend | None | file | A2, A3 | | --hicache-storage- prefetch-policy | best_effort | best_effort, wait_complete, timeout | Special for GPU | | --hicache-storage- backend-extra-config | None | Type: str | Special for GPU |

LMCache

ArgumentDefaultsOptionsServer supported
--enable-lmcacheFalsebool flag
(set to enable)Special for GPU

Offloading (must be used with --disable-cuda-graph)

ArgumentDefaultsOptionsServer supported
--cpu-offload-gb0Type: intA2, A3
--offload-group-size-1Type: int (DeepSeek only)A2, A3
--offload-num-in-group1Type: int (DeepSeek only)A2, A3
--offload-prefetch-step1Type: int (DeepSeek only)A2, A3
--offload-modecpucpu (DeepSeek only)
meta (DeepSeek only)
sharded_gpu (DeepSeek only, only support tp=1 dp>1)A2, A3

Args for multi-item scoring

ArgumentDefaultsOptionsServer supported
--multi-item-scoring-delimiterNoneType: intA2, A3

Optimization/debug options

ArgumentDefaultsOptionsServer supported
--disable-radix-cacheFalsebool flag
(set to enable)A2, A3
--cuda-graph-max-bsNoneType: intA2, A3
--cuda-graph-bsNoneList[int]A2, A3
--disable-cuda-graphFalsebool flag
(set to enable)A2, A3
--disable-cuda-graph-
paddingFalsebool flag
(set to enable)A2, A3
--enable-profile-
cuda-graphFalsebool flag
(set to enable)A2, A3
--enable-cudagraph-gcFalsebool flag
(set to enable)A2, A3
--enable-nccl-nvlsFalsebool flag
(set to enable)Special for GPU
--enable-symm-memFalsebool flag
(set to enable)Special for GPU
--disable-flashinfer-
cutlass-moe-fp4-allgatherFalsebool flag
(set to enable)Special for GPU
--enable-tokenizer-
batch-encodeFalsebool flag
(set to enable)A2, A3
--disable-tokenizer-
batch-decodeFalsebool flag
(set to enable)A2, A3
--disable-custom-
all-reduceFalsebool flag
(set to enable)Special for GPU
--enable-mscclppFalsebool flag
(set to enable)Special for GPU
--enable-torch-
symm-memFalsebool flag
(set to enable)Special for GPU
--disable-overlap
-scheduleFalsebool flag
(set to enable)A2, A3
--enable-mixed-
chunkFalsebool flag
(set to enable)A2, A3
--enable-dp-attentionFalsebool flag
(set to enable)A2, A3
--enable-dp-lm-headFalsebool flag
(set to enable)A2, A3
--enable-two-
batch-overlapFalsebool flag
(set to enable)Planned
--enable-single-
batch-overlapFalsebool flag
(set to enable)A2, A3
--tbo-token-
distribution-threshold0.48Type: floatPlanned
--enable-torch-
compileFalsebool flag
(set to enable)A2, A3
--enable-torch-
compile-debug-modeFalsebool flag
(set to enable)A2, A3
--enforce-piecewise-
cuda-graphFalsebool flag
(set to enable);
Currently, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct models are supported.A2, A3
--piecewise-cuda-
graph-tokensNoneType: JSON
listA2, A3
--piecewise-cuda-
graph-compilereagereagerA2, A3
--torch-compile-max-bs32Type: intA2, A3
--piecewise-cuda-
graph-max-tokensNoneType: intA2, A3
--torchao-config``Type: strSpecial for GPU
--enable-nan-detectionFalsebool flag
(set to enable)A2, A3
--enable-p2p-checkFalsebool flag
(set to enable)Special for GPU
--triton-attention-
reduce-in-fp32Falsebool flag
(set to enable)Special for GPU
--triton-attention-
num-kv-splits8Type: intSpecial for GPU
--triton-attention-
split-tile-sizeNoneType: intSpecial for GPU
--delete-ckpt-
after-loadingFalsebool flag
(set to enable)A2, A3
--enable-memory-saverFalsebool flag
(set to enable)A2, A3
--enable-weights-
cpu-backupFalsebool flag
(set to enable)A2, A3
--enable-draft-weights-
cpu-backupFalsebool flag
(set to enable)A2, A3
--allow-auto-truncateFalsebool flag
(set to enable)A2, A3
--enable-custom-
logit-processorFalsebool flag
(set to enable)A2, A3
--flashinfer-mla-
disable-raggedFalsebool flag
(set to enable)Special for GPU
--disable-shared-
experts-fusionTruebool flag
(set to enable)A2, A3
--disable-chunked-
prefix-cacheTruebool flag
(set to enable)A2, A3
--disable-fast-
image-processorFalsebool flag
(set to enable)A2, A3
--keep-mm-feature-
on-deviceFalsebool flag
(set to enable)A2, A3
--enable-return-
hidden-statesFalsebool flag
(set to enable)A2, A3
--enable-return-
routed-expertsFalsebool flag
(set to enable)A2, A3
--scheduler-recv-
interval1Type: intA2, A3
--numa-nodeNoneList[int]A2, A3
--enable-deterministic-
inferenceFalsebool flag
(set to enable)Planned
--rl-on-policy-targetNonefsdpPlanned
--enable-layerwise-
nvtx-markerFalsebool flag
(set to enable)Special for GPU
--enable-attn-tp-
input-scatteredFalsebool flag
(set to enable)Experimental
--enable-nsa-prefill-
context-parallelFalsebool flag
(set to enable)A2, A3
--enable-fused-qk-
norm-ropeFalsebool flag
(set to enable)Special for GPU

Dynamic batch tokenizer

ArgumentDefaultsOptionsServer supported
--enable-dynamic-
batch-tokenizerFalsebool flag
(set to enable)A2, A3
--dynamic-batch-
tokenizer-batch-size32Type: intA2, A3
--dynamic-batch-
tokenizer-batch-timeout0.002Type: floatA2, A3

Debug tensor dumps

ArgumentDefaultsOptionsServer supported
--debug-tensor-dump-
output-folderNoneType: strA2, A3
--debug-tensor-dump-
layersNoneList[int]A2, A3
--debug-tensor-dump-
input-fileNoneType: strA2, A3

PD disaggregation

ArgumentDefaultsOptionsServer supported
--disaggregation-modenullnull,
prefill,
decodeA2, A3
--disaggregation-transfer-backendmooncakeascendA2, A3
--disaggregation-bootstrap-port8998Type: intA2, A3
--disaggregation-ib-deviceNoneType: strSpecial for GPU
--disaggregation-decode-
enable-offload-kvcacheFalseFalseA2, A3
--num-reserved-decode-tokens512Type: intA2, A3
--disaggregation-decode-
polling-interval1Type: intA2, A3

Encode prefill disaggregation

ArgumentDefaultsOptionsServer supported
--enable-adaptive-dispatch-to-encoderFalsebool flag
(set to enable adaptively dispatch)A2, A3
--encoder-onlyFalsebool flag
(set to launch an encoder-only server)A2, A3
--language-onlyFalsebool flag
(set to load weights for the language model only)A2, A3
--encoder-transfer-backendzmq_to_schedulerzmq_to_scheduler,
zmq_to_tokenizer,
mooncakeA2, A3
--encoder-urls[]List[str]
(List of encoder server urls)A2, A3

Custom weight loader

ArgumentDefaultsOptionsServer supported
--custom-weight-loaderNoneList[str]A2, A3
--weight-loader-disable-
mmapFalsebool flag
(set to enable)A2, A3
--remote-instance-weight-
loader-seed-instance-ipNoneType: strA2, A3
--remote-instance-weight-
loader-seed-instance-service-portNoneType: intA2, A3
--remote-instance-weight-
loader-send-weights-group-portsNoneType: JSON
listA2, A3
--remote-instance-weight-
loader-backendnccltransfer_engine,
ncclA2, A3
--remote-instance-weight-
loader-start-seed-via-transfer-engineFalsebool flag
(set to enable)Special for GPU

For PD-Multiplexing

ArgumentDefaultsOptionsServer supported
--enable-pdmuxFalsebool flag
(set to enable)Special for GPU
--pdmux-config-pathNoneType: strSpecial for GPU
--sm-group-num8Type: intSpecial for GPU

For Multi-Modal

ArgumentDefaultsOptionsServer supported
--enable-broadcast-mm-
inputs-processFalsebool flag
(set to enable)A2, A3
--mm-process-configNoneType: JSON / DictA2, A3
--mm-enable-dp-encoderFalsebool flag
(set to enable)A2, A3
--limit-mm-data-per-requestNoneType: JSON / DictA2, A3

For checkpoint decryption

ArgumentDefaultsOptionsServer supported
--decrypted-config-fileNoneType: strA2, A3
--decrypted-draft-config-fileNoneType: strA2, A3
--enable-prefix-mm-cacheFalsebool flag
(set to enable)A2, A3

Forward hooks

ArgumentDefaultsOptionsServer supported
--forward-hooksNoneType: JSON listA2, A3

Configuration file support

ArgumentDefaultsOptionsServer supported
--configNoneType: strA2, A3

Other Params

The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.

ArgumentDefaultsOptions
--checkpoint-engine-
wait-weights-
before-readyFalsebool flag (set to enable)
--kt-weight-pathNoneType: str
--kt-methodAMXINT4Type: str
--kt-cpuinferNoneType: int
--kt-threadpool-count2Type: int
--kt-num-gpu-expertsNoneType: int
--kt-max-deferred-
experts-per-tokenNoneType: int

The following parameters have some functional deficiencies on community

ArgumentDefaultsOptions
--tool-serverNoneType: str