docs/platforms/ascend/ascend_npu_support_features.md
This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue.
If you want to know the meaning and usage of each parameter, click Server Arguments.
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--model-path | |||
--model | None | Type: str | A2, A3 |
--tokenizer-path | None | Type: str | A2, A3 |
--tokenizer-mode | auto | auto, slow | A2, A3 |
--tokenizer-worker-num | 1 | Type: int | A2, A3 |
--skip-tokenizer-init | False | bool flag (set to enable) | A2, A3 |
--load-format | auto | auto, safetensors | A2, A3 |
--model-loader- | |||
extra-config | {} | Type: str | A2, A3 |
--trust-remote-code | False | bool flag (set to enable) | A2, A3 |
--context-length | None | Type: int | A2, A3 |
--is-embedding | False | bool flag (set to enable) | A2, A3 |
--enable-multimodal | None | bool flag (set to enable) | A2, A3 |
--revision | None | Type: str | A2, A3 |
--model-impl | auto | auto, sglang, | |
transformers | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--host | 127.0.0.1 | Type: str | A2, A3 |
--port | 30000 | Type: int | A2, A3 |
--skip-server-warmup | False | bool flag (set to enable) | A2, A3 |
--warmups | None | Type: str | A2, A3 |
--nccl-port | None | Type: int | A2, A3 |
--fastapi-root-path | None | Type: str | A2, A3 |
--grpc-mode | False | False | Planned |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--dtype | auto | auto, | |
float16, | |||
bfloat16 | A2, A3 | ||
--quantization | None | modelslim | A2, A3 |
--quantization-param-path | None | Type: str | Special For GPU |
--kv-cache-dtype | auto | auto | A2, A3 |
--enable-fp32-lm-head | False | bool flag | |
| (set to enable) | A2, A3 | ||
--modelopt-quant | None | Type: str | Special For GPU |
--modelopt-checkpoint- | |||
restore-path | None | Type: str | Special For GPU |
--modelopt-checkpoint- | |||
save-path | None | Type: str | Special For GPU |
--modelopt-export-path | None | Type: str | Special For GPU |
--quantize-and-serve | False | bool flag | |
| (set to enable) | Special For GPU | ||
--rl-quant-profile | None | Type: str | Special For GPU |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--mem-fraction-static | None | Type: float | A2, A3 |
--max-running-requests | None | Type: int | A2, A3 |
--prefill-max-requests | None | Type: int | A2, A3 |
--max-queued-requests | None | Type: int | A2, A3 |
--max-total-tokens | None | Type: int | A2, A3 |
--chunked-prefill-size | None | Type: int | A2, A3 |
--max-prefill-tokens | 16384 | Type: int | A2, A3 |
--schedule-policy | fcfs | lpm, fcfs | A2, A3 |
--enable-priority- | |||
scheduling | False | bool flag | |
| (set to enable) | A2, A3 | ||
--schedule-low-priority- | |||
values-first | False | bool flag | |
| (set to enable) | A2, A3 | ||
--priority-scheduling- | |||
preemption-threshold | 10 | Type: int | A2, A3 |
--schedule-conservativeness | 1.0 | Type: float | A2, A3 |
--page-size | 128 | Type: int | A2, A3 |
--swa-full-tokens-ratio | 0.8 | Type: float | Planned |
--disable-hybrid-swa-memory | False | bool flag | |
| (set to enable) | Planned | ||
--radix-eviction-policy | lru | lru, | |
lfu | A2, A3 | ||
--enable-prefill-delayer | False | bool flag | |
| (set to enable) | A2, A3 | ||
--prefill-delayer-max-delay-passes | 30 | Type: int | A2, A3 |
--prefill-delayer-token-usage-low-watermark | None | Type: float | A2, A3 |
--prefill-delayer-forward-passes-buckets | None | List[float] | A2, A3 |
--prefill-delayer-wait-seconds-buckets | None | List[float] | A2, A3 |
--abort-on-priority- | |||
when-disabled | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-dynamic-chunking | False | bool flag | |
| (set to enable) | Experimental |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--device | None | Type: str | A2, A3 |
--tensor-parallel-size | |||
--tp-size | 1 | Type: int | A2, A3 |
--pipeline-parallel-size | |||
--pp-size | 1 | Type: int; Currently 2 not supported | Experimental |
--attention-context-parallel-size | |||
--attn-cp-size | 1 | Type: int; must be equal to --tp-size | A2, A3 |
--moe-data-parallel-size | |||
--moe-dp-size | 1 | Type: int | Planned |
--pp-max-micro-batch-size | None | Type: int | Experimental |
--pp-async-batch-depth | None | Type: int | Experimental |
--stream-interval | 1 | Type: int | A2, A3 |
--incremental-streaming-output | False | bool flag (set to enable) | A2, A3 |
--random-seed | None | Type: int | A2, A3 |
--constrained-json- | |||
whitespace-pattern | None | Type: str | A2, A3 |
--constrained-json- | |||
disable-any-whitespace | False | bool flag (set to enable) | A2, A3 |
--watchdog-timeout | 300 | Type: float | A2, A3 |
--soft-watchdog-timeout | 300 | Type: float | A2, A3 |
--dist-timeout | None | Type: int | A2, A3 |
--download-dir | None | Type: str | A2, A3 |
--model-checksum | None | Type: str | Planned |
--base-gpu-id | 0 | Type: int | A2, A3 |
--gpu-id-step | 1 | Type: int | A2, A3 |
--sleep-on-idle | False | bool flag (set to enable) | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--log-level | info | Type: str | A2, A3 |
--log-level-http | None | Type: str | A2, A3 |
--log-requests | False | bool flag | |
| (set to enable) | A2, A3 | ||
--log-requests-level | 2 | 0, 1, 2, 3 | A2, A3 |
--log-requests-format | text | text, json | A2, A3 |
--crash-dump-folder | None | Type: str | A2, A3 |
--enable-metrics | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-metrics-for- | |||
all-schedulers | False | bool flag | |
| (set to enable) | A2, A3 | ||
--tokenizer-metrics- | |||
custom-labels-header | x-custom-labels | Type: str | A2, A3 |
--tokenizer-metrics- | |||
allowed-custom-labels | None | List[str] | A2, A3 |
--bucket-time-to- | |||
first-token | None | List[float] | A2, A3 |
--bucket-inter-token- | |||
latency | None | List[float] | A2, A3 |
--bucket-e2e-request- | |||
latency | None | List[float] | A2, A3 |
--collect-tokens- | |||
histogram | False | bool flag | |
| (set to enable) | A2, A3 | ||
--prompt-tokens-buckets | None | List[str] | A2, A3 |
--generation-tokens-buckets | None | List[str] | A2, A3 |
--gc-warning-threshold-secs | 0.0 | Type: float | A2, A3 |
--decode-log-interval | 40 | Type: int | A2, A3 |
--enable-request-time- | |||
stats-logging | False | bool flag | |
| (set to enable) | A2, A3 | ||
--kv-events-config | None | Type: str | Special for GPU |
--enable-trace | False | bool flag | |
| (set to enable) | A2, A3 | ||
--oltp-traces-endpoint | localhost:4317 | Type: str | A2, A3 |
--log-requests-target | None | Type: str | A2, A3 |
--uvicorn-access-log-exclude-prefixes | [] | List[str] | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--export-metrics-to- | |||
file | False | bool flag | |
| (set to enable) | A2, A3 | ||
--export-metrics-to- | |||
file-dir | None | Type: str | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--api-key | None | Type: str | A2, A3 |
--admin-api-key | None | Type: str | A2, A3 |
--served-model-name | None | Type: str | A2, A3 |
--weight-version | default | Type: str | A2, A3 |
--chat-template | None | Type: str | A2, A3 |
--hf-chat-template-name | None | Type: str | A2, A3 |
--completion-template | None | Type: str | A2, A3 |
--enable-cache-report | False | bool flag | |
| (set to enable) | A2, A3 | ||
--reasoning-parser | None | deepseek-r1 | |
deepseek-v3 | |||
glm45 | |||
gpt-oss | |||
kimi | |||
qwen3 | |||
qwen3-thinking | |||
step3 | A2, A3 | ||
--tool-call-parser | None | llama3 | |
pythonic | |||
qwen | |||
qwen3_coder | A2, A3 | ||
--sampling-defaults | model | openai, model | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--data-parallel-size | |||
--dp-size | 1 | Type: int | A2, A3 |
--load-balance-method | auto | auto, | |
round_robin, | |||
follow_bootstrap_room, | |||
total_requests, | |||
total_tokens | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--dist-init-addr | |||
--nccl-init-addr | None | Type: str | A2, A3 |
--nnodes | 1 | Type: int | A2, A3 |
--node-rank | 0 | Type: int | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--json-model-override- | |||
args | {} | Type: str | A2, A3 |
--preferred-sampling- | |||
params | None | Type: str | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-lora | False | Bool flag | |
| (set to enable) | A2, A3 | ||
--enable-lora-overlap-loading | False | Bool flag | |
| (set to enable) | A2, A3 | ||
--max-lora-rank | None | Type: int | A2, A3 |
--lora-target-modules | None | all | A2, A3 |
--lora-paths | None | Type: List[str] / | |
| JSON objects | A2, A3 | ||
--max-loras-per-batch | 8 | Type: int | A2, A3 |
--max-loaded-loras | None | Type: int | A2, A3 |
--lora-eviction-policy | lru | lru, | |
fifo | A2, A3 | ||
--lora-backend | csgmv | triton, | |
csgmv, | |||
ascend, | |||
torch_native | A2, A3 | ||
--max-lora-chunk-size | 16 | 16, 32, | |
64, 128 | Special for GPU |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--attention-backend | None | ascend | A2, A3 |
--prefill-attention-backend | None | ascend | A2, A3 |
--decode-attention-backend | None | ascend | A2, A3 |
--sampling-backend | None | pytorch, | |
ascend | A2, A3 | ||
--grammar-backend | None | xgrammar | A2, A3 |
--mm-attention-backend | None | ascend_attn | A2, A3 |
--nsa-prefill-backend | flashmla_sparse | flashmla_sparse, | |
flashmla_decode, | |||
fa3, | |||
tilelang, | |||
aiter | Special for GPU | ||
--nsa-decode-backend | fa3 | flashmla_prefill, | |
flashmla_kv, | |||
fa3, | |||
tilelang, | |||
aiter | Special for GPU | ||
--fp8-gemm-backend | auto | auto, | |
deep_gemm, | |||
flashinfer_trtllm, | |||
flashinfer_cutlass, | |||
flashinfer_deepgemm, | |||
cutlass, | |||
triton, | |||
aiter | Special for GPU | ||
--disable-flashinfer- | |||
autotune | False | bool flag | |
| (set to enable) | Special for GPU |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--speculative-algorithm | None | EAGLE3, | |
NEXTN | A2, A3 | ||
--speculative-draft-model-path | |||
--speculative-draft-model | None | Type: str | A2, A3 |
--speculative-draft-model- | |||
revision | None | Type: str, | |
branch name, | |||
tag name, | |||
commit id | A2, A3 | ||
--speculative-draft-load-format | auto | auto, | |
dummy | A2, A3 | ||
--speculative-num-steps | None | Type: int | A2, A3 |
--speculative-eagle-topk | None | Type: int | A2, A3 |
--speculative-num-draft-tokens | None | Type: int | A2, A3 |
--speculative-accept- | |||
threshold-single | 1.0 | Type: float | Special for GPU |
--speculative-accept- | |||
threshold-acc | 1.0 | Type: float | Special for GPU |
--speculative-token-map | None | Type: str | A2, A3 |
--speculative-attention- | |||
mode | prefill | prefill, | |
decode | A2, A3 | ||
--speculative-moe-runner- | |||
backend | None | auto | A2, A3 |
--speculative-moe-a2a- | |||
backend | None | ascend_fuseep | A2, A3 |
--speculative-draft-attention-backend | None | ascend | A2, A3 |
--speculative-draft-model-quantization | None | unquant | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--speculative-ngram- | |||
min-match-window-size | 1 | Type: int | Experimental |
--speculative-ngram- | |||
max-match-window-size | 12 | Type: int | Experimental |
--speculative-ngram- | |||
min-bfs-breadth | 1 | Type: int | Experimental |
--speculative-ngram- | |||
max-bfs-breadth | 10 | Type: int | Experimental |
--speculative-ngram- | |||
match-type | BFS | BFS, | |
PROB | Experimental. BFS uses recency-based expansion; PROB uses frequency-based expansion. | ||
--speculative-ngram- | |||
max-trie-depth | 18 | Type: int | Experimental |
--speculative-ngram- | |||
capacity | 10000000 | Type: int | Experimental |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--expert-parallel-size | |||
--ep-size | |||
--ep | 1 | Type: int | A2, A3 |
--moe-a2a-backend | none | none, | |
deepep, | |||
ascend_fuseep(It is incompatible with eplb) | A2, A3 | ||
--moe-runner-backend | auto | auto, triton | A2, A3 |
--flashinfer-mxfp4- | |||
moe-precision | default | default, | |
bf16 | Special for GPU | ||
--enable-flashinfer- | |||
allreduce-fusion | False | bool flag | |
| (set to enable) | Special for GPU | ||
--deepep-mode | auto | normal, | |
low_latency, | |||
auto | A2, A3 | ||
--deepep-config | None | Type: str | Special for GPU |
--ep-num-redundant-experts | 0 | Type: int | A2, A3 |
--ep-dispatch-algorithm | None | static, | |
dynamic, | |||
fake | A2, A3 | ||
--init-expert-location | trivial | trivial, | |
<path.pt>, | |||
<path.json>, | |||
<json_string> | A2, A3 | ||
--enable-eplb | False | bool flag | |
| (set to enable) | A2, A3 | ||
--eplb-algorithm | deepseek | auto, | |
deepseek | A2, A3 | ||
--eplb-rebalance-num-iterations | 1000 | Type: int | A2, A3 |
--eplb-rebalance-layers- | |||
per-chunk | None | Type: int | A2, A3 |
--eplb-min-rebalancing- | |||
utilization-threshold | 1.0 | Type: float | A2, A3 |
--expert-distribution- | |||
recorder-mode | None | stat, | |
stat_approx, | |||
per_pass, | |||
per_token | A2, A3 | ||
--expert-distribution- | |||
recorder-buffer-size | None | Type: int | A2, A3 |
--enable-expert-distribution- | |||
metrics | False | bool flag (set to enable) | A2, A3 |
--moe-dense-tp-size | None | 1 | A2, A3 |
--elastic-ep-backend | None | none, mooncake | Special for GPU |
--mooncake-ib-device | None | Type: str | Special for GPU |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--max-mamba-cache-size | None | Type: int | A2, A3 |
--mamba-ssm-dtype | float32 | float32, | |
bfloat16, | |||
float16 | A2, A3 | ||
--mamba-full-memory-ratio | 0.9 | Type: float | A2, A3 |
--mamba-scheduler-strategy | auto | auto, | |
no_buffer, | |||
extra_buffer | A2, A3 | ||
--mamba-track-interval | 256 | Type: int | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-hierarchical- | |||
cache | False | bool flag | |
| (set to enable). | |||
| Currently, mamba cache is not supported. | A2, A3 | ||
--hicache-ratio | 2.0 | Type: float | A2, A3 |
--hicache-size | 0 | Type: int | A2, A3 |
--hicache-write-policy | write_through | Currently only write_back supported | A2, A3 |
--hicache-io-backend | kernel | kernel_ascend, |
`direct` | A2, A3 |
| --hicache-mem-layout | layer_first | page_first_direct,
page_first_kv_split | A2, A3 |
| --hicache-storage-
backend | None | file | A2, A3 |
| --hicache-storage-
prefetch-policy | best_effort | best_effort,
wait_complete,
timeout | Special for GPU |
| --hicache-storage-
backend-extra-config | None | Type: str | Special for GPU |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-lmcache | False | bool flag | |
| (set to enable) | Special for GPU |
--disable-cuda-graph)| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--cpu-offload-gb | 0 | Type: int | A2, A3 |
--offload-group-size | -1 | Type: int (DeepSeek only) | A2, A3 |
--offload-num-in-group | 1 | Type: int (DeepSeek only) | A2, A3 |
--offload-prefetch-step | 1 | Type: int (DeepSeek only) | A2, A3 |
--offload-mode | cpu | cpu (DeepSeek only) | |
meta (DeepSeek only) | |||
sharded_gpu (DeepSeek only, only support tp=1 dp>1) | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--multi-item-scoring-delimiter | None | Type: int | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--disable-radix-cache | False | bool flag | |
| (set to enable) | A2, A3 | ||
--cuda-graph-max-bs | None | Type: int | A2, A3 |
--cuda-graph-bs | None | List[int] | A2, A3 |
--disable-cuda-graph | False | bool flag | |
| (set to enable) | A2, A3 | ||
--disable-cuda-graph- | |||
padding | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-profile- | |||
cuda-graph | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-cudagraph-gc | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-nccl-nvls | False | bool flag | |
| (set to enable) | Special for GPU | ||
--enable-symm-mem | False | bool flag | |
| (set to enable) | Special for GPU | ||
--disable-flashinfer- | |||
cutlass-moe-fp4-allgather | False | bool flag | |
| (set to enable) | Special for GPU | ||
--enable-tokenizer- | |||
batch-encode | False | bool flag | |
| (set to enable) | A2, A3 | ||
--disable-tokenizer- | |||
batch-decode | False | bool flag | |
| (set to enable) | A2, A3 | ||
--disable-custom- | |||
all-reduce | False | bool flag | |
| (set to enable) | Special for GPU | ||
--enable-mscclpp | False | bool flag | |
| (set to enable) | Special for GPU | ||
--enable-torch- | |||
symm-mem | False | bool flag | |
| (set to enable) | Special for GPU | ||
--disable-overlap | |||
-schedule | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-mixed- | |||
chunk | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-dp-attention | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-dp-lm-head | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-two- | |||
batch-overlap | False | bool flag | |
| (set to enable) | Planned | ||
--enable-single- | |||
batch-overlap | False | bool flag | |
| (set to enable) | A2, A3 | ||
--tbo-token- | |||
distribution-threshold | 0.48 | Type: float | Planned |
--enable-torch- | |||
compile | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-torch- | |||
compile-debug-mode | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enforce-piecewise- | |||
cuda-graph | False | bool flag | |
| (set to enable); | |||
| Currently, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct models are supported. | A2, A3 | ||
--piecewise-cuda- | |||
graph-tokens | None | Type: JSON | |
| list | A2, A3 | ||
--piecewise-cuda- | |||
graph-compiler | eager | eager | A2, A3 |
--torch-compile-max-bs | 32 | Type: int | A2, A3 |
--piecewise-cuda- | |||
graph-max-tokens | None | Type: int | A2, A3 |
--torchao-config | `` | Type: str | Special for GPU |
--enable-nan-detection | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-p2p-check | False | bool flag | |
| (set to enable) | Special for GPU | ||
--triton-attention- | |||
reduce-in-fp32 | False | bool flag | |
| (set to enable) | Special for GPU | ||
--triton-attention- | |||
num-kv-splits | 8 | Type: int | Special for GPU |
--triton-attention- | |||
split-tile-size | None | Type: int | Special for GPU |
--delete-ckpt- | |||
after-loading | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-memory-saver | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-weights- | |||
cpu-backup | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-draft-weights- | |||
cpu-backup | False | bool flag | |
| (set to enable) | A2, A3 | ||
--allow-auto-truncate | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-custom- | |||
logit-processor | False | bool flag | |
| (set to enable) | A2, A3 | ||
--flashinfer-mla- | |||
disable-ragged | False | bool flag | |
| (set to enable) | Special for GPU | ||
--disable-shared- | |||
experts-fusion | True | bool flag | |
| (set to enable) | A2, A3 | ||
--disable-chunked- | |||
prefix-cache | True | bool flag | |
| (set to enable) | A2, A3 | ||
--disable-fast- | |||
image-processor | False | bool flag | |
| (set to enable) | A2, A3 | ||
--keep-mm-feature- | |||
on-device | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-return- | |||
hidden-states | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-return- | |||
routed-experts | False | bool flag | |
| (set to enable) | A2, A3 | ||
--scheduler-recv- | |||
interval | 1 | Type: int | A2, A3 |
--numa-node | None | List[int] | A2, A3 |
--enable-deterministic- | |||
inference | False | bool flag | |
| (set to enable) | Planned | ||
--rl-on-policy-target | None | fsdp | Planned |
--enable-layerwise- | |||
nvtx-marker | False | bool flag | |
| (set to enable) | Special for GPU | ||
--enable-attn-tp- | |||
input-scattered | False | bool flag | |
| (set to enable) | Experimental | ||
--enable-nsa-prefill- | |||
context-parallel | False | bool flag | |
| (set to enable) | A2, A3 | ||
--enable-fused-qk- | |||
norm-rope | False | bool flag | |
| (set to enable) | Special for GPU |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-dynamic- | |||
batch-tokenizer | False | bool flag | |
| (set to enable) | A2, A3 | ||
--dynamic-batch- | |||
tokenizer-batch-size | 32 | Type: int | A2, A3 |
--dynamic-batch- | |||
tokenizer-batch-timeout | 0.002 | Type: float | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--debug-tensor-dump- | |||
output-folder | None | Type: str | A2, A3 |
--debug-tensor-dump- | |||
layers | None | List[int] | A2, A3 |
--debug-tensor-dump- | |||
input-file | None | Type: str | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--disaggregation-mode | null | null, | |
prefill, | |||
decode | A2, A3 | ||
--disaggregation-transfer-backend | mooncake | ascend | A2, A3 |
--disaggregation-bootstrap-port | 8998 | Type: int | A2, A3 |
--disaggregation-ib-device | None | Type: str | Special for GPU |
--disaggregation-decode- | |||
enable-offload-kvcache | False | False | A2, A3 |
--num-reserved-decode-tokens | 512 | Type: int | A2, A3 |
--disaggregation-decode- | |||
polling-interval | 1 | Type: int | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-adaptive-dispatch-to-encoder | False | bool flag | |
| (set to enable adaptively dispatch) | A2, A3 | ||
--encoder-only | False | bool flag | |
| (set to launch an encoder-only server) | A2, A3 | ||
--language-only | False | bool flag | |
| (set to load weights for the language model only) | A2, A3 | ||
--encoder-transfer-backend | zmq_to_scheduler | zmq_to_scheduler, | |
zmq_to_tokenizer, | |||
mooncake | A2, A3 | ||
--encoder-urls | [] | List[str] | |
| (List of encoder server urls) | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--custom-weight-loader | None | List[str] | A2, A3 |
--weight-loader-disable- | |||
mmap | False | bool flag | |
| (set to enable) | A2, A3 | ||
--remote-instance-weight- | |||
loader-seed-instance-ip | None | Type: str | A2, A3 |
--remote-instance-weight- | |||
loader-seed-instance-service-port | None | Type: int | A2, A3 |
--remote-instance-weight- | |||
loader-send-weights-group-ports | None | Type: JSON | |
| list | A2, A3 | ||
--remote-instance-weight- | |||
loader-backend | nccl | transfer_engine, | |
nccl | A2, A3 | ||
--remote-instance-weight- | |||
loader-start-seed-via-transfer-engine | False | bool flag | |
| (set to enable) | Special for GPU |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-pdmux | False | bool flag | |
| (set to enable) | Special for GPU | ||
--pdmux-config-path | None | Type: str | Special for GPU |
--sm-group-num | 8 | Type: int | Special for GPU |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--enable-broadcast-mm- | |||
inputs-process | False | bool flag | |
| (set to enable) | A2, A3 | ||
--mm-process-config | None | Type: JSON / Dict | A2, A3 |
--mm-enable-dp-encoder | False | bool flag | |
| (set to enable) | A2, A3 | ||
--limit-mm-data-per-request | None | Type: JSON / Dict | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--decrypted-config-file | None | Type: str | A2, A3 |
--decrypted-draft-config-file | None | Type: str | A2, A3 |
--enable-prefix-mm-cache | False | bool flag | |
| (set to enable) | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--forward-hooks | None | Type: JSON list | A2, A3 |
| Argument | Defaults | Options | Server supported |
|---|---|---|---|
--config | None | Type: str | A2, A3 |
The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.
| Argument | Defaults | Options |
|---|---|---|
--checkpoint-engine- | ||
wait-weights- | ||
before-ready | False | bool flag (set to enable) |
--kt-weight-path | None | Type: str |
--kt-method | AMXINT4 | Type: str |
--kt-cpuinfer | None | Type: int |
--kt-threadpool-count | 2 | Type: int |
--kt-num-gpu-experts | None | Type: int |
--kt-max-deferred- | ||
experts-per-token | None | Type: int |
The following parameters have some functional deficiencies on community
| Argument | Defaults | Options |
|---|---|---|
--tool-server | None | Type: str |