Support Features on Ascend NPU

This section describes the basic functions and features supported by the Ascend NPU.If you encounter issues or have any questions, please open an issue.

If you want to know the meaning and usage of each parameter, click Server Arguments.

Model and tokenizer

Argument	Defaults	Options	Server supported
`--model-path`
`--model`	`None`	Type: str	A2, A3
`--tokenizer-path`	`None`	Type: str	A2, A3
`--tokenizer-mode`	`auto`	`auto`, `slow`	A2, A3
`--tokenizer-worker-num`	`1`	Type: int	A2, A3
`--skip-tokenizer-init`	`False`	bool flag (set to enable)	A2, A3
`--load-format`	`auto`	`auto`, `safetensors`	A2, A3
`--model-loader-`
`extra-config`	`{}`	Type: str	A2, A3
`--trust-remote-code`	`False`	bool flag (set to enable)	A2, A3
`--context-length`	`None`	Type: int	A2, A3
`--is-embedding`	`False`	bool flag (set to enable)	A2, A3
`--enable-multimodal`	`None`	bool flag (set to enable)	A2, A3
`--revision`	`None`	Type: str	A2, A3
`--model-impl`	`auto`	`auto`, `sglang`,
`transformers`	A2, A3

HTTP server

Argument	Defaults	Options	Server supported
`--host`	`127.0.0.1`	Type: str	A2, A3
`--port`	`30000`	Type: int	A2, A3
`--skip-server-warmup`	`False`	bool flag (set to enable)	A2, A3
`--warmups`	`None`	Type: str	A2, A3
`--nccl-port`	`None`	Type: int	A2, A3
`--fastapi-root-path`	`None`	Type: str	A2, A3
`--grpc-mode`	`False`	`False`	Planned

Quantization and data type

Argument	Defaults	Options	Server supported
`--dtype`	`auto`	`auto`,
`float16`,
`bfloat16`	A2, A3
`--quantization`	`None`	`modelslim`	A2, A3
`--quantization-param-path`	`None`	Type: str	Special For GPU
`--kv-cache-dtype`	`auto`	`auto`	A2, A3
`--enable-fp32-lm-head`	`False`	bool flag
(set to enable)	A2, A3
`--modelopt-quant`	`None`	Type: str	Special For GPU
`--modelopt-checkpoint-`
`restore-path`	`None`	Type: str	Special For GPU
`--modelopt-checkpoint-`
`save-path`	`None`	Type: str	Special For GPU
`--modelopt-export-path`	`None`	Type: str	Special For GPU
`--quantize-and-serve`	`False`	bool flag
(set to enable)	Special For GPU
`--rl-quant-profile`	`None`	Type: str	Special For GPU

Memory and scheduling

Argument	Defaults	Options	Server supported
`--mem-fraction-static`	`None`	Type: float	A2, A3
`--max-running-requests`	`None`	Type: int	A2, A3
`--prefill-max-requests`	`None`	Type: int	A2, A3
`--max-queued-requests`	`None`	Type: int	A2, A3
`--max-total-tokens`	`None`	Type: int	A2, A3
`--chunked-prefill-size`	`None`	Type: int	A2, A3
`--max-prefill-tokens`	`16384`	Type: int	A2, A3
`--schedule-policy`	`fcfs`	`lpm`, `fcfs`	A2, A3
`--enable-priority-`
`scheduling`	`False`	bool flag
(set to enable)	A2, A3
`--schedule-low-priority-`
`values-first`	`False`	bool flag
(set to enable)	A2, A3
`--priority-scheduling-`
`preemption-threshold`	`10`	Type: int	A2, A3
`--schedule-conservativeness`	`1.0`	Type: float	A2, A3
`--page-size`	`128`	Type: int	A2, A3
`--swa-full-tokens-ratio`	`0.8`	Type: float	Planned
`--disable-hybrid-swa-memory`	`False`	bool flag
(set to enable)	Planned
`--radix-eviction-policy`	`lru`	`lru`,
`lfu`	A2, A3
`--enable-prefill-delayer`	`False`	bool flag
(set to enable)	A2, A3
`--prefill-delayer-max-delay-passes`	`30`	Type: int	A2, A3
`--prefill-delayer-token-usage-low-watermark`	`None`	Type: float	A2, A3
`--prefill-delayer-forward-passes-buckets`	`None`	List[float]	A2, A3
`--prefill-delayer-wait-seconds-buckets`	`None`	List[float]	A2, A3
`--abort-on-priority-`
`when-disabled`	`False`	bool flag
(set to enable)	A2, A3
`--enable-dynamic-chunking`	`False`	bool flag
(set to enable)	Experimental

Runtime options

Argument	Defaults	Options	Server supported
`--device`	`None`	Type: str	A2, A3
`--tensor-parallel-size`
`--tp-size`	`1`	Type: int	A2, A3
`--pipeline-parallel-size`
`--pp-size`	`1`	Type: int; Currently `2` not supported	Experimental
`--attention-context-parallel-size`
`--attn-cp-size`	`1`	Type: int; must be equal to --tp-size	A2, A3
`--moe-data-parallel-size`
`--moe-dp-size`	`1`	Type: int	Planned
`--pp-max-micro-batch-size`	`None`	Type: int	Experimental
`--pp-async-batch-depth`	`None`	Type: int	Experimental
`--stream-interval`	`1`	Type: int	A2, A3
`--incremental-streaming-output`	`False`	bool flag (set to enable)	A2, A3
`--random-seed`	`None`	Type: int	A2, A3
`--constrained-json-`
`whitespace-pattern`	`None`	Type: str	A2, A3
`--constrained-json-`
`disable-any-whitespace`	`False`	bool flag (set to enable)	A2, A3
`--watchdog-timeout`	`300`	Type: float	A2, A3
`--soft-watchdog-timeout`	`300`	Type: float	A2, A3
`--dist-timeout`	`None`	Type: int	A2, A3
`--download-dir`	`None`	Type: str	A2, A3
`--model-checksum`	`None`	Type: str	Planned
`--base-gpu-id`	`0`	Type: int	A2, A3
`--gpu-id-step`	`1`	Type: int	A2, A3
`--sleep-on-idle`	`False`	bool flag (set to enable)	A2, A3

Logging

Argument	Defaults	Options	Server supported
`--log-level`	`info`	Type: str	A2, A3
`--log-level-http`	`None`	Type: str	A2, A3
`--log-requests`	`False`	bool flag
(set to enable)	A2, A3
`--log-requests-level`	`2`	`0`, `1`, `2`, `3`	A2, A3
`--log-requests-format`	`text`	`text`, `json`	A2, A3
`--crash-dump-folder`	`None`	Type: str	A2, A3
`--enable-metrics`	`False`	bool flag
(set to enable)	A2, A3
`--enable-metrics-for-`
`all-schedulers`	`False`	bool flag
(set to enable)	A2, A3
`--tokenizer-metrics-`
`custom-labels-header`	`x-custom-labels`	Type: str	A2, A3
`--tokenizer-metrics-`
`allowed-custom-labels`	`None`	List[str]	A2, A3
`--bucket-time-to-`
`first-token`	`None`	List[float]	A2, A3
`--bucket-inter-token-`
`latency`	`None`	List[float]	A2, A3
`--bucket-e2e-request-`
`latency`	`None`	List[float]	A2, A3
`--collect-tokens-`
`histogram`	`False`	bool flag
(set to enable)	A2, A3
`--prompt-tokens-buckets`	`None`	List[str]	A2, A3
`--generation-tokens-buckets`	`None`	List[str]	A2, A3
`--gc-warning-threshold-secs`	`0.0`	Type: float	A2, A3
`--decode-log-interval`	`40`	Type: int	A2, A3
`--enable-request-time-`
`stats-logging`	`False`	bool flag
(set to enable)	A2, A3
`--kv-events-config`	`None`	Type: str	Special for GPU
`--enable-trace`	`False`	bool flag
(set to enable)	A2, A3
`--oltp-traces-endpoint`	`localhost:4317`	Type: str	A2, A3
`--log-requests-target`	`None`	Type: str	A2, A3
`--uvicorn-access-log-exclude-prefixes`	`[]`	List[str]	A2, A3

RequestMetricsExporter configuration

Argument	Defaults	Options	Server supported
`--export-metrics-to-`
`file`	`False`	bool flag
(set to enable)	A2, A3
`--export-metrics-to-`
`file-dir`	`None`	Type: str	A2, A3

Argument	Defaults	Options	Server supported
`--api-key`	`None`	Type: str	A2, A3
`--admin-api-key`	`None`	Type: str	A2, A3
`--served-model-name`	`None`	Type: str	A2, A3
`--weight-version`	`default`	Type: str	A2, A3
`--chat-template`	`None`	Type: str	A2, A3
`--hf-chat-template-name`	`None`	Type: str	A2, A3
`--completion-template`	`None`	Type: str	A2, A3
`--enable-cache-report`	`False`	bool flag
(set to enable)	A2, A3
`--reasoning-parser`	`None`	`deepseek-r1`
`deepseek-v3`
`glm45`
`gpt-oss`
`kimi`
`qwen3`
`qwen3-thinking`
`step3`	A2, A3
`--tool-call-parser`	`None`	`llama3`
`pythonic`
`qwen`
`qwen3_coder`	A2, A3
`--sampling-defaults`	`model`	`openai`, `model`	A2, A3

Data parallelism

Argument	Defaults	Options	Server supported
`--data-parallel-size`
`--dp-size`	`1`	Type: int	A2, A3
`--load-balance-method`	`auto`	`auto`,
`round_robin`,
`follow_bootstrap_room`,
`total_requests`,
`total_tokens`	A2, A3

Multi-node distributed serving

Argument	Defaults	Options	Server supported
`--dist-init-addr`
`--nccl-init-addr`	`None`	Type: str	A2, A3
`--nnodes`	`1`	Type: int	A2, A3
`--node-rank`	`0`	Type: int	A2, A3

Model override args

Argument	Defaults	Options	Server supported
`--json-model-override-`
`args`	`{}`	Type: str	A2, A3
`--preferred-sampling-`
`params`	`None`	Type: str	A2, A3

LoRA

Argument	Defaults	Options	Server supported
`--enable-lora`	`False`	Bool flag
(set to enable)	A2, A3
`--enable-lora-overlap-loading`	`False`	Bool flag
(set to enable)	A2, A3
`--max-lora-rank`	`None`	Type: int	A2, A3
`--lora-target-modules`	`None`	`all`	A2, A3
`--lora-paths`	`None`	Type: List[str] /
JSON objects	A2, A3
`--max-loras-per-batch`	`8`	Type: int	A2, A3
`--max-loaded-loras`	`None`	Type: int	A2, A3
`--lora-eviction-policy`	`lru`	`lru`,
`fifo`	A2, A3
`--lora-backend`	`csgmv`	`triton`,
`csgmv`,
`ascend`,
`torch_native`	A2, A3
`--max-lora-chunk-size`	`16`	`16`, `32`,
`64`, `128`	Special for GPU

Kernel Backends (Attention, Sampling, Grammar, GEMM)

Argument	Defaults	Options	Server supported
`--attention-backend`	`None`	`ascend`	A2, A3
`--prefill-attention-backend`	`None`	`ascend`	A2, A3
`--decode-attention-backend`	`None`	`ascend`	A2, A3
`--sampling-backend`	`None`	`pytorch`,
`ascend`	A2, A3
`--grammar-backend`	`None`	`xgrammar`	A2, A3
`--mm-attention-backend`	`None`	`ascend_attn`	A2, A3
`--nsa-prefill-backend`	`flashmla_sparse`	`flashmla_sparse`,
`flashmla_decode`,
`fa3`,
`tilelang`,
`aiter`	Special for GPU
`--nsa-decode-backend`	`fa3`	`flashmla_prefill`,
`flashmla_kv`,
`fa3`,
`tilelang`,
`aiter`	Special for GPU
`--fp8-gemm-backend`	`auto`	`auto`,
`deep_gemm`,
`flashinfer_trtllm`,
`flashinfer_cutlass`,
`flashinfer_deepgemm`,
`cutlass`,
`triton`,
`aiter`	Special for GPU
`--disable-flashinfer-`
`autotune`	`False`	bool flag
(set to enable)	Special for GPU

Speculative decoding

Argument	Defaults	Options	Server supported
`--speculative-algorithm`	`None`	`EAGLE3`,
`NEXTN`	A2, A3
`--speculative-draft-model-path`
`--speculative-draft-model`	`None`	Type: str	A2, A3
`--speculative-draft-model-`
`revision`	`None`	Type: str,
`branch name`,
`tag name`,
`commit id`	A2, A3
`--speculative-draft-load-format`	`auto`	`auto`,
`dummy`	A2, A3
`--speculative-num-steps`	`None`	Type: int	A2, A3
`--speculative-eagle-topk`	`None`	Type: int	A2, A3
`--speculative-num-draft-tokens`	`None`	Type: int	A2, A3
`--speculative-accept-`
`threshold-single`	`1.0`	Type: float	Special for GPU
`--speculative-accept-`
`threshold-acc`	`1.0`	Type: float	Special for GPU
`--speculative-token-map`	`None`	Type: str	A2, A3
`--speculative-attention-`
`mode`	`prefill`	`prefill`,
`decode`	A2, A3
`--speculative-moe-runner-`
`backend`	`None`	`auto`	A2, A3
`--speculative-moe-a2a-`
`backend`	`None`	`ascend_fuseep`	A2, A3
`--speculative-draft-attention-backend`	`None`	`ascend`	A2, A3
`--speculative-draft-model-quantization`	`None`	`unquant`	A2, A3

Ngram speculative decoding

Argument	Defaults	Options	Server supported
`--speculative-ngram-`
`min-match-window-size`	`1`	Type: int	Experimental
`--speculative-ngram-`
`max-match-window-size`	`12`	Type: int	Experimental
`--speculative-ngram-`
`min-bfs-breadth`	`1`	Type: int	Experimental
`--speculative-ngram-`
`max-bfs-breadth`	`10`	Type: int	Experimental
`--speculative-ngram-`
`match-type`	`BFS`	`BFS`,
`PROB`	Experimental. `BFS` uses recency-based expansion; `PROB` uses frequency-based expansion.
`--speculative-ngram-`
`max-trie-depth`	`18`	Type: int	Experimental
`--speculative-ngram-`
`capacity`	`10000000`	Type: int	Experimental

Expert parallelism

Argument	Defaults	Options	Server supported
`--expert-parallel-size`
`--ep-size`
`--ep`	`1`	Type: int	A2, A3
`--moe-a2a-backend`	`none`	`none`,
`deepep`,
`ascend_fuseep`(It is incompatible with eplb)	A2, A3
`--moe-runner-backend`	`auto`	`auto`, `triton`	A2, A3
`--flashinfer-mxfp4-`
`moe-precision`	`default`	`default`,
`bf16`	Special for GPU
`--enable-flashinfer-`
`allreduce-fusion`	`False`	bool flag
(set to enable)	Special for GPU
`--deepep-mode`	`auto`	`normal`,
`low_latency`,
`auto`	A2, A3
`--deepep-config`	`None`	Type: str	Special for GPU
`--ep-num-redundant-experts`	`0`	Type: int	A2, A3
`--ep-dispatch-algorithm`	`None`	`static`,
`dynamic`,
`fake`	A2, A3
`--init-expert-location`	`trivial`	`trivial`,
`<path.pt>`,
`<path.json>`,
`<json_string>`	A2, A3
`--enable-eplb`	`False`	bool flag
(set to enable)	A2, A3
`--eplb-algorithm`	`deepseek`	`auto`,
`deepseek`	A2, A3
`--eplb-rebalance-num-iterations`	`1000`	Type: int	A2, A3
`--eplb-rebalance-layers-`
`per-chunk`	`None`	Type: int	A2, A3
`--eplb-min-rebalancing-`
`utilization-threshold`	`1.0`	Type: float	A2, A3
`--expert-distribution-`
`recorder-mode`	`None`	`stat`,
`stat_approx`,
`per_pass`,
`per_token`	A2, A3
`--expert-distribution-`
`recorder-buffer-size`	`None`	Type: int	A2, A3
`--enable-expert-distribution-`
`metrics`	`False`	bool flag (set to enable)	A2, A3
`--moe-dense-tp-size`	`None`	`1`	A2, A3
`--elastic-ep-backend`	`None`	`none`, `mooncake`	Special for GPU
`--mooncake-ib-device`	`None`	Type: str	Special for GPU

Mamba Cache

Argument	Defaults	Options	Server supported
`--max-mamba-cache-size`	`None`	Type: int	A2, A3
`--mamba-ssm-dtype`	`float32`	`float32`,
`bfloat16`,
`float16`	A2, A3
`--mamba-full-memory-ratio`	`0.9`	Type: float	A2, A3
`--mamba-scheduler-strategy`	`auto`	`auto`,
`no_buffer`,
`extra_buffer`	A2, A3
`--mamba-track-interval`	`256`	Type: int	A2, A3

Hierarchical cache

Argument	Defaults	Options	Server supported
`--enable-hierarchical-`
`cache`	`False`	bool flag
(set to enable).
Currently, mamba cache is not supported.	A2, A3
`--hicache-ratio`	`2.0`	Type: float	A2, A3
`--hicache-size`	`0`	Type: int	A2, A3
`--hicache-write-policy`	`write_through`	Currently only `write_back` supported	A2, A3
`--hicache-io-backend`	`kernel`	`kernel_ascend`,

                 `direct`                            |      A2, A3      |

LMCache

Argument	Defaults	Options	Server supported
`--enable-lmcache`	`False`	bool flag
(set to enable)	Special for GPU

Offloading (must be used with `--disable-cuda-graph`)

Argument	Defaults	Options	Server supported
`--cpu-offload-gb`	`0`	Type: int	A2, A3
`--offload-group-size`	`-1`	Type: int (DeepSeek only)	A2, A3
`--offload-num-in-group`	`1`	Type: int (DeepSeek only)	A2, A3
`--offload-prefetch-step`	`1`	Type: int (DeepSeek only)	A2, A3
`--offload-mode`	`cpu`	`cpu` (DeepSeek only)
`meta` (DeepSeek only)
`sharded_gpu` (DeepSeek only, only support tp=1 dp>1)	A2, A3

Args for multi-item scoring

Argument	Defaults	Options	Server supported
`--multi-item-scoring-delimiter`	`None`	Type: int	A2, A3

Optimization/debug options

Argument	Defaults	Options	Server supported
`--disable-radix-cache`	`False`	bool flag
(set to enable)	A2, A3
`--cuda-graph-max-bs`	`None`	Type: int	A2, A3
`--cuda-graph-bs`	`None`	List[int]	A2, A3
`--disable-cuda-graph`	`False`	bool flag
(set to enable)	A2, A3
`--disable-cuda-graph-`
`padding`	`False`	bool flag
(set to enable)	A2, A3
`--enable-profile-`
`cuda-graph`	`False`	bool flag
(set to enable)	A2, A3
`--enable-cudagraph-gc`	`False`	bool flag
(set to enable)	A2, A3
`--enable-nccl-nvls`	`False`	bool flag
(set to enable)	Special for GPU
`--enable-symm-mem`	`False`	bool flag
(set to enable)	Special for GPU
`--disable-flashinfer-`
`cutlass-moe-fp4-allgather`	`False`	bool flag
(set to enable)	Special for GPU
`--enable-tokenizer-`
`batch-encode`	`False`	bool flag
(set to enable)	A2, A3
`--disable-tokenizer-`
`batch-decode`	`False`	bool flag
(set to enable)	A2, A3
`--disable-custom-`
`all-reduce`	`False`	bool flag
(set to enable)	Special for GPU
`--enable-mscclpp`	`False`	bool flag
(set to enable)	Special for GPU
`--enable-torch-`
`symm-mem`	`False`	bool flag
(set to enable)	Special for GPU
`--disable-overlap`
`-schedule`	`False`	bool flag
(set to enable)	A2, A3
`--enable-mixed-`
`chunk`	`False`	bool flag
(set to enable)	A2, A3
`--enable-dp-attention`	`False`	bool flag
(set to enable)	A2, A3
`--enable-dp-lm-head`	`False`	bool flag
(set to enable)	A2, A3
`--enable-two-`
`batch-overlap`	`False`	bool flag
(set to enable)	Planned
`--enable-single-`
`batch-overlap`	`False`	bool flag
(set to enable)	A2, A3
`--tbo-token-`
`distribution-threshold`	`0.48`	Type: float	Planned
`--enable-torch-`
`compile`	`False`	bool flag
(set to enable)	A2, A3
`--enable-torch-`
`compile-debug-mode`	`False`	bool flag
(set to enable)	A2, A3
`--enforce-piecewise-`
`cuda-graph`	`False`	bool flag
(set to enable);
Currently, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct models are supported.	A2, A3
`--piecewise-cuda-`
`graph-tokens`	`None`	Type: JSON
list	A2, A3
`--piecewise-cuda-`
`graph-compiler`	`eager`	`eager`	A2, A3
`--torch-compile-max-bs`	`32`	Type: int	A2, A3
`--piecewise-cuda-`
`graph-max-tokens`	`None`	Type: int	A2, A3
`--torchao-config`	``	Type: str	Special for GPU
`--enable-nan-detection`	`False`	bool flag
(set to enable)	A2, A3
`--enable-p2p-check`	`False`	bool flag
(set to enable)	Special for GPU
`--triton-attention-`
`reduce-in-fp32`	`False`	bool flag
(set to enable)	Special for GPU
`--triton-attention-`
`num-kv-splits`	`8`	Type: int	Special for GPU
`--triton-attention-`
`split-tile-size`	`None`	Type: int	Special for GPU
`--delete-ckpt-`
`after-loading`	`False`	bool flag
(set to enable)	A2, A3
`--enable-memory-saver`	`False`	bool flag
(set to enable)	A2, A3
`--enable-weights-`
`cpu-backup`	`False`	bool flag
(set to enable)	A2, A3
`--enable-draft-weights-`
`cpu-backup`	`False`	bool flag
(set to enable)	A2, A3
`--allow-auto-truncate`	`False`	bool flag
(set to enable)	A2, A3
`--enable-custom-`
`logit-processor`	`False`	bool flag
(set to enable)	A2, A3
`--flashinfer-mla-`
`disable-ragged`	`False`	bool flag
(set to enable)	Special for GPU
`--disable-shared-`
`experts-fusion`	`True`	bool flag
(set to enable)	A2, A3
`--disable-chunked-`
`prefix-cache`	`True`	bool flag
(set to enable)	A2, A3
`--disable-fast-`
`image-processor`	`False`	bool flag
(set to enable)	A2, A3
`--keep-mm-feature-`
`on-device`	`False`	bool flag
(set to enable)	A2, A3
`--enable-return-`
`hidden-states`	`False`	bool flag
(set to enable)	A2, A3
`--enable-return-`
`routed-experts`	`False`	bool flag
(set to enable)	A2, A3
`--scheduler-recv-`
`interval`	`1`	Type: int	A2, A3
`--numa-node`	`None`	List[int]	A2, A3
`--enable-deterministic-`
`inference`	`False`	bool flag
(set to enable)	Planned
`--rl-on-policy-target`	`None`	`fsdp`	Planned
`--enable-layerwise-`
`nvtx-marker`	`False`	bool flag
(set to enable)	Special for GPU
`--enable-attn-tp-`
`input-scattered`	`False`	bool flag
(set to enable)	Experimental
`--enable-nsa-prefill-`
`context-parallel`	`False`	bool flag
(set to enable)	A2, A3
`--enable-fused-qk-`
`norm-rope`	`False`	bool flag
(set to enable)	Special for GPU

Dynamic batch tokenizer

Argument	Defaults	Options	Server supported
`--enable-dynamic-`
`batch-tokenizer`	`False`	bool flag
(set to enable)	A2, A3
`--dynamic-batch-`
`tokenizer-batch-size`	`32`	Type: int	A2, A3
`--dynamic-batch-`
`tokenizer-batch-timeout`	`0.002`	Type: float	A2, A3

Debug tensor dumps

Argument	Defaults	Options	Server supported
`--debug-tensor-dump-`
`output-folder`	`None`	Type: str	A2, A3
`--debug-tensor-dump-`
`layers`	`None`	List[int]	A2, A3
`--debug-tensor-dump-`
`input-file`	`None`	Type: str	A2, A3

PD disaggregation

Argument	Defaults	Options	Server supported
`--disaggregation-mode`	`null`	`null`,
`prefill`,
`decode`	A2, A3
`--disaggregation-transfer-backend`	`mooncake`	`ascend`	A2, A3
`--disaggregation-bootstrap-port`	`8998`	Type: int	A2, A3
`--disaggregation-ib-device`	`None`	Type: str	Special for GPU
`--disaggregation-decode-`
`enable-offload-kvcache`	`False`	`False`	A2, A3
`--num-reserved-decode-tokens`	`512`	Type: int	A2, A3
`--disaggregation-decode-`
`polling-interval`	`1`	Type: int	A2, A3

Encode prefill disaggregation

Argument	Defaults	Options
`--enable-adaptive-dispatch-to-encoder`	`False`	bool flag
(set to enable adaptively dispatch)	A2, A3
`--encoder-only`	`False`	bool flag
(set to launch an encoder-only server)	A2, A3
`--language-only`	`False`	bool flag
(set to load weights for the language model only)	A2, A3
`--encoder-transfer-backend`	`zmq_to_scheduler`	`zmq_to_scheduler`,
`zmq_to_tokenizer`,
`mooncake`	A2, A3
`--encoder-urls`	`[]`	List[str]
(List of encoder server urls)	A2, A3

Custom weight loader

Argument	Defaults	Options	Server supported
`--custom-weight-loader`	`None`	List[str]	A2, A3
`--weight-loader-disable-`
`mmap`	`False`	bool flag
(set to enable)	A2, A3
`--remote-instance-weight-`
`loader-seed-instance-ip`	`None`	Type: str	A2, A3
`--remote-instance-weight-`
`loader-seed-instance-service-port`	`None`	Type: int	A2, A3
`--remote-instance-weight-`
`loader-send-weights-group-ports`	`None`	Type: JSON
list	A2, A3
`--remote-instance-weight-`
`loader-backend`	`nccl`	`transfer_engine`,
`nccl`	A2, A3
`--remote-instance-weight-`
`loader-start-seed-via-transfer-engine`	`False`	bool flag
(set to enable)	Special for GPU

For PD-Multiplexing

Argument	Defaults	Options	Server supported
`--enable-pdmux`	`False`	bool flag
(set to enable)	Special for GPU
`--pdmux-config-path`	`None`	Type: str	Special for GPU
`--sm-group-num`	`8`	Type: int	Special for GPU

Argument	Defaults	Options	Server supported
`--enable-broadcast-mm-`
`inputs-process`	`False`	bool flag
(set to enable)	A2, A3
`--mm-process-config`	`None`	Type: JSON / Dict	A2, A3
`--mm-enable-dp-encoder`	`False`	bool flag
(set to enable)	A2, A3
`--limit-mm-data-per-request`	`None`	Type: JSON / Dict	A2, A3

For checkpoint decryption

Argument	Defaults	Options	Server supported
`--decrypted-config-file`	`None`	Type: str	A2, A3
`--decrypted-draft-config-file`	`None`	Type: str	A2, A3
`--enable-prefix-mm-cache`	`False`	bool flag
(set to enable)	A2, A3

Forward hooks

Argument	Defaults	Options	Server supported
`--forward-hooks`	`None`	Type: JSON list	A2, A3

Configuration file support

Argument	Defaults	Options	Server supported
`--config`	`None`	Type: str	A2, A3

Other Params

The following parameters are not supported because the third-party components that depend on are not compatible with the NPU, like Ktransformer, checkpoint-engine etc.

Argument	Defaults	Options
`--checkpoint-engine-`
`wait-weights-`
`before-ready`	`False`	bool flag (set to enable)
`--kt-weight-path`	`None`	Type: str
`--kt-method`	`AMXINT4`	Type: str
`--kt-cpuinfer`	`None`	Type: int
`--kt-threadpool-count`	`2`	Type: int
`--kt-num-gpu-experts`	`None`	Type: int
`--kt-max-deferred-`
`experts-per-token`	`None`	Type: int

The following parameters have some functional deficiencies on community

Argument	Defaults	Options
`--tool-server`	`None`	Type: str

Support Features on Ascend NPU

Support Features on Ascend NPU

Model and tokenizer

HTTP server

Quantization and data type

Memory and scheduling

Runtime options

Logging

RequestMetricsExporter configuration

API related

Data parallelism

Multi-node distributed serving

Model override args

LoRA

Kernel Backends (Attention, Sampling, Grammar, GEMM)

Speculative decoding

Ngram speculative decoding

Expert parallelism

Mamba Cache

Hierarchical cache

LMCache

Offloading (must be used with --disable-cuda-graph)

Args for multi-item scoring

Optimization/debug options

Dynamic batch tokenizer

Debug tensor dumps

PD disaggregation

Encode prefill disaggregation

Custom weight loader

For PD-Multiplexing

For Multi-Modal

For checkpoint decryption

Forward hooks

Configuration file support

Other Params

Offloading (must be used with `--disable-cuda-graph`)