Server Arguments

This page provides a list of server arguments used in the command line to configure the behavior and performance of the language model server during deployment. These arguments enable users to customize key aspects of the server, including model selection, parallelism policies, memory management, and optimization techniques. You can find all arguments by python3 -m sglang.launch_server --help

Common launch commands

To use a configuration file, create a YAML file with your server arguments and specify it with --config. CLI arguments will override config file values.

bash

# Create config.yaml
cat > config.yaml << EOF
model-path: meta-llama/Meta-Llama-3-8B-Instruct
host: 0.0.0.0
port: 30000
tensor-parallel-size: 2
enable-metrics: true
log-requests: true
EOF

# Launch server with config file
python -m sglang.launch_server --config config.yaml

To enable multi-GPU tensor parallelism, add --tp 2. If it reports the error "peer access is not supported between these two devices", add --enable-p2p-check to the server launch command.
bash
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
```
To enable multi-GPU data parallelism, add --dp 2. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend SGLang Model Gateway (former Router) for data parallelism.
bash
```
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
```
If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of --mem-fraction-static. The default value is 0.9.
bash
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
```
See hyperparameter tuning on tuning hyperparameters for better performance.
For docker and Kubernetes runs, you need to set up shared memory which is used for communication between processes. See --shm-size for docker and /dev/shm size update for Kubernetes manifests.
If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
bash
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
```
To enable fp8 weight quantization, add --quantization fp8 on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
To enable fp8 kv cache quantization, add --kv-cache-dtype fp8_e4m3 or --kv-cache-dtype fp8_e5m2.
To enable deterministic inference and batch invariant operations, add --enable-deterministic-inference. More details can be found in deterministic inference document.
If the model does not have a chat template in the Hugging Face tokenizer, you can specify a custom chat template. If the tokenizer has multiple named templates (e.g., 'default', 'tool_use'), you can select one using --hf-chat-template-name tool_use.
To run tensor parallelism on multiple nodes, add --nnodes 2. If you have two nodes with two GPUs on each node and want to run TP=4, let sgl-dev-0 be the hostname of the first node and 50000 be an available port, you can use the following commands. If you meet deadlock, please try to add --disable-cuda-graph
(Note: This feature is out of maintenance and might cause error) To enable torch.compile acceleration, add --enable-torch-compile. It accelerates small models on small batch sizes. By default, the cache path is located at /tmp/torchinductor_root, you can customize it using environment variable TORCHINDUCTOR_CACHE_DIR. For more details, please refer to PyTorch official documentation and Enabling cache for torch.compile.
bash
```
# Node 0
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3-8B-Instruct \
  --tp 4 \
  --dist-init-addr sgl-dev-0:50000 \
  --nnodes 2 \
  --node-rank 0

# Node 1
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3-8B-Instruct \
  --tp 4 \
  --dist-init-addr sgl-dev-0:50000 \
  --nnodes 2 \
  --node-rank 1
```

Please consult the documentation below and server_args.py to learn more about the arguments you may provide when launching a server.

Model and tokenizer

HTTP server

Quantization and data type

Memory and scheduling

Runtime options

Logging

RequestMetricsExporter configuration

Data parallelism

Multi-node distributed serving

Model override args

LoRA

Kernel Backends (Attention, Sampling, Grammar, GEMM)

Speculative decoding

Ngram speculative decoding

Multi-layer Eagle speculative decoding

MoE

Mamba Cache

Hierarchical cache

Hierarchical sparse attention

LMCache

Ktransformers

Diffusion LLM

Offloading

Args for multi-item scoring

Optimization/debug options

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> <col style={{width: "25%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Argument</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Description</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Defaults</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Options</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-radix-cache</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disable RadixAttention for prefix caching.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--cuda-graph-max-bs</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Set the maximum batch size for cuda graph. It will extend the cuda graph capture batch size to this value.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>`None`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Type: int</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--cuda-graph-bs</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Set the list of batch sizes for cuda graph.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>`None`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>List[int]</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-cuda-graph</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disable cuda graph.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-cuda-graph-padding</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disable cuda graph when padding is needed. Still uses cuda graph when padding is not needed.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-profile-cuda-graph</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable profiling of cuda graph capture.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-cudagraph-gc</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable garbage collection during CUDA graph capture. If disabled (default), GC is frozen during capture to speed up the process.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-layerwise-nvtx-marker</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable layerwise NVTX profiling annotations for the model. This adds NVTX markers to every layer for detailed per-layer performance analysis with Nsight Systems.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-nccl-nvls</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable NCCL NVLS for prefill heavy requests when available.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-symm-mem</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable NCCL symmetric memory for fast collectives.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-flashinfer-cutlass-moe-fp4-allgather</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disables quantize before all-gather for flashinfer cutlass moe.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-tokenizer-batch-encode</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable batch tokenization for improved performance when processing multiple text inputs. Do not use with image inputs, pre-tokenized input_ids, or input_embeds.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-tokenizer-batch-decode</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disable batch decoding when decoding multiple completions.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-outlines-disk-cache</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disable disk cache of outlines to avoid possible crashes related to file system or high concurrency.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-custom-all-reduce</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disable the custom all-reduce kernel and fall back to NCCL.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-mscclpp</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable using mscclpp for small messages for all-reduce kernel and fall back to NCCL.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-torch-symm-mem</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable using torch symm mem for all-reduce kernel and fall back to NCCL. Only supports CUDA device SM90 and above. SM90 supports world size 4, 6, 8. SM10 supports world size 6, 8.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-overlap-schedule</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disable the overlap scheduler, which overlaps the CPU scheduler with GPU model worker.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-mixed-chunk</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enabling mixing prefill and decode in a batch when using chunked prefill.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-dp-attention</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enabling data parallelism for attention and tensor parallelism for FFN. The dp size should be equal to the tp size. Currently DeepSeek-V2 and Qwen 2/3 MoE models are supported.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-dp-lm-head</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable vocabulary parallel across the attention TP group to avoid all-gather across DP groups, optimizing performance under DP attention.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-two-batch-overlap</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enabling two micro batches to overlap.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-single-batch-overlap</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Let computation and communication overlap within one micro batch.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--tbo-token-distribution-threshold</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>The threshold of token distribution between two batches in micro-batch-overlap, determines whether to two-batch-overlap or two-chunk-overlap. Set to 0 denote disable two-chunk-overlap.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>0.48</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Type: float</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-torch-compile</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Optimize the model with torch.compile. Experimental feature.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-torch-compile-debug-mode</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable debug mode for torch compile.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-piecewise-cuda-graph</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disable piecewise cuda graph for extend/prefill. PCG is enabled by default.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to disable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enforce-piecewise-cuda-graph</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enforce piecewise cuda graph, skipping all auto-disable conditions. For testing only.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--piecewise-cuda-graph-tokens</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Set the list of tokens when using piecewise cuda graph.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>`None`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Type: JSON list</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--piecewise-cuda-graph-compiler</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Set the compiler for piecewise cuda graph. Choices are: eager, inductor.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>eager</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>eager</code>, <code>inductor</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--torch-compile-max-bs</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Set the maximum batch size when using torch compile.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>32</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Type: int</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--piecewise-cuda-graph-max-tokens</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Set the maximum tokens when using piecewise cuda graph.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>4096</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Type: int</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--torchao-config</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Optimize the model with torchao. Experimental feature. Current choices are: int8dq, int8wo, int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-per_row</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>``</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Type: str</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-nan-detection</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable the NaN detection for debugging purposes.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-p2p-check</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable P2P check for GPU access, otherwise the p2p access is allowed by default.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--triton-attention-reduce-in-fp32</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Cast the intermediate attention results to fp32 to avoid possible crashes related to fp16. This only affects Triton attention kernels.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--triton-attention-num-kv-splits</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>The number of KV splits in flash decoding Triton kernel. Larger value is better in longer context scenarios. The default value is 8.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>8</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Type: int</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--triton-attention-split-tile-size</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>The size of split KV tile in flash decoding Triton kernel. Used for deterministic inference.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>`None`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Type: int</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--num-continuous-decode-steps</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Run multiple continuous decoding steps to reduce scheduling overhead. This can potentially increase throughput but may also increase time-to-first-token latency. The default value is 1, meaning only run one decoding step at a time.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>1</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Type: int</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--delete-ckpt-after-loading</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Delete the model checkpoint after loading the model.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-memory-saver</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Allow saving memory using release_memory_occupation and resume_memory_occupation</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-weights-cpu-backup</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Save model weights to CPU memory during release_weights_occupation and resume_weights_occupation</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-draft-weights-cpu-backup</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Save draft model weights to CPU memory during release_weights_occupation and resume_weights_occupation</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--allow-auto-truncate</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Allow automatically truncating requests that exceed the maximum input length instead of returning an error.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-custom-logit-processor</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable users to pass custom logit processors to the server (disabled by default for security)</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--flashinfer-mla-disable-ragged</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Not using ragged prefill wrapper when running flashinfer mla</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-shared-experts-fusion</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disable shared experts fusion optimization for deepseek v3/r1.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-chunked-prefix-cache</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Disable chunked prefix cache feature for deepseek, which should save overhead for short sequences.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--disable-fast-image-processor</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Adopt base image processor instead of fast image processor.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--keep-mm-feature-on-device</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Keep multimodal feature tensors on device after processing to save D2H copy.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-return-hidden-states</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable returning hidden states with responses.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-return-routed-experts</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable returning routed experts of each layer with responses.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--scheduler-recv-interval</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>The interval to poll requests in scheduler. Can be set to >1 to reduce the overhead of this.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>1</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Type: int</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--numa-node</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Sets the numa node for the subprocesses. i-th element corresponds to i-th subprocess.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>`None`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>List[int]</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-deterministic-inference</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable deterministic inference mode with batch invariant ops.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--rl-on-policy-target</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>The training system that SGLang needs to match for true on-policy.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>`None`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>fsdp</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-attn-tp-input-scattered</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Allow input of attention to be scattered when only using tensor parallelism, to reduce the computational load of operations such as qkv latent.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-nsa-prefill-context-parallel</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable context parallelism used in the long sequence prefill phase of DeepSeek v3.2.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--nsa-prefill-cp-mode</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Token splitting mode for the prefill phase of DeepSeek v3.2 under context parallelism. Optional values: <code>round-robin-split</code>(default),<code>in-seq-split</code>. <code>round-robin-split</code> distributes tokens across ranks based on <code>token_idx % cp_size</code>. It supports multi-batch prefill, fused MoE, and FP8 KV cache.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>in-seq-split</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}><code>in-seq-split</code>, <code>round-robin-split</code></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-fused-qk-norm-rope</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable fused qk normalization and rope rotary embedding.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>--enable-precise-embedding-interpolation</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable corner alignment for resize of embeddings grid to ensure more accurate(but slower) evaluation of interpolated embedding values.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}><code>False</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool flag (set to enable)</td> </tr> </tbody> </table>

Common launch commands

Model and tokenizer

HTTP server

Quantization and data type

Memory and scheduling

Runtime options

Logging

RequestMetricsExporter configuration

Data parallelism

Multi-node distributed serving

Model override args

LoRA

Kernel Backends (Attention, Sampling, Grammar, GEMM)

Speculative decoding

Ngram speculative decoding

Multi-layer Eagle speculative decoding

MoE

Mamba Cache

Hierarchical cache

Hierarchical sparse attention

LMCache

Ktransformers

Diffusion LLM

Offloading

Args for multi-item scoring

Optimization/debug options

Dynamic batch tokenizer

Debug tensor dumps

PD disaggregation

Encode prefill disaggregation

Custom weight loader

For PD-Multiplexing

Configuration file support

For checkpoint decryption

Forward hooks

Deprecated arguments

Server Arguments

Common launch commands

Model and tokenizer

HTTP server

Quantization and data type

Memory and scheduling

Runtime options

Logging

RequestMetricsExporter configuration

API related

Data parallelism

Multi-node distributed serving

Model override args

LoRA

Kernel Backends (Attention, Sampling, Grammar, GEMM)

Speculative decoding

Ngram speculative decoding

Multi-layer Eagle speculative decoding

MoE

Mamba Cache

Hierarchical cache

Hierarchical sparse attention

LMCache

Ktransformers

Diffusion LLM

Offloading

Args for multi-item scoring

Optimization/debug options

Dynamic batch tokenizer

Debug tensor dumps

PD disaggregation

Encode prefill disaggregation

Custom weight loader

For PD-Multiplexing

Configuration file support

For Multi-Modal

For checkpoint decryption

Forward hooks

Deprecated arguments