docs_new/docs/advanced_features/server_arguments.mdx
This page provides a list of server arguments used in the command line to configure the behavior
and performance of the language model server during deployment. These arguments enable users to
customize key aspects of the server, including model selection, parallelism policies,
memory management, and optimization techniques.
You can find all arguments by python3 -m sglang.launch_server --help
To use a configuration file, create a YAML file with your server arguments and specify it with --config. CLI arguments will override config file values.
# Create config.yaml
cat > config.yaml << EOF
model-path: meta-llama/Meta-Llama-3-8B-Instruct
host: 0.0.0.0
port: 30000
tensor-parallel-size: 2
enable-metrics: true
log-requests: true
EOF
# Launch server with config file
python -m sglang.launch_server --config config.yaml
To enable multi-GPU tensor parallelism, add --tp 2. If it reports the error "peer access is not supported between these two devices", add --enable-p2p-check to the server launch command.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
To enable multi-GPU data parallelism, add --dp 2. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend SGLang Model Gateway (former Router) for data parallelism.
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of --mem-fraction-static. The default value is 0.9.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
See hyperparameter tuning on tuning hyperparameters for better performance.
For docker and Kubernetes runs, you need to set up shared memory which is used for communication between processes. See --shm-size for docker and /dev/shm size update for Kubernetes manifests.
If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
To enable fp8 weight quantization, add --quantization fp8 on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
To enable fp8 kv cache quantization, add --kv-cache-dtype fp8_e4m3 or --kv-cache-dtype fp8_e5m2.
To enable deterministic inference and batch invariant operations, add --enable-deterministic-inference. More details can be found in deterministic inference document.
If the model does not have a chat template in the Hugging Face tokenizer, you can specify a custom chat template. If the tokenizer has multiple named templates (e.g., 'default', 'tool_use'), you can select one using --hf-chat-template-name tool_use.
To run tensor parallelism on multiple nodes, add --nnodes 2. If you have two nodes with two GPUs on each node and want to run TP=4, let sgl-dev-0 be the hostname of the first node and 50000 be an available port, you can use the following commands. If you meet deadlock, please try to add --disable-cuda-graph
(Note: This feature is out of maintenance and might cause error) To enable torch.compile acceleration, add --enable-torch-compile. It accelerates small models on small batch sizes. By default, the cache path is located at /tmp/torchinductor_root, you can customize it using environment variable TORCHINDUCTOR_CACHE_DIR. For more details, please refer to PyTorch official documentation and Enabling cache for torch.compile.
# Node 0
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3-8B-Instruct \
--tp 4 \
--dist-init-addr sgl-dev-0:50000 \
--nnodes 2 \
--node-rank 0
# Node 1
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3-8B-Instruct \
--tp 4 \
--dist-init-addr sgl-dev-0:50000 \
--nnodes 2 \
--node-rank 1
Please consult the documentation below and server_args.py to learn more about the arguments you may provide when launching a server.