docs_new/docs/advanced_features/pd_disaggregation.mdx
Large Language Model (LLM) inference comprises two distinct phases: Prefill and Decode. The Prefill phase is computation-intensive, processing the entire input sequence, while the Decode phase is memory-intensive, managing the Key-Value (KV) cache for token generation. Traditionally, these phases are handled within a unified engine, where combined scheduling of prefill and decode batches introduces inefficiencies. To address these challenges, we introduce Prefill and Decoding (PD) Disaggregation in SGLang.
The conventional unified engine, which processes prefill and decode batches together, results in two significant problems:
PD Disaggregation resolves these by separating the two stages, enabling tailored optimizations for each.
For the design details, please refer to link.
Currently, we support Mooncake and NIXL as the transfer engine.
When you need to profile prefill or decode workers in PD disaggregation mode, please refer to the Profile In PD Disaggregation Mode section in the Benchmark and Profiling guide. Due to torch profiler limitations, prefill and decode workers must be profiled separately using dedicated command-line options.
For deploying PD disaggregation at scale with load balancing and fault tolerance, SGLang provides a router. The router can distribute requests between prefill and decode instances using various routing policies. For detailed information on setting up routing with PD disaggregation, including configuration options and deployment patterns, see the SGLang Model Gateway (former Router).
uv pip install mooncake-transfer-engine
--disaggregation-ib-device supports the following formats when using the Mooncake backend:
mlx5_0 or mlx5_0,mlx5_1{"0": "mlx5_0,mlx5_1", "1": "mlx5_2,mlx5_3"}Each JSON value uses the same comma-separated device list format as the shared configuration.
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode prefill \
--port 30000 \
--disaggregation-ib-device mlx5_roce0
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode decode \
--port 30001 \
--base-gpu-id 1 \
--disaggregation-ib-device mlx5_roce0
python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
# prefill 0
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-ib-device ${device_name} \
--disaggregation-mode prefill \
--host ${local_ip} \
--port 30000 \
--trust-remote-code \
--dist-init-addr ${prefill_master_ip}:5000 \
--nnodes 2 \
--node-rank 0 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8
# prefill 1
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-ib-device ${device_name} \
--disaggregation-mode prefill \
--host ${local_ip} \
--port 30000 \
--trust-remote-code \
--dist-init-addr ${prefill_master_ip}:5000 \
--nnodes 2 \
--node-rank 1 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8
# decode 0
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-ib-device ${device_name} \
--disaggregation-mode decode \
--host ${local_ip} \
--port 30001 \
--trust-remote-code \
--dist-init-addr ${decode_master_ip}:5000 \
--nnodes 2 \
--node-rank 0 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8 \
--max-running-requests 128
# decode 1
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-ib-device ${device_name} \
--disaggregation-mode decode \
--host ${local_ip} \
--port 30001 \
--trust-remote-code \
--dist-init-addr ${decode_master_ip}:5000 \
--nnodes 2 \
--node-rank 1 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8 \
--max-running-requests 128
PD Disaggregation with Mooncake supports the following environment variables for fine-grained control over system behavior.
To enable NVLink transport for KV cache transfers with the mooncake backend (recommended for NVL72 deployments), set the following environment variables. Note that auxiliary data transfer will still use TCP as a temporary workaround.
export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=NVLINK
export MC_FORCE_MNNVL=True
To utilize Intra-Node NVLink for KV cache transfers with the Mooncake backend (recommended for A100, H20, H100, etc.), set the following environment variables. Please note that auxiliary data still needs to be transferred via TCP.
export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=INTRA_NODE_NVLINK
export MC_INTRANODE_NVLINK=true
The SGLANG_MOONCAKE_CUSTOM_MEM_POOL environment variable enables the custom memory pool. Supported values are NVLINK (or True), BAREX, and INTRA_NODE_NVLINK.
If a greater mean TTFT is acceptable, you can export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 (10 minutes) to relax the timeout condition.
Please be aware that this setting will cause prefill instances to take a longer time to clean up the affected memory resources when a running decode node loses connection.
If a greater mean TTFT is acceptable, you can export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600 (10 minutes) to relax the timeout condition.
When prefill and decode use different tensor parallelism (TP) sizes (e.g., prefill TP=4, decode DP attention with TP=1), the KV cache memory layout differs between the two sides. The GPU staging buffer solves this by gathering KV head slices into a contiguous buffer on the prefill side, performing bulk RDMA transfer, then scattering into the correct KV cache pages on the decode side. This provides 2–5x throughput improvement over the default per-token slice approach at high concurrency and matches homogeneous TP baselines within ~5%.
Enable the staging buffer when prefill and decode use different TP sizes with the Mooncake transfer backend. When both sides use the same TP size, staging is automatically bypassed even if enabled.
Note: The staging buffer is designed for non-MLA models (e.g. GQA, MHA). MLA models (e.g. DeepSeek-V2/V3) should not enable this flag.
# Set staging buffer environment variables on BOTH prefill and decode
export SGLANG_DISAGG_STAGING_BUFFER=1
export SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB=64
export SGLANG_DISAGG_STAGING_POOL_SIZE_MB=4096
# Prefill with TP=4
python -m sglang.launch_server \
--model-path $MODEL_PATH \
--disaggregation-mode prefill \
--port 30000 \
--tp 4 \
--trust-remote-code \
--disaggregation-ib-device mlx5_1,mlx5_2
# Decode with TP=1 (or DP attention with effective attention TP=1)
python -m sglang.launch_server \
--model-path $MODEL_PATH \
--disaggregation-mode decode \
--port 30001 \
--tp 4 \
--dp 4 \
--enable-dp-attention \
--trust-remote-code \
--disaggregation-ib-device mlx5_3,mlx5_4
# Router
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://127.0.0.1:30000 \
--decode http://127.0.0.1:30001 \
--host 0.0.0.0 --port 8000
Install via pip.
pip install nixl
Or build from source - may be required if you already have UCX installed.
git clone https://github.com/ai-dynamo/nixl.git
cd nixl
pip install . --config-settings=setup-args="-Ducx_path=/path/to/ucx"
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode prefill \
--port 30000 \
--disaggregation-transfer-backend nixl
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode decode \
--port 30001 \
--base-gpu-id 1 \
--disaggregation-transfer-backend nixl
python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
# prefill 0
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-transfer-backend nixl \
--disaggregation-mode prefill \
--host ${local_ip} \
--port 30000 \
--trust-remote-code \
--dist-init-addr ${prefill_master_ip}:5000 \
--nnodes 2 \
--node-rank 0 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8
# prefill 1
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-transfer-backend nixl \
--disaggregation-mode prefill \
--host ${local_ip} \
--port 30000 \
--trust-remote-code \
--dist-init-addr ${prefill_master_ip}:5000 \
--nnodes 2 \
--node-rank 1 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8
# decode 0
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-transfer-backend nixl \
--disaggregation-mode decode \
--host ${local_ip} \
--port 30001 \
--trust-remote-code \
--dist-init-addr ${decode_master_ip}:5000 \
--nnodes 2 \
--node-rank 0 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8 \
--max-running-requests 128
# decode 1
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--disaggregation-transfer-backend nixl \
--disaggregation-mode decode \
--host ${local_ip} \
--port 30001 \
--trust-remote-code \
--dist-init-addr ${decode_master_ip}:5000 \
--nnodes 2 \
--node-rank 1 \
--tp-size 16 \
--dp-size 8 \
--enable-dp-attention \
--moe-a2a-backend deepep \
--mem-fraction-static 0.8 \
--max-running-requests 128
By default, NIXL uses the UCX backend for KV cache transfers. You can select a different NIXL plugin backend depending on your infrastructure using the environment variable SGLANG_DISAGGREGATION_NIXL_BACKEND.
Example: export SGLANG_DISAGGREGATION_NIXL_BACKEND=LIBFABRIC
Available backends: UCX (default), LIBFABRIC, or any installed NIXL plugin.
Example usage:
export SGLANG_DISAGGREGATION_NIXL_BACKEND=LIBFABRIC
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--port 30000
Use ascend backend with memfabric_hybrid and ASCEND_MF_STORE_URL being set
pip install memfabric-hybrid==1.0.0
export ASCEND_MF_STORE_URL="tcp://xxx.xx.xxx.xxx:xxxx"
Use mooncake backend, more details can be found in mooncake section.
export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true
ASCEND_NPU_PHY_ID need to be set in container env
export ASCEND_NPU_PHY_ID=xxx
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32
export HCCL_BUFFSIZE=1600
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_NPU_PROFILING=0
export SGLANG_NPU_PROFILING_STAGE="prefill"
export DEEPEP_NORMAL_LONG_SEQ_ROUND=32
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3584
export ASCEND_MF_STORE_URL="tcp://127.0.0.1:24669"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=3600
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_BF16_DISPATCH=0
export ASCEND_USE_FIA=1
python3 -m sglang.launch_server \
--model-path /path/to/MiMo-V2-Flash-w8a8-all-0512 \
--attention-backend ascend \
--device npu \
--tp-size 8 --nnodes 1 --node-rank 0 \
--chunked-prefill-size -1 \
--trust-remote-code --port 10000 \
--host 127.0.0.1 --max-running-requests 16 \
--mem-fraction-static 0.8 \
--disaggregation-mode prefill --disaggregation-transfer-backend ascend \
--disaggregation-bootstrap-port 8996 \
--base-gpu-id 0 \
--disable-radix-cache \
--disable-cuda-graph \
--moe-a2a-backend deepep --deepep-mode normal \
# 2>&1 | tee $SGLANG_LOG_PATH
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# export ASCEND_LAUNCH_BLOCKING=1
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256
export HCCL_BUFFSIZE=1600
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export SGLANG_NPU_PROFILING=0
export SGLANG_NPU_PROFILING_STAGE="prefill"
export DEEPEP_NORMAL_LONG_SEQ_ROUND=32
export DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=3584
export ASCEND_MF_STORE_URL="tcp://127.0.0.1:24669"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=3600
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=3600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export SGLANG_DEEPEP_BF16_DISPATCH=0
export ASCEND_USE_FIA=1
export SGLANG_NPU_FUSED_MOE_MODE=2
python3 -m sglang.launch_server \
--model-path /path/to/MiMo-V2-Flash-w8a8-all-0512 \
--attention-backend ascend \
--device npu \
--tp-size 8 --nnodes 1 --node-rank 0 \
--trust-remote-code --port 10001 \
--host 127.0.0.1 --max-running-requests 16 \
--mem-fraction-static 0.8 \
--disaggregation-mode decode --disaggregation-transfer-backend ascend \
--disaggregation-bootstrap-port 8996 \
--base-gpu-id 8 \
--disable-radix-cache \
--cuda-graph-bs 1 2 4 8 10 12 14 16 \
--quantization modelslim \
--speculative-draft-model-quantization unquant \
--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--enable-multi-layer-eagle \
--moe-a2a-backend deepep --deepep-mode low_latency \
# 2>&1 | tee $SGLANG_LOG_PATH
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://127.0.0.1:10000 \
--decode http://127.0.0.1:10001 \
--host 127.0.0.1 \
--port 9903 \
--health-check-interval-secs 3600 \
--mini-lb \
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
export PATH=/usr/local/Ascend/8.5.0/compiler/bishengir/bin:$PATH
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
# IP set to p first node ip
export ASCEND_MF_STORE_URL="tcp://XXXXXX:24670"
# p node IP
P_IP=('XXXXX')
# D node IP
D_IP=('XXXXX')
# enable mlapo
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ENABLE_MOE_NZ=1
#export SGLANG_NPU_USE_MULTI_STREAM=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
MODEL_PATH=/path/to/deepseekr1_w4a8_pertoken
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
export HCCL_BUFFSIZE=2600
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode prefill --host ${P_IP[$i]} \
--port 8000 --disaggregation-bootstrap-port $((8998+$i)) --trust-remote-code --nnodes 1 --node-rank 0 \
--tp-size 16 --mem-fraction-static 0.7 --attention-backend ascend --device npu --quantization modelslim \
--disaggregation-transfer-backend ascend --max-running-requests 32 --context-length 8192 --disable-radix-cache \
--chunked-prefill-size -1 --max-prefill-tokens 10240 --moe-a2a-backend deepep --deepep-mode normal \
--speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--dp-size 8 --enable-dp-attention --disable-shared-experts-fusion --dtype bfloat16
NODE_RANK=$i
break
fi
done
MODEL_PATH=/path/to/deepseekr1_w4a8_pertoken
for i in "${!D_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${D_IP[$i]}" || "$LOCAL_HOST2" == "${D_IP[$i]}" ]];
then
echo "${D_IP[$i]}"
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export HCCL_BUFFSIZE=900
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=112
export TASK_QUEUE_ENABLE=1
export HCCL_SOCKET_IFNAME=data0.3001
export GLOO_SOCKET_IFNAME=data0.3001
python -m sglang.launch_server --model-path ${MODEL_PATH} --disaggregation-mode decode --host ${D_IP[$i]} \
--port 8001 --trust-remote-code --nnodes 1 --node-rank 0 --tp-size 16 --dp-size 16 \
--mem-fraction-static 0.8 --max-running-requests 448 --attention-backend ascend --device npu --quantization modelslim \
--moe-a2a-backend deepep --enable-dp-attention --deepep-mode low_latency --enable-dp-lm-head \
--cuda-graph-bs 2 4 6 8 10 12 14 16 18 20 22 24 26 28 --disaggregation-transfer-backend ascend --watchdog-timeout 9000 --context-length 8192 \
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--prefill-round-robin-balance --disable-shared-experts-fusion --dtype bfloat16 --tokenizer-worker-num 4 \
--load-balance-method round_robin
NODE_RANK=$i
break
fi
done
python -m sglang_router.launch_router --prefill ${P_IP}:8000 \
--decode ${D_IP}:8001 \
--host ${D_IP} --port 6688 \
--pd-disaggregation \
--health-check-interval-secs 3600 \