docs_new/docs/advanced_features/checkpoint_engine.mdx
The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes.
The checkpoint engine integration allows SGLang to:
First, install the checkpoint engine package:
pip install 'checkpoint-engine[p2p]'
The system consists of two main components:
--wait-for-initial-weights flag to wait for weights before becoming readyThe checkpoint engine uses a parameter server architecture with support for:
Terminal 1 - Launch SGLang Server:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights
Terminal 2 - Run Checkpoint Engine:
Using sglang entrypoint:
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Using torchrun directly:
torchrun --nproc-per-node 8 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Node 0:
Launch SGLang server:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP]
Run checkpoint engine:
Using sglang entrypoint (recommended):
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Using torchrun directly:
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 0 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Node 1:
Launch SGLang server:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP]
Run checkpoint engine:
Using sglang entrypoint (recommended):
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Using torchrun directly:
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 1 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Node 0:
Launch SGLang server:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP] \
--dist-init-addr [IP]:9120 \
--nnodes 2 \
--node-rank 0
Run checkpoint engine:
Using sglang entrypoint (recommended):
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
Using torchrun directly:
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 0 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
Node 1:
Launch SGLang server:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP] \
--dist-init-addr [IP]:9120 \
--nnodes 2 \
--node-rank 1
Run checkpoint engine:
Using sglang entrypoint (recommended):
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
Using torchrun directly:
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 1 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
--load-format dummy: Use dummy format for initial loading (allows overlapping with other tasks)--wait-for-initial-weights: Wait for checkpoint engine to provide weights before becoming ready--host: Host address for multi-node setups--dist-init-addr: Distributed initialization address for tensor parallelism--update-method: Weight update method (broadcast, p2p, or all)--checkpoint-path: Path to model checkpoint directory--inference-parallel-size: Number of inference parallel processes--endpoint: SGLang server endpoint (default: http://localhost:19730)--checkpoint-name: Name for the checkpoint (default: my-checkpoint-iter-0)--save-metas-file: File to save checkpoint metadata--load-metas-file: File to load checkpoint metadata from--uds: Unix domain socket path for communication--weight-version: Version identifier for weightsThe checkpoint engine provides significant time savings in two main aspects:
Multi-node Loading: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.
Single Process Optimization: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.
pip install 'checkpoint-engine[p2p]'--sleep-time parameter to add delays if needed for debugging