benchmark/kernels/all_reduce/README.md
MSCCL++ is a GPU-driven communication library that can replace NCCL for all-reduce operations. It supports CUDA graph capture and is optimized for small-to-medium message sizes commonly seen in tensor-parallel inference.
Currently supported configurations: TP=8 (single-node) and TP=16 (two-node).
docker/Dockerfile, MSCCL++ is already installed by default.git clone https://github.com/microsoft/mscclpp.git
cd mscclpp && mkdir build && cd build
cmake .. && make -j && pip install ..
mscclpp is importable in your Python environment before running the benchmark or using MSCCL++ for inference.The benchmark compares all-reduce latency across torch/NCCL (eager), MSCCL++ (eager and graph), and PyNccl (graph) for power-of-two message sizes.
torchrun --nproc_per_node 8 \
--nnodes 1 \
--node_rank 0 \
benchmark/kernels/all_reduce/benchmark_mscclpp.py
For multi-node (TP=16):
export WORLD_SIZE=2
export MASTER_ADDR=<master-ip>
export MASTER_PORT=12345
# Run on each node with the appropriate RANK (0 or 1):
torchrun --nproc_per_node 8 \
--nnodes $WORLD_SIZE \
--node_rank $RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
benchmark/kernels/all_reduce/benchmark_mscclpp.py
Use the --enable-mscclpp flag to select MSCCL++ as the all-reduce backend during CUDA-graph-captured inference:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp-size 8 \
--enable-mscclpp
Note: MSCCL++ performs auto-tuning on first initialization, which may add a few seconds to startup time. The tuned configurations are cached for the lifetime of the process.