docs_new/docs/sglang-diffusion/ring_sp_performance.mdx
Sequence parallelism splits long image or video latent sequences across GPUs. In SGLang Diffusion, the public controls are:
--sp-degree: total sequence parallel degree--ulysses-degree: Ulysses parallel degree--ring-degree: ring parallel degreeThe degrees must satisfy:
sp_degree = ulysses_degree * ring_degree
Use SP when sequence length or video shape makes the DiT forward pass the bottleneck and the model supports sequence sharding. For latency-oriented multi-GPU Qwen/Wan deployments, also compare against CFG parallelism and FSDP; SP is not automatically the best multi-GPU setting for every model.
This example uses two GPUs with sp=2, ulysses=1, and ring=2.
sglang serve \
--model-path Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--num-gpus 2 \
--sp-degree 2 \
--ulysses-degree 1 \
--ring-degree 2 \
--port 8898
Use an explicit single-GPU baseline before attributing a gain to sequence parallelism.
sglang serve \
--model-path Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--num-gpus 1 \
--sp-degree 1 \
--ulysses-degree 1 \
--ring-degree 1 \
--port 8898
When benchmarking SP, compare the same model, precision, resolution, frame count, step count, scheduler settings, prompt type, and output path. Report both stage latency and peak GPU memory; SP can reduce per-GPU memory while adding communication overhead.
Useful metrics:
The following numbers are a reference measurement for one setup. They are not a general promise for all Wan2.2 deployments.
Wan-AI/Wan2.2-TI2V-5B-Diffuserssp=2, ulysses=1, ring=2 (u1r2)sp=1, ulysses=1, ring=1 (u1r1)In this setup, end-to-end latency improved from 90.63s to 63.74s (1.42x) and peak GPU memory dropped by 7.33GB. The overhead ratio increased, so future tuning should still check communication and runtime overhead on the target hardware.