docs/diffusion/performance/ring_sp_performance.md
This page reports Ring-SP performance for Wan2.2-TI2V-5B-Diffusers using:
sp=2, ulysses=1, ring=2 (short: u1r2)sp=1, ulysses=1, ring=1 (short: u1r1)Wan2.2-TI2V-5B-Diffusers48G RTX40 series * 2u1r2)sglang serve \
--model-type diffusion \
--model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--num-gpus 2 --sp-degree 2 --ulysses-degree 1 --ring-degree 2 \
--port 8898
u1r1)sglang serve \
--model-type diffusion \
--model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--num-gpus 1 --sp-degree 1 --ulysses-degree 1 --ring-degree 1 \
--port 8898
These benchmarks are provided for reference under one specific setup and command configuration. Actual performance may vary with model settings, runtime environment, and request patterns.
| Stage / Metric | u1r2 (s) | u1r1 baseline (s) | Speedup |
|---|---|---|---|
| InputValidation | 0.1060 | 0.1029 | 0.97x |
| TextEncoding | 1.3965 | 2.2261 | 1.59x |
| LatentPreparation | 0.0002 | 0.0002 | 1.00x |
| TimestepPreparation | 0.0003 | 0.0004 | 1.33x |
| Denoising | 52.6358 | 71.6785 | 1.36x |
| Decoding | 7.6708 | 13.4314 | 1.75x |
| Total | 63.74 | 90.63 | 1.42x |
| Memory Metric | u1r2 (GB) | u1r1 baseline (GB) | Delta |
|---|---|---|---|
| Peak GPU Memory | 20.07 | 27.40 | -7.33 |
| Peak Allocated | 13.35 | 20.40 | -7.05 |
| Memory Overhead | 6.72 | 7.00 | -0.28 |
| Overhead Ratio | 33.5% | 25.6% | +7.9pp |
90.63s to 63.74s (1.42x).Denoising (1.36x) and Decoding (1.75x).Peak GPU Memory -7.33GB, Peak Allocated -7.05GB).+7.9pp), so future tuning can focus on reducing communication/runtime overhead while preserving the latency gain.