docs/platforms/ascend_npu_ring_sp_performance.md
This page reports Ring-SP performance on Ascend NPU with torch_npu==2.10.0.
ulysses=1, ring=1 (short: u1r1)ulysses=1, ring=2 (short: u1r2)Wan2.1-T2V-1.3B-Diffusers"a cat is playing piano"sglang generatetorch_npu==2.10.0u1r1)sglang generate --model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \
--prompt "a cat is playing piano" --num-gpus 1 --ring-degree 1 \
--save-output
u1r2)sglang generate --model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \
--prompt "a cat is playing piano" --num-gpus 2 --ring-degree 2 \
--save-output
Benchmark Disclaimer
These numbers are from one fixed setup and one prompt case. Actual performance may vary by model settings, environment, and workload.
| Stage / Metric | u1r2 (s) | u1r1 baseline (s) | Speedup |
|---|---|---|---|
| InputValidation | 0.0003 | 0.0002 | 0.67x |
| TextEncoding | 3.5936 | 3.5820 | 1.00x |
| LatentPreparation | 0.0007 | 0.0055 | 7.86x |
| TimestepPreparation | 0.0008 | 0.0007 | 0.88x |
| Denoising | 121.2788 | 239.2580 | 1.97x |
| Decoding | 13.8685 | 16.4969 | 1.19x |
| Total (Pixel data generated) | 141.86 | 266.50 | 1.88x |
torch_npu==2.10.0, Ring-SP (u1r2) runs successfully on NPU for this case.266.50s to 141.86s (1.88x).DenoisingStage (1.97x), while decoding also improves (1.19x).