Back to Sglang

Performance Optimization

docs_new/docs/sglang-diffusion/performance-optimization.mdx

0.5.146.6 KB
Original Source

Use this page as the starting point for SGLang Diffusion performance work. It separates performance levers into two decision classes:

  • Output-preserving / lossless-style: system settings that should preserve model behavior while changing residency, parallelism, kernels, or scheduling.
  • Quality-tradeoff / lossy or approximate: techniques that can change the denoising path, numerical representation, or generated output.

The docs use "output-preserving" instead of promising bit-exact "lossless" because different kernels, GPU types, or precision paths can still introduce small numerical differences. The decision boundary is whether the optimization intentionally trades quality or output equivalence for speed.

Start Here

  1. Pick a serving or generation mode from Deployment and Performance Modes. --performance-mode auto is the default; use speed when the model fits in GPU memory and latency matters most, memory when GPU memory is the bottleneck, and manual when every performance flag should be explicit.
  2. Choose the right attention backend from Attention Backends.
  3. Use Sequence Parallelism only when the model and video shape benefit from sequence splitting.
  4. Use Inference Batching for concurrent compatible requests during serving.
  5. Use Profiling before changing several levers at once.

Output-Preserving / Lossless-Style Levers

These settings should preserve model behavior while changing residency, parallelism, kernels, or scheduling. They are the first choices for production tuning.

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "24%"}} /> <col style={{width: "30%"}} /> <col style={{width: "46%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Lever</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Use when</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Docs</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}><code>--performance-mode</code></td> <td style={{padding: "9px 12px"}}>You want a safe preset for speed or memory without overriding explicit flags.</td> <td style={{padding: "9px 12px"}}><a href="./deployment_cookbook">Deployment and Performance Modes</a></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}>Offload, FSDP, CFG parallelism</td> <td style={{padding: "9px 12px"}}>GPU memory, multi-GPU residency, or CFG branch splitting is the main bottleneck.</td> <td style={{padding: "9px 12px"}}><a href="./deployment_cookbook">Deployment and Performance Modes</a></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}>Sequence parallelism</td> <td style={{padding: "9px 12px"}}>Long image/video sequences need sequence-level parallelism.</td> <td style={{padding: "9px 12px"}}><a href="./ring_sp_performance">Sequence Parallelism</a></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}>Attention backend</td> <td style={{padding: "9px 12px"}}>Kernel choice dominates DiT latency or memory.</td> <td style={{padding: "9px 12px"}}><a href="./attention_backends">Attention Backends</a></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}>Dynamic batching</td> <td style={{padding: "9px 12px"}}>Serving many compatible requests concurrently.</td> <td style={{padding: "9px 12px"}}><a href="./dynamic_batching">Inference Batching</a></td> </tr> </tbody> </table>

Quality-Tradeoff / Lossy Or Approximate Levers

These techniques can change the denoising path, numerical representation, or generated output. They are useful after you have a baseline and an acceptance criterion for quality.

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "24%"}} /> <col style={{width: "30%"}} /> <col style={{width: "46%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Lever</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Tradeoff</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Docs</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}>Cache-DiT</td> <td style={{padding: "9px 12px"}}>Skips selected DiT block or step computation based on cache decisions.</td> <td style={{padding: "9px 12px"}}><a href="./cache_dit">Cache-DiT</a></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}>TeaCache</td> <td style={{padding: "9px 12px"}}>Reuses residuals when consecutive denoising steps are similar enough.</td> <td style={{padding: "9px 12px"}}><a href="./teacache">TeaCache</a></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}>Progressive resolution</td> <td style={{padding: "9px 12px"}}>Runs early denoising at lower latent resolution for supported pipelines.</td> <td style={{padding: "9px 12px"}}><a href="./progressive_resolution">Progressive Resolution Generation</a></td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}>Quantization</td> <td style={{padding: "9px 12px"}}>Uses lower-precision transformer weights or activations.</td> <td style={{padding: "9px 12px"}}><a href="./quantization">Quantization</a></td> </tr> </tbody> </table>

Practical Order

  1. Establish a baseline with the target model, resolution, frame count, step count, and GPU type.
  2. Select --performance-mode and explicit residency or parallelism flags.
  3. Tune attention backend and batching for the deployment pattern.
  4. Profile if the bottleneck is unclear.
  5. Add caching, progressive resolution, or quantization only after comparing output quality against your acceptance target.

Diagnostics

Profiling is not an optimization technique by itself. It belongs in the performance workflow because it tells you which stage, kernel, or denoising step is worth optimizing before you change multiple levers.

References