docs_new/docs/sglang-diffusion/profiling.mdx
This guide covers profiling techniques for multimodal generation pipelines in SGLang.
PyTorch Profiler provides detailed kernel execution time, call stack, and GPU utilization metrics.
Profile the denoising stage with sampled timesteps (default: 5 steps after 1 warmup step):
sglang generate \
--model-path Qwen/Qwen-Image \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--seed 0 \
--profile
Parameters:
--profile: Enable profiling for the denoising stage--num-profiled-timesteps N: Number of timesteps to profile after warmup (default: 5)
--num-profiled-timesteps 10 profiles 10 steps after 1 warmup stepProfile all pipeline stages (text encoding, denoising, VAE decoding, etc.):
sglang generate \
--model-path Qwen/Qwen-Image \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--seed 0 \
--profile \
--profile-all-stages
Parameters:
--profile-all-stages: Used with --profile, profile all pipeline stages instead of just denoisingBy default, trace files are saved in the ./logs/ directory.
The exact output file path will be shown in the console output, for example:
[mm-dd hh:mm:ss] Saved profiler traces to: /sgl-workspace/sglang/logs/mocked_fake_id_for_offline_generate-5_steps-global-rank0.trace.json.gz
Load and visualize trace files at:
For large trace files, reduce --num-profiled-timesteps or avoid using --profile-all-stages.
--perf-dump-path (Stage/Step Timing Dump)Besides profiler traces, you can also dump a lightweight JSON report that contains:
This is useful to quickly identify which stage dominates end-to-end latency, and whether denoising steps have uniform runtimes (and if not, which step has an abnormal spike).
The dumped JSON contains a denoise_steps_ms field formatted as an array of objects, each with a step key (the step index) and a duration_ms key.
Example:
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "<PROMPT>" \
--perf-dump-path perf.json
Nsight Systems provides low-level CUDA profiling with kernel details, register usage, and memory access patterns.
See the SGLang profiling guide for installation instructions.
Profile the entire pipeline execution:
nsys profile \
--trace-fork-before-exec=true \
--cuda-graph-trace=node \
--force-overwrite=true \
-o QwenImage \
sglang generate \
--model-path Qwen/Qwen-Image \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--seed 0
Use --delay and --duration to capture specific stages and reduce file size:
nsys profile \
--trace-fork-before-exec=true \
--cuda-graph-trace=node \
--force-overwrite=true \
--delay 10 \
--duration 30 \
-o QwenImage_denoising \
sglang generate \
--model-path Qwen/Qwen-Image \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--seed 0
Parameters:
--delay N: Wait N seconds before starting capture (skip initialization overhead)--duration N: Capture for N seconds (focus on specific stages)--force-overwrite: Overwrite existing output files--num-profiled-timesteps with smaller values or --delay/--duration with Nsight Systems--profile alone for denoising stage, add --profile-all-stages for full pipelinesglang generate with Nsight Systems and find that the generated profiler file did not capture any CUDA kernels, you can resolve this issue by increasing the model's inference steps to extend the execution time.