docs_new/docs/sglang-diffusion/progressive_resolution.mdx
Progressive resolution growing is an experimental feature for selected SGLang Diffusion pipelines. It runs early denoising steps at a coarser latent resolution and spectrally upsamples the latent before the full-resolution steps. On the benchmark setup below, this reduces the quadratic attention cost of the DiT transformer and yields up to 1.63× speedup on FLUX.1, 1.93× speedup on FLUX.2, 2.33× speedup on Z-Image, 2.78× speedup on Wan 2.1 T2V, and 1.69× speedup on Qwen-Image.
This page is intentionally not linked from the main documentation navigation while the feature is still experimental.
Based on Spectral Progressive Diffusion (arXiv 2605.18736).
DiT attention is O(n²) in sequence length. Running the first N denoising steps at half the spatial resolution cuts the attention cost to ~6% for those steps.
The transition point — how many steps to run at each resolution — is computed from the Bayes-optimal frequency-activation criterion: frequencies that cannot be resolved at the coarse scale are not denoised there. This makes the speedup lossless by construction.
| Model | Full-res tokens | Half-res tokens | Token-step ratio |
|---|---|---|---|
| FLUX.1 1024×1024 | 4,096 | 1,024 | 4.0× |
| FLUX.2 1024×1024 | 4,096 | 1,024 | 4.0× |
| Z-Image 1024×1024 | 4,096 | 1,024 | 4.0× |
| Wan 2.1 T2V 480×832 (81 frames) | 6,240 | 1,560 | 4.0× |
| Parameter | CLI flag | Default | Description |
|---|---|---|---|
progressive_mode | --progressive-mode | "fullres" | "fullres" disables (identical to standard generation). "dct_rewind" enables spectral upsample with scheduler rewind (recommended). "dct" enables upsample without rewind. |
progressive_levels | --progressive-levels | 1 | Number of resolution halvings. 1 = one coarse stage (64×64 latent → 128×128). 2 = two coarse stages (32×32 → 64×64 → 128×128). |
progressive_delta | --progressive-delta | 0.01 | Noise-dominated tolerance δ. Controls how many steps run at coarse resolution. Higher δ = more coarse steps = more speedup. |
Tip: Add
--dit-cpu-offload falseto keep the transformer GPU-resident. With CPU offload each step pays a fixed PCIe transfer cost regardless of sequence length, which dilutes the speedup.
sglang generate \
--model-path black-forest-labs/FLUX.1-dev \
--prompt "A serene mountain lake at golden hour, photorealistic" \
--num-inference-steps 50 \
--dit-cpu-offload false \
--progressive-mode dct_rewind \
--progressive-levels 1 \
--progressive-delta 0.05
| δ | Coarse steps (50 total) | Denoising speedup |
|---|---|---|
0.01 | 18 @ 64² + 32 @ 128² | 1.32× |
0.05 | 28 @ 64² + 22 @ 128² | 1.63× |
For most prompts 0.05 is recommended — it gives the largest speedup with no visible degradation.
Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Timing = denoising loop only.
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 50 @ 128² latent | 36.65 s | 1.00× |
| dct_rewind L1 δ=0.01 | 18@64² + 32@128² | 27.67 s | 1.32× |
| dct_rewind L1 δ=0.05 | 28@64² + 22@128² | 22.58 s | 1.62× |
| dct_rewind L2 δ=0.01 | 10@32² + 8@64² + 32@128² | 26.48 s | 1.38× |
from sglang.multimodal_gen import DiffGenerator
gen = DiffGenerator.from_pretrained(
model_path="black-forest-labs/FLUX.1-dev",
dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
"prompt": "A serene mountain lake at golden hour, photorealistic",
"num_inference_steps": 50,
"height": 1024,
"width": 1024,
"progressive_mode": "dct_rewind",
"progressive_levels": 1,
"progressive_delta": 0.05,
})
Supports FLUX.2-dev, FLUX.2-klein-4B, and FLUX.2-klein-9B.
sglang generate \
--model-path black-forest-labs/FLUX.2-klein-4B \
--prompt "A serene mountain lake at golden hour, photorealistic" \
--num-inference-steps 30 \
--dit-cpu-offload false \
--progressive-mode dct_rewind \
--progressive-levels 1 \
--progressive-delta 0.10
Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Model: FLUX.2-klein-4B, 30 steps, 1024×1024.
Timing = denoising loop only, averaged across 10 diverse prompts.
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 30 @ 64² latent | 9.72 s | 1.00× |
| dct_rewind L1 δ=0.05 | 18@32² + 12@64² | 5.50 s | 1.77× |
| dct_rewind L1 δ=0.10 | 20@32² + 10@64² | 5.03 s | 1.93× |
from sglang.multimodal_gen import DiffGenerator
gen = DiffGenerator.from_pretrained(
model_path="black-forest-labs/FLUX.2-klein-4B",
dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
"prompt": "A serene mountain lake at golden hour, photorealistic",
"num_inference_steps": 30,
"progressive_mode": "dct_rewind",
"progressive_levels": 1,
"progressive_delta": 0.10,
})
Supports Wan-AI/Wan2.1-T2V-1.3B-Diffusers and Wan-AI/Wan2.1-T2V-14B-Diffusers.
Note: Progressive generation grows only the spatial H×W dimensions. The temporal dimension T (number of latent frames) is kept fixed across all stages.
sglang generate \
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--prompt "A cheetah sprinting across the Serengeti at sunset, slow motion, photorealistic" \
--num-inference-steps 50 \
--num-frames 81 \
--height 480 \
--width 832 \
--guidance-scale 5.0 \
--flow-shift 5.0 \
--dit-cpu-offload false \
--progressive-mode dct_rewind \
--progressive-levels 1 \
--progressive-delta 0.05
| δ | Coarse steps (50 total) | Denoising speedup |
|---|---|---|
0.01 | 23 @ 30×52 + 27 @ 60×104 | 1.65× |
0.02 | 27 @ 30×52 + 23 @ 60×104 | 1.86× |
0.05 | 33 @ 30×52 + 17 @ 60×104 | 2.32× |
0.10 | 37 @ 30×52 + 13 @ 60×104 | 2.78× |
For most prompts 0.05 is recommended. 0.10 provides maximum speedup but should be validated on motion-heavy scenes.
from sglang.multimodal_gen import DiffGenerator
gen = DiffGenerator.from_pretrained(
model_path="Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
dit_cpu_offload=False,
flow_shift=5.0,
)
result = gen.generate(sampling_params_kwargs={
"prompt": "A cheetah sprinting across the Serengeti at sunset, slow motion, photorealistic",
"num_inference_steps": 50,
"num_frames": 81,
"height": 480,
"width": 832,
"guidance_scale": 5.0,
"progressive_mode": "dct_rewind",
"progressive_levels": 1,
"progressive_delta": 0.05,
})
Supports Tongyi-MAI/Z-Image. Z-Image uses the same VAE as FLUX.1 (FluxVAEConfig), so the power-law spectrum constants are identical. The progressive stage handles Z-Image's 5-D latent format [B, C, 1, H, W] with squeeze/unsqueeze hooks and recomputes caption+image RoPE positional embeddings on each stage transition.
Note: Always specify
--height 1024 --width 1024(or another resolution where H_lat and W_lat are both divisible by 2). Z-Image's default resolution (360×640) produces a 45×80 latent where H=45 is not divisible by the patch size.
# Standard fullres — unchanged behavior
sglang generate --model-path Tongyi-MAI/Z-Image \
--prompt "A serene mountain lake at golden hour, photorealistic" \
--height 1024 --width 1024
# Progressive dct_rewind L1 δ=0.10 → 2.33× denoising speedup
sglang generate --model-path Tongyi-MAI/Z-Image \
--prompt "A serene mountain lake at golden hour, photorealistic" \
--height 1024 --width 1024 \
--num-inference-steps 50 \
--dit-cpu-offload false \
--progressive-mode dct_rewind \
--progressive-levels 1 \
--progressive-delta 0.10
| δ | Coarse steps (50 total) | Denoising speedup |
|---|---|---|
0.01 | 26 @ 64² + 24 @ 128² | 1.53× |
0.05 | 35 @ 64² + 15 @ 128² | 2.03× |
0.10 | 42 @ 64² + 8 @ 128² | 2.33× |
Z-Image achieves higher progressive speedups than FLUX.1 at the same δ because it uses dual CFG (two forward passes per step), doubling the absolute attention savings at coarse resolution. 0.10 is the recommended tradeoff.
from sglang.multimodal_gen import DiffGenerator
gen = DiffGenerator.from_pretrained(
model_path="Tongyi-MAI/Z-Image",
dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
"prompt": "A serene mountain lake at golden hour, photorealistic",
"num_inference_steps": 50,
"height": 1024,
"width": 1024,
"progressive_mode": "dct_rewind",
"progressive_levels": 1,
"progressive_delta": 0.10,
})
Qwen-Image uses the same 2×2 patchify convention as FLUX.1 (in_channels=64, C=16), so the same progressive stage wires in with model-specific hooks for RoPE (freqs_cis) and spatial metadata (img_shapes).
# Standard fullres — unchanged behavior
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A serene mountain lake at golden hour"
# Progressive dct_rewind L1 δ=0.20 → 1.69× denoising speedup
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A serene mountain lake at golden hour" \
--progressive-mode dct_rewind --progressive-levels 1 --progressive-delta 0.20 \
--num-inference-steps 30 --dit-cpu-offload false
Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Timing = denoising loop only.
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 30 @ 128² | 43.00 s | 1.00× |
| dct_rewind L1 δ=0.05 | 13@64² + 17@128² | 33.25 s | 1.29× |
| dct_rewind L1 δ=0.10 | 16@64² + 14@128² | 33.86 s | 1.27× |
| dct_rewind L1 δ=0.20 | 19@64² + 11@128² | 25.40 s | 1.69× |
--ulysses-degree or --ring-degree. The stage raises a RuntimeError if SP is enabled.--enable-torch-compile.