docs_new/docs/sglang-diffusion/progressive_resolution.mdx
Progressive resolution growing is an experimental feature for selected SGLang Diffusion pipelines. It runs early denoising steps at a coarser latent resolution and spectrally upsamples the latent before the full-resolution steps. On the benchmark setup below, this reduces the quadratic attention cost of the DiT transformer and yields up to 1.63× speedup on FLUX.1, 1.93× speedup on FLUX.2, 2.33× speedup on Z-Image, 2.78× speedup on Wan 2.1 T2V, 1.69× speedup on Qwen-Image, and 1.56× speedup on Ideogram 4.
Based on Spectral Progressive Diffusion (arXiv 2605.18736).
DiT attention is O(n²) in sequence length. Running the first N denoising steps at half the spatial resolution cuts the attention cost to ~6% for those steps.
The transition point — how many steps to run at each resolution — is computed from the Bayes-optimal frequency-activation criterion: frequencies that cannot be resolved at the coarse scale are not denoised there. The method is designed to preserve quality under this criterion, but generated outputs can still differ from the full-resolution baseline.
| Model | Full-res tokens | Half-res tokens | Token-step ratio |
|---|---|---|---|
| FLUX.1 1024×1024 | 4,096 | 1,024 | 4.0× |
| FLUX.2 1024×1024 | 4,096 | 1,024 | 4.0× |
| Z-Image 1024×1024 | 4,096 | 1,024 | 4.0× |
| Wan 2.1 T2V 480×832 (81 frames) | 6,240 | 1,560 | 4.0× |
| Ideogram 4 1024×1024 | 4,096 | 1,024 | 4.0× |
| Parameter | CLI flag | Default | Description |
|---|---|---|---|
progressive_mode | --progressive-mode | "fullres" | "fullres" disables (identical to standard generation). "dct_rewind" enables spectral upsample with scheduler rewind (recommended). "dct" enables upsample without rewind. |
progressive_levels | --progressive-levels | 1 | Number of resolution halvings. 1 = one coarse stage (64×64 latent → 128×128). 2 = two coarse stages (32×32 → 64×64 → 128×128). |
progressive_delta | --progressive-delta | 0.01 | Noise-dominated tolerance δ. Controls how many steps run at coarse resolution. Higher δ = more coarse steps = more speedup. |
Tip: Add
--dit-cpu-offload falseto keep the transformer GPU-resident. With CPU offload each step pays a fixed PCIe transfer cost regardless of sequence length, which dilutes the speedup.
sglang generate \
--model-path black-forest-labs/FLUX.1-dev \
--prompt "A serene mountain lake at golden hour, photorealistic" \
--num-inference-steps 50 \
--dit-cpu-offload false \
--progressive-mode dct_rewind \
--progressive-levels 1 \
--progressive-delta 0.05
| δ | Coarse steps (50 total) | Denoising speedup |
|---|---|---|
0.01 | 18 @ 64² + 32 @ 128² | 1.32× |
0.05 | 28 @ 64² + 22 @ 128² | 1.63× |
For most prompts 0.05 is recommended — it gives the largest speedup with no visible degradation.
Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Timing = denoising loop only.
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 50 @ 128² latent | 36.65 s | 1.00× |
| dct_rewind L1 δ=0.01 | 18@64² + 32@128² | 27.67 s | 1.32× |
| dct_rewind L1 δ=0.05 | 28@64² + 22@128² | 22.58 s | 1.62× |
| dct_rewind L2 δ=0.01 | 10@32² + 8@64² + 32@128² | 26.48 s | 1.38× |
from sglang.multimodal_gen import DiffGenerator
gen = DiffGenerator.from_pretrained(
model_path="black-forest-labs/FLUX.1-dev",
dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
"prompt": "A serene mountain lake at golden hour, photorealistic",
"num_inference_steps": 50,
"height": 1024,
"width": 1024,
"progressive_mode": "dct_rewind",
"progressive_levels": 1,
"progressive_delta": 0.05,
})
Supports FLUX.2-dev, FLUX.2-klein-4B, and FLUX.2-klein-9B.
sglang generate \
--model-path black-forest-labs/FLUX.2-klein-4B \
--prompt "A serene mountain lake at golden hour, photorealistic" \
--num-inference-steps 30 \
--dit-cpu-offload false \
--progressive-mode dct_rewind \
--progressive-levels 1 \
--progressive-delta 0.10
Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Model: FLUX.2-klein-4B, 30 steps, 1024×1024.
Timing = denoising loop only, averaged across 10 diverse prompts.
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 30 @ 64² latent | 9.72 s | 1.00× |
| dct_rewind L1 δ=0.05 | 18@32² + 12@64² | 5.50 s | 1.77× |
| dct_rewind L1 δ=0.10 | 20@32² + 10@64² | 5.03 s | 1.93× |
from sglang.multimodal_gen import DiffGenerator
gen = DiffGenerator.from_pretrained(
model_path="black-forest-labs/FLUX.2-klein-4B",
dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
"prompt": "A serene mountain lake at golden hour, photorealistic",
"num_inference_steps": 30,
"progressive_mode": "dct_rewind",
"progressive_levels": 1,
"progressive_delta": 0.10,
})
Supports Wan-AI/Wan2.1-T2V-1.3B-Diffusers and Wan-AI/Wan2.1-T2V-14B-Diffusers.
Note: Progressive generation grows only the spatial H×W dimensions. The temporal dimension T (number of latent frames) is kept fixed across all stages.
sglang generate \
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--prompt "A cheetah sprinting across the Serengeti at sunset, slow motion, photorealistic" \
--num-inference-steps 50 \
--num-frames 81 \
--height 480 \
--width 832 \
--guidance-scale 5.0 \
--flow-shift 5.0 \
--dit-cpu-offload false \
--progressive-mode dct_rewind \
--progressive-levels 1 \
--progressive-delta 0.05
| δ | Coarse steps (50 total) | Denoising speedup |
|---|---|---|
0.01 | 23 @ 30×52 + 27 @ 60×104 | 1.65× |
0.02 | 27 @ 30×52 + 23 @ 60×104 | 1.86× |
0.05 | 33 @ 30×52 + 17 @ 60×104 | 2.32× |
0.10 | 37 @ 30×52 + 13 @ 60×104 | 2.78× |
For most prompts 0.05 is recommended. 0.10 provides maximum speedup but should be validated on motion-heavy scenes.
from sglang.multimodal_gen import DiffGenerator
gen = DiffGenerator.from_pretrained(
model_path="Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
dit_cpu_offload=False,
flow_shift=5.0,
)
result = gen.generate(sampling_params_kwargs={
"prompt": "A cheetah sprinting across the Serengeti at sunset, slow motion, photorealistic",
"num_inference_steps": 50,
"num_frames": 81,
"height": 480,
"width": 832,
"guidance_scale": 5.0,
"progressive_mode": "dct_rewind",
"progressive_levels": 1,
"progressive_delta": 0.05,
})
Supports Tongyi-MAI/Z-Image. Z-Image uses the same VAE as FLUX.1 (FluxVAEConfig), so the power-law spectrum constants are identical. The progressive stage handles Z-Image's 5-D latent format [B, C, 1, H, W] with squeeze/unsqueeze hooks and recomputes caption+image RoPE positional embeddings on each stage transition.
Note: Always specify
--height 1024 --width 1024(or another resolution where H_lat and W_lat are both divisible by 2). Z-Image's default resolution (360×640) produces a 45×80 latent where H=45 is not divisible by the patch size.
# Standard fullres — unchanged behavior
sglang generate --model-path Tongyi-MAI/Z-Image \
--prompt "A serene mountain lake at golden hour, photorealistic" \
--height 1024 --width 1024
# Progressive dct_rewind L1 δ=0.10 → 2.33× denoising speedup
sglang generate --model-path Tongyi-MAI/Z-Image \
--prompt "A serene mountain lake at golden hour, photorealistic" \
--height 1024 --width 1024 \
--num-inference-steps 50 \
--dit-cpu-offload false \
--progressive-mode dct_rewind \
--progressive-levels 1 \
--progressive-delta 0.10
| δ | Coarse steps (50 total) | Denoising speedup |
|---|---|---|
0.01 | 26 @ 64² + 24 @ 128² | 1.53× |
0.05 | 35 @ 64² + 15 @ 128² | 2.03× |
0.10 | 42 @ 64² + 8 @ 128² | 2.33× |
Z-Image achieves higher progressive speedups than FLUX.1 at the same δ because it uses dual CFG (two forward passes per step), doubling the absolute attention savings at coarse resolution. 0.10 is the recommended tradeoff.
from sglang.multimodal_gen import DiffGenerator
gen = DiffGenerator.from_pretrained(
model_path="Tongyi-MAI/Z-Image",
dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
"prompt": "A serene mountain lake at golden hour, photorealistic",
"num_inference_steps": 50,
"height": 1024,
"width": 1024,
"progressive_mode": "dct_rewind",
"progressive_levels": 1,
"progressive_delta": 0.10,
})
Qwen-Image uses the same 2×2 patchify convention as FLUX.1 (in_channels=64, C=16), so the same progressive stage wires in with model-specific hooks for RoPE (freqs_cis) and spatial metadata (img_shapes).
# Standard fullres — unchanged behavior
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A serene mountain lake at golden hour"
# Progressive dct_rewind L1 δ=0.20 → 1.69× denoising speedup
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A serene mountain lake at golden hour" \
--progressive-mode dct_rewind --progressive-levels 1 --progressive-delta 0.20 \
--num-inference-steps 30 --dit-cpu-offload false
Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Timing = denoising loop only.
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 30 @ 128² | 43.00 s | 1.00× |
| dct_rewind L1 δ=0.05 | 13@64² + 17@128² | 33.25 s | 1.29× |
| dct_rewind L1 δ=0.10 | 16@64² + 14@128² | 33.86 s | 1.27× |
| dct_rewind L1 δ=0.20 | 19@64² + 11@128² | 25.40 s | 1.69× |
Supports ideogram-ai/ideogram-4. Ideogram 4 uses a dual-transformer architecture: a conditional transformer (text + image tokens) and a separately-weighted unconditional transformer (image tokens only, zero LLM features). Both transformers shrink at coarse resolution, providing the same token-ratio benefit as single-transformer models.
Note: Ideogram 4's logit-normal noise schedule (
std=1.75,mu=0) concentrates steps near the mid-sigma range. Fewer steps fall in the high-sigma coarse-eligible region compared to FLUX, which limits the achievable speedup at a given δ.
20-step (V4_DEFAULT_20 preset)
sglang generate \
--model-path ideogram-ai/ideogram-4 \
--prompt "A serene mountain lake at golden hour, photorealistic" \
--height 1024 --width 1024 \
--num-inference-steps 20 \
--dit-cpu-offload false \
--progressive-mode dct_rewind \
--progressive-levels 1 \
--progressive-delta 0.05
48-step (V4_QUALITY_48 preset)
sglang generate \
--model-path ideogram-ai/ideogram-4 \
--prompt "A serene mountain lake at golden hour, photorealistic" \
--height 1024 --width 1024 \
--num-inference-steps 48 \
--dit-cpu-offload false \
--progressive-mode dct_rewind \
--progressive-levels 1 \
--progressive-delta 0.05
Hardware: RTX A6000 48 GB, torch_sdpa, --dit-cpu-offload false. Timing = denoising loop only.
20-step (V4_DEFAULT_20)
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 20 @ 64² | 53.99 s | 1.00× |
| dct_rewind L1 δ=0.01 | 6 @ 32² + 14 @ 64² | 43.47 s | 1.24× |
| dct_rewind L1 δ=0.05 | 9 @ 32² + 11 @ 64² | 38.14 s | 1.42× |
| dct_rewind L1 δ=0.10 | 11 @ 32² + 9 @ 64² | 34.60 s | 1.56× |
48-step (V4_QUALITY_48)
| Config | Stage split | Denoise | Speedup |
|---|---|---|---|
| Fullres (baseline) | 48 @ 64² | 130.92 s | 1.00× |
| dct_rewind L1 δ=0.01 | 12 @ 32² + 36 @ 64² | 109.79 s | 1.19× |
| dct_rewind L1 δ=0.05 | 21 @ 32² + 27 @ 64² | 93.83 s | 1.40× |
| dct_rewind L1 δ=0.10 | 26 @ 32² + 22 @ 64² | 84.94 s | 1.54× |
from sglang.multimodal_gen import DiffGenerator
gen = DiffGenerator.from_pretrained(
model_path="ideogram-ai/ideogram-4",
dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
"prompt": "A serene mountain lake at golden hour, photorealistic",
"num_inference_steps": 48,
"height": 1024,
"width": 1024,
"progressive_mode": "dct_rewind",
"progressive_levels": 1,
"progressive_delta": 0.05,
})
--ulysses-degree or --ring-degree. The stage raises a RuntimeError if SP is enabled.--enable-torch-compile.