Progressive Resolution Generation

Progressive resolution growing is an experimental feature for selected SGLang Diffusion pipelines. It runs early denoising steps at a coarser latent resolution and spectrally upsamples the latent before the full-resolution steps. On the benchmark setup below, this reduces the quadratic attention cost of the DiT transformer and yields up to 1.63× speedup on FLUX.1, 1.93× speedup on FLUX.2, 2.33× speedup on Z-Image, 2.78× speedup on Wan 2.1 T2V, and 1.69× speedup on Qwen-Image.

This page is intentionally not linked from the main documentation navigation while the feature is still experimental.

Based on Spectral Progressive Diffusion (arXiv 2605.18736).

Overview

DiT attention is O(n²) in sequence length. Running the first N denoising steps at half the spatial resolution cuts the attention cost to ~6% for those steps.

The transition point — how many steps to run at each resolution — is computed from the Bayes-optimal frequency-activation criterion: frequencies that cannot be resolved at the coarse scale are not denoised there. This makes the speedup lossless by construction.

Model	Full-res tokens	Half-res tokens	Token-step ratio
FLUX.1 1024×1024	4,096	1,024	4.0×
FLUX.2 1024×1024	4,096	1,024	4.0×
Z-Image 1024×1024	4,096	1,024	4.0×
Wan 2.1 T2V 480×832 (81 frames)	6,240	1,560	4.0×

Parameters

Parameter	CLI flag	Default	Description
`progressive_mode`	`--progressive-mode`	`"fullres"`	`"fullres"` disables (identical to standard generation). `"dct_rewind"` enables spectral upsample with scheduler rewind (recommended). `"dct"` enables upsample without rewind.
`progressive_levels`	`--progressive-levels`	`1`	Number of resolution halvings. `1` = one coarse stage (64×64 latent → 128×128). `2` = two coarse stages (32×32 → 64×64 → 128×128).
`progressive_delta`	`--progressive-delta`	`0.01`	Noise-dominated tolerance δ. Controls how many steps run at coarse resolution. Higher δ = more coarse steps = more speedup.

Tip: Add --dit-cpu-offload false to keep the transformer GPU-resident. With CPU offload each step pays a fixed PCIe transfer cost regardless of sequence length, which dilutes the speedup.

FLUX.1

Usage

bash

sglang generate \
    --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --num-inference-steps 50 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.05

Choosing delta

δ	Coarse steps (50 total)	Denoising speedup
`0.01`	18 @ 64² + 32 @ 128²	1.32×
`0.05`	28 @ 64² + 22 @ 128²	1.63×

For most prompts 0.05 is recommended — it gives the largest speedup with no visible degradation.

Benchmark

Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Timing = denoising loop only.

Config	Stage split	Denoise	Speedup
Fullres (baseline)	50 @ 128² latent	36.65 s	1.00×
dct_rewind L1 δ=0.01	18@64² + 32@128²	27.67 s	1.32×
dct_rewind L1 δ=0.05	28@64² + 22@128²	22.58 s	1.62×
dct_rewind L2 δ=0.01	10@32² + 8@64² + 32@128²	26.48 s	1.38×

Python API

python

from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="black-forest-labs/FLUX.1-dev",
    dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A serene mountain lake at golden hour, photorealistic",
    "num_inference_steps": 50,
    "height": 1024,
    "width": 1024,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.05,
})

FLUX.2

Supports FLUX.2-dev, FLUX.2-klein-4B, and FLUX.2-klein-9B.

Usage

bash

sglang generate \
    --model-path black-forest-labs/FLUX.2-klein-4B \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --num-inference-steps 30 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.10

Benchmark

Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Model: FLUX.2-klein-4B, 30 steps, 1024×1024. Timing = denoising loop only, averaged across 10 diverse prompts.

Config	Stage split	Denoise	Speedup
Fullres (baseline)	30 @ 64² latent	9.72 s	1.00×
dct_rewind L1 δ=0.05	18@32² + 12@64²	5.50 s	1.77×
dct_rewind L1 δ=0.10	20@32² + 10@64²	5.03 s	1.93×

Python API

python

from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="black-forest-labs/FLUX.2-klein-4B",
    dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A serene mountain lake at golden hour, photorealistic",
    "num_inference_steps": 30,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.10,
})

Wan 2.1 T2V

Supports Wan-AI/Wan2.1-T2V-1.3B-Diffusers and Wan-AI/Wan2.1-T2V-14B-Diffusers.

Note: Progressive generation grows only the spatial H×W dimensions. The temporal dimension T (number of latent frames) is kept fixed across all stages.

Usage

bash

sglang generate \
    --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A cheetah sprinting across the Serengeti at sunset, slow motion, photorealistic" \
    --num-inference-steps 50 \
    --num-frames 81 \
    --height 480 \
    --width 832 \
    --guidance-scale 5.0 \
    --flow-shift 5.0 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.05

Choosing delta

δ	Coarse steps (50 total)	Denoising speedup
`0.01`	23 @ 30×52 + 27 @ 60×104	1.65×
`0.02`	27 @ 30×52 + 23 @ 60×104	1.86×
`0.05`	33 @ 30×52 + 17 @ 60×104	2.32×
`0.10`	37 @ 30×52 + 13 @ 60×104	2.78×

For most prompts 0.05 is recommended. 0.10 provides maximum speedup but should be validated on motion-heavy scenes.

Python API

python

from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    dit_cpu_offload=False,
    flow_shift=5.0,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A cheetah sprinting across the Serengeti at sunset, slow motion, photorealistic",
    "num_inference_steps": 50,
    "num_frames": 81,
    "height": 480,
    "width": 832,
    "guidance_scale": 5.0,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.05,
})

Z-Image

Supports Tongyi-MAI/Z-Image. Z-Image uses the same VAE as FLUX.1 (FluxVAEConfig), so the power-law spectrum constants are identical. The progressive stage handles Z-Image's 5-D latent format [B, C, 1, H, W] with squeeze/unsqueeze hooks and recomputes caption+image RoPE positional embeddings on each stage transition.

Note: Always specify --height 1024 --width 1024 (or another resolution where H_lat and W_lat are both divisible by 2). Z-Image's default resolution (360×640) produces a 45×80 latent where H=45 is not divisible by the patch size.

Usage

bash

# Standard fullres — unchanged behavior
sglang generate --model-path Tongyi-MAI/Z-Image \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --height 1024 --width 1024

# Progressive dct_rewind L1 δ=0.10 → 2.33× denoising speedup
sglang generate --model-path Tongyi-MAI/Z-Image \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --height 1024 --width 1024 \
    --num-inference-steps 50 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.10

Choosing delta

δ	Coarse steps (50 total)	Denoising speedup
`0.01`	26 @ 64² + 24 @ 128²	1.53×
`0.05`	35 @ 64² + 15 @ 128²	2.03×
`0.10`	42 @ 64² + 8 @ 128²	2.33×

Z-Image achieves higher progressive speedups than FLUX.1 at the same δ because it uses dual CFG (two forward passes per step), doubling the absolute attention savings at coarse resolution. 0.10 is the recommended tradeoff.

Python API

python

from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="Tongyi-MAI/Z-Image",
    dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A serene mountain lake at golden hour, photorealistic",
    "num_inference_steps": 50,
    "height": 1024,
    "width": 1024,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.10,
})

Qwen-Image

Qwen-Image uses the same 2×2 patchify convention as FLUX.1 (in_channels=64, C=16), so the same progressive stage wires in with model-specific hooks for RoPE (freqs_cis) and spatial metadata (img_shapes).

bash

# Standard fullres — unchanged behavior
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A serene mountain lake at golden hour"

# Progressive dct_rewind L1 δ=0.20 → 1.69× denoising speedup
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A serene mountain lake at golden hour" \
    --progressive-mode dct_rewind --progressive-levels 1 --progressive-delta 0.20 \
    --num-inference-steps 30 --dit-cpu-offload false

Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Timing = denoising loop only.

Config	Stage split	Denoise	Speedup
Fullres (baseline)	30 @ 128²	43.00 s	1.00×
dct_rewind L1 δ=0.05	13@64² + 17@128²	33.25 s	1.29×
dct_rewind L1 δ=0.10	16@64² + 14@128²	33.86 s	1.27×
dct_rewind L1 δ=0.20	19@64² + 11@128²	25.40 s	1.69×

Limitations

Sequence parallelism incompatible. Cannot be combined with --ulysses-degree or --ring-degree. The stage raises a RuntimeError if SP is enabled.
torch.compile incompatible. Compiled kernels have a fixed sequence length; the resolution transition causes a recompile or error. Use progressive without --enable-torch-compile.
Cache-DiT interaction is experimental. The stage refreshes Cache-DiT context at resolution transitions, but quality and speedup should be benchmarked before relying on this combination.

Overview

Parameters

FLUX.1

Usage

Choosing delta

Benchmark

Python API

FLUX.2

Usage

Benchmark

Python API

Wan 2.1 T2V

Usage

Choosing delta

Python API

Z-Image

Usage

Choosing delta

Python API

Qwen-Image

Limitations

References