Back to Sglang

Progressive Resolution Generation

docs_new/docs/sglang-diffusion/progressive_resolution.mdx

0.5.1310.9 KB
Original Source

Progressive resolution growing is an experimental feature for selected SGLang Diffusion pipelines. It runs early denoising steps at a coarser latent resolution and spectrally upsamples the latent before the full-resolution steps. On the benchmark setup below, this reduces the quadratic attention cost of the DiT transformer and yields up to 1.63× speedup on FLUX.1, 1.93× speedup on FLUX.2, 2.33× speedup on Z-Image, 2.78× speedup on Wan 2.1 T2V, and 1.69× speedup on Qwen-Image.

This page is intentionally not linked from the main documentation navigation while the feature is still experimental.

Based on Spectral Progressive Diffusion (arXiv 2605.18736).

Overview

DiT attention is O(n²) in sequence length. Running the first N denoising steps at half the spatial resolution cuts the attention cost to ~6% for those steps.

The transition point — how many steps to run at each resolution — is computed from the Bayes-optimal frequency-activation criterion: frequencies that cannot be resolved at the coarse scale are not denoised there. This makes the speedup lossless by construction.

ModelFull-res tokensHalf-res tokensToken-step ratio
FLUX.1 1024×10244,0961,0244.0×
FLUX.2 1024×10244,0961,0244.0×
Z-Image 1024×10244,0961,0244.0×
Wan 2.1 T2V 480×832 (81 frames)6,2401,5604.0×

Parameters

ParameterCLI flagDefaultDescription
progressive_mode--progressive-mode"fullres""fullres" disables (identical to standard generation). "dct_rewind" enables spectral upsample with scheduler rewind (recommended). "dct" enables upsample without rewind.
progressive_levels--progressive-levels1Number of resolution halvings. 1 = one coarse stage (64×64 latent → 128×128). 2 = two coarse stages (32×32 → 64×64 → 128×128).
progressive_delta--progressive-delta0.01Noise-dominated tolerance δ. Controls how many steps run at coarse resolution. Higher δ = more coarse steps = more speedup.

Tip: Add --dit-cpu-offload false to keep the transformer GPU-resident. With CPU offload each step pays a fixed PCIe transfer cost regardless of sequence length, which dilutes the speedup.


FLUX.1

Usage

bash
sglang generate \
    --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --num-inference-steps 50 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.05

Choosing delta

δCoarse steps (50 total)Denoising speedup
0.0118 @ 64² + 32 @ 128²1.32×
0.0528 @ 64² + 22 @ 128²1.63×

For most prompts 0.05 is recommended — it gives the largest speedup with no visible degradation.

Benchmark

Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Timing = denoising loop only.

ConfigStage splitDenoiseSpeedup
Fullres (baseline)50 @ 128² latent36.65 s1.00×
dct_rewind L1 δ=0.0118@64² + 32@128²27.67 s1.32×
dct_rewind L1 δ=0.0528@64² + 22@128²22.58 s1.62×
dct_rewind L2 δ=0.0110@32² + 8@64² + 32@128²26.48 s1.38×

Python API

python
from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="black-forest-labs/FLUX.1-dev",
    dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A serene mountain lake at golden hour, photorealistic",
    "num_inference_steps": 50,
    "height": 1024,
    "width": 1024,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.05,
})

FLUX.2

Supports FLUX.2-dev, FLUX.2-klein-4B, and FLUX.2-klein-9B.

Usage

bash
sglang generate \
    --model-path black-forest-labs/FLUX.2-klein-4B \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --num-inference-steps 30 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.10

Benchmark

Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Model: FLUX.2-klein-4B, 30 steps, 1024×1024. Timing = denoising loop only, averaged across 10 diverse prompts.

ConfigStage splitDenoiseSpeedup
Fullres (baseline)30 @ 64² latent9.72 s1.00×
dct_rewind L1 δ=0.0518@32² + 12@64²5.50 s1.77×
dct_rewind L1 δ=0.1020@32² + 10@64²5.03 s1.93×

Python API

python
from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="black-forest-labs/FLUX.2-klein-4B",
    dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A serene mountain lake at golden hour, photorealistic",
    "num_inference_steps": 30,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.10,
})

Wan 2.1 T2V

Supports Wan-AI/Wan2.1-T2V-1.3B-Diffusers and Wan-AI/Wan2.1-T2V-14B-Diffusers.

Note: Progressive generation grows only the spatial H×W dimensions. The temporal dimension T (number of latent frames) is kept fixed across all stages.

Usage

bash
sglang generate \
    --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A cheetah sprinting across the Serengeti at sunset, slow motion, photorealistic" \
    --num-inference-steps 50 \
    --num-frames 81 \
    --height 480 \
    --width 832 \
    --guidance-scale 5.0 \
    --flow-shift 5.0 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.05

Choosing delta

δCoarse steps (50 total)Denoising speedup
0.0123 @ 30×52 + 27 @ 60×1041.65×
0.0227 @ 30×52 + 23 @ 60×1041.86×
0.0533 @ 30×52 + 17 @ 60×1042.32×
0.1037 @ 30×52 + 13 @ 60×1042.78×

For most prompts 0.05 is recommended. 0.10 provides maximum speedup but should be validated on motion-heavy scenes.

Python API

python
from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    dit_cpu_offload=False,
    flow_shift=5.0,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A cheetah sprinting across the Serengeti at sunset, slow motion, photorealistic",
    "num_inference_steps": 50,
    "num_frames": 81,
    "height": 480,
    "width": 832,
    "guidance_scale": 5.0,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.05,
})

Z-Image

Supports Tongyi-MAI/Z-Image. Z-Image uses the same VAE as FLUX.1 (FluxVAEConfig), so the power-law spectrum constants are identical. The progressive stage handles Z-Image's 5-D latent format [B, C, 1, H, W] with squeeze/unsqueeze hooks and recomputes caption+image RoPE positional embeddings on each stage transition.

Note: Always specify --height 1024 --width 1024 (or another resolution where H_lat and W_lat are both divisible by 2). Z-Image's default resolution (360×640) produces a 45×80 latent where H=45 is not divisible by the patch size.

Usage

bash
# Standard fullres — unchanged behavior
sglang generate --model-path Tongyi-MAI/Z-Image \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --height 1024 --width 1024

# Progressive dct_rewind L1 δ=0.10 → 2.33× denoising speedup
sglang generate --model-path Tongyi-MAI/Z-Image \
    --prompt "A serene mountain lake at golden hour, photorealistic" \
    --height 1024 --width 1024 \
    --num-inference-steps 50 \
    --dit-cpu-offload false \
    --progressive-mode dct_rewind \
    --progressive-levels 1 \
    --progressive-delta 0.10

Choosing delta

δCoarse steps (50 total)Denoising speedup
0.0126 @ 64² + 24 @ 128²1.53×
0.0535 @ 64² + 15 @ 128²2.03×
0.1042 @ 64² + 8 @ 128²2.33×

Z-Image achieves higher progressive speedups than FLUX.1 at the same δ because it uses dual CFG (two forward passes per step), doubling the absolute attention savings at coarse resolution. 0.10 is the recommended tradeoff.

Python API

python
from sglang.multimodal_gen import DiffGenerator

gen = DiffGenerator.from_pretrained(
    model_path="Tongyi-MAI/Z-Image",
    dit_cpu_offload=False,
)
result = gen.generate(sampling_params_kwargs={
    "prompt": "A serene mountain lake at golden hour, photorealistic",
    "num_inference_steps": 50,
    "height": 1024,
    "width": 1024,
    "progressive_mode": "dct_rewind",
    "progressive_levels": 1,
    "progressive_delta": 0.10,
})

Qwen-Image

Qwen-Image uses the same 2×2 patchify convention as FLUX.1 (in_channels=64, C=16), so the same progressive stage wires in with model-specific hooks for RoPE (freqs_cis) and spatial metadata (img_shapes).

bash
# Standard fullres — unchanged behavior
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A serene mountain lake at golden hour"

# Progressive dct_rewind L1 δ=0.20 → 1.69× denoising speedup
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A serene mountain lake at golden hour" \
    --progressive-mode dct_rewind --progressive-levels 1 --progressive-delta 0.20 \
    --num-inference-steps 30 --dit-cpu-offload false

Hardware: RTX A6000 48 GB, --dit-cpu-offload false. Timing = denoising loop only.

ConfigStage splitDenoiseSpeedup
Fullres (baseline)30 @ 128²43.00 s1.00×
dct_rewind L1 δ=0.0513@64² + 17@128²33.25 s1.29×
dct_rewind L1 δ=0.1016@64² + 14@128²33.86 s1.27×
dct_rewind L1 δ=0.2019@64² + 11@128²25.40 s1.69×

Limitations

  • Sequence parallelism incompatible. Cannot be combined with --ulysses-degree or --ring-degree. The stage raises a RuntimeError if SP is enabled.
  • torch.compile incompatible. Compiled kernels have a fixed sequence length; the resolution transition causes a recompile or error. Use progressive without --enable-torch-compile.
  • Cache-DiT interaction is experimental. The stage refreshes Cache-DiT context at resolution transitions, but quality and speedup should be benchmarked before relying on this combination.

References