docs_new/docs/sglang-diffusion/cache_dit.mdx
SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 1.69x inference speedup with minimal quality loss.
Cache-DiT uses intelligent caching strategies to skip redundant computation in the denoising loop:
Enable Cache-DiT by exporting the environment variable and using sglang generate or sglang serve :
SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A beautiful sunset over the mountains"
Cache-DiT supports loading acceleration configs from a custom YAML file. For
diffusers pipelines (diffusers backend), pass the YAML/JSON path via --cache-dit-config. This
flow requires cache-dit >= 1.2.0 (cache_dit.load_configs).
Define a cache.yaml file that contains:
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
Then apply the config with:
sglang generate \
--backend diffusers \
--model-path Qwen/Qwen-Image \
--cache-dit-config cache.yaml \
--prompt "A beautiful sunset over the mountains"
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
# Must set the num_inference_steps for SCM. The SCM will automatically
# generate the steps computation mask based on the num_inference_steps.
# Reference: https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/#scm-steps-computation-masking
num_inference_steps: 28
steps_computation_mask: fast
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
num_inference_steps: 28
steps_computation_mask: fast
enable_sperate_cfg: true # e.g, Qwen-Image, Wan, Chroma, Ovis-Image, etc.
Define a parallelism only config yaml parallel.yaml file that contains:
parallelism_config:
ulysses_size: auto
attention_backend: native
Then, apply the distributed inference acceleration config from yaml. ulysses_size: auto means that cache-dit will auto detect the world_size as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4.
Then apply the distributed config with: (Note: please add --num-gpus N to specify the number of gpus for distributed inference)
sglang generate \
--backend diffusers \
--num-gpus 4 \
--model-path Qwen/Qwen-Image \
--cache-dit-config parallel.yaml \
--prompt "A futuristic cityscape at sunset"
You can also define a 2D parallelism config yaml parallel_2d.yaml file that contains:
parallelism_config:
ulysses_size: auto
tp_size: 2
attention_backend: native
Then, apply the 2D parallelism config from yaml. Here tp_size: 2 means using tensor parallelism with size 2. The ulysses_size: auto means that cache-dit will auto detect the world_size // tp_size as the ulysses_size.
You can also define a 3D parallelism config yaml parallel_3d.yaml file that contains:
parallelism_config:
ulysses_size: 2
ring_size: 2
tp_size: 2
attention_backend: native
Then, apply the 3D parallelism config from yaml. Here ulysses_size: 2, ring_size: 2, tp_size: 2 means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2.
To enable Ulysses Anything Attention, you can define a parallelism config yaml parallel_uaa.yaml file that contains:
parallelism_config:
ulysses_size: auto
attention_backend: native
ulysses_anything: true
For device that don't have NVLink support, you can enable Ulysses FP8 Communication to further reduce the communication overhead. You can define a parallelism config yaml parallel_fp8.yaml file that contains:
parallelism_config:
ulysses_size: auto
attention_backend: native
ulysses_float8: true
You can also enable async ulysses CP to overlap the communication and computation. Define a parallelism config yaml parallel_async.yaml file that contains:
parallelism_config:
ulysses_size: auto
attention_backend: native
ulysses_async: true # Now, only support for FLUX.1, Qwen-Image, Ovis-Image and Z-Image.
Then, apply the config from yaml. Here ulysses_async: true means enabling async ulysses CP.
You can also specify the extra parallel modules in the yaml config. For example, define a parallelism config yaml parallel_extra.yaml file that contains:
parallelism_config:
ulysses_size: auto
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
Define a hybrid cache and parallel acceleration config yaml hybrid.yaml file that contains:
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
parallelism_config:
ulysses_size: auto
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
Then, apply the hybrid cache and parallel acceleration config from yaml.
sglang generate \
--backend diffusers \
--num-gpus 4 \
--model-path Qwen/Qwen-Image \
--cache-dit-config hybrid.yaml \
--prompt "A beautiful sunset over the mountains"
In some cases, users may want to only specify the attention backend without any other optimization configs. In this case, you can define a yaml file attention.yaml that only contains:
attention_backend: "flash" # '_flash_3' for Hopper
You can also specify the quantization config in the yaml file, required torchao>=0.16.0. For example, define a yaml file quantize.yaml that contains:
quantize_config: # quantization configuration for transformer modules
# float8 (DQ), float8_weight_only, float8_blockwise, int8 (DQ), int8_weight_only, etc.
quant_type: "float8"
# layers to exclude from quantization (transformer). layers that contains any of the
# keywords in the exclude_layers list will be excluded from quantization. This is useful
# for some sensitive layers that are not robust to quantization, e.g., embedding layers.
exclude_layers:
- "embedder"
- "embed"
verbose: false # whether to print verbose logs during quantization
Then, apply the quantization config from yaml. Please also enable torch.compile for better performance if you are using quantization. For example:
sglang generate \
--backend diffusers \
--model-path Qwen/Qwen-Image \
--warmup \
--cache-dit-config quantize.yaml \
--enable-torch-compile \
--dit-cpu-offload false \
--text-encoder-cpu-offload false \
--prompt "A beautiful sunset over the mountains"
You can also combine all the above configs together in a single yaml file combined.yaml that contains:
cache_config:
max_warmup_steps: 8
warmup_interval: 2
max_cached_steps: -1
max_continuous_cached_steps: 2
Fn_compute_blocks: 1
Bn_compute_blocks: 0
residual_diff_threshold: 0.12
enable_taylorseer: true
taylorseer_order: 1
parallelism_config:
ulysses_size: auto
attention_backend: native
extra_parallel_modules: ["text_encoder", "vae"]
quantize_config:
quant_type: "float8"
exclude_layers:
- "embedder"
- "embed"
verbose: false
Then, apply the combined cache, parallelism and quantization config from yaml. Please also enable torch.compile for better performance if you are using quantization.
DBCache controls block-level caching behavior:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "12%"}} /> <col style={{width: "34%"}} /> <col style={{width: "14%"}} /> <col style={{width: "40%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Parameter</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Env Variable</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Default</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Description</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Fn</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`SGLANG_CACHE_DIT_FN`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>1</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Number of first blocks to always compute</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Bn</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`SGLANG_CACHE_DIT_BN`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Number of last blocks to always compute</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>W</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`SGLANG_CACHE_DIT_WARMUP`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>4</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Warmup steps before caching starts</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>R</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`SGLANG_CACHE_DIT_RDT`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.24</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Residual difference threshold</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>MC</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`SGLANG_CACHE_DIT_MC`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>3</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Maximum continuous cached steps</td> </tr> </tbody> </table>TaylorSeer improves caching accuracy using Taylor expansion:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "12%"}} /> <col style={{width: "36%"}} /> <col style={{width: "14%"}} /> <col style={{width: "38%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Parameter</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Env Variable</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Default</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Description</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Enable</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`SGLANG_CACHE_DIT_TAYLORSEER`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>false</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Enable TaylorSeer calibrator</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Order</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`SGLANG_CACHE_DIT_TS_ORDER`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>1</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Taylor expansion order (1 or 2)</td> </tr> </tbody> </table>DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters simultaneously:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang generate --model-path black-forest-labs/FLUX.1-dev \
--prompt "A curious raccoon in a forest"
SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and which to use cached results.
SCM Presets
SCM is configured with presets:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "18%"}} /> <col style={{width: "22%"}} /> <col style={{width: "22%"}} /> <col style={{width: "38%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Preset</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Compute Ratio</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Speed</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Quality</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>`none`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>100%</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Baseline</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Best</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>`slow`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>~75%</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>~1.3x</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>High</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>`medium`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>~50%</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>~2x</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Good</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>`fast`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>~35%</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>~3x</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Acceptable</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>`ultra`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>~25%</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>~4x</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Lower</td> </tr> </tbody> </table>Usage
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_PRESET=medium \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A futuristic cityscape at sunset"
Custom SCM Bins
For fine-grained control over which steps to compute vs cache:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A futuristic cityscape at sunset"
SCM Policy
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "16%"}} /> <col style={{width: "42%"}} /> <col style={{width: "42%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Policy</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Env Variable</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Description</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>`dynamic`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`SGLANG_CACHE_DIT_SCM_POLICY=dynamic`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Adaptive caching based on content (default)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>`static`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>`SGLANG_CACHE_DIT_SCM_POLICY=static`</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Fixed caching pattern</td> </tr> </tbody> </table>All Cache-DiT parameters can be configured via environment variables. See Environment Variables for the complete list.
SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "30%"}} /> <col style={{width: "70%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Model Family</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Example Models</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Wan</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Wan2.1, Wan2.2</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Flux</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>FLUX.1-dev, FLUX.2-dev</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Z-Image</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Z-Image-Turbo</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Qwen</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Qwen-Image, Qwen-Image-Edit</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Hunyuan</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>HunyuanVideo</td> </tr> </tbody> </table>medium preset for good speed/quality balanceworld_size > 1.For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache acceleration still works.