docs/diffusion/performance/attention_backends.md
This document describes the attention backends available in sglang diffusion (sglang.multimodal_gen) and how to select them.
Attention backends are defined by AttentionBackendEnum (sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum) and selected via the CLI flag --attention-backend.
Backend selection is performed by the shared attention layers (e.g. LocalAttention / USPAttention / UlyssesAttention in sglang.multimodal_gen.runtime.layers.attention.layer) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders).
When using the diffusers backend, --attention-backend is passed through to diffusers'
set_attention_backend (e.g., flash, _flash_3_hub, sage, xformers, native).
For SGLang-native pipelines, the CLI accepts the lowercase names of AttentionBackendEnum. The table below lists the backends implemented by the built-in platforms. fa3/fa4 are accepted as aliases for fa.
| CLI value | Enum value | Notes |
|---|---|---|
fa / fa3 / fa4 | FA | FlashAttention. fa3/fa4 are normalized to fa during argument parsing (ServerArgs.__post_init__). |
torch_sdpa | TORCH_SDPA | PyTorch scaled_dot_product_attention. |
sliding_tile_attn | SLIDING_TILE_ATTN | Sliding Tile Attention (STA). Requires st_attn. Configure via --attention-backend-config. |
sage_attn | SAGE_ATTN | Requires sageattention. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream setup.py: https://github.com/thu-ml/SageAttention/blob/main/setup.py. |
sage_attn_3 | SAGE_ATTN_3 | Requires SageAttention3 installed per upstream instructions. |
video_sparse_attn | VIDEO_SPARSE_ATTN | Requires vsa. Configure sparsity via --attention-backend-config. |
vmoba_attn | VMOBA_ATTN | Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config. |
aiter | AITER | Requires aiter. |
aiter_sage | AITER_SAGE | Requires aiter. |
sla_attn | SLA_ATTN | Sparse Linear Attention. Requires SpargeAttn. Install with pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation. |
sage_sla_attn | SAGE_SLA_ATTN | SageAttention + Sparse Linear Attention. Requires SpargeAttn (same install as SLA). |
sparse_video_gen_2_attn | SPARSE_VIDEO_GEN_2_ATTN | Requires svg. See installation instructions at https://github.com/svg-project/Sparse-VideoGen. |
The selection order in runtime/layers/attention/selector.py is:
global_force_attn_backend(...) / global_force_attn_backend_context_manager(...)--attention-backend (ServerArgs.attention_backend)Some backends require additional configuration. You can pass these parameters via --attention-backend-config. This argument accepts:
'{"sparsity": 0.5}')."sparsity=0.5,enable_x=true").Sliding Tile Attention (sliding_tile_attn)
| Parameter | Type | Description | Default |
|---|---|---|---|
mask_strategy_file_path | str | Required. Path to the mask strategy JSON file. | - |
sta_mode | str | Mode of STA. | STA_inference |
skip_time_steps | int | Number of steps to use full attention before switching to sparse attention. | 15 |
Video Sparse Attention (video_sparse_attn)
| Parameter | Type | Description | Default |
|---|---|---|---|
sparsity | float | Validation sparsity (0.0 - 1.0). | 0.0 |
V-MoBA (vmoba_attn)
| Parameter | Type | Description | Default |
|---|---|---|---|
temporal_chunk_size | int | Chunk size for temporal dimension. | - |
temporal_topk | int | Top-K tokens to select in temporal dimension. | - |
spatial_chunk_size | list[int] | Chunk size for spatial dimension (H, W). | - |
spatial_topk | int | Top-K tokens to select in spatial dimension. | - |
st_chunk_size | list[int] | Chunk size for spatiotemporal dimension (T, H, W). | - |
st_topk | int | Top-K tokens to select in spatiotemporal dimension. | - |
moba_select_mode | str | Selection mode (e.g., threshold). | threshold |
moba_threshold | float | Threshold value for selection. | 0.25 |
moba_threshold_type | str | Type of thresholding (e.g., query_head). | query_head |
first_full_step | int | Number of initial steps to use full attention. | 12 |
first_full_layer | int | Number of initial layers to use full attention. | 0 |
temporal_layer | int | Number of temporal layers. | 1 |
spatial_layer | int | Number of spatial layers. | 1 |
st_layer | int | Number of spatiotemporal layers. | 1 |
| Backend | CUDA | ROCm | XPU | MUSA | MPS | NPU | Notes |
|---|---|---|---|---|---|---|---|
fa | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | CUDA requires SM80+ and fp16/bf16. XPU uses its own flash attention backend. FlashAttention is only used when the required runtime is installed; otherwise it falls back to torch_sdpa. No extra installations are required for NPU |
torch_sdpa | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Most compatible option across platforms. |
sliding_tile_attn | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires st_attn. Configure via --attention-backend-config. |
sage_attn | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only (optional dependency). |
sage_attn_3 | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only (optional dependency). |
video_sparse_attn | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires vsa. Configure sparsity via --attention-backend-config. |
sla_attn | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires SpargeAttn. |
sage_sla_attn | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires SpargeAttn. |
vmoba_attn | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config. |
aiter | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | Requires aiter. |
aiter_sage | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | Requires aiter. |
sparse_video_gen_2_attn | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | CUDA-only. Requires svg. |
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend fa
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend torch_sdpa
# Pass the mask strategy file path via config
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend sliding_tile_attn \
--attention-backend-config "mask_strategy_file_path=/abs/path/to/mask_strategy.json"
--attention-backend torch_sdpa or fa depending on what is available in your environment.torch_sdpa.