docs_new/cookbook/diffusion/Wan/Wan2.2.mdx
import { Wan22Deployment } from '/src/snippets/diffusion/wan22-deployment.jsx';
Wan2.2 series are the most popular and open and advanced large-scale video generative models.
This generation delivers comprehensive upgrades across the board:
For more details, please refer to the official Wan2.2 GitHub Repository.
SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.
Please refer to the official SGLang-diffusion installation guide for installation instructions.
This section provides deployment configurations optimized for different hardware platforms and use cases.
The Wan2.2 series offers models in various sizes, architectures and input types, optimized for different hardware platforms. The recommended launch configurations vary by hardware and model size.
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size. SGLang supports serving Wan2.2 on NVIDIA B200, H200 and AMD MI300X, MI325X and MI355X GPUs.
<Wan22Deployment />Current supported optimization all listed here.
--vae-path: Path to a custom VAE model or HuggingFace model ID (e.g., fal/FLUX.2-Tiny-AutoEncoder). If not specified, the VAE will be loaded from the main model path.--num-gpus {NUM_GPUS}: Number of GPUs to use--tp-size {TP_SIZE}: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)--sp-degree {SP_SIZE}: Sequence parallelism size (typically should match the number of GPUs)--ulysses-degree {ULYSSES_DEGREE}: The degree of DeepSpeed-Ulysses-style SP in USP--ring-degree {RING_DEGREE}: The degree of ring attention-style SP in USPFor more API usage and request examples, please refer to: SGLang Diffusion OpenAI API
sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers --port 3000
curl http://127.0.0.1:3000/v1/images/generations \
-o >(jq -r '.data[0].b64_json' | base64 --decode > example.png) \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "black-forest-labs/FLUX.1-dev",
"prompt": "A cute baby sea otter",
"n": 1,
"size": "1024x1024",
"response_format": "b64_json"
}'
SERVER_ARGS=(
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
--text-encoder-cpu-offload
--pin-cpu-memory
--num-gpus 4
--ulysses-degree=2
--enable-cfg-parallel
)
SAMPLING_ARGS=(
--prompt "A curious raccoon"
--save-output
--output-path outputs
--output-file-name "A curious raccoon.mp4"
)
sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can set SGLANG_CACHE_DIT_ENABLED=True to enable it. For more details, please refer to the SGLang Cache-DiT documentation.
Basic Usage
SGLANG_CACHE_DIT_ENABLED=true sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
Advanced Usage
Combined Configuration Example:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
--dit-cpu-offload: Use CPU offload for DiT inference. Enable if run out of memory with FSDP.--text-encoder-cpu-offload: Use CPU offload for text encoder inference. Enable if run out of memory with FSDP.--image-encoder-cpu-offload: Use CPU offload for image encoder inference. Enable if run out of memory with FSDP.--vae-cpu-offload: Use CPU offload for VAE. Enable if run out of memory.--pin-cpu-memory: Pin memory for CPU offload. Only added as a temp workaround if it throws "CUDA error: invalid argument".Test Environment:
Server Command:
sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
Benchmark Command:
python3 -m sglang.multimodal_gen.benchmarks.bench_serving \
--backend sglang-video --dataset vbench --task t2v --num-prompts 1 --max-concurrency 1
Result:
================= Serving Benchmark Result =================
Backend: sglang-video
Model: Wan-AI/Wan2.2-T2V-A14B-Diffusers
Dataset: vbench
Task: t2v
--------------------------------------------------
Benchmark duration (s): 630.43
Request rate: inf
Max request concurrency: 1
Successful requests: 1/1
--------------------------------------------------
Request throughput (req/s): 0.00
Latency Mean (s): 630.4277
Latency Median (s): 630.4277
Latency P99 (s): 630.4277
--------------------------------------------------
Peak Memory Max (MB): 62627.41
Peak Memory Mean (MB): 62627.41
Peak Memory Median (MB): 62627.41
============================================================
Server Command:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang serve --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
Benchmark Command:
python3 -m sglang.multimodal_gen.benchmarks.bench_serving \
--backend sglang-video --dataset vbench --task t2v --num-prompts 20 --max-concurrency 20
Result:
================= Serving Benchmark Result =================
Backend: sglang-video
Model: Wan-AI/Wan2.2-T2V-A14B-Diffusers
Dataset: vbench
Task: t2v
--------------------------------------------------
Benchmark duration (s): 5163.21
Request rate: inf
Max request concurrency: 20
Successful requests: 20/20
--------------------------------------------------
Request throughput (req/s): 0.00
Latency Mean (s): 2739.7695
Latency Median (s): 2742.0673
Latency P99 (s): 5121.6331
--------------------------------------------------
Peak Memory Max (MB): 72523.56
Peak Memory Mean (MB): 70253.34
Peak Memory Median (MB): 70824.46
============================================================