docs_new/cookbook/diffusion/SANA-WM/SANA-WM.mdx
SANA-WM is an efficient open-source world model from NVLabs, trained natively for one-minute video generation. It is a 2.6B-parameter text+image-to-video (TI2V) diffusion transformer that synthesizes 720p, minute-scale videos with precise 6-DoF camera control, paired with an LTX-2 refiner for high-fidelity decoding. It builds on the SANA family — efficient high-resolution synthesis with a linear diffusion transformer.
SANA-WM ships in two checkpoints: a bidirectional checkpoint (dense, one-shot) and a streaming checkpoint (chunk-causal, autoregressive — generated chunk-by-chunk, reusing causal DiT state across chunks for bounded memory → long, even endless, clips). From a single first frame, a text prompt, and a camera trajectory, this cookbook covers all three serving modes SGLang exposes:
SANA-WM_bidirectional checkpoint generated in one shot (no chunking) via SanaWMTwoStagePipeline over the standard /v1/videos HTTP API. Highest single-clip quality (full bidirectional attention + dense LTX-2 refiner); matches the NVlabs dense reference.SANA-WM_streaming checkpoint generated chunk-by-chunk in one request via the same SanaWMTwoStagePipeline + --streaming over /v1/videos. This is SGLang's offline chunk-causal streaming path: the whole clip is produced chunk-by-chunk internally, then returned.SanaWMRealtimePipeline over a WebSocket API at /v1/realtime_video/generate, so a browser/client streams camera-action events frame-by-frame and receives video chunks back in real time. Realtime uses the same streaming checkpoint, but the incremental session path is not bit-identical to offline batch streaming.All three modes share the camera action DSL (§8) and the configuration knobs (§9). Modes (B) and (C) share the streaming checkpoint and the chunk-causal pipeline.
Key features (per the official model):
In the streaming / realtime configuration this becomes a low-latency, interactive pipeline:
/v1/videos paths, or pushed over the WebSocket at init / as live per-chunk events on the realtime path (see §7).Architecture & components
| Component | Value |
|---|---|
| Stage-1 DiT | 2.6B; 20 layers, hidden 2240, 20 heads (head_dim 112); ~10 GB |
| Attention | frame-wise Gated DeltaNet + softmax every 4th block (hybrid linear) |
| Camera | dual-branch, UCPE + PRoPE (raymap + Plücker), 6-DoF |
| VAE | LTX-2 causal, strides (T, H, W) = (8, 32, 32); ~2 GB |
| Refiner | LTX-2 Stage-2 distilled; ~41 GB |
| Output | up to 720p (704×1280) @ 16 fps, minute-scale |
For more details, see the SANA-WM paper (arXiv), the SANA project page, the NVlabs/Sana GitHub, and the SANA-WM_bidirectional model card (Apache-2.0).
SGLang-diffusion offers multiple installation methods depending on your hardware platform. Please refer to the official SGLang-diffusion installation guide.
SANA-WM adds the SanaWMTransformer3DModel + GDN kernels, the SanaWMTwoStagePipeline (dense bidirectional + chunk-causal streaming), and the SanaWMRealtimePipeline with the /v1/realtime_video WebSocket router. The diffusion server CLI is invoked as python -m sglang.multimodal_gen.runtime.entrypoints.cli.main.
Both SANA-WM checkpoints are public (Apache-2.0, no gating, no token) and load directly — there is no manual assembly step. Pass the HuggingFace repo id to --model-path and SGLang downloads, materializes, validates, and loads it:
| Mode | --model-path |
|---|---|
| Dense bidirectional (§4) | Efficient-Large-Model/SANA-WM_bidirectional |
| Batch streaming (§5) / realtime (§6) | Efficient-Large-Model/SANA-WM_streaming |
Both repo ids are registered in SGLang's built-in model-overlay registry, so on first load the overlay transparently materializes the official release into a runnable Diffusers directory — for the streaming checkpoint this converts the DMD self-forcing checkpoint (sana_dit/model.pt) into a Diffusers transformer/ and wires the LTX-2 causal VAE, the LTX-2 refiner, and the Gemma encoders. No environment variable or build_model_dir.sh step is needed. (You may also pass a local, already-materialized Diffusers directory.)
The materialized checkpoint is a Diffusers directory whose model_index.json declares the loadable components:
Component (model_index.json) | Class |
|---|---|
transformer (Stage-1 DiT) | diffusers.SanaWMTransformer3DModel |
vae | diffusers.AutoencoderKLCausalLTX2Video |
text_encoder | transformers.Gemma2Model |
tokenizer | transformers.GemmaTokenizer |
scheduler | diffusers.FlowMatchEulerDiscreteScheduler |
How loading works:
maybe_download_model(model_path, force_diffusers_model=True) and verifies it contains a model_index.json plus the required component subdirectories (transformer/, vae/).text_encoder / tokenizer are not provided as component paths, the pipeline falls back to the default Stage-1 text encoder Efficient-Large-Model/gemma-2-2b-it (DEFAULT_SANA_WM_TEXT_ENCODER).--pipeline-class-name. The checkpoint's model_index.json _class_name selects the default pipeline (SanaWMTwoStagePipeline). Pin it explicitly to choose: --pipeline-class-name SanaWMTwoStagePipeline for the /v1/videos paths (§4–5) or --pipeline-class-name SanaWMRealtimePipeline for live realtime (§6). Pinning is also required if you point --model-path at a bare safetensors file instead of a Diffusers directory.refiner/ in the checkpoint: refiner/transformer (transformer_2), refiner/connectors (connectors), and refiner/text_encoder (the Gemma-3 encoder for text_encoder_2, whose tokenizer also serves as tokenizer_2). The refiner is optional: it is skipped (Stage-1-only output) when the env flag SGLANG_SANA_WM_SKIP_REFINER (or a skip_refiner request extra) is set, or when no refiner/ is present (transformer_2 unloaded). On the batch path it runs chunk-wise with --refiner-chunked (the official streaming path, default on) or whole-clip without it; on the realtime path the pipeline builds a SanaWMChunkedRefinerChainStage only when a refiner is available, and otherwise streams Stage-1 frames./v1/videos)The bidirectional checkpoint generates the whole clip in one shot (full bidirectional attention, not chunked) followed by a dense LTX-2 refiner — the highest single-clip quality, matching the NVlabs dense reference.
Launch with the two-stage pipeline and no --streaming flag (dense is the default — streaming defaults to False):
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
--model-path Efficient-Large-Model/SANA-WM_bidirectional \
--pipeline-class-name SanaWMTwoStagePipeline \
--host 127.0.0.1 --port 30000
Then POST to /v1/videos exactly as in §5, but pass the NVlabs dense sampling defaults for closest parity — the dense path is denser than the distilled streaming few-step schedule:
curl -s http://127.0.0.1:30000/v1/videos \
-H 'content-type: application/json' -d '{
"prompt": "a camera moving forward and turning left",
"input_reference": "/path/to/first_frame.png",
"num_frames": 321,
"seed": 42,
"fps": 16,
"num_inference_steps": 60,
"guidance_scale": 5.0,
"diffusers_kwargs": {
"action": "w-80,wl-80,l-80,wj-80",
"intrinsics": "/path/to/intrinsics.npy"
}
}'
num_inference_steps / guidance_scale — the dense path uses CFG; NVlabs' reference defaults to 60 steps, guidance 5.0 (the SanaWMSamplingParams defaults are the lighter 20 / 4.5 — pass 60 / 5.0 explicitly for dense parity).num_frames=321 request yields 320 output frames./v1/videos)The streaming checkpoint generates a full camera-controlled clip in one request — no websocket. This is SGLang's offline streaming path: the whole clip is generated chunk-by-chunk internally, refined, decoded, and returned as one video.
Launch with the two-stage pipeline + the streaming flags:
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
--model-path Efficient-Large-Model/SANA-WM_streaming \
--pipeline-class-name SanaWMTwoStagePipeline \
--streaming --refiner-chunked \
--host 127.0.0.1 --port 30000
--streaming — chunk-causal forward_long Stage-1 (vs the dense one-shot path of §4).--refiner-chunked — chunk-wise streaming LTX-2 refiner (on by default). To use the whole-clip dense refiner instead (also valid, higher peak memory), pass --refiner-chunked false — simply omitting the flag keeps the default chunked refiner.--num-frame-per-block N — latent frames per chunk (default 3).Then POST to /v1/videos (JSON body shown below; multipart/form-data with an uploaded input_reference file also works). Camera control goes in diffusers_kwargs — the action-DSL string (§8) and the intrinsics:
curl -s http://127.0.0.1:30000/v1/videos \
-H 'content-type: application/json' -d '{
"prompt": "a camera moving forward and turning left",
"input_reference": "/path/to/first_frame.png",
"num_frames": 321,
"seed": 42,
"fps": 16,
"diffusers_kwargs": {
"action": "w-80,wl-80,l-80,wj-80",
"intrinsics": "/path/to/intrinsics.npy"
}
}'
| Field | Notes |
|---|---|
prompt | text prompt |
input_reference | first-frame image — a server-side path, or (multipart) an uploaded file. For an http(s):// URL in a JSON body, use the separate reference_url field (the server downloads it and assigns it to input_reference) |
num_frames | total pixel frames (e.g. 321 → 41 latent frames, 13 chunks; output 704×1280) |
seed | RNG seed (default 42) |
fps | output frame rate — pass 16 (SANA-WM's native rate). The generic /v1/videos default is 24, which would encode the same frames at 24 fps and make the clip play ~33% shorter (16/24 of the duration) |
diffusers_kwargs.action | camera action-DSL string (§8) |
diffusers_kwargs.intrinsics | path to a camera-intrinsics .npy (per-frame (T,3,3)) or an inline 3×3 / (T,3,3) list |
The response is a VideoResponse; fetch the rendered MP4 via the returned reference or GET /v1/videos/{id}/content. The streaming hyperparameters (num_frame_per_block, denoising_step_list, sink_size, num_cached_blocks, streaming_cfg_scale) are pipeline-config defaults on SanaWMPipelineConfig, not request fields — see §9.
Launch with the realtime pipeline pinned — the checkpoint defaults to SanaWMTwoStagePipeline, so realtime must be selected explicitly (see §3). The /v1/realtime_video router is always mounted and becomes functional once the realtime config is active, because SanaWMRealtimeConfig has a registered realtime adapter (SanaWMRealtimeAdapter).
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
--model-path Efficient-Large-Model/SANA-WM_streaming \
--pipeline-class-name SanaWMRealtimePipeline \
--host 127.0.0.1 --port 30000
Common launch variants:
# recommended multi-GPU realtime profile
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
--model-path Efficient-Large-Model/SANA-WM_streaming \
--pipeline-class-name SanaWMRealtimePipeline \
--num-gpus 8 --sp-degree 8 \
--host 127.0.0.1 --port 30000
# single GPU
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
--model-path Efficient-Large-Model/SANA-WM_streaming \
--pipeline-class-name SanaWMRealtimePipeline \
--num-gpus 1 --host 127.0.0.1 --port 30000
# offload DiT + text encoder to CPU (tight VRAM)
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
--model-path Efficient-Large-Model/SANA-WM_streaming \
--pipeline-class-name SanaWMRealtimePipeline \
--host 127.0.0.1 --port 30000 \
--dit-cpu-offload --text-encoder-cpu-offload
Notes on launch behavior:
127.0.0.1:30000 (--host / --port override).--dit-cpu-offload, --text-encoder-cpu-offload, and --image-encoder-cpu-offload are available; defaults are auto-adjusted from GPU memory (GPUs under 30 GB get more aggressive offloading).--sp-degree equal to the number of GPUs for a single session). Do not enable CFG parallel for the realtime profile: the default realtime request uses guidance_scale=1.0, while CFG parallel requires active cond/uncond branches.--use-fsdp-inference only when you specifically need weight sharding for memory. For the low-latency realtime profile, prefer keeping components resident and using SP first.--warmup flag is needed.Once up, the realtime WebSocket endpoint lives at ws://127.0.0.1:30000/v1/realtime_video/generate (use the Python client in §7 to connect — plain curl does not speak the ws:// upgrade).
The realtime API is a single WebSocket at /v1/realtime_video/generate. All messages — client → server and server → client — are msgpack (msgspec.msgpack.encode / decode), not JSON.
The lifecycle is:
<Steps> <Step title="Connect & send INIT"> The client opens the WebSocket and sends exactly one **init** message (`type: "init"`), carrying the prompt, the required `first_frame`, output/sampling options, and optional camera conditions in `condition_inputs`. </Step> <Step title="Stream live EVENTs (optional)"> While generation runs, the client may push **event** messages (`type: "event"`) to steer the camera — either `kind: "camera_actions"` (frame-by-frame lists or state transitions) or `kind: "action"` (an action-DSL string). </Step> <Step title="Receive frame batches"> The server streams **frame batches** back. Each chunk arrives as one or more `frame_batch` messages (header fields + payload bytes); `is_final_frame_batch: true` marks the end of a chunk. The server also emits `chunk_stats` timing messages. </Step> </Steps>RealtimeVideoGenerationsRequest (type is the literal "init"). Key fields:
| Field | Type | Notes |
|---|---|---|
type | "init" | Required literal |
prompt | str | Text prompt |
first_frame | bytes | str | Required by the SANA-WM adapter (on_init raises if absent), though the generic request schema defines it as optional. Raw image bytes, a server-side path, or an http(s):// URL (downloaded & cached) |
condition_inputs | dict | Camera/conditioning inputs (see below) |
num_frames | int | Total frames to generate. Omit it for an open-ended, continuous session — the adapter leaves num_frames unset and flags an open-ended run (condition_inputs["sana_wm_open_ended"] = True), generating uniform chunks indefinitely (until max_chunks or the client disconnects). Provide an integer for a fixed-length clip |
seed | int | RNG seed (default 42) |
size | str | "WIDTHxHEIGHT"; realtime requests default to "832x480" for latency. Pass "1280x704" for the native landscape resolution |
max_chunks | int | Optional cap on total chunks generated |
num_inference_steps | int | Default 4 for SANA-WM (realtime adapter) |
guidance_scale | float | Default 1.0 |
realtime_output_format | "raw" | "webp" | "jpeg" | Frame encoding for output (see below) |
realtime_causal_sink_size | int | Optional override |
realtime_causal_kv_cache_num_frames | int | Optional override |
condition_inputs accepts (all optional; pass only one of action / camera_actions):
| Key | Type | Meaning |
|---|---|---|
camera_actions | list[list[str]] or {mode: "state", transitions: [...]} | Frame-by-frame camera actions, or state-based transitions |
action | str | Action-DSL string, e.g. "w-10,none-5,a-8" (see §8) |
intrinsics_path | str | Server-side path to a camera-intrinsics .npy file (loaded via np.load; shapes (4,), (3,3), or (F,3,3)) |
intrinsics | list | Inline intrinsics with shape (4,), (3,3), (F,4), or (F,3,3) |
If you omit both intrinsics_path and intrinsics, SGLang uses a centered heuristic intrinsic matrix derived from the first-frame size. Pass explicit intrinsics when you need closer camera parity with a prepared trajectory.
{
"type": "init",
"prompt": "beautiful landscape video",
"first_frame": "<bytes or url>",
"size": "832x480",
"seed": 42,
"max_chunks": 10,
"realtime_output_format": "raw",
"num_inference_steps": 4,
"guidance_scale": 1.0,
"condition_inputs": {
"camera_actions": [["w"], [], ["a", "s"]],
"intrinsics_path": "/path/to/intrinsics.npy"
}
}
RealtimeEvent (type: "event"). Use kind + payload (optional event_id correlates the response back to this event).
{
"type": "event",
"kind": "camera_actions",
"event_id": 1,
"payload": [["w"], ["w"], ["a"], []]
}
{
"type": "event",
"kind": "camera_actions",
"event_id": 2,
"payload": {
"mode": "state",
"transitions": [
{"actions": ["w"], "client_ts_ms": 1000},
{"actions": ["a", "w"], "client_ts_ms": 1500}
]
}
}
{
"type": "event",
"kind": "action",
"event_id": 3,
"payload": "w-10,none-5,a-8,d-10"
}
The server streams frame batches. Every batch arrives as a single msgpack message with type: "frame_batch" — the header fields below plus an inline payload bytes field (the wire type is always "frame_batch"; there is no separate header-then-bytes message).
Header fields:
| Field | Meaning |
|---|---|
type | "frame_batch" (always) |
request_id | Generation id |
chunk_index | Chunk index |
content_type | application/x-raw-rgb, application/x-raw-rgb-delta-gzip, image/webp, or image/jpeg |
num_frames | Frames in this batch |
total_size | Payload size in bytes (len(payload) — the compressed size for delta-gzip) |
width, height, channels | Frame geometry (channels: 3) |
bytes_per_frame | Bytes per uncompressed frame (width*height*3) |
format | rgb24 for raw |
encoding | raw, delta-gzip, webp, or jpeg |
delta_reference | previous-frame (present for delta-gzip) |
event_id | Echoes the steering event id; omitted from the header for INIT-only chunks |
frame_batch_index, num_frame_batches | Sequence multiple batches within a chunk |
is_final_frame_batch | true ends the chunk |
{
"type": "frame_batch",
"request_id": "uuid-string",
"chunk_index": 0,
"content_type": "application/x-raw-rgb-delta-gzip",
"num_frames": 3,
"total_size": 1048576,
"width": 1280,
"height": 704,
"channels": 3,
"bytes_per_frame": 2703360,
"format": "rgb24",
"encoding": "delta-gzip",
"delta_reference": "previous-frame",
"event_id": 1,
"frame_batch_index": 0,
"num_frame_batches": 1,
"is_final_frame_batch": true,
"payload": "<gzip-compressed bytes>"
}
Encodings. application/x-raw-rgb is uncompressed RGB24 (3 × uint8, bytes_per_frame = width*height*3). application/x-raw-rgb-delta-gzip is the zlib-compressed per-frame XOR delta against the preceding frame (each frame in the batch is XOR'd against the previous one; sent by default). realtime_output_format: "raw" forces uncompressed RGB; "webp" / "jpeg" send preview-encoded frames.
import msgspec
import numpy as np
import websockets # pip install websockets
WS_URL = "ws://127.0.0.1:30000/v1/realtime_video/generate"
async def run():
async with websockets.connect(WS_URL, max_size=None) as ws:
# 1) INIT — omit num_frames for an open-ended session; "raw" = uncompressed RGB24
with open("first_frame.png", "rb") as f:
first_frame = f.read()
await ws.send(msgspec.msgpack.encode({
"type": "init",
"prompt": "a camera moving forward and turning right",
"first_frame": first_frame,
"size": "832x480",
"seed": 42,
"max_chunks": 10,
"realtime_output_format": "raw",
"num_inference_steps": 4,
"guidance_scale": 1.0,
"condition_inputs": {
"action": "w-100,wd-50,d-30",
"intrinsics_path": "/path/to/intrinsics.npy", # optional; centered heuristic if omitted
},
}))
# 2) optional: steer mid-stream
await ws.send(msgspec.msgpack.encode({
"type": "event",
"kind": "camera_actions",
"event_id": 1,
"payload": [["w"], ["w"], ["a"], []],
}))
# 3) receive frame batches (raw RGB24)
async for message in ws:
msg = msgspec.msgpack.decode(message)
if msg.get("type") != "frame_batch":
continue # skip chunk_stats etc.
n, h, w, c = msg["num_frames"], msg["height"], msg["width"], msg["channels"]
frames = np.frombuffer(msg["payload"], dtype=np.uint8).reshape(n, h, w, c)
# ... display/save frames ...
if msg.get("is_final_frame_batch") and msg.get("chunk_index", 0) >= 9:
break
# asyncio.run(run())
Camera trajectories are described by a compact string of comma-separated <keys>-<frames> segments, e.g. "w-100,wd-50,d-30,none-10". This is the format accepted by condition_inputs.action at init and by kind: "action" events.
Parsing rules (parse_action_string):
<keys>-<frames>; <frames> must be a positive integer.none means no motion for that span: none-10 = 10 static frames.wd = forward + right strafe). Allowed keys are exactly wasdijkl.| Key | Motion |
|---|---|
w / s | move forward / backward |
a / d | strafe left / right |
i / k | look (pitch) up / down |
j / l | look (yaw) left / right |
Pose generation (action_string_to_c2w):
w/s/a/d) moves at translation_speed (default 0.04 world-units/frame).i/k pitch, j/l yaw) turns at rotation_speed_deg (default 1.2°/frame); pitch is clamped to ±85°.0.4): a d (right) strafe also nudges yaw right and a (left) nudges yaw left, so wd traces a curving arc rather than a pure sidestep.(F+1, 4, 4) camera-to-world matrices; the realtime stage pads the trajectory to the requested frame count.Example: "w-100,wd-50,d-30,none-10" = 100 frames forward → 50 frames forward + sweep right → 30 frames right strafe → 10 frames static.
SANA-WM's defaults live in three places: request-time sampling params, the pipeline config (streaming/refiner knobs), and the realtime adapter (init-time overrides).
SanaWMSamplingParams (configs/sample/sana_wm.py)| Field | Default | Purpose |
|---|---|---|
height | 704 | Output height |
width | 1280 | Output width |
num_frames | 49 | Total pixel frames (must satisfy (num_frames - 1) % 8 == 0) |
fps | 16 | Output frame rate (overrides the base default of 24) |
num_inference_steps | 20 | Stage-1 step count |
guidance_scale | 4.5 | Dense-path CFG scale |
negative_prompt | "" | Negative prompt |
camera_to_world | None | In-memory (T,4,4) c2w extrinsics (mutually exclusive with action) |
intrinsics | None | In-memory (T,3,3) pinhole intrinsics |
action | None | Action-DSL string (see §8) |
translation_speed | 0.04 | World-units/frame for W/S/A/D |
rotation_speed_deg | 1.2 | Degrees/frame for I/K/J/L |
pitch_limit_deg | 85.0 | Pitch clamp |
generator_device is inherited from the base SamplingParams (default None = use the pipeline/model default). On the /v1/videos HTTP API the camera fields are passed inside diffusers_kwargs (action / intrinsics, as in §4–5).
SanaWMPipelineConfig (configs/pipeline_configs/sana_wm.py)These are server-launch knobs (set via the --streaming / --refiner-chunked / --num-frame-per-block CLI flags or a pipeline-config override), not request fields:
| Field | Default | Purpose |
|---|---|---|
streaming | False | Chunk-causal forward_long (§5) vs dense one-shot (§4) |
refiner_chunked | True | Chunk-wise streaming refiner vs whole-clip dense refiner |
num_frame_per_block | 3 | Latent frames per Stage-1 / refiner chunk |
num_cached_blocks | 2 | Rolling KV-cache history window |
denoising_step_list | (1000, 960, 889, 727, 0) | 4-step streaming self-forcing timesteps (must end in 0) |
streaming_cfg_scale | 1.0 | CFG scale for the distilled streaming path (1.0 = off) |
sink_size | 1 | Sink (unrefined context) frames |
refiner_block_size | 3 | Refiner block size |
refiner_kv_max_frames | 11 | Refiner sliding KV window |
SanaWMRealtimeAdapterAt WebSocket init the realtime adapter fills SANA-WM defaults that differ from the request/sampling defaults above:
| Field | Realtime default | Note |
|---|---|---|
size | 832x480 | Realtime request default; pass 1280x704 for native landscape output |
num_frames | (unset) | Omitting → open-ended continuous session (§7) |
num_inference_steps | 4 | Distilled few-step |
guidance_scale | 1.0 | CFG off |
fps | 16 | Native rate |