SANA-WM

1. Model Introduction

SANA-WM is an efficient open-source world model from NVLabs, trained natively for one-minute video generation. It is a 2.6B-parameter text+image-to-video (TI2V) diffusion transformer that synthesizes 720p, minute-scale videos with precise 6-DoF camera control, paired with an LTX-2 refiner for high-fidelity decoding. It builds on the SANA family — efficient high-resolution synthesis with a linear diffusion transformer.

SANA-WM ships in two checkpoints: a bidirectional checkpoint (dense, one-shot) and a streaming checkpoint (chunk-causal, autoregressive — generated chunk-by-chunk, reusing causal DiT state across chunks for bounded memory → long, even endless, clips). From a single first frame, a text prompt, and a camera trajectory, this cookbook covers all three serving modes SGLang exposes:

(A) Dense bidirectional (§4) — the SANA-WM_bidirectional checkpoint generated in one shot (no chunking) via SanaWMTwoStagePipeline over the standard /v1/videos HTTP API. Highest single-clip quality (full bidirectional attention + dense LTX-2 refiner); matches the NVlabs dense reference.
(B) Batch streaming (§5) — the SANA-WM_streaming checkpoint generated chunk-by-chunk in one request via the same SanaWMTwoStagePipeline + --streaming over /v1/videos. This is SGLang's offline chunk-causal streaming path: the whole clip is produced chunk-by-chunk internally, then returned.
(C) Live realtime (§6–7) — the streaming pipeline exposed as SanaWMRealtimePipeline over a WebSocket API at /v1/realtime_video/generate, so a browser/client streams camera-action events frame-by-frame and receives video chunks back in real time. Realtime uses the same streaming checkpoint, but the incremental session path is not bit-identical to offline batch streaming.

All three modes share the camera action DSL (§8) and the configuration knobs (§9). Modes (B) and (C) share the streaming checkpoint and the chunk-causal pipeline.

Key features (per the official model):

Hybrid Linear Attention — frame-wise Gated DeltaNet (GDN) recurrent blocks combined with softmax attention (every 4th layer, block indices {3,7,11,15,19}) for memory-efficient long-context modeling.
Dual-Branch Camera Control — independent main and camera branches (UCPE + PRoPE) for precise per-frame 6-DoF trajectory adherence.
Two-Stage Pipeline — an LTX-2 long-video refiner on top of Stage-1 latents for quality and temporal consistency.

In the streaming / realtime configuration this becomes a low-latency, interactive pipeline:

Stage-1 chunk-causal DiT — the streaming path carries a per-block KV cache (recurrent GDN state + a softmax K/V window) across chunks; bounded memory means it scales to long / endless sequences. Stage-1 is intentionally coarse.
LTX-2 streaming refiner — refines each Stage-1 latent chunk block-by-block with a sink + sliding-history KV cache (required for sharp output).
Causal LTX-2 VAE — decodes latents chunk-by-chunk with a carried conv-cache for seam-free frames.
Camera control — drive the camera with a compact WASD/IJKL action DSL (move with WASD, look with IJKL; see §8) — supplied at request time on the /v1/videos paths, or pushed over the WebSocket at init / as live per-chunk events on the realtime path (see §7).

Architecture & components

Component	Value
Stage-1 DiT	2.6B; 20 layers, hidden 2240, 20 heads (head_dim 112); ~10 GB
Attention	frame-wise Gated DeltaNet + softmax every 4th block (hybrid linear)
Camera	dual-branch, UCPE + PRoPE (raymap + Plücker), 6-DoF
VAE	LTX-2 causal, strides (T, H, W) = (8, 32, 32); ~2 GB
Refiner	LTX-2 Stage-2 distilled; ~41 GB
Output	up to 720p (704×1280) @ 16 fps, minute-scale

For more details, see the SANA-WM paper (arXiv), the SANA project page, the NVlabs/Sana GitHub, and the SANA-WM_bidirectional model card (Apache-2.0).

2. Installation

SGLang-diffusion offers multiple installation methods depending on your hardware platform. Please refer to the official SGLang-diffusion installation guide.

SANA-WM adds the SanaWMTransformer3DModel + GDN kernels, the SanaWMTwoStagePipeline (dense bidirectional + chunk-causal streaming), and the SanaWMRealtimePipeline with the /v1/realtime_video WebSocket router. The diffusion server CLI is invoked as python -m sglang.multimodal_gen.runtime.entrypoints.cli.main.

3. Model Setup

Both SANA-WM checkpoints are public (Apache-2.0, no gating, no token) and load directly — there is no manual assembly step. Pass the HuggingFace repo id to --model-path and SGLang downloads, materializes, validates, and loads it:

Mode	`--model-path`
Dense bidirectional (§4)	`Efficient-Large-Model/SANA-WM_bidirectional`
Batch streaming (§5) / realtime (§6)	`Efficient-Large-Model/SANA-WM_streaming`

Both repo ids are registered in SGLang's built-in model-overlay registry, so on first load the overlay transparently materializes the official release into a runnable Diffusers directory — for the streaming checkpoint this converts the DMD self-forcing checkpoint (sana_dit/model.pt) into a Diffusers transformer/ and wires the LTX-2 causal VAE, the LTX-2 refiner, and the Gemma encoders. No environment variable or build_model_dir.sh step is needed. (You may also pass a local, already-materialized Diffusers directory.)

The materialized checkpoint is a Diffusers directory whose model_index.json declares the loadable components:

Component (`model_index.json`)	Class
`transformer` (Stage-1 DiT)	`diffusers.SanaWMTransformer3DModel`
`vae`	`diffusers.AutoencoderKLCausalLTX2Video`
`text_encoder`	`transformers.Gemma2Model`
`tokenizer`	`transformers.GemmaTokenizer`
`scheduler`	`diffusers.FlowMatchEulerDiscreteScheduler`

How loading works:

The server resolves the checkpoint via maybe_download_model(model_path, force_diffusers_model=True) and verifies it contains a model_index.json plus the required component subdirectories (transformer/, vae/).
If text_encoder / tokenizer are not provided as component paths, the pipeline falls back to the default Stage-1 text encoder Efficient-Large-Model/gemma-2-2b-it (DEFAULT_SANA_WM_TEXT_ENCODER).
Pick the path with --pipeline-class-name. The checkpoint's model_index.json _class_name selects the default pipeline (SanaWMTwoStagePipeline). Pin it explicitly to choose: --pipeline-class-name SanaWMTwoStagePipeline for the /v1/videos paths (§4–5) or --pipeline-class-name SanaWMRealtimePipeline for live realtime (§6). Pinning is also required if you point --model-path at a bare safetensors file instead of a Diffusers directory.
The Stage-2 LTX-2 refiner lives under refiner/ in the checkpoint: refiner/transformer (transformer_2), refiner/connectors (connectors), and refiner/text_encoder (the Gemma-3 encoder for text_encoder_2, whose tokenizer also serves as tokenizer_2). The refiner is optional: it is skipped (Stage-1-only output) when the env flag SGLANG_SANA_WM_SKIP_REFINER (or a skip_refiner request extra) is set, or when no refiner/ is present (transformer_2 unloaded). On the batch path it runs chunk-wise with --refiner-chunked (the official streaming path, default on) or whole-clip without it; on the realtime path the pipeline builds a SanaWMChunkedRefinerChainStage only when a refiner is available, and otherwise streams Stage-1 frames.

<Note> Throughout this cookbook, `<checkpoint>` stands for the appropriate SANA-WM repo id from the table above (or a local materialized Diffusers directory). </Note>

4. Dense bidirectional (offline `/v1/videos`)

The bidirectional checkpoint generates the whole clip in one shot (full bidirectional attention, not chunked) followed by a dense LTX-2 refiner — the highest single-clip quality, matching the NVlabs dense reference.

Launch with the two-stage pipeline and no --streaming flag (dense is the default — streaming defaults to False):

bash

python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_bidirectional \
  --pipeline-class-name SanaWMTwoStagePipeline \
  --host 127.0.0.1 --port 30000

Then POST to /v1/videos exactly as in §5, but pass the NVlabs dense sampling defaults for closest parity — the dense path is denser than the distilled streaming few-step schedule:

bash

curl -s http://127.0.0.1:30000/v1/videos \
  -H 'content-type: application/json' -d '{
    "prompt": "a camera moving forward and turning left",
    "input_reference": "/path/to/first_frame.png",
    "num_frames": 321,
    "seed": 42,
    "fps": 16,
    "num_inference_steps": 60,
    "guidance_scale": 5.0,
    "diffusers_kwargs": {
      "action": "w-80,wl-80,l-80,wj-80",
      "intrinsics": "/path/to/intrinsics.npy"
    }
  }'

num_inference_steps / guidance_scale — the dense path uses CFG; NVlabs' reference defaults to 60 steps, guidance 5.0 (the SanaWMSamplingParams defaults are the lighter 20 / 4.5 — pass 60 / 5.0 explicitly for dense parity).
The dense refiner drops the leading sink frame, so a num_frames=321 request yields 320 output frames.

5. Batch streaming (offline `/v1/videos`)

The streaming checkpoint generates a full camera-controlled clip in one request — no websocket. This is SGLang's offline streaming path: the whole clip is generated chunk-by-chunk internally, refined, decoded, and returned as one video.

Launch with the two-stage pipeline + the streaming flags:

bash

python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMTwoStagePipeline \
  --streaming --refiner-chunked \
  --host 127.0.0.1 --port 30000

--streaming — chunk-causal forward_long Stage-1 (vs the dense one-shot path of §4).
--refiner-chunked — chunk-wise streaming LTX-2 refiner (on by default). To use the whole-clip dense refiner instead (also valid, higher peak memory), pass --refiner-chunked false — simply omitting the flag keeps the default chunked refiner.
--num-frame-per-block N — latent frames per chunk (default 3).

Then POST to /v1/videos (JSON body shown below; multipart/form-data with an uploaded input_reference file also works). Camera control goes in diffusers_kwargs — the action-DSL string (§8) and the intrinsics:

bash

curl -s http://127.0.0.1:30000/v1/videos \
  -H 'content-type: application/json' -d '{
    "prompt": "a camera moving forward and turning left",
    "input_reference": "/path/to/first_frame.png",
    "num_frames": 321,
    "seed": 42,
    "fps": 16,
    "diffusers_kwargs": {
      "action": "w-80,wl-80,l-80,wj-80",
      "intrinsics": "/path/to/intrinsics.npy"
    }
  }'

Field	Notes
`prompt`	text prompt
`input_reference`	first-frame image — a server-side path, or (multipart) an uploaded file. For an `http(s)://` URL in a JSON body, use the separate `reference_url` field (the server downloads it and assigns it to `input_reference`)
`num_frames`	total pixel frames (e.g. `321` → 41 latent frames, 13 chunks; output 704×1280)
`seed`	RNG seed (default `42`)
`fps`	output frame rate — pass `16` (SANA-WM's native rate). The generic `/v1/videos` default is `24`, which would encode the same frames at 24 fps and make the clip play ~33% shorter (16/24 of the duration)
`diffusers_kwargs.action`	camera action-DSL string (§8)
`diffusers_kwargs.intrinsics`	path to a camera-intrinsics `.npy` (per-frame `(T,3,3)`) or an inline 3×3 / `(T,3,3)` list

The response is a VideoResponse; fetch the rendered MP4 via the returned reference or GET /v1/videos/{id}/content. The streaming hyperparameters (num_frame_per_block, denoising_step_list, sink_size, num_cached_blocks, streaming_cfg_scale) are pipeline-config defaults on SanaWMPipelineConfig, not request fields — see §9.

6. Launch the Realtime Server

Launch with the realtime pipeline pinned — the checkpoint defaults to SanaWMTwoStagePipeline, so realtime must be selected explicitly (see §3). The /v1/realtime_video router is always mounted and becomes functional once the realtime config is active, because SanaWMRealtimeConfig has a registered realtime adapter (SanaWMRealtimeAdapter).

bash

python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --host 127.0.0.1 --port 30000

Common launch variants:

bash

# recommended multi-GPU realtime profile
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --num-gpus 8 --sp-degree 8 \
  --host 127.0.0.1 --port 30000

# single GPU
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --num-gpus 1 --host 127.0.0.1 --port 30000

# offload DiT + text encoder to CPU (tight VRAM)
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --host 127.0.0.1 --port 30000 \
  --dit-cpu-offload --text-encoder-cpu-offload

Notes on launch behavior:

Default endpoint is 127.0.0.1:30000 (--host / --port override).
CPU offload flags are optional. --dit-cpu-offload, --text-encoder-cpu-offload, and --image-encoder-cpu-offload are available; defaults are auto-adjusted from GPU memory (GPUs under 30 GB get more aggressive offloading).
Multi-GPU realtime. Prefer explicit sequence parallelism (--sp-degree equal to the number of GPUs for a single session). Do not enable CFG parallel for the realtime profile: the default realtime request uses guidance_scale=1.0, while CFG parallel requires active cond/uncond branches.
FSDP. Use --use-fsdp-inference only when you specifically need weight sharding for memory. For the low-latency realtime profile, prefer keeping components resident and using SP first.
Warmup. Server warmup is automatically skipped for the realtime pipeline — a synthetic warmup request has no WebSocket session, so the server detects the registered realtime adapter and skips it. No --warmup flag is needed.

Once up, the realtime WebSocket endpoint lives at ws://127.0.0.1:30000/v1/realtime_video/generate (use the Python client in §7 to connect — plain curl does not speak the ws:// upgrade).

7. Realtime WebSocket API

The realtime API is a single WebSocket at /v1/realtime_video/generate. All messages — client → server and server → client — are msgpack (msgspec.msgpack.encode / decode), not JSON.

The lifecycle is:

<Steps> <Step title="Connect & send INIT"> The client opens the WebSocket and sends exactly one **init** message (`type: "init"`), carrying the prompt, the required `first_frame`, output/sampling options, and optional camera conditions in `condition_inputs`. </Step> <Step title="Stream live EVENTs (optional)"> While generation runs, the client may push **event** messages (`type: "event"`) to steer the camera — either `kind: "camera_actions"` (frame-by-frame lists or state transitions) or `kind: "action"` (an action-DSL string). </Step> <Step title="Receive frame batches"> The server streams **frame batches** back. Each chunk arrives as one or more `frame_batch` messages (header fields + payload bytes); `is_final_frame_batch: true` marks the end of a chunk. The server also emits `chunk_stats` timing messages. </Step> </Steps>

INIT message

RealtimeVideoGenerationsRequest (type is the literal "init"). Key fields:

Field	Type	Notes
`type`	`"init"`	Required literal
`prompt`	str	Text prompt
`first_frame`	bytes \| str	Required by the SANA-WM adapter (`on_init` raises if absent), though the generic request schema defines it as optional. Raw image bytes, a server-side path, or an `http(s)://` URL (downloaded & cached)
`condition_inputs`	dict	Camera/conditioning inputs (see below)
`num_frames`	int	Total frames to generate. Omit it for an open-ended, continuous session — the adapter leaves `num_frames` unset and flags an open-ended run (`condition_inputs["sana_wm_open_ended"] = True`), generating uniform chunks indefinitely (until `max_chunks` or the client disconnects). Provide an integer for a fixed-length clip
`seed`	int	RNG seed (default `42`)
`size`	str	`"WIDTHxHEIGHT"`; realtime requests default to `"832x480"` for latency. Pass `"1280x704"` for the native landscape resolution
`max_chunks`	int	Optional cap on total chunks generated
`num_inference_steps`	int	Default `4` for SANA-WM (realtime adapter)
`guidance_scale`	float	Default `1.0`
`realtime_output_format`	`"raw"` \| `"webp"` \| `"jpeg"`	Frame encoding for output (see below)
`realtime_causal_sink_size`	int	Optional override
`realtime_causal_kv_cache_num_frames`	int	Optional override

condition_inputs accepts (all optional; pass only one of action / camera_actions):

Key	Type	Meaning
`camera_actions`	`list[list[str]]` or `{mode: "state", transitions: [...]}`	Frame-by-frame camera actions, or state-based transitions
`action`	str	Action-DSL string, e.g. `"w-10,none-5,a-8"` (see §8)
`intrinsics_path`	str	Server-side path to a camera-intrinsics `.npy` file (loaded via `np.load`; shapes `(4,)`, `(3,3)`, or `(F,3,3)`)
`intrinsics`	list	Inline intrinsics with shape `(4,)`, `(3,3)`, `(F,4)`, or `(F,3,3)`

If you omit both intrinsics_path and intrinsics, SGLang uses a centered heuristic intrinsic matrix derived from the first-frame size. Pass explicit intrinsics when you need closer camera parity with a prepared trajectory.

json

{
  "type": "init",
  "prompt": "beautiful landscape video",
  "first_frame": "<bytes or url>",
  "size": "832x480",
  "seed": 42,
  "max_chunks": 10,
  "realtime_output_format": "raw",
  "num_inference_steps": 4,
  "guidance_scale": 1.0,
  "condition_inputs": {
    "camera_actions": [["w"], [], ["a", "s"]],
    "intrinsics_path": "/path/to/intrinsics.npy"
  }
}

Live EVENT messages

RealtimeEvent (type: "event"). Use kind + payload (optional event_id correlates the response back to this event).

json

{
  "type": "event",
  "kind": "camera_actions",
  "event_id": 1,
  "payload": [["w"], ["w"], ["a"], []]
}

json

{
  "type": "event",
  "kind": "camera_actions",
  "event_id": 2,
  "payload": {
    "mode": "state",
    "transitions": [
      {"actions": ["w"], "client_ts_ms": 1000},
      {"actions": ["a", "w"], "client_ts_ms": 1500}
    ]
  }
}

json

{
  "type": "event",
  "kind": "action",
  "event_id": 3,
  "payload": "w-10,none-5,a-8,d-10"
}

Server frame output

The server streams frame batches. Every batch arrives as a single msgpack message with type: "frame_batch" — the header fields below plus an inline payload bytes field (the wire type is always "frame_batch"; there is no separate header-then-bytes message).

Header fields:

Field	Meaning
`type`	`"frame_batch"` (always)
`request_id`	Generation id
`chunk_index`	Chunk index
`content_type`	`application/x-raw-rgb`, `application/x-raw-rgb-delta-gzip`, `image/webp`, or `image/jpeg`
`num_frames`	Frames in this batch
`total_size`	Payload size in bytes (`len(payload)` — the compressed size for delta-gzip)
`width`, `height`, `channels`	Frame geometry (`channels: 3`)
`bytes_per_frame`	Bytes per uncompressed frame (`widthheight3`)
`format`	`rgb24` for raw
`encoding`	`raw`, `delta-gzip`, `webp`, or `jpeg`
`delta_reference`	`previous-frame` (present for delta-gzip)
`event_id`	Echoes the steering event id; omitted from the header for INIT-only chunks
`frame_batch_index`, `num_frame_batches`	Sequence multiple batches within a chunk
`is_final_frame_batch`	`true` ends the chunk

json

{
  "type": "frame_batch",
  "request_id": "uuid-string",
  "chunk_index": 0,
  "content_type": "application/x-raw-rgb-delta-gzip",
  "num_frames": 3,
  "total_size": 1048576,
  "width": 1280,
  "height": 704,
  "channels": 3,
  "bytes_per_frame": 2703360,
  "format": "rgb24",
  "encoding": "delta-gzip",
  "delta_reference": "previous-frame",
  "event_id": 1,
  "frame_batch_index": 0,
  "num_frame_batches": 1,
  "is_final_frame_batch": true,
  "payload": "<gzip-compressed bytes>"
}

Encodings. application/x-raw-rgb is uncompressed RGB24 (3 × uint8, bytes_per_frame = width*height*3). application/x-raw-rgb-delta-gzip is the zlib-compressed per-frame XOR delta against the preceding frame (each frame in the batch is XOR'd against the previous one; sent by default). realtime_output_format: "raw" forces uncompressed RGB; "webp" / "jpeg" send preview-encoded frames.

<Note> delta-gzip must be restored **frame-by-frame**: decompress the payload, then for each frame XOR it against the already-restored previous frame (the first frame of a batch references the last frame of the previous batch). See `restore_delta_gzip_raw_rgb_payload` in `runtime/utils/realtime_video.py`. The `"raw"` format below avoids this. </Note>

Minimal client example

python

import msgspec
import numpy as np
import websockets  # pip install websockets

WS_URL = "ws://127.0.0.1:30000/v1/realtime_video/generate"

async def run():
    async with websockets.connect(WS_URL, max_size=None) as ws:
        # 1) INIT — omit num_frames for an open-ended session; "raw" = uncompressed RGB24
        with open("first_frame.png", "rb") as f:
            first_frame = f.read()
        await ws.send(msgspec.msgpack.encode({
            "type": "init",
            "prompt": "a camera moving forward and turning right",
            "first_frame": first_frame,
            "size": "832x480",
            "seed": 42,
            "max_chunks": 10,
            "realtime_output_format": "raw",
            "num_inference_steps": 4,
            "guidance_scale": 1.0,
            "condition_inputs": {
                "action": "w-100,wd-50,d-30",
                "intrinsics_path": "/path/to/intrinsics.npy",  # optional; centered heuristic if omitted
            },
        }))

        # 2) optional: steer mid-stream
        await ws.send(msgspec.msgpack.encode({
            "type": "event",
            "kind": "camera_actions",
            "event_id": 1,
            "payload": [["w"], ["w"], ["a"], []],
        }))

        # 3) receive frame batches (raw RGB24)
        async for message in ws:
            msg = msgspec.msgpack.decode(message)
            if msg.get("type") != "frame_batch":
                continue  # skip chunk_stats etc.
            n, h, w, c = msg["num_frames"], msg["height"], msg["width"], msg["channels"]
            frames = np.frombuffer(msg["payload"], dtype=np.uint8).reshape(n, h, w, c)
            # ... display/save frames ...
            if msg.get("is_final_frame_batch") and msg.get("chunk_index", 0) >= 9:
                break

# asyncio.run(run())

8. Camera Action DSL

Camera trajectories are described by a compact string of comma-separated <keys>-<frames> segments, e.g. "w-100,wd-50,d-30,none-10". This is the format accepted by condition_inputs.action at init and by kind: "action" events.

Parsing rules (parse_action_string):

Each segment is <keys>-<frames>; <frames> must be a positive integer.
none means no motion for that span: none-10 = 10 static frames.
Keys are case-insensitive; combined keys apply simultaneously (wd = forward + right strafe). Allowed keys are exactly wasdijkl.

Key	Motion
`w` / `s`	move forward / backward
`a` / `d`	strafe left / right
`i` / `k`	look (pitch) up / down
`j` / `l`	look (yaw) left / right

Pose generation (action_string_to_c2w):

Translation (w/s/a/d) moves at translation_speed (default 0.04 world-units/frame).
Rotation (i/k pitch, j/l yaw) turns at rotation_speed_deg (default 1.2°/frame); pitch is clamped to ±85°.
Strafe-yaw coupling (coefficient 0.4): a d (right) strafe also nudges yaw right and a (left) nudges yaw left, so wd traces a curving arc rather than a pure sidestep.
Produces (F+1, 4, 4) camera-to-world matrices; the realtime stage pads the trajectory to the requested frame count.

Example: "w-100,wd-50,d-30,none-10" = 100 frames forward → 50 frames forward + sweep right → 30 frames right strafe → 10 frames static.

9. Configuration Reference

SANA-WM's defaults live in three places: request-time sampling params, the pipeline config (streaming/refiner knobs), and the realtime adapter (init-time overrides).

Request-time — `SanaWMSamplingParams` (`configs/sample/sana_wm.py`)

Field	Default	Purpose
`height`	`704`	Output height
`width`	`1280`	Output width
`num_frames`	`49`	Total pixel frames (must satisfy `(num_frames - 1) % 8 == 0`)
`fps`	`16`	Output frame rate (overrides the base default of 24)
`num_inference_steps`	`20`	Stage-1 step count
`guidance_scale`	`4.5`	Dense-path CFG scale
`negative_prompt`	`""`	Negative prompt
`camera_to_world`	`None`	In-memory `(T,4,4)` c2w extrinsics (mutually exclusive with `action`)
`intrinsics`	`None`	In-memory `(T,3,3)` pinhole intrinsics
`action`	`None`	Action-DSL string (see §8)
`translation_speed`	`0.04`	World-units/frame for W/S/A/D
`rotation_speed_deg`	`1.2`	Degrees/frame for I/K/J/L
`pitch_limit_deg`	`85.0`	Pitch clamp

generator_device is inherited from the base SamplingParams (default None = use the pipeline/model default). On the /v1/videos HTTP API the camera fields are passed inside diffusers_kwargs (action / intrinsics, as in §4–5).

Pipeline config — `SanaWMPipelineConfig` (`configs/pipeline_configs/sana_wm.py`)

These are server-launch knobs (set via the --streaming / --refiner-chunked / --num-frame-per-block CLI flags or a pipeline-config override), not request fields:

Field	Default	Purpose
`streaming`	`False`	Chunk-causal `forward_long` (§5) vs dense one-shot (§4)
`refiner_chunked`	`True`	Chunk-wise streaming refiner vs whole-clip dense refiner
`num_frame_per_block`	`3`	Latent frames per Stage-1 / refiner chunk
`num_cached_blocks`	`2`	Rolling KV-cache history window
`denoising_step_list`	`(1000, 960, 889, 727, 0)`	4-step streaming self-forcing timesteps (must end in 0)
`streaming_cfg_scale`	`1.0`	CFG scale for the distilled streaming path (1.0 = off)
`sink_size`	`1`	Sink (unrefined context) frames
`refiner_block_size`	`3`	Refiner block size
`refiner_kv_max_frames`	`11`	Refiner sliding KV window

Realtime adapter init overrides — `SanaWMRealtimeAdapter`

At WebSocket init the realtime adapter fills SANA-WM defaults that differ from the request/sampling defaults above:

Field	Realtime default	Note
`size`	`832x480`	Realtime request default; pass `1280x704` for native landscape output
`num_frames`	(unset)	Omitting → open-ended continuous session (§7)
`num_inference_steps`	`4`	Distilled few-step
`guidance_scale`	`1.0`	CFG off
`fps`	`16`	Native rate

<Note> `guidance_scale` applies to the dense path (§4) only; the distilled streaming path uses `streaming_cfg_scale` (default `1.0`, i.e. no CFG) so a `guidance_scale` override never accidentally enables CFG on the streaming stage. `denoising_step_list = (1000, 960, 889, 727, 0)` is the official 4-step streaming schedule (it must end in 0). </Note>

1. Model Introduction

2. Installation

3. Model Setup

4. Dense bidirectional (offline /v1/videos)

5. Batch streaming (offline /v1/videos)

6. Launch the Realtime Server

7. Realtime WebSocket API

INIT message

Live EVENT messages

Server frame output

Minimal client example

8. Camera Action DSL

9. Configuration Reference

Request-time — SanaWMSamplingParams (configs/sample/sana_wm.py)

Pipeline config — SanaWMPipelineConfig (configs/pipeline_configs/sana_wm.py)

Realtime adapter init overrides — SanaWMRealtimeAdapter

4. Dense bidirectional (offline `/v1/videos`)

5. Batch streaming (offline `/v1/videos`)

Request-time — `SanaWMSamplingParams` (`configs/sample/sana_wm.py`)

Pipeline config — `SanaWMPipelineConfig` (`configs/pipeline_configs/sana_wm.py`)

Realtime adapter init overrides — `SanaWMRealtimeAdapter`