Back to Sglang

SANA-WM

docs_new/cookbook/diffusion/SANA-WM/SANA-WM.mdx

0.5.1328.4 KB
Original Source

1. Model Introduction

SANA-WM is an efficient open-source world model from NVLabs, trained natively for one-minute video generation. It is a 2.6B-parameter text+image-to-video (TI2V) diffusion transformer that synthesizes 720p, minute-scale videos with precise 6-DoF camera control, paired with an LTX-2 refiner for high-fidelity decoding. It builds on the SANA family — efficient high-resolution synthesis with a linear diffusion transformer.

SANA-WM ships in two checkpoints: a bidirectional checkpoint (dense, one-shot) and a streaming checkpoint (chunk-causal, autoregressive — generated chunk-by-chunk, reusing causal DiT state across chunks for bounded memory → long, even endless, clips). From a single first frame, a text prompt, and a camera trajectory, this cookbook covers all three serving modes SGLang exposes:

  • (A) Dense bidirectional (§4) — the SANA-WM_bidirectional checkpoint generated in one shot (no chunking) via SanaWMTwoStagePipeline over the standard /v1/videos HTTP API. Highest single-clip quality (full bidirectional attention + dense LTX-2 refiner); matches the NVlabs dense reference.
  • (B) Batch streaming (§5) — the SANA-WM_streaming checkpoint generated chunk-by-chunk in one request via the same SanaWMTwoStagePipeline + --streaming over /v1/videos. This is SGLang's offline chunk-causal streaming path: the whole clip is produced chunk-by-chunk internally, then returned.
  • (C) Live realtime (§6–7) — the streaming pipeline exposed as SanaWMRealtimePipeline over a WebSocket API at /v1/realtime_video/generate, so a browser/client streams camera-action events frame-by-frame and receives video chunks back in real time. Realtime uses the same streaming checkpoint, but the incremental session path is not bit-identical to offline batch streaming.

All three modes share the camera action DSL (§8) and the configuration knobs (§9). Modes (B) and (C) share the streaming checkpoint and the chunk-causal pipeline.

Key features (per the official model):

  • Hybrid Linear Attention — frame-wise Gated DeltaNet (GDN) recurrent blocks combined with softmax attention (every 4th layer, block indices {3,7,11,15,19}) for memory-efficient long-context modeling.
  • Dual-Branch Camera Control — independent main and camera branches (UCPE + PRoPE) for precise per-frame 6-DoF trajectory adherence.
  • Two-Stage Pipeline — an LTX-2 long-video refiner on top of Stage-1 latents for quality and temporal consistency.

In the streaming / realtime configuration this becomes a low-latency, interactive pipeline:

  • Stage-1 chunk-causal DiT — the streaming path carries a per-block KV cache (recurrent GDN state + a softmax K/V window) across chunks; bounded memory means it scales to long / endless sequences. Stage-1 is intentionally coarse.
  • LTX-2 streaming refiner — refines each Stage-1 latent chunk block-by-block with a sink + sliding-history KV cache (required for sharp output).
  • Causal LTX-2 VAE — decodes latents chunk-by-chunk with a carried conv-cache for seam-free frames.
  • Camera control — drive the camera with a compact WASD/IJKL action DSL (move with WASD, look with IJKL; see §8) — supplied at request time on the /v1/videos paths, or pushed over the WebSocket at init / as live per-chunk events on the realtime path (see §7).

Architecture & components

ComponentValue
Stage-1 DiT2.6B; 20 layers, hidden 2240, 20 heads (head_dim 112); ~10 GB
Attentionframe-wise Gated DeltaNet + softmax every 4th block (hybrid linear)
Cameradual-branch, UCPE + PRoPE (raymap + Plücker), 6-DoF
VAELTX-2 causal, strides (T, H, W) = (8, 32, 32); ~2 GB
RefinerLTX-2 Stage-2 distilled; ~41 GB
Outputup to 720p (704×1280) @ 16 fps, minute-scale

For more details, see the SANA-WM paper (arXiv), the SANA project page, the NVlabs/Sana GitHub, and the SANA-WM_bidirectional model card (Apache-2.0).

2. Installation

SGLang-diffusion offers multiple installation methods depending on your hardware platform. Please refer to the official SGLang-diffusion installation guide.

SANA-WM adds the SanaWMTransformer3DModel + GDN kernels, the SanaWMTwoStagePipeline (dense bidirectional + chunk-causal streaming), and the SanaWMRealtimePipeline with the /v1/realtime_video WebSocket router. The diffusion server CLI is invoked as python -m sglang.multimodal_gen.runtime.entrypoints.cli.main.

3. Model Setup

Both SANA-WM checkpoints are public (Apache-2.0, no gating, no token) and load directly — there is no manual assembly step. Pass the HuggingFace repo id to --model-path and SGLang downloads, materializes, validates, and loads it:

Mode--model-path
Dense bidirectional (§4)Efficient-Large-Model/SANA-WM_bidirectional
Batch streaming (§5) / realtime (§6)Efficient-Large-Model/SANA-WM_streaming

Both repo ids are registered in SGLang's built-in model-overlay registry, so on first load the overlay transparently materializes the official release into a runnable Diffusers directory — for the streaming checkpoint this converts the DMD self-forcing checkpoint (sana_dit/model.pt) into a Diffusers transformer/ and wires the LTX-2 causal VAE, the LTX-2 refiner, and the Gemma encoders. No environment variable or build_model_dir.sh step is needed. (You may also pass a local, already-materialized Diffusers directory.)

The materialized checkpoint is a Diffusers directory whose model_index.json declares the loadable components:

Component (model_index.json)Class
transformer (Stage-1 DiT)diffusers.SanaWMTransformer3DModel
vaediffusers.AutoencoderKLCausalLTX2Video
text_encodertransformers.Gemma2Model
tokenizertransformers.GemmaTokenizer
schedulerdiffusers.FlowMatchEulerDiscreteScheduler

How loading works:

  • The server resolves the checkpoint via maybe_download_model(model_path, force_diffusers_model=True) and verifies it contains a model_index.json plus the required component subdirectories (transformer/, vae/).
  • If text_encoder / tokenizer are not provided as component paths, the pipeline falls back to the default Stage-1 text encoder Efficient-Large-Model/gemma-2-2b-it (DEFAULT_SANA_WM_TEXT_ENCODER).
  • Pick the path with --pipeline-class-name. The checkpoint's model_index.json _class_name selects the default pipeline (SanaWMTwoStagePipeline). Pin it explicitly to choose: --pipeline-class-name SanaWMTwoStagePipeline for the /v1/videos paths (§4–5) or --pipeline-class-name SanaWMRealtimePipeline for live realtime (§6). Pinning is also required if you point --model-path at a bare safetensors file instead of a Diffusers directory.
  • The Stage-2 LTX-2 refiner lives under refiner/ in the checkpoint: refiner/transformer (transformer_2), refiner/connectors (connectors), and refiner/text_encoder (the Gemma-3 encoder for text_encoder_2, whose tokenizer also serves as tokenizer_2). The refiner is optional: it is skipped (Stage-1-only output) when the env flag SGLANG_SANA_WM_SKIP_REFINER (or a skip_refiner request extra) is set, or when no refiner/ is present (transformer_2 unloaded). On the batch path it runs chunk-wise with --refiner-chunked (the official streaming path, default on) or whole-clip without it; on the realtime path the pipeline builds a SanaWMChunkedRefinerChainStage only when a refiner is available, and otherwise streams Stage-1 frames.
<Note> Throughout this cookbook, `<checkpoint>` stands for the appropriate SANA-WM repo id from the table above (or a local materialized Diffusers directory). </Note>

4. Dense bidirectional (offline /v1/videos)

The bidirectional checkpoint generates the whole clip in one shot (full bidirectional attention, not chunked) followed by a dense LTX-2 refiner — the highest single-clip quality, matching the NVlabs dense reference.

Launch with the two-stage pipeline and no --streaming flag (dense is the default — streaming defaults to False):

bash
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_bidirectional \
  --pipeline-class-name SanaWMTwoStagePipeline \
  --host 127.0.0.1 --port 30000

Then POST to /v1/videos exactly as in §5, but pass the NVlabs dense sampling defaults for closest parity — the dense path is denser than the distilled streaming few-step schedule:

bash
curl -s http://127.0.0.1:30000/v1/videos \
  -H 'content-type: application/json' -d '{
    "prompt": "a camera moving forward and turning left",
    "input_reference": "/path/to/first_frame.png",
    "num_frames": 321,
    "seed": 42,
    "fps": 16,
    "num_inference_steps": 60,
    "guidance_scale": 5.0,
    "diffusers_kwargs": {
      "action": "w-80,wl-80,l-80,wj-80",
      "intrinsics": "/path/to/intrinsics.npy"
    }
  }'
  • num_inference_steps / guidance_scale — the dense path uses CFG; NVlabs' reference defaults to 60 steps, guidance 5.0 (the SanaWMSamplingParams defaults are the lighter 20 / 4.5 — pass 60 / 5.0 explicitly for dense parity).
  • The dense refiner drops the leading sink frame, so a num_frames=321 request yields 320 output frames.

5. Batch streaming (offline /v1/videos)

The streaming checkpoint generates a full camera-controlled clip in one request — no websocket. This is SGLang's offline streaming path: the whole clip is generated chunk-by-chunk internally, refined, decoded, and returned as one video.

Launch with the two-stage pipeline + the streaming flags:

bash
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMTwoStagePipeline \
  --streaming --refiner-chunked \
  --host 127.0.0.1 --port 30000
  • --streaming — chunk-causal forward_long Stage-1 (vs the dense one-shot path of §4).
  • --refiner-chunked — chunk-wise streaming LTX-2 refiner (on by default). To use the whole-clip dense refiner instead (also valid, higher peak memory), pass --refiner-chunked false — simply omitting the flag keeps the default chunked refiner.
  • --num-frame-per-block N — latent frames per chunk (default 3).

Then POST to /v1/videos (JSON body shown below; multipart/form-data with an uploaded input_reference file also works). Camera control goes in diffusers_kwargs — the action-DSL string (§8) and the intrinsics:

bash
curl -s http://127.0.0.1:30000/v1/videos \
  -H 'content-type: application/json' -d '{
    "prompt": "a camera moving forward and turning left",
    "input_reference": "/path/to/first_frame.png",
    "num_frames": 321,
    "seed": 42,
    "fps": 16,
    "diffusers_kwargs": {
      "action": "w-80,wl-80,l-80,wj-80",
      "intrinsics": "/path/to/intrinsics.npy"
    }
  }'
FieldNotes
prompttext prompt
input_referencefirst-frame image — a server-side path, or (multipart) an uploaded file. For an http(s):// URL in a JSON body, use the separate reference_url field (the server downloads it and assigns it to input_reference)
num_framestotal pixel frames (e.g. 321 → 41 latent frames, 13 chunks; output 704×1280)
seedRNG seed (default 42)
fpsoutput frame rate — pass 16 (SANA-WM's native rate). The generic /v1/videos default is 24, which would encode the same frames at 24 fps and make the clip play ~33% shorter (16/24 of the duration)
diffusers_kwargs.actioncamera action-DSL string (§8)
diffusers_kwargs.intrinsicspath to a camera-intrinsics .npy (per-frame (T,3,3)) or an inline 3×3 / (T,3,3) list

The response is a VideoResponse; fetch the rendered MP4 via the returned reference or GET /v1/videos/{id}/content. The streaming hyperparameters (num_frame_per_block, denoising_step_list, sink_size, num_cached_blocks, streaming_cfg_scale) are pipeline-config defaults on SanaWMPipelineConfig, not request fields — see §9.

6. Launch the Realtime Server

Launch with the realtime pipeline pinned — the checkpoint defaults to SanaWMTwoStagePipeline, so realtime must be selected explicitly (see §3). The /v1/realtime_video router is always mounted and becomes functional once the realtime config is active, because SanaWMRealtimeConfig has a registered realtime adapter (SanaWMRealtimeAdapter).

bash
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --host 127.0.0.1 --port 30000

Common launch variants:

bash
# recommended multi-GPU realtime profile
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --num-gpus 8 --sp-degree 8 \
  --host 127.0.0.1 --port 30000

# single GPU
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --num-gpus 1 --host 127.0.0.1 --port 30000

# offload DiT + text encoder to CPU (tight VRAM)
python -m sglang.multimodal_gen.runtime.entrypoints.cli.main serve \
  --model-path Efficient-Large-Model/SANA-WM_streaming \
  --pipeline-class-name SanaWMRealtimePipeline \
  --host 127.0.0.1 --port 30000 \
  --dit-cpu-offload --text-encoder-cpu-offload

Notes on launch behavior:

  • Default endpoint is 127.0.0.1:30000 (--host / --port override).
  • CPU offload flags are optional. --dit-cpu-offload, --text-encoder-cpu-offload, and --image-encoder-cpu-offload are available; defaults are auto-adjusted from GPU memory (GPUs under 30 GB get more aggressive offloading).
  • Multi-GPU realtime. Prefer explicit sequence parallelism (--sp-degree equal to the number of GPUs for a single session). Do not enable CFG parallel for the realtime profile: the default realtime request uses guidance_scale=1.0, while CFG parallel requires active cond/uncond branches.
  • FSDP. Use --use-fsdp-inference only when you specifically need weight sharding for memory. For the low-latency realtime profile, prefer keeping components resident and using SP first.
  • Warmup. Server warmup is automatically skipped for the realtime pipeline — a synthetic warmup request has no WebSocket session, so the server detects the registered realtime adapter and skips it. No --warmup flag is needed.

Once up, the realtime WebSocket endpoint lives at ws://127.0.0.1:30000/v1/realtime_video/generate (use the Python client in §7 to connect — plain curl does not speak the ws:// upgrade).

7. Realtime WebSocket API

The realtime API is a single WebSocket at /v1/realtime_video/generate. All messages — client → server and server → client — are msgpack (msgspec.msgpack.encode / decode), not JSON.

The lifecycle is:

<Steps> <Step title="Connect & send INIT"> The client opens the WebSocket and sends exactly one **init** message (`type: "init"`), carrying the prompt, the required `first_frame`, output/sampling options, and optional camera conditions in `condition_inputs`. </Step> <Step title="Stream live EVENTs (optional)"> While generation runs, the client may push **event** messages (`type: "event"`) to steer the camera — either `kind: "camera_actions"` (frame-by-frame lists or state transitions) or `kind: "action"` (an action-DSL string). </Step> <Step title="Receive frame batches"> The server streams **frame batches** back. Each chunk arrives as one or more `frame_batch` messages (header fields + payload bytes); `is_final_frame_batch: true` marks the end of a chunk. The server also emits `chunk_stats` timing messages. </Step> </Steps>

INIT message

RealtimeVideoGenerationsRequest (type is the literal "init"). Key fields:

FieldTypeNotes
type"init"Required literal
promptstrText prompt
first_framebytes | strRequired by the SANA-WM adapter (on_init raises if absent), though the generic request schema defines it as optional. Raw image bytes, a server-side path, or an http(s):// URL (downloaded & cached)
condition_inputsdictCamera/conditioning inputs (see below)
num_framesintTotal frames to generate. Omit it for an open-ended, continuous session — the adapter leaves num_frames unset and flags an open-ended run (condition_inputs["sana_wm_open_ended"] = True), generating uniform chunks indefinitely (until max_chunks or the client disconnects). Provide an integer for a fixed-length clip
seedintRNG seed (default 42)
sizestr"WIDTHxHEIGHT"; realtime requests default to "832x480" for latency. Pass "1280x704" for the native landscape resolution
max_chunksintOptional cap on total chunks generated
num_inference_stepsintDefault 4 for SANA-WM (realtime adapter)
guidance_scalefloatDefault 1.0
realtime_output_format"raw" | "webp" | "jpeg"Frame encoding for output (see below)
realtime_causal_sink_sizeintOptional override
realtime_causal_kv_cache_num_framesintOptional override

condition_inputs accepts (all optional; pass only one of action / camera_actions):

KeyTypeMeaning
camera_actionslist[list[str]] or {mode: "state", transitions: [...]}Frame-by-frame camera actions, or state-based transitions
actionstrAction-DSL string, e.g. "w-10,none-5,a-8" (see §8)
intrinsics_pathstrServer-side path to a camera-intrinsics .npy file (loaded via np.load; shapes (4,), (3,3), or (F,3,3))
intrinsicslistInline intrinsics with shape (4,), (3,3), (F,4), or (F,3,3)

If you omit both intrinsics_path and intrinsics, SGLang uses a centered heuristic intrinsic matrix derived from the first-frame size. Pass explicit intrinsics when you need closer camera parity with a prepared trajectory.

json
{
  "type": "init",
  "prompt": "beautiful landscape video",
  "first_frame": "<bytes or url>",
  "size": "832x480",
  "seed": 42,
  "max_chunks": 10,
  "realtime_output_format": "raw",
  "num_inference_steps": 4,
  "guidance_scale": 1.0,
  "condition_inputs": {
    "camera_actions": [["w"], [], ["a", "s"]],
    "intrinsics_path": "/path/to/intrinsics.npy"
  }
}

Live EVENT messages

RealtimeEvent (type: "event"). Use kind + payload (optional event_id correlates the response back to this event).

json
{
  "type": "event",
  "kind": "camera_actions",
  "event_id": 1,
  "payload": [["w"], ["w"], ["a"], []]
}
json
{
  "type": "event",
  "kind": "camera_actions",
  "event_id": 2,
  "payload": {
    "mode": "state",
    "transitions": [
      {"actions": ["w"], "client_ts_ms": 1000},
      {"actions": ["a", "w"], "client_ts_ms": 1500}
    ]
  }
}
json
{
  "type": "event",
  "kind": "action",
  "event_id": 3,
  "payload": "w-10,none-5,a-8,d-10"
}

Server frame output

The server streams frame batches. Every batch arrives as a single msgpack message with type: "frame_batch" — the header fields below plus an inline payload bytes field (the wire type is always "frame_batch"; there is no separate header-then-bytes message).

Header fields:

FieldMeaning
type"frame_batch" (always)
request_idGeneration id
chunk_indexChunk index
content_typeapplication/x-raw-rgb, application/x-raw-rgb-delta-gzip, image/webp, or image/jpeg
num_framesFrames in this batch
total_sizePayload size in bytes (len(payload) — the compressed size for delta-gzip)
width, height, channelsFrame geometry (channels: 3)
bytes_per_frameBytes per uncompressed frame (width*height*3)
formatrgb24 for raw
encodingraw, delta-gzip, webp, or jpeg
delta_referenceprevious-frame (present for delta-gzip)
event_idEchoes the steering event id; omitted from the header for INIT-only chunks
frame_batch_index, num_frame_batchesSequence multiple batches within a chunk
is_final_frame_batchtrue ends the chunk
json
{
  "type": "frame_batch",
  "request_id": "uuid-string",
  "chunk_index": 0,
  "content_type": "application/x-raw-rgb-delta-gzip",
  "num_frames": 3,
  "total_size": 1048576,
  "width": 1280,
  "height": 704,
  "channels": 3,
  "bytes_per_frame": 2703360,
  "format": "rgb24",
  "encoding": "delta-gzip",
  "delta_reference": "previous-frame",
  "event_id": 1,
  "frame_batch_index": 0,
  "num_frame_batches": 1,
  "is_final_frame_batch": true,
  "payload": "<gzip-compressed bytes>"
}

Encodings. application/x-raw-rgb is uncompressed RGB24 (3 × uint8, bytes_per_frame = width*height*3). application/x-raw-rgb-delta-gzip is the zlib-compressed per-frame XOR delta against the preceding frame (each frame in the batch is XOR'd against the previous one; sent by default). realtime_output_format: "raw" forces uncompressed RGB; "webp" / "jpeg" send preview-encoded frames.

<Note> delta-gzip must be restored **frame-by-frame**: decompress the payload, then for each frame XOR it against the already-restored previous frame (the first frame of a batch references the last frame of the previous batch). See `restore_delta_gzip_raw_rgb_payload` in `runtime/utils/realtime_video.py`. The `"raw"` format below avoids this. </Note>

Minimal client example

python
import msgspec
import numpy as np
import websockets  # pip install websockets

WS_URL = "ws://127.0.0.1:30000/v1/realtime_video/generate"

async def run():
    async with websockets.connect(WS_URL, max_size=None) as ws:
        # 1) INIT — omit num_frames for an open-ended session; "raw" = uncompressed RGB24
        with open("first_frame.png", "rb") as f:
            first_frame = f.read()
        await ws.send(msgspec.msgpack.encode({
            "type": "init",
            "prompt": "a camera moving forward and turning right",
            "first_frame": first_frame,
            "size": "832x480",
            "seed": 42,
            "max_chunks": 10,
            "realtime_output_format": "raw",
            "num_inference_steps": 4,
            "guidance_scale": 1.0,
            "condition_inputs": {
                "action": "w-100,wd-50,d-30",
                "intrinsics_path": "/path/to/intrinsics.npy",  # optional; centered heuristic if omitted
            },
        }))

        # 2) optional: steer mid-stream
        await ws.send(msgspec.msgpack.encode({
            "type": "event",
            "kind": "camera_actions",
            "event_id": 1,
            "payload": [["w"], ["w"], ["a"], []],
        }))

        # 3) receive frame batches (raw RGB24)
        async for message in ws:
            msg = msgspec.msgpack.decode(message)
            if msg.get("type") != "frame_batch":
                continue  # skip chunk_stats etc.
            n, h, w, c = msg["num_frames"], msg["height"], msg["width"], msg["channels"]
            frames = np.frombuffer(msg["payload"], dtype=np.uint8).reshape(n, h, w, c)
            # ... display/save frames ...
            if msg.get("is_final_frame_batch") and msg.get("chunk_index", 0) >= 9:
                break

# asyncio.run(run())

8. Camera Action DSL

Camera trajectories are described by a compact string of comma-separated <keys>-<frames> segments, e.g. "w-100,wd-50,d-30,none-10". This is the format accepted by condition_inputs.action at init and by kind: "action" events.

Parsing rules (parse_action_string):

  • Each segment is <keys>-<frames>; <frames> must be a positive integer.
  • none means no motion for that span: none-10 = 10 static frames.
  • Keys are case-insensitive; combined keys apply simultaneously (wd = forward + right strafe). Allowed keys are exactly wasdijkl.
KeyMotion
w / smove forward / backward
a / dstrafe left / right
i / klook (pitch) up / down
j / llook (yaw) left / right

Pose generation (action_string_to_c2w):

  • Translation (w/s/a/d) moves at translation_speed (default 0.04 world-units/frame).
  • Rotation (i/k pitch, j/l yaw) turns at rotation_speed_deg (default 1.2°/frame); pitch is clamped to ±85°.
  • Strafe-yaw coupling (coefficient 0.4): a d (right) strafe also nudges yaw right and a (left) nudges yaw left, so wd traces a curving arc rather than a pure sidestep.
  • Produces (F+1, 4, 4) camera-to-world matrices; the realtime stage pads the trajectory to the requested frame count.

Example: "w-100,wd-50,d-30,none-10" = 100 frames forward → 50 frames forward + sweep right → 30 frames right strafe → 10 frames static.

9. Configuration Reference

SANA-WM's defaults live in three places: request-time sampling params, the pipeline config (streaming/refiner knobs), and the realtime adapter (init-time overrides).

Request-time — SanaWMSamplingParams (configs/sample/sana_wm.py)

FieldDefaultPurpose
height704Output height
width1280Output width
num_frames49Total pixel frames (must satisfy (num_frames - 1) % 8 == 0)
fps16Output frame rate (overrides the base default of 24)
num_inference_steps20Stage-1 step count
guidance_scale4.5Dense-path CFG scale
negative_prompt""Negative prompt
camera_to_worldNoneIn-memory (T,4,4) c2w extrinsics (mutually exclusive with action)
intrinsicsNoneIn-memory (T,3,3) pinhole intrinsics
actionNoneAction-DSL string (see §8)
translation_speed0.04World-units/frame for W/S/A/D
rotation_speed_deg1.2Degrees/frame for I/K/J/L
pitch_limit_deg85.0Pitch clamp

generator_device is inherited from the base SamplingParams (default None = use the pipeline/model default). On the /v1/videos HTTP API the camera fields are passed inside diffusers_kwargs (action / intrinsics, as in §4–5).

Pipeline config — SanaWMPipelineConfig (configs/pipeline_configs/sana_wm.py)

These are server-launch knobs (set via the --streaming / --refiner-chunked / --num-frame-per-block CLI flags or a pipeline-config override), not request fields:

FieldDefaultPurpose
streamingFalseChunk-causal forward_long (§5) vs dense one-shot (§4)
refiner_chunkedTrueChunk-wise streaming refiner vs whole-clip dense refiner
num_frame_per_block3Latent frames per Stage-1 / refiner chunk
num_cached_blocks2Rolling KV-cache history window
denoising_step_list(1000, 960, 889, 727, 0)4-step streaming self-forcing timesteps (must end in 0)
streaming_cfg_scale1.0CFG scale for the distilled streaming path (1.0 = off)
sink_size1Sink (unrefined context) frames
refiner_block_size3Refiner block size
refiner_kv_max_frames11Refiner sliding KV window

Realtime adapter init overrides — SanaWMRealtimeAdapter

At WebSocket init the realtime adapter fills SANA-WM defaults that differ from the request/sampling defaults above:

FieldRealtime defaultNote
size832x480Realtime request default; pass 1280x704 for native landscape output
num_frames(unset)Omitting → open-ended continuous session (§7)
num_inference_steps4Distilled few-step
guidance_scale1.0CFG off
fps16Native rate
<Note> `guidance_scale` applies to the dense path (§4) only; the distilled streaming path uses `streaming_cfg_scale` (default `1.0`, i.e. no CFG) so a `guidance_scale` override never accidentally enables CFG on the streaming stage. `denoising_step_list = (1000, 960, 889, 727, 0)` is the official 4-step streaming schedule (it must end in 0). </Note>