Back to Sglang

CLI reference

docs_new/docs/sglang-diffusion/api/cli.mdx

0.5.118.1 KB
Original Source

Use the CLI for one-off generation with sglang generate or to start a persistent HTTP server with sglang serve.

Overlay repos for non-diffusers models

If --model-path points to a supported non-diffusers source repo, SGLang can resolve it through a self-hosted overlay repo.

SGLang first checks a built-in overlay registry. Concrete built-in mappings can be added over time without changing the CLI surface.

Override example:

bash
export SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY='{
  "Wan-AI/Wan2.2-S2V-14B": {
    "overlay_repo_id": "your-org/Wan2.2-S2V-14B-overlay",
    "overlay_revision": "main"
  }
}'

sglang generate \
  --model-path Wan-AI/Wan2.2-S2V-14B \
  --config configs/wan_s2v.yaml

The overlay repo should be a complete diffusers-style/componentized repo

You can also pass the overlay repo itself as --model-path if it contains _overlay/overlay_manifest.json.

Notes:

  1. SGLANG_DIFFUSION_MODEL_OVERLAY_REGISTRY is only an optional override for development and debugging. It accepts either a JSON object or a path to a JSON file, and can extend or replace built-in entries for the current process.
  2. On the first load, SGLang will:
    • download overlay metadata from the overlay repo
    • download the required files from the original source repo
    • materialize a local standard component repo under ~/.cache/sgl_diffusion/materialized_models/
  3. Later loads reuse the materialized local repo. The materialized repo is what the runtime loads as a normal componentized model directory.

Quick Start

Generate

bash
sglang generate \
  --model-path Qwen/Qwen-Image \
  --prompt "A beautiful sunset over the mountains" \
  --save-output

Serve

bash
sglang serve \
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
  --num-gpus 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --port 30010

For request and response examples, see OpenAI-Compatible API.

<Tip> Use `sglang generate --help` and `sglang serve --help` for the full argument list. The CLI help output is the source of truth for exhaustive flags. </Tip>

Common Options

Model and runtime

  • --model-path &#123;MODEL&#125;: model path or Hugging Face model ID
  • --lora-path &#123;PATH&#125; and --lora-nickname &#123;NAME&#125;: load a LoRA adapter
  • --num-gpus &#123;N&#125;: number of GPUs to use
  • --tp-size &#123;N&#125;: tensor parallelism size, mainly for encoders
  • --sp-degree &#123;N&#125;: sequence parallelism size
  • --ulysses-degree &#123;N&#125; and --ring-degree &#123;N&#125;: USP parallelism controls
  • --attention-backend &#123;BACKEND&#125;: attention backend for native SGLang pipelines
  • --attention-backend-config &#123;CONFIG&#125;: attention backend configuration

Sampling and output

  • --prompt &#123;PROMPT&#125; and --negative-prompt &#123;PROMPT&#125;
  • --image-path &#123;PATH&#125; [&#123;PATH&#125; ...]: input image(s) for image-to-video or image-to-image generation
  • --num-inference-steps &#123;STEPS&#125; and --seed &#123;SEED&#125;
  • --height &#123;HEIGHT&#125;, --width &#123;WIDTH&#125;, --num-frames &#123;N&#125;, --fps &#123;FPS&#125;
  • --output-path &#123;PATH&#125;, --output-file-name &#123;NAME&#125;, --save-output, --return-frames

For frame interpolation and upscaling, see Post-Processing.

Quantized transformers

For quantized transformer checkpoints, prefer:

  • --model-path for the base pipeline
  • --transformer-path for a quantized transformers transformer component folder
  • --transformer-weights-path for a quantized safetensors file, directory, or repo

See Quantization for supported quantization families and examples.

Configuration Files

Use --config to load JSON or YAML configuration. Command-line flags override values from the config file.

bash
sglang generate --config config.yaml

Example:

yaml
model_path: FastVideo/FastHunyuan-diffusers
prompt: A beautiful woman in a red dress walking down a street
output_path: outputs/
num_gpus: 2
sp_size: 2
tp_size: 1
num_frames: 45
height: 720
width: 1280
num_inference_steps: 6
seed: 1024
fps: 24
precision: bf16
vae_precision: fp16
vae_tiling: true
vae_sp: true
enable_torch_compile: false

Generate

sglang generate runs a single generation job and exits when the job finishes.

bash
sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --text-encoder-cpu-offload \
  --pin-cpu-memory \
  --num-gpus 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --prompt "A curious raccoon" \
  --save-output \
  --output-path outputs \
  --output-file-name "a-curious-raccoon.mp4"
<Note> HTTP server-only arguments are ignored by `sglang generate`. </Note>

For diffusers pipelines, Cache-DiT can be enabled with SGLANG_CACHE_DIT_ENABLED=true or --cache-dit-config. See Cache-DiT.

Serve

sglang serve starts the HTTP server and keeps the model loaded for repeated requests.

bash
sglang serve \
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
  --text-encoder-cpu-offload \
  --pin-cpu-memory \
  --num-gpus 4 \
  --ulysses-degree 2 \
  --ring-degree 2 \
  --port 30010

Cloud Storage

SGLang Diffusion can upload generated images and videos to S3-compatible object storage after generation.

bash
export SGLANG_CLOUD_STORAGE_TYPE=s3
export SGLANG_S3_BUCKET_NAME=my-bucket
export SGLANG_S3_ACCESS_KEY_ID=your-access-key
export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com

See Environment Variables for the full set of storage options.

Component Path Overrides

Override individual pipeline components such as vae, transformer, or text_encoder with --<component>-path.

bash
sglang serve \
  --model-path black-forest-labs/FLUX.2-dev \
  --vae-path fal/FLUX.2-Tiny-AutoEncoder

The component key must match the key in the model's model_index.json, and the path must be either a Hugging Face repo ID or a complete component directory.

Diffusers Backend

Use --backend diffusers to force vanilla diffusers pipelines when no native SGLang implementation exists or when a model requires a custom pipeline class.

Key Options

<table> <thead> <tr> <th>Argument</th> <th>Values</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>--backend</code></td> <td><code>auto</code>, <code>sglang</code>, <code>diffusers</code></td> <td>Choose native SGLang, force native, or force diffusers</td> </tr> <tr> <td><code>--diffusers-attention-backend</code></td> <td><code>flash</code>, <code>_flash_3_hub</code>, <code>sage</code>, <code>xformers</code>, <code>native</code></td> <td>Attention backend for diffusers pipelines</td> </tr> <tr> <td><code>--trust-remote-code</code></td> <td>flag</td> <td>Required for models with custom pipeline classes</td> </tr> <tr> <td><code>--vae-tiling</code> and <code>--vae-slicing</code></td> <td>flag</td> <td>Lower memory usage for VAE decode</td> </tr> <tr> <td><code>--dit-precision</code> and <code>--vae-precision</code></td> <td><code>fp16</code>, <code>bf16</code>, <code>fp32</code></td> <td>Precision controls</td> </tr> <tr> <td><code>--enable-torch-compile</code></td> <td>flag</td> <td>Enable <code>torch.compile</code></td> </tr> <tr> <td><code>--cache-dit-config</code></td> <td><code>&#123;PATH&#125;</code></td> <td>Cache-DiT config for diffusers pipelines</td> </tr> </tbody> </table>

Example

bash
sglang generate \
  --model-path AIDC-AI/Ovis-Image-7B \
  --backend diffusers \
  --trust-remote-code \
  --diffusers-attention-backend flash \
  --prompt "A serene Japanese garden with cherry blossoms" \
  --height 1024 \
  --width 1024 \
  --num-inference-steps 30 \
  --save-output \
  --output-path outputs \
  --output-file-name ovis_garden.png

For pipeline-specific arguments not exposed in the CLI, pass diffusers_kwargs in a config file.