SGLang Diffusion OpenAI API

The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management.

Prerequisites

Python 3.11+ if you plan to use the OpenAI Python SDK.

Serve

Launch the server using the sglang serve command.

Start the server

bash

SERVER_ARGS=(
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
  --text-encoder-cpu-offload
  --pin-cpu-memory
  --num-gpus 4
  --ulysses-degree=2
  --ring-degree=2
  --port 30010
)

sglang serve "${SERVER_ARGS[@]}"

--model-path: Path to the model or model ID.
--port: HTTP port to listen on (default: 30000).

Get Model Information

Endpoint: GET /models

Returns information about the model served by this server, including model path, task type, pipeline configuration, and precision settings.

Curl Example:

bash

curl -sS -X GET "http://localhost:30010/models"

Response Example:

json

{
  "model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
  "task_type": "T2V",
  "pipeline_name": "wan_pipeline",
  "pipeline_class": "WanPipeline",
  "num_gpus": 4,
  "dit_precision": "bf16",
  "vae_precision": "fp16"
}

Endpoints

Image Generation

The server implements an OpenAI-compatible Images API under the /v1/images namespace.

Create an image

Endpoint: POST /v1/images/generations

Python Example (b64_json response):

python

import base64
from openai import OpenAI

client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1")

img = client.images.generate(
    prompt="A calico cat playing a piano on stage",
    size="1024x1024",
    n=1,
    response_format="b64_json",
)

image_bytes = base64.b64decode(img.data[0].b64_json)
with open("output.png", "wb") as f:
    f.write(image_bytes)

Curl Example:

bash

curl -sS -X POST "http://localhost:30010/v1/images/generations" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -d '{
        "prompt": "A calico cat playing a piano on stage",
        "size": "1024x1024",
        "n": 1,
        "response_format": "b64_json"
      }'

Note If response_format=url is used and cloud storage is not configured, the API returns a relative URL like /v1/images/<IMAGE_ID>/content.

Edit an image

Endpoint: POST /v1/images/edits

This endpoint accepts a multipart form upload with input images and a text prompt. The server can return either a base64-encoded image or a URL to download the image.

Curl Example (b64_json response):

bash

curl -sS -X POST "http://localhost:30010/v1/images/edits" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -F "image=@local_input_image.png" \
  -F "url=image_url.jpg" \
  -F "prompt=A calico cat playing a piano on stage" \
  -F "size=1024x1024" \
  -F "response_format=b64_json"

Curl Example (URL response):

bash

curl -sS -X POST "http://localhost:30010/v1/images/edits" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -F "image=@local_input_image.png" \
  -F "url=image_url.jpg" \
  -F "prompt=A calico cat playing a piano on stage" \
  -F "size=1024x1024" \
  -F "response_format=url"

Download image content

When response_format=url is used with POST /v1/images/generations or POST /v1/images/edits, the API returns a relative URL like /v1/images/<IMAGE_ID>/content.

Endpoint: GET /v1/images/{image_id}/content

Curl Example:

bash

curl -sS -L "http://localhost:30010/v1/images/<IMAGE_ID>/content" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -o output.png

Video Generation

The server implements a subset of the OpenAI Videos API under the /v1/videos namespace.

Create a video (text-to-video)

Endpoint: POST /v1/videos

Python Example:

python

from openai import OpenAI

client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1")

video = client.videos.create(
    prompt="A calico cat playing a piano on stage",
    size="1280x720"
)
print(f"Video ID: {video.id}, Status: {video.status}")

Curl Example:

bash

curl -sS -X POST "http://localhost:30010/v1/videos" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -d '{
        "prompt": "A calico cat playing a piano on stage",
        "size": "1280x720"
      }'

Create a video (image-to-video)

For I2V or TI2V models (e.g., Wan2.1 I2V, LTX-2.3 two-stage), pass an input image via multipart form upload or a reference URL.

Curl Example (multipart form upload):

bash

curl -sS -X POST "http://localhost:30010/v1/videos" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -F "prompt=A cat playing a piano" \
  -F "input_reference=@input_image.png" \
  -F "size=1280x720"

Curl Example (reference URL):

bash

curl -sS -X POST "http://localhost:30010/v1/videos" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -d '{
        "prompt": "A cat playing a piano",
        "reference_url": "https://example.com/input_image.png",
        "size": "1280x720"
      }'

List videos

Endpoint: GET /v1/videos

Python Example:

python

videos = client.videos.list()
for item in videos.data:
    print(item.id, item.status)

Curl Example:

bash

curl -sS -X GET "http://localhost:30010/v1/videos" \
  -H "Authorization: Bearer sk-proj-1234567890"

Download video content

Endpoint: GET /v1/videos/{video_id}/content

Python Example:

python

import time

# Poll for completion
while True:
    page = client.videos.list()
    item = next((v for v in page.data if v.id == video_id), None)
    if item and item.status == "completed":
        break
    time.sleep(5)

# Download content
resp = client.videos.download_content(video_id=video_id)
with open("output.mp4", "wb") as f:
    f.write(resp.read())

Curl Example:

bash

curl -sS -L "http://localhost:30010/v1/videos/<VIDEO_ID>/content" \
  -H "Authorization: Bearer sk-proj-1234567890" \
  -o output.mp4

LoRA Management

The server supports dynamic loading, merging, and unmerging of LoRA adapters.

Important Notes:

Mutual Exclusion: Only one LoRA configuration can be active per target at a time
Switching: To switch LoRAs, deactivate the current LoRA with unmerge_lora_weights, then set the new one
Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost

Set LoRA Adapter

Loads one or more LoRA adapters and applies them to the model. By default, regular weights are statically merged, while FSDP-sharded weights use dynamic LoRA to avoid full-gather memory peaks.

Endpoint: POST /v1/set_lora

Parameters:

lora_nickname (string or list of strings, required): A unique identifier for the LoRA adapter(s). Can be a single string or a list of strings for multiple LoRAs
lora_path (string or list of strings/None, optional): Path to the .safetensors file(s) or Hugging Face repo ID(s). Required for the first load; optional if re-activating a cached nickname. If a list, must match the length of lora_nickname
target (string or list of strings, optional): Which transformer(s) to apply the LoRA to. If a list, must match the length of lora_nickname. Valid values:
- "all" (default): Apply to all transformers
- "transformer": Apply only to the primary transformer (high noise for Wan2.2)
- "transformer_2": Apply only to transformer_2 (low noise for Wan2.2)
- "critic": Apply only to the critic model
strength (float or list of floats, optional): LoRA strength for merge, default 1.0. If a list, must match the length of lora_nickname. Values < 1.0 reduce the effect, values > 1.0 amplify the effect
merge_mode (string, optional): "auto" (default server policy), "merge" (force static merge), or "dynamic" (apply LoRA at forward time)

Single LoRA Example:

bash

curl -X POST http://localhost:30010/v1/set_lora \
  -H "Content-Type: application/json" \
  -d '{
        "lora_nickname": "lora_name",
        "lora_path": "/path/to/lora.safetensors",
        "target": "all",
        "strength": 0.8
      }'

Multiple LoRA Example:

bash

curl -X POST http://localhost:30010/v1/set_lora \
  -H "Content-Type: application/json" \
  -d '{
        "lora_nickname": ["lora_1", "lora_2"],
        "lora_path": ["/path/to/lora1.safetensors", "/path/to/lora2.safetensors"],
        "target": ["transformer", "transformer_2"],
        "strength": [0.8, 1.0]
      }'

Multiple LoRA with Same Target:

bash

curl -X POST http://localhost:30010/v1/set_lora \
  -H "Content-Type: application/json" \
  -d '{
        "lora_nickname": ["style_lora", "character_lora"],
        "lora_path": ["/path/to/style.safetensors", "/path/to/character.safetensors"],
        "target": "all",
        "strength": [0.7, 0.9]
      }'

[!NOTE] When using multiple LoRAs:

All list parameters (lora_nickname, lora_path, target, strength) must have the same length

If target or strength is a single value, it will be applied to all LoRAs

Multiple LoRAs applied to the same target are applied in order

Merge LoRA Weights

Manually merges the currently set LoRA weights into the base model.

[!NOTE] With FSDP-sharded weights, manual merge may require a full-gather and can OOM. Use set_lora with merge_mode="auto" or "dynamic" for the lower-peak path.

Endpoint: POST /v1/merge_lora_weights

Parameters:

target (string, optional): Which transformer(s) to merge. One of "all" (default), "transformer", "transformer_2", "critic"
strength (float, optional): LoRA strength for merge, default 1.0. Values < 1.0 reduce the effect, values > 1.0 amplify the effect

Curl Example:

bash

curl -X POST http://localhost:30010/v1/merge_lora_weights \
  -H "Content-Type: application/json" \
  -d '{"strength": 0.8}'

Unmerge LoRA Weights

Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This must be called before setting a different LoRA.

Endpoint: POST /v1/unmerge_lora_weights

Curl Example:

bash

curl -X POST http://localhost:30010/v1/unmerge_lora_weights \
  -H "Content-Type: application/json"

List LoRA Adapters

Returns loaded LoRA adapters and current application status per module.

Endpoint: GET /v1/list_loras

Curl Example:

bash

curl -sS -X GET "http://localhost:30010/v1/list_loras"

Response Example:

json

{
  "loaded_adapters": [
    { "nickname": "lora_a", "path": "/weights/lora_a.safetensors" },
    { "nickname": "lora_b", "path": "/weights/lora_b.safetensors" }
  ],
  "active": {
    "transformer": [
      {
        "nickname": "lora2",
        "path": "tarn59/pixel_art_style_lora_z_image_turbo",
        "merged": true,
        "mode": "merged",
        "strength": 1.0
      }
    ]
  }
}

Notes:

If LoRA is not enabled for the current pipeline, the server will return an error.
num_lora_layers_with_weights counts only layers that have LoRA weights applied for the active adapter.

Example: Switching LoRAs

Set LoRA A:

bash

curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_a", "lora_path": "path/to/A"}'

Generate with LoRA A...

Unmerge LoRA A:

bash

curl -X POST http://localhost:30010/v1/unmerge_lora_weights

Set LoRA B:

bash

curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_b", "lora_path": "path/to/B"}'

Generate with LoRA B...

Adjust Output Quality

The server supports adjusting output quality and compression levels for both image and video generation through the output-quality and output-compression parameters.

Parameters

output-quality (string, optional): Preset quality level that automatically sets compression. Default is "default". Valid values:
- "maximum": Highest quality (100)
- "high": High quality (90)
- "medium": Medium quality (55)
- "low": Lower quality (35)
- "default": Auto-adjust based on media type (50 for video, 75 for image)
output-compression (integer, optional): Direct compression level override (0-100). Default is None. When provided (not None), takes precedence over output-quality.
- 0: Lowest quality, smallest file size
- 100: Highest quality, largest file size

Notes

Precedence: When both output-quality and output-compression are provided, output-compression takes precedence
Format Support: Quality settings apply to JPEG, and video formats. PNG uses lossless compression and ignores these settings
File Size vs Quality: Lower compression values (or "low" quality preset) produce smaller files but may show visible artifacts