Loading Models from Object Storage

SGLang supports direct loading of models from object storage (S3 and Google Cloud Storage) without requiring a full local download. This feature uses the runai_streamer load format to stream model weights directly from cloud storage, significantly reducing startup time and local storage requirements.

Overview

When loading models from object storage, SGLang uses a two-phase approach:

Metadata Download (once, before process launch): Configuration files and tokenizer files are downloaded to a local cache
Weight Streaming (lazy, during model loading): Model weights are streamed directly from object storage as needed

Supported Storage Backends

Amazon S3: s3://bucket-name/path/to/model/
Google Cloud Storage: gs://bucket-name/path/to/model/
Azure Blob: az://some-azure-container/path/
S3 compatible: s3://bucket-name/path/to/model/

Quick Start

Basic Usage

Simply provide an object storage URI as the model path:

bash

# S3
python -m sglang.launch_server \
  --model-path s3://my-bucket/models/llama-3-8b/ \
  --load-format runai_streamer

# Google Cloud Storage
python -m sglang.launch_server \
  --model-path gs://my-bucket/models/llama-3-8b/ \
  --load-format runai_streamer

Note: The --load-format runai_streamer is automatically detected when using object storage URIs, so you can omit it:

bash

python -m sglang.launch_server \
  --model-path s3://my-bucket/models/llama-3-8b/

With Tensor Parallelism

bash

python -m sglang.launch_server \
  --model-path gs://my-bucket/models/llama-70b/ \
  --tp 4 \
  --model-loader-extra-config '{"distributed": true}'

Configuration

Load Format

The runai_streamer load format is specifically designed for object storage, ssd and shared file systems

bash

python -m sglang.launch_server \
  --model-path s3://bucket/model/ \
  --load-format runai_streamer

Extended Configuration Parameters

Use --model-loader-extra-config to pass additional configuration as a JSON string:

bash

python -m sglang.launch_server \
  --model-path s3://bucket/model/ \
  --model-loader-extra-config '{
    "distributed": true,
    "concurrency": 8,
    "memory_limit": 2147483648
  }'

Available Parameters

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "22%"}} /> <col style={{width: "16%"}} /> <col style={{width: "44%"}} /> <col style={{width: "18%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Parameter</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Type</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Description</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Default</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>distributed</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Enable distributed streaming for multi-GPU setups. Automatically set to <code>true</code> for object storage paths and cuda alike devices.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Auto-detected</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>concurrency</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>int</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Number of concurrent download streams. Higher values can improve throughput for large models.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>4</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>memory_limit</code></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>int</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Memory limit (in bytes) for the streaming buffer.</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>System-dependent</td> </tr> </tbody> </table>

Performance Considerations

Distributed Streaming

For multi-GPU setups, enable distributed streaming to parallelize weight loading between the processes:

bash

python -m sglang.launch_server \
  --model-path s3://bucket/model/ \
  --tp 8 \
  --model-loader-extra-config '{"distributed": true}'

Limitations

Supported Formats: Currently only supports .safetensors weight format (recommended format)
Supported Device: Distributed streaming is supported on cuda alike devices. Otherwise fallback to non distributed streaming

Overview

Supported Storage Backends

Quick Start

Basic Usage

With Tensor Parallelism

Configuration

Load Format

Extended Configuration Parameters

Available Parameters

Performance Considerations

Distributed Streaming

Limitations

See Also