Back to Sglang

Offline Engine API

docs_new/docs/basic_usage/offline_engine_api.ipynb

0.5.136.0 KB
Original Source

Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

  • Offline Batch Inference
  • Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

  • Non-streaming synchronous generation
  • Streaming synchronous generation
  • Non-streaming asynchronous generation
  • Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in custom_server.

Nest Asyncio

Note that if you want to use Offline Engine in ipython or some other nested loop code, you need to add the following code:

python
import nest_asyncio

nest_asyncio.apply()

Advanced Usage

The engine supports vlm inference as well as extracting hidden states.

Please see the examples for further use cases.

Ray Integration

When running in a Ray cluster, you can use RayEngine with a custom placement group for fine-grained GPU placement control.

Custom Placement Groups

Pass a placement_group with 1-GPU-per-bundle bundles to control exactly which GPUs are used. Each bundle should have exactly 1 GPU for deterministic mapping.

python
import ray
from ray.util.placement_group import placement_group
from sglang.srt.ray.engine import RayEngine

ray.init()

# Create placement group with specific GPU bundles
pg = placement_group(
    [{"GPU": 1} for _ in range(4)],  # 4 bundles, each with 1 GPU
    strategy="STRICT_PACK",
)
ray.get(pg.ready())

# Launch RayEngine on custom placement group
engine = RayEngine(
    model_path="meta-llama/Meta-Llama-3-8B-Instruct",
    tp_size=4,
    use_ray=True,
    placement_group=pg,
)

# Optional: specify exact bundle indices via environment variable
# export SGLANG_RAY_BUNDLE_INDICES="0,1,2,3"

Bundle Index Control

Use SGLANG_RAY_BUNDLE_INDICES environment variable to specify which placement group bundles to use for each worker rank. This enables:

  • Skipping unhealthy GPUs
  • Topology-aware placement (e.g., NVLink-connected GPUs)
  • Non-sequential bundle assignment
bash
# Use bundles 0,1,2,7 (skip bundles 3-6) for tp_size=4
export SGLANG_RAY_BUNDLE_INDICES="0,1,2,7"

# Place workers on NVLink-connected GPUs
export SGLANG_RAY_BUNDLE_INDICES="0,1,2,3"

The number of indices must match world_size (tp_size * pp_size * dp_size, or tp_size * pp_size when enable_dp_attention=True).

Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

python
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch  # noqa: F401
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Non-streaming Synchronous Generation

python
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Streaming Synchronous Generation

python
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()

Non-streaming Asynchronous Generation

python
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())

Streaming Asynchronous Generation

python
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())
python
llm.shutdown()