docs_new/docs/basic_usage/offline_engine_api.ipynb
SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:
This document focuses on the offline batch inference, demonstrating four different inference modes:
Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in custom_server.
Note that if you want to use Offline Engine in ipython or some other nested loop code, you need to add the following code:
import nest_asyncio
nest_asyncio.apply()
The engine supports vlm inference as well as extracting hidden states.
Please see the examples for further use cases.
When running in a Ray cluster, you can use RayEngine with a custom placement group for fine-grained GPU placement control.
Pass a placement_group with 1-GPU-per-bundle bundles to control exactly which GPUs are used. Each bundle should have exactly 1 GPU for deterministic mapping.
import ray
from ray.util.placement_group import placement_group
from sglang.srt.ray.engine import RayEngine
ray.init()
# Create placement group with specific GPU bundles
pg = placement_group(
[{"GPU": 1} for _ in range(4)], # 4 bundles, each with 1 GPU
strategy="STRICT_PACK",
)
ray.get(pg.ready())
# Launch RayEngine on custom placement group
engine = RayEngine(
model_path="meta-llama/Meta-Llama-3-8B-Instruct",
tp_size=4,
use_ray=True,
placement_group=pg,
)
# Optional: specify exact bundle indices via environment variable
# export SGLANG_RAY_BUNDLE_INDICES="0,1,2,3"
Use SGLANG_RAY_BUNDLE_INDICES environment variable to specify which placement group bundles to use for each worker rank. This enables:
# Use bundles 0,1,2,7 (skip bundles 3-6) for tp_size=4
export SGLANG_RAY_BUNDLE_INDICES="0,1,2,7"
# Place workers on NVLink-connected GPUs
export SGLANG_RAY_BUNDLE_INDICES="0,1,2,3"
The number of indices must match world_size (tp_size * pp_size * dp_size, or tp_size * pp_size when enable_dp_attention=True).
SGLang offline engine supports batch inference with efficient scheduling.
# launch the offline engine
import asyncio
import sglang as sgl
import sglang.test.doc_patch # noqa: F401
from sglang.utils import async_stream_and_merge, stream_and_merge
llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
prompts = [
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
"Provide a concise factual statement about France’s capital city. The capital of France is",
"Explain possible future trends in artificial intelligence. The future of AI is",
]
sampling_params = {
"temperature": 0.2,
"top_p": 0.9,
}
print("\n=== Testing synchronous streaming generation with overlap removal ===\n")
for prompt in prompts:
print(f"Prompt: {prompt}")
merged_output = stream_and_merge(llm, prompt, sampling_params)
print("Generated text:", merged_output)
print()
prompts = [
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
"Provide a concise factual statement about France’s capital city. The capital of France is",
"Explain possible future trends in artificial intelligence. The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
print("\n=== Testing asynchronous batch generation ===")
async def main():
outputs = await llm.async_generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print(f"\nPrompt: {prompt}")
print(f"Generated text: {output['text']}")
asyncio.run(main())
prompts = [
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
"Provide a concise factual statement about France’s capital city. The capital of France is",
"Explain possible future trends in artificial intelligence. The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
print("\n=== Testing asynchronous streaming generation (no repeats) ===")
async def main():
for prompt in prompts:
print(f"\nPrompt: {prompt}")
print("Generated text: ", end="", flush=True)
# Replace direct calls to async_generate with our custom overlap-aware version
async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
print(cleaned_chunk, end="", flush=True)
print() # New line after each prompt
asyncio.run(main())
llm.shutdown()