guides/python/model-serving/sglang/README.md
This guide demonstrates how to serve gpt-oss-20b, OpenAI's open-weights reasoning model, with SGLang on a Daytona GPU sandbox and query it from anywhere through a token-authenticated preview URL. The server speaks the OpenAI-compatible API, so any OpenAI client works against it unchanged.
serve_sglang.py creates the sandbox, starts sglang.launch_server, streams the startup logs, and prints the endpoint once the server is healthy. Four query examples are included: raw curl (query.sh), the OpenAI SDK with chat, streaming, structured output, reasoning, tool calling, and a prefix-cache demo (query_openai.py), LiteLLM (query_litellm.py), and a concurrent classification workload over classic-book passages (classify_passages.py).
lmsysorg/sglang image runs as-isgpu_type requests an H100 first, falling back to an RTX PRO 6000curl, the OpenAI SDK, LiteLLM, or anything else that speaks the OpenAI APIx-daytona-preview-token headerreasoning_effort adjusts it per request and the parsed trace comes back in reasoning_contentresponse_format with a JSON schema constrains decoding, so replies are guaranteed to parse--enable-cache-report flag exposes per-request cache hits in the usage statsclassify_passages.py classifies 273 passages from thirteen classic books by author in one concurrent batch (~825k tokens), scores the result against ground truth, then asks a second question (indoors vs outdoors) over the same passages that reuses the prefix cache[!TIP] No local GPU is needed; the model runs entirely inside the sandbox.
DAYTONA_API_KEY: Required for Daytona sandbox access. Get it from Daytona DashboardHF_TOKEN: Optional; gpt-oss is not gated, so this only matters for gated models you swap in (Hugging Face recommends a token for faster, less throttled downloads in general)python3.10 -m venv venv
source venv/bin/activate
pip install -e .
cp .env.example .env
# edit .env with your API key
python serve_sglang.py
export ENDPOINT=https://8000-{sandboxId}.{daytonaProxyDomain}
export TOKEN={previewToken}
./query.sh # raw curl
python query_openai.py # OpenAI SDK: chat, streaming, structured output, reasoning, tools, prefix cache
python query_litellm.py # LiteLLM
python classify_passages.py # concurrent author + setting classification over Gutenberg passages
The endpoint authenticates via the token header. Two alternatives: sb.create_signed_preview_url(PORT, expires_in_seconds=3600) returns a URL with the token embedded (for clients that can't set headers), and public=True at sandbox creation drops proxy auth entirely. Independently, SGLang's --api-key flag adds the server's own key check; combined with a public preview, the endpoint takes the standard OpenAI shape of base URL plus api_key.
gpt-oss reasons before it answers, and max_tokens covers reasoning plus answer combined. If thinking exhausts the budget, the response has finish_reason: "length" and content: null, which looks like the model returned nothing. Thinking length varies a lot between identical runs, so the examples use generous budgets and turn reasoning_effort down to "low" for simple or high-volume tasks.
Code running inside the sandbox can skip the preview URL and token and talk to http://localhost:8000 directly. The SGLang image ships the openai package, so the SDK works there as-is:
from daytona import Daytona, DaytonaConfig
sb = Daytona(DaytonaConfig(target="us-east-1")).get("SANDBOX_ID")
print(sb.process.code_run("""
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="gpt-oss-20b",
messages=[{"role": "user", "content": "Write a haiku about code that never leaves its sandbox."}],
max_tokens=4096,
)
print(resp.choices[0].message.content)
""").result)
Useful for colocated workloads, like batch inference over data uploaded into the sandbox.
The sandbox stays up after serve_sglang.py exits, so the endpoint keeps working on success and the downloaded weights aren't lost on failure. Delete it when you're done:
python -c "from daytona import Daytona; Daytona().get('SANDBOX_ID').delete()"
The sandbox ID is printed by serve_sglang.py.
Constants at the top of serve_sglang.py:
MODEL: Hugging Face model ID to serve (default: openai/gpt-oss-20b)SERVED_AS: model name exposed by the API, what clients pass as model (default: gpt-oss-20b)SGLANG_IMAGE: SGLang Docker image (default: lmsysorg/sglang:v0.5.12.post1-cu130)PORT: port the server listens on (default: 8000)TARGET: Daytona region; us-east-1 is currently the region for GPU sandboxesBOOT_TIMEOUT: seconds to wait for the server to become healthy (default: 900)When changing MODEL, also update the --tool-call-parser and --reasoning-parser flags: parser names must match the model family and your SGLang version, or tool calls and reasoning come back unparsed in content. Both flags also accept auto to detect the parser from the model's chat template.
GPU sandboxes are currently capped at 1 GPU each. The larger gpt-oss-120b also fits on a single H100 with extra memory flags; see the guide's "Scaling up" section for the flags and the capacity trade-off.
us-east-1 from the official SGLang imagesglang.launch_server as a background session command; the model downloads from Hugging Face and loads onto the GPU/health_generate (a real forward pass, not just a liveness check) through the preview URL while streaming server logs; if the server process exits, save the log locally and fail fastexport ENDPOINT=... TOKEN=... lines for the query scriptsx-daytona-preview-token headerSee the main project LICENSE file for details.