docs_new/docs/get-started/quickstart.mdx
This guide walks you through the entire flow of getting started with SGLang:
requests, or the native SGLang APIBy the end, you'll have a working SGLang server responding to your prompts.
pip install --upgrade pip
pip install uv
uv pip install sglang
Replace `<secret>` with your [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens):
```bash
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```
For production deployments, use the smaller **runtime** variant (~40% size reduction):
```bash
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest-runtime \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```
Start the SGLang server with a model. Here we use qwen/qwen2.5-0.5b-instruct as a lightweight example:
python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --port 30000
Wait until you see The server is fired up and ready to roll! in the terminal output.
SGLang is fully OpenAI API-compatible, so you can use the same tools and libraries you already know.
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/qwen2.5-0.5b-instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Install the OpenAI Python library if you haven't:
pip install openai
Then send a request:
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="qwen/qwen2.5-0.5b-instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="qwen/qwen2.5-0.5b-instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
import requests
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "qwen/qwen2.5-0.5b-instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
}
response = requests.post(url, json=data)
print(response.json())
/generate APISGLang also provides a native /generate endpoint for more flexibility.
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
},
)
print(response.json())
/generateimport requests
import json
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
"stream": True,
},
stream=True,
)
prev = 0
for chunk in response.iter_lines(decode_unicode=False):
chunk = chunk.decode("utf-8")
if chunk and chunk.startswith("data:"):
if chunk == "data: [DONE]":
break
data = json.loads(chunk[5:].strip("\n"))
output = data["text"]
print(output[prev:], end="", flush=True)
prev = len(output)
SGLang also supports offline batch inference using the Engine class directly -- no HTTP server required.
import sglang as sgl
llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print(f"Prompt: {prompt}\nGenerated text: {output['text']}\n")
llm.shutdown()
{/* WIP, TBD linked later