docs_new/cookbook/autoregressive/OpenAI/GPT-OSS.mdx
GPT-OSS is an advanced large language model developed by OpenAI designed for power reasoning, agentic tasks, and versatile developer use cases. It has versions with two model sizes.
GPT-OSS introduces several groundbreaking innovations:
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.
Please refer to the official SGLang installation guide for installation instructions.
This section provides deployment configurations optimized for different hardware platforms and use cases.
The GPT-OSS series comes in two sizes. Recommended starting configurations vary depending on hardware.
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities.
import { GPTOSSDeployment } from "/src/snippets/autoregressive/gpt-oss-deployment.jsx";
<GPTOSSDeployment />For more detailed configuration tips, please refer to GPS-OSS Usage.
For basic API usage and request examples, please refer to:
GPT-OSS supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:
python -m sglang.launch_server \
--model openai/gpt-oss-120b \
--reasoning-parser gpt-oss \
--tp 8
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
temperature=0.7,
max_tokens=2048,
stream=True
)
# Process the stream
has_thinking = False
has_answer = False
thinking_started = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
The user asks: "Solve this problem step by step: What is 15% of 240?" So we need to provide step-by-step solution. Compute 15% of 240: 0.15 * 240 = 36. Provide steps: convert percent to decimal, multiply, maybe use fraction. Provide answer.
=============== Content =================
**Step‑by‑step solution**
1. **Understand what “percent” means**
“15 %” means 15 out of every 100 parts, i.e. the fraction \(\displaystyle \frac{15}{100}\).
2. **Convert the percent to a decimal (or fraction)**
\[
\frac{15}{100}=0.15
\]
3. **Set up the multiplication**
To find 15 % of 240 we multiply 240 by the decimal 0.15:
\[
240 \times 0.15
\]
4. **Do the multiplication**
One convenient way is to break it into two easier parts:
\[
240 \times 0.15 = 240 \times \left(\frac{15}{100}\right)
= \frac{240 \times 15}{100}
\]
- First compute \(240 \times 15\):
\[
240 \times 15 = 240 \times (10 + 5) = 2400 + 1200 = 3600
\]
- Then divide by 100:
\[
\frac{3600}{100} = 36
\]
5. **Write the result**
\[
15\% \text{ of } 240 = 36
\]
---
**Answer:** \(36\)
GPT-OSS supports tool calling capabilities. Enable the tool call parser:
Python Example (without Thinking Process):
Start sglang server:
python -m sglang.launch_server \
--model openai/gpt-oss-120b \
--tool-call-parser gpt-oss \
--tp 8
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
# Make request with streaming to see thinking process
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
temperature=0.7,
stream=True
)
# Process streaming response
thinking_started = False
has_thinking = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print tool calls
if hasattr(delta, 'tool_calls') and delta.tool_calls:
# Close thinking section if needed
if has_thinking and thinking_started:
print("\n=============== Content =================", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
if tool_call.function:
print(f"🔧 Tool Call: {tool_call.function.name}")
print(f" Arguments: {tool_call.function.arguments}")
# Print content
if delta.content:
print(delta.content, end="", flush=True)
print()
Output Example:
🔧 Tool Call: get_weather
Arguments: {"location": "Beijing", "unit": "celsius"}
Python Example (with Thinking Process):
Start sglang server:
python -m sglang.launch_server \
--model openai/gpt-oss-120b \
--reasoning-parser gpt-oss \
--tool-call-parser gpt-oss \
--tp 8
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
# Make request with streaming to see thinking process
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
temperature=0.7,
stream=True
)
# Process streaming response
thinking_started = False
has_thinking = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print tool calls
if hasattr(delta, 'tool_calls') and delta.tool_calls:
# Close thinking section if needed
if has_thinking and thinking_started:
print("\n=============== Content =================", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
if tool_call.function:
print(f"🔧 Tool Call: {tool_call.function.name}")
print(f" Arguments: {tool_call.function.arguments}")
# Print content
if delta.content:
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
User asks: "What's the weather in Beijing?" We need to get current weather. Use function get_weather with location "Beijing". No unit specified; default? Probably use default (maybe Celsius). We can specify unit as "celsius". We'll call function.
=============== Content =================
🔧 Tool Call: get_weather
Arguments: {"location": "Beijing", "unit": "celsius"}
Note:
Handling Tool Call Results:
# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
# Your actual weather API call here
return f"The weather in {location} is 22°{unit[0].upper()} and sunny."
# Send tool result back to the model
messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Beijing", "unit": "celsius"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": get_weather("Beijing", "celsius")
}
]
final_response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=messages,
temperature=0.7
)
print(final_response.choices[0].message.content)
# Output: "The current weather in Beijing is 22 °C and sunny. Let me know if you’d like a forecast for the next few days or any other details!"
We use SGLang's built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios.
python -m sglang.launch_server \
--model openai/gpt-oss-120b \
--tp 8
python3 -m sglang.bench_serving \
--backend sglang \
--num-prompt 100 \
--max-concurrency 1
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 100
Benchmark duration (s): 52.35
Total input tokens: 33178
Total input text tokens: 33178
Total input vision tokens: 0
Total generated tokens: 21251
Total generated tokens (retokenized): 20868
Request throughput (req/s): 1.91
Input token throughput (tok/s): 633.76
Output token throughput (tok/s): 405.93
Peak output token throughput (tok/s): 433.00
Peak concurrent requests: 8
Total token throughput (tok/s): 1039.69
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 523.30
Median E2E Latency (ms): 389.91
---------------Time to First Token----------------
Mean TTFT (ms): 33.71
Median TTFT (ms): 31.79
P99 TTFT (ms): 108.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 2.31
Median TPOT (ms): 2.31
P99 TPOT (ms): 2.39
---------------Inter-Token Latency----------------
Mean ITL (ms): 2.31
Median ITL (ms): 2.31
P95 ITL (ms): 2.35
P99 ITL (ms): 2.38
Max ITL (ms): 3.54
==================================================
python -m sglang.launch_server \
--model openai/gpt-oss-120b \
--tp 8
python3 -m sglang.bench_serving \
--backend sglang \
--num-prompt 1000 \
--max-concurrency 100
Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 24.76
Total input tokens: 297156
Total input text tokens: 297156
Total input vision tokens: 0
Total generated tokens: 192432
Total generated tokens (retokenized): 187145
Request throughput (req/s): 40.39
Input token throughput (tok/s): 12003.57
Output token throughput (tok/s): 7773.26
Peak output token throughput (tok/s): 13780.00
Peak concurrent requests: 156
Total token throughput (tok/s): 19776.83
Concurrency: 89.23
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 2208.97
Median E2E Latency (ms): 1591.11
---------------Time to First Token----------------
Mean TTFT (ms): 102.94
Median TTFT (ms): 31.53
P99 TTFT (ms): 674.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.31
Median TPOT (ms): 11.00
P99 TPOT (ms): 91.28
---------------Inter-Token Latency----------------
Mean ITL (ms): 11.00
Median ITL (ms): 5.75
P95 ITL (ms): 25.35
P99 ITL (ms): 43.18
Max ITL (ms): 621.42
==================================================
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 8000
Results:
GPT-OSS-120b
Accuracy: 0.880
Invalid: 0.005
Latency: 5.262 s
Output throughput: 12143.675 token/s
GPT-OSS-20b
Accuracy: 0.535
Invalid: 0.165
Latency: 4.157 s
Output throughput: 19589.165 token/s