docs/basic_usage/hy3_preview.md
Hy3-preview is a large-scale language model (295B parameters, 21B active parameters) from Tencent Hunyuan team. SGLang supports serving Hy3-preview. This guide describes how to run Hy3-preview with native BF16.
docker pull lmsysorg/sglang:hy3-preview
# Install SGLang
git clone https://github.com/sgl-project/sglang
cd sglang
pip3 install pip --upgrade
pip3 install "transformers>=5.6.0"
pip3 install -e "python"
To serve the Hy3-preview model on 8 GPUs. On 8x96GB H20, SGLang can barely deploy the BF16 model and can only run small batch sizes or short requests. Use larger-memory GPUs such as H20-3e when possible.
python3 -m sglang.launch_server \
--model tencent/Hy3-preview \
--tp 8 \
--tool-call-parser hunyuan \
--reasoning-parser hunyuan \
--served-model-name hy3-preview
Description: SGLang supports Hy3-preview models with EAGLE speculative decoding.
Usage:
Add --speculative-algorithm, --speculative-num-steps, --speculative-eagle-topk, and --speculative-num-draft-tokens to enable this feature. For example:
python3 -m sglang.launch_server \
--model tencent/Hy3-preview \
--tp 8 \
--tool-call-parser hunyuan \
--reasoning-parser hunyuan \
--speculative-num-steps 1 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 2 \
--speculative-algorithm EAGLE \
--served-model-name hy3-preview
First, install the OpenAI Python client:
uv pip install -U openai
You can use the OpenAI client as follows to verify thinking-mode responses.
from openai import OpenAI
# If running SGLang locally with its default OpenAI-compatible port:
# http://localhost:30000/v1
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:30000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello."},
]
# Thinking mode is disabled by default (no need to pass chat_template_kwargs).
resp = client.chat.completions.create(
model="hy3-preview",
messages=messages,
temperature=1,
max_tokens=4096,
)
print(resp.choices[0].message.content)
# Thinking mode is enabled only if 'reasoning_effort' and 'interleaved_thinking' are set in 'chat_template_kwargs'.
# 'reasoning_effort' supports: 'high', 'low', 'no_think'.
resp_think = client.chat.completions.create(
model="hy3-preview",
messages=messages,
temperature=1,
max_tokens=4096,
extra_body={
"chat_template_kwargs": {
"reasoning_effort": "high",
"interleaved_thinking": True
},
},
)
output_msg = resp_think.choices[0].message
# thinking content
print(output_msg.reasoning_content)
# response content
print(output_msg.content)
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hy3-preview",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello."}
],
"temperature": 1,
"max_tokens": 4096
}'
For benchmarking, disable prefix caching by adding --disable-radix-cache to the server command.
The following example runs the benchmark on 8 H20 GPUs with 96 GB memory each.
python3 -m sglang.bench_serving \
--backend sglang \
--flush-cache \
--dataset-name random \
--random-range-ratio 1.0 \
--random-input-len 4096 \
--random-output-len 4096 \
--num-prompts 5 \
--max-concurrency 1 \
--output-file hy3_preview_h20.jsonl \
--model tencent/Hy3-preview \
--served-model-name hy3-preview
If successful, you will see the following output.
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 5
Benchmark duration (s): 176.41
Total input tokens: 20480
Total input text tokens: 20480
Total generated tokens: 20480
Total generated tokens (retokenized): 20480
Request throughput (req/s): 0.03
Input token throughput (tok/s): 116.09
Output token throughput (tok/s): 116.09
Peak output token throughput (tok/s): 118.00
Peak concurrent requests: 2
Total token throughput (tok/s): 232.19
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 35279.06
Median E2E Latency (ms): 35275.60
P90 E2E Latency (ms): 35294.13
P99 E2E Latency (ms): 35294.41
---------------Time to First Token----------------
Mean TTFT (ms): 355.93
Median TTFT (ms): 309.28
P99 TTFT (ms): 518.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 8.53
Median TPOT (ms): 8.54
P99 TPOT (ms): 8.54
---------------Inter-Token Latency----------------
Mean ITL (ms): 8.53
Median ITL (ms): 8.54
P95 ITL (ms): 8.62
P99 ITL (ms): 8.74
Max ITL (ms): 31.70
==================================================