docs/advanced_features/deterministic_inference.md
Deterministic inference ensures consistent LLM outputs across runs, which is critical for:
Even with temperature=0, standard LLM inference can produce different outputs due to dynamic batching and varying reduction orders in GPU kernels.
The main source is varying batch sizes. Different batch sizes cause GPU kernels to split reduction operations differently, leading to different addition orders. Due to floating-point non-associativity ((a + b) + c ≠ a + (b + c)), this produces different results even for identical inputs.
Building on Thinking Machines Lab's batch-invariant operators, SGLang achieves fully deterministic inference while maintaining compatibility with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. The development roadmap for deterministic inference features can be found in this issue.
Deterministic inference is only supported with the following three attention backends: FlashInfer, FlashAttention 3 (FA3), and Triton.
The following table shows feature compatibility for deterministic inference across different attention backends:
| Attention Backend | CUDA Graph | Chunked Prefill | Radix Cache | Non-greedy Sampling (Temp > 0) |
|---|---|---|---|---|
| FlashInfer | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes |
| FlashAttention 3 (FA3) | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
| Triton | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
Enable deterministic inference by adding the --enable-deterministic-inference flag:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--attention-backend fa3 \
--enable-deterministic-inference
| Argument | Type/Default | Description |
|---|---|---|
--enable-deterministic-inference | flag; default: disabled | Enable deterministic inference with batch-invariant operations |
--attention-backend | string; default: fa3 | Choose attention backend (flashinfer, fa3, or triton) |
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--attention-backend flashinfer \
--enable-deterministic-inference
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--attention-backend fa3 \
--enable-deterministic-inference
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-30B-A3B \
--attention-backend fa3 \
--enable-deterministic-inference
SGLang supports deterministic inference even with non-greedy sampling by using sampling seeds. This is particularly useful for reinforcement learning scenarios like GRPO (Group Relative Policy Optimization) where you need multiple diverse but reproducible responses.
By default, SGLang uses a sampling seed of 42 for reproducible sampling:
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Tell me a joke",
"sampling_params": {
"temperature": 0.8, # Non-greedy sampling
"max_new_tokens": 128,
},
},
)
print(response.json())
# This will always produce the same response across runs
To sample different responses from the same prompt while maintaining reproducibility (e.g., for GRPO training), provide different sampling seeds in your requests:
import requests
# Prepare a list of sampling seeds for different responses
sampling_seeds = [42, 43, 44, 45, 46]
responses = []
for seed in sampling_seeds:
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Tell me a joke",
"sampling_params": {
"temperature": 0.8,
"max_new_tokens": 128,
"sampling_seed": seed, # Specify sampling seed
},
},
)
responses.append(response.json())
# Each seed will produce a different but reproducible response
# Using the same seed will always produce the same response
This approach ensures that:
Run deterministic tests to verify consistent outputs:
# Single test: same prompt, varying batch sizes
python3 -m sglang.test.test_deterministic --test-mode single --n-trials 50
# Prefix test: prompts with different prefix lengths
python3 -m sglang.test.test_deterministic --test-mode prefix --n-trials 50
# Radix Cache Consistency mode: test radix cache determinism (cached vs uncached prefill)
python3 -m sglang.test.test_deterministic --test-mode radix_cache
Expected result: All tests should show Unique samples: 1 (perfectly deterministic).