Deterministic Inference

Why Deterministic Inference Matters

Deterministic inference ensures consistent LLM outputs across runs, which is critical for:

Reinforcement Learning: Ensures consistent logprobs across runs, reducing stochastic noise and making RL training more stable, reproducible, and debuggable.
Testing & Debugging: Enables reproducible validation
Production: Improves reliability and user experience

Even with temperature=0, standard LLM inference can produce different outputs due to dynamic batching and varying reduction orders in GPU kernels.

The Root Cause of Non-Determinism

The main source is varying batch sizes. Different batch sizes cause GPU kernels to split reduction operations differently, leading to different addition orders. Due to floating-point non-associativity ((a + b) + c ≠ a + (b + c)), this produces different results even for identical inputs.

SGLang's Solution

Building on Thinking Machines Lab's batch-invariant operators, SGLang achieves fully deterministic inference while maintaining compatibility with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. The development roadmap for deterministic inference features can be found in this issue.

Supported Backends

Deterministic inference is only supported with the following three attention backends: FlashInfer, FlashAttention 3 (FA3), and Triton.

The following table shows feature compatibility for deterministic inference across different attention backends:

Attention Backend	CUDA Graph	Chunked Prefill	Radix Cache	Non-greedy Sampling (Temp > 0)
FlashInfer	✅ Yes	✅ Yes	❌ No	✅ Yes
FlashAttention 3 (FA3)	✅ Yes	✅ Yes	✅ Yes	✅ Yes
Triton	✅ Yes	✅ Yes	✅ Yes	✅ Yes

Usage

Basic Usage

Enable deterministic inference by adding the --enable-deterministic-inference flag:

bash

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --attention-backend fa3 \
    --enable-deterministic-inference

Server Arguments

Argument	Type/Default	Description
`--enable-deterministic-inference`	flag; default: disabled	Enable deterministic inference with batch-invariant operations
`--attention-backend`	string; default: fa3	Choose attention backend (flashinfer, fa3, or triton)

Example Configurations

Qwen3-8B

bash

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --attention-backend flashinfer \
    --enable-deterministic-inference

Llama Models

bash

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --attention-backend fa3 \
    --enable-deterministic-inference

Qwen3-30B-A3B (MoE Model)

bash

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-30B-A3B \
    --attention-backend fa3 \
    --enable-deterministic-inference

Deterministic Inference with Non-Greedy Sampling (Temperature > 0)

SGLang supports deterministic inference even with non-greedy sampling by using sampling seeds. This is particularly useful for reinforcement learning scenarios like GRPO (Group Relative Policy Optimization) where you need multiple diverse but reproducible responses.

Default Behavior

By default, SGLang uses a sampling seed of 42 for reproducible sampling:

python

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Tell me a joke",
        "sampling_params": {
            "temperature": 0.8,  # Non-greedy sampling
            "max_new_tokens": 128,
        },
    },
)
print(response.json())
# This will always produce the same response across runs

Generating Multiple Reproducible Responses

To sample different responses from the same prompt while maintaining reproducibility (e.g., for GRPO training), provide different sampling seeds in your requests:

python

import requests

# Prepare a list of sampling seeds for different responses
sampling_seeds = [42, 43, 44, 45, 46]

responses = []
for seed in sampling_seeds:
    response = requests.post(
        "http://localhost:30000/generate",
        json={
            "text": "Tell me a joke",
            "sampling_params": {
                "temperature": 0.8,
                "max_new_tokens": 128,
                "sampling_seed": seed,  # Specify sampling seed
            },
        },
    )
    responses.append(response.json())

# Each seed will produce a different but reproducible response
# Using the same seed will always produce the same response

This approach ensures that:

Different seeds produce diverse responses
The same seed always produces the same response across different runs
Results are reproducible for debugging and evaluation

Verification

Run deterministic tests to verify consistent outputs:

bash

# Single test: same prompt, varying batch sizes
python3 -m sglang.test.test_deterministic --test-mode single --n-trials 50

# Prefix test: prompts with different prefix lengths
python3 -m sglang.test.test_deterministic --test-mode prefix --n-trials 50

# Radix Cache Consistency mode: test radix cache determinism (cached vs uncached prefill)
python3 -m sglang.test.test_deterministic --test-mode radix_cache

Expected result: All tests should show Unique samples: 1 (perfectly deterministic).