docs/features/batch_invariance.md
!!! note Batch invariance is currently in beta. Some features are still under active development. Track progress and planned improvements at https://github.com/vllm-project/vllm/issues/27433
This document shows how to enable batch invariance in vLLM. Batch invariance ensures that the output of a model is deterministic and independent of the batch size or the order of requests in a batch.
Batch invariance is crucial for several use cases:
Batch invariance currently requires NVIDIA GPUs with compute capability 9.0 or higher:
Batch invariance can be enabled by setting the VLLM_BATCH_INVARIANT environment variable to 1:
export VLLM_BATCH_INVARIANT=1
To start a vLLM server with batch invariance enabled:
VLLM_BATCH_INVARIANT=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
Then use the OpenAI-compatible client:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
# These requests will produce deterministic outputs
# regardless of batch size or order
response = client.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
prompt="The future of AI is",
max_tokens=100,
temperature=0.7,
seed=42,
)
print(response.choices[0].text)
For offline batch inference with batch invariance:
import os
os.environ["VLLM_BATCH_INVARIANT"] = "1"
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
"Machine learning enables",
"Deep learning models can",
]
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=100,
seed=42,
)
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
)
# Outputs will be deterministic regardless of batch size
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Generated: {generated_text!r}\n")
Batch invariance has been tested and verified on the following models:
deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-0324, deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-V3.1Qwen/Qwen3-1.7B, Qwen/Qwen3-8B, Qwen/Qwen3-4B-AWQ, Qwen/Qwen3-8B-AWQQwen/Qwen3-30B-A3B, Qwen/Qwen3-Next-80B-A3B-InstructQwen/Qwen2.5-0.5B-Instruct, Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-14B-Instruct, Qwen/Qwen2.5-32B-Instructmeta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-1B-Instructopenai/gpt-oss-20b, openai/gpt-oss-120bmistralai/Mistral-7B-v0.3Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the GitHub issue tracker.
When batch invariance is enabled, vLLM:
!!! note Enabling batch invariance may impact performance compared to the default non-deterministic mode. This trade-off is intentional to guarantee reproducibility.
The batch invariance feature is under active development. Planned improvements include:
For the latest status and to contribute ideas, see the tracking issue.