Batch Invariance

!!! note Batch invariance is currently in beta. Some features are still under active development. Track progress and planned improvements at https://github.com/vllm-project/vllm/issues/27433

This document shows how to enable batch invariance in vLLM. Batch invariance ensures that the output of a model is deterministic and independent of the batch size or the order of requests in a batch.

Motivation

Batch invariance is crucial for several use cases:

Framework debugging: Deterministic outputs make it easier to debug issues in the inference framework, as the same input will always produce the same output regardless of batching.
Model debugging: Helps identify issues in model implementations by ensuring consistent behavior across different batch configurations.
Reinforcement Learning (RL): RL training often requires deterministic rollouts for reproducibility and stable training.
Large-scale inference systems: Systems that use vLLM as a component benefit from deterministic behavior for testing, validation, and consistency guarantees.

Hardware Requirements

Batch invariance currently requires NVIDIA GPUs with compute capability 9.0 or higher:

H-series: H100, H200
B-series: B100, B200

Enabling Batch Invariance

Batch invariance can be enabled by setting the VLLM_BATCH_INVARIANT environment variable to 1:

bash

export VLLM_BATCH_INVARIANT=1

Online Inference (Server Mode)

To start a vLLM server with batch invariance enabled:

bash

VLLM_BATCH_INVARIANT=1 vllm serve meta-llama/Llama-3.1-8B-Instruct

Then use the OpenAI-compatible client:

python

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

# These requests will produce deterministic outputs
# regardless of batch size or order
response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="The future of AI is",
    max_tokens=100,
    temperature=0.7,
    seed=42,
)

print(response.choices[0].text)

Offline Inference

For offline batch inference with batch invariance:

python

import os
os.environ["VLLM_BATCH_INVARIANT"] = "1"

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
    "Machine learning enables",
    "Deep learning models can",
]

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=100,
    seed=42,
)

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
)

# Outputs will be deterministic regardless of batch size
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Generated: {generated_text!r}\n")

Tested Models

Batch invariance has been tested and verified on the following models:

DeepSeek series: deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-0324, deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-V3.1
Qwen3 (Dense): Qwen/Qwen3-1.7B, Qwen/Qwen3-8B, Qwen/Qwen3-4B-AWQ, Qwen/Qwen3-8B-AWQ
Qwen3 (MoE): Qwen/Qwen3-30B-A3B, Qwen/Qwen3-Next-80B-A3B-Instruct
Qwen2.5: Qwen/Qwen2.5-0.5B-Instruct, Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-14B-Instruct, Qwen/Qwen2.5-32B-Instruct
Llama 3: meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-1B-Instruct
GPT-OSS: openai/gpt-oss-20b, openai/gpt-oss-120b
Mistral: mistralai/Mistral-7B-v0.3

Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the GitHub issue tracker.

Implementation Details

When batch invariance is enabled, vLLM:

Uses deterministic kernel implementations for attention and other operations
Ensures consistent numerical behavior across different batch sizes
Disables certain optimizations that may introduce non-determinism (such as custom all-reduce operations in tensor parallel mode)

!!! note Enabling batch invariance may impact performance compared to the default non-deterministic mode. This trade-off is intentional to guarantee reproducibility.

Future Improvements

The batch invariance feature is under active development. Planned improvements include:

Support for additional GPU architectures
Expanded model coverage
Performance optimizations
Additional testing and validation

For the latest status and to contribute ideas, see the tracking issue.