docs/source/en/community_integrations/vllm.md
vLLM is a high-throughput inference engine for serving LLMs at scale. It continuously batches requests and keeps KV cache memory compact with PagedAttention.
Set model_impl="transformers" to load a model using the Transformers modeling backend.
from vllm import LLM
llm = LLM(model="meta-llama/Llama-3.2-1B", model_impl="transformers")
print(llm.generate(["The capital of France is"]))
Pass --model-impl transformers to the vllm serve command for online serving.
vllm serve meta-llama/Llama-3.2-1B \
--task generate \
--model-impl transformers
AutoConfig.from_pretrained] loads the model's config.json from the Hub or your Hugging Face cache. vLLM checks the architectures field against its internal model registry to determine which vLLM model class to use.AutoModel.from_config] to load the Transformers model implementation instead.AutoTokenizer.from_pretrained] loads the tokenizer files. vLLM caches some tokenizer internals to reduce overhead during inference.Setting model_impl="transformers" bypasses the vLLM model registry and loads directly from Transformers. vLLM replaces most model modules (MoE, attention, linear layers) with its own optimized versions while keeping the Transformers model structure.