.agents/vllm-backend.md
The vLLM backend lives at backend/python/vllm/backend.py (async gRPC) and the multimodal variant at backend/python/vllm-omni/backend.py (sync gRPC). Both wrap vLLM's AsyncLLMEngine / Omni and translate the LocalAI gRPC PredictOptions into vLLM SamplingParams + outputs into Reply.chat_deltas.
This file captures the non-obvious bits — most of the bring-up was a single PR (feat/vllm-parity) and the things below are easy to get wrong.
Do not write regex-based tool-call extractors for vLLM. vLLM ships:
vllm.tool_parsers.ToolParserManager — 50+ registered parsers (hermes, llama3_json, llama4_pythonic, mistral, qwen3_xml, deepseek_v3, granite4, openai, kimi_k2, glm45, …)vllm.reasoning.ReasoningParserManager — 25+ registered parsers (deepseek_r1, qwen3, mistral, gemma4, …)Both can be used standalone: instantiate with a tokenizer, call extract_tool_calls(text, request=None) / extract_reasoning(text, request=None). The backend stores the parser classes on self.tool_parser_cls / self.reasoning_parser_cls at LoadModel time and instantiates them per request.
Selection: vLLM does not auto-detect parsers from model name — neither does the LocalAI backend. The user (or core/config/hooks_vllm.go) must pick one and pass it via Options[]:
options:
- tool_parser:hermes
- reasoning_parser:qwen3
Auto-defaults for known model families live in core/config/parser_defaults.json and are applied:
core/gallery/importers/vllm.govllm / vllm-omni backend hook in core/config/hooks_vllm.goUser-supplied tool_parser:/reasoning_parser: in the config wins over defaults — the hook checks for existing entries before appending.
When to update parser_defaults.json: any time vLLM ships a new tool or reasoning parser, or you onboard a new model family that LocalAI users will pull from HuggingFace. The file is keyed by family pattern matched against normalizeModelID(cfg.Model) (lowercase, org-prefix stripped, _→-). Patterns are checked longest-first — keep qwen3.5 before qwen3, llama-3.3 before llama-3, etc., or the wrong family wins. Add a covering test in core/config/hooks_test.go.
Sister file — core/config/inference_defaults.json: same pattern but for sampling parameters (temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty). Loaded by core/config/inference_defaults.go and applied by ApplyInferenceDefaults(). The schema is map[string]float64 only — strings don't fit, which is why parser defaults needed their own JSON file. The inference file is auto-generated from unsloth via go generate ./core/config/ (see core/config/gen_inference_defaults/) — don't hand-edit it; instead update the upstream source or regenerate. Both files share normalizeModelID() and the longest-first pattern ordering.
Constructor compatibility gotcha: the abstract ToolParser.__init__ accepts tools=, but several concrete parsers (Hermes2ProToolParser, etc.) override __init__ and only accept tokenizer. Always:
try:
tp = self.tool_parser_cls(self.tokenizer, tools=tools)
except TypeError:
tp = self.tool_parser_cls(self.tokenizer)
The Go side (core/backend/llm.go, pkg/functions/chat_deltas.go) consumes Reply.chat_deltas to assemble the OpenAI response. For tool calls to surface in chat/completions, the Python backend must populate Reply.chat_deltas[].tool_calls with ToolCallDelta{index, id, name, arguments}. Returning the raw <tool_call>...</tool_call> text in Reply.message is not enough — the Go regex fallback exists for llama.cpp, not for vllm.
Same story for reasoning_content — emit it on ChatDelta.reasoning_content, not as part of content.
tokenizer.apply_chat_template() expects a list of dicts, not proto Messages. The shared helper in backend/python/common/vllm_utils.py (messages_to_dicts) handles the mapping including:
tool_call_id and name for role="tool" messagestool_calls JSON-string field → parsed Python list for role="assistant"reasoning_content for thinking modelsPass tools=json.loads(request.Tools) and (when request.Metadata.get("enable_thinking") == "true") enable_thinking=True to apply_chat_template. Wrap in try/except TypeError because not every tokenizer template accepts those kwargs.
vLLM publishes prebuilt CPU wheels at https://github.com/vllm-project/vllm/releases/.... The pin lives in backend/python/vllm/requirements-cpu-after.txt.
Version compatibility — important: newer vllm CPU wheels (≥ 0.15) declare torch==2.10.0+cpu as a hard dep, but torch==2.10.0 only exists on the PyTorch test channel and pulls in an incompatible torchvision. Stay on vllm 0.14.1+cpu + torch 2.9.1+cpu until both upstream catch up. Bumping requires verifying torchvision/torchaudio match.
requirements-cpu.txt uses --extra-index-url https://download.pytorch.org/whl/cpu. install.sh adds --index-strategy=unsafe-best-match for the cpu profile so uv resolves transformers/vllm from PyPI while pulling torch from the PyTorch index.
SIMD baseline: the prebuilt CPU wheel is compiled with AVX-512 VNNI/BF16. On a CPU without those instructions, importing vllm.model_executor.models.registry SIGILLs at _run_in_subprocess time during model inspection. There is no runtime flag to disable it. Workarounds:
FROM_SOURCE=true env var. Plumbing exists end-to-end:
install.sh hides requirements-cpu-after.txt, runs installRequirements for the base deps, then clones vllm and VLLM_TARGET_DEVICE=cpu uv pip install --no-deps .backend/Dockerfile.python declares ARG FROM_SOURCE + ENV FROM_SOURCEMakefile docker-build-backend macro forwards --build-arg FROM_SOURCE=$(FROM_SOURCE) when setRuntime shared libraries: vLLM's vllm._C extension dlopens libnuma.so.1 at import time. If missing, the C extension silently fails and torch.ops._C_utils.init_cpu_threads_env is never registered → EngineCore crashes on init_device with:
AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env'
backend/python/vllm/package.sh bundles libnuma.so.1 and libgomp.so.1 into ${BACKEND}/lib/, which libbackend.sh adds to LD_LIBRARY_PATH at run time. The builder stage in backend/Dockerfile.python installs libnuma1/libgomp1 so package.sh has something to copy. Do not assume the production host has these — backend images are FROM scratch.
core/config/backend_hooks.go)Per-backend defaults that used to be hardcoded in ModelConfig.Prepare() now live in core/config/hooks_*.go files and self-register via init():
hooks_llamacpp.go → GGUF metadata parsing, context size, GPU layers, jinja templatehooks_vllm.go → tool/reasoning parser auto-selection from parser_defaults.jsonHook keys:
"llama-cpp", "vllm", "vllm-omni", … — backend-specific"" — runs only when cfg.Backend is empty (auto-detect case)"*" — global catch-all, runs for every backend before specific hooksMultiple hooks per key are supported and run in registration order. Adding a new backend default:
// core/config/hooks_<backend>.go
func init() {
RegisterBackendHook("<backend>", myDefaults)
}
func myDefaults(cfg *ModelConfig, modelPath string) {
// only fill in fields the user didn't set
}
Messages.ToProto() fields you need to setcore/schema/message.go:ToProto() must serialize:
ToolCallID → proto.Message.ToolCallId (for role="tool" messages — links result back to the call)Reasoning → proto.Message.ReasoningContentToolCalls → proto.Message.ToolCalls (JSON-encoded string)These were originally not serialized and tool-calling conversations broke silently — the C++ llama.cpp backend reads them but always got empty strings. Any new field added to schema.Message and proto.Message needs a matching line in ToProto().