content/manuals/ai/model-runner/inference-engines.md
Docker Model Runner supports three inference engines: llama.cpp, vLLM, and Diffusers. Each engine has different strengths, supported platforms, and model format requirements. This guide helps you choose the right engine and configure it for your use case.
| Feature | llama.cpp | vLLM | Diffusers |
|---|---|---|---|
| Model formats | GGUF | Safetensors, HuggingFace | DDUF |
| Platforms | All (macOS, Windows, Linux) | Linux x86_64 only | Linux (x86_64, ARM64) |
| GPU support | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only | NVIDIA CUDA only |
| CPU inference | Yes | No | No |
| Quantization | Built-in (Q4, Q5, Q8, etc.) | Limited | Limited |
| Memory efficiency | High (with quantization) | Moderate | Moderate |
| Throughput | Good | High (with batching) | Good |
| Best for | Local development, resource-constrained environments | Production, high throughput | Image generation |
| Use case | Text generation (LLMs) | Text generation (LLMs) | Image generation (Stable Diffusion) |
llama.cpp is the default inference engine in Docker Model Runner. It's designed for efficient local inference and supports a wide range of hardware configurations.
| Platform | GPU support | Notes |
|---|---|---|
| macOS (Apple Silicon) | Metal | Automatic GPU acceleration |
| Windows (x64) | NVIDIA CUDA | Requires NVIDIA drivers 576.57+ |
| Windows (ARM64) | Adreno OpenCL | Qualcomm 6xx series and later |
| Linux (x64) | NVIDIA, AMD, Vulkan | Multiple backend options |
| Linux | CPU only | Works on any x64/ARM64 system |
llama.cpp uses the GGUF format, which supports efficient quantization for reduced memory usage without significant quality loss.
| Quantization | Bits per weight | Memory usage | Quality |
|---|---|---|---|
| Q2_K | ~2.5 | Lowest | Reduced |
| Q3_K_M | ~3.5 | Minimal | Acceptable |
| Q4_K_M | ~4.5 | Low | Good |
| Q5_K_M | ~5.5 | Moderate | Excellent |
| Q6_K | ~6.5 | Higher | Excellent |
| Q8_0 | 8 | High | Near-original |
| F16 | 16 | Highest | Original |
Recommended: Q4_K_M offers the best balance of quality and memory usage for most use cases.
Models on Docker Hub often include quantization in the tag:
$ docker model pull ai/llama3.2:3B-Q4_K_M
llama.cpp is the default engine. No special configuration is required:
$ docker model run ai/smollm2
To explicitly specify llama.cpp when running models:
$ docker model run ai/smollm2 --backend llama.cpp
When using llama.cpp, API calls use the llama.cpp engine path:
POST /engines/llama.cpp/v1/chat/completions
Or without the engine prefix:
POST /engines/v1/chat/completions
vLLM is a high-performance inference engine optimized for production workloads with high throughput requirements.
| Platform | GPU | Support status |
|---|---|---|
| Linux x86_64 | NVIDIA CUDA | Supported |
| Windows with WSL2 | NVIDIA CUDA | Supported (Docker Desktop 4.54+) |
| macOS | - | Not supported |
| Linux ARM64 | - | Not supported |
| AMD GPUs | - | Not supported |
[!IMPORTANT] vLLM requires an NVIDIA GPU with CUDA support. It does not support CPU-only inference.
vLLM works with models in Safetensors format, which is the standard format for HuggingFace models. These models typically use more memory than quantized GGUF models but may offer better quality and faster inference on powerful hardware.
Install the Model Runner with vLLM backend:
$ docker model install-runner --backend vllm --gpu cuda
Verify the installation:
$ docker model status
Docker Model Runner is running
Status:
llama.cpp: running llama.cpp version: c22473b
vllm: running vllm version: 0.11.0
Ensure you have:
Install vLLM backend:
$ docker model install-runner --backend vllm --gpu cuda
vLLM models are typically tagged with -vllm suffix:
$ docker model run ai/smollm2-vllm
To specify the vLLM backend explicitly:
$ docker model run ai/model --backend vllm
When using vLLM, specify the engine in the API path:
POST /engines/vllm/v1/chat/completions
Use --hf_overrides to pass model configuration overrides:
$ docker model configure --hf_overrides '{"max_model_len": 8192}' ai/model-vllm
| Setting | Description | Example |
|---|---|---|
max_model_len | Maximum context length | 8192 |
gpu_memory_utilization | Fraction of GPU memory to use | 0.9 |
tensor_parallel_size | GPUs for tensor parallelism | 2 |
| Scenario | Recommended engine |
|---|---|
| Single user, local development | llama.cpp |
| Multiple concurrent requests | vLLM |
| Limited GPU memory | llama.cpp (with quantization) |
| Maximum throughput | vLLM |
| CPU-only system | llama.cpp |
| Apple Silicon Mac | llama.cpp |
| Production deployment | vLLM (if hardware supports it) |
Diffusers is an inference engine for image generation models, including Stable Diffusion. Unlike llama.cpp and vLLM which focus on text generation with LLMs, Diffusers enables you to generate images from text prompts.
| Platform | GPU | Support status |
|---|---|---|
| Linux x86_64 | NVIDIA CUDA | Supported |
| Linux ARM64 | NVIDIA CUDA | Supported |
| Windows | - | Not supported |
| macOS | - | Not supported |
[!IMPORTANT] Diffusers requires an NVIDIA GPU with CUDA support. It does not support CPU-only inference.
Install the Model Runner with Diffusers backend:
$ docker model reinstall-runner --backend diffusers --gpu cuda
Verify the installation:
$ docker model status
Docker Model Runner is running
Status:
llama.cpp: running llama.cpp version: 34ce48d
mlx: not installed
sglang: sglang package not installed
vllm: vLLM binary not found
diffusers: running diffusers version: 0.36.0
Pull a Stable Diffusion model:
$ docker model pull stable-diffusion:Q4
Diffusers uses an image generation API endpoint. To generate an image:
$ curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "stable-diffusion:Q4",
"prompt": "A picture of a nice cat",
"size": "512x512"
}' | jq -r '.data[0].b64_json' | base64 -d > image.png
This command:
image.pngWhen using Diffusers, specify the engine in the API path:
POST /engines/diffusers/v1/images/generations
| Parameter | Type | Description |
|---|---|---|
model | string | Required. The model identifier (e.g., stable-diffusion:Q4). |
prompt | string | Required. The text description of the image to generate. |
size | string | Image dimensions in WIDTHxHEIGHT format (e.g., 512x512). |
You can run llama.cpp, vLLM, and Diffusers simultaneously. Docker Model Runner routes requests to the appropriate engine based on the model or explicit engine selection.
Check which engines are running:
$ docker model status
Docker Model Runner is running
Status:
llama.cpp: running llama.cpp version: 34ce48d
mlx: not installed
sglang: sglang package not installed
vllm: running vllm version: 0.11.0
diffusers: running diffusers version: 0.36.0
| Engine | API path | Use case |
|---|---|---|
| llama.cpp | /engines/llama.cpp/v1/chat/completions | Text generation |
| vLLM | /engines/vllm/v1/chat/completions | Text generation |
| Diffusers | /engines/diffusers/v1/images/generations | Image generation |
| Auto-select | /engines/v1/chat/completions | Text generation (auto-selects engine) |
$ docker model install-runner --backend <engine> [--gpu <type>]
Options:
--backend: llama.cpp, vllm, or diffusers--gpu: cuda, rocm, vulkan, or metal (depends on platform)$ docker model reinstall-runner --backend <engine>
$ docker model status
$ docker model logs
$ docker model package --gguf ./model.gguf --push myorg/mymodel:Q4_K_M
$ docker model package --safetensors ./model/ --push myorg/mymodel-vllm
Verify NVIDIA GPU is available:
$ nvidia-smi
Check Docker has GPU access:
$ docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
Verify you're on a supported platform (Linux x86_64 or Windows WSL2).
Ensure GPU acceleration is working (check logs for Metal/CUDA messages).
Try a more aggressive quantization:
$ docker model pull ai/model:Q4_K_M
Reduce context size:
$ docker model configure --context-size 2048 ai/model
gpu_memory_utilization:
$ docker model configure --hf_overrides '{"gpu_memory_utilization": 0.8}' ai/model