content/manuals/ai/model-runner/configuration.md
Docker Model Runner provides several configuration options to tune model behavior, memory usage, and inference performance. This guide covers the key settings and how to apply them.
The context size determines the maximum number of tokens a model can process in a single request, including both the input prompt and generated output. This is one of the most important settings affecting memory usage and model capabilities.
By default, Docker Model Runner uses a context size that balances capability with resource efficiency:
| Engine | Default behavior |
|---|---|
| llama.cpp | 4096 tokens |
| vLLM | Uses the model's maximum trained context size |
[!NOTE] The actual default varies by model. Most models support between 2,048 and 8,192 tokens by default. Some newer models support 32K, 128K, or even larger contexts.
You can adjust context size per model using the docker model configure command:
$ docker model configure --context-size 8192 ai/qwen2.5-coder
Or in a Compose file:
models:
llm:
model: ai/qwen2.5-coder
context_size: 8192
| Context size | Typical use case | Memory impact |
|---|---|---|
| 2,048 | Simple queries, short code snippets | Low |
| 4,096 | Standard conversations, medium code files | Moderate |
| 8,192 | Long conversations, larger code files | Higher |
| 16,384+ | Extended documents, multi-file context | High |
[!IMPORTANT] Larger context sizes require more memory (RAM/VRAM). If you experience out-of-memory errors, reduce the context size. As a rough guide, each additional 1,000 tokens requires approximately 100-500 MB of additional memory, depending on the model size.
To see a model's configuration including context size:
$ docker model inspect ai/qwen2.5-coder
[!NOTE] The
docker model inspectcommand shows the model's maximum supported context length (e.g.,gemma3.context_length), not the configured context size. The configured context size is what you set withdocker model configure --context-sizeand represents the actual limit used during inference, which should be less than or equal to the model's maximum supported context length.
Runtime flags let you pass parameters directly to the underlying inference engine. This provides fine-grained control over model behavior.
Runtime flags can be provided through multiple mechanisms:
In a Compose file:
models:
llm:
model: ai/qwen2.5-coder
context_size: 4096
runtime_flags:
- "--temp"
- "0.7"
- "--top-p"
- "0.9"
With the docker model configure command:
$ docker model configure --runtime-flag "--temp" --runtime-flag "0.7" --runtime-flag "--top-p" --runtime-flag "0.9" ai/qwen2.5-coder
These are the most commonly used llama.cpp parameters. You don't need to look up the llama.cpp documentation for typical use cases.
| Flag | Description | Default | Range |
|---|---|---|---|
--temp | Temperature for sampling. Lower = more deterministic, higher = more creative | 0.8 | 0.0-2.0 |
--top-k | Limit sampling to top K tokens. Lower = more focused | 40 | 1-100 |
--top-p | Nucleus sampling threshold. Lower = more focused | 0.9 | 0.0-1.0 |
--min-p | Minimum probability threshold | 0.05 | 0.0-1.0 |
--repeat-penalty | Penalty for repeating tokens | 1.1 | 1.0-2.0 |
Example: Deterministic output (for code generation)
runtime_flags:
- "--temp"
- "0"
- "--top-k"
- "1"
Example: Creative output (for storytelling)
runtime_flags:
- "--temp"
- "1.2"
- "--top-p"
- "0.95"
| Flag | Description | Default | Notes |
|---|---|---|---|
--threads | CPU threads for generation | Auto | Set to number of performance cores |
--threads-batch | CPU threads for batch processing | Auto | Usually same as --threads |
--batch-size | Batch size for prompt processing | 512 | Higher = faster prompt processing |
--mlock | Lock model in memory | Off | Prevents swapping, requires sufficient RAM |
--no-mmap | Disable memory mapping | Off | May improve performance on some systems |
Example: Optimized for multi-core CPU
runtime_flags:
- "--threads"
- "8"
- "--batch-size"
- "1024"
| Flag | Description | Default | Notes |
|---|---|---|---|
--n-gpu-layers | Layers to offload to GPU | All (if GPU available) | Reduce if running out of VRAM |
--main-gpu | GPU to use for computation | 0 | For multi-GPU systems |
--split-mode | How to split across GPUs | layer | Options: none, layer, row |
Example: Partial GPU offload (limited VRAM)
runtime_flags:
- "--n-gpu-layers"
- "20"
| Flag | Description | Default |
|---|---|---|
--rope-scaling | RoPE scaling method | Auto |
--rope-freq-base | RoPE base frequency | Model default |
--rope-freq-scale | RoPE frequency scale | Model default |
--no-prefill-assistant | Disable assistant pre-fill | Off |
--reasoning-budget | Token budget for reasoning models | 0 (disabled) |
When using the vLLM backend, different parameters are available.
Use --hf_overrides to pass HuggingFace model config overrides as JSON:
$ docker model configure --hf_overrides '{"rope_scaling": {"type": "dynamic", "factor": 2.0}}' ai/model-vllm
Here are complete configuration examples for common use cases.
models:
coder:
model: ai/qwen2.5-coder
context_size: 4096
runtime_flags:
- "--temp"
- "0.1"
- "--top-k"
- "1"
- "--batch-size"
- "1024"
models:
assistant:
model: ai/llama3.2
context_size: 8192
runtime_flags:
- "--temp"
- "0.7"
- "--top-p"
- "0.9"
- "--repeat-penalty"
- "1.1"
models:
writer:
model: ai/llama3.2
context_size: 8192
runtime_flags:
- "--temp"
- "1.2"
- "--top-p"
- "0.95"
- "--repeat-penalty"
- "1.0"
models:
analyzer:
model: ai/qwen2.5-coder:14B
context_size: 32768
runtime_flags:
- "--mlock"
- "--batch-size"
- "2048"
models:
efficient:
model: ai/smollm2:360M-Q4_K_M
context_size: 2048
runtime_flags:
- "--threads"
- "4"
You can also configure models via environment variables in containers:
| Variable | Description |
|---|---|
LLM_URL | Auto-injected URL of the model endpoint |
LLM_MODEL | Auto-injected model identifier |
See Models and Compose for details on how these are populated.
Configuration set via docker model configure persists until the model is removed.
To reset configuration:
$ docker model configure --context-size -1 ai/qwen2.5-coder
Using -1 resets to the default value.