docs/content/advanced/model-configuration.md
+++ disableToc = false title = "Model Configuration" weight = 23 url = '/advanced/model-configuration' +++
LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.
Model configuration files allow you to:
You can create model configuration files in several ways:
models/gpt-3.5-turbo.yaml)--models-config-file or LOCALAI_MODELS_CONFIG_FILEname: gpt-3.5-turbo
parameters:
model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
temperature: 0.3
context_size: 512
threads: 10
backend: llama-stable
template:
completion: completion
chat: chat
When using --models-config-file, you can define multiple models as a list:
- name: model1
parameters:
model: model1.bin
context_size: 512
backend: llama-stable
- name: model2
parameters:
model: model2.bin
context_size: 1024
backend: llama-stable
| Field | Type | Description | Example |
|---|---|---|---|
name | string | Model name, used to identify the model in API calls | gpt-3.5-turbo |
backend | string | Backend to use (e.g. llama-cpp, vllm, diffusers, whisper) | llama-cpp |
description | string | Human-readable description of the model | A conversational AI model |
usage | string | Usage instructions or notes | Best for general conversation |
| Field | Type | Description |
|---|---|---|
parameters.model | string | Path to the model file (relative to models directory) or URL |
download_files | array | List of files to download. Each entry has filename, uri, and optional sha256 |
Example:
parameters:
model: my-model.gguf
download_files:
- filename: my-model.gguf
uri: https://example.com/model.gguf
sha256: abc123...
The parameters section contains all OpenAI-compatible request parameters and model-specific options.
These settings will be used as defaults for all the API calls to the model.
| Field | Type | Default | Description |
|---|---|---|---|
temperature | float | 0.9 | Sampling temperature (0.0-2.0). Higher values make output more random |
top_p | float | 0.95 | Nucleus sampling: consider tokens with top_p probability mass |
top_k | int | 40 | Consider only the top K most likely tokens |
max_tokens | int | 0 | Maximum number of tokens to generate (0 = unlimited) |
frequency_penalty | float | 0.0 | Penalty for token frequency (-2.0 to 2.0) |
presence_penalty | float | 0.0 | Penalty for token presence (-2.0 to 2.0) |
repeat_penalty | float | 1.1 | Penalty for repeating tokens |
repeat_last_n | int | 64 | Number of previous tokens to consider for repeat penalty |
seed | int | -1 | Random seed (omit for random) |
echo | bool | false | Echo back the prompt in the response |
n | int | 1 | Number of completions to generate |
logprobs | bool/int | false | Return log probabilities of tokens |
top_logprobs | int | 0 | Number of top logprobs to return per token (0-20) |
logit_bias | map | {} | Map of token IDs to bias values (-100 to 100) |
typical_p | float | 1.0 | Typical sampling parameter |
tfz | float | 1.0 | Tail free z parameter |
keep | int | 0 | Number of tokens to keep from the prompt |
| Field | Type | Description |
|---|---|---|
language | string | Language code for transcription/translation |
translate | bool | Whether to translate audio transcription |
| Field | Type | Description |
|---|---|---|
batch | int | Batch size for processing |
ignore_eos | bool | Ignore end-of-sequence tokens |
negative_prompt | string | Negative prompt for image generation |
rope_freq_base | float32 | RoPE frequency base |
rope_freq_scale | float32 | RoPE frequency scale |
negative_prompt_scale | float32 | Scale for negative prompt |
tokenizer | string | Tokenizer to use (RWKV) |
These settings apply to most LLM backends (llama.cpp, vLLM, etc.):
| Field | Type | Default | Description |
|---|---|---|---|
threads | int | processor count | Number of threads for parallel computation |
context_size | int | 512 | Maximum context size (number of tokens) |
f16 | bool | false | Enable 16-bit floating point precision (GPU acceleration) |
gpu_layers | int | 0 | Number of layers to offload to GPU (0 = CPU only) |
| Field | Type | Default | Description |
|---|---|---|---|
mmap | bool | true | Use memory mapping for model loading (faster, less RAM) |
mmlock | bool | false | Lock model in memory (prevents swapping) |
low_vram | bool | false | Use minimal VRAM mode |
no_kv_offloading | bool | false | Disable KV cache offloading |
| Field | Type | Description |
|---|---|---|
tensor_split | string | Comma-separated GPU memory allocation (e.g., "0.8,0.2" for 80%/20%) |
main_gpu | string | Main GPU identifier for multi-GPU setups |
cuda | bool | Explicitly enable/disable CUDA |
| Field | Type | Default | Description |
|---|---|---|---|
mirostat | int | 0 | Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0) |
mirostat_tau | float | 5.0 | Mirostat target entropy |
mirostat_eta | float | 0.1 | Mirostat learning rate |
| Field | Type | Description |
|---|---|---|
lora_adapter | string | Path to LoRA adapter file |
lora_base | string | Base model for LoRA |
lora_scale | float32 | LoRA scale factor |
lora_adapters | array | Multiple LoRA adapters |
lora_scales | array | Scales for multiple LoRA adapters |
| Field | Type | Description |
|---|---|---|
no_mulmatq | bool | Disable matrix multiplication queuing |
draft_model | string | Draft model GGUF file for speculative decoding (see Speculative Decoding) |
n_draft | int32 | Maximum number of draft tokens per speculative step (default: 16) |
quantization | string | Quantization format |
load_format | string | Model load format |
numa | bool | Enable NUMA (Non-Uniform Memory Access) |
rms_norm_eps | float32 | RMS normalization epsilon |
ngqa | int32 | Natural question generation parameter |
rope_scaling | string | RoPE scaling configuration |
type | string | Model type/architecture |
grammar | string | Grammar file path for constrained generation |
YARN (Yet Another RoPE extensioN) settings for context extension:
| Field | Type | Description |
|---|---|---|
yarn_ext_factor | float32 | YARN extension factor |
yarn_attn_factor | float32 | YARN attention factor |
yarn_beta_fast | float32 | YARN beta fast parameter |
yarn_beta_slow | float32 | YARN beta slow parameter |
Speculative decoding speeds up text generation by predicting multiple tokens ahead and verifying them in a single forward pass. The output is identical to normal decoding — only faster. This feature is only available with the llama-cpp backend.
There are two approaches:
Uses a smaller, faster model from the same model family to draft candidate tokens, which the main model then verifies. Requires a separate GGUF file for the draft model.
name: my-model
backend: llama-cpp
parameters:
model: large-model.gguf
draft_model: small-draft-model.gguf
n_draft: 8
options:
- spec_p_min:0.8
- draft_gpu_layers:99
Uses patterns from the token history to predict future tokens — no extra model required. Works well for repetitive or structured output (code, JSON, lists).
name: my-model
backend: llama-cpp
parameters:
model: my-model.gguf
options:
- spec_type:ngram_simple
- spec_n_max:16
These are set via the options: array in the model configuration (format: key:value):
Common options
| Option | Type | Default | Description |
|---|---|---|---|
spec_type / speculative_type | string | none | Speculative decoding type, or comma-separated list to chain multiple (see table below) |
spec_n_max / draft_max | int | 16 | Maximum number of tokens to draft per step |
spec_n_min / draft_min | int | 0 | Minimum draft tokens required to use speculation |
spec_p_min / draft_p_min | float | 0.75 | Minimum probability threshold for greedy acceptance |
spec_p_split | float | 0.1 | Split probability for tree-based branching |
Draft-model options (apply when spec_type=draft, i.e. a draft_model is configured)
| Option | Type | Default | Description |
|---|---|---|---|
draft_gpu_layers | int | -1 | GPU layers for the draft model (-1 = use default) |
draft_threads / spec_draft_threads | int | same as main | Threads used by the draft model (<= 0 = hardware concurrency) |
draft_threads_batch / spec_draft_threads_batch | int | same as draft_threads | Threads used by the draft model during batch / prompt processing |
draft_cache_type_k / spec_draft_cache_type_k | string | f16 | KV cache K data type for the draft model (same values as cache_type_k) |
draft_cache_type_v / spec_draft_cache_type_v | string | f16 | KV cache V data type for the draft model |
draft_cpu_moe / spec_draft_cpu_moe | bool | false | Keep all MoE expert weights of the draft model on CPU |
draft_n_cpu_moe / spec_draft_n_cpu_moe | int | 0 | Keep MoE expert weights of the first N draft-model layers on CPU |
draft_override_tensor / spec_draft_override_tensor | string | "" | Comma-separated <tensor regex>=<buffer type> overrides for the draft model |
draft_ctx_size | int | (ignored) | Deprecated upstream: the draft now shares the target context size. Accepted for backward compatibility but has no effect. |
ngram_simple options (used when spec_type includes ngram_simple)
| Option | Type | Default | Description |
|---|---|---|---|
spec_ngram_size_n / ngram_size_n | int | 12 | N-gram lookup size |
spec_ngram_size_m / ngram_size_m | int | 48 | M-gram proposal size |
spec_ngram_min_hits / ngram_min_hits | int | 1 | Minimum hits for accepting n-gram proposals |
ngram_mod options (used when spec_type includes ngram_mod)
| Option | Type | Default | Description |
|---|---|---|---|
spec_ngram_mod_n_min | int | 48 | Minimum number of ngram tokens to use |
spec_ngram_mod_n_max | int | 64 | Maximum number of ngram tokens to use |
spec_ngram_mod_n_match | int | 24 | Ngram lookup length |
ngram_map_k options (used when spec_type includes ngram_map_k)
| Option | Type | Default | Description |
|---|---|---|---|
spec_ngram_map_k_size_n | int | 12 | N-gram lookup size |
spec_ngram_map_k_size_m | int | 48 | M-gram proposal size |
spec_ngram_map_k_min_hits | int | 1 | Minimum hits for accepting proposals |
ngram_map_k4v options (used when spec_type includes ngram_map_k4v)
| Option | Type | Default | Description |
|---|---|---|---|
spec_ngram_map_k4v_size_n | int | 12 | N-gram lookup size |
spec_ngram_map_k4v_size_m | int | 48 | M-gram proposal size |
spec_ngram_map_k4v_min_hits | int | 1 | Minimum hits for accepting proposals |
ngram_cache lookup files
| Option | Type | Default | Description |
|---|---|---|---|
spec_lookup_cache_static / lookup_cache_static | string | "" | Path to a static ngram lookup cache file |
spec_lookup_cache_dynamic / lookup_cache_dynamic | string | "" | Path to a dynamic ngram lookup cache file (updated by generation) |
The canonical names match upstream llama.cpp (dash-separated). For backward compatibility LocalAI also accepts the underscore-separated forms and the bare draft / eagle3 aliases.
| Type | Aliases accepted | Description |
|---|---|---|
none | No speculative decoding (default) | |
draft-simple | draft, draft_simple | Draft model-based speculation (auto-set when draft_model is configured) |
draft-eagle3 | eagle3, draft_eagle3 | EAGLE3 draft model architecture |
draft-mtp | draft_mtp | Multi-Token Prediction. Reuses the target model's embedded MTP head; no separate draft GGUF required (draft_model can be omitted). |
ngram-simple | ngram_simple | Simple self-speculative using token history |
ngram-map-k | ngram_map_k | N-gram with key-only map |
ngram-map-k4v | ngram_map_k4v | N-gram with keys and 4 m-gram values |
ngram-mod | ngram_mod | Modified n-gram speculation |
ngram-cache | ngram_cache | 3-level n-gram cache |
Multiple types can be chained by passing a comma-separated list to spec_type (e.g. spec_type:ngram-simple,ngram-mod). The runtime tries them in order and accepts the first proposal that meets the acceptance criteria.
{{% notice note %}}
Speculative decoding is automatically disabled when multimodal models (with mmproj) are active. The n_draft parameter can also be overridden per-request.
{{% /notice %}}
draft-mtp enables Multi-Token Prediction (ggml-org/llama.cpp#22673). MTP uses a small prediction head trained into the target model: the head runs alongside the main forward pass and proposes the next few tokens, which the target then verifies in a single batched step. Upstream reports ~1.85x-2.1x token throughput at ~72-82% draft acceptance on Qwen3.6 27B / 35B A3B.
Auto-detection (default). When a GGUF declares an MTP head (the upstream <arch>.nextn_predict_layers metadata key, set by convert_hf_to_gguf.py for Qwen3.5/3.6 family models and similar), LocalAI auto-enables MTP with the following defaults:
options:
- spec_type:draft-mtp
- spec_n_max:6
- spec_p_min:0.75
Detection runs both at import time (the /import-model UI / POST /models/import-uri flow range-fetches the GGUF header and writes the options into the generated YAML before you save it) and at load time (every llama-cpp model start re-checks the local header and appends the options if spec_type isn't already set). To opt out, set an explicit spec_type: / speculative_type: in your YAML - auto-detection always preserves the user value, including spec_type:none.
Two ways to load the MTP head:
spec_type includes draft-mtp and draft_model is empty, the backend builds the MTP draft context directly from the target model's weights. The GGUF must have been converted with the MTP tensors included.mtp-*.gguf sibling file. If you point draft_model at the separate MTP-head GGUF that ships next to the main weights on HuggingFace, the backend will load it as a draft model. Note: upstream's -hf auto-discovery of mtp-*.gguf siblings is not wired into LocalAI's gRPC layer - you need to download the sibling file and configure draft_model explicitly.Manual override knobs (overlap with the auto-detect defaults above):
| Option | Recommended | Notes |
|---|---|---|
spec_type | draft-mtp | Activates MTP. Can be chained with other types (see below). |
spec_n_max / draft_max | 2-6 | Number of draft tokens per step. Upstream's PR suggests 2-3 for the tightest acceptance window; LocalAI's auto-default is 6 to favour throughput on models with high acceptance. |
spec_p_min | 0.75 | Pinned because upstream marks the current default with a "change to 0.0f" TODO; locking it here keeps acceptance thresholds stable across future llama.cpp bumps. |
mmproj_use_gpu | false (or unset mmproj) | MTP has a prompt-processing overhead; if the model is non-vision, drop the mmproj entirely to save VRAM. |
Minimal config (override-only, since auto-detection already covers this for MTP-capable GGUFs):
name: qwen3-mtp
backend: llama-cpp
parameters:
model: qwen3-27b-with-mtp.gguf
options:
- spec_type:draft-mtp
- spec_n_max:3
With a separate MTP head file:
name: qwen3-mtp
backend: llama-cpp
parameters:
model: qwen3-27b.gguf
draft_model: qwen3-27b-mtp-head.gguf
options:
- spec_type:draft-mtp
- spec_n_max:3
Chaining MTP with n-gram fallback (experimental, from the PR's usage notes - useful when MTP acceptance drops on highly repetitive output):
options:
- spec_type:draft-mtp,ngram-mod
- spec_n_max:3
- spec_ngram_mod_n_match:24
Pre-converted GGUFs with MTP heads are published on the ggml-org HuggingFace org (initially Qwen3.6 27B and Qwen3.6 35B A3B).
These load-time options control how the backend parses <think> reasoning blocks and how much budget the model is allowed for thinking. They are set per model via the options: array.
| Option | Type | Default | Description |
|---|---|---|---|
reasoning_format | string | deepseek | Parser for reasoning/thinking blocks. One of none, auto, deepseek, deepseek-legacy (alias deepseek_legacy). |
enable_reasoning / reasoning_budget | int | -1 | Reasoning budget in tokens: -1 unlimited, 0 disabled, >0 token cap for the thinking section. |
prefill_assistant | bool | true | When false, the trailing assistant message is not pre-filled by the chat template. |
{{% notice note %}}
This is the load-time reasoning configuration. The orthogonal per-request enable_thinking chat-template kwarg toggles thinking on/off per call without restarting the model. It can be driven either by the YAML reasoning.disable field (model default) or per request via the OpenAI reasoning_effort field on /v1/chat/completions:
reasoning_effort: "none" disables thinking for that request (enable_thinking=false) - useful to run a single reasoning model like Qwen3 for low-latency tasks while still enabling reasoning on other requests.reasoning_effort: "minimal" | "low" | "medium" | "high" enables thinking, unless the model config explicitly set reasoning.disable: true (an operator's explicit disable wins and is never re-enabled by a request).
{{% /notice %}}| Option | Type | Default | Description |
|---|---|---|---|
mmproj_use_gpu / mmproj_offload | bool | true | Set false to keep the multimodal projector on CPU (saves VRAM at cost of speed). |
image_min_tokens | int | -1 | Minimum vision tokens per image. -1 keeps the model default. |
image_max_tokens | int | -1 | Maximum vision tokens per image. -1 keeps the model default. |
| Option | Type | Default | Description |
|---|---|---|---|
pooling_type / pooling | string | auto | Pooling strategy for embeddings: none, mean, cls, last, rank. Reranking automatically uses rank. |
embd_normalize / embedding_normalize | int | 2 | Normalization: -1 none, 0 max-abs, 1 taxicab, 2 Euclidean (L2), >2 p-norm. |
These llama.cpp options are passed through the options: array.
| Option | Type | Default | Description |
|---|---|---|---|
n_ubatch / ubatch | int | same as batch | Physical batch size. Decouple from n_batch when an embedding/rerank workload needs a different value. |
threads_batch / n_threads_batch | int | same as threads | Threads used during prompt processing. <= 0 means hardware_concurrency(). |
direct_io / use_direct_io | bool | false | Open the model with O_DIRECT (faster cold loads on NVMe; ignored if not supported). |
verbosity | int | 3 | llama.cpp internal log verbosity threshold. Higher = more verbose. |
override_tensor / tensor_buft_overrides | string | "" | Per-tensor buffer-type overrides for the main model. Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,.... Mirrors the existing draft_override_tensor syntax for the draft model. |
The recommended way to enable prompt caching for the llama-cpp backend is the server-side prompt cache controlled by cache_ram / kv_unified / cache_idle_slots in the options: array (see [llama.cpp backend options]({{%relref "features/text-generation#server-side-prompt-cache-repeated-system-prompts" %}})). It's on by default since LocalAI v4.3 and is what gives repeated system prompts a near-zero prefill on the second call.
The fields below come from upstream llama.cpp's CLI completion tool and are passed through to the gRPC backend for compatibility, but the gRPC server itself does not consume them: keep them empty unless you're targeting a non-llama-cpp backend that reads them.
| Field | Type | Description |
|---|---|---|
prompt_cache_path | string | (legacy / unused by llama-cpp gRPC server) Path to a file-backed prompt cache for upstream's CLI completion tool. |
prompt_cache_all | bool | (legacy / unused by llama-cpp gRPC server) |
prompt_cache_ro | bool | (legacy / unused by llama-cpp gRPC server) |
| Field | Type | Description |
|---|---|---|
stopwords | array | Words or phrases that stop generation |
cutstrings | array | Strings to cut from responses |
trimspace | array | Strings to trim whitespace from |
trimsuffix | array | Suffixes to trim from responses |
extract_regex | array | Regular expressions to extract content |
| Field | Type | Description |
|---|---|---|
system_prompt | string | Default system prompt for the model |
These options apply when using the vllm backend:
| Field | Type | Description |
|---|---|---|
gpu_memory_utilization | float32 | GPU memory utilization (0.0-1.0, default 0.9) |
trust_remote_code | bool | Trust and execute remote code |
enforce_eager | bool | Force eager execution mode |
swap_space | int | Swap space in GB |
max_model_len | int | Maximum model length |
tensor_parallel_size | int | Tensor parallelism size |
disable_log_stats | bool | Disable logging statistics |
dtype | string | Data type (e.g., float16, bfloat16) |
flash_attention | string | Flash attention configuration |
cache_type_k | string | Key cache quantization type. Maps to llama.cpp's -ctk. Accepted values for llama.cpp-family backends (llama-cpp, ik-llama-cpp, turboquant): f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1. The turboquant backend additionally accepts turbo2, turbo3, turbo4 — the fork's TurboQuant KV-cache schemes. turbo3/turbo4 auto-enable flash_attention. |
cache_type_v | string | Value cache quantization type. Maps to llama.cpp's -ctv. Same accepted values as cache_type_k. Note: any quantized V cache requires flash_attention to be enabled. |
limit_mm_per_prompt | object | Limit multimodal content per prompt: {image: int, video: int, audio: int} |
Templates use Go templates with Sprig functions.
| Field | Type | Description |
|---|---|---|
template.chat | string | Template for chat completion endpoint |
template.chat_message | string | Template for individual chat messages |
template.completion | string | Template for text completion |
template.edit | string | Template for edit operations |
template.function | string | Template for function/tool calls |
template.multimodal | string | Template for multimodal interactions |
template.reply_prefix | string | Prefix to add to model replies |
template.use_tokenizer_template | bool | Use tokenizer's built-in template (vLLM/transformers) |
template.join_chat_messages_by_character | string | Character to join chat messages (default: \n) |
Templating supports sprig functions.
Following are common variables available in templates:
{{.Input}} - User input{{.Instruction}} - Instruction for edit operations{{.System}} - System message{{.Prompt}} - Full prompt{{.Functions}} - Function definitions (for function calling){{.FunctionCall}} - Function call resulttemplate:
chat: |
{{.System}}
{{range .Messages}}
{{if eq .Role "user"}}User: {{.Content}}{{end}}
{{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
{{end}}
Assistant:
Configure how the model handles function/tool calls:
| Field | Type | Default | Description |
|---|---|---|---|
function.disable_no_action | bool | false | Disable the no-action behavior |
function.no_action_function_name | string | answer | Name of the no-action function |
function.no_action_description_name | string | Description for no-action function | |
function.function_name_key | string | name | JSON key for function name |
function.function_arguments_key | string | arguments | JSON key for function arguments |
function.response_regex | array | Named regex patterns to extract function calls | |
function.argument_regex | array | Named regex to extract function arguments | |
function.argument_regex_key_name | string | key | Named regex capture for argument key |
function.argument_regex_value_name | string | value | Named regex capture for argument value |
function.json_regex_match | array | Regex patterns to match JSON in tool mode | |
function.replace_function_results | array | Replace function call results with patterns | |
function.replace_llm_results | array | Replace LLM results with patterns | |
function.capture_llm_results | array | Capture LLM results as text (e.g., for "thinking" blocks) |
| Field | Type | Default | Description |
|---|---|---|---|
function.grammar.disable | bool | false | Completely disable grammar enforcement |
function.grammar.parallel_calls | bool | false | Allow parallel function calls |
function.grammar.mixed_mode | bool | false | Allow mixed-mode grammar enforcing |
function.grammar.no_mixed_free_string | bool | false | Disallow free strings in mixed mode |
function.grammar.disable_parallel_new_lines | bool | false | Disable parallel processing for new lines |
function.grammar.prefix | string | Prefix to add before grammar rules | |
function.grammar.expect_strings_after_json | bool | false | Expect strings after JSON data |
For image generation models using the diffusers backend:
| Field | Type | Description |
|---|---|---|
diffusers.cuda | bool | Enable CUDA for diffusers |
diffusers.pipeline_type | string | Pipeline type (e.g., stable-diffusion, stable-diffusion-xl) |
diffusers.scheduler_type | string | Scheduler type (e.g., euler, ddpm) |
diffusers.enable_parameters | string | Comma-separated parameters to enable |
diffusers.cfg_scale | float32 | Classifier-free guidance scale |
diffusers.img2img | bool | Enable image-to-image transformation |
diffusers.clip_skip | int | Number of CLIP layers to skip |
diffusers.clip_model | string | CLIP model to use |
diffusers.clip_subfolder | string | CLIP model subfolder |
diffusers.control_net | string | ControlNet model to use |
step | int | Number of diffusion steps |
For text-to-speech models:
| Field | Type | Description |
|---|---|---|
tts.voice | string | Voice file path or voice ID |
tts.audio_path | string | Path to audio files (for Vall-E) |
Map conversation roles to specific strings:
roles:
user: "### Instruction:"
assistant: "### Response:"
system: "### System Instruction:"
Enable or disable experimental features:
feature_flags:
feature_name: true
another_feature: false
Model Context Protocol (MCP) configuration:
| Field | Type | Description |
|---|---|---|
mcp.remote | string | YAML string defining remote MCP servers |
mcp.stdio | string | YAML string defining STDIO MCP servers |
Agent/autonomous agent configuration:
| Field | Type | Description |
|---|---|---|
agent.max_attempts | int | Maximum number of attempts |
agent.max_iterations | int | Maximum number of iterations |
agent.enable_reasoning | bool | Enable reasoning capabilities |
agent.enable_planning | bool | Enable planning capabilities |
agent.enable_mcp_prompts | bool | Enable MCP prompts |
agent.enable_plan_re_evaluator | bool | Enable plan re-evaluation |
Configure how reasoning tags are extracted and processed from model output. Reasoning tags are used by models like DeepSeek, Command-R, and others to include internal reasoning steps in their responses.
| Field | Type | Default | Description |
|---|---|---|---|
reasoning.disable | bool | false | When true, disables reasoning extraction entirely. The original content is returned without any processing. |
reasoning.disable_reasoning_tag_prefill | bool | false | When true, disables automatic prepending of thinking start tokens. Use this when your model already includes reasoning tags in its output format. |
reasoning.strip_reasoning_only | bool | false | When true, extracts and removes reasoning tags from content but discards the reasoning text. Useful when you want to clean reasoning tags from output without storing the reasoning content. |
reasoning.thinking_start_tokens | array | [] | List of custom thinking start tokens to detect in prompts. Custom tokens are checked before default tokens. |
reasoning.tag_pairs | array | [] | List of custom tag pairs for reasoning extraction. Each entry has start and end fields. Custom pairs are checked before default pairs. |
The reasoning extraction supports multiple tag formats used by different models:
<thinking>...</thinking> - General thinking tag<think>...</think> - DeepSeek, Granite, ExaOne, GLM models<|START_THINKING|>...<|END_THINKING|> - Command-R models<|inner_prefix|>...<|inner_suffix|> - Apertus models<seed:think>...</seed:think> - Seed models<|think|>...<|end|><|begin|>assistant<|content|> - Solar Open models[THINK]...[/THINK] - Magistral modelsDisable reasoning extraction:
reasoning:
disable: true
Extract reasoning but don't prepend tags:
reasoning:
disable_reasoning_tag_prefill: true
Strip reasoning tags without storing reasoning content:
reasoning:
strip_reasoning_only: true
Complete example with reasoning configuration:
name: deepseek-model
backend: llama-cpp
parameters:
model: deepseek.gguf
reasoning:
disable: false
disable_reasoning_tag_prefill: false
strip_reasoning_only: false
Example with custom tokens and tag pairs:
name: custom-reasoning-model
backend: llama-cpp
parameters:
model: custom.gguf
reasoning:
thinking_start_tokens:
- "<custom:think>"
- "<my:reasoning>"
tag_pairs:
- start: "<custom:think>"
end: "</custom:think>"
- start: "<my:reasoning>"
end: "</my:reasoning>"
Note: Custom tokens and tag pairs are checked before the default ones, giving them priority. This allows you to override default behavior or add support for new reasoning tag formats.
The reasoning.disable setting from model configuration can be overridden on a per-request basis using the metadata field in the OpenAI chat completion request. This allows you to enable or disable thinking for individual requests without changing the model configuration.
The metadata field accepts a map[string]string that is forwarded to the backend. The enable_thinking key controls thinking behavior:
# Enable thinking for a single request (overrides model config)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"metadata": {"enable_thinking": "true"}
}'
# Disable thinking for a single request (overrides model config)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3",
"messages": [{"role": "user", "content": "Hello"}],
"metadata": {"enable_thinking": "false"}
}'
Priority order:
metadata.enable_thinking (highest priority)reasoning.disable (fallback)Define pipelines for audio-to-audio processing and the [Realtime API]({{%relref "features/openai-realtime" %}}):
| Field | Type | Description |
|---|---|---|
pipeline.tts | string | TTS model name |
pipeline.llm | string | LLM model name |
pipeline.transcription | string | Transcription model name |
pipeline.vad | string | Voice activity detection model name |
Backend gRPC communication settings:
| Field | Type | Description |
|---|---|---|
grpc.attempts | int | Number of retry attempts |
grpc.attempts_sleep_time | int | Sleep time between retries (seconds) |
Override model configuration values at runtime (llama.cpp):
overrides:
- "qwen3moe.expert_used_count=int:10"
- "some_key=string:value"
Format: KEY=TYPE:VALUE where TYPE is int, float, string, or bool.
Specify which endpoints this model supports:
known_usecases:
- chat
- completion
- embeddings
Available flags: chat, completion, edit, embeddings, rerank, image, transcript, tts, sound_generation, tokenize, vad, video, detection, llm (combination of CHAT, COMPLETION, EDIT).
Here's a comprehensive example combining many options:
name: my-llm-model
description: A high-performance LLM model
backend: llama-stable
parameters:
model: my-model.gguf
temperature: 0.7
top_p: 0.9
top_k: 40
max_tokens: 2048
context_size: 4096
threads: 8
f16: true
gpu_layers: 35
system_prompt: "You are a helpful AI assistant."
template:
chat: |
{{.System}}
{{range .Messages}}
{{if eq .Role "user"}}User: {{.Content}}
{{else if eq .Role "assistant"}}Assistant: {{.Content}}
{{end}}
{{end}}
Assistant:
roles:
user: "User:"
assistant: "Assistant:"
system: "System:"
stopwords:
- "\n\nUser:"
- "\n\nHuman:"
prompt_cache_path: "cache/my-model"
prompt_cache_all: true
function:
grammar:
parallel_calls: true
mixed_mode: false
feature_flags:
experimental_feature: true
Note: By default, LocalAI sets gpu_layers to a very large value (9999999), which effectively disables llama-cpp's auto-fit functionality. This is intentional to work with LocalAI's VRAM-based model unloading mechanism.
To enable llama-cpp's auto-fit mode, set gpu_layers: -1 in your model configuration. However, be aware of the following:
Trade-off: Enabling auto-fit conflicts with LocalAI's built-in VRAM threshold-based unloading. Auto-fit attempts to fit all tensors into GPU memory automatically, while LocalAI's unloading mechanism removes models when VRAM usage exceeds thresholds.
Known Issues: Setting gpu_layers: -1 may trigger tensor_buft_override buffer errors in some configurations, particularly when the model exceeds available GPU memory.
Recommendation:
gpu_layers: -1 if you understand the implications and have tested on your specific hardwareThis is a known limitation being tracked in issue #8562. A future implementation may provide a runtime toggle or custom logic to reconcile auto-fit with threshold-based unloading.