Back to Mistral Rs

TOML configuration

docs/src/content/docs/reference/cli-toml-config.md

0.8.2312.4 KB
Original Source

mistralrs from-config -f <path> reads a TOML file. The top-level command field selects serve or run. Every key maps to a CLI flag of the same subcommand; the mapping is listed per table below. For per-flag semantics, see the generated CLI reference.

Minimal example

toml
command = "serve"

[server]
host = "0.0.0.0"
port = 1234

[[models]]
model_id = "Qwen/Qwen3-4B"

[models.quantization]
quant = "4"

mistralrs from-config -f this.toml runs the server.

Top-level fields

FieldTypeApplies toPurpose
commandstringboth"serve" or "run".
default_model_idstringserveModel id treated as the default. Must match one of the [[models]] entries.
thinkingboolrunForce thinking mode on or off for models that support it (alias: enable_thinking). Omit to defer to the chat template default. Maps to --thinking on mistralrs run.

[global] section

FieldCLI flagDefaultPurpose
seed--seednot setSampling seed.
log-l, --lognot setLog file for requests/responses.
token_source--token-sourcecacheToken source string (literal:<token>, env:<var>, path:<file>, cache, none).

-v/--verbose has no TOML equivalent; use RUST_LOG instead.

[runtime] section

FieldCLI flagDefaultPurpose
max_seqs--max-seqs32Max concurrent sequences.
no_kv_cache--no-kv-cachefalseDisable KV cache entirely.
prefix_cache_n--prefix-cache-n16Prefix caches retained (0 to disable).
chat_template-c, --chat-templatenot setCustom chat template file (.json or .jinja), applied to every model. Per-model chat_template in [[models]] overrides it.
jinja_explicit-j, --jinja-explicitnot setExplicit Jinja template override. Per-model jinja_explicit also exists.
matformer_config_path--matformer-config-pathnot setMatFormer (nested-submodel) slice config (CSV/JSON).
matformer_slice_name--matformer-slice-namenot setMatFormer slice to load. Requires matformer_config_path.
mtp_model--mtp-modelnot setMTP (multi-token prediction) assistant model id or path.
mtp_n_predict--mtp-n-predictnot setMTP draft tokens proposed per target step.
mcp_config--mcp-confignot setMCP (Model Context Protocol) client configuration JSON for outbound servers. Also reads MCP_CONFIG_PATH if unset.
agent--agent (alias --agentic)falseShortcut for enable_search = true + enable_code_execution = true + enable_shell = true.
enable_search--enable-searchfalseEnable the built-in web search tool.
search_embedding_model--search-embedding-modelnot setSearch reranker; embedding-gemma is the only accepted value. Requires enable_search (or agent).
enable_code_execution--enable-code-executionfalseEnable Python code execution.
code_exec_python--code-exec-pythonpython on Windows, python3 elsewherePython interpreter. Requires enable_code_execution (or agent).
code_exec_timeout--code-exec-timeout60Per-call timeout in seconds. Requires enable_code_execution (or agent).
code_exec_workdir--code-exec-workdirper-session temp dirCode execution working directory. Requires enable_code_execution (or agent).
enable_shell--enable-shellfalseEnable the built-in shell tool for Responses tools[*].type="shell".
shell_path--shell-path/bin/sh on Unix, cmd on WindowsShell executable. Requires enable_shell (or agent).
shell_timeout--shell-timeout600Per-call shell timeout in seconds. Requires enable_shell (or agent).
shell_workdir--shell-workdirper-session temp dirRoot directory for per-session shell working directories. Requires enable_shell (or agent).
skills_dir--skills-dirsystem temp dirDirectory for uploaded OpenAI-compatible Skills. Requires enable_shell (or agent).
agent_permission--agent-permissionautoauto, ask, or deny: whether model-requested agent actions run automatically, require approval, or are denied. code_exec_permission / --code-exec-permission are accepted as aliases.

[server] section (serve only)

FieldCLI flagDefaultPurpose
host--host0.0.0.0Bind address.
port-p, --port1234TCP port.
no_ui--no-uifalseDisable the built-in web UI (mounted at /ui by default).
mcp_port--mcp-portnot setAlso expose the loaded model as an MCP server on this port (JSON-RPC 2.0 at POST /mcp). See serve over MCP.
max_tool_rounds--max-tool-roundsnot setDefault cap on agentic tool loop rounds. Per-request values from the HTTP API override it; the safety cap is 256 when unset.
tool_dispatch_url--tool-dispatch-urlnot setURL to POST tool calls to for server-side execution. Only configurable server-side, never per-request.
disable_access_log--disable-access-logfalseDisable info-level HTTP access logs.
access_log_format--access-log-formattextAccess-log format: text or json.
access_log_health--access-log-healthfalseInclude health, metrics, docs, and UI requests in HTTP access logs.
disable_request_id_header--disable-request-id-headerfalseStop echoing x-request-id on responses.
disable_metrics--disable-metricsfalseDisable Prometheus HTTP metrics and recorder installation.

:::caution The default host = "0.0.0.0" binds on all interfaces, exposing the server to your network. There is no built-in authentication. Set host = "127.0.0.1" for local-only access, or put an authenticating reverse proxy in front before exposing it. :::

The MCP client configuration (mcp_config) lives under [runtime], not [server]: it applies to run as well as serve.

[paged_attn] section

FieldCLI flagDefaultPurpose
mode--paged-attnautoauto (on for CUDA, off for Metal/CPU), on, or off.
context_len--pa-context-lennot setAllocate KV cache for this context length.
memory_mb--pa-memory-mbnot setKV cache budget in MB. Conflicts with context_len.
memory_fraction--pa-memory-fractionnot setKV cache budget as fraction of VRAM (0.0 to 1.0). Conflicts with context_len and memory_mb.
block_size--pa-block-sizenot setTokens per block.
cache_type--pa-cache-typeautoKV cache quantization type.

[sandbox] section

OS-level isolation for the code-execution subprocess. Mechanics and threat model: sandbox reference.

FieldCLI flagDefaultPurpose
mode--sandboxautoauto (on for Linux/macOS, no-op elsewhere), on (missing isolation is a hard error), or off.
profile--sandbox-profileprofile-dependentdeveloper for agent/code/shell tools, otherwise restricted.
max_memory_mb--sb-max-memory-mb2048Per-session memory cap in MiB.
max_cpu_secs--sb-max-cpu-secs600Per-session CPU time cap in seconds. When rlimits apply, this is raised before execution to at least the enabled code or shell timeout.
max_procs--sb-max-procs64Per-session process/thread cap.
network--sandbox-networkprofile-dependentnone, loopback, or full. Defaults to full for developer, loopback for restricted.

[[models]] array

Each entry defines one loaded model.

FieldTypeRequiredPurpose
kindenumnoDefaults to auto. Set to text, multimodal, diffusion, speech, or embedding only to force a loader.
model_idstringyesHugging Face id or local path.
tokenizerpathnoLocal tokenizer.json.
archenumnoArchitecture override (text models).
dtypeenumnoauto, f16, bf16, f32.
chat_templatepathnoChat template override for this model.
jinja_explicitpathnoJinja override for this model.
matformer_config_pathpathnoMatFormer slice config (CSV/JSON).
matformer_slice_namestringnoMatFormer slice to load.

Each [[models]] entry can carry nested sections whose field shapes mirror the corresponding CLI flags:

SectionPurpose
[models.format]Weight format selection (e.g. GGUF file/repo).
[models.adapter]LoRA/X-LoRA adapter configuration.
[models.quantization]Quantization: quant (front-door, same as --quant), isq (explicit ISQ, same as --isq), from_uqff, isq_organization, imatrix.
[models.device]Device placement: cpu, device_layers, topology, hf_cache, max_seq_len, max_batch_size. cpu must be consistent across every entry.
[models.multimodal]Multimodal load-time caps (image/video/audio limits).

Multi-model example

toml
command = "serve"
default_model_id = "Qwen/Qwen3-4B"

[server]
host = "0.0.0.0"
port = 1234

[runtime]
enable_search = true
search_embedding_model = "embedding-gemma"

[[models]]
model_id = "Qwen/Qwen3-4B"

[models.quantization]
quant = "4"

[[models]]
model_id = "google/gemma-4-E4B-it"

[models.quantization]
quant = "4"

Validation

Invalid configs abort startup with a message identifying the problem:

  • At least one entry in [[models]].
  • default_model_id matches a model_id in [[models]].
  • cpu is consistent across all models when set.
  • search_embedding_model requires enable_search = true (or agent = true).
  • code_exec_python, code_exec_timeout, and code_exec_workdir each require enable_code_execution = true (or agent = true).

CLI usage notes

Flag interactions that hold on the command line and as TOML keys:

  • quant (CLI --quant, TOML key quant) is the front door: it tries a prebuilt UQFF (Universal Quantized File Format) first and falls back to ISQ (in-situ quantization). It conflicts with isq (--isq, the explicit ISQ level) and from_uqff (--from-uqff). mistralrs tune rejects quant = "auto" (--quant auto) because tune is the recommender.
  • --calibration-file conflicts with --imatrix.
  • --xlora conflicts with --lora. --xlora-order and --tgt-non-granular-index require --xlora; --xlora alone is accepted.
  • --matformer-slice-name requires --matformer-config-path.
  • mistralrs run: --image, --video, and --audio require -i/--input.
  • mistralrs bench: --prompt-len and --depth accept comma-separated values for sweeps.
    • Each --prompt-len value produces a prefill measurement at that prompt length.
    • Each --depth value produces a decode measurement that prefills depth tokens and then generates --gen-len tokens.
    • --depth must be greater than 0 when --gen-len is greater than 0.

Server behavior notes

  • CORS and body limit. Not exposed as CLI flags or TOML keys. Defaults: any origin; methods GET, POST, PUT, DELETE; allowed headers Content-Type, Authorization, x-api-key, anthropic-version, anthropic-beta, x-request-id; exposed headers x-request-id; 50 MB request body limit. Configure programmatically through MistralRsServerRouterBuilder in mistralrs-server-core.
  • Authentication. mistral.rs does not implement authentication. Put a reverse proxy (nginx, Caddy, Traefik) in front for auth and TLS. OpenAI-protocol clients always send Authorization: Bearer ... because the OpenAI SDK requires an API key; mistral.rs does not validate the header.
  • Logging and metrics. Access logs are written to normal server stdout/stderr by default, with request ids and route/status/latency metadata. GET /metrics exposes Prometheus HTTP metrics by default. See observability.
  • Payload logging. -v enables debug detail and -vv trace-level file/cache internals; RUST_LOG module filters (e.g. RUST_LOG=mistralrs_core=debug,tower_http=info) override both. -l <path> logs all requests and responses to a file.