docs/src/content/docs/guides/models/model-family-notes.mdx
import { Tabs, TabItem } from '@astrojs/starlight/components';
Most models need nothing beyond mistralrs run -m <id> (see Run any model). This page collects the per-family exceptions. The full architecture inventory is in the supported models reference.
Qwen3 and SmolLM3 are hybrid reasoning models; their chat templates enable thinking by default. Toggle it per request. Inline /think and /no_think prompt tags work everywhere; the --thinking flag and the enable_thinking field do the same without editing user text (true forces on, false forces off, omit/None for the template default).
mistralrs run --thinking false -m Qwen/Qwen3-4B
--thinking applies to both one-shot and interactive use. Prompt tags also work inline:
How many rs are in blueberry? /no_think
Are you sure? /think
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "How many rs are in blueberry?"}],
"enable_thinking": false
}'
runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[{"role": "user", "content": "How many rs are in blueberry?"}],
enable_thinking=False,
)
)
Qwen3 also publishes FP8 pre-quantized checkpoints; pass the FP8 model ID directly when you want those weights instead of runtime ISQ (in-situ quantization).
MoE (Mixture of Experts) families (DeepSeek V2/V3, GLM-4.7, GLM-4.7-Flash, Phi 3.5 MoE, Qwen3 MoE, Qwen3-VL MoE, Qwen3.5 MoE) support MoQE (Mixture of Quantized Experts): quantizing only the routed experts, which dominate memory, while leaving the rest of the model alone. Enable it with --isq-organization moqe; it composes with either --isq <level> or --quant <level>:
mistralrs run --isq 4 --isq-organization moqe -m Qwen/Qwen3-30B-A3B
In the Python SDK, pass organization=IsqOrganization.MoQE inside Which.Plain(...) or Which.MultimodalPlain(...). Expect small output differences between quantization levels: router decisions are sensitive to numerical noise.
DeepSeek V2, DeepSeek V3 (including non-distill R1, which uses the V3 architecture), and GLM-4.7-Flash use MLA (Multi-head Latent Attention). The KV cache stores a low-dimensional latent instead of full K/V, so the cache footprint is substantially smaller than standard attention at the same context length.
On CUDA (Unix builds), a specialized MLA decode kernel is used when all of the following hold:
A parallel fast path covers prefill with prefix caching (paged attention on, CUDA device). Otherwise the generic attention path reconstructs the latent per step.
MISTRALRS_NO_MLA=1 forces the generic path; use it when debugging suspected MLA kernel issues, and try --paged-attn off as a sanity check for unexpected paged-attention behavior. Background: the DeepSeek V2 paper.
GPT-OSS experts are stored pre-quantized in MXFP4 (4-bit microscaling float), and its attention uses per-head sinks. Load it without a quantization flag first:
mistralrs run -m openai/gpt-oss-20b
ISQ applies only to the attention layers (and lm_head); the expert weights are already quantized.
Qwen3 Next mixes Gated Delta Network (linear attention) layers with full softmax attention, so its cost profile at long contexts differs from a pure softmax model. Qwen3-Coder-Next checkpoints use the same loader.
IBM Granite 4.0 checkpoints (e.g. ibm-granite/granite-4.0-micro) mix Mamba-2 recurrent layers with attention layers. They load through auto-detection like any other text model.
Gemma repos are gated: accept the license on the Hugging Face model page, then authenticate with mistralrs login.
Gemma 4 accepts image, audio, and video parts mixed in one message, and enforces its tool-call format through constrained decoding by default; see tool calling.
MatFormer-trained models encode multiple model sizes in one checkpoint; the desired slice is selected at load time with two values:
matformer_config_path: path to the slice config file (CSV or JSON) shipped with the model card.matformer_slice_name: the named slice within that file.Without these, the default (full) configuration loads. Gemma 3n (google/gemma-3n-E4B-it) is the MatFormer model in the supported list; the bundled matformer_configs/gemma3n.csv contains the full E4B configuration, the official E2B slice, and intermediate E1.96B-E3.79B slices:
mistralrs run -m google/gemma-3n-E4B-it \
--matformer-config-path matformer_configs/gemma3n.csv \
--matformer-slice-name "Config for E2.49B (block-level)"
The same slice selection is available on every surface:
--matformer-config-path / --matformer-slice-name on run, serve, and bench.matformer_config_path / matformer_slice_name.matformer_config_path / matformer_slice_name on the Which selectors.with_matformer_config_path / with_matformer_slice_name on the model builders.Use the full configuration for quality and smaller slices for constrained devices.
Mistral Small 3 checkpoints can do tool calling, but some repos do not ship the right chat template. Use the bundled one:
mistralrs serve --quant 4 \
--jinja-explicit chat_templates/mistral_small_tool_call.jinja \
-m mistralai/Mistral-Small-3.2-24B-Instruct-2506
Mistral-backed LLaVA checkpoints work with the default template. Vicuna-backed checkpoints need the Vicuna template:
mistralrs run -m llava-hf/llava-v1.6-vicuna-7b-hf \
-c chat_templates/vicuna.json --image photo.jpg -i "Describe this image"
Per-request video frame-sampling overrides are not exposed. In multi-turn conversations reusing prefix cache entries, pixel inputs are narrowed per turn by grid count, not image count.
Llama 4 Scout supports up to 10M tokens of context. Using the full window requires paged attention with a large memory budget, generally with multi-GPU tensor parallelism.
For most multimodal models the text backbone holds most of the parameters, so device mapping and topology apply mainly to the text portion; the vision, audio, or video encoder stays on its supported device path.
Phi 3.5 Vision works best with a single image; multiple images are resized together. Phi 4 Multimodal accepts audio and image parts in the same message. Phi 3.5 MoE has 16 experts and routes each token to 2 of them; it benefits from MoQE.
Block-diffusion generation has its own page: block-diffusion models.
File an issue on GitHub, with a reproducer when possible.