docs/CLI.md
This is the comprehensive CLI reference for mistralrs. The CLI provides commands for interactive mode, HTTP server, builtin UI, quantization, and system diagnostics.
Start a model in interactive mode for conversational use, or run a single one-shot request with -i.
mistralrs run [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]
Note: MODEL_TYPE is optional and defaults to auto if not specified. This allows a shorter syntax.
Interactive mode examples:
# Run a text model interactively (shorthand - auto type is implied)
mistralrs run -m Qwen/Qwen3-4B
# Run with thinking mode enabled
mistralrs run -m Qwen/Qwen3-4B --thinking
# Run a multimodal model
mistralrs run -m google/gemma-4-E4B-it
One-shot mode examples:
When -i is provided, the model processes a single request and exits. Combine with --image, --video, or --audio for multimodal input.
# Text-only one-shot
mistralrs run -m Qwen/Qwen3-4B -i "What is the capital of France?"
# Describe an image
mistralrs run -m google/gemma-4-E4B-it --image photo.jpg -i "Describe this image"
# Multiple images
mistralrs run -m google/gemma-4-E4B-it --image img1.jpg --image img2.png -i "Compare these images"
# Video input
mistralrs run -m google/gemma-4-E4B-it --video clip.mp4 -i "What happens in this video?"
# Audio input
mistralrs run -m google/gemma-4-E4B-it --audio recording.wav -i "Transcribe this audio"
# Mixed media (image + audio)
mistralrs run -m google/gemma-4-E4B-it --image photo.jpg --audio clip.mp3 -i "Describe the image and transcribe the audio"
# URLs work too
mistralrs run -m google/gemma-4-E4B-it --image https://example.com/photo.jpg -i "What is in this image?"
Options:
| Option | Description |
|---|---|
--thinking [true|false] | Control thinking mode. --thinking forces on, --thinking false forces off. Omit to use chat template default. |
-i, --input <TEXT> | One-shot prompt. Sends a single request and exits instead of entering interactive mode |
--image <URL|PATH> | Image file path or URL to include (repeatable, requires -i) |
--video <URL|PATH> | Video file path or URL to include (repeatable, requires -i) |
--audio <URL|PATH> | Audio file path or URL to include (repeatable, requires -i) |
The run command also accepts all runtime options.
Start an HTTP server with OpenAI-compatible API endpoints.
mistralrs serve [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]
Note: MODEL_TYPE is optional and defaults to auto if not specified.
Examples:
# Start server on default port 1234 (shorthand)
mistralrs serve -m Qwen/Qwen3-4B
# Explicit auto type (equivalent to above)
mistralrs serve -m Qwen/Qwen3-4B
# Start server with web UI
mistralrs serve -m Qwen/Qwen3-4B --ui
# Start server on custom port
mistralrs serve -m Qwen/Qwen3-4B -p 3000
# Start server with MCP support
mistralrs serve -m Qwen/Qwen3-4B --mcp-port 8081
Server Options:
| Option | Default | Description |
|---|---|---|
-p, --port <PORT> | 1234 | HTTP server port |
--host <HOST> | 0.0.0.0 | Bind address |
--ui | disabled | Serve built-in web UI at /ui |
--mcp-port <PORT> | none | MCP protocol server port |
--mcp-config <PATH> | none | MCP client configuration file |
The serve command also accepts all runtime options.
Generate UQFF (Unified Quantized File Format) files from a model. Supports multiple quantization types in a single command.
mistralrs quantize <MODEL_TYPE> -m <MODEL_ID> --isq <LEVEL>[,<LEVEL>...] -o <OUTPUT>
Examples:
# Quantize to a single type (file output)
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k -o qwen3-4b-uqff/qwen3-4b-q4k.uqff
# Quantize to a single type (directory output, auto-named)
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k -o qwen3-4b-uqff/
# Quantize to multiple types at once (directory output)
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k,q8_0 -o qwen3-4b-uqff/
# Equivalent: repeated --isq flags
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k --isq q8_0 -o qwen3-4b-uqff/
# Quantize a multimodal model
mistralrs quantize -m google/gemma-4-E4B-it --isq 4 -o gemma4-E4B-uqff/
# Quantize with imatrix for better quality
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k --imatrix imatrix.dat -o qwen3-4b-uqff/qwen3-4b-q4k.uqff
When using directory output mode, the quantize command automatically:
README.md model card with Hugging Face frontmatter and example commandshuggingface-cli upload command to upload your UQFF to Hugging FaceQuantize Options:
| Option | Required | Description |
|---|---|---|
-m, --model-id <ID> | Yes | Model ID or local path |
--isq <LEVEL> | Yes | Quantization level(s), comma-separated or repeated (see ISQ Quantization) |
-o, --output <PATH> | Yes | Output path: .uqff file (single ISQ) or directory (auto-named per ISQ type) |
--isq-organization <TYPE> | No | ISQ organization strategy: default or moqe |
--imatrix <PATH> | No | imatrix file for enhanced quantization |
--calibration-file <PATH> | No | Calibration file for imatrix generation |
--no-readme | No | Skip automatic README.md model card generation |
Get quantization and device mapping recommendations for a model. The tune command analyzes your hardware and shows all quantization options with their estimated memory usage, context room, and quality trade-offs.
mistralrs tune [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]
Note: MODEL_TYPE is optional and defaults to auto if not specified, which supports all model types. See details.
Examples:
# Get balanced recommendations (shorthand)
mistralrs tune -m Qwen/Qwen3-4B
# Get quality-focused recommendations
mistralrs tune -m Qwen/Qwen3-4B --profile quality
# Get fast inference recommendations
mistralrs tune -m Qwen/Qwen3-4B --profile fast
# Output as JSON
mistralrs tune -m Qwen/Qwen3-4B --json
# Generate a TOML config file with recommendations
mistralrs tune -m Qwen/Qwen3-4B --emit-config config.toml
Example Output (CUDA):
Tuning Analysis
===============
Model: Qwen/Qwen3-4B
Profile: Balanced
Backend: cuda
Total VRAM: 24.0 GB
Quantization Options
--------------------
┌─────────────┬───────────┬────────┬──────────────┬───────────────┬──────────────────┐
│ Quant │ Est. Size │ VRAM % │ Context Room │ Quality │ Status │
├─────────────┼───────────┼────────┼──────────────┼───────────────┼──────────────────┤
│ None (FP16) │ 8.50 GB │ 35% │ 48k │ Baseline │ ✅ Fits │
│ Q8_0 │ 4.50 GB │ 19% │ 96k │ Near-lossless │ 🚀 Recommended │
│ Q6K │ 3.70 GB │ 15% │ 128k (max) │ Good │ ✅ Fits │
│ Q5K │ 3.20 GB │ 13% │ 128k (max) │ Good │ ✅ Fits │
│ Q4K │ 2.60 GB │ 11% │ 128k (max) │ Acceptable │ ✅ Fits │
│ Q3K │ 2.00 GB │ 8% │ 128k (max) │ Degraded │ ✅ Fits │
│ Q2K │ 1.50 GB │ 6% │ 128k (max) │ Degraded │ ✅ Fits │
└─────────────┴───────────┴────────┴──────────────┴───────────────┴──────────────────┘
Recommended Command
-------------------
mistralrs serve -m Qwen/Qwen3-4B --isq q8_0
[INFO] PagedAttention is available (mode: auto)
Example Output (Metal):
On macOS with Metal, the command recommends Apple Format Quantization (AFQ) types:
Quantization Options
--------------------
┌─────────────┬───────────┬────────┬──────────────┬───────────────┬──────────────────┐
│ Quant │ Est. Size │ VRAM % │ Context Room │ Quality │ Status │
├─────────────┼───────────┼────────┼──────────────┼───────────────┼──────────────────┤
│ None (FP16) │ 8.50 GB │ 53% │ 24k │ Baseline │ ✅ Fits │
│ AFQ8 │ 4.50 GB │ 28% │ 56k │ Near-lossless │ 🚀 Recommended │
│ AFQ6 │ 3.70 GB │ 23% │ 64k │ Good │ ✅ Fits │
│ AFQ4 │ 2.60 GB │ 16% │ 128k (max) │ Acceptable │ ✅ Fits │
│ AFQ3 │ 2.00 GB │ 13% │ 128k (max) │ Degraded │ ✅ Fits │
│ AFQ2 │ 1.50 GB │ 9% │ 128k (max) │ Degraded │ ✅ Fits │
└─────────────┴───────────┴────────┴──────────────┴───────────────┴──────────────────┘
Status Legend:
Tune Options:
| Option | Default | Description |
|---|---|---|
--profile <PROFILE> | balanced | Tuning profile: quality, balanced, or fast |
--json | disabled | Output JSON instead of human-readable text |
--emit-config <PATH> | none | Emit a TOML config file with recommended settings |
Run comprehensive system diagnostics and environment checks. The doctor command helps identify configuration issues and validates your system is ready for inference.
mistralrs doctor [OPTIONS]
Examples:
# Run diagnostics
mistralrs doctor
# Output as JSON
mistralrs doctor --json
Checks Performed:
Options:
| Option | Description |
|---|---|
--json | Output JSON instead of human-readable text |
Authenticate with HuggingFace Hub by saving your token to the local cache.
mistralrs login [OPTIONS]
Examples:
# Interactive login (prompts for token)
mistralrs login
# Provide token directly
mistralrs login --token hf_xxxxxxxxxxxxx
The token is saved to the standard HuggingFace cache location:
~/.cache/huggingface/tokenC:\Users\<user>\.cache\huggingface\tokenIf the HF_HOME environment variable is set, the token is saved to $HF_HOME/token.
Options:
| Option | Description |
|---|---|
--token <TOKEN> | Provide token directly (non-interactive) |
Manage the HuggingFace model cache. List cached models or delete specific models.
mistralrs cache <SUBCOMMAND>
Subcommands:
List all cached models with their sizes and last used times.
mistralrs cache list
Example output:
HuggingFace Model Cache
-----------------------
┌──────────────────────────┬──────────┬─────────────┐
│ Model │ Size │ Last Used │
├──────────────────────────┼──────────┼─────────────┤
│ Qwen/Qwen3-4B │ 8.5 GB │ today │
│ google/gemma-4-E4B-it │ 6.2 GB │ 2 days ago │
│ meta-llama/Llama-3.2-3B │ 5.8 GB │ 1 week ago │
└──────────────────────────┴──────────┴─────────────┘
Total: 3 models, 20.5 GB
Cache directory: /home/user/.cache/huggingface/hub
Delete a specific model from the cache.
mistralrs cache delete -m <MODEL_ID>
Examples:
# Delete a specific model
mistralrs cache delete -m Qwen/Qwen3-4B
# Delete a model with organization
mistralrs cache delete -m meta-llama/Llama-3.2-3B
Run performance benchmarks to measure prefill and decode speeds.
mistralrs bench [MODEL_TYPE] -m <MODEL_ID> [OPTIONS]
Note: MODEL_TYPE is optional and defaults to auto if not specified.
Examples:
# Run default benchmark (512 prompt tokens, 128 generated tokens, 3 iterations)
mistralrs bench -m Qwen/Qwen3-4B
# Custom prompt and generation lengths
mistralrs bench -m Qwen/Qwen3-4B --prompt-len 1024 --gen-len 256
# More iterations for better statistics
mistralrs bench -m Qwen/Qwen3-4B --iterations 10
# With ISQ quantization
mistralrs bench -m Qwen/Qwen3-4B --isq q4k
Example output:
Benchmark Results
=================
Model: Qwen/Qwen3-4B
Iterations: 3
┌────────────────────────┬─────────────────┬─────────────────┐
│ Test │ T/s │ Latency │
├────────────────────────┼─────────────────┼─────────────────┤
│ Prefill (512 tokens) │ 2847.3 ± 45.2 │ 179.82 ms (TTFT)│
│ Decode (128 tokens) │ 87.4 ± 2.1 │ 11.44 ms/T │
└────────────────────────┴─────────────────┴─────────────────┘
Options:
| Option | Default | Description |
|---|---|---|
--prompt-len <N> | 512 | Number of tokens in prompt (prefill test) |
--gen-len <N> | 128 | Number of tokens to generate (decode test) |
--iterations <N> | 3 | Number of benchmark iterations |
--warmup <N> | 1 | Number of warmup runs (discarded) |
The bench command also accepts all model loading options (ISQ, device mapping, etc.).
Run the CLI from a TOML configuration file. This is the recommended way to run multiple models simultaneously, including models of different types (e.g., text + multimodal + embedding).
See CLI_CONFIG.md for full TOML configuration format details.
mistralrs from-config --file <PATH>
Example:
mistralrs from-config --file config.toml
Multi-model example (config.toml):
command = "serve"
[server]
port = 1234
ui = true
[[models]]
kind = "auto"
model_id = "Qwen/Qwen3-4B"
[[models]]
kind = "multimodal"
model_id = "google/gemma-4-E4B-it"
[[models]]
kind = "embedding"
model_id = "google/embeddinggemma-300m"
Generate shell completions for your shell.
mistralrs completions <SHELL>
Examples:
# Generate bash completions
mistralrs completions bash > ~/.local/share/bash-completion/completions/mistralrs
# Generate zsh completions
mistralrs completions zsh > ~/.zfunc/_mistralrs
# Generate fish completions
mistralrs completions fish > ~/.config/fish/completions/mistralrs.fish
Supported Shells: bash, zsh, fish, elvish, powershell
Auto-detect model type. This is the recommended option for most models and is on by default simply by leaving out the explicit model type.
mistralrs run -m Qwen/Qwen3-4B
mistralrs serve -m Qwen/Qwen3-4B
The auto type supports text, multimodal, and other model types through automatic detection.
Explicit text generation model configuration.
mistralrs run text -m Qwen/Qwen3-4B
mistralrs serve text -m Qwen/Qwen3-4B
Multimodal models that can process images, audio, and text.
mistralrs run multimodal -m google/gemma-4-E4B-it
mistralrs serve multimodal -m google/gemma-4-E4B-it
Multimodal Options:
| Option | Description |
|---|---|
--max-edge <SIZE> | Maximum edge length for image resizing (aspect ratio preserved) |
--max-num-images <N> | Maximum number of images per request |
--max-image-length <SIZE> | Maximum image dimension for device mapping |
Image generation models using diffusion.
mistralrs run diffusion -m black-forest-labs/FLUX.1-schnell
mistralrs serve diffusion -m black-forest-labs/FLUX.1-schnell
Speech synthesis models.
mistralrs run speech -m nari-labs/Dia-1.6B
mistralrs serve speech -m nari-labs/Dia-1.6B
Text embedding models. These do not support interactive mode but can be used with the HTTP server.
mistralrs serve embedding -m google/embeddinggemma-300m
In-situ quantization (ISQ) reduces model memory usage by quantizing weights at load time. See details about ISQ here.
Usage:
# Simple bit-width quantization
mistralrs run -m Qwen/Qwen3-4B --isq 4
mistralrs run -m Qwen/Qwen3-4B --isq 8
# GGML-style quantization
mistralrs run -m Qwen/Qwen3-4B --isq q4_0
mistralrs run -m Qwen/Qwen3-4B --isq q4_1
mistralrs run -m Qwen/Qwen3-4B --isq q4k
mistralrs run -m Qwen/Qwen3-4B --isq q5k
mistralrs run -m Qwen/Qwen3-4B --isq q6k
ISQ Organization:
# Use MOQE organization for potentially better quality
mistralrs run -m Qwen/Qwen3-4B --isq q4k --isq-organization moqe
UQFF (Unified Quantized File Format) provides pre-quantized model files for faster loading.
Generate UQFF files:
mistralrs quantize -m Qwen/Qwen3-4B --isq q4k -o qwen3-4b-uqff/
Load from UQFF:
# Specify just the first shard -- remaining shards are auto-discovered
mistralrs run -m Qwen/Qwen3-4B --from-uqff q4k-0.uqff
Multiple UQFF files (semicolon-separated, for different quantizations in one load):
mistralrs run -m Qwen/Qwen3-4B --from-uqff "q4k-0.uqff;q8_0-0.uqff"
Note: Shard auto-discovery means you no longer need to list every shard file. Specifying
q4k-0.uqffwill automatically findq4k-1.uqff,q4k-2.uqff, etc.
PagedAttention enables efficient memory management for the KV cache. It is automatically enabled on CUDA and disabled on Metal/CPU by default.
Control PagedAttention:
# Auto mode (default): enabled on CUDA, disabled on Metal/CPU
mistralrs serve -m Qwen/Qwen3-4B --paged-attn auto
# Force enable
mistralrs serve -m Qwen/Qwen3-4B --paged-attn on
# Force disable
mistralrs serve -m Qwen/Qwen3-4B --paged-attn off
Memory allocation options (mutually exclusive):
# Allocate for specific context length (recommended)
mistralrs serve -m Qwen/Qwen3-4B --pa-context-len 8192
# Allocate specific GPU memory in MB
mistralrs serve -m Qwen/Qwen3-4B --pa-memory-mb 4096
# Allocate fraction of GPU memory (0.0-1.0)
mistralrs serve -m Qwen/Qwen3-4B --pa-memory-fraction 0.8
Additional options:
| Option | Description |
|---|---|
--pa-block-size <SIZE> | Tokens per block (default: 32 on CUDA) |
--pa-cache-type <TYPE> | KV cache quantization type (default: auto) |
Control how model layers are distributed across devices.
Automatic mapping:
# Use defaults (automatic)
mistralrs run -m Qwen/Qwen3-4B
Manual layer assignment:
# Assign 10 layers to GPU 0, 20 layers to GPU 1
mistralrs run -m Qwen/Qwen3-4B -n "0:10;1:20"
# Equivalent long form
mistralrs run -m Qwen/Qwen3-4B --device-layers "0:10;1:20"
CPU-only execution:
mistralrs run -m Qwen/Qwen3-4B --cpu
Topology file:
mistralrs run -m Qwen/Qwen3-4B --topology topology.yaml
Custom HuggingFace cache:
mistralrs run -m Qwen/Qwen3-4B --hf-cache /path/to/cache
Device mapping options:
| Option | Default | Description |
|---|---|---|
-n, --device-layers <MAPPING> | auto | Device layer mapping (format: ORD:NUM;...) |
--topology <PATH> | none | Topology YAML file for device mapping |
--hf-cache <PATH> | none | Custom HuggingFace cache directory |
--cpu | disabled | Force CPU-only execution |
--max-seq-len <LEN> | 4096 | Max sequence length for automatic device mapping |
--max-batch-size <SIZE> | 1 | Max batch size for automatic device mapping |
Apply LoRA or X-LoRA adapters to models.
LoRA:
# Single LoRA adapter
mistralrs run -m Qwen/Qwen3-4B --lora my-lora-adapter
# Multiple LoRA adapters (semicolon-separated)
mistralrs run -m Qwen/Qwen3-4B --lora "adapter1;adapter2"
X-LoRA:
# X-LoRA adapter with ordering file
mistralrs run -m Qwen/Qwen3-4B --xlora my-xlora-adapter --xlora-order ordering.json
# With target non-granular index
mistralrs run -m Qwen/Qwen3-4B --xlora my-xlora-adapter --xlora-order ordering.json --tgt-non-granular-index 2
Override the model's default chat template.
Use a template file:
# JSON template file
mistralrs run -m Qwen/Qwen3-4B --chat-template template.json
# Jinja template file
mistralrs run -m Qwen/Qwen3-4B --chat-template template.jinja
Explicit Jinja override:
mistralrs run -m Qwen/Qwen3-4B --jinja-explicit custom.jinja
Enable web search capabilities (requires an embedding model).
# Enable search with default embedding model
mistralrs run -m Qwen/Qwen3-4B --enable-search
# Specify embedding model
mistralrs run -m Qwen/Qwen3-4B --enable-search --search-embedding-model embedding-gemma
Control thinking/reasoning mode for models that support it (like DeepSeek, Qwen3).
# Force thinking on (equivalent to --thinking true)
mistralrs run -m Qwen/Qwen3-4B --thinking
# Force thinking off
mistralrs run -m Qwen/Qwen3-4B --thinking false
--thinking (or --thinking true) forces thinking on. --thinking false forces thinking off. If you omit the flag entirely, mistralrs run defers to the chat template's default behavior. Templates with an explicit thinking toggle use the repository fallback of true when no override is provided.
In interactive mode, thinking content is displayed in gray text before the final response.
These options apply to all commands.
| Option | Default | Description |
|---|---|---|
--seed <SEED> | none | Random seed for reproducibility |
-l, --log <PATH> | none | Log all requests and responses to file |
--token-source <SOURCE> | cache | HuggingFace authentication token source |
-V, --version | N/A | Print version information and exit |
-h, --help | N/A | Print help message (use with any subcommand) |
Token source formats:
cache - Use cached HuggingFace token (default)literal:<token> - Use literal token valueenv:<var> - Read token from environment variablepath:<file> - Read token from filenone - No authenticationExamples:
# Set random seed
mistralrs run -m Qwen/Qwen3-4B --seed 42
# Enable logging
mistralrs run -m Qwen/Qwen3-4B --log requests.log
# Use token from environment variable
mistralrs run -m meta-llama/Llama-3.2-3B-Instruct --token-source env:HF_TOKEN
These options are available for both run and serve commands.
| Option | Default | Description |
|---|---|---|
--max-seqs <N> | 32 | Maximum concurrent sequences |
--no-kv-cache | disabled | Disable KV cache entirely |
--prefix-cache-n <N> | 16 | Number of prefix caches to hold (0 to disable) |
-c, --chat-template <PATH> | none | Custom chat template file (.json or .jinja) |
-j, --jinja-explicit <PATH> | none | Explicit JINJA template override |
--enable-search | disabled | Enable web search |
--search-embedding-model <MODEL> | none | Embedding model for search |
These options are common across model types.
| Option | Description |
|---|---|
-m, --model-id <ID> | HuggingFace model ID or local path (required) |
-t, --tokenizer <PATH> | Path to local tokenizer.json file |
-a, --arch <ARCH> | Model architecture (auto-detected if not specified) |
--dtype <TYPE> | Model data type (default: auto) |
For loading quantized models.
| Option | Description |
|---|---|
--format <FORMAT> | Model format: plain, gguf, or ggml (auto-detected) |
-f, --quantized-file <FILE> | Quantized model filename(s) for GGUF/GGML (semicolon-separated) |
--tok-model-id <ID> | Model ID for tokenizer when using quantized format |
--gqa <VALUE> | GQA value for GGML models (default: 1) |
Examples:
# Load a GGUF model
mistralrs run -m Qwen/Qwen3-4B --format gguf -f model.gguf
# Multiple GGUF files
mistralrs run -m Qwen/Qwen3-4B --format gguf -f "model-part1.gguf;model-part2.gguf"
When running in interactive mode (mistralrs run), the following commands are available:
| Command | Description |
|---|---|
\help | Display help message |
\exit | Quit interactive mode |
\system <message> | Add a system message without running the model |
\clear | Clear the chat history |
\temperature <float> | Set sampling temperature (0.0 to 2.0) |
\topk <int> | Set top-k sampling value (>0) |
\topp <float> | Set top-p sampling value (0.0 to 1.0) |
Examples:
> \system Always respond as a pirate.
> \temperature 0.7
> \topk 50
> Hello!
Ahoy there, matey! What brings ye to these waters?
> \clear
> \exit
Multimodal Model Interactive Mode:
For multimodal models, you can include images in your prompts by specifying file paths or URLs:
> Describe this image: /path/to/image.jpg
> Compare these images: image1.png image2.png
> Describe the image and transcribe the audio: photo.jpg recording.mp3
Note: The CLI automatically detects paths to supported image, audio, and video files within your prompt. You do not need special syntax; simply paste the absolute or relative path to the file.
Supported image formats: PNG, JPEG, BMP, GIF, WebP Supported audio formats: WAV, MP3, FLAC, OGG Supported video formats: MP4, AVI, MOV, MKV, WebM, M4V, GIF (see VIDEO.md)