README.md
<a name="top"></a>
<!-- <h1 align="center"> mistral.rs </h1> --> <div align="center"> </div> <h3 align="center"> Fast, flexible LLM inference. </h3> <p align="center"> | <a href="https://ericlbuehler.github.io/mistral.rs/"><b>Documentation</b></a> | <a href="https://ericlbuehler.github.io/mistral.rs/quickstart/"><b>Quickstart</b></a> | <a href="https://crates.io/crates/mistralrs"><b>Rust SDK</b></a> | <a href="https://ericlbuehler.github.io/mistral.rs/guides/python/getting-started/"><b>Python SDK</b></a> | <a href="https://discord.gg/SZrecqK8qw"><b>Discord</b></a> | </p> <p align="center"> <a href="https://github.com/EricLBuehler/mistral.rs/stargazers"> </a> </p>/v1/skills bundles and reference them from Responses requests for reusable procedures, helper scripts, and local data. Guide/v1/files, attach Responses input_file or Chat file parts, and mount request files into shell/code sessions. Guidemistralrs serve now exposes Anthropic-compatible /v1/messages and /v1/messages/count_tokens endpoints alongside the OpenAI-compatible /v1 API. GuideMean tokens per second across prompt lengths and decode depths from 128 to 16384 tokens. Decode uses 256 generated tokens. See the full v0.8.2 report for commands, model revisions, host metadata, and appendix tables.
Q8 prefill TPS: mistral.rs UQFF q8 vs llama.cpp GGUF Q8_0
| Model | Hardware | mistral.rs | llama.cpp |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 7395.7 | 3973.7 |
| Gemma 4 E4B | B200 | 27705.6 | 11992.4 |
| Gemma 4 E4B | H100 SXM | 26220.6 | 11702.1 |
| Gemma 4 26B-A4B | GB10 | 2947.0 | 2178.5 |
| Gemma 4 26B-A4B | B200 | 12725.3 | 8503.4 |
| Gemma 4 26B-A4B | H100 SXM | 12362.3 | 8055.1 |
Q8 decode TPS: mistral.rs UQFF q8 vs llama.cpp GGUF Q8_0
| Model | Hardware | mistral.rs | llama.cpp |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 44.1 | 40.5 |
| Gemma 4 E4B | B200 | 241.4 | 194.4 |
| Gemma 4 E4B | H100 SXM | 223.1 | 183.0 |
| Gemma 4 26B-A4B | GB10 | 46.8 | 46.4 |
| Gemma 4 26B-A4B | B200 | 210.9 | 192.2 |
| Gemma 4 26B-A4B | H100 SXM | 199.8 | 183.9 |
BF16 prefill TPS: mistral.rs BF16 vs vLLM BF16
| Model | Hardware | mistral.rs | vLLM |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 5838.9 | 5812.9 |
| Gemma 4 E4B | B200 | 43547.8 | 39431.2 |
| Gemma 4 E4B | H100 SXM | 35852.2 | 39293.7 |
| Gemma 4 26B-A4B | GB10 | 592.2 | 3878.6 |
| Gemma 4 26B-A4B | B200 | 3467.3 | 28532.8 |
| Gemma 4 26B-A4B | H100 SXM | 2766.0 | 26295.9 |
BF16 decode TPS: mistral.rs BF16 vs vLLM BF16
| Model | Hardware | mistral.rs | vLLM |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 25.1 | 18.8 |
| Gemma 4 E4B | B200 | 202.6 | 196.2 |
| Gemma 4 E4B | H100 SXM | 174.4 | 153.0 |
| Gemma 4 26B-A4B | GB10 | 26.9 | 23.2 |
| Gemma 4 26B-A4B | B200 | 159.6 | 220.2 |
| Gemma 4 26B-A4B | H100 SXM | 138.7 | 148.0 |
mistralrs run -m user/model. Architecture, quantization format, and chat template are auto-detected.--quant automatically selects the best quantization format at that level: using a prebuilt UQFF if one is published, otherwise applying ISQ. Docsmistralrs serve process exposes OpenAI-compatible /v1 endpoints and Anthropic-compatible Messages endpoints.mistralrs serve exposes a /metrics endpoint in Prometheus format, recording per-request counts and latency labeled by method, route, and status. Docs/ui by default. Shows reasoning, code execution, plots, and files inline. Edit any message and the new branch runs with its own Python state. Pass --no-ui to disable.mistralrs tune recommends quantization and device mapping from the model config and your detected hardware.Linux/macOS:
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh
Windows (PowerShell):
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex
Downloads a self-contained prebuilt binary for your platform (Metal on Apple Silicon; per-GPU CUDA or CPU on Linux; CPU on Windows), falling back to a source build if none matches. No Rust or CUDA toolkit needed for the prebuilt path.
Manual installation, accelerator details & other platforms
# Interactive chat
mistralrs run -m Qwen/Qwen3-4B
# One-shot prompt (no interactive session)
mistralrs run -m Qwen/Qwen3-4B -i "What is the capital of France?"
# One-shot with an image
mistralrs run -m google/gemma-4-E4B-it --image photo.jpg -i "Describe this image"
# Agentic REPL: search + code execution + shell from the terminal
mistralrs run --agent -m Qwen/Qwen3-4B
# Start an API server with the built-in web UI
mistralrs serve -m google/gemma-4-E4B-it
For the server command, visit http://localhost:1234/ui for the web chat interface. OpenAI-compatible clients use http://localhost:1234/v1; Anthropic-compatible clients use http://localhost:1234.
mistralrs CLIThe CLI is designed to be zero-config: just point it at a model and go.
run, serve, bench)mistralrs tune recommends quantization and device mapping for your model and hardware# Recommend settings for your hardware and emit a config file
mistralrs tune -m Qwen/Qwen3-4B --emit-config config.toml
# Run using the generated config
mistralrs from-config -f config.toml
# Diagnose system issues (CUDA, Metal, HuggingFace connectivity)
mistralrs doctor
Performance
Quantization (full docs)
Flexibility
Agentic Features
/v1/files, Responses input_file, Chat file, and workdir mounts40+ model families: text (Llama, Qwen 3, GLM, DeepSeek, GPT-OSS, Granite, and more), multimodal (Gemma 4, Qwen 3-VL, Llama 4, Phi 4 multimodal, and more), speech (Voxtral ASR, Dia), image generation (FLUX), and embeddings (Embedding Gemma, Qwen 3 Embedding).
Full compatibility tables | Request a new model
pip install mistralrs
In-process inference from Python: load a model with Runner and send OpenAI-shaped requests, no server required. Accelerator-specific wheels (CUDA, Metal, MKL, Accelerate) are listed in the getting-started guide.
Get started | API reference | Examples
cargo add mistralrs
Embed the engine in a Rust application with the high-level mistralrs crate.
Get started | docs.rs | Crate | Examples
Prebuilt CPU and CUDA images are published to GHCR. Pull commands, tags, and Kubernetes notes are in the Docker guide.
For complete documentation, see the Documentation.
Quick Links:
Contributions welcome! Please open an issue to discuss new features or report bugs. If you want to add a new model, please contact us via an issue and we can coordinate.
This project would not be possible without the excellent work at Candle. Thank you to all contributors!
mistral.rs is not affiliated with Mistral AI.
<p align="right"> <a href="#top">Back to Top</a> </p>