README.md
<a name="top"></a>
<!-- <h1 align="center"> mistral.rs </h1> --> <div align="center"> </div> <h3 align="center"> Fast, flexible LLM inference. </h3> <p align="center"> | <a href="https://ericlbuehler.github.io/mistral.rs/"><b>Documentation</b></a> | <a href="https://crates.io/crates/mistralrs"><b>Rust SDK</b></a> | <a href="https://ericlbuehler.github.io/mistral.rs/tutorials/03-python-sdk/"><b>Python SDK</b></a> | <a href="https://discord.gg/SZrecqK8qw"><b>Discord</b></a> | </p> <p align="center"> <a href="https://github.com/EricLBuehler/mistral.rs/stargazers"> </a> </p>Mean tokens per second across prompt lengths and decode depths from 128 to 16384 tokens. Decode uses 256 generated tokens. See the full v0.8.2 report for commands, model revisions, host metadata, and appendix tables.
Q8 prefill TPS: mistral.rs UQFF q8 vs llama.cpp GGUF Q8_0
| Model | Hardware | mistral.rs | llama.cpp |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 7395.7 | 3973.7 |
| Gemma 4 E4B | B200 | 27705.6 | 11992.4 |
| Gemma 4 E4B | H100 SXM | 26220.6 | 11702.1 |
| Gemma 4 26B-A4B | GB10 | 2947.0 | 2178.5 |
| Gemma 4 26B-A4B | B200 | 12725.3 | 8503.4 |
| Gemma 4 26B-A4B | H100 SXM | 12362.3 | 8055.1 |
Q8 decode TPS: mistral.rs UQFF q8 vs llama.cpp GGUF Q8_0
| Model | Hardware | mistral.rs | llama.cpp |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 44.1 | 40.5 |
| Gemma 4 E4B | B200 | 241.4 | 194.4 |
| Gemma 4 E4B | H100 SXM | 223.1 | 183.0 |
| Gemma 4 26B-A4B | GB10 | 46.8 | 46.4 |
| Gemma 4 26B-A4B | B200 | 210.9 | 192.2 |
| Gemma 4 26B-A4B | H100 SXM | 199.8 | 183.9 |
BF16 prefill TPS: mistral.rs BF16 vs vLLM BF16
| Model | Hardware | mistral.rs | vLLM |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 5838.9 | 5812.9 |
| Gemma 4 E4B | B200 | 43547.8 | 39431.2 |
| Gemma 4 E4B | H100 SXM | 35852.2 | 39293.7 |
| Gemma 4 26B-A4B | GB10 | 592.2 | 3878.6 |
| Gemma 4 26B-A4B | B200 | 3467.3 | 28532.8 |
| Gemma 4 26B-A4B | H100 SXM | 2766.0 | 26295.9 |
BF16 decode TPS: mistral.rs BF16 vs vLLM BF16
| Model | Hardware | mistral.rs | vLLM |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 25.1 | 18.8 |
| Gemma 4 E4B | B200 | 202.6 | 196.2 |
| Gemma 4 E4B | H100 SXM | 174.4 | 153.0 |
| Gemma 4 26B-A4B | GB10 | 26.9 | 23.2 |
| Gemma 4 26B-A4B | B200 | 159.6 | 220.2 |
| Gemma 4 26B-A4B | H100 SXM | 138.7 | 148.0 |
mistralrs run -m user/model. Architecture, quantization format, and chat template are auto-detected.--quant automatically selects the best quantization format at that level: using a prebuilt UQFF if one is published, otherwise applying ISQ. Docs/ui by default. Shows reasoning, code execution, plots, and files inline. Edit any message and the new branch runs with its own Python state. Pass --no-ui to disable.mistralrs tune benchmarks your system and picks optimal quantization + device mapping.Linux/macOS:
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh
Windows (PowerShell):
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex
Manual installation & other platforms
# Interactive chat
mistralrs run -m Qwen/Qwen3-4B
# One-shot prompt (no interactive session)
mistralrs run -m Qwen/Qwen3-4B -i "What is the capital of France?"
# One-shot with an image
mistralrs run -m google/gemma-4-E4B-it --image photo.jpg -i "Describe this image"
# Agentic REPL: search + code execution from the terminal
mistralrs run --agent -m Qwen/Qwen3-4B
# Start an API server with the built-in web UI
mistralrs serve -m google/gemma-4-E4B-it
For the server command, visit http://localhost:1234/ui for the web chat interface.
mistralrs CLIThe CLI is designed to be zero-config: just point it at a model and go.
run, serve, bench)mistralrs tune to automatically benchmark and configure optimal settings for your hardware# Auto-tune for your hardware and emit a config file
mistralrs tune -m Qwen/Qwen3-4B --emit-config config.toml
# Run using the generated config
mistralrs from-config -f config.toml
# Diagnose system issues (CUDA, Metal, HuggingFace connectivity)
mistralrs doctor
Performance
Quantization (full docs)
Flexibility
Agentic Features
Request a new model | Full compatibility tables
pip install mistralrs # or mistralrs-cuda, mistralrs-metal, mistralrs-mkl, mistralrs-accelerate
from mistralrs import Runner, Which, ChatCompletionRequest
runner = Runner(
which=Which.Plain(model_id="Qwen/Qwen3-4B"),
in_situ_quant="4",
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
)
print(res.choices[0].message.content)
Python SDK | Installation | Examples | Cookbook
cargo add mistralrs
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, TextMessages, MultimodalModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = MultimodalModelBuilder::new("google/gemma-4-E4B-it")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let messages = TextMessages::new().add_message(
TextMessageRole::User,
"Hello!",
);
let response = model.send_chat_request(messages).await?;
println!("{:?}", response.choices[0].message.content);
Ok(())
}
For quick containerized deployment:
docker pull ghcr.io/ericlbuehler/mistral.rs:latest
docker run --gpus all -p 1234:1234 ghcr.io/ericlbuehler/mistral.rs:latest \
serve -m Qwen/Qwen3-4B
For production use, we recommend installing the CLI directly for maximum flexibility.
For complete documentation, see the Documentation.
Quick Links:
Contributions welcome! Please open an issue to discuss new features or report bugs. If you want to add a new model, please contact us via an issue and we can coordinate.
This project would not be possible without the excellent work at Candle. Thank you to all contributors!
mistral.rs is not affiliated with Mistral AI.
<p align="right"> <a href="#top">Back to Top</a> </p>