Back to Mistral Rs

Python SDK getting started

docs/src/content/docs/guides/python/getting-started.mdx

0.8.214.2 KB
Original Source

The Python SDK loads the model in-process and wraps the same Rust engine that backs the mistralrs binary.

python
from mistralrs import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.Plain(model_id="Qwen/Qwen3-4B"),
    in_situ_quant="4",
)

response = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[
            {"role": "user", "content": "In one sentence, what is Rust known for?"}
        ],
        max_tokens=256,
    )
)

print(response.choices[0].message.content)

The first run downloads the weights into the Hugging Face cache.

Installing

pip install mistralrs covers CPU (Linux x86_64/aarch64, Windows) and Metal (macOS arm64) - one package; pip picks the wheel for your platform. Python 3.10 or newer.

bash
pip install mistralrs

NVIDIA GPU (CUDA)

CUDA wheels are GitHub release assets because PyPI cannot select by GPU. Pick a wheel version from your driver CUDA level and compute capability:

Driver reportsWheel suffix
CUDA 13.1++cuda131.sm{cc}
CUDA 12.9 or 13.0 on GB10 / sm121+cuda129.sm121
CUDA 12.8 to 13.0 on other GPUs+cuda128.sm{cc}

Install with --find-links pointed at the release. Replace 0.8.21 / v0.8.21 with the release you want:

bash
pip install "mistralrs==0.8.21+cuda128.sm89" \
  --find-links https://github.com/EricLBuehler/mistral.rs/releases/expanded_assets/v0.8.21

Look up your GPU's compute capability in hardware support. CUDA wheels bundle the CUDA runtime and use the CUTLASS MoE backend. For cuTile, use the prebuilt CLI binary.

All install paths expose the same from mistralrs import ... API.

The pieces

Runner owns the loaded model. Construction loads the weights; reuse one Runner for the lifetime of the process to avoid reloading.

Which selects the model loader. Which.Plain(model_id="...") is correct for standard text models. Other variants cover multimodal models (Which.MultimodalPlain), GGUF checkpoints (Which.GGUF), embeddings (Which.Embedding), and LoRA adapters (Which.Lora).

in_situ_quant="4" is the equivalent of the CLI's --isq 4: it applies ISQ (in-situ quantization), quantizing the weights to 4 bits at load time. Omit it for full precision.

Full example: plain.

Streaming tokens

Set stream=True to receive an iterator of chunks instead of a single response:

python
from mistralrs import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.Plain(model_id="Qwen/Qwen3-4B"),
    in_situ_quant="4",
)

stream = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role": "user", "content": "Write me a haiku about ownership."}],
        max_tokens=128,
        stream=True,
    )
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

Each chunk is a ChatCompletionChunkResponse with the OpenAI streaming shape. choices[0].delta.content carries one incremental piece of the reply; it can be None (for example on the final chunk, which carries finish_reason), which is why the example checks delta before printing.

Full example: streaming. For async iteration, FastAPI integration, and mid-stream error handling, see streaming from Python.

Notes

The Runner keeps the model in memory for the process lifetime. Requests can be sent sequentially or from multiple threads, all reusing the loaded weights. To swap models, construct a new Runner; the old one releases GPU memory when it goes out of scope.

Chat history is not tracked. Each call to send_chat_completion_request is independent; multi-turn conversation means assembling the messages list yourself, appending each new user question and prior assistant reply.

The full Python surface (embeddings, speech, image generation, multimodal requests) is documented in the Python reference.