Engine Internals

This document describes internal engine behaviors in mistral.rs.

Overview

The mistral.rs engine manages model inference through a background thread pool. Each loaded model runs in its own engine thread, which handles request queuing, batching, and execution.

Warmup Run

When a text or multimodal model is loaded in a multi-threaded runtime, mistral.rs automatically performs a warmup ("dummy") run:

Sends a short completion request ("hello" with max 1 token) to initialize CUDA kernels and caches
Logs "Beginning dummy run." when starting and "Dummy run completed in Xs." when finished
Helps ensure more consistent performance for the first real user request
Only runs for text and multimodal models (not diffusion/speech)

This warmup ensures that CUDA kernel compilation and memory allocation happens during model loading rather than during the first user request.

Automatic Engine Recovery

If the inference engine thread dies unexpectedly (e.g., due to a panic), mistral.rs can automatically recover:

Detects dead engine threads when sending requests
Automatically reboots the engine using saved configuration
Logs "Engine {model_id} is dead, rebooting" followed by "Successfully rebooted engine {model_id}"
Preserves all original configuration including KV cache settings, prefix cache, and tool callbacks

This ensures high availability without manual intervention.

Thread Model

Each model loaded in mistral.rs runs in its own dedicated engine thread:

Main Thread: Handles HTTP requests, CLI interaction, and dispatches work to engine threads
Engine Threads: Each loaded model has a dedicated thread for inference
Background Workers: Tokenization and other preprocessing can run in parallel

For multi-model setups, each model gets its own engine thread, allowing true parallel inference across different models.

Engine Internals

Engine Internals

Overview

Warmup Run

Automatic Engine Recovery

Thread Model

See Also