docs/ENGINE.md
This document describes internal engine behaviors in mistral.rs.
The mistral.rs engine manages model inference through a background thread pool. Each loaded model runs in its own engine thread, which handles request queuing, batching, and execution.
When a text or multimodal model is loaded in a multi-threaded runtime, mistral.rs automatically performs a warmup ("dummy") run:
This warmup ensures that CUDA kernel compilation and memory allocation happens during model loading rather than during the first user request.
If the inference engine thread dies unexpectedly (e.g., due to a panic), mistral.rs can automatically recover:
This ensures high availability without manual intervention.
Each model loaded in mistral.rs runs in its own dedicated engine thread:
For multi-model setups, each model gets its own engine thread, allowing true parallel inference across different models.