docs/src/content/docs/reference/troubleshooting.md
Before debugging setup issues, run mistralrs doctor. It reports detected hardware, compiled accelerator features, and Hugging Face connectivity.
For unlisted issues, file an issue on GitHub with a reproducer.
mistralrs: command not found after installThe binary is at ~/.cargo/bin/mistralrs. The directory is added to PATH by rustup, but the change does not apply to the current shell. Open a new shell or run source "$HOME/.cargo/env".
flash-attn feature enabledFlash attention requires compute capability 8.0+ (see hardware support). On older GPUs, drop flash-attn and rebuild:
cuda nccl cudnn on Linux with NCCL installed.cuda cudnn otherwise.mistralrs login rejects the tokenThe token must start with hf_. The validation happens in mistralrs login before saving.
Accept the license on the model's Hugging Face page, then save a token with mistralrs login. The token is stored at ~/.cache/huggingface/token (or $HF_HOME/token).
Out of memory on loadAdd --quant 4. If still too large, try --quant 2 or split across GPUs with -n "0:N1;1:N2;...". See quantize a model.
Verify accelerator features are compiled in with mistralrs doctor. If cuda is missing, the binary was built without GPU support.
For CUDA decode throughput, also check whether paged attention is active. FlashInfer (a CUDA attention backend) paged decode and CUDA graphs are enabled by default for compatible CUDA paged decode paths.
CUDA graphs apply to supported single-token decode steps only. They do not speed up prompt prefill. The first time a graph shape is seen, mistral.rs pays warmup and capture overhead; steady-state decode is the part that can improve.
If graph capture or replay fails, mistral.rs logs a warning and disables CUDA graphs for that loaded pipeline. Set MISTRALRS_CUDA_GRAPHS=0 to compare with the normal CUDA path. See CUDA graphs.
max_tokens is most likely too low. Check finish_reason:
length - token limit reached.stop - generation ended naturally (EOS token) or a configured stop token/string matched.Connection refused hitting localhostCheck the Server listening on http://... line in the server output to confirm host and port.
The default allows any origin. Custom CORS configuration is only available programmatically through MistralRsServerRouterBuilder.
413 Payload Too LargeThe default body limit is 50 MB and is not configurable via the CLI. Configure programmatically through MistralRsServerRouterBuilder.
/uiThe UI is on by default. Check that --no-ui was not passed at startup, and that no reverse proxy is rewriting /ui.
The session expired (30-minute idle TTL) or was evicted (128-session cap, LRU). Long-lived sessions need explicit export/import via /v1/sessions/{id}. See persist sessions.
from mistralrs import Runner fails with ImportErrorThe wrong wheel was installed. pip install mistralrs gives the CPU (Linux/Windows) or Metal (macOS) wheel; for NVIDIA, install a CUDA wheel from the release with --find-links and the +cudaNNN.smNN matching your driver and GPU. See Python SDK getting started.
ModelBuilder::build() requires a tokio runtimeThe SDK requires a running tokio runtime. Use #[tokio::main] or create a runtime with tokio::runtime::Runtime::new().