docs/src/content/docs/guides/deploy/docker.md
The published images ship the unified mistralrs binary as their entrypoint, so any CLI subcommand works directly: serve, run, bench, quantize. Running a container with no arguments prints the CLI help.
docker run --rm -p 1234:1234 -v hf-cache:/data -e HF_TOKEN=<token> \
ghcr.io/ericlbuehler/mistral.rs:latest \
serve -m Qwen/Qwen3-4B
:latest is the CPU image. For NVIDIA GPUs, choose a CUDA tag and add --gpus all:
docker run --rm --gpus all -p 1234:1234 -v hf-cache:/data \
ghcr.io/ericlbuehler/mistral.rs:cuda128-sm89-latest \
serve -m Qwen/Qwen3-4B
The host needs the NVIDIA Container Toolkit; see NVIDIA's install guide. To pin a specific GPU: --gpus '"device=0"'.
All images live at ghcr.io/ericlbuehler/mistral.rs (package page).
latest (alias of cpu-latest), cpu-latest, cpu-X.Y.Z.cuda128-sm{cc}-latest, cuda129-sm121-latest, cuda130-sm{cc}-latest, cuda131-sm{cc}-latest, cuda132-sm{cc}-latest, cuda133-sm90-latest and matching X.Y.Z version tags.cuda-sm{cc}-latest, cuda-sm{cc}-X.Y.Z point at the cuda131 image.Choose the CUDA lane from the CUDA version shown by nvidia-smi:
| Driver reports | Use |
|---|---|
CUDA 13.3+ on Hopper / sm90 | cuda133-sm90 |
CUDA 13.2+ on Ampere/Ada / sm80, sm86, sm89 | cuda132-sm{cc} |
CUDA 13.1+ on Blackwell / sm100, sm120, sm121 | cuda131-sm{cc} |
| CUDA 13.0+ | cuda130-sm{cc} |
CUDA 12.9+ on GB10 / sm121 | cuda129-sm121 |
| CUDA 12.8+ | cuda128-sm{cc} |
cuTile is included only on lanes whose CUDA toolkit supports that SM.
CUDA compute capability variants (SM80+):
80 (A100)86 (A-series workstation/RTX 30)89 (RTX 40/L4)90 (H100)100 (B200)120 (RTX 50)121 (DGX Spark)See hardware support for the full GPU mapping.
The CPU image and Grace CUDA images (90, 100, 121) are multi-arch (amd64 + arm64). Docker picks the right architecture automatically. The other CUDA tags are x86_64 only.
The *-latest tags publish on releases and on manual CI dispatch from master; version tags pin a release.
For production, pin a version or sha tag rather than *-latest. Model ids also float: -m Qwen/Qwen3-4B resolves to whatever revision is tagged main at download time. The CLI has no revision flag; to pin a revision, use the Rust SDK's with_hf_revision.
mistralrs binary; pass a subcommand and its flags as the container command.mistralrs serve listens on port 1234 by default (the image's EXPOSEd port). To change it, change the flag and the mapping together: serve -p 8080 with -p 8080:8080. There is no PORT environment variable.HF_HOME=/data is set in the image: mount a volume at /data to persist downloaded weights (they land in /data/hub). HF authentication for gated models: -e HF_TOKEN=<token>./chat_templates for models that need one: --chat-template /chat_templates/<file>.json.From a repository checkout:
# CPU
docker build -t mistralrs:latest -f Dockerfile .
# CUDA (set the compute capability for your GPU)
docker build -t mistralrs:cuda -f Dockerfile.cuda-all \
--build-arg CUDA_COMPUTE_CAP=89 .
Dockerfile.cuda-all accepts CUDA_COMPUTE_CAP, BASE_TAG, and WITH_FEATURES build args. The default base is CUDA 12.8.1 and default features are cuda,cudnn; CI builds add flash-attn, and release images add cutile on supported CUDA/SM pairs.Dockerfile.cuda-13.0-ubi9 is a Red Hat UBI 9 variant for air-gapped and enterprise deployments.Persist the cache. Weights are large enough that re-downloading on every restart is wasteful. Mount a named volume or host path at /data.
Health check. /health returns 200 when the server is up. Add a Docker healthcheck:
HEALTHCHECK --interval=30s --timeout=5s --start-period=180s \
CMD curl -fsS http://localhost:1234/health || exit 1
The generous --start-period matters: first-run model loading can take minutes.
Resource limits. Set --memory and --gpus on docker run to bound the container's resources.
Video input. Install FFmpeg inside the image when serving video-capable models. See set up video input for the Docker snippet and runtime check.
The pieces above translate directly:
/health (or a model-aware check; see the production checklist)./data for the Hugging Face cache.nvidia.com/gpu resource request for CUDA.There is no official Helm chart. Contributions welcome.
mistralrs serve options.