docs/multi_model/overview.md
The mistralrs CLI supports loading and serving multiple models simultaneously, allowing you to switch between different models in the same server instance.
# Traditional usage - loads one model
mistralrs serve -p 1234 -m meta-llama/Llama-3.2-3B-Instruct
# Load multiple models from configuration file
mistralrs from-config --file config.toml
Create a JSON file with model configurations as object keys:
{
"llama3-3b": {
"alias": "llama3-3b",
"Plain": {
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
},
"qwen3-4b": {
"alias": "qwen3-4b",
"Plain": {
"model_id": "Qwen/Qwen3-4B"
},
"in_situ_quant": "Q4K"
}
}
"llama3-3b", "qwen3-4b"): Organizational labels (for human readability)model_id inside the model spec). You can override this with alias.alias: Custom model ID (nickname) used in API requestschat_template: Custom chat templatejinja_explicit: JINJA template filenum_device_layers: Device layer configurationin_situ_quant: In-situ quantization settingHow API identifiers work:
alias is set, it becomes the API model IDmodel_id field) is usedUse the model field in your requests to specify which model to use:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3-3b",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
"llama3-3b"), otherwise the full pipeline name (e.g., "meta-llama/Llama-3.2-3B-Instruct")"default" to explicitly request the default modelmodel field is omitted entirely, the default model will be used# Use default model explicitly
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}]
}'
The default model is either:
--default-model-id when starting the servercurl http://localhost:1234/v1/models
Returns:
{
"object": "list",
"data": [
{
"id": "default",
"object": "model",
"created": 1234567890,
"owned_by": "local"
},
{
"id": "llama3-3b",
"object": "model",
"created": 1234567890,
"owned_by": "local"
},
{
"id": "qwen3-4b",
"object": "model",
"created": 1234567890,
"owned_by": "local"
}
]
}
Note: The "default" model is always listed first and represents the server's default model. If aliases are configured, they will appear in the list while the canonical pipeline names remain accepted.
Use the multi-model subcommand with these options:
--config <PATH> (required): Path to the JSON configuration file--default-model-id <ID> (optional): Default model ID for requests that don't specify a model (alias or pipeline name)New syntax:
mistralrs from-config --file <CONFIG>
{
"llama3-3b": {
"Plain": {
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
},
"qwen3-4b": {
"Plain": {
"model_id": "Qwen/Qwen3-4B"
},
"in_situ_quant": "Q4K"
}
}
{
"text-model": {
"Plain": {
"model_id": "meta-llama/Llama-3.2-3B-Instruct"
}
},
"multimodal-model": {
"MultimodalPlain": {
"model_id": "google/gemma-4-E4B-it"
}
}
}
{
"llama-gguf": {
"GGUF": {
"tok_model_id": "meta-llama/Llama-3.2-3B-Instruct",
"quantized_model_id": "bartowski/Llama-3.2-3B-Instruct-GGUF",
"quantized_filename": "Llama-3.2-3B-Instruct-Q4_K_M.gguf"
}
}
}
You can dynamically unload models to free memory and reload them on demand. This is useful for managing GPU memory when working with multiple large models.
Unload a model from memory while preserving its configuration for later reload:
curl -X POST http://localhost:1234/v1/models/unload \
-H "Content-Type: application/json" \
-d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'
Response:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
"status": "unloaded"
}
Manually reload a previously unloaded model:
curl -X POST http://localhost:1234/v1/models/reload \
-H "Content-Type: application/json" \
-d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'
Response:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
"status": "loaded"
}
Get the current status of a specific model:
curl -X POST http://localhost:1234/v1/models/status \
-H "Content-Type: application/json" \
-d '{"model_id": "meta-llama/Llama-3.2-3B-Instruct"}'
Response:
{
"model_id": "meta-llama/Llama-3.2-3B-Instruct",
"status": "loaded"
}
Possible status values:
loaded: Model is loaded and readyunloaded: Model is unloaded but can be reloadedreloading: Model is currently being reloadednot_found: Model ID not recognizedno_loader_config: Model cannot be reloaded (missing loader configuration)internal_error: An internal error occurredWhen a request is sent to an unloaded model, it will automatically reload before processing the request. This enables a "lazy loading" pattern where models are only loaded when needed.
The /v1/models endpoint now includes status information:
curl http://localhost:1234/v1/models
Response:
{
"object": "list",
"data": [
{
"id": "default",
"object": "model",
"created": 1234567890,
"owned_by": "local"
},
{
"id": "meta-llama/Llama-3.2-3B-Instruct",
"object": "model",
"created": 1234567890,
"owned_by": "local",
"status": "loaded"
},
{
"id": "Qwen/Qwen3-4B",
"object": "model",
"created": 1234567890,
"owned_by": "local",
"status": "unloaded"
}
]
}
The mistralrs crate provides MultiModelBuilder for loading multiple models and Model methods for multi-model management.
By default, model IDs are the pipeline names (usually the HuggingFace model path, e.g., "google/gemma-4-E4B-it"). You can provide custom aliases with add_model_with_alias for shorter IDs.
use mistralrs::{IsqType, MultiModelBuilder, TextModelBuilder, MultimodalModelBuilder, TextMessages, TextMessageRole};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Build a multi-model instance with a multimodal model and a text model
// Use aliases for shorter model IDs in requests
let model = MultiModelBuilder::new()
.add_model_with_alias(
"gemma-multimodal",
MultimodalModelBuilder::new("google/gemma-4-E4B-it") // Multimodal model
.with_isq(IsqType::Q4K)
.with_logging(),
)
.add_model_with_alias(
"qwen-text",
TextModelBuilder::new("Qwen/Qwen3-4B") // Text model
.with_isq(IsqType::Q4K),
)
.with_default_model("gemma-multimodal")
.build()
.await?;
// Send request to default model
let messages = TextMessages::new()
.add_message(TextMessageRole::User, "Hello!");
let response = model.send_chat_request(messages).await?;
// Send request to specific model using its alias
let messages = TextMessages::new()
.add_message(TextMessageRole::User, "Hello from Qwen!");
let response = model.send_chat_request_with_model(messages, Some("qwen-text")).await?;
Ok(())
}
// List all models (returns aliases if configured, otherwise pipeline names)
let models = model.list_models()?;
// Get/set default model
let default = model.get_default_model_id()?;
model.set_default_model_id("qwen-text")?;
// List models with status
let status = model.list_models_with_status()?;
// Returns Vec<(String, ModelStatus)> where ModelStatus is Loaded, Unloaded, or Reloading
// Check if a model is loaded
let is_loaded = model.is_model_loaded("gemma-multimodal")?;
// Unload a model to free memory
model.unload_model("gemma-multimodal")?;
// Reload when needed
model.reload_model("gemma-multimodal").await?;
_with_model MethodsAll request methods have _with_model variants that accept an optional model ID:
send_chat_request_with_model(request, model_id: Option<&str>)stream_chat_request_with_model(request, model_id: Option<&str>)generate_image_with_model(..., model_id: Option<&str>)generate_speech_with_model(prompt, model_id: Option<&str>)generate_embeddings_with_model(request, model_id: Option<&str>)tokenize_with_model(..., model_id: Option<&str>)detokenize_with_model(..., model_id: Option<&str>)config_with_model(model_id: Option<&str>)max_sequence_length_with_model(model_id: Option<&str>)re_isq_model_with_model(isq_type, model_id: Option<&str>)When model_id is None, the default model is used. If aliases are configured, you can pass either the alias or the canonical pipeline name.
The Python Runner class supports multi-model operations directly.
from mistralrs import Runner, Which, ChatCompletionRequest, MultimodalArchitecture, Architecture
# Create a runner with a multimodal model (Gemma 4 E4B)
runner = Runner(
which=Which.MultimodalPlain(
model_id="google/gemma-4-E4B-it",
arch=MultimodalArchitecture.Gemma4,
),
in_situ_quant="Q4K",
)
# Or create a runner with a text model (Qwen3 4B)
# runner = Runner(
# which=Which.Plain(
# model_id="Qwen/Qwen3-4B",
# arch=Architecture.Qwen3,
# ),
# in_situ_quant="Q4K",
# )
# List models
models = runner.list_models()
print(f"Available models: {models}")
# Get/set default model
default = runner.get_default_model_id()
runner.set_default_model_id("google/gemma-4-E4B-it")
# Send request with specific model_id
request = ChatCompletionRequest(
messages=[{"role": "user", "content": "Hello!"}]
)
response = runner.send_chat_completion_request(request, model_id=models[0])
If aliases are configured (for example via the server config or Rust MultiModelBuilder), list_models() will return those aliases and you can pass them in model_id. The canonical pipeline names remain accepted.
# List models with their status
status = runner.list_models_with_status()
# Returns list of (model_id, status) tuples
# Check if a model is loaded
is_loaded = runner.is_model_loaded("google/gemma-4-E4B-it")
# Unload a model to free memory
runner.unload_model("google/gemma-4-E4B-it")
# Reload when needed
runner.reload_model("google/gemma-4-E4B-it")
All request methods accept an optional model_id parameter:
# Chat completion
response = runner.send_chat_completion_request(request, model_id="model-id")
# Completion
response = runner.send_completion_request(request, model_id="model-id")
# Embeddings
embeddings = runner.send_embedding_request(request, model_id="model-id")
# Image generation
image = runner.generate_image(prompt, response_format, model_id="model-id")
# Speech generation
audio = runner.generate_audio(prompt, model_id="model-id")
# Tokenization
tokens = runner.tokenize_text(text, add_special_tokens=True, model_id="model-id")
text = runner.detokenize_text(tokens, skip_special_tokens=True, model_id="model-id")
When model_id is None or omitted, the default model is used.
MultiModel (Rust)The MultiModel struct has been removed. Use Model directly with MultiModelBuilder:
// Old (deprecated)
let multi = MultiModel::new(...);
multi.send_chat_request_to_model(request, "model-id").await?;
// New - model IDs are pipeline names by default (aliases optional)
let model = MultiModelBuilder::new()
.add_model(MultimodalModelBuilder::new("google/gemma-4-E4B-it"))
.add_model(TextModelBuilder::new("Qwen/Qwen3-4B"))
.build()
.await?;
model.send_chat_request_with_model(request, Some("Qwen/Qwen3-4B")).await?;
MultiModelRunner (Python)The MultiModelRunner class has been removed. Use Runner directly:
# Old (deprecated)
multi_runner = MultiModelRunner(runner)
multi_runner.send_chat_completion_request_to_model(request, "model-id")
# New - model IDs are the registered IDs (aliases if configured)
runner = Runner(which=Which.Plain(model_id="google/gemma-4-E4B-it", ...))
runner.send_chat_completion_request(request, model_id="google/gemma-4-E4B-it")