mistralrs-web-chat/README.md
Deprecated: The standalone
mistralrs-web-chatbinary is deprecated. Usemistralrs serve --uiinstead for the same functionality.Migration:
bash# Old cargo run --release --features cuda --bin mistralrs-web-chat -- --text-model Qwen/Qwen3-4B # New mistralrs serve --ui -m Qwen/Qwen3-4BThe new built-in UI provides the same features and is accessible at
/uiwhen running the server.
A minimal, fast, and modern web chat interface for mistral.rs, supporting text, multimodal, and speech models with drag-and-drop image and file upload, markdown rendering, and multi-model selection.
Ctrl+Enter (or Cmd+Enter on Mac) to send messages.Note: choose the features based on this guide.
cargo run --release --features <specify feature(s) here> --bin mistralrs-web-chat -- \
--text-model Qwen/Qwen3-4B \
--multimodal-model google/gemma-4-E4B-it \
--speech-model nari-labs/Dia-1.6B
--text-model, --multimodal-model, or --speech-model can be specified.--port is optional (defaults to 1234).Options:
--isq <ISQ> In-situ quantization to apply. Defaults to Q6K on CPU, AFQ6 on Metal.
Options: Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2K, Q3K, Q4K, Q5K, Q6K, Q8K,
HQQ1, HQQ2, HQQ3, HQQ4, HQQ8, AFQ2, AFQ4, AFQ6, AFQ8
--text-model <MODEL> Text-only models (HuggingFace ID or local path). Can be repeated.
--multimodal-model <MODEL> Multimodal models (HuggingFace ID or local path). Can be repeated.
--speech-model <MODEL> Speech/TTS models (HuggingFace ID or local path). Can be repeated.
--enable-search Enable web search tool (requires embedding model)
--search-embedding-model <M> Built-in search embedding model (e.g., embedding_gemma)
-p, --port <PORT> Port to listen on (default: 1234)
--host <HOST> IP address to serve on (default: 0.0.0.0)
--cpu Use CPU only (disable GPU acceleration)
--temperature <TEMP> Default temperature for generation (0.0-2.0). Default: 0.7
--top-p <TOP_P> Default top_p for generation (0.0-1.0). Default: 0.9
--top-k <TOP_K> Default top_k for generation. Default: 40
--max-tokens <MAX> Default max tokens to generate. Default: 2048
--repetition-penalty <PEN> Default repetition penalty (1.0 = no penalty). Default: 1.1
--system-prompt <PROMPT> Default system prompt for all chats
-h, --help Print help
-V, --version Print version
Basic usage with a text model:
cargo run --release --features cuda --bin mistralrs-web-chat -- \
--text-model meta-llama/Llama-3.2-3B-Instruct
With custom generation defaults:
cargo run --release --features cuda --bin mistralrs-web-chat -- \
--text-model Qwen/Qwen3-4B \
--temperature 0.8 \
--max-tokens 4096 \
--system-prompt "You are a helpful coding assistant."
Multiple models with web search:
cargo run --release --features cuda --bin mistralrs-web-chat -- \
--text-model Qwen/Qwen3-4B \
--multimodal-model google/gemma-4-E4B-it \
--enable-search \
--search-embedding-model embedding_gemma
| Method | Endpoint | Description |
|---|---|---|
| GET | /ws | WebSocket connection for streaming chat |
| GET | /api/settings | Get server default settings |
| GET | /api/list_models | List available models |
| POST | /api/select_model | Switch active model |
| GET | /api/list_chats | List saved chats |
| POST | /api/new_chat | Create new chat |
| POST | /api/load_chat | Load chat history |
| POST | /api/delete_chat | Delete chat |
| POST | /api/rename_chat | Rename chat |
| POST | /api/upload_image | Upload image (multimodal models) |
| POST | /api/upload_text | Upload text/code file |
| POST | /api/upload_audio | Upload audio file |
| POST | /api/generate_speech | Generate speech (TTS models) |
Messages sent to the WebSocket can include generation parameters:
{
"content": "Your message here",
"generation_params": {
"temperature": 0.8,
"top_p": 0.95,
"top_k": 50,
"max_tokens": 2048,
"repetition_penalty": 1.1
},
"web_search_options": {
"search_context_size": "medium"
}
}
To update the system prompt:
{
"set_system_prompt": "You are a helpful assistant."
}
mistralrs_settings.