docs/src/content/docs/guides/agents/agentic-runtime.md
mistral.rs can act as a local-first runtime for agent applications. A single runtime request can include:
The most complete app-facing event stream today is /v1/chat/completions with stream: true. It emits normal OpenAI-compatible chunks plus mistral.rs agentic_tool_call_progress Server-Sent Events (SSE).
| Runtime part | What mistral.rs provides |
|---|---|
| Model output | Chat-completion responses and streaming chunks. |
| Tool execution | Built-in search, code execution, shell, OpenAI-compatible Skills, MCP (Model Context Protocol) tools, callbacks, or HTTP tool dispatch. |
| Generated media | Captured images and video frames from tools as base64 fields. |
| Files | User-provided input files plus generated output files in the same /v1/files registry. |
| Session state | Reusable session_id values for multi-turn tool and code state. |
Use this when an app wants inference and tool execution in one process rather than running its own tool loop around a model server. Built-in runtime tools are strict by default; whether an action may run at all is governed by permissions and approvals.
The server-side loop engages for a chat request when any of these hold:
web_search_options (advertises the web search tools).tools: [{"type":"code_interpreter","container":{"type":"auto"}}] on a server or runner with code execution enabled.tools: [{"type":"shell","environment":{"type":"container_auto"}}] on the Responses API, or the SDK request enables shell.tools and server-side executors exist for them (SDK tool_callbacks or connected MCP tools).max_tool_rounds, or the server has a --tool-dispatch-url.Otherwise the request is dispatched normally: the model's tool_calls field is returned to the client and the client runs the next round (the standard OpenAI-compatible flow).
Each round:
calling and the tool arguments.complete and the structured result.tool-role response, so the next inference pass sees the outcome.The cap and dispatch URL are configured on the tool calling page. At termination, the expanded message list is written back to the session, so the next request with the same session id sees the synthesized tool messages as history.
Start a server with the tools your app is allowed to use:
mistralrs serve --agent -m google/gemma-4-E4B-it
(--agent enables search, code execution, and shell; see build an agent.)
Send a streaming chat-completions request:
{
"model": "default",
"stream": true,
"messages": [
{"role": "user", "content": "Use Python to plot sin(x), then explain the chart."}
],
"tools": [{"type": "code_interpreter", "container": {"type": "auto"}}],
"web_search_options": {},
"max_tool_rounds": 4,
"session_id": "analysis-demo"
}
Model output arrives as standard chat-completion chunks. Tool progress arrives as named SSE events with round, an opaque tool_name for correlation, phase (calling or complete), and tool-type-specific data:
event: agentic_tool_call_progress
data: {"type":"agentic_tool_call_progress","round":0,"tool_name":"<tool identifier>","phase":"calling","data":{"tool_type":"code_execution","code":"print('hello')"}}
Complete events carry tool-type-specific payloads:
stdout, stderr, images_base64, video_frames_base64, working_directory, execution_time_ms.commands, stdout, stderr, exit_code, timed_out, and status.query, results_count.arguments, content.The full event tables are in the HTTP API reference. Non-streaming responses include the same information as an agentic_tool_calls array.
A File is a typed output produced by a tool, typically code execution or shell. Each file has a stable id, a name, a format, a mime type, a size in bytes, and either an inline body or a reference for fetching it. Files are first-class on the wire: they ride alongside the model transcript, not buried inside tool output strings.
Declare required outputs on the request to give the model a contract:
{
"model": "default",
"messages": [
{"role": "user", "content": "Generate a sin(x) plot and a CSV of the samples."}
],
"tools": [{"type": "code_interpreter", "container": {"type": "auto"}}],
"files": [
{"name": "plot.png", "format": "png"},
{"name": "samples.csv", "format": "csv", "description": "x, sin(x) columns"}
]
}
Chat Completions and Anthropic Messages carry produced files in a top-level files array; when streaming, each file is emitted as soon as it is produced via a file_produced SSE event. Each agentic_tool_calls[*] record gains a file_ids field listing the files attributable to that round, so apps can correlate files with the tool that wrote them.
Responses follows the OpenAI artifact shape: produced files are attached to assistant output_text content as container_file_citation annotations. The same bytes remain available through GET /v1/files/{id}/content; OpenAI-style clients can also fetch them through GET /v1/containers/{container_id}/files/{file_id}/content.
User-provided files use OpenAI-compatible request shapes: upload with POST /v1/files, reference file_id, or attach inline file_data. Responses also supports input_file.file_url.
Text-like UTF-8 input files get bounded decoded previews. When agentic tools are active, the model can request additional slices if the preview is not enough. Binary files are metadata-only in prompt context, but are still downloadable and mounted into shell/code workdirs when those tools are active. See OpenAI-compatible file inputs.
Behavior worth designing around:
text or data_base64); larger bodies are elided from the wire and fetched via GET /v1/files/{id}/content. is_truncated() on the SDK File reports an elided body.outputs parameter for files the model wrote but the request did not declare. Shell also advertises mistralrs_surface_outputs, which lets the model surface files created in earlier shell calls. Files declared via request.files are surfaced regardless; missing declared files come back as error placeholders. Files written but not named in outputs, mistralrs_surface_outputs, or request.files remain internal to the session.The exact file schema, metadata endpoint, and content-endpoint status codes are in the HTTP API reference.
Use session_id when your app needs continuity across requests: message history, tool records, media, and code-execution state. Session behavior, the export/import/delete endpoints, and lifetime rules live in persist sessions.
| Surface | Current behavior |
|---|---|
| HTTP | Best surface for live model chunks, tool-progress timelines, files, and agent approval events. |
| Rust SDK | Supports request input files via InputFile and RequestBuilder::with_input_file(...); Model::stream_chat_request yields raw Response::AgenticToolCallProgress events. |
| Python SDK | Supports request input files via InputFile, plus agentic requests, callbacks, code execution, shell, local skill mounts, and sessions. The streaming iterator currently yields model chunks; use HTTP SSE for the full timeline. |
| Web UI | Renders code execution, shell, search, reasoning blocks, generated media, and approval cards inline. |
Full examples: Rust file inputs, Python file inputs, server file inputs, Rust agent, Rust agent streaming, Python agentic tools, HTTP tool rounds, and server Skills.
Code and shell execution run with the permissions of the configured subprocess, inside the sandbox where enabled. Agent mode defaults to the developer sandbox profile, which keeps writes scoped to the session workdir while allowing common local toolchains to run. For untrusted workloads, set profile = "restricted" and tighter network settings in the TOML config, or use the matching CLI flags. Use agent_permission: "ask" or "deny" when an app needs tighter control over server-executed actions; a server-wide ask or deny cannot be loosened by the request (see permissions and approvals). For untrusted users, run mistral.rs in a container or VM, use a low-privilege user, and constrain network access.