Back to Mistral Rs

Agentic runtime for apps

docs/src/content/docs/guides/agents/agentic-runtime.md

0.8.1710.7 KB
Original Source

mistral.rs can act as a local-first runtime for agent applications. A single runtime request can include:

  • Model generation (chat-completion responses and chunks).
  • Server-side tool execution.
  • Python code execution, sandboxed by default on Linux and macOS.
  • Shell execution, sandboxed by default on Linux and macOS.
  • OpenAI-compatible Skills.
  • OpenAI-compatible file inputs.
  • Web search.
  • Generated images or video frames from tools.
  • Persistent session state.

The most complete app-facing event stream today is /v1/chat/completions with stream: true. It emits normal OpenAI-compatible chunks plus mistral.rs agentic_tool_call_progress Server-Sent Events (SSE).

Runtime partWhat mistral.rs provides
Model outputChat-completion responses and streaming chunks.
Tool executionBuilt-in search, code execution, shell, OpenAI-compatible Skills, MCP (Model Context Protocol) tools, callbacks, or HTTP tool dispatch.
Generated mediaCaptured images and video frames from tools as base64 fields.
FilesUser-provided input files plus generated output files in the same /v1/files registry.
Session stateReusable session_id values for multi-turn tool and code state.

Use this when an app wants inference and tool execution in one process rather than running its own tool loop around a model server. Built-in runtime tools are strict by default; whether an action may run at all is governed by permissions and approvals.

How the loop runs

The server-side loop engages for a chat request when any of these hold:

  • The request sets web_search_options (advertises the web search tools).
  • The request includes tools: [{"type":"code_interpreter","container":{"type":"auto"}}] on a server or runner with code execution enabled.
  • The request includes tools: [{"type":"shell","environment":{"type":"container_auto"}}] on the Responses API, or the SDK request enables shell.
  • The request carries tools and server-side executors exist for them (SDK tool_callbacks or connected MCP tools).
  • The request sets max_tool_rounds, or the server has a --tool-dispatch-url.

Otherwise the request is dispatched normally: the model's tool_calls field is returned to the client and the client runs the next round (the standard OpenAI-compatible flow).

Each round:

  1. The engine runs inference. The result either contains tool calls or does not.
  2. No tool calls: the loop exits and the response is forwarded to the client.
  3. The loop emits a progress event with phase calling and the tool arguments.
  4. The tool is executed through one of the paths above (built-in search, code execution, shell, file helpers, a registered callback, or a POST to the dispatch URL). If the model returns more than one tool call, only the first is executed and a warning is logged.
  5. The loop emits a progress event with phase complete and the structured result.
  6. The message history is extended with the assistant's tool-call message and a tool-role response, so the next inference pass sees the outcome.
  7. If the round counter reaches the cap, the loop exits without another tool opportunity.

The cap and dispatch URL are configured on the tool calling page. At termination, the expanded message list is written back to the session, so the next request with the same session id sees the synthesized tool messages as history.

HTTP run stream

Start a server with the tools your app is allowed to use:

bash
mistralrs serve --agent -m google/gemma-4-E4B-it

(--agent enables search, code execution, and shell; see build an agent.)

Send a streaming chat-completions request:

json
{
  "model": "default",
  "stream": true,
  "messages": [
    {"role": "user", "content": "Use Python to plot sin(x), then explain the chart."}
  ],
  "tools": [{"type": "code_interpreter", "container": {"type": "auto"}}],
  "web_search_options": {},
  "max_tool_rounds": 4,
  "session_id": "analysis-demo"
}

Model output arrives as standard chat-completion chunks. Tool progress arrives as named SSE events with round, an opaque tool_name for correlation, phase (calling or complete), and tool-type-specific data:

text
event: agentic_tool_call_progress
data: {"type":"agentic_tool_call_progress","round":0,"tool_name":"<tool identifier>","phase":"calling","data":{"tool_type":"code_execution","code":"print('hello')"}}

Complete events carry tool-type-specific payloads:

  • Code execution: stdout, stderr, images_base64, video_frames_base64, working_directory, execution_time_ms.
  • Shell: commands, stdout, stderr, exit_code, timed_out, and status.
  • Web search: query, results_count.
  • Custom tools: arguments, content.

The full event tables are in the HTTP API reference. Non-streaming responses include the same information as an agentic_tool_calls array.

Files

A File is a typed output produced by a tool, typically code execution or shell. Each file has a stable id, a name, a format, a mime type, a size in bytes, and either an inline body or a reference for fetching it. Files are first-class on the wire: they ride alongside the model transcript, not buried inside tool output strings.

Declare required outputs on the request to give the model a contract:

json
{
  "model": "default",
  "messages": [
    {"role": "user", "content": "Generate a sin(x) plot and a CSV of the samples."}
  ],
  "tools": [{"type": "code_interpreter", "container": {"type": "auto"}}],
  "files": [
    {"name": "plot.png", "format": "png"},
    {"name": "samples.csv", "format": "csv", "description": "x, sin(x) columns"}
  ]
}

Chat Completions and Anthropic Messages carry produced files in a top-level files array; when streaming, each file is emitted as soon as it is produced via a file_produced SSE event. Each agentic_tool_calls[*] record gains a file_ids field listing the files attributable to that round, so apps can correlate files with the tool that wrote them.

Responses follows the OpenAI artifact shape: produced files are attached to assistant output_text content as container_file_citation annotations. The same bytes remain available through GET /v1/files/{id}/content; OpenAI-style clients can also fetch them through GET /v1/containers/{container_id}/files/{file_id}/content.

User-provided files use OpenAI-compatible request shapes: upload with POST /v1/files, reference file_id, or attach inline file_data. Responses also supports input_file.file_url.

Text-like UTF-8 input files get bounded decoded previews. When agentic tools are active, the model can request additional slices if the preview is not enough. Binary files are metadata-only in prompt context, but are still downloadable and mounted into shell/code workdirs when those tools are active. See OpenAI-compatible file inputs.

Behavior worth designing around:

  • Inline vs fetched: bodies up to 8 MB are inlined (text or data_base64); larger bodies are elided from the wire and fetched via GET /v1/files/{id}/content. is_truncated() on the SDK File reports an elided body.
  • Context preview: input files expose decoded text previews of up to 4096 chars per file and 32768 chars per request. Agent-produced text outputs expose a 1024-byte preview. Agentic runs can inspect more text when the relevant file-access tool is available.
  • Undeclared outputs: the Python executor and shell tools accept an outputs parameter for files the model wrote but the request did not declare. Shell also advertises mistralrs_surface_outputs, which lets the model surface files created in earlier shell calls. Files declared via request.files are surfaced regardless; missing declared files come back as error placeholders. Files written but not named in outputs, mistralrs_surface_outputs, or request.files remain internal to the session.

The exact file schema, metadata endpoint, and content-endpoint status codes are in the HTTP API reference.

Sessions

Use session_id when your app needs continuity across requests: message history, tool records, media, and code-execution state. Session behavior, the export/import/delete endpoints, and lifetime rules live in persist sessions.

SDK boundaries

SurfaceCurrent behavior
HTTPBest surface for live model chunks, tool-progress timelines, files, and agent approval events.
Rust SDKSupports request input files via InputFile and RequestBuilder::with_input_file(...); Model::stream_chat_request yields raw Response::AgenticToolCallProgress events.
Python SDKSupports request input files via InputFile, plus agentic requests, callbacks, code execution, shell, local skill mounts, and sessions. The streaming iterator currently yields model chunks; use HTTP SSE for the full timeline.
Web UIRenders code execution, shell, search, reasoning blocks, generated media, and approval cards inline.

Full examples: Rust file inputs, Python file inputs, server file inputs, Rust agent, Rust agent streaming, Python agentic tools, HTTP tool rounds, and server Skills.

Security

Code and shell execution run with the permissions of the configured subprocess, inside the sandbox where enabled. Agent mode defaults to the developer sandbox profile, which keeps writes scoped to the session workdir while allowing common local toolchains to run. For untrusted workloads, set profile = "restricted" and tighter network settings in the TOML config, or use the matching CLI flags. Use agent_permission: "ask" or "deny" when an app needs tighter control over server-executed actions; a server-wide ask or deny cannot be loosened by the request (see permissions and approvals). For untrusted users, run mistral.rs in a container or VM, use a low-privilege user, and constrain network access.