docs/src/content/docs/guides/serve/openai-compatible-apis.md
mistralrs serve puts a local model behind OpenAI-compatible endpoints under /v1. OpenAI SDKs and compatible clients work unchanged with http://localhost:1234/v1 as the base URL.
mistralrs serve -m Qwen/Qwen3-4B
Then send a request:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [
{"role": "user", "content": "Write a haiku about local inference."}
],
"max_tokens": 128
}'
With a single -m model, the request model is "default" (or omitted). In multi-model serving, use a model id exactly as it appears in GET /v1/models.
First time serving a model? The Quickstart walks through installation, Hugging Face authentication for gated models, and the first run.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-used")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Say hello from mistral.rs."}],
)
print(response.choices[0].message.content)
The api_key is required by the client but not validated by the server; see authentication. Set stream=True for token-by-token output (full example).
| Endpoint | Purpose |
|---|---|
GET /v1/models | List loaded models. |
POST /v1/chat/completions | Chat, streaming, tool calling, multimodal inputs, and mistral.rs agentic extensions. |
POST /v1/responses | OpenAI Responses API: response objects, polling, background runs, cancellation. |
POST /v1/skills | Upload OpenAI-compatible Skills. |
GET /v1/skills | List uploaded skills. |
POST /v1/skills/{skill_id}/versions | Upload a new version of an existing skill. |
POST /v1/messages | Anthropic Messages API (base URL without /v1). |
POST /v1/completions | Legacy text completions. |
POST /v1/embeddings | Embedding generation. |
POST /v1/images/generations | Image generation. |
POST /v1/audio/speech | Text to speech. |
POST /v1/files | Upload OpenAI-compatible user files. |
GET /v1/files | List uploaded and generated files. |
Every path with full request and response schemas is in the generated HTTP API reference. Streaming events, authentication, and protocol semantics are in the HTTP API reference; field-level compatibility notes (including Responses API restrictions) are in OpenAI compatibility.
:::caution[Compatibility gaps] Most OpenAI-compatible fields work, but a few common ones have limitations:
seed, user, stream_options, metadata, parallel_tool_calls - accepted but ignored.code_interpreter supports only {"container":{"type":"auto"}}; OpenAI code-interpreter container ids and container.file_ids are not supported.web_search does not support image search or external_web_access: false.shell supports environment.type = "container_auto" and OpenAI-compatible uploaded skill_reference entries; local environments, container references, and inline container-created skills are not implemented.file_url, but binary formats are not converted with OpenAI's private PDF/image/spreadsheet extraction pipeline.dimensions (embeddings) - errors rather than truncating.Full list in OpenAI compatibility. :::
A live Swagger UI for the running server is at http://localhost:1234/docs.
OpenAI-compatible function tools work on Chat Completions and Responses, including strict: true for JSON-Schema-constrained tool arguments. See tool calling.
response_format with json_schema and the grammar extension constrain output server-side. See structured output.
Start the server with agentic capabilities to use server-side tools and agentic fields. Chat Completions uses web_search_options for web search and tools: [{"type":"code_interpreter","container":{"type":"auto"}}] for code execution. Responses uses hosted tools in the tools array for web search, code execution, shell, and OpenAI-compatible Skills.
mistralrs serve --agent -m Qwen/Qwen3-4B
For tool timelines, generated files, search, code execution, shell, Skills, and session state, see agentic runtime for apps.
-p/--port (default 1234) and --host (default 0.0.0.0) control the bind address. --no-ui disables the web UI at /ui. All flags are in the CLI reference; the equivalent config file for multi-model, repeatable deployments is the TOML config reference, which also covers CORS, body limits, authentication, and logging.
:::caution
The default --host 0.0.0.0 accepts connections from any host on the network. Use --host 127.0.0.1 to restrict to the local machine, and put authentication in a reverse proxy before exposing the server.
:::
Runnable client scripts live in examples/server/ and render under server examples:
| Example | What it shows |
|---|---|
| chat | Basic Chat Completions request. |
| streaming | Chat Completions streaming. |
| tool_calling | OpenAI-compatible function tools. |
| openai_response_format | Structured output via response_format. |
| responses | Responses API request. |
| responses_tools | Responses hosted tools: web search and code interpreter. |
| skills | OpenAI-compatible Skills upload and execution. |
| responses_vision | Responses API with image input. |
| web_search | Search through OpenAI-compatible request fields. |
| anthropic_chat | Anthropic Messages request. |
| multi_model_chat | Routing requests across loaded models. |
For Codex and Claude Code setup, see coding agents.