Back to Mistral Rs

Serve an OpenAI-compatible API

docs/src/content/docs/guides/serve/openai-compatible-apis.md

0.8.106.8 KB
Original Source

mistralrs serve puts a local model behind OpenAI-compatible endpoints under /v1. OpenAI SDKs and compatible clients work unchanged with http://localhost:1234/v1 as the base URL.

bash
mistralrs serve -m Qwen/Qwen3-4B

Then send a request:

bash
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "user", "content": "Write a haiku about local inference."}
    ],
    "max_tokens": 128
  }'

With a single -m model, the request model is "default" (or omitted). In multi-model serving, use a model id exactly as it appears in GET /v1/models.

First time serving a model? The Quickstart walks through installation, Hugging Face authentication for gated models, and the first run.

OpenAI Python client

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-used")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Say hello from mistral.rs."}],
)

print(response.choices[0].message.content)

The api_key is required by the client but not validated by the server; see authentication. Set stream=True for token-by-token output (full example).

Endpoints

EndpointPurpose
GET /v1/modelsList loaded models.
POST /v1/chat/completionsChat, streaming, tool calling, multimodal inputs, and mistral.rs agentic extensions.
POST /v1/responsesOpenAI Responses API: response objects, polling, background runs, cancellation.
POST /v1/skillsUpload OpenAI-compatible Skills.
GET /v1/skillsList uploaded skills.
POST /v1/skills/{skill_id}/versionsUpload a new version of an existing skill.
POST /v1/messagesAnthropic Messages API (base URL without /v1).
POST /v1/completionsLegacy text completions.
POST /v1/embeddingsEmbedding generation.
POST /v1/images/generationsImage generation.
POST /v1/audio/speechText to speech.
POST /v1/filesUpload OpenAI-compatible user files.
GET /v1/filesList uploaded and generated files.

Every path with full request and response schemas is in the generated HTTP API reference. Streaming events, authentication, and protocol semantics are in the HTTP API reference; field-level compatibility notes (including Responses API restrictions) are in OpenAI compatibility.

:::caution[Compatibility gaps] Most OpenAI-compatible fields work, but a few common ones have limitations:

  • seed, user, stream_options, metadata, parallel_tool_calls - accepted but ignored.
  • code_interpreter supports only {"container":{"type":"auto"}}; OpenAI code-interpreter container ids and container.file_ids are not supported.
  • Responses web_search does not support image search or external_web_access: false.
  • Responses shell supports environment.type = "container_auto" and OpenAI-compatible uploaded skill_reference entries; local environments, container references, and inline container-created skills are not implemented.
  • File inputs support uploaded ids, inline base64/Data URLs, and Responses file_url, but binary formats are not converted with OpenAI's private PDF/image/spreadsheet extraction pipeline.
  • dimensions (embeddings) - errors rather than truncating.

Full list in OpenAI compatibility. :::

A live Swagger UI for the running server is at http://localhost:1234/docs.

Tools, structured output, and agentic features

OpenAI-compatible function tools work on Chat Completions and Responses, including strict: true for JSON-Schema-constrained tool arguments. See tool calling.

response_format with json_schema and the grammar extension constrain output server-side. See structured output.

Start the server with agentic capabilities to use server-side tools and agentic fields. Chat Completions uses web_search_options for web search and tools: [{"type":"code_interpreter","container":{"type":"auto"}}] for code execution. Responses uses hosted tools in the tools array for web search, code execution, shell, and OpenAI-compatible Skills.

bash
mistralrs serve --agent -m Qwen/Qwen3-4B

For tool timelines, generated files, search, code execution, shell, Skills, and session state, see agentic runtime for apps.

Configuration

-p/--port (default 1234) and --host (default 0.0.0.0) control the bind address. --no-ui disables the web UI at /ui. All flags are in the CLI reference; the equivalent config file for multi-model, repeatable deployments is the TOML config reference, which also covers CORS, body limits, authentication, and logging.

:::caution The default --host 0.0.0.0 accepts connections from any host on the network. Use --host 127.0.0.1 to restrict to the local machine, and put authentication in a reverse proxy before exposing the server. :::

Examples

Runnable client scripts live in examples/server/ and render under server examples:

ExampleWhat it shows
chatBasic Chat Completions request.
streamingChat Completions streaming.
tool_callingOpenAI-compatible function tools.
openai_response_formatStructured output via response_format.
responsesResponses API request.
responses_toolsResponses hosted tools: web search and code interpreter.
skillsOpenAI-compatible Skills upload and execution.
responses_visionResponses API with image input.
web_searchSearch through OpenAI-compatible request fields.
anthropic_chatAnthropic Messages request.
multi_model_chatRouting requests across loaded models.

For Codex and Claude Code setup, see coding agents.