.plans/openai-api-server.md
Every major chat frontend (Open WebUI 126k★, LobeChat 73k★, LibreChat 34k★, AnythingLLM 56k★, NextChat 87k★, ChatBox 39k★, Jan 26k★, HF Chat-UI 8k★, big-AGI 7k★) connects to backends via the OpenAI-compatible REST API with SSE streaming. By exposing this endpoint, hermes-agent becomes instantly usable as a backend for all of them — no custom adapters needed.
┌──────────────────┐
│ Open WebUI │──┐
│ LobeChat │ │ POST /v1/chat/completions
│ LibreChat │ ├──► Authorization: Bearer <key> ┌─────────────────┐
│ AnythingLLM │ │ {"messages": [...]} │ hermes-agent │
│ NextChat │ │ │ gateway │
│ Any OAI client │──┘ ◄── SSE streaming response │ (API server) │
└──────────────────┘ └─────────────────┘
A user would:
API_SERVER_ENABLED=true in ~/.hermes/.envhermes gateway (API server starts alongside Telegram/Discord/etc.)http://localhost:8642/v1| Method | Path | Purpose |
|---|---|---|
| POST | /v1/chat/completions | Chat with the agent (streaming + non-streaming) |
| GET | /v1/models | List available "models" (returns hermes-agent as a model) |
| GET | /health | Health check |
Create gateway/platforms/api_server.py as a new platform adapter that
extends BasePlatformAdapter. This is the cleanest approach because:
aiohttp.web (already a dependency) for the HTTP serverThe adapter would start an aiohttp.web.Application server in connect()
and route incoming HTTP requests through the standard handle_message() pipeline.
A separate HTTP server class in gateway/api_server.py that creates its own
AIAgent instances directly. Simpler but duplicates session/auth logic.
Recommendation: Option A — fits the existing architecture, less code to maintain, gets all gateway features for free.
POST /v1/chat/completions
Authorization: Bearer hermes-api-key-here
Content-Type: application/json
{
"model": "hermes-agent",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What files are in the current directory?"}
],
"stream": false,
"temperature": 0.7
}
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1710000000,
"model": "hermes-agent",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Here are the files in the current directory:\n..."
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 50,
"completion_tokens": 200,
"total_tokens": 250
}
}
Same request with "stream": true. Response is SSE:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Here "},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"are "},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
GET /v1/models
Authorization: Bearer hermes-api-key-here
Response:
{
"object": "list",
"data": [{
"id": "hermes-agent",
"object": "model",
"created": 1710000000,
"owned_by": "hermes-agent"
}]
}
The OpenAI API is stateless — each request includes the full conversation. But hermes-agent sessions have persistent state (memory, skills, tool context).
Approach: Hybrid
messages array IS
the conversation. No session persistence between requests.X-Session-ID header. When provided, the
server maintains session state across requests (conversation history,
memory context, tool state). This enables richer agent behavior.The agent's run_conversation() is synchronous and returns the full response.
For real SSE streaming, we need to emit chunks as they're generated.
Phase 1 (MVP): Run agent in a thread, return the complete response as
a single SSE chunk + [DONE]. This works with all frontends — they just see
a fast single-chunk response. Not true streaming but functional.
Phase 2: Add a response callback to AIAgent that emits text chunks as the LLM generates them. The API server captures these via a queue and streams them as SSE events. This gives real token-by-token streaming.
Phase 3: Stream tool execution progress too — emit tool call/result events as the agent works, giving frontends visibility into what the agent is doing.
Two modes:
Authorization: Bearer <key> headerAPI_SERVER_KEY env varFrontends send "model": "hermes-agent" (or whatever). The actual LLM model
used is configured server-side in config.yaml. The API server maps any
requested model name to the configured hermes-agent model.
Optionally, allow model passthrough: if the frontend sends
"model": "anthropic/claude-sonnet-4", the agent uses that model. Controlled
by a config flag.
# In config.yaml
api_server:
enabled: true
port: 8642
host: "127.0.0.1" # localhost only by default
key: "your-secret-key" # or via API_SERVER_KEY env var
allow_model_override: false # let clients choose the model
max_concurrent: 5 # max simultaneous requests
Environment variables:
API_SERVER_ENABLED=true
API_SERVER_PORT=8642
API_SERVER_HOST=127.0.0.1
API_SERVER_KEY=your-secret-key
gateway/platforms/api_server.py — new adapter
POST /v1/chat/completions — Chat Completions API (universal compat)POST /v1/responses — Responses API (server-side state, tool preservation)GET /v1/models — list available modelsGET /health — health checkgateway/config.py — add Platform.API_SERVER enum + config
gateway/run.py — register adapter in _create_adapter()
Tests in tests/gateway/test_api_server.py
Add response streaming to both endpoints
choices[0].delta.content SSE formatAdd stream_callback parameter to AIAgent.run_conversation()
| File | Change |
|---|---|
gateway/platforms/api_server.py | NEW — main adapter (~300 lines) |
gateway/config.py | Add Platform.API_SERVER + config (~20 lines) |
gateway/run.py | Register adapter in _create_adapter() (~10 lines) |
tests/gateway/test_api_server.py | NEW — tests (~200 lines) |
cli-config.yaml.example | Add api_server section |
README.md | Mention API server in platform list |
Once implemented, hermes-agent works as a drop-in backend for:
| Frontend | Stars | How to Connect |
|---|---|---|
| Open WebUI | 126k | Settings → Connections → Add OpenAI API, URL: http://localhost:8642/v1 |
| NextChat | 87k | BASE_URL env var |
| LobeChat | 73k | Custom provider endpoint |
| AnythingLLM | 56k | LLM Provider → Generic OpenAI |
| Oobabooga | 42k | Already a backend, not a frontend |
| ChatBox | 39k | API Host setting |
| LibreChat | 34k | librechat.yaml custom endpoint |
| Chatbot UI | 29k | Custom API endpoint |
| Jan | 26k | Remote model config |
| AionUI | 18k | Custom API endpoint |
| HF Chat-UI | 8k | OPENAI_BASE_URL env var |
| big-AGI | 7k | Custom endpoint |