llmfit REST API Guide

This document is for agent/client builders integrating with llmfit serve.

Purpose

llmfit serve exposes node-local model fit analysis (same core data used by TUI/CLI) over HTTP and serves a local web dashboard.

Primary use case:

Query each node in a cluster for top runnable models.
Aggregate externally (scheduler/controller/UI) for placement decisions.

Start the server

llmfit serve --port 8787

Global flags still apply:

llmfit --memory 24G --ram 64G --cpu-cores 16 --max-context 8192 serve --port 8787

Hardware overrides (--memory, --ram, --cpu-cores) are reflected in API responses, making the server report the overridden values instead of the detected hardware.

Base URL

Default local base URL:

text

http://127.0.0.1:8787

To expose outside localhost, pass --host 0.0.0.0.

If you are building from source and want the dashboard embedded in llmfit, build web assets first:

cd llmfit-web && npm ci && npm run build

Endpoints

`GET /`

Web dashboard entrypoint (same-origin UI for fit exploration).

`GET /health`

Liveness probe.

Example response:

json

{
  "status": "ok",
  "node": {
    "name": "worker-1",
    "os": "linux"
  }
}

`GET /api/v1/system`

Returns node identity + detected hardware.

Example response shape:

json

{
  "node": {
    "name": "worker-1",
    "os": "linux"
  },
  "system": {
    "total_ram_gb": 62.23,
    "available_ram_gb": 41.08,
    "cpu_cores": 14,
    "cpu_name": "Intel(R) Core(TM) Ultra 7 165U",
    "has_gpu": false,
    "gpu_vram_gb": null,
    "gpu_name": null,
    "gpu_count": 0,
    "unified_memory": false,
    "backend": "CPU (x86)",
    "gpus": []
  }
}

`GET /api/v1/models`

Returns filtered/sorted model-fit rows for this node.

Envelope shape:

json

{
  "node": { "name": "worker-1", "os": "linux" },
  "system": { "...": "..." },
  "total_models": 23,
  "returned_models": 10,
  "filters": { "...": "echo of query state" },
  "models": [
    {
      "name": "Qwen/Qwen2.5-Coder-7B-Instruct",
      "provider": "Qwen",
      "parameter_count": "7B",
      "params_b": 7.0,
      "context_length": 32768,
      "use_case": "Coding",
      "category": "Coding",
      "release_date": "2025-03-14",
      "is_moe": false,
      "fit_level": "good",
      "fit_label": "Good",
      "run_mode": "gpu",
      "run_mode_label": "GPU",
      "score": 86.5,
      "score_components": {
        "quality": 87.0,
        "speed": 81.2,
        "fit": 90.1,
        "context": 88.0
      },
      "estimated_tps": 42.5,
      "runtime": "llamacpp",
      "runtime_label": "llama.cpp",
      "best_quant": "Q5_K_M",
      "memory_required_gb": 5.8,
      "memory_available_gb": 12.0,
      "utilization_pct": 48.3,
      "notes": [],
      "gguf_sources": []
    }
  ]
}

`GET /api/v1/models/top`

Key scheduling endpoint. Same schema as /api/v1/models, but defaults to top 5 runnable entries.

Important behavior:

Defaults limit=5.
Excludes too_tight rows unless explicitly overridden (and top endpoint still keeps runnable semantics).

`GET /api/v1/models/{name}`

Path-constrained search. Equivalent to a text search scoped by {name}.

Useful for:

Client-side drilldown after selecting a model family.

Query parameters

Supported on /api/v1/models and /api/v1/models/top (also /api/v1/models/{name}):

limit (or alias n): max rows returned.
perfect: true|false (when true, only perfect fits).
min_fit: perfect|good|marginal|too_tight.
runtime: any|mlx|llamacpp.
use_case: general|coding|reasoning|chat|multimodal|embedding.
provider: provider substring filter.
search: free-text filter (name/provider/params/use-case/category).
sort: score|tps|params|mem|ctx|date|use_case.
include_too_tight: include unrunnable rows (defaults true for /models, false for /models/top).
max_context: per-request context cap used by memory estimation.
force_runtime: mlx|llamacpp|vllm — override automatic runtime selection during analysis (e.g. get llama.cpp recommendations on Apple Silicon instead of MLX).

Error handling

Invalid filter values return HTTP 400:

json

{
  "error": "invalid min_fit value: use perfect|good|marginal|too_tight"
}

Server errors return HTTP 500 with {"error": "..."}.

Client integration recommendations

1) Polling pattern for schedulers

For each node agent:

Call /health.
Call /api/v1/system.
Call /api/v1/models/top?limit=K&min_fit=good.
Attach node metadata and forward to your central scheduler.

2) Conservative placement defaults

For production placement, prefer:

text

min_fit=good
include_too_tight=false
sort=score
limit=5..20

3) Per-workload targeting

Examples:

Coding workloads: use_case=coding
Embedding workloads: use_case=embedding
Runtime constrained to llama.cpp fleet: runtime=llamacpp

4) Stable parsing

Treat unknown fields as forward-compatible additions:

Parse required fields you depend on.
Ignore unknown fields.

Curl examples

curl http://127.0.0.1:8787/health
curl http://127.0.0.1:8787/api/v1/system
curl "http://127.0.0.1:8787/api/v1/models?limit=20&min_fit=marginal&sort=score"
curl "http://127.0.0.1:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"
curl "http://127.0.0.1:8787/api/v1/models/Mistral?runtime=any"

MCP Server Mode

llmfit can run as an MCP (Model Context Protocol) server over stdio, making it discoverable by AI agents (Claude, Cursor, etc.).

Start the MCP server

llmfit serve --mcp

Global hardware overrides still apply:

llmfit --memory 24G --ram 64G serve --mcp

MCP client configuration

Add to your MCP client config (e.g. claude_desktop_config.json):

json

{
  "mcpServers": {
    "llmfit": {
      "command": "llmfit",
      "args": ["serve", "--mcp"]
    }
  }
}

Available tools

Tool	Description	Parameters
`get_system_specs`	Node hardware info (RAM, GPU, CPU)	None
`recommend_models`	Top models for this hardware	`limit?`, `use_case?`, `min_fit?`, `runtime?`, `license?`, `sort?`
`search_models`	Free-text model search	`query`, `limit?`
`plan_hardware`	Hardware requirements for a model	`model`, `context?`, `quant?`, `target_tps?`
`get_runtimes`	Installed inference runtimes	None
`get_installed_models`	Models in local runtimes	None

NATS Event Publishing

When built with the nats feature, llmfit can publish hardware and model events to NATS for integration with coordination systems (e.g. Sympozium membrane).

Build with NATS support

cargo build --features nats

Enable event publishing

llmfit serve --send-events --nats-url nats://localhost:4222
llmfit serve --mcp --send-events  # also works with MCP mode

The NATS_URL environment variable is also supported.

Event subjects

Events are published to llmfit.{event_type}.{hostname}:

Subject	Trigger	Payload
`llmfit.system.{hostname}`	Startup + every 60s	System hardware specs
`llmfit.fit.{hostname}`	After fit analysis	Model fit summary
`llmfit.plan.{hostname}`	After plan estimate	Plan estimate
`llmfit.runtimes.{hostname}`	Startup + on query	Runtime availability
`llmfit.installed.{hostname}`	Startup + on query	Installed models

Event envelope

All events are wrapped in a common envelope:

json

{
  "timestamp": "1747058400",
  "hostname": "worker-1",
  "event_type": "system",
  "version": "1",
  "data": { ... }
}

nats sub 'llmfit.>'                    # all events from all nodes
nats sub 'llmfit.system.>'             # system specs from all nodes
nats sub 'llmfit.system.worker-1'      # system specs from specific node

Versioning notes

Current API prefix is v1.

If you build long-lived clients, pin to /api/v1/... and validate behavior with the local test script in scripts/test_api.py.

llmfit REST API Guide

llmfit REST API Guide

Purpose

Start the server

Base URL

Endpoints

GET /

GET /health

GET /api/v1/system

GET /api/v1/models

GET /api/v1/models/top

GET /api/v1/models/{name}

Query parameters

Error handling

Client integration recommendations

1) Polling pattern for schedulers

2) Conservative placement defaults

3) Per-workload targeting

4) Stable parsing

Curl examples

MCP Server Mode

Start the MCP server

MCP client configuration

Available tools

NATS Event Publishing

Build with NATS support

Enable event publishing

Event subjects

Event envelope

Subscribe to events

Versioning notes

`GET /`

`GET /health`

`GET /api/v1/system`

`GET /api/v1/models`

`GET /api/v1/models/top`

`GET /api/v1/models/{name}`