Back to Localai

Backend Monitor

docs/content/features/backend-monitor.md

4.6.04.3 KB
Original Source

+++ disableToc = false title = "Backend Monitor" weight = 20 url = "/features/backend-monitor/" +++

LocalAI provides endpoints to monitor and manage running backends. The /backend/monitor endpoint reports the status and resource usage of loaded models, /backend/load pre-loads a model into memory, and /backend/shutdown allows stopping a model's backend process.

All three are admin-only.

Monitor API

  • Method: GET
  • Endpoints: /backend/monitor, /v1/backend/monitor

Request

The model to monitor is passed as a query parameter:

ParameterTypeRequiredLocationDescription
modelstringYesqueryName of the model to monitor

For backwards compatibility, a JSON body with the same field is still accepted when the model query parameter is not set, but new clients should use the query parameter.

Response

Returns a JSON object with the backend status:

FieldTypeDescription
stateintBackend state: 0 = uninitialized, 1 = busy, 2 = ready, -1 = error
memoryobjectMemory usage information
memory.totaluint64Total memory usage in bytes
memory.breakdownobjectPer-component memory breakdown (key-value pairs)

If the gRPC status call fails, the endpoint falls back to local process metrics:

FieldTypeDescription
memory_infoobjectProcess memory info (RSS, VMS)
memory_percentfloatMemory usage percentage
cpu_percentfloatCPU usage percentage

Usage

bash
curl "http://localhost:8080/backend/monitor?model=my-model"

Example response

json
{
  "state": 2,
  "memory": {
    "total": 1073741824,
    "breakdown": {
      "weights": 536870912,
      "kv_cache": 268435456
    }
  }
}

Load API

Pre-loads a model into memory ahead of its first request, so that request pays no cold-start load cost. It is the inverse of the Shutdown API and works for any model, not just realtime pipelines.

  • Method: POST
  • Endpoints: /backend/load, /v1/backend/load

Request

ParameterTypeRequiredDescription
modelstringYesName of the model to load

Behavior

  • For a regular model, its own backend is loaded.
  • For a realtime pipeline model (a config with a pipeline: block), every configured sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) is loaded concurrently instead of the pipeline stub, which has no backend of its own.

The call blocks until loading finishes and reports which model names became resident, so partial failures are visible.

Usage

bash
curl -X POST http://localhost:8080/backend/load \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model"}'

Example response

json
{ "loaded": ["my-model"], "message": "model loaded" }

On failure the call returns 500 with loaded listing whichever sub-models did load and message naming the failures.

Shutdown API

  • Method: POST
  • Endpoints: /backend/shutdown, /v1/backend/shutdown

Request

ParameterTypeRequiredDescription
modelstringYesName of the model to shut down

Usage

bash
curl -X POST http://localhost:8080/backend/shutdown \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model"}'

Response

Returns 200 OK with the shutdown confirmation message on success.

Error Responses

Status CodeDescription
400Invalid or missing model name
500Backend error or model not loaded