docs/en/guides/05-observability.md
This guide collects the current OpenViking observability entry points in one place, including:
telemetryov tuiWeb Studio (served by the OV server at /studio)/metrics time-series metricsIf you just want to know where to look first, start with the table below.
| Entry point | Best for | Typical use case |
|---|---|---|
/health, observer/* | service health, queue backlog, VikingDB and VLM status | deployment validation, on-call checks |
ov tui | viking:// trees, directory summaries, file content, vector records, image preview for supported image files | development debugging, verifying that data actually landed |
Web Studio (/studio) | same-origin web UI on the OV server: Home shows token / retrieval / context-commit trends; Resources browses URIs; Retrieval runs find; Request Logs shows audit | interactive investigation without typing every command |
telemetry | per-request duration, token usage, vector retrieval, ingestion stages | debugging one specific slow or unexpected call |
/metrics | request trends, error rates, latency distribution, queue and probe state | Prometheus scraping, Grafana dashboards, alert rules |
/health provides a simple liveness check and does not require authentication.
curl http://localhost:1933/health
{"status": "ok"}
Python SDK (Embedded / HTTP)
status = client.get_status()
print(f"Healthy: {status['is_healthy']}")
print(f"Errors: {status['errors']}")
HTTP API
curl http://localhost:1933/api/v1/observer/system \
-H "X-API-Key: your-key"
{
"status": "ok",
"result": {
"is_healthy": true,
"errors": [],
"components": {
"queue": {"name": "queue", "is_healthy": true, "has_errors": false},
"vikingdb": {"name": "vikingdb", "is_healthy": true, "has_errors": false},
"vlm": {"name": "vlm", "is_healthy": true, "has_errors": false}
}
}
}
| Endpoint | Component | Description |
|---|---|---|
GET /api/v1/observer/queue | Queue | Processing queue status |
GET /api/v1/observer/vikingdb | VikingDB | Vector database status |
GET /api/v1/observer/vlm | VLM | Vision Language Model status |
For example:
curl http://localhost:1933/api/v1/observer/queue \
-H "X-API-Key: your-key"
Python SDK (Embedded / HTTP)
if client.is_healthy():
print("System OK")
HTTP API
curl http://localhost:1933/api/v1/debug/health \
-H "X-API-Key: your-key"
{"status": "ok", "result": {"healthy": true}}
Every API response includes an X-Process-Time header with the server-side processing time in seconds:
curl -v http://localhost:1933/api/v1/fs/ls?uri=viking:// \
-H "X-API-Key: your-key" 2>&1 | grep X-Process-Time
# < X-Process-Time: 0.0023
This layer answers "is the service up, blocked, or unhealthy?" If you want to inspect what happened inside one request, move on to telemetry.
ov tui for data-plane inspectionThe ov CLI includes a dedicated TUI file explorer:
ov tui /
You can also start from a specific scope:
ov tui viking://resources
Prerequisites:
ovcli.conf is configuredX-API-Key can read the target tenant dataThis TUI is useful for two kinds of inspection:
viking://resources, viking://user, viking://agent, and viking://sessionCommon keys:
q: quitTab: switch focus between the tree and content panelsj / k: move up and down.: expand or collapse a directoryg / G: jump to the top or bottomv: toggle vector-record viewn: load the next page in vector-record viewc: count total vector records for the current URIA typical debugging flow is:
ov tui viking://resources and locate the target document or directory.abstract, overview, or file content (supported image files — png, jpg, jpeg, gif, bmp, webp, tiff, tif — are rendered inline as a preview).v to inspect vector records for that URI.c to get the total count, and n to keep paging if needed.TUI is primarily for data-plane inspection. It helps answer "did the resource really land?" and "were vectors really written?" but it does not directly show token totals or per-stage request timing.
The OV server serves the Web Studio frontend at /studio on its own port — no separate process to start.
http://127.0.0.1:1933/studio
On first use, open the Connection dialog in the top right and set your X-API-Key. The base URL defaults to the current same origin (the URL you loaded /studio from).
The most useful pages for observability are:
Home (/studio): today's token usage, retrieval counts, context-commit trends, agent access summary — backed by the /api/v1/console/* BFFRequest Logs (/studio/request-logs): audit logs filterable by account / user / agent / route, backed by /api/v1/console/auditResources (/studio/resources): browse URIs, view directories and files, upload resourcesRetrieval (/studio/retrieval): run find / search / grep requests and inspect resultsSessions (/studio/sessions): browse session history, inspect message and memory commit flowWrite operations (Add Resource, Add Memory, tenant/user administration) are gated by the API key currently signed in — there's no separate --write-enabled switch.
From an observability standpoint, Studio talks to the same /api/v1/console/* BFF (dashboard summary, token series, context commits, audit logs) the old standalone console used — only the UI changed. For operations such as find, add-resource, and session commit, you can expand the result panel to inspect telemetry.summary.
Studio is best for interactive click-through debugging. If you need to feed observability data into your own logs or automation, prefer the HTTP API or SDK and request telemetry explicitly.
The public request-tracing feature in OpenViking is called operation telemetry. It attaches a structured summary to a response so you can inspect things like:
session.commitThe most common way to request it is to pass:
{"telemetry": true}
For example:
curl -X POST http://localhost:1933/api/v1/search/find \
-H "Content-Type: application/json" \
-H "X-API-Key: your-key" \
-d '{
"query": "memory dedup",
"limit": 5,
"telemetry": true
}'
For the full field reference, supported operations, and more examples, see:
/metrics for time-series observability/metrics is OpenViking's time-series metrics endpoint for the Prometheus scraping model. It is well suited for questions like:
Compared with observer/*, /metrics is better for trends, aggregation, and alerting. observer/* is better for inspecting the current point-in-time state by hand.
Compared with telemetry, /metrics focuses on aggregated time series, while telemetry focuses on what happened inside one specific request.
/metrics may be disabled by default. When the metrics subsystem is not enabled, the endpoint returns 404 with the message Prometheus metrics are disabled..
You do not need the full configuration to get started. Enabling the master switch under the server section is enough.
Minimal config (recommended)
Add the following to ~/.openviking/ov.conf (or the path passed via --config):
{
"server": {
"observability": {
"metrics": {
"enabled": true
}
}
}
}
Restart OpenViking Server after editing the config.
OpenViking groups signal-level observability configuration under server.observability:
server.observability.metrics: metrics subsystem and exportersserver.observability.traces: trace export configurationserver.observability.logs: log export configurationserver.observability.dump_body: attaches HTTP request/response bodies (filtered by content-type, truncated by bytes) as attributes on the active trace span so they can be inspected in trace UIs. Off by default — bodies may contain secrets and high-cardinality contentExample:
{
"server": {
"observability": {
"metrics": {
"enabled": true,
"exporters": {
"prometheus": {
"enabled": true
},
"otel": {
"enabled": true,
"protocol": "grpc",
"tls": {
"insecure": true
},
"endpoint": "otel-collector:4317",
"service_name": "openviking-server",
"export_interval_ms": 10000,
"headers": {}
}
}
},
"traces": {
"enabled": true,
"protocol": "grpc",
"tls": {
"insecure": true
},
"endpoint": "otel-collector:4317",
"service_name": "openviking-server",
"headers": {}
},
"logs": {
"enabled": true,
"protocol": "grpc",
"tls": {
"insecure": true
},
"endpoint": "otel-collector:4317",
"service_name": "openviking-server",
"headers": {}
},
"dump_body": {
"enabled": false,
"max_bytes": 4096
}
}
}
}
Notes:
headers forwards custom OTLP request headers or gRPC metadata to the exporter.headers shape is the same across traces, logs, and metrics.exporters.otel.protocol="grpc", headers are sent as gRPC metadata and keys should be lowercase, for example x-byteapm-appkey; this restriction does not apply to protocol="http".For full fields, supported ranges, and more examples, see:
/metrics directlyIn the current implementation, /metrics is not wired to get_request_context or other auth dependencies, so from the code-path perspective it currently behaves as a public scrape endpoint:
curl http://localhost:1933/metrics
If your deployment protects /metrics at the gateway, reverse proxy, or service discovery layer, attach auth according to the deployment environment.
The most common setup is to let Prometheus scrape it on a schedule:
scrape_configs:
- job_name: openviking
metrics_path: /metrics
static_configs:
- targets: ["localhost:1933"]
Once Prometheus is successfully scraping /metrics, the next common step is to import the OpenViking demo dashboard into Grafana.
Step 1: Confirm that Prometheus is already scraping /metrics
Before importing the dashboard, make sure the Prometheus data source can already query OpenViking metrics. The quickest checks are:
openviking_http_requests_total in the Prometheus UIopenviking_service_readinessIf there is no data yet, go back to the Prometheus scrape configuration above and verify targets, metrics_path, and network connectivity first.
Step 2: Import the official demo dashboard into Grafana
The OpenViking repository already includes ready-to-import dashboard JSON:
tim012432-calendarheatmap-panel Grafana plugin. Install it before importing to ensure panels render correctly.)You can import it with the following steps:
Dashboards from the left-side menu.New or Import in the top-right corner.Import to finish.If the dashboard imports but panels are empty, the first two things to verify are:
openviking_* metricsStep 3: What to look at after the dashboard opens
After import, the most useful panels usually come from these metric families:
openviking_http_*: HTTP request volume, latency, and inflight requestsopenviking_operation_*: structured operation success rates and latencyopenviking_queue_*: queue throughput, backlog, and in-progress workopenviking_*_readiness: dependency and probe health stateA beginner-friendly viewing order is:
account_id, route, and status.Step 4: What the final result looks like
After a successful import, you should see a dashboard centered on OpenViking requests, queues, probes, model calls, and overall system state. For a visual reference, see:
This screenshot helps you quickly verify whether the imported layout looks correct. If the dashboard structure matches but some panels are empty, it usually means the corresponding metrics have not produced samples yet, or the filters do not match the current traffic.
The most common labels you will use while investigating dashboards are:
account_id: tenant dimension label. It is only enabled on controlled allowlisted metric families. Unidentified requests fall into __unknown__, and values beyond the active-tenant budget fall into __overflow__route: HTTP route template, for example /api/v1/search/findstatus: request or stage status, such as 200, ok, or errorvalid: whether the sample came from a successful refresh; valid="0" usually indicates a fallback or stale sample/metrics vs. other entry points/health and observer/*ov tuitelemetry/metrics