Observability & Diagnostics

This guide collects the current OpenViking observability entry points in one place, including:

service health and component status
request-level telemetry
terminal-side ov tui
web-side Web Studio (served by the OV server at /studio)
/metrics time-series metrics

If you just want to know where to look first, start with the table below.

Choose the right entry point

Entry point	Best for	Typical use case
`/health`, `observer/*`	service health, queue backlog, VikingDB and VLM status	deployment validation, on-call checks
`ov tui`	`viking://` trees, directory summaries, file content, vector records, image preview for supported image files	development debugging, verifying that data actually landed
`Web Studio` (`/studio`)	same-origin web UI on the OV server: Home shows token / retrieval / context-commit trends; Resources browses URIs; Retrieval runs find; Request Logs shows audit	interactive investigation without typing every command
`telemetry`	per-request duration, token usage, vector retrieval, ingestion stages	debugging one specific slow or unexpected call
`/metrics`	request trends, error rates, latency distribution, queue and probe state	Prometheus scraping, Grafana dashboards, alert rules

Service health and component status

Health check

/health provides a simple liveness check and does not require authentication.

bash

curl http://localhost:1933/health

json

{"status": "ok"}

Overall system status

Python SDK (Embedded / HTTP)

python

status = client.get_status()
print(f"Healthy: {status['is_healthy']}")
print(f"Errors: {status['errors']}")

HTTP API

bash

curl http://localhost:1933/api/v1/observer/system \
  -H "X-API-Key: your-key"

json

{
  "status": "ok",
  "result": {
    "is_healthy": true,
    "errors": [],
    "components": {
      "queue": {"name": "queue", "is_healthy": true, "has_errors": false},
      "vikingdb": {"name": "vikingdb", "is_healthy": true, "has_errors": false},
      "vlm": {"name": "vlm", "is_healthy": true, "has_errors": false}
    }
  }
}

Component status

Endpoint	Component	Description
`GET /api/v1/observer/queue`	Queue	Processing queue status
`GET /api/v1/observer/vikingdb`	VikingDB	Vector database status
`GET /api/v1/observer/vlm`	VLM	Vision Language Model status

For example:

bash

curl http://localhost:1933/api/v1/observer/queue \
  -H "X-API-Key: your-key"

Quick health check

Python SDK (Embedded / HTTP)

python

if client.is_healthy():
    print("System OK")

HTTP API

bash

curl http://localhost:1933/api/v1/debug/health \
  -H "X-API-Key: your-key"

json

{"status": "ok", "result": {"healthy": true}}

Response time

Every API response includes an X-Process-Time header with the server-side processing time in seconds:

bash

curl -v http://localhost:1933/api/v1/fs/ls?uri=viking:// \
  -H "X-API-Key: your-key" 2>&1 | grep X-Process-Time
# < X-Process-Time: 0.0023

This layer answers "is the service up, blocked, or unhealthy?" If you want to inspect what happened inside one request, move on to telemetry.

Use `ov tui` for data-plane inspection

The ov CLI includes a dedicated TUI file explorer:

bash

ov tui /

You can also start from a specific scope:

bash

ov tui viking://resources

Prerequisites:

OpenViking Server is running
ovcli.conf is configured
the current X-API-Key can read the target tenant data

This TUI is useful for two kinds of inspection:

checking what actually exists under viking://resources, viking://user, viking://agent, and viking://session
checking whether vector records for a URI were actually written, and how many there are

Common keys:

q: quit
Tab: switch focus between the tree and content panels
j / k: move up and down
.: expand or collapse a directory
g / G: jump to the top or bottom
v: toggle vector-record view
n: load the next page in vector-record view
c: count total vector records for the current URI

A typical debugging flow is:

Run ov tui viking://resources and locate the target document or directory.
Confirm the right-side panel shows abstract, overview, or file content (supported image files — png, jpg, jpeg, gif, bmp, webp, tiff, tif — are rendered inline as a preview).
Press v to inspect vector records for that URI.
Press c to get the total count, and n to keep paging if needed.

TUI is primarily for data-plane inspection. It helps answer "did the resource really land?" and "were vectors really written?" but it does not directly show token totals or per-stage request timing.

Use Web Studio for web-based investigation

The OV server serves the Web Studio frontend at /studio on its own port — no separate process to start.

text

http://127.0.0.1:1933/studio

On first use, open the Connection dialog in the top right and set your X-API-Key. The base URL defaults to the current same origin (the URL you loaded /studio from).

The most useful pages for observability are:

Home (/studio): today's token usage, retrieval counts, context-commit trends, agent access summary — backed by the /api/v1/console/* BFF
Request Logs (/studio/request-logs): audit logs filterable by account / user / agent / route, backed by /api/v1/console/audit
Resources (/studio/resources): browse URIs, view directories and files, upload resources
Retrieval (/studio/retrieval): run find / search / grep requests and inspect results
Sessions (/studio/sessions): browse session history, inspect message and memory commit flow

Write operations (Add Resource, Add Memory, tenant/user administration) are gated by the API key currently signed in — there's no separate --write-enabled switch.

From an observability standpoint, Studio talks to the same /api/v1/console/* BFF (dashboard summary, token series, context commits, audit logs) the old standalone console used — only the UI changed. For operations such as find, add-resource, and session commit, you can expand the result panel to inspect telemetry.summary.

Studio is best for interactive click-through debugging. If you need to feed observability data into your own logs or automation, prefer the HTTP API or SDK and request telemetry explicitly.

Request-level telemetry

The public request-tracing feature in OpenViking is called operation telemetry. It attaches a structured summary to a response so you can inspect things like:

total duration
LLM and embedding token usage
vector search counts, scan volume, and returned results
resource-ingestion stages
memory extraction stats for session.commit

The most common way to request it is to pass:

json

{"telemetry": true}

For example:

bash

curl -X POST http://localhost:1933/api/v1/search/find \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-key" \
  -d '{
    "query": "memory dedup",
    "limit": 5,
    "telemetry": true
  }'

For the full field reference, supported operations, and more examples, see:

Operation Telemetry Reference

Use `/metrics` for time-series observability

/metrics is OpenViking's time-series metrics endpoint for the Prometheus scraping model. It is well suited for questions like:

Has HTTP traffic increased abnormally over the last few minutes?
Is the error rate for a route or operation continuing to rise?
Is latency distribution getting worse?
Is queue backlog starting to build up?
Have key dependencies, probes, or model providers become unhealthy?

Compared with observer/*, /metrics is better for trends, aggregation, and alerting. observer/* is better for inspecting the current point-in-time state by hand.

Compared with telemetry, /metrics focuses on aggregated time series, while telemetry focuses on what happened inside one specific request.

Enable metrics quickly

/metrics may be disabled by default. When the metrics subsystem is not enabled, the endpoint returns 404 with the message Prometheus metrics are disabled..

You do not need the full configuration to get started. Enabling the master switch under the server section is enough.

Minimal config (recommended)

Add the following to ~/.openviking/ov.conf (or the path passed via --config):

json

{
  "server": {
    "observability": {
      "metrics": {
        "enabled": true
      }
    }
  }
}

Restart OpenViking Server after editing the config.

Observability config hierarchy

OpenViking groups signal-level observability configuration under server.observability:

server.observability.metrics: metrics subsystem and exporters
server.observability.traces: trace export configuration
server.observability.logs: log export configuration
server.observability.dump_body: attaches HTTP request/response bodies (filtered by content-type, truncated by bytes) as attributes on the active trace span so they can be inspected in trace UIs. Off by default — bodies may contain secrets and high-cardinality content

Example:

json

{
  "server": {
    "observability": {
      "metrics": {
        "enabled": true,
        "exporters": {
          "prometheus": {
            "enabled": true
          },
          "otel": {
            "enabled": true,
            "protocol": "grpc",
            "tls": {
              "insecure": true
            },
            "endpoint": "otel-collector:4317",
            "service_name": "openviking-server",
            "export_interval_ms": 10000,
            "headers": {}
          }
        }
      },
      "traces": {
        "enabled": true,
        "protocol": "grpc",
        "tls": {
          "insecure": true
        },
        "endpoint": "otel-collector:4317",
        "service_name": "openviking-server",
        "headers": {}
      },
      "logs": {
        "enabled": true,
        "protocol": "grpc",
        "tls": {
          "insecure": true
        },
        "endpoint": "otel-collector:4317",
        "service_name": "openviking-server",
        "headers": {}
      },
      "dump_body": {
        "enabled": false,
        "max_bytes": 4096
      }
    }
  }
}

Notes:

headers forwards custom OTLP request headers or gRPC metadata to the exporter.
This is useful when an OTLP backend requires extra auth headers for direct ingestion.
The headers shape is the same across traces, logs, and metrics.exporters.otel.
When protocol="grpc", headers are sent as gRPC metadata and keys should be lowercase, for example x-byteapm-appkey; this restriction does not apply to protocol="http".

For full fields, supported ranges, and more examples, see:

Metrics

Access `/metrics` directly

In the current implementation, /metrics is not wired to get_request_context or other auth dependencies, so from the code-path perspective it currently behaves as a public scrape endpoint:

bash

curl http://localhost:1933/metrics

If your deployment protects /metrics at the gateway, reverse proxy, or service discovery layer, attach auth according to the deployment environment.

Prometheus scrape example

The most common setup is to let Prometheus scrape it on a schedule:

yaml

scrape_configs:
  - job_name: openviking
    metrics_path: /metrics
    static_configs:
      - targets: ["localhost:1933"]

Import and view the dashboard in Grafana

Once Prometheus is successfully scraping /metrics, the next common step is to import the OpenViking demo dashboard into Grafana.

Step 1: Confirm that Prometheus is already scraping /metrics

Before importing the dashboard, make sure the Prometheus data source can already query OpenViking metrics. The quickest checks are:

run openviking_http_requests_total in the Prometheus UI
or run openviking_service_readiness
if either query returns time series, Grafana should be able to render panels afterwards

If there is no data yet, go back to the Prometheus scrape configuration above and verify targets, metrics_path, and network connectivity first.

Step 2: Import the official demo dashboard into Grafana

The OpenViking repository already includes ready-to-import dashboard JSON:

openviking_demo_dashboard.json
openviking_token_demo_dashboard.json (Note: this dashboard depends on the tim012432-calendarheatmap-panel Grafana plugin. Install it before importing to ensure panels render correctly.)

You can import it with the following steps:

Sign in to Grafana.
Open Dashboards from the left-side menu.
Click New or Import in the top-right corner.
Upload the JSON file, or paste the contents of the linked file.
Select Prometheus as the data source on the import screen.
Click Import to finish.

If the dashboard imports but panels are empty, the first two things to verify are:

whether Grafana is bound to the correct Prometheus data source
whether Prometheus is actually scraping openviking_* metrics

Step 3: What to look at after the dashboard opens

After import, the most useful panels usually come from these metric families:

openviking_http_*: HTTP request volume, latency, and inflight requests
openviking_operation_*: structured operation success rates and latency
openviking_queue_*: queue throughput, backlog, and in-progress work
openviking_*_readiness: dependency and probe health state

A beginner-friendly viewing order is:

Start with readiness and health panels to confirm that the system and dependencies are generally healthy.
Move to HTTP and operation panels to see whether traffic, error rates, or latency have changed.
Then inspect queue panels to determine whether async work is backing up.
Finally, narrow the scope with labels such as account_id, route, and status.

Step 4: What the final result looks like

After a successful import, you should see a dashboard centered on OpenViking requests, queues, probes, model calls, and overall system state. For a visual reference, see:

grafana-demo-dashboard.png

This screenshot helps you quickly verify whether the imported layout looks correct. If the dashboard structure matches but some panels are empty, it usually means the corresponding metrics have not produced samples yet, or the filters do not match the current traffic.

Understanding common labels

The most common labels you will use while investigating dashboards are:

account_id: tenant dimension label. It is only enabled on controlled allowlisted metric families. Unidentified requests fall into __unknown__, and values beyond the active-tenant budget fall into __overflow__
route: HTTP route template, for example /api/v1/search/find
status: request or stage status, such as 200, ok, or error
valid: whether the sample came from a successful refresh; valid="0" usually indicates a fallback or stale sample

When to use `/metrics` vs. other entry points

To check whether the service is healthy and which component is currently unhealthy, start with /health and observer/*
To verify whether resources really landed and vectors were actually written, use ov tui
To inspect why one specific request was slow, where tokens went, or which stage blocked resource processing, use telemetry
To inspect whether request volume, error rates, or latency are degrading over time, use /metrics

Deployment - server setup
Authentication - API key setup
Operation Telemetry Reference - request-level structured tracing
System API - system and observer API reference
Metrics - time-series metrics and configuration

Observability & Diagnostics

Observability & Diagnostics

Choose the right entry point

Service health and component status

Health check

Overall system status

Component status

Quick health check

Response time

Use ov tui for data-plane inspection

Use Web Studio for web-based investigation

Request-level telemetry

Use /metrics for time-series observability

Enable metrics quickly

Observability config hierarchy

Access /metrics directly

Prometheus scrape example

Import and view the dashboard in Grafana

Understanding common labels

When to use /metrics vs. other entry points

Related Documentation

Use `ov tui` for data-plane inspection

Use `/metrics` for time-series observability

Access `/metrics` directly

When to use `/metrics` vs. other entry points