packages/kilo-docs/pages/contributing/features/agent-observability.md
{% callout type="info" title="Status" %} Partial - API metrics, session ingestion, storage, and burn-rate alert infrastructure exist. Higher-order agent behavior and outcome analysis remain roadmap work. {% /callout %}
Agentic coding systems combine model requests, tool execution, file changes, and external API calls. Traditional request metrics catch hard failures. Agent behavior signals are also needed to investigate loops, degraded sessions, and poor outcomes.
Current cloud service context is documented in Cloud Platform observability.
| Capability | Status | Notes |
|---|---|---|
| API metrics ingestion | Current | Operational request metrics ingestion exists |
| Session metrics ingestion | Current | Session-level ingestion exists |
| Burn-rate alert evaluation | Current | Alert evaluation runs against stored metrics |
| Alert config storage | Current | Alert configuration storage exists |
| Analytics Engine storage | Current | API and session metrics datasets exist |
| Export pipelines | Current infrastructure | Metrics export infrastructure exists for downstream analysis |
| Per-message feedback | Current | Explicit user feedback signal exists |
| Capability | Status | Goal |
|---|---|---|
| Oscillation detection | Planned or partial | Detect repeated or alternating agent actions |
| Unique-file progress metrics | Planned or partial | Track files touched during session |
| Unique-tool progress metrics | Planned or partial | Track tool diversity and repeated operations |
| Session termination classification | Planned | Distinguish completion, abandonment, timeout, and errors |
| Higher-order outcome analysis | Planned | Assess usefulness and task success beyond hard errors |
Use existing ingestion and alert infrastructure as base for dashboards and service-level objectives. Metric coverage should be validated before treating any field as available in production analysis.
Candidate dimensions for model requests:
Candidate session aggregates:
Burn-rate evaluation infrastructure exists. Proposed alert routing should page only for recommended models using Kilo Gateway; other conditions can create tickets or remain disabled.
| Window | Burn rate | Proposed action |
|---|---|---|
| 5 min | 14.4x | Page for major outage |
| 30 min | 6x | Page for incident |
| 6 hr | 1x | Create ticket for behavior change |
Initial behavior analysis should focus on repeated operations and progress signals:
| Signal | Purpose |
|---|---|
| Identical tool calls | Detect repeated actions with same tool and arguments |
| Identical failing calls | Detect retries that repeat same failure |
| Oscillation patterns | Detect alternating states without progress |
| Unique files touched | Estimate breadth of session changes |
| Unique tools used | Compare progress against repeated operations |
| Repeated-to-unique ratio | Identify sessions that may be stuck |
Hard errors and behavior metrics do not prove user success. Later work can combine explicit per-message feedback with session termination analysis and other outcome signals. Offline model and agent comparison belongs in Benchmarking.