Back to Kilocode

Agent Observability

packages/kilo-docs/pages/contributing/features/agent-observability.md

7.3.403.6 KB
Original Source

Agent Observability

{% callout type="info" title="Status" %} Partial - API metrics, session ingestion, storage, and burn-rate alert infrastructure exist. Higher-order agent behavior and outcome analysis remain roadmap work. {% /callout %}

Overview

Agentic coding systems combine model requests, tool execution, file changes, and external API calls. Traditional request metrics catch hard failures. Agent behavior signals are also needed to investigate loops, degraded sessions, and poor outcomes.

Current cloud service context is documented in Cloud Platform observability.

Current implementation

CapabilityStatusNotes
API metrics ingestionCurrentOperational request metrics ingestion exists
Session metrics ingestionCurrentSession-level ingestion exists
Burn-rate alert evaluationCurrentAlert evaluation runs against stored metrics
Alert config storageCurrentAlert configuration storage exists
Analytics Engine storageCurrentAPI and session metrics datasets exist
Export pipelinesCurrent infrastructureMetrics export infrastructure exists for downstream analysis
Per-message feedbackCurrentExplicit user feedback signal exists

Roadmap

CapabilityStatusGoal
Oscillation detectionPlanned or partialDetect repeated or alternating agent actions
Unique-file progress metricsPlanned or partialTrack files touched during session
Unique-tool progress metricsPlanned or partialTrack tool diversity and repeated operations
Session termination classificationPlannedDistinguish completion, abandonment, timeout, and errors
Higher-order outcome analysisPlannedAssess usefulness and task success beyond hard errors

Operational metrics roadmap

Use existing ingestion and alert infrastructure as base for dashboards and service-level objectives. Metric coverage should be validated before treating any field as available in production analysis.

API metrics

Candidate dimensions for model requests:

  • Provider
  • Model
  • Tool
  • Latency
  • Success or failure
  • Error type
  • Token counts
  • Client source

Session metrics

Candidate session aggregates:

  • Session duration
  • Time to first model response
  • Turns and tool calls
  • Errors by type
  • Tokens consumed
  • Context compaction frequency
  • Termination reason

Alert policy

Burn-rate evaluation infrastructure exists. Proposed alert routing should page only for recommended models using Kilo Gateway; other conditions can create tickets or remain disabled.

WindowBurn rateProposed action
5 min14.4xPage for major outage
30 min6xPage for incident
6 hr1xCreate ticket for behavior change

Agent behavior roadmap

Initial behavior analysis should focus on repeated operations and progress signals:

SignalPurpose
Identical tool callsDetect repeated actions with same tool and arguments
Identical failing callsDetect retries that repeat same failure
Oscillation patternsDetect alternating states without progress
Unique files touchedEstimate breadth of session changes
Unique tools usedCompare progress against repeated operations
Repeated-to-unique ratioIdentify sessions that may be stuck

Outcome roadmap

Hard errors and behavior metrics do not prove user success. Later work can combine explicit per-message feedback with session termination analysis and other outcome signals. Offline model and agent comparison belongs in Benchmarking.