packages/kilo-docs/pages/contributing/architecture/agent-observability.md
Agentic coding systems like Kilo Code operate with significant autonomy, executing multi-step tasks that involve LLM inference, tool execution, file manipulation, and external API calls. These systems mix traditional systems observability (i.e. request/response) with agentic behavior (i.e. planning, reasoning, and tool use).
At the lower level, we can observe the system as a traditional API, but at the higher level, we need to observe the agent's behavior and the quality of its outputs.
Some examples of customer-facing error modes:
All of these contribute to the overall reliability and user experience of the system.
Non-goals for this proposal:
Focus on the lower-level systems observability first, then build up to higher-level agentic behavior observability.
Objective: Establish awareness and alerting for hard failures.
This phase focuses on systems metrics we can capture with minimal changes, providing immediate operational visibility.
Capture these metrics per LLM API call:
Common dashboards which offer filtering based on provider, model, and tool:
Implement multi-window, multi-burn-rate alerting against error budgets:
| Window | Burn Rate | Action | Use Case |
|---|---|---|---|
| 5 min | 14.4x | Page | Major Outage |
| 30 min | 6x | Page | Incident |
| 6 hr | 1x | Ticket | Change in behavior |
Paging should only occur on Recommended Models when using the Kilo Gateway. All other alerts should be tickets, and some may be configured to be ignored.
Initial alert conditions:
Per-session (aggregated at session close or timeout):
None.
Objective: Detect how agents are using tools in a given session.
Loop and repetition detection:
Progress indicators:
None to start, we will learn.
Objective: Understand whether sessions are successful from the user's perspective.
Hard errors and behavior metrics tell us about failures, but we also need signal on overall session health.
Explicit signals:
Implicit signals:
May require LLM analysis of session transcripts to detect: