packages/kilo-docs/pages/contributing/architecture/benchmarking.md
This document proposes a benchmarking system for Kilo Code with two primary goals:
The design leverages existing open source infrastructure rather than building a custom harness:
The key engineering deliverable is a Kilo Code Harbor adapter that runs Kilo CLI autonomously in containerized environments and emits ATIF-compliant trajectories.
{% callout type="info" %} This is separate from production observability, which monitors real user sessions via PostHog. Benchmarking is an offline evaluation system for comparing quality, cost, and performance across models and agents. {% /callout %}
As Kilo Code evolves, we need systematic answers to questions like:
Today we have no structured way to answer these questions. Manual testing is not reproducible, and our existing PostHog telemetry does not capture the turn-by-turn detail needed for easy comparative analysis.
Non-goals:
┌─────────────────────────────────────────────────────────┐
│ Harbor Framework │
│ │
│ ┌──────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │Terminal-Bench│ │ SWE-bench │ │ Custom Tasks │ │
│ │ 2.0 │ │ │ │ (Kilo-specific) │ │
│ └──────┬───────┘ └──────┬──────┘ └───────┬─────────┘ │
│ └────────────────┼─────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Containerized Trial │ │
│ │ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ Agent Under │ │ │
│ │ │ Test │ │ │
│ │ │ (kilo --auto) │ │ │
│ │ └────────┬────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ Model API │ │ │
│ │ │ (Opus, GPT-5, │ │ │
│ │ │ Gemini, etc.) │ │ │
│ │ └─────────────────┘ │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ ATIF Trajectory │ │
│ │ (per-step traces) │ │
│ └───────────┬───────────┘ │
└──────────────────────────┼──────────────────────────────┘
│
┌────────────┴────────────┐
▼ ▼
┌──────────────────────┐ ┌──────────────────────────┐
│ tbench.ai Dashboard │ │ Opik │
│ - Leaderboard │ │ - Step-level traces │
│ - Task pass/fail │ │ - LLM judge per step │
│ - Asciinema replay │ │ - Cost attribution │
│ - Aggregate scores │ │ - Root cause comparison │
└──────────────────────┘ └──────────────────────────┘
Harbor is the evaluation framework built by the Terminal-Bench team. It provides:
Harbor is the standard evaluation framework used by many frontier labs. Rather than building our own harness, we write a Kilo Code adapter and plug into the existing ecosystem.
ATIF is a standardized JSON format for logging the complete interaction history of an agent run. Each trajectory captures:
This granularity is what enables step-level comparison between runs -- not just "did it pass or fail" but "at step 7, Agent A chose tool X while Agent B chose tool Y."
Opik (by Comet) provides trace ingestion and analysis with a first-class Harbor integration. Running benchmarks through Opik is as simple as:
opik harbor run -d terminal-bench@head -a kilo -m anthropic/claude-opus-4
Opik adds value beyond what the tbench.ai dashboard provides:
| Capability | tbench.ai Dashboard | Opik |
|---|---|---|
| Task-level pass/fail | Yes | Yes |
| Aggregate leaderboard | Yes | No |
| Asciinema replay | Yes | No |
| Step-level trace view | No | Yes |
| Step-level LLM judge | No | Yes |
| Cost attribution per step | No | Yes |
| Side-by-side trace comparison | No | Yes |
| Root cause analysis | No | Yes |
The two dashboards are complementary: tbench.ai for high-level leaderboard comparisons, Opik for drilling into why a specific run succeeded or failed.
Harbor's registry provides access to established benchmark datasets. The choice of dataset can vary depending on what you are evaluating:
| Dataset | Focus | Use Case |
|---|---|---|
| Terminal-Bench 2.0 | CLI/terminal tasks (89 tasks) | General agent capability on hard, realistic tasks |
| SWE-bench | Real GitHub issues in real repos | Software engineering task completion |
| LiveCodeBench | Competitive programming problems | Code generation quality |
| Custom task sets | Whatever you define | Targeted evaluation, marketing, regression testing |
Creating a custom Harbor task set is straightforward. Each task consists of:
This makes it easy to create task sets that target specific Kilo Code capabilities -- for example, a set of refactoring tasks, or a set of multi-file debugging scenarios. Custom sets can be published to the Harbor registry or kept private.
See the Harbor task tutorial for a step-by-step guide.
The primary engineering deliverable. This adapter:
kilo run --auto, which disables all permission prompts so the agent runs fully unattendedThe adapter follows the same pattern as existing Harbor agents (see the OpenHands adapter for reference). The key implementation detail is the populate_context_post_run method that converts Kilo's execution log into ATIF format.
Autonomous execution is critical. Harbor runs containerized trials in parallel and expects agents to execute from start to finish without human intervention. The adapter must ensure:
Documentation and examples for creating Kilo-specific task sets:
This enables the team to create targeted benchmarks for marketing, regression testing, or capability evaluation.
Configure the Opik-Harbor integration for Kilo Code benchmark runs:
opik harbor run with the Kilo Code adapter{% callout type="note" %} Lower priority. Implement after the core benchmarking system is working. {% /callout %}
Run a small subset of benchmark tasks (10-15) on release branches to catch regressions before shipping. Harbor supports this pattern natively. The subset should be chosen for:
Run the same Kilo Code agent against Terminal-Bench with different models:
# Run with Claude Opus
opik harbor run -d [email protected] -a kilo -m anthropic/claude-opus-4
# Run with GPT-5
opik harbor run -d [email protected] -a kilo -m openai/gpt-5
# Run with Gemini 3 Pro
opik harbor run -d [email protected] -a kilo -m google/gemini-3-pro
Compare results in tbench.ai for aggregate scores and in Opik for step-level analysis of where models diverge.
Run different agents against the same dataset with the same model:
# Run Kilo Code
opik harbor run -d [email protected] -a kilo -m anthropic/claude-opus-4
# Run Claude Code
opik harbor run -d [email protected] -a claude-code -m anthropic/claude-opus-4
Test a new release against the previous version:
# Run current release
opik harbor run -d [email protected] -a [email protected] -m anthropic/claude-opus-4
# Run candidate release
opik harbor run -d [email protected] -a [email protected] -m anthropic/claude-opus-4
Use Opik's trace comparison view to identify specific steps where the new version regressed or improved.
# Run against a custom Kilo-specific dataset
opik harbor run -d [email protected] -a kilo -m anthropic/claude-opus-4
Harbor provides task-level judging (did the agent solve the task?). Opik adds step-level evaluation:
| Level | Tool | What It Tells You |
|---|---|---|
| Task-level | Harbor | Pass/fail, score, total time, total cost |
| Step-level | Opik | At step N, the agent chose tool X when it should have used tool Y. The reasoning was flawed because of Z. This step cost $0.03 and took 4 seconds. |
Step-level evaluation is where root cause debugging happens. When a benchmark score drops between versions, you can trace back to the exact decision point that caused the regression.
This benchmarking system is complementary to, but separate from, the Agent Observability system:
| Concern | Benchmarking | Production Observability |
|---|---|---|
| Purpose | Offline evaluation of agent quality | Real-time monitoring of user sessions |
| Data source | Controlled benchmark tasks | Real user interactions |
| Tools | Harbor, Opik, tbench.ai | PostHog, custom metrics |
| When | Before release, on-demand | Continuously in production |
| Output | Leaderboard scores, trace comparisons | Alerts, dashboards, SLO tracking |