tools/llm-sequential-upgrade/DEVELOP.md
How to set up, run, and interpret the LLM cost-to-done benchmark.
Measures the total token cost to reach a fully working chat app by alternating between two agents:
run.sh) — generates code, fixes bugs, deploys. Token-tracked via OpenTelemetry.Only the Code Agent's tokens count toward the benchmark. Grading cost is the same for both SpacetimeDB and PostgreSQL, so it's excluded.
run.sh --level 1 → Code Agent generates & deploys app (tokens tracked)
↓
You (in Claude Code) → Grade Agent tests in Chrome, writes BUG_REPORT.md
↓
run.sh --fix <app-dir> → Code Agent reads bugs, fixes code, redeploys (tokens tracked)
↓
You (in Claude Code) → Grade Agent retests, writes updated BUG_REPORT.md or GRADING_RESULTS.md
↓
... repeat until all features pass or iteration limit hit
spacetime start
cd tools/llm-oneshot/llm-sequential-upgrade
docker compose -f docker-compose.otel.yaml up -d
Needs claude on PATH, or npx @anthropic-ai/claude-code works as fallback.
Required for the grading agent (interactive session). Chrome must be open with the "Claude in Chrome" MCP extension active.
Required for SpacetimeDB TypeScript backend, Vite dev server, and parse-telemetry.mjs.
cd tools/llm-oneshot/llm-sequential-upgrade
./run.sh --level 1 --backend spacetime
This:
COST_REPORT.mdIn this Claude Code session (or a new interactive one), say:
Grade the app at sequential-upgrade/sequential-upgrade-YYYYMMDD/spacetime/results/chat-app-<timestamp>
Or use the helper script:
./grade.sh sequential-upgrade/sequential-upgrade-YYYYMMDD/spacetime/results/chat-app-<timestamp>
The grading agent will:
BUG_REPORT.md in the app directoryITERATION_LOG.md and GRADING_RESULTS.mdIf bugs were found:
./run.sh --fix sequential-upgrade/sequential-upgrade-YYYYMMDD/spacetime/results/chat-app-<timestamp>
This:
BUG_REPORT.md from the app directoryBack in Claude Code:
Re-grade the app at sequential-upgrade/sequential-upgrade-YYYYMMDD/spacetime/results/chat-app-<timestamp>
Repeat Steps 3-4 until all features pass.
| Flag | Default | Description |
|---|---|---|
--level | 1 | Prompt level (1-12). Level 1 = 4 features, Level 12 = all 15 |
--backend | spacetime | spacetime or postgres |
--variant | sequential-upgrade | Test variant: sequential-upgrade or one-shot |
--fix <dir> | — | Fix mode: read BUG_REPORT.md, fix code, redeploy |
--upgrade <dir> | — | Upgrade mode: add features to existing app |
--resume-session | — | Resume prior Claude session for cache reuse |
| Level | Features | Est. Duration | Good For |
|---|---|---|---|
| 1 | 4 (basic chat, typing, receipts, unread) | 5-15 min | Pipeline validation |
| 5 | 8 (+ scheduled, ephemeral, reactions, edit) | 15-30 min | Mid-complexity |
| 12 | All 15 features | 30-60+ min | Full benchmark |
llm-sequential-upgrade/<variant>/<variant>-YYYYMMDD/
METRICS_DATA.json # Comparison metrics (generated after all grading)
METRICS_REPORT.md # Human-readable benchmark report
<backend>/ # e.g. spacetime/ or postgres/
inputs/ # Frozen snapshot of all inputs used for this run
results/
chat-app-<timestamp>/
GRADING_RESULTS.md # Per-feature scores (written by grade agent)
ITERATION_LOG.md # Per-iteration progress log (both agents append)
BUG_REPORT.md # Current bugs for fix agent to read (deleted when all pass)
backend/ # Generated SpacetimeDB backend (spacetime only)
server/ # Generated Express server (postgres only)
client/ # Generated React client
telemetry/
<backend>-level<N>-<timestamp>/
metadata.json # Run parameters, timing, session ID
cost-summary.json # Parsed token counts and total cost
COST_REPORT.md # Per-call breakdown
raw-telemetry.jsonl # OTel records for this session
llm-sequential-upgrade/telemetry/
logs.jsonl # Raw OTLP log records (shared across all runs)
metrics.jsonl # Raw OTLP metrics
| Metric | What It Shows |
|---|---|
| Total tokens to done | Raw LLM efficiency — fewer = easier to build with |
| Iterations to done | Fix cycles needed — fewer = less debugging |
| Final feature score | Quality of the final app |
| Lines of code | Code complexity — smaller = simpler for LLMs |
| External dependencies | Infrastructure complexity |
docker compose -f docker-compose.otel.yaml logs
ls -la telemetry/logs.jsonl
spacetime server ping local
spacetime start # if not running
--print mode)# Generate level 1, then upgrade through each level
./run.sh --level 1 --backend spacetime
# (grade, fix loop...)
./run.sh --upgrade <app-dir> --level 2
# ... continue through level 12
# Same for PostgreSQL
./run.sh --level 1 --backend postgres
# (grade, fix loop...)
./run.sh --upgrade <app-dir> --level 2
# ... continue through level 12
# Generate all 15 features in a single prompt
./run.sh --variant one-shot --backend spacetime
./run.sh --variant one-shot --backend postgres
llm-sequential-upgrade/
CLAUDE.md # Instructions for the Code Agent
DEVELOP.md # This file (for humans)
run.sh # Code Agent launcher (generate/fix/upgrade)
grade.sh # Grade Agent launcher (interactive Chrome MCP)
grade-playwright.sh # Grade via Playwright (optional, deterministic)
docker-compose.otel.yaml # OTel Collector container
otel-collector-config.yaml # Collector config (OTLP → JSON files)
parse-telemetry.mjs # Telemetry → COST_REPORT.md
backends/
spacetime.md # SpacetimeDB-specific phases
spacetime-sdk-rules.md # SpacetimeDB SDK patterns
spacetime-templates.md # Code templates
postgres.md # PostgreSQL-specific phases
test-plans/
feature-01-basic-chat.md # Per-feature browser test scripts
...
feature-15-anonymous-migration.md
playwright/ # Optional Playwright test suite
telemetry/ # Shared OTel Collector output
sequential-upgrade/ # Sequential upgrade test variant
sequential-upgrade-YYYYMMDD/ # Dated run with results, telemetry, inputs
one-shot/ # One-shot test variant
one-shot-YYYYMMDD/
TOKEN-TRACKED NOT TRACKED
┌─────────────────────┐ ┌─────────────────────┐
│ │ │ │
run.sh ────▶│ Code Agent │ │ Grade Agent │◀──── You
│ (claude --print) │ │ (interactive CC) │ (in Claude Code)
│ │ │ │
│ • Generate code │ │ • Chrome MCP │
│ • Build & deploy │ Bug │ • Test features │
│ • Fix bugs ◀───────│── Report │ • Score 0-3 │
│ • Redeploy │──────────▶ • Write BUG_REPORT │
│ │ │ • Write GRADING │
└────────┬────────────┘ └─────────────────────┘
│
OTel telemetry
│
┌────────▼────────────┐
│ OTel Collector │
│ → logs.jsonl │
│ → COST_REPORT.md │
└─────────────────────┘