tools/llm-sequential-upgrade/README.md
Automated benchmark harness for measuring AI app-generation cost, bug rate, and code size across backends. Designed to produce directly comparable data for the same app built on different stacks.
Results viewer: https://spacetimedb.com/llms-benchmark-sequential-upgrade
Generated test data (app source, telemetry, cost summaries): https://github.com/clockworklabs/spacetimedb-ai-test-results
For each backend under test, the harness drives a headless Claude Code session to:
BUG_REPORT.md and fixed via a separate Claude Code sessionSide-by-side results give a direct comparison of AI-generation cost across backends for the same functional target.
run.sh: orchestrates generation, upgrade, and fix sessions. Supports --upgrade, --fix, --composed-prompt, --resume-session.grade.sh / grade-agents.sh / grade-playwright.sh: grading harnesses (manual + automated)benchmark.sh / run-loop.sh: batch runners for parallel or sequential benchmark executioncleanup.sh / reset-app.sh: dev utilitiesbenchmark-viewer.html: local viewer for METRICS_DATA.json files (open in browser, drop JSON)generate-report.mjs: aggregate per-session cost-summary.json into a markdown reportparse-telemetry.mjs: parse OTel log stream into per-session cost-summary.jsonparse-playwright-results.mjs: convert Playwright JSON output to grading markdowndocker-compose.otel.yaml / otel-collector-config.yaml: OTel collector + PostgreSQLbackends/: per-backend setup / SDK reference documents given to the AIperf-benchmark/: runtime throughput benchmark (msgs/sec) for the AI-generated appsCLAUDE.md / DEVELOP.md / GRADING.md / GRADING_WORKFLOW.md: process documentation# Prereqs: Claude CLI installed, Docker running, SpacetimeDB installed
docker compose -f docker-compose.otel.yaml up -d
# Generate L1 from scratch
./run.sh --backend spacetime --level 1
./run.sh --backend postgres --level 1
# Upgrade through levels
./run.sh --upgrade <app-dir> --level 2 --composed-prompt
# ... continue through L12
# Fix bugs found during grading
./run.sh --fix <app-dir> --level N
Generated apps and telemetry land in sequential-upgrade/sequential-upgrade-<timestamp>/ locally. For published test data from canonical runs, see the AI Test Results repo.
perf-benchmark/ contains a runtime stress tool that fires concurrent writers against a generated app's send_message handler to measure sustained throughput in messages/sec. See perf-benchmark/README.md for usage.