tools/llm-sequential-upgrade/GRADING.md
This is the manual grading session. The app has already been generated and deployed by the automated run. Your job is to test every feature in the browser and score it.
You need TWO Chrome browser profiles so each user gets completely separate identity (localStorage, cookies, WebSocket connections).
Browser A (default profile): Navigate to the app URL and register as "Alice"
http://localhost:6173http://localhost:6273Switch to Browser B: Use switch_browser to switch to the second Chrome profile
Browser B: Navigate to the SAME URL, register as "Bob"
Use switch_browser to go back and forth. Both browsers connect to the same backend but have separate storage and WebSocket connections.
navigate — go to URLread_page — accessibility tree for element discoveryget_page_text — visible textfind — natural language element searchcomputer — click, type, scroll, screenshotform_input — fill form fieldsjavascript_tool — run JS for verificationread_console_messages — check for errorsgif_creator — record timing-sensitive features (typing indicators, ephemeral messages)Every generated app has different HTML. Use this fallback chain:
find("send message button")read_page — identify by role/textget_page_textjavascript_tool — query DOM directlyRead the test plan from test-plans/feature-NN-*.md for each feature. Test in order (1 through N).
For each feature:
read_console_messages for JS errorsGRADING_RESULTS.md — do not wait until the end## Feature N: <Name> (Score: X / 3)
- [x] <criterion> (1pt)
- [ ] <criterion> (1pt)
**Browser Test Observations:** ...
---
# Chat App Grading Results
**Model:** Claude Sonnet 4.6
**Date:** <YYYY-MM-DD>
**Backend:** spacetime | postgres
**Level:** <N>
**Grading Method:** Manual browser interaction
---
## Feature 1: <Name> (Score: X / 3)
- [x] <criterion> (1pt)
...
**Browser Test Observations:** ...
---
## Summary
| Feature | Score | Notes |
|---------|-------|-------|
| 1. Basic Chat | X/3 | ... |
...
| **TOTAL** | **X/33** | |
Do NOT include token counts, cost estimates, or API call counts. Cost data is in COST_REPORT.md.
| Reprompts | Score |
|---|---|
| 0 | 10/10 |
| 1 | 9/10 |
| 2 | 8/10 |
| 3 | 7/10 |
| 4–5 | 6/10 |
| 6–7 | 5/10 |
| 8–10 | 4/10 |
| 11–15 | 2/10 |
| 16+ | 0/10 |