tools/llm-oneshot/apps/chat-app/prompts/grading_rubric.md
Use this rubric to evaluate LLM-generated chat applications for both SpacetimeDB and PostgreSQL implementations.
Only score features that were included in the prompt used. Each composed prompt level includes all features up to that number:
| Prompt Level | Features Included | Max Score |
|---|---|---|
01_*_basic | 1-4 (Basic, Typing, Read Receipts, Unread) | 12 |
02_*_scheduled | 1-5 (+ Scheduled Messages) | 15 |
03_*_realtime | 1-6 (+ Ephemeral Messages) | 18 |
04_*_reactions | 1-7 (+ Reactions) | 21 |
05_*_edit_history | 1-8 (+ Edit History) | 24 |
06_*_permissions | 1-9 (+ Permissions) | 27 |
07_*_presence | 1-10 (+ Rich Presence) | 30 |
08_*_threading | 1-11 (+ Threading) | 33 |
09_*_private_rooms | 1-12 (+ Private Rooms & DMs) | 36 |
10_*_activity | 1-13 (+ Activity Indicators) | 39 |
11_*_drafts | 1-14 (+ Draft Sync) | 42 |
12_*_anonymous | 1-15 (All features) | 45 |
Example: If you used 05_spacetime_edit_history.md, only score features 1-8 and use 24 as the max score.
Each feature is scored on a 0–3 scale:
| Score | Meaning |
|---|---|
| 0 | Not implemented or completely broken |
| 1 | Partially implemented; major issues or missing core functionality |
| 2 | Mostly working; minor bugs or missing edge cases |
| 3 | Fully working as specified |
| N/A | Feature not included in prompt (don't count toward total) |
| Metric | Value |
|---|---|
| Prompt Level Used | ___ (e.g., 05_spacetime_edit_history) |
| Features Evaluated | 1-___ (e.g., 1-8) |
| Total Feature Score | _ / _ (features × 3 points) |
| Compiles without errors | Yes / No |
| Runs without crashing | Yes / No |
| Lines of code (backend) | ___ |
| Lines of code (frontend) | ___ |
| Number of files created | ___ |
| External dependencies | List them |
| First-try success | Yes / No (did it work without manual fixes?) |
| Reprompt Count | ___ (number of follow-up prompts to fix issues) |
| Reprompt Efficiency Score | ___ / 10 (see Reprompt Scoring below) |
Track how many follow-up prompts are needed to get the application working. This measures how close the LLM gets to a working solution on the first attempt.
A reprompt is any follow-up message you send to fix issues, including:
Does NOT count as a reprompt:
| Category | Description |
|---|---|
| Compilation/Build | Code doesn't compile, missing imports, syntax errors |
| Runtime/Crash | App crashes on startup or during use |
| Feature Broken | Feature exists but doesn't work correctly |
| Integration | Frontend/backend don't communicate properly |
| Data/State | Data not persisting, state management issues |
| Reprompts | Score | Interpretation |
|---|---|---|
| 0 | 10 | Perfect - works on first try |
| 1 | 9 | Excellent - minor fix needed |
| 2 | 8 | Very Good - few issues |
| 3 | 7 | Good - some debugging required |
| 4-5 | 6 | Acceptable - moderate iteration |
| 6-7 | 5 | Below Average - significant debugging |
| 8-10 | 4 | Poor - extensive iteration required |
| 11-15 | 2 | Very Poor - major rework needed |
| 16+ | 0 | Failing - essentially pair programming |
| # | Category | Issue Summary | Fixed? |
|---|---|---|---|
| 1 | Yes/No | ||
| 2 | Yes/No | ||
| 3 | Yes/No | ||
| 4 | Yes/No | ||
| 5 | Yes/No | ||
| ... |
Total Reprompts: _
Categories Breakdown: Compilation: _ | Runtime: _ | Feature: _ | Integration: _ | Data: _
Max Score: 3
| Criteria | Points |
|---|---|
| Users can set a display name | 0.5 |
| Users can create chat rooms | 0.5 |
| Users can join/leave rooms | 0.5 |
| Users can send messages to joined rooms | 0.5 |
| Online users are displayed | 0.5 |
| Basic validation exists (no empty messages, name limits, etc.) | 0.5 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Typing state is broadcast to other room members | 1 |
| Typing indicator auto-expires after inactivity (3-5 seconds) | 1 |
| UI shows "User is typing..." or "Multiple users are typing..." | 1 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| System tracks which users have seen which messages | 1 |
| "Seen by X, Y, Z" or similar indicator displays under messages | 1 |
| Read status updates in real-time as users view messages | 1 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Unread count badge shows on room list | 1 |
| Count tracks last-read position per user per room | 1 |
| Counts update in real-time (new messages arrive, messages read) | 1 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Users can compose and schedule messages for future delivery | 1 |
| Pending scheduled messages visible to author with cancel option | 1 |
| Message appears in room at scheduled time | 1 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Users can send messages with auto-delete timer | 1 |
| Countdown or disappearing indicator shown in UI | 1 |
| Message is permanently deleted when timer expires | 1 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Users can add emoji reactions to messages | 0.75 |
| Reaction counts display and update in real-time | 0.75 |
| Users can toggle their own reactions on/off | 0.75 |
| Hover/click shows who reacted | 0.75 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Users can edit their own messages | 1 |
| "(edited)" indicator shows on edited messages | 0.5 |
| Edit history is viewable by other users | 1 |
| Edits sync in real-time to all viewers | 0.5 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Room creator is admin and can kick/ban users | 1 |
| Kicked users immediately lose access and stop receiving updates | 1 |
| Admins can promote other users to admin | 0.5 |
| Permission changes apply instantly (no reconnection needed) | 0.5 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Users can set status: online, away, do-not-disturb, invisible | 1 |
| "Last active X minutes ago" shows for offline users | 0.5 |
| Status changes sync to all viewers in real-time | 1 |
| Auto-set to "away" after inactivity period | 0.5 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Users can reply to specific messages, creating a thread | 1 |
| Parent messages show reply count and preview | 0.5 |
| Threaded view shows all replies to a message | 1 |
| New replies sync in real-time to thread viewers | 0.5 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Users can create private/invite-only rooms | 0.75 |
| Room creators can invite specific users by username | 0.75 |
| Direct messages (DMs) between two users work | 0.75 |
| Only members can see private room content and member lists | 0.75 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Activity badges show on rooms (e.g., "Active", "Hot") | 1 |
| Activity level reflects recent message velocity | 1 |
| Indicators update in real-time as activity changes | 1 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Message drafts save automatically as user types | 1 |
| Drafts sync across devices/sessions in real-time | 1 |
| Each room maintains its own draft per user | 0.5 |
| Drafts persist until sent or manually cleared | 0.5 |
Scoring:
Test Cases:
Max Score: 3
| Criteria | Points |
|---|---|
| Users can join and send messages without an account | 1 |
| Anonymous identity persists for the session | 0.5 |
| Registration preserves message history and identity | 1 |
| Room memberships transfer to registered account | 0.5 |
Scoring:
Test Cases:
| Feature | Max | Score | Notes |
|---|---|---|---|
| 1. Basic Chat | 3 | ||
| 2. Typing Indicators | 3 | ||
| 3. Read Receipts | 3 | ||
| 4. Unread Counts | 3 | ||
| 5. Scheduled Messages | 3 | ||
| 6. Ephemeral Messages | 3 | ||
| 7. Message Reactions | 3 | ||
| 8. Message Editing | 3 | ||
| 9. Real-Time Permissions | 3 | ||
| 10. Rich Presence | 3 | ||
| 11. Message Threading | 3 | ||
| 12. Private Rooms & DMs | 3 | ||
| 13. Activity Indicators | 3 | ||
| 14. Draft Sync | 3 | ||
| 15. Anonymous Migration | 3 | ||
| TOTAL | 45 |
| Metric | SpacetimeDB | PostgreSQL |
|---|---|---|
| Total Score | /45 | /45 |
| Compiles | Yes/No | Yes/No |
| Runs | Yes/No | Yes/No |
| Backend LOC | ||
| Frontend LOC | ||
| Total Files | ||
| Dependencies | ||
| First-try Success | Yes/No | Yes/No |
| Reprompt Count | ||
| Reprompt Efficiency | /10 | /10 |
To create a single comparable metric that weighs both feature completeness and iteration efficiency:
Combined Score = (Feature Score / Max Score × 70) + (Reprompt Efficiency × 3)
This weights feature completeness at 70% and iteration efficiency at 30%, for a max score of 100.
| Rating | Combined Score |
|---|---|
| Excellent | 90-100 |
| Good | 75-89 |
| Acceptable | 60-74 |
| Poor | 40-59 |
| Failing | 0-39 |