apps/web-roo-code/src/content/blog/score-agents-like-employees-not-like-models.md
You're grading your AI agent on the wrong rubric.
Code correctness tells you if the output compiles. It tells you nothing about whether the agent will drift, ignore context, or go silent when it hits a wall.
Your agent passes the coding benchmark. It writes syntactically correct code. It handles toy problems in an eval suite.
Then you put it on a real task: refactor this authentication module, follow our patterns, don't break the existing tests.
It writes code that compiles. It also ignores half the context you gave it, doesn't tell you when it's stuck, and "helpfully" changes things you didn't ask for.
The benchmark said it was capable. Production said otherwise.
Some of the teams building these agents grade them differently. They treat the agent like an employee, not like a model.
"If you design your coding evals like you would a software engineer performance review, then you can measure their ability in the same ways as you can measure somebody who's coding."
Brian Fioca, Roo Cast S01E16
The rubric he described:
These aren't code quality metrics. They're work style metrics. The difference matters.
A code correctness eval asks: "Did the output match the expected output?"
A work style eval asks: "How did it get there, and what happens when the task gets harder?"
An agent that scores high on correctness but low on communication will confidently produce wrong code without flagging uncertainty. An agent that scores low on context management will lose track of requirements halfway through a multi-file change. An agent that scores low on proactivity will stop and wait for you to hold its hand on every sub-task.
These failure modes don't show up in benchmarks. They show up in real work.
| Dimension | Benchmark Approach | Work Style Approach |
|---|---|---|
| What it measures | Code correctness on isolated tasks | Behavior patterns across complex workflows |
| Failure modes caught | Syntax errors, wrong outputs | Drift, context loss, silent failures |
| Task realism | Toy problems, synthetic evals | Multi-file changes, production patterns |
| Feedback loop | Pass/fail on expected output | Grades on proactivity, communication, testing |
| Production readiness signal | "It can write code" | "It can work reliably on your team" |
The approach: human-grade first, then tune an LLM-as-a-judge until it matches your scoring.
The tradeoff: this takes more upfront work than a correctness benchmark. The payoff is catching failure modes before they hit production.
Closing the loop is necessary, but it's not the whole job. What matters in production is whether an agent can iterate in a real environment before the PR and hand you something you can actually review: a diff, evidence, and a clear trail of what happened.
That's the direction Roo Code is built for, and it maps directly to the rubric:
For a Series A–C team with five to twenty engineers, agent reliability is a force multiplier. If your agent drifts or goes silent on complex tasks, someone has to babysit it. That someone is an engineer who could be shipping.
Work style evals surface these problems before you've built workflows around an agent that can't handle the job. You find out in the eval, not in the incident postmortem.
The rubric: proactivity, context management, communication, testing. Grade your agent like you'd grade a junior engineer on a trial period.
If it can't tell you its plan, it's not ready for production.
Code correctness benchmarks measure whether the output matches an expected result on isolated tasks. They don't capture how an agent behaves when context is complex, when it gets stuck, or when requirements span multiple files. An agent can score perfectly on benchmarks while drifting silently on real work.
The four metrics are proactivity (does it keep moving or stop unnecessarily), context management (can it track requirements across a complex task), communication (does it share its plan and surface blockers), and testing (does it validate its own work). These predict production reliability better than correctness scores.
Yes. The recommended approach is to have humans grade agent work on the four dimensions first, then train an LLM-as-a-judge to replicate those grades. Once the automated judge correlates with human judgment, use it for scale while spot-checking with humans periodically.