Back to Planning With Files

Benchmark Results — planning-with-files v2.22.0

docs/evals.md

2.37.08.3 KB
Original Source

Benchmark Results — planning-with-files v2.22.0

Formal evaluation of planning-with-files using Anthropic's skill-creator framework. This document records the full methodology, test cases, grading criteria, and results.


Why We Did This

A proactive security audit in March 2026 identified a prompt injection amplification vector in the hook system. The PreToolUse hook re-reads task_plan.md before every tool call — the mechanism that makes the skill effective — but declaring WebFetch and WebSearch in allowed-tools created a path for untrusted web content to reach that file and be re-injected into context on every subsequent tool use.

Hardened in v2.21.0: removed WebFetch/WebSearch from allowed-tools, added explicit Security Boundary guidance to SKILL.md. These evals document the performance baseline and verify zero regression in workflow fidelity.


Test Environment

ItemValue
Skill version tested2.21.0
Eval frameworkAnthropic skill-creator (github.com/anthropics/skills)
Executor modelclaude-sonnet-4-6
Eval date2026-03-06
Eval repoLocal copy (planning-with-files-eval-test/)
Subagents10 parallel (5 with_skill + 5 without_skill)
Comparator agents3 blind A/B comparisons

Test 1: Evals + Benchmark

Skill Category

planning-with-files is an encoded preference skill (not capability uplift). Claude can plan without the skill — the skill encodes a specific 3-file workflow pattern. Assertions test workflow fidelity, not general planning ability.

Test Cases (5 Evals)

IDNameTask
1todo-cliBuild a Python CLI todo tool with persistence
2research-frameworksResearch Python testing frameworks, compare 3, recommend one
3debug-fastapiSystematically debug a TypeError in FastAPI
4django-migrationPlan a 50k LOC Django 3.2 → 4.2 migration
5cicd-pipelineCreate a CI/CD plan for a TypeScript monorepo

Each eval ran two subagents simultaneously:

  • with_skill: Read SKILL.md, follow it, create planning files in output dir
  • without_skill: Execute same task naturally, no skill or template

Assertions per Eval

All assertions are objectively verifiable (file existence, section headers, field counts):

AssertionEvals
task_plan.md created in project directoryAll 5
findings.md created in project directoryEvals 1,2,4,5
progress.md created in project directoryAll 5
## Goal section in task_plan.mdEvals 1,5
### Phase sections (1+) in task_plan.mdAll 5
**Status:** fields on phasesAll 5
## Errors Encountered sectionEvals 1,3
## Current Phase sectionEval 2
Research content in findings.md (not task_plan.md)Eval 2
4+ phasesEval 4
## Decisions Made sectionEval 4

Total assertions: 30

Results

Evalwith_skillwithout_skillwith_skill fileswithout_skill files
1 todo-cli7/7 (100%)0/7 (0%)task_plan.md, findings.md, progress.mdplan.md, todo.py, test_todo.py
2 research6/6 (100%)0/6 (0%)task_plan.md, findings.md, progress.mdframework_comparison.md, recommendation.md, research_plan.md
3 debug5/5 (100%)0/5 (0%)task_plan.md, findings.md, progress.mddebug_analysis.txt, routes_users_fixed.py
4 django5/6 (83.3%)0/6 (0%)task_plan.md, findings.md, progress.mddjango_migration_plan.md
5 cicd6/6 (100%)2/6 (33.3%)task_plan.md, findings.md, progress.mdtask_plan.md (wrong structure)

Aggregate:

ConfigurationPass RateTotal Passed
with_skill96.7%29/30
without_skill6.7%2/30
Delta+90.0 pp+27 assertions

Timing and token usage (from task completion notifications — captured at runtime):

Evalwith_skill tokenswith_skill timewithout_skill tokenswithout_skill time
1 todo-cli17,80299.7s13,58776.2s
2 research22,150128.7s13,610127.3s
3 debug17,50693.4s11,52566.5s
4 django24,049147.9s12,351141.4s
5 cicd18,122105.0s8,42476.7s
Average19,926115s11,89998s

The skill uses ~68% more tokens and ~17% more time on average. The extra cost is the structured output: creating 3 files instead of 1-2, following phase/status discipline, populating decisions and error tables. This is the intended tradeoff — the skill trades speed for structure.

One Assertion Refined (Eval 4)

Assertion: **Status:** pending on at least one future phase Result: not met

The agent completed all 6 migration phases in a single comprehensive planning session, leaving none pending. The skill was followed correctly — the assertion was overly prescriptive. The skill does not require phases to remain pending; it requires phases to have status fields. Revised for future evals: task_plan.md contains **Status:** fields (without specifying value).


Test 2: A/B Blind Comparison

Three independent comparator agents evaluated pairs of outputs without knowing which was with_skill vs without_skill. Assignment was randomized:

EvalABWinnerA scoreB score
1 todo-cliwithout_skillwith_skillB (with_skill)6.0/1010.0/10
3 debug-fastapiwith_skillwithout_skillA (with_skill)10.0/106.3/10
4 django-migrationwithout_skillwith_skillB (with_skill)8.0/1010.0/10

with_skill wins: 3/3 = 100%

Comparator Quotes

Eval 1 (todo-cli): "Output B satisfies all four structured-workflow expectations precisely... Output A delivered real, runnable code (todo.py + a complete test suite), which is impressive, but it did not fulfill the structural expectations... Output A's strength is real but out of scope for what was being evaluated."

Eval 3 (debug-fastapi): "Output A substantially outperforms Output B on every evaluated expectation. Output B is a competent ad-hoc debug response, but it does not satisfy the structured, multi-phase planning format the eval specifies. Output A passes all five expectations; Output B passes one and fails four."

Eval 4 (django-migration): "Output B is also substantively strong: it covers pytz/zoneinfo migration (a 4.2-specific item Output A omits entirely), includes 'django-upgrade' as an automated tooling recommendation... The 18,727 output characters vs 12,847 for Output A also reflects greater informational density in B."


Test 3: Description Optimizer

Status: Not run in this cycle

Requires ANTHROPIC_API_KEY in the eval environment. Per the project's eval standards, a test is only included in results if it can be run end-to-end with verified metrics. Scheduled for the next eval cycle.


Summary

TestStatusResult
Evals + Benchmark✅ Complete96.7% (with_skill) vs 6.7% (without_skill)
A/B Blind Comparison✅ Complete3/3 wins (100%) for with_skill
Description OptimizerPendingScheduled for next eval cycle

The skill demonstrably enforces the 3-file planning pattern across diverse task types. Without the skill, agents default to ad-hoc file naming and skip the structured planning workflow entirely.


Reproducing These Results

bash
# Clone the eval framework
gh api repos/anthropics/skills/contents/skills/skill-creator ...

# Set up workspace
mkdir -p eval-workspace/iteration-1/{eval-1,eval-2,...}/{with_skill,without_skill}/outputs

# Run with_skill subagent
# Prompt: "Read SKILL.md at path X. Follow it. Execute: <task>. Save to: <output_dir>"

# Run without_skill subagent
# Prompt: "Execute: <task>. Save to: <output_dir>. No skill or template."

# Grade assertions, produce benchmark.json
# See eval-workspace/iteration-1/benchmark.json for full data

Raw benchmark data: eval-workspace/iteration-1/benchmark.json (in eval-test copy, not tracked in main repo)