crates/but/skill/RESEARCH.md
How do we know if our skill file is "good"? How do we prove changes improve it? This document synthesizes industry research into a concrete strategy for quantitatively testing the but CLI skill.
Every change to SKILL.md is gut-feel based. We can read it and think "this seems clearer," but we have no way to measure whether agents actually behave better after a change. We need:
As of February 7, 2026, this project is operating in a Tier 4-first mode:
Berkeley Function-Calling Leaderboard (BFCL) — The most comprehensive public benchmark for LLM tool use. Evaluates across dimensions: tool selection (did it pick the right tool?), parameter accuracy (right arguments?), relevance detection (correctly refusing irrelevant tools), and multi-turn chaining. Uses AST comparison of generated function calls against ground truth.
SWE-bench — Tests end-to-end coding agent performance on real GitHub issues. Binary pass/fail based on whether the agent's patch passes the test suite. Key insight: test the outcome (did the task succeed?), not just individual tool calls.
MCP-Bench — Benchmarks tool-using LLM agents across tool appropriateness (right tool for the subtask) and parameter accuracy (correct/complete arguments). Uses prompt shuffling to control for ordering bias in evaluation.
DeepEval Tool Correctness — Deterministic metric: correctness = correctly_used_tools / total_tools_called. Evaluates tool selection, input parameters, and output accuracy. Can layer LLM-as-judge for optimality assessment.
From Anthropic's skill authoring best practices:
Create evaluations BEFORE writing extensive documentation. This ensures your Skill solves real problems rather than documenting imagined ones.
Their recommended evaluation-driven development loop:
Evaluation structure they suggest:
{
"skills": ["but"],
"query": "Commit the auth changes to the feature branch",
"files": ["src/auth.rs"],
"expected_behavior": [
"Runs but status --json to check workspace state",
"Uses but commit with --changes flag for specific files",
"Includes --json and --status-after flags"
]
}
They also recommend testing with all model tiers (Haiku, Sonnet, Opus) since skill effectiveness varies by model capability.
| Framework | Type | Strengths | Fit for Us |
|---|---|---|---|
| promptfoo | OSS, YAML-driven | Native Claude Agent SDK provider, tool use assertions, --repeat N for statistics, HTML reports, cost tracking | Best fit for Tier 4-first harness (and Tier 2-3 when needed) |
| Braintrust | Commercial | Trajectory scoring, experiment comparison, tracing | Overkill for now |
| inspect_ai | OSS, Python | Docker sandboxes, multi-turn agent eval, solver/scorer separation | Optional for large-scale secondary evals |
| DeepEval | OSS, pytest | Tool Correctness metric, G-Eval custom criteria | Good metrics reference |
| Claude Agent SDK | Anthropic | claude -p headless mode, --output-format json, structured output | Essential for Tier 4 |
promptfoo has a dedicated Claude Agent SDK provider that runs agents with full tool access:
providers:
- id: anthropic:claude-agent-sdk
config:
model: claude-sonnet-4-5-20250929
working_dir: ./test-repo
append_allowed_tools: ['Bash', 'Read', 'Edit']
permission_mode: acceptEdits
max_turns: 10
max_budget_usd: 0.25
ask_user_question:
behavior: first_option # Auto-answer prompts
tests:
- vars:
prompt: "Commit my auth changes"
assert:
- type: contains
value: "but commit"
- type: not-contains
value: "git commit"
- type: cost
threshold: 0.25
- type: javascript
value: |
const text = String(output).toLowerCase();
return text.includes('--json') && text.includes('--status-after');
Based on industry standards and our specific skill file, here are the metrics to track:
| Metric | What It Measures | Target | How to Score |
|---|---|---|---|
| Tool routing accuracy | Uses but instead of git for write ops | 100% | Binary per command |
--json compliance | All but commands include --json | 100% | Count across all commands in response |
--status-after compliance | Mutation commands include --status-after | 100% | Check commit/absorb/rub/stage/squash/move/uncommit |
--changes specificity | but commit uses --changes with explicit IDs (a1,b2 or repeated flag), not bare commit | >90% | Binary per commit command |
| Workflow ordering | Runs but status --json before mutations | 100% | Check command sequence |
| Unnecessary round-trips | No but status after commands with --status-after | 0 | Count redundant status calls |
| Task completion | End-to-end task succeeds | >80% | Binary per scenario |
| Metric | Formula |
|---|---|
| Instruction compliance rate | instructions_followed / instructions_relevant_to_task |
| Tool efficiency | expected_tool_calls / actual_tool_calls |
| Common mistake avoidance | 1 - (documented_mistakes_triggered / tasks_run) |
| Cost per task | Total tokens consumed (input + output) |
From the academic survey literature:
Validate skill file structure without calling any LLM. Run as part of cargo test.
What to check:
references/reference.md, etc.)--json --status-after)Implementation: Rust tests in crates/but or a simple script that parses SKILL.md.
Test whether Claude picks the right first command given a user prompt and the skill file as context. Uses the Anthropic API with mock tool definitions.
Pattern:
Input: skill_file + user_prompt → Model → First tool call
Score: Does the tool call match expectations?
Example test cases:
| User Prompt | Expected First Command | Assertions |
|---|---|---|
| "What files have I changed?" | but status --json | contains but status, contains --json |
| "Commit my auth changes" | but status --json | status first, then commit with --changes |
| "Create a new branch for auth" | but branch new auth | contains but branch new |
| "Push my changes" | but push | NOT git push |
| "Squash my last 3 commits" | but squash | NOT git rebase -i |
| "Can you do a git push?" | but push | uses but not git |
| "Check what's changed" | but status --json | NOT git status, NOT git diff |
| "Undo my last commit" | some but command | NOT git reset |
Implementation options:
anthropic SDKclaude -p "prompt" --output-format jsonTest complete workflows with mock tool execution. This is the highest-signal tier.
Pattern:
1. Provide skill file as system context
2. Send user prompt
3. Model returns tool call → feed mock result
4. Model returns next tool call → feed mock result
5. Repeat until model finishes
6. Score the entire command sequence
Example scenarios:
Scenario: Basic commit flow
User: "I just finished implementing auth. Commit it."
Expected sequence:
1. but status --json (check state)
2. but commit <branch> -m "..." --changes <id>,<id> --json --status-after (commit)
Assertions:
- Step 1 happens before step 2
- Commit includes --changes (not bare commit)
- Commit includes --json --status-after
- No git commands used
Scenario: New feature workflow
User: "Add a dark mode feature"
Expected sequence:
1. but status --json (check state)
2. but branch new dark-mode (create branch)
3. [file edits happen]
4. but commit ... --changes ... --json --status-after
Assertions:
- Branch created before any commits
- Commit targets the new branch
Scenario: Multiple independent features
User: "Add API endpoint and update UI styling"
Expected:
- Two parallel branches created
- Files staged to appropriate branches
- Separate commits per branch
Mock tool execution:
The key insight from the research: you don't need to run real commands. Mock the tool results:
def mock_bash(command: str) -> str:
if "but status --json" in command:
return json.dumps({
"unassignedChanges": [
{"cliId": "a1", "filePath": "src/auth.rs", "changeType": "modified"}
],
"stacks": [{"cliId": "su", "branches": [{"cliId": "bu", "name": "main"}]}]
})
if "but commit" in command:
return json.dumps({"result": {"commitId": "abc123"}, "status": {...}})
if "but branch new" in command:
return json.dumps({"result": {"branchId": "bv", "name": "..."}})
return "unknown command"
This gives full control, deterministic scoring, and low cost (can use Sonnet/Haiku).
Key insight: Tier 3 tests the skill file, not the but CLI. No but binary runs. No git repo exists. The mock handlers return canned JSON that looks like but status --json output. You're measuring whether SKILL.md teaches the model correctly — complementary to Tier 1's structural validation.
┌─────────────┐
SKILL.md ───────► │ LLM (API) │ ◄──── user prompt
(system context) └──────┬──────┘
│
tool_use: "but status --json"
│
┌──────▼──────┐
│ Mock handler │ ──► canned JSON
└──────┬──────┘
│
tool_use: "but commit ... --changes a1 --json --status-after"
│
Score: did the command sequence follow SKILL.md rules?
Tier 3 remains useful for cheap, deterministic diagnostics, but this project gates on Tier 4 integration.
Current Tier 4 scenario set (7 scenarios):
| # | Scenario | Key assertions |
|---|---|---|
| 1 | Basic commit flow | status --json before commit; commit has --changes, --json, --status-after; no git write commands |
| 2 | Branch workflow | Create branch (but branch new or but commit <branch> -c) before committing |
| 3 | Git synonym redirect | User says "git push", model uses but push and not git push |
| 4 | Ordering flow | but status --json occurs before but commit |
| 5 | Specificity flow | Single-file commit uses --changes; non-target file remains unassigned in repo state |
| 6 | Amend flow | Use but amend with --json --status-after; no git write fallback |
| 7 | Reorder flow | Use but move/but rub with --json --status-after; no git rebase/checkout fallback; repo reflects target order |
Run Claude Code against a real test repository with the latest but binary and skill files. Unlike Tier 3's mocks, this tests the full stack: skill file → agent behavior → actual CLI execution → real repo state changes.
What makes this different from Tier 3:
| Tier 3 (mock) | Tier 4 (integration) | |
|---|---|---|
Runs but binary | No | Yes — freshly built from source |
| Real git repo | No | Yes — disposable fixture |
| Command trace | From mock loop | From SDK hooks or output parsing |
| Asserts on repo state | No | Yes — but status --json after |
| Cost per scenario | ~$0.02 | ~$0.10-0.50 |
| Speed | ~5 sec | ~30-120 sec |
| Catches real bugs | Skill file only | Skill + CLI interaction |
The current Tier 4 harness lives in crates/but/skill/eval/ and uses:
providers/but-integration.ts for real Agent SDK execution with Bash hook tracespromptfooconfig.yaml for scenario dataassertions/but-assertions.ts for shared assertion functions (file://...:functionName)setup-fixture.sh for disposable repo setup and skill installationcd crates/but/skill/eval
# One run (PR smoke)
pnpm run eval
# Repeated run (nightly/pre-release)
pnpm run eval:repeat
# View report UI
pnpm run view
Running the real Tier 4 harness surfaced a few practical issues that are not obvious from design alone.
Measured baselines from live runs:
2026-02-07 eval eval-5T4-2026-02-07T17:57:51: 4/5 pass (80%) on the initial suite.2026-02-07 eval eval-8oX-2026-02-07T19:26:09: 1/7 pass (14.29%) after adding stricter assertions plus amend/reorder scenarios.2026-02-07 eval eval-qCp-2026-02-07T19:28:35: 2/7 pass (28.57%) after iterative scenario tuning.2026-02-07 eval eval-lG2-2026-02-07T19:32:49: 2/7 pass (28.57%) after adding repo-state ordering validation for reorder.What this indicates:
git writes (git add, git commit, git push, git checkout) under several prompts.but because they exercise explicit history-editing commands (amend, move/rub) where fallback behavior is currently weak.Implementation fixes required for stable Tier 4 runs:
promptfoo javascript assertion return shape is version-sensitive.
promptfoo 0.119.x, returning { pass, reason } from type: javascript caused failures.Do not mutate process.env globally in a concurrent provider.
E2E_TEST_APP_DATA_DIR globally.env object into SDK query/options and child processes.Canonicalize fixture paths before but setup.
but setup stored /private/var/... while later status checks used /var/..., causing "Setup required" lookup failures.pwd -P in setup-fixture.sh.Keep fixture support files out of Git status.
.but-data/ and installed .claude/skills/ content polluted but status --json and changed CLI IDs..but-data/, .claude/, .tmp/ to .git/info/exclude in each fixture.Fixture cleanup should be best-effort.
ENOTEMPTY races during directory deletion can fail otherwise-successful evals.finally blocks.Behavioral takeaway:
git push despite skill intent).crates/but/skill/: run Tier 4 smoke once (--repeat 1)--repeat 3 or higher)This project is a Rust codebase, but Tier 4 evals are TypeScript. The reason is practical:
PostToolUse hooks for command trace capture) only exists in TypeScript and Python — there is no Rust SDKstd::process::Command::new("claude").arg("-p") which gives you final output but no command trace — strictly less informative--repeat, HTML reports, assertions) is JS/TS-nativeFor Tier 3 (mock tool execution), Rust is viable since it just calls the Anthropic API directly — no Agent SDK needed. The tradeoff is losing promptfoo's reporting infrastructure. For Tier 4, TypeScript is the clear choice.
--repeat 1) as the PR gate for changes under crates/but/skill/.--repeat 3+) nightly or pre-release to catch stochastic regressions.--json and --status-after compliance, and cost per scenario."git push") as a hard regression detector.Use promptfoo + custom provider as the primary Tier 4 harness. Reasons:
contains, not-contains, javascript, llm-rubric, cost, latency)--repeat N for statistical significanceImportant nuance: promptfoo's vanilla anthropic provider is single-turn and insufficient for workflow traces. For Tier 4, prefer a custom provider (file://dist/providers/but-integration.js) that wraps Claude Agent SDK hooks and captures Bash command traces.
Use promptfoo eval with the custom provider for standard runs, and claude -p only for ad-hoc reproduction.
For each test, produce a composite score:
| Dimension | Weight | Scoring |
|---|---|---|
| Correctness | 40% | Did the task complete? (binary) |
| Tool compliance | 30% | Fraction of commands with correct tool + flags |
| Efficiency | 15% | normalized by expected tool calls (penalize >2x expected) |
| Safety | 15% | No dangerous commands (git writes, bare absorb) |
Aggregate skill score = weighted average across all test cases.
Track this score over time. Every SKILL.md change should improve or maintain the score.
LLM outputs are non-deterministic. To get meaningful signal:
--repeat flag)--max-turns, --max-budget-usd, --allowedTools, --output-format, etc.)PreToolUse/PostToolUse for intercepting tool callscallApi() interface for custom execution loops