Back to Oh My Openagent

Benchmark: work-with-pr (Iteration 1)

.opencode/skills/work-with-pr-workspace/iteration-1/benchmark.md

3.17.151.6 KB
Original Source

Benchmark: work-with-pr (Iteration 1)

Summary

MetricWith SkillWithout SkillDelta
Pass Rate96.8% (30/31)51.6% (16/31)+45.2%
Mean Duration340.2s303.0s+37.2s
Duration Stddev169.3s77.8s+91.5s

Per-Eval Breakdown

EvalWith SkillWithout SkillDelta
happy-path-feature-config-option100% (10/10)40% (4/10)+60%
bugfix-atlas-null-check100% (6/6)67% (4/6)+33%
refactor-split-constants100% (5/5)40% (2/5)+60%
new-mcp-arxiv-casual100% (5/5)60% (3/5)+40%
regex-fix-false-positive80% (4/5)60% (3/5)+20%

Key Discriminators

  • three-gates (CI + review-work + Cubic): 5/5 vs 0/5 — strongest signal
  • worktree-isolation: 5/5 vs 1/5
  • atomic-commits: 2/2 vs 0/2
  • cubic-check-method: 1/1 vs 0/1

Non-Discriminating Assertions

  • References actual files: passes in both conditions
  • PR targets dev: passes in both conditions
  • Runs local checks before pushing: passes in both conditions

Only With-Skill Failure

  • eval-5 minimal-change: Skill-guided agent proposed config schema changes and Go binary update for a minimal regex fix. The skill may encourage over-engineering in fix scenarios.

Analyst Notes

  • The skill adds most value for procedural knowledge (verification gates, worktree workflow) that agents cannot infer from codebase alone.
  • Duration cost is modest (+12%) and acceptable given the +45% pass rate improvement.
  • Consider adding explicit "fix-type tasks: stay minimal" guidance in iteration 2.