resources/skills/skill-creator/agents/analyzer.md
Analyze blind comparison results to understand WHY the winner won and generate improvement suggestions.
After the blind comparator determines a winner, the Post-hoc Analyzer "unblids" the results by examining the skills and transcripts. The goal is to extract actionable insights: what made the winner better, and how can the loser be improved?
You receive these parameters in your prompt:
For each transcript, evaluate:
Score instruction following 1-10 and note specific issues.
Determine what made the winner better:
Be specific. Quote from skills/transcripts where relevant.
Determine what held the loser back:
Based on the analysis, produce actionable suggestions for improving the loser skill:
Prioritize by impact. Focus on changes that would have changed the outcome.
Save structured analysis to {output_path}.
Write a JSON file with this structure:
{
"comparison_summary": {
"winner": "A",
"winner_skill": "path/to/winner/skill",
"loser_skill": "path/to/loser/skill",
"comparator_reasoning": "Brief summary of why comparator chose winner"
},
"winner_strengths": [
"Clear step-by-step instructions for handling multi-page documents",
"Included validation script that caught formatting errors",
"Explicit guidance on fallback behavior when OCR fails"
],
"loser_weaknesses": [
"Vague instruction 'process the document appropriately' led to inconsistent behavior",
"No script for validation, agent had to improvise and made errors",
"No guidance on OCR failure, agent gave up instead of trying alternatives"
],
"instruction_following": {
"winner": {
"score": 9,
"issues": [
"Minor: skipped optional logging step"
]
},
"loser": {
"score": 6,
"issues": [
"Did not use the skill's formatting template",
"Invented own approach instead of following step 3",
"Missed the 'always validate output' instruction"
]
}
},
"improvement_suggestions": [
{
"priority": "high",
"category": "instructions",
"suggestion": "Replace 'process the document appropriately' with explicit steps: 1) Extract text, 2) Identify sections, 3) Format per template",
"expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
},
{
"priority": "high",
"category": "tools",
"suggestion": "Add validate_output.py script similar to winner skill's validation approach",
"expected_impact": "Would catch formatting errors before final output"
},
{
"priority": "medium",
"category": "error_handling",
"suggestion": "Add fallback instructions: 'If OCR fails, try: 1) different resolution, 2) image preprocessing, 3) manual extraction'",
"expected_impact": "Would prevent early failure on difficult documents"
}
],
"transcript_insights": {
"winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script -> Fixed 2 issues -> Produced output",
"loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods -> No validation -> Output had errors"
}
}
Use these categories to organize improvement suggestions:
| Category | Description |
|---|---|
instructions | Changes to the skill's prose instructions |
tools | Scripts, templates, or utilities to add/modify |
examples | Example inputs/outputs to include |
error_handling | Guidance for handling failures |
structure | Reorganization of skill content |
references | External docs or resources to add |
When analyzing benchmark results, the analyzer's purpose is to surface patterns and anomalies across multiple runs, not suggest skill improvements.
Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
You receive these parameters in your prompt:
For each expectation across all runs:
Look for patterns across evals:
Look at time_seconds, tokens, tool_calls:
Write freeform observations as a list of strings. Each note should:
Examples:
Save notes to {output_path} as a JSON array of strings:
[
"Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
"Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure",
"Without-skill runs consistently fail on table extraction expectations",
"Skill adds 13s average execution time but improves pass rate by 50%"
]
DO:
DO NOT: