Back to Hermes Agent

Autoreason: Iterative Refinement Methodology

skills/research/research-paper-writing/references/autoreason-methodology.md

2026.6.518.7 KB
Original Source

Autoreason: Iterative Refinement Methodology

Complete reference for the autoreason iterative refinement method, derived from experimental results across subjective writing tasks, competitive programming, and four model tiers. Use this when any output (paper draft, experiment script, analysis, task definition) needs iterative improvement.

Source: NousResearch/autoreason — "Autoreason: When Iterative LLM Refinement Works and Why It Fails"


Strategy Selection Guide

Decision Tree

Is the task objectively verifiable (code, math, factual)?
├── YES → Does the model solve it on the first attempt?
│   ├── YES → Use single pass (no refinement needed)
│   └── NO → Use autoreason (structured analysis → reason-informed revision)
│
└── NO (subjective) → What model tier are you using?
    ├── Weak (Llama 8B, small models)
    │   → Single pass. Model too weak for refinement to help.
    │     Invest in generation quality, not iteration.
    │
    ├── Mid-tier (Haiku 3.5, Gemini Flash)
    │   → Autoreason with stronger judges. This is the sweet spot.
    │     Self-refinement DESTROYS weak model outputs — autoreason prevents this.
    │
    ├── Strong (Sonnet 4)
    │   → Autoreason for open-ended tasks. Wins 3/5.
    │     Critique-and-revise for concrete technical tasks (2/5).
    │
    └── Frontier (Sonnet 4.6, Opus)
        ├── Constrained scope? → Autoreason. Wins 2/3 constrained tasks.
        └── Unconstrained? → Critique-and-revise or single pass.
            Autoreason FAILS on unconstrained frontier tasks (comes last).

Strategy Comparison Table

StrategyBest ForAvoid WhenCompute (per iteration)
Single passFrontier models, template tasks, tight budgetsMid-tier models where quality ceiling is low1 call
Critique-and-reviseConcrete technical requirements (system design, specifications)Weak models (degrades output), unconstrained subjective tasks2 calls
AutoreasonMid-tier models, constrained scope, tasks with genuine tradeoffsWeak models (Llama 8B), frontier + unconstrained~6 calls
Best-of-NAlmost never recommendedWeak models especially — worse than single passN calls

Why Each Strategy Fails

StrategyFailure ModeMechanism
Single passQuality ceilingNo mechanism to improve beyond first attempt
Critique-and-reviseProgressive degradationModel hallucinates problems (sycophancy), scope creeps each pass, never declines to change
Best-of-NRandom selectionWithout good ranking signal, more samples = more mediocre options
Autoreason (unconstrained)Synthesis driftStronger models produce syntheses so consistently preferred that incumbent never stabilizes

The Autoreason Loop

Architecture

┌──────────────────────────────────────────────────────────┐
│                    ITERATION LOOP                         │
│                                                           │
│   Incumbent A ──► Critic ──► Author B ──► Synthesizer     │
│       │                                      │            │
│       │              ┌───────────────────────┘            │
│       ▼              ▼                                    │
│      [A]           [AB]          [B]                      │
│       │              │            │                       │
│       └──────────────┼────────────┘                       │
│                      ▼                                    │
│              Judge Panel (blind)                          │
│                      │                                    │
│                      ▼                                    │
│                   Winner                                  │
│                      │                                    │
│              ┌───────┴───────┐                            │
│              ▼               ▼                            │
│         A wins k=2      B or AB wins                      │
│         consecutive?    → new incumbent                   │
│              │                                            │
│              ▼                                            │
│           CONVERGED                                       │
└──────────────────────────────────────────────────────────┘

Roles

Every role is a fresh, isolated agent with no shared context:

RoleInputOutputKey Rule
CriticTask + Incumbent AList of problemsFind problems ONLY. No fixes. No suggestions.
Author BTask + A + CritiqueRevised version BAddress each criticism. State which problem each change fixes.
SynthesizerTask + X + Y (randomized labels)Synthesis ABTake strongest elements of each. Not a compromise.
Judge PanelTask + A, AB, B (randomized labels + order)RankingRank best to worst. No authorship stake.

Configuration

ParameterValueRationale
Convergence k2k=1 premature (94% displaced later). k=2 converges 100%, quality plateaus. k=3 fails 24%, 2x cost, no quality gain.
Author temperature0.7-0.8Encourages diverse revisions
Judge temperature0.3Encourages consistent evaluation
In-loop judges3Balance per-pass cost vs evaluation stability
Final evaluation judges7Higher statistical power for final comparison
Max tokens4096Standard; 8192 for long-form (papers)
Judge typeChain-of-thought3x faster convergence on some tasks. Always use.
TiebreakConservative (incumbent wins)Prevents false positives — A must be genuinely beaten
Max passes25 (constrained), 50 (remedy)Safety cap; most converge by pass 10-15

Prompts

Critic

System: You are a critical reviewer. Your only job is to find real problems. 
Be specific and concrete. Do not suggest fixes.

User: Find real problems with this proposal. Focus on:
- Things that won't work as described
- Complexity that doesn't pay for itself
- Assumptions that are wrong
- Missing pieces
Do NOT propose fixes. Just the problems.

Author B

System: You are a senior consultant revising a proposal based on specific 
criticisms. Address each valid criticism directly. Do not make changes not 
motivated by an identified problem.

User: [TASK] + [VERSION A] + [CRITIC OUTPUT]
Revise to address these problems. For each change, state which problem it fixes.

Synthesizer

System: You are given two versions as equal inputs. Take the strongest elements 
from each and produce a coherent synthesis. This is not a compromise.

User: [TASK] + [VERSION X] + [VERSION Y]
(labels randomized — synthesizer doesn't know which is incumbent)

Judge (Chain-of-Thought) — ALWAYS USE THIS VERSION

System: You are an independent evaluator. Think carefully before deciding.

User: [TASK] + Three proposals. For each, think step by step:
1. What does it get right?
2. What does it get wrong or miss?
3. Are numbers and claims defensible?
4. Is detail appropriate or bloated?
After reasoning, rank all three.
RANKING: [best], [second], [worst]

Baseline Prompts (for comparison experiments)

BaselinePrompt
Conservative"Make minimal improvements while preserving what works. Do not add new sections or significantly expand scope."
Improve this"Improve this document." (no further guidance)
Harsh critic"Critically evaluate and rewrite, fixing all weaknesses you identify."
Critique & reviseStep 1: "Produce a structured critique. List specific weaknesses." Step 2: "Revise to address each criticism."

Scoring: Borda Count

Judges rank candidates. Points awarded by rank position:

RankPoints (3 candidates)
1st3
2nd2
3rd1

Aggregation: Sum across all judges. Winner = highest total. Tiebreak: Incumbent (A) wins any tie.

Example (3 judges):

  • Judge 1: AB > A > B → AB gets 3, A gets 2, B gets 1
  • Judge 2: A > AB > B → A gets 3, AB gets 2, B gets 1
  • Judge 3: AB > B > A → AB gets 3, B gets 2, A gets 1
  • Totals: AB=8, A=6, B=4 → AB wins, becomes new incumbent

Randomization per judge:

  • Candidate labels randomized (A might be called "Proposal X" for one judge, "Proposal Z" for another)
  • Presentation order randomized (AB might appear first or last)
  • This prevents position bias and label bias

Model Selection Guide

Empirical Results by Model Tier

ModelAutoreason WinsAutoreason Avg BordaBest BaselineMarginRecommendation
Llama 3.1 8B1/323.725.0 (single)-1.3Skip autoreason. Model too weak for diverse candidates.
Gemini 2.0 Flash2/325.020.0 (single)+5.0Good candidate. Moderate gains.
Haiku 3.53/342.033.7 (single)+8.3Best candidate. Perfect scores. Baselines actively destroy quality.
Sonnet 43/527.822.4 (C&R)+5.4Good candidate for open tasks. C&R better for technical tasks.
Sonnet 4.6 (unconstrained)0/17.031.0 (C&R)-24.0Do NOT use autoreason without constraints.
Sonnet 4.6 (constrained)2/329.027.0 (improve)+2.0Use only with scope constraints.

The Generation-Evaluation Gap

The core insight: autoreason's value depends on the gap between a model's generation capability and its self-evaluation capability.

Weak models (Llama 8B):
  Generation: Poor  |  Self-evaluation: Poor
  Gap: Small (both bad) → Autoreason can't help, no diverse candidates

Mid-tier models (Haiku, Flash):
  Generation: Decent  |  Self-evaluation: Poor
  Gap: LARGE → Autoreason's sweet spot. External eval bridges the gap.

Strong models (Sonnet 4):
  Generation: Good  |  Self-evaluation: Decent
  Gap: Moderate → Autoreason helps on 3/5 tasks

Frontier models (Sonnet 4.6):
  Generation: Excellent  |  Self-evaluation: Good
  Gap: Small → Simple methods suffice. Autoreason hurts on unconstrained tasks.

Practical rule: As model costs drop and capabilities improve, today's frontier becomes tomorrow's mid-tier. The generation-evaluation gap is structural, not temporary. Match refinement architecture to the model's position on the capability curve.

Judge Selection

Author ModelRecommended JudgeRationale
Llama 8BDon't use autoreasonModel too weak
Gemini FlashSonnet 4Cross-model evaluation works
Haiku 3.5Sonnet 4Strong external eval is the mechanism
Haiku 3.5Haiku 3.5 (same)Still works — tournament structure provides value even without strong judges (20.7 vs 18.3 avg Borda)
Sonnet 4Sonnet 4 (same)Same-model judges work at this tier
Sonnet 4.6Sonnet 4.6 (same)Only with scope constraints

Scope Constraint Design

What Makes Autoreason Work on Constrained Tasks

The same model (Sonnet 4.6) goes from last place (unconstrained) to first place (constrained) with scope constraints. The constraints bound the improvement space so synthesis drift can't accumulate.

Effective Constraints

Constraint TypeExampleWhy It Works
Fixed facts"Use only these 8 data points, add nothing else"Bounds information space
Fixed deliverable"500-word startup pitch" (not "improve this")Defines done condition
Fixed structure"Exactly 4 sections, each with 3 numbered items"Prevents structural drift
Fixed change items"Address exactly these 3 reviewer concerns"Bounds modification scope

Ineffective Constraints

ConstraintWhy It FailsWhat Happens
Word count aloneNot a scope constraintFalse convergence — rejected for length, not quality
"Be concise"Too vagueIgnored after 2-3 passes
"Be comprehensive"Anti-constraintInvites scope creep
No constraints at allUnbounded improvement spaceSynthesis dominates, no convergence

Task Categories

Task TypeAutoreason Works?Why
Tasks with genuine tradeoffs (strategy, policy)YesMultiple valid approaches for tournament to select between
Constrained writing (pitch, memo, postmortem)Mostly (2/3)Bounded scope, clear evaluation criteria
Template-filling (incident postmortem)NoOne correct structure, minimal decision space
Competitive programmingYesNaturally scoped, test suite provides external verification
Open-ended unconstrained + frontier modelNoSynthesis drift, no convergence

Failure Taxonomy

Failure ModeConditionDetectionEvidence
Self-correction unreliableNo external evaluation signalBaselines degrade below single passHaiku baselines: 16.3 avg vs 33.7 single pass
Drift / synthesis dominanceUnconstrained scopeA wins <15%, AB dominatesSonnet 4.6 unconstrained: A wins 12%, AB wins 60%+
Overfitting to visible feedbackShallow revision loop (C&R)High public/private divergenceC&R overfits 32% on hard code problems
No convergenceBroken judge pipelineParsing failures, <3 valid judgesMixed panel parser failure: 11+ passes
Model too weakInsufficient generation diversityAll candidates look similarLlama 8B wins only 1/3 tasks

Recovery Patterns

FailureRecovery
No convergence (drift)Add scope constraints to the task
No convergence (broken judges)Fix parser, ensure 3 valid judges before continuing
Quality degrades with iterationSwitch to single pass or add constraints
Model too weakUse a stronger model for generation, keep weak model for cheap roles
Overfitting (code)Use structured analysis step, not just test feedback

Code Domain Adaptation

The autoreason method adapts differently for code vs writing:

Writing Domain

Call 1: Critic (find problems in incumbent)
Call 2: Author B (revise based on critique)
Call 3: Synthesizer (merge A and B)
Calls 4-6: Judge Panel (3 blind judges rank A, B, AB)

Code Domain (6-call budget)

Call 1: Initial generation
Call 2: Structured analysis (5 points — NO CODE):
  - Problem analysis: what does the problem actually require?
  - Approach analysis: what approach did we use, is it correct?
  - Failure analysis: why did tests fail?
  - Alternative approaches: what else could work?
  - Edge cases: what inputs might break the solution?
Calls 3-6: Reason-informed revisions
  - Each revision must explain WHY it fixes the issue
  - Sees test results from public (visible) test cases

Key difference: The code strategy replaces the judge panel with test-suite evaluation (objective ground truth). The structured analysis step (Call 2) is what drives recovery — it forces reasoning about why the approach failed before attempting fixes.

Results: Recovery is the mechanism. Among problems where both autoreason and single-pass failed initially, autoreason recovered 62% vs single-pass's 43% (McNemar p=0.041, Cohen's h=0.32).


Applying Autoreason to Paper Writing

The paper itself was refined using autoreason (Section 8 of the paper):

Setup

  • Model: claude-opus-4
  • Judges: 3 Opus judges
  • Enhancement: Ground-truth critic (access to actual experimental data)
  • Result: Converged in 9 passes

Key Findings for Paper Refinement

  1. Ground-truth critic is essential: Without ground-truth access, Opus hallucinated a fabricated ablation study, fake confidence intervals, wrong model names, and incorrect role descriptions. With ground-truth access, the critic caught all four on pass 1.

  2. Judge panel integrity matters: A broken parser in one judge (Gemini output format mismatch) reduced the panel from 3 to 2 judges. This prevented convergence for 11+ passes. Fixing to 3 working judges, the same incumbent converged in 2 passes. A broken judge doesn't add noise — it prevents equilibrium.

Critic prompt: "You are reviewing a research paper draft. You have access to the 
actual experimental results [GROUND TRUTH DATA]. Find factual errors, unsupported 
claims, hallucinated results, and structural problems. Do not suggest fixes."

Author B prompt: "Revise this paper draft to fix the identified problems. For each 
change, cite the specific problem it addresses. Do not add claims not supported by 
the provided experimental data."

Judge prompt (CoT): "Compare three versions of this paper. For each, evaluate:
1. Factual accuracy against the provided results
2. Clarity of the narrative and contribution
3. Whether claims are properly hedged and supported
4. Writing quality (concision, precision, no filler)
After reasoning, rank all three. RANKING: [best], [second], [worst]"

What to Provide as Ground Truth

  • All experimental result JSON files
  • Statistical test outputs
  • Raw numbers for every table and figure
  • Configuration files showing exact hyperparameters
  • Code that generated the results (for method description accuracy)

Compute Budget Reference

MethodCalls per PassTypical PassesTotal CallsRelative Cost
Single pass1111x
Best-of-NN1NNx
Critique & revise2153030x
Autoreason (in-loop)~610-1560-9060-90x
Autoreason (with final eval)~6 + 710-15 + 167-97~80x

Cost-quality tradeoff: Autoreason uses ~6x more compute per pass and typically runs more passes. This is a real tradeoff. The method trades compute for evaluation quality. On constrained tasks with mid-tier models, this tradeoff is strongly positive. On unconstrained tasks with frontier models, it's negative.

CoT judges reduce cost: 1 CoT judge provides evaluation quality comparable to 3 standard judges, at ~40% cost savings. Always use CoT judges.