skills/research/research-paper-writing/references/experiment-patterns.md
Patterns and best practices distilled from running research experiments at scale with the Hermes agent. These cover experiment infrastructure, evaluation protocols, monitoring, and failure recovery.
Organize experiments with a consistent structure:
workspace/
experiments/
run_main.py # Core experiment runner
run_baselines.py # Baseline comparison
run_ablation.py # Ablation studies
strategies.py # Method implementations
config.yaml # Shared configuration
results/
<experiment_name>/
<task_or_problem>/
<strategy>/
result.json # Final metrics
final_output.md # Final output artifact
history.json # Full trajectory/log
pass_01/ # Per-iteration artifacts (if iterative)
intermediate.md
analysis/
analyze_results.py # Statistical analysis
compute_stats.py # Significance tests
make_charts.py # Visualization
paper/
paper.tex # LaTeX source
fig_*.pdf # Generated figures
1. Incremental Saving (Crash Recovery)
Every experiment script should save results after each unit of work, and skip already-completed work on restart:
import json, os
from pathlib import Path
def run_experiment(problems, strategies, output_dir):
for problem in problems:
for strategy in strategies:
result_path = Path(output_dir) / problem["id"] / strategy / "result.json"
if result_path.exists():
print(f"Skipping {problem['id']}/{strategy} (already done)")
continue
# Run the experiment
result = execute_strategy(problem, strategy)
# Save immediately
result_path.parent.mkdir(parents=True, exist_ok=True)
with open(result_path, 'w') as f:
json.dump(result, f, indent=2)
This pattern makes re-runs safe and efficient. If a process crashes at problem 47/150, restarting skips the first 46.
2. Artifact Preservation
Save all intermediate outputs, not just final results. This enables post-hoc analysis without re-running:
def save_pass_artifacts(output_dir, pass_num, artifacts):
"""Save all artifacts from a single pass of an iterative method."""
pass_dir = Path(output_dir) / f"pass_{pass_num:02d}"
pass_dir.mkdir(parents=True, exist_ok=True)
for name, content in artifacts.items():
with open(pass_dir / f"{name}.md", 'w') as f:
f.write(content)
3. Configuration Management
Use YAML configs for reproducibility:
# config.yaml
model: anthropic/claude-sonnet-4-20250514
author_temperature: 0.8
judge_temperature: 0.3
max_tokens: 4096
num_judges: 3
max_passes: 15
convergence_k: 2
import yaml
with open("config.yaml") as f:
config = yaml.safe_load(f)
4. Separation of Concerns
Keep generation, evaluation, and visualization in separate scripts:
| Script | Purpose |
|---|---|
run_experiment.py | Core method execution |
run_baselines.py | Baseline comparisons at same compute |
run_eval.py | Blind evaluation / judge panels |
analyze_results.py | Statistical analysis |
make_charts.py | Figure generation |
This lets you re-run evaluation without re-running expensive generation, and regenerate figures without re-running analysis.
When evaluating subjective outputs (writing, analysis, recommendations), use a blind judge panel:
import random
def run_blind_evaluation(outputs: dict, task_prompt: str, num_judges: int = 7):
"""
Run blind evaluation of multiple method outputs.
Args:
outputs: {"method_name": "output_text", ...}
task_prompt: The original task description
num_judges: Number of independent judge evaluations
"""
rankings = []
for judge_i in range(num_judges):
# Randomize labels and presentation order per judge
methods = list(outputs.keys())
random.shuffle(methods)
labels = {m: chr(65 + i) for i, m in enumerate(methods)} # A, B, C...
# Present to judge with randomized labels
prompt = f"Task: {task_prompt}\n\n"
for method in methods:
prompt += f"--- Proposal {labels[method]} ---\n{outputs[method]}\n\n"
prompt += "Rank all proposals from best to worst. Format: RANKING: [best], [second], [worst]"
ranking = call_judge(prompt)
rankings.append({"labels": labels, "ranking": ranking})
# Aggregate via Borda count
return compute_borda(rankings)
def compute_borda(rankings, n_methods=3):
"""Borda count: 3/2/1 points for 1st/2nd/3rd."""
scores = {}
points = {0: n_methods, 1: n_methods - 1, 2: n_methods - 2} # Adjust for n_methods
for r in rankings:
for position, method in enumerate(r["ranking"]):
scores[method] = scores.get(method, 0) + points.get(position, 0)
return scores
Key design decisions:
For tasks with ground-truth evaluation (code, math, factual):
import subprocess
def evaluate_code(solution: str, test_cases: list, timeout: int = 30):
"""Run code solution against test cases with sandboxed execution."""
results = {"public": [], "private": []}
for test in test_cases:
try:
proc = subprocess.run(
["python3", "-c", solution],
input=test["input"],
capture_output=True,
timeout=timeout,
text=True
)
actual = proc.stdout.strip()
expected = test["expected"].strip()
passed = actual == expected
except subprocess.TimeoutExpired:
passed = False
category = "public" if test.get("public") else "private"
results[category].append(passed)
return {
"public_pass_rate": sum(results["public"]) / max(len(results["public"]), 1),
"private_pass_rate": sum(results["private"]) / max(len(results["private"]), 1),
}
Always compare methods at equal compute budget. If your method uses N API calls, baselines get N calls too:
| Method | Call Budget | Allocation |
|---|---|---|
| Single pass | 6 calls | 6 independent generations |
| Critique & revise | 6 calls | 1 generate + 5 revise rounds |
| Autoreason | 6 calls | 1 generate + 1 analysis + 4 revisions |
| Best-of-N | 6 calls | 6 independent, pick best on public test |
Many ML/NLP papers require human evaluation, especially for subjective tasks (text generation, summarization, dialogue, creative writing). Poorly designed human evals are a common rejection reason.
| Task Type | Required? | Notes |
|---|---|---|
| Text generation (open-ended) | Yes | LLM-as-judge alone is insufficient for acceptance at ACL/EMNLP |
| Summarization | Usually | At minimum for a subset of outputs |
| Dialogue systems | Yes | User studies or annotation |
| Code generation | No | Test suites are objective ground truth |
| Classification | No | Standard metrics suffice |
| Any task with subjective quality | Strongly recommended | Strengthens the paper significantly |
Human Evaluation Protocol:
1. Define the evaluation dimensions (fluency, relevance, factual accuracy, etc.)
2. Create annotation guidelines with examples of each score level
3. Run a pilot with 2-3 annotators on 20-30 examples
4. Compute pilot inter-annotator agreement — if low, revise guidelines
5. Run full evaluation
6. Report: annotator count, agreement metrics, compensation, time per item
Evaluation dimensions (pick relevant subset):
| Dimension | Definition | Scale |
|---|---|---|
| Fluency | Grammaticality and naturalness | 1-5 Likert |
| Relevance | Does it address the task? | 1-5 Likert |
| Factual accuracy | Are stated facts correct? | Binary or 1-5 |
| Coherence | Logical flow and consistency | 1-5 Likert |
| Informativeness | Does it provide useful information? | 1-5 Likert |
| Overall preference | Which output is better? | A/B/Tie (pairwise) |
Pairwise comparison (preferred over absolute scoring — more reliable):
Always report agreement metrics. Without them, reviewers assume your annotations are unreliable.
# Krippendorff's alpha (preferred — handles missing data, any scale)
# pip install krippendorffs-alpha
import krippendorff
# Ratings: rows = annotators, columns = items, values = scores
ratings = [
[3, 4, 1, 2, 5, None, 3], # Annotator 1
[3, 5, 1, 3, 5, 2, 3], # Annotator 2
[4, 4, 2, 2, 4, 2, None], # Annotator 3
]
alpha = krippendorff.alpha(reliability_data=ratings, level_of_measurement="ordinal")
print(f"Krippendorff's alpha: {alpha:.3f}")
# Interpretation: >0.80 good, 0.67-0.80 acceptable, <0.67 questionable
# Cohen's kappa (for exactly 2 annotators, categorical data)
from sklearn.metrics import cohen_kappa_score
annotator_1 = [1, 2, 3, 1, 2, 3, 2]
annotator_2 = [1, 2, 2, 1, 3, 3, 2]
kappa = cohen_kappa_score(annotator_1, annotator_2)
print(f"Cohen's kappa: {kappa:.3f}")
# Interpretation: >0.80 excellent, 0.60-0.80 substantial, 0.40-0.60 moderate
| Metric | When to Use | Annotators | Scale |
|---|---|---|---|
| Krippendorff's alpha | Default choice | Any number | Any (ordinal, nominal, ratio) |
| Cohen's kappa | 2 annotators, categorical | Exactly 2 | Nominal/ordinal |
| Fleiss' kappa | 3+ annotators, categorical | 3+ | Nominal |
| Pearson/Spearman | Continuous scores | 2 | Interval/ratio |
| Platform | Best For | Cost | Quality |
|---|---|---|---|
| Prolific | Academic research, higher quality | $8-15/hr | High — academic participant pool |
| MTurk | Large-scale, fast turnaround | $2-10/hr | Variable — use qualifications |
| Surge AI | NLP-specific annotations | Premium | High — trained annotators |
| Expert annotators | Domain-specific (medical, legal) | Highest | Highest — but slow |
Ethics requirements:
Human Evaluation Section Checklist:
- [ ] Number of annotators
- [ ] Annotator qualifications / recruitment method
- [ ] Number of items evaluated
- [ ] Evaluation dimensions with definitions
- [ ] Scale used (Likert, pairwise, binary)
- [ ] Inter-annotator agreement (Krippendorff's alpha or Cohen's kappa)
- [ ] Compensation rate
- [ ] Time per annotation item
- [ ] Whether annotators saw model identities (should be blind)
- [ ] Randomization of presentation order
| Test | When to Use | Python |
|---|---|---|
| McNemar's test | Comparing two methods on same problems | scipy.stats.binomtest for small n |
| Two-proportion z-test | Comparing success rates | Custom or statsmodels |
| Fisher's exact test | Small sample pairwise comparison | scipy.stats.fisher_exact |
| Bootstrapped CI | Confidence intervals for any metric | Custom bootstrap |
| Cohen's h | Effect size for proportions | Manual calculation |
import numpy as np
from scipy import stats
from pathlib import Path
import json
def load_all_results(results_dir):
"""Load all results into a structured format."""
results = {}
for result_file in Path(results_dir).rglob("result.json"):
parts = result_file.relative_to(results_dir).parts
if len(parts) >= 3:
experiment, task, strategy = parts[0], parts[1], parts[2]
data = json.loads(result_file.read_text())
results.setdefault(experiment, {}).setdefault(strategy, {})[task] = data
return results
def pairwise_mcnemar(method_a_results, method_b_results):
"""McNemar's test for paired binary outcomes."""
a_win_b_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if a and not b)
b_win_a_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if b and not a)
n = a_win_b_lose + b_win_a_lose
if n < 25:
# Use exact binomial for small samples
result = stats.binomtest(a_win_b_lose, n, 0.5)
p_value = result.pvalue
else:
# Chi-squared approximation
chi2 = (abs(a_win_b_lose - b_win_a_lose) - 1)**2 / (a_win_b_lose + b_win_a_lose)
p_value = 1 - stats.chi2.cdf(chi2, df=1)
return {
"a_wins": a_win_b_lose,
"b_wins": b_win_a_lose,
"n_discordant": n,
"p_value": p_value,
"significant": p_value < 0.05
}
def bootstrap_ci(data, n_bootstrap=10000, ci=0.95):
"""Bootstrap confidence interval for mean."""
means = []
for _ in range(n_bootstrap):
sample = np.random.choice(data, size=len(data), replace=True)
means.append(np.mean(sample))
lower = np.percentile(means, (1 - ci) / 2 * 100)
upper = np.percentile(means, (1 + ci) / 2 * 100)
return {"mean": np.mean(data), "ci_lower": lower, "ci_upper": upper}
def cohens_h(p1, p2):
"""Cohen's h effect size for two proportions."""
return 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))
Always include in the paper:
For each experiment batch, create a monitoring prompt:
Check the status of the [EXPERIMENT_NAME] experiment:
1. Process check: ps aux | grep [PROCESS_PATTERN]
2. Log check: tail -30 [LOG_FILE]
3. Results check: ls [RESULT_DIR]/eval/ (or appropriate result location)
4. If results are available:
- Read the result JSON files
- Report metrics in a table (Borda scores, accuracy, etc.)
- Compute key comparisons between methods
5. If all experiments in this batch are complete:
- git add -A && git commit -m "[COMMIT_MESSAGE]" && git push
- Report final summary
6. Key question: [SPECIFIC ANALYTICAL QUESTION]
If nothing has changed since the last check, respond with [SILENT].
## Code Experiments (Haiku 3.5) - COMPLETE
| Strategy | Pass Rate (150 problems) | vs Single |
|----------|------------------------|-----------|
| single_pass | 38.0% | — |
| critique_revise | 35.2% | -2.8pp |
| **autoreason** | **40.0%** | **+2.0pp** |
| best_of_6 | 31.0% | -7.0pp |
Key finding: Autoreason shows +2pp improvement over single pass, while
best-of-6 collapses due to single-public-test selection issue.
Committed: `git commit -m "Add Haiku code results (150 problems, 4 strategies)"`
Next: Run significance tests on these results.
| Failure | Detection | Recovery |
|---|---|---|
| API credit exhaustion | 402 errors in logs, incomplete results | Top up credits, re-run (skips completed work automatically) |
| Rate limiting | 429 errors, slow progress | Add retry logic with exponential backoff |
| Process crash | PID gone, log stops mid-problem | Re-run script (resumes from last checkpoint) |
| Wrong model ID | Model not found errors | Fix ID (e.g., claude-opus-4-6 not claude-opus-4.6) |
| Parallel slowdown | Each experiment taking 2x longer | Reduce parallel experiments to 2-3 max |
| Security scan blocks | Commands blocked by security | Use execute_code instead of piped terminal commands |
| Delegation failures | delegate_task returns errors | Fall back to doing work directly |
| Timeout on hard problems | Process stuck, no log progress | Kill, skip problem, note in results |
| Dataset path mismatch | File not found errors | Verify paths before launching |
When re-running failed experiments, use a suffix to track rounds:
logs/experiment_haiku_0_50.log # Round 1
logs/experiment_haiku_0_50_r2.log # Round 2 (after credit exhaustion)
logs/experiment_haiku_0_50_r3.log # Round 3 (after bug fix)
Before launching any experiment batch:
Pre-Flight:
- [ ] API credits sufficient for estimated calls
- [ ] Model IDs correct (test with 1 problem first)
- [ ] Output directory exists and is writable
- [ ] Resume logic works (re-run won't overwrite existing results)
- [ ] Log file path is unique (won't overwrite previous logs)
- [ ] Dataset/task files are accessible
- [ ] Config matches intended experiment
Design tasks that have clear objectives but subjective quality:
# Task: [Title]
## Context
[Specific scenario with concrete details: company size, constraints, timeline]
## Deliverable
[Exact format and structure required]
## Requirements
- [Specific, measurable requirements]
- [Not vague — "be comprehensive" is bad, "include exactly 6 sections" is good]
Constrained tasks test whether methods respect scope boundaries. Design with:
Do NOT use word count as a scope constraint. Word limits cause false convergence — outputs get rejected for length, not quality. Constrain scope (what to include) not length.
| Bad Constraint | Why | Good Constraint |
|---|---|---|
| "Max 500 words" | Judges reject for length | "Exactly 4 sections, each with 3 numbered items" |
| "Be concise" | Too vague | "Each prohibition must reference a specific base fact" |
| "Improve this" | Unbounded scope | "Write a 600-word incident postmortem with this exact structure" |
| "Make it better" | No clear criterion | "Address exactly these 3 reviewer concerns" |
Install SciencePlots for publication-ready defaults:
pip install SciencePlots matplotlib numpy
Option A: SciencePlots styles (recommended — handles most defaults automatically):
import matplotlib.pyplot as plt
import scienceplots # registers the styles
# Pick a style:
# 'science' — clean, serif fonts, suitable for most venues
# 'science+ieee' — IEEE-style (good for two-column papers)
# 'science+nature' — Nature-style
# Add 'no-latex' if LaTeX is not installed on the machine generating plots
with plt.style.context(['science', 'no-latex']):
fig, ax = plt.subplots(figsize=(3.5, 2.5)) # single-column width
# ... plot ...
fig.savefig('paper/fig_results.pdf', bbox_inches='tight')
Option B: Manual rcParams (when you need full control):
import matplotlib.pyplot as plt
plt.rcParams.update({
'font.size': 10,
'font.family': 'serif',
'axes.labelsize': 11,
'axes.titlesize': 11,
'xtick.labelsize': 9,
'ytick.labelsize': 9,
'legend.fontsize': 9,
'figure.figsize': (3.5, 2.5), # single-column default
'figure.dpi': 300,
'savefig.dpi': 300,
'savefig.bbox': 'tight',
'savefig.pad_inches': 0.05,
'axes.linewidth': 0.8,
'lines.linewidth': 1.5,
'lines.markersize': 5,
'axes.grid': True,
'grid.alpha': 0.3,
'grid.linewidth': 0.5,
})
| Use Case | figsize | Notes |
|---|---|---|
| Single column | (3.5, 2.5) | Fits in one column of two-column layout |
| Double column | (7.0, 3.0) | Spans full page width |
| Square (heatmap, confusion matrix) | (3.5, 3.5) | Single column |
| Tall single (many rows) | (3.5, 5.0) | Use sparingly |
Use this palette for all paper figures. It is distinguishable by people with all common forms of color vision deficiency:
COLORS = {
'blue': '#0072B2',
'orange': '#E69F00',
'green': '#009E73',
'red': '#D55E00',
'purple': '#CC79A7',
'cyan': '#56B4E9',
'yellow': '#F0E442',
'black': '#000000',
}
# As a list for cycling:
COLOR_CYCLE = ['#0072B2', '#D55E00', '#009E73', '#E69F00', '#CC79A7', '#56B4E9']
Also differentiate lines by marker and linestyle, not just color:
STYLES = [
{'color': '#0072B2', 'marker': 'o', 'linestyle': '-'},
{'color': '#D55E00', 'marker': 's', 'linestyle': '--'},
{'color': '#009E73', 'marker': '^', 'linestyle': '-.'},
{'color': '#E69F00', 'marker': 'D', 'linestyle': ':'},
]
import matplotlib.pyplot as plt
import numpy as np
try:
import scienceplots
style = ['science', 'no-latex']
except ImportError:
style = 'default'
with plt.style.context(style):
methods = ['Single Pass', 'Critique+Revise', 'Best-of-N', 'Ours']
scores = [73.2, 74.1, 68.5, 77.0]
errors = [2.1, 1.8, 3.2, 1.5]
colors = ['#56B4E9', '#E69F00', '#CC79A7', '#0072B2']
fig, ax = plt.subplots(figsize=(3.5, 2.5))
bars = ax.bar(methods, scores, yerr=errors, capsize=3,
color=colors, edgecolor='black', linewidth=0.5)
# Highlight "Ours"
bars[-1].set_edgecolor('#0072B2')
bars[-1].set_linewidth(1.5)
ax.set_ylabel('Pass Rate (%)')
ax.set_ylim(60, 85)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
fig.savefig('paper/fig_comparison.pdf', bbox_inches='tight')
with plt.style.context(style):
fig, ax = plt.subplots(figsize=(3.5, 2.5))
passes = np.arange(1, 16)
ours = [65, 72, 78, 82, 85, 87, 88, 89, 89.5, 90, 90, 90, 90, 90, 90]
baseline = [65, 68, 70, 71, 69, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58]
ax.plot(passes, ours, **STYLES[0], label='Ours', markersize=4)
ax.plot(passes, baseline, **STYLES[1], label='Critique+Revise', markersize=4)
# Mark convergence point
ax.axvline(x=10, color='gray', linestyle=':', alpha=0.5, linewidth=0.8)
ax.annotate('Converged', xy=(10, 90), fontsize=8, ha='center',
xytext=(10, 93), arrowprops=dict(arrowstyle='->', color='gray'))
ax.set_xlabel('Iteration')
ax.set_ylabel('Quality Score')
ax.legend(loc='lower right')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
fig.savefig('paper/fig_trajectory.pdf', bbox_inches='tight')
fig.savefig('fig.pdf') — vector graphics, sharp at any zoom| Comparison Type | Chart | Notes |
|---|---|---|
| Method vs method | Grouped bar chart | Include error bars |
| Across model sizes | Line chart with CI bands | Log scale for model size axis |
| Ablation study | Stacked/grouped bar | Highlight removed component |
| Trajectory/convergence | Line chart over iterations | Show winner per iteration |
| Per-task breakdown | Heatmap or grouped bar | Show variance across tasks |