doc/plans/2026-03-13-agent-evals-framework.md
Date: 2026-03-13
We need evals for the thing Paperclip actually ships:
We do not primarily need a fine-tuning pipeline. We need a regression framework that can answer:
This plan is based on:
doc/GOAL.mddoc/PRODUCT.mddoc/SPEC-implementation.mddocs/agents-runtime.mddoc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.mdPaperclip should take a two-stage approach:
So the recommendation is no longer “skip Promptfoo.” It is:
More specifically:
evals/ directory.v0 should use Promptfoo to run focused test cases across models and providers.A bundle is:
That is the right unit because that is what actually changes behavior in Paperclip.
Prompt-only tools are useful, but Paperclip’s real failure modes are often:
Those are control-plane behaviors. They require scenario setup, execution, and trace inspection.
The existing monorepo already uses:
pnpmtsxvitestA TypeScript-first harness will fit the repo and CI better than introducing a Python-first test subsystem as the default path.
Python can stay optional later for specialty scorers or research experiments.
OpenAI’s guidance is directionally right:
But OpenAI’s Evals API is not the right control plane for Paperclip as the primary system because our target is explicitly multi-model and multi-provider.
The current tradeoff:
The community suggestion is directionally right:
That makes it the best v0 tool for “did this prompt/skill/model change obviously regress?”
But Paperclip should still avoid making a hosted platform or a third-party config format the core abstraction before we have our own stable eval model.
The right move is:
We should split evals into four layers.
These should require no judge model.
Examples:
These are cheap, reliable, and should be the first line of defense.
These test narrow behaviors in isolation.
Examples:
These are the closest thing to prompt evals, but still framed in Paperclip terms.
These run a full heartbeat or short sequence of heartbeats against a seeded scenario.
Examples:
These should evaluate both final state and trace quality.
These are not “did the answer look good?” evals. They are “did we preserve quality while improving cost/latency?” evals.
Examples:
This layer is especially important for token optimization work.
EvalCaseEach eval case should define:
Suggested shape:
type EvalCase = {
id: string;
description: string;
tags: string[];
setup: {
fixture: string;
agentId: string;
trigger: "assignment" | "timer" | "on_demand" | "comment" | "approval";
};
inputs?: Record<string, unknown>;
checks: {
hard: HardCheck[];
rubric?: RubricCheck[];
pairwise?: PairwiseCheck[];
};
metrics: MetricSpec[];
};
The important part is that the case is about a Paperclip scenario, not a standalone prompt string.
EvalBundleSuggested shape:
type EvalBundle = {
id: string;
adapter: string;
model: string;
promptTemplate: string;
bootstrapPromptTemplate?: string;
skills: string[];
flags?: Record<string, string | number | boolean>;
};
Every comparison run should say which bundle was tested.
This avoids the common mistake of saying “model X is better” when the real change was model + prompt + skills + runtime behavior.
EvalTraceWe should capture a normalized trace for scoring:
The scorer layer should never need to scrape ad hoc logs.
Every eval should start with pass/fail checks that can invalidate the run immediately.
Examples:
If a hard check fails, the scenario fails regardless of style or judge score.
Rubric scoring should use narrow criteria, not vague “how good was this?” prompts.
Good rubric dimensions:
Each rubric should be a small 0-1 or 0-2 decision, not a mushy 1-10 scale.
OpenAI’s eval guidance is right that LLMs are better at discrimination than open-ended generation.
So for non-deterministic quality checks, the default pattern should be:
baseline, candidate, or tieThis is better than asking a judge for an absolute quality score with no anchor.
Do not bury efficiency inside a single blended quality score.
Record it separately:
Then compute a summary decision such as:
That is much easier to reason about than one magic number.
For PR gating:
For deeper comparison reports, show:
We should explicitly build the dataset from three sources.
Start here.
These should cover core product invariants:
These are small, clear, and stable.
Per OpenAI’s guidance, we should log everything and mine real usage for eval cases.
Paperclip should grow eval coverage by promoting real runs into cases when we see:
The initial version can be manual:
EvalCaseLater we can automate trace-to-case generation.
These should intentionally probe failure modes:
This is where promptfoo-style red-team ideas can become useful later, but it is not the first slice.
Recommended initial layout:
evals/
README.md
promptfoo/
promptfooconfig.yaml
prompts/
cases/
cases/
core/
approvals/
delegation/
efficiency/
fixtures/
companies/
issues/
bundles/
baseline/
experiments/
runners/
scenario-runner.ts
compare-runner.ts
scorers/
hard/
rubric/
pairwise/
judges/
rubric-judge.ts
pairwise-judge.ts
lib/
types.ts
traces.ts
metrics.ts
reports/
.gitignore
Why top-level evals/:
server/ even though they span adapters and runtime behaviorv0 config plus the later first-party runnerThe harness should support three modes.
Purpose:
Characteristics:
Purpose:
Characteristics:
Purpose:
Characteristics:
Suggested commands:
pnpm evals:smoke
pnpm evals:compare --baseline baseline/codex-default --candidate experiments/codex-lean-skillset
pnpm evals:nightly
PR behavior:
evals:smoke on prompt/skill/adapter/runtime changesevals:compare for labeled PRs or manual runsNightly behavior:
Best use for Paperclip:
What changed in this recommendation:
Why it still should not be the only long-term system:
Recommendation:
evals/promptfoo/What it gets right:
Why not the primary system today:
Recommendation:
What it gets right:
Why not the primary system today:
Recommendation:
What it gets right:
Why not the primary system:
Recommendation:
The first version should be intentionally small.
Build:
evals/promptfoo/promptfooconfig.yamlTarget scope:
Success criteria:
Build:
evals/ scaffoldEvalCase, EvalBundle, EvalTrace typesTarget cases:
Success criteria:
v0 cases either migrate into or coexist with this layer cleanlyBuild:
Success criteria:
Build:
Dependency:
2026-03-13-TOKEN-OPTIMIZATION-PLAN.mdSuccess criteria:
Build:
Success criteria:
We should start with these categories:
core.assignment_pickupcore.progress_updatecore.blocked_reportinggovernance.approval_requiredgovernance.company_boundarydelegation.correct_reportthreads.long_context_followupefficiency.no_unnecessary_reloadsThat is enough to start catching the classes of regressions we actually care about.
Every important scenario needs deterministic checks first.
Use pass/fail invariants plus a small number of stable rubric or pairwise checks.
The suite must keep growing from real runs, otherwise it will become a toy benchmark.
Trajectory matters for agents:
Our eval model should survive changes in:
Should the first scenario runner invoke the real server over HTTP, or call services directly in-process? My recommendation: start in-process for speed, then add HTTP-mode coverage once the model stabilizes.
Should we support Python scorers in v1? My recommendation: no. Keep v1 all-TypeScript.
Should we commit baseline outputs? My recommendation: commit case definitions and bundle definitions, but keep run artifacts out of git.
Should we add hosted experiment tracking immediately? My recommendation: no. Revisit after the local harness proves useful.
Start with Promptfoo for immediate, narrow model-and-prompt comparisons, then grow into a first-party evals/ framework in TypeScript that evaluates Paperclip scenarios and bundles, not just prompts.
Use this structure:
v0 bootstrapUse external tools selectively:
But keep the canonical eval model inside the Paperclip repo and aligned to Paperclip’s actual control-plane behaviors.