Agent Evals Framework Plan

Date: 2026-03-13

Context

We need evals for the thing Paperclip actually ships:

agent behavior produced by adapter config
prompt templates and bootstrap prompts
skill sets and skill instructions
model choice
runtime policy choices that affect outcomes and cost

We do not primarily need a fine-tuning pipeline. We need a regression framework that can answer:

if we change prompts or skills, do agents still do the right thing?
if we switch models, what got better, worse, or more expensive?
if we optimize tokens, did we preserve task outcomes?
can we grow the suite over time from real Paperclip usage?

This plan is based on:

doc/GOAL.md
doc/PRODUCT.md
doc/SPEC-implementation.md
docs/agents-runtime.md
doc/plans/2026-03-13-TOKEN-OPTIMIZATION-PLAN.md
Discussion #449: https://github.com/paperclipai/paperclip/discussions/449
OpenAI eval best practices: https://developers.openai.com/api/docs/guides/evaluation-best-practices
Promptfoo docs: https://www.promptfoo.dev/docs/configuration/test-cases/ and https://www.promptfoo.dev/docs/providers/custom-api/
LangSmith complex agent eval docs: https://docs.langchain.com/langsmith/evaluate-complex-agent
Braintrust dataset/scorer docs: https://www.braintrust.dev/docs/annotate/datasets and https://www.braintrust.dev/docs/evaluate/write-scorers

Recommendation

Paperclip should take a two-stage approach:

Start with Promptfoo now for narrow, prompt-and-skill behavior evals across models.
Grow toward a first-party, repo-local eval harness in TypeScript for full Paperclip scenario evals.

So the recommendation is no longer “skip Promptfoo.” It is:

use Promptfoo as the fastest bootstrap layer
keep eval cases and fixtures in this repo
avoid making Promptfoo config the deepest long-term abstraction

More specifically:

The canonical eval definitions should live in this repo under a top-level evals/ directory.
v0 should use Promptfoo to run focused test cases across models and providers.
The longer-term harness should run real Paperclip scenarios against seeded companies/issues/agents, not just raw prompt completions.
The scoring model should combine:
- deterministic checks
- structured rubric scoring
- pairwise candidate-vs-baseline judging
- efficiency metrics from normalized usage/cost telemetry
The framework should compare bundles, not just models.

A bundle is:

adapter type
model id
prompt template(s)
bootstrap prompt template
skill allowlist / skill content version
relevant runtime flags

That is the right unit because that is what actually changes behavior in Paperclip.

Why This Is The Right Shape

1. We need to evaluate system behavior, not only prompt output

Prompt-only tools are useful, but Paperclip’s real failure modes are often:

wrong issue chosen
wrong API call sequence
bad delegation
failure to respect approval boundaries
stale session behavior
over-reading context
claiming completion without producing artifacts or comments

Those are control-plane behaviors. They require scenario setup, execution, and trace inspection.

2. The repo is already TypeScript-first

The existing monorepo already uses:

pnpm
tsx
vitest
TypeScript across server, UI, shared contracts, and adapters

A TypeScript-first harness will fit the repo and CI better than introducing a Python-first test subsystem as the default path.

Python can stay optional later for specialty scorers or research experiments.

3. We need provider/model comparison without vendor lock-in

OpenAI’s guidance is directionally right:

eval early and often
use task-specific evals
log everything
prefer pairwise/comparison-style judging over open-ended scoring

But OpenAI’s Evals API is not the right control plane for Paperclip as the primary system because our target is explicitly multi-model and multi-provider.

4. Hosted eval products are useful, and Promptfoo is the right bootstrap tool

The current tradeoff:

Promptfoo is very good for local, repo-based prompt/provider matrices and CI integration.
LangSmith is strong on trajectory-style agent evals.
Braintrust has a clean dataset + scorer + experiment model and strong TypeScript support.

The community suggestion is directionally right:

Promptfoo lets us start small
it supports simple assertions like contains / not-contains / regex / custom JS
it can run the same cases across multiple models
it supports OpenRouter
it can move into CI later

That makes it the best v0 tool for “did this prompt/skill/model change obviously regress?”

But Paperclip should still avoid making a hosted platform or a third-party config format the core abstraction before we have our own stable eval model.

The right move is:

start with Promptfoo for quick wins
keep the data portable and repo-owned
build a thin first-party harness around Paperclip concepts as the system grows
optionally export to or integrate with other tools later if useful

What We Should Evaluate

We should split evals into four layers.

Layer 1: Deterministic contract evals

These should require no judge model.

Examples:

agent comments on the assigned issue
no mutation outside the agent’s company
approval-required actions do not bypass approval flow
task transitions are legal
output contains required structured fields
artifact links exist when the task required an artifact
no full-thread refetch on delta-only cases once the API supports it

These are cheap, reliable, and should be the first line of defense.

Layer 2: Single-step behavior evals

These test narrow behaviors in isolation.

Examples:

chooses the correct issue from inbox
writes a reasonable first status comment
decides to ask for approval instead of acting directly
delegates to the correct report
recognizes blocked state and reports it clearly

These are the closest thing to prompt evals, but still framed in Paperclip terms.

Layer 3: End-to-end scenario evals

These run a full heartbeat or short sequence of heartbeats against a seeded scenario.

Examples:

new assignment pickup
long-thread continuation
mention-triggered clarification
approval-gated hire request
manager escalation
workspace coding task that must leave a meaningful issue update

These should evaluate both final state and trace quality.

Layer 4: Efficiency and regression evals

These are not “did the answer look good?” evals. They are “did we preserve quality while improving cost/latency?” evals.

Examples:

normalized input tokens per successful heartbeat
normalized tokens per completed issue
session reuse rate
full-thread reload rate
wall-clock duration
cost per successful scenario

This layer is especially important for token optimization work.

Core Design

1. Canonical object: `EvalCase`

Each eval case should define:

scenario setup
target bundle(s)
execution mode
expected invariants
scoring rubric
tags/metadata

Suggested shape:

type EvalCase = {
  id: string;
  description: string;
  tags: string[];
  setup: {
    fixture: string;
    agentId: string;
    trigger: "assignment" | "timer" | "on_demand" | "comment" | "approval";
  };
  inputs?: Record<string, unknown>;
  checks: {
    hard: HardCheck[];
    rubric?: RubricCheck[];
    pairwise?: PairwiseCheck[];
  };
  metrics: MetricSpec[];
};

The important part is that the case is about a Paperclip scenario, not a standalone prompt string.

2. Canonical object: `EvalBundle`

Suggested shape:

type EvalBundle = {
  id: string;
  adapter: string;
  model: string;
  promptTemplate: string;
  bootstrapPromptTemplate?: string;
  skills: string[];
  flags?: Record<string, string | number | boolean>;
};

Every comparison run should say which bundle was tested.

This avoids the common mistake of saying “model X is better” when the real change was model + prompt + skills + runtime behavior.

3. Canonical output: `EvalTrace`

We should capture a normalized trace for scoring:

run ids
prompts actually sent
session reuse metadata
issue mutations
comments created
approvals requested
artifacts created
token/cost telemetry
timing
raw outputs

The scorer layer should never need to scrape ad hoc logs.

Scoring Framework

1. Hard checks first

Every eval should start with pass/fail checks that can invalidate the run immediately.

Examples:

touched wrong company
skipped required approval
no issue update produced
returned malformed structured output
marked task done without required artifact

If a hard check fails, the scenario fails regardless of style or judge score.

2. Rubric scoring second

Rubric scoring should use narrow criteria, not vague “how good was this?” prompts.

Good rubric dimensions:

task understanding
governance compliance
useful progress communication
correct delegation
evidence of completion
concision / unnecessary verbosity

Each rubric should be a small 0-1 or 0-2 decision, not a mushy 1-10 scale.

3. Pairwise judging for candidate vs baseline

OpenAI’s eval guidance is right that LLMs are better at discrimination than open-ended generation.

So for non-deterministic quality checks, the default pattern should be:

run baseline bundle on the case
run candidate bundle on the same case
ask a judge model which is better on explicit criteria
allow baseline, candidate, or tie

This is better than asking a judge for an absolute quality score with no anchor.

4. Efficiency scoring is separate

Do not bury efficiency inside a single blended quality score.

Record it separately:

quality score
cost score
latency score

Then compute a summary decision such as:

candidate is acceptable only if quality is non-inferior and efficiency is improved

That is much easier to reason about than one magic number.

Suggested Decision Rule

For PR gating:

No hard-check regressions.
No significant regression on required scenario pass rate.
No significant regression on key rubric dimensions.
If the change is token-optimization-oriented, require efficiency improvement on target scenarios.

For deeper comparison reports, show:

pass rate
pairwise wins/losses/ties
median normalized tokens
median wall-clock time
cost deltas

Dataset Strategy

We should explicitly build the dataset from three sources.

1. Hand-authored seed cases

Start here.

These should cover core product invariants:

assignment pickup
status update
blocked reporting
delegation
approval request
cross-company access denial
issue comment follow-up

These are small, clear, and stable.

2. Production-derived cases

Per OpenAI’s guidance, we should log everything and mine real usage for eval cases.

Paperclip should grow eval coverage by promoting real runs into cases when we see:

regressions
interesting failures
edge cases
high-value success patterns worth preserving

The initial version can be manual:

take a real run
redact/normalize it
convert it into an EvalCase

Later we can automate trace-to-case generation.

3. Adversarial and guardrail cases

These should intentionally probe failure modes:

approval bypass attempts
wrong-company references
stale context traps
irrelevant long threads
misleading instructions in comments
verbosity traps

This is where promptfoo-style red-team ideas can become useful later, but it is not the first slice.

Repo Layout

Recommended initial layout:

text

evals/
  README.md
  promptfoo/
    promptfooconfig.yaml
    prompts/
    cases/
  cases/
    core/
    approvals/
    delegation/
    efficiency/
  fixtures/
    companies/
    issues/
  bundles/
    baseline/
    experiments/
  runners/
    scenario-runner.ts
    compare-runner.ts
  scorers/
    hard/
    rubric/
    pairwise/
  judges/
    rubric-judge.ts
    pairwise-judge.ts
  lib/
    types.ts
    traces.ts
    metrics.ts
  reports/
    .gitignore

Why top-level evals/:

it makes evals feel first-class
it avoids hiding them inside server/ even though they span adapters and runtime behavior
it leaves room for both TS and optional Python helpers later
it gives us a clean place for Promptfoo v0 config plus the later first-party runner

Execution Model

The harness should support three modes.

Mode A: Cheap local smoke

Purpose:

run on PRs
keep cost low
catch obvious regressions

Characteristics:

5 to 20 cases
1 or 2 bundles
mostly hard checks and narrow rubrics

Mode B: Candidate vs baseline compare

Purpose:

evaluate a prompt/skill/model change before merge

Characteristics:

paired runs
pairwise judging enabled
quality + efficiency diff report

Mode C: Nightly broader matrix

Purpose:

compare multiple models and bundles
grow historical benchmark data

Characteristics:

larger case set
multiple models
more expensive rubric/pairwise judging

CI and Developer Workflow

Suggested commands:

pnpm evals:smoke
pnpm evals:compare --baseline baseline/codex-default --candidate experiments/codex-lean-skillset
pnpm evals:nightly

PR behavior:

run evals:smoke on prompt/skill/adapter/runtime changes
optionally trigger evals:compare for labeled PRs or manual runs

Nightly behavior:

run larger matrix
save report artifact
surface trend lines on pass rate, pairwise wins, and efficiency

Framework Comparison

Promptfoo

Best use for Paperclip:

prompt-level micro-evals
provider/model comparison
quick local CI integration
custom JS assertions and custom providers
bootstrap-layer evals for one skill or one agent workflow

What changed in this recommendation:

Promptfoo is now the recommended starting point
especially for “one skill, a handful of cases, compare across models”

Why it still should not be the only long-term system:

its primary abstraction is still prompt/provider/test-case oriented
Paperclip needs scenario setup, control-plane state inspection, and multi-step traces as first-class concepts

Recommendation:

use Promptfoo first
store Promptfoo config and cases in-repo under evals/promptfoo/
use custom JS/TS assertions and, if needed later, a custom provider that calls Paperclip scenario runners
do not make Promptfoo YAML the only canonical Paperclip eval format once we outgrow prompt-level evals

LangSmith

What it gets right:

final response evals
trajectory evals
single-step evals

Why not the primary system today:

stronger fit for teams already centered on LangChain/LangGraph
introduces hosted/external workflow gravity before our own eval model is stable

Recommendation:

copy the trajectory/final/single-step taxonomy
do not adopt the platform as the default requirement

Braintrust

What it gets right:

TypeScript support
clean dataset/task/scorer model
production logging to datasets
experiment comparison over time

Why not the primary system today:

still externalizes the canonical dataset and review workflow
we are not yet at the maturity where hosted experiment management should define the shape of the system

Recommendation:

borrow its dataset/scorer/experiment mental model
revisit once we want hosted review and experiment history at scale

OpenAI Evals / Evals API

What it gets right:

strong eval principles
emphasis on task-specific evals
continuous evaluation mindset

Why not the primary system:

Paperclip must compare across models/providers
we do not want our primary eval runner coupled to one model vendor

Recommendation:

use the guidance
do not use it as the core Paperclip eval runtime

First Implementation Slice

The first version should be intentionally small.

Phase 0: Promptfoo bootstrap

Build:

evals/promptfoo/promptfooconfig.yaml
5 to 10 focused cases for one skill or one agent workflow
model matrix using the providers we care about most
mostly deterministic assertions:
- contains
- not-contains
- regex
- custom JS assertions

Target scope:

one skill, or one narrow workflow such as assignment pickup / first status update
compare a small set of bundles across several models

Success criteria:

we can run one command and compare outputs across models
prompt/skill regressions become visible quickly
the team gets signal before building heavier infrastructure

Phase 1: Skeleton and core cases

Build:

evals/ scaffold
EvalCase, EvalBundle, EvalTrace types
scenario runner for seeded local cases
10 hand-authored core cases
hard checks only

Target cases:

assigned issue pickup
write progress comment
ask for approval when required
respect company boundary
report blocked state
avoid marking done without artifact/comment evidence

Success criteria:

a developer can run a local smoke suite
prompt/skill changes can fail the suite deterministically
Promptfoo v0 cases either migrate into or coexist with this layer cleanly

Phase 2: Pairwise and rubric layer

Build:

rubric scorer interface
pairwise judge runner
candidate vs baseline compare command
markdown/html report output

Success criteria:

model/prompt bundle changes produce a readable diff report
we can tell “better”, “worse”, or “same” on curated scenarios

Phase 3: Efficiency integration

Build:

normalized token/cost metrics into eval traces
cost and latency comparisons
efficiency gates for token optimization work

Dependency:

this should align with the telemetry normalization work in 2026-03-13-TOKEN-OPTIMIZATION-PLAN.md

Success criteria:

quality and efficiency can be judged together
token-reduction work no longer relies on anecdotal improvements

Phase 4: Production-case ingestion

Build:

tooling to promote real runs into new eval cases
metadata tagging
failure corpus growth process

Success criteria:

the eval suite grows from real product behavior instead of staying synthetic

Initial Case Categories

We should start with these categories:

core.assignment_pickup
core.progress_update
core.blocked_reporting
governance.approval_required
governance.company_boundary
delegation.correct_report
threads.long_context_followup
efficiency.no_unnecessary_reloads

That is enough to start catching the classes of regressions we actually care about.

Important Guardrails

1. Do not rely on judge models alone

Every important scenario needs deterministic checks first.

2. Do not gate PRs on a single noisy score

Use pass/fail invariants plus a small number of stable rubric or pairwise checks.

3. Do not confuse benchmark score with product quality

The suite must keep growing from real runs, otherwise it will become a toy benchmark.

4. Do not evaluate only final output

Trajectory matters for agents:

did they call the right Paperclip APIs?
did they ask for approval?
did they communicate progress?
did they choose the right issue?

5. Do not make the framework vendor-shaped

Our eval model should survive changes in:

judge provider
candidate provider
adapter implementation
hosted tooling choices

Open Questions

Should the first scenario runner invoke the real server over HTTP, or call services directly in-process? My recommendation: start in-process for speed, then add HTTP-mode coverage once the model stabilizes.
Should we support Python scorers in v1? My recommendation: no. Keep v1 all-TypeScript.
Should we commit baseline outputs? My recommendation: commit case definitions and bundle definitions, but keep run artifacts out of git.
Should we add hosted experiment tracking immediately? My recommendation: no. Revisit after the local harness proves useful.

Final Recommendation

Start with Promptfoo for immediate, narrow model-and-prompt comparisons, then grow into a first-party evals/ framework in TypeScript that evaluates Paperclip scenarios and bundles, not just prompts.

Use this structure:

Promptfoo for v0 bootstrap
deterministic hard checks as the foundation
rubric and pairwise judging for non-deterministic quality
normalized efficiency metrics as a separate axis
repo-local datasets that grow from real runs

Use external tools selectively:

Promptfoo as the initial path for narrow prompt/provider tests
Braintrust or LangSmith later if we want hosted experiment management

But keep the canonical eval model inside the Paperclip repo and aligned to Paperclip’s actual control-plane behaviors.

Agent Evals Framework Plan

Agent Evals Framework Plan

Context

Recommendation

Why This Is The Right Shape

1. We need to evaluate system behavior, not only prompt output

2. The repo is already TypeScript-first

3. We need provider/model comparison without vendor lock-in

4. Hosted eval products are useful, and Promptfoo is the right bootstrap tool

What We Should Evaluate

Layer 1: Deterministic contract evals

Layer 2: Single-step behavior evals

Layer 3: End-to-end scenario evals

Layer 4: Efficiency and regression evals

Core Design

1. Canonical object: EvalCase

2. Canonical object: EvalBundle

3. Canonical output: EvalTrace

Scoring Framework

1. Hard checks first

2. Rubric scoring second

3. Pairwise judging for candidate vs baseline

4. Efficiency scoring is separate

Suggested Decision Rule

Dataset Strategy

1. Hand-authored seed cases

2. Production-derived cases

3. Adversarial and guardrail cases

Repo Layout

Execution Model

Mode A: Cheap local smoke

Mode B: Candidate vs baseline compare

Mode C: Nightly broader matrix

CI and Developer Workflow

Framework Comparison

Promptfoo

LangSmith

Braintrust

OpenAI Evals / Evals API

First Implementation Slice

Phase 0: Promptfoo bootstrap

Phase 1: Skeleton and core cases

Phase 2: Pairwise and rubric layer

Phase 3: Efficiency integration

Phase 4: Production-case ingestion

Initial Case Categories

Important Guardrails

1. Do not rely on judge models alone

2. Do not gate PRs on a single noisy score

3. Do not confuse benchmark score with product quality

4. Do not evaluate only final output

5. Do not make the framework vendor-shaped

Open Questions

Final Recommendation

1. Canonical object: `EvalCase`

2. Canonical object: `EvalBundle`

3. Canonical output: `EvalTrace`