2026-04-06 Smart Model Routing

Status: Proposed Date: 2026-04-06 Audience: Product and engineering Related:

doc/SPEC-implementation.md
doc/PRODUCT.md
doc/plans/2026-03-14-adapter-skill-sync-rollout.md

1. Purpose

This document defines a V1 plan for "smart model routing" in Paperclip.

The goal is not to build a generic cross-provider router in the server. The goal is:

let supported adapters use a cheaper model for lightweight heartbeat orchestration work
keep the main task execution on the adapter's normal primary model
preserve Paperclip's existing task, session, and audit invariants
report cost and model usage truthfully when more than one model participates in a single heartbeat

The motivating use case is a local coding adapter where a cheap model can handle the first fast pass:

read the wake context
orient to the task and workspace
leave an immediate progress comment when appropriate
perform bounded lightweight triage

Then the primary model does the substantive work.

2. Hermes Findings

Hermes does have a real "smart model routing" feature, but it is narrower than the name suggests.

Observed behavior:

agent/smart_model_routing.py implements a conservative classifier for "simple" turns
the cheap path only triggers for short, single-line, non-code, non-URL, non-tool-heavy messages
complexity is detected with hardcoded thresholds plus a keyword denylist like debug, implement, test, plan, tool, docker, and similar terms
if the cheap route cannot be resolved, Hermes silently falls back to the primary model

Important architectural detail:

Hermes applies this routing before constructing the agent for that turn
the route is resolved in cron/scheduler.py and passed into agent creation as the active provider/model/runtime

More useful than the routing heuristic itself is Hermes' broader model-slot design:

main conversational model
fallback model for failover
auxiliary model slots for side tasks like compression and classification

That separation is a better fit for Paperclip than copying Hermes' exact keyword heuristic.

3. Current Paperclip State

Paperclip already has the right execution shape for adapter-specific routing, but it currently assumes one model per heartbeat run.

Current implementation facts:

server/src/services/heartbeat.ts builds rich run context, including paperclipWake, workspace metadata, and session handoff context
each adapter receives a single resolved config object and executes once
built-in local adapters read one config.model and pass it directly to the underlying CLI
UI config today exposes one main model field plus adapter-specific thinking-effort controls
cost accounting currently records one provider/model tuple per run via AdapterExecutionResult

What this means:

there is no shared routing layer in the server today
model choice already lives at the adapter boundary, which is good
multi-model execution in a single heartbeat needs explicit contract work or cost reporting will become misleading

4. Product Decision

Paperclip should implement smart model routing as an adapter-local, opt-in execution pattern.

V1 decision:

Do not add a global server-side router that tries to understand every adapter.
Do not copy Hermes' prompt-keyword classifier as Paperclip's default routing policy.
Add an adapter-specific "cheap preflight" phase for supported adapters.
Keep the primary model as the canonical work model.
Persist only the primary session unless an adapter can prove that cross-model session resume is safe.

Rationale:

Paperclip heartbeats are structured, issue-scoped, and already include wake metadata
routing by execution phase is more reliable than routing by free-text prompt complexity
session semantics differ by adapter, so resume behavior must stay adapter-owned

5. Proposed V1 Behavior

5.1 Config shape

Supported adapters should add an optional routing block to adapterConfig.

Proposed shape:

smartModelRouting?: {
  enabled: boolean;
  cheapModel: string;
  cheapThinkingEffort?: string;
  maxPreflightTurns?: number;
  allowInitialProgressComment?: boolean;
}

Notes:

keep existing model as the primary model
cheapModel is adapter-specific, not global
adapters that cannot safely support this block simply ignore it

For adapters with provider-specific model fields later, the shape can expand to include provider/base-url overrides. V1 should start simple.

5.2 Routing policy

Supported adapters should run cheap preflight only when all are true:

smartModelRouting.enabled is true
cheapModel is configured
the run is issue-scoped
the adapter is starting a fresh session, not resuming a persisted one
the run is expected to do real task work rather than just resume an existing thread

Supported adapters should skip cheap preflight when any are true:

a persisted task session already exists
the adapter cannot safely isolate preflight from the primary session
the issue or wake type implies the task is already mid-flight and continuity matters more than first-response speed

This is intentionally phase-based, not text-heuristic-based.

5.3 Cheap preflight responsibilities

The cheap phase should be narrow and bounded.

Allowed responsibilities:

ingest wake context and issue summary
inspect the workspace at a shallow level
leave a short "starting investigation" style comment when appropriate
collect a compact handoff summary for the primary phase

Not allowed in V1:

long tool loops
risky file mutations
being the canonical persisted task session
deciding final completion without either explicit adapter support or a trivial success case

Implementation detail:

the adapter should inject an explicit preflight prompt telling the model this is a bounded orchestration pass
preflight should use a very small turn budget, for example 1-2 turns

5.4 Primary execution responsibilities

After preflight, the adapter launches the normal primary execution using the existing prompt and primary model.

The primary phase should receive:

the normal Paperclip prompt
any preflight-generated handoff summary
normal workspace and wake context

The primary phase remains the source of truth for:

persisted session state
final task completion
most file changes
most cost

6. Required Contract Changes

The current AdapterExecutionResult is too narrow for truthful multi-model accounting.

Add an optional segmented execution report, for example:

executionSegments?: Array<{
  phase: "cheap_preflight" | "primary";
  provider?: string | null;
  biller?: string | null;
  model?: string | null;
  billingType?: AdapterBillingType | null;
  usage?: UsageSummary;
  costUsd?: number | null;
  summary?: string | null;
}>

V1 server behavior:

if executionSegments is absent, keep current single-result behavior unchanged
if present, write one cost_events row per segment that has cost or token usage
store the segment array in run usage/result metadata for later UI inspection
keep the existing top-level provider / model fields as a summary, preferably the primary phase when present

This avoids breaking existing adapters while giving routed adapters truthful reporting.

7. Adapter Rollout Plan

7.1 Phase 1: contract and server plumbing

Work:

Extend adapter result types with segmented execution metadata.
Update heartbeat cost recording to emit multiple cost events when segments are present.
Include segment summaries in run metadata for transcript/debug views.

Success criteria:

existing adapters behave exactly as before
a routed adapter can report cheap plus primary usage without collapsing them into one fake model

7.2 Phase 2: `codex_local`

Why first:

Codex already has rich prompt/handoff handling
the adapter already injects Paperclip skills and workspace metadata cleanly
the current implementation already distinguishes bootstrap, wake delta, and handoff prompt sections

Implementation work:

Add config support for smartModelRouting.
Add a cheap-preflight prompt builder.
Run cheap preflight only on fresh sessions.
Pass a compact preflight handoff note into the primary prompt.
Report segmented usage and model metadata.

Important guardrail:

do not resume the cheap-model session as the primary session in V1

7.3 Phase 3: `claude_local`

Implementation work is similar, but the session model-switch risk is even less attractive.

Same rule:

cheap preflight is ephemeral
primary Claude session remains canonical

7.4 Phase 4: other adapters

Candidates:

cursor
gemini_local
opencode_local
external plugin adapters through createServerAdapter()

These should come later because each runtime has different session and model-switch semantics.

8. UI and Config Changes

For supported built-in adapters, the agent config UI should expose:

model as the primary model
smart model routing toggle
cheap model
optional cheap thinking effort
optional allow initial progress comment toggle

The run detail UI should also show when routing occurred, for example:

cheap preflight model
primary model
token/cost split

This matters because Paperclip's board UI is supposed to make cost and behavior legible.

9. Why Not Copy Hermes Exactly

Hermes' cheap-route heuristic is useful precedent, but Paperclip should not start there.

Reasons:

Hermes is optimizing free-form conversational turns
Paperclip agents run structured, issue-scoped heartbeats with explicit task and workspace context
Paperclip already knows whether a run is fresh vs resumed, issue-scoped vs approval follow-up, and what workspace/session exists
those execution facts are stronger routing signals than prompt keyword matching

If Paperclip later wants a cheap-only completion path for trivial runs, that can be a second-stage feature built on observed run data, not the first implementation.

10. Risks

10.1 Duplicate or noisy comments

If the cheap phase posts an update and the primary phase posts another near-identical update, the issue thread gets worse.

Mitigation:

keep cheap comments optional
make the preflight prompt explicitly avoid repeating status if a useful comment was already posted

10.2 Misleading cost reporting

If we only record the primary model, the board loses visibility into the routing cost tradeoff.

Mitigation:

add segmented execution reporting before shipping adapter behavior

10.3 Session corruption

Cross-model session reuse may fail or degrade context quality.

Mitigation:

V1 does not persist or resume cheap preflight sessions

10.4 Cheap model overreach

A cheap model with full tools and permissions may do too much low-quality work.

Mitigation:

hard cap preflight turns
use an explicit orchestration-only prompt
start with supported adapters where we can test the behavior well

11. Verification Plan

Required tests:

adapter unit tests for route eligibility
adapter unit tests for "fresh session -> cheap preflight + primary"
adapter unit tests for "resumed session -> primary only"
heartbeat tests for segmented cost-event creation
UI tests for config save/load of cheap-model fields

Manual checks:

create a fresh issue for a routed Codex or Claude agent
verify the run metadata shows both phases
verify only the primary session is persisted
verify cost rows reflect both models
verify the issue thread does not get duplicate kickoff comments

12. Recommended Sequence

Add segmented execution reporting to the adapter/server contract.
Implement codex_local cheap preflight.
Validate cost visibility and transcript UX.
Implement claude_local cheap preflight.
Decide later whether any adapters need Hermes-style text heuristics in addition to phase-based routing.

13. Recommendation

Paperclip should ship smart model routing as:

adapter-specific
opt-in
phase-based
session-safe
cost-truthful

The right V1 is not "choose the cheapest model for simple prompts." The right V1 is "use a cheap model for bounded orchestration work on fresh runs, then hand off to the primary model for the real task."

2026-04-06 Smart Model Routing

2026-04-06 Smart Model Routing

1. Purpose

2. Hermes Findings

3. Current Paperclip State

4. Product Decision

5. Proposed V1 Behavior

5.1 Config shape

5.2 Routing policy

5.3 Cheap preflight responsibilities

5.4 Primary execution responsibilities

6. Required Contract Changes

7. Adapter Rollout Plan

7.1 Phase 1: contract and server plumbing

7.2 Phase 2: codex_local

7.3 Phase 3: claude_local

7.4 Phase 4: other adapters

8. UI and Config Changes

9. Why Not Copy Hermes Exactly

10. Risks

10.1 Duplicate or noisy comments

10.2 Misleading cost reporting

10.3 Session corruption

10.4 Cheap model overreach

11. Verification Plan

12. Recommended Sequence

13. Recommendation

7.2 Phase 2: `codex_local`

7.3 Phase 3: `claude_local`