SKILL - Ruflo — ContextQMD

Multi-baseline counterfactual cost analysis. Pairs with the existing observability surface:

cost-budget-check — "have we crossed a threshold?" (reactive)
cost-projection — "when will we cross a threshold?" (predictive)
cost-counterfactual — "is the routing earning its keep?" (comparative) ← this one

Algorithm

Read all session-* records from the cost-tracking namespace.
Apply --since window filter (default all-time).
Sum tokens across byModel[*] entries for each session.
For each requested baseline (default: all three):
- counterfactualUsd = (input × tier.input + output × tier.output + cache_write × tier.cache_write + cache_read × tier.cache_read) / 1M
Compute savings = counterfactualUsd − actualUsd.
Emit per-baseline totals + savings % across the comparison set.

Smoke transcript (2 sessions: 50K haiku tokens + 50K sonnet tokens)

| Sessions considered | 2 |
| Total input tokens  | 100,000 |
| Actual spend        | $0.162500 |

| Baseline           | Hypothetical | Actual    | Savings    | %       |
| `always-haiku`     | $0.025000    | $0.162500 | -$0.137500 | -550.00% |
| `always-sonnet`    | $0.300000    | $0.162500 | +$0.137500 |   45.83% |
| `always-opus`      | $1.500000    | $0.162500 | +$1.337500 |   89.17% |

How to read negative savings

A negative always-haiku result means the router chose more-expensive models than haiku on tasks haiku could have handled. That's an over-escalation signal:

Maybe qualityBar is set too high
Maybe the sonnet/opus session was warranted by complexity but the baseline doesn't know that
Run cost optimize (or inspect specific sessions via cost conversation) to investigate

Positive savings quantify the router's win against that baseline. The most informative number is usually always-sonnet — it's the standard "safe default" baseline most teams would pick if they didn't have routing.

When to use

Quarterly cost review: "We saved $X vs always-Sonnet — here's the proof."
CI gate: cost counterfactual --format json | jq '.baselines[1].savingsPct > 30' — fail builds if routing isn't saving ≥30% vs sonnet baseline (workload-shift detector).
Routing-config validation: When introducing a new qualityBar or cost-ceiling, re-run counterfactual to confirm savings didn't regress.

Stationarity caveat

Like all counterfactual analyses, this assumes the same tokens at the same complexity would have produced the same outcome from the baseline model. That's an upper bound — the baseline might have failed and required retries, which the math doesn't capture. Treat the numbers as a quality-blind ceiling.