.agents/skills/maintainer-review/references/evaluation-framework.md
Use this reference when a claim is ambiguous, severity is disputed, or a PR is technically correct but may not justify merge effort.
Treat validity, severity, and merge-worthiness as separate results. Also distinguish a Preliminary assessment, which may still require approved runtime evidence, from a final Maintainer decision. Do not label a provisional positive result as a verdict or final decision.
| Dimension | Questions | Strong evidence |
|---|---|---|
| Claim validity | Does the exact reported behavior occur? Is the proposed cause correct? | Reproduction, failing focused test, or complete reachable code path |
| Reachability | Can supported, realistic inputs reach it? | Public API trace, real configuration, linked user report, or release comparison |
| Consequence | What fails, and is the result silent or recoverable? | Observed output/error/state plus downstream effect |
| Breadth | Who is affected? | Supported providers, platforms, versions, and configurations identified precisely |
| Frequency | Is this normal, intermittent, or pathological? | Repeat runs, telemetry or reports when available, deterministic preconditions |
| Compatibility | Is released behavior or durable state changed? | Latest release comparison and explicit contract inspection |
| Solution fit | Does the fix enforce the invariant at the right layer? | Equivalent paths behave consistently; simpler alternatives considered |
| Maintenance cost | What permanent complexity and review burden does it add? | Changed surface, new branches/configuration, test burden, remaining work |
Severity is approximately consequence multiplied by realistic reach and frequency, reduced by recoverability. Do not raise severity because a report sounds alarming or lower it because a patch is small.
Before calling a claim confirmed, answer:
Use partially confirmed when the symptom is real but the cause, reach, or claimed scope is wrong. Use unproven when decisive evidence is missing. Use contradicted only when evidence directly disproves the claim.
Choose one primary action:
When requesting evidence, ask only for information that could change the disposition.
Assess these independently:
A PR can be correct but not merge-worthy. Typical reasons include a nonexistent or negligible need, a no-op on the actual runtime path, incomplete cross-path semantics, an abstraction cost larger than the benefit, or a simpler existing mechanism.
Keep issue impact and patch risk separate. Severity describes the underlying issue or user need. A regression, compatibility break, lifecycle leak, or maintenance hazard introduced by the proposed patch belongs under Patch risk and must not inflate or obscure the issue severity.
Do not treat documentation as automatically required for every public option, constructor parameter, provider setting, or behavior change. Make docs merge-blocking only when at least one of these is true:
If docs would merely improve discoverability or completeness, keep them non-blocking. Do not change Merge-worthy as-is to Merge-worthy after focused changes solely for optional docs, and do not include optional docs in the maintainer comment's required-action paragraph. Respect an explicit maintainer choice to omit docs or defer them to a separate follow-up.
Apply this section when a change adds validation, fail-fast behavior, cleanup, retries, interruption, background work, or concurrency.
Test at least one alternative against the proposed patch:
When two or more open PRs address the same issue, first verify that they belong in one comparison set. Accept an explicit issue link, the same minimal reproduction, the same violated invariant, or materially overlapping runtime paths as association evidence. Do not treat a shared label or subsystem as sufficient.
Compare each candidate on the same evidence basis:
| Criterion | Question |
|---|---|
| Coverage | Does it solve the whole confirmed issue, a useful subset, or an adjacent problem? |
| Correctness | Does the fix work on the real path and meaningful boundaries? |
| Placement | Does it enforce the invariant at the correct shared layer? |
| Tests | Does it reproduce the base failure and distinguish the candidate approaches? |
| Compatibility | Does it preserve released APIs, state, protocol, providers, and established behavior? |
| Complexity | What permanent branches, abstractions, configuration, or coupling does it add? |
| Readiness | Is it mergeable now, or how much focused work remains? |
| Reuse | Are there valuable tests or implementation pieces that should be combined into another candidate? |
Choose one portfolio-level disposition:
Do not split the decision into independent approvals. Competing PRs consume overlapping review and maintenance budgets, so recommend one path for the issue as a whole.
Always write maintainer comments in English, regardless of the assessment language. Produce a draft when the recommendation is to close, request evidence, request focused code changes, supersede a PR, or choose one competing PR over another.
Keep each draft polite, direct, and copy-paste-ready. Usually use 60-160 words in one to three short paragraphs:
Do not include internal labels such as severity: low, speculate about AI authorship or contributor intent, repeat the full review, or soften the message until the requested action becomes unclear.
Thanks for taking the time to investigate this. I traced the reported case through <path or behavior>, and <decisive finding>. In the supported path, <practical result>, so the added complexity is not justified by the demonstrated impact.
I am going to close this <issue/PR>. If you can provide <specific reproduction or evidence that would change the decision>, we can revisit the underlying problem with that narrower scope.
Thanks for the contribution. The underlying issue is valid, and this approach is directionally reasonable. Before we can merge it, please address the following points: <bounded list of required changes>.
These changes are needed because <concise contract, lifecycle, compatibility, or test reason>. Once they are covered with a regression test that fails on the base and passes on the updated branch, the PR should be ready for another review.
Adapt the wording to the actual evidence. Do not use these templates as generic filler.
Use Maintainer decision for a concluded review. Use Preliminary assessment when a desk review is tentatively positive but a decision-relevant runtime concern remains. Verdict is intentionally avoided in the report headings because it does not communicate whether the result is provisional or final.
## Preliminary assessment
<Tentative issue or PR assessment based on desk review only.>
## Static evidence
- <decisive code-path or test-inspection evidence>
- <what remains uncertain at runtime>
## Proposed runtime probe
- Concern: <the uncertainty that could change the decision>
- Probe: <smallest exact execution path>
- Control: <base, release, or known-good comparison when relevant>
- Scope: <local-only or any live-service, cost, mutation, or cleanup implications>
## Approval request
<Ask whether to run this exact probe. Do not present a final positive recommendation yet.>
## Maintainer decision
<Real/partial/unproven/contradicted, severity, and disposition.>
## Evidence
- <decisive evidence>
- <scope or uncertainty>
## Recommendation
<Prioritize, accept low priority, narrow, request evidence, or close.>
## Maintainer comment draft
<Include when closure or additional evidence should be requested.>
## Maintainer decision
<Need, practical impact, and merge-worthiness.>
- Code recommendation: <code disposition>
- Repository readiness: <integration status; include only for a merge-worthy recommendation when material>
## Evidence
- <runtime or code-path result>
- <test and compatibility result>
## Issue impact
- Validity: <claim validity>
- Severity: <severity of the underlying issue or need>
- Reach: <realistic reach>
## Patch risk
<Include only when the proposed patch introduces a meaningful regression, compatibility, lifecycle, or maintenance risk.>
## PR quality
- Solution fit: <assessment>
- Tests: <assessment>
- Remaining effort: <bounded/unbounded and why>
## Recommendation
<Merge, focused changes, simpler replacement, or close.>
## Maintainer comment draft
<Include only when closure, evidence, or changes should be requested.>
## Maintainer decision
<Issue validity, practical severity, and preferred implementation path.>
## Open PR comparison
| PR | Approach | Correctness | Tests | Compatibility/complexity | Readiness |
|---|---|---|---|---|---|
| #... | ... | ... | ... | ... | ... |
## Recommendation
<Select one, request focused changes, combine specific parts, replace all, or merge none.>
<State what should happen to every other open candidate.>
## Maintainer comment drafts
<One copy-paste-ready draft for each PR that should be closed, changed, or superseded.>