docs/help/gpt55-codex-agentic-parity-maintainers.md
This note explains how to review the GPT-5.5 / Codex parity program as four merge units without losing the original six-contract architecture.
Owns:
executionContractupdate_plan as non-terminal progress trackingDoes not own:
Owns:
/elevated full availability and blocked reasonsDoes not own:
Owns:
Does not own:
Owns:
Does not own:
| Original contract | Merge unit |
|---|---|
| Provider transport/auth correctness | PR B |
| Tool contract/schema compatibility | PR C |
| Same-turn execution | PR A |
| Permission truthfulness | PR B |
| Replay/continuation/liveness correctness | PR C |
| Benchmark/release gate | PR D |
PR D is the proof layer. It should not be the reason runtime-correctness PRs are delayed.
update_plan no longer looks like progress by itself/elevated full is only described as available when it is actually availableExpected artifacts from PR D:
qa-suite-report.md / qa-suite-summary.json for each model runqa-agentic-parity-report.md with aggregate and scenario-level comparisonqa-agentic-parity-summary.json with a machine-readable verdictDo not claim GPT-5.5 parity or superiority over Opus 4.6 until:
flowchart LR
A["PR A-C merged"] --> B["Run GPT-5.5 parity pack"]
A --> C["Run Opus 4.6 parity pack"]
B --> D["qa-suite-summary.json"]
C --> E["qa-suite-summary.json"]
D --> F["qa parity-report"]
E --> F
F --> G["Markdown report + JSON verdict"]
G --> H{"Pass?"}
H -- "yes" --> I["Parity claim allowed"]
H -- "no" --> J["Keep runtime fixes / review loop open"]
The parity harness is not the only evidence source. Keep this split explicit in review:
Use this when you are ready to land a parity PR and want a repeatable, low-risk sequence.
r:* auto-close labels when the PR should not landpnpm check:changedpnpm test:changed when tests changed or bug-fix confidence depends on test coverage/landpr process), then verify:
mainIf any one of the evidence bar items is missing, request changes instead of merging.
| Completion gate item | Primary owner | Review artifact |
|---|---|---|
| No plan-only stalls | PR A | strict-agentic runtime tests and approval-turn-tool-followthrough |
| No fake progress or fake tool completion | PR A + PR D | parity fake-success count plus scenario-level report details |
No false /elevated full guidance | PR B | deterministic runtime-truthfulness suites |
| Replay/liveness failures remain explicit | PR C + PR D | lifecycle/replay suites plus compaction-retry-mutating-tool |
| GPT-5.5 matches or beats Opus 4.6 | PR D | qa-agentic-parity-report.md and qa-agentic-parity-summary.json |
| User-visible problem before | Review signal after |
|---|---|
| GPT-5.5 stopped after planning | PR A shows act-or-block behavior instead of commentary-only completion |
| Tool use felt brittle with strict OpenAI/Codex schemas | PR C keeps tool registration and parameter-free invocation predictable |
/elevated full hints were sometimes misleading | PR B ties guidance to actual runtime capability and blocked reasons |
| Long tasks could disappear into replay/compaction ambiguity | PR C emits explicit paused, blocked, abandoned, and replay-invalid state |
| Parity claims were anecdotal | PR D produces a report plus JSON verdict with the same scenario coverage on both models |