Faster PR Workflows Without Sacrificing Quality

Generated by swarm planning session on 2026-02-15

Summary

Improve the speed of landing PRs for the maintainer (wwwillchen/keppo-bot) by optimizing the CI pipeline, automating the fix-and-retry loop, and improving developer feedback during wait times. The current workflow is already highly automated — this plan targets the remaining friction: CI wall-clock time, human-in-the-loop steps, and the "waiting black hole" between push and merge.

Problem Statement

The current PR landing flow involves a multi-cycle loop: push → wait for CI (~10-15 min) → wait for AI review (~10-30 min) → address comments → re-push → wait again. Each iteration adds latency, and for 3-4 PRs per day, this compounds into 1-2 hours of waiting plus context-switching costs. The quality gates (multi-reviewer AI review, comprehensive E2E testing, lint/type checks) are working well — the problem is throughput, not quality.

Scope

In Scope (MVP)

CI pipeline optimizations for self-hosted macOS ARM64 runners (caching, job consolidation)
Auto-trigger fix loop for privileged author PRs (remove label requirement)
Tiered test selection based on changed files
PR status watcher for terminal notifications
Unified /pr skill entry point
Move pr-review-responder to self-hosted runners (enable E2E fix capability)

Out of Scope (Follow-up)

Auto-merge changes (keep manual "merge-when-ready" label)
External contributor experience improvements
Bash script alternative to /fast-push (optimize existing skill instead)
Dry-run/preview mode for skills
Incremental Electron builds
Notification channels beyond terminal (desktop, Slack)
Multi-PR-in-flight status dashboard

User Stories

As a maintainer, I want CI to complete faster on my self-hosted runners, so I spend less time waiting between push and feedback.
As a maintainer, I want CI failures to be automatically diagnosed and fixed without me triggering a skill, so the fix-push-verify loop runs hands-free.
As a maintainer, I want to be notified in my terminal when CI completes (pass or fail), so I can confidently context-switch to other work during CI runs.
As a maintainer, I want a single /pr command that detects what my PR needs and does the right thing, so I don't need to remember which of 10+ skills to invoke.
As a maintainer, I want CI to skip E2E tests for low-risk changes (docs, config, test-only), so trivial PRs land faster without wasting runner time.
As a maintainer, I want the automated fix loop to be able to fix E2E failures (not just lint/type/unit), so fewer PRs get stuck at needs-human:review-issue.

UX Design

User Flow (Optimized)

Developer writes code (with Claude Code assistance)
Runs /pr — unified entry point detects uncommitted changes, runs lint/format/type-check/unit-tests locally, commits, pushes, creates/updates PR
/pr kicks off a background PR watcher that polls CI status
Developer context-switches to other work
Terminal notification: "CI passed / CI failed on PR #123"
If CI failed: auto-fix loop triggers automatically (no label needed), developer sees notification when fix is pushed and re-CI starts
If all checks pass: developer adds "merge-when-ready" label → auto-merge
If auto-fix loop fails after 4 iterations: needs-human:review-issue label applied, developer investigates

Key States

Pushing: /pr runs local validation and pushes. Terminal shows progress via task tracking.
Waiting: Background watcher polls. Developer sees no output unless something happens.
CI Passed: Terminal notification (bell + message). Developer returns to add merge label.
CI Failed: Terminal notification. Auto-fix loop kicks in automatically. Developer notified of each iteration.
Fix Loop Exhausted: needs-human:review-issue label. Terminal notification. Developer investigates.
Merged: After "merge-when-ready" label + CI pass, auto-merge fires. PR closed.

Interaction Details

/pr is the primary entry point. It inspects current state (uncommitted changes? open PR? failing checks? review comments?) and routes to the appropriate action.
Sub-commands available for power users: /pr push, /pr fix, /pr watch, /pr rebase
Background watcher uses gh pr checks --watch or polling gh pr checks every 30 seconds
Terminal notifications use bell character + colored status line

Accessibility

All feedback is text-based (terminal output), compatible with screen readers
Status notifications include text labels, not just colors
No custom UI components — leverages GitHub's accessible web interface for detailed status

Technical Design

Architecture

All changes are to the CI/CD pipeline, GitHub Actions workflows, and Claude Code skills. No application code changes.

Components Affected

Component	Change	Complexity
`.github/workflows/ci.yml`	Add persistent caching, merge build+E2E for self-hosted, tiered test selection	Medium
`.github/workflows/pr-review-responder.yml`	Remove label requirement for privileged authors, move to self-hosted runners	Small
`scripts/ci-cleanup-macos.sh`	Preserve caches (Playwright, pnpm store, nextjs-template, scaffold deps)	Small
`playwright.config.ts`	Add test tagging/grouping for tiered selection	Medium
`.claude/skills/pr/SKILL.md` (new)	Unified `/pr` entry point that routes to sub-skills	Medium
`.claude/skills/pr-watch/SKILL.md` (new)	Background CI status watcher with terminal notifications	Small
`.claude/skills/fast-push/SKILL.md`	Optimize for speed — reduce unnecessary steps	Small

Data Model Changes

None. All improvements are to pipeline/tooling configuration.

API Changes

None. Internal tooling only.

Implementation Plan

Phase 0: Instrument (prerequisite)

Add CI job timing annotations (start/end timestamps per job) to measure baseline
Track PR cycle time (first push to merge) — could be a simple GitHub Action that logs timestamps
Measure auto-fix loop iteration frequency (how often does cc:request:2, cc:request:3, cc:request:4 get reached?)
Measure CI first-push pass rate for privileged author PRs
Document current scripts/ci-cleanup-macos.sh behavior — confirm what's already cached vs. cleaned

Phase 1: CI Pipeline Quick Wins

Optimize self-hosted runner caching: Preserve Playwright browser binaries, pnpm store, nextjs-template clone + deps, scaffold deps between runs. Update scripts/ci-cleanup-macos.sh to keep these cached (keyed on lockfile/dependency hashes). Note: node_modules/ already persists (cleanup script doesn't remove it), so npm ci is already partially cached — focus caching effort on Playwright, pnpm store, and template deps.
Skip redundant installs: Check if Chromium is already installed before running npx playwright install chromium. Check if nextjs-template exists and is up-to-date before cloning.
Merge build + E2E into single job for self-hosted runners: Eliminate artifact tar/upload/download overhead and redundant setup steps (second checkout, second npm ci, second pnpm/scaffold/nextjs-template setup). Keep separate jobs for external contributors (they need cross-OS artifacts). Estimated savings: 3-5 min per CI run.
Validate caching gains: Compare CI timing before/after to confirm actual savings. Run a single diagnostic PR with date commands at each step boundary to establish baseline.

Phase 2: Auto-Fix Loop Enhancement

Auto-trigger fix loop for privileged authors: Modify pr-review-responder.yml to trigger for all privileged-author PRs, not just those with cc:request label. Auto-apply cc:request on PR creation for privileged authors, or remove the label check entirely.
Move pr-review-responder to self-hosted macOS runners: Currently runs on ubuntu-latest which cannot execute Electron Playwright tests. Moving to self-hosted enables the fix loop to actually fix E2E failures, not just lint/type/unit failures.
Add skip-review for trivial fixes: When the fix commit only changes lint/formatting or updates snapshots, automatically skip re-triggering the Claude PR Review. Use a heuristic: if diff is under N lines and only touches files mentioned in resolved review comments.
Add notification comments: When auto-fix triggers, post a brief PR comment ("CI failed on lint. Auto-fix triggered.") so the developer has context without reading Action logs.

Phase 3: Tiered Test Selection

Implement change classification in check-changes job: Classify changed files into tiers:
- Tier 0 (docs/config/.claude/rules only): Skip CI entirely (already exists)
- Tier 1 (test files only): Run lint + type-check + only the changed test files
- Tier 2 (source code, no schema/config changes): Run lint + type-check + unit tests + targeted E2E
- Tier 3 (broad changes, deps, config): Full CI suite
Tag E2E tests by feature area: Add metadata to Playwright tests indicating which source files/features they cover. Start with coarse groupings (chat, database, deployment, settings, etc.).
Add full-ci escape label: Allow forcing full CI via a label for cases where the tier classification seems wrong. Post a CI comment explaining which tier was selected and why.
Maintain nightly full-suite coverage: The existing deflake-e2e scheduled run provides a safety net for regressions missed by tiered selection.

Phase 4: Developer Experience

Create unified /pr skill: Single entry point that inspects state and routes:
- Uncommitted changes → push flow (lint, commit, push, create/update PR)
- Open PR with failing checks → fix flow (diagnose + fix CI failures)
- Open PR with unresolved review comments → comment fix flow
- Open PR with all checks passing → suggest adding "merge-when-ready" label
- Open PR needing rebase → rebase flow
- Fold /fast-push behavior into /pr as the default for deterministic operations (use haiku), with opus reserved for reasoning-heavy tasks (conflict resolution, complex review responses). Deprecate /fast-push as a separate entry point.
Create /pr-watch skill: Background watcher that polls gh pr checks and notifies on completion. Should surface specific failure type (lint, type-check, unit test, E2E) so developer knows whether to intervene or let auto-fix handle it. Could be as simple as gh pr checks --watch with a terminal bell on exit.
Add pre-push snapshot warning: If UI source files changed but E2E snapshot files didn't, warn that snapshots likely need updating before push. Lightweight heuristic to prevent the most common E2E failure cause (~30% of E2E failures).

Testing Strategy

CI timing: Compare wall-clock CI duration before/after each phase. Target: 30-50% reduction for typical PRs.
Cache correctness: Verify builds with cached dependencies produce identical outputs to clean builds. Run both in parallel for a verification period.
Tier classification accuracy: Run full CI alongside tiered CI for 2 weeks to measure false-negative rate (changes that should have triggered full CI but didn't).
Auto-fix loop effectiveness: Track how often the fix loop resolves issues automatically vs. escalating to human. Target: fix loop handles >80% of CI failures without human intervention.
Quality regression: Monitor for bugs/reverts within 24 hours of merge. Must not increase from current baseline.

Risks & Mitigations

Risk	Likelihood	Impact	Mitigation
Stale caches cause false CI passes on self-hosted runners	Medium	High	Hash-based cache keys, periodic full cache clear, nightly clean build
Tiered test selection misses regressions	Medium	High	Conservative tier classification, `full-ci` escape label, nightly full suite, always full CI on `main` push
Auto-fix loop on self-hosted runners consumes capacity	Low	Medium	2+ runners available; add concurrency limits if needed
Merged build+E2E job can't retry just E2E on flakes	Medium	Medium	Monitor flake rate; revert to split jobs if flake-retry savings > artifact-transfer savings
Unified `/pr` skill makes wrong routing decision	Low	Low	Show what action will be taken before executing; sub-commands for manual override
Moving pr-review-responder to self-hosted increases runner load	Low	Low	2+ runners available; responder runs are short-lived

Open Questions

What is the actual CI time breakdown per job? Phase 0 instrumentation will answer this. The answer determines whether Phase 1 (caching) or Phase 3 (test selection) delivers more value.
How often does the auto-fix loop iterate beyond 1 cycle? If most PRs converge in 1 iteration, loop optimization matters less. If 3-4 iterations are common, optimizing the loop is critical path.
What is the E2E flake rate? Determines whether merging build+E2E jobs (losing independent E2E retry) is a net win or net loss.
How should E2E tests be tagged/grouped? Coarse groupings (by feature area) are easier to maintain but less precise. Fine-grained file-level dependency tracking is more accurate but harder to keep updated. Start coarse and refine.

Decision Log

Decision	Options Considered	Chosen	Reasoning
Auto-merge approach	Default auto-merge, auto-merge on cc:done, keep current	Keep current	User prefers the deliberate "merge-when-ready" checkpoint. Focus speed improvements elsewhere.
Target persona	Maintainer only, both maintainer + contributors	Maintainer only	Focused scope delivers faster. External contributor improvements deferred.
Fast push implementation	Bash script, optimize existing skill, both paths	Optimize existing skill	Avoids maintaining two implementations. User prefers single path.
Runner capacity	Constrained (1 runner), unconstrained (2+)	2+ runners available	Runner capacity is not a constraint. Can add more concurrent workflows freely.
Review blocking merge	Non-blocking (post-merge review), blocking	Keep blocking	Team consensus: post-merge review comments have low compliance. Keeping review as blocking preserves quality gate. Speed up the review instead of removing it.
Bash fast-push script	Build it, don't build it	Don't build	UX team raised valid concern about fragmentation. Optimize `/fast-push` skill instead.

Generated by dyad:swarm-to-plan