docs/backend-migration/notes/team-operations-playbook.md
Practical rules for running team-mode migrations in this codebase. Append new lessons as they happen; newest on top.
All team-mode pilots MUST follow this worktree + PR convention.
origin/mainorigin/main, local-only during the pilotmain.
Do NOT git merge coord branch directly into main — always go through
GitHub PR for review + CI./Users/zhoukai/Documents/worktrees/aionui-backend-<topic>/origin/feat/backend-migrationorigin/feat/backend-migration, local-onlyfeat/backend-migration. Do NOT merge directly./Users/zhoukai/Documents/worktrees/aionui-<topic>/main / feat/backend-migration are integration points other consumers
expect to be green. PR gates give reviewable diff + CI.origin/<base> (not another pilot's in-flight coord branch) avoids
cross-pilot state collision when concurrent pilots run.After T3 green:
gh pr create --base main (backend) or --base feat/backend-migration (AionUi).main / feat/backend-migration happens AFTER PR merge,
via git pull.Symptom: Teammate sends idle notification then goes silent. Messages sent
by coordinator land with read=True but no TaskUpdate, no git changes, no
follow-up. Idle notifications may stop entirely.
Diagnosis threshold: 10 minutes of zero activity after a message was read.
Silence with NO git changes and NO new messages for 10+ min is dead, not busy.
(Long cargo build can be 10-20 min; check git/inbox before assuming dead.)
Diagnose before messaging: TaskList() + git status + inbox read state
before sending another message. Multiple "please execute" messages to a dead
agent compound confusion without effect.
Replacement protocol (autonomous, do NOT ask user):
~/.claude/teams/{team}/config.json — remove dead member from members.
Use python3 -c with json.load/dump for safety.rm ~/.claude/teams/{team}/inboxes/{name}.json — clear stale messages.The replacement protocol is deterministic and safe. Asking for approval each time just slows recovery. User can override after the fact.
No diagnostic questions to user either. Don't ask "was that rebase you?",
"did you change X?". Figure it out from git/fs directly. If the new spawn
needs resilience against unknown remote state, bake
git fetch && git reset --hard origin/<branch> into the spawn prompt — the
new agent self-heals.
Spawn prompt discipline (to avoid repeat zombies):
TaskGet + read plan file rather
than duplicating detail in the prompt.Problem: Coordinator can lose visibility for long stretches because teammate completions publish via git commits + TaskUpdate, not just inbox messages. Idle-notification noise can mask real completions.
Rules:
TaskList, git log -3 on both repos, read last 10 inbox entries.When teammate reports "task complete", coordinator verifies:
cd <repo>
git fetch origin <branch>
git log origin/<branch> --oneline -3 # must include teammate's new commit
Do NOT accept "completed" without this check. If teammate's SHA is missing, reopen (TaskUpdate in_progress) and send a precise commit+push instruction. Frame it explicitly as "NOT a replay" — teammates sometimes misread re-assignment-looking messages.
Spawn prompt boilerplate: "'Task complete' means commit + push succeeded
AND upstream sync verified, not just tests pass. Run
git log origin/<branch> --oneline -1 before claiming completion and confirm
your SHA is there."
Every migration-class plan must include:
enabled=false
on built-ins, sort_order, last_used_at). For each: preserve, drop
explicitly, or document.find src/process/resources -type d post-migration returns only active dirs.
(Grep-clean isn't enough — a directory with zero refs still carries size
and misleads future readers.)Why hand-crafted tests miss real-user bugs: Frontend unit tests mock the bridge. Backend integration tests validate endpoints in isolation. E2E fixtures are designed by the plan author and can't predict the shape of "broken input from v1.9.17" or legacy flags users actually have.
Any "which side adapts?" question must first run:
git log --oneline -- crates/aionui-api-types/
If there's a recent blanket refactor (e.g. dae96f8 "remove camelCase serde
rename from all aionui-api-types structs"), its decision is the oracle.
The other side adapts.
If no recent refactor: check AGENTS.md §API Conventions. If still unclear, it's a design decision to brainstorm with the user, NOT to hot-patch.
Under no circumstance pick direction by "what makes the failing test pass
right now." That path produced a camelCase island inside a snake_case
project during the builtin-skill pilot. See memory
feedback_wire_flip_audit_scope.md for audit-scope rules when flipping wire
keys.
When a pilot's motivation is "kill sibling assets/ assumption", verifying
the fix requires running the binary in a state where no sibling assets/
exist. A dev-machine cargo run always has the sibling via
target/debug/assets/; full Electron bun run build is ~15 min.
Mid-fidelity 2-minute smoke (run during T4 closure for any asset-delivery pilot):
cargo build --release
TMPDIR=$(mktemp -d)
cp target/release/aionui-backend "$TMPDIR/"
# Note: no `assets/` copied
"$TMPDIR/aionui-backend" --local --port 25905 --data-dir "$TMPDIR/data" &
# ... curl checks
Populated endpoints → binary is genuinely self-contained. Empty → packaging bug exists even if dev builds "work".
aionui-api-types/src/* public structs with
Serialize/Deserialize derive should assert rename_all = "camelCase"
(or explicit opt-out). ~50-line test in aionui-api-types/tests/. Filed in
coordinator-builtin-skill-migration-2026-04-23.md §Followups.