docs/plans/2026-01-05-plan-check-run-ha.md
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Make the plan check run scheduler HA-compatible using database-level atomic claiming.
Architecture: Add AVAILABLE status to plan check runs. Scheduler atomically claims AVAILABLE runs using FOR UPDATE SKIP LOCKED, transitioning them to RUNNING. Remove in-memory tracking from bus.
Tech Stack: Go, PostgreSQL, protobuf
Files:
backend/store/plan_check_run.go:18-27Step 1: Add the new status constant
In backend/store/plan_check_run.go, add PlanCheckRunStatusAvailable after line 18:
const (
// PlanCheckRunStatusAvailable is the plan check status for AVAILABLE.
PlanCheckRunStatusAvailable PlanCheckRunStatus = "AVAILABLE"
// PlanCheckRunStatusRunning is the plan check status for RUNNING.
PlanCheckRunStatusRunning PlanCheckRunStatus = "RUNNING"
// PlanCheckRunStatusDone is the plan check status for DONE.
PlanCheckRunStatusDone PlanCheckRunStatus = "DONE"
// PlanCheckRunStatusFailed is the plan check status for FAILED.
PlanCheckRunStatusFailed PlanCheckRunStatus = "FAILED"
// PlanCheckRunStatusCanceled is the plan check status for CANCELED.
PlanCheckRunStatusCanceled PlanCheckRunStatus = "CANCELED"
)
Step 2: Verify build
Run: go build ./backend/store/...
Expected: Build succeeds
Step 3: Commit
but commit plan-check-ha -m "feat(store): add AVAILABLE status for plan check runs"
Files:
backend/store/plan_check_run.goStep 1: Add ClaimedPlanCheckRun struct and claiming function
Add after BatchCancelPlanCheckRuns function (after line 189):
// ClaimedPlanCheckRun represents a plan check run that was atomically claimed.
type ClaimedPlanCheckRun struct {
UID int
PlanUID int64
}
// ClaimAvailablePlanCheckRuns atomically claims all AVAILABLE plan check runs by updating them to RUNNING
// and returns the claimed UIDs. Uses FOR UPDATE SKIP LOCKED to allow concurrent schedulers to claim different runs.
func (s *Store) ClaimAvailablePlanCheckRuns(ctx context.Context) ([]*ClaimedPlanCheckRun, error) {
q := qb.Q().Space(`
UPDATE plan_check_run
SET status = ?, updated_at = now()
WHERE id IN (
SELECT id FROM plan_check_run
WHERE status = ?
FOR UPDATE SKIP LOCKED
)
RETURNING id, plan_id
`, PlanCheckRunStatusRunning, PlanCheckRunStatusAvailable)
query, args, err := q.ToSQL()
if err != nil {
return nil, errors.Wrapf(err, "failed to build sql")
}
rows, err := s.GetDB().QueryContext(ctx, query, args...)
if err != nil {
return nil, errors.Wrapf(err, "failed to claim plan check runs")
}
defer rows.Close()
var claimed []*ClaimedPlanCheckRun
for rows.Next() {
var c ClaimedPlanCheckRun
if err := rows.Scan(&c.UID, &c.PlanUID); err != nil {
return nil, err
}
claimed = append(claimed, &c)
}
if err := rows.Err(); err != nil {
return nil, err
}
return claimed, nil
}
Step 2: Verify build
Run: go build ./backend/store/...
Expected: Build succeeds
Step 3: Commit
but commit plan-check-ha -m "feat(store): add ClaimAvailablePlanCheckRuns for HA scheduling"
Files:
backend/store/plan_check_run.go:50-68Step 1: Modify CreatePlanCheckRun to always use AVAILABLE
Replace the CreatePlanCheckRun function to ignore the passed status and always use AVAILABLE:
// CreatePlanCheckRun creates or replaces the plan check run for a plan.
// Always creates with AVAILABLE status for HA-safe scheduling.
func (s *Store) CreatePlanCheckRun(ctx context.Context, create *PlanCheckRunMessage) error {
result, err := protojson.Marshal(create.Result)
if err != nil {
return errors.Wrapf(err, "failed to marshal result")
}
query := `
INSERT INTO plan_check_run (plan_id, status, result)
VALUES ($1, $2, $3)
ON CONFLICT (plan_id) DO UPDATE SET
status = EXCLUDED.status,
result = EXCLUDED.result,
updated_at = now()
`
if _, err := s.GetDB().ExecContext(ctx, query, create.PlanUID, PlanCheckRunStatusAvailable, result); err != nil {
return errors.Wrapf(err, "failed to upsert plan check run")
}
return nil
}
Step 2: Verify build
Run: go build ./backend/store/...
Expected: Build succeeds
Step 3: Commit
but commit plan-check-ha -m "feat(store): CreatePlanCheckRun always uses AVAILABLE status"
Files:
backend/component/bus/bus.go:19-22Step 1: Remove RunningPlanChecks and RunningPlanCheckRunsCancelFunc
Remove these two lines from the Bus struct:
// RunningPlanChecks is the set of running plan checks.
RunningPlanChecks sync.Map
// RunningPlanCheckRunsCancelFunc is the cancelFunc of running plan checks.
RunningPlanCheckRunsCancelFunc sync.Map // map[planCheckRunUID]context.CancelFunc
The Bus struct should now look like:
// Bus is the message bus for all in-memory communication within the server.
type Bus struct {
// ApprovalCheckChan signals when an issue needs approval template finding.
// Triggered by plan check completion, issue creation (if checks already done).
ApprovalCheckChan chan int64 // issue UID
TaskRunSchedulerInfo sync.Map // map[taskRunID]*storepb.SchedulerInfo
// RunningTaskRunsCancelFunc is the cancelFunc of running taskruns.
RunningTaskRunsCancelFunc sync.Map // map[taskRunID]context.CancelFunc
// PlanCheckTickleChan is the tickler for plan check scheduler.
PlanCheckTickleChan chan int
// TaskRunTickleChan is the tickler for task run scheduler.
TaskRunTickleChan chan int
// RolloutCreationChan is the channel for automatic rollout creation.
RolloutCreationChan chan int64
// PlanCompletionCheckChan signals when a plan might be complete (for PIPELINE_COMPLETED webhook).
PlanCompletionCheckChan chan int64
}
Step 2: Verify build fails (expected - scheduler still references these)
Run: go build ./backend/...
Expected: Build fails with references to deleted fields
Step 3: Commit (partial - will fix in next task)
but commit plan-check-ha -m "refactor(bus): remove in-memory plan check tracking"
Files:
backend/runner/plancheck/scheduler.goStep 1: Update runOnce to use claiming
Replace the runOnce function:
func (s *Scheduler) runOnce(ctx context.Context) {
defer func() {
if r := recover(); r != nil {
err, ok := r.(error)
if !ok {
err = errors.Errorf("%v", r)
}
slog.Error("Plan check scheduler PANIC RECOVER", log.BBError(err), log.BBStack("panic-stack"))
}
}()
claimed, err := s.store.ClaimAvailablePlanCheckRuns(ctx)
if err != nil {
slog.Error("failed to claim available plan check runs", log.BBError(err))
return
}
for _, c := range claimed {
go s.runPlanCheckRun(ctx, c.UID, c.PlanUID)
}
}
Step 2: Update runPlanCheckRun signature and remove in-memory tracking
Replace the runPlanCheckRun function:
func (s *Scheduler) runPlanCheckRun(ctx context.Context, uid int, planUID int64) {
// Fetch plan to derive check targets at runtime
plan, err := s.store.GetPlan(ctx, &store.FindPlanMessage{UID: &planUID})
if err != nil {
s.markPlanCheckRunFailed(ctx, uid, planUID, err.Error())
return
}
if plan == nil {
s.markPlanCheckRunFailed(ctx, uid, planUID, "plan not found")
return
}
project, err := s.store.GetProject(ctx, &store.FindProjectMessage{ResourceID: &plan.ProjectID})
if err != nil {
s.markPlanCheckRunFailed(ctx, uid, planUID, err.Error())
return
}
if project == nil {
s.markPlanCheckRunFailed(ctx, uid, planUID, "project not found")
return
}
// Get database group if needed (for spec expansion)
databaseGroup, err := s.getDatabaseGroupForPlan(ctx, plan)
if err != nil {
s.markPlanCheckRunFailed(ctx, uid, planUID, err.Error())
return
}
// Derive check targets from plan
targets, err := DeriveCheckTargets(project, plan, databaseGroup)
if err != nil {
s.markPlanCheckRunFailed(ctx, uid, planUID, err.Error())
return
}
var results []*storepb.PlanCheckRunResult_Result
for _, target := range targets {
targetResults, targetErr := s.executor.RunForTarget(ctx, target)
if targetErr != nil {
err = targetErr
break
}
results = append(results, targetResults...)
}
if err != nil {
if errors.Is(err, context.Canceled) {
s.markPlanCheckRunCanceled(ctx, uid, planUID, err.Error())
} else {
s.markPlanCheckRunFailed(ctx, uid, planUID, err.Error())
}
} else {
s.markPlanCheckRunDone(ctx, uid, planUID, results)
}
}
Step 3: Update helper functions to use uid and planUID parameters
Replace the three mark functions:
func (s *Scheduler) markPlanCheckRunDone(ctx context.Context, uid int, planUID int64, results []*storepb.PlanCheckRunResult_Result) {
result := &storepb.PlanCheckRunResult{
Results: results,
}
if err := s.store.UpdatePlanCheckRun(ctx,
store.PlanCheckRunStatusDone,
result,
uid,
); err != nil {
slog.Error("failed to mark plan check run done", log.BBError(err))
return
}
// Auto-create rollout if plan checks pass
issue, err := s.store.GetIssue(ctx, &store.FindIssueMessage{PlanUID: &planUID})
if err != nil {
slog.Error("failed to get issue for approval check after plan check",
slog.Int("plan_id", int(planUID)),
log.BBError(err))
return
}
if issue != nil && issue.PlanUID != nil {
// Trigger approval finding
s.bus.ApprovalCheckChan <- int64(issue.UID)
// Trigger rollout creation (existing behavior)
s.bus.RolloutCreationChan <- planUID
}
}
func (s *Scheduler) markPlanCheckRunFailed(ctx context.Context, uid int, planUID int64, reason string) {
result := &storepb.PlanCheckRunResult{
Error: reason,
}
if err := s.store.UpdatePlanCheckRun(ctx,
store.PlanCheckRunStatusFailed,
result,
uid,
); err != nil {
slog.Error("failed to mark plan check run failed", log.BBError(err))
}
}
func (s *Scheduler) markPlanCheckRunCanceled(ctx context.Context, uid int, planUID int64, reason string) {
result := &storepb.PlanCheckRunResult{
Error: reason,
}
if err := s.store.UpdatePlanCheckRun(ctx,
store.PlanCheckRunStatusCanceled,
result,
uid,
); err != nil {
slog.Error("failed to mark plan check run canceled", log.BBError(err))
}
}
Step 4: Verify build
Run: go build ./backend/...
Expected: Build succeeds
Step 5: Commit
but commit plan-check-ha -m "refactor(plancheck): use atomic claiming instead of in-memory tracking"
Files:
backend/migrator/migration/3.14/0022##plan_check_run_ha.sqlbackend/migrator/migration/LATEST.sql:218,225Step 1: Create migration file
Create backend/migrator/migration/3.14/0022##plan_check_run_ha.sql:
-- Add AVAILABLE status for HA-compatible plan check scheduling.
-- Uses FOR UPDATE SKIP LOCKED pattern for atomic claiming.
-- Update status constraint to include AVAILABLE
ALTER TABLE plan_check_run
DROP CONSTRAINT plan_check_run_status_check,
ADD CONSTRAINT plan_check_run_status_check
CHECK (status IN ('AVAILABLE', 'RUNNING', 'DONE', 'FAILED', 'CANCELED'));
-- Convert existing RUNNING to AVAILABLE (will be re-claimed after deployment)
UPDATE plan_check_run SET status = 'AVAILABLE' WHERE status = 'RUNNING';
-- Update index to include AVAILABLE for efficient claiming
DROP INDEX IF EXISTS idx_plan_check_run_active_status;
CREATE INDEX idx_plan_check_run_active_status ON plan_check_run(status, id) WHERE status IN ('AVAILABLE', 'RUNNING');
Step 2: Update LATEST.sql constraint
In backend/migrator/migration/LATEST.sql, change line 218 from:
status text NOT NULL CHECK (status IN ('RUNNING', 'DONE', 'FAILED', 'CANCELED')),
to:
status text NOT NULL CHECK (status IN ('AVAILABLE', 'RUNNING', 'DONE', 'FAILED', 'CANCELED')),
Step 3: Update LATEST.sql index
In backend/migrator/migration/LATEST.sql, change line 225 from:
CREATE INDEX idx_plan_check_run_active_status ON plan_check_run(status, id) WHERE status = 'RUNNING';
to:
CREATE INDEX idx_plan_check_run_active_status ON plan_check_run(status, id) WHERE status IN ('AVAILABLE', 'RUNNING');
Step 4: Commit
but commit plan-check-ha -m "chore(migration): add AVAILABLE status for plan check runs"
Files:
backend/migrator/migrator_test.goStep 1: Find and update TestLatestVersion
Search for TestLatestVersion and update the migration count to include the new migration file.
Run: grep -n "TestLatestVersion\|3.14" backend/migrator/migrator_test.go
Update the 3.14 entry to include the new migration count (should be 22 now).
Step 2: Verify test
Run: go test -v -count=1 github.com/bytebase/bytebase/backend/migrator -run ^TestLatestVersion$
Expected: Test passes
Step 3: Commit
but commit plan-check-ha -m "test(migrator): update version for plan check run HA migration"
Step 1: Run golangci-lint
Run: golangci-lint run --allow-parallel-runners
Step 2: Fix any issues reported
Common issues to watch for:
_)Step 3: Run lint again until clean
Run: golangci-lint run --allow-parallel-runners
Expected: No issues
Step 4: Commit if any fixes
but commit plan-check-ha -m "fix: address linter issues"
Step 1: Full backend build
Run: go build -ldflags "-w -s" -p=16 -o ./bytebase-build/bytebase ./backend/bin/server/main.go
Expected: Build succeeds
Step 2: Run related tests
Run: go test -v -count=1 github.com/bytebase/bytebase/backend/store -run PlanCheck
Run: go test -v -count=1 github.com/bytebase/bytebase/backend/runner/plancheck/...
Step 3: Commit if any fixes needed
Step 1: Push branch
but push plan-check-ha
Step 2: Create PR
gh pr create --base main --head plan-check-ha \
--title "feat: make plan check run scheduler HA compatible" \
--body "$(cat <<'EOF'
## Summary
- Add `AVAILABLE` status to plan check runs
- Implement atomic claiming with `FOR UPDATE SKIP LOCKED`
- Remove in-memory tracking from bus component
- Follows the same HA pattern as task run scheduler
## Test plan
- [ ] Verify plan checks still execute correctly in single-instance mode
- [ ] Verify migration applies cleanly
- [ ] Verify existing RUNNING plan checks are re-executed after deployment
š¤ Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"