docs/plans/2025-12-30-task-scheduler-available-status-design.md
Refactor the task scheduler to introduce a dedicated AVAILABLE status between PENDING and RUNNING. This separates "ready to execute" from "actively executing", preparing the foundation for High Availability (HA) scheduler instances.
PENDING → AVAILABLE → RUNNING → DONE/FAILED/CANCELED
| Status | Meaning |
|---|---|
| PENDING | Waiting for constraints: RunAt time, version ordering, database locks, parallel limits |
| AVAILABLE | All constraints satisfied, ready for immediate execution by any scheduler instance |
| RUNNING | Actively executing |
| DONE/FAILED/CANCELED | Terminal states |
PENDING → AVAILABLE: pending_scheduler promotes when ALL gating checks passAVAILABLE → RUNNING: running_scheduler atomically claims via optimistic lockingRUNNING → terminal: running_scheduler updates after execution completesResponsibility: PENDING → AVAILABLE transitions with all gating logic.
Every 5 seconds:
1. Query task_runs WHERE status = 'PENDING'
2. Track in-memory for this round:
- availableDBs: databases already promoted this round
- rolloutCounts: tasks promoted per rollout this round
3. For each pending task:
a. Check RunAt time
b. Check version ordering (no smaller versions pending/available/running on same DB)
c. Check database mutual exclusion (sequential tasks only):
- Skip if availableDBs[database_id] is set
- Query: no RUNNING or AVAILABLE on same database
d. Check parallel limit:
- currentCount = COUNT(*) WHERE rollout_id = ? AND status IN ('RUNNING', 'AVAILABLE')
- Skip if currentCount + rolloutCounts[rollout_id] >= limit
e. If all pass:
- UPDATE status = 'AVAILABLE'
- availableDBs[database_id] = true
- rolloutCounts[rollout_id]++
Responsibility: Claim AVAILABLE tasks and execute. No gating logic.
Every 5 seconds (or when tickled):
1. Query task_runs WHERE status = 'AVAILABLE'
2. For each available task:
a. Attempt atomic claim:
UPDATE task_run SET status = 'RUNNING', started_at = NOW()
WHERE id = ? AND status = 'AVAILABLE'
b. If claim succeeds (rows affected = 1):
- Spawn goroutine to execute task
c. If claim fails (rows affected = 0):
- Another instance claimed it, skip
3. Re-execute orphaned RUNNING tasks on startup (maintains current behavior)
running_scheduler no longer maintains:
RunningDatabaseMigration mapAll gating logic is centralized in pending_scheduler.
task_run.run_at <= NOW()COUNT(*) WHERE rollout_id = ? AND status IN ('RUNNING', 'AVAILABLE') < limitWithin a single pending_scheduler iteration, track locally:
availableDBs := map[int]bool{} // database_id -> has AVAILABLE this round
rolloutCounts := map[int]int{} // rollout_id -> count promoted this round
This prevents marking multiple tasks AVAILABLE for the same database or exceeding rollout limits within one loop.
For HA with multiple scheduler instances, atomic check-and-update may be needed:
UPDATE task_run
SET status = 'AVAILABLE'
WHERE id = ?
AND status = 'PENDING'
AND (SELECT COUNT(*) FROM task_run
WHERE rollout_id = ? AND status IN ('RUNNING', 'AVAILABLE')) < ?
This is future work; current implementation assumes single pending_scheduler instance.
-- Add AVAILABLE to CHECK constraint
ALTER TABLE task_run
DROP CONSTRAINT task_run_status_check,
ADD CONSTRAINT task_run_status_check
CHECK (status IN ('PENDING', 'AVAILABLE', 'RUNNING', 'DONE', 'FAILED', 'CANCELED'));
-- Update partial index for active statuses
DROP INDEX idx_task_run_active_status_id;
CREATE INDEX idx_task_run_active_status_id ON task_run (status, id)
WHERE status IN ('PENDING', 'AVAILABLE', 'RUNNING');
proto/store/task_run.proto:
enum Status {
STATUS_UNSPECIFIED = 0;
PENDING = 1;
RUNNING = 2;
DONE = 3;
FAILED = 4;
CANCELED = 5;
NOT_STARTED = 6;
SKIPPED = 7;
AVAILABLE = 8; // NEW: Ready for immediate execution
}
frontend/src/components/RolloutV1/components/utils/taskStatus.ts:
const ACTIONABLE_STATUSES = [
"NOT_STARTED", "PENDING", "AVAILABLE", "RUNNING", "FAILED", "CANCELED"
];
const TERMINAL_STATUSES = ["DONE", "SKIPPED"];
Add translations for "Available" status in frontend/src/locales/.
| File | Changes |
|---|---|
proto/store/task_run.proto | Add AVAILABLE = 8 to Status enum |
backend/migrator/migration/XXX/ | New migration for schema changes |
backend/migrator/migration/LATEST.sql | Update CHECK constraint and index |
backend/runner/taskrun/pending_scheduler.go | Add all gating logic, promote to AVAILABLE |
backend/runner/taskrun/running_scheduler.go | Simplify to: claim AVAILABLE → execute |
backend/store/task_run.go | Add AVAILABLE constant, update queries |
frontend/src/components/RolloutV1/.../taskStatus.ts | Add AVAILABLE to actionable statuses |
frontend/src/locales/*.json | Add "available" translation |
| Frontend status display components | Add AVAILABLE visual styling |
No breaking changes. Existing PENDING/RUNNING tasks continue to work. New tasks will use the AVAILABLE intermediate state.
This refactoring prepares for HA by: