Back to Fastled

Stale Lock Detection and Recovery

docs/stale-lock-recovery.md

3.10.34.0 KB
Original Source

Stale Lock Detection and Recovery

Problem

When build processes are killed (Ctrl+C, task manager, etc.), lock files can be left behind without PID metadata. This happens when a process:

  1. Acquires the lock file (creates .lock)
  2. Gets killed BEFORE writing PID metadata (.lock.pid)
  3. Leaves an orphaned lock file that blocks future builds

Previous behavior: Age-based fallback only triggered after 30 minutes, causing long build delays.

Solution

Changes Made

1. Reduced Age Threshold (file_lock_rw.py)

  • Before: 30 minutes for locks without metadata
  • After: 2 minutes for locks without metadata
  • Rationale: Faster recovery from killed processes, more practical threshold
python
# Old threshold
if age_seconds > 1800:  # 30 minutes

# New threshold
if age_seconds > 120:  # 2 minutes

2. Immediate Stale Lock Check (build_lock.py)

  • Added stale lock check at START of lock acquisition (before wait loop)
  • Prevents immediate blocking on stale locks from previous runs
  • Removes stale locks proactively rather than waiting for timeout
python
# Check for stale lock BEFORE attempting acquisition
if self.lock_file.exists():
    if self._check_stale_lock():
        print(f"Removed stale lock before acquisition: {self.lock_file}")

3. Improved Logging

  • Better diagnostic messages for stale lock detection
  • Clearer indication when locks are from killed processes
  • Debug-level logging for recent locks (reduces noise)

How It Works

Lock Database

Lock state is stored in a centralized SQLite database (.cache/locks.db or ~/.fastled/locks.db). Each lock record contains: lock name, owner PID, lock mode (read/write), operation, hostname, timestamp.

Stale Detection Logic

  1. Check all holders: Query DB for processes holding the lock
  2. PID liveness check: For each holder, check if process is still alive
    • Uses os.kill(pid, 0) on Unix
    • Uses OpenProcess() on Windows
  3. If ALL holder PIDs are dead → lock is stale

Recovery Process

  1. At acquisition start: Check if lock exists and is stale
  2. During acquisition wait: Check periodically (every 1 second)
  3. If stale detected: Remove dead-PID rows from database
  4. Retry acquisition: Attempt to acquire immediately after removal

Testing

Test script demonstrates two scenarios:

bash
uv run python test_stale_lock.py

Test 1: Stale lock (3 minutes old, no metadata)

  • ✅ Detected as stale
  • ✅ Removed automatically
  • ✅ New lock acquired in <0.1s

Test 2: Recent lock (30 seconds old, no metadata)

  • ✅ Acquirable if no process holds it (OS-level lock released)
  • ✅ Correct behavior - SQLite DB tracks lock state properly

Real-World Scenario

Before fix:

bash
bash test
# Blocks for up to 30 minutes on stale lock
# User has to manually kill processes or delete lock files

After fix:

bash
bash test
# Detects stale lock immediately
# Removes it automatically
# Build proceeds in <0.1s

# Output:
# Detected stale lock at .build/locks/libfastled_build.lock (process dead)
# Removed stale lock file: .build/locks/libfastled_build.lock

Edge Cases Handled

  1. All holder PIDs dead: Stale (processes crashed)
  2. Any holder PID alive: Active (legitimate lock)
  3. No holders in DB: Not locked (available)
  4. Can't check PID: Assume active (fail safe)
  5. Multiple readers, one dead: Remove dead reader only (others remain)

Performance Impact

  • Stale lock with metadata: ~0.01s to check PID + remove
  • Stale lock without metadata: ~0.01s to check age + remove
  • Active lock (no contention): No overhead (immediate acquisition)
  • Active lock (contention): +1s per stale check iteration

Future Improvements

Possible enhancements (not currently implemented):

  • Reduce 2-minute threshold further (30 seconds?) for faster recovery
  • Add lock file corruption detection and recovery
  • Track lock acquisition duration for performance monitoring
  • Implement lock analytics (how often stale locks occur)