docs/plans/2026-05-21-clean-slate-data-loss-prevention.md
Date: 2026-05-21 Status: Design — revised after multi-review (6 parallel reviewers: Correctness, Security, Architecture, Alternatives, Performance, Simplicity). Convergent findings folded in. Codex skipped (stdin hang in this environment). Scope: Sync upload path; SuperSync + file-based providers. Client-side only. Tracking issue: #7709
Issue #7709 describes a multi-device data-loss chain: a single device uploads a partial / stale local state to the server as a full-state op (SYNC_IMPORT, BACKUP_IMPORT, or REPAIR — all are treated as clean-slate by operation-log-upload.service.ts:129-130), the server deletes its op-log history, and every other device that auto-syncs inherits the partial snapshot. The reported user lost ~2 weeks of work across three laptops this way.
Five user-reachable triggers produce destructive uploads today (the multi-review added the fifth):
CleanSlateService.createCleanSlate() → upload with isCleanSlate: trueSyncImportConflictCoordinatorService.forceUploadLocalState() (passes skipServerEmptyCheck: true)isCleanSlate=true flag)BackupService.importComplete (backup-file restore) — runs the same non-atomic clearAllOperations → append → setVectorClock → saveStateCache sequence as createCleanSlate (backup.service.ts:194/221/225/227). Fifth destructive trigger; the original plan missed it.This plan addresses two independent root causes. Either one alone is sufficient to cause the reported bug.
Both clean-slate.service.ts:149-168 and backup.service.ts:194-227 run in order:
1. clearAllOperations() ← OPS table emptied
2. append(syncImportOp) ← lastSeq goes back to 1
3. setVectorClock(...)
4. saveStateCache(...) ← state_cache populated
If the process is interrupted between step 1 and step 4 (crash, tab close, browser kill, even a thrown exception in append/setVectorClock), the device is left with OPS empty AND state_cache either stale or never written. On a low-activity device (under COMPACTION_THRESHOLD = 500 ops) state_cache was never written in the first place. Result: isWhollyFreshClient()===true on next launch, which routes through the LocalDataConflictError(0, {}) throw at operation-log-sync.service.ts:606.
This is the exact precondition chain the issue describes.
Destructive uploads use the current in-memory NgRx state as authoritative. If NgRx is partial (hydration incomplete, post-clearAllOperations() window, post-rollback empty state), the partial state is what gets sent. Nothing compares "what's about to be uploaded" against "what was here a minute ago."
Per multi-review: Problem B is not in the reported incident (Problem A explains it cleanly). Problem B is hypothetical — defending against unobserved failure modes. The original plan made Problem B the headline fix; this revision demotes it to a follow-up gated on forensic evidence from production logs.
{} empty remote snapshot at operation-log-sync.service.ts:606 so the conflict dialog has something to render.BACKUP_IMPORT / REPAIR clean-slate-by-default semantics in operation-log-upload.service.ts:129-130.Where: A new helper on OperationLogStoreService:
async runDestructiveStateReplacement(opts: {
syncImportOp: Operation;
newVectorClock: VectorClock;
newState: unknown;
schemaVersion: number;
snapshotEntityKeys: string[];
archiveYoung?: ArchiveStoreEntry['data'];
archiveOld?: ArchiveStoreEntry['data'];
}): Promise<void>
Two callers refactor to use it: CleanSlateService.createCleanSlate() and BackupService.importComplete(). Both previously ran a four-step sequence as independent IDB transactions.
Implementation: single multi-store readwrite transaction. All writes (clear OPS, append the SYNC_IMPORT entry, write vector_clock, write state_cache, optionally write archive_young / archive_old) happen inside one db.transaction(stores, 'readwrite'). If any step rejects the IDB request, tx.done rejects, the engine auto-aborts, and no committed change to any of the touched stores survives. The catch block calls an explicit tx.abort() as well — that branch is unreachable in production (rejected IDB requests already abort the tx) but is load-bearing for the spy-based fault-injection seam used by the interrupt integration test, where the spy throws synchronously instead of rejecting an IDB request.
Why a single multi-store tx, not snapshot-then-swap. The plan's first revision proposed staging the new state to a STATE_CACHE_STAGING_KEY row outside the destructive tx, then swapping references inside a small cross-store tx, on the premise that "every existing db.transaction(...) call in operation-log-store.service.ts is single-store" and WebKit/Capacitor had no precedent for multi-store + multi-MB. That premise was wrong: appendWithVectorClockUpdate already runs a 2-store (OPS + VECTOR_CLOCK) readwrite tx on every local action, including on Capacitor iOS. Staging would have cost one extra full-state structured-clone, a boot-time staging-row reconciliation step on every cold start, a catch-block cleanup, two integration tests, and ~30 lines of JSDoc — in exchange for a crash-detection sentinel nothing would have acted on. The simpler design provides the same atomicity guarantee with none of the machinery.
Payload duplication trade-off. Both syncImportOp.payload (full state) and newState (structurally identical) are persisted in the same tx: the payload via the OPS row's encoded operation, the state via the STATE_CACHE singleton. Both writes are required — OPS holds the payload the uploader sends in the snapshot endpoint; STATE_CACHE is what isWhollyFreshClient reads on next launch. Eliminating the duplication would require either lazy hydration of the OPS payload from STATE_CACHE at upload time (unsafe if compaction advances STATE_CACHE past this op's seq) or a dedicated payload-staging store. Neither pays back for an infrequent (password change / backup restore) operation.
On pre-migration-backup.service.ts — DELETED in PR-A. The original plan kept a placeholder PreMigrationBackupService and proposed implementing it as a recovery path independent of IDB atomicity. With runDestructiveStateReplacement now atomic, the safety net it was meant to provide (recover from a partial destructive write) cannot fire — the destructive tx either fully commits or fully rolls back. The placeholder service was deleted along with its DI wiring and stub tests. If a future requirement appears for "user-initiated undo of a successful clean-slate," that should be designed as its own feature, not a vestigial backup layer.
operation-log-sync.service.ts:606Critical correction from multi-review: Line 606 is in the op-streaming branch (result.newOps.length > 0). It is reached after the three if (result.providerMode === 'fileSnapshotOps' && result.snapshotState) early-returns at lines 458/484/517. result.snapshotState is undefined at line 606. The original plan's "thread result.snapshotState" instruction was wrong for this site.
Two-part fix:
(a) Earlier throws (lines 458, 484) already construct LocalDataConflictError(unsyncedCount, result.snapshotState, result.snapshotVectorClock) — verify these and add the vectorClock argument to line 484 if missing.
(b) Line 606 throw has no snapshot. Compute a synthetic payload from result.newOps:
const remoteOpCount = result.newOps.length;
throw new LocalDataConflictError(
0,
{ __synthetic: true, opCount: remoteOpCount },
result.snapshotVectorClock, // may be undefined; OK
);
(c) Narrow LocalDataConflictError.remoteSnapshotState type to a discriminated union so the dialog can render "Remote: N operations" for the synthetic case and a real entity-count summary for the snapshot case:
type RemoteDataDescriptor =
| { kind: 'snapshot'; counts: { tasks: number; projects: number; tags: number; notes: number; trackedHours: number } }
| { kind: 'opCount'; opCount: number };
This requires changes to:
LocalDataConflictError constructor (sync-errors.ts:95-105) — replace remoteSnapshotState: Record<string, unknown> with remoteData: RemoteDataDescriptor.result.snapshotState before throwing._handleLocalDataConflict (sync-wrapper.service.ts:1161-1262) — pass remoteData into conflictData.remote instead of mainModelData.dialog-sync-conflict.component.html — branch on descriptor.kind.Security note: the narrowed type eliminates the future-XSS risk Security C1 raised — decrypted remote task titles can no longer leak into the dialog via arbitrary mainModelData payloads.
Where: OpLog.warn at every clean-slate upload entry point and at every LocalDataConflictError throw.
Payload (counts-only; no entity content per CLAUDE.md sync rule 9):
{
cleanSlateReason: 'PASSWORD_CHANGED' | 'USE_LOCAL' | 'FORCE_UPLOAD' | 'SERVER_MIGRATION' | 'FIRST_ENCRYPTION' | 'BACKUP_RESTORE' | 'REPAIR',
triggerSource: string, // for FORCE_UPLOAD: which error class triggered it
inMemoryCounts: { tasks, projects, tags, notes, trackedHours },
stateCacheCounts: { tasks, projects, tags, notes, trackedHours } | null,
vectorClockSize: number, // size only, never the contents (Security C2)
lastSeq: number,
hasPriorStateCache: boolean,
}
Constraints:
LocalDataConflictError throw sites (458/484/606) with the remote counts where available.Why this is load-bearing: PR-D (conditional preflight) is gated on the evidence this generates. Without these logs we have no basis to decide whether Problem B happens in the wild.
Single modal, no counts in copy (per Simplicity W4 — counts make the dialog look load-bearing, invite litigation):
This will replace the data on the server.
Every other device that syncs after this will be overwritten by
what's currently on this device. This cannot be undone except by
restoring from a backup.
[Cancel] [Yes, replace server]
Counts go to the forensic log (Fix 4), not the modal UI.
Performance constraint (Performance S2): the modal must NOT hold the sync lock while waiting for the user. Acquire the lock only after the user clicks "Yes, replace server." Pre-modal reads (preflight, if added in PR-D) are lock-free.
Trigger sites: all four user-initiated paths (password change, "Keep local", force-upload snackbars, backup restore). Skipped for first-encryption-enable on a truly fresh client (lastSeq===0 && state_cache===null && server-side reports empty), since this is the legitimate bootstrapping case.
incrementCompactionCounter() (operation-log-store.service.ts:1145-1173) has no production callers (only test specs). Its state: null placeholder write at lines 1151-1163 motivates the defensive guard at loadStateCache line 1051.
Delete both the write path and the guard (per Correctness S1 — keeping the guard without the writer is confusion-bait for future readers). Update the spec files that reference incrementCompactionCounter to either delete those tests or test the deletion semantics (i.e., that the dead write path is gone).
The original plan made CleanSlatePreflightService the headline fix. Multi-review converged on cutting it from v1:
If PR-D ships (only if Fix 4 logs show partial-state uploads):
Three call sites (the multi-review corrected this — the plan's original three-callers-through-createCleanSlate claim was wrong; createCleanSlate has only one production caller at encryption-password-change.service.ts:97):
encryption-password-change.service.ts around line 97 (before createCleanSlate)sync-import-conflict-coordinator.service.ts forceUploadLocalState around line 51sync-wrapper.service.ts forceUpload around line 843Gate logic (temporal, not threshold):
const lastSuccessfulDownloadAt = await syncProvider.getLastSuccessfulDownloadAt();
const fresh = lastSuccessfulDownloadAt && (Date.now() - lastSuccessfulDownloadAt < 5 * 60 * 1000);
if (!fresh) {
// Refuse with: "Sync hasn't downloaded server state recently. Reload the app
// to download, then try again. (If you really want to replace the server data,
// use Export → Reset → Re-import.)"
return REFUSE;
}
Escape hatch (Alternatives alt #8): the refusal dialog includes a "Save local data to file" button calling the existing JSON-export flow. The user always has a recovery path regardless of preflight outcome.
TOCTOU mitigation (Correctness W1): if the preflight ever needs to compare in-memory state to a reference, compute the snapshot once via getStateSnapshotAsync() and pass it through to the destructive call — don't re-read.
Reordered per multi-review convergence (Correctness S3, Architecture S1, Simplicity S1, Security S3):
| PR | Contents | Risk | Rollback |
|---|---|---|---|
| PR-A | Atomicity via runDestructiveStateReplacement (Fix 2); PreMigrationBackupService deleted (atomic replace makes the recovery layer unnecessary). Empty-snapshot fix + RemoteDataDescriptor type (Fix 3) and dead-code cleanup (Fix 6) deferred to follow-up PRs. | Medium — touches IDB write patterns | Single revert |
| PR-B | Forensic logging (Fix 4) | Very low — pure observability | Trivial |
| PR-C | Confirmation modal (Fix 5) | Low — UI-only, doesn't refuse uploads | Trivial |
| PR-D | Conditional. Temporal preflight gate (Fix 1 — temporal, not count-based) + export-to-file escape hatch | Medium — refuses uploads | Trivial via call-site removal |
PR-A is the actual bug-closing PR — it eliminates Problem A and fixes the empty-snapshot rendering. PR-B builds the evidence base for PR-D. PR-C is a defense-in-depth layer. PR-D ships only if Fix 4 logs show partial-state uploads happening.
runDestructiveStateReplacement fault injection: simulate failure at each step (stage write, verify read, each line of the destructive tx). Verify post-condition is "OPS unchanged, STATE_CACHE singleton unchanged, VECTOR_CLOCK unchanged" — even with a staging row left over.STATE_CACHE_STAGING_KEY row, run init, verify the staging row is deleted and the singleton is untouched.LocalDataConflictError shape: verify each of the three throw sites constructs the correct RemoteDataDescriptor variant.pre-migration-backup round-trip: write a backup via the new implementation, verify it can be restored.In compaction.integration.spec.ts style:
createCleanSlate interrupt-during-tx → device state unchanged.BackupService.importComplete interrupt-during-tx → device state unchanged.Per Performance W1: verify the destructive tx commits on each runtime (Electron, Chromium web, Capacitor iOS, Capacitor Android) with a multi-MB pre-staged STATE_CACHE row.
One Playwright test reproducing the issue #7709 chain on the fixed code:
createCleanSlate and inject an exception at the destructive tx.OperationLogStoreService (now that BackupService is a second caller, helper is justified).{ kind: 'opCount', opCount: N } descriptor; dialog branches on kind.These are smaller decisions, not design alternatives:
IMPORT_BACKUP store be reused for the pre-migration backup, or should we add a new store?PreMigrationBackupService deleted in PR-A.runDestructiveStateReplacement accept the operation entry directly or build it from primitives? Affects testability vs. caller ergonomics.StateSnapshotService, which can be poisoned.isCleanSlate=true unconditionally. Server-side completeness gating (Alternatives alt #1) is the right follow-up and is tracked as a separate plan.Code references verified at feat/issue-7709-567f99 HEAD c5158dd35b:
encryption-password-change.service.ts:97 (only production caller of createCleanSlate), sync-import-conflict-coordinator.service.ts:51-67, sync-wrapper.service.ts:687/699/712/739/843/1021/1140, _handleLocalDataConflict at 1161-1262, backup.service.ts:191-233 (5th trigger).clean-slate.service.ts:81-176. Non-atomic sequence at 149-168. Client-ID rotation + vector-clock reset at 121-126.BackupService.importComplete non-atomic sequence: backup.service.ts:194 (clear), :221 (append), :225 (setVectorClock), :227 (saveStateCache).operation-log-upload.service.ts:129-130.isWhollyFreshClient: sync-local-state.service.ts:25-30.operation-log-sync.service.ts:606. result.snapshotState is undefined at this line (verified — past all three fileSnapshotOps early-returns at 458/484/517).pre-migration-backup.service.ts: confirmed placeholder, all methods are no-ops.handleServerMigration (server-migration.service.ts:200-269): only appends a SYNC_IMPORT op locally; no clearAllOperations / saveStateCache / setVectorClock. The "Keep local" path is destructive only on the server (via isCleanSlate=true flag on upload), not on the local device. (One multi-review reviewer overstated this as a non-atomic local sequence; verification shows it isn't.)incrementCompactionCounter: no production callers (verified via grep across src/).operation-log.const.ts:88 (COMPACTION_THRESHOLD = 500).dialog-sync-conflict.component.ts:153-168. Only fires when |remoteChanges - localChanges| >= 20.FeatureFlag|featureToggle|growthbook|posthog all absent from src/).BackupService.importComplete as the 5th destructive trigger (multi-review C2 finding).snapshotState; synthetic descriptor for op-streaming case; introduced RemoteDataDescriptor discriminated union.createCleanSlate has only one production caller; preflight wraps three independent call sites if it ships.appendWithVectorClockUpdate already uses multi-store readwrite on every action (the staging row was over-engineering on a wrong premise).runDestructiveStateReplacement helper (BackupService is a second caller, helper justified).pre-migration-backup.service.ts is a placeholder; PR-A implements it for real.LocalDataConflictError.remoteSnapshotState type to eliminate future XSS surface (per Security C1).