docs/sync-and-op-log/background-info/operation-log-best-practises2.md
Status: Research Complete Date: December 2, 2025 Purpose: Inform the design of Super Productivity's server sync architecture
This document synthesizes best practices from industry leaders (Figma, Linear, Replicache) and academic research on operation-based synchronization systems. Key findings inform our architecture decisions for transitioning from file-based to operation-based sync.
| Pattern | Use Case | Examples |
|---|---|---|
| Server-authoritative | Single source of truth needed | Linear, Replicache |
| Peer-to-peer | No central server required | CRDTs, Local-first |
| Hybrid | Server for ordering, peers ok | Figma multiplayer |
Recommendation for Super Productivity: Server-authoritative pattern. The server assigns monotonic sequence numbers, providing total ordering while clients handle optimistic updates.
Source: Replicache - How It Works
The most robust pattern uses separate push and pull phases:
┌─────────┐ ┌─────────┐
│ Client │ │ Server │
└────┬────┘ └────┬────┘
│ │
│ PUSH: mutations[] │
│─────────────────────────────►│
│ │ Execute mutations
│ │ Update lastMutationID
│ │
│ PULL: since cookie │
│─────────────────────────────►│
│ │
│ Response: patch, cookie, │
│ lastMutationIDChanges │
│◄─────────────────────────────│
│ │
│ Rebase local state │
│ │
Key insight: The server re-executes client mutations rather than simply storing them. This allows server-side validation, side effects, and authoritative conflict resolution.
Source: Replicache Push/Pull Reference
Linear uses a transaction-based model with delta broadcasting:
Key insight: Delta packets may differ from original transactions because the server performs side effects (e.g., generating history, enforcing constraints).
Source: Reverse Engineering Linear's Sync Engine
Vector clocks track causal relationships between events:
type VectorClock = Record<string, number>; // { clientId: sequenceNumber }
// Comparison results:
// - BEFORE: a happened-before b
// - AFTER: a happened-after b
// - EQUAL: same logical time
// - CONCURRENT: conflict - neither ordered
Properties:
Limitation: Vector clocks grow with the number of clients. Pruning strategies needed for long-lived systems.
Source: Vector Clocks and Conflicting Data
HLC combines physical and logical clocks, addressing vector clock limitations:
interface HLC {
wallTime: number; // Physical clock component
logical: number; // Logical counter for same wallTime
nodeId: string; // For tiebreaking
}
Advantages:
Best Practices:
Source: Hybrid Logical Clocks in Depth
The simplest approach for server-authoritative systems:
-- Per-user monotonic sequence
UPDATE user_sync_state
SET last_seq = last_seq + 1
WHERE user_id = ?
RETURNING last_seq;
Trade-off: Requires server connectivity for total ordering, but simplifies conflict detection to sequence comparison.
| Strategy | Use Case | Complexity | User Friction |
|---|---|---|---|
| Last-Write-Wins | Low-stakes, non-collab | Low | Medium |
| Object Versioning | Git-like history needed | Medium | High |
| CRDTs | Math-guaranteed convergence | High | Low |
| Server Aggregate | Simplified client | Medium | Low |
| Application Logic | Custom per-field rules | Medium | Low |
Source: Hasura Offline-First Design Guide
For operation-based CRDTs (relevant to our op-log approach):
// CRDT interface pattern
interface OperationCRDT<State, Op> {
empty(): State;
query(state: State): unknown;
prepare(state: State, command: unknown): Op;
effect(state: State, op: Op): State;
}
Source: Operation-Based CRDTs Protocol
Instead of applying patches directly, Replicache uses git-like rebase:
Key insight: Mutators are arbitrary code that can express any conflict resolution policy. The application defines merge semantics, not the sync engine.
Source: Replicache Concepts
Modern systems apply different strategies per field:
| Field Type | Strategy | Rationale |
|---|---|---|
| Single value | LWW | User expects latest |
| Set/Tags | Union | Additive, no data loss |
| Counter/Time | Sum deltas | Mathematically correct |
| Ordered list | LCS + interleave | Preserve both orderings |
| Rich text | OT or CRDT | Character-level merge |
In eventually consistent systems, deletions must be tracked:
Without tombstones:
Node A: DELETE item-123
Node B: (offline, has item-123)
Node B: (comes online) → "Node A is missing item-123, let me sync it!"
Result: Deleted item resurrects
Source: Tombstones in Distributed Systems
| Practice | Recommendation |
|---|---|
| Grace period | 90 days minimum (Cassandra default: 10d) |
| Repair before expiry | All nodes must see tombstone before GC |
| Soft delete flag | Better for frequent un-deletes |
| Tombstone table | Better for audit trails |
| Avoid mass deletions | Creates tombstone storms |
Cleanup safety rule: Only garbage-collect tombstones after:
Source: Cassandra Tombstones
Operations become garbage when superseded by newer operations on the same entity:
Op 1: CREATE task-123 {title: "Buy milk"} → GARBAGE after Op 2
Op 2: UPDATE task-123 {title: "Buy groceries"} → GARBAGE after Op 3
Op 3: DELETE task-123 → LIVE (until tombstone expires)
Source: Apache Geode Log Compaction
Live Key Identification (FASTER pattern):
Trigger Conditions:
Source: Microsoft FASTER Compaction
interface CompactionConfig {
// Never delete operations that haven't been:
// 1. Acknowledged by all devices
// 2. Past the retention period
minRetentionDays: 90;
requireAllDevicesAcked: true;
// Snapshot before compacting
createSnapshotBeforeCompaction: true;
}
Figma uses PostgreSQL's Write-Ahead Log (WAL) for real-time updates:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ PostgreSQL │─WAL─►│ Kafka │─────►│ LiveGraph │
│ (primary) │ │ │ │ (servers) │
└─────────────┘ └─────────────┘ └──────┬──────┘
│
WebSocket
│
┌───────▼───────┐
│ Clients │
└───────────────┘
Key insight: Rather than polling, subscribe to the database replication stream. This provides millisecond-level updates without polling overhead.
Source: Figma LiveGraph
Instead of pushing full data, send lightweight "poke" hints:
// Server → Client
interface Poke {
type: 'poke';
// No data payload - just a hint to pull
}
// Client receives poke → triggers pull
Benefits:
Source: Replicache Poke Mechanism
interface OfflineQueue {
// Operations stored in IndexedDB
pendingOps: Operation[];
// Metadata
lastSyncedAt: number;
lastServerSeq: number;
// On reconnect
async flush(): Promise<void> {
for (const op of this.pendingOps) {
await this.push(op);
}
}
}
When offline duration is long (days/weeks):
| Pending Ops | Duration | Recommended Action |
|---|---|---|
| < 100 | < 1 day | Normal sync |
| 100-500 | 1-7 days | Show warning, proceed |
| 500-2000 | > 7 days | Offer recovery options |
| > 2000 | Any | Force snapshot upload/download |
Source: Hasura Offline-First Guide
Both use PostgreSQL logical replication:
┌─────────────┐ ┌─────────────┐
│ PostgreSQL │───logical repl────►│ Sync Server │
│ (source) │ │ │
└─────────────┘ └──────┬──────┘
│
Delta stream
│
┌──────▼──────┐
│ SQLite │
│ (client) │
└─────────────┘
PowerSync approach: Writes go through application backend (custom logic, validation) ElectricSQL approach: Direct writes to Postgres with CRDT merge
Source: PowerSync vs ElectricSQL
From Figma LiveGraph 100x:
Source: Figma LiveGraph 100x
Based on this research, key recommendations: