Back to Spacedrive

Five-Phase Indexing Pipeline

.tasks/core/INDEX-002-five-phase-indexing-pipeline.md

0.4.36.1 KB
Original Source

Description

Implement the multi-phase indexing pipeline that breaks filesystem discovery and processing into atomic, resumable stages. The ephemeral engine runs only Phase 1 (Discovery), while the persistent engine runs all five phases with full database writes and content analysis.

Phase Architecture

Phase 1: Discovery

Used by: Ephemeral & Persistent

Parallel filesystem walk optimized for raw speed:

  • Work-Stealing Parallelism: Multiple threads scan concurrently, communicating via channels
  • Rules Engine Integration: IndexerRuler filters at discovery edge (.git, node_modules, .gitignore)
  • Lightweight Output: Stream of DirEntry objects
  • Progress Tracking: Measured by directories discovered
  • Batching: Collects 1,000 entries before moving to processing

Implementation: core/src/ops/indexing/phases/discovery.rs

Phase 2: Processing

Used by: Persistent Only

Converts discovered entries into database records:

  • Topology Sorting: Entries sorted by depth (parents before children)
  • Batch Transactions: 1,000 items per transaction to minimize SQLite locking
  • Change Detection: ChangeDetector compares filesystem vs database (New/Modified/Moved/Deleted)
  • UUID Preservation: Carries over ephemeral UUIDs via state.ephemeral_uuids
  • Boundary Validation: Ensures indexing path stays within location boundaries
  • Closure Table Updates: Inserts ancestor-descendant pairs for hierarchy
  • Directory Path Cache: Updates directory_paths table for O(1) lookups

Implementation: core/src/ops/indexing/phases/processing.rs

Phase 3: Aggregation

Used by: Persistent Only

Bottom-up recursive statistics calculation:

  • Closure Table Queries: O(1) descendant lookups
  • Leaf-to-Root Traversal: Calculates sizes from deepest directories upward
  • Aggregates Stored:
    • aggregate_size - Total bytes including subdirectories
    • child_count - Direct children only
    • file_count - Recursive file count

Enables instant "True Size" sorting without traversing descendants.

Implementation: core/src/ops/indexing/phases/aggregation.rs

Phase 4: Content Identification

Used by: Persistent Only

Content addressable storage via BLAKE3 hashing:

  • BLAKE3 Hashing: Generates content hashes for deduplication
  • Globally Deterministic UUIDs: v5 UUIDs from content hash (offline duplicate detection)
  • Sync Ordering: Content identities synced before entries (foreign key safety)
  • File Type Detection: FileTypeRegistry populates kind_id and mime_type_id
  • Link to Content Records: Entries reference shared content_identity table

Implementation: core/src/ops/indexing/phases/content.rs

Phase 5: Finalizing

Used by: Persistent Only

Post-processing and processor dispatch:

  • Directory Aggregation Updates: Final aggregate calculations
  • Processor Dispatch: Triggers thumbnail generation for Deep Mode
  • Cleanup: Marks indexing as complete

Implementation: Handled in core/src/ops/indexing/job.rs

Implementation Files

Phase Implementations

  • core/src/ops/indexing/phases/discovery.rs - Phase 1
  • core/src/ops/indexing/phases/processing.rs - Phase 2
  • core/src/ops/indexing/phases/aggregation.rs - Phase 3
  • core/src/ops/indexing/phases/content.rs - Phase 4
  • core/src/ops/indexing/phases/mod.rs - Phase enum and orchestration

Orchestration

  • core/src/ops/indexing/job.rs - IndexerJob runs phases sequentially
  • core/src/ops/indexing/state.rs - IndexerState tracks current phase and progress
  • core/src/ops/indexing/progress.rs - Progress reporting per phase

Acceptance Criteria

  • Phase 1 (Discovery) runs in both ephemeral and persistent modes
  • Phases 2-5 only run for persistent indexing
  • Each phase is resumable (state preserved in IndexerState)
  • Discovery uses work-stealing parallelism (8+ threads on capable systems)
  • Processing sorts entries by depth (parents before children)
  • Processing batches database writes (1,000 items/transaction)
  • ChangeDetector detects New/Modified/Moved/Deleted during processing
  • Aggregation uses closure table for O(1) descendant queries
  • Content phase generates BLAKE3 hashes
  • Content phase creates globally deterministic v5 UUIDs
  • FileTypeRegistry identifies file types during content phase
  • Progress tracking works for all phases
  • Job can pause/resume at any phase boundary
  • Ephemeral UUID preservation works in Phase 2

Indexing Modes

The pipeline supports three depth modes:

ModePhases RunSpeedUse Case
Shallow1, 2, 3FastUI navigation, quick scan
Content1, 2, 3, 4MediumNormal indexing with dedup
Deep1, 2, 3, 4, 5SlowMedia libraries with thumbnails

Indexing Scopes

ScopeBehaviorUse Case
CurrentIndex immediate directory onlyResponsive UI navigation
RecursiveIndex entire treeFull location indexing

Performance Characteristics

ConfigurationPerformanceNotes
Current + Shallow<500msNo subdirectories
Recursive + Shallow~10K files/secMetadata only
Recursive + Content~1K files/secWith BLAKE3 hashing
Recursive + Deep~100 files/secFull analysis + thumbnails

Resumability

Each phase stores sufficient state in IndexerState to resume:

rust
pub struct IndexerState {
    pub phase: Phase,
    pub dirs_to_walk: VecDeque<PathBuf>,
    pub entry_batches: Vec<Vec<DirEntry>>,
    pub entry_id_cache: HashMap<PathBuf, i32>,
    pub ephemeral_uuids: HashMap<PathBuf, Uuid>,
    pub stats: IndexerStats,
}

When interrupted:

  1. State serialized to jobs database (MessagePack)
  2. On resume, job loads state and continues from saved phase
  3. No work is lost
  • INDEX-001 - Hybrid Architecture (defines ephemeral vs persistent)
  • INDEX-003 - Database Architecture (closure tables used in Phase 3)
  • INDEX-004 - Change Detection (ChangeDetector used in Phase 2)
  • INDEX-005 - Indexer Rules (filters in Phase 1)
  • JOB-000 - Job System (provides resumability infrastructure)