Indexing

Spacedrive's indexing system solves a specific challenge: How do you build a distributed database that feels as fast as a local file explorer?

The answer is a Hybrid Indexing Engine that layers an ultra-fast, in-memory ephemeral index over a robust, SQLite-backed persistent index. These two systems operate in tandem, allowing Spacedrive to instantly browse unmanaged locations (like a file manager) while seamlessly upgrading those paths to managed libraries (like a DAM) without UI flicker or state loss.

The Hybrid Philosophy

Most file management software forces a choice: fast, dumb directory listing (Explorer/Finder) or slow, heavy database ingestion (Lightroom/Photos). Spacedrive does both simultaneously by decoupling Discovery from Persistence.

The Ephemeral Layer ("File Manager" Mode)

When you open a location that hasn't been added to your library—an external drive, network share, or local directory—Spacedrive runs only Phase 1 (Discovery) of the indexing pipeline:

Memory-Resident: The index lives entirely in RAM
Highly Optimized: Custom slab allocators (NodeArena) and string interning (NameCache) compress file entries down to ~50 bytes
Massive Scale: Can index millions of files into RAM for accelerated local search
Zero Database I/O: Bypasses SQLite entirely for maximum throughput

The Persistent Layer ("Library" Mode)

For files you want to track across devices, Spacedrive persists data to a synchronized SQLite database using the full multi-phase pipeline with deep content analysis, deduplication, and closure-table hierarchy management.

Seamless State Promotion

The critical innovation is how these two layers communicate. When you add a location to your library for a folder you're currently browsing ephemerally, the system performs an Intelligent Promotion:

UUID Preservation: The persistent indexer detects the existing ephemeral index and carries over UUIDs assigned during the browsing session into the database
UI Consistency: Because UUIDs remain stable, the UI doesn't flicker or reset. Selections, active tabs, and view states remain intact
Phase Continuation: The indexer essentially "resumes" from Phase 1, flushing discovered entries to SQLite and proceeding to Phase 2 (Processing) and Phase 3 (Content Analysis)

<Info> This architecture allows Spacedrive to act as your daily driver file explorer. You get instant access to files immediately, with the option to progressively "deepen" the index for files that matter. </Info>

Architecture Overview

The indexing system consists of specialized components working together:

IndexerJob orchestrates the entire indexing process as a resumable job. It maintains state across application restarts and provides detailed progress reporting.

IndexerState preserves all necessary information to resume indexing from any interruption point. This includes the current phase, directories to process, accumulated statistics, and ephemeral UUID mappings for preserving user metadata across browsing-to-persistent transitions.

DatabaseStorage provides the low-level database CRUD layer. All database operations (create, update, move, delete) flow through this module for consistency.

DatabaseAdapter implements both ChangeHandler (for filesystem watcher events) and IndexPersistence (for indexer job batches). Both pipelines use the same code to write entries to the database via DatabaseStorage.

MemoryAdapter implements both ChangeHandler (for filesystem watcher events) and IndexPersistence (for indexer job batches). Both pipelines use the same code to write entries to the in-memory EphemeralIndex.

This dual-implementation architecture unifies watcher and job pipelines, eliminating code duplication between real-time filesystem monitoring and batch indexing operations.

FileTypeRegistry identifies files through extensions, magic bytes, and content analysis.

The system integrates deeply with Spacedrive's job infrastructure, which provides automatic state persistence through MessagePack serialization. When you pause an indexing operation, the entire job state is saved to a dedicated jobs database, allowing seamless resumption even after application restarts.

<Note> Indexing jobs can run for hours on large directories. The resumable architecture ensures no work is lost if interrupted. </Note>

Database Architecture

The indexing system uses a closure table for hierarchy management instead of recursive queries:

Closure Table

Parent-child relationships are stored in the entry_closure table with precomputed ancestor-descendant pairs. This makes "find all descendants" queries O(1) regardless of nesting depth, at the cost of additional storage (worst-case N² for deeply nested trees).

sql

CREATE TABLE entry_closure (
    ancestor_id INTEGER,
    descendant_id INTEGER,
    depth INTEGER
);

The closure table stores all transitive relationships. For a file at /home/user/docs/report.pdf, entries exist for:

(home_id, report_id, depth=3)
(user_id, report_id, depth=2)
(docs_id, report_id, depth=1)
(report_id, report_id, depth=0)

Move operations require rebuilding closures for the entire moved subtree, which can affect thousands of rows when moving large directories.

Directory Paths Cache

The directory_paths table provides O(1) absolute path lookups for directories:

sql

CREATE TABLE directory_paths (
    entry_id INTEGER PRIMARY KEY,
    path TEXT UNIQUE
);

This eliminates recursive parent traversal when building file paths. Each directory stores its complete absolute path, enabling instant resolution for child entries.

Entries Table

sql

CREATE TABLE entry (
    id INTEGER PRIMARY KEY,
    uuid UUID UNIQUE,
    parent_id INTEGER,
    name TEXT,
    extension TEXT,
    kind INTEGER,
    size BIGINT,
    inode BIGINT,
    content_id INTEGER,
    aggregate_size BIGINT,
    child_count INTEGER,
    file_count INTEGER
);

Indexing Phases

The pipeline is broken into atomic, resumable phases. The Ephemeral engine runs only Phase 1. The Persistent engine runs all five phases.

Phase 1: Discovery

Used by: Ephemeral & Persistent

A parallel, asynchronous filesystem walk designed for raw speed:

Parallelism: Work-stealing architecture where workers consume directories and directly enqueue subdirectories. On systems with 8+ cores, multiple threads scan concurrently, communicating via channels to maximize disk throughput
Rules Engine: Filters system files (.git, node_modules) at the discovery edge through IndexerRuler, which applies toggleable system rules (NO_HIDDEN, NO_DEV_DIRS) and dynamically loaded .gitignore patterns when inside a Git repository
Output: A stream of lightweight DirEntry objects

Progress is measured by directories discovered. Entries are collected into batches of 1,000 items before moving to processing.

Phase 2: Processing

Used by: Persistent Only

Converts discovered entries into database records:

Topology Sorting: Entries are sorted by depth (parents before children) to maintain referential integrity during batch insertion
Batching: Writes occur in transactions of 1,000 items to minimize SQLite locking overhead

Change Detection runs during this phase. The ChangeDetector loads existing database entries for the indexing path, then compares against filesystem state to identify:

New: Paths not in database
Modified: Size or mtime differs
Moved: Same inode at different path
Deleted: In database but missing from filesystem

Changes are processed in batch transactions. Each batch inserts closure table rows, updates the directory paths cache, and syncs entries across devices.

Ephemeral UUID Preservation happens here. When a browsed folder is promoted to a managed location, UUIDs assigned during ephemeral indexing are preserved (state.ephemeral_uuids). This prevents orphaning user metadata like tags and notes attached during browsing sessions.

The processing phase validates that the indexing path stays within location boundaries, preventing catastrophic cross-location deletion if watcher routing bugs send events for the wrong path.

Phase 3: Aggregation

Used by: Persistent Only

To allow sorting folders by "True Size" (the size of all children recursively), we aggregate statistics from the bottom up:

Closure Table: Uses the entry_closure table to perform O(1) descendant lookups
Leaf-to-Root: Calculates sizes for the deepest directories first, bubbling totals up to the root

These aggregates are stored in the entry table:

aggregate_size: Total bytes including subdirectories
child_count: Direct children only
file_count: Recursive file count

This enables instant directory size display without traversing descendants.

Phase 4: Content Identification

Used by: Persistent Only

Enables Spacedrive's deduplication capabilities through Content Addressable Storage (CAS):

BLAKE3 Hashing: Generates content hashes for files, linking entries to content_identity records
Globally Deterministic UUIDs: Uses v5 UUIDs (namespace hash of content_hash only) so any device can independently identify identical files and arrive at the exact same Content UUID without communicating. This enables offline duplicate detection across all devices and libraries
Sync Order: Content identities must be synced before entries to avoid foreign key violations on receiving devices. The job system enforces this ordering
File Type Identification: Runs via FileTypeRegistry to populate kind_id and mime_type_id fields for new content

Phase 5: Finalizing

Used by: Persistent Only

Finalizing handles post-processing tasks like directory aggregation updates and potential processor dispatch (thumbnail generation for Deep Mode).

Change Detection System

The indexing system includes both batch and real-time change detection:

Batch Change Detection

ChangeDetector compares database state against filesystem during indexer job scans:

rust

let mut detector = ChangeDetector::new();
detector.load_existing_entries(ctx, location_id, indexing_path).await?;

for entry in discovered_entries {
    if let Some(change) = detector.check_path(&path, &metadata, inode) {
        // Process New, Modified, or Moved change
    }
}

let deleted = detector.find_deleted(&seen_paths);

The detector tracks paths by inode to identify moves. On Unix systems, inodes provide stable file identity across renames. Windows falls back to path-only matching since file indices are unstable across reboots.

Real-Time Change Detection

Both DatabaseAdapter and MemoryAdapter implement the ChangeHandler trait, which defines the interface for responding to filesystem watcher events:

rust

pub trait ChangeHandler {
    async fn find_by_path(&self, path: &Path) -> Result<Option<EntryRef>>;
    async fn create(&mut self, metadata: &DirEntry, parent_path: &Path) -> Result<EntryRef>;
    async fn update(&mut self, entry: &EntryRef, metadata: &DirEntry) -> Result<()>;
    async fn move_entry(&mut self, entry: &EntryRef, old_path: &Path, new_path: &Path) -> Result<()>;
    async fn delete(&mut self, entry: &EntryRef) -> Result<()>;
}

The watcher routes events to the appropriate handler based on whether the path belongs to a persistent location (DatabaseAdapter → database) or ephemeral session (MemoryAdapter → memory).

Indexing Modes and Scopes

The system provides flexible configuration through modes and scopes:

Index Modes

Shallow Mode extracts only filesystem metadata (name, size, dates). Completes in under 500ms for typical directories.

Content Mode adds BLAKE3 hashing to identify files by content. Enables deduplication and content tracking.

Deep Mode performs full analysis including file type identification and metadata extraction. Triggers thumbnail generation for images and videos.

Index Scopes

Current Scope indexes only immediate directory contents. Used for responsive UI navigation.

Recursive Scope indexes the entire directory tree. Used for full location indexing.

Persistence and Ephemeral Indexing

Spacedrive supports both persistent and ephemeral indexing modes:

Persistent Indexing

Persistent indexing stores all data in the database permanently. This is the default for library locations:

Full change detection and history
Syncs across devices
Survives application restarts
Enables offline search

Ephemeral Indexing

Ephemeral indexing keeps data in memory only, perfect for browsing external drives without permanent storage. The system uses highly memory-optimized structures (detailed in the Data Structures section below):

NodeArena: Slab allocator for FileNode entries with 32-bit entry IDs instead of 64-bit pointers
NameCache: Global string interning pool where one copy of "index.js" serves thousands of node_modules files
NameRegistry: BTreeMap for fast name-based lookups without full-text indexing overhead

Memory usage is around 50 bytes per entry vs 200+ bytes with naive approaches—a 4-6x reduction that enables browsing hundreds of thousands of files without database overhead.

The EphemeralIndexCache tracks which paths have been indexed, are currently being indexed, or are registered for filesystem watching. When a watched path receives filesystem events, the system updates the in-memory index in real-time through the unified ChangeHandler trait (shared with persistent storage).

<Info> Ephemeral mode lets you explore USB drives or network shares without permanently adding them to your library. </Info>

Data Structures & Optimizations

Specific low-level optimizations make the hybrid architecture viable:

NodeArena (Ephemeral)

The ephemeral index doesn't use standard HashMaps. Instead, it uses a memory-mapped NodeArena—a contiguous slab of memory that stores file nodes using 32-bit integers as pointers rather than 64-bit pointers. This reduces memory overhead by 4-6x compared to naive HashMap<PathBuf, Entry> implementations, enabling browsing of hundreds of thousands of files without database overhead.

Name Pooling (Ephemeral)

In typical filesystems, filenames like index.js, .DS_Store, or conf.yaml repeat thousands of times. The NameCache interns these strings, storing them once and referencing them by pointer. Multiple directory trees can coexist in the same EphemeralIndex (browsing both /mnt/nas and /media/usb simultaneously), sharing the string interning pool for maximum deduplication.

<Note> **Future Roadmap**: We plan to port the Name Pooling strategy from the ephemeral engine to the SQLite database schema. This will significantly reduce the storage footprint of the persistent library by deduplicating filename strings at the database level. </Note>

Directory Path Caching (Persistent)

While the database uses an adjacency list (parent_id) for structure, recursive queries are slow. The directory_paths table caches the full absolute path of every directory, enabling O(1) path resolution for any file without recursive parent traversal.

Indexer Rules

The IndexerRuler applies filtering rules during discovery to skip unwanted files:

System Rules are toggleable patterns like:

NO_HIDDEN: Skip dotfiles (.git, .DS_Store)
NO_DEV_DIRS: Skip node_modules, target, dist
NO_SYSTEM: Skip OS folders (System32, Windows)

Git Integration: When indexing inside a Git repository, rules are dynamically loaded from .gitignore files. This automatically excludes build artifacts and local configuration.

Rules return a RulerDecision (Accept/Reject) for each path during discovery, preventing unwanted entries from ever reaching the processing phase.

Index Integrity Verification

The IndexVerifyAction checks integrity by running a fresh ephemeral scan and comparing metadata against the existing persistent index:

rust

let verify = IndexVerifyAction::from_input(IndexVerifyInput { path }).await?;
let output = verify.execute(library, context).await?;

// output.report contains:
// - missing_from_index: Files on disk but not in database
// - stale_in_index: Entries in database but missing from filesystem
// - metadata_mismatches: Size, mtime, or inode differences

The verification system detects:

MissingFromIndex: Files created outside Spacedrive
StaleInIndex: Deleted files not yet purged from database
SizeMismatch: Files modified externally
ModifiedTimeMismatch: Timestamp drift (with 1-second tolerance)
InodeMismatch: File replacement or filesystem corruption

Verification runs as a library action and returns a detailed IntegrityReport with per-file diagnostics.

Job System Integration

The indexing system leverages Spacedrive's job infrastructure for reliability and monitoring.

State Persistence

When interrupted, the entire job state is serialized:

rust

#[derive(Serialize, Deserialize)]
pub struct IndexerState {
    phase: Phase,
    dirs_to_walk: VecDeque<PathBuf>,
    entry_batches: Vec<Vec<DirEntry>>,
    entry_id_cache: HashMap<PathBuf, i32>,
    ephemeral_uuids: HashMap<PathBuf, Uuid>,
    stats: IndexerStats,
}

This state is stored in the jobs database, separate from your library data. On resume, the job picks up exactly where it left off.

Progress Tracking

Real-time progress flows through multiple channels:

rust

pub struct IndexerProgress {
    pub phase: IndexPhase,
    pub total_found: IndexerStats,
    pub processing_rate: f32,
    pub estimated_remaining: Option<Duration>,
}

Progress updates are sent to the UI via channels, persisted to the database, and available through job queries for time estimates.

Error Handling

Non-critical errors are accumulated but don't stop indexing:

Permission denied on individual files
Corrupted metadata
Unsupported file types

Critical errors halt the job with state preserved:

Database connection lost
Filesystem unmounted
Out of disk space

Performance Characteristics

Indexing performance varies by mode and scope:

Configuration	Performance	Use Case
Current + Shallow	`<500ms`	UI navigation
Recursive + Shallow	~10K files/sec	Quick scan
Recursive + Content	~1K files/sec	Normal indexing
Recursive + Deep	~100 files/sec	Media libraries

Optimization Techniques

Batch Processing: Groups operations into transactions of 1,000 items, reducing database overhead by 30x.

Parallel Discovery: Work-stealing model with atomic counters for directory traversal, using half of available CPU cores by default.

Entry ID Cache: Eliminates redundant parent lookups during hierarchy construction, critical for deep directory trees.

Checkpoint Strategy: Checkpoints occur every 5,000 items or 30 seconds, balancing durability with performance.

Usage Examples

For responsive directory browsing:

rust

let config = IndexerJobConfig::ui_navigation(location_id, path);
let handle = library.jobs().dispatch(IndexerJob::new(config)).await?;

External Drive Browsing

Explore without permanent storage:

rust

let config = IndexerJobConfig::ephemeral_browse(
    usb_path,
    IndexScope::Recursive
);
let job = IndexerJob::new(config);

Full Library Location

Full indexing with content identification:

rust

let config = IndexerJobConfig::new(
    location_id,
    path,
    IndexMode::Deep
);

CLI Commands

The indexer is fully accessible through the CLI:

bash

# Quick current directory scan
spacedrive index quick-scan ~/Documents

# Browse external drive
spacedrive index browse /media/usb --ephemeral

# Full location with progress monitoring
spacedrive index location ~/Pictures --mode deep
spacedrive job monitor  # Watch progress

Troubleshooting

Common Issues

Slow Indexing: Check for large node_modules or build directories. System rules automatically skip common patterns, or use .gitignore to exclude project-specific artifacts.

High Memory Usage: Reduce batch size for directories over 1M files. Ephemeral mode uses around 50 bytes per entry, so 100K files requires roughly 5MB.

Resume Not Working: Ensure the jobs database isn't corrupted. Check logs for serialization errors.

Debug Tools

Enable detailed logging:

bash

RUST_LOG=sd_core::ops::indexing=debug spacedrive start

Inspect job state:

bash

spacedrive job info <job-id> --detailed

Platform Notes

Windows: Uses file indices for change detection where available, falling back to path-only matching. Supports long paths transparently. Network drives may require polling.

macOS: Leverages FSEvents and native inodes. Integrates with Time Machine exclusions. APFS provides efficient cloning.

Linux: Full inode support with detailed permissions. Handles diverse filesystems from ext4 to ZFS. Symbolic links supported with cycle detection.

Best Practices

Start shallow for new locations to verify configuration before deep scans
Use Git repositories to automatically inherit .gitignore exclusions
Monitor progress through the job system instead of polling the database
Schedule deep scans during low-usage periods for large photo/video libraries
Enable checkpointing for locations over 100K files to survive interruptions

<Warning> Always let indexing jobs complete or pause them properly. Force-killing can corrupt the job state and require reindexing from scratch. </Warning>

Jobs - Job system architecture
Locations - Directory management
Search - Querying indexed data
Performance - Optimization guide

The Hybrid Philosophy

The Ephemeral Layer ("File Manager" Mode)

The Persistent Layer ("Library" Mode)

Seamless State Promotion

Architecture Overview

Database Architecture

Closure Table

Directory Paths Cache

Entries Table

Indexing Phases

Phase 1: Discovery

Phase 2: Processing

Phase 3: Aggregation

Phase 4: Content Identification

Phase 5: Finalizing

Change Detection System

Batch Change Detection

Real-Time Change Detection

Indexing Modes and Scopes

Index Modes

Index Scopes

Persistence and Ephemeral Indexing

Persistent Indexing

Ephemeral Indexing

Data Structures & Optimizations

NodeArena (Ephemeral)

Name Pooling (Ephemeral)

Directory Path Caching (Persistent)

Indexer Rules

Index Integrity Verification

Job System Integration

State Persistence

Progress Tracking

Error Handling

Performance Characteristics

Optimization Techniques

Usage Examples

Quick UI Navigation

External Drive Browsing

Full Library Location

CLI Commands

Troubleshooting

Common Issues

Debug Tools

Platform Notes

Best Practices

Related Documentation