docs/core/indexing.mdx
Spacedrive's indexing system solves a specific challenge: How do you build a distributed database that feels as fast as a local file explorer?
The answer is a Hybrid Indexing Engine that layers an ultra-fast, in-memory ephemeral index over a robust, SQLite-backed persistent index. These two systems operate in tandem, allowing Spacedrive to instantly browse unmanaged locations (like a file manager) while seamlessly upgrading those paths to managed libraries (like a DAM) without UI flicker or state loss.
Most file management software forces a choice: fast, dumb directory listing (Explorer/Finder) or slow, heavy database ingestion (Lightroom/Photos). Spacedrive does both simultaneously by decoupling Discovery from Persistence.
When you open a location that hasn't been added to your library—an external drive, network share, or local directory—Spacedrive runs only Phase 1 (Discovery) of the indexing pipeline:
For files you want to track across devices, Spacedrive persists data to a synchronized SQLite database using the full multi-phase pipeline with deep content analysis, deduplication, and closure-table hierarchy management.
The critical innovation is how these two layers communicate. When you add a location to your library for a folder you're currently browsing ephemerally, the system performs an Intelligent Promotion:
The indexing system consists of specialized components working together:
IndexerJob orchestrates the entire indexing process as a resumable job. It maintains state across application restarts and provides detailed progress reporting.
IndexerState preserves all necessary information to resume indexing from any interruption point. This includes the current phase, directories to process, accumulated statistics, and ephemeral UUID mappings for preserving user metadata across browsing-to-persistent transitions.
DatabaseStorage provides the low-level database CRUD layer. All database operations (create, update, move, delete) flow through this module for consistency.
DatabaseAdapter implements both ChangeHandler (for filesystem watcher events) and IndexPersistence (for indexer job batches). Both pipelines use the same code to write entries to the database via DatabaseStorage.
MemoryAdapter implements both ChangeHandler (for filesystem watcher events) and IndexPersistence (for indexer job batches). Both pipelines use the same code to write entries to the in-memory EphemeralIndex.
This dual-implementation architecture unifies watcher and job pipelines, eliminating code duplication between real-time filesystem monitoring and batch indexing operations.
FileTypeRegistry identifies files through extensions, magic bytes, and content analysis.
The system integrates deeply with Spacedrive's job infrastructure, which provides automatic state persistence through MessagePack serialization. When you pause an indexing operation, the entire job state is saved to a dedicated jobs database, allowing seamless resumption even after application restarts.
<Note> Indexing jobs can run for hours on large directories. The resumable architecture ensures no work is lost if interrupted. </Note>The indexing system uses a closure table for hierarchy management instead of recursive queries:
Parent-child relationships are stored in the entry_closure table with precomputed ancestor-descendant pairs. This makes "find all descendants" queries O(1) regardless of nesting depth, at the cost of additional storage (worst-case N² for deeply nested trees).
CREATE TABLE entry_closure (
ancestor_id INTEGER,
descendant_id INTEGER,
depth INTEGER
);
The closure table stores all transitive relationships. For a file at /home/user/docs/report.pdf, entries exist for:
Move operations require rebuilding closures for the entire moved subtree, which can affect thousands of rows when moving large directories.
The directory_paths table provides O(1) absolute path lookups for directories:
CREATE TABLE directory_paths (
entry_id INTEGER PRIMARY KEY,
path TEXT UNIQUE
);
This eliminates recursive parent traversal when building file paths. Each directory stores its complete absolute path, enabling instant resolution for child entries.
CREATE TABLE entry (
id INTEGER PRIMARY KEY,
uuid UUID UNIQUE,
parent_id INTEGER,
name TEXT,
extension TEXT,
kind INTEGER,
size BIGINT,
inode BIGINT,
content_id INTEGER,
aggregate_size BIGINT,
child_count INTEGER,
file_count INTEGER
);
The pipeline is broken into atomic, resumable phases. The Ephemeral engine runs only Phase 1. The Persistent engine runs all five phases.
Used by: Ephemeral & Persistent
A parallel, asynchronous filesystem walk designed for raw speed:
.git, node_modules) at the discovery edge through IndexerRuler, which applies toggleable system rules (NO_HIDDEN, NO_DEV_DIRS) and dynamically loaded .gitignore patterns when inside a Git repositoryDirEntry objectsProgress is measured by directories discovered. Entries are collected into batches of 1,000 items before moving to processing.
Used by: Persistent Only
Converts discovered entries into database records:
Change Detection runs during this phase. The ChangeDetector loads existing database entries for the indexing path, then compares against filesystem state to identify:
Changes are processed in batch transactions. Each batch inserts closure table rows, updates the directory paths cache, and syncs entries across devices.
Ephemeral UUID Preservation happens here. When a browsed folder is promoted to a managed location, UUIDs assigned during ephemeral indexing are preserved (state.ephemeral_uuids). This prevents orphaning user metadata like tags and notes attached during browsing sessions.
The processing phase validates that the indexing path stays within location boundaries, preventing catastrophic cross-location deletion if watcher routing bugs send events for the wrong path.
Used by: Persistent Only
To allow sorting folders by "True Size" (the size of all children recursively), we aggregate statistics from the bottom up:
entry_closure table to perform O(1) descendant lookupsThese aggregates are stored in the entry table:
aggregate_size: Total bytes including subdirectorieschild_count: Direct children onlyfile_count: Recursive file countThis enables instant directory size display without traversing descendants.
Used by: Persistent Only
Enables Spacedrive's deduplication capabilities through Content Addressable Storage (CAS):
content_identity recordscontent_hash only) so any device can independently identify identical files and arrive at the exact same Content UUID without communicating. This enables offline duplicate detection across all devices and librariesFileTypeRegistry to populate kind_id and mime_type_id fields for new contentUsed by: Persistent Only
Finalizing handles post-processing tasks like directory aggregation updates and potential processor dispatch (thumbnail generation for Deep Mode).
The indexing system includes both batch and real-time change detection:
ChangeDetector compares database state against filesystem during indexer job scans:
let mut detector = ChangeDetector::new();
detector.load_existing_entries(ctx, location_id, indexing_path).await?;
for entry in discovered_entries {
if let Some(change) = detector.check_path(&path, &metadata, inode) {
// Process New, Modified, or Moved change
}
}
let deleted = detector.find_deleted(&seen_paths);
The detector tracks paths by inode to identify moves. On Unix systems, inodes provide stable file identity across renames. Windows falls back to path-only matching since file indices are unstable across reboots.
Both DatabaseAdapter and MemoryAdapter implement the ChangeHandler trait, which defines the interface for responding to filesystem watcher events:
pub trait ChangeHandler {
async fn find_by_path(&self, path: &Path) -> Result<Option<EntryRef>>;
async fn create(&mut self, metadata: &DirEntry, parent_path: &Path) -> Result<EntryRef>;
async fn update(&mut self, entry: &EntryRef, metadata: &DirEntry) -> Result<()>;
async fn move_entry(&mut self, entry: &EntryRef, old_path: &Path, new_path: &Path) -> Result<()>;
async fn delete(&mut self, entry: &EntryRef) -> Result<()>;
}
The watcher routes events to the appropriate handler based on whether the path belongs to a persistent location (DatabaseAdapter → database) or ephemeral session (MemoryAdapter → memory).
The system provides flexible configuration through modes and scopes:
Shallow Mode extracts only filesystem metadata (name, size, dates). Completes in under 500ms for typical directories.
Content Mode adds BLAKE3 hashing to identify files by content. Enables deduplication and content tracking.
Deep Mode performs full analysis including file type identification and metadata extraction. Triggers thumbnail generation for images and videos.
Current Scope indexes only immediate directory contents. Used for responsive UI navigation.
Recursive Scope indexes the entire directory tree. Used for full location indexing.
Spacedrive supports both persistent and ephemeral indexing modes:
Persistent indexing stores all data in the database permanently. This is the default for library locations:
Ephemeral indexing keeps data in memory only, perfect for browsing external drives without permanent storage. The system uses highly memory-optimized structures (detailed in the Data Structures section below):
FileNode entries with 32-bit entry IDs instead of 64-bit pointersMemory usage is around 50 bytes per entry vs 200+ bytes with naive approaches—a 4-6x reduction that enables browsing hundreds of thousands of files without database overhead.
The EphemeralIndexCache tracks which paths have been indexed, are currently being indexed, or are registered for filesystem watching. When a watched path receives filesystem events, the system updates the in-memory index in real-time through the unified ChangeHandler trait (shared with persistent storage).
Specific low-level optimizations make the hybrid architecture viable:
The ephemeral index doesn't use standard HashMaps. Instead, it uses a memory-mapped NodeArena—a contiguous slab of memory that stores file nodes using 32-bit integers as pointers rather than 64-bit pointers. This reduces memory overhead by 4-6x compared to naive HashMap<PathBuf, Entry> implementations, enabling browsing of hundreds of thousands of files without database overhead.
In typical filesystems, filenames like index.js, .DS_Store, or conf.yaml repeat thousands of times. The NameCache interns these strings, storing them once and referencing them by pointer. Multiple directory trees can coexist in the same EphemeralIndex (browsing both /mnt/nas and /media/usb simultaneously), sharing the string interning pool for maximum deduplication.
While the database uses an adjacency list (parent_id) for structure, recursive queries are slow. The directory_paths table caches the full absolute path of every directory, enabling O(1) path resolution for any file without recursive parent traversal.
The IndexerRuler applies filtering rules during discovery to skip unwanted files:
System Rules are toggleable patterns like:
NO_HIDDEN: Skip dotfiles (.git, .DS_Store)NO_DEV_DIRS: Skip node_modules, target, distNO_SYSTEM: Skip OS folders (System32, Windows)Git Integration: When indexing inside a Git repository, rules are dynamically loaded from .gitignore files. This automatically excludes build artifacts and local configuration.
Rules return a RulerDecision (Accept/Reject) for each path during discovery, preventing unwanted entries from ever reaching the processing phase.
The IndexVerifyAction checks integrity by running a fresh ephemeral scan and comparing metadata against the existing persistent index:
let verify = IndexVerifyAction::from_input(IndexVerifyInput { path }).await?;
let output = verify.execute(library, context).await?;
// output.report contains:
// - missing_from_index: Files on disk but not in database
// - stale_in_index: Entries in database but missing from filesystem
// - metadata_mismatches: Size, mtime, or inode differences
The verification system detects:
Verification runs as a library action and returns a detailed IntegrityReport with per-file diagnostics.
The indexing system leverages Spacedrive's job infrastructure for reliability and monitoring.
When interrupted, the entire job state is serialized:
#[derive(Serialize, Deserialize)]
pub struct IndexerState {
phase: Phase,
dirs_to_walk: VecDeque<PathBuf>,
entry_batches: Vec<Vec<DirEntry>>,
entry_id_cache: HashMap<PathBuf, i32>,
ephemeral_uuids: HashMap<PathBuf, Uuid>,
stats: IndexerStats,
}
This state is stored in the jobs database, separate from your library data. On resume, the job picks up exactly where it left off.
Real-time progress flows through multiple channels:
pub struct IndexerProgress {
pub phase: IndexPhase,
pub total_found: IndexerStats,
pub processing_rate: f32,
pub estimated_remaining: Option<Duration>,
}
Progress updates are sent to the UI via channels, persisted to the database, and available through job queries for time estimates.
Non-critical errors are accumulated but don't stop indexing:
Critical errors halt the job with state preserved:
Indexing performance varies by mode and scope:
| Configuration | Performance | Use Case |
|---|---|---|
| Current + Shallow | <500ms | UI navigation |
| Recursive + Shallow | ~10K files/sec | Quick scan |
| Recursive + Content | ~1K files/sec | Normal indexing |
| Recursive + Deep | ~100 files/sec | Media libraries |
Batch Processing: Groups operations into transactions of 1,000 items, reducing database overhead by 30x.
Parallel Discovery: Work-stealing model with atomic counters for directory traversal, using half of available CPU cores by default.
Entry ID Cache: Eliminates redundant parent lookups during hierarchy construction, critical for deep directory trees.
Checkpoint Strategy: Checkpoints occur every 5,000 items or 30 seconds, balancing durability with performance.
For responsive directory browsing:
let config = IndexerJobConfig::ui_navigation(location_id, path);
let handle = library.jobs().dispatch(IndexerJob::new(config)).await?;
Explore without permanent storage:
let config = IndexerJobConfig::ephemeral_browse(
usb_path,
IndexScope::Recursive
);
let job = IndexerJob::new(config);
Full indexing with content identification:
let config = IndexerJobConfig::new(
location_id,
path,
IndexMode::Deep
);
The indexer is fully accessible through the CLI:
# Quick current directory scan
spacedrive index quick-scan ~/Documents
# Browse external drive
spacedrive index browse /media/usb --ephemeral
# Full location with progress monitoring
spacedrive index location ~/Pictures --mode deep
spacedrive job monitor # Watch progress
Slow Indexing: Check for large node_modules or build directories. System rules automatically skip common patterns, or use .gitignore to exclude project-specific artifacts.
High Memory Usage: Reduce batch size for directories over 1M files. Ephemeral mode uses around 50 bytes per entry, so 100K files requires roughly 5MB.
Resume Not Working: Ensure the jobs database isn't corrupted. Check logs for serialization errors.
Enable detailed logging:
RUST_LOG=sd_core::ops::indexing=debug spacedrive start
Inspect job state:
spacedrive job info <job-id> --detailed
Windows: Uses file indices for change detection where available, falling back to path-only matching. Supports long paths transparently. Network drives may require polling.
macOS: Leverages FSEvents and native inodes. Integrates with Time Machine exclusions. APFS provides efficient cloning.
Linux: Full inode support with detailed permissions. Handles diverse filesystems from ext4 to ZFS. Symbolic links supported with cycle detection.
.gitignore exclusions