.tasks/core/INDEX-005-indexer-rules-engine.md
Implement the filtering rules system that allows selective indexing by skipping unwanted files at discovery time. The system supports toggleable system rules (hidden files, dev directories, OS folders) and dynamic .gitignore integration for Git repositories.
The IndexerRuler applies rules during Phase 1 (Discovery) to filter files before they enter the processing pipeline:
pub struct IndexerRuler {
// Toggleable system rules
enabled_rules: HashSet,
// .gitignore patterns (loaded dynamically)
gitignore: Option<Gitignore>,
// Custom user rules
custom_rules: Vec<Rule>,
}
pub enum RulerDecision {
Accept, // Include in index
Reject, // Skip this file
}
Predefined patterns that can be toggled on/off:
| Rule | Pattern | Example Matches |
|---|---|---|
NO_HIDDEN | Files starting with . | .git, .DS_Store, .env |
NO_DEV_DIRS | Common dev folders | node_modules, target, dist, build |
NO_SYSTEM | OS system folders | System32, Windows, /proc, /sys |
NO_TEMP | Temporary files | *.tmp, *.temp, ~* |
NO_CACHE | Cache directories | .cache, __pycache__, .pytest_cache |
When indexing inside a Git repository, the ruler automatically loads .gitignore:
impl IndexerRuler {
pub fn load_gitignore(&mut self, repo_root: &Path) -> Result<()> {
let gitignore_path = repo_root.join(".gitignore");
if gitignore_path.exists() {
let patterns = parse_gitignore(&gitignore_path)?;
self.gitignore = Some(Gitignore::new(patterns));
}
Ok(())
}
pub fn check_path(&self, path: &Path, is_dir: bool) -> RulerDecision {
// Check system rules first
if self.check_system_rules(path, is_dir) == RulerDecision::Reject {
return RulerDecision::Reject;
}
// Check .gitignore patterns
if let Some(gitignore) = &self.gitignore {
if gitignore.matches(path, is_dir) {
return RulerDecision::Reject;
}
}
// Check custom rules
for rule in &self.custom_rules {
if rule.matches(path, is_dir) {
return rule.decision;
}
}
RulerDecision::Accept
}
}
Rules are applied at the edge of discovery:
// In Phase 1 (Discovery)
for entry in read_dir(path)? {
let entry = entry?;
let path = entry.path();
// Apply rules BEFORE queuing for processing
if ruler.check_path(&path, entry.is_dir()) == RulerDecision::Reject {
continue; // Skip this file entirely
}
// File passed rules, add to processing queue
discovered_entries.push(entry);
}
This prevents unwanted files from ever reaching Phase 2, saving significant processing time.
core/src/ops/indexing/rules.rs - IndexerRuler, SystemRule, RulerDecisioncore/src/ops/indexing/phases/discovery.rs - Rules applied during filesystem walkcore/src/ops/indexing/input.rs - IndexerJobConfig with enabled_rules field.Rules are evaluated in order of specificity:
First rejection wins - no need to check remaining rules.
Applying rules at discovery edge provides significant speedup:
| Scenario | Without Rules | With Rules | Speedup |
|---|---|---|---|
| Node.js project (500K files) | 50 seconds | 8 seconds | 6.25x |
| Rust project (target/ dir) | 20 seconds | 3 seconds | 6.67x |
| Home directory (hidden files) | 100 seconds | 60 seconds | 1.67x |
By rejecting files at discovery, we avoid:
# Skip all hidden files and dev directories
spacedrive index location ~/Projects \
--skip-hidden \
--skip-dev-dirs
# Use .gitignore patterns
spacedrive index location ~/code/my-app \
--use-gitignore
# Custom rule
spacedrive index location ~/Documents \
--exclude "*.tmp" \
--exclude "~*"
[location."~/Projects"]
rules = ["NO_HIDDEN", "NO_DEV_DIRS"]
use_gitignore = true
[location."~/Documents"]
custom_rules = [
{ pattern = "*.tmp", decision = "Reject" },
{ pattern = "~*", decision = "Reject" }
]
Supported .gitignore syntax:
*.log, temp*)build/)!important.log)[abc].txt)**/node_modules)# ignore this)# Create test directory with common patterns
mkdir -p ~/test-rules
cd ~/test-rules
touch .hidden visible.txt
mkdir -p node_modules/.cache
echo "*.tmp" > .gitignore
touch test.tmp test.txt
# Index with rules
spacedrive index location ~/test-rules \
--skip-hidden \
--skip-dev-dirs \
--use-gitignore
# Verify filtered correctly
spacedrive db query "SELECT name FROM entry WHERE parent_id IN (
SELECT id FROM entry WHERE name = 'test-rules'
)"
# Should only see: visible.txt, test.txt, .gitignore
# Should NOT see: .hidden, node_modules, .cache, test.tmp
Located in core/tests/indexing/:
test_ruler_no_hidden - Verify hidden files skippedtest_ruler_no_dev_dirs - Verify dev directories skippedtest_ruler_gitignore - Verify .gitignore patterns respectedtest_ruler_precedence - Verify rule evaluation ordertest_ruler_custom_rules - Verify custom user rules work