.tasks/core/INDEX-012-ephemeral-cache-parent-path-dedup.md
When a volume is indexed ephemerally (recursive scan), browsing a subdirectory within that volume triggers a redundant ephemeral scan. This creates duplicate registrations in indexed_paths and wastes work. The primary cause is a symlink mismatch on macOS combined with exact-match path lookups.
$ sd index ephemeral-cache
INDEXED PATHS Children
○ /System/Volumes/Data 11
○ /Users/jamespine 111
User indexed their system volume (registered as /System/Volumes/Data), then browsed /Users/jamespine in the Explorer. The cache created a second entry with 111 children from a redundant shallow scan, despite the volume index already containing all 1.8M entries recursively.
On macOS with APFS, /Users is a symlink to /System/Volumes/Data/Users. The volume indexer walks from the real mount point /System/Volumes/Data, so the arena stores all paths under /System/Volumes/Data/Users/jamespine/....
When the Explorer browses /Users/jamespine, the chain of events:
get_for_search("/Users/jamespine") (directory_listing.rs:649) — canonicalizes the path to /System/Volumes/Data/Users/jamespine, finds it starts with the indexed root /System/Volumes/Data → returns the index ✅
list_directory("/Users/jamespine") (directory_listing.rs:659) — does a raw path_index.get(path) lookup. The arena has /System/Volumes/Data/Users/jamespine as the key, not /Users/jamespine → returns None ❌
Fallthrough (directory_listing.rs:736+) — concludes the path isn't indexed, calls create_for_indexing("/Users/jamespine"), triggers a redundant shallow scan, registers /Users/jamespine as a second entry in indexed_paths
The fix requires addressing both the symlink resolution and the parent-path awareness.
The core issue is that list_directory, get_entry_ref, and other EphemeralIndex methods use raw path lookups against path_index without resolving symlinks. When the arena was populated via a real path (/System/Volumes/Data/...) and the query uses a symlink path (/Users/...), the lookup fails.
// core/src/ops/indexing/ephemeral/index.rs
impl EphemeralIndex {
/// Resolve a path to its canonical form if it exists in the arena.
/// Tries the raw path first (fast path), then canonicalizes (slow path).
fn resolve_path<'a>(&'a self, path: &Path) -> Option<&'a PathBuf> {
// Fast path: direct lookup
if self.path_index.contains_key(path) {
return Some(
self.path_index.keys().find(|k| k.as_path() == path).unwrap()
);
}
// Slow path: canonicalize and retry
if let Ok(canonical) = path.canonicalize() {
if self.path_index.contains_key(&canonical) {
return Some(
self.path_index.keys().find(|k| **k == canonical).unwrap()
);
}
}
None
}
/// Updated list_directory with symlink resolution
pub fn list_directory(&self, path: &Path) -> Option<Vec<PathBuf>> {
// Try direct lookup first, then canonical
let lookup_path = if self.path_index.contains_key(path) {
path.to_path_buf()
} else if let Ok(canonical) = path.canonicalize() {
if self.path_index.contains_key(&canonical) {
canonical
} else {
return None;
}
} else {
return None;
};
let id = self.path_index.get(&lookup_path)?;
let node = self.arena.get(*id)?;
Some(
node.children
.iter()
.filter_map(|&child_id| self.reconstruct_path(child_id))
.collect(),
)
}
}
The same pattern applies to get_entry_ref, get_or_assign_uuid, and get_entry_uuid. Rather than duplicating the canonicalization in every method, a single internal resolve_to_entry_id helper handles it:
/// Internal: resolve a path to its EntryId, handling symlinks.
fn resolve_entry_id(&self, path: &Path) -> Option<EntryId> {
// Fast path
if let Some(&id) = self.path_index.get(path) {
return Some(id);
}
// Canonicalize and retry
path.canonicalize()
.ok()
.and_then(|canonical| self.path_index.get(&canonical).copied())
}
Then all public methods use resolve_entry_id instead of self.path_index.get(path).
is_indexed and get_for_pathEven after fixing symlink resolution, is_indexed() and get_for_path() should check parent paths. This handles the non-symlink case (e.g., Linux or paths without symlinks) where /mnt/nas is indexed and /mnt/nas/photos/2024 should be considered covered.
// core/src/ops/indexing/ephemeral/cache.rs
pub fn is_indexed(&self, path: &Path) -> bool {
let indexed = self.indexed_paths.read();
if indexed.contains_key(path) {
return true;
}
// Check canonical form
let canonical = path.canonicalize().ok();
if let Some(ref canon) = canonical {
if indexed.contains_key(canon.as_path()) {
return true;
}
}
// Check if any recursively-indexed parent covers this path
for (indexed_path, scope) in indexed.iter() {
if *scope != IndexScope::Recursive {
continue;
}
if path.starts_with(indexed_path) {
return true;
}
if let Some(ref canon) = canonical {
if canon.starts_with(indexed_path) {
return true;
}
}
}
false
}
Apply the same pattern to get_for_path.
Replace HashSet<PathBuf> with HashMap<PathBuf, IndexScope> to distinguish shallow browses from recursive scans. Parent coverage only applies to recursive scans.
indexed_paths: RwLock<HashMap<PathBuf, IndexScope>>,
Callers that register paths need to provide the scope. The scope is available at every callsite from IndexerJobConfig.scope:
mark_indexing_complete takes scope parametercreate_for_indexing takes scope parameterRecursiveCurrentpub fn create_for_indexing(
&self,
path: PathBuf,
scope: IndexScope,
) -> Arc<TokioRwLock<EphemeralIndex>> {
let in_progress = self.indexing_in_progress.read();
let indexed = self.indexed_paths.read();
// Check if already covered by a recursive parent (with symlink resolution)
let canonical = path.canonicalize().ok();
for (existing_path, existing_scope) in indexed.iter() {
if *existing_scope != IndexScope::Recursive {
continue;
}
let covered = path.starts_with(existing_path)
|| canonical.as_ref().map_or(false, |c| c.starts_with(existing_path));
if covered {
tracing::debug!(
"Path {} already covered by recursive index at {}, skipping",
path.display(),
existing_path.display()
);
return self.index.clone();
}
}
drop(indexed);
drop(in_progress);
let mut in_progress = self.indexing_in_progress.write();
let mut indexed = self.indexed_paths.write();
indexed.remove(&path);
in_progress.insert(path);
self.index.clone()
}
When a volume index completes, remove individually-browsed child paths that are now redundant.
pub fn mark_indexing_complete(&self, path: &Path, scope: IndexScope) {
let mut in_progress = self.indexing_in_progress.write();
let mut indexed = self.indexed_paths.write();
in_progress.remove(path);
// If this is a recursive scan, subsume child paths
if scope == IndexScope::Recursive {
indexed.retain(|existing, _| {
!existing.starts_with(path) || existing == path
});
}
indexed.insert(path.to_path_buf(), scope);
}
core/src/ops/indexing/ephemeral/index.rs - Add resolve_entry_id(), update list_directory(), get_entry_ref(), get_or_assign_uuid(), get_entry_uuid() to use symlink-aware lookupcore/src/ops/indexing/ephemeral/cache.rs - Change indexed_paths to HashMap<PathBuf, IndexScope>, update is_indexed(), get_for_path(), create_for_indexing(), mark_indexing_complete() with scope tracking and canonicalizationcore/src/ops/indexing/job.rs - Pass scope to mark_indexing_complete()core/src/ops/volumes/index/action.rs - Pass scope to create_for_indexing()core/src/ops/files/query/directory_listing.rs - Pass scope to create_for_indexing()list_directory("/Users/jamespine") returns children when arena has /System/Volumes/Data/Users/jamespineis_indexed() returns true for symlink paths under a recursively-indexed volumeget_for_path() returns the index for symlink paths under a recursively-indexed volumecreate_for_indexing() is a no-op when the path is covered by a recursive parent (including symlinks)mark_indexing_complete() with Recursive scope subsumes child pathsindexed_paths stores scope per pathsd index ephemeral-cache#[test]
fn test_symlink_path_resolution() {
let cache = EphemeralIndexCache::new().expect("failed to create cache");
// Simulate volume index at real path
let real_root = PathBuf::from("/System/Volumes/Data");
let _index = cache.create_for_indexing(real_root.clone(), IndexScope::Recursive);
cache.mark_indexing_complete(&real_root, IndexScope::Recursive);
// Symlink path should be considered indexed (on macOS /Users -> /System/Volumes/Data/Users)
// This test verifies the canonicalization logic
let symlink_path = PathBuf::from("/Users/jamespine");
assert!(cache.is_indexed(&symlink_path)); // canonicalizes to /System/Volumes/Data/Users/jamespine
}
#[test]
fn test_parent_path_coverage() {
let cache = EphemeralIndexCache::new().expect("failed to create cache");
let root = PathBuf::from("/mnt/volume");
let _index = cache.create_for_indexing(root.clone(), IndexScope::Recursive);
cache.mark_indexing_complete(&root, IndexScope::Recursive);
// Child path should be considered indexed
assert!(cache.is_indexed(&PathBuf::from("/mnt/volume/photos/2024")));
assert!(cache.get_for_path(&PathBuf::from("/mnt/volume/photos/2024")).is_some());
}
#[test]
fn test_shallow_browse_no_parent_coverage() {
let cache = EphemeralIndexCache::new().expect("failed to create cache");
// Shallow browse of root (Current scope)
let root = PathBuf::from("/mnt/volume");
let _index = cache.create_for_indexing(root.clone(), IndexScope::Current);
cache.mark_indexing_complete(&root, IndexScope::Current);
// Child path should NOT be covered by a shallow scan
assert!(!cache.is_indexed(&PathBuf::from("/mnt/volume/photos/2024")));
}
#[test]
fn test_no_redundant_scan_under_volume() {
let cache = EphemeralIndexCache::new().expect("failed to create cache");
let root = PathBuf::from("/mnt/volume");
let _index = cache.create_for_indexing(root.clone(), IndexScope::Recursive);
cache.mark_indexing_complete(&root, IndexScope::Recursive);
// Attempting to create_for_indexing on a child should be a no-op
let child = PathBuf::from("/mnt/volume/photos");
let _index = cache.create_for_indexing(child.clone(), IndexScope::Current);
// indexed_paths should still only contain the root
assert_eq!(cache.len(), 1);
}
#[test]
fn test_volume_subsumes_child_paths() {
let cache = EphemeralIndexCache::new().expect("failed to create cache");
// Browse individual directories first
let dir1 = PathBuf::from("/mnt/volume/photos");
let dir2 = PathBuf::from("/mnt/volume/documents");
let _index = cache.create_for_indexing(dir1.clone(), IndexScope::Current);
cache.mark_indexing_complete(&dir1, IndexScope::Current);
let _index = cache.create_for_indexing(dir2.clone(), IndexScope::Current);
cache.mark_indexing_complete(&dir2, IndexScope::Current);
assert_eq!(cache.len(), 2);
// Now volume index the root (recursive)
let root = PathBuf::from("/mnt/volume");
let _index = cache.create_for_indexing(root.clone(), IndexScope::Recursive);
cache.mark_indexing_complete(&root, IndexScope::Recursive);
// Child paths subsumed — only root remains
assert_eq!(cache.len(), 1);
// Children still covered via parent
assert!(cache.is_indexed(&dir1));
assert!(cache.is_indexed(&dir2));
}
Path::canonicalize() is a syscall that hits the filesystem. On the hot path (every list_directory call), this adds ~1-5μs. For the cache methods (is_indexed, get_for_path), the indexed_paths set is tiny (1-10 entries), so linear scan with one canonicalization is negligible.
For EphemeralIndex methods that are called per-entry (like get_or_assign_uuid in a listing loop), the canonicalization should be done once per listing call and the canonical path passed down, not re-computed per entry.
indexed_paths SizeTypically 1-10 entries. Linear iteration is faster than any tree structure at this scale. If it grows (unlikely), a sorted vec with binary prefix search would be the upgrade.
resolve_entry_id CachingConsider adding a small HashMap<PathBuf, PathBuf> cache for symlink resolutions on the EphemeralIndex to avoid repeated canonicalize() calls for the same path prefixes. This is optional and can be added later if profiling shows it matters.