.tasks/core/FILE-006-path-intersection-smart-diff.md
Implement a path intersection operation that compares two paths (potentially on different volumes or devices) and returns the set difference: files present at path A that don't exist at path B. This enables "smart copy" — transferring only what's missing without re-copying files that already exist at the destination.
Primary use case: an external drive with files, a NAS with a partial backup. Select both, see what's missing, copy the diff.
1. Heuristic Match (fast, default) Compare by relative path + file size + modification time. No hashing required. Suitable for most backup/copy scenarios where files haven't been modified in place.
2. Content Match (exact)
Compare by BLAKE3 content hash (content_id). Catches files that have been renamed or moved. Requires Phase 4 content identification on both sides — either from the persistent index or computed on demand.
3. Hybrid Match Heuristic first, then content-verify ambiguous cases (same name, different size or mtime). Best balance of speed and accuracy.
Input: source_path (SdPath), target_path (SdPath), strategy
↓
Ensure both paths are indexed (ephemeral or persistent)
Use complete scan if needed (INDEX-011)
↓
Build relative path maps for both directories
↓
Apply matching strategy:
Heuristic → compare (rel_path, size, mtime) tuples
Content → compare content_id hashes
Hybrid → heuristic first, hash verify edge cases
↓
Output: PathDiffResult {
only_in_source: Vec<DiffEntry>, // Missing from target
only_in_target: Vec<DiffEntry>, // Extra at target
modified: Vec<DiffEntry>, // Same path, different content
matched: usize, // Count of identical files
total_copy_size: u64, // Bytes needed to sync
}
The diff result feeds directly into a copy operation:
let diff = path_intersection(source, target, Strategy::Heuristic).await?;
// Copy only what's missing
let copy_input = FileCopyInput {
sources: SdPathBatch::from_paths(
diff.only_in_source.iter().map(|e| e.sd_path.clone())
),
destination: target_path,
overwrite: false,
..Default::default()
};
// core/src/ops/files/diff/mod.rs
pub struct PathDiffInput {
pub source: SdPath,
pub target: SdPath,
pub strategy: DiffStrategy,
pub use_index_rules: bool, // false = complete scan for full coverage
}
pub enum DiffStrategy {
/// Compare by relative path + size + mtime. Fast, no hashing.
Heuristic,
/// Compare by BLAKE3 content hash. Catches renames/moves.
Content,
/// Heuristic first, content-verify ambiguous matches.
Hybrid,
}
pub struct PathDiffResult {
/// Files at source that don't exist at target.
pub only_in_source: Vec<DiffEntry>,
/// Files at target that don't exist at source.
pub only_in_target: Vec<DiffEntry>,
/// Files at both paths but with different content.
pub modified: Vec<DiffEntry>,
/// Number of files that matched exactly.
pub matched_count: usize,
/// Total bytes that would need to be copied (only_in_source + modified).
pub copy_size: u64,
/// Total number of files scanned.
pub total_scanned: usize,
}
pub struct DiffEntry {
pub relative_path: PathBuf,
pub sd_path: SdPath,
pub uuid: Option<Uuid>,
pub size: u64,
pub modified_at: DateTime<Utc>,
pub content_id: Option<String>, // BLAKE3 hash if available
pub kind: EntryKind,
}
Before diffing, ensure both paths are indexed. Check ephemeral cache first, then request a scan if needed.
// core/src/ops/files/diff/resolver.rs
async fn ensure_indexed(
path: &SdPath,
use_rules: bool,
ctx: &ActionContext,
) -> Result<()> {
let cache = ctx.core_context().ephemeral_cache();
let abs_path = path.resolve(ctx)?;
if cache.is_indexed(&abs_path).await {
return Ok(());
}
// Request index scan
let config = if use_rules {
IndexerJobConfig::ephemeral_browse(path.clone(), IndexScope::Recursive, false)
} else {
IndexerJobConfig::complete_scan(path.clone(), IndexScope::Recursive)
};
let job_id = ctx.job_manager().submit(IndexerJob::new(config)).await?;
ctx.job_manager().wait_for(job_id).await?;
Ok(())
}
Extract entries from the ephemeral index and build maps keyed by relative path for comparison.
async fn build_path_map(
root: &Path,
cache: &EphemeralIndexCache,
) -> Result<HashMap<PathBuf, DiffEntry>> {
let index = cache.get_for_path(root)
.ok_or_else(|| anyhow!("Path not indexed: {}", root.display()))?;
let index_read = index.read().await;
let mut map = HashMap::new();
for (abs_path, _entry_id) in index_read.entries() {
if let Ok(relative) = abs_path.strip_prefix(root) {
let entry = build_diff_entry(&index_read, abs_path, relative)?;
map.insert(relative.to_path_buf(), entry);
}
}
Ok(map)
}
Heuristic:
fn diff_heuristic(
source_map: &HashMap<PathBuf, DiffEntry>,
target_map: &HashMap<PathBuf, DiffEntry>,
) -> PathDiffResult {
let mut result = PathDiffResult::default();
for (rel_path, source_entry) in source_map {
match target_map.get(rel_path) {
None => result.only_in_source.push(source_entry.clone()),
Some(target_entry) => {
if source_entry.size != target_entry.size
|| source_entry.modified_at != target_entry.modified_at
{
result.modified.push(source_entry.clone());
} else {
result.matched_count += 1;
}
}
}
}
for (rel_path, target_entry) in target_map {
if !source_map.contains_key(rel_path) {
result.only_in_target.push(target_entry.clone());
}
}
result.copy_size = result.only_in_source.iter().map(|e| e.size).sum::<u64>()
+ result.modified.iter().map(|e| e.size).sum::<u64>();
result
}
Content: Same structure but matches on content_id instead of (path, size, mtime). Can detect files that were renamed or moved. Falls back to heuristic for entries without content IDs.
Hybrid: Runs heuristic first. For entries where path matches but size/mtime differ, checks content_id to determine if the file actually changed or just had its timestamp updated.
// core/src/ops/files/diff/action.rs
pub struct PathDiffAction;
crate::register_library_action!(PathDiffAction, "files.diff");
impl Action for PathDiffAction {
type Input = PathDiffInput;
type Output = PathDiffResult;
async fn run(input: Self::Input, ctx: &ActionContext) -> Result<Self::Output> {
// 1. Ensure both paths indexed
ensure_indexed(&input.source, input.use_index_rules, ctx).await?;
ensure_indexed(&input.target, input.use_index_rules, ctx).await?;
// 2. Build path maps
let source_map = build_path_map(&input.source.resolve(ctx)?, cache).await?;
let target_map = build_path_map(&input.target.resolve(ctx)?, cache).await?;
// 3. Run strategy
match input.strategy {
DiffStrategy::Heuristic => Ok(diff_heuristic(&source_map, &target_map)),
DiffStrategy::Content => Ok(diff_content(&source_map, &target_map)),
DiffStrategy::Hybrid => Ok(diff_hybrid(&source_map, &target_map)),
}
}
}
# Show what's missing on the NAS
sd-cli files diff /Volumes/ExtDrive/Photos /Volumes/NAS/Photos
# Copy only what's missing
sd-cli files diff /Volumes/ExtDrive/Photos /Volumes/NAS/Photos --copy
# Content-based matching (catches renames)
sd-cli files diff /Volumes/ExtDrive /Volumes/NAS --strategy content
# Include normally-filtered files
sd-cli files diff /path/a /path/b --no-rules
core/src/ops/files/diff/mod.rs - Module definition, typescore/src/ops/files/diff/action.rs - PathDiffAction registrationcore/src/ops/files/diff/resolver.rs - Index acquisition and path map buildingcore/src/ops/files/diff/strategies.rs - Heuristic, Content, and Hybrid matchingModified Files:
core/src/ops/files/mod.rs - Add diff modulefiles diffuse_index_rules: false triggers complete scan via INDEX-011files diff command shows human-readable summaryfiles diff --copy triggers copy of missing files