Back to Spacedrive

Content Identity System for Deduplication

.tasks/core/CORE-003-content-identity.md

0.4.3801 B
Original Source

Description

Implement the Content Identity system, which forms the foundation for data deduplication and redundancy tracking. It uses an adaptive hashing strategy to efficiently fingerprint files.

Implementation Notes

  • The core logic is in src/domain/content_identity.rs.
  • The ContentHashGenerator uses a fast, sampled BLAKE3 hash for large files (>100KB) and a full hash for smaller files, as described in the whitepaper.
  • The corresponding content_identities table in the database schema stores these hashes.

Acceptance Criteria

  • ContentHashGenerator can produce deterministic hashes for files.
  • The system correctly uses different hashing strategies for small and large files.
  • The database schema supports storing content hashes and linking them to entries.