eden/mononoke/docs/2.1-bonsai-data-model.md
This document explains Mononoke's core data model—the Bonsai format—which serves as the canonical, VCS-agnostic representation of repository data. Understanding Bonsai is essential for working with Mononoke, as it forms the foundation upon which all other repository operations are built.
Bonsai is Mononoke's internal representation of version control data. Every commit in Mononoke, regardless of whether it originated from Git, Mercurial, or Sapling, is stored internally as a Bonsai changeset. This unified format allows Mononoke to maintain a single source of truth while serving multiple client types.
Mononoke distinguishes between two fundamental categories of data:
Inherent data constitutes the core Merkle DAG and serves as the source of truth. This data forms the basis for content-addressed hashing and must be sufficient to represent the entire state and history of the repository. In Mononoke, inherent data consists of:
Derived data comprises indexes and representations computed from inherent data. This data is not included in content-addressed hashes and can be regenerated from inherent data. Derived data enables efficient operations that would be impractical using only the minimal Bonsai representation.
This separation is fundamental to Mononoke's architecture. The write path stores only inherent data, keeping the critical section minimal. Derived data is computed asynchronously off the critical path, allowing Mononoke to maintain high write throughput while still providing efficient read operations.
A Bonsai changeset represents a single commit. The structure is defined in mononoke_types/src/bonsai_changeset.rs and contains the following components:
hg_extra: Key-value pairs for Mercurial-specific metadatagit_extra_headers: Key-value pairs for Git-specific headersgit_tree_hash: Not used (deprecated)git_annotated_tag: Optional Git annotated tag informationThese fields allow Bonsai to preserve the complete semantics of both Git and Mercurial commits while maintaining a unified structure.
File changes are represented as a flat list of path-to-change mappings. Each entry specifies the full path to a file and the change that occurred. This differs from Git's nested tree structure and Mercurial's manifest system.
Changes can be:
Tracked Changes
Tracked Deletions
Untracked Changes and Deletions
The file change types are defined in mononoke_types/src/file_change.rs.
Snapshot State
Subtree Changes
Bonsai uses Blake2b hashing to create a content-addressed Merkle DAG. Blake2b produces 256-bit (32-byte) hashes and provides cryptographic strength while being faster than SHA-256.
Changeset Identifiers
A Bonsai changeset is serialized (using Thrift compact protocol) and hashed with Blake2b. The resulting hash becomes the changeset identifier (ChangesetId). This identifier depends on:
Any modification to the changeset produces a different identifier.
Content Identifiers
File contents are hashed separately. Each file's content is hashed with Blake2b to produce a ContentId. Identical files across different commits or repositories produce the same content identifier, enabling deduplication.
Merkle DAG Properties The parent references in each changeset create a directed acyclic graph. The content-addressed nature ensures:
The hash types are defined in mononoke_types/src/hash.rs and mononoke_types/src/typed_hash.rs.
File contents are stored separately from changesets in the blobstore. This separation provides several characteristics:
Content Blobs
ContentId as the keyChunking for Large Files File content can be stored in chunks for large files. This is managed by the filestore, which handles:
The FileContents type (defined in mononoke_types/src/file_contents.rs) can represent either:
Bonsai serves as an intermediate representation between different version control systems. The format is designed to capture the semantics of both Git and Mercurial while avoiding the implementation details of either.
Git Compatibility Bonsai can represent all Git commit information:
Mercurial Compatibility Bonsai can represent all Mercurial changeset information:
hg_extraSimplified Structure Unlike both Git and Mercurial, Bonsai uses:
Bonsai changesets are converted to and from VCS-specific formats:
Git Conversion
bonsai_git_mapping table maintains bidirectional mappings between ChangesetId and Git commit SHA-1Mercurial Conversion
bonsai_hg_mapping table maintains bidirectional mappings between ChangesetId and Mercurial changeset hashThese conversions allow Mononoke to serve Git and Mercurial clients from a single Bonsai backend while maintaining compatibility with each VCS.
Bonsai changesets are immutable. Once a changeset is created and stored:
Write-Once Semantics
Blobstore Storage Bonsai changesets are stored in the blobstore:
ChangesetIdMetadata Database References While changesets are immutable, mutable state is maintained separately:
This separation of immutable content from mutable references allows efficient repository operations while maintaining data integrity.
Understanding how Bonsai differs from Git and Mercurial clarifies the design choices:
Git Trees vs. Bonsai File Changes Git represents repository state using nested tree objects. Each tree contains references to blobs (files) and subtrees (subdirectories), forming a recursive structure. Bonsai uses a flat list of file changes, where each entry specifies the complete path from the repository root. This eliminates the need to create and track intermediate tree objects.
Mercurial Manifests vs. Bonsai File Changes Mercurial uses manifest objects that, like Git trees, can be nested. Additionally, Mercurial maintains filelogs (per-file history) with linknodes connecting file revisions to changesets. Bonsai represents only what changed in each commit, with file history derived from the changeset graph.
Rename and Copy Tracking Git does not explicitly track renames or copies. Instead, Git detects renames heuristically when comparing trees. Mercurial explicitly records copy-from information in changeset metadata. Bonsai follows Mercurial's approach, explicitly storing copy-from information as part of tracked file changes.
Git: Uses SHA-1 (40 hex characters) for all object identifiers. Git is transitioning to SHA-256 (64 hex characters).
Mercurial: Uses SHA-1 (40 hex characters) for changeset and manifest identifiers.
Bonsai: Uses Blake2b (64 hex characters for 256-bit hashes). Blake2b provides cryptographic strength comparable to SHA-256 with better performance.
Git distinguishes between author (who wrote the code) and committer (who committed it), each with their own timestamp. This supports workflows where commits are authored by one person and applied by another.
Mercurial typically has only an author and date, though extra fields can store additional metadata.
Bonsai includes both author and optional committer fields, accommodating Git's model while allowing Mercurial commits (which typically have no committer) to be represented naturally.
The Bonsai data structures are defined in the mononoke_types crate, specifically:
Core Types (mononoke_types/src/bonsai_changeset.rs)
BonsaiChangeset - The immutable changeset structureBonsaiChangesetMut - The mutable builder used during changeset creationFile Changes (mononoke_types/src/file_change.rs)
FileChange - Enumeration of change types (tracked change, deletion, untracked change, untracked deletion)TrackedFileChange - A file addition or modification with copy-from informationBasicFileChange - File change without copy-from trackingFileType - File type (regular, executable, symlink)Content Addressing (mononoke_types/src/typed_hash.rs)
ChangesetId - Blake2b hash identifying a changesetContentId - Blake2b hash identifying file contentsFile Contents (mononoke_types/src/file_contents.rs)
FileContents - Either direct bytes or chunked contentChunkedFileContents - References to content chunks for large filesPaths (mononoke_types/src/path.rs)
MPath - A path within the repository (may include root)NonRootMPath - A path guaranteed not to be the rootAdditional types for dates, extra fields, and subtree changes are defined in their respective files within mononoke_types/src/.
While Bonsai changesets contain the core commit information, many repository operations require additional data structures. These are provided by derived data, which is computed from Bonsai changesets:
Manifests
fsnodes - Filesystem-like directory structureunodes - Full directory structure with file and directory history. Similar to Mercurial's manifests but with the "linknode problem" fixed.File Metadata
VCS-Specific Formats
These derived data types are described in detail in the Derived Data documentation.
The separation between Bonsai (inherent data) and these structures (derived data) is a fundamental architectural decision. Bonsai provides the minimal representation needed for accepting pushes at high speed, while derived data provides the indexes and formats needed for efficient operations and VCS compatibility.
When working with Mononoke code, Bonsai changesets are accessed through repository facets:
Reading Changesets
The repo_blobstore facet provides access to the blobstore, allowing changesets to be loaded by their ChangesetId. The Loadable trait (implemented for ChangesetId) enables loading the corresponding BonsaiChangeset from storage.
Creating Changesets
New Bonsai changesets are typically created using BonsaiChangesetMut, which provides a builder pattern for assembling the changeset components. Once complete, the mutable changeset is frozen into an immutable BonsaiChangeset and stored.
Commit Graph Traversal
The commit_graph facet provides efficient parent-child traversal without loading full changesets. This enables operations like ancestry checking and graph walking without deserializing all changeset metadata.
VCS Mapping
The bonsai_git_mapping and bonsai_hg_mapping facets provide bidirectional lookup between Bonsai changeset identifiers and Git or Mercurial commit hashes. These are used extensively when serving VCS clients.
Repository operations compose these facets to implement higher-level functionality like pushrebase, cross-repository sync, and hooks.
The Bonsai data model serves as Mononoke's canonical representation of version control data. Its design reflects several architectural principles:
VCS Independence - Bonsai is not tied to Git or Mercurial implementation details, allowing it to serve as an intermediate representation.
Content Addressing - Blake2b hashing creates a Merkle DAG ensuring data integrity and enabling efficient deduplication.
Minimal Representation - Bonsai contains only the information necessary to represent commit history, with additional indexes provided by derived data.
Flat File Changes - The flat list of file changes simplifies many operations compared to nested tree structures.
Immutability - Once created, Bonsai changesets never change, enabling aggressive caching and simplifying consistency reasoning.
Explicit Semantics - Copy-from information, file types, and metadata are explicitly represented rather than inferred.