eden/mononoke/docs/1.2-key-concepts.md
This document introduces the essential concepts you'll encounter when working with Mononoke. Each concept is covered briefly here and explored in greater depth in later sections of the documentation.
Bonsai is Mononoke's canonical, VCS-agnostic representation of version control data. Every commit in Mononoke, regardless of whether it originated from Git, Mercurial, or Sapling, is stored internally as a Bonsai changeset. This unified format enables Mononoke to serve multiple client types from a single backend.
A Bonsai changeset contains commit metadata (author, message, timestamps), parent changeset references, and a list of file changes (additions, modifications, deletions). Unlike Git or Mercurial, Bonsai uses a flat list of changed files rather than nested tree structures, which simplifies many operations. Each changeset is identified by a Blake2b hash of its contents, forming a Merkle DAG that ensures data integrity. File contents are stored separately as content-addressed blobs, allowing efficient deduplication across the repository. The Bonsai format was designed specifically for high-throughput writes and scalability, making it the foundation of Mononoke's performance characteristics.
Content addressing identifies data by the cryptographic hash of its contents rather than by location or arbitrary name, ensuring the same content produces the same identifier regardless of where it's stored. Mononoke uses Blake2b (256-bit) for hashing, while Git and Mercurial/Sapling use SHA-1.
The blobstore is a key-value store that holds data with different addressing strategies, some but not all of which is content-addressed. Fully content-addressed data uses the hash of the value as the key, for example Bonsai changesets are stored with ChangesetId as the key, which is the Blake2b hash of the changeset contents. Logically content-addressed data uses a content hash as the key, but the stored value may have a different structure in the store, for example file contents use ContentId (the hash of the file) as the key, but large files are chunked for storage. Non-content-addressed data uses keys derived from other objects, for example blame data is keyed by the corresponding unode ID rather than by hashing the blame information.
Content-addressed data provides immutability (any modification changes the hash), integrity verification (hashes serve as checksums), and deduplication (identical content produces identical keys). The Bonsai changeset graph forms a Merkle DAG where each changeset includes hashes of its parent changesets, making the repository history tamper-evident.
Mononoke uses a facet pattern to compose repository functionality. Rather than having a monolithic repository class with all possible methods, repository capabilities are broken into discrete "facets" that can be composed together. Each facet provides a specific capability—for example, RepoIdentity provides repository name and ID, RepoBlobstore provides access to immutable blob storage, and CommitGraph provides commit graph traversal operations.
Functions declare their requirements explicitly by specifying which facets they need through trait bounds. This makes dependencies clear and enables better modularity and testability. The facet pattern also allows different repository types to mix and match capabilities as needed. Facets are defined in the repo_attributes/ directory and are used throughout Mononoke's codebase as the standard way to access repository functionality.
Derived data is computed information that can be regenerated from Bonsai changesets and file content blobs. While Bonsai represents the core source of truth, many common operations would be inefficient without precomputed indexes. Derived data types include manifests (directory structures), file history information, blame annotations, and VCS-specific formats needed for Git and Mercurial protocol compatibility.
The key architectural decision is that derived data computation happens asynchronously, off the critical path of commit ingestion. When a new commit is pushed, only the Bonsai changeset and file contents are written synchronously. Derived data is then computed in the background, often by a separate derivation service that can scale horizontally. This separation allows Mononoke to maintain high write throughput even as more derived data types are added. Derived data is stored in the blobstore and can be backfilled or migrated independently of the core commit data.
The blobstore is Mononoke's immutable key-value storage layer for repository data. It stores Bonsai changesets, file content blobs, derived data, and other repository artifacts. Each blob is identified by a unique key and, once written, is never modified.
Mononoke uses a layered blobstore architecture built on the decorator pattern. Storage backends include fileblob (filesystem), sqlblob (MySQL/SQLite), s3blob (Amazon S3), and manifoldblob (Meta-internal). These backends are wrapped with decorators that add functionality: cacheblob provides multi-level caching (memcache and in-process cachelib), multiplexedblob enables writes to multiple backends for redundancy, packblob provides compression, and redactedblobstore enforces content redaction policies. This decorator stack allows Mononoke to compose complex storage behaviors from simple, reusable components. The immutability of blobstore data enables aggressive caching and simplifies consistency reasoning.
While the blobstore holds immutable content, the metadata database stores mutable repository state and indexes. This includes bookmarks (branch pointers), the commit graph index for efficient ancestry queries, and various mapping tables that connect Bonsai changesets to their external representations.
The metadata database uses MySQL in production and SQLite for development and testing. Unlike blobstore operations, database operations can involve transactions and updates. For example, when a bookmark moves, the database is updated to point to the new changeset. The metadata database is also used for operational concerns like tracking cross-repository sync progress and managing derived data derivation state. The separation between immutable blobstore data and mutable database state is a fundamental architectural principle that allows Mononoke to scale writes while maintaining consistency.
VCS mappings connect Mononoke's internal Bonsai representation to external version control system identities. When a Git commit or Mercurial changeset is imported into Mononoke, it is converted to Bonsai format, and a bidirectional mapping is stored. The bonsai_git_mapping table maps Bonsai changeset IDs to Git commit SHA-1 hashes, while bonsai_hg_mapping maps to Mercurial changeset hashes.
These mappings are essential for serving Git and Mercurial clients. When a client requests a commit by its Git or Mercurial ID, Mononoke uses the mapping to find the corresponding Bonsai changeset, operates on the Bonsai representation internally, and then converts results back to the client's expected format. Additional mappings exist for other identifier schemes like globalrevs (sequential integers for SVN-style workflows) and svnrevs (for repositories imported from Subversion). The mapping tables are stored in the metadata database and enable Mononoke to present a consistent view of repository history regardless of which client type is being used.