eden/mononoke/docs/2.4-storage-architecture.md
This document explains Mononoke's storage architecture, including the blobstore system, metadata database, caching strategy, and how these components work together to provide scalable repository storage.
Mononoke has two main data stores:
Immutable Blobstore - Content-addressed key-value storage for all immutable data including Bonsai changesets, file contents, and derived data. Once written, blobs are never modified or deleted (except through explicit redaction processes).
Metadata Database - SQL database (MySQL in production, SQLite for development) storing repository state that changes over time, such as bookmarks, VCS mappings, and the commit graph index.
This separation allows different optimization strategies for each type of data. The blobstore can use aggressive caching and replication since data is immutable, while the database handles transactional updates for mutable state.
In addition to these main stores, Mononoke also stores some things in other data stores:
Mutable Blobstore - Storage for pre-computed data like the pre-loaded commit graph. This data is periodically recomputed and overwritten in the mutable blobstore.
Ephemeral Blobstore - Garbage collectable blobstore used for ephmeral objects like snapshots. The ephemeral blobstore groups blobs into "bubbles" that expire together.
Zelos Derivation Queue - A graph-aware queue that is used to co-ordinate derivation of derived data.
The blobstore is Mononoke's primary storage layer for repository content. All blobs are identified by unique keys and are immutable after creation.
The blobstore stores several categories of data:
Core Repository Data:
Derived Data:
VCS-Specific Formats:
Most blobs use content-addressed keys based on Blake2b hashing, which enables deduplication and integrity verification. Some blobs are keyed using the hash of a corresponding item, for example blame data is keyed using unode hashes.
Mononoke supports multiple blobstore backend implementations:
Production Backends:
Manifoldblob (facebook/manifoldblob/) - Primary production backend using Manifold, Meta's internal distributed storage service. This is the standard deployment option for Meta's infrastructure.
SQLblob (blobstore/sqlblob/) - SQL database backend supporting both MySQL (production) and SQLite (development). Stores blobs as rows in a table with the key as the primary key and the blob bytes as a column.
S3blob (blobstore/s3blob/) - Amazon S3 backend for deployments using AWS infrastructure.
Development and Testing Backends:
Fileblob (blobstore/fileblob/) - Filesystem-based storage where each blob is a file. Used for local development and testing.
Memblob (blobstore/memblob/) - In-memory storage with no persistence. Used exclusively for unit tests.
Backend implementations share a common Blobstore trait, making them interchangeable. The choice of backend is a deployment decision that does not affect higher-level code.
Mononoke uses the decorator pattern to compose blobstore functionality. Rather than implementing features directly in backend storage implementations, decorators wrap blobstores to add specific capabilities. This allows flexible configuration of the storage stack.
Core Decorators:
Prefixblob (blobstore/prefixblob/) - Adds a prefix to all keys, typically used to namespace keys by repository. This allows multiple repositories to share a single blobstore backend without key collisions.
Cacheblob (blobstore/cacheblob/) - Implements multi-level caching using both cachelib (in-process cache) and memcache (shared cache across servers). Caching is discussed in detail in the Caching Strategy section below.
Multiplexedblob (blobstore/multiplexedblob/) - Writes blobs to multiple underlying blobstores simultaneously and reads from any available backend. This provides redundancy and availability, discussed further in the Multiplexing section.
Packblob (blobstore/packblob/) - Packs multiple related blobs together and applies compression. This reduces storage costs and is detailed in the Packblob section below.
Operational Decorators:
Redactedblobstore (blobstore/redactedblobstore/) - Enforces content redaction policies, preventing access to blobs that have been marked as redacted. Used for removing sensitive content from repositories.
Samplingblob (blobstore/samplingblob/) - Samples operations for observability, logging a configurable fraction of all blobstore operations for monitoring and debugging.
Throttledblob (blobstore/throttledblob/) - Applies rate limiting to blobstore operations, controlling load on downstream storage systems.
Logblob (blobstore/logblob/) - Logs all blobstore operations for debugging and auditing purposes.
Readonlyblob (blobstore/readonlyblob/) - Wraps a blobstore to make it read-only, preventing writes. Used when operating in read-only modes.
Testing Decorators:
Chaosblob (blobstore/chaosblob/) - Randomly fails operations to test error handling and resilience.
Delayblob (blobstore/delayblob/) - Adds artificial latency to operations to test timeout handling and performance under slow storage.
Specialized Storage:
Ephemeral blobstore (blobstore/ephemeral_blobstore/) - Temporary storage for snapshots and draft commits that have not been made permanent. Uses a separate SQL table with time-based expiration.
Virtually sharded blobstore (blobstore/virtually_sharded_blobstore/) - Caching blobstore that groups blobstore keys into virtual shards to allow deduplication of in-flight requests.
Decorators can be stacked in different orders to achieve different behaviors. The factory pattern in blobstore/factory/ constructs the appropriate stack based on configuration.
A typical production blobstore stack composes multiple decorators:
Application (repo attributes, features, etc.)
↓
Prefixblob (adds repo-specific key prefix)
↓
Cacheblob (memcache + cachelib caching)
↓
Multiplexedblob (write to multiple backends)
↓
Packblob (compression)
↓
Backend storage (sqlblob, manifoldblob, or s3blob)
In this stack:
repo123.bonsai.abc123...)Additional decorators can be inserted at various points in the stack. For example, redactedblobstore typically sits near the top to prevent redacted content from being accessed, while samplingblob might sit near the bottom to observe actual backend operations.
Mononoke's storage must remain available even when individual storage systems have issues. A particular concern is cyclic dependencies: if Mononoke uses a storage system, and that storage system's infrastructure uses Mononoke for source control, then a storage outage could prevent deploying fixes to that storage system.
To address this, Mononoke writes blobs to multiple independent storage backends simultaneously. The multiplexedblob decorator implements a write-all, read-any strategy:
Write Operations:
Read Operations:
Consistency Maintenance:
jobs/blobstore_healer/) periodically reads the WALBackend Independence:
The multiplexedblob WAL is stored in the metadata database (tables defined in blobstore_sync_queue/schemas/).
Packblob (blobstore/packblob/) reduces storage costs by packing multiple blobs together and applying compression.
Packing Strategy:
Packblob can pack multiple key-value pairs into a single underlying blob. Related blobs (such as different versions of the same file) can be packed together, improving compression ratios and reducing the number of objects stored in the underlying backend.
Compression:
Packblob uses Zstd compression with delta compression. One blob serves as a dictionary, and other blobs in the pack are compressed using deltas against that dictionary. This is effective for Mononoke data where related blobs (such as different versions of the same file) share significant content.
Storage Envelope:
All blobs stored through packblob are wrapped in a Thrift-based storage envelope that indicates whether the blob is stored independently or as part of a pack. This discriminator allows reads to determine the storage layout without requiring external metadata.
Position in Stack:
Packblob sits between the underlying storage backend and the multiplexedblob decorator. This allows different backends in a multiplex to use different packing strategies. The typical stack is:
multiplexedblob
↓
packblob
↓
backend storage (sqlblob, manifoldblob, etc.)
Cache Interaction:
Packblob can cache both the packed form (compressed blobs) and the unpacked form (individual blobs). The packed form might be cached close to the storage backend to reduce decompression overhead, while the unpacked form is cached at higher levels for application access.
See blobstore/packblob/README.md for detailed design documentation.
While the blobstore holds immutable content, the metadata database stores mutable repository state and indexes that change as the repository evolves.
The metadata database contains several categories of mutable data:
Bookmarks and References:
Bookmarks table - Current positions of all branches (bookmarks), mapping bookmark names to Bonsai changeset IDs. Schema defined in repo_attributes/bookmarks/dbbookmarks/schemas/.
Bookmarks update log - Audit trail of all bookmark movements, recording the changeset a bookmark moved from, the changeset it moved to, the reason for the move, and a timestamp. This provides a complete history of branch updates.
VCS Mappings:
Bonsai-Mercurial mapping (bonsai_hg_mapping) - Bidirectional mapping between Bonsai changeset IDs and Mercurial changeset hashes. Schema in repo_attributes/bonsai_hg_mapping/schemas/.
Bonsai-Git mapping (bonsai_git_mapping) - Bidirectional mapping between Bonsai changeset IDs and Git commit SHA-1 hashes. Schema in repo_attributes/bonsai_git_mapping/schemas/.
Bonsai-Globalrev mapping (bonsai_globalrev_mapping) - Mapping between Bonsai changeset IDs and sequential integer globalrevs used in some workflows. Schema in repo_attributes/bonsai_globalrev_mapping/schemas/.
Bonsai-SVN revision mapping (bonsai_svnrev_mapping) - Mapping for repositories imported from Subversion. Schema in repo_attributes/bonsai_svnrev_mapping/schemas/.
Commit Graph Index:
repo_attributes/commit_graph/sql_commit_graph_storage/schemas/. This includes generation numbers, merge ancestors, and skip tree structure.Repository State:
Phases - Tracks whether commits are in draft or public phase. Schema in repo_attributes/phases/sqlphases/schemas/.
Mutable counters - Integer counters used for cross-repository sync and other operational tasks. Schema in repo_attributes/mutable_counters/schemas/.
Deletion log - Records content deletions for auditing. Schema in repo_attributes/deletion_log/schemas/.
Repository locks - Allows locking of the repository for administrative actions. Schema in repo_attributes/repo_lock/schemas/.
Additional Mappings and Metadata:
Git references - Git symbolic refs and ref content mappings. Schemas in repo_attributes/git_symbolic_refs/schemas/ and repo_attributes/git_ref_content_mapping/schemas/.
Pushrebase mutation mapping - Tracks how commits were rewritten during pushrebase. Schema in repo_attributes/pushrebase_mutation_mapping/schemas/.
Cross-repo sync mappings - Maps commits between synchronized repositories. Schema in features/commit_rewriting/synced_commit_mapping/schemas/.
Filenodes - Mercurial filenode data for protocol compatibility. Schema in repo_attributes/newfilenodes/schemas/.
All SQL schemas support both MySQL (production) and SQLite (development), with schema files located in component-specific schemas/ directories.
The choice of where to store data follows these principles:
Store in the blobstore when:
Store in the metadata database when:
For example, a Bonsai changeset is immutable and content-addressed, so it lives in the blobstore. A bookmark is mutable and requires atomic updates, so it lives in the metadata database. The mapping between a Bonsai changeset ID and its Mercurial hash is immutable once established but requires indexed lookups in both directions, so it lives in the metadata database.
Mononoke employs multi-level caching to reduce load on storage backends and improve performance.
Level 1: In-Process Cache (Cachelib)
Each Mononoke server process maintains an in-memory cache using Cachelib, Meta's internal caching library. This cache is process-local and not shared between servers. Data is serialized using bincode for efficient memory representation.
Level 2: Shared Cache (Memcache)
A shared memcache cluster provides caching across all Mononoke servers in a region. Data is serialized using custom formats defined by the MemcacheEntity trait (often using Thrift Compact Protocol for complex types).
Cache Hierarchy:
When fetching a blob:
This hierarchy means that frequently accessed data can be served from in-process memory, moderately accessed data comes from the shared cache, and infrequently accessed data requires backend access.
Some backend storage implementations (notably manifold) have their own cache, which is below the Mononoke caches.
Cache behavior can be controlled through command-line flags:
--cache-mode=local-only - Use only cachelib, skip memcache--cache-mode=cache-disabled - Disable all cachingDifferent cache modes are useful for testing, debugging, and scenarios where caching behavior needs to be controlled.
Because of the asynchronous nature of derived data, if a client requests the value of a bookmark immediately after a bookmark has been updated and receives a newly landed commit, any subsequent requests for that commit will be delayed while data for that commit is derived. To prevent this, Mononoke maintains a cache of bookmarks that are "warm", which means derived data has been derived for the commit that the bookmark points to. This has the effect of delaying visibility of bookmark updates until after data derivation has happened. This delay is typically small (a few seconds) and has the same characteristics as the eventual consistency property of distributed systems.
Mononoke services can maintain the warm bookmark cache locally. Alternatively, the bookmark service microservice can maintain this cache and serve bookmark-related queries to front-end services. This prevents duplication of effort of cache maintenance across multiple hosts and reduces the load on the metadata database.
Microwave (microwave/) is a cache warming system that preloads caches before they are needed. When a new server starts up or when caches are cold, microwave can:
This reduces the "cold start" problem where a server with empty caches would perform poorly until caches are populated through normal access patterns.
Microwave was particularly important for the Mercurial wireproto implementation. It is less important now, and may be removed in the future.
The cacheblob decorator (blobstore/cacheblob/) implements the multi-level caching strategy. It wraps an underlying blobstore and intercepts get and put operations:
On get:
On put:
The caching implementation in common/caching_ext/ provides utilities for implementing caches with proper serialization, batching, and chunking of requests.
The virtually sharded blobstore is a variant of cacheblob that shards the blobstore keyspace into multiple shards to allow deduplication of simultaneous requests through the cache.
Mononoke's storage architecture has several characteristics that affect how the system operates:
Immutability Enables Aggressive Caching - Since most blobstore data never changes, caches can retain data indefinitely without worrying about invalidation. TTLs are used primarily to manage cache capacity rather than consistency.
Multiplexing Trades Write Latency for Availability - Writing to multiple backends increases write latency but provides redundancy. This trade-off is acceptable because write operations are less frequent than reads, and high availability is critical.
Separation of Storage Types Enables Optimization - The blobstore and metadata database can be optimized differently. The blobstore uses content-addressed keys and eventual consistency, while the metadata database uses transactional updates and indexed queries.
Decorator Composition Increases Flexibility - The decorator pattern allows different repositories to use different storage configurations without changing application code. A repository storing large binary files might use packblob differently than a repository with mostly source code.
Caching Reduces Backend Load - Multi-level caching reduces load on storage backends by several orders of magnitude for frequently accessed data. The majority of reads are served from cachelib without hitting network storage.
Storage is External to Servers - Servers are stateless and can be added or removed without data migration. All persistent state lives in external storage systems.
Several other Mononoke components interact closely with storage:
Blobstore Healer (jobs/blobstore_healer/) - Background job that ensures all blobs exist in all multiplexed backends. Reads the multiplexedblob WAL and repairs missing blobs.
Walker (jobs/walker/) - Graph traversal tool that validates data integrity by walking the commit graph and verifying that all referenced blobs exist and are readable. See jobs/walker/src/README.md for details.
Filestore (filestore/) - Higher-level abstraction over the blobstore for storing file contents. Handles chunking of large files, content deduplication, and metadata storage.
Blobstore Factory (blobstore/factory/) - Constructs blobstore stacks based on configuration, assembling the appropriate decorators and backends.
Component-specific storage details are documented in the respective component directories.