Back to Sapling

Storage Architecture

eden/mononoke/docs/2.4-storage-architecture.md

latest22.3 KB
Original Source

Storage Architecture

This document explains Mononoke's storage architecture, including the blobstore system, metadata database, caching strategy, and how these components work together to provide scalable repository storage.

Storage Model Overview

Mononoke has two main data stores:

Immutable Blobstore - Content-addressed key-value storage for all immutable data including Bonsai changesets, file contents, and derived data. Once written, blobs are never modified or deleted (except through explicit redaction processes).

Metadata Database - SQL database (MySQL in production, SQLite for development) storing repository state that changes over time, such as bookmarks, VCS mappings, and the commit graph index.

This separation allows different optimization strategies for each type of data. The blobstore can use aggressive caching and replication since data is immutable, while the database handles transactional updates for mutable state.

In addition to these main stores, Mononoke also stores some things in other data stores:

Mutable Blobstore - Storage for pre-computed data like the pre-loaded commit graph. This data is periodically recomputed and overwritten in the mutable blobstore.

Ephemeral Blobstore - Garbage collectable blobstore used for ephmeral objects like snapshots. The ephemeral blobstore groups blobs into "bubbles" that expire together.

Zelos Derivation Queue - A graph-aware queue that is used to co-ordinate derivation of derived data.

Blobstore Architecture

The blobstore is Mononoke's primary storage layer for repository content. All blobs are identified by unique keys and are immutable after creation.

What Lives in the Blobstore

The blobstore stores several categories of data:

Core Repository Data:

  • Bonsai changesets - Commit metadata including author, message, timestamps, parents, and file changes
  • File content blobs - Actual file contents, chunked for large files
  • File content metadata - Metadata about file contents

Derived Data:

  • Manifests - Directory structures (fsnodes, unodes, skeleton manifests)
  • File metadata - blame annotations, fastlog entries
  • Mercurial-specific data - Augmented Mercurial manifests
  • Git-specific data - Git delta manifests
  • Utility data - Changeset info, deleted manifest, case conflict detection

VCS-Specific Formats:

  • Mercurial artifacts - Revlog-compatible file nodes and manifests
  • Git artifacts - Git commit and tree objects

Most blobs use content-addressed keys based on Blake2b hashing, which enables deduplication and integrity verification. Some blobs are keyed using the hash of a corresponding item, for example blame data is keyed using unode hashes.

Storage Backends

Mononoke supports multiple blobstore backend implementations:

Production Backends:

  • Manifoldblob (facebook/manifoldblob/) - Primary production backend using Manifold, Meta's internal distributed storage service. This is the standard deployment option for Meta's infrastructure.

  • SQLblob (blobstore/sqlblob/) - SQL database backend supporting both MySQL (production) and SQLite (development). Stores blobs as rows in a table with the key as the primary key and the blob bytes as a column.

  • S3blob (blobstore/s3blob/) - Amazon S3 backend for deployments using AWS infrastructure.

Development and Testing Backends:

  • Fileblob (blobstore/fileblob/) - Filesystem-based storage where each blob is a file. Used for local development and testing.

  • Memblob (blobstore/memblob/) - In-memory storage with no persistence. Used exclusively for unit tests.

Backend implementations share a common Blobstore trait, making them interchangeable. The choice of backend is a deployment decision that does not affect higher-level code.

The Decorator Pattern

Mononoke uses the decorator pattern to compose blobstore functionality. Rather than implementing features directly in backend storage implementations, decorators wrap blobstores to add specific capabilities. This allows flexible configuration of the storage stack.

Core Decorators:

  • Prefixblob (blobstore/prefixblob/) - Adds a prefix to all keys, typically used to namespace keys by repository. This allows multiple repositories to share a single blobstore backend without key collisions.

  • Cacheblob (blobstore/cacheblob/) - Implements multi-level caching using both cachelib (in-process cache) and memcache (shared cache across servers). Caching is discussed in detail in the Caching Strategy section below.

  • Multiplexedblob (blobstore/multiplexedblob/) - Writes blobs to multiple underlying blobstores simultaneously and reads from any available backend. This provides redundancy and availability, discussed further in the Multiplexing section.

  • Packblob (blobstore/packblob/) - Packs multiple related blobs together and applies compression. This reduces storage costs and is detailed in the Packblob section below.

Operational Decorators:

  • Redactedblobstore (blobstore/redactedblobstore/) - Enforces content redaction policies, preventing access to blobs that have been marked as redacted. Used for removing sensitive content from repositories.

  • Samplingblob (blobstore/samplingblob/) - Samples operations for observability, logging a configurable fraction of all blobstore operations for monitoring and debugging.

  • Throttledblob (blobstore/throttledblob/) - Applies rate limiting to blobstore operations, controlling load on downstream storage systems.

  • Logblob (blobstore/logblob/) - Logs all blobstore operations for debugging and auditing purposes.

  • Readonlyblob (blobstore/readonlyblob/) - Wraps a blobstore to make it read-only, preventing writes. Used when operating in read-only modes.

Testing Decorators:

  • Chaosblob (blobstore/chaosblob/) - Randomly fails operations to test error handling and resilience.

  • Delayblob (blobstore/delayblob/) - Adds artificial latency to operations to test timeout handling and performance under slow storage.

Specialized Storage:

  • Ephemeral blobstore (blobstore/ephemeral_blobstore/) - Temporary storage for snapshots and draft commits that have not been made permanent. Uses a separate SQL table with time-based expiration.

  • Virtually sharded blobstore (blobstore/virtually_sharded_blobstore/) - Caching blobstore that groups blobstore keys into virtual shards to allow deduplication of in-flight requests.

Decorators can be stacked in different orders to achieve different behaviors. The factory pattern in blobstore/factory/ constructs the appropriate stack based on configuration.

Example Blobstore Stack

A typical production blobstore stack composes multiple decorators:

Application (repo attributes, features, etc.)
    ↓
Prefixblob (adds repo-specific key prefix)
    ↓
Cacheblob (memcache + cachelib caching)
    ↓
Multiplexedblob (write to multiple backends)
    ↓
Packblob (compression)
    ↓
Backend storage (sqlblob, manifoldblob, or s3blob)

In this stack:

  1. Application code accesses the blobstore through repository facets
  2. Prefixblob namespaces keys by repository ID (e.g., repo123.bonsai.abc123...)
  3. Cacheblob checks cachelib and memcache before querying lower layers
  4. Multiplexedblob coordinates writes to multiple backends and reads from any available backend
  5. Packblob packs related blobs and applies compression
  6. Backend storage persists blobs to the underlying storage system

Additional decorators can be inserted at various points in the stack. For example, redactedblobstore typically sits near the top to prevent redacted content from being accessed, while samplingblob might sit near the bottom to observe actual backend operations.

Multiplexing for Availability

Mononoke's storage must remain available even when individual storage systems have issues. A particular concern is cyclic dependencies: if Mononoke uses a storage system, and that storage system's infrastructure uses Mononoke for source control, then a storage outage could prevent deploying fixes to that storage system.

To address this, Mononoke writes blobs to multiple independent storage backends simultaneously. The multiplexedblob decorator implements a write-all, read-any strategy:

Write Operations:

  • Blobs are written to all configured backends
  • The write succeeds if a quorum of backends acknowledges the write
  • If some backends fail, the write is logged to a write-ahead log (WAL)

Read Operations:

  • Reads are submitted to all back-ends by default
  • The first backend to successfully return the blob satisfies the read
  • If one backend is unavailable, reads succeed from other backends

Consistency Maintenance:

  • The blobstore healer job (jobs/blobstore_healer/) periodically reads the WAL
  • For each incomplete write, the healer copies blobs from backends that have the data to backends that are missing it
  • Once all backends have the blob, the WAL entry is deleted

Backend Independence:

  • Backends are chosen to be as independent as possible
  • For example, multiple Manifold buckets might use separate ZippyDB tiers
  • One backend might be sqlblob while another is manifoldblob
  • This reduces the probability that all backends fail simultaneously

The multiplexedblob WAL is stored in the metadata database (tables defined in blobstore_sync_queue/schemas/).

Packblob: Compressed Storage

Packblob (blobstore/packblob/) reduces storage costs by packing multiple blobs together and applying compression.

Packing Strategy:

Packblob can pack multiple key-value pairs into a single underlying blob. Related blobs (such as different versions of the same file) can be packed together, improving compression ratios and reducing the number of objects stored in the underlying backend.

Compression:

Packblob uses Zstd compression with delta compression. One blob serves as a dictionary, and other blobs in the pack are compressed using deltas against that dictionary. This is effective for Mononoke data where related blobs (such as different versions of the same file) share significant content.

Storage Envelope:

All blobs stored through packblob are wrapped in a Thrift-based storage envelope that indicates whether the blob is stored independently or as part of a pack. This discriminator allows reads to determine the storage layout without requiring external metadata.

Position in Stack:

Packblob sits between the underlying storage backend and the multiplexedblob decorator. This allows different backends in a multiplex to use different packing strategies. The typical stack is:

multiplexedblob
    ↓
packblob
    ↓
backend storage (sqlblob, manifoldblob, etc.)

Cache Interaction:

Packblob can cache both the packed form (compressed blobs) and the unpacked form (individual blobs). The packed form might be cached close to the storage backend to reduce decompression overhead, while the unpacked form is cached at higher levels for application access.

See blobstore/packblob/README.md for detailed design documentation.

Metadata Database

While the blobstore holds immutable content, the metadata database stores mutable repository state and indexes that change as the repository evolves.

What Lives in the Metadata Database

The metadata database contains several categories of mutable data:

Bookmarks and References:

  • Bookmarks table - Current positions of all branches (bookmarks), mapping bookmark names to Bonsai changeset IDs. Schema defined in repo_attributes/bookmarks/dbbookmarks/schemas/.

  • Bookmarks update log - Audit trail of all bookmark movements, recording the changeset a bookmark moved from, the changeset it moved to, the reason for the move, and a timestamp. This provides a complete history of branch updates.

VCS Mappings:

  • Bonsai-Mercurial mapping (bonsai_hg_mapping) - Bidirectional mapping between Bonsai changeset IDs and Mercurial changeset hashes. Schema in repo_attributes/bonsai_hg_mapping/schemas/.

  • Bonsai-Git mapping (bonsai_git_mapping) - Bidirectional mapping between Bonsai changeset IDs and Git commit SHA-1 hashes. Schema in repo_attributes/bonsai_git_mapping/schemas/.

  • Bonsai-Globalrev mapping (bonsai_globalrev_mapping) - Mapping between Bonsai changeset IDs and sequential integer globalrevs used in some workflows. Schema in repo_attributes/bonsai_globalrev_mapping/schemas/.

  • Bonsai-SVN revision mapping (bonsai_svnrev_mapping) - Mapping for repositories imported from Subversion. Schema in repo_attributes/bonsai_svnrev_mapping/schemas/.

Commit Graph Index:

  • Commit graph tables - Index of parent-child relationships in the commit graph with precomputed skip tree pointers for efficient ancestry queries. Schema in repo_attributes/commit_graph/sql_commit_graph_storage/schemas/. This includes generation numbers, merge ancestors, and skip tree structure.

Repository State:

  • Phases - Tracks whether commits are in draft or public phase. Schema in repo_attributes/phases/sqlphases/schemas/.

  • Mutable counters - Integer counters used for cross-repository sync and other operational tasks. Schema in repo_attributes/mutable_counters/schemas/.

  • Deletion log - Records content deletions for auditing. Schema in repo_attributes/deletion_log/schemas/.

  • Repository locks - Allows locking of the repository for administrative actions. Schema in repo_attributes/repo_lock/schemas/.

Additional Mappings and Metadata:

  • Git references - Git symbolic refs and ref content mappings. Schemas in repo_attributes/git_symbolic_refs/schemas/ and repo_attributes/git_ref_content_mapping/schemas/.

  • Pushrebase mutation mapping - Tracks how commits were rewritten during pushrebase. Schema in repo_attributes/pushrebase_mutation_mapping/schemas/.

  • Cross-repo sync mappings - Maps commits between synchronized repositories. Schema in features/commit_rewriting/synced_commit_mapping/schemas/.

  • Filenodes - Mercurial filenode data for protocol compatibility. Schema in repo_attributes/newfilenodes/schemas/.

All SQL schemas support both MySQL (production) and SQLite (development), with schema files located in component-specific schemas/ directories.

Blobstore vs. Metadata Database Decision

The choice of where to store data follows these principles:

Store in the blobstore when:

  • Data is immutable after creation
  • Data is content-addressed
  • Data is large or varies significantly in size
  • Data benefits from caching

Store in the metadata database when:

  • Data is mutable and requires transactional updates
  • Data is small and fixed-size
  • Data requires indexed queries or range scans
  • Data represents relationships between objects

For example, a Bonsai changeset is immutable and content-addressed, so it lives in the blobstore. A bookmark is mutable and requires atomic updates, so it lives in the metadata database. The mapping between a Bonsai changeset ID and its Mercurial hash is immutable once established but requires indexed lookups in both directions, so it lives in the metadata database.

Caching Strategy

Mononoke employs multi-level caching to reduce load on storage backends and improve performance.

Cache Levels

Level 1: In-Process Cache (Cachelib)

Each Mononoke server process maintains an in-memory cache using Cachelib, Meta's internal caching library. This cache is process-local and not shared between servers. Data is serialized using bincode for efficient memory representation.

  • Cache size is configurable
  • Eviction uses LRU or other configured policies
  • Cache keys are the same as blobstore keys
  • Only positive results are cached (successful reads)

Level 2: Shared Cache (Memcache)

A shared memcache cluster provides caching across all Mononoke servers in a region. Data is serialized using custom formats defined by the MemcacheEntity trait (often using Thrift Compact Protocol for complex types).

  • Shared across all server instances
  • Larger capacity than per-process cache
  • Network overhead for cache hits
  • TTLs control how long data is cached

Cache Hierarchy:

When fetching a blob:

  1. Check cachelib (process-local cache)
  2. If miss, check memcache (shared cache)
  3. If miss, fetch from blobstore backend
  4. On backend fetch, populate both memcache and cachelib
  5. On memcache hit, populate cachelib

This hierarchy means that frequently accessed data can be served from in-process memory, moderately accessed data comes from the shared cache, and infrequently accessed data requires backend access.

Some backend storage implementations (notably manifold) have their own cache, which is below the Mononoke caches.

Cache Configuration

Cache behavior can be controlled through command-line flags:

  • --cache-mode=local-only - Use only cachelib, skip memcache
  • --cache-mode=cache-disabled - Disable all caching
  • Additional flags control cache sizes and policies

Different cache modes are useful for testing, debugging, and scenarios where caching behavior needs to be controlled.

Warm Bookmark Cache

Because of the asynchronous nature of derived data, if a client requests the value of a bookmark immediately after a bookmark has been updated and receives a newly landed commit, any subsequent requests for that commit will be delayed while data for that commit is derived. To prevent this, Mononoke maintains a cache of bookmarks that are "warm", which means derived data has been derived for the commit that the bookmark points to. This has the effect of delaying visibility of bookmark updates until after data derivation has happened. This delay is typically small (a few seconds) and has the same characteristics as the eventual consistency property of distributed systems.

Mononoke services can maintain the warm bookmark cache locally. Alternatively, the bookmark service microservice can maintain this cache and serve bookmark-related queries to front-end services. This prevents duplication of effort of cache maintenance across multiple hosts and reduces the load on the metadata database.

Cache Warming: Microwave

Microwave (microwave/) is a cache warming system that preloads caches before they are needed. When a new server starts up or when caches are cold, microwave can:

  • Preload derived data for recent commits on important bookmarks
  • Warm blobstore caches with frequently accessed blobs
  • Prepare caches before traffic is directed to a server

This reduces the "cold start" problem where a server with empty caches would perform poorly until caches are populated through normal access patterns.

Microwave was particularly important for the Mercurial wireproto implementation. It is less important now, and may be removed in the future.

Cacheblob Implementation

The cacheblob decorator (blobstore/cacheblob/) implements the multi-level caching strategy. It wraps an underlying blobstore and intercepts get and put operations:

On get:

  1. Hash the key to create cache keys for both cachelib and memcache
  2. Check cachelib
  3. If cachelib miss, check memcache
  4. If memcache hit, populate cachelib and return
  5. If memcache miss, fetch from underlying blobstore
  6. Populate both cachelib and memcache before returning

On put:

  • Write to underlying blobstore
  • Optionally populate caches (configuration-dependent)

The caching implementation in common/caching_ext/ provides utilities for implementing caches with proper serialization, batching, and chunking of requests.

The virtually sharded blobstore is a variant of cacheblob that shards the blobstore keyspace into multiple shards to allow deduplication of simultaneous requests through the cache.

Storage Characteristics

Mononoke's storage architecture has several characteristics that affect how the system operates:

Immutability Enables Aggressive Caching - Since most blobstore data never changes, caches can retain data indefinitely without worrying about invalidation. TTLs are used primarily to manage cache capacity rather than consistency.

Multiplexing Trades Write Latency for Availability - Writing to multiple backends increases write latency but provides redundancy. This trade-off is acceptable because write operations are less frequent than reads, and high availability is critical.

Separation of Storage Types Enables Optimization - The blobstore and metadata database can be optimized differently. The blobstore uses content-addressed keys and eventual consistency, while the metadata database uses transactional updates and indexed queries.

Decorator Composition Increases Flexibility - The decorator pattern allows different repositories to use different storage configurations without changing application code. A repository storing large binary files might use packblob differently than a repository with mostly source code.

Caching Reduces Backend Load - Multi-level caching reduces load on storage backends by several orders of magnitude for frequently accessed data. The majority of reads are served from cachelib without hitting network storage.

Storage is External to Servers - Servers are stateless and can be added or removed without data migration. All persistent state lives in external storage systems.

Several other Mononoke components interact closely with storage:

Blobstore Healer (jobs/blobstore_healer/) - Background job that ensures all blobs exist in all multiplexed backends. Reads the multiplexedblob WAL and repairs missing blobs.

Walker (jobs/walker/) - Graph traversal tool that validates data integrity by walking the commit graph and verifying that all referenced blobs exist and are readable. See jobs/walker/src/README.md for details.

Filestore (filestore/) - Higher-level abstraction over the blobstore for storing file contents. Handles chunking of large files, content deduplication, and metadata storage.

Blobstore Factory (blobstore/factory/) - Constructs blobstore stacks based on configuration, assembling the appropriate decorators and backends.

Component-specific storage details are documented in the respective component directories.