eden/mononoke/docs/2.3-derived-data.md
This document explains Mononoke's derived data system—the framework for computing and caching indexes and alternative representations from the canonical Bonsai changesets and file content blobs.
Derived data is information computed from Bonsai changesets and file contents that can always be regenerated from the core data. While Bonsai changesets and file content blobs constitute the source of truth for repository history, many operations would be inefficient using only this minimal representation. Derived data provides indexes, directory structures, file metadata, and VCS-specific formats that enable performant read operations.
Examples of derived data include:
The distinguishing characteristic of derived data is that it can be computed asynchronously after a commit has been accepted, rather than requiring synchronous computation during the commit operation.
Mononoke separates repository data into two categories:
Core Data - Information that must be written synchronously during commit operations:
Derived Data - Information computed from core data after the commit completes:
This separation allows Mononoke to maintain high commit throughput. When a commit is pushed, only core data is written before the operation completes and the next commit can proceed. Derived data is computed asynchronously by background workers or on-demand when needed for read operations.
The write path accepts commits and stores core data:
Derived data computation is not on this critical path. The next commit can proceed immediately.
The read path uses derived data to serve queries efficiently:
For repositories with high commit rates, this separation prevents derived data computation from becoming a bottleneck. Additional derived data types can be added without affecting write latency.
Derived data types are implemented using a trait-based framework centered on the BonsaiDerivable trait. This framework handles dependency management, caching, batch derivation, and storage.
Each derived data type implements BonsaiDerivable, which defines:
Core Methods:
derive_single - Compute derived data for one changeset given its Bonsai representation and the derived data for its parentsstore_mapping - Store the derived data in a way that can be retrieved by changeset IDfetch - Retrieve previously computed derived data by changeset IDOptimization Methods:
derive_batch - Compute derived data for multiple changesets efficiently (default: sequential derivation)fetch_batch - Retrieve derived data for multiple changesets efficiently (default: individual fetches)Dependencies:
Dependencies type - Other derived data types that must be available before this type can be derivedPredecessorDependencies type - Optional predecessor types that can be used for parallel backfillingThe framework is located in derived_data/ and derived_data/manager/.
Derived data types can depend on other derived data types. These dependencies form a directed acyclic graph that the derivation framework respects. When deriving a changeset, all dependencies are derived first.
Example dependencies:
Dependencies are declared using the Dependencies associated type in the BonsaiDerivable implementation. The framework automatically ensures dependencies are satisfied before derivation proceeds.
Derived data is stored in the blobstore using content-addressed keys. Each derived data type defines its own key format, typically incorporating the changeset ID and a type-specific prefix.
Example key patterns:
derived_root_blame_v2.<ChangesetId> - Blame dataderived_root_fsnode.<ChangesetId> - Fsnode manifest rootMappings from changeset IDs to derived data are also stored in the metadata database for some types, providing fast lookups without requiring blobstore access.
Derived data can be computed in several ways depending on operational needs.
When a client requests data that requires a specific derived data type, and that data has not yet been derived for the requested changeset, derivation occurs on-demand. The server derives the data, stores it, and returns the result.
On-demand derivation follows the dependency graph: if type A depends on type B, and neither is derived, type B is derived first, then type A.
For bulk operations or backfilling, derived data can be computed in batches. The derive_batch method allows implementations to optimize derivation across multiple changesets by:
Batch derivation is used by the derived data service and bulk derivation tools.
Backfilling is the process of deriving data for all changesets in a repository, typically after introducing a new derived data type or upgrading to a new version. Backfilling can be performed:
Sequentially - Derive changesets in topological order from repository roots to heads. Each changeset has its parents already derived.
In Parallel with Slicing - Divide the commit graph into slices, derive data for slice boundaries using predecessor dependencies, then derive each slice in parallel.
Using Predecessor Optimization - Some types support derive_from_predecessor, which can compute derived data using a different derived data type without requiring parent data. This enables parallel backfilling by deriving "anchor points" first.
Backfilling is performed using the admin CLI or bulk derivation tools in derived_data/bulk_derivation/.
After backfilling or initial derivation, new commits are derived incrementally as they arrive. This is typically handled by:
The warm bookmark cache mechanism ensures that derived data for public bookmarks (like master) is kept up-to-date.
The Remote Derivation Service (derived_data/remote/) provides centralized, asynchronous derivation. Instead of each Mononoke server deriving data locally, derivation requests are enqueued and processed by a pool of derivation workers.
Architecture:
repo_attributes/repo_derivation_queues/)API:
derive - Request asynchronous derivation, returns a tokenpoll - Check status of a derivation request using the tokenBenefits:
The service is defined in derived_data/remote/if/derived_data_service.thrift and implemented in facebook/derived_data_service/.
Mononoke includes a variety of derived data types serving different purposes. These are organized into several categories.
Manifests represent directory structures in different formats. Manifests can be tree-based or flat, and can also be sharded or unsharded. Unsharded manifests suffer from performance issues as the number of files in a directory increases, while sharded manifests can scale to large directories.
Fsnodes (derived_data/fsnodes/)
Unodes (derived_data/unodes/)
Skeleton Manifest (derived_data/skeleton_manifest/)
Skeleton Manifest V2 (derived_data/skeleton_manifest_v2/)
Basename Suffix Skeleton Manifest V3 (derived_data/basename_suffix_skeleton_manifest_v3/)
prefix/**/*.suffix.Case Conflict Skeleton Manifest (derived_data/case_conflict_skeleton_manifest/)
Content Manifest (derived_data/content_manifest_derivation/)
Deleted Manifest V2 (derived_data/deleted_manifest/)
Filenodes (derived_data/filenodes_derivation/)
Fastlog (derived_data/fastlog/)
Blame V2 (derived_data/blame/)
Inferred Copy From (derived_data/inferred_copy_from/)
Git Commit (git/git_types/)
Git Delta Manifest V2 (git/git_types/)
Git Delta Manifest V3 (git/git_types/)
Mercurial Changeset (derived_data/mercurial_derivation/)
Mercurial Augmented Manifest (derived_data/mercurial_derivation/)
Changeset Info (derived_data/changeset_info/)
Test Manifest (derived_data/test_manifest/)
Test Sharded Manifest (derived_data/test_sharded_manifest/)
Several manifest types use Sharded Maps, a specialized data structure for storing large mappings across multiple blobstore blobs. Sharded Maps allow loading subsets of a manifest without fetching the entire structure.
Sharded Map V2 - Improved implementation with:
Sharded Maps are used by Skeleton Manifest V2, Basename Suffix Skeleton Manifest V3, Content Manifest, and Git Delta Manifest V2.
Derived data types can be versioned, allowing the data model to evolve while maintaining backward compatibility.
Versioning Strategy:
Adding a New Version:
BonsaiDerivableThis allows derived data improvements without downtime or risky migrations.
Existing types can be rederived using a mapping key prefix, defined in DerivedDataTypesConfig::mapping_key_prefixes. This prefix is added to all mapping keys before the changeset id, and separates them into a different namespace so that rederivation can be performed independently.
To add a new derived data type:
store_mapping and fetch methodsDerivableType enum in mononoke_typesExample locations:
derived_data/*/derived_data/manager/mononoke_types/src/derivable_type.rsmetaconfig/types/The BonsaiDerivable trait documentation in derived_data/src/lib.rs provides detailed usage information.
The derived data system is designed for performance at scale:
Batch Derivation - Deriving multiple changesets together reduces database round-trips and allows amortization of common work.
Dependency Management - The framework ensures dependencies are derived in the correct order, minimizing redundant derivation.
Remote Derivation - Centralizing derivation reduces duplicate work across servers and enables horizontal scaling.
Caching - Derived data is stored in the blobstore and benefits from multi-level caching (cachelib, memcache).
Incremental Computation - Most derived data types can compute new values based on parent values, avoiding full recomputation.
Selective Derivation - Not all derived data types need to be derived for all changesets. Configuration controls which types are active.
Component-specific documentation for individual derived data types lives in the respective directories under derived_data/.