eden/mononoke/docs/3.2-jobs-and-background-workers.md
This document describes the background jobs and maintenance tasks that operate on Mononoke repositories. These processes run independently of user-facing servers and handle asynchronous operations, validation, and maintenance work.
Mononoke's architecture separates user-facing operations from background maintenance. While servers handle client requests synchronously, background jobs perform work that does not need to block user operations. This separation allows the system to optimize the critical path for user requests while delegating expensive or periodic operations to dedicated workers.
Background jobs fall into several categories:
These jobs are typically long-running processes that continuously monitor for work, process batches of data, and report status via metrics and logging.
Location: jobs/walker/
The walker traverses the Mononoke commit graph and validates data integrity. It defines a graph schema that represents Mononoke's data model as nodes connected by edges, then uses bounded traversal to walk the graph efficiently.
The walker supports several subcommands:
Scrub - Verifies storage durability in multiplexed blobstores. When Mononoke uses multiple blobstore backends (for redundancy), the scrub operation checks that each component blobstore contains data for every key. If a blob is missing from one backend but present in another, the scrub can repair the inconsistency. This allows Mononoke to operate on a single component store if necessary.
Validate - Checks data validity by traversing the graph and verifying references. This can detect issues like dangling references, missing data, or inconsistent metadata that might indicate bugs in derivation or write logic.
Corpus - Extracts a subset of repository data based on sampling criteria. The corpus is written to disk in a structured directory layout, allowing offline analysis or experimentation with compression techniques. Sampling can be based on path patterns or hash values.
Compression Benefit - Analyzes potential compression ratios by measuring the size of blobs before and after individual compression. This helps evaluate storage optimization strategies.
The walker represents Mononoke data as a directed graph where nodes correspond to Bonsai changesets, file contents, manifests, and other data types. Edges represent relationships like parent commits, file references, or manifest entries. The graph is discovered dynamically during traversal.
Visit tracking prevents re-visiting nodes, which is necessary because the graph contains cycles (for example, filenode-to-changeset backlinks). The walker uses concurrent hash maps to track visited nodes while keeping memory usage bounded.
Sampling support allows the walker to operate on subsets of large repositories. Sampling can be based on node hash or repository path, enabling operations to be split into batches that can be processed independently.
Documentation: See jobs/walker/src/README.md for detailed information about the walker's architecture, memory management, and usage patterns.
Location: jobs/blobstore_healer/
The blobstore healer maintains storage durability when using multiplexed blobstores. Mononoke writes to multiple independent storage backends simultaneously to avoid dependencies on any single storage system. The healer monitors a write-ahead log (WAL) queue and repairs any inconsistencies.
The healer continuously processes entries from the blobstore synchronization queue. For each entry, it verifies that all component blobstores in the multiplex configuration contain the expected blob. If a blob is missing from one backend, the healer copies it from another backend that has the data.
This addresses several scenarios:
The healer operates in batches, processing queue entries within configured age and concurrency limits. It can be run in dry-run mode to report issues without making changes, or in drain-only mode to clear processed entries without healing.
The healer accepts parameters for:
Location: derived_data/remote/
Derived data is computed asynchronously after commits are accepted. The remote derivation service coordinates this work across multiple worker processes, enabling horizontal scaling of derivation workload.
The derived data service is a Thrift-based microservice that accepts derivation requests and delegates work to worker processes. Clients (typically Mononoke servers) send requests to derive specific data types for specific changesets. The service tracks derivation progress and deduplicates requests for the same data.
This architecture allows derivation to scale independently from the main servers. During periods of high commit rate, additional derivation workers can be added to handle the load without affecting the write path.
When a commit is pushed, Mononoke stores the Bonsai changeset and file contents but does not compute derived data immediately. Instead, derivation happens asynchronously:
Different derived data types have dependency relationships. For example, blame derivation depends on unodes, which must be derived first. The derivation framework manages these dependencies automatically.
Derivation can be triggered on-demand (when a client requests data that hasn't been derived) or via batch jobs that derive data for all recent commits.
Location: features/commit_rewriting/backsyncer/
The backsyncer synchronizes commits between different repositories based on defined mapping configurations. This supports workflows where a large repository is split into smaller repositories, or where commits need to be mirrored with path or content transformations.
The backsyncer monitors the bookmark update log in a source repository. When bookmarks move (typically due to new commits), the backsyncer:
Transformations are defined in repository configuration and can include:
The backsyncer maintains mappings between source and target commits, allowing it to establish parent relationships correctly even when commit histories diverge.
The backsyncer can run in different modes:
Location: jobs/cas_sync/
The CAS (Content Addressed Storage) sync job uploads commits and file contents to external content-addressed storage systems. This provides an additional layer of data redundancy and enables integration with other systems that consume repository data.
The CAS sync job monitors the bookmark update log and processes new commits. For each commit, it:
The job handles retries for failed uploads and can be configured to sync specific bookmarks. It maintains a counter of the most recently synced bookmark update log entry, allowing it to resume from the correct position after restarts.
Location: jobs/modern_sync/
The modern sync job synchronizes repository data to remote endpoints using the EdenAPI protocol. This enables replication of repository data to other Mononoke instances or compatible systems.
Similar to CAS sync, modern sync processes bookmark update log entries. It groups entries into batches and syncs the associated commits and derived data to the target system. The job supports:
The job can operate in sharded mode, with multiple instances processing different repositories.
Location: jobs/statistics_collector/
The statistics collector computes repository metrics such as file counts, total file size, and line counts. These metrics are tracked over time and logged to Scuba for monitoring and analysis.
The statistics collector monitors a bookmark (typically the main branch) and recomputes statistics when new commits appear. For each commit, it:
For subsequent commits, the collector computes incremental changes by examining the diff between manifests, which is more efficient than recalculating statistics from scratch.
The collector can operate in two modes:
Statistics are computed only for regular files below a size threshold (to avoid processing very large files). Redacted content is handled by reporting zero lines for redacted files.
Background jobs are deployed as independent processes or services. Scheduling and lifecycle management depends on the deployment environment:
Sharded Execution - Jobs that process multiple repositories (like the statistics collector or backsyncer) can use the sharded executor framework. This integrates with Meta's ShardManager to automatically distribute repositories across worker instances and handle instance failures.
Continuous Operation - Jobs like the blobstore healer run continuously, polling their input queues and processing batches. They include sleep periods when no work is available to avoid excessive database load.
Triggered Jobs - Some jobs (like walker scrub operations) are triggered periodically via external scheduling systems rather than running continuously.
Background jobs report their status through several mechanisms:
Metrics - Jobs publish ODS counters and timeseries metrics for processed items, queue depths, error rates, and processing latency. These metrics power operational dashboards and alerting.
Scuba Logging - Detailed logs of individual operations (synced commits, healed blobs, derived changesets) are written to Scuba tables. This enables debugging and historical analysis.
Health Checks - Jobs integrate with Mononoke's monitoring framework to report liveness and readiness. This allows orchestration systems to detect and restart unhealthy instances.
Error Reporting - Errors are logged with context (repository, changeset ID, operation type) to facilitate investigation.
Jobs typically include command-line flags for dry-run mode, iteration limits, and custom logging, allowing operators to test changes safely and control resource usage.
Background jobs occupy a distinct layer in Mononoke's architecture:
Jobs use the same library code as servers and tools. They access repositories through the same facet-based abstractions and use the same storage backends. The difference is in their execution model: jobs are long-running processes that continuously look for work, while servers are stateless request handlers and tools are one-shot commands.
This separation allows Mononoke to optimize each layer independently. Servers minimize latency by avoiding expensive operations. Jobs maximize throughput by processing batches. Tools provide flexibility for operators to perform manual interventions when needed.