Jobs and Background Workers

This document describes the background jobs and maintenance tasks that operate on Mononoke repositories. These processes run independently of user-facing servers and handle asynchronous operations, validation, and maintenance work.

Overview

Mononoke's architecture separates user-facing operations from background maintenance. While servers handle client requests synchronously, background jobs perform work that does not need to block user operations. This separation allows the system to optimize the critical path for user requests while delegating expensive or periodic operations to dedicated workers.

Background jobs fall into several categories:

Validation and integrity - Verify data consistency and storage durability
Asynchronous derivation - Compute derived data off the critical path
Synchronization - Replicate data across repositories or storage systems
Metrics and monitoring - Collect repository statistics

These jobs are typically long-running processes that continuously monitor for work, process batches of data, and report status via metrics and logging.

Walker

Location: jobs/walker/

The walker traverses the Mononoke commit graph and validates data integrity. It defines a graph schema that represents Mononoke's data model as nodes connected by edges, then uses bounded traversal to walk the graph efficiently.

Capabilities

The walker supports several subcommands:

Scrub - Verifies storage durability in multiplexed blobstores. When Mononoke uses multiple blobstore backends (for redundancy), the scrub operation checks that each component blobstore contains data for every key. If a blob is missing from one backend but present in another, the scrub can repair the inconsistency. This allows Mononoke to operate on a single component store if necessary.

Validate - Checks data validity by traversing the graph and verifying references. This can detect issues like dangling references, missing data, or inconsistent metadata that might indicate bugs in derivation or write logic.

Corpus - Extracts a subset of repository data based on sampling criteria. The corpus is written to disk in a structured directory layout, allowing offline analysis or experimentation with compression techniques. Sampling can be based on path patterns or hash values.

Compression Benefit - Analyzes potential compression ratios by measuring the size of blobs before and after individual compression. This helps evaluate storage optimization strategies.

Implementation

The walker represents Mononoke data as a directed graph where nodes correspond to Bonsai changesets, file contents, manifests, and other data types. Edges represent relationships like parent commits, file references, or manifest entries. The graph is discovered dynamically during traversal.

Visit tracking prevents re-visiting nodes, which is necessary because the graph contains cycles (for example, filenode-to-changeset backlinks). The walker uses concurrent hash maps to track visited nodes while keeping memory usage bounded.

Sampling support allows the walker to operate on subsets of large repositories. Sampling can be based on node hash or repository path, enabling operations to be split into batches that can be processed independently.

Documentation: See jobs/walker/src/README.md for detailed information about the walker's architecture, memory management, and usage patterns.

Blobstore Healer

Location: jobs/blobstore_healer/

The blobstore healer maintains storage durability when using multiplexed blobstores. Mononoke writes to multiple independent storage backends simultaneously to avoid dependencies on any single storage system. The healer monitors a write-ahead log (WAL) queue and repairs any inconsistencies.

Operation

The healer continuously processes entries from the blobstore synchronization queue. For each entry, it verifies that all component blobstores in the multiplex configuration contain the expected blob. If a blob is missing from one backend, the healer copies it from another backend that has the data.

This addresses several scenarios:

Temporary unavailability of a component blobstore during writes
Blobs stored to only a subset of component stores due to configuration changes
Corruption or data loss in a single backend

The healer operates in batches, processing queue entries within configured age and concurrency limits. It can be run in dry-run mode to report issues without making changes, or in drain-only mode to clear processed entries without healing.

Configuration

The healer accepts parameters for:

Queue batch size and entry age thresholds
Concurrency limits (number of blobs and total bytes)
Shard ranges for parallel operation across multiple healer instances
Storage configuration identifying which multiplexed blobstore to heal

Derived Data Workers

Location: derived_data/remote/

Derived data is computed asynchronously after commits are accepted. The remote derivation service coordinates this work across multiple worker processes, enabling horizontal scaling of derivation workload.

Remote Derivation Service

The derived data service is a Thrift-based microservice that accepts derivation requests and delegates work to worker processes. Clients (typically Mononoke servers) send requests to derive specific data types for specific changesets. The service tracks derivation progress and deduplicates requests for the same data.

This architecture allows derivation to scale independently from the main servers. During periods of high commit rate, additional derivation workers can be added to handle the load without affecting the write path.

Derivation Process

When a commit is pushed, Mononoke stores the Bonsai changeset and file contents but does not compute derived data immediately. Instead, derivation happens asynchronously:

A derivation request is queued or sent to the remote service
Workers fetch the Bonsai changeset and its dependencies
The worker computes the derived data (manifests, filenodes, etc.)
The derived data is stored in the blobstore
Metadata tables are updated to mark the data as derived

Different derived data types have dependency relationships. For example, blame derivation depends on unodes, which must be derived first. The derivation framework manages these dependencies automatically.

Derivation can be triggered on-demand (when a client requests data that hasn't been derived) or via batch jobs that derive data for all recent commits.

Cross-Repository Sync (Backsyncer)

Location: features/commit_rewriting/backsyncer/

The backsyncer synchronizes commits between different repositories based on defined mapping configurations. This supports workflows where a large repository is split into smaller repositories, or where commits need to be mirrored with path or content transformations.

Sync Process

The backsyncer monitors the bookmark update log in a source repository. When bookmarks move (typically due to new commits), the backsyncer:

Reads bookmark update log entries
Fetches the new commits from the source repository
Applies configured transformations (path rewriting, file filtering, etc.)
Creates equivalent commits in the target repository
Updates bookmarks in the target repository

Transformations are defined in repository configuration and can include:

Path prefix changes (moving all files under a different directory)
File inclusion/exclusion patterns
Commit message rewriting
Author/timestamp preservation

The backsyncer maintains mappings between source and target commits, allowing it to establish parent relationships correctly even when commit histories diverge.

Modes of Operation

The backsyncer can run in different modes:

Continuous tailing of the bookmark update log
One-time sync of specific commits or bookmark ranges
Sharded execution across multiple worker processes

CAS Sync

Location: jobs/cas_sync/

The CAS (Content Addressed Storage) sync job uploads commits and file contents to external content-addressed storage systems. This provides an additional layer of data redundancy and enables integration with other systems that consume repository data.

Operation

The CAS sync job monitors the bookmark update log and processes new commits. For each commit, it:

Fetches the commit metadata and file contents
Uploads data to the CAS backend
Records progress in mutable counters
Logs sync status to Scuba

The job handles retries for failed uploads and can be configured to sync specific bookmarks. It maintains a counter of the most recently synced bookmark update log entry, allowing it to resume from the correct position after restarts.

Modern Sync

Location: jobs/modern_sync/

The modern sync job synchronizes repository data to remote endpoints using the EdenAPI protocol. This enables replication of repository data to other Mononoke instances or compatible systems.

Operation

Similar to CAS sync, modern sync processes bookmark update log entries. It groups entries into batches and syncs the associated commits and derived data to the target system. The job supports:

Batching of bookmark updates for efficiency
Configurable sync targets via TLS and proxy settings
Progress tracking via mutable counters
Optional ODS (operational data store) metrics

The job can operate in sharded mode, with multiple instances processing different repositories.

Statistics Collector

Location: jobs/statistics_collector/

The statistics collector computes repository metrics such as file counts, total file size, and line counts. These metrics are tracked over time and logged to Scuba for monitoring and analysis.

Operation

The statistics collector monitors a bookmark (typically the main branch) and recomputes statistics when new commits appear. For each commit, it:

Fetches the manifest from the commit
Iterates over all files in the repository
Computes statistics (file count, total size, line counts)
Logs results to Scuba with the commit timestamp

For subsequent commits, the collector computes incremental changes by examining the diff between manifests, which is more efficient than recalculating statistics from scratch.

The collector can operate in two modes:

Continuous monitoring of a bookmark
One-time computation for a set of commits from a file

Statistics are computed only for regular files below a size threshold (to avoid processing very large files). Redacted content is handled by reporting zero lines for redacted files.

Job Scheduling and Orchestration

Background jobs are deployed as independent processes or services. Scheduling and lifecycle management depends on the deployment environment:

Sharded Execution - Jobs that process multiple repositories (like the statistics collector or backsyncer) can use the sharded executor framework. This integrates with Meta's ShardManager to automatically distribute repositories across worker instances and handle instance failures.

Continuous Operation - Jobs like the blobstore healer run continuously, polling their input queues and processing batches. They include sleep periods when no work is available to avoid excessive database load.

Triggered Jobs - Some jobs (like walker scrub operations) are triggered periodically via external scheduling systems rather than running continuously.

Monitoring and Alerting

Background jobs report their status through several mechanisms:

Metrics - Jobs publish ODS counters and timeseries metrics for processed items, queue depths, error rates, and processing latency. These metrics power operational dashboards and alerting.

Scuba Logging - Detailed logs of individual operations (synced commits, healed blobs, derived changesets) are written to Scuba tables. This enables debugging and historical analysis.

Health Checks - Jobs integrate with Mononoke's monitoring framework to report liveness and readiness. This allows orchestration systems to detect and restart unhealthy instances.

Error Reporting - Errors are logged with context (repository, changeset ID, operation type) to facilitate investigation.

Jobs typically include command-line flags for dry-run mode, iteration limits, and custom logging, allowing operators to test changes safely and control resource usage.

Relationship to Servers and Tools

Background jobs occupy a distinct layer in Mononoke's architecture:

Servers handle synchronous client requests and must respond quickly
Jobs handle asynchronous operations and can take longer to process batches
Tools are invoked on-demand by operators for administrative tasks

Jobs use the same library code as servers and tools. They access repositories through the same facet-based abstractions and use the same storage backends. The difference is in their execution model: jobs are long-running processes that continuously look for work, while servers are stateless request handlers and tools are one-shot commands.

This separation allows Mononoke to optimize each layer independently. Servers minimize latency by avoiding expensive operations. Jobs maximize throughput by processing batches. Tools provide flexibility for operators to perform manual interventions when needed.