Walker and Validation

This document explains the walker tool, which traverses Mononoke's data graph to validate integrity, verify storage durability, and perform analysis operations.

Overview

The walker (jobs/walker/) is a graph traversal tool that represents Mononoke's data as a directed graph of nodes connected by edges. It traverses this graph to validate data consistency, verify storage durability across multiple blobstore backends, extract repository data for analysis, and measure compression characteristics.

The walker operates as both a background job for continuous validation and a command-line tool for on-demand operations. It can process entire repositories or work on sampled subsets to handle large-scale repositories within memory constraints.

Graph Representation

The walker defines a graph schema that models Mononoke's data structures as nodes with relationships represented by edges. This abstraction spans both blobstore-stored data (Bonsai changesets, file contents, manifests) and SQL-stored data (VCS mappings, bookmark data).

Node Types:

Bonsai changesets - Commit metadata and file change lists
File content - Actual file data stored in the blobstore
Manifests - Directory structures (fsnodes, unodes, skeleton manifests, Git trees)
VCS mappings - Bonsai to Git/Mercurial/SVN mappings
Derived data - Filenodes, blame, fastlog, and other computed indexes
Bookmarks - Branch pointers and their history

Edge Types: Edges represent relationships between nodes. For example:

Parent-child commit relationships
Changeset-to-file-content references
Manifest-to-file references
File-to-changeset backlinks (creating cycles)

The graph is discovered dynamically during traversal. The walker expands each node by loading its data and following outgoing edges to discover connected nodes. This dynamic approach handles Mononoke's large data sets without requiring the entire graph to be loaded into memory.

Graph Characteristics

Cycles: The graph contains cycles due to backlinks from files to changesets (for example, Mercurial filenodes link back to changesets). Visit tracking prevents infinite loops during traversal.

Mutability: Most data represented in the graph is immutable. Exceptions include bookmarks (which are resolved once at the start of a walk) and certain mappings that may be incomplete during derivation. The walker handles these cases by allowing revisits until data reaches a terminal state.

Traversal Depth: A walk is "deep" if it covers the entire repository history by following parent relationships. A walk is "shallow" if it only examines data reachable from specific commits without traversing history.

Subcommands

The walker provides four subcommands, each serving a distinct operational purpose.

Scrub

The scrub subcommand verifies storage durability in multiplexed blobstore configurations. When Mononoke writes to multiple blobstore backends for redundancy, scrub checks that each component blobstore contains data for every key.

Operation:

Traverse the graph visiting all nodes
For each blob accessed, verify it exists in all component blobstores
If a blob is missing from one backend but present in another, record the inconsistency
Optionally repair missing blobs by copying from available backends

Use Cases:

Detecting component blobstores with missing data
Verifying that keys are present in all configured storage backends
Ensuring Mononoke can operate on a single component store if necessary
Repairing inconsistencies after storage backend issues

Scrub integrates with the underlying ScrubBlobstore, which provides callbacks when issues are detected. The blobstore healer job (jobs/blobstore_healer/) performs similar healing operations continuously, processing the multiplexed blobstore write-ahead log. Scrub serves as a verification and on-demand repair tool.

Validate

The validate subcommand checks data validity by traversing the graph and verifying references. This detects data corruption, missing data, or inconsistencies that might indicate bugs in write or derivation logic.

Validations Performed:

Verify all referenced nodes exist and are loadable
Check linknode consistency (filenodes reference valid changesets)
Validate phase data (public commits labeled correctly)
Detect dangling references
Verify data structure integrity

Operation:

Traverse the graph from specified starting points (bookmarks, changesets)
Load each node's data and validate its structure
Verify outgoing edges point to valid, loadable nodes
Report validation errors with source node context

The validate operation logs issues to Scuba with route information, allowing operators to trace back from problematic nodes to their sources in the graph.

Corpus

The corpus subcommand extracts a subset of repository data and writes it to disk in a structured format. This supports offline analysis, compression experiments, and investigation of repository characteristics.

Operation:

Traverse the graph, associating nodes with repository paths where applicable
Extract blob data for selected node types
Write blobs to disk in an organized directory structure
Support sampling by path hash or path regex to control corpus size

Directory Layout:

The corpus is written to a directory structure that separates node types and preserves repository paths:

<output-dir>/
  <NodeType>/
    <repo-path>/
      .mononoke,/
        <hash-prefix>/
          <blob-key>

For example:

FileContent/root/src/main.rs/.mononoke,/49/f0/c1/repo123.content.blake2.49f0...

The repository path and blob key are percent-encoded to prevent conflicts with the magic .mononoke, directory name (which contains a comma, and commas are percent-encoded). This encoding allows shell operations to distinguish blob structure from repository paths.

Node types without repository paths (such as Bonsai changesets) omit the path component:

BonsaiChangeset/.mononoke,/ab/cd/ef/repo123.changeset.abcdef...

Sampling: The corpus subcommand supports sampling by node hash or repository path pattern. This allows extracting representative subsets of large repositories for analysis. Sampling can be based on hash ranges (--sample-rate and --sample-offset) or path regex patterns.

Compression Benefit

The compression-benefit subcommand measures potential storage savings from applying compression to individual blobs. It computes the size of blobs before and after zstd compression to evaluate compression ratios.

Operation:

Traverse the graph visiting blob nodes
Fetch blob data and measure uncompressed size
Apply zstd compression and measure compressed size
Report compression ratios for different blob types

This analysis helps evaluate storage optimization strategies and understand the compression characteristics of repository data. The results inform decisions about packblob configuration and compression settings.

Graph Traversal Implementation

The walker uses bounded traversal to process the graph efficiently. The bounded_traversal_stream function dynamically unfolds nodes in parallel, expanding edges and queueing child nodes for processing.

Visit Tracking

To prevent revisiting nodes (which would cause infinite loops in cyclic graphs), the walker maintains concurrent hash maps tracking visited nodes. The WalkStateCHashmap stores node identities and prevents re-expansion of nodes already seen during the current walk.

Visit tracking has memory implications. For large repositories, the set of visited nodes can consume significant memory. The walker is designed to keep node representations small (minimal identity information only) to reduce memory overhead.

WalkVisitor Trait

The WalkVisitor trait defines callbacks invoked during graph traversal:

start_node(): Called before loading a node's data. This allows setup operations like configuring sampling for blobstore accesses.

visit(): Called after a node is loaded and expanded. This receives the node, its data, and its outgoing edges. Visitors can:

Filter edges to prevent revisits
Validate node data and edge consistency
Perform operations on node data (scrubbing, corpus extraction, validation)
Log or record information about nodes

Different subcommands implement different visitors. For example, ScrubVisitor validates storage durability, ValidatingVisitor checks data consistency, and CorpusVisitor extracts blobs to disk.

Sampling Support

For large repositories, processing the entire graph may be impractical due to memory constraints or time limits. The walker supports sampling operations on subsets of data.

Sampling Strategies:

Node hash sampling: Select nodes based on their hash values. This provides stable sampling where the same nodes are selected across runs.
Repository path sampling: Select nodes based on repository paths using hash or regex patterns. Path sampling may be non-deterministic when multiple paths reference the same content.

Sampling Implementation:

The SamplingWalkVisitor combines high-level node sampling with low-level blobstore sampling. It sets a SamplingKey in the CoreContext, which the SamplingBlobstore uses to decide whether to sample individual blobstore operations.

The --sample-rate and --sample-offset flags divide the repository into slices. By incrementing the offset, the entire repository can be processed one slice at a time. This allows large-scale operations to be distributed across multiple walker invocations or run incrementally.

Visit tracking state can be maintained across slices to prevent re-visits when processing a repository in batches.

Memory Management

Memory usage is a primary design constraint for the walker. Two factors drive memory consumption:

Visit Tracking: To prevent cycles, the walker must track all visited nodes. For repositories with hundreds of millions of nodes, this requires efficient concurrent hash maps and minimal node representations.

Pending Queue: As the graph expands, the queue of nodes waiting to be processed can grow to millions of entries. The size of the Node enum representation directly affects queue memory usage. Paths associated with nodes (for corpus extraction and routing) also contribute to memory overhead.

Optimization Strategies:

The walker employs several strategies to manage memory:

Keep Node representations minimal (identity information only, without full data)
Separate Node (identity) from NodeData (loaded content)
Use concurrent hash maps for visit tracking (trade-offs between memory and concurrency)
Support sampling to process subsets of large repositories
Stream large manifest contents rather than loading them entirely into memory

Further optimizations under consideration include interning paths and nodes to reduce duplication, returning iterators instead of collected vectors, and extending bounded traversal to adjust queue sizes based on node type characteristics.

Logging and Monitoring

The walker provides operational visibility through metrics and logging.

ODS Metrics: The walker publishes counters and timeseries metrics for:

Nodes visited per node type
Edges traversed per edge type
Validation errors detected
Scrub operations performed
Corpus blobs extracted

These metrics power dashboards and alerting for continuous validation operations.

Scuba Logging: The walker logs detailed information to Scuba tables:

Validation errors with source node context
Scrub issues with blobstore details
Route information showing paths through the graph
Sampling coverage and statistics

Scuba logs support debugging and historical analysis of validation runs.

Progress Logging: When running locally or in integration tests, the walker provides progress output via glog, showing traversal statistics and real-time status.

Operational Usage

The walker serves multiple operational purposes:

Continuous Validation: Run as a background job to continuously validate repository integrity. This detects data corruption, missing data, or derivation bugs.

Storage Verification: Use scrub mode to verify multiplexed blobstore consistency. Schedule regular scrub runs to detect and repair storage inconsistencies before they impact availability.

Data Analysis: Extract corpus subsets for offline analysis. Experiment with compression techniques, analyze repository structure, or test storage optimizations using real repository data.

Incident Response: Run on-demand validation or scrubbing during incidents to assess impact and verify repairs. The walker can validate specific changesets or time ranges rather than entire repositories.

Capacity Planning: Use compression-benefit analysis to understand storage characteristics and plan capacity. Measure the effectiveness of compression strategies before deploying them.

Integration with Storage Systems

The walker integrates with Mononoke's storage architecture at multiple levels:

Blobstore Integration: The walker accesses blobs through the standard blobstore interface. This means it benefits from cacheblob caching, respects redaction policies via redactedblobstore, and can sample operations via samplingblob.

Scrub Integration: The scrub subcommand uses ScrubBlobstore, which provides access to individual component blobstores in a multiplex configuration. This allows verification and repair at the component level.

Healer Coordination: While the blobstore healer job continuously processes the write-ahead log, the walker's scrub mode can detect issues that don't appear in the WAL (such as corruption or data loss in a single backend). The two systems are complementary.

Metadata Database: The walker reads VCS mappings, bookmark data, and other metadata from SQL tables. This allows it to validate consistency between blobstore data and database state.

Configuration and Execution

The walker runs using the standard cmdlib/mononoke_app/ framework. It accepts configuration for:

Repository Selection: Specify which repositories to process. Sharded execution can distribute repositories across multiple walker instances.

Traversal Parameters:

Starting points (specific bookmarks, changesets, or tags)
Node and edge type filters to limit traversal scope
Sampling rate and offset for partial traversal
Enable/disable derivation (production runs disable to prevent writes)

Subcommand Options:

Output directory for corpus extraction
Repair mode for scrub operations
Validation strictness levels
Compression parameters for compression-benefit analysis

Execution Modes:

Continuous tailing of bookmarks for new commits
One-time traversal of specified ranges
Dry-run mode for validation without side effects

The walker interacts with several other Mononoke components:

Blobstore Healer (jobs/blobstore_healer/) - Continuously heals storage inconsistencies from the multiplexed blobstore WAL. The walker's scrub complements this by detecting issues outside the WAL.

Admin Tool (tools/admin/) - Provides commands for on-demand validation and data inspection. Some admin subcommands use walker-like traversal internally.

Bounded Traversal (common/bounded_traversal/) - The core traversal primitive used by the walker. This library provides efficient parallel tree/graph traversal with bounded concurrency.

Derived Data System (derived_data/) - The walker validates derived data consistency. It can verify that derived data is correctly computed from Bonsai changesets.

Documentation

Detailed implementation documentation is available in the walker source:

jobs/walker/src/README.md - Comprehensive design documentation covering graph representation, memory management, sampling, and implementation details
jobs/walker/src/ - Source code with inline documentation
Integration tests in tests/integration/ demonstrate walker usage patterns

Summary

The walker is a graph traversal tool that validates Mononoke's data integrity, verifies storage durability, and supports data analysis operations. It models Mononoke's data structures as a dynamically-discovered graph and uses bounded traversal to process the graph efficiently.

The four subcommands—scrub, validate, corpus, and compression-benefit—serve distinct operational purposes from storage verification to data analysis. Sampling support allows the walker to handle large-scale repositories by processing subsets of data, making it applicable to repositories of any size.

The walker operates both as a continuous validation job and as an on-demand tool for incident response and analysis. Its integration with Mononoke's storage architecture and monitoring systems makes it a component of operational data integrity verification.

For storage architecture details, see Storage Architecture. For background job context, see Jobs and Background Workers.

Walker and Validation

Walker and Validation

Overview

Graph Representation

Graph Characteristics

Subcommands

Scrub

Validate

Corpus

Compression Benefit

Graph Traversal Implementation

Visit Tracking

WalkVisitor Trait

Sampling Support

Memory Management

Logging and Monitoring

Operational Usage

Integration with Storage Systems

Configuration and Execution

Related Components

Documentation

Summary