Back to Sapling

Cross-Repository Sync

eden/mononoke/docs/4.2-cross-repo-sync.md

latest13.3 KB
Original Source

Cross-Repository Sync

This document explains Mononoke's cross-repository synchronization system—the framework for automatically synchronizing commits between different repositories.

What is Cross-Repo Sync?

Cross-repo sync is a system that maintains bidirectional synchronization between repositories. It automatically replicates commits from one repository to another, transforming file paths and commit metadata as needed to match each repository's structure.

The most common use case is synchronizing between a large repository (monorepo) and smaller project-specific repositories. When a developer pushes to a small repository, those changes are automatically synced to the corresponding location in the large repository. Conversely, changes in the large repository can be synced back to the small repositories.

Cross-repo sync handles:

  • Path transformation (files in different locations between repos)
  • Bookmark mapping (branch names may differ)
  • Commit metadata preservation
  • History linearization through pushrebase
  • Merge commit handling (with limitations)

Sync Directions

Cross-repo sync operates in two directions:

Forward Sync (Small → Large)

  • Commits from small repositories are synced into the large repository
  • Each small repo's files are placed in a designated prefix within the large repo
  • This is the primary direction when small repos are the source of truth
  • Implemented by the x-repo sync job

Backward Sync (Large → Small)

  • Commits from the large repository are synced to small repositories
  • Only changes affecting the small repo's prefix are synced
  • Maintains the ability to develop in either repository
  • Implemented by the backsyncer

The relationship between repositories is configured rather than inferred from structure. A repository can participate in multiple sync relationships.

Commit Transformation

When syncing commits between repositories, the sync system transforms them to match the target repository's structure.

Path Transformation

Files are moved according to configured path mappings. Each small repository has a map defining how paths are transformed:

Example transformation:

  • Small repo path: src/foo.rs
  • Large repo path: projects/myproject/src/foo.rs

The mapping is defined in the commit sync configuration using prefix replacements. More complex transformations can be expressed through multiple mapping entries or by using the default action configuration.

Commit Rewriting

Commits are rewritten during sync using the commit transformation library (features/commit_transformation/). The rewriting process:

  1. Loads the source commit
  2. Applies path transformations to file changes
  3. Remaps parent commits (already synced)
  4. Preserves author, timestamp, and commit message
  5. Stores the rewritten commit in the target repository

Commits that would be empty after path transformation (no files in the mapped paths) are handled specially. These may be recorded as "not a sync candidate" or mapped to an equivalent working copy ancestor.

Special Cases

Merge Commits - Supported with limitations. Both parents must already be synced. Some complex merge scenarios are not supported and will cause sync to fail.

Copy/Move Information - File copy and move metadata is preserved when both source and destination paths map to the target repository.

Git Submodules - When configured, submodule expansion can be performed during sync, converting submodule pointers into expanded directory contents.

Sync Process

Cross-repo sync operates by tailing the bookmark update log of the source repository.

Forward Sync Process

The x-repo sync job (features/commit_rewriting/mononoke_x_repo_sync_job/) performs forward sync:

  1. Tail Bookmark Updates - Monitor bookmark update log for changes
  2. Find Unsynced Ancestors - Identify commits not yet synced to target repo
  3. Sync in Order - Process commits topologically (parents before children)
  4. Transform Commits - Rewrite each commit for the target repository
  5. Pushrebase - Land synced commits onto target bookmarks via pushrebase
  6. Record Mapping - Store source-to-target commit mapping
  7. Update Counter - Record progress for resumption

Common pushrebase bookmarks (configured as common_pushrebase_bookmarks) receive special handling. These are bookmarks where pushrebase is used to maintain linear history in both repositories.

Backward Sync Process

The backsyncer (features/commit_rewriting/backsyncer/) performs backward sync:

  1. Tail Large Repo - Monitor bookmark updates in the large repository
  2. Filter Changes - Select commits affecting the small repo's mapped paths
  3. Transform Commits - Rewrite commits removing unrelated paths
  4. Sync to Small Repo - Push transformed commits to small repository
  5. Record Mapping - Store bidirectional mapping

Backsyncer can run continuously or in catch-up mode to process historical commits.

Synced Commit Mapping

All synced commits are recorded in the synced commit mapping database (features/commit_rewriting/synced_commit_mapping/). This mapping stores:

  • Source repository and changeset ID
  • Target repository and changeset ID
  • Sync config version used
  • Sync outcome (rewritten, equivalent working copy, not a sync candidate)

The mapping is bidirectional, allowing queries in either direction. It is consulted before syncing to avoid duplicate work and to remap parent commits correctly.

Commit Sync Outcomes

When a commit is synced, the result is recorded as a commit sync outcome:

RewrittenAs - The commit was transformed and created as a new commit in the target repository. This is the most common outcome.

EquivalentWorkingCopyAncestor - The commit would be empty after transformation (no files in mapped paths), so it maps to an ancestor commit with the same working copy state.

NotSyncCandidate - The commit should not be synced to the target repository. This is recorded when all file changes are outside mapped paths.

Multiple source commits may map to the same target commit when they only affect unmapped paths. The plural commit sync outcome type represents this many-to-one relationship.

Configuration

Cross-repo sync is configured through the CommitSyncConfig structure in repository metadata.

Commit Sync Config Structure

Each sync relationship defines:

Large Repository - The repository ID of the large (monorepo) repository

Small Repositories - Map of small repository IDs to their configurations, each containing:

  • Default Action - What to do with paths not explicitly mapped
  • Path Map - Prefix replacements (small repo path → large repo path)
  • Submodule Config - Git submodule handling if applicable

Common Pushrebase Bookmarks - Bookmarks where changes from small repos are pushed via pushrebase

Version Name - Identifies this configuration version

Configuration Versioning

Sync configurations are versioned using CommitSyncConfigVersion. This allows the sync rules to evolve over time:

  • New path mappings can be added in a new version
  • Different commits may be synced using different config versions
  • The version is recorded in the synced commit mapping
  • Live config can change while preserving historical mappings

Configuration is managed through live_commit_sync_config (features/commit_rewriting/live_commit_sync_config/), which provides access to both current and historical configuration versions.

Bookmark Mapping

Small repository bookmarks can be mapped to large repository bookmarks with a configured prefix. For example:

  • Small repo bookmark: main
  • Large repo bookmark: projectname/main

The bookmark prefix is configured per small repository in the permanent configuration.

Repository Integration

Cross-repo sync is integrated into repositories through the repo_cross_repo facet (repo_attributes/repo_cross_repo/). This facet provides:

  • Synced Commit Mapping - Database of synced commits
  • Live Commit Sync Config - Current and historical sync configurations
  • Sync Lease - Coordination to prevent concurrent sync attempts
  • Submodule Dependencies - Repository IDs of submodule dependencies

The facet is used by both sync jobs and by other repository operations that need to query or update sync state.

Use Cases

Cross-repo sync serves several use cases:

Monorepo and Project Repos - A large monorepo contains many projects. Each project can have a dedicated small repository. Developers can commit to either repository, and changes are automatically synchronized.

Code Sharing - Common code can be shared between repositories. Changes in the shared code are propagated to all repositories that include it.

Migration - Repositories can be gradually migrated into or out of a monorepo while maintaining dual operation during the transition period.

Megarepo Operations - Initial import of repositories into a megarepo uses cross-repo sync mechanisms to establish the initial mappings and history. See megarepo_api/ for megarepo-specific operations.

Directory Isolation - A large repository can be split into smaller repositories by directory, allowing teams to work in isolated repositories while maintaining the option to work in the full repository.

Jobs and Tools

X-Repo Sync Job

The primary sync job (features/commit_rewriting/mononoke_x_repo_sync_job/) runs continuously to sync commits from small repositories to the large repository. It can be run in sharded mode across multiple processes for scalability.

Operation:

  • Tails bookmark update log
  • Processes commits in topological order
  • Uses pushrebase for common bookmarks
  • Records progress in mutable counters
  • Resumes from last processed position on restart

Backsyncer

The backsyncer (features/commit_rewriting/backsyncer/) syncs commits from large repository to small repositories. It can run continuously or in catch-up mode.

Modes:

  • Continuous tailing of large repo bookmarks
  • Catch-up mode to process historical commits
  • Validation mode to verify sync correctness

Admin Tools

The admin CLI (tools/admin/) provides commands for cross-repo sync operations:

  • Manually triggering sync of specific commits
  • Querying synced commit mappings
  • Validating working copy equivalence
  • Inspecting sync configuration

Limitations and Constraints

The cross-repo sync system has several limitations:

Merge Commits - Some merge commit scenarios are not supported. Merges where both parents have not been synced will fail.

Root Commits - Root commits (commits with no parents) and their descendants may not sync unless merged into a main line of development.

Path Conflicts - If path transformations would create conflicts (multiple source paths mapping to the same target path), sync will fail.

Bookmark Filters - Not all bookmarks are synced. Only explicitly configured bookmarks participate in sync.

Sequential Processing - Commits must be synced in topological order, which can limit parallelism.

These limitations are documented in the sync job source code and are checked during sync operations.

Performance Considerations

Cross-repo sync is designed for continuous operation with commits arriving at high rates:

Incremental Sync - Only new commits since the last sync are processed, not the entire repository history.

Batching - Multiple commits can be processed in a single sync iteration when catching up.

Leasing - Sync lease prevents multiple servers from syncing the same commits concurrently, avoiding wasted work.

Derived Data - Derived data can be computed asynchronously after sync completes, not blocking the sync operation.

Caching - Live commit sync config is cached to avoid repeated configuration lookups.

Sharding - The sync job can be sharded across multiple processes, each handling a subset of repositories.

Monitoring and Validation

Cross-repo sync includes validation mechanisms:

Working Copy Verification - The verify_working_copy function confirms that synced commits have identical working copies (after path transformation).

Bookmark Diff - The find_bookmark_diff function identifies discrepancies between bookmarks in source and target repositories.

Commit Validator - The commit validator (features/commit_rewriting/commit_validator/) checks that forward and backward sync produce consistent results.

Scuba Logging - All sync operations are logged to Scuba for monitoring and debugging.

Validation can detect sync configuration errors, sync job failures, or unexpected commit transformations.

Component-specific documentation:

  • features/cross_repo_sync/ - Core sync library implementation
  • features/commit_transformation/ - Commit rewriting logic
  • features/commit_rewriting/mononoke_x_repo_sync_job/ - Forward sync job
  • features/commit_rewriting/backsyncer/ - Backward sync job
  • features/commit_rewriting/synced_commit_mapping/ - Mapping database
  • features/commit_rewriting/live_commit_sync_config/ - Configuration management
  • repo_attributes/repo_cross_repo/ - Repository facet
  • megarepo_api/ - Megarepo operations