eden/mononoke/docs/A.1-better-engineering.md
This document tracks long-term engineering improvements and technical debt that could be addressed in the Mononoke codebase. These are substantive improvements identified through codebase analysis, not immediate action items.
Note: These were mostly generated by automated analysis and may not be relevant or applicable to the current codebase. Please use your judgment when considering these items.
BonsaiChangeset::git_tree_hashLocation: mononoke_types/src/bonsai_changeset.rs:57
Problem: This was added as the first attempt of storing refs to trees. We now use git_ref_content_mapping. The validation logic enforces constraints that make this field mutually exclusive with normal changeset data and with git_annotated_tag.
Improvement: Migrate git_tree_hash to a separate mapping table (similar to bonsai_git_mapping) or create a distinct type for git tree representations rather than overloading BonsaiChangeset.
Impact: Reduces complexity in core type validation, eliminates special-case handling, improves type safety by separating concerns.
Location: repo_attributes/bonsai_*_mapping/src/lib.rs (8 different directories)
Problem: The codebase has 6+ nearly identical mapping traits (BonsaiHgMapping, BonsaiGitMapping, BonsaiGlobalrevMapping, BonsaiSvnrevMapping, BonsaiTagMapping) with duplicated conversion methods and nearly identical BonsaiOrX enum types.
Improvement: Create a generic BonsaiMapping<T> trait or macro to eliminate duplication. The common pattern could be abstracted into a single parameterized implementation with type-specific hooks only where needed.
Impact: Reduces maintenance burden significantly (changes need to be made in 6+ places currently), improves consistency across VCS backends, easier to add new mapping types.
Location: mononoke_types/src/file_change.rs:101-107
Problem: The FileChange enum has four variants (Change, Deletion, UntrackedChange, UntrackedDeletion) where the tracked/untracked distinction creates duplication. The simplify() method exists specifically to collapse this distinction, suggesting the separation may be too granular. There's a TODO about adding copy information to untracked changes.
Improvement: Restructure as a struct with change_type, tracking, and optional data fields. This would make adding copy_from to untracked changes trivial and eliminate the need for simplify().
Impact: More flexible type that's easier to extend, eliminates redundant enum variants, resolves TODO about copy info.
Location: mononoke_types/src/unode.rs:116,235
Problem: Both FileUnode::get_unode_id() and ManifestUnode::get_unode_id() have "FIXME: try avoid clone" comments. The methods clone the entire structure just to compute its ID.
Improvement: Cache the ID when the unode is created/deserialized, or implement BorrowedBlobstoreValue trait that can compute IDs from references.
Impact: Performance improvement (eliminates unnecessary allocations/copies), cleaner API, resolves long-standing FIXMEs.
Location: mononoke_types/src/path.rs and path/mpath_element.rs
Problem: Overlapping path types with subtle differences: MPath, NonRootMPath, RepoPath, MPathElement, MPathHash. The relationship creates conversion overhead and cognitive load.
Improvement: Consolidate to 2-3 core types with clearer roles using const generics or the type system for root/non-root distinction.
Impact: Reduced cognitive load, fewer conversions, clearer type relationships.
Location: Throughout features/, repo_attributes/bookmarks/bookmarks_movement/
Problem: Features define trait aliases requiring 10-20+ individual facet refs, creating verbose and error-prone code.
Improvement: Create higher-level facet groupings like RepoStorageFacets, RepoVcsFacets, RepoPermissionFacets, RepoDerivedDataFacets.
Impact: Reduces trait bound lists from 15+ items to 3-5 semantic groups, easier to understand, faster compilation, clearer grouping for new facets.
Location: All facets in repo_attributes/
Problem: Inconsistent patterns - some have only XxxRef, some only XxxArc, some have both. Features randomly mix these.
Improvement: Standardize on a single pattern (preferably only Ref variants for cleaner API) with Arc used only where required.
Impact: Eliminates confusion, reduces cognitive load, makes refactoring safer, smaller binary size.
Location: repo_attributes/repo_derived_data/src/lib.rs
Problem: RepoDerivedData violates single responsibility by containing configuration, multiple managers, selection logic, 7+ boilerplate methods, bubble context management, and scuba mutation.
Improvement: Split into multiple facets: RepoDerivedDataConfig, RepoDerivedDataManager, RepoDerivedDataRegistry. Use builder pattern for overrides.
Impact: Easier to mock in tests, clearer separation of concerns, reduces code duplication (150+ lines of boilerplate).
Location: derived_data/{fsnodes,unodes,skeleton_manifest}/mapping.rs
Problem: The same get_file_changes() function is duplicated in at least 3 derived data types with identical implementations.
Improvement: Move to a shared utility module (e.g., derived_data/src/common.rs) and reuse.
Impact: Reduces ~50 lines of duplication, ensures consistent behavior, single place to optimize.
Location: All mapping.rs files (17+ files)
Problem: Every derived data type manually implements format_key(), store_mapping(), and fetch() with nearly identical patterns.
Improvement: Create a derive_mapping! macro or trait-based abstraction that generates these methods given just the key prefix.
Impact: Eliminates ~100+ lines of boilerplate per type, prevents copy-paste errors, easier versioning/key format changes.
Location: Throughout framework
Problem: Magic numbers scattered across codebase (buffer_unordered(64), buffered(100), DEFAULT_BATCH_SIZE: u64 = 20) with no configuration or tuning capability.
Improvement: Add a DerivationTuning config struct in DerivationContext with justknobs-backed defaults for buffer sizes, batch sizes, concurrency limits.
Impact: Enables production tuning without code changes, better performance optimization, clearer intent.
Location: derived_data/skeleton_manifest/
Problem: Multiple skeleton manifest versions (v1, v2) with substantially duplicated derivation logic.
Improvement: Create deprecation plan for old versions.
Impact: Removes duplicated code.
Location: Various derive_batch implementations
Problem: No clear guidance on when to use sequential vs parallel vs stacked batch derivation. Each type implements differently.
Improvement: Create documented batch derivation patterns with decision tree. Provide template implementations.
Impact: Better batch performance for types not currently optimized, consistency, easier to implement new types correctly.
Location: Only in 5 derived data types
Problem: The "predecessor optimization" is extremely valuable for backfilling but only 5 types use it. No documentation or guide for implementing.
Improvement: Document the pattern in framework docs. Create guide/checklist for when it's applicable. Add helper methods to make implementation easier. Identify candidates among existing types.
Impact: 10-100x faster backfilling for large repos when enabled, clearer migration paths between versions.
Location: deleted_manifest/derive_batch.rs:224, inferred_copy_from/derive.rs:172, mercurial_derivation/src/derive_hg_changeset.rs:264
Problem: Known optimization opportunities documented as TODOs but not prioritized. Actual performance impact unknown.
Improvement: Benchmark each TODO to quantify impact. Create tasks for high-impact ones. Either implement or remove low-value TODOs.
Impact: Measured performance improvements (potentially 20-50% for impacted paths), cleaner codebase.
Location: from_thrift() and into_thrift() implementations in all 17+ mapping files
Problem: Repeated pattern of match statements with nearly identical error messages. Error handling is verbose and duplicated.
Improvement: Create derive_thrift_mapping! macro that generates these methods with less boilerplate. Consider auto-generation from thrift definitions.
Impact: Reduces ~30 lines per type, consistent error messages, easier to add new serialization formats.
Location: Test modules across all derived data types
Problem: Each type's tests define nearly identical TestRepo facet containers. Test utilities follow similar patterns but are duplicated.
Improvement: Create derived_data_test_utils with common test repo factory and trait-based test helpers.
Impact: Reduces test code by ~40%, ensures consistent test coverage, easier to add new types with good tests.
Location: blobstore/factory/src/blobstore.rs:451-628
Problem: The make_blobstore function recursively applies decorators with hardcoded order and needs_wrappers flag. Hard to understand which decorators will be applied, difficult to test interactions.
Improvement: Create a DecoratorBuilder pattern that explicitly documents the decorator stack and makes order testable and customizable.
Impact: Makes decorator ordering visible and documented, easier to add/remove/reorder, testable decorator combinations.
Location: blobstore/factory/src/blobstore.rs:70-181
Problem: BlobstoreOptions contains 8+ different option structs bundled together even when only 1-2 are needed. Creates excessive cloning and tight coupling.
Improvement: Use builder pattern with trait-based options or split into logical groups (StorageOptions, ObservabilityOptions, TestingOptions).
Impact: Reduce unnecessary cloning, only pass relevant options, easier to add new options.
Location: blobstore/multiplexedblob/src/scrub.rs:93-110
Problem: ScrubOptions has grown to include many modes with subtle interactions. ScrubWriteOnly enum has 4 variants with complex semantics.
Improvement: Split into separate concerns: ScrubPolicy, WriteOnlyStrategy, TimingConfig. Use type-state pattern to prevent invalid configurations.
Impact: Prevent misconfiguration, easier to understand scrubbing behavior, independently testable concerns.
Location: blobstore/src/lib.rs:454-513
Problem: Two related but distinct types (PutBehaviour and OverwriteStatus) where the relationship isn't obvious. Some decorators don't preserve status correctly.
Improvement: Rename to make relationship clearer (PutPolicy and PutOutcome). Add documentation showing the state machine. Add debug assertions in decorators.
Impact: Ensure put guarantees are maintained through decorators, easier to trace operations, self-documenting code.
Location: blobstore/src/lib.rs:515-609
Problem: BlobstoreKeySource enumeration has decorators manually stripping/adding key transformations. No support for pagination limits or filtering. Continuation tokens are opaque.
Improvement: Add a KeyTransform trait that decorators implement. Support server-side filtering. Make continuation tokens structured. Add limit parameter.
Impact: Reduce data transfer, prevent OOM from enumerating millions of keys, debuggable continuation tokens.
Location: lfs_server/src/middleware/request_context.rs, gotham_ext/src/middleware/request_context.rs
Problem: LFS server has its own implementation with custom method tracking, while git_server uses the generic version. Code duplication (~140+ lines), inconsistent authentication enforcement.
Improvement: Create unified, configurable RequestContextMiddleware in gotham_ext that supports optional method tracking, consistent authentication, and unified session building.
Impact: Reduces ~200 lines of duplication, ensures consistent auth behavior, easier to add new servers.
Location: git_server/src/service/error_formatter.rs, lfs_server/src/service/error_formatter.rs, edenapi_service
Problem: Each server implements its own error formatter - git returns plain text, LFS returns JSON with request_id, inconsistent error detail exposure.
Improvement: Create shared error formatting library with protocol-aware formatters (Git, LFS, JSON), consistent request_id inclusion, and configurable detail exposure.
Impact: Consistent error responses across all services, better debugging with request_id everywhere, easier to add structured logging.
Location: git_server/src/service/router.rs:101, lfs_server/src/service/router.rs:80
Problem: Different health check implementations - git has special wait-time header, LFS returns "EXITING" when shutting down, inconsistent for load balancers.
Improvement: Create shared health check handler in gotham_ext with support for optional will_exit state, delay header, consistent response format, and ready vs alive probes.
Impact: Consistent load balancer behavior, reusable logic (~30 lines per server), better graceful shutdown.
Location: server/src/main.rs:101-215, git_server/src/sharding.rs:23-105, scs/scs_server/src/main.rs:144-221
Problem: Each server implements nearly identical sharding logic (~100 lines duplicated 3+ times).
Improvement: Create a generic GenericRepoShardedProcess<R> in cmdlib/sharding.
Impact: Eliminates ~300 lines of duplication, single place to fix sharding bugs, easier to add new sharded services.
Location: server/src/main.rs:253-275, git_server/src/main.rs:195-207, lfs_server/src/main.rs:213-225
Problem: Nearly identical TLS acceptor building code (~20-30 lines duplicated per server).
Improvement: Add helper method to TLSArgs like build_acceptor(&self, alpn_protocols: Option<&[u8]>) -> Result<SslAcceptor>.
Impact: Reduces ~60-80 lines of duplication, ensures consistent TLS configuration, single point for security updates.
Location: HTTP vs Thrift server initialization
Problem: Two different termination methods (run_until_terminated vs wait_until_terminated) with different signatures and shutdown coordination patterns.
Improvement: Unify into a single termination pattern or clearly rename methods (run_http_server_until_terminated vs wait_for_termination). Or create unified ServerShutdown builder.
Impact: Clearer API for server authors, consistent shutdown behavior, reduces cognitive load.
Location: git_server/src/main.rs:300-332, lfs_server/src/main.rs:319-343, edenapi_service/src/lib.rs:80-110
Problem: Each server builds middleware stacks in different orders - TLS, Scuba, Load positions vary. Hard to understand why ordering differs, potential for bugs.
Improvement: Create a ServerMiddlewareBuilder with documented middleware ordering and semantic groupings (core, auth, app, observability).
Impact: Consistent middleware ordering across servers, prevents ordering bugs, ~40 lines of documentation vs scattered code.
Location: Scattered across servers
Problem: Repo name extraction scattered across servers with different path formats, no validation, inconsistent errors.
Improvement: Create shared repo resolution utilities with RepoResolver trait for different path schemes, common validation, consistent errors, and support for aliases.
Impact: Consistent repo not found errors, easier to add features like repo aliases, single place for validation, better security.
Location: git_server/src/main.rs:360-361, lfs_server/src/main.rs:356-357
Problem: Each HTTP server creates ConnectionSecurityChecker independently with same pattern duplicated 3+ times.
Improvement: Add helper to MononokeApp: pub async fn connection_security_checker(&self) -> Result<ConnectionSecurityChecker>.
Impact: Ensures consistent security checking, single place to update security logic, harder to forget checks in new servers.
Location: git_server/src/main.rs:336-340, lfs_server/src/main.rs:348-352, scs/scs_server/src/main.rs:379-383
Problem: Identical bound address file writing code duplicated across all servers (~5 lines per server).
Improvement: Create utility function: pub fn write_bound_address(path: Option<&Path>, addr: &str) -> Result<()>.
Impact: Consistent file writing behavior, single place for error handling, could add features like atomic writes.
Location: tests/integration/library.sh (1,843 lines)
Problem: Single 48KB file containing all test utilities. Difficult to navigate, maintain, and understand. No clear organization.
Improvement: Split into focused modules (server-helpers.sh, repo-setup.sh, wait-utils.sh, admin-helpers.sh). Group related functions together.
Impact: Better maintainability, easier onboarding, reduced merge conflicts, clearer ownership.
Location: tests/integration/library.sh (18+ wait functions)
Problem: 18+ specialized wait functions with similar polling logic and slight variations. Hardcoded timeouts and sleep intervals.
Improvement: Create generic wait_for_condition function with configurable condition check, timeout, poll interval with exponential backoff, and failure message.
Impact: Faster tests (via exponential backoff), consistent timeout behavior, easier debugging, ~300 lines of code reduction.
Location: Throughout library.sh and test files
Problem: 17+ hardcoded sleep calls with arbitrary durations. Fixed sleep durations cause unnecessary test slowness and race conditions.
Improvement: Replace all fixed sleeps with condition-based waiting. Implement exponential backoff for all polling. Use inotify/file watching where possible.
Impact: 20-30% faster test execution, more reliable tests, better failure diagnostics.
Location: Across 586 .t files
Problem: Only 25 tests use default_setup_* helpers. Most manually initialize repos. Two competing methods (blobimport vs testtool_drawdag). Massive duplication of setup boilerplate.
Improvement: Expand fixture library using Rust TestRepoFixture trait. Encourage fixture use via better documentation. Deprecate manual setup patterns.
Impact: Reduced test setup code by 50%+, faster test authoring, consistent environments, easier improvements propagate to all tests.
Location: Error handling throughout integration tests
Problem: When tests fail, often just see "timeout" or generic error. Server logs buried. No structured logging or failure context. Hard to diagnose which parallel operation failed.
Improvement: Wrap all wait_for_* functions to automatically dump relevant logs on timeout. Add structured failure context. Implement test operation timing. Auto-capture server logs on failure.
Impact: 50-80% reduction in debugging time, faster failure diagnosis, fewer "flaky test" retries.
Location: 31 tests marked with #require slow
Problem: 31 tests marked as "slow" but no performance tracking. No visibility into which operations are slow. Tests get slower over time without anyone noticing.
Improvement: Implement test timing infrastructure that logs operation durations. Add performance regression detection (warn if test >20% slower than baseline). Profile slow tests and optimize.
Impact: 2-5x speedup for slowest tests, prevent performance regression, better CI resource utilization.
Location: library.sh (32 uses of Python wrapper), dbrtest_runner.py
Problem: Functions implemented in Python called via python_fn wrapper from shell. Fragile environment variable passing. TODOs indicate incomplete migration. Debugging is difficult.
Improvement: Complete Python migration for functions flagged with TODO, or go fully shell-based. Eliminate python_fn wrapper complexity. Make language choice deliberate per-function.
Impact: Simpler mental model, easier debugging, clearer maintenance ownership, eliminate TODOs.
Location: library-push-redirector.sh (22KB), library-git-lfs.sh (4KB)
Problem: Large specialized library files that duplicate patterns from main library. Similar configuration generation logic duplicated.
Improvement: Extract common patterns into shared utilities. Create declarative config builders instead of imperative JSON string construction. Use Rust test utilities for complex setup.
Impact: Less duplication, easier to add new test scenarios, better config validation.
Location: Repetitive patterns across .t files
Problem: No helpers for common patterns like "push and verify bookmark moved", "make commit with specific files", manual jq parsing repeated hundreds of times, ad-hoc scuba log validation.
Improvement: Add high-level test helpers like push_and_verify_bookmark, create_commit_with_files, assert_scuba_log_contains, assert_bookmark_equals.
Impact: Tests become more readable, common errors prevented, faster test authoring.
Location: library.sh, dbrtest_runner.py, test utilities
Problem: 48KB of shell code with no unit tests. Test infrastructure bugs affect all 586 integration tests. Helper functions can break silently.
Impact: Add unit tests for critical library functions. Create test-library-*.t tests. Test edge cases. Add CI job for library unit tests.
Impact: Safer refactoring, prevent infrastructure regressions, faster development iteration.
This document is a living collection of improvement opportunities. Items should be evaluated for impact and feasibility before implementation. Not all items will be implemented, and priorities may shift based on business needs.