Back to Sapling

Better Engineering

eden/mononoke/docs/A.1-better-engineering.md

latest23.3 KB
Original Source

Better Engineering

This document tracks long-term engineering improvements and technical debt that could be addressed in the Mononoke codebase. These are substantive improvements identified through codebase analysis, not immediate action items.

Note: These were mostly generated by automated analysis and may not be relevant or applicable to the current codebase. Please use your judgment when considering these items.


Bonsai Data Model and Core Types

Remove BonsaiChangeset::git_tree_hash

Location: mononoke_types/src/bonsai_changeset.rs:57

Problem: This was added as the first attempt of storing refs to trees. We now use git_ref_content_mapping. The validation logic enforces constraints that make this field mutually exclusive with normal changeset data and with git_annotated_tag.

Improvement: Migrate git_tree_hash to a separate mapping table (similar to bonsai_git_mapping) or create a distinct type for git tree representations rather than overloading BonsaiChangeset.

Impact: Reduces complexity in core type validation, eliminates special-case handling, improves type safety by separating concerns.

Consolidate Duplicated VCS Mapping Trait Pattern

Location: repo_attributes/bonsai_*_mapping/src/lib.rs (8 different directories)

Problem: The codebase has 6+ nearly identical mapping traits (BonsaiHgMapping, BonsaiGitMapping, BonsaiGlobalrevMapping, BonsaiSvnrevMapping, BonsaiTagMapping) with duplicated conversion methods and nearly identical BonsaiOrX enum types.

Improvement: Create a generic BonsaiMapping<T> trait or macro to eliminate duplication. The common pattern could be abstracted into a single parameterized implementation with type-specific hooks only where needed.

Impact: Reduces maintenance burden significantly (changes need to be made in 6+ places currently), improves consistency across VCS backends, easier to add new mapping types.

Restructure FileChange Enum

Location: mononoke_types/src/file_change.rs:101-107

Problem: The FileChange enum has four variants (Change, Deletion, UntrackedChange, UntrackedDeletion) where the tracked/untracked distinction creates duplication. The simplify() method exists specifically to collapse this distinction, suggesting the separation may be too granular. There's a TODO about adding copy information to untracked changes.

Improvement: Restructure as a struct with change_type, tracking, and optional data fields. This would make adding copy_from to untracked changes trivial and eliminate the need for simplify().

Impact: More flexible type that's easier to extend, eliminates redundant enum variants, resolves TODO about copy info.

Eliminate Clone Overhead in Unode ID Computation

Location: mononoke_types/src/unode.rs:116,235

Problem: Both FileUnode::get_unode_id() and ManifestUnode::get_unode_id() have "FIXME: try avoid clone" comments. The methods clone the entire structure just to compute its ID.

Improvement: Cache the ID when the unode is created/deserialized, or implement BorrowedBlobstoreValue trait that can compute IDs from references.

Impact: Performance improvement (eliminates unnecessary allocations/copies), cleaner API, resolves long-standing FIXMEs.

Consolidate Path Type Proliferation

Location: mononoke_types/src/path.rs and path/mpath_element.rs

Problem: Overlapping path types with subtle differences: MPath, NonRootMPath, RepoPath, MPathElement, MPathHash. The relationship creates conversion overhead and cognitive load.

Improvement: Consolidate to 2-3 core types with clearer roles using const generics or the type system for root/non-root distinction.

Impact: Reduced cognitive load, fewer conversions, clearer type relationships.


Repository Facets and Composition

Reduce Trait Bound Verbosity

Location: Throughout features/, repo_attributes/bookmarks/bookmarks_movement/

Problem: Features define trait aliases requiring 10-20+ individual facet refs, creating verbose and error-prone code.

Improvement: Create higher-level facet groupings like RepoStorageFacets, RepoVcsFacets, RepoPermissionFacets, RepoDerivedDataFacets.

Impact: Reduces trait bound lists from 15+ items to 3-5 semantic groups, easier to understand, faster compilation, clearer grouping for new facets.

Standardize Arc vs Ref Trait Suffixes

Location: All facets in repo_attributes/

Problem: Inconsistent patterns - some have only XxxRef, some only XxxArc, some have both. Features randomly mix these.

Improvement: Standardize on a single pattern (preferably only Ref variants for cleaner API) with Arc used only where required.

Impact: Eliminates confusion, reduces cognitive load, makes refactoring safer, smaller binary size.

Split RepoDerivedData Facet

Location: repo_attributes/repo_derived_data/src/lib.rs

Problem: RepoDerivedData violates single responsibility by containing configuration, multiple managers, selection logic, 7+ boilerplate methods, bubble context management, and scuba mutation.

Improvement: Split into multiple facets: RepoDerivedDataConfig, RepoDerivedDataManager, RepoDerivedDataRegistry. Use builder pattern for overrides.

Impact: Easier to mock in tests, clearer separation of concerns, reduces code duplication (150+ lines of boilerplate).


Derived Data Framework

Eliminate Duplicate get_file_changes Implementations

Location: derived_data/{fsnodes,unodes,skeleton_manifest}/mapping.rs

Problem: The same get_file_changes() function is duplicated in at least 3 derived data types with identical implementations.

Improvement: Move to a shared utility module (e.g., derived_data/src/common.rs) and reuse.

Impact: Reduces ~50 lines of duplication, ensures consistent behavior, single place to optimize.

Abstract Mapping Storage Boilerplate

Location: All mapping.rs files (17+ files)

Problem: Every derived data type manually implements format_key(), store_mapping(), and fetch() with nearly identical patterns.

Improvement: Create a derive_mapping! macro or trait-based abstraction that generates these methods given just the key prefix.

Impact: Eliminates ~100+ lines of boilerplate per type, prevents copy-paste errors, easier versioning/key format changes.

Centralize Hardcoded Buffer/Concurrency Constants

Location: Throughout framework

Problem: Magic numbers scattered across codebase (buffer_unordered(64), buffered(100), DEFAULT_BATCH_SIZE: u64 = 20) with no configuration or tuning capability.

Improvement: Add a DerivationTuning config struct in DerivationContext with justknobs-backed defaults for buffer sizes, batch sizes, concurrency limits.

Impact: Enables production tuning without code changes, better performance optimization, clearer intent.

Deprecate and Remove Skeleton Manifest V1

Location: derived_data/skeleton_manifest/

Problem: Multiple skeleton manifest versions (v1, v2) with substantially duplicated derivation logic.

Improvement: Create deprecation plan for old versions.

Impact: Removes duplicated code.

Standardize Batch Derivation Strategies

Location: Various derive_batch implementations

Problem: No clear guidance on when to use sequential vs parallel vs stacked batch derivation. Each type implements differently.

Improvement: Create documented batch derivation patterns with decision tree. Provide template implementations.

Impact: Better batch performance for types not currently optimized, consistency, easier to implement new types correctly.

Expand derive_from_predecessor Pattern

Location: Only in 5 derived data types

Problem: The "predecessor optimization" is extremely valuable for backfilling but only 5 types use it. No documentation or guide for implementing.

Improvement: Document the pattern in framework docs. Create guide/checklist for when it's applicable. Add helper methods to make implementation easier. Identify candidates among existing types.

Impact: 10-100x faster backfilling for large repos when enabled, clearer migration paths between versions.

Implement Outstanding Performance TODOs

Location: deleted_manifest/derive_batch.rs:224, inferred_copy_from/derive.rs:172, mercurial_derivation/src/derive_hg_changeset.rs:264

Problem: Known optimization opportunities documented as TODOs but not prioritized. Actual performance impact unknown.

Improvement: Benchmark each TODO to quantify impact. Create tasks for high-impact ones. Either implement or remove low-value TODOs.

Impact: Measured performance improvements (potentially 20-50% for impacted paths), cleaner codebase.

Create Shared Thrift Serialization Framework

Location: from_thrift() and into_thrift() implementations in all 17+ mapping files

Problem: Repeated pattern of match statements with nearly identical error messages. Error handling is verbose and duplicated.

Improvement: Create derive_thrift_mapping! macro that generates these methods with less boilerplate. Consider auto-generation from thrift definitions.

Impact: Reduces ~30 lines per type, consistent error messages, easier to add new serialization formats.

Unify Testing Infrastructure

Location: Test modules across all derived data types

Problem: Each type's tests define nearly identical TestRepo facet containers. Test utilities follow similar patterns but are duplicated.

Improvement: Create derived_data_test_utils with common test repo factory and trait-based test helpers.

Impact: Reduces test code by ~40%, ensures consistent test coverage, easier to add new types with good tests.


Storage Architecture

Simplify Decorator Stack Configuration

Location: blobstore/factory/src/blobstore.rs:451-628

Problem: The make_blobstore function recursively applies decorators with hardcoded order and needs_wrappers flag. Hard to understand which decorators will be applied, difficult to test interactions.

Improvement: Create a DecoratorBuilder pattern that explicitly documents the decorator stack and makes order testable and customizable.

Impact: Makes decorator ordering visible and documented, easier to add/remove/reorder, testable decorator combinations.

Reduce BlobstoreOptions Proliferation

Location: blobstore/factory/src/blobstore.rs:70-181

Problem: BlobstoreOptions contains 8+ different option structs bundled together even when only 1-2 are needed. Creates excessive cloning and tight coupling.

Improvement: Use builder pattern with trait-based options or split into logical groups (StorageOptions, ObservabilityOptions, TestingOptions).

Impact: Reduce unnecessary cloning, only pass relevant options, easier to add new options.

Clarify Scrub Configuration

Location: blobstore/multiplexedblob/src/scrub.rs:93-110

Problem: ScrubOptions has grown to include many modes with subtle interactions. ScrubWriteOnly enum has 4 variants with complex semantics.

Improvement: Split into separate concerns: ScrubPolicy, WriteOnlyStrategy, TimingConfig. Use type-state pattern to prevent invalid configurations.

Impact: Prevent misconfiguration, easier to understand scrubbing behavior, independently testable concerns.

Clarify PutBehaviour vs OverwriteStatus

Location: blobstore/src/lib.rs:454-513

Problem: Two related but distinct types (PutBehaviour and OverwriteStatus) where the relationship isn't obvious. Some decorators don't preserve status correctly.

Improvement: Rename to make relationship clearer (PutPolicy and PutOutcome). Add documentation showing the state machine. Add debug assertions in decorators.

Impact: Ensure put guarantees are maintained through decorators, easier to trace operations, self-documenting code.

Improve Blobstore Enumeration

Location: blobstore/src/lib.rs:515-609

Problem: BlobstoreKeySource enumeration has decorators manually stripping/adding key transformations. No support for pagination limits or filtering. Continuation tokens are opaque.

Improvement: Add a KeyTransform trait that decorators implement. Support server-side filtering. Make continuation tokens structured. Add limit parameter.

Impact: Reduce data transfer, prevent OOM from enumerating millions of keys, debuggable continuation tokens.


Server and API Layer

Unify RequestContextMiddleware Implementations

Location: lfs_server/src/middleware/request_context.rs, gotham_ext/src/middleware/request_context.rs

Problem: LFS server has its own implementation with custom method tracking, while git_server uses the generic version. Code duplication (~140+ lines), inconsistent authentication enforcement.

Improvement: Create unified, configurable RequestContextMiddleware in gotham_ext that supports optional method tracking, consistent authentication, and unified session building.

Impact: Reduces ~200 lines of duplication, ensures consistent auth behavior, easier to add new servers.

Standardize Error Formatting Across Servers

Location: git_server/src/service/error_formatter.rs, lfs_server/src/service/error_formatter.rs, edenapi_service

Problem: Each server implements its own error formatter - git returns plain text, LFS returns JSON with request_id, inconsistent error detail exposure.

Improvement: Create shared error formatting library with protocol-aware formatters (Git, LFS, JSON), consistent request_id inclusion, and configurable detail exposure.

Impact: Consistent error responses across all services, better debugging with request_id everywhere, easier to add structured logging.

Create Shared Health Check Handler

Location: git_server/src/service/router.rs:101, lfs_server/src/service/router.rs:80

Problem: Different health check implementations - git has special wait-time header, LFS returns "EXITING" when shutting down, inconsistent for load balancers.

Improvement: Create shared health check handler in gotham_ext with support for optional will_exit state, delay header, consistent response format, and ready vs alive probes.

Impact: Consistent load balancer behavior, reusable logic (~30 lines per server), better graceful shutdown.

Eliminate Repo Sharding Logic Duplication

Location: server/src/main.rs:101-215, git_server/src/sharding.rs:23-105, scs/scs_server/src/main.rs:144-221

Problem: Each server implements nearly identical sharding logic (~100 lines duplicated 3+ times).

Improvement: Create a generic GenericRepoShardedProcess<R> in cmdlib/sharding.

Impact: Eliminates ~300 lines of duplication, single place to fix sharding bugs, easier to add new sharded services.

Unify TLS Acceptor Setup

Location: server/src/main.rs:253-275, git_server/src/main.rs:195-207, lfs_server/src/main.rs:213-225

Problem: Nearly identical TLS acceptor building code (~20-30 lines duplicated per server).

Improvement: Add helper method to TLSArgs like build_acceptor(&self, alpn_protocols: Option<&[u8]>) -> Result<SslAcceptor>.

Impact: Reduces ~60-80 lines of duplication, ensures consistent TLS configuration, single point for security updates.

Clarify Server Initialization Patterns

Location: HTTP vs Thrift server initialization

Problem: Two different termination methods (run_until_terminated vs wait_until_terminated) with different signatures and shutdown coordination patterns.

Improvement: Unify into a single termination pattern or clearly rename methods (run_http_server_until_terminated vs wait_for_termination). Or create unified ServerShutdown builder.

Impact: Clearer API for server authors, consistent shutdown behavior, reduces cognitive load.

Standardize Middleware Stack Ordering

Location: git_server/src/main.rs:300-332, lfs_server/src/main.rs:319-343, edenapi_service/src/lib.rs:80-110

Problem: Each server builds middleware stacks in different orders - TLS, Scuba, Load positions vary. Hard to understand why ordering differs, potential for bugs.

Improvement: Create a ServerMiddlewareBuilder with documented middleware ordering and semantic groupings (core, auth, app, observability).

Impact: Consistent middleware ordering across servers, prevents ordering bugs, ~40 lines of documentation vs scattered code.

Create Shared Repo Resolution Logic

Location: Scattered across servers

Problem: Repo name extraction scattered across servers with different path formats, no validation, inconsistent errors.

Improvement: Create shared repo resolution utilities with RepoResolver trait for different path schemes, common validation, consistent errors, and support for aliases.

Impact: Consistent repo not found errors, easier to add features like repo aliases, single place for validation, better security.

Add ConnectionSecurityChecker Helper

Location: git_server/src/main.rs:360-361, lfs_server/src/main.rs:356-357

Problem: Each HTTP server creates ConnectionSecurityChecker independently with same pattern duplicated 3+ times.

Improvement: Add helper to MononokeApp: pub async fn connection_security_checker(&self) -> Result<ConnectionSecurityChecker>.

Impact: Ensures consistent security checking, single place to update security logic, harder to forget checks in new servers.

Create Bound Address File Writing Utility

Location: git_server/src/main.rs:336-340, lfs_server/src/main.rs:348-352, scs/scs_server/src/main.rs:379-383

Problem: Identical bound address file writing code duplicated across all servers (~5 lines per server).

Improvement: Create utility function: pub fn write_bound_address(path: Option<&Path>, addr: &str) -> Result<()>.

Impact: Consistent file writing behavior, single place for error handling, could add features like atomic writes.


Testing Infrastructure

Split Monolithic Test Library File

Location: tests/integration/library.sh (1,843 lines)

Problem: Single 48KB file containing all test utilities. Difficult to navigate, maintain, and understand. No clear organization.

Improvement: Split into focused modules (server-helpers.sh, repo-setup.sh, wait-utils.sh, admin-helpers.sh). Group related functions together.

Impact: Better maintainability, easier onboarding, reduced merge conflicts, clearer ownership.

Consolidate wait_for_* Functions

Location: tests/integration/library.sh (18+ wait functions)

Problem: 18+ specialized wait functions with similar polling logic and slight variations. Hardcoded timeouts and sleep intervals.

Improvement: Create generic wait_for_condition function with configurable condition check, timeout, poll interval with exponential backoff, and failure message.

Impact: Faster tests (via exponential backoff), consistent timeout behavior, easier debugging, ~300 lines of code reduction.

Replace Hardcoded Sleep Calls

Location: Throughout library.sh and test files

Problem: 17+ hardcoded sleep calls with arbitrary durations. Fixed sleep durations cause unnecessary test slowness and race conditions.

Improvement: Replace all fixed sleeps with condition-based waiting. Implement exponential backoff for all polling. Use inotify/file watching where possible.

Impact: 20-30% faster test execution, more reliable tests, better failure diagnostics.

Standardize Test Repository Setup

Location: Across 586 .t files

Problem: Only 25 tests use default_setup_* helpers. Most manually initialize repos. Two competing methods (blobimport vs testtool_drawdag). Massive duplication of setup boilerplate.

Improvement: Expand fixture library using Rust TestRepoFixture trait. Encourage fixture use via better documentation. Deprecate manual setup patterns.

Impact: Reduced test setup code by 50%+, faster test authoring, consistent environments, easier improvements propagate to all tests.

Improve Test Failure Diagnostics

Location: Error handling throughout integration tests

Problem: When tests fail, often just see "timeout" or generic error. Server logs buried. No structured logging or failure context. Hard to diagnose which parallel operation failed.

Improvement: Wrap all wait_for_* functions to automatically dump relevant logs on timeout. Add structured failure context. Implement test operation timing. Auto-capture server logs on failure.

Impact: 50-80% reduction in debugging time, faster failure diagnosis, fewer "flaky test" retries.

Add Test Performance Tracking

Location: 31 tests marked with #require slow

Problem: 31 tests marked as "slow" but no performance tracking. No visibility into which operations are slow. Tests get slower over time without anyone noticing.

Improvement: Implement test timing infrastructure that logs operation durations. Add performance regression detection (warn if test >20% slower than baseline). Profile slow tests and optimize.

Impact: 2-5x speedup for slowest tests, prevent performance regression, better CI resource utilization.

Complete Shell/Python Migration

Location: library.sh (32 uses of Python wrapper), dbrtest_runner.py

Problem: Functions implemented in Python called via python_fn wrapper from shell. Fragile environment variable passing. TODOs indicate incomplete migration. Debugging is difficult.

Improvement: Complete Python migration for functions flagged with TODO, or go fully shell-based. Eliminate python_fn wrapper complexity. Make language choice deliberate per-function.

Impact: Simpler mental model, easier debugging, clearer maintenance ownership, eliminate TODOs.

Abstract Test-Specific Library Patterns

Location: library-push-redirector.sh (22KB), library-git-lfs.sh (4KB)

Problem: Large specialized library files that duplicate patterns from main library. Similar configuration generation logic duplicated.

Improvement: Extract common patterns into shared utilities. Create declarative config builders instead of imperative JSON string construction. Use Rust test utilities for complex setup.

Impact: Less duplication, easier to add new test scenarios, better config validation.

Add Missing Test Helpers

Location: Repetitive patterns across .t files

Problem: No helpers for common patterns like "push and verify bookmark moved", "make commit with specific files", manual jq parsing repeated hundreds of times, ad-hoc scuba log validation.

Improvement: Add high-level test helpers like push_and_verify_bookmark, create_commit_with_files, assert_scuba_log_contains, assert_bookmark_equals.

Impact: Tests become more readable, common errors prevented, faster test authoring.

Add Unit Tests for Test Infrastructure

Location: library.sh, dbrtest_runner.py, test utilities

Problem: 48KB of shell code with no unit tests. Test infrastructure bugs affect all 586 integration tests. Helper functions can break silently.

Impact: Add unit tests for critical library functions. Create test-library-*.t tests. Test edge cases. Add CI job for library unit tests.

Impact: Safer refactoring, prevent infrastructure regressions, faster development iteration.


This document is a living collection of improvement opportunities. Items should be evaluated for impact and feasibility before implementation. Not all items will be implemented, and priorities may shift based on business needs.