Monitoring and Observability

This document describes the monitoring and observability infrastructure in Mononoke. These systems provide visibility into server health, performance, and operational behavior.

Overview

Mononoke exports metrics, logs, and traces to support operational monitoring and debugging. The observability infrastructure is designed to handle high-volume production traffic while providing detailed diagnostics when needed.

The monitoring system is organized into several layers:

Metrics - Quantitative measurements exported to ODS (Operational Data Store)
Logging - Structured logs written to Scuba for analysis
Tracing - Request-level tracking using the tracing framework
Health checks - Service health reporting via HTTP and FB303
Performance counters - Per-request operation metrics

Metrics (ODS Integration)

Mononoke exports operational metrics using the stats crate. Metrics are exported to ODS for visualization and alerting.

Metric Types

Counters and Timeseries

rust

define_stats! {
    prefix = "mononoke.edenapi.request";
    total_requests: timeseries(Rate, Sum),
    requests: dynamic_timeseries("{}.requests", (method: String); Rate, Sum),
}

These track request rates, success counts, and error counts. Dynamic timeseries allow metrics to be broken down by dimensions like method name or repository.

Histograms

rust

files2_duration_ms: histogram(100, 0, 5000, Average, Sum, Count; P 50; P 75; P 95; P 99),

Histograms measure distributions of values such as request latency, response size, or operation duration. They report percentiles (P50, P95, P99) for understanding tail latency.

Metric Locations

Metrics are defined in components throughout Mononoke:

Server Metrics (edenapi_service/src/middleware/ods.rs, git_server/, lfs_server/)

Request duration by endpoint
Success and failure rates (4xx, 5xx responses)
Response bytes sent
Request load

Background Job Metrics (jobs/walker/, jobs/blobstore_healer/)

Items processed (commits walked, blobs healed)
Validation pass/fail counts
Scrub repair operations
Queue depths and processing latency

Storage Metrics (blobstore implementations)

Blob gets, puts, and presence checks
Cache hit/miss rates
Multiplex operations
Packblob compression ratios

Feature Metrics (pushrebase/, features/)

Pushrebase duration and conflicts
Cross-repo sync operations
Derived data derivation latency

Using Metrics

Metrics are updated by calling methods on the generated STATS object:

rust

STATS::total_requests.add_value(1);
STATS::files2_duration_ms.add_value(duration_ms);
STATS::failure_5xx.add_value(1, (method_name.to_string(),));

The stats framework aggregates these values and exports them to ODS. Dashboards and alerts are built on these metrics.

Logging (Scuba)

Scuba is Mononoke's primary structured logging system. Scuba samples (log entries) contain key-value pairs describing operations, requests, and events.

CoreContext and Scuba

Every Mononoke operation carries a CoreContext that includes a Scuba sample builder. The context flows through the request lifecycle, accumulating fields:

rust

pub struct CoreContext {
    fb: FacebookInit,
    session: SessionContainer,
    logging: LoggingContainer,
}

The logging container holds:

MononokeScubaSampleBuilder - Accumulates fields for the final Scuba sample
Performance counters - Operation metrics
Scribe client - For writing to Scribe streams

Scuba Fields

Scuba samples typically include:

Request Metadata

Session ID
Client info (hostname, username, identities)
Repository name
Request method and parameters

Performance Data

Request duration
Blob operations (gets, puts)
Cache statistics
Bytes transferred

Outcome Information

Success or failure status
Error messages
Result sizes

Verbosity Levels

The observability framework supports configurable verbosity (observability/):

Normal Level - Logs all normal operations. This is the default level for production traffic.

Verbose Level - Logs additional detail for debugging. Verbose logging can be enabled globally or selectively based on:

Session ID
Unix username
Source hostname (via regex)

The ObservabilityContext (in observability/src/context.rs) determines whether a sample should be logged based on its verbosity level and the current configuration. This allows detailed logging for specific users or sessions without overwhelming the logging system.

Scuba Tables

Different components write to different Scuba tables:

mononoke_edenapi - EdenAPI protocol requests
mononoke_scs - SCS Thrift API calls
mononoke_git - Git protocol operations
mononoke_lfs - LFS operations
mononoke_walker - Walker validation and scrubbing
mononoke_backsyncer - Cross-repo sync events

Each table has component-specific fields reflecting the operation types and parameters.

Tracing

Mononoke uses the tracing crate for structured logging within components. Tracing provides hierarchical logging with spans and events.

Usage

Tracing calls are embedded throughout the code:

rust

use tracing::{info, warn, debug, error};

info!("Starting derivation for changeset {}", cs_id);
warn!("Slow operation detected: {}ms", duration_ms);
debug!(keys = ?blob_keys, "Fetching blobs");

Spans group related operations:

rust

use tracing::Instrument;

async fn derive_data(ctx: &CoreContext, cs_id: ChangesetId) -> Result<()> {
    async move {
        // derivation work
    }
    .instrument(tracing::info_span!("derive_data", ?cs_id))
    .await
}

Tracing integrates with the request context and can include correlation IDs for following a request through multiple services.

Log Levels

Error - Unexpected failures requiring attention

Warn - Recoverable issues or concerning patterns

Info - Normal operational events

Debug - Detailed diagnostic information

Log output is configured per deployment and can be directed to local logs or centralized logging systems.

Request Context and Tracking

The CoreContext (server/context/) serves as the request tracking mechanism.

Context Contents

Session Information (SessionContainer)

Session class (user, background, backup)
Client metadata
Permission checker
Identity set

Logging Container

Scuba sample builder
Performance counters stack
Scribe client
Sampling key

Metadata

Client request info (entry point, correlator)
Source hostname
Unix username

Context Flow

A context is created when a request enters Mononoke (typically in the protocol server). The context is cloned and passed through:

Protocol handlers (EdenAPI, Git, SCS)
API layer (mononoke_api/)
Features (pushrebase, hooks, etc.)
Repository facets
Storage operations

Each layer can add fields to the Scuba sample or update performance counters. When the request completes, the accumulated data is logged.

Context Operations

Clone and Reset

rust

let new_ctx = ctx.clone_and_reset();

Creates a new context with reset performance counters, useful for sub-operations.

Fork Performance Counters

rust

let counters = ctx.fork_perf_counters();

Creates a snapshot of current performance counters for parallel operations.

Mutate Scuba Sample

rust

let new_ctx = ctx.with_mutated_scuba(|scuba| {
    scuba.add("field_name", value)
});

Adds fields to the Scuba sample for the context.

Performance Counters

Performance counters track detailed operation metrics within a request. Counters are defined in server/context/src/perf_counters.rs.

Counter Types

Blobstore Operations

BlobGets, BlobPuts, BlobPresenceChecks
BlobGetsMaxLatency, BlobPutsMaxLatency
BlobGetsTotalSize, BlobPutsTotalSize
BlobGetsDeduplicated, BlobPutsDeduplicated

Caching

CachelibHits, CachelibMisses

Protocol-Specific

EdenapiFiles, EdenapiTrees
GetpackNumFiles, GettreepackNumTreepacks
GetbundleNumCommits, GetbundleNumManifests

Data Transfer

BytesSent

Performance counters are accumulated in the PerfCountersStack and can be nested for tracking operations within sub-operations. When a request completes, counters are exported to Scuba for analysis.

Counter Access

Operations access counters through the CoreContext:

rust

let perf_counters = ctx.perf_counters();

The blobstore layer automatically updates blob operation counters. Protocol handlers update protocol-specific counters based on the data served.

Health Checks

Mononoke servers expose health check endpoints for monitoring and load balancing.

HTTP Health Checks

The main server (server/repo_listener/) responds to health check requests:

GET / -> "I_AM_ALIVE"
GET /health_check -> "I_AM_ALIVE"

If the server is shutting down, these endpoints return "EXITING". Load balancers use these endpoints to determine which servers should receive traffic.

FB303 Service

Mononoke applications integrate with FB303 (cmdlib/mononoke_app/src/monitoring.rs), a service framework that provides:

Status Reporting

FbStatus::Alive - Server is ready
FbStatus::Starting - Server is initializing
FbStatus::Stopping - Server is shutting down

The ReadyFlagService implementation starts in the Starting state and transitions to Alive once initialization completes.

Thrift Interface FB303 exposes a Thrift interface on a configured port (via --fb303-thrift-port). This interface allows monitoring systems to:

Query server status
Retrieve counter values
Check build information

Prometheus Export In fbcode builds, FB303 metrics can be exported to Prometheus format via the --prometheus-host-port flag. This enables integration with Prometheus-based monitoring stacks.

Monitoring Framework

The mononoke_app framework (cmdlib/mononoke_app/) initializes monitoring automatically. Applications using this framework receive:

FB303 service
Stats aggregation
Graceful shutdown handling
Health check integration

Common Patterns

Request Logging

Protocol servers use middleware to log requests:

Log Middleware (gotham_ext/src/middleware/log.rs) Logs HTTP requests and responses:

IN  > GET /repo/trees
OUT < 200 150ms 1024bytes

Scuba Middleware (gotham_ext/src/middleware/scuba.rs) Constructs and logs Scuba samples for each request with timing, status, and metadata.

ODS Middleware (edenapi_service/src/middleware/ods.rs) Updates ODS metrics for request duration, success/failure, and response size.

Operational Dashboards

Metrics exported to ODS are visualized in operational dashboards. Common dashboard categories:

Service Health

Request rate and error rate
P50, P95, P99 latency
Success vs. failure breakdown

Resource Usage

Blobstore operation rates
Cache hit rates
Bytes transferred

Feature-Specific

Pushrebase operations and conflicts
Derivation latency and backlog
Cross-repo sync lag

Background Jobs

Walker progress and error rates
Healer repair operations
Statistics collection status

Query Patterns

Scuba Queries Scuba samples can be queried to analyze specific requests, debug failures, or identify performance patterns:

Filter by session ID to trace a specific client session
Filter by repository to analyze repository-specific behavior
Aggregate by endpoint to identify slow operations
Join with performance counters to correlate latency with blob operations

ODS Queries ODS timeseries support aggregation and alerting:

Monitor P99 latency for SLA compliance
Track error rates for alerting
Compare metrics across deployments
Analyze capacity and scaling needs

Integration with Deployment

Monitoring configuration is specified via command-line arguments and configuration files:

Command-Line Flags

--fb303-thrift-port - Enable FB303 service
--prometheus-host-port - Export Prometheus metrics
--scuba-dataset - Scuba table name
--cache-mode - Caching configuration (affects cache metrics)

Configuration Files Observability configuration (scm/mononoke/observability/observability_config) controls:

Scuba verbosity levels
Verbose sessions/usernames
Sampling rates

Configuration is loaded via cached_config and can be updated without restarting servers.

Relationship to Architecture

The monitoring system reflects Mononoke's layered architecture:

Service Layer - HTTP middleware logs requests, updates per-endpoint metrics

API Layer - Scuba samples include high-level operation types

Feature Layer - Features log specific events (pushrebase conflicts, derivation completion)

Repository Layer - Facets update performance counters (blob operations, cache hits)

Storage Layer - Blobstore implementations track latency and throughput

This layering allows monitoring at multiple granularities, from high-level service health to detailed storage operations.

Architecture Overview - How monitoring fits into the overall system
Jobs and Background Workers - Job-specific monitoring
Servers and Services - Server health checks and metrics

Monitoring and Observability

Monitoring and Observability

Overview

Metrics (ODS Integration)

Metric Types

Metric Locations

Using Metrics

Logging (Scuba)

CoreContext and Scuba

Scuba Fields

Verbosity Levels

Scuba Tables

Tracing

Usage

Log Levels

Request Context and Tracking

Context Contents

Context Flow

Context Operations

Performance Counters

Counter Types

Counter Access

Health Checks

HTTP Health Checks

FB303 Service

Monitoring Framework

Common Patterns

Request Logging

Operational Dashboards

Query Patterns

Integration with Deployment

Relationship to Architecture

Related Documentation