Back to Sapling

Monitoring and Observability

eden/mononoke/docs/6.3-monitoring-and-observability.md

latest13.2 KB
Original Source

Monitoring and Observability

This document describes the monitoring and observability infrastructure in Mononoke. These systems provide visibility into server health, performance, and operational behavior.

Overview

Mononoke exports metrics, logs, and traces to support operational monitoring and debugging. The observability infrastructure is designed to handle high-volume production traffic while providing detailed diagnostics when needed.

The monitoring system is organized into several layers:

  • Metrics - Quantitative measurements exported to ODS (Operational Data Store)
  • Logging - Structured logs written to Scuba for analysis
  • Tracing - Request-level tracking using the tracing framework
  • Health checks - Service health reporting via HTTP and FB303
  • Performance counters - Per-request operation metrics

Metrics (ODS Integration)

Mononoke exports operational metrics using the stats crate. Metrics are exported to ODS for visualization and alerting.

Metric Types

Counters and Timeseries

rust
define_stats! {
    prefix = "mononoke.edenapi.request";
    total_requests: timeseries(Rate, Sum),
    requests: dynamic_timeseries("{}.requests", (method: String); Rate, Sum),
}

These track request rates, success counts, and error counts. Dynamic timeseries allow metrics to be broken down by dimensions like method name or repository.

Histograms

rust
files2_duration_ms: histogram(100, 0, 5000, Average, Sum, Count; P 50; P 75; P 95; P 99),

Histograms measure distributions of values such as request latency, response size, or operation duration. They report percentiles (P50, P95, P99) for understanding tail latency.

Metric Locations

Metrics are defined in components throughout Mononoke:

Server Metrics (edenapi_service/src/middleware/ods.rs, git_server/, lfs_server/)

  • Request duration by endpoint
  • Success and failure rates (4xx, 5xx responses)
  • Response bytes sent
  • Request load

Background Job Metrics (jobs/walker/, jobs/blobstore_healer/)

  • Items processed (commits walked, blobs healed)
  • Validation pass/fail counts
  • Scrub repair operations
  • Queue depths and processing latency

Storage Metrics (blobstore implementations)

  • Blob gets, puts, and presence checks
  • Cache hit/miss rates
  • Multiplex operations
  • Packblob compression ratios

Feature Metrics (pushrebase/, features/)

  • Pushrebase duration and conflicts
  • Cross-repo sync operations
  • Derived data derivation latency

Using Metrics

Metrics are updated by calling methods on the generated STATS object:

rust
STATS::total_requests.add_value(1);
STATS::files2_duration_ms.add_value(duration_ms);
STATS::failure_5xx.add_value(1, (method_name.to_string(),));

The stats framework aggregates these values and exports them to ODS. Dashboards and alerts are built on these metrics.

Logging (Scuba)

Scuba is Mononoke's primary structured logging system. Scuba samples (log entries) contain key-value pairs describing operations, requests, and events.

CoreContext and Scuba

Every Mononoke operation carries a CoreContext that includes a Scuba sample builder. The context flows through the request lifecycle, accumulating fields:

rust
pub struct CoreContext {
    fb: FacebookInit,
    session: SessionContainer,
    logging: LoggingContainer,
}

The logging container holds:

  • MononokeScubaSampleBuilder - Accumulates fields for the final Scuba sample
  • Performance counters - Operation metrics
  • Scribe client - For writing to Scribe streams

Scuba Fields

Scuba samples typically include:

Request Metadata

  • Session ID
  • Client info (hostname, username, identities)
  • Repository name
  • Request method and parameters

Performance Data

  • Request duration
  • Blob operations (gets, puts)
  • Cache statistics
  • Bytes transferred

Outcome Information

  • Success or failure status
  • Error messages
  • Result sizes

Verbosity Levels

The observability framework supports configurable verbosity (observability/):

Normal Level - Logs all normal operations. This is the default level for production traffic.

Verbose Level - Logs additional detail for debugging. Verbose logging can be enabled globally or selectively based on:

  • Session ID
  • Unix username
  • Source hostname (via regex)

The ObservabilityContext (in observability/src/context.rs) determines whether a sample should be logged based on its verbosity level and the current configuration. This allows detailed logging for specific users or sessions without overwhelming the logging system.

Scuba Tables

Different components write to different Scuba tables:

  • mononoke_edenapi - EdenAPI protocol requests
  • mononoke_scs - SCS Thrift API calls
  • mononoke_git - Git protocol operations
  • mononoke_lfs - LFS operations
  • mononoke_walker - Walker validation and scrubbing
  • mononoke_backsyncer - Cross-repo sync events

Each table has component-specific fields reflecting the operation types and parameters.

Tracing

Mononoke uses the tracing crate for structured logging within components. Tracing provides hierarchical logging with spans and events.

Usage

Tracing calls are embedded throughout the code:

rust
use tracing::{info, warn, debug, error};

info!("Starting derivation for changeset {}", cs_id);
warn!("Slow operation detected: {}ms", duration_ms);
debug!(keys = ?blob_keys, "Fetching blobs");

Spans group related operations:

rust
use tracing::Instrument;

async fn derive_data(ctx: &CoreContext, cs_id: ChangesetId) -> Result<()> {
    async move {
        // derivation work
    }
    .instrument(tracing::info_span!("derive_data", ?cs_id))
    .await
}

Tracing integrates with the request context and can include correlation IDs for following a request through multiple services.

Log Levels

Error - Unexpected failures requiring attention

Warn - Recoverable issues or concerning patterns

Info - Normal operational events

Debug - Detailed diagnostic information

Log output is configured per deployment and can be directed to local logs or centralized logging systems.

Request Context and Tracking

The CoreContext (server/context/) serves as the request tracking mechanism.

Context Contents

Session Information (SessionContainer)

  • Session class (user, background, backup)
  • Client metadata
  • Permission checker
  • Identity set

Logging Container

  • Scuba sample builder
  • Performance counters stack
  • Scribe client
  • Sampling key

Metadata

  • Client request info (entry point, correlator)
  • Source hostname
  • Unix username

Context Flow

A context is created when a request enters Mononoke (typically in the protocol server). The context is cloned and passed through:

  1. Protocol handlers (EdenAPI, Git, SCS)
  2. API layer (mononoke_api/)
  3. Features (pushrebase, hooks, etc.)
  4. Repository facets
  5. Storage operations

Each layer can add fields to the Scuba sample or update performance counters. When the request completes, the accumulated data is logged.

Context Operations

Clone and Reset

rust
let new_ctx = ctx.clone_and_reset();

Creates a new context with reset performance counters, useful for sub-operations.

Fork Performance Counters

rust
let counters = ctx.fork_perf_counters();

Creates a snapshot of current performance counters for parallel operations.

Mutate Scuba Sample

rust
let new_ctx = ctx.with_mutated_scuba(|scuba| {
    scuba.add("field_name", value)
});

Adds fields to the Scuba sample for the context.

Performance Counters

Performance counters track detailed operation metrics within a request. Counters are defined in server/context/src/perf_counters.rs.

Counter Types

Blobstore Operations

  • BlobGets, BlobPuts, BlobPresenceChecks
  • BlobGetsMaxLatency, BlobPutsMaxLatency
  • BlobGetsTotalSize, BlobPutsTotalSize
  • BlobGetsDeduplicated, BlobPutsDeduplicated

Caching

  • CachelibHits, CachelibMisses

Protocol-Specific

  • EdenapiFiles, EdenapiTrees
  • GetpackNumFiles, GettreepackNumTreepacks
  • GetbundleNumCommits, GetbundleNumManifests

Data Transfer

  • BytesSent

Performance counters are accumulated in the PerfCountersStack and can be nested for tracking operations within sub-operations. When a request completes, counters are exported to Scuba for analysis.

Counter Access

Operations access counters through the CoreContext:

rust
let perf_counters = ctx.perf_counters();

The blobstore layer automatically updates blob operation counters. Protocol handlers update protocol-specific counters based on the data served.

Health Checks

Mononoke servers expose health check endpoints for monitoring and load balancing.

HTTP Health Checks

The main server (server/repo_listener/) responds to health check requests:

GET / -> "I_AM_ALIVE"
GET /health_check -> "I_AM_ALIVE"

If the server is shutting down, these endpoints return "EXITING". Load balancers use these endpoints to determine which servers should receive traffic.

FB303 Service

Mononoke applications integrate with FB303 (cmdlib/mononoke_app/src/monitoring.rs), a service framework that provides:

Status Reporting

  • FbStatus::Alive - Server is ready
  • FbStatus::Starting - Server is initializing
  • FbStatus::Stopping - Server is shutting down

The ReadyFlagService implementation starts in the Starting state and transitions to Alive once initialization completes.

Thrift Interface FB303 exposes a Thrift interface on a configured port (via --fb303-thrift-port). This interface allows monitoring systems to:

  • Query server status
  • Retrieve counter values
  • Check build information

Prometheus Export In fbcode builds, FB303 metrics can be exported to Prometheus format via the --prometheus-host-port flag. This enables integration with Prometheus-based monitoring stacks.

Monitoring Framework

The mononoke_app framework (cmdlib/mononoke_app/) initializes monitoring automatically. Applications using this framework receive:

  • FB303 service
  • Stats aggregation
  • Graceful shutdown handling
  • Health check integration

Common Patterns

Request Logging

Protocol servers use middleware to log requests:

Log Middleware (gotham_ext/src/middleware/log.rs) Logs HTTP requests and responses:

IN  > GET /repo/trees
OUT < 200 150ms 1024bytes

Scuba Middleware (gotham_ext/src/middleware/scuba.rs) Constructs and logs Scuba samples for each request with timing, status, and metadata.

ODS Middleware (edenapi_service/src/middleware/ods.rs) Updates ODS metrics for request duration, success/failure, and response size.

Operational Dashboards

Metrics exported to ODS are visualized in operational dashboards. Common dashboard categories:

Service Health

  • Request rate and error rate
  • P50, P95, P99 latency
  • Success vs. failure breakdown

Resource Usage

  • Blobstore operation rates
  • Cache hit rates
  • Bytes transferred

Feature-Specific

  • Pushrebase operations and conflicts
  • Derivation latency and backlog
  • Cross-repo sync lag

Background Jobs

  • Walker progress and error rates
  • Healer repair operations
  • Statistics collection status

Query Patterns

Scuba Queries Scuba samples can be queried to analyze specific requests, debug failures, or identify performance patterns:

  • Filter by session ID to trace a specific client session
  • Filter by repository to analyze repository-specific behavior
  • Aggregate by endpoint to identify slow operations
  • Join with performance counters to correlate latency with blob operations

ODS Queries ODS timeseries support aggregation and alerting:

  • Monitor P99 latency for SLA compliance
  • Track error rates for alerting
  • Compare metrics across deployments
  • Analyze capacity and scaling needs

Integration with Deployment

Monitoring configuration is specified via command-line arguments and configuration files:

Command-Line Flags

  • --fb303-thrift-port - Enable FB303 service
  • --prometheus-host-port - Export Prometheus metrics
  • --scuba-dataset - Scuba table name
  • --cache-mode - Caching configuration (affects cache metrics)

Configuration Files Observability configuration (scm/mononoke/observability/observability_config) controls:

  • Scuba verbosity levels
  • Verbose sessions/usernames
  • Sampling rates

Configuration is loaded via cached_config and can be updated without restarting servers.

Relationship to Architecture

The monitoring system reflects Mononoke's layered architecture:

Service Layer - HTTP middleware logs requests, updates per-endpoint metrics

API Layer - Scuba samples include high-level operation types

Feature Layer - Features log specific events (pushrebase conflicts, derivation completion)

Repository Layer - Facets update performance counters (blob operations, cache hits)

Storage Layer - Blobstore implementations track latency and throughput

This layering allows monitoring at multiple granularities, from high-level service health to detailed storage operations.