eden/mononoke/docs/6.3-monitoring-and-observability.md
This document describes the monitoring and observability infrastructure in Mononoke. These systems provide visibility into server health, performance, and operational behavior.
Mononoke exports metrics, logs, and traces to support operational monitoring and debugging. The observability infrastructure is designed to handle high-volume production traffic while providing detailed diagnostics when needed.
The monitoring system is organized into several layers:
Mononoke exports operational metrics using the stats crate. Metrics are exported to ODS for visualization and alerting.
Counters and Timeseries
define_stats! {
prefix = "mononoke.edenapi.request";
total_requests: timeseries(Rate, Sum),
requests: dynamic_timeseries("{}.requests", (method: String); Rate, Sum),
}
These track request rates, success counts, and error counts. Dynamic timeseries allow metrics to be broken down by dimensions like method name or repository.
Histograms
files2_duration_ms: histogram(100, 0, 5000, Average, Sum, Count; P 50; P 75; P 95; P 99),
Histograms measure distributions of values such as request latency, response size, or operation duration. They report percentiles (P50, P95, P99) for understanding tail latency.
Metrics are defined in components throughout Mononoke:
Server Metrics (edenapi_service/src/middleware/ods.rs, git_server/, lfs_server/)
Background Job Metrics (jobs/walker/, jobs/blobstore_healer/)
Storage Metrics (blobstore implementations)
Feature Metrics (pushrebase/, features/)
Metrics are updated by calling methods on the generated STATS object:
STATS::total_requests.add_value(1);
STATS::files2_duration_ms.add_value(duration_ms);
STATS::failure_5xx.add_value(1, (method_name.to_string(),));
The stats framework aggregates these values and exports them to ODS. Dashboards and alerts are built on these metrics.
Scuba is Mononoke's primary structured logging system. Scuba samples (log entries) contain key-value pairs describing operations, requests, and events.
Every Mononoke operation carries a CoreContext that includes a Scuba sample builder. The context flows through the request lifecycle, accumulating fields:
pub struct CoreContext {
fb: FacebookInit,
session: SessionContainer,
logging: LoggingContainer,
}
The logging container holds:
MononokeScubaSampleBuilder - Accumulates fields for the final Scuba sampleScuba samples typically include:
Request Metadata
Performance Data
Outcome Information
The observability framework supports configurable verbosity (observability/):
Normal Level - Logs all normal operations. This is the default level for production traffic.
Verbose Level - Logs additional detail for debugging. Verbose logging can be enabled globally or selectively based on:
The ObservabilityContext (in observability/src/context.rs) determines whether a sample should be logged based on its verbosity level and the current configuration. This allows detailed logging for specific users or sessions without overwhelming the logging system.
Different components write to different Scuba tables:
Each table has component-specific fields reflecting the operation types and parameters.
Mononoke uses the tracing crate for structured logging within components. Tracing provides hierarchical logging with spans and events.
Tracing calls are embedded throughout the code:
use tracing::{info, warn, debug, error};
info!("Starting derivation for changeset {}", cs_id);
warn!("Slow operation detected: {}ms", duration_ms);
debug!(keys = ?blob_keys, "Fetching blobs");
Spans group related operations:
use tracing::Instrument;
async fn derive_data(ctx: &CoreContext, cs_id: ChangesetId) -> Result<()> {
async move {
// derivation work
}
.instrument(tracing::info_span!("derive_data", ?cs_id))
.await
}
Tracing integrates with the request context and can include correlation IDs for following a request through multiple services.
Error - Unexpected failures requiring attention
Warn - Recoverable issues or concerning patterns
Info - Normal operational events
Debug - Detailed diagnostic information
Log output is configured per deployment and can be directed to local logs or centralized logging systems.
The CoreContext (server/context/) serves as the request tracking mechanism.
Session Information (SessionContainer)
Logging Container
Metadata
A context is created when a request enters Mononoke (typically in the protocol server). The context is cloned and passed through:
mononoke_api/)Each layer can add fields to the Scuba sample or update performance counters. When the request completes, the accumulated data is logged.
Clone and Reset
let new_ctx = ctx.clone_and_reset();
Creates a new context with reset performance counters, useful for sub-operations.
Fork Performance Counters
let counters = ctx.fork_perf_counters();
Creates a snapshot of current performance counters for parallel operations.
Mutate Scuba Sample
let new_ctx = ctx.with_mutated_scuba(|scuba| {
scuba.add("field_name", value)
});
Adds fields to the Scuba sample for the context.
Performance counters track detailed operation metrics within a request. Counters are defined in server/context/src/perf_counters.rs.
Blobstore Operations
BlobGets, BlobPuts, BlobPresenceChecksBlobGetsMaxLatency, BlobPutsMaxLatencyBlobGetsTotalSize, BlobPutsTotalSizeBlobGetsDeduplicated, BlobPutsDeduplicatedCaching
CachelibHits, CachelibMissesProtocol-Specific
EdenapiFiles, EdenapiTreesGetpackNumFiles, GettreepackNumTreepacksGetbundleNumCommits, GetbundleNumManifestsData Transfer
BytesSentPerformance counters are accumulated in the PerfCountersStack and can be nested for tracking operations within sub-operations. When a request completes, counters are exported to Scuba for analysis.
Operations access counters through the CoreContext:
let perf_counters = ctx.perf_counters();
The blobstore layer automatically updates blob operation counters. Protocol handlers update protocol-specific counters based on the data served.
Mononoke servers expose health check endpoints for monitoring and load balancing.
The main server (server/repo_listener/) responds to health check requests:
GET / -> "I_AM_ALIVE"
GET /health_check -> "I_AM_ALIVE"
If the server is shutting down, these endpoints return "EXITING". Load balancers use these endpoints to determine which servers should receive traffic.
Mononoke applications integrate with FB303 (cmdlib/mononoke_app/src/monitoring.rs), a service framework that provides:
Status Reporting
FbStatus::Alive - Server is readyFbStatus::Starting - Server is initializingFbStatus::Stopping - Server is shutting downThe ReadyFlagService implementation starts in the Starting state and transitions to Alive once initialization completes.
Thrift Interface
FB303 exposes a Thrift interface on a configured port (via --fb303-thrift-port). This interface allows monitoring systems to:
Prometheus Export
In fbcode builds, FB303 metrics can be exported to Prometheus format via the --prometheus-host-port flag. This enables integration with Prometheus-based monitoring stacks.
The mononoke_app framework (cmdlib/mononoke_app/) initializes monitoring automatically. Applications using this framework receive:
Protocol servers use middleware to log requests:
Log Middleware (gotham_ext/src/middleware/log.rs)
Logs HTTP requests and responses:
IN > GET /repo/trees
OUT < 200 150ms 1024bytes
Scuba Middleware (gotham_ext/src/middleware/scuba.rs)
Constructs and logs Scuba samples for each request with timing, status, and metadata.
ODS Middleware (edenapi_service/src/middleware/ods.rs)
Updates ODS metrics for request duration, success/failure, and response size.
Metrics exported to ODS are visualized in operational dashboards. Common dashboard categories:
Service Health
Resource Usage
Feature-Specific
Background Jobs
Scuba Queries Scuba samples can be queried to analyze specific requests, debug failures, or identify performance patterns:
ODS Queries ODS timeseries support aggregation and alerting:
Monitoring configuration is specified via command-line arguments and configuration files:
Command-Line Flags
--fb303-thrift-port - Enable FB303 service--prometheus-host-port - Export Prometheus metrics--scuba-dataset - Scuba table name--cache-mode - Caching configuration (affects cache metrics)Configuration Files
Observability configuration (scm/mononoke/observability/observability_config) controls:
Configuration is loaded via cached_config and can be updated without restarting servers.
The monitoring system reflects Mononoke's layered architecture:
Service Layer - HTTP middleware logs requests, updates per-endpoint metrics
API Layer - Scuba samples include high-level operation types
Feature Layer - Features log specific events (pushrebase conflicts, derivation completion)
Repository Layer - Facets update performance counters (blob operations, cache hits)
Storage Layer - Blobstore implementations track latency and throughput
This layering allows monitoring at multiple granularities, from high-level service health to detailed storage operations.