Rate Limiting and Load Shedding

This document describes Mononoke's rate limiting and load shedding mechanisms, which control request load to maintain system stability at scale.

Introduction

Mononoke serves source control traffic for repositories with thousands of commits per day and hundreds of thousands of clients. Without load control, individual clients or groups of clients can overwhelm system resources, degrading service for all users. Rate limiting and load shedding provide mechanisms to control this load.

These mechanisms operate at different layers of the system: rate limiting controls individual client behavior based on request patterns, while load shedding responds to system-wide resource pressure. Together, they maintain service availability under varying load conditions.

Rate Limiting vs. Load Shedding

Rate limiting and load shedding serve distinct purposes:

Rate Limiting

Rate limiting restricts the number of requests from specific clients within a time window. Limits are configured based on metrics tracked per client, such as request rate or egress bandwidth. When a client exceeds its limit, additional requests are rejected with HTTP 429 (Too Many Requests) responses.

Characteristics:

Targets specific clients or groups based on identity
Measured over configured time windows
Applied before client behavior impacts the system
Configuration specifies which metrics to track and threshold values

Example: Reject requests from a specific client identity when that client's EdenAPI request rate exceeds 1 million requests per second globally.

Load Shedding

Load shedding rejects requests when system resources reach critical thresholds. Rather than targeting individual clients, load shedding responds to resource utilization (CPU, memory, queue depth) across the service. When load shedding activates, requests are rejected based on priority, with lower-priority traffic dropped first.

Characteristics:

Responds to system-wide resource metrics
Activates when CPU, memory, or other resources exceed thresholds
Can prioritize traffic by dropping lower-priority requests first
Protects system stability during overload conditions

Example: When regional CPU utilization exceeds 95%, reject all incoming requests until utilization returns to acceptable levels.

Rate Limiting Framework

The rate limiting framework is implemented in rate_limiting/ and integrated into request handling middleware.

Tracked Metrics

Rate limits can be configured for various metrics:

Request Metrics:

EdenApiQps - Requests per second to the EdenAPI service
LocationToHashCount - Commit location-to-hash queries

Data Transfer Metrics:

EgressBytes - Bytes sent to clients
TotalManifests - Manifest objects served
GetpackFiles - File objects served via getpack

Commit Metrics:

Commits - Commits created or modified
CommitsPerAuthor - Commits attributed to a specific author
CommitsPerUser - Commits from a specific user identity

Each metric is tracked over a configured time window (e.g., 1 second, 1 minute) to determine if limits are exceeded.

Scope Configuration

Rate limits can apply at different scopes:

Global Scope - Limits apply across all regions and server instances. A client exceeding the global limit is throttled regardless of which region or server handles the request. Global limits require coordination across server instances.

Regional Scope - Limits apply within a single geographic region or data center. A client might be throttled in one region while remaining under limits in another region. Regional limits allow independent capacity management.

Target Specification

Rate limits target specific clients or groups:

Main Client ID - Identifies individual client applications or services. Each client is assigned a unique identifier used for quota tracking. The main client ID can include additional context such as Sandcastle job alias for CI traffic.

Identity Set - Targets clients matching a set of identities. Identities include service identity (e.g., SERVICE_IDENTITY:quicksand), machine tier (e.g., MACHINE_TIER:sandcastle), or other attributes. All specified identities must be present for the rule to apply.

Static Slice - Applies limits to a percentage of clients based on consistent hashing of client hostname. This allows gradual rollout of new limits by targeting an increasing slice of traffic.

When no target is specified, the rate limit applies to all traffic.

Configuration Structure

Rate limits are configured using Thrift-based configuration files that can be updated dynamically. Each rate limit specifies:

Target - Which clients the limit applies to (optional, defaults to all clients)
Metric - What to measure (EdenApiQps, EgressBytes, etc.)
Window - Time duration for measurement (in seconds)
Scope - Global or Regional
Limit - Threshold value above which requests are rejected
Status - Whether the limit is enabled or disabled

Configuration is loaded via cached config handles and updates are applied without service restart. Configuration files define arrays of rate limits, allowing multiple limits with different targets and metrics to be active simultaneously.

Request Flow

Rate limiting is checked at the beginning of request processing, in the middleware layer before expensive operations are performed:

Extract client metadata - The request handler extracts the client's main ID and identity set from connection metadata
Check applicable limits - The rate limiter identifies which configured limits apply to this client based on target matching
Query metric values - Current metric values are retrieved from in-memory counters or external metrics systems
Compare to thresholds - Each applicable limit's current value is compared to its configured threshold
Accept or reject - If any limit is exceeded, the request is rejected with HTTP 429; otherwise, processing continues
Update counters - Accepted requests update relevant metric counters for future limit checks

This early rejection avoids consuming resources for requests that will be dropped, and the 429 response code triggers exponential backoff in Sapling clients.

Load Shedding Framework

Load shedding is implemented alongside rate limiting in the rate_limiting/ component and uses similar configuration structures.

Load Shedding Metrics

Load shedding decisions are based on system resource metrics:

Local Counters - Metrics exposed by the Mononoke process itself via the fb303 interface, such as CPU utilization, memory usage, request queue depth, or thread pool saturation. These are prefixed with mononoke. and available without external queries.

External Counters - Metrics from external systems such as ODS (Operational Data Store), including storage backend latency, database connection pool utilization, or cache hit rates. External counters require periodic queries to external metrics systems, with results cached locally.

The choice between local and external counters depends on what resource triggers load shedding. Backend storage pressure might use external counters for storage system metrics, while frontend overload uses local CPU and memory metrics.

Load Shedding Configuration

Each load shed limit specifies:

Target - Which clients are shed when the limit is exceeded (optional)
Metric - Resource metric to monitor (local or external counter)
Threshold - Value above which load shedding activates
Status - Whether the limit is enabled or disabled

Multiple load shed limits can be configured with different targets, allowing progressive shedding. Lower-priority traffic is shed at lower thresholds, while critical traffic is only shed when resources are critically exhausted.

Example configuration:

At 90% CPU: Shed requests from SERVICE_IDENTITY:quicksand
At 95% CPU: Shed all requests

This progressive approach preserves service for high-priority traffic while protecting the system from complete failure.

Load Shedding Decision Process

Load shedding is checked for each request, similar to rate limiting:

Fetch current metrics - Resource utilization metrics are retrieved from local counters or external metrics cache
Evaluate thresholds - Each configured load shed limit is checked to determine if its threshold is exceeded
Identify targets - For exceeded thresholds, the configured target is checked against the request's client metadata
Shed or allow - If a matching load shed limit is exceeded, the request is rejected; otherwise, processing continues

Shed requests receive the same HTTP 429 response as rate-limited requests, triggering client backoff behavior.

Load Limiter Service

The Load Limiter service (facebook/load_limiter/) is an external microservice that coordinates load-aware job scheduling for continuous integration systems.

Integration with Job Scheduling

The Load Limiter implements the Jupiter External Dependencies interface, which allows jobs to block on arbitrary external conditions. When load is high, jobs wait in queue rather than executing and failing due to service overload.

The service implements three operations:

acquire - Attempts to acquire the dependency. Returns success if load is acceptable, or blocked if load exceeds thresholds. Jobs receiving blocked responses remain in the scheduler queue rather than executing.

canAcquire - Checks whether the dependency could be acquired without actually acquiring it. Used by schedulers to predict wait times.

release - Releases the dependency when a job completes. For source control dependencies, this is typically a no-op since Mononoke load is not quota-based.

Load-Aware Scheduling

The Load Limiter monitors Mononoke metrics (CPU utilization, request rates, error rates) and returns blocked responses when these metrics exceed configured thresholds. The relationship between Mononoke load and Load Limiter responses:

Avg CPU Utilization	Mononoke Response	Load Limiter Response	Job Outcome
80%	200 (success)	Acquired	Job executes successfully
97%	Intermittent 200/429	Blocked	Job queues, executes when load decreases
99%	429 (rate limited)	Blocked	Job queues until system recovers

When jobs are blocked, they remain in the scheduler queue rather than consuming worker resources and failing. This reduces the number of failed CI jobs caused by transient Mononoke overload.

Configuration and Monitoring

The Load Limiter uses separate configuration (scm/mononoke/load_limiter/load_limiter) specifying which Mononoke metrics to monitor and threshold values for blocking jobs. Configuration includes:

Metrics to monitor - ODS counters for CPU, QPS, error rates
Blocking thresholds - Values above which jobs are blocked
Query intervals - How frequently to poll metrics

The service logs decisions to Scuba (mononoke_load_limiter dataset) for analysis of blocking patterns and load correlation.

Blobstore-Level Rate Limiting

In addition to request-level rate limiting, Mononoke can throttle blobstore operations using the ThrottledBlob decorator.

ThrottledBlob Implementation

ThrottledBlob (blobstore/throttledblob/) wraps a blobstore and applies rate limiting to individual operations. It can throttle on two dimensions:

Operations per Second (QPS):

Read QPS limit - Maximum get operations per second
Write QPS limit - Maximum put operations per second

Bytes per Second:

Read bytes limit - Maximum bytes read per second
Write bytes limit - Maximum bytes written per second
Burst allowance - Temporary exceeding of limits for large blobs

Throttling uses token bucket rate limiting (via the governor crate) with jitter to prevent thundering herd effects when multiple operations wait for tokens.

Use Cases

Blobstore throttling is used to:

Protect Storage Backends - Limit load on underlying storage systems (SQL, S3, Manifold) to prevent overwhelming them during traffic spikes or bulk operations.

Control Resource Consumption - Limit network bandwidth consumption when multiple repositories share network capacity.

Test Resilience - Simulate slow storage during testing by applying artificial throttling to verify timeout and retry behavior.

Throttled blobstores appear in the blobstore decorator stack between the application and storage backend, applying limits before requests reach the underlying storage.

Client Behavior

Sapling and EdenFS clients implement retry behavior for rate limiting and load shedding responses:

429 Responses (Rate Limit/Load Shed):

Trigger exponential backoff with a base of 2 seconds
Maximum backoff of 2^3 = 8 seconds
After maximum retries, the operation fails and reports an error to the user

Other Error Responses:

Trigger linear retries increasing by 1 second per attempt
Different retry behavior reflects lower expected success rate for non-throttling errors

Data Preservation:

Successfully fetched data is retained locally even if subsequent operations fail
Partial clone/fetch operations leave the repository in a usable state with whatever data was retrieved

EdenFS Behavior:

File access operations that exceed retries result in filesystem errors (ENOENT, EIO)
Checkout operations fail if files cannot be fetched, preventing incomplete checkouts in CI environments

This client behavior provides automatic recovery from transient overload while preventing indefinite hangs.

Configuration and Deployment

Rate limiting and load shedding configuration is managed through Mononoke's repository configuration system:

Configuration Storage:

Rate limits defined in Thrift-based configuration files
Loaded via ConfigStore with cached config handles
Updates applied dynamically without service restart

Configuration Validation:

Configuration is validated during loading
Invalid targets or metrics are rejected
Deployment errors are detected before configuration takes effect

Monitoring Integration:

Rate limit decisions logged to Scuba for analysis
ODS counters track reject rates per limit
Dashboards show which limits are actively rejecting traffic

Configuration updates follow standard Mononoke configuration deployment practices, with changes committed to version control and deployed through the configuration service.

Implementation Details

The rate limiting implementation consists of several components:

RateLimiter Trait (rate_limiting/src/lib.rs)

Defines the interface for checking rate limits and load shedding
check_rate_limit - Verifies if a request exceeds configured limits
check_load_shed - Verifies if system load requires shedding
bump_load - Updates metric counters for accepted requests
find_rate_limit - Retrieves applicable limits for a client

RateLimitEnvironment (rate_limiting/src/lib.rs)

Holds rate limiting state: configuration, counter manager, category
Provides factory method for creating rate limiters with appropriate configuration
Manages ODS counter manager for external metric queries

Configuration Handling (rate_limiting/src/config.rs)

Deserializes Thrift configuration into internal types
Validates target specifications and metric definitions
Converts between configuration format and runtime structures

Middleware Integration (server/repo_listener/src/request_handler.rs)

Checks load shedding at request start, before expensive operations
Logs shedding decisions to Scuba with client metadata
Returns 429 responses for shed requests

Counter Management (rate_limiting/src/facebook.rs for internal builds)

Maintains in-memory counters for tracked metrics
Periodically fetches external ODS counters
Provides counter values for limit comparison

The OSS build (rate_limiting/src/oss.rs) provides a minimal implementation that always allows requests, as full rate limiting depends on internal infrastructure.

Several Mononoke components interact with rate limiting:

Request Handlers - All frontend servers (Mononoke, SCS, Git, LFS) check rate limits in their request handling middleware before processing requests.

Repository Facets - The repo_permission_checker facet is consulted alongside rate limiting to enforce both authorization and load control.

Monitoring Infrastructure - Scuba logging and ODS counters provide visibility into rate limiting decisions and load patterns.

Configuration System - Cached config handles enable dynamic rate limit updates without service restart.

Summary

Mononoke's rate limiting and load shedding mechanisms operate at multiple layers to control request load:

Request-Level Rate Limiting - Tracks per-client metrics (QPS, egress bytes, commits) over time windows, rejecting requests from clients exceeding configured thresholds. Applies at global or regional scope.

System-Level Load Shedding - Monitors resource utilization (CPU, memory, queue depth) and sheds requests when thresholds are exceeded, with progressive shedding based on request priority.

Blobstore Throttling - Limits storage operations (QPS and bytes/sec) to protect backend storage systems from overload.

Load-Aware Job Scheduling - The Load Limiter service integrates with CI systems to queue jobs instead of executing them during high load periods.

These mechanisms use dynamic configuration, early request rejection, and coordinated client backoff to maintain service stability under varying load conditions. Configuration is deployed through standard configuration management with monitoring integration for observability.

Servers and Services - How servers integrate rate limiting
Architecture Overview - System architecture and request flow
Storage Architecture - ThrottledBlob and storage stack
Monitoring and Observability - Monitoring rate limiting decisions

Component-specific details are documented in rate_limiting/, facebook/load_limiter/, and blobstore/throttledblob/ directories.

Rate Limiting and Load Shedding

Rate Limiting and Load Shedding

Introduction

Rate Limiting vs. Load Shedding

Rate Limiting

Load Shedding

Rate Limiting Framework

Tracked Metrics

Scope Configuration

Target Specification

Configuration Structure

Request Flow

Load Shedding Framework

Load Shedding Metrics

Load Shedding Configuration

Load Shedding Decision Process

Load Limiter Service

Integration with Job Scheduling

Load-Aware Scheduling

Configuration and Monitoring

Blobstore-Level Rate Limiting

ThrottledBlob Implementation

Use Cases

Client Behavior

Configuration and Deployment

Implementation Details

Related Components

Summary

Related Documentation