Adaptive Rate Limit Scheduler - Architecture

Overview

The adaptive rate limit scheduler automatically handles provider rate limits during evaluations. It's zero-configuration - users don't need to change anything. The scheduler transparently wraps all provider calls with intelligent rate limit detection, retry logic, and adaptive concurrency management.

Design Goals

The scheduler addresses common challenges when running evaluations against rate-limited APIs:

No manual tuning: Users shouldn't need to guess the right -j (concurrency) value
Automatic recovery: Rate limit errors (429) should be retried, not fail permanently
Prevent cascading failures: High concurrency shouldn't cause mass failures
Zero configuration: Works out of the box with sensible defaults

Architecture

text

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Evaluator                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     RateLimitRegistry                                │    │
│  │  (Central coordinator - one per evaluation)                          │    │
│  │                                                                      │    │
│  │  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐   │    │
│  │  │ProviderRateLimit │  │ProviderRateLimit │  │ProviderRateLimit │   │    │
│  │  │     State        │  │     State        │  │     State        │   │    │
│  │  │  (openai/key1)   │  │  (openai/key2)   │  │  (anthropic)     │   │    │
│  │  │                  │  │                  │  │                  │   │    │
│  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │   │    │
│  │  │ │  SlotQueue  │  │  │ │  SlotQueue  │  │  │ │  SlotQueue  │  │   │    │
│  │  │ │ (FIFO)      │  │  │ │ (FIFO)      │  │  │ │ (FIFO)      │  │   │    │
│  │  │ └─────────────┘  │  │ └─────────────┘  │  │ └─────────────┘  │   │    │
│  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │   │    │
│  │  │ │  Adaptive   │  │  │ │  Adaptive   │  │  │ │  Adaptive   │  │   │    │
│  │  │ │ Concurrency │  │  │ │ Concurrency │  │  │ │ Concurrency │  │   │    │
│  │  │ └─────────────┘  │  │ └─────────────┘  │  │ └─────────────┘  │   │    │
│  │  └──────────────────┘  └──────────────────┘  └──────────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Component Responsibilities

RateLimitRegistry

File: src/scheduler/rateLimitRegistry.ts

Central coordinator that:

Creates/retrieves per-provider state based on rate limit keys
Routes provider calls to the appropriate state
Aggregates metrics across all providers
Emits events for monitoring

typescript

// Usage (automatic in evaluator)
const result = await registry.execute(provider, () => provider.callApi(...), {
  getHeaders: (result) => result.metadata?.headers,
  isRateLimited: (result, error) => error?.message?.includes('429'),
  getRetryAfter: (result, error) => parseRetryAfter(headers['retry-after']),
});

ProviderRateLimitState

File: src/scheduler/providerRateLimitState.ts

Per-provider state manager that:

Manages the slot queue for concurrency control
Tracks rate limit headers from responses
Implements retry logic with exponential backoff
Adapts concurrency based on success/failure patterns
Collects latency metrics

SlotQueue

File: src/scheduler/slotQueue.ts

FIFO queue with concurrency limiting:

Acquires/releases "slots" for concurrent requests
Blocks when at max concurrency or quota exhausted
Tracks remaining requests/tokens from headers
Schedules queue processing after rate limit windows

Key insight: Race-condition-free slot allocation. All requests queue, then slots are allocated in FIFO order.

AdaptiveConcurrency

File: src/scheduler/adaptiveConcurrency.ts

Dynamic concurrency adjustment:

On rate limit: Reduce concurrency by 50% (multiplicative decrease)
On sustained success: Increase by 1 (additive increase)
Proactive throttling: Reduce when approaching limits (via headers)

This implements AIMD (Additive Increase, Multiplicative Decrease) - the same algorithm TCP uses for congestion control.

HeaderParser

File: src/scheduler/headerParser.ts

Parses rate limit headers from multiple providers:

OpenAI: x-ratelimit-remaining-requests, x-ratelimit-limit-requests
Anthropic: anthropic-ratelimit-requests-remaining
Generic: retry-after, retry-after-ms, ratelimit-reset

RetryPolicy

File: src/scheduler/retryPolicy.ts

Determines retry behavior:

Exponential backoff with jitter
Respects server retry-after headers
Configurable max retries (default: 3)
Retries on: rate limits, timeouts, 502/503/504

Data Flow

text

1. Evaluator calls registry.execute(provider, callFn)
       │
       ▼
2. Registry gets/creates ProviderRateLimitState for this provider
       │
       ▼
3. State.executeWithRetry() is called
       │
       ▼
4. SlotQueue.acquire() - wait for available slot
       │
       ▼
5. Execute callFn() - actual provider API call
       │
       ▼
6. Parse response headers → update rate limit state
       │
       ▼
7. Check if rate limited:
   ├─ Yes → retry with backoff, reduce concurrency
   └─ No  → record success, maybe increase concurrency
       │
       ▼
8. SlotQueue.release() - free slot for next request
       │
       ▼
9. Return result (or throw after max retries)

Rate Limit Key Generation

Each provider gets a unique "rate limit key" based on:

Provider ID (e.g., "openai:chat:gpt-4o")
API key hash (different keys = different rate limits)
Organization ID (if applicable)

This ensures:

Same provider + same key = shared rate limit state
Same provider + different keys = separate rate limits
Different providers = completely isolated

Key Design Decisions

1. Zero Configuration

Users shouldn't need to tune rate limit settings. The scheduler learns from response headers and adapts automatically.

2. Fail-Safe Defaults

Default max concurrency: 4 (conservative)
Default retry delay: 60 seconds (when no header)
Max retries: 3 (prevents infinite loops)

3. Proactive Throttling

Don't wait for 429 errors. When headers show <10% remaining quota, proactively reduce concurrency.

4. Per-Provider Isolation

Different providers have different rate limits. Don't let OpenAI rate limits affect Anthropic calls.

5. Transparent Integration

The scheduler wraps provider.callApi() without changing the interface. Existing code works unchanged.

Metrics

The scheduler tracks:

totalRequests - All requests attempted
completedRequests - Successful completions
failedRequests - Permanent failures (after retries)
rateLimitHits - Times 429 was encountered
retriedRequests - Requests that required retry
avgLatencyMs, p50LatencyMs, p99LatencyMs - Latency distribution

Events

For monitoring/debugging, the scheduler emits:

slot:acquired / slot:released - Concurrency tracking
ratelimit:hit - Rate limit encountered
ratelimit:learned - First time seeing provider's limits
ratelimit:warning - Approaching rate limit
concurrency:increased / concurrency:decreased - Adaptive changes
request:retrying - Retry in progress

Testing

256 tests covering:

Unit tests for each component
Edge cases (negative values, zero values, overflow)
Race condition prevention
Integration with evaluator

Performance Characteristics

Overhead: Minimal - just slot acquisition and header parsing
Memory: O(providers) - one state object per unique rate limit key
Latency buffer: Circular buffer, last 100 requests, O(1) insertion