Back to Promptfoo

Adaptive Rate Limit Scheduler - Architecture

docs/scheduler-architecture.md

0.121.99.4 KB
Original Source

Adaptive Rate Limit Scheduler - Architecture

Overview

The adaptive rate limit scheduler automatically handles provider rate limits during evaluations. It's zero-configuration - users don't need to change anything. The scheduler transparently wraps all provider calls with intelligent rate limit detection, retry logic, and adaptive concurrency management.

Design Goals

The scheduler addresses common challenges when running evaluations against rate-limited APIs:

  • No manual tuning: Users shouldn't need to guess the right -j (concurrency) value
  • Automatic recovery: Rate limit errors (429) should be retried, not fail permanently
  • Prevent cascading failures: High concurrency shouldn't cause mass failures
  • Zero configuration: Works out of the box with sensible defaults

Architecture

text
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Evaluator                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     RateLimitRegistry                                │    │
│  │  (Central coordinator - one per evaluation)                          │    │
│  │                                                                      │    │
│  │  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐   │    │
│  │  │ProviderRateLimit │  │ProviderRateLimit │  │ProviderRateLimit │   │    │
│  │  │     State        │  │     State        │  │     State        │   │    │
│  │  │  (openai/key1)   │  │  (openai/key2)   │  │  (anthropic)     │   │    │
│  │  │                  │  │                  │  │                  │   │    │
│  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │   │    │
│  │  │ │  SlotQueue  │  │  │ │  SlotQueue  │  │  │ │  SlotQueue  │  │   │    │
│  │  │ │ (FIFO)      │  │  │ │ (FIFO)      │  │  │ │ (FIFO)      │  │   │    │
│  │  │ └─────────────┘  │  │ └─────────────┘  │  │ └─────────────┘  │   │    │
│  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │   │    │
│  │  │ │  Adaptive   │  │  │ │  Adaptive   │  │  │ │  Adaptive   │  │   │    │
│  │  │ │ Concurrency │  │  │ │ Concurrency │  │  │ │ Concurrency │  │   │    │
│  │  │ └─────────────┘  │  │ └─────────────┘  │  │ └─────────────┘  │   │    │
│  │  └──────────────────┘  └──────────────────┘  └──────────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Component Responsibilities

RateLimitRegistry

File: src/scheduler/rateLimitRegistry.ts

Central coordinator that:

  • Creates/retrieves per-provider state based on rate limit keys
  • Routes provider calls to the appropriate state
  • Aggregates metrics across all providers
  • Emits events for monitoring
typescript
// Usage (automatic in evaluator)
const result = await registry.execute(provider, () => provider.callApi(...), {
  getHeaders: (result) => result.metadata?.headers,
  isRateLimited: (result, error) => error?.message?.includes('429'),
  getRetryAfter: (result, error) => parseRetryAfter(headers['retry-after']),
});

ProviderRateLimitState

File: src/scheduler/providerRateLimitState.ts

Per-provider state manager that:

  • Manages the slot queue for concurrency control
  • Tracks rate limit headers from responses
  • Implements retry logic with exponential backoff
  • Adapts concurrency based on success/failure patterns
  • Collects latency metrics

SlotQueue

File: src/scheduler/slotQueue.ts

FIFO queue with concurrency limiting:

  • Acquires/releases "slots" for concurrent requests
  • Blocks when at max concurrency or quota exhausted
  • Tracks remaining requests/tokens from headers
  • Schedules queue processing after rate limit windows

Key insight: Race-condition-free slot allocation. All requests queue, then slots are allocated in FIFO order.

AdaptiveConcurrency

File: src/scheduler/adaptiveConcurrency.ts

Dynamic concurrency adjustment:

  • On rate limit: Reduce concurrency by 50% (multiplicative decrease)
  • On sustained success: Increase by 1 (additive increase)
  • Proactive throttling: Reduce when approaching limits (via headers)

This implements AIMD (Additive Increase, Multiplicative Decrease) - the same algorithm TCP uses for congestion control.

HeaderParser

File: src/scheduler/headerParser.ts

Parses rate limit headers from multiple providers:

  • OpenAI: x-ratelimit-remaining-requests, x-ratelimit-limit-requests
  • Anthropic: anthropic-ratelimit-requests-remaining
  • Generic: retry-after, retry-after-ms, ratelimit-reset

RetryPolicy

File: src/scheduler/retryPolicy.ts

Determines retry behavior:

  • Exponential backoff with jitter
  • Respects server retry-after headers
  • Configurable max retries (default: 3)
  • Retries on: rate limits, timeouts, 502/503/504

Data Flow

text
1. Evaluator calls registry.execute(provider, callFn)
       │
       ▼
2. Registry gets/creates ProviderRateLimitState for this provider
       │
       ▼
3. State.executeWithRetry() is called
       │
       ▼
4. SlotQueue.acquire() - wait for available slot
       │
       ▼
5. Execute callFn() - actual provider API call
       │
       ▼
6. Parse response headers → update rate limit state
       │
       ▼
7. Check if rate limited:
   ├─ Yes → retry with backoff, reduce concurrency
   └─ No  → record success, maybe increase concurrency
       │
       ▼
8. SlotQueue.release() - free slot for next request
       │
       ▼
9. Return result (or throw after max retries)

Rate Limit Key Generation

Each provider gets a unique "rate limit key" based on:

  • Provider ID (e.g., "openai:chat:gpt-4o")
  • API key hash (different keys = different rate limits)
  • Organization ID (if applicable)

This ensures:

  • Same provider + same key = shared rate limit state
  • Same provider + different keys = separate rate limits
  • Different providers = completely isolated

Key Design Decisions

1. Zero Configuration

Users shouldn't need to tune rate limit settings. The scheduler learns from response headers and adapts automatically.

2. Fail-Safe Defaults

  • Default max concurrency: 4 (conservative)
  • Default retry delay: 60 seconds (when no header)
  • Max retries: 3 (prevents infinite loops)

3. Proactive Throttling

Don't wait for 429 errors. When headers show <10% remaining quota, proactively reduce concurrency.

4. Per-Provider Isolation

Different providers have different rate limits. Don't let OpenAI rate limits affect Anthropic calls.

5. Transparent Integration

The scheduler wraps provider.callApi() without changing the interface. Existing code works unchanged.

Metrics

The scheduler tracks:

  • totalRequests - All requests attempted
  • completedRequests - Successful completions
  • failedRequests - Permanent failures (after retries)
  • rateLimitHits - Times 429 was encountered
  • retriedRequests - Requests that required retry
  • avgLatencyMs, p50LatencyMs, p99LatencyMs - Latency distribution

Events

For monitoring/debugging, the scheduler emits:

  • slot:acquired / slot:released - Concurrency tracking
  • ratelimit:hit - Rate limit encountered
  • ratelimit:learned - First time seeing provider's limits
  • ratelimit:warning - Approaching rate limit
  • concurrency:increased / concurrency:decreased - Adaptive changes
  • request:retrying - Retry in progress

Testing

256 tests covering:

  • Unit tests for each component
  • Edge cases (negative values, zero values, overflow)
  • Race condition prevention
  • Integration with evaluator

Performance Characteristics

  • Overhead: Minimal - just slot acquisition and header parsing
  • Memory: O(providers) - one state object per unique rate limit key
  • Latency buffer: Circular buffer, last 100 requests, O(1) insertion