docs/scheduler-architecture.md
The adaptive rate limit scheduler automatically handles provider rate limits during evaluations. It's zero-configuration - users don't need to change anything. The scheduler transparently wraps all provider calls with intelligent rate limit detection, retry logic, and adaptive concurrency management.
The scheduler addresses common challenges when running evaluations against rate-limited APIs:
-j (concurrency) value┌─────────────────────────────────────────────────────────────────────────────┐
│ Evaluator │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ RateLimitRegistry │ │
│ │ (Central coordinator - one per evaluation) │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ProviderRateLimit │ │ProviderRateLimit │ │ProviderRateLimit │ │ │
│ │ │ State │ │ State │ │ State │ │ │
│ │ │ (openai/key1) │ │ (openai/key2) │ │ (anthropic) │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ │
│ │ │ │ SlotQueue │ │ │ │ SlotQueue │ │ │ │ SlotQueue │ │ │ │
│ │ │ │ (FIFO) │ │ │ │ (FIFO) │ │ │ │ (FIFO) │ │ │ │
│ │ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │ │
│ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ │
│ │ │ │ Adaptive │ │ │ │ Adaptive │ │ │ │ Adaptive │ │ │ │
│ │ │ │ Concurrency │ │ │ │ Concurrency │ │ │ │ Concurrency │ │ │ │
│ │ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │ │
│ │ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
File: src/scheduler/rateLimitRegistry.ts
Central coordinator that:
// Usage (automatic in evaluator)
const result = await registry.execute(provider, () => provider.callApi(...), {
getHeaders: (result) => result.metadata?.headers,
isRateLimited: (result, error) => error?.message?.includes('429'),
getRetryAfter: (result, error) => parseRetryAfter(headers['retry-after']),
});
File: src/scheduler/providerRateLimitState.ts
Per-provider state manager that:
File: src/scheduler/slotQueue.ts
FIFO queue with concurrency limiting:
Key insight: Race-condition-free slot allocation. All requests queue, then slots are allocated in FIFO order.
File: src/scheduler/adaptiveConcurrency.ts
Dynamic concurrency adjustment:
This implements AIMD (Additive Increase, Multiplicative Decrease) - the same algorithm TCP uses for congestion control.
File: src/scheduler/headerParser.ts
Parses rate limit headers from multiple providers:
x-ratelimit-remaining-requests, x-ratelimit-limit-requestsanthropic-ratelimit-requests-remainingretry-after, retry-after-ms, ratelimit-resetFile: src/scheduler/retryPolicy.ts
Determines retry behavior:
retry-after headers1. Evaluator calls registry.execute(provider, callFn)
│
▼
2. Registry gets/creates ProviderRateLimitState for this provider
│
▼
3. State.executeWithRetry() is called
│
▼
4. SlotQueue.acquire() - wait for available slot
│
▼
5. Execute callFn() - actual provider API call
│
▼
6. Parse response headers → update rate limit state
│
▼
7. Check if rate limited:
├─ Yes → retry with backoff, reduce concurrency
└─ No → record success, maybe increase concurrency
│
▼
8. SlotQueue.release() - free slot for next request
│
▼
9. Return result (or throw after max retries)
Each provider gets a unique "rate limit key" based on:
This ensures:
Users shouldn't need to tune rate limit settings. The scheduler learns from response headers and adapts automatically.
Don't wait for 429 errors. When headers show <10% remaining quota, proactively reduce concurrency.
Different providers have different rate limits. Don't let OpenAI rate limits affect Anthropic calls.
The scheduler wraps provider.callApi() without changing the interface. Existing code works unchanged.
The scheduler tracks:
totalRequests - All requests attemptedcompletedRequests - Successful completionsfailedRequests - Permanent failures (after retries)rateLimitHits - Times 429 was encounteredretriedRequests - Requests that required retryavgLatencyMs, p50LatencyMs, p99LatencyMs - Latency distributionFor monitoring/debugging, the scheduler emits:
slot:acquired / slot:released - Concurrency trackingratelimit:hit - Rate limit encounteredratelimit:learned - First time seeing provider's limitsratelimit:warning - Approaching rate limitconcurrency:increased / concurrency:decreased - Adaptive changesrequest:retrying - Retry in progress256 tests covering: