site/docs/configuration/rate-limits.md
Promptfoo automatically handles rate limits from LLM providers. When a provider returns HTTP 429 or similar rate limit errors, requests are automatically retried with exponential backoff.
Rate limit handling is built into the evaluator and requires no configuration:
maxRetries, including 0 to disable retries)retry-after headers from providersPromptfoo parses rate limit headers from major providers:
| Provider | Headers |
|---|---|
| OpenAI | x-ratelimit-remaining-requests, x-ratelimit-limit-requests, x-ratelimit-remaining-tokens, retry-after-ms |
| Anthropic | anthropic-ratelimit-requests-remaining, anthropic-ratelimit-tokens-remaining, retry-after |
| Azure OpenAI | x-ratelimit-remaining-requests, retry-after-ms, retry-after |
| Generic | retry-after, ratelimit-remaining, ratelimit-reset |
Promptfoo automatically retries requests that fail with transient server errors:
| Status Code | Description | Retry Condition |
|---|---|---|
| 502 | Bad Gateway | Status text contains "bad gateway" |
| 503 | Service Unavailable | Status text contains "service unavailable" |
| 504 | Gateway Timeout | Status text contains "gateway timeout" |
| 524 | A Timeout Occurred | Status text contains "timeout" (Cloudflare-specific) |
These errors are retried up to 3 times with exponential backoff (1s, 2s, 4s). The status text check ensures that permanent failures (like authentication errors that happen to use 502) are not retried.
The scheduler uses AIMD (Additive Increase, Multiplicative Decrease) to optimize throughput:
This allows you to set a higher maxConcurrency and let promptfoo find the optimal rate automatically.
Control the maximum number of concurrent requests:
evaluateOptions:
maxConcurrency: 10
Or via CLI:
promptfoo eval --max-concurrency 10
The adaptive scheduler will reduce this if rate limits are encountered, but cannot exceed your configured maximum.
Add a fixed delay between requests (in addition to any rate limit backoff):
evaluateOptions:
delay: 1000 # milliseconds
Or via CLI:
promptfoo eval --delay 1000
Or via environment variable:
PROMPTFOO_DELAY_MS=1000 promptfoo eval
Promptfoo has two retry layers:
callApi() with 1-second base backoff, up to 3 times by default. If a provider config sets maxRetries, the scheduler uses that value (including 0 to disable scheduler retries entirely).maxRetries when set.When a provider config includes maxRetries, promptfoo propagates that value to both layers. Explicit per-call overrides (e.g. a provider that passes a specific maxRetries to fetchWithRetries) still take precedence. For direct fetchWithProxy calls, transient retries (502/503/504/524) are disabled when the provider sets maxRetries: 0.
Example — disable retries for a provider to fail fast on rate limits:
providers:
- id: openai:chat:gpt-4.1-mini
config:
maxRetries: 0
Environment variables for the scheduler:
| Environment Variable | Description | Default |
|---|---|---|
PROMPTFOO_DISABLE_ADAPTIVE_SCHEDULER | Disable adaptive concurrency (use fixed) | false |
PROMPTFOO_MIN_CONCURRENCY | Minimum concurrency (floor for adaptive) | 1 |
PROMPTFOO_SCHEDULER_QUEUE_TIMEOUT_MS | Timeout for queued requests (0 to disable) | 300000ms |
Environment variables for HTTP-level retry:
| Environment Variable | Description | Default |
|---|---|---|
PROMPTFOO_REQUEST_BACKOFF_MS | Base delay for HTTP retry backoff | 5000ms |
PROMPTFOO_RETRY_5XX | Retry on HTTP 500 errors | false |
Example:
PROMPTFOO_REQUEST_BACKOFF_MS=10000 PROMPTFOO_RETRY_5XX=true promptfoo eval
The scheduler's retry handles most rate limiting automatically. The HTTP-level retry provides additional resilience for network issues.
OpenAI has separate rate limits for requests and tokens. The scheduler tracks both. For high-volume evaluations:
evaluateOptions:
maxConcurrency: 20 # Scheduler will adapt down if needed
See OpenAI troubleshooting for additional options.
Anthropic rate limits are typically per-minute. The scheduler respects retry-after headers from the API.
Custom providers trigger automatic retry when errors contain:
To provide retry timing, include headers in your response metadata:
return {
output: 'response',
metadata: {
headers: {
'retry-after': '60', // seconds
},
},
};
To see rate limit events, enable debug logging:
LOG_LEVEL=debug promptfoo eval -c config.yaml
Events logged:
ratelimit:hit - Rate limit encounteredratelimit:learned - Provider limits discovered from headersratelimit:warning - Approaching rate limit thresholdconcurrency:decreased / concurrency:increased - Adaptive concurrency changesrequest:retrying - Retry in progressStart with higher concurrency - Set maxConcurrency to your desired throughput; the scheduler will adapt down if needed
Use caching - Enable caching to avoid re-running identical requests
Monitor debug logs - If evaluations are slow, check for frequent ratelimit:hit events
Consider provider tiers - Higher API tiers typically have higher rate limits; the scheduler will automatically use whatever limits the provider allows
The scheduler is always active but has minimal overhead. For fully deterministic behavior (e.g., in tests), use:
evaluateOptions:
maxConcurrency: 1
delay: 1000
This ensures sequential execution with fixed delays between requests.