Rate Limits

Promptfoo automatically handles rate limits from LLM providers. When a provider returns HTTP 429 or similar rate limit errors, requests are automatically retried with exponential backoff.

Automatic Handling

Rate limit handling is built into the evaluator and requires no configuration:

Automatic retry: Failed requests are retried up to 3 times with exponential backoff by default (overridable per provider via maxRetries, including 0 to disable retries)
Header-aware delays: Respects retry-after headers from providers
Adaptive concurrency: Reduces concurrent requests when rate limits are hit
Per-provider isolation: Each provider and API key has separate rate limit tracking

Supported Headers

Promptfoo parses rate limit headers from major providers:

Provider	Headers
OpenAI	`x-ratelimit-remaining-requests`, `x-ratelimit-limit-requests`, `x-ratelimit-remaining-tokens`, `retry-after-ms`
Anthropic	`anthropic-ratelimit-requests-remaining`, `anthropic-ratelimit-tokens-remaining`, `retry-after`
Azure OpenAI	`x-ratelimit-remaining-requests`, `retry-after-ms`, `retry-after`
Generic	`retry-after`, `ratelimit-remaining`, `ratelimit-reset`

Transient Error Handling

Promptfoo automatically retries requests that fail with transient server errors:

Status Code	Description	Retry Condition
502	Bad Gateway	Status text contains "bad gateway"
503	Service Unavailable	Status text contains "service unavailable"
504	Gateway Timeout	Status text contains "gateway timeout"
524	A Timeout Occurred	Status text contains "timeout" (Cloudflare-specific)

These errors are retried up to 3 times with exponential backoff (1s, 2s, 4s). The status text check ensures that permanent failures (like authentication errors that happen to use 502) are not retried.

How Adaptive Concurrency Works

The scheduler uses AIMD (Additive Increase, Multiplicative Decrease) to optimize throughput:

When a rate limit is hit, concurrency is reduced by 50%
After sustained successful requests, concurrency increases by 1
When remaining quota drops below 10% (from headers), concurrency is proactively reduced

This allows you to set a higher maxConcurrency and let promptfoo find the optimal rate automatically.

Configuration

Concurrency

Control the maximum number of concurrent requests:

yaml

evaluateOptions:
  maxConcurrency: 10

Or via CLI:

bash

promptfoo eval --max-concurrency 10

The adaptive scheduler will reduce this if rate limits are encountered, but cannot exceed your configured maximum.

Fixed Delay

Add a fixed delay between requests (in addition to any rate limit backoff):

yaml

evaluateOptions:
  delay: 1000 # milliseconds

Or via CLI:

bash

promptfoo eval --delay 1000

Or via environment variable:

bash

PROMPTFOO_DELAY_MS=1000 promptfoo eval

Backoff Configuration

Promptfoo has two retry layers:

Provider-level retry (scheduler): Retries callApi() with 1-second base backoff, up to 3 times by default. If a provider config sets maxRetries, the scheduler uses that value (including 0 to disable scheduler retries entirely).
HTTP-level retry: Retries failed HTTP requests. Defaults to 4 retries, or the provider's maxRetries when set.

When a provider config includes maxRetries, promptfoo propagates that value to both layers. Explicit per-call overrides (e.g. a provider that passes a specific maxRetries to fetchWithRetries) still take precedence. For direct fetchWithProxy calls, transient retries (502/503/504/524) are disabled when the provider sets maxRetries: 0.

Example — disable retries for a provider to fail fast on rate limits:

yaml

providers:
  - id: openai:chat:gpt-4.1-mini
    config:
      maxRetries: 0

Environment variables for the scheduler:

Environment Variable	Description	Default
`PROMPTFOO_DISABLE_ADAPTIVE_SCHEDULER`	Disable adaptive concurrency (use fixed)	false
`PROMPTFOO_MIN_CONCURRENCY`	Minimum concurrency (floor for adaptive)	1
`PROMPTFOO_SCHEDULER_QUEUE_TIMEOUT_MS`	Timeout for queued requests (0 to disable)	300000ms

Environment variables for HTTP-level retry:

Environment Variable	Description	Default
`PROMPTFOO_REQUEST_BACKOFF_MS`	Base delay for HTTP retry backoff	5000ms
`PROMPTFOO_RETRY_5XX`	Retry on HTTP 500 errors	false

Example:

bash

PROMPTFOO_REQUEST_BACKOFF_MS=10000 PROMPTFOO_RETRY_5XX=true promptfoo eval

The scheduler's retry handles most rate limiting automatically. The HTTP-level retry provides additional resilience for network issues.

Provider-Specific Notes

OpenAI

OpenAI has separate rate limits for requests and tokens. The scheduler tracks both. For high-volume evaluations:

yaml

evaluateOptions:
  maxConcurrency: 20 # Scheduler will adapt down if needed

See OpenAI troubleshooting for additional options.

Anthropic

Anthropic rate limits are typically per-minute. The scheduler respects retry-after headers from the API.

Custom Providers

Custom providers trigger automatic retry when errors contain:

"429"
"rate limit"
"too many requests"

To provide retry timing, include headers in your response metadata:

javascript

return {
  output: 'response',
  metadata: {
    headers: {
      'retry-after': '60', // seconds
    },
  },
};

Debugging

To see rate limit events, enable debug logging:

bash

LOG_LEVEL=debug promptfoo eval -c config.yaml

Events logged:

ratelimit:hit - Rate limit encountered
ratelimit:learned - Provider limits discovered from headers
ratelimit:warning - Approaching rate limit threshold
concurrency:decreased / concurrency:increased - Adaptive concurrency changes
request:retrying - Retry in progress

Best Practices

Start with higher concurrency - Set maxConcurrency to your desired throughput; the scheduler will adapt down if needed
Use caching - Enable caching to avoid re-running identical requests
Monitor debug logs - If evaluations are slow, check for frequent ratelimit:hit events
Consider provider tiers - Higher API tiers typically have higher rate limits; the scheduler will automatically use whatever limits the provider allows

Disabling Automatic Handling

The scheduler is always active but has minimal overhead. For fully deterministic behavior (e.g., in tests), use:

yaml

evaluateOptions:
  maxConcurrency: 1
  delay: 1000

This ensures sequential execution with fixed delays between requests.