docs/reference/rate-limiting.md
PicoClaw prevents 429 errors from LLM provider APIs by enforcing configurable per-model request-rate limits before sending each request. Unlike the reactive cooldown/fallback system (which activates after a 429 is received), rate limiting is proactive: it keeps outbound QPS within the provider's free-tier or plan limits.
Each rate-limited model gets a token bucket:
rpm (burst size equals the per-minute limit)rpm / 60 tokens per secondAgentLoop.callLLM()
└─ FallbackChain.Execute() ← iterate candidates
├─ CooldownTracker.IsAvailable() ← skip if post-429 cooldown active
├─ RateLimiterRegistry.Wait() ← NEW: block until token available
└─ provider.Chat() ← actual LLM HTTP call
The rate limiter runs after the cooldown check and before the provider call, so:
The same check applies in ExecuteImage.
RateLimiterRegistry is safe for concurrent use. The per-limiter token bucket uses a fine-grained mutex so concurrent goroutines each acquire their own token independently.
Set rpm on any model in model_list:
model_list:
- model_name: gpt-4o-free
provider: openai
model: gpt-4o
api_base: https://api.openai.com/v1
rpm: 3 # max 3 requests per minute
api_keys:
- sk-...
- model_name: claude-haiku
provider: anthropic
model: claude-haiku-4-5
rpm: 60 # 60 rpm (Anthropic free tier)
api_keys:
- sk-ant-...
- model_name: local-llm
provider: ollama
model: llama3
api_base: http://localhost:11434/v1
# no rpm → unrestricted
| Field | Type | Default | Description |
|---|---|---|---|
rpm | int | 0 | Requests per minute. 0 means no limit. |
When a model has fallbacks configured, each candidate is rate-limited independently:
model_list:
- model_name: gpt4-with-fallback
provider: openai
model: gpt-4o
rpm: 5
fallbacks:
- gpt-4o-mini # must also be in model_list; its own rpm applies
If the current candidate's bucket is empty and there are more candidates available, PicoClaw skips the locally saturated candidate and tries the next fallback immediately. Only the last remaining candidate waits for a token to refill. If the context deadline is hit while waiting on that last candidate, the wait error propagates.
For model_list aliases that resolve to the same underlying provider/model, rate limiting is keyed by the stable config identity (for example model_name) rather than the resolved runtime model string. This preserves distinct RPM settings for multi-key and alias-based configurations.
The bucket starts full (burst = RPM). For rpm: 3, the first 3 requests fire instantly; subsequent requests are spaced ~20 s apart.
To reduce burstiness for strict APIs, set a lower rpm and rely on the steady-state refill.
| File | What |
|---|---|
pkg/providers/ratelimiter.go | RateLimiter (token bucket) + RateLimiterRegistry |
pkg/providers/ratelimiter_test.go | Unit tests for limiter and registry |
pkg/providers/fallback.go | FallbackCandidate.RPM field; FallbackChain.rl; Wait() call in Execute/ExecuteImage |
pkg/agent/model_resolution.go | Resolves candidates from model_list, preserving stable config identity and propagating RPM into FallbackCandidate |
pkg/agent/loop.go | Build RateLimiterRegistry, register all agents' candidates, pass to NewFallbackChain |