Back to Pinecone Python Client

Retries and Resilience

docs/guides/retries.md

9.1.012.2 KB
Original Source

Retries and Resilience

The SDK retries failed requests automatically using multiple layers of resilience: per-request backoff with decorrelated jitter and adaptive concurrency for bulk operations. This page covers everything you need to know to tune that behavior or to decide when to hand retry control back to your orchestrator.

For the exceptions the SDK raises when retries are exhausted, see {doc}/guides/error-handling.


Defaults at a Glance

Out of the box — no configuration needed:

WhatDefault behavior
Max retries (after initial attempt)3 for REST (4 total), 5 for gRPC (6 total)
Retryable HTTP status codes408, 429, 500, 502, 503, 504
Retryable gRPC status codesUNAVAILABLE, RESOURCE_EXHAUSTED, ABORTED
Backoff algorithmDecorrelated jitter — random walk bounded by backoff_factor floor and max_wait cap
Adaptive concurrency (bulk paths)Self-tunes downward on throttling; max_concurrency is a ceiling, not a constant

Configuring Retries

Pass a RetryConfig to the Pinecone constructor to customize retry behavior for all REST requests made by that client:

python
from pinecone import Pinecone, RetryConfig

pc = Pinecone(
    retry_config=RetryConfig(
        max_retries=5,
        backoff_factor=0.5,
        max_wait=60.0,
        retryable_status_codes=frozenset({429, 500, 503}),
    )
)

RetryConfig fields

FieldTypeDefaultDescription
max_retriesint3Number of retry attempts after the initial attempt. Total attempts = max_retries + 1.
backoff_factorfloat0.25Minimum delay floor in seconds (lower bound of decorrelated jitter). See Jitter strategy for the full formula.
max_waitfloat60.0Maximum delay cap in seconds. The jitter algorithm never waits longer than this between retries.
retryable_status_codesfrozenset[int]{408, 429, 500, 502, 503, 504}HTTP status codes that trigger a retry. The SDK retries on these codes and raises on all others.

RetryConfig applies to REST only. The gRPC transport (Rust-backed) uses its own fixed retry policy with 5 retries by default. See Transport differences.

Disabling retries

To disable retries entirely, set max_retries=0:

python
pc = Pinecone(retry_config=RetryConfig(max_retries=0))

With max_retries=0, the SDK makes exactly one attempt and raises immediately on any error.

Handling rate limits without retrying

By default, 429 responses are retried automatically. To receive RateLimitError immediately instead (for example, so your orchestrator can handle the retry), exclude 429 from the retryable set:

python
from pinecone import Pinecone, RetryConfig
from pinecone.errors import RateLimitError

pc = Pinecone(
    retry_config=RetryConfig(
        retryable_status_codes=frozenset({408, 500, 502, 503, 504}),  # no 429
    )
)

try:
    index.upsert(vectors=[...])
except RateLimitError:
    time.sleep(30.0)
    index.upsert(vectors=[...])

Migration note: backoff_factor semantic change (v8 → v9)

In v8 and earlier, backoff_factor was an exponential multiplier. In v9, it became the minimum delay floor in seconds — the lower bound of the decorrelated jitter window. The default also changed from 2.0 to 0.25. If you pinned backoff_factor=2.0 in v8, the new equivalent that produces a similar mean first-retry delay is backoff_factor=0.5; if you want to restore the old default behavior (which caused ~4× longer delays than v9), pass backoff_factor=2.0 explicitly. Most users should use the v9 default or leave it unset.


Jitter Strategy

Jitter spreads retries across time so that concurrent clients with the same retry budget don't collide on the server at the same moment.

Decorrelated jitter (backoff path)

When no server hint is present, the SDK uses decorrelated jitter:

delay = uniform(backoff_factor, max(backoff_factor, prev_delay * 3))
delay = min(delay, max_wait)

Starting from prev_delay = backoff_factor, each retry delay is drawn uniformly from [backoff_factor, prev_delay × 3], capped at max_wait. Because the next window's upper bound grows with the previous delay, the sequence performs a random walk that diverges naturally without a hard exponential schedule — neighboring clients are unlikely to pick the same delay even when they start at the same time.

Concrete example with defaults (backoff_factor=0.25, max_wait=60.0):

AttemptWindow (seconds)Typical delay
1st retry[0.25, 0.75]~0.5 s
2nd retry[0.25, ~1.5]~0.9 s
3rd retry[0.25, ~4.5]~2.4 s

Adaptive Concurrency for Bulk Operations

When you run bulk upserts or other parallel operations, the SDK observes throttling signals and automatically reduces the number of concurrent in-flight requests. When throttling subsides, concurrency recovers.

How it works

Each Pinecone client maintains a per-host concurrency limiter. On every retryable response (429, 503, or equivalent gRPC code), the limiter halves the effective concurrency floor for that host. After a streak of consecutive successful requests, it recovers by one slot. The algorithm is AIMD (Additive Increase, Multiplicative Decrease) — the same control loop used by TCP congestion control.

You don't configure this directly. The max_concurrency parameter you pass to upsert() is a ceiling — the SDK self-tunes between 1 and that ceiling based on what the server can absorb.

Example

python
from pinecone import Pinecone

pc = Pinecone()
index = pc.index(host="product-search-abc123.svc.pinecone.io")

# max_concurrency=8 is the ceiling.
# If the index throttles during the run, the SDK will automatically
# reduce effective concurrency (e.g. to 4, then 2) and recover as
# throttling subsides. No code changes required.
response = index.upsert(
    vectors=large_list,
    batch_size=200,
    max_concurrency=8,
)
print(response.upserted_count)

Limiter scope

One limiter per index host per Pinecone client. If you create two Pinecone clients and both target the same index, they each maintain an independent limiter — there is no cross-client coordination (see Multi-process and serverless workloads).


Transport Differences

The retry plan goal is parity across REST and gRPC. The remaining differences are small:

AspectREST (Index, AsyncIndex)gRPC (GrpcIndex)
Default max_retries3 (4 total attempts)5 (6 total attempts)
Configured viaRetryConfig passed to Pinecone()Fixed in transport (not user-configurable)
Retryable codes{408, 429, 500, 502, 503, 504}UNAVAILABLE, RESOURCE_EXHAUSTED, ABORTED
Jitter algorithmDecorrelated jitter (Python)Decorrelated jitter (Rust)
Async supportYes (AsyncIndex)No — gRPC transport is sync-only
Adaptive concurrencyYes (REST + gRPC share the same per-host limiter registry)Yes

gRPC retry is not configurable via RetryConfig. If you need to tune gRPC retry behavior, construct GrpcIndex directly (rather than through Pinecone.index(grpc=True)) and pass max_retries explicitly.


Multi-Process and Serverless Workloads

What the SDK cannot do

The SDK's retry and adaptive concurrency machinery is per-process. If your workload fans out across multiple Lambda invocations, Cloud Run instances, or Kubernetes pods, each process runs its own independent retry loop. There is no shared state, no cross-process coordination, and no distributed rate-limit awareness.

The per-client adaptive-limiter registry is capped at 256 hosts with LRU eviction; long-running services that rotate through more than 256 distinct hosts will see infrequently-used hosts' adaptive state reset on next use, which is harmless.

This means:

  • N simultaneously throttled invocations each independently back off and retry. Without coordination, they can collide again at the end of the retry window.
  • The adaptive concurrency limiter starts from scratch for each new process instance (e.g. a fresh Lambda cold start). It cannot inherit a reduced limit that another invocation learned from throttling.

Let your orchestrator handle retries at the job level, and keep the SDK's retry window narrow:

python
from pinecone import Pinecone, RetryConfig
from pinecone.errors import RateLimitError

# Set max_retries=0 or 1: one attempt (or one fast retry), then raise.
# Let the SQS visibility timeout / Cloud Tasks retry / Step Functions catch
# handle the outer retry loop.
pc = Pinecone(retry_config=RetryConfig(max_retries=1))
index = pc.index(host="product-search-abc123.svc.pinecone.io")

try:
    response = index.upsert(vectors=batch, batch_size=100, max_concurrency=4)
except RateLimitError as exc:
    # Re-raise so the orchestrator sees a task failure and schedules a retry
    # after the visibility timeout expires.
    raise

Why jitter still helps across processes

Even without coordination, the SDK's decorrelated jitter provides statistical relief. If N independent Lambda invocations are all throttled at once, they don't all retry at the same instant — each draws its own delay, spreading the retries across a window. The larger N is, the more this matters.

Summary: when to trust the SDK vs. the orchestrator

ScenarioRecommended approach
Single-process bulk upsertUse defaults — SDK handles everything
Long-running worker (persistent process)Use defaults — adaptive limiter learns and recovers
Lambda / Cloud Functions / Cloud Run (stateless)max_retries=1, catch RateLimitError, re-raise for orchestrator retry
Fan-out across many pods (e.g. Kubernetes Job)Same as stateless — set low max_retries, rely on orchestrator
Strict per-invocation SLA (must not block)max_retries=0, retryable_status_codes=frozenset() — raise immediately

Observability

The SDK emits structured log records so you can diagnose retry storms and throttling pressure without adding instrumentation yourself.

Log namespaces

LoggerEvents
pinecone._internal.http_clientThrottled HTTP response received; retry delay computed
pinecone._internal.adaptiveAIMD concurrency limit transitions

INFO messages

An INFO-level record is emitted the first time a given host rate-limits a client instance:

Rate limited by host=<host>. Adaptive concurrency will reduce in-flight requests.
See https://docs.pinecone.io/python/retries for details.

This fires once per host per Pinecone / AsyncPinecone object, so it surfaces in your logs without flooding them on repeated throttling.

DEBUG messages

Enable DEBUG-level logging on the two namespaces above to see granular retry events:

python
import logging
logging.getLogger("pinecone._internal.http_client").setLevel(logging.DEBUG)
logging.getLogger("pinecone._internal.adaptive").setLevel(logging.DEBUG)

Throttle record (emitted once per retry attempt that receives a retryable response):

Throttled response: status=429 host=my-index.svc.pinecone.io attempt=1/4 delay=0.531s

Fields: status (HTTP status code), host, attempt (N of total attempts), delay (computed wait in seconds).

AIMD limit decrease (emitted when the adaptive limiter reduces concurrency):

AIMD limiter decreased: before=8 after=4 ceiling=8

AIMD limit increase (emitted when the limiter recovers a concurrency slot):

AIMD limiter increased: now=5 ceiling=8

Increase records only fire on actual transitions — not on every successful request — so the volume is proportional to recovery events, not request throughput.


See Also

  • {doc}/guides/error-handling — Exception hierarchy and how to catch specific errors
  • {doc}/guides/performance — Bulk upsert patterns, max_concurrency tuning, and transport selection
  • {doc}/guides/sync-vs-async — When to use the async client and how to manage concurrency with asyncio