relayer/docs/dynamic-retry-after-design.md
This document describes the design for computing dynamic Retry-After values based on queue state, drain rate, and request processing stage. The goal is to provide clients with intelligent polling intervals that adapt to system load.
The relayer has different queue structures per request type:
[HTTP] → [TX Throttler Queue] → [Gateway TX]
↑
TPS-based drain (per_seconds)
[HTTP] → [Readiness Queue] → [Readiness Check] → [TX Throttler Queue] → [Gateway TX]
↑ ↑
Concurrency-based drain TPS-based drain
(max_concurrency) (per_seconds)
Key differences:
| Parameter | Description | Default |
|---|---|---|
min_seconds | Minimum retry interval (floor) | 1 |
max_seconds | Maximum retry interval (ceiling) | 300 |
safety_margin | Multiplier applied to computed ETA (0.0-1.0) | 0.2 |
All nominal times are required in configuration (no code defaults):
| Request Type | Processing By | Config Field |
|---|---|---|
| Input Proof | Copro | input_proof_processing_seconds |
| User Decrypt | KMS | user_decrypt_processing_seconds |
| Public Decrypt | KMS | public_decrypt_processing_seconds |
| Readiness Check | (decrypts only) | readiness_check_seconds |
| TX Confirmation | Blockchain | tx_confirmation_ms |
For ReceiptReceived state only (Copro/KMS response time is unpredictable):
| Elapsed Time | Retry-After | Reason |
|---|---|---|
| 0-60s | 4s | Expect response soon |
| 60s-2m | 10s | Taking longer than usual |
| 2m-5m | 30s | Significant delay |
| 5m-15m | 60s | Major delay |
| 15m+ | 300s | Likely stuck, minimal polling |
| Var | Description |
|---|---|
p | Request's position in queue (0-indexed) |
Q | TX queue size (used when request will join at end) |
D | TX drain rate (tps) |
C | Readiness max concurrency |
R | Nominal readiness check time (ms) |
P | Nominal processing time (ms) - 2s for input proof, 4s for decrypt |
T | Nominal TX confirmation time (ms) |
M | Safety margin (e.g., 0.2) |
E | Elapsed time in current state (ms) |
B(E) | Backoff function based on elapsed time |
| Status | Formula |
|---|---|
| Queued | clamp(⌈(p/D + P + T) × (1+M) / 1000⌉, min, max) |
| Processing | clamp(⌈(p/D + P + T) × (1+M) / 1000⌉, min, max) |
| TxInFlight | clamp(⌈P × (1+M) / 1000⌉, min, max) |
| ReceiptReceived | B(E) |
| Completed/TimedOut/Failure | 0 |
| Status | Queue Location | Formula |
|---|---|---|
| Queued | In readiness queue | clamp(⌈(p/C + Q/D + P + T) × (1+M) / 1000⌉, min, max) |
| Processing | Out of readiness, not in TX | clamp(⌈(R + Q/D + P + T) × (1+M) / 1000⌉, min, max) |
| Processing | In TX queue | clamp(⌈(p/D + P + T) × (1+M) / 1000⌉, min, max) |
| TxInFlight | - | clamp(⌈P × (1+M) / 1000⌉, min, max) |
| ReceiptReceived | - | B(E) |
| Completed/TimedOut/Failure | - | 0 |
p in the queue (not total queue size)readiness_throttler.get_position(id) returns None → removed from readiness, check TX queuetx_throttler.get_position(id) returns Some(p) → use TX queue formulaNone → out of readiness, not yet in TX queueP (time to get response after TX sent)B(E) since Copro/KMS response time is unpredictableUsing parameters: D=10, C=50, R=2s, P_input=2s, P_decrypt=4s, T=100ms, M=0.2, B(E)=3s
⌈(p/D × 1000 + P + T) × (1+M) / 1000⌉| p | p/D (s) | + P + T (ms) | × 1.2 | Result |
|---|---|---|---|---|
| 0 | 0 | 2100 | 2520 | 3s |
| 1 | 0.1 | 2200 | 2640 | 3s |
| 10 | 1 | 3100 | 3720 | 4s |
| 100 | 10 | 12100 | 14520 | 15s |
| 1000 | 100 | 102100 | 122520 | 123s |
⌈P × 1.2 / 1000⌉= ⌈2000 × 1.2 / 1000⌉ = 3s (constant)
B(E) = 3s (constant)⌈(p/C × 1000 + Q/D × 1000 + P + T) × (1+M) / 1000⌉Assuming p = Q (same number of entries in both queues):
| p | p/C (s) | Q/D (s) | + P + T (ms) | × 1.2 | Result |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 4100 | 4920 | 5s |
| 1 | 0.02 | 0.1 | 4220 | 5064 | 6s |
| 10 | 0.2 | 1 | 5300 | 6360 | 7s |
| 100 | 2 | 10 | 16100 | 19320 | 20s |
| 1000 | 20 | 100 | 124100 | 148920 | 149s |
⌈(R + Q/D × 1000 + P + T) × (1+M) / 1000⌉| Q | R (ms) | Q/D (s) | + P + T (ms) | × 1.2 | Result |
|---|---|---|---|---|---|
| 0 | 2000 | 0 | 6100 | 7320 | 8s |
| 1 | 2000 | 0.1 | 6200 | 7440 | 8s |
| 10 | 2000 | 1 | 7100 | 8520 | 9s |
| 100 | 2000 | 10 | 16100 | 19320 | 20s |
| 1000 | 2000 | 100 | 106100 | 127320 | 128s |
⌈(p/D × 1000 + P + T) × (1+M) / 1000⌉| p | p/D (s) | + P + T (ms) | × 1.2 | Result |
|---|---|---|---|---|
| 0 | 0 | 4100 | 4920 | 5s |
| 1 | 0.1 | 4200 | 5040 | 6s |
| 10 | 1 | 5100 | 6120 | 7s |
| 100 | 10 | 14100 | 16920 | 17s |
| 1000 | 100 | 104100 | 124920 | 125s |
⌈P × 1.2 / 1000⌉= ⌈4000 × 1.2 / 1000⌉ = 5s (constant)
B(E) = 3s (constant)| Status | Input Proof | Decrypt |
|---|---|---|
| Queued | 15s | 20s |
| Processing (in readiness) | - | 20s |
| Processing (in TX queue) | 15s | 17s |
| TxInFlight | 3s | 5s |
| ReceiptReceived | 3s | 3s |
HTTP/1.1 202 Accepted
Retry-After: 27
{"status": "queued", "job_id": "...", "eta_seconds": 27}
HTTP/1.1 202 Accepted
Retry-After: 10
{"status": "queued", "state": "tx_in_flight", "eta_seconds": 10, "elapsed_seconds": 15}
All parameters are runtime-updatable via admin endpoints:
Why ReceiptReceived uses fixed backoff (no safety margin):
Why milliseconds internally:
Why position-based instead of queue-size-based:
Queued request, using total queue size is incorrectget_position(id) provides accurate estimates as the request advances through the queue