Back to Fhevm

Dynamic Retry-After Computation Design

relayer/docs/dynamic-retry-after-design.md

0.13.0-07.8 KB
Original Source

Dynamic Retry-After Computation Design

Overview

This document describes the design for computing dynamic Retry-After values based on queue state, drain rate, and request processing stage. The goal is to provide clients with intelligent polling intervals that adapt to system load.

Queue Architecture

The relayer has different queue structures per request type:

Input Proof (Single Queue)

[HTTP] → [TX Throttler Queue] → [Gateway TX]
              ↑
         TPS-based drain (per_seconds)

User Decrypt / Public Decrypt (Dual Queue)

[HTTP] → [Readiness Queue] → [Readiness Check] → [TX Throttler Queue] → [Gateway TX]
              ↑                                          ↑
    Concurrency-based drain                       TPS-based drain
       (max_concurrency)                          (per_seconds)

Key differences:

  • TX Throttler: Rate-limited via TPS (tokens per second) using governor
  • Readiness Queue: Concurrency-limited via semaphore (max_concurrency parallel tasks)

Configuration

Core Parameters

ParameterDescriptionDefault
min_secondsMinimum retry interval (floor)1
max_secondsMaximum retry interval (ceiling)300
safety_marginMultiplier applied to computed ETA (0.0-1.0)0.2

Nominal Processing Times

All nominal times are required in configuration (no code defaults):

Request TypeProcessing ByConfig Field
Input ProofCoproinput_proof_processing_seconds
User DecryptKMSuser_decrypt_processing_seconds
Public DecryptKMSpublic_decrypt_processing_seconds
Readiness Check(decrypts only)readiness_check_seconds
TX ConfirmationBlockchaintx_confirmation_ms

Copro/KMS Backoff Intervals

For ReceiptReceived state only (Copro/KMS response time is unpredictable):

Elapsed TimeRetry-AfterReason
0-60s4sExpect response soon
60s-2m10sTaking longer than usual
2m-5m30sSignificant delay
5m-15m60sMajor delay
15m+300sLikely stuck, minimal polling

Variables

VarDescription
pRequest's position in queue (0-indexed)
QTX queue size (used when request will join at end)
DTX drain rate (tps)
CReadiness max concurrency
RNominal readiness check time (ms)
PNominal processing time (ms) - 2s for input proof, 4s for decrypt
TNominal TX confirmation time (ms)
MSafety margin (e.g., 0.2)
EElapsed time in current state (ms)
B(E)Backoff function based on elapsed time

ETA Computation Formulas

Input Proof

StatusFormula
Queuedclamp(⌈(p/D + P + T) × (1+M) / 1000⌉, min, max)
Processingclamp(⌈(p/D + P + T) × (1+M) / 1000⌉, min, max)
TxInFlightclamp(⌈P × (1+M) / 1000⌉, min, max)
ReceiptReceivedB(E)
Completed/TimedOut/Failure0

Decrypt (User & Public)

StatusQueue LocationFormula
QueuedIn readiness queueclamp(⌈(p/C + Q/D + P + T) × (1+M) / 1000⌉, min, max)
ProcessingOut of readiness, not in TXclamp(⌈(R + Q/D + P + T) × (1+M) / 1000⌉, min, max)
ProcessingIn TX queueclamp(⌈(p/D + P + T) × (1+M) / 1000⌉, min, max)
TxInFlight-clamp(⌈P × (1+M) / 1000⌉, min, max)
ReceiptReceived-B(E)
Completed/TimedOut/Failure-0

Key Points

  1. Queued: Uses request's actual position p in the queue (not total queue size)
  2. Processing for Decrypt: Check which queue the request is in:
    • readiness_throttler.get_position(id) returns None → removed from readiness, check TX queue
    • tx_throttler.get_position(id) returns Some(p) → use TX queue formula
    • Both return None → out of readiness, not yet in TX queue
  3. TxInFlight: Uses processing time P (time to get response after TX sent)
  4. ReceiptReceived: Uses backoff B(E) since Copro/KMS response time is unpredictable

Example Calculations

Using parameters: D=10, C=50, R=2s, P_input=2s, P_decrypt=4s, T=100ms, M=0.2, B(E)=3s

Input Proof (P = 2000ms)

Queued / Processing: ⌈(p/D × 1000 + P + T) × (1+M) / 1000⌉

pp/D (s)+ P + T (ms)× 1.2Result
00210025203s
10.1220026403s
101310037204s
10010121001452015s
1000100102100122520123s

TxInFlight: ⌈P × 1.2 / 1000⌉

= ⌈2000 × 1.2 / 1000⌉ = 3s (constant)

ReceiptReceived: B(E) = 3s (constant)

Decrypt (P = 4000ms)

Queued (in readiness): ⌈(p/C × 1000 + Q/D × 1000 + P + T) × (1+M) / 1000⌉

Assuming p = Q (same number of entries in both queues):

pp/C (s)Q/D (s)+ P + T (ms)× 1.2Result
000410049205s
10.020.1422050646s
100.21530063607s
100210161001932020s
100020100124100148920149s

Processing (out of readiness, not in TX): ⌈(R + Q/D × 1000 + P + T) × (1+M) / 1000⌉

QR (ms)Q/D (s)+ P + T (ms)× 1.2Result
020000610073208s
120000.1620074408s
1020001710085209s
100200010161001932020s
10002000100106100127320128s

Processing (in TX queue): ⌈(p/D × 1000 + P + T) × (1+M) / 1000⌉

pp/D (s)+ P + T (ms)× 1.2Result
00410049205s
10.1420050406s
101510061207s
10010141001692017s
1000100104100124920125s

TxInFlight: ⌈P × 1.2 / 1000⌉

= ⌈4000 × 1.2 / 1000⌉ = 5s (constant)

ReceiptReceived: B(E) = 3s (constant)

Summary Table (p=100, Q=100)

StatusInput ProofDecrypt
Queued15s20s
Processing (in readiness)-20s
Processing (in TX queue)15s17s
TxInFlight3s5s
ReceiptReceived3s3s

Response Format

POST Response (202 Accepted)

http
HTTP/1.1 202 Accepted
Retry-After: 27

{"status": "queued", "job_id": "...", "eta_seconds": 27}

GET Response (202 In Progress)

http
HTTP/1.1 202 Accepted
Retry-After: 10

{"status": "queued", "state": "tx_in_flight", "eta_seconds": 10, "elapsed_seconds": 15}

Admin Configuration

All parameters are runtime-updatable via admin endpoints:

  • Nominal processing times per request type
  • TX throttler TPS (drain rate)
  • Retry-after bounds (min/max)
  • Safety margin
  • Copro/KMS backoff intervals

Design Rationale

Why ReceiptReceived uses fixed backoff (no safety margin):

  • Copro/KMS response time is fundamentally unpredictable
  • Backoff intervals are already conservative by design
  • Adding margin would just increase polling delay unnecessarily

Why milliseconds internally:

  • All internal calculations use milliseconds to avoid rounding errors
  • Conversion to seconds only happens when setting the Retry-After header

Why position-based instead of queue-size-based:

  • For GET requests polling an existing Queued request, using total queue size is incorrect
  • A request that's been waiting and is now at position 5 should get a shorter ETA than a new request at position 100
  • Using get_position(id) provides accurate estimates as the request advances through the queue