Back to Rivet

Pegboard Tunnel Retries

docs-internal/engine/PEGBOARD_TUNNEL_RETRIES.md

2.2.14.8 KB
Original Source

Pegboard Tunnel Retries

TODO: Clean up this AI slop explanation

This document explains how retries are coordinated between Guard and Pegboard-based handlers when transient tunnel (UPS) issues occur, for both HTTP and WebSocket traffic.

HTTP

  • Signal: A retryable transient tunnel failure is signaled by returning an HTTP 503 with the X-RIVET-ERROR header set.

    • Example (Pegboard Gateway): on tunnel closed (e.g., UPS request_timeout), the gateway replies with 503 and X-RIVET-ERROR: pegboard_gateway.tunnel_closed.
  • Guard behavior

    • Guard considers a response retryable if status == 503 and the X-RIVET-ERROR header is present.
    • Guard then applies exponential backoff (from middleware-config: max_attempts, initial_interval), re-resolves the route with ignore_cache = true, and retries the request.
    • On successful retry, traffic proceeds normally to the new target.
    • If attempts are exhausted, Guard returns an upstream error to the client.
  • Notes for implementers

    • For transient tunnel failures, return a 503 with X-RIVET-ERROR to trigger Guard retries. Use an empty body or minimal payload as appropriate.
    • Do not 503 for non-transient errors; let the normal error flow apply.

WebSocket

This section explains how WebSocket retries are coordinated between Guard and Pegboard-based handlers.

Overview

  • Retries are only possible before the client WebSocket is accepted ("opening" stage).
  • A retryable transient failure is signaled via the error guard.websocket_service_unavailable (WebSocketServiceUnavailable).
  • When Guard receives this error during opening, it re-resolves routes (ignoring cache), applies backoff, and retries with the same client socket and a new handler if available.
  • After the client socket is accepted ("open"), retries are not possible; the handler must close gracefully on failure.

Lifecycle Behavior

  • Opening (before accept)

    • Source: handler detects a transient UPS/tunnel issue before awaiting the HyperWebsocket (e.g., failing to ups.request(...) to open, or failing to ups.subscribe(...)).
    • Handler contract:
      • Do not await the client websocket yet.
      • Return the untouched HyperWebsocket in the error tuple so Guard still owns it: Err((client_ws, err)).
      • The outer wrapper maps tunnel-closed UPS errors (e.g., ups.request_timeout) to WebSocketServiceUnavailable.
    • Guard reaction:
      • Treats WebSocketServiceUnavailable as retryable.
      • Re-resolves the route with ignore-cache=true, using middleware-config retry/backoff.
      • Outcomes:
        • Re-resolve → CustomServe: reuse the same client_ws and retry with the new handler.
        • Re-resolve → Response: accept client, send a Close with the response message as the reason.
        • Re-resolve → Target (non-CustomServe) or mismatch: accept client, send a Close with a generic message (cannot retry).
        • Attempts exhausted: accept client and send a Close with the original error message.
  • Open (after accept)

    • The handler has awaited the client websocket; Guard can no longer retry.
    • Any failures (UPS send/receive, serialization, etc.) should be handled by closing the connection gracefully.
  • Closing

    • Best-effort signaling to the server via ups.request(...) and to the client via Close frames.
    • Failures are ignored; no retries.
  • Closed

    • No further action.

Implementer Guidance

  • Keep the client socket intact for retries:

    • Only return a retryable error (that maps to WebSocketServiceUnavailable) before awaiting the client websocket.
    • Return the socket in the error tuple: Err((client_ws, err)).
  • Map tunnel-closed errors at the wrapper:

    • In the outer handle_websocket wrapper, detect tunnel-closed (e.g., ups.request_timeout) and map to WebSocketServiceUnavailable.
    • handle_websocket_inner should return raw errors; do not construct WebSocketServiceUnavailable inside the inner function.
  • Use ups.request for all tunnel operations (open, messages, close):

    • Pre-accept failures should surface as errors with the unconsumed client_ws so Guard can retry.
    • Post-accept failures should break streams and close gracefully; do not attempt retries.
  • Backoff and attempts:

    • Guard uses middleware-config values for max_attempts and initial_interval to perform exponential backoff between retries.
    • Routes are re-resolved with ignore-cache=true on each retry to avoid stale targets.

Rationale

  • Returning the untouched HyperWebsocket in errors preserves the ability for Guard to re-route and retry without disconnecting the client.
  • Mapping tunnel-closed conditions to a single sentinel error (WebSocketServiceUnavailable) provides a consistent, guard-specific signal for retryability.
  • Restricting retries to pre-accept avoids protocol violations and simplifies resource ownership.