Back to Onyx

External-App OAuth Token Refresh — Implementation Plan

docs/craft/features/external-apps/oauth-token-refresh.md

4.1.014.8 KB
Original Source

External-App OAuth Token Refresh — Implementation Plan

Scope. This plan implements lazy, just-in-time refresh of a connected external app's OAuth access token at the egress credential-injection seam. It builds on action policies (which decides whether a request is forwarded) and the egress proxy's credential injection — this plan only ensures the token injected on an approved forward is live.

Why lazy / why this shape. The approach trade-offs (lazy vs. background vs. on-401, and the three ways to single-flight a lazy refresh) are argued in plans/external-app-lazy-token-refresh-design.md. This doc is the implementation contract for the chosen path — Approach 2: offloaded refresh via asyncio.to_thread, single-flighted by a system-wide Redis lock — and exists so a reviewer can map the PR diff to an agreed design.

Summary

OAuth access tokens for connected apps are captured once at connect time and never refreshed. Google Calendar tokens expire ~1 hour later, after which the egress proxy injects a dead Bearer token, the upstream 401s, and the agent's request fails until the user manually reconnects. The refresh_token is already stored — it is simply never used.

This change refreshes the access token transparently, the first time it is needed after it goes stale, using the stored refresh_token. A burst of concurrent sandbox requests is single-flighted through a Redis lock so the provider's token endpoint is hit at most once per (tenant, app, user) per expiry.

Problem

  • extract_credentials persists refresh_token + expires_in at connect time (backend/onyx/external_apps/providers/{google_calendar,slack,linear}.py), but nothing ever calls the token endpoint again and no absolute expiry is stored.
  • The sole injection seam, GateAddon._inject_credentials (backend/onyx/sandbox_proxy/addons/gate.py) → resolve_injection_headers (backend/onyx/external_apps/credentials.py), renders whatever is stored — including an expired token. It already opens a synchronous DB session here, so the refresh is a natural in-line addition at this one seam.

What is already in place (not part of this PR)

  • Encryption at rest (PR #11514): ExternalAppUserCredential.user_credentials and ExternalApp.organization_credentials are EncryptedJson() / SensitiveValue[dict], read via .get_value(apply_mask=False). No encryption migration is needed here.
  • Refresh token + expires_in capture: providers already store both. This PR adds only the absolute expires_at and the refresh machinery.

Design decisions (locked)

DecisionChoiceRationale
TriggerLazy, at the injection seam (GateAddon._inject_credentials)Two steps — refresh (ensure_fresh_credentials) then render (resolve_injection_headers).
Event-loop safetyThe refresh runs via await asyncio.to_thread(...)It can block (token POST + Redis lock); off-thread it stalls only the triggering request, not the whole proxy loop. The injection seam and _resolve_and_match are async so both verdict paths await the refresh inline (no flow.metadata side channel).
Interface boundaryThe seam passes the session factory + ids; the helper abstracts read + refresh + store entirelyThe gate stays ignorant of refresh mechanics (no DB calls, no lock, no token POST, no session handling at the seam).
ensure_fresh_credentials shape(db_session_factory, tenant_id, external_app_id, user_id) -> NoneSelf-contained: opens its own short sessions, single-flights, persists. Returns nothing — the caller just renders next.
Single-flightFleet-wide Redis lock (redis_shared_lock) with double-checked re-readDedupes a stampede across processes/pods and is safe for token-rotating providers; fully hidden inside the helper.
Session lifetimeThe helper takes its own short sessions per step (pre-check, re-read, persist)No connection held across the Redis-lock wait or the token POST. The factory is passed in (not a live session, not threaded through call after call).
Terminal handlingHelper clears the credential and returns (does not raise)A revoked grant becomes "disconnected" — the same path as a never-connected app (render finds no creds → forwards unauthenticated → upstream 401 + UI reconnect). Keeps the gate free of refresh-specific error handling.
Expiry storageAbsolute UTC expires_at instantNo "when was this written" bookkeeping on read.
expires_at stampingAt the callback and inside the refresh helper, not the providerextract_credentials / refresh_credentials stay clockless and trivially testable.
Refresh-token-only rows backfillSelf-heal (treat missing expires_at as never-expire)The pre-change population is tiny and transient; one extra 401 then a reconnect.

Event-loop note. gate.request runs on the mitmproxy event loop, which multiplexes egress for many sessions/tenants. The refresh's blocking work (token POST up to refresh_http_timeout_seconds, Redis-lock wait up to _LOCK_WAIT_S) therefore runs in a worker thread via await asyncio.to_thread(...), so a slow or hung token endpoint stalls only the triggering request — not every concurrent proxied flow. Because refreshes now run on threads (no longer serialized by the loop), the Redis lock is the real in-process single-flight, not redundant.

Changes (file-by-file — the PR review map)

1. Stamp absolute expires_at at write time

  • backend/onyx/server/features/build/api/external_apps_oauth_api.py — after provider.extract_credentials(...) in the callback, compute expires_at (UTC, from the response's expires_in) and merge it into the stored dict before upsert_external_app_user_credential. extract_credentials stays a pure response→dict mapper (no clock).

2. Provider refresh capability (template method)

  • backend/onyx/external_apps/providers/base.pyOAuthExternalAppProvider owns refresh as a template method, so the format knowledge lives on the provider class (not in free functions) and a divergent provider overrides only the piece that differs — it never re-implements the POST + error boilerplate:
    • refresh_credentials(stored, client_id, client_secret) — the template: read the refresh token → build request → POST → classify error → map, then merge onto the stored creds ({**stored, **mapped}) so connect-time-only fields and the (un-rotated) refresh token survive. Clockless (caller stamps expires_at).
    • Class properties (override per provider): refresh_http_timeout_seconds, terminal_refresh_errors (which OAuth error codes are fatal vs. retryable).
    • Hooks (override per provider): build_refresh_request(...) -> dict (the refresh POST form body — add scope/resource/etc.), classify_token_response(...) (failure detection), and extract_credentials (response → creds, shared with the initial grant).
    • All built-ins use the default RFC-6749 path today; the hooks exist so the next built-in or a (future, config-driven) custom OAuth provider slots in by overriding one method. Slack/Linear return no expires_in, so they're never refreshed regardless.

3. Refresh-and-persist helper (new)

  • backend/onyx/external_apps/token_refresh.py (new) — the one call the gate makes; everything about keeping the token fresh lives behind it:
    ensure_fresh_credentials(db_session_factory, tenant_id, external_app_id, user_id) -> None
    
    • Pre-check (own short session, then closed): read creds, needs_refresh? If not → return (the common path — just one cred read, no lock, no network, no provider/client-cred resolution).
    • Acquire redis_shared_lock("ea_token_refresh:{tenant}:{app}:{user}", …).
    • Re-read + gather (own short session, then closed): a fresh session sees a concurrent winner's committed refresh (read-committed); bail if no longer stale, non-OAuth, or missing client creds. Otherwise gather the POST inputs (provider, stored creds, client_id/client_secret) and close the session.
    • provider.refresh_credentials(...) — the token POST runs with no DB connection held. refresh_credentials raises only TokenRefreshError (a 2xx body it can't map is reclassified transient), so a refresh outcome never escapes the helper.
    • Persist (own short session): upsert_external_app_user_credential with the stamp_expires_at-stamped creds.
    • Never raises for a refresh outcome: transient → log + return (keep the existing token); terminal → delete_external_app_user_credential + return (app reads disconnected); lock contention → yield to the winner + return.
    • needs_refresh(stored, now, skew_s=120): no expires_atFalse; else expires_at - now <= skew. stamp_expires_at builds a new dict, so the get_value(apply_mask=False) cache is never mutated. The factory is passed in (not a live session); persistence stays in db/external_app.py.

4. Wire into the egress seam

  • backend/onyx/sandbox_proxy/addons/gate.py_inject_credentials (now async) refreshes off the event loop, then renders:
    python
    await asyncio.to_thread(
        ensure_fresh_credentials, self._db_session_factory, tenant_id, app_id, user_id
    )
    with self._db_session_factory(tenant_id) as db:
        headers = resolve_injection_headers(db, app_id, user_id)
    
    The gate imports only ensure_fresh_credentials + resolve_injection_headers — no DB functions, no lock, no refresh exceptions. _inject_credentials_or_block_inject_credentials and _resolve_and_match are all async, so both verdict paths await injection inline: the ALWAYS path in _resolve_and_match, the ASK-approved path in request (no flow.metadata hand-off). A revoked grant is cleared inside the helper → render finds no creds → forwards unauthenticated (upstream 401), same as a never-connected app. Unexpected errors hit the broad exceptFalse (fail closed).

Failure handling

All inside ensure_fresh_credentials — the gate never sees a refresh-specific exception:

  • invalid_grant / revoked (terminal): clear the credential row (delete_external_app_user_credential) and return. The render then finds no creds → forwards unauthenticated → upstream 401 + the app reads as disconnected (UI prompts reconnect). No retry loop.
  • Transient network / 5xx: log and return the existing token in place; the request proceeds (may 401) and the next request retries. Never destroy a possibly-valid token on a blip.
  • Lock contention (RedisSharedLockAcquisitionError): yield to the concurrent refresher and return; this request proceeds with the current token.
  • Unexpected (DB down, etc.): propagates to the gate's broad exceptFalse → block (fail closed).

Concurrency & edge cases

  • Stampede: N parallel stale requests → one wins the Redis lock and refreshes; the rest re-read (fresh session) and see the committed token.
  • Skew: the 120s window refreshes early so no in-flight request reaches upstream with a just-expired token.
  • Slack / Linear / static-credential apps: no expires_atneeds_refresh is Falseensure_fresh_credentials is a no-op; behaviour unchanged.
  • Lock scope: redis_shared_lock is on the shared (not tenant) Redis client, so the lock name encodes tenant_id:app_id:user_id.
  • No connection across the POST: each step takes its own short session; the token POST holds none.
  • Clock: compare against UTC now; store/parse absolute instants.

Tests

  • Unit (backend/tests/unit/external_apps/):
    • needs_refresh: fresh → no-op; within-skew / expired → refresh; no expires_at → no-op. (test_token_utils.py)
    • refresh_credentials mapping: success maps fields; rotation persists a new refresh_token; no-rotation carries the old one forward; the merge preserves connect-time-only fields (e.g. team_id); invalid_grant → terminal; 5xx / network → transient. (test_token_refresh.py)
    • Template-method extensibility: a subclass overriding one hook (build_refresh_request, terminal_refresh_errors) changes only that behavior and inherits the rest. (test_token_refresh.py)
    • ensure_fresh_credentials orchestration (mocked DB factory + lock + provider): fresh → no-op; double-checked re-read skips when the winner refreshed; stale → upserts a stamp_expires_at-stamped dict; terminal clears the row without raising; transient keeps the existing token; non-OAuth → no-op. (test_token_refresh.py)
  • Gate (backend/tests/unit/sandbox_proxy/test_gate.py):
    • _inject_credentials calls ensure_fresh_credentials(factory, tenant, app, user) before rendering. An autouse fixture defaults the refresh to a no-op so the other gate tests pin injection/approval only.

Reviewer checklist

  • extract_credentials remains a pure mapper; expires_at is stamped by the callback and the refresh helper — never the provider.
  • Refresh format lives on OAuthExternalAppProvider as overridable properties/hooks (template method), not in free functions/constants — a new provider's divergence is a one-method override, not a reimplementation.
  • The gate seam imports only ensure_fresh_credentials + resolve_injection_headers — no DB functions, lock, or refresh exceptions.
  • The (blocking) refresh runs via await asyncio.to_thread(...) — never synchronously on the mitmproxy event loop. _LOCK_WAIT_S is bounded so a waiter doesn't hold a worker thread long.
  • ensure_fresh_credentials takes the session factory (not a live session); each step's session is short — none spans the lock wait or POST.
  • stamp_expires_at builds a new dict — the get_value(apply_mask=False) cache is never mutated.
  • All DB calls go through db/external_app.py; the provider token POST has a bounded HTTP timeout; lock name includes tenant + app + user.
  • Failure paths (all inside the helper, never raised to the gate): terminal clears the row; transient/lock-contention keep the existing token; only unexpected errors propagate (gate blocks).
  • Slack/Linear (no expires_in) and static-credential apps are provably no-ops.
  • Refresh tokens never logged; injection logging stays header-names-only.

References