Back to Netbird

management/agentnetwork — domain layer + synth pipeline

docs/agent-networks/modules/21-management-agentnetwork.md

0.74.022.6 KB
Original Source

management/agentnetwork — domain layer + synth pipeline

Risk level: High — central business logic + budget enforcement + the source of every middleware-chain change the proxy executes. Backward-compat impact: Additive within the agent-network surface; one behavioural difference for opted-out accounts in parser capture (the capture flag is stamped explicitly false instead of being absent — see capture-pointer semantics below). Non-agent-network proxy services are untouched (the synth chain only ships on agent-net-svc-* targets).

Module boundary

management/server/agentnetwork owns every agent-network entity (providers, policies, guardrails, account budget rules, per-account settings, consumption rows) and translates them into the in-memory *rpservice.Service that the reverse-proxy controller turns into proto.ProxyMappings and pushes to clusters. It is the only writer of the agent-network middleware chain.

Inside the package: manager.go is the CRUD + permissions-gated facade; synthesizer.go walks settings + providers + policies + guardrails and emits the per-account service plus every middleware's JSON config; policyselect.go runs per-request attribution (min-wins account ceiling, then "drain bigger pool first"); reconcile.go diffs successive synth outputs and emits precise Create/Update/Delete proxy-mapping updates plus a peer-map refresh. labelgen/ mints DNS-safe subdomain labels; catalog/ is the static provider catalogue; types/ carries gorm entity structs. The _realstack_test.go files in the parent management/server/ directory exercise the manager + network-map controller end-to-end with no mocks.

Files

PathRole
agentnetwork/manager.goManager interface + CRUD + permission gates + bootstrap-settings + reconcile trigger
agentnetwork/synthesizer.goSettings/policy → wire-format synthesis; sole writer of the proxy middleware chain
agentnetwork/policyselect.goPer-request policy attribution + account-budget ceiling (min-wins)
agentnetwork/reconcile.goPer-account synth diff vs in-memory cache → Create/Update/Delete
agentnetwork/catalog/catalog.goStatic provider catalogue (auth headers, identity-injection shapes)
agentnetwork/labelgen/{labelgen,words}.goDNS-safe subdomain picker + curated wordlist
agentnetwork/types/provider.goProvider entity + APIKey + Models + ExtraValues + SessionKeys
agentnetwork/types/policy.goPolicy entity + PolicyLimits (token + budget)
agentnetwork/types/guardrail.goGuardrail entity (ModelAllowlist, PromptCapture)
agentnetwork/types/budgetrule.goAccountBudgetRule (reuses PolicyLimits)
agentnetwork/types/settings.goPer-account Settings (Cluster, Subdomain, 3 toggles)
agentnetwork/types/consumption.goConsumption row + WindowStart aligner
agentnetwork/{synthesizer,policyselect,reconcile,wire_shape}_*test.goSee test coverage table
agentnetwork/types/consumption_test.goWindowStart alignment proofs
agentnetwork/labelgen/labelgen_test.goDeterministic picks + exhaustion + fallback
management/server/agentnetwork_realstack_test.goNo-mock provider CRUD → network-map fan-out
management/server/agentnetwork_budgetrule_realstack_test.goNo-mock budget-rule CRUD + settings preserve-immutable

Architecture & flow

Synthesis (settings/policy → wire format)

mermaid
flowchart TD
    A[Mutation: provider/policy/guardrail/settings] --> B[managerImpl.reconcile accountID]
    B --> C{proxyController nil?}
    C -- yes --> D[accountManager.UpdateAccountPeers only]
    C -- no --> E[SynthesizeServices]
    E --> F[loadSettings — NotFound returns ok=false, no synth]
    F --> G[filterEnabledProviders sorted by CreatedAt]
    G --> H[filterEnabledPolicies]
    H --> I[backfillProviderSessionKeys if missing]
    I --> J[indexProviderGroups: providerID -> sorted source groups]
    J --> K[buildRouterConfigJSON drops orphan providers]
    J --> L[buildIdentityInjectConfigJSON per catalog entry]
    H --> M[mergeGuardrails: union allowlist, OR redact]
    M --> N[applyAccountCollectionControls account toggle = SOLE capture control]
    N --> O[marshalGuardrailConfig]
    K --> P[buildMiddlewareChain 8 middleware entries]
    L --> P
    O --> P
    P --> Q[buildAccountService: AccessGroups=union source groups, noop.invalid target]
    Q --> R[reconcile.diffMappings vs cache]
    R --> S[SendServiceUpdateToCluster CREATE/MODIFY/REMOVE]
    R --> T[accountManager.UpdateAccountPeers — fans synth ACLs into network map]

Budget rule resolution (min-wins, group+user bound)

mermaid
flowchart TD
    A[SelectPolicyForRequest in] --> B[checkAccountBudget — runs FIRST, independent of policies]
    B --> C[GetAccountAgentNetworkBudgetRules]
    C --> D{for each enabled rule}
    D --> E{budgetRuleApplies?}
    E -- no --> D
    E -- yes --> F[attrGroup = lowestIntersect TargetGroups, in.GroupIDs]
    F --> G{Token cap enabled?}
    G -- yes --> H[evalTokenCap user dim + group dim]
    H --> I{exhausted?}
    I -- yes --> J[DENY: llm_account.token_cap_exceeded - STOP]
    I -- no --> K{Budget cap enabled?}
    G -- no --> K
    K -- yes --> L[evalBudgetCap user dim + group dim]
    L --> M{exhausted?}
    M -- yes --> N[DENY: llm_account.budget_cap_exceeded - STOP]
    M -- no --> D
    K -- no --> D
    D --> O[All rules passed -> fall through to per-policy selection]

Key invariant: rules are checked sequentially and ANY exhausted rule denies (all-must-pass / min-wins). Untargeted rules (len(TargetGroups)==0 && len(TargetUsers)==0) apply to every caller (policyselect.go:393).

Policy selection (per-peer, per-request)

mermaid
flowchart TD
    A[Account-budget gate passed] --> B[GetAccountAgentNetworkPolicies]
    B --> C[filterApplicablePolicies enabled + provider match + group intersect]
    C --> D{candidates empty?}
    D -- yes --> E[Allow, empty SelectedPolicyID]
    D -- no --> F[scoreCandidates -> scoreOne per policy]
    F --> G[scoreOne: attrGroup + window]
    G --> H{any cap exhausted?}
    H -- yes --> I[Drop policy; record last deny code]
    H -- no --> K[Keep as live candidate]
    F --> L{live candidates exist?}
    L -- no --> M[Deny with last exhaustion code]
    L -- yes --> N[Sort: uncapped wins -> larger group token -> group budget -> user token -> user budget -> oldest CreatedAt]
    N --> O[winner = scored 0]
    O --> P[Allow + SelectedPolicyID + AttributionGroupID + WindowSeconds]

End-to-end: a mutation calls managerImpl.reconcile(ctx, accountID) (manager.go:205,239,...). Reconcile defers an accountManager.UpdateAccountPeers so the network-map controller re-runs and injectAllProxyPolicies picks up the new access groups; with a proxyController wired, it re-synthesizes the service, diffs against reconcileCache[accountID] (guarded by reconcileMu), and emits proto mappings to the cluster derived from the mapping's domain (reconcile.go:120). Synthesis is stateless and idempotent. Sole persistent side effect: backfillProviderSessionKeys (synthesizer.go:249) mints ed25519 keys on legacy provider rows and writes them back.

At request time the path is independent: the proxy calls SelectPolicyForRequest (policyselect.go:56); account-budget ceiling first, then per-policy scoring. Token + budget caps share evalTokenCap / evalBudgetCap — same primitive for account rules and policy limits, label differentiates the deny reason. After a served request, RecordAccountBudgetUsage (policyselect.go:415) fans deltas to every applicable rule's distinct (dim_kind, dim_id, window) tuple, deduplicating to prevent double-count when two rules share target+window.

Public contracts

  • Manager interface (manager.go:48-80): CRUD for Providers/Policies/Guardrails/BudgetRules; GetSettings/UpdateSettings (cluster + subdomain immutable, only the three toggles mutate); ListConsumption/RecordConsumption(account, kind, dimID, windowSec, in, out, USD); RecordAccountBudgetUsage(account, user, groups, in, out, USD); SelectPolicyForRequest(ctx, PolicySelectionInput) → *PolicySelectionResult{Allow, SelectedPolicyID, AttributionGroupID, WindowSeconds, DenyCode, DenyReason}.

  • PolicySelectionInput (manager.go:85-90): {AccountID, UserID, GroupIDs, ProviderID} — populated by the proxy from CapturedData + llm_router resolution.

  • Synthesized middleware chain (synthesizer.go:576-657), order load-bearing — response slot runs reverse-of-slice:

    SlotIdxIDConfigJSON shapeCanMutate
    on_request0llm_request_parser{"capture_prompt": <bool>, "redact_pii"?: true}
    on_request1llm_router{"providers":[{id, models[], upstream_*, auth_header_*, allowed_group_ids[]}]}true
    on_request2llm_limit_check{}
    on_request3llm_identity_inject{"providers":[{provider_id, header_pair?, json_metadata?, extra_headers?}]}true
    on_request4llm_guardrail{"model_allowlist"?, "prompt_capture":{enabled,redact_pii}}
    on_response5llm_limit_record{} (runs LAST at runtime)
    on_response6cost_meter{}
    on_response7llm_response_parser{"capture_completion": <bool>, "redact_pii"?: true}
  • Synthesized service shape (synthesizer.go:739): Mode=HTTP, Private=true, Domain=<subdomain>.<cluster>, AccessGroups=unionSourceGroups(enabledPolicies), one TargetTypeCluster target with Host=noop.invalid:443 (router rewrites per request), Options.{DirectUpstream,AgentNetwork}=true, DisableAccessLog=!settings.EnableLogCollection, CaptureMax{Req,Resp}Bytes=1<<20, CaptureContentTypes=["application/json","text/event-stream"].

Invariants

  • Min-wins / all-must-pass for account budget rules (checkAccountBudget, policyselect.go:353): every applicable enabled rule is checked; first exhausted cap denies. Untargeted rules bind every caller.
  • Account toggle is the SOLE control for capture enablement. applyAccountCollectionControls (synthesizer.go:701) sets merged.PromptCapture.Enabled = settings.EnablePromptCollection unconditionally.
  • Capture-pointer semantics on parser configs — see "Things to scrutinize" below.
  • EnableLogCollectionDisableAccessLog is the only access-log toggle (synthesizer.go:770). Default off ⇒ access log suppressed.
  • RedactPii flows verbatim to BOTH parsers (synthesizer.go:584-585) and is OR'd into the merged guardrail (synthesizer.go:706).
  • Cluster and Subdomain are immutable on Settings. UpdateSettings reloads existing row and overlays only the three toggles (manager.go:558-561).
  • Orphan providers (no enabled policy authorises them) NEVER reach the router (synthesizer.go:351-357); skipped from identity_inject for symmetry.
  • Provider creation refuses empty api_key (manager.go:175); deletion refuses while any policy still references it (manager.go:265-273).
  • Session keypair stability across provider edits (manager.go:226-228) — server-managed, copied through every UpdateProvider, never API-surfaced.

Things to scrutinize

Correctness

  • Capture-pointer semantics — *bool vs bool. Three states, owned by separate sides:
    • Wire JSON this module emits: buildParserConfigJSON (synthesizer.go:678-693) always stamps the capture field. Agent-network targets ship "capture_prompt": false or "capture_prompt": true — never absent. Same for "capture_completion". The happy-path test pins {"capture_prompt":false} (synthesizer_test.go:174).
    • Proxy-side parser config (consumer): parsers decode into *bool. Matrix:
      • nil (field absent) → legacy default = emit. Preserved for non-agent-network callers and pre-existing tests (the backward-compat hook).
      • false (field present, value false) → suppress emission entirely. The behaviour for opted-out agent-network accounts. Without this, enable_log_collection=true + enable_prompt_collection=false would leak raw user input AND raw model output to the access log.
      • true → emit normally.
    • Why the synth always stamps a value: an agent-network mapping omitting the field would hit legacy "always emit" and re-introduce the leak. The json.Marshal error fallback at synthesizer.go:687 degrades to {} — comment-claimed unreachable, but if ever fired re-introduces the leak. Consider fail-closed (return literal {"capture_prompt":false}) instead.
  • scoreCandidates non-cumulative deny code. Only the last exhausted policy's deny code survives (policyselect.go:188-190). Iteration order is store's natural order. Auth signal is len(scored)==0, so this is informational only — verify no UI depends on "first exhausted policy" semantics.
  • effectiveWindowSeconds token-wins tiebreak. When both halves are enabled with different windows, token's window wins (policyselect.go:482). Verify RecordLLMUsage increments against the winning window only.
  • RecordAccountBudgetUsage dedup. Two rules with the same (kind, dim_id, window) would double-count without the tuples map (policyselect.go:434-449). Key includes all three dimensions — correct.
  • Fail-closed on bad provider: unknown catalog id (synthesizer.go:794-796) or empty API key (synthesizer.go:801-803) drops the entire account's synth, not just the bad provider. Confirm matches operator UX.

Security

  • Redact OR-merge: merged RedactPii = account OR guardrail (synthesizer.go:706). Parser-side flag is settings.RedactPii only, NOT the OR — a guardrail-only opt-in does not propagate to parsers. Correct because the account toggle gates capture, but worth noting on the proxy side.
  • Group resolution must not leak across accounts. Every store call carries accountID (policyselect.go:73, 286, 298, 322, 334, 354); lowestIntersect uses caller's claimed groups only (policyselect.go:494). Risk surface is upstream (handler populates in.GroupIDs).
  • UpdateSettings preserves immutable Cluster + Subdomain (manager.go:558). A client can't rebind the cluster.
  • Provider session keypair backfill writes through SaveAgentNetworkProvider (synthesizer.go:256) from a read-shaped call. Idempotent → worst case is a wasted write under concurrent reconcile + snapshot.

Concurrency

  • reconcileMu guards reconcileCache. Lock window is narrow — compute diff inside, send outside (reconcile.go:56-68).
  • labelRngMu guards labelRng because math/rand.Source is unsafe for concurrent use (manager.go:638-640).
  • Real-store tests use store.NewTestStoreFromSQL with t.TempDir() per test — no shared state, no t.Parallel().
  • RecordAccountBudgetUsage dedup tuples map is per-call; concurrent calls fan out fully — correct (each request's tokens book once per applicable rule).
  • Deferred UpdateAccountPeers runs inline after the proxy push (reconcile.go:28-35); a slow call stretches CRUD response time.

Backward compatibility

  • Capture-pointer semantics (restated): non-agent-network callers see no field → legacy nil-default emit, identical to pre-PR. Agent-network targets always carry an explicit capture_* value.
  • TestSynthesizeServices_HappyPath was updated: request-parser config moved from {} to {"capture_prompt":false} (synthesizer_test.go:174). External snapshot tests against synth output need updating.
  • MergedGuardrails retains zeroed TokenLimits/Budget/Retention even though Policy.Limits carries the real values now; llm_limit_check is the authoritative enforcement. Comment at synthesizer.go:940-948 calls this out.

Performance

  • SynthesizeServices runs on every controller tick / mutation reconcile. Cost: 4 store reads + optional per-provider keypair backfill. Sort + index + merge are O(N log N) / O(P × G); dominant cost is JSON marshalling. No nested loops escape these dimensions.
  • reconcile.diffMappings is O(N + M) with N=M=1 per account today — effectively constant.
  • SynthesizeServicesForCluster (synthesizer.go:71) walks every account on a cluster; per-account failures are swallowed (synthesizer.go:91-93) so a single misconfigured account doesn't drop the cluster. Runs per proxy reconnect.

Observability

  • Activity codes: AgentNetwork{Provider,Policy,Guardrail,BudgetRule}{Created,Updated,Deleted}; AgentNetworkSettingsUpdated with log_collection/prompt_collection/redact_pii payload (manager.go:567-571). No activity code for SelectPolicyForRequest denies — surfaced via proxy access log only (likely intentional given volume).
  • Deny codes namespaced: llm_policy.{token,budget}_cap_exceeded, llm_account.{token,budget}_cap_exceeded (policyselect.go:18-26).
  • Reconcile failures are logged at warn and swallowed (reconcile.go:42-44). Persistent synth failures (e.g. unknown catalog id) silently keep the proxy out of sync — consider a manager-level synth-health surface if this becomes a support burden.

Test coverage

Test fileLocks down
synthesizer_test.goMock-store: HappyPath (8-mw chain ordering, {"capture_prompt":false} baseline); No{Settings,Providers}; Disabled{Provider,Policy}_NoService; RouterConfigOrdering; PolicyCheckConfig_UnionsSourceGroups; OrphanProvider_HasEmptyAllowedGroups; identity-inject for LiteLLM / Bifrost (overrides + partial disable) / Cloudflare / Portkey / Vercel / OpenRouter / generic non-customizable; GuardrailMerge_AllowlistUnion_LimitsRestrictive; BackfillsMissingSessionKeys; HTTPUpstream_KeepsExplicitPort; UpstreamURLPath_FlowsToRouter; UnknownProviderID_FailsClosed; EmptyAPIKey_FailsClosed.
synthesizer_realstore_test.goReal-sqlite: SurvivesStatusToggle reproduces the disable/re-enable 403 regression; Reconcile_RealStore_PushesPrivateAfterStatusToggle extends through reconcile push.
synthesizer_guardrail_realstore_test.goPromptCaptureAccountIsSoleControl; PromptCaptureFlowsWhenAccountOptsIn; AccountRedactWithoutGuardrailRedact; NoGuardrail_CaptureOff.
synthesizer_log_collection_realstore_test.goLogCollection{Off_SuppressesAccessLog,On_PermitsAccessLog} — verifies DisableAccessLog propagation through ToProtoMapping.
synthesizer_parser_redact_realstore_test.goCapture-pointer regression suite: ParserConfigsCarryRedactPii; ParserConfigsSuppressCaptureWhenLogCollectionOnly (log=on/prompt=off ⇒ both capture flags false); ParserConfigsOmitRedactPiiWhenOff.
policyselect_test.goMock-store: NoApplicablePolicies; AllowWithLowestGroupAttribution; LargerPoolWinsAcrossUsageLevels; StaysOnLargerPoolAfterPartialDrain; FallsThroughToSmallerPoolWhenLargerExhausted; TiebreakBy{LargerGroupPool,CreatedAt}; DeniesWhenAllExhausted; UncappedPolicyAlwaysWinsAgainstCapped; DisabledPolicyIgnored; StoreErrorPropagates; RejectsEmptyAccount; SharesGroupCounterAcrossPolicies; AntiFallThroughOnLowestGroup; BudgetOnlyExhaustionDenies; BudgetTighterThanTokenWins.
policyselect_realstore_test.goReal-sqlite regression guard: NoApplicablePolicies; AllowAndLowestGroupAttribution; LargerPoolWins_FallsThroughWhenExhausted; BudgetCapDenies; GroupCounterSharedAcrossPolicies; DisabledPolicyIgnored.
policyselect_account_realstore_test.goAccount budget rules: AccountCeilingBindsEvenWithUncappedPolicy (min-wins); AccountGroupCeiling; AccountTargetUsersBindsOnlyThatUser; AccountRuleRecordsToOwnWindow.
reconcile_test.goFirstSynth_EmitsCreate; NoChange_EmitsNothingExtra (re-push as Modified — verify desired); PolicyRemoved_EmitsDelete; NilProxyController_NoOp; EmptyAccountID_NoOp; ClusterFromMapping.
wire_shape_test.goTestSynthesizedService_WireShape — proto-shape lockdown via ToProtoMapping. Catches "service not matching" (mapping reaches proxy but no SNI/HTTP route). Asserts ID, Domain, Mode, AuthToken, Private, Auth.Oidc=false, one path / + https://noop.invalid/, 8 middlewares with correct slot enums, router config auth_header_value="Bearer sk-test-key".
labelgen/labelgen_test.goPickUnique_{DeterministicWithSeededRng,AvoidsTakenWordsWhenMostAreReserved,FallsBackWhenAllReserved}; UniqueWords_DropsDuplicates.
types/consumption_test.goWindowStart_{AlignedToUnixEpoch,WithinWindowConverges,AcrossWindowsDiverges,DifferentWindowsHaveDifferentBuckets,SubMinuteAndMinuteAlignment,ZeroWindowReturnsInputUTC}. Bucket alignment so multi-node reads converge.
agentnetwork_realstack_test.goProviderCRUD_FansOutToProxyAndClientPeers — no-mock end-to-end through real account manager + network-map + agentnetwork: provider create propagates the updated map to both proxy peer and client peer with the synth DNS surface.
agentnetwork_budgetrule_realstack_test.goBudgetRuleCRUD_RealManager; UpdateSettings_PreservesImmutableAndTogglesCollection.

Known limitations / explicit non-goals

  • MergedGuardrails.TokenLimits/Budget/Retention emit at zero (synthesizer.go:940-948); real enforcement is Policy.Limits via llm_limit_check. Future cleanup implied.
  • Session keys picked from first enabled provider by created_at (pickServiceSessionKeys, synthesizer.go:270). Existing session cookies survive provider edits only while the first-by-CreatedAt provider stays in place. Document for operators.
  • Reconcile failures silently swallowed (reconcile.go:42-44). Persistent failures keep the proxy out of sync until the next reconcile.
  • scoreCandidates exposes only the LAST exhaustion's deny code when multiple policies are exhausted.
  • bootstrapSettingsIfNeeded failure is non-fatal to provider create (manager.go:200): provider lands, synth is no-op until the next provider create retries the bootstrap.
  • Budget rules do not trigger a reconcile (manager.go:476-477). Request-time evaluation only; new rules take effect on the next request without a proxy push.

Cross-references