Back to Claude Mem

Plan 06 — Worker Env Isolation

plans/06-worker-env-isolation.md

13.2.033.4 KB
Original Source

Plan 06 — Worker Env Isolation

Goal: Stop host-side environment variables from contaminating the worker's Anthropic SDK subprocess. Two confirmed bugs anchor this plan: ANTHROPIC_BASE_URL leaks from the parent shell while ANTHROPIC_AUTH_TOKEN is blocked, breaking proxy/gateway auth (#2375); and CLAUDE_CODE_EFFORT_LEVEL propagates from host CLI settings into the SDK subprocess where it triggers a permanent HTTP 400 that the retry classifier mistakes for transient (#2357). Adjacent feature #2289 ($TIER alias syntax) is in scope where it shares the same env/model-resolution surface.

Net effect:

  • The OAuth-skip predicate requires a real credential (ANTHROPIC_API_KEY or ANTHROPIC_AUTH_TOKEN), not a bare ANTHROPIC_BASE_URL. Proxy/gateway users put credentials in ~/.claude-mem/.env; nothing relies on parent-shell leaks.
  • BLOCKED_ENV_VARS adds ANTHROPIC_BASE_URL and the CLAUDE_CODE_EFFORT_LEVEL / CLAUDE_CODE_ALWAYS_ENABLE_EFFORT pair (defense in depth alongside the existing env-sanitizer.ts CLAUDE_CODE_* prefix filter).
  • The Claude provider's error classifier explicitly handles HTTP 400 as unrecoverable, matching GeminiProvider/OpenRouterProvider. No more unbounded retry loop on permanent-error responses.
  • Every spawn boundary that hands env to a child process applies BOTH buildIsolatedEnv and sanitizeEnv. A grep-based CI check forbids spawning subprocesses with raw process.env.
  • ~/.claude-mem/.env becomes the single source of truth for non-OAuth Anthropic credentials. The loader's whitelist documents this contract.

Out of scope:

  • Hook-side env handling (Plan 01 / 02 territory).
  • Worker daemon lifecycle, DB bloat, and chroma-mcp leaks (Plan 03).
  • Observer/Knowledge SDK tool enforcement (Plan 05).
  • Re-auth UX flow (different concern; out of scope for this plan).
  • General provider-router refactor — $TIER alias is scoped to model resolution only (Phase 4).

Problem Statement (line citations)

Bug A — ANTHROPIC_BASE_URL leaks, OAuth gets skipped, ANTHROPIC_AUTH_TOKEN is missing (#2375)

src/shared/EnvManager.ts lines 14–24 (BLOCKED_ENV_VARS):

ts
const BLOCKED_ENV_VARS = [
  'ANTHROPIC_API_KEY',       // #733
  'ANTHROPIC_AUTH_TOKEN',    // added 5edf1557 (2026-05-04) — leak prevention
  'CLAUDECODE',
  'CLAUDE_CODE_OAUTH_TOKEN', // #2215
];

ANTHROPIC_BASE_URL is not in the list, so it survives buildIsolatedEnv() (lines 166–205) and reaches isolatedEnv from process.env.

buildIsolatedEnvWithFreshOAuth() lines 222–288 then runs the OAuth-skip predicate at lines 237–244:

ts
if (
  isolatedEnv.ANTHROPIC_API_KEY ||
  isolatedEnv.ANTHROPIC_BASE_URL ||
  isolatedEnv.ANTHROPIC_AUTH_TOKEN
) {
  clearStaleMarker();
  return isolatedEnv;
}

The bare BASE_URL branch was added in commit a122d34e (2026-05-04) under the rationale "tokenless gateways may exist." Combined with the AUTH_TOKEN block from 5edf1557 the same day, the subprocess ends up with:

  • ANTHROPIC_BASE_URL ✅ (leaked from parent)
  • ANTHROPIC_AUTH_TOKEN ❌ (blocked, never re-injected because ~/.claude-mem/.env is empty for first-time proxy users)
  • CLAUDE_CODE_OAUTH_TOKEN ❌ (skip path bypassed the keychain read)

Result: Not logged in · Please run /login from every SDK subprocess.

Bug B — CLAUDE_CODE_EFFORT_LEVEL triggers permanent 400 + unbounded retry (#2357)

The Anthropic SDK subprocess reads CLAUDE_CODE_EFFORT_LEVEL from its env and forwards it as the effort parameter on Messages API calls. claude-mem's source contains zero references to effort — the leak path is environmental, not code. Models without effort support (Haiku 4.5, Sonnet 4.5, older) reject with HTTP 400.

src/supervisor/env-sanitizer.ts lines 1–51 already filters CLAUDE_CODE_* via ENV_PREFIXES (with explicit allowances in ENV_PRESERVE). But:

  1. buildIsolatedEnv does NOT call sanitizeEnv internally; callers are expected to chain them.
  2. BLOCKED_ENV_VARS is the canonical leak deny-list and does not name CLAUDE_CODE_EFFORT_LEVEL. Defense-in-depth is currently single-layer.
  3. The retry classifier in src/services/worker/ClaudeProvider.ts has no HTTP 400 case; the default branch at line 98 returns kind: 'transient', so a permanent 400 loops forever.

src/services/worker/GeminiProvider.ts lines 89–94 and src/services/worker/OpenRouterProvider.ts lines 82–87 already classify 400 as unrecoverable; that pattern is the copy-target for ClaudeProvider.

Adjacent — $TIER alias syntax (#2289)

src/shared/SettingsDefaultsManager.ts line 116 already implements a portable 'haiku' alias for CLAUDE_MEM_TIER_SIMPLE_MODEL (per #1463). What's missing is the user-facing $TIER syntax in the CLAUDE_MEM_MODEL field that resolves to a provider-appropriate model at request time. Same code surface (model resolution in ClaudeProvider.getModelId at lines 442–446); minimal extension.


Phase 0 — Documentation Discovery (already completed)

Findings below are direct file reads dated 2026-05-08. Each implementation phase cites by line number; do not re-derive. Confidence: HIGH on file/API inventory. Local-only files were read end-to-end.

Allowed APIs / patterns to copy

ItemLocationWhat to copy
BLOCKED_ENV_VARS arraysrc/shared/EnvManager.ts:14–24Add new entries; keep the comment-per-entry convention
buildIsolatedEnv filter patternsrc/shared/EnvManager.ts:166–205Filter on BLOCKED_ENV_VARS.includes(key); defensive delete isolatedEnv.X post-filter
buildIsolatedEnvWithFreshOAuth skip-checksrc/shared/EnvManager.ts:237–244Restrict predicate to real credentials only
loadClaudeMemEnv whitelist + ClaudeMemEnv interfacesrc/shared/EnvManager.ts:26–32, 79–100Single source of truth for what ~/.claude-mem/.env accepts
ENV_PRESERVE / ENV_EXACT_MATCHES / ENV_PREFIXESsrc/supervisor/env-sanitizer.ts:1–51Whitelist-based env stripping; do NOT add CLAUDE_CODE_EFFORT_LEVEL to ENV_PRESERVE
Provider error classifier (HTTP 400 → unrecoverable)src/services/worker/GeminiProvider.ts:89–94, src/services/worker/OpenRouterProvider.ts:82–87Identical pattern to apply in ClaudeProvider
ClassifiedProviderError constructor + kind: 'unrecoverable' | 'auth_invalid' | 'transient' | 'rate_limit' | 'quota_exhausted'src/services/worker/retry.tsUse existing kind enum; do not invent permanent
isRetryableKind predicatesrc/services/worker/retry.ts:37–44Used by all retry sites; no edit needed once classifier is correct
Tier model resolution + 'haiku' aliassrc/services/worker/http/routes/SessionRoutes.ts:503–521, src/shared/SettingsDefaultsManager.ts:51–53, 115–117Pattern for extending $TIER syntax
Settings flat-key + loadFromFilesrc/shared/SettingsDefaultsManager.ts:6–67, 70–131, 137–139, 161–206New keys MUST be added to interface AND DEFAULTS block
Plan format (phase numbering, line-cited edits, anti-patterns block)plans/01-hook-io-discipline.md, plans/05-observer-tool-enforcement.mdReuse layout

Anti-patterns / methods that DO NOT exist (avoid inventing)

  • claude-mem source has zero references to effort, CLAUDE_CODE_EFFORT_LEVEL, CLAUDE_CODE_ALWAYS_ENABLE_EFFORT, or reasoning_effort. Do not "remove the effort parameter we forward" — there is none. The leak is the SDK subprocess reading the env var directly.
  • BLOCKED_ENV_VARS is an Array<string> with .includes lookup. Do NOT convert to Set in the same change — that touches every caller and is an unrelated refactor.
  • ClassifiedProviderError.kind does NOT support the value 'permanent'. The existing enum is 'transient' | 'rate_limit' | 'unrecoverable' | 'auth_invalid' | 'quota_exhausted'. Use unrecoverable for permanent 400s.
  • pending_messages has no retry_count column (dropped — see src/services/sqlite/SessionStore.ts:104's deadColumns array). Issue #2357's "retry counter climbed past #1874" refers to log-line numbering, not a DB counter. Do not add a counter as part of this plan; that's Plan 03 territory.
  • sanitizeEnv is whitelist-based (preserves a fixed set; strips everything matching CLAUDE_CODE_* etc). It is NOT idempotent if you re-add a name to ENV_PRESERVE. Do not add CLAUDE_CODE_EFFORT_LEVEL to ENV_PRESERVE — that's the opposite of what we want.
  • buildIsolatedEnv and sanitizeEnv are independent layers. Some callers chain (sanitizeEnv(buildIsolatedEnv(...))); some only use one. Do not assume chaining is universal — Phase 5 audits every spawn boundary.
  • The ~/.claude-mem/.env loader at src/shared/EnvManager.ts:79–100 uses property-by-property assignment as an implicit whitelist. Do NOT replace with Object.assign(result, parsed) — that breaks the whitelist guarantee.

File inventory used by this plan

FileLinesDisposition
src/shared/EnvManager.ts319Edited heavily (Phase 2, Phase 5)
src/supervisor/env-sanitizer.ts51Light edit (Phase 3 — comment change only; CLAUDE_CODE_* prefix already filters EFFORT_LEVEL)
src/services/worker/ClaudeProvider.ts448Edited (Phase 3 — error classifier on query() rejection path)
src/services/worker/retry.tssmallConfirm-only (Phase 3 — isRetryableKind already correct)
src/services/worker/GeminiProvider.tsreference onlyRead for pattern (Phase 3)
src/services/worker/OpenRouterProvider.tsreference onlyRead for pattern (Phase 3)
src/shared/SettingsDefaultsManager.ts209Edited (Phase 4 — $TIER alias resolution)
src/services/worker/http/routes/SessionRoutes.tsreferenceRead tier-routing pattern (Phase 4)
src/services/infrastructure/ProcessManager.tsline 415Audit (Phase 5) — confirm sanitizeEnv chain is sufficient
src/services/sync/ChromaMcpManager.tsline 585Audit (Phase 5)
src/supervisor/process-registry.tsline 539Audit (Phase 5)
src/services/worker-service.tsline 412Audit (Phase 5)
src/services/worker/knowledge/KnowledgeAgent.tslines 54, 149Confirm-only (Phase 5)
tests/env-isolation.test.tsNEWCREATED (Phase 6)
scripts/check-spawn-env-discipline.cjsNEWCREATED (Phase 7)
CLAUDE.mdsmallEdited (Phase 7 — document ~/.claude-mem/.env contract)

Phase 1 — Audit & write the failing tests first

Goal: Pin down current behavior with red tests so the fix can prove itself green. No production-code changes in this phase.

1.1 Tests to add (tests/env-isolation.test.ts)

Use bun:test per package.json "test": "bun test". Pattern from tests/claude-provider-resume.test.ts:1.

  1. buildIsolatedEnvWithFreshOAuth strips ANTHROPIC_BASE_URL when no .env credentials are configured
    • Stub process.env.ANTHROPIC_BASE_URL = 'https://proxy.example', no ~/.claude-mem/.env, no API_KEY/AUTH_TOKEN in env.
    • Call buildIsolatedEnvWithFreshOAuth().
    • Assert: result does NOT have ANTHROPIC_BASE_URL (post-fix). Currently RED.
  2. OAuth-skip does not fire on bare ANTHROPIC_BASE_URL
    • Same setup. Spy on readClaudeOAuthToken.
    • Assert: readClaudeOAuthToken was called (because BASE_URL alone is not enough to skip). Currently RED — readClaudeOAuthToken is NOT called today.
  3. ANTHROPIC_AUTH_TOKEN from ~/.claude-mem/.env reaches the isolated env
    • Write a temp .env with ANTHROPIC_AUTH_TOKEN=test-token and ANTHROPIC_BASE_URL=https://proxy.example.
    • Assert: isolatedEnv.ANTHROPIC_AUTH_TOKEN === 'test-token' AND isolatedEnv.ANTHROPIC_BASE_URL === 'https://proxy.example'. Currently GREEN (already works); test guards against regression.
  4. CLAUDE_CODE_EFFORT_LEVEL is stripped from the isolated env
    • Stub process.env.CLAUDE_CODE_EFFORT_LEVEL = 'MAX'.
    • Assert: sanitizeEnv(buildIsolatedEnv()) does NOT contain CLAUDE_CODE_EFFORT_LEVEL. Currently GREEN via env-sanitizer.ENV_PREFIXES; test guards.
  5. CLAUDE_CODE_EFFORT_LEVEL is in BLOCKED_ENV_VARS for defense-in-depth
    • Assert: BLOCKED_ENV_VARS.includes('CLAUDE_CODE_EFFORT_LEVEL'). Currently RED.
  6. HTTP 400 from Claude SDK is classified unrecoverable
    • Construct an error matching the SDK's 400 shape (error.status === 400, body contains does not support the effort parameter).
    • Assert: classifyClaudeProviderError(err).kind === 'unrecoverable'. Currently RED — falls through to transient.
  7. HTTP 400 with effort-parameter body emits a once-only warn log
    • Same setup as 6, plus capture logger.warn calls.
    • Assert: warn fires once with category SDK and a hint pointing at #2357 / ~/.claude-mem/.env. Currently RED.

1.2 Verification checklist (Phase 1)

  • All 7 tests added; tests 1, 2, 5, 6, 7 are RED; tests 3, 4 are GREEN.
  • bun test tests/env-isolation.test.ts runs cleanly (RED tests fail with the expected assertion, no other errors).
  • No production-code changes in this phase (git diff src/ empty).

1.3 Anti-pattern guards

  • Do NOT mock EnvManager.buildIsolatedEnv — it's the unit under test.
  • Do NOT use vi.* (project uses bun:test, not vitest).
  • Do NOT skip cleanup of temp .env files. Use a per-test beforeEach/afterEach with mkdtempSync.

Phase 2 — Fix #2375 (BASE_URL leak + OAuth-skip predicate)

Goal: Make the OAuth-skip require a real credential, and add ANTHROPIC_BASE_URL to the deny-list so it can only be configured via ~/.claude-mem/.env.

2.1 Edit src/shared/EnvManager.ts:14–24 — extend BLOCKED_ENV_VARS

Before:

ts
const BLOCKED_ENV_VARS = [
  'ANTHROPIC_API_KEY',
  'ANTHROPIC_AUTH_TOKEN',
  'CLAUDECODE',
  'CLAUDE_CODE_OAUTH_TOKEN',
];

After (add ANTHROPIC_BASE_URL):

ts
const BLOCKED_ENV_VARS = [
  'ANTHROPIC_API_KEY',       // #733
  'ANTHROPIC_AUTH_TOKEN',    // 5edf1557 — leak prevention; re-injected from ~/.claude-mem/.env when configured
  'ANTHROPIC_BASE_URL',      // #2375 — same leak class as AUTH_TOKEN; re-injected from ~/.claude-mem/.env. Without this entry, a leaked BASE_URL alone triggered the OAuth-skip while no auth credential reached the subprocess.
  'CLAUDECODE',
  'CLAUDE_CODE_OAUTH_TOKEN', // #2215
];

2.2 Edit src/shared/EnvManager.ts:237–244 — restrict OAuth-skip to real credentials

Before:

ts
if (
  isolatedEnv.ANTHROPIC_API_KEY ||
  isolatedEnv.ANTHROPIC_BASE_URL ||
  isolatedEnv.ANTHROPIC_AUTH_TOKEN
) {
  clearStaleMarker();
  return isolatedEnv;
}

After:

ts
// Skip OAuth lookup ONLY when a real credential is configured. A bare
// ANTHROPIC_BASE_URL is not a credential — every documented gateway needs
// either an AUTH_TOKEN or an API_KEY. This guards #2375 against a class of
// leaks where a parent shell exports BASE_URL (e.g. for the Claude Code CLI
// itself) while no token is present.
if (isolatedEnv.ANTHROPIC_API_KEY || isolatedEnv.ANTHROPIC_AUTH_TOKEN) {
  clearStaleMarker();
  return isolatedEnv;
}

2.3 Verify the ~/.claude-mem/.env re-injection at src/shared/EnvManager.ts:178–195

Currently the loader path covers BASE_URL re-injection from .env. Confirm by reading the function. No code change required here, but add a TS comment block above lines 178–195 documenting the new contract:

ts
// Contract (post-#2375): ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, and
// ANTHROPIC_API_KEY are *only* populated from ~/.claude-mem/.env. They are
// in BLOCKED_ENV_VARS so parent-shell values never leak through.

2.4 Verification checklist (Phase 2)

  • Tests 1, 2 from Phase 1 now GREEN.
  • Existing test suite still passes (bun test).
  • grep -n "ANTHROPIC_BASE_URL" src/shared/EnvManager.ts shows entries at: BLOCKED_ENV_VARS, ClaudeMemEnv interface, loader, re-injection, OAuth-skip predicate (NOT in skip predicate).
  • Smoke: with a ~/.claude-mem/.env containing ANTHROPIC_BASE_URL=... and ANTHROPIC_AUTH_TOKEN=..., the worker actually authenticates against the proxy. Test with BigModel or any sandboxed proxy.

2.5 Anti-pattern guards

  • Do NOT add ANTHROPIC_BASE_URL to ENV_PRESERVE in env-sanitizer.tsBLOCKED_ENV_VARS is the right layer; env-sanitizer is a downstream filter.
  • Do NOT keep the BASE_URL branch in the OAuth-skip predicate "for tokenless gateways may exist" — every documented gateway requires a token. The skip path was a misdesign.
  • Do NOT delete the existing delete isolatedEnv.CLAUDE_CODE_OAUTH_TOKEN defensive line at line 229. That guard is intact; it's belt-and-suspenders for #2215 and orthogonal to this plan.

Phase 3 — Fix #2357 (CLAUDE_CODE_EFFORT_LEVEL leak + 400 retry classification)

Goal: Two-layer defense for the env leak (existing CLAUDE_CODE_* prefix filter + new BLOCKED_ENV_VARS entries), plus a permanent classification for the resulting HTTP 400 so the retry loop terminates if the leak ever sneaks past either layer.

3.1 Edit src/shared/EnvManager.ts:14–24 — add EFFORT entries to BLOCKED_ENV_VARS

After the Phase 2 edit, the list is:

ts
const BLOCKED_ENV_VARS = [
  'ANTHROPIC_API_KEY',
  'ANTHROPIC_AUTH_TOKEN',
  'ANTHROPIC_BASE_URL',
  'CLAUDECODE',
  'CLAUDE_CODE_OAUTH_TOKEN',
  // #2357 — host CLI config, not part of the plugin's contract. The
  // env-sanitizer's CLAUDE_CODE_* prefix filter strips these for spawn paths
  // that go through it, but BLOCKED_ENV_VARS is the canonical deny-list and
  // belongs in defense-in-depth.
  'CLAUDE_CODE_EFFORT_LEVEL',
  'CLAUDE_CODE_ALWAYS_ENABLE_EFFORT',
];

3.2 Edit src/services/worker/ClaudeProvider.ts — classify HTTP 400 as unrecoverable

Locate the existing error-classification path. The Anthropic SDK raises errors with error.status and a body containing the failure description. Pattern from src/services/worker/GeminiProvider.ts:89–94 (the canonical copy-target):

ts
if (status === 400) {
  return new ClassifiedProviderError(
    `Gemini bad request (status 400)`,
    { kind: 'unrecoverable', cause: input.cause },
  );
}

Add the equivalent in ClaudeProvider's error classifier (new function or existing — read the file; create if absent, mirroring GeminiProvider shape):

ts
function classifyClaudeProviderError(input: { cause: unknown }): ClassifiedProviderError {
  const err = input.cause;
  const status = (err as { status?: number })?.status;
  const bodyText = String((err as { message?: string })?.message ?? '');

  // Permanent: SDK rejected the request itself. Most common cause in the wild
  // is a leaked CLAUDE_CODE_EFFORT_LEVEL the SDK subprocess forwarded as
  // `effort` against a model that doesn't support it (#2357). The leak is
  // also blocked at BLOCKED_ENV_VARS + env-sanitizer; this classifier ends
  // the retry loop if either layer is bypassed.
  if (status === 400) {
    if (/effort parameter/i.test(bodyText)) {
      logger.warn(
        'SDK',
        'Claude API rejected effort parameter — likely CLAUDE_CODE_EFFORT_LEVEL leaked into SDK env (issue #2357). Configure CLAUDE_MEM_MODEL or set credentials in ~/.claude-mem/.env.',
        { status, bodyText },
      );
    }
    return new ClassifiedProviderError(
      `Claude bad request (status 400): ${bodyText}`,
      { kind: 'unrecoverable', cause: input.cause },
    );
  }

  // 401 / 403 → auth_invalid (existing pattern from GeminiProvider:96-103)
  if (status === 401 || status === 403) {
    return new ClassifiedProviderError(
      `Claude auth rejected (status ${status})`,
      { kind: 'auth_invalid', cause: input.cause },
    );
  }

  // 429 → rate_limit
  if (status === 429) {
    return new ClassifiedProviderError(
      `Claude rate limited (status 429)`,
      { kind: 'rate_limit', cause: input.cause },
    );
  }

  // Default: transient (preserves the existing fall-through behavior).
  return new ClassifiedProviderError(
    `Claude SDK error: ${bodyText}`,
    { kind: 'transient', cause: input.cause },
  );
}

Wire this classifier into the existing try { ... } catch around query(...) in ClaudeProvider.ts. Read the actual catch shape before editing — the function lives near line 180–195 and the existing for await over queryResult is where rejections surface.

3.3 Confirm src/supervisor/env-sanitizer.ts already strips CLAUDE_CODE_EFFORT_LEVEL

Read lines 1–51. Verify:

  • ENV_PREFIXES includes 'CLAUDE_CODE_'.
  • ENV_PRESERVE does NOT include CLAUDE_CODE_EFFORT_LEVEL, CLAUDE_CODE_ALWAYS_ENABLE_EFFORT.

Add an inline comment at the ENV_PREFIXES declaration:

ts
// Filters CLAUDE_CODE_* unless explicitly preserved in ENV_PRESERVE.
// This is layer 2 of defense for #2357 — layer 1 is BLOCKED_ENV_VARS in EnvManager.

No code change to behavior here.

3.4 Verification checklist (Phase 3)

  • Tests 5, 6, 7 from Phase 1 now GREEN.
  • grep -n "CLAUDE_CODE_EFFORT_LEVEL" src/ returns hits in EnvManager.ts (BLOCKED_ENV_VARS) and the test file. Nothing else.
  • Reproduce #2357 scenario locally:
    bash
    CLAUDE_CODE_EFFORT_LEVEL=MAX bun run src/services/worker-service.ts --daemon
    # Observe: no `effort` parameter on outgoing requests.
    
  • If a 400 is forced (e.g., via a mocked SDK reject), the retry loop terminates after the first attempt; logger.warn fires once.

3.5 Anti-pattern guards

  • Do NOT add a separate "permanent error" enum value — kind: 'unrecoverable' already exists and is the right slot.
  • Do NOT regex on the entire error stack — error.status === 400 is the deterministic signal; the body text check is purely for the user-facing log hint.
  • Do NOT log inside classifyClaudeProviderError for every 400 — only the effort-parameter sub-case warrants a hint. Generic 400s are noisy enough at the call site.
  • Do NOT mark all 400s with body matching /effort/i as auth_invalid — that would trigger the "re-login" flow incorrectly. Use unrecoverable.
  • Do NOT rely on the SDK supporting an effort SDK-option that we strip. The SDK type does not expose effort; the leak is the SDK's own subprocess (pathToClaudeCodeExecutable) reading the env var. Stripping at our env layer is the only fix we control.

Phase 4 — $TIER alias syntax (#2289)

Goal: Allow CLAUDE_MEM_MODEL=$TIER:summary (and similar) to resolve at request time to a provider-appropriate model, reusing the existing 'haiku' portable alias machinery (line 116, #1463). Optional phase; can be deferred without blocking Phase 2/3.

4.1 Edit src/shared/SettingsDefaultsManager.ts — extend tier interface

Add to the SettingsDefaults interface near lines 51–53:

ts
CLAUDE_MEM_TIER_FAST_MODEL: string;     // for $TIER:fast — defaults to 'haiku'
CLAUDE_MEM_TIER_SMART_MODEL: string;    // for $TIER:smart — defaults to 'sonnet' (or provider-equivalent)

Add to the DEFAULTS block near lines 115–117:

ts
CLAUDE_MEM_TIER_FAST_MODEL: 'haiku',
CLAUDE_MEM_TIER_SMART_MODEL: 'sonnet',

4.2 Edit src/services/worker/ClaudeProvider.ts:442–446 — add $TIER resolution

Replace getModelId():

ts
private getModelId(): string {
  const settingsPath = paths.settings();
  const settings = SettingsDefaultsManager.loadFromFile(settingsPath);
  return resolveTierAlias(settings.CLAUDE_MEM_MODEL, settings);
}

Add resolveTierAlias to a shared util (src/services/worker/model-aliases.ts, NEW):

ts
import type { SettingsDefaults } from '../../shared/SettingsDefaultsManager';

const TIER_PATTERN = /^\$TIER:(fast|smart|simple|summary)$/;

export function resolveTierAlias(model: string, settings: SettingsDefaults): string {
  const match = TIER_PATTERN.exec(model);
  if (!match) return model;

  switch (match[1]) {
    case 'fast':    return settings.CLAUDE_MEM_TIER_FAST_MODEL || 'haiku';
    case 'smart':   return settings.CLAUDE_MEM_TIER_SMART_MODEL || 'sonnet';
    case 'simple':  return settings.CLAUDE_MEM_TIER_SIMPLE_MODEL || 'haiku';
    case 'summary': return settings.CLAUDE_MEM_TIER_SUMMARY_MODEL || settings.CLAUDE_MEM_MODEL;
    default:        return model;
  }
}

4.3 Same call site in KnowledgeAgent.ts:149 (getModelId)

Apply the same resolveTierAlias wrap. Knowledge agent uses the same settings path.

4.4 Verification checklist (Phase 4)

  • New test: resolveTierAlias('$TIER:fast', settings) returns settings.CLAUDE_MEM_TIER_FAST_MODEL.
  • New test: resolveTierAlias('claude-haiku-4-5-20251001', settings) returns input unchanged (non-tier passthrough).
  • Setting CLAUDE_MEM_MODEL=$TIER:fast and starting the worker actually queries against the fast-tier model.
  • Documentation updated in docs/public/configuration.mdx with the four tier aliases.

4.5 Anti-pattern guards

  • Do NOT match $TIER:* greedily — the regex is anchored.
  • Do NOT add $PROVIDER: or $MODEL: aliases in this phase — out of scope; one syntax at a time.
  • Do NOT mutate settings inside resolveTierAlias; pure function only.
  • Do NOT resolve the alias at settings-load time — resolve at request time so users can edit settings without restarting the worker.

Phase 5 — Cross-spawn-boundary audit

Goal: Every place claude-mem spawns a subprocess must apply both buildIsolatedEnv (or the async variant) AND sanitizeEnv. A grep-based check codifies the rule.

5.1 Audit table — current state per call site

FileLineSpawn targetEnv constructionSufficient?
src/services/worker/ClaudeProvider.ts155Anthropic SDK subprocesssanitizeEnv(await buildIsolatedEnvWithFreshOAuth())
src/services/worker/knowledge/KnowledgeAgent.ts54, 149Knowledge SDK subprocesssanitizeEnv(await buildIsolatedEnvWithFreshOAuth())
src/services/infrastructure/ProcessManager.ts415Worker daemonsanitizeEnv({...process.env, CLAUDE_MEM_WORKER_PORT, ...extraEnv})⚠️ daemon inherits parent env then sanitizes — does not pass through buildIsolatedEnv. Document why this is OK: daemon is the trust boundary; parent env IS the truth. But it should still strip CLAUDE_CODE_EFFORT_LEVEL via the prefix filter. Confirm.
src/services/sync/ChromaMcpManager.ts585chroma-mcp subprocesssanitizeEnv(process.env)⚠️ same as above.
src/supervisor/process-registry.ts539Generic spawn factorysanitizeEnv(options.env ?? process.env)⚠️ same.
src/services/worker-service.ts412MCP server subprocesssanitizeEnv(process.env)⚠️ same.

For the worker-daemon and downstream MCP/chroma spawns, parent-process env IS the source of truth — they are pre-credential paths. As long as CLAUDE_CODE_EFFORT_LEVEL and the Anthropic credentials are stripped (which sanitizeEnv does via CLAUDE_CODE_* prefix and the existing ANTHROPIC_AUTH_TOKEN block), behavior is correct. The plan does not change these paths — it adds tests that prove they stay correct.

5.2 Add audit test — tests/env-isolation.test.ts

  1. every documented spawn site applies sanitizeEnv
    • Read each file from the audit table.
    • Assert: each line cited contains sanitizeEnv(. Currently GREEN; test prevents regression.
  2. worker-daemon spawn env does not contain CLAUDE_CODE_EFFORT_LEVEL
    • Stub process.env.CLAUDE_CODE_EFFORT_LEVEL = 'MAX'.
    • Construct the env block as ProcessManager.ts:415 does.
    • Assert: result does not contain CLAUDE_CODE_EFFORT_LEVEL. Currently GREEN.

5.3 Verification checklist (Phase 5)

  • Tests 8, 9 GREEN.
  • No new spawn sites introduced; if any are added by accident, the CI check (Phase 7) flags them.

5.4 Anti-pattern guards

  • Do NOT add buildIsolatedEnv calls to ProcessManager / ChromaMcpManager / MCP server spawn paths. They legitimately need parent-shell PATH, HOME, etc. — those would be wiped by the credential-isolated builder.
  • Do NOT consolidate the two layers into one helper "for clarity" — they have distinct contracts and are layered intentionally.

Phase 6 — Test the full integration end-to-end

Goal: Smoke test the proxy/gateway path so we know the fix works in the real world.

6.1 Manual smoke (BigModel proxy or any equivalent)

bash
# Setup:
cat > ~/.claude-mem/.env <<'EOF'
ANTHROPIC_BASE_URL=https://open.bigmodel.cn/api/anthropic
ANTHROPIC_AUTH_TOKEN=<your-bigmodel-token>
EOF
chmod 600 ~/.claude-mem/.env

# Reset worker:
npm run build-and-sync
pkill -f worker-service.cjs

# Trigger:
# In any Claude Code session, use any tool — PostToolUse hook should land an observation.

# Verify:
tail -f ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log
# Expect: no "Not logged in" errors; observations land via the proxy.

6.2 Manual smoke (CLAUDE_CODE_EFFORT_LEVEL leak)

bash
# Setup:
export CLAUDE_CODE_EFFORT_LEVEL=MAX
export CLAUDE_CODE_ALWAYS_ENABLE_EFFORT=true

# Restart Claude Code so the env propagates to the hook subprocess.

# Verify:
tail -f ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log
# Expect: NO repeated "API Error: 400 This model does not support the effort parameter."
# Expect: NO "PARSER returned non-XML response; marking messages as failed for retry".

6.3 Verification checklist (Phase 6)

  • Both smoke scenarios pass.
  • bun test is green.
  • One iteration on a fresh machine confirms ~/.claude-mem/.env is the only knob users need for proxy auth.

Phase 7 — CI guard + documentation

Goal: A grep-based CI check rejects PRs that introduce a subprocess spawn without sanitizeEnv. Documentation aligns with the new contract.

7.1 Add scripts/check-spawn-env-discipline.cjs

Pattern from plans/01-hook-io-discipline.md Phase 6 (scripts/check-hook-io-discipline.cjs):

js
#!/usr/bin/env node
// Forbid raw process.env in subprocess spawn calls. Every spawn must use
// sanitizeEnv(...) and (where credentials are involved) buildIsolatedEnv*.

const { execSync } = require('node:child_process');

const VIOLATIONS = [];

// Find every `spawn(` / `spawnSync(` / `child_process.spawn(` call in src/
const grep = execSync(
  `grep -rEn "spawn(Sync)?\\(" src/ | grep -v "node_modules" | grep -v "\\.test\\."`,
  { encoding: 'utf8' },
);

for (const line of grep.split('\n').filter(Boolean)) {
  // Allow if the same logical block contains sanitizeEnv
  // (heuristic: read 5 lines after the match in the source file)
  const [filePath, lineNumStr] = line.split(':', 2);
  const lineNum = Number.parseInt(lineNumStr, 10);
  const src = require('node:fs').readFileSync(filePath, 'utf8').split('\n');
  const window = src.slice(lineNum - 1, lineNum + 8).join('\n');
  if (!/sanitizeEnv\s*\(/.test(window)) {
    VIOLATIONS.push(`${filePath}:${lineNum} — spawn without sanitizeEnv`);
  }
}

if (VIOLATIONS.length > 0) {
  console.error('Spawn-env discipline check FAILED:');
  VIOLATIONS.forEach(v => console.error('  ' + v));
  process.exit(1);
}
console.log('Spawn-env discipline check passed.');

Wire to package.json scripts.test:env-discipline. Add to CI alongside existing hook checks.

7.2 Edit CLAUDE.md — document the ~/.claude-mem/.env contract

Add a section under "Configuration":

markdown
### Anthropic Credentials (proxies, gateways, BigModel, etc.)

For non-OAuth Anthropic credentials (proxies / gateways / `ANTHROPIC_AUTH_TOKEN` / `ANTHROPIC_API_KEY`), put them in `~/.claude-mem/.env`:

\```
ANTHROPIC_BASE_URL=https://your-proxy.example
ANTHROPIC_AUTH_TOKEN=your-token
\```

The file is read at worker spawn time and re-injected into the SDK subprocess. **Parent-shell exports of these variables are intentionally ignored** — they are in `BLOCKED_ENV_VARS` to prevent host-config bleed-through (#2375).

If you only have an OAuth subscription, no `.env` is needed; the worker reads the token from your keychain at spawn time.

7.3 Verification checklist (Phase 7)

  • npm run test:env-discipline passes on the post-fix tree.
  • CI pipeline runs the new check.
  • CLAUDE.md section exists and accurately reflects the new contract.

7.4 Anti-pattern guards

  • Do NOT extend the CI check to flag every process.env read — only spawn*() call sites need sanitizeEnv. Reads are fine.
  • Do NOT add the .env file path to .gitignore — it lives in ~/.claude-mem/, not in the repo, so it's already outside.

Cross-plan dependencies

  • Plan 01 (Hook IO Discipline): Independent. Both can be implemented in parallel.
  • Plan 02 (Spawn-Contract Templating): Independent. Both touch templating but at different layers.
  • Plan 03 (Worker Lifecycle): Phase 3.2's HTTP 400 classification removes a class of unbounded retries. Plan 03's "circuit breaker" + "stale-session sweep" handles other retry classes. Merge order: this plan first (small, surgical), then Plan 03.
  • Plan 04 (Installer Transparency): Independent.
  • Plan 05 (Observer Tool Enforcement): Adjacent — KnowledgeAgent is touched in both plans (this one for getModelId, Plan 05 for tool enforcement). Sequence Plan 05 first (security urgency), then Plan 06.

Pre-/do checklist

  • Verify BLOCKED_ENV_VARS is still an Array<string> and not converted to a Set (Phase 2 refactor risk).
  • Verify the existing test suite passes against current main before starting (bun test).
  • Re-confirm effort is still absent from src/ (grep -rn "effort" src/) — if a future change adds the parameter, Phase 3.2's regex needs revisiting.
  • Read node_modules/@anthropic-ai/claude-agent-sdk/sdk.d.ts to confirm query() options does NOT support effort natively. If the SDK adds it, Phase 3.2's body-text regex still works as a fallback, but a code-level strip becomes the right fix.
  • Verify ~/.claude-mem/.env permissions are 0o600 post-fix (the saver enforces this; readers should not weaken it).