plans/2026-06-24-release-recovery-plan.md
Ship a reliability-first recovery release that removes the largest June 2026 error sources, stops observer/chroma churn, and gives users a setup path that does not fail only after their first session starts.
This plan cross-references the attached PostHog error report with the live thedotmack/claude-mem GitHub backlog as of 2026-06-24:
/Users/alexnewman/.superset/host/e7c5cb1f-3f94-4b7b-b6b7-37a97d3b4a51/attachments/08a4bcfe-650a-4094-a534-815c15b67701/08a4bcfe-650a-4094-a534-815c15b67701.jsongh:
/tmp/claude_mem_open_issues_full.json/tmp/claude_mem_open_prs.jsonsrc/shared/find-claude-executable.tssrc/services/sync/ChromaMcpManager.tssrc/services/sync/ChromaSync.tssrc/services/worker/GeminiProvider.tssrc/services/worker/OpenAICompatibleProvider.tssrc/services/sqlite/SessionStore.tssrc/services/telemetry/backfill.tssrc/services/worker/agents/ResponseProcessor.tsplugin/hooks/codex-hooks.jsonscripts/build-hooks.jssrc/cli/adapters/codex.tssrc/services/integrations/CodexCliInstaller.ts| Report item | Report impact | Matching GitHub issues | Matching PRs | Current root cause |
|---|---|---|---|---|
| Claude executable not found | 466,499 occurrences / 9,039 users | No exact open issue found | No exact PR found | Claude CLI dependency is discovered only when the generator starts; no first-run preflight or one-time remediation state. |
uvx not found | 67,958 occurrences / about 966 users | Partly covered by #2961 / plan #2779 | #2920, #2880, #2940, partly #3039 | Installer has ensureUv, but existing installs can still hit runtime uvx spawn failure. Runtime Chroma path does not degrade cleanly when uvx is absent. |
Bun.randomUUIDv5 not a function | 5,908 / 36 | No exact open issue found | No exact PR found | src/services/telemetry/backfill.ts calls a Bun-specific API; replacing it with a small UUIDv5 helper is better than requiring newer Bun. |
| Chroma 30s timeout | 102,186 / 7,061 | #2897, #2961, #3016, #3012 | #2920, #2880/#2940, #2536 | The MCP handshake timeout includes cold uvx environment installation; repeated timeout kills prevent cache completion and can leak temp dirs/processes. |
MCP -32000 Connection closed | 210,951 / 2,833 | #2879, #2939, #2954, #2961, #2959, #2950 | #2880, #2940, #2536 | Multiple causes collapse to a generic close: old uv rejects bare chroma-mcp==..., Windows shell handling mangles args, and stderr is not surfaced. |
| Chroma backoff throws into sync | 5,810 / 639 | #3016, #2896, #2959 | #2536, partly local singleton tests | ensureCollectionExists() can throw before addDocuments() reaches its per-batch catch path; write paths should return "not synced yet" instead of throwing user-visible errors. |
| Gemini bad request 400 | 100,784 / 555 | No exact open issue found | No exact PR found | Gemini request shaping/truncation can produce invalid conversation envelopes; 400s are classified but not prevented or bucketed by closed reason. |
| Platform source conflict | 22,078 / 465 | No exact open issue found | No exact PR found | sdk_sessions.content_session_id is globally unique, and tests currently require throwing when the same raw session ID appears from two platforms. |
| JSON parse error with Chinese chars | 4,965 / 78 | Partly plan #2782, no exact issue | No exact PR found | ChromaSync.formatObservationDocs() raw-parses facts and concepts; bad legacy rows can kill backfill instead of being quarantined. |
| Observer poison/respawn loop | Not in report top 10, but dominates GitHub | #3037, #3032, #3022, #3007, #2960, #2955, #2935, #2817 | #3028, #2857, #2943, #2927, #2901 | Non-XML/idle/quota prose is treated as invalid output and can trigger respawn loops that wipe context and stop memory generation. |
GitHub-only Codex compatibility blockers to include in the same recovery release:
| Codex blocker | Matching GitHub issues | Matching PRs | Current root cause |
|---|---|---|---|
| Codex refuses to load hooks config | #2972, #2947 | #2953, #2948 | plugin/hooks/codex-hooks.json still has a root-level description field. Codex 0.140.0-0.142.0 rejects unknown root keys, so all hooks appear enabled but never run. |
| Codex rejects hook output | #2975, #2871 | #2953 | Some Codex hook paths can emit Claude-style suppressOutput, which current Codex reports as an unsupported field on PreToolUse/PostToolUse. |
| Codex/Windows spawn contract regressions | #2962, #2941, #2914 | #2945, #2598 | Published bundles and hook commands have had shell: true + args and fragile login-shell PATH probes; Codex installs are sensitive to both. |
This release should be a recovery release, not a feature release. Hold broad feature PRs unless they remove a top recovery blocker.
Release blockers:
Explicitly hold from this release unless already required by a blocker:
Merge or rebase into the recovery branch:
fix: prevent a broken/partial dependency install from bricking the worker — clean, directly supports setup/upgrade survival.fix(windows): strip UTF-8 BOM in all settings.json readers — relevant to hook-breaking setup failures; rebase/check because merge state is unstable.Preserve proxy variables during environment sanitization — relevant to enterprise installs and provider/chroma network failures; rebase/check because merge state is unstable.fix: ignore unparseable observer output instead of poisoned respawn — use as canonical observer-loop PR if cleaned; supersede narrower #2857/#2943/#2927/#2901.fix(chroma): prewarm uvx installs before the MCP connect deadline — clean, essential for #2897.fix(chroma): spawn chroma-mcp via --from so uv < 0.5.31 works — prefer this over #2940 because it handles old uv and avoids bare positional package syntax. Pull any useful #2940 tests into the canonical Chroma PR, then close #2940 as superseded.fix(build): bundle zod into worker-service.cjs — clean and removes a known install-bricking path.fix(sqlite): apply busy_timeout to primary SQLite connections — clean, low-risk data durability improvement.Fix claude-mem codex-hooks.json for current Codex — use as the canonical Codex compatibility PR if it rebases cleanly. It should remove the unsupported root description, verify Codex output never includes suppressOutput, and include generated artifact updates.fix: install Windows Claude Code hooks without bash — merge if the spawn/PATH changes cover Codex-distributed hooks too; otherwise pull the shared hook-template fix into the Codex compatibility PR.Close or mark superseded after consolidation:
--from invocation is adopted and tested across uv versions.Create these GitHub plan-master issues because the current backlog does not cover the report's biggest missing roots:
[plan-15] Startup Dependency Health -- preflight, runtime degradation, and repairChildren to route:
CLAUDE_CODE_PATH.uvx missing after old install.Fix sequence:
setup_required and do not keep retrying until settings or PATH changes.vector_search_unavailable; SQLite capture must continue.[plan-16] Chroma Runtime Lifecycle -- launch contract, backoff semantics, and data-dir hygieneChildren to route:
Fix sequence:
uvx --from chroma-mcp==<pin> chroma-mcp ....ChromaSync, not as a thrown user-flow error.[plan-17] Provider Request Envelopes -- Gemini/OpenRouter shape, truncation, and closed-error reasonsChildren to route:
Fix sequence:
contents[] sequence after truncation.role_sequence, context_limit, model_unsupported, api_key, unknown_bad_request.[plan-18] Platform-Namespaced Session Identity -- one raw session ID can exist in multiple clientsChildren to route:
existing=claude, received=cursor.Fix sequence:
platform_source + '\0' + content_session_id.sdk_sessions away from global content_session_id TEXT UNIQUE to uniqueness on (platform_source, content_session_id).pending_messages uniqueness and joins to include session_db_id or the same composite platform key.createSDKSession() with get-or-create per platform.[plan-19] Codex Hook Compatibility -- strict schema, output contract, and spawn safetyChildren to route:
Fix sequence:
plugin/hooks/codex-hooks.json; Codex hook config root must be only the keys Codex accepts.scripts/build-hooks.js that fails if the Codex hooks file contains unsupported root keys.codexAdapter.formatOutput() and never emits Claude-only suppressOutput.hookSpecificOutput.additionalContext only.shell: true + args, no required login-shell PATH probe.Create release/recovery-2026-06-24. Merge only recovery-scoped fixes above. No new providers, UI features, or storage refactors.
Verification:
gh pr list for the release branch contains only blocker PRs.Closes #... for every child issue covered.Implement plan-15 plus merge #3039/#3033/#3018/#2887 as applicable.
Required code:
Bun.randomUUIDv5 in src/services/telemetry/backfill.ts with a local deterministic UUIDv5 implementation or small dependency-free helper.Tests:
Bun.randomUUIDv5.Implement plan-19 before the recovery release candidate is cut. This is a user-visible compatibility gate even though it is not in the PostHog top-10 report.
Required code:
plugin/hooks/codex-hooks.json without the root description.suppressOutput on success, skipped input, worker-unavailable, and error paths.Tests:
plugin/hooks/codex-hooks.json root keys match the Codex-accepted schema.suppressOutput.suppressOutput for Codex.Implement plan-16 by consolidating #2920 + #2880 + current singleton/process-tree work.
Required code:
buildCommandArgs() emits --from chroma-mcp==0.2.6 chroma-mcp.StdioClientTransport stderr is drained into bounded logs.addDocuments() returns 0 when collection creation hits known Chroma-unavailable/backoff states.Tests:
ensureConnected() calls spawn one process.> / < through cmd.exe.Land the broad #3028 behavior, then add the missing quota branch from #3037.
Required code:
poisoned telemetry once no behavior depends on it.Tests:
Implement plan-17 and plan-18.
Required code:
contents[] builder that repairs alternation after truncation.(platform_source, content_session_id) uniqueness.session_db_id or composite identity where needed.Tests:
model.Fold the JSON-parse issue into plan-09.
Required code:
JSON.parse(obs.facts) / JSON.parse(obs.concepts) in ChromaSync with tolerant JSON-array parsing.malformed_json_column.Tests:
facts = '开始' or raw non-JSON string does not crash backfill.| Axis | Required proof |
|---|---|
| Clean install | macOS, Linux, Windows install/repair succeeds; missing Claude CLI gives actionable setup state. |
| Existing broken install | Partial deps, BOM settings, missing uvx, stale worker port all degrade or recover without blocking hooks. |
| Codex | Codex 0.140.0+ parses codex-hooks.json; all five Codex hook events return accepted output without suppressOutput. |
| Chroma | uv 0.5.29, latest uv, slow cold cache, Windows direct-spawn, process leak regression. |
| Provider | Gemini long histories and odd truncation limits do not generate invalid request bodies. |
| Multi-client | Claude + Cursor with same raw session id do not conflict. |
| Data pipeline | Chroma backfill survives malformed JSON and Chroma backoff. |
| Observer | Idle/prose/quota outputs do not poison-loop. |
| Packaging | npm run build, npm run typecheck:root, targeted Bun test matrix, clean-room smoke install. |
Ship only when:
gh issue list --state open has all report-related symptoms routed to plan masters or closing PRs.Closes #... for the covered child issues.For 72 hours after release:
suppressOutput should stop after users upgrade.