docs/design/f2-mcp-transport-pool.md
Targets
daemon_mode_b_main(per #4175 branching strategy). Replaces #4175 Wave 5 PR 23. Single-PR delivery per maintainer's feature-cohesive batch guidance (2026-05-19). Author: doudouOUC. Date: 2026-05-20. Revised: 2026-05-20 (v2.2 — implementation review fold-ins).
PR #4336 shipped F2 as 6 atomic commits + 6 fix commits over ~4 hours. Wenshao reviewed cumulatively in 3 batches; each batch produced inline + critical fixes that were folded back. The table below records what changed vs. v2.1, organized by review batch.
| # | Site | What was wrong | Fold-in commit |
|---|---|---|---|
| C1 | acpAgent.ts:269 — IDE-close path | Pool drain only ran in SIGTERM handler; IDE-initiated normal close leaked entries until OS reaped. Mirror SIGTERM's pool drain on await connection.closed | ae0b296c4 |
| C2 | mcp-pool-entry.ts:cancelDrainTimer | cancelDrainTimer reset maxIdleTimer on every flap, defeating the §6.3 hard cap. Now only clears drainTimer; max-idle survives entire entry lifetime | ae0b296c4 |
| C3 | mcp-pool-entry.ts:doRestart | Reconnect failure left entry in zombie state (localStatus=CONNECTED, state='active', stale snapshot). Try/catch + transition to 'failed' on failure | ae0b296c4 |
| C4 | mcp-pool-entry.ts:forceShutdown | state='closed' set AFTER awaits, so concurrent acquire could observe 'active' and hand out stale connection. Set synchronously at top | ae0b296c4 |
| C5 | mcp-transport-pool.ts:drainAll | Concurrent acquire could spawn fresh entry mid-drain. Added draining mutex flag + await Promise.allSettled(spawnInFlight) before clearing | ae0b296c4 |
| C6 | mcp-pool-entry.ts:statusChangeListener | Listener wasn't filtered by serverName; every entry got every server's status notifications + entry's own markActive write echoed back | ae0b296c4 |
| C7 | mcp-client-manager.ts:discoverAllMcpToolsIncremental | Pool-mode gate added to discoverAllMcpTools but missed Incremental — /mcp refresh bypassed pool, spawned per-session client | ae0b296c4 |
| S1 | session-mcp-view.ts:passesSessionFilter | Doc didn't call out that excludeTools uses direct equality (no parens-form support); divergence vs. mcp-client.ts:isEnabled | ae0b296c4 |
| S2 | pid-descendants.ts docstring | Claimed Windows-specific taskkill /F branch that didn't exist — Node polyfills process.kill('SIGTERM') to TerminateProcess | ae0b296c4 |
| S3 | session-mcp-view.ts:applyTools debug log | String contained literal "N" instead of interpolation — operators saw applied 12 tools (filtered to N registered) | ae0b296c4 |
| S4 | mcp-transport-pool.ts:createUnpooledConnection status cb | Hardcoded to () => CONNECTED so aggregateStatusByName lied after disconnect. Now () => client.getStatus() | ae0b296c4 |
| # | Site | What was wrong | Fold-in commit |
|---|---|---|---|
| R1 | server.test.ts:918 /capabilities envelope | Test asserted getAdvertisedServeFeatures() (no toggles) but server.ts passes mcpPoolActive: opts.mcpPoolActive !== false (default-on). Anchor toggle | 3e68c00bc |
| R2 | server.test.ts capability default-on coverage | No test booted with default options to verify pool tags advertise. Added explicit mcpPoolActive: false test | 3e68c00bc |
| R3 | events.ts:DaemonMcpServerRestartRefusedData | Doc said pre-PR SDKs would "see new value as unknown and surface generically" — actually MCP_RESTART_REFUSED_REASONS.has(...) rejects → silent drop | 3e68c00bc |
| # | Site | What was wrong | Fold-in commit |
|---|---|---|---|
| WR1 | mcp-pool-entry.ts:maxIdleTimer | C2 fix correctly preserved maxIdleTimer across flap, but fire-action force-closed regardless of refs.size. Active session with re-attach inside grace would lose tools after 5min | 72399f109 |
| WR2 | mcp-client-manager.ts:discoverAllMcpToolsViaPool | releaseAllPooledConnections + re-acquire ALL on every pass left brief window with zero MCP tools registered AND bounced every drain timer. Diff against desired (name, fingerprint) | 72399f109 |
| WR3 | mcp-pool-entry.ts:doRestart snapshot fan-out | Restart updated toolsSnapshot/promptsSnapshot and emitted typed events — but no SessionMcpView instance subscribed to that stream. Iterate subscribers directly post-snapshot | 72399f109 |
| WR4 | mcp-transport-pool.ts:getSnapshot subprocessCount | Counted websocket toward subprocessCount — websocket dials remote, no local child. Restricted to 'stdio' only | 72399f109 |
| WR5 | pid-descendants.ts PowerShell -Filter | Interpolated ${pid} directly into -Filter string. Entry-point Number.isInteger guard prevents injection today; bind to $p for defense-in-depth against future guard relaxations | 72399f109 |
| WR6 | mcp-pool-entry.ts ctor cfg field | readonly cfg: MCPServerConfig was implicitly public, exposing env API keys / header auth / OAuth fields. Made private; new transportKind getter for the only external reader | 72399f109 |
| WR7 | mcp-pool-events.ts premature exports | 5 PoolEvent type guards + Prompt re-export + PoolEntryConnectionStatus had zero callers. Removed; kept MCPCallInterruptedError (design §13.4 mandate) | 72399f109 |
| WR8 | acpAgent.ts:269,300 pool drain duplication | SIGTERM + IDE-close had identical if (agentInstance) { try { await shutdownMcpPool(8_000) } catch... } blocks. Extracted drainPoolBeforeExit(label) helper | 72399f109 |
| # | Site | What was wrong | Fold-in commit |
|---|---|---|---|
| 6R1 | mcp-transport-pool.ts:onClosed | Slot-release race: A finishes spawn, B (different fingerprint, same name) starts spawn, A drains. Close-cb checked only entries (B not yet registered) → premature release | 0e58a098f |
| 6R2 | events.ts:mcpBudgetWarningCount JSDoc | Workspace-scoped events fan to N sessions → N reducer increments; consumers aggregating across sessions double-count. Docstring updated to call out the multiplier | 0e58a098f |
| 6R3 | acpAgent.ts:broadcastBudgetEvent | Iterated this.sessions.keys() directly during async fan-out; concurrent killSession could corrupt iterator. Snapshot via Array.from(...) | 0e58a098f |
| # | Site | What was wrong | Fold-in commit |
|---|---|---|---|
| W1 | mcp-transport-pool.ts:spawnEntry catch | Spawn failure leaked statusChangeListener permanently — only forceShutdown removes it. Added entry.forceShutdown('manual') to catch | 4a3c5cd90 |
| W2 | mcp-pool-entry.ts:statusChangeListener cross-check | Module-level serverStatuses map shared across multi-fingerprint entries. A's transport error wrote DISCONNECTED, B's listener corrupted B's localStatus. Added client.getStatus() check | 4a3c5cd90 |
| W3 | mcp-pool-entry.ts:doRestart pid sweep | Restart skipped listDescendantPids + sigtermPids — every restart of npx/uvx-wrapped stdio orphaned the actual MCP grandchild. Added sweep before disconnect | 4a3c5cd90 |
| W4 | mcp-pool-entry.ts:doRestart drain timer race | Drain timer could fire mid-restart yield → forceShutdown removes entry → client.connect spawns orphan. Added cancelDrainTimer + state→active at top of doRestart | 4a3c5cd90 |
| W5 | mcp-client-manager.ts:pooledConnections dead handles | When entry transitioned to 'failed', manager held dead PooledConnection forever. Subscribe to entry events; evict on 'failed' (idempotent via get(name) === conn guard) | 4a3c5cd90 |
| W6 | mcp-client-manager.ts:discoverAllMcpToolsViaPool re-entrancy | Two passes interleaving could both set(name, conn) → first conn leaked. Added discoveryInFlight mutex; second caller awaits same promise. New regression test | 4a3c5cd90 |
| W9 | acpAgent.ts:parsePoolDrainMs strictness | Number.parseInt accepted '30000ms' / '30000abc'. Strict ^\d+$ regex; reject with stderr warning + default fallback | 4a3c5cd90 |
| W10 | mcp-transport-pool.ts:acquire indexAttach order | indexAttach mutated sessionToEntries BEFORE entry.attach(). If attach threw, stale reverse-index mapping. Moved indexAttach after attach succeeds (both fast + in-flight paths) | 4a3c5cd90 |
| W13 | mcp-transport-pool.ts:subprocessCount JSDoc | Doc still claimed stdio + websocket after WR4 restricted to stdio. Updated | 4a3c5cd90 |
| W14 | mcp-transport-pool.ts:createUnpooledConnection catch | Same statusChangeListener leak as W1 in the unpooled path. Same mirror: forceShutdown before disconnect | 4a3c5cd90 |
| W15 | bridge.ts:restartMcpServer response | as PoolEntries cast was unsound — untyped JSON from ACP child. Array.isArray check + per-entry shape guard; malformed entries skipped with stderr breadcrumb | 4a3c5cd90 |
| # | Site | Reason for declining |
|---|---|---|
| W7 | Test coverage gaps (4 untested critical paths) | 1/4 added (W6 regression test); rest deferred to focused test-coverage PR after F2 series merges |
| W8 | maxReconnectAttempts / reconnectStrategy unused | Forward-compat placeholders for the deferred health-monitor-driven reconnect (design §6.6); removing + re-adding churns the public type |
| W11 | Duplicate fast-path / in-flight-path attach blocks | ✅ Done in PR A: attachPooledSession + rollbackReservationOnSpawnFailure private helpers (commit 2d546efca) |
| W12 | passesSessionFilter O(M×N) per applyTools | ✅ Done in PR A: applyTools / applyPrompts precompute filter Sets once per pass; predicate becomes O(1) per tool (commit a4a855ab3) |
| R9 | McpClientManager ctor 7-positional sentinels | ✅ Done in PR A: options-object ctor + mkManager test factory (commit 0cb1eaa27) |
| R10 | pgrep -P <pid> per-PID-per-level cost | ✅ Done in PR A: single ps -A -o pid=,ppid= snapshot + in-memory BFS walk; pgrep BFS retained as fallback for BusyBox <v1.28 / distroless (commit landing as final PR A piece) |
// F2 (#4175 commit X review fix — wenshao YN): breadcrumb pointing at the original review| # | What | Why |
|---|---|---|
| V21-1 | Switched from 6-sub-PR plan to single feature-cohesive PR with 6 atomic commits | Per maintainer guidance (#4175 branching strategy); reviewer can read commit-by-commit via git log -p |
| V21-2 | Added sessionToEntries: Map<sid, Set<ConnectionId>> reverse index in pool (§6) | releaseSession O(N entries) → O(refs of session); needed for 1000-session scale |
| V21-3 | ?fingerprint= query param on restart route (§13.1) | Operator may want to restart only one entry when same name has multiple fingerprints; near-zero cost to add now |
| V21-4 | Spawn-failure path explicitly releases reserved slot (§6.1, §6.5) | Otherwise slot leaks until next health-monitor pass; subtle real bug |
| V21-5 | New §13.4: in-flight tool call during reconnect semantics | MCPCallInterruptedError; pool does NOT auto-replay (writes unsafe) |
| V21-6 | New §10.4: /mcp disable X triggers SessionMcpView re-apply | Otherwise mid-session disable doesn't drop already-registered tools |
| V21-7 | Status route exposes entryIndex not raw fingerprint (§8.3) | Avoids side-channel exposure of OAuth token rotation via fingerprint change |
| V21-8 | Reconnect backoff spec'd: stdio fixed 5s × 3, HTTP/SSE exponential 1/2/4/8/16s × 5 (§6.6) | v2 didn't say; HTTP needs longer retry budget for network flap |
| V21-9 | canonicalOAuth(o) normalizes {enabled: false} ≡ undefined ≡ null (§5.1) | Otherwise functionally equivalent configs produce distinct entries |
| V21-10 | Renamed pool fallback helper from "legacy in-process acquire" to createUnpooledConnection (§5.3, §6.1) | SDK MCP bypass is permanent, not legacy |
| V21-11 | drainAll(opts?) returns Promise<void> with timeoutMs wall-clock budget (§17) | Caller needs to know when drain finishes for shutdown ordering |
| V21-12 | Locked SDK reducer field names (Q1 resolved): keep mcpBudgetWarningCount etc. with scope semantics in JSDoc | No public-API rename mid-PR |
| V21-13 | Locked Q3 (default pool-on, --no-mcp-pool kill switch), Q4 (HTTP/SSE opt-in), Q6 (eager construction) | Single-PR delivery; no flag gating needed |
| V21-14 | Added R9/R10/R11 single-PR risks (§23) | Review fatigue, daemon_mode_b_main merge conflict, CI time |
| V21-15 | Extension uninstall orphan entry handling deferred to MAX_IDLE_MS natural reap (§16.3) | No explicit invalidateByExtension; keeps model uniform |
| # | What | Why |
|---|---|---|
| C1 | Pool fans out Tools + Prompts (was: tools only) | McpClient ctor takes both registries; prompts otherwise silently lost in pool mode |
| C2 | New section on global state coexistence (serverStatuses / mcpServerRequiresOAuth module Maps) | Cross-session sharing already exists today; pool inherits + formalizes |
| C3 | connectToMcpServer factory path unified with McpClient class in F2-1 | v1 only refactored the class; would leave a parallel non-pooled path |
| C4 | Snapshot replay on attach (earlyEvents-style) added to PoolEntry.attach() | New race: session-B attaches → server emits tools/list_changed before subscription wired |
| C5 | spawnInFlight: Map<ConnectionId, Promise<PoolEntry>> for concurrent-acquire dedupe | v1 mentioned in test matrix but missed in implementation contract |
| C6 | Cross-platform descendant-pid sweep (Linux/macOS pgrep, Windows wmic/PowerShell) | v1 said "copy opencode's pgrep -P" — that's Unix-only |
| C7 | trust field per-session copy of tool object | trust lives on DiscoveredMCPTool; shared instance would mix per-session trust |
| C8 | HTTP/SSE transports opt-in to pooling (default: stdio + websocket only) | Some MCP HTTP servers maintain per-transport session state; sharing risks state-bleed |
| C9 | SDK MCP server (isSdkMcpServerConfig) explicit bypass | sendSdkMcpMessage is per-session by design |
| C10 | OAuth path explicitly deferred to F3 | OAuth flow needs PermissionMediator-style routing; not F2 scope |
| C11 | Restart route semantics spec'd (name → all matching entries) | PR 17's POST /workspace/mcp/:server/restart previously unambiguous (1 entry); now 1..N |
| C12 | Status route refactor section (new path: QwenAgent.getMcpPoolAccounting()) | httpAcpBridge.ts:733-770 currently reads bootstrap session's manager — must change |
| C13 | Generation counter on PoolEntry for stale tools/list_changed handler guard | Opencode pattern: if (s.clients[name] !== client) return |
| C14 | Sub-PR breakdown 4 → 6 | v1 underestimated; A2/B1/B3/C6 each add real work |
| C15 | Lazy pool construction (only when N≥2 sessions seen) — optional | qwen serve --foreground single-session won't benefit; saves init cost |
Goals
ToolRegistry / PromptRegistry views preserved (filtering, trust)Non-goals (F2 scope)
PermissionMediator)MCPServerConfig diff to in-place mutate entries (config change → new entry, old drains)acpAgent.newSession(sessionId)
→ newSessionConfig(cwd, mcpServers) // acpAgent.ts:1771
→ loadCliConfig → new Config → config.initialize()
→ ToolRegistry ctor → new McpClientManager(config, ...) // tool-registry.ts:199
→ for (name, cfg) in config.getMcpServers():
new McpClient(name, cfg, toolRegistry, promptRegistry, workspaceContext, ...)
→ client.connect() → client.discover(config)
Coupling map (what must be broken or threaded through):
| Coupling | Location | Action in F2 |
|---|---|---|
McpClient ctor binds 1 ToolRegistry + 1 PromptRegistry | mcp-client.ts:106-119 | Pool owns transport; SessionMcpView (per session) owns the per-session registries |
McpClient.discover() calls toolRegistry.registerTool() inline | mcp-client.ts:178-198 | Split: discoverAndReturn() returns snapshot; view registers |
ListRootsRequestSchema handler closes over workspaceContext.getDirectories() | mcp-client.ts:142-153 + connectToMcpServer.ts:893 | Pool's single workspace-bound context |
workspaceContext.onDirectoriesChanged listener registered per connect | mcp-client.ts:907 | Pool registers once per entry |
McpClientManager new'd inside ToolRegistry | tool-registry.ts:199 | Add optional pool? ctor param; injection from Config |
| Budget enforcement per-session | mcp-client-manager.ts:91-95 comment | Move state machine into pool |
serverDiscoveryPromises dedupe in-flight per server | mcp-client-manager.ts:350 | Pool has spawnInFlight: Map<ConnectionId, Promise<PoolEntry>> |
setMcpBudgetEventCallback per-session registration | acpAgent.ts:1851-1899 | Pool emits → QwenAgent broadcasts to all sessions |
Already-shared state (pool inherits, does not introduce):
| State | Location | Note |
|---|---|---|
serverStatuses: Map<string, MCPServerStatus> | mcp-client.ts:292 (module-level) | Process-wide today; pool key still by name → "any-CONNECTED-wins" |
mcpServerRequiresOAuth: Map<string, boolean> | mcp-client.ts:302 (module-level) | Same |
MCPOAuthTokenStorage on-disk tokens | ~/.qwen/mcp-oauth/<name>.json | Daemon-host shared; pool just exploits more efficiently |
| Project | Pool? | Key | Lifecycle | Patterns to steal |
|---|---|---|---|---|
| claude-code | No, per-process | name + JSON.stringify(cfg) (lodash.memoize) | clearServerCache + remote backoff×5; stdio crash → failed | Sorted-key SHA-256 hashMcpConfig for invalidation/keying |
| opencode | Yes, per workspace | server name only (no config hash) | No refcount / no eviction / no restart; Effect finalizer + pgrep -P recursive SIGTERM | Descendant-pid sweep, stale-handler guard (if (s.clients[name] !== client) return), tools/list_changed fan-out via event bus |
What F2 inherits from each: config-hash from claude-code (handles per-session env/auth divergence opencode doesn't), descendant-pid sweep from opencode (npx/uvx wrappers leak). What we add: refcount + drain (multi-client daemon), auto-restart (long-running daemon), prompt fan-out, generation guard.
HTTP daemon (packages/cli/src/serve, qwen serve)
│ spawns
▼
ACP child (qwen --acp, single process per workspace)
│
QwenAgent (acpAgent.ts)
├── McpTransportPool ◄── new, workspace-scoped, 1 instance
│ ├── entries: Map<ConnectionId, PoolEntry>
│ ├── spawnInFlight: Map<ConnectionId, Promise<PoolEntry>>
│ ├── workspaceContext (bound to daemon workspace)
│ └── budget guardrails (PR 14 state machine, graduated to workspace)
│
└── sessions: Map<sessionId, Session>
└── Session.Config → ToolRegistry → McpClientManager(pool?)
│
┌────────┴────────┐
│ pool injected │
▼ ▼
pool.acquire(name,cfg,sid) legacy in-process
→ SessionMcpView (standalone qwen)
.applyTools/Prompts
(filter + register into
session's own registries)
Pool lives in the ACP child, not the HTTP daemon. The HTTP daemon queries pool state via the existing bridge.client extMethod surface (getMcpPoolAccounting, restartMcpServer). F2 code lives in packages/core/src/tools/ (peer of mcp-client-manager.ts), not packages/acp-bridge/.
McpTransportPool
├─ acquire(name, cfg, sid) → PooledConnection
├─ release(connectionId, sid) → void
├─ releaseSession(sid) → void (bulk release for session teardown)
├─ restartByName(name) → RestartResult[]
├─ getAccounting() → McpClientAccounting (workspace-scope)
├─ getBudgetMode/Budget()
├─ drainAll() → Promise<void> (shutdown)
└─ onBudgetEvent: (event) => void (set by QwenAgent)
PoolEntry (internal)
├─ refs: Set<sessionId>
├─ client: McpClient
├─ toolsSnapshot: DiscoveredMCPTool[]
├─ promptsSnapshot: Prompt[]
├─ generation: number (++ on reconnect; stale-event guard)
├─ state: 'spawning' | 'active' | 'draining' | 'closed' | 'failed'
├─ drainTimer?: NodeJS.Timeout
├─ healthMonitor: { intervalTimer, consecutiveFailures, isReconnecting }
├─ subscribers: Map<sid, SessionMcpView>
├─ attach(sid, view) → PooledConnection
└─ detach(sid) → void
PooledConnection (handle returned to caller)
├─ id: ConnectionId
├─ on('toolsChanged' | 'promptsChanged' | 'disconnected' | 'reconnected' | 'failed', cb)
├─ callTool(name, args, { sessionId }) → CallToolResult
├─ readResource(uri, { sessionId, signal })
└─ release()
SessionMcpView (per session, per server)
├─ ctor(toolRegistry, promptRegistry, sessionId, serverName, cfg)
├─ applyTools(snapshot) → void (filters by include/exclude, decorates trust)
├─ applyPrompts(snapshot) → void
└─ teardown() → void (removes its registrations)
type PoolKey = string; // sha256 hex, first 16 chars sufficient (collision-free for realistic N)
type ConnectionId = `${serverName}::${PoolKey}`;
function fingerprint(cfg: MCPServerConfig): PoolKey {
const canonical = {
transport: mcpTransportOf(cfg),
command: cfg.command ?? null,
args: cfg.args ?? [],
cwd: cfg.cwd ?? null,
env: sortedEntries(cfg.env ?? {}), // [[k,v],...] sorted by k
url: cfg.url ?? null,
httpUrl: cfg.httpUrl ?? null,
headers: sortedEntries(cfg.headers ?? {}),
timeout: cfg.timeout ?? null,
oauth: canonicalOAuth(cfg.oauth),
};
return sha256(JSON.stringify(canonical)).slice(0, 16);
}
/**
* V21-9: normalize functionally-equivalent OAuth configs so they
* collapse to the same fingerprint. `{enabled: false}`, `undefined`,
* `null`, and `{}` all mean "no OAuth" → all return `null`.
*/
function canonicalOAuth(o?: OAuthConfig | null): OAuthConfig | null {
if (!o || !o.enabled) return null;
return {
enabled: true,
clientId: o.clientId ?? null,
scopes: o.scopes ? [...o.scopes].sort() : null,
authorizationUrl: o.authorizationUrl ?? null,
tokenUrl: o.tokenUrl ?? null,
};
}
// Excluded fields (per-session filters, NOT transport-level):
// includeTools, excludeTools, trust, description, extensionName
const POOLED_TRANSPORTS_DEFAULT = new Set(['stdio', 'websocket']);
function isPoolable(cfg: MCPServerConfig, opts: PoolOptions): boolean {
if (isSdkMcpServerConfig(cfg)) return false;
const transport = mcpTransportOf(cfg);
return opts.pooledTransports.has(transport);
}
Default pooledTransports = {stdio, websocket}. Operators opt HTTP/SSE in via:
--mcp-pool-transports=stdio,websocket,http,sseQWEN_SERVE_MCP_POOL_TRANSPORTS=stdio,websocket,httpWhy default exclude HTTP/SSE: some MCP HTTP server implementations bind state (auth context, conversation memory) to the TCP/SSE stream; multiple ACP sessions sharing it would bleed state. stdio + websocket are true OS processes whose state is observable and isolatable.
isSdkMcpServerConfig(cfg) true → pool returns a thin PooledConnection wrapper via createUnpooledConnection(name, cfg, sid) that constructs an McpClient immediately, no sharing, no entry stored in pool. Reason: sendSdkMcpMessage is per-session by design (routes through ACP control plane back to the originating session). Same path used for HTTP/SSE when transport not in pooledTransports (§10.3).
V21-10: name is createUnpooledConnection, not legacyInProcessAcquire — SDK MCP and HTTP-opt-out are permanent design choices, not legacy code.
class McpTransportPool {
private entries = new Map<ConnectionId, PoolEntry>();
private spawnInFlight = new Map<ConnectionId, Promise<PoolEntry>>();
/** V21-2: reverse index, O(refs) releaseSession instead of O(entries). */
private sessionToEntries = new Map<string, Set<ConnectionId>>();
async acquire(
name: string,
cfg: MCPServerConfig,
sid: string,
): Promise<PooledConnection> {
if (!isPoolable(cfg, this.opts)) {
return this.createUnpooledConnection(name, cfg, sid);
}
const id: ConnectionId = `${name}::${fingerprint(cfg)}`;
if (this.entries.has(id)) {
this.indexAttach(sid, id);
return this.entries.get(id)!.attach(sid);
}
let inFlight = this.spawnInFlight.get(id);
if (!inFlight) {
const slot = this.tryReserveSlot(name);
if (slot === 'refused') {
throw new BudgetExhaustedError(
name,
this.clientBudget!,
this.reservedSlots.size,
);
}
inFlight = this.spawnEntry(name, cfg, id)
.catch((err) => {
// V21-4: release reserved slot on spawn failure. Without
// this, slot leaks until health monitor's release path
// runs (which it doesn't, because there's no entry to monitor).
if (slot === 'reserved') this.releaseSlotName(name);
throw err;
})
.finally(() => this.spawnInFlight.delete(id));
this.spawnInFlight.set(id, inFlight);
}
const entry = await inFlight;
this.indexAttach(sid, id);
return entry.attach(sid);
}
release(id: ConnectionId, sid: string): void {
const entry = this.entries.get(id);
if (!entry) return;
entry.detach(sid);
this.indexDetach(sid, id);
if (entry.refs.size === 0) entry.startDrainTimer(this.opts.drainDelayMs);
}
/** V21-2: O(refs of this session), not O(all entries). */
releaseSession(sid: string): void {
const ids = this.sessionToEntries.get(sid);
if (!ids) return;
for (const id of ids) {
const entry = this.entries.get(id);
if (!entry) continue;
entry.detach(sid);
if (entry.refs.size === 0) entry.startDrainTimer(this.opts.drainDelayMs);
}
this.sessionToEntries.delete(sid);
}
private indexAttach(sid: string, id: ConnectionId): void {
let ids = this.sessionToEntries.get(sid);
if (!ids) {
ids = new Set();
this.sessionToEntries.set(sid, ids);
}
ids.add(id);
}
private indexDetach(sid: string, id: ConnectionId): void {
const ids = this.sessionToEntries.get(sid);
if (!ids) return;
ids.delete(id);
if (ids.size === 0) this.sessionToEntries.delete(sid);
}
}
spawnInFlight)Mirrors McpClientManager.serverDiscoveryPromises (mcp-client-manager.ts:350). Without it, 5 sessions spawning at boot all see entries.has(id) === false and race to spawn 5 child processes.
const DRAIN_DELAY_MS_DEFAULT = 30_000; // grace after last release
const MAX_IDLE_MS_DEFAULT = 5 * 60_000; // hard cap (defense against drain cancellation loop)
State machine in PoolEntry:
spawning ──spawn ok──► active ──last detach──► draining ──timeout──► closed
│ │ │
│ │ └──attach──► active (cancel timer)
spawn fail───────────►failed
│
└──manual restart──► spawning
Hard idle cap: drain timer can be cancelled+restarted indefinitely (acquire/release flap). MAX_IDLE_MS is a separate timer started at first idle and never reset; when it fires, force-close even if drain is currently in active grace. Prevents zombie pool entries from buggy clients that thrash acquire/release.
R10 / R23 T7 / PR A update (2026-05-22): switched from per-pid BFS (one pgrep -P <pid> / Get-CimInstance -Filter subprocess per node) to a single process-table snapshot followed by in-memory tree walk. Two motivations: (1) one fork instead of B^D forks on the hot pool-shutdown path; (2) snapshot consistency — pre-fix BFS could miss descendants that forked between adjacent BFS levels. Per-pid path retained as fallback for BusyBox ps <v1.28 (no -o support) and distroless containers without ps.
// packages/core/src/tools/pid-descendants.ts
export async function listDescendantPids(rootPid: number): Promise<number[]> {
if (!Number.isInteger(rootPid) || rootPid <= 0) return [];
try {
if (process.platform === 'win32')
return await listDescendantPidsWin(rootPid);
return await listDescendantPidsUnix(rootPid);
} catch {
return []; // OS reaps orphans; pool shutdown still proceeds.
}
}
async function listDescendantPidsUnix(root: number): Promise<number[]> {
let tree: Map<number, number[]> | undefined;
try {
tree = await snapshotProcessTreeUnix(); // ps -A -o pid=,ppid=
} catch {
/* fall through to fallback */
}
if (tree) return walkDescendants(tree, root); // O(descendants), 1 fork
return await listDescendantPidsUnixPgrepFallback(root); // legacy BFS
}
async function snapshotProcessTreeUnix(): Promise<Map<number, number[]>> {
// -A: all processes (POSIX, equivalent to -e but unambiguous on BSD).
// -o pid=,ppid=: pid + ppid columns, trailing `=` suppresses headers.
const { stdout } = await execFile('ps', ['-A', '-o', 'pid=,ppid='], {
timeout: 2000,
maxBuffer: 8 * 1024 * 1024, // covers >250k-process pathological hosts
});
const childrenByPpid = new Map<number, number[]>();
for (const line of stdout.split('\n')) {
const m = line.trim().match(/^(\d+)\s+(\d+)$/);
if (!m) continue;
/* parse, push into childrenByPpid */
}
return childrenByPpid;
}
// Windows: single Get-CimInstance Win32_Process | ConvertTo-Csv snapshot
// of all (ProcessId, ParentProcessId) rows + in-memory walk; per-pid
// `Get-CimInstance -Filter "ParentProcessId=$p"` retained as fallback.
Called from PoolEntry.shutdown() before client.disconnect(). Handles npx @modelcontextprotocol/server-X, uvx ..., pnpm dlx ... wrapper leaks. MAX_DESCENDANTS=256 / MAX_DEPTH=8 caps preserved.
If spawnEntry rejects after multiple subscribers attached (via spawnInFlight):
tryReserveSlot released via explicit .catch arm in acquire (V21-4); without this fix the slot leaked until next health-monitor pass, which never ran because no entry existed to monitor.entriesacquire originally failed (existing per-session discoverMcpToolsForServer catch logic remains valid)When a PoolEntry enters reconnect after transport drop:
| Transport family | Strategy | Cap |
|---|---|---|
| stdio | Fixed 5s × 3 attempts | Per existing DEFAULT_HEALTH_CONFIG.reconnectDelayMs |
| websocket | Fixed 5s × 3 attempts | Same as stdio |
| http (opt-in) | Exponential 1s, 2s, 4s, 8s, 16s × 5 attempts | Remote endpoints flap on transient network issues; longer budget |
| sse (opt-in) | Exponential 1s, 2s, 4s, 8s, 16s × 5 attempts | Same as http |
After cap exhaustion: entry transitions to failed state; subscribers receive failed event; new acquire for same ConnectionId retries spawn once, then throws. Operator restart (§13) resets state.
// packages/core/src/tools/mcp-client.ts — split discover into pure
async discoverAndReturn(cliConfig: Config): Promise<{
tools: DiscoveredMCPTool[];
prompts: Prompt[];
}> {
if (this.status !== MCPServerStatus.CONNECTED) throw new Error('Client is not connected.');
try {
const [prompts, tools] = await Promise.all([
discoverPrompts(this.serverName, this.client, /* no registry */),
discoverTools(this.client, this.serverConfig, this.serverName, this.debugMode, this.workspaceContext),
]);
if (prompts.length === 0 && tools.length === 0) {
throw new Error('No prompts or tools found on the server.');
}
return { tools, prompts };
} catch (e) {
this.updateStatus(MCPServerStatus.DISCONNECTED);
throw e;
}
}
// Legacy discover() retained, delegates to discoverAndReturn + registers (for standalone qwen)
async discover(cliConfig: Config): Promise<void> {
const { tools, prompts } = await this.discoverAndReturn(cliConfig);
for (const t of tools) this.toolRegistry.registerTool(t);
for (const p of prompts) this.promptRegistry.registerPrompt(p);
}
class SessionMcpView {
applyTools(snapshot: DiscoveredMCPTool[]) {
this.sessionToolRegistry.removeToolsByServer(this.serverName);
for (const tool of snapshot) {
if (!this.passesFilter(tool)) continue;
// C7: per-session copy of trust (don't mutate shared snapshot)
const localTool = tool.withTrust(this.cfg.trust);
this.sessionToolRegistry.registerTool(localTool);
}
}
applyPrompts(snapshot: Prompt[]) {
this.sessionPromptRegistry.removePromptsByServer(this.serverName);
for (const p of snapshot) this.sessionPromptRegistry.registerPrompt(p);
}
}
class PoolEntry {
attach(sid: string): PooledConnection {
this.refs.add(sid);
this.cancelDrainTimer();
const view = new SessionMcpView(...);
this.subscribers.set(sid, view);
// Immediately replay current snapshot so subscriber doesn't miss
// updates that landed between in-flight discover completion and
// attach.
if (this.state === 'active') {
view.applyTools(this.toolsSnapshot);
view.applyPrompts(this.promptsSnapshot);
}
return this.makeHandle(sid, view);
}
}
Mirrors PR 14b fix #1's BridgeClient.earlyEvents pattern — solves analogous race for pool attachment.
class PoolEntry {
private generation = 0;
private async reconnect(): Promise<void> {
this.generation += 1;
const myGen = this.generation;
await this.client.disconnect();
await this.client.connect();
if (myGen !== this.generation) return; // superseded by another reconnect
const snap = await this.client.discoverAndReturn(this.cfg);
if (myGen !== this.generation) return;
this.toolsSnapshot = snap.tools;
this.promptsSnapshot = snap.prompts;
this.fanOut('toolsChanged');
this.fanOut('promptsChanged');
}
private onServerToolsListChanged = () => {
const myGen = this.generation;
this.client
.discoverAndReturn(this.cfg)
.then((snap) => {
if (myGen !== this.generation) return;
this.toolsSnapshot = snap.tools;
this.fanOut('toolsChanged');
})
.catch(/* swallow + log */);
};
}
Without this, a stale handler from a pre-reconnect Client instance could overwrite the post-reconnect snapshot with stale data.
Monotonicity invariant (V21 clarification): generation only increments, never resets. Any in-flight operation captures myGen at entry, then post-await checks myGen === this.generation. Equivalent to "no superseding event has happened since I started". Bounded by Number.MAX_SAFE_INTEGER (~285k years at 1Hz reconnect), no overflow concern.
packages/core/src/tools/mcp-client.ts has TWO connect-to-server paths:
McpClient class (mcp-client.ts:100) — used by McpClientManagerconnectToMcpServer factory function (mcp-client.ts:875) — used by discoverMcpTools (line 560) and connectAndDiscover (line 607)F2-1 must converge both behind McpClient.discoverAndReturn (with connectToMcpServer becoming a private helper of McpClient or both calling a shared establishConnection() primitive). Otherwise pool only covers the class path; the factory path remains per-session and undermines the whole effort.
serverStatuses (mcp-client.ts:292) — collision-tolerant writeModule-level Map<serverName, MCPServerStatus>. Pool's ConnectionId is name::hash, but updateMCPServerStatus(name, status) writes by name. Multiple pool entries for same name (different fingerprints, e.g. token-divergence) would clobber each other's status.
Resolution: pool intercepts status writes:
class PoolEntry {
updateStatus(s: MCPServerStatus) {
this.localStatus = s;
const aggregated = this.pool.aggregateStatusByName(this.serverName);
updateMCPServerStatus(this.serverName, aggregated);
}
}
class McpTransportPool {
aggregateStatusByName(name: string): MCPServerStatus {
// Any CONNECTED ⇒ CONNECTED
// Else any CONNECTING ⇒ CONNECTING
// Else DISCONNECTED
const entries = [...this.entries.values()].filter(
(e) => e.serverName === name,
);
if (entries.some((e) => e.localStatus === CONNECTED)) return CONNECTED;
if (entries.some((e) => e.localStatus === CONNECTING)) return CONNECTING;
return DISCONNECTED;
}
}
Status route surfaces entryCount: number so operators see when name → multiple entries.
MCPOAuthTokenStorage writes to ~/.qwen/mcp-oauth/<serverName>.json — already daemon-host-shared. Pool benefits incidentally (first session's OAuth completes → token on disk → pool entry's reconnect picks up token → all other sessions piggy-back).
Caveat — multi-fingerprint case: 2 entries for same name (different headers/env) but same OAuth provider → both read the same token file. If tokens are server-scoped (OAuth typical), this works. If tokens are env-scoped (rare), explicit storage key extension needed. Punt to F3 with a documented known-limitation.
entryCount in snapshotGET /workspace/mcp per-server cell adds:
{
kind: 'mcp_server',
name: 'github',
status: 'ok',
mcpStatus: 'connected',
entryCount: 2, // NEW — N pool entries for this name
entrySummary?: [ // NEW — opaque per-entry breakdown
{ entryIndex: 0, refs: 2, status: 'connected' },
{ entryIndex: 1, refs: 1, status: 'connecting' },
],
...
}
V21-7: entrySummary[].entryIndex is a stable opaque integer assigned at entry creation (insertion order within name group), NOT the raw fingerprint. Reasoning: fingerprint changes when OAuth tokens or env vars rotate, which would leak that information through snapshot diffs (operator could infer "token rotated at T+5min" from 'a3b1' → 'f972' transition). entryIndex is monotonic within name group but stays stable across rotations because old entry drains and new entry gets next index.
Old SDK clients ignore unknown fields per PR 14 contract; new clients use entryCount for badges. Internal restart-by-fingerprint path uses an opaque token returned only via privileged extMethod, not exposed in HTTP snapshot.
Pool's McpClient instances share one WorkspaceContext — the daemon's bound workspace context (PR #4113 invariant). connectToMcpServer's ListRootsRequestSchema handler closes over this single context.
onDirectoriesChanged listener registered once per entry, not once per acquire. Detached on entry shutdown.
roots/list_changed fan-upServer notifies client of new roots → pool fans out:
toolsChanged event → all subscriber views re-applyupdateWorkspaceDirectoriesContract: in Mode B, per-session directory additions are a soft hint, not authoritative. Pool's WorkspaceContext is daemon-level.
Two implementations choices:
extraRoots: Map<sessionId, Set<dir>>, ListRoots handler returns union of bound workspace + all extras. Per-session removal triggers roots/list_changed. Adds 50-80 LOC complexity.Pick v1 simple for F2; v2 union as follow-up if user pain materializes.
mcpServers from newSession({mcpServers})newSessionConfig(cwd, mcpServers, ...) merges injected list with settings.merged.mcpServers (acpAgent.ts:1778-1831). Pool consumes the per-session merged view:
async newSessionConfig(...) {
const config = await loadCliConfig(...);
if (this.mcpPool) config.setMcpTransportPool(this.mcpPool);
// ...existing setMcpBudgetEventCallback REMOVED — pool handles broadcast directly
}
When two sessions inject same-name server with different env/headers → different fingerprints → two pool entries. Pool sharing kicks in only when sessions agree exactly.
Static ~/.qwen/settings.json mcpServers are identical across sessions → all share → 80% case. Per-session injected mcpServers with per-user tokens → unique fingerprints → no sharing. Both safe.
Default pooledTransports = {stdio, websocket}. HTTP/SSE servers go through createUnpooledConnection path (one McpClient per session) unless operator opts in.
/mcp disable X mid-session (V21-6)When operator runs /mcp disable github against a live session:
Config.disableMcpServer('github') adds to per-Config disabledMcpServers setConfig.onDisabledMcpServersChanged fires; SessionMcpView for that name calls teardown() (removes its tool/prompt registrations from session registries)Without step 2, mid-session disable would leave already-registered tools in the session's ToolRegistry until next session restart. Test 21.4 covers this.
/mcp enable github is the inverse: triggers fresh pool.acquire for the session, attaches new view, re-applies snapshot.
tryReserveSlot / releaseSlotName / 75% hysteresis / refused_batch coalescing / bulkPassDepth / pendingRefusalNames — all migrate from McpClientManager to McpTransportPool. McpClientManager retains the state only when running standalone (no pool injected).
{
kind: 'mcp_budget',
scope: 'workspace', // NEW value (PR 14 v1 returned 'session')
liveCount: 5,
clientBudget: 10,
budgetMode: 'enforce',
status: 'ok',
}
Per PR 14 contract: "Consumers MUST tolerate additional entries with unrecognized scope values (drop, don't fail)." Old SDK clients see scope: 'workspace', render as unknown (or fallback to top-level numbers). New SDK adds isWorkspaceScopedBudget(cell) helper.
class QwenAgent {
constructor() {
this.mcpPool = new McpTransportPool({
onBudgetEvent: (event) => this.broadcastBudgetEvent(event),
});
}
private broadcastBudgetEvent(event: McpBudgetEvent) {
for (const [sid, session] of this.sessions) {
const enriched = {
...event,
scope: 'workspace' as const,
sessionId: sid,
};
session.connection
.extNotification('qwen/notify/session/mcp-budget-event', enriched)
.catch((err) =>
debugLogger.debug('budget event delivery failed', { sid, err }),
);
}
}
}
PR 14b exported these (must extend additively):
DaemonMcpBudgetWarningData — add scope?: 'workspace' | 'session' (optional for backward compat; absent = 'session')DaemonMcpChildRefusedBatchData — same scope? extensionDaemonMcpGuardrailEvent — discriminator unchangedNew SDK helpers:
export function isWorkspaceScopedBudgetEvent(
e: DaemonMcpGuardrailEvent,
): boolean;
Reducer state on DaemonSessionViewState:
mcpBudgetWarningCount / mcpChildRefusedBatchCount increment regardless of scope (scope is a property of each event, not a separate stream)V21-12 (Q1 resolved, locked in v2.1): keep existing field names (mcpBudgetWarningCount, mcpChildRefusedBatchCount, lastMcpBudgetWarning, lastMcpChildRefusedBatch) with extended scope semantics documented in JSDoc:
/**
* Count of `mcp_budget_warning` events the session has observed.
* Under F2 (`scope: 'workspace'`), this increments simultaneously
* across all attached sessions because budget events fan out at
* workspace level. Use `isWorkspaceScopedBudgetEvent(lastMcpBudgetWarning)`
* to inspect scope of the most recent event.
*/
mcpBudgetWarningCount: number;
Rationale: PR 14b already shipped these names as public SDK surface; renaming is a breaking change worse than the slightly imprecise semantics.
OAuth 401 fallback in connectToMcpServer (mcp-client.ts:950-1010) needs interactive resolution (browser open or device-flow). Mode B daemon must not spawn a browser (per PR 21 design — static-source grep test fails build on open/xdg-open/shell.openExternal).
F2 behavior on OAuth-requiring server:
connectToMcpServer → 401 detectedfailed_auth_requirederrorKind: 'auth_env_error' (existing PR 13 errorKind)/mcp auth <name> (existing CLI) OR uses PR 21's device-flow route to get a token on disk → next session acquire re-attempts and succeedsF3 will replace step 4-5 with PermissionMediator routing OAuth completion request to attached sessions for first-responder.
This avoids F2 mixing into auth state-machine work.
POST /workspace/mcp/:server/restart under poolToday (PR 17): restart in bootstrap session's manager = restart the single entry for that name.
Under pool: name → possibly multiple entries (different fingerprints for same name = different sessions with different configs).
Spec'd behavior:
| Request | Behavior |
|---|---|
POST /workspace/mcp/:server/restart | Restart all entries matching serverName (parallel via Promise.allSettled) |
POST /workspace/mcp/:server/restart?entryIndex=0 | V21-3: restart only entry #0 (the opaque index from snapshot §8.3); 404 if not found |
POST /workspace/mcp/:server/restart?entryIndex=* | Explicit "all" (same as no param) |
Response shape:
type RestartResult = {
entryIndex: number; // V21-7: opaque index, not raw fingerprint
restarted: boolean;
durationMs?: number;
reason?: string; // 'budget_would_exceed' | 'not_connected' | 'in_flight'
};
POST /workspace/mcp/:server/restart → { entries: RestartResult[] }
Old shape {restarted: true, durationMs} retained when entries.length === 1 AND no entryIndex query param for backward compat; clients can detect new shape by checking 'entries' in response.
class PoolEntry {
private restartInFlight?: Promise<void>;
async restart(): Promise<void> {
if (this.restartInFlight) return this.restartInFlight;
this.restartInFlight = this.doRestart().finally(() => {
this.restartInFlight = undefined;
});
return this.restartInFlight;
}
}
Pre-restart, pool checks budget: if disconnect+reconnect would still fit, OK. The current PR 17 {restarted:false, skipped:true, reason:'budget_would_exceed'} semantic preserved (just now applied per-entry).
Session A invokes pool.callTool('git.commit', args) → request hits stdin of underlying child → child process crashes mid-write → entry transitions to reconnect:
class MCPCallInterruptedError extends Error {
readonly serverName: string;
readonly entryIndex: number;
readonly clientGeneration: number; // pre-reconnect generation
readonly args: unknown; // original args, for caller to retry if safe
constructor(serverName, entryIndex, clientGeneration, args) { ... }
}
Spec:
MCPCallInterruptedError as soon as transport drop detected (don't wait for reconnect)PooledConnection.callTool); pool routes to the new transport instance transparentlyMCPCallInterruptedError.clientGeneration lets caller correlate with subsequent reconnected event if neededTest 21.6 must cover: spawn a long-running stdio MCP, send tool call, kill the child mid-call, assert MCPCallInterruptedError rejection with non-zero clientGeneration.
// httpAcpBridge.ts:733 buildWorkspaceMcpStatus — replace data source
let accounting: McpClientAccounting | undefined;
try {
// NEW: query pool directly via bridge extMethod, not bootstrap session
accounting = await this.bridge.client.getMcpPoolAccounting();
} catch (err) {
// Fallback to legacy bootstrap session path for non-pool daemon
const manager = config.getToolRegistry()?.getMcpClientManager();
if (manager) accounting = manager.getMcpClientAccounting();
}
QwenAgent exposes getMcpPoolAccounting():
class QwenAgent {
getMcpPoolAccounting(): McpClientAccounting | undefined {
return this.mcpPool?.getAccounting();
}
}
ACP child bridges through extMethod for the daemon to call.
Per §8.3.
Today (PR 12), when daemon is idle (no sessions yet), GET /workspace/mcp returns initialized: false because there's no bootstrap session to query.
Under pool: pool exists from QwenAgent ctor → status route can return live accounting even with zero sessions. Cell initialized: true even pre-first-session. Documented behavior change in PR description; not a regression.
session-A active, holds entry-X ref
session-A disconnect (no explicit close) → eventually killSession → pool.releaseSession(A) → entry-X.refs.size === 0 → drain timer starts (30s)
session-A resume within 30s → new newSessionConfig → pool.acquire returns entry-X → attach cancels drain
session-A resume after 30s → entry-X already closed → pool spawns new entry (cold start)
restoreState cache window (5min, from PR 6)acpAgent.restoreState is held 5 min after disconnect. Pool drain (30s default) < restore window (5min) → resume between 30s and 5min pays MCP cold start. Acceptable trade-off (resume itself is rare path).
Alternative: pool reads daemon's restore-window config and extends drain to match. Adds coupling between pool and session state machine; defer to follow-up unless user reports cold-start pain.
pendingRestoreIds interactionacpAgent.killSession() must call pool.releaseSession(sid) AFTER cleaning pendingRestoreIds. Order:
pendingRestoreIds.add(sid))RESTORE_WINDOW_MS elapses without resume: killSession permanently cleans → pool.releaseSession(sid) triggers drainAvoids drain firing during a restore window.
User edits ~/.qwen/settings.json mid-flight, changes a server's env:
Config/McpServers snapshot → keep acquiring old fingerprint → entry-OLD ref persistsNo live-mutation of running connections — clean separation between sessions on different config versions.
POST /workspace/mcp/reload-all
→ for each session: re-load settings, swap Config.mcpServers
→ for each entry no longer referenced: schedule eviction
Useful for "I changed env vars and want immediate effect across all sessions." Defer to F2 follow-up (not blocking).
Scenario: extension foo-ext registers MCP server foo-server. Operator runs /extension uninstall foo-ext. Extension lifecycle removes foo-server from extensionMcpServers so future loadCliConfig calls don't include it. But:
Config snapshots that still include foo-server → those sessions keep using the entryResolution: rely on natural drain. As old sessions close, refcount drops; eventually entry hits MAX_IDLE_MS = 5min and is force-closed. No explicit pool.invalidateByExtension(name) API — keeps the model uniform with hot config reload (§16.1).
Trade-off: extension's server may run up to 5min after uninstall if a long session keeps it alive. Acceptable; operators can /mcp restart foo-server then kill the session if urgency requires.
QwenAgent.close() sequence (must be enforced):
1. Set acceptingNewSessions = false; reject new POST /session
2. For each in-flight prompt: signal cancel, await completion (existing PR 11 lifecycle)
3. For each session: trigger close → pool.releaseSession(sid)
4. await pool.drainAll({ force: true, timeoutMs: 10_000 }) ← bypasses 30s grace
├── For each entry: cancel drain + health timers, mark draining
├── For each entry in parallel: listDescendantPids → SIGTERM children
├── For each entry in parallel: client.disconnect()
└── Promise.race against timeoutMs; abandoned entries get SIGKILL
5. Bridge channel close
6. Process exit
V21-11: drainAll signature:
async drainAll(opts?: {
force?: boolean; // default false; true bypasses 30s grace timer
timeoutMs?: number; // default 10_000; wall-clock budget; SIGKILL stragglers after
}): Promise<DrainResult>;
type DrainResult = {
drained: number; // entries that disconnected cleanly
forced: number; // entries SIGKILLed after timeout
errors: Array<{ entryIndex: number; serverName: string; error: string }>;
};
Caller uses DrainResult for shutdown logging; on forced > 0 log a warning so operator knows a server didn't shut down cleanly.
New files:
packages/core/src/tools/
mcp-transport-pool.ts # McpTransportPool main (~700 LOC)
mcp-pool-key.ts # fingerprint + canonicalize helpers (~150 LOC)
mcp-pool-entry.ts # PoolEntry: refcount + drain + health + generation (~500 LOC)
session-mcp-view.ts # SessionMcpView: filter + register tools/prompts (~200 LOC)
mcp-pool-events.ts # PoolEvent discriminated union (~80 LOC)
pid-descendants.ts # listDescendantPids cross-platform (~150 LOC, incl. tests)
packages/core/src/tools/
mcp-transport-pool.test.ts # ~900 LOC
mcp-pool-entry.test.ts # ~400 LOC
session-mcp-view.test.ts # ~250 LOC
mcp-pool-key.test.ts # ~150 LOC
pid-descendants.test.ts # ~200 LOC (Unix + Windows skip-gated)
Changed files:
packages/core/src/tools/mcp-client.ts # discoverAndReturn() split; connectToMcpServer unified
packages/core/src/tools/mcp-client-manager.ts # optional pool param; budget state conditional
packages/core/src/tools/tool-registry.ts # threads pool from config into McpClientManager
packages/core/src/config/config.ts # setMcpTransportPool / getMcpTransportPool
packages/cli/src/acp-integration/acpAgent.ts # QwenAgent.mcpPool construction; broadcastBudgetEvent;
# newSessionConfig wires pool into Config;
# killSession calls pool.releaseSession
packages/cli/src/serve/runQwenServe.ts # pass --mcp-pool-transports + budget env to ACP child
packages/cli/src/serve/httpAcpBridge.ts # buildWorkspaceMcpStatus reads pool;
# restartMcpServer extMethod returns RestartResult[]
packages/cli/src/serve/capabilities.ts # advertise mcp_workspace_pool
packages/sdk/src/daemon/mcpEvents.ts # scope?: optional field; isWorkspaceScopedBudgetEvent helper
Per maintainer's feature-cohesive batch guidance (#4175 branching strategy 2026-05-19), F2 ships as one PR with 6 atomic commits. Reviewer can step through with git log -p HEAD~6..HEAD and review commit-by-commit.
| Commit # | Title | Scope | Touches |
|---|---|---|---|
| 1 | refactor(core): split McpClient.discover into pure tool/prompt list and unify connect paths | Add discoverAndReturn(); extract shared establishConnection() used by both McpClient.connect() and connectToMcpServer() factory; legacy discover() becomes thin wrapper that registers (preserves standalone qwen behavior). Zero observable behavior change. | mcp-client.ts, mcp-client.test.ts |
| 2 | feat(core): McpTransportPool + SessionMcpView | Pool core: fingerprint, refcount, spawnInFlight dedupe, sessionToEntries reverse index, drain state machine, snapshot replay on attach, generation guard, tool+prompt dual fan-out, per-session trust copy. Mock McpClient for unit tests. No production wiring. | new mcp-transport-pool.ts, mcp-pool-key.ts, mcp-pool-entry.ts, session-mcp-view.ts, mcp-pool-events.ts + tests |
| 3 | feat(core): cross-platform descendant pid sweep + pool health monitor | listDescendantPids (Unix pgrep -P recursive, Windows PowerShell CIM); unified health monitor inside PoolEntry (interval check + failure count + reconnect backoff per §6.6); subprocess-spawn integration tests gated on QWEN_INTEGRATION === '1'. | new pid-descendants.ts + tests; mcp-pool-entry.ts |
| 4 | feat(serve): wire McpTransportPool into QwenAgent daemon mode | Config.setMcpTransportPool + getMcpTransportPool; ToolRegistry threads pool into McpClientManager; McpClientManager optional pool? ctor param; acpAgent.QwenAgent constructs pool at init; newSessionConfig injection; killSession calls pool.releaseSession; SDK MCP + HTTP/SSE bypass via createUnpooledConnection; CLI flags --mcp-pool-transports, --mcp-pool-drain-ms, --no-mcp-pool. | config.ts, tool-registry.ts, mcp-client-manager.ts, acpAgent.ts, runQwenServe.ts |
| 5 | feat(serve): pool-aware status + restart routes | QwenAgent.getMcpPoolAccounting extMethod; httpAcpBridge.buildWorkspaceMcpStatus pool-first + bootstrap-session fallback; restartMcpServer accepts ?entryIndex= and returns RestartResult[]; entryCount + entrySummary[].entryIndex on cell; capability tags mcp_workspace_pool + mcp_pool_restart. | httpAcpBridge.ts, capabilities.ts, SDK types |
| 6 | feat(serve): graduate MCP budget guardrails to workspace scope | Move tryReserveSlot/releaseSlotName/hysteresis state machine from McpClientManager to pool; remove per-session setMcpBudgetEventCallback wiring in acpAgent.newSessionConfig; QwenAgent.broadcastBudgetEvent fan-out; snapshot cell scope: 'workspace'; SDK scope? additive field; isWorkspaceScopedBudgetEvent helper; inline doc updates. | mcp-transport-pool.ts, mcp-client-manager.ts, acpAgent.ts, httpAcpBridge.ts, SDK |
Total LOC estimate: ~4100 production + ~1900 tests = ~6000 LOC (v2 estimate ~3850; growth absorbs V21 corrections).
Merge target: single PR into daemon_mode_b_main. Periodic batch merge to main per #4175 strategy.
Self-review process before opening PR:
code-reviewer agent on the commit diff; fold adopted findings into the same commitsilent-failure-hunter + type-design-analyzerMirror PR 21's specialist pre-review pattern.
Because F2 ships as one PR, all three tags advertise together. Pool consumers may assume mcp_workspace_pool advertise ⇒ entryCount/entrySummary/scope? fields all present; no per-field capability check needed.
| Tag | When advertised | Meaning |
|---|---|---|
mcp_workspace_pool | When QwenAgent.mcpPool !== undefined (always true in daemon mode unless --no-mcp-pool kill switch) | GET /workspace/mcp reflects pool-level state; entryCount + entrySummary fields present |
mcp_pool_restart | Always when mcp_workspace_pool is on | POST /workspace/mcp/:server/restart accepts ?entryIndex= and may return entries: RestartResult[] |
(extends mcp_guardrails) | unchanged | Same tag, payload extended with scope ('workspace' under F2) |
// @qwen-code/sdk — additive only
export interface DaemonMcpBudgetWarningData {
// existing fields...
scope?: 'workspace' | 'session'; // NEW — absent on old daemons (means 'session')
}
export interface DaemonMcpChildRefusedBatchData {
// existing fields...
scope?: 'workspace' | 'session';
}
export interface ServeWorkspaceMcpServerStatus {
// existing fields...
entryCount?: number;
entrySummary?: Array<{
fingerprint: string;
refs: number;
status: MCPServerStatus;
}>;
}
export function isWorkspaceScopedBudgetEvent(
e: DaemonMcpGuardrailEvent,
): boolean;
EVENT_SCHEMA_VERSION stays at 1 (additive).
Authorization value diff → different keyincludeTools/excludeTools/trust mutated → SAME key (per-session filter)new MCPServerConfig(...) with identical content → same key (canonical hash, not identity)client.connect)MAX_IDLE_MS (5min) hard close even if drain flappingacquire(sameKey) while no entry exists → exactly 1 spawnEntry call, all 5 get same entryexcludeTools: ['foo'], Session B no exclusion → A's ToolRegistry omits foo, B has it; both from same toolsSnapshottrust: true, Session B trust: false → Session A's DiscoveredMCPTool.trust === true, B's false; verify NOT shared reference (mutating one doesn't affect other)notifications/tools/list_changed → all subscribers' applyTools called with new snapshotnotifications/prompts/list_changed analogprocess.kill → subscribers receive disconnected eventMCPHealthMonitorConfig) → success → reconnected + fresh snapshotfailed; entry transitions to failed state; new acquires retry once then throwbash -c "sleep 60 & sleep 60" as stdio command → kill root → verify both descendants reaped (/proc/<pid>/status poll, or kill(0, pid) === false)cmd /c "ping -t localhost" wrapper → kill → verify ping subprocess gonepgrep unavailable (PATH missing) → graceful degradation: log warning, just SIGTERM root, don't crash--mcp-client-budget=2 with 3 static MCP servers → workspace total = 3 (not 12); snapshot cell scope: 'workspace', liveCount: 3qwen (no daemon) → mcpPool === undefined → all existing mcp-client-manager.test.ts tests pass unchanged--no-mcp-pool daemon flag → falls back to per-session, all existing daemon e2e tests pass{name: 'github', headers: {Authorization: 'Bearer tokenA'}}, Session B tokenB → 2 separate processes; verify by snapshot entryCount: 2; verify A's tool calls go through A's transport (by header inspection in stdin/log)client.connect spy count)POST /workspace/mcp/foo/restart returns legacy {restarted: true, durationMs} shape{entries: [{fingerprint, restarted, ...}, ...]}{restarted: false, skipped: true, reason: 'budget_would_exceed'} per entryGET /workspace/mcp returns initialized: true with live accountingmcpBudgetWarningCount increments simultaneously across all subscriber sessions when workspace event broadcastsisWorkspaceScopedBudgetEvent(e) correctly identifies scope from payloadscope field) → defaults to 'session' interpretationQwenAgent.close() triggers in order: stop accepting → drain prompts → close sessions → pool.drainAll → no zombie pids in pgrep -P <acpChildPid> after exitV21 locked Q1/Q3/Q4/Q6 in design defaults (single-PR delivery). Q2/Q5/Q7/Q8/Q9 remain.
| # | Question | F2 design default | Decision needed before |
|---|---|---|---|
| Q1 ✅ | SDK reducer field names — rename or keep? | LOCKED v2.1: keep mcpBudgetWarningCount etc. with extended scope semantics in JSDoc | resolved |
| Q2 | mcp_workspace_pool capability — bump protocolVersions ('v1' → 'v1.1'), or stay 'v1' additive? | Stay 'v1' additive (consistent with PR 14b precedent) | commit 5 |
| Q3 ✅ | --no-mcp-pool flag — default on or opt-in? | LOCKED v2.1: default on; --no-mcp-pool is kill switch | resolved |
| Q4 ✅ | HTTP/SSE default — pool off or on? | LOCKED v2.1: pool off; opt-in via --mcp-pool-transports | resolved |
| Q5 | POST /workspace/mcp/reload-all — include in F2 or follow-up? | Follow-up | n/a (deferred) |
| Q6 ✅ | Lazy pool construction — worth the conditional? | LOCKED v2.1: eager (always construct in QwenAgent ctor) | resolved |
| Q7 | restoreState window vs pool drain — keep separate, align, or read from settings? | Keep separate 30s default + config knob --mcp-pool-drain-ms | commit 4 |
| Q8 | OAuth handling — confirm F3 deferral, document workaround? | Deferred to F3, document /mcp auth <name> workaround | commit 4 |
| Q9 | entrySummary exposure — always include, or behind verbose flag? | Always include (small payload, useful for ops) | commit 5 |
| Q10 | Update codeagents/qwen-code-daemon-design/02-architectural-decisions.md decision #3 — coordinate with @wenshao? | F2 PR description links codeagents PR; two PRs reviewed independently | PR open |
serverStatuses collision on multi-entry same-name. Mitigated by aggregate-status function; remaining risk is SDK consumers reading the raw global Map (unlikely — only used via getMCPServerStatus(name) accessor).connectToMcpServer factory and McpClient class have subtle behavioral diffs (e.g. capabilities advertised at construct time vs connect time). Mitigated by F2-1 being a pure refactor PR with full regression coverage before pool work begins.Get-CimInstance may be slow (spawn cost) or blocked by AppLocker. Mitigated by 2s timeout + graceful degradation.Promise.all parallelization + per-session catch (existing PR 14b pattern).MCPServerConfig not included in fingerprint would silently allow incorrect sharing. Mitigated by explicit canonicalization function + test that enumerates all MCPServerConfig fields and asserts coverage.daemon_mode_b_main merge conflict accumulation): F2 touches acpAgent.ts, httpAcpBridge.ts, capabilities.ts, mcp-client*.ts — all hot paths. F3 / F4 contributors landing concurrently risk conflicts during F2's 1–2 week review window. Mitigation: daily git rebase origin/daemon_mode_b_main; coordinate via #4175 update that F2 is in-flight + asks F3/F4 to defer hot-file changes until F2 merges.process.env.QWEN_INTEGRATION === '1', run subset in PR CI + full set in nightly; (b) Vitest parallelism ≥ 4; (c) Windows pid sweep tests skip-gated on GHA Windows runner only.| Doc | Update | When |
|---|---|---|
codeagents/qwen-code-daemon-design/02-architectural-decisions.md | Decision #3 "MCP server lifetime": currently "per-session"; update to "workspace-pooled with config-hash key under daemon mode; per-session standalone" | F2-3 merges (coordinate with @wenshao codeagents PR) |
codeagents/qwen-code-daemon-design/06-roadmap.md | Wave 5 PR 23 → mark as F2 series; link to PRs | F2-3 merges |
packages/cli/src/serve/README.md (if exists) or new docs/serve/mcp-pool.md | New section: pool semantics, fingerprint key, transport opt-in, restart semantics, status snapshot interpretation | F2-3b |
packages/sdk/README.md | scope? field on guardrail events, entryCount on server status, helper isWorkspaceScopedBudgetEvent | F2-4 |
| Issue #4175 body | Update F2 entry with sub-PR table, link to design v2 (this doc) | Before F2-1 opens |
| Issue #3803 body | Decision #3 row: update "Currently per-session" → "Workspace-pooled under daemon (F2)" | After F2-3 merges |
acpAgent.ts:869-936 inline comment | Remove "Wave 5 PR 23" forward reference; update to "graduated by F2 to scope: 'workspace'" | F2-4 PR |
| CHANGELOG / release notes (Wave 6 / F5) | "MCP processes now shared across sessions in a workspace" headline | F5 release |
## feat(serve): shared MCP transport pool (workspace-scoped) [F2]
Single feature-cohesive PR per #4175 branching strategy (2026-05-19).
Replaces what was originally planned as Wave 5 PR 23 + sub-PRs F2-1..F2-4.
### Scope
~4100 LOC production + ~1900 LOC tests across 6 atomic commits.
Step through with `git log -p HEAD~6..HEAD` for commit-by-commit review.
### Design doc
See `docs/design/f2-mcp-transport-pool.md` (v2.1).
### Pre-review specialist agents (per PR 21 pattern)
Folded into first commit before opening:
- code-reviewer: N findings, all adopted
- silent-failure-hunter: N findings, all adopted
- type-design-analyzer: N findings, all adopted
### Closes
(none — F2 entry in #4175 stays open until PR merges into main batch)
### Related
- #3803 decision #3 update (codeagents PR <link>)
- PR 14b (#4271 merged) — budget guardrail base; F2 graduates scope to workspace
- F1 (#4319 merged) — acp-bridge package; F2 depends on injection seams
### Backward compatibility
- Standalone `qwen` (non-daemon): pool not constructed; existing behavior preserved
- Daemon `qwen serve --no-mcp-pool`: kill switch falls back to per-session
- SDK: all new fields additive (`entryCount`, `scope?`); EVENT_SCHEMA_VERSION stays at 1
- Old SDK clients: unknown `scope: 'workspace'` ignored per PR 14 contract
- Old daemons: SDK consumers can detect absence of `mcp_workspace_pool` capability and fall back
### Test plan
- [ ] Pool key: env permutation stability, header divergence, per-session filter exclusion
- [ ] Lifecycle: 3-session sharing, drain grace, concurrent acquire dedupe, spawn failure slot release
- [ ] Tools + Prompts dual fan-out, per-session trust copy, snapshot replay on attach
- [ ] Generation guard: pre-reconnect handler doesn't overwrite post-reconnect snapshot
- [ ] Crash + reconnect with stdio backoff (5s × 3) and HTTP backoff (1/2/4/8/16s × 5)
- [ ] Descendant pid sweep: Linux/macOS pgrep recursion, Windows PowerShell CIM
- [ ] Budget at workspace scope: 4 sessions × budget=2 → 3 max (not 12); fan-out to all attached
- [ ] LoadSession resume within drain window: pool entry reused, no cold start
- [ ] Hot config reload: old/new entries coexist; old drains naturally
- [ ] Restart route: `?entryIndex=` selectivity; legacy single-entry response shape preserved
- [ ] In-flight tool call during reconnect: `MCPCallInterruptedError` rejection
- [ ] Standalone qwen: all existing mcp-client-manager tests pass unchanged
F2 v2.1 = single PR with 6 atomic commits (~6000 LOC), targeting daemon_mode_b_main. Key design pillars:
McpTransportPool in packages/core (ACP child side), workspace-scoped, refcount + 30s drainSessionMcpView per-session tool+prompt registry projection with trust copyscope?)entryCount + RestartResult[]Open questions Q1–Q10 in §22 need maintainer decisions before respective sub-PRs open. Recommend resolving Q1–Q4 before F2-3 starts (those gate the broad direction); Q5–Q10 can resolve incrementally.