docs/developers/daemon/05-mcp-transport-pool.md
McpTransportPool (packages/core/src/tools/mcp-transport-pool.ts) is the F2 (#4175 commit 5) workspace-scoped pool: multiple ACP sessions on one daemon share one transport per unique (serverName + configFingerprint) tuple, instead of each spawning its own MCP child process. The pool lives inside the ACP child (QwenAgent.mcpPool), is constructed once at agent startup with the daemon's bootstrap Config, and survives session lifecycles. Entries reference-count session attaches and close after a configurable grace period when the reference count reaches zero.
It is the main mechanism that prevents a multi-session daemon from forking one copy of every MCP server per session.
(name + fingerprint), deduplicating concurrent acquires via spawnInFlight.MAX_IDLE_MS cap so a thrashing client cannot keep an idle transport alive forever.sessionToEntries) so releaseSession(sessionId) is O(refs) rather than O(entries).restartByName) — single-entry returns {restarted, durationMs}, multi-entry returns {entries: RestartResult[]} (F2 multi-entry contract).WorkspaceMcpBudget (see 06-mcp-budget-guardrails.md) on acquire to enforce per-name reservation caps; release the slot on entry close when no sibling entry holds the same name.SessionMcpView so a discovery in one session does not register tools into other sessions.class McpTransportPool {
constructor(cliConfig: Config, options: McpTransportPoolOptions);
acquire(
serverName,
cfg,
sessionId,
sessionToolRegistry,
sessionPromptRegistry,
): Promise<PooledConnection>;
release(id, sessionId): void;
releaseSession(sessionId): void;
restartByName(
name,
opts?,
): Promise<RestartResult | { entries: RestartResult[] }>;
drainAll(opts?): Promise<void>;
getBudget(): WorkspaceMcpBudget | undefined;
getSnapshot(): McpPoolSnapshot;
}
McpTransportPoolOptions:
workspaceContext: WorkspaceContext (required).debugMode: boolean.sendSdkMcpMessage? — per-session callback (pool bypasses SDK MCP).pooledTransports?: ReadonlySet<McpTransportKind> — default {stdio, websocket}. HTTP/SSE transports stay unpooled by default because their headers can carry session-specific OAuth state, but operators can explicitly opt them into pooling with QWEN_SERVE_MCP_POOL_TRANSPORTS.drainDelayMs? — default 30_000.entryOptions?: (transport) => PoolEntryOptions.budget?: WorkspaceMcpBudget.| State | Type | Purpose |
|---|---|---|
entries | Map<ConnectionId, PoolEntry> | Live pool entries keyed by connectionIdOf(name, fingerprint). |
unpooledIds | Set<ConnectionId> | Entries for transports outside the configured pooledTransports allowlist. |
spawnInFlight | Map<ConnectionId, Promise<PoolEntry>> | Deduplicates concurrent cold acquires for the same key. |
sessionToEntries | Map<string, Set<ConnectionId>> | V21-2 reverse index for O(refs) releaseSession. |
draining | boolean | Drain mutex — once set, all acquire calls reject. |
nextIndexByName | Map<string, number> | V21-7 monotonic entryIndex per server name (dashboards do not reshuffle when a new entry appears). |
PoolEntry (per-entry structure, mcp-pool-entry.ts)State machine: spawning → active ⇄ (active ↔ reconnect) → (active → draining on last detach, draining → active on attach OR draining → closed on timer).
| Field | Purpose |
|---|---|
localStatus: MCPServerStatus | Driven by MCPServerStatus lifecycle. |
state: PoolEntryState | spawning/active/draining/closed/failed. |
generation: number | Bumped on each restart; subscribers compare to detect reconnect cycles. |
refs: Set<string> | Session ids currently attached. |
subscribers: Map<string, SessionMcpView> | Per-session filtered views. |
subscriberHandles: Map<string, PooledConnectionImpl> | Handles returned from acquire. |
toolsSnapshot[], promptsSnapshot[] | Canonical pool-level snapshots; re-issued on toolsChanged / promptsChanged. |
drainTimer? | Armed when refs.size === 0; default 30s. Reset on attach. |
maxIdleTimer? | Armed at first idle; never reset by acquire/release churn. Default 5 min. |
firstIdleAt? | Watermark for the max-idle hard cap. |
restartInFlight? | Mutex for restart(). |
PoolEntryOptionsinterface PoolEntryOptions {
drainDelayMs: number; // default 30_000
maxIdleMs: number; // default 5 * 60_000
maxReconnectAttempts: number; // default 3 (stdio/ws) or 5 (http/sse)
reconnectStrategy:
| { kind: 'fixed'; delayMs: number }
| { kind: 'exponential'; baseMs: number; capMs: number };
}
defaultPoolEntryOptions(transport) (mcp-pool-entry.ts) returns stdio/ws defaults {fixed 5s, 3 attempts} and http/sse defaults {exponential 1s → 16s, 5 attempts}. Remote transports get longer retry budgets because their failures are more often transient.
acquiresequenceDiagram
autonumber
participant S as Session
participant P as Pool
participant SIF as spawnInFlight
participant E as PoolEntry
participant BDG as WorkspaceMcpBudget
participant SRV as MCP server
S->>P: acquire(name, cfg, sessionId, sessionToolRegistry, sessionPromptRegistry)
P->>P: refuse if draining
P->>P: connectionId = connectionIdOf(name, fingerprint)
P->>P: if !isPoolable(cfg) → mark unpooled
alt entry in entries (warm)
E-->>P: existing PoolEntry
else inflight cold spawn
SIF-->>P: existing Promise<PoolEntry>
else cold start
P->>BDG: tryReserve(name) (if budget set + poolable)
BDG-->>P: 'reserved' | 'already_held' | 'refused'
alt refused
P->>BDG: recordRefusal(name, transport)
P-->>S: BudgetExhaustedError
else ok
P->>E: spawnEntry(name, cfg)
E->>SRV: connect transport
SRV-->>E: ready
P->>P: entries.set(id, E); nextIndexByName++
E-->>P: connected
end
end
P->>E: addSubscriber(sessionId, sessionToolRegistry, sessionPromptRegistry)
P->>P: sessionToEntries.add(sessionId, id)
P->>P: cancel drain timer (refs>0)
P-->>S: PooledConnection { id, serverName, entryIndex, client, toolsSnapshot, promptsSnapshot, on, off, release }
release + drainsequenceDiagram
autonumber
participant S as Session
participant P as Pool
participant E as PoolEntry
participant BDG as WorkspaceMcpBudget
S->>P: release(id, sessionId)
P->>E: removeSubscriber(sessionId)
P->>P: sessionToEntries.delete(sessionId, id)
alt refs > 0
E-->>P: ok
else refs == 0
E->>E: firstIdleAt = now (if unset)
E->>E: arm drainTimer(drainDelayMs)
E->>E: arm maxIdleTimer(maxIdleMs - elapsed)
end
Note over E: drainTimer fires →
E->>SRV: disconnect transport
E->>P: emit 'closed'
P->>P: entries.delete(id)
P->>P: if !hasNameSibling(name) → BDG.release(name)
hasNameSibling(name) (mcp-transport-pool.ts) iterates both entries.values() and spawnInFlight.keys() parsing the latter with parseConnectionId (server names can legitimately contain ::, so startsWith would false-positive on a sibling name beginning with ${name}::).
releaseSession(sessionId) reads from sessionToEntries, releases all referenced entries in O(refs), then clears the index entry. Used by the bridge's session-close path so it does not iterate the full entry map.
restartByNamesequenceDiagram
autonumber
participant Op as POST /workspace/mcp/:server/restart
participant P as Pool
participant E as PoolEntry
participant SRV as MCP server
Op->>P: restartByName(name, opts?)
alt opts.entryIndex specified
P->>E: find entry by (name, entryIndex)
else
P->>P: gather all entries with matching name
end
par per entry
P->>E: restart() (mutex via restartInFlight)
E->>SRV: disconnect
E->>SRV: reconnect
E->>E: bump generation, re-emit snapshots
end
alt single entry
P-->>Op: {restarted: true, durationMs}
else multi-entry
P-->>Op: {entries: [{restarted, durationMs, entryIndex}, ...]}
end
The preflight budget check at the daemon HTTP layer returns {restarted:false, skipped:true, reason:'budget_would_exceed'} (Wave 4 mutation control) when the target's slot is not already reserved and a restart would push live count over enforce budget.
drainAllsequenceDiagram
autonumber
participant D as Daemon shutdown
participant P as Pool
participant E as PoolEntries
D->>P: drainAll({timeoutMs?})
P->>P: draining = true (refuse new acquires)
par for each entry
P->>E: trigger drain (close transport, clear timers)
E-->>P: closed
end
P-->>D: done (or timeout reached, force close)
acquire cold-starts a transport.drainDelayMs (default 30s) is reset to cancellation on attach.maxIdleMs (default 5 min) is never reset by attach/detach — it starts ticking at the FIRST idle and only stops when the entry actually closes or attaches before the deadline. Defense against thrashing clients.nextIndexByName is monotonic. Old entries keep their assigned index even after newer ones appear, so dashboards reading entryIndex do not reshuffle.packages/core/src/tools/mcp-client.ts — McpClient, status enum, SendSdkMcpMessage.packages/core/src/tools/mcp-pool-entry.ts — PoolEntry, PoolEntryOptions, defaultPoolEntryOptions.packages/core/src/tools/mcp-pool-key.ts — connectionIdOf, parseConnectionId, isPoolable, mcpTransportOf, POOLED_TRANSPORTS_DEFAULT.packages/core/src/tools/mcp-pool-events.ts — ConnectionId, PoolEntryState, PoolEvent.packages/core/src/tools/session-mcp-view.ts — per-session view that filters pool snapshots.packages/core/src/tools/mcp-workspace-budget.ts — WorkspaceMcpBudget (see 06-mcp-budget-guardrails.md).packages/core/src/tools/mcp-discovery-timeout.ts — discoveryTimeoutFor, runWithTimeout.| Source | Knob | Effect |
|---|---|---|
| Env | QWEN_SERVE_NO_MCP_POOL=1 | Kill switch — QwenAgent.mcpPool stays undefined; per-session McpClientManager enforces (pre-F2 path). |
| Flag | --mcp-client-budget=N, --mcp-budget-mode={off,warn,enforce} | Forwarded to ACP child via childEnvOverrides; child constructs WorkspaceMcpBudget and passes to pool. |
| Capability tags (conditional) | mcp_workspace_pool, mcp_pool_restart | Advertised together when pool is on. SDK pre-flights both to branch on pool-aware response shapes. |
Transports outside the configured pooledTransports allowlist (HTTP, SSE, and SDK-MCP by default) take a separate path: createUnpooledConnection(name, cfg, sessionId, ...) (mcp-transport-pool.ts) creates a per-session entry with id ${name}::unpooled-${entryIndex}. Differences from pooled entries:
entries AND tracked in unpooledIds: Set<ConnectionId> so release / releaseSession can fast-path the close-on-detach behavior (refs always max out at 1).McpClient.discover() is used directly instead of pool replay; applyTools / applyPrompts are no-ops because the session's registries already hold what was registered (W77 / skipReplay: true in attach()).tryReserve; the same WorkspaceMcpBudget slot is reserved and released on entry close (whether pooled or unpooled).The W77 race (cb206da36): createUnpooledConnection stores the entry in this.entries BEFORE awaiting client.connect() / client.discover(), but only indexes sessionToEntries[sessionId] AFTER attach() succeeds. A concurrent closeStoredSession() / releaseSession(sessionId) during the connect/discover window saw an empty index, let the unpooled spawn finish, and attach() then registered tools/prompts into an already-closed session. The fix:
mcp-pool-entry.ts: public isTerminated(): boolean probe (state === 'closed' || state === 'failed').mcp-pool-entry.ts: markActive() short-circuits if isTerminated() so a torn-down entry cannot be resurrected to 'active'.isTerminated() between the awaits and abort the attach if the parent session went away.This race was latent at the time (the W61/W71 per-session releaseSession hooks land in F4), but would become live the moment that hook arrived. The fix was applied early in the F2 series.
GET /workspace/mcp pool-aware snapshot fieldsWhen the pool is active, each ServeWorkspaceMcpStatus server cell
(packages/acp-bridge/src/status.ts) includes three additional fields:
| Field | Type | Purpose |
|---|---|---|
disabledReason | 'config' | 'budget' | Distinguishes operator-disabled servers (disabled: true from disabledMcpServers) from budget refusal (status: 'error', errorKind: 'budget_exhausted'). Dashboards can render one server row without cross-reading errors[] or budgets[]. |
entryCount | number (>=1) | In pool mode a workspace can have multiple PoolEntry instances with the same name when sessions inject different fingerprints such as per-session OAuth headers. This field is absent when QWEN_SERVE_NO_MCP_POOL=1 disables the pool. New clients render an "N entries" badge when entryCount > 1. |
entrySummary | ReadonlyArray<{entryIndex, refs, status}> | Per-entry breakdown. entryIndex is the stable opaque integer assigned when the entry was created; it is not the raw fingerprint, so snapshot diffs do not leak OAuth or env rotation timing. refs is the current attached-session count. status lets dashboards show per-entry health while aggregate mcpStatus is already connected. |
(entryCount, entrySummary) are always broadcast as a pair. The
mcp_workspace_pool capability tag implies both fields. Older SDK clients
ignore them under the additive protocol contract.
Pool snapshots also expose subprocessCount. It counts only the 'stdio'
family. WebSocket, HTTP, and SSE transports connect to remote servers and do
not spawn local child processes. Early versions counted WebSocket transports as
local subprocesses, which inflated resource dashboards.
Pool drain is not limited to the SIGTERM handler. The normal IDE shutdown path
(await connection.closed) also calls drainAll via
packages/cli/src/acp-integration/acpAgent.ts's drainPoolBeforeExit. Whether
the daemon receives a process signal or the IDE closes its connection cleanly,
the pool enters draining, refuses new acquires, and waits for entries to
close.
/mcp refresh shares the boot discovery pathdiscoverAllMcpTools (boot discovery) and
discoverAllMcpToolsIncremental (/mcp refresh / hot reload) both consult the
pool first in pool mode (packages/core/src/tools/mcp-client-manager.ts). The
shared gate prevents hot reload from accidentally creating a per-session
client, double-counting budget, or leaving an orphan transport behind.
MCPCallInterruptedError)When the underlying MCP transport silently disconnects (the connection jumps
from 'active' / 'draining' to localStatus === DISCONNECTED without an
explicit close), the pool marks the entry 'failed', evicts it from
pool.entries, and emits the failed event before detaching subscriber views.
That emit-before-detach order matters: subscribers receive the failed event
soon enough to route pending callTool promises to
MCPCallInterruptedError, so a stuck await client.callTool(...) rejects
cleanly instead of hanging. forceShutdown uses the same emit-then-detach
ordering.
canonicalOAuth normalizationThe pool key comes from fingerprint(cfg) in mcp-pool-key.ts. The hash covers
all transport-defining fields:
transport, command, args, cwd, env, url, httpUrl, tcp, headers, timeout, oauth
Per-session filtering and metadata fields (includeTools, excludeTools,
trust, description, extensionName, discoveryTimeoutMs) are excluded, so
sessions with different filters can share one entry.
For the OAuth cell, canonicalOAuth(o) hashes every MCPOAuthConfig field:
clientId, clientSecret, sorted scopes, sorted audiences,
authorizationUrl, tokenUrl, redirectUri, tokenParamName, and
registrationUrl. This is the credential-isolation contract: two session
configs that differ only by clientSecret, audiences, or redirectUri get
different fingerprints and cannot share one entry. Confidential clients and
multi-audience token deployments depend on this.
Sorting scopes and audiences makes callsite order irrelevant. Explicit
null is normalized so undefined fields hash the same as explicit null. The
key does not include discoveryTimeoutMs; concurrent acquire calls with the
same key but different timeouts are "first wins", matching the pre-F2
per-session manager behavior.
PoolEntry keeps cfg: MCPServerConfig private. External code must use the
entry.transportKind getter when it needs the transport family. That prevents
env, header auth, and OAuth fields from leaking to consumers by accident.
MAX_IDLE_MSThere is intentionally no active cleanup path for unloading an MCP extension at
runtime. Orphan entries whose MCPServerConfig no longer appears in the merged
workspace settings are reclaimed naturally by the MAX_IDLE_MS hard cap after
the last subscriber detaches. A synchronous unload-cleanup path would add
complexity for a rare operator edge case; the hard cap limits orphan process
lifetime past the unload point to 5 minutes by default.
Operators who need faster cleanup can restart the daemon or call
POST /workspace/mcp/:server/restart for the now-unconfigured name, which goes
through the disabled-server path and tears the entry down.
The pool emits two structured diagnostics on the self-heal path:
McpClient.lastTransportError: Error | undefined (packages/core/src/tools/mcp-client.ts) — McpClient.onerror stores the most recent transport exception in a private field and clears it at connect() entry. The PoolEntry silent-drop path reads client.getLastTransportError() and includes it in emit({kind:'failed', lastError}), so subscribers and dashboards do not have to grep stderr for root cause.
SweepResult (internal interface, not exported; packages/core/src/tools/mcp-pool-entry.ts) — sweepAndDisconnect(reason) returns Promise<SweepResult>:
interface SweepResult {
pidSweepError?: Error; // listDescendantPids itself threw
descendantsFound?: number; // descendant pid count found
descendantsSignaled?: number; // successfully SIGTERM'd count
}
The only consumer is the silent-drop block in statusChangeListener. It uses
descendantsFound / descendantsSignaled to detect partial-signal cases
(fewer processes signaled than found, usually because a process exited or EPERM
occurred between listDescendantPids and sigtermPids) and sweep errors, then
logs a structured warning. forceShutdown and doRestart ignore this return
value because their catch paths already carry richer failure signals.
pid-descendants snapshot pathWhen McpTransportPool shuts down stdio subprocesses, it has to enumerate their
descendant processes; npx wrappers and shell wrappers can create multiple fork
levels. packages/core/src/tools/pid-descendants.ts exposes
listDescendantPids(rootPid) → Promise<number[]> and sigtermPids(pids) for
sweepAndDisconnect.
A single ps -A -o pid=,ppid= snapshot reads the process table, parses it into
Map<ppid, pid[]>, then walkDescendants(tree, root) performs BFS to extract
the subtree. Any depth requires only one ps fork.
walkDescendants maintains visited: Set<number> and includes root in the
set to defend against PID-reuse cycles. Under fast process churn, the snapshot
can theoretically contain A→B / B→A loops. Without visited, the walker could
fill the MAX_DESCENDANTS quota with bogus data and crowd out real descendants.
A single Get-CimInstance Win32_Process | ConvertTo-Csv -Delimiter ","
snapshot emits all (ProcessId, ParentProcessId) rows, then the same Map and
walkDescendants path runs.
The explicit -Delimiter "," is required. PowerShell 5.1, which ships with
Windows, defaults ConvertTo-Csv to the system locale list separator; DE, FR,
NL, IT, and similar locales use ;, so the pre-fix parser
^"(\d+)","(\d+)"$ never matched and every daemon shutdown fell back to the
per-pid CIM filter path, adding roughly 0.5-1s of PowerShell startup cost per
child.
BusyBox <v1.28 lacks ps -o, distroless containers might not include ps,
and some Windows environments truncate CIM output via ACLs. When the primary
path parses zero rows or throws, the code falls back to per-pid BFS: Linux /
macOS use pgrep -P <pid>, and Windows uses
Get-CimInstance -Filter "ParentProcessId=$p" where $p is a PowerShell
variable binding rather than string concatenation. The current
Number.isInteger guard is sufficient for the entry point; the binding is
defense-in-depth.
Both paths are bounded by MAX_DESCENDANTS = 256 and MAX_DEPTH = 8 to keep a
malicious or degenerate process tree from dragging down sweep.
The snapshot path uses maxBuffer: 8MB, enough for pathological hosts with
about 250k processes. Node's default 1MB buffer can truncate child-process
output around 30k processes.
The performance gain is intentionally modest (typical 200-500 process dev
machines parse in under 10ms, around 2x faster than per-pid pgrep). The main
benefit is fork hygiene and snapshot consistency: BFS sees the full subtree at
once, while the previous per-pid query path could miss a grandchild forked
between two queries.
McpClientManager constructorMcpClientManager is constructed as
(config, toolRegistry, options?: McpClientManagerOptions). Embedders that
import the class directly should pass:
new McpClientManager(config, toolRegistry, {
eventEmitter,
sendSdkMcpMessage,
healthConfig,
budgetConfig,
pool,
});
Tests should prefer an mkManager(overrides?) factory so cases that care about
one or two fields stay one line.
These helpers are internal, but source readers may see them:
McpTransportPool.acquire() uses attachPooledSession and rollbackReservationOnSpawnFailure to share fast-path attach, post-spawn attach, and pooled spawn-in-flight catch behavior. Runtime behavior is unchanged; race-window invariants still live at the call sites.SessionMcpView.applyTools / applyPrompts compile includeTools / excludeTools once via compileNameFilter(cfg) and check each tool with compiledFilterAccepts(compiled, name). Exported passesSessionFilter / passesSessionPromptFilter use the same compiled path. excludeTools is exact-match; includeTools strips the first (...) suffix so toolName(args) matches toolName.Design document: ../../design/f2-mcp-transport-pool.md §6 covers the transport pool state machine, reconnect, drain, and descendant sweep paths.
QWEN_SERVE_MCP_POOL_TRANSPORTS, each acquire mints a fresh entry that lives only as long as its session. Their headers may carry session-specific OAuth state, so pooling them by default would risk leaking credentials across sessions.maxIdleMs is a hard cap that survives attach/detach churn. A 5-minute idle hard cap means even an aggressively attaching/detaching client cannot keep an idle transport pinned past 5 minutes. Operators who want pinned long-lived transports should increase maxIdleMs or run the server outside the pool.pool.getSnapshot().subprocessCount.startsWith regression was avoided in hasNameSibling because MCP server names can legitimately contain :: (mcp-pool-key.test.ts). Always use parseConnectionId's lastIndexOf('::') split, never string-prefix matching.drainAll sets draining = true permanently; a fresh pool is required for further work.packages/core/src/tools/mcp-transport-pool.ts (entire file)packages/core/src/tools/mcp-pool-entry.ts (entry lifecycle)packages/core/src/tools/mcp-pool-key.ts (connectionIdOf, parseConnectionId)packages/core/src/tools/mcp-pool-events.ts (event types)packages/core/src/tools/session-mcp-view.ts (per-session filtered view)../../design/f2-mcp-transport-pool.md. Treat the design contract as authoritative; this page is the developer deep dive.