Back to Qwen Code

Qwen Code Runtime Diagnostics Benchmark Report

docs/e2e-tests/2026-05-19-qwen-runtime-diagnostics-benchmark-report.md

0.16.068.9 KB
Original Source

Qwen Code Runtime Diagnostics Benchmark Report

Date: 2026-05-19

Scope

This run repeats the previous Qwen Code benchmark shapes with the new opt-in runtime diagnostics enabled. It only tests Qwen Code, not Claude Code.

Initial model matrix:

  • pai/glm-5
  • qwen3.6-plus

Additional PR-size follow-up:

  • DeepSeek/deepseek-v4-pro through Anthropic-compatible protocol

Cases:

  • small GitHub PR review: PR #4268
  • code navigation: compression / compaction related code search and reads
  • synthetic local diff: about 94.6 KiB
  • synthetic local diff: about 968.5 KiB
  • synthetic local diff: about 4.84 MiB

The run used the local bundled CLI from the diagnostics branch, with QWEN_CODE_PROFILE_RUNTIME=1 and a temporary CLI home. Global MCP servers and hooks were not loaded for this benchmark.

Important caveat: these absolute RSS numbers are lower than the previous PATH-resolved qwen runs because this run used node dist/cli.js from the local branch plus a stripped temporary config. Treat this report as an internal diagnostics distribution run, not a direct replacement for the earlier installed CLI RSS comparison.

Installed CLI vs Local Bundle Sanity Check

A follow-up sanity check used the same minimal prompt, model, and non-interactive mode across the installed CLI and the local diagnostics bundle. The only intentional variable was whether Qwen Code loaded a stripped temporary CLI home or the normal user config.

CLIConfig modeTotal tokensTree RSS peakRoot RSS peakProcess count peakRuntime diagnostics
PATH qwenstripped config33,965542.4 MiB249.9 MiB3no
local dist/cli.jsstripped config47,281455.2 MiB214.2 MiB4yes
PATH qwennormal config97,6151,099.9 MiB250.1 MiB6no
local dist/cli.jsnormal config97,9541,105.4 MiB212.7 MiB8yes

This check changes the attribution: the earlier 1 GiB user-visible peak is reproducible with the normal config even on the local diagnostics bundle. It is therefore not primarily explained by the local branch including PR #4186.

At the normal-config peak, the local process-tree sample was dominated by multiple Node/MCP processes rather than the Qwen root process alone:

RoleCommand shapeRSS at tree peak
childNode process252.9 MiB
childChrome DevTools MCP219.7 MiB
childNode process219.2 MiB
rootQwen Node process215.1 MiB
childChrome DevTools MCP setup175.2 MiB

PR #4186 is present in the local diagnostics branch, but it is a V8 heap pressure auto-compaction safety net. It triggers at about 70% V8 heap pressure; on this environment the Node heap limit is about 4.1 GiB, while the stripped benchmark end heap was about 99-143 MiB. Based on these numbers, the lower stripped-config RSS is not caused by #4186 actively compressing context during these benchmark runs.

Bare Mode Config Attribution Check

A second follow-up used qwen3.6-plus with the same PR-review prompt shape on both the installed CLI and the local bundle. This is not a normal end-to-end business benchmark. It is a controlled attribution check for startup/config memory only.

--bare changes the runtime inputs: it skips normal global settings discovery, MCP startup, hooks, implicit context, skills, and other startup integrations. It can therefore fail or behave differently when a model provider is configured only in global settings. For this run, model credentials were supplied only through the child-process environment because bare mode intentionally does not load the normal provider settings. Nothing was written back to the user's global config.

This run did not produce useful token/tool-call statistics: the model completed in one turn and did not call the requested shell command. Do not use these rows as normal task benchmark results, and do not compare their token/tool-call behavior with the matrix above. They are only useful for estimating how much process-tree RSS comes from normal config and configured child processes.

CLIModeWallTurnsTool usesTree RSS peakRoot RSS peakProcess count peak
PATH qwennormal5.5s101,021.3 MiB251.5 MiB5
PATH qwen--bare2.4s10525.7 MiB246.4 MiB2
local dist/cli.jsnormal4.9s101,046.2 MiB213.3 MiB5
local dist/cli.js--bare2.3s10454.3 MiB216.5 MiB3

The result confirms the process-tree hypothesis for startup/config attribution. On this machine, normal config adds roughly 0.50-0.59 GiB of user-visible process-tree RSS over --bare, while root RSS stays in the same 0.21-0.25 GiB band. At the normal-config peak, the extra RSS again came from additional Node/MCP child processes, including a Chrome DevTools MCP process and its setup wrapper. --bare removes those startup/config children and brings installed/local runs back into the 0.45-0.53 GiB tree-RSS range.

Temporary Settings MCP / Hooks Isolation

Because --bare changes too many runtime inputs to be treated as a normal benchmark, a follow-up used temporary QWEN_HOME directories with generated settings files derived from the normal settings. The run stayed on the normal settings-loading path, but toggled only two config dimensions:

  • MCP disabled: mcpServers cleared and MCP allow/exclude lists emptied.
  • Hooks disabled: disableAllHooks set to true.

No global settings were modified. The case used qwen3.6-plus and a minimal startup prompt, so it measures startup/config process-tree cost, not task reasoning quality.

CLITemporary configMCP serversToolsTree RSS peakRoot RSS peakProcess count peak
PATH qwenfull4461,017.4 MiB249.8 MiB5
PATH qwenMCP disabled017548.7 MiB252.4 MiB2
PATH qwenhooks disabled4461,003.8 MiB246.4 MiB5
PATH qwenMCP + hooks disabled017542.5 MiB248.0 MiB2
local dist/cli.jsfull448865.9 MiB220.4 MiB6
local dist/cli.jsMCP disabled019442.9 MiB209.6 MiB2
local dist/cli.jshooks disabled448848.3 MiB212.6 MiB5
local dist/cli.jsMCP + hooks disabled019447.2 MiB217.8 MiB2

Interpretation:

  1. Disabling MCP is the dominant change. It removes 4 MCP servers, reduces the advertised tool count by about 29 tools, and lowers process-tree RSS by about 0.42-0.47 GiB in this startup/config case.
  2. Disabling hooks alone barely changes RSS in this case. That is expected because the prompt did not produce tool calls, so PreToolUse / PostToolUse hooks were not executed.
  3. The root process stays around 0.21-0.25 GiB across all rows. The large difference is again process-tree composition, not root Qwen RSS.

Two attempted code-navigation follow-ups with qwen3.6-plus and pai/glm-5 also reproduced the same MCP-vs-no-MCP memory split, but neither model produced tool calls in those runs. Those rows are therefore not used as hooks execution evidence. A valid hooks benchmark still needs a task/model combination that reliably emits tool calls.

Per-MCP Isolation

The previous row showed MCP as a group is the dominant startup/config memory factor. A follow-up isolated each configured MCP server while keeping hooks disabled for all rows. This keeps the test on the normal settings-loading path but changes only the MCP server subset.

Configured MCP server names:

  • approval-bridge
  • env-center
  • chrome-devtools
  • code

Single-pass isolation:

VariantEnabled MCPsToolsMCP serversTree RSS peakRoot RSS peakInterpretation
nonenone190444.4 MiB211.7 MiBbaseline without MCP
fullall 4484857.3 MiB215.9 MiBfull MCP startup shape
only approval-bridgeapproval-bridge191455.5 MiB214.0 MiBnear baseline
only env-centerenv-center191452.3 MiB214.4 MiBnear baseline
only chrome-devtoolschrome-devtools481824.4 MiB209.5 MiBlarge RSS increase and tool increase
only codecode191452.1 MiB216.6 MiBnear baseline
without approval-bridgeenv-center, chrome-devtools, code483997.1 MiB215.4 MiBstill high; run showed variance
without env-centerapproval-bridge, chrome-devtools, code483863.8 MiB220.9 MiBstill high
without chrome-devtoolsapproval-bridge, env-center, code193463.4 MiB221.6 MiBreturns near baseline
without codeapproval-bridge, env-center, chrome-devtools483858.1 MiB219.5 MiBstill high

Because startup RSS has some variance, the key variants were repeated twice:

VariantSamplesTree RSS rangeAvg tree RSSResult
none2443.3-451.9 MiB447.6 MiBstable no-MCP baseline
full2856.1-922.8 MiB889.5 MiBstable high-MCP range
only chrome-devtools21,007.1-1,021.2 MiB1,014.2 MiBenough alone to reproduce high
without chrome-devtools2461.1-461.6 MiB461.4 MiBremoves the high RSS
only approval-bridge2449.1-449.9 MiB449.5 MiBnear baseline
only env-center2438.7-449.5 MiB444.1 MiBnear baseline
only code2450.6-451.3 MiB451.0 MiBnear baseline

Interpretation:

  1. chrome-devtools is the dominant MCP contributor in this environment. It is sufficient by itself to reproduce the high process-tree RSS.
  2. Removing chrome-devtools from the full MCP set returns RSS to the no-MCP band. Removing other MCPs while keeping chrome-devtools does not.
  3. The advertised tool count follows the same pattern: baseline is 19 tools, while chrome-devtools raises the tool count to 48. That means this MCP is also likely to increase request tool schema size and token pressure, not just process-tree RSS.
  4. approval-bridge, env-center, and code individually stay near the no-MCP baseline in these startup/config runs. They emitted startup warnings in this environment, so this result should be interpreted as "no persistent startup RSS owner observed" rather than proof that they have zero cost in all workflows.

Runtime Summary

CaseModelWallTurnsTotal tokensTree RSS peakRoot RSS peakEnd heapEnd RSS
small PR #4268pai/glm-520.1s7173,216362.1 MiB359.8 MiB103.1 MiB216.5 MiB
code navigationpai/glm-518.4s249,127378.0 MiB376.0 MiB102.4 MiB313.4 MiB
diff 94.6 KiBpai/glm-516.6s6135,716367.9 MiB366.0 MiB99.1 MiB295.0 MiB
diff 968.5 KiBpai/glm-511.4s242,590373.2 MiB362.5 MiB106.4 MiB345.6 MiB
diff 4.84 MiBpai/glm-512.0s495,119414.2 MiB412.0 MiB123.6 MiB410.7 MiB
small PR #4268qwen3.6-plus35.0s6156,556358.9 MiB356.9 MiB102.6 MiB293.1 MiB
code navigationqwen3.6-plus28.9s499,800370.3 MiB368.3 MiB105.8 MiB298.2 MiB
diff 94.6 KiBqwen3.6-plus28.3s490,808358.8 MiB356.9 MiB105.9 MiB307.0 MiB
diff 968.5 KiBqwen3.6-plus30.9s6151,782366.1 MiB364.1 MiB101.0 MiB316.9 MiB
diff 4.84 MiBqwen3.6-plus24.1s493,271372.8 MiB366.0 MiB142.8 MiB366.0 MiB

Average by model:

ModelAvg tree RSS peakAvg root RSS peakAvg turnsAvg total tokensAvg max wire bodyAvg total tool result
pai/glm-5379.1 MiB375.3 MiB4.299,154111.8 KiB335.1 KiB
qwen3.6-plus365.4 MiB362.4 MiB4.8118,443119.3 KiB344.3 KiB

Overlapping small PR #4268 model snapshot:

ModelProtocolWallTurnsTotal tokensTree RSS peakRoot RSS peakMax wire body
pai/glm-5OpenAI20.1s7173,216362.1 MiB359.8 MiB113.8 KiB
qwen3.6-plusOpenAI35.0s6156,556358.9 MiB356.9 MiB134.1 KiB
DeepSeek/deepseek-v4-proAnthropic39.7s243,362346.9 MiB344.8 MiB103.0 KiB

Request And Tool Diagnostics

CaseModelRequestsMax wire bodyMax system promptMax tool schemaTool callsTotal tool resultMax tool resultMax function response in request
small PR #4268pai/glm-57113.8 KiB51.4 KiB40.2 KiB94.7 KiB3.9 KiB15.3 KiB
code navigationpai/glm-52114.6 KiB51.5 KiB40.2 KiB317.5 KiB6.2 KiB18.4 KiB
diff 94.6 KiBpai/glm-56111.2 KiB39.1 KiB37.2 KiB994.9 KiB92.6 KiB29.2 KiB
diff 968.5 KiBpai/glm-52104.8 KiB39.1 KiB37.2 KiB2772.1 KiB771.9 KiB25.6 KiB
diff 4.84 MiBpai/glm-54114.7 KiB39.1 KiB37.2 KiB4786.3 KiB783.2 KiB34.7 KiB
small PR #4268qwen3.6-plus6134.1 KiB51.4 KiB40.2 KiB534.6 KiB15.6 KiB36.6 KiB
code navigationqwen3.6-plus4114.9 KiB51.5 KiB40.2 KiB317.5 KiB6.2 KiB18.4 KiB
diff 94.6 KiBqwen3.6-plus4112.8 KiB39.1 KiB37.2 KiB392.9 KiB92.6 KiB33.0 KiB
diff 968.5 KiBqwen3.6-plus6113.1 KiB39.1 KiB37.2 KiB5778.0 KiB771.9 KiB32.1 KiB
diff 4.84 MiBqwen3.6-plus4121.5 KiB39.1 KiB37.2 KiB4798.5 KiB783.2 KiB41.3 KiB

Observations

  1. Process-tree RSS is almost the same as root RSS in this local bundle run. The root/tree gap is usually below 10 MiB. That means these runs did not show a persistent child-process memory owner. The dominant process is the main Node process.
  2. The local bundle run peaks around 0.36-0.41 GiB, not the earlier 0.83-1.04 GiB, because the matrix used a stripped temporary config. A follow-up normal-config sanity check reproduced about 1.1 GiB tree RSS on both PATH qwen and local dist/cli.js, with the extra memory coming from child MCP/Node processes in the process tree.
  3. V8 heap is much smaller than RSS. End heap is about 99-143 MiB while end RSS is about 216-411 MiB. The remaining footprint is likely loaded modules, native allocations, external buffers, or runtime overhead outside live JS heap.
  4. Static request overhead is large and repeated. The system prompt is about 39-51 KiB per request, and tool schema is about 37-40 KiB per request. This explains why even small tasks can produce high accumulated token counts when the model takes several turns.
  5. Large diff output is capped before it reaches the model request. The 968 KiB and 4.84 MiB diff cases produced around 772-799 KiB of captured tool result, but the largest model-facing function response in a request stayed around 25-41 KiB, and max wire body stayed around 105-122 KiB. This points to truncation / saved-output handling working on the model-facing path.
  6. Memory still increases on large-output cases even though wire body remains bounded. For example, the 4.84 MiB GLM run reached 414.2 MiB tree RSS and 410.7 MiB end RSS, and the 4.84 MiB qwen3.6-plus run ended with 142.8 MiB heap. That suggests large tool output can still affect local capture, normalization, or retained runtime state even when the final request payload is capped.
  7. Model choice changed turns and token totals more than RSS in this run. qwen3.6-plus averaged more tokens and turns than pai/glm-5, but its average tree RSS peak was slightly lower. This supports the earlier conclusion that model choice is not the main explanation for process memory.

Updated Working Inference

The new diagnostics make the earlier hypothesis more precise:

  • The installed-CLI user-visible 1 GiB peak is now reproducible with the normal config on the local diagnostics bundle. The stripped run should be used for internal Qwen runtime attribution; the normal-config run should be used for user-visible process-tree attribution.
  • The largest observed difference between stripped and normal config is process-tree shape: normal config starts additional MCP/Node child processes. Those children explain most of the absolute jump from about 0.35-0.55 GiB to about 1.1 GiB in the minimal prompt sanity check.
  • The --bare follow-up confirms the same direction on qwen3.6-plus: normal config costs about 0.50-0.59 GiB more process-tree RSS than bare mode for the same prompt shape, while root RSS changes only slightly.
  • The temporary-settings isolation is a better attribution test than --bare: disabling MCP alone reduces process-tree RSS by about 0.42-0.47 GiB while keeping the normal settings-loading path. Disabling hooks alone does not show a meaningful RSS change in no-tool-call cases.
  • Per-MCP isolation points to chrome-devtools as the dominant MCP contributor: it is enough by itself to reproduce the high RSS band, and removing it returns the run near the no-MCP baseline.
  • Within the local Qwen runtime, the most suspicious areas are no longer "raw diff bytes sent to the model". The model-facing request body is bounded.
  • The stronger suspects are static per-request context cost, repeated request rounds, tool schema size, and local retention/capture of large tool outputs before or outside model-facing truncation.
  • Because RSS remains much higher than V8 heap, the next profiling layer should include module/startup accounting, external memory, and heap snapshots around tool execution and final response emission.

RSS Attribution From Current Diagnostics

The current counters do not identify an exact retained object or source file, but they do narrow what is and is not driving RSS in these local runs:

SignalCurrent evidenceRSS implication
Root RSS vs process-tree RSSRoot and tree peaks are usually within about 2-10 MiB; DeepSeek large PR is the widest gap at about 23.6 MiBNo persistent child process explains the RSS in this local bundle run; the main Node process dominates
Normal config process treeMinimal-prompt normal-config runs reach about 1.1 GiB tree RSS while root RSS stays about 213-250 MiBUser-visible 1 GiB peaks can be dominated by MCP/Node child processes rather than Qwen root RSS alone
--bare comparisonqwen3.6-plus normal runs peak around 1.02-1.05 GiB tree RSS; bare runs peak around 0.45-0.53 GiBLoading normal config adds about 0.50-0.59 GiB process-tree RSS in this environment
Temporary MCP isolationClearing MCP servers drops startup/config tree RSS from 865-1,017 MiB to 443-549 MiBMCP startup and MCP child processes explain about 0.42-0.47 GiB of process-tree RSS in the controlled config check
Per-MCP isolationchrome-devtools alone reaches about 1.0 GiB in repeated samples; without it the run stays around 461 MiBchrome-devtools is the dominant MCP process-tree RSS contributor in this environment
Temporary hooks isolationdisableAllHooks=true with MCP still enabled changes tree RSS by only about 13-18 MiB in no-tool-call casesHook config alone is not a visible startup RSS driver here; hook execution still needs a tool-call benchmark
V8 heap vs RSSEnd heap is about 99-143 MiB while end RSS is about 216-411 MiBLive JS heap is not the whole footprint; loaded modules, native allocations, external buffers, or runtime overhead are likely significant
PR/diff size vs RSSDeepSeek small/medium/large PRs scale from 1 to 4,750 changed lines, but tree RSS stays in a narrow 340.7-360.0 MiB bandRaw PR size is not linearly driving RSS once tool output is bounded
Tool output sizeLarge diff runs capture about 772-799 KiB tool results and show some higher end RSS / heap, but RSS does not scale linearlyTool result capture/normalization contributes pressure, especially large-output cases, but is unlikely to be the only RSS driver
Request body sizeMax model-facing body ranges from about 103-289 KiB while RSS stays near the same bandRequest serialization size affects tokens and latency more clearly than RSS peak
Static per-request contextSystem prompt is about 39-51 KiB and tool schema about 37-48 KiB per requestRepeated rounds are a token/cost amplifier; this alone does not explain RSS but is a likely optimization target for token pressure

Working attribution: in the stripped local bundle benchmark, the RSS floor looks mostly like task-time runtime/module/native footprint, with large tool output adding incremental pressure. In the normal-config run, the user-visible 1 GiB tree peak is mostly process-tree composition: Qwen root plus MCP/Node child processes. The next targeted measurement should split Qwen root diagnostics from configured MCP server diagnostics, then add startup/module/external-memory checkpoints inside the Qwen root process.

Progress Snapshot

Current confirmed signals:

  1. The user-visible 1 GiB startup/config peak is reproducible with both the installed CLI and the local diagnostics bundle when the normal config is loaded. It is not primarily explained by the diagnostics branch or PR #4186.
  2. In this environment, that 1 GiB peak is mostly process-tree composition: Qwen root process plus relaunch child process plus MCP child processes.
  3. chrome-devtools is the dominant configured MCP contributor in the current config. It is enough by itself to reproduce the high process-tree RSS band, even when the prompt does not explicitly use that MCP.
  4. The no-MCP normal relaunch shape still sits around 0.45 GiB process-tree RSS. A single Qwen runtime process without the relaunch parent is closer to 0.22-0.24 GiB in the startup attribution check. This means the 0.45 GiB baseline is not a single-process root RSS number.
  5. In stripped non-interactive task runs, model choice changes turns, token totals, latency, and request sizes more clearly than RSS. RSS stayed in a relatively narrow range across pai/glm-5, qwen3.6-plus, and DeepSeek/deepseek-v4-pro.
  6. Current short-task diagnostics show model-facing tool/function responses are bounded, but local tool-result capture and runtime state can still increase heap/RSS on large-output cases. This keeps large-output retention on the investigation path.

Current gaps:

  1. The short-task benchmark matrix is still short-lived. A later interactive long-review run did reproduce a 41.9 min failure, but it is still one sample and needs repeat runs plus heap/object attribution.
  2. The current counters are enough to attribute process-tree RSS and request size, but not enough to name the retained JS object graph during long sessions.
  3. Startup/config RSS and long-session OOM must remain separate tracks. MCP and relaunch explain a large idle/startup RSS band; they do not by themselves explain V8 heap OOM after long tasks.
  4. Interactive TUI memory still needs a separate run from non-interactive mode, because UI history and Ink static output are not exercised the same way.

Long-Task OOM Evidence From Issues And PRs

Issue/PR evidence points to several different OOM shapes, not one single failure mode:

SourceEvidence summaryHypothesis to test
#4309User reports 5.84 GiB memory usage / 7.02 GiB warning with YOLO mode and DeepSeek backend; increasing Node memory to 8 GiB did not remove the symptomLong autonomous tool loops can retain enough state that simply raising old-space limit is not a root fix
#4149Multiple reports show Ineffective mark-compacts near heap limit, including 4 GiB and much larger heap-limit casesA large fraction of heap is reachable application state, not immediately collectible garbage
#4116OOM occurred while context display was around 9.5%; analysis points to structuredClone, UI history, Ink static tree, and large context windowsToken usage can be low while JS heap pressure is high; token threshold alone is not a reliable memory guard
#4167User says the crash happened while compressing; analysis identifies compression peak memory as a distinct shapeCompression can itself create a peak when heap is already high, especially if history is cloned/stringified around the same time
#2128Report identifies unbounded UI history, retained file diffs / terminal output, string-width caches, and checkpoint serializationInteractive TUI long sessions may retain memory outside model history and outside non-interactive benchmarks
#2562Report focuses on GeminiChat.getHistory() deep-cloning full history in long sessionsFull-history cloning can amplify memory peaks and should be measured separately from retained steady-state size
#4185Tracks V8 heap pressure exceeding limit before token-based compaction runsHeap-pressure guard is necessary, but it only mitigates symptoms if retained data remains large
#4184Proposes diagnostics and offload/preview for large retained tool resultsLarge tool output may be bounded for model requests while still retained in local hot memory
#4186Merged heap-pressure auto-compaction safety net and O(1) last-history access for nextSpeakerCheckerCovers part of heap-pressure and clone amplification, but does not claim to solve all OOM classes
#4127, #4168Open compaction-threshold PRs; one uses fixed heap thresholds, the other redesigns token thresholds and compression behaviorUseful related work, but long-task testing must verify whether heap, token, and compression signals line up in real runs
#3000, #4183Diagnostic roadmap calls out /doctor memory, heap snapshot, and bounded memory timelineSnapshot/timeline support is needed to move from RSS attribution to retained-object attribution

Initial interpretation:

  • Unused configured MCP can consume memory because normal startup connects to configured MCP servers and advertises their tools before the task needs them. In the measured config, chrome-devtools starts extra Node/npm MCP processes and also increases the tool schema count from 19 to 48. This explains a large startup/config RSS band and can also increase repeated request overhead.
  • The long-session OOM reports are a different layer. GC logs where Mark-Compact frees very little memory suggest the heap is full of reachable state. The strongest candidates are retained history/tool/UI objects, full-history clones, compression intermediates, and streaming/logging accumulators.
  • PR #4186 is a useful mitigation because it can compact based on heap pressure before token thresholds trigger, and it removes one unnecessary full-history clone. It should not be treated as proof that large tool-output retention, UI history retention, or compression peak memory is already solved.

Long-Task Validation Plan

The next benchmark should keep two tracks separate:

  1. Startup/config attribution: normal config vs MCP-disabled vs chrome-devtools-only vs no-relaunch attribution. This explains what users see before meaningful work begins.
  2. Long-task runtime growth: repeated tool calls, large outputs, compression, resume, and interactive UI history. This explains OOM after real work.

Recommended long-task cases:

CaseShapeWhy it matters
Long PR review loopRepeat medium/large PR review prompts for 30, 60, and 120 minutes, with fixed model and fixed configClosest to reported agent workflows; captures turns, tool calls, token growth, and RSS/heap trend
Large tool-output retentionRepeatedly produce bounded 1 MiB / 5 MiB / 20 MiB command outputs, then ask follow-up questionsTests whether raw output is retained locally after model-facing truncation
Compression pressureUse a lower controlled old-space limit and large-context prompts to trigger heap-pressure compactionVerifies PR #4186 triggers before OOM and whether compression itself creates a new peak
Interactive TUI historyRun the same long loop in tmux TUI mode and compare with non-interactive modeIsolates UI history, Ink static output, rendered diffs, and terminal-output display retention
Resume stressResume a large saved session and immediately continue workTargets /resume OOM reports and session reconstruction cost
Streaming/logging accumulatorForce long streamed responses with telemetry/logging enabled vs disabledTests the suspected collected responses / logging-retention path from issue analysis
MCP idle vs MCP activeRun no-MCP, chrome-devtools configured-but-unused, and chrome-devtools actively used variantsSeparates idle MCP child RSS from actual MCP tool execution and tool schema/token overhead

Metrics that should be recorded per turn or per sampling interval:

  • Root RSS current/peak and process-tree RSS current/peak.
  • Child process count and top child command shapes.
  • V8 heapUsed, heapTotal, heap_size_limit, external, and arrayBuffers.
  • Turn count, request count, tool-call count, and tool-call rounds.
  • Input/output/cache/total tokens by request and by whole task.
  • Request body bytes, system prompt bytes, tool schema bytes, and function response bytes.
  • Tool-result count, total captured tool-result bytes, max tool-result bytes, and retained tool-result bytes if available.
  • Conversation history message count and approximate history byte size.
  • Interactive-only UI history item count and approximate retained display size.
  • Compression attempts, compression trigger reason, tokens before/after, heap pressure before/after, and compression failure status.
  • Heap snapshot or bounded memory timeline artifacts when heap pressure crosses a configured threshold.

Validation criteria:

  1. Repeat at least the key long-task cases twice. Startup RSS has visible variance, so single-run conclusions should be avoided.
  2. Report root RSS and process-tree RSS separately. User-facing memory pressure can come from child processes, while V8 OOM comes from the Qwen root heap.
  3. Treat a flat RSS line as important evidence. If tokens and tool calls grow but heap/RSS stays flat, the issue is likely elsewhere.
  4. When RSS or heap grows, correlate the growth with a specific signal: tool-result bytes, history bytes, UI history count, compression event, streaming accumulator size, or MCP process start.
  5. If a heap snapshot is taken, write a structured diagnostics JSON first, then the snapshot. Heap snapshots may be large and can contain sensitive strings, so they should remain opt-in and local.

Interactive Long-Review Reproduction

After the short non-interactive prompts kept finishing before the target window, an interactive TUI benchmark was run with remote input. The CLI process stayed alive in one session while a controller submitted one real PR-review turn at a time. The next turn was only submitted after the assistant emitted that turn's completion marker. This avoids treating a short one-shot prompt as a long-task reproduction.

Setup:

  • Installed Qwen Code 0.15.11, model qwen-latest-series-invite-beta-v28.
  • Temporary CLI home derived from the normal settings, with MCP and hook config removed. No global config was modified.
  • Interactive TUI mode with dual JSON event output and remote JSONL input.
  • Static PR review only. The prompt disallowed dependency install, build, test, Playwright, Docker, and other long external build commands.
  • External RSS samplers recorded both process-tree RSS and the Qwen Node root RSS every 5 seconds.

Outcome:

SignalValue
Wall time before exit41.9 min
Exit status1
Completed PR-review turns6
Main chat records1,076
API response telemetry335
Tool-call telemetry607
MCP tool-call telemetry0
Main/root API responses36
Subagent API responses299
Root total tokens2.08M
Subagent total tokens17.24M
Total API telemetry tokens19.32M
Max root input tokens85,655
Max subagent input tokens215,207
/usr/bin/time -l max RSS1,072.4 MiB
Sampled Qwen root RSS peak1,028.2 MiB
Sampled process-tree RSS peak1,038.1 MiB

The process exited with:

text
libc++abi: terminating due to uncaught exception of type std::__1::system_error: thread constructor failed: Resource temporarily unavailable

This is a thread exhaustion error, not a V8 heap OOM. The failure mechanism is distinct: the OS refused to create a new thread, likely due to per-process resource limits (RLIMIT_NPROC) or memory fragmentation preventing stack allocation. It is still relevant because it occurred in a disabled-MCP, no-build/test, interactive long-session review where the Qwen Node process itself crossed about 1 GiB RSS. The failure happened during the final summary phase, after the controller had already completed six review turns.

Turn timeline and sampled Qwen root RSS:

WindowTurn stateQwen root RSS maxQwen root RSS at window end
0.0-9.0 minturn 1 completed701.2 MiB255.3 MiB
9.0-15.1 minturn 2 completed503.2 MiB494.4 MiB
15.1-24.1 minturn 3 completed468.7 MiB457.5 MiB
24.1-31.9 minturn 4 completed619.3 MiB602.3 MiB
31.9-40.3 minturn 5 completed955.5 MiB955.5 MiB
40.3-40.4 minturn 6 completed988.6 MiB988.6 MiB
40.4-41.9 minfinal summary / exit1,028.2 MiB1,028.2 MiB

Token and tool distribution:

OwnerAPI responsesInput tokensOutput tokensTotal tokensMax input
Root session362.06M22.2K2.08M85,655
Subagents29917.08M154.6K17.24M215,207

Tool-call telemetry by function:

ToolCallsCaptured content length
read_file2711.46 MB
run_shell_command181164.4 KB
web_fetch80846.3 KB
grep_search2515.0 KB
glob1527.8 KB
todo_write1616.1 KB
list_directory86.2 KB
agent100
tool_search12.1 KB

The top visible TUI token counter for a single agent reached about 3.83M tokens. Telemetry also shows the heaviest subagent at about 4.05M total tokens with a 215K-token max input request. That makes subagent amplification the dominant signal in this reproduction.

Interpretation:

  1. This run separates long-session growth from MCP startup/config memory. MCP was disabled and there were no MCP tool calls, yet the Qwen root process still reached about 1 GiB RSS.
  2. The late memory peak aligns with subagent-heavy review turns and final summary/merge-back, not with external build/test child processes.
  3. The RSS curve is not a simple linear leak. It falls after early turns, then rises sharply after later subagent turns and remains high near exit.
  4. The failure mode is native resource exhaustion rather than a V8 heap-limit stack, so the next run should add heap/external/arrayBuffer/thread-count sampling. RSS alone cannot distinguish JS heap from native allocations or thread-resource pressure.
  5. The strongest code paths to inspect remain subagent transcript retention, agent-result merge-back, full-history cloning, checkpoint/session recording, and final summary/history assembly.

Deterministic Huge-Task Clone-Pressure Reproduction

A deterministic stress harness was added as scripts/memory-pressure-repro.mjs. It does not call a model. Instead, it constructs a Qwen-like long-session object graph with root review turns, subagent transcripts, large tool results, checkpoint JSON, and retained structuredClone() copies. This gives a repeatable reproduction for the clone and checkpoint peak suspected from the user-provided OOM stack.

The harness has a lightweight script test:

bash
npx vitest run --config ./scripts/tests/vitest.config.ts \
  scripts/tests/memory-pressure-repro.test.js

Result: passed, 1 test.

Controlled runs used node --max-old-space-size=256 unless otherwise noted.

CaseHistory shapeClone/checkpoint pressureResultMax RSS
Small sanity2 turns, 2 KiB tool result, 1 subagent1 clone + 1 checkpointpassed; 2.6 MiB history JSON89.7 MiB
Huge build only12 turns, 256 KiB tool result, 2 subagents x 12 subagent turnsno retained clone/checkpointpassed; 76.2 MiB history JSON491.5 MiB
Huge + 1 clonesame as above1 retained structuredClone()passed569.6 MiB
Huge + 2 clonessame as above2 retained structuredClone() copiesOOM, exit 134496.5 MiB
Huge + 1 checkpointsame as aboveone checkpoint with original + cloned history JSONpassed; 152.5 MiB checkpoint JSON926.9 MiB
Huge + 2 checkpointssame as abovetwo checkpoint copiesOOM, exit 134920.1 MiB
Huge + 2 clones, no retained subagent transcriptssame generated subagent output, but parent history keeps only summariespassed; parent history JSON drops to 3.8 MiB136.8 MiB

The failing huge-clone run produced:

text
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

The native stack included:

  • v8::internal::ValueDeserializer::ReadObjectInternal
  • v8::internal::ValueDeserializer::ReadDenseJSArray
  • node::worker::Message::Deserialize
  • node::worker::StructuredClone

This matches the same stack family as the user-provided OOM log. The controlled reproduction also shows why 4 GiB / 8 GiB user reports are plausible: the failure is not caused by a single large object, but by large retained history/tool-result/subagent state plus one or more full-history clone or checkpoint copies. Raising --max-old-space-size can delay the crash while preserving the same amplification pattern.

Important attribution from this deterministic run:

  1. Building a 76.2 MiB parent history JSON can succeed under the reduced heap. The OOM appears when additional full-history clone/checkpoint copies are retained.
  2. A single checkpoint copy can push RSS close to 1 GiB even before OOM.
  3. Removing retained subagent transcripts from the parent hot history changes the same generated workload from OOM to a small 136.8 MiB RSS run. That is the clearest mitigation signal so far.
  4. This reproducer is synthetic and intentionally adversarial, but it exercises the same object-graph shape as the long interactive review: parent session, subagents, large tool outputs, transcript merge-back, and full-history clone pressure.

DeepSeek PR-Size Follow-Up

After the initial model matrix, an additional Qwen Code-only run tested DeepSeek/deepseek-v4-pro across three real PR sizes. This model is configured through the Anthropic-compatible protocol; OpenAI-compatible execution returned 404 in a smoke check, so the successful benchmark uses --auth-type anthropic.

The diagnostics branch was extended to record Anthropic wire request summaries with the same privacy rule as the OpenAI path: aggregate counts and byte sizes only, no prompt text, diff content, tool arguments, headers, base URL, or API key.

PR sizes:

SizePRStateFilesChanged linesTitle
small#4268merged11fix(serve): add mcp_guardrails to E2E capabilities expectation
medium#4186merged6494fix(core): add heap-pressure auto-compaction safety net
large#4168open254,750feat(core)!: redesign auto-compaction thresholds with three-tier ladder

Runtime:

SizePRWallTurnsTotal tokensCache-read tokensTree RSS peakRoot RSS peakEnd heapEnd RSS
small#426839.7s243,36228,672346.9 MiB344.8 MiB115.2 MiB304.3 MiB
medium#4186142.6s4135,120115,840340.7 MiB337.3 MiB103.5 MiB285.6 MiB
large#4168191.1s8386,891332,928360.0 MiB336.3 MiB119.3 MiB237.9 MiB

Request and tool diagnostics:

SizePRRequestsAnthropic wire requestsMax Anthropic bodyMax systemMax tool schemaTool callsTotal tool resultMax tool resultMax function response in request
small#426822103.0 KiB50.8 KiB47.6 KiB30.6 KiB0.5 KiB1.1 KiB
medium#418644159.8 KiB50.8 KiB47.6 KiB530.2 KiB29.3 KiB56.7 KiB
large#416888289.5 KiB50.8 KiB47.6 KiB11235.0 KiB232.1 KiB182.4 KiB

DeepSeek observations:

  1. PR size scaled turns, tokens, Anthropic wire body size, and tool result size clearly, but did not scale RSS proportionally. The small/medium/large tree RSS peaks stayed in a narrow 340.7-360.0 MiB band.
  2. The large PR was expensive mostly in model rounds and token volume: 8 requests and 386,891 total tokens. Its max Anthropic body was 289.5 KiB, much larger than the OpenAI-compatible runs, but RSS still stayed near the same local-bundle band.
  3. The static Anthropic request cost is also visible: system prompt is about 50.8 KiB and tool schema about 47.6 KiB per request. Repeated rounds are therefore a major token amplifier.
  4. The large PR produced 235.0 KiB of captured tool results and 182.4 KiB max function response in a request. This is higher than the earlier small PR / code-navigation cases and shows large PRs still put pressure on local tool-result handling and request assembly, even when RSS does not spike.
  5. The DeepSeek run reinforces the model-choice conclusion: provider/model choice strongly changes turns, latency, token volume, and wire payload shape, but the local bundle RSS peak remains dominated by Qwen Code runtime shape rather than scaling linearly with PR size.

Long-Review JSONL Replay: History Clone Pressure

A recent long PR-review chat record was analyzed as a post-mortem shape for the reported OOM class. The raw JSONL is not included here because it contains prompt and tool output text. The aggregate shape is:

SignalValue
Duration87.0 min
Qwen Code version0.15.10
Modelqwen-latest-series beta model
API responses380
Tool-call telemetry507 events
MCP tool-call telemetry4 events
Subagent API responses313
Root API responses67
Root prompt growth38,622 -> 168,555 tokens
Max prompt tokens168,555
Total response tokens31.28M

This shape does not support MCP as the primary OOM cause for this case. Only 4 of 507 tool-call telemetry events were MCP, and all four recorded content_length=0. The dominant shape is long-session/subagent amplification: 15 agent calls produced 313 subagent API responses and 403 subagent tool-call events.

The replay then rebuilt the chat Content[] message shape from the JSONL and ran controlled clone/stringify pressure tests. The base retained message payload is small, so it is not itself enough to OOM:

Replay scaleRetained clonesHistory JSONCheckpoint JSONEnd heapEnd RSS
1x80.54 MB1.08 MB18.0 MB88.8 MB
30x814.46 MB28.92 MB260.0 MB577.8 MB
60x828.86 MB57.71 MB510.3 MB960.8 MB

The scaled replay is not a user-data claim; it is a controlled amplification of the observed JSONL shape to test whether full-history clone and checkpoint serialization can create the same failure mode as the reports.

A low-heap reproduction with --max-old-space-size=256 confirms the mechanism:

CaseHistory JSONResult
Build history only38.4 MBSucceeded; heap 131.6 MB, RSS 378.2 MB
Build + one clone38.4 MBSucceeded; heap 183.3 MB, RSS 463.4 MB
Build + repeated clones38.4 MBOOM after several retained structuredClone() copies
Checkpoint double-history38.4 MBOOM while holding history plus cloned client history

The repeated-clone OOM stack contains ValueDeserializer::ReadObjectInternal, ValueDeserializer::ReadDenseJSArray, node::worker::Message::Deserialize, and node::worker::StructuredClone, matching the same stack family seen in the user-provided OOM log. This proves that full-history structuredClone() can be the immediate OOM trigger without any MCP server involvement.

Current working hypothesis for this JSONL class:

  1. MCP can explain normal-config startup RSS in separate benchmarks, but it is not the likely trigger for this long-review OOM shape.
  2. Long task growth comes from retained chat history, large tool outputs, subagent histories, observable agent messages, and UI/tool-result state.
  3. The immediate OOM trigger can be a full-history clone or checkpoint-style double serialization after the heap is already high.
  4. Compression can mitigate retained history, but compression itself may create a temporary peak if it first clones or serializes large history.

Local Mitigation Validation: Disabled-MCP PR Review Case

Two targeted mitigations were applied locally and validated before rerunning a disabled-MCP PR review case:

  1. checkNextSpeaker() now reads only the last curated message with getHistoryTail(1, true) and sends only that message to the next-speaker side query. The next-speaker prompt only asks about the immediately previous model response, so sending full history was unnecessary clone and token pressure.
  2. AgentToolInvocation no longer retains full responseParts arrays inside the live task_execution.toolCalls display. The real response parts still flow through transcript/history paths, but the parent UI display now keeps only a bounded text summary for nested tool-result streaming instead of holding another full copy of large subagent tool outputs during long runs.
  3. GeminiChat.sendMessageStream() now builds model request contents through an internal curated-history view instead of calling public getHistory(true). Public getHistory() still returns a defensive structuredClone() for external callers, but the request hot path no longer deep-clones the whole retained chat history before every model call.

TDD checks added for these mitigations:

TestExpected protection
checkNextSpeaker > should send only the last curated model message to the side queryPrevents full-history clone/send in next-speaker checks
AgentTool > should not retain responseParts in live tool call display after TOOL_RESULTPrevents live subagent display from retaining large tool responses
AgentTool > should keep only a bounded result summary in live tool call displayPreserves nested result readability without retaining the full response body
GeminiChat > sendMessageStream > does not deep-clone the full curated history when building request contentsPrevents request setup from hitting the ValueDeserializer / StructuredClone OOM path

Additional reproduction and fix validation:

StepCommand shapeResult
Pre-fix deterministic clone pressurenode --max-old-space-size=256 scripts/memory-pressure-repro.mjs ... --clone-count=2 --mode=cloneOOM, exit 134; stderr contained Reached heap limit and ValueDeserializer / StructuredClone; max RSS 528.1 MiB in the repeat run
Red testtargeted GeminiChat test with structuredClone forced to throw during request setupfailed at GeminiChat.getHistory() before the mitigation
Green testsame targeted GeminiChat test after the mitigationpassed
Built-code smokenode --max-old-space-size=256 against the built core package, with a 96-entry / about 48 MiB history and structuredClone forced to throwpassed; request had 97 contents; process RSS 161.4 MiB, /usr/bin/time -l max RSS 161.6 MiB

This narrows the earlier "same stack family" statement: the deterministic synthetic OOM still proves retained full-history clones can fail in the same V8 stack family as the user log, while the new GeminiChat red/green test proves one real production request-setup path no longer reaches that clone point. Checkpoint/resume and compression internals still need separate long-run validation because they can legitimately need durable copied history.

Verification commands:

CommandResult
npx vitest run src/core/geminiChat.test.tspassed, 89 tests
npx vitest run src/utils/nextSpeakerChecker.test.ts --coverage=falsepassed, 13 tests
npx vitest run src/tools/agent/agent.test.ts --coverage=falsepassed, 77 tests
npx vitest run --config ./scripts/tests/vitest.config.ts scripts/tests/memory-pressure-repro.test.jspassed, 1 test
npm run build --workspace=packages/corepassed
npm run build --workspace=packages/clipassed
npm run typecheck --workspace=packages/corepassed
npm run typecheck --workspace=packages/clipassed
npm run bundlepassed
npm run buildfailed in packages/vscode-ide-companion lint on existing internal-module import rules; core, CLI, bundle, and targeted tests above passed

The full root npm run build was not clean in this worktree because the vscode-ide-companion package hit pre-existing import/no-internal-modules lint errors. The core/CLI build and bundle needed for the local runtime test completed successfully.

The same PR review prompt was then run with a temporary config where MCP and hooks were disabled. Both rows were interrupted after a bounded long-run window instead of waiting for a full review to finish. Caveat: the two runs are confounded by workload size (79K vs 390K tokens) and cannot be compared as a controlled experiment. The comparison only shows directional evidence.

VariantRuntimeMCP serversToolsAssistant messagesTool use/result blocksParent tool idsTotal tokensMax input tokensRoot max RSS
before mitigation365.08s0194242 / 42379,43926,807357.7 MiB
after mitigation404.52s0195852 / 422390,33954,000310.5 MiB

This is not a deterministic apples-to-apples model benchmark: the patched run did more work and consumed substantially more total tokens before the manual cutoff. The useful signal is narrower: under a disabled-MCP review case with more observed work, root max RSS did not increase and was about 47.2 MiB lower. That supports the mitigation direction, but it does not prove the whole long-task OOM class is fixed.

Remaining high-risk clone/retention paths to inspect next:

  1. Compression still calls full getHistory(true) before summarization. If the heap is already high, the compression attempt can create the peak that trips OOM.
  2. Checkpoint creation can hold original history, cloned client history, and a serialized checkpoint payload at the same time.
  3. Fork subagents still seed from parent history with getHistory(true).
  4. ACP/history export/summary/copy paths still call full getHistory() and should be audited separately from the normal review loop.

Version timing:

IssueCreatedReported versionSignal
#21282026-03-05not specifiedLong-session UI memory growth
#25622026-03-21not specifiedstructuredClone OOM in long sessions
#28682026-04-030.13.2Heap OOM
#29452026-04-070.14.0V8 heap OOM
#41162026-05-130.15.11OOM with structured-clone-style analysis
#41342026-05-140.15.11OOM
#41492026-05-140.15.10-nightly.20260513V8 heap OOM
#41672026-05-150.15.11Crash near compression
#41852026-05-150.15.11Heap pressure before token compaction
#42542026-05-17not specifiedMemory keeps rising
#42762026-05-180.15.11V8 heap OOM
#43092026-05-190.15.11High memory warning around 7 GiB

The issue history does not prove that 0.15.10 introduced the OOM class; similar reports existed in March and April. It does support a recent cluster beginning around 2026-05-13, overlapping v0.15.10/v0.15.11 releases. The relevant diff between v0.15.9 and v0.15.10 touched subagent runtime, non-interactive execution, GeminiChat, and compression code heavily, so this range is a reasonable first bisect window.

Notes

  • The first code-navigation prompt allowed open-ended exploration and hit maxSessionTurns; the successful rows above use a constrained command list.
  • The first synthetic-diff attempt used a relative bundle path from inside the temporary repositories; those failed immediately and are excluded from the tables. The successful rows use the absolute local bundle path.
  • Raw JSONL streams are not committed because they contain prompts, tool commands, and tool output. The report only includes aggregate diagnostics.