docs/plans/2026-05-18-qwen-runtime-memory-investigation.md
Date: 2026-05-18
Local benchmarks show Qwen Code using substantially more process-tree RSS than
Claude Code for similar non-interactive CLI task shapes. The latest five-case
matrix found Qwen Code peaking around 0.83-1.04 GiB while Claude Code stayed
around 0.27-0.36 GiB.
This document proposes a draft investigation and optimization direction. It is not intended to claim a final root cause yet. The immediate goal is to make the memory gap reviewable, reproducible, and explainable with internal diagnostics.
The investigation has reached the evidence-and-direction stage:
/doctor memory and memory-diagnostics follow-ups.The investigation has not yet reached the final root-cause stage because external process RSS cannot show whether the retained memory is V8 heap, native memory, loaded modules, live history, tool results, or request assembly state.
The companion benchmark report is:
docs/e2e-tests/2026-05-18-qwen-memory-benchmark-report.mdThe main evidence is:
pai/glm-5 and qwen3.6-plus.Relevant upstream work already exists:
| Item | Status | Role in the memory work |
|---|---|---|
#4180 | merged PR | Adds baseline /doctor memory diagnostics. This is the first instrumentation slice. |
#4181 | open issue, no PR yet | Adds interpretation and pressure classification for /doctor memory. |
#4182 | open issue, no PR yet | Adds structured /doctor memory --json output and safe session-scale stats. |
#4183 | open issue, no PR yet | Adds opt-in heap snapshots and bounded memory timeline diagnostics. |
#4184 | open issue, no PR yet | Adds large tool-result retention diagnostics and designs offload/preview mitigation. |
#4127 | open PR, conflicting | Adds heap-pressure safety nets for long-session OOM prevention. Useful mitigation, not enough for attribution. |
#4168 | open PR | Redesigns auto-compaction thresholds. Useful for context pressure, not enough for task-time footprint analysis. |
#4172 | open PR | Decouples auto-memory recall from the main request path. Useful for latency/blocking, not direct RSS proof. |
#4188 | merged PR | Bounds build/test caches to prevent OOM in parallel test runs. Important but separate from runtime benchmarks. |
This investigation should build on that direction rather than wait for all follow-up issues to land.
Most of the remaining work is instrumentation-first. The open diagnostics
issues are designed to make memory reports explainable before attempting a
runtime fix. The open mitigation PRs may reduce specific OOM paths, but they do
not yet explain why short non-interactive CLI tasks repeatedly peak near
1 GiB.
This draft intentionally starts with benchmark evidence and an investigation plan instead of bundling a runtime code change.
Reasons:
The next implementation PR should add the missing counters and timeline points, then rerun the benchmark matrix. Only after that should a targeted optimization PR attempt to reduce memory.
The current data points toward a Qwen Code runtime/path issue more than a model provider issue.
The strongest current inference is:
Qwen Code appears to carry a high non-interactive CLI task execution footprint, likely amplified by larger context/tool-result/session handling. The likely problem area is the CLI runtime and agent data path, not the selected model alone.
More specifically, the evidence points away from "too many tool calls" as the primary cause. Tool-call counts were similar across CLIs, and Claude sometimes used more turns or tool calls while keeping lower RSS. The more plausible problem is that Qwen Code initializes or retains heavier state for the same short non-interactive CLI task, then amplifies that execution footprint with larger context, tool-result, saved-output, or session-history data.
The most likely buckets are:
This is deliberately phrased as an inference. The next step is to add enough internal measurements to confirm or rule out each bucket.
The first draft PR should be evidence and diagnostics focused:
main,The first PR should avoid mixing several unrelated optimizations. It should either remain documentation-only or add diagnostics-only code. A separate PR should carry the first runtime memory reduction once the cause is clearer.
These are candidates, not conclusions:
Claude Code and OpenAI Codex (OpenAI's CLI coding agent) should be used as design references for diagnostic separation, bounded output retention, and lazy history loading. The implementation should still follow Qwen Code's own architecture and tests.
The investigation should keep the same benchmark matrix so before/after results remain comparable:
For each run, record:
The minimum success condition for a candidate fix is not just "RSS went down". It should also identify which internal metric changed and why.
The next PR should be diagnostics-only and should avoid changing runtime behavior. A minimal useful slice would add:
After that lands locally, rerun the same Qwen model matrix and compare:
main;This draft does not claim that:
The intended claim is narrower: Qwen Code shows a consistent local RSS gap in the tested workloads, and the project needs internal diagnostics to explain and reduce that gap.