docs/e2e-tests/2026-05-18-qwen-memory-benchmark-report.md
Date: 2026-05-18
This report records local memory benchmarks for Qwen Code runtime behavior. It compares Qwen Code across models and compares Qwen Code with Claude Code on the same task shapes where equivalent model endpoints were available.
The headline result is consistent across the latest matrix (single run per cell, not statistically repeated):
852-1062 MiB (0.83-1.04 GiB).279-366 MiB (0.27-0.36 GiB).2.3x-3.6x higher in the tested
non-interactive CLI task benchmarks.Note: process-tree RSS includes MCP child processes (~350 MiB overhead on the Qwen side). This inflates the absolute numbers but the relative comparison remains informative since both CLIs were measured the same way.
The difference reproduced in small PR review, code navigation, and synthetic diff workloads. It is therefore unlikely to be explained only by one large PR or by one model provider.
This report is intended to make the current performance investigation visible: what has been measured, what conclusion is already supported, what remains unknown, and what diagnostics should be added next.
| Item | Value |
|---|---|
| Date | 2026-05-18 |
| Platform | macOS local development machine |
| Qwen Code version | 0.15.11 |
| Qwen Code binary | PATH-resolved qwen binary |
| Claude Code version used in the latest matrix | 2.1.129 |
| Claude Code binary used in the latest matrix | PATH-resolved claude binary |
| Node.js version | v22.x (default system install) |
| Sampling method | External ps RSS sampling once per second |
| Headline metric | Process-tree RSS peak |
Process-tree RSS is used as the headline metric because Qwen Code launches a root wrapper and a child Node/Qwen worker. Looking only at the root process can understate the memory footprint seen by users.
Temporary CLI config directories were used for matrix runs so the benchmarks did not depend on global CLI state.
Five local reports were produced before this consolidated report:
pai/glm-5.This consolidated report covers the conclusions and headline metrics from all
five reports. It does not embed every raw sample row, terminal transcript, or
temporary runner artifact. Those raw artifacts stayed in local tmp/
directories because they are experiment outputs rather than stable repository
fixtures.
The latest matrix is the strongest evidence because it covers multiple task shapes rather than only one PR review workload.
The current data is strong enough to say that Qwen Code has a higher runtime memory footprint than Claude Code in these local non-interactive CLI task benchmarks. It is not strong enough to name one final root cause yet.
The leading explanation is a Qwen Code runtime/path difference rather than a model provider difference:
pai/glm-5 and qwen3.6-plus;The most useful next measurement is therefore not another external RSS-only run. The next measurement should split RSS into V8 heap, native memory, session/history size, retained tool-result size, and subagent/process-tree activity.
The benchmark does not yet prove one root cause, but it does narrow the likely problem area.
| Signal | What it suggests | What it does not prove |
|---|---|---|
Qwen remains near 1 GiB in small PR and code-navigation cases | A high non-interactive task-time runtime cost is likely involved | It does not identify whether the footprint is V8 heap, native memory, module loading, or retained state |
| Diff size from 100 KiB to 5 MiB does not scale linearly with RSS | Raw diff bytes alone are probably not the primary driver | Large outputs can still amplify memory in real PR review flows |
| Qwen uses more tokens than Claude in every matrix cell | Qwen likely constructs or retains larger prompt/context/tool-result state for similar work | Token count is not the same as process memory and may be an effect rather than the cause |
| Tool call counts are similar, and Claude sometimes uses more turns/tool calls with lower RSS | A longer tool-call chain is unlikely to be the main explanation by itself | Tool output size and retention still need to be measured |
| Earlier large PR runs showed saved-output recovery and subagent amplification | Tool-output truncation and saved-output paths are likely heavy-workload amplifiers | They do not explain the entire small-task execution footprint |
The current best explanation is therefore:
The next diagnostic run should answer where the ~1 GiB sits:
The latest benchmark ran:
pai/glm-5 and qwen3.6-plus.#4268, one-line changerg plus sed on compression-related filesAll 20 runs exited 0 with no timeout.
| Case | Model | Qwen tree peak | Claude tree peak | Qwen / Claude |
|---|---|---|---|---|
small PR #4268 | pai/glm-5 | 1032.7 MiB | 357.8 MiB | 2.89x |
small PR #4268 | qwen3.6-plus | 852.2 MiB | 365.5 MiB | 2.33x |
| code navigation | pai/glm-5 | 993.1 MiB | 359.6 MiB | 2.76x |
| code navigation | qwen3.6-plus | 996.9 MiB | 349.0 MiB | 2.86x |
| diff 100 KiB | pai/glm-5 | 1012.1 MiB | 350.8 MiB | 2.89x |
| diff 100 KiB | qwen3.6-plus | 1001.1 MiB | 336.2 MiB | 2.98x |
| diff 1 MiB | pai/glm-5 | 1008.3 MiB | 278.8 MiB | 3.62x |
| diff 1 MiB | qwen3.6-plus | 1003.3 MiB | 340.5 MiB | 2.95x |
| diff 5 MiB | pai/glm-5 | 858.8 MiB | 323.2 MiB | 2.66x |
| diff 5 MiB | qwen3.6-plus | 1062.0 MiB | 331.2 MiB | 3.21x |
Average process-tree RSS peak by case:
| Case | Avg Qwen tree peak | Avg Claude tree peak |
|---|---|---|
small PR #4268 | 942.5 MiB | 361.6 MiB |
| code navigation | 995.0 MiB | 354.3 MiB |
| diff 100 KiB | 1006.6 MiB | 343.5 MiB |
| diff 1 MiB | 1005.8 MiB | 309.6 MiB |
| diff 5 MiB | 960.4 MiB | 327.2 MiB |
The same matrix also showed Qwen Code using more model-side tokens in every tested case.
Selected examples:
| Case | Model | CLI | Duration | Turns | Total tokens | Tool calls |
|---|---|---|---|---|---|---|
| small PR | pai/glm-5 | Qwen | 25.2s | 2 | 32,567 | 3 |
| small PR | pai/glm-5 | Claude | 21.1s | 4 | 7,899 | 3 |
| code navigation | qwen3.6-plus | Qwen | 25.2s | 2 | 38,151 | 3 |
| code navigation | qwen3.6-plus | Claude | 46.9s | 6 | 25,861 | 5 |
| diff 100 KiB | qwen3.6-plus | Qwen | 16.5s | 3 | 57,185 | 2 |
| diff 100 KiB | qwen3.6-plus | Claude | 17.2s | 3 | 6,377 | 2 |
| diff 5 MiB | pai/glm-5 | Qwen | 23.2s | 2 | 38,574 | 2 |
| diff 5 MiB | pai/glm-5 | Claude | 9.8s | 3 | 5,285 | 2 |
This token gap does not prove that token volume is the memory root cause, but it does suggest that context assembly, tool result retention, or response normalization should be measured alongside RSS and V8 heap statistics.
The token gap is one of the strongest clues, but it needs internal request metrics before it can be treated as a root cause.
What the data supports today:
What this suggests:
What is still missing:
Those missing metrics are why the next step should add internal diagnostics rather than only repeat the external RSS benchmark.
An earlier strict PR review benchmark used PR #4186 and showed the same broad
shape:
| Model | CLI | Process-tree RSS peak |
|---|---|---|
pai/glm-5 | Qwen Code | 1000.7 MiB |
pai/glm-5 | Claude Code | 349.0 MiB |
qwen3.6-plus | Qwen Code | 1095.8 MiB |
qwen3.6-plus | Claude Code | 341.1 MiB |
That earlier run was not enough by itself because a large PR can trigger unusual tool-output and saved-output paths. The latest five-case matrix makes the finding stronger because small PR and code-navigation tasks also reproduce the gap.
The current evidence supports these hypotheses, in priority order:
0.7-0.8 GiB.pai/glm-5 and
qwen3.6-plus showed the same broad Qwen-vs-Claude gap.The next local investigation branch should add or use diagnostics for:
process.memoryUsage() before and after startup, tool execution, streaming,
compression, and session finalization.These measurements should be collected with the same benchmark matrix so the current RSS comparison can be connected to internal Qwen Code state.