docs/solutions/logic-errors/2026-03-24-docx-vml-shapes-must-be-extracted-from-comment-markup-not-domparser-nodes.md
Direct coverage on getVShapes(...) exposed a quiet mismatch between the test seam and real Word markup.
The helper looked for V:SHAPE elements through DOMParser and querySelectorAll(...). That works only if the VML survives as real DOM nodes. In the actual cleaner flow, the interesting VML often arrives inside comment markup, so the DOM query path sees nothing useful and the helper returns an empty object.
The implementation assumed the parsed HTML tree preserved the VML shape elements in a queryable form.
That assumption was wrong for the real seam we care about. The data we need is the raw shape markup itself, specifically the id and o:spid attributes. DOM traversal was the wrong extraction tool.
Read the raw comment markup and extract VML shapes directly from the string.
The working approach was:
<v:shape ...> blocksido:spidThis keeps the helper aligned with how Word markup actually shows up in the docx cleaner path.
These checks passed:
bun test packages/docx/src/lib/docx-cleaner/utils/getVShapes.spec.ts
bun test packages/docx/src
pnpm turbo build --filter=./packages/docx
pnpm turbo typecheck --filter=./packages/docx
For docx cleaner helpers, prefer tests built from realistic pasted Word fragments over idealized DOM fixtures.
When the source data is comment-wrapped or namespace-heavy markup, parse the raw string first. DOM APIs are too eager to normalize the interesting bits away.