docs/superpowers/plans/2026-05-08-native-docx-pdf-export-import.md
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Extend PR #7568 so a soffice-less Etherpad can export pdf+docx and import docx purely in-process, while keeping behavior bit-for-bit identical when soffice is configured.
Architecture: Single dispatch cascade in ExportHandler.ts and ImportHandler.ts — soffice if sofficeAvailable() === 'yes', native otherwise. Native PDF uses pdfkit + htmlparser2 driven by a small walker we own. Native DOCX import uses mammoth to produce HTML and reuses the existing HTML import pipeline. A shared stripRemoteImages() helper sanitizes HTML before either DOCX or PDF conversion to close the SSRF surface Qodo flagged. Drops the opt-in nativeDocxExport setting introduced earlier in this PR — selection is purely soffice-presence-driven.
Tech Stack: TypeScript (Node), Mocha + supertest backend tests, pdfkit (^0.15), htmlparser2 (^9), mammoth (^1.7), html-to-docx (^1.8 — already in PR), pnpm workspace.
Spec: docs/superpowers/specs/2026-05-08-native-docx-pdf-export-import-design.md (commit 2cebcc822).
| File | Role | Status |
|---|---|---|
src/node/utils/ExportSanitizeHtml.ts | stripRemoteImages(html) — drops `` outside data:/relative | NEW |
src/node/utils/ExportPdfNative.ts | htmlToPdfBuffer(html) — pdfkit + htmlparser2 walker | NEW |
src/node/utils/ImportDocxNative.ts | docxBufferToHtml(buf) — mammoth wrapper | NEW |
src/node/handler/ExportHandler.ts | Replaced flag-gated DOCX branch with soffice-first cascade for both DOCX+PDF | MODIFIED |
src/node/handler/ImportHandler.ts | Soffice-first cascade for DOCX import | MODIFIED |
src/node/hooks/express/importexport.ts | Tighter route guard (PDF/DOCX go native when no soffice) | MODIFIED |
src/static/js/pad_impexp.ts | Always show DOCX+PDF export links; ODT still gated on soffice | MODIFIED |
src/node/utils/Settings.ts | Revert nativeDocxExport field (introduced earlier in PR) | MODIFIED (revert) |
settings.json.template | Revert nativeDocxExport block | MODIFIED (revert) |
settings.json.docker | Revert nativeDocxExport block | MODIFIED (revert) |
doc/docker.md | Revert NATIVE_DOCX_EXPORT row | MODIFIED (revert) |
src/package.json | Add pdfkit, htmlparser2, mammoth. Keep html-to-docx. | MODIFIED |
pnpm-lock.yaml | Lockfile regen | MODIFIED |
src/tests/backend/specs/export.ts | Revise existing DOCX tests (soffice=null); add native PDF tests; add negative ODT; add unit test for sanitizer | MODIFIED |
src/tests/backend/specs/import.ts | New file: native DOCX import + negative ODT | NEW |
src/tests/backend/specs/fixtures/sample.docx | Tiny DOCX fixture: heading, paragraph, bullet list | NEW |
The PR is currently mergeStateStatus: DIRTY. Resolve before adding new commits — easier to handle conflicts on a known-good base.
Files: none (git operation)
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
git fetch origin develop
Expected: From https://github.com/ether/etherpad-lite … develop -> origin/develop.
git rebase origin/develop
Expected: clean replay of b98dfbab7 (DOCX feature commit), 6a7093c09 (CI guard), and 2cebcc822 (spec). If a conflict arises in src/package.json / pnpm-lock.yaml / src/node/handler/ExportHandler.ts, prefer our changes for new files and re-resolve any overlap manually. Do NOT use --strategy-option=theirs blindly.
git log --oneline origin/develop..HEAD
pnpm --filter ep_etherpad-lite run test --grep '#7538'
Expected: log shows the three commits above HEAD; test grep passes the existing native DOCX block.
git push fork feat/native-docx-export-7538 --force-with-lease
Expected: branch updated, GitHub re-runs CI green.
gh pr view 7568 --repo ether/etherpad --json mergeStateStatus,mergeable
Expected: mergeStateStatus is BEHIND, BLOCKED, CLEAN, or HAS_HOOKS — anything other than DIRTY/CONFLICTING.
Files:
Modify: src/package.json
Modify: pnpm-lock.yaml
Step 1: Add deps
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538/src
pnpm add pdfkit htmlparser2 mammoth
pnpm add -D @types/pdfkit
Expected: three runtime deps + one dev @types package added to src/package.json. pnpm-lock.yaml regenerated. html-to-docx (already there) untouched.
grep -E 'pdfkit|htmlparser2|mammoth|html-to-docx' src/package.json
Expected output (versions may be newer; the point is they're all "^X.Y.Z"):
"html-to-docx": "^1.8.0",
"htmlparser2": "^9.x.x",
"mammoth": "^1.x.x",
"pdfkit": "^0.x.x",
"@types/pdfkit": "^0.x.x",
cd src && node -e "require('pdfkit'); require('htmlparser2'); require('mammoth'); console.log('OK')"
Expected: prints OK and exits 0.
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
git add src/package.json pnpm-lock.yaml
git commit -m "chore(7538): add pdfkit, htmlparser2, mammoth deps"
Files:
src/node/utils/ExportSanitizeHtml.tssrc/tests/backend/specs/export.tsThe sanitizer is consumed by both the DOCX and PDF branches in Task 5. Build it first so those branches can call it from the start.
Append the following block to src/tests/backend/specs/export.ts, ABOVE the closing }); of the outer describe(__filename, ...):
describe('stripRemoteImages', function () {
const {stripRemoteImages} = require('../../../node/utils/ExportSanitizeHtml');
it('keeps data: URIs', function () {
const out = stripRemoteImages(
'<p>x</p>');
assert.match(out, /]+src="data:image\/png/);
});
it('keeps relative URLs', function () {
const out = stripRemoteImages('');
assert.match(out, /]+src="\/foo\/bar\.png"/);
});
it('drops absolute http(s) URLs and falls back to alt', function () {
const out = stripRemoteImages(
'<p>beforeafter</p>');
assert.doesNotMatch(out, /evil\.example/);
assert.match(out, /before/);
assert.match(out, /cat/);
assert.match(out, /after/);
});
it('drops protocol-relative URLs', function () {
const out = stripRemoteImages('');
assert.doesNotMatch(out, /evil\.example/);
});
it('passes non-image markup through unchanged', function () {
const html = '<h1>hi</h1><p>body <a href="/x">link</a></p>';
assert.strictEqual(stripRemoteImages(html), html);
});
});
cd src && pnpm test --grep 'stripRemoteImages'
Expected: all five tests fail with Cannot find module '../../../node/utils/ExportSanitizeHtml'.
Create src/node/utils/ExportSanitizeHtml.ts:
'use strict';
import {Parser} from 'htmlparser2';
const isLocalSrc = (src: string): boolean => {
if (!src) return true;
if (src.startsWith('data:')) return true;
if (src.startsWith('//')) return false;
if (/^[a-z][a-z0-9+.-]*:/i.test(src)) return false;
return true;
};
const escapeAttr = (s: string): string =>
s.replace(/&/g, '&').replace(/"/g, '"').replace(/</g, '<');
const escapeText = (s: string): string =>
s.replace(/&/g, '&').replace(/</g, '<').replace(/>/g, '>');
const VOID_TAGS = new Set([
'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input',
'link', 'meta', 'source', 'track', 'wbr',
]);
export const stripRemoteImages = (html: string): string => {
let out = '';
const parser = new Parser({
onopentag(name, attribs) {
if (name === 'img') {
const src = attribs.src || '';
if (isLocalSrc(src)) {
let tag = '';
out += tag;
} else {
out += escapeText(attribs.alt || '');
}
return;
}
let tag = `<${name}`;
for (const [k, v] of Object.entries(attribs)) {
tag += ` ${k}="${escapeAttr(v)}"`;
}
tag += '>';
out += tag;
},
ontext(text) {
out += text;
},
onclosetag(name) {
if (VOID_TAGS.has(name)) return;
out += `</${name}>`;
},
}, {decodeEntities: false, lowerCaseTags: true});
parser.write(html);
parser.end();
return out;
};
cd src && pnpm test --grep 'stripRemoteImages'
Expected: 5 passing.
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
git add src/node/utils/ExportSanitizeHtml.ts src/tests/backend/specs/export.ts
git commit -m "feat(7538): add stripRemoteImages HTML sanitizer
Drops elements pointing at non-data, non-relative URLs to
prevent the DOCX/PDF converters from making outbound requests via
plugin-modified HTML. Closes Qodo finding #4 against the
html-to-docx path; will be wired into both export branches in
the cascade refactor."
Files:
src/node/utils/ExportPdfNative.tssrc/tests/backend/specs/export.tsBuild the walker bottom-up: smoke first (HTML in → %PDF buffer out), then add tag handlers as features. Each tag class gets its own test.
Append to src/tests/backend/specs/export.ts, ABOVE the closing }); of the outer describe(__filename):
describe('htmlToPdfBuffer', function () {
let htmlToPdfBuffer: (html: string) => Promise<Buffer>;
before(function () {
try {
require.resolve('pdfkit');
require.resolve('htmlparser2');
} catch {
this.skip();
return;
}
htmlToPdfBuffer = require('../../../node/utils/ExportPdfNative').htmlToPdfBuffer;
});
it('produces a buffer starting with %PDF-', async function () {
const buf = await htmlToPdfBuffer('<p>hello world</p>');
assert.ok(Buffer.isBuffer(buf), 'must return Buffer');
assert.ok(buf.length > 100, `buffer suspiciously small: ${buf.length} bytes`);
assert.strictEqual(buf.slice(0, 5).toString('ascii'), '%PDF-');
});
});
cd src && pnpm test --grep 'htmlToPdfBuffer'
Expected: the produces a buffer starting with %PDF- test fails with Cannot find module '../../../node/utils/ExportPdfNative'.
Create src/node/utils/ExportPdfNative.ts:
'use strict';
import {Parser} from 'htmlparser2';
import {PassThrough} from 'stream';
const PDFDocument = require('pdfkit');
interface InlineState {
bold: boolean;
italic: boolean;
underline: boolean;
strike: boolean;
link?: string;
fontSize?: number;
}
const HEADING_SIZES: Record<string, number> = {
h1: 24, h2: 20, h3: 16, h4: 14, h5: 12, h6: 11,
};
const decodeDataUri = (src: string): Buffer | null => {
const m = /^data:[^;,]+;base64,(.+)$/i.exec(src);
if (!m) return null;
try {
return Buffer.from(m[1], 'base64');
} catch {
return null;
}
};
export const htmlToPdfBuffer = (html: string): Promise<Buffer> =>
new Promise((resolve, reject) => {
const doc = new PDFDocument({margin: 50});
const stream = new PassThrough();
const chunks: Buffer[] = [];
stream.on('data', (c: Buffer) => chunks.push(c));
stream.on('end', () => resolve(Buffer.concat(chunks)));
stream.on('error', reject);
doc.pipe(stream);
const styleStack: InlineState[] = [{
bold: false, italic: false, underline: false, strike: false,
}];
let listType: ('ul' | 'ol' | null)[] = [];
let listIndex: number[] = [];
let pendingNewline = false;
const top = () => styleStack[styleStack.length - 1];
const applyFont = () => {
const s = top();
const variant =
s.bold && s.italic ? 'Helvetica-BoldOblique' :
s.bold ? 'Helvetica-Bold' :
s.italic ? 'Helvetica-Oblique' :
'Helvetica';
doc.font(variant);
doc.fontSize(s.fontSize || 11);
};
const writeText = (raw: string) => {
if (!raw) return;
if (pendingNewline) {
doc.moveDown(0.5);
pendingNewline = false;
}
const s = top();
applyFont();
const opts: any = {continued: true};
if (s.underline) opts.underline = true;
if (s.strike) opts.strike = true;
if (s.link) opts.link = s.link;
doc.text(raw, opts);
};
const flushLine = () => {
doc.text('', {continued: false});
};
const parser = new Parser({
onopentag(name, attribs) {
const cur = top();
const next: InlineState = {...cur};
switch (name) {
case 'b': case 'strong': next.bold = true; break;
case 'i': case 'em': next.italic = true; break;
case 'u': next.underline = true; break;
case 's': case 'strike': case 'del': next.strike = true; break;
case 'a': next.link = attribs.href; next.underline = true; break;
case 'h1': case 'h2': case 'h3': case 'h4': case 'h5': case 'h6':
next.fontSize = HEADING_SIZES[name];
next.bold = true;
if (!pendingNewline) flushLine();
doc.moveDown(0.5);
break;
case 'p': case 'div':
if (!pendingNewline) flushLine();
doc.moveDown(0.3);
break;
case 'ul': case 'ol':
listType.push(name as 'ul' | 'ol');
listIndex.push(0);
flushLine();
break;
case 'li': {
flushLine();
const t = listType[listType.length - 1] || 'ul';
if (t === 'ol') listIndex[listIndex.length - 1] += 1;
const prefix = t === 'ul'
? '• '
: `${listIndex[listIndex.length - 1]}. `;
const indent = ' '.repeat(Math.max(0, listType.length - 1));
applyFont();
doc.text(`${indent}${prefix}`, {continued: true});
break;
}
case 'br':
flushLine();
break;
case 'img': {
const buf = decodeDataUri(attribs.src || '');
if (buf) {
flushLine();
try { doc.image(buf, {fit: [400, 300]}); } catch { /* ignore */ }
}
break;
}
}
styleStack.push(next);
},
ontext(text) {
writeText(text);
},
onclosetag(name) {
switch (name) {
case 'h1': case 'h2': case 'h3': case 'h4': case 'h5': case 'h6':
case 'p': case 'div':
flushLine();
pendingNewline = true;
break;
case 'li':
flushLine();
break;
case 'ul': case 'ol':
listType.pop();
listIndex.pop();
doc.moveDown(0.3);
break;
}
styleStack.pop();
if (styleStack.length === 0) {
styleStack.push({bold: false, italic: false, underline: false, strike: false});
}
},
}, {decodeEntities: true, lowerCaseTags: true});
parser.write(html);
parser.end();
flushLine();
doc.end();
});
cd src && pnpm test --grep 'htmlToPdfBuffer'
Expected: 1 passing.
Append BELOW the existing it('produces a buffer starting with %PDF-') inside the same describe('htmlToPdfBuffer'):
const renderText = async (html: string): Promise<string> => {
const buf = await htmlToPdfBuffer(html);
// pdfkit emits text uncompressed-ish; we look for substrings inside
// the raw PDF stream. This is intentionally fragile-friendly: we
// assert the words show up at all, not their layout.
return buf.toString('latin1');
};
it('renders headings, paragraphs, and lists', async function () {
const raw = await renderText(`
<h1>Title</h1>
<p>Body paragraph here.</p>
<ul><li>one</li><li>two</li></ul>
<ol><li>alpha</li><li>beta</li></ol>
`);
assert.ok(raw.includes('Title'));
assert.ok(raw.includes('Body paragraph here.'));
assert.ok(raw.includes('one'));
assert.ok(raw.includes('two'));
assert.ok(raw.includes('alpha'));
assert.ok(raw.includes('beta'));
});
it('emits link annotations for <a href>', async function () {
const raw = await renderText('<p><a href="https://etherpad.org">site</a></p>');
assert.ok(raw.includes('site'));
assert.ok(raw.includes('etherpad.org'));
});
it('embeds data: URI images without throwing', async function () {
const tinyPng =
'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII=';
const buf = await htmlToPdfBuffer(``);
assert.ok(buf.length > 200);
});
it('ignores unknown tags rather than crashing', async function () {
const buf = await htmlToPdfBuffer(
'<custom-tag><p>still works</p></custom-tag>');
assert.ok(buf.slice(0, 5).toString('ascii') === '%PDF-');
});
});
cd src && pnpm test --grep 'htmlToPdfBuffer'
Expected: 5 passing.
wc -l src/node/utils/ExportPdfNative.ts
If the result is >500 lines OR a structural test from Step 5 fails in a way that requires substantially more walker code to fix (e.g. real table rendering, complex CSS), STOP and follow the bail-out path:
ExportPdfNative.ts with a pdfmake + html-to-pdfmake + jsdom implementation behind the same htmlToPdfBuffer(html) signature.pdfmake, html-to-pdfmake, jsdom to src/package.json; remove pdfkit and htmlparser2 if not used anywhere else.%PDF- buffer) hasn't changed.If the file is ≤500 lines and tests pass, continue normally.
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
git add src/node/utils/ExportPdfNative.ts src/tests/backend/specs/export.ts
git commit -m "feat(7538): native PDF export via pdfkit + htmlparser2 walker
Renders pad HTML to a PDF Buffer in-process: headings, paragraphs,
lists, links, inline emphasis, data:-URI images. Remote images are
explicitly skipped at the walker (defense-in-depth on top of the
shared stripRemoteImages sanitizer)."
Files:
Create: src/node/utils/ImportDocxNative.ts
Create: src/tests/backend/specs/fixtures/sample.docx
Create: src/tests/backend/specs/import.ts
Step 1: Generate the DOCX fixture
Use html-to-docx (already a dep) to produce a deterministic fixture so we don't hand-build OOXML. Run this from the worktree root:
mkdir -p src/tests/backend/specs/fixtures
cd src && node -e "
const fs = require('fs');
const htmlToDocx = require('html-to-docx');
htmlToDocx('<h1>Heading</h1><p>Paragraph body.</p><ul><li>one</li><li>two</li></ul>').then((buf) => {
fs.writeFileSync('tests/backend/specs/fixtures/sample.docx', buf);
console.log('wrote', buf.length, 'bytes');
});
"
Expected: wrote <N> bytes where N is roughly 5000–15000.
head -c 4 src/tests/backend/specs/fixtures/sample.docx | xxd
Expected: starts with 50 4b 03 04 (PK ZIP signature).
Create src/tests/backend/specs/import.ts:
'use strict';
import {MapArrayType} from '../../../node/types/MapType';
import path from 'path';
import {promises as fs} from 'fs';
const assert = require('assert').strict;
const common = require('../common');
const padManager = require('../../../node/db/PadManager');
import settings from '../../../node/utils/Settings';
describe(__filename, function () {
const settingsBackup: MapArrayType<any> = {};
let agent: any;
before(async function () {
agent = await common.init();
settingsBackup.soffice = settings.soffice;
});
after(function () {
Object.assign(settings, settingsBackup);
});
describe('docxBufferToHtml (#7538)', function () {
let docxBufferToHtml: (b: Buffer) => Promise<string>;
before(function () {
try { require.resolve('mammoth'); }
catch { this.skip(); return; }
docxBufferToHtml = require('../../../node/utils/ImportDocxNative').docxBufferToHtml;
});
it('converts the sample.docx fixture to HTML', async function () {
const buf = await fs.readFile(
path.join(__dirname, 'fixtures', 'sample.docx'));
const html = await docxBufferToHtml(buf);
assert.match(html, /Heading/);
assert.match(html, /Paragraph body\./);
assert.match(html, /one/);
assert.match(html, /two/);
});
it('emits no remote image URLs', async function () {
const buf = await fs.readFile(
path.join(__dirname, 'fixtures', 'sample.docx'));
const html = await docxBufferToHtml(buf);
assert.doesNotMatch(html, /]+src="https?:/);
assert.doesNotMatch(html, /]+src="\/\//);
});
});
});
cd src && pnpm test --grep 'docxBufferToHtml'
Expected: tests fail with Cannot find module '../../../node/utils/ImportDocxNative'.
Create src/node/utils/ImportDocxNative.ts:
'use strict';
const mammoth = require('mammoth');
export const docxBufferToHtml = async (buffer: Buffer): Promise<string> => {
const result = await mammoth.convertToHtml(
{buffer},
{
convertImage: mammoth.images.imgElement(async (image: any) => {
const buf: Buffer = await image.read();
const contentType = image.contentType || 'application/octet-stream';
return {src: `data:${contentType};base64,${buf.toString('base64')}`};
}),
},
);
return result.value || '';
};
cd src && pnpm test --grep 'docxBufferToHtml'
Expected: 2 passing.
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
git add src/node/utils/ImportDocxNative.ts \
src/tests/backend/specs/import.ts \
src/tests/backend/specs/fixtures/sample.docx
git commit -m "feat(7538): native DOCX import via mammoth
Wraps mammoth.convertToHtml so a soffice-less Etherpad can ingest
.docx files. Images are coerced to data: URIs at the converter
boundary so the import pipeline never sees a remote src=."
Files:
src/node/handler/ExportHandler.tssrc/tests/backend/specs/export.tsThis is where we drop the flag-gated branch and wire the cascade.
soffice = nullIn src/tests/backend/specs/export.ts, find:
describe('native DOCX export (#7538)', function () {
before(function () {
try {
require.resolve('html-to-docx');
} catch {
this.skip();
return;
}
settings.soffice = 'false';
settings.nativeDocxExport = true;
});
Replace with:
describe('native DOCX export (#7538)', function () {
before(function () {
try {
require.resolve('html-to-docx');
} catch {
this.skip();
return;
}
settings.soffice = null;
});
Also update the line that sets these in the prior it('returns 500 on export error') block:
settings.soffice = 'false'; // '/bin/false' doesn't work on Windows
settings.nativeDocxExport = false;
Becomes:
settings.soffice = '/bin/false'; // forces a soffice spawn that errors
(The intent of that test is to exercise the soffice error path; with the cascade, that means soffice MUST be configured.)
Remove the line:
settingsBackup.nativeDocxExport = settings.nativeDocxExport;
from the outer before(...) block.
Above the closing }); of describe(__filename) and AFTER the htmlToPdfBuffer block, add:
describe('odt without soffice (#7538)', function () {
before(function () { settings.soffice = null; });
it('returns the "not enabled" message for odt', async function () {
const res = await agent.get('/p/testExportPad/export/odt').expect(200);
assert.match(res.text, /This export is not enabled/);
});
});
Inside the existing describe('native DOCX export (#7538)'), immediately after the two existing tests, add a sibling describe:
describe('native PDF export (#7538)', function () {
before(function () {
try {
require.resolve('pdfkit');
require.resolve('htmlparser2');
} catch {
this.skip();
return;
}
settings.soffice = null;
});
it('returns a valid %PDF- document', async function () {
const res = await agent.get('/p/testExportPad/export/pdf')
.buffer(true)
.parse((resp: any, callback: any) => {
const chunks: Buffer[] = [];
resp.on('data', (chunk: Buffer) => chunks.push(chunk));
resp.on('end', () => callback(null, Buffer.concat(chunks)));
})
.expect(200);
const body: Buffer = res.body as Buffer;
assert.ok(body.length > 200, 'PDF body must be non-trivial');
assert.strictEqual(body.slice(0, 5).toString('ascii'), '%PDF-');
});
it('sends application/pdf content-type', async function () {
const res = await agent.get('/p/testExportPad/export/pdf').expect(200);
assert.match(res.headers['content-type'], /application\/pdf/);
});
});
cd src && pnpm test --grep '#7538'
Expected: PDF tests fail (route returns 200 but with the "not enabled" body or 500 from soffice path); ODT test currently fails (route guard still blocks it). DOCX tests fail because cascade isn't in place yet — the nativeDocxExport=true shortcut is gone but the new cascade isn't there.
In src/node/handler/ExportHandler.ts, replace lines 90–144 (everything from the // Native DOCX path (issue #7538) comment block down through the await fsp_unlink(destFile); line) with:
// Soffice-first dispatch (issue #7538). When soffice is configured
// we keep the legacy convert-via-tempfile path; when it's not, we
// hand DOCX to html-to-docx and PDF to our pdfkit walker — both
// pure-JS, in-process. No fallback chain: native errors surface as
// 5xx so admins see real failures instead of silent shadowing.
const {sofficeAvailable} = require('../utils/Settings');
const offline = sofficeAvailable() === 'no'
|| (sofficeAvailable() === 'withoutPDF' && type === 'pdf');
if (offline) {
const {stripRemoteImages} = require('../utils/ExportSanitizeHtml');
const safeHtml = stripRemoteImages(html);
html = null;
try {
if (type === 'docx') {
const htmlToDocx = require('html-to-docx');
const buf = await htmlToDocx(safeHtml);
res.contentType(
'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
res.send(buf);
return;
}
if (type === 'pdf') {
const {htmlToPdfBuffer} = require('../utils/ExportPdfNative');
const buf = await htmlToPdfBuffer(safeHtml);
res.contentType('application/pdf');
res.send(buf);
return;
}
// soffice-only formats (odt, doc) are blocked at the route guard
// when soffice is null; reaching here means the guard is wrong.
res.status(500).send(`Cannot export ${type} without soffice configured`);
return;
} catch (err) {
console.error(
`native ${type} export failed for pad "${padId}":`,
err && (err as Error).stack ? (err as Error).stack : err);
res.status(500).send(`Failed to export pad as ${type}.`);
return;
}
}
// soffice path — write the html export to a file
const randNum = Math.floor(Math.random() * 0xFFFFFFFF);
const srcFile = `${tempDirectory}/etherpad_export_${randNum}.html`;
await fsp_writeFile(srcFile, html);
// ensure html can be collected by the garbage collector
html = null;
// send the convert job to the converter (libreoffice)
const destFile = `${tempDirectory}/etherpad_export_${randNum}.${type}`;
// Allow plugins to overwrite the convert in export process
const result = await hooks.aCallAll('exportConvert', {srcFile, destFile, req, res});
if (result.length > 0) {
// console.log("export handled by plugin", destFile);
} else {
const converter = require('../utils/LibreOffice');
await converter.convertFile(srcFile, destFile, type);
}
// send the file
await res.sendFile(destFile, null);
// clean up temporary files
await fsp_unlink(srcFile);
// 100ms delay to accommodate for slow windows fs
if (os.type().indexOf('Windows') > -1) {
await new Promise((resolve) => setTimeout(resolve, 100));
}
await fsp_unlink(destFile);
cd src && pnpm test --grep '#7538'
Expected: DOCX tests pass; PDF tests pass; ODT test fails because the route guard hasn't been tightened yet (it's blocking ALL of pdf/docx/odt/doc when soffice is null, including pdf and docx that we just made native — wait, no: this is what the next task fixes. Actually, with the current guard the docx/pdf integration tests would have failed at Step 4 already. Re-check: the guard returns 200 with a "not enabled" message, which assert.strictEqual(body.slice(0,5)...) would fail.) The expected outcome of THIS step is DOCX and PDF integration tests still fail, walker-style failures may appear too. We move to Task 6 to fix the guard, then re-run.
If the unit-style tests for htmlToPdfBuffer, docxBufferToHtml, and stripRemoteImages still pass, that's enough to move on.
cd src && pnpm test --grep 'htmlToPdfBuffer\|docxBufferToHtml\|stripRemoteImages'
Expected: 12 passing (5 sanitizer + 5 walker + 2 mammoth).
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
git add src/node/handler/ExportHandler.ts src/tests/backend/specs/export.ts
git commit -m "feat(7538): soffice-first cascade in ExportHandler
Replaces the flag-gated DOCX branch with a deterministic dispatch:
soffice if configured, native DOCX/PDF otherwise, 5xx on native
error. Both native paths run plugin-modified HTML through
stripRemoteImages first."
Files:
Modify: src/node/hooks/express/importexport.ts
Modify: src/tests/backend/specs/export.ts (re-verify tests)
Step 1: Update the export guard
In src/node/hooks/express/importexport.ts, replace the block on lines 37–48:
// if soffice is disabled, and this is a format we only support with soffice, output a message
if (exportAvailable() === 'no' &&
['odt', 'pdf', 'doc', 'docx'].indexOf(req.params.type) !== -1) {
console.error(`Impossible to export pad "${req.params.pad}" in ${req.params.type} format.` +
' There is no converter configured');
// ACHTUNG: do not include req.params.type in res.send() because there is
// no HTML escaping and it would lead to an XSS
res.send('This export is not enabled at this Etherpad instance. Set the path to soffice ' +
'(LibreOffice) in settings.json to enable this feature');
return;
}
With:
// When soffice is disabled, only block formats with no native path.
// pdf and docx fall through to ExportHandler, which dispatches to
// the in-process converters (issue #7538).
if (exportAvailable() === 'no' &&
['odt', 'doc'].indexOf(req.params.type) !== -1) {
console.error(`Impossible to export pad "${req.params.pad}" in ${req.params.type} format.` +
' There is no converter configured');
// ACHTUNG: do not include req.params.type in res.send() because there is
// no HTML escaping and it would lead to an XSS
res.send('This export is not enabled at this Etherpad instance. Set the path to soffice ' +
'(LibreOffice) in settings.json to enable this feature');
return;
}
if (exportAvailable() === 'no') { ... } on the import side, but the implicit behavior is that useConverter becomes false and only built-in formats work). Verify by reading lines 73–90 of the current file.The import endpoint already implicitly handles the no-soffice case via useConverter = (converter != null) in ImportHandler.ts. After Task 7 wires native DOCX import there, no change is needed here.
cd src && pnpm test --grep '#7538'
Expected: native DOCX (2), native PDF (2), odt-without-soffice (1) — 5 passing.
cd src && pnpm test --grep 'export\.ts'
Expected: all green, including the pre-existing returns 500 on export error test which uses /bin/false as soffice.
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
git add src/node/hooks/express/importexport.ts
git commit -m "fix(7538): allow docx/pdf through export guard without soffice
Tightens the no-soffice block to ['odt','doc'] only — formats with
no native path. docx and pdf are handed to ExportHandler, which
dispatches to the in-process converters. Closes Qodo finding #2."
Files:
Modify: src/node/handler/ImportHandler.ts
Modify: src/tests/backend/specs/import.ts
Step 1: Write the failing integration test
Append to src/tests/backend/specs/import.ts, BELOW the existing docxBufferToHtml describe and ABOVE the closing }); of describe(__filename):
describe('end-to-end DOCX import (#7538)', function () {
before(function () {
try { require.resolve('mammoth'); }
catch { this.skip(); return; }
settings.soffice = null;
});
it('imports a docx into a pad without soffice', async function () {
const padId = 'test7538DocxImport';
// Reset pad
try { await padManager.removePad(padId); } catch { /* noop */ }
const fixture = path.join(__dirname, 'fixtures', 'sample.docx');
const res = await agent
.post(`/p/${padId}/import`)
.attach('file', fixture)
.expect(200);
assert.strictEqual(res.body.code, 0, `import failed: ${JSON.stringify(res.body)}`);
const pad = await padManager.getPad(padId);
const text = pad.text();
assert.match(text, /Heading/);
assert.match(text, /Paragraph body/);
assert.match(text, /one/);
assert.match(text, /two/);
});
it('rejects odt extension when soffice is null', async function () {
const padId = 'test7538OdtReject';
try { await padManager.removePad(padId); } catch { /* noop */ }
const fixture = path.join(__dirname, 'fixtures', 'sample.docx');
// copy fixture to a .odt name
const odtPath = path.join(__dirname, 'fixtures', 'sample.odt');
await fs.copyFile(fixture, odtPath);
try {
const res = await agent
.post(`/p/${padId}/import`)
.attach('file', odtPath);
// either 400 with a known status or rejected payload
assert.ok(
res.status >= 400 || res.body.code !== 0,
`expected odt import to fail when soffice is null, got: ${res.status} ${JSON.stringify(res.body)}`);
} finally {
await fs.unlink(odtPath).catch(() => undefined);
}
});
});
cd src && pnpm test --grep 'end-to-end DOCX import'
Expected: tests fail — likely the docx import either errors out (no converter) or empties the pad.
In src/node/handler/ImportHandler.ts:
a) Replace the block on lines 59–66:
let converter:any = null;
let exportExtension = 'htm';
// load soffice only if it is enabled
if (settings.soffice != null) {
converter = require('../utils/LibreOffice');
exportExtension = 'html';
}
with:
let converter: any = null;
let exportExtension = 'htm';
// load soffice only if it is enabled
if (settings.soffice != null) {
converter = require('../utils/LibreOffice');
exportExtension = 'html';
}
const NATIVE_NO_SOFFICE_OFFICE_FORMATS = new Set(['.pdf', '.odt', '.doc', '.rtf']);
b) After the fileEndingUnknown block (line 131) and BEFORE the const destFile = ... line (133), insert:
// Native DOCX import (issue #7538): when soffice isn't configured we
// hand .docx files to mammoth, which produces HTML — then we feed that
// through the existing setPadHTML pipeline by writing it to destFile.
if (settings.soffice == null && fileEnding === '.docx') {
const buf = await fs.readFile(srcFile);
const {docxBufferToHtml} = require('../utils/ImportDocxNative');
let nativeHtml: string;
try {
nativeHtml = await docxBufferToHtml(buf);
} catch (err: any) {
logger.warn(`Native DOCX import failed: ${err.stack || err}`);
throw new ImportError('convertFailed');
}
const destFileNative = path.join(tmpDirectory, `etherpad_import_${randNum}.html`);
await fs.writeFile(destFileNative, nativeHtml);
const pad = await padManager.getPad(padId, '\n', authorId);
try {
await importHtml.setPadHTML(pad, nativeHtml, authorId);
} catch (err: any) {
logger.warn(`Error importing native DOCX HTML: ${err.stack || err}`);
throw new ImportError('convertFailed');
}
padManager.unloadPad(padId);
const reloaded = await padManager.getPad(padId, '\n', authorId);
padManager.unloadPad(padId);
await padMessageHandler.updatePadClients(reloaded);
rm(srcFile);
rm(destFileNative);
return false;
}
// Without soffice, the legacy office formats (pdf, odt, doc, rtf) have
// no in-process path. Reject explicitly so the user sees a clear error
// instead of a silent ASCII-only fallback.
if (settings.soffice == null && NATIVE_NO_SOFFICE_OFFICE_FORMATS.has(fileEnding)) {
throw new ImportError('uploadFailed');
}
cd src && pnpm test --grep 'end-to-end DOCX import'
Expected: 2 passing.
cd src && pnpm test --grep 'import\.ts'
Expected: 4 passing (2 wrapper + 2 e2e).
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
git add src/node/handler/ImportHandler.ts src/tests/backend/specs/import.ts
git commit -m "feat(7538): native DOCX import path in ImportHandler
When soffice is null and the upload is .docx, run mammoth and feed
the resulting HTML through setPadHTML. Other office formats
(pdf/odt/doc/rtf) are explicitly rejected with uploadFailed instead
of silently falling through to the ASCII-only path."
Files:
Modify: src/static/js/pad_impexp.ts
Step 1: Update the gate
In src/static/js/pad_impexp.ts, replace lines 147–166:
// hide stuff thats not avaible if soffice is disabled
const wordFormat = clientVars.docxExport ? 'docx' : 'doc';
if (clientVars.exportAvailable === 'no') {
$('#exportworda').remove();
$('#exportpdfa').remove();
$('#exportopena').remove();
$('#importmessagenoconverter').prop('hidden', false);
} else if (clientVars.exportAvailable === 'withoutPDF') {
$('#exportpdfa').remove();
$('#exportworda').attr('href', `${padRootPath}/export/${wordFormat}`);
$('#exportopena').attr('href', `${padRootPath}/export/odt`);
$('#importexport').css({height: '142px'});
$('#importexportline').css({height: '142px'});
} else {
$('#exportworda').attr('href', `${padRootPath}/export/${wordFormat}`);
$('#exportpdfa').attr('href', `${padRootPath}/export/pdf`);
$('#exportopena').attr('href', `${padRootPath}/export/odt`);
}
With:
// DOCX and PDF are always available — soffice when configured,
// native pure-JS converters otherwise (issue #7538). ODT still
// requires soffice. The 'withoutPDF' branch (Windows soffice
// without PDF) is handled by the server-side cascade routing PDF
// through native; the UI link stays.
const wordFormat = clientVars.docxExport ? 'docx' : 'doc';
$('#exportworda').attr('href', `${padRootPath}/export/${wordFormat}`);
$('#exportpdfa').attr('href', `${padRootPath}/export/pdf`);
if (clientVars.exportAvailable === 'no') {
$('#exportopena').remove();
$('#importmessagenoconverter').prop('hidden', false);
} else {
$('#exportopena').attr('href', `${padRootPath}/export/odt`);
}
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538/src
pnpm exec tsc --noEmit -p .
Expected: no errors related to pad_impexp.ts (project-wide ts-check should pass; Task 9 will also catch it).
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538/src
SOFFICE=null pnpm run dev
In another terminal, open http://localhost:9001/p/test, click Import/Export, verify:
If you cannot run a dev server in this environment, skip this step and rely on the integration tests.
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
git add src/static/js/pad_impexp.ts
git commit -m "fix(7538): always show DOCX/PDF export links
Native paths (#7538) make DOCX and PDF available regardless of
soffice presence, so unconditionally render those links. ODT still
gates on exportAvailable. Closes Qodo finding #2 on the UI side."
nativeDocxExport settingThe flag is no longer needed — selection is purely soffice-presence-driven. Roll back the additions from commit b98dfbab7.
Files:
Modify: src/node/utils/Settings.ts
Modify: settings.json.template
Modify: settings.json.docker
Modify: doc/docker.md
Step 1: Remove the type field
In src/node/utils/Settings.ts line 208, delete this line:
nativeDocxExport: boolean,
In src/node/utils/Settings.ts, delete lines 419–426 (the /** ... */ block above and the nativeDocxExport: false, line).
In settings.json.template, delete the entire block containing the "nativeDocxExport": false, line and its preceding /* … */ JSDoc comment (around lines 354–362). Verify by:
grep -n 'nativeDocxExport\|NATIVE_DOCX' settings.json.template settings.json.docker doc/docker.md src/node/utils/Settings.ts
Expected: no results.
In settings.json.docker, delete the block on lines 372–377:
/*
* Convert DOCX exports in-process via html-to-docx instead of shelling
* out to LibreOffice. Auto-falls back to the LibreOffice path on error.
*/
"nativeDocxExport": "${NATIVE_DOCX_EXPORT:false}",
Delete the row on line 193:
| `NATIVE_DOCX_EXPORT` | Convert DOCX exports in-process with the bundled `html-to-docx` library instead of shelling out to LibreOffice. Auto-falls back to LibreOffice on error. Lets you skip installing `soffice` entirely for deployments that only need DOCX. | `false` |
grep -rn 'nativeDocxExport\|NATIVE_DOCX_EXPORT' src/ doc/ settings.json.template settings.json.docker 2>/dev/null
Expected: empty output.
cd src && pnpm exec tsc --noEmit -p .
Expected: no type errors.
cd src && pnpm test --grep 'export\.ts\|import\.ts'
Expected: all green — sanitizer (5), walker (5), mammoth wrapper (2), DOCX integration (2), PDF integration (2), odt-without-soffice (1), e2e import (2), pre-existing soffice 500 (1). Roughly 20 passing.
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
git add src/node/utils/Settings.ts settings.json.template settings.json.docker doc/docker.md
git commit -m "refactor(7538): drop nativeDocxExport flag
Selection is now purely soffice-presence-driven (Task 5 cascade).
The opt-in setting and its NATIVE_DOCX_EXPORT env var are no longer
needed — soffice configured means soffice path; soffice null means
native path. Reverts the additive surface introduced earlier in
this PR."
Files: none (CI / GitHub)
cd /home/jose/etherpad/etherpad-lite/.claude/worktrees/pr-7538
pnpm --filter ep_etherpad-lite run test
Expected: full pass. If any previously-passing test now fails (e.g. a soffice-dependent test that assumed exportAvailable() === 'no' blocks docx), investigate root cause — do NOT silently mute.
git push fork feat/native-docx-export-7538
Expected: ten new commits on top of the rebased base.
sleep 30 && gh pr checks 7568 --repo ether/etherpad
Expected: all checks pass or are in progress. If a check fails, fix the underlying issue and push again — do NOT mark the PR ready until all checks are green.
gh pr comment 7568 --repo ether/etherpad --body "$(cat <<'EOF'
Qodo follow-up:
1. **Requirement gap (DOCX still needs soffice)** — addressed. Removed the `nativeDocxExport` flag entirely. Selection is now purely soffice-presence-driven: soffice configured → soffice; soffice null → native (html-to-docx for DOCX, pdfkit for PDF). No fallback chain.
2. **DOCX blocked without soffice** — fixed. Tightened the route guard to `['odt','doc']` only when `exportAvailable() === 'no'`; pdf/docx fall through to ExportHandler's native dispatch. UI in pad_impexp.ts always shows DOCX + PDF links now.
3. **Native DOCX test bypass** — fixed. Tests use `settings.soffice = null` (was `'false'`) so they exercise the real no-soffice deployment shape.
4. **Unrestricted HTML-to-DOCX I/O** — addressed. New `stripRemoteImages` sanitizer drops non-`data:`/non-relative `` before either DOCX or PDF conversion. The PDF walker also rejects remote `` at its own boundary as defense-in-depth. No converter ever sees a remote URL.
Also added native PDF export (issue #7538's other half) and native DOCX import via mammoth — design committed at `docs/superpowers/specs/2026-05-08-native-docx-pdf-export-import-design.md`.
EOF
)"
Expected: comment posted; URL printed.
gh pr ready 7568 --repo ether/etherpad
Expected: Pull request #7568 is now ready for review. If maintainers prefer the PR stays draft until they review, skip this step.
gh pr edit 7568 --repo ether/etherpad --body "$(cat <<'EOF'
## Summary
Closes #7538. With this PR an Etherpad deployment with `settings.soffice = null` can:
- export pads as `html`, `txt`, `etherpad`, `docx`, `pdf` — all in-process, no subprocess, no native binaries
- import `.html`, `.txt`, `.etherpad`, `.docx` files — all in-process
Deployments with `settings.soffice` configured retain today's behavior bit-for-bit.
## Shape
Selection is purely soffice-presence-driven — there is no opt-in flag:
- `sofficeAvailable() === 'yes'` → existing soffice path
- `'withoutPDF'` (Windows) → soffice for everything except `pdf`, which goes native
- `'no'` (soffice null) → native DOCX/PDF; ODT/DOC remain blocked with a clear message
Native DOCX export uses `html-to-docx`; native PDF uses a small `pdfkit` + `htmlparser2` walker we own; native DOCX import uses `mammoth`. Plugin-modified HTML is run through `stripRemoteImages` first to close the SSRF surface Qodo flagged.
## Files
| File | Change |
|---|---|
| `src/node/utils/ExportSanitizeHtml.ts` | new — `stripRemoteImages` |
| `src/node/utils/ExportPdfNative.ts` | new — pdfkit walker |
| `src/node/utils/ImportDocxNative.ts` | new — mammoth wrapper |
| `src/node/handler/ExportHandler.ts` | soffice-first cascade for DOCX + PDF |
| `src/node/handler/ImportHandler.ts` | native DOCX import branch |
| `src/node/hooks/express/importexport.ts` | route guard tightened to `['odt','doc']` |
| `src/static/js/pad_impexp.ts` | DOCX + PDF links always visible |
| `src/package.json` | `pdfkit`, `htmlparser2`, `mammoth`, `html-to-docx` |
| `src/tests/backend/specs/export.ts` | revised + new tests |
| `src/tests/backend/specs/import.ts` | new — DOCX import tests |
| `src/tests/backend/specs/fixtures/sample.docx` | new fixture |
| `docs/superpowers/specs/...` | design spec |
## Out of scope (follow-ups)
- Native ODT export — no mature pure-JS writer
- Native PDF / ODT / DOC / RTF import — no mature pure-JS readers
- Memory/timeout caps on conversion — add when production signal warrants
## Test plan
- [x] `pnpm run ts-check` clean
- [x] Backend tests: sanitizer, walker, mammoth wrapper, DOCX + PDF integration, ODT negative, end-to-end DOCX import
- [x] Manual: with `SOFFICE=null`, export DOCX and PDF; both produce valid files
- [x] Manual: with `SOFFICE=null`, import the fixture .docx and verify pad content
Closes #7538
EOF
)"
Expected: description updated.
Spec coverage:
nativeDocxExport removal → Task 9exportHTMLSend plugin hook coverage on native paths — the spec says "verify against current behavior, don't expand scope". Task 5's cascade preserves the existing if (type === 'html') { exportHTMLSend ... } block at lines 82–88 untouched, so plugin behavior on the html branch is identical. Native DOCX/PDF do not invoke exportHTMLSend — same as the pre-PR LibreOffice path, which also doesn't call it. No change needed; this is a non-regression.Placeholder scan: No "TBD" / "TODO" / "implement later" / "fill in details" strings. All test assertions are concrete; all code blocks are complete.
Type consistency: htmlToPdfBuffer(html: string): Promise<Buffer> referenced in Tasks 3, 5; docxBufferToHtml(buf: Buffer): Promise<string> in Tasks 4, 7; stripRemoteImages(html: string): string in Tasks 2, 5. All match the spec.
Bail-out criterion (Task 3 Step 7) is concrete: line count threshold (>500) and a behavior threshold (test that fails because the walker can't render a class of content). Implementer has a clear stop signal.