Back to Dyad

MustardScript Sandbox for Attachment & File Access

plans/mustardscript-large-attachments.md

1.0.043.4 KB
Original Source

MustardScript Sandbox for Attachment & File Access

Generated by swarm planning session on 2026-04-20

Summary

For the local-agent flow, replace attachment-byte inlining with an on-disk storage model under .dyad/media/. The model is told in the user message that attachments are available at logical paths (attachments:<filename>), and a new agent tool, execute_sandbox_script, lets it generate short MustardScript (sandboxed JavaScript subset) snippets to read, slice, search, and aggregate file contents — returning only the concise result it actually needs. This solves context-window overflows, prompt cost, and provider latency on large attachments in the tool-capable local-agent path; as a bonus, the same tool can target any file the AI has scoped access to. When the request is not handled through src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts, keep the current behavior: inline the attachment into the user message and do not add a tool loop to src/ipc/handlers/chat_stream_handlers.ts.

Problem Statement

Today, every attachment's bytes are inlined directly into the message payload sent to the LLM:

  • src/ipc/handlers/chat_stream_handlers.ts:1866–1930 reads attachment content and embeds it into TextPart / ImagePart objects per message.
  • For large files (logs, CSVs, multi-page PDFs, long source files), this produces three user-visible failures:
    1. Context overflow — the send hard-fails or the provider silently truncates.
    2. Cost spike — users pay prefill tokens on hundreds of KB of noise to get a small answer.
    3. Latency — large prompts are slow to first token, and every follow-up turn re-sends the same bytes.

The pain is most acute for power-user workflows: large error logs, spec PDFs, code dumps, long JSON/CSV exports. The fix is to stop inlining in the local-agent path and let the model ask precise questions (via a sandboxed script) about files that live on disk. Non-local-agent/default chat keeps its existing inline behavior until a separate, explicitly scoped default-chat tool-loop project exists.

Scope

In Scope (MVP)

  • On-disk attachments (A, local-agent only). When the turn is handled by src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts, every user attachment (text and binary) is copied to .dyad/media/<sha256>.<ext> at send time. No size threshold — uniform rule. Text attachments are no longer inlined in this path.
  • Default-chat compatibility. Do not add tools: { execute_sandbox_script } or any other tool-loop machinery to src/ipc/handlers/chat_stream_handlers.ts. If the local-agent handler is not used, continue inlining the attachment into the user message exactly as the current default-chat path does.
  • Attachment-info user-message block (A, local-agent only). The outgoing user message gets a stable-position TextPart listing each attachment as attachments:<sanitizedOriginalFilename> with a terse type/size descriptor. The physical on-disk name (<sha256>.<ext>) is resolved by the host; the model never sees it. This is user-message content, not system-prompt content.
  • System-prompt invariance. The system prompt must not vary based on whether attachments are present in any mode. Do not add attachment-specific clauses, tool instructions, or platform-availability language to system prompts. Any attachment metadata belongs in the user message, and only for the local-agent path that can actually use it.
  • execute_sandbox_script tool (B). New agent tool wrapping MustardScript with a fixed host-capability set: read_file(path, opts?), list_files(dir), file_stats(path). No write_file, no fetch, no exec, no env. Read-only by design for v1.
  • Range-read support. read_file(path, { start?, length?, encoding? }) allows byte-range and head/tail reads so scripts avoid loading whole files.
  • Output-cap split. Tool result returns { value (≤64KB for LLM), truncated, fullOutputPath?, executionMs, instructionsUsed, heapBytesUsed }. Outputs larger than 64KB are additionally written to .dyad/media/script-output-<hash>.txt and the path is surfaced to both the LLM and the UI — the user-visible card can load the full result (up to ~1MB virtualized).
  • Consent model. Local-agent mode retains its current ask default to respect the existing user mental model there. Opt-out to never in Settings → Chat → Scripts.
  • First-run education. One-time, dismissible inline info strip anchored to the first Script card a user ever sees (not install-time, not a modal). Copy: "Dyad just ran a small script to read your file. You'll see each one here. Not into this? Turn it off in Settings → Chat → Scripts." Dismissed forever after one click. Plus a one-time composer-level tip on the user's first local-agent attachment: "Attachments stay on disk — Dyad reads what it needs when you send."
  • Transparency UI. ScriptCard component (mustard-amber accent), label "Script" (no "sandbox"), reuses DyadCard + DyadMcpToolCall expand/collapse. Collapsed by default on success, auto-expanded on error. Header auto-populates from the tool call's description field ("Read last 500 lines of server.log"), falling back to "Ran a script on server.log". Overflow menu on every card: Re-run · Copy script · Copy output · Manage scripts in Settings. Truncated outputs show "LLM saw X of Y" badge.
  • No default-chat tool-loop extension. Default chat is intentionally out of scope. Do not port the Pro ToolDefinition interface, do not add a generic registry, and do not wire Vercel AI SDK tools into chat_stream_handlers.ts for this project.
  • Small-model fallback UX. If a local-agent model returns a final reply without invoking the tool and there's an unreferenced on-disk attachment for the turn, render a gentle hint banner: "Your model didn't read the file — try a larger model or paste the contents inline." Prevents silent failure on Ollama 7B-class models in the tool-capable path.
  • Degraded-mode UX on unsupported platforms. If the MustardScript native binding is unavailable (e.g., linux-arm64), the tool's isEnabled() returns false and the local-agent attachment-info user-message block says "sandbox scripting unavailable on this platform" so the model doesn't attempt it. Attachments still land on disk in the local-agent path. Do not put platform availability in the system prompt.
  • Replay semantics. Replaying a prior chat message renders the stored script + result verbatim; it does NOT re-execute. Users get an explicit "Re-run" button on the card.
  • Backwards compatibility. Existing chats with inline attachments keep their inline bytes in history. New local-agent uploads go to disk; non-local-agent/default-chat uploads continue to inline. Attachment preparation handles the mixed history cleanly. Release notes call this out explicitly.
  • Lifecycle. Reuse cleanupOldMediaFiles() in src/main.ts for .dyad/media/ attachments (including script-output-*.txt). .dyad is already added to .gitignore via ensureDyadGitignored().
  • Power-user settings surface. Open .dyad/media/ button (using the literal path, not a euphemistic label), timeout ceiling configuration (2s default, up to 10s), consent toggle (always-allow ↔ ask ↔ never).
  • Security denylist. read_file rejects paths outside ctx.appPath; denies absolute paths, .. escapes, and a denylist covering .env*, .git/, node_modules/, ~/.ssh/, ~/.aws/, ~/.config/, .npmrc, .yarnrc, .pypirc, shell history files, ~/.netrc, *.key, *.pem. Path validation (allowlist + denylist) is the primary file-access guardrail; resource limits and timeouts provide additional containment.
  • Resource limits. 2s wall-clock default (10s user-configurable ceiling), 500ms per-host-call timeout, 16MB heap, 1M-instruction budget, per-call read_file size cap of 1MB.
  • Crash isolation. Wrap all MustardScript ExecutionContext calls in try/catch; add process-level uncaughtException and unhandledRejection guards so unexpected sandbox failures are surfaced instead of relying on a non-existent unhandledException event.
  • License hygiene. Add /NOTICE at repo root aggregating Apache-2.0 attribution (MustardScript + Playwright + any others); include MustardScript's NOTICE content if shipped in its tarball. Add a CI check for new Apache-2.0 deps.

Out of Scope (Follow-up)

  • PDF/binary semantic reading. MustardScript gets text bytes only; PDFs and images are not passed through. Image attachments continue using the existing ImagePart path. A future pdf_to_text or image_ocr agent tool is the right shape, not pushing bytes into a 16MB VM heap.
  • User-invoked scripts (slash command / palette). Technically trivial, but has a different capability surface (likely wants write_file, longer timeout) and deserves its own scoping pass.
  • Live progress streaming during script execution ({ bytesRead } events). Adds an IPC channel + renderer subscription; M-sized. Ship static states first; revisit if p90 duration exceeds 500ms in telemetry.
  • Sidecar execution mode for MustardScript. Per the package docs, in-process is not a hard security boundary. Mitigated by path allowlist, denylist, size cap, and the fact that users already run AI-generated code via other agent tools. Sidecar is a v2 hardening option; runner.ts should be designed so swapping is a no-op for callers.
  • Cached script results across turns. execute_sandbox_script is always fresh. Memoization only within a single tool fan-out if needed.
  • Attachment management UI. A Settings surface showing "Manage attachments — NNNMB across NN chats [Clean up unused]" is post-launch.
  • Chat export privacy toggles for bundled attachment contents.
  • Any default-chat tool loop. execute_sandbox_script is not exposed in default chat in v1. Any default-chat tool support requires a separate scoping pass.

User Stories

  • As a developer debugging production in local-agent mode, I want to drop a 4MB error.log into chat and ask "group and count unique stack traces" without hitting context limits or paying for 4MB of tokens — the AI writes a script that reads only what it needs.
  • As a PM reviewing a spec in local-agent mode, I want to attach a long text export and ask "find sections mentioning auth" so the AI pulls back just relevant passages.
  • As a reviewer using local-agent mode, I want to attach a whole-repo text dump and ask "list every callsite of deprecatedFn" — the AI's script does the grep, I get the answer.
  • As a privacy-conscious user, I want to see every script the AI ran and its returned output in my chat transcript, with the ability to expand and inspect at any time.
  • As a default-chat user, I want existing attachment behavior to remain stable — if I am not using the local-agent path, Dyad still inlines attachments into my message and does not show script/tool UI.
  • As a power user, I want the "Open .dyad/media/" settings button so I can inspect or share the raw files directly.
  • As an Ollama-local user on a small model, I want graceful failure — if my model can't invoke the tool, I want a hint, not silence.

Success Metrics

Metrics retired by the "local-agent only + no default-chat tool loop" decisions:

  • On-disk usage share within local-agent mode — trivially 100% post-launch for that path.
  • Default-chat tool-use pickup — default chat continues to inline attachments and has no script tool in this project.

New leading indicators:

  • Tool-use pickup rate: share of local-agent attachment-bearing turns where the AI emits at least one read_file / execute_sandbox_script call. Target ≥95% on frontier models. Watch small/local models separately — this is the "did the feature work at all" signal.
  • Zero-tool-call attachment turns (counter-metric). If non-trivial in local-agent mode, the model is seeing the attachment-info user-message block and ignoring it — a product failure we need to catch.
  • Settings opt-out rate: share of users who disable scripts in local-agent mode. >2% should trigger investigation.
  • Tool-loop latency overhead: p50/p90 added latency per local-agent attachment turn. Uniform-on-disk in this mode means even trivial attachments pay a tool round-trip; this catches regressions in the common case.

Kept from prior framing (reframed):

  • Median & p90 input-token count per local-agent attachment turn vs. a 1-week pre-launch baseline. The local-agent always-on-disk decision only pays off if the AI actually narrows its reads — this metric proves it. Targets: -40% median, -80% p90.
  • Context-error rate (context_length_exceeded / provider-specific) on local-agent attachment-bearing chats. Target: -90%.

Instrumentation events to emit for the local-agent path (standard dashboard): attachment.stored, sandbox.script.run, sandbox.script.completed, sandbox.script.timeout, sandbox.script.truncated, sandbox.script.denied, sandbox.tool.unused_with_attachment.

UX Design

User Flow

  1. In local-agent mode, the user drops server.log (80MB) into the composer via existing drag-and-drop or file picker (src/hooks/useAttachments.ts). An attachment chip appears — uniform design, no size/type variant. On the user's first-ever local-agent attach, a dismissible inline tip appears under the composer: "Attachments stay on disk — Dyad reads what it needs when you send."
  2. User types a question ("what's the most common error?") and sends.
  3. In the main process, because this turn is handled by local_agent_handler.ts, the file is copied to .dyad/media/<sha256>.log. The outgoing user message gains an attachment-info TextPart:
    Attachments available on disk (use attachments:<name> with read_file / execute_sandbox_script):
    - attachments:server.log (80 MB, text/plain)
    
  4. The model responds by calling execute_sandbox_script with a short MustardScript that tails attachments:server.log, groups by error code, returns the top 5.
  5. A ScriptCard renders inline:
    • Running state: amber spinner, scramble-reveal verb (skimming, sifting, tailing, etc.), "Running script…" label.
    • Success state: collapsed, header from tool-call description ("Read last 500 lines of server.log"), stats Read 42KB · 812ms, expandable.
    • Error state: auto-expanded, red accent, error line visible, Re-run and Retry with guidance buttons.
  6. If this is the user's very first Script card ever, a small dismissible strip sits above it for onboarding: "Dyad just ran a small script to read your file. You'll see each one here. Not into this? Turn it off in Settings → Chat → Scripts." [Got it] [Settings]
  7. Below the card, the model's prose answer streams referencing the findings.
  8. If the local-agent model never invokes the tool despite an attachment, a gentle banner renders below the reply: "Your model didn't read the file — try a larger model or paste the contents inline."
  9. In default chat or any other path that does not use local_agent_handler.ts, no script tool is exposed and the attachment continues to be inlined into the user message.

Key States

  • Attachment chip (local-agent): uniform design across all types; no badge, no size split. Hover tooltip: "Stored at .dyad/media/server.log. Dyad reads what it needs." Default chat keeps existing inline-attachment semantics.
  • Script card — running: mustard-amber accent, animated verb, aria-live="polite" announces "Running script".
  • Script card — success (collapsed): one-liner header from description, stats Read 42KB · 812ms, chevron, keyboard-operable.
  • Script card — success (expanded): tabs Script (syntax-highlighted MustardScript) and Output (monospace, virtualized for >10KB, "Copy" / "Save as…" / search-within). Footer strip: instructionsUsed, heapBytesUsed for power users.
  • Script card — truncated output: "LLM saw 42KB of 850KB — [Open full output]" linking to the side pane backed by .dyad/media/script-output-*.txt.
  • Script card — error: auto-expanded, red accent, one-line error + "Re-run" + "Retry with guidance" buttons.
  • Script card — empty result: neutral accent, "Script returned empty — Dyad will try again" (softer than a dead end; common now that small files also use scripts).
  • Script card — timeout: "Script took too long — canceled (2s)" + retry.
  • Script card — overflow menu: Re-run · Copy script · Copy output · Manage scripts in Settings.
  • First-run toast (inline strip): only above the user's first-ever Script card; dismissible.
  • First-attach composer tip: only on the user's first-ever local-agent attachment; dismissible.
  • Small-model fallback banner: when a local-agent attachment turn yields zero tool calls.
  • Settings → Chat → Scripts: script consent toggle (always-allow | ask | never), timeout ceiling slider (2s–10s), button Open .dyad/media/.

Interaction Details

  • Collapsed-by-default on success, auto-expanded on error — progressive disclosure.
  • Keyboard: card is a <button> with aria-expanded; Enter/Space toggles. Focus ring matches existing DyadCard.
  • Loading verbs: extend the existing pondering/conjuring/weaving family with file-appropriate entries (skimming, sifting, tailing, parsing, digesting, threading).
  • Copy/share: menu items in overflow handle script source + output.
  • Re-run button: explicit re-execution; no implicit replay.
  • Tooltip teaching: attachment chip tooltip carries the educational payload for the common case.
  • Delight: subtle mustard-jar icon on first-run toast; non-MVP, low-fi.

Accessibility

  • Accessible name per card: aria-label="Script — read server.log — success, 812ms".
  • aria-live="polite" announces start and completion.
  • prefers-reduced-motion disables scramble-reveal.
  • Mustard-amber accent meets ≥4.5:1 on text, ≥3:1 on non-text in both light and dark modes (pre-audit before ship).
  • Errors use icon + text, not color alone.
  • "Skip script, jump to output" link available for screen readers on long script cards.

Technical Design

Architecture

Five layered components:

  1. Attachment data plane. In the local-agent path (src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts), replace attachment byte inlining with always-on-disk attachment references. Do not make this change in src/ipc/handlers/chat_stream_handlers.ts; non-local-agent/default chat keeps inlining attachments into the user message. Handle mixed-history (legacy inlined attachments + new local-agent on-disk attachments) cleanly.
  2. Attachment-info block builder. A new utility that, given the attachments for the outgoing local-agent turn, emits a stable-position TextPart listing attachments:<name> with type/size. Placement: immediately before the user's text in the same user message, so provider prompt-cache boundaries stay consistent. This block is not part of the system prompt.
  3. System-prompt invariant. Across default chat, local-agent mode, plan mode, and any degraded platform state, system prompts must be identical for attachment and non-attachment turns. Attachment availability, unavailable-tool notices, and file metadata are user-message parts only.
  4. Sandbox runner. Located at src/ipc/utils/sandbox/ (non-Pro utility, but only wired into local-agent mode for v1). Contains runner.ts (MustardScript wrapper with lazy-init, resource limits, Promise.race timeout, try/catch + uncaughtException / unhandledRejection guard plan), capabilities.ts (read_file / list_files / file_stats host functions with path allowlist + denylist), limits.ts (timeout / heap / instruction budgets).
  5. execute_sandbox_script tool. Pro-mode definition stays under src/pro/main/ipc/handlers/local_agent/tools/execute_sandbox_script.ts (reuses the Pro ToolDefinition pattern). It is registered only through the local-agent tool system. No sibling default-chat tool, no direct wiring into streamText() in chat_stream_handlers.ts, and no default-chat generic tool-registry infrastructure.

Components Affected

Attachment flow (modify):

  • src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts — switch local-agent attachment handling to .dyad/media/ references and the attachment-info user-message block.
  • src/ipc/handlers/chat_stream_handlers.ts — preserve existing default-chat inline attachment behavior. Do not add a tool loop or script tool wiring here.
  • src/ipc/utils/media_path_utils.ts — add helpers for resolving attachments:<name><sha256>.<ext>.
  • src/ipc/types/chat.tsChatAttachmentSchema unchanged on wire; runtime types track onDiskPath + logicalName.
  • src/hooks/useAttachments.ts — frontend stays; chip uniform across types (no badge, tooltip carries the teaching).
  • src/main.tscleanupOldMediaFiles() continues to operate; ensure script-output-*.txt is also swept.

Sandbox runner (new, shared):

  • src/ipc/utils/sandbox/runner.ts
  • src/ipc/utils/sandbox/capabilities.ts
  • src/ipc/utils/sandbox/limits.ts

Tool system (new):

  • src/pro/main/ipc/handlers/local_agent/tools/execute_sandbox_script.ts — Pro definition using the runner.
  • src/pro/main/ipc/handlers/local_agent/tool_definitions.ts — register it for Pro agent.
  • No default-chat tool registration in src/ipc/handlers/chat_stream_handlers.ts.

UI (new / modify):

  • src/components/chat/ScriptCard.tsx — new component (reuses DyadCard + DyadMcpToolCall patterns), label "Script", overflow menu, stats strip.
  • src/components/chat/AttachmentsList.tsx — uniform chip; tooltip with on-disk path.
  • src/components/chat/ChatMessage.tsx (or equivalent) — render local-agent script tool-call / tool-result parts using ScriptCard.
  • src/components/chat/* — first-run inline strip (anchored to first Script card), first-attach composer tip, small-model fallback banner.
  • src/pages/settings/* — Settings → Chat → Scripts section with consent toggle, timeout ceiling, "Open .dyad/media/" button.

Native binary / packaging:

  • forge.config.ts — asar-unpack @mustardscript/binding-*/*.node.
  • Ensure macOS code-signing covers the .node files.
  • Platform gating: isEnabled: () => isSupportedPlatform() on execute_sandbox_script; attachment-info user-message block communicates unavailability in local-agent mode only.

Licensing:

  • /NOTICE at repo root with Apache-2.0 attributions. CI check for new Apache-2.0 deps.

Data Model Changes

  • Database: none required for MVP. Scripts + results persist inside the existing aiMessagesJson column via tool_call / tool_result parts.
  • On-disk: .dyad/media/ continues to hold attachment files; adds .dyad/media/script-output-<hash>.txt for oversized script returns. .dyad/ already in gitignore.
  • No schema migration. Legacy chats with inline bytes keep their inline bytes. Default-chat messages continue to store inline attachments.

API Changes

execute_sandbox_script tool:

ts
// Input
{
  script: string;       // MustardScript source; max 32 KB
  description?: string; // One-line human explanation rendered on the card
}

// Output (stringified JSON as tool result)
{
  value: string;             // Return value, ≤ 64 KB
  truncated: boolean;
  fullOutputPath?: string;   // `.dyad/media/script-output-<hash>.txt` if truncated
  executionMs: number;
  instructionsUsed: number;
  heapBytesUsed: number;
}

Host capabilities exposed into the MustardScript context (fixed set, no free-form):

ts
read_file(path: string, opts?: {
  start?: number;
  length?: number;
  encoding?: 'utf8' | 'base64';
}): string;

list_files(dir: string): string[];

file_stats(path: string): {
  size: number;
  isText: boolean;
  mtime: string;
};

Attachment-info user-message block format (v1, frozen for schema stability):

Attachments available on disk (use attachments:<name> with read_file / execute_sandbox_script):
- attachments:server.log (80 MB, text/plain)
- attachments:spec.txt (4 KB, text/plain)

Implementation Plan

Phase 0: Native-binary blocker spike (1–2 days)

  • Install mustardscript in Dyad; verify optional binding downloads correctly on mac-arm64, mac-x64, linux-x64, win-x64.
  • Configure forge.config.ts asarUnpack for @mustardscript/binding-*/*.node.
  • Confirm macOS code-signing and notarization succeed on a test build including the native module.
  • Benchmark cold-start cost; confirm lazy-init keeps app startup clean.
  • Verify ExecutionContext exceptions do not escape to kill Electron main (try/catch + process guard).
  • Document a fallback plan (isolated-vm / QuickJS-WASM) in case any platform fails; do not pick up the fallback unless the spike fails.

Phase 1: Attachment data plane + attachment-info block — local-agent only (2–3 days)

  • Always-on-disk attachment path in src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts. No threshold branch. Strip inline-text embedding in this path only.
  • Preserve current attachment inlining in src/ipc/handlers/chat_stream_handlers.ts; do not add tool-loop wiring there.
  • Attachment-info TextPart builder with stable placement in the user message.
  • Add a guard/test that system prompts are identical for attachment and non-attachment turns across all modes touched by this project.
  • Frontend uniform chip; tooltip with literal on-disk path; first-attach composer tip.
  • Mixed-history handling for legacy inlined attachments.
  • Telemetry: emit attachment.stored.
  • Unit tests for attachment save / attachment-info format / mixed-history / default-chat still inlines.
  • E2E: attach a large log in local-agent mode, verify attachment-info block, verify no inline bytes in the outgoing local-agent message; default chat still inlines; legacy chat still renders correctly.
  • Ship behind a feature flag for internal dogfooding.

Phase 2: execute_sandbox_script tool — Pro agent first (4–5 days)

  • Build src/ipc/utils/sandbox/ runner + capabilities + limits.
  • Path allowlist / denylist (.env*, .git/, node_modules/, ~/.ssh/, ~/.aws/, ~/.config/, ~/.netrc, *.key, *.pem); traversal test coverage.
  • execute_sandbox_script.ts Pro tool definition + registration.
  • Oversized-output spill to .dyad/media/script-output-<hash>.txt, with path surfaced in tool result.
  • ScriptCard UI with all states (running/success/error/timeout/empty/truncated); overflow menu.
  • First-run inline strip anchored to first Script card.
  • Settings → Chat → Scripts surface (consent toggle, timeout ceiling, open-folder button). Local-agent mode retains ask default.
  • Telemetry: sandbox.script.{run,completed,timeout,truncated,denied}.
  • Unit tests (runner, capabilities, path escape, limits, output spill).
  • E2E (Pro agent): successful run, timeout, denied consent, oversized output spill, replay does not re-execute.

Phase 3: Non-local-agent compatibility + prompt invariance (1–2 days)

  • Audit src/ipc/handlers/chat_stream_handlers.ts and verify no tools: { execute_sandbox_script }, no Vercel tool-loop additions, and no attachment-specific system-prompt branches are introduced.
  • If a turn is not handled by local_agent_handler.ts, continue inlining attachment content into the user message.
  • Persist script tool_call + tool_result parts only for local-agent messages in aiMessagesJson (reuse existing local-agent tool persistence patterns).
  • Provider/model capability gating: only expose the tool through the local-agent tool set when the selected model supports tool calling.
  • Small-model fallback banner when a local-agent attachment turn yields zero tool calls.
  • Degraded-mode handling on unsupported platforms uses local-agent user-message attachment info, not system-prompt changes.
  • E2E: default chat with an attachment still sends inline content and has no script card; local-agent large-log flow invokes the script tool; platform-gated degraded mode does not vary the system prompt.

Phase 4: Hardening + launch prep (~3–4 days)

  • /NOTICE file with Apache-2.0 attributions; CI check for new Apache-2.0 deps.
  • Security doc: docs/security.md section on the MustardScript threat model and mitigations, including explicit note that the path allowlist is the sole security control under always-allow.
  • Platform gating polish; banner copy.
  • Managed-model cost delta forecast (if Dyad subsidizes any model calls).
  • Ollama small-model QA gate: tool-call success rate ≥80% on llama3.1:8b and qwen2.5:7b before GA; otherwise ship with a small-model warning.
  • Dogfood period on internal builds.
  • Release notes: "Large files now work in agent mode — upload anything, Dyad reads just what it needs." Explicitly call out: legacy chats retain inline attachments; new local-agent uploads are on-disk; default chat still inlines attachments.
  • Flag removal.

Total MVP: ~3 weeks (one engineer), with ~1 week buffer if the native-binary spike surfaces platform issues. Each phase ends with something shippable behind a flag.

Testing Strategy

  • Vitest — runner: happy path, budget violation, host-capability rejection, timeout, crash-isolation (native-addon fault doesn't escape), non-determinism tolerance.
  • Vitest — capabilities: read_file path escape (.., absolute paths, symlinks where OS permits), denylist coverage (.env, .ssh, .aws, keys, pems), size cap, range-read correctness.
  • Vitest — attachment handler: local-agent always-on-disk for text and binary; attachment-info user-message block format is stable; default-chat attachments continue to inline; image attachments continue using existing default-chat behavior; legacy-inlined messages render correctly (mixed history).
  • Vitest — system prompt invariance: attachment and non-attachment turns produce the same system prompt in every affected mode, including degraded platform states.
  • Vitest — output-cap spill: fullOutputPath is written and returned for >64KB results.
  • Playwright E2E — happy path (Pro agent): attach large log, script runs, result card renders with script + output.
  • Playwright E2E — default chat compatibility: attach a large log in default chat; confirms inline content is still sent and no script card/tool loop appears.
  • Playwright E2E — consent: Local-agent ask default triggers modal; denial is logged.
  • Playwright E2E — error paths: script exceeds instruction budget; hits timeout; user cancels; empty return.
  • Playwright E2E — replay: reopening a chat with prior scripts renders them without re-execution; "Re-run" button executes fresh.
  • Playwright E2E — small-model fallback: local-agent turn yields no tool call → fallback banner appears.
  • Playwright E2E — platform gating: on a simulated unsupported platform, tool is disabled and local-agent user-message attachment info communicates unavailability without changing the system prompt.
  • Fixtures: commit 3–4 attachments under e2e-tests/fixtures/attachments/ (large log, CSV, JSON, small text); keep <1MB total.
  • Manual QA matrix: mac-arm64, mac-x64, linux-x64, win-x64. On linux-arm64 (if in shipping matrix), confirm graceful degraded mode.
  • Local-LLM QA gate: ≥80% tool-call success on llama3.1:8b and qwen2.5:7b before GA.

Risks & Mitigations

RiskLikelihoodImpactMitigation
MustardScript native binary fails to load in Electron on some platformMedHighPhase 0 spike gates everything else; platform-specific isEnabled fallback; documented interpreter fallback plan
In-process sandbox is not a hard security boundary; prompt-injected attachment causes the LLM to write an exfil scriptMedHighConservative path validation with strict allowlist and denylist including .env/.ssh/.aws/.npmrc/.pypirc/shell history/keys/pems; lazy-init; Script card shows source (transparency); sidecar mode on roadmap; launch-blocker review must focus here
Small local models (Ollama 7B-class) can't reliably emit tool calls in local-agent modeHighMedFallback banner when a local-agent attachment turn yields no tool call; QA gate ≥80% on llama3.1:8b + qwen2.5:7b; if below, ship with warning banner
Oversized script returns re-create the original context-blowup problemMedMed64KB LLM cap with truncation signaling; spill to disk for UI viewing; tool description encourages .slice/.filter/.reduce returns
Native addon crash takes down Electron main processLowHighExecutionContext calls wrapped in try/catch; process-level uncaughtException and unhandledRejection guards; verified in Phase 0 spike
Always-on-disk local-agent attachment handling widens prompt-injection attack surface in that modeMedHighDenylist extension (noted above); explicit mention in security doc; no-auto-replay policy prevents re-entrancy
Cost delta for managed-model users from added local-agent tool-loop tokensLowMedForecast input-token delta in Phase 4; track tool-loop latency overhead metric post-launch
Prompt-cache regression on small local-agent attachments (uniform on-disk means every attachment pays a tool round-trip in that mode)MedLowScramble-reveal verbs cover latency emotionally; track tool-loop latency overhead p50/p90; accept as scope given user's preference for mental-model consistency in local-agent mode
Mixed-history chats (legacy inlined + new local-agent on-disk) render inconsistentlyLowMedExplicit mixed-history handling in attachment preparation; E2E test coverage; release-notes call-out
Accidental default-chat tool-loop wiring changes behavior in chat_stream_handlers.tsMedHighExplicit Phase 3 audit; tests proving default chat still inlines attachments and exposes no script tool
Attachment-specific system-prompt changes fragment behavior or prompt cachingMedHighSystem-prompt invariance test for attachment vs. non-attachment turns in all affected modes; attachment metadata stays in user-message parts only
Apache-2.0 NOTICE obligation overlooked for bundled depsLowLowOne-time /NOTICE authoring; CI check on new Apache-2.0 deps
Users surprised by MustardScript v0.1.1 alpha status / maintainershipLowMedPin exact version; add a Dyad-CI canary that re-runs MustardScript's own tests on each bump
Cold-start cost of native addon delays first paintLowMedLazy-init module only on first script execution; never at app startup
.dyad/media/ grows unbounded across sessionsMedLowReuse cleanupOldMediaFiles(); Settings "Manage attachments" roadmapped for follow-up
Settings opt-out rate spikes (users uncomfortable with local-agent scripts)LowMed>2% threshold triggers investigation; first-run inline strip clearly signals how to disable
Chat export leaks attachment content users didn't realize was bundledLowMedExplicit "include attachment contents" toggle on export (post-MVP)

Open Questions

Resolved during implementation (not blocking planning):

  1. Exact MustardScript version for pinning. Depends on upstream release cadence between now and Phase 2. Pin to exact version; expect ^0.1 range to require manual bumps.
  2. First-run strip / toast copy. Draft proposed; UX to polish during Phase 2.
  3. Loading-verb set specific to script execution. Proposed list (skimming, sifting, tailing, parsing, digesting, threading); UX to finalize during Phase 2.
  4. Small-model QA thresholds. ≥80% on two Ollama models is the proposed gate; final numbers tuned after first test run.
  5. Platform gating details for unsupported architectures — precise banner copy. Decide after the Phase 0 spike reveals which platforms need gating.

Decision Log

  • Always on-disk, no size threshold in local-agent mode (user). Trade-off: loses prompt-cache efficiency on small attachments in the local-agent path. Gain: single mental model ("attachments are files on disk") for the tool-capable mode. Accepted.
  • Keep .dyad/media/ on disk; alias as attachments: in local-agent user-message attachment info only (user). Trade-off: filesystem name and LLM-facing name diverge. Gain: zero rename churn across 15+ files. UI copy uses whichever reads naturally; "Open .dyad/media/" button uses the literal path so power users see the transition consistently.
  • No default-chat tool loop in chat_stream_handlers.ts (user). If src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts is not used, continue inlining attachments into the user message. Default-chat tool support requires a separate plan.
  • System prompt never varies with attachments (user). Attachment presence, attachment paths, tool availability, and degraded platform state must not alter the system prompt in any mode. Put attachment metadata in user-message parts only.
  • Local-agent consent retains ask default (user + Eng refinement). Trade-off: a per-call modal remains. Gain: Pro/local-agent users keep existing behavior they've opted into.
  • Sandbox runner lives at src/ipc/utils/sandbox/ (Eng). Shared utility location avoids coupling the runner to Pro internals, but v1 wires it only through the local-agent tool system. License-compatible (MustardScript is Apache-2.0).
  • Label "Script", not "Sandbox Script" (UX). The "sandbox" word implies a security frame the team deliberately steps away from whenever users enable always-allow.
  • First-run education anchored to the first Script card (UX). Non-blocking inline strip at the moment the user actually encounters the feature; no install-time modal.
  • "Open .dyad/media/" button uses the literal path (UX). Prevents confusion at the point where the naming split does surface.
  • Read-only capabilities only in v1 (Eng). No write_file/fetch/exec. Huge return values use UI-side spill, not script-side write. Drastically reduced attack surface.
  • Confused-LLM threat model, not adversarial-user (Eng). In-process isn't a hard boundary; path allowlist + denylist + lazy-init + transparent card are the defense. Sidecar deferred.
  • Output cap split 64KB (LLM) / 1MB (UI) (Eng refined from UX proposal). Transparency principle honored without blowing IPC/storage budgets.
  • Range-read read_file(path, {start, length}) (PM, Eng). Scripts can tail/head efficiently within per-call caps.
  • Replay does NOT auto-re-execute (Eng, PM). Rendering is frozen; explicit "Re-run" button for fresh execution.
  • Live progress streaming deferred (Eng). Static states in v1; revisit if telemetry shows long-running scripts.
  • 2s default timeout + 500ms per-host-call + 10s user-configurable ceiling (PM + Eng compromise).
  • PDF/binary semantic reading deferred (Eng). MustardScript stays text-first; pdf_to_text is a separate future tool.
  • License hygiene via root /NOTICE (Eng).
  • Crash isolation for native addon (Eng). ExecutionContext calls wrapped; process-level guard added.
  • Legacy chats not migrated (PM). New local-agent uploads on-disk; default-chat and legacy history keep inline bytes. Release-notes call-out.
  • Small-model fallback banner (PM). Explicit UI when an attachment turn yields zero tool calls.
  • Non-negotiable launch line items (PM): local-agent tool-use-pickup metric instrumented, small-model fallback UI shipped, denylist hardened, first-run inline strip delivered, default-chat inline behavior preserved, system-prompt invariance verified.

Generated by dyad:swarm-to-plan