plugins/plugin-computeruse/docs/MULTI_MONITOR.md
This plugin treats every physical display as an independent scene. There is NO virtual-desktop coordinate space exposed to the agent or the model.
platform/displays.ts returns the live attached set:
{
id: number, // OS-stable handle or 0-based index
bounds: [x, y, w, h], // OS-global pixel space
scaleFactor: number, // 1 on Linux/Win, >1 on retina
primary: boolean,
name: string // e.g. "eDP-1", "DISPLAY1"
}
Per-OS source:
| OS | Source |
|---|---|
| Linux X11 | xrandr --listmonitors |
| Linux Wayland | hyprctl monitors -j / swaymsg -t get_outputs (else falls back to X) |
| macOS | system_profiler SPDisplaysDataType -json, then JXA CGMainDisplayID |
| Windows | PowerShell [Screen]::AllScreens |
Native sidecars (Swift ScreenCaptureKit, DXGI/WGC, Rust libdisplay) are a follow-up — the interface is shaped to absorb them without changing callers.
platform/capture.ts is the canonical capture entry point:
captureDisplay(id): Promise<{ display, frame: PNG-Buffer }>
captureAllDisplays(): Promise<DisplayCapture[]>
captureDisplayRegion(): Promise<DisplayCapture> // local-to-display region
frame is at backing-store resolution. On a 2× retina display reporting a
2560×1440 logical bounds with scaleFactor: 2, the PNG is 5120×2880.
The legacy single-display captureScreenshot() from screenshot.ts is still
exported for back-compat, but new code should prefer the per-display path.
Every coordinate-bearing action accepts:
{
displayId: number, // which display the coords belong to
coordinate: [x, y], // LOCAL to that display
coordSource?: "logical"|"backing" // default "logical"
}
platform/coords.ts::localToGlobal translates to OS-global before the input
driver fires. The model never sees OS-global coords.
coordSource: "backing" so the
translator divides by scaleFactor before adding the display origin.displayId is opaque to the model — it just echoes whatever the
displays[] provider gave it.If displayId is omitted, the service:
21:9 displays are NOT sliced. Each
displayIdis one scene. The aspect-aware patcher in WS6/WS7 sends the whole frame at the model'smax_pixelsbudget; M-RoPE preserves aspect inside the model.
computerState provider includes data.displays: DisplayDescriptor[] and
in-text computer_use.displays. The planner reads this to pick a target
display before issuing any coordinate-bearing COMPUTER_USE action.
This Linux test host has a single display. Automated tests cover:
parseXrandrMonitors() — string → DisplayInfo[] with golden fixtures.parseHyprlandMonitors(), parseSwayOutputs() — JSON → DisplayInfo[].parseSystemProfilerDisplays() — macOS JSON → DisplayInfo[].parseWindowsScreens() — PowerShell JSON → DisplayInfo[].localToGlobal / globalToLocal round-trip.Real multi-monitor capture & input injection on each OS still needs a manual rig:
| What | OS | Manual check |
|---|---|---|
screencapture -D 2 | macOS | dual display rig, capture each independently |
| Retina backing-store scale | macOS | scaleFactor=2, capture is 2× logical bounds |
| WGC / DXGI vs PerMonitorV2 DPI | Windows | per-monitor DPI mix, click lands on correct pixel |
| Wayland portal capture | Linux | GNOME 45+ / KDE 6 — needs a portal-based sidecar |
| Hyprland / Sway parser | Linux | run hyprctl monitors -j against live compositor |