packages/cua-driver/test-harness/vision-agent-test/README.md
Tests cua-driver the way a vision agent actually hits it — and without the overfit the modality recorder has (hand-tuned window-local points run through a private ratio, which never exercises the driver's image→screen mapping).
The pixel an agent reads off the returned screenshot is the pixel that gets clicked — verified by the target's own instrumented state changing.
get_window_state (window; returns the screenshot alongside the
tree by default) / get_desktop_state (desktop, true pixels): the exact image
an agent receives.PixelRegistryLocator: a pre-measured pixel read off the real PNG, with a
dims-guard that fails loud if the pinned geometry drifts). No AX element_index,
no hand-converted window-local points. Pluggable locate(image, target, dims)->(x,y).click/right_click/scroll at that pixel (scope set to match capture).last_action=, clicks=, …) — reliable pass/fail.
A coordinate mis-map leaves the oracle unchanged → FAIL.Run: python3 vision_agent_test.py {wkwebview-click-window|wkwebview-click-desktop|appkit-click-window|safari-learnmore-desktop|all}
locate() signature, send the PNG
to a model, score localization hit-rate separately — never gates the coordinate
regression.NSButton even frontmost — lands
pixel-perfect (crosshair) but NSButton's modal mouseDown loop reads the
window-server queue, not the per-pid CGEvent.postToPid queue. The suite drives
this via AXPress, so it never saw a vision agent's pixel click do nothing here.clicks= oracle is
unreachable that way; WKWebView exposes it fine.Cross-product, each a one-line registry entry:
{appkit, swiftui, wkwebview, electron, real-app} × {window, desktop(, secondary-display)} × {left, right, double, scroll, drag, type}.
A color-fiducial locator (harness renders a unique-color dot per control) would make
the registry robust to window moves without OCR.