skills/computer-use/SKILL.md
You have a computer_use tool that drives the user's desktop in the
background — your actions do NOT move the user's cursor, steal
keyboard focus, or switch virtual desktops / Spaces. The user can keep
typing in their editor while you click around in a browser in another
window. This is the opposite of pyautogui-style automation.
Everything here works with any tool-capable model — Claude, GPT, Gemini, or an open model on a local OpenAI-compatible endpoint. There is no Anthropic-native schema to learn.
Hermes drives cua-driver under the hood
for the platform plumbing. The Hermes-side computer_use tool exposed
in this skill is a higher-level Hermes vocabulary; the raw cua-driver
MCP tools (which a different agent harness would see) are NOT what you
call — call the computer_use actions documented below.
Step 1 — Capture first. Almost every task starts with:
computer_use(action="capture", mode="som", app="<the app you're driving>")
Returns a screenshot with numbered overlays on every interactable element AND an AX-tree index like:
#1 AXButton 'Back' @ (12, 80, 28, 28) [Chrome]
#2 AXTextField 'Address bar' @ (80, 80, 900, 32) [Chrome]
#7 Link 'Sign In' @ (900, 420, 80, 24) [Chrome]
...
The role names match the host platform's accessibility framework
(AXButton on macOS, Button on Windows UIA, push button on Linux
AT-SPI) — treat them as labels, not as strict types.
Step 2 — Click by element index. This is the single most important habit:
computer_use(action="click", element=7)
Much more reliable than pixel coordinates for every model. Claude was trained on both; other models are often only reliable with indices.
Step 3 — Verify. After any state-changing action, re-capture. You can save a round-trip by asking for the post-action capture inline:
computer_use(action="click", element=7, capture_after=True)
mode | Returns | Best for |
|---|---|---|
som (default) | Screenshot + numbered overlays + AX index | Vision models; preferred default |
vision | Plain screenshot | When SOM overlay interferes with what you want to verify |
ax | AX tree only, no image | Text-only models, or when you don't need to see pixels |
capture mode=som|vision|ax app=… (default: current app)
click element=N OR coordinate=[x, y] button=left|right|middle
double_click element=N OR coordinate=[x, y]
right_click element=N OR coordinate=[x, y]
middle_click element=N OR coordinate=[x, y]
drag from_element=N, to_element=M (or from/to_coordinate)
scroll direction=up|down|left|right amount=3 (ticks)
type text="…"
key keys="<save shortcut>" | "return" | "escape" | "<modifier>+t"
wait seconds=0.5
list_apps
focus_app app="<app name>" raise_window=false (default: don't raise)
All actions accept optional capture_after=True to get a follow-up
screenshot in the same tool call. All actions that target an element
accept modifiers=[…] for held keys.
Use the host's idiomatic modifier:
| Common action | macOS | Windows / Linux |
|---|---|---|
| Save | cmd+s | ctrl+s |
| New tab | cmd+t | ctrl+t |
| Close tab / window | cmd+w | ctrl+w |
| Copy / paste | cmd+c / cmd+v | ctrl+c / ctrl+v |
| Address bar | cmd+l | ctrl+l |
| App switcher | cmd+tab | alt+tab |
When in doubt, capture and look for menu hints, or ask the user which shortcut to use.
raise_window=True unless the user explicitly asked you
to bring a window to front. Input routing works without raising.app="Chrome") — less noisy, fewer
elements, doesn't leak other windows the user has open.Prefer element indices:
computer_use(action="drag", from_element=3, to_element=17)
For a rubber-band selection on empty canvas, use coordinates:
computer_use(action="drag",
from_coordinate=[100, 200],
to_coordinate=[400, 500])
Scroll the viewport under an element (most common):
computer_use(action="scroll", direction="down", amount=5, element=12)
Or at a specific point:
computer_use(action="scroll", direction="down", amount=3, coordinate=[500, 400])
list_apps returns running apps with bundle IDs / process names, PIDs,
and window counts. focus_app routes input to an app without raising
it. You rarely need to focus explicitly — passing app=... to
capture / click / type will target that app's frontmost window
automatically.
When the user is on a messaging platform (Telegram, Discord, etc.) and
you took a screenshot they should see, save it somewhere durable and
use MEDIA:/absolute/path.png in your reply. cua-driver's screenshots
are PNG or JPEG bytes (mimeType is on the response); write them out
with write_file or the terminal (base64 -d).
On CLI, you can just describe what you see — the screenshot data stays in your conversation context.
type. You'll see an
error if the guard fires.| Symptom | Likely cause + remedy |
|---|---|
cua-driver not installed | Run hermes computer-use install, or hermes tools and enable Computer Use |
| Captures consistently return empty / "no on-screen window" | On Linux: DISPLAY may not be set (X11) or you're on pure Wayland — ask the user to run hermes computer-use doctor. On Windows: you may be in Session 0 (SSH session) instead of the interactive desktop — see the cua-driver WINDOWS.md deep-dive |
| Element index stale ("Element N not in cache") | SOM indices are only valid until the next capture. Re-capture before clicking. The wrapper carries opaque element_tokens for stale-detection; you'll see an explicit error rather than a wrong click |
| Click had no effect | Re-capture and verify. A modal that wasn't visible before may be blocking input. Dismiss it (usually escape or click its close button) before retrying |
| Type text disappears into a terminal emulator | cua-driver detects terminals (Ghostty, iTerm2, Terminal.app, Windows Terminal, mintty, etc.) and routes through key-event synthesis — should "just work" on a recent cua-driver. If it doesn't, ask the user to run hermes computer-use doctor |
blocked pattern in type text | You tried to type a shell command matching the dangerous-pattern block list (curl ... | bash, sudo rm -rf, etc.). Break the command up or reconsider |
| Anything else weird | First action: ask the user to run hermes computer-use doctor. It runs the cua-driver health_report MCP tool and prints a structured per-check matrix. Their output tells you (and them) exactly what's wrong |
computer_usebrowser_* tools — those use a
real headless Chromium and are more reliable than driving the user's
GUI browser. Reach for computer_use specifically when the task
needs the user's actual native apps (Finder/Explorer/Files, Mail/
Outlook/Thunderbird, native chat clients, Figma, Logic, games,
anything non-web).read_file / write_file / patch, not
type into an editor window.terminal, not type into Terminal.app /
Windows Terminal / gnome-terminal.Hermes intentionally keeps THIS skill focused on the Hermes-side
computer_use action vocabulary. The platform-specific deep dives
(macOS no-foreground contract, Windows UIA + Session 0, Linux AT-SPI +
X11/Wayland nuances, recording trajectory + video, browser-page
interaction, etc.) live in cua-driver's skill pack — same content the
cua-driver team ships and maintains for every other agent harness.
To link the cua-driver skill pack into your skill space:
cua-driver skills install
You'll then have access to:
SKILL.md — the cross-platform core (snapshot invariant, no-
foreground contract, click dispatch, AX tree mechanics)MACOS.md — macOS specifics (no-foreground contract, AXMenuBar
navigation, SkyLight click dispatch, Apple Events JS bridge)WINDOWS.md — Windows specifics (UIA tree, UWP / ApplicationFrameHost
hosting, Session 0 isolation, autostart pattern for SSH)LINUX.md — Linux specifics (AT-SPI tree, X11 / Wayland, terminal
emulator detection)RECORDING.md — trajectory + video recording semanticsWEB_APPS.md — browser page interaction tipsTESTS.md — replay-by-trajectory workflowThese are platform deep dives, not duplicates — when the user reports
"on Windows the click landed on the wrong element," you read
WINDOWS.md for the UIA / UWP context that explains why and what to
do differently.
When cua-driver skills install autodetects Hermes (planned follow-up
in trycua/cua), this happens automatically on install. Until then, ask
the user to run the command and the pack lands in their agent skill
space alongside this skill.