Desktop Control
GPU-accelerated CLI for AI agents to control any macOS app via screen, mouse, and keyboard.
DesktopCtl
Local CLI for AI agents to observe and control your computer via screen, mouse, and keyboard. Bring your own AI - any model, even without vision.
Runs fully local. No screenshots sent to the cloud.
Learn more at https://desktopctl.com
https://github.com/user-attachments/assets/4321b23e-6706-4792-a911-89e13766ebc0
Why DesktopCtl
- Local-first runtime. No cloud dependency
- Bring your own AI: works with any desktop AI agent
- GPU-accelerated text recognition and computer vision
- Selector-first automation (
--text,--token) with coordinate fallback - Agent-friendly explicit waits and post-action verification
- Stable JSON contracts for agent integrations
Architecture
DesktopCtl is split into two binaries:
DesktopCtl.app(desktopctld): daemon that owns perception, state, execution, and verificationdesktopctl: stateless CLI surface for actions and queries over local IPC
Repository layout:
src/desktop/core- shared protocol and typessrc/desktop/daemon- daemon runtimesrc/desktop/cli- CLI client
Current Scope
- macOS-first
- OCR-first perception pipeline
- Tokenized screen output for agent grounding
- Deterministic CLI primitives for click/type/wait flows
Prerequisites
- macOS (current support target)
- Rust toolchain (
cargo) justcommand runner- Accessibility permission for
DesktopCtl.app - Screen Recording permission for
DesktopCtl.app
Quick Start
just build run
raw="$(desktopctl app open Notes --json)"
win_id="$(printf '%s' "$raw" | jq -r '.result.window_id // empty')"
desktopctl keyboard press cmd+f --active-window "$win_id" --no-observe
desktopctl keyboard type "Shopping list" --active-window "$win_id" --no-observe
desktopctl screen tokenize --active-window "$win_id"
Status / Roadmap
- Status: active development, with macOS-first CLI and daemon workflows already usable.
- Reliability for text/token-driven actions and verification loops. Stable machine-readable error codes.
- Upcoming CLI:
doctor, richerwindow/appintrospection, and--explainfailure output. - Better local computer vision and semantic UI tokenization.
- Multi-platform support.