Back to blog

Computer Use Architecture: Diagramming AI Browser and Desktop Automation (2026)

How to diagram computer use (CUA) architectures where AI agents control browsers and desktop UIs. Covers the perception-action loop, screen capture, action execution, safety sandboxing, and production deployment — with prompt templates.

R
Ryan·Senior AI Engineer
·

Computer use (also called computer use automation, or CUA) refers to AI agents that interact with computer interfaces — browsers, desktop applications, and web UIs — by perceiving the screen and executing mouse and keyboard actions, just as a human would. Introduced as a frontier capability by Anthropic's Claude in 2024 and rapidly adopted across the industry, computer use enables AI agents to automate workflows that previously required API access or custom integrations: filling out web forms, navigating legacy software, extracting data from web pages, and completing multi-step browser tasks.

A computer use architecture diagram maps the perception-action loop at the heart of CUA systems, the sandboxing infrastructure that makes production deployment safe, and the orchestration layer that plans multi-step tasks. This guide covers the core components, the leading frameworks, and prompt templates for generating computer use architecture diagrams in seconds.

The perception-action loop

Every computer use architecture is built around a fundamental loop that your diagram should make explicit:

  1. Perceive: The agent takes a screenshot of the current screen state (or receives an accessibility tree / DOM representation for web-based UIs). This visual input is sent to a vision-capable LLM (Claude claude-opus-4-8, GPT-4o, or a specialized CUA model).
  2. Reason: The LLM analyzes the screen state in the context of the current task goal and generates an action — a click at specific coordinates, a text input, a keyboard shortcut, a scroll, or a task-level decision like "navigate to the next page."
  3. Act: An action executor translates the LLM's action specification into actual OS-level events — mouse movements and clicks via system APIs, keyboard input via OS event injection, or browser-level commands via a WebDriver or CDP (Chrome DevTools Protocol) connection.
  4. Verify: After each action, the system takes a new screenshot to observe the result. The LLM compares the new state to the expected post-action state and decides whether to continue, retry, or escalate to a human.

This loop repeats until the task is complete or a stopping condition is reached. The loop is the fundamental architectural unit — show it as a cycle in your diagram with labeled transitions between each step.

Core architectural components

Sandboxed environment

A computer use agent executing arbitrary browser and desktop actions is a significant security risk if run on a production machine. Every production CUA system should run inside a sandboxed environment — a Docker container, a virtual machine, or a cloud browser sandbox (Browserbase, Playwright in Docker, or Steel.dev). The sandbox provides:

  • Isolation from the host file system — the agent cannot access files outside the task scope
  • Network restrictions — outbound connections can be limited to the specific domains the task requires
  • Session isolation — each task runs in a fresh browser profile with no access to other users' sessions or stored credentials
  • Automatic cleanup — the container is destroyed after task completion, leaving no persistent state

Show the sandbox boundary as a dashed box in your diagram, clearly separating the agent's execution environment from the host infrastructure.

Screen capture and rendering

The screen capture layer takes screenshots of the sandbox environment at each step of the perception-action loop. For web-only agents, screenshots can be captured via Playwright's built-in screenshot method. For desktop automation, virtual frame buffers (Xvfb on Linux) provide a headless display environment that can be screen-captured. Screenshots are resized and compressed before being sent to the vision LLM to minimize token costs.

Action executor

The action executor translates LLM action commands into actual UI interactions. For browser automation, this is typically Playwright or Puppeteer (CDP-based). For desktop automation, platform-specific APIs are used: pyautogui or xdotool on Linux, AppleScript / Accessibility APIs on macOS, Win32 API or UIAutomation on Windows. Show the action executor as the component that interfaces with the sandboxed OS layer.

Task orchestrator

For complex multi-step tasks, a task orchestrator breaks the high-level goal into subtasks and manages the overall task lifecycle. The orchestrator handles: task planning (decomposing "book a flight" into navigate → search → select → checkout steps), error recovery (retrying a failed action with a different approach), human escalation (pausing when the agent encounters an unexpected state it cannot resolve), and success verification (confirming the task was actually completed rather than just appearing to be).

Computer use frameworks (2026)

Framework / PlatformTypeNotes
Anthropic Computer Use (Claude claude-opus-4-8)Model APINative screenshot + action tools in Claude claude-opus-4-8; the model generates structured action commands
OpenAI CUA (o3 model)Model APIComputer use via operator API; supports browser and desktop environment types
PlaywrightBrowser automation libraryChromium/Firefox/WebKit automation; commonly used as the action executor layer in browser-only CUA systems
BrowserbaseManaged cloud browser sandboxHosted sandboxed browsers for CUA; handles scaling, proxying, and session isolation
Steel.devOpen-source browser infrastructureSelf-hosted alternative to Browserbase; Docker-based browser sandbox with session management API
Browser Use (Python library)Open-source agent frameworkConnects LangChain agents to Playwright for web automation; provides the orchestration layer

Prompt templates for computer use architecture diagrams

Web scraping and data extraction agent

"Computer use agent for automated web data extraction. Orchestrator: a task queue worker (Python) reads job definitions from a Redis queue — each job specifies a target website URL, a data extraction goal (e.g., 'extract all product prices and names from the search results page'), and a timeout (60 seconds). Sandbox: each job launches a fresh Browserbase cloud browser session (Chromium, isolated profile, no stored credentials). Perception-action loop: (1) Playwright captures a screenshot of the current page; (2) screenshot (resized to 1024x768) + task goal are sent to Claude claude-opus-4-8 computer use API; (3) Claude returns an action (click, scroll, extract_text, done); (4) action is executed via Playwright CDP; (5) repeat until Claude returns 'done' or timeout is reached. Output: structured JSON extracted by Claude from the final page state is written to PostgreSQL. Anti-detection: rotating residential proxies are assigned per session. Show the sandbox boundary, the perception-action loop as a cycle, and the output path from completed extraction to PostgreSQL."

End-to-end testing AI agent

"AI computer use agent for automated end-to-end QA testing of a SaaS web application. Test execution: a CI pipeline (GitHub Actions) triggers the CUA test runner on each pull request. The runner reads a test case definition (natural language task goal + success criteria). Sandbox: Playwright launches a Chromium browser pointed at the staging URL (authenticated via env var credentials injected into the session — never hardcoded). Perception-action loop: Claude claude-opus-4-8 computer use receives a screenshot + current task step, returns actions. After each action, the agent checks for error states (modal dialogs, HTTP error pages, console errors via CDP) and aborts with failure details if any are found. Success evaluation: at the end of the task, Claude evaluates whether the success criteria are met by analyzing the final screenshot. Results: pass/fail + screenshot evidence + full action trace are posted back to the GitHub PR check. No human authored Playwright selectors anywhere — the agent navigates by understanding the visual UI. Show the CI trigger, sandbox isolation, perception-action loop, and results reporting back to GitHub."

Safety architecture for computer use systems

Computer use agents have broader access than tool-calling agents — they can interact with any web page or application, not just APIs you explicitly integrated. Your architecture diagram should document the safety controls explicitly:

  • Network allowlisting: Restrict outbound connections from the sandbox to only the domains required for the task — prevent the agent from exfiltrating data to unexpected endpoints
  • Credential isolation: Credentials are injected via environment variables or a secrets manager at task start and are never stored in the browser profile or accessible across tasks
  • Action confirmation gates: For high-risk actions (form submission, file upload, purchase completion), require human confirmation before executing — show this as a pause node in the action loop
  • Action logging: Log every action (including screenshots before and after) for audit and debugging — essential for understanding what happened in autonomous multi-step tasks
  • Timeout controls: Set hard time and action-count limits per task to prevent runaway loops — a task that takes >100 actions or >5 minutes is almost certainly in an error state

Frequently asked questions about computer use architecture

What is computer use in AI?

Computer use in AI refers to the capability of AI agents to interact with computers through their graphical interfaces — controlling a browser, desktop application, or web UI by viewing screenshots and executing mouse and keyboard actions. Unlike traditional API-based tool use, computer use allows agents to automate any workflow accessible through a UI, including legacy software with no API, websites that block programmatic access, and complex multi-step workflows that span multiple applications.

How is computer use different from traditional browser automation?

Traditional browser automation (Selenium, Playwright) uses deterministic selectors — CSS selectors, XPath, element IDs — to locate and interact with specific page elements. This is brittle: any change to the HTML structure breaks the script. Computer use agents navigate byunderstanding the visual UI (reading labels, interpreting button positions, understanding context), making them resilient to layout changes and capable of handling novel UI states the test author never anticipated. The tradeoff is higher latency (a vision LLM call per action) and higher per-action cost.

What is the biggest risk of deploying a computer use agent in production?

The primary risk is unintended actions on real systems — submitting forms, sending emails, deleting data, or making purchases that the user did not intend. Mitigate this with: (1) sandboxed environments for all development and staging work; (2) human-in-the-loop confirmation for any irreversible action; (3) network allowlisting to restrict what systems the agent can reach; and (4) comprehensive action logging so any unintended action can be identified and reversed. Never run a computer use agent with production credentials in a live environment without explicit confirmation gates for destructive actions.

Related guides: AI agent architecture diagrams, MCP architecture diagrams, agentic AI security architecture, and multi-agent orchestration patterns.

Ready to try it yourself?

Start Creating - Free