Can OpenAI’s “Operator” Beat Selenium?
AI has launched “Operator” for ChatGPT (https://openai.com/index/introducing-operator/) for Pro users in the USA.
This has been in the works for almost 2 years — OpenAI alluded to this when they released “Plugins” (the pre-cursor to CustomGPTs) in March 2023: “helping users with a variety of new use cases, ranging from browsing product catalogs to booking flights or ordering food.”
The third-party partners back then were Expedia, OpenTable, TripAdvisor, Instacart and so on.
Those same third-party partners reappear in the “Operator” demo.
Now, OpenAI claims that “Operator is powered by a new model called Computer-Using Agent (CUA). Combining GPT-4o’s vision capabilities with advanced reasoning through reinforcement learning, CUA is trained to interact with graphical user interfaces (GUIs) — the buttons, menus, and text fields people see on a screen.”
“Operator can “see” (through screenshots) and “interact” (using all the actions a mouse and keyboard allow) with a browser, enabling it to take action on the web without requiring custom API integrations.”
In other words, it takes the user prompt and combines it with a visual screenshot of the webpage (https://openai.com/index/computer-using-agent/)
This is an interesting approach.
In this (https://medium.com/@orren/ai-agents-are-here-but-are-they-really-breaking-down-walled-gardens-in-apps-9f4f92ac3269), I said “The other viable alternative is using a programming language like Python with web automation software like the Selenium Library which can interact with a website like a human would, then augment that web automation work with Generative AI.”
In fact, I built a working demo in a single weekend using this approach (https://medium.com/@orren/the-misleading-myth-of-ai-agents-892fe63f1189).
OpenAIs approach may be a better way in certain contexts:
• Screenshots mirror how humans interact with GUIs, focusing on the layout and visible elements rather than raw HTML data. This makes “Operator” adaptable to a wide range of interfaces without needing to parse complex or obfuscated backend code.
• Many modern websites use dynamic rendering, obfuscated HTML structures, or rely heavily on JavaScript. Screenshots bypass the need to deal with these complexities, as the model interacts with the rendered interface, not the underlying code.
• While HTML scraping or automation (e.g., Selenium) often requires meticulous tuning to website-specific implementations, the screenshot approach works universally, as long as the website is visually interpretable.
• By leveraging GPT-4’s vision capabilities, OpenAI enables reasoning about the visual layout, text, and context in one step. This could be advantageous for tasks like recognising visually distinct elements (e.g., logos, icons, or non-text buttons) that HTML scraping cannot easily handle.
• Screenshots provide a complete snapshot of what the user sees, including elements like pop-ups, modals, and other on-screen cues that might not be apparent in the HTML DOM.
• Using screenshots can sometimes align better with user privacy and compliance regulations. It interacts with visible elements rather than extracting all underlying metadata indiscriminately, which may contain sensitive or private information.
It may also have trade-offs compared to HTML scraping:
• Processing screenshots with a vision model requires significant computational resources compared to parsing HTML. This can introduce latency and increase operating costs.
• Screenshots only capture what is visually rendered on the screen. Any hidden fields, metadata, or structured information in the HTML that is not visible cannot be accessed directly.
• Complex, multi-step workflows that require navigating between pages or handling intricate logic might still require programmatic understanding of underlying DOM structures, which scraping can directly access.
• Models must infer functionality (e.g., distinguishing between a text field and a button) from visual cues. This might lead to errors in interfaces with unconventional designs or low-quality renderings.