Meet Alibaba’s Page Agent: A JavaScript In-Page GUI Agent That Controls Web Interfaces With Natural Language Through the DOM
| Source: MarkTechPost
Tags: Alibaba, Page Agent, browser automation, DOM, open-source, TypeScript, agentic AI
Alibaba open-sources Page Agent, a TypeScript library that embeds inside webpages to control UI elements via natural language, using DOM dehydration to compress pages into compact text maps any OpenAI-compatible model can act on — no screenshots or headless browsers required.
Details
Most browser automation tools — Playwright, Puppeteer, Selenium — operate from outside the browser, relying on screenshots or the Chrome DevTools Protocol. Alibaba's Page Agent inverts this: a TypeScript library that runs as client-side JavaScript inside the page itself, inheriting the user's session, cookies, and authentication state. The core technique is DOM dehydration. Rather than passing raw HTML (which can contain thousands of nodes) to a model, the agent scans the live DOM, assigns each interactive element an index with a role and label, and outputs a compact FlatDomTree text representation. This lets smaller text-only models act precisely without needing vision capabilities. The agent is model-agnostic — any OpenAI-compatible endpoint works. It builds on browser-use for DOM processing and is released under the MIT license with TypeScript source. The embedded architecture means existing UI validation and security rules remain intact, and no separate backend is needed. Key limitations: prompt-level safety only, single-page scope, and no support for external or locked-down sites. Best fit for internal copilots and form-filling automation within apps you own.