Snapshot Model and @ref IDs
How Wraith Browser builds DOM snapshots for AI agents, assigns @ref IDs to interactive elements, and enables agents to interact with pages without CSS selectors or XPath.
Wraith does not hand raw HTML to AI agents. Instead, it builds a DOM snapshot — a
compact, flat list of interactive and semantic elements optimized for LLM context windows.
Each element is assigned a stable @ref ID that agents use to click, fill, scroll to, or
otherwise interact with page elements.
This model is inspired by accessibility tree approaches (like AgentChrome) and spatial DOM representations (like browsy-core), but tuned specifically for token efficiency and agent usability.
Why Not Raw HTML?
A typical web page's HTML source is 50,000-200,000 tokens. An LLM with a 128k context
window would spend most of its budget just reading the page. Worse, most of that HTML is
boilerplate — <div> nesting, CSS classes, tracking scripts, SVG icons — that carries
zero information for an agent trying to fill out a form or click a link.
The DOM snapshot compresses a page to its interactive essence: what can be clicked, what can be filled in, what text is visible. A page that is 100,000 tokens as raw HTML becomes 200-500 tokens as a snapshot.
The DomSnapshot Structure
Defined in crates/browser-core/src/dom.rs:
pub struct DomSnapshot {
pub url: String,
pub title: String,
pub elements: Vec<DomElement>,
pub meta: PageMeta,
pub timestamp: chrono::DateTime<chrono::Utc>,
}DomElement
Each element in the snapshot is a flat struct:
pub struct DomElement {
pub ref_id: u32, // Unique @ref ID
pub role: String, // Semantic role: link, button, textbox, etc.
pub text: Option<String>, // Visible text content
pub href: Option<String>, // URL for links
pub placeholder: Option<String>, // Placeholder for inputs
pub value: Option<String>, // Current value for inputs/selects
pub enabled: bool, // Can the element be interacted with?
pub visible: bool, // Is the element visible?
pub aria_label: Option<String>, // ARIA label if present
pub selector: String, // CSS selector path (fallback targeting)
pub bounds: Option<(f64, f64, f64, f64)>, // Bounding box (x, y, w, h)
}The ref_id is a monotonically incrementing integer assigned during snapshot construction.
It is stable within a single snapshot but may change between snapshots if the page
mutates (elements added or removed). Agents should always take a fresh snapshot after
performing actions that modify the page.
PageMeta
Page-level metadata provides situational awareness:
pub struct PageMeta {
pub page_type: Option<String>, // "login", "search_results", "article", etc.
pub main_content_preview: Option<String>, // First ~500 chars of readable content
pub description: Option<String>, // Open Graph / meta description
pub form_count: usize, // Number of forms on the page
pub has_login_form: bool, // Login form detected?
pub has_captcha: bool, // CAPTCHA detected?
pub interactive_element_count: usize,
pub overlays: Vec<(String, String, String)>, // (ref_id, type, title)
}The overlays field is particularly important for agents. When the snapshot detects a
modal dialog, cookie banner, or other overlay blocking the page, it records the overlay's
ref ID and type. The agent text rendering places these at the top of the output with a
warning, so the agent sees them first and knows to dismiss them before interacting with the
underlying page.
Agent Text Rendering
When the MCP server returns a snapshot to an agent, it calls to_agent_text() to produce a
compact text representation. Here is an example of what an agent actually sees:
Page: "GitHub - wraith-browser" (https://github.com/example/wraith-browser)
@e1 [link] "Code"
@e2 [link] "Issues (3)"
@e3 [link] "Pull requests (1)"
@e4 [button] "Star"
@e5 [button] "Fork"
@e6 [textbox] "" placeholder="Go to file"
@e7 [link] "README.md"
@e8 [text] "An AI-agent-first web browser written in Rust"Each line follows the format: @eN [role] "text" attributes. The formatting is
deliberately aligned into columns so that both humans and LLMs can scan it quickly.
Form Inputs
For form fields (textbox, email, password, combobox, etc.), the rendering shows the
current value if one is set, otherwise the placeholder:
@e12 [textbox] "" placeholder="Email address"
@e13 [password] "" placeholder="Password"
@e14 [button] "Sign in"After an agent fills @e12:
@e12 [textbox] "user@example.com" placeholder="Email address" value="user@example.com"
@e13 [password] "" placeholder="Password"
@e14 [button] "Sign in"Disabled Elements
Disabled elements are tagged so agents do not waste actions on them:
@e20 [button] "Submit" [DISABLED]Overlay Detection
When an overlay is present, the snapshot begins with a warning:
⚠ OVERLAY DETECTED: [cookie_banner] "We use cookies" @e1 — interact with this first
Page: "Example Site" (https://example.com)
@e1 [button] "Accept All"
@e2 [button] "Reject All"
@e3 [link] "Cookie Settings"
...This priming ensures the agent addresses the overlay before trying to interact with elements that may be obscured behind it.
How @ref IDs Work in Actions
Agents interact with page elements by passing @ref IDs to browser actions. The MCP tools
accept the numeric portion of the ref ID (the 42 in @e42):
Clicking
{
"tool": "browse_click",
"arguments": { "ref": 4 }
}This maps to BrowserAction::Click { ref_id: 4, force: None }. The engine looks up element
@e4 in the current snapshot, resolves its CSS selector, and performs the click.
Filling
{
"tool": "browse_fill",
"arguments": { "ref": 12, "value": "user@example.com" }
}This maps to BrowserAction::Fill { ref_id: 12, text: "user@example.com", force: None }.
Selecting
{
"tool": "browse_select",
"arguments": { "ref": 15, "value": "California" }
}Maps to BrowserAction::Select { ref_id: 15, value: "California", force: None }.
Scrolling To
{
"tool": "browse_scroll_to",
"arguments": { "ref": 42 }
}Maps to BrowserAction::ScrollTo { ref_id: 42 }. Centers the viewport on element @e42.
Force Mode
Some elements may be hidden, disabled, or obscured by overlays. Actions accept a force
flag to bypass pre-checks:
{
"tool": "browse_click",
"arguments": { "ref": 7, "force": true }
}This is useful when the agent knows an element exists (from a previous snapshot) but it is temporarily covered by an animation or positioned off-screen.
The Snapshot Lifecycle
A typical agent interaction follows this cycle:
1. browse_navigate("https://example.com")
└── Engine fetches page, builds DomSnapshot, returns agent text
2. Agent reads snapshot, identifies @e6 as a search input
3. browse_fill(ref=6, value="openai")
└── Engine resolves @e6 → CSS selector, sets value
4. browse_click(ref=7) // "Search" button
└── Engine clicks, page navigates
5. browse_snapshot()
└── Fresh snapshot of search results page (new @ref IDs)
6. Agent reads new snapshot, identifies result linksKey rule: after any action that might change the page (click, navigate, form submit), the agent should take a fresh snapshot. The ref IDs from the previous snapshot may no longer be valid.
Token Budgeting
The snapshot includes a token estimate:
pub fn estimated_tokens(&self) -> usize {
self.to_agent_text().len() / 4
}This rough estimate (4 characters per token) lets the agent loop budget context window usage. A typical page snapshot consumes 200-800 tokens. Complex pages with many form fields might reach 1,500-2,000 tokens. This is still 100x smaller than the raw HTML.
CSS Selector Fallback
Each DomElement carries a selector field — the CSS selector path to the element. This
serves as a fallback targeting mechanism. If an engine cannot resolve an element by ref ID
(for example, after a partial page re-render), it can fall back to the CSS selector. The
selector is also useful for debugging — it tells you exactly which DOM node a ref ID points
to.
Bounding Boxes
When the engine supports layout (SevroEngine and CdpEngine, but not NativeEngine), each
element includes a bounding box as (x, y, width, height). This enables spatial reasoning
— an agent can determine whether an element is above, below, or beside another element.
NativeEngine sets bounds to None since it has no layout engine.