Extraction
Content extraction tools -- snapshots, markdown, PDF, article readability, screenshots, OCR, and web search
Extraction tools pull structured content from pages. Use browse_snapshot to see the interactive DOM with @ref IDs, browse_extract to get clean markdown for LLM context windows, and specialized extractors for PDFs, articles, and OCR. The browse_search tool provides web metasearch without navigating.
browse_snapshot
Get the current page's DOM as an agent-readable snapshot. Shows all interactive elements with @e reference IDs that can be used with browse_click, browse_fill, and other interaction tools.
Walks the full DOM tree, assigns sequential @e ref IDs to visible elements, queries QuickJS for current .value properties on form inputs (reflects values set by browse_fill), and formats as compact text.
Parameters
None.
Request
{
"tool": "browse_snapshot",
"arguments": {}
}Response
{
"content": [
{
"type": "text",
"text": "Page: \"Job Application\" (https://example.com/apply)\n\n@e1 [text] \"Jane\" value=\"Jane\" placeholder=\"First Name\"\n@e2 [text] \"\" placeholder=\"Last Name\"\n@e3 [email] \"\" placeholder=\"Email\"\n@e4 [file] \"\"\n@e5 [select] \"\" options=[\"US\",\"UK\",\"CA\",\"DE\"]\n@e6 [textarea] \"\" placeholder=\"Cover Letter\"\n@e7 [button] \"Submit Application\"\n\n7 interactive elements"
}
]
}browse_screenshot
Capture a PNG screenshot of the current page. Returns a base64-encoded PNG image. Useful for visual verification, OCR input, or debugging layout issues.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
full_page | boolean | no | false | true captures the entire scrollable page. false captures only the visible viewport |
Request — viewport only
{
"tool": "browse_screenshot",
"arguments": {}
}Response
{
"content": [
{
"type": "image",
"data": "iVBORw0KGgoAAAANSUhEUgAA...",
"mimeType": "image/png"
}
]
}Request — full page
{
"tool": "browse_screenshot",
"arguments": {
"full_page": true
}
}browse_extract
Extract the current page's content as clean LLM-optimized markdown. Strips navigation, ads, and boilerplate using the wraith_content_extract engine. The max_tokens parameter ensures the output fits your context window budget.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
max_tokens | integer | no | unlimited | Token budget for extracted content. Output is truncated to fit within this limit |
Request
{
"tool": "browse_extract",
"arguments": {
"max_tokens": 1500
}
}Response
{
"content": [
{
"type": "text",
"text": "# Software Engineer — Remote\n\n**Company:** Acme Corp\n**Location:** Remote (US)\n**Salary:** $150,000 - $200,000\n\n## About the Role\n\nWe are looking for a software engineer with experience in Rust and distributed systems...\n\n## Requirements\n\n- 3+ years of Rust experience\n- Familiarity with async runtimes (Tokio)\n- Experience with PostgreSQL\n\n## Benefits\n\n- Fully remote\n- Unlimited PTO\n- Health, dental, vision\n\n---\n42 links | ~1500 tokens"
}
]
}extract_markdown
Convert HTML to clean markdown. If html is provided, converts that HTML string directly. If omitted, converts the current page's full HTML source.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
html | string | no | -- | Raw HTML string to convert. If omitted, uses the current page's HTML source |
Request — convert current page
{
"tool": "extract_markdown",
"arguments": {}
}Response
{
"content": [
{
"type": "text",
"text": "# Welcome to Acme Corp\n\nWe build tools for developers.\n\n## Products\n\n- **Acme CLI** — Command-line interface for managing deployments\n- **Acme SDK** — Client libraries for Python, Rust, and Go\n\n## Getting Started\n\n```bash\nnpm install @acme/sdk\n```\n\nSee our [documentation](https://docs.acme.com) for more details."
}
]
}Request — convert an HTML string
{
"tool": "extract_markdown",
"arguments": {
"html": "<h1>Hello</h1><p>This is a <strong>test</strong> paragraph with a <a href='https://example.com'>link</a>.</p>"
}
}Response
{
"content": [
{
"type": "text",
"text": "# Hello\n\nThis is a **test** paragraph with a [link](https://example.com)."
}
]
}extract_article
Extract the main article body from the current page using readability analysis. Removes navigation, sidebars, ads, and other non-content elements. Ideal for extracting blog posts, news articles, and documentation pages.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
readability | boolean | no | true | If true, uses the readability extraction algorithm to isolate the main content. If false, returns all text content |
Request
{
"tool": "extract_article",
"arguments": {
"readability": true
}
}Response
{
"content": [
{
"type": "text",
"text": "# Understanding Rust Lifetimes\n\nBy Jane Doe | Published March 15, 2025\n\nLifetimes are one of Rust's most distinctive features. They ensure that references are always valid, preventing dangling pointers and use-after-free bugs at compile time.\n\n## The Borrow Checker\n\nRust's borrow checker enforces two rules:\n\n1. At any given time, you can have either one mutable reference or any number of immutable references.\n2. References must always be valid.\n\n## Lifetime Annotations\n\nWhen the compiler can't infer lifetimes automatically, you annotate them:\n\n```rust\nfn longest<'a>(x: &'a str, y: &'a str) -> &'a str {\n if x.len() > y.len() { x } else { y }\n}\n```\n\n---\nExtracted 1,247 words | Reading time: ~5 min"
}
]
}extract_plain_text
Convert HTML to plain text with no formatting. Strips all tags, attributes, and styles, returning only the text content. Useful for word counting, NLP processing, or plain-text comparison.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
html | string | no | -- | Raw HTML string to convert. If omitted, uses the current page's HTML source |
Request
{
"tool": "extract_plain_text",
"arguments": {}
}Response
{
"content": [
{
"type": "text",
"text": "Welcome to Acme Corp\n\nWe build tools for developers.\n\nProducts\n\nAcme CLI — Command-line interface for managing deployments\nAcme SDK — Client libraries for Python, Rust, and Go\n\nGetting Started\n\nnpm install @acme/sdk\n\nSee our documentation for more details."
}
]
}Request — convert an HTML string
{
"tool": "extract_plain_text",
"arguments": {
"html": "<div><h1>Title</h1><p>Paragraph with <em>emphasis</em> and <a href='/'>a link</a>.</p></div>"
}
}Response
{
"content": [
{
"type": "text",
"text": "Title\n\nParagraph with emphasis and a link."
}
]
}extract_pdf
Fetch a PDF from a URL over HTTP and extract its text content as markdown. Useful for processing research papers, reports, resumes, and documentation that live behind PDF links.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | yes | -- | Full URL of the PDF to fetch and extract (e.g., "https://example.com/report.pdf") |
Request
{
"tool": "extract_pdf",
"arguments": {
"url": "https://example.com/annual-report-2024.pdf"
}
}Response
{
"content": [
{
"type": "text",
"text": "# Annual Report 2024\n\n## Executive Summary\n\nRevenue grew 32% year-over-year to $4.2B. Operating margin improved to 18.5% from 15.2% in the prior year.\n\n## Financial Highlights\n\n| Metric | 2024 | 2023 | Change |\n|--------|------|------|--------|\n| Revenue | $4.2B | $3.2B | +32% |\n| Net Income | $780M | $480M | +63% |\n| Employees | 12,400 | 9,800 | +27% |\n\n## Outlook\n\nWe expect continued growth driven by our enterprise platform...\n\n---\nExtracted from PDF: 24 pages, 8,432 words"
}
]
}extract_ocr
Run OCR text detection on the current page's screenshot. Useful when page content is rendered as images, Canvas elements, or uses custom fonts that resist text extraction.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
description | string | no | -- | Hint about what to OCR or look for on the page. Helps focus the extraction |
Request
{
"tool": "extract_ocr",
"arguments": {
"description": "Extract the pricing table from the page screenshot"
}
}Response
{
"content": [
{
"type": "text",
"text": "OCR Results:\n\nPricing\n\nStarter Pro Enterprise\n$9/mo $29/mo Custom\n5 projects Unlimited Unlimited\n1 GB storage 50 GB 500 GB\nEmail support Priority Dedicated\n\nAll plans include SSL, CDN, and CI/CD."
}
]
}browse_search
Search the web via metasearch without navigating to any page. See the Navigation reference for full details.
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
query | string | yes | -- | Search query (supports OR for multi-variant search) |
max_results | integer | no | 10 | Maximum number of results |
Additional Extraction Tools
browse_eval_js
Execute JavaScript on the current page for custom extraction. See the Interaction reference for details.
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
code | string | yes | -- | JavaScript code to execute in the page's QuickJS context |
dom_query_selector
Run a CSS selector query against the page DOM and return matching elements with their @ref IDs.
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
selector | string | yes | -- | CSS selector (e.g., "div.job-card", "#main-content", "a[href*='apply']") |
dom_get_attribute
Read an HTML attribute value from an element by its @ref ID.
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
ref_id | integer | yes | -- | Element @e reference |
name | string | yes | -- | Attribute name (e.g., "href", "class", "data-job-id", "aria-label") |