article2026-03-28by Ben Racicot

Web Scraping on Mac: 7 Extraction Formats with Framework Detection

TL;DR

PiperScrape is ToolPiper's CDP-based web scraper. It navigates a real Chrome browser, detects 16 frontend frameworks to know when a page is actually ready, and extracts content in 7 formats: markdown, text, readability, AX tree, HTML, links, and screenshot. Framework-aware readiness uses a RACE pattern so a Next.js page that hydrates in 200ms does not wait a full generic timeout. Everything runs locally on your Mac.

Screencast of PiperScrape extracting content from a JavaScript-heavy website in multiple formats

2:00

Scrape any website to markdown, text, readability, AX tree, HTML, links, or screenshot

You need content from a website. Maybe you are building a RAG knowledge base. Maybe you are researching competitors. Maybe you are extracting data from a web app. You reach for BeautifulSoup or Puppeteer, write 50 lines of code, and then the site loads with JavaScript and your scraper gets nothing. Modern websites are JavaScript applications. Traditional HTTP scrapers see an empty shell.

This is the state of web scraping in 2026. Single-page applications built with React, Next.js, Angular, Vue, and dozens of other frameworks render content client-side. The content is not in the HTML source. It is constructed by JavaScript after the page loads. You need a real browser to see the real content. But running headless Chrome, waiting for JavaScript to execute, and then extracting structured content from arbitrary DOM structures requires significant infrastructure and careful timing.

What makes modern web scraping hard?

Four problems compound on each other.

Knowing when the page is ready. SPAs load incrementally. The initial HTML arrives, then JavaScript bundles download and execute, then API calls fire, then components render with the fetched data. A React app might show a loading spinner for two seconds before the actual content appears. A Next.js page might hydrate in 200 milliseconds. A Vue app might stream in components over several seconds. There is no universal "done" signal. The browser's load event fires long before the content you want is actually visible.

Extracting structured content. Even after the page is fully rendered, the DOM is a mess. Navigation bars, cookie banners, ad scripts, tracking pixels, footer links, social media widgets. The content you actually want is buried in there somewhere. Readability algorithms help but they are tuned for articles, not web apps.

Bot detection. Websites increasingly detect and block automated browsers. Headless Chrome has telltale fingerprints. Missing browser APIs, absent plugins, suspicious timing patterns. Sites behind Cloudflare, Akamai, or DataDome reject requests from browsers that look automated.

Format mismatch. Different use cases need different output formats. RAG ingestion works best with structured markdown. NLP pipelines need clean plain text. Debugging needs the raw page structure. Sometimes you just need a list of links. No single extraction method serves all purposes.

How does PiperScrape solve the readiness problem?

PiperScrape is ToolPiper's CDP-based web scraper. It navigates Chrome, which is a real browser with your normal profile, not a headless instance. This immediately solves the bot detection problem for most sites, because there is nothing to detect. It is a real browser with real browser APIs and a real user profile.

The readiness problem is harder. PiperScrape solves it with a RACE pattern and framework detection.

Component Intelligence is an embedded JavaScript module (ComponentIntelligenceScript) that PiperScrape injects into every page. It does three things: applies stealth patches to avoid bot detection, hooks into framework lifecycle events, and tracks network requests and DOM mutations. The script detects 16 frontend frameworks: React, Vue, Angular, Svelte, Next.js, Nuxt, SvelteKit, Remix, Gatsby, Astro, Solid, Qwik, Lit, Preact, Alpine.js, and htmx.

When a framework is detected, PiperScrape uses framework-specific signals to determine readiness. React's hydration callback. Angular's stability API. Vue's mounted lifecycle hook. Next.js's page transition events. These signals are precise. They fire when the framework considers the page content ready, not just when the browser finishes loading resources.

The RACE pattern runs the framework-specific signal and a generic idle detector in parallel. The framework signal gets a 1-second cap. Whichever fires first wins. This means a Next.js page that hydrates in 200 milliseconds does not wait for the generic 3-second idle timeout. And a page built with an undetected framework still gets scraped after the generic idle period. No false failures.

What are the 7 extraction formats?

PiperScrape extracts content in seven formats from a single scrape. You choose which formats you want in the request.

1. Markdown. The accessibility tree rendered as structured markdown via AXMarkdownRenderer. Headings are preserved as proper markdown headings. Links include their URLs. Lists maintain their structure. Semantic hierarchy is maintained. This is the best format for RAG ingestion because it preserves document structure without HTML noise.

2. Text. Plain text extraction. All HTML stripped, all formatting removed. Clean, readable text with no artifacts. Good for NLP pipelines, text classification, or any use case where you want pure content without structure.

3. Readability. Mozilla Readability-style content cleaning. Strips navigation, ads, footers, sidebars, and cookie banners. Isolates the main article or content area. Good for blog posts, news articles, and documentation pages where you want just the editorial content.

4. AX Tree. The raw accessibility tree as the browser understands it. The full semantic structure of the page with roles, names, states, and hierarchy. This is what screen readers see. Best for debugging extraction issues, understanding page structure, or building custom extraction logic on top of the semantic representation.

5. HTML. The full rendered HTML after all JavaScript has executed. This is what BeautifulSoup would see if it could run JavaScript. Useful when you need the actual DOM for downstream processing with your own tools, or when you need to preserve inline styles, images, and embedded content.

6. Links. All links on the page extracted with heading context. Each link includes the URL, the anchor text, and the nearest heading it falls under. Useful for building sitemaps, discovering related content, or crawling a website by following links programmatically.

7. Screenshot. A visual capture of the page as rendered in Chrome. A PNG image of exactly what a human would see. Useful for visual regression testing, documentation, or feeding into a vision model for analysis.

How does the AX tree approach compare to DOM scraping?

Most scrapers parse the DOM. They navigate HTML elements, follow CSS selectors, and extract text nodes. This works for simple pages but breaks down on modern web apps where content is buried inside deeply nested framework components.

PiperScrape's markdown format uses the accessibility tree, not the DOM. Chrome's AX tree is a semantic representation of the page. It describes what users see and interact with: headings, paragraphs, links, buttons, lists. It strips away presentational markup, wrapper divs, CSS-only elements, and framework scaffolding. A React app with 2,000 DOM nodes might have 200 AX tree nodes, and those 200 nodes contain the actual content.

The AXMarkdownRenderer walks the AX tree and produces structured markdown. A heading node becomes a markdown heading. A link node becomes a markdown link with its URL. A list becomes a markdown list. The output reads like a clean document, not like parsed HTML with artifacts.

How do you use PiperScrape?

PiperScrape exposes a REST API and two MCP tools.

REST API: Three endpoints. POST /v1/scrape starts a scrape job with a URL and requested formats. GET /v1/scrape/:id retrieves the result. GET /v1/scrape lists all scrape jobs. The scrape runs asynchronously and SSE events (scrape.completed, scrape.failed) notify you when it finishes.

MCP tools: Two tools available to any MCP client. scrape handles full scraping with all format options. browser_detect runs framework detection only, without extracting content. If you are using ToolPiper with Claude Code, Cursor, or any MCP-aware tool, you can scrape websites with natural language.

How does PiperScrape work with RAG?

The practical workflow: scrape a website to markdown, feed the markdown into a ToolPiper RAG collection, then ask questions about it in chat.

PiperScrape's markdown output is specifically designed for RAG. The heading hierarchy provides natural chunk boundaries. Links are preserved with their URLs so the model can cite sources. Semantic structure is maintained so the model understands document organization. Compare this to dumping raw HTML into a RAG pipeline, where the model has to parse through div soup, navigation menus, and script tags to find the actual content.

You can scrape multiple pages and index them all into the same collection. Product documentation, competitor blogs, research papers, internal wikis. The RAG pipeline handles chunking, embedding, and vector indexing. You just provide clean markdown.

What are the limitations?

PiperScrape is honest about what it cannot do.

Requires Chrome running. PiperScrape is CDP-based. It navigates a real Chrome browser via the DevTools Protocol. No browser, no scraping. This is by design, because a real browser is what makes JavaScript rendering and bot detection evasion work. But it means you cannot run headless scrape jobs on a server without a display.

Not built for bulk crawling. The scrape queue is serialized, one at a time, with a maximum of 5 concurrent Chrome tabs. This is a tool for extracting content from tens or hundreds of pages, not for crawling thousands. If you need to scrape an entire website with 50,000 pages, use Scrapy or a dedicated crawling framework.

Heavily protected sites may still block. PiperScrape's stealth patches handle most bot detection. But sites with aggressive CAPTCHAs, Cloudflare challenge pages with interactive verification, or sites that require login will still require manual intervention. The browser is real, but automated navigation patterns can still be detected by the most sophisticated protections.

2MB content cap. Each scrape is capped at 2MB of extracted content to prevent regex denial-of-service on pathologically large pages. In practice, this is more than enough for any single page. If a page exceeds 2MB of text content, something unusual is happening.

Two-layer SSRF protection. PiperScrape will not scrape localhost, private IPs, or internal network addresses. isPrivateHost checks the hostname. resolvesToPrivateIP resolves DNS and checks the IP. This prevents using PiperScrape as a proxy to access internal services.

Try It

Download ModelPiper, install ToolPiper, and scrape your first page. Point it at a JavaScript-heavy site that your current scraper cannot handle. Compare the markdown output to what BeautifulSoup gives you on the same URL. The difference is the accessibility tree, framework detection, and a real browser doing the work.

This is part of a series on local-first AI workflows on macOS. See also: Local RAG Chat for indexing scraped content into a searchable knowledge base.

PiperScrape extracting content from a JavaScript-heavy website in markdown, text, and AX tree formats

Seven extraction formats from a single scrape, powered by Chrome's accessibility tree

Web Scraping: PiperScrape vs Alternatives

	PiperScrape	BeautifulSoup	Puppeteer	Firecrawl
JavaScript rendering	Yes (real Chrome)	No	Yes (headless Chrome)	Yes (cloud)
Output formats	7	HTML only	HTML + screenshot	Markdown + HTML
Framework detection	16 frameworks	No	No	No
Readiness detection	RACE pattern (framework-aware)	N/A	Manual waits	Unknown
Privacy	All local	All local	All local	Cloud service
Setup	Included in ToolPiper	pip install	npm install + code	API key + billing
MCP integration	Yes (2 tools)	No	No	Yes
Bot detection evasion	Stealth patches	None	Stealth plugin	Cloud-managed
Cost	Free	Free	Free	$19+/mo

Frequently Asked Questions

Can PiperScrape handle JavaScript-heavy sites?

Yes. PiperScrape navigates a real Chrome browser via CDP, so all JavaScript executes normally. React, Angular, Vue, Next.js, and any other framework render fully before content is extracted. The framework detection system identifies 16 frameworks and uses their lifecycle signals to determine when the page is genuinely ready, not just when the HTML has loaded.

How does framework detection work?

PiperScrape injects a lightweight JavaScript module (Component Intelligence) into each page. This script checks for framework-specific globals and lifecycle hooks: React's __REACT_DEVTOOLS_GLOBAL_HOOK__, Angular's getAllAngularRootElements, Vue's __VUE__, and similar markers for 16 frameworks. When a framework is detected, PiperScrape uses that framework's specific readiness signal (hydration callback, stability API, mounted hook) instead of relying on generic timeouts. Detection is metadata-only and does not modify the page.

Can I use scraped content for RAG?

Yes, and the markdown format is designed for it. PiperScrape's AX-tree-based markdown preserves heading hierarchy, link URLs, and semantic structure. These provide natural chunk boundaries for RAG indexing. Scrape a page to markdown, add it to a ToolPiper RAG collection, and the content is searchable in chat. You can scrape multiple pages into the same collection to build a knowledge base from web sources.

Is web scraping legal?

Web scraping of publicly accessible content is generally legal under the 2022 hiQ v. LinkedIn ruling, which held that scraping public data does not violate the Computer Fraud and Abuse Act. However, scraping behind authentication, violating Terms of Service, or scraping copyrighted content for commercial redistribution can create legal risk. PiperScrape runs locally and does not store or transmit your scraped data anywhere. Use good judgment and respect robots.txt when scraping sites you do not own.

How many pages can I scrape at once?

PiperScrape processes one scrape at a time with a maximum of 5 concurrent Chrome tabs. This is intentional. It is designed for targeted extraction from tens or hundreds of pages, not bulk crawling of entire websites. If you need to scrape thousands of pages, use a dedicated crawling framework like Scrapy and feed the results into ToolPiper for processing.

Web ScrapingBrowser AutomationRAGPrivacymacOSDeveloper Tools

Local RAG Chat on Mac: Ask Your Documents, Keep Your DataIndex scraped content into a searchable knowledge base with local RAG AX-Native Browser Automation: Why We Built Our Own CDP EngineThe CDP engine that powers PiperScrape's browser automation layer Building Over 300 MCP Tools in Swift: Architecture of a Unified Local AI ServerThe MCP server architecture that hosts the scrape and browser_detect tools Local-First AI on macOS: Why Your Data Should Never Leave Your MachineThe pillar article on local-first AI workflows