You need content from a website. Maybe you are building a RAG knowledge base. Maybe you are researching competitors. Maybe you are extracting data from a web app. You reach for BeautifulSoup or Puppeteer, write 50 lines of code, and then the site loads with JavaScript and your scraper gets nothing. Modern websites are JavaScript applications. Traditional HTTP scrapers see an empty shell.

This is the state of web scraping in 2026. Single-page applications built with React, Next.js, Angular, Vue, and dozens of other frameworks render content client-side. The content is not in the HTML source. It is constructed by JavaScript after the page loads. You need a real browser to see the real content. But running headless Chrome, waiting for JavaScript to execute, and then extracting structured content from arbitrary DOM structures requires significant infrastructure and careful timing.

What makes modern web scraping hard?

Four problems compound on each other.

Knowing when the page is ready. SPAs load incrementally. The initial HTML arrives, then JavaScript bundles download and execute, then API calls fire, then components render with the fetched data. A React app might show a loading spinner for two seconds before the actual content appears. A Next.js page might hydrate in 200 milliseconds. A Vue app might stream in components over several seconds. There is no universal "done" signal. The browser's load event fires long before the content you want is actually visible.

Extracting structured content. Even after the page is fully rendered, the DOM is a mess. Navigation bars, cookie banners, ad scripts, tracking pixels, footer links, social media widgets. The content you actually want is buried in there somewhere. Readability algorithms help but they are tuned for articles, not web apps.

Bot detection. Websites increasingly detect and block automated browsers. Headless Chrome has telltale fingerprints. Missing browser APIs, absent plugins, suspicious timing patterns. Sites behind Cloudflare, Akamai, or DataDome reject requests from browsers that look automated.

Format mismatch. Different use cases need different output formats. RAG ingestion works best with structured markdown. NLP pipelines need clean plain text. Debugging needs the raw page structure. Sometimes you just need a list of links. No single extraction method serves all purposes.

How does PiperScrape solve the readiness problem?

PiperScrape is ToolPiper's CDP-based web scraper. It navigates Chrome, which is a real browser with your normal profile, not a headless instance. This immediately solves the bot detection problem for most sites, because there is nothing to detect. It is a real browser with real browser APIs and a real user profile.

The readiness problem is harder. PiperScrape solves it with a RACE pattern and framework detection.

Component Intelligence is an embedded JavaScript module (ComponentIntelligenceScript) that PiperScrape injects into every page. It does three things: applies stealth patches to avoid bot detection, hooks into framework lifecycle events, and tracks network requests and DOM mutations. The script detects 16 frontend frameworks: React, Vue, Angular, Svelte, Next.js, Nuxt, SvelteKit, Remix, Gatsby, Astro, Solid, Qwik, Lit, Preact, Alpine.js, and htmx.

When a framework is detected, PiperScrape uses framework-specific signals to determine readiness. React's hydration callback. Angular's stability API. Vue's mounted lifecycle hook. Next.js's page transition events. These signals are precise. They fire when the framework considers the page content ready, not just when the browser finishes loading resources.

The RACE pattern runs the framework-specific signal and a generic idle detector in parallel. The framework signal gets a 1-second cap. Whichever fires first wins. This means a Next.js page that hydrates in 200 milliseconds does not wait for the generic 3-second idle timeout. And a page built with an undetected framework still gets scraped after the generic idle period. No false failures.

What are the 7 extraction formats?

PiperScrape extracts content in seven formats from a single scrape. You choose which formats you want in the request.

1. Markdown. The accessibility tree rendered as structured markdown via AXMarkdownRenderer. Headings are preserved as proper markdown headings. Links include their URLs. Lists maintain their structure. Semantic hierarchy is maintained. This is the best format for RAG ingestion because it preserves document structure without HTML noise.

2. Text. Plain text extraction. All HTML stripped, all formatting removed. Clean, readable text with no artifacts. Good for NLP pipelines, text classification, or any use case where you want pure content without structure.

3. Readability. Mozilla Readability-style content cleaning. Strips navigation, ads, footers, sidebars, and cookie banners. Isolates the main article or content area. Good for blog posts, news articles, and documentation pages where you want just the editorial content.

4. AX Tree. The raw accessibility tree as the browser understands it. The full semantic structure of the page with roles, names, states, and hierarchy. This is what screen readers see. Best for debugging extraction issues, understanding page structure, or building custom extraction logic on top of the semantic representation.

5. HTML. The full rendered HTML after all JavaScript has executed. This is what BeautifulSoup would see if it could run JavaScript. Useful when you need the actual DOM for downstream processing with your own tools, or when you need to preserve inline styles, images, and embedded content.

6. Links. All links on the page extracted with heading context. Each link includes the URL, the anchor text, and the nearest heading it falls under. Useful for building sitemaps, discovering related content, or crawling a website by following links programmatically.

7. Screenshot. A visual capture of the page as rendered in Chrome. A PNG image of exactly what a human would see. Useful for visual regression testing, documentation, or feeding into a vision model for analysis.

How does the AX tree approach compare to DOM scraping?

Most scrapers parse the DOM. They navigate HTML elements, follow CSS selectors, and extract text nodes. This works for simple pages but breaks down on modern web apps where content is buried inside deeply nested framework components.

PiperScrape's markdown format uses the accessibility tree, not the DOM. Chrome's AX tree is a semantic representation of the page. It describes what users see and interact with: headings, paragraphs, links, buttons, lists. It strips away presentational markup, wrapper divs, CSS-only elements, and framework scaffolding. A React app with 2,000 DOM nodes might have 200 AX tree nodes, and those 200 nodes contain the actual content.

The AXMarkdownRenderer walks the AX tree and produces structured markdown. A heading node becomes a markdown heading. A link node becomes a markdown link with its URL. A list becomes a markdown list. The output reads like a clean document, not like parsed HTML with artifacts.

How do you use PiperScrape?

PiperScrape exposes a REST API and two MCP tools.

REST API: Three endpoints. POST /v1/scrape starts a scrape job with a URL and requested formats. GET /v1/scrape/:id retrieves the result. GET /v1/scrape lists all scrape jobs. The scrape runs asynchronously and SSE events (scrape.completed, scrape.failed) notify you when it finishes.

MCP tools: Two tools available to any MCP client. scrape handles full scraping with all format options. browser_detect runs framework detection only, without extracting content. If you are using ToolPiper with Claude Code, Cursor, or any MCP-aware tool, you can scrape websites with natural language.

How does PiperScrape work with RAG?

The practical workflow: scrape a website to markdown, feed the markdown into a ToolPiper RAG collection, then ask questions about it in chat.

PiperScrape's markdown output is specifically designed for RAG. The heading hierarchy provides natural chunk boundaries. Links are preserved with their URLs so the model can cite sources. Semantic structure is maintained so the model understands document organization. Compare this to dumping raw HTML into a RAG pipeline, where the model has to parse through div soup, navigation menus, and script tags to find the actual content.

You can scrape multiple pages and index them all into the same collection. Product documentation, competitor blogs, research papers, internal wikis. The RAG pipeline handles chunking, embedding, and vector indexing. You just provide clean markdown.

What are the limitations?

PiperScrape is honest about what it cannot do.

Requires Chrome running. PiperScrape is CDP-based. It navigates a real Chrome browser via the DevTools Protocol. No browser, no scraping. This is by design, because a real browser is what makes JavaScript rendering and bot detection evasion work. But it means you cannot run headless scrape jobs on a server without a display.

Not built for bulk crawling. The scrape queue is serialized, one at a time, with a maximum of 5 concurrent Chrome tabs. This is a tool for extracting content from tens or hundreds of pages, not for crawling thousands. If you need to scrape an entire website with 50,000 pages, use Scrapy or a dedicated crawling framework.

Heavily protected sites may still block. PiperScrape's stealth patches handle most bot detection. But sites with aggressive CAPTCHAs, Cloudflare challenge pages with interactive verification, or sites that require login will still require manual intervention. The browser is real, but automated navigation patterns can still be detected by the most sophisticated protections.

2MB content cap. Each scrape is capped at 2MB of extracted content to prevent regex denial-of-service on pathologically large pages. In practice, this is more than enough for any single page. If a page exceeds 2MB of text content, something unusual is happening.

Two-layer SSRF protection. PiperScrape will not scrape localhost, private IPs, or internal network addresses. isPrivateHost checks the hostname. resolvesToPrivateIP resolves DNS and checks the IP. This prevents using PiperScrape as a proxy to access internal services.

Try It

Download ModelPiper, install ToolPiper, and scrape your first page. Point it at a JavaScript-heavy site that your current scraper cannot handle. Compare the markdown output to what BeautifulSoup gives you on the same URL. The difference is the accessibility tree, framework detection, and a real browser doing the work.

This is part of a series on local-first AI workflows on macOS. See also: Local RAG Chat for indexing scraped content into a searchable knowledge base.