What's wrong with testing in 2026?
End-to-end tests are the most valuable tests a team can write and the ones they write least. The testing pyramid has been gospel for over a decade, and at most companies the top layer is empty. Not because anyone disagrees with it, but because the cost of creating and maintaining E2E tests is too high relative to everything else competing for engineering time.
The tools aren't the problem. Playwright and Cypress are genuinely good frameworks. They're fast, well-documented, and handle the hard parts of browser automation competently. The problem is the authoring model.
Writing a Playwright test means writing code. You need a Node.js environment, a test configuration, and familiarity with the framework API. For a QA engineer who thinks in user flows, the gap between what they want to express and what they need to write is enormous. Codegen tools try to bridge this gap, but generated code produces brittle CSS selectors that break on the next refactor. When selectors break, someone opens the test file, inspects the updated DOM, writes a new selector, and re-runs. Multiply that across fifty tests and three UI refactors a quarter, and you understand why the E2E suite is the first thing teams abandon under velocity pressure.
The selector strategy is the root cause. CSS and XPath selectors target the DOM, which is a build artifact. It changes when you update a dependency, switch component libraries, rename CSS classes, or upgrade your build tool. The accessibility tree doesn't have this problem. Chrome's AX tree represents what users see and interact with: a button labeled "Sign In." No class names, no generated IDs, no framework wrapper divs. A React app with 2,000 DOM nodes might have 200 AX tree nodes, the ones that actually matter for interaction.
The testing landscape (April 2026)
The browser testing market has consolidated around three tiers: open-source frameworks (Playwright, Cypress, Selenium), enterprise platforms (Testim/Tricentis, mabl, Katalon), and a growing wave of AI-native entrants.
Playwright (v1.58.2, ~83,500 GitHub stars)
Playwright is the default choice for new JavaScript/TypeScript projects. Multi-browser support (Chromium, Firefox, WebKit), excellent TypeScript integration, and Microsoft's backing make it the safe pick. Version 1.56 added Test Agents for LLM-guided authoring. Version 1.58 added timeline visualization and IndexedDB state management.
The Playwright MCP server ships 25 tools covering navigation, interaction, screenshots, and console access. It's decent for basic AI-driven browser control. But it has no assertion tools, no self-healing, no recording, no network interception, no coverage tracking, and no test session management. Microsoft themselves released a separate CLI tool because a typical browser automation task consumes approximately 114,000 tokens via MCP versus 27,000 via CLI, a 4x overhead that matters at scale.
The deeper issue is architectural. Playwright's getByRole() does not query Chrome's real accessibility tree. It injects JavaScript (roleSelectorEngine.ts) that calls querySelectorAll('*') and computes ARIA roles by walking the DOM. This is a DOM-level ARIA simulation, not a query against the browser's native accessibility tree. It provides cross-browser consistency but diverges from what screen readers actually see, and the querySelectorAll('*') scan explains the measured 1.5x performance penalty versus CSS selectors.
Cypress (v15.13.0, ~49,600 GitHub stars)
Cypress pioneered the "run tests in the browser" model and built a strong community around it. Version 15's headline feature is cy.prompt(), which lets you write natural language test steps and have AI generate executable Cypress code. It's a smart idea. It also has real problems.
cy.prompt() requires Cypress Cloud. Every prompt goes to their servers, and the AI generates code there. Free accounts are rate-limited to 100 prompts per hour and 500 steps per hour. Paid accounts get 600 prompts and 3,000 steps. Each call is capped at 50 steps. Self-healing works by re-calling the cloud AI when cached selectors break, which means a network round-trip on every heal and a dependency on Cypress's servers being available. If Cloud is down, cy.prompt() doesn't work. If you're offline, cy.prompt() doesn't work. Your test prompts leave your machine.
The underlying selector strategy hasn't changed. cy.prompt() still generates CSS and data-cy selectors from DOM inspection, not accessibility tree queries. The AI wrapping is new. The fragile foundation is the same. The "generate once, export to repo" workflow partially addresses this by letting you eject generated code into version control, but the exported code still uses DOM selectors that break on the next refactor.
Beyond cy.prompt(), the structural gaps remain. Star growth has slowed (200-400/month versus Playwright's 800+). No multi-tab support. No real Safari support (experimental since 2020, issue #6422 still open). The proprietary command chain model confuses developers coming from standard async/await. Benchmarks show Cypress runs roughly 23% slower than Playwright on equivalent test suites. Canvas and iframes aren't supported. Component testing is E2E-only.
Selenium (v4.41.0, ~32,800 GitHub stars)
Selenium is not dead. It shipped 12 releases in 2025, gets 50 million PyPI downloads per month, and appears in 10,000+ US job postings. The Dynamic Grid for Kubernetes and WebDriver BiDi protocol are serious engineering efforts with W3C backing.
But new JavaScript/TypeScript projects overwhelmingly choose Playwright. Teams migrating from Selenium report 40% faster pipelines and 50% fewer flaky tests. The migration pattern is clear: stop writing new Selenium tests, direct new tests to Playwright, migrate high-value tests first, retire the old suite gradually. Java and Python enterprise teams are the holdout, and for good reason: Selenium's multi-language bindings and mature ecosystem have no equivalent.
Enterprise platforms (Testim, mabl, Katalon)
Tricentis acquired Testim for $200 million in 2022. Testim offers AI-powered self-healing selectors and visual test authoring. Pricing is enterprise-only, not published. mabl starts at roughly $450-500/month with a credit-based model. Katalon was named a Visionary in Gartner's 2025 Magic Quadrant for AI-Augmented Software Testing.
These tools proved that self-healing and visual authoring are what teams actually want. They also proved that enterprise pricing gates these features from the teams that need them most. A startup with 20 flaky tests can't justify $500/month for mabl when Playwright is free.
The AI-native wave
A new tier of AI-first startups is entering the market. Momentic (YC W24, $18.7M total funding) uses intent-based locators and claims 2,600+ users including Notion and Webflow. Bug0, Octomind, and Meticulous are building AI QA agents that auto-generate tests from user sessions or natural language descriptions. BrowserStack launched a suite of 5 AI agents in June 2025, including self-healing and accessibility detection, with 30+ testing products planned. Katalon was named a Visionary in Gartner's first Magic Quadrant for AI-Augmented Software Testing. The market is splitting: managed services at $8,000+/month (QA Wolf), enterprise platforms at $500+/month (mabl, Testim), and developer-owned tools (Playwright, Cypress, PiperTest) where you control the infrastructure. There's very little in between.
The accessibility tree approach
This is our thesis, and the reason we built PiperTest from scratch on raw CDP rather than wrapping an existing framework.
Every major testing tool treats the DOM as the primary interface to the page. Playwright injects JavaScript to simulate ARIA role resolution. Cypress uses CSS selectors. Selenium uses XPath and CSS. Enterprise tools add AI on top of DOM selectors to heal them when they break. The entire industry is building increasingly sophisticated ways to manage an inherently unstable foundation.
PiperTest inverts this. We query Chrome's real accessibility tree via CDP's Accessibility.queryAXTree method. The AX tree is Chrome's semantic representation of the page, computed by the rendering engine, consumed by screen readers, and stable across framework migrations. A CSS refactor that changes every class name doesn't touch it. A migration from React to Vue doesn't touch it either, as long as the UI looks and behaves the same.
The selector format reflects this:
role:button:Sign In
label:Email
text:Welcome to the app
testid:submit-btn
role:form:Login > role:button:SubmitThese selectors target what users experience, not how developers built it. Tests break when behavior changes, which is exactly when they should break.
We hit real obstacles making this work. Chrome 148 introduced breaking changes to the Accessibility domain that affected every CDP tool using AX queries. Node IDs started returning as integers instead of strings. getFullAXTree began requiring explicit frameId parameters. queryAXTree needs document-root backendNodeId scoping or returns empty results. Chrome also builds the AX tree lazily, so the first query on a new page must prime it. We solved all of these because we had to. Most tools never encountered these issues because they don't use the Accessibility domain at all.
Why we replaced Chrome DevTools MCP and Playwright MCP
ToolPiper ships 14 browser MCP tools and 6 test-specific MCP tools. These aren't wrappers around existing tools. They're custom-built replacements for both Google's Chrome DevTools MCP and Microsoft's Playwright MCP, designed for professional testing and automation workflows.
What Google ships (Chrome DevTools MCP)
Chrome DevTools MCP is a debugging tool. It gives AI agents access to DevTools panels: Elements inspection, Console output, network monitoring, performance profiling, and JavaScript evaluation. It connects to your existing Chrome session (no new window), which is useful for debugging.
What it doesn't do: no accessibility tree queries, no structured selectors, no self-healing, no assertions, no recording, no test format, no coverage. It's built for developers inspecting a live page, not for testing workflows.
What Microsoft ships (Playwright MCP)
Playwright MCP exposes 25 tools covering navigation, clicks, typing, screenshots, and console messages. It works in snapshot mode (accessibility tree text) or vision mode (coordinates from screenshots). It can generate Playwright test code from a session.
It doesn't have assertions, self-healing, network interception, storage management, performance metrics, code coverage, WebAuthn testing, or autofill testing. Its output is raw data structures that consume 4x the tokens of equivalent CLI operations. And critically, its accessibility snapshots use the same DOM-walking JavaScript approach as Playwright itself, not Chrome's native AX tree.
What we built
ToolPiper's 14 browser tools cover four domains:
- Observation:
browser_snapshot(real AX tree, auto-connect),browser_console(typed messages + network errors),browser_network(request/response capture),browser_performance(Web Vitals + runtime metrics) - Interaction:
browser_action(click, fill, select, hover, scroll, keyboard with self-healing and AX diffs),browser_autofill(credit card + address forms),browser_eval(JavaScript execution with unwrapped results) - Testing:
browser_assert(7 assertion types with polling and snapshot-on-failure),browser_record(AX-enriched interaction recording),browser_coverage(JS + CSS code coverage) - Infrastructure:
browser_manage(connection lifecycle),browser_storage(cookies + localStorage + sessionStorage CRUD),browser_intercept(network mocking),browser_webauthn(virtual authenticator for passkey testing)
Every tool returns semantic plain text, not raw JSON. browser_snapshot returns a formatted AX tree with indentation and role labels. browser_action returns structured AX diffs showing what changed: added nodes with +, removed with -, modified with ~. This is readable by both humans and AI models without token-heavy JSON parsing.
On top of these, 6 test tools handle session management: test_list, test_get, test_save, test_delete, test_run, test_export. Any MCP-capable AI client, including Claude Code, Cursor, and Windsurf, can create, run, heal, and export PiperTests entirely through these tools.
What PiperTest ships today
PiperTest has been in development through 10 phases. Here's what's built and working.
Visual recording
Browse your app normally. PiperTest captures every interaction as an AX-enriched step with the full accessibility path from document root to target element. Each step includes element metadata (tag, role, name, bounding box), the page URL and title, and an AX mutation diff showing what changed after the action. No annotation, no switching between panels. Just use the app.
Self-healing (3 modes)
When a selector no longer matches, PiperTest doesn't fail immediately.
Passive quality improvement: During recording or after a successful run, the system notices a step uses a weak selector when a stronger one is available. It upgrades css:.btn-primary to role:button:Submit automatically.
Active AX fuzzy match (~5-15ms): On selector failure, takes a fresh AX tree snapshot and searches for nodes matching the original selector's role and approximate name. Candidates scored by role match, name edit distance, and tree position. A button renamed from "Submit" to "Save" heals automatically.
AI-assisted heal (on failure only): When fuzzy matching can't resolve the break, builds a heal context including the error, mutation diff, current snapshot, and heal history, then asks an AI model to propose revised steps. The test runs at CDP speed for passing steps. AI latency is only incurred on failures.
Enterprise tools charge thousands per month for comparable self-healing. PiperTest ships it as a core feature because we believe test maintenance shouldn't be a revenue stream.
7 assertion types
Visible, hidden, text content, URL match, element count, attribute value, and console message. All assertions use polling with configurable timeouts and capture an AX snapshot on failure for debugging.
Temporal assertions
Three modes for time-dependent verification: always (condition must hold for a duration), eventually (condition must become true within a deadline), and next (condition must hold on the very next check). These use a residual evaluation model with 100ms polling intervals and a 50-residual cap. We added these because teams kept asking for a way to verify async state without brittle waits.
Background health monitors
Passive console error, JavaScript exception, and HTTP error checking runs after every step. No configuration needed. The HealthMonitorRunner reads from existing CDP buffers with timestamp-based deduplication and a 200-violation cap. If your app throws a console error during a test, you'll know.
Combined coverage reports
Three coverage dimensions merged into a single weighted report: PiperProbe element coverage (60% weight) maps every interactive element on the page and tracks which ones your tests actually touch. CDP JavaScript coverage (30%) and CSS coverage (10%) round out the picture. The coverage bar shows a color-coded breakdown with an expandable list of uncovered elements per page.
Export to Playwright and Cypress
One-click deterministic export. The renderer maps AX selectors to each framework's native format: role:button:Sign In becomes page.getByRole('button', { name: 'Sign In' }) in Playwright and cy.contains('button', 'Sign In') in Cypress. The exported code is clean and idiomatic, ready to paste into your CI pipeline. Temporal assertions emit // TEMPORAL: comments explaining the intent since most frameworks don't have native equivalents.
Execution speed
Each step executes in 10-50ms. PiperTest talks to Chrome via a persistent CDP WebSocket connection. No browser driver binary, no WebDriver protocol translation, no process spawning per action. A 20-step login test completes in under a second. The bottleneck is your application's response time, not the test runner.
Triple-readable format
The same JSON test file is simultaneously: a visual tree in the UI (humans read it), structured MCP tool input (AI agents consume it), and CDP-executable steps (machines run it). No other format achieves all three. Playwright tests are code. Cypress tests are code. Enterprise tests are proprietary. PiperTest is JSON that works everywhere.
Smart fill
The fill action auto-detects input types via CDP's DOM tree walk. <select> elements get programmatic option selection. Date and time inputs use native value setters. Range sliders set values and dispatch events. Color pickers validate hex format. Each input type gets the right interaction strategy without configuration.
Coming soon: test anything
Everything above is browser testing. PiperTest is about to go further.
We're building a unified testing surface that covers native macOS apps, OS-level actions, and web UI in the same format, the same runner, the same MCP tools. A single test session will be able to open a native app, interact with its accessibility tree, trigger system actions, switch to a browser, verify the result, and report pass/fail across all of it.
The foundation is the same: the accessibility tree. macOS exposes a system-wide AX tree for every running application, not just browsers. The same selector strategy that targets role:button:Sign In in Chrome can target role:button:Save in Finder, Xcode, or your own Swift app. And with ActionPiper already shipping 26 domains of system actions (window management, audio control, display settings, keyboard simulation, network toggling, and more), the action layer is already built.
No other testing tool does this. Playwright tests browsers. Appium tests mobile apps. XCUITest tests Apple apps. Each lives in its own silo with its own selector strategy, its own runner, its own language. PiperTest will be one format that crosses all of them, because the accessibility tree is the one abstraction that spans every surface on the Mac.
Web UI, native apps, OS actions. One test. One format. Coming soon.
PiperTest vs everything
We're going to be honest about this comparison. Every tool in this table has strengths we don't match. We also have capabilities none of them offer.