Why does nobody have enough test coverage?
Every team knows they need more tests. Nobody has time to write them. The codebase has 400 components, 30 user flows, and a test suite that covers maybe 5 of them well. The rest get manual smoke tests before release, if that. A 2025 PractiTest survey found that 45% of QA teams report frequent test breakages, and most teams spend their limited testing budget maintaining existing tests rather than writing new ones.
The math is brutal. A single E2E test for a login flow takes 15-30 minutes to write, including selector work, assertion logic, and debugging timing issues. A checkout flow with error handling takes an hour or more. Multiply that by every flow in your application and you're looking at weeks of dedicated testing work that competes with feature development for sprint capacity.
So teams ship with gaps. They know the settings page isn't tested. They know the edge case where a user cancels mid-checkout isn't covered. They know, and they ship anyway, because the alternative is spending two sprints writing tests instead of building the feature the PM is asking for.
AI can break this tradeoff. Not by replacing human judgment about what matters, but by removing the mechanical work of translating "test the login flow" into executable test steps. The bottleneck was never knowing what to test. It was the tedium of writing it down in a format machines can execute.
How does AI generate tests from the AX tree?
PiperTest's approach to AI test generation is deliberately simple. There's no special SDK. No proprietary format the AI needs to learn. No framework-specific plugin. The entire mechanism works because the accessibility tree is already plain text that any language model can read and reason about.
Here's the actual flow:
Step 1: Capture an AX snapshot. PiperTest calls Chrome's Accessibility.getFullAXTree via CDP and formats the result as structured plain text. A page that looks complex in the DOM becomes a readable tree of semantic elements:
WebArea "Acme App - Dashboard"
navigation "Main"
link "Home"
link "Settings"
link "Billing"
main
heading "Welcome back, Jane" (level 1)
region "Quick Actions"
button "New Project"
button "Import Data"
button "Invite Team"
table "Recent Projects"
row "Project Alpha" | "Active" | "3 members"
row "Project Beta" | "Draft" | "1 member"
link "View all projects"That's it. No framework-specific metadata. No CSS classes. No DOM nesting depth. Just what users see and can interact with. Every language model on earth can read this.
Step 2: Feed the snapshot to any AI model. The snapshot goes into the prompt as context. The system prompt tells the model what PiperTest step format looks like (navigate, click, fill, assert) and asks it to generate test steps for a given flow. The model doesn't need access to your source code, your component tree, or your build system. It reads the AX tree the same way a screen reader does and reasons about what actions a user would take.
Step 3: AI generates PiperTest steps. The model outputs structured steps using AX selectors:
navigate https://app.example.com/dashboard
click role:button:New Project
fill label:Project Name My Test Project
fill label:Description A project for testing
click role:button:Create
assert text role:heading = "My Test Project"Step 4: Steps save, run, and report. The generated steps are saved as a PiperTest session via test_save, executed via test_run, and results come back with pass/fail per step, heal logs, health monitor findings, and optional coverage reports. The entire loop - from AX snapshot to running test - can happen in a single conversation with an AI assistant.
Why is the AX tree the right input for AI?
You could feed a DOM dump to an AI and ask it to generate tests. Some tools do this. The results are worse, and the reason is information density.
A typical React application renders 1,500-3,000 DOM nodes per page. Most are wrapper divs, CSS-in-JS containers, framework internals, and presentational elements with no user-facing meaning. The AI has to sift through all of that noise to find the 100-200 elements users actually interact with. It burns tokens on <div class="css-1dbjc4n r-1awozwy"> that tell it nothing about what the page does.
The AX tree strips all of that away. 200 semantic nodes instead of 2,000 DOM nodes. Every node has a role (button, link, textbox, heading) and a name (the accessible label). The AI gets a clean, semantic representation of the page that matches how users think about it. "There's a button called New Project" rather than "there's a div with class btn-primary-lg containing a span with text New Project inside a div with role=none."
This matters for token efficiency too. A full DOM dump for a moderately complex page might be 50-100KB of text. The AX tree for the same page is typically 3-8KB. You can fit the entire page context in a single prompt without hitting context limits, even with smaller local models. A 3B parameter model running on your Mac can read the full AX tree of most pages. It can't read a 100KB DOM dump without context window overflow.
Does this work with local models?
Yes. This is one of the key design choices. PiperTest doesn't require any specific AI provider. The AX tree is plain text. The step format is plain text. Any model that can read text and output structured text can generate tests.
Local llama.cpp models running on Apple Silicon through ToolPiper work well for straightforward flows. A 7B-8B model like Llama 3.1 or Qwen 2.5 can read an AX snapshot and produce valid PiperTest steps for login flows, navigation tests, form submissions, and CRUD operations. The test quality is good for happy paths and basic error cases. Complex multi-step flows with conditional logic benefit from larger models.
Cloud models like Claude, GPT-4, or Gemini produce higher-quality tests with better edge case coverage, more thoughtful assertions, and better handling of complex workflows. They also cost money per generation and require sending your page structure to a third party.
The choice is yours. For an internal admin tool where the page structure isn't sensitive, a local 8B model generates perfectly good tests for free. For a customer-facing application where you want thorough edge case coverage, Claude or GPT-4 produces more comprehensive test suites. PiperTest doesn't care which model generates the steps. It cares whether the steps are valid and the assertions pass.
What about MCP tools?
For AI clients that support the Model Context Protocol - Claude Code, Cursor, Windsurf, and others - PiperTest exposes six MCP tools that handle the full test lifecycle:
test_save- Create or update a test session with stepstest_list- Discover existing teststest_get- Read a specific test's steps and resultstest_run- Execute a test with self-healing and health monitorstest_delete- Remove a testtest_export- Generate Playwright or Cypress code from test steps
Plus browser_snapshot to capture the AX tree on demand. The AI assistant connects to ToolPiper's MCP server, calls browser_snapshot to see the current page, reasons about what to test, calls test_save to create the test, and test_run to execute it. The entire workflow happens in a conversation.
But MCP isn't required. The same workflow works with any AI through copy-paste. Grab the AX snapshot from PiperTest's UI, paste it into ChatGPT or your local model's chat interface, ask it to generate PiperTest steps, and paste the result back. MCP makes it seamless. Without MCP, it's still a 60-second workflow instead of a 30-minute manual test authoring session.
What is AI good at generating?
After extensive use, clear patterns emerge about where AI excels and where it struggles with test generation.
AI is excellent at happy paths. "Test the login flow." "Test creating a new project." "Test the checkout process." The AX tree shows exactly what elements exist. The model generates a linear sequence of actions that exercises the primary flow. These tests are straightforward, correct, and exactly the tests that are most tedious to write by hand.
AI is good at coverage expansion. Give it an AX snapshot and say "generate tests for every interactive element on this page." It will produce a test for each button, each link, each form field. The tests might be shallow - click and verify something changed - but they establish a baseline of coverage that catches regressions. PiperProbe's coverage report shows which elements are untested, and AI can systematically close those gaps.
AI is decent at error cases. "Test what happens when the email field is empty." "Test submitting with an invalid credit card." The model understands form validation patterns and generates reasonable negative tests. It won't catch every edge case your specific application handles, but it covers the common patterns that apply to most web forms.
AI struggles with business logic assertions. It can click the "Apply Discount" button, but it doesn't know the discount should be 15%. It can navigate to the billing page, but it doesn't know the invoice total should match the sum of line items. These assertions require domain knowledge that the AX tree doesn't contain. A human needs to specify what the expected values are.
AI struggles with complex state dependencies. "Test that a user who was invited but hasn't accepted the invite sees a different dashboard than an active user." The model would need to understand the application's state machine, which isn't visible in a single AX snapshot. Multi-state tests still need human design, even if AI writes the mechanical steps.
How does this compare to other AI test generation tools?
The AI testing landscape is moving fast. Here's where things stand in early 2026.
Playwright Test Agents (v1.56+) ship three built-in agents: Planner (explores the app and writes a markdown test plan), Generator (converts the plan into .spec.ts files), and Healer (runs tests and patches failures). This is sophisticated infrastructure that uses MCP under the hood to control the browser. The output is Playwright-native TypeScript code. The tradeoff: it's tightly coupled to Playwright's ecosystem, requires VS Code v1.105+, and the agents use cloud AI models for reasoning.
Cypress cy.prompt() converts natural language steps into executable Cypress commands at runtime. You write cy.prompt(['Click the login button', 'Fill in the email']) and AI generates the Cypress code. It includes self-healing: when cached selectors fail, AI regenerates them. The tradeoff: requires Cypress Cloud authentication, each heal is a network round-trip, and the approach is tightly coupled to Cypress's runtime.
Testim (Tricentis) uses ML-based smart locators combined with visual AI to generate and maintain tests. The recording-based approach captures user interactions and builds test steps with built-in AI resilience. Heavy cloud infrastructure. Enterprise pricing.
mabl Agentic AI offers a Test Creation Agent that autonomously builds end-to-end tests from natural language descriptions. Claims 2x faster test generation with conversational planning. Uses AI Vectorization for semantic embeddings across all test assets. Fully cloud-dependent with enterprise pricing.
Katalon StudioAssist uses GPT to generate test scripts from comments and natural language descriptions. The 2026 update adds a web recording agent that translates natural language scenarios into test steps. Integrated into Katalon's IDE. Claims to reduce a 30-40 minute task to 5 minutes.
What makes PiperTest's approach different?
Three things set PiperTest apart from the tools above.
Model independence. PiperTest doesn't embed a specific AI model or require a specific cloud service. The AX tree is text. The step format is text. Use whatever model you want. Run a 3B model on your laptop for quick drafts. Use Claude for comprehensive test suites. Switch between them without changing anything about your tests. No vendor lock-in at the generation layer.
AX-native output. The generated tests use accessibility tree selectors (role:button:Submit, label:Email) rather than CSS selectors or XPath. This means the generated tests inherit PiperTest's self-healing: if a label changes slightly, the fuzzy AX matcher heals it in 5-15ms without any AI call. Tests generated by AI are just as resilient as tests written by hand, because they target the same semantic layer.
Privacy by default. With a local model, nothing leaves your machine. The AX snapshot is processed locally. The test steps are generated locally. The tests run locally. No page structure, no element names, no application URLs touch any external service. With cloud models, only the AX tree text (3-8KB of semantic labels) goes to the provider - not screenshots, not DOM dumps, not source code.
What's the practical workflow?
Here's the realistic daily workflow for a developer using AI test generation with PiperTest.
Morning: Coverage check. Open PiperTest's coverage bar. PiperProbe shows which interactive elements on each page have test coverage and which don't. The settings page has 12% coverage. The dashboard has 45%. The checkout flow has 80%.
Focus: AI fills the gaps. Navigate to the settings page in Chrome. Call browser_snapshot (or copy the AX tree from PiperTest's UI). Feed it to your AI model with "generate tests for every interactive element on this page." The AI produces 8-12 test steps covering each form field, each toggle, each save button. Save and run them. Most pass on the first try. Fix the ones that don't - usually an assertion value the AI guessed wrong.
Review: Human judgment. Read through the generated tests. The AI tested that clicking "Delete Account" shows a confirmation dialog. Good. But it asserted that the dialog text contains "Are you sure?" when your app actually says "This action is permanent." Fix the assertion. The AI tested toggling the notification preference. Good. But it didn't test that the change persists after page reload. Add that step manually.
Result: 45 minutes to go from 12% to 70% coverage on the settings page. Writing those same tests by hand would have taken a full afternoon. The AI handled the mechanical work. You handled the judgment calls.
What won't AI test generation replace?
AI generates tests. Humans decide what matters.
AI doesn't know that the checkout flow is more important than the about page. It doesn't know that the discount calculation is the feature that caused a production incident last month. It doesn't know that the accessibility of the main navigation is a legal requirement for your government clients.
Prioritization, risk assessment, and business logic validation remain human responsibilities. AI is a force multiplier for the mechanical work, not a replacement for the strategic work. The teams that get the most value from AI test generation are the ones that treat it as a tool for coverage expansion guided by human priorities, not as an autonomous QA replacement.
Try it
Download ToolPiper from the Mac App Store. Open any web app in Chrome. Capture an AX snapshot. Feed it to any AI model - local or cloud - and ask it to generate PiperTest steps. Save and run them. The entire loop takes under two minutes.
For MCP-equipped AI clients, add ToolPiper as an MCP server (claude mcp add toolpiper -- ~/.toolpiper/mcp) and the AI can snapshot, generate, save, and run tests in a single conversation. Six MCP tools cover the full lifecycle.
This is part of the AI-powered testing series. Next: Reduce Test Maintenance Cost - the business case for AX-native testing and self-healing. For the self-healing mechanism, see Self-Healing Test Selectors. For the CDP engine underneath, see AX-Native Browser Automation.