article2026-04-02by Ben RacicotUpdated 2026-06-03

MCP Testing Tools: How AI Agents Write and Run Browser Tests

TL;DR

ToolPiper exposes 20 MCP tools for browser testing: 14 browser tools (snapshot, action, assert, record, console, network, performance, coverage, eval, manage, storage, intercept, webauthn, autofill) and 6 test lifecycle tools (list, get, save, delete, run, export). An AI agent can take a page snapshot, generate structured test steps, save them, run them with self-healing, and export to Playwright or Cypress. All output is semantic plain text, not raw JSON. Works with Claude Code, Cursor, Windsurf, or any MCP client. Also works without MCP by injecting the AX tree as conversation context for any AI provider.

Diagram showing an AI agent using MCP tools to take a browser snapshot, generate test steps, and run them with pass/fail results

AI agents write and run browser tests through 20 MCP tools - no code, no framework, no cloud

The MCP testing landscape in 2026

MCP (Model Context Protocol) turned AI coding assistants into tool-using agents. Instead of just generating code, they can now call functions, read data, and drive external systems. Browser testing is one of the most natural fits for this pattern. An AI agent can look at a page, decide what to test, create the test, and run it, all through structured tool calls.

The ecosystem has responded. Four MCP servers now claim testing capabilities, but they approach the problem from very different angles.

Playwright MCP is the official Microsoft server for Playwright. It exposes around 25 tools for navigation and interaction. The AI agent can open pages, click elements, fill forms, and take screenshots. It uses Playwright's snapshot mode by default, which reads a structured accessibility tree representation. But there are gaps: no built-in assertion tools, no recording, no self-healing, and no test persistence. The agent can drive a browser but can't save or replay what it did. Microsoft now recommends Playwright CLI over MCP for coding agents because a typical MCP session consumes around 114,000 tokens, while the CLI approach uses about 27,000 for the same task.

Chrome DevTools MCP is Google's official server. It exposes roughly 29 tools organized around debugging: console messages, network requests, performance traces, JavaScript evaluation, and basic browser automation (navigate, click, fill, screenshot). It's built for debugging, not testing. There are no assertion primitives, no test format, no recording, and no export. You can reproduce a bug through it, but you can't create a reusable test.

Cypress Cloud MCP launched in beta on March 17, 2026. It's a remote MCP server that connects AI assistants to Cypress Cloud for read-only access to test run data: statuses, failure details, error messages, stack traces, and flaky test reports. It requires a personal access token and Cypress Cloud account. The server is scoped to data retrieval and debugging. It cannot create tests, run tests, or modify test code. It answers "what happened in my last CI run?" but not "write me a test for this page."

None of these servers cover the full testing lifecycle through MCP. They each handle a piece: Playwright MCP drives browsers, Chrome DevTools MCP debugs them, Cypress Cloud MCP reads test results. Creating, persisting, running, healing, and exporting tests through MCP tools requires something else.

MCP testing tools: the comparison table

This table compares what each MCP server actually exposes for testing workflows. Not marketing features. Actual tools an AI agent can call.

The differences are structural, not incremental. Playwright MCP and Chrome DevTools MCP give AI agents browser control. Cypress Cloud MCP gives AI agents test result visibility. ToolPiper gives AI agents the complete testing lifecycle: see the page, create the test, run it, heal it, and export it. These are different problems with different tool requirements.

The complete workflow: AI writes a test from scratch

Here's how an AI agent actually creates and runs a browser test through MCP, step by step. This works with Claude Code, Cursor, Windsurf, or any MCP-compatible client connected to ToolPiper.

Step 1: Take a snapshot. The agent calls browser_snapshot. This returns the current page's accessibility tree as structured plain text. Not a screenshot. Not raw HTML. A semantic representation of every interactive element: buttons, links, inputs, headings, landmarks. A page with 2,000 DOM nodes might produce 200 AX nodes, and those 200 are the ones that matter for testing.

browser_snapshot

Result:
Page: Login - Example App
URL: https://app.example.com/login
---
navigation "Main"
  link "Home"
  link "Pricing"
  link "Docs"
heading "Sign in to your account" [level=1]
form "Login"
  textbox "Email address" [required]
  textbox "Password" [required]
  checkbox "Remember me"
  button "Sign In"
  link "Forgot password?"

The output is plain text, not JSON. It's designed for LLM consumption: compact, structured, and immediately readable. The agent now understands every interactive element on the page, their roles, their labels, and their hierarchy.

Step 2: Reason about what to test. The agent reads the AX snapshot and decides what to test. For a login page, the obvious tests are: successful login, empty field validation, wrong password handling, and the forgot password link. The agent doesn't need special browser automation capabilities for this step. It reads text and reasons about it. Any model can do this.

Step 3: Generate test steps. The agent constructs a test as a sequence of typed steps. Each step has an action (navigate, click, fill, assert), a selector in AX format, and a value where applicable.

Steps:
1. navigate  https://app.example.com/login
2. fill      label:Email address       user@test.com
3. fill      label:Password            secret123
4. click     role:button:Sign In
5. assert    url                       contains "dashboard"
6. assert    visible                   role:heading:Dashboard

Step 4: Save the test. The agent calls test_save with the steps, a name, and a description. The test is persisted as a session on disk. It's now a reusable artifact that can be run, edited, exported, or shared.

Step 5: Run the test. The agent calls test_run with the session ID. The runner walks through each step, executes the action via CDP, and evaluates the assertions. If a selector doesn't match, the self-healing engine activates: fuzzy AX matching in 5-15ms, with optional AI-assisted healing as a fallback. The run returns a structured result:

test_run session_id="ts-abc123"

Result:
Test: Login Happy Path
Status: PASSED
Steps: 6/6 passed
Duration: 2.3s
---
1. navigate  https://app.example.com/login     PASSED  120ms
2. fill      label:Email address                PASSED   18ms
3. fill      label:Password                     PASSED   15ms
4. click     role:button:Sign In                PASSED   22ms
5. assert    url contains "dashboard"           PASSED  890ms
6. assert    visible role:heading:Dashboard      PASSED   12ms
---
Health: 0 console errors, 0 network failures
Heals: 0

Step 6: Export to Playwright. The agent calls test_export with the session ID and target framework. The export renderer maps AX selectors to Playwright's native format:

test_export session_id="ts-abc123" framework="playwright"

Result:
// login-happy-path.spec.ts
import { test, expect } from '@playwright/test';

test('Login Happy Path', async ({ page }) => {
  await page.goto('https://app.example.com/login');
  await page.getByLabel('Email address').fill('user@test.com');
  await page.getByLabel('Password').fill('secret123');
  await page.getByRole('button', { name: 'Sign In' }).click();
  await expect(page).toHaveURL(/dashboard/);
  await expect(
    page.getByRole('heading', { name: 'Dashboard' })
  ).toBeVisible();
});

Six tool calls. A complete test from nothing to exportable Playwright code. The agent handled the entire lifecycle: observe the page, decide what to test, create the test, run it, verify it passes, and generate CI-ready code.

What makes ToolPiper's MCP tools different

Tool count isn't the differentiator. It's what the tools return and how they compose. Four design decisions set ToolPiper's MCP testing tools apart from the alternatives.

Semantic plain text output

Every tool returns structured plain text, not raw JSON. When browser_action clicks a button, the response includes a semantic AX diff showing what changed on the page:

browser_action click role:button:Sign In

Result:
Clicked: button "Sign In"
---
AX Diff:
- form "Login"
- textbox "Email address"
- textbox "Password"
- button "Sign In"
+ heading "Welcome back" [level=1]
+ paragraph "Redirecting to dashboard..."
+ progressbar "Loading"

Added nodes are marked with +, removed with -, modified with ~. The AI agent sees exactly what the action did to the page without taking another full snapshot. This keeps token usage low and gives the model precise information about state transitions.

Playwright MCP returns the full AX tree on every interaction. After several tool calls, the context window fills with stale page states. ToolPiper's diff-based approach means the agent always has fresh, compact information about what just changed.

Self-healing actions

When browser_action targets a selector that doesn't match, it doesn't fail immediately. The action enters a healing loop: query the AX tree for same-role candidates, score them by name similarity using Levenshtein distance, and execute on the best match if confidence is high. A button renamed from "Submit" to "Save" heals in 5-15ms with zero network calls. The response includes what was healed:

Clicked: button "Save" (healed from "Submit", confidence: high)

No other MCP server has built-in self-healing. Playwright MCP fails the action. Chrome DevTools MCP fails the action. The AI agent would need to catch the error, take a new snapshot, reason about what changed, and retry. That costs tokens, time, and context window space. With ToolPiper, the tool handles it automatically.

Assertions with polling

browser_assert provides seven assertion types (visible, hidden, text, url, count, attribute, console) with built-in polling and configurable timeouts. The assertion retries until the condition is met or the timeout expires, then captures an AX snapshot on failure.

Playwright MCP has no assertion tools. An AI agent using Playwright MCP would need to take a snapshot, parse the result, and evaluate conditions in its own reasoning. That's brittle: the agent might check too early, miss a loading state, or misinterpret the snapshot. ToolPiper's assertions handle timing internally.

Full test lifecycle

The six test tools (test_list, test_get, test_save, test_delete, test_run, test_export) give the AI agent persistent test management. Tests are saved to disk, rerunnable, editable, and exportable. An agent can build a test suite over multiple sessions, come back tomorrow, list existing tests, modify one, run the full suite, and export the results.

With Playwright MCP or Chrome DevTools MCP, everything is ephemeral. The agent drives a browser session and the interactions evaporate when the session ends. There's no "save this sequence of actions as a test" tool. There's no "run that test I made yesterday" tool. The AI has to regenerate everything from scratch each time.

Without MCP: the provider-agnostic approach

MCP is powerful, but it's not the only path. ToolPiper's browser tools work just as well without MCP by injecting the AX tree directly into conversation context.

The approach is simple. Take a browser snapshot via ToolPiper's REST API (GET /v1/browser/snapshot). The response is the same semantic plain text that the MCP tool returns. Paste it into any AI conversation, with any model, on any provider. Claude, GPT-4, Gemini, a local Llama model, anything that can read text and generate structured output.

The AI reads the AX snapshot, generates PiperTest steps as JSON, and you save them via the REST API (POST /v1/test-sessions). Run them via POST /v1/test-sessions/:id/run. Export via GET /v1/test-sessions/:id/export. The full lifecycle works through HTTP, no MCP required.

This matters for three scenarios:

AI providers without MCP support. Not every model host supports MCP. OpenAI's API, Google's Gemini API, and many open-source model APIs don't have native MCP clients. But they all accept text input and produce text output. Feed them the AX tree, get test steps back.

Custom integrations. If you're building your own testing agent, you might not want MCP's overhead. A simple HTTP client that calls ToolPiper's REST API is lighter weight and gives you full control over the orchestration logic.

Local models. A 7B parameter model running on your Mac through llama.cpp can read an AX snapshot and generate test steps. The snapshot is compact plain text, not a 100KB JSON blob. Small models handle it fine because there's nothing to parse, just structured text to reason about.

The MCP tools are the convenient path. The REST API is the universal path. Both produce the same result: AI-generated tests stored in PiperTest format, runnable with self-healing, exportable to Playwright or Cypress.

Example: Claude Code writing a login test

Here's a real interaction with Claude Code connected to ToolPiper via MCP. The human provides a single instruction. The agent does the rest.

Human: "Write a test for the login page at https://staging.myapp.com/login. Test successful login and invalid password."

What the agent does:

Calls browser_action with navigate to open the URL
Calls browser_snapshot to read the page structure
Reads the AX tree: finds the email field, password field, sign in button, and error message area
Generates two test sessions: "Login - Happy Path" and "Login - Invalid Password"
Calls test_save twice to persist both tests
Calls test_run on the happy path test, reads the result (all steps passed)
Calls test_run on the invalid password test, reads the result (assertion on error message passed)
Calls test_export on both tests with target "playwright"
Reports: "Created 2 tests, both passing. Exported Playwright code for both. The happy path test has 6 steps and the invalid password test has 7 steps including an assertion on the error message text."

Total tool calls: 8. Total time: under 30 seconds. Total code written by the human: zero. The agent handled navigation, page analysis, test design, step generation, execution, verification, and export through MCP tools.

The same workflow works in Cursor, Windsurf, or any MCP-compatible client. The tools are identical. The agent's reasoning varies by model, but the tool interface is the same.

What the AI is good (and bad) at

AI-driven test authoring isn't magic. The agent is good at specific things and genuinely bad at others.

Good at: generating happy path coverage. Given an AX snapshot of a form, the agent can generate fill-and-submit steps, navigate to the result page, and assert on visible outcomes. This is mechanical work that the agent handles consistently.

Good at: expanding a recording. Record a 5-step test manually, then ask the agent to add error handling: "What happens if I submit an empty form?" The agent reads the existing test, takes a snapshot, and adds steps for the empty-field case. This turns a basic recording into a thorough test.

Good at: generating assertions. The agent can look at a page state and suggest what to assert: "The heading should say Dashboard. The user's name should appear in the nav. The URL should contain /dashboard." Humans often forget assertions because they're focused on the interaction flow. The agent fills in the verification gaps.

Bad at: knowing which edge cases matter. The agent doesn't know that your app has a race condition when two users edit the same document simultaneously. It doesn't know that the discount code field breaks when the code contains a special character. These are business-logic edge cases that require domain knowledge.

Bad at: complex auth flows. OAuth redirects, CAPTCHA challenges, MFA tokens. These are inherently dynamic and require either environment setup (test accounts, bypass tokens) or manual intervention. The agent can record the flow once, but it can't replicate a time-based OTP code.

Bad at: replacing test strategy. The agent generates tests for what it can see. It doesn't decide that you need load testing, or that the checkout flow needs to be tested with 15 different payment methods, or that the admin panel needs tests at all. Test strategy is a human decision.

Use the AI for the tedious work: recording flows, generating assertions, expanding test coverage for visible UI. Apply human judgment for the strategy: what to test, which edge cases matter, when a test is actually valuable.

Honest limitations

ToolPiper's MCP testing tools have real constraints that affect who they work for.

Chrome only. All browser tools use Chrome DevTools Protocol. Firefox and Safari are not supported. If your CI matrix requires cross-browser testing, you need to export to Playwright and run the generated code across browsers there. The authoring happens in Chrome. The CI execution can be multi-browser via the export.

macOS only. ToolPiper is a native macOS app that runs on Apple Silicon (M1 or later). There's no Windows or Linux version. The exported Playwright and Cypress code runs on any platform, but the MCP server and test runner require a Mac.

Token efficiency varies by client. Claude Code handles 20 tools well because Anthropic designed MCP with large tool registries in mind. Other clients may struggle with tool selection when presented with ToolPiper's full 300+ tool catalog. If your MCP client gets confused by tool count, you can limit the scope by configuring which tool tiers to expose.

Self-healing has confidence thresholds. Fuzzy AX matching works for label renames and element reordering. It doesn't work for fundamental page restructures where the original element's role and name have both changed completely. In those cases, the test correctly fails and AI-assisted healing (if enabled) attempts a structural repair.

AI-generated tests need review. An AI agent can produce a test that passes today but checks the wrong things. The assertions might be too loose ("page contains some text") or too tight ("heading equals exact string that changes with every deploy"). Human review of AI-generated tests is still essential.

Getting started

Install ToolPiper from modelpiper.com. Connect it to your MCP client with one command:

claude mcp add toolpiper -- ~/.toolpiper/mcp

For Cursor, Windsurf, or other clients that use JSON config, point to the same binary path. For HTTP-based clients, use http://localhost:9998/mcp as the Streamable HTTP endpoint.

How access is secured

You don't paste a key anywhere. Because the testing tools all run on your Mac, local clients connect over loopback with zero config: ToolPiper mints a short-lived ambient bearer token (64 hex chars of CSPRNG entropy) at ~/Library/Application Support/ToolPiper/.toolpiper-token with mode 0600, rotates it on every app launch, and trusts loopback at the socket level. Claude Code, Cursor, and Windsurf just work. Only off-machine callers (a LAN device, or a PiperMesh peer) have to present a scoped Bearer tp_<64 hex> token with the right audience, and that token is never written into ~/.claude/ or any client config you don't own. Inside ToolPiper, the Connected Apps pane lists every authenticated client as a labelled row showing its MCP-tools access state, and a one-click revoke kills the bearer immediately while leaving the row visible as revoked, state is never hidden. If you want to lock testing down further, the tool-governance overlay lets you deny specific tools (say, system_run_command) for every client or flip the built-in marketplace from open browse to an explicit allow-list. Governance can only narrow what a tier already permits, never widen it. Today it's a per-device, locally-edited overlay, the same policy shape an organization will be able to push centrally later (at which point the controls lock behind a "Managed by your organization" badge), so think of it as allow/block tool governance per-device now, org-managed push coming, not an enterprise admin console yet.

Open Chrome, navigate to whatever you want to test, and ask your AI agent: "Take a snapshot of this page and write a test for the login flow." The agent handles the rest.

This is part of a series on AI-powered testing workflows. For the visual testing guide (no AI required), see Visual Testing on Mac. For the self-healing deep-dive, see Self-Healing Test Selectors. For the full MCP server overview (all 300+ tools), see Local MCP Server on Mac.

Six-step workflow diagram: snapshot, reason, generate steps, save test, run test, export to Playwright

The complete AI testing workflow through MCP: observe, generate, save, run, heal, export

MCP Testing Tools: ToolPiper vs Playwright MCP vs Chrome DevTools MCP vs Cypress Cloud MCP

	ToolPiper MCP	Playwright MCP	Chrome DevTools MCP	Cypress Cloud MCP
Testing-relevant tools	20 (14 browser + 6 test)	25 (navigation + interaction)	~29 (debugging + basic automation)	Read-only test result queries
Assertion tools	7 types with polling (visible, hidden, text, url, count, attribute, console)	None	None	None (reads past results)
Self-healing	Built-in (fuzzy AX match 5-15ms + AI-assisted)	None	None	None
Test recording	Yes (browser_record)	None	None	None
Test persistence	Yes (save, list, get, delete)	None (ephemeral sessions)	None (ephemeral sessions)	Read-only (existing CI runs)
Test execution	Yes (test_run with pass/fail + health monitors)	None (browser driving only)	None (browser driving only)	None (reads past runs)
Export to CI frameworks	Yes (Playwright + Cypress code generation)	None	None	None
Network interception	Yes (browser_intercept + browser_network)	None	Yes (list_network_requests)	None
Console access	Yes (browser_console with typed messages)	Console lines in snapshots	Yes (list_console_messages)	None
Performance metrics	Yes (browser_performance - Web Vitals + runtime)	None	Yes (performance traces)	None
Code coverage	Yes (browser_coverage - JS + CSS)	None	None	None
Storage management	Yes (cookies, localStorage, sessionStorage)	None	None	None
WebAuthn / Autofill	Yes (passkey simulation + form autofill)	None	None	None
Output format	Semantic plain text (compact, LLM-optimized)	Full AX snapshot JSON per response	Structured tool responses	JSON test result data
Tokens per session	~15-20K (semantic text + diffs)	~114K (full snapshots accumulate)	Moderate	Low (read-only queries)
Cloud dependency	None (fully local)	None (local server)	None (local server)	Required (Cypress Cloud + auth token)
Privacy	All local, no telemetry	All local	All local	Test data on Cypress servers
Setup	One app install + one CLI command	npm install + config	npm install + config	Cypress Cloud account + personal access token

Frequently Asked Questions

Which MCP clients work with ToolPiper's testing tools?

Any MCP-compatible client works. Claude Code, Cursor, Windsurf, and any custom client that speaks the Model Context Protocol. ToolPiper supports two transports: stdio (used by Claude Code and most CLI tools) and Streamable HTTP at localhost:9998/mcp (used by web-based clients). The tool definitions and behavior are identical across both. Claude Code handles the 20 testing tools particularly well because Anthropic designed MCP with large tool registries in mind.

Can I use these tools without MCP?

Yes. Every MCP tool has an equivalent REST API endpoint on ToolPiper's HTTP server (port 9998). Take a snapshot via GET /v1/browser/snapshot, paste the AX tree into any AI conversation (Claude, GPT-4, Gemini, a local model), get test steps back, and save them via POST /v1/test-sessions. The full test lifecycle works through HTTP without MCP. This is the provider-agnostic approach: any model that can read text and produce structured output can generate PiperTest steps.

How many tokens does a typical testing session use?

A ToolPiper MCP testing session (snapshot, generate steps, save, run, export) typically uses 15,000-20,000 tokens. Playwright MCP averages around 114,000 tokens for equivalent browser automation because it returns the full AX snapshot on every interaction. ToolPiper returns semantic plain text with AX diffs showing only what changed, keeping context compact. For a 10-test authoring session, the difference is roughly 150K tokens (ToolPiper) versus 1.1M tokens (Playwright MCP).

Do AI-generated tests get self-healing too?

Yes. Once a test is saved, it's a PiperTest session like any other. The runner applies the same three self-healing modes regardless of whether the test was recorded manually, written through MCP, or created through the REST API. Fuzzy AX matching catches label renames in 5-15ms. AI-assisted healing handles structural changes. The heal history is persisted with the test and used on subsequent runs.

What if my AI agent generates bad test steps?

Run them. The test runner will report exactly which steps failed and why, including AX snapshots at the point of failure. You can also call test_get to review the steps before running, and test_save to modify individual steps. The AI is good at generating happy path coverage and assertions for visible UI elements. It's less reliable at complex auth flows, business-logic edge cases, and timing-sensitive interactions. Review AI-generated tests the same way you'd review AI-generated code: trust the structure, verify the intent.

TestingMCPBrowser AutomationAIClaude CodePrivacymacOSDeveloper Tools

Visual Testing on Mac: Record, Replay, Export to PlaywrightThe visual testing workflow for creating PiperTest tests without code or AI Self-Healing Test Selectors: How PiperTest Fixes Broken Tests AutomaticallyDeep-dive into the three self-healing modes that keep AI-generated tests running Local MCP Server on Mac: 300+ AI Tools in One InstallThe full ToolPiper MCP server overview covering all 300+ tools across thirteen categories Playwright vs Cypress vs PiperTest in 2026The three-way framework comparison with MCP integration analysis