Was PiperMatch trained on user data?

No. We use synthetic training data exclusively. No user prompts, no chat history, no scraped third-party datasets. Privacy is a training boundary for us, not a runtime add-on.

Can a small local model really use 300+ tools?

Not all at once, and that's the point. PiperMatch picks a relevant subset per turn, sized to what the loaded model can actually handle. The catalog stays large, the per-turn footprint stays small.

Does this only help small local models?

No. Cloud frontier models benefit too, because the catalog you don't ship is the catalog you don't pay for. The savings just look bigger on small local models because the constraint is harder there.

Why not use an off-the-shelf embedding model?

We tried. General-purpose embeddings don't separate semantically-similar tools well enough for routing decisions. The wrong tool wins too often on common queries. Training our own moved the right tool into the top five on every case in our internal test set.

Doesn't running retrieval before the model add latency?

A few hundred milliseconds at most, on the Apple Neural Engine. That's well under the time the language model itself takes to start producing tokens, and it pays for itself by avoiding context overflows and bad tool calls.

Tool Routing for Local AI on Mac: How Small Models Use 300+ Tools

Hand a small local model over 300 tool definitions and it stops working. Not slowly. Not approximately. It breaks. The catalog overflows the context window before you've typed a question, and what was supposed to be a useful AI assistant turns into a context error.

That's the constraint we ran into the day we connected ToolPiper's full MCP catalog to a tight-context local model. Cloud frontier models swallowed it without complaint. Llama 3.2 3B did not. The catalog alone was bigger than half its working memory.

The MCP wire is small. The inference payload isn't.

ToolPiper exposes over 300 MCP tools, and the MCP wire payload — what an external AI client like Claude Code or Cursor receives on tools/list — is only about 24 KB total. Names plus stripped schemas, no descriptions. That fits in any client's context, period.

The hard problem is one layer deeper. When ToolPiper's own chat layer makes a function-calling call to a language model — local or cloud — it has to inject full tool definitions into the prompt: name, full description, parameters with per-property docs and enum values. At that fidelity the 300+ tool catalog runs to roughly 29,000 tokens. A frontier model with a 200K context can absorb that without thinking. A local model with an 8K context cannot, and won't, ever. The catalog by itself is more than three times the model's entire working memory.

So the only models that can be handed every tool every turn are frontier-class ones with a big context window to spare. Everything smaller, which is most of what people actually run on their own hardware, needs the catalog filtered before it reaches the model. There is no clever prompt engineering that fixes this. The arithmetic is what it is.

The cloud platforms hide this by doing tool routing behind their API. When you call OpenAI's function calling with 50 tools, it isn't pasting all 50 into the prompt. There's a retrieval layer that picks a small relevant subset before the model ever sees the catalog. Same with Anthropic. That layer is one of the reasons frontier APIs feel magical and the pricing stays sane.

Local AI on a laptop doesn't get that for free. There is no platform layer between you and the model. Whatever the client sends is what the model sees, pays for, and tries to reason about. So we built our own.

Meet PiperMatch

The piece that does this in ToolPiper is called PiperMatch. It's a small retrieval model we trained ourselves and ship inside the app. When a query arrives, PiperMatch ranks the catalog against what you actually asked for and returns the handful of tools likely to matter for that turn.

It runs on the Apple Neural Engine. The query never leaves your Mac. Selection takes a few hundred milliseconds at most, and the chosen tools are what gets handed to the language model.

This is the same shape of solution the big platforms use behind their APIs. The difference is we got it small enough to ship inside a desktop app and run privately on your hardware.

Why small models really can only handle one tool at a time

The general advice you'll see online says small local models "struggle" with many tools. That undersells it. With a tight context window and a long catalog, the practical ceiling is closer to one tool. Sometimes two. Anything beyond that and the prompt starts crowding out the user's actual message and the model's working memory.

So the goal of routing for small models isn't "pick a great handful." It's "pick the right one, fast." If you nail that, the model is sharp and decisive. If you miss, the model has nothing to recover with, because there's no room left in the prompt for a second attempt.

We treat tight-context routing as a single-shot problem. The retrieval has to be right the first time. Everything else in the system is shaped around making that achievable.

Why not let the model ask for tools?

We considered the obvious alternative. Hand the model a meta-tool that says "ask me for any tool by name and I'll load its schema," then let it pull what it wants on demand. Frontier models can chain that reliably. Small local models often can't. They ignore the meta-tool, invoke it with the wrong names, or call it once and skip the actual work. Asking a model to do a job it isn't capable of just produces a different kind of failure.

So for tight-context regimes we make the routing call upstream, before the model is involved, and ship the right tools directly. The model never has to know there were over 300 options. It sees a small focused set and gets to work.

Two schema sizes for one catalog

Picking fewer tools is half the work. Making each tool cheaper is the other half.

Every function-calling tool definition has a name, a description, and a JSON schema for its parameters. The per-parameter docs and the enum-value commentary are the heaviest pieces. On a verbose tool that's hundreds of tokens. On a verbose catalog of 300+, it's tens of thousands.

ToolPiper ships each tool in two inference-side sizes. Compact strips per-parameter descriptions and recursive doc fields but keeps the tool-level description and the type/required/enum surface a model needs to call correctly. Full is the unstripped form a frontier model can use for the subtlest calls. Catalog-wide the compact form runs about 30-40% leaner than full.

Which size goes out is decided per turn, based on how much room the loaded model has. A small local model gets compact schemas for a small set of tools. A long-context cloud model gets full schemas for a wider set. You don't configure any of this. The selection logic reads the model's actual capacity and sizes the catalog to fit.

The MCP wire itself is separate from all of this — it's already stripped to the floor. External clients reading tools/list see names and just enough schema for argument inference, no descriptions on the wire. The compact/full split happens later, when ToolPiper actually invokes the model.

The combined effect is real money and real responsiveness. The same catalog that breaks a small model under naive shipping fits comfortably under PiperMatch plus schema sizing.

You decide what's in the schema

Retrieval handles the per-turn decision. The other half of catalog management is per-tool and per-category control, owned by the user, applied once and respected everywhere.

ToolPiper's Tool Permissions panel lets you mark any individual tool — or an entire category — as Allow, Ask, or Deny. Denied tools never reach PiperMatch and never enter the inference schema. The model literally cannot see them, so it cannot try them. This is not a runtime rejection; it is upstream of the catalog. A denied tool costs zero tokens.

The same is true at the category level. Don't want the model touching the macOS system actions? Deny system and the 162 tools in that category drop out of every payload. Don't want it touching Git? Deny the filesystem subset you care about, or the whole category. The Schema Budget card on the same panel shows your current allow/ask/deny token totals as you toggle, so you can see the catalog shrink in real time.

For external clients, per-app gating lives one panel over, in Connected Apps. Claude Code can have access to one slice of the catalog while Cursor has access to another, with both routed through the same running ToolPiper.

How the connection is secured

The whole catalog runs behind a loopback bearer, and on the same machine you never see it. ToolPiper mints a short-lived ambient token at ~/Library/Application Support/ToolPiper/.toolpiper-token (mode 0600, rotated on every launch) and trusts loopback at the socket level, so Claude Code, Cursor, and anything else on your Mac connect with zero config — there's no key to paste into a client file. Off-machine callers — a LAN device or a PiperMesh peer — are the ones that must present a scoped Bearer tp_<64 hex> token with the right audience, 64 hex characters of CSPRNG entropy that never get written into ~/.claude/ or any config you don't control.

Connected Apps is also where every authenticated client shows up as a labelable row: you can see its current MCP-tools access state and revoke it in one click. Revocation kills the bearer immediately, and the row stays visible marked as revoked — state is never hidden. One panel further, the tool governance overlay lets you deny a specific tool (say system_run_command) for every client at once, or flip the built-in tool marketplace from open browse to an explicit allow-list. Governance can only narrow what a tier already permits, never widen it. Today it's a per-device overlay you edit locally; the same policy shape is built so an organization can push it later, at which point the controls lock behind a "Managed by your organization" badge. There's no SSO/SAML admin console yet — that's demand-pulled — but the per-device controls are live now.

Why we trained our own retrieval model

There are off-the-shelf embedding models you could use for this kind of retrieval. We tried them. They aren't good enough for tool selection at this granularity. The vocabulary of MCP tools overlaps in ways general-purpose embeddings don't separate. Two tools that do related-but-different things land too close together, and the wrong one wins on common queries.

So we trained our own. Synthetic data only, no user prompts and no real chat history, ever. That's a privacy boundary, not a footnote. The training pipeline is opinionated about tool routing in particular and has nothing to do with general semantic search. The result is a model small enough to load on the Neural Engine in milliseconds and accurate enough to put the right tool in the top five every time on our internal test set.

We won't go deeper than that publicly. The recipe matters and we want to keep iterating on it. The output is what's worth talking about.

What you experience as a user

Most of this is invisible by design. You install ToolPiper, you connect it to your client of choice, and it works. The catalog is there. The right tools show up when you need them. Tokens don't blow up. Small models don't choke.

If you want to see it run, ToolPiper's logs viewer surfaces a recommendation event on every turn. You'll see how many tools shipped, the size they were sent at, the total token cost, and the names that got picked. It's a useful sanity check the first few times you ask the system to do something. After that you stop looking, because the answer is always "a small relevant set, cheaply shipped."

Where this approach holds back

Two cases worth being honest about.

The first is when the model's context is big enough that the whole catalog fits without crowding anything. At that scale ranking adds no value, so we turn retrieval off and ship the catalog directly. A clever piece of machinery that earns its keep on a tight context window is the wrong tool for a generous one.

The second is genuinely novel phrasing. Retrieval models work on the language they've seen during training, and a query expressed in unfamiliar vocabulary can rank a less-relevant tool higher than the right one. When that happens the workaround is usually a small rephrase. We also track these cases offline and tune the catalog text over time, because the catalog vocabulary is where most of the work is.

Why this matters beyond ToolPiper

Local AI is going to spend the next few years catching up to cloud AI on the experiences that depend on tools rather than raw model intelligence. Browser automation, system actions, file operations, search, scraping, all of it routes through tool calls now. A model that can't use tools well is a model with one hand tied.

The way you make small models good at tools is to do the work the cloud does, on-device. That work is mostly retrieval and schema engineering. PiperMatch is our take on the retrieval half. The schema sizing is the engineering half. Together they're the reason a 3B local model on a MacBook can go toe-to-toe with a frontier model on tool-driven tasks where intelligence isn't the bottleneck.

It's also the part of ModelPiper we're most proud of. The user-facing surface is small. The work behind it is not.

Try it

ToolPiper is a free download from modelpiper.com/download. The retrieval layer is on by default. Nothing to configure, nothing to enable. Connect a model with a tight context, connect a model with a generous one, and watch the same catalog adapt to both.

	Naive (ship everything)	PiperMatch + schema sizing
MCP wire payload	All 300+ tools names + stripped schemas (~24 KB / ~6K tokens)	Same — wire is already stripped by default
Inference-side payload per turn	All 300+ tools at full schema (~29K tokens)	Relevant subset, sized to the loaded model
Schema fidelity	Always maximum at inference time	Compact or full, picked per turn from the model's context
Small-model behavior	Catalog overflows context, model fails	Catalog fits with room for the conversation
Where ranking happens	Implicitly, inside the prompt at inference time	Explicitly, on-device before the model sees the catalog
Privacy	Catalog and query travel together to the model	Selection runs locally on the Apple Neural Engine
Adapts to model size	No, same payload regardless of context	Yes, sized per loaded model