Hand a small local model over 300 tool definitions and it stops working. Not slowly. Not approximately. It breaks. The catalog overflows the context window before you've typed a question, and what was supposed to be a useful AI assistant turns into a context error.

That's the constraint we ran into the day we connected ToolPiper's full MCP catalog to a tight-context local model. Cloud frontier models swallowed it without complaint. Llama 3.2 3B did not. The catalog alone was bigger than half its working memory.

The MCP wire is small. The inference payload isn't.

ToolPiper exposes over 300 MCP tools, and the MCP wire payload — what an external AI client like Claude Code or Cursor receives on tools/list — is only about 24 KB total. Names plus stripped schemas, no descriptions. That fits in any client's context, period.

The hard problem is one layer deeper. When ToolPiper's own chat layer makes a function-calling call to a language model — local or cloud — it has to inject full tool definitions into the prompt: name, full description, parameters with per-property docs and enum values. At that fidelity the 300+ tool catalog runs to roughly 29,000 tokens. A frontier model with a 200K context can absorb that without thinking. A local model with an 8K context cannot, and won't, ever. The catalog by itself is more than three times the model's entire working memory.

So the only models that can be handed every tool every turn are frontier-class ones with a big context window to spare. Everything smaller, which is most of what people actually run on their own hardware, needs the catalog filtered before it reaches the model. There is no clever prompt engineering that fixes this. The arithmetic is what it is.

The cloud platforms hide this by doing tool routing behind their API. When you call OpenAI's function calling with 50 tools, it isn't pasting all 50 into the prompt. There's a retrieval layer that picks a small relevant subset before the model ever sees the catalog. Same with Anthropic. That layer is one of the reasons frontier APIs feel magical and the pricing stays sane.

Local AI on a laptop doesn't get that for free. There is no platform layer between you and the model. Whatever the client sends is what the model sees, pays for, and tries to reason about. So we built our own.

Meet PiperMatch

The piece that does this in ToolPiper is called PiperMatch. It's a small retrieval model we trained ourselves and ship inside the app. When a query arrives, PiperMatch ranks the catalog against what you actually asked for and returns the handful of tools likely to matter for that turn.

It runs on the Apple Neural Engine. The query never leaves your Mac. Selection takes a few hundred milliseconds at most, and the chosen tools are what gets handed to the language model.

This is the same shape of solution the big platforms use behind their APIs. The difference is we got it small enough to ship inside a desktop app and run privately on your hardware.

Why small models really can only handle one tool at a time

The general advice you'll see online says small local models "struggle" with many tools. That undersells it. With a tight context window and a long catalog, the practical ceiling is closer to one tool. Sometimes two. Anything beyond that and the prompt starts crowding out the user's actual message and the model's working memory.

So the goal of routing for small models isn't "pick a great handful." It's "pick the right one, fast." If you nail that, the model is sharp and decisive. If you miss, the model has nothing to recover with, because there's no room left in the prompt for a second attempt.

We treat tight-context routing as a single-shot problem. The retrieval has to be right the first time. Everything else in the system is shaped around making that achievable.

Why not let the model ask for tools?

We considered the obvious alternative. Hand the model a meta-tool that says "ask me for any tool by name and I'll load its schema," then let it pull what it wants on demand. Frontier models can chain that reliably. Small local models often can't. They ignore the meta-tool, invoke it with the wrong names, or call it once and skip the actual work. Asking a model to do a job it isn't capable of just produces a different kind of failure.

So for tight-context regimes we make the routing call upstream, before the model is involved, and ship the right tools directly. The model never has to know there were over 300 options. It sees a small focused set and gets to work.

Two schema sizes for one catalog

Picking fewer tools is half the work. Making each tool cheaper is the other half.

Every function-calling tool definition has a name, a description, and a JSON schema for its parameters. The per-parameter docs and the enum-value commentary are the heaviest pieces. On a verbose tool that's hundreds of tokens. On a verbose catalog of 303, it's tens of thousands.

ToolPiper ships each tool in two inference-side sizes. Compact strips per-parameter descriptions and recursive doc fields but keeps the tool-level description and the type/required/enum surface a model needs to call correctly. Full is the unstripped form a frontier model can use for the subtlest calls. Catalog-wide the compact form runs about 30-40% leaner than full.

Which size goes out is decided per turn, based on how much room the loaded model has. A small local model gets compact schemas for a small set of tools. A long-context cloud model gets full schemas for a wider set. You don't configure any of this. The selection logic reads the model's actual capacity and sizes the catalog to fit.

The MCP wire itself is separate from all of this — it's already stripped to the floor. External clients reading tools/list see names and just enough schema for argument inference, no descriptions on the wire. The compact/full split happens later, when ToolPiper actually invokes the model.

The combined effect is real money and real responsiveness. The same catalog that breaks a small model under naive shipping fits comfortably under PiperMatch plus schema sizing.

You decide what's in the schema

Retrieval handles the per-turn decision. The other half of catalog management is per-tool and per-category control, owned by the user, applied once and respected everywhere.

ToolPiper's Tool Permissions panel lets you mark any individual tool — or an entire category — as Allow, Ask, or Deny. Denied tools never reach PiperMatch and never enter the inference schema. The model literally cannot see them, so it cannot try them. This is not a runtime rejection; it is upstream of the catalog. A denied tool costs zero tokens.

The same is true at the category level. Don't want the model touching the macOS system actions? Deny system and the 162 tools in that category drop out of every payload. Don't want it touching Git? Deny the filesystem subset you care about, or the whole category. The Schema Budget card on the same panel shows your current allow/ask/deny token totals as you toggle, so you can see the catalog shrink in real time.

For external clients, per-app gating lives one panel over, in Connected Apps. Claude Code can have access to one slice of the catalog while Cursor has access to another, with both routed through the same running ToolPiper.

Why we trained our own retrieval model

There are off-the-shelf embedding models you could use for this kind of retrieval. We tried them. They aren't good enough for tool selection at this granularity. The vocabulary of MCP tools overlaps in ways general-purpose embeddings don't separate. Two tools that do related-but-different things land too close together, and the wrong one wins on common queries.

So we trained our own. Synthetic data only, no user prompts and no real chat history, ever. That's a privacy boundary, not a footnote. The training pipeline is opinionated about tool routing in particular and has nothing to do with general semantic search. The result is a model small enough to load on the Neural Engine in milliseconds and accurate enough to put the right tool in the top five every time on our internal test set.

We won't go deeper than that publicly. The recipe matters and we want to keep iterating on it. The output is what's worth talking about.

What you experience as a user

Most of this is invisible by design. You install ToolPiper, you connect it to your client of choice, and it works. The catalog is there. The right tools show up when you need them. Tokens don't blow up. Small models don't choke.

If you want to see it run, ToolPiper's logs viewer surfaces a recommendation event on every turn. You'll see how many tools shipped, the size they were sent at, the total token cost, and the names that got picked. It's a useful sanity check the first few times you ask the system to do something. After that you stop looking, because the answer is always "a small relevant set, cheaply shipped."

Where this approach holds back

Two cases worth being honest about.

The first is when the model's context is big enough that the whole catalog fits without crowding anything. At that scale ranking adds no value, so we turn retrieval off and ship the catalog directly. A clever piece of machinery that earns its keep on a tight context window is the wrong tool for a generous one.

The second is genuinely novel phrasing. Retrieval models work on the language they've seen during training, and a query expressed in unfamiliar vocabulary can rank a less-relevant tool higher than the right one. When that happens the workaround is usually a small rephrase. We also track these cases offline and tune the catalog text over time, because the catalog vocabulary is where most of the work is.

Why this matters beyond ToolPiper

Local AI is going to spend the next few years catching up to cloud AI on the experiences that depend on tools rather than raw model intelligence. Browser automation, system actions, file operations, search, scraping, all of it routes through tool calls now. A model that can't use tools well is a model with one hand tied.

The way you make small models good at tools is to do the work the cloud does, on-device. That work is mostly retrieval and schema engineering. PiperMatch is our take on the retrieval half. The schema sizing is the engineering half. Together they're the reason a 3B local model on a MacBook can go toe-to-toe with a frontier model on tool-driven tasks where intelligence isn't the bottleneck.

It's also the part of ModelPiper we're most proud of. The user-facing surface is small. The work behind it is not.

Try it

ToolPiper is a free download from modelpiper.com/download. The retrieval layer is on by default. Nothing to configure, nothing to enable. Connect a model with a tight context, connect a model with a generous one, and watch the same catalog adapt to both.