Hand a small local model 147 tool definitions and it stops working. Not slowly. Not approximately. It breaks. The catalog overflows the context window before you've typed a question, and what was supposed to be a useful AI assistant turns into a context error.
That's the constraint we ran into the day we connected ToolPiper's full MCP catalog to a tight-context local model. Cloud frontier models swallowed it without complaint. Llama 3.2 3B did not. The catalog alone was bigger than half its working memory.
Why you can't just send all 147 tools
ToolPiper exposes 147 MCP tools. At full schema fidelity that catalog runs to roughly 27,000 tokens. A frontier model with a 200K context can absorb that without thinking. A local model with an 8K context cannot, and won't, ever. The catalog by itself is more than three times the model's entire working memory.
So the only models that can be handed every tool every turn are frontier-class ones with a big context window to spare. Everything smaller, which is most of what people actually run on their own hardware, needs the catalog filtered before it reaches the model. There is no clever prompt engineering that fixes this. The arithmetic is what it is.
The cloud platforms hide this by doing tool routing behind their API. When you call OpenAI's function calling with 50 tools, it isn't pasting all 50 into the prompt. There's a retrieval layer that picks a small relevant subset before the model ever sees the catalog. Same with Anthropic. That layer is one of the reasons frontier APIs feel magical and the pricing stays sane.
Local AI on a laptop doesn't get that for free. There is no platform layer between you and the model. Whatever the client sends is what the model sees, pays for, and tries to reason about. So we built our own.
Meet PiperMatch
The piece that does this in ToolPiper is called PiperMatch. It's a small retrieval model we trained ourselves and ship inside the app. When a query arrives, PiperMatch ranks the catalog against what you actually asked for and returns the handful of tools likely to matter for that turn.
It runs on the Apple Neural Engine. The query never leaves your Mac. Selection takes a few hundred milliseconds at most, and the chosen tools are what gets handed to the language model.
This is the same shape of solution the big platforms use behind their APIs. The difference is we got it small enough to ship inside a desktop app and run privately on your hardware.
Why small models really can only handle one tool at a time
The general advice you'll see online says small local models "struggle" with many tools. That undersells it. With a tight context window and a long catalog, the practical ceiling is closer to one tool. Sometimes two. Anything beyond that and the prompt starts crowding out the user's actual message and the model's working memory.
So the goal of routing for small models isn't "pick a great handful." It's "pick the right one, fast." If you nail that, the model is sharp and decisive. If you miss, the model has nothing to recover with, because there's no room left in the prompt for a second attempt.
We treat tight-context routing as a single-shot problem. The retrieval has to be right the first time. Everything else in the system is shaped around making that achievable.
Why not let the model ask for tools?
We considered the obvious alternative. Hand the model a meta-tool that says "ask me for any tool by name and I'll load its schema," then let it pull what it wants on demand. Frontier models can chain that reliably. Small local models often can't. They ignore the meta-tool, invoke it with the wrong names, or call it once and skip the actual work. Asking a model to do a job it isn't capable of just produces a different kind of failure.
So for tight-context regimes we make the routing call upstream, before the model is involved, and ship the right tools directly. The model never has to know there were 147 options. It sees a small focused set and gets to work.
Three schema sizes for one catalog
Picking fewer tools is half the work. Making each tool cheaper is the other half.
Every MCP tool definition has a name, a description, and a JSON schema for its parameters. The schema is the heaviest piece. It carries types, enum lists, nested objects, and the documentation the model uses to call the tool correctly. On a verbose tool that's hundreds of tokens. On a verbose catalog of 147, it's tens of thousands.
ToolPiper ships every tool in three sizes. The smallest is a stripped form that keeps only what the model strictly needs to invoke the tool, the middle size carries enough detail for an average model to make good arg choices, and the full size includes the documentation and edge-case fields that frontier models can use to handle subtle calls.
Which size goes out is decided per turn, based on how much room the loaded model has. A small local model gets the leanest schema for a small set of tools. A long-context cloud model gets richer schemas for a wider set. You don't configure any of this. The selection logic reads the model's actual capacity and sizes the catalog to fit.
The combined effect is real money and real responsiveness. The same catalog that breaks a small model under naive shipping fits comfortably under PiperMatch plus schema sizing.
Why we trained our own retrieval model
There are off-the-shelf embedding models you could use for this kind of retrieval. We tried them. They aren't good enough for tool selection at this granularity. The vocabulary of MCP tools overlaps in ways general-purpose embeddings don't separate. Two tools that do related-but-different things land too close together, and the wrong one wins on common queries.
So we trained our own. Synthetic data only, no user prompts and no real chat history, ever. That's a privacy boundary, not a footnote. The training pipeline is opinionated about tool routing in particular and has nothing to do with general semantic search. The result is a model small enough to load on the Neural Engine in milliseconds and accurate enough to put the right tool in the top five every time on our internal test set.
We won't go deeper than that publicly. The recipe matters and we want to keep iterating on it. The output is what's worth talking about.
What you experience as a user
Most of this is invisible by design. You install ToolPiper, you connect it to your client of choice, and it works. The catalog is there. The right tools show up when you need them. Tokens don't blow up. Small models don't choke.
If you want to see it run, ToolPiper's logs viewer surfaces a recommendation event on every turn. You'll see how many tools shipped, the size they were sent at, the total token cost, and the names that got picked. It's a useful sanity check the first few times you ask the system to do something. After that you stop looking, because the answer is always "a small relevant set, cheaply shipped."
Where this approach holds back
Two cases worth being honest about.
The first is when the model's context is big enough that the whole catalog fits without crowding anything. At that scale ranking adds no value, so we turn retrieval off and ship the catalog directly. A clever piece of machinery that earns its keep on a tight context window is the wrong tool for a generous one.
The second is genuinely novel phrasing. Retrieval models work on the language they've seen during training, and a query expressed in unfamiliar vocabulary can rank a less-relevant tool higher than the right one. When that happens the workaround is usually a small rephrase. We also track these cases offline and tune the catalog text over time, because the catalog vocabulary is where most of the work is.
Why this matters beyond ToolPiper
Local AI is going to spend the next few years catching up to cloud AI on the experiences that depend on tools rather than raw model intelligence. Browser automation, system actions, file operations, search, scraping, all of it routes through tool calls now. A model that can't use tools well is a model with one hand tied.
The way you make small models good at tools is to do the work the cloud does, on-device. That work is mostly retrieval and schema engineering. PiperMatch is our take on the retrieval half. The schema sizing is the engineering half. Together they're the reason a 3B local model on a MacBook can go toe-to-toe with a frontier model on tool-driven tasks where intelligence isn't the bottleneck.
It's also the part of ModelPiper we're most proud of. The user-facing surface is small. The work behind it is not.
Try it
ToolPiper is a free download from modelpiper.com/download. The retrieval layer is on by default. Nothing to configure, nothing to enable. Connect a model with a tight context, connect a model with a generous one, and watch the same catalog adapt to both.
