Claude Code runs against any HTTP endpoint that speaks the Anthropic Messages API. Four serious options: ToolPiper, Ollama, vLLM, LM Studio. They are not equivalent. Here is what is actually different.
| Feature | ToolPiper | Ollama | vLLM | LM Studio |
|---|---|---|---|---|
| Anthropic Messages API (`POST /v1/messages`) | Native, with provider switching | No | Yes (since v0.7) | Yes (since 0.3.x) |
| OpenAI Chat Completions (`/v1/chat/completions`) | Yes | Yes | Yes | Yes |
| Streaming with cancellation | <200 ms upstream abort on Esc | Yes | Yes | Yes |
| Tool use round-trip | Anthropic + OpenAI shapes | OpenAI only | OpenAI only | OpenAI only |
| Inference backends | 9 (llama.cpp, Apple Intelligence, FluidAudio, MLX, Apple Vision, …) | 1 (llama.cpp) | 1 (vLLM engine) | 1 (llama.cpp) |
| Apple Intelligence (Neural Engine) backend | Yes — first to ship it | No | No | No |
| BYOK cloud proxy (OpenAI / Anthropic / Gemini / etc.) | Yes — Keychain-locked, server-side injection | No | No | No |
| Cross-provider routing in one base URL | Yes — `endpoint_set` MCP tool | No | No | No |
| Conversational provider switching | "Switch to my local one" works in chat | No | No | No |
| Context-aware backend recommendation | `endpoint_recommend` reads conversation size + telemetry | No | No | No |
| Native /model picker integration | Yes — auto-writes `ANTHROPIC_CUSTOM_MODEL_OPTION_*` | No | No | No |
| Zero-config Claude Code install | One click — token + settings.json + MCP register | Manual env vars | Manual env vars + DEFAULT_*_MODEL dance | Manual env vars + DEFAULT_*_MODEL dance |
| MCP server bundled | 147 tools — browser, files, system, video, voice | No | No | No |
| MCP toolbelt injection (when client has no tools) | Header-gated, PiperMatch-retrieved | No | No | No |
| Browser automation (CDP) | 17 AX-native tools | No | No | No |
| Voice AI (STT + TTS + voice clone) | Yes | No | No | No |
| OCR (Apple Vision) | Yes | No | No | No |
| Image / video upscale on Neural Engine | PiperSR — 44 fps real-time 2× video | No | No | No |
| Resource intelligence (auto model eviction) | Real-time RAM monitoring + eviction | Reads RAM once at startup | Manual | Removed in 4.0 |
| Distribution | Native macOS DMG | Homebrew + CLI | Python wheel + GPU container | Cross-platform DMG |
| Sandboxing | Native, signed | No | No | No |
| Price | Free; Pro $9.99/mo for cloud BYOK | Free | Free | Free |
Performance numbers depend heavily on your Mac, your model, and your prompt mix. ToolPiper ships scripts/anthropic-perf-baseline.ts — a harness that captures p50 / p95 TTFT and tokens/sec across every provider you have configured, including Ollama / vLLM / LM Studio if you wire them as endpoints. The output lives at docs/features/anthropic-perf-baseline.md and updates on every run. The numbers below are workload definitions; the live comparison is in the repo.
| Workload | ToolPiper | Definition |
|---|---|---|
| Single-turn 2K prompt (TTFT, p50) | See live numbers | Captured per `EndpointConfig` by `scripts/anthropic-perf-baseline.ts`. Runs across every provider you have configured. |
| Multi-turn 20K with tools (tokens/sec, p50) | See live numbers | Same harness — Phase 7 deliverable, populated on every fresh ToolPiper run. |
| Tool-heavy 50K (TTFT, p95) | See live numbers | Skipped automatically on backends whose context window is below 50K — annotated `context_overflow` in the output. |
Neural Engine inference. No download, no quota, no API key. The cheapest fastest path.
Local · 32K · Tool useQwen-Coder on llama.cpp via Metal GPU. Best capability/privacy balance.
BYOK · Keychaingpt-4o through Claude Code, billed against your account, key locked in Keychain.
Conversational"Use my local one." "Try Gemini Flash for this." No env vars. The MCP-native flow.
Download ToolPiper, click "Configure for Claude Code", run claude. Done.