Free Direct Download

Everything Ollama does, free. Then it does the rest of your Mac.

Model downloads, the native llama.cpp engine, multi-model, a local OpenAI-compatible API — free, no account, no caps, no terminal. Then 300+ MCP tools, browser automation, voice, vision, and pipelines on top.

Powered by llama.cpp: ToolPiper embeds upstream llama-server directly — not a fork — currently build b9533, with the exact version shown in the About panel. Models are standard GGUF files. One install replaces Ollama + Open WebUI + Playwright MCP + a filesystem MCP + LangChain.

By , Founder & Lead Engineer— Updated
9 Backends

Local AI Inference

llama.cpp on Metal GPU

Run any GGUF model. Qwen 3.5, Llama 3.2, DeepSeek R1. 30+ tok/s on M2 Air. Flash attention, speculative decoding, and Jinja templates out of the box.

Apple Intelligence

Neural Engine inference. Summarization, rewriting, Smart Reply. Zero GPU impact — runs on dedicated ML hardware alongside your other models.

FluidAudio STT/TTS

Whisper-class transcription at 210x realtime. PocketTTS with zero GPU impact. Both run on the Neural Engine, leaving Metal GPU free for LLM inference.

MLX Audio TTS

Soprano, Orpheus, voice cloning via Qwen3 TTS on Metal GPU. High-quality voice synthesis with support for multiple speakers and styles.

Apple Vision OCR

On-device text extraction from images and PDFs. No model download needed — uses the Vision framework built into macOS.

Resource Intelligence

Real-time RAM monitoring via proc_pid_rusage. Automatic model eviction under memory pressure. Your Mac stays responsive without manual intervention.

Works with Claude Code, Cursor, Windsurf

300+ MCP Tools

Core AI (31 tools)

chat, audio_transcribe, audio_speak, audio_voice_clone, text_embed, vision_ocr, model_load/list/search/download, voice_chat memory, endpoint management. All inference on Neural Engine and Metal GPU.

System Control (162 tools)

26 native macOS domains, all in-process via ActionRouter. Windows, displays, audio devices, network, Bluetooth, Dock, Spaces, Finder, calendar, contacts, focus, clipboard history, notifications, processes, reminders, timers.

Browser & Web (28 tools)

AX-native selectors, self-healing, 7 assertion types, visual recording, network interception, coverage, WebAuthn, autofill, plus web_scrape, http_request, youtube_transcript, and web_api_discover.

Video Creator (17 tools)

AI-driven screenplay-to-MP4 pipeline. screenplay, rehearse, record, render, narrate, plus full timeline edit (composition, narration, screenplay, timeline, export), import_media, and clip library.

Filesystem & Git (18 tools)

file_read, file_write, file_create, file_delete, file_list_directory, file_pick_directory, code_search, workspace_search, plus git_status, git_diff, git_commit, git_log, git_push, git_checkout.

Outreach & Social (15 tools)

github_repo_list, github_activity, github_compare, hn_search, hn_trending, reddit_search, reddit_post, gsc_analytics, gsc_inspect, queue_publish, queue_add, queue_list. Build your distribution loop.

Analysis & RAG (11 tools)

rag_ingest, rag_query, rag_collection_list with local embeddings + HNSW vector index. image_analyze, text_analyze, image_upscale, video_upscale, image_transform, pdf_extract, qr_generate.

Capture & Motion (8 tools)

vision_screenshot, vision_color_pick, audio_record, plus 60fps pose streaming: pose_detect, pose_format_list, pose_stream_start, vision_stream_start/stop.

Testing & Integrations (14 tools)

PiperTest (test_save, test_run, test_list, test_get, test_export, test_delete) with self-healing selectors. Plus OAuth (4), Sieve content filters (4) for end-to-end workflows.

Drop-in Replacement

OpenAI-Compatible API

/v1/chat/completions

Drop-in replacement for OpenAI SDK. Change base_url to localhost:9998, that's it. Streaming and non-streaming. Works with LangChain, LlamaIndex, Continue.dev, Aider, and anything that accepts a custom OpenAI base URL.

/v1/embeddings

Local vector embeddings for RAG pipelines. Apple NL embedding (zero-setup, 512-dim) or llama.cpp embedding models. Content-addressed cache for repeat queries.

/v1/audio/speech

Text-to-speech synthesis. Three engines (FluidAudio, MLX Audio, PocketTTS), eight voices, all on-device. Same API format as OpenAI's TTS endpoint.

/v1/audio/transcriptions

Speech-to-text transcription at 210x realtime on Neural Engine. Parakeet V3 model. Same API format as OpenAI's Whisper endpoint.

Cloud API Proxy

Route cloud requests through ToolPiper with Keychain key injection. Your API keys never appear in code or .env files. One base URL handles both local and cloud models.

Developer Tokens

tp_<64hex> Bearer tokens for team sharing and CI pipelines. SHA-256 hashed, stored in macOS Keychain. Works as api_key in any OpenAI SDK.

AX-Native

Browser Automation & Testing

AX-Native Selectors

Query Chrome's real accessibility tree via CDP's Accessibility.queryAXTree. Not a DOM simulation, not injected JavaScript. The actual computed AX tree that screen readers consume.

Self-Healing

3 modes — passive, fuzzy AX match, AI-assisted. Broken selectors repair in 5-15ms with zero external calls. Free, not a paid add-on.

Visual Recording

Browse your app normally. Every interaction becomes an AX-enriched test step with element metadata, page context, and a mutation diff showing what changed.

Export to Playwright/Cypress

Deterministic, idiomatic code for CI. AX selectors map to each framework's native format: role:button:Sign In becomes page.getByRole('button', { name: 'Sign In' }).

On-Device

Voice AI & Media

Speech-to-Text

Parakeet V3 at 210x realtime on Neural Engine. Whisper-class accuracy with zero GPU impact. Transcribe meetings, lectures, and voice memos locally.

Text-to-Speech

Three engines, eight voices, all on-device. FluidAudio on Neural Engine for low-latency. MLX Audio on Metal GPU for studio-quality synthesis.

Voice Cloning

Clone any voice from a 10-second sample via Qwen3 TTS on Metal GPU. Create custom voices for narration, accessibility, or creative projects.

Video Upscale

PiperSR at 44.4 FPS on Apple Neural Engine. Double-buffered ANE+Metal pipeline. Real-time 2x upscale from 360p to 720p.

Push-to-Talk

Right Option = dictate anywhere (STT to paste). Right Command = voice command (STT to LLM with tool definitions to ActionRouter). System-wide hotkeys.

Web Scraping

CDP-based scraper using a real browser. 16-framework detection, readiness-aware extraction, 7 output formats: markdown, text, readability, AX tree, HTML, links, screenshot.

ToolPiper vs The Alternatives

ToolPiperOllamaLM StudioOpen WebUI
SetupSigned DMG, one drag to ApplicationsHomebrew + CLIDMG downloadDocker + Ollama required
Inference enginellama.cpp + 8 other backendsllama.cpp onlyllama.cpp onlyNo engine (proxies to Ollama)
MCP toolsover 300 tools, stdio + HTTP transportsNoneNoneNone
Browser automation14 CDP tools with AX selectorsNoneNoneNone
Voice AISTT + TTS + voice cloningNoneNoneNone
Resource monitoringReal-time RAM, automatic evictionNone (reads RAM once at startup)Removed in v4.0None (most-requested feature)
API compatibilityOpenAI-compatible (chat, embed, speech, transcribe)OpenAI-compatible (chat, embed)OpenAI-compatible (chat)Web UI only
Cloud proxyKeychain key injection, one base URLNoneNoneNone
TestingPiperTest with self-healing + Playwright/Cypress exportNoneNoneNone
PriceFree; Pro $10/mo, Studio $29/mo, Max $49/moFreeFreeFree

How It Works

1

Install

Download the signed DMG from modelpiper.com/download. Drag to Applications. Launch. A starter model downloads automatically.

2

Connect

Open ModelPiper in your browser, or run claude mcp add toolpiper -- ~/.toolpiper/mcp for MCP.

3

Build

Chat, transcribe, automate browsers, run tests, build agents — all on localhost.

Everything Stays Local

All inference runs on your Mac's GPU and Neural Engine. No data leaves your machine.

Powered by llama.cpp

The inference engine is upstream llama-server, embedded directly — not a fork. Currently build b9533; the exact version ships in the About panel.

9 Backends, 1 App

llama.cpp, Apple Intelligence, FluidAudio, MLX Audio, Apple Vision OCR, CoreML upscale — coordinated by one process.

Open Standards

OpenAI-compatible API, Model Context Protocol, GGUF models, Playwright/Cypress export. No proprietary lock-in.

Simple Pricing

The model runner is free for everyone — no account, no caps. Paid tiers add voice, media, and developer tools.

Free
$0

forever

  • Native llama.cpp engine — any GGUF model
  • Unlimited downloads, multi-model switching
  • Local OpenAI-compatible API + embeddings
  • MCP server with all 300+ tools
  • Transcription (STT) and visual pipeline builder
  • Free companion apps (Vision, Audio, Media)
Download ToolPiper
Pro
$10

/month

  • Everything in Free
  • Push-to-talk dictation, anywhere on your Mac
  • Text-to-speech — three engines, eight voices
  • Apple Intelligence on the Neural Engine
  • Local RAG over your files
  • Cloud API proxy with Keychain keys
  • All 9 inference backends
Get Pro
Coming soon
$29

/month

  • Everything in Pro
  • Image upscaling (ANE-native)
  • Video upscaling (60fps real-time)
  • Video editing pipeline
  • Pose detection (60fps streaming)
  • Outreach toolkit
Coming soon
$49

/month

  • Everything in Studio
  • CodePiper (IDE AI extension)
  • PiperTest (self-healing browser tests)
  • Full browser automation (CDP + AX)
  • API discovery toolkit
  • Priority support

Frequently Asked Questions

Do I need Ollama?

No. ToolPiper embeds the same llama.cpp engine Ollama wraps — directly, not as a fork — and the whole runner is free: unlimited GGUF downloads, multi-model switching, the local OpenAI-compatible API, no account. Models are stored as standard GGUF files you can use with any llama.cpp tool, not a proprietary blob format. If you keep Ollama installed, ModelPiper can still connect to it as a backend, but there's nothing it provides that ToolPiper doesn't run natively.

Will my Ollama tools keep working?

Yes - ToolPiper serves the Ollama API itself. Flip on the compatibility listener (Settings → General, off by default) and anything that talks to localhost:11434 talks to ToolPiper: model list, streamed chat, embeddings, pulls, deletes, all on the native engine. It's served as the legacy dialect - every response carries a standards-based deprecation header pointing at the first-party /v1 API, which is where new integrations should land.

What models can I run?

Any GGUF model from HuggingFace (135,000+ available). ToolPiper also ships curated presets tested on Apple Silicon: Qwen 3.5, Llama 3.2, DeepSeek R1, Phi-4, Gemma 3, and more. Each preset shows exact RAM usage so you never download something your Mac can't run.

Does it work with Claude Code?

Yes. One command: claude mcp add toolpiper -- ~/.toolpiper/mcp. Restart Claude Code and all 300+ tools are available. Also works with Cursor, Windsurf, and any MCP-compatible client via stdio or Streamable HTTP transport.

How much RAM do I need?

8GB minimum for small models (0.8B-3B). 16GB recommended for the mainstream sweet spot (7B-8B models alongside normal app usage). 32GB opens up 14B models and multi-model workflows. ToolPiper's resource intelligence monitors RAM in real-time and automatically evicts models under memory pressure.

Is it really free?

The whole model runner is free with no account and no caps: the native llama.cpp engine, unlimited model downloads, multi-model switching, the local OpenAI-compatible API, embeddings, transcription, the visual pipeline builder, and all 300+ MCP tools. Pro ($10/mo) adds push-to-talk dictation, text-to-speech, Apple Intelligence, local RAG, and the cloud API proxy. Studio ($29/mo, coming soon) adds image and video upscaling, video editing, pose detection, and the outreach toolkit. Max ($49/mo, coming soon) adds CodePiper, PiperTest with self-healing, full browser automation, API discovery, and priority support.

Does it work offline?

Yes, completely. All inference runs on your Mac's hardware. Once models are downloaded, everything works without an internet connection — chat, transcribe, speak, embed, OCR, browser automation, testing, and desktop control. The only features that require network access are cloud API proxy and the social/research tools (GitHub, Hacker News, Reddit).

Ready to run AI locally?

One app. Nine backends. Over 300 tools. Zero configuration.