You have a folder of contracts. Or a codebase. Or a year of meeting transcripts. You need to find something specific — a clause, a function, a decision that was made in March. You could search manually. You could use Ctrl+F across fifty files. Or you could ask a question in plain English and get an answer that cites the exact passage.
That's what RAG does. Retrieval-augmented generation connects a language model to your documents so it can answer questions using your actual data instead of its training knowledge. The model retrieves relevant passages, reads them, and synthesizes an answer — with source attribution so you can verify.
The problem is that most RAG tools require uploading your documents to a cloud service. ChatGPT's file analysis, Claude's Project Knowledge, and every RAG-as-a-service platform sends your data to remote servers for processing and indexing. For confidential documents — legal contracts, financial records, proprietary code, medical notes — that's a non-starter.
What is RAG and how does it work?
RAG has three stages: ingest, index, and query.
Ingest: Your documents are split into chunks — paragraphs, sections, or fixed-length segments. Each chunk is converted into a numerical vector (an "embedding") that captures its semantic meaning. Similar concepts produce similar vectors, regardless of exact wording.
Index: The vectors are stored in a searchable index. When you ask a question, your question is also converted to a vector, and the index finds the document chunks whose vectors are most similar — the passages most likely to contain the answer.
Query: The retrieved passages are injected into the language model's prompt as context. The model reads them and generates an answer grounded in your actual data. It can cite which document and which passage it used.
The entire value of RAG is that the language model doesn't need to have memorized your data during training. It reads the relevant parts on the fly, every time you ask. This means it works with documents that were created yesterday, with internal data the model has never seen, and with content that changes frequently.
Why local RAG matters
Your documents never leave your machine. The embedding model, the vector index, and the language model all run on your Mac. Contracts, source code, financial statements, patient records — the data stays on your disk. There's no cloud index storing your proprietary knowledge.
No per-query cost. Cloud RAG services charge for embedding generation, vector storage, and LLM queries. A large document collection can cost $50–200/month in API fees alone. Local RAG costs nothing after the one-time model download.
No document size limits. Cloud services cap file sizes and total knowledge base volume. Locally, you're limited only by your disk space and RAM. Index a 10GB codebase or ten years of meeting notes — no artificial caps.
Works offline. Your indexed documents are queryable without internet. On a plane reviewing contracts, at a client site without Wi-Fi, or when your ISP is down — the index and models work the same.
Your index stays current. ToolPiper watches your document folder and can re-index automatically when files change. Cloud services require manual re-upload or sync configuration.
What You Need
You don't need: A vector database service. An OpenAI API key. Python. Docker. Pinecone, Weaviate, or Chroma running somewhere. A subscription to anything.
You do need: A Mac with Apple Silicon (M1 or later) and at least 16GB of RAM. RAG runs an embedding model and a language model simultaneously — 16GB gives comfortable headroom for both.
What can you index?
ToolPiper ingests 39 text file types and 6 image types. The practical list covers what most people actually need:
Documents: PDF, DOCX, RTF, Markdown, plain text, HTML, LaTeX, EPUB
Code: Swift, TypeScript, Python, Go, Rust, Java, C/C++, Ruby, PHP, SQL, YAML, JSON, TOML, and more
Data: CSV, TSV, XML, JSON
Images: PNG, JPEG, WebP, GIF, BMP (indexed as image chunks with embeddings via vision models)
Point it at a folder and it indexes everything it can read. Files it doesn't recognize are skipped silently.
The embedding models
Embeddings are the foundation of RAG — the quality of your search results depends on the quality of your embeddings. ToolPiper offers three options:
Apple NL Embedding — zero-setup, runs on the Neural Engine, 512-dimensional vectors. This is the fastest option and requires no model download. It uses Apple's built-in NLContextualEmbedding framework, which means it works immediately on any Mac with Apple Silicon. Text only — images are skipped.
Snowflake Arctic Embed S — a dedicated embedding model running via llama-server on port 9997. Higher quality embeddings than Apple NL for technical and domain-specific content. Requires a one-time model download (~130MB).
Any GGUF embedding model — bring your own. If you have a preferred embedding model in GGUF format, ToolPiper's embedding server loads it via the standard /v1/embeddings endpoint.
For most users, Apple NL Embedding is the right default — zero download, zero configuration, good quality. Switch to Snowflake or a custom model if you need higher precision on specialized content.
The ModelPiper Workflow
Load the Local RAG Chat template. The pipeline has two key nodes: a RAG node (connected to your document collection) and a language model node (Qwen 3.5 by default).
First, create a collection. Click the RAG node, point it at a folder, and choose an embedding model. Hit "Index" — ToolPiper splits your documents into chunks, embeds each chunk, and builds a vector index. A folder of 500 documents takes 1–3 minutes depending on size and embedding model.
Then ask questions. Type a question in the chat interface. The RAG node searches your index for relevant passages, injects them into the LLM prompt, and the model generates an answer grounded in your data. Source documents are cited in the response.
The search is hybrid by default — combining vector similarity (semantic meaning) with BM25 keyword matching (exact terms). This means it finds relevant passages even when the wording doesn't match your question exactly, while still surfacing results that contain specific terms you mentioned.
When cloud RAG is still better
Cloud RAG services with GPT-4 or Claude as the language model will produce higher-quality synthesis on complex multi-document reasoning tasks. If you need to cross-reference a thousand legal contracts and synthesize findings, a frontier model with massive context will outperform a local 3B–8B model.
The tradeoff is cost, privacy, and latency. For 90% of document Q&A — finding specific information, summarizing sections, answering factual questions about your data — local RAG with a capable model like Qwen 3.5 4B is more than sufficient.
Try It
Download ModelPiper, install ToolPiper, and load the Local RAG Chat template. Point it at a folder, wait for indexing, and start asking questions. Your documents stay on your Mac.
This is part of a series on local-first AI workflows on macOS. See also: Private Local Chat — the same local LLM without the document retrieval layer.