article2026-03-18by Ben Racicot

Local RAG Chat on Mac: Ask Your Documents, Keep Your Data

TL;DR

RAG (retrieval-augmented generation) lets you ask an AI questions about your own documents - contracts, codebases, research papers, meeting notes - and get answers grounded in your actual data. ModelPiper runs the entire pipeline locally: embedding, vector search, and language model inference all happen on your Mac. No documents are uploaded anywhere.

Screencast of RAG chat in ModelPiper - indexing a document folder and asking questions about the contents

2:00

Index a folder, ask questions, get answers citing your actual documents

You have a folder of contracts. Or a codebase. Or a year of meeting transcripts. You need to find something specific - a clause, a function, a decision that was made in March. You could search manually. You could use Ctrl+F across fifty files. Or you could ask a question in plain English and get an answer that cites the exact passage.

That's what RAG does. Retrieval-augmented generation connects a language model to your documents so it can answer questions using your actual data instead of its training knowledge. The model retrieves relevant passages, reads them, and synthesizes an answer - with source attribution so you can verify.

The problem is that most RAG tools require uploading your documents to a cloud service. ChatGPT's file analysis, Claude's Project Knowledge, and every RAG-as-a-service platform sends your data to remote servers for processing and indexing. For confidential documents - legal contracts, financial records, proprietary code, medical notes - that's a non-starter.

What is RAG and how does it work?

RAG (retrieval-augmented generation) has three stages: ingest, index, and query. Documents are split into chunks and embedded as vectors, then a question retrieves the closest chunks and feeds them into the language model as grounded context.

Ingest: Your documents are split into chunks - paragraphs, sections, or fixed-length segments. Each chunk is converted into a numerical vector (an "embedding") that captures its semantic meaning. Similar concepts produce similar vectors, regardless of exact wording.

Index: The vectors are stored in a searchable index. When you ask a question, your question is also converted to a vector, and the index finds the document chunks whose vectors are most similar - the passages most likely to contain the answer.

Query: The retrieved passages are injected into the language model's prompt as context. The model reads them and generates an answer grounded in your actual data. It can cite which document and which passage it used.

The entire value of RAG is that the language model doesn't need to have memorized your data during training. It reads the relevant parts on the fly, every time you ask. This means it works with documents that were created yesterday, with internal data the model has never seen, and with content that changes frequently.

Why does local RAG matter?

Local RAG keeps the embedding model, the vector index, and the language model on your Mac, so confidential documents never reach a cloud index. It also removes per-query API costs and document size caps.

Your documents never leave your machine. The embedding model, the vector index, and the language model all run on your Mac. Contracts, source code, financial statements, patient records - the data stays on your disk. There's no cloud index storing your proprietary knowledge.

No per-query cost. Cloud RAG services charge for embedding generation, vector storage, and LLM queries. A large document collection can cost $50-200/month in API fees alone. Local RAG costs nothing after the one-time model download.

No document size limits. Cloud services cap file sizes and total knowledge base volume. Locally, you're limited only by your disk space and RAM. Index a 10GB codebase or ten years of meeting notes - no artificial caps.

Works offline. Your indexed documents are queryable without internet. On a plane reviewing contracts, at a client site without Wi-Fi, or when your ISP is down - the index and models work the same.

Your index stays current. ToolPiper watches your document folder and can re-index automatically when files change. Cloud services require manual re-upload or sync configuration.

What do you need for local RAG?

An Apple Silicon Mac (M1 or newer) with at least 16GB of RAM. RAG runs an embedding model and a language model at the same time, and 16GB gives both comfortable headroom.

You don't need: A vector database service. An OpenAI API key. Python. Docker. Pinecone, Weaviate, or Chroma running somewhere. A subscription to anything.

You do need: A Mac with Apple Silicon (M1 or later) and at least 16GB of RAM. RAG runs an embedding model and a language model simultaneously - 16GB gives comfortable headroom for both.

What can you index?

ToolPiper indexes 39 text file types and 6 image formats: PDFs, Word docs, Markdown, code in 20+ languages, CSV and JSON data, plus PNG, JPEG, WebP, GIF, and BMP images.

ToolPiper ingests 39 text file types and 6 image types. The practical list covers what most people actually need:

Documents: PDF, DOCX, RTF, Markdown, plain text, HTML, LaTeX, EPUB

Code: Swift, TypeScript, Python, Go, Rust, Java, C/C++, Ruby, PHP, SQL, YAML, JSON, TOML, and more

Data: CSV, TSV, XML, JSON

Images: PNG, JPEG, WebP, GIF, BMP (indexed as image chunks with embeddings via vision models)

Point it at a folder and it indexes everything it can read. Files it doesn't recognize are skipped silently.

Which embedding models does ToolPiper use?

Two paths: EmbeddingGemma, the on-device default that runs on the Apple Neural Engine, or any GGUF embedding model you bring (for example Snowflake Arctic Embed or nomic-embed-text).

Embeddings are the foundation of RAG - the quality of your search results depends on the quality of your embeddings. ToolPiper offers two paths:

EmbeddingGemma - the on-device default. Google's embedding model produces 768-dimensional vectors and runs on the Apple Neural Engine, so it consumes no GPU memory while indexing. It downloads once from ModelPiper's release CDN, then runs locally on any Mac with Apple Silicon. Text only - images are skipped.

Bring your own GGUF embedding model - if you have a preferred embedding model in GGUF format (Snowflake Arctic Embed, nomic-embed-text, bge), ToolPiper loads it locally via the standard /v1/embeddings endpoint. Useful for higher precision on technical or domain-specific content.

For most users, EmbeddingGemma is the right default - one download, then on-device, good quality. Switch to a GGUF model if you need higher precision on specialized content.

How do you set up RAG chat in ModelPiper?

Load the Local RAG Chat template, point the RAG node at a folder, choose an embedding model, and click Index. After indexing finishes, ask questions and the model answers with citations from your documents.

Load the Local RAG Chat template. The pipeline has two key nodes: a RAG node (connected to your document collection) and a language model node (Qwen 3.5 by default).

First, create a collection. Click the RAG node, point it at a folder, and choose an embedding model. Hit "Index" - ToolPiper splits your documents into chunks, embeds each chunk, and builds a vector index. A folder of 500 documents takes 1-3 minutes depending on size and embedding model.

Then ask questions. Type a question in the chat interface. The RAG node searches your index for relevant passages, injects them into the LLM prompt, and the model generates an answer grounded in your data. Source documents are cited in the response.

The search is hybrid by default - combining vector similarity (semantic meaning) with BM25 keyword matching (exact terms). This means it finds relevant passages even when the wording doesn't match your question exactly, while still surfacing results that contain specific terms you mentioned.

When is cloud RAG still better?

Cloud RAG with a frontier LLM still wins on complex multi-document reasoning. If you need to cross-reference a thousand contracts and synthesize findings, GPT-4 or Claude Opus with massive context outperforms a 3B-8B local model.

Cloud RAG services with GPT-4 or Claude as the language model will produce higher-quality synthesis on complex multi-document reasoning tasks. If you need to cross-reference a thousand legal contracts and synthesize findings, a frontier model with massive context will outperform a local 3B-8B model.

The tradeoff is cost, privacy, and latency. For 90% of document Q&A - finding specific information, summarizing sections, answering factual questions about your data - local RAG with a capable model like Qwen 3.5 4B is more than sufficient.

Try It

Download ModelPiper, install ToolPiper, and load the Local RAG Chat template. Point it at a folder, wait for indexing, and start asking questions. Your documents stay on your Mac.

This is part of a series on local-first AI workflows on macOS. See also: Private Local Chat - the same local LLM without the document retrieval layer.

ModelPiper RAG chat interface showing a question answered with cited passages from indexed documents

The AI retrieves relevant passages and cites its sources

Local RAG: ToolPiper vs Cloud RAG Services

	ToolPiper	ChatGPT (file analysis)	Claude Projects	Notion AI Q&A
Privacy	Documents stay on your Mac	Uploaded to OpenAI	Uploaded to Anthropic	Stored in Notion cloud
Works offline	Yes	No	No	No
Cost	Free (local models)	$20/mo (Plus)	$20/mo (Pro)	$10/user/mo
Document types	39 text + 6 image formats	PDF, DOCX, CSV, images	PDF, text, code	Notion pages only
Collection size	Unlimited (disk space)	~500MB per GPT	~200K tokens	Workspace-bound
Custom embedding model	Yes (any GGUF)	No	No	No
Auto re-index on change	Yes	No (re-upload)	No (re-upload)	Automatic (cloud)
LLM quality	Local 3B-14B models	GPT-4o	Claude Sonnet/Opus	GPT-4o
API / MCP access	Yes (REST + MCP)	API (Assistants)	No	No

How to get started

1
Install ToolPiper and load the RAG Chat template
Download ToolPiper from modelpiper.com. Launch it, open ModelPiper, and load the Local RAG Chat template from the template picker.
2
Create a document collection
Click the RAG node in the pipeline. Select a folder containing your documents - contracts, code, notes, whatever you want to search. Choose an embedding model (EmbeddingGemma for the on-device default, or a GGUF model like Snowflake Arctic Embed for higher quality). Click Index.
3
Wait for indexing
ToolPiper splits your documents into chunks, generates embeddings for each chunk, and builds the vector index. Progress is shown in real time. A folder of 500 documents typically takes 1-3 minutes.
4
Ask questions
Type a question in the chat. The RAG system retrieves relevant passages from your documents, injects them into the language model's context, and generates an answer citing the source documents.

Frequently Asked Questions

How many documents can I index?

There's no artificial limit. You're bounded by disk space (for the vector index) and RAM (for the embedding model). In practice, collections of thousands of documents work well. The HNSW index provides O(log n) search - query time barely changes as the collection grows.

Does RAG work with code repositories?

Yes. ToolPiper indexes Swift, TypeScript, Python, Go, Rust, Java, C/C++, Ruby, and more. The semantic chunking respects code boundaries - it won't split a function across two chunks. You can ask questions like "where is the authentication logic?" or "what does the handleUpload function do?" and get answers citing the actual source files.

What happens when my documents change?

ToolPiper can re-index automatically when it detects changes (configurable per collection). You can also trigger re-indexing manually. Only changed files are re-processed - the embedding cache (keyed by content hash) skips unchanged chunks.

Can I use RAG with a cloud LLM instead of a local model?

Yes. The RAG retrieval (embedding + vector search) always runs locally. You can connect the retrieved context to any language model - local or cloud. The documents are still searched locally; only the synthesized query (with retrieved passages) goes to the cloud model if you choose that path.

What's the difference between RAG and just pasting documents into ChatGPT?

Context window limits. ChatGPT can read ~128K tokens at once - roughly 200 pages. RAG indexes unlimited documents and retrieves only the relevant passages for each question. You can index 10,000 documents and the model only reads the 5-10 most relevant chunks per query, staying well within context limits while having access to everything.

RAGText GenerationEmbeddingsPrivacymacOSDocument Q&AApple Silicon

Private Local Chat on Mac: ChatGPT Without the CloudThe base chat experience without document retrieval Local Document OCR on Mac: Extract Text Without Uploading AnythingExtract text from scanned documents to make them RAG-indexable Local-First AI on macOS: Why Your Data Should Never Leave Your MachineThe pillar article on local-first AI workflows

Local RAG Chat on Mac: Ask Your Documents, Keep Your Data

What is RAG and how does it work?

Why does local RAG matter?

What do you need for local RAG?

What can you index?

Which embedding models does ToolPiper use?

How do you set up RAG chat in ModelPiper?

When is cloud RAG still better?

Try It

Local RAG: ToolPiper vs Cloud RAG Services

How to get started

Install ToolPiper and load the RAG Chat template

Create a document collection

Wait for indexing

Ask questions

Frequently Asked Questions

Related

AI Providers