---
title: "Local RAG Chat on Mac: Ask Your Documents, Keep Your Data"
description: "Index your files and ask questions about them using local AI - retrieval-augmented generation running entirely on your Mac, no documents uploaded anywhere."
date: 2026-03-18
author: "Ben Racicot"
tags: ["RAG", "Text Generation", "Embeddings", "Privacy", "macOS", "Document Q&A", "Apple Silicon"]
type: "article"
canonical: "https://modelpiper.com/blog/local-rag-chat-mac/"
---

# Local RAG Chat on Mac: Ask Your Documents, Keep Your Data

> Index your files and ask questions about them using local AI - retrieval-augmented generation running entirely on your Mac, no documents uploaded anywhere.

## TL;DR

RAG (retrieval-augmented generation) lets you ask an AI questions about your own documents - contracts, codebases, research papers, meeting notes - and get answers grounded in your actual data. ModelPiper runs the entire pipeline locally: embedding, vector search, and language model inference all happen on your Mac. No documents are uploaded anywhere.

You have a folder of contracts. Or a codebase. Or a year of meeting transcripts. You need to find something specific - a clause, a function, a decision that was made in March. You could search manually. You could use Ctrl+F across fifty files. Or you could ask a question in plain English and get an answer that cites the exact passage.

That's what RAG does. Retrieval-augmented generation connects a language model to your documents so it can answer questions using your actual data instead of its training knowledge. The model retrieves relevant passages, reads them, and synthesizes an answer - with source attribution so you can verify.

The problem is that most RAG tools require uploading your documents to a cloud service. ChatGPT's file analysis, Claude's Project Knowledge, and every RAG-as-a-service platform sends your data to remote servers for processing and indexing. For confidential documents - legal contracts, financial records, proprietary code, medical notes - that's a non-starter.

## What is RAG and how does it work?

RAG (retrieval-augmented generation) has three stages: ingest, index, and query. Documents are split into chunks and embedded as vectors, then a question retrieves the closest chunks and feeds them into the language model as grounded context.

**Ingest:** Your documents are split into chunks - paragraphs, sections, or fixed-length segments. Each chunk is converted into a numerical vector (an "embedding") that captures its semantic meaning. Similar concepts produce similar vectors, regardless of exact wording.

**Index:** The vectors are stored in a searchable index. When you ask a question, your question is also converted to a vector, and the index finds the document chunks whose vectors are most similar - the passages most likely to contain the answer.

**Query:** The retrieved passages are injected into the language model's prompt as context. The model reads them and generates an answer grounded in your actual data. It can cite which document and which passage it used.

The entire value of RAG is that **the language model doesn't need to have memorized your data during training.** It reads the relevant parts on the fly, every time you ask. This means it works with documents that were created yesterday, with internal data the model has never seen, and with content that changes frequently.

## Why does local RAG matter?

Local RAG keeps the embedding model, the vector index, and the language model on your Mac, so confidential documents never reach a cloud index. It also removes per-query API costs and document size caps.

**Your documents never leave your machine.** **The embedding model, the vector index, and the language model all run on your Mac.** Contracts, source code, financial statements, patient records - the data stays on your disk. There's no cloud index storing your proprietary knowledge.

**No per-query cost.** Cloud RAG services charge for embedding generation, vector storage, and LLM queries. A large document collection can cost $50-200/month in API fees alone. Local RAG costs nothing after the one-time model download.

**No document size limits.** Cloud services cap file sizes and total knowledge base volume. Locally, you're limited only by your disk space and RAM. Index a 10GB codebase or ten years of meeting notes - no artificial caps.

**Works offline.** Your indexed documents are queryable without internet. On a plane reviewing contracts, at a client site without Wi-Fi, or when your ISP is down - the index and models work the same.

**Your index stays current.** ToolPiper watches your document folder and can re-index automatically when files change. Cloud services require manual re-upload or sync configuration.

## What do you need for local RAG?

An Apple Silicon Mac (M1 or newer) with at least 16GB of RAM. RAG runs an embedding model and a language model at the same time, and 16GB gives both comfortable headroom.

**You don't need:** A vector database service. An OpenAI API key. Python. Docker. Pinecone, Weaviate, or Chroma running somewhere. A subscription to anything.

**You do need:** A Mac with Apple Silicon (M1 or later) and at least 16GB of RAM. RAG runs an embedding model and a language model simultaneously - 16GB gives comfortable headroom for both.

## What can you index?

ToolPiper indexes 39 text file types and 6 image formats: PDFs, Word docs, Markdown, code in 20+ languages, CSV and JSON data, plus PNG, JPEG, WebP, GIF, and BMP images.

ToolPiper ingests 39 text file types and 6 image types. The practical list covers what most people actually need:

**Documents:** PDF, DOCX, RTF, Markdown, plain text, HTML, LaTeX, EPUB

**Code:** Swift, TypeScript, Python, Go, Rust, Java, C/C++, Ruby, PHP, SQL, YAML, JSON, TOML, and more

**Data:** CSV, TSV, XML, JSON

**Images:** PNG, JPEG, WebP, GIF, BMP (indexed as image chunks with embeddings via vision models)

Point it at a folder and it indexes everything it can read. Files it doesn't recognize are skipped silently.

## Which embedding models does ToolPiper use?

Three options: Apple NL Embedding (zero-setup, runs on the Neural Engine), Snowflake Arctic Embed S (higher quality, ~130MB download), or any GGUF embedding model you bring.

Embeddings are the foundation of RAG - the quality of your search results depends on the quality of your embeddings. ToolPiper offers three options:

**Apple NL Embedding** - zero-setup, runs on the Neural Engine, 512-dimensional vectors. This is the fastest option and requires no model download. It uses Apple's built-in `NLContextualEmbedding` framework, which means it works immediately on any Mac with Apple Silicon. Text only - images are skipped.

**Snowflake Arctic Embed S** - a dedicated embedding model running via llama-server on port 9997. Higher quality embeddings than Apple NL for technical and domain-specific content. Requires a one-time model download (~130MB).

**Any GGUF embedding model** - bring your own. If you have a preferred embedding model in GGUF format, ToolPiper's embedding server loads it via the standard `/v1/embeddings` endpoint.

For most users, Apple NL Embedding is the right default - zero download, zero configuration, good quality. Switch to Snowflake or a custom model if you need higher precision on specialized content.

## How do you set up RAG chat in ModelPiper?

Load the Local RAG Chat template, point the RAG node at a folder, choose an embedding model, and click Index. After indexing finishes, ask questions and the model answers with citations from your documents.

Load the **Local RAG Chat** template. The pipeline has two key nodes: a RAG node (connected to your document collection) and a language model node (Qwen 3.5 by default).

First, create a collection. Click the RAG node, point it at a folder, and choose an embedding model. Hit "Index" - ToolPiper splits your documents into chunks, embeds each chunk, and builds a vector index. A folder of 500 documents takes 1-3 minutes depending on size and embedding model.

Then ask questions. Type a question in the chat interface. The RAG node searches your index for relevant passages, injects them into the LLM prompt, and the model generates an answer grounded in your data. Source documents are cited in the response.

The search is hybrid by default - combining vector similarity (semantic meaning) with BM25 keyword matching (exact terms). This means it finds relevant passages even when the wording doesn't match your question exactly, while still surfacing results that contain specific terms you mentioned.

## When is cloud RAG still better?

Cloud RAG with a frontier LLM still wins on complex multi-document reasoning. If you need to cross-reference a thousand contracts and synthesize findings, GPT-4 or Claude Opus with massive context outperforms a 3B-8B local model.

Cloud RAG services with GPT-4 or Claude as the language model will produce higher-quality synthesis on complex multi-document reasoning tasks. If you need to cross-reference a thousand legal contracts and synthesize findings, a frontier model with massive context will outperform a local 3B-8B model.

The tradeoff is cost, privacy, and latency. For 90% of document Q&A - finding specific information, summarizing sections, answering factual questions about your data - local RAG with a capable model like Qwen 3.5 4B is more than sufficient.

## Try It

Download [ModelPiper](https://modelpiper.com), install ToolPiper, and load the Local RAG Chat template. Point it at a folder, wait for indexing, and start asking questions. Your documents stay on your Mac.

_This is part of a series on [local-first AI workflows on macOS](/blog/local-first-ai-macos). See also: [Private Local Chat](/blog/private-local-chat-mac) - the same local LLM without the document retrieval layer._

## Steps

### 1. Install ToolPiper and load the RAG Chat template

Download ToolPiper from [modelpiper.com](https://modelpiper.com). Launch it, open ModelPiper, and load the **Local RAG Chat** template from the template picker.

### 2. Create a document collection

Click the RAG node in the pipeline. Select a folder containing your documents - contracts, code, notes, whatever you want to search. Choose an embedding model (Apple NL Embedding for zero-config, or Snowflake for higher quality). Click **Index**.

### 3. Wait for indexing

ToolPiper splits your documents into chunks, generates embeddings for each chunk, and builds the vector index. Progress is shown in real time. A folder of 500 documents typically takes 1-3 minutes.

### 4. Ask questions

Type a question in the chat. The RAG system retrieves relevant passages from your documents, injects them into the language model's context, and generates an answer citing the source documents.

## FAQ

### How many documents can I index?

There's no artificial limit. You're bounded by disk space (for the vector index) and RAM (for the embedding model). In practice, collections of thousands of documents work well. The HNSW index provides O(log n) search - query time barely changes as the collection grows.

### Does RAG work with code repositories?

Yes. ToolPiper indexes Swift, TypeScript, Python, Go, Rust, Java, C/C++, Ruby, and more. The semantic chunking respects code boundaries - it won't split a function across two chunks. You can ask questions like "where is the authentication logic?" or "what does the handleUpload function do?" and get answers citing the actual source files.

### What happens when my documents change?

ToolPiper can re-index automatically when it detects changes (configurable per collection). You can also trigger re-indexing manually. Only changed files are re-processed - the embedding cache (keyed by content hash) skips unchanged chunks.

### Can I use RAG with a cloud LLM instead of a local model?

Yes. The RAG retrieval (embedding + vector search) always runs locally. You can connect the retrieved context to any language model - local or cloud. The documents are still searched locally; only the synthesized query (with retrieved passages) goes to the cloud model if you choose that path.

### What's the difference between RAG and just pasting documents into ChatGPT?

Context window limits. ChatGPT can read ~128K tokens at once - roughly 200 pages. RAG indexes unlimited documents and retrieves only the relevant passages for each question. You can index 10,000 documents and the model only reads the 5-10 most relevant chunks per query, staying well within context limits while having access to everything.