Your codebase is the most sensitive intellectual property your company owns. When a developer pastes a function into ChatGPT to ask "what does this do?", that code is now on someone else's server. When you use GitHub Copilot, your code context flows through Microsoft's infrastructure. For open source, that's fine. For proprietary code, it's a risk many companies won't accept.
What if you could ask questions about your entire codebase with semantic search, not just grep, without any code leaving your machine?
That's what local RAG for code gives you. You index your repository into vector embeddings on your own hardware. When you ask a question, the system finds the most relevant code chunks and hands them to a language model as context. The model reads the relevant snippets at query time and answers grounded in your actual source code. The entire pipeline runs on your Mac.
What does RAG for code actually mean?
RAG (retrieval-augmented generation) connects a language model to an external data source so it can answer questions using data it was never trained on. For code, the data source is your repository.
The pipeline has three stages. First, your source files are split into chunks and each chunk is converted into a numerical vector (an "embedding") that captures its semantic meaning. Second, those vectors are stored in a searchable index. Third, when you ask a question, your question is also embedded, the index finds the most similar code chunks, and those chunks are injected into the language model's prompt as context.
The model doesn't need to have seen your code during training. It reads the relevant snippets on the fly, every time you ask. This means it works with code you wrote yesterday, with internal libraries no public model has ever seen, and with repositories that change every hour.
Why is semantic search better than grep for code?
grep and ripgrep find exact string matches. They're fast, and they're invaluable. But they only find what you already know how to name.
RAG finds conceptually related code. Ask "where do we handle authentication?" and it finds the auth middleware, the JWT validation helper, the session manager, and the login route handler, even if none of them contain the word "authentication." Ask "how do we connect to the database?" and it surfaces the connection pool setup, the retry logic, and the migration runner.
Semantic search works because embeddings capture meaning, not just characters. Two pieces of code that do similar things produce similar vectors, regardless of variable names, comments, or coding style. This is especially useful in large codebases where you didn't write every module and you don't know the naming conventions used three years ago by someone who left the team.
Why is code RAG harder than document RAG?
Code has properties that make naive RAG less effective than it is for prose documents.
Structure matters. A function signature without its body is useless context. A class definition without its methods tells you nothing. Chunking strategies that work for paragraphs of text can split a function right down the middle, giving the model half an implementation.
Cross-file references are everywhere. An import statement in one file references an export in another. A function call in your controller triggers logic defined three directories away. RAG retrieves isolated chunks, not dependency graphs. It can find the relevant pieces, but it won't automatically trace a call chain through five files.
Context dependencies are implicit. A variable named config could mean anything. The same function name can exist in multiple modules with completely different behavior. Without the surrounding context (the file path, the module, the imports), a code chunk can be ambiguous.
These challenges don't make code RAG useless. They mean that the chunking strategy, the embedding quality, and the retrieval parameters all matter more than they do for searching a folder of meeting notes.
How does ToolPiper handle code repositories?
ToolPiper's RAG system is built to handle the specific challenges of code search.
39 text file types indexed. TypeScript, JavaScript, Python, Swift, Go, Rust, Java, Ruby, C, C++, header files, Markdown, JSON, YAML, TOML, SQL, and more. Point it at a project directory and it indexes everything it can read. Binary files, images, and unknown types are skipped silently.
Semantic chunking at 500 characters with 50-character overlap. The overlap means that if a function body falls near a chunk boundary, the end of one chunk and the beginning of the next share content. This isn't perfect, but it keeps most small-to-medium functions intact within a single chunk, and the overlap bridges the gap when they split.
Hybrid search: vector + keyword. Every query runs through two search paths simultaneously. HNSW vector search finds semantically similar code (conceptual matches). BM25 keyword search finds exact term matches (function names, variable names, specific strings). The results are merged so you get the best of both approaches. Ask "payment processing" and you find both the processPayment() function (keyword match) and the Stripe webhook handler (semantic match).
Three embedding options:
- Apple NL Embedding (
apple-nlembedding) - zero setup, 512-dimensional vectors, runs on-device via Apple's NLContextualEmbedding. No model download needed. Works immediately on any Mac with Apple Silicon. - Open-source embedding models via llama.cpp - more options for specialized domains. Download a GGUF embedding model and point ToolPiper at it.
- Dedicated embedding server on port 9997 - runs the embedding model on a separate process so you can embed and chat simultaneously without competing for GPU time.
Embedding cache. Every chunk embedding is cached globally with a SHA-256 content-addressed key. If you re-index a repo and 90% of files haven't changed, 90% of embeddings are served from cache instantly. Only modified files trigger new embedding computation.
How do you set up code search in ModelPiper?
The workflow takes about two minutes for a medium-sized repository.
1. Create a RAG collection. In ModelPiper, open the RAG settings and create a new collection. Give it a name that matches the project.
2. Add your project directory. Point the collection at your repository root, or at specific subdirectories if you only want to search the source code (skip node_modules, build artifacts, etc.).
3. Choose an embedding model and index. Apple NL Embedding is the zero-config default. Hit Index and ToolPiper splits every recognized file into chunks, embeds them, and builds the HNSW + BM25 indices. A repository with 1,000 source files typically indexes in 1-3 minutes.
4. Ask questions in chat. Select the collection in the chat interface and start asking: "How does the auth middleware work?", "Find all API endpoints", "What does processPayment do?", "Where is the database connection pool configured?"
The RAG system retrieves the top matching code chunks, injects them into the LLM's context window, and the model generates an answer grounded in your actual source code. It cites which files the answer came from.
What happens when your code changes?
Codebases change constantly. ToolPiper handles this with non-blocking auto-reindex: when files change, the system reindexes in the background while still serving queries against the existing index. You never wait for reindexing to finish before asking a question. The index updates in place once the new embeddings are ready.
The embedding cache makes re-indexing fast. Only files with actual content changes need new embeddings. A typical incremental reindex after a day of development takes seconds, not minutes.
ToolPiper also enforces a 16K token context guard so the model isn't overwhelmed with too much retrieved code, and maintains an LRU cache of up to 8 loaded collections for fast switching between projects.
What are the honest limitations?
Local code RAG isn't a replacement for reading the code yourself. Here's where it falls short.
Chunking can split long functions. Semantic chunking at 500 characters keeps most functions intact, but a 200-line method will span multiple chunks. The 50-character overlap helps, but the model may receive the beginning and end of a long function without the middle.
Large repos take time to index initially. A repository with 100,000+ files will take meaningful time on the first index pass. Subsequent re-indexes are fast because of the embedding cache, but that first run requires patience.
General-purpose embeddings, not code-specific. Specialized code embedding models like CodeBERT aren't available in GGUF format yet. General-purpose embeddings work well for most searches, but they may miss very specific API patterns or domain-specific naming conventions that a code-trained model would catch.
No dependency graph traversal. RAG retrieves relevant chunks, not call chains. If you ask "what happens when a user clicks Submit?", it can find the click handler and the API call and the backend route, but it finds them as separate chunks. It won't automatically trace the execution path through five files in order. You're getting the pieces, and the LLM synthesizes them, but complex multi-hop logic can be missed.
When should you use cloud code search instead?
If your code is open source and you want the absolute best model quality for complex architectural questions, GitHub Copilot or Cursor with GPT-4 or Claude will produce better synthesis. The frontier models are better at reasoning across many code chunks simultaneously.
The tradeoff is always the same: privacy, cost, and offline access versus raw model quality. For finding specific code, understanding modules, answering "where is X?" and "how does Y work?" questions, local RAG with a capable model like Qwen 3.5 is more than sufficient. And your code never leaves your disk.
Try It
Download ModelPiper, install ToolPiper, and create a RAG collection pointed at your project directory. Wait for indexing, then start asking your codebase questions. Your source code stays on your Mac.
This is part of a series on local-first AI workflows on macOS. See also: Local RAG Chat for document search, and Private Local Chat for the base local LLM experience.