article2026-03-28by Ben Racicot

Code Search with Local RAG on Mac: Ask Your Codebase Without Uploading It

TL;DR

Your codebase is your most sensitive IP. Local RAG lets you index an entire repository into vector embeddings and ask questions in plain English, with semantic search that finds conceptually related code across files. ToolPiper runs the full pipeline on your Mac: embedding, hybrid HNSW + BM25 search, and LLM inference. No code is uploaded anywhere.

Screencast of code search in ModelPiper - indexing a project directory and asking questions about the codebase

2:00

Index a codebase, ask questions, get answers citing actual source files

Your codebase is the most sensitive intellectual property your company owns. When a developer pastes a function into ChatGPT to ask "what does this do?", that code is now on someone else's server. When you use GitHub Copilot, your code context flows through Microsoft's infrastructure. For open source, that's fine. For proprietary code, it's a risk many companies won't accept.

What if you could ask questions about your entire codebase with semantic search, not just grep, without any code leaving your machine?

That's what local RAG for code gives you. You index your repository into vector embeddings on your own hardware. When you ask a question, the system finds the most relevant code chunks and hands them to a language model as context. The model reads the relevant snippets at query time and answers grounded in your actual source code. The entire pipeline runs on your Mac.

What does RAG for code actually mean?

RAG (retrieval-augmented generation) connects a language model to an external data source so it can answer questions using data it was never trained on. For code, the data source is your repository.

The pipeline has three stages. First, your source files are split into chunks and each chunk is converted into a numerical vector (an "embedding") that captures its semantic meaning. Second, those vectors are stored in a searchable index. Third, when you ask a question, your question is also embedded, the index finds the most similar code chunks, and those chunks are injected into the language model's prompt as context.

The model doesn't need to have seen your code during training. It reads the relevant snippets on the fly, every time you ask. This means it works with code you wrote yesterday, with internal libraries no public model has ever seen, and with repositories that change every hour.

Why is semantic search better than grep for code?

grep and ripgrep find exact string matches. They're fast, and they're invaluable. But they only find what you already know how to name.

RAG finds conceptually related code. Ask "where do we handle authentication?" and it finds the auth middleware, the JWT validation helper, the session manager, and the login route handler, even if none of them contain the word "authentication." Ask "how do we connect to the database?" and it surfaces the connection pool setup, the retry logic, and the migration runner.

Semantic search works because embeddings capture meaning, not just characters. Two pieces of code that do similar things produce similar vectors, regardless of variable names, comments, or coding style. This is especially useful in large codebases where you didn't write every module and you don't know the naming conventions used three years ago by someone who left the team.

Why is code RAG harder than document RAG?

Code has properties that make naive RAG less effective than it is for prose documents.

Structure matters. A function signature without its body is useless context. A class definition without its methods tells you nothing. Chunking strategies that work for paragraphs of text can split a function right down the middle, giving the model half an implementation.

Cross-file references are everywhere. An import statement in one file references an export in another. A function call in your controller triggers logic defined three directories away. RAG retrieves isolated chunks, not dependency graphs. It can find the relevant pieces, but it won't automatically trace a call chain through five files.

Context dependencies are implicit. A variable named config could mean anything. The same function name can exist in multiple modules with completely different behavior. Without the surrounding context (the file path, the module, the imports), a code chunk can be ambiguous.

These challenges don't make code RAG useless. They mean that the chunking strategy, the embedding quality, and the retrieval parameters all matter more than they do for searching a folder of meeting notes.

How does ToolPiper handle code repositories?

ToolPiper's RAG system is built to handle the specific challenges of code search.

39 text file types indexed. TypeScript, JavaScript, Python, Swift, Go, Rust, Java, Ruby, C, C++, header files, Markdown, JSON, YAML, TOML, SQL, and more. Point it at a project directory and it indexes everything it can read. Binary files, images, and unknown types are skipped silently.

Semantic chunking at 500 characters with 50-character overlap. The overlap means that if a function body falls near a chunk boundary, the end of one chunk and the beginning of the next share content. This isn't perfect, but it keeps most small-to-medium functions intact within a single chunk, and the overlap bridges the gap when they split.

Hybrid search: vector + keyword. Every query runs through two search paths simultaneously. HNSW vector search finds semantically similar code (conceptual matches). BM25 keyword search finds exact term matches (function names, variable names, specific strings). The results are merged so you get the best of both approaches. Ask "payment processing" and you find both the processPayment() function (keyword match) and the Stripe webhook handler (semantic match).

Two embedding options:

EmbeddingGemma - the on-device default. Google's embedding model produces 768-dimensional vectors and runs on the Apple Neural Engine. It downloads once from ModelPiper's release CDN, then runs locally on any Mac with Apple Silicon, with no per-query cost and nothing sent to a server.
Bring your own GGUF embedding model - more options for specialized domains. Download a GGUF embedding model and point ToolPiper at it.

Embedding cache. Every chunk embedding is cached globally with a SHA-256 content-addressed key. If you re-index a repo and 90% of files haven't changed, 90% of embeddings are served from cache instantly. Only modified files trigger new embedding computation.

How do you set up code search in ModelPiper?

The workflow takes about two minutes for a medium-sized repository.

1. Create a RAG collection. In ModelPiper, open the RAG settings and create a new collection. Give it a name that matches the project.

2. Add your project directory. Point the collection at your repository root, or at specific subdirectories if you only want to search the source code (skip node_modules, build artifacts, etc.).

3. Choose an embedding model and index. EmbeddingGemma is the on-device default. Hit Index and ToolPiper splits every recognized file into chunks, embeds them, and builds the HNSW + BM25 indices. A repository with 1,000 source files typically indexes in 1-3 minutes.

4. Ask questions in chat. Select the collection in the chat interface and start asking: "How does the auth middleware work?", "Find all API endpoints", "What does processPayment do?", "Where is the database connection pool configured?"

The RAG system retrieves the top matching code chunks, injects them into the LLM's context window, and the model generates an answer grounded in your actual source code. It cites which files the answer came from.

What happens when your code changes?

Codebases change constantly. ToolPiper handles this with non-blocking auto-reindex: when files change, the system reindexes in the background while still serving queries against the existing index. You never wait for reindexing to finish before asking a question. The index updates in place once the new embeddings are ready.

The embedding cache makes re-indexing fast. Only files with actual content changes need new embeddings. A typical incremental reindex after a day of development takes seconds, not minutes.

ToolPiper also enforces a 16K token context guard so the model isn't overwhelmed with too much retrieved code, and maintains an LRU cache of up to 8 loaded collections for fast switching between projects.

What are the honest limitations?

Local code RAG isn't a replacement for reading the code yourself. Here's where it falls short.

Chunking can split long functions. Semantic chunking at 500 characters keeps most functions intact, but a 200-line method will span multiple chunks. The 50-character overlap helps, but the model may receive the beginning and end of a long function without the middle.

Large repos take time to index initially. A repository with 100,000+ files will take meaningful time on the first index pass. Subsequent re-indexes are fast because of the embedding cache, but that first run requires patience.

General-purpose embeddings, not code-specific. Specialized code embedding models like CodeBERT aren't available in GGUF format yet. General-purpose embeddings work well for most searches, but they may miss very specific API patterns or domain-specific naming conventions that a code-trained model would catch.

No dependency graph traversal. RAG retrieves relevant chunks, not call chains. If you ask "what happens when a user clicks Submit?", it can find the click handler and the API call and the backend route, but it finds them as separate chunks. It won't automatically trace the execution path through five files in order. You're getting the pieces, and the LLM synthesizes them, but complex multi-hop logic can be missed.

When should you use cloud code search instead?

If your code is open source and you want the absolute best model quality for complex architectural questions, GitHub Copilot or Cursor with GPT-4 or Claude will produce better synthesis. The frontier models are better at reasoning across many code chunks simultaneously.

The tradeoff is always the same: privacy, cost, and offline access versus raw model quality. For finding specific code, understanding modules, answering "where is X?" and "how does Y work?" questions, local RAG with a capable model like Qwen 3.5 is more than sufficient. And your code never leaves your disk.

Try It

Download ModelPiper, install ToolPiper, and create a RAG collection pointed at your project directory. Wait for indexing, then start asking your codebase questions. Your source code stays on your Mac.

This is part of a series on local-first AI workflows on macOS. See also: Local RAG Chat for document search, and Private Local Chat for the base local LLM experience.

ModelPiper code search interface showing a question answered with cited passages from indexed source files

Ask your codebase a question, get an answer citing actual source files

Code Search: ToolPiper RAG vs Alternatives

	ToolPiper RAG	GitHub Copilot	Cursor	grep/ripgrep
Privacy	All local	Code sent to cloud	Code sent to cloud	All local
Search type	Semantic + keyword	Semantic (cloud)	Semantic (cloud)	Keyword only
Finds conceptual matches	Yes	Yes	Yes	No
Setup	Add folder to collection	VS Code extension	Download app	Built-in
Cost	Free	$10-19/mo	$20/mo	Free
Works offline	Yes	No	No	Yes
Custom embedding model	Yes	No	No	N/A
File types	39+ text types	All	All	All

How to get started

1
Install ToolPiper and open ModelPiper
Download ToolPiper from modelpiper.com. Launch it, then open ModelPiper. You'll set up a RAG collection pointed at your codebase.
2
Create a RAG collection for your repository
In the RAG settings, create a new collection and point it at your project directory. Choose an embedding model (EmbeddingGemma for the on-device default, or a GGUF model for specialized domains). Click Index.
3
Wait for indexing
ToolPiper splits your source files into 500-character chunks, generates embeddings for each chunk, and builds HNSW vector + BM25 keyword indices. A typical repository with 1,000 files indexes in 1-3 minutes.
4
Ask your codebase questions
Select the collection in chat and ask questions: "How does the auth middleware work?", "Find all API endpoints", "What does processPayment do?" The system retrieves relevant code chunks and the LLM answers citing source files.

Frequently Asked Questions

How does semantic code search work?

Your source files are split into chunks and each chunk is converted into a numerical vector (embedding) that captures its meaning. When you ask a question, your question is also converted to a vector and the system finds the code chunks with the most similar vectors. This means it can find code that's conceptually related to your question even if no exact keywords match. The retrieved chunks are then given to a language model as context so it can answer grounded in your actual code.

Is my code safe with local RAG?

Yes. The entire pipeline runs on your Mac. The embedding model, vector index, BM25 index, and language model all run locally. No source code is sent to any server. The index files are stored on your disk alongside the collection configuration. If you use a local LLM for the chat step, zero bytes leave your machine at any point in the pipeline.

How long does it take to index a large repo?

A repository with 1,000 source files typically indexes in 1-3 minutes on the first pass. Re-indexing after code changes is much faster because the embedding cache (SHA-256 content-addressed) skips unchanged files. A typical day's worth of changes re-indexes in seconds. Very large repos (100K+ files) will take longer on the first pass, but subsequent incremental indexes remain fast.

Can I search across multiple repos?

Yes. Create separate RAG collections for each repository and switch between them in the chat interface. ToolPiper keeps up to 8 collections loaded simultaneously via an LRU cache. You can also create a single collection that points to a parent directory containing multiple projects if you want unified cross-repo search.

Does this understand code structure?

RAG understands code at the chunk level, not at the AST level. It doesn't parse your code into functions, classes, and modules the way an IDE does. Instead, it captures semantic meaning through embeddings, which is effective for finding relevant code by concept. It won't trace a function call through five files automatically, but it will surface the most relevant chunks for your question and let the language model synthesize an answer from them.

RAGCode SearchEmbeddingsText GenerationPrivacymacOSDeveloper Tools

Local RAG Chat on Mac: Ask Your Documents, Keep Your DataThe general-purpose RAG guide for documents, contracts, and notes Private Local Chat on Mac: ChatGPT Without the CloudThe base local LLM chat experience without document retrieval Local-First AI on macOS: Why Your Data Should Never Leave Your MachineThe pillar article on local-first AI workflows