document ai2026-03-21by Ben RacicotUpdated 2026-03-28

Document AI on Mac: OCR, RAG, Embeddings, and Code Search Without the Cloud

TL;DR

Document AI on Mac is a full pipeline: OCR extracts text from scans, embeddings convert text into searchable vectors, RAG indexes those vectors and retrieves relevant passages at query time, and a local LLM synthesizes answers grounded in your data. ToolPiper runs every stage on Apple Silicon — Neural Engine for OCR and Apple NL Embedding, Metal GPU for open-source models, HNSW for vector search. No documents leave your machine, no per-query costs, no size limits.

Video3:00

OCR, embed, index, query — the full document AI pipeline running locally on Mac

What's broken with document AI?

Every RAG tool, every "chat with your docs" product, every knowledge base AI service requires the same thing: uploading your documents to someone else's infrastructure. ChatGPT's file analysis sends your contracts to OpenAI. Claude Projects sends your code to Anthropic. NotebookLM sends your research to Google. Microsoft Copilot indexes your SharePoint on Azure. The entire premise of "AI-powered document search" as currently implemented by every major provider means giving your most sensitive documents to a third party.

This isn't a nuance. It's the defining constraint of the category. The documents that benefit most from AI search — legal contracts, patient records, proprietary source code, financial models, HR files, anything under NDA — are exactly the documents you cannot upload. Enterprise data governance policies at Fortune 500 companies explicitly prohibit sending classified documents to third-party AI services. Regulated industries face compliance requirements (HIPAA, SOX, GDPR) that make cloud document processing a legal liability. The result: the most capable document AI tools are unavailable for the documents that need them most.

And the infrastructure layer has the same problem. The standard RAG stack in 2026 sends your data through at least three cloud services: one for embeddings (OpenAI), one for vector storage (Pinecone, Weaviate, Qdrant), and one for the language model (OpenAI, Anthropic, Google). Each service is a dependency, a cost center, and a data egress point. A production RAG deployment on Pinecone with OpenAI embeddings and GPT-4 synthesis can cost $200-500/month at moderate query volumes. Every query sends chunks of your documents to cloud APIs. Every indexed document lives on vendor infrastructure you don't control.

The fix isn't better access controls on cloud services. The fix is removing the upload step entirely. If the embedding model, the vector index, and the language model all run on your hardware, there is no data egress. No API calls. No vendor infrastructure holding your documents. No per-query costs. The privacy guarantee becomes absolute by architecture, not by policy.

The practical question has always been whether local models are good enough. As of April 2026, the answer is yes for the vast majority of document Q&A tasks.

The state of the art (April 2026)

Document AI has reached a turning point. The building blocks — embeddings, vector search, LLMs with retrieval — are mature technologies. What has changed in the past year is where they run and how accessible they have become.

RAG has gone mainstream

Every major AI platform now offers some form of document Q&A. ChatGPT's file analysis reads PDFs, spreadsheets, and images directly in GPT-4o's context window. Claude Projects accept up to 200K tokens of uploaded documents as persistent context. Google's NotebookLM ingests documents and generates interactive audio summaries with source grounding. Microsoft Copilot for M365 indexes your SharePoint, OneDrive, and email for enterprise Q&A. These are polished, capable products that have made document AI accessible to non-technical users — and every one of them processes your documents on someone else's servers.

The infrastructure tier mirrors the same pattern. Pinecone, Weaviate, Qdrant, and Chroma offer managed vector database services. LangChain and LlamaIndex provide excellent orchestration but default to cloud embedding APIs and hosted vector stores. The standard production RAG deployment requires assembling and paying for multiple cloud services.

Embedding models have gotten very good and very small

The embedding quality gap between cloud and local has narrowed significantly. Two years ago, OpenAI's text-embedding-ada-002 was the clear best choice. Now, open-source alternatives match or exceed it on standard benchmarks.

Snowflake Arctic Embed (released January 2025) set a new standard for small embedding models. The S variant is roughly 130MB, runs comfortably on any Mac with Apple Silicon, and produces embeddings that compete with models 10x its size on MTEB retrieval benchmarks. The L and XL variants push quality higher at the cost of more memory.

BGE (BAAI General Embedding) from the Beijing Academy of AI remains a strong contender, especially bge-large-en-v1.5 for English retrieval tasks. Jina AI's jina-embeddings-v3 added late-interaction retrieval and multimodal support. Nomic's nomic-embed-text-v1.5 introduced Matryoshka representations — vectors that can be truncated to smaller dimensions without retraining, useful for memory-constrained deployments.

Apple's built-in NLContextualEmbedding, available on every Mac with Apple Silicon, produces 512-dimensional vectors with zero setup and zero model download. It runs on the Neural Engine, consuming no GPU memory. For general English document search, it is a remarkably capable zero-configuration option.

The practical takeaway: you no longer need a cloud API to get good embeddings. A 130MB local model produces vectors that are more than sufficient for document search, code search, and knowledge base queries. Cloud APIs still have an edge on multilingual retrieval and highly specialized domains, but the gap is narrow and closing.

OCR is a solved problem on-device

Apple Vision OCR, powered by the Neural Engine, handles printed text, handwritten text, multiple languages, and structured document layouts with high accuracy. It is the same framework behind Live Text in Photos and the camera — Apple has invested years of engineering into on-device text recognition. The speed difference is dramatic: Apple Vision OCR returns results in milliseconds with zero network latency, versus 2-30 seconds for cloud services that include upload time and processing queues.

The cloud OCR players (Google Cloud Vision, AWS Textract, Azure Document Intelligence) still lead on specialized tasks: complex table extraction with cell-level bounding boxes, form field recognition with key-value pair extraction, and document classification at scale. AWS Textract's Analyze Document API can identify form fields, tables, and signatures in a single call. Google's Document AI can classify documents by type and extract schema-specific fields. These are enterprise-grade capabilities designed for processing millions of documents.

For the common use case — extracting readable text from scans, photos, whiteboards, receipts, and screenshots so you can search or summarize them — on-device OCR is accurate enough that the cloud round trip is pure overhead. Apple Vision OCR handles multi-column layouts, rotated text, varying font sizes, and structured tables with a reading-order algorithm that produces coherent output. For individual and small-team document processing, on-device OCR has reached parity with cloud services on the most common tasks.

What the cloud benchmarks look like

Cloud document AI sets the quality bar. It is worth knowing where that bar is.

ChatGPT file analysis (GPT-4o): Reads PDFs, images, spreadsheets, and code files. Handles multi-document reasoning across uploaded files. Summarizes, extracts, compares. Limited to roughly 500MB per GPT and subject to context window constraints. $20/month for Plus. Your documents are processed on OpenAI servers.

Claude Projects (Claude 3 Opus/Sonnet): Accepts up to 200K tokens of uploaded context. Excellent at long-document synthesis and multi-document comparison. No persistent vector index — it reads the full context on every query. $20/month for Pro. Documents are processed on Anthropic servers.

NotebookLM (Google): Ingests PDFs, Google Docs, web pages, and YouTube transcripts. Generates interactive audio overviews. Strong citation interface showing exactly which passages support each claim. Free with a Google account. Documents are processed on Google servers.

Pinecone + OpenAI (the standard RAG stack): OpenAI for embeddings ($0.13/million tokens) and GPT-4 for synthesis. Pinecone for vector storage (free tier at 100K vectors, paid from $70/month). Production-grade infrastructure with monitoring, scaling, and multi-tenant isolation. Your data flows through both services.

Local document AI trades the quality ceiling of frontier models for complete privacy, zero recurring cost, and offline capability. For 90% of document Q&A — finding specific information, summarizing sections, answering factual questions about your data — a local 4B-8B model with good retrieval is more than sufficient. The retrieval quality matters more than the model quality in most RAG scenarios: if the right passages are retrieved, even a smaller model can synthesize an accurate answer. If the wrong passages are retrieved, even GPT-4 will hallucinate. This is why embedding quality and search strategy are the leverage points, not just model size.

The single-process RAG architecture

This is our thesis, and the reason ToolPiper exists as a single application rather than an orchestration layer over five services.

The entire RAG pipeline — embedding, indexing, retrieval, and synthesis — can run in a single process on a single Mac with zero external dependencies. Most RAG tutorials and production deployments involve assembling a stack of separate services: an embedding API (OpenAI), a vector database (Pinecone, Weaviate, or Chroma), an LLM API (OpenAI or Anthropic), an ingestion pipeline (LangChain or LlamaIndex), and a UI (custom or Open WebUI). Each service is a dependency you must configure, authenticate, monitor, and pay for. Each one is a data egress point where your documents leave your control.

ToolPiper replaces all five with one process. Apple NL Embedding runs on the Neural Engine with zero model download — 512-dimensional vectors from a framework built into macOS. HNSW vector indexing runs in-memory with O(log n) search time. BM25 keyword matching runs alongside it using Apple's NLTokenizer for proper Unicode word boundaries. llama.cpp handles LLM synthesis on the Metal GPU. A built-in ingestion pipeline splits and embeds 39 file types and 6 image types. There is no network hop between any of these stages.

The architectural insight that makes this possible is Apple Silicon's unified memory. On a discrete-GPU machine, moving data between CPU and GPU requires copying across a PCIe bus. On Apple Silicon, the CPU, GPU, and Neural Engine share the same physical memory pool. The embedding model's output vectors, the HNSW index, and the language model's weights all live in the same memory space with zero-copy data transfer between pipeline stages. A query goes from text to embedding to index lookup to LLM prompt to answer without a single network call or memory copy.

This matters for more than just performance. Every network boundary in a distributed RAG stack is a failure mode: the embedding API can time out, the vector database can have a cold start, the LLM API can rate-limit you. A single-process architecture has one failure mode: the process crashes. It also has one cost: the hardware you already own. No monthly bills for vector storage. No per-token charges for embeddings. No API key management. No vendor lock-in on any component.

The tradeoff is real. Cloud RAG scales horizontally to millions of documents across hundreds of concurrent users. Local RAG scales to the memory and disk of one machine. For individual developers, small teams, and anyone working with confidential data, the single-machine constraint is not a limitation — it's the entire point. Your documents never leave your disk because there's nowhere else for them to go.

What's coming

The document AI space is moving fast. Here is what is on the horizon, with clear labels for what is confirmed versus speculative.

Multi-modal RAG (in development, industry-wide). Today, RAG indexes text. Images in documents are either ignored or processed separately through OCR. The next step is indexing images and text together in a unified embedding space, so you can ask "show me the chart from the Q3 report" and retrieve the actual figure alongside the surrounding text. Multimodal embedding models like jina-embeddings-v3 are pushing in this direction. ColPali and ColQwen represent a different approach entirely: they embed document page images directly without OCR, using vision-language models to produce dense page-level vectors. ToolPiper already supports image embeddings via /v1/embeddings, but the quality varies by model and the retrieval experience is not yet seamless.

Reranking models (announced, multiple vendors). First-stage retrieval via HNSW is fast but approximate — it returns the top candidates based on vector proximity alone. Reranking models rescore those candidates with a more expensive cross-encoder that reads both the query and the candidate passage together, improving precision significantly. Jina Reranker, Cohere Reranker, and bge-reranker-v2 are all available or announced. Integrating a local reranker as a second stage between retrieval and synthesis would improve answer quality on ambiguous queries where the top vector results include false positives.

Better chunking strategies (in development). Fixed-size chunking with overlap is the current default. It is simple and works, but it can split a paragraph mid-sentence or cut a function in half. Smarter approaches are emerging: semantic chunking that uses embedding distance to detect topic boundaries, AST-aware chunking for code that splits at function and class boundaries, and recursive document structure parsing that follows heading hierarchy. Several open-source projects are experimenting with LLM-guided chunking where a small model identifies natural break points. The improvement potential here is significant — better chunks mean better retrieval, which means better answers.

Cross-collection search (planned). Today, you search one RAG collection at a time or create a single collection spanning multiple folders. True cross-collection search would let you query across all your indexed knowledge bases simultaneously — contracts and code and meeting notes in one query — with per-collection relevance weighting. This is the difference between a search tool and a knowledge platform.

Structured extraction pipelines (industry trend). Moving beyond Q&A to automated extraction: given a stack of invoices, extract every line item into a spreadsheet. Given a set of contracts, pull out all termination clauses. Given a folder of resumes, extract candidate profiles into structured records. LLMs with structured output (JSON mode, function calling) make this increasingly reliable. The workflow is OCR → LLM structured extraction → output format. ToolPiper already supports the building blocks (OCR pipeline block, LLM with JSON output), and template-level automation is a natural next step.

Longer context windows reducing RAG's role (industry debate). As context windows grow — Gemini 1.5 offers 1M tokens, Claude supports 200K — a reasonable question is whether RAG is even necessary. If you can fit your entire document collection into the context window, why bother with embedding and retrieval? The answer, as of April 2026, is cost and latency. Sending 200K tokens on every query is expensive and slow. RAG retrieves 2-5K tokens of relevant context per query. For large collections, RAG remains the practical architecture even as context windows expand.

How ToolPiper handles this today

ToolPiper ships a complete document AI stack. Every stage runs on your Mac. Here is what is available and where to go deeper.

RAG pipeline: ingest, embed, index, query

The core pipeline. Point a RAG collection at a folder. ToolPiper splits every recognized file into chunks (500-character default with 50-character overlap), embeds each chunk, and builds a searchable vector index. At query time, your question is embedded, the index finds the most relevant passages via hybrid HNSW + BM25 search, and those passages are injected into the language model's prompt.

39 text file types are supported for indexing: PDF, DOCX, Markdown, HTML, LaTeX, EPUB, and 20+ programming languages. 6 image types (PNG, JPEG, WebP, GIF, BMP) are indexed as image chunks with embeddings via vision models. Binary files and unrecognized types are skipped silently.

The vector index uses HNSW (Hierarchical Navigable Small World) via the SwiftHNSW library. HNSW builds a multi-layer graph of vectors where nearest-neighbor search takes O(log n) time — a collection of 10,000 documents and a collection of 100,000 documents return results in roughly the same time. An LRU cache keeps up to 8 collections loaded simultaneously, so switching between indexed folders does not require reloading from disk. A 16K token context guard prevents overwhelming the language model with too many retrieved chunks.

The hybrid search combines vector similarity (semantic meaning) with BM25 keyword matching (exact terms) by default. BM25 uses Apple's NLTokenizer for proper Unicode word boundaries. The results from both paths are fused so you get the best of both approaches. Ask "payment processing" and you find both the conceptually related Stripe webhook handler (vector match) and the function literally named processPayment() (keyword match). This dual-path strategy catches queries that are conceptual ("how do we handle auth?") and queries that are specific ("the processOrder function").

Ready to try it? Set up local RAG chat — takes about two minutes.

Apple Vision OCR

On-device OCR powered by the Neural Engine. Handles printed text, handwritten text, multiple languages, and structured layouts (columns, tables, headers). The same framework behind Live Text in Photos. No model download required — it is built into macOS.

Because OCR is a pipeline block in ModelPiper, you can chain it: OCR → Summarize, OCR → Translate, OCR → Structured Extract. Drop in a scanned contract, get structured data out.

Ready to try it? Set up document OCR — drag in an image, get text back.

Web scraping for knowledge bases

PiperScrape is ToolPiper's CDP-based web scraper. It navigates a real Chrome browser, detects 16 frontend frameworks to know when a page is genuinely ready, and extracts content in 7 formats: markdown, text, readability, AX tree, HTML, links, and screenshot.

The practical workflow for document AI: scrape a website to markdown, feed the markdown into a RAG collection, then ask questions about it in chat. PiperScrape's AX-tree-based markdown preserves heading hierarchy and link structure — ideal for RAG chunking.

Ready to try it? Set up web scraping — scrape a JS-heavy site that your current tools cannot handle.

Code search

Index an entire codebase into RAG and ask questions about it. "Where is the authentication logic?" "What does processPayment do?" "How is the database connection pool configured?" The system finds conceptually related code across files, not just exact keyword matches. This is semantic search for code: a query about "error handling" surfaces try/catch blocks, result types, error boundary components, and retry logic across your entire repository.

ToolPiper indexes 20+ programming languages (Swift, TypeScript, Python, Go, Rust, Java, C/C++, Ruby, PHP, SQL, and more) with semantic chunking at 500 characters with 50-character overlap to keep most functions intact within a single chunk. The hybrid search is particularly effective for code: BM25 catches function and variable names (the exact strings developers search for), while vector search catches conceptual matches (what the code does, not what it is named). Re-indexing after a day of development takes seconds because the embedding cache skips unchanged files. Non-blocking auto-reindex means you keep querying while the index updates in the background.

Ready to try it? Set up code search — index your repo and start asking questions.

Three embedding paths

Every option is local. Every option serves an OpenAI-compatible /v1/embeddings endpoint.

Apple NL Embedding — zero setup, Neural Engine, 512 dimensions. Works immediately on any Mac with Apple Silicon. The right default for most users.
Open-source GGUF models — Snowflake Arctic Embed S (~130MB), nomic-embed-text, bge-base-en, or any GGUF embedding model from HuggingFace. Higher quality for specialized domains. Runs on Metal GPU.
Dedicated embedding server (port 9997) — separate llama-server process so embedding and chat do not compete for GPU. Auto-starts on first call, auto-stops after 5 minutes idle.

Ready to try it? Explore local embeddings — the foundation under everything else.

Embedding cache

Embeddings are deterministic — the same text with the same model always produces the same vector. ToolPiper exploits this with a global content-addressed cache. Every embedding is keyed by SHA-256(modelStem + text). Re-indexing a collection where 90% of files haven't changed serves 90% of embeddings from cache instantly — in microseconds instead of milliseconds. Binary persistence with CRC32 integrity checks (hardware-accelerated via zlib). 100K entry cap with FIFO eviction. Dimension validation on incremental ingest catches model mismatches early: if you switch from a 512-dim model to a 1536-dim model, the cache detects the mismatch and re-embeds. The embedding server is lazily started and skipped entirely when all content is already cached — meaning re-indexing unchanged collections has zero model load overhead.

MCP integration

ToolPiper's document AI capabilities are fully accessible through MCP tools, making them available to any AI assistant that supports the Model Context Protocol. rag_collections lists all indexed collections with their metadata. rag_query searches them with hybrid retrieval and returns cited passages. embed generates embeddings directly for custom pipelines. scrape pulls web content in any of 7 formats for RAG ingestion. ocr extracts text from images using Apple Vision. These tools compose naturally: an AI assistant can scrape a documentation site, index the content into a collection, and then answer questions about it — all through natural language commands. Claude Code, Cursor, Windsurf, or any MCP-capable client gains full document AI capabilities without any API code.

OpenAI-compatible API

ToolPiper serves standard OpenAI-compatible endpoints: POST /v1/embeddings for vector generation and POST /v1/chat/completions for LLM inference. Any tool that calls the OpenAI API — LangChain, LlamaIndex, custom Python scripts, existing RAG pipelines — works by changing the base URL to http://localhost:9998 and removing the API key. Same request format, same response format, no code changes beyond the URL.

Model	Size	RAM	Speed	Quality
Apple NL Embedding	Built-in	Negligible (Neural Engine)	Single-digit ms per text	Good (English-focused)
Snowflake Arctic Embed S	~130MB	~300 MB	5-50ms per text	Very good (competitive with larger models)
Apple Vision OCR	Built-in	Negligible (Neural Engine)	Near-instant	High (printed + handwritten, multi-language)
Qwen 3.5 4B	4B parameters (~2.5 GB)	~4 GB	~30 tokens/s (M2 Max)	Good for document Q&A and summarization
Any GGUF embedding model	Varies (50-500 MB)	200-500 MB	5-50ms per text	Model-dependent

RAG Architecture: Local vs Cloud (April 2026)

	ToolPiper (Local)	Pinecone + OpenAI	ChatGPT File Analysis	Claude Projects	NotebookLM
Services required	1 (single process)	3-5 (embedding API + vector DB + LLM + pipeline + UI)	1 (managed)	1 (managed)	1 (managed)
Data egress per query	None (stays on device)	Document chunks sent to embedding + LLM APIs	Full document uploaded	Full document uploaded	Full document uploaded
Index persistence	On your disk (~/.toolpiper/rag/)	Vendor infrastructure (Pinecone/Weaviate)	OpenAI servers (ephemeral per session)	Anthropic servers (project-scoped)	Google servers
Offline operation	Fully functional	Requires internet for all stages	Requires internet	Requires internet	Requires internet
Privacy guarantee	Architectural (no network path exists)	Policy (vendor terms of service)	Policy (OpenAI data usage terms)	Policy (Anthropic data usage terms)	Policy (Google data usage terms)
Cost	Free (local models)	$70+/mo (Pinecone) + per-token embedding + LLM costs	$20/mo (Plus)	$20/mo (Pro)	Free (Google account)
Document types	39 text + 6 image formats	Depends on ingestion pipeline	PDF, DOCX, CSV, images	PDF, text, code	PDF, Docs, web, YouTube
Collection size	Unlimited (disk space)	100K vectors free, paid tiers scale	~500MB per GPT	~200K tokens	50 sources per notebook
Hybrid search (vector + BM25)	Yes (fused by default)	Vector only by default (BM25 requires separate config)	No (context window, not retrieval)	No (context window, not retrieval)	Proprietary retrieval
Embedding model flexibility	Any GGUF + Apple NL Embedding (zero-download)	OpenAI models only	No choice (GPT-4o internal)	No choice (Claude internal)	No choice (Gemini internal)
Auto re-index on file change	Yes (file watcher, non-blocking)	Custom pipeline needed	No (re-upload required)	No (re-upload required)	Manual refresh
OCR	Apple Vision (Neural Engine, built-in)	Not included (separate service)	GPT-4o vision (cloud)	Not available	Not available
Web scraping	Built-in (7 formats, 16 frameworks)	Not included	No	No	Web pages via URL
Code search	Yes (20+ languages, hybrid retrieval)	Custom pipeline needed	Limited (file upload)	Limited (file upload)	No
LLM quality	Local 4B-14B models	GPT-4/4o (frontier)	GPT-4o (frontier)	Claude Sonnet/Opus (frontier)	Gemini (frontier)
API / MCP access	REST + over 300 MCP tools	Pinecone API + OpenAI API	Assistants API	No	No

Embedding Options: Local vs Cloud (April 2026)

	Apple NL Embedding	Snowflake Arctic Embed S	OpenAI text-embedding-3-small	Cohere embed-v3
Privacy	On-device (Neural Engine)	On-device (Metal GPU)	Cloud (OpenAI)	Cloud (Cohere)
Setup	Zero (built into macOS)	~130MB download	API key + billing	API key + billing
Cost	Free	Free	$0.02/M tokens	$0.10/M tokens
Dimensions	512	384	1536	1024
Quality (MTEB)	Good	Very good	Excellent	Excellent
Speed	Single-digit ms (Neural Engine)	5-50ms (Metal GPU)	100-500ms (network round trip)	100-500ms (network round trip)
Works offline	Yes	Yes	No	No
Languages	English-focused	English-focused	Multilingual	Multilingual
Model lock-in	None (switch anytime, dimension validation on re-index)	None (any GGUF model)	OpenAI models only	Cohere models only
OpenAI-compatible API	Yes (via ToolPiper)	Yes (via ToolPiper)	Native	No

On the Horizon

Multi-modal RAG: unified text + image embedding and retrieval in a single indexin development

Image embeddings already supported via /v1/embeddings, but unified retrieval across text and image chunks is not seamless yet

Cross-collection search: query across all indexed knowledge bases simultaneouslyannounced

Currently, each RAG collection is searched independently. Cross-collection would unify contracts, code, notes in one query

Local reranking models for improved retrieval precisionrumored

Jina Reranker, bge-reranker-v2, and Cohere Reranker are available. Integration with ToolPiper's retrieval pipeline is under consideration

AST-aware code chunking: split code at function and class boundaries instead of fixed character countsin development

Would improve code RAG quality by ensuring each chunk is a complete semantic unit

Structured extraction pipelines: OCR → LLM → JSON for invoices, contracts, and formsannounced

LLMs with structured output (JSON mode) make automated field extraction increasingly reliable

jina-embeddings-v3: late interaction, multimodal, Matryoshka representationsannounced

Promising for multi-modal RAG once GGUF conversion is available. Currently requires Python runtime

How long does it take to index a large document collection?

A folder of 500 documents typically indexes in 1-3 minutes on the first pass. A codebase with 1,000 source files takes a similar amount of time. Re-indexing after changes is much faster because the embedding cache (SHA-256 content-addressed, 100K entry cap) skips unchanged content. A typical incremental re-index after a day of edits takes seconds, not minutes. The embedding server auto-starts on first call and auto-stops after 5 minutes idle.

What file types can I index?

ToolPiper indexes 39 text file types and 6 image types. Text types include PDF, DOCX, RTF, Markdown, HTML, LaTeX, EPUB, CSV, JSON, YAML, XML, and 20+ programming languages (Swift, TypeScript, Python, Go, Rust, Java, C/C++, Ruby, PHP, SQL, and more). Image types include PNG, JPEG, WebP, GIF, and BMP, indexed via vision model embeddings. Binary files and unrecognized types are skipped silently.

Which embedding model should I use?

Start with Apple NL Embedding. It requires zero setup, runs on the Neural Engine with negligible memory usage, and produces good results for English document search. Switch to Snowflake Arctic Embed S (~130MB download) if you need higher quality for technical or domain-specific content. Use a specialized GGUF model from HuggingFace if you need multilingual support or compatibility with a specific embedding dimension.

How much RAM do I need for document AI?

16GB is comfortable for the full pipeline. Apple NL Embedding uses negligible memory (Neural Engine). Apple Vision OCR also uses the Neural Engine. The language model for RAG synthesis is the main consumer — a 4B model uses roughly 4GB, an 8B model uses roughly 6-8GB. The HNSW vector index adds memory proportional to collection size, but tens of thousands of documents fit in a few hundred MB. If you run a separate embedding server alongside a chat model, 16GB gives headroom for both.

Can I use RAG with a cloud LLM instead of a local model?

Yes. The RAG retrieval pipeline (embedding, vector search, BM25 keyword search) always runs locally. You can connect the retrieved context to any language model — local or cloud. Only the synthesized query with retrieved passages goes to the cloud model if you choose that path. Your documents are still searched and embedded on your Mac; the cloud model only sees the relevant passages for each individual query.

How is this different from just pasting files into ChatGPT?

Context window limits. ChatGPT reads roughly 128K tokens at once — about 200 pages. RAG indexes unlimited documents and retrieves only the relevant passages for each question. You can index 10,000 documents and the model reads just the 5-10 most relevant chunks per query, staying well within context limits while having access to everything. RAG also updates incrementally: add a new file and the index updates without re-uploading your entire collection.

Does the web scraper handle JavaScript-heavy sites?

Yes. PiperScrape navigates a real Chrome browser via CDP, so all JavaScript executes normally. It detects 16 frontend frameworks (React, Angular, Vue, Next.js, Svelte, and more) and uses framework-specific lifecycle signals to know when the page is genuinely ready. The RACE pattern runs the framework signal and a generic idle detector in parallel, whichever fires first wins. A Next.js page that hydrates in 200ms does not wait for a generic 3-second timeout.

Can I search across code and documents together?

You can create a single RAG collection that includes both document and code directories, or create separate collections for each. ToolPiper keeps up to 8 collections loaded simultaneously via an LRU cache, so switching between them is fast. Cross-collection search (querying multiple collections in a single request) is on the roadmap but not yet available. For now, the workaround is pointing one collection at a parent directory that contains both.

RAGOCREmbeddingsCode SearchWeb ScrapingVector SearchPrivacymacOSApple SiliconDocument Q&A

Local AI Chat on Mac: Private Conversations With On-Device ModelsThe local chat roundup — the LLM layer that powers RAG synthesis AI Developer Tools on Mac: MCP, Browser Automation, and Local APIsThe developer platform roundup — MCP tools, APIs, and the infrastructure behind document AI Vision AI on Mac: Screen Analysis, Image Understanding, and Pose DetectionThe vision AI roundup — OCR, image analysis, and visual understanding capabilities