What's broken with document AI?
Every RAG tool, every "chat with your docs" product, every knowledge base AI service requires the same thing: uploading your documents to someone else's infrastructure. ChatGPT's file analysis sends your contracts to OpenAI. Claude Projects sends your code to Anthropic. NotebookLM sends your research to Google. Microsoft Copilot indexes your SharePoint on Azure. The entire premise of "AI-powered document search" as currently implemented by every major provider means giving your most sensitive documents to a third party.
This isn't a nuance. It's the defining constraint of the category. The documents that benefit most from AI search — legal contracts, patient records, proprietary source code, financial models, HR files, anything under NDA — are exactly the documents you cannot upload. Enterprise data governance policies at Fortune 500 companies explicitly prohibit sending classified documents to third-party AI services. Regulated industries face compliance requirements (HIPAA, SOX, GDPR) that make cloud document processing a legal liability. The result: the most capable document AI tools are unavailable for the documents that need them most.
And the infrastructure layer has the same problem. The standard RAG stack in 2026 sends your data through at least three cloud services: one for embeddings (OpenAI), one for vector storage (Pinecone, Weaviate, Qdrant), and one for the language model (OpenAI, Anthropic, Google). Each service is a dependency, a cost center, and a data egress point. A production RAG deployment on Pinecone with OpenAI embeddings and GPT-4 synthesis can cost $200-500/month at moderate query volumes. Every query sends chunks of your documents to cloud APIs. Every indexed document lives on vendor infrastructure you don't control.
The fix isn't better access controls on cloud services. The fix is removing the upload step entirely. If the embedding model, the vector index, and the language model all run on your hardware, there is no data egress. No API calls. No vendor infrastructure holding your documents. No per-query costs. The privacy guarantee becomes absolute by architecture, not by policy.
The practical question has always been whether local models are good enough. As of April 2026, the answer is yes for the vast majority of document Q&A tasks.
The state of the art (April 2026)
Document AI has reached a turning point. The building blocks — embeddings, vector search, LLMs with retrieval — are mature technologies. What has changed in the past year is where they run and how accessible they have become.
RAG has gone mainstream
Every major AI platform now offers some form of document Q&A. ChatGPT's file analysis reads PDFs, spreadsheets, and images directly in GPT-4o's context window. Claude Projects accept up to 200K tokens of uploaded documents as persistent context. Google's NotebookLM ingests documents and generates interactive audio summaries with source grounding. Microsoft Copilot for M365 indexes your SharePoint, OneDrive, and email for enterprise Q&A. These are polished, capable products that have made document AI accessible to non-technical users — and every one of them processes your documents on someone else's servers.
The infrastructure tier mirrors the same pattern. Pinecone, Weaviate, Qdrant, and Chroma offer managed vector database services. LangChain and LlamaIndex provide excellent orchestration but default to cloud embedding APIs and hosted vector stores. The standard production RAG deployment requires assembling and paying for multiple cloud services.
Embedding models have gotten very good and very small
The embedding quality gap between cloud and local has narrowed significantly. Two years ago, OpenAI's text-embedding-ada-002 was the clear best choice. Now, open-source alternatives match or exceed it on standard benchmarks.
Snowflake Arctic Embed (released January 2025) set a new standard for small embedding models. The S variant is roughly 130MB, runs comfortably on any Mac with Apple Silicon, and produces embeddings that compete with models 10x its size on MTEB retrieval benchmarks. The L and XL variants push quality higher at the cost of more memory.
BGE (BAAI General Embedding) from the Beijing Academy of AI remains a strong contender, especially bge-large-en-v1.5 for English retrieval tasks. Jina AI's jina-embeddings-v3 added late-interaction retrieval and multimodal support. Nomic's nomic-embed-text-v1.5 introduced Matryoshka representations — vectors that can be truncated to smaller dimensions without retraining, useful for memory-constrained deployments.
Apple's built-in NLContextualEmbedding, available on every Mac with Apple Silicon, produces 512-dimensional vectors with zero setup and zero model download. It runs on the Neural Engine, consuming no GPU memory. For general English document search, it is a remarkably capable zero-configuration option.
The practical takeaway: you no longer need a cloud API to get good embeddings. A 130MB local model produces vectors that are more than sufficient for document search, code search, and knowledge base queries. Cloud APIs still have an edge on multilingual retrieval and highly specialized domains, but the gap is narrow and closing.
OCR is a solved problem on-device
Apple Vision OCR, powered by the Neural Engine, handles printed text, handwritten text, multiple languages, and structured document layouts with high accuracy. It is the same framework behind Live Text in Photos and the camera — Apple has invested years of engineering into on-device text recognition. The speed difference is dramatic: Apple Vision OCR returns results in milliseconds with zero network latency, versus 2-30 seconds for cloud services that include upload time and processing queues.
The cloud OCR players (Google Cloud Vision, AWS Textract, Azure Document Intelligence) still lead on specialized tasks: complex table extraction with cell-level bounding boxes, form field recognition with key-value pair extraction, and document classification at scale. AWS Textract's Analyze Document API can identify form fields, tables, and signatures in a single call. Google's Document AI can classify documents by type and extract schema-specific fields. These are enterprise-grade capabilities designed for processing millions of documents.
For the common use case — extracting readable text from scans, photos, whiteboards, receipts, and screenshots so you can search or summarize them — on-device OCR is accurate enough that the cloud round trip is pure overhead. Apple Vision OCR handles multi-column layouts, rotated text, varying font sizes, and structured tables with a reading-order algorithm that produces coherent output. For individual and small-team document processing, on-device OCR has reached parity with cloud services on the most common tasks.
What the cloud benchmarks look like
Cloud document AI sets the quality bar. It is worth knowing where that bar is.
ChatGPT file analysis (GPT-4o): Reads PDFs, images, spreadsheets, and code files. Handles multi-document reasoning across uploaded files. Summarizes, extracts, compares. Limited to roughly 500MB per GPT and subject to context window constraints. $20/month for Plus. Your documents are processed on OpenAI servers.
Claude Projects (Claude 3 Opus/Sonnet): Accepts up to 200K tokens of uploaded context. Excellent at long-document synthesis and multi-document comparison. No persistent vector index — it reads the full context on every query. $20/month for Pro. Documents are processed on Anthropic servers.
NotebookLM (Google): Ingests PDFs, Google Docs, web pages, and YouTube transcripts. Generates interactive audio overviews. Strong citation interface showing exactly which passages support each claim. Free with a Google account. Documents are processed on Google servers.
Pinecone + OpenAI (the standard RAG stack): OpenAI for embeddings ($0.13/million tokens) and GPT-4 for synthesis. Pinecone for vector storage (free tier at 100K vectors, paid from $70/month). Production-grade infrastructure with monitoring, scaling, and multi-tenant isolation. Your data flows through both services.
Local document AI trades the quality ceiling of frontier models for complete privacy, zero recurring cost, and offline capability. For 90% of document Q&A — finding specific information, summarizing sections, answering factual questions about your data — a local 4B-8B model with good retrieval is more than sufficient. The retrieval quality matters more than the model quality in most RAG scenarios: if the right passages are retrieved, even a smaller model can synthesize an accurate answer. If the wrong passages are retrieved, even GPT-4 will hallucinate. This is why embedding quality and search strategy are the leverage points, not just model size.
The single-process RAG architecture
This is our thesis, and the reason ToolPiper exists as a single application rather than an orchestration layer over five services.
The entire RAG pipeline — embedding, indexing, retrieval, and synthesis — can run in a single process on a single Mac with zero external dependencies. Most RAG tutorials and production deployments involve assembling a stack of separate services: an embedding API (OpenAI), a vector database (Pinecone, Weaviate, or Chroma), an LLM API (OpenAI or Anthropic), an ingestion pipeline (LangChain or LlamaIndex), and a UI (custom or Open WebUI). Each service is a dependency you must configure, authenticate, monitor, and pay for. Each one is a data egress point where your documents leave your control.
ToolPiper replaces all five with one process. Apple NL Embedding runs on the Neural Engine with zero model download — 512-dimensional vectors from a framework built into macOS. HNSW vector indexing runs in-memory with O(log n) search time. BM25 keyword matching runs alongside it using Apple's NLTokenizer for proper Unicode word boundaries. llama.cpp handles LLM synthesis on the Metal GPU. A built-in ingestion pipeline splits and embeds 39 file types and 6 image types. There is no network hop between any of these stages.
The architectural insight that makes this possible is Apple Silicon's unified memory. On a discrete-GPU machine, moving data between CPU and GPU requires copying across a PCIe bus. On Apple Silicon, the CPU, GPU, and Neural Engine share the same physical memory pool. The embedding model's output vectors, the HNSW index, and the language model's weights all live in the same memory space with zero-copy data transfer between pipeline stages. A query goes from text to embedding to index lookup to LLM prompt to answer without a single network call or memory copy.
This matters for more than just performance. Every network boundary in a distributed RAG stack is a failure mode: the embedding API can time out, the vector database can have a cold start, the LLM API can rate-limit you. A single-process architecture has one failure mode: the process crashes. It also has one cost: the hardware you already own. No monthly bills for vector storage. No per-token charges for embeddings. No API key management. No vendor lock-in on any component.
The tradeoff is real. Cloud RAG scales horizontally to millions of documents across hundreds of concurrent users. Local RAG scales to the memory and disk of one machine. For individual developers, small teams, and anyone working with confidential data, the single-machine constraint is not a limitation — it's the entire point. Your documents never leave your disk because there's nowhere else for them to go.
What's coming
The document AI space is moving fast. Here is what is on the horizon, with clear labels for what is confirmed versus speculative.
Multi-modal RAG (in development, industry-wide). Today, RAG indexes text. Images in documents are either ignored or processed separately through OCR. The next step is indexing images and text together in a unified embedding space, so you can ask "show me the chart from the Q3 report" and retrieve the actual figure alongside the surrounding text. Multimodal embedding models like jina-embeddings-v3 are pushing in this direction. ColPali and ColQwen represent a different approach entirely: they embed document page images directly without OCR, using vision-language models to produce dense page-level vectors. ToolPiper already supports image embeddings via /v1/embeddings, but the quality varies by model and the retrieval experience is not yet seamless.
Reranking models (announced, multiple vendors). First-stage retrieval via HNSW is fast but approximate — it returns the top candidates based on vector proximity alone. Reranking models rescore those candidates with a more expensive cross-encoder that reads both the query and the candidate passage together, improving precision significantly. Jina Reranker, Cohere Reranker, and bge-reranker-v2 are all available or announced. Integrating a local reranker as a second stage between retrieval and synthesis would improve answer quality on ambiguous queries where the top vector results include false positives.
Better chunking strategies (in development). Fixed-size chunking with overlap is the current default. It is simple and works, but it can split a paragraph mid-sentence or cut a function in half. Smarter approaches are emerging: semantic chunking that uses embedding distance to detect topic boundaries, AST-aware chunking for code that splits at function and class boundaries, and recursive document structure parsing that follows heading hierarchy. Several open-source projects are experimenting with LLM-guided chunking where a small model identifies natural break points. The improvement potential here is significant — better chunks mean better retrieval, which means better answers.
Cross-collection search (planned). Today, you search one RAG collection at a time or create a single collection spanning multiple folders. True cross-collection search would let you query across all your indexed knowledge bases simultaneously — contracts and code and meeting notes in one query — with per-collection relevance weighting. This is the difference between a search tool and a knowledge platform.
Structured extraction pipelines (industry trend). Moving beyond Q&A to automated extraction: given a stack of invoices, extract every line item into a spreadsheet. Given a set of contracts, pull out all termination clauses. Given a folder of resumes, extract candidate profiles into structured records. LLMs with structured output (JSON mode, function calling) make this increasingly reliable. The workflow is OCR → LLM structured extraction → output format. ToolPiper already supports the building blocks (OCR pipeline block, LLM with JSON output), and template-level automation is a natural next step.
Longer context windows reducing RAG's role (industry debate). As context windows grow — Gemini 1.5 offers 1M tokens, Claude supports 200K — a reasonable question is whether RAG is even necessary. If you can fit your entire document collection into the context window, why bother with embedding and retrieval? The answer, as of April 2026, is cost and latency. Sending 200K tokens on every query is expensive and slow. RAG retrieves 2-5K tokens of relevant context per query. For large collections, RAG remains the practical architecture even as context windows expand.
How ToolPiper handles this today
ToolPiper ships a complete document AI stack. Every stage runs on your Mac. Here is what is available and where to go deeper.
RAG pipeline: ingest, embed, index, query
The core pipeline. Point a RAG collection at a folder. ToolPiper splits every recognized file into chunks (500-character default with 50-character overlap), embeds each chunk, and builds a searchable vector index. At query time, your question is embedded, the index finds the most relevant passages via hybrid HNSW + BM25 search, and those passages are injected into the language model's prompt.
39 text file types are supported for indexing: PDF, DOCX, Markdown, HTML, LaTeX, EPUB, and 20+ programming languages. 6 image types (PNG, JPEG, WebP, GIF, BMP) are indexed as image chunks with embeddings via vision models. Binary files and unrecognized types are skipped silently.
The vector index uses HNSW (Hierarchical Navigable Small World) via the SwiftHNSW library. HNSW builds a multi-layer graph of vectors where nearest-neighbor search takes O(log n) time — a collection of 10,000 documents and a collection of 100,000 documents return results in roughly the same time. An LRU cache keeps up to 8 collections loaded simultaneously, so switching between indexed folders does not require reloading from disk. A 16K token context guard prevents overwhelming the language model with too many retrieved chunks.
The hybrid search combines vector similarity (semantic meaning) with BM25 keyword matching (exact terms) by default. BM25 uses Apple's NLTokenizer for proper Unicode word boundaries. The results from both paths are fused so you get the best of both approaches. Ask "payment processing" and you find both the conceptually related Stripe webhook handler (vector match) and the function literally named processPayment() (keyword match). This dual-path strategy catches queries that are conceptual ("how do we handle auth?") and queries that are specific ("the processOrder function").
Ready to try it? Set up local RAG chat — takes about two minutes.
Apple Vision OCR
On-device OCR powered by the Neural Engine. Handles printed text, handwritten text, multiple languages, and structured layouts (columns, tables, headers). The same framework behind Live Text in Photos. No model download required — it is built into macOS.
Because OCR is a pipeline block in ModelPiper, you can chain it: OCR → Summarize, OCR → Translate, OCR → Structured Extract. Drop in a scanned contract, get structured data out.
Ready to try it? Set up document OCR — drag in an image, get text back.
Web scraping for knowledge bases
PiperScrape is ToolPiper's CDP-based web scraper. It navigates a real Chrome browser, detects 16 frontend frameworks to know when a page is genuinely ready, and extracts content in 7 formats: markdown, text, readability, AX tree, HTML, links, and screenshot.
The practical workflow for document AI: scrape a website to markdown, feed the markdown into a RAG collection, then ask questions about it in chat. PiperScrape's AX-tree-based markdown preserves heading hierarchy and link structure — ideal for RAG chunking.
Ready to try it? Set up web scraping — scrape a JS-heavy site that your current tools cannot handle.
Code search
Index an entire codebase into RAG and ask questions about it. "Where is the authentication logic?" "What does processPayment do?" "How is the database connection pool configured?" The system finds conceptually related code across files, not just exact keyword matches. This is semantic search for code: a query about "error handling" surfaces try/catch blocks, result types, error boundary components, and retry logic across your entire repository.
ToolPiper indexes 20+ programming languages (Swift, TypeScript, Python, Go, Rust, Java, C/C++, Ruby, PHP, SQL, and more) with semantic chunking at 500 characters with 50-character overlap to keep most functions intact within a single chunk. The hybrid search is particularly effective for code: BM25 catches function and variable names (the exact strings developers search for), while vector search catches conceptual matches (what the code does, not what it is named). Re-indexing after a day of development takes seconds because the embedding cache skips unchanged files. Non-blocking auto-reindex means you keep querying while the index updates in the background.
Ready to try it? Set up code search — index your repo and start asking questions.
Three embedding paths
Every option is local. Every option serves an OpenAI-compatible /v1/embeddings endpoint.
- Apple NL Embedding — zero setup, Neural Engine, 512 dimensions. Works immediately on any Mac with Apple Silicon. The right default for most users.
- Open-source GGUF models — Snowflake Arctic Embed S (~130MB), nomic-embed-text, bge-base-en, or any GGUF embedding model from HuggingFace. Higher quality for specialized domains. Runs on Metal GPU.
- Dedicated embedding server (port 9997) — separate llama-server process so embedding and chat do not compete for GPU. Auto-starts on first call, auto-stops after 5 minutes idle.
Ready to try it? Explore local embeddings — the foundation under everything else.
Embedding cache
Embeddings are deterministic — the same text with the same model always produces the same vector. ToolPiper exploits this with a global content-addressed cache. Every embedding is keyed by SHA-256(modelStem + text). Re-indexing a collection where 90% of files haven't changed serves 90% of embeddings from cache instantly — in microseconds instead of milliseconds. Binary persistence with CRC32 integrity checks (hardware-accelerated via zlib). 100K entry cap with FIFO eviction. Dimension validation on incremental ingest catches model mismatches early: if you switch from a 512-dim model to a 1536-dim model, the cache detects the mismatch and re-embeds. The embedding server is lazily started and skipped entirely when all content is already cached — meaning re-indexing unchanged collections has zero model load overhead.
MCP integration
ToolPiper's document AI capabilities are fully accessible through MCP tools, making them available to any AI assistant that supports the Model Context Protocol. rag_collections lists all indexed collections with their metadata. rag_query searches them with hybrid retrieval and returns cited passages. embed generates embeddings directly for custom pipelines. scrape pulls web content in any of 7 formats for RAG ingestion. ocr extracts text from images using Apple Vision. These tools compose naturally: an AI assistant can scrape a documentation site, index the content into a collection, and then answer questions about it — all through natural language commands. Claude Code, Cursor, Windsurf, or any MCP-capable client gains full document AI capabilities without any API code.
OpenAI-compatible API
ToolPiper serves standard OpenAI-compatible endpoints: POST /v1/embeddings for vector generation and POST /v1/chat/completions for LLM inference. Any tool that calls the OpenAI API — LangChain, LlamaIndex, custom Python scripts, existing RAG pipelines — works by changing the base URL to http://localhost:9998 and removing the API key. Same request format, same response format, no code changes beyond the URL.