Every RAG system, every semantic search, every recommendation engine starts with embeddings. They turn text into numbers that capture meaning. OpenAI charges $0.13 per million tokens for their embeddings API. That adds up fast when you are indexing thousands of documents. More importantly, every document you embed through their API is sent to their servers. What if embeddings ran on your Mac for free, with zero data exposure?

Your Mac has dedicated AI hardware sitting idle most of the time. The Neural Engine, the Metal GPU, and unified memory are all capable of running embedding models at production speeds. You do not need a cloud API, a Python environment, or a vector database service. You need a model and an index.

What are embeddings and why do they matter?

An embedding is a function that takes text and returns a vector: an array of numbers, typically 384 to 1536 dimensions. The key property is that similar texts produce similar vectors. "How do I reset my password?" and "I forgot my login credentials" produce vectors that are close together in the embedding space, even though they share no words.

This makes embeddings the foundation of any system that needs to understand meaning rather than match keywords:

Semantic search. Find documents by what they mean, not just what words they contain. A search for "authentication failure" surfaces results about login errors, credential issues, and session timeouts.

Retrieval-augmented generation (RAG). Give a language model relevant context from your documents before it answers. The embedding index finds the right passages; the LLM synthesizes the answer.

Clustering. Group similar documents automatically. Customer support tickets, research papers, code files. The vectors reveal structure that keyword analysis misses.

Deduplication. Detect near-duplicates even when the wording differs. Two documents describing the same policy in different language produce nearly identical vectors.

Recommendation. "If you read this, you might want to read that." Vector proximity is the simplest and most effective recommendation signal.

Without embeddings, search is limited to exact keyword matching. With them, your system understands language.

How do embedding models work?

Embedding models are transformers trained on massive text datasets. They learn to compress text into dense vectors where proximity reflects semantic similarity. The model reads input text, produces contextual token embeddings for each token, then pools them (usually mean pooling) into a single fixed-size vector.

Model size determines quality and speed. Smaller models (50-100M parameters) are fast and produce decent embeddings. They fit comfortably in memory and run at thousands of texts per second. Larger models (400M+ parameters) produce higher-quality embeddings with better performance on difficult retrieval tasks, at the cost of more memory and slower throughput.

The practical tradeoff: for most document search and RAG use cases, a small embedding model is more than sufficient. You only need a large model when you are doing cross-lingual retrieval, highly specialized domain search, or benchmarking against academic datasets.

What are the three local embedding paths in ToolPiper?

ToolPiper gives you three ways to generate embeddings locally. Each one has different tradeoffs, and you can switch between them without changing anything in your application code.

1. Apple NL Embedding

Zero setup. Built into macOS. Uses NLContextualEmbedding from Apple's Natural Language framework to produce 512-dimensional vectors with mean-pooled token embeddings. Runs on the Neural Engine, so it consumes no GPU memory and requires no model download or configuration. The model stem apple-nlembedding skips llama-server entirely.

This is the right default for most users. You get production-quality embeddings immediately, on any Mac with Apple Silicon, with nothing to install. The tradeoff: it is English-focused and not state-of-the-art on multilingual or highly specialized retrieval benchmarks. For general document search, meeting notes, codebases, and knowledge bases, it works well.

2. Open-source embedding models (via llama.cpp)

Download any GGUF embedding model from HuggingFace and load it through ToolPiper. Models like nomic-embed-text, all-MiniLM-L6-v2, and bge-base-en produce higher-quality embeddings for specialized domains. They run on Metal GPU via the same llama.cpp engine that powers chat models.

This path gives you flexibility. Need 1536-dimensional embeddings for compatibility with an existing pipeline? There is a model for that. Need multilingual support? There is a model for that. Need a domain-specific model fine-tuned on medical or legal text? Download it, point ToolPiper at it, and your index quality improves.

The tradeoff: these models require a download (typically 50-500MB) and consume GPU memory while loaded. On a Mac with 16GB+ of unified memory, this is rarely a constraint.

3. Dedicated embedding server (port 9997)

A separate llama-server instance optimized for embeddings, running on port 9997 with -c 2048 -ub 2048 --embedding flags. It starts on-demand when the first embed call arrives and auto-stops after 5 minutes of idle time.

Why a separate server? Because embedding and chat compete for the same GPU resources when they share an engine. If you are indexing a thousand documents while also chatting with an LLM, the dedicated embedding server handles indexing on its own process while the main engine handles your conversation. No contention, no latency spikes.

The embedding server loads models via the standard /v1/embeddings endpoint and is fully OpenAI-compatible. Any tool that calls the OpenAI embeddings API works with ToolPiper by changing the base URL to localhost.

How does the embedding cache keep things fast?

Embeddings are deterministic. The same text with the same model always produces the same vector. ToolPiper exploits this with a global content-addressed cache.

Every embedding is keyed by SHA-256(modelStem + text). When you re-index a document collection, only the chunks that actually changed get re-embedded. The rest are served from cache in microseconds instead of milliseconds.

The cache uses binary persistence with CRC32 integrity checks (hardware-accelerated via zlib). It holds up to 100K entries with FIFO eviction. Dimension validation on incremental ingest catches model mismatches early. If you switch from a 512-dim model to a 1536-dim model, the cache detects the mismatch and re-embeds rather than returning wrong-sized vectors.

The practical result: your first indexing run of a large document folder takes a few minutes. Re-indexing after adding a handful of files takes seconds. The embedding server is lazily started and skipped entirely when all content is already cached.

How does vector search work locally?

Generating embeddings is half the problem. The other half is searching them efficiently. ToolPiper uses HNSW (Hierarchical Navigable Small World) for approximate nearest-neighbor search via the SwiftHNSW library.

HNSW builds a multi-layer graph where each node is a vector. Finding the nearest neighbors to a query vector takes O(log n) time, regardless of collection size. A collection of 10,000 documents and a collection of 100,000 documents return results in roughly the same time.

The index is loaded into memory with loadUnaligned for safe deserialization. An LRU cache keeps up to 8 collections loaded simultaneously, so switching between indexed folders does not require reloading from disk every time.

Search is hybrid by default: HNSW vector search (semantic meaning) combined with BM25 keyword search (exact term matching). BM25 uses NLTokenizer from Apple's Natural Language framework for proper Unicode word boundaries. The results are merged so you get documents that match both by meaning and by specific terms you mentioned.

Persistence is efficient. The HNSW sidecar and BM25 index files are LZ4-compressed. CRC32 checksums (hardware-accelerated) verify integrity on load. You can copy the index files to another machine and they work without re-indexing.

What are the honest limitations?

Local embeddings are not a free lunch. Here is what you should know before switching away from a cloud API:

Apple NL Embedding is English-focused. Embedding quality degrades for other languages. If you need strong multilingual retrieval, use an open-source model like multilingual-e5 instead.

Open-source models require GPU memory. Embedding models typically consume 200-500MB while loaded. On a Mac with 8GB of RAM already running a chat model, this can cause memory pressure. The dedicated embedding server helps by isolating the memory footprint, but the bytes still need to come from somewhere.

HNSW is in-memory. Very large collections (millions of documents) may require significant RAM for the index alone. For tens of thousands of documents, memory is not a concern. For millions, you may need 32GB+ of unified memory.

Vector search is CPU-based. HNSW runs on CPU, not GPU. This is fast enough for collections up to hundreds of thousands of vectors. If you need sub-millisecond search over tens of millions of vectors, a dedicated vector database like Milvus or Qdrant with GPU support will outperform a local HNSW index.

Image embeddings depend on the model. ToolPiper supports image embeddings via /v1/embeddings, but quality varies significantly by model. Text embedding models produce text vectors. Multimodal models that handle both text and images require specific model architectures.

Cloud embedding quality is still best-in-class. OpenAI's text-embedding-3-large and Cohere's embed-v3 are trained on proprietary datasets at scales that open-source models have not matched. For mission-critical retrieval where every percentage point of recall matters, cloud APIs still have an edge. For everything else, local embeddings are good enough.

How do you use this with existing tools?

ToolPiper serves an OpenAI-compatible POST /v1/embeddings endpoint. If your application uses the OpenAI Python SDK, LangChain, LlamaIndex, or any tool that calls the OpenAI embeddings API, you change the base URL to http://localhost:9998 and remove your API key. Everything else stays the same.

For MCP-connected AI assistants, the embed and rag_query tools provide direct access to local embeddings and vector search. The assistant can index documents and search them without writing any API code.

The REST API also supports raw embedding requests for custom pipelines. Send a JSON body with model and input fields, get back a vector. Standard OpenAI response format. Standard error handling. No surprises.

Try It

Download ModelPiper, install ToolPiper, and point a RAG collection at a folder. Apple NL Embedding works immediately with zero configuration. If you want higher quality, download an embedding model from HuggingFace through the model browser. Your documents never leave your Mac.

This is part of a series on local-first AI workflows on macOS. See also: Local RAG Chat for the full document Q&A pipeline built on top of local embeddings.