article2026-03-27by Ben Racicot

Local Embeddings on Mac: On-Device Vector Search Without Cloud APIs

TL;DR

Embeddings turn text into vectors that capture meaning. They power semantic search, RAG, clustering, and recommendation. OpenAI charges per million tokens and sees every document you embed. ToolPiper runs embeddings locally on your Mac: EmbeddingGemma on the Apple Neural Engine by default, or bring your own GGUF model. EmbeddingGemma downloads once, then runs on-device with no per-query cost. All vectors stay on your disk. An OpenAI-compatible API means existing tools work without code changes.

Screencast of local embedding and vector search in ToolPiper on macOS

1:30

Index documents and search by meaning, entirely on your Mac

Every RAG system, every semantic search, every recommendation engine starts with embeddings. They turn text into numbers that capture meaning. OpenAI charges $0.13 per million tokens for their embeddings API. That adds up fast when you are indexing thousands of documents. More importantly, every document you embed through their API is sent to their servers. What if embeddings ran on your Mac for free, with zero data exposure?

Your Mac has dedicated AI hardware sitting idle most of the time. The Neural Engine, the Metal GPU, and unified memory are all capable of running embedding models at production speeds. You do not need a cloud API, a Python environment, or a vector database service. You need a model and an index.

What are embeddings and why do they matter?

An embedding is a function that takes text and returns a vector: an array of numbers, typically 384 to 1536 dimensions. The key property is that similar texts produce similar vectors. "How do I reset my password?" and "I forgot my login credentials" produce vectors that are close together in the embedding space, even though they share no words.

This makes embeddings the foundation of any system that needs to understand meaning rather than match keywords:

Semantic search. Find documents by what they mean, not just what words they contain. A search for "authentication failure" surfaces results about login errors, credential issues, and session timeouts.

Retrieval-augmented generation (RAG). Give a language model relevant context from your documents before it answers. The embedding index finds the right passages; the LLM synthesizes the answer.

Clustering. Group similar documents automatically. Customer support tickets, research papers, code files. The vectors reveal structure that keyword analysis misses.

Deduplication. Detect near-duplicates even when the wording differs. Two documents describing the same policy in different language produce nearly identical vectors.

Recommendation. "If you read this, you might want to read that." Vector proximity is the simplest and most effective recommendation signal.

Without embeddings, search is limited to exact keyword matching. With them, your system understands language.

How do embedding models work?

Embedding models are transformers trained on massive text datasets. They learn to compress text into dense vectors where proximity reflects semantic similarity. The model reads input text, produces contextual token embeddings for each token, then pools them (usually mean pooling) into a single fixed-size vector.

Model size determines quality and speed. Smaller models (50-100M parameters) are fast and produce decent embeddings. They fit comfortably in memory and run at thousands of texts per second. Larger models (400M+ parameters) produce higher-quality embeddings with better performance on difficult retrieval tasks, at the cost of more memory and slower throughput.

The practical tradeoff: for most document search and RAG use cases, a small embedding model is more than sufficient. You only need a large model when you are doing cross-lingual retrieval, highly specialized domain search, or benchmarking against academic datasets.

How does local embedding work in ToolPiper?

ToolPiper generates embeddings on your Mac, with two ways to do it. You can switch between them without changing anything in your application code.

1. EmbeddingGemma (the on-device default)

EmbeddingGemma is Google's on-device embedding model. It produces 768-dimensional vectors and runs on the Apple Neural Engine, so it consumes no GPU memory while embedding. The model downloads once from ModelPiper's release CDN, then runs locally from cache. After that first download it works offline, with no per-query cost and nothing sent to a server.

This is the right default for most users. You get production-quality embeddings on any Mac with Apple Silicon, and indexing runs on dedicated AI hardware instead of competing with your chat model for the GPU. For general document search, meeting notes, codebases, and knowledge bases, it works well.

2. Bring your own GGUF embedding model

Download any GGUF embedding model from HuggingFace and load it through ToolPiper. Models like nomic-embed-text, Snowflake Arctic Embed, and bge-base-en produce higher-quality embeddings for specialized domains. They run locally on the Metal GPU via the same engine that powers chat models.

This path gives you flexibility. Need 1536-dimensional embeddings for compatibility with an existing pipeline? There is a model for that. Need multilingual support? There is a model for that. Need a domain-specific model fine-tuned on medical or legal text? Download it, point ToolPiper at it, and your index quality improves.

The tradeoff: these models require a download (typically 50-500MB) and consume GPU memory while loaded. On a Mac with 16GB+ of unified memory, this is rarely a constraint.

Both paths serve the same OpenAI-compatible /v1/embeddings endpoint. Any tool that calls the OpenAI embeddings API works with ToolPiper by changing the base URL to localhost.

How does the embedding cache keep things fast?

Embeddings are deterministic. The same text with the same model always produces the same vector. ToolPiper exploits this with a global content-addressed cache.

Every embedding is keyed by SHA-256(modelStem + text). When you re-index a document collection, only the chunks that actually changed get re-embedded. The rest are served from cache in microseconds instead of milliseconds.

The cache uses binary persistence with CRC32 integrity checks (hardware-accelerated via zlib). It holds up to 100K entries with FIFO eviction. Dimension validation on incremental ingest catches model mismatches early. If you switch from a 768-dim model to a 1536-dim model, the cache detects the mismatch and re-embeds rather than returning wrong-sized vectors.

The practical result: your first indexing run of a large document folder takes a few minutes. Re-indexing after adding a handful of files takes seconds. Embedding work is skipped entirely when all content is already cached.

How does vector search work locally?

Generating embeddings is half the problem. The other half is searching them efficiently. ToolPiper uses HNSW (Hierarchical Navigable Small World) for approximate nearest-neighbor search via the SwiftHNSW library.

HNSW builds a multi-layer graph where each node is a vector. Finding the nearest neighbors to a query vector takes O(log n) time, regardless of collection size. A collection of 10,000 documents and a collection of 100,000 documents return results in roughly the same time.

The index is loaded into memory with loadUnaligned for safe deserialization. An LRU cache keeps up to 8 collections loaded simultaneously, so switching between indexed folders does not require reloading from disk every time.

Search is hybrid by default: HNSW vector search (semantic meaning) combined with BM25 keyword search (exact term matching). BM25 uses NLTokenizer from Apple's Natural Language framework for proper Unicode word boundaries. The results are merged so you get documents that match both by meaning and by specific terms you mentioned.

Persistence is efficient. The HNSW sidecar and BM25 index files are LZ4-compressed. CRC32 checksums (hardware-accelerated) verify integrity on load. You can copy the index files to another machine and they work without re-indexing.

What are the honest limitations?

Local embeddings are not a free lunch. Here is what you should know before switching away from a cloud API:

The default model is general-purpose. EmbeddingGemma is tuned for broad document retrieval, not narrow specialist domains. If you need highly specialized or domain-tuned retrieval, bring your own GGUF model trained for that domain.

GGUF models require GPU memory. Open-source embedding models typically consume 200-500MB while loaded. On a Mac with 8GB of RAM already running a chat model, this can cause memory pressure. EmbeddingGemma sidesteps this by running on the Neural Engine instead of the GPU, but GGUF models still need the bytes from somewhere.

HNSW is in-memory. Very large collections (millions of documents) may require significant RAM for the index alone. For tens of thousands of documents, memory is not a concern. For millions, you may need 32GB+ of unified memory.

Vector search is CPU-based. HNSW runs on CPU, not GPU. This is fast enough for collections up to hundreds of thousands of vectors. If you need sub-millisecond search over tens of millions of vectors, a dedicated vector database like Milvus or Qdrant with GPU support will outperform a local HNSW index.

Image embeddings depend on the model. ToolPiper supports image embeddings via /v1/embeddings, but quality varies significantly by model. Text embedding models produce text vectors. Multimodal models that handle both text and images require specific model architectures.

Cloud embedding quality is still best-in-class. OpenAI's text-embedding-3-large and Cohere's embed-v3 are trained on proprietary datasets at scales that open-source models have not matched. For mission-critical retrieval where every percentage point of recall matters, cloud APIs still have an edge. For everything else, local embeddings are good enough.

How do you use this with existing tools?

ToolPiper serves an OpenAI-compatible POST /v1/embeddings endpoint. If your application uses the OpenAI Python SDK, LangChain, LlamaIndex, or any tool that calls the OpenAI embeddings API, you change the base URL to http://localhost:9998 and remove your API key. Everything else stays the same.

For MCP-connected AI assistants, the embed and rag_query tools provide direct access to local embeddings and vector search. The assistant can index documents and search them without writing any API code.

The REST API also supports raw embedding requests for custom pipelines. Send a JSON body with model and input fields, get back a vector. Standard OpenAI response format. Standard error handling. No surprises.

Try It

Download ModelPiper, install ToolPiper, and point a RAG collection at a folder. EmbeddingGemma downloads once, then runs locally on the Neural Engine. If you want domain-specific quality, download a GGUF embedding model from HuggingFace through the model browser. Your documents never leave your Mac.

This is part of a series on local-first AI workflows on macOS. See also: Local RAG Chat for the full document Q&A pipeline built on top of local embeddings.

ToolPiper embedding status showing EmbeddingGemma and a GGUF model loaded for local vector search on macOS

On-device embeddings, all local, all OpenAI-compatible

Embeddings: Local vs Cloud

	EmbeddingGemma	ToolPiper (GGUF Models)	OpenAI Embeddings API	Cohere Embeddings
Privacy	On-device	On-device	Cloud	Cloud
Setup	One-time download	Download model	API key	API key
Cost	Free	Free	$0.13/M tokens	$0.10/M tokens
Dimensions	768	Model-dependent (384-1536)	1536-3072	1024
Quality	Good	Very good	Excellent	Excellent
Speed	Fast (Neural Engine)	Fast (GPU)	Network-dependent	Network-dependent
Works offline	Yes	Yes	No	No
Languages	English-focused	Model-dependent	Multilingual	Multilingual
OpenAI-compatible API	Yes (via ToolPiper)	Yes	Native	No

How to get started

1
Install ToolPiper and open ModelPiper
Download ToolPiper from modelpiper.com. Launch it, then open ModelPiper. EmbeddingGemma downloads once from the release CDN, then runs on-device for every collection after that.
2
Create a RAG collection
Select a document folder and choose an embedding model. EmbeddingGemma is the on-device default, or download a GGUF model for domain-specific quality. Click Index to embed and index all documents.
3
Search by meaning
Type a query in natural language. The hybrid search combines HNSW vector similarity with BM25 keyword matching to find the most relevant passages, regardless of exact wording.
4
Use the OpenAI-compatible API
Point any OpenAI SDK or tool at http://localhost:9998/v1/embeddings. Same request format, same response format, no API key. Works with LangChain, LlamaIndex, and custom pipelines.

Frequently Asked Questions

What are embeddings and why do I need them?

Embeddings convert text into numerical vectors that capture semantic meaning. They are the foundation of semantic search, RAG, clustering, and recommendation systems. Without embeddings, search is limited to exact keyword matching. With them, a search for "authentication error" finds documents about login failures, expired sessions, and credential issues.

Should I use EmbeddingGemma or a downloaded model?

Start with EmbeddingGemma. It downloads once, then runs on the Neural Engine and produces good results for most document search. Switch to a downloaded GGUF model if you need domain-specific quality (medical, legal, scientific) or compatibility with a specific embedding dimension (e.g., 1536 to match an existing pipeline).

How much memory do embedding models use?

EmbeddingGemma uses negligible GPU memory since it runs on the Neural Engine. Open-source GGUF models typically consume 200-500MB of unified memory while loaded. The HNSW vector index adds memory proportional to collection size, but tens of thousands of documents fit comfortably in a few hundred MB.

Can I use this with LangChain or LlamaIndex?

Yes. ToolPiper serves an OpenAI-compatible /v1/embeddings endpoint. Configure your LangChain or LlamaIndex embedding client with base_url="http://localhost:9998/v1" and remove the API key. The request and response formats match the OpenAI specification exactly.

How fast is local embedding compared to OpenAI's API?

For individual texts, local embedding is faster because there is no network round trip. EmbeddingGemma returns vectors in single-digit milliseconds. GGUF models via llama.cpp take 5-50ms depending on text length and model size. OpenAI's API typically takes 100-500ms including network latency. For bulk indexing, local throughput depends on your hardware, but the embedding cache means re-indexing is near-instant for unchanged content.

EmbeddingsVector SearchRAGPrivacymacOSNeural EngineDeveloper Tools

Local RAG Chat on Mac: Ask Your Documents, Keep Your DataThe full document Q&A pipeline built on local embeddings Local-First AI on macOS: Why Your Data Should Never Leave Your MachineThe pillar article on local-first AI workflows

Local Embeddings on Mac: On-Device Vector Search Without Cloud APIs

What are embeddings and why do they matter?

How do embedding models work?

How does local embedding work in ToolPiper?

How does the embedding cache keep things fast?

How does vector search work locally?

What are the honest limitations?

How do you use this with existing tools?

Try It

Embeddings: Local vs Cloud

How to get started

Install ToolPiper and open ModelPiper

Create a RAG collection

Search by meaning

Use the OpenAI-compatible API

Frequently Asked Questions

Related

AI Providers