You loaded a 13B model. Now you want to load a 7B alongside it for a different task. Ollama doesn't tell you whether your Mac can handle both. It reads system RAM once at startup, makes a rough estimate, and proceeds. If it's wrong, macOS starts swapping to disk. Token generation drops from 30 tokens per second to two. Your fans spin up. Every app on the Mac slows down.
The problem isn't that you can't run multiple models. Apple Silicon's unified memory architecture is actually well-suited for it - the GPU and CPU share the same pool, so there's no VRAM limit to hit. The problem is visibility. Nothing in Ollama tells you how much memory each model is using, how much is left, or when you're about to cross the line.
How much memory do Ollama models actually use?
Model memory isn't just the file size on disk. A GGUF model file is compressed with quantization. In memory, the model expands, and the inference engine allocates additional buffers for context, KV cache, and computation.
Rough guidelines for common models at Q4 quantization on Apple Silicon:
Small models (1-3B): 1-2GB in memory. Qwen 3.5 0.8B sits at about 800MB. Llama 3.2 3B uses about 2GB. These are cheap to keep loaded.
Medium models (7-8B): 4-6GB in memory. Llama 3.2 7B at Q4 uses about 4.5GB. Mistral 7B is similar. This is where 8GB Macs hit their ceiling - one medium model plus macOS overhead fills the available memory.
Large models (13-14B): 9-11GB in memory. On a 16GB Mac, loading a 13B model leaves room for macOS and not much else. On 32GB, you have headroom.
XL models (30B+): 20-40GB+ in memory. These need 32GB or 64GB Macs. A 70B model at Q4 uses roughly 35-40GB. Only viable on M2/M3/M4 Max with 64GB+ or Ultra machines.
These numbers are approximate. Actual usage depends on quantization level, context length, and how the inference engine manages memory. Which is exactly why you need measurement, not estimation.
Why doesn't Ollama show you per-model memory usage?
Ollama checks available system RAM once at startup and uses that number to decide whether a model fits. It doesn't re-check as conditions change. If you loaded a browser with 40 tabs after Ollama started, Ollama doesn't know the available memory shrank. If another model was already loaded by a different process, Ollama doesn't see it.
There's no ollama stats or ollama memory command. The ollama ps command shows which models are loaded but not how much memory each one consumes. Open WebUI inherits the same blind spot - it can't report what Ollama doesn't measure.
Activity Monitor shows total process memory, but it reports the Ollama server process as one blob. If you have three models loaded, you see one combined number with no breakdown. You can't tell which model to unload to free the most space.
How does ToolPiper track per-model memory?
ToolPiper measures actual per-model memory usage through proc_pid_rusage, the macOS kernel API that reports resident memory per process. Because ToolPiper manages each model as a separate llama.cpp server process, it can attribute memory to individual models precisely.
The resource monitor shows:
Per-model resident memory. Not the file size, not an estimate - the actual bytes the model occupies in RAM right now. You can see that Llama 3.2 3B is using 2.1GB while Parakeet v3 is using 480MB.
GPU vs CPU allocation. On Apple Silicon, Metal GPU acceleration handles most of the inference work, but some layers may fall back to CPU if GPU memory pressure is high. ToolPiper shows the split via IOKit GPU utilization metrics, so you know whether your model is running at full GPU speed or partially on CPU.
System RAM pressure. macOS kernel APIs report memory pressure at three levels: normal, warn, and critical. ToolPiper surfaces this as a simple indicator. When pressure reaches "warn," loading another model will likely cause swapping. You see this before the slowdown starts, not after.
Pre-load estimation. Before you load a model, ToolPiper shows its estimated memory requirement alongside your current available memory. If a model won't fit without causing pressure, you see a warning before loading - not after macOS has already started paging to disk.
What model combinations actually fit on common Macs?
We measured actual memory usage on an M2 Max with 32GB running macOS Sonoma. All models at Q4 quantization unless noted.
8GB Mac (M1/M2 MacBook Air): One model is the practical limit. Llama 3.2 3B (2GB) works comfortably with headroom for macOS. A 7B model (4.5GB) fits alone but leaves tight margins - macOS needs roughly 3-4GB for itself, so 4.5GB + 3.5GB = 8GB with nothing left for your browser or other apps. Two models simultaneously isn't realistic on 8GB.
16GB Mac: Room for one large model or two-three small ones. Llama 3.2 7B + Parakeet STT + PocketTTS (total ~5.5GB) runs a full voice chat pipeline comfortably. A 13B model alone leaves adequate headroom. Two 7B models simultaneously pushes to the edge.
32GB Mac: The sweet spot for multi-model workflows. A 13B chat model + 7B coding model + STT + TTS (total ~15-16GB) runs with plenty of room. Even a 30B model + small utility models fits. This is where local AI stops feeling constrained.
64GB+ Mac: Run almost any combination. 70B models become practical. Multiple large models simultaneously. Memory stops being the bottleneck - inference speed and context length become the limiting factors instead.
Practical multi-model scenarios
Coding model + chat model
A common setup: keep a coding-optimized model (like DeepSeek Coder 6.7B or CodeLlama 7B) loaded for programming tasks, and a general chat model (Llama 3.2 7B) for everything else. Total: about 9-10GB. Fits on 16GB Macs. ToolPiper lets you switch between them from the same chat interface without waiting for the swap.
Voice chat pipeline
STT (Parakeet v3, ~500MB) + chat LLM (3B at ~2GB or 7B at ~4.5GB) + TTS (PocketTTS at ~300MB). Total: 3-5.5GB depending on chat model size. With a 3B model, 8GB Macs can manage if you keep other apps minimal. A 7B voice setup at 5.5GB needs 16GB to be comfortable - on 8GB, macOS overhead plus three models exceeds capacity. All three stay loaded for the duration of the voice session, so there's no model-swap latency between turns. See voice chat with Ollama for the full walkthrough.
RAG + chat
Embedding model (Apple NL Embedding uses zero additional memory since it's built into macOS, or a dedicated model at ~500MB) + chat LLM (7B at ~4.5GB). Total: 4.5-5GB. Comfortable on 16GB. The embedding model stays loaded for indexing and query embedding while the chat model handles generation.
When quantization is the answer
If your target model combination doesn't fit, the first lever to pull is quantization level. The same model at Q8 (8-bit) uses roughly double the memory of Q4 (4-bit). Going from Q8 to Q4 on a 7B model saves about 3-4GB with a modest quality reduction that's barely noticeable for most tasks.
For multi-model setups, use Q4 for the smaller utility models (coding assistant, summarizer) and reserve Q5 or Q6 for your primary chat model where quality matters most. ToolPiper's model browser shows the memory impact of each quantization option before you download.
What are the limitations of running multiple models locally?
Unified memory is shared. The GPU, CPU, and system all draw from the same memory pool on Apple Silicon. When you load models, you're reducing the memory available to macOS, your browser, and every other app. The model combinations above assume a clean system with minimal other load. Twenty browser tabs and Slack running alongside changes the math.
Context length multiplies memory. The numbers above assume default context lengths (2048-4096 tokens). Increasing context to 8192 or 16384 tokens increases KV cache memory proportionally. A 7B model at 16K context uses noticeably more RAM than the same model at 4K. If you're running multiple models with extended context, account for the additional memory.
Model loading takes time. Swapping models in and out isn't instant. Loading a 7B model from disk to memory takes 3-5 seconds on fast NVMe storage. If you frequently switch between more models than fit in memory simultaneously, you'll feel the swap cost. The solution is to keep your most-used models loaded and unload the rest.
Ollama's model management is opaque. Ollama has its own model loading/unloading behavior that isn't fully controllable. It may keep models loaded after your last request, or unload them based on its own memory heuristics. When using Ollama alongside ToolPiper, the memory reported by ToolPiper's resource monitor covers ToolPiper's models accurately, but Ollama-managed models show as a single process in the system view.
Download ToolPiper at modelpiper.com and check the resource monitor before loading your next model. If you use Ollama, connect it as a provider and manage your model loading through ModelPiper's interface.
This is part of a series on Ollama frontends for Mac. See also: How AI Model Memory Works on Mac for the fundamentals of model memory on Apple Silicon.