Does Ollama automatically unload models to free memory?

Ollama has basic model lifecycle management, but it's not fully transparent. It may keep models loaded after your last request and unload them after an idle timeout. You can force an unload with the API, but there's no built-in UI for managing what's loaded. ToolPiper gives you explicit control: load and unload models individually with real-time memory feedback.

Can I run Ollama and ToolPiper models at the same time?

Yes. They're separate processes drawing from the same unified memory pool. A 7B model in Ollama and a 3B model in ToolPiper use about 6.5GB combined. The constraint is total available RAM, not any interaction between the two tools. ToolPiper's resource monitor shows system-wide memory pressure, so you'll see the impact of both.

What happens when I exceed available memory?

macOS starts paging to the SSD swap file. Token generation speed drops dramatically - from 30+ tokens/second to single digits. The entire system feels sluggish: app switching lags, browser tabs reload, and fan noise increases. The fix is unloading a model. ToolPiper warns you before this happens. Ollama does not.

Should I use Q4 or Q8 quantization for multi-model setups?

Q4 for most models in a multi-model setup. The quality difference between Q4 and Q8 is modest for conversational tasks. Q4 cuts memory usage roughly in half, which is the difference between fitting two models and fitting one. Reserve Q5 or Q6 for your primary chat model if quality matters more than memory for that specific use case.

Run Multiple Ollama Models on Mac: See What Fits in Memory

You loaded a 13B model. Now you want to load a 7B alongside it for a different task. Ollama doesn't tell you whether your Mac can handle both. It reads system RAM once at startup, makes a rough estimate, and proceeds. If it's wrong, macOS starts swapping to disk. Token generation drops from 30 tokens per second to two. Your fans spin up. Every app on the Mac slows down.

The problem isn't that you can't run multiple models. Apple Silicon's unified memory architecture is actually well-suited for it - the GPU and CPU share the same pool, so there's no VRAM limit to hit. The problem is visibility. Nothing in Ollama tells you how much memory each model is using, how much is left, or when you're about to cross the line.

How much memory do Ollama models actually use?

Model memory isn't just the file size on disk. A GGUF model file is compressed with quantization. In memory, the model expands, and the inference engine allocates additional buffers for context, KV cache, and computation.

Rough guidelines for common models at Q4 quantization on Apple Silicon:

Small models (1-3B): 1-2GB in memory. Qwen 3.5 0.8B sits at about 800MB. Llama 3.2 3B uses about 2GB. These are cheap to keep loaded.

Medium models (7-8B): 4-6GB in memory. Llama 3.2 7B at Q4 uses about 4.5GB. Mistral 7B is similar. This is where 8GB Macs hit their ceiling - one medium model plus macOS overhead fills the available memory.

Large models (13-14B): 9-11GB in memory. On a 16GB Mac, loading a 13B model leaves room for macOS and not much else. On 32GB, you have headroom.

XL models (30B+): 20-40GB+ in memory. These need 32GB or 64GB Macs. A 70B model at Q4 uses roughly 35-40GB. Only viable on M2/M3/M4 Max with 64GB+ or Ultra machines.

These numbers are approximate. Actual usage depends on quantization level, context length, and how the inference engine manages memory. Which is exactly why you need measurement, not estimation.

Why doesn't Ollama show you per-model memory usage?

Ollama checks available system RAM once at startup and uses that number to decide whether a model fits. It doesn't re-check as conditions change. If you loaded a browser with 40 tabs after Ollama started, Ollama doesn't know the available memory shrank. If another model was already loaded by a different process, Ollama doesn't see it.

There's no ollama stats or ollama memory command. The ollama ps command shows which models are loaded but not how much memory each one consumes. Open WebUI inherits the same blind spot - it can't report what Ollama doesn't measure.

Activity Monitor shows total process memory, but it reports the Ollama server process as one blob. If you have three models loaded, you see one combined number with no breakdown. You can't tell which model to unload to free the most space.

How does ToolPiper track per-model memory?

ToolPiper measures actual per-model memory usage through proc_pid_rusage, the macOS kernel API that reports resident memory per process. Because ToolPiper manages each model as a separate llama.cpp server process, it can attribute memory to individual models precisely.

The resource monitor shows:

Per-model resident memory. Not the file size, not an estimate - the actual bytes the model occupies in RAM right now. You can see that Llama 3.2 3B is using 2.1GB while Parakeet v3 is using 480MB.

GPU vs CPU allocation. On Apple Silicon, Metal GPU acceleration handles most of the inference work, but some layers may fall back to CPU if GPU memory pressure is high. ToolPiper shows the split via IOKit GPU utilization metrics, so you know whether your model is running at full GPU speed or partially on CPU.

System RAM pressure. macOS kernel APIs report memory pressure at three levels: normal, warn, and critical. ToolPiper surfaces this as a simple indicator. When pressure reaches "warn," loading another model will likely cause swapping. You see this before the slowdown starts, not after.

Pre-load estimation. Before you load a model, ToolPiper shows its estimated memory requirement alongside your current available memory. If a model won't fit without causing pressure, you see a warning before loading - not after macOS has already started paging to disk.

What model combinations actually fit on common Macs?

We measured actual memory usage on an M2 Max with 32GB running macOS Sonoma. All models at Q4 quantization unless noted.

8GB Mac (M1/M2 MacBook Air): One model is the practical limit. Llama 3.2 3B (2GB) works comfortably with headroom for macOS. A 7B model (4.5GB) fits alone but leaves tight margins - macOS needs roughly 3-4GB for itself, so 4.5GB + 3.5GB = 8GB with nothing left for your browser or other apps. Two models simultaneously isn't realistic on 8GB.

16GB Mac: Room for one large model or two-three small ones. Llama 3.2 7B + Parakeet STT + PocketTTS (total ~5.5GB) runs a full voice chat pipeline comfortably. A 13B model alone leaves adequate headroom. Two 7B models simultaneously pushes to the edge.

32GB Mac: The sweet spot for multi-model workflows. A 13B chat model + 7B coding model + STT + TTS (total ~15-16GB) runs with plenty of room. Even a 30B model + small utility models fits. This is where local AI stops feeling constrained.

64GB+ Mac: Run almost any combination. 70B models become practical. Multiple large models simultaneously. Memory stops being the bottleneck - inference speed and context length become the limiting factors instead.

Practical multi-model scenarios

Coding model + chat model

A common setup: keep a coding-optimized model (like DeepSeek Coder 6.7B or CodeLlama 7B) loaded for programming tasks, and a general chat model (Llama 3.2 7B) for everything else. Total: about 9-10GB. Fits on 16GB Macs. ToolPiper lets you switch between them from the same chat interface without waiting for the swap.

Voice chat pipeline

STT (Parakeet v3, ~500MB) + chat LLM (3B at ~2GB or 7B at ~4.5GB) + TTS (PocketTTS at ~300MB). Total: 3-5.5GB depending on chat model size. With a 3B model, 8GB Macs can manage if you keep other apps minimal. A 7B voice setup at 5.5GB needs 16GB to be comfortable - on 8GB, macOS overhead plus three models exceeds capacity. All three stay loaded for the duration of the voice session, so there's no model-swap latency between turns. See voice chat with Ollama for the full walkthrough.

RAG + chat

Embedding model (Apple NL Embedding uses zero additional memory since it's built into macOS, or a dedicated model at ~500MB) + chat LLM (7B at ~4.5GB). Total: 4.5-5GB. Comfortable on 16GB. The embedding model stays loaded for indexing and query embedding while the chat model handles generation.

When quantization is the answer

If your target model combination doesn't fit, the first lever to pull is quantization level. The same model at Q8 (8-bit) uses roughly double the memory of Q4 (4-bit). Going from Q8 to Q4 on a 7B model saves about 3-4GB with a modest quality reduction that's barely noticeable for most tasks.

For multi-model setups, use Q4 for the smaller utility models (coding assistant, summarizer) and reserve Q5 or Q6 for your primary chat model where quality matters most. ToolPiper's model browser shows the memory impact of each quantization option before you download.

What are the limitations of running multiple models locally?

Unified memory is shared. The GPU, CPU, and system all draw from the same memory pool on Apple Silicon. When you load models, you're reducing the memory available to macOS, your browser, and every other app. The model combinations above assume a clean system with minimal other load. Twenty browser tabs and Slack running alongside changes the math.

Context length multiplies memory. The numbers above assume default context lengths (2048-4096 tokens). Increasing context to 8192 or 16384 tokens increases KV cache memory proportionally. A 7B model at 16K context uses noticeably more RAM than the same model at 4K. If you're running multiple models with extended context, account for the additional memory.

Model loading takes time. Swapping models in and out isn't instant. Loading a 7B model from disk to memory takes 3-5 seconds on fast NVMe storage. If you frequently switch between more models than fit in memory simultaneously, you'll feel the swap cost. The solution is to keep your most-used models loaded and unload the rest.

Ollama's model management is opaque. Ollama has its own model loading/unloading behavior that isn't fully controllable. It may keep models loaded after your last request, or unload them based on its own memory heuristics. When using Ollama alongside ToolPiper, the memory reported by ToolPiper's resource monitor covers ToolPiper's models accurately, but Ollama-managed models show as a single process in the system view.

Download ToolPiper at modelpiper.com and check the resource monitor before loading your next model. If you use Ollama, connect it as a provider and manage your model loading through ModelPiper's interface.

This is part of a series on Ollama frontends for Mac. See also: How AI Model Memory Works on Mac for the fundamentals of model memory on Apple Silicon.

Run Multiple Ollama Models on Mac: See What Fits in Memory

How much memory do Ollama models actually use?

Why doesn't Ollama show you per-model memory usage?

How does ToolPiper track per-model memory?

What model combinations actually fit on common Macs?

Practical multi-model scenarios

Coding model + chat model

Voice chat pipeline

RAG + chat

When quantization is the answer

What are the limitations of running multiple models locally?

Model Memory Usage at Q4 Quantization on Apple Silicon

Multi-Model Combinations That Fit on Common Macs

Frequently Asked Questions

Related

AI Providers

Model	Parameters	RAM (Q4)	RAM (Q8)	Fits on 8GB?	Fits on 16GB?
Qwen 3.5	0.8B	~800MB	~1.5GB	Yes	Yes
Llama 3.2	3B	~2GB	~3.5GB	Yes (tight)	Yes
Mistral	7B	~4.5GB	~8GB	Barely (alone)	Yes
Llama 3.2	7B	~4.5GB	~8GB	Barely (alone)	Yes
Llama 3.2	13B	~9.5GB	~17GB	No	Yes (alone, tight)
DeepSeek-V2	16B	~10GB	~18GB	No	Tight
Llama 3.2	70B	~35-40GB	~70GB	No	No (need 64GB+)

Setup	Total RAM	8GB Mac	16GB Mac	32GB Mac
3B chat only	~2GB	Comfortable	Comfortable	Comfortable
7B chat only	~4.5GB	Tight	Comfortable	Comfortable
Voice chat (STT + 3B + TTS)	~3GB	Tight (close other apps)	Comfortable	Comfortable
Voice chat (STT + 7B + TTS)	~5.5GB	No (needs 16GB)	Comfortable	Comfortable
7B chat + 7B coding	~9GB	No	Tight	Comfortable
13B chat + STT + TTS	~10.5GB	No	Tight	Comfortable
13B chat + 7B coding + STT + TTS	~15.5GB	No	No	Comfortable