"Ollama out of memory" is the most common complaint on r/LocalLLaMA. You download a 7B model, launch it, and your Mac slows to a crawl. The fans spin up. Safari tabs reload. The model either crashes or starts producing garbled output. You check Activity Monitor and see memory pressure pegged in the red.
The problem is not your hardware. The problem is that no tool tells you what is actually happening inside your RAM, or warns you before things go wrong.
Why does Apple Silicon handle AI memory differently?
On a traditional PC with a discrete GPU, you have two separate memory pools. System RAM for your apps, and VRAM on the graphics card for the model. They do not compete. If your GPU has 12GB of VRAM, you know exactly how much model fits.
Apple Silicon works differently. The CPU, GPU, and Neural Engine all share one unified memory pool. A MacBook Pro with 18GB of RAM has 18GB total for everything: your OS, your browser tabs, your running apps, and any AI models you load. There is no separate VRAM.
This is genuinely an advantage. It means a Mac with 32GB of unified memory can load a model that would require a $1,200 GPU on a PC. There is no PCIe bottleneck copying data between CPU and GPU memory. The model sits in one place and every processor can access it directly.
But it also means AI models compete with everything else on your Mac for the same finite pool of RAM. Load a model that is too large, and macOS starts swapping to disk. Swap means your entire system slows down, not just the model.
How much memory does a language model actually use?
The simplest rule of thumb: GGUF file size on disk is roughly equal to RAM needed. A Q4_K_M quantized 7B model is about 4.2GB on disk, and it will occupy roughly 4.2GB in memory when loaded. A 13B model at Q4 is about 7.4GB. A 70B model at Q4 is around 40GB.
But that is not the whole story. When you actually run inference, the model allocates additional memory for three things:
KV cache. This is where the model stores the context of your conversation. It grows with context length. A 7B model with a 4K context window allocates roughly 256MB of KV cache. Bump that to 8K context and it doubles. Some models advertise 32K or even 128K context, and the KV cache for those can exceed the model weights themselves.
Scratch buffers. llama.cpp (the engine that Ollama, LM Studio, and ToolPiper all use under the hood) allocates temporary buffers for computation. These are typically 100-500MB depending on model architecture and batch size.
OS overhead. macOS itself, Finder, Spotlight indexing, your browser, Slack, Xcode - all of these are using RAM simultaneously. A fresh macOS boot with a browser open and a few apps running typically consumes 6-10GB before you load any model.
So the real calculation for a 7B model on a 16GB Mac looks like this: 4.2GB model weights + 0.3GB KV cache + 0.3GB scratch + 8GB existing usage = 12.8GB. That leaves about 3GB of headroom. Comfortable, but not generous. Open a few more browser tabs or launch Xcode and you are in swap territory.
What is memory pressure and why does it matter?
macOS does not crash when you run out of RAM. Instead, it starts compressing memory pages and swapping them to your SSD. This is managed by a system called memory pressure, which you can see in Activity Monitor under the Memory tab.
Green means comfortable. Yellow means the system is compressing and may start swapping. Red means active swapping, which makes everything slow.
The critical thing to understand: memory pressure is not the same as memory usage. You can have 15 out of 16GB "used" and still be in green if the system has efficiently compressed inactive pages. Or you can have 12GB used and be in yellow because the active working set is fragmented and hard to compress.
This is why checking your model's file size is not enough. A model that fits on paper can still push your system into pressure if the access pattern is unfavorable, or if your other apps suddenly need more memory.
Most AI tools completely ignore memory pressure. They read total system RAM once, compare it to model size, and call it a day. That approach misses the dynamic reality of how macOS actually manages memory.
How do existing tools handle memory?
Ollama reads system RAM at startup and does basic math: if the model file is smaller than available RAM, it loads. It does not refresh this measurement. It cannot detect that Chrome just allocated another 2GB for a heavy web app. If memory pressure spikes from another process, Ollama has no idea. Your model just gets slower, and the only feedback you get is that token generation drops from 30 tok/s to 3 tok/s with no explanation.
LM Studio actually had a CPU/GPU status bar in version 4.0 that showed real-time utilization. Then they removed it in a redesign. Users have been filing bugs asking for it back (issue #1422 on their GitHub). As of now, there is no real-time resource monitoring in LM Studio. You get a warning if a model is too large to load, but nothing once it is running.
Open WebUI has zero resource monitoring. This is one of their most-requested features. GitHub discussions #7136, #11469, #2558, and #8825 all ask for some form of memory or GPU visibility. The responses amount to "use nvidia-smi" or "check Activity Monitor" - external tools that require you to leave the app and interpret raw system data.
The pattern across all three is the same: the tools that run AI models on your Mac do not tell you what those models are doing to your system resources. You are flying blind, and the first sign of trouble is your entire Mac grinding to a halt.
What is resource intelligence?
ToolPiper takes a different approach. Instead of treating memory as a static number checked once at startup, it continuously monitors three dimensions of resource usage and responds to changes in real time. We call this resource intelligence, and it is built into the core of how ToolPiper manages models.
The segmented memory bar is a real-time display that shows exactly how much RAM each loaded model consumes. Not estimated from file size, but measured from actual runtime memory via proc_pid_rusage, EMA-averaged over time and cross-validated against total system RAM. It updates every 3 seconds over WebSocket, and it includes GPU utilization alongside memory. If you have an STT model, an LLM, and a TTS model loaded simultaneously for a voice chat pipeline, you see three distinct color-coded segments with their individual memory footprints. You know at a glance where your RAM is going.
The readiness dialog appears before loading multi-model pipelines. When you launch a voice chat workflow that needs a speech-to-text model, a language model, and a text-to-speech model, ToolPiper does not just start loading blindly. It shows a preflight check: each model listed with its projected memory cost, a budget bar showing total projected usage against your available RAM, and an eviction forecast that identifies which currently-loaded models would need to be unloaded first. The dialog walks through three phases - review, loading, done - so you see exactly what is happening at each step. You approve the plan before anything loads.
Memory pressure awareness is the piece no other tool has. ToolPiper registers a macOS DispatchSource for memory pressure notifications, the same kernel-level API that macOS itself uses to manage the system. When pressure rises to warning or critical levels, ToolPiper does not wait for you to notice your Mac slowing down. It automatically evicts the least-recently-used model from memory. If you were running three models and pressure spikes because you opened a large project in Xcode, the model you have not used in the longest time gets unloaded instantly. Your Mac stays responsive. The model reloads on demand when you need it again. This entire cycle is invisible unless you watch the memory bar.
How does a RAM-aware model catalog work?
Ollama shows you every model in its library regardless of whether your Mac can run it. You can download a 40GB model on a 16GB machine, wait 20 minutes for it to finish, and only discover it will not load when you try to run it. That is a terrible experience, and it happens constantly to new users.
ToolPiper's model catalog is filtered server-side based on your actual available memory. Models that are too large for your Mac simply do not appear in the list. You never download something you cannot run. If your available RAM changes - because you loaded another model, or launched a heavy app, or plugged in fewer external displays - the catalog updates to reflect what still fits.
This filtering uses measured runtime profiles, not file-size estimation. A model that is 4GB on disk but needs 5.2GB at runtime with KV cache and scratch buffers is shown with its true cost. The number you see is the number that matters, not an optimistic figure from the GGUF header.
What does this mean in practice?
Here are concrete recommendations based on real measurements:
8GB Mac (M1 MacBook Air, base M2). You can comfortably run one small model at a time. Qwen 3.5 0.8B or Llama 3.2 1B leave enough headroom for normal app usage. Do not attempt 7B models on 8GB. They will technically load, but your system will swap constantly and token generation will be painfully slow.
16GB Mac (most MacBook Pros, Mac Mini). The sweet spot for a single 7B model. Llama 3.2 3B or Qwen 3.5 4B run with plenty of room to spare. A full 7B model at Q4 works but leaves thin margins, around 3GB of headroom. Multi-model pipelines (STT + LLM + TTS for voice chat) are feasible with smaller models, but this is exactly where ToolPiper's readiness dialog earns its keep by calculating whether the combination fits before loading.
32GB Mac. Comfortable for multi-model workflows. You can run voice chat with three models loaded simultaneously without any memory pressure. A single 13B model runs well with room left for your apps. You can load a 30B model at Q4 with moderate headroom.
64GB Mac or higher. You can run 70B models, multiple large models simultaneously, and long context windows (32K+) without concern. This is where the unified memory advantage of Apple Silicon really pays off. An equivalent setup on PC would require a GPU costing as much as an entire Mac Mini.
Try it
Download ModelPiper. Install ToolPiper. Load a model and watch the memory bar show you exactly where your RAM is going. Try loading a second model and see the readiness dialog calculate whether it fits. Push your system and watch the pressure-aware eviction keep your Mac responsive.
No other local AI tool on macOS shows you this level of detail about what is happening inside your memory. Most do not even try.
This is part of a series on local-first AI workflows on macOS. Next up: Private Local Chat, where we cover running ChatGPT-quality conversations entirely on your Mac.