Why does my Mac slow down when running AI models?

Apple Silicon uses unified memory, which means your AI model shares the same RAM pool as your OS, browser, and every other app. When total memory demand exceeds physical RAM, macOS starts swapping to your SSD. Swap is orders of magnitude slower than RAM, so your entire system feels sluggish. The fix is either using a smaller model, closing other apps, or using a tool like ToolPiper that monitors memory pressure and automatically unloads models before swap becomes severe.

How much RAM does a 7B model actually need?

A 7B model quantized to Q4_K_M is roughly 4.2GB on disk, but it needs closer to 4.8-5.2GB at runtime once you account for KV cache (context memory) and scratch buffers. On a 16GB Mac with typical app usage consuming 6-10GB, that leaves thin margins. ToolPiper measures actual runtime memory rather than estimating from file size, so you always see the true cost.

Can I run multiple AI models at the same time?

Yes, but it depends on your RAM. A voice chat pipeline needs three models loaded simultaneously: speech-to-text (~0.5GB), a language model (2-5GB), and text-to-speech (~0.5-1GB). On 16GB that is tight but feasible with smaller models. On 32GB it is comfortable. ToolPiper's readiness dialog calculates the total cost before loading anything and warns you if the combination will not fit.

What happens when I run out of memory?

macOS does not crash immediately. It compresses memory pages and then swaps to your SSD. Everything gets slow: your browser, Finder, the AI model itself. Token generation speed can drop by 10x or more. Most AI tools have no mechanism to detect or respond to this. ToolPiper monitors macOS memory pressure via DispatchSource and automatically evicts the least-recently-used model when pressure rises, keeping your system responsive without manual intervention.

Do I need 32GB or 64GB for local AI?

For a single chat model (3B-7B parameters), 16GB is sufficient. For multi-model workflows like voice chat (STT + LLM + TTS running together), 32GB gives you comfortable headroom. 64GB is only necessary if you want to run 70B-class models or multiple large models simultaneously. Most people doing casual local AI chat will be well-served by 16GB with proper resource management.

AI Model Memory on Mac: How Much RAM Do You Actually Need?

"Ollama out of memory" - search that phrase and you will find pages of threads on r/LocalLLaMA. You download a 7B model, launch it, and your Mac slows to a crawl. The fans spin up. Safari tabs reload. The model either crashes or starts producing garbled output. You check Activity Monitor and see memory pressure pegged in the red.

The problem is not your hardware, and it is not your runner. On Apple Silicon, a model shares one unified memory pool with everything else on your Mac, and budgeting that pool is the part most setups leave entirely to you.

Why does Apple Silicon handle AI memory differently?

On a traditional PC with a discrete GPU, you have two separate memory pools. System RAM for your apps, and VRAM on the graphics card for the model. They do not compete. If your GPU has 12GB of VRAM, you know exactly how much model fits.

Apple Silicon works differently. The CPU, GPU, and Neural Engine all share one unified memory pool. A MacBook Pro with 18GB of RAM has 18GB total for everything: your OS, your browser tabs, your running apps, and any AI models you load. There is no separate VRAM.

This is genuinely an advantage. It means a Mac with 32GB of unified memory can load a model that would require a $1,200 GPU on a PC. There is no PCIe bottleneck copying data between CPU and GPU memory. The model sits in one place and every processor can access it directly.

But it also means AI models compete with everything else on your Mac for the same finite pool of RAM. Load a model that is too large, and macOS starts swapping to disk. Swap means your entire system slows down, not just the model.

How much memory does a language model actually use?

The simplest rule of thumb: GGUF file size on disk is roughly equal to RAM needed. A Q4_K_M quantized 7B model is about 4.2GB on disk, and it will occupy roughly 4.2GB in memory when loaded. A 13B model at Q4 is about 7.4GB. A 70B model at Q4 is around 40GB.

But that is not the whole story. When you actually run inference, the model allocates additional memory for three things:

KV cache. This is where the model stores the context of your conversation. It grows with context length. A 7B model with a 4K context window allocates roughly 256MB of KV cache. Bump that to 8K context and it doubles. Some models advertise 32K or even 128K context, and the KV cache for those can exceed the model weights themselves.

Scratch buffers. llama.cpp (the engine that Ollama, LM Studio, and ToolPiper all use under the hood) allocates temporary buffers for computation. These are typically 100-500MB depending on model architecture and batch size.

OS overhead. macOS itself, Finder, Spotlight indexing, your browser, Slack, Xcode - all of these are using RAM simultaneously. A fresh macOS boot with a browser open and a few apps running typically consumes 6-10GB before you load any model.

So the real calculation for a 7B model on a 16GB Mac looks like this: 4.2GB model weights + 0.3GB KV cache + 0.3GB scratch + 8GB existing usage = 12.8GB. That leaves about 3GB of headroom. Comfortable, but not generous. Open a few more browser tabs or launch Xcode and you are in swap territory.

What is memory pressure and why does it matter?

macOS does not crash when you run out of RAM. Instead, it starts compressing memory pages and swapping them to your SSD. This is managed by a system called memory pressure, which you can see in Activity Monitor under the Memory tab.

Green means comfortable. Yellow means the system is compressing and may start swapping. Red means active swapping, which makes everything slow.

The critical thing to understand: memory pressure is not the same as memory usage. You can have 15 out of 16GB "used" and still be in green if the system has efficiently compressed inactive pages. Or you can have 12GB used and be in yellow because the active working set is fragmented and hard to compress.

This is why checking your model's file size is not enough. A model that fits on paper can still push your system into pressure if the access pattern is unfavorable, or if your other apps suddenly need more memory.

Most AI tools treat memory as a static check: read total system RAM once, compare it to model size, load or refuse. That is a defensible simplification, but it cannot see anything that changes after the check - and on a working Mac, memory changes constantly.

How do existing tools handle memory?

Ollama checks system RAM at load time: if the model fits in what is available, it loads. For a CLI tool, that is a deliberate and defensible design - do the math once, then delegate memory management to macOS like any other process does. What the design leaves out is everything that happens after the check. If Chrome allocates another 2GB mid-session and pressure rises, token generation drops from 30 tok/s to 3 tok/s, and connecting that slowdown to memory is left to you. ToolPiper adds that layer on top: live per-model memory measured via proc_pid_rusage, GPU utilization via IOKit, and pressure warnings before a load is attempted.

LM Studio actually had a CPU/GPU status bar in version 4.0 that showed real-time utilization. Then they removed it in a redesign. Users have been filing bugs asking for it back (issue #1422 on their GitHub). As of now, there is no real-time resource monitoring in LM Studio. You get a warning if a model is too large to load, but nothing once it is running.

Open WebUI has zero resource monitoring. This is one of their most-requested features. GitHub discussions #7136, #11469, #2558, and #8825 all ask for some form of memory or GPU visibility. The responses amount to "use nvidia-smi" or "check Activity Monitor" - external tools that require you to leave the app and interpret raw system data.

The pattern across all three is the same: the runner delegates memory to the OS and reports little back. Activity Monitor covers part of the gap - it shows process-level memory and the pressure gauge - but it cannot tell you which model owns which gigabytes, or warn you before a load tips the system into swap. That per-model, before-it-happens layer is the piece ToolPiper builds in.

What is resource intelligence?

ToolPiper takes a different approach. Instead of treating memory as a static number checked once at startup, it continuously monitors three dimensions of resource usage and responds to changes in real time. We call this resource intelligence, and it is built into the core of how ToolPiper manages models.

The segmented memory bar is a real-time display that shows exactly how much RAM each loaded model consumes. Not estimated from file size, but measured from actual runtime memory via proc_pid_rusage, EMA-averaged over time and cross-validated against total system RAM. It updates every 3 seconds over WebSocket, and it includes GPU utilization alongside memory. If you have an STT model, an LLM, and a TTS model loaded simultaneously for a voice chat pipeline, you see three distinct color-coded segments with their individual memory footprints. You know at a glance where your RAM is going.

The readiness dialog appears before loading multi-model pipelines. When you launch a voice chat workflow that needs a speech-to-text model, a language model, and a text-to-speech model, ToolPiper does not just start loading blindly. It shows a preflight check: each model listed with its projected memory cost, a budget bar showing total projected usage against your available RAM, and an eviction forecast that identifies which currently-loaded models would need to be unloaded first. The dialog walks through three phases - review, loading, done - so you see exactly what is happening at each step. You approve the plan before anything loads.

Memory pressure awareness is the piece no other tool has. ToolPiper registers a macOS DispatchSource for memory pressure notifications, the same kernel-level API that macOS itself uses to manage the system. When pressure rises to warning or critical levels, ToolPiper does not wait for you to notice your Mac slowing down. It automatically evicts the least-recently-used model from memory. If you were running three models and pressure spikes because you opened a large project in Xcode, the model you have not used in the longest time gets unloaded instantly. Your Mac stays responsive. The model reloads on demand when you need it again. This entire cycle is invisible unless you watch the memory bar.

How does a RAM-aware model catalog work?

Ollama's library lists every model regardless of your hardware; checking the fit is your job. Download a 40GB model on a 16GB machine and you find out it will not load only after the 20-minute download finishes.

ToolPiper's model catalog is filtered server-side based on your actual available memory. Models that are too large for your Mac simply do not appear in the list. You never download something you cannot run. If your available RAM changes - because you loaded another model, or launched a heavy app, or plugged in fewer external displays - the catalog updates to reflect what still fits.

This filtering uses measured runtime profiles, not file-size estimation. A model that is 4GB on disk but needs 5.2GB at runtime with KV cache and scratch buffers is shown with its true cost. The number you see is the number that matters, not an optimistic figure from the GGUF header.

What does this mean in practice?

Here are concrete recommendations based on real measurements:

8GB Mac (M1 MacBook Air, base M2). You can comfortably run one small model at a time. Qwen 3.5 0.8B or Llama 3.2 1B leave enough headroom for normal app usage. Do not attempt 7B models on 8GB. They will technically load, but your system will swap constantly and token generation will be painfully slow.

16GB Mac (most MacBook Pros, Mac Mini). The sweet spot for a single 7B model. Llama 3.2 3B or Qwen 3.5 4B run with plenty of room to spare. A full 7B model at Q4 works but leaves thin margins, around 3GB of headroom. Multi-model pipelines (STT + LLM + TTS for voice chat) are feasible with smaller models, but this is exactly where ToolPiper's readiness dialog earns its keep by calculating whether the combination fits before loading.

32GB Mac. Comfortable for multi-model workflows. You can run voice chat with three models loaded simultaneously without any memory pressure. A single 13B model runs well with room left for your apps. You can load a 30B model at Q4 with moderate headroom.

64GB Mac or higher. You can run 70B models, multiple large models simultaneously, and long context windows (32K+) without concern. This is where the unified memory advantage of Apple Silicon really pays off. An equivalent setup on PC would require a GPU costing as much as an entire Mac Mini.

Try it

Download ModelPiper. Install ToolPiper. Load a model and watch the memory bar show you exactly where your RAM is going. Try loading a second model and see the readiness dialog calculate whether it fits. Push your system and watch the pressure-aware eviction keep your Mac responsive.

No other local AI tool on macOS shows per-model memory at this level of detail.

This is part of a series on local-first AI workflows on macOS. Next up: Private Local Chat, where we cover running ChatGPT-quality conversations entirely on your Mac.