You've decided to run AI locally on your Mac. Good call. No API costs, no rate limits, no data leaving your machine. Then you open HuggingFace and find 400,000 models. Llama, Qwen, Mistral, Phi, Gemma - each with a dozen variants in different sizes and quantization formats. Which one do you actually download?

This is the decision tree nobody gives you. Not which model is "best" in some abstract benchmark, but which model is right for your specific Mac, your available RAM, and the tasks you actually do.

How do local LLMs work on Apple Silicon?

Before picking a model, it helps to understand what's happening under the hood. Local LLMs run on your Mac's unified memory architecture, where the GPU and CPU share the same RAM pool. This is fundamentally different from PC setups where models need to fit inside a dedicated GPU's VRAM.

On a Mac, the model loads into your unified memory and the Metal GPU handles inference. The practical implication: model size directly determines both quality and resource usage. A bigger model needs more RAM but produces better output. A smaller model leaves room for other apps but trades off some capability.

The core tradeoff is parameters versus available RAM. Parameters are the model's learned knowledge, measured in billions (B). More parameters generally means better reasoning, more coherent writing, and fewer errors. But each parameter takes up memory, and your Mac has a fixed amount.

What are the model size tiers for Mac?

Not all models are created equal, and not all Macs can run the same models. Here's what each tier actually delivers.

0.5B to 1B parameters (Qwen 3.5 0.8B). These are the smallest useful models. They use roughly 1GB of RAM and generate tokens at 50+ tokens per second on most Apple Silicon Macs. Good for simple tasks: quick summaries, basic Q&A, text reformatting, translation of short passages. They struggle with complex reasoning, nuanced writing, and multi-step problems. Think of them as a fast utility rather than a thinking partner.

3B to 4B parameters (Llama 3.2 3B, Qwen 3.5 4B). This is the sweet spot for most Mac users. These models use 2-3GB of RAM and run at 30+ tokens per second on an M2. They handle conversation, code assistance, drafting, brainstorming, and summarization well. The quality jump from 1B to 3B is dramatic - these models understand context, follow complex instructions, and produce genuinely useful output. If you have a 16GB Mac, start here.

7B to 8B parameters (Llama 3.1 8B, Mistral 7B). Strong general-purpose models that approach cloud quality for many tasks. They need a 16GB Mac minimum and consume 5-6GB of RAM at Q4 quantization. Speed drops to 15-20 tokens per second on an M2, which is still comfortable for interactive use. These models handle code generation, long-form writing, and analytical tasks with noticeably better quality than 3B models.

14B+ parameters (Qwen 2.5 14B). The upper practical limit for local use on Mac. Needs 32GB of RAM and runs at 8-12 tokens per second. At this size, you're approaching the quality of mid-tier cloud models for most tasks. The output is more nuanced, the reasoning is more reliable, and the model handles ambiguity better. But you need the hardware to support it, and other apps will compete for the remaining memory.

What is quantization, and should you care about it?

Every model on HuggingFace comes in multiple quantization formats: Q4_K_M, Q5_K_M, Q8_0, F16, and others. These names describe how the model's parameters are compressed.

F16 (16-bit floating point) is the original, uncompressed model. Maximum quality, maximum size. A 7B F16 model needs roughly 14GB of RAM. Most Mac users can't afford this.

Q8 (8-bit quantization) halves the size with almost no quality loss. That 7B model drops to about 7GB. Barely perceptible difference in output quality.

Q4_K_M (4-bit quantization, K-quant medium) is the sweet spot. It cuts size to roughly 4GB for a 7B model, with only slight quality degradation. For conversational use, code help, and general tasks, you won't notice the difference from F16. This is what most people should use.

Q2 and Q3 variants exist but compress too aggressively. Quality drops noticeably - the model starts producing more errors, losing coherence in long outputs, and struggling with instructions. Avoid these unless you're extremely RAM-constrained.

The rule of thumb: always use Q4_K_M unless you have a specific reason to go higher or lower. It gives you the most capable model that fits in your available RAM.

How much RAM do you actually need?

Your Mac's total RAM isn't all available for models. macOS itself uses 3-5GB, and your other apps need room too. Here's a realistic breakdown.

8GB Mac. After macOS overhead, you have roughly 4GB for a model. That means 3B models at Q4 quantization are your ceiling. Qwen 3.5 0.8B and Llama 3.2 3B both run comfortably. Don't try to load a 7B model - it'll either fail or thrash swap memory and become unusably slow.

16GB Mac. The mainstream sweet spot. You can run 7B-8B models at Q4 comfortably while keeping a browser and code editor open. 3B models fly at high speed. This is enough hardware for most local AI use cases.

32GB+ Mac. The full range opens up. 14B models run well, and you can even experiment with 30B+ models at aggressive quantization, though speed drops significantly. If you regularly need the best possible local quality, this is the tier to aim for.

Why does any of this matter vs. just using the cloud?

Fair question. If you just want the best output and don't care about privacy or cost, GPT-4 and Claude are still ahead of any local model. But local models give you something cloud can't.

No rate limits. Try ten models in an afternoon. Run a hundred prompts through each. No "you've hit your limit" messages, no throttling, no waiting.

No API costs. Every query is free after the one-time download. Compare that to per-token pricing that adds up when you're iterating on a complex prompt.

Complete privacy. Paste proprietary code, client documents, financial data, personal notes. Nothing leaves your machine. This isn't a policy - it's physics.

Experimentation freedom. The best way to learn what works is to try different models on your actual tasks. Locally, switching models is free and instant. In the cloud, every experiment costs money.

How does ToolPiper simplify model selection?

Everything above is useful knowledge, but ToolPiper is designed so you don't need to memorize it. The model browser in ToolPiper presents curated presets - models that have been tested on Apple Silicon and verified to work well at specific quantization levels.

Each preset shows the model name, parameter count, quantization format, and the exact RAM it will use. A segmented memory bar shows how much of your Mac's memory the model occupies, how much is already in use by other apps, and how much headroom remains. You see the resource impact before you commit to a download.

RAM-aware filtering hides models that won't fit on your Mac. If you have a 16GB machine, you won't see 14B F16 models cluttering the list. The catalog shows you what actually works on your hardware.

Downloading is one click. The model downloads from HuggingFace, gets stored locally, and becomes available in the chat interface's model dropdown. Switching between downloaded models is instant - select from the dropdown and your next message uses the new model.

The model configs API serves these curated presets with live availability status. Models you've already downloaded show as ready. Models that would fit your Mac show download sizes. Models too large for your RAM are filtered out or flagged. This is the decision tree from earlier in this article, automated.

What should you actually download first?

If you're new to local LLMs, here's the practical order.

Start with the bundled starter model (Qwen 3.5 0.8B). It downloads automatically when you install ToolPiper. Use it for a day. Get a feel for local inference speed and the types of tasks it handles. It's fast, capable for simple tasks, and costs nothing to try.

Then download a 3B model. Llama 3.2 3B or Qwen 3.5 4B. The quality jump is immediate and significant. This is where most users settle for daily use. General chat, code help, writing assistance, brainstorming - all solid at this size.

If you have 16GB+ RAM, try an 8B model. Llama 3.1 8B is the standard benchmark. The quality improvement over 3B is noticeable in longer outputs, more complex reasoning, and code generation. If it runs at an acceptable speed on your Mac, it becomes your daily driver.

32GB users: experiment with 14B. Qwen 2.5 14B is genuinely impressive for a local model. Use it when you need the best local quality and have the patience for slightly slower output.

You don't have to pick one. Keep multiple models downloaded and switch between them based on the task. Quick question? Use the 3B. Complex code review? Switch to 8B. The model dropdown in ModelPiper makes this a two-second decision.

What are the honest limitations?

Local models top out at around 14B parameters for practical desktop use. For GPT-4 or Claude Opus level reasoning, multi-step analysis, and genuinely hard problems, cloud models are still better. This gap is shrinking with each model generation, but it exists today.

First load takes time. Downloading a 3B model is a few gigabytes. A 14B model at Q4 is about 8GB. After download, the first load into memory takes 5-15 seconds depending on model size. Subsequent loads are faster if the model stays cached.

8GB Macs are genuinely limited. You can run 3B models comfortably, but 7B models will push your system. If you're buying a Mac for local AI, 16GB is the practical minimum.

And quantization is a real tradeoff, not a free lunch. Q4_K_M is excellent for most uses, but if you're doing tasks where subtle quality differences matter - nuanced creative writing, precise technical analysis - you might notice the compression. On a Mac with enough RAM, Q8 gives you a measurable quality bump.

Try It

Download ModelPiper. Install ToolPiper. The starter model downloads automatically - you'll be chatting in under a minute. When you're ready, open the model browser, pick a model that fits your RAM, and download it with one click.

No terminal. No configuration. No guessing which quantization format to pick. The catalog does the thinking for you.

This is part of a series on local-first AI workflows on macOS. Related: Private Local Chat covers the chat experience itself, and Deep Reasoning explores reasoning models that think step by step.