"How fast is a local LLM on my Mac?" is the first question everyone asks. The answer depends on three things: your chip (M1 through M4), your RAM (which determines the largest model you can run), and the model's size and quantization. Nobody publishes these numbers in one place with consistent methodology.

Here they are.

How does token generation work on Apple Silicon?

When you run a local LLM on your Mac, the model runs on Metal GPU. Not CPU, not the Neural Engine. The Neural Engine handles audio models (speech-to-text, text-to-speech) and vision models (image upscale, pose detection). Text generation is a Metal GPU workload.

Apple Silicon's unified memory architecture is the key advantage over traditional PCs. On a desktop GPU, model weights must be copied from system RAM to VRAM over a PCIe bus. On a Mac, the GPU has direct access to the same physical memory as the CPU. There is no copy step, no bus bottleneck. The model sits in memory and the GPU reads from it directly.

This matters because LLM inference is memory-bandwidth-bound. During token generation, the GPU reads billions of model parameters from memory for every single output token. The speed at which it can read that memory determines your tokens per second. More bandwidth means faster generation.

Here is how Apple Silicon chips compare on raw memory bandwidth:

  • M1: ~68 GB/s
  • M2: ~100 GB/s
  • M3: ~100 GB/s
  • M4: ~120 GB/s
  • Pro variants: ~200 GB/s (M2 Pro, M3 Pro, M4 Pro)
  • Max variants: ~400 GB/s (M2 Max, M3 Max, M4 Max)

The jump from base to Pro to Max is not a small uplift. Max chips have roughly 4x the memory bandwidth of base chips. That translates directly to token speed.

What determines how fast your local LLM runs?

Four factors control your tokens per second.

1. Memory bandwidth (your chip generation and tier). This is the ceiling. A model cannot generate tokens faster than the GPU can read its weights from memory. The M2 Max with ~400 GB/s will always outperform the M2 Air with ~100 GB/s, all else being equal.

2. Model size in memory (parameters times quantization bits). A 3-billion-parameter model at Q4 quantization occupies roughly 2GB in memory. An 8B model at Q4 takes roughly 5GB. A 14B at Q4 needs about 8GB. Smaller in-memory footprint means fewer bytes the GPU must read per token, which means faster generation.

3. Context length. As the conversation grows longer, the model must process more data per token. A fresh 100-token prompt generates faster than a conversation that has accumulated 8,000 tokens of context. The slowdown is gradual but measurable.

4. Quantization level. Q4_K_M (4-bit) is the sweet spot for most users. It reduces model size by roughly 4x compared to the original FP16 weights while preserving nearly all quality. Q8 (8-bit) is higher quality but twice the memory footprint and roughly half the speed. FP16 is full precision but impractical for most Mac configurations.

How do you read benchmark numbers?

Two metrics matter: prompt processing speed and generation speed. They measure different things.

Prompt processing (also called "prefill") is how fast the model ingests your input. When you paste a 2,000-word document and ask a question about it, the model first processes that entire document. This step is typically 2-3x faster than generation because it can process tokens in parallel.

Generation speed is how fast the model produces output tokens. This is what you feel as a user. It determines how fast the response streams back to you. Every benchmark number in this article refers to generation speed unless stated otherwise.

The unit is tokens per second (tok/s). One token is roughly 3/4 of a word in English. So 20 tok/s means roughly 15 words per second of output.

Here is what different speeds feel like in practice:

  • 30+ tok/s: Feels instant. Text appears faster than you can read it
  • 20-30 tok/s: Very comfortable. Faster than natural reading speed
  • 10-20 tok/s: Comfortable. Noticeable streaming but not frustrating
  • 5-10 tok/s: Usable but slow. You notice the wait
  • Below 5 tok/s: Painful. Only acceptable for long-form generation where you walk away

What are the actual benchmark numbers?

These benchmarks are from ToolPiper's inference engine (llama.cpp on Metal GPU) with real-world chat workloads at 2K context length. All models use Q4_K_M quantization.

M2 MacBook Air 16GB

The most popular Mac for local AI. 100 GB/s memory bandwidth, 16GB unified memory. You can comfortably run models up to 8B parameters.

ModelParametersGeneration (tok/s)
Qwen 3.5 0.8B0.8B~55
Llama 3.2 3B3B~32
Qwen 3.5 4B4B~28
Llama 3.1 8B8B~18

M2 Pro 32GB

Double the memory bandwidth (~200 GB/s) and double the RAM. You can run 14B models and the 8B models get noticeably faster.

ModelParametersGeneration (tok/s)
Qwen 3.5 0.8B0.8B~65
Llama 3.2 3B3B~38
Llama 3.1 8B8B~22
Qwen 2.5 14B14B~12

M2 Max 32GB

The bandwidth king at ~400 GB/s. Every model runs significantly faster, and the 14B model becomes genuinely comfortable.

ModelParametersGeneration (tok/s)
Qwen 3.5 0.8B0.8B~80
Llama 3.2 3B3B~48
Llama 3.1 8B8B~28
Qwen 2.5 14B14B~16

M3 and M4 series chips follow the same bandwidth tiers with slight improvements. The M4 Max with its higher bandwidth ceiling pushes even faster numbers across the board.

Note: These are generation speeds with 2K context. Longer contexts reduce speed. Prompt processing is typically 2-3x faster than generation.

What helps ToolPiper run faster?

ToolPiper's llama.cpp engine includes several optimizations that affect real-world speed.

Flash attention. ToolPiper enables --flash-attn auto by default. Flash attention reduces memory usage during the attention computation, which means less memory pressure and faster generation, especially at longer context lengths. The improvement is most noticeable above 4K context where standard attention starts to bottleneck.

Speculative decoding. ToolPiper uses --spec-type ngram-simple, which predicts likely next tokens based on n-gram patterns. When the prediction is correct (which happens often for repetitive or predictable text), multiple tokens are confirmed in a single forward pass. This provides modest speed gains for structured output like code, lists, and formulaic text.

Resource intelligence. ToolPiper's memory bar shows whether a model is running comfortably or under memory pressure. When a model's working memory exceeds available RAM, macOS starts swapping to disk, and performance collapses. The memory bar lets you see this happening in real time so you can switch to a smaller model or close other apps before the system grinds to a halt.

Automatic model filtering. ToolPiper hides models that will not fit in your available memory. You never accidentally download a 14B model on an 8GB machine only to discover it swaps to disk at 2 tok/s.

What are the honest limitations?

These numbers are approximate. Your actual performance will vary based on several factors you should know about.

Context length matters more than people realize. The benchmarks above use 2K context. At 8K context, expect roughly 15-20% slower generation. At 16K context, expect 25-35% slower. At 32K context, the slowdown is more significant and flash attention becomes essential.

Battery mode throttles the GPU. If you are running on battery power, macOS reduces GPU clock speeds to save energy. Expect 30-50% lower tok/s compared to plugged-in performance. If speed matters, plug in.

Background GPU load reduces bandwidth. A browser with hardware-accelerated video, a game, or even a compositing-heavy desktop all compete for GPU resources and memory bandwidth. For the best benchmark results, close GPU-heavy apps. For everyday use, the numbers will be somewhat lower.

M1 base with 8GB is the floor. You can run 0.8B and 3B models comfortably. The 8B models will load but will be memory-constrained after accounting for macOS and your other apps. Expect 10-14 tok/s for Llama 3.1 8B on M1 8GB with nothing else running.

macOS version affects Metal performance. Apple regularly improves Metal GPU performance in macOS updates. Running the latest macOS version generally gives you the best results.

How does local speed compare to cloud services?

Cloud LLM services like ChatGPT, Claude, and Gemini typically stream responses at 30-60 tok/s. But that speed includes network latency, server queuing, and rate limiting. During peak hours, cloud services can slow down significantly or refuse requests entirely.

A local model on an M2 Air at 18 tok/s for an 8B model is slower than peak cloud speed. But it is consistent, unlimited, and private. There is no queue, no rate limit, no internet requirement, and no per-token cost.

For small models (0.8B-3B), local speed on any Apple Silicon Mac matches or exceeds cloud streaming speed. For larger models, you trade raw speed for privacy and unlimited usage.

Try It

Download ModelPiper. Install ToolPiper. Pick a model that fits your RAM. Watch the memory bar as it loads. Start chatting and see the tok/s counter in real time.

Your chip's memory bandwidth is fixed. But now you know exactly what it can do.

This is part of a series on local-first AI workflows on macOS. Related: Which Local LLM covers model selection and AI Model Memory explains RAM requirements in detail.