Why is my local LLM slow?

The most common cause is memory pressure. If the model's working memory exceeds your available RAM, macOS swaps to SSD and performance collapses from 20+ tok/s to under 2 tok/s. Check ToolPiper's memory bar. Other causes: running on battery (GPU throttles by 30-50%), long context length (8K+ tokens slows generation), and background apps competing for GPU bandwidth (browsers with video, games).

How much does context length affect speed?

At 2K context, you get full speed. At 8K context, expect roughly 15-20% slower generation. At 16K, expect 25-35% slower. At 32K, the slowdown is more significant. Flash attention (enabled by default in ToolPiper) helps reduce this penalty, especially above 4K context. If speed matters and your conversation is long, start a new chat to reset the context.

Is M1 good enough for local AI?

Yes, with the right model. M1 with 8GB runs 0.8B and 3B models comfortably at 20-40 tok/s. Llama 3.1 8B will run at 10-14 tok/s, which is usable but noticeably slower. M1 with 16GB gives you more headroom and lets you run 8B models without memory pressure. The M1 is not fast enough for 14B+ models, but it handles the most popular local models well.

Does quantization affect quality?

Minimally at Q4_K_M, which is why it is the standard recommendation. Q4_K_M reduces model size by roughly 4x compared to FP16 with negligible quality loss on most tasks. Q8 is slightly higher quality but uses twice the memory and runs at roughly half the speed. For most users, Q4_K_M is the correct choice. You would need to run careful blind comparisons to notice the difference between Q4_K_M and Q8 in normal conversation.

How does local speed compare to ChatGPT?

Cloud services stream at 30-60 tok/s at peak. A local 8B model on an M2 Air does ~18 tok/s, which is slower than peak cloud speed but still comfortable (faster than reading speed). Small local models (0.8B-3B) match or exceed cloud streaming speed. The real comparison is consistency: local speed never degrades due to server load, rate limits, or network issues. And there is no per-token cost.

Local LLM Benchmarks on Apple Silicon: Token Speed Across M1 to M5

"How fast is a local LLM on my Mac?" is the first question everyone asks. The answer depends on three things: your chip (M1 through M5), your RAM (which determines the largest model you can run), and the model's size and quantization. Nobody publishes these numbers in one place with consistent methodology.

Here they are.

How does token generation work on Apple Silicon?

When you run a local LLM on your Mac, the model runs on Metal GPU. Not CPU, not the Neural Engine. The Neural Engine handles audio models (speech-to-text, text-to-speech) and vision models (image upscale, pose detection). Text generation is a Metal GPU workload.

Apple Silicon's unified memory architecture is the key advantage over traditional PCs. On a desktop GPU, model weights must be copied from system RAM to VRAM over a PCIe bus. On a Mac, the GPU has direct access to the same physical memory as the CPU. There is no copy step, no bus bottleneck. The model sits in memory and the GPU reads from it directly.

This matters because LLM inference is memory-bandwidth-bound. During token generation, the GPU reads billions of model parameters from memory for every single output token. The speed at which it can read that memory determines your tokens per second. More bandwidth means faster generation.

Here is how Apple Silicon chips compare on raw memory bandwidth:

M1: ~68 GB/s
M2: ~100 GB/s
M3: ~100 GB/s
M4: ~120 GB/s
M5: ~153 GB/s
Pro variants: ~200-307 GB/s (M2 Pro through M5 Pro)
Max variants: ~400-614 GB/s (M2 Max through M5 Max)

The jump from base to Pro to Max is not a small uplift. Max chips have roughly 4x the memory bandwidth of base chips. That translates directly to token speed.

What determines how fast your local LLM runs?

Four factors control your tokens per second.

1. Memory bandwidth (your chip generation and tier). This is the ceiling. A model cannot generate tokens faster than the GPU can read its weights from memory. The M2 Max with ~400 GB/s will always outperform the M2 Air with ~100 GB/s, all else being equal.

2. Model size in memory (parameters times quantization bits). A 3-billion-parameter model at Q4 quantization occupies roughly 2GB in memory. An 8B model at Q4 takes roughly 5GB. A 14B at Q4 needs about 8GB. Smaller in-memory footprint means fewer bytes the GPU must read per token, which means faster generation.

3. Context length. As the conversation grows longer, the model must process more data per token. A fresh 100-token prompt generates faster than a conversation that has accumulated 8,000 tokens of context. The slowdown is gradual but measurable.

4. Quantization level. Q4_K_M (4-bit) is the sweet spot for most users. It reduces model size by roughly 4x compared to the original FP16 weights while preserving nearly all quality. Q8 (8-bit) is higher quality but twice the memory footprint and roughly half the speed. FP16 is full precision but impractical for most Mac configurations.

How do you read benchmark numbers?

Two metrics matter: prompt processing speed and generation speed. They measure different things.

Prompt processing (also called "prefill") is how fast the model ingests your input. When you paste a 2,000-word document and ask a question about it, the model first processes that entire document. This step is typically 2-3x faster than generation because it can process tokens in parallel.

Generation speed is how fast the model produces output tokens. This is what you feel as a user. It determines how fast the response streams back to you. Every benchmark number in this article refers to generation speed unless stated otherwise.

The unit is tokens per second (tok/s). One token is roughly 3/4 of a word in English. So 20 tok/s means roughly 15 words per second of output.

Here is what different speeds feel like in practice:

30+ tok/s: Feels instant. Text appears faster than you can read it
20-30 tok/s: Very comfortable. Faster than natural reading speed
10-20 tok/s: Comfortable. Noticeable streaming but not frustrating
5-10 tok/s: Usable but slow. You notice the wait
Below 5 tok/s: Painful. Only acceptable for long-form generation where you walk away

What are the actual benchmark numbers?

These benchmarks are from ToolPiper's inference engine (llama.cpp on Metal GPU) with real-world chat workloads at 2K context length. All models use Q4_K_M quantization.

M2 MacBook Air 16GB

The most popular Mac for local AI. 100 GB/s memory bandwidth, 16GB unified memory. You can comfortably run models up to 8B parameters.

Model	Parameters	Generation (tok/s)
Qwen 3.5 0.8B	0.8B	~55
Llama 3.2 3B	3B	~32
Qwen 3.5 4B	4B	~28
Llama 3.1 8B	8B	~18

M2 Pro 32GB

Double the memory bandwidth (~200 GB/s) and double the RAM. You can run 14B models and the 8B models get noticeably faster.

Model	Parameters	Generation (tok/s)
Qwen 3.5 0.8B	0.8B	~65
Llama 3.2 3B	3B	~38
Llama 3.1 8B	8B	~22
Qwen 2.5 14B	14B	~12

M2 Max 32GB

The bandwidth king at ~400 GB/s. Every model runs significantly faster, and the 14B model becomes genuinely comfortable.

Model	Parameters	Generation (tok/s)
Qwen 3.5 0.8B	0.8B	~80
Llama 3.2 3B	3B	~48
Llama 3.1 8B	8B	~28
Qwen 2.5 14B	14B	~16

M3 through M5 series chips follow the same tier structure with rising ceilings — the base M5 reaches ~153 GB/s and the M5 Max up to ~614 GB/s. Generation speed scales almost linearly with memory bandwidth, so you can estimate a newer chip from these tables by the bandwidth ratio: an M5 Max at 614 GB/s runs roughly 1.5x the M2 Max numbers.

Note: These are generation speeds with 2K context. Longer contexts reduce speed. Prompt processing is typically 2-3x faster than generation.

What does ToolPiper enable by default?

None of this makes the engine faster than the same engine elsewhere - on the same model and quantization, our 2026-06 same-bytes benchmark on an M2 Max 32GB put ToolPiper's generation speed within single digits of Ollama's in both directions, with the winner flipping by model. These are llama.cpp capabilities that sit behind flags; ToolPiper turns them on out of the box, and the same features are available in Ollama through configuration. What follows them is the resource monitoring that is ToolPiper's own.

Flash attention. ToolPiper enables --flash-attn auto by default. Flash attention reduces memory usage during the attention computation, which means less memory pressure and faster generation, especially at longer context lengths. The improvement is most noticeable above 4K context where standard attention starts to bottleneck.

KV cache quantization. Quantizing the KV cache roughly halves the memory a context window consumes. The model weights are untouched - this is the conversation memory - and the practical effect is that long chats stay out of swap territory on smaller Macs. ToolPiper enables it by default.

Speculative decoding. ToolPiper enables --spec-type ngram-simple by default, which predicts likely next tokens based on n-gram patterns. When the prediction is correct (which happens often for repetitive or predictable text), multiple tokens are confirmed in a single forward pass. This provides modest speed gains for structured output like code, lists, and formulaic text.

Resource intelligence. This one is not a llama.cpp flag - it is ToolPiper's own monitoring layer. The memory bar shows whether a model is running comfortably or under memory pressure. When a model's working memory exceeds available RAM, macOS starts swapping to disk, and performance collapses. The memory bar lets you see this happening in real time so you can switch to a smaller model or close other apps before the system grinds to a halt.

Automatic model filtering. ToolPiper hides models that will not fit in your available memory. You never accidentally download a 14B model on an 8GB machine only to discover it swaps to disk at 2 tok/s.

What are the honest limitations?

These numbers are approximate. Your actual performance will vary based on several factors you should know about.

Context length matters more than people realize. The benchmarks above use 2K context. At 8K context, expect roughly 15-20% slower generation. At 16K context, expect 25-35% slower. At 32K context, the slowdown is more significant and flash attention becomes essential.

Battery mode throttles the GPU. If you are running on battery power, macOS reduces GPU clock speeds to save energy. Expect 30-50% lower tok/s compared to plugged-in performance. If speed matters, plug in.

Background GPU load reduces bandwidth. A browser with hardware-accelerated video, a game, or even a compositing-heavy desktop all compete for GPU resources and memory bandwidth. For the best benchmark results, close GPU-heavy apps. For everyday use, the numbers will be somewhat lower.

M1 base with 8GB is the floor. You can run 0.8B and 3B models comfortably. The 8B models will load but will be memory-constrained after accounting for macOS and your other apps. Expect 10-14 tok/s for Llama 3.1 8B on M1 8GB with nothing else running.

macOS version affects Metal performance. Apple regularly improves Metal GPU performance in macOS updates. Running the latest macOS version generally gives you the best results.

How does local speed compare to cloud services?

Cloud LLM services like ChatGPT, Claude, and Gemini typically stream responses at 30-60 tok/s. But that speed includes network latency, server queuing, and rate limiting. During peak hours, cloud services can slow down significantly or refuse requests entirely.

A local model on an M2 Air at 18 tok/s for an 8B model is slower than peak cloud speed. But it is consistent, unlimited, and private. There is no queue, no rate limit, no internet requirement, and no per-token cost.

For small models (0.8B-3B), local speed on any Apple Silicon Mac matches or exceeds cloud streaming speed. For larger models, you trade raw speed for privacy and unlimited usage.

Try It

Download ModelPiper. Install ToolPiper. Pick a model that fits your RAM. Watch the memory bar as it loads. Start chatting and see the tok/s counter in real time.

Your chip's memory bandwidth is fixed. But now you know exactly what it can do.

This is part of a series on local-first AI workflows on macOS. Related: Which Local LLM covers model selection and AI Model Memory explains RAM requirements in detail.

Chip	Memory Bandwidth	Max RAM	Largest Comfortable Model
M1	68 GB/s	16GB	8B Q4
M2	100 GB/s	24GB	8B Q4
M2 Pro	200 GB/s	32GB	14B Q4
M2 Max	400 GB/s	96GB	70B Q4
M3	100 GB/s	24GB	8B Q4
M3 Pro	150 GB/s	36GB	14B Q4
M3 Max	400 GB/s	128GB	70B Q4
M4	120 GB/s	32GB	14B Q4
M4 Pro	273 GB/s	48GB	32B Q4
M4 Max	546 GB/s	128GB	70B+ Q4
M5	153 GB/s	32GB	14B Q4
M5 Pro	307 GB/s	64GB	32B Q4
M5 Max	614 GB/s	128GB	70B+ Q4