---
title: "Local LLM Benchmarks on Apple Silicon: Token Speed Across M1 to M4"
description: "Consistent benchmarks for local LLM speed on Mac. Token generation rates across M1 through M4 chips, with real models and real workloads."
date: 2026-03-26
author: "Ben Racicot"
tags: ["Benchmarks", "Performance", "Text Generation", "macOS", "Apple Silicon", "Metal GPU"]
type: "article"
canonical: "https://modelpiper.com/blog/local-llm-benchmarks-apple-silicon/"
---

# Local LLM Benchmarks on Apple Silicon: Token Speed Across M1 to M4

> Consistent benchmarks for local LLM speed on Mac. Token generation rates across M1 through M4 chips, with real models and real workloads.

## TL;DR

LLM inference on Apple Silicon is memory-bandwidth-bound. An M2 Air pushes 18 tok/s with Llama 3.1 8B and 55 tok/s with Qwen 0.8B. M2 Max hits 28 and 80 tok/s respectively. Anything above 20 tok/s feels instant. These are real-world numbers from ToolPiper's llama.cpp engine on Metal GPU, not synthetic benchmarks.

"How fast is a local LLM on my Mac?" is the first question everyone asks. The answer depends on three things: your chip (M1 through M4), your RAM (which determines the largest model you can run), and the model's size and quantization. Nobody publishes these numbers in one place with consistent methodology.

Here they are.

## How does token generation work on Apple Silicon?

When you run a local LLM on your Mac, the model runs on Metal GPU. Not CPU, not the Neural Engine. The Neural Engine handles audio models (speech-to-text, text-to-speech) and vision models (image upscale, pose detection). Text generation is a Metal GPU workload.

Apple Silicon's unified memory architecture is the key advantage over traditional PCs. On a desktop GPU, model weights must be copied from system RAM to VRAM over a PCIe bus. On a Mac, the GPU has direct access to the same physical memory as the CPU. There is no copy step, no bus bottleneck. The model sits in memory and the GPU reads from it directly.

This matters because LLM inference is **memory-bandwidth-bound**. During token generation, the GPU reads billions of model parameters from memory for every single output token. The speed at which it can read that memory determines your tokens per second. More bandwidth means faster generation.

Here is how Apple Silicon chips compare on raw memory bandwidth:

-   **M1:** ~68 GB/s
-   **M2:** ~100 GB/s
-   **M3:** ~100 GB/s
-   **M4:** ~120 GB/s
-   **Pro variants:** ~200 GB/s (M2 Pro, M3 Pro, M4 Pro)
-   **Max variants:** ~400 GB/s (M2 Max, M3 Max, M4 Max)

The jump from base to Pro to Max is not a small uplift. Max chips have roughly 4x the memory bandwidth of base chips. That translates directly to token speed.

## What determines how fast your local LLM runs?

Four factors control your tokens per second.

**1\. Memory bandwidth (your chip generation and tier).** This is the ceiling. A model cannot generate tokens faster than the GPU can read its weights from memory. The M2 Max with ~400 GB/s will always outperform the M2 Air with ~100 GB/s, all else being equal.

**2\. Model size in memory (parameters times quantization bits).** A 3-billion-parameter model at Q4 quantization occupies roughly 2GB in memory. An 8B model at Q4 takes roughly 5GB. A 14B at Q4 needs about 8GB. Smaller in-memory footprint means fewer bytes the GPU must read per token, which means faster generation.

**3\. Context length.** As the conversation grows longer, the model must process more data per token. A fresh 100-token prompt generates faster than a conversation that has accumulated 8,000 tokens of context. The slowdown is gradual but measurable.

**4\. Quantization level.** Q4\_K\_M (4-bit) is the sweet spot for most users. It reduces model size by roughly 4x compared to the original FP16 weights while preserving nearly all quality. Q8 (8-bit) is higher quality but twice the memory footprint and roughly half the speed. FP16 is full precision but impractical for most Mac configurations.

## How do you read benchmark numbers?

Two metrics matter: **prompt processing speed** and **generation speed**. They measure different things.

Prompt processing (also called "prefill") is how fast the model ingests your input. When you paste a 2,000-word document and ask a question about it, the model first processes that entire document. This step is typically 2-3x faster than generation because it can process tokens in parallel.

Generation speed is how fast the model produces output tokens. This is what you feel as a user. It determines how fast the response streams back to you. Every benchmark number in this article refers to generation speed unless stated otherwise.

The unit is **tokens per second (tok/s)**. One token is roughly 3/4 of a word in English. So 20 tok/s means roughly 15 words per second of output.

Here is what different speeds feel like in practice:

-   **30+ tok/s:** Feels instant. Text appears faster than you can read it
-   **20-30 tok/s:** Very comfortable. Faster than natural reading speed
-   **10-20 tok/s:** Comfortable. Noticeable streaming but not frustrating
-   **5-10 tok/s:** Usable but slow. You notice the wait
-   **Below 5 tok/s:** Painful. Only acceptable for long-form generation where you walk away

## What are the actual benchmark numbers?

These benchmarks are from ToolPiper's inference engine (llama.cpp on Metal GPU) with real-world chat workloads at 2K context length. All models use Q4\_K\_M quantization.

### M2 MacBook Air 16GB

The most popular Mac for local AI. 100 GB/s memory bandwidth, 16GB unified memory. You can comfortably run models up to 8B parameters.

Model

Parameters

Generation (tok/s)

Qwen 3.5 0.8B

0.8B

~55

Llama 3.2 3B

3B

~32

Qwen 3.5 4B

4B

~28

Llama 3.1 8B

8B

~18

### M2 Pro 32GB

Double the memory bandwidth (~200 GB/s) and double the RAM. You can run 14B models and the 8B models get noticeably faster.

Model

Parameters

Generation (tok/s)

Qwen 3.5 0.8B

0.8B

~65

Llama 3.2 3B

3B

~38

Llama 3.1 8B

8B

~22

Qwen 2.5 14B

14B

~12

### M2 Max 32GB

The bandwidth king at ~400 GB/s. Every model runs significantly faster, and the 14B model becomes genuinely comfortable.

Model

Parameters

Generation (tok/s)

Qwen 3.5 0.8B

0.8B

~80

Llama 3.2 3B

3B

~48

Llama 3.1 8B

8B

~28

Qwen 2.5 14B

14B

~16

M3 and M4 series chips follow the same bandwidth tiers with slight improvements. The M4 Max with its higher bandwidth ceiling pushes even faster numbers across the board.

**Note:** These are generation speeds with 2K context. Longer contexts reduce speed. Prompt processing is typically 2-3x faster than generation.

## What helps ToolPiper run faster?

ToolPiper's llama.cpp engine includes several optimizations that affect real-world speed.

**Flash attention.** ToolPiper enables `--flash-attn auto` by default. Flash attention reduces memory usage during the attention computation, which means less memory pressure and faster generation, especially at longer context lengths. The improvement is most noticeable above 4K context where standard attention starts to bottleneck.

**Speculative decoding.** ToolPiper uses `--spec-type ngram-simple`, which predicts likely next tokens based on n-gram patterns. When the prediction is correct (which happens often for repetitive or predictable text), multiple tokens are confirmed in a single forward pass. This provides modest speed gains for structured output like code, lists, and formulaic text.

**Resource intelligence.** ToolPiper's memory bar shows whether a model is running comfortably or under memory pressure. When a model's working memory exceeds available RAM, macOS starts swapping to disk, and performance collapses. The memory bar lets you see this happening in real time so you can switch to a smaller model or close other apps before the system grinds to a halt.

**Automatic model filtering.** ToolPiper hides models that will not fit in your available memory. You never accidentally download a 14B model on an 8GB machine only to discover it swaps to disk at 2 tok/s.

## What are the honest limitations?

These numbers are approximate. Your actual performance will vary based on several factors you should know about.

**Context length matters more than people realize.** The benchmarks above use 2K context. At 8K context, expect roughly 15-20% slower generation. At 16K context, expect 25-35% slower. At 32K context, the slowdown is more significant and flash attention becomes essential.

**Battery mode throttles the GPU.** If you are running on battery power, macOS reduces GPU clock speeds to save energy. Expect 30-50% lower tok/s compared to plugged-in performance. If speed matters, plug in.

**Background GPU load reduces bandwidth.** A browser with hardware-accelerated video, a game, or even a compositing-heavy desktop all compete for GPU resources and memory bandwidth. For the best benchmark results, close GPU-heavy apps. For everyday use, the numbers will be somewhat lower.

**M1 base with 8GB is the floor.** You can run 0.8B and 3B models comfortably. The 8B models will load but will be memory-constrained after accounting for macOS and your other apps. Expect 10-14 tok/s for Llama 3.1 8B on M1 8GB with nothing else running.

**macOS version affects Metal performance.** Apple regularly improves Metal GPU performance in macOS updates. Running the latest macOS version generally gives you the best results.

## How does local speed compare to cloud services?

Cloud LLM services like ChatGPT, Claude, and Gemini typically stream responses at 30-60 tok/s. But that speed includes network latency, server queuing, and rate limiting. During peak hours, cloud services can slow down significantly or refuse requests entirely.

A local model on an M2 Air at 18 tok/s for an 8B model is slower than peak cloud speed. But it is consistent, unlimited, and private. There is no queue, no rate limit, no internet requirement, and no per-token cost.

For small models (0.8B-3B), local speed on any Apple Silicon Mac matches or exceeds cloud streaming speed. For larger models, you trade raw speed for privacy and unlimited usage.

## Try It

Download [ModelPiper](https://modelpiper.com). Install ToolPiper. Pick a model that fits your RAM. Watch the memory bar as it loads. Start chatting and see the tok/s counter in real time.

Your chip's memory bandwidth is fixed. But now you know exactly what it can do.

_This is part of a series on [local-first AI workflows on macOS](/blog/local-first-ai-macos). Related: [Which Local LLM](/blog/which-local-llm-mac) covers model selection and [AI Model Memory](/blog/ai-model-memory-mac) explains RAM requirements in detail._

## FAQ

### Why is my local LLM slow?

The most common cause is memory pressure. If the model's working memory exceeds your available RAM, macOS swaps to SSD and performance collapses from 20+ tok/s to under 2 tok/s. Check ToolPiper's memory bar. Other causes: running on battery (GPU throttles by 30-50%), long context length (8K+ tokens slows generation), and background apps competing for GPU bandwidth (browsers with video, games).

### How much does context length affect speed?

At 2K context, you get full speed. At 8K context, expect roughly 15-20% slower generation. At 16K, expect 25-35% slower. At 32K, the slowdown is more significant. Flash attention (enabled by default in ToolPiper) helps reduce this penalty, especially above 4K context. If speed matters and your conversation is long, start a new chat to reset the context.

### Is M1 good enough for local AI?

Yes, with the right model. M1 with 8GB runs 0.8B and 3B models comfortably at 20-40 tok/s. Llama 3.1 8B will run at 10-14 tok/s, which is usable but noticeably slower. M1 with 16GB gives you more headroom and lets you run 8B models without memory pressure. The M1 is not fast enough for 14B+ models, but it handles the most popular local models well.

### Does quantization affect quality?

Minimally at Q4\_K\_M, which is why it is the standard recommendation. Q4\_K\_M reduces model size by roughly 4x compared to FP16 with negligible quality loss on most tasks. Q8 is slightly higher quality but uses twice the memory and runs at roughly half the speed. For most users, Q4\_K\_M is the correct choice. You would need to run careful blind comparisons to notice the difference between Q4\_K\_M and Q8 in normal conversation.

### How does local speed compare to ChatGPT?

Cloud services stream at 30-60 tok/s at peak. A local 8B model on an M2 Air does ~18 tok/s, which is slower than peak cloud speed but still comfortable (faster than reading speed). Small local models (0.8B-3B) match or exceed cloud streaming speed. The real comparison is consistency: local speed never degrades due to server load, rate limits, or network issues. And there is no per-token cost.