Does Ollama still use llama.cpp?

Partly. As of mid-2026 Ollama runs three engines: a vendored llama.cpp runner (its repo pins build b9509) for many architectures, an in-house Go engine built directly on the GGML library for newer models, and an opt-in MLX preview on Apple Silicon. Which engine a model hits is decided internally per architecture - our Llama 3.2 runs used the llama.cpp runner while Qwen3 and Gemma 4 used the Go engine.

Why did you benchmark Ollama's blob files directly?

Because the bytes inside Ollama's sha256-named blobs are standard GGUF, pointing llama-server at the blob file itself removes every confound about whether the two runtimes loaded "the same model." Same file on disk, same weights, same quantization - the only variables left are the engines.

Is ToolPiper faster than Ollama?

We claim parity, not a speed win. ToolPiper embeds the upstream llama-server build measured here (b9533, unmodified), so its generation speed is the llama-server column in our tables: within single digits of Ollama in both directions. The differences that matter day to day are elsewhere - plain GGUF storage, a native GUI, tuned defaults out of the box, and no cloud tier.

Can I reproduce these benchmarks?

Yes. Every command, flag, and the prompt-caching workaround is in the methodology section. You need Ollama, any llama-server build, and about twenty minutes. ToolPiper's embedded engine answers the same /completion endpoint, so the identical script benchmarks it too.

Ollama vs llama.cpp Benchmarks on Apple Silicon (2026)

You can find people online claiming Ollama runs models at half the speed of llama.cpp, and people claiming there's no difference at all. Both camps are usually benchmarking different files on different hardware with different settings, which is how you get a folklore gap instead of a measured one.

So we measured it. Same GGUF bytes, same Mac, same prompt, same context size, out-of-the-box defaults on both sides. Every command is in the methodology section so you can reproduce the whole thing in twenty minutes.

Is Ollama slower than llama.cpp on a Mac?

On Apple Silicon, no - not meaningfully. In our June 2026 testing on an M2 Max 32GB, token generation for identical Q4_K_M GGUF files came in within 2-7% between Ollama 0.23.4 and upstream llama-server b9533, and the winner changed depending on the model.

That's the headline, and it cuts both ways. Ollama's runtime held a 2.2% edge on Llama 3.2 3B and a 3.7% edge on Gemma 4 12B. Upstream llama-server held a 6.6% edge on Qwen3 4B. Nothing here resembles the 2x penalties that circulate in Reddit threads - at least not on this hardware, with these models, at these defaults.

What exactly did we benchmark?

The trick that makes this comparison airtight: Ollama stores model weights as sha256-named blobs under ~/.ollama/models/blobs/, and the bytes inside those blobs are standard GGUF. So we pointed llama-server directly at Ollama's own blob files. Not the same model re-downloaded from somewhere else - the same file on disk, byte for byte.

Three models, three size classes, all Q4_K_M:

Llama 3.2 3B Instruct - Ollama's llama3.2:latest. Ollama dispatched this one to its vendored llama.cpp runner (its repo pins build b9509).

Qwen3 4B Thinking 2507 - Ollama's qwen3:4b. Dispatched to Ollama's own Go engine, the one built directly on GGML rather than wrapping llama.cpp.

Gemma 4 12B IT - imported into Ollama from a plain GGUF with ollama create. Also dispatched to the Go engine.

That dispatch detail matters. Ollama in mid-2026 is really three engines in a trench coat: a vendored llama.cpp runner for older architectures, the in-house Go/GGML engine for newer ones, and an MLX preview on Apple Silicon (opt-in, not tested here). When someone says "Ollama is slow," the first question is which engine their model actually hit. We confirmed the dispatch for each model in Ollama's server logs.

The numbers

Medians of five runs after one warmup, 1,216-token prompt, 256 tokens generated, temperature 0. The tables below are the whole result.

Two things stand out. First, the generation gap never exceeds 7% in either direction. Token generation on Apple Silicon is memory-bandwidth bound, and both runtimes sit on the same GGML kernels underneath, so this is roughly what you'd expect once the folklore is stripped away. Second, Ollama's Go engine showed its only real deficit on Qwen3 4B (83.6 vs 89.1 tok/s) while beating upstream on Gemma 4 12B (34.0 vs 32.8 tok/s) - so even the engine-fork story doesn't reduce to a clean "the fork is slower."

Prompt processing was equally close: within 1% on Llama 3.2, upstream ahead 5.3% on Qwen3, Ollama ahead 5.8% on Gemma 4.

Why do people report much bigger gaps?

The dramatic Ollama-vs-llama.cpp numbers almost always come from different hardware or different settings, not from the engines themselves. The famous 20x figure was measured on an NVIDIA DGX Spark with gpt-oss-120b, where Ollama's CUDA path lagged badly - a real result, but not one that transfers to a Mac running Metal.

Four things inflate the perceived gap in practice:

Hardware path. llama.cpp's maintainer Georgi Gerganov publicly noted llama.cpp ran almost 20x faster than Ollama for gpt-oss-120b on the DGX Spark. CUDA and Metal are different codepaths with different maintenance attention. Our Metal numbers say nothing about CUDA, and vice versa.

Defaults. Ollama historically shipped a 2,048-token default context and silent reload-on-overflow. A model that quietly re-processes its context reads as "slow" in a way no tokens-per-second number captures. We pinned both sides to 4,096 context to take this off the table.

Version skew. Both projects move fast. Ollama 0.23.4 vendors llama.cpp b9509; the upstream build we tested is b9533, released days apart. A six-month-old comparison is a comparison of six-month-old software.

Different files. "Same model" often isn't. A Q4_K_M from one uploader and a Q4_0 from Ollama's registry can differ by double-digit percentages in speed and quality. Same-bytes testing is the only version of this comparison that means anything, which is why we did it that way.

How we ran it (full methodology)

Hardware: MacBook Pro, Apple M2 Max, 32GB unified memory, macOS 26.5. Software: Ollama 0.23.4 (Metal; flash attention auto-enabled per its logs), llama-server build b9533, commit c4a278d68, built for arm64. One other model sat loaded but idle in a separate process throughout - identical conditions for both engines, and no inference ran on it during any measurement.

Protocol, in full:

1. The prompt is a fixed ~1,200-token passage. Each run prepends a unique numeric prefix of identical token shape, so neither engine can serve a cached prefix. llama-server additionally gets cache_prompt: false.

2. llama-server: llama-server --model <blob> -ngl 99 -c 4096, then POST /completion with n_predict: 256, temperature: 0. Speeds read from the response's timings object.

3. Ollama: stock ollama serve, then POST /api/generate with raw: true (bypasses chat templating, so both engines see the identical text), num_predict: 256, temperature: 0, num_ctx: 4096. Speeds computed from prompt_eval_count/duration and eval_count/eval_duration.

4. One warmup, five measured runs per model per engine, medians reported. Models benchmarked sequentially, never two loaded by the same engine at once, and each Ollama model explicitly unloaded before the next.

5. The Gemma 4 12B GGUF was imported into Ollama with a one-line Modelfile (FROM /path/to/gemma-4-12b-it-Q4_K_M.gguf) and ollama create. Importing a GGUF into Ollama is easy. Getting one out is not - there is no ollama export, and the long-standing issue asking for one was closed "not planned" in April 2026. We have a separate guide on that.

What should you actually pick a runner on?

Not speed. That's the practical conclusion of this whole exercise: on a Mac, in mid-2026, the engines are close enough that tokens per second is the wrong axis to decide on. The axes with real daylight between them are storage layout (named GGUF files you can point any tool at, versus sha256 blobs behind a manifest), interface (a native app versus a CLI plus env vars), defaults (KV cache quantization and flash attention on out of the box, or flags you have to know about), and where the project is headed (on-device features versus a cloud tier).

ToolPiper embeds the exact upstream llama-server build these numbers come from - b9533, stated publicly, tracked release to release - and ships the tuned defaults as the defaults. The whole runner is free: unlimited GGUF downloads, multi-model, the local OpenAI-compatible API, embeddings, and an MCP server with over 300 tools. No account, no caps, no terminal. The full comparison covers everything beyond the tokens.

Download ToolPiper at modelpiper.com/download and run the same benchmark against it - the commands above work unchanged against its embedded engine.

This post is part of our local LLM performance series. See Local LLM Benchmarks on Apple Silicon for cross-device numbers and OLLAMA_KV_CACHE_TYPE for the memory side of the tuning story.

Model (Q4_K_M)	llama-server b9533	Ollama 0.23.4	Delta	Ollama engine used
Llama 3.2 3B Instruct	115.0 tok/s	117.5 tok/s	Ollama +2.2%	vendored llama.cpp runner
Qwen3 4B Thinking 2507	89.1 tok/s	83.6 tok/s	llama-server +6.6%	Go engine (GGML)
Gemma 4 12B IT	32.8 tok/s	34.0 tok/s	Ollama +3.7%	Go engine (GGML)

Model (Q4_K_M)	llama-server b9533	Ollama 0.23.4	Delta
Llama 3.2 3B Instruct	1,465.7 tok/s	1,476.6 tok/s	Ollama +0.7%
Qwen3 4B Thinking 2507	1,070.4 tok/s	1,016.4 tok/s	llama-server +5.3%
Gemma 4 12B IT	363.6 tok/s	384.7 tok/s	Ollama +5.8%

Ollama vs llama.cpp Benchmarks on Apple Silicon (2026)

Is Ollama slower than llama.cpp on a Mac?

What exactly did we benchmark?

The numbers

Why do people report much bigger gaps?

How we ran it (full methodology)

What should you actually pick a runner on?

Token Generation: Same GGUF, Same Mac (median tok/s, 256 tokens, temp 0, M2 Max 32GB)

Prompt Processing: 1,216-Token Prompt (median tok/s)

Frequently Asked Questions

Related

AI Providers