article2026-04-14by ModelPiper

Ollama KV Cache Quantization: Fit Longer Contexts in Less Memory

TL;DR

Ollama's KV cache stores attention state for every token in your context window and grows linearly with context length. At 32K+ contexts it can use more memory than the model itself. Ollama supports quantizing this cache from FP16 down to q8_0 (half memory, negligible quality loss) or q4_0 (quarter memory, modest quality loss). Set OLLAMA_KV_CACHE_TYPE and restart. It's the single biggest lever for running longer contexts on Apple Silicon without upgrading hardware.

What is the KV cache and why does it eat your memory?

When a language model generates text, it computes key and value vectors for every token in the context window. These vectors get stored in the KV cache so the model doesn't have to recompute them on every new token. Without the cache, generation would be quadratically slower. With it, each new token only needs to attend to the cached keys and values.

The cost is memory. Each layer of the model stores a separate set of key and value vectors for every token. A 7B model with 32 layers, running at FP16 precision with an 8K context window, allocates roughly 1GB for the KV cache alone. Double the context to 16K, the cache doubles to 2GB. At 32K context, it's 4GB — nearly as much as the model weights themselves at Q4 quantization.

This is the wall most people hit without realizing it. You load a 7B model (4.5GB at Q4), set context to 32K, and suddenly the process is using 8-9GB. On a 16GB Mac, that's game over for running anything else alongside it. The model weights didn't change. The KV cache is what grew.

Ollama's KV cache quantization options

Ollama supports compressing the KV cache from its default FP16 representation into lower-precision formats. This is entirely separate from model weight quantization (Q4, Q5, Q8) — you can run a Q4 model with an FP16 cache, or a Q8 model with a Q4 cache. They're independent knobs.

The setting is a single environment variable: OLLAMA_KV_CACHE_TYPE. Default is f16.

q8_0 — the safe default

8-bit quantization. Cuts KV cache memory roughly in half. Quality impact is negligible — published benchmarks show perplexity increases of 0.002 to 0.05, which is undetectable in conversational use. If you're going to change one thing after reading this article, set q8_0 and forget about it.

q4_0 — aggressive compression

4-bit quantization. Cuts KV cache memory to roughly one quarter of FP16. Quality impact is small but measurable — you may notice slightly less coherent output on very long contexts or complex reasoning tasks. For chat, summarization, and code generation at normal context lengths, it's hard to tell the difference. At 64K+ context, the accumulated quantization noise becomes more noticeable.

tq3 / tq4 — TurboQuant (coming soon)

Based on Google's PolarQuant paper (ICLR 2026). TurboQuant applies a randomized Hadamard rotation to key vectors before quantizing, which distributes information more evenly across dimensions and reduces quantization error. TQ4 (4-bit) achieves quality close to q8_0 at compression ratios close to q4_0 — roughly the best of both worlds. TQ3 (3-bit) pushes further, achieving nearly 5x compression versus FP16.

TurboQuant is currently in development for llama.cpp (PR #21089) and hasn't merged into mainline yet. Once it lands in llama.cpp, Ollama and other tools that build on it will follow. The benchmarks are promising — when it ships, TQ4 will likely become the new best default for users who want both compression and quality.

How to enable it

The setup depends on how you run Ollama.

If you run Ollama from the terminal

Set the environment variable before starting the server:

OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

Or add it to your shell profile for persistence:

export OLLAMA_KV_CACHE_TYPE=q8_0

Add that line to ~/.zshrc (macOS default) or ~/.bashrc, then restart your terminal and Ollama.

If you run the Ollama macOS app

The macOS app doesn't read shell environment variables. Use launchctl instead:

launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0

Then quit and reopen the Ollama app. The setting persists until you log out or restart. To make it permanent across reboots, add the launchctl setenv command to a login script or LaunchAgent plist.

Verify it's working

After restarting Ollama, load a model and check the server logs. You should see the KV cache type mentioned during model initialization. If you're using ToolPiper's resource monitor, you'll see the difference in per-model memory consumption directly — a model at 16K context with q8_0 KV cache will show noticeably lower resident memory than the same model at FP16.

When does this actually matter?

At default context lengths (2048-4096 tokens), the KV cache is small relative to model weights. A 7B model at 4K context uses maybe 500MB for the cache. Quantizing that saves 250-375MB — nice, but not transformative.

The math changes at longer contexts:

7B model at 32K context: KV cache at FP16 is roughly 4GB. At q8_0, it's about 2GB. At q4_0, about 1GB. That's a 3GB savings — enough to load a second small model.

7B model at 128K context: KV cache at FP16 would need roughly 16GB. More than the model itself. At q4_0, it drops to about 4GB. This is the difference between "impossible on 32GB" and "comfortable on 32GB."

13B model at 16K context: KV cache at FP16 is about 4GB on top of the model's 9.5GB. Total: 13.5GB. At q8_0, the cache drops to 2GB, total 11.5GB — enough headroom on a 16GB Mac to avoid swapping.

The pattern: KV cache quantization matters most when context length × model size pushes you near your hardware's memory limit. If you're running a 3B model at 4K context on a 32GB Mac, you won't notice the difference. If you're running a 13B model at 32K context on 16GB, it's the difference between usable and unusable.

Quality trade-offs: what you actually lose

Model weight quantization (Q4 vs Q8 vs FP16) affects the model's core reasoning ability across every token. KV cache quantization is different — it affects how precisely the model remembers prior context. The degradation shows up as subtle attention errors: the model might occasionally lose track of a detail mentioned 10,000 tokens ago, or slightly misattribute who said what in a long conversation.

At q8_0, these errors are vanishingly rare. Benchmark perplexity increases by 0.002 to 0.05 depending on the model and context length. In practice, nobody notices.

At q4_0, the errors are more frequent but still subtle. For chat and code generation, the quality is fine. For tasks that require precise long-range recall — "what was the third item in the list I gave you 20K tokens ago" — you might see occasional misses. The 7.6% perplexity increase reported in benchmarks is comparable to the impact of going from Q8 to Q4 on model weights. Usable, with a trade-off you can feel on demanding tasks.

When TurboQuant lands in llama.cpp, tq4 should offer quality between q8_0 and q4_0 at compression close to q4_0 — the Hadamard rotation trick preserves quality better than raw quantization at the same bit width. Early benchmarks from community forks are promising.

Apple Silicon considerations

On Apple Silicon, there's one thing worth knowing: KV cache quantization adds a small compute overhead for the quantize/dequantize step on each attention operation. On NVIDIA GPUs this is negligible. On Apple's Metal backend, some users have reported slight generation speed regressions — typically 5-10% fewer tokens per second.

Whether this matters depends on your bottleneck. If you're memory-constrained (the model barely fits), the trade-off is obviously worth it — slightly slower generation beats swapping to disk. If you have plenty of memory headroom and just want to enable it "because why not," test with your specific model and context length. For most setups, q8_0 shows no perceptible speed difference.

ToolPiper ships with KV cache quantization enabled

If you use ToolPiper, you don't need to configure any of this. ToolPiper's bundled llama.cpp engine launches with q8_0 KV cache quantization on both keys and values by default, alongside flash attention. Every model you load through ToolPiper gets the memory savings automatically — no environment variables, no restarts, no launchctl.

Ollama defaults to FP16 and requires you to opt in. ToolPiper defaults to q8_0 because there's no reason not to — the quality loss is unmeasurable and the memory savings are real. A 7B model at 32K context uses roughly 2GB less KV cache memory than the same model through Ollama's default settings.

ToolPiper also runs each model as a separate llama.cpp server process, which means you get per-model visibility. The resource monitor shows actual resident memory for each loaded model, so you can see exactly what the KV cache costs. Load a model, check the number, compare it against the estimates in this article. No guessing, no math.

For users running multiple models simultaneously — a chat model plus a coding model, or a voice pipeline with STT, LLM, and TTS — the cumulative savings from q8_0 KV cache across all loaded models adds up. On a 16GB Mac, it can be the difference between fitting your setup and hitting swap.

Download ToolPiper at modelpiper.com or the Mac App Store.

This is part of a series on Ollama frontends for Mac. See also: Run Multiple Ollama Models on Mac for managing memory across multiple models, and How AI Model Memory Works on Mac for the fundamentals.

KV Cache Memory Usage by Quantization Type (7B Model)

Context Length	FP16 (default)	q8_0 (ToolPiper default)	q4_0
4K tokens	~500MB	~250MB	~125MB
8K tokens	~1GB	~500MB	~250MB
16K tokens	~2GB	~1GB	~500MB
32K tokens	~4GB	~2GB	~1GB
128K tokens	~16GB	~8GB	~4GB

Quality vs Compression Trade-off

KV Cache Type	Compression vs FP16	Perplexity Impact	Best For
f16 (Ollama default)	1x (baseline)	None	Short contexts where memory isn't a concern
q8_0 (ToolPiper default)	~2x	Negligible (+0.002–0.05)	Best default — no visible quality loss
q4_0	~4x	Small-medium (+~7.6%)	Memory-constrained setups, long contexts
tq4 (coming soon)	~3.8x	Small (between q8_0 and q4_0)	Best quality-to-compression ratio once available
tq3 (coming soon)	~4.9x	Moderate on small models	Maximum compression on 8B+ models

How to get started

1
Check your current Ollama version
Run ollama --version in your terminal. KV cache quantization requires a relatively recent build. If you're on an older version, run brew upgrade ollama or download the latest from ollama.com. TurboQuant (tq3/tq4) requires the newest builds.
2
Set the environment variable
For terminal users, add export OLLAMA_KV_CACHE_TYPE=q8_0 to your ~/.zshrc. For the Ollama macOS app, run launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0. Start with q8_0 — it's the safest option with nearly zero quality loss.
3
Restart Ollama
The environment variable is read at server startup. If Ollama is already running, quit it fully and relaunch. For terminal users: stop the running ollama serve process and start it again. For the macOS app: quit from the menu bar icon and reopen.
4
Test with your workload
Load your usual model and try a conversation at your typical context length. Check memory usage in Activity Monitor or ToolPiper's resource monitor. You should see lower resident memory for the same model and context. If generation quality feels off (rare with q8_0), try a different quantization level or revert to f16.

Frequently Asked Questions

Is KV cache quantization the same as model quantization (Q4, Q8)?

No. Model quantization compresses the model's weights — the parameters it learned during training. KV cache quantization compresses the temporary attention state generated during inference. They're independent settings. You can run a Q4 model with an FP16 KV cache, or a Q8 model with a Q4 KV cache. Both reduce memory, but they affect different things.

Why isn't KV cache quantization enabled by default?

It's a trade-off. At short context lengths, the KV cache is small and quantizing it saves little memory. The quantize/dequantize overhead slightly reduces generation speed. And any quality loss — however small — is a regression from the default behavior. Ollama takes the conservative path: FP16 by default, quantization opt-in for users who need it.

Can I set different KV cache types for different models?

Not in Ollama. The OLLAMA_KV_CACHE_TYPE environment variable applies globally to all models loaded by that Ollama server instance. You can't set q8_0 for one model and q4_0 for another. ToolPiper supports per-model cache configuration since each model runs as a separate process.

Does KV cache quantization affect generation speed?

Slightly. The quantize and dequantize operations on each attention step add a small overhead. On NVIDIA GPUs it's negligible. On Apple Silicon's Metal backend, some users report 5-10% slower token generation. If you're already memory-constrained, the trade-off is worth it — slow generation beats disk swapping.

Should I use q4_0 or wait for TurboQuant (tq4)?

TurboQuant achieves similar compression to q4_0 but preserves quality better thanks to the Hadamard rotation trick. However, it hasn't merged into mainline llama.cpp yet — it's in active development. For now, q4_0 is the proven aggressive option if q8_0 doesn't free enough memory. Once TurboQuant lands, tq4 will likely become the better choice for the same compression tier.

Does this work with Ollama running inside Docker?

Yes. Pass the environment variable to the container: docker run -e OLLAMA_KV_CACHE_TYPE=q8_0 ollama/ollama. Everything else works the same.

OllamaText GenerationPrivacymacOSApple SiliconPerformance

Run Multiple Ollama Models on Mac: See What Fits in MemoryKV cache quantization frees memory for running multiple models simultaneously Best Ollama Frontend for Mac: Every GUI Option ComparedPillar comparison of every Ollama frontend option on Mac How AI Model Memory Works on MacThe fundamentals of model memory on Apple Silicon Voice Chat With Ollama on Mac: Add STT and TTS to Any Local ModelVoice chat loads three models at once — KV cache quantization helps them fit