article2026-06-02by Ben Racicot

Ollama Environment Variables: The Complete Reference (2026)

TL;DR

Ollama is configured through environment variables read by the server at startup. The most-used are OLLAMA_HOST (bind address), OLLAMA_MODELS (storage path), OLLAMA_KEEP_ALIVE (how long models stay loaded), OLLAMA_KV_CACHE_TYPE (cache quantization), and OLLAMA_FLASH_ATTENTION. Set them in your shell, with launchctl on the macOS app, or systemd on Linux, then restart Ollama.

Ollama Environment Variables: The Complete Reference

Ollama's configuration lives in environment variables, and finding the full list is harder than it should be. The official FAQ documents a handful. The rest are scattered across GitHub issues, Reddit threads, and the source. Two of the top Google results for "ollama environment variables" are literally GitHub issues asking the maintainers to write this page down.

So here it is: every Ollama environment variable that matters, what it controls, its default, and where to set it on macOS, Linux, Docker, and Windows. If you want Ollama to use less memory, stay loaded between requests, or accept connections from another machine, the variable you need is in the table below. This reference assumes Ollama is already running. If it is not, our install guide for Mac gets you there first.

What are Ollama environment variables?

Ollama environment variables are settings the Ollama server reads at startup to control networking, model storage, memory behavior, and inference performance. The server reads them once when it launches, so changing one only takes effect after you restart Ollama.

The distinction that trips people up: these configure the background server (ollama serve), not the ollama run command. Setting one in the same shell where you type ollama run does nothing if the server is already running as a separate process or a macOS app. The server has its own environment, and that is the one that counts.

Where do you set Ollama environment variables?

Set Ollama environment variables in your shell profile for terminal use, with launchctl setenv for the macOS app, in the systemd unit on Linux, or with -e flags for Docker. Ollama reads them at server startup, so restart Ollama after any change.

The right method depends entirely on how Ollama is running, and getting this wrong is the most common reason a variable seems to be ignored.

Terminal (you run ollama serve yourself): add the export to your shell profile.

export OLLAMA_KV_CACHE_TYPE=q8_0

Put that in ~/.zshrc (macOS default) or ~/.bashrc, then open a new terminal and start the server.

macOS app: the menu-bar app does not read your shell profile. Use launchctl, then quit and reopen Ollama.

launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0

Linux (systemd service): edit the unit with sudo systemctl edit ollama.service and add the variable under [Service].

[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

Then sudo systemctl daemon-reload and sudo systemctl restart ollama.

Docker: pass it with -e at run time.

docker run -e OLLAMA_KV_CACHE_TYPE=q8_0 -p 11434:11434 ollama/ollama

Windows: set it in the system environment variables panel (search "edit environment variables"), then quit Ollama from the system tray and relaunch.

The complete Ollama environment variable reference

Every variable Ollama reads, grouped by what it controls, with the default and a common override. Defaults shift between releases, so treat the official Ollama FAQ and the source config as the version-specific ground truth. One default that moves: OLLAMA_CONTEXT_LENGTH is 4096 on most machines, but Ollama 0.15.5 made it VRAM-aware, so a Mac with 24GB or more of unified memory can default to 32K tokens or higher.

How do I keep an Ollama model loaded in memory?

Set OLLAMA_KEEP_ALIVE to control how long a model stays in memory after its last request. The default is 5 minutes. Use -1 to keep models loaded indefinitely, or 0 to unload immediately after each request.

This is the variable people reach for after the first time a model unloads mid-session and the next prompt takes ten seconds to reload from disk. OLLAMA_KEEP_ALIVE=-1 trades RAM for latency: the model sits in memory until you stop the server. On a machine you use for AI all day, that trade is usually worth it. On a 16GB Mac where you also need memory for everything else, the 5-minute default exists for a reason.

How do I reduce Ollama's memory usage?

Set OLLAMA_KV_CACHE_TYPE=q8_0 to roughly halve the memory the context cache uses, and enable OLLAMA_FLASH_ATTENTION=1. Together they let the same model run longer contexts in less RAM with no visible quality loss.

The context (KV) cache grows with your context length and, at 32K or more tokens, can use more memory than the model weights themselves. Quantizing it from the default f16 down to q8_0 cuts that roughly in half for a perplexity increase that benchmarks put at 0.002 to 0.05, which nobody notices in practice. We cover the full quality and memory trade-off, including the more aggressive q4_0, in Ollama KV cache quantization.

How do I let other devices connect to Ollama?

Set OLLAMA_HOST=0.0.0.0:11434 to bind Ollama to all network interfaces, and OLLAMA_ORIGINS to allow cross-origin browser requests. By default Ollama binds only to 127.0.0.1, so other machines cannot reach it, and it allows browser requests only from localhost origins.

The two are separate problems. OLLAMA_HOST controls which network interface the server binds to. OLLAMA_ORIGINS controls which web origins the browser is allowed to call. If you are wiring a web app to Ollama and getting a silent CORS failure, the host is fine and the origins are the issue. We walk through that exact fix in the Ollama CORS fix on Mac.

Can Ollama offload the KV cache to system RAM?

Ollama has no dedicated environment variable for KV cache offload. It offloads model layers and the KV cache to the GPU together, and the runner flag --no-kv-offload keeps the cache in system RAM while the layers stay on the GPU.

This comes up when a model almost fits in VRAM and the cache is what pushes it over. There is an open discussion (ollama/ollama#9750) about preferring to offload model layers over the KV cache when both will not fit, because keeping the cache on the GPU and spilling layers to CPU is usually faster than the reverse. For now this is runner behavior, not an environment variable, which is worth knowing before you go looking for an OLLAMA_KV_OFFLOAD that does not exist.

How can I see which environment variables Ollama is using?

Start the server with OLLAMA_DEBUG=1 and Ollama logs every variable and its resolved value at startup. On the macOS app, the same information appears in the server log.

This is the fastest way to confirm a variable actually took effect, which matters because the most common configuration bug is setting the variable in one environment and running the server in another. If OLLAMA_DEBUG=1 shows the value you set, the setting is live. If it shows the default, you set it in the wrong place. Check the table above for the method that matches how your server runs.

The case for not configuring any of this

Every variable on this page exists because Ollama ships conservative defaults and makes you opt in to the better ones. The context cache defaults to f16 when q8_0 would save memory for free. Flash attention is off until you turn it on. Browser requests from anything but a localhost origin are blocked until you set origins. None of these are wrong defaults, they are just cautious ones, and the result is a config file's worth of variables you have to learn before Ollama runs the way you want.

ToolPiper takes the other approach. It bundles the same llama.cpp engine and runs the same GGUF models, but it launches with q8_0 KV cache quantization and flash attention on by default, serves CORS headers natively so there is no origins variable to set, and shows per-model memory directly so you can see what a model actually costs before you load it. The good defaults are the defaults. There is nothing to put in a shell profile and no server to restart.

It also connects to your existing Ollama instance as a provider, so the models you already pulled stay reachable while you migrate — a transition path, not the recommended setup. The honest limitation: ToolPiper is macOS only. If you run Ollama on Linux or Windows, the variables in this reference are how you get there, and they work.

Download ToolPiper at modelpiper.com, or use the reference above and keep tuning Ollama directly.

Part of our series on running Ollama on Mac. See also: Ollama KV cache quantization, the Ollama CORS fix, and running multiple Ollama models on Mac.

Every Ollama Environment Variable: What It Does and Its Default

Variable	What it controls	Default	Common override
OLLAMA_HOST	Address and port the server binds to	127.0.0.1:11434	0.0.0.0:11434
OLLAMA_ORIGINS	Allowed CORS origins for browser requests	localhost, 127.0.0.1 (any port)	*
OLLAMA_MODELS	Where model files are stored	~/.ollama/models	/Volumes/SSD/models
OLLAMA_KEEP_ALIVE	How long a model stays loaded after last use	5m	-1 (forever)
OLLAMA_KV_CACHE_TYPE	Quantization of the context (KV) cache	f16	q8_0
OLLAMA_FLASH_ATTENTION	Enable flash attention	off	1
OLLAMA_CONTEXT_LENGTH	Default context window size	4096	8192
OLLAMA_NUM_PARALLEL	Concurrent requests handled per model	1	4
OLLAMA_MAX_LOADED_MODELS	Models kept in memory at once	3 per GPU (3 on CPU)	2
OLLAMA_MAX_QUEUE	Requests queued before the server rejects	512	1024
OLLAMA_GPU_OVERHEAD	VRAM in bytes to reserve per GPU	0	1073741824
OLLAMA_SCHED_SPREAD	Spread one model across all GPUs	off	1
OLLAMA_DEBUG	Verbose server logging	off	1
OLLAMA_NOPRUNE	Skip pruning unused blobs at startup	off	1
OLLAMA_NOHISTORY	Disable prompt history in interactive mode	off	1

Ollama Defaults vs ToolPiper Defaults

Setting	Ollama default	ToolPiper default
KV cache type	f16 (set OLLAMA_KV_CACHE_TYPE to change)	q8_0, on by default
Flash attention	Off (set OLLAMA_FLASH_ATTENTION=1)	On by default
Browser (CORS) access	Localhost only (set OLLAMA_ORIGINS for other origins)	Headers served natively
Per-model memory visibility	None	Live per-model RAM and GPU
Config to learn first	A page of environment variables	None

How to get started

1
Choose where to set the variable
Match the method to how Ollama runs. Terminal server: your shell profile. macOS menu-bar app: launchctl. Linux: the systemd unit. Docker: a -e flag. Setting it in the wrong place is the number-one reason a variable looks ignored.
2
Set the variable
For the macOS app, run launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0. For a terminal server, add export OLLAMA_KV_CACHE_TYPE=q8_0 to ~/.zshrc. Substitute whichever variable from the reference table you need.
3
Restart Ollama
Ollama reads environment variables once, at server startup. Quit the app fully and reopen it, or stop and restart ollama serve. Reloading a client or browser tab is not enough.
4
Verify it took effect
Start the server with OLLAMA_DEBUG=1 and read the startup log. If it shows your value, the setting is live. If it shows the default, you set it in an environment the server does not see, so go back to step 1.

Frequently Asked Questions

What is the OLLAMA_KV_CACHE_TYPE variable?

OLLAMA_KV_CACHE_TYPE sets the quantization of Ollama's context (KV) cache. The default is f16. Setting it to q8_0 roughly halves the cache's memory use with negligible quality loss. Going to q4_0 quarters it with a small, measurable quality trade-off. Both quantized types only take effect with OLLAMA_FLASH_ATTENTION=1. It applies globally to every model the server loads.

Do Ollama environment variables work in Docker?

Yes. Pass each one with a -e flag at run time, for example docker run -e OLLAMA_KEEP_ALIVE=-1 -e OLLAMA_KV_CACHE_TYPE=q8_0 -p 11434:11434 ollama/ollama. They behave identically to a native install because the container runs the same ollama serve process.

Why don't my Ollama environment variables take effect?

Almost always because the variable is set in a different environment than the one the server runs in. The macOS app ignores your shell profile and needs launchctl setenv. A systemd service ignores your shell and needs an Environment= line in the unit. And every method requires a full server restart, since Ollama reads the variables only at startup.

Where are Ollama environment variables stored on Windows?

In the Windows system environment variables. Search "edit the system environment variables" from the Start menu, add the variable under your user or system variables, then quit Ollama from the system tray and relaunch it so the server picks up the change.

How do I set Ollama environment variables permanently?

Use a method that survives reboots. launchctl setenv on macOS resets on logout, so for a permanent macOS setting add it to a login script or LaunchAgent. On Linux, the systemd unit override is already permanent. In a shell, an export line in ~/.zshrc or ~/.bashrc persists across sessions.

OllamaText GenerationDeveloperPrivacymacOSPerformance

OLLAMA_KV_CACHE_TYPE: Halve Ollama's KV Cache MemoryThe full memory and quality trade-off behind the KV cache variable OLLAMA_ORIGINS=*: Fix the Ollama CORS Error on MacSet OLLAMA_ORIGINS to unblock browser connections to Ollama Run Multiple Ollama Models on Mac: See What Fits in MemoryOLLAMA_MAX_LOADED_MODELS and the memory math for running several models Install Ollama on Mac: Setup Guide and the One-App AlternativeGet Ollama running before you tune it