Ollama's configuration lives in environment variables, and finding the full list is harder than it should be. The official FAQ documents a handful. The rest are scattered across GitHub issues, Reddit threads, and the source. Two of the top Google results for "ollama environment variables" are literally GitHub issues asking the maintainers to write this page down.

So here it is: every Ollama environment variable that matters, what it controls, its default, and where to set it on macOS, Linux, Docker, and Windows. If you want Ollama to use less memory, stay loaded between requests, or accept connections from another machine, the variable you need is in the table below. This reference assumes Ollama is already running. If it is not, our install guide for Mac gets you there first.

What are Ollama environment variables?

Ollama environment variables are settings the Ollama server reads at startup to control networking, model storage, memory behavior, and inference performance. The server reads them once when it launches, so changing one only takes effect after you restart Ollama.

The distinction that trips people up: these configure the background server (ollama serve), not the ollama run command. Setting one in the same shell where you type ollama run does nothing if the server is already running as a separate process or a macOS app. The server has its own environment, and that is the one that counts.

Where do you set Ollama environment variables?

Set Ollama environment variables in your shell profile for terminal use, with launchctl setenv for the macOS app, in the systemd unit on Linux, or with -e flags for Docker. Ollama reads them at server startup, so restart Ollama after any change.

The right method depends entirely on how Ollama is running, and getting this wrong is the most common reason a variable seems to be ignored.

Terminal (you run ollama serve yourself): add the export to your shell profile.

export OLLAMA_KV_CACHE_TYPE=q8_0

Put that in ~/.zshrc (macOS default) or ~/.bashrc, then open a new terminal and start the server.

macOS app: the menu-bar app does not read your shell profile. Use launchctl, then quit and reopen Ollama.

launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0

Linux (systemd service): edit the unit with sudo systemctl edit ollama.service and add the variable under [Service].

[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

Then sudo systemctl daemon-reload and sudo systemctl restart ollama.

Docker: pass it with -e at run time.

docker run -e OLLAMA_KV_CACHE_TYPE=q8_0 -p 11434:11434 ollama/ollama

Windows: set it in the system environment variables panel (search "edit environment variables"), then quit Ollama from the system tray and relaunch.

The complete Ollama environment variable reference

Every variable Ollama reads, grouped by what it controls, with the default and a common override. Defaults shift between releases, so treat the official Ollama FAQ and the source config as the version-specific ground truth. One default that moves: OLLAMA_CONTEXT_LENGTH is 4096 on most machines, but Ollama 0.15.5 made it VRAM-aware, so a Mac with 24GB or more of unified memory can default to 32K tokens or higher.

How do I keep an Ollama model loaded in memory?

Set OLLAMA_KEEP_ALIVE to control how long a model stays in memory after its last request. The default is 5 minutes. Use -1 to keep models loaded indefinitely, or 0 to unload immediately after each request.

This is the variable people reach for after the first time a model unloads mid-session and the next prompt takes ten seconds to reload from disk. OLLAMA_KEEP_ALIVE=-1 trades RAM for latency: the model sits in memory until you stop the server. On a machine you use for AI all day, that trade is usually worth it. On a 16GB Mac where you also need memory for everything else, the 5-minute default exists for a reason.

How do I reduce Ollama's memory usage?

Set OLLAMA_KV_CACHE_TYPE=q8_0 to roughly halve the memory the context cache uses, and enable OLLAMA_FLASH_ATTENTION=1. Together they let the same model run longer contexts in less RAM with no visible quality loss.

The context (KV) cache grows with your context length and, at 32K or more tokens, can use more memory than the model weights themselves. Quantizing it from the default f16 down to q8_0 cuts that roughly in half for a perplexity increase that benchmarks put at 0.002 to 0.05, which nobody notices in practice. We cover the full quality and memory trade-off, including the more aggressive q4_0, in Ollama KV cache quantization.

How do I let other devices connect to Ollama?

Set OLLAMA_HOST=0.0.0.0:11434 to bind Ollama to all network interfaces, and OLLAMA_ORIGINS to allow cross-origin browser requests. By default Ollama binds only to 127.0.0.1, so other machines cannot reach it, and it allows browser requests only from localhost origins.

The two are separate problems. OLLAMA_HOST controls which network interface the server binds to. OLLAMA_ORIGINS controls which web origins the browser is allowed to call. If you are wiring a web app to Ollama and getting a silent CORS failure, the host is fine and the origins are the issue. We walk through that exact fix in the Ollama CORS fix on Mac.

Can Ollama offload the KV cache to system RAM?

Ollama has no dedicated environment variable for KV cache offload. It offloads model layers and the KV cache to the GPU together, and the runner flag --no-kv-offload keeps the cache in system RAM while the layers stay on the GPU.

This comes up when a model almost fits in VRAM and the cache is what pushes it over. There is an open discussion (ollama/ollama#9750) about preferring to offload model layers over the KV cache when both will not fit, because keeping the cache on the GPU and spilling layers to CPU is usually faster than the reverse. For now this is runner behavior, not an environment variable, which is worth knowing before you go looking for an OLLAMA_KV_OFFLOAD that does not exist.

How can I see which environment variables Ollama is using?

Start the server with OLLAMA_DEBUG=1 and Ollama logs every variable and its resolved value at startup. On the macOS app, the same information appears in the server log.

This is the fastest way to confirm a variable actually took effect, which matters because the most common configuration bug is setting the variable in one environment and running the server in another. If OLLAMA_DEBUG=1 shows the value you set, the setting is live. If it shows the default, you set it in the wrong place. Check the table above for the method that matches how your server runs.

The case for not configuring any of this

Every variable on this page exists because Ollama ships conservative defaults and makes you opt in to the better ones. The context cache defaults to f16 when q8_0 would save memory for free. Flash attention is off until you turn it on. Browser requests from anything but a localhost origin are blocked until you set origins. None of these are wrong defaults, they are just cautious ones, and the result is a config file's worth of variables you have to learn before Ollama runs the way you want.

ToolPiper takes the other approach. It bundles the same llama.cpp engine and runs the same GGUF models, but it launches with q8_0 KV cache quantization and flash attention on by default, serves CORS headers natively so there is no origins variable to set, and shows per-model memory directly so you can see what a model actually costs before you load it. The good defaults are the defaults. There is nothing to put in a shell profile and no server to restart.

It also connects to your existing Ollama instance as a provider, so the models you already pulled show up alongside ToolPiper's own engine. You do not have to choose. The honest limitation: ToolPiper is macOS only. If you run Ollama on Linux or Windows, the variables in this reference are how you get there, and they work.

Download ToolPiper at modelpiper.com, or use the reference above and keep tuning Ollama directly.

Part of our series on running Ollama on Mac. See also: Ollama KV cache quantization, the Ollama CORS fix, and running multiple Ollama models on Mac.