Ollama supports vision models. LLaVA, Gemma 3, Moondream, Llama 3.2 Vision - you can pull them the same way you pull any other model. The inference works. The problem is the interface.
Here's what using a vision model through Ollama's API looks like:
curl http://localhost:11434/api/generate -d '{
"model": "llava",
"prompt": "What is in this image?",
"images": ["iVBORw0KGgoAAAANSUhEUgAA..."]
}'That images field expects a base64-encoded string. For a typical screenshot, that's 50,000-200,000 characters of encoded data pasted into a terminal command. You'd generate it with base64 -i screenshot.png, then paste the output into the JSON payload. It works. Nobody does it twice voluntarily.
The CLI ollama run llava supports a file path shorthand, but it's still a terminal workflow. Describe an image, read the text output, adjust your prompt, repeat. No visual feedback, no side-by-side comparison of the image and the model's response. For a task that's fundamentally visual - understanding what's in a picture - the text-only interface is a mismatch.
What can vision models actually do?
Vision-language models process an image and a text prompt together. They don't just classify ("this is a cat"). They reason about the image content in the context of your question.
Image Q&A. "What's the error message in this screenshot?" "How many people are in this photo?" "What brand is the laptop on the left?" The model reads the image and answers in natural language.
Document understanding. Point a vision model at a chart, a table, a diagram, or a handwritten note. Ask it to extract the data, describe the relationships, or summarize the content. This overlaps with OCR but goes further - vision models understand layout and context, not just characters.
Scene description. Describe what's happening in a photo for accessibility narration, content tagging, or creative writing prompts. Models like LLaVA produce surprisingly detailed scene descriptions when prompted well.
UI analysis. Screenshot a web page or app interface and ask the model to identify elements, describe the layout, or spot accessibility issues. Useful for design review and testing workflows.
All of these work with Ollama's vision models running on your Mac. The capability is there. What's missing is a way to use it that doesn't involve pasting base64 strings into a terminal.
How does ModelPiper handle vision model input?
Drag an image onto the chat window. That's the entire interaction. The image appears inline in your conversation. Type your question below it. The model sees both the image and the text, and responds in the same chat thread.
ModelPiper handles the encoding automatically. When you drop an image (PNG, JPEG, WebP, or GIF), the app encodes it and includes it in the API request to whichever vision model you've selected. If that model is running through Ollama, the request goes to Ollama's /api/generate endpoint with the image payload. If it's running through ToolPiper's built-in engine, it goes to the local llama.cpp server. Either way, you never see the base64 string.
The chat interface shows the image alongside the model's response, so you can compare what the model said to what's actually in the picture. For iterative work - "now describe just the chart in the upper right" - you can reference the same image across multiple messages in the conversation.
Which vision models work with Ollama on Mac?
Several vision-capable models are available through Ollama's registry. Performance depends on model size and your Mac's memory.
LLaVA 1.6 (7B and 13B). The most established open-source vision model. The 7B variant needs about 4-5GB of RAM and handles general image Q&A well. The 13B variant is more capable for complex reasoning but needs 8-9GB. Pull with ollama pull llava.
Gemma 3 (4B with vision). Google's multimodal model. Smaller memory footprint than LLaVA 7B with competitive image understanding. Good for Macs with limited RAM. Pull with ollama pull gemma3.
Moondream (1.6B). The smallest viable vision model. About 1.5GB of RAM. Fast inference, adequate for simple image descriptions and basic Q&A. Not great for complex visual reasoning, but it runs on 8GB Macs comfortably alongside other tools.
Llama 3.2 Vision (11B and 90B). Meta's multimodal entries. The 11B model offers strong reasoning capabilities. The 90B model is exceptional but needs 48GB+ of RAM - only viable on M2/M3/M4 Max or Ultra machines with 64GB or more.
In ModelPiper, all of these appear in the model selector when Ollama is connected as a provider. Select one, drop an image, ask your question.
How do you set up vision chat with Ollama through ModelPiper?
If you already have Ollama running, this takes about two minutes.
Pull a vision model if you don't have one: ollama pull llava in Terminal. Open ModelPiper and confirm Ollama is connected as a provider (Settings → Providers). Navigate to the Chat view and select the vision model from the dropdown.
Drag an image onto the chat input area. Type your question - "What does this screenshot show?" or "Extract the table data from this image" or "Describe this photo for an alt text." Send the message. The model processes both the image and your text and responds in the chat.
For follow-up questions about the same image, keep typing in the same conversation. The model retains the image context across turns.
Can you combine vision with other pipelines?
Vision models become more useful when chained with other capabilities. A few practical combinations:
Vision + OCR pipeline. Use Apple Vision OCR (built into ToolPiper) to extract raw text from a document image, then feed that text to a chat model for summarization or analysis. OCR handles the character recognition, the language model handles the understanding. More reliable than asking a vision model to read dense text directly.
Vision + TTS. Describe an image with a vision model, then pipe the description to text-to-speech for an audio narration. Useful for accessibility workflows or creating audio descriptions of visual content.
Vision + Translation. Describe an image in English, then translate the description to another language. Or OCR a document in one language and use a chat model to translate the extracted text.
These multi-step workflows are where ModelPiper's pipeline builder earns its complexity. Each step is a block on the canvas with a clear input and output.
What are the limitations of local vision models?
Smaller models miss details. Moondream at 1.6B and even LLaVA 7B will miss fine text in screenshots, misread numbers in charts, and sometimes hallucinate details that aren't in the image. For high-accuracy document extraction, Apple Vision OCR is more reliable than asking a small vision model to read text. Use OCR for text extraction, vision models for understanding and description.
Image size matters. Vision models downscale images internally to fit their context window. A 4K screenshot gets resized to 336x336 or 672x672 pixels depending on the model's vision encoder. Fine details below that resolution are lost. For best results, crop to the relevant portion of the image before sending it to the model.
Memory pressure. Vision models tend to be larger than text-only models at the same parameter count because they include a vision encoder alongside the language model. LLaVA 7B uses more memory than Llama 3.2 7B. On memory-constrained Macs, consider Moondream or Gemma 3 as lighter alternatives. Check ToolPiper's resource monitor before loading a vision model if you already have other models in memory.
Cloud vision models are still better at hard tasks. GPT-4 Vision and Claude's vision capabilities outperform local models on complex document analysis, dense OCR, and multi-object reasoning. For quick image descriptions, accessibility text, and simple Q&A, local models are good enough. For legal document analysis or detailed medical image review, cloud models remain more reliable.
Download ToolPiper at modelpiper.com and try dragging an image into the chat. If you have Ollama vision models pulled, they appear in the model selector automatically.
This is part of a series on Ollama frontends for Mac. See also: Ollama Pipelines - combine vision with OCR and TTS in a workflow. Next: Run Multiple Ollama Models on Mac - see what fits in memory.