Ollama runs language models. It doesn't listen and it doesn't speak. Type a question in the terminal, read the answer on screen. That's the entire interaction model.

Voice changes what local AI feels like. Instead of typing and reading, you talk and listen. The model becomes a conversational partner instead of a text box. But getting there requires three separate AI models working together, and Ollama only handles one of them.

What does voice chat with a local model actually require?

Three models, running in sequence, every time you speak:

Speech-to-text (STT). Your voice goes in, a text transcription comes out. This needs a dedicated model - Whisper, Parakeet, or similar. Ollama doesn't include one.

Language model (LLM). The transcribed text goes to your chat model. This is what Ollama does well. Llama 3.2, Qwen 3.5, Mistral, DeepSeek - any model you have pulled works here.

Text-to-speech (TTS). The model's text response gets converted to audio. Another dedicated model - PocketTTS, Soprano, Orpheus, or similar. Ollama doesn't include this either.

The hard part isn't running each model. It's coordinating them. The STT output needs to feed into the LLM prompt. The LLM response needs to stream into the TTS engine as tokens arrive, not after the full response completes. Latency between stages compounds - if each handoff adds 500ms, the conversation feels broken.

You could wire this together manually with Python scripts, a Whisper server, and a TTS service. Some people do. It takes hours of setup, and the result is fragile.

How does ToolPiper add voice to Ollama models?

ToolPiper ships STT, LLM, and TTS as built-in backends, all running on Apple Silicon hardware acceleration. The tp-local-voice-chat pipeline template wires all three together in a pre-configured workflow.

The speech-to-text backend uses Parakeet v3, running on Apple's Neural Engine. It transcribes in real-time on M-series chips. The language model runs through ToolPiper's bundled llama.cpp engine (or connects to your existing Ollama instance). The text-to-speech backend offers three options:

PocketTTS - runs on the Neural Engine. Fastest option, near-instant generation. Default voice: Cosette (female). Good for conversational pace where you want the response to start immediately.

Soprano - runs on Metal GPU. Higher audio quality, slightly more latency. Default voice: Tara (female). Better for longer responses where you want the voice to sound more natural.

Orpheus - expressive model with emotional range. Default voice: Tara (female). Best for content creation and narration. Overkill for quick Q&A, worth it for anything where the voice quality matters.

All three TTS options run entirely on your Mac. No audio leaves the device.

How do you set up voice chat with Ollama on Mac?

If you already have Ollama running with models downloaded, ToolPiper connects to it as an external provider. Your Ollama models appear in the pipeline's LLM block alongside ToolPiper's built-in models. You don't have to choose one or the other.

The voice chat pipeline is three blocks connected in sequence: microphone input flows to STT, the transcript flows to the LLM, and the response flows to TTS. ModelPiper's pipeline builder shows this as a visual graph you can inspect and customize.

Push-to-talk vs continuous listening

Two input modes. Push-to-talk activates the microphone when you hold a button (or a keyboard shortcut) and stops when you release. Continuous listening keeps the microphone open and uses silence detection to determine when you've finished speaking.

Push-to-talk is more predictable. You control exactly when the model hears you. Continuous listening is more natural for extended conversations but occasionally triggers on background noise. We default to push-to-talk for the pipeline template.

What does the latency actually look like?

Voice chat latency is the sum of three stages. We measured each on an M2 Max with 32GB, using Qwen 3.5 3B (Q4) for chat:

STT (Parakeet v3): A typical spoken sentence (5-10 words) transcribes in about 400ms. Parakeet runs on the Neural Engine, which is a separate processor from the GPU - so transcription doesn't compete with the LLM for Metal compute time.

LLM (3B model, Q4): Time to first token averages about 300ms in our testing. Tokens stream as they generate, and the TTS engine picks up partial output - it doesn't wait for the full response to complete.

TTS (PocketTTS): First audio plays about 350ms after receiving text input. Because of the streaming handoff, the user hears audio before the LLM finishes generating its full response.

Total round-trip: About 1.5 seconds from the end of your sentence to the first word of the spoken response with a 3B model on M2 Max. With a 7B model, the LLM's time-to-first-token roughly doubles, pushing total latency to about 2-2.5 seconds. A 13B model pushes it to 3-4 seconds.

What that feels like in practice: you stop talking, there's a beat of silence, and then the model starts speaking. With a 3B model, the pause is short enough that it feels like the model is formulating a thought. With a 13B model, the pause is noticeable - you start wondering if something broke before the first word arrives. For comparison, ChatGPT's voice mode typically responds in under a second, running on optimized server hardware. Local voice chat on consumer hardware can't match that speed, but it runs entirely on-device with no internet connection and no data leaving your Mac.

What are the limitations of local voice chat?

Latency is real. Cloud voice assistants like ChatGPT's voice mode use optimized infrastructure and voice-native models to achieve sub-second response times. Local models on consumer hardware can't match that speed, especially with larger models. The 1-2 second pause with a 3B model is the floor, not the ceiling.

Three models in memory simultaneously. STT, LLM, and TTS each need RAM. Parakeet v3 uses roughly 500MB. A 3B chat model at Q4 uses about 2GB. PocketTTS uses about 300MB. Total: roughly 3GB for the smallest viable voice chat setup. On an 8GB Mac, that leaves tight headroom. On 16GB or more, it's comfortable. For the full picture on running multiple models at once, see running multiple Ollama models on Mac.

No interruption handling. If the model is speaking and you start talking, the current implementation doesn't stop the TTS output mid-sentence. You need to wait for it to finish or manually stop playback. This is a known limitation we're working to improve.

Ambient noise sensitivity. Continuous listening mode can false-trigger on background audio - music, other people talking, keyboard sounds. Push-to-talk avoids this entirely, which is why it's the default.

For most conversational AI tasks - brainstorming, dictation review, Q&A while your hands are busy - local voice chat is good enough that you stop reaching for the keyboard. For rapid-fire dialogue where sub-second latency matters, cloud voice modes are still faster.

Download ToolPiper at modelpiper.com and try the tp-local-voice-chat pipeline template with your existing Ollama models.

This is part of a series on Ollama frontends for Mac. See also: Voice Chat on Mac With Local AI for the general guide to local voice conversation. Next: Ollama Pipelines on Mac - chain models in a visual workflow.