---
title: "Voice Chat With Ollama on Mac: Add STT and TTS to Any Local Model"
description: "Add voice conversation to Ollama on Mac by chaining three local models: speech-to-text, your LLM, and text-to-speech. All on-device, no cloud APIs."
date: 2026-04-07
author: "Ben Racicot"
tags: ["Voice", "Ollama", "Speech to Text", "Text Generation", "Text to Speech", "Privacy", "macOS"]
type: "article"
canonical: "https://modelpiper.com/blog/ollama-voice-chat-mac/"
---

# Voice Chat With Ollama on Mac: Add STT and TTS to Any Local Model

> Add voice conversation to Ollama on Mac by chaining three local models: speech-to-text, your LLM, and text-to-speech. All on-device, no cloud APIs.

## TL;DR

Ollama runs language models. It doesn't listen or speak. ToolPiper adds full voice conversation by chaining three local models: Parakeet for speech-to-text, your Ollama LLM for reasoning, and PocketTTS or Soprano for text-to-speech. Push-to-talk or continuous listening, all running on Apple Silicon with no cloud APIs.

Ollama runs language models. It doesn't listen and it doesn't speak. Type a question in the terminal, read the answer on screen. That's the entire interaction model.

Voice changes what local AI feels like. Instead of typing and reading, you talk and listen. The model becomes a conversational partner instead of a text box. But getting there requires three separate AI models working together, and Ollama only handles one of them.

## What does voice chat with a local model actually require?

Three models, running in sequence, every time you speak:

**Speech-to-text (STT).** Your voice goes in, a text transcription comes out. This needs a dedicated model - Whisper, Parakeet, or similar. Ollama doesn't include one.

**Language model (LLM).** The transcribed text goes to your chat model. This is what Ollama does well. Llama 3.2, Qwen 3.5, Mistral, DeepSeek - any model you have pulled works here.

**Text-to-speech (TTS).** The model's text response gets converted to audio. Another dedicated model - PocketTTS, Soprano, Orpheus, or similar. Ollama doesn't include this either.

The hard part isn't running each model. It's coordinating them. The STT output needs to feed into the LLM prompt. The LLM response needs to stream into the TTS engine as tokens arrive, not after the full response completes. Latency between stages compounds - if each handoff adds 500ms, the conversation feels broken.

You could wire this together manually with Python scripts, a Whisper server, and a TTS service. Some people do. It takes hours of setup, and the result is fragile.

## How does ToolPiper add voice to Ollama models?

ToolPiper ships STT, LLM, and TTS as built-in backends, all running on Apple Silicon hardware acceleration. The `tp-local-voice-chat` pipeline template wires all three together in a pre-configured workflow.

The speech-to-text backend uses Parakeet v3, running on Apple's Neural Engine. It transcribes in real-time on M-series chips. The language model runs through ToolPiper's bundled llama.cpp engine (or connects to your existing Ollama instance). The text-to-speech backend offers three options:

**PocketTTS** - runs on the Neural Engine. Fastest option, near-instant generation. Default voice: Cosette (female). Good for conversational pace where you want the response to start immediately.

**Soprano** - runs on Metal GPU. Higher audio quality, slightly more latency. Default voice: Tara (female). Better for longer responses where you want the voice to sound more natural.

**Orpheus** - expressive model with emotional range. Default voice: Tara (female). Best for content creation and narration. Overkill for quick Q&A, worth it for anything where the voice quality matters.

All three TTS options run entirely on your Mac. No audio leaves the device.

## How do you set up voice chat with Ollama on Mac?

If you already have Ollama running with models downloaded, ToolPiper connects to it as an external provider. Your Ollama models appear in the pipeline's LLM block alongside ToolPiper's built-in models. You don't have to choose one or the other.

The voice chat pipeline is three blocks connected in sequence: microphone input flows to STT, the transcript flows to the LLM, and the response flows to TTS. ModelPiper's pipeline builder shows this as a visual graph you can inspect and customize.

### Push-to-talk vs continuous listening

Two input modes. Push-to-talk activates the microphone when you hold a button (or a keyboard shortcut) and stops when you release. Continuous listening keeps the microphone open and uses silence detection to determine when you've finished speaking.

Push-to-talk is more predictable. You control exactly when the model hears you. Continuous listening is more natural for extended conversations but occasionally triggers on background noise. We default to push-to-talk for the pipeline template.

## What does the latency actually look like?

Voice chat latency is the sum of three stages. We measured each on an M2 Max with 32GB, using Qwen 3.5 3B (Q4) for chat:

**STT (Parakeet v3):** A typical spoken sentence (5-10 words) transcribes in about 400ms. Parakeet runs on the Neural Engine, which is a separate processor from the GPU - so transcription doesn't compete with the LLM for Metal compute time.

**LLM (3B model, Q4):** Time to first token averages about 300ms in our testing. Tokens stream as they generate, and the TTS engine picks up partial output - it doesn't wait for the full response to complete.

**TTS (PocketTTS):** First audio plays about 350ms after receiving text input. Because of the streaming handoff, the user hears audio before the LLM finishes generating its full response.

**Total round-trip:** About 1.5 seconds from the end of your sentence to the first word of the spoken response with a 3B model on M2 Max. With a 7B model, the LLM's time-to-first-token roughly doubles, pushing total latency to about 2-2.5 seconds. A 13B model pushes it to 3-4 seconds.

What that feels like in practice: you stop talking, there's a beat of silence, and then the model starts speaking. With a 3B model, the pause is short enough that it feels like the model is formulating a thought. With a 13B model, the pause is noticeable - you start wondering if something broke before the first word arrives. For comparison, ChatGPT's voice mode typically responds in under a second, running on optimized server hardware. Local voice chat on consumer hardware can't match that speed, but it runs entirely on-device with no internet connection and no data leaving your Mac.

## What are the limitations of local voice chat?

**Latency is real.** Cloud voice assistants like ChatGPT's voice mode use optimized infrastructure and voice-native models to achieve sub-second response times. Local models on consumer hardware can't match that speed, especially with larger models. The 1-2 second pause with a 3B model is the floor, not the ceiling.

**Three models in memory simultaneously.** STT, LLM, and TTS each need RAM. Parakeet v3 uses roughly 500MB. A 3B chat model at Q4 uses about 2GB. PocketTTS uses about 300MB. Total: roughly 3GB for the smallest viable voice chat setup. On an 8GB Mac, that leaves tight headroom. On 16GB or more, it's comfortable. For the full picture on running multiple models at once, see [running multiple Ollama models on Mac](/blog/ollama-multi-model-mac).

**No interruption handling.** If the model is speaking and you start talking, the current implementation doesn't stop the TTS output mid-sentence. You need to wait for it to finish or manually stop playback. This is a known limitation we're working to improve.

**Ambient noise sensitivity.** Continuous listening mode can false-trigger on background audio - music, other people talking, keyboard sounds. Push-to-talk avoids this entirely, which is why it's the default.

For most conversational AI tasks - brainstorming, dictation review, Q&A while your hands are busy - local voice chat is good enough that you stop reaching for the keyboard. For rapid-fire dialogue where sub-second latency matters, cloud voice modes are still faster.

Download ToolPiper at [modelpiper.com](https://modelpiper.com) and try the `tp-local-voice-chat` pipeline template with your existing Ollama models.

_This is part of a series on [Ollama frontends for Mac](/blog/best-ollama-frontend-mac). See also: [Voice Chat on Mac With Local AI](/blog/voice-chat-mac-local-ai) for the general guide to local voice conversation. Next: [Ollama Pipelines on Mac](/blog/ollama-pipelines-mac) - chain models in a visual workflow._

## Steps

### 1. Install ToolPiper and load a chat model

Download ToolPiper from the [modelpiper.com/download](https://modelpiper.com/download) or [modelpiper.com](https://modelpiper.com). A starter model downloads on first launch. For voice chat, a 3B model (like Qwen 3.5 3B or Llama 3.2 3B) gives the best balance of speed and quality. If you have Ollama models, add Ollama as a provider - your models appear automatically.

### 2. Open the voice chat pipeline

In ModelPiper, navigate to the pipeline templates and select `tp-local-voice-chat`. This creates a three-block pipeline: STT (Parakeet v3) → LLM (your chat model) → TTS (PocketTTS by default). Each block is pre-configured with recommended settings.

### 3. Choose your models and TTS voice

Click the LLM block to select which chat model handles reasoning. Click the TTS block to choose between PocketTTS (fastest), Soprano (highest quality), or Orpheus (most expressive). Each TTS engine defaults to a female voice - Cosette for PocketTTS, Tara for Soprano and Orpheus.

### 4. Start talking

Click the microphone button (or use the keyboard shortcut) to activate push-to-talk. Speak your question, release the button. ToolPiper transcribes your speech, sends the text to the LLM, and reads the response aloud. The conversation history stays in the chat panel so you can review what was said.

## FAQ

### Can I use voice chat with my existing Ollama models?

Yes. Add Ollama as an external provider in ToolPiper's settings. Your Ollama models appear in the voice chat pipeline's LLM block selector alongside ToolPiper's built-in models. The STT and TTS blocks use ToolPiper's own audio backends regardless of which LLM provider you choose.

### How much RAM does voice chat need?

Three models load simultaneously: STT (~500MB), your chat LLM (2-5GB depending on model size), and TTS (~300MB). A 3B chat model needs roughly 3GB total. A 7B model pushes that to about 5.5GB. On an 8GB Mac, stick with 3B models for voice chat. On 16GB, a 7B model is comfortable.

### Is the voice data sent to any server?

No. All three models - speech-to-text, language model, and text-to-speech - run locally on your Mac's Apple Silicon. Audio never leaves the device. There are no cloud APIs, no internet connection required after the initial model downloads, and no telemetry on voice data.

### Can I change the voice or language?

TTS voice is configurable per engine. PocketTTS defaults to Cosette, Soprano and Orpheus default to Tara. Language depends on the models: Parakeet v3 handles English well, and multilingual STT models are available. The chat LLM responds in whatever language the prompt uses, and TTS will read non-English text with varying quality depending on the engine.
