Voice mode in ChatGPT is impressive. You talk, it listens, it responds with a natural voice. The conversation feels fluid. Then you remember that every word you're saying is being streamed to OpenAI's servers, processed, stored, and — unless you opted out — potentially used for training.
What if the same experience ran entirely on your Mac?
That's not a hypothetical. The hardware you're sitting on — Apple Silicon with a Neural Engine, a capable GPU, and unified memory — can run all three stages of a voice conversation locally: speech-to-text, language model inference, and text-to-speech. The missing piece has been software that wires them together without requiring you to configure three separate tools.
The Three-Stage Pipeline
A voice chat is three AI models working in sequence.
Stage 1: Speech-to-Text (STT). Your voice is captured through the microphone and converted to text. This runs on the Neural Engine using Parakeet, a Whisper-class model. It handles accents, background noise, and natural speech patterns.
Stage 2: Language Model (LLM). The transcribed text is sent to a language model — Llama, Qwen, or whatever you've downloaded — which generates a response. This runs on the Metal GPU via llama.cpp.
Stage 3: Text-to-Speech (TTS). The model's text response is synthesized into speech. This runs on either the Neural Engine (FluidAudio) or Metal GPU (MLX Audio), depending on which voice backend you choose.
The result: you speak, the AI thinks, and it speaks back. All three stages execute on your hardware. Nothing hits the network.
The ModelPiper Workflow
Open ModelPiper and load the Voice Chat template. It pre-wires all three stages: Audio Capture → STT → LLM → TTS → Response.
Hit the record button and talk. When you stop, the pipeline fires in sequence — your speech becomes text, the LLM generates a response, and TTS reads it back to you. The response block auto-plays the audio.
The visual pipeline builder shows you exactly what's happening at each stage. You can see the transcription appear, watch the LLM generate its response, and then hear the TTS output. If you want to swap the LLM for a different model, or switch from FluidAudio TTS to MLX Audio for a higher-quality voice, it's a dropdown change.
Why Voice Matters
Typing is not always the best interface. Voice is better when:
Your hands are busy. Cooking, driving, exercising, working with tools. A voice interface lets you interact with AI without stopping what you're doing.
You think better out loud. Some people process ideas more effectively by talking than by typing. Voice chat turns the AI into a thinking partner you can have a spoken conversation with.
Accessibility. For anyone who has difficulty with a keyboard — RSI, motor impairments, vision issues — voice is not a novelty. It's the primary interface.
Speed. Most people speak at 125–150 words per minute and type at 40–60. Voice input is 2–3x faster for getting your thoughts into the system.
Latency: The Local Advantage
Cloud voice services have an inherent latency floor: your audio has to travel to a server, get processed, and the response has to travel back. Even on fast connections, that's 500ms–2s of dead air.
Local voice has no network round trip. The STT runs in milliseconds on the Neural Engine. The LLM starts generating immediately. TTS synthesis begins streaming as soon as the first sentence is ready. The perceived latency of a local voice conversation can be under 500ms total — faster than most cloud services, and fast enough that the conversation feels natural rather than stilted.
Try It
Download ModelPiper, install ToolPiper, and load the Voice Chat template. Make sure you've downloaded an LLM model (the starter model works, a 3B model is better for conversation). Talk to your Mac.
Your voice, the model's response, and the synthesized speech all stay on your machine.
This is part of a series on local-first AI workflows on macOS. Next up: Transcribe & Summarize — drop an audio file, get the key points back.