macOS has had dictation since Mountain Lion. In the fourteen years since, the model has been the same: you speak, it types. Words appear wherever the cursor is. The model transcribes. It doesn't understand what you meant. It doesn't do anything about it.

Transcription is useful. It's also the narrowest version of what voice AI on a Mac could be.

What is conversational AI for Mac?

Conversational AI for Mac means using your voice to have a real back-and-forth with an AI model, or to speak a natural language intent and have your Mac act on it - not just transcribe the words. It requires a speech recognition model, a language model that interprets meaning, and an action layer that can execute the result.

The distinction matters because it changes what voice is good for. Dictation is useful when you want words on a screen. Conversational AI is useful when you want something to happen - a calendar event created, a file found, a question answered, a reminder set based on your location. The output isn't text. It's a completed task.

What made this possible now?

Two things converged in the last two years. The first is Apple Silicon. The Neural Engine in M-series chips has enough throughput to run speech recognition, a multi-billion-parameter language model, and text-to-speech simultaneously, in real time, without touching the network. An M2 Max can run a 7B parameter model at 30+ tokens per second while doing STT and TTS in parallel.

The second is model size. Llama 3.2 3B, Qwen 3 1.7B, Phi-3 Mini - these aren't toy models. They're capable enough at natural language interpretation to accurately route voice commands to system actions. A year ago the models small enough to run locally weren't good enough to be useful for command interpretation. That gap has closed.

Put those two things together and the interpretation layer between your voice and your Mac becomes practical. Not a research demo. A tool you can use all day.

How does ToolPiper's voice AI work?

ToolPiper has three voice modes. Right Option held is push-to-talk dictation: speak and release, text pastes at the cursor. Right Command held is AI command mode: speak an intent, a local LLM interprets it, your Mac executes the action. The voice chat interface is a full conversational AI: speak, hear the response, follow up - all on-device.

The two hotkeys are intentionally different so the modes stay distinct in muscle memory. Dictation is Right Option because it's the simpler, faster operation - hold, speak, paste. Commands are Right Command because you're invoking the AI deliberately. The difference in key is a difference in intent: transcribe what I said versus do what I'm asking.

All three modes use the same on-device pipeline. FluidAudio's Whisper model runs speech recognition on the Neural Engine. For command mode and voice chat, a local LLM receives the transcript. Text-to-speech (available in voice chat and command responses) runs locally through FluidAudio, PocketTTS, or Soprano. No microphone data leaves your machine at any point.

What can voice AI commands actually do?

The short answer: 142 macOS system actions across 26 domains. Calendar, reminders, Finder, apps, notifications, Bluetooth, display, focus modes, clipboard, browser control, system settings, and more. The long answer is better understood through examples of what real requests look like.

'Add a meeting tomorrow at 2pm with the design team about the launch' creates the calendar event. 'Remind me to follow up on this email when I leave the office' sets a location-based reminder. 'Find the contract PDF from last week' runs a Spotlight search and opens the result. 'Turn on Do Not Disturb for 90 minutes' activates focus mode. 'Set my screen brightness to 40%' adjusts the display. 'Open Figma' launches the app. 'Copy that to the clipboard' captures the last AI response.

These aren't macros or keyboard shortcuts mapped to a phrase. The LLM interprets the natural language of what you said and routes it to the appropriate action. You don't need to say the command in a specific format. 'Put a reminder for tomorrow morning about the dentist' and 'remind me about dentist, tomorrow, 9am' both work - the model handles the variation.

How does this compare to Siri, macOS dictation, and Wispr?

Siri is the closest conceptual overlap. Both systems understand spoken intent and execute actions. The architectural difference is significant: Siri routes through Apple's servers, works best with Apple's own apps, and operates within the SiriKit permission model for third-party actions. ToolPiper runs the LLM locally, exposes 142 custom system actions without any permission framework, and works fully offline.

The scope is also different. Siri's action vocabulary is Apple's vocabulary. ToolPiper's action vocabulary was built specifically for power users: fine-grained Finder operations, system-level controls, multi-step clipboard workflows, browser automation, and actions Siri can't surface because they bypass the SiriKit layer entirely.

macOS Dictation (the built-in option) does what it says: transcribes. On Apple Silicon it's on-device and fast. It has no interpretation layer and no action system. It's a good transcription tool. That's all it is.

Wispr Flow adds polish and context awareness to the transcription model - it knows you're in Gmail and formats accordingly, or knows you're in a code editor and doesn't autocorrect variable names. But the interpretation layer isn't there either. Wispr turns your voice into better-formatted text. It doesn't turn your voice into completed tasks. And at $12/month for transcription alone (at the time of this writing), versus ToolPiper Pro's $10/month for a full local AI platform, the value comparison warrants scrutiny.

What does the voice chat experience look like?

Open the voice chat interface in ToolPiper and the interaction model shifts from command execution to conversation. Speak a question. The STT model transcribes it, a local LLM generates a response, TTS speaks the answer back. Response latency on an M2 Max is typically under two seconds from the end of your sentence to the start of the AI's response.

Voice chat uses the same model you have configured for text chat - Llama 3.2, Qwen 3, Mistral, Phi, or any of the other supported local models. The voice interface is a wrapper around the same inference stack, not a separate system. That means the full context window, the same model capability, and the same privacy properties: nothing leaves your Mac.

For daily use, voice chat is most useful for quick questions where speaking is faster than typing, for hands-free research while doing something else, and for getting AI responses you can hear rather than read. It's also the closest thing to 'talking to your Mac' in the intuitive sense - not issuing commands, just having a conversation with a model that knows your system and has access to your tools.

Where does cloud voice AI still win?

Cross-platform is the real gap. Wispr Flow works on iPhone and Android alongside Mac. Siri works across every Apple device. ToolPiper is macOS only. If you need voice AI that follows you from your Mac to your phone, neither ToolPiper's dictation nor its command mode helps you there.

Wispr's context-aware formatting is also more mature for heavy dictation users. Detecting that you're in a code editor versus a Slack message and adjusting how it formats the output - that's two years of iteration ToolPiper hasn't fully replicated yet.

And for simple questions where the answer doesn't require privacy - 'what's the weather tomorrow,' 'how do you spell conscientious' - Siri is faster to reach because it's always one key away and doesn't require a model to be loaded in memory.

The tradeoff is clear: cross-device access and cloud-powered convenience on one side, privacy and a much deeper action system on the other. For Mac-primary users who care where their voice data goes, the local approach covers the use cases that matter.

Try it

Download ToolPiper at modelpiper.com. The Pro trial is 14 days. Start with the Right Command hotkey - say something specific and watch the action router handle it. That's the clearest demonstration of what separates conversational AI from dictation.

This is the pillar for the Voice AI for Mac series. The spokes go deeper on specific angles: Wispr Flow alternative (price and feature comparison), private voice dictation (why voice data is more sensitive than it looks), offline voice typing (what works without internet), and voice coding on Mac (local AI pair programming).