You are in the middle of writing code in your IDE. You need to dictate a quick note. You switch to a dictation app, speak, copy the text, switch back, paste it. Or you reach for Siri and wait for the cloud round-trip. What if you could just hold a key, speak, and have the text appear right where your cursor is - processed entirely on your Mac?
That is what push-to-talk dictation should be. Hold a key, talk, release the key, and the words appear. No app switching. No cloud latency. No fumbling with a separate transcription window. Just your voice turning into text, right where you need it.
Why is voice input on Mac still this clunky in 2026?
Apple's built-in dictation sends audio to Apple's servers by default. You can enable on-device dictation in System Settings, but the quality is noticeably worse, it occasionally stalls, and it still requires the awkward double-tap of the fn key followed by a mode switch that takes over your keyboard focus.
Third-party dictation apps are either cloud-dependent (Otter, Google Docs voice typing) or have workflows that break your concentration. Open the app, click record, speak, wait for processing, copy the result, switch back to your original app, paste. That is five steps too many for something that should be instant.
Siri handles some voice tasks, but it is tethered to Apple's servers, limited to Apple's predetermined command set, and cannot simply paste text at your cursor. You cannot say "type this paragraph into my code editor" and have Siri do it.
Whisper.cpp exists as a local option, but it is a command-line tool. It processes audio files in batch mode. There is no push-to-talk interface, no cursor integration, and no way to use it from the middle of another application without writing your own wrapper.
The core problem is that none of these solutions combine three things: local processing, push-to-talk activation, and system-wide cursor insertion. You get one or two, never all three.
What would real push-to-talk look like on a Mac?
Think about how a walkie-talkie works. Hold the button, talk, release. The message goes through. No menus, no mode switching, no waiting. Now apply that to your Mac: hold a key on your keyboard, speak naturally, release the key, and the transcribed text appears wherever your cursor happens to be. Your IDE. Your browser. Slack. Notes. Terminal. Anywhere.
That is the first mode - dictation. Your voice becomes text.
Now consider a second mode. Same interaction pattern - hold a key, speak, release - but instead of transcribing your words literally, the system interprets them as a command. "Turn on dark mode." "Mute my Mac." "Set the volume to fifty percent." "Move this window to the left half of the screen." You speak an instruction in natural language, and your Mac executes it.
These two modes - dictation and command - cover the vast majority of what people actually want from voice input. Either you want your words as text, or you want your words to trigger an action. Everything else is a variation on those two patterns.
How fast can local speech-to-text actually be?
Cloud STT adds a minimum of 200 to 500 milliseconds of network latency before any processing even starts. Your audio has to travel to a data center, queue for processing, get transcribed, and travel back. On a good connection, that is barely noticeable. On a mediocre one, it is enough to make the experience feel sluggish.
Local STT on Apple's Neural Engine runs in real time. The Parakeet model processes speech faster than you can produce it. The bottleneck is not the transcription - it is how fast you can talk. When the STT engine stays loaded in memory (a "keep-warm" backend), there is no cold-start delay. The model is ready the instant you press the key.
The real benchmark is end-to-end latency: from the moment you release the key to the moment text appears at your cursor. With a keep-warm STT backend on the Neural Engine, that number is approximately 140 milliseconds. That is faster than a blink. Fast enough that the text appears to materialize the instant you stop talking.
How does ActionPiper handle push-to-talk?
ActionPiper is a free macOS menu bar app that registers two global hotkeys for voice input. It runs in the background and uses roughly 20MB of memory.
Right Option key = Push-to-Talk Dictation. Hold right Option, speak naturally, release. FluidAudio STT transcribes your speech on the Neural Engine, and the resulting text is pasted at your current cursor position. This works in any application - your IDE, browser, email client, terminal, chat app, anywhere that accepts text input. There is no app switching. The transcription happens in the background and the text appears inline.
Right Command key = Push-to-Command. Hold right Command, speak an instruction, release. The STT engine transcribes your speech, then passes the text to a local LLM with tool definitions for ActionPiper's 26 action domains. The LLM interprets your instruction, calls the appropriate tool, and ActionRouter dispatches the action. A macOS notification confirms what happened.
Both modes use the same underlying pipeline: microphone capture, FluidAudio STT on the Neural Engine, then either cursor paste (dictation) or LLM interpretation (command). FluidAudio is a keep-warm backend - it stays loaded in memory for the entire time ActionPiper is running, so there is never a cold-start delay.
What can you actually do with voice commands?
Push-to-Command routes through ActionPiper's action system, which covers 26 domains of macOS control. Here are some examples of what you can say:
Display and appearance. "Turn on dark mode." "Set brightness to seventy percent." "Enable Night Shift."
Audio. "Mute my Mac." "Set volume to fifty percent." "Unmute."
Windows. "Move this window to the left half of the screen." "Make this window fullscreen." "Close this window."
Apps and system. "Open Safari." "Open Finder." "Show the desktop." "Lock my screen."
Network. "Turn off Wi-Fi." "Turn on Bluetooth."
Media. "Pause." "Play." "Skip to the next track."
The LLM interprets natural language, so you do not need to memorize exact phrasing. "Make it dark" and "switch to dark mode" and "turn on dark mode" all resolve to the same action. The 26 domains cover display, audio, window management, apps, processes, Bluetooth, network, media controls, accessibility, Focus modes, Dock, Spaces, desktop, system settings, and more.
How does the push-to-command pipeline work under the hood?
When you hold right Command and speak, the following sequence happens:
Step 1: Audio capture. ActionPiper captures microphone audio for the duration of the key hold.
Step 2: Speech-to-text. FluidAudio's Parakeet model transcribes the audio on the Neural Engine. Because the model is keep-warm, this begins immediately with no loading delay.
Step 3: LLM interpretation. The transcribed text is sent to a local LLM (running through ToolPiper's llama.cpp backend) along with tool definitions for all 26 action domains. The LLM determines which action to invoke and with what parameters.
Step 4: Action dispatch. ActionRouter executes the action on macOS. This might call AppleScript, Core Audio APIs, window management APIs, or system preferences depending on the domain.
Step 5: Confirmation. A macOS notification confirms the action. "Dark mode enabled." "Volume set to 50%." "Window moved to left half."
The entire pipeline runs locally. Audio never leaves your Mac. The STT runs on the Neural Engine. The LLM runs on the Metal GPU. The action executes through native macOS APIs. No cloud services are involved at any stage.
Does push-to-talk dictation work in every app?
Yes. ActionPiper registers global hotkeys at the system level, so the right Option key triggers dictation regardless of which application is in the foreground. The transcribed text is inserted at whatever cursor position is active - a text field in your browser, a code editor, a terminal prompt, a Slack message box, a Notes document.
The only requirement is that the foreground application must accept text input at the cursor position. If you are looking at a non-editable view (a PDF viewer, a video player), the text has nowhere to go. But any standard text input field in any macOS application works.
How does this compare to existing voice input options?
The core trade-offs between the available approaches are worth understanding before choosing one.
Apple's built-in dictation is the most convenient option if you are already in its ecosystem and do not mind cloud processing. It works in every text field and requires no additional software. But it sends audio to Apple's servers by default, the on-device mode has lower quality, and it has no concept of voice commands beyond basic punctuation and formatting.
Siri handles voice commands but is limited to Apple's predefined set. You cannot extend it with custom actions, it requires cloud connectivity, and it cannot paste text at your cursor position. It is a separate interaction context, not an inline tool.
Whisper.cpp is the best option for batch transcription of audio files. The quality is excellent with larger models. But it has no real-time push-to-talk interface, no cursor integration, and requires compiling from source. It is a developer tool, not a productivity tool.
ActionPiper is the only option that combines local processing, push-to-talk activation, system-wide cursor insertion, and AI-powered voice commands in a single app. The trade-off is that it is macOS-only, currently supports English as the primary STT language, and the Right Option/Right Command key assignments cannot be remapped yet.
What are the honest limitations?
Push-to-talk is genuinely useful, but it is not magic. Here is what you should know before relying on it.
Language support. FluidAudio's Parakeet model has English as its strongest language. It supports 25 European languages with automatic detection, but accuracy and latency are best in English. If you primarily dictate in another language, test it before committing to the workflow.
Command scope. Push-to-Command can only do what ActionPiper's 26 action domains support. It cannot write emails for you, browse the web, or interact with specific applications beyond basic window management. It controls macOS system functions, not application-specific features.
Hotkey flexibility. The Right Option and Right Command keys are currently fixed assignments. If another application already uses those keys, there will be a conflict. Custom key mapping is planned but not yet available.
Memory footprint. ActionPiper itself uses roughly 20MB. However, the FluidAudio STT backend that stays warm in memory is managed by ToolPiper separately. With the STT model loaded, expect an additional several hundred MB of memory usage for the keep-warm backend. On a Mac with 16GB or more, this is negligible. On an 8GB machine, it is worth monitoring.
Background requirement. ActionPiper must be running in the menu bar for the hotkeys to work. If you quit the app or it crashes, voice input stops until you relaunch. The app is lightweight and designed to stay running, but it is another process in your menu bar.
How do you set it up?
Download ActionPiper from modelpiper.com. It installs as a menu bar app. Make sure ToolPiper is also running (it provides the STT and LLM backends). The push-to-talk hotkeys are active immediately after launch - no configuration needed.
For Push-to-Command, you will also need a local LLM model downloaded through ToolPiper. The starter model works, though a 3B or larger model produces more reliable command interpretation.
Hold right Option, say something, release. If text appears at your cursor, dictation is working. Hold right Command, say "mute my Mac," release. If the volume drops and a notification appears, command mode is working.
Everything runs on your Mac. No accounts, no API keys, no audio uploaded anywhere.
This is part of a series on local-first AI workflows on macOS.