You are in a meeting. Someone says something important. You scramble to type it before the moment passes. By the time you finish writing, the conversation has moved on and you have missed the next two points.

Existing transcription tools either work after the fact (upload a recording, wait for processing) or stream your audio to a cloud server in real time. Otter.ai, Google Meet captions, Microsoft Teams transcription: they all send your meeting audio to someone else's infrastructure in a continuous stream. Every word of your confidential discussion, salary negotiation, legal review, or strategic planning session flows through someone else's servers.

What if you could see words appearing on screen as people spoke, processed entirely on your Mac?

What is the difference between batch and streaming transcription?

Most local transcription tools process a complete audio file after recording. You record the meeting. You wait for the file to process. Then you read the transcript. The information is useful, but it is available after the conversation is over. You cannot act on it in the moment.

Streaming transcription is fundamentally different. It processes audio as it arrives. You see words forming in real time, while the conversation is still happening. You can take action during the meeting, not after it. You can flag a decision as it is made, capture an action item the moment it is assigned, or notice when a point needs clarification while there is still time to ask.

This is not a small distinction. The gap between "I will read the transcript later" and "I can see what was just said" changes how you participate in a conversation. You stop being the person furiously scribbling notes and start being the person who is actually present, with a live text record accumulating in the background.

Why is streaming speech-to-text technically harder?

Streaming STT needs to process audio chunks (typically 1-3 seconds) fast enough to keep up with speech. The model must balance accuracy with latency, and the tradeoff is real.

A longer audio buffer means more context for the model, which improves accuracy. The model can consider surrounding words, resolve ambiguity, and handle pauses more gracefully. But more context means higher delay. You might not see the words until several seconds after they were spoken.

A shorter buffer means faster output. Words appear almost immediately. But the model has less context to work with, which can mean more errors, especially with technical vocabulary, proper nouns, or accented speech.

Cloud streaming services like Otter.ai and Google Meet solve this with powerful server-side hardware that can run larger models at lower latency. The tradeoff they make is different: they have the compute power, but your audio is flowing through their infrastructure the entire time. Every second of your meeting is transmitted, processed, and stored on remote servers.

How does live transcription work in ToolPiper?

ToolPiper enables real-time streaming transcription by connecting two local components: AudioPiper for audio capture and FluidAudio for speech-to-text inference.

AudioPiper captures audio from any source on your Mac. Not just the microphone. It can capture system audio (everything your Mac is playing), audio from specific apps (via Core Audio Taps on macOS 14+), or a mix of multiple sources simultaneously. It streams mixed PCM audio over a WebSocket connection in real time.

FluidAudio STT runs the Parakeet model on Apple's Neural Engine. It processes audio chunks as they arrive, producing text output within 1-2 seconds of the audio being captured. The accuracy is Whisper-class for clear speech in supported languages.

The pipeline works like this: AudioPiper captures audio from your chosen source, streams chunks to FluidAudio, which transcribes each chunk on the Neural Engine, and the text appears in ModelPiper as it is generated. The entire loop happens on your hardware. No audio data touches the network at any point.

What audio sources can you capture?

This is where AudioPiper's flexibility makes a real difference. Most transcription tools only capture from the microphone. That means background music, notification sounds, and audio from other apps all get mixed into your transcript. AudioPiper can capture from individual apps, giving you clean, isolated audio.

In a Zoom meeting, you can capture only the Zoom audio, not the Spotify playing in the background. In a lecture, you can capture from the browser tab playing the video. In a podcast recording, you can capture from the podcast app specifically. Each source is tapped independently.

Per-app audio capture uses Core Audio Taps, a macOS 14+ API that lets you tap into any app's audio output without virtual audio drivers. No kernel extensions, no third-party audio routing software like Loopback or BlackHole, no system-level hacks that break with every macOS update.

You can also mix sources. Capture microphone input (your voice) alongside Zoom audio (the remote participants) for a complete transcript of both sides of the conversation. Or combine system audio with microphone input to capture everything happening on your Mac.

How do you set up live transcription in ModelPiper?

In the pipeline builder, wire three blocks together: Audio Capture block, AI Provider block (configured with the STT model), and Response block.

The Audio Capture block lets you select your audio source: microphone, system audio, or a specific app. The AI Provider block runs FluidAudio's Parakeet model on the Neural Engine. The Response block shows the transcribed text as it streams in, updating live as new chunks are processed.

You can extend this pipeline further. Add a second AI Provider block with an LLM after the STT block, and you get live summarization or translation on top of the transcription. Audio Capture, then STT, then LLM (summarize, translate, or extract action items), then Response. Four blocks, wired together visually, no code required.

The extended pipeline is powerful for specific use cases. Imagine a meeting where the transcript streams in real time and an LLM simultaneously extracts action items as they are mentioned. Or a foreign-language webinar where the audio is transcribed and translated into English as the speaker talks.

Why does the Neural Engine matter for live transcription?

FluidAudio's Parakeet model runs on the Neural Engine, not the GPU. This is a practical advantage that matters more than it sounds.

The GPU on your Mac is shared by everything: the display, any running apps, and any other AI models you might be running. If you are running a chat model on the GPU (via llama.cpp) while also doing live transcription, they would compete for the same hardware. One slows down the other.

The Neural Engine is dedicated silicon for ML inference. It runs independently from the GPU. This means you can run live transcription and a chat model simultaneously without either one slowing down. Transcribe a meeting in the background while using a language model in another window. Both run at full speed because they are on different hardware. This is a unique advantage of Apple Silicon's architecture: specialized processors for different workloads, not one GPU trying to do everything.

What about push-to-talk as lightweight streaming?

If you do not need continuous transcription, ActionPiper offers a simpler form. Hold the Right Option key to speak, release to see text. This is push-to-talk dictation: it captures audio only while you hold the key, transcribes it through the same FluidAudio STT engine, and pastes the result wherever your cursor is.

Think of it as the lightweight version. You do not need to open ModelPiper or build a pipeline. Just hold a key, speak, release, and the text appears. For quick dictation, capturing a thought, or hands-free text input while your hands are busy, it is faster than setting up a full streaming pipeline.

For continuous meeting transcription where you want an ongoing text record, the full pipeline is what you want. For quick voice-to-text moments throughout your day, push-to-talk is the answer.

What does the experience actually feel like?

Words appear within 1-2 seconds of being spoken. There is a brief delay as the audio chunk is captured and processed, but it is short enough that you can follow along in near real time. The text updates feel natural, similar to watching someone type quickly.

Accuracy for clear speech is comparable to cloud services. The Parakeet model handles accents, filler words, and natural speech patterns well. Technical vocabulary and proper nouns are usually correct in context. It is Whisper-class in quality, which means it is significantly better than the speech-to-text engines of five years ago.

Background noise and multiple speakers are where any STT system (local or cloud) starts to degrade. A quiet room with one clear speaker produces excellent results. A noisy coffee shop or a meeting with people talking over each other will have more errors.

What are the honest limitations?

Streaming accuracy is slightly lower than batch processing because the model has less context per chunk. If you process the same audio as a complete file after the fact, the batch transcript will typically be a few percentage points more accurate. This is a fundamental tradeoff of streaming: speed versus context.

English is the primary language. Parakeet v3 supports 25 European languages with automatic detection, but English accuracy is the highest. Other languages work well for clear speech but may have lower accuracy with technical vocabulary or domain-specific terms.

Multiple simultaneous speakers (crosstalk) reduce accuracy significantly. This is a universal limitation of current STT technology, not specific to local models. Cloud services like Otter.ai have invested heavily in speaker separation, but even they struggle with heavy crosstalk.

There is no speaker diarization yet. The transcript is a single stream of text without identifying who said what. For meetings with multiple speakers, you get accurate words but no speaker labels.

Per-app audio capture requires macOS 14 or later. Older macOS versions can still use microphone capture, but the per-app Core Audio Taps feature is not available.

Try It

Download ModelPiper and install ToolPiper. Wire up an Audio Capture block, an AI Provider block with the STT model, and a Response block. Select your audio source and start speaking.

Words appear as you speak. The audio never leaves your Mac.

This is part of a series on local-first AI workflows on macOS. Related: Batch Voice Transcription for processing complete recordings, and Transcribe & Summarize for getting structured key points from audio.