How do I add narration?

Each scene in a screenplay includes a narration text field. The video_narrate tool sends that text to a local TTS engine on your Mac. Three options: Soprano (fast, Neural Engine), Orpheus (higher quality, Metal GPU), or Qwen3 TTS (Metal GPU, supports voice cloning). All run locally. Your narration text is never uploaded to a cloud service.

What if my UI changes after recording?

Update the affected scenes in the screenplay (or let self-healing handle minor changes like label renames), then re-run the pipeline: rehearse, record, narrate, render. The mechanical execution is automated. A UI update that would cost an hour of manual re-recording costs a few minutes of re-rendering. This is the core advantage of programmatic video over manual screen recording.

Can I export to different formats?

The render stage outputs H.264 MP4 with auto-generated SRT subtitles. For other formats, aspect ratios, or resolutions, post-process the MP4 with a tool like FFmpeg. Video Creator is optimized for producing the source recording and narration, not for format conversion.

Automate Demo Videos on Mac: AI Screenplay to Rendered MP4

Every feature you ship deserves a demo video. But recording, editing, narrating, and rendering a video takes hours. Often longer than building the feature itself. So most features never get a video. The changelog says "added dark mode support" and users either find it or they don't.

Product Hunt launches need video. App Store previews need video. Documentation needs video. Tweet threads perform better with video. But the production cost is a bottleneck that solo developers and small teams can't absorb on every release cycle.

What if you could describe what the video should show, and AI handled the recording, narration, and rendering?

Why do demo videos matter this much?

Video is the highest-converting content format. A 30-second screencast of a feature in action communicates more than 1,000 words of changelog. Users watch a demo and immediately understand what changed. They see the interaction, the timing, the visual result. No amount of bullet points replicates that.

For developer tools specifically, video solves a discovery problem. You can document a feature exhaustively, but if nobody reads the docs, the feature might as well not exist. A short video in a release post, embedded in docs, or shared on social media reaches users who would never open a changelog.

The problem isn't that teams don't know this. It's that producing even a simple screencast takes real time. You record the screen, trim dead air, add zoom effects on the important parts, record a voiceover (probably several takes), sync audio to video, add text overlays, render, and export. Each step is manual. Each iteration means redoing multiple steps. And the result is often "good enough" rather than good, because you ran out of time and shipped it anyway.

What does the traditional video workflow look like?

For a typical demo video, the manual process looks something like this:

Open OBS or ScreenFlow, set up the capture region
Rehearse the demo mentally, then record the screen while clicking through the feature
Stop recording, realize you made a mistake at the 40-second mark, start over
Trim the footage, cut dead air, speed up boring transitions
Record narration separately (find a quiet room, do three takes)
Sync narration audio to the screen recording
Add text overlays and callouts for key moments
Render to MP4, wait, upload

When the feature changes next week, you start from step one.

This workflow is fundamentally iterative in the wrong places. You iterate on execution ("I clicked the wrong button, re-record") instead of iterating on content ("this scene should show the settings panel instead"). The mechanical parts consume more time than the creative decisions.

What does "programmatic video" mean?

Instead of manually performing each step, you define a screenplay. A screenplay is a sequence of scenes. Each scene has three elements: browser actions to perform (click this, type that, navigate here), narration text to speak during the scene, and timing constraints for pacing.

The system executes the actions in the browser, records the screen while those actions happen, generates narration from the text using a local TTS engine, composites the video and audio layers together, and renders the final output.

Change one line of the screenplay and re-render. Add a scene, remove a scene, rewrite the narration. The mechanical execution is automated. You iterate on the script, not on your clicking accuracy.

This is the difference between writing a speech and recording yourself reading it ten times until you get through without stumbling. With programmatic video, you write the script once and the delivery is consistent every time.

How does Video Creator work?

Video Creator is built into ToolPiper and automates the full demo video pipeline: screenplay, rehearsal, recording, narration, composition, and rendering. Each stage has a dedicated tool, and you can run the pipeline end-to-end or execute individual stages.

Screenplay. You define scenes using the video_save tool. Each scene specifies browser actions using AX (accessibility tree) selectors, the same format used by PiperTest. If you have already written tests for a feature, you already know the selector format: role:button:Sign In, label:Email, text:Welcome. Each scene also includes narration text and optional timing. Screenplays are stored as PiperTest sessions with purpose: 'video', so they share the same infrastructure as test sessions.

Rehearse. Before you commit to recording, video_rehearse runs a dry run. It executes every browser action in the screenplay without capturing video. If a selector is broken, a page doesn't load, or an action fails, you find out during rehearsal instead of after a five-minute recording session. Think of it as a test run before the shoot.

Record. video_record executes the screenplay while VisionPiper captures the screen. Each scene is recorded as a separate take. VisionPiper is a separate free macOS menu bar app that handles screen capture via ScreenCaptureKit, recording to H.264 MP4 with AVAssetWriter. The separation means screen recording happens at the system level with native performance, not through a browser API with frame rate limitations.

Narrate. video_narrate generates voice audio for each scene's narration text using local TTS engines. You have three options: Soprano (fast, good quality, runs on the Neural Engine), Orpheus (slower, higher quality, runs on Metal GPU), or Qwen3 TTS (also Metal GPU, supports voice cloning). No cloud service ever hears your narration text. The audio is generated on your Mac's hardware.

Compose. video_edit_composition layers everything together using a Metal compositor. Video, narration audio, background audio, and text overlays are combined into a unified timeline. Audio ducking automatically lowers background audio during narration segments so the voice is always clear. This is not a full non-linear editor, but it handles the common composition tasks that demo videos need.

Render. video_render produces the final H.264 MP4 with auto-generated SRT subtitles from the narration text. The output is ready to upload to YouTube, embed in docs, or attach to a release.

What tools are available for the full workflow?

Video Creator exposes 12 MCP tools that cover every stage of production:

video_list and video_get for discovering and reading screenplays
video_save and video_delete for creating and removing screenplays
video_rehearse for dry-running a screenplay without recording
video_record for executing the screenplay with screen capture
video_narrate for generating TTS audio per scene
video_render for final MP4 output
video_preview for reviewing a rendered video
video_edit_screenplay, video_edit_composition, and video_edit_narration for modifying individual stages

These tools are available via MCP, which means any MCP-aware AI client can orchestrate the full pipeline. Claude Code, Cursor, Windsurf, or your own integration can generate screenplays, rehearse them, record, narrate, and render without you touching a video editor.

Can AI write the screenplay for me?

Yes. In a chat conversation, describe what you want to show: "Record a demo of the login flow, then navigate to the dashboard and open the settings panel." The AI generates a screenplay with the right browser actions and narration text. It uses browser_snapshot to see the current page structure, identifies the correct AX selectors for each element, and builds the scene sequence.

You review the generated screenplay, adjust narration wording, reorder scenes if needed, then rehearse and record. The AI handles the tedious selector lookups and action sequencing. You focus on what the video should communicate.

Because screenplays use the same AX selectors as PiperTest, they benefit from self-healing. If a button moves or gets renamed between video iterations, the selector engine finds it anyway using fuzzy matching against the current accessibility tree. Your screenplay doesn't break because a designer changed a label from "Submit" to "Save Changes."

How do you add narration?

Each scene in a screenplay includes a narration text field. During the narrate stage, that text is sent to a local TTS engine on your Mac. You choose which voice model to use:

Soprano runs on Apple's Neural Engine and is the fastest option. Good for quick iterations where you want to hear the pacing before committing to a final voice.

Orpheus runs on Metal GPU and produces higher-quality output with more natural prosody. Better for final renders where voice quality matters.

Qwen3 TTS also runs on Metal GPU and supports voice cloning. If you want the narration to sound like a specific person, provide a reference audio clip and the model matches the voice characteristics.

All three engines run locally. Your narration text is never uploaded to a cloud service. You can narrate confidential product demos, internal training videos, or client-specific walkthroughs without any data leaving your machine.

What if your UI changes after recording?

This is where programmatic video has a structural advantage over manual recording. When you record a demo manually and the UI changes, you re-record from scratch. There is no other option.

With Video Creator, the screenplay is the source of truth. When your UI changes, you update the affected selectors in the screenplay (or let self-healing handle it if the change is minor), then re-run the pipeline: rehearse, record, narrate, render. The mechanical work is automated. A UI update that would cost you an hour of re-recording costs a few minutes of re-rendering.

For teams that ship weekly, this is the difference between having demo videos and not having them.

Can you export to different formats?

The render stage outputs H.264 MP4, which is the most widely compatible format. SRT subtitles are generated automatically from the narration text. For other formats, aspect ratios, or resolutions, you would need to post-process the MP4 with a video tool like FFmpeg.

Video Creator is optimized for producing the source recording, not for format conversion. The output is a clean MP4 that any downstream tool can work with.

Does this work for non-browser demos?

The screenplay system is built around browser automation via CDP, so the automated action execution is Chrome-specific. However, VisionPiper's screen recording captures any region of your screen. You could manually drive a native app while VisionPiper records, then use Video Creator for the narration and composition stages.

The full automated pipeline (screenplay actions, rehearsal, and self-healing) requires Chrome. For native macOS apps, Electron apps, or mobile simulators, you would record manually and use Video Creator for post-production.

What are the honest limitations?

Video Creator solves the specific problem of programmatic demo video production. It is not a general-purpose video editor. Here is what you should know.

Chrome only for automated recording. The screenplay-driven recording pipeline uses Chrome DevTools Protocol to execute browser actions. Firefox, Safari, and native apps are not supported for automated action execution.

VisionPiper required for screen recording. VisionPiper is a separate free macOS app that handles the actual screen capture. It is available on the Mac App Store but must be installed separately from ToolPiper.

TTS quality varies by model. Soprano is fast but less expressive. Orpheus sounds more natural but takes longer to generate. Qwen3 TTS supports voice cloning but requires a reference audio clip. None of them sound identical to a professional voice actor, though they are significantly better than robotic system voices.

No advanced editing. Transitions, zoom effects, picture-in-picture, animated callouts, and other NLE features are not available. The Metal compositor handles layering video, audio, narration, and text overlays. For polish beyond that, export the MP4 and use a dedicated video editor.

Resolution locked to your screen. VisionPiper captures your actual screen at its native resolution. There is no virtual canvas or resolution scaling during recording. If you need 4K output, you need a 4K display.

Try It

Download ToolPiper and VisionPiper from the Mac App Store. Open Chrome, navigate to the feature you want to demo, and describe the video to your AI assistant. The screenplay gets generated, rehearsed, recorded, narrated, and rendered without you opening a video editor.

Your demo ships when your feature ships. Not a sprint later.

This is part of a series on local-first AI workflows on macOS. For visual test recording with the same AX selectors, see Visual Testing on Mac. For browser automation details, see Browser Automation on Mac. For local TTS engines, see Local Text to Speech on Mac.

	ToolPiper Video Creator	Manual (OBS + Editor)	Loom	Tango
Workflow	Screenplay → auto-record → render	Record → edit → export	Record → share	Record → annotate
Narration	AI-generated (local TTS)	Record yourself	Record yourself	None
Scriptable	Yes (programmatic)	No	No	No
Re-recordable	Change screenplay, re-render	Start from scratch	Start from scratch	Start from scratch
Self-healing	Yes (AX selectors)	N/A	N/A	N/A
Privacy	All local	All local	Cloud	Cloud
Cost	Free	Free (tools) + time	$12.50/mo	$16/mo
Subtitles	Auto-generated SRT	Manual	Auto (cloud)	No