Capture any region, record video, export GIFs, and stream live to vision models. A steerable camera for your screen.
VisionPiper is a standalone menu bar app for macOS that captures screen regions with precision borders, records H.264 video, trims and exports as GIF/WebP/MP4, and streams frames over WebSocket to ToolPiper's vision models at 30fps. Think of it as a screen-mounted camera you can point at anything.
Four edges + four corners, pass-through interior. Click through the capture region to interact with everything inside it normally.
Drag the border to follow content while recording is active. A steerable camera that tracks multi-step workflows across panels.
Last capture region is remembered between sessions. Launch VisionPiper and your previous region is already set.
Quick capture, toggle recording, adjust region from keyboard. Fast enough for rapid bug documentation.
Works across all connected displays. Place the capture region on any monitor regardless of resolution or scaling.
Retina-aware capture at native resolution. Every pixel captured at the display's actual density.
SCStream + AVAssetWriter for efficient, hardware-accelerated recording. Captures the selected region as .mp4 with minimal CPU overhead.
Built-in cgif + libimagequant for high-quality, small-file GIFs. Allocates color information to changing pixels, not static backgrounds.
Modern format for web-ready screen captures. Full color and alpha channel support, smaller files than GIF.
Cut start and end of recordings before export. No external editor needed. Preview, trim, and export in one step.
Metal JPEG frames streamed to ToolPiper on port 10000. Hardware-accelerated encoding keeps CPU usage low during continuous streaming.
Ask AI questions about what's on your screen in real time. The vision model sees exactly what you see, updated every frame.
Vision model describes the screen content, TTS reads it aloud. An accessibility workflow that works with any on-screen content.
Extract text from any screen region using Apple Vision OCR. No cloud API, no rate limits. Runs entirely on-device.
| VisionPiper | macOS Screenshot | CleanShot X | Kap | |
|---|---|---|---|---|
| Region selection | Click-through 8-window border, persisted | Crosshair selection, not saved | Crosshair + pinned overlay | Full screen or window only |
| Record video | H.264 .mp4, hardware-accelerated | QuickTime (separate app) | H.264, HEVC, GIF | H.264, WebM, GIF |
| GIF export | Built-in, optimized cgif + libimagequant | No | Yes, with compression | Yes, basic |
| Trim editor | Built-in, cut start/end before export | No (requires iMovie/QuickTime) | Yes | Yes |
| AI streaming | 30fps WebSocket to local vision models | No | No | No |
| Movable region | Drag border during recording | No | No | No |
| Price | Free | Free (built-in) | $29 one-time | Free (open source) |
| AI integration | ToolPiper vision models, OCR, narration | None | None | None |
Free from the Mac App Store. Grant screen recording permission when prompted.
Click and drag to select any area of your screen. Borders appear with pass-through interior so you can interact normally.
Screenshot, record video, or stream live to ToolPiper's vision models for AI analysis.
Screen content stays on your Mac. Live streaming goes to localhost, not the cloud.
No subscription, no watermarks, no feature gates. Free on the Mac App Store.
Yes. Screen capture, video recording, trim editing, and GIF/WebP/MP4 export all work standalone. ToolPiper is only needed for AI streaming — sending frames to vision models for Screen Q&A, image narration, and OCR.
macOS 26 or later on Apple Silicon (M1 or newer). VisionPiper uses ScreenCaptureKit and Metal, which require Apple Silicon.
Yes. The 8-window architecture uses four edge windows and four corner windows with a pass-through interior. Everything inside the border is fully interactive — clicks, drags, and scrolls pass through to the underlying app.
No. VisionPiper captures video only. For audio capture, use AudioPiper — a separate free app that records mic, system audio, and per-app audio via Core Audio Taps.
Capture a region, stream to a vision model, get answers — without uploading screenshots anywhere.