Motion capture used to require a room full of infrared cameras, reflective markers taped to a bodysuit, and software that cost more than a car. Studios charged thousands of dollars per session. The data was locked in proprietary formats. If you needed skeleton data for animation, game development, or AI image generation, you either paid studio rates or went without.

Then pose estimation models made it possible to extract skeleton data from ordinary video — no markers, no special cameras, just a webcam or a phone. But the tools to actually use this are fragmented. OpenPose is a research project with complex dependencies. MediaPipe runs in a browser but outputs non-standard formats. Cloud pose APIs charge per frame and require uploading your video.

Your Mac can do this locally, in real time, with zero setup. The Neural Engine runs Apple Vision's pose detector at 30+ FPS. CoreML models like ViTPose and RTMPose push that to 133-joint wholebody detection — fingers, toes, and facial landmarks — at speeds that keep up with live video.

What is pose estimation?

Pose estimation detects the position of human body joints in an image or video frame. A basic body model tracks 17–18 keypoints: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles. Each keypoint has an x/y coordinate and a confidence score — how certain the model is that it found that joint.

Wholebody estimation extends this to 133 keypoints: the 17 body joints plus 42 hand joints (21 per hand), 6 foot keypoints, and 68 facial landmarks. This captures finger positions, facial expressions, and foot placement — enough for detailed animation and expression transfer.

The output is structured data: a list of people detected, each with an array of keypoint coordinates. This data can be rendered as a skeleton visualization, exported as JSON for downstream tools, or used directly as conditioning input for AI image and video generation.

What you can do with it

AI image generation conditioning. ControlNet and similar tools accept skeleton images as input to control the pose of generated characters. OpenPose-format skeletons — colored limbs on a black background — are the standard conditioning format. ToolPiper renders these natively in the correct format, colors, and joint topology.

AI video generation. AnimateDiff and video diffusion models use frame-by-frame skeleton sequences to control motion. ToolPiper's real-time streaming outputs skeleton data at 30–60 FPS — feed it a video and get a skeleton sequence ready for generation.

Animation and game development. The JSON keypoint output can drive character rigs, motion retargeting, or procedural animation. The compact binary format (236 bytes per frame for single-person body tracking) is designed for low-latency integration.

Fitness and rehabilitation. Track body positions for exercise form analysis, physical therapy progress monitoring, or athletic performance measurement. All processing is local — no one sees your movement data.

Accessibility research. Gesture recognition, sign language processing, and assistive technology research all start with pose data. Local processing means sensitive biometric movement data stays private.

Why local matters for pose and mocap

Movement data is biometric. How you move is uniquely identifiable — gait analysis can identify individuals with over 90% accuracy. Uploading pose data to a cloud service exposes biometric information that can't be changed like a password. Local processing means your movement data never leaves your machine.

Real-time needs low latency. Cloud pose APIs add network round-trip time to every frame. For live applications — interactive installations, real-time animation, fitness tracking — that latency makes the experience unusable. Local inference on the Neural Engine processes a frame in 6–13ms. That's faster than a single network hop to most data centers.

No per-frame cost. Cloud pose APIs charge $0.001–0.01 per frame. At 30 FPS, that's $1.80–18.00 per minute of video. A one-hour dance rehearsal would cost $108–1,080. Locally, it's free.

Works offline. Record skeleton data at a location with no internet — a dance studio, a gym, a film set in a remote location. The models run the same whether you're connected or not.

What You Need

You don't need: A motion capture suit. Infrared cameras. Reflective markers. A terminal. Python. OpenPose compiled from source. A cloud API subscription.

You do need: A Mac with Apple Silicon (M1 or later) and at least 8GB of RAM. A camera or video source — webcam, screen capture via VisionPiper, or pre-recorded video files.

The models

ToolPiper includes four pose estimation models, each with different tradeoffs:

Apple Pose — built into macOS via the Vision framework. 19 keypoints (COCO body + neck + center shoulder). Multi-person detection. Runs on the Neural Engine with zero download. This is the default and the fastest option — start here.

Apple PoseNet — Apple's reference CoreML model. 17 keypoints (standard COCO). 12 MB, single-person, ANE-optimized. Lightweight alternative when you need basic body tracking with minimal resource usage.

ViTPose-B Wholebody — 133 keypoints including hands, feet, and facial landmarks. 172 MB CoreML model with a ViT-B backbone. The best option for full-body mocap where finger positions and facial expressions matter. Downloads from HuggingFace on first use.

RTMPose-L Wholebody — 133 keypoints, real-time speed, multi-person. 250 MB model from MMPose. State-of-the-art speed/accuracy balance for wholebody detection. The best option when you need both finger-level detail and real-time multi-person tracking.

Output formats

ToolPiper outputs skeleton data in four standard topologies that match what downstream tools expect:

OpenPose-18 — 18 joints, rainbow-gradient colored limbs on black canvas. This is what ControlNet expects. Feed the rendered skeleton directly into Stable Diffusion's ControlNet preprocessor.

COCO-17 — 17 joints, standard COCO annotation format. Compatible with most computer vision tools and datasets.

OpenPose+Hands — 50+ joints (body + hand keypoints). Body limbs in rainbow gradient, hand joints in orange/teal. For ControlNet with hand conditioning.

DWPose-133 — 133 joints (wholebody). Body, hands, feet, face. Required for DWPose-conditioned generation and detailed animation.

Each topology can be rendered as an image (skeleton on black, overlay on source image, or depth map) or exported as structured JSON with per-joint coordinates and confidence scores.

Real-time streaming

For live applications, ToolPiper runs a dedicated WebSocket server on port 10005. Send video frames in, get skeleton data back — in real time, at up to 60 FPS.

The streaming protocol supports three output formats:

Compact binary — 236 bytes per frame (single person, OpenPose-18, 2D). Zero heap allocation on the hot path. Designed for maximum throughput and minimum latency — frame data goes from Neural Engine to network in microseconds.

Compact JSON — flat arrays with pixel-space coordinates. Easy to parse in any language.

Verbose JSON — named joints with full metadata. Human-readable, good for debugging and prototyping.

The server handles up to 4 concurrent streams. Frame dropping is automatic — if detection takes longer than the frame interval, incoming frames are silently dropped to maintain real-time responsiveness. Stats messages report the drop rate so you can tune your configuration.

Three ways to use it

Pipeline (easiest): Use the Vision block in ModelPiper's pipeline builder. Connect it to a Pose block, feed it screen captures from VisionPiper, video files, or images. The skeleton output renders in the pipeline and can be saved.

MCP tools (for AI agents): The pose_detect tool accepts a base64 image and returns skeleton data. The pose_stream tool provides WebSocket connection details for real-time streaming. Use from Claude Code, Cursor, or any MCP client.

WebSocket API (for developers): Connect to ws://127.0.0.1:10005, authenticate, configure your preferred model and output format, and start sending frames. Binary skeleton data streams back in real time.

When cloud pose estimation is still better

Google's Mediapipe and cloud pose APIs support 3D pose estimation with depth prediction from monocular video — ToolPiper's 3D support is available via Apple Vision but is less mature than specialized cloud offerings. For large-scale video processing (thousands of hours of footage), cloud batch APIs with GPU clusters will be faster than sequential local processing.

For real-time use, ControlNet conditioning, animation prototyping, and any scenario where the movement data is sensitive, local processing is the better choice.

Try It

Download ModelPiper and install ToolPiper. Use the pose_detect MCP tool to detect poses in any image, or connect to the WebSocket stream for real-time skeleton tracking. Apple Pose works immediately with no model download — start there.

This is part of a series on local-first AI workflows on macOS. For the full list of local AI workflows, see the pillar article.