Can I use the skeleton output directly with ControlNet?

Yes. The OpenPose-18 renderer produces skeleton images in exactly the format ControlNet expects - colored limbs on a black canvas, with the correct joint ordering and color mapping. Save the rendered image and use it as your ControlNet conditioning input.

How many people can it detect at once?

Apple Pose, ViTPose, and RTMPose all support multi-person detection. The practical limit depends on frame resolution and model - in a typical scene, 5-10 people are detected reliably. The compact binary streaming format includes per-person data in each frame.

Does it track fingers and facial expressions?

Yes, with the wholebody models. ViTPose-B Wholebody and RTMPose-L Wholebody detect 133 keypoints including 42 hand joints and 68 facial landmarks. Apple Pose optionally detects hands and face as separate requests. For ControlNet with hand conditioning, use the OpenPose+Hands topology.

Can I use this for real-time animation or motion capture?

Yes. The WebSocket streaming API on port 10005 outputs skeleton data at up to 60 FPS in compact binary format (236 bytes per frame). Feed frames from a webcam, VisionPiper screen capture, or any video source. The output can drive character rigs, motion retargeting, or procedural animation in real time.

What's the latency for real-time pose detection?

Body-only detection with Apple Vision takes 6-13ms per frame - well under the 16ms budget for 60 FPS. Adding hands brings it to 11-23ms (40-60 FPS). Full wholebody with face detection runs at 16-28ms (30-40 FPS). Frames that arrive faster than detection speed are dropped automatically.

Pose Estimation and Mocap on Mac: Skeleton Tracking Without the Cloud

Motion capture used to require a room full of infrared cameras, reflective markers taped to a bodysuit, and software that cost more than a car. Studios charged thousands of dollars per session. The data was locked in proprietary formats. If you needed skeleton data for animation, game development, or AI image generation, you either paid studio rates or went without.

Then pose estimation models made it possible to extract skeleton data from ordinary video - no markers, no special cameras, just a webcam or a phone. But the tools to actually use this are fragmented. OpenPose is a research project with complex dependencies. MediaPipe runs in a browser but outputs non-standard formats. Cloud pose APIs charge per frame and require uploading your video.

Your Mac can do this locally, in real time, with zero setup. The Neural Engine runs Apple Vision's pose detector at 30+ FPS. CoreML models like ViTPose and RTMPose push that to 133-joint wholebody detection - fingers, toes, and facial landmarks - at speeds that keep up with live video.

What is pose estimation?

Pose estimation detects the position of human body joints in an image or video frame. A basic body model tracks 17-18 keypoints: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles. Each keypoint has an x/y coordinate and a confidence score - how certain the model is that it found that joint.

Wholebody estimation extends this to 133 keypoints: the 17 body joints plus 42 hand joints (21 per hand), 6 foot keypoints, and 68 facial landmarks. This captures finger positions, facial expressions, and foot placement - enough for detailed animation and expression transfer.

The output is structured data: a list of people detected, each with an array of keypoint coordinates. This data can be rendered as a skeleton visualization, exported as JSON for downstream tools, or used directly as conditioning input for AI image and video generation.

What can you do with pose estimation?

AI image generation conditioning. ControlNet and similar tools accept skeleton images as input to control the pose of generated characters. OpenPose-format skeletons - colored limbs on a black background - are the standard conditioning format. ToolPiper renders these natively in the correct format, colors, and joint topology.

AI video generation. AnimateDiff and video diffusion models use frame-by-frame skeleton sequences to control motion. ToolPiper's real-time streaming outputs skeleton data at 30-60 FPS - feed it a video and get a skeleton sequence ready for generation.

Animation and game development. The JSON keypoint output can drive character rigs, motion retargeting, or procedural animation. The compact binary format (236 bytes per frame for single-person body tracking) is designed for low-latency integration.

Fitness and rehabilitation. Track body positions for exercise form analysis, physical therapy progress monitoring, or athletic performance measurement. All processing is local - no one sees your movement data.

Accessibility research. Gesture recognition, sign language processing, and assistive technology research all start with pose data. Local processing means sensitive biometric movement data stays private.

Why does local matter for pose and mocap?

Movement data is biometric. How you move is uniquely identifiable - gait analysis can identify individuals with over 90% accuracy. Uploading pose data to a cloud service exposes biometric information that can't be changed like a password. Local processing means your movement data never leaves your machine.

Real-time needs low latency. Cloud pose APIs add network round-trip time to every frame. For live applications - interactive installations, real-time animation, fitness tracking - that latency makes the experience unusable. Local inference on the Neural Engine processes a frame in 6-13ms. That's faster than a single network hop to most data centers.

No per-frame cost. Cloud pose APIs charge $0.001-0.01 per frame. At 30 FPS, that's $1.80-18.00 per minute of video. A one-hour dance rehearsal would cost $108-1,080. Locally, it's free.

Works offline. Record skeleton data at a location with no internet - a dance studio, a gym, a film set in a remote location. The models run the same whether you're connected or not.

What do you need for local pose estimation?

You don't need: A motion capture suit. Infrared cameras. Reflective markers. A terminal. Python. OpenPose compiled from source. A cloud API subscription.

You do need: A Mac with Apple Silicon (M1 or later) and at least 8GB of RAM. A camera or video source - webcam, screen capture via VisionPiper, or pre-recorded video files.

Which models does ToolPiper use for pose detection?

ToolPiper includes four pose estimation models, each with different tradeoffs:

Apple Pose - built into macOS via the Vision framework. 19 keypoints (COCO body + neck + center shoulder). Multi-person detection. Runs on the Neural Engine with zero download. This is the default and the fastest option - start here.

Apple PoseNet - Apple's reference CoreML model. 17 keypoints (standard COCO). 12 MB, single-person, ANE-optimized. Lightweight alternative when you need basic body tracking with minimal resource usage.

ViTPose-B Wholebody - 133 keypoints including hands, feet, and facial landmarks. 172 MB CoreML model with a ViT-B backbone. The best option for full-body mocap where finger positions and facial expressions matter. Downloads from HuggingFace on first use.

RTMPose-L Wholebody - 133 keypoints, real-time speed, multi-person. 250 MB model from MMPose. State-of-the-art speed/accuracy balance for wholebody detection. The best option when you need both finger-level detail and real-time multi-person tracking.

What output formats are available?

ToolPiper outputs skeleton data in four standard topologies that match what downstream tools expect:

OpenPose-18 - 18 joints, rainbow-gradient colored limbs on black canvas. This is what ControlNet expects. Feed the rendered skeleton directly into Stable Diffusion's ControlNet preprocessor.

COCO-17 - 17 joints, standard COCO annotation format. Compatible with most computer vision tools and datasets.

OpenPose+Hands - 50+ joints (body + hand keypoints). Body limbs in rainbow gradient, hand joints in orange/teal. For ControlNet with hand conditioning.

DWPose-133 - 133 joints (wholebody). Body, hands, feet, face. Required for DWPose-conditioned generation and detailed animation.

Each topology can be rendered as an image (skeleton on black, overlay on source image, or depth map) or exported as structured JSON with per-joint coordinates and confidence scores.

How does real-time skeleton streaming work?

For live applications, ToolPiper runs a dedicated WebSocket server on port 10005. Send video frames in, get skeleton data back - in real time, at up to 60 FPS.

The streaming protocol supports three output formats:

Compact binary - 236 bytes per frame (single person, OpenPose-18, 2D). Zero heap allocation on the hot path. Designed for maximum throughput and minimum latency - frame data goes from Neural Engine to network in microseconds.

Compact JSON - flat arrays with pixel-space coordinates. Easy to parse in any language.

Verbose JSON - named joints with full metadata. Human-readable, good for debugging and prototyping.

The server handles up to 4 concurrent streams. Frame dropping is automatic - if detection takes longer than the frame interval, incoming frames are silently dropped to maintain real-time responsiveness. Stats messages report the drop rate so you can tune your configuration.

What are the three ways to use pose estimation?

Pipeline (easiest): Use the Vision block in ModelPiper's pipeline builder. Connect it to a Pose block, feed it screen captures from VisionPiper, video files, or images. The skeleton output renders in the pipeline and can be saved.

MCP tools (for AI agents): The pose_detect tool accepts a base64 image and returns skeleton data. The pose_stream tool provides WebSocket connection details for real-time streaming. Use from Claude Code, Cursor, or any MCP client.

WebSocket API (for developers): Connect to ws://127.0.0.1:10005, authenticate, configure your preferred model and output format, and start sending frames. Binary skeleton data streams back in real time.

When is cloud pose estimation still better?

Google's Mediapipe and cloud pose APIs support 3D pose estimation with depth prediction from monocular video - ToolPiper's 3D support is available via Apple Vision but is less mature than specialized cloud offerings. For large-scale video processing (thousands of hours of footage), cloud batch APIs with GPU clusters will be faster than sequential local processing.

For real-time use, ControlNet conditioning, animation prototyping, and any scenario where the movement data is sensitive, local processing is the better choice.

Try It

Download ModelPiper and install ToolPiper. Use the pose_detect MCP tool to detect poses in any image, or connect to the WebSocket stream for real-time skeleton tracking. Apple Pose works immediately with no model download - start there.

This is part of a series on local-first AI workflows on macOS. For the full list of local AI workflows, see the pillar article.

	ToolPiper	OpenPose (local)	MediaPipe (browser)	Rokoko (cloud)
Setup	One app, zero config	Build from source (CUDA)	npm install	Account + subscription
Privacy	All data stays on Mac	Local	Browser (local)	Video uploaded to cloud
Models	4 (Apple, ViTPose, RTMPose, PoseNet)	1 (OpenPose)	1 (BlazePose)	Proprietary
Max keypoints	133 (wholebody)	135 (body+hands+face)	33 (body)	Custom rig
Real-time streaming	Yes (60 FPS WebSocket)	Yes (GPU-dependent)	Yes (browser)	Yes (via app)
Output formats	OpenPose, COCO, DWPose, JSON, binary	OpenPose JSON	Landmarks JSON	FBX, BVH
ControlNet-ready rendering	Yes (native)	Yes	Manual conversion	No
macOS / Apple Silicon	Native, ANE-optimized	No (CUDA only)	Yes (browser)	macOS app available
Cost	Studio ($29/mo), no per-minute fees	Free	Free	$17.50-60/mo