Abstract
We release PiperSR, a lightweight super-resolution model purpose-built for inference on Apple Neural Engine (ANE). PiperSR upscales images 2× using 6 residual blocks with 64 channels, SiLU activations, and PixelShuffle - 453,388 parameters in a 928 KB CoreML FP16 package. On the standard Set5 benchmark, PiperSR achieves 37.54 dB PSNR at 2× upscaling - 3.88 dB above bicubic interpolation. Integrated into a double-buffered pipeline (CPU input conversion → ANE prediction → Metal GPU output conversion), it sustains 44.4 FPS on 360p→720p video on M4 Max, processing real-world H.264 content at 1.5× realtime speed.
To our knowledge, PiperSR is the first publicly released super-resolution model designed from the ground up for Apple Neural Engine inference. The model, inference code, and benchmarks are available under AGPL-3.0 (code) and the PiperSR Model License (weights).
1. Motivation
Every Mac shipped since late 2020 includes a 16-core Neural Engine capable of 15.8+ TOPS at FP16. Despite this, the on-device super-resolution landscape remains dominated by GPU-targeted architectures - Real-ESRGAN, ESPCN, IMDN - that either ignore the Neural Engine entirely or run on it incidentally through CoreML's automatic device placement.
The problem is architectural. Operations that are efficient on GPU are not necessarily efficient on ANE, and vice versa. GPU inference benefits from large batch sizes, attention mechanisms, and flexible memory access patterns. ANE inference is bandwidth-bound at typical feature map sizes - reducing arithmetic intensity (FLOPs) yields zero speedup if memory traffic stays constant. This means the common approach of "train on GPU, convert to CoreML, hope for the best" leaves significant performance on the table.
PiperSR was designed with ANE constraints as first-class requirements, not afterthoughts. Every operation in the model was verified to execute on ANE - no silent fallback to GPU or CPU. The architecture was shaped by empirical measurement of ANE behavior, not theoretical FLOP counts.
2. Model Architecture
PiperSR uses a deliberately simple architecture optimized for ANE throughput:
- Input: 3-channel RGB image (any resolution for tiled mode; 640×360 for full-frame video)
- Feature extraction: Single Conv2d (3→64 channels, 3×3 kernel)
- Body: 6 residual blocks, each containing 2× Conv2d (64→64, 3×3) with SiLU activation and skip connection
- Upscale: Conv2d (64→12, 3×3) → PixelShuffle(2) → 3-channel output at 2× resolution
- Total: 453,388 parameters, 928 KB CoreML FP16 package
Key design decisions:
- No attention layers. Channel attention (SE blocks, ESAB) adds ~2.9ms per block on ANE at 360×640 - nearly the cost of an entire residual block - due to the reduce_mean over 230K spatial elements being bandwidth-dominated.
- No depthwise separable convolutions. Blueprint Separable Convolution (BSConv) reduces FLOPs by 7.9× on paper, but empirical measurement shows 0% speedup on ANE at these feature map sizes. The Neural Engine is memory-bandwidth-bound, not compute-bound. More operations means more bandwidth tax, regardless of arithmetic intensity.
- Batch normalization fused into convolutions. The video-optimized model has 30 MIL (Machine Learning Intermediate Language) operations, down from 42 before BN fusion. Each fused operation eliminates a multiply-add and its associated memory traffic.
- FP16 throughout, no quantization. At 928 KB, the model fits in ANE SRAM. INT8 quantization would add dequantization overhead at every layer boundary for zero performance benefit.
3. Benchmarks
Image Quality (PSNR / SSIM)
Evaluated on standard benchmarks at 2× upscaling, compared against models in the same parameter class:
| Model | Params | Set5 PSNR | Set14 PSNR | Architecture |
|---|---|---|---|---|
| PiperSR | 453K | 37.54 dB | 33.22 dB | Plain conv3×3 + PixelShuffle (ANE-native) |
| ESPCN | 20K | 33.13 dB | 29.82 dB | 3-layer CNN + sub-pixel |
| PAN | 272K | 37.58 dB | 33.40 dB | Pixel attention network (GPU-optimized) |
| SAFMN | 228K | 38.00 dB | 33.52 dB | Spatial feature modulation (GPU-optimized) |
| BSRN | 332K | 38.10 dB | 33.63 dB | Blueprint separable + channel attention (GPU-optimized) |
PiperSR achieves competitive quality with models targeting GPU, while being the only model in this table that runs entirely on ANE without device fallback. Models with higher PSNR (BSRN, SAFMN) use attention and separable convolutions that are empirically slower on ANE than plain convolutions at equivalent feature map sizes.
Inference Speed
Measured on M4 Max, Release build, sustained over 300 frames. "ANE only" means CoreML prediction with .cpuAndNeuralEngine compute units and verified MIL graph placement:
| Configuration | FPS | Frame Time | Device |
|---|---|---|---|
| PiperSR full-frame (640×360→1280×720) | 44.4 | 22.5ms | ANE |
| PiperSR tiled (128×128 tiles, any resolution) | 5-10 | 100-200ms | ANE |
| Real-ESRGAN x2 (via GPU) | 2-4 | 250-500ms | GPU |
Pipeline phase breakdown (M4 Max, full-frame video path):
| Phase | Hardware | Time |
|---|---|---|
| Input conversion (BGRA → Float16 tensor) | CPU | 0.3ms |
| Neural network prediction | ANE | 20.8ms |
| Output conversion (Float16 → BGRA via Metal shader) | GPU | 1.3ms (hidden by double-buffering) |
4. ANE-Specific Findings
Several findings from developing PiperSR challenge common assumptions about Neural Engine performance:
ANE is bandwidth-bound, not compute-bound. At 640×360 with 64 channels, each feature map layer is 28.8 MB in FP16. On M2 (100 GB/s memory bandwidth), every operation pays a ~3ms bandwidth tax to read and write this data, regardless of arithmetic complexity. A 3×3 convolution (36,864 multiplies per pixel) takes 2.96ms. A BSConv decomposition of the same operation (4,672 multiplies - 7.9× fewer FLOPs) takes 3.31ms. Less compute, more ops, same bandwidth, slower wall clock.
Compute units must be .cpuAndNeuralEngine, not .all. The .all option allows CoreML to route operations to GPU when it estimates GPU would be faster. For models designed around ANE's constraints, this "optimization" introduces device transfer overhead and is always slower.
Full-frame inference eliminates dispatch overhead. Tiling a 360p frame into 128×128 patches requires ~66 inference dispatches. Each dispatch carries scheduling overhead regardless of computational cost. A single full-frame dispatch eliminates 96% of this overhead - from 66 round-trips to 1.
5. Pipeline Integration
PiperSR ships as a CoreML .mlpackage that works standalone with the included Python inference script. For real-time video upscaling, it integrates into ToolPiper's double-buffered pipeline:
# Standalone image upscale (Python)
pip install pipersr
pipersr upscale input.png --output output.png
# Video upscale via ToolPiper REST API
curl -X POST http://127.0.0.1:9998/v1/video/upscale \
-H "X-Session-Key: $SK" \
-F "[email protected]"
# Video upscale via MCP tool (Claude Code, Cursor, etc.)
# Just ask: "upscale this video" - the video_upscale tool handles itThe double-buffered pipeline runs CPU input conversion, ANE prediction, and Metal GPU output conversion on three different pieces of hardware simultaneously. Pre-allocated frame buffers mean zero per-frame heap allocation. A dedicated DispatchQueue bypasses Swift's cooperative thread pool to eliminate 3-5ms of scheduling jitter.
ToolPiper exposes this as a single REST endpoint (/v1/video/upscale) and MCP tool (video_upscale) - upload a video file, get the upscaled result. Progress streams via Server-Sent Events. Audio is remuxed automatically.
6. Distribution
PiperSR is available through multiple channels:
| Channel | What | Link |
|---|---|---|
| GitHub | Inference code, benchmarks, sample images | github.com/modelpiper/pipersr |
| ModelPiper | Model card, weights download, benchmark database | modelpiper.com |
| ToolPiper | Real-time video upscale (REST + MCP) | modelpiper.com/docs/toolpiper |
| PyPI | Python inference package | pip install pipersr |
License
Code (inference script, benchmark script): AGPL-3.0 - free for personal and research use; commercial products using the code must open-source their application.
Model weights: PiperSR Model License - free for personal, academic, and non-commercial use. Attribution required ("Powered by PiperSR from ModelPiper" or equivalent) in any public-facing use. Commercial use requires a separate license. No redistribution without attribution. No use of weights to train competing models.
CoreML .mlpackage only - no PyTorch weights or ONNX export. PiperSR is an ANE model; the distribution format reflects that.
7. Limitations
- Resolution-locked full-frame path. The optimized 44 FPS pipeline only handles 640×360 → 1280×720. Other resolutions fall back to tiled inference at 5-10 FPS. Additional resolution-specific models can be produced but aren't bundled to keep app size small.
- 2× upscale only. PiperSR is a 2× model. 4× upscale requires running two passes or training a dedicated 4× model (which doubles the output tensor size and halves throughput).
- M1 untested. All published benchmarks are M4 Max. M1 has the same 16 ANE cores but an older microarchitecture - we expect lower but likely still-realtime throughput.
- Quality ceiling. At 37.54 dB Set5, PiperSR trades ~0.5 dB versus state-of-the-art lightweight models (BSRN at 38.10 dB) in exchange for ANE-native execution. This tradeoff is deliberate - the GPU-optimal architectures are empirically slower on ANE.
8. Conclusion
PiperSR demonstrates that purpose-built ANE models can achieve competitive quality while unlocking performance that GPU-targeted architectures cannot match on Apple Silicon. The Apple Neural Engine is underutilized in the ML community - most models treat it as an incidental deployment target rather than a first-class design constraint.
We release PiperSR to establish a reference point for ANE-native super-resolution and to provide a practical, high-performance upscaling tool for the Apple developer community. For real-time video upscaling with zero setup, PiperSR is available in ToolPiper.
This is a technical paper in the local-first AI on macOS series. For the pipeline engineering details, see How We Achieved 44 FPS Video Upscale on Apple Neural Engine. For a user-focused guide, see Local Video Upscale on Mac.