Abstract

We release PiperSR, a lightweight super-resolution model purpose-built for inference on Apple Neural Engine (ANE). PiperSR upscales images 2× using 6 residual blocks with 64 channels, SiLU activations, and PixelShuffle - 453,388 parameters in a 928 KB CoreML FP16 package. On the standard Set5 benchmark, PiperSR achieves 37.54 dB PSNR at 2× upscaling - 3.88 dB above bicubic interpolation. Integrated into a double-buffered pipeline (CPU input conversion → ANE prediction → Metal GPU output conversion), it sustains 44.4 FPS on 360p→720p video on M4 Max, processing real-world H.264 content at 1.5× realtime speed.

To our knowledge, PiperSR is the first publicly released super-resolution model designed from the ground up for Apple Neural Engine inference. The model, inference code, and benchmarks are available under AGPL-3.0 (code) and the PiperSR Model License (weights).

1. Motivation

Every Mac shipped since late 2020 includes a 16-core Neural Engine capable of 15.8+ TOPS at FP16. Despite this, the on-device super-resolution landscape remains dominated by GPU-targeted architectures - Real-ESRGAN, ESPCN, IMDN - that either ignore the Neural Engine entirely or run on it incidentally through CoreML's automatic device placement.

The problem is architectural. Operations that are efficient on GPU are not necessarily efficient on ANE, and vice versa. GPU inference benefits from large batch sizes, attention mechanisms, and flexible memory access patterns. ANE inference is bandwidth-bound at typical feature map sizes - reducing arithmetic intensity (FLOPs) yields zero speedup if memory traffic stays constant. This means the common approach of "train on GPU, convert to CoreML, hope for the best" leaves significant performance on the table.

PiperSR was designed with ANE constraints as first-class requirements, not afterthoughts. Every operation in the model was verified to execute on ANE - no silent fallback to GPU or CPU. The architecture was shaped by empirical measurement of ANE behavior, not theoretical FLOP counts.

2. Model Architecture

PiperSR uses a deliberately simple architecture optimized for ANE throughput:

  • Input: 3-channel RGB image (any resolution for tiled mode; 640×360 for full-frame video)
  • Feature extraction: Single Conv2d (3→64 channels, 3×3 kernel)
  • Body: 6 residual blocks, each containing 2× Conv2d (64→64, 3×3) with SiLU activation and skip connection
  • Upscale: Conv2d (64→12, 3×3) → PixelShuffle(2) → 3-channel output at 2× resolution
  • Total: 453,388 parameters, 928 KB CoreML FP16 package

Key design decisions:

  • No attention layers. Channel attention (SE blocks, ESAB) adds ~2.9ms per block on ANE at 360×640 - nearly the cost of an entire residual block - due to the reduce_mean over 230K spatial elements being bandwidth-dominated.
  • No depthwise separable convolutions. Blueprint Separable Convolution (BSConv) reduces FLOPs by 7.9× on paper, but empirical measurement shows 0% speedup on ANE at these feature map sizes. The Neural Engine is memory-bandwidth-bound, not compute-bound. More operations means more bandwidth tax, regardless of arithmetic intensity.
  • Batch normalization fused into convolutions. The video-optimized model has 30 MIL (Machine Learning Intermediate Language) operations, down from 42 before BN fusion. Each fused operation eliminates a multiply-add and its associated memory traffic.
  • FP16 throughout, no quantization. At 928 KB, the model fits in ANE SRAM. INT8 quantization would add dequantization overhead at every layer boundary for zero performance benefit.

3. Benchmarks

Image Quality (PSNR / SSIM)

Evaluated on standard benchmarks at 2× upscaling, compared against models in the same parameter class:

ModelParamsSet5 PSNRSet14 PSNRArchitecture
PiperSR453K37.54 dB33.22 dBPlain conv3×3 + PixelShuffle (ANE-native)
ESPCN20K33.13 dB29.82 dB3-layer CNN + sub-pixel
PAN272K37.58 dB33.40 dBPixel attention network (GPU-optimized)
SAFMN228K38.00 dB33.52 dBSpatial feature modulation (GPU-optimized)
BSRN332K38.10 dB33.63 dBBlueprint separable + channel attention (GPU-optimized)

PiperSR achieves competitive quality with models targeting GPU, while being the only model in this table that runs entirely on ANE without device fallback. Models with higher PSNR (BSRN, SAFMN) use attention and separable convolutions that are empirically slower on ANE than plain convolutions at equivalent feature map sizes.

Inference Speed

Measured on M4 Max, Release build, sustained over 300 frames. "ANE only" means CoreML prediction with .cpuAndNeuralEngine compute units and verified MIL graph placement:

ConfigurationFPSFrame TimeDevice
PiperSR full-frame (640×360→1280×720)44.422.5msANE
PiperSR tiled (128×128 tiles, any resolution)5-10100-200msANE
Real-ESRGAN x2 (via GPU)2-4250-500msGPU

Pipeline phase breakdown (M4 Max, full-frame video path):

PhaseHardwareTime
Input conversion (BGRA → Float16 tensor)CPU0.3ms
Neural network predictionANE20.8ms
Output conversion (Float16 → BGRA via Metal shader)GPU1.3ms (hidden by double-buffering)

4. ANE-Specific Findings

Several findings from developing PiperSR challenge common assumptions about Neural Engine performance:

ANE is bandwidth-bound, not compute-bound. At 640×360 with 64 channels, each feature map layer is 28.8 MB in FP16. On M2 (100 GB/s memory bandwidth), every operation pays a ~3ms bandwidth tax to read and write this data, regardless of arithmetic complexity. A 3×3 convolution (36,864 multiplies per pixel) takes 2.96ms. A BSConv decomposition of the same operation (4,672 multiplies - 7.9× fewer FLOPs) takes 3.31ms. Less compute, more ops, same bandwidth, slower wall clock.

Compute units must be .cpuAndNeuralEngine, not .all. The .all option allows CoreML to route operations to GPU when it estimates GPU would be faster. For models designed around ANE's constraints, this "optimization" introduces device transfer overhead and is always slower.

Full-frame inference eliminates dispatch overhead. Tiling a 360p frame into 128×128 patches requires ~66 inference dispatches. Each dispatch carries scheduling overhead regardless of computational cost. A single full-frame dispatch eliminates 96% of this overhead - from 66 round-trips to 1.

5. Pipeline Integration

PiperSR ships as a CoreML .mlpackage that works standalone with the included Python inference script. For real-time video upscaling, it integrates into ToolPiper's double-buffered pipeline:

# Standalone image upscale (Python)
pip install pipersr
pipersr upscale input.png --output output.png

# Video upscale via ToolPiper REST API
curl -X POST http://127.0.0.1:9998/v1/video/upscale \
  -H "X-Session-Key: $SK" \
  -F "[email protected]"

# Video upscale via MCP tool (Claude Code, Cursor, etc.)
# Just ask: "upscale this video" - the video_upscale tool handles it

The double-buffered pipeline runs CPU input conversion, ANE prediction, and Metal GPU output conversion on three different pieces of hardware simultaneously. Pre-allocated frame buffers mean zero per-frame heap allocation. A dedicated DispatchQueue bypasses Swift's cooperative thread pool to eliminate 3-5ms of scheduling jitter.

ToolPiper exposes this as a single REST endpoint (/v1/video/upscale) and MCP tool (video_upscale) - upload a video file, get the upscaled result. Progress streams via Server-Sent Events. Audio is remuxed automatically.

6. Distribution

PiperSR is available through multiple channels:

ChannelWhatLink
GitHubInference code, benchmarks, sample imagesgithub.com/modelpiper/pipersr
ModelPiperModel card, weights download, benchmark databasemodelpiper.com
ToolPiperReal-time video upscale (REST + MCP)modelpiper.com/docs/toolpiper
PyPIPython inference packagepip install pipersr

License

Code (inference script, benchmark script): AGPL-3.0 - free for personal and research use; commercial products using the code must open-source their application.

Model weights: PiperSR Model License - free for personal, academic, and non-commercial use. Attribution required ("Powered by PiperSR from ModelPiper" or equivalent) in any public-facing use. Commercial use requires a separate license. No redistribution without attribution. No use of weights to train competing models.

CoreML .mlpackage only - no PyTorch weights or ONNX export. PiperSR is an ANE model; the distribution format reflects that.

7. Limitations

  • Resolution-locked full-frame path. The optimized 44 FPS pipeline only handles 640×360 → 1280×720. Other resolutions fall back to tiled inference at 5-10 FPS. Additional resolution-specific models can be produced but aren't bundled to keep app size small.
  • 2× upscale only. PiperSR is a 2× model. 4× upscale requires running two passes or training a dedicated 4× model (which doubles the output tensor size and halves throughput).
  • M1 untested. All published benchmarks are M4 Max. M1 has the same 16 ANE cores but an older microarchitecture - we expect lower but likely still-realtime throughput.
  • Quality ceiling. At 37.54 dB Set5, PiperSR trades ~0.5 dB versus state-of-the-art lightweight models (BSRN at 38.10 dB) in exchange for ANE-native execution. This tradeoff is deliberate - the GPU-optimal architectures are empirically slower on ANE.

8. Conclusion

PiperSR demonstrates that purpose-built ANE models can achieve competitive quality while unlocking performance that GPU-targeted architectures cannot match on Apple Silicon. The Apple Neural Engine is underutilized in the ML community - most models treat it as an incidental deployment target rather than a first-class design constraint.

We release PiperSR to establish a reference point for ANE-native super-resolution and to provide a practical, high-performance upscaling tool for the Apple developer community. For real-time video upscaling with zero setup, PiperSR is available in ToolPiper.

This is a technical paper in the local-first AI on macOS series. For the pipeline engineering details, see How We Achieved 44 FPS Video Upscale on Apple Neural Engine. For a user-focused guide, see Local Video Upscale on Mac.