---
title: "How We Achieved 44 FPS Video Upscale on Apple Neural Engine"
description: "A deep dive into PiperSR's double-buffered ANE+Metal pipeline that upscales 360p video to 720p at 44.4 FPS - 1.5x realtime on Apple Silicon."
date: 2026-03-21
author: "Ben Racicot"
tags: ["Video Upscale", "Apple Neural Engine", "CoreML", "Privacy", "macOS", "Apple Silicon", "Performance"]
type: "paper"
canonical: "https://modelpiper.com/blog/pipersr-44fps-video-upscale-apple-neural-engine/"
---

# How We Achieved 44 FPS Video Upscale on Apple Neural Engine

> A deep dive into PiperSR's double-buffered ANE+Metal pipeline that upscales 360p video to 720p at 44.4 FPS - 1.5x realtime on Apple Silicon.

## TL;DR

PiperSR upscales 360p video to 720p at 44.4 FPS on Apple Silicon - 1.5x realtime. This paper details the double-buffered pipeline (CPU → ANE → Metal GPU) that eliminated 96% of dispatch overhead, the BN-fused 453K-parameter model, and the Metal shader that handles output conversion in 1.3ms.

PiperSR is a 453K-parameter super-resolution model that upscales video 2x in realtime on Apple Silicon. On an M4 Max, it sustains **44.4 FPS on real-world H.264 content - 1.5x faster than the 30 FPS playback rate.** This paper describes the architecture decisions and optimizations that made this possible, starting from a naive tiled implementation at 5 FPS and ending with a double-buffered pipeline that saturates the Apple Neural Engine.

## The Problem

Video super-resolution on consumer hardware faces a fundamental throughput constraint: you need to process at least 30 frames per second to match realtime playback. Each frame at 360p (640×360) contains 230,400 pixels. At 2x upscale, the output is 1280×720 - 921,600 pixels per frame, 30 times per second.

Most on-device super-resolution implementations use a tiled approach: split each frame into small patches, run inference on each patch, stitch the results. This works but introduces massive overhead. A 360p frame split into 128×128 tiles requires roughly 66 inference dispatches per frame. Each dispatch carries ANE scheduling overhead, memory copies, and tile boundary artifacts that need blending.

Our initial tiled pipeline ran at 5-10 FPS depending on resolution - functional for static images, but unusable for video.

## The Model: Small, ANE-Native, Purpose-Built

PiperSR uses 6 residual blocks with 64 channels, SiLU activation, and PixelShuffle for the 2x upscale. The entire model is 928 KB in CoreML FP16 format. Key design constraints:

-   **All operations are ANE-native.** No fallback to GPU or CPU during inference. This means no operations that the ANE compiler would reject and silently reroute - we verified the MIL (Machine Learning Intermediate Language) graph to confirm every op runs on the Neural Engine.
-   **Batch normalization fused into convolutions.** The video model has 42 MIL operations reduced to 30 after BN fusion. Each fused op removes a multiply, add, and associated memory traffic. This was done during export from PyTorch, not at CoreML compile time.
-   **FP16 throughout, no quantization.** INT8 quantization would add dequantization overhead on every layer boundary. At 928 KB, the full FP16 model fits comfortably in the ANE's SRAM, so quantization would trade quality for zero performance gain.

We bundle two model variants: an ImageType model (128×128 tiles) for static image upscale, and a TensorType model (640×360 full-frame) for video. The distinction matters for pipeline architecture.

## Full-Frame Inference: 66 Dispatches to 1

The single largest optimization was eliminating tiling entirely for the video path. Instead of splitting a 360p frame into 66 tiles of 128×128, we converted the model to accept the full 640×360 frame as a single tensor input.

This required switching from CoreML's `ImageType` (which expects `CVPixelBuffer`) to `TensorType` (which accepts `MLMultiArray`). The tradeoff: we lose CoreML's automatic pixel buffer conversion, but gain complete control over the data pipeline. We handle the pixel format conversion ourselves - and we do it faster than CoreML does.

The impact was dramatic. Tiled inference spent more time on dispatch overhead and memory copies than on actual neural network computation. **A single full-frame dispatch eliminated 96% of the ANE scheduling overhead.**

## The Double-Buffered Pipeline

With full-frame inference taking ~20.8ms on the ANE, we had a theoretical ceiling of ~48 FPS if everything else were free. But "everything else" isn't free: converting the input frame from BGRA pixel buffer to Float16 tensor costs ~0.3ms on CPU, and converting the output tensor back to a displayable BGRA pixel buffer costs ~1.3ms on GPU.

A naive sequential pipeline looks like this:

```
Frame N:  [convertIn 0.3ms][predict 20.8ms][convertOut 1.3ms]  = 22.4ms = 44.6 FPS
Frame N+1:                                                     [convertIn...]
```

22.4ms per frame yields 44.6 FPS - already above realtime. But we can do better by observing that convertIn runs on CPU, predict runs on ANE, and convertOut runs on GPU. These are three different pieces of hardware that can execute simultaneously.

We allocate two `FrameSession` objects, each containing its own pre-allocated `MLMultiArray`, `MLDictionaryFeatureProvider`, and Metal buffers. The sessions alternate:

```
Frame N   (session A): [convertIn CPU][── predict ANE ──][convertOut GPU]
Frame N+1 (session B):                                   [convertIn CPU][── predict ANE ──]
                                                          ↑ overlap: GPU + CPU simultaneous
```

While the GPU runs the Metal shader to convert frame N's output, the CPU simultaneously prepares frame N+1's input tensor. The ANE starts prediction as soon as its input is ready. The effective frame period drops to convertIn + predict = 21.1ms, yielding a theoretical 47.4 FPS.

Measured sustained performance over 300 frames (10 seconds of video): **44.4 FPS on real-world H.264 content, 45.9 FPS on synthetic test patterns.** The gap between theoretical and measured is primarily H.264 decode cost from AVAssetReader.

## The Metal Shader

The output conversion - Float16 planar tensor to interleaved BGRA8 pixel buffer - deserves special attention. Our initial CPU implementation took 7.6ms per frame, which would have been the bottleneck despite double-buffering.

The problem is memory access patterns. The model outputs three separate Float16 planes (R, G, B), each stored contiguously. The display pixel buffer expects interleaved BGRA8 - four bytes per pixel, tightly packed. Converting between these layouts requires stride-4 writes that defeat CPU SIMD vectorization.

The Metal compute shader `float16PlanarToBGRA8` solves this with one GPU thread per pixel:

```
// One thread per pixel. GPU handles interleaving natively.
kernel void float16PlanarToBGRA8(
    device const half *src [[buffer(0)]],
    device uchar4 *dst     [[buffer(1)]],
    constant uint &width   [[buffer(2)]],
    constant uint &height  [[buffer(3)]],
    uint2 gid [[thread_position_in_grid]])
{
    uint idx = gid.y * width + gid.x;
    uint planeSize = width * height;
    half r = src[idx];
    half g = src[idx + planeSize];
    half b = src[idx + 2 * planeSize];
    dst[idx] = uchar4(uchar(clamp(r, 0.0h, 1.0h) * 255.0h),
                      uchar(clamp(g, 0.0h, 1.0h) * 255.0h),
                      uchar(clamp(b, 0.0h, 1.0h) * 255.0h),
                      255);
}
```

The shader reads from a pre-allocated `MTLBuffer` (shared memory, populated via `memcpy` from the MLMultiArray output). The GPU writes interleaved BGRA8 to another pre-allocated buffer. Total GPU time: 1.3ms - a 5.8x improvement over the CPU path. And with double-buffering, this 1.3ms is completely hidden behind the next frame's ANE prediction.

## Pre-Allocation: Zero Per-Frame Heap Allocation

Each `FrameSession` pre-allocates all buffers at initialization:

-   Input `MLMultiArray` (Float16, 3×360×640) - rewritten in-place every frame via `memcpy`
-   `MLDictionaryFeatureProvider` wrapping the array - created once, reused
-   Metal input buffer (shared memory mode) - `memcpy` from MLMultiArray output
-   Metal output buffer - read back via `memcpy` to CVPixelBuffer

During the frame loop, the only allocations are the CVPixelBuffers from AVAssetReader (which we don't control) and the output CVPixelBuffers for AVAssetWriter. There are zero heap allocations in the conversion or prediction path.

We also bypass Swift's cooperative thread pool by running the frame loop on a dedicated `DispatchQueue`. The cooperative pool's work-stealing behavior added 3-5ms of scheduling jitter in early testing - unacceptable when your frame budget is 21ms.

## The Complete Pipeline

The full video upscale pipeline uses AVFoundation for decode and encode:

1.  **AVAssetReader** decodes H.264 frames into CVPixelBuffers
2.  **FullFrameUpscaler** runs the double-buffered ANE+Metal pipeline
3.  **AVAssetWriter** encodes upscaled frames as H.264 High profile
4.  **Audio remux** - a second pass copies the original audio track unchanged

Safety guards include a concurrency lock (`OSAllocatedUnfairLock` - returns HTTP 429 rather than producing garbled output), a 4096px dimension cap to prevent OOM, `Task.checkCancellation()` between frames, and progress reporting via SSE throttled to 500ms intervals.

## Optimization Summary

Each optimization built on the previous one. The order mattered - double-buffering without full-frame inference would have hidden only tile-stitching overhead, not the fundamental dispatch bottleneck.

Optimization

Effect

Full-frame model (no tiling)

66 dispatches → 1. Eliminates 96% of ANE dispatch overhead

BN fusion (12 ops removed)

Conv2d absorbs BatchNorm. 42 → 30 MIL ops

TensorType (no CVPixelBuffer I/O)

Eliminates CoreML's pixel buffer conversion

FP16 only (no INT8 quantize)

No dequantization overhead. Model fits ANE SRAM

Pre-allocated FrameSession

Zero per-frame heap allocation

Dedicated DispatchQueue

Bypasses cooperative thread pool (saves 3-5ms jitter)

Metal convertOut

GPU interleave: 1.3ms vs 7.6ms CPU (5.8x faster)

Double-buffering

Hides GPU cost behind next prediction. Net: 0ms added

## Performance Measurements

All measurements on M4 Max, Release build, 300 frames sustained (10 seconds), no thermal throttle observed:

Phase

Hardware

Time

convertIn

CPU

0.3ms

predict

ANE

20.8ms

convertOut

GPU

1.3ms (hidden by overlap)

**Frame period**

**~21.7ms = 44-46 FPS**

The model achieves 37.54 dB PSNR on the Set5 benchmark - 3.88 dB above bicubic interpolation - with 453,388 parameters in a 928 KB CoreML package.

## Limitations and Future Work

The full-frame pipeline is currently resolution-locked to 360p → 720p. Other input resolutions fall back to the tiled pipeline at 5-10 FPS. Additional resolution-specific models (480p, 1080p) can be exported from our training pipeline but aren't bundled yet to keep the app size small.

All benchmarks are on M4 Max. The M1 has the same 16 ANE cores but an older microarchitecture - we expect lower but still-realtime throughput. We haven't published M1 numbers because we haven't tested on that hardware.

Real-time streaming upscale is also supported via WebSocket - a persistent FrameSession processes individual frames with zero-allocation reuse. The streaming path achieves similar per-frame latency but is bounded by network transport rather than compute.

PiperSR video upscale is available in [ToolPiper](https://modelpiper.com/docs/toolpiper) via the `/v1/video/upscale` REST endpoint and the `video_upscale` MCP tool.

_This is a technical paper in the [local-first AI on macOS](/blog/local-first-ai-macos) series. For a user-focused guide, see [Local Video Upscale on Mac](/blog/local-video-upscale-mac). For the model architecture and benchmarks, see [PiperSR: Open-Source ANE Super-Resolution](/blog/pipersr-open-source-ane-super-resolution)._

## FAQ

### What hardware do I need for 44 FPS video upscale?

The 44.4 FPS benchmark was measured on M4 Max. Any Mac with Apple Silicon (M1 or later) can run PiperSR, though FPS will vary by chip. The full-frame pipeline requires 640×360 input - other resolutions fall back to the tiled pipeline at 5-10 FPS.

### Can PiperSR upscale to 4K?

Currently PiperSR is a 2× model optimized for 360p→720p. 4K upscale would require a 4× model or two sequential passes. The resolution-locked full-frame path only handles 640×360 input - other resolutions use the slower tiled pipeline.

### Why is PiperSR faster than Real-ESRGAN?

PiperSR is purpose-built for Apple Neural Engine - every operation runs natively on ANE with no GPU fallback. Real-ESRGAN targets GPU with attention layers and operations that are inefficient on ANE. Combined with full-frame inference (1 dispatch vs 66 tiles) and double-buffering, PiperSR achieves 10-20× higher throughput on Apple Silicon.