PiperSR is a 453K-parameter super-resolution model that upscales video 2x in realtime on Apple Silicon. On an M4 Max, it sustains 44.4 FPS on real-world H.264 content — 1.5x faster than the 30 FPS playback rate. This paper describes the architecture decisions and optimizations that made this possible, starting from a naive tiled implementation at 5 FPS and ending with a double-buffered pipeline that saturates the Apple Neural Engine.

The Problem

Video super-resolution on consumer hardware faces a fundamental throughput constraint: you need to process at least 30 frames per second to match realtime playback. Each frame at 360p (640×360) contains 230,400 pixels. At 2x upscale, the output is 1280×720 — 921,600 pixels per frame, 30 times per second.

Most on-device super-resolution implementations use a tiled approach: split each frame into small patches, run inference on each patch, stitch the results. This works but introduces massive overhead. A 360p frame split into 128×128 tiles requires roughly 66 inference dispatches per frame. Each dispatch carries ANE scheduling overhead, memory copies, and tile boundary artifacts that need blending.

Our initial tiled pipeline ran at 5-10 FPS depending on resolution — functional for static images, but unusable for video.

The Model: Small, ANE-Native, Purpose-Built

PiperSR uses 6 residual blocks with 64 channels, SiLU activation, and PixelShuffle for the 2x upscale. The entire model is 928 KB in CoreML FP16 format. Key design constraints:

  • All operations are ANE-native. No fallback to GPU or CPU during inference. This means no operations that the ANE compiler would reject and silently reroute — we verified the MIL (Machine Learning Intermediate Language) graph to confirm every op runs on the Neural Engine.
  • Batch normalization fused into convolutions. The video model has 42 MIL operations reduced to 30 after BN fusion. Each fused op removes a multiply, add, and associated memory traffic. This was done during export from PyTorch, not at CoreML compile time.
  • FP16 throughout, no quantization. INT8 quantization would add dequantization overhead on every layer boundary. At 928 KB, the full FP16 model fits comfortably in the ANE's SRAM, so quantization would trade quality for zero performance gain.

We bundle two model variants: an ImageType model (128×128 tiles) for static image upscale, and a TensorType model (640×360 full-frame) for video. The distinction matters for pipeline architecture.

Full-Frame Inference: 66 Dispatches to 1

The single largest optimization was eliminating tiling entirely for the video path. Instead of splitting a 360p frame into 66 tiles of 128×128, we converted the model to accept the full 640×360 frame as a single tensor input.

This required switching from CoreML's ImageType (which expects CVPixelBuffer) to TensorType (which accepts MLMultiArray). The tradeoff: we lose CoreML's automatic pixel buffer conversion, but gain complete control over the data pipeline. We handle the pixel format conversion ourselves — and we do it faster than CoreML does.

The impact was dramatic. Tiled inference spent more time on dispatch overhead and memory copies than on actual neural network computation. A single full-frame dispatch eliminated 96% of the ANE scheduling overhead.

The Double-Buffered Pipeline

With full-frame inference taking ~20.8ms on the ANE, we had a theoretical ceiling of ~48 FPS if everything else were free. But "everything else" isn't free: converting the input frame from BGRA pixel buffer to Float16 tensor costs ~0.3ms on CPU, and converting the output tensor back to a displayable BGRA pixel buffer costs ~1.3ms on GPU.

A naive sequential pipeline looks like this:

Frame N:  [convertIn 0.3ms][predict 20.8ms][convertOut 1.3ms]  = 22.4ms = 44.6 FPS
Frame N+1:                                                     [convertIn...]

22.4ms per frame yields 44.6 FPS — already above realtime. But we can do better by observing that convertIn runs on CPU, predict runs on ANE, and convertOut runs on GPU. These are three different pieces of hardware that can execute simultaneously.

We allocate two FrameSession objects, each containing its own pre-allocated MLMultiArray, MLDictionaryFeatureProvider, and Metal buffers. The sessions alternate:

Frame N   (session A): [convertIn CPU][── predict ANE ──][convertOut GPU]
Frame N+1 (session B):                                   [convertIn CPU][── predict ANE ──]
                                                          ↑ overlap: GPU + CPU simultaneous

While the GPU runs the Metal shader to convert frame N's output, the CPU simultaneously prepares frame N+1's input tensor. The ANE starts prediction as soon as its input is ready. The effective frame period drops to convertIn + predict = 21.1ms, yielding a theoretical 47.4 FPS.

Measured sustained performance over 300 frames (10 seconds of video): 44.4 FPS on real-world H.264 content, 45.9 FPS on synthetic test patterns. The gap between theoretical and measured is primarily H.264 decode cost from AVAssetReader.

The Metal Shader

The output conversion — Float16 planar tensor to interleaved BGRA8 pixel buffer — deserves special attention. Our initial CPU implementation took 7.6ms per frame, which would have been the bottleneck despite double-buffering.

The problem is memory access patterns. The model outputs three separate Float16 planes (R, G, B), each stored contiguously. The display pixel buffer expects interleaved BGRA8 — four bytes per pixel, tightly packed. Converting between these layouts requires stride-4 writes that defeat CPU SIMD vectorization.

The Metal compute shader float16PlanarToBGRA8 solves this with one GPU thread per pixel:

// One thread per pixel. GPU handles interleaving natively.
kernel void float16PlanarToBGRA8(
    device const half *src [[buffer(0)]],
    device uchar4 *dst     [[buffer(1)]],
    constant uint &width   [[buffer(2)]],
    constant uint &height  [[buffer(3)]],
    uint2 gid [[thread_position_in_grid]])
{
    uint idx = gid.y * width + gid.x;
    uint planeSize = width * height;
    half r = src[idx];
    half g = src[idx + planeSize];
    half b = src[idx + 2 * planeSize];
    dst[idx] = uchar4(uchar(clamp(r, 0.0h, 1.0h) * 255.0h),
                      uchar(clamp(g, 0.0h, 1.0h) * 255.0h),
                      uchar(clamp(b, 0.0h, 1.0h) * 255.0h),
                      255);
}

The shader reads from a pre-allocated MTLBuffer (shared memory, populated via memcpy from the MLMultiArray output). The GPU writes interleaved BGRA8 to another pre-allocated buffer. Total GPU time: 1.3ms — a 5.8x improvement over the CPU path. And with double-buffering, this 1.3ms is completely hidden behind the next frame's ANE prediction.

Pre-Allocation: Zero Per-Frame Heap Allocation

Each FrameSession pre-allocates all buffers at initialization:

  • Input MLMultiArray (Float16, 3×360×640) — rewritten in-place every frame via memcpy
  • MLDictionaryFeatureProvider wrapping the array — created once, reused
  • Metal input buffer (shared memory mode) — memcpy from MLMultiArray output
  • Metal output buffer — read back via memcpy to CVPixelBuffer

During the frame loop, the only allocations are the CVPixelBuffers from AVAssetReader (which we don't control) and the output CVPixelBuffers for AVAssetWriter. There are zero heap allocations in the conversion or prediction path.

We also bypass Swift's cooperative thread pool by running the frame loop on a dedicated DispatchQueue. The cooperative pool's work-stealing behavior added 3-5ms of scheduling jitter in early testing — unacceptable when your frame budget is 21ms.

The Complete Pipeline

The full video upscale pipeline uses AVFoundation for decode and encode:

  1. AVAssetReader decodes H.264 frames into CVPixelBuffers
  2. FullFrameUpscaler runs the double-buffered ANE+Metal pipeline
  3. AVAssetWriter encodes upscaled frames as H.264 High profile
  4. Audio remux — a second pass copies the original audio track unchanged

Safety guards include a concurrency lock (OSAllocatedUnfairLock — returns HTTP 429 rather than producing garbled output), a 4096px dimension cap to prevent OOM, Task.checkCancellation() between frames, and progress reporting via SSE throttled to 500ms intervals.

Optimization Summary

Each optimization built on the previous one. The order mattered — double-buffering without full-frame inference would have hidden only tile-stitching overhead, not the fundamental dispatch bottleneck.

OptimizationEffect
Full-frame model (no tiling)66 dispatches → 1. Eliminates 96% of ANE dispatch overhead
BN fusion (12 ops removed)Conv2d absorbs BatchNorm. 42 → 30 MIL ops
TensorType (no CVPixelBuffer I/O)Eliminates CoreML's pixel buffer conversion
FP16 only (no INT8 quantize)No dequantization overhead. Model fits ANE SRAM
Pre-allocated FrameSessionZero per-frame heap allocation
Dedicated DispatchQueueBypasses cooperative thread pool (saves 3-5ms jitter)
Metal convertOutGPU interleave: 1.3ms vs 7.6ms CPU (5.8x faster)
Double-bufferingHides GPU cost behind next prediction. Net: 0ms added

Performance Measurements

All measurements on M4 Max, Release build, 300 frames sustained (10 seconds), no thermal throttle observed:

PhaseHardwareTime
convertInCPU0.3ms
predictANE20.8ms
convertOutGPU1.3ms (hidden by overlap)
Frame period~21.7ms = 44-46 FPS

The model achieves 37.54 dB PSNR on the Set5 benchmark — 3.88 dB above bicubic interpolation — with 453,388 parameters in a 928 KB CoreML package.

Limitations and Future Work

The full-frame pipeline is currently resolution-locked to 360p → 720p. Other input resolutions fall back to the tiled pipeline at 5-10 FPS. Additional resolution-specific models (480p, 1080p) can be exported from our training pipeline but aren't bundled yet to keep the app size small.

All benchmarks are on M4 Max. The M1 has the same 16 ANE cores but an older microarchitecture — we expect lower but still-realtime throughput. We haven't published M1 numbers because we haven't tested on that hardware.

Real-time streaming upscale is also supported via WebSocket — a persistent FrameSession processes individual frames with zero-allocation reuse. The streaming path achieves similar per-frame latency but is bounded by network transport rather than compute.

PiperSR video upscale is available in ToolPiper via the /v1/video/upscale REST endpoint and the video_upscale MCP tool.