The hardware blind spot in super resolution

The entire super resolution research field optimizes for hardware most people don't own.

Real-ESRGAN, the most widely deployed super resolution model, uses Residual-in-Residual Dense Blocks with channel attention - architecture decisions tuned for NVIDIA CUDA GPUs. BSRN, the NTIRE 2022 efficient SR challenge winner, benchmarks on RTX 3090. SwinIR uses shifted window attention, a transformer design that maps naturally to CUDA tensor cores. ESPCN, SAFMN, PAN - every model in the lightweight super resolution literature assumes NVIDIA hardware for both training and deployment. The "state of the art" is measured in FP32 throughput on RTX 4090s. Papers report CUDA FPS as though it were a universal metric.

Meanwhile, every Mac sold since late 2020 ships with a Neural Engine - dedicated ML silicon capable of 15+ trillion operations per second at FP16. That's purpose-built hardware for exactly this kind of workload: dense convolutions over spatial feature maps, the core operation in super resolution. And it sits idle. Not because the hardware can't handle SR inference, but because nobody builds SR models for it. The research community trains on CUDA, benchmarks on CUDA, optimizes for CUDA, and treats CoreML conversion as an afterthought. When a GPU-optimized model does get converted to CoreML, operations like channel attention and dynamic shapes silently fall back to CPU, creating pipeline stalls that destroy throughput. A model that runs at 30 FPS on an RTX 3090 limps along at 2-4 FPS on Apple Silicon - not because the hardware is slow, but because the model was never designed for it.

This is the gap PiperSR was built to close. Not by converting an existing model, but by designing one from scratch for ANE constraints: no attention layers (ANE can't run them efficiently), batch normalization fused at export time, TensorType inputs for full-frame processing, and a double-buffered pipeline that runs CPU, ANE, and Metal GPU simultaneously on different frames. The result is 44.4 FPS at 360p-to-720p on an M4 Max - 1.5x realtime - on hardware that ships in every MacBook Pro, Mac Mini, and Mac Studio.

PiperSR exists because we decided to train a super resolution model specifically for the hardware that over 100 million Mac users actually have.

What super resolution actually does

Super resolution increases an image's resolution while recovering detail that wasn't in the source. Traditional upscaling (bicubic, Lanczos) interpolates between existing pixels - the result is higher resolution but visually blurry, because no new information is added. Neural super resolution works differently: a model trained on millions of paired low-resolution and high-resolution images learns to predict what fine detail should exist in a given low-resolution input. A photo of grass goes from a green blur to individual blades. Text goes from pixelated blocks to readable characters. The model has seen enough examples of each pattern to make educated predictions about what should be there.

The pipeline is a single forward pass through a convolutional neural network: feature extraction, non-linear mapping through residual blocks, and reconstruction via PixelShuffle. No iterative optimization, no diffusion sampling. The models are surprisingly small - PiperSR is 928 KB - but the computation per pixel is intensive: a 2x upscale of a 640x360 video frame produces 921,600 individually predicted pixels, 30 times per second for video. This is why super resolution has historically required dedicated GPUs or cloud servers, and why the Neural Engine changes the equation.

The state of the art (April 2026)

Super resolution research has been active since SRCNN (Dong et al., 2014), the first CNN-based approach that beat traditional methods. The practical landscape in 2026 is shaped by a few dominant forces and a notable gap.

Real-ESRGAN: the GPU standard

Real-ESRGAN (Xintao Wang et al., 2021) remains the most widely deployed super resolution model. It handles complex degradations - JPEG artifacts, noise, blur, downsampling - and produces visually impressive results on real-world images. Variants like Real-ESRGAN x4plus and the anime-optimized version are available on nearly every cloud upscaling service and embedded in creative tools from Stable Diffusion to ComfyUI. As of March 2026, Real-ESRGAN is the default choice for GPU-based super resolution, with over 29,000 GitHub stars and integrations in dozens of commercial tools.

The limitation is hardware targeting. Real-ESRGAN uses RRDB blocks (Residual-in-Residual Dense Blocks) with channel attention - architecture decisions optimized for NVIDIA GPU inference. On Apple Silicon, these models run on the GPU via Metal or fall back to CPU. They don't touch the Neural Engine because operations like channel attention and dense connections are inefficient on ANE's memory-bandwidth-bound architecture. Typical throughput on Apple Silicon: 2-4 FPS for a 360p frame. Functional for single images, unusable for video.

The lightweight model wave

Research has shifted toward efficient super resolution models that balance quality against speed. The NTIRE (New Trends in Image Restoration and Enhancement) challenges at CVPR have driven a generation of models competing on the quality-versus-parameters frontier. Notable entries as of early 2026:

  • BSRN (Blueprint Separable Residual Network, Li et al.) - 332K parameters, 38.10 dB PSNR on Set5 at 2x. Uses blueprint separable convolutions and channel attention. Winner of the NTIRE 2022 efficient SR challenge. GPU-optimized.
  • SAFMN (Spatial Adaptive Feature Modulation Network) - 228K parameters, 38.00 dB on Set5. Feature modulation instead of attention. Efficient but still GPU-targeted.
  • PAN (Pixel Attention Network) - 272K parameters, 37.58 dB on Set5. Pixel-level attention with competitive quality. GPU-optimized.
  • ESPCN (Efficient Sub-Pixel Convolutional Network, Shi et al., 2016) - The original real-time model at 20K parameters. Still competitive for simple 2x upscaling at 33.13 dB, but quality is well below modern models.
  • SwinIR (Liang et al., 2021) - Transformer-based architecture using shifted window attention. Higher quality than CNN-based models at the cost of significantly more parameters and compute. Not practical for real-time inference on edge hardware.

Every model in this list targets GPU inference. Their architectures use operations that are theoretically efficient (fewer FLOPs) but empirically no faster on ANE due to increased memory traffic from additional operations. The super resolution community has largely ignored Apple Neural Engine as a deployment target. Models are trained on CUDA, benchmarked on CUDA, and optimized for CUDA. CoreML conversion is an afterthought when it happens at all.

CoreML and the ANE gap

Apple's CoreML framework makes it straightforward to convert PyTorch models to run on Apple Silicon. But "runs on Apple Silicon" is not the same as "runs efficiently on Neural Engine." CoreML's automatic device placement often routes operations to GPU when ANE would be faster, or vice versa. Operations that the ANE compiler can't handle (certain attention mechanisms, dynamic shapes, some activation functions) are silently routed to CPU, creating pipeline stalls that destroy throughput.

The result is a paradox: every Mac has a 16-core Neural Engine capable of 15.8+ TOPS at FP16, and almost no super resolution model uses it effectively. This isn't Apple's fault - the ANE is well-suited to convolution-heavy workloads. It's a community problem. Most ML researchers don't have access to ANE profiling tools, don't target CoreML during training, and don't verify device placement after conversion. The research ecosystem is built around NVIDIA hardware, and most papers don't even mention Apple Silicon in their evaluation sections.

This gap is what motivated PiperSR. The hardware is there. The framework is there. What was missing was a model designed from the ground up with ANE constraints as first-class requirements.

Commercial tools

Topaz Video AI ($199 one-time) is the leading desktop upscaling tool. It ships multiple specialized models for different content types (faces, text, animation) and supports up to 4x upscale with temporal consistency across video frames. Topaz also handles deinterlacing, frame interpolation, and stabilization. Quality is excellent, particularly for face restoration. The tradeoff: it's a dedicated application with a learning curve, GPU-focused (it uses Metal on Mac but not ANE), and the price point excludes casual use.

Cloud services (Let's Enhance, Upscale.media, CapCut, Descript) provide upscaling as a web feature. Pricing ranges from free-with-watermark to $34/month. All require uploading your media, which creates privacy concerns for sensitive content and adds latency from the upload/download round trip. Quality varies by service, and most use Real-ESRGAN variants or proprietary models trained on similar architectures.

waifu2x, the open-source upscaler that popularized neural upscaling for anime and illustrations, remains functional but hasn't seen significant updates since 2022. Its architecture predates Real-ESRGAN and most modern efficient models. It retains a loyal user base in the anime community but is no longer competitive on photographic content.

The ANE thesis: why we built PiperSR from scratch

The Apple Neural Engine is the most underutilized ML accelerator in consumer hardware. Every Mac with Apple Silicon has an ANE capable of 15+ TOPS at FP16, but the super resolution model ecosystem ignores it entirely. Models are designed for CUDA, converted to CoreML as an afterthought, and run at a fraction of their potential because the architecture was never intended for ANE's constraints. PiperSR exists because we believe ANE-native design is a viable and demonstrably faster path for on-device super resolution - and we built the evidence to prove it.

Four architectural decisions define PiperSR and separate it from every converted-from-GPU model:

No attention layers. Channel attention (SE blocks, ESAB) is the standard quality-boosting technique in modern super resolution. On CUDA, attention adds negligible overhead. On ANE, it adds approximately 2.9ms per block at 360x640 resolution - nearly the cost of an entire residual block. The reason is fundamental: ANE is memory-bandwidth-bound at typical feature map sizes. The reduce_mean operation over 230K spatial elements is dominated by data movement, not arithmetic. Similarly, blueprint separable convolutions (BSConv) reduce FLOPs by 7.9x on paper but measure 0% speedup on ANE - more operations means more bandwidth tax regardless of arithmetic intensity. This is the key insight that separates ANE-native design from GPU-first-then-convert. PiperSR trades roughly 0.5 dB of PSNR for 10-20x higher throughput. The tradeoff is deliberate.

Batch normalization fused at export time. The training graph uses standard batch normalization (42 MIL operations). At CoreML export, BN parameters are folded into the preceding convolution weights, reducing the graph to 30 operations. Fewer operations means fewer ANE dispatches and less memory traffic. This is a common optimization in deployment, but most SR models skip it because GPU inference doesn't benefit as much.

TensorType inputs for full-frame processing. The video model accepts complete 640x360 frames as single Float16 tensors. A tiled approach at 128x128 would require approximately 66 ANE dispatches per frame. Each dispatch carries scheduling overhead regardless of computational cost. A single full-frame dispatch eliminates 96% of that overhead - from 66 round-trips to the Neural Engine down to 1. This requires the model to be compiled for a specific input shape, which is why the optimized pipeline is resolution-locked to 360p-to-720p. Other resolutions fall back to tiling automatically.

Double-buffered pipeline across three hardware units. This is PiperSR's strongest differentiator. The pipeline runs CPU, ANE, and Metal GPU simultaneously on different frames:

  • Frame N+1: CPU converts the input to a Float16 tensor (0.3ms)
  • Frame N: ANE runs the super resolution prediction (20.8ms)
  • Frame N-1: A Metal compute shader converts the output tensor to a pixel buffer (1.3ms)

Three different hardware units, three different frames, zero idle time. Pre-allocated FrameSession buffers mean zero per-frame heap allocation. A dedicated DispatchQueue bypasses Swift's cooperative thread pool to eliminate 3-5ms of scheduling jitter. The effective per-frame time on M4 Max is approximately 22ms, yielding 44-46 FPS sustained over 300+ frames with no thermal throttling observed.

The result is real-time 2x video upscale on hardware that costs $1,599 (M4 Max Mac Mini). The equivalent CUDA pipeline on a $1,599 RTX 4090 runs Real-ESRGAN at roughly 15 FPS at 4x resolution - different scale factors and different quality tradeoffs, but the point is that ANE is a viable deployment target for real-time video SR if you design the model for it from the start. Nobody else has done this because nobody else has tried.

What's coming

Our roadmap

Higher-resolution video models. The current full-frame pipeline is resolution-locked to 360p to 720p. We can export additional resolution-specific models (480p to 960p, 540p to 1080p) from the PiperSR training pipeline. These aren't bundled yet to keep app size small, but the double-buffered pipeline architecture supports them without modification. The engineering work is in ANE profiling and training - the pipeline code is ready.

4x model. PiperSR is currently 2x only. A 4x model doubles the output tensor size and roughly halves throughput, but 4x upscale at 20+ FPS on ANE is architecturally feasible. The alternative - two sequential 2x passes - works today but with quality degradation from double prediction. A native 4x architecture with PixelShuffle(4) is the better path.

Batch processing UI. The REST endpoint and MCP tool already support batch workflows. A drag-and-drop batch interface in ModelPiper that processes a folder of images is planned, with progress tracking and side-by-side before/after comparison.

Industry horizon

Apple's own ML upscaling. MetalFX Temporal Upscaling ships in macOS for game rendering - it upscales game frames using motion vectors and temporal data. As of macOS 15, this is limited to games using the MetalFX API. There are no public signals that Apple plans to expose general-purpose neural upscaling as a system framework, but the hardware capability is clearly there, and Apple has shown willingness to ship on-device ML features (Live Text OCR, Visual Look Up) when the models are small enough.

ANE-aware model design spreading. The success of ANE-optimized models in other domains (FluidAudio for speech, Apple's own on-device models for OCR and language understanding) suggests the super resolution community will eventually target ANE directly. As of March 2026 this hasn't happened outside PiperSR, but the economic incentive grows with every Mac shipped - over 40 million Macs sold annually, all with Neural Engines.

Video diffusion upscaling. Diffusion-based super resolution (StableSR, PASD) produces remarkable results on still images but is too slow for video at current model sizes. Research into efficient diffusion architectures (latent consistency models, progressive distillation) may eventually make diffusion-quality upscaling practical at video rates. This is years away from consumer deployment on edge hardware.

Temporal super resolution models. Current per-frame models like PiperSR process each frame independently. Research into video-aware models that use information from adjacent frames (temporal convolutions, optical flow, recurrent architectures) could improve consistency on motion-heavy content. The ANE-specific challenge is keeping the temporal context window small enough to fit in SRAM while still providing meaningful multi-frame information.

How ToolPiper handles this today

ToolPiper bundles two CoreML super resolution models and a third for 4x image upscaling. All three are included in the app - no download step, no waiting, no model management. They work immediately after installation.

Image upscale

The Local Image Upscale template processes single images at 2x (PiperSR) or 4x (PurePhoto SPAN). Drop an image onto the pipeline input - drag from Finder, paste from clipboard, or use the file picker. The upscaled result appears immediately. Any resolution up to 8192x8192 pixels, any format (PNG, JPEG, WebP). Output is always uncompressed PNG to preserve every pixel of recovered detail.

PurePhoto SPAN 4x is a 16-layer residual network with attention, processing 256x256 tiles. It's the default for images because 4x upscale is what most users want for photos, product images, and scanned documents. It's particularly strong on photographic content where texture recovery matters - skin detail, fabric, foliage, architectural elements. PiperSR 2x is faster and produces cleaner results on screenshots, UI elements, and text-heavy images where 2x is sufficient and speed matters more than maximum enlargement.

Processing time ranges from 1-5 seconds depending on input size and the selected model. A 1000x1000 image upscaled 4x takes about 3 seconds on an M2. Both models process on the Neural Engine with GPU-assisted tile stitching.

Ready to try it? Set up local image upscaling - drop an image, see the result in seconds.

Video upscale

The Local Video Upscale template processes MP4 and MOV files with H.264 encoding. Drop a video file and the upscale starts immediately with real-time progress via Server-Sent Events showing frame count and estimated completion. Audio is remuxed unchanged - no re-encoding, no quality loss, no sync issues. The audio track passes through the pipeline untouched.

The optimized full-frame pipeline handles 360p to 720p at 44 FPS on M4 Max - 1.5x faster than the 30 FPS playback rate. This means a 10-minute video completes in about 6.5 minutes. Other input resolutions fall back to the tiled pipeline at 5-10 FPS - same visual quality, lower throughput. The output is H.264 High profile MP4.

The pipeline architecture is what makes this fast. Three hardware units run simultaneously via double-buffering: CPU converts input frames to Float16 tensors (0.3ms), the Neural Engine runs super resolution prediction (20.8ms), and a custom Metal compute shader converts the output tensor back to a displayable pixel buffer (1.3ms). Pre-allocated frame sessions mean zero per-frame heap allocation. A dedicated DispatchQueue bypasses Swift's cooperative thread pool to eliminate 3-5ms of scheduling jitter.

A concurrency lock returns HTTP 429 if a second upscale is requested while one is running - no garbled output from competing writes to the video encoder. A 4096px dimension cap prevents out-of-memory conditions. Task.checkCancellation() between frames allows clean interruption.

Ready to try it? Upscale your first video - the model is bundled, nothing to configure.

Real-time streaming upscale

ToolPiper also supports real-time video upscale via WebSocket on port 10004. External applications can send individual frames and receive upscaled output with the same per-frame latency as the file-based pipeline. The streaming path reuses a persistent FrameSession with zero-allocation frame processing. This is a Pro feature, targeted at applications that need live upscaling - real-time preview, streaming workflows, or integration with video production pipelines.

MCP tools and REST API

Both upscale capabilities are exposed programmatically for automated and scripted workflows:

  • image_upscale MCP tool - send a base64-encoded image, get the upscaled PNG back. Works in Claude Code, Cursor, Windsurf, or any MCP client.
  • video_upscale MCP tool - send a video file path, get the upscaled MP4. Progress streams via SSE.
  • POST /v1/images/upscale REST endpoint - for scripted batch workflows. Concurrent requests queue and process sequentially to avoid memory pressure.
  • POST /v1/video/upscale REST endpoint - with progress streaming and concurrent request queuing (429 if busy).
  • benchmark_upscale MCP tool - runs the benchmark suite (5 configurations) and reports throughput for the current hardware.

The MCP interface means any AI agent can incorporate upscaling into automated workflows. A documentation agent that screenshots a UI, upscales it for high-DPI display, and embeds it in a knowledge base - all through tool calls, no human in the loop. A video processing pipeline that ingests low-resolution footage, upscales it, and publishes the result. The upscale is just a tool call.

Models and hardware

ToolPiper ships three bundled super resolution models. All run on Apple Silicon (M1 or later) with no separate download or configuration.

PiperSR 2x (Image variant) uses the PiperSR_2x.mlmodelc CoreML package with ImageType input. It processes images in 128x128 pixel tiles, handling any resolution up to 8192x8192. The model is 453K parameters / 928 KB in FP16 format. It produces clean 2x upscaling at 37.54 dB PSNR on Set5 - 3.88 dB above bicubic. Best suited for screenshots, UI elements, text-heavy content, and situations where 2x is sufficient. Processing takes 1-3 seconds for typical images.

PiperSR 2x (Video variant) uses the PiperSR_2x_video_720p.mlmodelc CoreML package with TensorType input. This is the same model architecture with batch normalization fused into convolutions (42 MIL operations reduced to 30) and compiled for full-frame 640x360 input. It accepts complete frames as single tensors - no tiling. The double-buffered pipeline achieves 44.4 FPS sustained on M4 Max. Resolution-locked to 640x360 input / 1280x720 output. Other resolutions automatically fall back to the tiled image model at 5-10 FPS.

PurePhoto SPAN 4x is a 16-layer residual network with attention, processing 256x256 tiles. It quadruples resolution: a 1000x1000 photo becomes 4000x4000. Default for image upscaling. Strong on photographic content - texture recovery, fine detail, natural scenes. Slightly slower than PiperSR 2x due to the larger model and 4x output.

Hardware requirements are minimal: any Mac with Apple Silicon (M1 or later) and at least 8 GB of RAM. The models are small enough that memory is never the bottleneck. The 44 FPS benchmark is from M4 Max. M1 has the same 16 Neural Engine cores but an older microarchitecture - expect lower but still-usable throughput. We haven't published M1 numbers because we haven't benchmarked on that hardware.

The PiperSR architecture

PiperSR is a 453,388-parameter super resolution model with 6 residual blocks, 64 channels, SiLU activation, and PixelShuffle for the 2x upscale. The entire model is 928 KB in CoreML FP16 format. The strategic rationale for every design decision is covered in the ANE thesis section above. This section covers the implementation details.

Full ANE placement verification. We verified the MIL (Machine Learning Intermediate Language) graph to confirm every operation runs on the Neural Engine with no silent fallback to GPU or CPU. This sounds trivial but isn't - most CoreML models have operations that the ANE compiler quietly routes elsewhere, creating pipeline stalls that destroy throughput. A single operation falling back to CPU can add 5-10ms as data transfers between processors. The verification step is essential and something most SR models skip entirely.

Quality benchmarks. PiperSR achieves 37.54 dB PSNR on the Set5 benchmark at 2x upscaling - 3.88 dB above bicubic interpolation and competitive with GPU-targeted models like PAN (37.58 dB) and SAFMN (38.00 dB). It trades approximately 0.5 dB versus the best lightweight models (BSRN at 38.10 dB) in exchange for 10-20x higher throughput on Apple Silicon. The GPU-optimal architectures are empirically slower on ANE despite being closer to state-of-the-art quality.

PiperSR is open source. Inference code and benchmarks are AGPL-3.0. Model weights are released under the PiperSR Model License: free for personal, academic, and non-commercial use with attribution. The distribution format is CoreML .mlpackage only - no PyTorch weights or ONNX export, because PiperSR is an ANE model and the distribution format reflects that. Details at github.com/modelpiper/pipersr.

Local vs cloud: an honest comparison

Local and cloud upscaling serve different needs. Here's where each is genuinely stronger, without marketing spin.

Local wins on:

  • Privacy. Your images and video never leave your machine. Client photos, medical scans, legal evidence, unreleased content, security footage - none of it touches a network. For industries with data residency requirements (healthcare, legal, government), this isn't a preference, it's a compliance requirement.
  • Cost. Zero per-image, zero per-video. Process a thousand photos and the only cost is electricity. Cloud services charge $0.05-2.00 per image or per minute of video. A team processing product photography regularly can easily spend hundreds per month on cloud upscaling.
  • Speed for video. 44 FPS means a 10-minute video completes in 6.5 minutes with no upload or download time. Cloud services queue your job behind other users and add network transfer latency - minutes to hours of total waiting.
  • Offline availability. Works on a plane, at a client site with no Wi-Fi, in a secure facility. The models are bundled - no download required, ever.
  • No watermarks, no account, no waitlist. Cloud free tiers gate quality behind watermarks and usage limits. Locally, full-quality output is immediate and unlimited.
  • Programmatic access. REST endpoints and MCP tools for automated workflows at no additional cost. Most cloud services charge extra for API access or don't offer it at all.

Cloud and dedicated tools win on:

  • Specialized models. Topaz Video AI ships domain-specific models for faces, text, and animation that are trained and tuned for each content type. For professional photo restoration where every fraction of a dB matters, specialized models outperform general-purpose ones on their target domain.
  • Higher upscale factors. Let's Enhance offers up to 16x. ToolPiper currently maxes at 4x for images (PurePhoto SPAN) and 2x for video (PiperSR). For extreme enlargement needs, cloud tools offer more headroom.
  • Temporal consistency. Topaz's video models use multi-frame temporal analysis to maintain consistency across frames, using optical flow and motion vectors. PiperSR processes frames independently - no temporal coherence model. For most content this is invisible, but on certain types of motion (slow pans, subtle camera movement) it can produce subtle per-frame variation.
  • Complex degradation handling. Real-ESRGAN handles noise, JPEG artifacts, and blur simultaneously because it was trained on synthetically degraded images. PiperSR is optimized for clean 2x upscaling of undegraded source material. For heavily compressed or noisy footage, a denoising step before PiperSR would be the local approach.

For everyday use - product photos, screenshots, scanned documents, screen recordings, webcam archives, old phone footage - the bundled models produce excellent results with zero friction. For professional film restoration or forensic-level enhancement, dedicated tools with larger, specialized models still have the edge.