A Competitive Edge: Can FPGAs Beat GPUs at DCNN Inference Acceleration in Resource-Limited Edge Computing Applications?
AA Competitive Edge: Can FPGAs Beat GPUs atDCNN Inference Acceleration in Resource-LimitedEdge Computing Applications?
Ian Colbert, Jake Daly, Ken Kreutz-Delgado, and Srinjoy Das
Abstract —When trained as generative models, Deep Learn-ing algorithms have shown exceptional performance on tasksinvolving high dimensional data such as image denoising andsuper-resolution. In an increasingly connected world dominatedby mobile and edge devices, there is surging demand for these al-gorithms to run locally on embedded platforms. FPGAs, by virtueof their reprogrammability and low-power characteristics, areideal candidates for these edge computing applications. As such,we design a spatio-temporally parallelized hardware architecturecapable of accelerating a deconvolution algorithm optimized forpower-efficient inference on a resource-limited FPGA. We pro-pose this FPGA-based accelerator to be used for DeconvolutionalNeural Network (DCNN) inference in low-power edge computingapplications. To this end, we develop methods that systematicallyexploit micro-architectural innovations, design space exploration,and statistical analysis. Using a Xilinx PYNQ-Z2 FPGA, weleverage our architecture to accelerate inference for two DCNNstrained on the MNIST and CelebA datasets using the WassersteinGAN framework. On these networks, our FPGA design achieves ahigher throughput to power ratio with lower run-to-run variationwhen compared to the NVIDIA Jetson TX1 edge computing GPU.
I. I
NTRODUCTION
Generative models are widely used as a means of parameter-izing distributions of high-dimensional signals and structures.Among the various types of generative models, the GenerativeAdversarial Network (GAN) first proposed by Goodfellow etal. [1] yields superior performance on applications such asimage generation, super resolution, and language modeling [2].The learning strategy of the GAN jointly optimizes a generator G and a discriminator D . While the generator G is trained tominimize the distance between the ground truth distribution P g and the model-parameterized distribution P θ , the discriminator D is trained to separate P g from P θ . Although trainingoptimizes both G and D , only the generator G is needed forinference when drawing samples from P θ .The typical GAN framework shown in Fig. 1 involvesconvolution layers, where D is a Convolutional Neural Net-work (CNN) and G is a Deconvolutional Neural Network(DCNN). Traditionally, these networks are deployed on CPUsand GPUs using cloud computing infrastructures. However, theproliferation of applications for mobile and edge computinghave created new opportunities to deploy these models onembedded hardware for local inference. In contrast to CPUsand GPUs, FPGAs offer large-scale fine-grained parallelismand provide consistent power-efficient throughput, makingthem well-suited for these edge computing applications [3].In this paper, we consider DCNN inference accelerationusing a resource-limited Xilinx PYNQ-Z2 FPGA. We bench- Figure 1: Generative Adversarial Network [1] Architecture.
After training on the cloud, we map generator G onto local hardwarefor low-power inference at the edge. mark our implementation against the NVIDIA Jetson TX1GPU, a processor heavily optimized for edge computingapplications, and achieve a superior throughput to power ratio.The contributions of this paper are as follows: • Significant enhancements over the algorithm proposedby [4] that reduce resource utilization, improve dataflow,and exploit memory hierarchy • A spatio-temporally parallelized hardware architecturespecifically designed to exploit these algorithmic innova-tions for power-efficient acceleration of DCNN inference • An application of high-dimensional statistical analyses tobalance the trade-off between hardware performance andgenerative quality when exploring network sparsityII. R
ELATED R ESEARCH
Previous works take architectural and algorithmic ap-proaches to accelerating deconvolution workloads. The authorsin [5] and [6] reformulate the deconvolution operation as asparse convolution and build complex architectures that unifySIMD and MIMD execution models. Wang et al. [7] alsouse the zero-insertion deconvolution algorithm, approachingthe problem by parallelizing over a uniform 2D systolic arrayhardware architecture to accelerate both 2D and 3D DCNNs.Liu et al. [8] propose a tiling method with a memory-efficientarchitecture that limits off-chip memory accesses at the cost ofincreased resource utilization via on-chip buffering. Chang etal . [9], [10] propose an accelerator that transforms the decon-volution operation into a convolution (TDC), requiring stride as many filters and potentially zero-padding the input andweight matrices. To improve dataflow, Tu et al. [11] explorethe on-chip re-stitching of the disjoint output feature mapsresulting from the TDC method. Mao et al. [12] adapt thismethod in a piecewise manner to handle the load-imbalanceresulting from zero-padding at the cost of increased hardware a r X i v : . [ c s . D C ] J a n igure 2: Deconvolution Mapping of Input and Output FeatureMaps.
Visualization from [4] for mapping input and output blocks. complexity. The algorithm first proposed by Zhang et al. [4]avoids the zero-insertion and zero-padding requirements of themethods outlined above. We adapt this algorithm to a parallelhardware architecture as described in Sections III and IV.III. D
ECONVOLUTION A LGORITHM
Standard deconvolution arithmetic traverses the input space,which requires a summation of regions that overlap in theoutput space [13]. When realized in hardware, accumulatingover these overlapping regions can require complex dataflowand increase resource utilization via on-chip buffering [4],[8], [9]. To circumvent this, Zhang et al. [4] redesign thedeconvolution algorithm to directly loop over the outputspace at the cost of the expensive modulo arithmetic requiredto calculate dependent input pixels. We propose the followingthree enhancements to adapt this reverse looping algorithm toa spatio-temporally parallelized hardware architecture. (1) Preprocessing modulo arithmetic.
Standard deconvo-lution arithmetic calculates the indices of dependent outputpixels o h from input index i h using weight index k h , stride S , and padding P , as shown in Eq. 1. Here, tiling along theinput space leads to overlapping blocks in the output space,creating communication overhead [4], [9], [11]. o h = i h × S + k h − P (1) To avoid this, Zhang et al. [4] use the mapping in Fig. 2 toloop over the output space and determine i h using Eq. 2. i h = o h + P − k h S (2) When
S > , Eq. 2 yields fractional values. To ensurefunctional correctness, Zhang et al. [4] propose a stride holeskipping technique, adding an offset value f h given by Eq. 3. f h = mod ( S − mod ( P − k h , S ) , S ) (3) However, the resulting input pixel calculation given by Eq. 4relies on modulo arithmetic which increases resource utiliza-tion and power consumption when implemented in hardware. i h = o h + P − k h + f h S (4) Observing that, in Eq. 3, f h is only dependent on k h , wepre-compute and cache these offsets for each value of k h .This process reduces the number of modulo operations to K ,where K is the weight filter size. This minimizes resourceutilization and on-chip memory as K tends to be small. (2) Dataflow Optimization. Loop interchange is an algorithm-level optimization that can be applied to improve the sequential
Algorithm 1
Deconvolution Kernel.
Each kernel loads inputs,weights, and offsets into local memory to compute each output block. y ← initializeToBias()for i c = 0 , i c ++ , while i c < I C do x ← loadInputBlock() w ← loadWeightBlock()for k h = 0 , k h ++ , while k h < K dofor k w = 0 , k w ++ , while k w < K do w = w [ k h , k w ] f h = loadOffset ( k h ) f w = loadOffset ( k w ) for (cid:98) o h = 0 , (cid:98) o h + = S , while (cid:98) o h < T O H dofor (cid:98) o w = 0 , (cid:98) o w + = S , while (cid:98) o w < T O W do o h = (cid:98) o h + f h o w = (cid:98) o w + f w i h = ( o h + P − k h ) /Si w = ( o w + P − k w ) /S y [ o h , o w ] ← w × x [ i h , i w ] pushOutputBlock ( y ) computation order of operations [14]. We reorder the loopsof the deconvolution arithmetic in [4] to sequentially traversethe weight space and maximize data reuse. Increasing weight-level data reuse also increases the impact of zero-skipping- a conditional execution paradigm that eliminates redundantoperations by only processing non-zero elements.Additionally, we exploit the opportunities for data-levelparallelism exposed by directly looping over the outputspace. Unlike the standard deconvolution algorithm, whichsuffers from the overlapping sum problem, the output spaceof the reverse looping deconvolution can be tiled into smallerbatches to execute concurrently on a parallelized hardwarearchitecture. When the size of the output feature spaceincreases owing to the upsampling nature of deconvolutionoperations, the workloads and memory requirements remainconstant, simplifying hardware design requirements. (3) Decoupling external memory accesses from computeoperations. Reverse looping deconvolution arithmetic us-ing [4] produces a non-sequential external memory accesspattern over the input space. To mask any resulting overhead,we decouple all external memory accesses from computeoperations to allow for the cascaded execution of these sub-tasks on a pipelined hardware architecture and restrict non-sequential memory access patterns to faster on-chip memory.This is done by first computing the pixel addresses of an inputblock using Eq. 4, then sequentially reading these addressesfrom external memory, and finally caching the data on-chip tobe distributed. To do this, we determine the tile size T I H ofthe input block needed for each output block from the outputtiling factor T O H and the layer parameters using Eq. 5. Theresulting deconvolution kernel given by Algorithm 1 can thencontinuously compute T O H × T O W output blocks with a non-sequential access pattern over locally cached T I H × T I W inputblocks using K × K weight blocks as the next set of inputsare fetched from external memory using sequential reads. T I H = max( i h ) − min( i h ) = (cid:24) T O H S (cid:25) + (cid:24) KS (cid:25) (5) igure 3: FPGA Hardware Architecture.
As discussed in Section IV, we design a spatio-temporally parallelized hardware architecturecustomized to accelerate the deconvolution algorithm proposed in Section III for low-power DCNN inference at the edge.
IV. FPGA H
ARDWARE A RCHITECTURE
To accelerate DCNN inference on an FPGA, we designa SIMD (Single Instruction Multiple Data) hardwarearchitecture with replicable compute units (CUs) that exploitsthe opportunities for both spatial and temporal data-levelparallelism that arise from the optimizations discussed inSection III. As depicted in Figure 3, the dataflow of thedeconvolution accelerator IP block is split into the threepipelined stages outlined below. (1) Reading Inputs and Weights.
The limited amountof on-chip memory is a bottleneck when accelerating largenetworks on a resource-limited FPGA. As such, the inputfeature maps and network weights are stored in off-chip DDRmemory and fetched using AXI interconnects. As describedin Section III, decoupling external memory accesses masksthe communication overhead when executed in a pipelinedarchitecture. We separate input and weight external memoryaccesses into dedicated hardware blocks to concurrently readfrom DDR memory and stream to CUs through on-chipFIFOs. This efficient memory hierarchy is realized by on-chipbuffers using BRAMs to store tiled input and weight blocksto be processed by CUs. (2) Spatially Parallelized Compute Units.
Looping overthe output feature map enables partitioning deconvolutionarithmetic into tiled batches that can execute concurrentlyacross an array of CUs. The CUs follow a SIMD executionmodel, where each workload is dependent on blocks of inputsand weights that are sequentially streamed in through FIFOsand accumulated. The CUs each perform the deconvolutionarithmetic outlined in Algorithm 1 using on-chip DSP unitsand the resulting T O H × T O W output block is streamed out tobe written to off-chip memory. To maximize the occupancyof these CUs, we explore the design space as outlined inSection V-A to optimize the output tiling factor. (3) Writing Output Pixels. Traversing the output space andavoiding the overlapping sum problem allows for a one-shotwrite to external memory for each output block computed bya CU. We dedicate a hardware block to stream the outputsfrom each element in the CU array to be written to externalDDR memory. This minimizes communication overhead withDDR and on-chip BRAM memory requirements. (a) MNIST DCNN (b) CelebA DCNN
Figure 4:
DCNN Architectures.
We consider the network architec-tures shown above for inference acceleration on low-power hardware.
V. E
XPERIMENTAL R ESULTS
We implement our architecture on a Xilinx PYNQ-Z2board at 32-bit fixed point precision using the Vivado DesignSuite. With the available hardware resources, we synthesizethe design with 16 CUs at 125MHz in Vivado HLS usingHLSLIB [15] and benchmark performance on the two DCNNsdepicted in Figure 4. Each DCNN is trained on the MNISTand CelebA datasets using the WGAN-GP [16] framework.
A. Design Space Exploration
In this work, we explore square tiling factors over theoutput space such that T O H = T O W and use the design spaceexploration methodology proposed by Zhang et al. [17] tooptimize T O H . Because our accelerator multiplexes throughthe DCNN layers, we optimize T O H globally across all layersfor each network architecture as a unified hardware designparameter as in [17]. Fig. 5 depicts all legal solutions for boththe MNIST and CelebA DCNNs. Any solution to the left of thepeak sustainable bandwidth slope requires a higher bandwidththan the FPGA can sustain [17]. The optimal T O H (indicatedin green) maximizes attainable throughput while satisfying thehardware constraints. Table I provides the values used in thiswork and the resulting FPGA resource utilization. Note thatthe Xilinx PYNQ-Z2 board is extremely resource-constained,using only 9% of the DSP blocks used in [5] and 5% of thatused in [7] and [10]. T OH DSP48s BRAMs Flip-Flops LUTsMNIST 12 134 50 43218 36469CelebA 24 134 74 48938 40923
Table I:
Xilinx PYNQ-Z2 Resource Utilization a) MNIST DCNN (b) CelebA DCNN
Figure 5:
Design Space Exploration.
The optimal tiling factor T O H maximizes attainable throughput while satisfying the peak sustainablebandwidth constraint as measured by the STREAM benchmark [18]. B. Performance-per-Watt Comparison with Edge GPU
GPUs are power-hungry processors heavily optimized forlarge batch processing of on-chip memory [19]. Unlike theFPGA, which has been shown to provide workload-insensitivethroughput with better power-efficiency, the time-varying op-timizations leveraged by modern GPUs give rise to a non-deterministic execution model that can rarely provide theconsistent performance that is required by edge computing ap-plications [3], [20]. Additionally, modern GPUs use hardwarethrottling (ie. reducing clock frequency) to lower power andcool the chip when it gets hot, further increasing run-to-runvariation [21]. This makes FPGAs the more suitable choicefor edge computing applications when consistent throughputand power efficiency are key requirements [3].In our experiments, we compare the throughput to powerratio of our Xilinx PYNQ-Z2 FPGA design against the heavilyoptimized NVIDIA Jetson TX1 edge computing GPU. Asin [22], we evaluate the GPU with Torch using nvprof tocollect performance and power numbers for each layer ineach DCNN. We measure FPGA power using a USB PowerMeter Voltage Detector and collect performance numbers us-ing hardware counters. We compute total network throughputas the sum of the arithmetic operations of all layers dividedby the sum of the execution time of all layers. Our resultsprovided in Table II show that our design yields a highertotal network throughput to power ratio with lower run-to-runvariation when compared to the GPU for both DCNNs. Asnoted in [17], unified design parameters such as T O H simplifyimplementation cost but may be sub-optimal for some layers.We observe this behavior for the CelebA DCNN as shownin Table II. In future work, we will investigate dynamicallyreconfiguring tiling factors to optimize dataflow per layer. MNIST
L1 L2 L3 TotalFPGA
GPU 1.3 (0.17) 2.7 (0.42) 1.8 (0.25) 2.1 (0.18)
CelebA
L1 L2 L3 L4 L5 TotalFPGA
GPU 3.2 (0.66)
Table II:
DCNN Comparison (GOps/second/Watt).
We measurethe mean and standard dev. (in parenthesis) of the throughput to powerratio of each layer in each DCNN on each processor over 50 runs.
Figure 6:
FPGA Sparsity Analysis.
As shown on the right,generative quality decreases as the sparsity level increases.
C. Sparsity Experiments
Weight pruning is a widely studied technique used toreduce the power consumption and memory footprint of largenetworks on mobile and edge computing platforms [23].It’s difficult for GPUs to effectively handle this form ofunstructured sparsity as they are highly sensitive to conditionalexecution paradigms such as zero-skipping [3], [22]. Alterna-tively, FPGA performance is stable under such paradigms andcan yield significant speed-ups when only executing non-zerovalued computations [3], [24]. However, removing learnedparameters from a network invariably leads to degradation ingenerative quality. Previous works optimizing DCNN dataflowfor unstructured sparsity fail to account for this degrada-tion [24]. To balance the trade-off between generative qualityand hardware performance, we propose an optimization metricgiven by Eq. 6 based on Maximum Mean Discrepancy (MMD)- a kernel-based divergence metric that computes the distancebetween two distributions [25]. Here, t is the execution timeof the full weight matrix and t p is that of the sparse matrix. − log MMD ( P g , P θ ) × t t p (6) In our experiments, we systematically prune weights bytheir magnitude as done in [23]. Pruning more weightsyields higher speed-ups but the distance between P θ and P g increases. As shown in Fig. 6, this leads to a concaveoptimization curve with a peak representing the sparsity levelthat balances image quality and execution time.VI. C ONCLUSIONS AND F UTURE W ORK
In this paper, we adapt the deconvolution algorithm firstproposed in [4] to a parallelized execution model by reduc-ing resource utilization, improving dataflow, and exploitingmemory hierarchy. We design a spatio-temporally parallelizedhardware architecture to accelerate this algorithm for DCNNinference on a Xilinx PYNQ-Z2 FPGA. For edge computingapplications when consistent throughput and power efficiencyare key requirements, we show that this resource-limitedFPGA achieves a higher throughput to power ratio withlower run-to-run variation than the NVIDIA Jetson TX1 edgecomputing GPU. To balance the trade-off between generativequality and hardware performance, we propose a MMD-basedoptimization metric when exploring unstructured sparsity. Infuture work, we will adapt this architecture to other GANsand investigate the effect of bitwidth reduction on hardwareperformance and generative quality.
EFERENCES[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-erative adversarial nets. In Advances in neural information processingsystems, pages 2672–2680, 2014.[2] Zhaoqing Pan, Weijie Yu, Xiaokai Yi, Asifullah Khan, Feng Yuan, andYuhui Zheng. Recent progress on generative adversarial networks (gans):A survey. IEEE Access, 7:36322–36333, 2019.[3] Saman Biookaghazadeh, Ming Zhao, and Fengbo Ren. Are fpgassuitable for edge computing? In { USENIX }}