Why is FPGA-GPU Heterogeneity the Best Option for Embedded Deep Neural Networks?
WWhy is FPGA-GPU Heterogeneity the Best Optionfor Embedded Deep Neural Networks?
Walther Carballo-Hern´andez
Universit´e Clermont AuvergneCNRS, SIGMA Clermont, Institut PascalF-63000 Clermont-Ferrand, FranceEmail: walther.carballo [email protected]
Maxime Pelcat
IETR, UMR CNRS 6164, Institut PascalUMR CNRS 6602, Univ Rennes, INSARennes, FranceEmail: [email protected]
Franc¸ois Berry
Universit´e Clermont AuvergneCNRS, SIGMA Clermont, Institut PascalF-63000 Clermont-Ferrand, FranceEmail: [email protected]
Abstract —Graphics Processing Units (GPUs) are currently thedominating programmable architecture for Deep Learning (DL)accelerators. The adoption of Field Programmable Gate Arrays(FPGAs) in DL accelerators is however getting momentum. Inthis paper, we demonstrate that Direct Hardware Mapping(DHM) of a Convolutional Neural Network (CNN) on an em-bedded FPGA substantially outperforms a GPU implementationin terms of energy efficiency and execution time. However, DHMis highly resource intensive and cannot fully substitute the GPUwhen implementing a state-of-the-art CNN. We thus propose ahybrid FPGA-GPU DL acceleration method and demonstratethat heterogeneous acceleration outperforms GPU accelerationeven including communication overheads.Experimental results are conducted on a heterogeneousmulti-platform setup embedding an Nvidia® Jetson TX2 CPU-GPU board and an Intel® Cyclone10GX FPGA board. TheSqueezeNet, MobileNetv2, and ShuffleNetv2 mobile-orientedCNNs are experimented. We show that heterogeneous FPGA-GPU acceleration outperforms GPU acceleration for classificationinference task over MobileNetv2 (12%-30% energy reduction,4% to 26% latency reduction), SqueezeNet (21%-28% energy re-duction, same latency), and ShuffleNetv2 (25% energy reduction,21% latency reduction).
I. I
NTRODUCTION
Internet of Things (IoT) and the emerging adoption of hetero-geneous architectures in edge devices are currently extendingthe possibilities of Deep Learning (DL)-powered applications.Indeed, in order to keep reasonable device energy consumption,embedded platforms have started to adopt heterogeneousarchitectures to keep up with an ever-growing computationaldemand. While GPUs currently dominate programmable DLacceleration, state-of-the art is still divided on deciding in whichcases an FPGA outperforms a GPU as an efficient DL hardwaresubstrate. The main motivation for heterogeneous solutions isto increase computational efficiency through acceleration fora subset of tasks on a full workflow. However, this gain doesnot mean that the communication overheads induced by inter-layer transfers can be compensated. In this paper, we proposeand evaluate FPGA and GPU DL module implementationsseparately against a heterogeneous solution. Comparisons arebased on widely used CNN building blocks using a throughput-optimised pipe-lined Direct Hardware Mapping (DHM) tech-nique for FPGA CNN kernels deployment [1]. The DHMtechnique incorporates several differences in comparison withconventional GPU network execution. The first difference is the use of a fixed-point computation approach. This compressiontechnique allows us not only to reduce the memory complexityof features and weights, but also to use specialized hardwarededicated to fixed-point computation. In this study, we use 8-bitfixed-point representation as suggested in [2], to avoid affectingheavily the resulting DL accuracy. Secondly, the number ofexternal memory accesses from the device must be considered.Since DHM is based on a stream processing paradigm whilekeeping parameters and features close to each other, it widelydeviates from the memory hierarchy approach of the GPUmemory model.In the three case studies, we aim to evaluate the inferencedeployment of embedded CNN models such as MobileNetV2[3], ShuffleNetV2 [4] and SqueezeNet [5], on an embeddedFPGA-GPU heterogeneous platform. Although both hardwarearchitectures have been well studied and evaluated on High Per-formance Computing (HPC) centers, their specific capabilitiesare still to be exploited on embedded design. In this work, wecompute an energy and latency estimation for multiple layersused in these CNN models. We then propose a heterogeneousversion of grouped or depth-wise convolution partitions forlayer-fusing when allowed by the network architecture at amodule-level.The contributions of this work consist of:1) demonstrating that DHM on an FPGA is a viablealternative to GPU deep learning acceleration in termsof energy, latency and throughput. However, DHM iscurrently limited to small layers, due to its extensiveusage of FPGA logic resources.2) comparing the obtained measurements against an em-bedded GPU implementation for specific layers andoperations at a module-level.3) demonstrating that a combination of GPU and FPGA ef-fectively outperforms homogeneous solutions, even wheninter-systems communication overheads are considered.II. R
ELATED WORK
Heterogeneous computing has been increasingly adopted inthe last few decades as a result of the power and memory wallsof computing. These programmable computing nodes have adiversity of hardware capabilities, different ways to executeinstructions, or multiple operation management methods [6].
Copyright is held by the author/owner(s).DATE Friday Workshop on
System-level Design Methods for Deep Learning onHeterogeneous Architectures (SLOHA 2021) , Virtual Workshop, February 5, 2021. n cases where there is enough (data or task) parallelism thatcan be exploited by scheduling, combining FPGAs and GPUscan offer a significant performance [7]. Recent studies feedthe discussion in the context of embedded vision applications,comparing for instance an ARM57 CPU, a TX2 GPU anda ZCU102 FPGA [8]. They prove that FPGAs are a bettersolution than GPUs for more complex pipelines like imagefiltering, feature extraction or geometric extraction, which isthe case in our study. [9] and [10] are the closest works tothis study in terms of hardware architecture and partitioningbetween GPUs and FPGAs in embedded image processingterms. However, the granularity of the partitions are either toofine to affect the communication bottleneck, or too coarse tofully exploit resource allocation. In this paper, we proposeheterogeneous partitioning at a module level on state-of-the-art CNNs and compare quantitative results to [9] and [10] inTable I.In [11], closer to the current study from a communicationperspective, a heterogeneous platform consisting of two pro-grammable logic devices. Both are interconnected to be testedon image processing techniques such as histogram of orientedgradients. While some speed-ups are achieved, the inter-subsystems communication through a Peripheral ComponentInterconnect Express (PCIe) link tends to reduce speed-ups,resulting in a bottleneck. Adopting a host-guest computingstructure, more recent works [12], [13] alleviate this bottleneckby bypassing or skipping data allocation at host memory,keeping data in the guest device for a longer time. Shapingmemory transfers is critical in a DL context in order to increasethe number of layers or parameters to be mapped on the mostefficient accelerator and be sent back to a host, as presentedin these papers. III. P ROBLEM DEFINITION
In this section we describe the performance of individualarchitectures, i.e. of a full implementation on an FPGA withDHM and on a GPU for Section IV models. Two metrics wereconsidered on each device for this work; processing latency(
LAT ) and energy ( E ). We further develop both solutions ina heterogeneous manner showing the results comparison in V: A. DHM for FPGA synthesis definition
In this work, we use a data-driven approach that fully exploitsthe resources in an FPGA to concurrently map and executemultiple CNN layers on the device as a pipeline. DHM wasfirst introduced in [1] as a technique to map processing nodesto logical elements or Digital Signal Processors (DSPs). Thesynthesized accelerators using this technique take advantageof the fused layers, further explained in section IV, sincethe intermediate feature maps are stored internally in thedevice, as well as the kernel weights. This storage avoidsthe bottleneck communication of intermediate data externalmemory accesses, increasing energy, latency and throughputefficiency. Additionally, all weights are stored closer to the logicelements, so no external memory accesses are needed for weightretrieval, which in DL applications introduces a considerable overhead. Although this method offers an indisputable highperformance efficiency gain, this comes at the cost of anenormous resource requirement. As a consequence of thisconstrain, only small designs can be mapped using DHM.Considering the opportunities and limitations of DHM, itsusage for CNN acceleration must be handled carefully. Thecombination of DHM on a heterogeneous platform with theobjective to reduce memory accesses on the GPU proves tobe an efficient solution, as it is discussed in V. We show, thatin fact, while the FPGA is more efficient than the GPU inall evaluation metrics on small kernels, combining such localFPGA acceleration with global GPU processing leads towardsthe optimal performance.
B. High-level CUDA code generation for GPU CNN deploy-ment
The hardware architecture and memory model of NvidiaGPUs are highly specialized for batch processing based onSingle-Instruction Multiple-Data (SIMD) parallel execution.This execution method requires a specific coding paradigmthat handles memory accesses and scheduling to the mem-ory hierarchy. GPUs embed memories with different accesslatencies, accessed from computing elements called ComputeUnified Device Architecture (CUDA) cores. Therefore, boththe latency and energy performances are highly dependant ofhow the kernels threads are executed and how they use thishierarchy.For CNN applications, multiple levels of data parallelism anddata reuse can be achieved by techniques like loop unrolling,tiling or batching; which directly affect hardware performance.Fortunately, because of the wide adoption of high levelcompiling tools and open source projects, optimized softwaresuch as Pytorch [14] alleviate this task for the developer. Inthis work, we deploy inference for CNNs using the generatedCUDA code on different sizes of convolutional layers. Figure1 shows an example of obtained measurements metrics forlatency (Figure 1a) and energy (Figure 1b). The figures areobtained by measuring the execution of a convolutional layerwith an input tensor of dimensions 224x224, 3 input channels,from 2 to 64 kernel filters and different kernel sizes. It can beobserved that the FPGA implementation outperforms the GPUsolution both in terms of energy and of latency. However, theFPGA with DHM deployment is quickly limited by the numberof available resources, constraining the depth of convolutionfilters that can be directly mapped; 64 filters of size × inthis case.IV. S TATE - OF - THE -A RT CNN
MODULES AND
GPU-FPGA
PARTITIONING
The main motivation for the deployment of heterogeneousplatforms for CNN networks is the presence of parallelism andheterogeneity in both CNN computation and communication.The main and more time-consuming operation of the presentedbuilding blocks is the convolution operation. Therefore, inorder to be able to accelerate execution, it is essential tofully understand its computing model and relevant parameters. a) (b) Fig. 1: Latency (a) and Energy (b) comparison between multiple convolution function sizes on Cyclone10GX FPGA (blue) andJetson TX2 GPU (green) for different CNN layers on an input image of 224x224x3. Blue bars represent the layers implementedon the FPGA and the green bars represent the energy consumption on the GPU. The performance factor in this measure isincreased result of multiplication on both power and latency metrics.A convolutional layer (Conv) takes as an input a multidi-mensional tensor, called Input Feature Map (IFM), I of size H I × W I × C I from a previous layer l − and it is multipliedand accumulated with a sliding window of a kernel tensor K of size k h × k w × C I × N . Typically in most applications k h = k w . The resulting Output Feature Map (OFM) O on thecurrent layer l is obtained from Multiply-and-ACcumulate(MAC) operations and inter-layer communication. RecentCNN algorithmic optimizations are constantly introducing non-regular patterns into the networks in the form of differentlayer types with a variety of operations. In this subsection,we describe the main building blocks or modules and theirpartitioning from the current mobile CNN models: • Depth-Wise separable Convolution (DWConv):
Thistechnique was first described in [15] and fully utilized in[3]. The main concept relies upon a sort of factorization ofa traditional convolutional layer. The first of the resultingoperations is a k × k convolution over every single inputchannel. The second operation over this tensor is a × C I resultingin the first channel of the output tensor. Figure 2a showsa layer d l as a DWConv. We propose a partitioningdelegating all the × • Grouped Convolution (GConv):
This partitioningmethod divides the computational load into workflows thatcan be executed in parallel and concatenated afterwards.In Figure 2b two contiguous partitions of different sizesare created for each device. The GPU partition takes thesubset of the IFMs H I × W I × ( C I − g l ) and the filtertensor of size k × k × ( C I − g l ) × N , while the FPGA takes H I × W I × g l and the filter tensor of size k × k × g l × N . • Fused-Layer:
It was first introduced in [16] as a methodto store intermediate weights and neuron activity in cache from adjacent layers in depth. This approach handles oneof the most common challenges in CNN models, the datatransfer burden. In Figure 2c the f l number of parametersof layer l ∈ L is internally stored on the FPGA to beexecuted in a pipe-lined fashion [17]. The OFM of thelast layer in the partition is then transferred to the GPU.V. E XPERIMENTAL METHODOLOGY , EVALUATION ANDRESULTS
In this section we describe the experimental methodologydeployed to obtain the proposed metrics. In section V-A, wediscuss the experimental setup and how individual performancemetrics for each device are obtained. In section V-B wepresent the results of the heterogeneous platform measurementsfor different operations with the layer-wise partitioning fromsection IV.Figure 3 shows the selected embedded computing nodes ascase study on a custom prototyping board, linking both devicesby a communication node or interface. Additionally, this sectiondescribes more in the detail the experimental setup as case ofstudy and baseline for measurement metrics comparison.
A. Measurement-based energy and latency performance com-parison
On the Jetson TX2 Module-on-Chip (MoC), a Tegra TX2System-on-Chip (SoC) is incorporated, which at the sametime, includes an integrated multi-channel power monitor. TheImageNet pre-trained mobile CNN models were obtained fromPytorch [14] and the torchvision model zoo. On the FPGAside, we use the Power Estimation tool® from Intel QuartusPro Edition® targeting multiple convolutional task operationson the Intel® Cyclone10GX FPGA. The function synthesisis based on the DHM technique described in III. DHM mapsdirectly the function on hardware. Therefore, its power variesrapidly with the number of processing elements and registersmapped on the device. In Figure 1 an example of energyefficiency comparison between both devices is shown. It can a) DWConv. (b) GConv.(c) Fused Layer. Fig. 2: Proposed GPU and FPGA layer-wise CNN mappings. The partition in blue represents the data produced on the FPGA,while in green the data produced on the GPU. (a) Depth-wise convolution example where the k × k convolution is executed onthe GPU and the × C I input channelsand kernel filters are divided on each device.(c) Fused layer example where a couple or intermediate layer activity are stored inthe internal FPGA on-chip memory. (a) (b) Fig. 3: Heterogeneous prototype board consisting of (a) anembedded CPU-GPU Nvidia® Jetson TX2 System-on-Module(SoM) at the top board and (b) an Intel® Cyclone 10GXFPGA at the bottom board interconnected by a 4-lane PCIegen2 interface.be observed that the FPGA has a better energy efficiency thatoutperforms the GPU with orders of magnitude. This effectincreases with the number of kernel filters on a fixed IFM.Nonetheless, this is only true as long as the design fits on anembedded FPGA device, like the Cyclone 10 GX FPGA.
B. Evaluation and results
Given the metric measurements on individual devices andthe data-flow graphs for the heterogeneous platform from the proposed partitioning, we validate and evaluate theirefficiency on the hardware configurations described fromthe architecture model in Figure 3. For a fairer comparisonbetween the monolithic homogeneous GPU-only setup and theheterogeneous FPGA-GPU evaluation, both setups were testedwith the same configuration parameters and task workloads.The selected CNN models were pre-trained with the ImageNetdataset. Layers hyper-parameters, i.e. IFM and OFM, wereobtained from the original papers.To keep up with a better model precision the first two IFMsdimensions of the layers, H I and W I , are sampled followingthe typical architecture tensor sizes of 224x224, 112x112and so on down to 4x4. This allows us to fully exploit thedimension reduction of the IFM result of the GConv. Becauseour hardware setup is highly bounded by the PCIe throughputof 2.5GBytes/s, these observations are crucial to keep up witha good performance.From Figure 4a it can be observed a comparison between theenergy in mJ and latency in ms of the layers from SqueezeNet.The heterogeneous solution has a significant energy efficiencygain up to with no significant impact on the latency. Thisis mostly because the energy efficiency of the Conv x taskon the FPGA is higher than that on the GPU. In the case ofthe latency, because both the time spent in communicating a) SqueezeNet’s layers performance. (b) MobileNetv2’s layers performance.(c) ShuffleNetv2’s layers performance. Fig. 4: Average metric performance space of the tested with different workloads for (a) SqueezeNet’s (b) MobileNetv2’s and (c)ShuffleNetv2’s on an homogeneous GPU-only platform (green) and our FPGA-GPU heterogeneous platform (blue). x -axisrepresents the average energy and y -axis the average latency.between devices and the processing time on the FPGA are stillshorter than the execution of the Conv x task on the GPU,it is possible to hide its latency during the execution time ofthe GPU. This means that if the latency of the FPGA and thecommunication is less than the GPU latency, then the maxfunction as consequence of the heterogeneous model’s parallelexecution, will be dominated by the GPU-side latency. This ishighly beneficial because, this sub-task is small enough, thanksto the GConv, to be fully mapped on the FPGA for every layeron the CNN.For MobileNetv2 (with 0.5x parameters), although ourpartition only considers a sequential execution of the diversetasks in the layers, in this case, there are both an increase inenergy and latency performances. This speed-up and energyefficiency factor increases with the size of the IFM, as seen inFigure 4b up to 23% and 30%, respectively.Combining the strategies from both previous partitioningand scheduling, ShuffleNetv2 (with 0.5x parameters) benefitsfrom a speed-up factor on both model types, with and withoutspatial reduction. The first section of the layer incorporatesa spatial reduction block that benefits from a similar gain of parallel execution. Therefore, the gain follows the same conceptas the layer from SqueezeNet, but with a DW Conv x instead of a traditional Conv x . The second section of thelayer repeats a sequential execution with no spatial reduction.As a consequence, the result is similar to the layers fromMobileNetv2. Because of this connection, it has the highestspeed-up factor of 25% and energy efficiency of 21% comparedto its homogeneous GPU counterpart as seen in Figure 4c.Table I shows the speed-up factor and energy performancecomparison with some works from Section II. This work demon-strates a similar performance, showing clear heterogeneity-related gains. Notice that the evaluated algorithms are morecomplex than compared state-of-the-art, achieving similarresults. Therefore, because of the high parallel deploymentfor inference tasks, the use of FPGA-GPU heterogeneousembedded platforms also for mobile DL CNN topologies isjustified, and shall result in very high gains if GPU and FPGAsubstrates are put closer to each other than in the tested multi-board setup. ork Heterogeneousplatform Partitioninggranularity Evaluatedalgorithms EnergyGain LatencySpeedup Qasaimeh, M.et al. [8] GPU+CPU NvidiaJetson TX2 Fine(Element-wise) Background substraction 1.74x -Color segmentation 1.86x -FPGA XilinxZCU102 Harris corners tracking 3.94x -Stereo block matching 8.83x -Hosseinabady, M.et al. [9] GPU+CPU NvidiaJetson TX1 Fine(Element-wise) Histogram 1.45x-2.29x 1.18x-1.79xFPGA+CPU Virtex-7 andXilinx ZynqUltrascale+ MPSoC Dense Matrix-Vectormultiplication 0.96x-1.19x 1.22x-1.48xSparse Matrix-Vectormultiplication 1.1x-1.23x 1.15x-1.25xYuexuan Tu,et al. [10] CPU+GPU NvidiaJetson TX2 Coarse(Feature extraction+Classification) CNN (N=16) 2.11x 1.3xFPGA Xilinx NexysArtix 7 CNN (N=32) 1.94x 1.19xCNN (N=64) 1.9x 1.17xThis work CPU+GPU NvidiaJetson TX2 Mild(Layer-wise) SqueezeNet’s Fire 1.34x 1.01xFPGA IntelCylone 10 GX MobileNet’s v2 Bottleneck 1.55x 1.26xShuffleNet’s v2 Stage 1.39x 1.35x
TABLE I: Energy and latency comparison with state-of-the-art partitioning techniques on heterogeneous FPGA-GPUimplementations. VI. C
ONCLUSIONS
In this work, we have proposed, experimented and evalu-ated partitioning and scheduling of pre-trained mobile CNNarchitectures on an FPGA-GPU embedded heterogeneousplatform. We have demonstrated that an FPGA exploitingDirect Hardware Mapping (DHM) outperforms a GPU im-plementation on a small piece of network at the cost ofhigh resource requirements. We have also shown that theconsidered DL workloads benefit from a heterogeneous FPGA-GPU infrastructure when partitioned at a layer-level granularity.Indeed, the designed heterogeneous systems all outperform ahomogeneous GPU-only solution either in energy, latency orboth on inference for classification tasks. These results call fornew fully programmable architectural solutions for embeddeddeep learning combining reconfigurable logic and streamingmultiprocessor architectures.A
CKNOWLEDGMENT