[PDF] DeCoILFNet: Depth Concatenation and Inter-Layer Fusion based ConvNet Accelerator

Abstract

Convolutional Neural Networks (CNNs) are rapidly gaining popularity in varied fields. Due to their increasingly deep and computationally heavy structures, it is difficult to deploy them on energy constrained mobile applications. Hardware accelerators such as FPGAs have come up as an attractive alternative. However, with the limited on-chip memory and computation resources of FPGA, meeting the high memory throughput requirement and exploiting the parallelism of CNNs is a major challenge. We propose a high-performance FPGA based architecture - Depth Concatenation and Inter-Layer Fusion based ConvNet Accelerator - DeCoILFNet which exploits the intra-layer parallelism of CNNs by flattening across depth and combines it with a highly pipelined data flow across the layers enabling inter-layer fusion. This architecture significantly reduces off-chip memory accesses and maximizes the throughput. Compared to a 3.5GHz hexa-core Intel Xeon E7 caffe-implementation, our 120MHz FPGA accelerator is 30X faster. In addition, our design reduces external memory access by 11.5X along with a speedup of more than 2X in the number of clock cycles compared to state-of-the-art FPGA accelerators.

Full PDF

DDeCoILFNet: Depth Concatenation and Inter-Layer Fusion basedConvNet Accelerator

Akanksha Baranwal , ∗ , Ishan Bansal , ∗ , Roopal Nahar , K.Madhava Krishna Abstract —Convolutional Neural Networks (CNNs) are rapidlygaining popularity in varied ﬁelds. Due to their increasingly deepand computationally heavy structures, it is difﬁcult to deploythem on energy constrained mobile applications. Hardwareaccelerators such as FPGAs have come up as an attractivealternative. However, with the limited on-chip memory andcomputation resources of FPGA, meeting the high memorythroughput requirement and exploiting the parallelism of CNNsis a major challenge. We propose a high-performance FPGAbased architecture -

Depth Concatenation and Inter-Layer Fusionbased ConvNet Accelerator - DeCoILFNet which exploits theintra-layer parallelism of CNNs by ﬂattening across depth andcombines it with a highly pipelined data ﬂow across the layersenabling inter-layer fusion. This architecture signiﬁcantly reducesoff-chip memory accesses and maximizes the throughput. Com-pared to a 3.5GHz hexa-core Intel Xeon E7 caffe-implementation,our 120MHz FPGA accelerator is 30X faster. In addition, ourdesign reduces external memory access by 11.5X along with aspeedup of more than 2X in the number of clock cycles comparedto state-of-the-art FPGA accelerators.

I. I

NTRODUCTION

From recognition to reasoning, convolution neural networkshave attained impressive accuracies in a broad range of appli-cations such as mobile robotics, natural language processing,information retrieval and speech recognition. [10] [11]. In2014, VGG-Net, [14] a network which became very popularsuggested some standards including uniform ﬁlters/kernels ofsize 3X3 across all layers as it could emulate the effectof larger receptive ﬁelds. This reinforced the notion thatconvolution neural networks have to be deep in order for thehierarchical representation of visual data to work.General purpose processors are not able to fully exploit theinherent inter-output and intra-output parallelism of convnetnetworks, hence specialized hardware accelerators such asGPUs [13], FPGA [1] and ASICs [12] are gaining popularity.In ﬁelds like mobile robotics, which usually have stringentenergy constraints, the reconﬁgurability and higher energyefﬁciency of FPGA based implementations has made theman attractive alternative. [3] [2] The major bottleneck whileimplementing huge networks on FPGA is meeting high mem-ory throughput requirement of CNNs with limited on-chipmemory. Traditional implementations of CNNs evaluate thenetwork layer by layer [1] and off-load data intermittentlyto a larger external memory which signiﬁcantly decreasesthroughput because of limited data transfer bandwidth. Robotics Research Center, IIIT-Hyderabad, India ∗ Equal contribution [email protected]@[email protected]@iiit.ac.in

Figure 1:

Data inﬂuence diagram across layers:

For computingsubsequent layers, each element of the input is needed only for asmall region of output.

The computation pattern of CNNs is similar to iterativestencil loops (ISLs) [9], for which data dependencies spanacross multiple layers and iterations. Convnet layers are char-acterised by uniform spatial dependencies, domain narrow-ness and uniform inter-iteration dependencies. Works like [3]have adapted ISLs computation techniques [9] to pipelinethe dataﬂow across different convnet layers. Since the spatialdata ﬂow across layers is dependent on very few data values,it is not required to wait for the entire intermediate outputto be computed to start processing the next layer. This factwas exploited in Fused layer cnn [3] which restructured thecomputation to signiﬁcantly reduce external memory access.In our paper, we leverage upon the fact that the reverseis also true. That is, a particular input inﬂuences only alimited neighborhood of the intermediate output layers. Soonce these outputs are computed, that particular input canbe discarded as shown in Fig(1). Using techniques like linebuffer windowing and depth based concatenation, our 2.78Xfaster architecture improves upon [3]. Speciﬁcally we makethe following contributions: • We propose depth concatenation in both input data andﬁlter weights, i.e. data values across depth are concate-nated adjacent to each other so that they can be movedtogether across buffers. Since most of the computationsalong depth for each layer are independent and can occurconcurrently, depth ﬂattening minimises the lag due toserial data ﬂow along depth. • We have modiﬁed the data ﬂow pattern of CNNs for aconstrained bandwidth setup by fusing across layers usingthe architectural pattern of line buffering. Line buffershelp maximize data re-use by storing input serial datastream and intermediate computation results in small on-chip BRAM buffers. The effectively pipelined structureallows the computation of values of next layer as soonas its depending values have been computed and discardsthis input as soon as the corresponding outputs have been a r X i v : . [ c s . D C ] D ec omputed thus eliminating recomputation and optimizingmemory resources.These contributions have enabled the design of our elegantlypipelined high throughput DeCoILFNet accelerator which isvery efﬁcient in terms of utilization of FPGA resources. Wehave evaluated our accelerator on VGG-like networks, withVGG-16 [14] as the representative. Compared to state-of-the-art CNN FPGA accelerator [2], our accelerator performs 2.6Xfaster on an average and reduces external memory access by11.5X. Compared to Fused CNN [3], our accelerator performs2.78X faster with a slight increase in off-chip memory ac-cess. We are 30X better in speed compared to CPU-caffeimplementation and almost reach the speed of GPU-caffeimplementations.II. R ELATED WORK AND M OTIVATION

There are two major components of computation in Con-volutional Neural Networks: forward pass and backward pass.While training iteratively, network performs repeated forwardand backward passes to reﬁne weights until the desired ac-curacy is achieved. Since for recognition only forward passis required, many application designers train networks ofﬂineand use the trained weights to perform time-sensitive jobs onenergy constrained devices [3] [2]. Recent developments in thedeep learning community have shown that the fully connectedlayers can be removed with no degradation in performance [5].Under these circumstances, works like [2] [1] which focuson accelerating convolution layers have gained prominence.However, as the networks are getting heavier, the on-chipmemory of FPGAs is becoming insufﬁcient to store the hugeintermediate outputs. Conventional works [1] have focused ondesigning CNN accelerators which iteratively process the CNNlayers and off-load the intermediate data to external memory.This involves extensive and unnecessary repetitious read andwrite accesses. Because of this reason, the limited amountof external bandwidth is a challenge for designing efﬁcientaccelerators. In CNNs, the input and output feature volumeis larger for initial layers and it gradually reduces. In thelater layers, the memory occupied by weights dominates asthe depth increases. Thus redesigning data ﬂow movement forinitial layers signiﬁcantly reduces the overall external memoryaccesses [3] which decreases the overall computation latencyand power. Inspired by the structure of image processingpipelines to minimize memory bandwidth using architecturalpattern of line buffering [4], our DeCoILFNet uses small on-chip buffers to pipeline the computations within and acrossthe layers increasing throughput and eliminating unnecessarycommunication with off-chip DDR. Our architecture has beenoptimized in a bandwidth constrained setup so efﬁcientlythat the restricted external memory access is no longer thebottleneck. III. D E C O ILFN ET A RCHITECTURE

In the following sections we describe in detail the optimiza-tions in different modules of DeCoILFNet accelerator. Though our accelerator is generic, for ease of explanation,we have taken the following test example - input image as5*5*3 (l*b*d), two convolution layers fused both with stride=1(s), padding=1 (p), number of ﬁlters=3 (k) with kernel size3X3 (wXw) followed by a pooling layer with window of 2X2and stride =2. Figure 2:

A: expected input window, B: window obtained from linebuffer

A. Line Buffer Windowing Module

The input to the accelerator comes in the form of a serialdata stream. The ﬁrst layer in CNNs is the convolution layer.For convolution operation, we need input windows similar toshown in the expected window in Fig. (2). As the input datacomes serially, to get a valid complete window, we need 9values of the sliding window which come sequentially. Thiscumulative delay of reading these values for getting a validwindow each time adds huge unnecessary delay to overallcomputation. Therefore our line and window buffer moduleas shown in Fig. (2) is pipelined in such a way that we areable to get a new window at each clock cycle after a certainlatency.Usually before convolution, to maintain the spatial dimen-sions of the output, we pad the input layer with zeros. Asshown in Fig. (3), when we reach towards the end of linebuffer, we get some invalid windows. Using our line buffermodule, we are able to smoothly incorporate the paddinglayer to get padded windows which are input for the nextconsecutive convolution.

B. Depth concatenation module: Input data and ﬁlter dataﬂattening

The above line buffer windowing module has been describedfor a 2-D window, whereas in our case, for volume convolu-tion, we need a 3-D window. To get the 3-D window similarlyin every cycle, our novel method for the same is to ﬂattenalong depth so that the data ﬂow is same as before but instead2 igure 3:

Incorporating padding layer in our architecture using linebuffers of just one window of a particular depth, we get a windowﬂattened along the third dimension. As shown in the Fig. (4),the input data after preprocessed depth-ﬂattening, is sent toDeCoILFNet as a concatenated data stream. This concatena-tion increases the bandwidth as now instead of reading the32 bits of D , D , and D in separate cycles, we readthem together as 96 bits of D D D . This new concatenatedwindow can be simply split into three independent windowswhich are parallely sent to the convolution block. Data ofthe convolving ﬁlter too is ﬂattened similarly i.e. the valuesalong the depth are concatenated. Before computation, thisconcatenated data f of ﬁlter1 is split into three 2-D ﬁltersand sent to convolution module. We have instantiated d*d = 9ﬁlter BRAMs with multiple ﬁlters kept one after the other asshown in Fig (4). The multiple BRAMs allow us to read all 9values of one 3-D ﬁlter parallely, thus making the ﬁlter readyfor convolution in one cycle. C. 3-D Convolution pipelined Module

As shown in Fig. (4), the 3-D ﬁlter and input window splitinto d=3 ﬁlters and d=3 windows. We have used DSPs onlyfor multipliers and LUTs for adders so that more computationscan be performed in parallel. Both the multiplier and addermodules have an initial latency of 9 cycles after which becauseof its internal pipelining, the output of the next k=3 subsequentﬁlters and input windows keeps coming in every cycle. Thus,the 2-D convolution module gets ﬁnely pipelined giving outputin every cycle after a latency of (9*(1+ ceil(2(log2)w))) =45 cycles because of the cumulative effect of multipliers andadders. The d values of 2D-convolution of each ﬁlter are addedagain to give the ﬁnal single scalar value of 3-D convolution of the output volume. The entire 3-D convolution module ispipelined in such a way that after an initial latency of (9*(1+ceil(2(log2)w)+ceil(log2(d)))) = 63 cycles we get the outputof convolution of each ﬁlter with an image window in everyclock cycle.Activation functions consume a very small percentage in theoverall computation and can be trivially integrated without anyeffect on data ﬂow movement. The ReLU layer which has alsobeen incorporated (without any computation overhead) in thismodule has not been explicitly shown.

Figure 4:

Depth Concatenation Module for input data and ﬁlter

D. Pooling

Usually in CNNs, the consecutive convolution layers arefollowed by pooling. In max pooling, a 2X2 window is slidedacross the input with a stride of 2. In our DeCoILFNet archi-tecture, we use an intermediate pool line buffer for pipelining.As soon as we get the output of previous convolution, weredirect it to the pool buffer at the current output columnaddress. We update the output column address at every evenstep, and at odd steps, we replace the current output with themax of old value and new computed output. These pooledoutputs are read into the next input line buffer for furthercomputation.

E. Inter-layer Fusion Pipelining

Since CNNs follow the pattern of iterative stencil loops [9],i.e. each particular input inﬂuences only a limited neighbor-hood of the intermediate output layers as shown in Fig. (1).The main concept of using line buffer windowing module isbased on this idea. So once these outputs are computed, thatparticular input can be replaced to get the next input eitherfrom external memory or computed output of the previouslayer. Hence in our architecture, we start processing for thenext layer as soon as we get the required valid inputs. Asexplained above in the 3-D convolution pipelined module,we get the convolution output of intermediate layer in everycycle for ﬁlters subsequently one after another. As shown inthe pipeline Fig. (5), since in the ﬁrst layer we have threeﬁlters which are computed one after another, though we get theoutput of each ﬁlter in every cycle, to stream the output dataas serial input to the intermediate layer , we need to wait forthe whole volume of output value to be computed. During this3 igure 5:

Overall Pipeline design time when the volume is being computed, the input window iskept constant till all ﬁlters have been processed. This outputvolume is serially streamed to the intermediate line buffer.Here also, we need to wait for initial ﬁling of intermediateline buffer before we get a valid convolvable window. Thispipelining can be continued for further convolution layers.DeCoILFNet accelerator has been pipelined so efﬁciently thateven if multiple convolutions are fused together, the only delayis because of the initial latencies after which we are still ableto get one output element in every step. If we fuse the poolinglayer in our architecture, as explained above, we need to waitfor some more clock cycles before every new pooled row.Hence our architecture works best when we have multipleconsecutive convolutions.IV. E

XPERIMENTAL E VALUATION AND R ESULTS

A. Programming using hardware descriptive language: ver-ilog

Most of the design optimization works [2] [3] [1] havebeen done using high level synthesis tools as it is easier toport the code from software to hardware implementation. Themotivation behind using HLS is to avoid the need for RTLprogramming, nevertheless it is still necessary to verify theHLS generated RTL output [7], and in cases veriﬁcation fails,it is difﬁcult to determine the cause of problem. Hence tosuccessfully explore and implement the deep pipelining andparallelism of our design and use resources in an efﬁcientmanner, our testing and validation has been done completelyin verilog using

Vivado tool.

B. Experimental Setup • FPGA :

Our design has been implemented on FPGAboard Virtex-7 XC7V690T (on-chip BRAM of 6.46MB,3600 DSP slices and 693120 logic cells) with a workingfrequency of 120MHz. This is the same board as usedin [3] and [2], so that our comparisons in the nextsection are fair. We have used Xilinx Vivado 2017.1 toolfor synthesis, placement and routing and the results areshown in Table (I). • Baselines:

We compare our design with the followingbaselines: – CPU-caffe:

We have obtained the baseline CPU-caffe timings with respect to a 3.5GHz hexa-coreIntel Xeon E7 caffe-implementation [8]. – GPU-caffe:

We have obtained the baseline GPU-caffe timings with respect to GeForce GTX 1070(1506 MHz graphics clock and 1683MHz processorclock) caffe-implementation [8]. – Fused layer cnn and

Optimized convolution accel-erator:

We have compared the resources and timingof ﬁrst ﬁve layers of VGG-16 for DeCoILFNet ac-celerator against the Fused layer cnn accelerator [2]and Optimizing FPGA-based Accelerator Design forDeep Convolutional Neural Networks [2] by usingdata from Table of [3]. • Functional veriﬁcation:

We performed layer by layerfunctional veriﬁcation of our code by comparing it withour Matlab forward pass implementation using trainedweights from caffe.

C. Results and Comparison

In this section, we have analyzed the performance ofour accelerators with caffe-cpu, caffe-gpu and state-of-the-art FPGA-accelerators for the initial layers of VGG-16. Themotivation for us to choose VGG-16 was because modernstate-of-the-art deep networks for various applications suchas Fully Convolutional Network (FCN-32s) [10], Segnet (webdemo model) [11] are variants of VGG-16. The commonfeature between them is that most of the convolution layershave kernel size=3X3, padding =1 and stride=1. Also thesenetworks are characterized by multiple consecutive convolu-tion layers.

Table I:

Resource Utilization of our accelerator for ﬁrst 2 convolutionlayers and 1 pooling of VGG-16

Resource DSP BRAMs LUTs FlipﬂopUsed 605 474 245138 465002Available 3600 1470 433200 866400Utilization 16.8% 32.24% 56.58% 53.67%

We ﬁrst evaluate our performance with respect to CPU-caffeand GPU-caffe implementations for the ﬁrst seven layers of4GG-Net16 ( 5 Convolution Layers and 2 Pooling layers ).Table (II) shows the comparison of timing after every layer ofVGG-Net16 of our accelerator with software implementationsrunning over both CPU and GPU. As visible from the table,our DeCoILFNet’s performance at 120MHz is comparableto GPU and outperforms CPU with a speedup ranging from4.28X to 39.08X.The amount of speedup gained by DeCoILFNet as com-pared to the CPU keeps on increasing with the increase inthe number of layers, this is because of the exploitation of theinter-layer fusion in case of hardware accelerator which allowsit to start the next convolution without waiting for the wholeoutput and as the number of layer increases this amount offusion increases resulting in better performance as comparedto the CPU.

Table II:

Comparing the time taken by ﬁrst seven layers of VGGNet-16 with CPU-caffe and GPU-caffe. (Here X is the time taken byDeCoILFNet)

Starting Layer Ending Layer CPU-caffe (ms) GPU-caffe (ms) DeCoILFNet (ms)conv1 1 conv1 1 114.54 23.12 26.764.28X 0.86X Xconv1 1 conv1 2 736.78 27.42 27.0127.27X 1.01X Xconv1 1 pool1 769.37 27.15 27.0628.43X 1.003X Xconv1 1 conv2 1 1011.71 29.31 28.0836.02X 1.04X Xconv1 1 conv2 2 1282.42 33.45 41.4630.93X 0.806X Xconv1 1 pool2 1442.47 33.57 41.4934.76X 0.809X Xconv1 1 conv3 1 1637.43 34.81 41.9539.03X 0.829X X

Fusing of a pooling layer with convolution layer takeslonger than fusing two convolution layers. Fig. (6) shows thedifference in speedup obtained with and without the poolinglayer. This is because for computing the pooled layer output,we need to ﬁll up the entire line buffer initially. Thus the initiallatency for pooling is higher.Our design gives the best speedup performance when wehave multiple consecutive convolutions. This is particularlyhelpful in networks like FCNs [10] and segnet [11] whichfollow this pattern. In order to demonstrate the performanceof our hardware accelerator, we have designed our ownnetwork consisting of four consecutive convolution layers eachconsisting of 64 ﬁlters of dimension 3*3 with stride 1, and runit over the CPU, GPU and DeCoILFNet comparing the resultafter each layer. This is a network pattern that is common inthe initial layers of modern networks [11] [10].As shown inthe Table (III), when we fuse consecutive convolution layers,we are able to attain a speedup of 76.8X with respect toCPU and even slightly surpasses the GPU speed. In generalFPGAs have a much higher per watt performance compared toGPUs. Modern GPUs use 10-100X more power than FPGAs.Thus using a resource constrained FPGA even reaching theGPU computation speed increases the per watt performancesigniﬁcantly.In order to compare our architecture with the current state-of-the-art hardware accelerators we compared our architecturewith the one proposed by [2] and [3] for the ﬁrst seven layers

Table III:

Comparing convolution network performance with CPU-caffe and GPU-caffe for consecutive convolution layers.

Starting Layer Ending Layer CPU (ms) GPU (ms) DeCoILFNet (ms)Conv 1 Conv 1 114.54 23.12 26.7644.28X 0.863X XConv 1 Conv 2 736.78 27.42 27.0127.27X 1.015X XConv1 1 Conv1 3 1346.32 35.45 27.2449.42X 1.301X XConv1 1 Conv1 4 2113.24 38.58 27.4876.91X 1.403X X of VGG-Net16. Table IV compares the resource utilizationof DeCoILFNet with the baseline architectures. The resourceutilization and timing for both implementations has been takendirectly from [3]). Among the three, our architecture gives thebest performance of speed(compared to [2] [3]) along witha signiﬁcant reduction in data volume transferred (comparedto [2]. We have been able to effectively utilize the DSPsby eliminating recomputation with the help of line-bufferpipelining. The goal of our architecture is to maximize thespeedup in limited external memory accesses. Depth concate-nation helped us pipeline dataﬂow and perform all independentcomputations for the ﬁrst seven layers of VGG [14] in parallel.The pipelining is also very stringent, i.e. there is no stall afterthe initial latency and we keep getting a continuous stream ofoutput. Keeping these in mind, the results shown in TableIVare the best possible we could achieve on Virtex-7. We areable to attain more than 2X speedup in terms of clock cyclescompared to both accelerators, along with higher workingfrequency.

Table IV:

Comparison table with FPGA accelerators for initial layersof VGG-Net:

Optimized Fused Layer DeCoILFNetClockcycles*

100 100 120

MB transferred per input 77.14 3.64 6.69BRAMs 2085 2509 2387DSP 2880 2987 2907

V. D

ISCUSSION AND T RADE - OFF

Fig. (7) shows the relation between off-chip memory ac-cesses and the computation units due to grouped fusion of ﬁveconvolutions and two pooling layers of VGG-16 in differentgroups. We have assumed that the depth based parallelism isconstant for all the cases considered. The point A representsno fusion, i.e. when all intermediate outputs are stored inDDR. In this case as is visible from the diagram, since wewrite back to the DDR, the computation unit of single layer isreused for every layer, i.e. each layer is its own group. Hencethe DSP utilization is minimum for this case at the cost ofhighest ( 23.54 MB ) dataﬂow. The point G in the diagramrepresents when all layers have been grouped and fused. Sincewe are computing all layers concurrently, the DSP utilizationis maximum with minimum on-chip memory utilization.Our high performance in Table IV compared to otheraccelerators is aided by our parallel computations across5 igure 6:

Comparison of speedup of our accelerator when comparedwith GPU-caffe and CPU-caffe with and without pooling layer (X-axis represents the number of layers, Y-axis represents the speedup) depth. Using depth concatenation allows us to perform severalcomputations concurrently. Our depth-concatenation techniqueis also limited by the compute resources present on the FPGAboard. As the concatenation depth increases, we need moreresources to perform computations in parallel. We have usediterative decomposition to solve this problem. We divide thedepth into multiple groups of parallel computation, and processthese groups serially. The number of serial groups decides thefactor by which our clock cycles increase, as we need to waitfor the result of all groups to complete to get one output. Thistechnique is particularly needed for the later layers of VGG-Net where we need to process inputs of depth 256 or 512.In CNNs, the input and output feature volume is largefor initial layers and gradually reduces. Keeping data-volumeconsiderations aside, independence in computation-pattern forlater-layers is same as initial-layers. Though we have demon-strated improvement results for initial layers, we believe ourarchitecture can exploit the same data independence of laterlayers to give same better performance over baselines. Forlater-layers, weights dominate memory space and depth ofconvolving ﬁlters increases signiﬁcantly. Since both paral-lelization due to depth concatenation and layer fusion requiresame compute resources, there is a trade-off between them.The number of layers fused should be maximum for the initiallayers. This is because for the initial layers, the intermediateoutput data is huge and less layers fused would mean a hugedata volume movement to and from external memory [3].Whereas for the later layers, the depth of input and convolv-ing ﬁlters increases signiﬁcantly and the intermediate. Alsothe subsampling layers reduce the intermediate data volume.Hence it makes more sense to allocate compute resources toparallel computations across depth for the later layers.VI. C

ONCLUSION

We presented a ‘Depth Concatenation and inter-layer fusionbased convnet accelerator-DeCoILFNet’ which exploits theintra-layer parallelism of CNNs by ﬂattening across the depthand combining it with the inter-layer fusion. Our accelerator

Figure 7:

Trade-off between inter-layer fusion and computationresource: DSP maximises data re-use and completely eliminates recomputa-tions while fusing multiple convnet layers. We explained in de-tail the different components of our architecture and evaluatedour accelerator on VGG-like networks, with VGG-16 as therepresentative. We demonstrated that our 120 MHz acceleratoris 30X faster compared the performance to a 3.5GHz hexa-coreIntel Xeon E7 caffe-implementation.In addition, our designreduces external memory access by 42X along with a speedupof more than 2X in the number of clock cycles compared tostate-of-the art FPGA accelerators.R

EFERENCES[1] Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, SrihariCadambi, “A dynamically conﬁgurable coprocessor for convolutionalneural networks,” in

ISCA ,2010.[2] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, JasonCong, “Optimizing FPGA-based Accelerator Design for Deep Convolu-tional Neural Networks,”in

FPGA , 2015.[3] Manoj Alwani, Han Chen, Michael Ferdman, Peter Milder, “Fused LayerCNN accelerators,” in

MICRO , 2016.[4] James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley,Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, Pat Hanrahan,“Darkroom: compiling high-level image processing code into hardwarepipelines,” in

ACM Transactions on Graphics (TOG) - Proceedings ofACM SIGGRAPH ,2014.[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep ResidualLearning for Image Recognition,” in

CVPR ,2015[6] M. Herbordt, T. VanCourt, Y. Gu, B. Sukhwani, A. Conti, J. Model, and D.DiSabello., “Achieving high performance with FPGA-based computing,”in

IEEE Computer , 40(3):50–57 , 2007.[7] J. Sanguinetti, “Understanding high-level synthesis design’s advantages,”in

EE Times Asia, pages 1–4 , 26 April 2010.[8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in

Proceedings of the 22nd ACM internationalconference on Multimedia,

IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems,

IEEE Transactions on PatternAnalysis and Machine Intelligence,

IEEE Transactions on Pattern Analysis and Machine Intelligence,

Proceedings of the 47th Annual IEEE/ACM InternationalSymposium on Microarchitecture (MICRO) . IEEE Computer Society,

13] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,and E. Shelhamer, “cuDNN: Efﬁcient primitives for deep learning,” in

CoRR , 2014[14] Karen Simonyan, Andrew Zisserman “Very deep convolutional networksfor large-scale image recognition”, 2014[14] Karen Simonyan, Andrew Zisserman “Very deep convolutional networksfor large-scale image recognition”