A scalable and efficient convolutional neural network accelerator using HLS for a System on Chip design
Kim Bjerge, Jonathan Horsted Schougaard, Daniel Ejnar Larsen
AA generic and efficient convolutional neural network accelerator using HLSfor a system on chip design
Kim Bjerge a, ∗ , Jonathan Schougaard b , Daniel Ejnar Larsen b a School of Engineering, Aarhus University, Finlandsgade 22, 8200 Aarhus N, Denmark b Department of Engineering, Aarhus University, Finlandsgade 22, 8200 Aarhus N, Denmark
Abstract
This paper presents a generic convolutional neural network accelerator (CNNA) for a system on chipdesign (SoC). The goal was to accelerate inference of different deep learning networks on an embeddedSoC platform. The presented CNNA has a scalable architecture which uses high level synthesis (HLS) andSystemC for the hardware accelerator. It is able to accelerate any CNN exported from Python and supportsa combination of convolutional, max-pooling, and fully connected layers. A training method using fixed-point quantized weights is proposed and presented in the paper. The CNNA is template-based, enabling itto scale for different targets of the Xilinx ZYNQ platform. This approach enables design space exploration,which makes it possible to explore several configurations of the CNNA during C- and RTL-simulation, fittingit to the desired platform and model. The convolutional neural network VGG16 was used to test the solutionon a Xilinx Ultra96 board. The result gave a high accuracy in training with an auto-scaled fixed-point Q2.14format compared to a similar floating-point model. It was able to perform inference in 2 . .
63 W, which corresponds to a power efficiency of 6 . Keywords: S ystem On Chip, FPGA, High Level Synthesis, Convolutional Neural Network, PYNQ
1. Introduction
In recent years, deep learning with convolutionalneural networks (CNN) has been applied in manydifferent fields such as image classification [1],[2],object detection [3],[4] and recognition [5]. In mostcases, state-of-the-art CNN models run on a serverin the cloud. However, with the increase of Internetof Things (IoT), there is a demand for embeddingthe deep neural networks into mobile edge comput-ing. This is especially true for computer vision sys-tems, where the amount of collected data is highand analyses of images must be carried out in real-time.As CNNs continue to be applied to increasinglycomplex problems, low throughput, latency andenergy efficiency present challenges on embeddeddevices with central processing units (CPUs) orgraphical processing units (GPUs). Due to sev-eral attractive features, field programming gate ∗ Corresponding author
Email address: [email protected] (Kim Bjerge) arrays (FPGA) present promising platforms forhardware (HW) acceleration of CNNs as reportedin [6],[7],[8],[9]. CNNs optimized for fixed-point orusing binary neural networks achieve even betterperformance [10],[11],[12],[13]. In general, FPGAsprovide higher performance than CPUs and have abetter energy efficiency than both CPUs and GPUs.Historically, the long design time and need forHW experts have limited the use of FPGA. Here,the high level synthesis (HLS) tools have enabledautomatic compilation from imperative high-levelprograms to low-level specifications in a hardwaredefinition language (HDL) [14]. It is, however, stilla challenge to accelerate large-scale CNNs [15] ona FPGA, since model parameters typically requirefar more memory that the on-chip capacity of theFPGAs. Another challenge is to find an optimalconfiguration for a given HW accelerator design dueto the long design time.The scope of our work is to develop a generic andflexible architecture, which can accelerate the infer-ence of CNN networks on a multi-processor system
Preprint submitted to Journal of Systems Architecture April 29, 2020 a r X i v : . [ c s . C V ] A p r n chip design (MPSoC). It presents the design ofthe HW/SW architecture, i.e. the programmablelogic that will reside in the FPGA fabric and the de-sign of the software. The architecture is generic sothat it can accept major CNNs such as AlexNet [16]and VGG16 [2], which can be exported from a deeplearning framework such as Keras [17]. It is devel-oped in the PYNQ [18] framework using Pythonand SystemC [19] in order to create a generic tem-plate based HW accelerator. To find the optimaldesign, a SystemC based simulation is used to ex-plore the design space of the optimal configurationparameters of the CNN accelerator. The designmodel is translated to a HDL specification usingHLS. Our paper discusses the precision, speed andpower consumption of the accelerator as well as thefixed-point retraining of the CNN. In this section, the current state-of-the-art hard-ware based CNN accelerators that inspired the ar-chitecture presented in this paper will be discussed.The Microsoft model [20] is an architecture devel-oped by Microsoft for accelerating CNN for a cloudserver solution with several FPGA-cards. The ar-chitecture uses a top-level controller to control thedata-flow with a PCI memory interface. It hasmultiple input buffers, one kernel weight buffer, alarge array of processing element arrays (PEA) andlastly, a data redistribution block. It uses a directmemory access (DMA) channel to load data in fromPC memory to the buffers. On the FPGA it usesPEA blocks to perform dot product calculations ofthe values in the input buffer and the weight buffer.The result of the dot product is saved into the nextinput buffer.ZynqNet [21] is based on the architecture of theMicrosoft model. However, it focuses on makingit work for both training and inference. It is builtfor a SoC design instead of a server solution. Theproposed solution seems promising, although it ap-pears to have a few bottlenecks due to a purely C-based HLS implementation of the solution. It usesa circular line buffer (CLB) for input data handlingand uses a memory-mapped master interface to getdata from the main memory, e.g. weights and in-put data are transferred using the memory-mappedinterface.YodaNN: An Ultra-Low Power ConvolutionalNeural Network Accelerator Based On BinaryWeights [22] is an accelerator designed for an ASCIinstead of an FPGA. It is stated that FPGA is two orders of magnitude less energy efficient than ASCI.The architecture of the accelerator is built for ac-celerating an input of 32 channels with 32 kernels,which results in a 32 channel output. The acceler-ator only focuses on the convolution part of CNNand is locked to a specific type of CNN architecture.FINN-R [11] is an end-to-end deep-learningframework for fast exploration of quantized neu-ral networks (QNN). It is a framework built uponthe FINN accelerator [23] which is a QNN builtfor FPGA. The FINN-R consists of a cascade ofmultiple layer accelerators that are optimized fora pipelined architecture. This design reduces thetransferring of data between the main memory andthe accelerators. The difficult part is to balancethe layered accelerators in order to prevent bottle-necks or resource waste. However the framework,does not solve the problem of different throughputfor each layer. FINN-R optimizes the generatedHW using HLS, allowing fast exploration of QNNto create the perfect accelerator for a specific FPGAtarget.NeuFlow [24] uses a so-called Runtime Reconfig-urable Dataflow Architecture. It consists of pro-cessing tiles (PT), a DMA, and both global andlocal data lines, where each PT has a local dataline with its four neighbours. It consists of a con-troller and bus, which allows configuration and gridwiring of all the PTs during runtime. In this way,the grid can be set up to run different types of op-erations, although the PTs are only fine-tuned foraccelerating articifial neural networks (ANN).Most of the different architecture build their ar-chitectures are built around a PEA with a bufferfor handling input data, which could be a CLB ora row buffer. The output handling varies. Somedesigns, such as FINN-R, use the output to feedthe next accelerator. Others have a large memoryto cache layered outputs directly to the next in-put buffer so that data are ready for the next CNNlayer. However, due to limited internal memory,this approach is not feasible for all FPGAs. There-fore, there is a need for reloading the input datafrom the main memory. An example of this is theMicrosoft model. Other architectures use the mainmemory to cache the data between layers. FINN-R,for example, does this for each block of layers.The CNN developed in this work has some ele-ments in common with the papers presented above.It uses the main memory to store data between lay-ers. In addition, the architecture is built arounda PEA with two buffering systems: one for the2eights and one for input image data, the latterof which uses a CLB. The above architectures arevery similar, but the major difference lies in the de-tails of the CLB, which enables efficient pipeliningand data allignment.The CNN architecture in our work supports anyinput size and layer depth, stride, zero paddingand windows size. It also supports so-called stitch-ing, which allows the splitting of CNN layers intomultiple iterations. This is explained in detail insection 3.2. It makes the accelerator more flexibleand enables it to run nearly any CNN model thatuses convolution, pooling and fully connected lay-ers. Unlike FINN, which uses small building blocks,the accelerator developed in this work acceleratesall parts inside a single IP core. This approach al-lows the hardware to reuse the efficient but expen-sive CLB in both convolution layers and poolinglayers. It can be used with most CNN models dur-ing run-time without the need for re-compiling thatis required for the Microsoft model. The acceleratoris developed to work with PYNQ [25],[18] and usesan application programming interface (API) simi-lar to Keras [17]. The CNN model may need to beretrained if too much quantization is desired, i.e. avery small fixed-point format.
2. Design methods
In this section, we will briefly describe the designmethods and concepts used as a basis in designingand implementing the architecture for the convolu-tional neural network accelerator (CNNA).When working with FPGAs, there are differentmethods for developing an Intellectual PropertyCore (IP). A wide-used method is writing code atthe behavioural level or RTL in HDL language suchas Verilog or VHDL. These languages are normallytime-consuming and require skilled HDL engineer-ing.Another possibility is to use HLS [26],[27] to de-velop the IP. HLS is a way of writing behaviouralmodels of the hardware in a higher abstractionlanguage, typically C or C++. This behaviouralmodel can be mapped and scheduled to an RTLmodel using a synthesis tool such as Xilinx VivadoHLS [28],[29]. Developing IP using HLS can be verytime efficient due to the higher abstraction level aswell as the inherent design flow.In our work, SystemC is used with the design flowdescribed in [28, ch. 1],[30]. It is an efficient way inwhich an IP can be written and verified using HLS. SystemC is able to model the hardware structureand concurrency in a more optimal manner thanpure C and C++. It is an IEEE standard (IEEEStd1666-2011) [19], based on a C++ template librarymade for HW/SW co-design.Productiviy for ZYNQ (PYNQ) [31] is an open-source framework for creating applications on aZYNQ MPSoC. The system design in this work isbased on PYNQ for controlling the IP directly fromPython. This framework is divided into three lay-ers: application, software, and hardware.The application layer, which hosts the user-code,is described in [31, ch. 22]. This is usually Pythoncode in the form of a Jupyter notebook that runson the ARM CPU inside the ZYNQ MPSoC. Theauthors of the present paper argue that the choiceof Python increases productivity, since Python isa high-level interpreted language, which can be de-veloped quickly and run piecewise, thus allowing forquick debugging.The middle layer in the PYNQ framework is thesoftware layer. This layer contains the Python li-braries and the interaction with the IP inside theFPGA through the OS drivers. Several drivers areprovided through the PYNQ libraries for interact-ing with the IP. The interface is called an overlayand is used to program the FPGA and manage theIP.The last hardware layer in the PYNQ frame-work is the bit -file programmed into the FPGA.The interaction between the software layer and thehardware layer is done using direct memory access(DMA) or memory-mapped interfaces.
3. System architecture
The SoC design consists of three main elements:the FPGA (i.e. the programming logic (PL)) thedual core CPU and memory (i.e. random accessmemory (RAM)). The goal is to run a CNN con-sisting of convolutional, max-pooling and fully con-nected layers computed in the same IP Core insidethe FPGA logic. The responsibility of the CPU isto let the user control the hardware acceleration sothat the IP Core is processing CNN layers in correctsequential order. Figure 1 shows that the systemuses direct memory access (DMA) to transfer dataand weights between the CPU and IP Core accel-erator. The CPU controls the DMA data blocktransfer and converts the memory interface to thestreaming interface of the IP Core.3 igure 1: Block diagram of the system architecture coveringCPU (ZYNQ Ultrascale+ MPSoC), memory (DDR RAM),hardware IP Core accelerator (CNNA), five DMAs for inputs(X), outputs (Y), weights (W), control (CTRL) and splits(XBUF).
The system interacts in different manners de-pending on which scenario it needs to execute.There are three main scenarios: preprocessing , ini-tialization and inference . Preprocessing.
The first scenario preprocessingconverts the weights to fixed-point and realigns andscales the weights so that they are ready for thesystem to use. Preprocessing also calculates pa-rameters such as layer output size and layers to besplit which can be done offline on any computer.The weights are transformed from floating-pointto fixed-point representation in the chosen format,and aligned and rounded off correctly, as describedlater. Finally, the weights are saved in an h5-file,which is the standard format for storing weights inKeras [17], and can be transferred to the hardwaretarget.
Initialization.
The hardware target needs to beconfigured and initialized for a particular fixed-point resolution by using the synthesized bit-fileof the optimized CNNA. The bit-file contains theCNNA IP Core and interconnection setup to theCPU for the specified hardware target. This is doneusing a specification of the model in the form ofa JSON-file and an h5-file containing the weights,which are already realigned and quantized in thepreprocessing. It starts by calculating the buffersize and getting the properties of the loaded CNNA.When this is done, the SW allocates the needed re-
Figure 2: Sequence diagram of the system of interaction be-tween software control (PYNQ) and hardware accelerator(FPGA) during inference. sources and readies the SW for inference by allo-cating the buffers for each layer in the CNN.
Inference.
When using the system, predicting animage will be the most commonly used task. Thistask is shown in the sequence diagram in figure 2.Here, the user calls the method predict, which re-turns the predicted class of the image when the in-ference is done. The image, which is the parameterto the method predict, is stored internally in a con-tiguous array, i.e. an array which can be used bythe PYNQ DMA library. Depending on the CNN,several layers are executed in the correct order, i.e.convolution, pooling or fully connected layer. Allparameters controlling the CNN execution are sentat the start of the predict method.The CPU initiates the convolution by startingseveral tasks, which happens in parallel. Thesetasks prepare the data transfers for the input con-trol buffer CTRL, the input data buffer X, the out-put data buffer Y, the weight buffer W, and thestitching buffer Xbuf. The data transfers are han-dled by the DMA, which streams the data fromRAM to the CNNA. If Xbuf is in use, it means thatthe convolutional layer has been split into severaloperations, because the buffer for the weights in theFPGA is too small. This is called stitching and es-sentially interleaves an old result with new results.4ll configuration for the IP core is sent though theCTRL DMA so that it calculates the convolutioncorrectly.The convolution is done by the CPU initiatingfour different tasks in parallel. It sets up the datatransfer for the input control data CTRL, the inputdata X, the output Y and Xbuf. Each of these datatransfers are handled by the DMA, which streamsthe content of the buffer from RAM to the CNNA.The input control data CTRL is for sending theconfiguration for the CNNA to do the computationcorrectly. The last supported layer is the fully con-nected layer. The fully connected layer is executedsimilarly to both pooling and convolution. It startsfour different DMAs, one for each of the input dataX, the weights W and the output Y and the config-uration though the CTRL.
Two interfaces are used. The streaming inter-faced used for the DMA is implemented with a func-tional deterministic guaranty so that no race con-dition can happen, which makes the whole IP verystable. The other AXI-lite [32] interface, which isa memory-map, interface is only used for readingstatus registers.
Streaming interface.
The streaming interface, i.e.the AXI streaming interface, is used for transmit-ting the data between the CPU and IP. This in-terface can, in theory, be connected to any modulesupporting the interface. In this work, however, itis connected to a DMA interface. The main rea-son for choosing the streaming interface is that itenables transmission of data to the IP using thehigh performance (HP) buses which can accomplishspeeds up to bitt clock using standard DMA modules.The streaming interface also provides back-pressurewhich allows throughput balancing of the streamingports.
Status interface.
The status register is used forchecking the properties for the current CNNAso the software knows how the data should bealigned, or for reading status registers and debug-ging. The interface is implemented as an AXI4-lite [33] memory-mapped interface and is used tointeract with the IP as part of the CPU memoryaddress space. This can remove some of the over-head that comes with using DMA and is only fasterwhen transferring few data values.
Figure 3: Example of buffer stitching with a split of threeshown for a single pixel.
Some convolutional layers are too large to be pro-cessed as a single CNNA iteration. This meansthat they are split up into several sub convolutions.However, the result is returned from the IP Corewith the depth first and thus needs to be stitchedtogether with the later result. This is done usingthe IP Core which has a DMA-channel (XBUF) forthis purpose, as shown in figure 1. An example ofstitching can be seen in figure 3. The shown ex-ample illustrates the output of two pixels, a and b,from a convolution that has a depth that needs tobe split.The stitching is done using two equally sizedbuffers, which both have a size equal to the ex-pected output size. The size in this example is six.The first convolution only uses the first buffer asthe output, where pixels a and b are both updatedby the DMA. However, only the first third of thedepth is calculated in the first convolution. Thesecond convolution calculates the next third of theoutput. However, these outputs need to be stitchedin between the previous outputs. This is done us-ing the output of the first convolution as a stitchbuffer. The IP Core is informed to use the firstpart of each pixel and appends the result to eachpixel depth-wise. The result of this stitching is sentto the output buffer[1]. The third convolution takestwo thirds from the stitch buffer and the last thirdfrom the output for each pixel. The output of thestitched convolution is in the buffer, which is theone used as output in the last stitching.However, most fully connected layers are also toolarge to be processed at once, however, in whichcase the splits are handled differently. A fully con-5 igure 4: Illustration of the CNNA architecture with fivestreaming buffer inputs for control, stitching, weights anddata as well as one result output buffer. A number of pro-cessing elements (PE) are used to accelerate the pooling andmultiplications for each neuron in the network. The CNNAwill be executed several times and typical once for each layerduring inference. nected layer generates a single value per output.The buffer will be filled with values from the firstsplit when it runs the first time. The second time itruns it will get the next outputs, which need to beput after the first split in the buffer. This is solvedby adding the number of bytes, which is producedby the first split, to the physical address, i.e. thepointer to the address in RAM.All the splits need to have an adjusted numberof bytes to receive so such that it only receives theright amount of data. When all the splits have beenprocessed the result is in the same buffer, whichis then ready for the next layer. This means thatthe fully connected layer only needs a single buffer,contrary to the convolution layers, which needs two.
Figure 4 shows the architecture overview of themain elements of the CNNA. The CNNA works asan accelerator for a single layer at a time. Thismeans that the accelerator needs to be reconfiguredfor each layer, which is done using the streaminginterface
CTRL .Furthermore, the accelerator has four otherstreaming interfaces, which each of which has a spe-cific use. The streaming interface W is used to loadthe weights, which can consist of multiple kernels,and cache these in the weight buffer. This meansthey can be used multiple times. The streaminginterface X is used to stream in the input data,which can either be an image or the output fromthe previous layer. X buf is an interface that isused when a convolutional layer is split into several splits, which need to be stitched together correctly.The last streaming interface is Y , which streamsout the output values of the operation.The accelerator is built for three different opera-tions: convolution, pooling and fully connected lay-ers. During convolution acceleration, it uses theweight buffer, the data buffer and the PEA. Thepooling operation is done using only the data bufferand the pooling block. When executing the fullyconnected layers, the weight buffer and data bufferare simply set to forward the data directly to thePEA, thus generating a dot product of the two.The CNNA is designed using the five blocksbriefly described in the following sections. Weight buffer.
The weight buffer is used to cachethe weights inside the FPGA. This cached data istransmitted to the PEA where the convolution iscalculated.
Data buffer.
The data buffer is used to handle theinput data stream and create the windows on whichthe convolution or pooling is carried on. These win-dows are used for further calculations, which will bedescribed later in this section.
Processing elements array (PEA).
The PEA isused for handling the math acceleration in theCNNA, i.e. it is an array of dot product accelera-tors. Each of these PEAs are given a matching pairof input data and weights from the data buffer andthe weight buffer, respectively. They then calculatethe output. The PEAs also handles the activationfunction, which can be either linear or ReLU [34].The output of these is passed on to the output han-dling element.
Pooling.
The polling element is used to acceleratethe pooling operation. It gets its input from thedata buffer and sends the output to the output han-dling part, thus bypassing the PEA, which is notused in pooling.
Output handler.
The output handling elementplays a major role in getting the output of theCNNA into the right shape and alignment beforestreaming it out through interface Y . It merges theresults from the PEA when it is used, and if thedata needs to be stitched with X buf , i.e. if a con-volution operation has been split into more thanone convolution and needs to have the old output6 igure 5: Weight buffer with alignment illustrates how kernelweight data are aligned so that a specific kernel gets in theright spot. The illustration shows how the weight data ofkernel 0 are sent to stream buffer 1. The iteration interval(II) and band width (BW) of the weight input package arechanged by a factor of resize factor . The yellow part of theimage cube shows which part of the kernel is sent. interleaved into the new output. Splitting the con-volution happens when too many kernels need tobe stored in the weight buffer.It also handles the output of a pooling operation,which simply means forwarding the output of thepooling element.
The weight buffer is used for caching the weightkernels. This caching is necessary, because theconvolution requires the kernels to be used oncefor each output pixel. For instance, if the firstconvolutional layer of VGG16 has an output of224 ×
224 = 50 , (in) by a factor of resize factor . It alsosplits the raw package into smaller packages. Therealign module splits the raw package into smallerpackages. The splitter separates the data streaminto N different stream buffers, each of which hasthe same BW as the resized BW, BW (resize) . Eachstream buffer sends the kernel to a PE X times.The realignment in the weight buffer is compli-cated. This is firstly due to the bias value, which uses a complete package. Secondly, it is compli-cated because the kernels need to match the or-der in which the three-dimensional window comesfrom the data buffer, i.e. have the same positionsand depths as the data buffer. In figure 5, the firstpackage contains the bias values transferred to theweight buffer. It shows that this single value onlyuses a complete resized package. This is followed byN other bias packages. After all bias packages aresent the weight packaged are sent. In this example,the weight packages contain a 3 × × An image typically consists of three chan-nels, RGB, which can be visualized as a three-dimensional cube. A three-dimensional image is il-lustrated in figure 6. The image is stored in rasterorder, i.e. firstly, pixel (0,0) channel 0, then chan-nel 1 of the same pixel followed by the last channel.This is followed by the same three channels for thepixel one row down, which means that the Z-axisis processed first, then the Y-axis second and theX-axis last. Raster order is the order in which theimage data is streamed to the CNNA.The Circular Line Buffer (CLB) can be consid-ered the brain of the CNNA, because it allows itto increase the BW and removes the need to re-align the input data for each layer. However, be-fore explaining the different parts of the CLB, wemust explain the parameters that the actor can setthrough the control interface, must be explained.These parameters are: • Row size
The row size of the input image, i.e.the Y-axis length. It is assumed that the imageis quadratic. • Depth
The depth is N-channels of the inputimage, i.e. the Z-axis of the image. This shouldbe dividable by the bandwidth. • Stride
Stride is a convolution parameter. • Window size
The size of the window. If thewindow size is 3, the real window size wouldbe 3 × × depth. This is also a convolutionparameter. • Zero pad
A convolution parameter setting thesize of the zero padding around the image.7 igure 6: An illustration of the flow of data through the CLB. It consists of two parts: a line buffer for storing N lines previouslines and a shift buffer for storing N pixels previous pixel for each line. The leftmost image cube illustrates a single pixel 202,which is written for the line buffer. The middle cube illustrates which data is saved in which line buffer and how the new linereplaces the first line. The rightmost cube illustrates what data is in the shift buffers. The missing part illustrates how muchmore data is needed from the line buffers before it has a complete window in the shift buffer. The read pointer on the shiftbuffers is used for getting the N previous samples and is used for generating the output from the shift buffers. • Replay
How many times the CLB should re-send a single window.After setting up the CLB with the parametersthrough the control interface, the image data canflow into the CLB. The CLB consists of two parts:a line buffer for storing N lines previous lines anda shift buffer for storing N pixels previous pixel foreach line. These parts are explained in detail below: Line buffers.
The first module in the CLB, wherethe image data is ordered and stored is the linebuffer. This module streams one row of the imagewith all channels at the same time. The number ofline buffers is equal to the maximum window sizeminus one, N line buffer = window size −
1. This is be-cause only the N previous lines are needed to con-struct a window. The illustration of the data flow ofthe line buffer in figure 6 shows that the N − window size . It is also in-dicated that the buffer is stored circularly, i.e. theoldest value is overwritten by the newest when newdata arrive. This is handled by the pointer, which can be seen in figure 6. This pointer will increaseeach time a new input is received, and after receiv-ing a whole line, the line buffers will rotate, i.e.the first line will be moved to the back the secondline will be pushed forward and the pointer will bereset. This is done by multiplexing logic in the im-plemented design. Shift buffers.
After the line buffers, the datareaches the shift buffers. These buffers are usedfor getting the N previous pixels from each line, i.e.having all the pixels needed for a convolution win-dow, as shown in figure 6. The shift buffers haveanother important function too. They replay thewindow for the convolution if there are not enoughPEs to run all the dot products in the convolu-tion operation at once. The shift buffers are RAM-based shift buffers and consist of two pointers. Thewrite pointer is essentially controlled by countingup whenever data is written and moving it backto start of the shift buffer when the end has beenreached. The read pointer, however, is controlledby logic, which tells the shift buffer that it needsthe N previous samples. This will be handled bythe shift buffer, which also calculates its new posi-8 igure 7: Illustration of the PE design. It consists of a par-allel multiplier array followed by a summing tree and lastlyan accumulator, scale and the activation function logic. tions.
The heart of the CNNA is the PEA. Each pro-cessing element (PE) performs hardware accelera-tion of a dot product with a small range of activa-tion functions, i.e. linear or ReLU. The PE opera-tion can be written as shown in equation 1, whichis a dot product of the two equal length vectors (cid:126)x and (cid:126)w . P E ( (cid:126)x, (cid:126)w ) = f (cid:18) N − (cid:88) i =0 ( x i · w i ) (cid:19) (1)Each PE receives data frames in pairs from theweight buffer and the data buffer, i.e. one fromeach. The acceleration of the PE is done by run-ning the multiplications in parallel and totaling theresults afterward, as illustrated in figure 7. Thisdata is dotted together and followed by the activa-tion function.The figure 7 shows how the PE has two inputs: W , the weight input, and X , the data input. Whenthe PE has received a frame on both W and X ,the data frames are dotted together and the biasis added to the result. The result is forwarded tothe next part, which is the PE summer. This partaccumulates the result, which it has received fromthe PE dot product. It will keep on accumulatinguntil it receives the last flag. When this happens,it will multiply the accumulated value by a factorset by the actor, i.e. the control interface, and ap-ply the activation, which is also set by the actorthrough the control interface. Lastly, it is streamedout through the port Y , and the accumulated resultis reset. Figure 8: Illustration of the logic of the pooling accelerator.The data is received as a small slice (the purple cube), whichis the output of the CLB. It compares the input data withthe current pooling, if it is the first value, saves it. After thispooling, logic is executed. If it is max-pooling, for instance,it checks if the input is larger than the stored value, andstores the largest one. When a whole window has been runthrough the accelerator, it will start streaming out the cal-culated pixels. These steps are repeated for all the windowscreated by the CLB.
Pooling is used in CNN to downsample an image.The reason for placing the pooling operator insidethe CNNA is reuse of the CLB hardware. The pool-ing accelerator receives its input directly from theCLB, and the output from the pooling goes directlyto the output.When looking at figure 8, it can be seen thatthe pooling block consists of logic for handling thepooling operation, e.g. max-pooling, and RAM forbuffering a single pixel. The pooling logic is con-trolled by the actor and is used for setting the depthof the current image and the size of the window, e.g.2 × × depth or 3 × × depth. The last parametercontrols what type of pooling operator should berun, i.e. max-, min- or average-pooling.
4. Training for fixed-point
To overcome the challenge of the CNNA us-ing fixed-point values, an emulation of fixed-pointneeds to be made for the CNN to be trained andcalculated correctly. This is mostly due to the largedynamic range of the weights.This emulation is shown in equation 2, where Q [ I.F ] ( x ) is the fixed-point representation of x in9he fixed-point format Q[I].[F] [35]. Here, I is thenumber of integer bits and F is the number of fac-tional bits. First, the number x is scaled up by 2 F and then rounded off to resolve the limited reso-lution of fixed-point numbers. This is followed bywhat is essentially a saturation of the number tothe range of the fixed-point number, i.e. between − I + F − and 2 I + F − −
1. Lastly, the number isscaled down by the same factor it was scaled upby. This results in a value that can be interpretedcorrectly by the CNNA. Q [ I.F ] ( x ) = max (cid:18) − I + F − , min (cid:0) I + F − − ,round ( x · F ) (cid:1)(cid:19) · − F (2) The weights are quantized as a constraint to theoptimizer, which executes the backpropagation [36].This constraint is set to quantize all weights aftereach update using equation 2. This results in thestochastic gradient decent (SGD) update formulashown in equation 3, where Q [ I.F ] ( x ) is the quanti-zation function shown in equation 2, W ( l,t = t − ij isthe previous weight, W ( l,t = t ) ij is the new weight, and α is the learning rate. W ( l,t = t ) ij = Q [ I.F ] ( W ( l,t = t − ij − α ∇ ( l,t = t − W ij ) (3)However, this introduces a problem, that makesthe training freeze. The cause of the problem is thatthe size of the update to the weights is too small tomove from one quantized value to another. The ef-fect of a too-small update change can be seen in thefollowing example shown in equation 4. Its not pos-sible to update a single weight in Q2.6 with a valuesmaller than the smallest quantized value, in thiscase 2 − = 0 . . . W ( l ) ij = Q [2 . (1 . − . . W is saved so that the forward pass, i.e. inference, is calculated using the quantized weights, and theSGD is calculated using unquantized weights. Thismeans that the weights do not get stuck betweenquantization steps. This is also known as Lazy up-date SGD [37]. In this way, the weights W aresaved and the quantized weights W Q are used forthe forward pass, which can be seen in equations 5and 6. W ( l ) ,t = τij = ( W ( l ) ,t = τ − ij − α ( ∇ ( l ) ,t = τ − W ) ij ) (5) W Q ( l ) ,t = τij = Q [ I.F ] ( W ( l ) ,t = τij ) (6)By using these equations, the optimizer can trainthe CNN even though the changes are too small tobe significant when quantized. It has been found that the small kernels in thefirst convolutional layers of the CNN VGG16 havelarge weights, i.e. close to 1 or −
1, but the fully con-nected layers have very small weights that only usethe lowest bits, even in Q2.14. This means that theCNN needs more fractional bits. However, this ispossible to solve by dynamically scaling the weightsand the output. This is carried out with integersin [38]. The following will show how this can becarried out on fixed-point values as well. It hasbeen found that the dynamic range of each kernelis almost the same for each layer. This knowledgecan be used to add scaling to each layer in orderto change the dynamic range of the weights. Forexample, based on the given weights W = (cid:20) .
11 0 . − . − .
05 0 .
002 0 . (cid:21) and a fixed-point format Q[I].[F], which, for simplic-ity, is able to store a maximum value of 1, denoted Q MAX [ I.F ] , a scaling can be found. To find the scalingneeded for a better dynamic range, equation 7 canbe used. This equation takes the absolute maxi-mum absolute value of the weights and divides itby the maximum value of the fixed-point format. scale ( l ) = max i (cid:18) | W ( l ) i | (cid:19) Q MAX [ I.F ] = | − . | = 0 .
30 (7)The scaled value of the weights can now be cal-culated as shown in equation 8, which divides the10eights by scale ( l ) . This shows that the maximalabsolute value is now -1. W ( l ) scale = W ( l ) scale ( l ) = (cid:20) .
367 0 . − − .
167 0 . . (cid:21) (8)Using this scale factor, the output of a layer iscalculated as shown in equation 9, which has anadded multiplication of the quantized value of thescale factor, where z ( l ) scale is the scaled output oflayer l , W ( l ) scale are the scaled weights, (cid:126)a l − is theoutput from the previous layer and scale ( l ) is thescale factor of the layer l . z ( l ) scale = Q [ I.F ] ( W ( l ) scale · (cid:126)a l − ) · Q [ I scale ] . [ F scale ] ( scale ( l ) )(9)Because of the quantization, it cannot be guaran-teed that the outputs are the same, but they shouldbe very similar, i.e. z ( l ) (cid:39) z ( l ) scale . The main differ-ence between the scaled and unscaled version is that z ( l ) scale is better suited for the bit range of the fixed-point format than z ( l ) . More bits result in a betterSNR, as shown in equation 10. This means that,for the example shown earlier, the bits used haveincreased by ceil (log (( scale ( l ) ) − )) = 2, which cor-responds to an increase in SNR of 12 . SN R dB ( N bits ) = 20 log (2 N bits ) ≈ . · N bits (10)
5. Design space exploration
The template-based IP Core written in SystemChas a number of parameters that must be selectedto achieve an optimal solution. When optimizingthe IP Core for a FPGA, it can be an enormoustask to generate a design and find the optimal de-sign parameters. It takes approximately one hourto synthesize the HLS code to RTL code. If this wasto be done for all possible combinations of param-eters, it would take weeks, months or even years,since the developed architecture has such a largenumber of parameters, e.g. bandwidth (BW) be-tween modules, FIFO-depth, the number of PEs,etc.The design parameters are used for tuning theCNNA design in order to find a tradeoff betweenprecision, speed and resources. The CNNA tuningparameters used are as follows: data ( W ) size . The word-length of the fixed-point dataformat in bits, i.e. I+F. Has an impact on precision. PE BW {×} . The internal BW with an element sizeof data (W)size used by the CNNA. PE N . The number of PEs and the PEA are limitedby the size of FPGA fabrics. DB ( output ) BW {×} . The output BW multiplier after theCLB. Normally this will be set at an equal valueto CLB (rows)N but can be set to a lower number inorder to allow the PE to run with lower BW andpotentially have a bigger PEA. The internal BWin the PE will be DB (output)BW {×} · PE BW {×} , with anelement size of data (W)size . The BW used inside theweight buffer is also equal to DB (output)BW {×} · PE BW {×} . kernels ( R × × ) N . Used to calculateWB (buffer)size = (cid:18) (3 × × (output)BW {×} · PE BW {×} + bias size (cid:19) · kernels ( R × × )N Tuning the CNNA can be expressed as a vector, (cid:126)β , with 5 hyper-parameters, as shown in equation (cid:126)β = (cid:20) data (W)size , PE BW {×} , PE N , DB (output)BW {×} , kernels ( R × × )N (cid:21) To measure the performance of the differentCNNA configurations, a simulation was made. Itconsisted of five different elements, two pooling op-erations, two convolution operations and a singlefully connected operation, which was executed in-dividually but evaluated together.When looking at the latency for the combinedsimulation test, i.e. the five simulations carried outconsecutively after each other, the dominant can-didates all have DB (output)BW {×} = 1 regardless of wordlength (see figure 9). The figure shows that thefaster the accelerator is the higher the number ofPEs.Two models were created of each configuration,one of which was done using C-simulation, i.e. sim-ulation using the SystemC HLS code directly. The11 igure 9: Results of the C-simulation of the combined testof all fixed-point candidates, showing the average resourceusage of BRAM and DSP vs latency. The PE BW {×} is setto 128 for all solutions. The plotted text for a candidate isin the format (cid:20) PE N , DB (output)BW {×} , kernels ( R × × )N (cid:21) . Thecandidates are split up in three groups of the word lengths(data (W)size ) 8 bit, 16 bit and 32 bit versions. other was a RTL-simulation, which used the RTL-code generated from the SystemC model. The lat-ter was clock cycle accurate and the execution timewas precise.Several candidates were identified and shown ingreater detail in table 1. The table shows the num-ber of DSP and BRAM used, as well as the totallatency for C- and RTL-simulation. Some candi-dates marked with a ”-” uses more resources thanis available on the tested target platform. Othercandidates were synthesized and tested using RTL-simulation, which simulates the real HDL-code gen-erated. This also gives a more precise resource us-age, which is typically slightly lower than the oneestimated using C-simulation. The execution timeis also shown and is slightly higher than the esti-mated value.Finding the optimal parameters was done usingtwo different fixed-point formats (data (W)size ): Q2.14and Q2.6, i.e. a word length of 16 bits and 8bits, respectively. These were chosen because ofarea constraints of the FPGA on the Xilinx Ultra96board [39]. However 32-bits would have been pos-sible with a larger FPGA.Finally, three different configurations of theCNNA were chosen for the final test of the system,one of which used 16-bit fixed-point format Q2.14while two used 8 bit fixed-point format Q2.6.
6. Results and discussion
The dataset DETECT [40] was used to verifythe system. This dataset consisted of 29 classes ofmicro-invertebrates suspended in alcohol. Only thefirst five classes were used in the first test, whilethe second test used all 29 classes. The trainingwas carried out over the span of 100 epochs, whichmeans that the complete dataset was repeated 100times for training.The CNN used was VGG16 [2]. Here the denselayers following all the CNN blocks is two fully con-nected layers with either 4096 or 1024 neurons. Thelast fully connected layer has either five or 29 neu-rons, depending on the number of classes. Thetraining was performed on two fixed-point formats:Q2.14 and Q2.6, and tested on three configura-tions, which will be denoted CNNA , CNNA andCNNA . CNNA uses the tuning parameters (cid:126)β =[16 , , , , uses (cid:126)β = [8 , , , , uses (cid:126)β = [8 , , , , The CNN was trained using the small dataset inorder to find suitable candidates faster, since it iseasier to train for five classes than for 29 classes.If the accuracy of a fixed-point format is poor onfive classes, it will likely be as poor, or worse, whentraining on 29 classes. In addition, training on fiveclasses takes approximately two and a half hours onVGG16 with 4096 neurons in the fully connectedlayers, whereas it takes almost eight hours to trainthe network when using all 29 classes. Therefore,the first training is done on the small dataset.Figure 10 shows that most of the trained mod-els face some issues and obtain low accuracy whenusing fixed-point format. The only quantized ver-sion that obtains a high accuracy is the one usingfixed-point format Q2.14. It is unknown why thetraining that ses fixed-point format Q2.14 and noauto-scaling makes a sudden dive after 10 epoch.However, it could be caused by the learning-ratebeing too high or too low, or too few neurons in thefully connected layers. It seems that the best results12 able 1: Design space exploration of resource usage and latency of possible CNNA candidates using C- and RTL-simulation. A” ÷ ” means RTL-simulation performed, but insufficient space on target platform (Ultra96). A ”-” means RTL-simulation notperformed. Parameters (cid:126)β resourceaverage % DSPs DSPs(RTL) BRAMs BRAMs(RTL) latency[ ms ] latency[ ms ](RTL)[8 , , , ,
42] 70 384 359 144 249 4 .
60 6 . , , , ,
42] 34 128 125 137 185 4 .
95 7 . , , , ,
42] 124 768 - 152 - 4 .
26 -[8 , , , ,
42] 52 256 245 139 193 4 .
03 6 . , , , ,
32] 54 192 360 233 377 7 .
47 8 . , , , ,
32] 35 64 - 227 - 7 .
61 -[16 , , , ,
32] 81 384 ÷ ÷ . ÷ [16 , , , ,
32] 44 128 293 229 349 6 .
27 8 . , , , ,
20] 94 384 - 355 - 13 .
19 -[32 , , , ,
20] 58 128 165 351 336 12 .
92 14 . , , , ,
20] 148 768 - 359 - 12 .
42 -[32 , , , ,
20] 76 256 325 353 408 10 .
73 12 . Figure 10: Fully connected layers have 1024 neurons. Gray:floating-point, orange: fixed-point Q2.14 with auto-scaling,blue: fixed-point Q2.14 without auto-scaling, red at the bot-tom: fixed-point Q2.6 with and without auto-scale. are achieved using fixed-point format Q2.14 andauto-scaling, which converges towards an accuracyof almost 100%. Figure 11 shows that the fixed-point format Q2.14 with auto-scaling performs wellon the CNN with 4096 neurons in the fully con-nected layers. The training that uses Q2.14 withno auto-scaling performs well and reaches approxi-mately 84%. All fixed-point Q2.6 versions did notmanage to be trained or achieve any useful results.Table 2 shows the results of the training. It showsthe type of fixed-point training, omitting the onesthat did not achieve any useful results. The tableshows the number of neurons in the fully connectedlayers, N neurons , as well as the validation accuracy,
Figure 11: Fully connected layers have 4096 neurons. Or-ange: fixed-point Q2.14 with auto-scaling, blue: fixed-pointQ2.14 without auto-scaling, red at the bottom: fixed-pointQ2.6 with and without auto-scale. where validation is performed on a dataset not usedfor training. Finally, it shows the accuracy on thetraining dataset.The final test was performed on all 29 classesof DETECT, using the candidates that performedwell in the previous test. The best candidates werethe floating-point version for reference and the ver-sions using fixed-point format Q2.14, both with andwithout auto-scaling. As is evident from figure fig-ure 12, only the training using fixed-point formatQ2.14 and auto-scaling achieved any promising re-sults. It shows that it is much harder to train theCNNA when using quantization, because details arelost due to the limited range of the fixed point num-13 able 2: Training results for the training of VGG16 on 5classes.
Type auto-scale N neurons validate trainfloat n/a 1024 97 . . . . . . . . . . . . Figure 12: Fully connected layers has 4096 neurons. Blue:floating-point, red: fixed-point Q2.14 with auto-scale, lightblue: fixed-point Q2.14 without auto-scaling. bers. However, it is possible to achieve a betterresult using the auto-scaling technique.The specific results from the training can befound in table 3, which shows that training is possi-ble even when using quantization. However, it takesmany more iterations for the training to reach thesame accuracy level as floating-point.
Table 3: Training results for the training of VGG16 on 29classes.
Type auto-scale N neurons validate trainfloat n/a 1024 88 . . Q .
14 no 1024 5 . . Q .
14 yes 1024 86 . . Q .
14 no 4096 5 . . Q .
14 yes 4096 86 . . The Xilinx Ultra96 board [39] was used to evalu-ate the performance of the system using a hardwareclock of 100 MHz and 172.22 MHz for the CNNA IP Core. The inference time was measured for the dif-ferent configurations of the CNNA , CNNA andCNNA and the inference times are shown in table4. The timing performance is measured on the Ul-tra96 board during inference of the quantized andtrained VGG16 model with five classes. The meantime and variance is measured as an average over30 measurements. The fastest model CNNA takes1 .
22 sec per image, while the slowest, CNNA at100 MHz, takes 2 .
20 sec per image.
Table 4: Average inference time and variance using VGG16for five classes using four different IP Cores.
CNNA avg [ sec ] .
20 1 .
96 1 .
22 1 . [ · − .
25 0 .
30 0 .
20 0 . , obtains the best performance due tothe lager number of PEs (16). Note that CNNA is slightly faster than CNNA in the convolutionallayers, even with fewer PEs, due to the higher band-width of the output multiplier. There is a largenumber of splits (512) in the dense 1 and dense 2layers, and they consume more than half of thetotal execution time for all three CNNA configu-rations. A larger FPGA enabling more PEs andbuffers could be a solution to lower the number ofsplits and optimize the system further. The power consumption of the design withCNNA , CNNA and CNNA was measured onthe Ultra96 board during inference of the trainedVGG16 model with five classes. The measured volt-age of the power supply to the board was multipliedwith the measured current to compute the powerconsumption. The mean and maximum power dur-ing inference is calculated as a mean over 10 infer-ences. The power consumption of the IP Core is de-fined as the difference of the Ultra96 board idlingand power during inference. The idle power con-sumption was measured over a five-minute periodto be P idle = 3 .
055 Watt:14 able 5: Time of execution of each VGG16 layer in [ ms ]using four different IP Cores. layer CNNA l1 conv1 19.3 17.0 19.9 21.4l1 conv2 111 84.1 61.3 60.1l1 pool 18.1 13.7 12.9 17.1l2 conv1 55.4 42.9 31.3 30.5l2 conv2 108 81.3 60.2 56.3l2 pool 8.98 6.87 6.36 8.40l3 conv1 56.0 4.35 33.1 29.9l3 conv2 112 84.2 64.2 59.1l3 conv3 110 85.5 63.0 57.8l3 pool 4.51 3.48 3.25 4.23l4 conv1 64.6 51.5 37.9 35.3l4 conv2 126 97.9 76.0 70.5l4 conv3 123 102.0 73.5 67.8l4 pool 2.32 1.83 1.71 2.19l5 conv1 46.6 41.0 30.7 29.3l5 conv2 49.1 39.7 29.7 28.1l5 conv3 45.8 39.5 29.6 27.8l5 pool 0.74 0.62 0.59 0.69dense 1 767 737 364 509dense 2 393 397 197 362dense 3 1.55 1.62 1.18 1.50 Table 6: Average and peak power consumption of the Ul-tra96 board and the system and IP Core during inference.
CNNA P avg [ W ] 5.28 5.68 4.71 4.80P peak [ W ] 6.60 7.14 5.76 6.35P IP avg [ W ] 2.23 2.63 1.66 1.74P IP peak [ W ] 3.55 4.09 2.71 3.30Table 6 shows that the mean power consump-tion of the Ultra96 board is between 4 . − . usinga 100 MHz clock is 0 .
24 sec slower but consumesless power than the version using a 172 MHz clock.Table 6 shows that the peak power consumption isalmost the same for all tested IPs in the range from2 . . Figure 13: Power consumption of the tested VGG16 withformat CNNA during inference at 172 MHz.Figure 14: Power consumption of the tested VGG16 withformat CNNA during inference at 172 MHz. sumption is largest in the beginning of the infer-ence, i.e. in the convolution blocks of the CNN.The power consumption drops during execution ofthe fully connected layers. This indicates that mostof the FPGA logic is in action during convolution,while less is used when computing the fully con-nected layers and pooling. Pooling activity corre-sponds to the big dips in power consumption in thefirst half of the inference.
7. Comparing to state-of-art CNNs
To compare the results of this work with currentstate-of-the-art CNNs, the same units of measure-ment are needed. Because of this, the throughputneeds to be expressed in giga operations per second15 igure 15: Power consumption of the tested VGG16 withformat CNNA during inference at 172 MHz. (GOPS) and the power efficiency in GOPS/W. Thissection will evaluate the CNNA in the present pa-per with the same units of measurements as in [41],and discuss whether the results are satisfactory. Throughput.
The throughput in GOPS is calcu-lated as shown in equation 11, where N operations is the number of operations in the tested CNN. Inthe case of VGG16, this number is 30 .
76 GOP. t run is the time of inference using the CNN in secondsand N OPS is the number of operations per secondin GOPS. N OPS = N operations t run (11)The results are shown in table 7, which indicatesthat the GOPS are lower than the compared archi-tectures. This can be due to different measurementmethods, i.e. if only tested on convolutional lay-ers, the number of GOPS is very large. However,if fully connected layers are taken into considera-tion, it will drop dramatically, which is the case forthe tests carried out in this work. Furthermore, theUltra96 target used in our evaluation is very smalland low-cost compared to the ones used in the state-of-the-art examples. If a larger and more expensivetarget such as the Xilinx ZCU104 evaluation kit [42]was used, it would be possible to increase the num-ber of PEs, thereby achieving a higher throughputand performance. Power efficiency.
The power efficiency is expressedas GOPS/W. It can be calculated as shown in equa-tion 12, where N OPS is the number of operations per second, which we calculated earlier. P IP mean is themean power consumption of the CNNA, which wemeasured earlier. GOPSW is the power efficiency.GOPSW = N OPS P IP mean (12)Compared to many of the current state-of-the-artaccelerators, the CNN accelerator in this work per-forms well in terms of power efficiency. When using16-bit fixed-point weights, it is comparable to bothNeuFlow and Microsoft. However, the results arevery dependent on the method of measurement, asdescribed above. It is not the most power-efficientsolution, but, unlike FINN, it does not need recom-piling, and unlike YodaNN it is reconfigurable. Interms of power efficiency, it is comparable to Mi-crosoft, although Microsoft uses a much larger andmore expensive FPGA, and thus achieves a consid-erably higher throughput.
8. Conclusion
In this paper, an architecture for a system on chipdesign was presented. This presented architectureimplements the different operations necessary for adeep neural network to perform close to real-timeinference. The architecture was implemented usingPython and HLS for the IP Core and was able torun on the Ultra96 board using PYNQ. The inter-face for the system is similar to Keras and shouldbe familiar to most engineers working in the fieldof machine learning.The CNN is able to accelerate deep learning algo-rithms that uses uses any sequence of convolutional,max-pooling and fully connected layers. The layeroperations can support many different parametersand will be able to perform inferences using mostmodern CNNs. The network weights can use any 8-,16- or 32-bit fixed-point format when exported fromKeras and the weights must be autoscaled correctly.A training method was proposed which achieved in-ference accuracies that were almost as high whenusing fixed-point weights as floating-point weights.The system is able to accelerate the CNN opera-tions in HW. However, it is not able to run in real-time for a computer vision system that demandsa throughput of several frames per second. TheVGG16 architecture chosen for testing in this paperwas able to perform inference in 2 . . able 7: A table with state-of-the-art CNN accelerators. ÷ = unsupported and ∼ = needs re-compiling. Technique PowerEfficiency[GOPS/W] GOPS DSP BRAM LUT Format[bit] Scalableinput, convpool, fc SoCFINN [23] 210 . . N/A
186 46253 1-2 ∼ (cid:88) NeuFlow [24] 14 . N/A N/A N/A (cid:88) (cid:88) YodaNN [22] 2756 1510
N/A N/A N/A ÷ ÷ Microsoft [20] 7 .
15 1178 . N/A N/A N/A float (cid:88) ÷ CNNA -100MHz .
278 13 .
99 339 277 42490 16 (cid:88) (cid:88)
CNNA -172MHz .
989 15 .
73 360 174 55626 16 (cid:88) (cid:88)
CNNA -172MHz .
22 25 .
23 245 132 57232 8 (cid:88) (cid:88)
CNNA -172MHz .
83 20 .
71 359 160 64998 8 (cid:88) (cid:88) consumption of the developed CNN (Q2.14) is lessthan 7 . . . . − . . − . Acknowledgments
We would like to thank Freia Martensen for lan-guage and proof reading the article.
References [1] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNetclassification with deep convolutional neural networks,Communications of the ACM doi:10.1145/3065386 .[2] K. Simonyan, A. Zisserman, Very deep convolutionalnetworks for large-scale image recognition, in: 3rd In-ternational Conference on Learning Representations,ICLR 2015 - Conference Track Proceedings, 2015. arXiv:1409.1556 .[3] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, Youonly look once: Unified, real-time object detection, in:Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, 2016. arXiv:1506.02640 , doi:10.1109/CVPR.2016.91 .[4] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: To-wards Real-Time Object Detection with Region Pro-posal Networks, IEEE Transactions on Pattern Anal-ysis and Machine Intelligence arXiv:1506.01497 , doi:10.1109/TPAMI.2016.2577031 .[5] K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask R-CNN, IEEE Transactions on Pattern Analysis and Ma-chine Intelligence doi:10.1109/TPAMI.2018.2844175 .[6] A. Shawahna, S. M. Sait, A. El-Maleh, FPGA-Basedaccelerators of deep learning networks for learning andclassification: A review (2019). arXiv:1901.00121 , doi:10.1109/ACCESS.2018.2890150 . [7] S. Mittal, J. S. Vetter, A survey of methods for an-alyzing and improving gpu energy efficiency (2014). arXiv:1404.4629 , doi:10.1145/2636342 .[8] S. Mittal, A survey of FPGA-based accelerators forconvolutional neural networks (2018). doi:10.1007/s00521-018-3761-1 .[9] W. Ding, Z. Huang, Z. Huang, L. Tian, H. Wang,S. Feng, Designing efficient accelerator of depthwise sep-arable convolutional neural network on FPGA, Journalof Systems Architecture doi:10.1016/j.sysarc.2018.12.008 .[10] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv,Y. Bengio, Binarized neural networks, in: Advances inNeural Information Processing Systems, 2016. arXiv:1602.02505 .[11] M. Blott, T. B. Preuber, N. J. Fraser, G. Gambardella,K. O’Brien, Y. Umuroglu, M. Leeser, K. Vissers, FinN-R: An end-to-end deep-learning framework for fast ex-ploration of quantized neural networks, ACM Transac-tions on Reconfigurable Technology and Systems arXiv:1809.04570 , doi:10.1145/3242897 .[12] E. Nurvitadhi, D. Sheffield, J. Sim, A. Mishra,G. Venkatesh, D. Marr, Accelerating binarized neu-ral networks: Comparison of FPGA, CPU, GPU, andASIC, in: Proceedings of the 2016 International Con-ference on Field-Programmable Technology, FPT 2016,2017. doi:10.1109/FPT.2016.7929192 .[13] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H.Lin, M. Srivastava, R. Gupta, Z. Zhang, Acceler-ating binarized convolutional neural networks withsoftware-programmable fpgas, in: Proceedings of the2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, ACM, NewYork, NY, USA, 2017, pp. 15–24. doi:10.1145/3020078.3021741 .URL http://doi.acm.org/10.1145/3020078.3021741 [14] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort,A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi,J. Anderson, K. Bertels, A Survey and Evaluation ofFPGA High-Level Synthesis Tools, IEEE Transactionson Computer-Aided Design of Integrated Circuits andSystems doi:10.1109/TCAD.2015.2513673 .[15] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, L. Wang,A high performance FPGA-based accelerator for large-scale convolutional neural networks, in: FPL 2016 -26th International Conference on Field-ProgrammableLogic and Applications, 2016. doi:10.1109/FPL.2016. .[16] A. Krizhevsky, I. Sutskever, G. E. Hinton, 2012AlexNet, Advances In Neural Information Pro-cessing Systems arXiv:1102.0183 , doi:http://dx.doi.org/10.1016/j.protcy.2014.09.007 .[17] F. Chollet, et al., Keras, https://keras.io (2015).[18] L. Stornaiuolo, M. Santambrogio, D. Sciuto, On howto efficiently implement deep learning algorithms onPYNQ Platform, in: Proceedings of IEEE ComputerSociety Annual Symposium on VLSI, ISVLSI, 2018. doi:10.1109/ISVLSI.2018.00112 .[19] Accellera Systems Initiative, Ieee standard for standardsystemc language reference manual, IEEE Std 1666-2011 (Revision of IEEE Std 1666-2005).[20] K. Ovtcharov, O. Ruwase, J.-y. Kim, J. Fowers,K. Strauss, E. S. Chung, Accelerating Deep Convo-lutional Neural Networks Using Specialized Hardware,Microsoft Research Whitepaper.[21] D. Gschwend, Zynqnet: An fpga-accelerated embeddedconvolutional neural network.URL https://github.com/dgschwend/zynqnet [22] R. Andri, L. Cavigelli, D. Rossi, L. Benini, YodaNN:An ultra-low power convolutional neural network ac-celerator based on binary weights, in: Proceedings ofIEEE Computer Society Annual Symposium on VLSI,ISVLSI, 2016. doi:10.1109/ISVLSI.2016.111 .[23] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott,P. Leong, M. Jahre, K. Vissers, FINN: A frame-work for fast, scalable binarized neural networkinference, in: FPGA 2017 - Proceedings of the2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017. arXiv:1612.07119 , doi:10.1145/3020078.3021744 .[24] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Cu-lurciello, Y. Lecun, NeuFlow: A runtime reconfigurabledataflow processor for vision, in: IEEE Computer Soci-ety Conference on Computer Vision and Pattern Recog-nition Workshops, 2011. doi:10.1109/CVPRW.2011.5981829 .[25] Xilinx, Pynq: Python productivity for zynq.URL [26] D. D. Gajski, L. Ramachandran, Introduction to high-level synthesis, IEEE Design Test of Computers 11 (4)(1994) 44–54.[27] M. Fingeroff, High-Level Synthesis Blue Book, MentorGraphics Corporation, 2010.[28] Xilinx, UG 902 - Vivado Design Suite User Guide -High-Level Synthesis, 2019th Edition (07 2019).[29] Xilinx, UG 998 - Introduction to FPGA Design withVivado High-Level Synthesis, 1st Edition (01 2019).[30] Accellera Systems Initiative, SystemC SynthesizableSubsets, 1st Edition (January 2015).[31] L. H. Crockett, D. Northcote, C. Ramsay, F. D. Robin-son, R. W. Stewart, Exploring Zynq R (cid:13) MPSoC WithPYNQ and Machine Learning Applications, Strath-clyde Academic Media, 2019.[32] Xilinx, UG761 - AXI Reference Guide, v13.1 Edition(March 2011).[33] Axi reference guide.URL [34] B. Xu, R. Huang, M. Li, Revise saturated activationfunctions, CoRR abs/1602.05980. arXiv:1602.05980 .URL http://arxiv.org/abs/1602.05980 [35] E. Oberstar, Fixed-Point Representation & FractionalMath Revison 1.2 (08 2007). doi:10.13140/RG.2.1.3602.8242 .[36] B. J. Wythoff, Backpropagation neural networks:A tutorial, Chemometrics and Intelligent Labora-tory Systems 18 (2) (1993) 115 – 155. doi:https://doi.org/10.1016/0169-7439(93)80052-J .URL [37] H. Park, J. H. Lee, Y. Oh, S. Ha, S. Lee, Train-ing deep neural network in limited precision, CoRRabs/1810.05486. arXiv:1810.05486 .URL http://arxiv.org/abs/1810.05486 [38] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang,A. Howard, H. Adam, D. Kalenichenko, Quantizationand training of neural networks for efficient integer-arithmetic-only inference, in: The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),2018.[39] 96 Boards, Ultra96-v2 developer board.URL [40] Detect [online] (June 2017) [cited 7/5-2018].[41] A. Shawahna, S. M. Sait, A. El-Maleh, Fpga-based ac-celerators of deep learning networks for learning andclassification: A review, IEEE Access 7 (2019) 7823–7859. doi:10.1109/ACCESS.2018.2890150 .[42] Xilinx, Zynq ultrascale+ mpsoc zcu104 evaluation kit.URL