[PDF] An Application-Specific VLIW Processor with Vector Instruction Set for CNN Acceleration

Abstract

In recent years, neural networks have surpassed classical algorithms in areas such as object recognition, e.g. in the well-known ImageNet challenge. As a result, great effort is being put into developing fast and efficient accelerators, especially for Convolutional Neural Networks (CNNs). In this work we present ConvAix, a fully C-programmable processor, which -- contrary to many existing architectures -- does not rely on a hard-wired array of multiply-and-accumulate (MAC) units. Instead it maps computations onto independent vector lanes making use of a carefully designed vector instruction set. The presented processor is targeted towards latency-sensitive applications and is capable of executing up to 192 MAC operations per cycle. ConvAix operates at a target clock frequency of 400 MHz in 28nm CMOS, thereby offering state-of-the-art performance with proper flexibility within its target domain. Simulation results for several 2D convolutional layers from well known CNNs (AlexNet, VGG-16) show an average ALU utilization of 72.5% using vector instructions with 16 bit fixed-point arithmetic. Compared to other well-known designs which are less flexible, ConvAix offers competitive energy efficiency of up to 497 GOP/s/W while even surpassing them in terms of area efficiency and processing speed.

Full PDF

AAn Application-Speciﬁc VLIW Processor withVector Instruction Set for CNN Acceleration

Andreas Bytyn, Rainer Leupers and Gerd Ascheid

Institute for Communication Technologies and Embedded Systems, RWTH Aachen UniversityEmail: [email protected] c (cid:13) Abstract —In recent years, neural networks have surpassedclassical algorithms in areas such as object recognition, e.g. in thewell-known ImageNet challenge. As a result, great effort is beingput into developing fast and efﬁcient accelerators, especially forConvolutional Neural Networks (CNNs). In this work we presentConvAix, a fully C-programmable processor, which contraryto many existing architectures does not rely on a hard-wiredarray of multiply-and-accumulate (MAC) units. Instead it mapscomputations onto independent vector lanes making use of acarefully designed vector instruction set.The presented processor is targeted towards latency-sensitiveapplications and is capable of executing up to 192 MAC oper-ations per cycle. ConvAix operates at a target clock frequencyof 400 MHz in 28nm CMOS, thereby offering state-of-the-artperformance with proper ﬂexibility within its target domain.Simulation results for several 2D convolutional layers fromwell known CNNs (AlexNet, VGG-16) show an average ALUutilization of 72.5% using vector instructions with 16 bit ﬁxed-point arithmetic. Compared to other well-known designs whichare less ﬂexible, ConvAix offers competitive energy efﬁciency ofup to 497 GOP/s/W while even surpassing them in terms of areaefﬁciency and processing speed.

I. I

NTRODUCTION

Since their introduction to the broad public, ConvolutionalNeural Networks (CNNs) have been adopted for many taskssuch as object classiﬁcation and detection [1] [2]. Their abilityto extract meaningful features out of data has been the keyenabling factor for their superior performance compared toother approaches. However, this comes at the price of in-creased computational complexity, speciﬁcally the number ofMultiply-And-Accumulate (MAC) operations can reach wellinto the several GMAC per frame as summarized in [3].Thankfully, there is a lot of explicit parallelism contained inCNNs, thereby offering many options for accelerating themwith domain-speciﬁc hardware architectures.The most time- and energy-consuming computational kernelof every CNN is the 3D-convolution of so called input featuremaps (e.g. RGB input for the ﬁrst layer) with multiple sets ofﬁlters that compute layer-wise output feature maps [3]. Thefocus of this paper is therefore on the acceleration of theseconvolutions. As described in [4] and [5], the number of off-chip memory accesses and the management of the on-chipmemories both play a detrimental role for energy consumptionand arithmetic utilization within the accelerator. There aremany parameters for the convolutions that affect the efﬁciencyand it is the designers task to ﬁnd a suitable trade-off betweenthem. Important parameters to consider are for example the orderin which convolutions are executed as well as the tiling of in-put and output feature maps into e.g. column- and depth-slices.The optimal choice of these parameters however is dependenton the speciﬁc CNN model, thereby making it desirable tohave some degree of ﬂexibility in the data ﬂow. The authorsbelieve that an Application-Speciﬁc Instruction Set Processor(ASIP), as presented in this paper, can offer a decent trade-off between ﬂexibility and efﬁciency. Some parameters, suchas unrolling-factors that result in hardware parallelism, mustbe decided at design time. However, other parameters suchas tiling-factors and loop-order, can be ﬂexibly adjusted insoftware. Since a fully featured C/C++-compiler is generatedfor our ASIP automatically, computational kernels can easilybe re-used or adapted by means of software libraries.In Section II a brief overview of existing hardware acceler-ators for CNNs is given. Afterwards, the convolutional kernelis introduced in Section III and an overview of the processorarchitecture is given in Section IV. Section V presents postplace-and-route results of the ASIP implemented in a 28nmCMOS technology, as well as some relevant benchmarks run-ning the state-of-the-art CNNs AlexNet and VGG-16. Finally,Section VI concludes the paper.II. R

ELATED W ORK

Many of the published accelerators are comprised of alarge array of processing entities (PEs) with some application-speciﬁc interconnectivity between them. These acceleratorsoften offer immense performance in terms of throughput(GOP/s) and energy efﬁciency (GOP/s/W), yet lag the desiredﬂexibility when it comes to the employable data ﬂow patternsand the on-chip data management. In [6], the authors presenta 12x14 MAC array that aims to maximize data reuse (andtherefore minimize off-chip accesses) by applying a speciﬁccomputation scheme called row stationary . Some data ﬂowﬂexibility is achieved by subdividing the 2D MAC array intoslices and distributing parallel computations amongst theseslices. This ﬂexibility has its limits though, as only a pre-deﬁned set of parameters can be adapted at runtime. In [7],a C-programmable processor is presented that makes use ofa 16x16 MAC array to accelerate convolutions. The MACarray is supplied with data by a RISC processor, which givesﬂexibility in terms of the on-chip data management, howeverthe MAC array itself does not offer any additional ﬂexibility.Both [6] and [7] apply voltage scaling for their circuit to a r X i v : . [ c s . A R ] A p r emonstrate the potential energy efﬁciency improvement whenoperating at a lower voltage and clock frequency.Furthermore, since many CNNs can be quantized down to8 bit ﬁxed-point [8] [9], existing architectures exploit this byeither designing their MAC units to be narrow to begin with(e.g. 12 bit as in [10]), or by applying techniques such asprecision-gating or using subword parallelism at runtime [7].Other architectures such as [11] and [12] attempt to alleviatethe memory bottleneck in CNNs by employing near-memorycomputation and using 3D memory.III. CNN D ATA F LOW

In general, CNNs consist of a number of concatenatedlayers, each executing a pre-deﬁned operation on a 3- or 4-dimensional tensor (depending on whether batch processingis applied), thereby generating an output tensor that is usedas input to the following layer. The most common layersinclude the convolutional layer, max-pooling layer for tensor-downsampling, and recently also so-called depth-wise andpoint-wise convolutional layers [13], which are special cases ofthe regular convolutional layer. For the remainder of this paperwe will focus on the convolutional layers, as they constitutemost of the computational expenditure in modern CNNs.Furthermore, batch processing is not considered since theprocessor presented here is intended for real-time applicationsthat are latency-sensitive.

IChIH IW IChFH FWIChFH FW OH OWOCh

IFMaps Filters OFMapsF F OCh-1

Fig. 1: Convolutional layer example.Fig. 1 illustrates the 3D-convolution operation used inCNNs. A volume of input feature maps (IFMaps) consistingof

ICh separate channels (also called the depth), each of thembeing of dimension IH x IW , is convolved with OCh banks ofﬁlters ( F .. F OCh-1 ). In this process, each ﬁlterbank creates oneoutput volume of dimension OH x OW by convolving eachIFMap with the corresponding ﬁlter of dimension FH x FW and accumulating the results of the different ﬁlters.As mentioned before, not all IFMaps, OFMaps and ﬁlterscan be kept in on-chip memory at the same time, so onlysubsets of the data are available for processing. This canbe interpreted as slicing the input and/or output tensors intosmaller chunks of data for which parallel processing is possi-ble. For more information on the different slicing options, theinterested reader is referred to [4] and [5].Due to its software programmability, ConvAix supports avariety of slicing options. Fig. 2 illustrates one option thatis particularly suitable for networks such as AlexNet and VGG-16, which is also used for the benchmarks presentedin Section V. PSumsPSums O O N-1 I PSumsPSums

IFMap-Slices OFMap-SlicesFilter-SliceI

M-1

Step 1Step 2 PSumsPSums ... ......

Step 3

Fig. 2: Exemplary data ﬂow used for benchmarks in Sec. V.Both IFMaps and OFMaps are sliced along their depthdimension to build M input slices I .. I M-1 as well as N outputslices O .. O N-1 . Each output slice is then processed in a row-wise fashion (step 1) in order to re-use existing IFMap rowswhen shifting the ﬁlter window to the next row. For each slice,ﬁlters are pre-loaded before processing starts, while IFMap-rows and OFMap-rows are loaded and stored concurrently ondemand. Partial sums (PSums) of the incomplete OFMapsare accumulated in local scratchpad memories and only ifnecessary buffered in off-chip memory, which also happensconcurrent to processing. After all IFMaps of the current slice I m have been processed, the next slice I m+1 is loaded (step 2).Finally, the current OFMap slice O n is complete and the nextslice O n+1 can be processed (step 3). Note that if the IFMapsare not sliced along their depth-dimension, no intermediateoff-chip buffering of PSums is required.IV. P ROCESSOR A RCHITECTURE

The architecture overview of the proposed ASIP, calledConvAix, is shown in Fig. 3a. It consists of 8 pipeline stages(ID, IF, E1..E6) with 4 heterogeneous VLIW issue-slots toexploit instruction-level parallelism. Slot 0 is reserved forcontrol instructions as well as memory operations both forthe on-chip as well as the off-chip memory. ConvAix alsooffers a scalar ALU that is used for address calculationsand house-keeping computations, e.g. loop-counter updates. Inaddition to the regular load/store-unit, an application-speciﬁcline buffer is used to cache IFMap rows. Slots 1-3 each offera pipelined SIMD vector datapath, whereas each datapath(vALU) itself again consists of 4 separate SIMD vector-slicesthat can be programmed in C using speciﬁc vector primitivesadded to the compiler. The vector parallelism is set to 16,resulting in a total of 4 x 16 = 64 MAC operations per issue-slot and 3 x 64 = 192 MAC operations in total for all 3slots. Furthermore, slot 1 includes an application-speciﬁc unitthat operates on single vectors of size 16, which is used forcalculating activation functions and performing max-pooling.In general, the ASIP uses 16 bit ﬁxed-point arithmetic forboth the scalar ALU as well as the vector units. Howeverthe ALU in slot 0 also offers a 32 bit datapath to performcomputations for addressing larger memories such as externalDRAM. Furthermore, the vector datapath supports precisiongating of its operands to reduce the effective word width andtherefore save energy as described in [9]. Settings such as the n s t r u c t i o n F e t c h I n s t r u c t i o n D e c o d e IF ID E1 E2 E3 E4 E5 E6

ALUCtrVec Maxp/ActLine Bu ﬀ er(Vec) Load/StoreMemory Controller+DMA ExternalMemoryDM DM DM ScalarReg ﬁ le ( R ) PM

16 KByte

VRl

Vector-ALU (Slot 1)Vector-ALU (Slot 2)Vector-ALU (Slot 3)

Slot 0

OperandPrepare Vectors of 164x

AccumulatorAccumulatorAccumulatorAccumulator

Vector-ALU VR VR0VR1VR2VR3

16 Banks (a)

Decoder 1.3%Memory IF & DMA 10.9%Line Buffer 5.5%Register Files 20.2%vALUs 56.3% Misc 5.7% (b)

Decoder 1.7%Memory IF & DMA 6.8%Line Buffer 3.6%Register Files 8.6%vALUs 44.0%Misc 1.2% DM (SRAM) 31.9%PM (SRAM) 2.2% (c)

Fig. 3: ConvAix instruction pipeline and storage overview (3a), processor area breakdown (w/o SRAMs) (3b) and exemplarypower distribution (3c) for AlexNet layer 3 (8 bit gated precision).rounding-scheme as well as the fractional-shift of the vector-ALUs can be conﬁgured at runtime.All slots have access to a 32-element wide scalar registerﬁle R (16 bit per entry). Two large vector register ﬁles VR and VRl of sizes 16 x 256 bit and 12 x 512 bit respectivelyare used to provide data to the vector units, thereby actingas an intermediate storage between the on-chip SRAM (DM)and the processor pipeline. The second register ﬁle

VRl , whichhas double the width of VR , is used for vector accumulation.To reduce the multiplexer depth required to access the vectorregister ﬁles, both VR and VRl are sliced into 4 (

VR0..VR3 )and 3 (

VRl0..VRl2 ) sub-regions respectively. While slot 0 canaccess the complete register ﬁles, which is required for datamovement and load/stores, the time-critical vector-ALUs onlyhave access to some of the aforementioned sub-regions. Eachvector-ALU has an operand fetch and prepare stage that caneither broadcast entire vectors to the 4 vector slices within itsALU or generate a permuted version of the input accordingto a pattern, which is set at runtime.In addition to the 16 KByte program memory (PM) used tofetch instructions from, ConvAix has access to 128 KByte ofdual-ported on-chip SRAM via a custom memory interface.The aforementioned memory is called data memory (DM),which is partitioned into 16 banks of 8 KByte each in orderto allow fetches of 2 vectors per cycle (2x256 bit). This isrequired by the application, since at least one new ﬁlter vectorand input vector must be loaded each cycle to keep the vector-ALUs busy. To allow seamless transfer of data to/from externalmemory while the ASIP processes data slices as described inSection III, there is a simple direct memory access (DMA)engine included in the memory interface. Additionally to theDMA, the line buffer unit has direct access to the memoryinterface. This allows for simultaneous loads of new IFMaprowchunks while providing (possibly strided) inputs to thevector-ALUs. Using this approach, strided convolutions are executed with minimal cycle overhead.V. R

ESULTS

ConvAix was synthesized and placed & routed using aTSMC 28nm CMOS technology at 1V nominal supply voltageand standard V T for typical conditions (25 ◦ C). Table Isummarizes the implementation results, while Fig. 3b and Fig.3c present a detailed breakdown of the ASIP’s area and powerdistribution, respectively. The overall layout of the processor isshown in Fig. 4. All presented power values were obtained bysimulating the netlist after place & route, thereby generatingdetailed switching activity of the circuit.TABLE I: P

ROCESSOR S PECIFICATIONTechnology TSMC 28nm SVT 1P8MCore voltage 1.0 VClock frequency 400 MHzGate count (logic) 1293 kGEOn-Chip SRAM 128 KByte (Data)16 KByte (Instruction)

Out of the total chip area, the SRAM macro-cells occupythe largest portion at 63%. As can be seen in Fig 3b, the largestarea-contributors with regards to the logic-cells are the vector-ALUs, which make up 56% in total. Regarding the powerconsumption it can be observed that SRAM data memoriestogether with the register ﬁles and the line buffer consumeroughly the same amount of power (44.1%) as the vector-ALUs (44%). The latter power ﬁgures however also includeABLE II: C

OMPARISON WITH S TATE - OF - THE -A RT A CCELERATORSReference Envision [7] Eyeriss [6] This work (ConvAix)Technology 40nm LP (Silicon) 65nm LP (Silicon) 28nm LP (P&R)Architecture RISC + MAC Array ASIC ASIPCore Voltage [V] 0.85-0.92 1 1Gate Count (logic only) [kGE] 1600 1176 1293On-Chip SRAM [KByte] 148 181.5 144Registers [KByte] - 11.8 3.6Clock Frequency [MHz] 204 200 400

Power Consumption [mW] 70.1 116.8 104.8 228.8 223.9Off-Chip I/O [MByte] a b c c d d MAC Utilization Rate e Area Efﬁciency [GOP/s/MGE] 39.73 44.01 20.85

Energy Efﬁciency [GOP/s/W] 815 187 104 - -Energy Efﬁciency @ 28nm/1V [GOP/s/W] f

434 242 459 a Off-chip I/O for processing batches of size 1 b Compressed using Huffman coding c Compressed using run-length coding d Uncompressed e Ratio of actual and ideal processing time based on 100% MAC utilization each cycle f Power values were scaled according to the following formula: P scaled = P old ( L new /L old )( V DD,new /V DD,old ) vALU0vALU2 vALU1vRegMem-Ctr ControllerLBuf DMA

MiscDM[0..7] DM[8..15] P M Fig. 4: Layout view of ConvAix after place & route.the contribution of pipeline registers and multiplexers withinthe vector-ALUs.ConvAix was benchmarked using two widely used CNNmodels: AlexNet [1] and VGG-16 [14]. Table II summarizesour results and compares them with two well-known acceler-ators (Envision [7] and Eyeriss [6]) targeting the same CNNmodels. To allow for a fair comparison between all designs,we scaled the energy efﬁciency values of all architecturesto a uniform 28nm technology operating at 1V. The valuespresented in Table II show overall results across all layersof the respective CNNs with optimized word width for thearchitectures which provide scalable precision. Furthermore, toincrease the fairness of the comparison, the processing timesused in Table II do not include the time required for off-chipI/O whenever possible. We hereby aim to eliminate the effectthat the I/O bandwidth of the external memories could haveon the presented ﬁgures.Due to its comparatively high clock frequency, ConvAixexceeds the other designs in terms of processing speed (1.6xcompared to the next fastest in AlexNet and 4.8x for VGG-16)and area efﬁciency (1.9x for AlexNet and 4.3x for VGG-16).At the same time it maintains a competitive energy efﬁciencyof 459 GOP/s/W on average for AlexNet and 497 GOP/s/W for VGG-16. The average MAC utilization for AlexNet is8% lower than that of Eyeriss. This is well expected dueto the fact, that the proposed design is software-programmedwhich always incurs a certain overhead required for control-code. For VGG-16 however, ConvAix demonstrates a muchhigher utilization of 76% vs. 36% for Eyeriss. According tothe authors of Eyeriss, this is because of added time requiredfor repeatedly ramping up the MAC array. The required off-chip I/O is higher than that of Eyeriss, which can be explainedby the lack of a memory compression engine in our design.Calculations using the sparsity-values provided in [6] show,that our design would achieve similar total off-chip I/O ﬁguresas Eyeriss, if compression was added.VI. C

ONCLUSION

It was the goal of this work to demonstrate the practicalfeasibility of a software-programmable architecture with aninstruction set that is targeted towards, but not limited to, CNNacceleration. The envisioned architecture, called ConvAix,was implemented in a modern 28nm CMOS technology andevaluated using highly relevant benchmarks. Results showthat ConvAix can not only achieve competitive efﬁciencycompared to other less ﬂexible designs, but even surpassthem in terms of area efﬁciency (1.9x for AlexNet, 4.3x forVGG-16) and throughput. Especially for the larger VGG-16model, ConvAix achieves signiﬁcantly higher utilization (76%compared to 36%) and a 4.8x higher processing speed. Incor-porating techniques such as dynamic voltage and frequencyscaling or memory compression could further improve theefﬁciency of the presented design. We leave it to future workto investigate this further.A

CKNOWLEDGMENT

This work was supported by the German Federal Ministryof Education and Research (BMBF) via the PARIS project(16ES0602) aiming at autonomous driving.

EFERENCES[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classiﬁca-tion with Deep Convolutional Neural Networks,”

Advances In NeuralInformation Processing Systems , pp. 1–9, 2012.[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only LookOnce: Uniﬁed, Real-Time Object Detection,” 2015. [Online]. Available:http://arxiv.org/abs/1506.02640[3] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer, “Efﬁcient Processingof Deep Neural Networks: A Tutorial and Survey,” pp. 1–31, 2017.[Online]. Available: http://arxiv.org/abs/1703.09039[4] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing Loop Operationand Dataﬂow in FPGA Acceleration of Deep Convolutional Neural Net-works,”

Proceedings of the 2017 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays - FPGA ’17 , pp. 45–54, 2017.[5] Y.-h. Chen, J. Emer, and V. Sze, “Using Dataﬂow to Optimize EnergyEfﬁciency of Deep Neural Network Accelerators,”

IEEE Micro , no. 3,pp. 12–21.[6] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efﬁcient Reconﬁgurable Accelerator for Deep Convolutional NeuralNetworks,”

IEEE Journal of Solid-State Circuits , no. 1, pp. 127–138.[7] B. Moons and M. Verhelst, “An Energy-Efﬁcient Precision-ScalableConvNet Processor in 40-nm CMOS,”

IEEE Journal of Solid-StateCircuits , vol. 52, no. 4, pp. 903–914, 2017.[8] D. D. Lin, S. S. Talathi, and V. S. Annapureddy, “Fixed PointQuantization of Deep Convolutional Networks,” vol. 48, 2015.[Online]. Available: http://arxiv.org/abs/1511.06393[9] B. Moons, B. De Brabandere, L. Van Gool, and M. Verhelst, “Energy-Efﬁcient ConvNets Through Approximate Computing,” in . IEEE,pp. 1–8.[10] L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W ConvolutionalNetwork Accelerator,”

IEEE Transactions on Circuits and Systems forVideo Technology , no. 11, pp. 2461–2475.[11] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Neurostream: Scalableand Energy Efﬁcient Deep Learning with Smart Memory Cubes,”

IEEETransactions on Parallel and Distributed Systems , no. 2, pp. 420–434.[12] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS:Scalable and Efﬁcient Neural Network Acceleration with 3D Memory,”

Proceedings of the Twenty-Second International Conference on Archi-tectural Support for Programming Languages and Operating Systems -ASPLOS ’17 , no. 2, pp. 751–764.[13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “MobileNets: EfﬁcientConvolutional Neural Networks for Mobile Vision Applications,” 2017.[Online]. Available: http://arxiv.org/abs/1704.04861[14] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networksfor Large-Scale Image Recognition,”