An Application-Specific VLIW Processor with Vector Instruction Set for CNN Acceleration
AAn Application-Specific VLIW Processor withVector Instruction Set for CNN Acceleration
Andreas Bytyn, Rainer Leupers and Gerd Ascheid
Institute for Communication Technologies and Embedded Systems, RWTH Aachen UniversityEmail: [email protected] c (cid:13) Abstract —In recent years, neural networks have surpassedclassical algorithms in areas such as object recognition, e.g. in thewell-known ImageNet challenge. As a result, great effort is beingput into developing fast and efficient accelerators, especially forConvolutional Neural Networks (CNNs). In this work we presentConvAix, a fully C-programmable processor, which contraryto many existing architectures does not rely on a hard-wiredarray of multiply-and-accumulate (MAC) units. Instead it mapscomputations onto independent vector lanes making use of acarefully designed vector instruction set.The presented processor is targeted towards latency-sensitiveapplications and is capable of executing up to 192 MAC oper-ations per cycle. ConvAix operates at a target clock frequencyof 400 MHz in 28nm CMOS, thereby offering state-of-the-artperformance with proper flexibility within its target domain.Simulation results for several 2D convolutional layers fromwell known CNNs (AlexNet, VGG-16) show an average ALUutilization of 72.5% using vector instructions with 16 bit fixed-point arithmetic. Compared to other well-known designs whichare less flexible, ConvAix offers competitive energy efficiency ofup to 497 GOP/s/W while even surpassing them in terms of areaefficiency and processing speed.
I. I
NTRODUCTION
Since their introduction to the broad public, ConvolutionalNeural Networks (CNNs) have been adopted for many taskssuch as object classification and detection [1] [2]. Their abilityto extract meaningful features out of data has been the keyenabling factor for their superior performance compared toother approaches. However, this comes at the price of in-creased computational complexity, specifically the number ofMultiply-And-Accumulate (MAC) operations can reach wellinto the several GMAC per frame as summarized in [3].Thankfully, there is a lot of explicit parallelism contained inCNNs, thereby offering many options for accelerating themwith domain-specific hardware architectures.The most time- and energy-consuming computational kernelof every CNN is the 3D-convolution of so called input featuremaps (e.g. RGB input for the first layer) with multiple sets offilters that compute layer-wise output feature maps [3]. Thefocus of this paper is therefore on the acceleration of theseconvolutions. As described in [4] and [5], the number of off-chip memory accesses and the management of the on-chipmemories both play a detrimental role for energy consumptionand arithmetic utilization within the accelerator. There aremany parameters for the convolutions that affect the efficiencyand it is the designers task to find a suitable trade-off betweenthem. Important parameters to consider are for example the orderin which convolutions are executed as well as the tiling of in-put and output feature maps into e.g. column- and depth-slices.The optimal choice of these parameters however is dependenton the specific CNN model, thereby making it desirable tohave some degree of flexibility in the data flow. The authorsbelieve that an Application-Specific Instruction Set Processor(ASIP), as presented in this paper, can offer a decent trade-off between flexibility and efficiency. Some parameters, suchas unrolling-factors that result in hardware parallelism, mustbe decided at design time. However, other parameters suchas tiling-factors and loop-order, can be flexibly adjusted insoftware. Since a fully featured C/C++-compiler is generatedfor our ASIP automatically, computational kernels can easilybe re-used or adapted by means of software libraries.In Section II a brief overview of existing hardware acceler-ators for CNNs is given. Afterwards, the convolutional kernelis introduced in Section III and an overview of the processorarchitecture is given in Section IV. Section V presents postplace-and-route results of the ASIP implemented in a 28nmCMOS technology, as well as some relevant benchmarks run-ning the state-of-the-art CNNs AlexNet and VGG-16. Finally,Section VI concludes the paper.II. R
ELATED W ORK
Many of the published accelerators are comprised of alarge array of processing entities (PEs) with some application-specific interconnectivity between them. These acceleratorsoften offer immense performance in terms of throughput(GOP/s) and energy efficiency (GOP/s/W), yet lag the desiredflexibility when it comes to the employable data flow patternsand the on-chip data management. In [6], the authors presenta 12x14 MAC array that aims to maximize data reuse (andtherefore minimize off-chip accesses) by applying a specificcomputation scheme called row stationary . Some data flowflexibility is achieved by subdividing the 2D MAC array intoslices and distributing parallel computations amongst theseslices. This flexibility has its limits though, as only a pre-defined set of parameters can be adapted at runtime. In [7],a C-programmable processor is presented that makes use ofa 16x16 MAC array to accelerate convolutions. The MACarray is supplied with data by a RISC processor, which givesflexibility in terms of the on-chip data management, howeverthe MAC array itself does not offer any additional flexibility.Both [6] and [7] apply voltage scaling for their circuit to a r X i v : . [ c s . A R ] A p r emonstrate the potential energy efficiency improvement whenoperating at a lower voltage and clock frequency.Furthermore, since many CNNs can be quantized down to8 bit fixed-point [8] [9], existing architectures exploit this byeither designing their MAC units to be narrow to begin with(e.g. 12 bit as in [10]), or by applying techniques such asprecision-gating or using subword parallelism at runtime [7].Other architectures such as [11] and [12] attempt to alleviatethe memory bottleneck in CNNs by employing near-memorycomputation and using 3D memory.III. CNN D ATA F LOW
In general, CNNs consist of a number of concatenatedlayers, each executing a pre-defined operation on a 3- or 4-dimensional tensor (depending on whether batch processingis applied), thereby generating an output tensor that is usedas input to the following layer. The most common layersinclude the convolutional layer, max-pooling layer for tensor-downsampling, and recently also so-called depth-wise andpoint-wise convolutional layers [13], which are special cases ofthe regular convolutional layer. For the remainder of this paperwe will focus on the convolutional layers, as they constitutemost of the computational expenditure in modern CNNs.Furthermore, batch processing is not considered since theprocessor presented here is intended for real-time applicationsthat are latency-sensitive.
IChIH IW IChFH FWIChFH FW OH OWOCh
IFMaps Filters OFMapsF F OCh-1
Fig. 1: Convolutional layer example.Fig. 1 illustrates the 3D-convolution operation used inCNNs. A volume of input feature maps (IFMaps) consistingof
ICh separate channels (also called the depth), each of thembeing of dimension IH x IW , is convolved with OCh banks offilters ( F .. F OCh-1 ). In this process, each filterbank creates oneoutput volume of dimension OH x OW by convolving eachIFMap with the corresponding filter of dimension FH x FW and accumulating the results of the different filters.As mentioned before, not all IFMaps, OFMaps and filterscan be kept in on-chip memory at the same time, so onlysubsets of the data are available for processing. This canbe interpreted as slicing the input and/or output tensors intosmaller chunks of data for which parallel processing is possi-ble. For more information on the different slicing options, theinterested reader is referred to [4] and [5].Due to its software programmability, ConvAix supports avariety of slicing options. Fig. 2 illustrates one option thatis particularly suitable for networks such as AlexNet and VGG-16, which is also used for the benchmarks presentedin Section V. PSumsPSums O O N-1 I PSumsPSums
IFMap-Slices OFMap-SlicesFilter-SliceI
M-1
Step 1Step 2 PSumsPSums ... ......
Step 3
Fig. 2: Exemplary data flow used for benchmarks in Sec. V.Both IFMaps and OFMaps are sliced along their depthdimension to build M input slices I .. I M-1 as well as N outputslices O .. O N-1 . Each output slice is then processed in a row-wise fashion (step 1) in order to re-use existing IFMap rowswhen shifting the filter window to the next row. For each slice,filters are pre-loaded before processing starts, while IFMap-rows and OFMap-rows are loaded and stored concurrently ondemand. Partial sums (PSums) of the incomplete OFMapsare accumulated in local scratchpad memories and only ifnecessary buffered in off-chip memory, which also happensconcurrent to processing. After all IFMaps of the current slice I m have been processed, the next slice I m+1 is loaded (step 2).Finally, the current OFMap slice O n is complete and the nextslice O n+1 can be processed (step 3). Note that if the IFMapsare not sliced along their depth-dimension, no intermediateoff-chip buffering of PSums is required.IV. P ROCESSOR A RCHITECTURE
The architecture overview of the proposed ASIP, calledConvAix, is shown in Fig. 3a. It consists of 8 pipeline stages(ID, IF, E1..E6) with 4 heterogeneous VLIW issue-slots toexploit instruction-level parallelism. Slot 0 is reserved forcontrol instructions as well as memory operations both forthe on-chip as well as the off-chip memory. ConvAix alsooffers a scalar ALU that is used for address calculationsand house-keeping computations, e.g. loop-counter updates. Inaddition to the regular load/store-unit, an application-specificline buffer is used to cache IFMap rows. Slots 1-3 each offera pipelined SIMD vector datapath, whereas each datapath(vALU) itself again consists of 4 separate SIMD vector-slicesthat can be programmed in C using specific vector primitivesadded to the compiler. The vector parallelism is set to 16,resulting in a total of 4 x 16 = 64 MAC operations per issue-slot and 3 x 64 = 192 MAC operations in total for all 3slots. Furthermore, slot 1 includes an application-specific unitthat operates on single vectors of size 16, which is used forcalculating activation functions and performing max-pooling.In general, the ASIP uses 16 bit fixed-point arithmetic forboth the scalar ALU as well as the vector units. Howeverthe ALU in slot 0 also offers a 32 bit datapath to performcomputations for addressing larger memories such as externalDRAM. Furthermore, the vector datapath supports precisiongating of its operands to reduce the effective word width andtherefore save energy as described in [9]. Settings such as the n s t r u c t i o n F e t c h I n s t r u c t i o n D e c o d e IF ID E1 E2 E3 E4 E5 E6
ALUCtrVec Maxp/ActLine Bu ff er(Vec) Load/StoreMemory Controller+DMA ExternalMemoryDM DM DM ScalarReg fi le ( R ) PM
16 KByte
VRl
Vector-ALU (Slot 1)Vector-ALU (Slot 2)Vector-ALU (Slot 3)
Slot 0
OperandPrepare Vectors of 164x
AccumulatorAccumulatorAccumulatorAccumulator
Vector-ALU VR VR0VR1VR2VR3
16 Banks (a)
Decoder 1.3%Memory IF & DMA 10.9%Line Buffer 5.5%Register Files 20.2%vALUs 56.3% Misc 5.7% (b)
Decoder 1.7%Memory IF & DMA 6.8%Line Buffer 3.6%Register Files 8.6%vALUs 44.0%Misc 1.2% DM (SRAM) 31.9%PM (SRAM) 2.2% (c)
Fig. 3: ConvAix instruction pipeline and storage overview (3a), processor area breakdown (w/o SRAMs) (3b) and exemplarypower distribution (3c) for AlexNet layer 3 (8 bit gated precision).rounding-scheme as well as the fractional-shift of the vector-ALUs can be configured at runtime.All slots have access to a 32-element wide scalar registerfile R (16 bit per entry). Two large vector register files VR and VRl of sizes 16 x 256 bit and 12 x 512 bit respectivelyare used to provide data to the vector units, thereby actingas an intermediate storage between the on-chip SRAM (DM)and the processor pipeline. The second register file
VRl , whichhas double the width of VR , is used for vector accumulation.To reduce the multiplexer depth required to access the vectorregister files, both VR and VRl are sliced into 4 (
VR0..VR3 )and 3 (
VRl0..VRl2 ) sub-regions respectively. While slot 0 canaccess the complete register files, which is required for datamovement and load/stores, the time-critical vector-ALUs onlyhave access to some of the aforementioned sub-regions. Eachvector-ALU has an operand fetch and prepare stage that caneither broadcast entire vectors to the 4 vector slices within itsALU or generate a permuted version of the input accordingto a pattern, which is set at runtime.In addition to the 16 KByte program memory (PM) used tofetch instructions from, ConvAix has access to 128 KByte ofdual-ported on-chip SRAM via a custom memory interface.The aforementioned memory is called data memory (DM),which is partitioned into 16 banks of 8 KByte each in orderto allow fetches of 2 vectors per cycle (2x256 bit). This isrequired by the application, since at least one new filter vectorand input vector must be loaded each cycle to keep the vector-ALUs busy. To allow seamless transfer of data to/from externalmemory while the ASIP processes data slices as described inSection III, there is a simple direct memory access (DMA)engine included in the memory interface. Additionally to theDMA, the line buffer unit has direct access to the memoryinterface. This allows for simultaneous loads of new IFMaprowchunks while providing (possibly strided) inputs to thevector-ALUs. Using this approach, strided convolutions are executed with minimal cycle overhead.V. R
ESULTS
ConvAix was synthesized and placed & routed using aTSMC 28nm CMOS technology at 1V nominal supply voltageand standard V T for typical conditions (25 ◦ C). Table Isummarizes the implementation results, while Fig. 3b and Fig.3c present a detailed breakdown of the ASIP’s area and powerdistribution, respectively. The overall layout of the processor isshown in Fig. 4. All presented power values were obtained bysimulating the netlist after place & route, thereby generatingdetailed switching activity of the circuit.TABLE I: P
ROCESSOR S PECIFICATIONTechnology TSMC 28nm SVT 1P8MCore voltage 1.0 VClock frequency 400 MHzGate count (logic) 1293 kGEOn-Chip SRAM 128 KByte (Data)16 KByte (Instruction)
Out of the total chip area, the SRAM macro-cells occupythe largest portion at 63%. As can be seen in Fig 3b, the largestarea-contributors with regards to the logic-cells are the vector-ALUs, which make up 56% in total. Regarding the powerconsumption it can be observed that SRAM data memoriestogether with the register files and the line buffer consumeroughly the same amount of power (44.1%) as the vector-ALUs (44%). The latter power figures however also includeABLE II: C
OMPARISON WITH S TATE - OF - THE -A RT A CCELERATORSReference Envision [7] Eyeriss [6] This work (ConvAix)Technology 40nm LP (Silicon) 65nm LP (Silicon) 28nm LP (P&R)Architecture RISC + MAC Array ASIC ASIPCore Voltage [V] 0.85-0.92 1 1Gate Count (logic only) [kGE] 1600 1176 1293On-Chip SRAM [KByte] 148 181.5 144Registers [KByte] - 11.8 3.6Clock Frequency [MHz] 204 200 400
Power Consumption [mW] 70.1 116.8 104.8 228.8 223.9Off-Chip I/O [MByte] a b c c d d MAC Utilization Rate e Area Efficiency [GOP/s/MGE] 39.73 44.01 20.85
Energy Efficiency [GOP/s/W] 815 187 104 - -Energy Efficiency @ 28nm/1V [GOP/s/W] f
434 242 459 a Off-chip I/O for processing batches of size 1 b Compressed using Huffman coding c Compressed using run-length coding d Uncompressed e Ratio of actual and ideal processing time based on 100% MAC utilization each cycle f Power values were scaled according to the following formula: P scaled = P old ( L new /L old )( V DD,new /V DD,old ) vALU0vALU2 vALU1vRegMem-Ctr ControllerLBuf DMA
MiscDM[0..7] DM[8..15] P M Fig. 4: Layout view of ConvAix after place & route.the contribution of pipeline registers and multiplexers withinthe vector-ALUs.ConvAix was benchmarked using two widely used CNNmodels: AlexNet [1] and VGG-16 [14]. Table II summarizesour results and compares them with two well-known acceler-ators (Envision [7] and Eyeriss [6]) targeting the same CNNmodels. To allow for a fair comparison between all designs,we scaled the energy efficiency values of all architecturesto a uniform 28nm technology operating at 1V. The valuespresented in Table II show overall results across all layersof the respective CNNs with optimized word width for thearchitectures which provide scalable precision. Furthermore, toincrease the fairness of the comparison, the processing timesused in Table II do not include the time required for off-chipI/O whenever possible. We hereby aim to eliminate the effectthat the I/O bandwidth of the external memories could haveon the presented figures.Due to its comparatively high clock frequency, ConvAixexceeds the other designs in terms of processing speed (1.6xcompared to the next fastest in AlexNet and 4.8x for VGG-16)and area efficiency (1.9x for AlexNet and 4.3x for VGG-16).At the same time it maintains a competitive energy efficiencyof 459 GOP/s/W on average for AlexNet and 497 GOP/s/W for VGG-16. The average MAC utilization for AlexNet is8% lower than that of Eyeriss. This is well expected dueto the fact, that the proposed design is software-programmedwhich always incurs a certain overhead required for control-code. For VGG-16 however, ConvAix demonstrates a muchhigher utilization of 76% vs. 36% for Eyeriss. According tothe authors of Eyeriss, this is because of added time requiredfor repeatedly ramping up the MAC array. The required off-chip I/O is higher than that of Eyeriss, which can be explainedby the lack of a memory compression engine in our design.Calculations using the sparsity-values provided in [6] show,that our design would achieve similar total off-chip I/O figuresas Eyeriss, if compression was added.VI. C
ONCLUSION
It was the goal of this work to demonstrate the practicalfeasibility of a software-programmable architecture with aninstruction set that is targeted towards, but not limited to, CNNacceleration. The envisioned architecture, called ConvAix,was implemented in a modern 28nm CMOS technology andevaluated using highly relevant benchmarks. Results showthat ConvAix can not only achieve competitive efficiencycompared to other less flexible designs, but even surpassthem in terms of area efficiency (1.9x for AlexNet, 4.3x forVGG-16) and throughput. Especially for the larger VGG-16model, ConvAix achieves significantly higher utilization (76%compared to 36%) and a 4.8x higher processing speed. Incor-porating techniques such as dynamic voltage and frequencyscaling or memory compression could further improve theefficiency of the presented design. We leave it to future workto investigate this further.A
CKNOWLEDGMENT
This work was supported by the German Federal Ministryof Education and Research (BMBF) via the PARIS project(16ES0602) aiming at autonomous driving.
EFERENCES[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classifica-tion with Deep Convolutional Neural Networks,”
Advances In NeuralInformation Processing Systems , pp. 1–9, 2012.[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only LookOnce: Unified, Real-Time Object Detection,” 2015. [Online]. Available:http://arxiv.org/abs/1506.02640[3] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer, “Efficient Processingof Deep Neural Networks: A Tutorial and Survey,” pp. 1–31, 2017.[Online]. Available: http://arxiv.org/abs/1703.09039[4] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing Loop Operationand Dataflow in FPGA Acceleration of Deep Convolutional Neural Net-works,”
Proceedings of the 2017 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays - FPGA ’17 , pp. 45–54, 2017.[5] Y.-h. Chen, J. Emer, and V. Sze, “Using Dataflow to Optimize EnergyEfficiency of Deep Neural Network Accelerators,”
IEEE Micro , no. 3,pp. 12–21.[6] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional NeuralNetworks,”
IEEE Journal of Solid-State Circuits , no. 1, pp. 127–138.[7] B. Moons and M. Verhelst, “An Energy-Efficient Precision-ScalableConvNet Processor in 40-nm CMOS,”
IEEE Journal of Solid-StateCircuits , vol. 52, no. 4, pp. 903–914, 2017.[8] D. D. Lin, S. S. Talathi, and V. S. Annapureddy, “Fixed PointQuantization of Deep Convolutional Networks,” vol. 48, 2015.[Online]. Available: http://arxiv.org/abs/1511.06393[9] B. Moons, B. De Brabandere, L. Van Gool, and M. Verhelst, “Energy-Efficient ConvNets Through Approximate Computing,” in . IEEE,pp. 1–8.[10] L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W ConvolutionalNetwork Accelerator,”
IEEE Transactions on Circuits and Systems forVideo Technology , no. 11, pp. 2461–2475.[11] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Neurostream: Scalableand Energy Efficient Deep Learning with Smart Memory Cubes,”
IEEETransactions on Parallel and Distributed Systems , no. 2, pp. 420–434.[12] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS:Scalable and Efficient Neural Network Acceleration with 3D Memory,”
Proceedings of the Twenty-Second International Conference on Archi-tectural Support for Programming Languages and Operating Systems -ASPLOS ’17 , no. 2, pp. 751–764.[13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “MobileNets: EfficientConvolutional Neural Networks for Mobile Vision Applications,” 2017.[Online]. Available: http://arxiv.org/abs/1704.04861[14] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networksfor Large-Scale Image Recognition,”