Understanding Cache Boundness of ML Operators on ARM Processors
Bernhard Klein, Christoph Gratl, Manfred Mücke, Holger Fröning
UUnderstanding Cache Boundness of ML Operatorson ARM Processors
Bernhard Klein
Institute of Computer EngineeringHeidelberg University, Germany [email protected]
Christoph Gratl and Manfred M¨ucke
Materials Center Leoben, Austria [email protected]@mcl.at
Holger Fr¨oning
Institute of Computer EngineeringHeidelberg University, Germany [email protected]
Abstract —Machine Learning (ML) compilers like TVM allowa fast and flexible deployment on embedded CPUs. This enablesthe use of non-standard operators, which are common in MLcompression techniques. However, it is necessary to understandthe limitations of typical compute-intense operators in MLworkloads to design a proper solution. This is the first in-detail analysis of dense and convolution operators, generatedwith TVM, that compares to the fundamental hardware limits ofembedded ARM processors. Thereby it explains the gap betweencomputational peak performance, theoretical and measured, andreal-world state-of-the-art results, created with TVM and open-BLAS . Instead, one can see that single-precision general matrixmultiply (GEMM) and convolutions are bound by L1-cache-read bandwidth. Explorations of 8-bit and bit-serial quantizedoperators show that quantization can be used to achieve relevantspeedups compared to cache-bound floating-point operators.However, the performance of quantized operators highly dependson the interaction between data layout and bit packing. Index Terms —Cache-bound, ARM Peak Performance, CacheBandwidth, TVM, Quantization, Bit Serial, GEMM, Convolution
I. I
NTRODUCTION
Although machine learning is meantime ubiquitously de-ployed, it is still nascent and developing fast. One can observea continuous stream of innovation, resulting in new methods,model architectures and use cases that benefit from ML’sinherent property to model complex input-output relations withhigh accuracy and outstanding generalization. Currently, thiscontinuous development results in fast-changing basic oper-ations of artificial neural networks. One prominent exampleare Capsule networks, in which scalar-valued neurons arereplaced by small matrices in order to capture more complexrelationships. There exist various reports on the inflexiblityof ML frameworks to address the needs of such innova-tion [1], as they are mainly designed for today’s need, butobviously cannot anticipate tomorrow’s innovation. Due to thediminishing returns of CMOS scaling, the number of distinctprocessor architectures and technologies is rapidly growing,further increasing this problem as each platform requires a setof corresponding libraries.The ubiquity of ML also requires the deployment on em-bedded and edge platforms, which are substantially limitedin their resources, including processor, memory, network, and battery life, among others. To address the gap between modelrequirement and hardware capability, a plethora of methods ex-ist to compress the models. Examples include quantization [2],which reverts from floating point values to low-bit-widthalternatives for parameters such as weights and activations,pruning which introduces sparsity in the parameters to avoidcomputations [3], and architecture search that basically trainsnot only parameters but also the hyperparameters that representmodel architecture [4]. Notably, all techniques aim to maintainprediction accuracy while minimizing either the amount ofcomputations, or parameters, or both.One of the most promising solutions to address the gapbetween ML innovation and hardware trends is code genera-tion, which gears to automatically generate high-performanceoperators for a given target platform, thereby providing analternative to hand-tuned, manually-written libraries. One ofthe most prominent examples of code generation specializedfor ML is TVM [5], which is based on graph-level andoperator-level optimizations and employs a learning-based costmodel for automated optimization [6]. Experimental resultshave shown that the performance of generated operators ison-par with state-of-the-art, hand-tuned libraries for varioustarget platforms [5].Similarly to new operators resulting from ML innovation,compression techniques usually also require custom operators:quantization can reach extremely low-precision representa-tions, ultimately binarized values [7], while pruning needs toexpress the found sparsity in the operator, requiring for in-stance compressed-sparse row formats or masks that representnon-zero values. Similarly, architecture search can result incomplex-connected networks, which could also benefit fromcustom operators for efficiency reasons. As a result, it seemspromising to explore the capabilities of code generation formodel compression, such that deployment is feasible with highefficiency, high productivity and performance comparable tohand-tuned libraries.This work is mainly concerned with executing specializedML operators on ARM processors as pervasively available em-bedded systems, analyzing performance limiters, and exploringcountermeasures such as model compression. To be moreconcrete, we provide a detailed analysis of computationallyintense operators, including convolutional and dense ones,on embedded ARM processors. As the main observation of a r X i v : . [ c s . A R ] F e b his analysis is that such operations are bound by memory-cache bandwidth, we subsequently employ quantization ascompression technique to reduce the necessary data volume.We rely on code generation to solve the problem that theresulting specialized operators are often not supported bydedicated libraries. For that, we select TVM due to its provenfunctionality, active community, and support for quantizedcomputations [8, 9]. We furthermore choose bit-serial forms ofcomputation on ARM processors, which allows for arbitraryreduced-precision representations on various platforms [10].The bit serial approach does not scale according to reduceddata size and moreover, it does not seem to be bound by cachebandwidth, at least not in regard to the cache-bound modeldiscussed in this work. Also, as in our opinion quantizationshall not be limited to certain data types, such as the de-facto industry standard of 8-bit integer, any quantization optionhas to be flexible for bestmost trading among accuracy andperformance.In particular, this work makes the following contributions:1) Measure the computational peak performance and mem-ory bandwidths for caches and RAM of selected ARMprocessors.2) Benchmarking of convolutional and dense operations onembedded ARM processors, including auto-generatedcode and comparison with openBLAS.3) Detailed performance analysis to understand the dispar-ity between sustained and theoretical peak performance,resulting in a cache-bound model.4) Benchmarking 8-bit and bit-serial results, and discussingobserved behavior based on the cache-bound model.II. R ELATED W ORK
Static inference libraries, like ARM’s Compute Library or other specialized, hand-tuned, ultra-low-precision opera-tors [11, 12] achieve impressive performance, but make itdifficult to combine multiple compression techniques if notprovided by the library. However, the combination of suchtechniques, especially quantization and pruning, can be veryeffective [13].In contrast, deep learning compilers like TVM, Tensorflow’sXLA [14] and Lift [15] close the gap between high-levelmachine learning frameworks and deployment on hardware ina flexible way. Their independence from static libraries enablesstraightforward research of new ML methods, the usage ofnon-standard operators, and allows to use compression tech-niques and to combine them with little effort.Auto-tuning is one of the most important features of TVM.With an automated end-to-end feedback loop, executing run-time measurements, and a domain-specific machine-learning-based cost model, optimal parameters can be found [6]. Onestep further, it is possible to exploit AutoTVM and use it inthe decision process of hardware design [16].While import and execution of previously quantized 8-bitmodels is supported using a specialized QNN dialect [17], https://developer.arm.com/ip-products/processors/machine-learning/compute-library ultra-low-precision quantization is still an open research topic.Whereby, Cowan et al. [8] proved that such low-precisionoperations are feasible with TVM and, using a hardware-aware bit-serial algorithm, a relevant speedup over the de-fault floating-point implementation can be achieved. Moreover,TVM’s scheduling mechanism can be used for high-leveloptimizations and program synthesis to find highly optimizedlow-level operators [9].Our work does not focus on a new technique reducing theinference time even further. It focuses on the analysis andunderstanding of performance bounds of compute-intense MLoperators and compare this with the fundamental hardwarelimits. It brings memory access as the limiting factor intomind and thus motivates ML compression techniques likelow-precision representation, which can reduce the memorypressure, and thus the overall processing latency.III. M ETHODOLOGICAL S ETUP AND P ERFORMANCE B ASELINE
A. Auto-tuning Methodology
The interface of AutoTVM is designed to tune neuralnetworks which are represented in Relay [18], TVM’s high-level IR. However, in our work, specific single operatorsare evaluated. Creating many single-layer neural networks,containing exactly the operations to be examined, auto-tunethem and save the tuned parameters to a logfile, enablesreuse and thus auto-tuned operator execution in the manualexamination mode.For regular data types, float32, unsigned and signed int8, theXGBTuner with its xgboost cost model is used [19]. However,this tuner is—because of a not yet fixed issue—not compatiblewith some of the bit-serial operators, therefore, all bit-serialoperators are auto-tuned with the random tuner. In principlethe tuner can have a relevant impact on the tuned parametersand thereby on the final inference time. However, for bit-serialdense and convolution operators, the search space is highlyrestricted due to the bit-packing implementation resulting inless freedom in the parameter selection. Therefore, the impactof auto-tuning is relative small and the selection of the tunerless critical.TVM release 0.7 is used and compiled with openBLASsupport. B. Target Architecture
Our experiments are performed on ARM Cortex-A53 ,Broadcom BCM2837 and ARM Cortex-A72 , BCM2711.
1) Peak Performance: the computational theoretical peakperformance, p peak = f requency · cores · F LOPinstr · instrcycle · SIM D width (1) TVM: release v0.7 commit efdac9439506d1de5eec91ecc795982c78e41909 https://developer.arm.com/ip-products/processors/cortex-a/cortex-a53 https://developer.arm.com/ip-products/processors/cortex-a/cortex-a72ABLE IM EASURED M EMORY B ANDWIDTH FOR
ARM C
ORTEX
A53
Memory Block Size Read BW Write BW
RAM 16 MB 2040 MiB/s 1600 MiB/sL2 Cache 256 KB 7039 MiB/s 3467 MiB/sL1 Cache 4 KB 14363 MiB/s 23703 MiB/sTABLE IIM
EASURED M EMORY B ANDWIDTH FOR
ARM C
ORTEX
A72
Memory Block Size Read BW Write BW
RAM 16 MB 3661 MiB/s 2984 MiB/sL2 Cache 256 KB 12934 MiB/s 7407 MiB/sL1 Cache 4 KB 45733 MiB/s 30423 MiB/s represents directly the maximum possible performance, assum-ing that all compute resources can be fully utilized. Consid-ering multiply-accumulate operations (MACs) with 2 FLOPsper instruction, one NEON MAC per cycle and a NEONSIMD width of bit, this leads to a single-precision peakperformance of . GFLOP/s and . GFLOP/s for CortexA53 (1.2 GHz) and A72 (1.5 GHz), respectively.In order to verify theoretical expectations, a small bench-mark program was written that executes many MACs, NEON’s
VMLA instructions making only use of on-register operations,avoiding any other memory access. For a fair comparison, thetotal amount of MACs in a GEMM is distributed to all coresand multi-threading effects are included in the measurement,which is plainly visible for small matrices. The critical partis written in assembly to ensure that no compiler optimiza-tions distort the workload. The benchmark code is publiclyavailable .The measured performance, shown in Table IV and V,confirms the compute boundary, provided that the workloadis large enough to hide the overhead of multi-threading.Additionally, it confirms that one VMLA instruction can becomputed per cycle.
2) Memory Bandwidth:
With the benchmark tool RAM-speed the read and write bandwidth with different block sizesis measured. Since the L1 data cache is 16 KB for the ARMCortex-A53 and 32 KB for the Cortex-A72, a block size of4 KB is used to measure L1 cache bandwidth. To fit into the L2caches—512 KB Cortex A53 and 1 MB Cortex A72—256 KBsized blocks are used. Last, to measure the RAM bandwidth,a block size of 16 MB is large enough to keep caching effectsreasonably small. For the measurements shown in Tables I andII, RAMspeed is used in multi-threading mode with 4 threadsto utilize all cores and with 1 GB and 8 GB of data per pass,respectively, the maximum that still fits into main memory. C. Operators1) General Matrix Multiply: in dense layers all neuronsare connected with all neurons of the previous layer. They https://github.com/UniHD-CEG/arm-peak https://github.com/cruvolo/ramspeed-smp TABLE IIIR ESNET -18 C
ONVOLUTION L AYERS
Name b c in c out h in w in k s p MACsC2 1 64 64 56 56 3 1 1 124,010,496C3 1 64 128 56 56 3 2 1 62,005,248C4 1 64 128 56 56 1 2 0 6,422,528C5 1 128 128 28 28 3 1 1 132,710,400C6 1 128 256 28 28 3 2 1 66,355,200C7 1 128 256 28 28 1 2 0 6,422,528C8 1 256 256 14 14 3 1 1 150,994,944C9 1 256 512 14 14 3 2 1 75,497,472C10 1 256 512 14 14 1 2 0 6,422,528C11 1 512 512 7 7 3 1 1 191,102,976 are a key component of classical ML models and can berepresented as dense GEMMs with a non-linear activationfunction afterwards.To compute a GEMM with squared matrices, based onthree nested loops, approximately N MACs are necessary,thus the total amount of arithmetic operations is · N . Withan execution time t , the performance of a GEMM can becomputed as: p = 2 · M ACst = 2 · N t (2)
2) Convolutions:
Convolutional Neural Networks (CNNs)are not only state-of-the-art in computer vision, moreover,they spread in many ML domains as a general, very effectiveconcept. Thus, convolution layers are the heart of most modernML models, and are the most resource and time-consumingpart, too.It is a common practice to employ GEMMs to computeconvolution operators, relying on IM2COL [20], but nativeconvolution algorithms are also common.ResNet [21] is one of the most successful and wide-spreadarchitecture and smaller variants are common on embeddeddevices. The properties of all convolution layers of ResNet-18are shown in Table III. As in previous work, the first layeris excluded since the input layer is particularly sensitive toquantization and the input channel depth is too low for efficientbit packing [8].Like the GEMM, convolutions highly rely on MAC opera-tions. For a batch-size b , pad p , stride s , input and output chan-nels c in , c out , input and output image size h in , w in , h out , w out ,and convolution kernel (cid:126)k , the number of MACs in a convolu-tion layer is: h out = h in + 2 ps ; w out = w in + 2 ps (3) M ACs = b · h out · w out · c in · c out · k x · k y (4)Table III demonstrates the amount of diversity among layerswith regard to computational complexity. Notably, × convolutions are most compute intensive, and layers with anon-unit stride can lead to complex memory access patterns. ABLE IVGEMM P
ERFORMANCE FLOAT
32 C
ORTEX
A53
N openBLAS TVM compute peak perf.naive tuned measured theoreticalGFLOP/s
32 1.07 1.16 4.43 16.49 38.4128 4.96 2.07 6.58 37.38 38.4256 4.71 1.83 6.93 38.04 38.4512 4.87 0.60 5.06 38.15 38.41024 4.99 0.54 5.01 38.18 38.4TABLE VGEMM P
ERFORMANCE FLOAT
32 C
ORTEX
A72
N openBLAS TVM compute peak perf.naive tuned measured theoreticalGFLOP/s
32 3.01 3.59 9.20 21.92 48.0128 14.22 4.68 16.72 47.11 48.0256 14.86 4.77 17.24 47.83 48.0512 14.33 2.04 17.99 47.92 48.01024 14.98 1.36 15.75 47.93 48.0
IV. I N -D ETAIL P ERFORMANCE A NALYSIS
A. Matrix Multiply Performance
Table IV and V illustrate the performance for varioussquared matrices, with results generated by TVM with andwithout auto-tuned parameters, and with external openBLASlibrary calls. For all measurements, the overhead of multi-threading is dominating for small matrices. However, for allmatrices, the auto-tuned solution clearly outperforms the onewithout tuning, which has to fall back to default parameters,and even outperforms the hand-tuned BLAS library.The performance saturates for larger matrices, albeit, it issignificantly lower than the expected peak performance of theCPU.
B. Cache-bound Model
In the best case scenario all computations are executed onfast local registers. However, there are obviously not enoughregisters for all matrix elements. In a simplified model, itcan be assumed that for a vectorized MAC operation, thefirst operand can be kept in registers, so that the result isalso accumulated in registers, but at least the second operandhas to be read from some kind of memory. Therefore, in thefollowing discussion it is assumed, that for a MAC operation atleast one operand has to be read from cache or RAM. Certainlythis is a very simplified model since the variables, kept inregisters, need to be exchanged after multiplied with all theircounterparts, and the accumulated results have to be writtenback to memory, too. However, this is the most commonoperation and should dominate overall execution time.Fig. 1 shows in double logarithmic scale the executiontime of auto-tuned TVM and openBLAS with respect to thematrix size. Additionally the compute time representing the
Fig. 1. Execution time over squared matrix size for general matrix multiplyin double logarithmic representation. Together with hardware boundaries,theoretical compute time, time to read or write data to RAM or cache-memory. theoretical peak compute performance, and the time to read · N bytes for float32 data type from L1, L2 cache and RAMmemory are plotted. The bandwidth of the different memorytypes are listed in Tables I and II.For small matrix sizes, the execution times are somewherein the range between RAM and cache bandwidth bounds, butthe measured time is in the range of only a few microseconds,thus vulnerable to small systematic errors, like multi-threadingoverhead. However, the time for matrices with N ≥ strongly correlates with the L1 cache boundary. This suggeststhat the L1-cache-read bandwidth is not fast enough to keepthe floating-point units fully utilized. Therefore, the theoreticalpeak performance cannot be reached, and it becomes apparentthat single-precision GEMM for ARM Cortex A53 and A72processors is not compute-bound, but cache-bound. C. Performance of Convolutions
For the experiments, the ARM-specific conv2d spatial pack operator with NCHW data layout [22] is used. There is alsoa NHWC [22] version available, but it performs much worsewithout auto-tuning and AutoTVM cannot be applied for thisoperator, thus this analysis is postponed until a fair comparisonis possible.Fig. 2 sets the measured execution time for these con-volutional operations in relation to the expected time forcompute and memory read, again assuming that · MAC bytesof data need to be read. In particular, one can see that theexecution time for
TVM auto-tuned correlates with
L1 cacheread time , though there are some notable exceptions. Again,this execution time is far from theoretical compute time aswell as
RAM read time . ig. 2. Comparison of TVM’s execution time of convolution layers of ResNet-18, to minimal theoretical compute and memory-read times. Mostly executiontime correlates with L1 or L2 cache read times.Fig. 3. Performance of ResNet-18 layers compared with theoretical computepeak performance and memory bandwidth limited performances. Layers aresorted in descending performance order. The performance in terms of GFLOP/s as shown in Fig. 3confirms this finding. Moreover, some of the compute intensive × convolutions can reach a little better performance than theL1-memory-bandwidth suggests. Probably this kernel layoutallows optimizations regarding in-register reuse. However, thiseffect is not large enough to change the mainly dominatingcache-bound behavior. V. P ERFORMANCE A NALYSIS OF Q UANTIZED O PERATORS
The main take-away from the last section is that indepenentof the operator type, either GEMM or CONV, executiontime is bound not by computational peak performance butmemory access time, usually those of L1 caches. While thisis a surprising finding for HPC settings, apparently embeddedprocessors have different design objectives.Also, this cache boundness suggests that performance scal-ing can benefit most from reduced memory pressure. There-fore, in the following we apply model quantization to reducethe size of operands, thereby also reducing the data volumefetched for a given operation.
A. Measurements
TVM supports bit-serial quantized convolution and denseoperators for ARM CPUs for bipolar ( − , b and unipolar (0 , b bit encodings for various bit widths [9]. Whereas theunipolar variant is the more advanced quantization scheme,generally achieving better accuracies, it needs one additionalsubtraction and popcount instruction and is thus a little slower.While the precision dimension is calculated sequentially, bit-packing along the spatial dimensions allows the usage ofvectorized instructions [23].The operators allow to set activation and weight bits inde-pendent from each other. It is common to use activations andweights with the same precision, or with larger activation pre-cision than weight precision [2]. Still, in this work activationsand weights are quantized equally.For bit-serial operations, data has to be in a specific packeddata layout. The weights can be pre-packed and thus do notneed to be packed during runtime, but the activations requirebit-packing just before the calculation. B. Quantized GEMM
Fig. 4 shows the performance of bit-serial GEMM fordifferent matrix sizes. It is remarkable that for lower bit widths,even larger matrices are necessary to achieve maximum per-formance. For the extreme binary case it might not even reachits peak with k matrices, which is way larger than a typicaldense layer in most ML workloads.To test whether bit-serial GEMM is also cache-bound,the required bandwidth to achieve measured performance iscalculated. Depending on the used data type, it is assumedthat per MAC m one data read with d bytes is necessaryto sustain performance p with execution time t . Thus, therequired bandwidth bw req is: bw req = m · dt = p · d (5)In Fig. 5 this required bandwidth is plotted. With increasedperformances for larger matrices the requirements due tobandwidth increase too, but they all stay below the L1-cache-read bandwidth. While this suggests that bit-serial GEMM isnot cache-bound, it has to be mentioned that other effectsmight be present in this case. One example is the efficiencyof accessing bit-packed values. It is possible that the same ig. 4. Performance of bit-serial GEMM for different squared matrices, withthe size of a dimension reported on the x axis. Lower bit widths achieve theirmaximum performance only with larger matrices.Fig. 5. Required bandwidth to reach measured performance according tocache-bound model for GEMM bit-serial operation. packed data have to be read more than once per MAC, therebythe required bandwidth would also increase. Whether bit-packed data increases the required cache reads per MAC isan open question that is left for future research as of now.Furthermore, the mandatory bit-packing step for activationsbefore the GEMM needs additional memory accesses, whichare also not covered by the one-read-per-MAC assumption. C. Quantized Convolution
The speedup of the quantized ResNet-18 convolution layersis illustrated in Fig. 6. The performance of 8-bit QNN orbit-serial operators highly depend on the interaction betweenoperator data layout and bit-packing format. In general, × convolutions benefit from higher computational complexityand data reuse, and therefore perform better than their × Fig. 6. Speedup over float32 for QNN 8-bit and bit-serial convolution forResNet-18 layers. counterpart. A non-unit stride can lead to less efficient memoryaccess especially for packed data. Furthermore, the bit-serialconvolution operator uses NHWC data layout, which togetherwith bit-packing leads to a performance decrease for smallinput sizes. For instance, layer 11 performs badly for the bit-serial operator, even though this operation has the highestMAC count. In contrast, 8-bit QNN with its NCHW layoutis less sensible to the input size and it further seems to bemore robust against × convolutions for smaller imagesthan the floating-point baseline. For the bit-serial approach,the computational complexity scales quadratically with the bitwidth, thus, the speedup of lower bit-width representations issignificantly higher.Fig. 7 shows the required bandwidth to sustain measuredperformance. The bandwidth requirements scale linear withthe bit width, but the computational complexity quadratically,therefore, higher required bandwidths, due to the better overallperformance, are observable for lower bit widths.More importantly, the required bandwidth indicates that, likefor the GEMM, the L1-cache bandwidth is sufficient to providedata if only one read per MAC is necessary. Therefore, the 8- ig. 7. Required bandwidth to sustain measured performance, according tocache-bound model of convolution operators. Comparison of float32, QNN8-bit and bit serial to memory bandwidths. bit QNN and bit-serial convolutions are apparently not cache-bound. However, multiple reads due to packed data and theoverhead for bit packing are not covered in this model, butcould be relevant.VI. S UMMARY AND O UTLOOK
First the fundamental hardware limits, the bandwidth ofcaches and RAM, and the computational peak performance aremeasured, whereby the measured peak performance fits well tothe theoretical maximum. The execution time of GEMM andconvolution operators, generated with TVM and openBLAS,are compared against these hardware limits. The huge differ-ence between computational peak performance and sustainedperformance of representative operators can be explained witha cache-bound model, which assumes that per MAC operationone memory read from cache, typically L1 cache, is necessary.The performance of single-precision floating point operatorscorrelates mainly with the limit created by the L1-cache-read bandwidth. To overcome this bottleneck, low-bit-widthoperators are explored: 8-bit QNN and the bit-serial approachfor bit widths between 1 and 8 bit. Both achieve relevantspeedups compared to the floating-point baseline. The analysis of the required bandwidth to read bit-packed data and sustainmeasured performance indicates that quantized operators arenot limited by cache-read bandwidth, at least not with thesimple cache-bound model discussed in this work. However,it becomes apparent that due to the bit-packed data structure,the operator data layout is crucial. Moreover, to reach maximalperformance of bit-serial GEMMs very large matrices, andthus a more efficient access to bit-packed data, is essential.Relevant future directions include understanding the over-head of bit packing and access to packed data, scaling ofmemory accesses with problem size, and a correspondingrefinement of the cache-bound model. Such a model, wouldbe of great benefit for the design of new operators and couldalso be included in an improved auto-tuning.Ultimately, the study of operators with differently quantizedactivations and weights would be of great interest, especiallyfrom the point of view that bit packing is only necessary foractivations, but packed data access applies for both.In summary, we believe that the best choice in terms ofquantization for a given ARM processor requires a detailedunderstanding of the underlying effects, so that methodologicalapproaches can be pursued for corresponding optimizations,instead of brute-forcing the problem by exhaustive searchesbased on a vast number of individual performance experi-ments. A
CKNOWLEDGMENT
The authors would like to thank Stephan Diestelhorst fromXilinx DCG for sharing his valuable experiences, which finallyleads us to the cache-bound model idea.Furthermore, we gratefully acknowledge the financial sup-port under the scope of the COMET program within theK2 Center “Integrated Computational Material, Process andProduct Engineering (IC-MPPE)” (Project No 859480). Thisprogram is supported by the Austrian Federal Ministriesfor Transport, Innovation and Technology (BMVIT) and forDigital and Economic Affairs (BMDW), represented by theAustrian research funding association (FFG), and the federalstates of Styria, Upper Austria and Tyrol.R
EFERENCES [1] P. Barham and M. Isard, “Machine learning systems arestuck in a rut,” in
Workshop on Hot Topics in OperatingSystems , ser. HotOS. ACM, 2019. [Online]. Available:https://doi.org/10.1145/3317550.3321441[2] W. Roth, G. Schindler, M. Z¨ohrer, L. Pfeifenberger,R. Peharz, S. Tschiatschek, H. Fr¨oning, F. Pernkopf, andZ. Ghahramani, “Resource-efficient neural networks forembedded systems,”
CoRR , vol. abs/2001.03048, 2020.[Online]. Available: http://arxiv.org/abs/2001.03048[3] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell,“Rethinking the value of network pruning,”
CoRR ,vol. abs/1810.05270, 2018. [Online]. Available: http://arxiv.org/abs/1810.05270[4] B. Zoph and Q. V. Le, “Neural architecturesearch with reinforcement learning,”
CoRR , vol.bs/1611.01578, 2016. [Online]. Available: http://arxiv.org/abs/1611.01578[5] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan,M. Cowan, H. Shen, L. Wang, Y. Hu, L. Ceze,C. Guestrin, and A. Krishnamurthy, “TVM: Anautomated end-to-end optimizing compiler for deeplearning,” in , ser. OSDI, 2018.[Online]. Available: https://arxiv.org/abs/1802.04799[6] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau,L. Ceze, C. Guestrin, and A. Krishnamurthy, “Learningto optimize tensor programs,” in
Advances in NeuralInformation Processing Systems 31 , ser. NIPS. CurranAssociates Inc., 2018, p. 3393–3404. [Online]. Available:https://arxiv.org/abs/1805.08166[7] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,“XNOR-Net: Imagenet classification using binaryconvolutional neural networks,” in , ser. ECCV. SpringerInternational Publishing, 2016. [Online]. Available:https://doi.org/10.1007/978-3-319-46493-0 32[8] M. Cowan, T. Moreau, T. Chen, and L. Ceze,“Automating generation of low precision deep learningoperators,”
CoRR , vol. abs/1810.11066, 2018. [Online].Available: http://arxiv.org/abs/1810.11066[9] M. Cowan, T. Moreau, T. Chen, J. Bornholt, andL. Ceze, “Automatic generation of high-performancequantized machine learning kernels,” in , ser. CGO. ACM, 2020. [Online].Available: https://doi.org/10.1145/3368826.3377912[10] Y. Umuroglu, D. Conficconi, L. Rasnayake, T. B.Preusser, and M. Sj¨alander, “Optimizing bit-serialmatrix multiplication for reconfigurable computing,”
Transactions on Reconfigurable Technology and Systems(TRETS) , vol. 12, no. 3, Aug. 2019. [Online]. Available:https://doi.org/10.1145/3337929[11] Q. Han, Y. Hu, F. Yu, H. Yang, B. Liu, P. Hu, R. Gong,Y. Wang, R. Wang, Z. Luan, and D. Qian, “Extremelylow-bit convolution optimization for quantized neuralnetwork on modern computer architectures,” in , ser.ICPP. ACM, 2020. [Online]. Available: https://doi.org/10.1145/3404397.3404407[12] G. Schindler, M. M¨ucke, and H. Fr¨oning, “Linking ap-plication description with efficient simd code generationfor low-precision signed-integer gemm,” in
Euro-Par 2017: Parallel Processing Workshops . SpringerInternational Publishing, 2018. [Online]. Available:https://doi.org/10.1007/978-3-319-75178-8 55[13] S. Han, H. Mao, and W. J. Dally, “Deep compression:Compressing deep neural networks with pruning,trained quantization and huffman coding,” in ,ser. ICLR, 2016. [Online]. Available: http://arxiv.org/abs/1510.00149 [14] M. Abadi et al. , “Tensorflow: A system for large-scale machine learning,” in , ser.OSDI. USENIX Association, 2016. [Online]. Available:https://arxiv.org/abs/1605.08695[15] M. Steuwer, T. Remmelg, and C. Dubach, “Lift:a functional data-parallel ir for high-performancegpu code generation,” in
International Symposiumon Code Generation and Optimization , ser. CGO.IEEE, 2017, pp. 74–85. [Online]. Available: https://doi.org/10.1109/CGO.2017.7863730[16] D. Diamantopoulos, B. Ringlein, M. Purandare,G. Singh, and C. Hagleitner, “Agile autotuning ofa transprecision tensor accelerator overlay for TVMcompiler stack,” in ,ser. FPL. IEEE, 2020. [Online]. Available:https://doi.org/10.1109/FPL50879.2020.00058[17] A. Jain, S. Bhattacharya, M. Masuda, V. Sharma,and Y. Wang, “Efficient execution of quantized deeplearning models: A compiler approach,”
CoRR , vol.abs/2006.10226, 2020. [Online]. Available: https://arxiv.org/abs/2006.10226[18] J. Roesch, S. Lyubomirsky, L. Weber, J. Pollock,M. Kirisame, T. Chen, and Z. Tatlock, “Relay: Anew IR for machine learning frameworks,” in , ser. MAPL.ACM, 2018. [Online]. Available: https://doi.org/10.1145/3211346.3211348[19] T. Chen and C. Guestrin, “XGBoost: A scalable treeboosting system,” in ,ser. KDD. ACM, 2016. [Online]. Available: https://doi.org/10.1145/2939672.2939785[20] K. Chellapilla, S. Puri, and P. Simard, “High performanceconvolutional neural networks for document processing,”in , 2006. [Online]. Available:https://hal.inria.fr/inria-00112631[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in
Conference onComputer Vision and Pattern Recognition (CVPR) ,ser. CVPR. IEEE, 2016. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90[22] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen,J. Tran, B. Catanzaro, and E. Shelhamer, “cuDNN:Efficient primitives for deep learning,”
CoRR , vol.abs/1410.0759, 2014. [Online]. Available: http://arxiv.org/abs/1410.0759[23] Y. Umuroglu, L. Rasnayake, and M. Sj¨alander, “BISMO:A scalable bit-serial matrix multiplication overlay forreconfigurable computing,” in
International Conferenceon Field Programmable Logic and Applications (FPL) ,2018. [Online]. Available: https://ieeexplore.ieee.org/document/8533514II. A
PPENDIX
Fig. 9 shows the performance in GFLOP/s of floating pointGEMM for TVM with and without auto-tuned parametersin comparison to openBLAS. Auto-tuned TVM clearly out-performs the naive version and, moreover, also outperformsopenBLAS. For small matrices, systematic errors—like multi-threading overhead—are clearly visible. In the middle regimeauto-tuned TVM slightly outperforms openBLAS, probablydue to parameters which are optimized to this specific hard-ware and matrix size. For larger matrices auto-tuned TVMand openBLAS achieve comparable results. It becomes appar-ent that hand-tuned general BLAS concepts are on-par withhardware and operator specific code generation.However, it is clearly visible that all measured performancesare far away from the theoretical peak performance, which canbe explained with the cache-bound model.Fig. 8 compares the performance of floating point with thequantized 8-bit QNN and bit-serial approaches for ResNet-18 convolution layers. The performance of the different ap-proaches for the diverse operators is highly related to the