[PDF] Understanding Cache Boundness of ML Operators on ARM Processors

Abstract

Machine Learning compilers like TVM allow a fast and flexible deployment on embedded CPUs. This enables the use of non-standard operators, which are common in ML compression techniques. However, it is necessary to understand the limitations of typical compute-intense operators in ML workloads to design a proper solution. This is the first in-detail analysis of dense and convolution operators, generated with TVM, that compares to the fundamental hardware limits of embedded ARM processors. Thereby it explains the gap between computational peak performance, theoretical and measured, and real-world state-of-the-art results, created with TVM and openBLAS. Instead, one can see that single-precision general matrix multiply (GEMM) and convolutions are bound by L1-cache-read bandwidth. Explorations of 8-bit and bit-serial quantized operators show that quantization can be used to achieve relevant speedups compared to cache-bound floating-point operators. However, the performance of quantized operators highly depends on the interaction between data layout and bit packing.

Full PDF

UUnderstanding Cache Boundness of ML Operatorson ARM Processors

Bernhard Klein

Institute of Computer EngineeringHeidelberg University, Germany [email protected]

Christoph Gratl and Manfred M¨ucke

Materials Center Leoben, Austria [email protected]@mcl.at

Holger Fr¨oning

Institute of Computer EngineeringHeidelberg University, Germany [email protected]

Abstract —Machine Learning (ML) compilers like TVM allowa fast and ﬂexible deployment on embedded CPUs. This enablesthe use of non-standard operators, which are common in MLcompression techniques. However, it is necessary to understandthe limitations of typical compute-intense operators in MLworkloads to design a proper solution. This is the ﬁrst in-detail analysis of dense and convolution operators, generatedwith TVM, that compares to the fundamental hardware limits ofembedded ARM processors. Thereby it explains the gap betweencomputational peak performance, theoretical and measured, andreal-world state-of-the-art results, created with TVM and open-BLAS . Instead, one can see that single-precision general matrixmultiply (GEMM) and convolutions are bound by L1-cache-read bandwidth. Explorations of 8-bit and bit-serial quantizedoperators show that quantization can be used to achieve relevantspeedups compared to cache-bound ﬂoating-point operators.However, the performance of quantized operators highly dependson the interaction between data layout and bit packing. Index Terms —Cache-bound, ARM Peak Performance, CacheBandwidth, TVM, Quantization, Bit Serial, GEMM, Convolution

I. I

NTRODUCTION

Although machine learning is meantime ubiquitously de-ployed, it is still nascent and developing fast. One can observea continuous stream of innovation, resulting in new methods,model architectures and use cases that beneﬁt from ML’sinherent property to model complex input-output relations withhigh accuracy and outstanding generalization. Currently, thiscontinuous development results in fast-changing basic oper-ations of artiﬁcial neural networks. One prominent exampleare Capsule networks, in which scalar-valued neurons arereplaced by small matrices in order to capture more complexrelationships. There exist various reports on the inﬂexiblityof ML frameworks to address the needs of such innova-tion [1], as they are mainly designed for today’s need, butobviously cannot anticipate tomorrow’s innovation. Due to thediminishing returns of CMOS scaling, the number of distinctprocessor architectures and technologies is rapidly growing,further increasing this problem as each platform requires a setof corresponding libraries.The ubiquity of ML also requires the deployment on em-bedded and edge platforms, which are substantially limitedin their resources, including processor, memory, network, and battery life, among others. To address the gap between modelrequirement and hardware capability, a plethora of methods ex-ist to compress the models. Examples include quantization [2],which reverts from ﬂoating point values to low-bit-widthalternatives for parameters such as weights and activations,pruning which introduces sparsity in the parameters to avoidcomputations [3], and architecture search that basically trainsnot only parameters but also the hyperparameters that representmodel architecture [4]. Notably, all techniques aim to maintainprediction accuracy while minimizing either the amount ofcomputations, or parameters, or both.One of the most promising solutions to address the gapbetween ML innovation and hardware trends is code genera-tion, which gears to automatically generate high-performanceoperators for a given target platform, thereby providing analternative to hand-tuned, manually-written libraries. One ofthe most prominent examples of code generation specializedfor ML is TVM [5], which is based on graph-level andoperator-level optimizations and employs a learning-based costmodel for automated optimization [6]. Experimental resultshave shown that the performance of generated operators ison-par with state-of-the-art, hand-tuned libraries for varioustarget platforms [5].Similarly to new operators resulting from ML innovation,compression techniques usually also require custom operators:quantization can reach extremely low-precision representa-tions, ultimately binarized values [7], while pruning needs toexpress the found sparsity in the operator, requiring for in-stance compressed-sparse row formats or masks that representnon-zero values. Similarly, architecture search can result incomplex-connected networks, which could also beneﬁt fromcustom operators for efﬁciency reasons. As a result, it seemspromising to explore the capabilities of code generation formodel compression, such that deployment is feasible with highefﬁciency, high productivity and performance comparable tohand-tuned libraries.This work is mainly concerned with executing specializedML operators on ARM processors as pervasively available em-bedded systems, analyzing performance limiters, and exploringcountermeasures such as model compression. To be moreconcrete, we provide a detailed analysis of computationallyintense operators, including convolutional and dense ones,on embedded ARM processors. As the main observation of a r X i v : . [ c s . A R ] F e b his analysis is that such operations are bound by memory-cache bandwidth, we subsequently employ quantization ascompression technique to reduce the necessary data volume.We rely on code generation to solve the problem that theresulting specialized operators are often not supported bydedicated libraries. For that, we select TVM due to its provenfunctionality, active community, and support for quantizedcomputations [8, 9]. We furthermore choose bit-serial forms ofcomputation on ARM processors, which allows for arbitraryreduced-precision representations on various platforms [10].The bit serial approach does not scale according to reduceddata size and moreover, it does not seem to be bound by cachebandwidth, at least not in regard to the cache-bound modeldiscussed in this work. Also, as in our opinion quantizationshall not be limited to certain data types, such as the de-facto industry standard of 8-bit integer, any quantization optionhas to be ﬂexible for bestmost trading among accuracy andperformance.In particular, this work makes the following contributions:1) Measure the computational peak performance and mem-ory bandwidths for caches and RAM of selected ARMprocessors.2) Benchmarking of convolutional and dense operations onembedded ARM processors, including auto-generatedcode and comparison with openBLAS.3) Detailed performance analysis to understand the dispar-ity between sustained and theoretical peak performance,resulting in a cache-bound model.4) Benchmarking 8-bit and bit-serial results, and discussingobserved behavior based on the cache-bound model.II. R ELATED W ORK

Static inference libraries, like ARM’s Compute Library or other specialized, hand-tuned, ultra-low-precision opera-tors [11, 12] achieve impressive performance, but make itdifﬁcult to combine multiple compression techniques if notprovided by the library. However, the combination of suchtechniques, especially quantization and pruning, can be veryeffective [13].In contrast, deep learning compilers like TVM, Tensorﬂow’sXLA [14] and Lift [15] close the gap between high-levelmachine learning frameworks and deployment on hardware ina ﬂexible way. Their independence from static libraries enablesstraightforward research of new ML methods, the usage ofnon-standard operators, and allows to use compression tech-niques and to combine them with little effort.Auto-tuning is one of the most important features of TVM.With an automated end-to-end feedback loop, executing run-time measurements, and a domain-speciﬁc machine-learning-based cost model, optimal parameters can be found [6]. Onestep further, it is possible to exploit AutoTVM and use it inthe decision process of hardware design [16].While import and execution of previously quantized 8-bitmodels is supported using a specialized QNN dialect [17], https://developer.arm.com/ip-products/processors/machine-learning/compute-library ultra-low-precision quantization is still an open research topic.Whereby, Cowan et al. [8] proved that such low-precisionoperations are feasible with TVM and, using a hardware-aware bit-serial algorithm, a relevant speedup over the de-fault ﬂoating-point implementation can be achieved. Moreover,TVM’s scheduling mechanism can be used for high-leveloptimizations and program synthesis to ﬁnd highly optimizedlow-level operators [9].Our work does not focus on a new technique reducing theinference time even further. It focuses on the analysis andunderstanding of performance bounds of compute-intense MLoperators and compare this with the fundamental hardwarelimits. It brings memory access as the limiting factor intomind and thus motivates ML compression techniques likelow-precision representation, which can reduce the memorypressure, and thus the overall processing latency.III. M ETHODOLOGICAL S ETUP AND P ERFORMANCE B ASELINE

A. Auto-tuning Methodology

The interface of AutoTVM is designed to tune neuralnetworks which are represented in Relay [18], TVM’s high-level IR. However, in our work, speciﬁc single operatorsare evaluated. Creating many single-layer neural networks,containing exactly the operations to be examined, auto-tunethem and save the tuned parameters to a logﬁle, enablesreuse and thus auto-tuned operator execution in the manualexamination mode.For regular data types, ﬂoat32, unsigned and signed int8, theXGBTuner with its xgboost cost model is used [19]. However,this tuner is—because of a not yet ﬁxed issue—not compatiblewith some of the bit-serial operators, therefore, all bit-serialoperators are auto-tuned with the random tuner. In principlethe tuner can have a relevant impact on the tuned parametersand thereby on the ﬁnal inference time. However, for bit-serialdense and convolution operators, the search space is highlyrestricted due to the bit-packing implementation resulting inless freedom in the parameter selection. Therefore, the impactof auto-tuning is relative small and the selection of the tunerless critical.TVM release 0.7 is used and compiled with openBLASsupport. B. Target Architecture

Our experiments are performed on ARM Cortex-A53 ,Broadcom BCM2837 and ARM Cortex-A72 , BCM2711.

1) Peak Performance: the computational theoretical peakperformance, p peak = f requency · cores · F LOPinstr · instrcycle · SIM D width (1) TVM: release v0.7 commit efdac9439506d1de5eec91ecc795982c78e41909 https://developer.arm.com/ip-products/processors/cortex-a/cortex-a53 https://developer.arm.com/ip-products/processors/cortex-a/cortex-a72ABLE IM EASURED M EMORY B ANDWIDTH FOR

ARM C

ORTEX

A53

Memory Block Size Read BW Write BW

RAM 16 MB 2040 MiB/s 1600 MiB/sL2 Cache 256 KB 7039 MiB/s 3467 MiB/sL1 Cache 4 KB 14363 MiB/s 23703 MiB/sTABLE IIM

EASURED M EMORY B ANDWIDTH FOR

ARM C

ORTEX

A72

Memory Block Size Read BW Write BW

RAM 16 MB 3661 MiB/s 2984 MiB/sL2 Cache 256 KB 12934 MiB/s 7407 MiB/sL1 Cache 4 KB 45733 MiB/s 30423 MiB/s represents directly the maximum possible performance, assum-ing that all compute resources can be fully utilized. Consid-ering multiply-accumulate operations (MACs) with 2 FLOPsper instruction, one NEON MAC per cycle and a NEONSIMD width of bit, this leads to a single-precision peakperformance of . GFLOP/s and . GFLOP/s for CortexA53 (1.2 GHz) and A72 (1.5 GHz), respectively.In order to verify theoretical expectations, a small bench-mark program was written that executes many MACs, NEON’s

VMLA instructions making only use of on-register operations,avoiding any other memory access. For a fair comparison, thetotal amount of MACs in a GEMM is distributed to all coresand multi-threading effects are included in the measurement,which is plainly visible for small matrices. The critical partis written in assembly to ensure that no compiler optimiza-tions distort the workload. The benchmark code is publiclyavailable .The measured performance, shown in Table IV and V,conﬁrms the compute boundary, provided that the workloadis large enough to hide the overhead of multi-threading.Additionally, it conﬁrms that one VMLA instruction can becomputed per cycle.

2) Memory Bandwidth:

With the benchmark tool RAM-speed the read and write bandwidth with different block sizesis measured. Since the L1 data cache is 16 KB for the ARMCortex-A53 and 32 KB for the Cortex-A72, a block size of4 KB is used to measure L1 cache bandwidth. To ﬁt into the L2caches—512 KB Cortex A53 and 1 MB Cortex A72—256 KBsized blocks are used. Last, to measure the RAM bandwidth,a block size of 16 MB is large enough to keep caching effectsreasonably small. For the measurements shown in Tables I andII, RAMspeed is used in multi-threading mode with 4 threadsto utilize all cores and with 1 GB and 8 GB of data per pass,respectively, the maximum that still ﬁts into main memory. C. Operators1) General Matrix Multiply: in dense layers all neuronsare connected with all neurons of the previous layer. They https://github.com/UniHD-CEG/arm-peak https://github.com/cruvolo/ramspeed-smp TABLE IIIR ESNET -18 C

ONVOLUTION L AYERS

Name b c in c out h in w in k s p MACsC2 1 64 64 56 56 3 1 1 124,010,496C3 1 64 128 56 56 3 2 1 62,005,248C4 1 64 128 56 56 1 2 0 6,422,528C5 1 128 128 28 28 3 1 1 132,710,400C6 1 128 256 28 28 3 2 1 66,355,200C7 1 128 256 28 28 1 2 0 6,422,528C8 1 256 256 14 14 3 1 1 150,994,944C9 1 256 512 14 14 3 2 1 75,497,472C10 1 256 512 14 14 1 2 0 6,422,528C11 1 512 512 7 7 3 1 1 191,102,976 are a key component of classical ML models and can berepresented as dense GEMMs with a non-linear activationfunction afterwards.To compute a GEMM with squared matrices, based onthree nested loops, approximately N MACs are necessary,thus the total amount of arithmetic operations is · N . Withan execution time t , the performance of a GEMM can becomputed as: p = 2 · M ACst = 2 · N t (2)

2) Convolutions:

Convolutional Neural Networks (CNNs)are not only state-of-the-art in computer vision, moreover,they spread in many ML domains as a general, very effectiveconcept. Thus, convolution layers are the heart of most modernML models, and are the most resource and time-consumingpart, too.It is a common practice to employ GEMMs to computeconvolution operators, relying on IM2COL [20], but nativeconvolution algorithms are also common.ResNet [21] is one of the most successful and wide-spreadarchitecture and smaller variants are common on embeddeddevices. The properties of all convolution layers of ResNet-18are shown in Table III. As in previous work, the ﬁrst layeris excluded since the input layer is particularly sensitive toquantization and the input channel depth is too low for efﬁcientbit packing [8].Like the GEMM, convolutions highly rely on MAC opera-tions. For a batch-size b , pad p , stride s , input and output chan-nels c in , c out , input and output image size h in , w in , h out , w out ,and convolution kernel (cid:126)k , the number of MACs in a convolu-tion layer is: h out = h in + 2 ps ; w out = w in + 2 ps (3) M ACs = b · h out · w out · c in · c out · k x · k y (4)Table III demonstrates the amount of diversity among layerswith regard to computational complexity. Notably, × convolutions are most compute intensive, and layers with anon-unit stride can lead to complex memory access patterns. ABLE IVGEMM P

ERFORMANCE FLOAT

32 C

ORTEX

A53

N openBLAS TVM compute peak perf.naive tuned measured theoreticalGFLOP/s

32 1.07 1.16 4.43 16.49 38.4128 4.96 2.07 6.58 37.38 38.4256 4.71 1.83 6.93 38.04 38.4512 4.87 0.60 5.06 38.15 38.41024 4.99 0.54 5.01 38.18 38.4TABLE VGEMM P

ERFORMANCE FLOAT

32 C

ORTEX

A72

N openBLAS TVM compute peak perf.naive tuned measured theoreticalGFLOP/s

32 3.01 3.59 9.20 21.92 48.0128 14.22 4.68 16.72 47.11 48.0256 14.86 4.77 17.24 47.83 48.0512 14.33 2.04 17.99 47.92 48.01024 14.98 1.36 15.75 47.93 48.0

IV. I N -D ETAIL P ERFORMANCE A NALYSIS

A. Matrix Multiply Performance

Table IV and V illustrate the performance for varioussquared matrices, with results generated by TVM with andwithout auto-tuned parameters, and with external openBLASlibrary calls. For all measurements, the overhead of multi-threading is dominating for small matrices. However, for allmatrices, the auto-tuned solution clearly outperforms the onewithout tuning, which has to fall back to default parameters,and even outperforms the hand-tuned BLAS library.The performance saturates for larger matrices, albeit, it issigniﬁcantly lower than the expected peak performance of theCPU.

B. Cache-bound Model

In the best case scenario all computations are executed onfast local registers. However, there are obviously not enoughregisters for all matrix elements. In a simpliﬁed model, itcan be assumed that for a vectorized MAC operation, theﬁrst operand can be kept in registers, so that the result isalso accumulated in registers, but at least the second operandhas to be read from some kind of memory. Therefore, in thefollowing discussion it is assumed, that for a MAC operation atleast one operand has to be read from cache or RAM. Certainlythis is a very simpliﬁed model since the variables, kept inregisters, need to be exchanged after multiplied with all theircounterparts, and the accumulated results have to be writtenback to memory, too. However, this is the most commonoperation and should dominate overall execution time.Fig. 1 shows in double logarithmic scale the executiontime of auto-tuned TVM and openBLAS with respect to thematrix size. Additionally the compute time representing the

Fig. 1. Execution time over squared matrix size for general matrix multiplyin double logarithmic representation. Together with hardware boundaries,theoretical compute time, time to read or write data to RAM or cache-memory. theoretical peak compute performance, and the time to read · N bytes for ﬂoat32 data type from L1, L2 cache and RAMmemory are plotted. The bandwidth of the different memorytypes are listed in Tables I and II.For small matrix sizes, the execution times are somewherein the range between RAM and cache bandwidth bounds, butthe measured time is in the range of only a few microseconds,thus vulnerable to small systematic errors, like multi-threadingoverhead. However, the time for matrices with N ≥ strongly correlates with the L1 cache boundary. This suggeststhat the L1-cache-read bandwidth is not fast enough to keepthe ﬂoating-point units fully utilized. Therefore, the theoreticalpeak performance cannot be reached, and it becomes apparentthat single-precision GEMM for ARM Cortex A53 and A72processors is not compute-bound, but cache-bound. C. Performance of Convolutions

For the experiments, the ARM-speciﬁc conv2d spatial pack operator with NCHW data layout [22] is used. There is alsoa NHWC [22] version available, but it performs much worsewithout auto-tuning and AutoTVM cannot be applied for thisoperator, thus this analysis is postponed until a fair comparisonis possible.Fig. 2 sets the measured execution time for these con-volutional operations in relation to the expected time forcompute and memory read, again assuming that · MAC bytesof data need to be read. In particular, one can see that theexecution time for

TVM auto-tuned correlates with

L1 cacheread time , though there are some notable exceptions. Again,this execution time is far from theoretical compute time aswell as

RAM read time . ig. 2. Comparison of TVM’s execution time of convolution layers of ResNet-18, to minimal theoretical compute and memory-read times. Mostly executiontime correlates with L1 or L2 cache read times.Fig. 3. Performance of ResNet-18 layers compared with theoretical computepeak performance and memory bandwidth limited performances. Layers aresorted in descending performance order. The performance in terms of GFLOP/s as shown in Fig. 3conﬁrms this ﬁnding. Moreover, some of the compute intensive × convolutions can reach a little better performance than theL1-memory-bandwidth suggests. Probably this kernel layoutallows optimizations regarding in-register reuse. However, thiseffect is not large enough to change the mainly dominatingcache-bound behavior. V. P ERFORMANCE A NALYSIS OF Q UANTIZED O PERATORS

The main take-away from the last section is that indepenentof the operator type, either GEMM or CONV, executiontime is bound not by computational peak performance butmemory access time, usually those of L1 caches. While thisis a surprising ﬁnding for HPC settings, apparently embeddedprocessors have different design objectives.Also, this cache boundness suggests that performance scal-ing can beneﬁt most from reduced memory pressure. There-fore, in the following we apply model quantization to reducethe size of operands, thereby also reducing the data volumefetched for a given operation.

A. Measurements

TVM supports bit-serial quantized convolution and denseoperators for ARM CPUs for bipolar ( − , b and unipolar (0 , b bit encodings for various bit widths [9]. Whereas theunipolar variant is the more advanced quantization scheme,generally achieving better accuracies, it needs one additionalsubtraction and popcount instruction and is thus a little slower.While the precision dimension is calculated sequentially, bit-packing along the spatial dimensions allows the usage ofvectorized instructions [23].The operators allow to set activation and weight bits inde-pendent from each other. It is common to use activations andweights with the same precision, or with larger activation pre-cision than weight precision [2]. Still, in this work activationsand weights are quantized equally.For bit-serial operations, data has to be in a speciﬁc packeddata layout. The weights can be pre-packed and thus do notneed to be packed during runtime, but the activations requirebit-packing just before the calculation. B. Quantized GEMM

Fig. 4 shows the performance of bit-serial GEMM fordifferent matrix sizes. It is remarkable that for lower bit widths,even larger matrices are necessary to achieve maximum per-formance. For the extreme binary case it might not even reachits peak with k matrices, which is way larger than a typicaldense layer in most ML workloads.To test whether bit-serial GEMM is also cache-bound,the required bandwidth to achieve measured performance iscalculated. Depending on the used data type, it is assumedthat per MAC m one data read with d bytes is necessaryto sustain performance p with execution time t . Thus, therequired bandwidth bw req is: bw req = m · dt = p · d (5)In Fig. 5 this required bandwidth is plotted. With increasedperformances for larger matrices the requirements due tobandwidth increase too, but they all stay below the L1-cache-read bandwidth. While this suggests that bit-serial GEMM isnot cache-bound, it has to be mentioned that other effectsmight be present in this case. One example is the efﬁciencyof accessing bit-packed values. It is possible that the same ig. 4. Performance of bit-serial GEMM for different squared matrices, withthe size of a dimension reported on the x axis. Lower bit widths achieve theirmaximum performance only with larger matrices.Fig. 5. Required bandwidth to reach measured performance according tocache-bound model for GEMM bit-serial operation. packed data have to be read more than once per MAC, therebythe required bandwidth would also increase. Whether bit-packed data increases the required cache reads per MAC isan open question that is left for future research as of now.Furthermore, the mandatory bit-packing step for activationsbefore the GEMM needs additional memory accesses, whichare also not covered by the one-read-per-MAC assumption. C. Quantized Convolution

The speedup of the quantized ResNet-18 convolution layersis illustrated in Fig. 6. The performance of 8-bit QNN orbit-serial operators highly depend on the interaction betweenoperator data layout and bit-packing format. In general, × convolutions beneﬁt from higher computational complexityand data reuse, and therefore perform better than their × Fig. 6. Speedup over ﬂoat32 for QNN 8-bit and bit-serial convolution forResNet-18 layers. counterpart. A non-unit stride can lead to less efﬁcient memoryaccess especially for packed data. Furthermore, the bit-serialconvolution operator uses NHWC data layout, which togetherwith bit-packing leads to a performance decrease for smallinput sizes. For instance, layer 11 performs badly for the bit-serial operator, even though this operation has the highestMAC count. In contrast, 8-bit QNN with its NCHW layoutis less sensible to the input size and it further seems to bemore robust against × convolutions for smaller imagesthan the ﬂoating-point baseline. For the bit-serial approach,the computational complexity scales quadratically with the bitwidth, thus, the speedup of lower bit-width representations issigniﬁcantly higher.Fig. 7 shows the required bandwidth to sustain measuredperformance. The bandwidth requirements scale linear withthe bit width, but the computational complexity quadratically,therefore, higher required bandwidths, due to the better overallperformance, are observable for lower bit widths.More importantly, the required bandwidth indicates that, likefor the GEMM, the L1-cache bandwidth is sufﬁcient to providedata if only one read per MAC is necessary. Therefore, the 8- ig. 7. Required bandwidth to sustain measured performance, according tocache-bound model of convolution operators. Comparison of ﬂoat32, QNN8-bit and bit serial to memory bandwidths. bit QNN and bit-serial convolutions are apparently not cache-bound. However, multiple reads due to packed data and theoverhead for bit packing are not covered in this model, butcould be relevant.VI. S UMMARY AND O UTLOOK

First the fundamental hardware limits, the bandwidth ofcaches and RAM, and the computational peak performance aremeasured, whereby the measured peak performance ﬁts well tothe theoretical maximum. The execution time of GEMM andconvolution operators, generated with TVM and openBLAS,are compared against these hardware limits. The huge differ-ence between computational peak performance and sustainedperformance of representative operators can be explained witha cache-bound model, which assumes that per MAC operationone memory read from cache, typically L1 cache, is necessary.The performance of single-precision ﬂoating point operatorscorrelates mainly with the limit created by the L1-cache-read bandwidth. To overcome this bottleneck, low-bit-widthoperators are explored: 8-bit QNN and the bit-serial approachfor bit widths between 1 and 8 bit. Both achieve relevantspeedups compared to the ﬂoating-point baseline. The analysis of the required bandwidth to read bit-packed data and sustainmeasured performance indicates that quantized operators arenot limited by cache-read bandwidth, at least not with thesimple cache-bound model discussed in this work. However,it becomes apparent that due to the bit-packed data structure,the operator data layout is crucial. Moreover, to reach maximalperformance of bit-serial GEMMs very large matrices, andthus a more efﬁcient access to bit-packed data, is essential.Relevant future directions include understanding the over-head of bit packing and access to packed data, scaling ofmemory accesses with problem size, and a correspondingreﬁnement of the cache-bound model. Such a model, wouldbe of great beneﬁt for the design of new operators and couldalso be included in an improved auto-tuning.Ultimately, the study of operators with differently quantizedactivations and weights would be of great interest, especiallyfrom the point of view that bit packing is only necessary foractivations, but packed data access applies for both.In summary, we believe that the best choice in terms ofquantization for a given ARM processor requires a detailedunderstanding of the underlying effects, so that methodologicalapproaches can be pursued for corresponding optimizations,instead of brute-forcing the problem by exhaustive searchesbased on a vast number of individual performance experi-ments. A

CKNOWLEDGMENT

The authors would like to thank Stephan Diestelhorst fromXilinx DCG for sharing his valuable experiences, which ﬁnallyleads us to the cache-bound model idea.Furthermore, we gratefully acknowledge the ﬁnancial sup-port under the scope of the COMET program within theK2 Center “Integrated Computational Material, Process andProduct Engineering (IC-MPPE)” (Project No 859480). Thisprogram is supported by the Austrian Federal Ministriesfor Transport, Innovation and Technology (BMVIT) and forDigital and Economic Affairs (BMDW), represented by theAustrian research funding association (FFG), and the federalstates of Styria, Upper Austria and Tyrol.R

EFERENCES [1] P. Barham and M. Isard, “Machine learning systems arestuck in a rut,” in

Workshop on Hot Topics in OperatingSystems , ser. HotOS. ACM, 2019. [Online]. Available:https://doi.org/10.1145/3317550.3321441[2] W. Roth, G. Schindler, M. Z¨ohrer, L. Pfeifenberger,R. Peharz, S. Tschiatschek, H. Fr¨oning, F. Pernkopf, andZ. Ghahramani, “Resource-efﬁcient neural networks forembedded systems,”

CoRR , vol. abs/2001.03048, 2020.[Online]. Available: http://arxiv.org/abs/2001.03048[3] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell,“Rethinking the value of network pruning,”

CoRR ,vol. abs/1810.05270, 2018. [Online]. Available: http://arxiv.org/abs/1810.05270[4] B. Zoph and Q. V. Le, “Neural architecturesearch with reinforcement learning,”

CoRR , vol.bs/1611.01578, 2016. [Online]. Available: http://arxiv.org/abs/1611.01578[5] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan,M. Cowan, H. Shen, L. Wang, Y. Hu, L. Ceze,C. Guestrin, and A. Krishnamurthy, “TVM: Anautomated end-to-end optimizing compiler for deeplearning,” in , ser. OSDI, 2018.[Online]. Available: https://arxiv.org/abs/1802.04799[6] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau,L. Ceze, C. Guestrin, and A. Krishnamurthy, “Learningto optimize tensor programs,” in

Advances in NeuralInformation Processing Systems 31 , ser. NIPS. CurranAssociates Inc., 2018, p. 3393–3404. [Online]. Available:https://arxiv.org/abs/1805.08166[7] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,“XNOR-Net: Imagenet classiﬁcation using binaryconvolutional neural networks,” in , ser. ECCV. SpringerInternational Publishing, 2016. [Online]. Available:https://doi.org/10.1007/978-3-319-46493-0 32[8] M. Cowan, T. Moreau, T. Chen, and L. Ceze,“Automating generation of low precision deep learningoperators,”

CoRR , vol. abs/1810.11066, 2018. [Online].Available: http://arxiv.org/abs/1810.11066[9] M. Cowan, T. Moreau, T. Chen, J. Bornholt, andL. Ceze, “Automatic generation of high-performancequantized machine learning kernels,” in , ser. CGO. ACM, 2020. [Online].Available: https://doi.org/10.1145/3368826.3377912[10] Y. Umuroglu, D. Conﬁcconi, L. Rasnayake, T. B.Preusser, and M. Sj¨alander, “Optimizing bit-serialmatrix multiplication for reconﬁgurable computing,”

Transactions on Reconﬁgurable Technology and Systems(TRETS) , vol. 12, no. 3, Aug. 2019. [Online]. Available:https://doi.org/10.1145/3337929[11] Q. Han, Y. Hu, F. Yu, H. Yang, B. Liu, P. Hu, R. Gong,Y. Wang, R. Wang, Z. Luan, and D. Qian, “Extremelylow-bit convolution optimization for quantized neuralnetwork on modern computer architectures,” in , ser.ICPP. ACM, 2020. [Online]. Available: https://doi.org/10.1145/3404397.3404407[12] G. Schindler, M. M¨ucke, and H. Fr¨oning, “Linking ap-plication description with efﬁcient simd code generationfor low-precision signed-integer gemm,” in

Euro-Par 2017: Parallel Processing Workshops . SpringerInternational Publishing, 2018. [Online]. Available:https://doi.org/10.1007/978-3-319-75178-8 55[13] S. Han, H. Mao, and W. J. Dally, “Deep compression:Compressing deep neural networks with pruning,trained quantization and huffman coding,” in ,ser. ICLR, 2016. [Online]. Available: http://arxiv.org/abs/1510.00149 [14] M. Abadi et al. , “Tensorﬂow: A system for large-scale machine learning,” in , ser.OSDI. USENIX Association, 2016. [Online]. Available:https://arxiv.org/abs/1605.08695[15] M. Steuwer, T. Remmelg, and C. Dubach, “Lift:a functional data-parallel ir for high-performancegpu code generation,” in

International Symposiumon Code Generation and Optimization , ser. CGO.IEEE, 2017, pp. 74–85. [Online]. Available: https://doi.org/10.1109/CGO.2017.7863730[16] D. Diamantopoulos, B. Ringlein, M. Purandare,G. Singh, and C. Hagleitner, “Agile autotuning ofa transprecision tensor accelerator overlay for TVMcompiler stack,” in ,ser. FPL. IEEE, 2020. [Online]. Available:https://doi.org/10.1109/FPL50879.2020.00058[17] A. Jain, S. Bhattacharya, M. Masuda, V. Sharma,and Y. Wang, “Efﬁcient execution of quantized deeplearning models: A compiler approach,”

CoRR , vol.abs/2006.10226, 2020. [Online]. Available: https://arxiv.org/abs/2006.10226[18] J. Roesch, S. Lyubomirsky, L. Weber, J. Pollock,M. Kirisame, T. Chen, and Z. Tatlock, “Relay: Anew IR for machine learning frameworks,” in , ser. MAPL.ACM, 2018. [Online]. Available: https://doi.org/10.1145/3211346.3211348[19] T. Chen and C. Guestrin, “XGBoost: A scalable treeboosting system,” in ,ser. KDD. ACM, 2016. [Online]. Available: https://doi.org/10.1145/2939672.2939785[20] K. Chellapilla, S. Puri, and P. Simard, “High performanceconvolutional neural networks for document processing,”in , 2006. [Online]. Available:https://hal.inria.fr/inria-00112631[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in

Conference onComputer Vision and Pattern Recognition (CVPR) ,ser. CVPR. IEEE, 2016. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90[22] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen,J. Tran, B. Catanzaro, and E. Shelhamer, “cuDNN:Efﬁcient primitives for deep learning,”

CoRR , vol.abs/1410.0759, 2014. [Online]. Available: http://arxiv.org/abs/1410.0759[23] Y. Umuroglu, L. Rasnayake, and M. Sj¨alander, “BISMO:A scalable bit-serial matrix multiplication overlay forreconﬁgurable computing,” in

International Conferenceon Field Programmable Logic and Applications (FPL) ,2018. [Online]. Available: https://ieeexplore.ieee.org/document/8533514II. A

PPENDIX

Fig. 9 shows the performance in GFLOP/s of ﬂoating pointGEMM for TVM with and without auto-tuned parametersin comparison to openBLAS. Auto-tuned TVM clearly out-performs the naive version and, moreover, also outperformsopenBLAS. For small matrices, systematic errors—like multi-threading overhead—are clearly visible. In the middle regimeauto-tuned TVM slightly outperforms openBLAS, probablydue to parameters which are optimized to this speciﬁc hard-ware and matrix size. For larger matrices auto-tuned TVMand openBLAS achieve comparable results. It becomes appar-ent that hand-tuned general BLAS concepts are on-par withhardware and operator speciﬁc code generation.However, it is clearly visible that all measured performancesare far away from the theoretical peak performance, which canbe explained with the cache-bound model.Fig. 8 compares the performance of ﬂoating point with thequantized 8-bit QNN and bit-serial approaches for ResNet-18 convolution layers. The performance of the different ap-proaches for the diverse operators is highly related to the