[PDF] High Performance and Portable Convolution Operators for ARM-based Multicore Processors

Abstract

The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the \imcol transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear algebra libraries. The main problems of this approach are 1) the large memory workspace required to host the intermediate matrices generated by the IM2COL transform; and 2) the time to perform the IM2COL transform, which is not negligible for complex neural networks. This paper presents a portable high performance convolution algorithm based on the BLIS realization of the GEMM kernel that avoids the use of the intermediate memory by taking advantage of the BLIS structure. In addition, the proposed algorithm eliminates the cost of the explicit IM2COL transform, while maintaining the portability and performance of the underlying realization of GEMM in BLIS.

Full PDF

HHigh Performance and Portable Convolution Operators forARM-based Multicore Processors

Pablo San Juan ∗ Adri´an Castell´o † Manuel F. Dolz † Pedro Alonso-Jord´a ∗ Enrique S. Quintana-Ort´ı ∗ May 14, 2020

Abstract

The considerable impact of Convolutional Neural Networks on many Artiﬁcial Intelligencetasks has led to the development of various high performance algorithms for the convolution op-erator present in this type of networks. One of these approaches leverages the im2col transformfollowed by a general matrix multiplication ( gemm ) in order to take advantage of the highlyoptimized realizations of the gemm kernel in many linear algebra libraries. The main problemsof this approach are 1) the large memory workspace required to host the intermediate matricesgenerated by the im2col transform; and 2) the time to perform the im2col transform, whichis not negligible for complex neural networks. This paper presents a portable high performanceconvolution algorithm based on the BLIS realization of the gemm kernel that avoids the use ofthe intermediate memory by taking advantage of the BLIS structure. In addition, the proposedalgorithm eliminates the cost of the explicit im2col transform, while maintaining the portabilityand performance of the underlying realization of gemm in BLIS.

During the past decade and a half, the use of deep neural networks (DNNs) for machine learning(also known as deep learning, or DL), and more speciﬁcally convolutional neural networks (CNNs),has gained a tremendous momentum, carrying beyond conventional problems in image classiﬁcation,object dectection, speech recognition and neural machine translation [11, 24, 37], to be extendedto a myriad of unexplored applications, for example, in quantum computing, solid state lighting,nanoelectronics and nanomechanics, high throughput screening of new materials, computer visionin microscopy, radiography and tomography, and astrophysics simulation; see [33, 28, 6] amongmany others.Current CNN models consist of a large number of neuron layers that allow to deliver superioraccuracy on many artiﬁcial intelligence (AI) tasks, at the cost of a considerable computationalcost, both for training and inference [33]. This cost comes from the CNN being mostly composed ofconvolutional layers ( conv ), each basically embedding a high-dimensional convolution operator [27].The high computational cost of the conv layers can be tackled via certain compression tech-niques (such as use of low-rank approximations, quantization/low-precision arithmetic, sparsiﬁ-cation, etc.), which aim to reduce the complexity of the convolution in exchange for a potentialdegradation in accuracy [19]. The application of the convolution operator can also be accelerated ∗ Universitat Polit`ecnica de Val`encia, Spain. [email protected], [email protected], [email protected] † Universitat Jaume I, Castell´on de la Plana, Spain. { adcastel,dolzm } @icc.uji.es a r X i v : . [ c s . PF ] M a y ia optimized implementations of this kernel that carefully exploit the architecture of modern highperformance processors, such as multicore processors and graphics processing units (GPUs). On theone hand, when the ﬁlters involved in the convolution are of size 5 × gemm ) [9, 15, 1]via the im2col transform [8]. In some cases, the gemm -based approach can be accelerated employ-ing Winograd’s minimal ﬁltering algorithms, possibly combined with the Strassen variant of thematrix multiplication [25, 39]. However, this latter strategy can also result in a decay of accuracyof the trained model.High performance realizations of the convolution operator/ gemm are available in libraries suchas Intel’s openDNN/MKL and NVIDIA’s cuDNN/cuBLAS, respectively [1, 2]. However, theseimplementations target Intel/AMD x86 architectures and NVIDIA GPUs, and therefore, they arenot portable to other architectures. Moreover, except for openDNN, these libraries take a “black-box” approach and their contents cannot be examined nor modiﬁed.The Basic Linear Algebra Instantiation Software (BLIS) is a software framework for rapiddevelopment of high-performance dense linear algebra libraries [34]. BLIS implements the fullfunctionality deﬁned in the

Basic Linear Algebra Subprograms (BLAS) application programminginterface (API) [12] featuring several appealing properties:– BLIS is written in Standard C (mainly ISO C90 with a few C99 extensions).– The BLIS code is mostly architecture-independent and, therefore, largely portable. Devel-oping an eﬃcient instance of BLIS for an speciﬁc processor architecture requires an eﬃcientimplementation of a small piece of code, known as the micro-kernel, and the selection of anumber of cache conﬁguration parameters that can be adjusted via an analytical model [26].– There exist high performance realizations of the micro-kernel (and tuned selection of the cacheconﬁguration parameters) for many diﬀerent architectures, including low-power ARM-basedprocessors [7].– On a variety of modern multicore processors, BLIS has been shown to deliver sustainedhigh performance [36, 32, 7] that rivals that of commercial libraries, such as Intel’s MKL,as well as other open high performance instances of the BLAS, such as GotoBLAS [18, 17],OpenBLAS [29] and ATLAS [35].In this paper, we leverage the open implementation of the gemm kernel in BLIS to designhigh performance and portable convolution operators for DL inference on general-purpose multicoreprocessors.

For this purpose, we modify one of packing routines in the BLIS gemm kernel to applythe im2col transform on-the-ﬂy (that is, during the execution of the matrix multiplication) on theinput tensor for the convolution operator. As a result, our approach features:

Reduced workspace.

We avoid the explicit assembly of the large-scale matrix that results fromapplying the im2col transform to the input tensor, requiring no extra workspace (other thanthe small buﬀers that are used inside the BLIS gemm ). High performance.

Our solution mimics the performance of the BLIS gemm , basically eliminat-ing the overhead of the im2col transform, to reduce the execution time of the convolutionoperator to that of the associated gemm kernel.

Portability.

The result remains as portable as BLIS since our modiﬁcation of the gemm kerneldoes not aﬀect the micro-kernel nor the cache conﬁguration parameters.2s an additional contribution of this work, we assess the advantages of our integration of im2col into the BLIS gemm by porting and evaluating the resulting convolution operator on the ARMquad-core Cortex-A57 processor (ARMv8, 64-bits) that is integrated in the NVIDIA Jetson TX2module.The rest of the paper is organized as follows. After a survey of related work in the nextsubsection, in Section 2 we review the BLIS approach for the implementation of gemm , brieﬂydiscussing the portability and multi-threaded parallelization of this kernel. Special attention ispaid there to the packing performed within BLIS, in connection with the layout of the data inmemory, as these are two keys to our approach. In Section 3 we review the im2col transform andhow to leverage this function to cast a convolution in terms of the matrix multiplication. We thenopen Section 4 with a discussion of the problems of such straight-forward scheme, proposing analternative that embeds the im2col transform within the BLIS gemm kernel, yielding a portable,high performance, integrated convgemm operator for multicore processors. Finally, we evaluatethe performance of the new routines on an ARM Cortex-A57 processor in Section 5, and oﬀer someﬁnal closing remarks in Section 6.

Direct algorithms.

Libraries such as NVIDIA’s cuDNN, HexagonNN [23] and Xiaomi’s MACE [3]include optimized direct convolution operators for the most frequently encountered ﬁlter dimen-sions and strides, falling back to default algorithms for other parameter values. In comparison,Intel’s MKL-DNN [15] employs parameterized architecture-aware just-in-time code generators toproduce direct optimized convolution routines at runtime.NNPACK (Neural Networks PACKage) [4] also provides direct implementations of convolutionaloperators involving large ﬁlters (3 × ×

5) using either Winograd ﬁlters or FFT. NNPACKsupports many popular deep learning frameworks (Caﬀe2, PyTorch, MXNET, etc.) and includesarchitecture-speciﬁc optimizations for ARMv7, ARMv8, and x86 processors.

Indirect algorithms.

In contrast with the previous approach, gemm -based algorithms reformu-late the convolution in terms of a two-stage (or indirect) im2col + gemm . This allows to leveragehighly optimized realizations of the BLAS, which exists for almost any modern computer platform.As a result, the gemm -based approach is now used in all major deep learning frameworks [13].Facebook’s QNNPACK (Quantized NNPACK) [14] extends NNPACK to perform computa-tions in 8-bit ﬁxed-point precision targeting convolution operators which cannot beneﬁt from fastWinograd/FFT-based schemes. Similar to our approach, QNNPACK follows an indirect approachwhile aiming to eliminate the overhead of the im2col transform for matrix multiplication libraries.A few other works have addressed the excessive memory consumption of gemm -based algorithmsby dividing the matrix multiplication into small kernels [10, 5]. However, the authors of these worksdo not consider the combination of their solutions with optimized, architecture-speciﬁc realizationsof the gemm kernel.In [13], M. Dukhan tackles both the memory and performance issues of the indirect approach.Concretely, that work proposes to introduce an indirection structure of pointers to the convolutioninput operand optimized for the so-called NHWC layout. Unfortunately, the author recognizesthat 1) the algorithm is not expected to be competitive with state-of-the-art patch-building al-gorithms [38] due to strided memory access; and 2) his solution has limited applicability for thebackward pass of the convolution operator and the Transposed Convolution operator.3 for j c = 0 , . . . , n − in steps of n c L2: for p c = 0 , . . . , k − in steps of k c B ( p c : p c + k c − , j c : j c + n c − → B c // Pack into B c L3: for i c = 0 , . . . , m − in steps of m c A ( i c : i c + m c − , p c : p c + k c − → A c // Pack into A c L4: for j r = 0 , . . . , n c − in steps of n r // Macro-kernel L5: for i r = 0 , . . . , m c − in steps of m r C c ( i r : i r + m r − , j r : j r + n r −

1) // Micro-kernel+= A c ( i r : i r + m r − , k c − · B c (0 : k c − , j r : j r + n r − Figure 1: High performance implementation of gemm in BLIS. In the code, C c ≡ C ( i c : i c + m c − , j c : j c + n c −

1) is just a notation artifact, introduced to ease the presentation of the algorithm,while A c , B c correspond to actual buﬀers that are involved in data copies. gemm in BLIS General overview.

Consider the gemm operation C += A · B , where the dimension of theoperands are C → m × n , A → m × k , and B → k × n . BLIS adheres to the high-performancetaxonomy in GotoBLAS [18] to implement this kernel (and any other variant, with transposed A and/or B ) as three nested loops around a macro-kernel plus two packing routines ; see Loops L1 – L3 in the gemm algorithm in Figure 1. In addition, the macro-kernel is implemented in terms of twoadditional loops around a micro-kernel ; see Loops L4 and L5 in the same ﬁgure. The micro-kernelis encoded as a loop around a rank–1 update (that is, and outer product; not explicitly shown inthe ﬁgure). For simplicity, we will consider hereafter that m, n, k are integer multiples of m c , n c , k c ,respectively; and m c , n c are integer multiples of m r , n r , respectively.In BLIS, the loop ordering, together with the packing routines and an appropriate selection ofthe loop strides n c , k c , m c , n r and m r (which match the processor cache conﬁguration), orchestratea regular pattern of data transfers through the memory hierarchy [34, 26]. In rough detail, givena processor architecture, the goal is that a k c × n r micro-panel of the buﬀer B c , say B r , and an m c × k c macro-panel of the buﬀer A c , say A r , are streamed into the ﬂoating-point units (FPUs)from the L1 and L2 caches, respectively; while the k c × n c macro-panel B c resides in the L3 cache(if present). Portability.

An appealing property of BLIS is that all routines are encoded in C except, possibly,for the rank–1 update inside the micro-kernel, which may be vectorized using either assembly orvector intrinsics [34]. Furthermore, following the convention for BLAS [12], the routines for (almost)all other Level-3 BLAS are built on top of gemm . This enhances portability as, given a “generic”(architecture-oblivious) instance of the BLIS gemm , porting all the BLIS library to a particularprocessor architecture then only needs to develop an eﬃcient realization of the rank–1 update forthe target processor, and selecting the proper values for n c , k c , m c , n r and m r to the processorcache/memory conﬁguration. Multi-threaded parallelization.

BLIS allows to choose, at execution time, which of the ﬁveloops of the gemm kernel are parallelized. The multi-threaded parallelization of the BLIS gemm kernel has been previously analyzed for conventional multicore processors [36], modern many-threaded architectures [32], and low-power (asymmetric) ARM-based processors in [7]. The insights4 B c c r m n r A r B r Figure 2: Packing in the BLIS and GotoBLAS implementations of gemm . The arrows indicate thelinear layout of the elements in the memory: column-major for A r and row-major for B r .gained from these experimental studies show that Loop L1 is usually a good candidate for multi-socket platforms with on-chip L3 caches; Loop L3 should be parallelized when each core has itsown L2 cache; and Loops L4 and L5 are convenient choices if the cores share the L2 cache. Data storage.

Hereafter, unless otherwise explicitly stated, we adhere to the Fortran conventionthat dictates the column-major order storage for matrices. This implies that, for example, theentries of the 2D array (i.e., matrix) C → m × n , are arranged in consecutive positions in memoryas C [0][0] , C [1][0] , . . . , C [ m − (cid:124) (cid:123)(cid:122) (cid:125) C , C [0][1] , C [1][1] , . . . , C [ m − (cid:124) (cid:123)(cid:122) (cid:125) C , . . . ,C [0][ n − , C [1][ n − , . . . , C [ m − n − (cid:124) (cid:123)(cid:122) (cid:125) Last column of C . Note that BLAS follows the Fortran matrix storage convention and, therefore, this is necessary tobe able to invoke the gemm kernel.

The packing routines.

The purpose of these routines is to arrange the elements of A and B into A c and B c , respectively, so that the elements of the A c and B c buﬀers will be accessed withunit stride when executing the micro-kernel [21]. (An additional beneﬁt of packing is that A c and B c are preloaded into certain cache levels of the memory hierarchy, reducing the time to access theelements of these buﬀers when using them to update a micro-tile of C .)The packing routines proceed to copy and compact the data of the input operands as follows.In the packing routine for A c , each m c × k c block of A is packed into the A c buﬀer with its elementsorganized as micro-panels of size m r × k c ; furthermore, within each micro-panel of A c , the elementsare stored in column-major order. Also, each k c × n c block of B is packed into B c , with its theelements arranged into micro-panels of size k c × n r ; and each micro-panel stored into row-majororder; see Figure 2 and the algorithm in Figure 3.Let us consider the overhead introduced by the data copies necessary to perform the packing.Consider, for example, the packing for B c . In principle, packing this buﬀer requires k c · n c memoryaccesses, to read the elements of matrix B (from the memory) and write them into the appropriatepositions of the buﬀer (in principle, in the L3 cache, if there is one). Each buﬀer is then re-utilizedfor the (ﬂoating-point) operations embraced by Loop L3 of the gemm kernel (see Figure 1), which5 for j r = 0 , . . . , n c − in steps of n r i = 0 L2: for p s = 0 , . . . , k c − L3: for j s = 0 , . . . , n r − B c [ i ][ j r ] = B [ p c + p s ][ j c + j r + j s ] i = i + 1 Figure 3: Algorithm for packing B into B c . The indices p c and j c correspond to the coordinates ofthe top-left entry for the block of matrix B that is packed; see Figure 1. Matrix B is maintained incolumn-major order. Each micro-panel B r within the buﬀer B c is arranged in row-major order, asexpected by the BLIS micro-kernel; see Figure 2. This is attained by viewing B c as an ( k c · n r ) × ( n c /n r ) buﬀer, where each column contains an entire micro-panel in row-major order. L1: for i b = 0 , . . . , b − L2: for i c = 0 , . . . , c i − L3: for i w = 0 , . . . , w o − L4: for i h = 0 , . . . , h o − L5: for i kw = 0 , . . . , k w − L6: for i kh = 0 , . . . , k h − L7: for i k = 0 , . . . , k n − O [ i k ][ i h ][ i w ][ i b ] + = F [ i k ][ i kh ][ i kw ][ i c ] · I [ s · i h + i kh ][ i w · s + i kw ][ i c ][ i b ] Figure 4: Direct algorithm for the application of the convolution operator O = conv ( F, I ).amount to mm c · n c n r · m c m r · m r n r k c ) = 2 mn c k c ﬂops. Thus, provided m is large enough, the cost ofthe packing for B c is negligible compared with the amount of ﬂops performed inside Loop L3 . Asimilar reasoning applies to the overhead due to the packing for A c .As we will expose in the next section, the packing routines are particularly important for ourimplementation of the convolution operator. im2col + gemm Convolution operator.

Consider a conv layer, appearing during inference with a DNN model,that comprises a convolution operator consisting of k n ﬁlters (or kernels) of dimension k h × k w × c i each. Assume the layer receives b tensor inputs of dimension h i × w i × c i each; and produces b tensoroutputs of size h o × w o × k n each. (The parameter b is also often referred to as the batch size.)Then, each of the k n individual ﬁlters in this layer combines a (sub)tensor of the inputs, with thesame dimension as the ﬁlter, to produce a single scalar value (entry) in one of the k n outputs. Byrepeatedly applying the ﬁlter to the whole input, in a sliding window manner (with a certain stride s ), the convolution operator produces the complete entries of this single output; see [33]. Assuminga padding p along dimensions h i and w i , the output dimensions become h o = (cid:98) ( h i − k h + 2 p ) /s + 1 (cid:99) and w o = (cid:98) ( w i − k w + 2 p ) /s + 1 (cid:99) .The algorithm in Figure 4 provides a direct realization of a convolution operator O = conv ( F, I ),where I → h i × w i × c i × b corresponds to the input tensor, F → k n × k h × k w × c i denotes theﬁlters, and O → k n × h o × w o × b is the output tensor.6 for i b = 0 , . . . , b − L2: for i c = 0 , . . . , c i − L3: for i w = 0 , . . . , w o − L4: for i h = 0 , . . . , h o − c = i h + i w · h i + i b · w i · h i L5: for i kw = 0 , . . . , k w − L6: for i kh = 0 , . . . , w h − r = i kh + i kw · k h + i c · k w · k h ˆ B [ r ][ c ] = I [ i h · s + i k h ][ i w · s + i kw ][ i c ][ i b ] Figure 5: Algorithm for the im2col transformation. The actual implementation moves some of theloop invariants inside Loops L4 and L6 to reduce the indexing arithmetic overhead. For simplicity,this is not shown in the algorithm. Tensor data storage.

A tensor generalizes the concept of a matrix to that of a multidimensionalarray. Note though that, from the physical point of view, the tensor entries are still arranged asa linear array in memory. Here, we generalize the Fortran convention of column-major order toconsider that, unless explicitly stated otherwise, the entries of the tensors are stored in consecutivepositions in memory starting from the leftmost indices. This implies that, for example, if the tensor O → k n × h o × w o × b is stored into an 4D array O [ k n ][ h o ][ w o ][ b ], then its entries are consecutivelyarranged in memory as O [0][0][0][0] , O [1][0][0][0] , . . . , O [ k n − ,O [0][1][0][0] , O [1][1][0][0] , . . . , O [ k n − , . . . O [0][ h o − , O [1][ h o − , . . . , O [ k n − h o − ,O [0][0][1][0] , O [1][0][1][0] , . . . , O [ k n − , . . . ,O [ k n − h o − w o − b − . Indirect convolution and the im2col transform.

On modern computer architectures, theperformance of the direct realization of the convolution operator given in Figure 4 is limited bythe memory bandwidth and, therefore, delivers only a fraction of the processor peak ﬂoating-pointthroughput. In practice, higher performance can be attained via an indirect (or gemm -based)approach that casts this operator in terms of a matrix multiplication via the im2col transform [8].Concretely, the algorithm in Figure 5 shows how to transform the input tensor I into an augmentedmatrix ˆ B . With this transform, the output of the application of the convolution can be simplyobtained from the gemm ˆ C = ˆ A · ˆ B , where ˆ C ≡ O → k n × ( h o · w o · b ) is the output tensor (viewedas an m × n matrix, with m = k n and n = ( h o · w o · b )); ˆ A ≡ F → k n × ( k h · k w · c i ) contains thekernels; and ˆ B → ( k h · k w · c i ) × ( h o · w o · b ) is the result from applying the im2col transform tothe input tensor I . 7 Optimized Indirect Convolutions via Integration of im2col into gemm

There are two problems with the indirect (two-stage) procedure described in Section 3 that performsthe convolution as a sequence of an explicit im2col transform followed by a call to the gemm kernel:

P1.

Starting from an input tensor I of dimension h i × w i × c i × b , the im2col transforms creates anaugmented matrix ˆ B of size ( k h · k w · c i ) × ( h o · w o · b ). Assuming h i , w i ≈ h o , w o , this requires aworkspace that is k h · k w times larger than the original input tensor. For current CNNs, withmany layers, even when using small 3 × P2.

On modern high performance processors, when using a realization of the gemm kernel thatis highly optimized, the overhead due to the copy and replication required by the im2col transform in general becomes “visible” and reduces the performance of the global (explicit) im2col +BLIS gemm process.To tackle both problems, we propose a solution that integrates the im2col transform intothe packing of ˆ B onto the buﬀer B c . In other words, during the execution of the gemm kernel,the buﬀer B c is directly assembled from the contents of the input tensor I (instead of using theaugmented matrix ˆ B , which is never created). In the following, we will refer to our solution as anindirect convolution via a convgemm operator.

We can now justify the contributions listed in theintroduction of this work (see Section 1):

Reduced workspace.

We avoid the use of the large workspace present in the two-step procedure(problem P1 ), as the only “additional” storage that is needed is the buﬀer for B c , which isalready necessary in the BLIS gemm kernel. High performance.

Furthermore, as argued during the discussion of the packing in Section 2,the memory access costs introduced by the packing of B c is well amortized with the ﬂopsthat are performed in the innermost loops and, therefore, the overhead can be considerednegligible (problem P2 ). Portability.

The approach has the additional advantage that the only change that is needed tothe BLIS gemm is to replace the original packing routine with a procedure that reads (andpacks) the second input operand to the matrix multiplication directly from the input tensor.There is no need to modify the routine that performs the packing with ˆ A . More importantly,there is no need to change the micro-kernel, which enhances the portability of our solution:the only part that is diﬀerent is written in C and depends on a small number of architecture-dependent parameters that are adjusted during the process of porting BLIS. The parametersthat deﬁne the ﬁlter dimensions are “embedded” within the dimensions of the resulting matrixand, therefore, require no speciﬁc optimization.The algorithm in Figure 6 illustrates how to pack the corresponding entries of the input tensor I into the buﬀer B c during the execution of the BLIS gemm kernel in Figure 1 while, simultaneously,performing the implicit im2col transform. The algorithm packs the k c × n c block of matrix ˆ B starting at row p c and column k c into the buﬀer B c , reading the corresponding entries directly fromthe input tensor I . As a result, the output matrix comprises the sought-after convolution: O = conv ( F, I ) ≡ ˆ C = ˆ A · ˆ B ≡ ˆ C = ˆ A · im2col ( I ) , for j r = 0 , . . . , n c − in steps of n r i = 0 L2: for p s = 0 , . . . , k c − i c = ( p c + p s ) / ( k h · k w ) i kw = (( p c + p s ) mod ( k h · k w )) /k h i kh = (( p c + p s ) mod ( k h · k w )) mod k h L3: for j s = 0 , . . . , n r − i b = ( j c + j r + j s ) / ( h o · w o ) i w = (( j c + j r + j s ) mod ( h o · w o )) /h o i h = (( j c + j r + j s ) mod ( h o · w o )) mod h o B c [ i ][ j r ] = I [ i kh + i h · s ][ i kw + i w · s ][ i c ][ i b ] i = i + 1 Figure 6: Algorithm for packing I into B c . The indices p c and j c correspond to the coordinates ofthe top-left entry for the block of matrix ˆ B that is packed; see Figure 1.where ˆ C ≡ O and ˆ A ≡ F . The actual implementation of this algorithm eliminates some of theloop invariants and integer arithmetic to reduce the overhead. Concretely, the computation ofthe indices i c , i kw , i kh , i b , i w , i h is performed outside the loops and then properly updated duringthe iterations to avoid the high cost of the integer divisions and modulo operations (that is, theremainder of the integer division, abbreviated in the presentation as mod). The algorithm is shownin this basic form to improve readability. In this section, we assess the performance of our convgemm approach (that is, im2col integratedinto the BLIS gemm ) against the baseline counterpart that explicitly assembles the extended inputactivation matrix and then performs the augmented gemm . As described next, for this evaluationwe target a high performance ARM processor present in a low-power embedded system, and per-form the analysis by simulating the inference stage of three representative state-of-the-art CNNs.The source of all codes employed for the evaluation, including the convgemm implementation, ispublicly available in a git repository [16].

The evaluation presented in this paper was executed on an NVIDIA Jetson TX2 [22] platform,which integrates an ARM quad-core Cortex-A57, an NVIDIA dual-core Denver, an NVIDIA 256-CUDA core Pascal GPU, and 8 GiB of main memory. The results reported next were obtainedin the ARM Cortex-A57 only, due to the wide spread of this architecture and the availability ofoptimized high performance linear algebra libraries for this processor. On the software side, theexperiments were conducted using the Linux distribution Ubuntu 18.04.4, the GNU compiler gcc7.5.0, and BLIS 0.6.0.As the evaluation targets inference with CNNs, all the experiments employed (IEEE) simpleprecision arithmetic. In general, the inference process does not beneﬁt from the use of doubleprecision arithmetic, and a reduced precision format (ﬂoating point single or half, or even ﬁxedpoint) is often preferred in order to improve performance and/or reduce energy consumption. BLISprovides a single-precision instance of the BLAS optimized for the ARM Cortex-A57 which featuresan optimized micro-kernel with m r × n r = 8 ×

12, and sets the following cache conﬁguration values:9odel fc conv Pool

Total Memory consumption for im2col (MiB)AlexNet 3 5 3 11 15 . b ResNet50 1 53 1 55 13 . b VGG16 3 13 5 21 110 . b Table 1: Number and type of layers in the target CNN models and memory required by the explicit im2col transform as a function of the batch size b . n c = 3072, k c = 640, and m c = 120. The algorithm paralellizes loop L4 of Figure 1 and theoutermost loop of the packing of A using OpenMP [30]. For the convgemm , we also parallelizeloop L1 of Figure 6. The counterpart with an explicit im2col parallelizes loop L2 of Figure 5. In order to tackle the complex software stack required for executing CNNs, we have employedan inference simulator that performs the major computational stages of the convolutional layersencountered during the inference of CNN models. For the baseline case, we emulate this behavior byexecuting a sequence of explicit im2col + gemm pairs, of the dimensions appearing in consecutivelayers of the neural network. Our optimized alternative instead executes the specialized convgemm kernel (of the dimensions dictated by the CNN model). In both cases, the simulator reads the CNNconﬁguration parameters for a certain model from an input ﬁle, accepting the batch size (numberof input samples simultaneously processed per inference process) as an additional parameter. Thesimulator then allocates memory buﬀers for all required matrices using the maximum size of eachmatrix from among the matrix sizes required by each layer in the model, and performs a full modelevaluation for each batch size in the speciﬁed range. During inference, the output of a certainlayer is basically the input data of the next layer. Our code mimics this behaviour by using buﬀerswapping. In this way, we simulate more accurately the real data movements that take place acrossthe cache hierarchy during the inference stage.The simulator repeatedly executes the computational operations till a certain time threshold isattained, and then divides the total wall-time by the number of repetitions to avoid system loadvariability in the measurements. We have applied the simulator to study the beneﬁts of the optimized indirect convgemm algorithmusing three representative CNN models: AlexNet [24], VGG16 [31], and ResNet50 [20]. Theformer model was selected because of its simplicity, which facilitates an easier interpretation ofthe results. The remaining two models were chosen because of their more complex structuresand notable computational requirements. Table 1 summarizes the number and type of layers foreach model as well as the extra memory consumption required by the explicit im2col transform.This later parameter represents the maximum memory needed to hold the largest intermediatematrix assembled by the explicit im2col transform when executing each model. This is a keyparameter because it may constrain the use of the explicit im2col + gemm approach for many CNNmodel+platform pairs due to insuﬃcient memory capacity. Remember that our optimized algorithmwith convgemm saves this extra space by avoiding the explicit creation of the intermediate matrices. The models adhere to the speciﬁcations deﬁned in Google’s TensorFlow benchmarks suite. gemm dimensions( h i × w i × c i × b ) ( k n × k h × k w × c i ) ( m × n × k )2 conv × × × b × × × × b × conv × × × b × × ×

64 192 × b × conv × × × b × × ×

192 384 × b × conv × × × b × × ×

384 384 × b × conv × × × b × × ×

384 256 × b × conv layers appearing in the AlexNet CNN model as a function ofthe batch size b .Table 2 details the conﬁguration of the conv layers for the AlexNet model. Concretely, thetable displays the number of neurons (represented by the dimensions of the input data); the kernelspeciﬁcations (number of kernels, their height and width, and their number input channels); andthe dimensions of the gemm product, when applying the indirect convolution, for each layer of thattype. In this subsection we report the results obtained with the simulator applied to simulate the inferenceprocess for the three selected CNN models. In these experiments, we compare the execution timeof the models with either 1) an im2col operation followed by the gemm on the augmented matrix(explicit im2col + gemm ); or 2) an im2col performed on-the-ﬂy with the gemm (referred to as convgemm ). To better understand the source of the observed diﬀerences, in the comparison wealso include 3) the cost of the gemm operations without (the overhead caused by) the im2col transforms; and 4) the separate cost of the latter. Note that, as our ultimate goal is to hidecompletely the cost of the im2col transform inside the gemm operation, the performance referencefor our convgemm routine is to match the execution time/performance rate of the standalone gemm kernel.Figures 7 and 8 show the time and performance (in GFLOPS, or billions of ﬂoating-pointoperations per second) obtained for the evaluated models executed using a single core and thefull 4-core processor, respectively. The plots display the execution time/performance attainedfor a range of batch sizes, for the optimized convgemm algorithm against the baseline approach(explicit) im2col + gemm . In addition, all plots include the execution time/performance attainedby the gemm kernels involved in the model simulation, and the plots in the left-hand side includethe time overhead required to perform the im2col transforms.For the AlexNet and ResNet50 models, the experiments are run up to a batch size b = 80, whilefor VGG16 the largest value for this parameter is only b = 72. This is due to the large amount ofmemory required for the intermediate matrices assembled by the im2col transform, which exceedsthe memory capacity of the device (8 GiB) for the VGG16 model when b = 80.The results in Figures 7 and 8 demonstrate that our technique with an integrated im2col fully hides cost of this transform for the AlexNet network, delivering the same execution timeand GFLOPS rate observed when executing only the gemm operations. When we tackle the tworemaining (more complex) CNN models, the cost and performance of the optimized algorithm stillremain close to those of the standalone gemm operation while clearly outperforming the explicit im2col + gemm counterpart. 11 ti m e ( s ec ond s ) batch sizeIM2COL + GEMMStandalone GEMMCONVGEMMStandalone IM2COL (a) Execution time, AlexNet, 1 core G F L O PS batch size CONVGEMMStandalone GEMMIM2COL + GEMM (b) GFLOPS, AlexNet, 1 core ti m e ( s ec ond s ) batch sizeIM2COL + GEMMStandalone GEMMCONVGEMMStandalone IM2COL (c) Execution time, ResNet50, 1 core G F L O PS batch size CONVGEMMStandalone GEMMIM2COL + GEMM (d) GFLOPS, ResNet50, 1 core ti m e ( s ec ond s ) batch sizeIM2COL + GEMMStandalone GEMMCONVGEMMStandalone IM2COL (e) Execution time, VGG16, 1 core G F L O PS batch size CONVGEMMStandalone GEMMIM2COL + GEMM (f) GFLOPS, VGG16, 1 core Figure 7: Execution time (left column) and performance (right column) obtained by the indirectconvolution algorithms for AlexNet (top row), ResNet50 (middle row) and VGG16 (bottom row)on a single ARM Cortex-A57 core. 12 ti m e ( s ec ond s ) batch sizeIM2COL + GEMMStandalone GEMMCONVGEMMStandalone IM2COL (a) Execution time, AlexNet, 4 cores G F L O PS batch size CONVGEMMStandalone GEMMIM2COL + GEMM (b) GFLOPS, AlexNet, 4 cores ti m e ( s ec ond s ) batch sizeIM2COL + GEMMStandalone GEMMCONVGEMMStandalone IM2COL (c) Execution time, ResNet50, 4 cores G F L O PS batch size CONVGEMMStandalone GEMMIM2COL + GEMM (d) GFLOPS, ResNet50, 4 cores ti m e ( s ec ond s ) batch sizeIM2COL + GEMMStandalone GEMMCONVGEMMStandalone IM2COL (e) Execution time, VGG16, 4 cores G F L O PS batch size CONVGEMMStandalone GEMMIM2COL + GEMM (f) GFLOPS, VGG16, 4 cores Figure 8: Execution time (left column) and performance (right column) obtained by the indirectconvolution algorithms for AlexNet (top row), ResNet50 (middle row) and VGG16 (bottom row)using all four ARM Cortex-A57 cores. 13 ti m e ( s ec ond s ) layersStandalone IM2COLStandalone GEMMIM2COL + GEMMCONVGEMM ti m e ( s ec ond s ) layersStandalone IM2COLStandalone GEMMIM2COL + GEMMCONVGEMM Figure 9: Execution time per layer obtained by the indirect convolution algorithms for AlexNet(left) and VGG16 (right) using all four ARM Cortex-A57 cores and a batch size b = 32.There is a particular case worth of being discussed in some detail. Concretely, for the explicit im2col + gemm approach, Figures 7f and 8f both show a notorious decrease in performance for theVGG16 when b >

48. This decline is caused by the large size of the intermediate matrices, whichresults in I/O swapping to disk. The negative eﬀect in the performance is more notorious in themulticore experiment, as in this case the memory access patterns performed during the explicit im2col transform are more spread, increasing the eﬀect of the swapping to disk.The observed negative eﬀect in performance for large batch sizes and complex network mod-els demonstrates that the optimized convgemm algorithm, with an embedded im2col , not onlyallows to perform the inference process for network models that cannot be tackled by the explicit im2col + gemm , but also avoids the eﬃciency pitfalls due to the earlier use of disk I/O in thatapproach.To close the experimental analysis, Figure 9 reports the execution time to compute the convo-lutions required at each CNN layer in the AlexNet and VGG16 models. The plots there illustratethat the time required per layer signiﬁcantly varies between diﬀerent layers. This work introduces a new convolution algorithm that outperforms the straight-forward im2col + gemm approach in several aspects. First, the new convgemm algorithm removes the need of the addi-tional memory work space utilized by the im2col + gemm approach, enabling inference with largeCNN models in memory bound systems. In addition, the realization of the new scheme in com-bination with the BLIS kernel for gemm yields an eﬃcient and portable implementation that canbe migrated to other low-power architectures for which an optimized implementation of the BLISmicro-kernel exists (or can be developed).The results in the experimental evaluation performed in this work show the remarkable per-formance advantage of the new convgemm scheme on a representative low-power ARM-basedmulticore processor, which completely eliminates the workspace and performance overheads due tothe utilization of an explicit im2col transform. 14 cknowledgements This research was partially sponsored by projects TIN2017-82972-R of

Ministerio de Ciencia, In-novaci´on y Universidades and Prometeo/2019/109 of the

Generalitat Valenciana . References [1] oneAPI deep neural network library (oneDNN): Performance library for deep learning, 2018.Formerly known as Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN)and Deep Neural Network Library (DNNL). Available from https://oneapi-src.github.io/oneDNN/ .[2] Deep learning SDK documentation: cuDNN developer guide, 2020. Available from https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html .[3] Mobile AI Compute Engine documentation, 2020. Available from https://mace.readthedocs.io/en/latest/ .[4] NNPACK: Acceleration package for neural networks on multi-core CPUs, 2020. Available from https://github.com/Maratyszcza/NNPACK .[5] Andrew Anderson et al. Low-memory GEMM-based convolution algorithms for deep neuralnetworks.

CoRR , abs/1709.03395, 2017.[6] Tal Ben-Nun and Torsten Hoeﬂer. Demystifying parallel and distributed deep learning: Anin-depth concurrency analysis.

ACM Computing Surveys , 52(4):65:1–65:43, August 2019.[7] Sandra Catal´an, Francisco D. Igual, Rafael Mayo, Rafael Rodr´ıguez-S´anchez, and Enrique S.Quintana-Ort´ı. Architecture-aware conﬁguration and scheduling of matrix multiplication onasymmetric multicore processors.

Cluster Computing , 19(3):1037–1051, 2016.[8] Kumar Chellapilla, Sidd Puri, and Patrice Simard. High performance convolutional neuralnetworks for document processing. In

International Workshop on Frontiers in HandwritingRecognition , 2006. Available as INRIA report inria-00112631 from https://hal.inria.fr/inria-00112631 .[9] Sharan Chetlur, Cliﬀ Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, BryanCatanzaro, and Evan Shelhamer. cuDNN: Eﬃcient primitives for deep learning, 2014. arXivpreprint 1410.0759. Available from https://arxiv.org/abs/1410.0759 .[10] Minsik Cho and Daniel Brand. MEC: Memory-eﬃcient convolution for deep neural network. In

Proceedings of 34th Int. Conference on Machine Learning – PMLR , volume 70, pages 815–824,2017.[11] Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Mike Seltzer, GeoﬀZweig, Xiaodong He, Jason Williams, Yifan Gong, and Alex Acero. Recent advances in deeplearning for speech research at Microsoft. In , pages 8604–8608, May 2013.[12] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duﬀ. A set of level 3 basiclinear algebra subprograms.

ACM Trans. on Mathematical Software , 16(1):1–17, March 1990.1513] Marat Dukhan. The indirect convolution algorithm.

CoRR , abs/1907.02129, 2019. arXivpreprint 1907.02129. Available from https://arxiv.org/abs/1907.02129 .[14] Marat Dukhan, Yiming Wu, and Hao Lu. QNNPACK: open source library for optimized mobiledeep learning, 2020. Available from https://code.fb.com/ml-applications/qnnpack/ .[15] Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry,Hans Pabst, and Alexander Heinecke. Anatomy of high-performance deep learning convolutionson simd architectures. In

Proceedings of the International Conference for High PerformanceComputing, Networking, Storage, and Analysis , SC 18, pages 66:1–66:12. IEEE Press, 2018.[16] Source code repository. https://gitlab.com/comtacts/convgemm , 2020.[17] Kazushige Goto and Robert van de Geijn. High performance implementation of the level-3BLAS.

ACM Transactions on Mathematical Software , 35(1):4:1–4:14, July 2008.[18] Kazushige Goto and Robert A. van de Geijn. Anatomy of a high-performance matrix multi-plication.

ACM Trans. on Mathematical Software , 34(3):12:1–12:25, May 2008.[19] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neu-ral networks with pruning, trained quantization and Huﬀman coding, 2015. arXiv preprint1510.00149. Available from https://arxiv.org/abs/1510.00149 .[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016.[21] Greg Henry. BLAS based on block data structures. Theory Center Technical ReportCTC92TR89, Advanced Computing Research Institute. Cornell University, 1992.[22] NVIDIA Jetson TX2. , 2020.[23] Jintao Ke, Hai Yang, Hongyu Zheng, Xiqun Chen, Yitian Jia, Pinghua Gong, and Jieping Ye.Hexagon-based convolutional neural network for supply-demand forecasting of ride-sourcingservices.

IEEE Trans. on Intelligent Transportation Systems , 20(11):4160–4173, 2019.[24] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E. Hinton. ImageNet classiﬁcation with deepconvolutional neural networks. In

Proceedings of the 25th International Conference on NeuralInformation Processing Systems - Volume 1 , NIPS’12, pages 1097–1105, USA, 2012. CurranAssociates Inc.[25] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In , pages 4013–4021,2016.[26] Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ort´ı. Analyt-ical modeling is enough for high-performance BLIS.

ACM Trans. on Mathematical Software ,43(2):12:1–12:18, August 2016. 1627] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuﬄeNet V2: Practical guide-lines for eﬃcient CNN architecture design. In

Proceedings European Conference on ComputerVision - ECCV 2018. Lecture Notes in Computer Science , volume 11218, pages 122–138, 2018.[28] Maryam M. Najafabadi, Flavio Villanustre, Taghi M. Khoshgoftaar, Naeem Seliya, RandallWald, and Edin Muharemagic. Deep learning applications and challenges in big data analytics.

Journal of Big Data , 2(1):1, Feb 2015.[29] OpenBLAS. , 2015.[30] OpenMP Architecture Review Board. OpenMP application program interface version 3.0, May2008.[31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 , 2014.[32] Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeﬀ R. Hammond, and Field G.Van Zee. Anatomy of high-performance many-threaded matrix multiplication. In

Proc. IEEE28th Int. Parallel and Distributed Processing Symp. , IPDPS’14, pages 1049–1059, 2014.[33] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. Eﬃcient processing of deepneural networks: A tutorial and survey.

Proceedings of the IEEE , 105(12):2295–2329, Dec2017.[34] Field G. Van Zee and Robert A. van de Geijn. BLIS: A framework for rapidly instantiatingBLAS functionality.

ACM Trans. on Mathematical Software , 41(3):14:1–14:33, 2015.[35] R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In

Proceedings of the 1998 ACM/IEEE Conference on Supercomputing , SC 98, page 127, USA,1998. IEEE Computer Society.[36] Field G. Van Zee, Tyler M. Smith, Bryan Marker, Tze Meng Low, Robert A. Van De Geijn,Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel,John A. Gunnels, and Lee Killough. The BLIS framework: Experiments in portability.

ACMTrans. on Mathematical Software , 42(2):12:1–12:19, June 2016.[37] Jiajun Zhang and Chengqing Zong. Deep neural networks in machine translation: An overview.

IEEE Intelligent Systems , 30(5):16–25, Sep. 2015.[38] Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. High performance zero-memory overheaddirect convolutions. In

Proceedings of the 35th International Conference on Machine Learning– PMLR , volume 80, 2018.[39] Yulin Zhao, Donghui Wang, Leiou Wang, and Peng Liu. A faster algorithm for reducing thecomputational complexity of convolutional neural networks.