UNIT: Unifying Tensorized Instruction Compilation
Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, Tony Nowatzki
UUNIT: Unifying Tensorized Instruction Compilation
Jian Weng ∗† , Animesh Jain † , Jie Wang ∗† , Leyuan Wang † , Yida Wang † , Tony Nowatzki ∗∗ University of California, Los Angeles, USA † Amazon Web Services, USA {jian.weng,jiewang,tjn}@cs.ucla.edu {janimesh,wangleyu,wangyida}@amazon.com
Abstract —Because of the increasing demand for intensivecomputation in deep neural networks, researchers have developedboth hardware and software mechanisms to reduce the computeand memory burden. A widely adopted approach is to usemixed precision data types. However, it is hard to benefitfrom mixed precision without hardware specialization becauseof the overhead of data casting. Recently, hardware vendorsoffer tensorized instructions specialized for mixed-precision tensoroperations, such as Intel VNNI, Nvidia Tensor Core, and ARMDOT. These instructions involve a new computing idiom, whichreduces multiple low precision elements into one high precisionelement. The lack of compilation techniques for this emergingidiom makes it hard to utilize these instructions. In practice, oneapproach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents furtheroptimizations. Another approach is to manually write hardwareintrinsics, which is error-prone and difficult for programmers.Some prior works tried to address this problem by creatingcompilers for each instruction. This requires excessive effortswhen it comes to many tensorized instructions.In this work, we develop a compiler framework, UNIT, tounify the compilation for tensorized instructions. The key to thisapproach is a unified semantics abstraction which makes theintegration of new instructions easy, and the reuse of the analysisand transformations possible. Tensorized instructions from dif-ferent platforms can be compiled via UNIT with moderate effortfor favorable performance. Given a tensorized instruction and atensor operation, UNIT automatically detects the applicability ofthe instruction, transforms the loop organization of the operation,and rewrites the loop body to take advantage of the tensorizedinstruction. According to our evaluation, UNIT is able to targetvarious mainstream hardware platforms. The generated end-to-end inference model achieves 1.3 × speedup over Intel oneDNNon an x86 CPU, 1.75 × speedup over Nvidia cuDNN on an NvidiaGPU, and 1.13 × speedup over a carefully tuned TVM solutionfor ARM DOT on an ARM CPU. I. I
NTRODUCTION
Dense tensor operations like matrix multiplication (Matmul)and convolution (Conv) have long been the workhorses inmany domains, including deep learning workloads [14]. Thepopularity of deep learning means that aggressively optimizingthese operations has a high payoff. Essentially, Matmul andConv are a series of multiply-accumulate (MAC) operations,which perform accumulation over a number of elementwisemultiplications.To capture the reduction behavior and perform it moreefficiently, recent general-purpose processors offer native tensoroperation specialized instructions (hereinafter referred to as tensorized instructions ), like Intel VNNI [2], Nvidia TensorCore [5], and ARM DOT [1]. Unlike the conventional SIMD ∗† Work done during Jian and Jie’s internship at AWS. instructions, after performing elementwise arithmetic opera-tions, these instructions introduce a “horizontal computation” toaccumulate elementwise results. Further, tensorized instructionsare often mixed-precision, meaning that elementwise operationsuse less precise and lower bitwidth operands (e.g., fp16 and int8 ), while accumulation occurs with higher bitwidth, whereit is needed. This offers a good balance between data widthand precision that is generally sufficient for deep learningworkloads [24], [18], and enables the use of quantized datatypes.Mixed-precision is difficult to express in a single SIMDinstruction, because the output vector width is different than theinput vector width. In most ISAs this paradigm requires multi-ple SIMD instructions to express. In a tensorized instruction, bydefinition there are fewer outputs, so allocating more bitwidthto them for the output vector is natural. In addition, tensorizedinstructions sometimes reuse the same inputs multiple times,which reduces the required register file bandwidth. Overall,tensorized instructions offer significant advantages over SIMDfor executing MACs.While promising, the absence of appropriate compilationtechniques limit c the applicability of these tensorized instruc-tions. Conventional SIMD instructions are vector instructions,so industry standard compilers only try parallelizing theinnermost loops. In addition, it is difficult for the high-level language programmer to express the compute flow ina tensorization-friendly way and hint the compiler to trytensorization upon a loop nest, because the dependency ofreduction is more complicated and error-prone.In practice, there are normally two options to leveragetensorized instructions. One way is to call the vendor-providedlibraries such as Intel oneDNN [6], Nvidia cuBLAS andcuDNN [4], which provides highly optimized performancein some pre-defined single kernels using tensorized instruc-tions [17], [42]. However, it also brings inflexibility whenit comes to new workloads or when further performanceexploitation is desired. The other option is to manually writeassembly intrinsics, which sets a high bar to ordinary developersand hence lacks productivity. Some prior works tried to solvethis problem by developing a compiler [35], [36] for eachinstruction. This requires too much effort when there aremany tensorized instructions, both within and across hardwareplatforms.
Our Goal:
Although different processors may providedifferent tensorized instructions, in the context of deep learningworkloads, we observe that these instructions essentially handlea similar compute pattern, i.e., elementwise multiplication and a r X i v : . [ c s . P L ] J a n hen horizontal accumulation. They primarily differ in thenumber of elementwise computation lanes and the acceptingdata types. Therefore, we aim to develop a unified approachto compile these tensorized instructions on multiple platformsto optimize the tensor operations in deep learning workloads.Our techniques are extensible to the tensorized instructionswith other data types and operations as well. Challenges:
There are several challenges to attain a unifiedcompilation pipeline: • Instructions Integration:
Instead of building a new spe-cialized compiler for each new instruction, it is desirableto create a unified and extensible compilation flow; • Detecting the applicability:
Given a tensorized instruction,a first question is whether and how this instruction can beapplied to the target tensor operation, which may requireloop reorganization to make it applicable; • Code rewriting:
When applicable, the compiler mustdetermine how the loops involved should be rewrittenby the tensorized instruction, and how the loops shouldbe rearranged to achieve high performance.
Our Insight:
We envision that the key to addressing thesethree challenges is to have a unified semantics abstraction fortensorized instructions so that the analysis and transformationcan also be unified.This paper presents UNIT, an end-to-end compilationpipeline to surmount the above three challenges. UNIT takesthe tensorized instructions (e.g., Intel VNNI instructions onCPUs, or Nvidia Tensor Core instructions on GPUs) and adeep learning model as input, lowers the tensor operations ofthe model into loop-based IRs to identify the tensorizable com-ponents, and inserts the tensorized instructions by transformingand rewriting the loop. It achieves high performance for tensoroperations, and consequently, model inference. To the bestof our knowledge, this is the first work to tackle tensorizedinstruction compilation and optimization with a unified solution.UNIT not only achieves high performance for single tensoroperations, but also provides desirable model inference latencyin practice.
Key Results:
According to our evaluation, UNIT is ex-pressive enough to target many tensorized instructions onmultiple hardware platforms, including Intel VNNI, NvidiaTensor Core, and ARM DOT. The generated programs forend-to-end model inference are 1.3 × and 1.75 × faster thanthe solutions backed up by Intel oneDNN and Nvidia cuDNNon CPU and GPU, respectively. In addition, UNIT can beextended to new tensorized instructions with moderate effort.Although we designed UNIT to target Intel CPUs and NvidiaGPUs, on an ARM Cortex A-72 CPU with DOT instructions,UNIT achieves up to 1.13 × speedup against a carefully manualtuned solution.To sum up, our contribution is an end-to-end compilationpipeline of tensorized instructions for deep learning workloads,which includes: • A unified abstraction for tensorized instructions. • An algorithm that detects the applicability of thesetensorized instructions. r e s n e t - r e s n e t - r e s n e t - _ v b i n c e p t i o n - b n i n c e p t i o n - v r e s n e t - r e s n e t - m o b il e n e t - v m o b il e n e t - v g e o m e a n R e l a t i v e P e r f o r m a n c e cuDNN(fp32) cuDNN (fp16) w/o Tensor Core Fig. 1. Performance comparison on Nvidia V100-SXM2 between fp32 and fp16 without mixed precision instruction support. • A rewriting and tuning mechanism that looks for favorableloop transformations of the tensor operations to plug inthe tensorized instructions for high performance.
Paper Organization:
We first introduce the backgroundand challenges of tensorized compilation in Section II. Thedesign of UNIT is presented in Section III. We explain theimplementation details in Section IV. We clarify our experimentmethodology in Section V, and evaluate our work in Section VI.Finally, we discuss the related work in Section VII.II. B
ACKGROUND
UNIT is an end-to-end compilation pipeline capable ofautomatically mapping tensorized instructions to the deeplearning tensor operations. It defines the tensorized instruction’ssemantics using a suitable intermediate representation (IR) andinserts them in proper places of the program of tensor opera-tions. In this section, we give an overview of popular mixedprecision tensorized instructions, followed by the limitationsof existing solutions in automatic mapping of these tensorizedinstructions. Finally, we discuss the background of tensordomain specific language and the multi-level intermediaterepresentation.
A. Mixed Precision Tensorized Instructions
Deep learning is computationally expensive, requiring sub-stantial compute and memory resources. As deep learning be-comes more pervasive, researchers are designing both softwareand hardware techniques to reduce the compute and memoryburden. A widely adopted approach in this context is usingmixed precision for expensive operations, e.g., convolution ordense operations [24], [18]. In practice, this means representing32-bit floating point ( fp32 ) operands with a lower bitwidthdatatype - 16-bit floating point numbers ( fp16 ) or 8/16-bitinteger numbers ( int8 , int16 ). To keep the accuracy incheck, it is helpful to accumulate the results in higher precision( fp32 or int32 ). This type of mixed precision computationis often called quantization for integer values [18]. In thispaper, we will always use mixed precision for brevity.While using mixed precision data types reduces memoryfootprint, it might not necessarily lead to performance im-provement. To investigate this, we conducted an experimentto compare the performance of Nvidia cuDNN performancefor fp16 and fp32 in the absence of Nvidia mixed precisiontensorized instructions (Tensor Core). As shown in Figure 1, weobserve that blindly using mixed precision leads to substantial c15a60b60 × + b61 × b62 × b63 × dst15a61 a62 a63 + + u8x64i8x64i16x32i16x32 (a) Intel VNNI x86.avx512.pbpdusd (b) Nvidia Tensor Core nvvm.wmma.m16n16k16.mma.row.row.f32.f32 += × A: 16 × fp16 B: 16 × fp16 C: 16 × fp32 c0a0b0 × + b1 × b2 × b3 × dst0a1 a2 a3 + + Fig. 2. The semantics of Intel VNNI and Nvidia Tensor Core. The text besideis the name of the corresponding LLVM intrinsic. slowdown because of the overhead of casting between twodata types.Therefore, mainstream hardware vendors (Intel, ARM andNvidia) have introduced mixed precision tensorized instructionsto achieve better performance. These instructions add mixedprecision arithmetic support where operands are of lowerprecision while the accumulation happens in higher precision,potentially leading to 2 × - 4 × speedup. The most popularexamples of these tensorized instructions are Intel VNNI, ARMDOT and Nvidia Tensor Core. We will discuss the semanticsof these operations in Section III.Hardware vendors have a long history of adding newinstructions to accelerate important applications. However, themixed precision tensorized instructions introduce a uniqueidiom - horizontal accumulation. These tensorized instructionstypically conduct a sequence of elementwise multiplicationsgoverned by a memory access pattern, followed by a horizontalaccumulation. The accumulation is termed horizontal becauseall values to be accumulated are present in the same vectorregister. For example, as it is shown in Figure 2(a), Intel VNNIexecutes a dot product of two vectors, each having 4 int8 elements, while performing the accumulation in int32 . Weobserve a similar pattern, though with different numbers ofentries and data types, for Nvidia Tensor Core (in Figure 2(b))and ARM DOT instructions (this is omitted, because it issimilar to VNNI). B. Limitations of Existing Solutions
Though tensorized instructions seem promising, their adop-tion pace is limited because of the absence of an automatictechnique that can detect and use these instructions seamlessly.Currently, their usage in the deep learning domain is limitedto hardware vendor libraries like Intel oneDNN and NvidiacuDNN, which may provide high performance for the pre-defined operations but are inflexible as discussed in Section I.Similarly, conventional loop vectorizers find it hard to exploitthe profitability of these tensorized instructions, as they arenot designed to work with the horizontal reduction idiom.Conventional loop vectorizers in general-purpose compilers like GCC and LLVM mainly focus on either analyzing theinnermost loop body or combining instructions in the unrolledloop bodies. When it comes to the horizontal reduction idiom,these compilers often reorder the computation and generateepilogue reduction, preventing us from using the tensorizedinstructions.There have been some recent works in compiling programsto leverage tensorized instructions. PolyDL [36] generatesCPU programs for convolution kernels in neural networksthat call a GEMM micro-kernel using Intel VNNI instructions.Bhaskaracharya et al. [35] generate CUDA programs for matrixcomputation leveraging Nvidia Tensor Core. However, theseworks are limited to one platform and its specific instruction,which lacks generalizability. A generic solution to handletensorized instructions from multiple platforms together isstill missing.
C. Multi-Level Intermediate Representation
Compilers often have multiple levels of intermediate repre-sentation (IR) to express the program; each level is designed toenable different analyses and transformations. In this section,we describe the background of a tensor domain specificlanguage (DSL) and the multi-level IR.
1) Graph-Level IR:
Deep learning compilers like TVM [10],Glow [34], and XLA [41] adopt a graph-level IR to representa deep learning model as a directed acyclic graph (DAG)of operations. This graph-level IR is useful for inter-tensor-operation optimization, like tensor shape padding, operationfusion, and choosing the proper data layout [23]. Our tensorizedanalysis relies on tensor padding so that loops can be tiledby the number of lanes of the instruction perfectly. However,this IR has little knowledge about the implementation of eachtensor operation. When compiling a graph-level IR, each nodeof the DAG will be dispatched to its implementation in tensorDSL as explained next.
2) Tensor DSL:
Tensor domain-specific languages, likeHalide [31], TVM [10], and Tensor Comprehension [37],have been developed to productively and portably expresstensor programs while enabling efficient performance tuning.As shown in Figure 4 and Figure 5, programs written intensor DSLs follow this paradigm: Users first declare thetensors and the loop variables, and then the computation isdescribed by expressions involving the declared tensors andloop variables. These DSLs also provide interfaces to split,reorder, and annotate loops without affecting the computationsemantics for performance tuning.All the information gathered from the tensor DSL frontendwill be stored in a tensor Op data structure, includingthe declared tensors, loop variables, expressions, and loopmanipulation.
3) Tensor IR:
Each tensor Op is then lowered to
Tensor IR ,which is an imperative program IR with additional constraints:All the loops are canonical (starting from 0, and increased by1 each time), and all the array operations are restricted (i.e., anelement cannot be accessed by two different pointers). Thesetwo properties enable making strong assumptions for analysis ensor Operation Prog.
Hardware Target
InspectorRewriterTensor. Inst.
Intel x86 ARM
NVIDIA
Section III.A Section III.BSection III.C
XFormTune
Fig. 3. The overview of our framework, UNIT. and transformation. Our work conducts analysis on the tensorOp data structure level and then performs transformation on thetensor IR. Although the tensor IR provides essentially identicalinformation for analysis, as discussed above, it is easier toreorganize the loops via the tensor Op data structure.
4) Low-Level IR:
The tensor IR is lowered to a general-purposed low-level IR like LLVM, after all the specializedanalysis and transformations on the tensor IR are done, to getready for assembly code generation.III. U
NIFIED T ENSORIZATION
Our goal is to automatically tensorize mixed-precisiondeep learning tensor operations across a variety of hardwareplatforms. We resolve the challenges discussed in Section I bypresenting UNIT with the following techniques:1) Tensorized Instruction in Tensor DSL:
To abstract thediverse tensorized instructions on different hardwareplatforms, we leverage the existing tensor DSL torepresent their semantics.2)
Applicability Inspection:
To determine if and how atensorized instruction can be applied to a tensor operation,we developed an analysis pass in the
Inspector componentof UNIT, which analyzes the tensor Op data structureof both the instruction and the operation. The result ofanalysis will guide the loop reorganization and instructioninjection.3)
Code Rewriter:
Once the tensorized instruction is deter-mined applicable, the Rewriter reorganizes the loop nestsin accordance with the Inspector so that the innermostloop nests resemble the tensorized instruction and areready to be replaced. Finally, it sets up the tuning spacefor the remaining loop nests to exploit high performance.These components of UNIT together enable a unified compi-lation flow to simplify the mapping of tensorized instructionsacross a variety of hardware platforms. In the rest of this section,the details of each of the above steps will be discussed.
A. Semantics Abstraction - Tensor DSL
In order to unify the compilation of tensorized instructionsfrom different platforms and keep the system open to integratenew instructions, the first question to answer is how to have aunified description of the semantics of tensorized instructions. We coin the word to mean rewrite and optimize a given code by thetensorized instruction. a, b = tensor((64,),u8), tensor((64,),i8)c = tensor((16,), i32)i, j = loop_axis(0,16), reduce_axis(0,4)d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j])) (a) Intel VNNI x86.avx512.pbpdusd a, b = tensor((16,),i8), tensor((16,),i8)c = tensor((4,), i32)i, j = loop_axis(0,4), reduce_axis(0,4)d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j])) (b) ARM DOT arm.neon.sdot.v4i32.v16i8 a, b = tensor((16,16),fp16), tensor((16,16),fp16)i, j = loop_axis(0,16), loop_axis(0,16)k = reduce_axis(0,16)c[i,j] += fp32(a[i,k]) * fp32(b[k,j]) nvvm.wmma.m16n16k16.mma.row.row.f32.f32 (c) Nvidia Tensor Core
Fig. 4. Tensorized instructions as abstracted in the tensor DSL.
As explained in Section II, we employ ubiquitous tensor DSLand tensor IR to solve the abstraction problem. All mixedprecision tensorized instructions perform some elementwiseoperations for vectors, followed by a horizontal reduction. Eachtensorized instruction, therefore, can be regarded as a smalltensor operation program written in the tensor DSL.Figure 4(a) shows how an Intel VNNI instruction is describedin the tensor DSL. Three source operands of Intel VNNI are512-bit registers. Two of them are 64 lanes of unsigned 8-bitintegers ( uint8 ) and signed 8-bit integers ( int8 ), and theother one is 16 lanes of signed 32-bit integers ( int32 ), whichcorrespond to the tensors a , b , c we defined. The arithmeticbehavior is defined by the loop variables and the expressionof d[i] . Here we annotate that loop i is data parallel, sincethese 16 elements are independent from each other; loop j isreduction since for every independent element it sums up 4elements along with this loop. A similar loop pattern appears inthe other tensor operations shown in Figure 5. The descriptionof ARM DOT, shown in Figure 4(b), is similar to Intel VNNI,with a different number of lanes and data types.Nvidia Tensor Core, on the other hand, performs a squarematrix multiplication as shown in Figure 4(c). Comparing with(a) and (b), a key difference is that it requires the accumulatorregister to be the same as the addition register (note the += ).This is due to the data type opaqueness of the Tensor Coreinstruction, which prevents us from giving arbitrary initialvalues for the accumulators.We describe the semantics of each tensorized instruction intensor DSL. The deep learning compiler pipeline parses theoperation into tensor Op , which preserves tensor informationlike the expression tree, the loop trip count, and the arraybuffers. This information is essential for the analysis andtransformation passes in Inspector and Rewriter. B. Applicability Detection - Inspector
To determine if a tensorized instruction can be applied to atensor operation, the Inspector pass uses a two-step approach.It first determines if (part of) the tensor operation program andthe instruction can be arithmetically equivalent by checking aform of isomorphism between their associated expression trees. // Convolution in tensor DSLa,b = tensor((H,W,C), u8),tensor((R,S,K,C),i8)k,rc = loop_axis(0,K), reduce_axis(0,C)x,y = loop_axis(0,H-R+1), loop_axis(0,W-S+1)r,s = reduce_axis(0,R), reduce_axis(0,S)c[x,y,k]+= i32(a[x+r,y+s,rc])*i32(b[r,s,k,rc]) (b).1 Arithmetic Isomorphism for (x=0; x<(H-R)+1; ++x) for (y=0; y<(W-S)+1; ++y) for (ko=0; ko VNNI: =c[x,y,k]:i32 +:i32*:i32 a[x+r,y+s,rc]:u8 c[x,y,k]:i32 Conv: Two trees are exactly the same topology, opcodes, and data type. cast:i32 b[r,s,k,rc]:icast:i32 b[i*4+j]:i8a[i*4+j]:u8cast:i32cast:i32 for (x=0; x<(H-R)+1; ++x) for (y=0; y<(W-S)+1; ++y) for (k=0; k Reorganize the loops in DSL primitives. {x,y,k} → {i} ⊆ {i}{x,y,k} → {i} ⊆ {i}{x,y,r,s,rc} → {j} ⊆ {i,j}{r,s,k,rc} → {i,j} ⊆ {i,j}c[x,y,k] d[i]k → irc → j c[x,y,k] c[i]a[x+r,y+s,rc]b[r,s,k,rc] a[i*4+j]b[i*4+j]c[x,y,ko:16]x16(a[x,y,co:4]) a[r,s,k+0,co:4])a[r,s,k+1,co:4])a[r,s,k+15,co:4]) d=pbpdusd(a,b,c); c[x,y,ko:16] f:A → B Index: u Index: v S(u) S’(u) ⊆ S(v) x86.avx512.pbpdusd Fig. 5. An example of applying Intel VNNI to Conv using UNIT. After that, it inspects the data access pattern to confirm theassembly operands can be prepared so as to guide the Rewritertransformation. 1) Compute Isomorphism: Algorithm 1 shows the algorithmwe adopt to determine the isomorphism of two expression trees.It recursively traverses both trees and matches the data typeand opcode of each pair of nodes. Figure 5(b).1 shows thatthe two trees of convolution and pbpdusd (an Intel VNNIinstruction) are in exactly the same topology and data type, sothese two programs are arithmetically isomorphic.This analysis also finds a mapping from the operands in thetensor program to the operands in the tensorized instruction.As we explained, tensor operands in the tensorized instruction function I NSPECT (a,b) if a.type=b.type thenif isleaf(a) ∧ isleaf(b) thenif a is not bound then bind[a]:=b else if bind[a] (cid:54) = b thenreturn False end ifreturn True else if isarith(a), isarith(b) then cond:=a.opcode=b.opcodecond:=cond ∧ Inspect(a.lhs, b.lhs)cond:=cond ∧ Inspect(a.rhs, b.rhs) return cond end ifend ifreturn False end function Algorithm 1: Determine the isomorphism between expressiontrees. a is for the instruction, and b is for the operation.are the abstraction for registers. Therefore, a register cannotcorrespond to multiple data sources. This property still requiresfurther checks, which will be explained in the next section. 2) Array Access Isomorphism: Once compute isomorphismis determined, the next concern is how the data are fed to thisinstruction. The enforcement explained in the last subsectionalready determines each register operand only corresponds toone array in the tensor operation. On top of this, we need todetermine each element in the operand tensor corresponds toonly one memory address in the tensor program when mappingto the tensorized instruction. To map a tensor program to atensorized instruction, we need to know which loop levels aretensorized. We enumerate the loop levels to be tensorized, andthese loop levels will be mapped to loops in the tensorizedinstruction. Note that only loops with the same annotation(data parallel or reduction) can be mapped to each other. Thenwe check if this enumerated mapping is feasible, by scanningeach pair of operand correspondence determined in the lastparagraph. If the operand in the tensor program is a constant,we just skip it . If the operand is a memory operation, weinspect the index expressions of both memory operations inthe operation and instruction. We define: • A is the set of loop variables to be mapped to thetensorized instruction. • B is the set of loop variables of the tensorized instruction. • f : A (cid:55)→ B is the mapping we enumerate. • S ( u ) := { x | x is loop variable in the index expression u } • S (cid:48) ( u ) := { f ( x ) | x ∈ S ( u ) ∩ A } A mapping is considered feasible, if every pair of memoryoperation’s index expressions ( u, v ) , where u is from theoperation and v is from the instruction, holds S (cid:48) ( u ) ⊆ S ( v ) .Figure 5(b).2 shows an example of inspection. If S (cid:48) ( u ) is If it is a constant, the correspondence was already checked in the lastsection. This register corresponds to this constant. subset of S ( v ) , this means the data loaded by the tensoroperation should be broadcast along with the loop variablesthat do not exist in S ( v ) to fill all the register lanes. If not,this means each register lane corresponds to multiple memoryaddresses under this mapping, which is not realistic for codegeneration, so we should try another enumeration.If there are multiple feasible mappings, we leave this asa dimension of code tuning space. Once this mapping isdetermined, it will guide the further loop transformation andcode generation. C. Code Transformation - Rewriter There are three phases in the code transformation: loopreorganization, tensorized instruction replacement, and tuning. 1) Loop Reorganization: As discussed in Subsection III-B,the inspector selects the loop levels to be executed by the giveninstruction. To get poised for code generation, as shown inFigure 5(c), we need to tile these loops and reorder them to theinnermost loop levels so that those innermost loops performexactly the same semantics as the instruction. As we explained,tensor DSL provides the capability to reorganize the loopsnests easily. 2) Tensorized Instruction Replacement: After identifyingthe code region to be replaced by a tensorized instruction, thecode generator should prepare each operand of this instruction.It is difficult to fully automate the operand preparation fordifferent platforms because of their diverse execution modelsand assembly formats. Therefore, we formalize a unifiedprogramming interface to compiler developers to manuallyspecify the rule of operand generation. In this interface, eachloop variable to be replaced, and their coefficients in the indexexpression are exposed. For example, as shown in Figure 5(c),by analyzing the strides and trip count of ki , and ci , thearray access c[x,y,c] will be transformed to a 16-lanevector; a[x,y,rc] will be vectorized along with c by 4,and broadcast along with ki by 16; b[r,s,k,c] will bevectorized along with ci by 4, and unrolled and concatenatedalong with ki . 3) Tuner: All the other loop levels that are not involved ininstruction rewriting can be reorganized to tune the performance.Here, we develop strategies to optimize the performance oftensor programs on both CPU and GPU. The generic philosophyis to exploit both fine- and coarse-grained parallelism. Wealso developed specialized strategies because of the differentexecution models and memory hierarchy. a) CPU Tuning: On CPU, data-parallel loops are dis-tributed to multiple threads to achieve coarse-grained paral-lelism. On the other hand, the loop-carried dependence inreduction loops introduces RAW hazards in the executionpipeline. To avoid this penalty, and achieve instruction-levelparallelism, we reorder and unroll a small degree of dataparallel loops below the innermost reduction loop.The tuning space of CPU involves two dimensions, thedegree of unrolling and parallelization. We enumerate thesetwo parameters and profile the execution time to search for thebest one. If the unrolling degree is too small, there will not // a[n,k], b[k,m], c[n,m]Buffer
Fig. 6. Accumulating a p × p “square window” avoids loop-carried datadependences, and reuses buffered submatrices. be enough independent instructions to fill in the idle penaltycycles caused by RAW hazards. If it is too large, it will causeI-cache misses. Similarly, the number of threads can neitherbe too few or too many. If it is too few, the computing coreswould have insufficient utilization and memory latency wouldnot be hidden. Too many threads introduce context switchingoverhead. We rely on the tuning process to look for the bestcombination. b) GPU Tuning: On GPU, coarse-grained parallelismis achieved by distributing the data parallel loops acrossthe streaming multiprocessors. Similar to CPU, fine-grainedparallelism is also achieved by reordering and unrolling a smalldegree of data parallel loops to avoid the pipeline penaltycaused by loop-carried dependences. Moreover, on GPU, datareuse is explicitly managed by the software. Therefore, as itis shown in Figure 6, we adopt an outer-product style matrixmultiply accumulation to reuse the buffered submatrices.Besides the generic optimization, we also developed opti-mization mechanisms specialized for DNN kernels. Amongpopular DNN models, there are many layers with relativelysmall width and height and deep channels. We apply dimensionfusion to layers with small width and height – these twodimensions are fused into one to save the redundant padding. Inaddition, we apply split reduction to layers with deep channels.For a reduction loop with large trip count, we can split it andparallelize each split segment on threadIdx . After all thesegments are done, we synchronize the threads and reduce theplitted segments in the shared memory.IV. I MPLEMENTATION In this section, we will discuss technical details in ourimplementation. UNIT is implemented by extending ApacheTVM [10], a full-stack deep learning compiler, with tensorizedinstruction support. We leverage TVM’s tensor DSL, tensorOp, tensor IR infrastructure, and the tuning infrastructuremechanisms [11], [23] to generate high performance kernels. Inaddition, implementing UNIT on top of TVM enables end-to-end model inference with other optimizations such as operatorfusion, in addition to tensorization. A. Inspector The inspector pass is implemented by analyzing TVM’s ComputeOp data structure. This matches the expression treeof both the instruction and program and enumerates mappingsbetween the loop variables. We enumerate the loops from thetensor’s innermost dimension to outermost dimension, andgreedily return the first eligible one because of the betterpotential data locality for inner dimensions. The enumeratedmapping provides us with the correspondence of loop variablesbetween the instructions and the tensor operations. B. Rewriter These following steps will be performed by the rewriter:1) According to the loop correspondence analyzed by theinspector, we reorganize the loops to be tensorized bytiling these loops by the trip counts of the correspondingloops in the instruction, and reorder them to be theinnermost loops. These loops will be annotated by a tensorize pragma to hint the instruction injection.2) Based on the strategies discussed in Section III-C, wereorganize the loops above not involved in instructionrewriting to tune the performance.3) We lower the manipulated loop nest to the tensor IR, andreplace the loop body annotated with the tensorize pragma with the target instructions, as shown in Fig-ure 5(c).Steps 1 and 2 are achieved by invoking TVM schedulingprimitives on the tensor DSL level, and step 3 is a tensor IRtransformation pass.Next, we discuss the implementation of the tuning strategiesdiscussed in the last section. CPU Tuning: The code sketch of tuned CPU code isshown in Figure 7. To implement the tuning we discussedin Section III-C, we enumerate two breaking points on thedata parallel loop nest, which define how the loop levels areparallelized and unrolled. A breaking point is defined by a loop level and tiling factor , giving more flexibility to thedivision. Loops before the first breaking point, will be fused andparallelized. Loops between these two points will be executedin serialized order. Loops after the second breaking point willbe reordered to the innermost and unrolled. GPU Tuning: As it is discussed in the last paragraph ofSection III-C, both coarse-grained and fine-grained parallelism for (ax0=0; ax0 Fig. 7. The code sketch of CPU tuning. optimizations are applied on data-parallel loops, so there is atradeoff between them: data reuse is increased by increasingthe unrolling degree (each buffered submatrix is reused p times), but the coarse-grained parallelism is decreased. Also, alarge unrolling degree may overwhelm the register resources.Therefore, the key to generic optimization is to choose a properunrolling degree.On the other hand, greedily applying each specializedoptimization does not always improve the performance. Thoughdimension fusion may save the memory traffic, it also intro-duces software overhead on data rearrangement. Similarly,though splitting the reduction loop introduces more parallelism,it also introduces thread synchronization overhead and registerpressure. We enumerate each parameter, including the degreeof reduction parallelization and whether to fuse the width andheight dimensions, and then apply these transformations tothe program and profile the performance to determine whichtransformation leads to the best performance.V. M ETHODOLOGY A. Target Hardware Platforms We assess UNIT on three hardware platforms: Intel x86 CPU: We use Amazon EC2 C5.12xlarge instanceas our x86 platform with 24-core Intel Xeon Platinum 8275CLCPU @3.00GHz (codename: Cascade Lake) and 96GB mem-ory. ARM CPU: We use Amazon EC2 M6g.8xlarge instance as ourARM platform with AWS Graviton2 CPU, which features 32-core ARM Cortex-A72 CPU @2.30GHz and 128GB memory. Nvidia GPU: We use Amazon EC2 P3.2xlarge instance as ourGPU platform with Nvidia Tesla V100 SXM2 GPU that has16GB host memory. B. Software Frameworks Code Generation: All programs implemented in Apache TVMare emitted to LLVM IR for code generation. We choose LLVM-10 as our backend, and to be compatible, we use CUDA-10.0as the NVPTX linker and runtime. e s n e t - r e s n e t - r e s n e t - _ v b i n c e p t i o n - b n i n c e p t i o n - v r e s n e t - r e s n e t - m o b il e n e t - v m o b il e n e t - v g e o m e a n R e l a t i v e P e r f o r m a n c e MxNet w/ oneDNN TVM UNIT Fig. 8. Quantized network inference (bs=1) accelerated by Intel VNNI. Baseline: We use vendor-provided libraries for baseline per-formance of operators whenever possible. Specifically, InteloneDNN v1.6.1 and Nvidia cuDNN 7.6.5 are used as ourCPU and GPU baselines, respectively. For end-to-end modelinference, we looked for the best available solutions withthose libraries, which was MXNet integrated with oneDNN forCPU and TVM integrated with cuDNN for GPU. Another setof baselines is the manually written implementation. To thisend, we use the existing TVM solutions for Intel and ARMCPUs, which involve heavy engineering effort to carefully writeintrinsics to use Intel VNNI and ARM DOT instructions. Wedid not find a manually written Tensor Core implementationthat covers our evaluated workloads. C. Workloads DNN Models: All DNN models are from the MXNet ModelZoo and converted to TVM’s graph IR, Relay [32], forquantization [19], layout transformation, and data padding.All these models adopt NCHW[x]c data layout [23] for thedata and KCRS[y]k[x]c for the kernel. Here N denotes thebatch size, C denotes the input channels, H and W are the widthand height of the input image, and [x]c denotes that theoriginal C is split by x . Similarly, K denotes the number ofoutput channels, R and S are the height and width of the kernel,and [y]k denotes the original dimension K is split by y . [x] equals to the number of lanes of the instruction output, and [y] equals to the width of reduction.In the evaluation, we target the N=1 cases, because it is hardto optimize but critical for inference use cases. Comparingwith batched cases where N>1 , we cannot reuse the kerneltensor across samples, or exploit the parallelism brought bythe data-parallel batching dimension.VI. E VALUATION Our evaluation of UNIT attempts to answer these questions:1) What is the performance of the end-to-end deep learningmodel inference powered by tensorized instructions?2) How does each optimization technique that UNIT usesimpact the performance?3) Can UNIT be extended to support new hardware plat-forms and tensor operations? A. End-to-End Performance In this subsection, we show the UNIT end-to-end effective-ness on Intel x86 and Nvidia GPU processors for tensorizing r e s n e t - r e s n e t - r e s n e t - _ v b i n c e p t i o n - b n i n c e p t i o n - v r e s n e t - r e s n e t - m o b il e n e t - v m o b il e n e t - v g e o m e a n R e l a t i v e P e r f o r m a n c e cuDNN (fp16) w/ Tensor Core UNIT Fig. 9. Mixed precision network inference (bs=1) accelerated by Tensor Core. R e l a t i v e P e r f o r m a n c e oneDNN Parallel +Unroll +Tune Fig. 10. The performance impact of the code space exploration. R e l a t i v e P e r f o r m a n c e cuDNN Generic +FuseDim +SplitK +Tune Fig. 11. The performance impact of the code space exploration.TABLE IC HARACTERISTICS OF THE SELECTED CONVOLUTION LAYERS .1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16C 288 160 80 128 192 256 128 576 96 576 64 64 608IHW 35 9 7 73 16 16 16 14 16 14 16 14 14 29 56 14K 384 224 192 192 128 192 256 512 160 192 128 256 128 96 128 192R=S 3 3 1 3 3 3 3 1 3 1 3 1 1 3 1 1Stride 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1OHW 17 7 7 71 14 14 14 14 14 14 14 14 14 27 28 14 mixed precision instructions. For Intel x86 experiments, we useMXNet integrated with Intel oneDNN (referred to as MXNet-oneDNN) as the baseline. Another comparison of ours is TVMwith manually written schedules using Intel’s VNNI instruction.The findings of this experiment are shown in Figure 8.We observe that UNIT achieves significant speedup com-pared to MXNet-oneDNN. Note that Intel oneDNN has accessto manually written schedules that have been aggressivelyoptimized and tuned by domain experts. We also observethat TVM overall achieves better performance than MXNet-oneDNN, but has suboptimal performance on resnet50 andresnet50b, which were heavily tuned by oneDNN engineers.On the other hand, UNIT outperforms both baselines, by 1.3 × over MXNet-oneDNN and by 1.18 × over TVM.Next, we test the efficacy of UNIT on utilizing NvidiaTensor Core instructions for Nvidia GPUs. For the baseline,e integrate TVM with cuDNN, which has access to manuallywritten aggressively tuned Tensor Core schedules. The findingsof this experiment are shown in Figure 9. We observe thatUNIT consistently achieves better performance than cuDNNwith a mean speedup of 1.75 × and up to 2.2 × . B. Optimization Implications In this subsection, we focus on the convolution operatorsof the DNN models to perform an in-depth analysis of theimpact of different optimization techniques used by UNIT’sRewriter. This is essentially an ablation study, showing howimportant different parts of UNIT are. There are 148 differentconvolution workloads (i.e., convolution with different featuremap sizes, kernel sizes, strides, etc.) in the models, out ofwhich we choose 16 representative convolution layers. Thesekernels cover diverse input shapes and strides. Other workloadsbehave similarly in the ablation study. We summarize thecharacteristics, namely, convolution attributes, like shapes,strides, etc., of the selected workloads in Table I. Intel x86 servers: As we discussed in Section III-C, we havetwo breaking points in CPU scheduling. The loop nests beforethe first breaking point are parallelized and the loop nestsafter the second breaking point are unrolled, while the onesin between the breaking point are executed serially. As loopnests can either be parallelized or unrolled (remaining oneis serialized), we have a search space represented by thetuning pairs. Rewriter tunes this search space to generate ahigh-performance kernel. In this experiment, we incrementallymeasure the performance improvements brought by paralleliz-ing, unrolling and tuning. The findings of this experiment areshown in Figure 10, normalizing the speedup to Intel oneDNNexecution latency.First we fuse outer loop nests such that the loop boundof the fused loop nest is < Parallel ). Then, we takethe remaining loop nests, and tile and unroll them such theunrolling factor is < 8, and measure this performance (shownby +Unroll ). Finally, instead of setting the limits as 3000 and 8,we tune the search space and measure performance (shown by +Tune ), getting the final latency UNIT achieves. We observethat Parallel and Unroll together is responsible for most of thespeedup. The additional speedup introduced by Tuning is quitesmall. It turns out that more than half of the kernels get theoptimal performance on the first tuning pair (i.e. 3000 and 8),and more than 95% of the kernels get the optimal performancewithin the first 8 tuning pairs.CPU does poorly on workloads Nvidia GPU servers: As discussed in Section III-C, weemploy three optimizations on GPU: generic coarse- and fine-grained parallelism, fusing width and height to save memorybandwidth, and parallelizing the reduction dimension. In thissubsection, we study the impact of these optimizations on the r e s n e t - r e s n e t - r e s n e t - _ v b i n c e p t i o n - b n i n c e p t i o n - v r e s n e t - r e s n e t - m o b il e n e t - v m o b il e n e t - v g e o m e a n R e l a t i v e P e r f o r m a n c e TVM-NEONTVM-ManualUNIT Fig. 12. The performance of ARM on model inference. R e l a t i v e P e r f o r m a n c e Fig. 13. The performance of each layer on res18-3d. performance. We show the findings in Figure 11, normalizingthe speedup to Nvidia cuDNN.According to our evaluation, any unrolling degree ( p inFigure 6) larger than 2 may overwhelm the registers, sowe use p=2 to apply the generic optimization. The genericoptimization already beat cuDNN in most cases (shown by Generic ). Then, depending on the height and width values,Rewriter fuses the height and width dimensions to save memorybandwidth (shown by +FuseDim ). Then, we split the reductiondimension K by 64 and measure the performance ( +SplitK ).Finally, we let Rewriter to choose the sizes for these 3optimizations and measure performance (shown by +Tune ).We observe that SplitK leads to the maximal speedup, asit leads to significant parallelism and keeps the Tensor Coresbusy. More than 70% of the kernels can get high performanceby employing fusion and parallelizing the reduction dimension.Similar to CPUs, the additional speedup by tuning is small.UNIT cannot outperform cuDNN on C. Extensibility We evaluate the extensibility of UNIT in two aspects: to newhardware platforms and to new deep learning tensor operations.We observe that by just representing the semantics of the newtensorized instruction in tensor DSL, UNIT can easily extendto new tensorized instructions and tensor operations. New Hardware Platforms. To demonstrate the capability ofextending to new hardware platforms, we apply UNIT to anARM CPU supporting the ARM DOT instruction. To theest of our knowledge, there is a lack of a deep learningframework with well-integrated ARM backend library support.In the absence of a framework baseline, we choose TVMcompiling to ARM Neon assembly as the baseline (shown byTVM-NEON). Additionally, we find that TVM has manually-written schedules using ARM DOT instructions, which formsour second comparison baseline (shown by TVM-Manual).Note that in contrast to UNIT’s automatic approach, this is amanually written schedule requiring intense engineering efforts.Finally, we represent the semantics of ARM DOT instruction inUNIT’s tensor DSL and use UNIT to compile the models. Thefindings of this experiment are shown in Figure 12, showingnormalized speedup compared to the TVM-Neon baseline. Theresults show that UNIT consistently outperforms both TVM-NEON and TVM-Manual, proving UNIT’s effectiveness inextending to new hardware platforms. 3D Convolution. We test UNIT on 3D convolution operationfor mapping Intel VNNI tensorized instructions. Note that thisdoes not require any changes from UNIT perspective; we arejust giving a new input (tensor-level IR for conv3d) to UNIT.To evaluate this extensibility, we take all the 2D convolutionsfrom Resnet18 and manually convert them to 3D convolutions.We then apply UNIT on these kernels and show the speedupcompared to oneDNN baseline in Figure 13. We observe thatUNIT easily extends to 3D convolution, as it has comparableperformance for many convolution kernels, with an average of1.2 × speedup. VII. R ELATED W ORK Compilation support for hardware intrinsics There existsa large body of literature on compilation support for varioushardware intrinsics [33], [20], [27], [29], [16], [22], [28],[15], [36], [35]. Existing production compilers such as GCCand LLVM implement auto-vectorization to leverage SIMDintrinsics. Prior works such as [20], [33] propose variousapproaches to further improve the performance of the auto-vectorizer. These approaches cannot be extended to supporttensor computation intrinsics which introduce “horizontalcomputation” within each lane. TVM [10] implements anextensible interface to support new hardware intrinsics thatare not limited to SIMD instructions. However, programmersneed to transform the program to match the behavior of theintrinsics and declare the lowering rule for the intrinsics priorto compilation. TVM will match the computation and replaceit with the code snippets that call the target hardware intrinsics.Compared to TVM, UNIT performs the code detection andtransformation automatically. This achieves higher flexibilityand productivity. There are some prior works that, similarto UNIT, also perform program transformation and codegeneration automatically for tensor computation [36], [35].However, these are limited to one platform or certain intrinsicsand hence are not as flexible as UNIT. Decoupled Computation and Data Access The analysispass of UNIT is inspired by the decoupled-access execute(DAE) architectures [21], [30], [26], [40], [13]. Computationand data access are decoupled and specialized separately. The computation is offloaded onto a programmable data path anddata access is encoded in hardware intrinsics and executedon specialized address generation unit (AGU). UNIT adoptsa reversed approach, it matches computation on a fixed datapath, and analyzes data access fed to the data path. Polyhedral model Many prior works have built programanalysis and transformation frameworks based on the polyhedralmodel for tensor programs [20], [36], [35], [15], [37], [12],[25], [38]. Loop Tactics [9] is one representative work whichmatches the pre-defined computation patterns in the polyhedralIR and transforms the matched patterns to optimized programs.UNIT distinguishes itself from Loop Tactics in: 1) Comparedwith the schedule tree [39] in the polyhedral model, thetensor DSL provides more information such as loop reductionproperties and operand types; 2) UNIT provides an end-to-end solution including auto-tuning to obtain the optimalperformance, whereas Loop Tactics requires the optimizedschedules to be provided manually. Deep learning frameworks UNIT is complementary to theexisting deep learning frameworks. Existing frameworks suchas Tensorflow [8], PyTorch [7], and MXNet [3] rely on vendor-crafted libraries to support the new tensor intrinsics. TVM [10]requires code re-writing at the user side. UNIT is able tohandle new operators which might not be covered by thevendor libraries and spare the user from having to performmanual re-writing. We have demonstrated the effectiveness ofthe methodology of UNIT based on TVM. Similar techniquecan be applied to other frameworks to further boost theirperformance. VIII. C ONCLUSION Deep learning has prompted hardware vendors to addspecialized tensorized instructions for dense tensor operations.These instructions perform “horizontal reduction” accumulateelementwise computation. While promising, introducing thisnew idiom complicates its general purpose applicability, as onehas to rely on hand-written kernels to gain high performancebrought by these instructions. In this paper, we introduce UNIT,a unified compilation pipeline, that represents the tensorizedinstructions from different hardware platforms using the sameIR, then automatically detects the applicability of the tensorizedinstructions in a given tensor operation, transforms the loopnest to enable easy mapping of the tensorized instruction, andfinally rewrites the loop body with the tensorized instructions.UNIT enables automatic tensorized instruction compilationover a variety of hardware platforms like Intel/ARM CPUs andNvidia GPUs. Our analysis shows that UNIT achieves 1.3 × speedup over oneDNN (VNNI instruction), 1.75 × over cuDNN(Tensor Core instruction), and 1.13 × over the manually writtenARM intrinsics in TVM (DOT instruction).A CKNOWLEDGEMENTS This work is supported by NSF grant CCF-1751400 and MuLi’s team at Amazon Web Services. PPENDIX A. Abstract This guide describes how to set up UNIT compilationinfrastructure and run the workloads we discussed in Section VI.This guide provides instructions to: • Set up the experiment environment for UNIT throughDocker. • Run end-to-end inference model shown in Figure 8, 9,and 12. • Run the experiments to demonstrate the effects of ourtuning strategies shown in Figure 10, and 11. • Run the 3D-convolution results shown in Figure 13.Our experiments are conducted on Amazon EC2 c5.12xlarge for Intel VNNI, p3.2xlarge for NvidiaTensorCore, and m6g.8xlarge for ARM VDOT. To down-load and install our infrastructure, approximately 32GB ofdisk is required. We provide Dockerfile to set up theenvironment, and scripts to automatically run the experimentsand plot the figures. B. Artifact Checklist • Program: As it is demonstatrated in Section V, we usenine typical DNN models, including ResNet, ImageNet,and MobileNet. • Compilation: We need specific versions of TVM to runour experiments and baselines. They are included in thezip release. • Data set: The test data is included in our zip release. • Runtime environment: We run our artifact all on Ubuntu18.04. For GPU, Nvidia GPU driver and additional runtimefor Docker should be installed. • Hardware: We run our experiments on AWSEC2 instances — c5.12xlarge for IntelVNNI, p3.2xlarge for Nvidia TensorCore, and m6g.8xlarge for ARM DOT. • Execution: We provide scripts to run our experimentsdiscussed in Section VI. It takes 2 hours to compile themodels in Figure 8, half an hour to compile the models inFigure 9, and 1.4 hours to compile the models in Figure 12.It takes half an hour to run the experiments in Figure 10,and 11. • Output: Our scripts both run the experiments and plotthe figures in PDF files. • Experiments: The results reported in our paper aregenerated by a physical machine, but in this artifactevaluation they all run on a virtual machine in Docker.Performance fluctuation may happen because of theoverhead of virtualization. C. Description1) How Delivered: Download our Dockerfile, scripts, andmodel test data at https://doi.org/10.5281/zenodo.4420522. 2) Hardware dependencies: • AVX512_VNNI: This is available on Intel CPUs withCascadelake architecture. In this work, we use AWS EC2 c5.12xlarge . The CPU model is Intel(R) Xeon(R)Platinum 8275CL CPU @3.00GHz. The rate is $2.04/hour,and it takes approximately one hour to set up theenvironment and 5 hours to run all the related experiments. • TensorCore: This is avaiable on Nvidia GPUs withTensorCore extension. In this work, we use AWS EC2 p3.2xlarge . The GPU model is Tesla V100. Pleaseinstall the GPU driver. The rate is $3.06/hour, and ittakes approximately 1 hour to set up the environment, andanother one hour run all the related experiments. • ARM VDOT: This is available on ARM CPU v8.2 with dotprod extension. In this work, we use AWS EC2 m6g.8xlarge . The CPU model is Amazon Graviton 2.The rate is $1.232/hour, and it takes 1 hour to set up theenvironment and run the experiments. 3) Software Dependencies: All our software dependencesare installed automatically in Docker. Refer to this link forDocker installation. When setting up the last step of the thepackage repository, do choose the proper tab for your CPUplatform (x86 or ARM). Refer to this to install Docker thatruns Nvidia GPU. Nvidia Docker requires GPU driver installed,use this command to install: $ sudo apt-get install nvidia-driver-455 D. Installation Unzip the downloaded file, and there are three sub-zips — tensorcore.zip , vnni.zip , and arm.zip to evaluatethe three platform we discussed in this paper. E. Experiment Workflow1) GPU: • After building the docker image, an image hash value willbe generated in the console log: $ unzip tensorcore.zip && cd tensorcore$ sudo docker build . • After entering the container, the experiment scripts areall in $HOME directory: $ cd $HOME • To replicate experiments run in Figure 9, and 11: $ bash run_e2e.sh • It takes half an our to run these two scripts. Both theexperiments and data plotting are done in these two scripts.Use these commands to take the generated PDF out ofthe container and look at them: $ We run the Intel VNNI experiment on an AWSEC2 c5.12xlarge instance. It is also used to cross-compileARM target. • After building the docker image, an image hash value willbe generated in the console log: $ unzip vnni.zip && cd vnni$ sudo docker build .$ sudo docker run -tid It takes about two hours to get all models com-piled on ARM. The compiled models will be in $HOME/arm-base and $HOME/arm-unit . • Copy the compiled model to the ARM machine: $ scp -i key.pem -r arm-unit F. Evaluation and Expected Result Finally, we have these PDF files: • Figure 8, 9, and 12 should be compared against cpu-e2e.pdf , gpu-e2e.pdf , and arm-e2e.pdf . – The ARM results reported in this paper were gener-ated by an old version of TVM. The performance isimproved in the newer version. We will fix this incamera ready. • Figure 10, and 11 should be compared against cpu-dse.pdf , and gpu-dse.pdf . • Figure 13 should be compared against conv3d.pdf .R ® ACMTransactions on Architecture and Code Optimization (TACO) , 16(4):1–25,2019.[10] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan,Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze,et al. TVM: An automated end-to-end optimizing compiler for deeplearning. In , pages 578–594, 2018.[11] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau,Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning tooptimize tensor programs. In Advances in Neural Information ProcessingSystems , pages 3389–3400, 2018.[12] Jason Cong and Jie Wang. PolySA: Polyhedral-based systolic arrayauto-compilation. In , pages 1–8. IEEE, 2018.[13] Vidushi Dadu and Tony Nowatzki. Towards general purpose accelerationby exploiting common data-dependence forms. In Proceedings of the52nd Annual IEEE/ACM International Symposium on Microarchitecture ,2019.[14] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. ImageNet:A large-scale hierarchical image database. In , pages 248–255, 2009.[15] Andi Drebes, Lorenzo Chelini, Oleksandr Zinenko, Albert Cohen, HenkCorporaal, Tobias Grosser, Kanishkan Vadivel, and Nicolas Vasilache.TC-CIM: Empowering tensor comprehensions for computing-in-memory.In IMPACT 2020-10th International Workshop on Polyhedral CompilationTechniques , 2020.[16] Alexandre E Eichenberger, Peng Wu, and Kevin O’brien. Vectorizationfor simd architectures with alignment constraints. Acm Sigplan Notices ,39(6):82–93, 2004.[17] Qingchang Han, Yongmin Hu, Fengwei Yu, Hailong Yang, Bing Liu,Peng Hu, Ruihao Gong, Yanfei Wang, Rui Wang, Zhongzhi Luan, andDepei Qian. Extremely low-bit convolution optimization for quantizedneural network on modern computer architectures. In ICPP ’20: 49thInternational Conference on Parallel Processing - ICPP , 2020.18] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, MatthewTang, Andrew G. Howard, Hartwig Adam, and Dmitry Kalenichenko.Quantization and training of neural networks for efficient journal. CoRR ,abs/1712.05877, 2017.[19] Animesh Jain, Shoubhik Bhattacharya, Masahiro Masuda, Vin Sharma,and Yida Wang. Efficient execution of quantized deep learning models:A compiler approach. arXiv preprint arXiv:2006.10226 , 2020.[20] Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-NoëlPouchet, and Ponnuswamy Sadayappan. When polyhedral transformationsmeet SIMD code generation. In Proceedings of the 34th ACM SIGPLANconference on Programming language design and implementation , pages127–138, 2013.[21] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. MAERI:Enabling flexible dataflow mapping over DNN accelerators via reconfig-urable interconnects. SIGPLAN Not. , 53(2):461–475, March 2018.[22] Samuel Larsen and Saman Amarasinghe. Exploiting superword levelparallelism with multimedia instruction sets. Acm Sigplan Notices ,35(5):145–156, 2000.[23] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang.Optimizing CNN model inference on CPUs. In , pages 1025–1040, Renton, WA,July 2019. USENIX Association.[24] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos,Erich Elsen, David García, Boris Ginsburg, Michael Houston, OleksiiKuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. CoRR , abs/1710.03740, 2017.[25] MLIR. Multi-level IR compiler framework. https://mlir.llvm.org.[26] Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and KarthikeyanSankaralingam. Stream-dataflow acceleration. In , 2017.[27] Dorit Nuzman and Richard Henderson. Multi-platform auto-vectorization.In Proceedings of the International Symposium on Code Generation andOptimization , CGO ’06, page 281–294, USA, 2006. IEEE ComputerSociety.[28] Dorit Nuzman, Ira Rosen, and Ayal Zaks. Auto-vectorization ofinterleaved data for SIMD. In Michael I. Schwartzbach and ThomasBall, editors, Proceedings of the ACM SIGPLAN 2006 Conference onProgramming Language Design and Implementation, Ottawa, Ontario,Canada, June 11-14, 2006 , pages 132–143. ACM, 2006.[29] Phitchaya Mangpo Phothilimthana, Archibald Samuel Elliott, An Wang,Abhinav Jangda, Bastian Hagedorn, Henrik Barthels, Samuel J Kaufman,Vinod Grover, Emina Torlak, and Rastislav Bodik. Swizzle inventor:data movement synthesis for GPU kernels. In Proceedings of theTwenty-Fourth International Conference on Architectural Support forProgramming Languages and Operating Systems , pages 65–78, 2019.[30] Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, TianZhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle[36] Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, GagandeepGoyal, Ramakrishna Upadrasta, and Bharat Kaul. PolyDL: Polyhedral Olukotun. Plasticine: A reconfigurable architecture for parallel paterns.In , 2017.[31] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris,Frédo Durand, and Saman Amarasinghe. Halide: A language andcompiler for optimizing parallelism, locality, and recomputation inimage processing pipelines. In Proceedings of the 34th ACM SIGPLANConference on Programming Language Design and Implementation ,PLDI ’13, pages 519–530, New York, NY, USA, 2013. ACM.[32] Jared Roesch, Steven Lyubomirsky, Logan Weber, Josh Pollock, MarisaKirisame, Tianqi Chen, and Zachary Tatlock. Relay: A new IR formachine learning frameworks. In Proceedings of the 2nd ACM SIG-PLAN International Workshop on Machine Learning and ProgrammingLanguages , MAPL 2018, page 58–68, New York, NY, USA, 2018.Association for Computing Machinery.[33] Ira Rosen, D. Nuzman, and A. Zaks. Loop-aware SLP in GCC. pages131–142, 01 2007.[34] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, RomanDzhabarov, James Hegeman, Roman Levenstein, Bert Maher, SatishNadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, and MishaSmelyanskiy. Glow: Graph lowering compiler techniques for neuralnetworks. CoRR , abs/1805.00907, 2018.[35] Vinod Grover Somashekaracharya G. Bhaskaracharya, Julien Demouth.Automatic kernel generation for Volta tensor cores. arXiv preprintarXiv:2006.12645 , 2020.optimizations for creation of high performance DL primitives. arXivpreprint arXiv:2006.02230 , 2020.[37] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, PriyaGoyal, Zachary DeVito, William S Moses, Sven Verdoolaege, AndrewAdams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprintarXiv:1802.04730 , 2018.[38] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez,Christian Tenllado, and Francky Catthoor. Polyhedral parallel codegeneration for CUDA. ACM Transactions on Architecture and CodeOptimization (TACO) , 9(4):1–23, 2013.[39] Sven Verdoolaege, Serge Guelton, Tobias Grosser, and Albert Cohen.Schedule trees. In International Workshop on Polyhedral CompilationTechniques, Date: 2014/01/20-2014/01/20, Location: Vienna, Austria ,2014.[40] J. Weng, S. Liu, V. Dadu, Z. Wang, P. Shah, and T. Nowatzki. DSAGEN:Synthesizing programmable spatial accelerators. In ,pages 268–281, 2020.[41] XLA Team. Xla - tensorflow, compiled, March 2017.[42] D. Yan, W. Wang, and X. Chu. Demystifying tensor cores to optimizehalf-precision matrix multiply. In2020 IEEE International Parallel andDistributed Processing Symposium (IPDPS)