[PDF] UNIT: Unifying Tensorized Instruction Compilation

Abstract

Because of the increasing demand for computation in DNN, researchers develope both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to leverage mixed precision without hardware support because of the overhead of data casting. Hardware vendors offer tensorized instructions for mixed-precision tensor operations, like Intel VNNI, Tensor Core, and ARM-DOT. These instructions involve a computing idiom that reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this makes it hard to utilize these instructions: Using vendor-provided libraries for computationally-intensive kernels is inflexible and prevents further optimizations, and manually writing hardware intrinsics is error-prone and difficult for programmers. Some prior works address this problem by creating compilers for each instruction. This requires excessive effort when it comes to many tensorized instructions. In this work, we develop a compiler framework to unify the compilation for these instructions -- a unified semantics abstraction eases the integration of new instructions, and reuses the analysis and transformations. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability, transforms the loop organization of the operation,and rewrites the loop body to leverage the tensorized instruction. According to our evaluation, UNIT can target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3x speedup over Intel oneDNN on an x86 CPU, 1.75x speedup over Nvidia cuDNN on an NvidiaGPU, and 1.13x speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU.

Full PDF

UUNIT: Unifying Tensorized Instruction Compilation

Jian Weng ∗† , Animesh Jain † , Jie Wang ∗† , Leyuan Wang † , Yida Wang † , Tony Nowatzki ∗∗ University of California, Los Angeles, USA † Amazon Web Services, USA {jian.weng,jiewang,tjn}@cs.ucla.edu {janimesh,wangleyu,wangyida}@amazon.com

Abstract —Because of the increasing demand for intensivecomputation in deep neural networks, researchers have developedboth hardware and software mechanisms to reduce the computeand memory burden. A widely adopted approach is to usemixed precision data types. However, it is hard to beneﬁtfrom mixed precision without hardware specialization becauseof the overhead of data casting. Recently, hardware vendorsoffer tensorized instructions specialized for mixed-precision tensoroperations, such as Intel VNNI, Nvidia Tensor Core, and ARMDOT. These instructions involve a new computing idiom, whichreduces multiple low precision elements into one high precisionelement. The lack of compilation techniques for this emergingidiom makes it hard to utilize these instructions. In practice, oneapproach is to use vendor-provided libraries for computationally-intensive kernels, but this is inﬂexible and prevents furtheroptimizations. Another approach is to manually write hardwareintrinsics, which is error-prone and difﬁcult for programmers.Some prior works tried to address this problem by creatingcompilers for each instruction. This requires excessive effortswhen it comes to many tensorized instructions.In this work, we develop a compiler framework, UNIT, tounify the compilation for tensorized instructions. The key to thisapproach is a uniﬁed semantics abstraction which makes theintegration of new instructions easy, and the reuse of the analysisand transformations possible. Tensorized instructions from dif-ferent platforms can be compiled via UNIT with moderate effortfor favorable performance. Given a tensorized instruction and atensor operation, UNIT automatically detects the applicability ofthe instruction, transforms the loop organization of the operation,and rewrites the loop body to take advantage of the tensorizedinstruction. According to our evaluation, UNIT is able to targetvarious mainstream hardware platforms. The generated end-to-end inference model achieves 1.3 × speedup over Intel oneDNNon an x86 CPU, 1.75 × speedup over Nvidia cuDNN on an NvidiaGPU, and 1.13 × speedup over a carefully tuned TVM solutionfor ARM DOT on an ARM CPU. I. I

NTRODUCTION

Dense tensor operations like matrix multiplication (Matmul)and convolution (Conv) have long been the workhorses inmany domains, including deep learning workloads [14]. Thepopularity of deep learning means that aggressively optimizingthese operations has a high payoff. Essentially, Matmul andConv are a series of multiply-accumulate (MAC) operations,which perform accumulation over a number of elementwisemultiplications.To capture the reduction behavior and perform it moreefﬁciently, recent general-purpose processors offer native tensoroperation specialized instructions (hereinafter referred to as tensorized instructions ), like Intel VNNI [2], Nvidia TensorCore [5], and ARM DOT [1]. Unlike the conventional SIMD ∗† Work done during Jian and Jie’s internship at AWS. instructions, after performing elementwise arithmetic opera-tions, these instructions introduce a “horizontal computation” toaccumulate elementwise results. Further, tensorized instructionsare often mixed-precision, meaning that elementwise operationsuse less precise and lower bitwidth operands (e.g., fp16 and int8 ), while accumulation occurs with higher bitwidth, whereit is needed. This offers a good balance between data widthand precision that is generally sufﬁcient for deep learningworkloads [24], [18], and enables the use of quantized datatypes.Mixed-precision is difﬁcult to express in a single SIMDinstruction, because the output vector width is different than theinput vector width. In most ISAs this paradigm requires multi-ple SIMD instructions to express. In a tensorized instruction, bydeﬁnition there are fewer outputs, so allocating more bitwidthto them for the output vector is natural. In addition, tensorizedinstructions sometimes reuse the same inputs multiple times,which reduces the required register ﬁle bandwidth. Overall,tensorized instructions offer signiﬁcant advantages over SIMDfor executing MACs.While promising, the absence of appropriate compilationtechniques limit c the applicability of these tensorized instruc-tions. Conventional SIMD instructions are vector instructions,so industry standard compilers only try parallelizing theinnermost loops. In addition, it is difﬁcult for the high-level language programmer to express the compute ﬂow ina tensorization-friendly way and hint the compiler to trytensorization upon a loop nest, because the dependency ofreduction is more complicated and error-prone.In practice, there are normally two options to leveragetensorized instructions. One way is to call the vendor-providedlibraries such as Intel oneDNN [6], Nvidia cuBLAS andcuDNN [4], which provides highly optimized performancein some pre-deﬁned single kernels using tensorized instruc-tions [17], [42]. However, it also brings inﬂexibility whenit comes to new workloads or when further performanceexploitation is desired. The other option is to manually writeassembly intrinsics, which sets a high bar to ordinary developersand hence lacks productivity. Some prior works tried to solvethis problem by developing a compiler [35], [36] for eachinstruction. This requires too much effort when there aremany tensorized instructions, both within and across hardwareplatforms.

Our Goal:

Although different processors may providedifferent tensorized instructions, in the context of deep learningworkloads, we observe that these instructions essentially handlea similar compute pattern, i.e., elementwise multiplication and a r X i v : . [ c s . P L ] J a n hen horizontal accumulation. They primarily differ in thenumber of elementwise computation lanes and the acceptingdata types. Therefore, we aim to develop a uniﬁed approachto compile these tensorized instructions on multiple platformsto optimize the tensor operations in deep learning workloads.Our techniques are extensible to the tensorized instructionswith other data types and operations as well. Challenges:

There are several challenges to attain a uniﬁedcompilation pipeline: • Instructions Integration:

Instead of building a new spe-cialized compiler for each new instruction, it is desirableto create a uniﬁed and extensible compilation ﬂow; • Detecting the applicability:

Given a tensorized instruction,a ﬁrst question is whether and how this instruction can beapplied to the target tensor operation, which may requireloop reorganization to make it applicable; • Code rewriting:

When applicable, the compiler mustdetermine how the loops involved should be rewrittenby the tensorized instruction, and how the loops shouldbe rearranged to achieve high performance.

Our Insight:

We envision that the key to addressing thesethree challenges is to have a uniﬁed semantics abstraction fortensorized instructions so that the analysis and transformationcan also be uniﬁed.This paper presents UNIT, an end-to-end compilationpipeline to surmount the above three challenges. UNIT takesthe tensorized instructions (e.g., Intel VNNI instructions onCPUs, or Nvidia Tensor Core instructions on GPUs) and adeep learning model as input, lowers the tensor operations ofthe model into loop-based IRs to identify the tensorizable com-ponents, and inserts the tensorized instructions by transformingand rewriting the loop. It achieves high performance for tensoroperations, and consequently, model inference. To the bestof our knowledge, this is the ﬁrst work to tackle tensorizedinstruction compilation and optimization with a uniﬁed solution.UNIT not only achieves high performance for single tensoroperations, but also provides desirable model inference latencyin practice.

Key Results:

According to our evaluation, UNIT is ex-pressive enough to target many tensorized instructions onmultiple hardware platforms, including Intel VNNI, NvidiaTensor Core, and ARM DOT. The generated programs forend-to-end model inference are 1.3 × and 1.75 × faster thanthe solutions backed up by Intel oneDNN and Nvidia cuDNNon CPU and GPU, respectively. In addition, UNIT can beextended to new tensorized instructions with moderate effort.Although we designed UNIT to target Intel CPUs and NvidiaGPUs, on an ARM Cortex A-72 CPU with DOT instructions,UNIT achieves up to 1.13 × speedup against a carefully manualtuned solution.To sum up, our contribution is an end-to-end compilationpipeline of tensorized instructions for deep learning workloads,which includes: • A uniﬁed abstraction for tensorized instructions. • An algorithm that detects the applicability of thesetensorized instructions. r e s n e t - r e s n e t - r e s n e t - _ v b i n c e p t i o n - b n i n c e p t i o n - v r e s n e t - r e s n e t - m o b il e n e t - v m o b il e n e t - v g e o m e a n R e l a t i v e P e r f o r m a n c e cuDNN(fp32) cuDNN (fp16) w/o Tensor Core Fig. 1. Performance comparison on Nvidia V100-SXM2 between fp32 and fp16 without mixed precision instruction support. • A rewriting and tuning mechanism that looks for favorableloop transformations of the tensor operations to plug inthe tensorized instructions for high performance.

Paper Organization:

We ﬁrst introduce the backgroundand challenges of tensorized compilation in Section II. Thedesign of UNIT is presented in Section III. We explain theimplementation details in Section IV. We clarify our experimentmethodology in Section V, and evaluate our work in Section VI.Finally, we discuss the related work in Section VII.II. B

ACKGROUND

UNIT is an end-to-end compilation pipeline capable ofautomatically mapping tensorized instructions to the deeplearning tensor operations. It deﬁnes the tensorized instruction’ssemantics using a suitable intermediate representation (IR) andinserts them in proper places of the program of tensor opera-tions. In this section, we give an overview of popular mixedprecision tensorized instructions, followed by the limitationsof existing solutions in automatic mapping of these tensorizedinstructions. Finally, we discuss the background of tensordomain speciﬁc language and the multi-level intermediaterepresentation.

A. Mixed Precision Tensorized Instructions

Deep learning is computationally expensive, requiring sub-stantial compute and memory resources. As deep learning be-comes more pervasive, researchers are designing both softwareand hardware techniques to reduce the compute and memoryburden. A widely adopted approach in this context is usingmixed precision for expensive operations, e.g., convolution ordense operations [24], [18]. In practice, this means representing32-bit ﬂoating point ( fp32 ) operands with a lower bitwidthdatatype - 16-bit ﬂoating point numbers ( fp16 ) or 8/16-bitinteger numbers ( int8 , int16 ). To keep the accuracy incheck, it is helpful to accumulate the results in higher precision( fp32 or int32 ). This type of mixed precision computationis often called quantization for integer values [18]. In thispaper, we will always use mixed precision for brevity.While using mixed precision data types reduces memoryfootprint, it might not necessarily lead to performance im-provement. To investigate this, we conducted an experimentto compare the performance of Nvidia cuDNN performancefor fp16 and fp32 in the absence of Nvidia mixed precisiontensorized instructions (Tensor Core). As shown in Figure 1, weobserve that blindly using mixed precision leads to substantial c15a60b60 × + b61 × b62 × b63 × dst15a61 a62 a63 + + u8x64i8x64i16x32i16x32 (a) Intel VNNI x86.avx512.pbpdusd (b) Nvidia Tensor Core nvvm.wmma.m16n16k16.mma.row.row.f32.f32 += × A: 16 × fp16 B: 16 × fp16 C: 16 × fp32 c0a0b0 × + b1 × b2 × b3 × dst0a1 a2 a3 + + Fig. 2. The semantics of Intel VNNI and Nvidia Tensor Core. The text besideis the name of the corresponding LLVM intrinsic. slowdown because of the overhead of casting between twodata types.Therefore, mainstream hardware vendors (Intel, ARM andNvidia) have introduced mixed precision tensorized instructionsto achieve better performance. These instructions add mixedprecision arithmetic support where operands are of lowerprecision while the accumulation happens in higher precision,potentially leading to 2 × - 4 × speedup. The most popularexamples of these tensorized instructions are Intel VNNI, ARMDOT and Nvidia Tensor Core. We will discuss the semanticsof these operations in Section III.Hardware vendors have a long history of adding newinstructions to accelerate important applications. However, themixed precision tensorized instructions introduce a uniqueidiom - horizontal accumulation. These tensorized instructionstypically conduct a sequence of elementwise multiplicationsgoverned by a memory access pattern, followed by a horizontalaccumulation. The accumulation is termed horizontal becauseall values to be accumulated are present in the same vectorregister. For example, as it is shown in Figure 2(a), Intel VNNIexecutes a dot product of two vectors, each having 4 int8 elements, while performing the accumulation in int32 . Weobserve a similar pattern, though with different numbers ofentries and data types, for Nvidia Tensor Core (in Figure 2(b))and ARM DOT instructions (this is omitted, because it issimilar to VNNI). B. Limitations of Existing Solutions

Though tensorized instructions seem promising, their adop-tion pace is limited because of the absence of an automatictechnique that can detect and use these instructions seamlessly.Currently, their usage in the deep learning domain is limitedto hardware vendor libraries like Intel oneDNN and NvidiacuDNN, which may provide high performance for the pre-deﬁned operations but are inﬂexible as discussed in Section I.Similarly, conventional loop vectorizers ﬁnd it hard to exploitthe proﬁtability of these tensorized instructions, as they arenot designed to work with the horizontal reduction idiom.Conventional loop vectorizers in general-purpose compilers like GCC and LLVM mainly focus on either analyzing theinnermost loop body or combining instructions in the unrolledloop bodies. When it comes to the horizontal reduction idiom,these compilers often reorder the computation and generateepilogue reduction, preventing us from using the tensorizedinstructions.There have been some recent works in compiling programsto leverage tensorized instructions. PolyDL [36] generatesCPU programs for convolution kernels in neural networksthat call a GEMM micro-kernel using Intel VNNI instructions.Bhaskaracharya et al. [35] generate CUDA programs for matrixcomputation leveraging Nvidia Tensor Core. However, theseworks are limited to one platform and its speciﬁc instruction,which lacks generalizability. A generic solution to handletensorized instructions from multiple platforms together isstill missing.

C. Multi-Level Intermediate Representation

Compilers often have multiple levels of intermediate repre-sentation (IR) to express the program; each level is designed toenable different analyses and transformations. In this section,we describe the background of a tensor domain speciﬁclanguage (DSL) and the multi-level IR.

1) Graph-Level IR:

Deep learning compilers like TVM [10],Glow [34], and XLA [41] adopt a graph-level IR to representa deep learning model as a directed acyclic graph (DAG)of operations. This graph-level IR is useful for inter-tensor-operation optimization, like tensor shape padding, operationfusion, and choosing the proper data layout [23]. Our tensorizedanalysis relies on tensor padding so that loops can be tiledby the number of lanes of the instruction perfectly. However,this IR has little knowledge about the implementation of eachtensor operation. When compiling a graph-level IR, each nodeof the DAG will be dispatched to its implementation in tensorDSL as explained next.

2) Tensor DSL:

Tensor domain-speciﬁc languages, likeHalide [31], TVM [10], and Tensor Comprehension [37],have been developed to productively and portably expresstensor programs while enabling efﬁcient performance tuning.As shown in Figure 4 and Figure 5, programs written intensor DSLs follow this paradigm: Users ﬁrst declare thetensors and the loop variables, and then the computation isdescribed by expressions involving the declared tensors andloop variables. These DSLs also provide interfaces to split,reorder, and annotate loops without affecting the computationsemantics for performance tuning.All the information gathered from the tensor DSL frontendwill be stored in a tensor Op data structure, includingthe declared tensors, loop variables, expressions, and loopmanipulation.

3) Tensor IR:

Each tensor Op is then lowered to

Tensor IR ,which is an imperative program IR with additional constraints:All the loops are canonical (starting from 0, and increased by1 each time), and all the array operations are restricted (i.e., anelement cannot be accessed by two different pointers). Thesetwo properties enable making strong assumptions for analysis ensor Operation Prog.

Hardware Target

InspectorRewriterTensor. Inst.

Intel x86 ARM

NVIDIA

Section III.A Section III.BSection III.C

XFormTune

Fig. 3. The overview of our framework, UNIT. and transformation. Our work conducts analysis on the tensorOp data structure level and then performs transformation on thetensor IR. Although the tensor IR provides essentially identicalinformation for analysis, as discussed above, it is easier toreorganize the loops via the tensor Op data structure.

4) Low-Level IR:

The tensor IR is lowered to a general-purposed low-level IR like LLVM, after all the specializedanalysis and transformations on the tensor IR are done, to getready for assembly code generation.III. U

NIFIED T ENSORIZATION

Our goal is to automatically tensorize mixed-precisiondeep learning tensor operations across a variety of hardwareplatforms. We resolve the challenges discussed in Section I bypresenting UNIT with the following techniques:1) Tensorized Instruction in Tensor DSL:

To abstract thediverse tensorized instructions on different hardwareplatforms, we leverage the existing tensor DSL torepresent their semantics.2)

Applicability Inspection:

To determine if and how atensorized instruction can be applied to a tensor operation,we developed an analysis pass in the

Inspector componentof UNIT, which analyzes the tensor Op data structureof both the instruction and the operation. The result ofanalysis will guide the loop reorganization and instructioninjection.3)

Code Rewriter:

Once the tensorized instruction is deter-mined applicable, the Rewriter reorganizes the loop nestsin accordance with the Inspector so that the innermostloop nests resemble the tensorized instruction and areready to be replaced. Finally, it sets up the tuning spacefor the remaining loop nests to exploit high performance.These components of UNIT together enable a uniﬁed compi-lation ﬂow to simplify the mapping of tensorized instructionsacross a variety of hardware platforms. In the rest of this section,the details of each of the above steps will be discussed.

A. Semantics Abstraction - Tensor DSL

In order to unify the compilation of tensorized instructionsfrom different platforms and keep the system open to integratenew instructions, the ﬁrst question to answer is how to have auniﬁed description of the semantics of tensorized instructions. We coin the word to mean rewrite and optimize a given code by thetensorized instruction. a, b = tensor((64,),u8), tensor((64,),i8)c = tensor((16,), i32)i, j = loop_axis(0,16), reduce_axis(0,4)d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j])) (a) Intel VNNI x86.avx512.pbpdusd a, b = tensor((16,),i8), tensor((16,),i8)c = tensor((4,), i32)i, j = loop_axis(0,4), reduce_axis(0,4)d[i] = c[i] + sum(i32(a[i*4+j])*i32(b[i*4+j])) (b) ARM DOT arm.neon.sdot.v4i32.v16i8 a, b = tensor((16,16),fp16), tensor((16,16),fp16)i, j = loop_axis(0,16), loop_axis(0,16)k = reduce_axis(0,16)c[i,j] += fp32(a[i,k]) * fp32(b[k,j]) nvvm.wmma.m16n16k16.mma.row.row.f32.f32 (c) Nvidia Tensor Core

Fig. 4. Tensorized instructions as abstracted in the tensor DSL.

As explained in Section II, we employ ubiquitous tensor DSLand tensor IR to solve the abstraction problem. All mixedprecision tensorized instructions perform some elementwiseoperations for vectors, followed by a horizontal reduction. Eachtensorized instruction, therefore, can be regarded as a smalltensor operation program written in the tensor DSL.Figure 4(a) shows how an Intel VNNI instruction is describedin the tensor DSL. Three source operands of Intel VNNI are512-bit registers. Two of them are 64 lanes of unsigned 8-bitintegers ( uint8 ) and signed 8-bit integers ( int8 ), and theother one is 16 lanes of signed 32-bit integers ( int32 ), whichcorrespond to the tensors a , b , c we deﬁned. The arithmeticbehavior is deﬁned by the loop variables and the expressionof d[i] . Here we annotate that loop i is data parallel, sincethese 16 elements are independent from each other; loop j isreduction since for every independent element it sums up 4elements along with this loop. A similar loop pattern appears inthe other tensor operations shown in Figure 5. The descriptionof ARM DOT, shown in Figure 4(b), is similar to Intel VNNI,with a different number of lanes and data types.Nvidia Tensor Core, on the other hand, performs a squarematrix multiplication as shown in Figure 4(c). Comparing with(a) and (b), a key difference is that it requires the accumulatorregister to be the same as the addition register (note the += ).This is due to the data type opaqueness of the Tensor Coreinstruction, which prevents us from giving arbitrary initialvalues for the accumulators.We describe the semantics of each tensorized instruction intensor DSL. The deep learning compiler pipeline parses theoperation into tensor Op , which preserves tensor informationlike the expression tree, the loop trip count, and the arraybuffers. This information is essential for the analysis andtransformation passes in Inspector and Rewriter. B. Applicability Detection - Inspector

To determine if a tensorized instruction can be applied to atensor operation, the Inspector pass uses a two-step approach.It ﬁrst determines if (part of) the tensor operation program andthe instruction can be arithmetically equivalent by checking aform of isomorphism between their associated expression trees. // Convolution in tensor DSLa,b = tensor((H,W,C), u8),tensor((R,S,K,C),i8)k,rc = loop_axis(0,K), reduce_axis(0,C)x,y = loop_axis(0,H-R+1), loop_axis(0,W-S+1)r,s = reduce_axis(0,R), reduce_axis(0,S)c[x,y,k]+= i32(a[x+r,y+s,rc])*i32(b[r,s,k,rc]) (b).1 Arithmetic Isomorphism for (x=0; x<(H-R)+1; ++x) for (y=0; y<(W-S)+1; ++y) for (ko=0; ko

VNNI: =c[x,y,k]:i32 +:i32*:i32 a[x+r,y+s,rc]:u8 c[x,y,k]:i32

Conv:

Two trees are exactly the same topology, opcodes, and data type. cast:i32 b[r,s,k,rc]:icast:i32 b[i*4+j]:i8a[i*4+j]:u8cast:i32cast:i32 for (x=0; x<(H-R)+1; ++x) for (y=0; y<(W-S)+1; ++y) for (k=0; k

Reorganize the loops in DSL primitives. {x,y,k} → {i} ⊆ {i}{x,y,k} → {i} ⊆ {i}{x,y,r,s,rc} → {j} ⊆ {i,j}{r,s,k,rc} → {i,j} ⊆ {i,j}c[x,y,k] d[i]k → irc → j c[x,y,k] c[i]a[x+r,y+s,rc]b[r,s,k,rc] a[i*4+j]b[i*4+j]c[x,y,ko:16]x16(a[x,y,co:4]) a[r,s,k+0,co:4])a[r,s,k+1,co:4])a[r,s,k+15,co:4]) d=pbpdusd(a,b,c); c[x,y,ko:16] f:A → B Index: u Index: v S(u) S’(u) ⊆ S(v) x86.avx512.pbpdusd

Fig. 5. An example of applying Intel VNNI to Conv using UNIT.

After that, it inspects the data access pattern to conﬁrm theassembly operands can be prepared so as to guide the Rewritertransformation.

1) Compute Isomorphism:

Algorithm 1 shows the algorithmwe adopt to determine the isomorphism of two expression trees.It recursively traverses both trees and matches the data typeand opcode of each pair of nodes. Figure 5(b).1 shows thatthe two trees of convolution and pbpdusd (an Intel VNNIinstruction) are in exactly the same topology and data type, sothese two programs are arithmetically isomorphic.This analysis also ﬁnds a mapping from the operands in thetensor program to the operands in the tensorized instruction.As we explained, tensor operands in the tensorized instruction function I NSPECT (a,b) if a.type=b.type thenif isleaf(a) ∧ isleaf(b) thenif a is not bound then bind[a]:=b else if bind[a] (cid:54) = b thenreturn False end ifreturn

True else if isarith(a), isarith(b) then cond:=a.opcode=b.opcodecond:=cond ∧ Inspect(a.lhs, b.lhs)cond:=cond ∧ Inspect(a.rhs, b.rhs) return cond end ifend ifreturn

False end function

Algorithm 1: Determine the isomorphism between expressiontrees. a is for the instruction, and b is for the operation.are the abstraction for registers. Therefore, a register cannotcorrespond to multiple data sources. This property still requiresfurther checks, which will be explained in the next section.

2) Array Access Isomorphism:

Once compute isomorphismis determined, the next concern is how the data are fed to thisinstruction. The enforcement explained in the last subsectionalready determines each register operand only corresponds toone array in the tensor operation. On top of this, we need todetermine each element in the operand tensor corresponds toonly one memory address in the tensor program when mappingto the tensorized instruction. To map a tensor program to atensorized instruction, we need to know which loop levels aretensorized. We enumerate the loop levels to be tensorized, andthese loop levels will be mapped to loops in the tensorizedinstruction. Note that only loops with the same annotation(data parallel or reduction) can be mapped to each other. Thenwe check if this enumerated mapping is feasible, by scanningeach pair of operand correspondence determined in the lastparagraph. If the operand in the tensor program is a constant,we just skip it . If the operand is a memory operation, weinspect the index expressions of both memory operations inthe operation and instruction. We deﬁne: • A is the set of loop variables to be mapped to thetensorized instruction. • B is the set of loop variables of the tensorized instruction. • f : A (cid:55)→ B is the mapping we enumerate. • S ( u ) := { x | x is loop variable in the index expression u } • S (cid:48) ( u ) := { f ( x ) | x ∈ S ( u ) ∩ A } A mapping is considered feasible, if every pair of memoryoperation’s index expressions ( u, v ) , where u is from theoperation and v is from the instruction, holds S (cid:48) ( u ) ⊆ S ( v ) .Figure 5(b).2 shows an example of inspection. If S (cid:48) ( u ) is If it is a constant, the correspondence was already checked in the lastsection. This register corresponds to this constant. subset of S ( v ) , this means the data loaded by the tensoroperation should be broadcast along with the loop variablesthat do not exist in S ( v ) to ﬁll all the register lanes. If not,this means each register lane corresponds to multiple memoryaddresses under this mapping, which is not realistic for codegeneration, so we should try another enumeration.If there are multiple feasible mappings, we leave this asa dimension of code tuning space. Once this mapping isdetermined, it will guide the further loop transformation andcode generation. C. Code Transformation - Rewriter

There are three phases in the code transformation: loopreorganization, tensorized instruction replacement, and tuning.

1) Loop Reorganization:

As discussed in Subsection III-B,the inspector selects the loop levels to be executed by the giveninstruction. To get poised for code generation, as shown inFigure 5(c), we need to tile these loops and reorder them to theinnermost loop levels so that those innermost loops performexactly the same semantics as the instruction. As we explained,tensor DSL provides the capability to reorganize the loopsnests easily.

2) Tensorized Instruction Replacement:

After identifyingthe code region to be replaced by a tensorized instruction, thecode generator should prepare each operand of this instruction.It is difﬁcult to fully automate the operand preparation fordifferent platforms because of their diverse execution modelsand assembly formats. Therefore, we formalize a uniﬁedprogramming interface to compiler developers to manuallyspecify the rule of operand generation. In this interface, eachloop variable to be replaced, and their coefﬁcients in the indexexpression are exposed. For example, as shown in Figure 5(c),by analyzing the strides and trip count of ki , and ci , thearray access c[x,y,c] will be transformed to a 16-lanevector; a[x,y,rc] will be vectorized along with c by 4,and broadcast along with ki by 16; b[r,s,k,c] will bevectorized along with ci by 4, and unrolled and concatenatedalong with ki .

3) Tuner:

All the other loop levels that are not involved ininstruction rewriting can be reorganized to tune the performance.Here, we develop strategies to optimize the performance oftensor programs on both CPU and GPU. The generic philosophyis to exploit both ﬁne- and coarse-grained parallelism. Wealso developed specialized strategies because of the differentexecution models and memory hierarchy. a) CPU Tuning:

On CPU, data-parallel loops are dis-tributed to multiple threads to achieve coarse-grained paral-lelism. On the other hand, the loop-carried dependence inreduction loops introduces RAW hazards in the executionpipeline. To avoid this penalty, and achieve instruction-levelparallelism, we reorder and unroll a small degree of dataparallel loops below the innermost reduction loop.The tuning space of CPU involves two dimensions, thedegree of unrolling and parallelization. We enumerate thesetwo parameters and proﬁle the execution time to search for thebest one. If the unrolling degree is too small, there will not // a[n,k], b[k,m], c[n,m]Buffer A, B;Buffer C;for (i=0; i A[p], B[p]; Buffer C[p][p]; for (x=0; x

Fig. 6. Accumulating a p × p “square window” avoids loop-carried datadependences, and reuses buffered submatrices. be enough independent instructions to ﬁll in the idle penaltycycles caused by RAW hazards. If it is too large, it will causeI-cache misses. Similarly, the number of threads can neitherbe too few or too many. If it is too few, the computing coreswould have insufﬁcient utilization and memory latency wouldnot be hidden. Too many threads introduce context switchingoverhead. We rely on the tuning process to look for the bestcombination. b) GPU Tuning: On GPU, coarse-grained parallelismis achieved by distributing the data parallel loops acrossthe streaming multiprocessors. Similar to CPU, ﬁne-grainedparallelism is also achieved by reordering and unrolling a smalldegree of data parallel loops to avoid the pipeline penaltycaused by loop-carried dependences. Moreover, on GPU, datareuse is explicitly managed by the software. Therefore, as itis shown in Figure 6, we adopt an outer-product style matrixmultiply accumulation to reuse the buffered submatrices.Besides the generic optimization, we also developed opti-mization mechanisms specialized for DNN kernels. Amongpopular DNN models, there are many layers with relativelysmall width and height and deep channels. We apply dimensionfusion to layers with small width and height – these twodimensions are fused into one to save the redundant padding. Inaddition, we apply split reduction to layers with deep channels.For a reduction loop with large trip count, we can split it andparallelize each split segment on threadIdx . After all thesegments are done, we synchronize the threads and reduce theplitted segments in the shared memory.IV. I

MPLEMENTATION

In this section, we will discuss technical details in ourimplementation. UNIT is implemented by extending ApacheTVM [10], a full-stack deep learning compiler, with tensorizedinstruction support. We leverage TVM’s tensor DSL, tensorOp, tensor IR infrastructure, and the tuning infrastructuremechanisms [11], [23] to generate high performance kernels. Inaddition, implementing UNIT on top of TVM enables end-to-end model inference with other optimizations such as operatorfusion, in addition to tensorization.

A. Inspector

The inspector pass is implemented by analyzing TVM’s

ComputeOp data structure. This matches the expression treeof both the instruction and program and enumerates mappingsbetween the loop variables. We enumerate the loops from thetensor’s innermost dimension to outermost dimension, andgreedily return the ﬁrst eligible one because of the betterpotential data locality for inner dimensions. The enumeratedmapping provides us with the correspondence of loop variablesbetween the instructions and the tensor operations.

B. Rewriter

These following steps will be performed by the rewriter:1) According to the loop correspondence analyzed by theinspector, we reorganize the loops to be tensorized bytiling these loops by the trip counts of the correspondingloops in the instruction, and reorder them to be theinnermost loops. These loops will be annotated by a tensorize pragma to hint the instruction injection.2) Based on the strategies discussed in Section III-C, wereorganize the loops above not involved in instructionrewriting to tune the performance.3) We lower the manipulated loop nest to the tensor IR, andreplace the loop body annotated with the tensorize pragma with the target instructions, as shown in Fig-ure 5(c).Steps 1 and 2 are achieved by invoking TVM schedulingprimitives on the tensor DSL level, and step 3 is a tensor IRtransformation pass.Next, we discuss the implementation of the tuning strategiesdiscussed in the last section.

CPU Tuning:

The code sketch of tuned CPU code isshown in Figure 7. To implement the tuning we discussedin Section III-C, we enumerate two breaking points on thedata parallel loop nest, which deﬁne how the loop levels areparallelized and unrolled. A breaking point is deﬁned by a loop level and tiling factor , giving more ﬂexibility to thedivision. Loops before the ﬁrst breaking point, will be fused andparallelized. Loops between these two points will be executedin serialized order. Loops after the second breaking point willbe reordered to the innermost and unrolled.

GPU Tuning:

As it is discussed in the last paragraph ofSection III-C, both coarse-grained and ﬁne-grained parallelism for (ax0=0; ax0

Fig. 7. The code sketch of CPU tuning. optimizations are applied on data-parallel loops, so there is atradeoff between them: data reuse is increased by increasingthe unrolling degree (each buffered submatrix is reused p times), but the coarse-grained parallelism is decreased. Also, alarge unrolling degree may overwhelm the register resources.Therefore, the key to generic optimization is to choose a properunrolling degree.On the other hand, greedily applying each specializedoptimization does not always improve the performance. Thoughdimension fusion may save the memory trafﬁc, it also intro-duces software overhead on data rearrangement. Similarly,though splitting the reduction loop introduces more parallelism,it also introduces thread synchronization overhead and registerpressure. We enumerate each parameter, including the degreeof reduction parallelization and whether to fuse the width andheight dimensions, and then apply these transformations tothe program and proﬁle the performance to determine whichtransformation leads to the best performance.V. M ETHODOLOGY

A. Target Hardware Platforms

We assess UNIT on three hardware platforms:

Intel x86 CPU:

We use Amazon EC2 C5.12xlarge instanceas our x86 platform with 24-core Intel Xeon Platinum 8275CLCPU @3.00GHz (codename: Cascade Lake) and 96GB mem-ory.

ARM CPU:

We use Amazon EC2 M6g.8xlarge instance as ourARM platform with AWS Graviton2 CPU, which features 32-core ARM Cortex-A72 CPU @2.30GHz and 128GB memory.

Nvidia GPU:

We use Amazon EC2 P3.2xlarge instance as ourGPU platform with Nvidia Tesla V100 SXM2 GPU that has16GB host memory.

B. Software Frameworks

Code Generation:

All programs implemented in Apache TVMare emitted to LLVM IR for code generation. We choose LLVM-10 as our backend, and to be compatible, we use CUDA-10.0as the NVPTX linker and runtime. e s n e t - r e s n e t - r e s n e t - _ v b i n c e p t i o n - b n i n c e p t i o n - v r e s n e t - r e s n e t - m o b il e n e t - v m o b il e n e t - v g e o m e a n R e l a t i v e P e r f o r m a n c e MxNet w/ oneDNN TVM UNIT

Fig. 8. Quantized network inference (bs=1) accelerated by Intel VNNI.

Baseline:

We use vendor-provided libraries for baseline per-formance of operators whenever possible. Speciﬁcally, InteloneDNN v1.6.1 and Nvidia cuDNN 7.6.5 are used as ourCPU and GPU baselines, respectively. For end-to-end modelinference, we looked for the best available solutions withthose libraries, which was MXNet integrated with oneDNN forCPU and TVM integrated with cuDNN for GPU. Another setof baselines is the manually written implementation. To thisend, we use the existing TVM solutions for Intel and ARMCPUs, which involve heavy engineering effort to carefully writeintrinsics to use Intel VNNI and ARM DOT instructions. Wedid not ﬁnd a manually written Tensor Core implementationthat covers our evaluated workloads.

C. Workloads

DNN Models:

All DNN models are from the MXNet ModelZoo and converted to TVM’s graph IR, Relay [32], forquantization [19], layout transformation, and data padding.All these models adopt

NCHW[x]c data layout [23] for thedata and

KCRS[y]k[x]c for the kernel. Here N denotes thebatch size, C denotes the input channels, H and W are the widthand height of the input image, and [x]c denotes that theoriginal C is split by x . Similarly, K denotes the number ofoutput channels, R and S are the height and width of the kernel,and [y]k denotes the original dimension K is split by y . [x] equals to the number of lanes of the instruction output, and [y] equals to the width of reduction.In the evaluation, we target the N=1 cases, because it is hardto optimize but critical for inference use cases. Comparingwith batched cases where

N>1 , we cannot reuse the kerneltensor across samples, or exploit the parallelism brought bythe data-parallel batching dimension.VI. E

VALUATION

Our evaluation of UNIT attempts to answer these questions:1) What is the performance of the end-to-end deep learningmodel inference powered by tensorized instructions?2) How does each optimization technique that UNIT usesimpact the performance?3) Can UNIT be extended to support new hardware plat-forms and tensor operations?

A. End-to-End Performance

In this subsection, we show the UNIT end-to-end effective-ness on Intel x86 and Nvidia GPU processors for tensorizing r e s n e t - r e s n e t - r e s n e t - _ v b i n c e p t i o n - b n i n c e p t i o n - v r e s n e t - r e s n e t - m o b il e n e t - v m o b il e n e t - v g e o m e a n R e l a t i v e P e r f o r m a n c e cuDNN (fp16) w/ Tensor Core UNIT Fig. 9. Mixed precision network inference (bs=1) accelerated by Tensor Core. R e l a t i v e P e r f o r m a n c e oneDNN Parallel +Unroll +Tune Fig. 10. The performance impact of the code space exploration. R e l a t i v e P e r f o r m a n c e cuDNN Generic +FuseDim +SplitK +Tune Fig. 11. The performance impact of the code space exploration.TABLE IC

HARACTERISTICS OF THE SELECTED CONVOLUTION LAYERS .1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16C 288 160

80 128 192 256

128 576 96

576 64 64 608IHW 35 9 7 73 16 16 16 14 16 14 16 14 14 29 56 14K 384 224 192 192 128 192 256 512 160 192 128 256 128 96 128 192R=S 3 3 1 3 3 3 3 1 3 1 3 1 1 3 1 1Stride 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1OHW 17 7 7 71 14 14 14 14 14 14 14 14 14 27 28 14 mixed precision instructions. For Intel x86 experiments, we useMXNet integrated with Intel oneDNN (referred to as MXNet-oneDNN) as the baseline. Another comparison of ours is TVMwith manually written schedules using Intel’s VNNI instruction.The ﬁndings of this experiment are shown in Figure 8.We observe that UNIT achieves signiﬁcant speedup com-pared to MXNet-oneDNN. Note that Intel oneDNN has accessto manually written schedules that have been aggressivelyoptimized and tuned by domain experts. We also observethat TVM overall achieves better performance than MXNet-oneDNN, but has suboptimal performance on resnet50 andresnet50b, which were heavily tuned by oneDNN engineers.On the other hand, UNIT outperforms both baselines, by 1.3 × over MXNet-oneDNN and by 1.18 × over TVM.Next, we test the efﬁcacy of UNIT on utilizing NvidiaTensor Core instructions for Nvidia GPUs. For the baseline,e integrate TVM with cuDNN, which has access to manuallywritten aggressively tuned Tensor Core schedules. The ﬁndingsof this experiment are shown in Figure 9. We observe thatUNIT consistently achieves better performance than cuDNNwith a mean speedup of 1.75 × and up to 2.2 × . B. Optimization Implications

In this subsection, we focus on the convolution operatorsof the DNN models to perform an in-depth analysis of theimpact of different optimization techniques used by UNIT’sRewriter. This is essentially an ablation study, showing howimportant different parts of UNIT are. There are 148 differentconvolution workloads (i.e., convolution with different featuremap sizes, kernel sizes, strides, etc.) in the models, out ofwhich we choose 16 representative convolution layers. Thesekernels cover diverse input shapes and strides. Other workloadsbehave similarly in the ablation study. We summarize thecharacteristics, namely, convolution attributes, like shapes,strides, etc., of the selected workloads in Table I.

Intel x86 servers:

As we discussed in Section III-C, we havetwo breaking points in CPU scheduling. The loop nests beforethe ﬁrst breaking point are parallelized and the loop nestsafter the second breaking point are unrolled, while the onesin between the breaking point are executed serially. As loopnests can either be parallelized or unrolled (remaining oneis serialized), we have a search space represented by thetuning pairs. Rewriter tunes this search space to generate ahigh-performance kernel. In this experiment, we incrementallymeasure the performance improvements brought by paralleliz-ing, unrolling and tuning. The ﬁndings of this experiment areshown in Figure 10, normalizing the speedup to Intel oneDNNexecution latency.First we fuse outer loop nests such that the loop boundof the fused loop nest is < Parallel ). Then, we takethe remaining loop nests, and tile and unroll them such theunrolling factor is <

8, and measure this performance (shownby +Unroll ). Finally, instead of setting the limits as 3000 and 8,we tune the search space and measure performance (shown by +Tune ), getting the ﬁnal latency UNIT achieves. We observethat Parallel and Unroll together is responsible for most of thespeedup. The additional speedup introduced by Tuning is quitesmall. It turns out that more than half of the kernels get theoptimal performance on the ﬁrst tuning pair (i.e. 3000 and 8),and more than 95% of the kernels get the optimal performancewithin the ﬁrst 8 tuning pairs.CPU does poorly on workloads

Nvidia GPU servers:

As discussed in Section III-C, weemploy three optimizations on GPU: generic coarse- and ﬁne-grained parallelism, fusing width and height to save memorybandwidth, and parallelizing the reduction dimension. In thissubsection, we study the impact of these optimizations on the r e s n e t - r e s n e t - r e s n e t - _ v b i n c e p t i o n - b n i n c e p t i o n - v r e s n e t - r e s n e t - m o b il e n e t - v m o b il e n e t - v g e o m e a n R e l a t i v e P e r f o r m a n c e TVM-NEONTVM-ManualUNIT

Fig. 12. The performance of ARM on model inference. R e l a t i v e P e r f o r m a n c e Fig. 13. The performance of each layer on res18-3d. performance. We show the ﬁndings in Figure 11, normalizingthe speedup to Nvidia cuDNN.According to our evaluation, any unrolling degree ( p inFigure 6) larger than 2 may overwhelm the registers, sowe use p=2 to apply the generic optimization. The genericoptimization already beat cuDNN in most cases (shown by Generic ). Then, depending on the height and width values,Rewriter fuses the height and width dimensions to save memorybandwidth (shown by +FuseDim ). Then, we split the reductiondimension K by 64 and measure the performance ( +SplitK ).Finally, we let Rewriter to choose the sizes for these 3optimizations and measure performance (shown by +Tune ).We observe that SplitK leads to the maximal speedup, asit leads to signiﬁcant parallelism and keeps the Tensor Coresbusy. More than 70% of the kernels can get high performanceby employing fusion and parallelizing the reduction dimension.Similar to CPUs, the additional speedup by tuning is small.UNIT cannot outperform cuDNN on

C. Extensibility

We evaluate the extensibility of UNIT in two aspects: to newhardware platforms and to new deep learning tensor operations.We observe that by just representing the semantics of the newtensorized instruction in tensor DSL, UNIT can easily extendto new tensorized instructions and tensor operations.

New Hardware Platforms.

To demonstrate the capability ofextending to new hardware platforms, we apply UNIT to anARM CPU supporting the ARM DOT instruction. To theest of our knowledge, there is a lack of a deep learningframework with well-integrated ARM backend library support.In the absence of a framework baseline, we choose TVMcompiling to ARM Neon assembly as the baseline (shown byTVM-NEON). Additionally, we ﬁnd that TVM has manually-written schedules using ARM DOT instructions, which formsour second comparison baseline (shown by TVM-Manual).Note that in contrast to UNIT’s automatic approach, this is amanually written schedule requiring intense engineering efforts.Finally, we represent the semantics of ARM DOT instruction inUNIT’s tensor DSL and use UNIT to compile the models. Theﬁndings of this experiment are shown in Figure 12, showingnormalized speedup compared to the TVM-Neon baseline. Theresults show that UNIT consistently outperforms both TVM-NEON and TVM-Manual, proving UNIT’s effectiveness inextending to new hardware platforms.

3D Convolution.

We test UNIT on 3D convolution operationfor mapping Intel VNNI tensorized instructions. Note that thisdoes not require any changes from UNIT perspective; we arejust giving a new input (tensor-level IR for conv3d) to UNIT.To evaluate this extensibility, we take all the 2D convolutionsfrom Resnet18 and manually convert them to 3D convolutions.We then apply UNIT on these kernels and show the speedupcompared to oneDNN baseline in Figure 13. We observe thatUNIT easily extends to 3D convolution, as it has comparableperformance for many convolution kernels, with an average of1.2 × speedup. VII. R ELATED W ORK

Compilation support for hardware intrinsics

There existsa large body of literature on compilation support for varioushardware intrinsics [33], [20], [27], [29], [16], [22], [28],[15], [36], [35]. Existing production compilers such as GCCand LLVM implement auto-vectorization to leverage SIMDintrinsics. Prior works such as [20], [33] propose variousapproaches to further improve the performance of the auto-vectorizer. These approaches cannot be extended to supporttensor computation intrinsics which introduce “horizontalcomputation” within each lane. TVM [10] implements anextensible interface to support new hardware intrinsics thatare not limited to SIMD instructions. However, programmersneed to transform the program to match the behavior of theintrinsics and declare the lowering rule for the intrinsics priorto compilation. TVM will match the computation and replaceit with the code snippets that call the target hardware intrinsics.Compared to TVM, UNIT performs the code detection andtransformation automatically. This achieves higher ﬂexibilityand productivity. There are some prior works that, similarto UNIT, also perform program transformation and codegeneration automatically for tensor computation [36], [35].However, these are limited to one platform or certain intrinsicsand hence are not as ﬂexible as UNIT.

Decoupled Computation and Data Access

The analysispass of UNIT is inspired by the decoupled-access execute(DAE) architectures [21], [30], [26], [40], [13]. Computationand data access are decoupled and specialized separately. The computation is ofﬂoaded onto a programmable data path anddata access is encoded in hardware intrinsics and executedon specialized address generation unit (AGU). UNIT adoptsa reversed approach, it matches computation on a ﬁxed datapath, and analyzes data access fed to the data path.

Polyhedral model

Many prior works have built programanalysis and transformation frameworks based on the polyhedralmodel for tensor programs [20], [36], [35], [15], [37], [12],[25], [38]. Loop Tactics [9] is one representative work whichmatches the pre-deﬁned computation patterns in the polyhedralIR and transforms the matched patterns to optimized programs.UNIT distinguishes itself from Loop Tactics in: 1) Comparedwith the schedule tree [39] in the polyhedral model, thetensor DSL provides more information such as loop reductionproperties and operand types; 2) UNIT provides an end-to-end solution including auto-tuning to obtain the optimalperformance, whereas Loop Tactics requires the optimizedschedules to be provided manually.

Deep learning frameworks

UNIT is complementary to theexisting deep learning frameworks. Existing frameworks suchas Tensorﬂow [8], PyTorch [7], and MXNet [3] rely on vendor-crafted libraries to support the new tensor intrinsics. TVM [10]requires code re-writing at the user side. UNIT is able tohandle new operators which might not be covered by thevendor libraries and spare the user from having to performmanual re-writing. We have demonstrated the effectiveness ofthe methodology of UNIT based on TVM. Similar techniquecan be applied to other frameworks to further boost theirperformance. VIII. C

ONCLUSION

Deep learning has prompted hardware vendors to addspecialized tensorized instructions for dense tensor operations.These instructions perform “horizontal reduction” accumulateelementwise computation. While promising, introducing thisnew idiom complicates its general purpose applicability, as onehas to rely on hand-written kernels to gain high performancebrought by these instructions. In this paper, we introduce UNIT,a uniﬁed compilation pipeline, that represents the tensorizedinstructions from different hardware platforms using the sameIR, then automatically detects the applicability of the tensorizedinstructions in a given tensor operation, transforms the loopnest to enable easy mapping of the tensorized instruction, andﬁnally rewrites the loop body with the tensorized instructions.UNIT enables automatic tensorized instruction compilationover a variety of hardware platforms like Intel/ARM CPUs andNvidia GPUs. Our analysis shows that UNIT achieves 1.3 × speedup over oneDNN (VNNI instruction), 1.75 × over cuDNN(Tensor Core instruction), and 1.13 × over the manually writtenARM intrinsics in TVM (DOT instruction).A CKNOWLEDGEMENTS

This work is supported by NSF grant CCF-1751400 and MuLi’s team at Amazon Web Services.

PPENDIX

A. Abstract

This guide describes how to set up

UNIT compilationinfrastructure and run the workloads we discussed in Section VI.This guide provides instructions to: • Set up the experiment environment for

UNIT throughDocker. • Run end-to-end inference model shown in Figure 8, 9,and 12. • Run the experiments to demonstrate the effects of ourtuning strategies shown in Figure 10, and 11. • Run the 3D-convolution results shown in Figure 13.Our experiments are conducted on Amazon EC2 c5.12xlarge for Intel VNNI, p3.2xlarge for NvidiaTensorCore, and m6g.8xlarge for ARM VDOT. To down-load and install our infrastructure, approximately 32GB ofdisk is required. We provide

Dockerfile to set up theenvironment, and scripts to automatically run the experimentsand plot the ﬁgures.

B. Artifact Checklist • Program:

As it is demonstatrated in Section V, we usenine typical DNN models, including ResNet, ImageNet,and MobileNet. • Compilation:

We need speciﬁc versions of TVM to runour experiments and baselines. They are included in thezip release. • Data set:

The test data is included in our zip release. • Runtime environment:

We run our artifact all on Ubuntu18.04. For GPU, Nvidia GPU driver and additional runtimefor Docker should be installed. • Hardware:

We run our experiments on AWSEC2 instances — c5.12xlarge for IntelVNNI, p3.2xlarge for Nvidia TensorCore, and m6g.8xlarge for ARM DOT. • Execution:

We provide scripts to run our experimentsdiscussed in Section VI. It takes 2 hours to compile themodels in Figure 8, half an hour to compile the models inFigure 9, and 1.4 hours to compile the models in Figure 12.It takes half an hour to run the experiments in Figure 10,and 11. • Output:

Our scripts both run the experiments and plotthe ﬁgures in PDF ﬁles. • Experiments:

The results reported in our paper aregenerated by a physical machine, but in this artifactevaluation they all run on a virtual machine in Docker.Performance ﬂuctuation may happen because of theoverhead of virtualization.

C. Description1) How Delivered:

Download our Dockerﬁle, scripts, andmodel test data at https://doi.org/10.5281/zenodo.4420522.

2) Hardware dependencies: • AVX512_VNNI:

This is available on Intel CPUs withCascadelake architecture. In this work, we use AWS EC2 c5.12xlarge . The CPU model is Intel(R) Xeon(R)Platinum 8275CL CPU @3.00GHz. The rate is $2.04/hour,and it takes approximately one hour to set up theenvironment and 5 hours to run all the related experiments. • TensorCore:

This is avaiable on Nvidia GPUs withTensorCore extension. In this work, we use AWS EC2 p3.2xlarge . The GPU model is Tesla V100. Pleaseinstall the GPU driver. The rate is $3.06/hour, and ittakes approximately 1 hour to set up the environment, andanother one hour run all the related experiments. • ARM VDOT:

This is available on ARM CPU v8.2 with dotprod extension. In this work, we use AWS EC2 m6g.8xlarge . The CPU model is Amazon Graviton 2.The rate is $1.232/hour, and it takes 1 hour to set up theenvironment and run the experiments.

3) Software Dependencies:

All our software dependencesare installed automatically in Docker. Refer to this link forDocker installation. When setting up the last step of the thepackage repository, do choose the proper tab for your CPUplatform (x86 or ARM). Refer to this to install Docker thatruns Nvidia GPU. Nvidia Docker requires GPU driver installed,use this command to install: $ sudo apt-get install nvidia-driver-455

D. Installation

Unzip the downloaded ﬁle, and there are three sub-zips — tensorcore.zip , vnni.zip , and arm.zip to evaluatethe three platform we discussed in this paper. E. Experiment Workﬂow1) GPU: • After building the docker image, an image hash value willbe generated in the console log: $ unzip tensorcore.zip && cd tensorcore$ sudo docker build . • After entering the container, the experiment scripts areall in $HOME directory: $ cd $HOME • To replicate experiments run in Figure 9, and 11: $ bash run_e2e.sh • It takes half an our to run these two scripts. Both theexperiments and data plotting are done in these two scripts.Use these commands to take the generated PDF out ofthe container and look at them: $ ) CPU:

We run the Intel VNNI experiment on an AWSEC2 c5.12xlarge instance. It is also used to cross-compileARM target. • After building the docker image, an image hash value willbe generated in the console log: $ unzip vnni.zip && cd vnni$ sudo docker build .$ sudo docker run -tid $ sudo docker attach • After entering the container, the experiment scripts areall in $HOME directory: $ cd $HOME • To replicate experiments run in Figure 8, 10, and 13: $ bash run_e2e.sh • It takes about 2.5 hours to run these experiments, and youcan use the following commands to take out these plottedﬁgures and look at them: $ • Use the following script to run ARM target compilation: $ bash run_arm.sh

It takes about two hours to get all models com-piled on ARM. The compiled models will be in $HOME/arm-base and $HOME/arm-unit . • Copy the compiled model to the ARM machine: $ scp -i key.pem -r arm-unit :~$ scp -i key.pem -r arm-base :~$ ssh -i key.pem • Set up the ARM environment and run the experiments onARM machine: $ unzip arm.zip && cd arm$ mv ../arm-unit .$ mv ../arm-base .$ sudo docker build .$ sudo docker run -tid $ sudo docker attach $ cd $HOME && bash run_e2e.sh $ sudo docker cp \:/root/baseline.result .$ sudo docker cp \:/root/tensorize.result . • Bring these two .result ﬁles to a x86 machine, andplot the graph: $ python plot_e‘2e.py baseline.result tensorize.result

F. Evaluation and Expected Result

Finally, we have these PDF ﬁles: • Figure 8, 9, and 12 should be compared against cpu-e2e.pdf , gpu-e2e.pdf , and arm-e2e.pdf . – The ARM results reported in this paper were gener-ated by an old version of TVM. The performance isimproved in the newer version. We will ﬁx this incamera ready. • Figure 10, and 11 should be compared against cpu-dse.pdf , and gpu-dse.pdf . • Figure 13 should be compared against conv3d.pdf .R ® ACMTransactions on Architecture and Code Optimization (TACO) , 16(4):1–25,2019.[10] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan,Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze,et al. TVM: An automated end-to-end optimizing compiler for deeplearning. In , pages 578–594, 2018.[11] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau,Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning tooptimize tensor programs. In

Advances in Neural Information ProcessingSystems , pages 3389–3400, 2018.[12] Jason Cong and Jie Wang. PolySA: Polyhedral-based systolic arrayauto-compilation. In , pages 1–8. IEEE, 2018.[13] Vidushi Dadu and Tony Nowatzki. Towards general purpose accelerationby exploiting common data-dependence forms. In

Proceedings of the52nd Annual IEEE/ACM International Symposium on Microarchitecture ,2019.[14] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. ImageNet:A large-scale hierarchical image database. In , pages 248–255, 2009.[15] Andi Drebes, Lorenzo Chelini, Oleksandr Zinenko, Albert Cohen, HenkCorporaal, Tobias Grosser, Kanishkan Vadivel, and Nicolas Vasilache.TC-CIM: Empowering tensor comprehensions for computing-in-memory.In

IMPACT 2020-10th International Workshop on Polyhedral CompilationTechniques , 2020.[16] Alexandre E Eichenberger, Peng Wu, and Kevin O’brien. Vectorizationfor simd architectures with alignment constraints.

Acm Sigplan Notices ,39(6):82–93, 2004.[17] Qingchang Han, Yongmin Hu, Fengwei Yu, Hailong Yang, Bing Liu,Peng Hu, Ruihao Gong, Yanfei Wang, Rui Wang, Zhongzhi Luan, andDepei Qian. Extremely low-bit convolution optimization for quantizedneural network on modern computer architectures. In

ICPP ’20: 49thInternational Conference on Parallel Processing - ICPP , 2020.18] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, MatthewTang, Andrew G. Howard, Hartwig Adam, and Dmitry Kalenichenko.Quantization and training of neural networks for efﬁcient journal.

CoRR ,abs/1712.05877, 2017.[19] Animesh Jain, Shoubhik Bhattacharya, Masahiro Masuda, Vin Sharma,and Yida Wang. Efﬁcient execution of quantized deep learning models:A compiler approach. arXiv preprint arXiv:2006.10226 , 2020.[20] Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-NoëlPouchet, and Ponnuswamy Sadayappan. When polyhedral transformationsmeet SIMD code generation. In

Proceedings of the 34th ACM SIGPLANconference on Programming language design and implementation , pages127–138, 2013.[21] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. MAERI:Enabling ﬂexible dataﬂow mapping over DNN accelerators via reconﬁg-urable interconnects.

SIGPLAN Not. , 53(2):461–475, March 2018.[22] Samuel Larsen and Saman Amarasinghe. Exploiting superword levelparallelism with multimedia instruction sets.

Acm Sigplan Notices ,35(5):145–156, 2000.[23] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang.Optimizing CNN model inference on CPUs. In , pages 1025–1040, Renton, WA,July 2019. USENIX Association.[24] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos,Erich Elsen, David García, Boris Ginsburg, Michael Houston, OleksiiKuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training.

CoRR , abs/1710.03740, 2017.[25] MLIR. Multi-level IR compiler framework. https://mlir.llvm.org.[26] Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and KarthikeyanSankaralingam. Stream-dataﬂow acceleration. In , 2017.[27] Dorit Nuzman and Richard Henderson. Multi-platform auto-vectorization.In

Proceedings of the International Symposium on Code Generation andOptimization , CGO ’06, page 281–294, USA, 2006. IEEE ComputerSociety.[28] Dorit Nuzman, Ira Rosen, and Ayal Zaks. Auto-vectorization ofinterleaved data for SIMD. In Michael I. Schwartzbach and ThomasBall, editors,

Proceedings of the ACM SIGPLAN 2006 Conference onProgramming Language Design and Implementation, Ottawa, Ontario,Canada, June 11-14, 2006 , pages 132–143. ACM, 2006.[29] Phitchaya Mangpo Phothilimthana, Archibald Samuel Elliott, An Wang,Abhinav Jangda, Bastian Hagedorn, Henrik Barthels, Samuel J Kaufman,Vinod Grover, Emina Torlak, and Rastislav Bodik. Swizzle inventor:data movement synthesis for GPU kernels. In

Proceedings of theTwenty-Fourth International Conference on Architectural Support forProgramming Languages and Operating Systems , pages 65–78, 2019.[30] Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, TianZhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle[36] Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, GagandeepGoyal, Ramakrishna Upadrasta, and Bharat Kaul. PolyDL: Polyhedral Olukotun. Plasticine: A reconﬁgurable architecture for parallel paterns.In , 2017.[31] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris,Frédo Durand, and Saman Amarasinghe. Halide: A language andcompiler for optimizing parallelism, locality, and recomputation inimage processing pipelines. In

Proceedings of the 34th ACM SIGPLANConference on Programming Language Design and Implementation ,PLDI ’13, pages 519–530, New York, NY, USA, 2013. ACM.[32] Jared Roesch, Steven Lyubomirsky, Logan Weber, Josh Pollock, MarisaKirisame, Tianqi Chen, and Zachary Tatlock. Relay: A new IR formachine learning frameworks. In

Proceedings of the 2nd ACM SIG-PLAN International Workshop on Machine Learning and ProgrammingLanguages , MAPL 2018, page 58–68, New York, NY, USA, 2018.Association for Computing Machinery.[33] Ira Rosen, D. Nuzman, and A. Zaks. Loop-aware SLP in GCC. pages131–142, 01 2007.[34] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, RomanDzhabarov, James Hegeman, Roman Levenstein, Bert Maher, SatishNadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, and MishaSmelyanskiy. Glow: Graph lowering compiler techniques for neuralnetworks.

CoRR , abs/1805.00907, 2018.[35] Vinod Grover Somashekaracharya G. Bhaskaracharya, Julien Demouth.Automatic kernel generation for Volta tensor cores. arXiv preprintarXiv:2006.12645 , 2020.optimizations for creation of high performance DL primitives. arXivpreprint arXiv:2006.02230 , 2020.[37] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, PriyaGoyal, Zachary DeVito, William S Moses, Sven Verdoolaege, AndrewAdams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprintarXiv:1802.04730 , 2018.[38] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez,Christian Tenllado, and Francky Catthoor. Polyhedral parallel codegeneration for CUDA.

ACM Transactions on Architecture and CodeOptimization (TACO) , 9(4):1–23, 2013.[39] Sven Verdoolaege, Serge Guelton, Tobias Grosser, and Albert Cohen.Schedule trees. In

International Workshop on Polyhedral CompilationTechniques, Date: 2014/01/20-2014/01/20, Location: Vienna, Austria ,2014.[40] J. Weng, S. Liu, V. Dadu, Z. Wang, P. Shah, and T. Nowatzki. DSAGEN:Synthesizing programmable spatial accelerators. In ,pages 268–281, 2020.[41] XLA Team. Xla - tensorﬂow, compiled, March 2017.[42] D. Yan, W. Wang, and X. Chu. Demystifying tensor cores to optimizehalf-precision matrix multiply. In2020 IEEE International Parallel andDistributed Processing Symposium (IPDPS)