[PDF] Automatic Kernel Generation for Volta Tensor Cores

Abstract

A commonly occurring computation idiom in neural networks is to perform some pointwise operations on the result of a matrix multiplication. Such a sequence of operations is typically represented as a computation graph in deep learning compilers. When compiling to a GPU target, these computations can be individually mapped to manually tuned implementations provided by libraries such as cuBLAS and cuDNN. These libraries also provide off-the-shelf support for targeting tensor cores in NVIDIA GPUs, which can lead to huge performance boosts through their specialized support for mixed-precision matrix math. Alternatively, tensor cores can be programmed directly using CUDA APIs or inline assembly instructions, which opens up the possibility of generating efficient CUDA kernels automatically for such computations. Automatic kernel generation is particularly crucial when it is beneficial to generate efficient code for an entire computation graph by fusing several operations into a single device function instead of invoking a separate kernel for each of them. Polyhedral compilation techniques provide a systematic approach for the analysis and transformation of a sequence of affine loop-nests. In this paper, we describe a polyhedral approach to generate efficient CUDA kernels for matrix multiplication using inline assembly instructions for programming tensor cores on NVIDIA Volta GPUs. Furthermore, we build on this approach to generate fused kernels for computation sequences involving matrix multiplication and pointwise operations such as bias addition, ReLU activation etc. Experimental evaluation of these techniques show that automatically generated kernels can provide significantly better performance than manually tuned library implementations, with speedups ranging up to 2.55X.

Full PDF

AAutomatic Kernel Generation for Volta TensorCores

Somashekaracharya G. Bhaskaracharya

NVIDIA [email protected]

Julien Demouth

NVIDIA [email protected]

Vinod Grover

NVIDIA [email protected]

Abstract —A commonly occurring computation idiom in neuralnetworks is to perform some pointwise operations on the resultof a matrix multiplication. Such a sequence of operations istypically represented as a computation graph in deep learningcompilers. When compiling to a GPU target, these computationscan be individually mapped to manually tuned implementationsprovided by libraries such as cuBLAS and cuDNN. Theselibraries also provide off-the-shelf support for targeting tensorcores in NVIDIA GPUs, which can lead to huge performanceboosts through their specialized support for mixed-precisionmatrix math. Alternatively, tensor cores can be programmeddirectly using CUDA APIs or inline assembly instructions, whichopens up the possibility of generating efﬁcient CUDA kernelsautomatically for such computations.Automatic kernel generation is particularly crucial when it isbeneﬁcial to generate efﬁcient code for an entire computationgraph by fusing several operations into a single device functioninstead of invoking a separate kernel for each of them. Poly-hedral compilation techniques provide a systematic approachfor the analysis and transformation of a sequence of afﬁneloop-nests. In this paper, we describe a polyhedral approach togenerate efﬁcient CUDA kernels for matrix multiplication usinginline assembly instructions for programming tensor cores onNVIDIA Volta GPUs. Furthermore, we build on this approachto generate fused kernels for computation sequences involvingmatrix multiplication and pointwise operations such as biasaddition, ReLU activation etc. Experimental evaluation of thesetechniques show that automatically generated kernels can providesigniﬁcantly better performance than manually tuned libraryimplementations, with speedups ranging up to 2.55 × . Index Terms —matmul, polyhedral compilation, tensor cores

I. I

NTRODUCTION

Tensor cores in NVIDIA GPUs are processing cores spe-cialized for matrix operations. They provide huge boostsin throughput and efﬁciency by performing several mixed-precision matrix-multiply and accumulate calculations in asingle operation. Consequently, tensor cores can signiﬁcantlyspeed up generalized matrix-multiply (GEMM) and convolu-tions (as implicit GEMMs), both of which are used heavily indeep learning systems and other computational applications.CUDA libraries such as cuBLAS [1] and Cutlass [2] provideoff-the-shelf support for leveraging tensor core capabilitiesthrough manually tuned implementations of GEMM for vari-ous input matrix layouts (row-major and column-major). Fur-thermore, cuDNN [3] is a widely used GPU-accelerated librarywith extensive support for various deep learning primitives.Deep learning computations can be modeled as dataﬂowgraphs where each node represents a speciﬁc computation such as matmul, convolution, pointwise operations etc. Acommonly occurring computation idiom in neural networksis that of a matmul feeding a bias add, which then drivesan activation function such as ReLU. These computations canbe speciﬁed through various deep learning frameworks suchas TensorFlow [4], PyTorch [5]. A naive approach to mapthese computations onto a GPU would invoke separate kernelsfor each computation node in an execution order dictated bythe dataﬂow dependences. However, the repeated round-tripthrough global memory is inefﬁcient and can be eliminatedif multiple operations are fused into the same kernel. So, forthe above example, the entire computation should be ideallymapped to a single kernel that performs matmul as well asits downstream pointwise operations, without the overhead ofwriting the intermediate results to global memory. In this con-text, it is imperative to explore kernel generation techniquesnot just for matmul but also for matmul with such pointwiseoperations in its epilogue or prologue i.e., downstream orupstream to it in the computation graph.Automatic kernel generation for tensor cores requires directprogrammatic access to tensor cores. On a Volta GPU, tensorcores can be programmed directly using the mma.sync.m8n8k4

PTX instruction [6] for half-precision ﬂoating point type. Theoperation is deﬁned for each quad-pair, i.e, a group of 8threads. The mma.sync.m8n8k4 instruction for half-precisionﬂoats (the fp16 data type) on Volta requires each thread in aquad pair to own fragments of the input matrix to a matrixmultiply-and-accumulate operation. Each fragment consists of4 half-precision ﬂoats. Consequently, the input fragments fromall the threads in a quad-pair constitute the two 8 × × × × mma.sync.m8n8k4 instruction, thewmma operation is deﬁned for an entire warp. A wmmafragment distribution is opaque, and is target dependent. Onthe other hand, the distribution of mma.sync fragments is pre-deﬁned and not target architecture dependent. Furthermore, aswe shall see later, the m8n8k4 mma operation can be composedinto higher-order macro-MMA operations in a number of waysto take advantage of wider loads and stores at every level a r X i v : . [ c s . P L ] A ug f the memory hierarchy. Due to this low-level control, the mma.sync.m8n8k4 instruction lends itself well to implementefﬁcient kernels for deep learning graphs, with ample scopefor kernel fusion.Our focus is on generating efﬁcient kernels for computationgraphs that involve matmul and pointwise operations such asadd, subtract, and activation functions such as ReLU, Sigmoidand Tanh. In particular, the scope of this work is restricted tothe following computation idioms. • matmul without any epilogue or prologue • matmul with pointwise operations in its prologue • matmul with pointwise operations in its epilogue • matmuls of the same shape feeding a pointwise operationThese operations can be expressed as afﬁne loop-nests.Polyhedral frameworks have proven to be effective for ana-lyzing and transforming a sequence of afﬁne loop-nests [8]–[10]. Several deep learning compiler frameworks [11]–[13]support polyhedral compilation techniques in conjunction withother analysis tools and intermediate representations. In thispaper, we describe a polyhedral approach to automaticallygenerate efﬁcient kernels for Volta tensor cores given a high-level computation DAG where each node represents a matmulor a pointwise operation such as add, subtract, ReLU, Sigmoid,Tanh. To summarize, our contributions are as follows. • We describe polyhedral techniques to automatically gen-erate efﬁcient kernels that implement matmul using the mma.m8n8k4

PTX instruction on a Volta GPU, by com-posing the m8n8k4 mma tile into higher-order macro-MMA operations of shapes m16n16k8 and m32n32k8 forrealizing a warp-level MMA. • We then build on the above approach to automaticallyperform kernel fusion for some commonly occurringcomputation idioms involving matmul and pointwise op-erations. • We implement and evaluate our approach for these com-putation idioms on various problem sizes and demonstratesigniﬁcant speedups over manually-tuned implementa-tions provided by standard libraries such as cuBLAS andcuDNN.Section II provides the necessary background and intro-duces the notation use in later sections. Sections IV, V andVI describe the compute decomposition and data movementrequired for mapping a matmul computation to Volta tensorcores. Kernel fusion using similar decompositions is discussedin Section VII. Experimental evaluation of these techniques isprovided in Section VIII. Related work and conclusions arepresented in Sections IX and X respectively.II. B

ACKGROUND

This section provides the notation and background for thetechniques we present in the rest of this paper.

A. Programming Tensor Cores

The GPU compute hierarchy consists of blocks of threadswhich are organized into warps. Each warp consists of 32 A , .. T A , .. T A , .. T A , .. T A , .. T A , .. T A , .. T A , .. T B , .. T B , .. T B , .. T B , .. T B , .. T B , .. T B , .. T B , .. T mma.sync.m8n8k4.row.col operation. Matrix A is rowmajor while matrix B is column major. threads. These threads can be further partitioned into 8 quads ,i.e., threads − , − , − , − , − , − , − , − respectively. A Volta mma.sync.m8n8k4 instruction isexecuted by a pair of quads. For example, threads − and − constitute quad-pair QP ; threads − , − constitute quad-pair QP and so on.As shown in Figure 1, each mma operation is of shape m8n8k4 , i.e., input matrices of shape 8 × × × fp16 type) whereas the accumulator matrix elements are oftype fp32 (a version of mma.m8n8k4 with fp16 accumulatorsalso exists on Volta, although we consider the fp32 version inthis paper). The input matrices can be in either row or columnmajor format with different specializations of the instructionprovided for different layout combinations. Each thread in aquad pair owns an input data fragment , which is nothing but4 half-precision ﬂoating point values. As shown in Figure 1,thread owns 4 contiguous fp16 values from the ﬁrst row andcolumn of the input matrices A and B respectively; thread owns the second row and column and so on. Furthermore,each thread owns an accumulator fragment , which consistsof elements from the accumulator matrix. In Figure 1, thethread indices speciﬁed in the cells of the 8 × m8n8k4 . More details on the mma.sync.m8n8k4 instruction in Volta can be found in the CUDA toolkit docu-mentation [6]. B. Polyhedral Model

Polyhedral model is a mathematical representation of afﬁneloop-nests, where the loop bounds and array access expres-sions are afﬁne combinations of enclosing loop iterators andprogram parameters. In the polyhedral representation eachexecution instance of a statement S is represented as an2nteger point within a polyhedron. The faces of the polyhedroncorrespond to the bounds on the enclosing loops and thedimensionality of the polyhedron is nothing but the numberof enclosing loops. The integer points within the polyhedroncapture the iteration domain I S of the statement. Each state-ment may access arrays whose dataspaces can also be similarlydeﬁned as polyhedra. Consequently, an access relation I S → A mapping an iteration domain I S to a dataspace A can be usedto specify accesses to array A performed by a statement S .Similarly, relations between iteration spaces represent RAW,WAR and WAW dependences. Furthermore, the executionschedule of the statement instances is speciﬁed by mapping theexecution instances to multi-dimensional logical timestamps,whose lexicographic ordering gives the execution order.The Integer Set Library (ISL) [14] can be used to representand manipulate such a polyhedral representation. Given thepolyhedral representation of a loop-nest or a sequence of loop-nests, the polyhedral scheduler in ISL can be used to determinea valid schedule, such that all dependences are satisﬁed. ISLalso provides facility for manipulating a schedule through itsschedule tree representation [15].III. P ROBLEM S TATEMENT

Our focus is on automatic kernel generation for computationDAGs where the nodes represent matmul or pointwise opera-tions. Since each of these operations can be speciﬁed as afﬁneloop-nests, each node also encapsulates its corresponding poly-hedral representation – the iteration space, data space, accessrelations as well as the dependence relations. Such a computa-tion DAG serves as the input to our kernel generation problem.DSLs such as Halide [16], Tensor Comprehensions [11] canbe used to derive such a DAG. for (i = 0; i < M; ++i) for (j = 0; j < N; ++j) for (k = 0; k < K; ++k) /*S1*/ C[i, j] = mul_acc(C[i, j], A[i, k], B[k, j]); for (i = 0; i < M; ++i) for (j = 0; j < N; ++j) /*S2*/ E[i, j] = relu_add(C[i, j], bias[i, j]); Listing 1: Matmul + Bias + ReLU

Listing 1 shows an example where the matmul result isfed to pointwise operations – bias add followed by the ReLUactivation function. Each loop-nest maps to a separate nodein the computation DAG. Now, the polyhedral model wouldconsist of the following. • The iteration domains, I and I , for the statements S and S . • Dataspaces, M C and M E , written to by S and S respectively. • Write access relations, I → M C and I → M E ,specifying write accesses performed on M C and M E respectively. • Read access relations, { I → M C , I → M A , I → M B , I → M bias , I → M C } , specifying the readaccesses performed on the dataspaces M C , M A , M B and M bias respectively. • RAW dependences, which consist of both intra-node andinter-node dependences.Additionally, to facilitate code generation, the followingattributes about each statement are tagged on to its iterationspace. • Expression Tree.

The operations that are to be performedas part of the statement are encoded in the form ofan expression tree. Each internal node in the expressiontree corresponds to an operation. The result of the rootoperation is written to the output dataspace. In Listing 1,statement S performs a bias add and ReLU activation,represented by the compound operation relu_add . Itsexpression tree consists of two operations – ReLU andadd, with ReLU as the root operation. • Write and read access relations : These are the leavesof the expression tree and are used to determine thedata accesses performed by a statement instance, therebyproviding the operands for the parent operation nodes inthe tree.In the following sections, we ﬁrst discuss the problem ofgenerating efﬁcient Volta kernels for matmul and then build onthis approach to generate fused kernels for longer computationsequences.IV. M

ATMUL C OMPUTE D ECOMPOSITION

The ISL scheduler gives an outer-parallel schedule, whichis a good starting point for mapping the computation to aGPU. The matmul loop in Listing 1 has such a schedule. Wenow describe the compute decomposition for mapping this 3-dloop nest to the GPU compute hierarchy of blocks, warps andthreads in order to target tensor cores.

A. Macro-MMA

MMA operations with the shape m8n8k4 are deﬁned for aquad-pair. But they can be composed into higher-order macro-MMA operations that are performed by an entire warp. Twosuch macro-MMA compositions that we employ are of shapes16 × × × × × × × × A and B areof size 16 × ×

16 respectively. The accumulator matrixis of size 16 ×

16. As with mma.m8n8k4 , the input matricesmay be in row or column major layout, which determines howthe input matrices are distributed across the threads. Figure 2shows the distribution for a scenario where the input matrix Ais row-major and matrix B is column-major. We refer to thesubset of data elements owned by a given thread as a macro-MMA fragment . Each input macro-MMA fragment consists of8 fp16 values – for example, thread 0 owns the 8 elementsin the ﬁrst row of matrix A as well as the 8 elements in theﬁrst column of B (note that these elements are also ownedby threads 8 and 4 respectively). The accumulator fragmentconsists of 8 fp32 values. In Figure 2, the numbers inside thecells indicate the threads that own the corresponding elementsin the output matrix. For example, thread 0 owns the ﬁrst two3 , .. B , .. B , .. B , .. B , .. B , .. B , .. B , .. B , .. B , .. B , .. B , .. B , .. B , .. B , .. B , .. T .. T .. T .. T .. T .. T .. T .. T .. A , .. A , .. A , .. A , .. A , .. A , .. A , .. A , .. A , .. A , .. A , .. A , .. A , .. A , .. A , .. A , .. T .. T T .. T T .. T T .. T T .. T T .. T T .. T T .. T Fig. 2: 16 × × elements in the ﬁrst row as well as 3 other pairs of adjacentelements, all of which make up its accumulator fragment.The 4 different colours represent the 4 quad-pairs, therebyindicating the thread-to-data mapping for all the matrices.Regardless of the layout of the input matrices, the distributionof the accumulator matrix for the macro-MMA remains thesame.In case of 32 × × ×

32 is computed using input matrices of size 32 × ×

32. An input macro-MMA fragment for a 32 × × f p values, and 32 f p valuesin its accumulator fragment. Both of these macro-MMAcompositions along with wide loads and stores can lead toperformance comparable with that of hand-tuned kernels. Inthe following sections, without any loss of generality, wediscuss our approach using 16 × × × × B. Tiling for Blocks and Warps

The outer-parallel schedule obtained using ISL is shownin Figure 3, in the schedule tree form. It corresponds toa 3-d loop-nest and consists of a single permutable bandwith 3 schedule dimensions. Only the innermost dimensionis sequential.Suppose we distribute the loop-nest across a 2-d grid ofsize b × b with each thread block having a 2-d w × w arrangement of warps. Furthermore, suppose the sequentialloop is to be strip-mined with a strip size of b s . This can beachieved by successively tiling the loop-nest, ﬁrst with tilesizes b × b × b s and then with tile sizes w × w × b s . Werefer to the former as the block tile and the latter as the warptile . Clearly, the block tile sizes along every dimension must exceed or equal the warp tile sizes along the correspondingdimension. Also, they must be integer multiples of the macro-MMA sizes chosen – 16 × × × ×

8. The problemsizes are assumed to be multiples of the block tile sizes. Forother problem sizes, we can either pad the input matriceswith zeroes to make them so or generate specialized code forthe partial tiles and then combine the results of the full-tilecomputed using macro-MMA and the results of the partial-tiles.Figure 4 shows the block and warp-level tiling transforma-tion on the schedule tree using tile sizes 128 × ×

32 and64 × ×

32 respectively. Note that in all three band nodes ofthe resulting schedule tree, only the innermost dimension issequential. Clearly, the parallel dimensions of the outermostband correspond to the block indices, blockIdx.y and block-Idx.x . Similarly, the middle band essentially iterates over thewarp tiles. So, its parallel dimensions correspond to warpindices, warpIdx y and warpIdx x , which can be derived fromother kernel parameters such as thread indices. Figure 5 showsthe schedule tree after these schedule parameters have beenintroduced.

C. Strip Mining the Warp-Level MMA

The innermost band in Figure 5, which performs awarp-level MMA operation, processes a warp tile of size64 × ×

32. As shown in Figure 6, the sequential dimensionin the innermost band can be strip-mined further using a stripsize of 8, which corresponds to the macro-MMA size alongthe k dimension. Such a restructuring essentially expressesthe warp-level MMA as an outer product, which exposes moreinstruction-level parallelism than an inner product formulation.This helps cover the latency from instructions and the memoryload at a low occupancy. D. Schedule Domain Contraction

At this stage, the innermost band in Figure 6 speciﬁes a3-d loop nest around the statement S , each instance of whichperforms a scalar matrix-multiply and accumulate operation.To target tensor cores, each statement instance must insteadperform a macro-MMA operation (of shape 16 × × × × × × S is then altered to onethat performs a 16 × × × × OMAIN : S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K BAND: S [ i, j, k ] → [ i, j, k ] Fig. 3: Initial schedule for matmul in schedule tree form.

DOMAIN: S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K BAND: S [ i, j, k ] → [ (cid:98) i/ (cid:99) , (cid:98) j/ (cid:99) , (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ (cid:98) i/ (cid:99) − (cid:98) i/ (cid:99) , (cid:98) j/ (cid:99) − (cid:98) j/ (cid:99) , BAND: S [ i, j, k ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , k − (cid:98) k/ (cid:99) ] Fig. 4: Tiling with block tile size of 128 × ×

32 and warp-leveltile size of 64 × × DOMAIN: S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K BAND: S [ i, j, k ] → [ blockIdx.y, blockIdx.x, (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ warpIdx y, warpIdx x, BAND: S [ i, j, k ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , k − (cid:98) k/ (cid:99) ] Fig. 5: Introducing kernel parameters – blockIdx.y, blockIdx.x,warpIdx y, warpIdx x . DOMAIN: S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K BAND: S [ i, j, k ] → [ blockIdx.y, blockIdx.x, (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ warpIdx y, warpIdx x, BAND: S [ i, j, k ] → [0 , , (cid:98) k/ (cid:99) − (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , k − (cid:98) k/ (cid:99) ] Fig. 6: Strip-mining along the k dimension with strip-size 8.

DOMAIN: S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K ∧ (cid:98) i/ (cid:99) = i ∧ (cid:98) j/ (cid:99) = j ∧ (cid:98) k/ (cid:99) = k BAND: S [ i, j, k ] → [ blockIdx.y, blockIdx.x, (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ warpIdx y, warpIdx x, BAND: S [ i, j, k ] → [0 , , (cid:98) k/ (cid:99) − (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , k − (cid:98) k/ (cid:99) ] Fig. 7: Constraining the schedule domain so that each macro-MMAinstance is mapped to a single integer point.Fig. 8: Schedule tree transformations for matmul.

MMA operation. In later sections, we discuss how thesefragments can be created and loaded with the required data.

E. Split-K

The compute decomposition described so far parallelizesthe matmul computation only along the parallel dimensions,i.e., the i and j loops. While the k dimension is sequential,it is reasonable to parallelize along the k dimension as wellwhen targeting tensor cores. We refer to such a paralleliza-tion as split-K . A 2-way intra-thread-block split-K splits thecomputation into 2 parts and assigns them to concurrent warpsalong the z axis of the GPU compute hierarchy. The partialresults computed by corresponding warps along the z axisthen need to be summed up to obtain the ﬁnal result for thematmul computation. More generally, it is possible to extendthis arrangement to a p -way intra-thread-block split-K, where p is a power of 2.

1) Tiling for Intra-Thread-Block Split-K:

For a p -way split-K, the block tiles need to be bigger by a factor p along the k dimension, while the warp tiles are of the same size as before.So, the compute decomposition for a p -way split-K involvestiling the loop-nest, ﬁrst with tile sizes b × b × p ∗ b s andthen with w × w × b s . This creates p rows of warp tilesalong the z axis of the GPU compute hierarchy. Figure 9illustrates this approach for a 2-way intra-thread-block split-K. Clearly, the resulting schedule in the middle band is fora loop-nest that iterates over warp tiles along the y , x and z dimensions respectively. So, on introducing these derivedkernel parameters, along with the block indices, the scheduletree would be as shown in Figure 10.The above tilings are followed up with strip-mining andschedule domain contraction as shown earlier in Figures 6and 7 to complete the compute decomposition for a p -waysplit-K.V. D ATA M OVEMENT ACROSS M EMORY H IERARCHY

In order to obtain good performance on a GPU, it isimportant to optimize the movement of data across global memory, shared memory and registers. In this section, weexplain how copy statements that move the data across thismemory hierarchy are introduced by inserting their associatedcopy schedule nodes to the transformed schedule tree obtainedafter compute decomposition.The read maps, I → M A and I → M B cap-ture the read access relation from the iteration domain ofthe matmul statement S to the dataspaces M A and M B ,which are only read. The schedule tree transformation de-scribed in the previous section effectively maps the iterationspace I to the schedule space I (cid:48) with the schedule vec-tor ( b y , b x , c , w y , w x , c , c , c , c , c , c , c ) . The dimensions b y , b x , c correspond to the schedule dimensions of the outer-most band in Figure 7; w y , w x , c correspond to the scheduledimensions of the band immediately below it and so on.

1) Global, Shared and Register Data Tiles:

Using thetransformation I → I (cid:48) , we can derive the mappings I (cid:48) → M A , I (cid:48) → M B and likewise, the write mapping I (cid:48) → M C , whichdescribe the read and the write access relations from the sched-ule space I (cid:48) to the input and output dataspaces respectively.The basic idea is to infer tiles in the global memory that areaccessed at block-level and warp-level. Global data tiles thatare reused can then be promoted to shared memory data tilesand register tiles, i.e., copied to shared memory or registerﬁles and then reused. For each memory promotion, a newstatement is introduced to perform this data copy. Furthermore,the access maps can be analyzed to determine the access depthas well as the dimensions of these data tiles. The iterationdomain of the copy statement can be derived from the dataspace of the corresponding data tile. At the inferred accessdepth, a new schedule node deﬁning a schedule for the copystatement is inserted into the schedule tree. As explained later,in some cases, the copy operation may not be a simple dataassignment and may involve a call to helper functions. A. Global to Shared Memory

Block-level read maps I (cid:48) block → M A and I (cid:48) block → M B canbe derived from the read access relations I (cid:48) → M A and I (cid:48) → OMAIN: S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K BAND: S [ i, j, k ] → [ (cid:98) i/ (cid:99) , (cid:98) j/ (cid:99) , (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ (cid:98) i/ (cid:99) − (cid:98) i/ (cid:99) , (cid:98) j/ (cid:99) − (cid:98) j/ (cid:99) , (cid:98) k/ (cid:99) − (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , k − (cid:98) k/ (cid:99) ] Fig. 9: Tiling initial schedule in Figure 3 with block tile size of128 × ×

64 instead of 128 × ×

32 for 2-way split-K. Warp tilesize remains the same – 64 × × DOMAIN: S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K BAND: S [ i, j, k ] → [ blockIdx.y, blockIdx.x, (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ warpIdx y, warpIdx x, warpIdx z ] BAND: S [ i, j, k ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , k − (cid:98) k/ (cid:99) ] Fig. 10: On introducing kernel parameters – blockIdx.y, blockIdx.x,warpIdx y, warpIdx x, warpIdx z .Fig. 11: Schedule tree transformations for matmul with 2-way intra-thread-block split-K. M B respectively, by projecting out all the dimensions otherthan those in the outer schedule band, namely, ( b y , b x , c ) . Theranges of these maps determine the data tiles that are accessedby the compute tile ( b y , b x , c ) . Since this data tile is reusedby threads within the block with block indices ( b y , b x ) , it canbe promoted to shared memory.Furthermore, we separate out the global-to-shared copy bysplitting it into a global-to-register copy followed by a register-to-shared copy loop. The global-to-register copy loop is setup in such a way that each thread performs 128-bit accesses(8 contiguous fp16 elements) to global memory. Furthermore,this loop to copy data from global memory to registers usingvectorized loads is cyclically distributed across all the threadsin a thread block. Such a distribution also ensures that globalaccesses are coalesced.

1) Swizzled Shared Memory:

In order to ensure that thereare no bank conﬂicts when reading and writing data to andfrom the shared memory, the data elements in the sharedmemory tile need to be permuted or swizzled . So, unlike thecopy from global memory to registers, which is a straightfor-ward copy of an 8-wide vector of fp16 data values, the copyfrom registers to shared memory involves the application of aswizzle function. In essence, the 8-wide vector data is storedto a shared memory buffer where the data ordering is differentfrom that in the corresponding global data tile.The swizzle function, used to compute the offset in theswizzled shared memory allocation, provides a one-to-onemapping from the unswizzled data tile to the swizzled tile inshared memory. Essentially, it maps the offsets of contiguousblocks of 8 fp16 data values in the unswizzled data tile toan offset in the shared memory allocation for that data tile.The exact implementation of the swizzle function depends onthe number of elements per row and column in the swizzledshared array. We also use different helper functions to copythe data to shared memory for each of the 4 combinations – A or B in row or column major layout. B. Shared Memory to Register Fragments

The access maps I (cid:48) warp → M A , I (cid:48) warp → M B , I (cid:48) warp → M C are obtained by projecting out the dimensions ( c , c , c ) from I (cid:48) → M A , I (cid:48) → M B , I (cid:48) → M C respectively. The rangesof these warp-level access maps determine the input and outputdata tiles for the warp-level macro-MMA. For the scheduletree in Figure 7, these data tiles would be of size 64 ×

8, 8 × ×

64 respectively.

1) Macro-MMA Register Fragments:

Each thread need notown the entire data tile. Recall the schedule domain contrac-tion in Figure 7 which abstracted away the inner point loops.Similarly, these tiles can be contracted to obtain an array ofregister fragments. For a 16 × × • The 2-d input data tile of M A is contracted by factors16 × ×

16 if it iscolumn major. For example, a given tile of size 64 × × • The input data tile of M B is contracted by factors 8 × × • The output data tile for M C is contracted by factors16 ×

16 to obtain the array of accumulator fragments.Each 16 × × fp16 elements.

2) Loading Register Fragments:

In order to load the regis-ter fragments, a polyhedral schedule can be created for movingthe data from shared memory to register fragments. Eachregister fragment load involves loading 8 fp16 values fromshared memory. The swizzled storage ensures that there areno bank conﬂicts due to the read accesses to it. The sameswizzling function that is used for storing to shared memoryis used to obtain the swizzled offset from which to fetch thedata vector in the swizzled storage. Again, the data copy isdone using vectorized accesses. The macro-MMA thread-to-data mappings for inputs A and B are different. Furthermore,it is different for different layouts – row or column major.Also, different load fragment helper functions are needed foreach of the 4 possibilities – A or B in row or column majorlayout. C. Macro-MMA on Register Fragments

As described in the previous subsection, the register frag-ments are inferred from the access maps I (cid:48) warp → M A , I (cid:48) warp → M B , I (cid:48) warp → M C . Consequently, a mapping fromthe global dataspace to the data space of register fragmentscan be inferred. Using these mappings, all accesses to globalmemory in the statement S can be altered to access registerfragments instead. D. Store to Global Memory

The result of a macro-MMA computation goes into anaccumulator fragment, which resides in registers. So, a copyof the data from accumulator fragments to global memory isrequired.6 extern "C" __global__ void __launch_bounds__(256) kern0( int M, int N, int K, const half * __restrict__ M_0, int ldM_0, const half * __restrict__ M_1, int ldM_1, half * __restrict__ M_3, int ldM_3) { ... for ( int c2 = -1; c2 < K / 64; c2 += 1) // 2-way split-K if (K >= 64 * c2 + 128) // prefetch data from global memory for ( int c5 = 0; c5 <= 3; c5 += 1) { (half8&)(private_M_1[(8 * c5)]) = ((half8&)(M_0[(128 * blockIdx.y + 32 * c5 + linearId / 8) * K + (8 * (linearId% 8) + 64 * c2 + 64) * 1])); // copy data from global memory to registers (half8&)(private_M_3[(8 * c5)]) = ((half8&)(M_1[(128 * blockIdx.x + 32 * c5 + linearId / 8) * K + (8 * (linearId% 8) + 64 * c2 + 64) * 1]));} // copy data from global memory to registers if (c2 >= 0) // overlap computation with data movement from global memory to registers for ( int c8 = 0; c8 <= 3; c8 += 1){ // strip-mine the warp-level macro-MMA for ( int c11 = 0; c11 <= 3; c11 += 1){ hmma_load_b_col_swizzled(&hmma_M_5[(c11)][(0)][0][0], &shared_mma_M_2[0], (64 * warpIdx_x + 16 * c11), (32 *warpIdx_z + 8 * c8), 64); // copy data from swizzled shared buffers to register fragments hmma_load_a_row_swizzled(&hmma_M_4[(c11)][(0)][0][0], &shared_mma_M_0[0], (64 * warpIdx_y + 16 * c11), (32 *warpIdx_z + 8 * c8), 64);} // copy data from swizzled shared buffers to register fragments // iterate over the macro-MMAs for ( int c9 = 0; c9 <= 63; c9 += 16){ for ( int c10 = 0; c10 <= 63; c10 += 16){ // perform macro-MMA (A is row & B is col-major) hmma_row_col(&hmma_M_6[(c9 / 16)][(c10 / 16)][0][0], &hmma_M_4[(c9 / 16)][(0)][0][0], &hmma_M_5[(c10 / 16)][(0)][0][0], &hmma_M_6[(c9 / 16)][(c10 / 16)][0][0]);}}} if (K >= 64 * c2 + 128) // prefetch data from shared memory __syncthreads(); for ( int c5 = 0; c5 <= 3; c5 += 1) { // copy data from registers to swizzled shared buffers hmma_store_shared_b_col_swizzled(&shared_mma_M_2[0], ((half8&)(private_M_3[(8 * c5)])), (8 * c5)); hmma_store_shared_a_row_swizzled(&shared_mma_M_0[0], ((half8&)(private_M_1[(8 * c5)])), (8 * c5));} __syncthreads(); if (K >= 64) // iterate over accumulator fragments and store results to global mem, rearrange through shared mem for ( int c4 = 0; c4 <= 3; c4 += 1) for ( int c5 = 0; c5 <= 3; c5 += 1) hmma_store_global_after_reordering(&M_3[(64 * warpIdx_y + 128 * blockIdx.y + 16 * c4) * N + (64 * warpIdx_x + 128* blockIdx.x + 16 * c5) * 1], &hmma_M_6[(c4)][(c5)][0][0], ldM_3, &sharedBuffer[0]); } Listing 2: Skeleton code for the matmul kernel generated with block tile size 128 × ×

32 and warp tile size 64 × ×

1) Reordering through Shared Memory:

The 8 fp32 datavalues making up an accumulator fragment do not map toa contiguous block of 8 elements (please refer Figure 2).Furthermore, it is necessary to convert these data values fromthe fp32 accumulator type to the fp16 output type. So, thisprecludes a straightforward 8-wide vectorized store to globalmemory. In order to achieve that, the overall strategy is tomove the data from the accumulator fragments to globalmemory, via shared memory. Essentially, shared memory isused as a temporary buffer to exchange and rearrange theaccumulator data in registers so that each thread can ﬁnallymove a contiguous block of 8 fp16 values to global memory.For example, as shown in Figure 2, thread 0 owns the ﬁrst twoelements C , .. in the ﬁrst row of the accumulator matrix, aswell as the elements C , .. . Likewise, thread 2 owns C , .. and C , .. . Thread 0 packs the 4 elements that it owns intoa 4-wide vector of f p values (after converting them from fp32 to fp16 ), and copies them to a contiguous block in sharedmemory. Now, if thread 2 does the same and copies its packeddata to the adjacent contiguous block of shared memory, thread0 can then copy back the entire contiguous block of 8 fp16 values to registers and rearrange them to obtain the correctdata order. A similar procedure is employed for all the threadsto make sure that the threads own data elements that formcontiguous blocks in the global array. Again, the accesses toshared memory are modeled to avoid bank conﬂicts and toexploit wide loads and stores.

2) Split-K Reduction:

In case of a split-K schedule, thepartial results computed by corresponding warps along the z axis must be summed up. This is done by adding thepartial results once they are stored to shared memory for thereordering described above. The reduction is performed onlyby the threads in warps with warp z equal to 0. Furthermore,this means that the rest of the threads do not perform any storeto global memory.Reordering involves carefully chosen non-afﬁne access pat-terns. So, all these details, including data conversion and split-K reduction, are hidden behind a helper function that copiesdata from an accumulator fragment to global memory throughthe above steps. E. Prefetching and Hiding Latency

As explained earlier, the input data for the macro-MMAcomputation is obtained by copying the data from globalmemory to shared memory and then from the latter to theregister fragments. A prefetching schedule can be obtained byshifting the schedule bands associated with the correspondingcopy schedule nodes accordingly. The schedule is furthermodiﬁed so that the global memory access latency is hiddenby overlapping it with the macro-MMA computation.VI. C

ODE G ENERATION

The AST generation facility provided by ISL is used toemit CUDA code from the schedule tree that results from thecompute and data decomposition discussed in the previoussections. Kernel launch parameters, i.e., the grid and blocksizes are inferred through the constraints on the correspondingschedule dimensions in the schedule tree. Listing 2 providesa skeleton of the kernel generated using our approach for7 for (i = 0; i < M; ++i) for (j = 0; j < N; ++j) { for (k = 0; k < K; ++k) C[i, j] = mul_acc(C[i, j], A[i, k], B[k, j]);//S1 E[i, j] = relu_add(C[i, j], bias[i, j]);//S2 } Listing 3: Matmul + Bias + ReLU

DOMAIN : S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K ; S [ i, j ] : 0 ≤ i < M ∧ ≤ j < N BAND: S [ i, j, k ] → [ i, j, k ]; S [ i, j ] → [ i, j, K ]; SEQUENCEFILTER: S [ i, j, k ] FILTER: S [ i, j ] Fig. 12: Initial schedule tree for matmul+bias+ReLU

DOMAIN : S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K ; S [ i, j ] : 0 ≤ i < M ∧ ≤ j < N BAND: S [ i, j, k ] → [ (cid:98) i/ (cid:99) , (cid:98) j/ (cid:99) , (cid:98) k/ (cid:99) ]; S [ i, j ] → [ (cid:98) i/ (cid:99) , (cid:98) j/ (cid:99) , (cid:98) K/ (cid:99) ] BAND: S [ i, j, k ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , k − (cid:98) k/ (cid:99) ]; S [ i, j ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , K − (cid:98) K/ (cid:99) ] SEQUENCEFILTER: S [ i, j, k ] FILTER: S [ i, j ] Fig. 13: Tiling with block tile size of 128 × × DOMAIN : S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K ; S [ i, j ] : 0 ≤ i < M ∧ ≤ j < N BAND: S [ i, j, k ] → [ (cid:98) i/ (cid:99) , (cid:98) j/ (cid:99) ]; S [ i, j ] → [ (cid:98) i/ (cid:99) , (cid:98) j/ (cid:99) ] SEQUENCEFILTER: S [ i, j, k ] BAND: S [ i, j, k ] → [ (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , k − (cid:98) k/ (cid:99) ] FILTER: S [ i, j ] BAND: S [ i, j ] → [ (cid:98) K/ (cid:99) ] BAND: S [ i, j ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , K − (cid:98) K/ (cid:99) ] Fig. 14: Sequence hoisting for inter-statement dependence.

DOMAIN : S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K ∧ (cid:98) i/ (cid:99) = i ∧ (cid:98) j/ (cid:99) = j ∧ (cid:98) k/ (cid:99) = k ; S [ i, j ] : 0 ≤ i < M ∧ ≤ j < N ∧ (cid:98) i/ (cid:99) = i ∧ (cid:98) j/ (cid:99) = j BAND: S [ i, j, k ] → [ blockIdx.y, blockIdx.x ]; S [ i, j ] → [ blockIdx.y, blockIdx.x ] SEQUENCEFILTER: S [ i, j, k ] BAND: S [ i, j, k ] → [ (cid:98) k/ (cid:99) ] BAND: S [ i, j, k ] → [ warpIdx y, warpIdx x, BAND: S [ i, j, k ] → [0 , , (cid:98) k/ (cid:99) − (cid:98) k/ (cid:99) ]; BAND: S [ i, j, k ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , k − (cid:98) k/ (cid:99) ]; FILTER: S [ i, j ] BAND: S [ i, j ] → [ (cid:98) K/ (cid:99) ] BAND: S [ i, j ] → [ warpIdx y, warpIdx x, BAND: S [ i, j ] → [0 , , (cid:98) K/ (cid:99) − (cid:98) K/ (cid:99) ]; BAND: S [ i, j ] → [ i − (cid:98) i/ (cid:99) , j − (cid:98) j/ (cid:99) , K − (cid:98) K/ (cid:99) ]; Fig. 15: Schedule after tiling innermost bands in both the ﬁlter nodeswith warp-level tile size of 64 × ×

32 and then strip-mining thesequential dimension with strip-size of 8.Fig. 16: Schedule tree transformation for Matmul + Bias + ReLU matmul with block tile sizes 128 × ×

32 and warp tile sizes64 × ×

32 with a 2-way intra-thread-block split-K, where theinput matrices

M_0 and

M_1 are in row-major and columnmajor layout respectively. As can be seen the core matmulcomputation is performed on line 16 through the call tothe helper function hmma_row_col , which we implementusing the mma.sync.m8n8k4 instructions to realize either a16 × × × × ERNEL F USION

We now discuss how fused kernels can be generated forsome computation idioms with both matmul and pointwiseoperations.

A. Pointwise Operations in Epilogue

Consider a computation sequence with pointwise operationsfed by a matmul e.g. matmul followed by a bias add andReLU activation, with only the ﬁnal result being live out.Without loss of generality, all the pointwise operations canbe fused together and represented by a single compoundoperation as in Listing 1. The ISL schedule for this is asshown in Figure 12, which corresponds to the loop-nest shownin Listing 3. The sequence node speciﬁes the relative orderingof the statements that appear in its ﬁlter child nodes. Only the innermost dimensions of band nodes are sequential. Notethat unlike in Listing 1, the outer-parallel loops are fused inListing 3.As in the case of block-level tiling for just matmul, the bandnode can be tiled using tile sizes b × b × b s . Figure 13 showsthe result of such a block-tiling. However, given the sequenceordering between statements S and S , it is necessary tohoist the sequence node further up the schedule tree so that allsequential loop dimensions appear only in descendant nodesof the hoisted sequence node. This may require splitting theband members so that the outer parallel loop dimensionsfall in a separate band from that of the inner sequentialdimensions. Figure 14 illustrates this restructuring. Note thatthe parent band node of the sequence node only containsparallel loop dimensions. The sequential loop dimensions forthe two statements are now moved to separate band nodes im-mediately under the ﬁlter nodes. Such a restructuring ensuresthat the matmul reduction loop is scheduled before its resultis consumed by the pointwise operations.Clearly, the schedule dimensions constituting the outermostband in Figure 14 correspond to kernel parameters blockIdx.y and blockIdx.x , as shown in Figure 15. Furthermore, theinnermost band under both the ﬁlter nodes in Figure 14 iterateover the warp tiles. Consequently, as shown in Figure 15, wecan perform warp-tiling and strip-mining on both of these ﬁlter8odes similar to the transformations illustrated in Figure 4 and6. Schedule domain contraction, as described in Section IV-Dis then applied for statement S so that it performs a 16 × × S is also constrainedand its operation type updated to one that performs a pointwiseoperation on an entire accumulator fragment computed by themacro-MMA operation, effectively distributing the pointwiseoperation across threads. a) Avoiding Intermediate Writes To Global Memory: Theschedule space for S is similar to that obtained earlier formatmul alone (see Figure 7). So, similar memory promotionsto those described in Section V hold good for this part of theschedule tree. However, since the matmul result is not live-out, the data in an accumulator fragment need not be storedto global memory. Nevertheless, this result is consumed by thepointwise operations scheduled in the other ﬁlter node. So, theaccumulator data needs to be converted to fp16 type and thenreordered through shared memory, but ﬁnally copied back toa register fragment. The reordering ensures that each threadholds a contiguous block of 8 fp16 values in every registerfragment. b) Register Fragments for Pointwise Operation: Theregister fragments obtained from the accumulator data serveas one of the inputs to the downstream pointwise operation.For the statement S , every other data tile accessed by itswarp-level compute tile is promoted to registers e.g. thosefrom access to the bias array. These data tiles are similarlydistributed across the threads by contracting them by a factor16 ×

16 to obtain an array of register fragments, each contain-ing 8 fp16 elements. Note that since these fragments have thesame data distribution as the register fragments obtained fromthe accumulator data, they can be loaded from global memoryusing 128-bit accesses. A copy schedule node is inserted intothe schedule tree for the register fragment load operations. c) Storing Live-Out to Global Memory:

With all the datatiles accessed by a warp-level compute tile of the pointwiseoperation distributed across threads into register fragments, theresult of the pointwise operation is also in a register fragment.Since it is a contiguous block of 8 fp16 values, a ﬁnal copyschedule node is inserted after the pointwise operation toperform a 128-bit store to global memory for each registerfragment.

B. Matmuls with Pointwise Epilogue

Consider the scenario where there are two matmuls of thesame shape whose results are consumed by a downstreampointwise operation. The ISL schedule obtained for such anexample is shown in Figure 17. The corresponding loop-nestis shown in Listing 4.Comparing Figures 12 and Figures 17, it is clear thatthe main structural difference between the schedule trees isthat there are three statements involved in the latter and so,three corresponding ﬁlter nodes in sequence. Consequently,the same schedule transformations can be applied to obtain aschedule tree similar to that in Figure 15, except that it would

DOMAIN : S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K ; S [ i, j, k ] : 0 ≤ i < M ∧ ≤ j < N ∧ ≤ k < K ; S [ i, j ] : 0 ≤ i < M ∧ ≤ j < N BAND: S [ i, j, k ] → [ i, j, k ]; S [ i, j, k ] → [ i, j, k ]; S [ i, j ] → [ i, j, K ]; SEQUENCEFILTER: S [ i, j, k ] FILTER: S [ i, j, k ] FILTER: S [ i, j ] Fig. 17: Initial schedule tree for two matmuls feeding an addition. for (i = 0; i < M; ++i) for (j = 0; j < N; ++j) { for (k = 0; k < K; ++k) { C[i, j] = mul_acc(C[i, j], A[i, k], B[k, j]);//S1 R[i, j] = mul_acc(R[i, j], P[i, k], Q[k, j]);//S2 } Z[i, j] = add(C[i, j], R[i, j]); /*S3*/ } Listing 4: Sum of Matmuls have three ﬁlter nodes – the ﬁrst two for the two matmuls andthe last one for the pointwise operations, i.e., at the thread-level all of these operations are performed in sequence with nointerleaving. Memory promotion presents no additional issuesand is handled similarly by avoiding writes of intermediateresults to global memory with only the live-out data beingstored to it.In this example, the input matrices to both the matmulshave the same shape. However, the overall approach of tilingfor blocks with sequence hoisting to ensure that there is nointerleaving of the operations, can be applied even when theinput matrices are of different shapes so long as the matmuloutputs are of the same shape.

C. Pointwise Operations in Prologue

Consider the scenario where the matmul inputs are resultsof pointwise operations as shown in Listing 5. Since eachstatement is tagged with a speciﬁcation of its expression tree,the compound operation of ReLU on an input followed bymatmul can be represented by a single statement, with matmulas the root in the expression tree. for (i = 0; i < M; ++i) for (j = 0; j < N; ++j) for (k = 0; k < K; ++k) /*S1*/C[i,j] = mul_acc(C[i, j], relu(A[i, k]), B[k,j]); Listing 5: ReLU + Matmul

The initial schedule tree for this loop-nest is structurallysimilar to that in Figure 3. So, all the schedule transformationsdiscussed in Section IV can be applied. This is followed upwith memory promotion as discussed in Section V with onecrucial difference. Recall that the global-to-shared memorycopy is split into a global-to-register copy followed by aregister-to-shared copy (see Section V-A). So, all the datathat is reused at the block level is ﬁrst moved to registers.Therefore, when we have pointwise operations in the matmulprologue, instead of moving all this data to shared memory for9euse, the pointwise operations can be performed on these datavalues and the result can instead be stored to shared memoryfor reuse. In effect, even though the compound operationin statement S could have multiple input dataspaces, theshared memory footprint remains the same as that for amatmul without pointwise operations in its prologue. Withthe pointwise operations being performed during the datamovement, the statement S only needs to perform a macro-MMA using two input fragments and an accumulator fragmentas in case of a single matmul with no prologue. For theexample in Listing 5, the kernel generated would be the sameas Listing 2, except that on lines 20 and 21, the result obtainedby applying a pointwise ReLu on the 8 fp16 values which areheld in registers, would then be stored to shared memory.VIII. E XPERIMENTAL E VALUATION

In this section, we provide an experimental evaluation of thetechniques described in the previous sections We implementedthese techniques in the Diesel DSL framework [17]. Dieselprovides a basic front-end for specifying tensor expressionsand to extract their polyhedral representation. Input to thecompiler backend included various code generation options,namely, block tile sizes, warp tile sizes, choice of macro-MMA(16 × × × × nvprof for 100 problemsizes that were randomly generated, with problem sizes beingmultiples of 128 (the largest tile size) upto a size of 4096. Forevery problem size, we present the best performance obtainedusing our auto-generated kernels.Baseline versions of the benchmarks wereimplemented using cuBLAS 10.1 and cuDNN 7.0. The CUBLAS GEMM DFALT TENSOR OP algorithm was chosento target the tensor cores whereas the pointwise operationsfor bias addition and ReLU activation were implementedusing the cudnnAddTensor and cudnnActivationForward

APIs exposed by cuDNN. The experimental evaluation wasperformed on an NVIDIA Quadro GV100 GPU with theVolta microarchitecture. The nvcc v10.1 compiler was used tocompile all the benchmarks with the options: -O3 -std=c++11-use fast math -ccbin g++ -arch=compute 70 -code=sm 70-expt-relaxed-constexpr. a) GEMM:

Figure 18 shows a performance plot ofthe speedups obtained using the auto-generated kernels forGEMM over the cublasGemmEx performance. In one in threecases (32%), the auto-generated kernels were able to match oroutperform the baseline version. The peak speedup obtainedwas 1.75 × while it was 0.78 × in the worst case. For 60 ofthe 100 problem sizes, the speedup was 0.9 × or higher andthe geometric mean speedup was 0.985 × . Fig. 18: Performance for GEMM.Fig. 19: Performance of prologue fusion: (ReLU + GEMM). b) ReLU + GEMM:

In this benchmark, a ReLU point-wise activation function is applied on an input matrix toGEMM. The auto-generated kernel fuses the ReLU in theGEMM prologue and the matmul operation into the samedevice function. Figure 19 shows a performance plot ofthe speedups obtained using the auto-generated kernels forGEMM over the baseline, which uses cublasGemmEx forperforming the GEMM and cudnnActivationForward . For 61problem sizes, the auto-generated kernels performed at leastas well as the baseline version. The peak speedup obtainedwas 2.36 × with 0.80 × in the worst case. The geometric meanspeedup was 1.113 × . Since the fused kernel is similar to thatfor GEMM, except that the ReLU operation is performedon registers during the data transfer from global to sharedmemory, the best tile sizes for the fused kernel correlated withthat for GEMM. c) GEMM + Bias + ReLU: A performance plot of thespeedups obtained using the fused kernels that were auto-generated for Gemm + Bias + ReLU are shown in Figure 20.Compared to Figure 18, clearly, kernel fusion moves most ofthe speedups over the baseline, with the auto-generated kernelsoutperforming the baseline in 94 out of 100 cases. The peakspeedup obtained was 2.55 × with 0.89 × being the worst casespeedup, with a mean speedup of 1.29 × . d) Add(GEMM, GEMM): In this benchmark, the resultsof two GEMMs of the same shape were fed to an add10 ig. 20: Epilogue fusion performance: GEMM + Bias + Relu.Fig. 21: Kernel fusion performance for Add(GEMM, GEMM). operation. Figure 21 shows a plot of the speedups obtainedthrough a fused kernel over a baseline that used cuBLAS forgemm and cuDNN for the add operation. For 33 out of the 100problem sizes, the fused kernel matches or beats the baselinewith 2.13 × peak speedup. The worst case speedup is 0.73 × .Overall, with multiple matmuls being fused into the kernel,we noticed there was greater register pressure compared tothe other fused kernels, necessitating the use of smaller tileswhich while decreasing register pressure were not necessarilyoptimal for the matmul computation in the fused kernel.IX. R ELATED W ORK

CUDA libraries such as cuBLAS [1] and cuDNN [3] pro-vide highly tuned GPU-accelerated implementations of stan-dard basic linear algebra routines and deep learning primitives.Cutlass [2] is a CUDA C++ template library which providesperformance that is comparable to cuBLAS. It features var-ious compute decomposition and data movement strategiesfor implementing GEMM, with mixed-precision computationsupport for Volta tensor cores. The tensor core operationsin Cutlass are also implemented using the mma instruction.cuTensor [18] is a recent high-performance CUDA libraryfor GPUs with compute capability greater than or equalto 70. It supports various tensor operations such as tensor contractions, pointwise operations with support for pointwiseoperator fusion.Polyhedral compilation has been a topic of active researchfor several decades [19], [20]. With a large suite of tools andlibraries [21]–[25], it has gradually been incorporated into pro-duction compilers such as RStream [26], GCC/Graphite [27],LLVM/Polly [28].Domain speciﬁc languages such as Polymage [29] exploitthe sophisticated transformation and code generation capabil-ities of the polyhedral framework to automatically generatehigh performance implementations of image processing ap-plications. Our work is based on the Diesel DSL compilerframework developed by Elango et al [17], who also tackledthe problem of efﬁcient CUDA kernel generation for matmuland some epilog fusion scenarios. However, their focus wasmore on generating efﬁcient kernels that target traditionalCUDA cores. We build on their work to not only targettensor cores but also cover a wider range of computation se-quences. Other DSL compiler frameworks such as Halide [16],TVM [30], which are non-polyhedral, separate the notion ofthe tensor computation from that of its schedule. Tiramisu [31],a polyhedral framework to generate high performance codefor GPUs also features a scheduling language to provide low-level control over the schedule to the user. However, theschedule primitives for exploiting tensor cores are limited orprimarily rely on the CUDA wmma API for programmingthem directly [32].Vasilache et al developed Tensor Comprehensions(TC) [33], [34], which leverages the Halide compiler [16]in conjunction with polyhedral compilation to automaticallygenerate CUDA kernels given a mathematical speciﬁcation ofa deep learning graph. It uses a modiﬁed version of the PPCGcompiler developed by Verdoolaege et al [9] with support foroperator fusion. While TC handles a larger class of afﬁneloop-nests, we deal with kernel generation for tensor coreswith a focus on a few common computation idioms. Zerrellet al [35] developed Stripe, a nested polyhedral intermediaterepresentation used in the PlaidML [36] compiler with thefacility to fuse tensor contractions. MLIR [12] is an ongoingproject which aims to unify the compiler infrastructure formachine learning by providing the ability to embed multipleIR dialects in it e.g. linear algebra dialect or an afﬁne dialect,with a progressive lowering and transformation of IR dialects.Overall, we believe our work is complementary and couldbe integrated with many of these frameworks as a library fortargeting tensor cores.X. C

ONCLUSION

We tackled the problem of automatic generation of efﬁcientCUDA kernels for computation sequences involving matmuland pointwise operations. To the best of our knowledge,this is the ﬁrst work to leverage polyhedral compilationtechniques for exploiting tensor core capabilities on a VoltaGPU. In particular, we relied upon macro-MMA composi-tions of size 16 × × × × mma.sync.m8n8k4 PTX instruction for targeting tensor cores.11urthermore, we demonstrated that these techniques can leadto signiﬁcant speedups for a wide range of problem sizes.In the future, we intend to augment this approach with costmodels for automatic tile size selection as well as generalizeit for subsequent GPU micro-architectures such as Turing.A

CKNOWLEDGMENT

We thank all contributors to the Diesel compiler – VenmugilElango, Mahesh Ravishankar, Norm Rubin, for creating theframework in which we could try out the ideas described inthis paper. We also thank Bastian Hagedorn for his commentson leveraging the mma instructions for targeting Volta tensorcores. R

EFERENCES[1] NVIDIA, “cublas,” 2019,https://docs.nvidia.com/cuda/cublas/index.html.[2] ——, “Cuda templates for linear algebra subroutines,” 2019,https://github.com/NVIDIA/cutlass.[3] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,B. Catanzaro, and E. Shelhamer, “cudnn: Efﬁcient primitives fordeep learning,”

CoRR , vol. abs/1410.0759, 2014. [Online]. Available:http://arxiv.org/abs/1410.0759[4] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga,S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan,P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorﬂow: A systemfor large-scale machine learning,” in

Advances in Neural Information Processing Systems 32: AnnualConference on Neural Information Processing Systems 2019,NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada ,2019, pp. 8024–8035. [Online]. Available: http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library[6] NVIDIA, “Cuda toolkit documentation,” 2019,https://docs.nvidia.com/cuda/parallel-thread-execution/index.html

PLDI ,Jun 2008.[9] S. Verdoolaege, J. C. Juega, A. Cohen, J. I. G´omez, C. Tenllado,and F. Catthoor, “Polyhedral parallel code generation for CUDA,”

TACO , vol. 9, no. 4, pp. 54:1–54:23, 2013. [Online]. Available:https://doi.org/10.1145/2400682.2400713[10] M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam,A. Rountev, and P. Sadayappan, “A Compiler Framework for Optimiza-tion of Afﬁne Loop Nests for GPGPUs,” in

ACM Intl. conference onSupercomputing (ICS) , Jun. 2008.[11] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito,W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, “Tensorcomprehensions: Framework-agnostic high-performance machinelearning abstractions,”

CoRR , vol. abs/1802.04730, 2018. [Online].Available: http://arxiv.org/abs/1802.04730[12] C. Lattner, J. A. Pienaar, M. Amini, U. Bondhugula, R. Riddle,A. Cohen, T. Shpeisman, A. Davis, N. Vasilache, and O. Zinenko,“MLIR: A compiler infrastructure for the end of moore’s law,” 2020.[Online]. Available: https://arxiv.org/abs/2002.11054[13] S. V. M. K. R. Schreiber and H. Kamepalli, “Generating simd instruc-tions for cerebras cs-1 using polyhedral compilation techniques,” 2020. [14] S. Verdoolaege, “ isl : An integer set library for the polyhedralmodel,” in

Mathematical Software - ICMS 2010, Third InternationalCongress on Mathematical Software, Kobe, Japan, September 13-17, 2010. Proceedings , 2010, pp. 299–302. [Online]. Available:https://doi.org/10.1007/978-3-642-15582-6 49[15] S. Verdoolaege, S. Guelton, T. Grosser, and A. Cohen, “Schedule trees,”in

IMPACT , 01 2014.[16] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, andS. P. Amarasinghe, “Halide: a language and compiler for optimizingparallelism, locality, and recomputation in image processing pipelines,”in

ACM SIGPLAN symposium on Programming Languages Design andImplementation , 2013, pp. 519–530.[17] V. Elango, N. Rubin, M. Ravishankar, H. Sandanagobalane, andV. Grover, “Diesel: DSL for linear algebra and neural net computationson gpus,” in

Proceedings of the 2nd ACM SIGPLAN InternationalWorkshop on Machine Learning and Programming Languages,MAPL@PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018 , 2018, pp.42–51. [Online]. Available: https://doi.org/10.1145/3211346.3211354[18] NVIDIA, “cutensor: A high-performance cuda library for tensor primi-tives,” 2019,https://docs.nvidia.com/cuda/cutensor/index.html.[19] P. Feautrier, “Some efﬁcient solutions to the afﬁne scheduling problem:Part I, one-dimensional time,”

Intl. Journal of Parallel Programming ,vol. 21, no. 5, pp. 313–348, 1992.[20] ——, “Some efﬁcient solutions to the afﬁne scheduling problem: Part II,multidimensional time,”

Intl. Journal of Parallel Programming

IMPACT , 2011.[29] R. T. Mullapudi, V. Vasista, and U. Bondhugula, “Polymage: Automaticoptimization for image processing pipelines,” in

Intl. Conference on Ar-chitectural Support for Programming Languages and Operating Systems ,ser. ASPLOS ’15, 2015, pp. 429–443.[30] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Q. Yan, H. Shen, M. Cowan,L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy,“TVM: an automated end-to-end optimizing compiler for deeplearning,” in

IEEE/ACMInternational Symposium on Code Generation and Optimization, CGO2019, Washington, DC, USA, February 16-20, 2019 , 2019, pp. 193–205.[Online]. Available: https://doi.org/10.1109/CGO.2019.8661197[32] “How to optimize convolution using tensorcores,” 2018,https://docs.tvm.ai/tutorials/optimize/opt conv tensorcore.html.[33] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito,W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, “The next700 accelerated layers: From mathematical expressions of networkcomputation graphs to accelerated GPU kernels, automatically,”

TACO , vol. 16, no. 4, pp. 38:1–38:26, 2020. [Online]. Available:https://doi.org/10.1145/3355606[34] ——, “Tensor comprehensions: Framework-agnostic high-performancemachine learning abstractions,”

CoRR , vol. abs/1802.04730, 2018.[Online]. Available: http://arxiv.org/abs/1802.04730[35] T. Zerrell and J. Bruestle, “Stripe: Tensor compilation via thenested polyhedral model,”

CoRR , vol. abs/1903.06498, 2019. [Online].Available: http://arxiv.org/abs/1903.06498[36] Intel, “Plaidml,” 2019,