[PDF] Flexible Performant GEMM Kernels on GPUs

Abstract

General Matrix Multiplication or GEMM kernels take center place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA's Tensor Cores. Their exploitation is hampered by the two-language problem: it requires either low-level programming which implies low programmer productivity or using libraries that only offer a limited set of components. Because rephrasing algorithms in terms of established components often introduces overhead, the libraries' lack of flexibility limits the freedom to explore new algorithms. Researchers using GEMMs can hence not enjoy programming productivity, high performance, and research flexibility at once. In this paper we solve this problem. We present three sets of abstractions and interfaces to program GEMMs within the scientific Julia programming language. The interfaces and abstractions are co-designed for researchers' needs and Julia's features to achieve sufficient separation of concerns and flexibility to easily extend basic GEMMs in many different ways without paying a performance price. Comparing our GEMMs to state-of-the-art libraries cuBLAS and CUTLASS, we demonstrate that our performance is mostly on par with, and in some cases even exceeds, the libraries, without having to write a single line of code in CUDA C++ or assembly, and without facing flexibility limitations.

Full PDF

IIEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1

Flexible Performant GEMM Kernels on GPUs

Thomas Faingnaert, Tim Besard, Bjorn De Sutter,

Member, IEEE

Abstract —General Matrix Multiplication or GEMM kernels take center place in high performance computing and machine learning.Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA’s Tensor Cores. Their exploitation is hampered by the two-languageproblem: it requires either low-level programming which implies low programmer productivity or using libraries that only offer a limited setof components. Because rephrasing algorithms in terms of established components often introduces overhead, the libraries’ lack ofﬂexibility limits the freedom to explore new algorithms. Researchers using GEMMs can hence not enjoy programming productivity, highperformance, and research ﬂexibility at once.In this paper we solve this problem. We present three sets of abstractions and interfaces to program GEMMs within the scientiﬁc Juliaprogramming language. The interfaces and abstractions are co-designed for researchers’ needs and Julia’s features to achieve sufﬁcientseparation of concerns and ﬂexibility to easily extend basic GEMMs in many different ways without paying a performance price.Comparing our GEMMs to state-of-the-art libraries cuBLAS and CUTLASS, we demonstrate that our performance is mostly on par with,and in some cases even exceeds, the libraries, without having to write a single line of code in CUDA C++ or assembly, and without facingﬂexibility limitations.

Index Terms —matrix multiplication, graphics processors, high-level programming languages (cid:70)

NTRODUCTION

GEMM (General Matrix Multiplication) kernels form thecore of many computations in the ﬁelds of HPC (HighPerformance Computing) and ML (Machine Learning). InHPC, GEMM is at the core of linear algebra [1], includingdense linear algebra [2], [3], and is used for earthquakesimulation [4], plasma visualization [5], and weather andclimate prediction [6]. In ML they are used to train neuralnetworks including fully connected layers in traditionalneural networks, convolutional neural networks, long shortterm memory cells, and natural language processing [7],[8]. To accelerate their computations, researchers in thementioned domains have relied on the massively parallelcomputing resources of GPUs (Graphics Processing Units).To answer the demand for more efﬁcient GEMMs, recentgenerations of GPUs include matrix multiplication accelera-tors, such as NVIDIA’s TCs (Tensor Cores) [9]. Researcherscan exploit these resources in two ways. They can expresstheir algorithms in high-level PLs (Programming Languages)such as Python and express them in terms of establishedGEMM variants for which efﬁcient implementations are avail-able in third-party libraries such as CU BLAS or CUTLASS.This approach offers high research productivity, at the costof being limited to the APIs (Application Programming Inter-faces) and GEMM implementations available in the libraries.In many domains, this lack of ﬂexibility is problematic. Whennon-standard, more generalized GEMMs as needed in neuralnetworks [10], convolutional networks [8], ﬂuid dynam-ics [11], electromechanics [12], computational chemistry [13],or any other computation on multidimensional tensors [14], • T. Faingnaert and B. De Sutter are with the Department of Electron-ics and Information Systems, Ghent University, Belgium. T. Besard worksfor Julia Computing. [email protected];[email protected]

Corresponding author: [email protected]

Manuscript received X, ; revised X. [15], [16], [17], [18], [19], [20], [21], [22] are rephrased interms of standard GEMM kernels available in libraries,additional custom kernels need to be launched in betweenthe GEMM kernels for things such as precision conversions,layout conversions (transpositions), type conversions, biasoperations, elementwise operations, etc. These extra kernelsintroduce huge overheads because they have a massiveimpact on the trafﬁc to the very slow global memory.Alternatively, researchers can rewrite the most demand-ing parts of their software in lower-level PLs such as CUDAC/C++ [23] or OpenCL [24]. This decreases their productivity,however, and they now require much more PL and GPU pro-gramming model knowledge outside their own applicationdomain. In short, many researchers working with GEMM-like algorithms suffer from the two-language problem . Theycannot achieve high performance, high research productivity,and algorithmic ﬂexibility together.The scientiﬁc PL Julia is designed to overcome the two-language problem [25]. It offers a high-level syntax, dynamictyping, managed memory, meta-programming, multipledispatch, and other features that increase programmer pro-ductivity. Julia’s compiler is based on type inference and just-ahead-of-time compilation, which allows it to generate codedevoid of much of the run-time overhead (e.g., in the formof dynamic type checks) that other PLs pay for supportingthe mentioned features. On CPUs (Central Processing Units),Julia code is comparable in performance to C, C++, andFortran code [26]. Through the CUDA. JL package, it ispossible to program NVIDIA GPUs directly in Julia, at a highabstraction level of arrays or at the lower-level of CUDA-like kernels [27], [28]. Before our research, TCs were notsupported in CUDA. JL , however. The package and its high-level APIs were hence of limited use to many researchers.In our research, we set out to overcome this issue inthree steps. First, we developed support for TCs in the Juliacompiler and libraries through a WMMA API of wrapperfunctions around so-called compiler intrinsics. This allows This paper was submitted to IEEE TPDS. © 2020 IEEE.Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material foradvertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. a r X i v : . [ c s . M S ] O c t EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2 for exploiting TCs in kernels written with the lower-levelsupport in CUDA. JL . This low-level API only focuses on theWMMA operation. It does not free the programmer fromthe cumbersome task of coordinating the memory trafﬁcin the memory hierarchy to move data to and from theTCs. So secondly, we developed a tiling API in Julia thatallows programmers to coordinate the memory trafﬁc toand from TCs at a high abstraction level, and, importantly,without paying a price in terms of performance. The low-level API and the tiling API enable efﬁcient use of TCs andthe GPU memory hierarchy, but to reach good performance,the GEMM computations themselves, possible with fusedadditional computations, then still need to be programmedat a rather low level of abstraction requiring a lot of expertise.In the ﬁnal step, we therefore developed a high-level GEMMAPI in Julia that allows programmers to express and combinea range of extensions of basic GEMM computations in anabstract, intuitive way, without having to pay an unaccept-able price in performance. Combined, these three APIs solvethe two-language problem to a great extent with respect tohardware resources such as TCs.Our main result is that we get performance comparableto that of hand-tuned libraries and in some cases even muchbetter performance, without having to write a single line ofcode in a lower-level PL and without being limited to thespeciﬁc GEMM versions supported by the libraries.This paper focuses on the tiling and GEMM APIs. Afterproviding the necessary background in Section 2, Section 3discusses requirements for a tiling API, presents our novelway for abstracting tiling and the Julia API we designedbased on that abstraction, and demonstrates and evaluatesthe API on a number of stages in GEMM computations.Section 4 discusses the requirements for ﬂexibility in GEMMsin more detail. We present the different building blocks atthe basis of our Julia GEMM API that provide that ﬂexibilityin an intuitive manner, and we demonstrate the API on anumber of examples. In Section 5, our ﬁnal contribution isa performance evaluation of multiple variants of GEMMcomputations, showing that we get close to the performanceof hand-tuned libraries like CU BLAS and CUTLASS withouthaving to write any single line in a lower-level PL. Thepaper then ends with a discussion of some related work inSection 6, the availability of our artifacts in Section 7, andwith a conclusion and a look forward in Section 8.

ACKGROUND

The main difference between programming GPUs versusCPUs is their underlying programming model. GPUs aremassively parallel processors, meaning that a large numberof threads execute the same function in parallel. In GPUparlance, this function is commonly referred to as a kernel .GPU threads are organised in a thread hierarchy [23].Since our main interest is in NVIDIA GPUs, we limit our dis-cussion to NVIDIA’s CUDA programming model.

Threads arethe smallest unit of execution in the hierarchy. The hardwaregroups them into sets of 32 threads called warps . Threadsin the same warp execute in a SIMT (Single InstructionMultiple Thread) fashion. These threads must hence executethe same instruction at the same time, possibly on different data. Threads are also grouped by the programmer into blocks .Threads in the same block can communicate efﬁciently, sothat they can cooperate on a common task. Finally, the set ofall blocks on the GPU device is called the grid .Similarly to threads, GPU memory is also ordered hier-archically. We are mainly interested in three parts of thishierarchy, which correspond directly to levels in the threadhierarchy. The register ﬁle is the fastest type of memory. Eachthread typically has access to 255 registers. Each block hasits own set of shared memory , that may be used by threads inthe same block to communicate. Finally, global memory canbe accessed by all threads on the device, regardless of whichblock they belong to. Global memory has the largest capacity,but also has much higher latency and lower throughput.To fully exploit the available resources on a GPU,programmers can either use low-level PLs like CUDAC/C++ [23] or OpenCL [24] to program their own kernels,or they can use the foreign function interface of high-levelPLs languages such as Python to invoke kernels in libraries.The former option requires quite some knowledge in GPUprogramming models, forces the programmers to write quitesome boilerplate code to manage data in memories andconﬁgure the kernels, and offers little performance portability,so manual (re)tuning of code is necessary when porting thecode to different devices. Popular libraries such as CU BLAScontain kernel versions tuned for many different devices toovercome the performance portability issue.

The open-source PL Julia features a high-level syntax [29].A central paradigm in its design is the way it handlesdispatch, the process by which the compiler chooses whichimplementation of a function to use for a given function call.Julia uses a multiple dispatch scheme, which means that thischoice depends on the types of all of a function’s arguments.Julia’s type system is dynamic , meaning that the types ofexpressions are not necessarily known statically. However,Julia inherits some of the advantages of static type systemsthrough several features of its compiler. For one, the Juliacompiler applies type inference to deduce the types of valuesused by the program. Code is then specialised based on thisinformation, e.g., function calls are devirtualized, dynamictype checks are removed, etc. This style of compilation,dubbed just-ahead-of-time , has the performance of ahead-of-time compiled languages with the ﬂexibility of a just-in-timecompiled one. We rely on this design to seamlessly composea GEMM computation from all involved components, i.e.,beyond what normal layering of libraries at different layersof abstraction allows as is typically done with other PLs.Julia’s compiler is built on top of LLVM, a compiler infras-tructure project commonly used in research and industry [30].Julia’s compilation process consists of a couple steps. First,Julia code is converted to an IR (Intermediate Representation)that is used for type inference, called Julia IR. Next, Julia IRis lowered to LLVM IR, the representation that LLVM uses.From this point onwards, the LLVM framework takes controlof the compilation process. LLVM contains a set of backends,one for each target architecture that LLVM supports. Thebackend corresponding to the current architecture will thenconvert this LLVM IR to native instructions.

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 3

The Julia package CUDA. JL reuses part of the Juliacompilation process to allow executing kernels written inJulia on NVIDIA GPUs [27]. In particular, the aforementionedcompilation pipeline is run to the point where Julia IR islowered to LLVM IR. The LLVM IR is intercepted and sentto the LLVM NVPTX backend instead of the backend ofthe host architecture. This NVPTX backend converts the IRto PTX (Parallel Thread Execution) instructions, the virtualinstruction set of NVIDIA GPUs.With CUDA. JL , it is possible to program NVIDIA GPUsat the lower-level of CUDA-like kernels and at the higherabstraction level of arrays [27]. The former involves lessboilerplate and verbosity, and makes reusing code easiercompared to programming in CUDA C/C++. The latterenables much more productive programming [28].Julia’s multiple dispatch enables transparent exploitationof performance-optimized functionality from popular GPUlibraries. For example, for any function from the CU BLASlibrary, a Julia package can contain a generic implementationin pure Julia code that operates for all (numeric) data typesand that hence accepts all arguments of type

Number . Inaddition, the package can contain wrappers that each onlyaccept a more concrete argument type such as

Float32 andthat invoke the corresponding CU BLAS function for thattype. Users of the package can then invoke the function onany type they want. If it is supported by the CU BLAS library,they will get optimal performance “for free”.

Each TC performs a matrix multiply-accumulate expressionof the form D = A · B + C . TCs support a limited set ofpossible data types for these matrices. For example, if the A and B matrices are stored as 16-bit ﬂoating point values, the C and D matrices are 32-bit ﬂoating point.NVIDIA exposes TCs in C++ in the so-called WMMA(Warp Matrix Multiply Accumulate) API. WMMA instruc-tions must be used by all threads in a warp in a SIMTfashion. Each thread that cooperates in a warp-wide WMMAoperation holds a part of each matrix in its registers, calleda fragment . In the remainder of this paper, unless stateddifferently, we will assume that A is an M × K matrix, B is a K × N matrix, and C and D are M × N matrices. The tuple ( M, N, K ) is called the shape of the WMMA operation. Notall possible values of M , N , and K are allowed, as WMMArestricts the set of possible shapes. Conceptually, WMMAconsists of three separate steps:1) Load the input matrices A , B , and C from memory intoWMMA fragments using a WMMA LOAD operation.2) Perform the matrix multiply-accumulate using aWMMA

MMA operation, resulting in a fragment of D .3) Store the resultant D fragment to memory using aWMMA STORE operation.In CUDA C++ each step corresponds to an overloaded C++function. Calls to these functions are mapped one-to-one ontothe corresponding WMMA PTX instruction by the compiler.To add support for WMMA to CUDA. JL , we reused thepre-existing WMMA PTX intrinsics in the NVPTX backend.This necessitated adaptations to Julia’s compiler, in particularto the code generation process. Our WMMA API consistsof two different layers. The lowest layer consists of Julia wrapper functions that are mapped one-to-one to theseintrinsics. The second layer is a high-level interface, similarto CUDA C++’s version of WMMA. It consists of load_a , load_b , load_c , mma , and store_d functions, which callthe intrinsic wrapper corresponding to the argument types.At its launch with the Volta architecture in 2017, WMMAonly supported × × multiply-accumulates of FP16matrices. More recent GPU architectures extend the interfacewith new data types and shapes. Turing’s second generation,introduced in 2018, adds support for 8-bit, 4-bit, and 1-bitdatatypes, along with new WMMA shapes depending on thedatatype used. The most recent version of WMMA includessupport for FP64, bﬂoat16, and TF32 datatypes, and waslaunched in May 2020 with the introduction of Ampere.TCs can also be used through libraries instead of WMMA.NVIDIA’s CU DNN library contains TC kernels for commonML algorithms. ML frameworks such as TensorFlow, Py-Torch, and MXNet use CU DNN for training and inference. CU BLAS, CU BLASL T , and CUTLASS contain optimisedGEMM kernels for HPC applications. NVIDIA’s CU TENSORbuilds on CUTLASS and contains Tensor-Core-acceleratedkernels for tensor computations.

BSTRACTIONS FOR RECURSIVE BLOCKING

Matrix multiplication is rich in data reuse. For example, mul-tiplying square matrices of size N requires O ( N ) ﬂoatingpoint operations, but only O ( N ) storage, so each element isreused roughly O ( N ) times. To exploit this reuse, data needsto be re-accessed as much as possible in faster memories.When all data does not ﬁt into the fastest memories, thetransfers between different memories in the hierarchy needto be coordinated carefully to maximize reuse.For GEMMs, the general idea is to copy tiles of the inputmatrices up into the memory hierarchy: from global memoryto shared memory and from there to registers. The size of thetiles in each step is chosen such that they ﬁt in the availablememory. As computations of different tiles of the resultantmatrix are independent, those can be performed completelyin parallel to maximize the resource utilization of massivelyparallel GPUs. Because of the one-to-one mapping betweenlevels of threads and the memory hierarchy, each of thetiled copy operations is also performed cooperatively, by allthreads in the relevant part of the thread hierarchy.Consider again the GEMM of D = A · B + C . With tiling,this GEMM will consist of the following stages:1) Copy a tile of C from global memory to shared memory,cooperatively by all threads in a block.2) Copy a tile of C from shared memory to registers,cooperatively by all threads in a warp.3) Iterate over the K dimension, according to the tiling sizeof a block.a) Copy a tile of A from global memory to sharedmemory, cooperatively by all threads in a block.b) Do the same for a tile of B .c) Iterate over the K dimension, according to the tilingsize of a warp.i) Copy a tile of A from shared memory to registers,cooperatively by all threads in a warp. EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 4 ii) Do the same for a tile of B .iii) Compute a tile of D , given the A , B , and C tiles,cooperatively by all threads in a warp4) Copy a tile of D from registers to shared memory,cooperatively by all threads in a warp.5) Copy a tile of D from shared memory to global memory,cooperatively by all threads in a block.For a WMMA GEMM, stage 2, 3.c.i, and 3.c.ii correspondto WMMA LOAD operations, stage 3.c.iii to

MMA operations,and stage 4 to WMMA

STORE operations.This form of recursive blocking is an absolute requirementto achieve good performance, but it is also complex toprogram. Tile sizes need to be chosen in function of thehardware and the number of dimensions and the sizes ofthe data, indexing of the matrices depends on their layouts,optimal tiling parameters can differ between memory andcomputational stages. Determining the tiling parameters iscomplex, and encoding all address computations in the actualcode is cumbersome, error-prone, and results in code that ishard to comprehend, port, and maintain. To make it easierto program general GEMM computations with recursiveblocking, we developed a novel API with which the tilingcomputations can be abstracted to a much higher level. Therequirements we put forward for this API are the following: • Code readability to ease writing kernels that use blocking. • Zero performance cost compared to manually expressedaddress computations of all tiles. • Support for multiple dimensions (>2) to support tensorcontractions and batched GEMMs, • Recursive blocking for the different levels of the memoryhierarchy though independent tiling parameters. • WMMA-compatibility to exploit WMMA.

To meet the requirements, we propose a novel abstractionconsisting of four different operations on tiles.

Projection is the ﬁrst abstraction. The compute stages ofGEMM will use tiles that refer to the three dimensionaliteration space ( M, N, K ) . In the memory stages of GEMM,we typically only need two of these dimensions. For example,to load a slice of the A matrix, we are only interested in the M and K dimension. Projecting a tile reduces its dimensionalityby dropping one or more of its dimensions, as shown inFigure 1. The projection abstraction thus allows us to easilyreduce the original three dimensional tile to a tile containingonly the relevant dimensions. projectM and KMN K KM Figure 1. Projection of a 3D tile to two dimensions M and K . Parallelisation is the most important operation of the tilingAPI. It corresponds to the recursive subdivision of tilesin smaller tiles, and the subsequent parallelisation of theresulting subtiles over a set of collaborating entities, such asthread blocks or warps. Consider the example in Figure 2.A tile of size M × N is divided in subtiles, each of size M × N . These subtiles are handled in parallel by a set of8 cooperating warps, indicated by the numbers 0–7. Notethat the set of all cooperating warps do not need to cover theentire tile. In the example, there are 16 subtiles but only 8warps. This means that each warp will handle 2 of these 16subtiles. This parallelisation can be applied recursively, bydividing each of these subtiles into sub-subtiles, where eachsub-subtile is handled by one thread. Figure 2. Parallelisation of a tile over a set of 8 cooperating warps.

Translation moves a tile over a speciﬁed distance in eachdimension. In the example of Figure 3, a two dimensionaltile is moved over a distance m in the M dimension, anda distance n in the N dimension. The translation operationis useful in cases where the reference point of a tile needsto be changed. For example, consider a tile referring to asubmatrix stored in global memory. The coordinates of thistile are speciﬁed relative to the ﬁrst element in the ﬁrst row ofthe parent matrix in global memory. To copy this submatrixto shared memory, we need to express the tile relative tothe ﬁrst element stored in shared memory, which may bedifferent. To accomplish this, we can simply translate the tileover the correct distance. translateM = mN = nM N m n Figure 3. Translation of a tile.

Linearisation is used to convert a tile’s location from acartesian index to a linear index. This is needed to calculatethe offset of a tile in memory, relative to the base pointerof the parent tile. In the example of Figure 4, we consider asubtile at a cartesian offset of ( m, n ) from its parent tile withsize ( M, N ) . Linearisation results in the linear offset of thistile, relative to the top-left corner of the parent tile. Note thatthe linearisation process assumes that the matrix is stored incolumn major ordering, as this is the convention that Juliauses. In this case, we need to span n columns of M elementseach, and an additional m elements to reach the subtile. Thiscorresponds to a linear index of nM + m . lineariseM N mn nM+m Figure 4. Linearisation to convert a tile to a linear index.

To overcome challenges in developing a concrete API basedon the four abstractions while meeting all requirements, werelied on some high-level Julia features.

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 5 struct Tile{size, names, T} base::NamedTuple{names, T} offset::NamedTuple{names, T} end Listing 1: The deﬁnition of a Tile in the Julia tiling API.First, a tile is fully determined by its position and itssize. Our tiling API contains a

Tile struct that stores thisinformation. Storing the size in a ﬁeld of this struct does,however, not sufﬁce to meet the zero-cost requirement. InJulia, each function is JIT-compiled once for each combinationof argument types occurring during the execution of theprogram. During each such JIT-compilation, no specializationtakes place based on the values of the arguments. In order forthe compiler to generate high quality code for the differentstages in tiled GEMMs, it needs to know how many registersare needed when transferring slices into registers. In otherwords, the sizes of the tiles need to be available at compiletime, such that specialized code can be generated per tile size.Moreover, no dynamic type checking should be necessary inthe generated code. To obtain the required specialization, yetavoid that any dynamic type checks are needed, we deﬁned

Tile to be a parametrized type, where one of the typeparameters (rather than a ﬁeld) is the size of the tile. Julia’stype inference can then obtain all the necessary informationto enable specialized code generation in the JIT compilerwithout running into type instability issues [29].Secondly, we observe that when we want to implementGEMM using tiling, we typically do not think in terms ofthe ﬁrst or second dimension of a tile. Instead, a tile thatrepresents a slice of the A matrix of size M × K has M and K dimensions. Rather than writing the position as pos[1] ,we can increase readability by naming the dimensions, sothat we may write pos.M . This form of syntactic sugar caneasily be achieved with the existing Julia type NamedTuple ,which we use to store both the position and size of a tile.Thirdly, we observed a form of structural bias in theJulia-LLVM tool ﬂow with respect to address computationssuch as those typically occurring in recursive blocking code.The problem is that in many stages of the tiled computation,different threads operate on tiles at different positions in theinput matrices or tensors. If the position stored in the

Tile is simply a single position, unnecessarily complex PTX codeis generated. However, if the position is split into a thread-dependent base index and a thread-independent offset, thecompiler generates code that efﬁciently exploits the availableregister + constant addressing mode.The ﬁnal deﬁnition of the parametrised

Tile type isshown in Listing 1. With this deﬁnition, the implementationfor the translate operation is fairly simple. We deﬁne thefunction translate(tile, dist) that returns a new tilewith the same size and offset, but where the base is theelementwise sum of the original tile’s base and the argument dist . This essentially moves the multi-dimensional tile overthe distance speciﬁed by the argument.The ﬁrst argument of linearise(coord, dim) repre-sents the coordinate of the tile. Note that we do not takethe tile itself as an argument, so that linearise can beused for both the base and offset of a tile. Instead ofhaving a separate linearise function for base and offset, we may simply write linearise(tile.base, ...) and linearise(tile.offset, ...) . The second argument dim represents the size of the parent tile. To convert thecartesian index to a linear index, we use the

LinearIndices type from the Julia standard library. This way, we can bothreuse functionality, and ensure the linearise operation worksfor any number of dimensions.One option to project tiles is to deﬁne a function project(tile, dims) , where dims contains a list of thedimensions to keep. A projection of a tile to the M and K dimension could then be written as project(tile,(:M, :N)) . We instead opted to use Julia’s extensibility.In Julia, the syntactic construct a.b is converted to a callto Base.getproperty(a, :b) [29]. Through the multipledispatch mechanism, we override this function such thatone can express the project operation as tile.MN instead of project(tile, (:M, :N)) .Listing 2 shows part of the implementation ofthe projection operation. As mentioned previously,the construct tile.MN is ﬁrst converted to the call

Base.getproperty(tile, :MN) . The type of the secondargument, :MN , is a

Symbol , indicated by the colon pre-ﬁx.

Symbol s are similar to strings, except that they areimmutable and only one copy of each distinct value isstored [29]. The

Base.getproperty function is specialisedfor arguments of type

Tile on line 1. The value of the sym argument of this function determines the name ofthe ﬁeld that was accessed. To generate custom projectionimplementations for each set of dimensions, we want todispatch on the value :MN of this argument, rather than its type

Symbol . To do this, we can use Julia’s

Val type, a parametrictype with one type parameter. When we call the constructorof

Val as Val(sym) , a new instance of

Val is created wherethe type parameter is set to sym . This essentially moves thevalue of sym to the type domain, so that we may use themultiple dispatch mechanism. After creating a

Val type,we dispatch to another function getproperty_impl thatimplements the projection itself.To make the abstraction zero-cost, we use @generated functions that generate custom code at type-inference timeand depending on the argument types, as shown on line 3 ofListing 2. Since we moved the ﬁeld name to the type domain,we can thus generate a different, specialized implementationfor each projection. First note that accesses to the base oroffset of a tile using tile.base or tile.offset also getconverted to calls to Base.getproperty . Lines 4–8 handlethis by checking the passed symbol is base or offset . Ifso, we just return the value of the ﬁeld by calling getfield .Julia’s @generated functions must return an Expr , which isa block of code to be compiled. Such blocks are surroundedwith the quote ... end construct, as shown in lines 6–8.The projection itself is implemented in lines 10–22. Line11 converts the symbol representing the ﬁeld name to a

String , which line 12 then converts to a tuple containingthe individual dimensions. For example, if sym is :MN , then sym_str and new_names are "MN" and (:M, :N) , respec-tively. In lines 16–18, an Expr is generated to create new

NamedTuple s that only contain the relevant dimensions forthe base, offset, and size. Finally, line 21 wraps these newlygenerated

NamedTuple s in the

Tile struct that representsthe projected tile, and returns that tile.

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 6 @inline Base.getproperty(tile::Tile{size, names, T}, sym:: Symbol ) where {size, names, T} = getproperty_impl(tile,

Val (sym)) (cid:44) → @generated function getproperty_impl(tile::Tile{size, names, T}, :: Val {sym}) where {size, names, T, sym} if sym == :base || sym == :offset return quote getfield(tile, sym) end else sym_str = String(sym) new_names = ntuple(i -> Symbol (sym_str[i]), length(sym_str)) return quote new_base = ... new_offset = ... new_size = ... return Tile{new_size, new_names, ...}(new_base, new_offset) end end end Listing 2: An overview of the implementation of tile projection in the Julia tiling API.The parallelise operation is exposed as a function call parallelise(tile, tiling_size, index, count) .The tile argument of type

Tile is the parent tile that willbe subdivided and parallelised over a set of entities thatcan be blocks, warps, or threads that cooperate. The secondargument, tiling_size , determines the tile size that eachentity will handle, and the last argument count refers tothe number of cooperating entities. Finally, the argument index is an integer from 0 to count - 1 , and determinesthe identiﬁer of the currently executing entity.Figure 5 shows an example parallelisation. It starts with aparent tile of size m × n , divides it in subtiles of size m × n ,and parallelises them across 2 warps. The 0/1 in each subtileindicates the warp responsible for it. We write the operationas parallelise(Tile(M = 4 * m, N = n), Tile(M =m, N = n), warpId, 2) , where warpId is either 0 or 1,i.e., the id of the currently executing warp.To generalise the parallelisation operation to multipledimensions, we again reuse the indexing functionality fromJulia’s standard library. The information needed for iterationis then stored in a new struct, a TileIterator , that is re-turned by the paralellise function. Julia allows us to writecustomised implementations for iterating over user-deﬁnedtypes. For-loops are converted to calls to the

Base.iterate function, which may be specialised for our own types. Toiterate over

TileIterator s using a for loop, we must thusspecialise the

Base.iterate method for

TileIterator s. Base.iterate is called for each iteration of the for loop,and must return the value associated with each iteration. Inthe case of

TileIterator s, each call to

Base.iterate willreturn a

Tile corresponding to the tile of that iteration.All operations on

Tile s in our API are built on top ofJulia interfaces that work for any number of dimensions.For example, the position and size of each

Tile is storedusing Julia’s

NamedTuple s, which support any amount ofdimensions. Similarly, the parallelisation and linearisationoperations, which involve computations using multidimen-sional indices, are written using Julia’s generic indexinginterfaces. This supports higher dimensions as required.

Figure 5. Parallelisation over 2 warps each handling a × set of subtiles. To illustrate the use, readability and zero-cost of the API, weconsider three representative stages of the tiled GEMM.

To copy a tile of C from global to shared memory in step 1of the complete GEMM, Listing 3 implements the approachillustrated in Figure 6. Each block copies a separate tile,and we launch the GEMM kernel with enough blocks tofully cover the C matrix. The tile size is determined bythe block_tile variable. It initially has three dimensions,so we ﬁrst project it to the M and N dimension using block_tile.MN on line 1 in Listing 3.Next, we divide block_tile in subtiles and parallelisethe resulting warp_tile s over a set of cooperating warpsin the block, also on line 1. The @unroll macro from theJulia package GPU IFY L OOPS . JL [31] informs LLVM to fullyunroll the loop. Each of these warp_tile s has size (M =MEM_CD_WARP.M, N = MEM_CD_WARP.N) .Typically, MEM_CD_WARP.N is , so that the resulting warp_tile is highly rectangular. This is necessary to accessglobal memory efﬁciently, as this guarantees that the threadsin one warp access adjacent memory locations. The hardwareis then able to coalesce these memory accesses into fewermemory transactions, thus increasing memory throughput.This is commonly referred to as global memory coalescing .Similarly, we parallelise the warp_tile over the setof 32 threads in a warp on line 2. The integer laneId variable identiﬁes the threads within a warp. Eachthread handles a tile of size (M = MEM_CD_THREAD.M, N EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 7 thread_tilewarp_tileblock_tile.MNparallelise parallelise

Figure 6. Copying a tile of the C matrix from global to shared memory. = MEM_CD_THREAD.N) in each iteration. In the case of anFP32 (Single Precision Floating Point) C matrix, the bestchoice is MEM_CD_THREAD.M = 4 , and

MEM_CD_THREAD.N= 1 . This way, each thread loads/stores 4 adjacent FP32elements, such that the GPU can issue one 128-bit load/store,the largest memory transaction size supported by the GPU,thus maximally vectorising the memory accesses.Note that the positions of all tiles are speciﬁed relative tothe top-left corner of the current block’s tile. This means that thread_tile.index == (M = 0, N = 0) corresponds toa linear index of 0. Because shared memory only storesthe tile of the current block, this is the correct index forshared memory. For global memory, we need to offset this tiledepending on the currently executing block. To accomplishthis, we translate this thread_tile over the correct distanceon line 3. Finally, lines 5–8 convert the base and offset ofeach of these thread_tile s to a linear index. We can thencreate a pointer to the correct memory location on lines 10–11, and perform the load or store. To separate the constantparts of the memory addresses, we create a pointer using thelinearised base, and only add the linearised offset afterwards.Listing 3 is equivalent to Listing 4, but does not use ourtiling API. The outer loop on lines 1–6 corresponds to the ﬁrstparallelisation, the inner loop on lines 8–13 is the equivalentof the second parallelisation. In both loops the tile bases andoffsets are calculated manually. Lines 15–18 convert the basesand offsets to linear indices, and are thus the equivalent ofthe linearisations in Listing 3. Note that the translation ishandled by the addition of the translation offsets block_i and block_j on line 15. Clearly the use of our tiling API inListing 3 is less verbose and more maintainable.Listing 5 shows the CUDA PTX to which Listing 3 iscompiled. First, each thread’s base addresses are computedin registers %rd20 and %rd13 for shared and global memory,respectively. The loads and stores are vectorised, as indicatedby the sufﬁx v4.f32 . The stores to shared memory are onlines 6, 12, and 20. As the shared memory size is known atcompile time, the code exploits the register plus constantaddressing modes as discussed in Section 3.3. By contrast,the compiler does not know the size of the matrix in globalmemory, so it does not know the linearised offset either, eventhough the offsets in the M and N dimensions are constants.To calculate the address in global memory, LLVM emits amultiplication (using a bitshift shl.b64 ), and an addition.The code in Listing 5 is identical to the code that theJulia-LLVM tool ﬂow generates for Listing 4. We concludethat no superﬂuous instructions are generated because of theuse of our tiling API, for both the loads from global memoryand the stores to shared memory.Note that we can use similar code for steps 2, 3-a, 3-b,3-c-i, and 3-c-ii of the GEMM, with independently chosentile conﬁgurations for each of them. This way, recursivedouble-sided blocking is supported. block_tileA B DM NK 128 x 12816 x 128128 x 16 Figure 7. 3D iteration space in the inner loop of the matrix product.

A B DparalleliseM NK64 x 16 16 x 3264 x 32Iteration 1 Iteration 2 ...

01 23 54 67 01 23 54 67

Figure 8. Computation of the matrix product in the innermost loop.

To implement the computation of the matrix product in theinner loop using the tiling API, we will follow the approachillustrated in Figure 7. A block_tile represents the threedimensional iteration space ( M, N, K ) used to calculate thetile of the D matrix corresponding to one block. Let usconsider the case where a block_tile has size ( M, N, K ) =(128 , , . This means that each block calculates an M × N = 128 × tile of D , by multiplying all M × K = 128 × tiles in a row of A with all K × N = 16 × tiles in acolumn of B . These tiles of D are subsequently accumulatedby summing over the K dimension.We want to parallelise this computation over all warpsin a block. In the example of Figure 8, each block contains 8warps, in a × arrangement. Each warp calculates a × tile of D in each iteration, by multiplying a × tile of A ,and a × tile of B . Of course, we want tiles in this threedimensional space with the same M and N indices to bemapped to the same warp, so that we can accumulate acrossthe K dimension. In the case where the matrices are storedin column-major, the warps are assigned to tiles in the orderof the M , N , and K dimension. We can thus simply usea parallelisation operation of size ( M, N, K ) = (64 , , across 8 warps, as shown in Figure 8. The 8 warps fully coverthe M and N dimensions, as indicated by the 0–7 in eachtile. In the next iteration, we have advanced along the K dimension, but the division along the M and N dimension isthe same, so with this choice of tiling size, the parallelisationoperation implicitly iterates over the K dimension.Line 1 of Listing 6 shows this parallelisation operation.It returns a three dimensional warp_tile . To compute thematrix product using WMMA, we ﬁrst need to load the A and B tiles into WMMA fragments. To load A , weare only interested in the M and K dimension, so weﬁrst project warp_tile on line 3. This give a tile of size EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 8 @unroll for warp_tile = parallelise(block_tile.MN, Tile(MEM_CD_WARP), warpId, WARPS_PER_BLOCK) @unroll for thread_tile = parallelise(warp_tile, Tile(MEM_CD_THREAD), laneId, 32) global_thread_tile = translate(thread_tile, (M = block_i, N = block_j)) global_linear_base = linearise(global_thread_tile.base, (M = global_M, N = global_N)) global_linear_offset = linearise(global_thread_tile.offset, (M = global_M, N = global_N)) shared_linear_base = linearise(thread_tile.base, (M = shared_M, N = shared_N)) shared_linear_offset = linearise(thread_tile.offset, (M = shared_M, N = shared_N)) global_ptr = pointer(global_c, global_linear_base) shared_ptr = pointer(shared_c, shared_linear_base) value = vloada(Vec{MEM_CD_THREAD, Float32 }, global_ptr, global_linear_offset) vstorea!(Vec{MEM_CD_THREAD, Float32 }, shared_ptr, value, shared_linear_offset) end end Listing 3: Copying a tile of the C matrix from global to shared memory using our tiling API. @unroll for warp_offset = 0 : WARPS_PER_BLOCK : (BLOCK_M * BLOCK_N) ÷ (MEM_CD_WARP.M * MEM_CD_WARP.N) - 1 NUM_WARP_ROWS = BLOCK_M ÷ MEM_CD_WARP.M base_warp_i = (warpId % NUM_WARP_ROWS) * MEM_CD_WARP.M base_warp_j = (warpId ÷ NUM_WARP_ROWS) * MEM_CD_WARP.N warp_i = (warp_offset % NUM_WARP_ROWS) * MEM_CD_WARP.M warp_j = (warp_offset ÷ NUM_WARP_ROWS) * MEM_CD_WARP.N @unroll for thread_offset = 0 : 32 : (MEM_CD_WARP.M * MEM_CD_WARP.N) ÷ (MEM_CD_THREAD.M * MEM_CD_THREAD.N) - 1 NUM_THREAD_ROWS = MEM_CD_WARP.M ÷ MEM_CD_THREAD.M base_thread_i = (laneId % NUM_THREAD_ROWS) * MEM_CD_THREAD.M base_thread_j = (laneId ÷ NUM_THREAD_ROWS) * MEM_CD_THREAD.N thread_i = (thread_offset % NUM_THREAD_ROWS) * MEM_CD_THREAD.M thread_j = (thread_offset ÷ NUM_THREAD_ROWS) * MEM_CD_THREAD.N global_linear_base = (block_i + base_warp_j + base_thread_j) * global_M + (block_j + base_warp_i + base_thread_i) global_linear_offset = (warp_j + thread_j) * global_M + (warp_i + thread_i) shared_linear_base = (base_warp_j + base_thread_j) * shared_M + (base_warp_i + base_thread_i) shared_linear_offset = (warp_j + thread_j) * shared_M + (warp_i + thread_i) global_ptr = pointer(global_c, global_linear_base) shared_ptr = pointer(shared_c, shared_linear_base) value = vloada(Vec{MEM_CD_THREAD, Float32 }, global_ptr, global_linear_offset) vstorea!(Vec{MEM_CD_THREAD, Float32 }, shared_ptr, value, shared_linear_offset) end end Listing 4: Implementing the ﬁrst stage in GEMM using manual calculation of addresses. // Calculate the base addresses in %rd13 and %rd20... shl.b64 %rd22, %rd13, 5; add.s64 %rd23, %rd17, %rd22; cvta.to.global.u64 %rd24, %rd23; ld.global.v4.f32 {%f5, %f6, %f7, %f8}, [%rd24]; st.shared.v4.f32 [%rd20+4096], {%f5, %f6, %f7, %f8}; shl.b64 %rd25, %rd13, 6; add.s64 %rd26, %rd17, %rd25; cvta.to.global.u64 %rd27, %rd26; ld.global.v4.f32 {%f9, %f10, %f11, %f12}, [%rd27]; st.shared.v4.f32 [%rd20+8192], {%f9, %f10, %f11, %f12}; // ... repetition of similar blocks due to unrolling mul.lo.s64 %rd64, %rd13, 480; add.s64 %rd65, %rd17, %rd64; cvta.to.global.u64 %rd66, %rd65; ld.global.v4.f32 {%f61, %f62, %f63, %f64}, [%rd66]; st.shared.v4.f32 [%rd20+61440], {%f61, %f62, %f63, %f64}; Listing 5: The PTX code generated for Listing 3. ( M, K ) = (64 , , which thus consists of four × WMMA fragments. To load those, we ﬁrst translate the tilein the M dimension over 0, 16, 32, and 48 elements on line 3.Lines 5–6 then convert this translated base and offset to alinear index, which can then be used to create the pointerargument to WMMA.load_a on line 8. Lines 11–18 do the same thing for the B matrix: the warp_tile is projected tothe K and N dimensions, translated, and converted to alinear index. Finally, lines 20–24 calculate the × productof D using the WMMA.mma function from our WMMA API.This example is perhaps the best illustration of the tilingAPI, as it combines all four operations on tiles: parallelisation,projection, translation, and linearisation. Using these fouroperations signiﬁcantly improves readability compared towriting the necessary address calculations by hand.We omit the PTX code generated for this listing becauseit provides no additional value, but we conﬁrm similarobservations as on the ﬁrst example: the base addressesof A and B for each warp are calculated once, and storedin registers. The code in Listing 6 is converted to a set of wmma.load.a , wmma.load.b , and wmma.mma instructions,and the addresses of the load operations are expressed as aconstant offset from the base addresses stored in registers.This once again indicates that the tiling abstractions do notintroduce any superﬂuous instructions. In the previous example, we studied the calculation of thematrix product in the inner loop of GEMM. After this stage,each warp has a part of the D matrix stored in WMMA EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 9 @unroll for warp_tile = parallelise(block_tile, Tile(M = 64, N = 32, K = 16), warpId, 8) @unroll for i = 1 : 4 a_tile = translate(warp_tile.MK, (M = (i-1)*16, K = 0)) linear_base = linearise(a_tile.base, ...) linear_offset = linearise(a_tile.offset, ...) a_frags[i] = WMMA.load_a(...) end @unroll for j = 1 : 2 b_tile = translate(warp_tile.KN, (K = 0, N = (j-1)*16)) linear_base = linearise(b_tile.base, ...) linear_offset = linearise(b_tile.offset, ...) b_frags[j] = WMMA.load_b(...) end @unroll for i = 1 : 4 @unroll for j = 1 : 2 acc_frags[i, j] = WMMA.mma(a_frags[i], b_frags[j], acc_frags[i, j], ...) end end end Listing 6: The computation of the matrix product, implemented using our tiling API. acc_frags[1, 2]acc_frags[4, 1]warp_tileblock_tile.MN 01 23 45 67parallelise

Figure 9. Copying a tile of the D matrix from registers to shared memory. fragments. To store these WMMA fragments to sharedmemory, we follow the approach illustrated in Figure 9. block_tile represents the same tile as in example 2, i.e.the three dimensional iteration space used to calculate a tileof the D matrix corresponding to one block. To copy D , weare only interested in the M and N dimension, so we projectthis tile to these dimensions ﬁrst.Next, we parallelise this tile over a set of warps. Thisparallelisation should have the same parameters as the matrixcomputation in the previous example. Obviously, the tilingsize is only speciﬁed in the M and N dimension, insteadof in the three dimensions M , N , and K . Figure 9 uses thesame tiling sizes as our previous example: block_tile isa × matrix, and is parallelised across 8 warps, eachhandling a × subtile. The corresponding parallelisationoperation returns a warp_tile , and is shown on line 1 ofListing 7. Note that the for loop of line 1 only has 1 iterationin this case, since 8 warps fully cover the block_tile .Finally, this warp_tile is divided in a × arrangementof WMMA fragments, like in example 2. The for loops online 2 and line 3 iterate over these 8 WMMA fragments. Line 4then translates the tile in the M and N dimension over , , , or elements to obtain the ﬁnal tile corresponding toeach WMMA fragment. Line 6 and Line 7 then convert thiscartesian index to a linear index, so that it may be used tocreate pointers for the WMMA store.d on line 9.Again, we omit the generated PTX code, but we canconﬁrm our earlier observations. First, the base address of D for each warp is calculated and stored in a register. Afterthis computation, 8 wmma.store.d instructions are emitted,which use this register as a base address, and constant offsets.Once again, we conclude that the use of the tiling API doesnot introduce any extra overhead. LEXIBLE

GEMM

KERNEL ABSTRACTIONS

In the previous section, we designed tiling abstractions toimplement performant GEMM kernels. In this section, weadd the necessary ﬂexibility to this GEMM, such that userscan instantiate a wide range of GEMM variants.

The Google Brain DL (Deep Learning) research team providesan excellent overview of why this ﬂexibility is needed [10].They focus on Capsule networks, a novel neural networkML idea where the neurons are matrix-valued rather thanscalars [32]. In short, they observed that the inﬂexibility ofexisting ML frameworks TensorFlow [33] and PyTorch [34]forced the researches to rephrase their computations in termsof the limited set of GEMM kernels already supported bythese frameworks. They had to insert multiple data trans-position and matrix materialisation stages that introduceddetrimental amounts of memory access overhead. They alsohad to insert separate kernels in between their network layersto perform simple but not-yet-established operations on thematrix elements. Not being able to fuse those operations inthe GEMM kernels themselves, this again introduced massiveamounts of overhead. Clearly, for advanced research in adomain such as ML, libraries providing only establishedGEMM functionality do not sufﬁce.A ﬂexible GEMM also needs to support a multitude ofdifferent memory layouts. Basic GEMMs involve only row-major and column-major layouts. Convolutions, which arealso implemented with GEMMs, involve more dimensionsthan matrices, so more layouts need to be considered. Forimages, e.g., ML frameworks typically use four dimensions:a batch of N images with C channels, each consisting of W × H features. Among the many possible choices, NCHWand NHWC are most common [8].Next, we consider the generalisation of matrix multi-plications to multi-dimensional tensor contractions, whichare common in several scientiﬁc ﬁelds, such as ﬂuid dy-namics [11], electromechanics [12], and computational chem-istry [13]. Whereas a matrix-matrix multiplication only hasthree different indices { m, n, k } , tensor contractions involve EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 10 @unroll for warp_tile = parallelise(block_tile.MN, Tile(COMPUTE_WARP).MN, warpId, WARPS_PER_BLOCK) @unroll for i = 1 : 4 @unroll for j = 1 : 2 tile = translate(warp_tile, (M = (i-1)*16, N = (j-1)*16)) linear_base = linearise(tile.base, ...) linear_offset = linearise(tile.offset, ...) WMMA.store_d(..., acc_frags[i, j], ...) end end end Listing 7: Copying a tile of the D matrix from registers to shared memory using our tiling API.an arbitrarily large set of indices. Matrix transpositions areextended to arbitrary permutations of those indices. Thenumber of possible data layouts for tensor contractions ishence much higher. For example, a contraction of 4D tensorshas a total of × ×

4! = 13824 different memory layouts.Given the importance of tensor contractions, a lot ofresearch has been done to implement efﬁcient support forthe large number of possible cases. Springer and Bientinesiclassify the traditional approaches to tensor contractionin three main categories [14]: loop nesting [15], [35], [36],[37], LoG [17], [18], and TTGT (Transpose-Transpose-GEMM-Transpose) [19], [20]. All three of them suffer from seriousperformance issues due to bad data reuse or the need toinsert data reshufﬂing and transposition kernels.In 2016, Springer and Bientinesi proposed anothermethod for tensor contractions, GETT (GEMM-like Tensor-Tensor contraction) [37], that has since been adopted byother tensor contraction implementations [21], [22]. GETT isbased on the principles of TTGT, but implicitly reorganisestensors while loading them to avoid separate transpositions.GETT can therefore be seen as a variant of TTGT, wherethe transposes are fused into the GEMM kernel. Clearly, thisfusion requires that the underlying GEMM kernel is ﬂexible.While the tiling and WMMA APIs introduced in previoussections allow that ﬂexibility, programming against themwould still be quite cumbersome. We hence propose ahigher-level API based on a high-performance GEMM kernelthat can easily be customized through a set of higher-levelabstractions that are as intuitive as possible to researchersfrom, e.g., the domains of ML and DL. We put forward thefollowing main requirements for this high-level GEMM API: • Flexibility with respect to data layouts, on-the-ﬂy datatranspositions, and fused operations are our primeobjectives. Support for other, more complex data typessuch as complex numbers [2] or dual numbers as usedin automatic differentiation [38] is another form ofﬂexibility. • Performance of GEMM kernels that are built usingour API should be on-par with the state-of-the-artimplementations, such as CU BLAS or CUTLASS. Thisobviously implies that our GEMM should be WMMA-compatible and support double-sided recursive blocking,i.e., independent tiling parameters need to be supportedfor the data transfer stages at different levels of thememory hierarchy and for the computational stages. • Portability : GEMM kernels built with our API shouldperform well on a range of devices. We should hencemake as few assumptions about the underlying hard- ware as possible. For example, our API needs to beable to handle different shared memory sizes, as well asGPUs with and without TCs of different generations.

Our strategy is to implement the general structure of aperformant GEMM kernel beforehand. To make it ﬂexible, wesplit this GEMM in a small set of building blocks with a pre-determined interface. Concretely, the GEMM contains callsto a set of functions with a predetermined name. To extendthe basic implementation, it will sufﬁce to implement newversions of the called functions that customize the behaviorbased on their input types. Julia’s just-in-time type inferenceand compilation ﬂow enables us to perform this split withoutintroducing performance overhead. Furthermore, Julia’smultiple-dispatch allows us to make the split orthogonally,which results in more intuitive building blocks and easescode reuse and hence programmer productivity.

We of course still want the user to be able to customisethe tiling size of each step of the GEMM kernel. This isthe purpose of the params abstraction. This abstraction isessentially a structure that is passed to the kernel, andcontains a set of conﬁguration ﬁelds. Some of these ﬁeldsdetermine the tiling sizes, others specify the kernel’s launchconﬁguration, such as the number of warps per block. Theuser does not need to specify all ﬁelds manually. We haveimplemented a set of heuristics that choose reasonabledefaults for ﬁelds that are not set explicitly. For example,if the tiling size per threadblock is not set, we choose thelargest square ( N × N ) or nearly-square ( N × N ) tile that stillﬁts in shared memory. For the time being, these heuristics aremainly aimed at GEMMs using TCs, but future work couldexpand these heuristics to other cases as well. The positions of tiles at different levels in the memoryhierarchy in our tiling API are expressed in logical coor-dinates. To convert these logical coordinates to offsets inphysical memory, we introduce another abstraction, called layouts . This abstraction corresponds to three functions thatcan be customised using Julia’s multiple dispatch. The size(layout_type, logical_size) function determinesthe size in physical memory of the layout for a given size inlogical coordinates. This physical size is not necessarily thesame as the size in logical coordinates. For example, to access

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 11 shared memory efﬁciently, it is sometimes necessary to add p padding elements to every column of a column major matrix.In this case, for a logical size of M × K , the correspondingphysical size would be ( M + p ) × K . The size(...) functionis used so that our GEMM API knows how many bytes it hasto reserve in shared memory. This function is also used bythe heuristics in the params abstraction to select the optimaltiling size in shared memory, as this depends on how muchmemory a given memory layout requires.The other two functions are load(layout_type, tile,...) and store(layout_type, tile, ...) . As theirname suggests, these functions are responsible to load orstore the tile at the logical coordinates represented by the tile argument. With these functions, users can implementarbitrary logic to load or store the matrix elements corre-sponding to a given tile. For example, recall that NVIDIAGPUs can load and store vectors of 16 bytes (128 bits) in asingle instruction. This vectorisation of memory accesses isonly possible if the base address of the load or store is aligned,i.e. divisible by 16. An AlignedColumnMajor layout canindicate that the necessary alignment requirements are met,so that the corresponding load and store functions canissue vectorised loads and stores.For a classic GEMM kernel, the most obvious instan-tiations of the layout building block are

RowMajor and

ColumnMajor . As mentioned before, each of these can beadapted to aligned or padded layouts. To add padding, onecould have

PaddedRowMajor and

PaddedColumnMajor lay-outs, but Julia’s type system allows us to do this more cleanly.We can make a parametrised type

PaddedLayout{Layout,Padding} , where

Padding represents the padding in num-ber of elements, and

Layout is the base layout we wish tomodify, such as

RowMajor or ColumnMajor . The load and store functions for padded layouts would then dispatch tothe implementations for the underlying

Layout .The layout building block can also be used to create aGEMM with a more complicated mapping between logicalindices and physical offsets. For example, GETT’s reinter-pretation of multidimensional tensors as matrices can beperformed using a custom implementation of the layoutbuilding block. Note that a layout does not even need tocorrespond to a matrix that is materialized in memory.Consider a matrix multiplication where the elements ofone of the matrices can be calculated from the position, i.e. a ij = f ( i, j ) for some function f . In this case, we implementa layout where the store function is a no-op, and the load function generates the necessary elements on the ﬂy. Similarstrategies can be used for other matrices with a specialstructure, such as sparse matrices or diagonal matrices. Wecan only store the non-zero elements in memory, and createa custom layout that implements the necessary logic to loador store the correct elements. The next building block is that of transforms . Transformsare arbitrary Julia functors, i.e. functions or structuresimplementing the function call operator () . They are calledafter every load and before every store operation in theGEMM. By having a transform after every load and beforeevery store, elementwise operations to the input and resultmatrices can be applied consistently in our API. Params Logical indexTuple of elements Transformed tupleof elementsShared layout(store)Global layout(load)Tile iterator TileTransform

Figure 10. Copying a tile of A from global to shared memory using theparams, layouts, and transforms components in our GEMM API. Transforms can serve for elementwise operations, suchas a simple scaling in the case of GEMM, and for activationfunctions for artiﬁcial neurons in neural networks. Anotheruse case of transforms is to implement type conversionsimmediately after loading data from global memory. Thisis useful if one wants to use a higher precision data type tocompute the GEMM, but store the matrices in lower precisionin global memory to save capacity.Figure 10 illustrates how params, layouts, and transformsinteract to copy a tile of the A matrix from global to sharedmemory. A similar structure is used to copy tiles of the B , C ,or resultant D matrix. This copy operation is performedcooperatively by all threads in a threadblock, using theparallelisation operation of our tiling API. First, the paramscomponent determines the tiling size that should be usedfor the tile iterator corresponding to the parallelise operation.The GEMM kernel then iterates over this tile iterator, whichreturns a tile in each iteration. The base and offset of thistile are speciﬁed in logical coordinates. To load the correctmatrix elements from global memory, the load function iscalled using this tile and the layout of A in global memory.This load function returns a tuple that contains the correctmatrix elements. This tuple is then sent to the transform forthe global-to-shared stream of the A matrix, resulting in atransformed tuple. Finally, the store function correspondingto the layout of A in shared memory is called with thistransformed tuple and the logical index of the current tile. The previous building blocks together copy tiles from globalto shared memory. The purpose of the next building block,called operators is to load tiles from shared memory, performthe matrix multiplication, and store the resulting tile backto shared memory. To do so, this building block has ﬁvefunctions associated with it.The load_a , load_b , and load_c functions load tiles ofthe A , B , and C matrix from shared memory to registers.The matrix computation itself is performed by the mma function, and the result is stored back to shared memoryusing the store_d function. Like the layout building block,the load_a , load_b , load_c , and store_d functions havea tile argument that represents the logical coordinate ofthe tile that should be loaded or stored. The load and storefunctions also have an argument that determines the sharedmemory layout of the corresponding matrix, so that we candispatch to the different implementations depending on thememory layout that is used. Finally, the mma function hasthree arguments a_frag , b_frag , and c_frag that representparts of the A , B , and C matrices stored in registers. The EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 12 function should perform the multiply-accumulate operation res_frag = a_frag * b_frag + c_frag , and return theresulting fragment res_frag .The listed functions map one-to-one onto the steps of theWMMA API mentioned in Section 2.3. This is no coincidence,as both the operator building block and the WMMA APIare warp-level matrix multiply-accumulate operations. It ishence fairly easy to deﬁne an implementation of the operatorbuilding blocks that uses TCs using our WMMA API. Itsufﬁces to convert the tile argument to the load and storefunctions to a memory address, and call the load , store ,and mma functions of the WMMA API.The operator building block has several use cases. First,they can be used to provide a custom implementation forthe computation in the inner loop of GEMM. This is usefulif the data type of our matrices has a custom multiplicationoperator, such as complex numbers or dual numbers. Theoperator building block also improves the portability of theGEMM kernel. For example, the WMMA operator may beparametrised with the WMMA shape, so that we can selectthe WMMA shape that is optimal for our GPU. Alternatively,we can deﬁne an alternative operator that calculates thematrix product using the traditional FPUs (Floating PointUnits) instead of TCs on devices that lack the latter. While the already discussed transform abstraction alreadyallows performing elementwise operations on the D matri-ces, we add an epilogue abstraction to our API to enablecustomization of the way global memory is updated at thelast stage of GEMM. This enables, e.g., to apply a reductionoperation across all threadblocks.In the general GEMM implementation, our epiloguebuilding block only has one purpose: to copy tiles of theresultant matrix from shared memory to global memory. Bydefault, we only include one epilogue that simply copies thecurrent threadblock’s tile in shared memory to the correctposition in global memory. This default epilogue uses thepreviously mentioned layout building block to determine thememory layout of the resultant D matrix. As a ﬁrst example of how to use the presented buildingblocks and API, we consider the ﬁrst step in a GEMM kernel:copying a tile of the C matrix from global to shared memory.Listing 3 showed an implementation of this step using ourtiling API. In our GEMM API, this ﬁrst step is implementedas shown in Listing 8. The code has a similar structure toListing 3, but the linearisation, loads, and stores are replacedby generic calls to Layout.load and

Layout.store . Theﬁrst arguments of these functions,

GLOBAL_LAYOUT and

SHARED_LAYOUT , are types that determine the memorylayout of C for global and shared memory, respectively. The transform_global_to_shared_c is a Julia function thatrepresents the transform that should be applied during theglobal-to-shared memory stream of the C matrix.Now suppose that we have deﬁned the necessary com-ponents (such as layouts, operators, . . . ) for a given usecase. To instantiate and execute GEMM kernels that use these components, we use the user-facing interface of ourGEMM API, which is illustrated in Listing 9. This codefragment calculates a mixed-precision matrix product ofthe form D ij = max( (cid:80) k A ik B kj + C ij , . These types ofmatrix products are common in neural networks, wherethe activation function max( · , is commonly referred toas a rectiﬁed linear unit (ReLU). Lines 1–4 declare thetwo-dimensional arrays that represent the A , B , C , and D matrices. In lines 6–11, we conﬁgure the parameters ofour GEMM kernel, such as the overall shape of the GEMM,the operator to be used in the inner loop, and the memorylayouts of the A and C matrices. The missing ﬁelds areautomatically set to reasonable default values. For example,if the memory layout of the B matrix is not speciﬁed, it isautomatically set to the memory layout of the A matrix.A GEMM kernel that uses this conﬁgura-tion is executed in lines 13–16. The argument transform_regs_to_shared_d determines thetransform that should be applied when copying tilesof the resultant D matrix from the register ﬁle toshared memory. The call to GemmKernels.matmul will execute each step in the GEMM kernel, usingthe components given by the user. For example,Listing 8 will be executed with

GLOBAL_C_LAYOUT =Layout.AlignedColMajor{Float32} . We conclude thatwe can instantiate and launch customised GEMM kernelseasily, without sacriﬁcing ﬂexibility.

The interface of the previous section exposes maximalﬂexibility to the user, but differs from the interface usedby CU BLAS. We also provide a more familiar BLAS-likeinterface which can be used if not all ﬂexibility is needed.This interface supports all operations of CU BLAS’s gemmEx ,i.e., linear scaling and transposition of the input matrices,but, importantly, with support for many more input types.To ease the transitioning process, this BLAS-like interfacehas the same signature as CUDA. JL ’s gemmEx wrapper. Touse our GEMM kernels in existing code, it sufﬁces to replace CUDA.CUBLAS.gemmEx! by GemmKernels.gemmEx! .Using the BLAS interface, there is no need to specifyeach component manually. Instead, they are derivedfrom the types of the arguments. For example, Listing 10calculates the matrix product C := α · A · B + β · C .Based on the combination of the types of the A , B ,and C matrix, our implementation of the BLAS-likeinterface instantiates a GEMM kernel with operator =Operator.WMMAOp{16, 16, 16} , global_a_layout =Layout.AlignedColMajor{Float16} , etc.Using the BLAS interface in library code allows extendingthis library with new types, without changing the librarycode. For example, the library code in Listing 11 is calledfor GPU arrays, but does not impose any restrictions onthe element type. Using custom element types is as simpleas adding support for them in our GEMM framework, andcalling the library code with this new type. For our GEMM framework design, we have strived for or-thogonality of different components. For example, epilogues

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 13 @unroll for warp_tile = parallelise(block_tile.MN, Tile(MEM_CD_WARP), warpId, WARPS_PER_BLOCK) @unroll for thread_tile = parallelise(warp_tile, Tile(MEM_CD_THREAD), laneId, 32) global_thread_tile = translate(thread_tile, (M = block_i, N = block_j)) x = Layout.load(GLOBAL_C_LAYOUT, c, global_thread_tile) y = transform_global_to_shared_c(x, thread_tile) Layout.store(SHARED_C_LAYOUT, shmem_c, y, thread_tile) end end Listing 8: Copying a tile of the C matrix from global to shared memory in our GEMM API. a = CuArray(rand( Float16 , (M, K))) b = CuArray(rand( Float16 , (K, N))) c = CuArray(rand( Float32 , (M, N))) d = similar(c) conf = GemmKernels.get_config( gemm_shape = (M = M, N = N, K = K), operator = Operator.WMMAOp{16, 16, 16}, global_a_layout = Layout.AlignedColMajor{ Float16 }, global_c_layout = Layout.AlignedColMajor{ Float32 } ) GemmKernels.matmul( a, b, c, d, conf; transform_regs_to_shared_d = Transform.Elementwise(x ->max(x, 0)) (cid:44) → ) Listing 9: Calculating the matrix product D ij =max( (cid:80) k A ik · B kj + C ij , using our GEMM API. a = CuArray(rand( Float16 , (M, K))) b = CuArray(rand( Float16 , (K, N))) c = CuArray(rand( Float32 , (M, N))) alpha = rand( Float32 ) beta = rand( Float32 ) GemmKernels.BLAS.gemmEx!('N', 'N', alpha, a, b, beta, c)

Listing 10: A matrix product using our BLAS-like interface.and operators only interact via shared memory and can becombined arbitrarily as long as they both support the sameshared memory layout. Transforms use broadcast expressionsthat work for different data types and array lengths, and canhence be combined with different layouts or params.Nevertheless, some inevitable coupling between differentcomponents remains. Most prominently, layouts are coupledto epilogues (e.g., a bias epilogue can require different logicfor row-major and column-major layouts), and to operators(e.g., WMMA supporting padded and non-padded layouts).Luckily, we can reduce the impact on code verbosity andreuse through several features of Julia. Epilogues can containlayout-agnostic code, and rely on ﬁne-grained method over-loading for layout-speciﬁc code paths. Metaprogrammingcan be used to redirect operator calls for padded layouts tothe underlying layout, as illustrated in Listing 12.Another point that merits some discussion is the extentto which our APIs and abstractions are Julia speciﬁc, i.e.,whether or not parts of them can be used for similar APIsin other PLs. While none of our proposed abstractions areJulia-speciﬁc, implementations in other PLs would sufferfrom reduced code reuse, or increased overhead and ver-bosity. Julia’s unique combination of multiple dispatch, typeinference, and JIT compilation allows us to compose GEMM function library_code(a::CuArray, b::CuArray, c::CuArray) GemmKernels.BLAS.gemmEx!('N', 'N', alpha, a, b, beta, c) end Listing 11: Library code making use of our BLAS-like API. for f in (:load_a, :load_b, :load_c, :store_d) @eval @inline $f(op, :: Type {Layout.Padded{L, P}}, args...)where {L, P} = $f(op, L, args...) (cid:44) → end Listing 12: Redirecting operator calls for padded layouts tothe underlying layout, using Julia’ metaprogramming.operations from different components, without incurring anyrun time overhead. Due to the parametric nature of Julia’stype system, tile sizes can be moved to the type domain, suchthat specialized code can be generated per tile size. Julia’smetaprogramming capabilities prove extremely powerful toimprove code reuse and reduce code verbosity.Finally, we should discuss the issue of portability. Sofar, we focused on ﬂexible GEMMs for CUDA-enabledGPUs. Nevertheless, the abstractions in our tiling API andﬂexible GEMM API are vendor-agnostic, and we expect theycan be reused for AMD and Intel GPUs. More concretely,porting our framework necessitates two changes. First, ourWMMA operator needs to be replaced with an operator usingtraditional ﬂoating point hardware. Secondly, our templatekernel contains CUDA-speciﬁc concepts such as threadIdx ,and hence needs to be ported to O

PEN

CL. Code duplicationcan be avoided using the package K

ERNEL A BSTRACTIONS . JL that allows writing vendor-agnostic GPU kernels [39]. VALUATION

To evaluate the performance and ﬂexibility of our APIs, wecreated the necessary components for four GEMM variants:a normal mixed-precision GEMM, computations using di-agonal matrices, computations exploiting operator fusion,and GEMMs on complex and dual numbers. Run times weremeasured on an NVIDIA RTX 2080 Ti with NVIDIA NsightCompute and with B

ENCHMARK T OOLS . JL , which continuessampling until the standard deviation becomes small enough.We compare the performance of our kernels to CUTLASS2.2 and CU BLAS 11.2. We set the latter’s math mode to

CUBLAS_TENSOR_OP_MATH and call cublasGemmEx . Weuse CUDA 11.0, CUDA. JL EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 14

Our ﬁrst example is a normal mixed-precision GEMM, i.e., acomputation of the form D = A · B + C . This operation isdirectly supported by NVIDIA’s CU BLAS library. To use TCsin our GEMM framework, we create an operator that simplycalls the correct WMMA functions in our WMMA API foreach step in the GEMM’s inner loop.In a GEMM, the A and B matrices may be stored in acolumn-major memory layout (N), or a row-major memorylayout (T). We hence implemented both a ColMajor and

RowMajor layout component. These layouts are suitable forglobal memory, but lead to inefﬁcient memory accesses inshared memory. On NVIDIA GPUs, shared memory is splitinto a set of memory banks . Memory accesses to addresses thatmap to the same bank, so-called bank conﬂicts, are serialised.The simplest way to reduce these bank conﬂicts is to addpadding to every column or row, such that the mappingof matrix elements to banks is changed. To achieve this,we use a

PaddedLayout component to store matrices inshared memory. This layout serves as a wrapper for otherlayouts, e.g.,

PaddedLayout{ColMajor, 8} is a columnmajor layout, where every column is padded by 8 elements.Out of the three functions associated with layouts, only size needs to be specialised for each type of paddedlayout. This is necessary because padding differs for, e.g.,row-major layouts vs. column-major layouts. Calls to load or store are automatically redirected to the underlyinglayout. As a result, supporting padding only requiredadding 14 lines of source code. To use padded layoutsfor a GEMM, it sufﬁces to set, e.g., shared_a_layout =Layout.PaddedLayout{Layout.ColMajor, 8} , similarlyto Lines 9–10 in Listing 9.The epilogue for mixed-precision GEMM simply copies atile from shared memory to global memory.Figure 11 compares the performance of our mixed-precision GEMM to CUTLASS and CU BLAS. The differentlines and markers represent the four combinations of thedata layouts of the A and B matrices, the shaded regionsrepresent error margins. We have no explanations for the twoanomalous results in the measurements for CUTLASS and CU BLAS. They occurred consistently over many experiments.For the most interesting, larger matrices with N=2048 andmore, the performance of our kernels ranges between 82%and 86% of CU BLAS, the best performing library. We con-clude that our kernels achieve reasonably good performance,despite being written completely in Julia. To the best of ourknowledge, no existing implementation purely written in asingle higher-level PL comes close.The performance difference between our kernel and thestate-of-the-art in CUTLASS and CU BLAS can be attributedto two factors. First, our implementation does not yet containdata swizzling to avoid bank conﬂicts in shared memory. Adevice-dependent layout is necessary for swizzling optimallyfor a GPU’s shared memory implementation, which wehave not yet explored. Secondly, CU BLAS does not useWMMA, but accesses TCs directly, allowing the use ofcustom memory layouts in shared memory that performbetter than our padded layouts. Implementing this in ourframework necessitates a new layout and a new operatorcomponent. This custom operator would use the mma family N T F L O P S cuBLASCUTLASSOur implementation Figure 11. Performance of our mixed-precision GEMM and the state-of-the-art implementation in CUTLASS and CU BLAS. of instructions instead of WMMA. Contrary to WMMA, mma does not have load or store instructions, so distributionof matrix elements to different threads needs to be doneexplicitly. Use of the mma operator would hamper portability,though, because some mma instructions are optimized for aparticular GPU architecture. Engineering this is future work.

Next, we focus on a mixed-precision GEMM where the A or B matrix is a diagonal matrix. In Julia, diagonal matricesare represented using the Diagonal parametric type. Thistype acts as a wrapper for a one dimensional array, whichcontains the diagonal elements. References to elements on thediagonal are redirected to loads or stores of this underlyingarray, whereas references off the diagonal return 0 withoutperforming any actual memory access.We leveraged Julia’s multiple dispatch to provide aGEMM implementation that is specialised for diagonalmatrices by means of two optimizations. First, we replace thelayouts from Section 5.1 with a

Diagonal layout. Similarto Julia’s

Diagonal wrapper, this layout simply returns 0if the accessed element lies off the diagonal, and otherwiseaccesses an array. Second, we extended our template GEMMkernel with a customisable predicate that determines for eachinner loop iteration if it should be executed or skipped. Bydefault, this predicate is constant true , but we specialiseit for diagonal matrices so that iterations that performcomputations on elements off the diagonal are skipped.Table 1 compares the performance of this specialisedGEMM kernel with CU BLAS. As CU BLAS does not in-clude GEMM kernels specialised for diagonal matrices, the CU BLAS version of our code had to materialise the diagonalmatrix before calling the standard CU BLAS GEMM kernel.This is in line with the common practice as discussed inSection 4.1. The ﬁrst optimisation results in a reduction of 89%in global memory trafﬁc. The second optimisation reducesthe number of TC operations by over 95%. Together, theylead to a GEMM that is more than 6 times faster than whatwe can obtain with CU BLAS’s inﬂexible kernels.Adding support for diagonal matrices required adding 23lines of source code to our existing framework. We conclude

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 15

Table 1Performance of our diagonal matrix GEMM and the equivalent CU BLASimplementation, for N = 4096 ; run times are for 100 iterations. cuBLAS OursRun time (ms) . ± .

08 50 . ± .

67 109K per kernel 3113K per kernel that specialisation of kernels in our framework requiresminimal effort, while at least in some cases still obtainingmassive performance improvements. CU BLAS fuses linear scaling into its GEMM computation,i.e., its GEMM is of the form D = α · AB + β · C . Othercomputations cannot be fused in CU BLAS’s kernels andrequire a separate kernel launch. We consider two examplesof GEMM computations that can exploit operation fusion:custom elementwise operations and adding a bias vector.In custom elementwise operations, the linear scalingwith α and β is replaced by any arbitrary function. Weimplemented this in our GEMM framework using an ElementwiseTransform component. It has an arbitraryfunction as a parameter, which is applied to every element.In bias computations, a one-dimensional bias vector isadded to every row of the matrix product. To add support forbias in our GEMM framework, we created a custom epiloguethat loads the bias vector from global memory, and adds itto the matrix product in shared memory, before writing theresult back to global memory.Table 2 compares the run times of a ﬁve GEMMs withand without elementwise operations on input and/or outputmatrices, and with and without bias vectors. For the purposeof this experiment, we use ReLU as the elementwise oper-ation, a popular function in the domain of ML, but similarresults are obtained with other ones that are not supported inCUTLASS, and for which researchers would have to fall backon CU BLAS or our solution. For this reason, we only compareto CU BLAS here. With CU BLAS, combining the GEMM withelementwise operations or bias vectors results in additionalkernel launches. By contrast, our framework seamlessly fusesall operations in the GEMM kernel instead. The effect isclearly visible in the results in the table. While the standardGEMM in CU BLAS is faster than with our framework, inline with the results in Section 5.1, our framework catchesup as more fusable operations are added. Whereas theextra operations require almost no additional executiontime in our framework (bias vectors need to be loaded,hence their small cost), each operation costs approximately10% in performance with CU BLAS, and actually becomesabout 5% faster. This clearly illustrates the need for operatorfusion, and how effective our approach is. The fact that ourapproach fuses the operations seamlessly also implies thatany future optimisation of our default implementation of themixed-precision GEMM, through swizzling or use of mmainstead of WMMA, will automatically beneﬁt GEMMS withfusable operations as well. The performance advantage ofour GEMM over CU BLAS will then grow even further.

Table 2Run times of CU BLAS that lacks fusion capabilities and our GEMMs thatexploit operation fusion, for N = 4096 and 100 iterations. cuBLAS [ms] Ours [ms]GEMM . ± .

66 390 . ± . GEMM + ReLU on D . ± .

14 389 . ± . GEMM + bias . ± .

09 392 . ± . GEMM + bias + ReLU on D . ± .

51 393 . ± . GEMM + bias + ReLU on C & D . ± .

10 392 . ± . Standard mixed-precision GEMMs differ from from mixed-precision GEMMs of complex numbers in two ways. First,the WMMA multiply-accumulate operation in the inner loopis replaced by four WMMA operations:

A.real * B.real , A.real * B.imag , A.imag * B.real , and

A.imag *B.imag . Second, complex GEMMs use different memorylayouts in global and shared memory. In global memory,complex matrices are typically stored in an interleavedlayout, where the real and imaginary parts are storedcontiguously. This layout is incompatible with WMMA, soin shared memory, we use a split layout instead, wherethe real and imaginary parts are stored separately. In ourGEMM framework, these two differences correspond toa new operator

WMMAComplexOp , and two new layouts

InterleavedComplex and

SplitComplex , respectively.Dual numbers differ slightly from complex numbers. Theimaginary unit i is replaced by ε , and ε = 0 whereas i = − . As such, we need an additional WMMADualOp oper-ator component, but we can reuse the split and interleavedlayouts we developed for complex matrices.Figure 12 shows the performance of four GEMM kernelsusing complex or dual numbers. As CU BLAS supportsneither complex numbers using TCs nor dual numbers,we did not include it in the comparison. CU BLASL T doessupport complex numbers, but uses CUTLASS’s kernels.CUTLASS only supports complex numbers. So we comparethe performance of our two kernels to CUTLASS forcomplex numbers and to the generic CUDA. JL kernel thatis invoked when our API is not used to compute a mixed-precision GEMM on dual numbers. Our complex numberimplementation achieves a peak performance of 59% ofCUTLASS’s peak performance. We conclude that, despitethe fact that our kernels are written completely in Julia, anddo not contain optimizations speciﬁc to complex GEMM, weare still able to reach reasonably good performance.The difference in performance between our complexGEMM and CUTLASS’s can partly be attributed to CUT-LASS’s use of a split layout in both global and sharedmemory, eliminating the overhead of changing layouts inthe global-to-shared memory stream. Our kernels use aninterleaved layout in global memory, as this is the format thatJulia uses. This comes with some overhead, but eliminatesthe need for extra interleaved-to-split and split-to-interleavedkernel launches that would be necessary for CUTLASS.While CUDA. JL contains wrappers for CU BLAS’s GEMMkernels, it supports only a limited number of data types.Calling these wrappers with unsupported types, such asdual numbers, falls back to a generic implementation thatis many orders of magnitude slower, as is clear from the

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 16 N T F L O P S CUDA.jl (dual)CUTLASS (complex)Our implementation (complex)Our implementation (dual)

Figure 12. Performance of complex and dual GEMMs in our frameworkand state-of-the-art implementations.

CUDA. JL line in Figure 12.Adding support for complex and dual numbers to ourframework required 169 lines of source code, of which 85are common to both data types. We conclude that extendingour framework with custom data types requires minimaleffort, especially compared to the alternative of writing aperformant GEMM kernel for these data types from scratch. ELATED WORK

In Section 4.1, we already discussed the need for ﬂexibleand performant GEMMs on GPUs in various domains. Asdiscussed in Section 2.3, several libraries exist to deliver theperformance. We compared the performance of our GEMMAPI with some of them in Section 5.NVIDIA’s CUTLASS template library contains compo-nents to instantiate performant GEMMs on GPUs [40]. Asit is written in C++, it does not solve the two-languageproblem and its impact on programmer productivity. It didserve, however, as an inspiration for the abstractions in ourGEMM API. For example, CUTLASS also has the notion oflayouts that map logical indices to physical memory offsets,and epilogues for custom post-processing. Some componentshave a slightly different purpose, however. In CUTLASS,transforms only apply to the global-to-shared memorystream of the A and B matrices. Elementwise transformson the resultant matrix are handled in a custom epilogue.Adding support for custom transformations requires signif-icant effort, as CUTLASS epilogues are typically 150–200lines long. In our GEMM API, transforms are applied afterevery load and before every store, ensuring that elementwiseoperations to the input and resultant matrices can be appliedmore easily and consistently.The CUTLASS codebase is extensive and contains manycomponents, making it more difﬁcult for end users to getstarted extending it. Each GEMM typically involves quite afew layered template instantiations, impacting code compre-hension. Most templates are heavily specialised for differentmemory layouts, computations, etc., reducing orthogonalityand code reuse. For example, CUTLASS contains differentepilogues for GEMMs exploiting TCs and GEMMs using FPUs. Our GEMM API abstractions offer better separation ofconcerns and hence more reusability.Like our GEMM API, CUTLASS contains both a BLAS-like interface, and an interface exposing all its ﬂexibility.Launching a CUTLASS kernel using the latter requiresmore boilerplate compared to our approach in Listing 9,e.g., because CUTLASS users need to explicity allocate aworkspace that is used internally in CUTLASS.NVIDIA’s CU BLASL T is a lightweight BLAS librarydedicated to GEMM, with a more ﬂexible API than CU BLAS.This ﬂexibility comes in the form of support for more matrixlayouts, input and compute data types, and algorithmicimplementations. It is available on CUDA 10.1 or later.Like our GEMM API, launching a kernel in CU BLASL T isalso a two step process. First, a “plan” must be created thatdetermines the options for the GEMM computation. Second,this plan is used to launch one or more GEMM kernels. CU BLASL T features concepts similar to CUTLASS, suchas epilogues that post-process the resultant matrix andlayouts describing how matrices are stored in memory.Each of these corresponds to an enumeration that lists thelegal values, hence limiting ﬂexibility. For example, it onlyincludes bias and ReLU as possible epilogues, and offers nosupport for custom layouts such as diagonal matrices, orcustom datatypes such as dual numbers. While CU BLASL T ’sinterface is an improvement over that of CU BLAS, its closed-source nature still results in limited extensibility.BLIS is a framework that facilitates the instantiation ofan entire BLAS library for new architectures, and hencehas a larger scope than just GEMM [41]. It achieves thisby rephrasing all BLAS operations in terms of a limitedset of kernels. Their focus is on CPUs rather than GPUs,however. Similar to our GEMM kernel, BLIS contains aset components that can be reused for new BLAS-likeoperations. BLIS’s GEMM kernels offer support for morememory layouts and data types than traditional BLASlibraries. Nevertheless, BLIS’s ﬂexibility is mainly aimedat developers of the BLIS library, instead of its users. Forexample, extending BLIS with support for complex numbersrequired signiﬁcant effort, that warranted a separate paperdescribing its implementation [42].Section 3 presented a tiling API that allows programmersto coordinate memory transfers to improve data locality.Automated tools based on polyhedral optimisation exist thatcan automatically generate tiled code from nested loops [43],[44], [45]. Basic approaches only reorder memory accesses,but more advanced ones can also exploit parallellism. Forexample, P

OLLY can exploit inter-tile parallellism using theO

PEN

MP interface [46]. The framework by Baskaran et al.is even capable of automatically adding padding for sharedmemory accesses to reduce bank conﬂicts [47].Recent work by Bondhugula uses the polyhedral utilitiesin MLIR to generate performant GEMM kernels [48]. Hisapproach achieves a performance that is within 9% of state-of-the-art CPU GEMMs in BLIS and MKL. It still offerslimited ﬂexibility, however. Incorporating domain-speciﬁcoptimisations, such as diagonal matrices, is signiﬁcantlyharder than in our approach, as it requires adaptations to theMLIR code base and/or T

ABLE G EN rules.The D IESEL

DSL (Domain-Speciﬁc Language) uses poly-hedral techniques to compile high-level expressions to

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 17 performant GPU kernels [49]. Bhaskaracharya et al. extendD

IESEL with support for TCs and fused kernels that combinematrix multiplications with ReLU and bias [50]. Their workfocuses only on Volta TCs, however, and does not addressother forms of GEMM ﬂexibility such as support for morecomplex data types. Additionally, only a limited number ofelementwise operations are supported.H

ALIDE , a DSL for image processing, was also extendedwith TC support by Sioutas et al. [51]. Their approachfeatures a ﬁxed kernel skeleton, similar to our templateGEMM kernel. Their kernel also makes use of WMMA,and achieves performance results similar to ours. It still haslimited ﬂexibility, however. For example, it offers no supportfor complex data types, and only handles one combination ofmemory layouts for the A and B matrices, which necessitatesexplicit transposition kernels.Note that BLIS, the polyhedral techniques (in so far asthey support GPUs), D IESEL and H

ALIDE all involve stati-cally compiled, statically typed PLs. For the general-purposePLs, this by itself foregoes the productivity advantages ofrapid-prototyping PLs such as Julia or Python. For the DSLs,it implies that code reuse across domains is limited. Thosesolutions also by construction do not consider ﬂexibilitybeyond the data types, layouts, and operations typical fortheir domains. Our solution does not suffer from thesedrawbacks, yet obtains performance in the same ball park.

VAILABILITY

Our contributions are open-source and available in therelevant GitHub repositories. Support for Tensor Cores usingWMMA was merged into CUDA. JL , and is available inthe latest stable version. The required adaptations to theJulia compiler were sent to the developers, and have beenmerged upstream. Our tiling and ﬂexible GEMM APIs arebundled in one Julia package G EMM K ERNELS . JL . This pack-age is available at https://github.com/thomasfaingnaert/GemmKernels.jl, and can easily be installed using Julia’sbuilt-in package manager. It contains all instantiations of ourAPI abstractions of all experiments and listings in this paper,ready for out-of-the-box re-use. ONCLUSIONS AND FUTURE WORK

In this paper, we ﬁrst presented tiling abstractions with whichprogrammers can use tiling techniques, which are necessaryto achieve high-performance for many computations, at ahigh level of abstraction.We then discussed a ﬂexible GEMM API where thekernel consists of a set orthogonal components. Each ofthese components corresponds to a set of Julia functionsthat can be specialised for different GEMM variants. Wedemonstrated the ﬂexibility of this approach by instantiatingthe necessary components for 4 variants of GEMM computa-tions: a normal mixed-precision GEMM, computations usingdiagonal matrices, computations exploiting operator fusion,and GEMMs on complex and dual numbers. We arguedhow speciﬁc features of the Julia compiler, such as multipledispatch, type inference, and just-ahead-of-time compilation,allow for this ﬂexibility without run time overhead.An experimental evaluation showed that the performanceof our GEMM kernels written entirely in Julia is in the same ball park as, and in some cases even exceeds, the state of theart in the manually tuned CU BLAS and CUTLASS.We presented two interfaces to use our ﬂexible GEMMAPI: a fully-featured interface and a BLAS-like interface. Theformer exposes the full ﬂexibility of our framework, the latterextends BLAS’s GEMM with support for more data typessuch as dual numbers. We demonstrated our APIs for CUDA-enabled GPUs, but our abstractions are vendor-agnostic andcan be ported to other GPU architectures.In the future, we plan to port our framework to otherGPUs such as those of AMD and Intel, and to add supportfor the mma family of instructions as well as data swizzling, asused in CUTLASS and CU BLAS, to improve performance. A CKNOWLEDGEMENTS

This work was funded by the Research Foundation Flanders(Fonds voor Wetenschappelijk Onderzoek), grant number3G051318. R EFERENCES [1] BLAS contributors. (2017) BLAS (Basic Linear Algebra Subpro-grams).[2] A. Abdelfattah, S. Tomov, and J. Dongarra, “Towards half-precisioncomputation for complex matrices: A case study for mixed precisionsolvers on GPUs,” in

IEEE/ACM 10th Workshop on Latest Advancesin Scalable Algorithms for Large-Scale Systems , 2019, pp. 17–24.[3] A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “HarnessingGPU Tensor Cores for fast FP16 arithmetic to speed up mixed-precision iterative reﬁnement solvers,” in

Proceedings of the Interna-tional Conference for High Performance Computing, Networking, Storage,and Analysis , ser. SC ’18. IEEE Press, 2018.[4] T. Ichimura, K. Fujita, T. Yamaguchi, A. Naruse, J. C. Wells, T. C.Schulthess, T. P. Straatsma, C. J. Zimmer, M. Martinasso, K. Naka-jima, M. Hori, and L. Maddegedara, “A fast scalable implicit solverfor nonlinear time-evolution earthquake city problem on low-ordered unstructured ﬁnite elements with artiﬁcial intelligenceand transprecision computing,” in

Proc. Int’l Conference for HighPerformance Computing, Networking, Storage, and Analysis , 2018.[5] A. Haidar, H. Bayraktar, S. Tomov, and J. Dongarra, “HarnessingTensor Cores FP16 arithmetic to accelerate linear solvers and HPCscientiﬁc applications,” 2018, nVIDIA GPU Technology Conference.[6] V. Mehta, “Getting started with Tensor Cores in HPC,” 2019,nVIDIA GPU Technology Conference.[7] D. Yan, W. Wang, and X. Chu, “Demystifying Tensor Coresto optimize half-precision matrix multiply,” in

Proc. 34th IEEEInternational Parallel and Distributed Processing Symposium , 2020.[8] NVIDIA. (2019, 6) Deep learning performance guide.[9] ——. (2020) NVIDIA V100.[10] P. Barham and M. Isard, “Machine learning systems are stuck in arut,” in

Proc. Workshop on Hot Topics in Operating Systems , 2019, pp.177—-183.[11] N. Rink, A. Susungi, J. Castrillón, J. Stiller, and C. Tadonki,“CFDlang: High-level code generation for high-order methods inﬂuid dynamics,” in

Real World Domain Speciﬁc Languages Workshop2018 , 02 2018, pp. 1–10.[12] R. Poya, A. J. Gil, and R. Ortigosa, “A high performance dataparallel tensor contraction framework: Application to coupledelectro-mechanics,”

Computer Physics Communications , vol. 216, pp.35–52, 2017.[13] A. Auer, G. Baumgartner, D. Bernholdt, A. Bibireata, V. Choppella,D. Cociorva, G. Xiaoyang, R. Harrison, S. Krishnamoorthy, S. Kr-ishnan, C.-C. Lam, Q. Lu, M. Nooijen, R. Pitzer, J. Ramanujam,P. Sadayappan, and A. Sibiryakov, “Automatic code generation formany-body electronic structure methods: The tensor contractionengine,”

Molecular Physics , vol. 104, 01 2006.[14] P. Springer and P. Bientinesi, “The landscape of high-performance tensor contractions,” in

Workshop on Batched,Reproducible, and Reduced Precision BLAS

EEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 18 [15] T. Nelson, A. Rivera, P. Balaprakash, M. Hall, P. D. Hovland,E. Jessup, and B. Norris, “Generating efﬁcient tensor contractionsfor GPUs,” in ,2015, pp. 969–978.[16] P. Springer and P. Bientinesi, “Design of a high-performanceGEMM-like tensor–tensor multiplication,”

ACM Transactions onMathematical Software (TOMS) , vol. 44, no. 3, pp. 1–29, 2018.[17] E. D. Napoli, D. Fabregat-Traver, G. Quintana-Ortí, and P. Bientinesi,“Towards an efﬁcient use of the BLAS library for multilinear tensorcontractions,”

Applied Mathematics and Computation , vol. 235, pp.454–468, 2014.[18] J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc, “An input-adaptive and in-place approach to dense tensor-times-matrixmultiply,” in

Proc. Int’l Conference for High Performance Computing,Networking, Storage and Analysis , 2015, pp. 1–12.[19] E. Solomonik, D. Matthews, J. Hammond, and J. Demmel, “Cyclopstensor framework: Reducing communication and eliminatingload imbalance in massively parallel contractions,” in , 2013, pp. 813–824.[20] B. W. Bader and T. G. Kolda, “Algorithm 862: MATLAB tensorclasses for fast algorithm prototyping,”

ACM Transactions onMathematical Software , vol. 32, no. 4, pp. 635–653, December 2006.[21] J. Kim, A. Sukumaran-Rajam, V. Thumma, S. Krishnamoorthy,A. Panyala, L.-N. Pouchet, A. Rountev, and P. Sadayappan, “A codegenerator for high-performance tensor contractions on GPUs,” in

Proc. IEEE/ACM Int’l Symposium on Code Generation and Optimization ,2019, p. 85–95.[22] D. A. Matthews, “High-performance tensor contraction withouttransposition,”

SIAM Journal on Scientiﬁc Computing , vol. 40, no. 1,pp. C1–C24, 2018.[23] NVIDIA. (2020) CUDA C++ programming guide.[24] Khronos Group. (2020) OpenCL: An open standard for parallelprogramming of heterogeneous systems.[25] JuliaLang.org. (2020) The Julia language.[26] ——. (2020) Julia micro-benchmarks.[27] T. Besard, C. Foket, and B. De Sutter, “Effective extensible program-ming: Unleashing Julia on GPUs,”

IEEE Transactions on Parallel andDistributed Systems , vol. 30, no. 4, pp. 827–841, 2019.[28] T. Besard, V. Churavy, A. Edelman, and B. De Sutter, “Rapidsoftware prototyping for heterogeneous and distributed platforms,”

Advances in Engineering Software , vol. 132, pp. 29 – 46, 2019.[29] JuliaLang.org. (2020) The Julia language ofﬁcial documentation.[30] LLVM contributors. (2020) The LLVM compiler infrastructureproject.[31] V. Churavy. (2020) GPUifyLoops.jl: Support for writing loop-basedcode that executes both on CPU and GPU.[32] G. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with EMrouting,” in

International Conference on Learning Representations , 2018.[33] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg,R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Va-sudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorﬂow:A system for large-scale machine learning,” in

Proc. 12th USENIXSymposium on Operating Systems Design and Implementation (OSDI) ,2016, pp. 265–283.[34] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style,high-performance deep learning library,” in

Advances in NeuralInformation Processing Systems 32 , 2019, pp. 8024–8035.[35] E. Aprà, M. Klemm, and K. Kowalski, “Efﬁcient implementationof many-body quantum chemical methods on the Intel® Xeon Phicoprocessor,” in

Proc. Int’l Conference for High Performance Computing,Networking, Storage and Analysis , 2014, pp. 674–684.[36] W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski, “GPU-Based Implementations of the Noniterative Regularized-CCSD(T)Corrections: Applications to Strongly Correlated Systems,”

Journalof Chemical Theory and Computation , vol. 7, no. 5, pp. 1316–1327,2011.[37] P. Springer and P. Bientinesi, “Design of a high-performanceGEMM-like tensor-tensor multiplication,” 2016.[38] J. Revels, M. Lubin, and T. Papamarkou, “Forward-mode automaticdifferentiation in Julia,” arXiv:1607.07892 [cs.MS] , 2016.[39] V. Churavy. (2020) KernelAbstractions.jl: Heterogeneousprogramming in Julia. [Online]. Available: https://github.com/JuliaGPU/KernelAbstractions.jl [40] NVIDIA. (2020) CUTLASS: CUDA templates for linear algebrasubroutines.[41] F. G. Van Zee and R. A. van de Geijn, “BLIS: A framework forrapidly instantiating BLAS functionality,”

ACM Trans. Math. Softw. ,vol. 41, no. 3, Jun. 2015.[42] F. G. Van Zee, “Implementing high-performance complex matrixmultiplication via the 1M method,”

SIAM Journal on ScientiﬁcComputing , vol. 42, no. 5, pp. C221–C244, 2020.[43] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, “Apractical automatic polyhedral parallelizer and locality optimizer,”in

Proc. 29th ACM SIGPLAN Conference on Programming LanguageDesign and Implementation , 2008, pp. 101–113.[44] S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gomez,C. Tenllado, and F. Catthoor, “Polyhedral parallel code generationfor CUDA,”

ACM Transactions on Architecture and Code Optimization(TACO) , vol. 9, no. 4, pp. 1–23, 2013.[45] P. Di, D. Ye, Y. Su, Y. Sui, and J. Xue, “Automatic parallelization oftiled loop nests with enhanced ﬁne-grained parallelism on GPUs,”in

Proc. 41st Int’l Conference on Parallel Processing , 2012, pp. 350–359.[46] T. Grosser, A. Groesslinger, and C. Lengauer, “Polly—performingpolyhedral optimizations on a low-level intermediate representa-tion,”

Parallel Processing Letters , vol. 22, no. 04, p. 1250010, 2012.[47] M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam,A. Rountev, and P. Sadayappan, “A compiler framework foroptimization of afﬁne loop nests for GPGPUs,” in

Proc. 22nd AnnualInt’l Conference on Supercomputing , 2008, pp. 225–234.[48] U. Bondhugula, “High performance code generation in MLIR: Anearly case study with GEMM,” preprint arXiv:2003.00532 , 2020.[49] V. Elango, N. Rubin, M. Ravishankar, H. Sandanagobalane, andV. Grover, “Diesel: DSL for linear algebra and neural net compu-tations on GPUs,” in

Proc. 2nd ACM SIGPLAN Int’l Workshop onMachine Learning and Programming Languages , 2018, pp. 42–51.[50] S. G. Bhaskaracharya, J. Demouth, and V. Grover, “Auto-matic kernel generation for Volta Tensor Cores,” arXiv preprintarXiv:2006.12645 , 2020.[51] S. Sioutas, S. Stuijk, T. Basten, L. Somers, and H. Corporaal,“Programming tensor cores from an image processing DSL,” in

Proc. 23th Int’l Workshop on Software and Compilers for EmbeddedSystems , 2020, pp. 36–41.

Thomas Faingnaert has just started his PhDresearch at Ghent University in the ComputerSystems Lab, where he obtained his MSc degreein Computer Science Engineering in 2020. Hisresearch focuses on high-level abstractions forGPU programming in Julia.

Tim Besard is a software engineer at Julia Com-puting. He obtained his MSc in Computer Engi-neering from University College Ghent in 2011,and his PhD in Computer Science Engineeringfrom Ghent University in 2019. He is currently thelead maintainer of several GPU back-ends for theJulia programming language.