[PDF] Accelerating Reduction and Scan Using Tensor Core Units

Abstract

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -- in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89%-98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes -- common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and16%for scan.

Full PDF

AAccelerating Reduction and Scan Using Tensor Core Units

Abdul Dakkak, Cheng Li

University of Illinois Urbana-ChampaignUrbana, Illinois{dakkak,cli99}@illinois.edu

Jinjun Xiong

IBM T. J. Watson Research CenterYorktown Heights, New [email protected]

Isaac Gelado

NVIDIA CorporationSanta Clara, [email protected]

Wen-mei Hwu

University of Illinois Urbana-ChampaignUrbana, [email protected]

ABSTRACT

16) to accelerateHPC and deep learning workloads. Although TCUs are prevalentand promise increase in performance and/or energy efficiency, theysuffer from over specialization as only matrix multiplication onsmall matrices is supported. In this paper we express both reduc-tion and scan in terms of matrix multiplication operations and mapthem onto TCUs. To our knowledge, this paper is the first to try tobroaden the class of algorithms expressible as TCU operations andis the first to show benefits of this mapping in terms of: programsimplicity, efficiency, and performance. We implemented the reduc-tion and scan algorithms using NVIDIA’s V100 TCUs and achieved89% −

98% of peak memory copy bandwidth. Our results are ordersof magnitude faster (up to 100 × for reduction and 3 × for scan) thanstate-of-the-art methods for small segment sizes (common in HPCand deep learning applications). Our implementation achieves thisspeedup while decreasing the power consumption by up to 22% forreduction and 16% for scan. ACM Reference Format:

Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-mei Hwu.2019. Accelerating Reduction and Scan Using Tensor Core Units. In

ACM/SPECInternational Conference on Supercomputing (ICS ’19), June 26–28, 2019,Pheonix, AX.

ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3330345.3331057

Deep learning’s reliance on matrix-multiplication (GEMM) forcompute has driven both research and industry to develop matrix-multiplication accelerator hardware — collectively called TensorCore Units (TCUs) in this paper. TCUs are designed to accel-erate Multilayer Perceptrons (MLP), Convolutional Neural Net-works (CNN), and Recurrent Neural Networks (RNN) or Deep

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

Load/Store SPURegistersControlCache

FP64 INT FP32 TCUTCU

Figure 1: Each subcore (processing block) in the NVIDIA TeslaV100 PCI-E architecture contains TCUs. In total,

TCUsare available — achieving a theoretical peek of

TFLOPS.

Neural Network (DNN) in general. TCUs come under the guiseof different marketing terms, be it NVIDIA’s Tensor Cores [55],Google’s Tensor Processing Unit [19], Intel’s DLBoost [69], AppleA11’s Neural Engine [3], Tesla’s HW3, or ARM’s ML Processor [4].They vary in the underlying hardware implementation [15, 27, 63,71], and are prevalent [18, 55, 58] in both cloud and edge devices.To show the theoretical benefits of TCUs, consider the NVIDIAVolta V100 GPUs architecture. Using V100 Tensor Cores, oneachieves a 8 × throughput increase per Streaming Multiprocessors (SM)over previous Pascal GP100 generation. This throughput increase isbecause each V100 SM is capable of performing 1024 half precisionoperations per cycle using the TCUs whereas the GP100 SM is capa-ble of performing 128 half precision operations per cycle without theTCUs. The throughput increase is enabled by the fact that the V100dedicates a large chip area of the SM subcore to TCUs (Figure 1).Although TCUs are prevalent and promise increase in perfor-mance and/or energy efficiency and are heavily used within super-computers [32, 57] to achieve exascale performance, they sufferfrom over specialization. Currently, no algorithm other than GEMMutilizes the NVIDIA TCUs. This results in idle TCUs, low chiputilization, and limits TCUs applicability to specialized libraries ornarrow application domains.The objective of the paper is to expand the class of algorithms thatcan execute on TCUs— enabling the TCUs to be used within a widerrange of non-GEMM algorithms. We choose reduction and scan,since a large body of work [7, 9, 36] has shown that they are key a r X i v : . [ c s . PF ] N ov rimitives for data parallel implementations of radix sort, quicksort,lexical analysis, stream compaction, and polynomial evaluation. Inthis paper, we formulate a mapping of reduction or scan onto TCUs.We then introduce algorithms for cache- (warp-), processing element(PE)/core- (block-), and device- (grid-) level reduction and scanand show their performance on NVIDIA TCUs. We separate ouralgorithm description from implementation, making the algorithms,motivation, methods, and observations generally applicable to abroader range of TCUs and numerical precision agnostic. Whilethe formulation is the main objective of the paper, we show that animplementation of our algorithms on NVIDIA V100 is either orderof magnitude faster or rival the fastest GPU implementation, withmuch lower programming complexity. The key contributions of thepaper are:(1) We show how to use TCUs to compute both reduction andscan. We believe we are the first to formulate these algorithmsin terms of TCU operations in a manner that is independentto the underlying TCU architecture.(2) We implement our algorithms onto NVIDIA V100 GPUsand show orders of magnitude speedup over state-of-art algo-rithms for small segment sizes. Small segements are commonin mathematics (e.g. evaluating polynomials), scientific ap-plications (e.g. finite difference), and machine learning (e.g.batch norm) applications. For large segments, we are com-parable to the fastest algorithms and achieve 89 −

98% oftheoretical peak memory copy bandwidth.(3) We show that our implementation is up to 22% more power ef-ficient and decreases the utilization of general purpose ALUs.(4) We describe the current usage and programmability of theNVIDIA TensorCore and evaluate GEMM on the TCUs usingcuBLAS [46], CUTLASS [49] and the CUDA TCU API.This paper is divided as follows: we first describe the NVIDIATCUs and show the performance of GEMM and GEMV computationin Section 2. In Section 3, we give a background of reduction andscan and show the TCU algorithms for reduction (Section 4) andscan (Section 5). We then compare our implementation against state-of-the-art in Section 6. Section 7 describes the related work, beforewe conclude in Section 8.

A marquee feature of NVIDIA’s GPUs (Volta’s Tesla V100 andTurning’s TU102 architectures) and Google’s TPUs are their TCUs—a programmable matrix multiply and accumulate hardware units,called Tensor Cores by NVIDIA and matrix-multiply-units (MXUs)by Google . While there are other competing TCU implementations,both NVIDIA Tensor Cores and Google’s TPU are by far the mostpopular. At a high level, their functionality and architectural designare similar. They both subdivide the device into cores, with eachhaving multiple processing block (or subcores) and TCUs. Figure 1illustrates a subcore in an NVIDIA SM, with the V100 containing 80SMs and each having 4 subcores. In turn, each subcore contains twoTensor Cores — for a total of 640 Tensor Cores and achieve a 12 × throughput improvement over previous generation Tesla P100 [54].Google’s TPUv3 device, on the other hand, has 8 cores — 4 chipseach with 2 cores — with each core having 2 MXUs. We will use TCU and Tensor Core interchangeably in this paper.

Since Google TPUs currently can only be used within GoogleCloud using the XLA compiler [33] and the NVIDIA V100 TCUsare widely available and are installed in supercomputers [32, 57], thissection will only describe the TCU usage and results for NVIDIAV100. Similar analysis can be performed for other TCUs.Each NVIDIA V100 Tensor Core provides a 4 × × D = A · B + C within a cycle, where A , B , C and D are 4 × A and B must be in half precision format while the accumulators, C and D , can be either single or half precision. Each Tensor Corecan perform 64 (cid:0) × × (cid:1) FMA operations per cycle. Therefore,using the TCU each SM can perform 1024 (cid:0) × × (cid:1) floatingpoint operations per cycle, since each FMA consists of two floatingpoint operations and each SM contains 8 Tensor Cores. This is an8 × SM throughput increase compared to Pascal for floating pointoperations [54]. This section first describes the current usage ofthe NVIDIA Tensor Cores, then details the current NVIDIA TensorCores API, and presents evaluation results to motivate our work.

Currently, Tensor Cores have only been used to accelerate GEMMoperations, most prominently through NVIDIA’s CUDA libraries(such as cuBLAS [46] and cuDNN [48]). These libraries requireusers to opt-in to use the Tensor Cores to accelerate GEMM com-putation.NVIDIA also provides the CUTLASS (CUDA Templatesfor Linear Algebra Subroutines) [49] library, which is a C++ tem-plated library that provides building block primitives to write highperformance GEMM-like kernels. Deep learning frameworks suchas NVCaffe [25], Caffe2 [8], MXNet [43], PyTorch [59], Tensor-Flow [67], and TensorRT [56] leverage these NVIDIA libraries forDNN training [50] and inference acceleration.

Aside from the libraries, NVIDIA also provides a CUDA C++ WarpMatrix Multiply and Accumulate (WMMA) [47] API to programthe Tensor Cores directly. The current WMMA API provides warp-level matrix operations for matrix load ( load_matrix_sync ), matrixstore ( store_matrix_sync ), and matrix multiply and accumulate( mma_sync ). These APIs operate on a special data type fragment ,which holds a matrix tile in thread-local registers. A helper functionto broadcast a scalar constant into a fragment ( fill_fragment ) isprovided as well. No API currently exists for calling TCU operationsat sub warp level — neither in the IR nor in the PTX [52, 53].The load_matrix_sync function distributes values of the matrixacross the warp lanes. Threads within a warp utilize multiple TensorCores concurrently to perform the mma_sync operation — collabo-rating to compute the D M × N = A M × K · B K × N + C M × N , with M , N , K denoting the matrix dimensions. The API imposes limitations on thedimensions — requiring the shape ⟨ M , N , K ⟩ to be either ⟨ , , ⟩ , ⟨ , , ⟩ , or ⟨ , , ⟩ .Listing 1 shows a CUDA kernel that computes a ⟨ , , ⟩ matrix multiplication within a warp using the WMMA API. Lines 4–6 declare the matrix fragments. The API supports 3 kinds of matrices— matrix_a ( A ), matrix_b ( B ), and accumulator ( C or D ) — with M = N = K (log scale) H a l f P r ec i s i o n T F L O P S WMMA HGEMMWMMA HGEMM (na¨ıve)cuBLAS HGEMM w/o TCUcuBLAS HGEMM w TCUCUTLASS HGEMM (a) GEMM with half precision input and half precision output. M = N = K (log scale) M i x e d P r ec i s i o n T F L O P S WMMA MGEMMWMMA MGEMM (na¨ıve)cuBLAS MGEMM w/o TCUcuBLAS MGEMM w TCUCUTLASS MGEMM (b) Mixed precision GEMM with half precision input and single precision output.

Figure 2: General matrix-matrix multiplication (GEMM) performance using Tensor Cores for both half- (2a) and mixed- (2b) pre-cision on a V100 PCI-E GPU with a clock frequency of

MHz and a

TFLOPS peek performance. The inputs are squarematrices with variable ⟨ M , N , K ⟩ dimensions. The optimized and naïve WMMA GEMM algorithms are described in the text. using namespace nvcuda::wmma; __global__ void dot_wmma_16x16(half *a, half *b, half *c) { fragment a_frag; fragment b_frag; fragment c_frag; load_matrix_sync(a_frag , a, /* row stride */ 16); load_matrix_sync(b_frag , b, /* row stride */ 16); fill_fragment(c_frag , 0.0f); mma_sync(c_frag , a_frag , b_frag , c_frag); store_matrix_sync(c, c_frag , 16, row_major); } Listing 1: A simple CUDA kernel performing ⟨ , , ⟩ ma-trix multiplication ( C = A · B + C ) in half precision using theCUDA WMMA API. each having their own internal data layout as well as loading,storing, and computing semantics. Users specify both the data typeand the ⟨ M , N , K ⟩ shape of the fragments. For both the A and B kinds, users specify whether the matrix is in column- or row-majororder. Users also specify the stride between rows and load the datafrom either shared or global memory (Lines 7–8). Line 9 initializesthe matrix_c elements to zero by broadcasting the scalar value 0into the fragment. Once the data is loaded, users perform the matrixmultiplication operation (Line 10) and store the results (Line 11).The kernel in Listing 1 can be generalized to implement GEMMfor arbitrary matrix dimensions in a manner similar to tiling matrixmultiplication. For example, a naive implementation (referred to as WMMA HGEMM (naïve) ) assigns a strip of 16 rows from matrix A anda strip of 16 columns from matrix B columns to each warp to com-pute a 16 ×

16 tile of the output C . Each warp iterates through the A rows and B columns by loading 16 ×

16 tiles of A and B from globalmemory into the fragments using load_matrix_sync , then perform-ing mma_sync , and repeats. After all rows of A and columns of B have been consumed, the warp uses store_matrix_sync to storethe accumulated C values into global memory. An optimized imple-mentation (referred to as WMMA HGEMM ) utilizes persistent threadswhere each thread block collaboratively loads multiple tiles of matrix A and B into shared memory (to facilitate tile re-use). The tiles arethen loaded into fragments and the mma_sync operation is performed. The mapping between individual matrix elements to their residing thread(s) is purposelyopaque [47] and undocumented. We discuss how we alleviate some of the constraints inSection 6.1. M = N (log scale) . . . . . T F L O P S WMMA MGEMV using WMMA MGEMM (na¨ıve)WMMA HGEMV using WMMA HGEMM (na¨ıve)WMMA MGEMV using cuBLAS MGEMMWMMA HGEMV using cuBLAS HGEMMcublasSgemv

Figure 3: General matrix-vector multiplication (GEMV) per-formance using Tensor Cores on a V100 PCI-E GPU. GEMVcan be implemented in terms of a GEMM (with dimensions ⟨ M , N , ⟩ ) or calling the GEMV method in CUBLAS (whichcurrently does not support half precision). To show the TCU performance, we evaluate GEMM using TensorCores on an NVIDIA Tesla V100 PCI-E GPU with CUDA 9 . . . . . . × that of cuBLAS without the use of TCUs. For mixedprecision GEMM (MGEMM), shown in Figure 2b, a maximumperformance of 85 . . × speedup over cuBLAS without Tensor Cores (the degradation ofperformance compared to HGEMM is due to output bytes countbeing twice as large). CUTLAS MGEMM is more performant thanHGEMM, this is due to compiler and hardware optimizations formixed precision that are absent from half precision [62]. The order of magnitude speedup of GEMM with TCU raises the ques-tion: can we formulate other algorithms in terms of matrix multipli-cation and also benefit from the TCU? The most obvious algorithm ismatrix-vector multiplication (GEMV). We implement HGEMV (halfprecision GEMV) and MGEMV (mixed-precision GEMV) usingcuBLAS HGEMM or MGEMM with dimension ⟨ M , N , K = ⟩ .Thismethod wastes at least 15 N memory loads and performs 15 MN extra arp LevelBlock LevelGrid Level …… Figure 4: The reduction algorithm is 1 composed of warp-level reduction that reduces each segment and is used to 2implement block-level reduction that further reduces each seg-ment of partially reduced values. The partially reduced valuesare reduced across the grid 3 to perform full reduction. flops. We evaluate our implementations against cuBLAS SGEMV,since half precision GEMV is not present within cuBLAS.Figure 3 shows that even when accounting for both resource andcomputation waste, HGEMV, implemented using cuBLAS HGEMMwith Tensor Cores, outperforms cuBLAS SGEMV by at least 2 × and saturates at 900 GFLOPS due to the HBM2 global memorybandwidth. Naïve HGEMV and MGEMV are super imposed atopeach other since the overhead of using mixed-precision is dwarfed bythe inefficient memory access. Both naïve versions still outperformcuBLAS’ SGEMV for large inputs.The GEMV evaluation shows that the performance of matrixmultiplication on NVIDIA TCUs is high enough to tolerate resourceand computation waste in algorithms. Driven by this observation, weexamine how to formulate two widely used primitives — reductionand scan — to utilize TCUs.

We start by defining reduction and scan. Reduction (also called fold or total ) of a vector A = (cid:2) a , a ,..., a n (cid:3) is defined by its sum Σ ni = a i .Segmented reduction is defined as reductions on subsets of the inputvector. In a regular segmented reduction, all segments are the samesize . The scan operation (also called prefix sum ) for the same vector A is defined by the vector (cid:2) a , a + a ,..., Σ ni = a i (cid:3) . Segmented scanis defined similarly to segmented reduction. __device__ half warp_reduce(half val) { for (int offset=WARP_SIZE /2; offset >0; offset /=2) val += __shfl_down_sync (0 xFFFFFFFFU , val , mask ); return val; } __device__ half warp_scan(half val) { for (int offset =1; offset = offset) val += n; } return val; } Listing 2: NVIDIA’s recommended warp-level reductionand scan implementations utilizing shuffle instructions. Note that one implicitly performs tiling when utilizing the WMMA API. Irregular segmented reduction is implemented in terms of regular segmented reductionby padding the input.

For GPUs, state of the art libraries [1, 12, 38] implement both reduc-tion and scan in terms of warp-, block-, and device-level collectives,as illustrated in Figure 4. The warp-level are commonly implementedusing shuffle instructions [28], shown in Listing 2, which allowsthreads within a warp to share values via registers without syn-chronization or using shared memory. Shuffle instructions can be abottleneck due to their limited throughput, however. For example,on the NVIDIA Volta architecture only 32 warp shuffle operationscan be performed within a clock cycle per SM.

Intuitively, reduction can be implemented using TCUs by represent-ing it as a special case of matrix multiplication, since

Reduction (cid:2) a , a ,..., a n (cid:3) =  ···

10 0 ··· ... ... ... ... ···  ·  a a ... a n ··· ... ... ... ... ···  T =  Σ ni = a i ···

00 0 ··· ... ... ... ... ···  The challenge is to map generic input sizes onto the fixed matrixdimensions supported by the TCUs. For simplicity, this paper willassume that the TCU supports only matrices with 16 ×

16 dimension.Other hardware may require other dimensions and those can be usedwithout modifying the core idea of the algorithms. The algorithmsare also presented in a precision agnostic way.We use

Reduction K to represent a K regular segmented reduction— partial reductions of the input uniformly partitioned into K elementsubsets. We will use P to denote the matrix which has ones for thefirst row and zero otherwise (i.e. p r , c = (cid:40) r =

00 if r (cid:44) X for a matrix where all elements are the constant value X .To make our formulation non-NVIDIA WMMA API specific,we present our algorithms in an API neutral way. In the followingsections, we use LoadTile in place of the load_matrix_sync whichtakes a memory address, a matrix layout (default is row-major), andstride (default is 16) as input. We abstract store_matrix_sync tomake it addressable as if it were a linear array. We will also usethe notation A · B + C to denote the mma_sync operation. This paperhowever uses the standard CUDA terminology for warp, block, andgrid to explain the algorithms, since no other standard nomenclatureexists. The warp, block and device used in this paper correspond tothe three memory hierarchy levels: L-Cache, PE/core, and device. We introduce warp-level reduction first, since it is the building blockfor both block- and grid-level reductions. We formulate reductionusing TCUs for segment sizes 16, 256, and multiples of 16 and256. Support for arbitrary segment sizes can be supported either bypadding the input with zeros or by masking the P matrix. We findthat padding introduces minimal overhead and is required in somecases to maintain the memory alignment imposed by the TCU APIs. Segment Size : The

Reduction algorithm, shown in Algorithm 1and Figure 5, performs warp-level reduction on 256 elements whichrepresent 16 segments of size 16. On Line 3 in Algorithm 1 orStep 1 in Figure 5, the data is loaded from memory into a column-major order fragment (matrix A ). Each row is then reduced using = AV P

12 3

Figure 5: The

Reduction algorithm 1 each warp loads elements into the matrix A in column major order from the in-put vector, 2 performs the TCU operation where the P matrixhas ones for the first row, and then 3 the result, which is in thefirst row of V , is stored into the output vector. V = P · A (Line 4 or Step 2 ). The result — first row of V — isstored in the output memory (Line 5 or Step 3 ). Algorithm 1

The

Reduction algorithm. Initialize P matrix. idx ← global offset A ← LoadTile (cid:0) in (cid:2) idx ... idx + (cid:3) , “ colma jor ” (cid:1) V ← P · A + if laneIdx < then out (cid:104) idx + laneIdx (cid:105) ← V (cid:2) laneIdx (cid:3) Segment Size 256:

For handling segments of size 256, one followsa pattern similar to

Reduction . The algorithm is shown in Al-gorithm 2 and is a single iteration of the algorithm illustrated inFigure 6. First, all 256 elements are loaded onto the TCU (Line 3).The rows are reduced using the same procedure as Reduction (Line 2-4) the resulting columns are reduced using P T (Line 5)before we store the scalar result (Line 6) into memory. Algorithm 2

The

Reduction algorithm. Initialize P matrix idx ← global offset A ← LoadTile (cid:0) in (cid:2) idx ... idx + (cid:3) , “ colma jor ” (cid:1) V ← P · A + V ← V · P T + if laneIdx = then out (cid:104) idx (cid:105) ← V (cid:2) (cid:3) Segment Size Multiples of 256:

With the above

Reduction and Reduction warp-level primitives, we can express segments thatare multiples of either 16 (denoted by 16 N ) or 256 (denoted by256 N ). We will first look at the 256 N algorithm, since it will be usedfor representing the 16 N algorithm.A naïve way is to implement the 256 N segmented reduction as N -repeated applications of the Reduction , shown in Figure 6. Whilethis is correct, it is work inefficient — wasting one matrix multiplica-tion for each iteration. Instead of performing two reductions in eachiteration, we can implement a work efficient 256 N segmented reduc-tion by first reducing each row of the 16 ×

16 matrix (

Reduction ) in each iteration and then using the row of reduced values as anaccumulator. In the final iteration, the final row is reduced into ascalar. Figure 7 illustrates the work-efficient algorithm. Segment Size Multiples of 16:

Similar to

Reduction , segmentedreduction where the segment size is multiples of 16 (16 N ) can beperformed in two ways. The first is a strided segmented reduction,shown in Figure 8 (for the case where N = i , a warp loads 16 segments (each of length 16) into the matrix A with a stride of 16 N (Steps 1 and 4 ), i.e., the beginning of each16-element segment is 16N elements away from the beginning ofthe next segment in the original input vector. The 16 columns of A are then reduced and accumulated into the first row of V (Steps 2and 5 ). This repeats for N iterations. This method is simple, worksfor arbitrary multiple of 16, and leverages GPU cache for small N .For large N this method suffers from bad locality.Algorithm 3 makes better use of cache locality and reduces unco-alesced memory accesses. The algorithm implements Reduction N in terms of Reduction N for N > Reduction (cid:0) N %16 (cid:1) × ,can be implemented using the strided segmented 16 N reductionmethod. When the segment size is large, collaborative reduction within ablock becomes profitable. We follow standard practice [38] to imple-ment block-level reduction, but differ in that we still use the TCUto perform reduction on the partially reduced values within a block. = 0=

AVR QL P s10 P T V Figure 6: The work-inefficient

Reduction N algorithm 1 ini-tializes the Q matrix with all zeros and 2 loads the inputelements into a matrix A in column major order. 3 A dot prod-uct V = P · A + P matrix has the first row as ones andthe rest of the values are zeros is performed to reduce each rowinto a scalar. 4 the dot product R = V · P T + Q reduces the firstrow into a scalar. 5 If the segmented reduction size is equal tothe matrix size (i.e. N = ) or for the last iteration, then the firstelement of the R matrix is stored in the output array, otherwise6 the first element of R is used as the first element of the Q matrix and the procedure is iterated starting from step 2 . = A V P A N PV N = V N-1 R V N P T Figure 7: The work-efficient

Reduction N algorithm 1 loads input elements into matrix A i in each iteration. It then 2performs a matrix multiplication V i = P · A i + V i − for i between and N with V =

0. The final vector is reduced 3 by performingthe R = V N · P T operation and the 4 result stored as output. = A V P A P V V Figure 8: A strided

Reduction N algorithm for N = elements where the stride between each row is N . 2 Wethen perform the matrix multiplication V = P · A and 3 usethe V matrix as an accumulator for the next iteration where4 we again load the next elements with the leading dimen-sion set to N . The 5 matrix multiplication V = P · A + V isperformed and 6 the first row is stored in the output vector. Algorithm 4 shows how warp-level reduction is used to implementthe block-level

Reduction N kernel. When the segment size is very large a grid-level reduction might beneeded. A naïve grid-level reduction for a list of length N involvestwo kernel launches. The first kernel launch performs a segmented Algorithm 3

The Coalesced

Reduction N algorithm. Initialize P matrix V ← gidx ← global offset numSegs ← ⌊ N ⌋ ▷ Number of 256 segments for i ← i < numSegs ; i ← i + do idx ← gidx + i A ← LoadTile (cid:0) in (cid:2) idx ... idx + (cid:3) , “ colma jor ” (cid:1) V ← P · A + V ... ▷ Reduce rest of segments using

StridedReduction if laneIdx < then out (cid:104) gidx N + laneIdx (cid:105) ← V (cid:2) laneIdx (cid:3) Algorithm 4

The Block-level

Reduction N algorithm. wpb ← warps per block prtls ← alloc shared mem (cid:2) wpb (cid:3) partial ← Reduction

Nwpb (cid:0) in (cid:1) if laneIdx = then prtls (cid:2) warpIdx (cid:3) ← partial sync threads if warpIdx = then out (cid:2) blockIdx (cid:3) ← Reduction wpb (cid:0) prtls (cid:1) reduction with the output stored in a list of partials. A second kernelthen reduces the partials into a scalar. Although this algorithm isnaïve, its performance is on par with the fastest algorithm.

It might be less intuitive to represent scan as matrix multiplication.For a vector V of 256 elements, we can store it in row-major orderwithin a 16 ×

16 matrix A — with a i , j = V (cid:2) (cid:0) j − (cid:1) + i (cid:3) . A =  a , a , ... a , a , a , ... a , ... ... ... ... a , a , ... a ,  We notice that a row-wise scan can be obtained by multiplyingthe matrix A with an upper diagonal matrix — with the values of theupper diagonals being 1 and the rest 0. RowScan  a , a , ... a , a , a , ... a , ... ... ... ... a , a , ... a ,  = A · U = A ·  ...

10 1 ... ... ... ... ... ...  =  a , ... Σ i = a , i a , ... Σ i = a , i ... ... ... a , ... Σ i = a , i  Similarly, to get the scan of each column one can use a lowerdiagonal matrix. We use a strict lower diagonal, i.e. the diagonal is0, to get an exclusive scan of each column.

ExclusiveColumnScan  a , a , ... a , ... ... ... ... a , a , ... a ,  = L · A =  ...

01 0 ... ... ... ... ... ...  · A =  ... a , a , ... a , a , + a , a , + a , ... a , + a , ... ... ... ... Σ j = a j , Σ j = a j , ... Σ j = a j ,  We then use the L · A matrix to create a G matrix where eachelement G j , i is the reduction of the j th row of L · A . That is, allelements in the j th row of G are of the same value — the sum of allelements preceding the j th row of A , i.e. G j , i = Σ j − k = Σ i = A k , i . The G matrix can be generated by multiplying L · A with a matrix with ll element values set to 1. We then add G to the A · U matrix togenerate the scan of V — which is read in linear row-major order. Scan (cid:0) V (cid:1) = L · A ·  ...

11 1 ... ... ... ... ... ...  + A · U = G + A · U =  a , a , + a , ... Σ i = a , i a , + Σ i = a , i a , + a , + Σ i = a , i ... Σ j = Σ i = a j , i ... ... ... ... a , + Σ j = Σ i = a j , i a , + a , + Σ j = Σ i = a j , i ... Σ j = Σ i = a j , i  Throughout this section we will use U to represent the upperdiagonal matrix where the upper diagonal values are one, and use L to represent the strict lower diagonal matrix where the valuesbelow the lower diagonal are one — i.e. U r , c = (cid:40) r > = c r < c and L r , c = (cid:40) r < c r > = c . With the above derivation, we follow a similar structure to Section 4:first introducing warp-level primitives before presenting the block-and grid-level primitives. We write

Scan K to represent a K regularsegmented scan. Since the process of building warp-level, block-level, and grid-level scans from Scan K is very similar to that ofreduction, we will only highlight the key differences. Segment Size 16:

Is the

RowScan equation above and is illustrated inFigure 9 as steps 1 , 2 , and 3 .

Segment Size 256:

Is implemented using 3 matrix multiplicationsshown in Figure 9 and presented mathematically above.

Segment Size Multiples of 16:

Is similar to strided 16 N reduction,with the key difference being that we broadcast the last columnrather than the reduced scalar value and is shown in Algorithm 5. Algorithm 5

The

Scan N algorithm. Initialize U matrix. gidx ← global offset S ← lid ← laneIdx for i ← i < N ; i ← i + do idx ← gidx + i A ← LoadTile (cid:0) in (cid:2) idx ... idx + N (cid:3) , stride = N (cid:1) R ← A · U + S S ← Broadcast (cid:0)

LastColumn (cid:0) R (cid:1)(cid:1) if lid < then oi ← idx + lid ∗ N out (cid:2) oi ... oi + (cid:3) ← R (cid:2) lid ... lid + (cid:3) Segment Size Multiples of 256:

Only a small modification to

Scan is needed to implement

Scan N and is illustrated in Figure 9 andAlgorithm 6. Line 11 in Algorithm 6 shows that we keep track of thesum (last element of the R matrix) and broadcast it to the S matrixafter each iteration. The S matrix is then used when performingsubsequent iterations. A SAU = 10 U ALA = L = L AU LA Figure 9: The

Scan N algorithm 1 loads 256 elements fromthe input vector into a matrix A and 2 initializes the S matrixto 0. The 3 AU = A · U + S and 4 LA = L · A + LA and added to the AU matrix. 6 The result R is stored in theoutput vector. 7 If the segment size is a multiple of , thenthe last element of R (position , ) is broadcasted into the S matrix and the procedure is repeated.Algorithm 6 The

Scan N algorithm. Initialize U and L matrices. gidx ← global offset S ← for i ← i < N ; i ← i + do idx ← gidx + i A ← LoadTile (cid:0) in (cid:2) idx ... idx + (cid:3)(cid:1) AU ← A · U + S LA ← L · A + R ← LA · + AU out (cid:2) idx ... idx + (cid:3) ← R S ← Broadcast (cid:0) R (cid:2) (cid:3)(cid:1) Algorithm 7 shows how to perform the scan at the block level. It firstcomputes the segmented scan using the warp primitives (Line 8-13),stores the reduced values into a partials list (Line 16), performs a scanon the partial list (Line 17), and adds the values to the intermediateresults to get the output (Line 19-23).Algorithm 7 also exercises the TCU to perform the scan on thepartially reduced values across tiles. On Line 16 we use the offset ofthe last row (240) and 256 as the leading dimension when loadingthe tile. This loads the last row of R across tiles into E . Line 17 then lgorithm 7 The Block-level

Scan N algorithm. Initialize U and L matrices. gidx ← global offset wpb ← warps per block ▷ Assumed to be less than 16 sout ← alloc shared mem (cid:2) × (cid:3) prtls ← alloc shared mem (cid:2) (cid:3) ▷ Partial sums S ← for i ← i < N ; i ← i + warpsPerBlock do idx ← gidx + (cid:0) i + warpIdx (cid:1) A ← LoadTile (cid:0) in (cid:2) idx ... idx + (cid:3)(cid:1) AU ← A · U + S LA ← L · A + R ← LA · + AU sout (cid:2) warpIdx ... warpIdx + (cid:3) ← R sync threads if warpIdx = then E ← LoadTile (cid:0) sout (cid:2) ... (cid:3) , stride = (cid:1) prtls ← LastColumnScan (cid:0) E (cid:1) ▷ Exclusive scan sync threads for j ← j ≤ j ← j + warpSize do it ← j + laneIdx val ← sout (cid:2) warpIdx + it (cid:3) + prtls (cid:2) warpIdx (cid:3) out (cid:2) idx + it (cid:3) ← val S ← Broadcast (cid:0) prtls (cid:2) (cid:3)(cid:1) performs an exclusive scan on the last column of the E and storesthe results into the list of partials . Similar to reduction, the segmented scan is used as a building blockfor the grid-level scan. The grid-level scan uses a text book imple-mentation, scan-then-propagate strategy, and involves 3 kernel calls.The first kernel uses segmented scan and produces partially reducedvalues for each block. The second kernel performs a scan on thepartially reduced values. The third kernel then uniformly adds thepartially scanned values to their corresponding segments.

We implemented the algorithms presented in Sections 4 and 5 usingNVIDIA’s WMMA API. The code (available at https://github.com/c3sr/tcu_scope) is implemented as a C++ header library with an APIsimilar to CUB’s — providing functions such as

SegmentedReduce , Reduce , SegmentedScan , and

Scan . We employ auto-tuning to se-lect the ideal algorithm, number of warps (or independent TCUoperations) per block, coarsening factor (the number segments toperform within a warp), and block dimensions for the user-providedsegment size.We evaluate our implementation on an Intel Xeon E5-2698 withCentOS 4 .

3, CUDA Driver 396 .

26, and CUDA Runtime 9 . . GBs or 450 The implementation of

LastColumnScan is performed by loading the last columnvalues into the first row and performing an TCU version of the exclusive scan algorithm.Formulating the intermediate operation this way is needed to adhere to the CUDAWMMA API’s byte alignment constraint for loading fragments. billion half precision elements per second. All the results belowshow the throughput of the algorithms in terms of billions of halfprecision elements per second. Constraints arise when using the current WMMA API for non-GEMM computation. These limitations would not exist if one isto perform just GEMM computation. The constraints observed were:(1) Loads or stores must be performed at fragment granularity.(2) Loading and storing fragments can only be performed usingglobal or shared memory; constant memory cannot be used.(3) The memory layout for the matrix kinds are not the same andtheir is no API to perform casts between them.We address these limitations in different ways within our imple-mentation. For (1) and (2) we use knowledge about the internallayout of the fragment [26] and implemented WMMA API enhance-ments tailored to our usage. Listing 3 shows an example of our APIextensions for operating on partial fragments. using frag_b = fragment ;__device__ int matrix_b_get_row_idx() {const int laneIdx = threadIdx.x % warpSize;return laneIdx&0x10 >> 2 + laneIdx&0x0B;}__device__ void matrix_b_set_upper_triangular(frag_b &f) {

Listing 3: The WMMA API can only perform load/store fromshared or global memory and lacks the ability to fill an TCUfragment from constant memory or operate on sub-fragments.This code shows how we enhance the NVIDIA WMMA API,using knowledge of the fragment layout, to create an upper tri-angular matrix and get the first column of a fragment for the matrix_b fragment kind.

Although we can use the layout information to shuffle registers toaddress (3), we opt instead to express the cast in terms of load/storeAPIs available through the WMMA API. For example, to cast a ma-trix in the matrix_a format to matrix_b format, we first store thematrix into shared memory and then perform a load from memoryto matrix_b . Using our API extensions for fragment layout infor-mation requires less block synchronization — which increases theperformance of our implementation by up to 5%. Since relying onfragment layout information is not portable, we omit these results.

CUB is a C++ template library that contains multiple algorithmsfor the collectives. The CUB library contains the fastest [40, 41]implementation for the reduction and scan collectives and is usedby libraries such as Thrust [5] as well as most deep learning frame-works [8, 42, 43, 59, 67]. We compare against the latest release ofCUB [38] (version 1 .

8) and evaluate against different parametersof the collectives. As an optimization, warp-level shuffle reductionand scan are implemented in PTX within CUB for integer, float,and double data types, since NVCC is currently unable to use the igure 10: We evaluate the segmented reduction for the algo-rithms presented on different segment sizes (between and ) for a fixed element list. Through a combination of thealgorithms presented, for the range between and weare able to achieve throughput within and of idealthroughput (the theoretical peak is billion half precision el-ements per second). The bar on top of the figure shows the bestperforming algorithm for each range of segment sizes. shuffle instruction’s predicate to guard against invalid peers [28,44]. We observerved that CUB does not contain these shuffle-basedoptimizations for half precision. To make the evaluations fair andmeaningful, we implement these optimization for the half precisiondata type in CUB. The modified CUB is used for the evaluation toprovide a more aggressive base of comparison. Theoretically (on V100) our warp-level TCU reduction algorithmsrequire less than one fourth of the cycles of the warp-level shuffle re-duction. For example, consider performing a warp-level

Reduction :the warp-level reduction shown in Listing 2 requires 8 iterationsof 32 element reduction to reduce each segment. The total cyclesis therefore 256, since each shuffle instruction and addition takes4 cycles. Our algorithm performs the reduction using two matrixmultiplications or 64 cycles — since each TCU WMMA matrixmultiplication requires 32 cycles. However, reduction is known tobe memory bound, with the ideal performance bounded by memorycopy speed.We evaluate the TCU segmented reduction algorithms against cub::DeviceSegmentedReduce::Sum by fixing the number of in-put elements and varying the segment size (Figure 10). When thesegment size is less than 256, the 16 N algorithm is used. The 16 N algorithm’s performance degrades for large segment sizes due toits strided access pattern resulting in uncoalesced global memoryaccess. When the segment size is larger than 256, the 256 N algo-rithm is used, but again suffers from performance degradation aftersegment size 2 due to low occupancy. When the segment sizeis large (greater than 2 ) the block-level 256 N reduction is used.Figure 10 shows that our TCU implementation achieves more than90% of the peak throughput for variable segment size and is alwaysbetter than CUB.When the segment size is large and the number of segments issmall, the performance of both CUB and our implementation drops.Since each segment gets mapped onto a block, a small number ofsegments causes some SMs to be idle. For example when segmentsize is 2 , both CUB and our implementation achieve an occupancyof around 0 .

25 and SM efficiency of around 40%. A better strat-egy for these cases would be to assign multiple thread blocks to collaboratively reduce each segment when the size of the segmentsis very large. This optimization can be achieved using CUDA 9’scooperative groups [45], but is outside the focus of this paper.Our TCU implementation largely outperforms CUB’s device-wide segmented reduction for different segment size. Through pro-filing, we identified the major predictors of performance to be, inthe order of importance, the number of half-precision floating-pointinstructions ( inst_fp_16 in the NVProf [51] metrics), warp instruc-tions ( inst_inter_thread_communication ), and integer instruc-tions ( inst_integer ). We consistently find that our implementa-tion’s half-precision instructions is approximately equal to the num-ber of total elements (2 ) while CUB’s is much larger. Moreover,CUB requires large number of integer and warp shuffle instructionswhile our implementation uses no warp shuffle instructions and asmaller number of integer instructions. This contributes to the 100 × speedup for segment size 16.We examined the power consumption by measuring the averagepower draw within the execution phase of the kernel using NVProf.Based on these measurements, we find that our implementationconsumes 7 . − .

3% less power compared to CUB across differentsegment sizes. Again, this is because of the efficient use of the FP INT

ALUs as well as better SM and DRAM utilization. We notethat our algorithm leaves the general purpose ALUs idle, allowingless contention on these units.CUB provides a cub::WarpReduce , applicable for segment sizes16 and 32, to compute a parallel reduction of elements within awarp. CUB also provides cub::BlockReduce to perform reductionwithin a block. These two primitives require users to partition thedata and construct the kernel. Since CUB’s device-wide segmentedreduction does not perform well for segment size smaller then 2 ,we evaluate our TCU implementations against cub::WarpReduce and cub::BlockReduce implementations, shown in Figure 11. The cub::WarpReduce implementation is tunable on block size, wherasthe cub::BlockReduce implementation is tunable on block size,thread coarsening factor, and reduction algorithms. We compare ourimplementation against the best CUB implementation. We find thatour TCU implementations is still faster for segment size smaller than1024, and is comparable to cub::BlockReduce for the other cases.For segmented scan, we evaluate the TCU algorithms againstThrust’s implementation ( inclusive_scan_by_key ), since CUBhas no user visible API for segmented scan. The Thrust implementa-tion utilizes CUB’s internal warp- and block-level scan to implementthe scan-by-key operation. We evaluate different segment sizes witha fixed number of input elements — the results are shown in Fig-ure 12. Thrust, consistent with previous work [17], has constantperformance irrespective of the segment size. Whereas, our scanTCU implementations achieve more than 89% of the peak through-put and is 3 × faster than thrust for small segment sizes. We observelower power consumption compared to Thrust — observing it to beeither equivalent in power usage or up to 17% less. Our segmentedscan is not ideal for large segment sizes since, as explained in Sec-tion 4, only a small number of blocks get launched and thus the GPUis underutilized. This inefficiency can be remedied using the samestrategy described for reduction.CUB provides cub::WarpScan to compute a parallel scan ofdata partitioned within a warp, and cub::BlockScan within a block. B illi on s o f E l e m en t s / S e c CUB WarpCUB BlockCUB DeviceOur 16NOur 256N

16 32 64 128 256 512 1024 2048 4096 8192050100150200250300 B illi on s o f E l e m en t s / S e c CUB WarpCUB BlockOur 16NOur 256N Figure 11: Segmented 1 reduction and 2 scan are evaluated in terms of billions of half-precision elements per second ( y -axis) forsegment sizes between and ( x -axis). The best configurations for our implementation as well as CUB are selected. Segment Size (log scale) B illi o n s o f E l e m e n t s / S e c ThrustOur 16NOur 256NOur 256N Block

Figure 12: We evaluate the segmented scan for the algorithmspresented on different segment sizes for a fixed element list.Through a combination of the algorithms presented, for therange between and we are able to achieve throughputwithin and of ideal throughput (the theoretical peakis billion half precision elements per second). Number of Elements (log scale) B illi o n s o f E l e m e n t s / S e c CUBThrust Ours

Figure 13: A full reduction implementation based on the de-scription in Section 4 achieves performance on par to CUB.

Similar to reduction, these two primitives require more program-ming effort from the users to partition the data and construct thekernel. The CUB scan implementations have the same tunable pa-rameters as CUB’s reduction. We evaluate our TCU segmented scanagainst the best cub::WarpScan and cub::BlockScan parameters,shown in Figure 11. We can see that our TCU implementations arestill faster for small segment size, and are at least comparable to cub::BlockScan for other cases.

Unlike the warp- and block-level operations, this paper does notattempt to optimize grid-level operations — opting to use a naïve implementation for the grid-level collectives. The naïve implementa-tion involves multiple kernel launches. We include the evaluation re-sults to show that even our naïve grid-level implementation achievesperformance that is better or comparable to that of CUB and Thrust. Number of Elements (log scale) B illi o n s o f E l e m e n t s / S ec CUBThrust Ours

Figure 14: A full scan implementation based on the descriptionin Section 5 achieves performance comparable to CUB.

We compare against both CUB and Thrust for full reduction(Figure 13), and scan (Figure 14). For both cases, our implementationuses the 256 N block-level algorithm. Even with our naïve grid-levelimplementation, we are able to mach the performance of CUB andare considerably faster than the Thrust implementation. For reductionand scan, the TCU implementation is slightly faster than CUB withlarge input sizes being bounded by memory bandwidth and is within98% (for reduction) of peek memory copy bandwidth. For scan, ourcurrent unoptimized implementation uses CUB to reduce the partialvalues (kernel 2 described in Section 5.3). Future implementationswould not use CUB, since it fails for inputs larger than 2 whichcauses our implementation to fail as well. The mapping of algorithms onto matrix multiplication has been wellstudied [22, 30, 60, 68]. Similarly, both reduction and scan are wellstudied from a performance and application [6, 7, 21, 29, 65] aspecton a wide range of architectures and have been heavily evaluatedacross GPU architectures [14, 16, 37, 64, 70]. To our knowledgehowever, there has been no attempt at mapping either reduction orscan in terms of matrix multiplication.Considerable research has been done on the development of per-formance portable compilers for reduction and scan kernels [2, 10,13, 31, 66]. These compilers express the algorithms as systems of lternative building blocks that are then composed and auto-tuned atcompile time for both the target architecture and the input character-istics. These tools are synergistic with our technique, since we areable to add our algorithm as another building block to implementreduction or scan.Previous work [11, 38, 39, 61, 70] has also shown that optimiza-tions can be made to either avoid or hide the overhead of multi-kernellaunches. These optimizations would enable our grid-level opera-tions to be competitive for large sizes when compared to state-of-the-art methods. Other research looked at specific cases of scan, in [34]the authors look at performing scan on tuples while minimizingglobal reads and facilitating latency hiding.Work describing NVIDIA’s Tensor Cores is scarce. In [26], theauthors use microbenchmarks to discern micro-architectural detailsof the V100 architecture. This work was extended in [62] whereauthors’ study expand on the micro-architectural study and show aproposed NVIDIA TCU architecture. In [35, 20] the authors use halfprecision and TCUs to implement iterative solvers. They use halfprecision along with low quality solvers to compute the initial condi-tions and then switch to both higher precision solvers for subsequentiterations. The authors also examine the numerical error incurredwhen using TCUs and half-precision for HPC workloads. This paper leveraged the Tensor Core Units (TCUs) (a specializedaccelerator developed to optimize matrix multiplication for deeplearning) to implement both reduction and scan. We showed a novel,simple, and efficient mapping of the reduction and scan primitivesonto TCUs. We believe we are the first to formulate these algorithmsto exercise the TCU. Unlike existing work which designs ASICs tomap reduction and scan onto hardware, we develop an algorithmicsolution to map both reduction and scan on existing TCUs. An algo-rithmic solution is relevant when using preexisting TCU designs (asis the case for the NVIDIA TCU). We also pointed out directions forfuture API and architectural changes to relax some of the TCU con-straints such as loading fragments from constant, extracting singlerow or column, etc. — resulting in a simplified implementation.We implemented the proposed algorithms onto V100 TCUs,achieved up to 100 × speedup for reduction and up to 3 × for scan,and showed performance that rivals state of the art implementationin the worst cases. We observed up to 22% less power consumptionfor reduction and 16% for scan using NVPROF. As a result of thealgorithms, we were able to make use of the otherwise idle TCUs—enabling better GPU utilization for kernels that exercise the generalpurpose ALUs.Future work would leverage the techniques described in thispaper to map more algorithms and functions onto TCUs. We arespecifically interested in transcendental and special functions, sincethe NVIDIA special function units have been observed to be thebottleneck in HPC applications. We also want to express neuralnetwork layers in terms of TCUs, where some layer implementationsand layer fusion opportunities would be enabled by our work: such asthe computation of variance in batch norm [23, 24] or the evaluationof special functions in activation layers. ACKNOWLEDGMENTS

This work is supported by IBM-ILLINOIS Center for CognitiveComputing Systems Research (C3SR) - a research collaboration aspart of the IBM Cognitive Horizon Network.

REFERENCES [1] Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak,Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009.Numerical linear algebra on emerging architectures: the plasma and magmaprojects. In

Journal of Physics: Conference Series number 1. Vol. 180. IOPPublishing, 012037.[2] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, AlanEdelman, and Saman Amarasinghe. 2009.

PetaBricks: a language and compilerfor algorithmic choice . Number 6. Vol. 44. ACM.[3] Apple. 2019 (accessed January 14, 2019).

A11 Bionic

Arm Machine Learning Processor .https://developer.arm.com/products/processors/machine-learning/arm-ml-processor.[5] Nathan Bell and Jared Hoberock. 2011. Thrust: a productivity-oriented libraryfor cuda. In

GPU computing gems Jade edition . Elsevier, 359–371.[6] Guy E Blelloch. 1989. Scans as primitive parallel operations.

IEEE Transactionson computers , 38, 11, 1526–1538.[7] Guy E Blelloch, Michael A Heroux, and Marco Zagha. 1993. Segmented op-erations for sparse matrix computation on vector multiprocessors. Tech. rep.Carnegie-Mellon Univ Pittsburgh PA School of Computer Science.[8] Caffe2. 2019 (accessed January 14, 2019).

Caffe2 . https://caffe2.ai.[9] Timothy M Chan. 2010. More algorithms for all-pairs shortest paths in weightedgraphs.

SIAM Journal on Computing , 39, 5, 2075–2089.[10] Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna, andWen-mei Hwu. 2016. Efficient kernel synthesis for performance portable pro-gramming. In

The 49th Annual IEEE/ACM International Symposium on Mi-croarchitecture . IEEE Press, 12.[11] Gaurav Chaurasia, Jonathan Ragan-Kelley, Sylvain Paris, George Drettakis,and Fredo Durand. 2015. Compiling high performance recursive filters. In

Proceedings of the 7th Conference on High-Performance Graphics . ACM, 85–94.[12] Leonardo Dagum and Ramesh Menon. 1998. Openmp: an industry standard apifor shared-memory programming.

IEEE computational science and engineering ,5, 1, 46–55.[13] Simon Garcia De Gonzalo, Sitao Huang, Juan Gómez-Luna, Simon Hammond,Onur Mutlu, and Wen-mei Hwu. 2019. Automatic generation of warp-levelprimitives and atomic instructions for fast and portable parallel reduction ongpus. In

Proceedings of the 2019 IEEE/ACM International Symposium on CodeGeneration and Optimization . IEEE Press, 73–84.[14] Yuri Dotsenko, Naga K Govindaraju, Peter-Pike Sloan, Charles Boyd, and JohnManferdelli. 2008. Fast scan algorithms on graphics processors. In

Proceedingsof the 22nd annual international conference on Supercomputing . ACM, 205–213.[15] Zidong Du et al. 2017. An accelerator for high efficient vision processing.

IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems , 36,2, 227–240.[16] Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran. 2016.Low-level functional gpu programming for parallel algorithms. In

Proceedingsof the 5th International Workshop on Functional High-Performance Computing .ACM, 31–37.[17] Marco Eilers. 2014. Multireduce and multiscan on modern gpus.

Department ofComputer Science, University of Copenhagen. Master’s thesis .[18] Google. 2019 (accessed January 14, 2019).

Edge TPU . https://cloud.google.com/edge-tpu.[19] Google. 2019 (accessed January 14, 2019).

Google Cloud TPU . https://cloud.google.com/tpu.[20] Azzam Haidar, Panruo Wu, Stanimire Tomov, and Jack Dongarra. 2017. Inves-tigating half precision arithmetic to accelerate dense linear system solvers. In

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms forLarge-Scale Systems . ACM, 10.[21] Mark Harris, Shubhabrata Sengupta, and John D Owens. 2007. Parallel prefixsum (scan) with cuda.

GPU gems , 3, 39, 851–876.[22] Xiaohan Huang and Victor Y Pan. 1998. Fast rectangular matrix multiplicationand applications.

Journal of complexity , 14, 2, 257–299.[23] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: accelerat-ing deep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167 .5624] X. Jia et al. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes.

ArXiv e-prints , (July 2018).arXiv: 1807.11205.[25] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: convolu-tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 .[26] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018.Dissecting the nvidia volta gpu architecture via microbenchmarking. arXivpreprint arXiv:1804.06826 .[27] Norman P Jouppi et al. 2017. In-datacenter performance analysis of a tensorprocessing unit. In

Computer Architecture (ISCA), 2017 ACM/IEEE 44th AnnualInternational Symposium on . IEEE, 1–12.[28] Julien Demouth. 2019 (accessed January 14, 2019).

Kepler Shuffle: Tips andTricks . http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf.[29] Hee-Seok Kim, Shengzhao Wu, Li-wen Chang, and W Hwu Wen-mei. 2011.A scalable tridiagonal solver for gpus. In

Parallel Processing (ICPP), 2011International Conference on . IEEE, 444–453.[30] Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applica-tions.

SIAM review , 51, 3, 455–500.[31] Rasmus Wriedt Larsen and Troels Henriksen. 2017. Strategies for regular seg-mented reductions on gpu. In

Proceedings of the 6th ACM SIGPLAN Inter-national Workshop on Functional High-Performance Computing . ACM, 42–52.[32] Lawrence Livermore National Laboratory. 2019 (accessed January 14, 2019).

Sierra Supercomputer . https://computation.llnl.gov/computers/sierra.[33] Chris Leary and Todd Wang. 2017. Xla: tensorflow, compiled.

TensorFlow DevSummit .[34] Sepideh Maleki, Annie Yang, and Martin Burtscher. 2016.

Higher-order andtuple-based massively-parallel prefix sums . Number 6. Vol. 51. ACM.[35] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and JeffreyS Vetter. 2018. Nvidia tensor core programmability, performance & precision. arXiv preprint arXiv:1803.04014 .[36] Michael D McCool, Arch D Robison, and James Reinders. 2012.

Structuredparallel programming: patterns for efficient computation . Elsevier.[37] Trevor L McDonell, Manuel MT Chakravarty, Gabriele Keller, and Ben Lipp-meier. 2013. Optimising purely functional gpu programs.

ACM SIGPLAN No-tices , 48, 9, 49–60.[38] D Merrill. 2018. CUB v1.8.0: CUDA Unbound, a library of warp-wide, block-wide, and device-wide GPU parallel primitives.

NVIDIA Research .[39] Duane G Merrill and Andrew S Grimshaw. 2010. Revisiting sorting for gpgpustream architectures. In

Proceedings of the 19th international conference onParallel architectures and compilation techniques . ACM, 545–546.[40] Duane Merrill and Michael Garland. 2016. Single-pass parallel prefix scan withdecoupled look-back. Tech. rep. NVIDIA Technical Report NVR-2016-002.[41] Bruce Merry. 2015. A performance comparison of sort and scan libraries forgpus.

Parallel Processing Letters , 25, 04, 1550007.[42] Rory Mitchell and Eibe Frank. 2017. Accelerating the xgboost algorithm usinggpu computing.

PeerJ Computer Science , 3, e127.[43] MXNet. 2019 (accessed January 14, 2019).

MXNet . https://mxnet.apache.org.[44] Wilt Nicholas. 2013. The cuda handbook: a comprehensive guide to gpu pro-gramming. (2013).[45] NVIDIA. 2019 (accessed January 14, 2019).

Cooperative Groups: FlexibleCUDA Thread Programming . https://devblogs.nvidia.com/cooperative-groups/.[46] NVIDIA. 2019 (accessed January 14, 2019). cuBLAS . https://developer.nvidia.com/cublas.[47] NVIDIA. 2019 (accessed January 14, 2019).

CUDA C Programming Guide .https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. [48] NVIDIA. 2019 (accessed January 14, 2019). cuDNN . https://developer.nvidia.com/cudnn.[49] NVIDIA. 2019 (accessed January 14, 2019).

CUTLASS . https://devblogs.nvidia.com/cutlass-linear-algebra-cuda.[50] NVIDIA. 2019 (accessed January 14, 2019).

Mixed Precision Training . https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training.[51] NVIDIA. 2019 (accessed January 14, 2019).

NVPROF . https://docs.nvidia.com/cuda/profiler-users-guide/index.html

NVVM IR Specification 1.5 . https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html.[53] NVIDIA. 2019 (accessed January 14, 2019).

Parallel Thread Execution ISAVersion 6.2 . https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.[54] NVIDIA. 2019 (accessed January 14, 2019).

Programming Tensor Cores inCUDA 9 . https://devblogs.nvidia.com/programming-tensor-cores-cuda-9.[55] NVIDIA. 2019 (accessed January 14, 2019).

Tensor Cores

TensorRT . https://developer.nvidia.com/tensorrt.[57] Oak Ridge National Laboratory. 2019 (accessed January 14, 2019).

SummitSupercomputer

Science: Internet,Data and Things (CS-E4000), Spring 2018 , 211.[59] PyTorch. 2019 (accessed January 14, 2019).

PyTorch . https://pytorch.org.[60] Stephan Rabanser, Oleksandr Shchur, and Stephan Günnemann. 2017. Introduc-tion to tensor decompositions and their applications in machine learning. arXivpreprint arXiv:1711.10781 .[61] Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Syl-vain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2017. Halide:decoupling algorithms from schedules for high-performance image processing.

Communications of the ACM , 61, 1, 106–115.[62] Md Aamir Raihan, Negar Goli, and Tor M. Aamodt. 2018. Modeling deeplearning accelerator enabled gpus.

CoRR , abs/1811.08309. arXiv: 1811.08309.http://arxiv.org/abs/1811.08309.[63] Brandon Reagen, Robert Adolf, Paul Whatmough, Gu-Yeon Wei, and DavidBrooks. 2017. Deep learning for computer architects.

Synthesis Lectures onComputer Architecture , 12, 4, 1–123.[64] Shubhabrata Sengupta, Mark Harris, and Michael Garland. 2008. Efficientparallel scan algorithms for gpus.

NVIDIA, Santa Clara, CA, Tech. Rep. NVR-2008-003 , 1, 1, 1–17.[65] Shubhabrata Sengupta, Aaron E Lefohn, and John D Owens. 2006. A work-efficient step-efficient prefix sum algorithm. In

Workshop on edge computingusing new commodity architectures , 26–27.[66] Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. Lift: afunctional data-parallel ir for high-performance gpu code generation. In

CodeGeneration and Optimization (CGO), 2017 IEEE/ACM International Symposiumon . IEEE, 74–85.[67] TensorFlow. 2019 (accessed January 14, 2019).

TensorFlow

Handbooks inoperations research and management science , 3, 247–321.[69] WikiChip. 2019 (accessed January 14, 2019).

Cascade Lake - Microarchitectures- Intel . https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake.[70] Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. Streamscan: fast scanalgorithms for gpus without global barrier synchronization. In

ACM SIGPLANNotices number 8. Vol. 48. ACM, 229–238.[71] Yuhao Zhu, Matthew Mattina, and Paul Whatmough. 2018. Mobile machinelearning hardware at arm: a systems-on-chip (soc) perspective. arXiv preprintarXiv:1801.06274arXiv preprintarXiv:1801.06274