Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM
OOptimizing Block-Sparse Matrix Multiplicationson CUDA with TVM
Zijing Gu [email protected]
July 28, 2020
We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to ex-plore the schedule space of the operation and generate efficient CUDA code.With the automatic parameter tuning in TVM, our cross-thread reductionbased implementation achieved competitive or better performance comparedwith other state-of-the-art frameworks.Our code is available here: https://github.com/ceruleangu/Block-Sparse-Benchmark
Contemporary deep learning researches require efficient GPU kernels to per-form intense computations such as model training and inferences. General ma-trix multiplication (GEMM) plays an essential role in the computation of deepneural networks because both convolution operations and fully connected oper-ations can both be represented through GEMM. To accelerate neural networkcomputation, the sparsity of weights has been widely utilized. For example,fully connected operations do not scale well because the weights learned for alayer are dense matrices whose sizes depend on not only the size of the inputlayer but also that of the output layer, while sparse fully connected operationshave computational complexity only proportional to the number of non-zero el-ements. However, sparse operations for arbitrary sparsity cannot be efficientlyimplemented on current GPUs because the highly parallelized computations ofGPU cannot align with the sparsity patterns. In recent literature, block-sparseoperations have gradually come to our sights. They have been successfully ap-plied to different domains such as computer vision [Xie et al., 2016, Zhang et al.,2017] and natural language process [Sak et al., 2014, Kuchaiev and Ginsburg,2017]. 1 a r X i v : . [ c s . M S ] J u l igure 1: Matrix with dense entries (left), matrix with block-sparse entries(middle), sparsity pattern of the block-sparse matrix in the middle (right).Figure 1 [Gray et al., 2017] shows an example of block-sparse matrix, wherethe empty blocks in the middle figure indicate all-zero-element blocks. Thus wecan represent the sparsity pattern in a matrix as shown in the rightmost figurein Figure 1. In this project, we will be specifically implementing sparse-densematrix multiplications, in which case the inputs include a dense matrix and ablock-sparse matrix, and the output is a dense matrix. This setting has beencommonly adopted in the operations of deep neural networks. The key data structure for this task is the block-sparse matrix. A block-sparsematrix is stored in block compressed row (BSR) format, which is similar tocompressed sparse row (CSR). BSR has three components:1. block_data : a three-dimensional array with shape [number of non-zeroblock, block row number, block column number]2. block_indices : an array of column indices of each block3. index_pointer : an array of indices of the elements in the block_indices array that are the first blocks of each matrix rowThe key operation we are optimizing is sparse-dense matrix multiplications Y = XW (cid:124) , where X is a dense matrix with shape ( m, k ), W is a transposedblock-sparse matrix with shape ( n, k ) in BSR format and Y is the output ma-trix with shape ( m, n ). Note that W is transposed from the original matrix toimprove the memory locality during computation. Similar to GEMM for densematrices, we can parallelize the block-sparse matrix multiplications by comput-ing each element in the output matrix concurrently. However, unlike GEMMfor dense matrices, the block-sparse matrix in BSR format involves indirect in-dexing at the runtime: We need to read index_pointer and block_indices parts of W and then find the corresponding part in the dense matrix X . This2akes memory-access patterns unpredictable and thus it is difficult to applymemory optimization methods such as utilizing shared memory as cache.We implemented the sparse-dense matrix multiplication on TVM [Chenet al., 2018], a deep learning compiler. TVM uses its intermediate representationto describe and optimize tensor operations. It provides a set of domain-specificlanguages (DSL) API such as loop tiling and loop re-order. With TVM, we canimplement the operation with the DSL API in Python and then generate effi-cient CUDA code. This allows us to explore different implementations for thistask efficiently. The implementation of a TVM operation has two parts that aredefined separately: the computation and the schedule. The computation partis to describe the operation in a tensor expression language in a way similar toHalide [Ragan-Kelley et al., 2013]. Listing 1 shows an example of the tensorexpression in TVM. The schedule part is to use the provided schedule primitivesto map from a tensor expression to low-level code while preserving the logicalequivalence of the program. m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h')A = tvm.placeholder((m, h)), name='A')B = tvm.placeholder((n, h)), name='B')k = tvm.reduce_axis((0, h)), name='k')C = tvm.compute((m,n), lambda y,x: tvm.sum(A[y,k]*B[x,k],axis=k)) Listing 1: Example of tensor expression in TVM
As introduced in Section 3, we first implemented the block-sparse matrix multi-plications with TVM tensor expressions DSL to define the computational seman-tics. And then, we used the schedule primitives of TVM to explore the schedulespace of the block-sparse operations on CUDA. After that, we integrated Au-toTVM, the machine-learning-based optimizer of TVM, to automatically tunethe parameters of our schedules, such as the tiling size of the loops. We im-plemented and tested on AWS one g4dn.xlarge instance which contains oneNVIDIA T4 GPU. We denote the operation we are optimizing as the following:Let data X ∈ R m × k , block B ∈ R b r × b c , and transposed weights W B ∈ R n × k where b r | n , b c | k . We want to optimize the time of computing XW (cid:124) B = Y ∈ R m × n Note that in reality, W B is stored in BSR format which consists of block_data , block_indices and index_pointer introduced in Section 3 Our first approach is to parallelize the computation by using one thread foreach output element Y ij . Specifically, we use a total of one thread block that3ontains m × n threads. However, we find that when m × n is large, a runtimeerror occurs because the thread block explodes with too many threads in it. Wesolved the problem by using m × n thread blocks, each of which contains onethread. The result shows that the runtime increases as sparsity decreases. Wethink this is because as sparsity decreases each thread needs more data readsfrom W B and all these global memory accessing bring a lot of overhead. Since consecutive Y ij share the same part of X and W B during computation,similar to the optimization for GEMM for dense matrices, we divide the m × n output elements in Y into multiple tiles of the same size and assign each tile toa different thread. We think that assigning several consecutive Y ij to the samethread can reduce some overhead from global memory access. Even though thecomputations of the Y ij in one tile are no longer concurrent, it may benefit fromthe locality of accessing X and W B from the global memory. In the experiment,we tried several sizes of tiles. However, the result gets worse compared with thefirst approach. We think it is because the overhead from sequential computationexceeds that from the global memory access. To reduce global memory access and preserve locality without losing concurrencyin computations, we choose caching as our third approach. We divide the m × n output elements in Y into multiple tiles of the same size, like in Section 4.2. Butinstead of assigning each tile to a different thread, we assign the computationfor each tile to a different thread block. For each element in one tile, we usea thread for the computation. In this way, threads within a thread block havelocality in accessing X and W B . Therefore, we use shared memory as the cachefor the part of X required by all Y ij in the same tile, as well as the shared W B ,block index, and pointers.Specifically, before computing the output, threads will copy the needed partof X , block_data , and block_indices into the shared memory through co-operative fetching, and then perform __syncthreads . However, note that insparse-dense matrix multiplication, the needed parts of X and block_data for the computation is dependent on block_indices , which involves checkingthe indices at the runtime and then choose the corresponding parts of X and block_data to be cached. This indirect indexing introduces additional over-head compared with dense matrix multiplication.In our implementation of this caching approach with TVM, we met anotherissue: TVM requires all caching to be decided at compile time. Since W B is aruntime value and the compiler does not know which parts of W B are dense, theentire rows of X are copied to shared memory even though we do not need theparts of X that corresponds to the sparse parts of W B for the computations. Itis a TVM limitation that runtime caching is not viable. Therefore, we are notable to implement this approach. 4 .4 Concurrent reduction Based on previous analysis, we decide to discard using shared memory and tolimit memory access to only the parts of X that corresponds to the dense partsof W B while preserving concurrency in computation. In Section 4.1, we use onethread to compute one Y ij in one thread block. In this section, we parallelizethe reduction of computing one Y ij . Specifically, we use m × n thread blocks,each of which contains multiple threads to perform the reduction concurrentlyfor a single Y ij . In sparse-dense matrix multiplication, the reduction for Y ij isthe sum of the products between each non-sparse element in the j -th row of W B and its corresponding element in the i -th row of X . The details of reductionacross different threads are provided in Section 4.4.3. We explore two ways ofparallelizing reduction in Section 4.4.1 and Section 4.4.2. In this approach, we parallelize reduction for Y ij in the following method: foreach non-sparse block B , we assign a thread to perform reduction within B .As shown in Figure 2a, each color in the figure represents a different thread.Ideally, we want to set the number of threads in each thread block to be thenumber of non-sparse B in each row of W B , because we do not need the sparseparts to compute Y ij . However, since W B is a runtime value and the number ofnon-sparse B in each row of W B varies, we have to set the number of threads ina thread block to be the total number of B in that row, which is k/b c (if k/b c is large, one thread may take charge of multiple B ). Note that this may exceedthe actual number of the threads needed for the computation. Therefore, weuse an if-statement to decide whether a thread should be idle or not dependingon whether the B assigned to that thread is sparse or not. But this incursdivergence of control flow. As expected, the results do not improve. In Section 4.4.1, different threads perform reductions on different B in each rowof W B . In this approach, different threads perform reduction within one B andacross all dense B in each row of W B . Specifically, we use a fixed number ofthreads t in each thread block to sum up a portion of Y ij in each dense B ofthat row in W B . The final Y ij is computed by summing up all the results. Asshown in Figure 2b, each color in the figure represents a different thread. Duringthe reduction, different threads are accessing consecutive elements within one B from the global memory, which leads to global memory coalescing and improvesmemory efficiency.We see improvements in some cases compared with the results in Section4.1. We believe that this approach very likely leads to the right direction foroptimization. Since it is possible that the number threads per thread block t we choose is not optimal, we apply automatic parameter tuning with TVM inSection 4.5. 5 irection of Reduction WW B (a) Parallelize over blocks Direction of Reduction WW B (b) Parallelize within blocks Figure 2: Example of parallel reduction
This section elaborates on the cross-thread reduction. To produce the output Y ij , we sum up the partial reduction results from different threads and storethose results in the first thread of each thread block.We use shared memory to share the results between threads within the samethread block and perform several iterations of reduction as shown in Figure 3.When the number of threads in a thread block is 32, we perform additionaloptimization to eliminate the usage of shared memory. We use __shfl_down instruction to gather the value from the other threads in the same warp intothe first thread. To search for the optimal number of threads t mentioned in Section 4.4.2, weapply TVM to perform automatic parameter tuning for different shapes of input X and W B and then generate specialized CUDA kernels. We define a list ofcandidate values of t to be all possible factors of k . TVM will choose certainvalues from those candidates and generate CUDA code to profile on GPU. ThenTVM trains a decision-tree based cost model to predict the performance withoutrunning on the actual devices and therefore, the search is very efficient. Weperform a maximum of 200 iterations of search which normally finishes withintwo or three minutes. 6 + + ++ ++0000 111 22 3 4 5 6 73Iteration 1Iteration 2Iteration 3Thread ID Figure 3: Aggregate results from different threads
We test our approaches introduced in Section 4 with different input sizes ( m, k, n ),block sizes ( b r = b c ), and sparsity. The test cases are selected from popular deeplearning models of computer vision and natural language processing. The test-ing environment is an NVIDIA T4 GPU on CUDA 10. As shown in Table 1,PEP (Section 4.1) achieves good performance in most cases. Parallel-reductionbased methods, PROB and PRWB, (Section 4.4) outperform PEP in most caseswhere ( m, k, n ) is small. But as ( m, k, n ) increases, the running time of parallel-reduction based methods drastically increases. Since the possible options for theschedule parameters (such as the number of threads in a thread block) increaseas ( m, k, n ) increases, it is very likely that the current parameters we pick for thisbenchmark are not optimized. Since PROB suffers divergence of control-flow(Section 4.4.1), we decide to choose PRWB for AutoTuning (Section 4.5).After AutoTuning, we achieve the best results among all our approaches.We compare our best results with two state-of-the-art (SOTA) frameworks inTable 2. Gray et al. [2017] is a highly-optimized library for block-sparse matrixoperations based on manually-tuned micro-kernels, which allows fine-grainedcontrol of the instruction orders and achieves competitive performance. cuS-parse is the vendor-provided library for sparse and block-sparse matrices thatutilizes TensorCore. As shown in Table 2, our approach outperforms SOTA inhalf of the cases. When ( m, k, n ) is small, the AutoTuning on PRWB generatesthe best results and is even better than TensorCore-based results. As ( m, k, n )increases, our results are worse than others. We suggest that the search spaceof the schedule parameter can be improved so that the AutoTuning might yieldbetter results. We also noticed that our results tend to outperform SOTA when7parsity is extremely high. We suggest that cuSparse is unable to utilize Ten-sorCore very well in those scenarios. We implemented multiplication between dense matrix and block-sparse matrixin BSR format on CUDA with TVM. We explored different schedules usingTVM and evaluated their performances. With automatic parameter tuningwith TVM, we achieved competitive performances in the benchmark.
In this project, we tried several implementations of sparse-dense matrix multi-plication. Since the operation is memory intensive, the performance we achievedis still far below the limit of the computation performance. Even though theparallel reduction based implementations achieved the best performances, thenumber of threads needed is highly dependent on the sparsity pattern. If a singlerow in W B has many non-sparse blocks while some other rows are highly sparse,such imbalance will incur divergence of control flow since the number of itera-tions for the reduction differs. As a result, the performance of parallel-reductionbased implementation is significantly worsened, and the simple parallelizationapproach in Section 4.1 will suffice. Hence, we believe that there is no singleimplementation that has the best performance in all cases. To achieve goodperformance, we may need to combine different approaches for different cases.Additionally, we observe that the absolute performance of both our meth-ods and SOTA are very low. The peek FLOPS is about 400GFLOPS, which isfar below the theoretical performance of NVIDIA T4 GPU (8.1TFLOPS). Wethink the performance of the sparse-dense matrix multiplication is mainly lim-ited by memory access, and therefore it is difficult to gain further performanceimprovement. References
Saining Xie, Ross B. Girshick, Piotr Doll´ar, Zhuowen Tu, and KaimingHe. Aggregated residual transformations for deep neural networks.
CoRR ,abs/1611.05431, 2016. URL http://arxiv.org/abs/1611.05431 .Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: Anextremely efficient convolutional neural network for mobile devices.
CoRR ,abs/1707.01083, 2017. URL http://arxiv.org/abs/1707.01083 .Hasim Sak, Andrew W. Senior, and Fran¸coise Beaufays. Long short-term mem-ory based recurrent neural network architectures for large vocabulary speechrecognition.
CoRR , abs/1402.1128, 2014. URL http://arxiv.org/abs/1402.1128 . 8 m, k, n ) B size Sparsity PEP PTP PROB PRWB(1 , , , , , , , , Table 1: Benchmark of matrix multiplications between dense matrix of size m × k and block-sparse matrix of size k × n with different block size and sparsity.“PEP”: Per Element Parallization (Section 4.1). “PTP”: Per Tile Paralliza-tion (Section 4.2). “PROB”: Parallel Reduction Over Blocks (Section 4.4.1).“PRWB”: Parallel Reduction With Blocks (Section 4.4.2). Results are in mil-liseconds. 9 m, k, n ) B size Sparsity PRWB+AT(Ours) Gray et al. [2017] cuSparse(1 , , - 0.0110.85 - 0.0110.95 - 0.011(8 , , - 0.0110.85 - 0.0110.95 - 0.011(1 , ,
16 0.8 0.013 0.018 - 0.0110.95 - 0.011(8 , , ,16 0.8 0.074 0.019
32 0.8 0.018 -
Table 2: Benchmark of matrix multiplications between dense matrix of size m × k and block-sparse matrix of size k × n with different block size and sparsity.“PRWB+AT”:Parallel Reduction Within Blocks + AutoTuning (Section 4.4.2,4.5). Results are in milliseconds. 10leksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM net-works. CoRR , abs/1703.10722, 2017. URL http://arxiv.org/abs/1703.10722 .Scott Gray, Alec Radford, and Diederik P. Kingma. Gpu kernels for block-sparseweights. 2017.Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan,Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Car-los Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In , pages 578–594, Carlsbad, CA, October 2018. USENIX Association. ISBN 978-1-939133-08-3. URL .Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fr´edoDurand, and Saman Amarasinghe. Halide: A language and compiler for opti-mizing parallelism, locality, and recomputation in image processing pipelines.In
Proceedings of the 34th ACM SIGPLAN Conference on Programming Lan-guage Design and Implementation , PLDI ’13, pages 519–530, New York, NY,USA, 2013. ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.2462176.URL http://doi.acm.org/10.1145/2491956.2462176http://doi.acm.org/10.1145/2491956.2462176