Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale
Md Taufique Hussain, Oguz Selvitopi, Aydin Buluç, Ariful Azad
CCommunication-Avoiding and Memory-ConstrainedSparse Matrix-Matrix Multiplication at Extreme Scale
Md Taufique Hussain ∗ , Oguz Selvitopi † , Aydin Buluc¸ ‡ and Ariful Azad §∗ Indiana University, Bloomington, IN ([email protected]) † Lawrence Berkeley National Laboratory, Berkeley, CA ([email protected]) ‡ Lawrence Berkeley National Laboratory, Berkeley, CA ([email protected]) § Indiana University, Bloomington, IN ([email protected])
Abstract —Sparse matrix-matrix multiplication (SpGEMM) isa widely used kernel in various graph, scientific computingand machine learning algorithms. In this paper, we considerSpGEMMs performed on hundreds of thousands of processorsgenerating trillions of nonzeros in the output matrix. DistributedSpGEMM at this extreme scale faces two key challenges: (1)high communication cost and (2) inadequate memory to generatethe output. We address these challenges with an integratedcommunication-avoiding and memory-constrained SpGEMM al-gorithm that scales to 262,144 cores (more than 1 millionhardware threads) and can multiply sparse matrices of any sizeas long as inputs and a fraction of output fit in the aggregatedmemory. As we go from 16,384 cores to 262,144 cores on a CrayXC40 supercomputer, the new SpGEMM algorithm runs 10xfaster when multiplying large-scale protein-similarity matrices.
I. I
NTRODUCTION
Multiplication of two sparse matrices (SpGEMM) is acommon operation in numerous computing fields includingdata analytics [1, 2], graph processing [3]–[5], bioinformat-ics [6, 7], machine learning [8], computational chemistry [9],and scientific computing [10]. Most use cases of SpGEMMhave low arithmetic intensity, resulting in poor scaling tolarge concurrencies. Despite these inherent challenges, theneed to scale SpGEMM on large distributed-memory clustersfor solving important science problems generated a fruitfulline of research. As a result, the last decade saw the de-velopment of increasingly sophisticated distributed-memoryparallel SpGEMM algorithms that employ communication-avoiding techniques, perform communication overlapping, andpartitioning the matrices to minimize communication [11, 12].Different from other matrix multiplication instances suchas dense-dense and sparse-dense, SpGEMM has four uniquefeatures that make distributed SpGEMM harder to develop.(1) The output of SpGEMM often has more nonzeros thanthe inputs combined. In some applications, the output ofSpGEMM may not even fit in the aggregate memory oflarge supercomputers. (2) the number of nonzeros of theoutput matrix is not known a-priori for SpGEMM. Hence, adistributed symbolic step is needed to estimate the memoryrequired for the multiplication. (3) SpGEMM typically haslow arithmetic intensity. As a result, local computations oneach node becomes computationally expensive. (4) At extremescale, inter-process communication becomes the performancebottleneck. Since the last two problems arise in all distributed SpGEMM, they are well studied in the literature [11]–[13].By contrast, the first two problems are relatively new to thecommunity as SpGEMM performed with emerging biologicaland social networks has started to overrun the memory limitof supercomputers.This paper presents a distributed-memory algorithm and itshigh-performance implementation that harmoniously addressall four challenges. (1) When the memory requirement ofSpGEMM exceeds the aggregate memory, we multiply in b batches, where a batch computes a subset of columns of theoutput. (2) The required number of batches depends on theaggregate memory, input matrices, and the process grid usedto distribute the matrix. We developed a distributed symbolicalgorithm that efficiently identifies the memory requirementsand the optimal number of batches needed. (3) We integratethe batching technique within an existing communication-avoiding framework [13] that multiplies matrices on a 3Dprocess grid. (4) We observe that in-node SpGEMM andmerging routines needed by distributed SpGEMM do notneed nonzeros sorted within columns of input and outputmatrices. In light of this observation, we developed newsort-free SpGEMM and merging algorithms to make localcomputations significantly faster than current state-of-the-artapproaches. Thus, this paper presents novel solutions to fourchallenges in large-scale SpGEMM, and it does so within theestablished communication-avoiding algorithms framework.The proposed memory-constrained SpGEMM algorithm canbe directly used with applications that do not access the wholeoutput matrix all at once. For example, consider finding theJaccard similarity between two datasets, a problem that issuccessfully formulated as multiplication of a sparse matrixwith its transpose [14]. BELLA sequence overlapper [7] andPASTIS many-to-many protein sequence aligner [15] use theSpGEMM operation in a similar way. Since the subsequentanalysis only needs to access subsets of the output AA T , wecan form it in batches, and discard each batch. In hypergraphpartitioning using the effective multi-level paradigm, succes-sively coarser graphs are generated. Prior to coarsening, onetypically finds the number of shared hyperedges between allpairs of vertices in order to run a matching algorithm that iden-tifies pairs of vertices to coarsen. This process is called heavy-connectivity matching [16] or inner-product matching [17],and can be formulated as another SpGEMM of type AA T . Due a r X i v : . [ c s . D C ] O c t o memory limitations and the higher density of the product,this SpGEMM is done in batches in distributed-memory multi-level partitioners such as Zoltan [18]. Finally, HipMCL [19]is a distributed-memory scalable version of the Markov clus-tering algorithm. HipMCL iterations involve matrix squaringfollowed by various pruning steps to reduce the memoryfootprint and increase convergence. In both HipMCL andhypergraph coarsening, columns of the output can be formedin batches, and immediately used for pruning or matchingwithout requiring the whole output.In these scenarios, the higher-level application can formsubsets of the output matrix in batches, perform the requiredcomputation on it, and discard some or all of its nonzerosbefore moving into the next batch. The solutions provided inthe literature for all these examples are ad-hoc algorithms. Thecommunity lacks a thorough understanding of theoretical andpractical performance that can be expected from a memory-constrained SpGEMM that operates in batches. This resultsin late discoveries of performance and scalability problems inlarge-scale runs. Our work fills that crucial gap.Our integrated communication-avoiding (CA) and memory-constrained SpGEMM algorithm is generally applicable evenwhen forming the full output is feasible. In that case, ouralgorithm will choose to form the output all at once ifthat minimizes communication. Our algorithm strong scalesnaturally well because the increase in the available memoryallow our integrated algorithm to either decrease the numberof passes over the input, or increase replication in exchangeof reduced communication, or do both.Overall, the faster local computation, reduced communica-tion and batching techniques delivered a massively parallelalgorithm that scales to millions of threads on a Cray XC40supercomputer. Using batching, our SpGEMM can multiplymassive matrices using 0.5 PB memory whereas previousSpGEMMs would have required 2.2 PB memory, therebycould not solve the problem at all.The main contributions of the paper are summarized below.1) We develop a new distributed-memory SpGEMM algorithmthat multiplies matrices in batches when the required mem-ory to generate the output exceeds the available memory.This algorithm makes it possible to multiply sparse matricesof any size as long as inputs and a fraction of output fit inthe memory.2) We adapt existing CA techniques in a memory-constrainedsetting. We develop new algorithms to reduce the overheadof CA algorithms due to batching.3) We develop lower and upper bounds on the number ofbatches and present a symbolic algorithm to compute theexact number of batches needed.4) We demonstrate that our SpGEMM algorithm scales to4096 nodes (more than 1 million hardware threads) on aCray XC40 supercomputer. The utility of the SpGEMM al-gorithm is demonstrated with large-scale protein similaritynetworks where the multiplication requires up to 300 trillionfloating point operations and 2.2 PB memory. II. B ACKGROUND AND N OTATIONS
A. Notations
Given two sparse matrices A ∈ R m × k and B ∈ R k × n ,SpGEMM multiplies A and B and computes another po-tentially sparse matrix C ∈ R m × n . The operation can be alsoperformed on an arbitrary semiring S instead of the field ofreal numbers R and our algorithms are applicable to that caseas well since we do not utilize Strassen-like algorithms. Inour analysis, we consider n -by- n matrices (that is, m = k = n )for simplicity. Given a matrix A , nrows ( A ) and ncols ( A ) denote the number of rows and columns in A . nnz ( A ) denotesthe number of nonzeros in A . flops denotes the numberof multiplications needed to compute AB . The compressionfactor ( cf ) is the ratio of flops to nnz ( C ) : cf =flops / nnz ( C ) .Since at least one multiplication is needed for an outputnonzero, cf ≥ . B. Problem Description
In many data analytics applications that use SpGEMM, theoutput matrix C is often too large to store all at once inaggregate memory. Fortunately, the subsequent analysis rarelyrequires access to the whole C , so it can be formed in batches .Our paper is about this memory-constrained SpGEMM formu-lation where the available aggregate memory M = nnz ( C ) /k where k > . We also assume that the inputs fit into aggregatememory, meaning M > nnz ( A ) + nnz ( B ) . Together, thesetwo conditions imply nnz ( C ) > k ( nnz ( A ) + nnz ( B )) for k > . Note that the actual memory requirement of distributedSpGEMM can be much larger than O ( nnz ( C )) because of theneed to store unmerged results in each process. Our algorithmalso covers the case where the final output fits in the memory,but the intermediate results exceed the available memory. C. Related work
The algorithms developed for parallel SpGEMM stem fromhow the matrices and the relevant computations are dis-tributed across parallel computing units. Most of the shared-memory parallel SpGEMM algorithms rely on Gustavson’salgorithm [20], which yields a high level of parallelism as therows or the columns of the output matrix can be computedindependent of each other. In the column variant, a number ofcolumns of A need to be scaled and accumulated in order tocompute the output column C (: , j ) , i.e., C (: , j ) = (cid:88) i : B ( i,j ) (cid:54) =0 A (: , i ) B ( i, j ) . The algorithms usually differ in the data structure they usefor accumulating values. Among them are the sparse accu-mulators [21, 22], heaps [13, 23], hash tables [24, 25], andmerge sorts [26, 27]. The accumulator choice has importantimplications on the performance of the parallel algorithm andoften depends on sparsity pattern of the multiplied matrices,compression factor, and parallel architecture.The studies regarding parallelization of SpGEMM on many-and multi-core systems often focus on optimization aspects re-lated to accumulators, efficient data access, and load balancing.2 lgorithm 1
An overview of 2D sparse SUMMA algorithm.
Input and Output:
Input matrices A and B and output matrix C are distributed on a 2D process grid P D . ˜A , ˜B , and ˜C denote local submatrices in the process considered. procedure S UMMA A , B , P D ) for all processes in P D ( i, j ) in parallel do stages ← number of process columns in P D for s ← to stages do (cid:46) SUMMA stages ˜A recv ← B CAST ( ˜A , P D ( i, s ) , P D ( i, :) ) ˜B recv ← B CAST ( ˜B , P D ( s, j ) , P D (: , j )) ˜C [ s ] ← L OCAL M ULTIPLY ( ˜A recv , ˜B recv ) ˜C ← M ERGE ( ˜C [1 .. stages ] ) return C As the many-core architectures necessitate a fine-grain loadbalancing for good performance, several works on GPUs [23,24, 26, 28] aim to achieve that goal. The works for multi-core architectures [22, 25] often strive for improving the poorcache behavior with various techniques such as blocking orusing appropriate accumulators according to cf .The distributed memory algorithms for parallel SpGEMMcan be categorized according to the data distribution methodthey use. The algorithms that rely on one-dimensional (1D)distribution partition all matrices across the entire row orcolumn dimension. Although 1D distribution [29] suffers frompoor communication scalability, the communication costs canbe reduced by pre-processing with graph/hypergraph parti-tioning models [11] that exploit the sparsity pattern of themultiplied matrices in order to reduce communicated data.This pre-processing, however, can be prohibitive if SpGEMMis not utilized in a repeated context as the pre-processingstage often does not scale well. CombBLAS [5] uses SparseSUMMA algorithm [30, 31] tailored for 2D distribution ofmatrices. In 2D distribution, the matrices are partitioned intorectangular blocks and a 2D process grid is associated withthe distribution. The 2D distribution has better communicationcharacteristics compared to 1D distribution. In the 3D variantof the Sparse SUMMA algorithm, each sub-matrix is furtherdivided into layers. This approach exhibits better scalabilityat larger node counts [13, 32], where the multiplied instancesbecome more likely to be latency-bound. We examine thesealgorithms in detail in Sections III-A and III-B, respectively,as they form the basis of our work. Cannon’s algorithm [33],which uses a 2D distribution of matrices, has also been usedin parallel SpGEMM [9].III. 2D AND SPARSE
SUMMA
ALGORITHMS
Our algorithmic framework is based on the 2D SUMMAalgorithm [30, 31], and our communication-avoiding techniquerelies on the 3D SUMMA algorithm developed in priorwork [13]. In this section, we revisit 2D and 3D sparseSUMMA algorithms by putting them in a common framworkupon which the new memory-constrained algorithm will bebuilt in the next section.
A. 2D Sparse SUMMA
Data distribution.
The original sparse SUMMA algo-rithm [30] works on a 2D p r × p c process grid P D . In thispaper, we only consider square process grid with p r = p c . P D ( i, j ) denotes the process in i th row and j th column inthe 2D grid. P D ( i, :) denotes all processes in the i th row and P D (: , j ) denotes all processes in the j th column. The algorithm.
The S
UMMA
2D function in Algorithm 1describes the 2D multiplication algorithm that operates onmatrices distributed on P D . S UMMA
2D proceeds in p c stages.At stage s , each process P D ( i, j ) participates in two broadcastoperations in its process row and process column. In the i thprocess row P D ( i, :) , the process P D ( i, s ) broadcasts itslocal submatrix ˜ A to all processes in P D ( i, :) (line 5, Alg. 1).Similarly, in the j th process column P D (: , j ) , the process P D ( s, j ) broadcasts its local submatrix ˜ B to all processes in P D (; , j ) (line 6, Alg. 1). Each process stores the receiveddata in ˜ A recv and ˜ B recv . In line 7 of Alg. 1, we locallymultiply the received matrices and generate a low-rank versionof the output. To facilitate merging at the end of all SUMMAstages, partial result from each stage is stored in an array. Forexample, at stage s , each process multiplies received inputsubmatrices ˜ A recv and ˜ B recv and stores the result at ˜ C [ s ] .Finally, at the end of all SUMMA stages, each process merges p c pieces of partial results and creates its local piece ˜ C ofthe result (line 8, Alg. 1). When S UMMA
2D returns, all localpieces ˜ C form the output matrix C distributed on P D . Major steps.
Each stage of S
UMMA
2D has three majorsteps: (a)
A-Broadcast : broadcasting parts of A along theprocess row; (b) B-Broadcast : broadcasting parts of B alongthe process column; and (c) Local-Multiply : performing mul-tithreaded local multiplication. After all SUMMA stages, weperform another step
Merge (will be called
Layer-Merge inthe 3D algorithm) that merges partial results. Here, mergingmeans adding multiplied values with the same row and columnindices. While one can incrementally merge partial results afterlocal multiplications, it is computationally more expensive inthe worst case [34]. Hence, in this paper, we consider mergingafter completing all stages as shown in line 8 of Alg. 1.S
UMMA
2D performs reasonably well on a few hundred pro-cesses. However, as the number of processes increases, com-munication (broadcasting A and B ) becomes the performancebottleneck of S UMMA
2D [30]. 3D Sparse SUMMA [13] wasdeveloped to reduce the communication cost of SpGEMM.
B. 3D Sparse SUMMA
Processes grid.
Here, each matrix is distributed in a 3D p r × p c × l process grid P D . All processes P D (: , : , k ) witha fixed value k in the third dimension form the k th layer of P D , where ≤ k ≤ l . In this context, P D (: , : , k ) is similarto a 2D process grid discussed in Sec. III-A. In this paper,we only consider square process grid in each layer. Hence,with p processes divided into l layers, the shape of P D is (cid:112) p/l × (cid:112) p/l × l . Here, P D ( i, j, k ) denotes the process in the i th row and j th column in the k th layer in P D . All processes3 lgorithm 2
3D sparse SUMMA.
Input and Output:
Input matrices A and B and output matrix C are distributed in a 3D process grid P D . ˜A , ˜B , and ˜C denote local submatrices in the current process. A ( k ) denotesthe 2D submatrix in the k th layer of P D . procedure S UMMA A , B , P D ) for all processes in P D ( i, j, k ) in parallel do D ( k ) ← S UMMA A ( k ) , B ( k ) , P D (: , : , k ) ) ˜D ( k ) [1 ..l ] ← ColSplit( ˜D ( k ) , l ) ˜C ( k ) [1 ..l ] ← AllToAll ( ˜D ( k ) [1 ..l ] , P D ( i, j, :)) ˜C ( k ) ← Merge( ˜C ( k ) [1 ..l ]) return C TABLE I:
Symbols used in this paper
Symbol Meaning p total number of processes l number of layers in a 3D grid b number of batches P D a (cid:112) p/l × (cid:112) p/l × l process grid A first input matrix (distributed) B second input matrix (distributed) C output matrix (distributed) A ( k ) Part of A in the k th layer (similarly B ( k ) and C ( k ) ) D ( k ) A low-rank version of output matrix in the k th layer ˜ A parts of A stored in the current process (similarly ˜ B , ˜ C ) n total number of rows/columns in A , B , C M total available memory in all processes r number of bytes needed to store a nonzero (on average) in P D ( i, j, :) form a fiber that consists of processes with thesame row and column indices from different layers. Data distribution.
Suppose A is an n × n matrix. After dis-tributing A in a 3D grid, let A ( k ) ∈ R n × ( n/l ) be the submatrixof A distributed in the k th layer. Fig. 1(c) shows that eachlayer gets slices of A that respect the 2D process boundary.After 3D distribution, each local piece ˜ A is an (cid:0) n/ (cid:112) p/l (cid:1) × (cid:0) n/ √ pl (cid:1) submatrix. Hence, nrows ( ˜ A ) = l · ncols ( ˜ A ) for alllocal submatrices ˜ A . Similarly, we split B along the rows,and each local piece ˜ B is an (cid:0) n/ √ pl (cid:1) × (cid:0) n/ (cid:112) p/l (cid:1) submatrix(Fig. 1(f)). As l increases, ˜ A becomes tall and skinny and ˜ B becomes short and fat. Fig. 1 shows an example with a 3Dgrid where p = 8 processes are organized in l = 2 layers,and each layer is a × grid. Since A and B are distributeddifferently in P D , we chose to distribute C similar to A . The algorithm.
The S
UMMA
3D function in Algorithm 2describes the 3D multiplication algorithm that operates onmatrices distributed on P D . Initially, S UMMA
3D proceedsindependently at each layer P D (: , : , k ) by calling S UMMA
UMMA
2D multiplies A ( k ) ∈ R n × nl and B ( k ) ∈ R nl × n to produce D ( k ) ∈ R n × n . Here, D ( k ) denotes an intermediatelow-rank output matrix that needs to be merged across layersto form the final product. Next, each process splits ˜D ( k ) , thelocal submatrix of D ( k ) , into l parts ˜D ( k ) [1 ..l ] by splitting ˜D ( k ) along the column (line 4, Alg. 2). Then, each processperforms an AllToAll operation on every fiber P D ( i, j, :) (line5, Alg. 2). Finally, each process merges all the copies received from the AllToall operation in previous step to form the finallocal copy of the result matrix (line 6, Alg. 2). The last twosteps are called AllToAll-Fiber and Merge-Fiber.IV. N EW ALGORITHMS FORMEMORY - CONSTRAINED
3D S P GEMM
A. The case for batching
Memory requirement of 3D Sparse SUMMA. If r bytesare needed to store a nonzero, the memory requirement to per-form SpGEMM is a least r ( nnz ( A )+ nnz ( B )+ nnz ( C )) bytes.However, a distributed algorithm requires much more memoryin practice. The S UMMA
3D function from Algorithm 2 on a (cid:112) p/l × (cid:112) p/l × l process grid stores unmerged matrices D ( k ) at layer k . Thus, we have the following inequality: flops ≥ l (cid:88) k =1 nnz ( D ( k ) ) ≥ nnz ( C ) . (1)Here, the Merge-Layer operation will generate D ( k ) at everylayer, and the Merge-Fiber will produce C . In the worstcase, we may need to store flops nonzeros when no merginghappens inside Local-Multiply in S UMMA C )denote the memory requirement for the S UMMA
3D function.Then, mem ( C ) = r (cid:80) lk =1 nnz ( D ( k ) ) .For example, consider the Metaclust50 matrix in Table V.If we need r =
24 bytes to store a nonzero (16 bytes forrow and column indices and 8 bytes for the value), we need24TB memory to store the final output. However, the actualmemory requirement could be up to Trillion ∗
24 = 2208
TB(2.2PB) as squaring this matrix requires 92 trillion flops . Corisupercomputer used in our experiment has 1.09PB aggregatedmemory. Hence, we need algorithmic innovations to multiplymatrices at this extreme scale.
Batched 3D Sparse SUMMA.
When the memory re-quirement of S
UMMA
3D exceeds the available memory, wemultiply matrices in b batches where each batch computes n/b columns of C . Hence, in a batch, we multiply A ∈ R n × n and B ∈ R n × nb to generate C ∈ R n × nb . We chose to form batchesof C column-by-column so that an application can take aninformed decision based on the entire columns of C . Ourmotivation comes from graph analytics where we need theentire result for a batch of vertices before an applicationdecides how to process the result and move to the next batch.For example, Markov clustering [19] keeps top-k entries ineach column from the resultant matrix. Hence, we did notconsider partitioning C into square tiles despite the fact thatit may reduce communication even further. A symbolic step to determine the required number ofbatches. If M denotes the aggregated memory in p processes,then an SpGEMM algorithm needs at least b phases as follows: b ≥ (cid:24) mem ( C ) M − r ( nnz ( A ) + nnz ( B )) (cid:25) . (2)In practice, b cannot be determined analytically becausemem ( C ) is not known in advance. b also depends on the layoutof the the process grid P D and the load balancing factor indistributing the matrices.4 = n = 8 𝑝/𝑙 = 𝑛 / ) * = 𝑛𝑝𝑙 = 2 (a) An 8x8 sparse matrix (b) A 3D grid with p=8 processorganized in 2 2x2 layers (c) Partitioning A into two layers L1 and L2 L1 L1L2 L2
Grid: 2x2Matrix: 8x4 Grid: 2x2Matrix: 8x4(d) Part of A in L1 (e) Part of A in L2
A in Layer 1 A in Layer 2 (f) Partitioning B into two layers L1 and L2
L1L1L2L2 𝑛/ )* = 4 𝑛 𝑝 𝑙 = Grid: 2x2Matrix: 4x8Grid: 2x2Matrix: 4x8(g) Part of B in L1 (h) Part of B in L2
B in Layer 1B in Layer 2 (i) (shaded) batch 1 in layer 1 (b=2) ,-* (𝑛/ )* ) = 1 n = n/l = 4n=8 n / l = Fig. 1:
Distributing an input matrix shown in (a) into a × ×
3D process grid shown in (b). Here, we use the same matrix for both A and B . Thick borders denotes process boundaries in each × layer. (c) A is partitioned along the column to create layers. Red slicesform layer 1 and blue slices form layer 2. (d, e) Rectangular submatrices of A in layer 1 and 2. (f) B is partitioned along the row to createlayers. Red slices form layer 1 and blue slices form layer 2. (g, h) Rectangular submatrices of B in layer 1 and 2. (i) Assuming b = 2 , thefirst batch in layer 1 of B is shown by shaded regions. (layer 1)(layer 2) (layer 1, batch 1 in shade) (layer 1, batch 1)After 2D SUMMA (layer 1, batch 1)After Alltoall-FiberXX == (layer 1, batch 1)After Merge-Fiber n n/l n n/l n 𝐴 ( 𝐵 ( 𝐷 ( 𝐶 ( 𝐶 ( 𝐴 (() (layer 2, batch 1 in shade) (layer 2, batch 1)After 2D SUMMA (layer 2, batch 1)After Alltoall-Fiber (layer 2, batch 1)After Merge-Fiber 𝐵 (() 𝐷 (() 𝐶 (() 𝐶 (() Fig. 2:
Assuming b = 2 , major steps of the first batch of Alg. 4 are shown when multiplying the input matrices shown in Fig. 1. Columnsinvolved in this batch are shown by shaded regions. Red dots represent nonzeros in layer 1 and blue dots represent nonzeros in layer 2. Afterperforming 2D multiplication in each layer, we obtain D (1) and D (2) at layer 1 and 2. After AllToall-Fiber both layers have some red andsome blue dots as each process exchanges along the fiber some part of its result obtained from S UMMA B . We developed the S
YMBOLIC
3D function (Algorithm 3)that estimates the number of phases from input matrices andthe process grid P D . Similar to the SUMMA3D function,S YMBOLIC
3D operates in several stages within each layer(line 5-8 in Algorithm 3). At each stage, local submatrices ˜A and ˜B are broadcast along process rows and columns, respectively. After P D ( i, j, k ) receives submatrices of A and B from other processes, it performs a symbolic multiplicationlocally using L OCAL S YMBOLIC that computes the number ofnonzeros in the output. At the end of all SUMMA stages,we find the maximum nnz for A , B , and D stored at anyprocess (lines 9-11 in Algorithm 3). Finally, line 12 computes5 lgorithm 3 Symbolic step to determine b . Input : A and B are distributed on a 3D process grid P D . M denotes the total available memory in bytes and r denotesthe number of bytes needed to store each nonzero. procedure S YMBOLIC A , B , P D , M ) for all processes in P D ( i, j, k ) in parallel do nnz [ i, j, k ] ← (cid:46) per-process nnz stages ← number of process columns in P D for s ← to stages do (cid:46) SUMMA stages ˜A recv ← B CAST ( ˜A , P D ( i, s, k ) , P D ( i, : , k )) ˜B recv ← B CAST ( ˜B , P D ( s, j, k ) , P D (: , j, k )) nnz [ i, j, k ] += L OCAL S YMBOLIC ( ˜A recv , ˜B recv ) maxnnzC ← A LL R EDUCE M AX ( nnz [ i, j, k ] , P D ) maxnnzA ← A LL R EDUCE M AX ( nnz ( ˜A ) , P D ) maxnnzB ← A LL R EDUCE M AX ( nnz ( ˜B ) , P D ) b ← r × maxnnzC M/p − r ( maxnnzA+maxnnzB ) return bb from per-process available memory M/p and other memoryrequirements. Note that the S
YMBOLIC
3D function considersthe maximum unmerged nonzeros stored by a process so thatno process exhausts its available memory. This choice makesour algorithm robust to different sparsity patterns. Hence, incomparison to perfectly-balanced computation, S
YMBOLIC
YMBOLIC
3D has lightweight com-putations because L
OCAL S YMBOLIC can be computed muchfaster than L
OCAL M ULTIPLY . By contrast, S
YMBOLIC
3D stillneeds expensive broadcasts as was needed by SUMMA3D.Consequently, the communication-avoiding scheme used inS
YMBOLIC
3D has more significant impact on its performance.
B. The batched SUMMA3D algorithm
Data distribution.
We perform batching on top of 3Ddistribution discussed in Fig. 1. Since a batch computes n/b columns of C , batching does not change the distribution of A shown in Fig. 1. Globally, a batch should have n/b columns of B and C across all processes. This can be achieved by simplysplitting all local submatrices ˜ B along the column and making b pieces. Let ˜ B [ i ] be the part of local B in the i th batch.Since ˜ B has n/ (cid:112) p/l columns in 3D distribution, ˜ B [ i ] has n/ ( b (cid:112) p/l ) columns. However, a block column partitioningof ˜ B may create load imbalance in the Merge-Fiber operation(to be explained next). Hence, we use block-cyclic partitioningwhere each block has n/ ( bl (cid:112) p/l ) columns. A collection of l such blocks gives us a batch of desired shape as shown inFig. 1(i). The algorithm.
The B
ATCHED S UMMA
3D function in Al-gorithm 4 multiplies matrices distributed on P D and generatesthe result matrix batch by batch. At first, each process splits ˜ B ,its local copy of B , column-wise into b batches ˜ B [1 ..b ] (line4, Alg. 4). The exact block cycling splitting is described inthe previous paragraph and in Fig. 1(i). It effectively createsas many pieces of B as the number of batches b . Then for Algorithm 4
3D memory-constrained sparse SUMMA withbatching.
Input and Output:
Input matrices A and B and output matrix C are distributed in a 3D process grid P D . ˜A , ˜B , and ˜C denote local submatrices in the current process. b is the numberof batches. B [ batch ] denotes a 3D matrix consisting of localbatch ˜B [ batch ] from all processes. procedure B ATCHED S UMMA A , B , P D ) b ← S YMBOLIC ( A , B , P D ) for all processes in P D ( i, j, k ) in parallel do ˜B [1 ..b ] ← ColSplit ( ˜B , b ) for batch ← to b do C [ batch ] ← S UMMA A , B [ batch ] , P D ) ˜C ← ColConcat ( ˜C [1 ..b ]) return C each batch , the algorithm calls S UMMA
3D to multiply A and B [ batch ] to get the output matrix C [ batch ] (line 6, Alg. 4).At the end of all batches, each process has b pieces of ˜ C [1 ..b ] .Here, submatrices in ˜ C [1 ..b ] have non-overlapping columns.Hence, each process concatenates ˜ C [1 ..b ] along the column toform its own piece ˜ C of the final result (line 7, Alg. 4). Theoverall result matrix C is a collection of these local matricesdistributed on P D . Note that we showed the formation of C for completeness of the algorithm. In practice, the output C [ batch ] from each batch is pruned or saved to disk bythe application. Fig. 2 shows an execution of Algorithm 4.Note that when nnz ( A ) (cid:29) nnz ( B ) , column-wise batching canbe expensive. However, if inputs are square matrices, wecan easily use row-by-row batching on B using the samealgorithm. Major steps. B ATCHED S UMMA
3D uses S
UMMA
3D ineach batch and S
UMMA
3D uses S
UMMA
2D in each layer.Hence, all major steps in S
UMMA
3D and S
UMMA
2D are alsoexecuted in B
ATCHED S UMMA
ATCHED -S UMMA
3D uses Alg. 3 to estimate the number of batchesneeded for an SpGEMM. Therefore, B
ATCHED S UMMA
3D hasseven major steps: (1) Symbolic multiplication to estimate b (once; involve communication and computation), (2) A-Broadcast (once per SUMMA2D stage: along a process rowon every layer), (3) B-Broadcast (once per SUMMA2D stage:along a process column on every layer), (4) Local-Multiply(once per SUMMA2D stage: local computation), (5) Merge-Layer (once per SUMMA2D stage: local computation), (6)AllToAll-Fiber (once per batch: communicate in a fiber) (7)Merge-Fiber (once per batch: local computation). C. Communication complexity
Among seven major steps in B
ATCHED S UMMA b > , B ATCHED -S UMMA
3D increases the number of times matrix A needs tobe re-communicated, making the impact of communication-avoidance even more significant for batched SpGEMM. Even6hough we relied on the communication-avoiding techniquedeveloped in a prior work [13], batching makes communi-cation steps more fine-grained and irregular. To analyze thecommunication complexity we used the α − β model, where α is the latency constant corresponding to the fixed cost ofcommunicating a message regardless of its size, and β is theinverse bandwidth corresponding to the cost of transmittingone word of data. Consequently, communicating a message of n words takes α + βn time.Table II shows the bandwidth and latency costs of differentsteps of B ATCHED S UMMA b . Here, the bandwidth bound forAllToAll-Fiber is rather loose because there is already signifi-cant compression of intermediate products expected within (a)local SpGEMM calls, and (b) within each SUMMA layer. Asthe number of layers l increases, the effect of this compressiondiminishes. It is tighter to use (cid:80) lk =1 nnz ( D ( k ) ) /p in lieu of flops /p , which is a function that grows slowly with l .Table II shows that all communication steps are performedat least b times. Consequently, B ATCHED S UMMA
3D hashigher latency overheads relative to S
UMMA A iscommunicated b times, the bandwidth cost of A-Broadcastcould dominate the overall communication cost, especially fora large value of b . Fortunately, we can increase the number oflayers to reduce the overhead of re-communicating A . D. Faster local computations using hash tables
We developed new algorithms for local multiplication andmerging within each process with an aim to utilize specialstructures due to layering and batching. One particular opti-mization is in the sortedness of results after Local-Multiply,Merge-Layer, and Merge-Fiber. Since the final output C isobtained after Merge-Fiber, we keep the output of Merge-Fibersorted within each column. However, the outputs of Local-Multiply and Merge-Layer do not need to be sorted withineach column because they will eventually be sorted afterMerge-Fiber. To facilitate unsorted outputs in Local-Multiplyand Merge-Layer, we used hash-based local SpGEMM andmerging algorithms. By contrast, heap-based algorithms usedin prior work [13] always keep results sorted. These modifi-cations can make local computations more than × faster formany matrices as we will demonstrate in Fig. 15. Sort-free local SpGEMM.
Previous 3D SUMMA algo-rithm [13] used a multithreaded heap-based local SpGEMMalgorithm in each process. However, the heap algorithm waslater replaced by a hybrid algorithm that used either a hashtable or a heap to form the i th column depending on the com-pression ratio of the column [25]. After forming the column,that hybrid algorithm sorted the column in an ascending orderof row indices. With an aim to keep matrices unsorted, weuse a hash-based SpGEMM in our Local-Multiply routine(line 7 of Algo. 1) because the hash SpGEMM does notrequire sorted matrices as inputs. Additionally, we refrainfrom sorting columns once they are formed. Thus, our local TABLE II:
Communication complexity for different steps ofB
ATCHED S UMMA
A- B- AllToAll-Bcast Bcast FiberPer process data nnz ( A ) /p nnz ( B ) / ( bp ) flops / ( bp ) Comm. size (cid:112) p/l (cid:112) p/l l
Latency cost α lg p/l α lg p/l αl Bandwidth cost β ( nnz ( A ) /p ) β ( nnz ( B ) / ( bp )) β (flops / ( bp )) How many times b ( (cid:112) p/l ) b ( (cid:112) p/l ) b Tatal latency αb ( (cid:112) p/l lg p/l ) αb ( (cid:112) p/l lg p/l ) αbl Total bandwidth βb ( nnz ( A ) / √ pl ) β ( nnz ( B ) / √ pl ) β (flops /p ) TABLE III:
Computational complexity for different steps ofB
ATCHED S UMMA
Local- Merge- MergeMultiply Layer FiberComplexity flops / ( bp (cid:112) p/l ) flops / ( bp ) lg p/l flops / ( bp ) lg l How many times b ( (cid:112) p/l ) b b Total flops /p flops /p lg p/l flops /p lg l TABLE IV:
Overview of the evaluation platform.
Cori-KNL Cori-Haswell
Processor Intel Xeon Phi 7250 Intel Xeon E5-2698Cores/node 68 32Clock 1.4 GHz 2.3 GHzHyper-threads/core 4 2Memory/node 112GB 128GBTotal nodes 9,668 2,388Total memory 1.09 PB 298.5 TBInterconnect Cray Aries with Dragonfly topologyCompiler Intel icpc Compiler 19.0.3 with -O3 option multiplication uses an “unsorted-hash” algorithm. In practice,the unsorted-hash algorithm can be - faster than thehybrid algorithm. Sort-free hash merging algorithms.
Previous 2DSUMMA [30] and 3D SUMMA [13] algorithms used aheap-based merging algorithm in Merge-Layer and Merge-Fiber routines. Observing the benefit of hash SpGEMM algo-rithms [25], we develop a new hash-based merging algorithmfor Merge-Layer and Merge-Fiber steps. Given a collection of l matrices, the hash-merge algorithm forms the i th column ofthe merged output from the i th columns of all input matrices.The merging is done using a hash table that can work withunsorted input and outputs an unsorted output column. Thisnew “unsorted-hash-merge” algorithm can be an order ofmagnitude faster than previous heap-merge algorithm.V. R ESULTS
A. Evaluation platforms
We evaluate the performance of our algorithm on NERSCCori system. We used two types of compute nodes on Corias described in Table. IV. We used MPI+OpenMP hybridparallelization. All of our experiments used 16 and 6 threadsper MPI process on Cori-KNL and Cori-Haswell, respectively.Only one thread in every process makes MPI calls.7
ABLE V:
Statistics about test matrices used in our experiments. C = AA T for Rice-kmers and Metaclust20m. For all other cases, C = AA . M, B and T denote million, billion and trillion, respectively. Matrix ( A ) rows columns nnz ( A ) nnz ( C ) flops Eukarya 3M 3M 360M 2B 134BRice-kmers 5M 2B 4.5B 6B 12.4BMetaclust20m 20M 244M 2B 312B 347BIsolates-small 35M 35M 17B 248B 42TFriendster 66M 66M 3.6B 1T 1.4TIsolates 70M 70M 68B 984B 301TMetaclust50 282M 282M 37B 1T 92T
B. Test problems
Table V describes several large-scale matrices used in ourexperiments. Eukarya, Isolates-small, and Isolates are protein-similarity networks generated from isolate genomes in theIMG database [35]. These matrices are publicly available [19].Metaclust50 stores similarities of proteins in Metaclust50(https://metaclust.mmseqs.com/) dataset which contains pre-dicted genes from metagenomes and metatranscriptomes ofassembled contigs from IMG/M and NCBI. Friendster repre-sents an online gaming network available in the SuiteSparseMatrix Collection [36]. The rows of the Rice-kmers matrixrepresent the PacBio sequences for the Oryza sativa rice, andits columns represent a subset of the k -mers (short nucleotidesubsequences of length k ) that are used by the sequenceoverlapping software BELLA [7]. The Metaclust20m matrixis originally used in distributed protein similarity search [15]and contains a subset of sequences for Metaclust50. The rowsand columns of this matrix respectively represent protein se-quences and k-mers. The AA T operation involving this matrixproduces candidates for batch pairwise sequence alignment.Even though there are many applications of SpGEMM, weprimarily focus on problems where nnz ( C ) > ( nnz ( A ) + nnz ( B )) because for these problems, batching could beneeded. For example, square of a sparse matrix often requiressignificantly more memory than input matrices [37]. Weselected matrices in Table V considering the following threeapplications. (a) Markov clustering [19, 38] iteratively squaresa protein-similarity matrix to discover protein clusters. We cap-ture this application by extensively studying the computationof AA . (b) Clustering coefficients of nodes in a social networkare computed by counting triangles. High-performance trianglecounting algorithms rely on the multiplication of the lower tri-angular and upper triangular parts of the adjacency matrix [3]. AA captures the computation needed in triangle countingalgorithms. (c) By computing AA T , the sequence overlapperBELLA discovers all (if any) the shared k-mers between allpairs of sequences. By using the sparse matrix as an indexdata structure in lieu of a hash table, SpGEMM computes thisseemingly all-pairs computation without incurring quadraticcost in the number of reads. C. Impact on an end-to-end protein clustering application
We plugged the newly developed SpGEMM algorithm inHipMCL [19], a distributed-memory implementation of the T i m e ( s e c ) b = = = = = = = = = = SymbolicCommunicationComputation
Fig. 3:
Run times of the first 10 iterations of HipMCL when clusteringthe Isolates-small graph on 65,536 cores of Cori. The left bar ofeach group represents 1-layer setting and right bar represents 16-layer setting. Number of batches b is calculated using the symbolicstep. b = 1 is not shown. HipMCL needed 66 iterations to clusterproteins of Isolates-small graph. Overall, it was . × faster with 16layers than with 1 layer. Markov clustering algorithm. HipMCL clusters a protein-similarity network by iteratively squaring the input matrix andthen applying column-wise pruning to keep the output sparse.For large networks in Table V, Cori does not have enoughmemory to store A . When we employ B ATCHED S UMMA A batch-by-batch, apply variouspruning strategies on the current batch and then proceed tothe next batch. Hence, the presented algorithm satisfies theneed of HipMCL perfectly well.Fig. 3 show that the runtime of the first 10 iterations ofHipMCL when B ATCHED S UMMA
3D is used with 1 and 16layers. In the first few iterations, HipMCL performs expensivemultiplications that needs multiple batches on 1024 nodes.Even though the 16-layer setting needs more batches, thebenefit of communication avoidance makes most iterations × or more faster than 2D SpGEMM. Overall, B ATCHED -S UMMA
3D makes HipMCL more than . × faster than abatched 2D algorithm. More importantly, HipMCL cannot evencluster Isolates-small on 1024 nodes of Cori if batching is notused . Hence, the presented algorithm makes large-scale proteinclustering possible, and at the same time, it makes HipMCLsignificantly faster than previous algorithms.
D. Evaluating the impact of number of layers and batches
Given the use of matrix squaring in HipMCL, we use thecomputation of A to show various feature of our algorithm.At first, we study the impact of l and b on different stepsof B ATCHED S UMMA b depends on the available memory and is determined by thesymbolic step, we vary b in this experiment to capture differentamount of memory available in different distributed systems.Fig. 4 shows the results with two matrices and two differentnumber of cores. We summarize key observations from Fig. 4. A-Broadcast time:
With a fixed l , A-Broadcast time in-creases almost linearly with b . This is expected as A isbroadcast b times in total (once in every batch). With a =1 l=4 l=16 l=64 Number of Layers T i m e ( s e c ) (a) Friendster (on 16384 cores of Cori-KNL) b = = = = =
16 b =
16 b =
16 b = =
32 b =
32 b =
32 b = =
64 b =
64 b =
64 b = A-BroadcastB-BroadcastLocal-MultiplyMerge-LayerAllToAll-FiberMerge-Fiber l=1 l=4 l=16 l=64
Number of Layers T i m e ( s e c ) (b) Friendster (on 65536 cores of Cori-KNL) b = = = = = = = = = = = = =
16 b =
16 b =
16 b = =
32 b =
32 b =
32 b = =
64 b =
64 b =
64 b = A-BroadcastB-BroadcastLocal-MultiplyMerge-LayerAllToAll-FiberMerge-Fiber l=1 l=4 l=16
Number of Layers T i m e ( s e c ) (c) Isolates-small (on 65536 cores of Cori-KNL) b = = = =
16 b =
16 b = =
32 b =
32 b = =
64 b =
64 b = A-BroadcastB-BroadcastLocal-MultiplyMerge-LayerAllToAll-FiberMerge-Fiber
Fig. 4:
Evaluating the impact of the number of layers (denoted by l ) and batches (denoted by b ) on different steps of B ATCHED S UMMA A - B r oad c a s t t i m e ( s e c ) Number of layers ( l )b=2 b=2 (ideal) b=16b=16 (ideal) b=64 b=64 (ideal) Fig. 5:
With a fixed b , A-Broadcast time decreases when l increases.Here, we multiply Friendster on 65,536 cores (1024 nodes) withdifferent batches (come from Fig. 4(b)). Solid lines denote observedA-Broadcast time and dashed lines denote expected A-Broadcast timethat decreases by a factor of 2 as we increase l by a factor of 4. fixed b , when l increases, A-Broadcast time decreases at arate proportional to √ l . Fig. 5 shows this observation fordifferent batches when multiplying Friendster on 65,536 cores.In Fig. 5, solid lines (observed A-Broadcast time) closelyfollow dashed line (expected based on a factor of √ l decrease).To explain this, consider a (cid:112) p/l × (cid:112) p/l × l process grid with p processes. If we increase the number of layers by a factor of4, the grid becomes (cid:112) p/l × (cid:112) p/l × l . Hence, the numberof processes in each process row within a layer is reduced by afactor of 2. Since A-Broadcast is performed within individualprocess rows of each layer, the communicator size for A-Broadcast would be decreased by a factor 2. Thus, the A-Broadcast time decreases at a rate proportional to √ l . B-Broadcast time:
With a fixed l , the B-Broadcast timedoes not change with b . This is expected because the totaldata volume associated with B-Broadcast remains the samewith batching. Hence, the bandwidth cost for B-Broadcastdoes not depend on b . However, the latency term increaseslinearly as we increase b . Since small latency-bound matricesmay not need batching, B-Broadcast time does not rely on b asobserved in Fig. 4. With a fixed b , B-Broadcast time decreases when we increase l . This observation is consistent with thecorresponding observation in A-Broadcast. Local-Multiply time:
With a fixed l , the Local-Multiplytime does not change significantly with b . As we increase b , per layer result matrix is computed in increasing numberof batches. It does not change the complexity or data accesspatterns of local multiplication. However, if b is significantlylarge, the repeated cost of thread creation, memory allocation,and cache accesses may increase the Local-Multiply time asobserved with b = 64 in Fig. 4. With a fixed b , the Local-Multiply time decreases with the increase of l . Suppose in aLocal-Multiply with l = 1 , we multiply an ˆ n × ˆ k matrix ˆ A witha ˆ k × ˆ n matrix ˆ B and generate an ˆ n × ˆ n matrix ˆ D . With l layers,each layer multiplies an ˆ n × ˆ kl matrix ˆ A with a ˆ kl × ˆ n matrix ˆ B and generate an ˆ n × ˆ n matrix ˆ D . Therefore, as we increase l , Local-Multiply generates a lower-rank version of the finaloutput. As a result, with the increase of l , the complexity oflocal multiplication decreases. This effect is more significantfor sparser matrices, where the result of local multiplicationbecomes hyper-sparse with many layers. For example with b = 64 , when we go from l = 1 to l = 16 , Local-Multiplydecreases by a factor of . × for Friendster (Fig. 4(b)) andby a factor of . × for Isolates-small (Fig. 4(c)). AllToAll-Fiber time:
With a fixed l , the AllToAll-Fiber timedoes not change significantly with b . For a fixed l , AllToAll-Fiber is performed among l processes on each fiber in b batches. If AllToAll-Fiber in each batch is bandwidth-bound,its runtime does not change with b as is observed in Fig. 4. With a fixed b , the AllToAll-Fiber time increases as we increase l . As explain before, increasing l creates lower rank outputin each layer, needing more data to be communicated acrosslayer for merging. Furthermore, increasing l also increases thecommunicator size of AllToAll-Fiber. Hence, the AllToAll-Fiber time increases with the increase of l . Merge-Fiber time:
With a fixed l , the Merge-Fiber timedoes not change significantly with b . As with AllToAll-Fiber,the Merge-Fiber time is not influenced by b . With a fixed b ,the Merge-Fiber time increases as we increase l . As explainin the previous paragraph, increasing l requires more data to9 riendster Number of Cores T i m e ( s e c ) Symbolic (5.6x)A-Broadcast (35.2x)B-Broadcast (1.5x)Local-Multiply (17.3x)Merge-Layer (17.9x)AllToAll-Fiber (11.6x)Merge-Fiber (20.2x) b=32 b=8 b=2
Isolates-small
Number of Cores T i m e ( s e c ) Symbolic (17.8x)A-Broadcast (45.4x)B-Broadcast (1.3x)Local-Multiply (13.1x)Merge-Layer (11.1x)AllToAll-Fiber (37.1x)Merge-Fiber (18.8x) b=8b=23 b=87
Fig. 6:
Strong scaling when squaring Friendster and Isolates-small. The scaling experiments were conducted from 64 nodes (4,096 cores) to1024 nodes (65,536 cores) on Cori-KNL. Number of layers is set to 16 and number of batches is shown on top of each bar. Total speedupsfrom one bar to the next bar are shown by arrowheads. Numbers in captions denote the overall speedup of each step in batched-SUMMA3Dwhen we go from 4,096 to 65,536 cores (a 16 × increase in core counts). Isolates
Number of Cores T i m e ( s e c ) Symbolic (12.6x)A-Broadcast (35.5x)B-Broadcast (2.5x)Local-Multiply (12.9x)Merge-Layer (11.6x)AllToAll-Fiber (6.7x)Merge-Fiber (19.4x) b=125 b=35 b=12
Metaclust50
Number of Cores T i m e ( s e c ) Symbolic (4.0x) A- Broadcast (5.1x) B- Broadcast (1.3x)
Local-Multiply (8.7x)
Merge-Layer ( .2x) AllToAll-Fiber (5.1x) Merge-Fiber (15.3x) b=7b=19b=46
Fig. 7:
Strong scaling when squaring two biggest matrices in our test suite. The scaling experiments were conducted from 256 nodes (16,384cores) to 4096 nodes (262,144 cores) on Cori-KNL. Number of layers is set to 16 and number of batches is shown on top of each bar. Totalspeedups from one bar to the next bar are shown by arrowheads. Numbers in captions denote the overall speedup of each step when we gofrom 16,384 to 262,144 cores (a 16 × increase in core counts). l =1 l =4 l =16Layers010203040506070 T i m e ( s e c ) CommunicationComputation
Fig. 8:
Comparing computation and communication time in thesymbolic step for the Isolates-small graph on 65,536 Cori KNL cores. be merged in the Merge-Fiber step.
Symbolic time:
The symbolic step used to determine b also uses a communication-avoiding algorithm as discussedin Algo. 3. Fig. 8 shows that the symbolic step becomessignificantly faster as we increase l . The communication timein the symbolic step becomes more than × faster when weuse 16 layers, which results in more than × speedup ofthe total symbolic runtime. Here, the communication-avoidingalgorithm has a higher impact than the actual multiplication TABLE VI:
The impact of l and b on different steps of B ATCHED -S UMMA ↔ : no change, ↑ : increase, ↓ : decrease. The Local-Multiply time may slightly increase with b . A- B- Local- Merge- Merge AllToAll- l b
Bcast Bcast Multiply Layer Fiber Fiber ↔ ↑ ↑ ↔ ↓ ↔ ↔ ↔↑ ↔ ↓ ↓ ↓ ↔ ↑ ↑ because Algo. 3 needs lighter local computation. Selecting the number of layers and batches.
Table VIsummarizes the overall impacts of l and b on different stepsof B ATCHED S UMMA b to the smallest possible value so that the result in a batchfits in the avilable memory. Generally, the A-Broadcast and B-Broadcast time continue to decrease as we increase the numberof layers as seen in Fig. 4. However, All2All-Fiber and Merge-Fiber time increase as we increase l . Therefore, selecting theoptimum number of layers is challenging as it depends on thetradeoff between broadcasts and fiber reduction/merge costs.In the rest of our experiments, we set l = 16 as it usuallygives the best result as can be seen in Fig. 4. E. Strong scaling results
Fig. 6 and Fig. 7 show the strong scaling of batched-SUMMA3D for four different matrices. We summarize thestrong scaling results with the following key findings.10 ,096 16,384 65,536 262,144Number of cores0.40.60.81.01.2 E ff i c i en cy FriendsterIsolates-smallIsolatesMetaclust
Fig. 9:
Parallel efficiency of B
ATCHED S UMMA B ATCHED S UMMA
3D scales remarkably well at extremescale.
As we increase core counts by × , our algorithmscales remarkably well for all four matrices: Friendster( × ), Isolates-small ( . × ), Isolates ( × ), and Meta-clust50 ( . × ). For bigger matrices B ATCHED S UMMA . × reduction for the Isolate-small matrix inFig. 6) because of the reduced number of batches needed athigher concurrency. The Symbolic step also scales well. OnlyB-Broadcast does not scale as well as other steps possiblybecause of high latency overhead in broadcasting small piecesof B . However, B-Broadcast constitutes about 1% of the totalruntime for all matrices in Fig. 6 and Fig. 7. Hence, the B-Broadcast time does not significantly impact the scalability ofB ATCHED S UMMA
Super linear speedup is attainable with more aggregatedmemory at high node counts.
As we increase the numberof nodes by × , the aggregated memory also increases bya factor of 4. As a result, b decreases by at least a factorof in all cases in Fig. 6 and Fig. 7. Fewer batches athigher concurrency, can super-linearly reduce the A-Broadcasttime, resulting in possible super-linear speedups. For example,for Isolates in Fig. 7, the total runtime decreases by . × when we go from 256 nodes (16,384 cores) to 1024 nodes(65,536 cores) because b decreases from 125 to 35. The super-linear speedup is especially observed at low concurrenciesbecause of the dramatic reduction of the number of batches.Even though b decreases as we increase the number of nodes,their relationship is not straightforward as they are related viathe intermediate per-layer expanded matrix D ( k ) that in turndepends on per-layer compression factor as well as the overallcompression factor. For example, in Fig. 7, when we go from65K cores to 262K cores, the number of batches is decreasedby less than 3x even though the memory increases by 4x. Parallel efficiency.
We compute the parallel efficiency byusing P P T ( P T ( P where T ( P ) denotes the runtime with P TABLE VII:
Overview of local computation improvements whenmultiplying Isolates-small on 65,536 Cori KNL cores.
Layers Local-Multiply Merge-Layer Merge-FiberPrevious Now Previous Now Previous Now processes, and P >P . Fig. 9 shows the parallel efficiencyof four large matrices. We observe that the efficiency remainsclose to 1 for three out of four large matrices. Here, super-linear speedups resulted in an efficiency greater than one. ForMetaclust, the efficiency drops to 0.4 on 262K cores. SinceMetaclust is sparser than the other big matrix Isolates, thecommunication cost for Metaclust starts to dominate quickly.For example, on 4096 nodes, the communication takes 48%and 36% of the total runtime for Metaclust and Isolates,respectively. As communication does not scale as well ascomputation, we observe a drop in parallel efficiency forsparser matrices like Metaclust. F. The impact of faster computational kernels
As mentioned in Sec. IV-D, we developed new hash-basedmerging algorithms that replaced a heap-based merging algo-rithms used in previous work [13]. As hash-based SpGEMMand merging algorithm can work with unsorted matrices, wekeep all intermediate results unsorted to reduce the com-putational complexity. These two modifications (hash-basedmerging and unsorted matrices) made local multiplication andmerging significantly faster as shown in Table VII.We observe that the unsorted-hash SpGEMM algorithmused in the Local-Multiply step of B
ATCHED S UMMA
3D canbe up to faster than the previous hybrid SpGEMMalgorithm when 16 layers are used. The performance benefitof unsorted local multiplication is more significant with morelayers. This is because the number of nonzeros in intermediatematrices D k increases as we increase l , which requires sortingwith a larger volume of data. When l =1 , the previous hybridalgorithm can run faster than the unsorted-hash algorithmbecause the hybrid algorithm can also use a heap-basedalgorithm that is often faster when the compression ratio of acolumn is small [25].Table VII shows a dramatic improvement when we use thenew unsorted-heap-merging algorithm instead of the previousheap-merging algorithm. On 16 layers, both Merge-Layer andMerge-Fiber time reduces by an order of magnitude. Con-sequently, the total computation of B ATCHED S UMMA
3D ismade at least × faster than previous state-of-the-art SUMMA3D implementation [13]. G. The scalability when computing AA T We compute AA T only for Metaclust20m and Rice-kmersmatrices considering their application in sequence overlapperBELLA [7] in computing the shared k-mers between all pairsof sequences.Fig. 10 shows the performance of B ATCHED S UMMA AA T for Metaclust20m. On 64 nodes (4K11 ,096 16,384 65,536Number of cores0255075100125150175200 T i m e ( s ec ) l =1 b =6 l =1 b =2 l =1 b =1 l =4 b =8 l =4 b =3 l =4 b =1 l =16 b =12 l =16 b =4 l =16 b =2SymbolicA-broadcastB-broadcastLocal-MultiplyMerge-LayerAlltoall-FiberMerge-Fiber Fig. 10:
Computing AA T with the Metaclust20m matrix on Cori. Number of Cores T i m e ( s e c ) l = l = l = l = l = l = l = l = l = SymbolicA-BroadcastB-BroadcastLocal-MultiplyMerge-LayerAllToAll-FiberMerge-Fiber
Fig. 11:
Scalability of computing AA T with the Rice-kmers matrixon Cori-KNL. cores), B ATCHED S UMMA
3D with 16 layers is slightly fasterthan with 1 layer. This is because the 16-layer instance needed12 phases whereas the 1-layer instance needed just 6 phases.Hence, on 64 nodes, the benefit of communication-avoidanceis overshadowed by the need to broadcast A × more time. On1024 nodes (65K cores), B ATCHED S UMMA
3D with 16 layersis about × faster than with 1 layer even though the 1-layercase does not need any batching. The latter case highlightsthe significant performance benefit of B ATCHED S UMMA nnz ( AA T ) ≈ nnz ( A ) for the Rice-kmers matrix. Hence, batching is often not required to compute AA T with Rice-kmers. In this case, B ATCHED S UMMA AA T with b =1 , while the communication-avoidingalgorithm reduces the communication costs. Fig. 11 shows that AA T computation on the Rice-kmers matrix is dominated bycommunication when only one layer is used (recall that theSymbolic step also performs A-Broadcast and B-Broadcast).This is expected since Rice-kmers has just 2 nonzeros percolumn on average. As SpGEMM is dominated by communi-cation, using more layers reduces the runtime significantly,as expected. For example, on 65,536 cores (1024 nodes)of Cori-KNL, AA T can be computed × faster if we use16 layers instead of 1 layer. This experiment demonstratesthat B ATCHED S UMMA
3D helps any SpGEMM run faster atextreme scale with our without batching.
HT=No HT=Yes HT=No HT=YesComputation 231.2 80.5 214.8 54.1Communication 147.1 208.9 139.7 159.2050100150200250300350400450 T i m e ( s e c ) l =16 l =16 l =64 l =64 Fig. 12:
The impact of using hyperthreads when squaring Meta-clust50 on 4096 nodes on Cori-KNL. HT=Yes means 4 hardwarethreads per core are used. At this setting with 4096 nodes, we use256 threads per node totaling 1,048,576 threads. HT=No means onethread per core is used, totaling 262,144 threads. As before, we used16 threads per process. The number of layers used in the 3D grid isshown at the top of each bar. For this matrix, hyperthreading reducesthe computation time, but increases communication time. Overall,hyperthreading decreases the total runtime of this SpGEMM.
KNL HaswellComputation 760.803075 359.704147Communication 188.012602 132.17615402004006008001000 T i m e ( s e c ) Fig. 13:
Squaring Isolates-small on 256 nodes of Cori-KNL andCori-Haswell. While the communication network remains the same,Cori-Haswell uses 32 fast cores per node. We use 6 cores/processon Haswell and 16 cores/process on KNL. With 16 layers and 23batches, both experiments use the same process grid. Arrowheadsshow that computation and communication becomes . × and . × faster on Haswell. H. Impact of hyper-threading at extreme scale
As with most modern manycore processors, each KNLprocessor has four hardware threads. In our prior experiments,we did not use hyperthreading because it can increase thecommunication time significantly due to larger process grids.Here, we study the impact of hyperthreads when squaringMetaclust50 on 4096 nodes on Cori-KNL. Fig. 12 shows thathyperthreading can help SpGEMM run faster by reducing thecomputation time significantly even though the communicationtime may increase. This gives us an unprecedented scalabilityto more than one million threads, which will help scale manygraph and machine learning applications to upcoming exascalesystems. The impact of hyperthreading is more significantwhen the total runtime is dominated by computation (forexample when l = 64 in Fig. 12). That means, hyperthread-ing may not help when SpGEMM becomes communication-bound.12 ,024 (16 nodes) 16,384 (256 nodes) Number of Cores T i m e ( s e c ) l = l = l = l = l = l = SymbolicA-BroadcastB-BroadcastLocal-MultiplyMerge-LayerAllToAll-FiberMerge-Fiber
Fig. 14:
Squaring Eukarya, the smallest matrix in our test suite,on Cori-KNL. SpGEMM needs two batches when l = 16 on 16nodes. For all other cases, b = 1 . SUMMA3D is not usefulwhen communication cost is insignificant on 16 nodes. However,SUMMA3D with 4 layers is still useful on 256 nodes even thoughbatching is not needed for this small matrix. I. Impact of using faster processors
If we employ faster processors while using the same com-munication network, communication could quickly become thebottleneck. Hence, faster in-node computations are expected tomake B
ATCHED S UMMA
3D more even more beneficial. Weinvestigate this hypothesis by running B
ATCHED S UMMA . × faster on Haswell. Even with the same communication net-work, communication on Cori-Haswell is . × faster possiblybecause of faster data possessing around MPI calls. Sincecommunication does not scale as well as computations onHaswell, communication takes an increased fraction of thetotal time in comparison to KNL. We expect even betterbenefits of our algorithm on GPU-based clusters because ofthe availability of faster in-node computations. J. Applicability with small matrices at low concurrency
As observed in previous experiments, the B
ATCHED -S UMMA
3D algorithm is extremely effective when multiply-ing large-scale matrices on thousands of nodes. Fig. 14demonstrates that B
ATCHED S UMMA
3D can reduce the A-Broadcast time even on 16 nodes. However, the reducedcommunication time has little impact on the total runtimewhen communication does not dominate the runtime. On256 node, B
ATCHED S UMMA
3D runs faster with 4 layers.However, using 16 layers on 256 nodes does not reduce theruntime any further as AllToAll-Fiber starts to become thecommunication bottleneck. Hence, B
ATCHED S UMMA
3D isuseful even on few hundred nodes if l is set to a small value. K. A direct comparison with 3D Sparse SUMMA
Fig. 15 compares B
ATCHED S UMMA
3D with SUMMA3Dpresented in previous work [13]. We downloaded theSUMMA3D code from the CombBLAS library. Note thatthe previous SUMMA3D algorithm simply fails when thememory requirement exceeds avaiable memory. Here, wesquare the Eukarya matrix where batching is not needed on16 nodes and 256 nodes of Cori KNL using 4 layers and 16
SUMMA-3D Batched-SUMMA3D SUMMA-3D Batched-SUMMA3DComputation 209.997029 27.611273 26.24962863 3.077281Communication 4.722522 2.927029 1.1806305 0.6774810.52832128512 T i m e ( s e c ) Fig. 15:
Comparing B
ATCHED S UMMA
3D developed in this paperwith the SUMMA3D algorithm presented in [13] in squaring theEukarya matrix with 4 layers without batching. threads per process. The computation in B
ATCHED S UMMA × faster than the previous work, while thecommunication is also slightly faster. We relied on new hash-based multiplication and merging algorithms (see Sec IV-A),which made the computation much faster.VI. C ONCLUSIONS
This paper presents a robust SpGEMM algorithm that canmultiply matrices in batches even when the output matrixexceeds the available memory of large supercomputers. Addi-tionally, the presented algorithm reduces the communication sothat SpGEMM can scale to the limit of modern supercomput-ers. These two techniques together eliminate two fundamentalbarriers – memory and communication – in large-scale sparsedata analysis.Our result is unexpectedly positive because communication-avoiding (CA) matrix multiplication algorithms trade increasedmemory for reduced communication. The conventional wis-dom suggests that CA algorithms would be detrimental in thisalready memory-constrained regime. However, 3D algorithmsoffset the increased broadcast costs associated with batchingrequired in the memory constrained setting, creating a previ-ously unexplored synergy.Our algorithm will boost many applications in genomics,scientific computing, and social network analysis whereSpGEMM has emerged as a key computational kernel. Forexample, Yelick et al. [39] regarded SpGEMM as a parallelismmotif of genomic data analysis with applications in alignment,profiling, clustering and assembly for both single genomesand metagenomes. With the size of genomic data growingexponentially, extreme-scale SpGEMM presented in this paperwill enable rapid scientific discoveries in these applications.A
CKNOWLEDGMENTS
This work is supported in part by the Advanced ScientificComputing Research (ASCR) Program of the Departmentof Energy Office of Science under contract No. DE-AC02-05CH11231, and in part by the Exascale Computing Project(17-SC-20-SC), a collaborative effort of the U.S. Departmentof Energy Office of Science and the National Nuclear SecurityAdministration.13his work used resources of the NERSC supported by theOffice of Science of the DOE under Contract No. DEAC02-05CH11231. R
EFERENCES[1] G. He, H. Feng, C. Li, and H. Chen, “Parallel simrank computationon large graphs with iterative aggregation,” in
Proceedings of the 16thACM SIGKDD international conference on Knowledge discovery anddata mining , 2010, pp. 543–552.[2] F. Jamour, I. Abdelaziz, Y. Chen, and P. Kalnis, “Matrix algebraframework for portable, scalable and efficient query engines for rdfgraphs,” in
Proceedings of the Fourteenth EuroSys Conference 2019 ,2019, pp. 1–15.[3] A. Azad, A. Buluc¸, and J. Gilbert, “Parallel triangle counting andenumeration using matrix algebra,” in
IEEE International Parallel andDistributed Processing Symposium Workshop , 2015, pp. 804–811.[4] E. Solomonik, M. Besta, F. Vella, and T. Hoefler, “Scaling betweennesscentrality using communication-efficient sparse matrix multiplication,”in
Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , 2017, pp. 1–14.[5] A. Buluc¸ and J. R. Gilbert, “The Combinatorial BLAS: Design, im-plementation, and applications,”
The International Journal of HighPerformance Computing Applications , vol. 25, no. 4, pp. 496 – 509,2011.[6] C. Jain, H. Zhang, A. Dilthey, and S. Aluru, “Validating Paired-EndRead Alignments in Sequence Graphs,” in , vol. 143, 2019, pp. 17:1–17:13.[7] G. Guidi, M. Ellis, D. Rokhsar, K. Yelick, and A. Buluc¸, “BELLA:Berkeley efficient long-read to long-read aligner and overlapper,” bioRxiv , p. 464420, 2018.[8] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das,B. Kaul, and T. Krishna, “SIGMA: A sparse and irregular GEMMaccelerator with flexible interconnects for DNN training,” in . IEEE, 2020, pp. 58–70.[9] U. Borˇstnik, J. VandeVondele, V. Weber, and J. Hutter, “Sparse matrixmultiplication: The distributed block-compressed sparse row library,”
Parallel Computing , vol. 40, no. 5-6, pp. 47–58, 2014.[10] M. McCourt, B. Smith, and H. Zhang, “Sparse matrix-matrix productsexecuted through coloring,”
SIAM Journal on Matrix Analysis andApplications , vol. 36, no. 1, pp. 90–109, 2015.[11] K. Akbudak, O. Selvitopi, and C. Aykanat, “Partitioning models forscaling parallel sparse matrix-matrix multiplication,”
ACM Transactionson Parallel Computing , vol. 4, no. 3, pp. 13:1–13:34, 2018.[12] G. Ballard, A. Druinsky, N. Knight, and O. Schwartz, “Hypergraphpartitioning for sparse matrix-matrix multiplication,”
ACM Transactionson Parallel Computing (TOPC) , vol. 3, no. 3, pp. 1–34, 2016.[13] A. Azad, G. Ballard, A. Buluc¸, J. Demmel, L. Grigori, O. Schwartz,S. Toledo, and S. Williams, “Exploiting multiple levels of parallelismin sparse matrix-matrix multiplication,”
SIAM Journal on ScientificComputing , vol. 38, no. 6, pp. C624–C651, 2016.[14] M. Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. R¨atsch, T. Hoe-fler, and E. Solomonik, “Communication-efficient jaccard similarity forhigh-performance distributed genome comparisons,” in
IEEE IPDPS ,2020.[15] O. Selvitopi, S. Ekanayake, G. Guidi, G. Pavlopoulos, A. Azad, andA. Buluc¸, “Distributed many-to-many protein sequence alignment usingsparse matrices,” in
Proceedings of the 2020 ACM/IEEE InternationalConference for High Performance Computing, Networking, Storage andAnalysis , ser. SC’20, 2020.[16] U. V. Catalyurek and C. Aykanat, “Hypergraph-partitioning-based de-composition for parallel sparse-matrix vector multiplication,”
IEEETransactions on parallel and distributed systems , vol. 10, no. 7, pp.673–693, 1999.[17] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, “Multilevelhypergraph partitioning: applications in vlsi domain,”
IEEE Transactionson Very Large Scale Integration (VLSI) Systems , vol. 7, no. 1, pp. 69–79,1999.[18] K. D. Devine, E. G. Boman, R. T. Heaphy, R. H. Bisseling, and U. V.Catalyurek, “Parallel hypergraph partitioning for scientific computing,”in
Proceedings 20th IEEE International Parallel & Distributed Process-ing Symposium . IEEE, 2006, pp. 10–pp. [19] A. Azad, A. Buluc¸, G. A. Pavlopoulos, N. C. Kyrpides, and C. A.Ouzounis, “HipMCL: a high-performance parallel implementation of theMarkov clustering algorithm for large-scale networks,”
Nucleic AcidsResearch , vol. 46, no. 6, pp. e33–e33, 01 2018.[20] F. G. Gustavson, “Two fast algorithms for sparse matrices: Multiplica-tion and permuted transposition,”
ACM Transactions on MathematicalSoftware , vol. 4, no. 3, pp. 250–269, Sep. 1978.[21] J. Gilbert, C. Moler, and R. Schreiber, “Sparse matrices in MATLAB:Design and implementation,”
SIAM Journal on Matrix Analysis andApplications , vol. 13, no. 1, pp. 333–356, 1992.[22] M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson,S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey,“Parallel efficient sparse matrix-matrix multiplication on multicore plat-forms,” in
ISC . Springer, 2015, pp. 48–57.[23] W. Liu and B. Vinter, “An efficient GPU general sparse matrix-matrixmultiplication for irregular data,” in
IEEE IPDPS , 2014, pp. 370–381.[24] M. Deveci, C. Trott, and S. Rajamanickam, “Performance-portablesparse matrix-matrix multiplication for many-core architectures,” in
IEEE IPDPS Workshops , 2017, pp. 693–702.[25] Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluc¸, “Performanceoptimization, modeling and analysis of sparse matrix-matrix productson multi-core and many-core processors,”
Parallel Comput. , vol. 90,2019.[26] F. Gremse, K. K¨upper, and U. Naumann, “Memory-efficient sparsematrix-matrix multiplication by row merging on many-core architec-tures,”
SIAM Journal on Scientific Computing , vol. 40, no. 4, pp. C429–C449, 2018.[27] Z. Gu, J. Moreira, D. Edelsohn, and A. Azad, “Bandwidth-optimizedparallel algorithms for sparse matrix-matrix multiplication using propa-gation blocking,” in
SPAA , 2020, pp. 293–303.[28] Y. Nagasaka, A. Nukada, and S. Matsuoka, “High-performance andmemory-saving sparse general matrix-matrix multiplication for NVIDIAPascal GPU,” in
ICPP , 2017, pp. 101–110.[29] K. L. Nusbaum, “Optimizing Tpetra’s sparse matrix-matrix multipli-cation routine,” SAND2011-6036, Sandia National Laboratories, Tech.Rep., 2011.[30] A. Buluc¸ and J. R. Gilbert, “Parallel sparse matrix-matrix multiplicationand indexing: Implementation and experiments,”
SIAM Journal onScientific Computing , vol. 34, no. 4, pp. C170–C191, 2012.[31] R. A. van de Geijn and J. Watts, “SUMMA: Scalable universal matrixmultiplication algorithm,” Austin, TX, USA, Tech. Rep., 1995.[32] A. Lazzaro, J. VandeVondele, J. Hutter, and O. Sch¨utt, “Increasing theefficiency of sparse matrix-matrix multiplication with a 2.5 d algorithmand one-sided MPI,” in
Proceedings of the Platform for AdvancedScientific Computing Conference , 2017, pp. 1–9.[33] L. E. Cannon, “A cellular computer to implement the Kalman filteralgorithm,” Ph.D. dissertation, Montana State University, Bozman, MT,1969.[34] O. Selvitopi, M. T. Hussain, A. Azad, and A. Buluc¸, “Optimizing highperformance Markov clustering for pre-exascale architectures,” in
IEEEIPDPS , 2020.[35] I.-M. A. Chen, V. M. Markowitz, K. Chu, K. Palaniappan, E. Szeto,M. Pillay, A. Ratner, J. Huang, E. Andersen, M. Huntemann et al. ,“IMG/M: integrated genome and metagenome comparative data analysissystem,”
Nucleic Acids Research , p. gkw929, 2016.[36] T. A. Davis and Y. Hu, “The University of Florida sparse matrixcollection,”
ACM Trans. Math. Softw. , vol. 38, no. 1, pp. 1–25, Dec.2011.[37] G. Ballard, A. Buluc, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz,and S. Toledo, “Communication optimal parallel multiplication of sparserandom matrices,” in
SPAA . ACM, 2013, pp. 222–231.[38] S. van Dongen, “Graph clustering by flow simulation,” Ph.D. disserta-tion, Utrecht University, 2000.[39] K. Yelick, A. Buluc¸, M. Awan, A. Azad, B. Brock, R. Egan,S. Ekanayake, M. Ellis, E. Georganas, G. Guidi et al. , “The parallelismmotifs of genomic data analysis,”
Philosophical Transactions of theRoyal Society A , vol. 378, no. 2166, p. 20190394, 2020., vol. 378, no. 2166, p. 20190394, 2020.