[PDF] Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

Abstract

We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03\% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to 3 – √ times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88\% of Piz Daint's peak performance. Our work does not require any hand tuning and is maintained as an open source implementation.

Full PDF

RRed-Blue Pebbling Revisited:Near Optimal Parallel Matrix-Matrix Multiplication

Technical Report

Grzegorz Kwasniewski , Marko Kabi´c , Maciej Besta ,Joost VandeVondele , Raffaele Solc`a , Torsten Hoefler Department of Computer Science, ETH Zurich , ETH Zurich , Swiss National Supercomputing Centre (CSCS)

ABSTRACT

We propose COSMA: a parallel matrix-matrix multiplication algo-rithm that is near communication-optimal for all combinations ofmatrix dimensions, processor counts, and memory sizes. The keyidea behind COSMA is to derive an optimal (up to a factor of 0.03%for 10MB of fast memory) sequential schedule and then parallelizeit, preserving I/O optimality. To achieve this, we use the red-bluepebble game to precisely model MMM dependencies and derivea constructive and tight sequential and parallel I/O lower boundproofs. Compared to 2D or 3D algorithms, which fix processordecomposition upfront and then map it to the matrix dimensions, itreduces communication volume by up to √ Matrix-matrix multiplication (MMM) is one of the most fundamen-tal building blocks in scientific computing, used in linear algebraalgorithms [13, 15, 42], (Cholesky and LU decomposition [42], eigen-value factorization [13], triangular solvers [15]), machine learn-ing [6], graph processing [4, 8, 18, 36, 44, 52], computational chem-istry [21], and others. Thus, accelerating MMM routines is of greatsignificance for many domains. In this work, we focus on mini-mizing the amount of transferred data in MMM, both across thememory hierarchy ( vertical I/O ) and between processors ( horizontalI/O , aka “communication”) .The path to I/O optimality of MMM algorithms is at least 50years old. The first parallel MMM algorithm is by Cannon [10],which works for square matrices and square processor decomposi-tions. Subsequent works [24, 25] generalized the MMM algorithm torectangular matrices, different processor decompositions, and com-munication patterns. PUMMA [17] package generalized previousworks to transposed matrices and different data layouts. SUMMA al-gorithm [56] further extended it by optimizing the communication,introducing pipelining and communication–computation overlap.This is now a state-of-the-art so-called 2D algorithm (it decomposesprocessors in a 2D grid) used e.g., in ScaLAPACK library [14].Agarwal et al. [1] showed that in a presence of extra memory,one can do better and introduces a 3D processor decomposition.The 2.5D algorithm by Solomonik and Demmel [53] effectively This is an extended version of the SC’19 publication (DOI 10.1145/3295500.3356181)Changes in the original submission are listed in the Appendix We also focus only on “classical” MMM algorithms which perform n multiplicationsand additions. We do not analyze Strassen-like routines [54], as in practice they areusually slower [19]. LIMITEDMEMORY EXTRAMEMORY LIMITEDMEMORY EXTRAMEMORY020406080100 TALL MATRICES

SQUARE MATRICES maximumgeometric mean

STRONGSCALING

ScaLAPACK [14]CARMA [ ] COSMA (this work)CTF [ ] % o f p e a k p e r f o r m a n c e STRONGSCALING

22 50

Figure 1:

Percentage of peak flop/s across the experiments ranging from 109 to 18,432cores achieved by COSMA and the state-of-the-art libraries (Sec. 9).

Year W o r s t - c a s e I / O c o s t n a i v e P U M M A [ ] C a n n o n ' s [ ] S U M M A [ ] C A R M A [ ] C O S M A [ h e r e ]

1D 2D 2.5D & 3D " . D " [ ] optimized for non-squarematrices optimal decomp. inall scenarios lo w er bound use of excessmemory1969 1994 1997 2011 2013<1969 Figure 2:

Illustratory evolution of MMM algorithms reaching the I/O lower bound. interpolates between those two results, depending on the avail-able memory. However, Demmel et al. showed that algorithmsoptimized for square matrices often perform poorly when matrixdimensions vary significantly [22]. Such matrices are common inmany relevant areas, for example in machine learning [60, 61] orcomputational chemistry [45, 49]. They introduced CARMA [22], arecursive algorithm that achieves asymptotic lower bounds for allconfigurations of dimensions and memory sizes. This evolution forchosen steps is depicted symbolically in Figure 2.Unfortunately, we observe several limitations with state-of-theart algorithms. ScaLAPACK [14] (an implementation of SUMMA)supports only the 2D decomposition, which is communication–inefficient in the presence of extra memory. Also, it requires a userto fine-tune parameters such as block sizes or a processor grid size.CARMA supports only scenarios when the number of processorsis a power of two [22], a serious limitation, as the number of pro-cessors is usually determined by the available hardware resources.Cyclops Tensor Framework (CTF) [50] (an implementation of the2.5D decomposition) can utilize any number of processors, but itsdecompositions may be far from optimal (§ 9). We also emphasizethat asymptotic complexity is an insufficient measure of practical per-formance . We later (§ 6.2) identify that CARMA performs up to √ a r X i v : . [ c s . CC ] D ec echnical Report, 2019, G. Kwasniewski et al.

2D [56] 2.5D [53] recursive [22] COSMA (this work)

Input: User–specified grid Available memory Available memory, matrix dimensions Available memory, matrix dimensions

Step 1

Split m and n Split m , n , k Split recursively the largest dimension Find the optimal sequential schedule

Step 2

Map matrices to processor grid Map matrices to processor grid Map matrices to recursion tree Map sequential domain to matrices (cid:7)

Requires manual tuning (cid:7)

Asymptotically more comm. - Optimal for m = n (cid:7) Inefficient for m (cid:28) n or n (cid:28) m (cid:7) Inefficient for some p - Asymptotically optimal for all m , n , k , p (cid:7) Up to √ times higher comm. cost (cid:7) p must be a power of 2 - Optimal for all m , n , k - Optimal for all p -- Best time-to-solution

Table 1:

Intuitive comparison between the COSMA algorithm and the state-of-the-art 2D, 2.5D, and recursive decompositions. C = AB , A ∈ R m × k , B ∈ R k × n In this work, we present COSMA (Communication Optimal S-partition-based Matrix multiplication Algorithm): an algorithmthat takes a new approach to multiplying matrices and alleviatesthe issues above. COSMA is I/O optimal for all combinations ofparameters (up to the factor of √ S /(√ S + − ) , where S is the size ofthe fast memory ). The driving idea is to develop a general methodof deriving I/O optimal schedules by explicitly modeling data reusein the red-blue pebble game. We then parallelize the sequentialschedule, minimizing the I/O between processors, and derive anoptimal domain decomposition. This is in contrast with the otherdiscussed algorithms, which fix the processor grid upfront and thenmap it to a sequential schedule for each processor. We outline thealgorithm in § 3. To prove its optimality, we first provide a newconstructive proof of a sequential I/O lower bound (§ 5.2.7), thenwe derive the communication cost of parallelizing the sequentialschedule (§ 6.2), and finally we construct an I/O optimal parallelschedule (§ 6.3). The detailed communication analysis of COSMA,2D, 2.5D, and recursive decompositions is presented in Table 3. Ouralgorithm reduces the data movement volume by a factor of upto √ ≈ .

73x compared to the asymptotically optimal recursivedecomposition and up to max { m , n , k } times compared to the 2Dalgorithms, like Cannon’s [39] or SUMMA [56].Our implementation enables transparent integration with theScaLAPACK data format [16] and delivers near-optimal computa-tion throughput. We later (§ 7) show that the schedule naturally ex-presses communication–computation overlap, enabling even higherspeedups using Remote Direct Memory Access (RDMA). Finally,our I/O-optimal approach is generalizable to other linear algebrakernels. We provide the following contributions: • We propose COSMA: a distributed MMM algorithm that is nearly-optimal (up to the factor of √ S /(√ S + − ) ) for any combinationof input parameters (§ 3). • Based on the red-blue pebble game abstraction [34], we providea new method of deriving I/O lower bounds (Lemma 4), whichmay be used to generate optimal schedules (§ 4). • Using Lemma 4, we provide a new constructive proof of thesequential MMM I/O lower bound. The proof delivers constantfactors tight up to √ S /(√ S + − ) (§ 5). • We extend the sequential proof to parallel machines and provideI/O optimal parallel MMM schedule (§ 6.3). • We reduce memory footprint for communication buffers andguarantee minimal local data reshuffling by using a blocked datalayout (§ 7.6) and a static buffer pre-allocation (§ 7.5), providingcompatibility with the ScaLAPACK format. Throughout this paper we use the original notation from Hong and Kung to denotethe memory size S . In literature, it is also common to use the symbol M [2, 3, 33]. • We evaluate the performance of COSMA, ScaLAPACK, CARMA,and CTF on the CSCS Piz Daint supercomputer for an extensiveselection of problem dimensions, memory sizes, and numbers ofprocessors, showing significant I/O reduction and the speedupof up to 8.3 times over the second-fastest algorithm (§ 9).

We first describe our machine model (§ 2.1) and computation model(§ 2.2). We then define our optimization goal: the I/O cost (§ 2.3).

We model a parallel machine with p processors, each with localmemory of size S words. A processor can send and receive fromany other processor up to S words at a time. To perform anycomputation, all operands must reside in processor’ local memory.If shared memory is present, then it is assumed that it has infinitecapacity. A cost of transferring a word from the shared to the localmemory is equal to the cost of transfer between two local memories. We now briefly specify a model of general computation; we use thismodel to derive the theoretical I/O cost in both the sequential andparallel setting. An execution of an algorithm is modeled with the computational directed acyclic graph (CDAG) G = ( V , E ) [11, 28, 47].A vertex v ∈ V represents one elementary operation in the givencomputation. An edge ( u , v ) ∈ E indicates that an operation v depends on the result of u . A set of all immediate predecessors (orsuccessors) of a vertex are its parents (or children ). Two selectedsubsets I , O ⊂ V are inputs and outputs , that is, sets of vertices thathave no parents (or no children, respectively). Red-Blue Pebble Game

Hong and Kung’s red-blue pebble game[34] models an execution of an algorithm in a two-level memorystructure with a small-and-fast as well as large-and-slow memory.A red (or a blue) pebble placed on a vertex of a CDAG denotes thatthe result of the corresponding elementary computation is insidethe fast (or slow) memory. In the initial (or terminal) configuration,only inputs (or outputs) of the CDAG have blue pebbles. There canbe at most S red pebbles used at any given time. A complete CDAGcalculation is a sequence of moves that lead from the initial to theterminal configuration. One is allowed to: place a red pebble on anyvertex with a blue pebble (load), place a blue pebble on any vertexwith a red pebble (store), place a red pebble on a vertex whose par-ents all have red pebbles (compute), remove any pebble, red or blue,from any vertex (free memory). An I/O optimal complete CDAGcalculation corresponds to a sequence of moves (called pebbling of /O Optimal Parallel Matrix Multiplication Technical Report, 2019, a graph) which minimizes loads and stores. In the MMM context, itis an order in which all n multiplications are performed. Throughout this paper we focus on the input/output (I/O) cost ofan algorithm. The I/O cost Q is the total number of words trans-ferred during the execution of a schedule. On a sequential or sharedmemory machine equipped with small-and-fast and slow-and-bigmemories, these transfers are load and store operations from andto the slow memory (also called the vertical I/O ). For a distributedmachine with a limited memory per node, the transfers are commu-nication operations between the nodes (also called the horizontalI/O ). A schedule is I/O optimal if it minimizes the I/O cost among allschedules of a given CDAG. We also model a latency cost L , whichis a maximum number of messages sent by any processor. Here we briefly describe strategies of the existing MMM algorithms.Throughout the whole paper, we consider matrix multiplication C = AB , where A ∈ R m × k , B ∈ R k × n , C ∈ R m × n , where m , n , and k are matrix dimensions. Furthermore, we assume that the size ofeach matrix element is one word, and that S < min { mn , mk , nk } ,that is, none of the matrices fits into single processor’s fast memory.We compare our algorithm with the 2D, 2.5D, and recursive de-compositions (we select parameters for 2.5D to also cover 3D). Weassume a square processor grid [√ p , √ p , ] for the 2D variant, analo-gously to Cannon’s algorithm [10], and a cubic grid [ (cid:112) p / c , (cid:112) p / c , c ] for the 2.5D variant [53], where c is the amount of the “extra” mem-ory c = pS /( mk + nk ) . For the recursive decomposition, we assumethat in each recursion level we split the largest dimension m , n , or k in half, until the domain per processor fits into memory. Thedetailed complexity analysis of these decompositions is in Table 3.We note that ScaLAPACK or CTF can handle non-square decompo-sitions, however they create different problems, as discussed in § 1.Moreover, in § 9 we compare their performance with COSMA andmeasure significant speedup in all scenarios. COSMA decomposes processors by parallelizing the near optimalsequential schedule under constraints: (1) equal work distributionand (2) equal memory size per processor. Such a local sequentialschedule is independent of matrix dimensions. Thus, intuitively,instead of dividing a global domain among p processors (the top-down approach), we start from deriving a near I/O optimal sequential schedule. We then parallelize it, minimizing the I/O and latencycosts Q , L (the bottom-up approach); Figure 3 presents more details.COSMA is sketched in Algorithm 1. In Line 1 we derive asequential schedule, which consists of series of a × a outer products.(Figure 4 b). In Line 2, each processor is assigned to compute b of these products, forming a local domain D (Figure 4 c), that iseach D contains a × a × b vertices (multiplications to be performed- the derivation of a and b is presented in § 6.3). In Line 3, wefind a processor grid G that evenly distributes this domain by thematrix dimensions m , n , and k . If the dimensions are not divisibleby a or b , this function also evaluates new values of a opt and b opt by fitting the best matching decomposition, possibly not utilizingsome processors (§ 7.1, Figure 4 d-f). The maximal number of idleprocessors is a tunable parameter δ . In Line 5, we determine theinitial decomposition of matrices A , B , and C to the submatrices Step 2: Map computation tothe process grid domainperprocessdivide by p

A CB

Step 1: Find the process grid"Top-down" (e.g., 3D)

I/Obetweendomainsrequired! (a) 3D domain decomposition no reusedomain per processorproducts

A BC

Step 1: Sequential scheduleStep 2: Parallel schedule Minimize crossing seriesof outerproducts "Bottom-up" (COSMA)

No I/Obetweendomains (b) COSMA decompositionFigure 3:

Domain decomposition using p = processors. In scenario (a), a straight-forward 3D decomposition divides every dimension in p / = . In scenario (b),COSMA starts by finding a near optimal sequential schedule and then parallelizes itminimizing crossing data reuse V R , i (§ 5). The total communication volume is reducedby 17% compared to the former strategy. A l , B l , C l that are local for each processor. COSMA may handle anyinitial data layout, however, an optimal block-recursive one (§ 7.6)may be achieved in a preprocessing phase. In Line 6, we computethe size of the communication step, that is, how many of b opt outer products assigned to each processor are computed in a singleround, minimizing the latency (§ 6.3). In Line 7 we compute thenumber of sequential steps (Lines 8–11) in which every processor:(1) distributes and updates its local data A l and B l among the grid G (Line 9), and (2) multiplies A l and B l (Line 10). Finally, the partialresults C l are reduced over G (Line 12). I/O Complexity of COSMA

Lines 2–7 require no communi-cation (assuming that the parameters m , n , k , p , S are already dis-tributed). The loop in Lines 8-11 executes (cid:6) ab /( S − a ) (cid:7) times. InLine 9, each processor receives | A l | + | B l | elements. Sending thepartial results in Line 12 adds a communicated elements. In § 6.3we derive the optimal values for a and b , which yield a total ofmin (cid:110) S + · mnkp √ S , (cid:16) mnkP (cid:17) / (cid:111) elements communicated. Algorithm 1

COSMA

Input: matrices A ∈ R m × k , B ∈ R k × n ,number of processors: p , memory size: S , computation-I/O tradeoff ratio ρ Output: matrix C = AB ∈ R m × n a ← FindSeqSchedule ( S , m , n , k , p ) (cid:46) sequential I/O optimality (§ 5)2: b ← ParallelizeSched ( a , m , n , k , p ) (cid:46) parallel I/O optimality (§ 6)3: (G , a opt , b opt ) ← FitRanks ( m , n , k , a , b , p , δ ) for all p i ∈ { . . . p } do in parallel ( A l , B l , C l ) ← GetDataDecomp ( A , B , G , p i ) s ← (cid:22) S − a opt aopt (cid:23) (cid:46) latency-minimizing size of a step (6.3)7: t ← (cid:108) bopts (cid:109) (cid:46) number of steps8: for j ∈ { . . . t } do ( A l , B l ) ← Distr Data ( A l , B l , G , j , p i ) C l ← Multiply ( A l , B l , j ) (cid:46) compute locally11: end for C ← Reduce ( C l , G) (cid:46) reduce the partial results13: end for echnical Report, 2019, G. Kwasniewski et al. We now present a mathematical machinery for deriving I/O lowerbounds for general CDAGs. We extend the main lemma by Hongand Kung [34], which provides a method to find an I/O lower boundfor a given CDAG. That lemma, however, does not give a tightbound, as it overestimates a reuse set size (cf. Lemma 3). Our keyresult here, Lemma 4, allows us to derive a constructive proof ofa tighter I/O lower bound for a sequential execution of the MMMCDAG (§ 5).The driving idea of both Hong and Kung’s and our approach isto show that some properties of an optimal pebbling of a CDAG (aproblem which is PSPACE-complete [40]) can be translated to theproperties of a specific partition of the CDAG (a collection of subsets V i of the CDAG; these subsets form subcomputations, see § 2.2).One can use the properties of this partition to bound the numberof I/O operations of the corresponding pebbling. Hong and Kunguse a specific variant of this partition, denoted as S -partition [34].We first introduce our generalization of S -partition, called X -partition, that is the base of our analysis. We describe symbols usedin our analysis in Table 2. MMM m , n , k Matrix dimensions A , B Input matrices A ∈ R m × k and B ∈ R k × n C = AB Output matrix C ∈ R m × n p The number of processors g r a p h s G A directed acyclic graph G = ( V , E ) Pred ( v ) A set of immediate predecessors of a vertex v : Pred ( v ) = { u : ( u , v ) ∈ E } Succ ( v ) A set of immediate successors of a vertex v : Succ ( v ) = { u : ( v , u ) ∈ E } I / O c o m p l e x i t y S The number of red pebbles (size of the fast memory) V i An i -th subcomputation of an S -partition Dom ( V i ) , Min ( V i ) Dominator and minimum sets of subcomputation V i V R , i The reuse set : a set of vertices containing red pebbles(just before V i starts) and used by V i H ( S ) The smallest cardinality of a valid S -partition R ( S ) The maximum size of the reuse set Q The I/O cost of a schedule (a number of I/O operations) ρ i The computational intensity of V i ρ = max i { ρ i } The maximum computational intensity S c h e du l e s S = { V , . . ., V h } The sequential schedule (an ordered set of V i ) P = {S , . . ., S p } The parallel schedule (a set of sequential schedules S j ) D j = (cid:208) Vi ∈S j V i The local domain (a set of vertices in S j a , b Sizes of a local domain: |D j | = a b Table 2:

The most important symbols used in the paper. X -Partitions Before we define X -partitions, we first need todefine two sets, the dominator set and the minimum set . Given asubset V i ∈ V , define a dominator set Dom ( V i ) as a set of verticesin V , such that every path from any input of a CDAG to any vertexin V i must contain at least one vertex in Dom ( V i ) . Define also the minimum set Min ( V i ) as the set of all vertices in V i that do not haveany children in V i .Now, given a CDAG G = ( V , E ) , let V , V , . . . V h ∈ V be a seriesof subcomputations that (1) are pairwise disjoint ( ∀ i , j , i (cid:44) j V i ∩ V j = ∅) , (2) cover the whole CDAG ( (cid:208) i V i = V ), (3) have no cyclicdependencies between them, and (4) their dominator and minimumsets are at most of size X ( ∀ i (| Dom ( V i )| ≤ X ∧ | Min ( V i )| ≤ X ) ).These subcomputations V i correspond to some execution order (aschedule) of the CDAG, such that at step i , only vertices in V i arepebbled. We call this series an X -partition or a schedule of the CDAGand denote this schedule with S( X ) = { V , . . . , V h } . Here we need to briefly bring back the original lemma by Hong andKung, together with an intuition of its proof, as we use a similarmethod for our Lemma 3.

Intuition

The key notion in the existing bound is to use X = S -partitions for a given fast memory size S . For any sub-computation V i , if | Dom ( V i )| = S , then at most S of them couldcontain a red pebble before V i begins. Thus, at least S additional peb-bles need to be loaded from the memory. The similar argument goesfor Min ( V i ) . Therefore, knowing the lower bound on the numberof sets V i in a valid 2 S -partition , together with the observation thateach V i performs at least S I/O operations, we phrase the lemma byHong and Kung:Lemma 1 ( [34]). The minimal number Q of I/O operations forany valid execution of a CDAG of any I/O computation is boundedby Q ≥ S · ( H ( S ) − ) (1)Proof. Assume that we know the optimal complete calculation of the CDAG, where a calculation is a sequence of allowed movesin the red-blue pebble game [34]. Divide the complete calculationinto h consecutive subcomputations V , V , ..., V h , such that duringthe execution of V i , i < h , there are exactly S I/O operations, andin V h there are at most S operations. Now, for each V i , we definetwo subsets of V , V R , i and V B , i . V R , i contains vertices that have redpebbles placed on them just before V i begins. V B , i contains verticesthat have blue pebbles placed on them just before V i begins, andhave red pebbles placed on them during V i . Using these definitions,we have: (cid:182) V R , i ∪ V B , i = Dom ( V i ) , (cid:183) | V R , i | ≤ S , (cid:184) | V B , i | ≤ S , and (cid:185) | V R , i ∪ V B , i | ≤ | V R , i | + | V B , i | ≤ S . We define similar subsets W B , i and W R , i for the minimum set Min ( V i ) . W B , i contains allvertices in V i that have a blue pebble placed on them during V i , and W R , i contains all vertices in V i that have a red pebble at the end of V i . By the definition of V i , | W B , i | ≤ S , by the constraint on the redpebbles, we have | W R , i | ≤ S , and by te definition of the minimumset, Min ( V i ) ⊂ W R , i ∪ W B , i . Finally, by the definition of S -partition, V , V , ..., V h form a valid 2 S -partition of the CDAG. (cid:3) A more careful look at sets V R , i , V B , i , W R , i ,and W B , i allows us to refine the bound on the number of I/O oper-ations on a CDAG. By definition, V B , i is a set of vertices on whichwe place a red pebble using the load rule; We call V B , i a load set of V i . Furthermore, W B , i contains all the vertices on which we placea blue pebble during the pebbling of V i ; We call W B , i a store set of V i . However, we impose more strict V R , i and W R , i definitions: V R , i contains vertices that have red pebbles placed on them just before V i begins and – for each such vertex v ∈ V R , i – at least one child of v is pebbled during the pebbling of V i using the compute rule of thered-blue pebble game . We call V R , i a reuse set of V i . Similarly, W R , i contains vertices that have red pebbles placed on them after V i endsand were pebbled during V i and – for each such vertex v ∈ W R , i –at least one child of v is pebbled during the pebbling of V i + using thecompute rule of the red-blue pebble game . We call W R , i a cache set of V i . Therefore, if Q i is the number of I/O operations during thesubcomputation V i , then Q i ≥ | V B , i | + | W B , i | .We first observe that, given the optimal complete calculation,one can divide this calculation into subcomputations such that eachsubcomputation V i performs an arbitrary number of Y I/O oper-ations. We still have | V R , i | ≤ S , | W R , i | ≤ S , 0 ≤ | W B , i | (by thedefinition of the red-blue pebble game rules). Moreover, observe /O Optimal Parallel Matrix Multiplication Technical Report, 2019, that, because we perform exactly Y I/O operations in each subcom-putation, and all the vertices in V B , i by definition have to be loaded, | V B , i | ≤ Y . A similar argument gives 0 ≤ | W B , i | ≤ Y .Denote an upper bound on | V R , i | and | W R , i | as R ( S ) ( ∀ i max {| V R , i | , | W R , i |} ≤ R ( S ) ≤ S ). Further, denote a lower boundon | V B , i | and | W B , i | as T ( S ) ( ∀ i ≤ T ( S ) ≤ min {| V B , i | , | W B , i |} ). Wecan use R ( S ) and T ( S ) to tighten the bound on Q . We call R ( S ) a maximum reuse and T ( S ) a minimum I/O of a CDAG. We now use the above definitionsand observations to generalize the result of Hong and Kung [34] .Lemma 2.

An optimal complete calculation of a CDAG G = ( V , E ) ,which performs q I/O operations, is associated with an X -partition of G such that q ≥ ( X − R ( S ) + T ( S )) · ( h − ) for any value of X ≥ S , where h is the number of subcomputationsin the X -partition, R ( S ) is the maximum reuse set size, and T ( S ) isthe minimum I/O in the given X -partition. Proof. We use analogous reasoning as in the original lemma.We associate the optimal pebbling with h consecutive subcompu-tations V , . . . V h with the difference that each subcomputation V i performs Y = X − R ( S ) + T ( S ) I/O operations. Within those Y operations, we consider separately q i , s store and q i , l load oper-ations. For each V i we have q i , s + q i , l = Y , q i , s ≥ T ( S ) , and q i , l ≤ Y − T ( S ) = X − R ( S ) . ∀ i : | V B , i | ≤ q l , i ≤ Y − T ( S ) ∀ i : | V R , i | ≤ q s , i ≤ R ( S ) ≤ S Since V R , i ∪ V B , i = Dom ( V i ) : | Dom ( V i )| ≤ | V R , i | + | V B , i || Dom ( V i )| ≤ R ( S ) + Y − T ( R ) = X By an analogous construction for store operations, we showthat | Min ( V i )| ≤ X . To show that S ( X ) = { V . . . V h } meets theremaining properties of a valid X -partition S( X ) , we use the samereasoning as originally done [34].Therefore, a complete calculation performing q > ( X − R ( S ) + T ( S )) · ( h − ) I/O operations has an associated S( X ) , such that |S( X )| = h (if q = ( X − R ( S ) + T ( S ))·( h − ) , then |S( X )| = h − (cid:3) From the previous lemma, we obtain a tighter I/O lower bound.Lemma 3.

Denote H ( X ) as the minimum number of subcomputa-tions in any valid X -partition of a CDAG G = ( V , E ) , for any X ≥ S .The minimal number Q of I/O operations for any valid execution of aCDAG G = ( V , E ) is bounded by Q ≥ ( X − R ( S ) + T ( S )) · ( H ( X ) − ) (2) where R ( S ) is the maximum reuse set size and T ( S ) is the minimumI/O set size. Moreover, we have H ( X ) ≥ | V || V max | (3) where V max = arg max V i ∈S( X ) | V i | is the largest subset of verticesin the CDAG schedule S( X ) = { V , . . . , V h } . Proof. By definition, H ( X ) = min S( X ) |S( X )| ≤ h , so Q ≥( X − R ( S ) + T ( S )) · ( H ( X ) − ) immediately follows from Lemma 2.To prove Eq. (3), observe that V max by definition is the largestsubset in the optimal X -partition. As the subsets are disjoint, anyother subset covers fewer remaining vertices to be pebbled than V max . Because there are no cyclic dependencies between subsets,we can order them topologically as V , V , ... V H ( X ) . To ensure thatthe indices are correct, we also define V ≡ ∅ . Now, define W i tobe the set of vertices not included in any subset from 1 to i , that is W i = V − (cid:208) ij = V j . Clearly, W = V and W H ( X ) = ∅ . Then, we have ∀ i | V i | ≤ | V max || W i | = | W i − | − | V i | ≥ | W i − | − | V max | ≥ | V | − i | V max || W H ( X ) | = ≥ | V | − H ( X ) · | V max | that is, after H ( X ) steps, we have H ( X )| V max | ≥ | V | . (cid:3) From this lemma, we derive the following lemma that we use toprove a tight I/O lower bound for MMM (Theorem 1):Lemma 4.

Define the number of computations performed by V i forone loaded element as the computational intensity ρ i = | V i | X −| V R , i | + | W B , i | of the subcomputation V i . Denote ρ = max i ( ρ i ) ≤ | V max | X − R ( S ) + T ( S ) tobe the maximal computational intensity . Then, the number of I/Ooperations Q is bounded by Q ≥ | V |/ ρ . Proof. Note that the term H ( X ) − Y − R ( S ) + T ( S ) I/O operations, since | V H ( X ) | ≤ | V max | . However, because ρ is defined as maximal computational intensity, then performing | V H ( S ) | computations requires at least Q H ( S ) ≥ | V H ( S ) |/ ρ . The totalnumber of I/O operations therefore is: Q = H ( X ) (cid:213) i = Q i ≥ H ( X ) (cid:213) i = | V i | ρ = | V | ρ (cid:3) In this section, we present our main theoretical contribution: a con-structive proof of a tight I/O lower bound for classical matrix-matrixmultiplication. In § 6, we extend it to the parallel setup (Theorem 2).This result is tight (up to diminishing factor √ S /(√ S + − ) ), andtherefore may be seen as the last step in the long sequence of im-proved bounds. Hong and Kung [34] derived an asymptotic bound Ω (cid:16) n /√ S (cid:17) for the sequential case. Irony et al. [33] extended thelower bound result to a parallel machine with p processes, eachhaving a fast private memory of size S , proving the n √ p √ S − S lower bound on the communication volume per process. Recently,Smith and van de Gein [48] proved a tight sequential lower bound(up to an additive term) of 2 mnk /√ S − S . Our proof improves theadditive term and extends it to a parallel schedule.Theorem 1 (Seqential Matrix Multiplication I/O lowerbound). Any pebbling of MMM CDAG which multiplies matrices ofsizes m × k and k × n by performing mnk multiplications requires aminimum number of mnk √ S + mn I/O operations.

The proof of Theorem 1 requires Lemmas 5 and 6, which in turn,require several definitions. echnical Report, 2019, G. Kwasniewski et al. Intuition: Restricting the analysis to greedy schedules provides explicit infor-mation of a state of memory (sets V r , V R , r , W B , r ), and to a correspondingCDAG pebbling. Additional constraints (§ 5.2.7) guarantee feasibility of aderived schedule (and therefore, lower bound tightness). Theset of vertices of MMM CDAG G = ( V , E ) consists of three subsets V = A ∪ B ∪ C , which correspond to elements in matrices A , B , and mnk partial sums of C . Each vertex v is defined uniquely by a pair ( M , T ) , where M ∈ { a , b , c } determines to which subset A , B , C vertex v belongs to, and T ∈ N d is a vector of coordinates, d = M = a ∨ b and d = M = c . E.g., v = ( a , ( , )) ∈ A is a vertexassociated with element ( , ) in matrix A , and v = ( c , ( , , )) ∈ C is associated with 8th partial sum of element ( , ) of matrix C .For every t th partial update of element ( t , t ) in matrix C , andan associated point v = ( c , ( t , t , t )) ∈ C we define ϕ c ( v ) = ( t , t ) to be a projection of this point to matrix C , ϕ a ( v ) = ( a , ( t , t )) ∈ A is its projection to matrix A , and ϕ b ( v ) = ( b , ( t , t )) ∈ B is itsprojection to matrix B . Note that while ϕ a ( v ) , ϕ b ( v ) ∈ V , projection ϕ c ( v ) (cid:60) V has not any associated point in V . Instead, verticesassociated with all k partial updates of an element of C have thesame projection ϕ c ( v ) : ∀ v = ( c , ( p , p , p )) , w = ( c , ( q , q , q ))∈C : ( p = q ) ∧ ( p = q )⇐⇒ ϕ c ( p ) = ϕ c ( q ) (4)As a consequence, ϕ c (( c , ( t , t , t ))) = ϕ c (( c , ( t , t , t − ))) .A t th update of ( t , t ) element in matrix C of a classical MMMis formulated as C ( t , t , t ) = C ( t , t , t − ) + A ( t , t ) · B ( t , t ) .Therefore for each v = ( c , ( t , t , t )) ∈ C , t >

1, we have followingedges in the CDAG: ( ϕ a ( v ) , v ) , ( ϕ b ( v ) , v ) , ( c , ( t , t , t − )) , v ) ∈ E . α , β , γ , Γ . For a given subcomputation V r ⊆ C , we denoteits projection to matrix A as α r = ϕ a ( V r ) = { v : v = ϕ a ( c ) , c ∈ V r } ,its projection to matrix B as β r = ϕ b ( V r ) , and its projection tomatrix C as γ r = ϕ c ( V r ) . We further define Γ r ⊂ C as a set ofall vertices in C that have a child in V r . The sets α , β , Γ thereforecorrespond to the inputs of V r that belong to matrices A , B , andprevious partial results of C , respectively. These inputs form aminimal dominator set of V r : Dom ( V r ) = α r ∪ β r ∪ Γ r (5)Because Min ( V r ) ⊂ C , and each vertex v ∈ C has at most onechild w with ϕ c ( v ) = ϕ c ( w ) (Equation 4), the projection ϕ c ( Min ( V r )) is also equal to γ r : ϕ c ( V r ) = ϕ c ( Γ r ) = ϕ c ( Min ( V r )) = γ r (6) Red () . Define

Red ( r ) as the set of all vertices that havered pebbles just before subcomputation V r starts, with Red ( ) = ∅ .We further have Red ( P ) , P ⊂ V is the set of all vertices in somesubset P that have red pebbles and Red ( ϕ c ( P )) is a set of uniquepairs of first two coordinates of vertices in P that have red pebbles. We call a schedule S = { V , . . . , V h } greedy if during every subcomputation V r every vertex u that willhold a red pebble either has a child in V r or belongs to V r : ∀ r : Red ( r ) ⊂ α r − ∪ β r − ∪ V r − (7) Lemma 5.

Any greedy schedule that multiplies matrices of sizes m × k and k × n using mnk multiplications requires a minimumnumber of mnk √ S + mn I/O operations.

Proof. We start by creating an X -partition for an MMM CDAG(the values of Y and R ( S ) are parameters that we determine in thecourse of the proof). The proof is divided into the following 6 steps(Sections 5.2.1 to 5.2.6). Observethat each vertex in c = ( t , t , t ) ∈ C , t = . . . m , t = . . . n , t = . . . k − c = ( t , t , t + ) . Therefore, wecan assume that in an optimal schedule there are no two vertices ( t , t , t ) , ( t , t , t + f ) ∈ C , f ∈ N + that simultaneously hold ared vertex, as when the vertex ( t , t , t + ) is pebbled, a red pebblecan be immediately removed from ( t , t , t ) : | Red ( V r )| = | ϕ c ( Red ( V r ))| (8)On the other hand, for every vertex v , if all its predecessors Pred ( v ) have red pebbles, then vertex v may be immediately com-puted, freeing a red pebble from its predecessor w ∈ C , due to thefact, that v is the only child of w : ∀ v ∈ V ∀ r : Pred ( v ) ⊂ Dom ( V r ) ∪ V r = ⇒ ∃ t ≤ r v ∈ V t (9)Furthermore, after subcomputation V r , all vertices in V r thathave red pebbles are in its minimum set: Red ( r + ) ∩ V r = Red ( r + ) ∩ Min ( V r ) (10)Combining this result with the definition of a greedy schedule(Equation 7), we have Red ( r + ) ⊆ α r ∪ β r ∪ Min ( V r ) (11) By the definitionof X -partition, the computation is divided into H ( X ) subcomputa-tions V r ⊂ C , r ∈ { , . . . H ( X )} , such that Dom ( V r ) , Min ( V r ) ≤ X .Inserting Equations 5, 6, and 8, we have: | Dom ( V r )| = | α r | + | β r | + | γ r | ≤ X (12) | Min ( V r )| = | γ r | ≤ X On the other hand, the Loomis-Whitney inequality [41] boundsthe volume of V r : V r ≤ (cid:112) | α r || β r || γ r | (13)Consider sets of all different indices accessed by projections α r , β r , γ r : T = { t , , . . . , t , a } , | T | = aT = { t , , . . . , t , b } , | T | = bT = { t , , . . . , t , c } , | T | = cα r ⊆ {( t , t ) : t ∈ T , t ∈ T } (14) β r ⊆ {( t , t ) : t ∈ T , t ∈ T } (15) γ r ⊆ {( t , t ) : t ∈ T , t ∈ T } (16) V r ⊆ {( t , t , t ) : t ∈ T , t ∈ T , t ∈ T } (17) /O Optimal Parallel Matrix Multiplication Technical Report, 2019, For fixed sizes of the projections | α r | , | β r | , | γ r | , then the volume | V r | is maximized when left and right side of Inequalities 14 to 16are equal. Using 5 and 9 we have that 17 is an equality too, and: | α r | = ac , | β r | = bc , | γ r | = ab , | V r | = abc , (18)achieving the upper bound (Equation 13). V R , r and store set W B , r . Consider two subse-quent computations, V r and V r + . After V r , α r , β r , and V r mayhave red pebbles (Equation 7). On the other hand, for the domi-nator set of V r + we have | Dom ( V r + )| = | α r + | + | β r + | + | γ r + | .Then, the reuse set V R , i + is an intersection of those sets. Since α r ∩ β r = α r ∩ γ r = β r ∩ γ r = ∅ , we have (confront Equation 11): V R , r + ⊆ ( α r ∩ α r + ) ∪ ( β r ∩ β r + ) ∪ ( Min ( V r ) ∩ Γ r + )| V R , r + | ≤ | α r ∩ α r + | + | β r ∩ β r + | + | γ r ∩ γ r + | (19)Note that vertices in α r and β r are inputs of the computation:therefore, by the definition of the red-blue pebble game, they startin the slow memory (they already have blue pebbles). Min ( V r ) ,on the other hand, may have only red pebbles placed on them.Furthermore, by the definition of the S -partition, these verticeshave children that have not been pebbled yet. They either haveto be reused forming the reuse set V R , r + , or stored back, forming W B , r and requiring the placement of the blue pebbles. Because Min ( V r ) ∈ C and C ∩ A = C ∩ B = ∅ , we have: W B , r ⊆ Min ( V r ) \ Γ r + | W B , r | ≤ | γ r \ γ r + | (20) Consider two subcomputations V r and V r + . Denote shared parts of their projections as α s = α r ∩ α r + , β s = β r ∩ β r + , and γ s = γ r ∩ γ r + . Then, there are twopossibilities:(1) V r and V r + are not cubic, resulting in their volume smallerthan the upper bound | V r + | < (cid:112) | α r + || β r + || γ r + | (Equa-tion 13),(2) V r and V r + are cubic. If all overlapping projections arenot empty, then they generate an overlapping computa-tion, that is, there exist vertices v , such that ϕ ik ( v ) ∈ α s , ϕ kj ( v ) ∈ β s , ϕ ij ( v ) ∈ γ s . Because we consider greedyschedules, those vertices cannot belong to computation V r + (Equation 9). Therefore, again | V r + | < (cid:112) | α r + || β r + || γ r + | . Now consider sets of all differentindices accessed by those rectangular projections (Sec-tion 5.2.2, Inequalities 14 to 16). Fixing two non-emptyprojections we define all three sets T , T , T , which inturn, generate the third (non-empty) projection, result-ing again in overlapping computations which reduce thesize of | V r + | . Therefore, for cubic subcomputations, theirvolume is maximized | V r + | = (cid:112) | α r + || β r + || γ r + | if atmost one of the overlapping projections is non-empty (andtherefore, there is no overlapping computation). Computational inten-sity ρ r of a subcomputation V r is an upper bound on ratio betweenits size | V r | and the number of I/O operations required. The numberof I/O operations is minimized when ρ is maximized (Lemma 4): maximize ρ r = | V r | X − R ( S ) + T ( S ) ≥ | V r | Dom ( V r ) − | V R , r | + | W B , r | subject to: | Dom ( V r )| ≤ X | V R , r | ≤ S To maximize the computational intensity, for a fixed number ofI/O operations, the subcomputation size | V r | is maximized. Basedon Observation 5.2.4, it is maximized only if at most one of theoverlapping projections α r ∩ α r + , β r ∩ β r + , γ r ∩ γ r + is not empty.Inserting Equations 13, 12, 19, and 20, we have the following threeequations for the computational intensity, depending on the non-empty projection: α r ∩ α r + (cid:44) ∅ : ρ r = (cid:112) | α r || β r || γ r || α r | + | β r | + | γ r | − | α r ∩ α r + | + | γ r | (21) β r ∩ β r + (cid:44) ∅ : ρ r = (cid:112) | α r || β r || γ r || α r | + | β r | + | γ r | − | β r ∩ β r + | + | γ r | (22) γ r ∩ γ r + (cid:44) ∅ : ρ r = (cid:112) | α r || β r || γ r || α r | + | β r | + | γ r | − | γ r ∩ γ r + | + | γ r \ γ r + | (23) ρ r is maximized when γ r = γ r + , γ r ∩ γ r + (cid:44) ∅ , γ r \ γ r + = ∅ (Equation 23).Then, inserting Equations 18, we have:maximize ρ r = abcac + cb subject to: ab + ac + cb ≤ Xab ≤ Sa , b , c ∈ N + , where X is a free variable. Simple optimization technique usingLagrange multipliers yields the result: a = b = (cid:98)√ S (cid:99) , c = , (24) | α r | = | β r | = (cid:98)√ S (cid:99) , | γ r | = (cid:98)√ S (cid:99) , | V r | = (cid:98)√ S (cid:99) , X = (cid:98)√ S (cid:99) + (cid:98)√ S (cid:99) ρ r = (cid:98)√ S (cid:99) √ S ∈ N + . By the compu-tational intensity corollary (cf. page 4 in the main paper): Q ≥ | V | ρ = mnk √ S This is the I/O cost of putting a red pebble at least once on everyvertex in C . Note however, that we did not put any blue pebbleson the outputs yet (all vertices in C had only red pebbles placedon them during the execution). By the definition of the red-blue echnical Report, 2019, G. Kwasniewski et al. pebble game, we need to place blue pebbles on mn output vertices,corresponding to the output matrix C , resulting in additional mn I/O operations, yielding final bound Q ≥ mnk √ S + mn (cid:3) Restricting the analysisto greedy schedules provides explicit information of a state of mem-ory (sets V r , V R , r , W B , r ), and therefore, to a corresponding CDAGpebbling. In Section 5.2.5, it is proven that an optimal greedy sched-ule is composed of mnkR ( S ) outer product calculations, while loading (cid:112) R ( S ) elements of each of matrices A and B . While the lower boundis achieved for R ( S ) = S , such a schedule is infeasible, as at leastsome additional red pebbles, except the ones placed on the reuseset V R , r , have to be placed on 2 (cid:112) R ( S ) vertices of A and B .A direct way to obtain a feasible greedy schedule is to set X = S ,ensuring that the dominator set can fit into the memory. Then eachsubcomputation is an outer-product of column-vector of matrix A and row-vector of B , both holding √ S + − mnk √ S + − + mn I/O operations, a factor of √ S √ S + − more than a lower bound, which quickly approach 1 for large S .Listing 1 provides a pseudocode of this algorithm, which is a well-known rank-1 update formulation of MMM. However, we can dobetter.Let’s consider a generalized case of such subcomputation V r .Assume, that in each step:(1) a elements of A (forming α r ) are loaded,(2) b elements of B (forming β r ) are loaded,(3) ab partial results of C are kept in the fast memory (forming Γ r )(4) ab values of C are updated (forming V r ),(5) no store operations are performed.Each vertex in α r has b children in V r (each of which has also aparent in β r ). Similarly, each vertex in β r has a children in V r ,each of which has also a parent in α r . We first note, that ab < S (otherwise, we cannot do any computation while keeping all ab partial results in fast memory). Any red vertex placed on α r shouldnot be removed from it until all b children are pebbled, requiringred-pebbling of corresponding b vertices from β r . But, in turn, anyred pebble placed on a vertex in β r should not be removed until all a children are red pebbled.Therefore, either all a vertices in α r , or all b vertices in β r haveto be hold red pebbles at the same time, while at least one additionalred pebble is needed on β r (or α r ). W.l.o.g., assume we keep redpebbles on all vertices of α r . We then have:maximize ρ r = aba + b subject to: ab + a + ≤ Sa , b ∈ N + , (26) The solution to this problem is a opt =  (cid:113) ( S − ) − S + S −  < √ S (27) b opt =  − S + (cid:113) ( S − ) − S − (cid:113) ( S − ) − S +  < √ S (28) for i = (cid:108) maopt (cid:109) for j = (cid:24) nbopt (cid:25) for r = k for i = i · T : min (( i + ) · a opt , m ) for j = j · T : min (( j + ) · b opt , n ) C ( i , j ) = C ( i , j ) + A ( i , r ) · B ( r , j ) Listing 1: Pseudocode of near optimal sequential MMM

In § 5.2.6, it is shown that the I/O lower bound for any greedy sched-ule is Q ≥ mnk √ S + mn . Furthermore, Listing 1 provide a schedulethat attains this lower bound (up to a a opt b opt / S factor). To provethat this bound applies to any schedule, we need to show, that anynon-greedy cannot perform better (perform less I/O operations)than the greedy schedule lower bound.Lemma 6. Any non-greedy schedule computing classical matrixmultiplication performs at least mnk √ S + mn I/O operations.

Proof. Lemma 3 applies to any schedule and for any value of X . Clearly, for any general schedule we cannot directly model V R , i , V B , i , W R , i , and W B , i , and therefore T ( S ) and R ( S ) . However,it is always true that 0 ≤ T ( S ) and R ( S ) ≤ S . Also, the dominatorset formed in Equation 5 applies for any subcomputation, as wellas a bound on | V r | from Inequality 13. We can then rewrite thecomputational intensity maximization problem:maximize ρ r = | V r | X − R ( S ) + T ( S ) ≤ (cid:112) | α r || β r || γ r || α r | + | β r | + | γ r | − S subject to: S < | α r | + | β r | + | γ r | = X (29)This is maximized for | α r | = | β r | = | γ r | = X /

3, yielding ρ r = ( X / ) / X − S Because mnk / ρ r is a valid lower bound for any X > S (Lemma 4),we want to find such value X opt for which ρ r is minimal, yieldingthe highest (tightest) lower bound on Q :minimize ρ r = ( X / ) / X − S subject to: X ≥ S (30) /O Optimal Parallel Matrix Multiplication Technical Report, 2019, which, in turn, is minimized for X = S . This again shows, thatthe upper bound on maximum computational intensity for anyschedule is √ S /

2, which matches the bound for greedy schedules(Equation 25). (cid:3)

We note that Smith and van de Gein [48] in their paper alsobounded the number of computations (interpreted geometricallyas a subset in a 3D space) by its surface and obtained an analo-gous result for this surface (here, a dominator and minimum setsizes). However, using computational intensity lemma, our boundis tighter by 2 S ( + mn , counting storing the final result). Proof of Theorem 1:

Lemma 5 establishes that the I/O lower bound for any greedy sched-ule is Q = mnk /√ S + mn . Lemma 6 establishes that no otherschedule can perform less I/O operations. (cid:3) Corollary : The greedy schedule associated with an X = S -partitionperforms at most √ S √ S + − more I/O operations than a lower bound.The optimal greedy schedule is associated with an X = a opt b opt + a opt + b opt -partition and performs √ S ( a opt + b opt ) a opt b opt I/O operations.

We now derive the schedule of COSMA from the results from § 5.2.7.The key notion is the data reuse, that determines not only the se-quential execution, as discussed in § 4.2 , but also the parallelscheduling. Specifically, if the data reuse set spans across multiplelocal domains, then this set has to be communicated between thesedomains, increasing the I/O cost (Figure 3). We first introduce aformalism required to parallelize the sequential schedule (§ 6.1).In § 6.2, we generalize parallelization strategies used by the 2D,2.5D, and recursive decompositions, deriving their communicationcost and showing that none of them is optimal in the whole rangeof parameters. We finally derive the optimal decomposition (

Find-OptimalDomain function in Algorithm 1) by expressing it as anoptimization problem (§ 6.3), and analyzing its I/O and latency cost.The remaining steps in Algorithm 1:

FitRanks , GetDataDecomp , aswell as

DistrData and

Reduce are discussed in § 7.1, § 7.6, and § 7.2,respectively. For a distributed machine, we assume that all matricesfit into collective memories of all processors: pS ≥ mn + mk + nk .For a shared memory setting, we assume that all inputs start in acommon slow memory. We now describe how a parallel schedule is formed from a sequentialone. The sequential schedule S partitions the CDAG G = ( V , E ) into H ( S ) subcomputations V i . The parallel schedule P divides S among p processors: P = {D , . . . D p } , (cid:208) pj = D j = S . The set D j of all V k assigned to processor j forms a local domain of j (Fig. 4c).If two local domains D k and D l are dependent, that is, ∃ u , ∃ v : u ∈ D k ∧ v ∈ D l ∧ ( u , v ) ∈ E , then u has to be com-municated from processor k to l . The total number of vertices com-municated between all processors is the I/O cost Q of schedule P .We say that the parallel schedule P opt is communication–optimal if Q (P opt ) is minimal among all possible parallel schedules.The vertices of MMM CDAG may be arranged in an [ m × n × k ]

3D grid called an iteration space [59]. The orthonormal vectors i , j , k correspond to the loops in Lines 1-3 in Listing 1 (Figure 3a). We call Crossing dependencies! Crossing dependencies! (d) (e) (f)(a) MMM CDAG (b) Optimal i jk matrix A matrix B3D itera � on spacematrix C (c) Local domain output size: input size: elements elementselements Figure 4: (a) An MMM CDAG as a 3D grid (iteration space). Each vertex in it (exceptfor the vertices in the bottom layer) has three parents - blue (matrix A ), red (matrix B ),and yellow (partial result of matrix C ) and one yellow child (except for vertices in thetop layer). (b) A union of inputs of all vertices in V i form the dominator set Dom ( V i ) (two blue, two red and four dark yellow). Using approximation √ S + − ≈ √ S ,we have | Dom ( V i , opt )| = S . (c) A local domain D consists of b subcomputations V i , each of a dominator size | Dom ( V i )| = a + a . (d-f) Different parallelizationschemes of near optimal sequential MMM for p = > p = . a schedule P parallelized in dimension d if we “cut” the CDAG alongdimension d . More formally, each local domain D j , j = . . . p is agrid of size either [ m / p , n , k ] , [ m , n / p , k ] , or [ m , n , k / p ] . The sched-ule may also be parallelized in two dimensions ( d d ) or three di-mensions ( d d d ) with a local domain size [ m / p m , n / p n , k / p k ] forsome p m , p n , p k , such that p m p n p k = p . We call G = [ p m , p n , p k ] the processor grid of a schedule. E.g., Cannon’s algorithm is par-allelized in dimensions ij , with the processor grid [√ p , √ p , ] .COSMA, on the other hand, may use any of the possible paral-lelizations, depending on the problem parameters. The sequential schedule S (§ 5) consists of mnk / S elementary outerproduct calculations, arranged in √ S × √ S × k “blocks” (Figure 4).The number p = mn / S of dependency-free subcomputations V i (i.e., having no parents except for input vertices) in S determinesthe maximum degree of parallelism of P opt for which no reuse set V R , i crosses two local domains D j , D k . The optimal schedule is par-allelized in dimensions ij . There is no communication between thedomains (except for inputs and outputs), and all I/O operations areperformed inside each D j following the sequential schedule. Eachprocessor is assigned to p / p local domains D j of size (cid:2) √ S , √ S , k (cid:3) ,each of which requires 2 √ Sk + S I/O operations (Theorem 1), givinga total of Q = mnk /( p √ S ) + mn / p I/O operations per processor.When p > p , the size of local domains |D j | is smaller than √ S × √ S × k . Then, the schedule has to either be parallelized indimension k , or has to reduce the size of the domain in ij plane.The former option creates dependencies between the local domains,which results in additional communication (Figure 4e). The latterdoes not utilize the whole available memory, making the sequen-tial schedule not I/O optimal and decreasing the computationalintensity ρ (Figure 4d). We now analyze three possible paralleliza-tion strategies (Figure 4) which generalize 2D, 2.5D, and recursivedecomposition strategies; see Table 3 for details. Schedule P ij The schedule is parallelized in dimensions i and j .The processor grid is G ij = (cid:2) ma , na , (cid:3) , where a = (cid:113) mnp . Because alldependencies are parallel to dimension k , there are no dependenciesbetween D j except for the inputs and the outputs. Because a < √ S , echnical Report, 2019, G. Kwasniewski et al. the corresponding sequential schedule has a reduced computationalintensity ρ ij < √ S / Schedule P ijk The schedule is parallelized in all dimensions.The processor grid is G ijk = (cid:2) m √ S , n √ S , pSmn (cid:3) . The computational in-tensity ρ ijk = √ S / k dimensioncreates dependencies between local domains, requiring communi-cation and increasing the I/O cost. Schedule P cubic The schedule is parallelized in all dimensions.The grid is (cid:2) ma c , na c , ka c (cid:3) , where a c = min (cid:110)(cid:0) mnkp (cid:1) / , (cid:113) S (cid:111) . Be-cause a c < √ S , the corresponding computational intensity ρ cubic < √ S / k dimension createsdependencies between local domains, increasing communication. Schedules of the State-of-the-Art Decompositions If m = n ,the P ij scheme is reduced to the classical 2D decomposition (e.g.,Cannon’s algorithm [10]), and P ijk is reduced to the 2.5D decompo-sition [53]. CARMA [22] asymptotically reaches the P cubic scheme,guaranteeing that the longest dimension of a local cuboidal domainis at most two times larger than the smallest one. We presenta detailed complexity analysis comparison for all algorithms inTable 3. Observe that none of those schedules is optimal in the whole rangeof parameters. As discussed in § 5, in sequential scheduling, interme-diate results of C are not stored to the memory: they are consumed(reused) immediately by the next sequential step. Only the final re-sult of C in the local domain is sent. Therefore, the optimal parallelschedule P opt minimizes the communication, that is, sum of the in-puts’ sizes plus the output size, under the sequential I/O constrainton subcomputations ∀ V i ∈D j ∈P opt | Dom ( V i )| ≤ S ∧ | Min ( V i )| ≤ S .The local domain D j is a grid of size [ a × a × b ] , containing b outer products of vectors of length a . The optimization problemof finding P opt using the computational intensity (Lemma 4) isformulated as follows:maximize ρ = a bab + ab + a (31)subject to: a ≤ S (the I/O constraint) a b = mnkp (the load balance constraint) pS ≥ mn + mk + nk (matrices must fit into memory)The I/O constraint a ≤ S is binding (changes to equality) for p ≤ mnkS / . Therefore, the solution to this problem is: a = min (cid:110) √ S , (cid:16) mnkp (cid:17) / (cid:111) , b = max (cid:110) mnkpS , (cid:16) mnkp (cid:17) / (cid:111) (32)The I/O complexity of this schedule is: Q ≥ a bρ = min (cid:110) mnkp √ S + S , (cid:16) mnkp (cid:17) (cid:111) (33)This can be intuitively interpreted geometrically as follows: if weimagine the optimal local domain ”growing” with the decreasingnumber of processors, then it stays cubic as long as it is still ”smallenough” (its side is smaller than √ S ). After that point, its face in the ij plane stays constant √ S × √ S and it ”grows” only in the k dimension. This schedule effectively switches from P ijk to P cubic once there is enough memory ( S ≥ ( mnk / p ) / ).Theorem 2. The I/O complexity of a classic Matrix Multiplicationalgorithm executed on p processors, each of local memory size S ≥ mn + mk + nkp is Q ≥ min (cid:110) mnkp √ S + S , (cid:16) mnkp (cid:17) (cid:111) Proof. The theorem is a direct consequence of Lemma 3 andthe computational intensity (Lemma 4). The load balance constraintenforces a size of each local domain |D j | = mnk / p . The I/O costis then bounded by |D j |/ ρ . Schedule P opt maximizes ρ by theformulation of the optimization problem (Equation 31). (cid:3) I/O-Latency Trade-off

As showed in this section, the localdomain D of the near optimal schedule P is a grid of size [ a × a × b ] ,where a , b are given by Equation (32). The corresponding sequentialschedule S is a sequence of b outer products of vectors of length a . Denote the size of the communicated inputs in each step by I step = a . This corresponds to b steps of communication (thelatency cost is L = b ).The number of steps (latency) is equal to the total communicationvolume of D divided by the volume per step L = Q / I step . To reducethe latency, one either has to decrease Q or increase I step , under thememory constraint that I step + a ≤ S (otherwise we cannot fit boththe inputs and the outputs in the memory). Express I step = a · h ,where h is the number of sequential subcomputations V i we mergein one communication. We can express the I/O-latency trade-off:min ( Q , L ) subject to: Q = ab + a , L = bha + ah ≤ S (I/O constraint) a b = mnkp (load balance constraint)Solving this problem, we have Q = mnkpa + a and L = mnkpa ( S − a ) ,where a ≤ √ S . Increasing a we reduce the I/O cost Q and increase the latency cost L . For minimal value of Q (Theorem 2), L = (cid:108) abS − a (cid:109) ,where a = min {√ S , ( mnk / p ) / } and b = max { mnkpS , ( mnk / p ) / } .Based on our experiments, we observe that the I/O cost is vastlygreater than the latency cost, therefore our schedule by defaultminimizes Q and uses extra memory (if any) to reduce L . We now present implementation optimizations that further increasethe performance of COSMA on top of the speedup due to our nearI/O optimal schedule. The algorithm is designed to facilitate theoverlap of communication and computation § 7.3. For this, toleverage the RDMA mechanisms of current high-speed networkinterfaces, we use the MPI one-sided interface § 7.4. In addition, ourimplementation also offers alternative efficient two-sided commu-nication back end that uses MPI collectives. We also use a blocked /O Optimal Parallel Matrix Multiplication Technical Report, 2019, Decomposition 2D [56] 2.5D [53] recursive [22] COSMA (this paper)Parallel schedule

P P ij for m = n P ijk for m = n P cubic P opt grid [ p m × p n × p k ] (cid:2) √ p × √ p × (cid:3) (cid:104)(cid:112) p / c × (cid:112) p / c × c (cid:105) ; c = pSmk + nk [ a × a × a ] ; a + a + a = log ( p ) (cid:2) ma × na × kb (cid:3) ; a , b : Equation 32 domain size (cid:104) m √ p × n √ p × k (cid:105) (cid:20) m √ p / c × n √ p / c × kc (cid:21) (cid:104) m a × n a × k a (cid:105) [ a × a × b ] “General case”:I/O cost Q k √ p ( m + n ) + mnp ( k ( m + n )) / p √ S + mnSk ( m + n ) (cid:110) √ mnkp √ S , (cid:16) mnkp (cid:17) / (cid:111) + (cid:16) mnkp (cid:17) / min (cid:110) mnkp √ S + S , (cid:16) mnkp (cid:17) / (cid:111) latency cost L k log (√ p ) ( k ( m + n )) / pS / ( km + kn − mn ) + (cid:16) pSmk + nk (cid:17) (cid:16) / mnk (cid:17) / (cid:16) pS / (cid:17) + ( p ) abS − a log (cid:16) mna (cid:17) Square matrices, “limited memory”: m = n = k , S = n / p , p = n I/O cost Q n (√ p + )/ p n (√ p + )/ p n (cid:16)(cid:112) / p + / p / (cid:17) n (√ p + )/ p latency cost L k log (√ p ) √ p (cid:0) (cid:1) / √ p log ( p ) √ p log ( p ) “Tall” matrices, “extra” memory available: m = n = √ p , k = p / / , S = nk / p / , p = n + I/O cost p / / p / / + p / p / p (cid:16) − / (cid:17) / / ≈ . p latency cost L p / log (√ p )/ Table 3:

The comparison of complexities of 2D, 2.5D, recursive, and COSMA algorithms. The 3D decomposition is a special case of 2.5D, and can be obtained by instantiating c = p / in the 2.5D case. In addition to the general analysis, we show two special cases. If the matrices are square and there is no extra memory available, 2D, 2.5D and COSMAachieves tight communication lower bound n /√ p , whereas CARMA performs √ times more communication. If one dimension is much larger than the others and there is extramemory available, 2D, 2.5D and CARMA decompositions perform O( p / ) , O( p / ) , and 8% more communication than COSMA, respectively. For simplicity, we assume thatparameters are chosen such that all divisions have integer results. (a) × × grid single idle process (b) × × grid with one idle processorFigure 5: Processor decomposition for square matrices and 65 processors. (a) To utilizeall resources, the local domain is drastically stretched. (b) Dropping one processorresults in a symmetric grid which increases the computation per processor by 1.5%,but reduces the communication by 36%. data layout § 7.6, a grid-fitting technique § 7.1, and an optimizedbinary broadcast tree using static information about the communi-cation pattern (§ 7.2) together with the buffer swapping (§ 7.5). Forthe local matrix operations, we use BLAS routines for highest per-formance. Our code is publicly available at https://github.com/eth-cscs/COSMA.

Throughout the paper, we assume all operations required to assessthe decomposition (divisions, roots) result in natural numbers. Wenote that in practice it is rarely the case, as the parameters usuallyemerge from external constraints, like a specification of a performedcalculation or hardware resources (§ 8). If matrix dimensions arenot divisible by the local domain sizes a , b (Equation 32), then astraightforward option is to use the floor function, not utilizing the“boundary” processors whose local domains do not fit entirely inthe iteration space, which result in more computation per proces-sor. The other option is to find factors of p and then construct theprocessor grid by matching the largest factors with largest matrix di-mensions. However, if the factors of p do not match m , n , and k , thismay result in a suboptimal decomposition. Our algorithm allowsto not utilize some processors (increasing the computation volumeper processor) to optimize the grid, which reduces the communi-cation volume. Figure 5 illustrates the comparison between these options. We balance this communication–computation trade-off by”stretching” the local domain size derived in § 6.3 to fit the globaldomain by adjusting its width, height, and length. The range of thistuning (how many processors we drop to reduce communication)depends on the hardware specification of the machine (peak flop/s,memory and network bandwidth). For our experiments on the PizDaint machine we chose the maximal number of unutilized coresto be 3%, accounting for up to 2.4 times speedup for the squarematrices using 2,198 cores (§ 9). As shown in Algorithm 1, COSMA by default executes in t = abS − a rounds. In each round, each processor receives s = ab / t = ( S − a )/ A and B . Thus, the input matrices are broadcast amongthe i and j dimensions of the processor grid. After the last round,the partial results of C are reduced among the k dimension. Thecommunication pattern is therefore similar to ScaLAPACK or CTF.To accelerate the collective communication, we implement ourown binary broadcast tree, taking advantage of the known data lay-out, processor grid, and communication pattern. Knowing the initialdata layout § 7.6 and the processor grid § 7.1, we craft the binaryreduction tree in all three dimensions i , j , and k such that the dis-tance in the grid between communicating processors is minimized.Our implementation outperforms the standard MPI broadcast fromthe Cray-MPICH 3.1 library by approximately 10%. The sequential rounds of the algorithm t i = , . . . , t , naturallyexpress communication–computation overlap. Using double buffer-ing, at each round t i we issue an asynchronous communication(using either MPI Get or MPI Isend / MPI Irecv § 7.4) of the datarequired at round t i + , while locally processing the data receivedin a previous round. We note that, by the construction of the localdomains D j § 6.3, the extra memory required for double bufferingis rarely an issue. If we are constrained by the available memory,then the space required to hold the partial results of C , which is a , echnical Report, 2019, G. Kwasniewski et al. is much larger than the size of the receive buffers s = ( S − a )/

2. Ifnot, then there is extra memory available for the buffering.

Number of rounds:

The minimum number of rounds, andtherefore latency, is t = abS − a (§ 6.3) . However, to exploit moreoverlap, we can increase the number of rounds t > t . In this way,in one round we communicate less data s = ab / t < s , allowingthe first round of computation to start earlier. To reduce the latency [27] we implemented communication usingMPI RMA [32]. This interface utilizes the underlying features ofRemote Direct Memory Access (RDMA) mechanism, bypassing theOS on the sender side and providing zero-copy communication:data sent is not buffered in a temporary address, instead, it is writtendirectly to its location.All communication windows are pre-allocated usingMPI Win allocate with the size of maximum message in the broad-cast tree 2 s − D (§ 7.2). Communication in each step is performedusing the MPI Get and MPI Accumulate routines.For compatibility reasons, as well as for the performance com-parison, we also implemented a communication back-end usingMPI two-sided (the message passing abstraction). The binary broadcast tree pattern is a generalization of the recursivestructure of CARMA. However, CARMA in each recursive stepdynamically allocates new buffers of the increasing size to matchthe message sizes 2 s − D , causing an additional runtime overhead.To alleviate this problem, we pre-allocate initial, send, and re-ceive buffers for each of matrices A, B, and C of the maximum sizeof the message ab / t , where t = abS − a is the number of steps inCOSMA (Algorithm 1). Then, in each level s of the communicationtree, we move the pointer in the receive buffer by 2 s − D elements. COSMA’s schedule induces the optimal initial data layout, sincefor each D j it determines its dominator set Dom (D j ) , that is, el-ements accessed by processor j . Denote A l , j and B l , j subsets ofelements of matrices A and B that initially reside in the local mem-ory of processor j . The optimal data layout therefore requires that A l , j , B l , j ⊂ Dom (D j ) . However, the schedule does not specify ex-actly which elements of Dom (D j ) should be in A l , j and B l , j . As aconsequence of the communication pattern § 7.2, each element of A l , j and B l , j is communicated to д m , д n processors, respectively.To prevent data reshuffling, we therefore split each of Dom (D j ) into д m and д n smaller blocks, enforcing that consecutive blocksare assigned to processors that communicate first. This is unlike thedistributed CARMA implementation [22], which uses the cyclic dis-tribution among processors in the recursion base case and requireslocal data reshuffling after each communication round. Anotheradvantage of our blocked data layout is a full compatibility withthe block-cyclic one, which is used in other linear-algebra libraries. We evaluate COSMA’s communication volume and performanceagainst other state-of-the-art implementations with various com-binations of matrix dimensions and memory requirements. These scenarios include both synthetic square matrices, in which all algo-rithms achieve their peak performance, as well as “flat” (two largedimensions) and real-world “tall-and-skinny” (one large dimension)cases with uneven number of processors.

Comparison Targets

As a comparison, we use the widely used ScaLAPACK library asprovided by Intel MKL (version: 18.0.2.199) , as well as CyclopsTensor Framework , and the original CARMA implementation . We manually tune ScaLAPACK parameters to achieve its maximumperformance.

Our experiments showed that on Piz Daint it achievesthe highest performance when run with 4 MPI ranks per computenode, 9 cores per rank. Therefore, for each matrix sizes/node countconfiguration, we recompute the optimal rank decomposition forScaLAPACK. Remaining implementations use default decomposi-tion strategy and perform best utilizing 36 ranks per node, 1 coreper rank.

Infrastructure and Implementation Details

All implementations were compiled using the GCC 6.2.0 compiler.We use Cray-MPICH 3.1 implementation of MPI. The parallelismwithin a rank of ScaLAPACK is handled internally by the MKLBLAS (with GNU OpenMP threading) version 2017.4.196. To profileMPI communication volume, we use the mpiP version 3.4.1 [57]. Experimental Setup and Architectures

We run our experiments on the CPU partition of CSCS Piz Daint,which has 1,813 XC40 nodes with dual-socket Intel Xeon E5-2695v4 processors (2 ·

18 cores, 3.30 GHz, 45 MiB L3 shared cache, 64GiB DDR3 RAM), interconnected by the Cray Aries network witha dragonfly network topology. We set p to a number of availablecores and S to the main memory size per core (§ 2.1). To addition-ally capture cache size per core, the model can be extended to athree-level memory hierarchy. However, cache-size tiling is alreadyhandled internally by the MKL. Matrix Dimensions and Number of Cores

We use square ( m = n = k ), “largeK” ( m = n (cid:28) k ), “largeM” ( m (cid:29) n = k ), and “flat” ( m = n (cid:29) k ) matrices. The matrix dimensions andnumber of cores are (1) powers of two m = r , n = r , m = r , (2)determined by the real-life simulations or hardware architecture(available nodes on a computer), (3) chosen adversarially, e.g, n + w molecules,the sizes of the matrices are m = n = w and k = w . In thestrong scaling scenario, we use w =

128 as in the original paper,yielding m = n = k = Selection of Benchmarks

We perform both strong scaling and memory scaling experiments.The memory scaling scenario fixes the input size per core ( pSI , I = mn + mk + nk ), as opposed to the work per core ( mnkp (cid:44) const ). the latest version available on Piz Daint when benchmarks were performed (August2018). No improvements of P[S,D,C,Z]GEMM have been reported in the MKL releasenotes since then. https://github.com/cyclops-community/ctf, commit ID 244561c on May 15, 2018 https://github.com/lipshitz/CAPS, commit ID 7589212 on July 19, 2013 only ScaLAPACK uses multiple cores per ranks for ScaLAPACK, actual number of MPI ranks is p / /O Optimal Parallel Matrix Multiplication Technical Report, 2019, We evaluate two cases: (1) ”limited memory” ( pSI = const ), and (2)”extra memory” ( p / SI = const ).To provide more information about the impact of communicationoptimizations on the total runtime, for each of the matrix shapes wealso separately measure time spent by COSMA on different partsof the code. for each matrix shape we present two extreme cases ofstrong scaling - with smallest number of processors (most compute-intense) and with the largest (most communication-intense). Toadditionally increase information provided, we perform these mea-surements with and without communication–computation overlap. Programming Models

We use either the RMA or the Message Passing models. CTF alsouses both models, whereas CARMA and ScaLAPACK use MPI two-sided (Message Passing).

Experimentation Methodology

For each combination of parameters, we perform 5 runs, each withdifferent node allocation. As all the algorithms use BLAS routinesfor local matrix computations, for each run we execute the kernelsthree times and take the minimum to compensate for the BLASsetup overhead. We report median and 95% confidence intervals ofthe runtimes.

We now present the experimental results comparing COSMA withthe existing algorithms. For both strong and memory scaling, wemeasure total communication volume and runtime on both squareand tall matrices. Our experiments show that COSMA alwayscommunicates least data and is the fastest in all scenarios.

Summary and Overall Speedups

As discussed in § 8, we evaluate three benchmarks – strong scaling,“limited memory” (no redundant copies of the input are possible),and “extra memory” ( p / extra copies of the input can fit intocombined memory of all cores). Each of them we test for square,“largeK”, “largeM”, and , “flat” matrices, giving twelve cases in total.In Table 4, we present arithmetic mean of total communicationvolume per MPI rank across all core counts. We also report thesummary of minimum, geometric mean, and maximum speedupsvs the second best-performing algorithm. Communication Volume

As analyzed in § 5 and § 6, COSMA reaches I/O lower bound (up tothe factor of √ S /(√ S + − ) ). Moreover, optimizations presentedin § 7 secure further improvements compared to other state-of-the-art algorithms. In all cases, COSMA performs least communication.Total communication volume for square and “largeK” scenarios isshown in Figures 6 and 10. Square Matrices

Figure 8 presents the % of achieved peak hardware performance forsquare matrices in all three scenarios. As COSMA is based on thenear optimal schedule, it achieves the highest performance in allcases . Moreover, its performance pattern is the most stable: whenthe number of cores is not a power of two, the performance doesnot vary much compared to all remaining three implementations.We note that matrix dimensions in the strong scaling scenarios( m = n = k = ) are very small for distributed setting. Yeteven in this case COSMA maintains relatively high performancefor large numbers of cores: using 4k cores it achieves 35% of peak performance, compared to ¡5% of CTF and ScaLAPACK, showingexcellent strong scaling characteristics. Tall and Skinny Matrices

Figure 10 presents the results for “largeK” matrices - due to spaceconstraints, the symmetric “largeM” case is For strong scaling,the minimum number of cores is 2048 (otherwise, the matrices ofsize m = n = k = “Flat” Matrices Matrix dimensions for strong scaling are set to m = n = = k = = k = m = n scaling accordinglyfor the “limited” and “extra” memory cases. Such kernels takemost of the execution time in, e.g., matrix factorization algorithms,where updating Schur complements is performed as a rank-k gemmoperation [31]. Unfavorable Number of Processors

Due to the processor grid optimization (§ 7.1), the performanceis stable and does not suffer from unfavorable combinations ofparameters. E.g., the runtime of COSMA for square matrices m = n = k = p = = · cores is 142 ms. Adding anextra core ( p = = · p runs in 600 ms, while for p the runtime increases to1613 ms due to a non-optimal processor decomposition. Communication-Computation Breakdown

In Figure 12 we present the total runtime breakdown of COSMAinto communication and computation routines. Combinedwith the comparison of communication volumes (Figures 6 and 7,Table 4) we see the importance of our I/O optimizations for dis-tributed setting even for traditionally compute-bound MMM. E.g.,for square or “flat” matrix and 16k cores, COSMA communicatesmore than two times less than the second-best (CARMA). Assumingconstant time-per-MB, COSMA would be 40% slower if it communi-cated that much, being slower than CARMA by 30%. For “largeK”,the situation is even more extreme, with COSMA suffering 2.3 timesslowdown if communicating as much as the second-best algorithm,CTF, which communicates 10 times more.

Detailed Statistical Analysis

Figure 13 provides a distribution of the achieved peak performanceacross all numbers of cores for all six scenarios. It can be seen that,for example, in the strong scaling scenario and square matrices,COSMA is comparable to the other implementations (especiallyCARMA). However, for tall-and-skinny matrices with limited mem-ory available,

COSMA lowest achieved performance is higher thanthe best performance of CTF and ScaLAPACK.

10 RELATED WORK

Works on data movement minimization may be divided into twocategories: applicable across memory hierarchy (vertical, also calledI/O minimization), or between parallel processors (horizontal, alsocalled communication minimization). Even though they are “twosides of the same coin”, in literature they are often treated as sep-arate topics. In our work we combine them: analyze trade–offsbetween communication optimal (distributed memory) and I/Ooptimal schedule (shared memory). echnical Report, 2019, G. Kwasniewski et al. ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● CARMA [21] CTF [49]COSMA (this work) ScaLAPACK [14] M B c o mm un i c a t ed pe r c o r e

22 50 (a) Strong scaling, n = m = k = ● ● ● ● ● ● ● ● ● ●● CARMA [21]CTF [49] COSMA (this work)ScaLAPACK [14] M B c o mm un i c a t ed pe r c o r e (b) Limited memory, n = m = k = (cid:113) pS ●● ● ● ● ● ● ● ● ● ●● CARMA [21] CTF [49]COSMA (this work)ScaLAPACK [14] M B c o mm un i c a t ed pe r c o r e

22 50 (c) Extra memory, n = m = k = (cid:113) p / S Figure 6:

Total communication volume per core carried out by COSMA, CTF, ScaLAPACK and CARMA for square matrices, as measured by the mpiP profiler. ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

CARMA [21]CTF [49]COSMA (this work) ScaLAPACK [14] M B c o mm un i c a t ed pe r c o r e

50 22 (a) Strong scaling, n = m = k = ●● ● ● ● ● ● ● ● ● ●● CARMA [21]CTF [49]COSMA (this work)ScaLAPACK [14] M B c o mm un i c a t ed pe r c o r e

50 22 (b) Limited memory, m = n = p , k = p ●● ● ● ● ● ● ● ● ● ●● CARMA [21]CTF [49]COSMA (this work)ScaLAPACK [14] M B c o mm un i c a t ed pe r c o r e (c) Extra memory,, m = n = p , k = p Figure 7:

Total communication volume per core carried out by COSMA, CTF, ScaLAPACK and CARMA for “largeK” matrices, as measured by the mpiP profiler. ● ● ●● ● ●● ● ●●● ● ● ●●● ● ● ● ● ●●● ● ● ●●●

CARMA [21]CTF [49]COSMA (this work)ScaLAPACK [14] % pea k pe r f o r m an c e (a) Strong scaling, n = m = k = ● ● ●● ● ●● ● ●●● ● ● ●●● ● ● ● ● ●●● ● ● ●●● CARMA [ ]CTF [ ] COSMA (this work)ScaLAPACK [14] % pea k pe r f o r m an c e (b) Limited memory, n = m = k = (cid:113) pS ● ● ● ● ●●● ● ●●●●● ●●●●●●● ●● ●●●●● ●● ●● ●●● ●●●●● ● ●● ●●●●● CARMA [21]CTF [49] COSMA (this work)ScaLAPACK [14] % pea k pe r f o r m an c e (c) Extra memory, m = n = k = (cid:113) p / S Figure 8:

Achieved % of peak performance by COSMA, CTF, ScaLAPACK and CARMA for square matrices, strong and weak scaling. We show median and 95% confidence intervals.total comm. volume per rank [MB] speedup shape benchmark

ScaLAPACK

CTF CARMA COSMA min mean max

A CB strong scaling 203 222 195 107

A CB strong scaling 2636 2278 659 545 1.24 2.00 6.55limited memory 368 541 128 88 1.30 2.61 8.26extra memory 133 152 48 35 1.31 2.55 6.70

CBA strong scaling 3507 2024 541 410 1.31 2.22 3.22limited memory 989 672 399 194 1.42 1.7 2.27extra memory 122 77 77 29 1.35 1.76 2.8

A CB strong scaling 134 68 10 7 1.21 4.02 limited memory 47 101 26 8 1.31 2.07 3.41extra memory 15 15 10 3 1.5 2.29 3.59 overall 1.07 2.17 12.81Table 4:

Average communication volume per MPI rank and measured speedup ofCOSMA vs the second-best algorithm across all core counts for each of the scenarios.

Hong and Kung [34] analyzed the I/O complexity for general CDAGsin their the red-blue pebble game, on which we base our work. As a special case, they derived an asymptotic bound Ω (cid:16) n /√ S (cid:17) forMMM. Elango et al. [23] extended this work to the red-blue-whitegame and Liu and Terman [40] proved that it is also P-SPACE com-plete. Irony et al. [33] extended the MMM lower bound result toa parallel machine with p processors, each having a fast privatememory of size S , proving the n √ p √ S − S lower bound on thecommunication volume per processor. Chan [12] studied differ-ent variants of pebble games in the context of memory space andparallel time. Aggarwal and Vitter [2] introduced a two-memorymachine that models a blocked access and latency in an externalstorage. Arge et al. [3] extended this model to a parallel machine.Solomonik et al. [51] combined the communication, synchroniza-tion, and computation in their general cost model and applied itto several linear algebra algorithms. Smith and van de Geijn [48]derived a sequential lower bound 2 mnk /√ S − S for MMM. They /O Optimal Parallel Matrix Multiplication Technical Report, 2019, ● ● ● ● ● ● ● ● t o t a l t i m e [ m s ] ● CARMA [21] COSMA (this work) CTF [46] ScaLAPACK [51] (a) Strong scaling, n = m = k = ● ● ● ● ● ● ● ● t o t a l t i m e [ m s ] ● CARMA [ ] 22 COSMA (this work) CTF [ ] ScaLAPACK [1450 (b) Limited memory, n = m = k = (cid:113) pS ● ● ● ● ● ● ● ● t o t a l t i m e [ m s ] ● CARMA [21] COSMA (this work) CTF [46] ScaLAPACK [51] (c) Extra memory, m = n = k = (cid:112) ( p / S )/ Figure 9:

Total runtime of COSMA, CTF, ScaLAPACK and CARMA for square matrices, strong and weak scaling. We show median and 95% confidence intervals. ● ● ● ● ● ● ● ● ● ●● ●

CARMA [21]CTF [49]COSMA (this work) ScaLAPACK [14] % pea k pe r f o r m an c e (a) Strong scaling, n = m = k = ● ● ●● ● ●● ● ●●● ● ● ●●● ● ● ● ● ●●● ● ● ●●● CARMA [21]CTF [49]COSMA (this work)ScaLAPACK [14] % pea k pe r f o r m an c e (b) Limited memory, m = n = p , k = p ● ● ●● ● ●● ● ●●● ● ● ●●● ● ● ● ● ●●● ● ● ●●● CARMA [21]CTF [49] COSMA (this work)ScaLAPACK [14] % pea k pe r f o r m an c e (c) Extra memory, m = n = p , k = p Figure 10:

Achieved % of peak performance by COSMA, CTF, ScaLAPACK and CARMA for “largeK” matrices. We show median and 95% confidence intervals. ● ● ● t o t a l t i m e [ m s ] ● CARMA [21] COSMA (this work) CTF [46] ScaLAPACK [51 (a) Strong scaling, n = m = k = l l l l l l l l t o t a l t i m e [ m s ] l CARMA [21] COSMA (this work) CTF [46] ScaLAPACK [51] (b) Limited memory, m = n = p , k = p l l l l l l l l t o t a l t i m e [ m s ] l CARMA [21] COSMA (this work) CTF [46] ScaLAPACK [51] (c) Extra memory, m = n = p , k = p Figure 11:

Total runtime of COSMA, CTF, ScaLAPACK and CARMA for “largeK” matrices, strong and weak scaling. We show median and 95% confidence intervals. % o f t o t a l r un t i m e sending inputs A and B sending output C computation othertotal time executedwith overlap A CB

A CB

CBA

A CB c o mm un i - ca t i on SQUARE LARGE K LARGE M FLAT processor count

Figure 12:

Time distribution of COSMA communication and computation kernelsfor strong scaling executed on the smallest and the largest core counts for each of thematrix shapes. Left bar: no communication–computation overlap. Right bar: overlapenabled. showed that the leading factor 2 mnk /√ S is tight. We improve thisresult by 1) improving an additive factor of 2 S , but more importantly2) generalizing the bound to a parallel machine. Our work uses asimplified model, not taking into account the memory block size, as in the external memory model, nor the cost of computation. Wemotivate it by assuming that the block size is significantly smallerthan the input size, the data is layout contiguously in the memory,and that the computation is evenly distributed among processors. I/O optimization for linear algebra includes such techniques asloop tiling and skewing [59], interchanging and reversal [58]. Forprograms with multiple loop nests, Kennedy and McKinley [35]showed various techniques for loop fusion and proved that in gen-eral this problem is NP-hard. Later, Darte [20] identified cases whenthis problem has polynomial complexity.Toledo [55] in his survey on Out-Of-Core (OOC) algorithmsanalyzed various I/O minimizing techniques for dense and sparse echnical Report, 2019, G. Kwasniewski et al. 'flat' extra memory 'flat', limited memory 'flat', strong scaling square extra memory square, limited memory square, strong scaling0255075100 % pea k pe r f o r m an c e COSMA (this work) ScaLAPACK [14] CTF [ ] CARMA [21] from left to right: Figure 13:

Distribution of achieved % of peak performance of the algorithms across all number of cores for “flat” and square matrices. 'largeK', extra memory 'largeK', limited memory 'largeK', strong scaling 'largeM', extra memory 'largeM', limited memory 'largeM', strong scaling0255075100 % pea k pe r f o r m an c e COSMA (this work) ScaLAPACK [14] CTF [ ] CARMA [21]from left to right: Figure 14:

Distribution of achieved % of peak performance of the algorithms across all number of cores for tall-and-skinny matrices. matrices. Mohanty [43] in his thesis optimized several OOC algo-rithms. Irony et al. [33] proved the I/O lower bound of classicalMMM on a parallel machine. Ballard et al. [5] proved analogousresults for Strassen’s algorithm. This analysis was extended byScott et al. [46] to a general class of Strassen-like algorithms.Although we consider only dense matrices, there is an extensiveliterature on sparse matrix I/O optimizations. Bender et al. [7] ex-tended Aggarwal’s external memory model [2] and showed I/O com-plexity of the sparse matrix-vector (SpMV) multiplication.Greiner [29] extended those results and provided I/O complexi-ties of other sparse computations.

Distributed algorithms for dense matrix multiplication date back tothe work of Cannon [10], which has been analyzed and extendedmany times [30] [39]. In the presence of extra memory, Aggarwalet al. [1] included parallelization in the third dimension. Solomonikand Demmel [53] extended this scheme with their 2.5D decom-position to arbitrary range of the available memory, effectivelyinterpolating between Cannon’s 2D and Aggarwal’s 3D scheme. Arecursive, memory-oblivious MMM algorithm was introduced byBlumofe et al. [9] and extended to rectangular matrices by Frigo etal. [26]. Demmel el al. [22] introduced CARMA algorithm whichachieves the asymptotic complexity for all matrix and memorysizes. We compare COSMA with these algorithms, showing that weachieve better results both in terms of communication complexityand the actual runtime performance. Lazzaro et al. [38] used the2.5D technique for sparse matrices, both for square and rectangu-lar grids. Koanantakool et al. [37] observed that for sparse-denseMMM, 1.5D decomposition performs less communication than 2Dand 2.5D schemes, as it distributes only the sparse matrix.

11 CONCLUSIONS

In this work we present a new method (Lemma 3) for assessing tightI/O lower bounds of algorithms using their CDAG representationand the red-blue pebble game abstraction. As a use case, we prove atight bound for MMM, both for a sequential (Theorem 1) and parallel (Theorem 2) execution. Furthermore, our proofs are constructive:our COSMA algorithm is near I/O optimal (up to the factor of √ S √ S + − , which is less than 0.04% from the lower bound for 10MB offast memory) for any combination of matrix dimensions, number ofprocessors and memory sizes. This is in contrast with the currentstate-of-the-art algorithms, which are communication-inefficientin some scenarios.To further increase the performance, we introduce a series ofoptimizations, both on an algorithmic level (processor grid opti-mization (§ 7.1) and blocked data layout (§ 7.6)) and hardware-related (enhanced communication pattern (§ 7.2), communication–computation overlap (§ 7.3), one-sided (§ 7.4) communication). Theexperiments confirm the superiority of COSMA over the otheranalyzed algorithms - our algorithm significantly reduces commu-nication in all tested scenarios, supporting our theoretical analysis.Most importantly, our work is of practical importance, being main-tained as an open-source implementation and achieving a time-to-solution speedup of up to 12.8x times compared to highly optimizedstate-of-the-art libraries.The important feature of our method is that it does not requireany manual parameter tuning and is generalizable to other machinemodels (e.g., multiple levels of memory) and linear algebra kernels(e.g., LU or Cholesky decompositions), both for dense and sparsematrices. We believe that the “bottom-up” approach will lead todeveloping more efficient distributed algorithms in the future. Acknowledgements

We thank Yishai Oltchik and Niels Gleinig for invaluable help withthe theoretical part of the paper, and Simon Pintarelli for adviceand support with the implementation. We also thank CSCS for thecompute hours needed to conduct all experiments. This project hasreceived funding from the European Research Council (ERC) underthe European Union’s Horizon2020 programme (grant agreementDAPP, No.678880), and additional funding from the Platform forAdvanced Scientific Computing (PASC). /O Optimal Parallel Matrix Multiplication Technical Report, 2019, REFERENCES [1] R. C. Agarwal et al. 1995. A three-dimensional approach to parallel matrixmultiplication.

IBM J. Res. Dev. (1995).[2] Alok Aggarwal and S. Vitter, Jeffrey. [n. d.]. The Input/Output Complexity ofSorting and Related Problems.

CACM (Sept. [n. d.]).[3] Lars Arge et al. 2008. In

SPAA .[4] Ariful Azad et al. 2015. Parallel triangle counting and enumeration using matrixalgebra. In

IPDPSW .[5] Grey Ballard et al. 2012. Graph expansion and communication costs of fastmatrix multiplication.

JACM (2012).[6] Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying Parallel and DistributedDeep Learning: An In-Depth Concurrency Analysis.

CoRR abs/1802.09941 (2018).[7] Michael A Bender et al. 2010. Optimal sparse matrix dense vector multiplicationin the I/O-model.

TOCS (2010).[8] Maciej Besta et al. 2017. SlimSell: A Vectorizable Graph Representation forBreadth-First Search. In

IPDPS .[9] Robert D Blumofe et al. 1996. An analysis of dag-consistent distributed shared-memory algorithms. In

SPAA .[10] Lynn Elliot Cannon. 1969.

A Cellular Computer to Implement the Kalman FilterAlgorithm . Ph.D. Dissertation.[11] Gregory J. Chaitin et al. 1981. Register allocation via coloring.

Computer Lan-guages (1981).[12] S. M. Chan. 2013. Just a Pebble Game. In

CCC .[13] Fran`aoise Chatelin. 2012.

Eigenvalues of Matrices: Revised Edition . Siam.[14] Jaeyoung Choi et al. 1992. ScaLAPACK: A scalable linear algebra library fordistributed memory concurrent computers. In

FRONTIERS .[15] Jaeyoung Choi et al. 1996. Design and Implementation of the ScaLAPACK LU,QR, and Cholesky Factorization Routines.

Sci. Program. (1996).[16] J. Choi et al. 1996. ScaLAPACK: a portable linear algebra library for distributedmemory computers — design issues and performance.

Comp. Phys. Comm. (1996).[17] Jaeyoung Choi, David W Walker, and Jack J Dongarra. 1994. PUMMA: Paralleluniversal matrix multiplication algorithms on distributed memory concurrentcomputers.

Concurrency: Practice and Experience

6, 7 (1994), 543–570.[18] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein.2009.

Introduction to algorithms . MIT press.[19] Paolo D’Alberto and Alexandru Nicolau. 2008. Using recursion to boost ATLAS’sperformance. In

High-Performance Computing . Springer.[20] Alain Darte. 1999. On the complexity of loop fusion. In

PACT .[21] Mauro Del Ben et al. 2015. Enabling simulation at the fifth rung of DFT: Largescale RPA calculations with excellent time to solution.

Comp. Phys. Comm. (2015).[22] J. Demmel et al. 2013. Communication-Optimal Parallel Recursive RectangularMatrix Multiplication. In

IPDPS .[23] V. Elango et al. 2013.

Data access complexity: The red/blue pebble game revisited .Technical Report.[24] Geoffrey C Fox. 1988. Solving problems on concurrent processors. (1988).[25] Geoffrey C Fox, Steve W Otto, and Anthony JG Hey. 1987. Matrix algorithms ona hypercube I: Matrix multiplication.

Parallel computing

4, 1 (1987), 17–31.[26] M. Frigo et al. 1999. Cache-oblivious algorithms. In

FOCS .[27] R. Gerstenberger et al. 2013. Enabling Highly-Scalable Remote Memory AccessProgramming with MPI-3 One Sided. In SC .[28] John R. Gilbert et al. 1979. The Pebbling Problem is Complete in PolynomialSpace. In STOC .[29] Gero Greiner. 2012.

Sparse matrix computations and their I/O complexity . Ph.D.Dissertation. Technische Universit¨at M¨unchen.[30] A. Gupta and V. Kumar. 1993. Scalability of Parallel Algorithms for MatrixMultiplication. In

ICPP .[31] Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018.Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In

Proceedings of the International Confer-ence for High Performance Computing, Networking, Storage, and Analysis . IEEEPress, 47.[32] T. Hoefler et al. 2015. Remote Memory Access Programming in MPI-3.

TOPC (2015).[33] Dror Irony et al. 2004. Communication Lower Bounds for Distributed-memoryMatrix Multiplication.

JPDC (2004).[34] Hong Jia-Wei and Hsiang-Tsung Kung. 1981. I/O complexity: The red-blue pebblegame. In

STOC .[35] Ken Kennedy and Kathryn S McKinley. 1993. Maximizing loop parallelism andimproving data locality via loop fusion and distribution. In

LCPC .[36] Jeremy Kepner et al. 2016. Mathematical foundations of the GraphBLAS. arXiv:1606.05790 (2016).[37] Penporn Koanantakool et al. 2016. Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In

IPDPS .[38] Alfio Lazzaro et al. 2017. Increasing the efficiency of sparse matrix-matrixmultiplication with a 2.5 D algorithm and one-sided MPI. In

PASC .[39] Hyuk-Jae Lee et al. 1997. Generalized Cannon’s Algorithm for Parallel MatrixMultiplication. In

ICS .[40] Quanquan Liu. 2018. Red-Blue and Standard Pebble Games : Complexity andApplications in the Sequential and Parallel Models. [41] Lynn Loomis and Hassler Whitney. 1949. An inequality related to the isoperi-metric inequality. (1949).[42] Carl D Meyer. 2000.

Matrix analysis and applied linear algebra . SIAM.[43] Sraban Kumar Mohanty. 2010. I/O Efficient Algorithms for Matrix Computations.

CoRR abs/1006.1307 (2010).[44] Andrew Y Ng et al. 2002. On spectral clustering: Analysis and an algorithm. In

NIPS .[45] Donald W. Rogers. 2003.

Computational Chemistry Using the PC . John Wiley &Sons, Inc.[46] Jacob Scott et al. 2015. Matrix multiplication I/O-complexity by path routing. In

SPAA .[47] Ravi Sethi. 1973. Complete Register Allocation Problems. In

STOC .[48] Tyler Michael Smith and Robert A. van de Geijn. 2017. Pushing the Bounds forMatrix-Matrix Multiplication.

CoRR abs/1702.02017 (2017).[49] Raffaele Solc`a, Anton Kozhevnikov, Azzam Haidar, Stanimire Tomov, Jack Don-garra, and Thomas C. Schulthess. 2015. Efficient Implementation of Quan-tum Materials Simulations on Distributed CPU-GPU Systems. In

Proceedingsof the International Conference for High Performance Computing, Networking,Storage and Analysis (SC ’15) . ACM, New York, NY, USA, Article 10, 12 pages.https://doi.org/10.1145/2807591.2807654[50] Edgar Solomonik et al. 2013. Cyclops tensor framework: Reducing commu-nication and eliminating load imbalance in massively parallel contractions. In

IPDPS .[51] Edgar Solomonik et al. 2016. Trade-Offs Between Synchronization, Communica-tion, and Computation in Parallel Linear Algebra omputations.

TOPC (2016).[52] E. Solomonik et al. 2017. Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication. In SC .[53] Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel2.5D Matrix Multiplication and LU Factorization Algorithms. In EuroPar .[54] Volker Strassen. 1969. Gaussian Elimination is Not Optimal.

Numer. Math. (1969).[55] Sivan Toledo. 1999. A survey of out-of-core algorithms in numerical linearalgebra.

External Memory Algorithms and Visualization (1999).[56] Robert A Van De Geijn and Jerrell Watts. 1997. SUMMA: Scalable universalmatrix multiplication algorithm.

Concurrency: Practice and Experience

9, 4 (1997),255–274.[57] Jeffrey S Vetter and Michael O McCracken. 2001. Statistical scalability analysisof communication operations in distributed applications.

ACM SIGPLAN Notices

36, 7 (2001), 123–132.[58] Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm.In

PLDI .[59] Michael Wolfe. 1989. More iteration space tiling. In

Supercomputing’89: Proceed-ings of the 1989 ACM/IEEE conference on Supercomputing . IEEE, 655–664.[60] Nan Xiong. 2018.

Optimizing Tall-and-Skinny Matrix-Matrix Multiplication onGPUs . Ph.D. Dissertation. UC Riverside.[61] Qinqing Zheng and John D. Lafferty. 2016. Convergence Analysis for RectangularMatrix Completion Using Burer-Monteiro Factorization and Gradient Descent.

CoRR (2016).17 echnical Report, 2019, G. Kwasniewski et al.

A CHANGE LOG • Added DOI (10.1145/3295500.3356181) of the SC’19 paperversion • Section , last paragraph (Section 4 in theSC’19 paper): W B , i corrected to W R , i in the definition of R ( S ) . • Section , Lemma 2: q ≥ ( X − R ( S )− T ( S )) · ( h − ) corrected to q ≥ ( X − R ( S ) + T ( S )) · ( h − )• Section , Lemma 3 (Section 4,Lemma 1 in the SC’19 paper): “ T ( S ) is the minimum storesize” corrected to “ T ( S ) is the minimum I/O size” • Section 6.2 Parallelization Strategies for MMM,

Schedule P ijk : processor grid G ijk = (cid:2) m √ S , n √ S , kpS (cid:3) cor-rected to (cid:2) m √ S , n √ S , pSmn (cid:3)(cid:3)