Fast Matrix Multiplication and Symbolic Computation
FFast Matrix Multiplicationand Symbolic Computation
Jean-Guillaume Dumas [ a ] and Victor Y. Pan [ b ][ a ] Laboratoire Jean KuntzmannApplied Mathematics and Computer ScienceUniversit´e Grenoble Alpes700 avenue centrale, IMAG - CS 4070038058 Grenoble cedex 9, [email protected]/membres/Jean-Guillaume.Dumas b ] Department of Mathematics and Computer ScienceLehman College of the City University of New YorkBronx, NY 10468 USAandPh.D. Programs in Mathematics and Computer ScienceThe Graduate Center of the City University of New YorkNew York, NY 10036 [email protected]/vpan Abstract
The complexity of matrix multiplication (hereafter MM) has been intensively studied since1969, when Strassen surprisingly decreased the exponent 3 in the cubic cost of the straight-forward classical MM to log (7) ≈ . curse of recursion , whichseparates the progress in fast MM into its most acclaimed and purely theoretical part and intovaluable acceleration of MM of feasible sizes. Then, in the second part of our paper, we coverfast MM in realistic symbolic computations and discuss applications and implementation of fast exact matrix multiplication. We first review how most of exact linear algebra can be reducedto matrix multiplication over small finite fields. Then we highlight the differences in the designof approximate and exact implementations of fast MM, taking into account nowadays processorand memory hierarchies. In the concluding section we comment on current perspectives of thestudy of fast MM. Key Words:
Matrix multiplication (MM), Computations, Complexity, Linear algebra, Tensor de-composition, Multilinear algebra, Inner product, Summation, Binary Segmentation, Exact linearalgebra, Small finite fields, Processors, Memory hierarchy, Impacts of fast MM, Numerical imple-mentation, Perspectives Partly supported by the OpenDreamKit Horizon 2020 European Research Infrastructures project ( Supported by NSF Grants CCF–1116736 and CCF–1563942 and PSC CUNY Award 68862–00 46. a r X i v : . [ c s . S C ] D ec Introduction
Matrix multiplication (hereafter we keep using the acronym MM ) is fundamentally important forsymbolic and numerical computations in linear algebra and for the theory of computing. Efficientperformance of MM depends on various factors, particularly on vectorization, data locality, andarithmetic cost (cf. [71, Chapter 1]).In the first part of the paper (Sections 2–10) we review the work on the decrease of the arithmeticcost, including purely theoretical study of MM of immense sizes (so far this part of the study hasbeen most acclaimed and most generously supported!), but we focus on feasible MM.In our longest Section 11 we discuss realistic acceleration of symbolic MM, taking into accountnowadays processor and memory hierarchies.In our concluding Section 12 we comment on current perspectives of the study of fast MM. The cubic arithmetic time 2 n − n of the straightforward algorithm for M M ( n ), that is, for n × n MM, was commonly believed to be optimal until 1969, when Strassen’s algorithm of [142] performed
M M ( n ) in O ( n ω ) time for ω = log (7) ≈ . (7) also fornumerous venerated computational problems in Computer Science, Linear Algebra, and ComputerAlgebra such as Boolean MM, parsing context-free grammars, computing paths and distances ingraphs, the solution of a nonsingular linear system of equations, computation of the inverse and thedeterminant of a matrix, and its various factorizations (see more in Section 10). The worldwideinterest to MM has immediately exploded, and it was widely expected that new efficient algorithmswould soon perform MM and solve the related computational problems in nearly quadratic time.Even the exponent 2.8074, however, defied the attacks of literally all experts around the globe foralmost a decade, until 1978, when the algorithm of [113] broke Strassen’s record, improving thealgorithm of [142] already at the level of feasible MM.The mainstream research responded to that breakthrough by directing all effort to the decreaseof the exponent of MM of unrestricted sizes and very soon succeeded in dramatic accelerationof infeasible MM of astronomical sizes. New surprising resources have been found, sophisticatedtechniques have been developed, and by 1987 the exponent of infeasible MM was decreased below2.38 (see [44]), although as of December 2016 it still has not been decreased below 2.37, that is, inthe last 3 decades the progress was nominal (see [97] for the current record exponent). Moreover, thestudy of infeasible MM has never made impact on practice of MM or any other realistic computations(cf. [4], [5], [21], and our concluding section).For n restricted to be “moderate”, say, less than 1,000,000, the current record is 2.7734, achievedwith the algorithm of [118] and unbeaten since 1982. All algorithms supporting smaller exponentssuffer from the curse of recursion (cf. [123]): they beat the classical straightforward MM algorithmonly after performing a large and typically immense number of recursive steps, with the input sizegrowing exponentially in the number of such steps: the straightforward algorithm supersedes themuntil the input size by far exceeds realistic level, typically by many orders of magnitude. In the context of this development, we refocus our presentation compared to the decades-old survey[119]. In Section 9 we still pay tribute to the lasting interest to the exponent of infeasible MM, butwe do not cover various amazing sophisticated techniques proposed exclusively for the accelerationof MM of immense sizes , which dominated the review of [119]. Instead we cover in some detail For the scientific world the news came as a miracle from the blue. Most of the readers were particularly impressedby the power of the divide and conquer method (not novel in 1969) rather than by Strassen’s ingenious algorithm for2 × The results of [16] became widely known only when the well-recognized article[50] extended them to all recursive bilinear algorithms for MM, but even in 2010 the Introduction ofthe important innovative paper [22] still referred to this “theorem”and in 2016, the paper [80] stilltalks about “numerical stability issues with many levels of recursions” .Moreover the paper [50] and its successor [10] attack [113] as well as all the work on fast feasibleMM from another side. Trying to simplify the discussion or perhaps to divert public attention fromthe advanced work on fast feasible MM to the domain of their own study and progress, the authorsof these papers call “Strassen-like” all known fast algorithms for MM. This “innovation” was basedon ignorance: “Strassen-like” algorithms, as [50] and [10] formally define them, have been long andwell known under the much more informative name of noncommutative bilinear algorithms (see,e.g., [24]). Our personal communication in 2015 with the authors of [50] seems to help: the label“Strassen-like” is not used, e.g., in [8], but unfortunately their original widely publicized contemptto the advanced results on fast feasible MM (including those completely distinct from Strassen’s oldalgorithm of 1969 and in various important respects superseding it) has been deeply implanted intoscientific community.In the first part of our paper (Sections 2–10) we review the previous study of fast MM with thefocus on fast feasible MM and its impacts and applications to realistic computations beyond MM,and for example we included our novel extension of an old MM technique to the computation of theinner product of 2 vectors and the summation of integers (see our Examples 8.1 and 8.2).In the second part of the paper (Section 11) we discuss in some detail symbolic implementationof fast MM directed to minimizing communication cost and improving parallel implementation. Thebasic algorithms for that study are mostly decades-old, and complementing them with some morerecent advanced algorithms for feasible MM is one of the most natural directions to further progressin the field.
We organize our paper as follows. In Sections 2 and 4 we recall the 2 first accelerations of MM, in1969 and 1978, respectively, and comment on their impacts beyond MM. In Sections 3, 5, and 7 wecover the fundamental classes of bilinear, trilinear and the so called APA algorithms, respectively,and discuss the associated fundamental techniques of the algorithm design, their impact on theacceleration of MM and their links to other areas of computing and algebra. In Section 8 we extend More precisely fast MM algorithms are slightly less stable numerically than the straightforward MM, but thisinstability is mild and rather little affects actual implementations of MM (see more details in [16], [73], [10], [50], and[8]). The paper [80] is not really about fast MM since it unrolls only one or two levels of recursion of Strassen’salgorithm and then recomputes several times the submatrix additions to avoid using temporary buffers. The book [121] and MM survey articles [119] and [120] pay high respect to Strassen’s fundamental contributionsto fast MM (and similarly did Volker Strassen to the contribution of [111] to algebraic computations in sections “Pan’smethod” of [143] and [145]), but we feel that calling all the advanced work on fast feasible MM Strassen-like, is notmore fair or informative than, say, labeling Democritus-like the Faraday’s constant, Mendeleev’s periodic table, andthe Heisenberg’s principle of uncertainty. exact fast MM. In Section 12 we comment on numericalimplementation of fast MM and perspectives of its study.In addition to the acronym “
M M ” for “matrix multiplication”, hereafter we write “
M I ” for“nonsingular matrix inversion”,
M M ( m, n, p ) for m × n by n × p MM,
M M ( n ) for M ( n, n, n ), and M I ( n ) for n × n M I . W = ( w i,j ) m,ni,j =1 denotes an m × n matrix with the entries w i,j , for i = 1 , . . . , m and j = 1 , . . . , n . × -based recursive processes We begin with recalling 3 old accelerations of the straightforward MM algorithm.
Example 2.1.
From faster inner product to faster MM. [See [152] and notice technical similarityto the algorithms for polynomial evaluation with preprocessing in [111] and [92].] Observe that u T v = n/ (cid:88) i =1 ( u i − + v i )( v i − + u i ) − n/ (cid:88) i =1 u i − u i − n/ (cid:88) i =1 v i − v i , (2.1) for any even n . Apply this identity to all n inner products defining n × n MM and compute it byusing . n + n scalar multiplications and . n + 2 n − n additions and subtractions. Example 2.2.
Winograd’s 2 × (cf. [68], [24, pages 45–46], [1, Exercise 6.5], or [52]).Compute the product X = U V of a pair of × matrices, U = (cid:18) u u u u (cid:19) , V = (cid:18) v v v v (cid:19) , X = U V = (cid:18) x x x x (cid:19) by using the following expressions, s = u + u , s = s − u , s = u − u , s = u − s , s = v − v , s = v − s , s = v − v , s = s − v , p = s s , p = u v , p = u v , p = s s , p = s s , p = s v , p = u s , t = p + p , t = t + p , t = t + p , x = p + p , x = t + p , x = t − p , x = t + p . This algorithm performs 2 × h , then recursively h − × u ij , v jk , and c ik , for all subscripts i, j, k ∈ { , } , and reuse the above algorithm for every 2 × h scalar multiplications and 7 h − + 15(4 h ) additions and subtractions, which is readily extendedto performing n × n MM for any n by using c n log (7) scalar arithmetic operations overall wherelog (7) ≈ . c < .
92 [68].
Example 2.3.
Strassen’s 2 × [142]. p = ( u + u )( v + v ) , p = ( u + u ) v , p = u ( v − v ) , p = ( u − u )( v + v ) , p = ( u + u ) v , p = u ( v − v ) , p = ( u − u )( v + v ) , x = p + p + p − p , x = p + p , x = p + p , x = p + p + p − p . This algorithm performs 2 × n × n MM for any n by using c (cid:48) · n log (7) scalar arithmetic operations for c (cid:48) < . Strassen’s , Strassen-based , or
Strassen-like (cf. [50], [10]). We can see nothing in Example2.2 borrowed from Example 2.3, but the cited impact of [50] seems to be so deeply rooted in theCommunity of Computer Algebra, that even the authors of the advanced works [22] and [36] didnot dare to call Example 2.2 “Winograd’s algorithm”, although their own innovative algorithmsextended precisely this example and not Example 2.3.The first line of Table 2.1 displays the estimated arithmetic cost of the recursive bilinear algo-rithms for
M M ( n ), n = 2 k , that begin with 2 × k × k MM performed with straightforward algorithm, and then one ofthe 3 above algorithms is applied recursively. The improvement from [36] is technically interesting,but its arithmetic cost bounds are still significantly inferior to those of the algorithms of [118] aswell as ones of [94], whose implementation in [88] is numerically stable and is highly efficient in usingmemory space.Table 2.1: Arithmetic Complexity of Some 2 × n . − n n . − n n . + 0 . n . + 2 n . − . n . n . − n . n . − n . n . + 0 . n . + 1 . n . − . n The algorithms of Examples 2.2 and 2.3 belong to the important class of noncommutative bilinearalgorithms, to which we refer just as bilinear . Such an algorithm for
M M ( m, n, p ) first computessome linear forms l q ( U ) and l (cid:48) q ( V ) in the entries of the input matrices U = ( u ij ) m,ni,j =1 and V =( v jk ) n,pj,k =1 and then the entries x ik = (cid:80) j u ij v jk of the product X = U V as the mp bilinear forms, l q ( U ) = m,n (cid:88) i,j =1 α ( q ) ij u ij , l (cid:48) q ( V ) = n,p (cid:88) j,k =1 β ( q ) jk v jk , q = 1 , . . . , r,x ik = r (cid:88) q =1 γ ( q ) ik l q ( U ) l (cid:48) q ( V ) , i = 1 , . . . , m ; k = 1 , . . . , p. Here r is said to be the rank of the algorithm , u ij , v jk , and x ik are variables or block matrices,and α ( q ) ij , β ( q ) jk , and γ ( q ) ik are constant coefficients for all i, j, k , and q . They define coefficient tensors( α ( q ) ij ) i,j,q , ( β ( q ) jk ) j,k,q , and ( γ ( q ) ik ) i,k,q , which can be also viewed as coefficient matrices A = ( α ( q ) ij ) ( i,j ) ,q , B = ( β ( q ) jk ) ( j,k ) ,q , and C = ( γ ( q ) ik ) ( i,k ) ,q . (3.1) M M exponents
Suppose m = n = p , assume that u ij , v jk , and x ik are block matrices, and then again recursivelyapply the same bilinear algorithm of rank r to block matrices. The resulting bilinear algorithms haveranks r h = c (cid:48) n hω for M M ( n h ), h = 2 , , . . . , ω = ω n,r = log n ( r ) and a constant c (cid:48) is independent of n and h . One can readily extend these observations to the following result.5 heorem 3.1. Given a bilinear algorithm of rank r for M M ( n ) for a pair of positive integers n and r , one can perform M M ( K ) by using cK ω arithmetic operations for any K , ω = ω n,r = log n ( r ) ,and a constant c independent of K . Now we define(i) the exponent of MM(n) , ω n = min r log n ( r ) (3.2)where an integer n > r of all bilinear algorithms for M M ( n ) and(ii) the exponent of MM , ω = min n ω n (3.3)where the minimum is over all integers n > (iii) The latter concept is mathematically attractive, but the highly rewarded work for its upperestimates has diverted public attention from feasible to infeasible MM. To straighten the unbalancedstudy of this subject, one should instead estimate the exponents of feasible MM , ω M M ( m, n, p ) is bilinear of rank mnp . The bilinear algorithms ofExamples 2.2 and 2.3 for M M (2) have rank 7, which turned out to be optimal (see [76], [77], [31]).[112, Theorem 3] as well as [122, Theorem 0.3] and [47]) provide an explicit expression for all bilinearalgorithms of rank 7 for M M (2), including the algorithms of Examples 2.2 and 2.3 as special cases.Among them 15 scalar additions and subtractions of Example 2.2 are optimal [127], [33].One can define bilinear algorithms for any bilinear problem , that is, for the computation of anyset of bilinear forms, e.g., the product of two complex numbers ( u + i u )( v + i v ) = ( u v − u v ) + i ( u v + u v ), i = √− 1. The straightforward bilinear algorithm has rank 4, and here is a rank-3bilinear algorithm, l l (cid:48) = u v , l l (cid:48) = u v , l l (cid:48) = ( u + u )( v + v ), u v − u v = l l (cid:48) − l l (cid:48) , u v + u v = l l (cid:48) − l l (cid:48) − l l (cid:48) . See [153], [65], [66], [112], [29], [78], [144], [128], [30], [31], onthe early study of bilinear algorithms and see a concise exposition in [24]. The book [155] coversvarious efficient bilinear algorithms for multiplication of pairs of integers and polynomials (the latteroperation is also called the convolution of the coefficient vectors), with further applications to thedesign of FIR-filters.The minimal rank of all bilinear algorithms for a fixed bilinear problem such as M M ( m, n, p ) iscalled the rank of the problem . It can be bounded in terms of the minimal arithmetic cost of thesolution of the problem and vice versa [112, Theorem 1], [24].The algorithm of Example 2.1 of rank r = r ( n ) = 0 . n + n for M M ( n ), however, is not(noncommutative) bilinear; such algorithms are called quadratic or commutative bilinear. (See [154]on some lower bounds on the rank of such algorithms for MM.) We cannot extend to them recursiveprocesses that bound the MM exponents by log n ( r ), because MM is not commutative, e.g., theequation u i − u i = u i u i − is invalid for matrices u i − and u i .The algorithm, however, was of great importance in the history of fast MM: it was the firstacceleration of the straightforward MM (it saves about 50% multiplications), but most important,it motivated the effort for the design of bilinear (rather than quadratic) algorithms of rank less than n for M M ( n ). It “remained” to devise such an algorithm at least for n = 2, and Strassen receivedample recognition for his brilliant design that accomplished exactly this. Here we consider MM over the fields of real and complex numbers. The exponent ω (although not the overheadconstant) stays invariant over the fields having the same characteristic [133, Theorem 2.8]. Some of our results canbe used in important applications of MM over semi-rings, but generally in that case distinct techniques are used (cf.[3], [54], [156], [96]). Hereafter we refer to decreasing upper bounds on the exponent of MM as decreasing the exponent , for short. In 1969 the exponents below Strassen’s 2.8074 became the target of literally all the leading researchersin the field, worldwide, but remained a dream for almost a decade. This dream would have cometrue based on bilinear algorithms of rank 6 for M M (2) or rank 21 for M M (3), but it was provedthat the rank of M M (2) exceeds 6, and it is still unknown whether M M (3) > M M ( n )for all n , to the papers [76], [77], [112, Theorem 1], and [29], [31], [32], [20], [19], [129], [137], and[95] for some earlier work in this direction, and to the papers [53] and [136] for various lower andupper bounds on the arithmetic complexity and the ranks of rectangular MM of smaller sizes.The exponent was decreased only in 1978, after almost a decade of stalemate, when the paper[113] presented a bilinear algorithm of rank 143,640 for M M (70). This breakthrough implied theexponent ω = log (143 , < . M M ( n ), M I ( n ), Boolean M M ( n ), and a variety of otherwell-known computational problems. The algorithm of [113] has extended an algorithm of the paper[112] of 1972, published in Russian and translated into English only in 2014 in [122].The progress was due to the novel combination of two techniques: trilinear interpretation ofbilinear algorithms and the aggregation method. By following [113] we call this combination trilinearaggregation . By refining this combination of 2 techniques the paper [118] accelerated MM of moderatesizes and yielded the exponent 2.7734. As we already mentioned, various algorithms combiningtrilinear aggregation with other advanced techniques decreased the exponent below this level (andin 1986 even below 2.38), but only when they were applied to MM of immense sizes because of thecurse of recursion.The technique of trilinear aggregation has been recognized for its impact on the decreases ofthe MM exponent, but the paper [112] was also a historical landmark in the study of multilinearand tensor decompositions. Such decompositions introduced by Hitchcock in 1927 received littleattention except for a minor response in 1963–70 with half of a dozen papers in the psychometricsliterature. The paper [112] of 1972 provided the earliest known application of nontrivial multilinearand tensor decompositions to fundamental matrix computations, now a popular flourishing area inlinear and multilinear algebra with a wide range of important applications to modern computing(see [148], [93], [108], [71], and the bibliography therein). Nevertheless the paper [112] has rarelybeen cited at all and has never been cited in the papers on multilinear and tensor decompositions. Next we define trilinear representation of MM, first proposed and used in the paper [112]. We arealso going to link it to some important computations beyond MM.Let U = ( u ij ) i,j and V = ( v jk ) j,k be a pair of m × n and n × p matrices, respectively, and letthe equations (cid:80) j u ij v jk = (cid:80) rs =1 w ( s ) ik l s ( U ) l (cid:48) s ( B ) for all i, j represent a bilinear algorithm of rank r for the matrix product X = U V .Define a trilinear decomposition of rank r for trace ( U V W ) = (cid:80) i,j,k u i,j v jk w ki by multiplyingthese equations by variables d ki and summing the products in i and k . Here W = ( w ki ) n,mk,i isan auxiliary n × m matrix and trace( M ) denotes the trace of a matrix M . (Equivalently we candecompose the tensor of the trilinear form trace ( U V W ) into the sum of r tensors of rank 1.) Example 5.1. A trilinear decomposition of rank 7 for M M (2). (cid:80) i,j,h =1 u ij v jh w hi = (cid:80) s =1 l s l (cid:48) s l (cid:48)(cid:48) s , l l (cid:48) l (cid:48)(cid:48) = ( u + u )( v + v )( w + w ) , l l (cid:48) l (cid:48)(cid:48) = ( u + u ) v ( w − w ) , l l (cid:48) l (cid:48)(cid:48) = u ( v − v )( w + w ) , l l (cid:48) l (cid:48)(cid:48) = ( u − u )( v + v ) w , l l (cid:48) l (cid:48)(cid:48) = ( u + u ) v ( w − w ) , l l (cid:48) l (cid:48)(cid:48) = u ( v − v )( w + w ) , l l (cid:48) l (cid:48)(cid:48) = ( u − u )( v + v ) w . Conversely, we can come back to the original bilinear algorithm for the matrix product X = U V if we interpret both sides of any decomposition of the trilinear form trace ( U V W ) as linear forms inthe variables w ih and equate the coefficients of these variables on both sides of the decomposition. Until 1976 the second author lived in the Soviet Union. From 1964 to 1976 he has been working in Economics inorder to make his living and has written the papers [111] and [112] in his spare time. r for M M ( m, n, p ) to a trilinear decomposition of rank r for the associated trilinear form and its tensor.Instead of equating the variables w ij on both sides of the trilinear decomposition, we can equatethe coefficients of all variables u ij or all variables v jh and then arrive at 2 other dual bilinearalgorithms of the same rank for the problems M ( n, p, m ) and M ( p, m, n ).By interchanging the subscripts of the variables, we arrive at the dual bilinear algorithms of thesame rank for the problems M M ( m, p, n ), M M ( n, m, p ), and M M ( p, n, m ) as well (cf. [112, part 5of Theorem 1], [29], [78], [128]). The latter extension from triples to 6-tuples is pertinent to MM,because it uses the double subscripts for the variables, but the triples of bilinear algorithms can begenerated from their common trilinear representation for any bilinear computational problem, e.g.,for multiplication of 2 complex numbers in the following example. Example 5.2. A trilinear decomposition of rank 3 for multiplication of 2 complex numbers. u v w − u v w + u v w + u v w = u v ( w − w ) − u v ( w + w ) + ( u + u )( v + v ) w . For a sample application of the duality technique, one can readily deduce the following result(part 1 of [112, Theorem 1]). Theorem 5.1. Given a bilinear or trilinear algorithm of rank r for M M ( m, n, p ) and any 4-tupleof integers r , m , n , and p such that mnp > , one can perform M M ( K ) by using cK ω arithmeticoperations for any K , ω = ω m,n,p,r = 3 log mkn ( r ) , and a constant c independent of K . For further applications of the duality technique, see efficient bilinear algorithms for FIR-filtersand multiplication of complex numbers and polynomials in [155]. Aggregation technique is well-known in business, economics, computer science, telecommunication,natural sciences, medicine, and statistics. The idea is to mass together or cluster independentbut similar units into much fewer aggregates. Their study is simpler, but its results are supposedto characterize all these units either directly or by means of special disaggregation techniques.Such aggregation/disaggregation processes proposed in [101] became a basis for creating the field of Algebraic Multigrid , now quite popular.Aggregation/disaggregation techniques are behind the acceleration of MM in Example 2.1, whichwas preceded by similar application of this technique to polynomial evaluation with preprocessing ofcoefficients [111], [92]. In that example one first rewrite u i v i = v i u i by using commutativity ofmultiplication (which does not hold if u i and v i are matrices), then aggregates the terms u i − v i − and v i u i into the single term ( u i − + v i )( v i − + u i ), thus saving 50% of scalar multiplications,that is, 0 . n for n × n MM. For disaggregation one subtracts the correction terms u i − u i and v i v i − . n × n MM involves 0 . n pairs of such products, and so correction involves just n scalarmultiplications overall, which is a small sacrifice compared to saving 0 . n scalar multiplications.The papers [112] and [113] strengthen aggregation based on restating MM as the problem of thedecomposition of the trilinear form trace ( U V W ), so that some additional links among the subscriptsof the input and output entries enable stronger aggregation and faster MM.Various implementations of this technique appeared in [113], [114], [115], and a number of sub-sequent papers. For demonstration we apply it to Disjoint MM where we compute two indepen-dent matrix products AB and U V by decomposing the trilinear form trace ( XY Z + U V W ) = (cid:80) m,n,pi,j,k =1 ( x ij y jk z ki + u jk v ki w ij ) into the sum of rank-1 tensors.Let T = (cid:80) m,n,pi,j,k =1 ( x ij + u jk )( y jk + v ki )( z ki + w ij ) denote a trilinear aggregate made up of the 2monomials x ij y jk z ki and u jk v ki w ij and let T = (cid:80) m,ni,j =1 x ij s ij w ij , T = (cid:80) n,pj,k =1 u jk y jk r jk , and T = (cid:80) p,mk,i =1 q ik v ki z ki denote 3 groups of correction terms, where q ik = (cid:80) kj =1 ( u ij + u jk ), s ij = (cid:80) nk =1 ( y jk + v ki ), and r jk = (cid:80) mi =1 ( z ki + w ij ). Then the equation trace ( XY Z + U V W ) = T − T − T − T definesa trilinear decomposition of rank mnp + mn + np + pm (rather than the straightforward 2 mnp ).8able 6.1: Aggregation/disaggregation of a pair of terms. x ij y jk z ki u jk v ki w ij Table 6.1 displays this aggregation/disaggregation technique.The product of the 3 sums of pairs on input entries in each of the 3 columns of the table is anaggregate. The 2 products of triples of entries of each of the 2 rows are the output terms x ij y jk z ki and u jk v ki w ij . The cross-products of other triples of the table define 6 correction terms. Their sumover all n triples of indices i, j and k has rank 2( mn + np + pm ). By subtracting this sum fromthe sum of all mnp aggregates, we decompose 2 mnp terms of trace ( XY Z + U V W ) into the sum of mnp + 2( mn + np + pm ) terms. For m = n = p = 34 this implies a decomposition of rank n + 6 n for a pair of disjoint M M ( n ).Demonstration of the power of trilinear aggregation can be made most transparent for DisjointMM, whose natural link to trilinear aggregation has been shown in [119], [120, Section 5], [121,Section 12], and [94]. Such constructions for Disjoint MM, however, can frequently be extended to M M ( n ). In particular, by playing with odd and even subscripts of the matrix entries, the paper[112] obtained a trilinear decomposition of rank 0 . n + 3 n for M M ( n ) and any even n by meansof extending the above decomposition of trace ( XY Z + U V W ). This implied the M M exponentlog n (0 . n + 3 n ), which is less than 2.85 for n = 34.The paper [113] defined a trilinear decomposition and bilinear algorithms of rank ( n − n ) / n for M M ( n ), n = 2 s , and any positive integer s . Substitute n = 70 and obtain the MMexponent 2.7962. Then again it is convenient to demonstrate this design for Disjoint MM associatedwith a decomposition of the trilinear form trace ( XY Z + U V W + ABC ). The basic step is theaggregation/disaggregation defined by Table 6.2.Table 6.2: Aggregation/disaggregation of a triple of terms. x ij y jk z ki u jk v ki w ij a ki b ij c jk Sum the mkn aggregates ( x ij + u jk + a ki )( y jk + v ki + b ij )( z ki + w ij + c jk ), subtract order of n correction terms, and obtain a decomposition of rank n + O ( n ) for trace ( XY Z + U V W + ABC ),versus the straightforward 3 n . The trace represents 3 disjoint problems of M M ( n ), that is, thecomputation of the 3 independent matrix products XY , U V , and AB of size n × n (and can bereadily extended to 3 MM products of sizes m × n by n × p , n × p by p × m , and p × m by m × n ),and we obtain a bilinear algorithm of rank n + O ( n ) for this bilinear task.With a little more work one obtains a similar trilinear decomposition of rank ( n − n ) / n for M M ( n ), n = 2 s , and any positive integer s (see [113]). For n = 70 we arrive at an upper bound2.7962 on the MM exponent. By refining this construction the algorithm of [118] decreased theupper bound below 2.7734. The technique of Any Precision Approximation (hereafter we use the acronym APA ) was anotherbasic ingredient of the algorithms supporting the decrease of the exponent of MM. The paper [15]achieved the first APA acceleration of MM, by yielding the exponent 2.7799. According to [130],9his came from computer search for partial M M (2) where the goal was the computation of only 3entries of 2 × x ij y jk λ z ki λu jk λv ki w ij It defines the aggregate ( x ij + λu jk )( y jk + λv ki )( λ z ki + w ij ) and 3 correction terms, similarlyto Table 6.1, but with a decisive difference – the term λ ( x ij + u jk ) v ki z ki has a smaller order ofmagnitude as λ → 0. Therefore we arrive at trilinear decomposition trace ( XY Z + U V W ) = T − T − T + O ( λ )where T = λ − m,k,n (cid:88) i,j,k =1 ( x ij + λu jk )( y jk + λv ki )( λ z ki + w ij ) , T = m,k (cid:88) i,j =1 x ij s ij w ij , T = k,n (cid:88) j,k =1 u jk y jk r jk ,s ij = (cid:80) nk =1 ( y jk + λv ki ), and r jk = (cid:80) mi =1 ( λ z ki + w ij ).Drop the terms of order λ , and obtain a decomposition for Disjoint M M ( m, n, p ) having border rank mnp + mn + np . For m = p = 7, n = 1 this implies an APA exponent of MM ω = 3 log . < . 66. (Here we use Sch¨onhage’s result of [133] that deduces the MM exponent from Disjoint MM.)The above APA decomposition using mnp + mn + np terms is numerically unstable. Indeedwe would corrupt the output if we drop the summands λu jk and λv ki in the sums x ij + λu jk and y jk + λv ki in the aggregate ( x ij + λu jk )( y jk + λv ki )( λ z ki + w ij ), but keeping these summands doublesthe precision required for the representation of these sums. Similarly all other known bilinear APAalgorithms are prone to numerical stability problems if their rank exceeds their border rank.In [14], however, Bini proved that, for the purpose of decreasing the exponent of MM, thisdeficiency is immaterial if we allow unrestricted recursive processes, that is, if we ignore the curseof recursion. Namely he proved that Theorem 5.1 holds even if border rank replaces rank in itsstatement. Bini proved this result for MM, but Sch¨onhage in [133] extended it to Disjoint MM. Bothproofs yield acceleration of straightforward MM only where its input size becomes huge because ofthe curse of recursion. Theorem 7.1. We can perform M M ( K ) by using ¯ cK ω arithmetic operations for any K , where ¯ c isa constant independent of K and ω = ω m,n,p,r = 3 log mkn ( r ) , provided that we are given a bilinearor trilinear APA algorithm having a border rank r for M M ( m, n, p ) and a 4-tuple of integers r , m , n , and p such that mnp > .Proof. First observe that by means of interpolation we can extend any APA algorithm of borderrank r using a polynomial q ( λ ) in λ , of a degree d to a λ -free bilinear algorithm for M M ( m, n, p ) ofrank (2 d + 1) r .Now apply such an APA algorithm recursively. Notice that every recursive step squares theoriginal problem size mnp but only doubles the degree of λ . After h steps, we arrive at an APAalgorithm of degree 2 h d for the MM problem of size ( mnp ) h , and then (cf. Theorem 5.1) theinterpolation factor 2(2 h d + 1) only implies an increase of the MM exponent ω by a tiny factor,which converges to 1 as the number of recursive steps grows to the infinity. Next we cover an application of APA techniques beyond MM.10ecall the APA algorithm of Section 7, let the entries x ij , y jk , u jk , and v ki be integers in therange [0 , d ), and choose λ = 2 d . Notice that the product ( x ij + λy jk )( u jk + λv ki ) fits the length L of the computer word if L ≥ d . Moreover if the ratio L/d is large enough, we can perform theAPA computations of Section 7 within the precision L . [121, Section 40] exploits such observationsfurther and devise efficient algorithms for multiplication of vectors and matrices filled with boundedintegers. Next we recall that technique and in Example 8.2 show its surprising extension.Suppose that the coefficient vector of a polynomial v ( λ ) = (cid:80) n − i =0 v i λ i is filled with integers fromthe semi-open segment [0 , d ) of the real axis for a positive integer d . Represent this vector by the2 d -ary integer v (2 d ) = (cid:80) n − i =0 v i di . Generally the interpolation to a polynomial of degree n − n knots, but in the above special case we only need the evaluation at thesingle knot 2 d . Now suppose that all coefficients v i are integers from the semi-open segment [ q, r )for any pair of integers q and r , q < r . Then we can apply the above recipe to compute the shiftedvector u = ( u i ) n − i =0 = ( v i − q ) n − i =0 , having all its components in the semi-open segment [0 , s ) for s = r − q . We can finally recover the vector v from u . By following [116] and [17], we call thistechnique binary segmentation . Its history can be traced back to [67], and one can even view it asan application of the Kronecker map, although having specific computational flavor.Next we follow [121, Example 40.3, pages 196–197] to compute the inner product of two integervectors, then extend the algorithm to summation, and finally list various other applications of binarysegmentation. Example 8.1. ( The inner product of two integer vectors , cf. [121, Example 40.3].) Assume twononnegative integers g and h and two vectors u = ( u i ) n − i =0 and v = ( v i ) n − i =0 with nonnegative in-teger coordinates in two semi-open segments, namely, [0 , g ) for the coordinates of u and [0 , h ) for the coordinates of v . The straightforward algorithm for the inner product u T v = (cid:80) n − i =0 u i v i first computes the n products u i v i for i = 0 , , . . . , n − and then sums them. This involves n multiplications and n − additions. Instead, however, we can just multiply a pair of bounded non-negative integers, apply binary segmentation to the product, and output the desired inner product.Namely, introduce the two polynomials u ( x ) = (cid:80) n − i =0 u i x i and v ( x ) = (cid:80) n − i =0 v i x n − − i . Their productis the polynomial q ( x ) = u ( x ) v ( x ) = (cid:80) n − i =0 q i x i with integer coefficients in the segment [0 , k ) for k = g + h + (cid:100) log n (cid:101) . The coefficient q n − = (cid:80) n − i =0 u i v i is precisely the inner product u T v . Representthe polynomials u ( x ) and v ( x ) by their integer values u (2 k ) and v (2 k ) at the point k . Clearly, theylie in the semi-open segments r u = [0 , nk + g ) and r v = [0 , nk + h ) , respectively. Now compute the in-teger q (2 k ) = u (2 k ) v (2 k ) , lying in the segment [0 , nk + g + h ) , and recover the coefficient q n − = u T v by applying binary segmentation. Remark 8.1. We only seek the coefficient q n − of the median term q n − x n − of the polynomial u ( x ) v ( x ) . This term lies in the segment [2 ( n − k , ( n − k + g + h ) , and the next challenge is to optimizeits computation. Is such a more narrow task substantially simpler than the multiplication of twointegers lying in the segments r u and r v ? Example 8.2. ( Summation of bounded integers. ) For u = (1) n − i =0 , g = 0 , and k = h + (cid:100) log n (cid:101) or v = (1) n − i =0 , h = 0 , and k = g + (cid:100) log n (cid:101) , the algorithm of Example 8.1 outputs the sum of n integers. Remark 8.2. In the same way as for polynomial interpolation in the beginning of this section, wecan relax the assumption of Examples 8.1 and 8.2 that the input integers are nonnegative. Moreover,the summation of integers can be extended to the fundamental problem of the summation of binarynumbers truncated to a fixed precision. In Examples 8.1 and 8.2, multiplication of two long integers followed by binary segmentationreplaces either 2 n − n arithmetic operations, respectively. This increases the Boolean (bit-wise operation) cost by a factor depending on the Boolean cost of computing the product of 2integers or, in view of Remark 8.1, of computing the median segment in the binary representationof the product. The increase is minor if we multiply integers in nearly linear Boolean time (seethe supporting algorithms for such multiplication in [135], [1, Section 7.5], [69]), but grows if wemultiply integers by applying the straightforward algorithm, which uses quadratic Boolean time.11onetheless, in both cases one could still benefit from using Example 8.2 if the necessary bits ofthe output integer fit the computer word (i.e. the bits of the middle coefficient are not part ofthe overflow of the product), as long as the representation of the vector as an integer requires noadditional cost. If the output integer does not fit the word length, we can apply the same algorithmsto the subproblems of smaller sizes, e.g., we can apply the algorithms of Examples 8.1 and 8.2 tocompute the inner products of some subvectors or partial sums of integers, respectively.Other applications of binary segmentation include polynomial multiplication (that is, the com-putation of the convolution of vectors) [67], [134], some basic linear algebra computations [121,Examples 40.1–40.3], polynomial division [17], [134], computing polynomial GCD [37], and discreteFourier transform [134]. Binary segmentation can be potentially efficient in computations withBoolean vectors and matrices. E.g., recall that Boolean MM is reduced to MM whose input andoutput entries are some bounded nonnegative integers (see [1, Proof of Theorem 6.9]). Quantizedtensor decompositions is another promising application area (cf. [148], [106], [107], [91], [109], [72]). In 1979–81 and then again in 1986 the exponent of infeasible MM was significantly decreased basedon combination of trilinear aggregation, Disjoint MM, and APA techniques with unrestricted use ofrecursion.All supporting algorithms have been built on the top of the techniques of the preceding papers.More and more lenient basic bilinear/trilinear decompositions of small rank were chosen for Dis-joint MM of small sizes and since 1986 for small bilinear/trilinear problems similar to Disjoint MM.Transition back to MM relied on nested recursion, consistently intensified; consequently accelerationof the straightforward algorithm began only with MM of astronomical sizes.By 1987 the power of these techniques seems to be exhausted, and then the progress has stoppeduntil 2010. Since then it is moving from the bound 2.376 of [44] towards 2.37 with the snail’s speed.That direction was prompted by the cited results of the seminal papers [14] by Dario Bini and[133] by Arnold Sch¨onhage. Sch¨onhage, however, has concluded the introduction of [133] withpointing out that all new exponents of MM were just ”of theoretical interest” because they werevalid only for the inputs ”beyond any practical size” and that ”Pan’s estimates of 1978 for moderate”input sizes were ”still unbeaten”. Actually, as we can see in Figure 1, the exponent 2.7962 of 1978for M M ( n ) restricted to n ≤ , , 000 has been successively decreased in [114], [115], [117], and[118] (cf. also [119]), although by small margins. As of December 2016, the exponent 2.7734 of [118]is still record low for M M ( n ) with n ≤ , , M M ( n ) for n ≤ , , n , respectively. The supporting algorithms of Figure 1 rely solely on trilinearaggregation, and the associated overhead constants are small. Some of these algorithms have beenrefined in [119], [94] and implemented in [87] and [88].All other algorithms of Figure 2 (not supporting Figure 1) employ trilinear aggregation as well(cf. [44, page 255]), but also employ other techniques and suffer from the curse of recursion.The figures link each exponent to its recorded publication in a journal, a conference proceedings,or as a research report. As we already mentioned, the MM exponent of [113] significantly decreasedin 1979–1981 and 1986. It has been updated at least 4 times during the single year of 1979: reachingbelow the values 2.801 in February in Research Report [115]; 2.7799 (as an APA MM exponent) in[15] in June and (as an MM exponent) in [14]; 2.548 in [133], and 2.522 in [117]. Both of the lattertwo exponents appeared in the book of abstracts of the conference on the Computational Complexityin Oberwolfach, West Germany, organized by Schnorr, Sch¨onhage and Strassen in October (cf. [119,page 199] and [133]). The exponent 2.496 of [43] was reported in October 1981 at the IEEE FOCS’81,and in Figure 1 we place it after the exponent 2.517 of the paper [131] of 1982, which was submittedin March 1980. The Research Report version of the paper [44] appeared in August of 1986, but inFigure 2 we place [44] after the paper [146], published in October of 1986 in the Proceedings of theIEEE FOCS, because the paper [146] has been submitted to FOCS’86 in the Spring of 1986 andhas been widely circulated afterwards. One could complete the historical account of Figure 2 by12ncluding the exponents 2.7804 (announced in the Fall of 1978 in [113] and superseded in February1979 when the paper was submitted [115]) and 2.5218007, which decreased the exponent 2.5218128of [114] and appeared at the end of the final version of [133] in 1981, that is, before the publication,but after the submission of the exponent 2.517 of [131].We refer the reader to [41], [99], [42], [79], [89], [96], and the references therein for similar progressin asymptotic acceleration of rectangular MM.Figure 1: M M ( n ) exponents for n ≤ , , 10 Applications of Fast MM, Its Links to Other Subject Ar-eas, Impacts of Its Study, and Implementation (Briefly) The decrease of the exponent of MM implies theoretical acceleration of the solution of a numberof important problems in various areas of computations in Algebra and Computer Science, suchas Boolean MM, computation of paths and distances in graphs, parsing context-free grammars,the solution of a nonsingular linear system of equations, computations of the inverse, determinant,characteristic and minimal polynomials, and various factorizations of a matrix.See [142], [34], [1, Sections 6.3–6.6], [24, pages 49–51], [18, Chapter 2], [3], [79], [48], [98], [160],[157], [158], [159], [86], [25], [89], [6], [54], [156], [132], [96], [4], [138], [140], [103], [104], [105], [125],and the bibliography therein and notice that some new important applications have been found veryrecently, e.g., in 4 papers at ISSAC 2016.The reduction of other computational problems to MM increases the overhead, which is alreadyimmense for the algorithms supporting the record exponent of MM. Such a reduction, however, canstill be valuable because it reveals some links of independent interest and because by applying fastalgorithms or even the straightforward algorithm for feasible MM one can take advantage of usingblock matrix software and parallel acceleration.Research on fast feasible MM had a variety of feasible links to other subject areas. The workon bilinear and trilinear decompositions for fast MM was an important part of the study of suchdecompositions in the general area of Algebraic Computations and has led to new insights andtechniques. • We have already cited historical importance of the demonstration in 1972 in [112] of the powerof tensor decompositions and valuable applications of duality to the design of efficient bilinearalgorithms. 13igure 2: M M ( n ) exponents for unrestricted n . • Trilinear aggregation was also a surprising demonstration of the power of aggregation/disag-gregation methods. • More applications of this kind can follow in the future, such as a quite unexpected applicationof APA techniques to the computation of inner products and summation presented in Section 8. • It may be surprising, but apart from trilinear aggregation and the APA method, the advancedand amazing techniques developed for decreasing the exponent of infeasible MM had no theo-retical (as well as practical) applications to other areas of computations or algebra (and havemade no impacts on actual work on MM as well).Of course, such impacts on the practice of performing this fundamental operation of moderncomputations are the main motivation and goal of the study of fast MM, and indeed recursive bilinearalgorithms based on 2 × 11 Fast methods in exact computational linear algebra Next we discuss applications and implementations of exact fast matrix multiplication (MM). As wementioned in Section 1.1, we first review how most of the exact linear algebra can be reduced to MMover small finite fields. Then we highlight the differences in the design of approximate and exactimplementations of MM taking into account nowadays processor and memory hierarchies.14 The design of matrix multiplication routines over a word size finite field is the main building block forcomputations in exact dense linear algebra, represented with coefficient domains of two kinds: word-size discrete entries (mainly in finite fields of small cardinality) and variable size entries (larger finitefields, integers, or polynomials). Indeed, efficient methods for the latter task usually reduce it to theformer one: reductions are made by means of evaluation/interpolation (Chinese remaindering overthe integers) or by means of lifting small size solutions (via Hensel-like or high-order lifting [139]),see, e.g., [85, 63, 124] for recent surveys.Over word-size finite fields, efficient implementations of MM have been obtained by means ofeffective algorithmic reductions originated in the complexity analysis. In this case reductions havebeen obtained directly to MM. Already Strassen in 1969 extended MM to matrix inversion (we sketchthis in Examples 11.1 and 11.2 below), then Bunch and Hopcroft [34] extended MM to invertible LUPfactorization, and further extensions followed in the eighties, for instance, in [82] to the computationof the matrix rank and of Gaussian elimination. Efficient algorithms whose complexity is sensitiveto the rank or to the rank profile have only very recently been discovered [83, 64, 140]. One of thelatest reductions to MM was that of the characteristic polynomial, which has extended the seminalwork of [90] more than thirty years afterwards [126]. Example 11.1. Triangular system solving by means of reduction to matrix multiplication. Denote by X = T RSM ( U, B ) the solution to the linear matrix equation U X = B with a matrix B on the right-hand side and an upper triangular invertible matrix U . Recursively cut the matrices U , B and X in halves as follows: U = (cid:20) U VU (cid:21) , X = (cid:20) X X (cid:21) , and B = (cid:20) B B (cid:21) ; then obtain an efficientreduction of the solution to MM by means of the following algorithm:1. Recursively compute X = T RSM ( U , B ) ;2. Compute B (cid:48) = B − V X ; // via fast MM3. Recursively compute X = T RSM ( U , B (cid:48) ) .The only operations performed are fast MMs. Asymptotically the low complexity is preserved, andin practice this reduction can be made very efficient, even in the case of exact computations, whereintermediate reductions might occur, as shown, e.g., in [62, § Example 11.2. Reduction of LU factorization to MM. For simplicity, consider an invertible matrix A having generic rank profile (that is, having all itsleading principal minors also invertible ). Recursively cut A into halves in both dimensions, that is,represent it as × block matrix, A = (cid:20) A A A A (cid:21) . Then an efficient triangularization A = LU canbe computed by means of the following algorithm:1. Recursively compute L U = A ; // Triangularization of the upper left block2. Compute G = T RSM ( U , A ) ; // G is such that GU = A 3. Compute H = T RSM ( L , A ) ; // H is such that L H = A 4. Compute Z = A − GH ; // via fast MM5. Recursively compute L , U = Z ;6. Return L = (cid:20) L G L (cid:21) and U = (cid:20) U HU (cid:21) .Once again, the only operations performed in this algorithm are MMs, making it efficient as long asMM is efficient. This reduction to MMs would remain efficient in the extension to the more generalcase where rank deficiencies are allowed and pivoting is applied [64]. xample 11.3. Sketch of linear system solving in arbitrary precision. Next we accelerate the solution of a linear system of equations with arbitrary precision by using thetwo previous reductions to exact MM. The idea is to solve the linear system modulo a small prime p ,by intensively using fast exact MM, and then to reuse this factorization in Hensel-like p -adic liftingproducing iterative refinement of the solution modulo p k . This leads us to the following algorithm:Successively compute1. L p , U p ≡ A mod p ; // Triangularization modulo a small prime2. x ≡ T RSM ( U p , T RSM ( L p , b ) mod p ) mod p ;3. b = b − Ax p ; // this computation is over Z x ≡ T RSM ( U p , T RSM ( L p , b ) mod p ) mod p ;5. b = b − Ax p ; // this computation is over Z x ≡ T RSM ( U p , T RSM ( L p , b ) mod p ) mod p ;7. . . . One can easily show that b ≡ A ( (cid:80) k − i =0 x i p i ) mod p k .Now recall that we can recover unique rational number r = ab from its p -adic series representationtruncated at p k as soon as k is such that p k is larger than ab and the denominator b is coprimeto p . Namely, by combining Cramer’s rule and Hadamard’s bound on determinants, we representsolution values x i as rational numbers with bounded denominators and then recover them from theirtruncated p -adic series representation by applying sufficiently many steps of iterative refinement.More details on actual values of p and k can be found in Dixon’s seminal paper [51]. We can prove(and this is important for practical application) that the overhead of this exact iterative refinementis quite a small constant. We show this in Figure 3, for which we used the LinBox exact linearalgebra library: despite quite large coefficient growths (for instance, up to forty thousand bits forthe solution to a × random integer matrix with bits entries), approximate and exactarbitrary precision times remain essentially proportional. We achieve good practical performance of computations in linear algebra by extensively applyingreductions to MM, which we mostly perform exactly over small finite fields.On the one hand, nowadays machine architecture, with their memory hierarchy, is also well-adapted to the highly homogeneous structure of the straightforward MM. This is true for numericalroutines, where the stability issues have been worked out for fast MM in [16, 73, 50, 8]. This is alsotrue for exact routines where, symmetrically, the costly reductions (modular or polynomial) have tobe delayed as much as possible.On the other hand, this straightforward algorithm remains faster in practice only for smallmatrices, but the larger memory capacity and the increase in the number of computing units makes itpossible to handle, on desktop computers, dense matrices of dimensions in several hundred thousands.As the practical threshold between straightforward and fast algorithms is often for matrices ofdimensions about 500, several fast routines can be very effectively combined, yielding the fastestimplementations to date, as explained in the next section. The design of matrix multiplication routines over a word size finite field is the main building blockfor computations in exact dense linear algebra. In practice, a turning point has been the introductionof the following principles in [58]: Hensel-like p -adic lifting is a major tool of computer algebra, which has striking similarity with the classicalalgorithm of iterative refinement in numerical linear algebra. The recovery of a rational r from its p -adic truncated series is called rational number reconstruction which is anaccelerated Euclidean algorithm, see, e.g., [70, § https://github.com/linbox-team/linbox t i m e ( s ) s i z e ( K B ) matrix dimensionSolving random 32-bits rational linear systemsExact arbitrary precision solve timingExact arbitrary precision solution sizeApproximate BLAS solve timing 1. finite field arithmetic is reduced to integer arithmetic with delayed or simultaneous modularreductions;2. integer arithmetic is performed by floating point units (in order to take advantage of SIMDinstructions and of numerical routines development – BLAS);3. computations are structured in blocks in order to optimize the use of the memory hierarchy ofcurrent architectures;4. asymptotically fast algorithms are used, mostly recursive bilinear algorithm for MM based onWinograd’s 2 × Definition 11.1. (i). Delayed reduction is the action of replacing a classical dot product with re-ductions, (cid:80) ni =0 Reduce ( a i b i ) , by an algorithm that reduces only from time to time, for instance,when the internal representation cannot hold the growth anymore: • Compute s = (cid:80) ki =0 a i b i ; Reduce ( s ) ; For j = 1 to n/k − compute s = s + (cid:80) ( j +1) ki = jk +1 a i b i ; Reduce ( s ) ; • done;(ii). Simultaneous modular reduction is the action of performing a single reduction on severalcoefficients at once: in other words, the idea is to replace independent reductions (for in-stance Reduce ( a ) ; Reduce ( b ) ; Reduce ( c ) ) by a single reduction on grouped elements (use t = Group ( a, b, c ) ; Reduce ( t ) ; a, b, c = U ngroup ( t ) instead), see, e.g., [56] for more details. Moreover, in practice, for exact computations, recursion is important when fast algorithms areused: while tiling allows a static placement of blocks adapted to the memory hierarchy, largerblocks allow faster acceleration but also more delays in the application of reductions (modularreduction, polynomial reduction, etc.) [60]. Further, the fast algorithms can be used for a few levelsof recursion and then switch back to the, statically placed, straightforward algorithm on smallermatrices. Eventually (nowadays, on a current personal computer, this can typically be betweenn=500 and n=1000), the exact algorithm sketched above is based on the numerical BLAS and thenapplied without any further recursion.There is actually a cascade of algorithms where, at each recursive cutting, a new choice of the bestsuited variant is performed before the final switch [68, 27]. This cascade is usually non-stationary,i.e., a different scheme, bilinear or not, can be chosen at every level of recurrence. For instance,a meta-algorithm starts by partitioning the matrices into four blocks followed by application ofWRB-MM. This meta-algorithm is called on each of the seven multiplications of the blocks; at therecursive call the meta-algorithm can decide whether to re-apply itself again to a 2 × p > { − p . . . p − } ,on a type with a mantissa of m bits, the condition is that the modular reduction in a scalar productof dimension k can be delayed to the end as long as k (cid:0) p − (cid:1) < m .When applying (cid:96) recursive levels of WRB-MM algorithm, it can be showed instead that someintermediate computations could grow above this bound [62], and the latter condition becomes9 (cid:96) (cid:98) k (cid:96) (cid:99) (cid:0) p − (cid:1) < m . This requires to perform by a factor of about (9 / (cid:96) more modular reductions.Example of sequential speed obtained by the fgemm routine algorithm of the LinBox library [57]is shown in Table 11.1.Table 11.1: Effective Gfops (2 n /time/ ) of matrix multiplications: LinBox fgemm vs OpenBLAS(s|d)gemm on one core of a Xeon E5-4620 @2.20GHz n OpenBLAS sgemm O (cid:0) n (cid:1) - fgemm Mod 37 21.90 24.93 26.93 28.10 28.62 O (cid:0) n . (cid:1) - fgemm Mod 37 22.32 27.40 OpenBLAS dgemm O (cid:0) n (cid:1) - fgemm Mod 131071 15.69 16.20 16.40 16.43 16.47 O (cid:0) n . (cid:1) - fgemm Mod 131071 The efficiency of the fgemm routine is largely due to the efficiency of the BLAS, but as the latterare not using fast algorithms, exact computations can be faster. Modulo 37 elements are stored over Within its FFLAS-FFpack module [62], https://github.com/linbox-team/fflas-ffpack sgemm subroutine can be used, whereas modulo 131071 elements arestored using double precision float, and the dgemm subroutine is used. Table 11.1 first shows thatthe overhead of performing the modular reductions in the O (cid:0) n (cid:1) implementations is very limited ifthe matrix is large enough. Then, when enabling WRB-MM O (cid:0) n . (cid:1) algorithm, a speed-up of upto 40% can be attained in both single and double precision arithmetic. More recently, it has beenshown also how algorithm [15] by Bini et al. could be efficiently put into practice [26], also offeringsome interesting speed-up for prime fields of size near 10 bits. WRB-MM algorithm requires external temporary memory allocations in order to store the inter-mediate linear combinations of blocks. With large matrices and current architectures, this can bepenalizing because the memory availability and access to it dominate the overall computational costs.It is therefore crucial to reduce as much as possible the extra memory requirements of fast methods.This can be done via a careful scheduling of the steps of WRB-MM algorithm: it is not mandatoryto perform 8 pre-additions, 7 multiplications and 7 post-additions in that order, with temporarymemory allocations for each of these 22 steps. Depending on the associated dependency graph,one can choose instead to reduce the number of allocation by following this graph and overwritingalready allocated memory when the associated variable is not used any more.For the product of n × n matrices, without accumulation, C ← A × B , [52] proposed a schedulerequiring, apart from C , two extra temporary blocks of size n × n at the first recursive levels, twoextra temporary blocks of size n × n for each of the seven recursive calls, etc. Overall, the neededextra memory is bounded by n . For the product with accumulation, C ← C + A × B , for more thanten years the record was three temporaries with an extra memory bounded by n [81], but this wasrecently improved to two temporaries in [28]. Notice that [28] proposed also some schemes requiringsmaller extra memory (that can actually be made quite close to zero), with the same asymptoticcomplexity as WRB-MM algorithm, although with a larger constant overhead factor in arithmeticoperations. Recently M. Bodrato proposed a variant of WRB-MM algorithm, which is symmetricand more suitable to squaring matrices, but which uses similar schedules, and therefore keeps theextra requirements [22]. This is not the case in [80] where no extra memory is required if only oneor two recursive levels of fast MM are used, but at the cost of recomputing many additions.Finally, in the case of the APA algorithm [15] by Bini et al., the requirements are also of twotemporaries in the product without accumulation [26]. From [81], new schedules have usually beendiscovered by hand, but with the help of a pebble game program, which discards rapidly the wrongschedules and verifies formally the correct ones. The practical efficiency of MM depends greatly on the representation of field elements. Thus wepresent three kinds of compact representations for the elements of a finite field with very small cardi-nality: bit-packing (for F ), bit-slicing (say, for F , F , F , F , or F ), and Kronecker substitution.These representations are designed to allow efficient linear algebra operations, including MM: Definition 11.2. Compact representations for small finite fields.(i). Over F , the method of the four Russians [7], also called Greasing , can be used as follows [2]: • A 64-bit machine word can be used in order to represent a row vector of dimension 64. • Multiplication of an m × k matrix A by an k × n matrix B can be done by first storing all k k -dimensional linear combinations of rows of B in a table. Then the i -th row of theproduct is copied from the row of the table indexed by the i -th row of A . • By ordering indices of the table according to a binary Gray Code, each row of the tablecan be deduced from the previous one, using only one row addition. This decreases thebit-operation count for building the table from k k n to k n . Choosing k = log n in this method implies M M ( n ) = O (cid:0) n / log n (cid:1) over F . In practice,the idea is once again to use a cascading algorithm: at first some recursive steps of fastMM is performed, and then, at a size small enough, one should switch to the greasing.(ii). Bit-slicing consists in representing an n -dimensional vector of k -bit sized coefficients by using k binary vectors of dimension n [23]. In particular, one can apply Boolean word instructionin order to perform arithmetic on 64 dimensional vectors. • Over F , the binary representation ≡ [0 , , ≡ [1 , , − ≡ [1 , allows us to add andsubtract two elements in 6 Boolean operations:Add ([ x , x ] , [ y , y ]) : s ← x ⊕ y , t ← x ⊕ y Return ( s ∧ t, ( s ⊕ x ) ∨ ( t ⊕ y )) Sub ([ x , x ] , [ y , y ]) : t ← x ⊕ y Return ( t ∨ ( x ⊕ y ) , ( t ⊕ y ) ∧ ( y ⊕ x )) • Over F (resp. F ), a redundant representation x = x + 2 x + 4 x ≡ [ x , x , x ] allowsus to add two elements by using 20 (resp. 17) Boolean operations, negate in 6 (resp. 3)Boolean operations, and double by using 5 (resp. 0) Boolean operations. Table 11.2: Boolean operation counts for basic arithmetic by using bit-slicing F F F Addition 6 20 17Negation 1 6 3Double 5 0 (iii). Bit-packing consists in representing a vector of field elements as an integer that fits in a singlemachine word by using a k -adic representation: ( x , . . . , x n − ) ∈ F nq ≡ X = x + 2 k x + · · · + (2 k ) n − x n − ∈ Z Elements of extension fields are viewed as polynomials and stored as the evaluation of thispolynomial at the characteristic of the field. The latter evaluation is called Kronecker substitu-tion [55]. Once we can pack and simultaneously reduce coefficients of the finite field in a singlemachine word, the obtained parallelism can be used for MM. Depending on the respective sizesof the matrices in the multiplication, one can pack only the left operand, only the right one,or both [56]. Then, over the field extensions, fast floating point operations can also be used onthe Kronecker substitution of the elements. All these methods improve in fact the base case of dense linear algebra, when fast methods arenot competitive anymore. As already mentioned, the generic cascading idea applies: perform at firstsome recursive steps fast, decreasing the matrix dimension, and, at a small enough size, switch tothe (improved) classical triple loop method. Now, we focus on the design of a parallel MM routine, which computes the matrix product AB basedon the WRB-MM sequential algorithm. In order to parallelize the computation at the coarsest grain,the best approach is to apply first a classical block algorithm, generating a prescribed number ofindependent tasks, and then each of them will use the sequential WRB-MM algorithm [60, 59].There, the parallel algorithm is recursive and splits the largest of either the row dimension of A orthe column dimension of B , to form two independent tasks. The granularity of the split is recursiveand terminates whenever the number of created tasks becomes larger than the number of computingresources (e.g., the total number of cores). This maximizes the size of the blocks, and therefore the20igure 4: Speed of exact and numerical matrix multiplication routines on a 32 cores Intel XeonE5-4620 2.2Ghz (Sandy Bridge) with 16384KB L3 cache [59]. E ff e c t i v e G f op s ( / s e c ond s / ) Matrices dimensions Speed of parallel matrix multiplication on a 32 cores Xeon E5-4620 @2.2GHz Straightforward MM peak performance LinBox : WRB-MM modulo 131071 (pfgemm) MKL : straightforward MM (dgemm) OpenBlas : straightforward MM (dgemm) Plasma-Quark : straightforward MM (dgemm) benefit of WRB-MM algorithm, while ensuring a large enough number of tasks for the computingresources.Figure 4 shows the computation time of various MM algorithms: the numerical dgemm imple-mentation of Plasma-Quark , OpenBLAS and Intel-MKL as well as the implementation of pfgemm of LinBox using the OpenMP -4.0 data-flow model. Contrary to MKL , OpenBLAS or Plasma-Quark , this pfgemm routine uses the above sketched splitting strategy with WRB-MM. This implementation isrun over the finite field Z / Z or with real double floating point numbers. We first notice thatmost routines perform very similarly. More precisely, Intel-MKL dgemm is faster on small matrices,but the effect of WRB-MM algorithm makes pfgemm faster on larger matrices, even in the finite fieldwhere additional modular reductions occur. 12 Some Research Challenges The field is still in progress, and here are some current observations.(i) The implementations of fast trilinear aggregation algorithms by Kaporin in [87] and [88] arerelatively little applied in practice so far, although they have important advantages versus otheralgorithms now in use: being at least as fast and more stable numerically and allowing highlyefficient parallel implementation, they require much less memory. This is because the algorithmsof [87] and [88] are defined by trilinear decompositions with supersparse coefficient matrices A =( α ( q ) ij ) ( i,j ) ,q , B = ( β ( q ) jk ) ( j,k ) ,q , and C = ( γ ( q ) ik ) ( i,k ) ,q of (3.1), which is a general property of competentlyimplemented trilinear aggregation algorithms. The implementations in [87] and [88] were intendedto be initial rather than final steps and were supposed to leave some room for further amelioration.In particular the algorithm of [118], also relying on trilinear aggregation, has similar features, butsupports even a smaller exponent, and can be a basis for further progress in the implementation offast feasible MM. 21ii) The more recent algorithms of [136], obtained by means of computer search, are highlypromising because they beat the exponent log (7) already for various small problems M M ( m, n, p )where min { m, n, p } = 2 and max { m, n, p } ≤ 6. Their coefficient matrices A , B , and C of (3.1) arequite dense, however, and their tested performance is inferior to recursive bilinear algorithms basedon Examples 2.2 and 2.3. This leads to the challenge of devising MM algorithms that would combinethe best features of the algorithms of [87], [88], and [136].(iii) Numerical instability of APA algorithms does not prevent them from being efficient insymbolic computations, but so far only the rudimentary algorithm of [15] has been implemented [26],while much simpler and much more efficient ones are ignored (cf., e.g., our algorithm in Section 7).Some researchers in matrix computations still view decreasing the current record MM exponentof about 2.37 towards the lower bound 2 as the major theoretical challenge. For the state of affairsin this area we refer the reader to our Figure 2 and the mostly negative results in [4], [5], and [21].We, however, consider breaking the barrier of 2.7733 for the realistic exponent of M M ( n ), n ≤ , , and its decreaseshould require more powerful tools than the acceleration of infeasible MM because of the limitationon the use of recursive bilinear algorithms. We hope that this important challenge will be met inreasonable time, possibly based on combination of human ingenuity and computer search, and thensignificant impact on present day computations should be expected, whereas reaching the exponent2 for MM of infeasible sizes per se would hardly make such an impact.In view of microscopic progress in the decrease of the exponent of infeasible MM, the presentday research directions towards that goal seem to lead to various dead ends, and any potentialprogress shall most likely rely on radically new insights, ideas and techniques, such as aggregationand mapping the input and output of MM to another dimension (cf. [112], [118], [119] and [94]).In the area of the implementation of MM, further progress with recursive bilinear algorithmsbased on fast 2 × × Acknowledgments Jean-Guillaume Dumas’s work is partially supported by the OpenDreamKit Horizon 2020 EuropeanResearch Infrastructures project ( References [1] A.V. Aho, J.E. Hopcroft, J.D. Ullman, The Design and Analysis of Algorithms. Addison-Wesley, Reading, MA, 1974.[2] Martin R. Albrecht, Gregory V. Bard, and William Hart. Algorithm 898: Efficientmultiplication of dense matrices over GF(2). ACM Trans. Math. Softw. , 37(1), 2010. doi:10.1145/1644001.1644010 . The title of [149] is a little deceptive. Computer search has already helped the authors of [15], [22], and [149] in devising their algorithms. 54, 2 , 255–262. 1997.[4] N. Alon, A. Shpilka, C. Umans, On Sunflowers and Matrix Multiplication. Computa-tional Complexity , 22, 2 , 219–243, 2013.[5] A. Ambainis, Y. Filmus, F. Le Gall, Fast Matrix Multiplication: Limitations of theLaser Method. Electronic Colloquium on Computational Complexity (ECCC) , year2014, paper 154. http://eccc.hpi-web.de/report/2014/154 .Available at arXiv: Proceedings of the 12th International Conference on Database Theory , 121–126, 2009.[7] V. L. Arlazarov, E. A. Dinic, M. A. Kronrod, and I. A. Faradˇzev. The economicalconstruction of the transitive closure of an oriented graph. Doklady Akademii NaukSSSR , 194:487–488, 1970.[8] Ballard, G., Benson, A. R., Druinsky, A., Lipshitz, B., Schwartz, O., Improving thenumerical stability of fast matrix multiplication algorithms, SIMAX, in print, and arXiv: ActaNumerica , , 1–155, 2014.[10] G. Ballard, J. Demmel, O. Holtz, O. Schwartz, Graph expansion and communicationcosts of fast matrix multiplication, Journal of the ACM (JACM) , 59, 6 , Article No.32 doi¿10.1145/2395116.2395121[11] G. Ballard, A. Druinsky, N. Knight, O. Schwartz, Hypergraph Partitioning for SparseMatrix-Matrix Multiplication, arXiv: SIAM J. onScientific and Statistical Computing , 9, 3 , 603–607, 1988.[13] A.R. Benson, G. Ballard, A framework for practical parallel fast matrix multiplication.In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming , 42–53, ACM Press, New York, January 2015.[14] D. Bini, Relations Between Exact and Approximate Bilinear Algorithms: Applications. Calcolo , 17, 1 , 87–97, 1980.[15] D. Bini, M. Capovani, G. Lotti, F. Romani, O ( n . ) Complexity for n × n Ap-proximate Matrix Multiplication. Information Processing Letters , 8, 5 , 234–235, June1979.[16] D. Bini, G. Lotti, Stability of Fast Algorithms for Matrix Multiplication. NumerischeMath. , 36, 1 , 63–72, 1980.[17] D. Bini, V.Y. Pan, Polynomial Division and Its Computational Complexity. J. Com-plexity , 2, 3 , 179–203, 1986.[18] D. Bini, V.Y. Pan, Polynomial and Matrix Computations, Volume 1: FundamentalAlgorithms . Birkh¨auser, Boston, 1994.[19] M. Bl¨aser, Lower Bounds for the Multiplicative Complexity of Matrix Multiplication. J. of Computational Complexity , 9, 2 , 73–112, 2000.2320] M. Bl¨aser, A 5 / n -Lower Bound for the Multiplicative Complexity of n × n MatrixMultiplication over Arbitrary Fields. Proc. 40th Ann. IEEE Symp. on Foundations ofComputer Science (FOCS 1999) , 45–50, IEEE Computer Society Press, Los Alamitos,CA 1999.[21] J. Blasiak, T. Church, H. Cohn, J. A. Grochow, E. Naslund, W. F. Sawin, C.Umans, On cap sets and the group-theoretic approach to matrix multiplication, arXiv: Proceedings of the 2010 International Symposium on Symbolicand Algebraic Computation , ISSAC ’10, pages 273–280, New York, NY, USA, 2010.ACM. doi:10.1145/1837934.1837987 .[23] Thomas J. Boothby and Robert W. Bradshaw. Bitslicing and the method of fourrussians over larger finite fields, January 2009. arXiv: The Computational Complexity of Algebraic and Numeric Prob-lems. American Elsevier, New York, 1975.[25] A. Bostan, C.-P. Jeannerod, E. Schost, Solving structured linear systems with largedisplacement rank. Theoretical Computer Science , , 155-181, 2008. Proceedingsversion in ISSAC’07 , pp. 3340, ACM Press, NY, 2007.[26] B. Boyer and J.-G. Dumas. Matrix multiplication over word-size modular fields us-ing approximate formulae. ACM Transactions on Mathematical Software , 42(3):20.1–20.12, 2016. https://hal.archives-ouvertes.fr/hal-00987812 .[27] B. Boyer, J.-G. Dumas, P. Giorgi, C. Pernet, and B. David Saunders. Elements ofdesign for containers and solutions in the linbox library. In Hoon Hong and CheeYap, editors, Mathematical Software - ICMS 2014 , volume 8592 of Lecture Notes inComputer Science , pages 654–662. Springer Berlin Heidelberg, 2014. doi:10.1007/978-3-662-44199-2_98 .[28] B. Boyer, J.-G. Dumas, C. Pernet, W. Zhou, Memory Efficient Scheduling of Strassen-Winograd’s Matrix Multiplication Algorithm. Proc. Intern. Symposium on Symbolicand Algebraic Computation (ISSAC 2009) , 55–62, ACM Press, New York, 2009.[29] R.W. Brockett, D. Dobkin, On Optimal Evaluation of a Set of Bilinear Forms. Proc. ofthe 5th Annual Symposium on the Theory of Computing (STOC 1973) , 88–95, ACMPress, New York, 1973.[30] R.W. Brockett, D. Dobkin, On the Number of Multiplications Required for a MatrixMultiplication. SIAM Journal on Computing , 5, 4 , 624–628, 1976.[31] R.W. Brockett, D. Dobkin, On Optimal Evaluation of a Set of Bilinear Forms. LinearAlgebra and Its Applications , 19, 3 , 207–235, 1978.[32] N.H. Bshouty, A Lower Bound for Matrix Multiplication. SIAM J. on Computing , 18, 4 , 759–765, 1989.[33] N.H. Bshouty, On the Additive Complexity of 2 × InformationProcessing Letters , 56, 6 , 329–335, 1995.[34] J.R. Bunch, J.E. Hopcroft, Triangular Factorization and Inversion by Fast MatrixMultiplication. Mathematics of Computation , 28, 125 , 231–236, 1974.[35] P. B¨urgisser, M. Clausen, M.A. Shokrollahi, Algebraic Complexity Theory . SpringerVerlag, 1997. 2436] M. Cenk, M.A. Hasan. On the Arithmetic Complexity of Strassen-Like Matrix Multi-plications. J. of Symbolic Computation , 2016, in press. doi: 10.1016/j.jsc.2016.07.004.[37] B.W. Char, K.O. Geddes, G.H. Gonnet, GCDHEU: Heuristic Polynomial GCD Al-gorithm Based on Integer GCD Computation. Proceedings of EUROSAM’84, LectureNotes in Computer Science , , 285–296, Springer, New York, 1984.[38] H. Cohn, R. Kleinberg, B. Szegedy, C. Umans, Group-theoretic Algorithms for Ma-trix Multiplication. Proceedings of the 46th Annual Symposium on Foundations ofComputer Science (FOCS 2005) , (Pittsburgh, PA), 379–388, IEEE Computer SocietyPress, 2005.[39] H. Cohn, C. Umans, A Group-theoretic Approach to Fast Matrix Multiplication. Pro-ceedings of the 44th Annual Symposium on Foundations of Computer Science (FOCS2003) , (Cambridge, MA), 438–449, IEEE Computer Society Press, 2003.[40] H. Cohn, C. Umans, Fast Matrix Multiplication Using Coherent Configurations. Pro-ceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA2013) , 1074–1087, 2013.[41] D. Coppersmith, Rapid Multiplication of Rectangular Matrices. SIAM Journal onComputing , 11, 3 , 467–471, 1982.[42] D. Coppersmith, Rectangular Matrix Multiplication Revisited. Journal of Complexity , 13, 1 , 42–49, 1997.[43] D. Coppersmith, S. Winograd, On the Asymptotic Complexity of Matrix Multipli-cation. SIAM J. on Computing , 11, 3 , 472–492, 1982. Proc. version in (Nashville, TN), 82–90, IEEE Computer Society Press, 1981.[44] D. Coppersmith, S. Winograd, Matrix Multiplicaton via Arithmetic Progressions. J. ofSymbolic Computations , 9, 3 , 251–280, 1990. Proc. version in , (New York, NY), 1–6, ACM Press, New York, NY,1987. Also Research Report RC 12104, IBM T.J. Watson Research Center , August1986.[45] P. D’Alberto, A. Nicolau, Adaptive Winograd’s Matrix Multiplication. ACM Trans-actions on Mathematical Software , 36, 1 , paper 3, 2009.[46] A.M. Davie, A.J. Stothers, Improved Bound for Complexity of Matrix Multiplication. Proceedings of the Royal Society of Edinburgh , , 351–370, 2013.[47] H.F. de Groot, On Varieties of Optimal Algorithms for the Computation of BilinearMappings. Theoretical Computer Science , 7, 2 , 127–148, 1978.[48] C. Demetrescu, G.F. Italiano, Fully Dynamic Transitive Closure: Breaking Throughthe O ( n ) Barrier. In Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FOCS 2000) , 381–389, 2000.[49] J. Demmel, I. Dumitriu, O. Holtz, Fast Linear Algebra Is Stable. Numerische Mathe-matik , , 59–91, 2007.[50] J. Demmel, I. Dumitriu, O. Holtz, R. Kleinberg, Fast Matrix Multiplication Is Stable. Numerische Mathematik , , 199–224, 2007.[51] John D. Dixon. Exact solution of linear equations using p-adic expansions. NumerischeMathematik , 40:137–141, 1982. 2552] C.C. Douglas, M. Heroux, G. Slishman, R.M. Smith, GEMMW: A Portable Level3 BLAS Winograd Variant Of Strassen’s Matrix-Matrix Multiply Algorithm. J. ofComputational Physics , , 1–10, 1994.[53] C.-E. Drevet, Md. N. Islam, ´E. Schost, Optimization Techniques for Small MatrixMultiplication. Theoretical Computer Science , , 2219–2236, 2011.[54] R. Duan, S. Pettie, Fast Algorithms for (max − min) Matrix Multiplication and Bot-tleneck Shortest Paths, Proceedings of the 15th Annual ACM-SIAM Symposium onDiscrete Algorithms (SODA 2009) , 384–391, 2009.[55] J.-G. Dumas. Q-adic transform revisited. In Proceedings of the 2008 InternationalSymposium on Symbolic and Algebraic Computation , pages 63–69, New York, 2008.ACM. doi:10.1145/1390768.1390780 .[56] J.-G. Dumas, Laurent Fousse, and Bruno Salvy. Simultaneous modular reductionand Kronecker substitution for small finite fields. Journal of Symbolic Computation ,46(7):823 – 840, 2011. Special Issue in Honour of Keith Geddes on his 60th Birthday. doi:10.1016/j.jsc.2010.08.015 .[57] J.-G. Dumas, T. Gautier, M. Giesbrecht, P. Giorgi, B. Hovinen, E. Kaltofen, B.D. Saunders, W. J. Turner and G. Villard. LinBox: A Generic Library for ExactLinear Algebra. In ICMS’2002, Proceedings of the 2002 International Congress ofMathematical Software , pages 40–50, Beijing, China. http://ljk.imag.fr/membres/Jean-Guillaume.Dumas/Publications/icms.pdf .[58] J.-G. Dumas, T. Gautier, and C. Pernet. Finite field linear algebra subroutines. InTeo Mora, editor, ISSAC’2002, Proceedings of the 2002 ACM International Sympo-sium on Symbolic and Algebraic Computation, Lille, France , pages 63–74. ACM Press,New York, July 2002. http://ljk.imag.fr/membres/J.-G..Dumas/Publications/Field_blas.pdf .[59] J.-G. Dumas, T. Gautier, C. Pernet, J.-L. Roch, and Z. Sultan. Recursion basedparallelization of exact dense linear algebra routines for Gaussian elimination. ParallelComputing , 57:235–249, 2016. http://hal.archives-ouvertes.fr/hal-01084238 .[60] J.-G. Dumas, T. Gautier, C. Pernet, and Z. Sultan. Parallel computation of echelonforms. In Euro-Par 2014, Proceedings of the 20th international conference on parallelprocessing, Porto, Portugal , volume 8632 of Lecture Notes in Computer Science , pages499–510, August 2014. http://hal.archives-ouvertes.fr/hal-00947013 .[61] J.-G. Dumas, P. Giorgi, C. Pernet, FFPACK: Finite Field Linear Algebra Pack-age. Proc. Intern. Symposium on Symbolic and Algebraic Computation (ISSAC 2004) ,Jaime Gutierrez, editor, 119–126, ACM Press, New York, 2004.[62] J.-G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra over prime fields. ACMTransactions on Mathematical Software , 35(3):1–42, November 2008. http://hal.archives-ouvertes.fr/hal-00018223 .[63] J.-G. Dumas and C. Pernet. Computational linear algebra over finite fields. InDaniel Panario and Gary L. Mullen, editors, Handbook of Finite Fields . Chapman& Hall/CRC, 2012. http://hal.archives-ouvertes.fr/hal-00688254 .[64] J.-G. Dumas, C. Pernet, Z. Sultan, Fast Computation of the Rank Profile Matrixand the Generalized Bruhat Decomposition, J. of Symbolic Computation , 2016. Proc.version in ISSAC 2015. 2665] C.M. Fiduccia, On Obtaining Upper Bound on the Complexity of Matrix Multiplica-tion. In Analytical Complexity of Computations (edited by R.E. Miller, J. W. Thatcher,J. D. Bonlinger), in the the IBM Research Symposia Series , pp. 31–40, Plenum Press,NY, 1972.[66] C.M. Fiduccia, Polynomial Evaluation via the Division Algorithm: The Fast FourierTransform Revisited. Proc. 4th Annual ACM Symposium on Theory of Computing(STOC 1972) , 88–93, ACM Press, New York, 1972.[67] M.J. Fischer, M.S. Paterson, String-Matching and Other Products. SIAM–AMS Proc. , , 113–125, 1974.[68] P.C. Fischer, Further Schemes for Combining Matrix Algorithms. Proceedings of the2nd Colloquium on Automata, Languages and Programming , Lecture Notes in Com-puter Science , , 428–436, Springer-Verlag, London, UK, 1974.[69] M. F¨urer, Faster Integer Multiplication. SIAM J. on Computing , 39, 3 , 979–1005,2009.[70] J. von zur Gathen, J. Gerhard (2013). Modern Computer Algebra . Cambridge Univer-sity Press, Cambridge, UK, third edition, 2013.[71] G.H. Golub, C.F. Van Loan, Matrix Computations . Johns Hopkins University Press,Baltimore, Maryland, 2013 (4th addition).[72] L. Grasedyck, D. Kressner, C. Tobler, A Literature Survey of Low-rank Tensor Ap-proximation Techniques. GAMM-Mitteilungen , , , 53–78, 2013.[73] N.J. Higham, Exploiting Fast Matrix Multiplication within Level 3 BLAS. ACM Trans.on Math. Software , 16, 4 , 352–368, 1990.[74] Joris van der Hoeven, Gr´egoire Lecerf, and Guillaume Quintin. Modular SIMD arith-metic in mathemagix. ACM Transactions on Mathematical Software, 43(1):5, August2016. arXiv: Proc.14th STOC , pages 326–333, New York, NY, USA, 1981. ACM. doi:10.1145_800076.802486 .[76] J.E. Hopcroft, L.R. Kerr, Some Techniques for Proving Certain Simple ProgramsOptimal. Proceedings of the Tenth Annual Symposium on Switching and AutomataTheory , 36–45, IEEE Computer Society Press, 1969.[77] J.E. Hopcroft, L.R. Kerr, On Minimizing the Number of Multiplications Necessary forMatrix Multiplication. SIAM J. on Applied Math. , 20, 1 , 30–36, 1971.[78] J.E. Hopcroft, J. Musinski, Duality Applied to Matrix Multiplication and Other Bi-linear Forms. SIAM Journal on Computing , 2, 3 , 159–173, 1973.[79] X. Huang, V.Y. Pan, Fast Rectangular Matrix Multiplication and Applications. Jour-nal of Complexity , 14, 2 , 257–299, 1998. Proc. version in Proc. Annual ACM Inter-national Symposium on Parallel Algebraic and Symbolic Computation (PASCO’97) ,11–23, ACM Press, New York, 1997.[80] J. Huang, T.M. Smith, G.M. Henry, R.A. van de Geijn, Strassen’s Algorithm Reloaded. Proceedings of the International Conference for High Performance Computing, Net-working, Storage and Analysis , SC ’16, 59:1–59:12, IEEE Press, Piscataway, NJ, USA,2016. Supercomputing ’96 Conference Proceedings: November 17–22,Pittsburgh, PA , New York, NY 10036, USA and 1109 Spring Street, Suite 300, Sil-ver Spring, MD 20910, USA, 1996. ACM Press and IEEE Computer Society Press. doi:10.1145/369028.369096 .[82] Oscar H. Ibarra, Shlomo Moran, and Roger Hui. A generalization of the fast LUPmatrix decomposition algorithm and applications. Journal of Algorithms , 3(1):45–56,March 1982. doi:10.1016/0196-6774(82)90007-4 .[83] C.-P. Jeannerod, C. Pernet, and A. Storjohann. Rank-profile revealing Gaussian elimi-nation and the CUP matrix decomposition. Journal of Symbolic Computation , 56(0):46– 68, 2013. doi:10.1016/j.jsc.2013.04.004 .[84] R. W. Johnson, A. M. McLoughlin, Noncommutative Bilinear Algorithms for 3 × SIAM J. on Computing , 15, 2 , 595–603, 1986.[85] E. Kaltofen and A. Storjohann. Encyclopedia of Applied and Computational Math-ematics , Chapter “Complexity of computational problems in exact linear algebra”,227–233. Springer, November 2015. doi:10.1007/978-3-540-70529-1_173 .[86] H. Kaplan, M. Sharir, E. Verbin, Colored Intersection Searching via Sparse Rectan-gular Matrix Multiplication. In Proceedings of the 22nd ACM Symposium on Compu-tational Geometry , 52–60, 2006.[87] I. Kaporin, A Practical Algorithm for Faster Matrix Multiplication. Numerical LinearAlgebra with Applications , 6, 8 , 687–700, 1999.[88] I. Kaporin, The Aggregation and Cancellation Techniques as a Practical Tool for FasterMatrix Multiplication. Theoretical Computer Science , , , 469–510, 2004.[89] S. Ke, B. Zeng, W. Han, V. Y. Pan, Fast Rectangular Matrix Multiplication and SomeApplications. Science in China, Series A: Mathematics , 51, 3 , 389–406, 2008.[90] Walter Keller-Gehrig. Fast algorithms for the characteristic polynomial. Theor. Com-put. Sci. , 36(2-3):309–317, June 1985. http://dl.acm.org/citation.cfm?id=3929.3939 .[91] B.N. Khoromskij, O ( d log N ) Quantics Approximation of N - d Tensors in High-dimensional Numerical Modeling. Constructive Approximation , , , 257–280, 2011.[92] D.E. Knuth, The Art of Computer Programming: Volume 2, Seminumerical Algo-rithms . Addison-Wesley, Reading, Massachusetts, 1969 (first edition), 1981 (secondedition), 1997 (third edition).[93] T.G. Kolda, B.W. Bader, Tensor Decompositions and Applications. SIAM Review , 51, 3 , 455–500, 2009.[94] J. Laderman, V.Y. Pan, H.X. Sha, On Practical Algorithms for Accelerated MatrixMultiplication. Linear Algebra and Its Applications , , 557–588, 1992.[95] J.M. Landsberg, New Lower Bound for the Rank of Matrix Multiplication. SIAM J.on Computing , 43, 1 , 144–149, 2014.[96] F. Le Gall, Faster Algorithms for Rectangular Matrix Multiplication. Proceedings ofthe 53rd Annual IEEE Symposium on Foundations of Computer Science (FOCS 2012) ,514–523, IEEE Computer Society Press, 2012.2897] F. Le Gall, Powers of Tensors and Fast Matrix Multiplication. Proceedings of the 39thInternational Symposium on Symbolic and Algebraic Computation (ISSAC 2014) , 296–303, ACM Press, New York, 2014.[98] L. Lee, Fast Context-free Grammar Parsing Requires Fast Boolean Matrix Multipli-cation. Journal of the ACM (JACM) , 49, 1 , 1–15, 2002.[99] G. Lotti, F. Romani, On the Asymptotic Complexity of Rectangular Matrix Multipli-cation. Theoretical Computer Science , , 171–185, 1983.[100] A. Massarenti, E. Raviolo, Corrigendum to ”The rank of n × n matrix multiplicationis at least 3 n − √ n / − n ” [ Linear Algebra and its Applications , (2013)4500–4509]. Linear Algebra and its Applications , , 369–371, 2014.[101] W. L. Miranker, V. Y. Pan, Methods of Aggregations, Linear Algebra and Its Appli-cations , , 231–257, 1980.[102] T.S. Motzkin, Evaluation of Polynomials and Evaluation of Rational Functions. Bull.of Amer. Math. Society , 61, 2 , 163, 1955.[103] V. Neiger, Fast computation of shifted Popov forms of polynomial matrices via sys-tems of modular polynomial equations. Proceedings of the International Symposiumon Symbolic and Algebraic Computation (ISSAC’16) , 365–372, ACM Press, New York,2016.[104] A. Neumaier, D. Stehl´e, Faster LLL-type reduction of lattice bases, Proceedings ofthe International Symposium on Symbolic and Algebraic Computation (ISSAC’16) ,373–380, ACM Press, New York, 2016.[105] J.S.R. Nielsen, A. Storjohann, Algorithms for Simultaneous Pad´e Approximation, Pro-ceedings of the International Symposium on Symbolic and Algebraic Computation (IS-SAC’16) , 405–412, ACM Press, New York, 2016.[106] I.V. Oseledets, Approximation of Matrices with Logarithmic Number of Parameters. Dokl. Math. , 80, 2 , 653–654, 2009.[107] I.V. Oseledets, Approximation of 2 d × d Matrices Using Tensor Decomposition. SIAMJ. on Matrix Analysis and Applications , , , 2130–2145, 2010.[108] I.V. Oseledets, E.E. Tyrtyshnikov, TT-cross Approximation for Multidimensional Ar-rays. Linear Algebra Appls. , 70–88, 2010.[109] I.V. Oseledets, E.E. Tyrtyshnikov, Algebraic Wavelet Transform via Quantics TensorTrain Decomposition. SIAM J. Scientific Computing , 33, 3 , 1315–1328, 2011.[110] A.M. Ostrowski, On Two Problems in Absract Algebra Connected with Horner’s Rule.In the Studies Presented to R. von Mises , 40–48, Academic Press, New York, 1954.[111] V.Y. Pan, On Methods of Computing the Values of Polynomials. Uspekhi Matematich-eskikh Nauk , , , 103–134, 1966. [Transl. Russian Mathematical Surveys , , 105–137, 1966.][112] V.Y. Pan, On Schemes for the Evaluation of Products and Inverses of Matrices (inRussian). Uspekhi Matematicheskikh Nauk , , , 249–250, 1972.[113] V.Y. Pan, Strassen’s Algorithm Is Not Optimal. Trilinear Technique of Aggregating forFast Matrix Multiplication. Proc. the 19th Annual IEEE Symposium on Foundations ofComputer Science (FOCS’78) , 166–176, IEEE Computer Society Press, Long Beach,California, 1978. 29114] V.Y. Pan, Fields Extension and Trilinear Aggregating, Uniting and Canceling for theAcceleration of Matrix Multiplication. Proceedings of the 20th Annual IEEE Sympo-sium on Foundations of Computer Science (FOCS’79) , 28–38, IEEE Computer SocietyPress, Long Beach, California, 1979.[115] V.Y. Pan, New Fast Algorithms for Matrix Operations. SlAM J. on Computing , 9, 2 ,321–342, 1980, and Research Report RC 7555, IBM T.J. Watson Research Center ,February 1979.[116] V.Y. Pan, The Bit-Operation Complexity of the Convolution of Vectors and of theDFT. Technical report 80-6, Computer Science Dept., SUNY, Albany, NY, 1980. (Ab-stract in Bulletin of EATCS, , page 95, 1981.)[117] V.Y. Pan, New Combinations of Methods for the Acceleration of Matrix Multiplica-tions. Computers and Mathematics (with Applications) , 7, 1 , 73–125, 1981.[118] V.Y. Pan, Trilinear Aggregating with Implicit Canceling for a New Acceleration ofMatrix Multiplication. Computers and Mathematics (with Applications) , 8, 1 , 23–34,1982.[119] V.Y. Pan, How Can We Speed up Matrix Multiplication? SIAM Review, , , 393–415, 1984.[120] V.Y. Pan, Trilinear Aggregating and the Recent Progress in the Asymptotic Acceler-ation of Matrix Operations. Theoretical Computer Science , 33, 1 , 117–138, 1984.[121] V.Y. Pan, How to Multiply Matrices Faster. Lecture Notes in Computer Science , ,Springer, Berlin, 1984.[122] V.Y. Pan, Better Late Than Never: Filling a Void in the History of Fast MatrixMultiplication and Tensor Decompositions, arXiv: arXiv: High Performance and Reliable Algebraic Computing . Habilitation `a dirigerdes recherches, Universit´e Joseph Fourier, Grenoble 1, November 2014. https://tel.archives-ouvertes.fr/tel-01094212 .[125] C. Pernet, Computing with Quasiseparable Matrices, Proceedings of the Interna-tional Symposium on Symbolic and Algebraic Computation (ISSAC’16) , 389–396, ACMPress, New York, 2016.[126] C. Pernet and A. Storjohann. Faster algorithms for the characteristic polyno-mial. In Proceedings of the 2007 International Symposium on Symbolic and Alge-braic Computation , ISSAC ’07, pages 307–314, New York, NY, USA, 2007. ACM. doi:10.1145/1277548.1277590 .[127] R. L. Probert, On the Additive Complexity of Matrix Multiplication. SIAM J. onComputing , 5, 2 , 187–203, 1976.[128] R.L. Probert, On the Complexity of Symmetric Computations. Canadian J. of Infor-mation Processing and Operational Res. , 12, 1 , 71–86, 1974.[129] R. Raz, A. Shpilka, Lower Bounds for Matrix Product, in Bounded Depth Circuitswith Arbitrary Gates. SIAM J. on Computing , 32, 2 , 488–513, 2003.[130] F. Romani, private communication at the Oberwolfach Conference in October of 1979.30131] F. Romani, Some Properties of Disjoint Sum of Tensors Related to MM. SIAM J. onComputing , 11, 2 , 263–267, 1982.[132] P. Sankowski, M. Mucha, Fast Dynamic Transitive Closure with Lookahead. Algorith-mica , 56, 2 , 180–197, 2010.[133] A. Sch¨onhage, Partial and Total Matrix Multiplication. SIAM J. on Computing , 10, 3 ,434–455, 1981.[134] A. Sch¨onhage, Asymptotically Fast Algorithms for the Numerical Multiplication andDivision of Polynomials with Complex Coefficients. Proc. EUROCAM, Marseille (edited by J. Calmet), Lecture Notes in Computer Science , 3–15, Springer, Berlin,1982.[135] A. Sch¨onhage, V. Strassen, Schnelle Multiplikation großer Zahlen. Computing , 7, 3–4 ,281–292, 1971.[136] A.V. Smirnov, The Bilinear Complexity and Practical Algorithms for Matrix Multi-plication. Computational Mathematics and Mathematical Physics , 53, 12 , 1781–1795(Pleiades Publishing, Ltd), 2013. Original Russian Text in Zhurnal Vychislitel’noiMatematiki i Matematicheskoi Fiziki , 53, 12 , 1970–1984, 2013.[137] A. Shpilka, Lower Bounds for Matrix Product. SIAM J. on Computing , 32, 5 , 1185–1200, 2003.[138] A. Storjohann, On the complexity of inverting integer and polynomial matrices, Com-putational Complexity , , 777–821, 2015.[139] A. Storjohann. The shifted number system for fast linear algebra on integer matrices. Journal of Complexity , 21(4):609–650, 2005. doi:10.1016/j.jco.2005.04.002 .[140] A. Storjohann, S. Yang, A relaxed algorithm for online matrix inversion, Proceedingsof the International Symposium on Symbolic and Algebraic Computation (ISSAC’15) ,339-346, ACM Press, New York, 2015.[141] A.J. Stothers, On the Complexity of Matrix Multiplication. Ph.D. Thesis, Universityof Edinburgh, 2010.[142] V. Strassen, Gaussian Elimination Is Not Optimal. Numerische Math. , , 354–356,1969.[143] V. Strassen, Evaluation of Rational Functions, in Analytical Complexity of Computa-tions (edited by R.E. Miller, J. W. Thatcher, and J. D. Bonlinger), pages 1–10, PlenumPress, New York, 1972.[144] V. Strassen, Vermeidung von Divisionen. J. Reine Angew. Math. , , 184–202,1973.[145] V. Strassen, Some Results in Algebraic Complexity Theory, in Proceedings of theInternational Congress of Mathematicians , Vancouver, 1974 (Ralph D. James, editor),Volume , pages 497–501, Canadian Mathematical Society 1974.[146] V. Strassen, The Asymptotic Spectrum of Tensors and the Exponent of Matrix Mul-tiplication. Proc. 27th Ann. Symposium on Foundation of Computer Science , 49–54,1986.[147] J. Todd, Motivation for Working in Numerical Analysis. Communication on Pure andApplied Math. , 8, 1 , 97–116, 1955.31148] E.E. Tyrtyshnikov, Tensor Approximations of Matrices Generated by AsymptoticallySmooth Functions. Mat. Sbornik , , 147–160, 2003.[149] V. Vassilevska Williams, Multiplying Matrices Faster than Coppersmith–Winograd.Version available at http://theory.stanford.edu/virgi/matrixmult-f.pdf , re-trieved on January 30, 2014. Also see Proc. 44th Annual ACM Symposium on Theoryof Computing (STOC 2012) , 887–898, ACM Press, New York, 2012.[150] A. Waksman, On Winograd’s Algorithm for Inner Products. IEEE Transactions onComputers , C-19, 4 , 360–361, 1970.[151] S. Winograd, On the Number of Multiplications Required to Compute Certain Func-tions. Proc. of the National Academy of Sciences , 58, 5 , 1840–1842, 1967.[152] S. Winograd, A New Algorithm for Inner Product. IEEE Transaction on Computers , C–17, 7 , 693–694, 1968.[153] S. Winograd, On the Number of Multiplications Necessary to Compute Certain Func-tions. Communications on Pure and Applied Mathematics , 23, 2 , 165–179, 1970.[154] S. Winograd, On Multiplication of 2 × Linear Algebra and Its Applications , , 382–388, 1971.[155] S. Winograd, Arithmetic Complexity of Computations . CBMS-NSF Regional Confer-ence Series in Applied Math., , SIAM, Philadelphia, 1980.[156] R. Yuster, Efficient Algorithms on Sets of Permutations, Dominance, and Real-weighted APSP Proceedings of the 15th Annual ACM-SIAM Symposium on DiscreteAlgorithms (SODA 2009) , 384–391, 2009.[157] R. Yuster, U. Zwick, U. Detecting Short Directed Cycles Using Rectangular MatrixMultiplication and Dynamic Programming. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2004) , 254–260, 2004.[158] R. Yuster, U. Zwick, Fast Sparse Matrix Multiplication. ACM Transactions on Algo-rithms , 1, 1 , 2–13, 2005.[159] R. Yuster, U. Zwick, Answering Distance Queries in Directed Graphs Using Fast Ma-trix Multiplication. In Proceedings of 46th Annual IEEE Symposium on the Founda-tions of Computer Science (FOCS 2005) , 389–396, IEEE Computer Society Press,2005.[160] U. Zwick. All-pairs Shortest Paths Using Bridging Sets and Rectangular Matrix Mul-tiplication. Journal of the ACM , 49, 349, 3