[PDF] On fast multiplication of a matrix by its transpose

Abstract

We present a non-commutative algorithm for the multiplication of a 2x2-block-matrix by its transpose using 5 block products (3 recursive calls and 2 general products) over C or any finite field.We use geometric considerations on the space of bilinear forms describing 2x2 matrix products to obtain this algorithm and we show how to reduce the number of involved additions.The resulting algorithm for arbitrary dimensions is a reduction of multiplication of a matrix by its transpose to general matrix product, improving by a constant factor previously known reductions.Finally we propose schedules with low memory footprint that support a fast and memory efficient practical implementation over a finite this http URL conclude, we show how to use our result in LDLT factorization.

Full PDF

OOn Fast Multiplication of a Matrix by its Transpose

Jean-Guillaume Dumas

Université Grenoble AlpesLaboratoire Jean Kuntzmann, CNRSUMR 5224, 38058 Grenoble, France

Clément Pernet

Université Grenoble AlpesLaboratoire Jean Kuntzmann, CNRSUMR 5224, 38058 Grenoble, France

Alexandre Sedoglavic

Université de LilleUMR CNRS 9189 CRISTAL59650 Villeneuve d’Ascq, France

ABSTRACT

We present a non-commutative algorithm for the multiplicationof a 2 × C or any field of primecharacteristic. We use geometric considerations on the space ofbilinear forms describing 2 × L · D · L ⊺ factorization. CCS CONCEPTS • Computing methodologies → Exact arithmetic algorithms ; Linear algebra algorithms . KEYWORDS algebraic complexity, fast matrix multiplication, SYRK, rank-kupdate, Symmetric matrix, Gram matrix, Wishart matrix

Strassen’s algorithm [20], with 7 recursive multiplications and 18additions, was the first sub-cubic time algorithm for matrix prod-uct, with a cost of O (cid:0) n . (cid:1) . Summarizing the many improvementswhich have happened since then, the cost of multiplying two ar-bitrary n × n matrices O ( n ω ) will be denoted by MM ω ( n ) (see [17]for the best theoretical value of ω known to date).We propose a new algorithm for the computation of the prod-uct A · A ⊺ of a 2 × ω − MM ω ( n ) overany field (see [11, § 6.3.1]). Here, we establish the following result:Theorem 1.1. The product of an n × n matrix by its transpose canbe computed in ω − MM ω ( n ) field operations over a base field forwhich there exists a skew-orthogonal matrix. Our algorithm is derived from the class of Strassen-like algo-rithms multiplying 2 × ω . By ex-ploiting the symmetry of the problem, it requires about half of thearithmetic cost of general matrix multiplication when ω is log n × k matrixby its transpose and possibly accumulating the result to another matrix. Following the terminology of the blas standard [10], thisoperation is a symmetric rank k update (syrk for short). Considered as 2 × C = A · B couldbe computed using Strassen algorithm by performing the followingcomputations (see [20]): ρ ← a ( b − b ) , ρ ← ( a + a ) b , ρ ← ( a − a )( b + b ) , ρ ← ( a + a ) b , ρ ← ( a + a )( b + b ) , ρ ← a ( b − b ) , ρ ← ( a − a )( b + b ) , (cid:0) c c c c (cid:1) = (cid:16) ρ + ρ − ρ + ρ ρ + ρ ρ + ρ ρ + ρ + ρ − ρ (cid:17) . (1)In order to consider this algorithm under a geometric standpoint,we present it as a tensor. Matrix multiplication is a bilinear map: K m × n × K n × p → K m × p , ( X , Y ) → X · Y , (2)where the spaces K a × b are finite vector spaces that can be endowedwith the Frobenius inner product ⟨ M , N ⟩ = Trace ( M ⊺ · N ) . Hence,this inner product establishes an isomorphism between K a × b andits dual space (cid:0) K a × b (cid:1) ⋆ allowing for example to associate matrixmultiplication and the trilinear form Trace ( Z ⊺ · X · Y ) : K m × n × K n × p × ( K m × p ) ⋆ → K , ( X , Y , Z ⊺ ) → ⟨ Z , X · Y ⟩ . (3)As by construction, the space of trilinear forms is the canonicaldual space of order three tensor product, we could associate theStrassen multiplication algorithm (1) with the tensor S defined by: (cid:205) i = S i ⊗ S i ⊗ S i = (cid:0) (cid:1) ⊗ (cid:0) − (cid:1) ⊗ (cid:0) (cid:1) + (cid:0) (cid:1) ⊗ (cid:0) (cid:1) ⊗ (cid:0) − (cid:1) + (cid:0) (cid:1) ⊗ (cid:0) (cid:1) ⊗ (cid:0) − (cid:1) + (cid:0) − (cid:1) ⊗ (cid:0) (cid:1) ⊗ (cid:0) (cid:1) + (cid:0) (cid:1) ⊗ (cid:0) (cid:1) ⊗ (cid:0) (cid:1) + (cid:0) (cid:1) ⊗ (cid:0) − (cid:1) ⊗ (cid:0) (cid:1) + (cid:0) − (cid:1) ⊗ (cid:0) (cid:1) ⊗ (cid:0) (cid:1) (4)in ( K m × n ) ⋆ ⊗ ( K n × p ) ⋆ ⊗ K m × p with m = n = p =

2. Given anycouple ( A , B ) of 2 × S the Strassen matrix multiplication algorithm computing A · B by the partial contraction {S , A ⊗ B } : (cid:16) ( K m × n ) ⋆ ⊗( K n × p ) ⋆ ⊗ K m × p (cid:17) ⊗ (cid:0) K m × n ⊗ K n × p (cid:1) → K m × p , S ⊗ ( A ⊗ B ) → (cid:205) i = ⟨ S i , A ⟩⟨ S i , B ⟩ S i , (5)while the complete contraction {S , A ⊗ B ⊗ C ⊺ } is Trace ( A · B · C ) .The tensor formulation of matrix multiplication algorithm givesexplicitly its symmetries (a.k.a. isotropies ). As this formulation is a r X i v : . [ c s . S C ] J un ean-Guillaume Dumas, Clément Pernet, and Alexandre Sedoglavic associated to the trilinear form Trace ( A · B · C ) , given three invert-ible matrices U , V , W of suitable sizes and the classical propertiesof the trace, one can remark that Trace ( A · B · C ) is equal to:Trace (cid:0) ( A · B · C ) ⊺ (cid:1) = Trace ( C · A · B ) = Trace ( B · C · A ) , and Trace (cid:0) U − · A · V · V − · B · W · W − · C · U (cid:1) . (6)These relations illustrate the following theorem:Theorem 2.1 ([8, § 2.8]). The isotropy group of the n × n matrixmultiplication tensor is psl ± ( K n ) × ⋊ S , where psl stands for thegroup of matrices of determinant ± and S for the symmetric groupon elements. The following definition recalls the sandwiching isotropy onmatrix multiplication tensor:Definition 2.1.

Given g = ( U × V × W ) in psl ± ( K n ) × , its ac-tion g ⋄ S on a tensor S is given by (cid:205) i = g ⋄ ( S i ⊗ S i ⊗ S i ) wherethe term g ⋄ ( S i ⊗ S i ⊗ S i ) is equal to: ( U − ⊺ · S i · V ⊺ ) ⊗ ( V − ⊺ · S i · W ⊺ ) ⊗ ( W − ⊺ · S i · U ⊺ ) . (7)Remark 2.1. In psl ± ( K n ) × , the product ◦ of two isotropies д de-fined by u × v × w and д by u × v × w is the isotropy д ◦ д equal to u · u × v · v × w · w . Furthermore,the complete con-traction { д ◦ д , A ⊗ B ⊗ C } is equal to { д , д ⊺ ⋄ A ⊗ B ⊗ C } . The following theorem shows that all 2 × The group psl ± ( K n ) × acts transitivelyon the variety of optimal algorithms for the computation of × -matrix multiplication. Thus, isotropy action on Strassen tensor may define other matrixproduct algorithm with interesting computational properties. × -matrix product This observation inspires our general strategy to design specificalgorithms suited for particular matrix product.Strategy 2.1.

By applying an undetermined isotropy: g = U × V × W = (cid:0) u u u u (cid:1) × (cid:0) v v v v (cid:1) × (cid:0) w w w w (cid:1) (8) on Strassen tensor S , we obtain a parameterization T = g ⋄ S of allmatrix product algorithms requiring coefficient multiplications: T = (cid:213) i = T i ⊗ T i ⊗ T i , T i ⊗ T i ⊗ T i = g ⋄ S i ⊗ S i ⊗ S i . (9) Then, we could impose further conditions on these algorithms andcheck by a Gröbner basis computation if such an algorithm exists. If so,there is subsequent work to do for choosing a point on this variety; thischoice can be motivated by the additive cost bound and the schedulingproperty of the evaluation scheme given by this point.

Let us first illustrate this strategy with the well-known Winogradvariant of Strassen algorithm presented in [22].Example 1.

Apart from the number of multiplications, it is also in-teresting in practice to reduce the number of additions in an algorithm.Matrices S and S in tensor (4) do not increase the additive costbound of this algorithm. Hence, in order to reduce this complexity in an algorithm, we could try to maximize the number of such matricesinvolved in the associated tensor. To do so, we recall Bshouty’s resultson additive complexity of matrix product algorithms. Theorem 2.3 ([6]).

Let e ( i , j ) = ( δ i , k δ j , l ) ( k , l ) be the single entryelementary matrix. A × matrix product tensor could not have such matrices as first (resp. second, third) component ([6, Lemma 8]).The additive complexity bound of first and second components areequal ([6, eq. (11)]) and at least = − . The total additive complex-ity of × -matrix product is at least ([6, Theorem 1]).Following our strategy, we impose on tensor T (9) the constraints T = e , = (cid:0) (cid:1) , T = e , , T = e , (10) and obtain by a Gröbner basis computation [13] that such tensors arethe images of Strassen tensor by the action of the following isotropies: w = (cid:0) (cid:1) × (cid:0) − − (cid:1) × (cid:0) w w w w (cid:1) . (11) The variant of the Winograd tensor [22] presented with a renumberingas Algorithm 1 is obtained by the action of w with the specializa-tion w = w = = − w , w = on the Strassen tensor S . Whilethe original Strassen algorithm requires additions, only additionsare necessary in the Winograd Algorithm 1. Algorithm 1 : C = W ( A , B ) Input: A = (cid:0) a a a a (cid:1) and B = (cid:16) b b b b (cid:17) ; Output: C = A · Bs ← a − a , s ← a + a , s ← s − a , s ← a − s , t ← b − b , t ← b − b , t ← b + t , t ← b − t . p ← a · b , p ← a · b , p ← a · t , p ← s · t , p ← s · t , p ← s · b , p ← s · t . c ← p + p , c ← c + p , c ← p + p , c ← c + p , c ← c + p , c ← c + p , c ← c + p . return C = (cid:0) c c c c (cid:1) .As a second example illustrating our strategy, we consider nowthe matrix squaring that was already explored by Bodrato in [3].Example 2. When computing A , the contraction (5) of the ten-sor T (9) with A ⊗ A shows that choosing a subset J of { , . . . , } and imposing T i = T i as constraints with i in J (see [3, eq 4]) cansave | J | operations and thus reduce the computational complexity.The definition (9) of T , these constraints, and the fact that U , V and W ’s determinant are , form a system with + | J | equationsand unknowns whose solutions define matrix squaring algorithms.The algorithm [3, § 2.2, eq 2] is given by the action of the isotropy: g = (cid:0) − (cid:1) × (cid:0) (cid:1) × (cid:0) (cid:1) (12) on Strassen’s tensor and is just Chatelin’s algorithm [7, Appendix A],with λ = (published years before [3], but not applied to squaring). Remark 2.2.

Using symmetries in our strategy reduces the com-putational cost compared to the resolution of Brent’s equations [4,§ 5, eq 5.03] with an undetermined tensor T . In the previous exam-ple by doing so, we should have constructed a system of at most algebraic equations with ( ( − | J |) + | J |) unknowns, resultingfrom the constraints on T and the relation T = S , expressed usingKronecker product as a single zero matrix in K × . We apply now our strategy on the 2 × A · A ⊺ . n Fast Multiplication of a Matrix by its Transpose × -matrix product by its transpose Applying our Strategy 2.1, we consider (9) a generic matrix multi-plication tensor T and our goal is to reduce the computational com-plexity of the partial contraction (5) with A ⊗ A ⊺ computing A · A ⊺ .By the properties of the transpose operator and the trace, thefollowing relations hold: (cid:10) T i , A ⊺ (cid:11) = Trace (cid:0) T i ⊺ · A ⊺ (cid:1) = Trace (cid:0) ( A · T i ) ⊺ (cid:1) , = Trace (cid:0) A · T i (cid:1) = Trace (cid:0) T i · A (cid:1) = (cid:10) T i ⊺ , A (cid:11) . (13)Thus, the partial contraction (5) satisfies here the following relation: (cid:213) i = (cid:10) T i , A (cid:11)(cid:10) T i , A ⊺ (cid:11) T i = (cid:213) i = (cid:10) T i , A (cid:11) ⟨ T i ⊺ , A ⟩ T i . (14) Our goal is to savecomputations in the evaluation of (14). To do so, we consider the sub-sets J of { , . . . , } and H of (cid:8) ( i , j ) ∈ { , . . . , } | i (cid:44) j , i (cid:60) J , j (cid:60) J (cid:9) in order to express the following constraints: T i = T i ⊺ , i ∈ J , T j = T k ⊺ , T k = T j ⊺ , ( j , k ) ∈ H . (15)The constraints of type J allow one to save preliminary additionswhen applying the method to matrices B = A ⊺ : since then opera-tions on A and A ⊺ will be the same. The constraints of type H allowto save multiplications especially when dealing with a block-matrixproduct: in fact, if some matrix products are transpose of another,only one of the pair needs to be computed as shown in Section 3.We are thus looking for the largest possible sets J and H . Byexhaustive search, we conclude that the cardinality of H is at most 2and then the cardinality of J is at most 3. For example, choosingthe sets J = { , , } and H = {( , ) , ( , )} we obtain for these so-lutions the following parameterization expressed with a primitiveelement z = v − v : v = z + v , v = (cid:0) v ( v + z ) − (cid:1) v + z , v = − (cid:0) v + ( v + z ) + (cid:1) v − z , u = − (cid:0) ( z + v ) + v (cid:1) ( w + w ) , u = − (cid:0) ( z + v ) + v (cid:1) ( w + w ) , u = − (cid:0) ( z + v ) + v (cid:1) w , u = (cid:0) ( z + v ) + v (cid:1) w , (cid:0) ( z + v ) + v (cid:1) + = , w w − w w = . (16)Remark 2.3. As (cid:0) ( z + v ) + v (cid:1) + = occurs in this param-eterization, field extension could not be avoided in these algorithms ifthe field does not have—at least—a square root of − . We show in Sec-tion 3 that we can avoid these extensions with block-matrix productsand use our algorithm directly in any field of prime characteristic. Asdone in Example 1, we could also try to reduce the additive com-plexity and use 4 pre-additions on A (resp. B ) [6, Lemma 9] and 7post-additions on the products to form C [6, Lemma 2]. In the cur-rent situation, if the operations on B are exactly the transpose ofthat of A , then we have the following lower bound:Lemma 2.1. Over a non-commutative domain, additive opera-tions are necessary to multiply a × matrix by its transpose with abilinear algorithm that uses multiplications. Indeed, over a commutative domain, the lower left and upperright parts of the product are transpose of one another and onecan save also multiplications. Differently, over non-commutativedomains, A · A ⊺ is not symmetric in general (say ac + bd (cid:44) ca + db )and all four coefficients need to be computed. But one can still save 4additions, since there are algorithms where pre-additions are thesame on A and A ⊺ . Now, to reach that minimum, the constraints (15)must be combined with the minimal number 4 of pre-additions for A .Those can be attained only if 3 of the T i factors do not require anyaddition [6, Lemma 8]. Hence, those factors involve only one of thefour elements of A and they are just permutations of e . We thusadd these constraints to the system for a subset K of { , . . . , } : | K | = T i is in (cid:8)(cid:0) (cid:1) , (cid:0) (cid:1) , (cid:0) (cid:1) , (cid:0) (cid:1)(cid:9) and i in K . (17) We choose K = { , , } similar to (10) andobtain the following isotropy that sends Strassen tensor to an algo-rithm computing the symmetric product more efficiently: a = (cid:16) z z (cid:17) × (cid:0) z − z z (cid:1) × (cid:0) − (cid:1) , z = − . (18)We remark that a is equal to d ◦ w with w the isotropy (11) thatsends Strassen tensor to Winograd tensor and with: d = D ⊗ D ⊗ D = (cid:16) z z (cid:17) × (cid:16) z − z (cid:17) × (cid:0) (cid:1) , z = − . (19)Hence, the induced algorithm can benefit from the scheduling andadditive complexity of the classical Winograd algorithm. In fact,our choice a ⋄ S is equal to ( d ◦ w ) ⋄ S and thus, according to re-mark (2.1) the resulting algorithm expressed as the total contraction {( d ◦ w ) ⋄ S , ( A ⊗ A ⊺ ⊗ C )} = { w ⋄ S , d ⊺ ⋄ ( A ⊗ A ⊺ ⊗ C )} (20)could be written as a slight modification of Algorithm 1 inputs.Precisely, as d ’s components are diagonal, the relation d ⊺ = d holds; hence, we could express input modification as: (cid:16) D − · A · D (cid:17) ⊗ (cid:16) D − · A ⊺ · D (cid:17) ⊗ (cid:16) D − · C · D (cid:17) . (21)The above expression is trilinear and the matrices D i are scalingsof the identity for i in { , } , hence our modifications are just: (cid:18) z A · D (cid:19) ⊗ (cid:16) D − · A ⊺ (cid:17) ⊗ z C . (22)Using notations of Algorithm 1, this is C = W (cid:0) A · D , D − · A ⊺ (cid:1) .Allowing our isotropies to have determinant different from 1,we rescale D by a factor 1 / z to avoid useless 4th root as follows: Q = D z = (cid:16) − z (cid:17) = (cid:16) − y (cid:17) , z = − y designates the expression z that is a root of −

1. Hence,our algorithm to compute the symmetric product is: C = W (cid:32) A · D z , (cid:18) D z (cid:19) − · A ⊺ (cid:33) = W (cid:16) A · Q , (cid:16) A · ( Q − ) ⊺ (cid:17) ⊺ (cid:17) . (24)In the next sections, we describe and extend this algorithm to higher-dimensional symmetric products A · A ⊺ with a 2 ℓ m × ℓ m matrix A . ean-Guillaume Dumas, Clément Pernet, and Alexandre Sedoglavic × -BLOCK RECURSIVE SYRK The algorithm presented in the previous section is non-commutativeand thus we can extend it to higher-dimensional matrix productby a divide and conquer approach. To do so, we use in the sequelupper case letters for coefficients in our algorithms instead of lowercase previously (since these coefficients now represent matrices).Thus, new properties and results are induced by this shift of per-spective. For example, the coefficient Y introduced in (23) couldnow be transposed in (24); that leads to the following definition:Definition 3.1. An invertible matrix is skew-orthogonal if thefollowing relation Y ⊺ = − Y − holds. If Y is skew-orthogonal, then of the 7 recursive matrix productsinvolved in expression (24): 1 can be avoided ( P ) since we do notneed the upper right coefficient anymore, 1 can be avoided since it isthe transposition of another product ( P = P ⊺ ) and 3 are recursivecalls to syrk. This results in Algorithm 2. Algorithm 2 syrk : symmetric matrix product

Input: A = (cid:16) A A A A (cid:17) ; a skew-orthogonal matrix Y . Output:

The lower left triangular part of C = A · A ⊺ = (cid:16) C C ⊺ C C (cid:17) . ▷ Y : S ← ( A − A ) · Y , S ← A − A · Y , S ← S − A , S ← S + A . ▷ P , P , P ) and generic ( P , P ) products: P ← A · A ⊺ , P ← A · A ⊺ , P ← A · S ⊺ , P ← S · S ⊺ , P ← S · S ⊺ .▷ Low ( U )← Low ( P ) + Low ( P ) , ▷ U , P , P are symm. Low ( U )← Low ( P ) + Low ( P ) , ▷ U , P , P are symm. ▷ P and P are not symmetric): Up ( U ) ← Low ( U ) ⊺ , U ← U + P , U ← U + P ,▷ U = U + P + P ⊺ is symmetric): Low ( U ) ← Low ( U ) + Low ( P ⊺ ) . return (cid:16) Low ( U ) U Low ( U ) (cid:17) .Proposition 3.1 (Appendix A.1). Algorithm 2 is correct for anyskew-orthogonal matrix Y . Algorithm 2 requires a skew-orthogonal matrix. Unfortunatelythere are no skew-orthogonal matrices over R , nor Q . Hence, we re-port no improvement in these cases. In other domains, the simplestskew-orthogonal matrices just use a square root of − Therefore Algorithm 2 is directlyusable over C n × n with Y = i I n ∈ C n × n . Further, usually, complexnumbers are emulated by a pair of floats so then the multiplicationsby Y = i I n are essentially free since they just exchange the realand imaginary parts, with one sign flipping. Even though over thecomplex the product zherk of a matrix by its conjugate transpose ismore widely used, zsyrk has some applications, see for instance [1]. Over some fields with prime char-acteristic, square roots of − i in F again. There, Algorithm 2 only requires some pre-multiplications by this square root (with also Y = i I n ∈ F n × n ), but within the field . Proposition 3.2 thereafter characterizes these fields.Proposition 3.2. Fields with characteristic two, or with an oddcharacteristic p ≡ , or finite fields that are an even extension,contain a square root of − . Proof. If p =

2, then 1 = = −

1. If p ≡ x in the base field of size p satisfy x p − (cid:44) ± −

1. If the finite field F isof cardinality p k , then, similarly, there exists elements x pk − pk + different from ± − □ Finally, we show that Al-gorithm 2 can also be run without any field extension, even when − Y is even. Whenever this dimension is odd, it is alwayspossible to pad with zeroes so that A · A ⊺ = ( A ) · (cid:0) A ⊺ (cid:1) .Proposition 3.3. Let F be a field of characteristic p , there ex-ists ( a , b ) in F such that the matrix: (cid:16) a b − b a (cid:17) ⊗ I n = (cid:16) a I n b I n − b I n a I n (cid:17) in F n × n (25) is skew-orthogonal. Proof. Using the relation (cid:16) a I n b I n − b I n a I n (cid:17) (cid:16) a I n b I n − b I n a I n (cid:17) ⊺ = ( a + b ) I n , (26)it suffices to prove that there exist a , b such that a + b = −

1. Incharacteristic 2, a = , b = + = −

1. In oddcharacteristic, there are p + distinct square elements x i in the baseprime field. Therefore, there are p + distinct elements − − x i . Butthere are only p distinct elements in the base field, thus there existsa couple ( i , j ) such that − − x i = x j [19, Lemma 6]. □ Proposition 3.3 shows that skew-orthogonal matrices do existfor any field with prime characteristic. For Algorithm 2, we need tobuild them mostly for p ≡ r = pk + ( − ) p −

1, then the rela-tions r ≡ r ≡ − p hold;(2) Thus, results of [5] allow one to decompose primes intosquares and give a couple ( a , b ) in Z such that a + b = r .Finally, we get a + b ≡ − p .By the prime number theorem the first step is polynomial in log ( p ) ,as is the second step (square root modulo a prime, denoted sqrt ,has a cost close to exponentiation and then the rest of Brillhart’salgorithm is gcd-like). In practice, though, it is faster to use the fol-lowing Algorithm 3, even though the latter has a better asymptoticcomplexity bound only if the erh is true. n Fast Multiplication of a Matrix by its Transpose Algorithm 3

SoS : Sum of squares decomposition over a finite field

Input: p ∈ P \{ } , k ∈ Z . Output: ( a , b ) ∈ Z , s.t. a + b ≡ k mod p . if (cid:16) kp (cid:17) = then ▷ k is a square mod p return ( sqrt ( k ) , ) . else ▷ Find smallest quadratic non-residue s ← while (cid:16) sp (cid:17) == do s ← s + c ← sqrt ( s − ) ▷ s − must be a square r ← ks − mod p a ← sqrt ( r ) ▷ Now k ≡ a s ≡ a ( + c ) mod p return ( a , ac mod p ) Proposition 3.4.

Algorithm 3 is correct and, under the erh, runsin expected time (cid:101) O (cid:0) log ( p ) (cid:1) . Proof. If k is square then the square of one of its square rootsadded to the square of zero is a solution. Otherwise, the lowest qua-dratic non-residue (lqnr) modulo p is one plus a square b (1 is al-ways a square so the lqnr is larger than 2). For any generator of Z p ,quadratic non-residues, as well as their inverses ( s is invertible as itis non-zero and p is prime), have an odd discrete logarithm. There-fore the multiplication of k and the inverse of the lqnr must be asquare a . This means that the relation k = a (cid:0) + b (cid:1) = a + ( ab ) holds. Now for the running time, under erh, the lqnr should belower than 3 log ( p )/ −

44 log ( p )/ +

13 [21, Theorem 6.35]. Theexpected number of Legendre symbol computations is O (cid:0) log ( p ) (cid:1) and this dominates the modular square root computations. □ Remark 3.1.

Another possibility is to use randomization: insteadof using the lowest quadratic non-residue (lqnr), randomly select anon-residue s , and then decrement it until s − is a quadratic residue( is a square so this will terminate) . Also, when computing t sumof squares modulo the same prime, one can compute the lqnr onlyonce to get all the sum of squares with an expected cost boundedby (cid:101) O (cid:0) log ( p ) + t log ( p ) (cid:1) . Remark 3.2.

Except in characteristic or in algebraic closures,where every element is a square anyway, Algorithm 3 is easily ex-tended over any finite field: compute the lqnr in the base prime field,then use Tonelli-Shanks or Cipolla-Lehmer algorithm to computesquare roots in the extension field.Denote by SoS ( q , k ) this algorithm decomposing k as a sum ofsquares within any finite field F q . This is not always possible overinfinite fields, but there Algorithm 3 still works anyway for the specialcase k = − : just run it in the prime subfield. Note that Algorithm 2 remains valid if transposition is replacedby conjugate transposition , provided that there exists a matrix Y such that Y · Y ⊺ = − I. This is not possible anymore over the com-plex field, but works for any even extension field, thanks to Al-gorithm 3: if − F q , then Y = √− · I n still works; In practice, the running time seems very close to that of Algorithm 3 anyway, see, e.g.the implementation in Givaro rev. 7bdefe6, https://github.com/linbox-team/givaro. otherwise there exists a square root i of − F q , from Propo-sition 3.2. In the latter case, thus build ( a , b ) , both in F q , suchthat a + b = −

1. Now Y = ( a + ib ) · I n in F q n × n is appropriate:indeed, since q ≡ a + ib = ( a + ib ) q = a − ib . Theorem 4.1.

Algorithm 2 requires ω − C ω n ω + o ( n ω ) field op-erations, over C or over any field with prime characteristic. Proof. Algorithm 2 is applied recursively to compute three prod-ucts P , P and P , while P and P are computed in MM ω ( n ) = C ω n ω + o ( n ω ) using a general matrix multiplication algorithm. Wewill show that applying the skew-orthogonal matrix Y to a n × n matrix costs yn for some constant y depending on the base field.Then applying Remark 4.1 thereafter, the cost T ( n ) of Algorithm 2satisfies: T ( n ) ≤ T ( n / ) + C ω ( n / ) ω + ( . + y )( n / ) + o (cid:16) n (cid:17) (27)and T ( ) is a constant. Thus, by the master Theorem: T ( n ) ≤ C ω ω − n ω + o (cid:0) n ω (cid:1) = ω − ω ( n ) + o (cid:0) n ω (cid:1) . (28)If the field is C or satisfies the conditions of Proposition 3.2,there is a square root i of −

1. Setting Y = i I n / yields y =

1. Oth-erwise, in characteristic p ≡ Y equal to (cid:16) a b − b a (cid:17) ⊗ I n / for which y =

3. As a subcase, the lattercan be improved when p ≡ − (cid:16) − p (cid:17) = (cid:16) − p (cid:17) (cid:16) p (cid:17) = (− ) p − (− ) p − = (− )(− ) = a = b ≡ √− p such that the rela-tion a + b = − Y = (cid:16) √− −√− (cid:17) ⊗ I n / for which y = □ To our knowledge, the best previously known result was witha ω − factor instead, see e.g. [11, § 6.3.1]. Table 1 summarizes thearithmetic complexity bound improvements. Problem Alg. O (cid:0) n (cid:1) O (cid:16) n log ( ) (cid:17) O ( n ω ) A · A ⊺ ∈ F n × n [11] n MM log ( ) ( n ) ω − MM ω ( n ) Alg. 2 0 . n MM log ( ) ( n ) ω − MM ω ( n ) Table 1: Arithmetic complexity bounds leading terms.

Alternatively, over C , the 3 M method (Karatsuba) for non-symmetricmatrix multiplication reduces the number of multiplications ofreal matrices from 4 to 3 [15]: if RR ω ( n ) is the cost of multiply-ing n × n matrices over R , then the 3 M method costs 3 RR ω ( n ) + o ( n ω ) operations over R . Adapting this approach to the symmet-ric case yields a 2 M method to compute the product of a complexmatrix by its transpose, using only 2 real products: H = A · B ⊺ and G = ( A + B ) · ( A ⊺ − B ⊺ ) . Combining those into ( G − H ⊺ + H ) + i ( H + H ⊺ ) , yields the product ( A + iB ) · ( A ⊺ + iB ⊺ ) . This approachcosts 2 RR ω + o ( n ω ) operations in R .Classical algorithm [11, § 6.3.1] applies a divide and conquerapproach directly on the complex field. This would use only the ean-Guillaume Dumas, Clément Pernet, and Alexandre Sedoglavic equivalent of ω − complex floating point n × n products. Usingthe 3 M method for the complex products, this algorithm uses over-all ω − RR ω + o ( n ω ) operations in R . Finally, Algorithm 2 onlycosts ω − complex multiplications for a leading term boundedby ω − RR ω , better than 2RR ω for ω > log ( ) ≈ . ω by 3 or log ( ) . Problem Alg. MM ( n ) MM log ( n ) MM ω ( n ) A · B ∈ C n × n naive 8 n log ( ) ( n ) ω ( n )

3M 6 n log ( ) ( n ) ω ( n ) A · A ⊺ ∈ C n × n

2M 4 n log ( ) ( n ) ω ( n ) [11] 3 n log ( ) ( n ) ω − RR ω ( n ) Alg. 2 2 . n RR log ( ) ( n ) ω − RR ω ( n ) Table 2: Symmetric multiplication over C : leading term ofthe cost in number of operations over R . Remark 4.1.

Each recursive level of Algorithm 2 is composed of 9block additions. An exhaustive search on all symmetric algorithmsderived from Strassen’s showed that this number is minimal in thisclass of algorithms. Note also that out of these additions in Algo-rithm 2 involve symmetric matrices and are therefore only performedon the lower triangular part of the matrix. Overall, the number ofscalar additions is n + / n ( n + ) = / n + . n , nearly half ofthe optimal in the non-symmetric case [6, Theorem 1]. To further reduce the number of additions, a promising approachis that undertaken in [2, 16]. This is however not clear to us howto adapt our strategy to their recursive transformation of basis.

This section reports on an implementation of Algorithm 2 overprime fields. We propose in Table 3 and Figure 1 a schedule for theoperation C ← A · A ⊺ using no more extra storage than the unusedupper triangular part of the result C . S = ( A − A ) · Y C U = P + P C S = A − A · Y C Up ( U ) = Low ( U ) ⊺ C P ⊺ = S · S ⊺ C U = U + P C S = S − A C U = U + P C P = S · S ⊺ C U = U + P ⊺ C S = S + A C P = A · A ⊺ C P = A · S ⊺ C U = P + P C P = A · A ⊺ C Table 3: Memory placement and schedule of tasks to com-pute the lower triangular part of C ← A · A ⊺ when k ≤ n . Theblock C of the output matrix is the only temporary used. For the more general operation C ← αA · A ⊺ + βC , Table 4 andFigure 2 propose a schedule requiring only an additional n / × n / fsyrk routine in the fflas-ffpack library for dense linear algebraover a finite field [14, from commit 0a91d61e].Figure 3 compares the computation speed in effective Gfops(defined as n /( × time ) ) of this implementation over Z / Z C C C C S S P ⊺ S P S P P U U U U P U Figure 1: dag of the tasks and their memory location for thecomputation of C ← A · A ⊺ presented in Table 3. operation loc. operation loc. S = ( A − A ) · Y tmp P = αA · A ⊺ tmp S = A − A · Y C U = P + P C Up ( C ) = Low ( C ) ⊺ C Up ( U ) = Low ( U ) ⊺ C P ⊺ = αS · S ⊺ C U = U + P C S = S − A tmp U = U + P C P = αS · S ⊺ C U = U + P ⊺ + β Up ( C ) ⊺ C S = S + A tmp P = αA · A ⊺ + βC C P = αA · S ⊺ + βC C U = P + P C Table 4: Memory placement and schedule of tasks tocompute the lower triangular part of C ← αA · A ⊺ + βC when k ≤ n . The block C of the output matrix as well asan n / × n / block tmp are used as temporary storages. C C C tmp C Up ( C ) S S P ⊺ S P S P P U U U U P U Figure 2: dag of the tasks and their memory location for thecomputation of C ← αA · A ⊺ + βC presented in Table 4. with that of the double precision blas routines dsyrk , the classicalcubic-time routine over a finite field (calling dsyrk and performingmodular reductions on the result), and the classical divide andconquer algorithm [11, § 6.3.1]. The fflas-ffpack library is linkedwith Openblas [23, v0.3.6] and compiled with gcc-9.2 on an Intelskylake i7-6700 running a Debian gnu/Linux system (v5.2.17).The slight overhead of performing the modular reductions isquickly compensated by the speed-up of the sub-cubic algorithm n Fast Multiplication of a Matrix by its Transpose E ﬀ e c t i v e G f o p s : n / ( x t i m e ) nFast FSYRK on an i7-6700 (skylake)Classic OpenBLAS DSYRKClassic FSYRK modulo 131071Divide & Conquer FSYRK modulo 131071Fast FSYRK modulo 131071Fast FSYRK modulo 131041 Figure 3: Speed of an implementation of Algorithm 2 (the threshold for a first recursive call is near n = − Y . Symmetric rank k updates are a key building block for symmetrictriangular factorization algorithms, for their efficiency is one of thebottlenecks. In the most general setting (indefinite factorization),a block diagonal scaling by a matrix D , with 1 or 2 dimensionaldiagonal blocks, has to be inserted within the product, leading tothe operation: C ← C − A · D · A ⊺ .Handling the block diagonal structure over the course of therecursive algorithm may become tedious and quite expensive. Forinstance, a 2 × O (cid:0) log ( n ) (cid:1) extra columns in a recursive setting.Over a finite field, though, we will show in this section, how tofactor the block-diagonal matrix D into D = ∆ · ∆ ⊺ , without need-ing any field extension , and then compute instead ( A · ∆ ) · ( A · ∆ ) ⊺ .Algorithm 6, deals with non-squares and 2 × × First we give an algorithm handling pairs of non-quadratic residues.Proposition 5.1.

Algorithm 4 is correct.

Proof. Given α and β quadratic non-residues, the couple ( a , b ) ,such that α = a + b , is found by the algorithm of Remark 3.2.Second, as α and β are quadratic non-residues, over a finite fieldtheir quotient is a residue since: (cid:0) βα − (cid:1) q − = − − =

1. Third, if c denotes − bda − then c + d is equal to (− bd / a ) + d and thusto ( b / a + ) d ; this last quantity is equal to ( α ) d / a and then Algorithm 4 : nrsyf : Sym. factorization. of a pair of non-residues Input: ( α , β ) ∈ F q , both being quadratic non-residues. Output: Y ∈ F q × , s.t. Y · Y ⊺ = (cid:16) α β (cid:17) . ( a , b ) ← SoS ( q , α ) ; ▷ α = a + b d ← a sqrt ( βα − ) ; ▷ d = a βα − c ← − bda − ; ▷ ac + bd = return Y ← (cid:16) a bc d (cid:17) .to α ( a (cid:112) β / α ) / a = α ( a β / α )/ a = β . Fourth, a (or w.l.o.g. b ) is in-vertible. Indeed, α is not a square, therefore it is non-zero andthus one of a or b must be non-zero. Finally, we obtain the can-cellation ac + bd = a (− dba − ) + bd = − db + bd = Y · Y ⊺ is (cid:16) a bc d (cid:17) (cid:0) a cb d (cid:1) = (cid:16) a + b ac + bdac + bd c + d (cid:17) = (cid:16) α β (cid:17) . □ Using Algorithm 4, one can then factor any diagonal matrixwithin a finite field as a symmetric product with a tridiagonal matrix.This can then be used to compute efficiently A · D · A ⊺ with D adiagonal matrix: factor D with a tridiagonal matrix D = ∆ · ∆ ⊺ , thenpre-multiply A by this tridiagonal matrix and run a fast symmetricproduct on the resulting matrix. This is shown in Algorithm 5,where the overhead, compared to simple matrix multiplication, isonly O (cid:0) n (cid:1) (that is O ( n ) square roots and O ( n ) column scalings). Algorithm 5 syrkd : sym. matrix product with diagonal scaling

Input: A ∈ F qm × n and D = Diag ( d , . . . , d n ) ∈ F qn × n Output: A · D · A ⊺ in F qm × m if number of quadratic non-residues in { d , . . . , d n } is odd then Let d ℓ be one of the quadratic non-residues ¯ D ← Diag ( d , . . . , d n , d ℓ ) ∈ F q ( n + )×( n + ) ¯ A ← ( A ) ∈ F qm ×( n + ) ▷ Augment A with a zero column else ¯ D ← Diag ( d , . . . , d n ) ∈ F qn × n ¯ A ← A ∈ F qm × n for all quadratic residues d j in ¯ D do ¯ A ∗ , j ← sqrt ( d j ) · ¯ A ∗ , j ▷ Scale col. j of ¯ A by a sq. root of d j for all distinct pairs of quadratic non-residues ( d i , d j ) in ¯ D do ∆ ← nrsyf ( d i , d j ) ▷ ∆ · ∆ ⊺ = (cid:16) d i d j (cid:17) using Algorithm 4 ( ¯ A ∗ , i ¯ A ∗ , j ) ← ( ¯ A ∗ , i ¯ A ∗ , j ) · ∆ ; return syrk ( ¯ A ) ▷ ¯ A · ¯ A ⊺ using Algorithm 2 In general, an L · D · L ⊺ factorization may have antitriangular orantidiagonal blocks in D [12]. In order to reduce to a routine for fastsymmetric multiplication with diagonal scaling, these blocks needto be processed once for all, which is what this section is about. In odd characteris-tic, the 2-dimensional blocks in an L · D · L ⊺ factorization are onlyof the form (cid:16) ββ (cid:17) , and always have the symmetric factorization: (cid:0) − (cid:1) (cid:18) β − β (cid:19) (cid:0) − (cid:1) ⊺ = (cid:16) ββ (cid:17) . (29) ean-Guillaume Dumas, Clément Pernet, and Alexandre Sedoglavic This shows the reduction to the diagonal case (note the requirementthat 2 is invertible).

In characteristic 2,some 2 × (cid:16) ββ γ (cid:17) , with γ (cid:44) (cid:16) ββ γ (cid:17) = (cid:18)(cid:18) βγ − / γ / (cid:19) (cid:0) (cid:1)(cid:19) (cid:18)(cid:18) βγ − / γ / (cid:19) (cid:0) (cid:1)(cid:19) ⊺ . (30)Therefore the antitriangular blocks also reduce to the diagonal case. The symmetric factor-ization in this case might require an extra row or column [18] asshown in Eq. (31): (cid:16) β (cid:17) (cid:0) (cid:1) (cid:16)(cid:16) β (cid:17) (cid:0) (cid:1)(cid:17) ⊺ = (cid:16) ββ (cid:17) mod 2 . (31)A first option is to augment A by one column for each antidiagonalblock, by applying the 2 × x , and an antidiagonalblock as shown in Eq. (32). (cid:18) √ x √ x √ x β β (cid:19) (cid:18) √ x √ x √ x β β (cid:19) ⊺ = (cid:18) x β β (cid:19) mod 2 . (32)Hence, any antidiagonal block can be combined with any 1 × × A . This indeed extracts the antidiagonal elements andcreates a 3 × x in a further combination with the nextantidiagonal blocks. Algorithm 6 sums up the use of Eqs. (29) to (32). REFERENCES [1] M. Baboulin, L. Giraud, and S. Gratton. A parallel distributed solver for large densesymmetric systems: Applications to geodesy and electromagnetism problems.

Int.J. of HPC Applications , 19(4):353–363, 2005. doi:10.1177/1094342005056134 .[2] G. Beniamini and O. Schwartz. Faster matrix multiplication via sparse decompo-sition. In

Proc. SPAA’19 , pages 11–22, 2019. doi:10.1145/3323165.3323188 .[3] M. Bodrato. A Strassen-like matrix multiplication suited for squaring and higherpower computation. In

Proc. ISSAC’10 , pages 273–280. ACM, 2010. doi:10.1145/1837934.1837987 .[4] R. P. Brent. Algorithms for matrix multiplication. Technical Report STAN-CS-70-157, C.S. Dpt. Standford University, Mar. 1970.[5] J. Brillhart. Note on representing a prime as a sum of two squares.

Math. of Compu-tation , 26(120):1011–1013, 1972. doi:10.1090/S0025-5718-1972-0314745-6 .[6] N. H. Bshouty. On the additive complexity of × matrix multiplication. Inf. Pro-cessing Letters , 56(6):329–335, Dec. 1995. doi:10.1016/0020-0190(95)00176-X .[7] Ph. Chatelin. On transformations of algorithms to multiply × matrices. Inf.processing letters , 22(1):1–5, Jan. 1986. doi:10.1016/0020-0190(86)90033-5 .[8] H. F. de Groot. On varieties of optimal algorithms for the computation of bilinearmappings I. The isotropy group of a bilinear mapping.

Theoretical ComputerScience , 7(2):1–24, 1978. doi:10.1016/0304-3975(78)90038-5 .[9] H. F. de Groot. On varieties of optimal algorithms for the computation of bilinearmappings II. Optimal algorithms for × -matrix multiplication. TheoreticalComputer Science , 7(2):127–148, 1978. doi:10.1016/0304-3975(78)90045-2 .[10] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A Set of Level 3 BasicLinear Algebra Subprograms.

ACM Trans. on Math. Soft. , 16(1):1–17, Mar. 1990. doi:10.1145/77626.79170 .[11] J.-G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra over prime fields.

ACMTrans. on Math. Soft. , 35(3):1–42, Nov. 2008. doi:10.1145/1391989.1391992 .[12] J.-G. Dumas and C. Pernet. Symmetric indefinite elimination revealing the rankprofile matrix. In

Proc. ISSAC’18 , pages 151–158. ACM, 2018. doi:10.1145/3208976.3209019 . Algorithm 6 : syrkbd : sym. matrix product with block diag. scaling Input: A ∈ F qm × n ; B ∈ F qn × n , block diagonal with scalar or2-dimensional blocks of the form (cid:16) ββ γ (cid:17) with β (cid:44) Output: A · B · A ⊺ ∈ F qm × m ¯ A ← A ∈ F qm × n ; ¯ D ← I n for all scalar blocks in B at position j do ¯ D j ← B j , j if q is odd then ▷ Use Eq. (29) for all symmetric antidiagonal blocks in B at ( j , j + ) do β ← B j , j + ( = B j + , j ) ¯ D j ← β ; ¯ D j + ← − β ( ¯ A ∗ , i ¯ A ∗ , j ) ← ( ¯ A ∗ , i ¯ A ∗ , j ) (cid:0) − (cid:1) else for all antitriangular blocks in B at position ( j , j + ) do β ← B j , j + ( = B j + , j ) ; δ ← sqrt ( B j + , j + ) ; ¯ A ∗ , j ← βδ − · ¯ A ∗ , j ▷ Scale column j of ¯ A ¯ A ∗ , j + ← δ · ¯ A ∗ , j + ▷ Scale column j + of ¯ A ¯ A ∗ , j + ← ¯ A ∗ , j + + ¯ A ∗ , j ▷ Use Eq. (30)

Swap columns j and j + A if there are n / B then ▷ Use Eq. (31) β ← B , ( = B , ) ¯ A ∗ , ← β · ¯ A ∗ , ; ¯ A ← ( ¯ A ¯ A ∗ , + ¯ A ∗ , ) ∈ F qm ×( n + ) ℓ ← δ ← else δ ← sqrt ( ¯ D ℓ,ℓ ) where ℓ is s.t. ¯ D ℓ,ℓ is a scalar block for all remaining antidiagonal blocks in B at ( j , j + ) do β ← B j , j + ( = B j + , j ) ▷ Use Eq. (32) ¯ A ∗ ,ℓ ← δ · ¯ A ∗ ,ℓ ; ¯ A ∗ , j + ← β · ¯ A ∗ , j + ( ¯ A ∗ ,ℓ ¯ A ∗ , j ¯ A ∗ , j + ) ← ( ¯ A ∗ ,ℓ ¯ A ∗ , j ¯ A ∗ , j + ) · (cid:16) (cid:17) δ ← return syrkd ( ¯ A , ¯ D ) ▷ ¯ A · ¯ D · ¯ A ⊺ using Algorithm 5 [13] J.-C. Faugère. FGb: A Library for Computing Gröbner Bases. In Proc ICMS’10 ,LNCS, 6327, pages 84–87, 2010. doi:10.1007/978-3-642-15582-6_17 .[14] The FFLAS-FFPACK group.

FFLAS-FFPACK: Finite Field Linear Algebra Subroutines/ Package , 2019. v2.4.1. URL: http://github.com/linbox-team/fflas-ffpack.[15] N. J. Higham. Stability of a method for multiplying complex matrices with threereal matrix multiplications.

SIMAX , 13(3):681–687, 1992. doi:10.1137/0613043 .[16] E. Karstadt and O. Schwartz. Matrix multiplication, a little faster. In

Proc. SPAA’17 ,pages 101–110. ACM, 2017. doi:10.1145/3087556.3087579 .[17] F. Le Gall. Powers of tensors and fast matrix multiplication. In

Proc ISSAC’14 ,pages 296–303. ACM, 2014. doi:10.1145/2608628.2608664 .[18] A. Lempel. Matrix factorization over GF ( ) and trace-orthogonal basesof GF ( n ) . SIAM J. on Computing , 4(2):175–186, 1975. doi:10.1137/0204014 .[19] G. Seroussi and A. Lempel. Factorization of symmetric matrices and trace-orthogonal bases in finite fields.

SIAM J. on Computing , 9(4):758–767, 1980. doi:10.1137/0209059 .[20] V. Strassen. Gaussian elimination is not optimal.

Numerische Mathematik , 13:354–356, 1969. doi:10.1007/BF02165411 .[21] S. Wedeniwski. Primality tests on commutator curves. PhD U. Tübingen, 2001.[22] S. Winograd. La complexité des calculs numériques.

La Recherche , 8:956–963,1977.[23] Z. Xianyi, M. Kroeker, et al.

OpenBLAS, an Optimized BLAS library n Fast Multiplication of a Matrix by its Transpose

A APPENDIXA.1 Proof of Proposition 3.1

Proposition 3.1 (Appendix A.1).

Algorithm 2 is correct for anyskew-orthogonal matrix Y . Proof. If Y is skew-orthogonal, then Y · Y ⊺ = − I. First, U = P + P = A · A ⊺ + A · A ⊺ = C . (33)Denote by R the product: R = A · Y · S ⊺ = A · Y · ( A ⊺ − Y ⊺ · A ⊺ ) = A · ( Y · A ⊺ + A ⊺ ) . (34)Thus, as S = S − A = ( A − A ) · Y − A = − S − A · Y : U = P + P = A · A ⊺ + S · S ⊺ = A · A ⊺ + ( S + A · Y ) · ( S ⊺ + Y ⊺ · A ⊺ ) = S · S ⊺ + R ⊺ + R . (35)And denote R = A · Y · A ⊺ , so that: S · S ⊺ = ( A − A · Y ) · ( A ⊺ − Y ⊺ · A ⊺ ) = A · A ⊺ − A · A ⊺ − R − R ⊺ . (36)Furthermore, from Equation (34): R + P = R + S · S ⊺ = R + ( A − A ) · Y · ( A ⊺ − Y ⊺ · A ⊺ ) = A · ( Y · A ⊺ + A ⊺ ) + S · S ⊺ = A · Y · A ⊺ + A · A ⊺ = R + A · A ⊺ . (37)Therefore, from Equations (35), (36) and (37): U = U + P + P ⊺ = S · S ⊺ + R + R ⊺ + P + P ⊺ = A · A ⊺ + (− + ) A · A ⊺ = C . (38)And the last coefficient U of the result is obtained from Equa-tions (37) and (38): U = U + P = U − P ⊺ + P = U + A · ( A ⊺ + Y ⊺ · A ⊺ − Y ⊺ · A ⊺ − A ⊺ ) = A · A ⊺ − P ⊺ + A · ( A ⊺ + Y ⊺ · A ⊺ − Y ⊺ · A ⊺ ) = R ⊺ − R ⊺ + A · ( A ⊺ + Y ⊺ · A ⊺ − Y ⊺ · A ⊺ ) = R ⊺ + A · ( A ⊺ − Y ⊺ · A ⊺ ) = A · A ⊺ + A · A ⊺ = C . (39)Finally, P = A · A ⊺ , P = A · A ⊺ , and P = S · S ⊺ are sym-metric by construction. So are therefore, U = P + P , U = P + P and U = U + ( P + P ⊺ ) . □ A.2 Threshold in the theoretical number ofoperations for dimensions that are a powerof two

Here, we look for a theoretical threshold where our fast symmetricalgorithm performs less arithmetic operations than the classical one.Below that threshold any recursive call should call a classical algo-rithm for A · A ⊺ . But, depending whether padding or static/dynamicpeeling is used, this threshold varies. For powers of two, however,no padding nor peeling occurs and we thus have a look in thissection of the thresholds in this case. n 4 8 16 32 64 128syrk 70 540 4216 33264 264160 2105280Rec. SWSyrk-i 1 0

70 540 4216

G0-i 36099 221390

G1-i 37499 226990

G2-i 38899 232590

G3-i 40299 238190

Table 5: Number of arithmetic operations in the multiplica-tion an n × n matrix by its transpose: blue when Syrk-i (us-ing Strassen-Winograd with i − recursive levels) is betterthan other Syrk; orange/red/violet/green when ours (usingStrassen-Winograd with i − recursive levels, and G0-i for C / G1-i if − is a square / G2-i or G3-i otherwise, dependingwhether − is a square or not) is better than others. First, from Section 3.1, over C , we can choose Y = i I n . Thenmultiplications by i are just exchanging the real and imaginaryparts. In Equation (27) this is an extra cost of y = y = y = − y = G , G G i − − i − − n ≥

16 with 1 recursive level (and thus 0 recursive levels forStrassen-Winograd), for n ≥

32 with 2 recursive levels, etc. Overa field where − n ≥ n ≥

64 with 3 recursive levels, etc.64 with 3 recursive levels, etc.