Some fast algorithms multiplying a matrix by its adjoint
SSome fast algorithms multiplying a matrix by itsadjoint
Jean-Guillaume Dumas , Cl´ement Pernet , and AlexandreSedoglavic Univ. Grenoble Alpes, umr CNRS 5224 LJKF-38000 Grenoble, France https://ljk.imag.fr/CAS3C3 , { Jean-Guillaume.Dumas,Clement.Pernet } @univ-grenoble-alpes.fr Univ. Lille, CNRS, Centrale Lille, umr 9189 CRIStALF-59000 Lille, France
January 5, 2021
Abstract
We present a non-commutative algorithm for the multiplication ofa 2 × C or in positive characteristic. The resulting algo-rithm for arbitrary dimensions is a reduction of multiplication of a matrixby its adjoint to general matrix product, improving by a constant factorpreviously known reductions. We prove also that there is no algorithmderived from bilinear forms using only four products and the adjoint ofone of them. Second we give novel dedicated algorithms for the com-plex field and the quaternions to alternatively compute the multiplicationtaking advantage of the structure of the matrix-polynomial arithmetic in-volved. We then analyze the respective ranges of predominance of thetwo strategies. Finally we propose schedules with low memory footprintthat support a fast and memory efficient practical implementation over aprime field. Contents a r X i v : . [ c s . S C ] J a n An algorithm for the product of a matrix by its adjoint withfive multiplications 64 Rings with skew unitary matrices 9
Volker Strassen’s algorithm [19], with 7 recursive multiplications and 18 addi-tions, was the first sub-cubic time algorithm for matrix product, with a costof O (cid:0) n . (cid:1) . Summarizing the many improvements which have happened sincethen, the cost of multiplying two arbitrary n × n matrices over a ring R willbe denoted by MM R ω ( n ) = O ( n ω ) ring operations where 2 < ω ≤ ω known to date).We consider here the computation of the product of a matrix by trans-pose A · A (cid:124) or by its conjugate transpose A · A (cid:124) , which we handle in a unifiedway as the product A · φ ( A ) where φ is a matrix anti-homomorphism . In therest of the paper, φ ( A ) will be referred to as the adjoint of A . For this compu-tation, the natural divide and conquer algorithm, splitting the matrices in fourquadrants, will use 6 block multiplications (as any of the two off-diagonal blockscan be recovered from the other one).We propose instead a new algorithm usingonly 5 block multiplications, for any antihomomorphism φ , provided that thebase ring supports the existence of skew unitary matrices.2or this product, the best previously known cost bound was equivalentto ω − MM ω ( n ) over any field (see [6, § ω − MM ω ( n ) ring operationswhen there exists a skew-unitary matrix. Our algorithm is derived from theclass of Strassen-like algorithms multiplying 2 × ω . By exploiting thesymmetry of the problem, it requires about half of the arithmetic cost of generalmatrix multiplication when ω is log in the Strassen orbit . We partially address here the more generalresult that there is no algorithm derived from bilinear forms, with fewerproducts, by proving the inexistence of an algorithm with four productsand the (hermitian) transpose of one of them.3. In [7] the algorithm is shown to be efficient over C , for a range of ma-trix multiplication exponents (including all the feasible ones), and for anypositive characteristic field, unconditionally. We extend this analysis tothe case for the Hermitian transpose: while our five-products algorithmis unusable due to the inexistence of skew unitary matrices over C , wepropose a 2M algorithm, adapted from the 3M algorithm for the productof complex matrices.4. Finally, we propose novel dedicated algorithms for the multiplication of amatrix by its transpose or conjugate transpose over the algebra of quater-nions (over R or any commutative field), improving on the dominant termof the state of the art complexity bounds for these problems.After a introducing the terminology in Section 2, we will present in Sec-tion 3 the main recursive algorithm computing the product of a matrix by itsadjoint in 5 block products provided that a skew unitary matrix is given. Wesurvey in Section 4 the most classical instances for the base field to supportthe existence of skew unitary matrices. We then investigate in Section 5 theminimality of five products for the computing the product of a 2 × Preliminaries
To unify the notion of transposition and conjugate transposition, we use the for-malism of antihomomorphisms and of involutive antihomomorphisms as recalledin the following definitions.
Definition 1.
Let R , S be two rings, φ : R → S is a ring antihomomorphism if and only if, for all ( x, y ) in R × S : φ (1 R ) = 1 S , (1a) φ ( x + y ) = φ ( x ) + φ ( y ) , (1b) φ ( xy ) = φ ( y ) φ ( x ) . (1c)From this, one can define a matrix antihomomorphism by induction, asshown in Definition 2. Definition 2.
Over a ring R , an involutive matrix antihomomorphism is afamily of applications φ m,n : R m × n → R n × m for all ( m, n ) in N satisfying foradditional ( (cid:96), k ) in N and for all A and A (cid:48) in R m × n , for all M in R n × k , forall B in R m × k , for all C in R (cid:96) × n and for all D in R (cid:96) × k the following relations: φ m,n ◦ φ n,m = I , (2a) φ m,n ( A + A (cid:48) ) = φ m,n ( A ) + φ m,n ( A (cid:48) ) , (2b) φ m,k ( A · M ) = φ n,k ( M ) · φ m,n ( A ) , (2c) φ m + (cid:96),n + k (cid:18)(cid:20) A BC D (cid:21)(cid:19) = (cid:20) φ m,n ( A ) φ (cid:96),n ( C ) φ m,k ( B ) φ (cid:96),k ( D ) (cid:21) . (2d)For the convenience, we will denote all applications of this family by φ , asthe dimensions are clear from the context. This definition implies the following: Lemma 3.
For all A in R m × n let B , be φ ( A ) . Then for all suitable ( i, j ) thecoefficient b ij is φ ( a ji ) .Proof. By induction, using Equation (2d): if m = n = 1, then φ ( A ) = [ φ ( a )].Then assume the property is true for all A ∈ R m × n with m, n ≤ N , andconsider a matrix A in R ( N +1) × ( N +1) . Applying Equation (2d) on the blockdecomposition A = (cid:20) A a a a (cid:21) where A is in R N × N yields the relations: φ ( a ) = (cid:20) φ ( A ) φ ( a ) φ ( a ) φ ( a ) (cid:21) = [ φ ( a ji )] ij (3)by induction hypothesis. The case of matrices in R m × ( N +1) and R ( N +1) × n isdealt with similarly, using 0-dimensional blocks a or a respectively. Lemma 4.
For all α in R and for all A in R m × n , φ ( αA ) = φ ( A ) φ ( α ) . roof. By Lemma 3, φ ( α I m ) = φ ( α )I m = I m φ ( α ). Then by Equation (2c), therelations φ ( αA ) = φ (( α I m ) A ) = φ ( A ) φ ( α I m ) = φ ( A ) φ ( α ) hold.The following Lemma 5 shows that Definition 2 is a natural extension of aring antiendomorphism for matrices. Lemma 5.
An involutive matrix antihomomorphism is a ring antiendomor-phism on its base ring (seen as the ring of × matrices).Proof. Equations (2b) and (2c) directly imply Equations (1b) and (1c) respec-tively when m = n = k = 1. Then, we have that φ (1) = φ (1) ·
1. Therefore φ ( φ (1)) = φ ( φ (1) ·
1) and 1 = φ (1) φ ( φ (1)) = φ (1) · Examples 6.
For matrices over a commutative ring, • the matrix transpose with φ ( A ) = A (cid:124) and • the matrix conjugate transpose, φ ( A ) = A H ,are two examples of matrix anti-homomorphisms. However, for instance, trans-position over the quaternions is a counter-example as the non-commutativityimplies there that in general ( A · B ) (cid:124) (cid:54) = B (cid:124) · A (cid:124) . Definition 7.
The image φ ( A ) of a matrix A by an antihomomorphism is calledthe adjoint of A . Definition 8.
Let A ∈ R m × n , we denote respectively by Low ( A ) and Up ( A ) the m × n lower and upper triangular parts of A , namely the matrices L and U verifying • L ij = a ij for i ≥ j and L ij = 0 otherwise, • U ij = a ij for i ≤ j and U ij = 0 otherwise. Lemma 9. If φ ( A ) = A in R n × n , then Up ( A ) = φ ( Low ( A )) .Proof. Applying Lemma 3, the coefficients u ij of U = φ (Low ( A )) for 0 < i ≤ j satisfy u ij = φ ( a ji ). Now if φ ( A ) = A , we have u ij = a ij for 0 < i ≤ j and u ij = 0 otherwise, as φ (0) = 0, by Equation (2b). Hence U = Up ( A ). Definition 10 (Skew-unitary) . A matrix Y in R n × n is skew-unitary relativelyto a matrix antihomomorphism φ if the following relation holds: Y · φ ( Y ) = − I n . (4)For the cost analysis, we will also need the following variant of the Mas-ter Theorem, reflecting the constant in the leading term of the computed costbound. 5 emma 11. Let T ( n ) be defined by the recurrence T ( n ) = aT ( n/
2) + b (cid:0) n (cid:1) α + o ( n α ) , where ≤ log a < α . Then T ( n ) = b α − a n α + o ( n α ) .Proof. T ( n ) = a log n T (1) + log ( n ) − (cid:88) i =0 a i b (cid:16) n i +1 (cid:17) α + o (cid:16)(cid:16) n i (cid:17) α (cid:17) = n log a T (1) + b α n α log ( n ) − (cid:88) i =0 (cid:16) a α (cid:17) i + o ( n α ) = b α − a n α + o ( n α ) . We now show how to compute the product of a matrix by its adjoint withrespect to an involutive antihomomorphism in only 5 recursive multiplicationsand 2 multiplications by any skew-unitary matrix. This is a generalization of [7,Algorithm 2] for any involutive antihomomorphism.We next give Algorithm 12 for even dimensions. In case of odd dimensions,padding or static/dynamic peeling can always be used [3].
Theorem 13.
Algorithm 12 is correct. Moreover, if any two n × n matri-ces over a ring R can be multiplied in MM R ω ( n ) = O ( n ω ) ring operations for ω > , and if there exist a skew-unitary matrix which can be multiplied toany other matrix in o ( n ω ) ring operations then Algorithm 12 requires fewerthan ω − MM R ω ( n ) + o ( n ω ) ring operations.Proof. For the cost analysis, Algorithm 12 is applied recursively to computethree products P , P and P , while P and P are computed in MM R ω ( n ) usingthe general matrix multiplication algorithm. The second hypothesis is thatapplying the skew-unitary matrix Y to a n × n matrix costs Y ( n ) = o ( n ω ).Then applying Remark 15 thereafter, the cost T ( n ) of Algorithm 12 satisfies: T ( n ) ≤ T ( n/
2) + 2MM R ω ( n/
2) + 2 Y ( n ) + (7 . n/ + o (cid:0) n (cid:1) (5)and T (4) is a constant. Thus, by Lemma 11: T ( n ) ≤ ω − R ω ( n ) + o ( n ω ) . (6)Now for the correction, by Equation (2d), we have to show that the resultof Algorithm 12 is indeed:Low ( A · φ ( A )) = Low (cid:16) A · (cid:16) φ ( A ) φ ( A ) φ ( A ) φ ( A ) (cid:17)(cid:17) = (cid:16) Low( A · φ ( A )+ A · φ ( A )) × A · φ ( A )+ A · φ ( A ) Low( A · φ ( A )+ A · φ ( A )) (cid:17) lgorithm 12 Product of a matrix by its adjoint
Input: A ∈ R m × n (with even m and n for the sake of simplicity); Input: φ an involutive matrix antihomomorphism; Input: Y ∈ R n × n skew-unitary for φ . Output:
Low ( A · φ ( A )).Split A = (cid:0) A A A A (cid:1) where A is in R m × n (cid:46) additions and 2 multiplications by Y : S ← ( A − A ) · YS ← A − A · YS ← S − A S ← S + A (cid:46) recursive ( P , P , P ) and general products ( P , P ): Low ( P ) ← Low ( A · φ ( A ))Low ( P ) ← Low ( A · φ ( A )) P ← A · φ ( S ) P ← S · φ ( S )Low ( P ) ← Low ( S · φ ( S )) (cid:46) half additions and complete additions: Low ( U ) ← Low ( P )+Low ( P )Low ( U ) ← Low ( P )+Low ( P )Up( U ) ← φ (Low ( U )) (cid:46) Forms the full matrix U U ← U + P , U ← U + P ,Low ( U ) ← Low ( U ) + Low ( φ ( P )) . return (cid:16) Low( U ) U Low( U ) (cid:17) .First, we have that:Low ( U ) = Low ( P ) + Low ( P ) = Low ( A · φ ( A ) + A · φ ( A )) . (7)Second, as Y is skew-unitary, then we have that Y · φ ( Y ) = − I n . Also,by Equations (2b) and (2c), φ ( S ) = φ ( A ) − φ ( Y ) · φ ( A ). Then, denoteby R the product: R = A · Y · φ ( S ) = A · Y · ( φ ( A ) − φ ( Y ) · φ ( A ))= A · ( Y · φ ( A ) + φ ( A )) . (8)Further, by Equations (2a) and (2c), we have that P = A · φ ( A ), P = A · φ ( A ), and P = S · φ ( S ) are invariant under the action of φ .So are therefore, U = P + P , U = P + P and U = U + ( P + φ ( P )). ByLemma 9, it suffices to compute Low ( U ) and, if needed, we also have Up ( U ) = φ (Low ( U )).Then, as S = S − A = ( A − A ) · Y − A = − S − A · Y , and7 ( φ ( S )) = S by Equation (2a), we have that: U = P + P = A · φ ( A ) + S · φ ( S )= A · φ ( A ) + ( S + A · Y ) · ( φ ( S ) + φ ( Y ) · φ ( A ))= S · φ ( S ) + φ ( R ) + R . (9)Also, denote R = A · Y · φ ( A ), so that: S · φ ( S ) = ( A − A · Y ) · ( φ ( A ) − φ ( Y ) · φ ( A ))= A · φ ( A ) − A · φ ( A ) − R − φ ( R ) . (10)Furthermore, from Equation (8): R + P = R + S · φ ( S )= R + ( A − A ) · Y · ( φ ( A ) − φ ( Y ) · φ ( A ))= A · Y · φ ( A ) + A · φ ( A ) = R + A · φ ( A ) . (11)This shows, from Equations (9) to (11), that: U = U + P + φ ( P ) = S · φ ( S ) + φ ( R ) + R + P + φ ( P )= A · φ ( A ) + ( − A · φ ( A ) . (12)Third, the last coefficient U of the result is obtained from Equations (11)and (12): U = U + P = U + P + P = A · φ ( A ) − A · φ ( A ) − R − φ ( R ) + R + φ ( R ) + P + P = A · φ ( A ) − φ ( R ) + φ ( R ) + P (13)since by Equation (11), R + P = R + A · φ ( A ). Now P = A · φ ( S ) = A · φ (( A − A ) · Y + A − A )= φ ( R ) − φ ( R ) + A · φ ( A ) , (14)Hence U = A · φ ( A ) + A · φ ( A )To our knowledge, the best previously known result was with a ω − factorinstead, see e.g. [6, § Examples 14.
In many cases, applying the skew-unitary matrix Y to a n × n matrix costs only yn for some constant y depending on the base ring. If thering is the complex field C or satisfies the conditions of Proposition 16, thereis a square root i of − . Setting Y = i I n/ yields Y ( n ) = n . Otherwise, we O (cid:0) n (cid:1) O (cid:0) n log (7) (cid:1) O ( n ω ) A · φ ( A ) ∈ F n × n [6] n MM log (7) ( n ) ω − MM ω ( n )Alg. 12 . n MM log (7) ( n ) ω − MM ω ( n ) Table 1: Arithmetic complexity bounds leading terms. show in Section 4 that in characteristic p ≡ , Proposition 17 produces Y equal to (cid:0) a b − b a (cid:1) ⊗ I n/ for which Y ( n ) = 3 n . As a sub-case, the latter can beimproved when p ≡ : then, Lemma 18 shows that − is a square. There-fore, in this case set a = 1 and b ≡ √− p such that one multiplication issaved. Then the relation a + b = − there yields Y = (cid:16) √− −√− (cid:17) ⊗ I n/ forwhich Y ( n ) = 2 n . Remark 15.
Each recursive level of Algorithm 12 is composed of 9 block ad-ditions. An exhaustive search on all symmetric algorithms in the orbit of thatof Strassen (via a Gr¨obner basis parameterization [7]) showed that this numberis minimal in this class of algorithms. Note also that out of these additionsin Algorithm 12 involve symmetric matrices and are therefore only performedon the lower triangular part of the matrix. Overall, the number of scalar ad-ditions is n + 3 / n ( n + 1) = 15 / n + 1 . n , nearly half of the optimal in thenon-symmetric case [5, Theorem 1]. To further reduce the number of additions, a promising approach is thatundertaken in [15, 2]. This is however not clear to us how to adapt our strategyto their recursive transformation of basis.
Algorithm 12 requires a skew-unitary matrix. Unfortunately there are no skew-unitary matrices over R , nor Q for φ the transposition, nor over C for φ theHermitian transposition (there − Y · φ ( Y )). Hence, Algorithm 12 provides no improvement in thesecases. In other domains, the simplest skew-unitary matrices just use a squareroot of − Algorithm 12 is thus directly usable over C n × n with φ ( A ) = A (cid:124) and Y = i I n in C n × n . When complex numbers are represented in Cartesian form, as a pairof real numbers, the multiplications by Y = i I n are essentially free since theyjust exchange the real and imaginary parts, with one sign flip.As mentioned, for the conjugate transposition, φ ( A ) = A H , on the contrary,there are no candidate skew-unitary matrices and we for now report no im-9rovement in this case using this approach (but another one does as shownin Section 6.1).Now, even though over the complex the product of a matrix by its conjugate transpose is more widely used, there are some applications for the product of amatrix by its transpose, see for instance [1]. This is reflected in the blas api ,where both routines zherk and zsyrk are offered. φ ( A ) = A (cid:124) Over some rings , square roots of − i in R again. There, Algorithm 12 only requires some pre-multiplicationsby this square root (with also Y = i I n ∈ R n × n ), but within the ring .Further, when the ring is a field in positive characteristic, the existence ofa square root of minus one can be characterized, as shown in Proposition 16,thereafter. Proposition 16.
Fields with characteristic two, p satisfying p ≡ , orfinite fields that are an even extension, contain a square root of − .Proof. If p = 2, then 1 = 1 = −
1. If p ≡ x in the base field of size p satisfy x p − (cid:54) = ± −
1. If the finite field F is of cardinality p k , then, similarly,there exists elements x pk − pk +12 different from ± − φ ( A ) = A (cid:124) Actually, we show that Algorithm 12 can also be run without any field extension,even when − Y is even. Whenever this dimension is odd, it is always possible to pad with zeroesso that A · A (cid:124) = ( A ) · (cid:0) A (cid:124) (cid:1) . Proposition 17.
Let F p k be a field of characteristic p , there exists ( a, b ) in F p such that the matrix: (cid:0) a b − b a (cid:1) ⊗ I n = (cid:0) a I n b I n − b I n a I n (cid:1) in F n × np (15) is skew-unitary for the transposition.Proof. Using the relation (cid:0) a I n b I n − b I n a I n (cid:1) (cid:0) a I n b I n − b I n a I n (cid:1) (cid:124) = ( a + b ) I n , (16)it suffices to prove that there exist a, b such that a + b = −
1. In character-istic 2, a = 1 , b = 0 is a solution as 1 + 0 = −
1. In odd characteristic, thereare p +12 distinct square elements x i in the base prime field. Therefore, thereare p +12 distinct elements − − x i . But there are only p distinct elements inthe base field, thus there exists a couple ( i, j ) such that − − x i = x j [18,Lemma 6]. 10o further improve the running time of multiplications by a skew-unitarymatrix in this case, one could set one of the squares to be 1. This is possible if − p ≡ Lemma 18. If p ≡ then − is a square modulo p .Proof. Using Legendre symbol, (cid:16) − p (cid:17) = (cid:16) − p (cid:17) (cid:16) p (cid:17) = ( − p − ( − p − =( − −
1) = 1Now, Proposition 17 shows that skew-unitary matrices do exist for any fieldwith positive characteristic. For Algorithm 12, we need to build them mostlyfor p ≡ erh ), it is possible touse the decomposition of primes into squares:1. Compute by enumeration a prime r = 4 pk + (3 − p −
1, so that bothrelations r ≡ r ≡ − p hold;2. Thus, the methods of [4] allow one to decompose any prime into squaresand give a couple ( a, b ) in Z such that a + b = r . Finally, this gives a + b ≡ − p .By the prime number theorem the first step is polynomial in log( p ), as is thesecond step (square root modulo a prime, denoted sqrt , has a cost close to ex-ponentiation and then the rest of Brillhart’s algorithm is gcd -like). In practice,though, it is faster to use the following Algorithm 19, even though the latterhas a better asymptotic complexity bound only if the erh is true. Algorithm 19
SoS : Sum of squares decomposition over a finite field
Input: p ∈ P \{ } , k ∈ Z . Output: ( a, b ) ∈ Z , s.t. a + b ≡ k mod p . if (cid:16) kp (cid:17) = 1 then (cid:46) k is a square mod p return ( sqrt ( k ) , else (cid:46) Find smallest quadratic non-residue s ← while (cid:16) sp (cid:17) == 1 do s ← s + 1 c ← sqrt ( s − (cid:46) s − must be a square r ← ks − mod p a ← sqrt ( r ) (cid:46) Now k ≡ a s ≡ a (1 + c ) mod p return ( a, ac mod p ) Proposition 20.
Algorithm 19 is correct and, under the erh , runs in expectedtime (cid:101) O (cid:0) log ( p ) (cid:1) .Proof. If k is square then the square of one of its square roots added to thesquare of zero is a solution. Otherwise, the lowest quadratic non-residue ( lqnr )11odulo p is one plus a square b (1 is always a square so the lqnr is largerthan 2). For any generator of Z p , quadratic non-residues, as well as their in-verses ( s is invertible as it is non-zero and p is prime), have an odd discretelogarithm. Therefore the multiplication of k and the inverse of the lqnr mustbe a square a . This means that the relation k = a (cid:0) b (cid:1) = a + ( ab ) holds.Now for the running time, under the erh , [20, Theorem 6.35] shows thatthe lqnr should be lower than 3 log ( p ) / −
44 log( p ) / O (cid:0) log ( p ) (cid:1) and this dom-inates the modular square root computations. Remark 21.
Another possibility is to use randomization: instead of usingthe lowest quadratic non-residue ( lqnr ), randomly select a non-residue s , andthen decrement it until s − is a quadratic residue ( is a square so this willterminate). In practice, the running time seems very close to that of Algo-rithm 19 anyway, see, e.g. the implementation in Givaro rev. 7bdefe6, https:// github. com/ linbox-team/ givaro . Also, when computing t sum of squaresmodulo the same prime, one can compute the lqnr only once to get all the sumof squares with an expected cost bounded by (cid:101) O (cid:0) log ( p ) + t log ( p ) (cid:1) . Remark 22.
Except in characteristic or in algebraic closures, where everyelement is a square anyway, Algorithm 19 is easily extended over any finitefield: compute the lqnr in the base prime field, then use Tonelli-Shanks orCipolla-Lehmer algorithm to compute square roots in the extension field.Denote by SoS ( q, k ) this algorithm decomposing k as a sum of squares withinany finite field F q . This is not always possible over infinite fields, but thereAlgorithm 19 still works anyway for the special case k = − : just run it in theprime sub-field, since − must be in it. φ ( A ) = A H With φ ( A ) = A H , we need a matrix Y such that Y · Y H = Y · Y (cid:124) = − I .This is not possible anymore over the complex field, but works for any evenextension field, thanks to Algorithm 19. To see this, we consider next the finitefield F q , where q is a power of an arbitrary prime. Given a ∈ F q , we adoptthe convention that conjugation is given by the Frobenius automorphism: a = a q . (17)The bar operator is F q -linear and has order 2 on F q .First, if − F q , then Y = √− · I n works in F q since then √− √− Y · Y (cid:124) = √− · I n √− · I n = − I n .Second, otherwise, q ≡ i of − F q , from Proposition 16. Further, one can build ( a, b ), both in the basefield F q , such that a + b = −
1, from Algorithm 19. Finally Y = ( a + ib ) · I n in F q n × n is skew-unitary: indeed, since q ≡ i q = i k = i ( − k = − i and, therefore, a + ib = ( a + ib ) q = a − ib . Finally Y · Y (cid:124) =( a + ib )( a − ib ) · I n = − I n . 12 .5 Any field with positive characteristic and φ ( A ) = A H If − a, b ) in the base field such that a + b = − Y = (cid:0) a b − b a (cid:1) ⊗ I n is a skew-unitary matrix. Indeed, since a and b live inthe base field, they are invariant by the Frobenius automorphism. Therefore, Y (cid:124) = (cid:0) a b − b a (cid:1) (cid:124) ⊗ I n = (cid:0) a − bb a (cid:1) ⊗ I n and Y · Y (cid:124) = ( a + b ) · I n = − I n . Our Algorithm 12 computes the product of a matrix over a ring by its (hermi-tian) transpose using only 5 block multiplications and the (hermitian) transposeof one of these block multiplications. Here, we use consider some vector-spacesand thus, restrict ourselves to consider matrices over a field.We reformulate in this section the method introduced by de Groote in [10]in order to prove that the tensor rank of the 2 × Theorem 23.
There is no algorithm derived from non-commutative block × matrix product algorithms that computes the product of a matrix over a field byits (hermitian) transpose using only block multiplications and the (hermitian)transpose of one of these block multiplications. This result does not state that it is never possible to multiply by the adjointusing fewer than 5 multiplications as shown by the following remark.
Remark 24.
Over any ring with a square root i of − , there is a computationalscheme requiring multiplications and computing the product of a × -matrixby its transpose: (cid:0) a bc d (cid:1) ( a cb d ) = (cid:16) ( a + ib )( a − ib ) × ac + bd ( c + id )( c − id ) (cid:17) = (cid:16) a + b × ac + bd c + d (cid:17) . (18) This is the case for instance over F , where i = 1 , or over the complex numbers.As this scheme requires for instance that aib = iba , at least some commutativityis required, thus in general it does not apply to block matrices and it is thereforenot in the scope of Theorem 23. The following section is devoted to shortly present the framework used inthis part of our work.
We present de Groote’s proof using a tensorial presentation of bilinear maps;we recall briefly this standpoint through the following well-known example of13even multiplications and we refer to [16] for a complete introduction to thisframework.
Example 25.
Considered as × matrices, the matrix product C = A · B couldbe computed using Strassen algorithm by performing the following computations(see [19]): ρ ← a ( b − b ) ,ρ ← ( a + a ) b , ρ ← ( a − a )( b + b ) ,ρ ← ( a + a ) b , ρ ← ( a + a )( b + b ) ,ρ ← a ( b − b ) , ρ ← ( a − a )( b + b ) , (cid:18) c c c c (cid:19) = (cid:18) ρ + ρ − ρ + ρ ρ + ρ ρ + ρ ρ + ρ + ρ − ρ (cid:19) . (19) With m, n, p equal to , this algorithm encodes a bilinear map: F m × n × F n × p → F m × p , ( A, B ) → A · B. (20) We keep the indices m, n, p in this section for the sake of clarity in order todistinguish the different spaces involved in the sequel. The spaces F ·×· can beendowed with the Frobenius product (cid:104) M, N (cid:105) = Trace( M (cid:124) · N ) that establishesan isomorphism between F ·×· and its dual space (cid:0) F ·×· (cid:1) (cid:63) ; hence, it allows forexample to associate the trilinear form Trace( C (cid:124) · A · B ) and the matrix multi-plication (20): S | : F m × n × F n × p × ( F m × p ) (cid:63) → F , ( A, B, C (cid:124) ) → (cid:104) C, A · B (cid:105) . (21) As by construction, the space of trilinear forms is the canonical dual space oforder three tensor products, we could encode the Strassen multiplication algo-rithm (19) as the tensor S defined by: S = (cid:80) i =1 Σ i ⊗ Σ i ⊗ S i = Σ i ⊗ Σ i ⊗ S i = ( ) ⊗ (cid:0) − (cid:1) ⊗ ( )+( ) ⊗ ( ) ⊗ (cid:0) − (cid:1) +( ) ⊗ ( ) ⊗ (cid:0) − (cid:1) + (cid:0) − (cid:1) ⊗ ( ) ⊗ ( )+( ) ⊗ ( ) ⊗ ( )+( ) ⊗ (cid:0) − (cid:1) ⊗ ( )+ (cid:0) − (cid:1) ⊗ ( ) ⊗ ( ) (22) in ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) ⊗ F m × p with m = n = p = 2 . Remark that—as introduced in the above Equation (22)—we are going touse in the sequel the
Einstein summation convention in order to simplify theforthcoming notations (according to this convention, when an index variableappears twice in a term and is not otherwise defined, it represents in fact thesum of that term over all the values of the index).Starting from the tensor representation S of our algorithm, we could considerseveral contractions that are the main objects manipulated in the sequel.14 .2 Flattening tensors and isotropies The complete contraction
S | ( A ⊗ B ⊗ C (cid:124) ) is defined as the following map:(( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) ⊗ F m × p ) ⊗ ( F m × n ⊗ F n × p ⊗ ( F m × p ) (cid:63) ) → F , (cid:0) Σ i ⊗ Σ i ⊗ S i (cid:1) ⊗ ( A ⊗ B ⊗ C (cid:124) ) → (cid:104) Σ i , A (cid:105)(cid:104) Σ i , B (cid:105)(cid:104) S i , C (cid:124) (cid:105) . (23)We already saw informally in the previous section that this complete contrac-tion is Trace( A · B · C ) and we recall in the following remark some of its basicproperties. Remark 26.
Given three invertible matrices: α ∈ F m × m , β ∈ F p × p , γ ∈ F n × n (24) that encodes changes of basis, the trace Trace( A · B · C ) is equal to: Trace (cid:0) ( A · B · C ) (cid:124) (cid:1) = Trace( C · A · B ) = Trace( B · C · A ) , and Trace (cid:0) α − · A · β · β − · B · γ · γ − · C · α (cid:1) . (25)These relations illustrate the following theorem: Theorem 27 ([12, § . The isotropy group of the n × n matrix multiplica-tion tensor is psl ± ( F n ) × (cid:111) S , where psl stands for the group of matrices ofdeterminant ± and S for the symmetric group on elements. The following classical statement redefines the sandwiching isotropy on amatrix multiplication tensor:
Definition 28.
Given g = ( α × β × γ ) in psl ± ( F n ) × , its action g (cid:5) S on atensor S is given by g (cid:5) (Σ i ⊗ Σ i ⊗ S i ) where each summands is equal to: (cid:0) ( α − ) (cid:124) · Σ i · β (cid:124) (cid:1) ⊗ (cid:0) ( β − ) (cid:124) · Σ i · γ (cid:124) (cid:1) ⊗ (cid:0) ( γ − ) (cid:124) · S i · α (cid:124) (cid:1) , ∀ i fixed . (26)These isotropies will be used later; for the moment, let us now focus ourattention on the very specific standpoint on which is based the forthcomingdevelopments: flattenings. Definition 29.
Given a tensor S , the third flattening S | (a.k.a. third -contraction) of the tensor S is: S | : F m × p → ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) ,M → (cid:104) M, S i (cid:105) Σ i ⊗ Σ i . (27) Example 30.
To illustrate this definition and some important technicalities,let us consider Im
S | the image of the Strassen tensor (22) flattening: this isa subspace of ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) . More precisely, let us first consider only thefifth summand in Equation (22) and the image of its third flattening:Im ( ) ⊗ ( ) ⊗ ( ) | = (cid:18) c + c c + c c + c c + c (cid:19) ∀ ( c , c ) ∈ F . (28)15 he indeterminates c and c keep track of the domain of the flattening.The × right-hand side matrix in above expression should not be confusedwith the Kronecker product I × ⊗ I × involved in the left-hand side. In fact,the result I × of this Kronecker product is of classical matrix rank while therank in ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) of the right-hand side matrix (28) is by construc-tion. Hence, even if we present in this section the elements of F ·×· as matrices,we use a vectorization (e.g. F m × n (cid:39) F mn ) of these matrices in order to performcorrectly our computations. Taking this standpoint into account we obtain thefollowing description of the whole Strassen third flattening image as:Im S | = c c c c c c c c (29) that could be guessed almost without computation. In fact, this right-hand sidematrix is just the matrix of the bilinear form defining the trilinear encoding ofthe matrix product: Trace( A · B · C (cid:124) ) = (cid:0) a a a a (cid:1) c c c c c c c c b b b b . (30)Hence, the flattening Im S | is a canonical description of the matrix productindependent from the algorithm/tensor used to encode it; in particular, it is aninvariant under the action of the isotropies introduced in Definition 28. We aregoing to use these objects and their properties in the following section. We are interested in a situation where, given a bilinear map, a classical rep-resentation by a tensor C is known and we wish to disprove the existence of atensor representation of a given rank. Inspired by Steinitz exchange theorem,de Groote introduced in [10, § Definition 31.
Given a tensor T encoding a bilinear map whose codomain is M and given q rank-one elements P i , linearly independent in ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) ,let us denote by L ( Im T | , P , . . . , P q ) the linear subspace of M defined by: LinearSpan (cid:26) M ∈ M | ∃ ( u , . . . , u q ) ∈ F q , Rank ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) (cid:0) T | ( M ) + u j P j (cid:1) = 1 (cid:27) . (31)We introduced the notation M for the codomain of the considered bilinearmap in order to highlight the fact that it is isomorphic—via the Frobeniusisomorphism—to the domain of the flattening and to show how it is used in thefollowing. 16he following proposition allows to construct an effective test that checksif there exists a tensor of rank dim M + q that defines the considered bilinearmap. Proposition 32.
If there exists a tensor T of rank dim M + q encoding a bi-linear map with codomain M then there are q rank-one elements P i linearlyindependent in ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) such that L ( Im T | , P , . . . , P q ) is M .Proof. Let us assume that there exists a tensor C encoding the considered bilin-ear map, that its tensor rank ρ is greater than dim M + q and that it is definedby the sum Q i ⊗ R i . Remark that the set ( R i ) ≤ i ≤ ρ is a generating set of thespace M by hypothesis.Suppose now that there exists a tensor T of rank r = dim M + q encodingwith fewer summands the same considered bilinear map: T = P k ⊗ S k , S k ∈ M ⊂ F m × p , P k ∈ ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) . (32)The elements P k are linearly independent, otherwise T could be expressed witheven fewer terms. Furthermore, there is a subset of ( S i ) ≤ i ≤ dim M that is a baseof M (otherwise, T could not encode the same bilinear map as C ).Suppose that we reorder this set so that the base is ( S i + q ) ≤ i ≤ dim M . Byinvariance of the flattening map, the following relations hold: ∀ M ∈ M , C | ( M ) = (cid:104) M, R k (cid:105)Q k = T | ( M ) = (cid:104) M, S k (cid:105)P k . (33)By introducing a base ( B i ) ≤ i ≤ dim M of M , we could summarize this situationunder a matricial standpoint as follows: P ... P q (cid:104) B , R k (cid:105)Q k ... (cid:104) B dim M , R k (cid:105)Q k = (cid:18) I q × q C D (cid:19) P ... P q P q +1 ... P q +dim M (34)where the matrices C and D are such that: ∀ i ∈ { , . . . , dim M } , C ij = (cid:104) B i , S j (cid:105) , ∀ j ∈ { , . . . , q } ,D ij = (cid:104) B i , S q + j (cid:105) , ∀ j ∈ { , . . . , dim M } . (35)As, by hypothesis, ( S i + q ) ≤ i ≤ dim M is a basis of M , the matrix D is invertibleand we could rewrite Equation (34) as follows: U = − D − · C,V = D − , (cid:18) I q × q U V (cid:19) P ... P q (cid:104) B , R k (cid:105)Q k ... (cid:104) B dim M , R k (cid:105)Q k = P ... P q P q +1 ... P q +dim M . (36)17he dim M lines of the ( U, V ) matrices in Equation (36) give us dim M vec-tors ( u j , . . . , u jq , v j , . . . , v j dim M ) such that for all j in { , . . . , dim M } Rank ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) (cid:16) u ji P i + v ji (cid:104) B i , R k (cid:105)Q k (cid:17) = Rank ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) P q + j = 1 . (37)To conclude, we remark that these last relations show that all the matrices v ji B i are in the subspace L (Im T | , P , . . . , P q ). As they are dim M independentlinear combinations of basis elements of M , these matrices form another of itsbases and thus the subspace L (Im T | , P , . . . , P q ) is equal to M . In order to use Proposition 32 to prove Theorem 23, we have to show that forany element P = P ⊗ Q in ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) the subspace L (Im T | , P , P H ) isnot equal to M (with P H = Q H ⊗ P H ). This vector-space M is a 3 dimensionalvector-space spanned by all the outputs of our bilinear map. Let us start tomake this strategy more precise by the following remark. Remark 33.
A classical block version of bilinear algorithm (e.g. [6, § (cid:0) A A A A (cid:1) (cid:16) A H A H A H A H (cid:17) = (cid:16) A A H + A A H × A A H + A A H A A H + A A H (cid:17) . (38) As the result of this algorithm is self-adjoint, by Lemma 9 there is no needto compute the top-right coefficient and thus, we conclude that the dimensionof M is at most . Hence, there exists a bilinear map encoded by a ten-sor H = Σ i ⊗ Σ i ⊗ S i of rank that computes the product of a matrix by itshermitian transpose and the image of its third flattening H | isIm H | ( F m × p ) = (cid:18) c c c c c c (cid:19) . (39)We need a last standard definition in order to classify all possible tensors P considered in the sequel. Definition 34.
Given a tensor P decomposable as sum of rank-one tensors: P = q (cid:88) i =1 ⊗ sj =1 P ij where P ij are matrices . (40) The list [(Rank P ij ) j =1 ...s ] i =1 ...q is called the type of tensor P . Remark 35.
In our situation q is one, s is two and the P ij are × matrices;hence, the tensor P could only have type [(1 , , [(1 , , [(2 , or [(2 , .
18e also use the isotropies presented in Definition 28 in order to simplifyas much as possible the tensor P as illustrated in the proof of the followingstatement. Furthermore, let us first introduce several notations: • we denote by G x =0 the matrix G in which the indeterminate x is replacedby 0; • the determinant of the matrix (cid:16) G i,j G i,l G k,j G k,l (cid:17) is denoted by Det [ i,j | k,l ] ( G ); • as the rank of a matrix is invariant under elementary row and columnsoperations, we also use the notation G | [ i, j | (cid:96) ] for the matrix resulting fromthe addition to the i th line of G of its j th line multiplied by (cid:96) . Lemma 36.
There is no tensor P in ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) of type (1 , i ) with i equals to or such that the subspace L ( H | , P , P H ) is equal to M .Proof. Let us consider a tensor P = A ⊗ B of type (1 , i ) with i = 1 ,
2. As thefirst component A is of rank one, there exists two vectors such that: A = (cid:0) a a (cid:1) ⊗ (cid:18) b b (cid:19) = (cid:18) a b a b a b a b (cid:19) . (41)If the coefficient a is zero, we choose a matrix β as the identity matrix. Ifthe coefficient a is zero, we could consider a permutation matrix β = ( );otherwise, if s = a a + a a (cid:54) = 0, consider: β = (cid:18) a a a − a (cid:19) . (42)Then we have both β · β H = s · I × and A · β = (cid:18) sb sb (cid:19) .Hence, in any of these cases, there always exists a matrix β , which in-verse is a multiple of its hermitian transpose, such that the isotropy g definedby (I m × m × β × I n × n ) satisfies the following properties: g (cid:5) P = (cid:18) u u (cid:19) ⊗ (cid:18) v v v v (cid:19) , g (cid:5) P H = 1 s ( g (cid:5) P ) H (43)With the above notations, conventions and isotropy’s action, given any M in M ,the 4 × H | ( M ) + y P + y P H is: y u v + y v H u H + c y u v + y v H u H y u v y u v y v H u H y v H u H c y u v + y v H u H + c y u v + y v H u H + c y u v y u v y v H u H y v H u H c c . (44) This matrix is supposed to be of rank one. Thus, all its 2 × c or c is equal to 0 and for any such P the sub-space L ( H | , P , P H ) is thus not M . 19here remains the case s = 0. Then let β = (cid:18) a − − a a (cid:19) . (45)The inverse of β is no longer related to β H anymore but β still transforms A into a single non-zero column matrix: A · β = (cid:18) b b (cid:19) . Thus, for this β , theaction of the isotropy g = (I m × m × β × I n × n ) is: g (cid:5) P = (cid:18) b b (cid:19) ⊗ (cid:18) z z z z (cid:19) and g (cid:5) P H = U ⊗ V. (46)With the above notations, conventions and isotropy’s action, given any M in M ,the 4 × G = H | ( M ) + y P + y P H is thus: (cid:32) y b z + y u v + c y b z + y u v y b z + y u v y b z + y u v y u v y u v y u v + c y u v y b z + y u v + c y b z + y u v + c y b z + y u v y b z + y u v y u v y u v y u v + c y u v + c (cid:33) . (47)This matrix is supposed to be of rank one in L ( H | , P , P H ). Thus, all its 2 × y is zero, then the constraints of Equation (47) showthat Det [2 , | , ( G y =0 ) = c c is equal to 0. On the other hand, if y is differentfrom 0 then the minor Det [2 , | , ( G ) is equal to c y u v and supposed to beequal to zero by hypothesis. We also have that the minor Det [2 , | , ( G ) is equalto c y u v and is also supposed to be equal to zero by hypothesis. We aregoing to explore all the consequences induced by this constraint. u (cid:54) = 0 , v (cid:54) = 0 → c = 0 ,u = 0 , v = 0 → Det [2 , | , ( G u =0 ,v =0 ) = y u v c = 0 ,u = 0 , v = 0 , u = 0 → Det [2 , | , ( G u =0 ,v =0 ,u =0 ) = c c = 0 ,u = 0 , v = 0 , v = 0 see thereafter u = 0 , v (cid:54) = 0 → Det [2 , | , ( G u =0 ) = − y v u c = 0 ,u = 0 , v (cid:54) = 0 , u (cid:54) = 0 → c = 0 ,u = 0 , v (cid:54) = 0 , u = 0 → Det [2 , | , ( G u =0 ,u =0 ) = c c = 0 . (48) Now, if u is different from 0, then from the first two minors, either Therelations v = v = 0 hold or c = 0.Further, not both b and b can be zero, otherwise the tensor is of rank 0.W.l.o.g., suppose that b (cid:54) = 0 and let G (cid:48) = G v =0 ,v =0 | [1 , |− b /b ]. ThenDet [1 , | , ( G (cid:48) ) = ( − b /b ) y c u v and either b = 0 or v = 0 or c = 0.This gives the following distinctions (recall that now u (cid:54) = 0 and y (cid:54) = 0): b = 0 , → Det [1 , | , ( G (cid:48) b =0 ) = y u v c = 0 ,b = 0 , v (cid:54) = 0 → c = 0 ,b = 0 , v = 0 → Det [1 , | , ( G (cid:48) ) = c c = 0 ,b (cid:54) = 0 , v = 0 → Det [1 , | , ( G (cid:48) ) = ( − b /b ) c = 0 , (49)20here remains the case u = 0, v = 0 and v = 0 in Equation (48). Herealso, w.l.o.g., suppose that b (cid:54) = 0 and let G ∗ = G u =0 ,v =0 ,v =0 | [1 , |− b /b ]. b = 0 → Det [1 , | , ( G ∗ b =0 ) = c = 0 ,b (cid:54) = 0 → Det [1 , | , ( G ∗ ) = ( − b /b ) c c = 0 . (50)Thus, in any cases, for any such P , the set L ( H | , P , P H ) is not M .Note that a computational way to see this, is to perform a Gr¨obner basiscomputation, directly from Equation (47): for instance over C this gives thatthe relation c c = 0 must hold and that the set is not the full codomain.According to Remark 35, the above computations deal with half of the casesto consider. We remark that, mutatis mutandis, similar computations excludealso the existence of an algorithm where P is of type (2 , Lemma 37.
There is no tensor P in ( F m × n ) (cid:63) ⊗ ( F n × p ) (cid:63) of type (2 , suchthat the subspace L ( H | , P , P H ) is equal to M .Proof. First, let us consider a tensor P of type (2 , β suchthat the action of the isotropy g = (I m × m × β × I n × n ) is: g (cid:5) P = (cid:18) (cid:19) ⊗ (cid:18) z z z z (cid:19) and g (cid:5) P H = U ⊗ V. (51)With the above notations, conventions and isotropy’s action, given any M in M ,the 4 × G = H | ( M ) + y P + y P H is: (cid:32) y u v + c + y z y u v + y z y u v + y z y u v + y z y u v y u v y u v + c y u v y u v + c y u v + c y u v y u v y u v + y z y u v + y z y u v + c + y z y u v + c + y z (cid:33) . (52)This matrix is supposed to be of rank one in L ( H | , P , P H ). Thus, all its 2 × C shows in that case that the rela-tions c c = c c c = c c = 0 hold and this is sufficient to conclude. Nev-ertheless, we present a proof that does not require such computations and isvalid for any field.On the one hand, if y = 0, then the constraints of Equation (52) show forinstance that both Det [2 , | , ( G y =0 ) = c c and Det [2 , | , ( G y =0 ) = c c areequal to 0.On the other hand, if y (cid:54) = 0 then consider the minor Det [2 , | , ( G ), whichis equal to c y u v and supposed to be equal to zero by hypothesis. We aregoing to explore all the consequences induced by this constraint. First, if y = 0,21hen we have: u (cid:54) = 0 , v (cid:54) = 0 → c = 0 ,u = 0 , v = 0 → Det [3 , | , ( G y =0 ,u =0 ,v =0 ) = c = 0 ,u = 0 , v (cid:54) = 0 → Det [2 , | , ( G y =0 ,u =0 ) = − y u v c = 0 ,u = 0 , v (cid:54) = 0 , u (cid:54) = 0 → c = 0 ,u = 0 , v (cid:54) = 0 , u = 0 → Det [2 , | , ( G y =0 ,u =0 ,u =0 ) = − c c = 0 ,u (cid:54) = 0 , v = 0 → Det [3 , | , ( G y =0 ,v =0 ) = c u v y = 0 ,u (cid:54) = 0 , v = 0 , v (cid:54) = 0 → c = 0 ,u (cid:54) = 0 , v = 0 , v = 0 → Det [2 , | , ( G y =0 ,v =0 ,v =0 ) = c c = 0 , (53) If y and y are both non-zero, then, the minor Det [1 , | , ( G | [1 , |− y c u v and supposed to be equal to zero by hypothesis. We are going toexplore all the consequences induced by this constraint. Let G (cid:48) = G | [1 , |− u (cid:54) = 0 , v (cid:54) = 0 → c = 0 ,u = 0 , v = 0 → Det [2 , | , ( G (cid:48) u =0 ,v =0 ) = − c c = 0 ,u = 0 , v (cid:54) = 0 → Det [2 , | , ( G (cid:48) u =0 ) = y c u v = 0 ,u = 0 , v (cid:54) = 0 , u (cid:54) = 0 , v (cid:54) = 0 → c = 0 ,u = 0 , v (cid:54) = 0 , u = 0 → Det [2 , | , ( G (cid:48) u =0 ,u =0 ) = − c c = 0 ,u = 0 , v (cid:54) = 0 , v = 0 → Det [1 , | , ( G (cid:48) u =0 ,u =0 ) = c c = 0 ,u (cid:54) = 0 , v = 0 → Det [2 , | , ( G (cid:48) v =0 ) = − y u v c = 0 ,u (cid:54) = 0 , v = 0 , v (cid:54) = 0 → c = 0 ,u (cid:54) = 0 , v = 0 , v = 0 → Det [1 , | , ( G (cid:48) v =0 ,v =0 ) = c = 0 . (54) Thus, in any cases, for any such P , the set L ( H | , P , P H ) is not M .The Proposition 32, together with the computations done in Lemma 37 andin Lemma 36 are sufficient to conclude the proof of Theorem 23. The cost comparison in Table 1 is for matrices over an arbitrary ring with skewunitary matrices. When the ring is an extension, the input of the problem isa polynomial matrix over the base ring. Following the traditional equivalencebetween polynomial matrices and matrix polynomials, leads to alternative waysto multiply the matrix by its transpose, considering the product of two poly-nomials with matrix coefficients. More specifically, we will focus on degree twoextensions, and compare the costs in terms of number of operations over thebase ring.
Over the field C of complex numbers, the 3 M method (Karatsuba) for generalmatrix multiplication reduces the number of multiplications of real matricesfrom 4 to 3 [13]: if MM R ω ( n ) is the cost of multiplying n × n matrices over R ,then the 3 M method costs 3MM R ω ( n ) + o ( n ω ) operations over R . Adapting this22pproach for product of matrix by its adjoint yields a 2 M method using only 2real products: Algorithm 38
2M multiplication in an extension
Input:
A commutative ring K and one of its extensions E ; Input: A ∈ K m × n and B ∈ K m × n ; Input: φ an involutive matrix antihomomorphism of E . Input: i ∈ E , commuting with K , and such that (cid:15) = iφ ( i ) ∈ K ; Output: ( A + iB ) · φ ( A + iB ) ∈ E m × m .Let H = A · φ ( B ) ∈ K m × m ;Let G = ( A + B ) · φ ( A + (cid:15)B ) ∈ K m × m ; return ( G − (cid:15)H − φ ( H )) + Hφ ( i ) + iφ ( H ). Lemma 39.
Algorithm 38 is correct. It costs MM K ω ( n )+ o ( n ω ) operations overthe base ring K .Proof. Let M = ( A + iB ) · φ ( A + iB ). By Lemma 4, we have that φ ( iB ) = φ ( B ) φ ( i ) = φ ( i ) φ ( B ). Thus, M = A · φ ( A ) + A · φ ( B ) φ ( i ) + iB · φ ( A ) + iB · φ ( B ) φ ( i ), by Equations (2b) and (2c). As i commutes with K , we also havethat M = A · φ ( A ) + B · φ ( B ) (cid:15) + A · φ ( B ) φ ( i ) + iB · φ ( A ). By Equations (2a)and (2c), we have that φ ( H ) = φ ( φ ( B )) · φ ( A ) = B · φ ( A ). Finally, G = A · φ ( A ) + A · φ ( B ) (cid:15) + B · φ ( A ) + B · φ ( B ) (cid:15) as φ ( (cid:15) ) = φ ( φ ( i )) φ ( i ) = iφ ( i ) = (cid:15) .Therefore G − (cid:15)H − φ ( H ) = G − H(cid:15) − φ ( H ) = A · φ ( A ) + B · φ ( B ) (cid:15) and M = G + Hφ ( i ) + iφ ( H ). Example 40.
For instance, if K = R , E = C , then i = √− satisfies theconditions of Algorithm 38 for both cases when φ is the transposition or theconjugate transposition. Therefore, we obtain the multiplications of a matrix byits adjoint, whether it be the transpose or the conjugate transpose, in MM R ω + o ( n ω ) operations in R . The classical divide and conquer algorithm, see e.g. [6, § C and uses the equivalent of ω − complex float-ing point n × n matrix products. Using the M method for the complex prod-ucts, this algorithm uses overall ω − MM R ω ( n ) + o ( n ω ) operations in R . Finally,Algorithm 12 costs ω − complex multiplications for a leading term boundedby ω − MM R ω ( n ) , improving over MM R ω for ω > log (6) ≈ . , but this doesnot apply to the conjugate transpose case. This is summarized in Table 2, alsoreplacing ω by or log (7) to illustrate the situation for the main feasible expo-nents. Given a field K of characteristic not 2, the K -algebra of quaternions H ( K ) is the K -vector space of all formal linear combinations: x + x i + x j + x k , ( x , . . . , x ) ∈ K , (55)23roblem Alg. MM ( n ) MM log ( n ) MM ω ( n ) A · B naive 8 n R log (7) ( n ) 4 MM R ω ( n )3M 6 n R log (7) ( n ) 3 MM R ω ( n ) A · A H Alg. 38 4 n R log (7) ( n ) 2 MM R ω ( n ) [6] n R log (7) ( n ) ω − MM R ω ( n ) A · A (cid:124) Alg. 38 4 n R log (7) ( n ) R ω ( n ) [6] 3 n R log (7) ( n ) ω − MM R ω ( n )Alg. 12 . n MM R log (7) ( n ) ω − MM R ω ( n )Table 2: Multiplication of a matrix by its adjoint or transpose over C : leadingterm of the cost in number of arithmetic operations over R . Note that 2 < ω − only when ω < log (7) ≈ .
81 and 2 < ω − only when ω < log (6) ≈ . i = j = k = ijk = − . (56)The quaternions can also be seen as a degree 2 extension of a degree 2 exten-sion, but a non-commutative one. Therefore the 3 M or § M techniques of Sec-tion 6.1 only apply directly for the first degree 2 extension, while the second ex-tension would require 4 multiplications. This gives 4 × × The multiplication of quaternions is ( x + x i + x j + x k )( y + y i + y j + y k ) =( w + w i + w j + w k ), with: w = x y − x y − x y − x y (57) w = x y + x y + x y − x y (58) w = x y − x y + x y + x y (59) w = x y + x y − x y + x y (60)Fiduccia showed in [9] how to compute this product with only 10 field mul-tiplications and 25 additions, cleverly using Gauß’ trick for the multiplication24f complex numbers in three multiplications. Regarding the minimal num-ber of base field operation required for the multiplication of quaternions, deGroote shows in [11] that 10 multiplications is minimal to compute both X · Y and Y · X . In addition, over the reals and the rationals, the minimal numberof multiplications is 8 [14] and [10, Proposition 1.7]. The algorithm of [14],requiring also 28 additions, is recalled in Algorithm 41. Algorithm 41
Howell-Lafon quaternion multiplication
Input: (cid:126)x = x + x i + x j + x k ∈ H ( K ), (cid:126)y = y + y i + y j + y k ∈ H ( K ) Output: (cid:126)xy ∈ H ( K ) Q = ( x + x )( y + y ) , Q = ( x − x )( y − y ) Q = ( x − x )( y + y ) , Q = ( x + x )( y − y ) Q = ( x + x )( y + y ) , Q = ( x − x )( y − y ) Q = ( x + x )( y − y ) , Q = ( x − x )( y + y ) T = Q + Q , T = Q + Q T = Q − Q , T = Q − Q T = T − T , T = T + T T = T + T , T = T − T w = Q + T / , w = Q − T / w = T / − Q , w = T / − Q return w + w i + w j + w k . Proposition 42.
Algorithm 41 extends to the case of quaternions with matrixcoefficients. If matrix multiplication over the base field costs MM K ω ( n ) field oper-ations for n × n matrices and the field matrix addition O ( n ) field additions, thenthe dominant cost of Algorithm 41 applied to matrices is bounded by MM K ω ( n ) .Proof. Correctness is by inspection since 2 is invertible in K of characteristicdifferent from 2. The complexity bound is just the fact that the Algorithmperforms 8 multiplications of matrices with coefficients in the base field.The lowest number of multiplications required to multiply two quaternionsbeing 8, Proposition 42 is the best possible result while keeping the view of thematrices as two quaternions with base field matrix coefficients, X , X , X , X and Y , Y , Y , Y . The alternative is to use a matrix with quaternion coefficientsand use classical fast matrix algorithms. Next, we see the different alternativesfor the multiplication by an adjoint. We now propose several methods to multiply a quaternion matrix by its trans-pose: 25. First a “7M” method which considers a quaternion with matrix coeffi-cients, and thus reduces everything to seven general matrix multiplicationsover the base field.2. Second, one can consider a matrix with quaternion coefficients and justapply any matrix multiplication algorithm where multiplication of coeffi-cients is that of Algorithm 41.
7M method: a quaternion with matrix coefficients.
Many simplifi-cations used in computing the square of a quaternion no longer apply whencomputing the product of a quaternion matrix by its transpose, due to non-commutativity of the matrix product. For
A, B, C, D ∈ K m × n ,( A + B i + C j + D k )( A (cid:124) + B (cid:124) i + C (cid:124) j + D (cid:124) k ) = S + S i + S j + S k (61)where: S = AA (cid:124) − BB (cid:124) − CC (cid:124) − DD (cid:124) (62) S = ( AB (cid:124) + BA (cid:124) ) + ( CD (cid:124) − DC (cid:124) ) (63) S = ( AC (cid:124) + CA (cid:124) ) + ( DB (cid:124) − BD (cid:124) ) (64) S = ( AD (cid:124) + DA (cid:124) ) + ( BC (cid:124) − CB (cid:124) ) (65)Using Munro’s trick twice, this can be computed with 7 multiplications overthe field K and 17 additions (6 of which are half-additions), as shown in Algo-rithm 43. Open question 1.
Multiply a quaternion with matrix coefficients by its trans-pose in fewer than multiplications. Using matrices of quaternions and divide and conquer.
In the follow-ing, for the sake of simplicity, we will consider only square matrices. Here, weconsider instead a matrix with quaternion coefficients and perform a matrix-matrix product: since the quaternions are not commutative, then
M M (cid:124) isnot necessarily symmetric, one has to compute both the top right and bot-tom left corners of the product. Thus no gain is obvious between computing
M M (cid:124) and
M N that way and Algorithm 12 is a priori useless in this case, asremarked in the (counter)-Examples 6. The idea is thus to use a non symmet-ric algorithm, applied to M and M (cid:124) . The baseline cost would then again beMM H ( K ) ω ( n ) = 8MM K ω ( n ).Another approach is to use a divide and conquer strategy at the higher level:cut M into (cid:20) M M M M (cid:21) , and compute: • M · M (cid:124) , M · M (cid:124) , M · M (cid:124) , M · M (cid:124) by the baseline algo-rithm; • and M · M (cid:124) , M · M (cid:124) , M · M (cid:124) , M · M (cid:124) by recursive calls.26 lgorithm 43 Fast quaternion matrix multiplications by its transpose
Input:
A, B, C, D ∈ K m × n Output: M · M (cid:124) ∈ H ( K ) m × m , for M = A + B i + C j + D k . U = A + B, U = A − BU = C + DP = CA (cid:124) , P = DB (cid:124) P = U U (cid:124) , P = U U (cid:124) ,P = AB (cid:124) , P = CD (cid:124) P = ( U + U )( U (cid:124) − U (cid:124) ) R = P + P , R = P − P Low ( R ) = Low ( P + P (cid:124) )Low ( R ) = Low ( P − P (cid:124) )Low ( S ) = Low ( P − P + P + R − R (cid:124) ) S = R + R (cid:124) S = R + R S = P + P − S (cid:124) return S + S i + S j + S k The cost of this divide and conquer strategy is then: C ( n ) ≤ C ( n H ( K ) ω ( n o ( n ω ) ≤ C ( n K ω ( n o ( n ω ) . (66)By Lemma 11, we have that C ( n ) ≤ ω − MM K ω ( n ) + o ( n ω ), and this is neverbetter that 8MM K ω ( n ) (but equal when ω = 3 as expected). So this is thususeless too.But the same strategy can be used with a Strassen-like algorithm instead.Now such algorithms, for instance those of [19, 21], when applied to M and M (cid:124) ,use two recursive calls and five normal multiplications. This is: S ( n ) ≤ S ( n H ( K ) ω ( n o ( n ω ) ≤ S ( n K ω ( n o ( n ω ) . (67)By Lemma 11, we obtain that this is S ( n ) ≤ ω − K ω ( n ) + o ( n ω ) . (68)As expected this is again 8MM K log (7) ( n ) if a Strassen-like algorithm is also usedfor the baseline over the field and ω = log (7). This is worse if ω < log (7), butbetter, and only (6 + )MM K ( n ), if ω = 3. Positive characteristic quaternions and transposition.
In this case, Al-gorithm 12 is not usable. It nonetheless has an interesting feature: it has 327ymmetric products instead of 2 for the algorithms of [19, 21]. In the quater-nion case, as transposition is not an antihomomorphism, one cannot use thesymmetries directly to save computations as in general ( M · N ) (cid:124) (cid:54) = N (cid:124) · M (cid:124) .But if one is willing to recompute N (cid:124) · M (cid:124) then the algorithm still works. Thisyields to Algorithm 44 which requires 7 multiplications instead of 5, 15 additionsinstead of 9, and 4 multiplications by Y or Y (cid:124) . Algorithm 44
Product of a matrix by its non antihomomorphic transpose
Input: A ∈ R m × n (with even m and n for the sake of simplicity); Input: Y ∈ R n × n such that Y · Y (cid:124) = − I n ; Output: A · A (cid:124) .Split A = (cid:0) A A A A (cid:1) where A is in R m × n S ← ( A − A ) · Y ST ← Y (cid:124) · ( A (cid:124) − A (cid:124) ) S ← A − A · Y ST ← A (cid:124) − Y (cid:124) · A (cid:124) S ← S − A ST ← ST − A (cid:124) S ← S + A ST ← ST + A (cid:124) P ← A · A (cid:124) P ← A · A (cid:124) P ← A · ST P T ← S · A (cid:124) P ← S · ST P T ← S · ST P ← S · ST U ← P + P U ← P + P U ← U + P U T ← U + P T U ← U + P U T ← U T + P T U ← U + P T return (cid:0) U UT U U (cid:1) .Now, it turns out that transposition is still antihomomorphic if one of thematrices has its coefficients in the base field, as shown by Lemma 45. Lemma 45.
Let A be in H ( K ) m × k and Y in K k × n , then ( A · Y ) (cid:124) = Y (cid:124) · A (cid:124) .Proof. Since the coefficients of Y are in the base field, they commute with thequaternions. Therefore, ∀ i, j , (cid:80) k a jk y ki = (cid:80) k y ki a jk .Now, Section 4.3 shows that for any quaternion algebra in positive charac-teristic, there exist a matrix Y , in the base field , such that Y · Y (cid:124) = − I (cid:98) n (cid:99) .Therefore, in this case, Lemma 45 shows that in Algorithm 44, ST = S (cid:124) , ST = S (cid:124) , ST = S (cid:124) , ST = S (cid:124) . This shows that not only P and P aremultiplications of a matrix by its transpose, but also P = S · S (cid:124) . Finally, Al-gorithm 44 thus requires three recursive calls and four general multiplications.28his is: P ( n ) ≤ P ( n H ( K ) ω ( n o ( n ω ) ≤ P ( n K ω ( n o ( n ω ) (69)and Equation (68) is modified as: P ( n ) ≤ ω − K ω ( n ) + o ( n ω ) (70)As expected this is again 8MM K log (7) ( n ) if a Strassen-like algorithm is also usedfor the baseline over the field and ω = log (7). This is again worse if ω < log (7),but better, and only (6 + )MM K ( n ), if ω = 3. We now deal with the case of the product of a quaternion matrix with itsconjugate transpose. This operator is now an antihomomorphism which allowsus to save some computations as in Algorithm 12. Here also we distinguish thematrix of quaternions from the quaternion with matrix coefficients.
Scalar case.
The quaternion conjugation satisfies XY = Y X . Therefore, wehave: ( a + b i + c j + d k )( a + b i + c j + d k ) = a + b + c + d (71)For matrices again simplifications do not occur and the product is then morecomplex. Using matrices of quaternions.
Now M · M H is a hermitian matrix and Al-gorithm 12 works over H ( K ). For this, one needs to find a skew-unitary matrixin H ( K ). This is impossible in H ( C ), but always possible in the quaternionsover fields of positive characteristic using sums of squares and Equation (71).Suppose we use a generic matrix multiplication algorithm over the quater-nions with cost bound equivalent to MM H ( K ) ω ( n ) field operations. Then our Al-gorithm 12 can multiply a matrix of a quaternions by its conjugate transposewith a dominant complexity term bounded by (cid:16) ω − (cid:17) MM H ( K ) ω ( n ) operations,by Theorem 13.Now for the quaternions, the best algorithm to multiply any two matricesof quaternions is given by Proposition 42 and uses 8MM K ω ( n ) field operations ifthe base field matrix multiplication uses MM K ω ( n ). We thus have proven: Corollary 46.
Algorithm 12 multiplies a quaternion matrix by its conjugatetranspose with dominant cost bounded by (cid:16) ω − (cid:17) MM K ω ( n ) base field operations. irectly using quaternions with matrix coefficients. Using a quaternionwith matrix coefficients over the field, we have: M · M H = ( A + B i + C j + D k ) · ( A + B i + C j + D k ) H = ( A + B i + C j + D k )( A (cid:124) − B (cid:124) i − C (cid:124) j − D (cid:124) k )= H + H i + H j + H k (72)where: H = AA (cid:124) + BB (cid:124) + CC (cid:124) + DD (cid:124) (73) H = ( BA (cid:124) − AB (cid:124) ) + ( DC (cid:124) − CD (cid:124) ) (74) H = ( CA (cid:124) − AC (cid:124) ) + ( BD (cid:124) − DB (cid:124) ) (75) H = ( DA (cid:124) − AD (cid:124) ) + ( CB (cid:124) − BC (cid:124) ) (76)Note that H is symmetric and H , H , H are skew-symmetric.The properties of the transpose in the field shows that these can be computedwith 4 + 3 ∗ E = A + B i and F = C + D i , so that M = E + F j . Thisshows that: M · M H = ( E + F j ) · ( E H + j F H ) = ( EE H + F F H ) + ( F j E H − E j F H ) (77)Then, we have: F j E H = ( C + D i ) j ( A (cid:124) − B (cid:124) i )= ( CA (cid:124) − DB (cid:124) ) j + ( DA (cid:124) + CB (cid:124) ) k = X j + Y k (78) − E j F H = ( A + B i ) j ( − C (cid:124) + D (cid:124) i )= ( − AC (cid:124) + BD (cid:124) ) j − ( BC (cid:124) + AD (cid:124) ) k = − X (cid:124) j − Y (cid:124) k (79)Using Equations (78) and (79), we thus have Algorithm 47 which uses only6 multiplications (one of which is a square) and a total of 14 additions, 7 ofthem being half-additions (On the one hand, 3 multiplications and 5 additionsfor Equations (78) and (79) overall, then 2 half-additions for Low ( H ) andLow ( H ); on the other hand, 2 multiplications, 1 square, 3 additions and 9half-additions for Low ( H ) and Low ( H )). Proposition 48.
Algorithm 47 multiplies a quaternion matrix by its conjugatetranspose with cost equivalent to (cid:16) ω − (cid:17) MM K ω ( n ) base field operations.Proof. Algorithm 47 uses 6 multiplications, one of which, Q , is the product ofa matrix in K by its transpose. Open question 2.
Multiply a quaternion with matrix coefficients by its Her-mitian transpose in fewer than multiplications (including square), or withmore squares. lgorithm 47 Fast quaternion matrix multiplications by its adjoint
Input:
A, B, C, D ∈ K m × n Output: M · M H ∈ H ( K ) m × m , for M = A + B i + C j + D k . U = A + B, U = C + DQ = CA (cid:124) , Q = DB (cid:124) Q = U U (cid:124) Q = AB (cid:124) , Q = CD (cid:124) Q = ( U + U )( U (cid:124) + U (cid:124) ) T = Q − Q , T = Q + Q T = ( Q + Q ) − Q Low ( H ) = Low ( Q − ( Q (cid:124) + Q ) − ( T (cid:124) + T ))Low ( H ) = Low ( T (cid:124) − T )Low ( H ) = Low ( T − T (cid:124) )Low ( H ) = Low ( T (cid:124) − T ) return H + H i + H j + H k . Comparison.
We summarize the results of this section about quaternion ma-trices in Table 3.
This section reports on an implementation of Algorithm 12 over a prime field,as it is a core ingredient of any such computation in positive characteristic orover Z [ i ] or Q [ i ]. In order to reduce the memory footprint and increase thedata locality of the computation, we first need to identify a memory placementand a scheduling of the tasks minimizing the temporary allocations. We thuspropose in Table 4 and Figure 1 a memory placement and schedule for the oper-ation C ← A · A (cid:124) using no more extra storage than the unused upper triangularpart of the result C .The more general operation C ← αA · A (cid:124) + βC , is referred to as SYRK (Sym-metric Rank k update) in the blas api . Table 5 and Figure 2 propose a schedulerequiring only one additional n/ × n/ fsyrk routine in the opensource fflas-ffpack library for dense linear algebra over a finite field [8, fromcommit 0a91d61e].Figure 3 compares the computation speed in effective Gfops (a normalization,defined as n / (10 × time)) of this implementation over Z / Z with that ofthe double precision blas routines dsyrk , the classical cubic-time routine overa finite field (calling dsyrk and performing modular reductions on the result),and the classical divide and conquer algorithm [6, § fflas-ffpack library is linked with Open blas [22, v0.3.6] and compiled31 lg. MM ( n ) MM log ( n ) MM ω ( n ) A · B naive 32 n
16 MM K log (7) ( n ) 16 MM K ω ( n )[14] 16 n K log (7) ( n ) 8 MM K ω ( n ) A · A H Alg. 47 10 . n . K log (7) ( n ) (cid:16) ω − + 5 (cid:17) MM K ω ( n ) Alg. 12 . n K log (7) ( n ) ω − MM K ω ( n ) A · A (cid:124) Alg. 43 14 n K log (7) ( n ) 7 MM K ω ( n ) Eq. (68) (13 + ) n K log (7) ( n ) ω − MM K ω ( n ) > (12 + ) n K log (7) ( n ) ω − MM K ω ( n ) Table 3: Matrix Multiplication over H ( K ) n × n : leading term of the cost innumber of operations over K . Note that (cid:16) ω − + 5 (cid:17) < ω − only when ω < log (29) − log (5) ≈ . S = ( A − A ) · Y C U = P + P C S = A − A · Y C Up( U ) = Low( U ) (cid:124) C P (cid:124) = S · S (cid:124) C U = U + P C S = S − A C U = U + P C P = S · S (cid:124) C U = U + P (cid:124) C S = S + A C P = A · A (cid:124) C P = A · S (cid:124) C U = P + P C P = A · A (cid:124) C Table 4: Memory placement and schedule of tasks to compute the lower trian-gular part of C ← A · A (cid:124) when k ≤ n . The block C of the output matrix isthe only temporary used. 32 C C C S S P (cid:124) S P S P P U U U U P U Figure 1: dag of the tasks and their memory location for the computationof C ← A · A (cid:124) presented in Table 4.operation loc. operation loc. S = ( A − A ) · Y tmp P = αA · A (cid:124) tmp S = A − A · Y C U = P + P C Up( C ) = Low( C ) (cid:124) C Up( U ) = Low( U ) (cid:124) C P (cid:124) = αS · S (cid:124) C U = U + P C S = S − A tmp U = U + P C P = αS · S (cid:124) C U = U + P (cid:124) + β Up( C ) (cid:124) C S = S + A tmp P = αA · A (cid:124) + βC C P = αA · S (cid:124) + βC C U = P + P C Table 5: Memory placement and schedule of tasks to compute the lower tri-angular part of C ← αA · A (cid:124) + βC when k ≤ n . The block C of the outputmatrix as well as an n/ × n/ C C tmp C Up( C ) S S P (cid:124) S P S P P U U U U P U Figure 2: dag of the tasks and their memory location for the computationof C ← αA · A (cid:124) + βC presented in Table 5. E ff e c t i v e G f o p s : n / ( x t i m e ) nFast FSYRK on an i7-6700 (skylake)Classic OpenBLAS DSYRKClassic FSYRK modulo 131071Divide & Conquer FSYRK modulo 131071Fast FSYRK modulo 131071Fast FSYRK modulo 131041 Figure 3: Speed of an implementation of Algorithm 1234ith gcc-9.2 on an Intel skylake i7-6700 running a Debian gnu /Linux system(v5.2.17).The slight overhead of performing the modular reductions is quickly com-pensated by the speed-up of the sub-cubic algorithm (the threshold for a firstrecursive call is near n = 2000). The classical divide and conquer approachalso speeds up the classical algorithm, but starting from a larger threshold, andhence at a slower pace. Lastly, the speed is merely identical modulo 131041,where square roots of − Y . We made progresses in order to prove that five non-commutative products arenecessary for the computation of the product of a 2 × ω = log
7. Yet for ω < .
536 and for the case of transposition the 6M and7M algorithms perform best. The minimality of the number 6 of multiplica-tions to multiply a quaternion by its conjugate is an open question, as for theminimality of the number 7 of multiplications to multiply a quaternion by itstranspose. For these questions, de Groote’s method could provide an answer.We proposed several algorithms for the product of a matrix by its adjoint,each of which improves by a constant factor the best known costs, dependingon the algebraic nature of the field of coefficients and on the underlying matrixexponent to be chosen. When implemented in practice the comparison maybecome even more complex, as other parameters, such as memory access patternor vectorization will come into play. Our first experiments show that theseconstant factor improvements do have a practical impact.
References [1] M. Baboulin, L. Giraud, and S. Gratton. A parallel distributed solverfor large dense symmetric systems: Applications to geodesy and electro-magnetism problems.
Int. J. of HPC Applications , 19(4):353–363, 2005. doi:10.1177/1094342005056134 .352] G. Beniamini and O. Schwartz. Faster matrix multiplication via sparsedecomposition. In
Proc. SPAA’19 , pages 11–22, 2019. doi:10.1145/3323165.3323188 .[3] Brice Boyer, Jean-Guillaume Dumas, Cl´ement Pernet, and Wei Zhou.Memory efficient scheduling of Strassen-Winograd’s matrix multiplicationalgorithm. In
Proc. , ISSAC’09, pages 135–143. ACM Press, July 2009. doi:10.1145/1576702.1576713 .[4] J. Brillhart. Note on representing a prime as a sum of twosquares.
Math. of Computation , 26(120):1011–1013, 1972. doi:10.1090/S0025-5718-1972-0314745-6 .[5] N. H. Bshouty. On the additive complexity of 2 × Inf. Processing Letters , 56(6):329–335, December 1995. doi:10.1016/0020-0190(95)00176-X .[6] J.-G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra over primefields.
ACM TOMS , 35(3):1–42, November 2008. doi:10.1145/1391989.1391992 .[7] Jean-Guillaume Dumas, Cl´ement Pernet, and Alexandre Sedoglavic. Onfast multiplication of a matrix by its transpose. In
Proc. , ISSAC’20,pages 162–169, New York, July 2020. ACM Press. doi:10.1145/3373207.3404021 .[8] The FFLAS-FFPACK group.
FFLAS-FFPACK: Finite Field Linear Al-gebra Subroutines / Package , 2019. v2.4.1. URL: http://github.com/linbox-team/fflas-ffpack .[9] Charles M. Fiduccia. Fast matrix multiplication. In
Proc. , STOC ’71, pages45–49, New York, NY, USA, 1971. ACM Press. doi:10.1145/800157.805037 .[10] Hans Friedich Groote, de. On varieties of optimal algorithms for thecomputation of bilinear mappings II. Optimal algorithms for 2 × Theoretical Computer Science , 7(2):127–148, 1978. doi:10.1016/0304-3975(78)90045-2 .[11] Hans Friedrich Groote, de. On the complexity of quaternion multiplication.
Inf. Processing Letters , 3(6):177 – 179, 1975. doi:10.1016/0020-0190(75)90036-8 .[12] Hans Friedrich Groote, de. On varieties of optimal algorithms for thecomputation of bilinear mappings I. The isotropy group of a bilinearmapping.
Theoretical Computer Science , 7(2):1–24, 1978. doi:10.1016/0304-3975(78)90038-5 . 3613] N. J. Higham. Stability of a method for multiplying complex matriceswith three real matrix multiplications.
SIMAX , 13(3):681–687, 1992. doi:10.1137/0613043 .[14] Thomas D. Howell and Jean Lafon. The complexity of the quaternionproduct. Technical report, Cornell University, USA, 1975. URL: https://hdl.handle.net/1813/6458 .[15] E. Karstadt and O. Schwartz. Matrix multiplication, a little faster. In
Proc.SPAA’17 , pages 101–110. ACM, 2017. doi:10.1145/3087556.3087579 .[16] Joseph M. Landsberg.
Geometry and complexity theory , volume 169 of
Cambridge Studies in Advanced Mathematics . Cambrigde University Press,December 2016. doi:10.1017/9781108183192 .[17] F. Le Gall. Powers of tensors and fast matrix multiplication. In
ProcISSAC’14 , pages 296–303. ACM, 2014. doi:10.1145/2608628.2608664 .[18] G. Seroussi and A. Lempel. Factorization of symmetric matrices and trace-orthogonal bases in finite fields.
SIAM J. on Computing , 9(4):758–767,1980. doi:10.1137/0209059 .[19] V. Strassen. Gaussian elimination is not optimal.
Numerische Mathematik ,13:354–356, 1969. doi:10.1007/BF02165411 .[20] S. Wedeniwski. Primality tests on commutator curves. PhD U. T¨ubingen,2001. URL: https://d-nb.info/963295438/34 .[21] S. Winograd. La complexit´e des calculs num´eriques.
La Recherche , 8:956–963, 1977.[22] Zhang Xianyi, Martin Kroeker, et al.
OpenBLAS, an Optimized BLASlibrary , 2019.