[PDF] Batched computation of the singular value decompositions of order two by the AVX-512 vectorization

Abstract

In this paper a vectorized algorithm for simultaneously computing up to eight singular value decompositions (SVDs, each of the form A=UΣ V ∗ ) of real or complex matrices of order two is proposed. The algorithm extends to a batch of matrices of an arbitrary length n , that arises, for example, in the annihilation part of the parallel Kogbetliantz algorithm for the SVD of a square matrix of order 2n . The SVD algorithm for a single matrix of order two is derived first. It scales, in most instances error-free, the input matrix A such that its singular values Σ ii cannot overflow whenever its elements are finite, and then computes the URV factorization of the scaled matrix, followed by the SVD of a non-negative upper-triangular middle factor. A vector-friendly data layout for the batch is then introduced, where the same-indexed elements of each of the input and the output matrices form vectors, and the algorithm's steps over such vectors are described. The vectorized approach is then shown to be about three times faster than processing each matrix in isolation, while slightly improving accuracy over the straightforward method for the 2×2 SVD.

Full PDF

aa r X i v : . [ c s . M S ] M a y May 18, 2020 0:29 00

Parallel Processing Letters © World Scientiﬁc Publishing Company

Batched computation of the singular valuedecompositions of order two by the AVX-512 vectorization

Vedran Novakovi´c ∗ completed a major part of this research as a collaborator on the MFBDA project10000 Zagreb, Croatia Received (received date)Revised (revised date)ABSTRACTIn this paper a vectorized algorithm for simultaneously computing up to eight singularvalue decompositions (SVDs, each of the form A = U Σ V ∗ ) of real or complex matricesof order two is proposed. The algorithm extends to a batch of matrices of an arbitrarylength n , that arises, for example, in the annihilation part of the parallel Kogbetliantzalgorithm for the SVD of a square matrix of order 2 n . The SVD algorithm for a singlematrix of order two is derived ﬁrst. It scales, in most instances error-free, the input matrix A such that its singular values Σ ii cannot overﬂow whenever its elements are ﬁnite, andthen computes the URV factorization of the scaled matrix, followed by the SVD of anon-negative upper-triangular middle factor. A vector-friendly data layout for the batchis then introduced, where the same-indexed elements of each of the input and the outputmatrices form vectors, and the algorithm’s steps over such vectors are described. Thevectorized approach is then shown to be about three times faster than processing eachmatrix in isolation, while slightly improving accuracy over the straightforward methodfor the 2 × Keywords : batched computation; singular value decomposition; AVX-512 vectorization.

1. Introduction

Let a ﬁnite sequence A = ( A [ k ] ) k , where 1 ≤ k ≤ n , of complex 2 × U = ( U [ k ] ) k , V = ( V [ k ] ) k of 2 × Σ = ( Σ [ k ] ) k of 2 × A [ k ] = U [ k ] Σ [ k ] ( V [ k ] ) ∗ ,i.e., for each k , the right hand side of the equation is the singular value decomposition(SVD) of the left hand side. This batch of 2 × n × n SVD [2–4]. A parallel step of the algorithm, repeated until convergence, amounts toforming and processing such a batch, with each A [ k ] assembled column by column ∗ e-mail address: [email protected] https://orcid.org/0000-0003-2964-9674 ay 18, 2020 0:29 00 V. Novakovi´c from the elements of the iteration matrix at the suitably chosen pivot positions( p k , p k ), ( q k , p k ), ( p k , q k ), and ( q k , q k ). The iteration matrix is then updated fromthe left by ( U [ k ] ) ∗ and from the right by V [ k ] , transforming the p k th and the q k throws and columns, respectively, while annihilating the oﬀ-diagonal pivot positions.For each k , the matrices A [ k ] , U [ k ] , V [ k ] , and Σ [ k ] have the following elements, A [ k ] = " a [ k ]11 a [ k ]12 a [ k ]21 a [ k ]22 , U [ k ] = " u [ k ]11 u [ k ]12 u [ k ]21 u [ k ]22 , V [ k ] = " v [ k ]11 v [ k ]12 v [ k ]21 v [ k ]22 , Σ [ k ] = " σ [ k ]max σ [ k ]min , where σ [ k ]max ≥ σ [ k ]min ≥

0. When its actual index k is either implied or irrelevant, A [ k ] is denoted by A . Similarly, U , V , and Σ denote U [ k ] , V [ k ] , and Σ [ k ] , respectively, insuch a case, and the bracketed indices of the particular elements are also omitted.When computing in the machine’s ﬂoating-point arithmetic, the real and theimaginary parts of the input elements are assumed to be rounded to ﬁnite (i.e.,excluding ±∞ and NaN ) double precision quantities, but the SVD computations cansimilarly be vectorized in single precision ( float datatype in the C language [5]).Let C and W denote the CPU’s cache line size and the maximal SIMD width,both expressed in bytes, respectively. For an Intel CPU with the 512-bit AdvancedVector Extensions Foundation (AVX-512F) instruction set [6], C = W = 64. Let B be the size in bytes of the chosen underlying datatype T (here, T = double in thereal and the complex case alike, so B = 8), and let S = W / B = 8. If n mod S = 0, letˆ n = n + ( S − ( n mod S )), else let ˆ n = n .This paper aims to show how to single-threadedly compute as many SVDs atthe same time as there are the SIMD/vector lanes available ( S ), one SVD by eachlane. Furthermore, these vectorized computations can execute concurrently on thenon-overlapping batch chunks assigned to the multiple CPU cores.Techniques similar to the ones proposed in this paper have already been ap-plied in [7] for vectorization of the Hari–Zimmermann joint diagonalizations of acomplex positive deﬁnite pair of matrices [8] of order two, and could be, as futurework, for the real variant of the Hari–Zimmermann algorithm for the generalizedeigendecomposition [9] and the generalized SVD [10]. Those eﬀorts do not use theC compiler intrinsics, but rely instead on the vectorization capabilities of the In-tel Fortran compiler over data laid out in a vector-friendly fashion similar to theone described in section 3. Simple as it may seem, it is also a more fragile way ofexpressing the vector operations, should the compiler ever renegade on the presentbehavior of its autovectorizer. The intrinsics approach has been tried in [11] with256-bit-wide vectors of the AVX2+FMA [6] instruction set, alongside AVX-512F,for vectorization of the eigendecompositions of symmetric matrices of order two bythe Jacobi rotations computed similarly to [12]. This way the one-sided Jacobi SVD(and, similarly, the hyperbolic SVD) of real n × n matrices can be signiﬁcantly spedup when n is small enough to make the eigendecompositions’ execution time com-parable to the 2 × Batched computation of the SVDs of order two by the AVX-512 vectorization In numerical linear algebra the term “batched computation” is well-established,signifying a simultaneous processing of a large quantity of relatively small problems,e.g., the LU and the Cholesky factorizations [15] and the corresponding linear sys-tem solving [16] on the GPUs, with appropriate data layouts. It is therefore bothjustiﬁable and convenient to reuse the term in the present context.This paper is organized as follows. A non-vectorized Kogbetliantz method for theSVD of a matrix of order two is presented in section 2. In section 3 a vector-friendlydata layout is proposed, followed by a summary of the vectorized algorithm for thebatched 2 × A , each by a suitable power of two, to avoid any over-ﬂows and most underﬂows, vectorized in section 5 and based on subsection 2.1,2. the URV factorizations [17], vectorized in section 6 and based on subsection 2.2,of the matrices from the previous phase, into the real, non-negative, upper-triangular middle factors R and the unitary left and right factors,3. the singular value decompositions of the factors R from the previous phase,vectorized in section 7 and based on subsection 2.3, yielding the scaled Σ , and4. assembling of the left ( U ) and the right ( V ) singular vectors of the matrices A .The numerical testing results follow in section 8, and the conclusions in section 9.

2. The Kogbetliantz algorithm for the SVD of order two

The pointwise, non-vectorized Kogbetliantz algorithm for the SVD of a matrix oforder two has been an active subject of research [18–20], and has been implementedfor real matrices in LAPACK’s [21] xLASV2 (for the full SVD) and xLAS2 (for thesingular values only) routines, where x ∈ { S , D } . Here a simpliﬁed version of thealgorithm from [4, trigonometric case] is described, with an early reduction of acomplex matrix to the real one that is partly inﬂuenced by, but improves on, [22].It is assumed in the paper that the ﬂoating-point arithmetic [23] is nonstop, i.e.,does not trap on exceptions, and has the gradual underﬂow, i.e., Flush-denormals-To-Zero (FTZ) and Denormals-Are-Zero (DAZ) processor ﬂags [6] are disabled.To compute | z | and e i arg z for a complex z with both components ﬁnite, including z = 0, while avoiding the complex arithmetic operations, use the hypot( a, b ) = √ a + b function [5]. With DBL TRUE MIN being the smallest positive non-zero (andthus subnormal, or denormal in the old parlance) double precision value, let | z | = hypot( ℜ z, ℑ z ) , e i arg z = cos(arg z ) + i · sin(arg z ) , cos(arg z ) = fmin (cid:18) |ℜ z || z | , (cid:19) · sign ℜ z, sin(arg z ) = ℑ z max( | z | , DBL TRUE MIN ) . (1)Here and in the following, fmin and fmax are the functions similar to the onesin the C language [5], but with a bit relaxed semantics, that return the minimal(respectively, maximal) of their two non- NaN arguments, or the second argument ifthe ﬁrst is a

NaN , as it is the case with the vector minimum and maximum [6,

VMINPD and

VMAXPD ]. See also [7, subsection 6.2] for a similar exploitation of the

NaN handlingay 18, 2020 0:29 00 V. Novakovi´c of min and max operations. It now follows that, when | z | = 0, and so ℜ z = ℑ z = 0,cos(arg z ) = sign ℜ z = ± , sin(arg z ) = 0 · sign ℑ z = ± . The signs of ℜ z and ℑ z are thus preserved in cos(arg z ) and sin(arg z ), respectively.A mandatory property of hypot for (1) to be applicable to all inputs z , where ℜ z and ℑ z are of suﬃciently small magnitude (see subsection 2.1), is to havehypot( a, b ) = 0 ⇐⇒ a = b = 0 . A vectorized hypot implementation is accessible from the Intel Short VectorMath Library (SVML) via a compiler intrinsic, as well as it is a reciprocal squareroot (invsqrt x = 1 / √ x ) vectorized routine, helpful for the cosine calculations in (7),(15), and (16), though neither is always correctly rounded to at most half ulp a . Exact scalings of the input matrix

Even if both components of z are ﬁnite, | z | from (1) can overﬂow, but | − z | cannot.Scaling a ﬂoating-point number by an integer power of two is exact, except whenthe signiﬁcand of a subnormal result loses a trailing non-zero part due to shiftingof the original signiﬁcand to the left, or when the result overﬂows. Therefore, suchscaling [6, VSCALEFPD ] is the best remedy for the absolute value overﬂow problem.Let the exponent of a ﬂoating-point value (assuming the radix two) be deﬁnedas exp −∞ and exp a = ⌊ lg | a |⌋ for a ﬁnite non-zero a (see [6, VGETEXPPD ]).Let h = DBL MAX EXP − s for A , take s as s = min { DBL MAX , min E ℜ , min E ℑ } , (2)where E ℜ = { E ℜ ij | ≤ i, j ≤ } and E ℑ = { E ℑ ij | ≤ i, j ≤ } are computed as E ℜ ij = h − exp ℜ a ij , E ℑ ij = h − exp ℑ a ij . Note that − ≤ s ≤ DBL MAX , due to the deﬁnition of h . If A is real, E ℑ is not used.The upper bound on s is required to be ﬁnite, since 0 · ∞ would result in a NaN .If there is a value of a huge magnitude (i.e., with its exponent greater than h ) in A , s from (2) will be negative and the huge values will decrease, either twofold orfourfold. Else, s will be the maximal non-negative amount by which the exponentsof the values in A can jointly be increased, thus taking the very small values out ofthe subnormal range if possible, without any of the new exponents going over h .Let b A = 2 s A , and let ˆ a j denote the j th column of b A . The Frobenius norm of ˆ a j , k ˆ a j k F = hypot( | ˆ a j | , | ˆ a j | ) . (3)cannot overﬂow (see (27) in the proof of Theorem 1 in subsection 2.3). a Consult the reports on the High Accuracy functions at

URL. ay 18, 2020 0:29 00

Batched computation of the SVDs of order two by the AVX-512 vectorization This scaling is both a simpliﬁcation and an improvement of [4, subsection 2.3.2],which guarantees that the computed scaled singular values are ﬁnite, while avoidingany branching, lane masking, or recomputing when vectorized, with the only adverseeﬀect being a potential sacriﬁce of the tiniest subnormal values in the presence of ahuge one (i.e., with its exponent strictly greater than h ) in A . The URV factorization of a well-scaled matrix If k ˆ a k F < k ˆ a k F , let P c = [ ], else let P c = [ ]. Denote the column-pivoted b A by A ′ = b AP c . If | a ′ | < | a ′ | , let P ∗ r = [ ], else let P ∗ r = [ ]. Denote therow-sorted A ′ by A ′′ = P ∗ r A ′ . To make a ′′ real and non-negative, let D ∗ = " e − i arg a ′′ e − i arg a ′′ , A ′′′ = D ∗ A ′′ . (4)Complex multiplication, required in (4), (8), (9), and (11), is performed usingthe fused multiply-add operations with a single rounding [5], fma( a, b, c ) = a · b + c ,as in [24, cuComplex.h ] and [25, subsection 3.2.1], i.e., for a complex c = a · b holds ℜ c = fma( ℜ a, ℜ b, −ℑ a · ℑ b ) , ℑ c = fma( ℜ a, ℑ b, ℑ a · ℜ b ) . (5)To annihilate a ′′′ , compute the Givens rotation Q ∗ α , where − π/ ≤ α ≤

0, as Q ∗ α = cos α (cid:20) − tan α tan α (cid:21) , R ′′ = Q ∗ α (cid:2) a ′′′ a ′′′ (cid:3) = (cid:20) r r ′ r ′′ (cid:21) , r = k a ′′′ k F . (6)Since the column norms of a well-scaled b A are ﬁnite, its column-pivoted, row-sortedQR factorization in (6) cannot result in an inﬁnite element in R ′′ .If a ′′′ = 0 then A ′′′ = 0 as a special case. Handling special cases in a vectorizedway is diﬃcult as it implies branching or using the instructions with a lane mask.However, fmax function aids in avoiding both of these approaches similarly to fminin (1), since tan α and cos α from (6) can be computed astan α = − fmax( a ′′′ /a ′′′ , , cos α = invsqrt(fma(tan α, tan α, . (7)To make r ′ from (5) real (see [22]) and non-negative, take e D and obtain R ′ as e D = (cid:20) e − i arg r ′ (cid:21) , R ′ = R ′′ e D = (cid:20) r r r ′ (cid:21) , r ≥ . (8)Similarly, to make r ′ from (8) real and non-negative, take b D ∗ and obtain R as b D ∗ = (cid:20) e − i arg r ′ (cid:21) , R = b D ∗ R ′ = (cid:20) r r r (cid:21) , r ≥ max { r , r } , (9)due to the column pivoting. Speciﬁcally, if A is already real, then D ∗ = (cid:20) sign a ′′

00 sign a ′′ (cid:21) , e D = (cid:20) r ′ (cid:21) , b D ∗ = (cid:20) r ′ (cid:21) . (10)ay 18, 2020 0:29 00 V. Novakovi´c

Note that, from (4), (6), (8), and (9), R = U ∗ + b AV + , U ∗ + = b D ∗ Q ∗ α D ∗ P ∗ r , V + = P c e D, (11)where U ∗ + and V + are unitary, i.e., U + RV ∗ + is a speciﬁc URV factorization [17] of b A . The SVD of a special upper-triangular non-negative matrix

Here the plane rotations U ∗ ϕ and V ψ are computed, such that U ∗ ϕ RV ψ = Σ ′ , where U ∗ ϕ = cos ϕ (cid:20) − tan ϕ tan ϕ (cid:21) , V ψ = cos ψ (cid:20) ψ − tan ψ (cid:21) , Σ ′ = (cid:20) σ ′ σ ′ (cid:21) , (12)with R from (9) and min { σ ′ , σ ′ } ≥ x = fmax( r /r , , y = fmax( r /r , . (13)With x and y from (13), 0 ≤ x, y ≤

1, computetan(2 ϕ ) = − min (cid:26) fmax (cid:18) (2 min( x, y )) max( x, y )fma( x − y, x + y, , (cid:19) , √ DBL MAX (cid:27) , (14)as justiﬁed in the next paragraph.Since the quotient in (14) is non-negative (when deﬁned), tan(2 ϕ ) is non-positive, and thus − π/ ≤ ϕ ≤

0. From tan(2 ϕ ) computetan ϕ = tan(2 ϕ )1 + p fma(tan(2 ϕ ) , tan(2 ϕ ) , , cos ϕ = invsqrt(sec ϕ ) , (15)with − ≤ tan ϕ ≤ ϕ = fma(tan ϕ, tan ϕ, ϕ ) wasnot bounded in magnitude. If | tan(2 ϕ ) | = ∞ in ﬂoating-point (this occurs rarely,when x > y = 1, and x ± y = ± ϕ = NaN instead of the correctresult, −

1. Else, if | tan(2 ϕ ) | > √ DBL MAX , adding one to its square would havemade little diﬀerence before the rounding (and the sum would have overﬂown afterit), so the square root in (15) could be approximated by | tan(2 ϕ ) | . Again, with | tan(2 ϕ ) | so obtained, adding one to it in the denominator in (15) would have beenirrelevant, and tan ϕ would have then equaled to −

1. Bounding | tan(2 ϕ ) | fromabove as in (14) therefore avoids the argument of the square root overﬂowing (sousing hypot(tan(2 ϕ ) ,

1) instead of p fma(tan(2 ϕ ) , tan(2 ϕ ) ,

1) is not required), andensures tan ϕ = − ϕ ) that would otherwise be greater than the bound.Having thus computed U ϕ , the right plane rotation V ψ is constructed fromtan ψ = fma( y, tan ϕ, − x ) , cos ψ = invsqrt(sec ψ ) , (16)where tan ψ ≤ ψ = fma(tan ψ, tan ψ, R contributes to an im-portant property of the computed scaled singular values Σ ′ ; namely, they are alreadysorted non-ascendingly, and thus never have to be swapped in a postprocessing step.Also, the scaled singular values are always ﬁnite in ﬂoating-point arithmetic.ay 18, 2020 0:29 00 Batched computation of the SVDs of order two by the AVX-512 vectorization Theorem 1.

For Σ ′ it holds ∞ > σ ′ ≥ σ ′ ≥

0, where σ ′ = (cos ϕ cos ψ sec ψ ) r , σ ′ = (cos ϕ cos ψ sec ϕ ) r , (17)and √ ≥ | tan ψ | ≥ | tan ϕ | ≥ Proof.

Assume r > x > σ ′ = r ≥ r = σ ′ , as claimed, and only σ ′ < ∞ remains to be proven.From (15) cos ϕ = 0, and from (16) tan ψ = −∞ , so cos ψ = 0. Scaling U ∗ ϕ by1 / cos ϕ and V ψ by 1 / cos ψ in (12), and R by 1 /r in (9), from (13) follows (cid:20) − tan ϕ tan ϕ (cid:21) (cid:20) x y (cid:21) (cid:20) ψ − tan ψ (cid:21) = (cid:20) σ ′′ σ ′′ (cid:21) . (18)Multiplying the matrices on the left hand side of (18) and equating the elementsof the result with the corresponding elements of the right hand side, one obtains σ ′′ = tan ψ + 1 = sec ψ, σ ′′ = (tan ϕ + 1) y = (sec ϕ ) y, (19)after an algebraic simpliﬁcation using the relation (16) for tan ψ = y tan ϕ − x . Theequations for σ ′ and σ ′ from (17) then follow by multiplying the equations for σ ′′ and σ ′′ from (19), respectively, by r cos ϕ cos ψ . Specially, σ ′ >

0, since σ ′ = 0would imply an obvious contradiction r = 0 ≥ r > y = 0, from (19) and (17) it follows σ ′ > σ ′′ = σ ′ , and, due to (14),(15), and (16), it holds | tan ψ | = x > | tan ϕ | . If y = 1 then x = 0 from (9)and (13), contrary to the assumption. Therefore, 0 < y < a = − xy < , b = 1 + x − y > , c = b + p a + b > . (20)Then, rewrite tan(2 ϕ ) from (14) using (20) astan(2 ϕ ) = ab , tan (2 ϕ ) + 1 = a + b b , (21)as well as tan ϕ from (15) using (21) and (20) astan ϕ = tan(2 ϕ )1 + p tan (2 ϕ ) + 1 = a/b ( b + √ a + b ) /b = ab + √ a + b = ac . (22)From (22), | tan ϕ | = | a | /c >

0, what gives | tan ψ | = y | a | /c + x with (13) and (16).Taking the ratio of these two absolute values, it has to be proven that | tan ψ || tan ϕ | = | a | y + cx | a | ≥ . (23)Expanding (23) using (20), it follows | a | y | a | + cx | a | = y + (1 + x − y + √ a + b ) x xy = 1 + x + y + √ a + b y , (24)where the argument of the square root can be expressed as a + b = (1 + x ) + 2( x − y + y , (25)ay 18, 2020 0:29 00 V. Novakovi´c after substitution of (20) for a and b and a subsequent algebraic simpliﬁcation. For aﬁxed but arbitrary y , (25), and thus the numerator of (24), decrease monotonicallyas x →

0. Substituting zero for x in (24) and (25), the former becomes1 + y + p (1 − y ) y = 22 y = 1 y > , what proves the inequality between the tangents.The inequality between the scaled singular values follows easily from (19) as σ ′ σ ′ = σ ′′ σ ′′ = tan ϕ + 1tan ψ + 1 y < , since tan ψ ≥ tan ϕ and y <

1. It remains to be shown that σ ′ < ∞ for all R from (9). If R has not been computed (what is never the case in the proposedmethod) from a well-scaled b A (see subsection 2.1), then even σ ′ can overﬂow.Observe that 1 / √ < cos ϕ ≤ < cos ψ ≤ cos ϕ . From (17), σ ′ r = cos ϕ cos ψ ≤ ψ . (26)Since, from (15) and (16), | tan ψ | = y | tan ϕ | + x ≤ y + x, cos ψ = 1 / q ψ,x + y has to be bounded from above, to be able to bound cos ψ from below, andthus (26) from above. From (9) and the column pivoting goal, r ≥ r + r , whatgives x + y ≤ r , i.e., x and y are contained in the intersectionof the ﬁrst quadrant and the unit disc. On this domain, x + y attains the maximalvalue of √ x = y = 1 / √

2, so | tan ψ | ≤ √

2, as claimed, and thus cos ψ ≥ / √ ψ in (26), it follows σ ′ ≤ √ · r ≪ · r .From subsection 2.1 and (6), sincemax ≤ i,j ≤ {|ℜ ˆ a ij | , |ℑ ˆ a ij |} < h +1 , it can be concluded thatmax ≤ i,j ≤ | ˆ a ij | < q (2 h +1 ) + (2 h +1 ) = √ · h +1 , and therefore r = max ≤ j ≤ k ˆ a j k F < q ( √ · h +1 ) + ( √ · h +1 ) = 2 · h +1 = 2 h +2 , (27)so σ ′ ≪ · h +2 = 2 h +3 , where the right hand side is the immediate successor(that represents ∞ ) of the largest ﬁnite ﬂoating-point number, as claimed.Using Theorem 1, from (12) and (11) then followsΣ = 2 − s Σ ′ , U ∗ = U ∗ ϕ U ∗ + , V = V + V ψ , U = ( U ∗ ) ∗ , (28)ay 18, 2020 0:29 00 Batched computation of the SVDs of order two by the AVX-512 vectorization where Σ ′ has to be backscaled to obtain the singular values of the original inputmatrix. However, the backscaling should be skipped if it would cause the singularvalues to overﬂow or (less catastrophic but still inaccurate outcome) underﬂow,while informing the user of such an event by preserving the value of s .Speciﬁcally, by matrix multiplication, from (28) and (11) it follows U = cos α cos ϕ P r " d (1 − ˆ d tan α tan ϕ ) d (tan ϕ + ˆ d tan α ) − d (tan α + ˆ d tan ϕ ) d ( ˆ d − tan α tan ϕ ) ,V = cos ψ P c (cid:20) ψ − ˜ d tan ψ ˜ d (cid:21) . (29)Computing each element of a complex U requires only one complex multiplication.Writing a complex number z as ( ℜ z, ℑ z ) = ℜ z + i · ℑ z , noting that d , d ,and ˆ d are the complex conjugates of ( D ∗ ) , ( D ∗ ) , and ( b D ∗ ) , respectively, andprecomputing c = cos α cos ϕ and t = − tan α tan ϕ , (29) can be expanded as, e.g.,( P ∗ r U ) = c ( d · (fma( ℜ ˆ d , t, , ℑ ˆ d t )) , ( P ∗ r U ) = − c ( d · (fma( ℜ ˆ d , tan ϕ, tan α ) , ℑ ˆ d tan ϕ )) , ( P ∗ r U ) = c ( d · (fma( ℜ ˆ d , tan α, tan ϕ ) , ℑ ˆ d tan α )) , ( P ∗ r U ) = c ( d · (fma( − tan α, tan ϕ, ℜ ˆ d ) , ℑ ˆ d )) , but another mathematically equivalent computation that minimizes the number ofroundings required for forming the elements of P ∗ r U as this one does is also valid.

3. Vector-friendly data layout

Vectors replace scalars in the SIMD arithmetic operations. A vector should hold S elements from the same matrix sequence, with the same row and column indices,and the consecutive bracketed indices. When computing with complex numbers,however, it is more eﬃcient to keep the real and the imaginary parts of the elementsin separate vectors, since there are no hardware vector instructions for the complexmultiplication and division, e.g., which thus have to be implemented manually.Also, a vector should be aligned in memory to a multiple of W bytes to employ themost eﬃcient versions of the vector load/store operations. It is therefore essentialto establish a vector-friendly layout for the matrix sequences A , U , V , and Σ ′ inthe linear memory space. One such layout, inspired by splitting the real and thecomplex parts of the matrix elements into separate vectors [7, subsection 6.2], is ℜ a ij = ℜ a [1] ij ℜ a [2] ij · · · ℜ a [ˆ n ] ij , ℑ a ij = ℑ a [1] ij ℑ a [2] ij · · · ℑ a [ˆ n ] ij . where ℜ a ij and ℑ a ij (similarly, ℜ u ij , ℑ u ij , and ℜ v ij , ℑ v ij ) for i, j ∈ { , } are thesequences of the real and the imaginary components, respectively, of the elementsin the i th row and the j th column of the matrices in A (similarly, in U and in V ).ay 18, 2020 0:29 00 V. Novakovi´c

Each train of boxes represents a contiguous region of memory aligned to W bytes.In the real case, no ℑ -boxes exist, but the layout otherwise stays the same. Thescaled singular values σ ′ [ k ]max and σ ′ [ k ]min from (12) are stored as σ ′ max = σ ′ [1]max σ ′ [2]max · · · σ ′ [ˆ n ]max , σ ′ min = σ ′ [1]min σ ′ [2]min · · · σ ′ [ˆ n ]min , respectively, while the scaling parameters s [ k ] from (2) are laid out as s = s [1] s [2] · · · s [ˆ n ] . The virtual elements, with their bracketed indices ranging from n + 1 to ˆ n , serveif present as a (e.g., zero) padding, which ensures that all vectors, including the lastone, formed from the consecutive elements of a box train, hold the same maximalnumber ( S ) of deﬁned values and can thus be processed in an uniform manner.The input sequence A may initially be in another layout and has to be repackedbefore any further computation. Also, the output sequences U , V , Σ ′ , and s mayhave to be repacked for a further processing. Such reshuﬄings should be avoided,as they incur a substantial overhead in both time and memory requirements.Layout of data, including the intermediate results, in vector registers during thecomputation is the same as it is for the box trains, but with S elements instead ofˆ n . The v th vector, for 1 ≤ v ≤ V = ˆ n/ S , encompasses the consecutive indices k ,( v − · S + 1 ≤ k ≤ v · S . (30)A vector is loaded into, kept in, and stored from, a variable of the C type m512d .In the following, a bold lowercase letter stands for a vector, and the uppercaseone for a (logical, not necessarily in-memory) matrix sequence. For example, R is asequence of ˆ n matrices R [ k ] , of which R [ v ] is a subsequence of length S , and r [ v ]is a vector containing elements r [ k ]12 of R [ k ] , for some v and its corresponding indices k from (30). A bold constant denotes a vector with all its values being equal tothe given constant. An arithmetic operation on vectors (or collections thereof) ormatrix sequences is a shorthand for a sequence of the elementwise operations; e.g., − s = (2 − s [ k ] ) k , BC = ( B [ k ] C [ k ] ) k , ≤ k ≤ ˆ n. where B and C are any two matrix sequences, B [ k ] C [ k ] is a product of matrices oforder two, and is a collection of vectors with all their values equal to two. Allbracketed indices are one-based, as it is customary in linear algebra, but in the Ccode they are zero-based, being thus one less than they are in the paper’s text.

4. Overview of the algorithm for the batched SVDs of order two

When there are two cases, the real and the complex one, all code-presenting ﬁg-ures cover the latter with a mixture of the actual statements and a mathematicallyoriented pseudocode. The real-case diﬀerences are described in them in the com-ments starting with R . A function name in uppercase, NAME , is a shorthand for the mm512 name pd C compiler intrinsic, if the operation is available in the machine’say 18, 2020 0:29 00

Batched computation of the SVDs of order two by the AVX-512 vectorization instruction set, or for an equivalent sequence of the bit-pattern preserving caststo and from the integer vectors and an equivalent integer NAME operation. Moreprecisely, for

NAME ∈ {

AND , ANDNOT , OR , XOR } bitwise operations, if the AVX-512DQinstruction set extensions are not supported, an exception to the naming rule holds: NAME ( x, y ) = mm512 castsi512 pd ( mm512 name epi64 (ˆ x, ˆ y )) , ˆ x = mm512 castpd si512 ( x ) , ˆ y = mm512 castpd si512 ( y ) . All other required operations have a pd variant (with double precision vectorlanes) in the core AVX-512F instruction set, so it suﬃces for implementing theentire algorithm. Additionally, let CMPLT MASK stand for the mm512 cmplt pd mask intrinsic, i.e., for the less-than lane-wise comparison of two vectors.The four phases of the algorithm for the batched SVDs of order two, as listedin section 1, can be succinctly depicted by the following logical execution pipeline, A −→ b A −→ A ′ −→ A ′′ −→ A ′′′ −→ R ′′ −→ R ′ −→ R −→ Σ ′ not −−→ safe Σ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ s P c P ∗ r D ∗ Q ∗ α e D b D ∗ U ∗ ϕ , V ψ → U , VU = P r DQ α b DU ϕ , V = P c e DV ψ ; Σ = − s Σ ′ , where the ﬁrst row shows the transformations of A , the second row contains thevarious matrix sequences that are the “by-products” of the computation, describedin section 2, ending with the sequences of the left and the right singular vectors,that are formed as indicated in the third row. As the singular values Σ can overﬂowdue to the backscaling (see subsection 2.3) of the scaled ones ( Σ ′ ), computing themunconditionally is unsafe, and such postprocessing is left to the user’s discretion.In certain use-cases it might be known in advance that the singular values cannotoverﬂow/underﬂow, e.g., if the initial matrices have already been well-scaled at theirformation. The backscaling, performed as in Fig. 1, is then unconditionally safe. − = SET1(-0.0); // a constant vector with all lanes equal to -0.0 − s [ v ] = XOR( s [ v ] , − ); // negation is performed as XOR-ing with − σ max [ v ] = SCALEF( σ ′ max [ v ] , − s [ v ] ); σ min [ v ] = SCALEF( σ ′ min [ v ] , − s [ v ] ); s [ v ] = − ; // inform the user that the backscaling has been performed Fig. 1. Optional vectorized backscaling of Σ ′ to Σ by − s . The pipeline is executed independently on each non-overlapping subsequence of S consecutive matrices. If there are more such sequences than the active threads,at a point in time some sequences might have already been processed, while theothers are still waiting, either for the start or the completion of the processing. Aconceptual core of a driver routine implementing such a pass over the data is shownin Fig. 2, where xSsvd2 , x ∈ { d , z } , are the main (real or complex, respectively),single-threaded routines that are responsible for all vectorized computations on eachay 18, 2020 0:29 00 V. Novakovi´c particular sequence of size S . The OpenMP [26] parallel for directive in Fig. 2assumes a user-deﬁned maximal number and placement/aﬃnity of threads. const size_t V = ( n + (S - 1)) / S; // V = ⌈ n/ S ⌉ , ˆ n = V · S V , A , U , V , Σ ′ , s ) for (size_t v = 0; v < V ; ++ v ) xSsvd2( A [ v ] , U [ v ] , V [ v ] , Σ ′ [ v ] , s [ v ] ); Fig. 2. A conceptualization of the main part of a driver routine for the batched SVDs of ordertwo, with an OpenMP parallel loop over the data, where each of the V subsequences of length S canbe processed concurrently with others by an xSsvd2 routine that performs S SVDs simultaneously.

The input arguments of xSsvd2 are (the pointers to) the arrays, each alignedto W bytes, of S double values, e.g., const double A12r[static S] for ℜ a [ v ].The output arguments are similar, e.g., double U21i[static S] for ℑ u [ v ]. Notethat the same interface, up to replacing S by , would be applicable to the pointwise x1svd2 routine for a single 2 × xSsvd2 routines. It is therefore fullybranch-free, if the used SVML routines are. All data, once loaded from memoryor computed, is intended to be held in the zmm vector registers until the outputhas been formed and written back to RAM. This goal is almost achievable in thetest setting, since there are two vector register spillages, with a total of only fourextra memory accesses (two writes and two reads), as reported by the optimizer. Ahand-tuned self-contained assembly might do away with these as well.The ﬁrst three phases of the algorithm are vectorized as described in sections 5,6, and 7, respectively, since each of the phases can be viewed as an algorithm on itsown. They are, however, chained by the dataﬂow, each having as its input the outputof the previous one. Should the output of a phase be made available alongside theﬁnal results, it could be written to an additional memory buﬀer in the same layoutas presented in section 3. Otherwise, the intermediate results are not preserved.Vectorization of the last, fourth phase of the algorithm from (29) is as tediousand uninformative as it is straightforward, and so it is omitted for brevity. It suﬃcesto say that ℜ u ij (and ℑ u ij ) and ℜ v ij (and ℑ v ij ), for 1 ≤ i, j ≤

2, are computedfrom (29), using the fma operation where possible, and (5) for the complex multi-plications. The ﬁnal row permutations by P r or P c are performed in the same wayas the row swaps in the URV factorization phase, described in Fig. 6 in section 6.An interested reader is referred to the actual code in the supplementary material b .

5. Vectorized exact scalings of the input matrices

Computation of the scaling parameters s is remarkably simple, as shown in Fig. 3.It is advantageous to have GETEXP ( a ) = exp ( a ) and SCALEF ( a , b ) = a · b vector b Supplementary material is available in https://github.com/venovako/VecKog repository. ay 18, 2020 0:29 00

Batched computation of the SVDs of order two by the AVX-512 vectorization h = SET1((double)(DBL_MAX_EXP-3)); // set each lane of h to h e ℜ ij [ v ] = SUB( h , GETEXP( ℜ a ij [ v ] )); // h − exp ℜ a ij [ v ] // take e ℜ [ v ] = min { e ℜ [ v ] , e ℜ [ v ] , e ℜ [ v ] , e ℜ [ v ] } by a two-level min -reduction e ℜ [ v ] = MIN(MIN( e ℜ [ v ] , e ℜ [ v ] ), MIN( e ℜ [ v ] , e ℜ [ v ] )); // e ℑ ij [ v ] , with ≤ i, j ≤ , and e ℑ [ v ] are computed analogously from ℑ a ij [ v ] s [ v ] = MIN(SET1(DBL_MAX), MIN( e ℜ [ v ] , e ℑ [ v ] )); // from (2) , R : e ℑ [ v ] nonexistent ℜ ˆa ij [ v ] = SCALEF( ℜ a ij [ v ] , s [ v ] ); ℑ ˆa ij [ v ] = SCALEF( ℑ a ij [ v ] , s [ v ] ); Fig. 3. Vectorized computation of the scaling parameters s from (2) and the scaling of A . operations, returning a correct (or correctly rounded, respectively) result, even withsubnormal inputs (the former) or outputs (the latter). Should they not be availableon another platform, their scalar variants ( frexp and scalbn , respectively) mightbe used instead on the values in each lane, slowing the execution considerably.Once b A is obtained from A , the column norms of the former are computed as inFig. 4. Observe that ABS ( b ) = ANDNOT ( − , b ), since ANDNOT ( a , b ) = ¬ a ∧ b bitwise,and that having a vectorized HYPOT is essential here. Should it not be available, itwould have to be carefully implemented to avoid the overﬂows in the intermediateresults. A na¨ıve per-lane computation of c = hypot( a, b ), where a and b are ﬁnite,without adjusting the exponents of a and b , but with one extra division instead,is to let a ′ = max {| a | , | b |} , b ′ = min {| a | , | b |} , a + = max { a ′ , DBL TRUE MIN } > q + = b ′ /a + ≤

1, and c = a ′ · p fma( q + , q + , | ˆa ij | [ v ] = HYPOT( ℜ ˆa ij [ v ] , ℑ ˆa ij [ v ] ); // from (1) , R : | ˆa ij | [ v ] = ANDNOT( − , ℜ ˆa ij [ v ] ) k ˆa j k F [ v ] = HYPOT( | ˆa j | [ v ] , | ˆa j | [ v ] ); // with j ∈ { , } , from (3) Fig. 4. Vectorized computation of the column norms k ˆa j k F from (3).

6. Vectorized URV factorizations of order two

Having its column norms computed, b A has to be pivoted, each matrix by a column-swapping permutation (or identity, if a swap is not required), such that a columnwith the largest norm becomes the ﬁrst one. This is accomplished in Fig. 5 by the MASK BLEND operation, that selects a value for the ℓ th output lane from the samelane in either the ﬁrst or the second argument vector, according to a bit-mask c thatcompactly encodes the results of the lane-wise < -comparisons of the norms by the CMPLT MASK operation. If the ℓ th bit in the mask is zero (i.e., the ℓ th comparison isfalse), the ℓ th output lane gets its value from the ﬁrst vector, and the correspondingpermutation P [ k ] c , where k = ( v − · S + ℓ , is identity; else, the output value istaken from the second vector, and the permutation encodes a swap. All norms areﬁnite and thus ordered, so the complement of the relation < is ≥ .ay 18, 2020 0:29 00 V. Novakovi´c c [ v ] = CMPLT_MASK( k ˆa k F [ v ] , k ˆa k F [ v ] ); // S -bit mask encodes the < relation ℜ a ′ i [ v ] = MASK_BLEND( c [ v ] , ℜ ˆa i [ v ] , ℜ ˆa i [ v ] ); // similarly for ℑ a ′ i [ v ] ℜ a ′ i [ v ] = MASK_BLEND( c [ v ] , ℜ ˆa i [ v ] , ℜ ˆa i [ v ] ); // similarly for ℑ a ′ i [ v ] | a ′ i | [ v ] = MASK_BLEND( c [ v ] , | ˆa i | [ v ] , | ˆa i | [ v ] ); | a ′ i | [ v ] = MASK_BLEND( c [ v ] , | ˆa i | [ v ] , | ˆa i | [ v ] ); k a ′ k F [ v ] = MASK_BLEND( c [ v ] , k ˆa k F [ v ] , k ˆa k F [ v ] ); k a ′ k F [ v ] = MASK_BLEND( c [ v ] , k ˆa k F [ v ] , k ˆa k F [ v ] ); Fig. 5. Vectorized column pivoting of b A . Not only A ′ itself has to be obtained. The absolute values of the elements and thecolumn norms also have to be subject to the same (maybe identity) permutations, asin Fig. 5, to avoid recomputing them unnecessarily and at a greater cost, especiallyin the complex case. The similar principles hold for the row sorting of A ′ in Fig. 6. r [ v ] = CMPLT_MASK( | a ′ | [ v ] , | a ′ | [ v ] ); // Is | a ′ | [ v ] < | a ′ | [ v ] , lane-wise? ℜ a ′′ j [ v ] = MASK_BLEND( r [ v ] , ℜ a ′ j [ v ] , ℜ a ′ j [ v ] ); // similarly for ℑ a ′′ j [ v ] ℜ a ′′ j [ v ] = MASK_BLEND( r [ v ] , ℜ a ′ j [ v ] , ℜ a ′ j [ v ] ); // similarly for ℑ a ′′ j [ v ] | a ′′ j | [ v ] = MASK_BLEND( r [ v ] , | a ′ j | [ v ] , | a ′ j | [ v ] ); | a ′′ j | [ v ] = MASK_BLEND( r [ v ] , | a ′ j | [ v ] , | a ′ j | [ v ] ); Fig. 6. Vectorized row sorting of A ′ . Since only the rows of A ′ are possibly swapped to get A ′′ , the column normsdo not change, so k a ′′ j k F = k a ′ j k F . To make the ﬁrst columns of A ′′ real and non-negative, D ∗ from (4) or (10) is computed and applied as in Fig. 7. Observe howthe sign extractions and the implicit complex conjugations and multiplications areperformed in Fig. 7, as the same pattern is assumed for them in the following.The matrices D ∗ are unitary and diagonal, so k a ′′′ j k F = k a ′′ j k F can be (and is)assumed, though numerically they might slightly diﬀer, should the former be recom-puted, due to the rounding errors accumulated in the course of the transformationsof the elements of A ′′ as in Fig. 7, as well as due to the recomputation itself.Fig. 8 shows how to get the QR factorizations from (6), i.e., R ′′ = Q ∗ α A ′′′ . Only r ′′ has to be computed by multiplying a ′′′ (complex in the general case) from theleft by the real Q ∗ α , while r ′′ is always real and already known. Should INVSQRT notbe available, there are two remedies, both starting from sec α = √ α . Theﬁrst, faster one computes cos α = 1 / sec α , while the second, possibly more accurateone due to requiring one rounding less than the ﬁrst [4], does not require the cosineat all, and instead replaces all multiplications by it with divisions by the secant.Now r ′ has to be made real and non-negative by multiplying R ′′ by e D fromthe right, obtaining R ′ as in (8), and then r ′ has to undergo a similar procedureay 18, 2020 0:29 00 Batched computation of the SVDs of order two by the AVX-512 vectorization = SET1(1.0); m = SET1(DBL_TRUE_MIN); // ones & the successors of +0 | a + i | [ v ] = MAX( | a ′′ i | [ v ] , m ); // from (1) , with i ∈ { , } , here and below ℜ d ii [ v ] = OR(MIN(DIV(ANDNOT( − , ℜ a ′′ i [ v ] ), | a ′′ i | [ v ] ), ), AND( ℜ a ′′ i [ v ] , − )); ℑ d ii [ v ] = DIV( ℑ a ′′ i [ v ] , | a + i | [ v ] ); // ℑ d ∗ ii [ v ] = −ℑ d ii [ v ] implicitly // R : ℜ d ii [ v ] = AND( ℜ a ′′ i [ v ] , − ), only the sign bit ( ± , not ± from (10) ) // ( ℜ a ′′′ i [ v ] , ℑ a ′′′ i [ v ]) = ( ℜ d ∗ ii [ v ] , ℑ d ∗ ii [ v ]) · ( ℜ a ′′ i [ v ] , ℑ a ′′ i [ v ]) , from (4) ℜ a ′′′ i [ v ] = FMADD( ℜ d ii [ v ] , ℜ a ′′ i [ v ] , MUL( ℑ d ii [ v ] , ℑ a ′′ i [ v ] )); // and below from (5) ℑ a ′′′ i [ v ] = FMSUB( ℜ d ii [ v ] , ℑ a ′′ i [ v ] , MUL( ℑ d ii [ v ] , ℜ a ′′ i [ v ] )); // Fused Mul and SUB // R : ℜ a ′′′ i [ v ] = XOR( ℜ d ii [ v ] , ℜ a ′′ i [ v ] ), ℜ a ′′′ i [ v ] = ℜ d ∗ ii [ v ] · ℜ a ′′ i [ v ] from (10) ℜ a ′′′ i [ v ] = | a ′′ i | [ v ] ; // assume ℑ a ′′′ i [ v ] = SETZERO(); i.e., Fig. 7. Vectorized computation of D ∗ and A ′′′ from (4) or (10). r [ v ] = k a ′′′ k F [ v ] ; // from (6) , r [ v ] = is assumed but not set − tan α [ v ] = MAX(DIV( ℜ a ′′′ [ v ] , ℜ a ′′′ [ v ] ), ); // from (7) cos α [ v ] = INVSQRT(FMADD( tan α [ v ] , tan α [ v ] , )); // from (7) ℜ r ′ [ v ] = MUL( cos α [ v ] , FMADD( − tan α [ v ] , ℜ a ′′′ [ v ] , ℜ a ′′′ [ v ] )); // similarly ℑ r ′ [ v ] // tan α [ v ] implicit in Fused_Negative_Multiply-ADD( a, b, c ) = − ( a · b ) + c ℜ r ′′ [ v ] = MUL( cos α [ v ] , FNMADD( − tan α [ v ] , ℜ a ′′′ [ v ] , ℜ a ′′′ [ v ] )); // ℑ r ′′ [ v ] likewise Fig. 8. The vectorized QR factorization of A ′′′ from (6) and (7). by multiplying R ′ from the left by b D ∗ , as in (9), to get the real and non-negative R . The ﬁrst step involves one complex multiplication per lane, r ′ = r ′′ · ˜d , while r = | r ′ | , and the second step involves none, since r = | r ′ | , as shown in Fig. 9.

7. Vectorized SVD of real upper-triangular matrices of order two

In Fig. 10 a vectorization of the 2 × T = double , the precomputed upperbound for tan(2 ϕ ) should be replaced by the appropriate one (e.g., for T = float , √ FLT MAX should be used instead). No sines are explicitly computed here, unlike inthe LAPACK’s

DLASV2 routine, but could be, as sin β = cos β · tan β , for β ∈ { ϕ, ψ } .A computation functionally similar to the one proposed in Fig. 10 could beperformed by S calls to DLASV2 . In the Fortran syntax, one such call looks like

CALL DLASV2 ( F [ k ] , G [ k ] , H [ k ] , SSMIN [ k ] , SSMAX [ k ] , SNR [ k ] , CSR [ k ] , SNL [ k ] , CSL [ k ] ) , where k lies in the range (30), for a given v . The input-only arguments are F [ k ] = r [ k ]11 , G [ k ] = r [ k ]12 , H [ k ] = r [ k ]22 , ay 18, 2020 0:29 00 V. Novakovi´c r [ v ] = | r ′ | [ v ] = HYPOT( ℜ r ′ [ v ] , ℑ r ′ [ v ] ); // R : r [ v ] = ANDNOT( − , ℜ r ′ [ v ] ) ℜ ˜d ∗ = OR(MIN(DIV(ANDNOT( − , ℜ r ′ [ v ] ), | r ′ | [ v ] ), ), AND( ℜ r ′ [ v ] , − )); // (8) ℑ ˜d ∗ = DIV( ℑ r ′ [ v ] , MAX( | r ′ | [ v ] , m )); // from (8) , here and above using (1) // R : ℜ d [ v ] = AND( ℜ r ′ [ v ] , − ), from (10) , but only the sign bit (i.e., ± ) ℜ r ′ [ v ] = FMADD( ℜ r ′′ [ v ] , ℜ ˜d ∗ [ v ] , MUL( ℑ r ′′ [ v ] , ℑ ˜d ∗ [ v ] )); // and below from (8) ℑ r ′ [ v ] = FMSUB( ℑ r ′′ [ v ] , ℜ ˜d ∗ [ v ] , MUL( ℜ r ′′ [ v ] , ℑ ˜d ∗ [ v ] )); // Fused Mul and SUB // R : ℜ r ′ [ v ] = XOR( ℜ r ′′ [ v ] , ℜ d [ v ] ), from (10) , but faster than multiplication r [ v ] = | r ′ | [ v ] = HYPOT( ℜ r ′ [ v ] , ℑ r ′ [ v ] ); // R : r [ v ] = ANDNOT( − , ℜ r ′ [ v ] ) ℜ ˆd = OR(MIN(DIV(ANDNOT( − , ℜ r ′ [ v ] ), | r ′ | [ v ] ), ), AND( ℜ r ′ [ v ] , − )); // (9) ℑ ˆd = DIV( ℑ r ′ [ v ] , MAX( | r ′ | [ v ] , m )); // ℑ ˆd ∗ = −ℑ ˆd implicitly; from (9) // R : ℜ ˆd [ v ] = AND( ℜ r ′ [ v ] , − ), from (10) , but the sign bit extraction only Fig. 9. Vectorized computation of e D , b D , and R from (8) and (9), or from (10). f = SET1(1.34078079299425956E+154); // √ DBL MAX from (14) x [ v ] = MAX(DIV( r [ v ] , r [ v ] ), ); y [ v ] = MAX(DIV( r [ v ] , r [ v ] ), ); // see (13) tan( ϕ )[ v ] = OR(MIN(MAX(DIV(MUL(SCALEF(MIN( x [ v ] , y [ v ] ), ), MAX( x [ v ] , y [ v ] )), /* from (14) */ FMADD(SUB( x [ v ] , y [ v ] ), ADD( x [ v ] , y [ v ] ), )), ), f ), − ); tan ϕ [ v ] = DIV( tan( ϕ )[ v ] , ADD( , SQRT(FMADD( tan( ϕ )[ v ] , tan( ϕ )[ v ] , )))); sec ϕ [ v ] = FMADD( tan ϕ [ v ] , tan ϕ [ v ] , ); cos ϕ [ v ] = INVSQRT( sec ϕ [ v ] ); // see (15) tan ψ [ v ] = FMSUB( y [ v ] , tan ϕ [ v ] , x [ v ] ); sec ψ [ v ] = FMADD( tan ψ [ v ] , tan ψ [ v ] , ); cos ψ [ v ] = INVSQRT( sec ψ [ v ] ); // from (16) here and above c ψϕ [ v ] = MUL( cos ϕ [ v ] , cos ψ [ v ] ); σ ′ max [ v ] = MUL(MUL( c ψϕ [ v ] , sec ψ [ v ] ), r ); σ ′ min [ v ] = MUL(MUL( c ψϕ [ v ] , sec ϕ [ v ] ), r ); // from (17) here and above Fig. 10. Vectorization of the 2 × while the outputs are related to the quantities computed or implied in Fig. 10 ascos ϕ [ v ] = ( CSL [ k ] ) k , − sin ϕ [ v ] = ( SNL [ k ] ) k , cos ψ [ v ] = ( CSR [ k ] ) k , − sin ψ [ v ] = ( SNR [ k ] ) k , σ ′ max [ v ] = ( SSMAX [ k ] ) k , σ ′ min [ v ] = ( SSMIN [ k ] ) k . It is inadvisable to replace the vectorized algorithm in Fig. 10 by the

DLASV2 calls, for at least two reasons. First, the input vectors have to be stored from theregisters to the addressable memory. Then S function calls have to be made in-stead of a single pass over the data, and the results ﬁnally have to be loaded fromthe memory into the vector registers for the last phase of the algorithm. Second,throughout the paper the tangents are used instead of the sines, to increase theaccuracy by reducing the number of the roundings performed due to more oppor-tunities for employing the fma-type operations [12]. However, DLASV2 provides thetangents only implicitly, as tan β = sin β/ cos β for β ∈ { ϕ, ψ } . If the last phase ofthe algorithm comes after the DLASV2 calls, (29) has to be rewritten in the terms ofay 18, 2020 0:29 00

Batched computation of the SVDs of order two by the AVX-512 vectorization the respective sines to avoid the superﬂuous divisions, as the equivalent expressions U = P r " d (cos α cos ϕ − ˆ d sin α sin ϕ ) d (cos α sin ϕ + ˆ d sin α cos ϕ ) − d (sin α cos ϕ + ˆ d cos α sin ϕ ) d ( ˆ d cos α cos ϕ − sin α sin ϕ ) ,V = P c (cid:20) cos ψ sin ψ − ˜ d sin ψ ˜ d cos ψ (cid:21) . (31)A quick test (albeit with computing tan β by two vector divisions for simplicity)has shown that an algorithm that calls DLASV2 as described is noticeably slower,more so in the real than in the complex (more involved in the other phases) case,relatively to the timings of the fully vectorized algorithm. Using

DLASV2 instead ofthe method in Fig. 10 might therefore be a viable alternative only in the pointwisecase, within a routine (e.g., x1svd2 ) designed in the LAPACK’s style.

8. Numerical testing

All testing was performed on an Intel Xeon Phi 7210 CPU, running at 1 . . . icc (version 19.1.1.217), was invoked with the follow-ing optimization and ﬂoating-point options: -O3 -xHost -qopt-zmm-usage=high-fp-model source -no-ftz -prec-div -prec-sqrt -fimf-precision=high , toenable the gradual underﬂow, prohibit the aggressive ﬂoating-point optimizationsthat could result in a loss of precision, and, together with -fimf-use-svml=true ,to employ the high-accuracy SVML library. Among other options were -std=c18 and -qopenmp . The DLASV2 routine was provided by the sequential Intel Math Ker-nel Library (MKL). The quadruple precision ﬂoating-point arithmetic, used for theerror checking only, was supported by the float128 datatype and the functionsoperating on it, e.g., fmaq , hypotq , and scalbq , with the obvious semantics.The test data was harvested from /dev/urandom pseudorandom byte stream,with an approximately uniform probability distribution of each bit, and 64 consec-utive bits formed a double precision value. A value that was not ﬁnite (either ±∞ or a NaN ) was replaced with another one that was, sourced in the same way. In total2 ﬁnite doubles were stored in a binary ﬁle and reused for all runs. In the realcase, the ﬁle layout was assumed to be a sequence of records, where each recordcontained four vectors (i.e., all the elements of S × ℜ a [ v ] , ℜ a [ v ] , ℜ a [ v ] , ℜ a [ v ] , while in the complex case eight vectors were assumed per each record, as ℜ a [ v ] , ℑ a [ v ] , ℜ a [ v ] , ℑ a [ v ] , ℜ a [ v ] , ℑ a [ v ] , ℜ a [ v ] , ℑ a [ v ] . In both cases, a single batch comprised n = ˆ n = 2 matrices—an absurdly largenumber for a typical usage scenario in the 2 n × n SVD algorithm, but necessaryay 18, 2020 0:29 00 V. Novakovi´c for a reliable timing of each batch, to compensate for the unavoidable operatingsystem’s jitter. Therefore, in the real case the test ﬁle contained 64 batches, and thesame ﬁle provided 32 batches in the complex case. Execution of each batch by theparallel for loop from Fig. 2 with 32 OpenMP threads, spread across the 64 CPUcores such that each thread was aﬃnity-bound to its own core while no two threadsshared the same level-2 cache, resulted in the following summary outputs:1. the wall time t in seconds for processing the entire batch, measured by placingthe omp get wtime calls immediately before and after the aforesaid parallel loop,2. the maximal a posteriori spectral condition number of the matrices in the batch, κ = max ≤ k ≤ n κ ( A [ k ] ) = fmin( σ [ k ]max /σ [ k ]min , ∞ ) , (32)3. the maximal normwise relative error of the singular value decompositions, as ρ = max ≤ k ≤ n fmax (cid:0) k U [ k ] Σ [ k ] ( V [ k ] ) ∗ − A [ k ] k F / k A [ k ] k F , (cid:1) , (33)4. the maximal departure from orthogonality of the left singular vectors, as δ = max ≤ k ≤ n k ( U [ k ] ) ∗ U [ k ] − I k F , (34)5. and the maximal departure from orthogonality of the right singular vectors, as η = max ≤ k ≤ n k ( V [ k ] ) ∗ V [ k ] − I k F . (35)The last four metrics above ( κ , ρ , δ , and η ) were computed after the batchhad been entirely processed and the output data had been converted to quadrupleprecision by the value-preserving casts. The results were then printed out by round-ing them ﬁrst to the hardware’s 80-bit extended datatype ( long double , with anegligible error), while the timings were rounded to the nearest microsecond.The pointwise algorithm was implemented in the real ( d1svd2 ) and the complex( z1svd2 ) variant. The ﬁrst two phases of the pointwise algorithm are arithmeticallyequivalent to those of the vectorized one if the scalar hypot and invsqrt functionsare equivalent to the respective vector ones, in an arbitrary lane. The SVD of anon-negative upper-triangular matrix was performed in the pointwise algorithm bya single DLASV2 call (see section 7), and the subsequent formation of U and V wasdone as in (31), to compare the accuracy (i.e., ρ , δ , and η ) of such an approach withthe one proposed for the vectorized algorithm. Also, the pointwise implementationsin C are as close as possible to the LAPACK-style Fortran routines that could bewritten for this speciﬁc purpose of computing the general 2 × m × n SVD routine would necessarily incur, thus allowinga fair comparison of the execution times, as follows. Each call to dSsvd2 or zSsvd2 was replaced by S calls (one for each of the argument vectors’ lanes) to d1svd2 or z1svd2 , respectively, and the rest of the testing code (i.e., the batch timing andthe error checking parts) was left intact. The speedup is a ratio of the wall timerequired for processing a batch with the pointwise algorithm so employed and thewall time required for the same job using the vectorized algorithm as proposed.ay 18, 2020 0:29 00 Batched computation of the SVDs of order two by the AVX-512 vectorization The maximal condition number κ from (32) attained in the real case varies fromone batch to another, from 4 . · to 2 . · , and in the complexcase from 6 . · to 9 . · , so in each batch there was at least onealmost as highly ill-conditioned matrix as possible, without being exactly singular.Fig. 11 shows the attained speedup. In the real case, the wall times t varyfrom 5 . . . . . . . . n equals one or two. batch index s p ee dup batch index s p ee dup Fig. 11. The speedup attained with each batch processed by the vectorized algorithm ( xSsvd2 )vs. the pointwise algorithm ( x1svd2 ), in the real ( x = d , left) and the complex ( x = z , right) case. Fig. 12 shows that for the vast majority of batches, the vectorized algorithmgives a bit more accurate decomposition than the pointwise one, but both are usable.The optional backscaling was proven to be unsafe, since in each batch at least onesingular value overﬂowed when backscaled, with either algorithm and in either case.Figs. 13 and 14 demonstrate that the vectorized algorithm generally results inthe more orthogonal left and right singular vectors, respectively, than the pointwiseone, since there is less rounding involved in computing the vectors in the former.

9. Conclusions

This paper has shown that a batched computation of the SVDs of order two can bevectorized with a relative ease on the Intel AVX-512 architecture. Other vectoriza-tion platforms might be targeted as well, if they provide the instructions analogousto those required here. Single precision could be used instead of double precision.Compared to the pointwise processing of one matrix at a time, the vectorizedalgorithm is nearly or more than three times faster, in the complex and the realcase, respectively, and generally slightly more accurate when computing the SVDs ofay 18, 2020 0:29 00 V. Novakovi´c batch index ρ · batch index ρ · Fig. 12. The maximal normwise relative errors ρ from (33) for each batch processed by thevectorized ( (cid:7) ) and the pointwise ( • ) algorithm, in the real (left) and the complex (right) case. batch index δ · batch index δ · Fig. 13. The maximal departures from orthogonality δ from (34) for each batch processed by thevectorized ( (cid:7) ) and the pointwise ( • ) algorithm, in the real (left) and the complex (right) case. batch index η · batch index η · . . . . Fig. 14. The maximal departures from orthogonality η from (35) for each batch processed by thevectorized ( (cid:7) ) and the pointwise ( • ) algorithm, in the real (left) and the complex (right) case. non-negative upper-triangular matrices as proposed versus the standard procedureof DLASV2 . Additionally, the exact scalings of the input matrices ensure that thescaled singular values never overﬂow, provided that the input elements are all ﬁnite.ay 18, 2020 0:29 00

Batched computation of the SVDs of order two by the AVX-512 vectorization A similar vectorization principle, relying on a vector-friendly data layout (seesection 3), could be applied to the various algorithms for factorizations or decom-positions of a batch of matrices of a small, ﬁxed order, if the control ﬂow within thealgorithm can be transformed into a branch-free one, as it has been done here forthe column and the row pivotings in section 6, and for handling the special values.

Acknowledgements

This work has been supported in part by Croatian Science Foundation under theproject IP–2014–09–3670 “Matrix Factorizations and Block Diagonalization Algo-rithms” (MFBDA).The author would like to thank Sanja Singer for her assistance in preparationof Figs. 11, 12, 13, and 14 with

METAPOST . References [1] E. G. Kogbetliantz, Solution of linear equations by diagonalization of coeﬃcientsmatrix,

Quart. Appl. Math. (2) (1955) 123–132.[2] M. Beˇcka, G. Okˇsa and M. Vajterˇsic, Dynamic ordering for a parallel block-JacobiSVD algorithm, Parallel Comp. (2) (2002) 243–262.[3] G. Okˇsa, Y. Yamamoto, M. Beˇcka and M. Vajterˇsic, Asymptotic quadratic conver-gence of the two-sided serial and parallel block-Jacobi SVD algorithm, SIAM J. Ma-trix Anal. and Appl. (2) (2019) 639–671.[4] V. Novakovi´c and S. Singer, A Kogbetliantz-type algorithm for the hyperbolic SVD,arXiv:2003.06701 [math.NA] (March, 2020).[5] ISO/IEC JTC1/SC22/WG14, ISO/IEC 9899:2018(en) Information technology —Programming languages — C , 4 th edn. (ISO, 2018). International standard.[6] Intel Corporation, Intel ®

64 and IA-32 Architectures Software Developer’s Manual (October, 2019). https://software.intel.com/en-us/articles/intel-sdm (OrderNo. 325462-071US, Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4).[7] S. Singer, E. Di Napoli, V. Novakovi´c and G. ˇCaklovi´c, The LAPW method witheigendecomposition based on the Hari–Zimmermann generalized hyperbolic SVD,arXiv:1907.08560 [math.NA] (July, 2019).[8] V. Hari, On the global convergence of the complex HZ method,

SIAM J. Matrix Anal.Appl. (4) (2019) 1291–1310.[9] V. Hari, Globally convergent Jacobi methods for positive deﬁnite matrix pairs, Nu-mer. Algorithms (Sep 2018) 221–249.[10] V. Novakovi´c, S. Singer and S. Singer, Blocking and parallelization of the Hari–Zimmermann variant of the Falk–Langemeyer algorithm for the generalized SVD, Parallel Comput. (2015) 136–152.[11] V. Novakovi´c, Parallel Jacobi-type algorithms for the singular and the generalizedsingular value decomposition, PhD thesis, University of Zagreb, Croatia ( https://urn.nsk.hr/urn:nbn:hr:217:515320 , December 2017)[12] Z. Drmaˇc, Implementation of Jacobi rotations for accurate singular value computationin ﬂoating point arithmetic, SIAM J. Sci. Comput. (4) (1997) 1200–1222.[13] V. Hari, S. Singer and S. Singer, Block-oriented J -Jacobi methods for Hermitianmatrices, Linear Algebra Appl. (8–10) (2010) 1491–1512.[14] V. Hari, S. Singer and S. Singer, Full block J -Jacobi method for Hermitian matrices, Linear Algebra Appl. (2014) 1–27. ay 18, 2020 0:29 00 V. Novakovi´c [15] A. Haidar, A. Abdelfattah, M. Zounon, S. Tomov and J. Dongarra, A Guide forAchieving High Performance with Very Small Matrices on GPU: A Case Study ofBatched LU and Cholesky Factorizations,

IEEE Trans. Parallel Distrib. Syst. (5)(2018) 973–984.[16] J. Dongarra, M. Gates, J. Kurzak, P. Luszczek and Y. M. Tsai, Autotuning NumericalDense Linear Algebra for Batched Computation With GPU Hardware Accelerators, Proc. IEEE (11) (2018) 2040–2055.[17] G. W. Stewart, An updating algorithm for subspace tracking,

IEEE Trans. SignalProcess. (6) (1992) 1535–1541.[18] J. P. Charlier, M. Vanbegin and P. Van Dooren, On eﬃcient implementations ofKogbetliantz’s algorithm for computing the singular value decomposition, Numer.Math. (3) (1987) 279–300.[19] V. Hari and J. Matejaˇs, Accuracy of two SVD algorithms for 2 × Appl. Math. Comput. (1) (2009) 232–257.[20] J. Matejaˇs and V. Hari, On high relative accuracy of the Kogbetliantz method,

LinearAlgebra Appl. (2015) 100–129.[21] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz,A. Greenbaum, S. Hammarling, A. McKenney and D. Sorensen,

LAPACK Users’Guide , 3 rd edn. (Society for Industrial and Applied Mathematics, Philadelphia, PA,USA, 1999).[22] S. Qiao and X. Wang, Computing the Singular Values of 2-by-2 Complex Matrices, (May, 2002).[23] IEEE Task P754, IEEE 754-2008, Standard for Floating-Point Arithmetic (IEEE,New York, NY, USA, August 2008).[24] NVIDIA Corporation,

CUDA C++ Programming Guide v10.2.89 (November, 2019). https://docs.nvidia.com/cuda/cuda-c-programming-guide/ .[25] V. Novakovi´c and S. Singer, An implicit Hari–Zimmermann algorithm for the gener-alized SVD on the GPUs, arXiv:1909.00101 [math.NA] (August, 2019).[26] OpenMP Architecture Review Board,

OpenMP Application Programming Inter-face Version 4.5 (November, 2015).