[PDF] Parallel Robust Computation of Generalized Eigenvectors of Matrix Pencils

Abstract

In this paper we consider the problem of computing generalized eigenvectors of a matrix pencil in real Schur form. In exact arithmetic, this problem can be solved using substitution. In practice, substitution is vulnerable to floating-point overflow. The robust solvers xTGEVC in LAPACK prevent overflow by dynamically scaling the eigenvectors. These subroutines are sequential scalar codes which compute the eigenvectors one by one. In this paper we discuss how to derive robust blocked algorithms. The new StarNEig library contains a robust task-parallel solver Zazamoukh which runs on top of StarPU. Our numerical experiments show that Zazamoukh achieves a super-linear speedup compared with DTGEVC for sufficiently large matrices.

Full PDF

aa r X i v : . [ c s . M S ] M a r Parallel Robust Computation of GeneralizedEigenvectors of Matrix Pencils

Carl Christian Kjelgaard Mikkelsen [0000 − − − Mirko Myllykoski [0000 − − − Department of Computing Science and HPC2NUme˚a University90187 Ume˚a, Sweden { spock,mirkom } @cs.umu.se Abstract.

In this paper we consider the problem of computing gen-eralized eigenvectors of a matrix pencil in real Schur form. In exactarithmetic, this problem can be solved using substitution. In practice,substitution is vulnerable to ﬂoating-point overﬂow. The robust solvers xTGEVC in LAPACK prevent overﬂow by dynamically scaling the eigen-vectors. These subroutines are sequential scalar codes which computethe eigenvectors one by one. In this paper we discuss how to derive ro-bust blocked algorithms. The new StarNEig library contains a robusttask-parallel solver

Zazamoukh which runs on top of StarPU. Our numer-ical experiments show that

Zazamoukh achieves a super-linear speedupcompared with

DTGEVC for suﬃciently large matrices.

Keywords: generalized eigenvectors, overﬂow protection, task-parallelism

Let A ∈ R m × m and let B ∈ R m × m . The matrix pencil ( A, B ) consists of allmatrices of the form A − λB where λ ∈ C . The set of (generalized) eigenvaluesof the matrix pencil ( A, B ) is given by λ ( A, B ) = { λ ∈ C : det( A − λB ) = 0 } . (1)We say that z ∈ C m is a (generalized) eigenvector of the matrix pencil ( A, B ) ifand only if z = 0 and Az = λBz. (2)The eigenvalues of ( A, B ) can be computed by ﬁrst reducing (

A, B ) to realSchur form (

S, T ). Speciﬁcally, there exist orthogonal matrices U and V such This manuscript was accepted to (PPAM2019), Bialystok, Poland, September 8–11,2019. The ﬁnal authenticated version is available online at https://doi.org/10.1007/978-3-030-43229-4_6 . C.C. Kjelgaard Mikkelsen, M. Myllykoski that S = U T AV is quasi-upper triangular and T = U T BV is upper triangular.It is clear that λ ( A, B ) = λ ( S, T ) . (3)Moreover, if z is a generalized eigenvector of ( S, T ) corresponding to the eigen-value λ , then U z is a generalized eigenvector of (

A, B ) corresponding to theeigenvalue λ .In this paper, we consider the parallel computation of eigenvectors of a matrixpencil in real Schur form. In exact arithmetic, this problem can be solved usingsubstitution. However, substitution is very vulnerable to ﬂoating-point overﬂow.In LAPACK [3] there exists a family xTGEVC of subroutines which computethe generalized eigenvectors of a matrix pencil in Schur form. They preventoverﬂow by dynamically scaling the eigenvectors. These subroutines are scalarcodes which compute the eigenvectors one by one. In this paper we discuss theconstruction of algorithms which are not only robust, but blocked and parallel.Our paper is organized as follows. In Section 2 we consider the problemof computing the eigenvectors of a matrix pencil in real Schur form using realarithmetic. This problem is equivalent to solving a homogeneous matrix equationof the form SVD − T VB = 0 (4)where D is diagonal and B is block diagonal with diagonal blocks which areeither 1-by-1 or 2-by-2. In Section 3 we present a blocked algorithm for solvingthis matrix equation. In Section 4 we discuss how to prevent overﬂow in thisalgorithm. The concept of an augmented matrix is central to this discussion. Arobust task-parallel solver Zazamoukh has been developed and integrated intothe new StarNEig library for solving non-symmetric eigenvalue problems [1,9].The performance of

Zazamoukh is compared to LAPACK in Section 6. We outlineseveral directions for future work in Section 7.

Let A ∈ R m × m and B ∈ R m × m be given. The set of generalized eigenvalues ofthe matrix pencil ( A, B ) can be computed by ﬁrst reducing (

A, B ) to generalizedreal Schur form. Speciﬁcally, there exist orthogonal matrices U and V such that S = U T AV =  S S . . . S p S . . . S p . . . ... S pp  , T = U T BV =  T T . . . T p T . . . T p . . . ... T pp  are upper block triangular and dim( S jj ) = dim( T jj ) ∈ { , } . It is clear that λ ( S, T ) = ∪ pj =1 λ ( S jj , T jj ) . In order to simplify the discussion, we will make the following assumptions. eneralized Eigenvectors of Matrix Pencils 3

1. If dim( S jj ) = 1, then ( S jj , T jj ) has a single real eigenvalue.2. If dim( S jj ) = 2, then ( S jj , T jj ) has two complex conjugate eigenvalues.3. All eigenvalues are distinct.We follow the standard convention and represent eigenvalues λ using an orderedpair ( α, β ) where α ∈ C and β ≥

0. If β >

0, then λ = α/β is a ﬁnite eigenvalue.The case of α ∈ R − { } and β = 0, corresponds to an inﬁnite eigenvalue. Thecase of α = β = 0 corresponds to an indeﬁnite problem.We now consider the problem of computing generalized eigenvectors of ( S, T ).Our goal is to obtain an equivalent problem which can be solved using a blockedalgorithm.

It this subsection, we note that the problem of computing a single generalizedeigenvector of (

S, T ) is equivalent to solving a tall homogenous matrix equationinvolving real matrices. Let λ ∈ λ ( S jj , T jj ) and let λ = α j β j where β j > α j = a j + ib j ∈ C . Let m j = dim( S jj ) and let D jj ∈ R m j × m j and B jj ∈ R m j × m j be given by D jj = β j , B jj = a j (5)when dim( S jj ) = 1 (or equivalently b j = 0) and D jj = (cid:20) β j β j (cid:21) , B jj = (cid:20) a j b j − b j a j (cid:21) (6)when dim( S jj ) = 2 (or equivalently b j = 0). With this notation, the problemof computing an eigenvector is equivalent to solving the homogeneous linearequation SVD jj − T VB jj = 0 (7)with respect to V ∈ R m × n j . This is an immediate consequence of the followinglemma. Lemma 1.

Let λ ∈ λ ( S jj , T jj ) and let λ = ( a j + ib j ) /β where β > . Then thefollowing statements are true.1. If dim( S jj ) = 1 , then x ∈ R m is a real eigenvector corresponding to the realeigenvalue λ ∈ R if and only if V = (cid:2) x (cid:3) has rank 1 and solves equation (7) .2. If dim( S jj ) = 2 , then z = x + iy ∈ C m is a complex eigenvector correspondingto the complex eigenvalue λ ∈ C if and only if V = (cid:2) x y (cid:3) has rank 2 andsolves equation (7) . C.C. Kjelgaard Mikkelsen, M. Myllykoski

It this subsection, we note that the problem of computing all generalized eigen-vectors of (

S, T ) is equivalent to solving a homogenous matrix equation involvingreal matrices. Speciﬁcally, let D ∈ R m × m and B ∈ R m × m be given by D = diag { D , D , . . . , D pp } , B = diag { B , B , . . . , B pp } . (8)where D jj and B jj are given by equations (5) and (6). Then V = (cid:2) V V . . . V p (cid:3) solves the homogeneous matrix equation SVD − T VB = 0 (9)if and only if V j ∈ R m × n j solves equation (7). It is straightforward to derived a blocked algorithm for solving the homogeneousmatrix equation (9). Speciﬁcally, redeﬁne S = (cid:2) S ij (cid:3) , i, j ∈ { , , . . . , M } to denote any partitioning of S into an M by M block matrix which does not splitany of the 2-by-2 blocks along the diagonal of S . Apply the same partitioningto T , D , B , and V . Then S , T , and V are block triangular. It is straightforwardto see that equation (9) can be solved using Algorithm 1. Algorithm 1:

Blocked computation of all generalized eigenvectors for j ← , , . . . , p do Compute generalized eigenvectors Y jj for ( S jj , T jj ); for i ← j − , . . . , do for k ← , , . . . , i do Y kj = Y kj − ( S k,i +1 Y i +1 ,j D jj − T k,i +1 Y i +1 ,j B jj ) Solve S ii ZD jj − T ii ZB jj = Y ij (10)with respect to Z and set Y ij ← Z ; We see that Algorithm 1 can be implemented using three distinct kernels:1. Compute the eigenvectors corresponding to a small matrix pencil (

S, T ) inreal Schur form.2. Execute linear updates of the form Y ← Y − ( SXD − T XB ) (11)where S and T are small dense matrices, D is diagonal and B is blockdiagonal with blocks of size 1 or 2. eneralized Eigenvectors of Matrix Pencils 5

3. Solution of small matrix equations of the form

SZD − T ZB = Y (12)where ( S, T ) is a matrix pencil in real Schur form and D is diagonal and B is block diagonal with diagonal blocks of dimension 1 or 2.Once these kernels have been implemented, it is straightforward to parallelizeAlgorithm 1 using a task-based runtime system such as StarPU [4]. The subroutines xTGEVC prevent overﬂow using principles originally derived byEdward Anderson and implemented in the subroutines xLATRS [2]. These sub-routines apply to triangular linear systems

T x = b (13)with a single right-hand side b . Mikkelsen and Karlsson [7] formalized the workof Anderson and derived a robust blocked algorithm for solving (13). Mikkelsen,Schwarz and Karlsson [8] derived a robust blocked algorithm for solving trian-gular linear systems T X = B with multiple right hand sides. Their StarPU implementation ( Kiya ) is signiﬁ-cantly faster than

DLATRS when numerical scaling is necessary and not signiﬁ-cantly slower than

DTRSM when numerical scaling is unnecessary.A robust variant of Algorithm 1 can be derived using the principles appliedby Mikkelsen, Schwarz and Karlsson [8]. We can use the two real functions

ProtectUpdate and

ProtectDivision introduced by Mikkelsen and Karlsson[7]. They are used to prevent scalar divisions and linear updates from exceedingthe overﬂow threshold

Ω >

0. They have the following key properties.1. If t = 0 and | b | ≤ Ω , and ξ = ProtectDivision ( | b | , | t | ) (14)then ξ ∈ (0 ,

1] and | ξb | ≤ | t | Ω . It follows that the scaled division y ← ( ξb ) t (15)cannot exceed Ω .2. If Z = Y − T X is deﬁned, with k T k ∞ ≤ Ω, k X k ∞ ≤ Ω, k Y k ∞ ≤ Ω, (16)and ξ = ProtectUpdate ( k Y k ∞ , k T k ∞ , k X k ∞ ) (17) C.C. Kjelgaard Mikkelsen, M. Myllykoski then ξ ∈ (0 ,

1] and ξ ( k Y k ∞ + k T k ∞ k X k ∞ ) ≤ Ω . It follows that Z ← ( ξY ) − T ( ξX ) = ( ξY ) − ( ξT ) X (18)can be computed without any intermediate or ﬁnal result exceeding Ω .Now consider the kernels needed for implementing Algorithm 1. It is easyto understand that they can be implemented using loops, divisions and linearupdates. It is therefore plausible that robust variants can be implemented using ProtectDivision and

ProtectUpdate . However, resolving all the relevant de-tails requires many pages. Here we explain how the updates given by equation(11) can be done without exceeding the overﬂow threshold Ω . We will use theconcept of an augmented matrix introduced by Mikkelsen, Schwarz and Karlsson[8]. Deﬁnition 1.

Let X ∈ R m × n be partitioned into k block columns X = (cid:2) X X . . . X k (cid:3) , X j ∈ R m × n j , k X j =1 n j = n, (19) and let α ∈ R k have α j ∈ (0 , . The augmented matrix h α, X i represents thereal matrix Y given by Y = (cid:2) Y Y . . . Y k (cid:3) , Y j = α − j Y j . (20)This is a trivial extension of the original deﬁnition which only considered thecase of k = n . The purpose of the scaling factors α j is used to extend the normalrepresentational range of our ﬂoating-point numbers. Algorithm 2:

Right updates with block diagonal matrix

Data:

An augmented matrix h α, X i where X = h X X . . . X k i , k X j k ∞ ≤ Ω, α ∈ R k and matrices B j such that k B j k ∞ ≤ Ω and Y j = X j B j is deﬁned. Result:

An augmented matrix h β, Y i such that β − j Y j = ( α − j X j ) B j , k Y j k ∞ ≤ Ω, (21)and Y can be computed without exceeding Ω . for j = 1 , . . . , k do γ j = ProtectUpdate ( k X j k ∞ , k B j k ∞ , Y j = ( γ j X j ) B j ; β j = α j γ j Now Z = XD can be computed without overﬂow using Algorithm 2. Inour case D is diagonal, so we are merely scaling the columns of X . However, eneralized Eigenvectors of Matrix Pencils 7 Z = Y − SZ can now be computed without overﬂow using Algorithm 3. Itfollows that Y ← ( Y − S ( XD )) + T ( XB ) can be computed without overﬂowusing two applications of Algorithm 2 and Algorithm 3. Now suppose that Y is m by n . Then Algorithm 2 does O ( mn ) ﬂops on O ( mn ) data. However, Algorithm3 then does O ( mn ) ﬂops on the same data. Therefore, the overall arithmeticintensity is O ( n ). Algorithm 3:

Left update with dense matrix

Data:

A matrix T and augmented matrices h α, X i and h β, Y i where X = h X X . . . X k i , Y = h Y Y . . . Y k i , (22)such that Z j = Y j − T X j is deﬁned and k X j k ∞ ≤ Ω, k Y j k ∞ ≤ Ω. (23) Result:

An augmented matrix h ζ, Z i such that ζ − j Z j = β − j Y j − T ( α − j X j ) , k Z j k ∞ ≤ Ω, (24)and Z can be computed without exceeding Ω . for j = 1 , . . . , k do γ j = min { α j , β j } ; δ j = ProtectUpdate ( k T k ∞ , ( γ j /α j ) k X j k ∞ , ( γ j /β j ) k Y j k ∞ ); X j ← δ j ( γ j /α j ) X j ; Y j ← δ j ( γ j /β j ) Y j ; ζ j = ǫ j δ j γ j ; Z ← Y − T X ; return h ζ, Z i In order to execute all linear updates needed for a robust variant of Algorithm1 we require certain norms. Speciﬁcally, we need the inﬁnity norms of all super-diagonal blocks of S and T . Moreover, we require the inﬁnity norm of certainsubmatrices of Y . These submatrices consists of either a single column (segmentof real eigenvector) or two adjacent columns (segment of complex eigenvector).The inﬁnity norm must be computed whenever a submatrix has been initializedor updated. ProtectUpdate requires that the input arguments are bounded by Ω and failure is possible if they are not. It is necessary to scale the matrices S and T to ensure that all blocks have inﬁnity norm bounded by Ω . The new StarNEig library runs on top of StarPU and can be used to solve densenon-symmetric eigenvalue problems. A robust variant of Algorithm 1 has beenimplemented in StarNEig. This implementation (

Zazamoukh ) uses augmented

C.C. Kjelgaard Mikkelsen, M. Myllykoski matrices and scaling factors which are signed 32 bit integers.

Zazamoukh cancompute eigenvectors corresponding to a subset E ⊆ λ ( S, T ) which is closedunder complex conjugation.

Zazamoukh is currently limited to shared memory,but an extension to distributed memory is planned.

Given block sizes mb and nb Zazamoukh partitions S , T and the matrix of eigen-vectors Y conformally by rows and columns. In the absence of any 2-by-2 diag-onal blocks on the diagonal blocks the tiles of S and T are mb by mb and thetiles of Y are mb by nb . The only exceptions can be found along the right andlower boundaries of the matrices. This default conﬁguration is adjusted mini-mally to prevent splitting any 2-by-2 block of S or separating the real part andthe imaginary part of a complex eigenvector into separate block columns. Zazamoukh relies on four types of tasks1. Pre-processing tasks which compute all quantities needed for robustness.This includes the inﬁnity norm of all super-diagonal tiles of S and T as wellas all norms needed for the robust solution of equations of the type (12). Ifnecessary, the matrics S and T are scaled minimally.2. Solve tasks which use DTGEVC to compute the lower tips of eigenvectors anda robust solver based on

DLALN2 to solve equations of the type (12).3. Update tasks which execute updates of the type (11) robustly.4. Post-processing tasks which enforce a consistent scaling on all eigenvectors.

Zazamoukh is closely related to

Kiya which solves triangular linear systems withmultiple right-hand sides. Apart from the pre-processing and post-processingtasks, the main task graph is a the disjoint union of p task-graphs, one foreach block column of the matrix of eigenvectors. Zazamoukh uses the same taskinsertion order and priorities as

Kiya to process each of the p sub-graphs. In this section we give the result of a set of experiments involving tiny ( m ≤

10 000) and small ( m ≤

40 000) matrices. Each experiment consisted of com-puting all eigenvectors of the matrix pencil. The run time was measured for(DTGEVC) LAPACK and

Zazamoukh . Results related to somewhat larger ma-trices ( m ≤

80 000) can be found in the NLAFET Deliverable 2.7 [10].The experiments were executed on an Intel Xeon E5-2690v4 (Broadwell) nodewith 28 cores arranged in two NUMA islands with 14 cores each. The theoretical eneralized Eigenvectors of Matrix Pencils 9 peak performance in double-precision arithmetic is 41.6 GFLOPS/s for one coreand 1164.8 GFLOPS/s for a full node.We used the StarNEig test-program starneig-test to generate reproducibleexperiments. The default parameters produce matrix pencils where approxi-mately 1 percent of all eigenvalues are zeros, 1 percent of all eigenvalues areinﬁnities and there are no indeﬁnite eigenvalues.

Zazamoukh used the default tilesize mb which is 1.6 percent of the matrix dimension for matrix pencils withdimension m ≥ Zazamoukh used 28 StarPUworkers and 1 master thread. The summary of our results are given in Figure 1.The speedup of

Zazamoukh over LAPACK is initially very modest as there is notenough tasks to keep 28 workers busy, but it picks up rapidly and

Zazamoukh achieves a superlinear speedup over

DTGEVC when m ≥

10 000. This is an ex-pression of the fact that

Zazamoukh uses a blocked algorithm, whereas

DTGEVC computes the eigenvectors one by one. dimension eigenvalue analysis runtime (ms) SpeedUpm zeros inf. indef. LAPACK StarNEig1000 11 13 0 295 175 1.68572000 25 16 0 1598 409 3.90713000 24 30 0 6182 929 6.65454000 42 49 0 15476 1796 8.61695000 54 37 0 30730 2113 14.54336000 61 64 0 53700 2637 20.36417000 67 64 0 84330 3541 23.81538000 56 69 0 122527 4769 25.69249000 91 91 0 171800 6189 27.758910000 108 94 0 242466 7821 31.001920000 175 197 0 2034664 49823 40.837830000 306 306 0 7183746 162747 44.140640000 366 382 0 17713267 380856 46.5091

Table 1.

Comparison of sequential DTGEVC to task-parallel

Zazamoukh using 28cores. The runtimes are given in milli-seconds (ms). The last column gives the speedupof

Zazamoukh over LAPACK. Values above 28 correspond to super-linear speedup. Alleigenvectors were computed with a relative residual less than 2 u , where u denotes thedouble precision unit roundoﬀ. Previous work by Mikkelsen, Schwarz and Karlsson has shown that triangularlinear systems can be solved in parallel without overﬂow using augmented ma- trices. In this paper we have shown that the eigenvectors of a matrix pencil canbe computed in parallel without overﬂow using augmented matrices. Certainly,robust algorithms are slower than non-robust algorithms when numerical scal-ing is not needed, but robust algorithms will always return a result which canbe evaluated in the context of the user’s application. To the best of our knowl-egde StarNEig is the only library which contains a parallel robust solver forcomputing the generalized eigenvectors of a dense nonsymmetric matrix pencil.The StarNEig solver (

Zazamoukh ) runs on top of StarPU and uses augmentedmatrices and scaling factors with are integer powers of 2 to prevent overﬂow. Itacheives superlinear speedup compared with (DTGEVC) from LAPACK. In theimmediate future we expect to pursue the following work:1. Extend

Zazamoukh to also compute left eigenvectors. Here the layout of theloops is diﬀerent and we must use the 1-norm instead of the inﬁnity normwhen executing the overﬂow protection logic.2. Extend

Zazamoukh to distributed memory machines.3. Extend

Zazamoukh ’s solver to use recursive blocking to reduce the run-timefurther. The solve tasks all lie on the critical path of the task graph.4. Extend

Zazamoukh to complex data-types. This case is simpler than realarithmetic because there are no 2-by-2 blocks on the main diagonal of S .5. Revisit the complex division routine xLADIV [6] which is the foundation forthe DLALN2 routine used by

Zazamoukh ’s solve tasks. In particular, the failuremodes of xLADIV have not be characterized [5].

Acknowlegdements

This work is part of a project (NLAFET) that has received funding from theEuropean Unions Horizon 2020 research and innovation programme under grantagreement No 671633. This work was supported by the Swedish strategic re-search programme eSSENCE. We thank the High Performance Computing Cen-ter North (HPC2N) at Ume˚a University for providing computational resourcesand valuable support during test and performance runs.

References

1. StarNEig — A task-based library for solving nonsymmetric eigenvalue problems, https://github.com/NLAFET/StarNEig

2. Anderson, E.: LAPACK Working Note No. 36: Robust Triangular Solves for Usein Condition Estimation. Tech. Rep. CS-UT-CS-91-142, University of Tennessee,Knoxville, TN, USA (August 1991)3. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J.,Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LA-PACK Users’ Guide. SIAM, Philadelphia, PA, 3rd edn. (1999)4. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: A Uni-ﬁed Platform for Task Scheduling on Heterogeneous Multicore Architec-tures. CCPE - Special Issue: Euro-Par 2009 , 187–198 (Feb 2011).https://doi.org/10.1002/cpe.1631, http://hal.inria.fr/inria-00550877 eneralized Eigenvectors of Matrix Pencils 115. Baudin, M.: Personal e-mail to C.C.K. Mikkelsen6. Baudin, M., Smith, R.L.: A robust complex division in scilab. CoRR abs/1210.4539 (2012), http://arxiv.org/abs/1210.4539

7. Kjelgaard Mikkelsen, C.C., Karlsson, L.: Blocked Algorithms for Robust Solutionof Triangular Linear Systems. In: Wyrzykowski, R., Dongarra, J., Deelman, E.,Karczewski, K. (eds.) Parallel Processing and Applied Mathematics. pp. 68–78.Springer International Publishing, Cham (2018)8. Kjelgaard Mikkelsen, C.C., Schwarz, A.B., Karlsson, L.: Parallel Ro-bust Solution of Triangular Linear Systems. CCPE (0), e5064 (2018).https://doi.org/10.1002/cpe.50649. Myllykoski, M., Kjelgaard Mikkelsen, C.C.: StarNEig - A Task-based Library forSolving Nonsymmetric Eigenvalue Problems. Submitted to PPAM-201910. Myllykoski, M., Kjelgaard Mikkelsen, C.C., Schwarz, A., K˚agstr¨om, B.: D2.7eigenvalue solvers for nonsymmetric problems. Tech. rep., Ume˚a University (2019),(0), e5064 (2018).https://doi.org/10.1002/cpe.50649. Myllykoski, M., Kjelgaard Mikkelsen, C.C.: StarNEig - A Task-based Library forSolving Nonsymmetric Eigenvalue Problems. Submitted to PPAM-201910. Myllykoski, M., Kjelgaard Mikkelsen, C.C., Schwarz, A., K˚agstr¨om, B.: D2.7eigenvalue solvers for nonsymmetric problems. Tech. rep., Ume˚a University (2019),