[PDF] A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems

Abstract

In this paper, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetric tridiagonal matrices based on ScaLAPACK and a parallel structured matrix multiplication algorithm, called PSMMA. Computing the eigenvectors via matrix-matrix multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such multiplications is a rank-structured Cauchy-like matrix. By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matrices without any communication, and further reduces the computation costs by using a structured low-rank approximation algorithm. Thus, both the communication and computation costs are reduced. Experimental results show that both PSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDC that was proposed in [J. Comput. Appl. Math. 344 (2018) 512--520] and only scaled to 300 processes for the same matrices. Comparing with \texttt{PDSTEDC} in ScaLAPACK, PSDC is always faster and achieves 1.4 x-- 1.6 x speedup for some matrices with few deflations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using few processes and a little slower when using many processes.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 1

A parallel structured divide-and-conqueralgorithm for symmetric tridiagonal eigenvalueproblems

Xia Liao, Shengguo Li, Yutong Lu and Jose E. Roman

Abstract —In this paper, a parallel structured divide-and-conquer (PSDC) eigensolver is proposed for symmetrictridiagonal matrices based on ScaLAPACK and a parallel structured matrix multiplication algorithm, called PSMMA.Computing the eigenvectors via matrix-matrix multiplications is the most computationally expensive part of the divide-and-conquer algorithm, and one of the matrices involved in such multiplications is a rank-structured Cauchy-like matrix.By exploiting this particular property, PSMMA constructs the local matrices by using generators of Cauchy-like matriceswithout any communication, and further reduces the computation costs by using a structured low-rank approximationalgorithm. Thus, both the communication and computation costs are reduced. Experimental results show that bothPSMMA and PSDC are highly scalable and scale to 4096 processes at least. PSDC has better scalability than PHDCthat was proposed in [J. Comput. Appl. Math. 344 (2018) 512–520] and only scaled to 300 processes for the samematrices. Comparing with

PDSTEDC in ScaLAPACK, PSDC is always faster and achieves . x– . x speedup for somematrices with few deﬂations. PSDC is also comparable with ELPA, with PSDC being faster than ELPA when using fewprocesses and a little slower when using many processes. Index Terms —PSMMA, PUMMA Algorithm, ScaLAPACK, Divide-and-conquer, Rank-structured matrix, Cauchy-likematrix (cid:70)

NTRODUCTION

Computing the eigendecomposition of a sym-metric tridiagonal matrix is an important linearalgebra problem and is widely used in manyﬁelds of science and engineering. It is usuallysolved by divide-and-conquer (DC) [1], [2],QR [3], MRRR [4], and some other methods.The DC algorithm is now the default methodin LAPACK [5] and ScaLAPACK [6] when theeigenvectors are required. DC is usually veryefﬁcient in practice, requiring only O ( n . ) ﬂopsfor matrices with dimension n on average [7],but its complexity can still be O ( n ) for some • X. Liao and S. Li are with the College of Computer Science,National University of Defense Technology, Changsha, China.E-mail: [email protected] • Y. Lu is with the National Supercomputer Center in Guangzhou,and the School of Data and Computer Science, Sun YatsenUniversity, Guangzhou, China, 510006. • J. E. Roman is with the D. Sistemes Inform`atics i Computaci´o,Universitat Polit`ecnica de Val`encia, Cam´ı de Vera s/n, 46022Val`encia, Spain.Manuscript received Jan 19, 2020; revised December 27, 2012. matrices with few deﬂations. The performanceof an algorithm not only depends on thenumber of ﬂoating point operations, but alsoother ingredients such as communication anddata movements, etc. Some communication-avoiding algorithms have been developed re-cently [8], [9]. In this work, we aim to acceleratethe parallel DC algorithm on distributed mem-ory machines by reducing both the computation complexity and communication complexity. Ouralgorithm can be much faster than the ScaLA-PACK routine

PDSTEDC for those difﬁcult ma-trices, and can scale to thousands of processes.Experimental results are included in section 4.It is known that some Cauchy-like matriceswith off-diagonally low-rank properties appearin the DC algorithm [1], [2], which can beapproximated by hierarchically semiseparable(HSS) matrices [10], [11]. Then, the worst casecomplexity of DC can be reduced from O ( n ) to O ( n r ) , where r is a modest number and isusually much smaller than a large n , see [12].This technique was extended for the bidiago- a r X i v : . [ c s . M S ] A ug OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 2 nal and banded DC algorithms for the SVDproblem on a shared memory multicore plat-form in [13], [14]. For distributed memory ma-chines, a parallel DC algorithm is similarly pro-posed in [15] by using STRUMPACK (STRUc-tured Matrices PACKage) [16], which providessome distributed parallel HSS algorithms. Theaccelerated DC proposed in [15] was called

PHDC . However, numerical results show thatfor the simple matrix-matrix multiplication op-erations, STRUMPACK is not as scalable as

PDGEMM , and may become slower than

PDGEMM when using 300 or more processes on Tianhe-2supercomputer. See [15] for details.Instead of using HSS, this work exploits amuch simpler type of rank-structured matrix,called BLR (Block low-rank format [17]). Com-pared with HSS, BLR abandons the hierarchybut compresses the off-diagonal blocks. It losesthe near-linear complexity of other hierarchicalmatrices such as H -matrix [18], H -matrix [19],and HSS matrix. Because of its simple struc-ture, BLR is easy to implement in parallel, andoften improves the scalability of correspondingalgorithms, see [17], [20] for more details.In this paper, we propose a parallel struc-tured matrix-matrix multiplication algorithmfor Cauchy-like matrices, which will be named PSMMA . It exploits the off-diagonal low-rankproperty of matrices like BLR, and it can fur-ther reduce the communication cost by con-structing local submatrices using the generators ,which will be explained in section 3. Our maincontributions include the following: • We propose a parallel structured ma-trix multiplication algorithm (PSMMA)for structured matrices including Cauchy-like, Toeplitz, Hankel, Vandermonde, etc.PSMMA can reduce both the communica-tion and computation costs by using low-rank approximations. To the best of theauthors’ knowledge, none of the matrixmultiplication algorithms has been devel-oped to reduce the communication cost byexploiting the structure of matrices. • PSMMA works for matrices both in theblock cyclic data distribution (BCDD) form(like ScaLAPACK) and block data distribu-tion (BDD) form (2D block partitioning). Italso works for general process grids. • Combining PSMMA with the DC algo-rithm in ScaLAPACK, we propose a paral-lel structured DC algorithm (PSDC), whichcan be much faster than

PDSTEDC inScaLAPACK. PSDC is also competitivewith ELPA [21].The process of PSMMA is similar to Can-non [22] and Fox [23] algorithms. However,PSMMA works for matrices in BCDD formand works for any rectangular process grids.From this perspective, PSMMA is more likePUMMA [24], a generalized Fox algorithm.PSMMA is more efﬁcient than PUMMA forstructured matrices and details are shown insection 3.1.2. It has three advantages comparedwith PUMMA. One advantage is that PSMMAconstructs the required submatrix locally byusing the generators without communicationand thus requires less communication . Anotherone is that PSMMA combines with low-rankapproximations and therefore the computationcomplexity is also reduced. The third one is thatPSMMA requires less workspace and the sizeof local matrix multiplications is also largerthan PUMMA. In this paper, SRRSC [13], [25] isused to compute the low-rank approximationsof Cauchy-like matrices in PSMMA, which onlyrequires linear storage. Note that

PDGEMM im-plements an algorithm similar to SUMMA [26].Compared with SUMMA [26], which is basedon the outer product form of matrix multipli-cation, PSMMA can naturally exploit the off-diagonal low-rank property of matrices.By incorporating PSMMA into the DC algo-rithm in ScaLAPACK [7], we obtain a highlyscalable DC algorithm, which has much bet-ter scalability than the previous PHDC algo-rithm [15]. To distinguish from PHDC, we callthe newly proposed algorithm parallel struc-tured DC algorithm (PSDC). Numerical re-sults show that PSDC is always faster than

PDSTEDC in ScaLAPACK, and scales to 4096processes at least. That is because PSDC re-quires both less computations and communi-cations than

PDSTEDC . The speedups of PSDCover

PDSTEDC can be up to . x- . x for somematrices with dimension , on Tianhe-2supercomputer. Note that PHDC in [15] canonly scale to 300 processes for the same ma-trices. OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 3

The remaining sections of this paper areorganized as follows. Section 2 introduces theDC algorithm, the SRRSC algorithm for con-structing low-rank approximations of Cauchy-like matrices, and some classical parallel matrixmultiplication algorithms. Section 3 presentsthe newly proposed parallel structured ma-trix multiplication algorithm PSMMA, and de-scribes the implementation details of PSDC,which combines PSMMA with the paralleltridiagonal DC algorithm in ScaLAPACK. Allthe experimental results are reported in sec-tion 4, and some future works are included insection 4.2. Conclusions are drawn in section 5.

RELIMINARIES

Assume T is a symmetric tridiagonal matrix, T =  a b b . . . . . .. . . a n − b n − b n − a n  . (1)We brieﬂy introduce some formulae of Cup-pen’s divide-and-conquer algorithm [1], [7].The ScaLAPACK routine also implements thisversion of DC algorithm [7] based on rank-oneupdate.Firstly, T is decomposed into the sum of twomatrices, T = (cid:20) T T (cid:21) + b k vv T , (2)where T ∈ R k × k , b k is the off-diagonal elementat the k th row of T and v = [0 , . . . , , , . . . , T with ones at the k th and ( k + 1) th entries. If T = Q Λ Q T and T = Q Λ Q T , then T can bewritten as T = (cid:20) Q Q (cid:21) (cid:18)(cid:20) Λ Λ (cid:21) + b k uu T (cid:19) (cid:20) Q T Q T (cid:21) , (3)where u = (cid:20) Q T Q T (cid:21) v = (cid:20) last col. of Q T ﬁrst col. of Q T (cid:21) . Since Q and Q are orthogonal matrices, theproblem is reduced to computing the spectraldecomposition of a diagonal plus rank-one ma-trix, M ≡ D + b k uu T = (cid:98) Q Λ (cid:98) Q T , (4) ALGORITHM 1:

DC(

T, Q, Λ ) algorithm forcomputing the eigendecomposition of asymmetric tridiagonal matrix. Input: T ∈ R n × n Output: eigenvalues Λ , eigenvectors Q if the size of T is small enough then apply the QR algorithm and compute T = Q Λ Q T ; return Q and Λ ; else form T = (cid:20) T T (cid:21) + b k vv T ; call DC( T , Q , Λ ); call DC( T , Q , Λ );form M = D + b k uu T from Q i , Λ i and v ( i = 1 , ), where D = diag(Λ , Λ ) ;ﬁnd eigenvalues Λ and eigenvectors (cid:98) Q of M ;compute the eigenvectors of T as Q = (cid:20) Q Q (cid:21) (cid:98) Q ; return Q and Λ ; end where D = diag(Λ , Λ ) is a diagonal matrix, Λ is a diagonal matrix whose diagonal elementsare the eigenvalues of matrix M , and (cid:98) Q is theeigenvector matrix of M . Then, the eigenvectormatrix of T is computed as (cid:20) Q Q (cid:21) (cid:98) Q. (5)The eigenvalues λ i of D + b k uu T are the rootsof the secular equation f ( λ ) = 1 + b k u k d k − λ = 0 , (6)where d k is the k th diagonal entry of D , u k isthe k th component of u . Then, the eigenvectoris computed as ˆ q i = ( D − λ i I ) − u. (7)The main observation of works [15], [12] isthat (cid:98) Q = (cid:18) u i v j d i − λ j (cid:19) i,j , (8) OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 4 where v j = 1 / (cid:113)(cid:80) nk =1 u k ( d k − λ j ) , is a Cauchy-like matrix, and the vectors u, v ∈ R n and d = ( d , · · · , d n ) T , λ = ( λ , · · · , λ n ) T are called generators .The whole classical DC algorithm is shownin Algorithm 1. The main computational taskof DC lies in computing the eigenvectorsvia matrix-matrix multiplications (MMM) (5),which costs O ( n ) ﬂops. Since (cid:98) Q is aCauchy-like matrix and off-diagonally low-rank, MMM (5) can be accelerated by us-ing HSS matrix algorithms, and the computa-tional complexity can be reduced signiﬁcantly,see [15], [12] for more details. The aim of thiswork is not only to reduce the computationcost of MMM (5) but also its communicationcost in the distributed memory environment.For simplicity, we do not consider deﬂation in (5). About the deﬂation process, we refer theinterested readers to [1], [27] and section 3.2. A novel low-rank approximation method forCauchy-like matrix is proposed in [13], [25],which only requires linear storage. For com-pleteness, this method is introduced brieﬂy inthis section, which only works on the genera-tors.Assume that A is an n × n Cauchy-like matrix, A = ( u i v j d i − w j ) i,j . The following factorization iscalled the k th Schur complement factorization of A , A = (cid:20) A A A A (cid:21) = (cid:20) A A A ( k ) (cid:21) (cid:20) I Z ( k ) I (cid:21) , (9)where A ∈ R k × k and A ( k ) is called the k thSchur complement.One good property of a Cauchy-like matrixis that its k th Schur complement A ( k ) is alsoCauchy-like [28], [29], and its generators canbe computed recursively as follows. Theorem 1:

The k th Schur complement A ( k ) satisﬁes D k A ( k ) − A ( k ) W k = u ( k ) ( k + 1 : n ) · v ( k ) T ( k + 1 : n ) , with D k +1 = diag( d k +1 , · · · , d n ) and W k +1 =diag( w k +1 , . . . , w n ) . Then u ( k ) ( k + (cid:96) ) = u ( k − ( k + (cid:96) ) · d k + (cid:96) − d k d k + (cid:96) − w k , (10) v ( k ) ( k + (cid:96) ) = v ( k − ( k + (cid:96) ) · w k + (cid:96) − w k w k + (cid:96) − d k , (11)with ≤ (cid:96) ≤ n − k, k ≥ , A (0) = A , u (0) = u and v (0) = v . Corollary 1:

For ≤ (cid:96) ≤ n − k, k ≥ , thegenerators of A ( k ) are u ( k ) ( k + (cid:96) ) = k (cid:89) j =1 d k + (cid:96) − d j d k + (cid:96) − w j · u (0) ( k + (cid:96) ) , (12) v ( k ) ( k + (cid:96) ) = k (cid:89) j =1 w k + (cid:96) − w j w k + (cid:96) − d j · v (0) ( k + (cid:96) ) . (cid:3) (13)It is easy to prove that Z ( k ) is also Cauchy-like, and its generators can also be computedrecursively. Theorem 2:

The generators of Z ( k ) satisfy thedisplacement equation W ( k )1 Z ( k ) − Z ( k ) W ( k )2 = y ( k ) (1 : k ) · v ( k ) ( k + 1 : n ) T , where W ( k )1 = diag( w , . . . , w k ) and W ( k )2 =diag( w k +1 , . . . , w n ) , and y ( k ) ( i ) = (cid:89) ≤ j ≤ k,j (cid:54) = i d j − w i w j − w i · d i − w i v (0) ( i ) , (14)and v ( k ) ( k + 1 : n ) are computed recursively byequation (13). Furthermore, y ( k ) ( i ) = (cid:40) y ( k − ( i ) · d k − w i w k − w i , if ≤ i ≤ k − , (cid:81) k − j =1 w k − d j w k − w j · d k − w k v (0) ( k ) , if i = k. (cid:3) Since permutation does not destroy the struc-ture of Cauchy-like matrices, we can permutegenerators u ( k ) , v ( k ) , d and w to make the ﬁrstentry of A ( k ) , A ( k ) (1 ,

1) = u ( k ) ( k +1) v ( k ) ( k +1) d k +1 − w k +1 , large.A complete pivoting strategy could be used.Some efﬁcient pivoting strategies have beenproposed in [29] and [30]. We used the pivotingstrategy proposed in [13].If the entries of A ( k ) are negligible, and thenby ignoring A ( k ) , we get a low-rank approxi-mation to A , AP ≈ A (1 : n, T ) (cid:2) I Z ( k ) (cid:3) , (15)where T = { i , i , . . . , i k } ⊂ { , , . . . , n } and P is a permutation matrix which records thecolumn permutations during the pivotings. To OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 5 be more speciﬁc, A (1 : n, T ) consists of a subsetof columns of A , and from Theorem 2, Z ( k ) ij = u ( k ) ( k + i ) y ( k ) ( j ) w ( i ) − w ( k + j ) , (16)where w is the array after permutation. Matrix-matrix multiplication is a very impor-tant computational kernel of many scientiﬁcapplications. In this subsection, we brieﬂy in-troduce some parallel matrix multiplication al-gorithms, which compute C = A × B .Cannon algorithm [22] was the ﬁrst efﬁ-cient algorithm for parallel matrix multiplica-tion providing theoretically optimal communi-cation cost. However it requires the processgrid to be square, which limits its practicalusage. Fox algorithm proposed in [23] has thesame problem. The PUMMA algorithm [24] isa generalized Fox algorithm, and it works fora general P × Q processor grid. PUMMA wasdesigned for ScaLAPACK and used the BCDDform.The current version of ScaLAPACK imple-ments SUMMA [26], which was proposed inthe mid-1990s and designed for a generalprocessor grid. It also uses the block-cyclicdata distribution form. It implements the outerproduct form of matrix multiplication, and al-lows to pipeline them. PUMMA implementsthe inner product form, and requires the largestpossible matrices for computations and com-munications. Some more efﬁcient matrix mul-tiplication algorithms have been proposed re-cently, like 2.5D algorithm [31], CARMA [32],CTF [33], and COSMA [34], and many others.Since they are not quite related to this currentwork, we do not attempt to give a completeliterature review.PUMMA is more appropriate for the rank-structured matrices than SUMMA, since thelarge off-diagonal blocks can be compressed bylow-rank approximations. So are Cannon andFox algorithms. It is not obvious for SUMMAto exploit the rank-structured property of theinput matrices.In this paper, we propose a parallel struc-tured matrix multiplication algorithm, which is called PSMMA for short. We assume at leastone of the two matrices is a structured ma-trix. By ’structured matrices’ we mean thosematrices which can be expressed using O ( n ) parameters where n is the dimension of ma-trix, e.g. Cauchy-like, Toeplitz, Hankel, andVandermonde matrices [35]. PSMMA can bebased on the BCDD structure like PUMMA andScaLAPACK, and it can also be based on theBDD structure like Cannon or Fox algorithms.As shown later, the second approach is moreefﬁcient to exploit the off-diagonal low-rankstructure. Its drawback is that it requires toredistribute the matrix from the BCDD formto BDD form, since the matrices are initiallystored in BCDD form in ScaLAPACK routines.The PSMMA in BCDD form ﬁts well withScaLAPACK routines and does not need anydata redistribution.By exploiting the special structure of matrix,PSMMA can reduce both the computation andcommunication costs. To the best of the au-thors’ knowledge, this work is the ﬁrst oneproposing a reduction of the communicationcost of matrix multiplication algorithms by ex-ploiting the structures of matrices. LGORITHM P ROPOSED

The main contributions of this paper consist oftwo parts. First, we design a parallel structuredmatrix multiplication algorithm for structuredmatrices in section 3.1, which is called PSMMAand can reduce both the computation and com-munication costs.Secondly, section 3.2 shows how to com-bine PSMMA with the parallel tridiagonal DCalgorithm in ScaLAPACK. It illustrates howto modify several routines in ScaLAPACK inorder to use PSMMA. The parallel structuredDC algorithm is called PSDC, and its wholeprocedure is summarized in Algorithm 3. Acartoon is included in Fig. 4 to show the wholeworkﬂow of PSDC.

In this subsection, we introduce PSMMA tocompute C = A × B , where A ∈ R m × k is ageneral matrix, B ∈ R k × n is a structured matrix, OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 6 and its entries can be represented by O ( n + k ) parameters. The case that A is a structuredmatrix and B is a general matrix is similar.When both A and B are structured matrices,all operations can be performed locally withoutany communication, which will be explainedlater. In the following sections, we assume A isstored in BCDD form and B is represented byits generators.Suppose the matrix A has M block rows and K block columns, and the matrix B has K blockrows and N block columns. Block ( I, J ) of C isthen computed by C ( I, J ) = K − (cid:88) (cid:96) =0 A ( I, (cid:96) ) · B ( (cid:96), J ) , (17)where I = 0 , , · · · , M − , J = 0 , , · · · , N − .Cannon and Fox algorithms initially consid-ered only the case of matrices in which eachprocessor contains a single row or a singlecolumn of blocks [22], [23]. PUMMA consid-ered the matrix multiplication algorithms withBCDD form [26]. Fig. 1 shows a × blockmatrix stored in a × process grid. It is easyto see that the matrix in the block cyclic form isobtained from the original matrix by perform-ing row and column block permutations.Since the matrix B can be represented by itsgenerators, any submatrices of B can be formedeasily. This fact enables us to treat the submatri-ces of A in each process as a whole continuousblock, and we only need to construct the propersubmatrices of B correctly. This is our mainobservation . After discovering this fact, theproposal of the PSMMA algorithm becomesvery natural and simple, just following theequation (17). The whole procedure is shownin Algorithm 2.To illustrate the algorithm from the processpoint of view, we show how the submatricesof C stored at process P (located at position (0 , of the process grid) are computed for thematrix shown in Fig. 1(b). This consists of threesteps, and the process is depicted in Figure 2.The column indexes of B are ﬁxed, and its rowindexes are determined by the column indexesof A . After each step, matrix A would be shiftedleftward, and the local matrix A on process P is updated by the matrix on process P . As (a) Matrix point-of-view.(b) Process point-of-view. Fig. 1. A matrix with × blocks is distributedover a × process grid. shown in Fig. 1(b), the local matrix A of process P at step is updated by that of process P ,which is located at the right of process P . Afterprocess P has received the matrix A from allother processes, the algorithm stops.Algorithm 2 works both for block cyclic datadistribution and block data distribution. It onlydepends on the distribution of A to determinethe row indexes of local matrix B . In step3(b) of Algorithm 2, we construct a low rankapproximation only when the local matrix B is probably low rank. For the tridiagonal DCalgorithm in section 2, matrix B is a Cauchy-like matrix and we can check whether the in-tersection of CIndex and

RIndex is empty or not.If the intersection is empty, the B submatrix is OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 7

ALGORITHM 2:

PSMMA for a structuredmatrix B . Input: A ∈ R m × k , B ∈ R k × n , where A isdistributed over a p × q processgrid, B is a structured matrix andall processes have a copy of itsgenerators; Output: C = A × B .1) Each process constructs the columnindexes ( CIndex ) of matrix B based onits process column in the process grid;2) Set C = 0 . do (cid:96) = 0 , q − (a) For each process ( i, j ) , construct thecolumn indexes ( RIndex ) of matrix A based on the process column mod ( j + (cid:96), q ) ;(b) Calculate the required B subblock B ( RIndex,CIndex ) , and construct itslow-rank approximation (if needed)by using its generators, B ≈ U B V B ;(c) Multiply the copied A subblock withthe currently residing B subblock: C = C + ( A · U B ) · V B ;(d) Shift matrix A leftward cyclicallyalong each process row; end do probably numerically low-rank. Otherwise, the B submatrix is probably full rank. For Cauchy-like matrices, we use SRRSC discussed in sec-tion 2.1 to construct a low-rank approximationto matrix B .R EMARK

1. The PSMMA algorithm intro-duced in this section is also suitable for otherstructured matrices, such as Toeplitz, Vander-monde, and DFT (Discrete Fourier Transform)matrices. Some results will be shown in ourfuture works.R

EMARK

2. Only step of Algorithm 2 re-quires point-to-point communications. It can beoverlapped with other computations if imple-mented carefully. During our numerical exper-iments, we did not implement this technique.

Fig. 2. The process for computing the submatri-ces of C located at process (0 , . For our problem, the eigenvector matrix (cid:98) Q in equation (4) is a Cauchy-like matrix, seeequation (8). It is further off-diagonally lowrank, since { d i } ni =1 and { λ i } ni =1 are interlacing.The storage form of matrix A affects the low-rank property of matrix B in Fig. 2. We use thefollowing example to show the main points. Example 0.

Assume that matrix B is deﬁnedas B ij = u i v j d i − w j , where u i and v j are randomnumbers, d i = i · b − an , w j = d j + b − a ∗ n , a = 1 . , b =9 . , for i, j = 1 , , · · · , n. It is known that B isa rank-structured Cauchy-like matrix, see [13].Let n = 768 and assume B is distributed overa × process grid. Suppose that N B = 128 (the parameter of block size for distribution),we get the BCDD form of matrix B , as shownin Fig. 3(a). By choosing N B = 256 , we get theBDD form of B in Fig. 3(b).Fig. 3 shows the block partitions of matrix B when choosing different N B . The numbers inFig. 3 are the ranks of the corresponding blocks.Fig. 3(a) shows the ranks of blocks of B when B is in the BCDD form. From it we can see theoff-diagonal ranks in Fig. 3(a) are larger than OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 8 those when B is in the BDD form, which areshown in Fig. 3(b). Many ﬂops can be savedwhen the ranks are small. When B is initiallystored in the BCDD form with small N B , wecan transform it to the BDD form by usingScaLAPACK routine PDGEMR2D . We comparethese two cases in Example 1 in section 4. (a) Block cyclic distribution ofmatrix B when N B = 128 . (b) Block distribution of B when N B = 256 . Fig. 3. The ranks of B blocks when B is in theBCDD and BDD forms, respectively. Algorithm 2 works on the whole block of localmatrices, like the block partition based algo-rithms such as Cannon and Fox algorithms.The advantages of PSMMA over Cannon andFox algorithms are that Algorithm 2 also worksfor matrices of BCDD form and also works forany rectangular process grids.Comparing with PUMMA, which works forblock cyclic data distribution and rectangu-lar process grids, Algorithm 2 requires lessworkspace and the size of local matrix multi-plications is larger than that in PUMMA. Basedon the

LCM concept, PUMMA needs extraworkspace to permute the block columns of A and block rows of B together, which can bemultiplied simultaneously. Therefore, PUMMArequires roughly as much extra workspace tostore another copy of the local matrices A and B , which makes it impractical in real applica-tions, see [24], [36]. Furthermore, only partialblock columns (rows) can be merged togetherin PUMMA, while PSMMA always multipliesthe whole local matrix A with B and it doesnot need to permute the block columns of A orthe block rows of B . Comparing with SUMMA, which is based onthe outer product form of matrix multiplica-tion, PSMMA can further exploit the low-rankstructure of matrix B . It is not easy for SUMMAto exploit this property. The excellent performance of the DC algorithmis partially due to deﬂation [37], [1], whichhappens in two cases. If the entry z i of z isnegligible or zero, the corresponding ( λ i , ˆ q i ) isalready an eigenpair of T . Similarly, if twoeigenvalues in D are identical then one entryof z can be transformed to zero by applyinga sequence of plane rotations. All the deﬂatedeigenvalues are moved to the end of D bya permutation matrix, and so are the corre-sponding eigenvectors. Then, after deﬂation,(3) reduces to T = Q ( GP ) (cid:18) ¯ D + b k ¯ z ¯ z T ¯ D d (cid:19) ( GP ) T Q T , (18)where G is the product of all rotations, P isa permutation matrix, and ¯ D d are the deﬂatedeigenvalues.According to (4), the eigenvectors of T arecomputed as U = (cid:20)(cid:18) Q Q (cid:19) GP (cid:21) (cid:18) (cid:98) Q I d (cid:19) . (19)To improve efﬁciency, Gu [38] suggested apermutation strategy for reorganizing the datastructure of the orthogonal matrices, which hasbeen used in ScaLAPACK. The matrix in squarebrackets is permuted as (cid:18) Q Q Q Q Q Q (cid:19) ,where the ﬁrst and third block columns containthe eigenvectors that have not been affectedby deﬂation, the fourth block column containsthe deﬂated eigenvectors, and the second blockcolumn contains the remaining columns. Then,the computation of U can be done by two paral-lel matrix-matrix products (calling PDGEMM ) in-volving parts of (cid:98) Q and the matrices (cid:0) Q Q (cid:1) , (cid:0) Q Q (cid:1) . Another factor that contributes tothe excellent performance of DC is that mostoperations can take advantage of highly opti-mized matrix-matrix products. OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 9

When there are few deﬂations, the size ofmatrix (cid:98) Q in (19) will be large, and most ofthe time spent by DC would correspond tothe matrix-matrix multiplication in (19). Fur-thermore, it is well-known that matrix (cid:98) Q de-ﬁned as in (4) is a Cauchy-like matrix withoff-diagonally low rank property, see [2], [12].Therefore, we simply use the parallel struc-tured matrix-matrix multiplication algorithm tocompute the eigenvector matrix U in (19). SincePSMMA requires much fewer ﬂoating pointoperations and communications than the plainmatrix-matrix multiplication, PDGEMM , this ap-proach makes the DC algorithm in ScaLAPACKmuch faster.As mentioned before, the central idea is toreplace

PDGEMM by PSMMA. The eigenvec-tors are updated in the ScaLAPACK routine

PDLAED1 , and therefore we modify it and callPSMMA in it instead of

PDGEMM . The wholeprocedure of PSDC accelerated by PSMMA issummarized in Algorithm 3. Comparing withthe classical DC algorithm (Algorithm 1), theonly difference is that PSMMA is used whenthe size of matrix M is large. In Fig. 4 the stagesof PSDC are graphically represented.Note that after applying permutations to Q in (19), matrix (cid:98) Q should also be permutedaccordingly. From the results in [2], [12], [13],we know that (cid:98) Q is a Cauchy-like matrix andoff-diagonally low-rank, the numerical rank isusually around - . When combining withPSMMA, we would not use Gu’s idea [38] sincepermutation may destroy the off-diagonallylow-rank structure of (cid:98) Q in (19). We need tomodify the ScaLAPACK routine PDLAED2 , andonly when the size of deﬂated matrix ¯ D in (18)is large enough, PSMMA would be used, oth-erwise use Gu’s idea. In section 4, we denotethe size of ¯ D by K , whose value depends onthe matrix as well as the architecture of theparticular parallel computer used, and may bedifferent for different computers. In Example 2,PSMMA is used when K ≥ , .R EMARK

3. To keep the orthogonality of (cid:98) Q (see equation (8)), d i − λ j must be computed by d i − λ j = (cid:40) ( d i − d j ) − γ j if i ≤ j ( d i − d j +1 ) + µ j if i > j , (20) Fig. 4. The stages of the PSDC method forsolving the symmetric tridiagonal eigenvalueproblem. where γ i = λ i − d i (the distance between λ i and d i ), and µ i = d i +1 − λ i (the distance between λ i and d i +1 ), which can be returned by calling theLAPACK routine DLAED4 . In our implementa-tion, (cid:98) Q is represented by using ﬁve generators, { d i } , { γ i } , { µ i } , { u i } and { v i } . XPERIMENTAL RESULTS

All the experimental results are obtained onthe Tianhe-2 supercomputer [39], [40], locatedin Guangzhou, China. Each compute node isequipped with two 12-cores Intel Xeon E5-2692 v2 CPUs and our experiments only useCPU cores. The details of the test platform andenvironment of compute nodes are shown inTable 1. For all these numerical experiments,we only used plain MPI, run MPI processesper node in principle, and one process per core.For example, we used compute nodes fortesting processes.

Example 1.

Assume that A ∈ R n × n is arandom matrix and B is deﬁned as B ij = u i v j d i − w j ,where u i and v j are random numbers, d i = OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 10

ALGORITHM 3:

PSDC(

Items Values2*CPU Intel Xeon CPU E5-2692 [email protected] size 64GB (DDR3)Operating System Linux 3.10.0Complier Intel ifort 2013 sp1.2.144Optimization -O3 -mavx i · b − an , w j = d j + b − a ∗ n , a = 1 . , b = 9 . , for i, j =1 , · · · , n. It is known that B is a rank-structuredCauchy-like matrix, see [13], [12]. We compute C = A × B . Let n = 8192 , and ,respectively, and choose different number of processes, N P = 16 , , , and ,to compare PDGEMM with PSMMA. To avoidperformance variance during multiple execu-tions, we evaluated the performance of

PDGEMM and PSMMA twice in the same program andcalled that program three times, and chose thebest results among these six executions. Ourcodes will be released on Github (available athttps://github.com/shengguolsg/).It is well-known that the performance of

PDGEMM depends on the block size N B . Wetested the performances of PDGEMM by choos-ing N B = 64 , and , and we found thattheir differences are very small. But they arebetter than choosing N B ≤ . Therefore, wechose N B = 64 and n/ √ N P , which correspondsto the BCDD form and BDD form, respectively.Note that a large N B is better for PSMMA sincethe ranks of off-diagonal blocks after permuta-tions may be smaller, see Table 2.As shown in section 3.1.1, we can redistributematrix A from BCDD to BDD to exploit the off-diagonal low-rank property of matrix B . In thisexample, we tested four versions of PSMMA,which are • PSMMA BCDD: N B = 64 and with low-rank approximation; • PSMMA BDD: N B = n/ √ N P and withlow-rank approximation; • PSMMA WRedist: N B = 64 and with dataredistribution and low-rank approxima-tion (matrix A is transformed from BCDDto BDD with N B = n/ √ N P and then back); • PSMMA NLowrank: N B = 64 and withoutlow-rank approximation;The speedups of PSMMA over PDGEMM areshown in Fig. 5(a), 5(b) and 5(c) with dimen-sions n = 8192 , , , respectively. Fromthe results, we can see that PSMMA BCDD isalways faster than PDGEMM except for N P = 16 .It is because the ranks of B blocks are verylarge when using the BCDD form, and mostof the time is spent in computing the low-rank approximations. Table 2 shows the ranksof the off-diagonal blocks in the ﬁrst blockcolumn, see Fig. 3 for the structure of ma-trix B . Without using low-rank approximation,PSMMA can be faster than PDGEMM . It is inter-esting to see that PSMMA NLowrank is alwaysfaster than

PDGEMM for these three matrices.

OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 11

Note that PSMMA NLowrank requires moreﬂoating point operations than

PDGEMM sinceit needs to construct the local B submatrices.However, PSMMA NLowrank requires fewercommunications than PDGEMM .PSMMA WRedist is only slower than

PDGEMM when the size of the matrix is small ( n = 8192) and the number of processes islarge. From the last row of Table 2, we cansee that the ranks of off-diagonal submatricesin BDD form are much smaller than the sizes (4096) of submatrices. PSMMA WRedist isgenerally faster than both PSMMA BCDDand PSMMA NLowrank. The disadvantage ofPSMMA WRedist is that it requires to performdata redistribution (communication). Whenthe number of processes is large, the dataredistribution represents a large portion of theoverall time. Fig. 6 shows the percentages oftransforming matrix A from BCDD to BDDand transforming it back. It takes over oftotal time when n is small and N P is large.PSMMA BDD is always the best and can bemore than ten times faster than PDGEMM . Thisis because it assumes that matrix A is initiallystored in BDD form and the low-rank propertyof matrix B is not destroyed either.It is difﬁcult to know exactly whythe speedups of PSMMA WRedist andPSMMA BDD over PDGEMM ﬁrstly increaseand then decrease. It depends on the propertiesof these two algorithms. Comparing with

PDGEMM , PSMMA BDD and PSMMA WRedistsave both computations and communications.When the number of processes increases, thesetwo contributions make the speedups ﬁrstincrease. As the number of processes increases,the size of the local submatrix located oneach process decreases, and therefore thepercentage of ﬂoating point operationssaved from using low-rank approximationsdecreases. Since PSMMA BDD does not savemany ﬂoating point operations compared to

PDGEMM , the speedups decrease when usingmore processes. For PSMMA WRedist, thecost of data redistribution also increases as thenumber of processes increases. Meanwhile, asthe number of processes grows, the percentageof saved communication by PSMMA BDDand PSMMA WRedist increases. Therefore, PSMMA BDD can be always faster than

PDGEMM since it not only reduces computationsbut also communications. It is betterto use PSMMA NLowrank instead ofPSMMA WRedist when the number ofprocesses is very large.R

EMARK

4. We can adaptively choose a par-ticular PSMMA algorithm based on the size ofmatrix and the number of processes. It is betterto use PSMMA WRedist when the number ofprocesses is small, and use PSMMA BCDDor PSMMA NLowrank when the number ofprocesses is large and/or the size of matrix issmall. It is always best to use PSMMA BDDwhen it is available and matrix B is off-diagonally low rank. TABLE 2The ranks of off-diagonal blocks of B in the ﬁrstblock column when stored on × processesin the BCDD form. Each block is a × submatrix. N B B (2 , B (3 , B (4 , Example 2.

We use some ’difﬁcult’ matri-ces [41] for the DC algorithm, for which few orno eigenvalues are deﬂated. Examples includethe Clement-type, Hermite-type and Toeplitz-type matrices, which are deﬁned as follows.The Clement-type matrix [41] is given by T = tridiag (cid:32) √ n (cid:112) n − √ n ·

10 0 . . . √ n (cid:112) n − √ n · (cid:33) , where the off-diagonal entries are (cid:112) i ( n + 1 − i ) , i = 1 , . . . , n .The Hermite-type matrix is given as [41], T = tridiag (cid:32) √ √ √ n −

10 0 . . . √ √ √ n − (cid:33) . The Toeplitz-type matrix is deﬁned as [41], T = tridiag (cid:32) . . . (cid:33) . OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 12 (a) Dimension n = 8192 .(b) Dimension n = 16384 .(c) Dimension n = 32768 . Fig. 5. The speedup of PSMMA over

PDGEMM . Fig. 6. The percentages of time required by dataredistribution.

For the results of strong scaling, we let thedimension n be , , and use rank-structuredtechniques only when the size of the secularequation is larger than K = 20 , . We usedPSMMA WRedist and chose N B = 64 . Theresults for strong scaling of PSDC are shown inFig. 7(a). The speedups of PSDC over ScaLA-PACK are reported in Fig. 7(b). We can seethat PSDC is about . x– . x times faster than PDSTEDC in ScaLAPACK for all cases. Becauseof deﬂations, the performances of these threematrices can be different even though theyhave the same dimensions.The orthogonality of the eigenvectors com-puted by PSDC is in the same order as those byScaLAPACK, as shown in Table 3. The orthog-onality of matrix Q is deﬁned as (cid:107) I − QQ T (cid:107) max ,where (cid:107) · (cid:107) max is the maximum absolute valueof entries of ( · ) . We conﬁrm that the residualsof eigenpairs computed by PSDC are in thesame order as those computed by ScaLAPACKthough the results are not included here.Furthermore, we compare PSDC with PHDCwhich was introduced in [15] and usedSTRUMPACK to accelerate the matrix-matrixmultiplications. The results are shown in Fig. 8.For these three matrices, PSDC is much fasterthan PHDC when using many processes. It isbetter to use STRUMPACK when using fewprocesses since HSS-based multiplications can OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 13 (a) The strong scaling of PSDC.(b) The speedup of PSDC over ScaLAPACK.

Fig. 7. The results for the matrices of Example2. TABLE 3The orthogonality of the computedeigenvectors by PSDC.

Matrix Number of Processes

64 256 1024 4096

Clement . e -

14 3 . e -

14 3 . e - Hermite . e -

14 2 . e -

14 3 . e - Toeplitz . e -

14 3 . e -

14 2 . e - save more ﬂoating point operations than BLR-based multiplications. Fig. 8. The speedup of PSDC over PHDC.

In this subsection, we use three matrices thatcome from real applications to test PSDC. Onecomes from the spherical harmonic transform(SHT) [42], which has been used in [15]. Onesymmetric tridiagonal matrix is deﬁned as fol-lows, which will be denoted by SHT, A jk =  c m +2 j − , k = j − d m +2 j , k = jc m +2 j , k = j + 10 , otherwise , (21)for j, k = 0 , , . . . , n − , where ξ l = l − m , c l = (cid:115) ( ξ l + 1)( ξ l + 2)( l + m + 1)( l + m + 2)(2 l + 1)(2 l + 3) (2 l + 5) ,d l = 2 l ( l + 1) − m − l − l + 3) , for l = m, m + 1 , m + 2 , . . . . We assume thedimension of this matrix is n = 30 , and m = n .The other two are sparse matrices obtainedfrom the SuiteSparse matrix collection [43],called SiO and Si5H12. We ﬁrst reduce each ma-trix into its tridiagonal form by calling ScaLA-PACK routines and then call PSDC to compute OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 14 its eigendecomposition. It is also the generalprocess for computing the eigenvalue decom-position of any symmetric (sparse) matrices.These matrices are real and symmetric andtheir dimensions are n = 33 , and , ,respectively. Example 3.

We use matrices SHT, SiO andSi5H12 to compare PSDC with

PDSTEDC . Inthis example, we use the rank-structured tech-niques, i.e. calling PSMMA, whenever the sizeof the secular equation is larger than K =15 , , since the largest K for matrix Si5H12 is , when N B = 64 . The speedups of PSDCover PDSTEDC are reported in Table 4. Thebackward errors of the computed eigenpairsare also included in the third column, whichare computed as

Residual = (cid:107) A − Q Λ Q ∗ (cid:107) c (cid:107) A (cid:107) , (22)where Q ∈ R n × n is orthogonal, Λ ∈ R n × n isdiagonal with the eigenvalues as its diagonalelements, (cid:107) X (cid:107) c denotes the maximum Frobe-nius norm of each column of X and (cid:107) X (cid:107) denotes the 2-norm of X , its largest singularvalue. TABLE 4The speedups of PSDC over

PDSTEDC formatrices from real applications.

Matrix K Residual Number of Processes

64 256 1024 4096

SHT ,

136 1 . e- ,

489 3 . e- ,

981 1 . e- Example 4.

It is known that ELPA (Eigen-value soLver for Petascale Applications [21],[44]) has better scalability and is faster thanthe MKL version of ScaLAPACK. As opposedto ScaLAPACK, ELPA routines do not relyon BLACS and PBLAS, and it can overlapthe computations with communications, andthe computation is also optimized by usingOpenMP and even GPU. For the tridiagonaleigensolver, ELPA rewrites the DC algorithmand implements its own matrix-matrix multi-plications and does not use the PBLAS routine

PDGEMM . In contrast, PSDC follows the main procedure of

PDSTEDC , and only uses PSMMAto accelerate the expensive matrix-matrix mul-tiplication part.We compare PSDC with ELPA and ﬁndthat it competes with ELPA (with version2018.11.001). We use the Clement matrix todo experiments. Fig. 9 shows the executiontimes of PSDC and ELPA when using differentprocesses. It shows that PSDC is faster thanELPA when using few processes, but PSDCbecomes slower when using more than processes. It is because matrix multiplicationis no longer the dominant factor when usingmany processes. It is shown in [15] that thepercentage of time dedicated to the matrixmultiplications can be less than 10% of the totaltime. The gains of ELPA are obtained fromother optimization techniques.

Fig. 9. The comparison of PSDC with ELPA.

We discuss some bottlenecks of current im-plementation and future works in this subsec-tion. A restriction of our current codes is thatPSMMA was only used at the top level of theDC tree. It should be used at any level as longas the size of the matrix is large. This willbe modiﬁed in the near future. We did notuse OpenMP or vectorization to optimize ourroutines. The routines for constructing the local

OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 15 submatrices from generators can be optimizedby using vectorization and OpenMP.Following our current work, there are someinteresting research projects to do in the fu-ture. First, PSMMA can be used to extend thebanded DC algorithms proposed in [14], [45]to distributed memory platforms in a similarmanner. Secondly, the structured matrix-matrixmultiplication techniques can be used in het-erogeneous architectures, which can reduce thedata movements from CPU to the acceleratorssuch as GPU . We only need to transformthe generators to GPUs once instead of manysubmatrices. Last but not least, PSMMA canbe adapted for Toeplitz, Hankel, DFT (DiscreteFourier Transform) and other structured matri-ces [46]. Some results will be included in ourfuture works. ONCLUSIONS

The starting point of this paper is trying toaccelerate the tridiagonal DC algorithm imple-mented in ScaLAPACK [7] for some difﬁcultmatrices. It is known that the main task liesin multiplying a general matrix with a rank-structured Cauchy-like matrix, see [1], [2], [12].The main contribution of this paper is thata highly scalable parallel matrix multiplica-tion algorithm is proposed for rank-structuredCauchy-like matrices, which ﬁts well for theparallel tridiagonal DC algorithm.The matrix multiplication problem is knownto be very compute intensive. However, asHPC moves towards exascale computing,the development of communication-avoidingor communication-decreasing algorithms be-comes more and more important. By taking ad-vantage of the particular structures of Cauchy-like matrices, we proposed a parallel struc-tured matrix multiplication algorithm PSMMA,which can reduce both computation and com-munication costs. The workﬂow of PSMMA issimilar to PUMMA [26] and further exploitsthe rank-structured property of input matrices.Experimental results show that PSMMA can bemuch faster than

PDGEMM for rank-structured

1. The authors would like to thank the referee for pointingout this research direction.

Cauchy-like matrices, and the speedups over

PDGEMM can be up to . .By combing PSMMA with the parallel tridi-agonal DC algorithm, we propose a parallelstructured DC algorithm (PSDC). For thesedifﬁcult matrices for which DC deﬂates veryfew eigenvalues, PSDC is always much fasterthan the classical DC algorithm implemented inScaLAPACK. Unlike PHDC which is proposedin [15], PSDC does not have scalability problemand it can scale to 4096 processes at least. A CKNOWLEDGMENTS

The authors would like to thank the refer-ees for their valuable comments which greatlyimprove the presentation of this paper. Thiswork is supported by National Natural ScienceFoundation of China (No. NNW2019ZT6-B20,U1611261, 61872392 and U1811461), NationalKey RD Program of China (2018YFB0204303),NSF of Hunan (No. 2019JJ40339), NSF ofNUDT (No. ZK18-03-01), Guangdong NaturalScience Foundation (2018B030312002), and theProgram for Guangdong Introducing Innova-tive and Entrepreneurial Teams under Grant(No. 2016ZT06D211). Jose E. Roman was sup-ported by the Spanish Agencia Estatal deInvestigaci ´on (AEI) under project SLEPc-DA(PID2019-107379RB-I00). R EFERENCES [1] J. J. M. Cuppen, “A divide and conquer method forthe symmetric tridiagonal eigenproblem,”

Numer. Math. ,vol. 36, pp. 177–195, 1981.[2] M. Gu and S. C. Eisenstat, “A divide-and-conquer algo-rithm for the symmetric tridiagonal eigenproblem,”

SIAMJ. Matrix Anal. Appl. , vol. 16, pp. 172–191, 1995.[3] G. H. Golub and C. F. V. Loan,

Matrix Computations , 3rd ed.The Johns Hopkins University Press, Baltimore, MD, 1996.[4] I. S. Dhillon and B. N. Parlett, “Multiple representations tocompute orthogonal eigenvectors of symmetric tridiagonalmatrices,”

Linear Algebra Appl. , vol. 387, pp. 1–28, 2004.[5] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel,J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling,A. McKenney, and D. Sorensen,

LAPACK Users’ Guide ,3rd ed. Philadelphia, PA: Society for Industrial andApplied Mathematics, 1999.[6] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov,A. Petitet, K. Stanley, D. Walker, and R. Whaley, “Scalapack:A portable linear algebra library for distributed mem-ory computers-design issues and performance,”

ComputerPhysics Communications , vol. 97, pp. 1–15, 1996.

OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 16 [7] F. Tisseur and J. Dongarra, “A parallel divide and con-quer algorithm for the symmetric eigenvalue problem ondistributed memory architectures,”

SIAM J. Sci. Comput. ,vol. 20, no. 6, pp. 2223–2236, 1999.[8] G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, “Mini-mizing communication in numerical linear algebra,”

SIAMJ. Matrix Anal. Appl. , vol. 21, no. 2, pp. 562–580, 2011.[9] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou,“Communication-optimal parallel and sequential QR andLU factorizations,”

SIAM J. Sci. Comput. , vol. 34, pp. A206–A239, 2012.[10] S. Chandrasekaran, M. Gu, and T. Pals, “Fast and stablealgorithms for hierarchically semi-separable representa-tions,” University of California, Berkeley, CA, Tech. Rep.,2004.[11] S. Chandrasekaran, P. Dewilde, M. Gu, W. Lyons, andT. Pals, “A fast solver for HSS representations via sparsematrices,”

SIAM J. Matrix Anal. Appl. , vol. 29, pp. 67–81,2006.[12] S. Li, X. Liao, J. Liu, and H. Jiang, “New fast divide-and-conquer algorithm for the symmetric tridiagonal eigen-value problem,”

Numer. Linear Algebra Appl. , vol. 23, pp.656–673, 2016.[13] S. Li, M. Gu, L. Cheng, X. Chi, and M. Sun, “An acceler-ated divide-and-conquer algorithm for the bidiagonal SVDproblem,”

SIAM J. Matrix Anal. Appl. , vol. 35, no. 3, pp.1038–1057, 2014.[14] X. Liao, S. Li, L. Cheng, and M. Gu, “An improved divide-and-conquer algorithm for the banded matrices with nar-row bandwidths,”

Comput. Math. Appl. , vol. 71, pp. 1933–1943, 2016.[15] S. Li, F.-H. Rouet, J. Liu, C. Huang, X. Gao, and X. Chi,“An efﬁcient hybrid tridiagonal divide-and-conquer al-gorithm on distributed memory architectures,”

Journal ofComputational and Applied Mathematics , vol. 344, pp. 512–520, 2018.[16] F. Rouet, X. Li, P. Ghysels, and A. Napov, “A distributed-memory package for dense hierarchically semi-separablematrix computations using randomization,”

ACM Transac-tions on Mathematical Software , vol. 42, no. 4, pp. 27:1–35,2016.[17] P. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y.L’Excellent, and C. Weisbecker, “Improving multifrontalmethods by means of block low-rank representations,”

SIAM Journal on Scientiﬁc Computing , vol. 37, no. 3, pp.A1451–A1474, 2015.[18] W. Hackbusch, “A sparse matrix arithmetic based on H -matrices. Part I: Introduction to H -matrices,” Computing ,vol. 62, pp. 89–108, 1999.[19] W. Hackbusch, B. Khoromskij, and S. Sauter, “On H -matrices,” in Lecture on Applied Mathematics , Z. C. Bun-gartz H, Hoppe RHW, Ed. Berlin: Springer, 2000, pp. 9–29.[20] I. Yamazaki, A. Ida, R. Yokota, and J. Dongarra,“Distributed-memory lattice H -matrix factorization,” Inter-national Journal of High Performance Computing Applications ,vol. 33, no. 5, pp. 1046–1063, 2019.[21] T. Auckenthaler, V. Blum, H. J. Bungartz, T. Huckle,R. Johanni, L. Kr¨amer, B. Lang, H. Lederer, and P. R.Willems, “Parallel solution of partial symmetric eigenvalueproblems from electronic structure calculations,”

ParallelComputing , vol. 37, no. 12, pp. 783–794, 2011.[22] L. E. Cannon, “A cellular computer to implement thekalman ﬁlter algorithm,” Ph.D. dissertation, Montana StateUnivesity, 1969.[23] G. C. Fox, S. W. Otto, and A. J. G. Hey, “Matrix algorithms on a hypercube I: matrix multiplication,” parallel computing ,vol. 4, no. 1, pp. 17–31, 1987.[24] J. Choi, D. W. Walker, and J. J. Dongarra, “Pumma:Parallel universal matrix multiplication algorithms on dis-tributed memory concurrent computers,”

Concurrency andComputation: Practice and Experience , vol. 6, no. 7, pp. 543–570, 1994.[25] M. Gu and J. Xia, “A multi-structured superfast Toeplitzsolver,” preprint, 2009.[26] R. A. V. De Geijn and J. Watts, “SUMMA: Scalable uni-versal matrix multiplication algorithm,”

Concurrency andComputation: Practice and Experience , vol. 9, no. 4, pp. 255–274, 1997.[27] M. Gu and S. C. Eisenstat, “A stable and efﬁcient al-gorithm for the rank-one modiﬁcation of the symmetriceigenproblem,”

SIAM J. Matrix Anal. Appl. , vol. 15, pp.1266–1276, 1994.[28] I. Gohberg, T. Kailath, and V. Olshevsky, “Fast Gaussianelimination with partial pivoting for matrices with dis-placement structure,”

Mathematics of Computation , vol. 64,no. 212, pp. 1557–1576, 1995.[29] M. Gu, “Stable and efﬁcient algorithms for structuredsystems of linear equations,”

SIAM J. Matrix Anal. Appl. ,vol. 19, no. 2, pp. 279–306, 1998.[30] C.-T. Pan, “On the existence and computation of rankrevealing LU factorizations,”

Linear Algebra Appl. , vol. 316,pp. 199–222, 2000.[31] E. Solomonik and J. Demmel, “Communication-optimalparallel 2.5d matrix multiplication and lu factorizationalgorithms,” in

Proc. 17th international conference on Parallelprocessing (Euro-Par 2011), Part II , 2011, pp. 90–109.[32] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz,O. Schwartz, and O. Spillinger, “Communication-optimalparallel recursive rectangular matrix multiplication,” in . IEEE, 2013, pp. 261–272.[33] E. Solomonik, D. Matthews, J. Hammond, and J. Demmel,“Cyclops tensor framework: Reducing communication andeliminating load imbalance in massively parallel contrac-tions,” in . IEEE, 2013, pp. 813–824.[34] G. Kwasniewski, M. Kabi´c, M. Besta, J. VandeVondele,R. Solc`a, and T. Hoeﬂer, “Red-blue pebbling revisited: Nearoptimal parallel matrix-matrix multiplication,” in

Proceed-ings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , 2019, pp. 1–22.[35] V. Y. Pan,

Structured Matrices and polynomials, uniﬁed su-perfast algorithms . New York: Birkh¨auser, Springer, 2001.[36] J. Choi, “A new parallel matrix multiplication algorithmon distributed-memory concurrent computers,”

Concur-rency: practice and experience , vol. 10, no. 8, pp. 655–670,1998.[37] J. R. Bunch, C. P. Nielsen, and D. C. Sorensen, “Rankone modiﬁcation of the symmetric eigenproblem,”

Numer.Math. , vol. 31, pp. 31–48, 1978.[38] M. Gu, “Studies in numerical linear algebra,” Ph.D. dis-sertation, Yale University, New Haven, CT, 1993.[39] X. Liao, L. Xiao, C. Yang, and Y. Lu, “Milkyway-2 super-computer: System and application,”

Frontiers of ComputerScience , vol. 8, no. 3, pp. 345–356, 2014.[40] Y. Liu, C. Yang, F. Liu, X. Zhang, Y. Lu, Y. Du, C. Yang,M. Xie, and X. Liao, “623 Tﬂop/s HPCG run on Tianhe-2:Leveraging millions of hybrid cores,”

International Journal

OURNAL OF L A TEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 17 of High Performace Computing Applications , vol. 30, no. 1, pp.39–54, 2016.[41] O. A. Marques, C. Voemel, J. W. Demmel, and B. N. Par-lett, “Algorithm 880: A testing infrastructure for symmetrictridiagonal eigensolvers,”

ACM Trans. Math. Softw. , vol. 35,pp. 8:1–13, 2008.[42] M. Tygert, “Fast algorithms for spherical harmonic ex-pansions, II,”

Journal of Computational Physics , vol. 227, pp.4260–4279, 2008.[43] T. Davis and Y. Hu, “The Univeristy of Florida sparsematrix collection,”

ACM Trans. Math. Softw. , vol. 38, no. 1,pp. 1:1–1:25, 2011.[44] A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auck-enthaler, A. Heinecke, H. Bungartz, and H. Lederer, “TheELPA library: scalable parallel eigenvalue solutions forelectronic structure theory and computational science,”

J.Phys.: Condens. Matter , vol. 26, pp. 1–15, 2014.[45] P. Arbenz, “Divide-and-conquer algorithms for thebandsymmetric eigenvalue problem,”

Parallel Comput. ,vol. 18, pp. 1105–1128, 1992.[46] D. Bini and V. Pan,

Polynomial and Matrix Computations,Volume I Fundamental Agorithms , ser. Process in TheoreticalComputer Science. Birkh¨auser, 1994.

Xia Liao received the B.S. degree from the College of ComputerScience, National University of Defense Technology (NUDT),Changsha, China. She is a PhD candidate in the College ofComputer Science at National University of Defense Technol-ogy. Her research interests include high performance comput-ing, big data analysis and processing and parallel computing.

Shengguo Li received BS, MS and PhD from National Univer-sity and Defense Technology, Changsha, China, in computa-tional mathematics. He is currently an assistant professor withthe College of Computer Science, NUDT. His research interestsinclude numerical linear algebra, high performance computing,and machine learning.

Yutong Lu received the MSc and PhD degrees in computerscience from the National University of Defense Technology(NUDT), Changsha, China respectively. She is currently a pro-fessor with the School of Data and Computer Science, SunYatsen University, Guangzhou, China. She is also the directorof National Supercomputer Center in Guangzhou. Her researchinterests include high performance computing, parallel systemmanagement, high-speed communication, distributed ﬁle sys-tems, and advanced programming environments with the MPI.