[PDF] Correlations in sequences of generalized eigenproblems arising in Density Functional Theory

Abstract

Density Functional Theory (DFT) is one of the most used ab initio theoretical frameworks in materials science. It derives the ground state properties of a multi-atomic ensemble directly from the computation of its one-particle density \nr .In DFT-based simulations the solution is calculated through a chain of successive self-consistent cycles; in each cycle a series of coupled equations (Kohn-Sham) translates to a large number of generalized eigenvalue problems whose eigenpairs are the principal means for expressing \nr. A simulation ends when \nr\ has converged to the solution within the required numerical accuracy. This usually happens after several cycles, resulting in a process calling for the solution of many sequences of eigenproblems. In this paper, the authors report evidence showing unexpected correlations between adjacent eigenproblems within each sequence. By investigating the numerical properties of the sequences of generalized eigenproblems it is shown that the eigenvectors undergo an "evolution" process. At the same time it is shown that the Hamiltonian matrices exhibit a similar evolution and manifest a specific pattern in the information they carry. Correlation between eigenproblems within a sequence is of capital importance: information extracted from the simulation at one step of the sequence could be used to compute the solution at the next step. Although they are not explored in this work, the implications could be manifold: from increasing the performance of material simulations, to the development of an improved iterative solver, to modifying the mathematical foundations of the DFT computational paradigm in use, thus opening the way to the investigation of new materials.

Full PDF

CCorrelations in sequences of generalized eigenproblems arising in Density FunctionalTheory (cid:73) , (cid:73)(cid:73) Edoardo Di Napoli a,c, ∗ , Stefan Bl¨ugel b, ∗∗ , Paolo Bientinesi a, ∗∗ a RWTH Aachen University, AICES, Schinkelstr. 2, 52062 Aachen, Germany b Peter Gr¨unberg Institut and Institute for Advanced Simulation, Forschungszentrum J¨ulich and JARA, 52425 J¨ulich, Germany c J¨ulich Supercomputing Centre, Institute for Advanced Simulation, Forschungszentrum J¨ulich, 52425 J¨ulich, Germany

Abstract

Density Functional Theory (DFT) is one of the most used ab initio theoretical frameworks in materials science. Itderives the ground state properties of a multi-atomic ensemble directly from the computation of its one-particle density n ( r ). In DFT-based simulations the solution is calculated through a chain of successive self-consistent cycles; in eachcycle a series of coupled equations (Kohn-Sham) translates to a large number of generalized eigenvalue problems whoseeigenpairs are the principal means for expressing n ( r ). A simulation ends when n ( r ) has converged to the solutionwithin the required numerical accuracy. This usually happens after several cycles, resulting in a process calling for thesolution of many sequences of eigenproblems. In this paper, the authors report evidence showing unexpected correlationsbetween adjacent eigenproblems within each sequence. By investigating the numerical properties of the sequences ofgeneralized eigenproblems it is shown that the eigenvectors undergo an “evolution” process. At the same time it isshown that the Hamiltonian matrices exhibit a similar evolution and manifest a speciﬁc pattern in the information theycarry. Correlation between eigenproblems within a sequence is of capital importance: information extracted from thesimulation at one step of the sequence could be used to compute the solution at the next step. Although they are notexplored in this work, the implications could be manifold: from increasing the performance of material simulations, tothe development of an improved iterative solver, to modifying the mathematical foundations of the DFT computationalparadigm in use, thus opening the way to the investigation of new materials. Keywords:

Density Functional Theory, sequence of generalized eigenproblems, FLAPW, eigenproblem correlation

1. Introduction

Density Functional Theory [1, 2] is a very eﬀective theoret-ical framework for studying complex quantum mechanicalproblems in solid and liquid systems. DFT-based methodsare growing as the standard tools for simulating new mate-rials. Simulations aim at recovering and predicting phys-ical properties (electronic structure, total energy diﬀer-ences, magnetic properties, etc.) of large molecules as wellas systems made of many hundreds of atoms. DFT reachesthis result by solving self-consistently a rather complex setof quantum mechanical equations leading to the computa-tion of the one-particle density n ( r ), from which physicalproperties are derived.In order to preserve self-consistency, numerical imple-mentations of DFT methods consist of a series of iterative (cid:73) Article based on research supported by the J¨ulich Aachen Re-search Alliance (JARA-HPC) consortium, the Deutsche Forschungs-gemeinschaft (DFG), and the Volkswagen Foundation (cid:73)(cid:73) ∗ Principal corresponding author ∗∗ Corresponding author

Email addresses: [email protected] (Edoardo DiNapoli), [email protected] (Stefan Bl¨ugel), [email protected] (Paolo Bientinesi) cycles; at the end of each cycle a new density is computedand compared to the one calculated in the previous cy-cle. The end result is a series of successive densities con-verging to an n ( r ) approximating the exact density withinthe desired level of accuracy [3]. Each cycle consists ofa complex series of operations: calculation of Kohn-Shampotential, basis functions generation, numerical integra-tion, generalized eigenproblems solution, and ground stateenergy computation. In one particular DFT implemen-tation, namely the Full-potential Linearized AugmentedPlane Wave (FLAPW) method [4, 5], matrix entry initial-ization and generalized eigenvalue problem solution are themost time consuming stages in each iterative cycle (Fig. 1).The cost for completing these stages is directly relatedto the number of generalized eigenproblems involved andto their size. Above a certain threshold, the eigenprob-lem size is proportional to the third power of the num-ber of atoms of the physical system, while the numberof eigenproblems ranges from a few to several hundredper cycle. Typically, each of the problems is dense anda signiﬁcant fraction of the spectrum is required in or-der to compute n ( r ). The nature of these eigenproblemsforces the use of direct methods, such as those included inLAPACK or ScaLAPACK [20, 21]. All of the most com- Preprint submitted to Elsevier October 28, 2018 a r X i v : . [ phy s i c s . c o m p - ph ] M a r on simulation codes implementing the FLAPW method(WIEN2k, FLEUR, FLAIR, Exciting, ELK [7–11]), de-spite successfully simulating complex materials [12–15],treat each eigenproblem of the series of iterative cyclesin isolation. This implies that no information embeddedin the solution of eigenproblems in one cycle is used tospeed up the computation of problems in the next. Whilethese routines provide users with accurate algorithms tobe used as black-boxes, they do not oﬀer a mechanism forexploiting extra information relative to the application.The line of research pursued here takes inspiration fromthe necessity of exploring a diﬀerent computational ap-proach in an attempt to develop a high performance algo-rithm speciﬁcally studied for the FLAPW method. Con-trary to the traditional view, we look at the entire succes-sion of iterative cycles making up a simulation as consti-tuted by a few dozen sequences of generalized eigenprob-lems. By mathematical construction, each problem in a se-quence is expected to be, at most, weakly connected to theprevious one. At odds with this observation, we presentevidence showing that there is an unexpectedly strong cor-relation between eigenproblems of adjacent cycles in eachsequence. We suggest how this extra information shouldbe used to improve the performance of the current state-of-the-art routines.Recently some methods have been developed that goin this direction. Among them we mention the block ver-sion of the Krylov-Schur [16] (in itself an improved versionof the Thick-Restart Lanczos [17]) and the Chebyshev-Davidson [18] methods. One of the most successful exam-ples in this sense is the recently implemented Chebyshev-Filtered Subspace Accelleration [19] currently included inthe PARSEC package speciﬁcally targeting ab initio real-space computations.In Sec. 2 we introduce the reader to the DFT frame-work and more speciﬁcally to the FLAPW method. Weexplain in more detail the series of self-consistent cycles,the computational bottlenecks inside each cycle, and how,from this picture, the importance of sequences of eigen-problems emerges. In Sec. 3, we illustrate the investigativetools we employ in studying the eigen-sequences, namelyeigenvector evolution and unchanging matrix patterns. Wethen present our computational results and extensively dis-cuss their interpretation from the numerical and physicalpoint of view. Finally in Sec. 4 we draw our conclusionsand explain how it would be possible to exploit the ex-perimental results to improve the performance of a DFTsimulation.

2. Physical framework

DFT methods are based on the simultaneous solution ofa set of Schr¨odinger-like equations [eq. (1)]. These equa-tions are determined by a Hamiltonian operator ˆ H that,in addition to a kinetic energy operator, contains an ef-fective potential v [ n ], which functionally depends only on the one-particle electron density n ( r ). In turn, the wavefunctions φ i ( r ), which solve the Schr¨odinger-like equationsfor N electrons, compute the one-particle electron density[eq. (2)] used in determining the eﬀective potential. Thelatter is explicitly written in terms of the nuclei atomicCoulomb potential v I ( r ), a Hartree term w ( r , r (cid:48) ) describ-ing repulsions between pairs of electrons, and the exchangecorrelation potential v xc [ n ]( r ) summarizing all other col-lective contributions [eq. (3)]. This set of equations, alsoknown as Kohn-Sham (KS) [2], is solved self-consistently.In other words the equations must be solved subject to thecondition that the eﬀective potential v [ n ] and the electrondensity n ( r ) mutually agree.ˆ Hφ i ( r ) = (cid:18) − (cid:126) m ∇ + v ( r ) (cid:19) φ i ( r ) = (cid:15) i φ i ( r ) (1) n ( r ) = N (cid:88) i | φ i ( r ) | (2) v ( r ) = v I ( r ) + (cid:90) w ( r , r (cid:48) ) n ( r (cid:48) ) d r (cid:48) + v xc [ n ]( r ) (3)Computational implementations of DFT depend on theparticular modeling of the eﬀective potential and on theorbital basis used to parametrize the eigenfunctions φ i ( r ).In the context of periodic solids, the vector k and band ν indices replace the generic index i ; the Bloch vector k is anelement of a three-dimensional Brillouin zone discretizedover a ﬁnite set of values, called the set of k -points. Inthe FLAPW method [4, 5], the orbital function φ k ,ν ( r )are expanded in terms of a function basis set ψ G ( k , r )indexed by vectors G lying in the lattice reciprocal to theconﬁguration space φ k ,ν ( r ) = (cid:88) | G + k |≤ K max c Gk ,ν ψ G ( k , r ) . (4)In FLAPW, the conﬁguration (physical) space of the quan-tum sample is divided into spherical regions – called Muﬃn-Tin (MT) spheres – centered around atomic nuclei, andinterstitial areas between the MT spheres. Within the vol-ume of the solid’s unit cell Ω, the basis set ψ G ( k , r ) takesa diﬀerent expression depending on the region ψ G ( k , r ) ==  √ Ω e i ( k + G ) r − Interstitial (cid:88) l , m (cid:104) a α, G lm ( k ) u α l ( r ) + b α, G lm ( k ) ˙ u α l ( r ) (cid:105) Y lm (ˆ r α ) − MT . For each atom α , the coeﬃcents a α, G lm ( k ) and b α, G lm ( k )are determined by imposing continuity of the wavefunc-tions φ k ,ν ( r ) and their derivatives at the boundary of theMT sphere. The Y lm (ˆ r α ) are spherical harmonics of the α -atom, ˆ r α ≡ r α r α is a unit vector, and r α is the distance fromthe MT center. The radial functions u α l ( r ) and their timederivatives ˙ u α l ( r ) are obtained by a simpliﬁed Schr¨odinger2 nitialization:non-interactingspherical v ph ( r ) Calculation offull-potential v [ n ]( r ) < % timeusage Charge densityfor next cycle n ( r ) Basis functionsgeneration ψ G ( k , r ) ∼ Calculation ofground state energy E → n ( r ) Matrices generation A ( k ) = < ψ ( k ) | ˆ H | ψ ′ ( k ) >B ( k ) = < ψ ( k ) | ˆ S | ψ ′ ( k ) > ∼ % timeusage Generalizedeigenproblems A ( k ) x = λB ( k ) x Pseudo-charge method.FFT Schr¨odinger-likeequations solved for aset of k Eigenproblems:distribution and setupEigenpairs selectionConvergence check.Iteration setup

Figure 1:

A schematic rendering of the principal stages of the self-consistent cycle. Colors indicate the breakdown of the computingtime of each stage of the cycle. Each “time usage” is based on an average for simulations of very diverse physical systems. Eachsimulation was run on a single core using a sequential version of the FLEUR code. equation, written for a given energy level E l , containingonly the spherical part of the eﬀective potential v ph ( r ) (cid:26) − (cid:126) m ∂ ∂r + (cid:126) m l ( l + 1) r + v ph ( r ) − E l (cid:27) ru α l ( r ) = 0 . (5)Thanks to this expansion, the KS equations naturallytranslate to a set of generalized eigenvalue problems (cid:88) G (cid:48) [ A GG (cid:48) ( k ) − λ k ν B GG (cid:48) ( k )] c G (cid:48) k ν = 0 (6)where the coeﬃcients of the expansion c G (cid:48) k ν are the eigen-vectors, while the Hamiltonian and overlap matrices A and B are given by volume integrals and a sum over all MTspheres { A ( k ) , B ( k ) } = (cid:88) (cid:90) ψ ∗ G ( k , r ) { ˆ H, ˆ } ψ G (cid:48) ( k , r ) . (7) In practical numerical computations, a solution is reachedby setting up a multi-stage cycle (Fig. 1). An initial edu-cated guess for n ( r ) is used to calculate the eﬀective full-potential v [ n ] using the Pseudo-Charge [6] in combinationwith Fast Fourier Transform (FFT) methods. The poten-tial, in turn, is inserted into the simpliﬁed Schr¨odingerequation (5) whose solutions, together with the coeﬃcents a α, G lm ( k ) and b α, G lm ( k ), lead to the basis functions ψ G ( k , r ).The latter are used to calculate the entries of the matrices A ( k ) and B ( k ), an operation that requires the computa-tion of three-dimensional spherical integrals [eq. (7)] forthe non-spherical part of the potential appearing in ˆ H .In the next stage the matrices just computed are theinput in dozens to hundreds of generalized eigenvalue prob-lems [eq. (6)] that are solved simultaneously. Each eigen-problem is of the form Ax = λBx , where both A and B are dense hermitian matrices, B is additionally posi-3ive deﬁnite, and x and λ form a sought-after eigenpair.In FLAPW-related applications, usually only a fraction ofthe lower part of the spectrum is computed and retainedbased on the Fermi energy value. The stored eigenpairs arethen used to evaluate the ground state energy of the phys-ical system, a step which is followed by the computationof a new charge density n (cid:48) ( r ).At the end of the cycle, convergence is checked by com-paring n (cid:48) ( r ) with n ( r ). If | n (cid:48) ( r ) − n ( r ) | > η , where η is therequired accuracy, a suitable mixing of the two densitiesis selected as a new guess, and the cycle is repeated. Thisprocess is properly referred to as an outer-iteration of theDFT self-consistent cycle. Convergence is guaranteed bythe Hohenberg-Kohn theorem [3] stating that there existsa unique electron density n ( r ) locally minimizing an en-ergy functional E [ n ] closely related with the Hamiltonianoperator ˆ H .In sum, the FLAPW self-consistent scheme is formedby a series of outer-iterations, each one containing multi-ple large generalized eigenproblems. In order to numeri-cally compute the charge density n ( r ) at each iteration,the matrices A and B need to be initialized for each k -point and the generalized eigenproblem Ax = λBx solved.These two stages are the most machine-time consumingpart of the cycle, each accounting for between 40% and48% of the total computational time (Fig. 1). Moreover,the more complex the material, the larger the matrices andthe slower the convergence, resulting in an increase in thenumber of outer-iterations.

3. An alternative viewpoint

The results presented in this paper originate from the de-liberate choice of studying the DFT self-consistent cyclefrom a diﬀerent perspective. The entire outer-iterativeprocess is regarded as a set of sequences of eigenproblems P i . This interpretation is based on the observation that,for each k -point, the solution of a problem at a certainiteration P i ( k ) is a prerequisite for setting up the next one P i +1 ( k ).Considering the single eigenproblems P i (the k indexis suppressed for the sake of simplicity) to be part of a se-quence { P i } can have far-reaching consequences: it mighthelp to unravel correlations among them and ultimatelylead to the conception of an entirely diﬀerent computa-tional approach to solving them as the simulation pro-gresses. Since DFT is one of the most important ab initioelectronic structure frameworks, the study of a compu-tational procedure that would lead to high-performancesolutions to { P i } is of crucial importance.In order to study the evolution of the generalized eigen-problems as part of the sequence { P i } , we focus our atten-tion on the transformation of eigenvectors and the vari-ation of the matrix entries of the Hamiltonian matrix A (the same could be done for the overlap matrix B). For aﬁxed k , each eigenvector at iteration i is compared with its corresponding eigenvector at iteration i + 1. A similarcomparison is performed between the values of the entriesof adjacent Hamiltonian matrices. Despite the apparentsimplicity of the strategy, the realization of a comparativetool that quantitatively describes the evolution of { P i } isnot a trivial matter. In this section we focus on a generic sequence and describea procedure to study the evolution of the eigenvectors solv-ing for P i . The results obtained are independent of k sinceeach such point represent a vector in the Brillouin zone forwhich there is an independent sequence of eigenproblems.As a consequence our analysis can be applied to any se-quence in the simulation. In order to carry out our plan,we need an associative criterion that allows comparisonbetween eigenvectors of successive iterations. This is nota simple task since the ordering of a set of eigenpairs canchange substantially from one iteration to the next.For instance, one could arrange the eigenvectors by theincreasing magnitude of their respective eingenvalues andcompare two eigenvectors, say x ( i ) (cid:96) and x ( i +1) (cid:96) , with thesame eigenvalue index (cid:96) . This naive comparison is boundto fail due to the fact that eigenvalues close in magni-tude often swap positions across iterations. Consequently,identifying eigenvectors becomes rather diﬃcult as the se-quence advances. The nature of the self-consistent processinterferes with the ability to ﬁnd a one-to-one correspon-dence between vectors of neighboring iterations. For a correct comparison between adjacent eigenvectors wedeveloped an algorithm that establishes a one-to-one corre-spondence based on two observations: 1) a DFT simulationis basically a minimization procedure, and as such favorssmall eigenpair variations in its progress towards conver-gence, and 2) all eigenpairs contribute more or less “demo-cratically” to the progression of the sequence. Althoughnot fully mathematically rigorous, these statements ﬁndtheir a posteriori justiﬁcation in the formal comparisonbetween orbitals of two successive iteration cycles (see Ap-pendix A), and translate directly into two speciﬁc behav-iors of the eigensolutions. First, scalar products betweenan eigenvector x ( i ) j at iteration i and any of the eigenvec-tors x ( i +1) (cid:96) at iteration i + 1 have a gaussian distributionnarrowly peaked at around one value σ ( i ) j . Second, theset of largest scalar products, { σ ( i ) j } , has a ﬂat and almostconstant distribution. In mathematical terms, they can bewritten as ∀ j ∃ ! ¯ (cid:96) : σ ( i ) j . = (cid:104) x ( i ) j , x ( i +1)¯ (cid:96) (cid:105) (cid:29) (cid:104) x ( i ) j , x ( i +1) (cid:96) (cid:105) (cid:12)(cid:12)(cid:12) (cid:96) (cid:54) =¯ (cid:96) (8) ∀ ( j , j ) : (cid:12)(cid:12)(cid:12) σ ( i ) j − σ ( i ) j (cid:12)(cid:12)(cid:12) σ ( i ) j + σ ( i ) j (cid:28) . (9)4hese observations motivated the design of a routinethat, without claiming to be unique or optimized, succeedto correctly relate two successive eigenvectors. Speciﬁ-cally, it identiﬁes, for each eigenvector x ( i ) j , the largestscalar product σ ( i ) j subject to the condition that the (i+1)-iteration index ¯ (cid:96) , associated with j , is not associated withany other j (cid:48) (cid:54) = j . By design, this procedure establishesa one-to-one correspondence between the eigenvectors ofsuccessive iterations, x ( i ) j ⇔ x ( i +1)¯ (cid:96) , whose information isstored in a permutation operator Π ∀ j, ∃ ! ¯ (cid:96) : j = Π(¯ (cid:96) ) and ∀ (cid:96), ∃ ! ¯ j : (cid:96) = Π − (¯ j ) . (10)Using Π, the column positions in the matrix of scalar prod-ucts (cid:104) x ( i ) j , x ( i +1)Π( (cid:96) ) (cid:105) can be rearranged so as to easily obtainthe largest scalar products { σ ( i ) j } from the main diagonal.From this matrix we can easily extract the subspace de-viation angles, automatically normalized to one, betweencorresponding eigenvectors of adjacent iterations θ ( i ) j = diag (cid:16) − (cid:104) x ( i ) j , x ( i +1)Π( (cid:96) ) (cid:105) (cid:17) . (11)These angles provide the means for studying the evolutionof the eigenvectors of the sequence of generalized eigen-problems { P i } .Collecting all the angles computed in one simulationresults in a large set of data (there are N = dim( A ) anglesfor each iteration and each k ). For our statistical analysis,we manipulate the angles so as to plot them in three dif-ferent ways depending on which parameter characterizingthe data is kept ﬁxed. First, ﬁxing the iteration index anda speciﬁc eigenvalue, we look at how the angles are dis-tributed among the k s. Then, we choose a random k andlook at how all the deviation angles vary as the sequenceprogresses. Finally we select an eigenvalue and examinethe evolution of the angles for all k as the iteration indexincreases.In order to perform the entire computational process,from eigenvector pairing to deviation angle plotting, webuilt a Matlab analysis toolkit. The input is the set ofmatrices A and B of all the eigenproblems appearing inthe sequences of a simulation. Simulations of the phys-ical systems analyzed were performed using the FLEURcode [8] running on JUROPA, a powerful cluster-basedcomputer operating in the Supercomputing Center of theForschungszentrum J¨ulich. For each physical system stud-ied we produced outputs for a consistent range of param-eters. Table 1: Simulation data

Material k -points (cid:96)

15 27 400ZnO 40 9 490CaFe As

15 30 2600

We present here a numerical study for two typical physicalsystems. The ﬁrst one is a 5 layer ﬁlm of iron, denoted byFe (cid:96) , with a (100) surface orientation modeled by a sim-ple tetragonal lattice containing 5 atoms in the unit cellembedded in two semi-inﬁnite vacua. The second exam-ple, zinc oxide, is an ionic bonded material arranged on awurtzite lattice – a multilayered hexagonal lattice with 2Zn and 2 O atoms per unit cell. For each material we rana simulation whose speciﬁcs are described in Table 1. Eigenvalue 1

Deviation Angle o f E i g e n vec t o r s Eigenvalue 2

Deviation Angle o f E i g e n vec t o r s Eigenvalue 3

Deviation Angle o f E i g e n vec t o r s Eigenvalue 4

Deviation Angle o f E i g e n vec t o r s Fe ! Fe ! Fe ! Fe ! Eigenvalue 1

Deviation Angle o f E i g e n vec t o r s Eigenvalue 2

Deviation Angle o f E i g e n vec t o r s Eigenvalue 3

Deviation Angle o f E i g e n vec t o r s Eigenvalue 4

Deviation Angle o f E i g e n vec t o r s ZnOZnO ZnOZnOZnOZnO ZnOZnOZnOZnO ZnOZnOZnOZnO ZnOZnO

Figure 2:

Histograms showing a qualitative distribution of thedeviation angles of each k -point eigenvector corresponding tothe four lowest eigenvalues at a ﬁxed iteration. All angles arenormalized to one (1.00 = π/ ). Above: the Fe (cid:96) case plottedat the rd iteration. Below: the zinc oxide case plotted at the th iteration. For each case study we plot a set of histograms showing,for all k-points, the distribution of angle deviations for thefour smallest eigenvalues at a speciﬁcally chosen iteration(Fig. 2). In each histogram the distribution is sharplypeaked at the lowest end of the interval and has null or only5egligible tails. This result supports our analysis as to the“democratic” contribution of all angles to the progressionof the sequence [eq. (9)]. (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) Evolution of subspace angle for eigenvectors of k (cid:239) point 11 and all eigs

Iterations (2 (cid:239) > 27) A ng l e b / w e i g e n vec t o r s o f a d j ace n t i t e r a t i on s Fe ! (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) Evolution of subspace angle for eigenvector 7 and all 15 k (cid:239) points

Iterations (2 (cid:239) > 27) A ng l e b / w e i g e n vec t o r s o f a d j ace n t i t e r a t i on s Fe ! Figure 3:

Eigenvector angles of successive iterations for the ironmulti-layer: evolution of angles for all eigenvectors of the se-quence corresponding to k -point 11 (above) and the evolution ofangles for all 15 k -points of eigenvectors corresponding to the th lowest eigenvalue (below). All the angles are normalized toone (1.00 = π/ ). For both physical systems we also show two diagrams,one that plots angles for all eigenvectors at one speciﬁc k -value and the other that presents the angles of the j th eigenvector at each k -point (Fig. 3,4). Both graphs areplotted against the iteration index on a semi-log scale tobetter display the evolution of the deviation angles .We can immediately notice the almost monotonic decreaseof the deviation angles as the sequences progress towardsconvergence. Small upward oscillations are probably dueto an excess of localized charge that may cause a partialrestart of the sequence. We have also observed that theangles corresponding to the lowest 20% of the spectrumare, on average, higher than the rest. In all graph labels the abbreviation “eigs” stands for the word eigenvalues while the symbol “b/w” stands for the word between . (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) Evolution of subspace angle for eigenvectors of k (cid:239) point 21 and all eigs

Iterations (2 (cid:239) > 9) A ng l e b / w e i g e n vec t o r s o f a d j ace n t i t e r a t i on s ZnO (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) Evolution of subspace angle for eigenvector 7 and all 40 k (cid:239) points

Iterations (2 (cid:239) > 9) A ng l e b / w e i g e n vec t o r s o f a d j ace n t i t e r a t i on s ZnO

Figure 4:

Eigenvectors angles of successive iterations for zincoxide: evolution of angles for all eigenvectors of the sequencecorresponding to k -point 21 (above) and the evolution of an-gles for all 40 k -points of eigenvectors corresponding to the th lowest eigenvalue (below). All the angles are normalized to one(1.00 = π/ ). In all other multi-atomic systems studied, besides theones shown here, the great majority of angles after the3 rd or 4 th iteration are very small. Contrary to intuitionthe simulation is far from converged at this stage, imply-ing again a sort of “democracy” of contribution, where alleigenvectors positively inﬂuence the process of minimizingthe energy functional that depends on n ( r ). This behaviorhas a universal character since we observed it in the bulk,layer, metallic, and ionic materials we analyzed.In order to give a more quantitative ﬂavor of the eigen-vector evolution we have tabulated the mean angle value¯ θ ( i ) for an iteration at the beginning and one at the end ofthe simulation (Table 2). Due to their physical relevance,we used only deviation angles of those eigenvectors whoseeigenvalues represent energies below Fermi level. As canreadily be seen, the mean values at the end of the simula-tion are considerably smaller than those at the beginning,in this way conﬁrming the qualitative picture describedabove.6 able 2: Deviation Angle Means Material ¯ θ (start) ¯ θ (end) Fe (cid:96)

44 (11.0 %) 1.10 10 − − ZnO 27 (5.6 %) 7.29 10 − − We systematically look at the variations in the entries inadjacent A matrices for two main reasons. First, we plan todevelop a method that identiﬁes those portions of entriesof A that undergo little or no change at all. Eventuallythis method could be used to avoid recalculating those en-tries at each iteration, and in doing so saving computingtime. Second, we believe that the connection between suc-cessive eigenvectors should somehow surface in how muchthe matrices deﬁning the eigenproblems vary across itera-tions: both are the indirect consequence of changes in theset of basis wavefunctions ψ G ( k , r ). Despite its manifest simplicity, comparison between ma-trix entries across adjacent iterations can be rather non-trivial. In fact, variations of the single entries of A ( i ) k with A ( i +1) k span a range of several orders of magnitude andneed to be opportunely rescaled. Our initial strategy isto normalize all variations so as to map them onto a [0 , p t that cuts oﬀ all variations below a certain value.This strategy helps in identifying those areas of the matri-ces where the entries undergo relatively large variations; italso allows us to study the percentage of entries that variesas a function of the cut-oﬀ value. Eventually, one can de-termine the value of the threshold that might be chosenfor saving computing time while only minimally compro-mising the accuracy of the eigensolutions (i.e. speed vsaccuracy).First we had to establish the most appropriate metricto gauge the relative size of entry variation. The choice ofthe metric inﬂuences the mapping of the variations ontothe speciﬁed interval. In this study we chose the maximalentry variation for each matrix diﬀerence δ ( i ) = max( | A ( i +1) k − A ( i ) k | ) and normalized each entry of the diﬀerence with re-spect to it. The entries of the resulting matrix ˜ A ( i ) k = | A ( i +1) k − A ( i ) k | δ ( i ) are clearly mapped onto the [0 ,

1] interval.Then the threshold is measured as a fraction of δ i , givenby the cut-oﬀ value p t , being a number ∈ [0 ,

1] (e.g. a p t = 0 . A ( i ) k that are larger or equal to 10 % of δ ( i ) ). It has to benoted that, contrary to common intuition, the lower thecut-oﬀ value p t , the greater the number of non-zero entriesof ˜ A is. We complement this analysis with a more conven-tional approach where the median of the entries of A ( i ) k is compared with the median of the entries of the diﬀerence | A ( i ) k − A ( i − k | as the sequence progresses.All ˜ A ( i ) k , extracted from the simulation of a given phys-ical system, were analyzed, at a ﬁxed k , for diﬀerent valuesof the cut-oﬀ and for diﬀerent iteration levels i . As for theeigenvector evolution, the input for our analysis is the setof matrices A that deﬁnes the eigenproblems appearingin sequences of a simulation. All the simulations of thephysical systems were performed using the FLEUR coderunning on JUROPA. |A (4) (cid:239) A (3) | / (cid:98) (3) > 0.10 (cid:239) Iteration 4 | A (26) (cid:239) A (25) | / (cid:98) (25) > 0.15 (cid:239) Iteration 26

Figure 5:

Visualization of ˜ A ( i ) k for k = 1 at two diﬀerent itera-tion cycles, excerpted from a CaFe As simulation. Above: plotof all the entries above the cut-oﬀ value p t = 0 . at iteration i = 4 . Below: plot of all the entries above the cut-oﬀ value p t = 0 . at iteration i = 26 . As an example of our approach we present the analysis ofa simulation of a superconducting compound, denoted by7aFe As , that undergoes a ﬁrst order phase transitionfrom a high temperature, tetragonal phase to a low tem-perature orthorhombic phase. The speciﬁc characteristicsof this simulation are listed in Table 1.In Fig. 5 we ﬁrst give a qualitative picture of the por-tion of A that changes, for a speciﬁc k -point at two distantiterations and diﬀerent cut-oﬀ values. Two distinct obser-vations can be made: on the one hand, the empty portionsof ˜ A tend to preserve their shape and position as the cut-oﬀ increases. In other words those parts of A that do notvary much seem to follow a speciﬁc pattern independentfrom p t . On the other hand, the percentage of entriesthat undergo variation does not seem to be aﬀected by theprogress of the sequence of iterations. Percentage of matrix entries variation vs cut off value (k=1,i=16)

Cut (cid:239) off value P e r ce n t a g e o f m a t r i x e n t r i es Figure 6:

Percentage of varying matrix entries plotted versuscut-oﬀ values on a semilogarithmic scale for the CaFe As sys-tem for k = 1 and i = 16 . The fact that such qualitative behavior is evident in allof the physical systems investigated suggests that it is a“universal” trait characteristic of DFT-based simulations.This conclusion implies that despite the fact that the basisfunctions set ψ G ( k , r ) changes substantially between suc-cessive iterations, it would seem that for certain subsets ofbasis functions such a change contributes very little to thevolume integrals in eq.(7).In Fig. 6 we give a more quantitative description, at aﬁxed iteration, of the number of matrix entries that changeas the threshold value increases. It can be noted that thepercentage of varied entries becomes signiﬁcant only for p t ≤ .

1. This behavior indicates that overall there arevery few entries undergoing major changes; most of thevariations are concentrated in the low end of the metric(i.e. p t × δ ( i ) and p t < . { P i } may have diﬀerent origins.In Fig. 7 we have also compared the median and maxi- mum value of | A ( i ) k − A ( i − k | with the median of the entriesof A ( i ) k . As the simulation progresses the median value ofthe Hamiltonian matrix remains approximately constant(green line) while the maximum and the median value ofthe matrix variation decrease. Moreover the median valueof | A ( i ) k − A ( i − k | is, on average, 1-2 orders of magnitudelower than A ( i ) k as can also be seen from Table 3.Can the large number of unchanging entries be used tospeed up computations at every DFT iteration? In orderto answer this question it needs to be understood howthe trade oﬀ between speed and accuracy depends on thechoice of cut-oﬀ value in relation to the iteration index i . On one hand the patterns and distribution of entriesthat undergo very small variation could be exploited so asto avoid computing them anew in each iteration. On theother hand the evolution of entry variation suggests thatupdating some of the entries could be completely avoidedafter a certain iteration. (cid:239) (cid:239) (cid:239) (cid:239) (cid:239) Comparing matrix entries across iterations

Iteration A b s o l u t e va l u es o f m a t r i x e n t r i es perc var (p t = 0.15)max (A (i) (cid:239) A (i (cid:239) )med (A (i) )med (A (i) (cid:239) A (i (cid:239) ) Figure 7:

Median of the matrix entries of A ( i ) and median andmaximum of matrix entries of | A ( i ) − A ( i − | plotted versus theiteration index for the CaFe As system for k = 1 . Our study makes manifest that there is room for devel-oping new methods for saving computational time in theprocess of updating the matrices deﬁning the generalizedeigenproblems in { P i } . The possible methods and theiroptimization is still the object of current research and werefer the reader to future publications for more conclusiveresults. Table 3: Matrix Entry Medians

Iterationindex ( i ) Medianof A ( i ) Medianof | A ( i ) − A ( i − | Maximumof | A ( i ) − A ( i − | − − −

10 1.22 10 − − −

15 1.72 10 − − −

20 1.29 10 − − −

25 1.47 10 − − − . Correlation and exploitation In the previous section, we analyzed DFT-based simu-lations from a non-conventional perspective. Departingfrom the traditional picture that considers eigenproblemsin each iterative cycle of a simulation as independent, weinstead assumed that they form a set of sequences of eigen-problems { P i ( k ) } . We then provided experimental evi-dence pointing out a connection between problems thathave consecutive iteration indices. In particular, we un-covered a strong correlation between eigenvectors of suc-cessive problems as well as the existence of unchangingpatterns in the matrices deﬁning the eigenproblems. Weillustrated how this correlation is noticeably linked to theconvergence process of the simulation: as the iteration in-dex increases, the eigenvector deviation angles become, onaverage, smaller. Unfortunately we did not observe anequivalent drop in the variation of the matrix entries of theHamiltonian but only in its maximum and median values.Nonetheless it is evident that the sequences of eigenprob-lems { P i } undergo a signiﬁcant evolution.The importance of this result stems from the fact thatthis correlation is quite unexpected. Since each singleproblem P ( k ) at iteration i +1 is determined by the orbitalwave functions obtained by the solution of all the problems { P ( k ) } at iteration i , the connection between P i and P i +1 is non-linear and presumably very weak. Until recently itwas believed that such weak non-linearity would have hid-den any sign of correlation, a conviction that inhibited anyresearch eﬀort in this direction. The source of non-linearityresides in the two indirect ways each P i is inﬂuenced by thebasis functions ψ G ( k , r ) computed at each new iteration.First, matrix entries deﬁning the eigenproblems are givenby volume integrals involving basis functions [eq. (7)], notorbital functions, and second the eigenvectors are the n-tuple of coeﬃcients expressing orbital wave functions interms of a linear combination of basis functions [eq. (4)].As a consequence of these two considerations, eigenvectorswere considered to be very loosely connected. The discov-ery that the contrary to this fact is true compels us to givegreat relevance to the evidence of a strong correlation.While having found a strong correlation between eigen-problems of a sequence { P i } is in itself an important result,it is even the more so because it opens the way to the ex-ploration of new computational strategies. In particular,the performance of the entire DFT simulation could beimproved by boosting the performance of the sequencesof eigenproblems. The idea is to take advantage of repeti-tive patterns in the eigenpencil (Hamiltonian and Overlap)and in the eigenvector evolution. In practice, the key ele-ment would be to reuse, for the solution of eigenproblemsat a certain iteration, numerical quantities computed in aprevious one. In order to clarify this concept we brieﬂy de-scribe, in the following, the implication of just eigenvectormanipulation.Traditionally large dense eigenproblems are solved us-ing direct methods whenever the fraction of the sought after spectrum is already above the few %-points. In theopposite case scenario, iterative methods are mostly usedfor sparse eigenproblems when the fraction of eigensolu-tions required is very small. Due to the large numberof matrix-vector operations iterative eigensolvers do notperform well for dense problems and sometimes, in thepresence of tight clusters, even fail to converge. Thesesame considerations do not necessarily apply to sequencesof dense correlated eigenproblems { P i } when the fractionof eigenpairs sought after is somewhat lower than 10%. Inthis case the reuse of information can somewhat improvethe performance of selected iterative solvers.Far from presenting any conclusive evidence, we pointout that the evolution of eigenvectors could be exploitedto improve the performance of iterative solvers and, even-tually, make them competitive with direct solvers whenapplied to sequences of eigenproblems. This result will de-pend on two distinct key properties of the iterative methodof choice: 1) the ability to be fed previously computedeigenvectors and take advantage of them as a startingguess to eﬃciently compute the new ones, and 2) the ca-pacity to solve simultaneously for a substantial portionof eigenpairs. The ﬁrst property would result in ﬁlteringaway as eﬃciently as possible the unwanted componentsof the old eigenvectors. The second characteristic impliesa moderate-to-substantial speedup for the matrix-vectormultiplications as well as an improved convergence. Thisis part of a study that is underway and will be presentedin a future publication.

5. Conclusions

The results described in this paper are an example of howit is feasible to study a mathematical model by reverseinduction. Starting from simulations, we showed how itis possible to construct a method to analyze the potentialimprovements of the algorithmic realization of a mathe-matical model on which the simulations are based. Thisapproach reverses the usual direction that goes from the-oretical model to experiment passing through computer-based simulations. In other words it is an example of howto look at DFT-based computations as an inverse prob-lem. As such we would like to refer to our approach as a“reverse simulation” method.In this work we utilize the “reverse simulation” ap-proach as a tool to investigate the properties of the non-linear generalized eigenproblem arising from the Kohn-Sham equations. In general such an eigenproblem is lin-earized through the introduction of a self-consistent cyclethat is solved until convergence is reached. The non-linearproblem translates then into a series of eigenproblems thatare solved in total independence from each other. We showthat in reality these eigenproblems are strongly correlatedand constitute part of a sequence. We ﬁnally suggest thatthe correlation discovered could be exploited by a class ofopportunely designed iterative solvers.9his result is of great impact for the community of com-putational physicists working in the wide ﬁeld of materialscience. It could be the ﬁrst step towards changing com-putational paradigm, especially in view of the increasingtrend toward the use of massively parallel architectures.By giving “dense” DFT methods, like FLAPW, access tothe use of iterative solvers one will eﬀectively increase theirscalability. Conversely a higher scalability would empowerthese ab initio methods with the capability of investigatinglarger physical systems that are currently out of reach.Besides aﬀecting DFT methods that until now haveprecluded the use of iterative solvers, the methodologicalviewpoint used here will have by far more important conse-quences than just improving the computational approachto the simulations: it would allow us to go beyond theconventional FLAPW method and create a more eﬃcientmathematical paradigm.

6. Acknowledgements

We thank Dr. Daniel Wortmann, Dr. Gustav Bihlmayerand Gregor Michalicek for discussions and their help indealing with the FLEUR code. The computations wereperformed under the auspices of the J¨ulich Supercomput-ing Centre at the Forschungszentrum J¨ulich, whose gen-erous support of CPU time is hereby acknowledged. Wewould also like to thank the AICES graduate school forhosting some of the authors and contributing to the suc-cess of the project.Financial support from the following institutions is grate-fully acknowledged: the JARA-HPC through the MidtermSeed Funds 2009 grant, the Deutsche Forschungsgemein-schaft (German Research Association) through grant GSC111, and the Volkswagen Foundation through the fellow-ship “Computational Sciences”.

AppendixA. A formal justiﬁcation of the eigen-vector computational scheme

The formal study of eigenvector evolution is based on twostatements given at the beginning of section 3.1.1. Of thesestatements the ﬁrst one simply states the fact that the or-bital functions that make up the charge density have toalready be a very good guess for the exact wavefunctionsfor the self-consistent cycle to converge. This fact trans-lates to the following formal analysis.Eq. (4) written for the orbital functions at a certainiteration ( i ) is φ ( i ) k ,ν ( r ) = (cid:88) | G + k |≤ K max c ( i ) Gk ,ν ψ ( i ) G ( k , r ) . (A.1) If we take the scalar product of two φ s for diﬀerent bandindices but the same k -point and the same iteration westraightforwardly obtain (cid:90) d r φ ∗ ( i ) k ,ν ( r ) φ ( i ) k ,µ ( r ) == (cid:88) | G + k | , | G (cid:48) + k |≤ K max c ∗ ( i ) G (cid:48) k ,ν c ( i ) Gk ,µ (cid:90) d r ψ ∗ ( i ) G (cid:48) ( k , r ) ψ ( i ) G ( k , r )= (cid:88) | G + k | , | G (cid:48) + k |≤ K max c ∗ ( i ) G (cid:48) k ,ν S ( i ) G (cid:48) G c ( i ) Gk ,µ (A.2)= (cid:104) x ν x µ (cid:105) = δ νµ . By repeating the same computation for φ s from two suc-cessive iterations, we arrive at (cid:90) d r φ ∗ ( i ) k ,ν ( r ) φ ( i +1) k ,µ ( r ) == (cid:88) | G + k | , | G (cid:48) + k |≤ K max c ∗ ( i ) G (cid:48) k ,ν c ( i +1) Gk ,µ (cid:90) d r ψ ∗ ( i ) G (cid:48) ( k , r ) ψ ( i +1) G ( k , r )= (cid:88) | G + k | , | G (cid:48) + k |≤ K max c ∗ ( i ) G (cid:48) k ,ν ˜ S G (cid:48) G c ( i +1) Gk ,µ (A.3)where ˜ S G (cid:48) G is an asymmetric positive deﬁnite matrix andas such can be factorized according to Cholesky as˜ S G (cid:48) G = (cid:88) | G (cid:48)(cid:48) + k |≤ K max ˜ L G (cid:48) G (cid:48)(cid:48) ˜ L T G (cid:48)(cid:48) G (A.4)with ˜ L G (cid:48) G (cid:48)(cid:48) being a lower triangular matrix.Let us momentarily assume the correctness of the state-ments in section 3.1.1. The ﬁrst statement implies that theleft-hand side of Eq. (A.3) can be expanded as (cid:104) x ( i ) ν x ( i +1) µ (cid:105) = δ νµ + (cid:15)E νµ + o ( (cid:15) ) with (cid:15) being a generic numerical expan-sion parameter. Correspondingly, from the second state-ment we conclude that the matrix E νµ should be of quasi-block-diagonal form with a dominant diagonal which en-forces only small rotations or mixing. In the same fashionwe can arbitrarily expand the ˜ S G (cid:48) G on the right-hand side˜ L ˜ L T = L ( i ) (cid:0) I + (cid:15)D + o ( (cid:15) ) (cid:1) L ( i +1) T . (A.5)where L ( i ) and L ( i +1) indicate the lower triangular matri-ces decomposing S ( i ) G (cid:48) G and S ( i +1) G (cid:48) G respectively. Combiningthese two expansions we arrive at the conclusion that (cid:104) x ( i ) ν x ( i +1) µ (cid:105) = c ∗ ( i ) k ,ν L ( i ) L ( i +1) T c ( i +1) k ,µ (A.6)= δ νµ + (cid:15) (cid:104) E νµ − c ∗ ( i ) k ,ν L ( i ) D L ( i +1) T c ( i +1) k ,µ (cid:105) + o ( (cid:15) ) . In other words, depending on the size of the numericalparameter (cid:15) , one may expect small angle variations be-tween eigenvectors of successive iteration cycles. Whilethis result is formally correct, it is based on an unprovenassumption. From the analytical point of view very lit-tle can be said on the validity of this expansion since each10ycle updates the basis functions in a very non-linear man-ner. We reverse the usual reasoning and by assumingthe correctness of the expansion work our way back. Inpractice, we verify numerically the consistence of Eq. A.6and consequently can infer the validity of its premises. Inother words we follow an inverse engineering problem ap-proach as the basis for the computational scheme of sub-section 3.1.1.

References [1] R. M. Dreizler, and E. K. U. Gross, Density Functional Theory(Springer-Verlag, 1990)[2] W. Kohn, and L. J. Sham, Phys. Rev. A 140 (1965) 1133[3] P. Hohenberg and W. Kohn, Phys. Rev. B 136 (1964) 864[4] A. J. Freeman, H. Krakauer, M. Weinert, and E. Wimmer, Phys.Rev. B 24 (1981) 864.[5] A. J. Freeman, and H. J. F. Jansen, Phys. Rev. B 30 (1984) 561[6] M. Weinert, J. Math. Phys. 22 (1981) 2433[7] P. Blaha, K. Schwarz, G. Madsen, D. Kvasnicka and J. Luitz,WIEN2k - [8] S. Bl¨ugel, G. Bihlmayer, D. Wortmann, C. Friedrich, M. Heide,M. Lezaic, F. Freimuth, and M. Betzinger, The J¨ulich FLEURproject - [9] M. Weinert, R. Podloucky, J. Redinger and G. Schneider,FLAIR - [10] C. Ambrosch-Draxl, Z. Basirat, T. Dengg, R. Golesorkhtabar,C. Meisenbichler, D. Nabok, W. Olovsson, P. asqualePavone, S.Sagmeister, and J. Spitaler, The Exciting Code - http://exciting-code.org/ [11] J. K. Dewhurst, S. Sharma, L. Nordstr¨om, F. Cricchio, F. Bult-mark, and E. K. U. Gross, The Elk Code Manual (Ver. 1.2.20)- http://elk.sourceforge.net/ [12] K. Nakamura, T. Ito, A. J. Freeman, L.Zhong, and J. F. deCastro, Phys. Rev. B 67 (2003) 014420[13] P. Kurz, F. F¨orster, L. Nordstr¨om, G. Bihlmayer, and S. Bl¨ugel,Phys. Rev. B 69 (2004) 024415[14] C. Ambrosch-Draxl, and J. O. Sofo, Comp. Phys. Comm. 174(2006) 14[15] C. Cao, P. J. Hirschfeld, and H. P. Cheng, Phys. Rev. B 77(2008) 220506(R)[16] Y. Zhou, and Y. Saad, Numer. Algor. 47 (2008) 341[17] K. Wu, and H. Simon, J. Mat. Anal. Appl. 22 (2000) 602[18] Y. Zhou, J. Comp. Phys. 229 (2010) 9188[19] Y. Zhou, Y. Saad, M. L. Tiago, J. R. Chelikowsky,J. Comp. Phys. 219 (2006) 172[20] E. Anderson, Z. Bai, C. Bischof, L.S. Blackford, J. Demmel,Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum,A. McKenney, and D. Sorensen, LAPACK Users’ guide, (Societyfor Industrial and Applied Mathematics, Philadelphia, PA USA,(third ed.), 1999)[21] L.S. Blackford, J. Choi, A. Cleary, E. D’Azeuedo, J. Demmel,I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley,D. Walker, and R.C. Whaley, ScaLAPACK user’s guide, (Soci-ety for Industrial and Applied Mathematics, Philadelphia, PAUSA, 1997)[12] K. Nakamura, T. Ito, A. J. Freeman, L.Zhong, and J. F. deCastro, Phys. Rev. B 67 (2003) 014420[13] P. Kurz, F. F¨orster, L. Nordstr¨om, G. Bihlmayer, and S. Bl¨ugel,Phys. Rev. B 69 (2004) 024415[14] C. Ambrosch-Draxl, and J. O. Sofo, Comp. Phys. Comm. 174(2006) 14[15] C. Cao, P. J. Hirschfeld, and H. P. Cheng, Phys. Rev. B 77(2008) 220506(R)[16] Y. Zhou, and Y. Saad, Numer. Algor. 47 (2008) 341[17] K. Wu, and H. Simon, J. Mat. Anal. Appl. 22 (2000) 602[18] Y. Zhou, J. Comp. Phys. 229 (2010) 9188[19] Y. Zhou, Y. Saad, M. L. Tiago, J. R. Chelikowsky,J. Comp. Phys. 219 (2006) 172[20] E. Anderson, Z. Bai, C. Bischof, L.S. Blackford, J. Demmel,Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum,A. McKenney, and D. Sorensen, LAPACK Users’ guide, (Societyfor Industrial and Applied Mathematics, Philadelphia, PA USA,(third ed.), 1999)[21] L.S. Blackford, J. Choi, A. Cleary, E. D’Azeuedo, J. Demmel,I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley,D. Walker, and R.C. Whaley, ScaLAPACK user’s guide, (Soci-ety for Industrial and Applied Mathematics, Philadelphia, PAUSA, 1997)