Iterative diagonalization of symmetric matrices in mixed precision
aa r X i v : . [ phy s i c s . c o m p - ph ] A ug Iterative diagonalization of symmetric matrices in mixed precision
Eiji Tsuchida ∗ , Yoong-Kee Choe Nanosystem Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), TsukubaCentral 2, Umezono 1-1-1, Tsukuba 305-8568, Japan
Abstract
Diagonalization of a large matrix is the computational bottleneck in many applications such as electronicstructure calculations. We show that a speedup of over 30% can be achieved by exploiting 32-bit float-ing point operations, while keeping 64-bit accuracy. Moreover, most of the computationally expensiveoperations are performed by level-3 BLAS/LAPACK routines in our implementation, thus leading tooptimal performance on most platforms. Further improvement can be made by using problem-specificpreconditioners which take into account nondiagonal elements.
Keywords: diagonalization, eigenvalues, electronic structure calculations, mixed precision, conjugategradient method
1. Introduction
Matrix diagonalization plays an important role in many areas of science and engineering. In electronicstructure calculations of complex systems, for instance, most of the computational effort is spent on thenumerical solution of the Schr¨odinger equation at various levels of approximation. This procedure isequivalent to an eigenvalue problem in which only a small subset of the eigenvalues of a large symmetricmatrix is desired [1, 2, 3]. To this end, iterative diagonalization is more favorable than direct methods[4] which are designed for calculating all eigenvalues of dense matrices.Numerical calculations on modern computers are generally performed using 64-bit double precision(DP) floating-point numbers, which are accurate to 15-16 significant digits. On the other hand, 32-bit single precision (SP) floating-point numbers with 7-8 digits of accuracy are more efficient in terms ofcomputational cost, memory usage, network bandwidth, and disk storage [5]. In recent years, considerableeffort has been made to obtain the results with DP accuracy at the expense of SP operations. In particular,a variety of mixed precision algorithms have been developed for the solution of linear equations [5, 6].Similarly, several researchers have attempted to solve eigenvalue problems with mixed precision algo-rithms in the past [7, 8]. Unfortunately, most of these approaches correspond to direct methods for densematrices, and thus are of limited use in electronic structure calculations. In this paper, we show howto exploit the mixed precision arithmetic for iterative solution of large-scale eigenvalue problems, withspecial emphasis on electronic structure calculations.
2. Theory
Let H be a real, symmetric matrix of dimension N . The eigenvalues and eigenvectors of H are definedby H e j = λ j e j , j = 1 , , · · · , N, (1)where λ ≤ λ ≤ ... ≤ λ N , and e , e , · · · , e N form a set of orthonormal vectors. We also assume that H is a sparse matrix with O ( N ) nonzero entries. ∗ Corresponding author
Email address:
[email protected] (Eiji Tsuchida)
Preprint submitted to Elsevier November 8, 2018 ur aim is to calculate the m lowest eigenvalues of H and the corresponding eigenvectors, where1 ≪ m ≪ N , and typically, m = 10 -10 and N = 10 -10 in electronic structure calculations [9, 10]. Tobe precise, it is often sufficient to calculate only the sum of the eigenvalues, E G = m X j =1 λ j , (2)and the subspace spanned by e , e , · · · , e m , where E G corresponds to the ground-state energy of thesystem [1, 2, 3]. Hereafter we assume the presence of a gap in the spectrum, i.e., ǫ gap = λ m +1 − λ m > , (3)which is satisfied in nonmetallic systems [1, 2, 3].The trace minimization method [3, 11] is based on the fact that if E ( c , c , · · · , c m ) = m X j =1 c Tj H c j (4)is minimized subject to the orthonormality conditions c Ti c j = δ ij , i, j = 1 , , · · · , m, (5) E = E G holds at the minimum, and C = ( c c · · · c m ) spans the same subspace as { e , e , · · · , e m } . Inmatrix form, Eqs. (4) and (5) can be written as E ( C ) = trace (cid:0) C T H C (cid:1) (6)and C T C = I m , (7)respectively. These equations are invariant under any unitary transformation of C .Although we focus on the standard eigenvalue problem of Eq.(1) in this work, the extension to thegeneralized eigenvalue problem ( H e = λ S e ) is straightforward if S is a symmetric, positive definite matrix[11]. This property allows us to use nonorthogonal basis functions [12, 13, 14, 15] with ease. Moreover,the trace minimization method is equally valid even if H depends on the eigenvectors of H itself, asexplained in § E ( C )is minimized directly with respect to C using the nonlinear conjugate gradient method [16, 17, 18]. Thisprocedure is often referred to as a direct energy minimization in electronic structure community [1, 2, 3].Here C, G, P ∈ R N × m , and a line minimization is performed along P i to determine α i [19, 20]. Moreover, γ i is given by γ i = i = 0trace(( G i − G i − ) T G i )trace( G Ti − G i − ) i = 1 , , ... (8)following the Polak-Ribiere formula [19].When k G i k , the Frobenius norm of G i , is sufficiently small, E i will be equal to E G , and H C = CH (9)will hold. Diagonalization of H ∈ R m × m will give the explicit eigenvalues and eigenvectors of H , ifnecessary. The number of iterations to reach convergence is estimated by [21] N iter ∝ √ ǫ gap . (10)Therefore, a naive implementation of the trace minimization method fails in the limit of a vanishing gap.In this case, more complex (and thus more costly) algorithms [22, 23] should be employed to avoid theslow convergence. 2hoose initial guess C subject to C T C = I m for i = 0 , , , ... Calculate E i = E ( C i ) and G i = −∇ E ( C i )check convergence; continue if necessary P i = G i + γ i P i − C i +1 = C i + α i P i Orthonormalize ( C i +1 ) end Figure 1: Trace minimization method using the nonlinear conjugate gradient method.
Calculate E ( C ) and G = −∇ E ( C )Algorithm-1:(1) X = H C O ( mN )(2) H = C T X DGEMM O ( m N )(3) E = trace( H ) O ( m )(4) G = − X − CH ) DGEMM O ( m N )Algorithm-2:(1) X = H C O ( mN )(2) H D = diag( C T X ) O ( mN )(3) E = trace( H D ) O ( m )(4) Quit if G is unnecessary(5) X ′ = X − CH D O ( mN )(6) H ′ = C T X ′ DGEMM O ( m N )(7) G = − X ′ − CH ′ ) DGEMM O ( m N ) Figure 2: Two (mathematically) equivalent procedures for calculating the energy and gradient. The second and thirdcolumns show the corresponding BLAS/LAPACK routines (if any) and their computational costs.
H, H ′ , H D are symmetricmatrices ∈ R m × m , and X, X ′ ∈ R N × m . The procedure for calculating the energy and the gradient is illustrated in Fig.2. Algorithm-1 is anaive implementation which is shown only for illustrative purposes. Algorithm-2 is a more practicalimplementation which is appropriate for incorporating the mixed precision arithmetic explained in § H ′ is a symmetric matrix of dimension m , only the upper (or lower) triangular part needs to becalculated explicitly in step-(6). In particular, when m is large, H ′ should be divided into smaller blocks,as shown in Fig.3, with each block being calculated by a separate DGEMM (or SGEMM) call.Similarly, the orthonormalization procedure is shown in Fig.4. Strictly speaking, explicit orthonor-malization of C is inconsistent with the conjugate gradient method, which is designed for unconstrainedminimization problems. In our experience, however, only minor effects are seen when m ≪ N is satisfied.Since all O ( m N ) operations introduced in this section are performed by level-3 BLAS routines,optimal performance is achieved on most platforms [24]. Here we present several ideas for improving the performance of the algorithm introduced in § full DP ), which will serve as a reference in the following. On the other hand, it isalso possible to replace all DP operations by SP ( full SP ), which is expected to achieve the largest gainin terms of computational cost and memory usage. Unfortunately, as will be shown below, this approachdoes not provide sufficient accuracy, and thus is of limited use in real applications.A more practical approach is to start from full DP, and incorporate SP operations progressively.Mixed precision variant-1 ( MP1 ) is a conservative approach which aims at achieving a reasonable gain,while keeping full DP accuracy.(i) The change of C i , k C i +1 − C i k , in Fig.1, will become much smaller than k C i k as convergence isapproached. Therefore, G i and P i can be stored in memory in SP format without sacrificing accuracy.3 a) (b) Figure 3: (a) Recursive and (b) row-wise splitting of the upper triangular part of a symmetric matrix.
Orthonormalize ( C in ):(1) S = C T in C in DSYRK O ( m N )(2) Calculate L , where S = LL T DPOTRF O ( m )(3) Calculate L − DTRTRI O ( m )(4) C out = C in L − DTRMM O ( m N ) Figure 4: Orthonormalization procedure based on the Cholesky factorization. Same notation as in Fig.2. S is a symmetricmatrix of dimension m , and L is a lower triangular matrix of dimension m (so is L − ). On exit, C in is overwritten by C out . Here, the elements of G i are first calculated in DP to minimize round-off errors in Fig.2, followed byconversion to SP. Conversion between SP and DP is performed with the Fortran intrinsic functions Sngl and
Dble . While this change alone leads to only a minor performance gain, a large gain can be madewhen used in conjunction with advanced preconditioners, as will be discussed in §
4. Furthermore, asignificant reduction of memory usage is expected.(ii) Similarly, the change of C i during the orthonormalization procedure shown in Fig.4 decreases fromiteration to iteration. Therefore, after calculating L − , we introduce a DP matrix L D = diag( L − ) , (11)and an SP matrix L ′ = L − − L D , (12)where L D → I m and L ′ → C out = C in L D + C in L ′ , (13)where the first and the second terms are calculated in DP and SP, respectively. Obviously, the formercost is negligible, while the latter can be performed with an STRMM call instead of DTRMM. The finalresult, C out , is stored in DP format.In addition to the changes noted above, further acceleration is achieved in the mixed precision variant-2 ( MP2 ) by reducing the cost of evaluating the gradient in Fig.2, Algorithm-2, as follows. Since thecomputational cost of this procedure is dominated by steps-(1), (6), and (7), the two DGEMM callsin steps-(6) and (7) are replaced by SGEMM. The rest of the operations in this procedure, includingstep-(1), are performed in DP, which guarantees DP accuracy of the energy. Step-(5), corresponding toself-orthogonalization of the gradient, is also performed in DP, which allows us to reduce the round-offerrors arising from the use of SGEMM in steps-(6) and (7) significantly. It is also preferable to setdiag( H ′ )=0 explicitly after step-(6) to minimize the errors. Nevertheless, MP2 has the potential risk ofobtaining inaccurate gradient when close to convergence.4 Matrix Size G F l op s SGEMMDGEMM
Figure 5: Performance of matrix-matrix multiplications in SP (SGEMM) and DP (DGEMM).
When MP2 is used, all O ( m N ) operations except the DSYRK call in the orthonormalization proce-dure are performed in SP. The performance and accuracy of these algorithms are compared in the nextsection.
3. Numerical results
All calculations shown here were performed on a single core of the 2.3 GHz AMD Opteron 6176 SEprocessor with CentOS 5.5 operating system, gfortran 4.1.2 compiler, and GotoBLAS2 numerical library[24].We first show the performance of matrix-matrix multiplications in Fig.5. The SP routine (SGEMM)is found to be 1.85-1.95 times faster than the corresponding DP routine (DGEMM) when the matrix sizeis larger than 200.Then, we compare the performance of full DP, MP1, MP2, and full SP for calculating the sum of the m lowest eigenvalues of large sparse matrices corresponding to the two-dimensional discrete Laplacianunder Dirichlet boundary conditions, H N = D n − I n − I n D n − I n . . . . . . . . . − I n D n − I n − I n D n ∈ R N × N , (14) D n = − − − . . . . . . . . . − − − ∈ R n × n . (15)The dimension N is given by N = n , where n denotes the grid size in each direction. The eigenvaluesof H N are given explicitly by λ p,q = 4 (cid:18) sin (cid:18) pπ n + 1) (cid:19) + sin (cid:18) qπ n + 1) (cid:19)(cid:19) , p, q = 1 , , ..., n. (16)After sorting { λ p,q } in ascending order, we can calculate the exact value of E G for any value of m .The above Hamiltonian represents free electrons confined in a square box, which is essentially a gapless5 N u m b e r o f e i g e nv a l u e s Eigenvalue
Figure 6: The eigenvalue distribution of H N (Eq.(14)) for N = 192 .Table 1: Performance of the four algorithms for iterative diagonalization of H N . N iter denotes the number of iterationsrequired to achieve R < − in the full DP run. All measurements are given in units of seconds per iteration. Thenumbers in parentheses indicate the percentage relative to full DP. N m E G ǫ gap N iter Full DP MP1 MP2 Full SP96
220 35.25 6.2 × −
270 0.631 0.591(94%) 0.491(78%) 0.419(66%)192
220 8.991 1.9 × −
630 2.520 2.357(94%) 1.968(78%) 1.680(67%)192
534 50.90 2.5 × −
560 12.04 11.05(92%) 8.772(73%) 7.958(66%)192 × −
460 45.38 41.01(90%) 30.52(67%) 28.04(62%)192 × −
422 89.84 79.16(88%) 60.53(67%) 57.23(64%)system, as shown in Fig.6. Therefore, this problem is a stringent test for the trace minimization methodwhich requires the presence of an energy gap.In Table 1, we show the results of iterative diagonalization for several pairs of (
N, m ), following thealgorithm presented in § m were chosen to satisfy ǫ gap >
0. For simplicity, thesymmetry of H ′ was not exploited, and the initial guess C was generated from random numbers, followedby orthonormalization. In Fig.7, we show the convergence of the residual at iteration i , defined by R i = E i − E G E G . (17)These results suggest that the performance of the four algorithms satisfiesFull SP > MP2 > MP1 > Full DP , where the differences tend to increase with m , but not with N . In particular, full SP is 30 - 40 % fasterthan full DP, which is consistent with the results for matrix multiplications. However, the accuracy ofthe results obtained from full SP is insufficient for most applications. Therefore, full SP should be usedonly for generating the initial guess [25].In contrast, MP1 retains full DP accuracy for all values of ( N, m ) shown in Table 1, while showingonly a modest gain of ≈
10 %.MP2 is found to achieve a gain of over 30 % for large m , while keeping near DP accuracy. However,the accuracy of the converged solution deteriorates slowly with m , because the use of SP is not alwaysappropriate for evaluating the gradient. We have found that if subspace rotation is performed occasionally6 a) (b) -14 -12 -10 -8 -6 -4 -2 -14 -12 -10 -8 -6 -4 -2 Iterations Iterations R e s i du a l R e s i du a l -14 -12 -10 -8 -6 -4 -2 R e s i du a l IterationsIterations
Full DPMP1MP2Full SP Full DPMP1MP2Full SP0 100 200 300 400 500Full DPMP1MP2Full SP (c) -14 -12 -10 -8 -6 -4 -2 R e s i du a l (d) Figure 7: Convergence of the residual for N = 192 : (a) m = 220, (b) m = 534, (c) m = 1064, and (d) m = 1519. to diagonalize H , this problem can be overcome, thus leading to full DP accuracy. Alternatively, one cansimply switch from MP2 to MP1 (or full DP) algorithm when the residual is below a given tolerance.The latter approach is preferable in terms of the construction of conjugate directions [19], as well as theextrapolation of the initial guess [26].
4. Discussion In § G ′ = M G. (18)Here, M ∈ R N × N is a symmetric, positive definite matrix called a preconditioner, which should be areasonable approximation to the inverse Hessian (See §
11 of Ref.[4]). Preconditioning allows us to reducethe number of iterations to reach convergence at the expense of increased cost per iteration. In electronicstructure calculations, this generally leads to an increase in O ( mN ) cost and a decrease in O ( m N )cost. Therefore, the choice of the preconditioner becomes more important for larger applications. Inplane-wave-based electronic structure calculations, it is a common practice to use a diagonal matrix for M , which leads to a considerable reduction of the number of iterations [17] at a negligible cost.7owever, more elaborate preconditioners have been developed beyond the diagonal approximationfor plane waves [27], atomic orbitals [28], and real space basis sets [29, 30, 31]. These preconditionerssignificantly improve the convergence rate, while the computational cost of evaluating the right-handside of Eq.(18) will become non-negligible. Since SP is generally sufficient for representing G , as alreadymentioned in § M , unless M is an ill-conditioned matrix. Therefore, the matrixmultiplication of Eq.(18) can also be performed in SP, thus minimizing the overhead of preconditioning.The idea behind this approach is similar in spirit to the solution of linear equations discussed in Ref.[5].These advanced preconditioners will be particularly beneficial when highly accurate methods [32] areused for the electronic structure calculations.Although we have focused on the conjugate gradient method, the basic idea presented in this workshould be equally valid for other iterative algorithms [33, 34, 35, 36, 37]. In our electronic structurecode Femteck [12, 38], the trace minimization method is used in conjunction with the limited memoryBFGS (Broyden-Fletcher-Goldfarb-Shanno) method [39, 40, 41, 42], which requires practically no lineminimization. When the MP2 algorithm is used, together with a multigrid approximation to the inverseHessian [31], the computational time for the ground-state calculation is reduced by a factor of about twoin large-scale applications [10], compared with the full DP calculation using a diagonal approximation tothe Hessian [41, 42]. If we start from a reasonable initial guess [26], the ground-state energy is obtainedin 10-15 iterations in nonmetallic materials [9, 10].
5. Conclusion
We have shown in this work that iterative solution of the eigenvalue problem can be acceleratedsignificantly by taking advantage of mixed precision arithmetic without relying on external devices. Evenfurther improvement can be made by using problem-specific preconditioners which take into accountnondiagonal elements. These methods will become more important with increasing problem size.For most applications, MP2 is a good compromise between accuracy and computational cost. Whenhighly accurate eigenvalues/vectors are desired, we recommend to switch from MP2 to MP1 (or full DP)at some point before the convergence slows down.While the current implementation is designed for ground-state electronic structure calculations, itwould also be interesting to investigate the use of mixed precision arithmetic in other problems such asthe density-functional perturbation theory [43, 44, 45].
Acknowledgements
This work has been supported in part by a KAKENHI grant (22104001) from the Ministry of Edu-cation, Culture, Sports, Science and Technology, and a grant from the Ministry of Economy, Trade, andIndustry, Japan.
References [1] R. M. Martin, Electronic Structure: Basic Theory and Practical Methods, Cambridge UniversityPress, 2004.[2] D. Marx, J. Hutter, Ab Initio Molecular Dynamics: Basic Theory and Advanced Methods, Cam-bridge University Press, 2009.[3] Y. Saad, J. R. Chelikowsky, S. M. Shontz, Numerical Methods for Electronic Structure Calculationsof Materials, SIAM Rev. 52 (2010) 3-54.[4] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, H. van der Vorst, Templates for the Solution of AlgebraicEigenvalue Problems: A Practical Guide, SIAM, 2000.[5] M. Baboulin, A. Buttari, J. Dongarra, J. Kurzak, J. Langou, J. Langou, P. Luszczek, S. Tomov,Accelerating scientific computations with mixed precision algorithms, Comput. Phys. Commun. 180(2009) 2526-2533. 86] M. Papadrakakis, N. Bitoulas, Accuracy and effectiveness of preconditioned conjugate gradient al-gorithms for large and ill-conditioned problems, Comput. Methods Appl. Mech. Engrg. 109 (1993)219-232.[7] J. J. Dongarra, C. B. Moler, J. H. Wilkinson, Improving the accuracy of computed eigenvalues andeigenvectors, SIAM J. Numer. Anal. 20 (1983) 23-45.[8] V. Drygalla, Exploiting mixed precision for computing eigenvalues of symmetric matrices and sin-gular values, PAMM 8 (2008) 10809-10810.[9] Y-K. Choe, E. Tsuchida, T. Ikeshoji, S. Yamanaka, S. Hyodo, Nature of proton dynam-ics in a polymer electrolyte membrane, nafion: a first-principles molecular dynamics study,Phys. Chem. Chem. Phys. 11 (2009) 3892-3899.[10] Y-K. Choe, E. Tsuchida, T. Ikeshoji, A. Ohira, K. Kidena, An ab initio modeling study on a modeledhydrated polymer electrolyte membrane, sulfonated polyethersulfone (SPES), J. Phys. Chem. B 114(2010) 2411-2421.[11] A. H. Sameh, J. A. Wisniewski, A trace minimization algorithm for the generalized eigenvalueproblem, SIAM J. Numer. Anal. 19 (1982) 1243-1259.[12] E. Tsuchida, M. Tsukada, Adaptive finite-element method for electronic-structure calculations,Phys. Rev. B 54 (1996) 7602-7605.[13] J. E. Pask, P. A. Sterne, Finite element methods in ab initio electronic structure calculations,Modelling Simul. Mater. Sci. Eng. 13 (2005) R71-R96.[14] J. M. Soler, E. Artacho, J. D. Gale, A. Garc´ıa, J. Junquera, P. Ordej´on, D. S´anchez-Portal, TheSIESTA method for ab initio order- N materials simulation, J. Phys.: Condens. Matter 14 (2002)2745-2779.[15] T. Ozaki, H. Kino, Numerical atomic basis orbitals from H to Kr, Phys. Rev. B 69 (2004) 195113.[16] I. Stich, R. Car, M. Parrinello, S. Baroni, Conjugate gradient minimization of the energy functional:A new method for electronic structure calculation, Phys. Rev. B 39 (1989) 4997-5004.[17] M. C. Payne, M. P. Teter, D. C. Allan, T. A. Arias, J. D. Joannopoulos, Iterative minimizationtechniques for ab initio total-energy calculations: molecular dynamics and conjugate gradients,Rev. Mod. Phys. 64 (1992) 1045-1097.[18] A. Edelman, S. T. Smith, On conjugate gradient-like methods for eigen-like problems, BIT 36 (1996)494-508.[19] W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical Recipes in Fortran,Cambridge University Press, 1992.[20] R. D. King-Smith, D. Vanderbilt, First-principles investigation of ferroelectricity in perovskite com-pounds, Phys. Rev. B 49 (1994) 5828-5844.[21] J. F. Annett, Efficiency of algorithms for Kohn-Sham density functional theory, Comp. Mater. Sci.4 (1995) 23-42.[22] N. Marzari, D. Vanderbilt, M. C. Payne, Ensemble density-functional theory for ab initio moleculardynamics of metals and finite-temperature insulators, Phys. Rev. Lett. 79 (1997) 1337-1340.[23] A. Sameh, Z. Tong, The trace minimization method for the symmetric generalized eigenvalue prob-lem, J. Comput. Appl. Math. 123 (2000) 155-175.[24] K. Goto, R. van de Geijn, High-performance implementation of the level-3 BLAS, ACMTrans. Math. Softw. 35 (2008) 1-14. 925] R. Kosloff, H. Tal-Ezer, A direct relaxation method for calculating eigenfunctions and eigenvaluesof the Schr¨odinger equation on a grid, Chem. Phys. Lett. 127 (1986) 223-230.[26] T. A. Arias, M. C. Payne, J. D. Joannopoulos, Ab initio molecular dynamics: Analytically continuedenergy functionals and insights into iterative solutions, Phys. Rev. Lett. 69 (1992) 1077-1080.[27] A. Sawamura, M. Kohyama, T. Keishi, An efficient preconditioning scheme for plane-wave-basedelectronic structure calculations, Comp. Mater. Sci. 14 (1999) 4-7.[28] V. Weber, J. VandeVondele, J. Hutter, A. M. N. Niklasson, Direct energy functional minimizationunder orthogonality constraints, J. Chem. Phys. 128 (2008) 084113.[29] J.-L. Fattebert, J. Bernholc, Towards grid-based
O(N) density-functional theory methods: Optimizednonorthogonal orbitals and multigrid acceleration, Phys. Rev. B 62 (2000) 1713-1722.[30] T. Torsti et al. , Three real-space discretization techniques in electronic structure calculations,Phys. Stat. Sol. (b) 243 (2006) 1016-1053.[31] E. Tsuchida,
Ab initio molecular-dynamics study of liquid formamide, J. Chem. Phys. 121 (2004)4740-4746.[32] M. Guidon, F. Schiffmann, J. Hutter, J. VandeVondele,