Mixed Precision Fermi-Operator Expansion on Tensor Cores From a Machine Learning Perspective
Joshua Finkelstein, Justin Smith, Susan M. Mniszewski, Kipton Barros, Christian F. A. Negre, Emanuel H. Rubensson, Anders M. N. Niklasson
LLA-UR-21-20350
Mixed Precision Fermi-Operator Expansion on Tensor Cores From a MachineLearning Perspective
Joshua Finkelstein †∗ , Justin Smith † , Susan M. Mniszewski ‡ , Kipton Barros † ,Christian F. A. Negre †∗ , Emanuel H. Rubensson †† , Anders M. N. Niklasson †∗ † Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545 ‡ Computer, Computational, and Statistical Sciences Division,Los Alamos National Laboratory, Los Alamos, New Mexico 87545 and †† Division of Scientific Computing, Department of Information Technology,Uppsala University, Box 337, SE-751 05 Uppsala, Sweden (Dated: January 19, 2021)We present a second-order recursive Fermi-operator expansion scheme using mixed precision float-ing point operations to perform electronic structure calculations using tensor core units. A perfor-mance of over 100 teraFLOPs is achieved for half-precision floating point operations on Nvidia’sA100 tensor core units. The second-order recursive Fermi-operator scheme is formulated in termsof a generalized, differentiable deep neural network structure, which solves the quantum mechanicalelectronic structure problem. We demonstrate how this network can be accelerated by optimizingthe weight and bias values to substantially reduce the number of layers required for convergence.We also show how this machine learning approach can be used to optimize the coefficients of therecursive Fermi-operator expansion to accurately represent fractional occupation numbers of theelectronic states at finite temperatures.
I. INTRODUCTION
Electronic structure calculations based on Hartree-Fock, density-functional theory, or semiempirical meth-ods often require the intermediate construction of thesingle-particle density matrix [1–3]. This density-matrixcan be calculated with different techniques, for exam-ple, the use of a direct diagonalization of the Kohn-ShamHamiltonian or the Fockian [4], Green’s function meth-ods [5], variational optimization [6], and various recursiveFermi-operator expansion schemes [2, 7, 8]. The methodof choice often depends on several criteria such as: 1)the electronic structure basis set, i.e., if plane waves orlocalized atomic orbitals are used; 2) the system that isanalyzed, i.e., if the system is small or large or if it ismetallic or non-metallic; and 3) the computational plat-form, i.e., if the calculation is performed on a single ormultiple central processing units (CPUs) or on a hybridarchitecture with graphics processing units (GPUs).In this article we target density matrix calculationsfor electronic structure methods on tensor core unitswith an adapted atomic-orbital-like basis set for inter-mediate sized nonmetallic systems. For these calcula-tions, we use a second-order recursive Fermi-operatorexpansion scheme in combination with mixed precisionfloating point operations, which enables efficient calcula-tions of the density matrix using tensor core accelerators[9]. Exploring the use of tensor core based architecturesfor electronic structure calculations follows the previoustransitioning from CPU-only based electronic structuretechniques to the more specialized GPU-based techniques[10–20]. Our Fermi-operator expansion scheme is formu- ∗ Electronic address: [email protected], [email protected], [email protected] lated and presented in terms of a generalized convolu-tional deep neural network [21, 22]. This network formu-lation provides a powerful machine learning perspectiveon how we can further optimize and extend the applica-tions of the recursive Fermi-operator expansion. We findthat we can optimize the weight and bias values, anduse a combination of multiple layers to represent Fermifunctions at finite electronic temperatures with high nu-merical accuracy. We also find that an optimized set ofweight and bias values can reduce the number of layersrequired to reach convergence.The article is outlined as follows. First, we discussthe electronic structure problem and the Fermi-operatorrepresentation of the density matrix. Then, we present asecond-order recursive Fermi-operator expansion methodin terms of a generalized deep neural network using mixedprecision floating-point operations that are well adaptedfor tensor core calculations. We then demonstrate andanalyze the performance on tensor core units for sometest examples. Thereafter, we discuss how optimizedweight and bias values can be used to accelerate conver-gence and how they accurately represent the Fermi func-tion for fractional occupation numbers at finite electronictemperatures. The algorithms are presented in pseudo-code throughout the manuscript and implementations inPython are available in the Supporting Information (SI)document.
II. DENSITY-MATRIX FERMI-OPERATOREXPANSIONA. The Density Matrix a r X i v : . [ phy s i c s . c o m p - ph ] J a n The single-particle density matrix, D , is given in termsof the Fermi matrix function, where D = (cid:16) e β ( H − µI ) + I (cid:17) − . (1)Here, I is the identity matrix, β = 1 / ( k B T e ) is the inverseelectronic temperature, µ is the chemical potential, and H ∈ R N × N is the Hamiltonian (or Fockian) matrix withmatrix elements H i,j = (cid:104) φ i | (cid:98) H | φ j (cid:105) . (2)For simplicity, we assume an orthonormal basis-set { φ i } Ni =1 , so that the overlap matrix S i,j = (cid:104) φ i | φ j (cid:105) = δ i,j .An orthonormalized basis-set representation can alwaysbe constructed from a congruence transform based onthe inverse factorization of the overlap matrix [23]. Theoperator (cid:98) H is the effective single-particle Hamiltonianoperator involved in methods such as Hartree-Fock orKohn-Sham density functional theory. At zero electronictemperature, T e = 0, the Fermi function becomes a Heav-iside step function and the density matrix reads as: D = θ ( µI − H ) . (3)There are several methods that can be used to calculatethe density matrix. The traditional method is based on adirect diagonalization of H , i.e. finding the orthonormaleigenstates Q such that Q T HQ = E, E ij = (cid:15) i δ ij , Q T Q = I. (4)In the diagonal (eigenvector) representation, E , and theidentity matrix, I , are diagonal and the matrix exponen-tial and inversion can therefore be calculated directly, i.e. D = QQ T (cid:0) e β ( H − µI ) + I (cid:1) − QQ T = Q (cid:0) e β ( E − µI ) + I (cid:1) − Q T , (5)where the chemical potential µ is adjusted to accountfor the desired orbital occupation, N occ , i.e. such thatTr[ D ] = N occ . Alternatively, in the zero-temperaturelimit the density matrix becomes D = QQ T θ ( µI − H ) QQ T = Qθ ( µI − E ) Q T , (6)where the shifted Heaviside step function, θ ( µI − E ), isevaluated on each of the diagonal elements of E (eigen-values of H ). The chemical potential needs to be shiftedto reach a desired occupation also in this case. B. Serial Fermi-Operator Expansions
An alternative to construct the density matrix, D , isthe serial Chebyshev Fermi-operator expansion scheme [24, 25], where the density matrix is approximated by alinear combination of Chebyshev matrix polynomials ofthe Hamiltonian, T n ( H ), D = (cid:16) e β ( H − µI ) + I (cid:17) − ≈ m (cid:88) n =1 c n T n ( H ) . (7)Alternatively, we may also use a Green’s function expan-sion, which is based on a complex contour integration[26–30] with some complex energy mesh { z n } , where D = (cid:16) e β ( H − µI ) + I (cid:17) − ≈ m (cid:88) n =1 c n ( H − z n I ) − . (8)In both of these serial Fermi-operator expansion meth-ods, the coefficients { c n } mn =1 , need to be adjusted suchthat the approximate density matrix has the correct oc-cupation and temperature. To reach accurate repre-sentations, high-order expansions are required and con-vergence can be hard to achieve at low temperatures.Higher-order polynomials are needed for the Chebyshevexpansion and the Green’s function expansion requirescomplex energies close to the real axis which may leadto singularity problems for low-temperature expansions.However, the Chebyshev and Green’s function methodscan take advantage of sparse matrix algebra for suffi-ciently large Hamiltonian matrices, which allows compu-tations with linear scaling complexity [31, 32]. Cheby-shev methods can sometimes also take advantage of astochastic sampling of expectation values using a smallerrandomized trial basis. In these cases the O ( N ) cubicscaling cost of the diagonalization, a function of the sys-tem size N , can be replaced by a linear O ( N ) scalingcomplexity [30–33]. This is a particular advantage forvery large problems such as those including tens of thou-sands of atoms or electrons. For smaller problems, theconstruction of the density matrix with direct diagonal-ization is typically much faster. C. Recursive Fermi-Operator Expansion
At zero electronic temperature a Fermi-operator ex-pansion scheme has to approximate the Heaviside stepfunction of H , with the step formed at the chemical po-tential. In this zero-temperature limit, Chebyshev andGreen’s function methods can have convergence problems[24, 25]. An alternative is given by a recursive Fermi-operator expansion [7, 34–42], where D = θ ( µI − H ) ≈ f m ( f m − ( . . . f ( H ) . . . )) . (9)The recursion can be performed by successive projec-tions of a matrix X i that starts with X = H and then X i +1 = f i ( X i ) is calculated in each iteration until con-vergence is reached as X i → D . The functions f i ( X i )are chosen to project the eigenvalue spectrum of X i to amore pure ensemble with eigenvalues closer either to 1 orto 0. The occupation is 1 for the occupied states belowthe chemical potential µ and 0 for the unoccupied statesabove. These type of recursive Fermi-operator expansionmethods are also referred to as purification or spectralprojection schemes [2, 36, 43].The advantage of a recursive Fermi-operator expansionis that we can reach a very high polynomial order inthe approximation with only a few number of iterations.There are many choices of polynomials and techniquesto adjust the expansion such that the step is formed atthe chemical potential. Possibly the simplest and mostefficient technique is the second-order spectral projection(SP2) method [36, 39, 41, 44], which is the main focus ofthis article. III. THE SP2 METHODA. Second-Order Spectral Projection Polynomials
In the SP2 method, the recursive expansion functionsin Eq. (9) are chosen as second-order polynomials actingon the interval [0 , X i +1 = f i ( X i ) = X i ± ( X i − X i ) . (10)The ± sign is chosen to adjust the trace of X i +1 ineach projection such that the correct occupation, N occ , isreached at convergence, i.e. such that Tr[ X i ] → Tr[ D ] = N occ [36]. In this way the step is formed automaticallyat the correct chemical potential µ . No prior knowledgeof the chemical potential is therefore required and nopost-processing adjustment is needed. The polynomialexpansion order doubles in each recursion. In only 30 re-cursion steps the polynomial expansion order is over onebillion. The second-order polynomials in Eq. (10) arecontinuously increasing and decreasing functions on theexpansion interval [0 , B. Deep-NN SP2
There are several ways to implement the SP2 recursiveFermi-operator expansion of Eq. (9), using the second-order projection polynomials in Eq. (10). Here we willpresent a version that naturally maps onto the algorith-mic structure of a generalized convolutional deep neuralnetwork (Deep-NN).The original SP2 expansion has two key properties ifwe assume all eigenvalues of X ∈ [0 , , X i by projecting the eigenstateseither toward a stationary point at 1 or toward a station-ary point at 0. An initial linear transform, X = f ( H ),is chosen to scale the eigenstates of H to the interval[0 ,
1] in reverse order. Following this initial transform,we can choose the projection polynomials that improvethe convergence of the trace after each recursion. Whenthe trace corrections no longer improve the occupationor when all the eigenvalues of X i are as close as possibleto 0 or 1, convergence has been reached and the expan-sion is terminated. At this point we may use the con-verged density matrix X m to calculate various quantummechanical observables, (cid:104) A (cid:105) = Tr[ X m A ], where A is thematrix representation of the relevant operator, e.g., theHamiltonian matrix, H , for the energy. It is easy to seehow this scheme can be reformulated and mapped ontothe structure of a Deep-NN as shown in Fig. 3. In thefirst layer we use the Hamiltonian X = H as the inputdescriptor. The weight and bias functions, W and B ,are then chosen such that S is the rescaled Hamiltonianwith eigenvalues in reverse order inside the interval [0 , matrix function f ( S ) = S , which acts on the eigenvalues of S . This isin contrast to regular neural networks where the activa-tion function acts on the individual matrix elements. Thematrix square operation of the activation function con-sists of a single tensor contraction, i.e. a matrix-matrixmultiplication, which is an advantage since tensor coresare optimized to perform such operations at high speed.At the subsequent layer, where X = f ( S ), we chosethe weight and bias values such that S = W X + B ,with W = σ I and B = ( I − W ) S . The value of σ = ± S has the smallest occupa-tion error, | Tr[ S ] − N occ | , of the two sign alternatives.These operations are continued layer by layer and the i -th approximation to the density matrix is computed asfollows: X i = f ( . . . f ( W f ( W X + B ) + B ) . . . ) . (11)At the last layer, S m − , once the occupation error hasconverged to some sufficiently accurate value, the densitymatrix is outputted as, D = X m .The Deep-NN formulation of the SP2 algorithm isgiven in pseudocode in Alg. 1 and also includes a parameter-free check for convergence. The convergenceis determined from where an expected decrease, underexact arithmetics, of the estimated idempotency error,IdErr n , is no longer fulfilled in practice. A motivationfor and precise derivation of the convergence criterionis provided in the appendix. Typically, the idempo-tency error is computed as (cid:107) X − X (cid:107) , where (cid:107) · (cid:107) is ei-ther the spectral (2-norm) or the Frobenius norm. Herewe have instead chosen to use Tr[ X − X ] as the mea-sure of the idempotency error, which is a simpler andmore computationally efficient measure. In fact, since (cid:107) X − X (cid:107) ≤ Tr[ X − X ] ≤ N (cid:107) X − X (cid:107) whenever theeigenvalues of X are in [0 , X − X ]is equivalent to convergence in the spectral norm. Only asingle O ( N ) trace operation is needed in each deep layer.In Alg. 1 we also include a small constant (cid:15) which en-sures that we get an alternating sign of σ n if the occu-pation corrections are very small, inducing faster conver-gence. Otherwise, the sign, σ n , is chosen to minimizethe occupation error in S n . The inclusion of the small (cid:15) term is an ad-hoc adjustment that, in general, is not anecessity of the convergence criteria. A Python script im-plementing the full Deep-NN SP2 algorithm is presentedin the supplementary material (SI). C. Mixed Precision Operations
The computationally dominant step in the Deep-NNSP2 Fermi-operator expansion is the calculation of thematrix square in the activation function. In dense ma-trix algebra, such generalized matrix-matrix multipli-cations can often be performed with very high perfor-mance on almost any computational platform. TheSP2 scheme therefore stands out as an efficient alterna-tive to diagonalization-based density matrix calculations.Here we are interested in using tensor core calculations.Tensor core units have been tailored to perform tensorcontractions, i.e. matrix-matrix multiplications for ma-chine learning applications using convolutional deep neu-ral networks with close to peak performance. Recently,Nvidia’s V100 tensor core accelerated graphics processingunit broke the 100 teraFLOPs barrier for deep learning
FIG. 1: Schematic picture of a deep neural network. Theweights W n and bias values B n generate a linear transfor-mation of X n , where S n = W n X n + B n , are given in stan-dard matrix notation. A new layer X n +1 is provided afterthe application of a non-linear activation function, i.e. where X n +1 = f ( S n ). Algorithm 1
The Deep-NN formulation of the SP2recursive Fermi-operator expansion algorithm. A Pythonscript is provided in the supplementary information. N occ , Number of occupied states or orbitals (cid:15), small number close (or equal) to 0 H, Orthonormalized Hamiltonian ε , ε N , Spectral bound estimates of HW n , Observable operator matrix, e.g. W n = HX = H, Input layer W = − ( ε N − ε ) − I, B = ε N ( ε N − ε ) − IS = W X + B , N S = Tr[ S ] , Occupation of S n = 0 , Number of layers while
Not Converged do n = n + 1 X n = f ( S n − ), See Alg. 2 N X = Tr[ X n ]IdErr n = N S − N X , Idempotency error estimate σ n = Sign ( | N S − N X − N occ | − | N X − N occ | − σ n − (cid:15) ) W n = σ n I, B n = ( I − W n ) S n − S n = W n X n + B n N S = W n N X + (1 − σ n ) N S , Updated occupation of S n if IdErr n < = 0 then Converged = trueelse if n > and σ n − (cid:54) = σ n − and IdErr n > . × (IdErr n − ) then Converged = trueend ifend while D = f final ( S m − ) , final X m layer in Fig. 1 α = Tr[ DW m ] , Output observable applications [48]. Our goal is to use such tensor coreaccelerators for the calculation of density matrices usingthe Deep-NN SP2 Fermi-operator expansion.The tensor core units use low, mixed precision floatingpoint operations. Typically, only half-precision opera-tions with single-precision accumulation are used. Thehalf-precision is in general too low in accuracy for mean-ingful density matrix calculations, but a single preci-sion accuracy is good enough for many problems. Toachieve single precision accuracy we can represent asingle-precision matrix X with a pair of two half-precisionmatrices, X ≈ X (0) + X (1) . (12)Using pseudocode notation, the corresponding dual half-precision representation of a matrix X would be gener-ated by X (0) = FP16[ X ] X (1) = FP16[ X − X (0) ] , (13)where FP16[ ] denotes the half-precision representation.Generalizations to a higher level of accuracy using multi-ple matrices, X ( n ) , is straightforward and will not be dis-cussed. A matrix-matrix multiplication, X × Y , can thenbe performed using four separate matrix-matrix multi-plications in half precision with accumulation in singleprecision (FP32[ ]), i.e. X × Y ≈ FP32 (cid:2)(cid:0) X (0) + X (1) (cid:1) (cid:0) Y (0) + Y (1) (cid:1)(cid:3) = FP32 (cid:2) ( X (0) Y (0) + (cid:0) X (1) Y (0) + X (0) Y (1) (cid:1) + X (1) Y (1) (cid:3) . (14)In the Deep-NN SP2 Fermi-operator scheme in Alg. 1 weonly need to calculate matrix squares in the activationfunction. If we assume that each matrix X is symmetricand neglect the small X (1) Y (1) -term we can reduce thecalculation of a matrix square to only two matrix-matrixmultiplications in half-precision and single accumulation,i.e. X ≈ FP32 (cid:2) X (0) X (0) + X (0) X (1) + ( X (0) X (1) ) T (cid:3) . (15)All the matrix products and sums are assumed to beaccumulated in single precision (FP32). This approachwould also benefit from multiplications of symmetric ma-trices where only the upper or lower half matrix needs tobe calculated. D. Mixed Precision Deep-NN SP2
To adjust the Deep-NN SP2 algorithm in Alg. 1 tomixed precision floating-point operations, we only needto adjust the activation function, f ( X ) = X , where thematrix square is performed using tensor contractions ontensor core units in half precision with single accumula-tion. This is described by Alg. 2.The main source of the error in the mixed precisionDeep-NN SP2 scheme is the eigenvalue distribution atconvergence. Because of the finite precision, the eigen-states will not be exactly 1 or 0, corresponding to fully Algorithm 2
Calculation of the activation function, f ( X ) = X , in the Deep-NN SP2 scheme in Alg. 1 for tensorcore units in half precision and single accumulation. X Input matrix in single precision X (0) = FP16[ X ] X (1) = FP16[ X − X (0) ] A = FP32[ X (0) × X (0) ] Tensor core multiplication B = FP32[ X (0) × X (1) ] Tensor core multiplication f ( X ) = X ≈ FP32[ A + B + B T ] Layer
Idempotency ErrorEnergy Error
FIG. 2: The idempotency convergence of the Deep-NNSP2 Fermi operator expansion scheme for a small uniformlyrandomized symmetric Hamiltonian ( H ∈ R × , N occ =5 , H ij ∈ [ − , occupied or unoccupied states. This may lead to sig-nificant errors in energy calculations. However, theseerrors can be reduced by a post-processing step. Thepost-processing refinement can be achieved by using amodified activation function in the final step, f final ( S n ),in Alg. 1. Instead of the matrix square, we use f final ( S n ) = (cid:40) S n − S n , if σ n − = 1(2 S n − S n ) , if σ n − = − X n H ], with the error compared to the “exact” en-ergy, and the idempotency error measured by the spectralnorm, (cid:107) X n − X n (cid:107) . The rapid improvement by multipleorders of magnitude in the last layer is given by the fi-nal refinement step performed in an enhanced precisionwhich scales quadratically from n − n . The enhancedprecision is often necessary to attain a sufficiently highnumerical accuracy. E. Convergence Estimate For Low PrecisionFloating-Point Operations
Using only half-precision floating point operations inthe Deep-NN SP2 scheme leads to fairly large errors com-pared to regular double precision operations, even if thedual mixed precision, presented above, is used. Thanksto the post-processing refinement step the final error issignificantly reduced. However, we first need to deter-mine when convergence is reached. This can be difficultto decide under numerically noisy conditions caused bythe low-precision floating point operations. The idea weuse to determine convergence is based on the observa-tion that the idempotency estimate we use in Alg. 1, i.e.IdErr n = Tr[ S n − − X n ] = Tr[ S n − − S n − ], decreasesquadratically between every second step if we have al-ternating signs of σ n . However, limitations in the finiteprecision will, at some point, prevent the expected decayof the idempotency error. At this point, the expansioncan then be terminated, because the best possible conver-gence has been reached. This parameter-free convergenceestimate is both efficient and easy to implement.The refinement in Eq. (16) is the result of composingtwo layers in a single step with σ n = 1 and σ n − = − σ n = − σ n − = 1. These alternating signs of σ provide for a guaranteed second order decrease in theerror if exact floating point operations are used. Ourconvergence criterion is analogous to the parameterlessstopping criteria by Kruchinina, Rudberg and Rubensson[49], which here has been adapted to a different idempo-tency measure. Details of the derivation are given in theappendix. IV. MIXED PRECISION FERMI-OPERATOREXPANSION ON TENSOR CORES
To make use of tensor cores with matrix multiplica-tions, the Deep-NN SP2 algorithm was written usingCUDA v11.0, the cuBLAS library and several customizedkernels. All matrix multiplications in the SP2 algorithmwere carried out using cuBLAS general matrix multipli-cation (GEMM) calls. The GEMM calls execute tensorcore operations automatically and no special commandsare required to make use of them. This automatic featurecan be disabled with the appropriate cuBLAS API call.Implementation of the half precision multiplicationsneeded by the X activation function in Alg. 2 re-quired several custom kernels. These kernels decomposethe matrix X into a sum of two FP16 matrices thatare then multiplied and summed as described in Eqs.(12) through (15) using standard cuBLAS GEMM rou-tines. Custom kernels were also necessary to reduce, asmuch as possible, the amount of data transfer betweenthe host and GPU device memory. The Deep-NN SP2CUDA implementation will be made available throughthe PROGRESS [50] library.The rate of floating point operations (FLOP) for the Hamiltonian matrix size ( N) t e r a F L O P s A100 Tensor coresV100 Tensor coresV100 GPU only
FIG. 3: Half-precision tera-floating point operations per sec-ond (teraFLOPs) vs. Hamiltonian system size for the Deep-NN SP2 Fermi-operator expansion running on Nvidia’s VoltaV100 tensor core units, on its Volta GPU only, and on themore recent A100 tensor core units. The difference in maxi-mum N values is due to device memory limitations. Deep-NN SP2 algorithm was estimated from simulationsusing tensor cores on both Nvidia A100 and V100 GPUsand is shown in Fig. 3. This estimate does not includethe initialization and memory allocation of the routinenor the final layer, i.e. the double precision refinementstep. For purposes of comparison, this FLOP rate wasalso computed on the V100 with the tensor cores disabled ,we call this the GPU-only FLOP rate; it is displayed inFig. 3 as well. We observe an approximate 7-8x speed upon the V100 when tensor cores are enabled versus whenthey are disabled and only the GPU is used. Even moreimpressive, we achieve approximately 120 teraFLOPs onthe A100 when utilizing tensor cores.Although the plots in Fig. 3 suggest the Deep-NN SP2algorithm may only be beneficial for large N values, arecent publication [51] shows how matrix-matrix multi-plications can reach high performance also for smaller N by utilizing batching techniques. The same techniquewould most likely benefit the SP2 method for small N .Additionally, further performance increases might alsobe gained by considering a sparse matrix implementation[52] of the Deep-NN SP2 method. V. CAPITALIZING ON THE MACHINELEARNING PERSPECTIVE
There are several observations that appear from themachine learning perspective of the SP2 Fermi-operatorexpansion scheme when it is formulated in terms of alayered network structure: 1) The quantum mechani-cal problem is solved naturally and with high efficiencythrough the computational structure of a generalizeddeep neural network; 2) The bias and weight valuescould be optimized using machine learning techniquesto achieve improved convergence and possibly higher ac-curacy; 3) Other functions besides the matrix Heav-iside step function could potentially be approximatedthrough the same generalized deep neural network, in-cluding Fermi functions at finite electronic temperatures;4) Recursive Fermi-operator schemes or sign-matrix ex-pansions based on higher-order spectral projection poly-nomials could be mapped onto the same generalized net-work structure and use the same mixed precision tech-nique; and 5) A recursive calculation of Green’s Func-tions via a Dyson series expansion could also be gener-alized to fit into the algorithmic structure of Deep-NNSP2, e.g. G = G + G V G (where G , G O and V are theGreen Function, the initial Green Function, and a pertur-bation to the Hamiltonian, respectively) can be rewrittenrecursively in a similar way to the SP2 scheme, where thecorresponding weights and bias values could be optimizedfor convergence.Here we will briefly discuss the ability to accelerateconvergence for the Deep-NN SP2 algorithm and how ap-proximate Fermi functions for fractional occupation num-bers at elevated electronic temperatures can be generatedrecursively with high accuracy. A. Accelerated Deep-NN SP2
In machine learning we try to learn the weight andbias functions by optimizing a regularized penalty func-tion based on, for example, some large set of predeter-mined data. Here we may instead use the convergencerate to the idempotent density matrix. Each layer ofthe Deep-NN SP2 scheme can then be seen as general-ized spectral projections with weights W n and bias val-ues B n . Instead of choosing the spectral projections from W n = σ n I , where σ n = ±
1, we may optimize over a con-tinuous set of values, σ n ∈ R , as illustrated in Fig. 4. Tooptimize convergence we chose the values of W n , which ineach separate layer gives the highest slope of the projec-tion around the re-scaled eigenvalues corresponding tothe HOMO or LUMO eigenvalues, but without riskingswitching places between occupied and unoccupied eigen-values. This local choice of optimization accelerates theseparation of the HOMO and LUMO eigenstates, whichare the last to reach the fixed points at 1 and 0. This opti-mization requires prior knowledge of the re-scaled HOMOand LUMO eigenvalues. The optimized spectral projec-tions may push eigenvalues outside of the [0 ,
1] interval,which could lead to divergence. To avoid this we needto shift and re-scale the eigenvalue spectrum to [0 ,
1] af-ter each optimized projection. The combined transformfrom the choice of σ n -values, followed by the shift and re-scaling, determines the optimized weight, W n , and biasvalues, B n , in each layer. This local optimization of theweight and bias values of each layer can lead to a signif-icant acceleration. Our accelerated Deep-NN SP2 algo-rithm is presented as a Python script in the supplemen-tary information and the optimized choices of W n and B n for the network defined by X n +1 = f ( W n X n + B n ) are X n X n + X n+1 = W n f(X n ) + (1-W n )X n FIG. 4: Generalized Deep-NN SP2 projections in each layer, X n → X n +1 , with various weight values W n = σ n I , with σ n ∈ R , instead of σ n = ±
1. By locally optimizing σ n in each layerand using a shift and re-scale transforms to keep eigenvalueswithin the interval [0 , accelerated Deep-NN formulation of the SP2 expansion aregenerated. A Python script for the accelerated Deep-NN SP2algorithm is presented in the supplementary information. given there explicitly. This accelerated Deep-NN SP2scheme turns out to be an equivalent Deep-NN formu-lation of the accelerated SP2 Fermi-operator expansionby Rubensson, which uses a shift and re-scale technique[41, 44]. However, here we arrive at the same accelera-tion scheme, but based on the Deep-NN perspective andwith a different combination of spectral projection poly-nomials and choice of shift and re-scale transformations.An example of the convergence accelerated Deep-NNSP2 scheme is shown in Fig. 5. The convergence isreached after 17 layers instead of 28, which is a signif-icant improvement. A disadvantage with the acceleratedDeep-NN SP2 scheme is that it requires prior knowledgeof the HOMO-LUMO eigenvalues. However, for repeatedcalculations of the density matrix, for example, in molec-ular dynamics simulations, the HOMO-LUMO eigenval-ues can be estimated from previous time steps with ahigh-level of accuracy [41]. Further acceleration of theDeep-NN SP2 scheme can possibly be achieved by tai-loring the optimization of the weight and bias values forHamiltonian matrices with particular eigenvalue distri-butions.
B. Optimized SP2 Finite TemperatureFermi-Operator Expansion
A recursive SP2 expansion that is stopped before ithas reached convergence generates a smooth approxima-tion to the Heaviside step function θ ( ε − µ ) using theeigenvalues ε i ∈ [0 ,
1] of the re-scaled Hamiltonian ma-
Layer (n) I d e m po t e n c y E rr o r , ( || X n2 - X n || ) Deep-NN SP2Accelerated Deep-NN SP2
FIG. 5: The idempotency error, (cid:107) X n − X n (cid:107) , for a testHamiltonian, H ∈ R × , with N occ = 10, with and with-out acceleration. A Python script of the accelerated Deep-NNSP2 scheme is presented in the supplementary material. trix H . This occurs because the second-order spectralprojection functions are smooth on the interval [0 , x and 2 x − x , as is in Eq. (10), we allow forgeneral second degree polynomials on [0 ,
1] to generatean approximation to the Fermi function at any ε ∈ [0 , n -th layer to be, ε = ε ,ε n = θ n − , ε n − + θ n − , ε n − + θ n − , . (17)To increase model flexibility, a linear combination of theintermediate values, (cid:80) ni =0 c i ε i , is used to enhance theapproximation. Taking into account the folding of theeigenspectrum by the SP2 scheme, the Fermi functionapproximation then becomes (cid:101) F ( ε ) := 1 − n (cid:88) i =0 c i ε i . (18)Subsequently, the θ i,j and c i are trained to minimizethe mean squared error over a pre-selected grid, { ε i } Ni =0 , F ( e ) Fermi-DiracOptimized truncated SP20 0.2 0.4 0.6 0.8 1 e -5.00.05.0 D i ff e r e n ce x FIG. 6: Result from learning the Fermi function with 11Layers and β = 40. The exact Fermi function (red) is com-pared with the learned one (dashed blue) in the top panel.The bottom panel shows the error as a function of the scaledenergy ε . on [0 , L ( (cid:101) F ) = (cid:88) i w i [ (cid:101) F ( ε i ) − F ( ε i )] ≈ (cid:90) [ (cid:101) F ( ε ) − F ( ε )] dε . (19)We use the Levenberg-Marquardt (LM) optimizationmethod, which is designed specifically for a sum-of-squares loss function. LM dynamically blends the Gauss-Newton method, yielding quadratic convergence wherepossible, and gradient descent, slower, but having morerobust convergence guarantees.Figure 6 shows an example of a globally optimizedtruncated SP2 recursive Fermi-operator scheme in com-parison to the corresponding Fermi-Dirac function, F ( ε ) = (cid:16) e β ( ε − / + 1 (cid:17) − , (20)with β = 40. The approximation error is shown in thelower panel. Previously, we have been limited to the useof recursive Fermi-operator expansions that are based onrational Pade’ polynomials as their projections to reachthis level of accuracy [37, 53]. However these schemes areimplicit and require a solution to a system of equationsin each iteration. Here, we are able to achieve a simi-lar level of accuracy using the explicit machine-learnedgeneralized SP2 expansion scheme as presented in Eq.(17). VI. CONCLUSIONS
We have demonstrated how the solution to thequantum-mechanical electronic structure problem, forexample, appearing in Hartree-Fock and Kohn-Shamdensity functional theory, can be mapped onto the com-putational structure of a generalized deep neural net-work. The solution, in terms of an effective single-particle density matrix, is generated by a recursive Fermi-operator expansion derived from a second-order spectralprojection scheme. The main computational bottleneckof the layered network is dominated by the activationfunction, a matrix square operation, which can be per-formed with high efficiency on tensor core units using amixed-precision formulation that enhances the intrinsichalf-precision floating point operations. A single preci-sion matrix-matrix multiplication in the activation func-tion is replaced by two half-precision matrix-matrix mul-tiplications, allowing us to make full use of available ten-sor core architectures. This leads to an impressive speedup of about 16x for the calculation of density matriceswith respect to the same generation GPUs.By capitalizing on the machine learning perspectiveof the deep neural network formulation of the recur-sive second-order SP2 Fermi-operator expansion, we wereable to both accelerate the rate of convergence, by opti-mizing the weights of the neural net, and apply machinelearning techniques to closely approximate Fermi-Diracfunctions at finite electronic temperatures.
VII. ACKNOWLEDGMENTS
This work is supported by the U.S. Department of En-ergy Office of Basic Energy Sciences (FWP LANLE8AN),the LANL LDRD-ER program, and by the U.S. Depart-ment of Energy through the Los Alamos National Lab-oratory. We are thankful to Nicolas Bock for his adviceon code development.
VIII. APPENDIX
Here, we state the result used to justify our parameter-free convergence criterion in Alg. 1, which is based ona bound of the worst case error reduction on thegeneral form Error i ≤ C (Error i − ) for some constant C . We use the estimate of the idempotency error,IdErr i = Tr [ S i − − S i − ], for the error measure Error i .We then determine that convergence occurs once theestimated error reduction (in Eq. (21) below) no longerholds with the available precision of the floating pointoperations. It is always valid in exact arithmetics.The theory is analogous to the previous parameter-freeconvergence criterion by Kruchinina et al. [49], whichwas based on a different measure of the idempotencyerror. Theorem 1.
Assume that σ i (cid:54) = σ i − so that either S i =(2 S i − − S i − ) or S i = 2 S i − − S i − and i > . Assumealso that S i − has all eigenvalues in [0 , . Then, IdErr i +1 = Tr[ S i − S i ] ≤ C (Tr[ S i − − S i − ]) = C (IdErr i − ) , (21) with C = (71 + 17 √ ≈ . .Proof. Let { λ ( i ) j } Nj =1 be the eigenvalues of S i , where theordering of eigenvalues is such that λ ( i ) j = λ ( i − j + σ i ( λ ( i − j − ( λ ( i − j ) ) , j = 1 , . . . , N . From Ref. [49] wehave that max λ ∈ (0 , (2 λ − λ ) − (2 λ − λ ) ( λ − λ ) (22)= max λ ∈ (0 , λ − λ − (2 λ − λ ) ( λ − λ ) = C , (23)which, given that σ i (cid:54) = σ i − , means for j = 1 , . . . , Nλ ( i ) j − ( λ ( i ) j ) ≤ C (cid:16) λ ( i − j − ( λ ( i − j ) (cid:17) . (24)Summing over all eigenvalues,IdErr i +1 =Tr[ S i − S i ] (25)= N (cid:88) j =1 λ ( i ) j − ( λ ( i ) j ) (26) ≤ N (cid:88) j =1 C (cid:18) λ ( i − j − ( λ ( i − j ) (cid:19) (27)= C (cid:18)(cid:18) N (cid:88) j =1 λ ( i − j − ( λ ( i − j ) (cid:19) (28) − N (cid:88) j =1 (cid:88) k (cid:54) = j (cid:0) λ ( i − j − ( λ ( i − j ) (cid:1) × (cid:0) λ ( i − k − ( λ ( i − k ) (cid:1)(cid:19) ≤ C (cid:0) Tr[ S i − − S i − ] (cid:1) (29)= C (IdErr i − ) . (30)Note that C is not the asymptotic error constant,but since C is finite, the result implies the estab-lished quadratic convergence for sequences with alternat-ing polynomials in the limit of idempotent matrices S i [36, 41]. [1] R. McWeeny, Proc. R. Soc. London Ser. A-Math ,496 (1956). [2] R. McWeeny, Rev. Mod. Phys. , 335 (1960). [3] M. Finnis, Interatomic Forces in Condensed Matter (OUP Oxford, 2003).[4] A. Szabo and N. S. Ostlund,
Modern Quantum Chemistry (Mc Graw–Hill Inc., New York, 1989), first, revised ed.[5] E. N. Economou,
Green’s Functions in Quantum Physics (Springer Science & Business Media, 2006).[6] P. W. Payne, Proc. Natl. Acad. Sci. U. S. A. , 6391(1982).[7] A. H. R. Palser and D. E. Manolopoulos, Phys. Rev. B , 12704 (1998).[8] D. R. Bowler and M. J. Gillan, Comput. Phys. Commun. , 95 (1999).[9] NVIDIA tensor cores , , accessed: 2020-12-22.[10] I. S. Ufimtsev and T. J. Mart´ınez, J. Chem. Theory Com-put. , 222 (2008).[11] I. S. Ufimtsev and T. J. Mart´ınez, J. Chem. Theory Com-put. , 1004 (2009).[12] I. S. Ufimtsev and T. J. Mart´ınez, J. Chem. Theory Com-put. , 2619 (2009).[13] J. E. Stone, D. J. Hardy, I. S. Ufimtsev, and K. Schulten,Journal of Molecular Graphics and Modelling , 116(2010).[14] N. Luehr, I. S. Ufimtsev, and T. J. Mart´ınez, J. Chem.Theory Comput. , 949 (2011).[15] G. A. U. C. J. D. C. Maia, C. P. Mangueira, S. R. San-tana, L. A. F. Cabral, and G. B. Rocha, J. Chem. TheoryComput. , 3131 (2015).[16] M. Hacene, A. Anciaux-Sedrakian, X. Rozanska,D. Klahr, T. Guignon, and P. Fleurat-Lessard, J. Chem.Theory Comput. , 2581 (2012).[17] F. Liu, N. Luehr, H. J. Kulik, and T. J. Mart´ınez, Com-put. Phys. Commun. , 3072 (2012).[18] W. P. Huhn, B. Lange, V. W. Yu, M. Yoon, and V. Blum,Comput. Phys. Commun. , 107314 (2020).[19] M. S. Gordon and T. L. Windus, Chem. Rev. , 9015(2020).[20] G. Zhou, B. Nebgen, N. Lubbers, W. Malone, A. M. N.Niklasson, and S. Tretiak, Journal of Chemical Theoryand Computation , 4951 (2020), pMID: 32609513,https://doi.org/10.1021/acs.jctc.0c00243, URL https://doi.org/10.1021/acs.jctc.0c00243 .[21] J. Schmidhuber, Neural Netw. , 85 (2015).[22] C. J. Higham and D. F. Higham, SIAM Rev. , 860(2019).[23] C. F. A. Negre, S. M. Mnizsewski, M. J. Cawkwell,N. Bock, M. E. Wall, and A. M. N. Niklasson, J. Chem.Theory Comput. , 3063 (2016).[24] R. N. Silver and H. Roder, Int. J. Mod. Phys. C , 735(1994).[25] A. Weisse, G. Wellein, A. Alvermann, and H. Fehske,Rev. Mod. Phys. , 275 (2006).[26] R. Zeller, J. Deutz, and P. Dederichs, Solid State Com-mun. , 993 (1985).[27] N. Bernstein, Europhys. Lett , 52 (2001).[28] S. Goedecker, Phys. Rev. B , 17573 (1993).[29] T. Ozaki, Phys. Rev. B , 035123 (2007). [30] L. Lin, M. Chen, C. Yang, and L. He, J. Phys. Condens.Matter , 195501 (2013).[31] S. Goedecker, Rev. Mod. Phys. , 1085 (1999).[32] D. R. Bowler and T. Miyazaki, Rep. Prog. Phys. ,036503 (2012).[33] R. N. Silver, H. Roder, A. F. Voter, and J. D. Kress, Int.J. Comput. Phys. , 115 (1996).[34] K. Nemeth and G. E. Scuseria, J. Chem. Phys. , 6035(2000).[35] A. Holas, Chem. Phys. Lett. , 552 (2001).[36] A. M. N. Niklasson, Phys. Rev. B , 155115 (2002).[37] A. M. N. Niklasson, Phys. Rev. B , 233104 (2003).[38] D. K. Jordan and D. A. Mazziotti, J. Chem. Phys. ,084114 (2005).[39] E. Rudberg and E. H. Rubensson, J. Phys.: Condens.Matter , 075502 (2011).[40] P. Suryanarayana, Chem. Phys. Lett. , 291 (2013).[41] E. H. Rubensson and A. M. N. Niklasson, SIAM J. Sci.Comput. , 148 (2014), URL http://arxiv.org/abs/1302.7292 .[42] L. A. Truflandier, R. M. Dianzinga, and D. R. Bowler, J.Chem. Phys. (2016), ISSN 0021-9606.[43] G. Beylkin, N. Coult, and M. J. Mohlenkamp, J. Comp.Phys. , 32 (1999).[44] E. H. Rubensson, J. Chem. Theory and Comput. , 1233(2011).[45] R. Silver, H. Roeder, A. Voter, and J. Kress, Journalof Computational Physics , 115 (1996), ISSN 0021-9991, URL .[46] A. Weiße, G. Wellein, A. Alvermann, and H. Fehske,Rev. Mod. Phys. , 275 (2006), URL https://link.aps.org/doi/10.1103/RevModPhys.78.275 .[47] S. M. Mniszewski, R. Perriot, E. H. Rubensson, C. F. A.Negre, M. J. Cawkwell, and A. M. N. Niklasson, Journalof Chemical Theory and Computation , 190 (2019),https://doi.org/10.1021/acs.jctc.8b00887, URL https://doi.org/10.1021/acs.jctc.8b00887 .[48] NVIDIA V100 , , accessed: 2020-12-23.[49] A. Kruchinina, E. Rudberg, and E. H. Rubens-son, Journal of Chemical Theory and Com-putation , 5788 (2016), pMID: 27783507,https://doi.org/10.1021/acs.jctc.6b00626, URL https://doi.org/10.1021/acs.jctc.6b00626 .[50] A. M. Niklasson, S. M. Mniszewski, C. F. A. Negre,M. E. Wall, M. J. Cawkwell, and N. Bock, PROGRESSversion 1.0 (2016), URL https://github.com/lanl/qmd-progress .[51] A. Abdelfattah, S. Tomov, and J. Dongarra, in (2019), pp. 111–122.[52] O.Zachariadis, N. Satpute, J. G´omez-Luna, and J. Oli-vares, Comput. Electr. Eng , 106848 (2020).[53] A. M. N. Niklasson, M. J. Cawkwell, E. H. Rubensson,and E. Rudberg, Phys. Rev. E92