DeePKS: a comprehensive data-driven approach towards chemically accurate density functional theory
DDeePKS: a comprehensive data-driven approach towards chemicallyaccurate density functional theory
Yixiao Chen a , Linfeng Zhang ∗ a , Han Wang † b , and Weinan E ‡ a,ca Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ,USA b Laboratory of Computational Physics, Institute of Applied Physics and ComputationalMathematics, Huayuan Road 6, Beijing 100088, People’s Republic of China c Department of Mathematics, Princeton University, Princeton, NJ, USAAugust 4, 2020
Abstract
We propose a general machine learning-based framework for building an accurate and widely-applicableenergy functional within the framework of generalized Kohn-Sham density functional theory. To this end,we develop a way of training self-consistent models that are capable of taking large datasets from differentsystems and different kinds of labels. We demonstrate that the functional that results from this trainingprocedure gives chemically accurate predictions on energy, force, dipole, and electron density for a largeclass of molecules. It can be continuously improved when more and more data are available.
Predicting the ground-state information of a many-electron system in an environment of clamped ions is afundamental task in the field of molecular modeling. Over the past few decades, a wide variety of methodshave been developed for addressing this problem, such as quantum Monte Carlo, post Hartree-Fock (HF)methods (also known as wave function theory, WFT), density functional theory (DFT) , etc. In general,these methods follow a well-known trade-off between accuracy and efficiency. The cost of exact WFT methodslike full configuration interaction (FCI) usually scales exponentially with system size. Coupled clustersingles, doubles and perturbative triples (CCSD(T)) , the method often referred to as the golden standardof quantum chemistry, has a cost that scales as O (cid:0) N (cid:1) with respect to the number of electrons N . Thecost of Kohn-Sham (KS) DFT and its generalized version typically scales as O (cid:0) N ∼ N (cid:1) . However,currently available DFT models, although much more efficient, are much less accurate compared with FCIand CCSD(T), due to the approximate nature of the functionals involved.Developing accurate and efficient DFT functionals is among the world’s hardest and most importantparameter fitting problems. As for all parameter fitting problems, we need a functional form with somefree parameters and a way to optimize these parameters. The key notion in this context is universality . Inprinciple, the DFT functionals are universal and we would like our approximate functionals to be as universalas possible. It should be noted immediately that truly universal and computationally efficient functionalsare very difficult, if not impossible, to come by. Therefore our goal should be to develop a functional that isefficient and chemically accurate for all the systems that can be reasonably represented by the data available.To this end, we look for models with the following requirements in mind: ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ phy s i c s . c o m p - ph ] A ug . We need to have a functional form (for these approximate functionals) that is expressive enough so thatthe behavior of different systems, whether small or large molecules or condensed systems, can all beaccommodated.2. We should also make maximum use of existing high quality data , including data for different systemsand data with different kinds of labels, such as energy, force, and electron density. The model shouldbe continuously improvable as more and more data become available.Since condensed systems involve other non-trivial technical issues, we choose to focus first on molecules.For similar reason, we do not discuss the analytical conditions that are used in the so-called non-empiricalfunctionals . Most of these conditions are derived in some limiting cases, such as uniform electron gas.They are less relevant to molecules or the generalized Kohn-Sham scheme that we are going to use.A reasonably successful model that accomplishes the first requirement, termed Deep Post Hartree-Fock(DeePHF), has been developed in Ref. 2 for molecules. By exploiting both physical constraints fromsymmetries and the unprecedented expressivity of neural network (NN) functions, DeePHF succeeded inachieving chemical accuracy for the energy at a cost comparable to Hartree-Fock (HF). It has demonstratedimpressive performance on existing datasets for molecules. One main objective of the current work is toextend DeePHF to a self-consistent framework such as KS-DFT. We will adopt the generalized Kohn-Sham(GKS) formalism, with the domain of our functional been elevated from pure density to Kohn-Sham orbitals,so that the functional space represented is much larger. At the same time, we will make sure that the secondrequirement listed above is also fulfilled.Despite several earlier attempts , there have been serious difficulties involved in this task. Formachine learning-based models such as DeePHF, it was the gradient-based optimization schemes that makethem efficiently trainable. Gradient-based methods can hardly be used in the self-consistent framework,since it is very expensive to compute the gradients of the self-consistent energy, force, and density, withrespect to the NN parameters. For this reason, an earlier attempt reported in Ref. 19 used Monte Carlo, agradient-free optimization scheme. This is prohibitively expensive in the self-consistent setup, particular withlarge datasets. When the training data is limited to only the energies of a few molecules, the pioneering workreported in Ref. 7 successfully developed a gradient-based strategy by effectively decoupling the self-consistentconstraint and the gradient-based training. We will follow a similar strategy, but we have to develop amodified reformulation to make the process more efficient so that much larger datasets can be handled. Whenthe training data also include alternative labels other than energy, such as forces and electron density, tothe best of our knowledge, no effective gradient-based method has been developed. We will present a newtraining scheme that overcomes these difficulties in a very elegant way.We name the approach proposed here Deep Kohn-Sham (DeePKS). We also use DeePKS to refer to themodel (i.e. functionals) obtained this way. DeePKS obeys all physical and gauge symmetries and is consistentwith all known high quality data. For this reason, it is arguably the most competitive model in terms ofaccuracy and transferability, given our current knowledge. In addition, it can be continuously improved asmore and more data become available. We also note that the training schemes developed here can be used inother situations when some self-consistent models are trained. We first give a brief overview of the (generalized) Kohn-Sham theory. We start from the many-bodySchrödinger equation of N electrons indexed by i ,( T + W + V ext )Ψ( x , x , . . . , x N ) = E tot Ψ( x , x , . . . , x N ) , (1)where we use E tot to denote the ground-state energy of the N -electron Schrödinger equation. Here T = − ∇ and W = P i,j | x i − x j | denote the kinetic and electron-electron interactions, respectively. V ext stands forthe external potential.For example, in an atomic system with M ions indexed by I , V ext = P i V ext ( x i ) = P I,i Z I | x I − x i | . 2ollowing the variational principle, the ground-state energy can also be written as E tot = min Ψ h Ψ | T + W + V ext | Ψ i = min Ψ n G [Ψ] + E ext [ ρ [Ψ]] o , (2)where G [Ψ] = h Ψ | T + W | Ψ i , (3) E ext [ ρ ] = Z d x V ext ( x ) ρ ( x ) . (4)According to the well-known Hohenberg-Kohn theorem , this problem is equivalent to another minimizationproblem with respect to the electron density ρ , E tot = min ρ ( x ) → N n F HK [ ρ ] + E ext [ ρ ] o , (5) F HK [ ρ ] = min Ψ → ρ ( x ) G [Ψ] ≡ min Ψ → ρ ( x ) h Ψ | T + W | Ψ i . (6)Eq. 6 defines the Hohenberg-Kohn (HK) functional F HK [ ρ ] using the Levy-Lieb constrained search formula-tion . Note here both F HK [ ρ ] and G [Ψ] are considered to be universal, meaning that they do not dependexplicitly on the external potential V ext .Directly solving the ground-state energy or representing the HK functional can be very difficult, since itinvolves dealing with the N -particle wave function. Therefore, one often resorts to the popular Kohn-Sham(KS) scheme to simplify this problem. The key ingredient of KS-like theories is to replace the general N -particle ground-state Ψ with a model system, whose ground state can be represented by a single Slaterdeterminant Φ = √ N ! det[ ϕ i ( x j )], where we use { ϕ i } to denote a set of orthonormal single particle orbitals.The energy functional can also be written as G [Φ] = G [ { ϕ i } ]. As a result, the ground-state energy E KS anddensity functional F KS is given by E KS = min ρ ( x ) → N n F KS [ ρ ] + E ext [ ρ ] o , (7) F KS [ ρ ] = min Φ → ρ ( x ) G [Φ] = min { ϕ i }→ ρ ( x ) h ϕ i | ϕ j i = δ ij G [ { ϕ i } ] . (8)Depending on how the functional G [Φ] is chosen, the above formulation gives many different theories. Toname a few: • If we leave G unchanged from G , we get the Hartree-Fock theory, G HF [Φ] = G [Φ] = h Φ | T + W | Φ i = h Φ | T | Φ i + E H [ ρ ] + E F [ { ϕ i } ] , (9)where E H [ ρ ] and E F [ { ϕ i } ] denote the Coulomb (Hartree) and exchange (Fock) energy, respectively.Note here E H depends only on the electron density ρ ( x ) = P i | ϕ i ( x ) | . • If we constrain G such that the only term that explicitly depends on Φ is the kinetic energy, we get thestandard KS theory, G KS [Φ] = h Φ | T | Φ i + E H [ ρ ] + E xc [ ρ ] , (10)where E xc [ ρ ] is the so called exchange-correlation functional. Usually E xc [ ρ ] can be split into two parts,the exchange energy E x [ ρ ] and the correlation energy E c [ ρ ] . • If we include part of the Fock exchange operator in addition to the standard exchange-correlationfunctional, we get a standard version the hybrid Kohn-Sham theory, G Hyb [Φ] = h Φ | T | Φ i + E H [ ρ ] + λE F [ { ϕ i } ] + (1 − λ ) E x [ ρ ] + E c [ ρ ] , (11)where λ is a tunable factor deciding how much the exact exchange operator is used.3he term generalized Kohn-Sham (GKS) theory simply refers to any choice of G that does not satisfy thestandard KS condition (Eq. 10). Many functionals fall into this class, including all hybrid functionals andmost Meta-GGA functionals.A KS-like theory is considered to be exact if its choice of G yields the same density functional as theoriginal Hohenberg-Kohn functional, namely, F KS [ ρ ] = F HK [ ρ ] . (12)Therefore, an exact theory would give the exact ground-state energy, E KS = E tot , as well as the exactground-state density ρ . As an example, the aforementioned Hartree-Fock theory is obviously not exact. Itremains an open question whether there exist a possible choice of G in general that yields the exact functional,and hence the exact ground-state density. In the context of standard KS theory, it is termed the problemof non-interacting v -representability. From this point of view, the GKS theory is at least as exact as thestandard KS theory.In order to solve the KS-like problem, we reformulate Eq. 7 as a direct minimization problem with respectto the single particle orbitals { ϕ i } , namely E KS = min { ϕ i } , h ϕ i | ϕ j i = δ ij n G [ { ϕ i } ] + E ext [ ρ [ { ϕ i } ]] o . (13)We now further require that the functional derivative of G [ { ϕ i } ] can be cast into the form of a single particleoperator, δG [ { ϕ j } ] δ h ϕ i | = O [ { ϕ j } ] | ϕ i i . (14)Therefore, using Lagrange multipliers on Eq. 13, we obtain the self consistent field (SCF) equation: H [ { ϕ j } ] | ϕ i i ≡ ( O [ { ϕ j } ] + V ext ) | ϕ i i = ε i | ϕ i i for i = 1 . . . N. (15)where we use H to denote the single particle Hamiltonian. As an example, for the HF theory (Eq. 9), wehave O HF [ { ϕ j } ] = T + V H [ ρ ] + V F [ { ϕ j } ] . (16)For the standard KS theory (Eq. 10), we have O KS [ { ϕ j } ] = T + V H [ ρ ] + V xc [ ρ ] . (17)Here we use T , V H , V F , V xc to denote single particle kinetic, Coulomb, exact exchange and exchange-correlation operators, respectively. We construct our GKS model on top of an existing KS-like model and add a parametrized correction term E δ to it. To be more specific, we define our energy functional to be G (cid:2) { ϕ i }| ω (cid:3) = G base [ { ϕ i } ] + E δ (cid:2) { ϕ i }| ω (cid:3) (18)where ω stands for the set of parameters we use in the representation of E δ . The corresponding single particleHamiltonian is then given by H (cid:2) { ϕ i }| ω (cid:3) = O base [ { ϕ j } ] + V ext + V δ (cid:2) { ϕ i }| ω (cid:3) . (19)The reference point G base should be a reasonable electron energy functional in KS-like theories, e.g., G HF , G KS;PBE , G Hyb;SCAN0 , etc.Before proceeding further, we list the set of requirements that we ideally want E δ (cid:2) { ϕ i }| ω (cid:3) to obey: 1) Generality . The model should be general enough to be applicable for all the systems whose local electronicconfigurations are well represented by the training data. 2)
Locality . The model should be relatively local, sothat it can potentially be constructed using data from small systems and then be generalizable to larger ones.4)
Symmetry . The model should respect both physical and gauge symmetries. Here physical symmetry meansthat E c should be invariant under translation and rotation of the system. Gauge symmetry means that E δ should be invariant when the occupied orbitals {| ψ i i} undergo a unitary transformation. 4) Accuracy . Fortarget systems, the model should achieve chemical accuracy, i.e. a prediction error lower than 1 kcal/mol. 5)
Efficiency . The cost for solving the model should be comparable to that of HF or other DFT models.To satisfy these requirements, we follow our previous work to construct E δ as a neural network modelusing the “local density matrix” as input. Briefly speaking, we build our functional based on the one-particlereduced density matrix Γ( x, x ) = X i h x | ϕ i ih ϕ i | x i = X i ϕ ∗ i ( x ) ϕ i ( x ) . (20)We then project it onto a set of atomic basis (cid:8)(cid:12)(cid:12) α Inlm (cid:11)(cid:9) indexed by the radial number n , azimuthal number l ,magnetic (angular) number m and centered on each atom I , to get the “local density matrix” (cid:0) D Inl (cid:1) mm = X i (cid:10) α Inlm (cid:12)(cid:12) ψ i (cid:11)(cid:10) ψ i (cid:12)(cid:12) α Inlm (cid:11) . (21)Note here for simplicity and locality, we only take the block diagonal part of the full matrix, i.e. indices I , n and l are taken to be the same for both sides of the projection, only angular indices m and m differ.To deal with the rotational symmetry of the basis α Inlm , we use the eigenvalues of the local density matrixas our descriptor d Inl = EigenVals mm (cid:2)(cid:0) D Inl (cid:1) mm (cid:3) , (22)and we use a neural network model to output the “correction” energy E δ = X I F NN (cid:0) d I (cid:1) . (23)Hence the corresponding potential V c is given by V δ = X Inlmm ∂E δ ∂ (cid:0) D Inl (cid:1) mm (cid:12)(cid:12) α Inlm (cid:11)(cid:10) α Inlm (cid:12)(cid:12) . (24)We emphasize that although E δ is constructed from the one particle density matrix, neither the ground-state orbitals nor the density matrix calculated by our model should be expected to have a physical meaning.Instead, we consider the ground-state density to be physical, just as in the standard KS theory, and expect itto coincide with the true ground-state density once we have the exact functional. This is because we followthe GKS approach, rather than a 1-reduced density matrix functional theory , which can not be mapped toa KS system. We now discuss how to train a self-consistent model. Here self-consistency means that the property predictedby the model is obtained via a minimization process and is given at the minimum. A KS-like DFT method isnaturally self-consistent. On the contrary, methods like Møller–Plesset perturbation theory and many otherpost-HF theories are not self-consistent, since they do not involve a minimizing procedure. We call thosemethods energy models, to be distinguished from the self-consistent ones. In this context, recent machinelearning-based schemes, such as DeePHF method and the MOB-ML method , are energy models.Similar to other supervised learning procedures, We fit the energy functional using existing datasets withcertain labels. These labels can be acquired from calculations of high-accuracy methods, such as CCSD(T)and quantum Monte Carlo. Generally speaking, we consider three types of labels:1. quantity that is the direct output of the functional after a minimization procedure. Here it is the totalenergy.2. quantity that depends on both the direct output of the functional and its minimizer. Here we considerthe atomic force. 5. quantity that depends on the minimizer of the functional, but only implicitly through the mathematicalform of the functional. Here we consider the ground-state density.As has been mentioned, using all these labels in training is a non-trivial task, since there is a highly complicatedand expensive procedure for calculating the corresponding quantities. Here we develop general and efficienttraining algorithms for these three types of labels. Type one (energy) . The training procedure with the energy label may seem straightforward at firstglance. Using the ‘ norm as the error metric, the optimization problem becomesmin ω E data "(cid:18) E label − min { ϕ i } , h ϕ i | ϕ j i = δ ij E model (cid:2) { ϕ i }| ω (cid:3)(cid:19) . (25)Here E model (cid:2) { ϕ i }| ω (cid:3) = G base [ { ϕ i } ] + E ext [ ρ [ { ϕ i } ]] + E δ (cid:2) { ϕ i }| ω (cid:3) , (26)where the expectation is taken over the training samples.The gradient of E model with respect to ω can be easily obtained using the Hellmann-Feynman theorem.However, the minimization procedure of { ϕ i } involves solving an SCF equation (Eq. 15) that is very timeconsuming. A typical training procedure consists of as many as a million gradient descent steps. This isunrealistic if the SCF equation is solved at every step.We use a different optimization formalism. Instead of treating the minimized energy as a function of theparameters ω , we consider it as a function of both orbitals { ϕ i } and parameters ω that satisfies the constraintthat { ϕ i } is the minimizer. Therefore, the whole optimization problem can be written asmin ω E data h(cid:0) E label − E model (cid:2) { ϕ i }| ω (cid:3)(cid:1) i (27)s . t . ∃ ε i ≤ , (cid:0) H (cid:2) { ϕ i }| ω (cid:3) − ε i (cid:1) | ϕ i i = 0 , h ϕ i | ϕ j i = δ ij for i, j = 1 . . . N (28)where Eq. 28 is a parameterized version of Eq. 15, i.e., the single particle Hamiltonian H (cid:2) { ϕ i }| ω (cid:3) depends onboth the orbitals { ϕ i } and the model parameters ω .We now can use a projection method to relax the constraint and this reduces the cost of calculating the SCFequation. In other words, we can first optimize the parameters ω using unconstrained gradient-based methodwith the orbitals { ϕ i } fixed. After several steps, we project the orbitals back to the constraint manifoldby solving the SCF equation. Decreasing the projection frequency can largely reduce the computationalcost since most of the computation time is spent in the SCF equation. To make it more clear, we write theprocedure into the following steps.1. Initialize a set of { ϕ i } and ω that satisfies the SCF equation, e.g., take ω to be all zero and { ϕ i } to bethe Hartree-Fock solution. Also keep track of the predicted energy E model .2. Update the parameters ω by training the model following Eq. 27 with fixed orbitals { ϕ i } .3. Update the orbitals { ϕ i } by solving the SCF equation with fixed model parameters ω .4. Check whether the predicted energy E model converges. If not, go to step 2 and do more iterations.A schematic illustration of this approach is shown in Fig. 1. Note that we usually take many training steps instep 2. In practice, when restarting from old parameters using new orbitals, we find it possible to train themodel until the validation error no longer decreases, without breaking the convergence of the whole procedure.Therefore, the total time of solving SCF equation is significantly reduced.We note that a similar formalism has been proposed and used by the NeuralXC scheme . The majordifference is that, in DeePKS, a single NN function is used as a universal approximator. The function formdoes not change with the iterative process, and its parameters does not depend on the chemical species ofthe associated atom. In contrast, in NeuralXC, the parameters depend on the chemical species, and in eachiteration, a new NN layer is appended to the NN model from the previous iteration. The reformulationin DeePKS is designed to makes it more transferable to larger chemical space, and more suited for largerdataset. 6 riginalRelaxed Figure 1: A schematic illustration of the iterative training procedure using the projection method. Here“Original” stands for the direct constrained minimization following Eq. 25 and “Relaxed” stands for therelaxed projection method in Eq. 27-28. In step 2 of the relaxed method, the optimization of ω strays awayfrom the minimizing manifold, while in step 3 the projection of { ϕ i } brings it back. Type two (force) . The atomic forces from the proposed model can be easily calculated by the standardHellmann-Feynman theorem, F model (cid:2) { ϕ ∗ i [ ω ] }| ω (cid:3) = − ∂E model (cid:2) { ϕ ∗ i }| ω (cid:3) ∂X = F (cid:2) { ϕ ∗ i } (cid:3) − X Inlmm ∂E δ (cid:2) { ϕ ∗ i }| ω (cid:3) ∂ (cid:0) D Inl (cid:1) mm X i * ϕ ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ (cid:0)(cid:12)(cid:12) α Inlm (cid:11)(cid:10) α Inlm (cid:12)(cid:12)(cid:1) ∂X (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ϕ ∗ i + , (29)where we have written out the dependence on the parameters ω explicitly. We use { ϕ ∗ i } to denote theminimizer of the total energy functional, which themselves are functions of ω , { ϕ ∗ i [ ω ] } = arg min h ϕ i | ϕ j i = δ ij E model (cid:2) { ϕ i }| ω (cid:3) . (30)We can see that the force F depends directly on both the model parameters ω and the minimizing orbitals { ϕ ∗ i [ ω ] } . This introduces an additional difficulty when we evaluate the gradient of F with respect to ω . Thecontribution from the { ϕ ∗ i [ ω ] } term is very hard to compute, since it involves a whole minimization procedure,and there is no Hellmann-Feynman theorem to save us.Luckily, this difficulty disappears in our iterative training procedure, where the gradient we used tooptimize ω is no longer the constrained one. The orbitals are treated as independent variables so that theydo not contribute to the gradient. Therefore, the gradient can be calculated straightforwardly using a backpropagation procedure. By writing the force term into the loss function, the new optimization problembecomes min ω E data h(cid:0) E label − E model (cid:2) { ϕ i }| ω (cid:3)(cid:1) + λ f (cid:0) F label − F model (cid:2) { ϕ i }| ω (cid:3)(cid:1) i s . t . ∃ ε i ≤ , (cid:0) H (cid:2) { ϕ i }| ω (cid:3) − ε i (cid:1) | ϕ i i = 0 , h ϕ i | ϕ j i = δ ij for i, j = 1 . . . N, (31)where λ f is a tunable parameter that determines the weight of force label in the loss function. We can thenuse the iterative algorithms described above to solve this optimization problem. Type three (density) . The ground-state density given by the proposed model is a function of the7inimizing orbitals, ρ model ( x ) = N X i | ϕ ∗ i ( x ) | . (32)Since it does not depend on the parameters ω explicitly, unlike the case for forces, we cannot write the densityinto the loss function. To solve this problem, we introduce a penalty term in the SCF equation to “guide”the training procedure. This is done by changing the minimization problem in Eq. 13 into:min { ϕ i } , h ϕ i | ϕ j i = δ ij n E model (cid:2) { ϕ i }| ω (cid:3) + λ ρ D (cid:2) ρ [ { ϕ i } ] , ρ label (cid:3)o , (33)where λ ρ > D is some non-negative error metric that equals to zero onlywhen ρ = ρ label . Hence, if the SCF solution gives the exact density, the penalty term does not influence theminimizer. Otherwise, according to the discussion in the Appendix, because of an additional potential termin the SCF equation, V pnt [ ρ | ρ label ] = δD [ ρ, ρ label ] δρ , (34)it will lead to a self-consistent energy strictly larger than the one obtained without this penalty term, and adensity that is closer to the label.Note here λ ρ does not need to be a fixed value. Rather, it can be a bunch of values or even a non-negativerandom variable. When the model yields exact density, all the functionals with different λ ρ should give theexact solution. When the solution is not exact, randomized λ ρ services as a regulator that helps reducing theoverfitting and provides better results. If we choose it to be a random variable, the modified optimizationproblem becomes:min ω E data ,λ ρ h(cid:0) E label − E model (cid:2) { ϕ i }| ω (cid:3)(cid:1) + λ f (cid:0) F label − F model (cid:2) { ϕ i }| ω (cid:3)(cid:1) i s . t . ∃ ε i ≤ , (cid:0) H (cid:2) { ϕ i }| ω (cid:3) + λ ρ V pnt (cid:2) ρ [ { ϕ i } ] | ρ label (cid:3) − ε i (cid:1) | ϕ i i = 0 , h ϕ i | ϕ j i = δ ij for i, j = 1 . . . N (35)The same projection-based training procedure can be applied to this loss function.We note that although in this paper we take energy, force, and density as examples, these algorithmsare rather general and can be easily transferred to similar learning problems that involve an optimizationprocedure for the evaluation of meaningful quantities. For example, if we include dipole as label, we can adda penalty term similar to Eq. 33. Moreover, the training algorithms are not limited to the specific GKS modelwe described above. Instead, they can be applied to gradient-based optimization tasks for any exchangecorrelation functionals and even other self-consistent learning problems. We now examine the performance of the DeePKS scheme on three classes of data that have been used forbenchmark purposes in the literature. Unless otherwise specified, all labels are given by CCSD(T), and allcalculations are conducted using the cc-pVDZ basis. • Malonaldehyde, including 1500 configurations with energy, force, and density labels. We use this datasetto test thoroughly our training method with all three types of labels. Since within the CCSD(T)formalism, perturbative triple does not give the corresponding density, we use CCSD for density relatedtests. The data is calculated from PySCF with molecular configurations coming from the sGDMLdataset . • Three molecules (malonaldehyde, benzene and toluene), including 1500 configurations for each moleculewith energy and force labels. This is a subset of the sGDML dataset under the same numerical setup,therefore we can train one model on all three molecules and examine the inter-molecule performance, asa first step toward universal functionals. 8 QM7b-T dataset , including 7212 molecules and one configuration for each molecule, with energylabels only. This is the largest publiclly available dataset with CCSD(T) accuracy. It has been used tobenchmark several other methods as well as the energy model developed in . We test it here tomake a comparison of the new self-consistent model and the previous energy model. We also use it toexamine the ability of the DeePKS method for generating “universal” functionals that are applicable toas many systems as possible.We emphasize that the objective of our method is to build one single functional with chemical accuracy foras many systems as possible, although it is currently limited by the data we have. The functional shouldbe able to predict accurate results for all the systems that are well represented in the training set, and itscoverage can be enlarged continuously by adding more and more training data.We implement the DeePKS method using the open-source packages PySCF and PyTorch . We startour iteration from a functional obtained using DeePHF and orbitals solved from that functional. In eachiteration, the optimization of the neural network parameters is conducted for 10,000 epochs using the ADAMoptimizer . For all training that includes force labels, we set the parameter λ f to be 0.1. One more trick weuse is that, to speed up the convergence, after training with ADAM, we further correct the model with aglobal energy shift, which is calibrated from the training set.We now examine the performance our method on the malonaldehyde molecule, using the HF functional asthe base model, G base = G HF . As a first step, to have an intuitive picture of the newly proposed iterativemethod, we use energy and force as training labels and study the behavior of the mean absolute error (MAE)in the testing set during the training process. The error for the forces is calculated component-wise. Thetraining is done on 1000 molecular configurations and testing on the remaining 500. We also include resultfrom sGDML and DeePMD model for comparison. F o r c e M A E ( m H / A ) DeePMD testingsGDML testingDeePKS trainingDeePKS testing0 2 4 6 8 10 12 14Number of iteration0.030.10.3 E n e r g y M A E ( m H ) Add force labels
Figure 2: The energy and force errors during training for the malonaldehyde molecule. Force labels are addedstarting from iteration 7. Results from DeePMD and sGDML are included for comparison.As can be seen in Fig. 2, when training with only energy labels (iterations 0 to 6), the testing accuracyquickly saturates while the training error keeps decreasing, suggesting that the model begins to overfit. Onthe other hand, even though we train with only energy labels, the model already outperforms DeePMD andsGDML methods, both of which utilize forces as training labels. When we include force labels after iteration6, the testing accuracy can be further improved by two to three times. This shows the effectiveness of addingforce labels in the training. 9o further examine the sample efficiency of our method, we study the learning curve associated withthe malonaldehyde molecule by plotting the testing MAE of both energies and forces versus the number oftraining samples. Each time the dataset is augmented, existing samples in the dataset are kept, and thetesting error is calculated on the rest part of the data. For comparison, we include the result of NeuralXC and DeePMD. As shown in Fig. 3, in all cases, DeePKS outperforms both DeePMD and NeuralXC: Usingthe same amount of training data, the accuracy of both the energies and forces is improved 3 to 10 times. Asan ablation study, we also examine the situation of using labels at the CCSD level and starting from PBEfunctionals ( G base = G KS;PBE ), we find that the results do not change much. Therefore, hereafter we focuson the HF based model, since the implementation of PBE in PySCF is rather slow. F o r c e M A E ( m H / Å ) NeuralXC (CCSD(T))DeePMD (CCSD(T))DeePKS (HF;CCSD(T))DeePKS (HF;CCSD)DeePKS (PBE;CCSD)100 200 300 500 1000Number of training samples0.10.31.0 E n e r g y M A E ( m H ) Figure 3: The learning curve of both energy and force for the malonaldehyde molecules. Results fromDeePMD and NeuralXC are included for comparison. NeuralXC results are digitally captured from Ref. 7.We now move to density related tests. Here we use labels at the CCSD level. We follow Eq. 35 to trainour model with density labels. The error penalty term is taken to be the Coulomb repulsion energy of thedensity difference, D [ ρ, ρ label ] = Z d x d x ∆ ρ ( x )∆ ρ ( x ) | x − x | ∆ ρ ( x ) = ρ ( x ) − ρ label ( x ) , (36)which can be evaluated with very small cost in PySCF. The penalty parameter λ ρ is sampled uniformly from0 to 1 for every data point and every SCF calculation. We train with this setup for 20 iterations and thenremove the penalty and perform another 5 iterations for relaxation. As we will see later, such relaxation willslightly reduce the accuracy for density, but substantially improve the accuracy for energy and force.We study the performance of the DeePKS model in terms of the prediction error of energy E , force F ,dipole µ and point-wise electron density ρ . For comparison, we also include different training schemes andresults from several other methods. We use the ‘ norm for energy and density, the component-wise ‘ normfor force and ‘ norm for dipole as error metrics. All models are trained on 1000 malonaldehyde configurationsand the testing errors are averaged over the rest 500 configurations. For HF and DFT functionals, a constantenergy shift, calculated from the training set, is applied to their predicted total energy. Our testing resultsare summarized in Table. 1.In general, we find ML-based methods perform much better than traditional HF or DFT functionalsin terms of the accuracy of energy and forces. This is expected since these methods are directly trained10able 1: Comparison of different methods in terms of the prediction errors for energy, force, dipole, anddensity, for the malonaldehyde molecule. Errors are measured in mH for energy, mH/Å for force, Debye fordipole, and e for density. For ML-based methods (sGDML, DeePMD, DeePKS), the types of labels used fortraining are shown in parentheses. The term “rlxd” stands for the relaxation procedure after training withdensity. Methods k ∆ E k k ∆ F k k ∆ µ k k ∆ ρ k sGDML (w/ E, F ) 0.10 0.59 – –DeePMD (w/
E, F ) 0.13 0.69 – –HF ( E + 805 .
19) 3.29 24.1 0.66 0.58PBE ( E − .
33) 1.35 7.53 0.17 0.35SCAN0 ( E − .
05) 1.83 10.9 0.32 0.29DeePKS (w/ E ) 0.067 0.44 0.10 0.50DeePKS (w/ E, F ) 0.034 0.18 0.10 0.39DeePKS (w/
E, F, ρ ) 0.048 0.30 0.044 0.20DeePKS (w/
E, F, ρ ; rlxd) 0.041 0.24 0.047 0.21with corresponding labels on this specific system. Traditional functionals, on the other hand, give rathergood prediction on dipoles and densities. Only by including density labels can DeePKS outperform thestate-of-the-art conventional functional (SCAN0). It is also interesting to observe that even without dipolelabels, the DeePKS models, obtained in different ways, significantly outperform HF, PBE, and SCAN0 interms of testing accuracy on dipole moments. d e n s i t y d i ff e r e n c e ( e Å ) DeePKS (w/ E )DeePKS (w/ E , F )DeePKS (w/ E , F , )DeePKS (w/ E , F , ; rlxd)SCAN0 Figure 4: Density difference with respect to ρ CCSD , given by different training schemes for DeePKS as well asSCAN0 functional.For a more intuitive view, we compare the ground-state density given by SCAN0 with different trainingschemes for DeePKS. We take a sliced line that crosses an atomic core, and we plot the density differencecompared with the CCSD label. As shown in Fig. 4, when training without density, the error is relativelylarge (around 0 . e Å − at maximum) in the core region and is worse than the SCAN0 prediction. After weadd density labels, the error is reduced to below 0 . e Å − , showing the necessity of using density labels inthe training. We also note that the absolute density value can reach 600 e Å − at the core, hence even thelargest difference in density is still very small compare to the absolute value.As a further step, we test the performance of DeePKS on learning one single functional for multiplemolecules simultaneously. This is in general a hard task, especially when the number of training samplesis very limited. As mentioned in Ref. 24, for these so-called transferable models, “energy prediction errorsare often much larger than 1 kcal/mol”, even with huge amount of training data . However, this step11rucial and inevitable since our ultimate goal is to build one universally accurate functional for a wide rangeof systems.Table 2: Comparison of different methods on the prediction accuracy of energy and force for three molecules.Errors are measured in mH for energy and mH/Å for force. Force errors are calculated component-wisely.Methods marked with “*” are trained separately on each molecules. NeuralXC results are digitally capturedfrom Ref. 7. Malonaldehyde Benzene TolueneMethods k ∆ E k k ∆ F k k ∆ E k k ∆ F k k ∆ E k k ∆ F k NeuralXC* 0.35 – 0.075 – 0.20 –sGDML* 0.10 0.59 0.006 0.06 0.05 0.33DeePKS* 0.04 0.22 0.007 0.07 0.06 0.32DeePKS 0.07 0.41 0.014 0.13 0.08 0.42We then check the behavior of DeePKS for fitting malonaldehyde, benzene and toluene at the same time ,with energy and force labels. This is the largest set of data we find with both energy and force at the CCSD(T)level calculated in the same numerical setup. We take 1000 samples for each molecule in the training andtest on the remaining configurations. We summarize our results in Table 2, including a comparison withNeuralXC and sGDML. Despite a small loss in accuracy, DeePKS method is still comparable with sGDMLand outperforms NeuralXC, both of which are trained separately on each individual molecule. We also notethat sGDML performs relatively well on benzene and toluene, possibly due to their explicit handling of thepoint group symmetry. Such treatment can improve the sample efficiency for highly symmetric molecules likebenzene and toluene, yet may not be very helpful for more general molecules. E n e r g y M A E ( m H ) FCHL19FCHL18MOB-ML (RC/GPR/RFC)DeePHF (energy model)DeePKS (self-consistent model)
Figure 5: The learning curve of DeePHF and DeePKS methods on the QM7b-T dataset. Results of MOB-MLwith regression clustering and FCHL methods are included for comparison. Note the FCHL results useMP2 energy as training and testing labels, and are digitally captured from Ref. 6.For a larger test, we examine the performance of DeePKS on the QM7b-T dataset. This is the largestdataset we have with CCSD(T) level of energy, and is also used for benchmarking the energy model DeePHF .We study the learning curve by randomly selecting some samples as training set and test on the rest. Sincethere is no new label included and the model is trained only with energy, we should not expect DeePKS toexhibit any accuracy improvement with respect to DeePHF. The best results we can look for is that theself-consistent model behaves as well as the energy model. This is indeed the case, as shown in Fig. 5.We further examine the transferability of DeePKS to much larger systems by predicting hydrocarbonreaction and isomerization energies using the HC7 and ISOL6 benchmarks. The 7000 samples randomlyselected from the QM7b-T dataset, used to train the DeePKS model, contain at most 7 heavy atoms.12able 3: MAE of reaction and isomerization energies, calculated using the HC7 and ISOL6 datasets,respectively. and ANI-1ccx come from Ref. 28. Errors are given in mH. All DFT methods are conducted incc-pVDZ basis. The MAEs of DFT methods (including DeePKS) are calculated by comparing with resultsfrom CCSD(T)/cc-pVDZ calculation. The MAE of ANI-1ccx is calculated by comparing with the methodsused for generating their training data, i.e., CCSD(T)*/CBS.Methods HC7 ISOL6PBE 8.91 3.80SCAN 16.22 3.25B3LYP 16.74 4.16SCAN0 23.94 3.65 ω B97X 17.71 3.45 ω B97M-V 5.86 3.81ANI-1ccx 3.24 2.41DeePKS 2.88 1.26However, HC7 and ISOL6 contain at most 12 and 15 heavy atoms, respectively. As shown in Table 3, DeePKSoutperforms conventional DFT functionals and generalizes better than the current best-performing ML-basedmodel, ANI1-ccx, which is trained using a huge dataset of 5M molecular configurations with DFT energiesand forces, and fine-tuned on about 500K configurations with CCSD(T)*/CBS energies.
17 28 39 50 61 72 83Number of basis functions10 C P U t i m e ( s ) HFPBECCSDCCSD(T)DeePKS
Figure 6: The CPU time spent in the calculations of alkanes using different methods.As a final remark, we show that the DeePKS model can indeed be evaluated efficiently. Fig. 6 shows thecomputational cost of different methods for calculating alkanes ranging from one to seven carbon atoms. Thenumber of iterations in all SCF-based methods is set to 10. We note that for PBE and other DFT functionals,the implementation in PySCF involves numerical integration over space grids, which is much more expensivefor small molecules with the GTO basis set, wherein analytical evaluations of orbital overlapping can becarried out efficiently in the HF method and the HF-based DeePKS. As a result, DeePKS is even faster thanPBE and scales similarly with HF. The additional cost over HF scales essentially linearly with respect tosystem size. For larger systems where the O (cid:0) N (cid:1) scaling in HF begins to dominate, we can switch to PBE orother KS functionals as the starting point and retain the cubic scaling.13 Conclusion
We presented a general framework for learning chemically accurate self-consistent energy functionals usingdifferent types of labels, including energy, force, and density. The new training method, combined witha self-consistent extension of DeePHF, leads to a generalized Kohn-Sham functional with the accuracy ofCCSD(T) and the computational cost of DFT. We examined the performance of the proposed method onmultiple molecular datasets, and obtained highly accurate predictions for multiple properties like energy,force, and density. In addition, the proposed method is capable of learning a single functional that coversdifferent molecular systems, and its accuracy can be continuously improved by adding more training data.We believe it is a good starting point towards a universally accurate functional for molecules, and we areconfident that it can be extended to include condensed phases.
We thank Xiao Wang and Lin Lin for beneficial discussions. The work of Y. C., L. Z. and W. E was supportedin part by a gift from iFlytek to Princeton University, the ONR grant N00014-13-1-0338, and the CenterChemistry in Solution and at Interfaces (CSI) funded by the DOE Award de-sc0019394. The work of H. W.is supported by the National Science Foundation of China under Grant No. 11871110, the National KeyResearch and Development Program of China under Grants No. 2016YFB0201200 and No. 2016YFB0201203,and Beijing Academy of Artificial Intelligence (BAAI).
A Properties of the modified minimization scheme for density op-timization
We discuss the properties of the energy and density when we modify in Eq. 33 the minimization scheme fordensity optimization. For simplicity, let us use the following notation: L λ ρ (cid:2) { ϕ i } (cid:3) = E model (cid:2) { ϕ i } (cid:3) + λ ρ D (cid:2) ρ [ { ϕ i } ] , ρ label (cid:3) , (37)and { ϕ λ ρ i } = arg min h ϕ i | ϕ j i = δ ij L λ ρ (cid:2) { ϕ i } (cid:3) , (38)for which we assume that the global minimizer of L λ ρ (cid:2) { ϕ i } (cid:3) is unique. In particular, { ϕ ∗ i } = arg min h ϕ i | ϕ j i = δ ij L (cid:2) { ϕ i } (cid:3) = arg min h ϕ i | ϕ j i = δ ij E model (cid:2) { ϕ i } (cid:3) , (39)gives the original minimizer.For λ > λ ≥
0, we have L λ (cid:2) { ϕ λ i } (cid:3) ≥ L λ (cid:2) { ϕ λ i } (cid:3) ; (40) L λ (cid:2) { ϕ λ i } (cid:3) ≥ L λ (cid:2) { ϕ λ i } (cid:3) ; (41) L λ (cid:2) { ϕ λ i } (cid:3) ≥ L λ (cid:2) { ϕ λ i } (cid:3) . (42)Eq. 40 holds, since { ϕ λ i } is the minimizer of L λ ; Similarly, Eq. 42 holds, since { ϕ λ i } is the minimizer of L λ ; Eq. 41 holds, since the term ( λ − λ ) D (cid:2) ρ [ { ϕ λ i } ] , ρ label (cid:3) is non-negative.It is straightforward to see that equalities hold for all these equations if and only if both D (cid:2) ρ [ { ϕ λ i } ] , ρ label (cid:3) and D (cid:2) ρ [ { ϕ λ i } ] , ρ label (cid:3) are 0. In this case, both the energy L λ ρ (cid:2) { ϕ λ ρ i } (cid:3) and the minimizing density ρ [ { ϕ λ ρ i } ]will be the same for all λ ρ ≥
0. Otherwise, we will have the following two properties:1. E model (cid:2) { ϕ λ ρ i } (cid:3) is strictly larger than E model (cid:2) { ϕ ∗ i } (cid:3) , by taking λ = λ ρ and λ = 0 in Eq. 42.14. A larger penalty will lead to a density that is closer to the label. This can be obtained by adding Eq. 40to Eq. 42, which will lead to( λ − λ )( D (cid:2) ρ [ { ϕ λ i } ] , ρ label (cid:3) − D (cid:2) ρ [ { ϕ λ i } ] , ρ label (cid:3) ) > . (43)Therefore, we have D (cid:2) ρ [ { ϕ λ i } ] , ρ label (cid:3) < D (cid:2) ρ [ { ϕ λ i } ] , ρ label (cid:3) . References [1] Mihail Bogojeski, Leslie Vogt-Maranto, Mark E Tuckerman, Klaus-Robert Mueller, and Kieron Burke.Density functionals with quantum chemical accuracy: From machine learning to molecular dynamics.
ChemRxiv preprint , 8079917:v1, 2019.[2] Yixiao Chen, Linfeng Zhang, Han Wang, and Weinan E. Ground state energy functional with hartree-fockefficiency and chemical accuracy. arXiv preprint , page 2005.00169, 2020.[3] Lixue Cheng, Nikola B Kovachki, Matthew Welborn, and Thomas F Miller III. Regression clusteringfor improved accuracy and training costs with molecular-orbital-based machine learning.
Journal ofChemical Theory and Computation , 15(12):6668–6677, 2019.[4] Lixue Cheng, Matthew Welborn, Anders S Christensen, and Thomas F Miller III. Thermalized (350k)qm7b, gdb-13, water, and short alkane quantum chemistry dataset including mob-ml features, 2019.URL https://data.caltech.edu/records/1177 .[5] Lixue Cheng, Matthew Welborn, Anders S Christensen, and Thomas F Miller III. A universal densitymatrix functional from molecular orbital-based machine learning: Transferability across organic molecules.
The Journal of Chemical Physics , 150(13):131103, 2019.[6] Anders S Christensen, Lars A Bratholm, Felix A Faber, and O Anatole von Lilienfeld. Fchl revisited:Faster and more accurate quantum machine learning.
The Journal of Chemical Physics , 152(4):044107,2020.[7] Sebastian Dick and Marivi Fernandez-Serra. Machine learning accurate exchange and correlationfunctionals of the electronic density.
Nature Communications , 11(1):1–10, 2020.[8] Thomas L Gilbert. Hohenberg-kohn theorem for nonlocal external potentials.
Physical Review B , 12(6):2111, 1975.[9] Pierre Hohenberg and Walter Kohn. Inhomogeneous electron gas.
Physical review , 136(3B):B864, 1964.[10] Bogumil Jeziorski and Hendrik J Monkhorst. Coupled-cluster method for multideterminantal referencestates.
Physical Review A , 24(4):1668, 1981.[11] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint , page1412.6980, 2014.[12] Walter Kohn and Lu Jeu Sham. Self-consistent equations including exchange and correlation effects.
Physical review , 140(4A):A1133, 1965.[13] Xiangyun Lei and Andrew J Medford. Design and analysis of machine learning exchange-correlationfunctionals via rotationally invariant convolutional descriptors.
Physical Review Materials , 3(6):063801,2019.[14] Mel Levy. Universal variational functionals of electron densities, first-order density matrices, and naturalspin-orbitals and solution of the v-representability problem.
Proceedings of the National Academy ofSciences , 76(12):6062–6065, 1979.[15] Elliott H Lieb. Density functionals for coulomb systems.
International Journal of Quantum Chemistry ,24(3):243–277, 1983. 1516] Qin Liu, JingChun Wang, PengLi Du, LiHong Hu, Xiao Zheng, and GuanHua Chen. Improving theperformance of long-range-corrected exchange-correlation functional with an embedded neural network.
The Journal of Physical Chemistry A , 121(38):7273–7281, 2017.[17] Sijie Luo, Yan Zhao, and Donald G Truhlar. Validation of electronic structure methods for isomerizationreactions of large organic molecules.
Physical Chemistry Chemical Physics , 13(30):13683–13689, 2011.[18] Chr Møller and Milton S Plesset. Note on an approximation treatment for many-electron systems.
Physical Review , 46(7):618, 1934.[19] Ryo Nagai, Ryosuke Akashi, and Osamu Sugino. Completing density functional theory by machinelearning hidden messages from molecules. npj Computational Materials , 6(1):1–8, 2020.[20] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf,Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, BenoitSteiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In
Advances in Neural Information Processing Systems32 , pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf .[21] John P Perdew, Kieron Burke, and Matthias Ernzerhof. Generalized gradient approximation madesimple.
Physical review letters , 77(18):3865, 1996.[22] Roberto Peverati, Yan Zhao, and Donald G Truhlar. Generalized gradient approximation that recoversthe second-order density-gradient expansion with optimized across-the-board performance.
The Journalof Physical Chemistry Letters , 2(16):1991–1997, 2011.[23] John A Pople, Martin Head-Gordon, and Krishnan Raghavachari. Quadratic configuration interaction.A general technique for determining electron correlation energies.
The Journal of Chemical Physics , 87(10):5968–5975, 1987.[24] Huziel E Sauceda, Stefan Chmiela, Igor Poltavsky, Klaus-Robert Müller, and Alexandre Tkatchenko.Molecular force fields with gradient-domain machine learning: Construction and application to dynamicsof small molecules with coupled cluster forces.
The Journal of chemical physics , 150(11):114102, 2019.[25] A Seidl, Andreas Görling, Peter Vogl, Jacek A Majewski, and Mel Levy. Generalized kohn-sham schemesand the band-gap problem.
Physical Review B , 53(7):3764, 1996.[26] Justin S Smith, Olexandr Isayev, and Adrian E Roitberg. ANI-1: an extensible neural network potentialwith dft accuracy at force field computational cost.
Chemical Science , 8(4):3192–3203, 2017.[27] Justin S Smith, Ben Nebgen, Nicholas Lubbers, Olexandr Isayev, and Adrian E Roitberg. Less is more:Sampling chemical space with active learning.
The Journal of chemical physics , 148(24):241733, 2018.[28] Justin S Smith, Benjamin T Nebgen, Roman Zubatyuk, Nicholas Lubbers, Christian Devereux, KiptonBarros, Sergei Tretiak, Olexandr Isayev, and Adrian E Roitberg. Approaching coupled cluster accuracywith a general-purpose neural network potential through transfer learning.
Nature communications , 10(1):1–8, 2019.[29] Jianwei Sun, Adrienn Ruzsinszky, and John P Perdew. Strongly constrained and appropriately normedsemilocal density functional.
Physical Review Letters , 115(3):036402, 2015.[30] Qiming Sun, Timothy C Berkelbach, Nick S Blunt, George H Booth, Sheng Guo, Zhendong Li, Junzi Liu,James D McClain, Elvira R Sayfutyarova, Sandeep Sharma, et al. Pyscf: the python-based simulationsof chemistry framework.
Wiley Interdisciplinary Reviews: Computational Molecular Science , 8(1):e1340,2018. 1631] Linfeng Zhang, Jiequn Han, Han Wang, Roberto Car, and Weinan E. Deep potential molecular dynamics:A scalable model with the accuracy of quantum mechanics.
Physical Review Letters , 120:143001, Apr2018.[32] Linfeng Zhang, Jiequn Han, Han Wang, Wissam Saidi, Roberto Car, and Weinan E. End-to-endsymmetry preserving inter-atomic potential energy model for finite and extended systems.