[PDF] Learning the exchange-correlation functional from nature with fully differentiable density functional theory

Abstract

Improving the predictive capability of molecular properties in ab initio simulations is essential for advanced material discovery. Despite recent progress making use of machine learning, utilizing deep neural networks to improve quantum chemistry modelling remains severely limited by the scarcity and heterogeneity of appropriate experimental data. Here we show how training a neural network to replace the exchange-correlation functional within a fully-differentiable three-dimensional Kohn-Sham density functional theory (DFT) framework can greatly improve simulation accuracy. Using only eight experimental data points on diatomic molecules, our trained exchange-correlation networks enable improved prediction accuracy of atomization energies across a collection of 104 molecules containing new bonds and atoms that are not present in the training dataset.

Full PDF

LLearning the exchange-correlation functional from nature with fully diﬀerentiabledensity functional theory

M. F. Kasim ∗ and S. M. Vinko † Department of Physics, Clarendon Laboratory, University of Oxford, Parks Road, Oxford OX1 3PU, UK (Dated: February 9, 2021)Improving the predictive capability of molecular properties in ab initio simulations is essentialfor advanced material discovery. Despite recent progress making use of machine learning, utilizingdeep neural networks to improve quantum chemistry modelling remains severely limited by thescarcity and heterogeneity of appropriate experimental data. Here we show how training a neuralnetwork to replace the exchange-correlation functional within a fully-diﬀerentiable three-dimensionalKohn-Sham density functional theory (DFT) framework can greatly improve simulation accuracy.Using only eight experimental data points on diatomic molecules, our trained exchange-correlationnetwork provided improved prediction of atomization and ionization energies across a collectionof 110 molecules when compared with both commonly used DFT functionals and more expensivecoupled cluster simulations.

Fast and accurate predictive models of atomic andmolecular properties are key in advancing novel materialdiscovery. With the rapid development of machine learn-ing (ML) techniques, considerable eﬀort has been dedi-cated to accelerating predictive simulations, enabling theeﬃcient exploration of conﬁguration space. Much of thiswork focuses on end-to-end ML approaches [1], where themodels are trained to predict some speciﬁc molecular ormaterial property using tens to hundreds of thousands ofdata points, generated via simulations [2]. Once trained,ML models have the advantage of accurately predictingthe simulation outputs much faster than the original sim-ulations [3]. The level of accuracy achieved by ML mod-els with respect to the simulated data can be very high,and may even reach chemical accuracy on select systemsoutside the original training dataset [4]. These machinelearning approaches show great promise, but rely heavilyon the availability of large datasets for training.While there has been a concerted eﬀort in acceleratingpredictive models, less attention has been dedicated tousing machine learning to increase the accuracy of themodel itself with respect to the experimental data. Inparticular, as ML models typically rely on some under-lying simulation for the training data, the accuracy ofthe trained models can never exceed the simulations interms of accuracy. However, training ML models usingexperimental data is challenging as the available data ishighly heterogeneous as well as limited, making it diﬃ-cult to train accurate ML models tailored to reproduce aspeciﬁc system or physical property. This limitation alsoposes a great challenge on the generalization capabilityof ML models to systems outside the training data.One emerging solution to this problems is to use ma-chine learning to build models which learn the exchange-correlation (xc) functional within the framework of Den-sity Functional Theory (DFT). Here, the Kohn-Shamscheme (KS-DFT) [5] can be used to calculate vari-ous ground-state properties of molecules and materi-als, and its accuracy largely depends on the quality of the employed xc functional. Work along these lines inthe one-dimensional case show promise [6, 7], but areseverely limited by the lack of availability of experimen-tal data from one-dimensional systems. An alternativeapproach is to attempt to learn the xc functional us-ing a dataset constructed from electron density proﬁlesobtained from coupled-cluster (CCSD) simulations [8],and the xc energy densities obtained from an inverseKohn-Sham scheme [9]. However, this method is fun-damentally limited to at most match the accuracy ofthe original simulations. Recently, Nagai et al. [10] re-ported on successfully representing the xc functional asa neural network, trained to ﬁt 2 diﬀerent properties of 3molecules using the three-dimensional KS-DFT calcula-tion in PySCF [11] via gradient-less optimization. Whiletheir results show promise in predicting various materialproperties, gradient-less methods in training neural net-works typically suﬀer from bad convergence [12] and aretoo computationally intensive to scale to larger and morecomplex neural networks.In this context, we present here a new machine learn-ing approach to learn the xc functional using heteroge-neous experimental data. We achieve this by incorporat-ing a deep neural network representing the xc functionalin a fully diﬀerentiable three-dimensional KS-DFT cal-culation. In our method, the xc functional is representedby a network that takes the three-dimensional electrondensity proﬁle as its input and produces the xc energydensity as its output. Notably, using a xc neural net-work (XCNN) in a fully diﬀerentiable KS-DFT calcula-tion gives us a framework to train the xc functional us-ing any available experimental properties, however het-erogeneous, which can be simulated via standard KS-DFT. Moreover, as the neural network is independent ofthe system (atom type, structure, number of electrons,etc.), a well-trained XCNN should be generalizable to awide range of systems, including those outside the poolof training data. In this paper, we demonstrate this ap-proach using a dataset comprising atoms and molecules a r X i v : . [ phy s i c s . c h e m - ph ] F e b from the ﬁrst few rows of the periodic table, but the re-sults are readily extendable to more complex structures.A single biggest challenge to realizing our method iswriting a KS-DFT code in a fully diﬀerentiable manner.This is achieved here by writing the KS-DFT code usingan automatic diﬀerentiation library, PyTorch [13], andby providing separately the gradients of all the KS-DFTcomponents which are not available in PyTorch. TheKS-DFT calculation involves self-consistent iterations ofsolving the Kohn-Sham equations, H [ n ]( r ) φ i ( r ) = ε i φ i ( r ) , (1) n ( r ) = X i f i | φ i ( r ) | , (2)where n ( r ) is the electron density as a function of a3D coordinate ( r ), f i is the occupation number of the i -th orbital ( φ i ), ε i is the Kohn-Sham orbital energy,and H [ n ]( r ) is the Hamiltonian operator, a functionalof electron density proﬁle n and a function of the 3D co-ordinate. The Hamiltonian operator is given in atomicunits as H [ n ]( r ) = −∇ / v ( r ) + v H [ n ]( r ) + v xc [ n ]( r )with −∇ / v ( r ) theexternal potential operator including ionic Coulomb po-tentials, v H [ n ]( r ) the Hartree potential operator, and v xc [ n ]( r ) the xc potential operator. The xc potentialoperator is deﬁned as the functional derivative of thexc energy with respect to the density, i.e. v xc [ n ]( r ) = δE xc [ n ] /δn ( r ).In this work, the orbitals in the above equations arerepresented using contracted Gaussian-type orbital basissets [14], b j ( r ) for j = { ...n b } where n b is the numberof basis elements. As the basis set is non-orthogonal,Eq. (1) becomes the Roothaan’s equation [15], F [ n ] p i = ε i Sp i , (3)where p i is the basis coeﬃcient for the i -th or-bital, S is the overlap matrix with elements S rc = R b r ( r ) b ∗ c ( r )d r , and F [ n ] is the Fock matrix as a func-tional of the electron density proﬁle, n , with elements F rc = R b r ( r ) H [ n ]( r ) b ∗ c ( r )d r . Unless speciﬁed otherwise,all calculations involved in this paper use 6-311++G**basis sets [16].Equation (3) can be solved by performing an eigen-decomposition of the Fock matrix to obtain the orbitalsfrom the electron density proﬁle. The ionic Coulombpotential, Hartree potential, and the kinetic energy op-erators in the Fock matrix are constructed by calculatingthe Gaussian integrals with libcint [17]. The contributionfrom the xc potential operator in the Fock matrix is con-structed by integrating it with the basis using the SG-3integration grid [18], using the Becke scheme to com-bine multi-atomic grids [19]. The KS-DFT calculationproceeds by solving Eq. (3) and computing Eq. (2) iter-atively until convergence or self-consistency is reached. FIG. 1. Schematics of the xc neural network and diﬀerentiableKS-DFT code. The trainable neural network is illustratedwith darker gray boxes. The solid dark lines show the forwardcalculation ﬂow while the dashed blue lines show the gradientbackpropagation ﬂow.

The schematics of our diﬀerentiable KS-DFT code isshown in Fig. 1.Although modern automatic diﬀerentiation librariesprovide many operations that enable an automatic prop-agation of the gradient calculation, there are many gra-dient operations that have to be implemented speciﬁ-cally for this project. An important example is the self-consistent cycle iteration. Previous approaches employeda ﬁxed number of linear mixing iterations to achieveself-consistency, and let the automatic diﬀerentiation li-brary propagate the gradient through the chain of iter-ations [7, 20]. In contrast, we implement the gradientpropagation of the self-consistency cycle using an im-plicit gradient calculation (see supplementary materials),allowing us to deploy a better self-consistent algorithm,achieving faster and more robust convergence than lin-ear mixing. We implemented the gradient calculation ofthe self-consistency cycle in xitorch [21], a publicly avail-able computational library written speciﬁcally for thisproject.Another example is the gradient calculation of theeigendecomposition with (near-)degenerate eigenvalues.This often produces numerical instabilities in the gra-dient calculation. To solve this problem, we follow thecalculation by Kasim [22] to obtain a numerically sta-ble gradient calculation in the (near-)degenerate cases,which is also implemented in xitorch. Other parts in KS-DFT that require speciﬁc gradient implementations arethe calculation of the xc energy, the Gaussian basis eval-uation, and the Gaussian integrals. The details of thegradient calculations in these cases can be found in thesupplementary materials.With the fully diﬀerentiable KS-DFT code, the xc en-ergy represented by a deep neural network can be trainedeﬃciently. For this work we will test two diﬀerent types ofxc deep neural networks (XCNN) based on two rungs ofJacob’s ladder – the Local Density Approximation (LDA)and the Generalized Gradient Approximation (GGA).These xc energies can be written as: E nnLDA [ n ] = αE LDA(x) [ n ] + β Z n ( r ) f ( n, ξ ) d r (4) E nnPBE [ n ] = αE PBE(xc) [ n ] + β Z n ( r ) f ( n, ξ, s ) d r , (5)where E LDA(x) is the LDA exchange energy [5], and E PBE(xc) is the PBE exchange-correlation energy [23].Here α and β are two tunable parameters, and f ( · ) isthe trainable neural network. The LDA and PBE en-ergies are calculated using Libxc [24] and wrapped withPyTorch to provide the required gradient calculations.The XCNNs take inputs of the local electron density n , the relative spin polarization ξ = ( n ↑ − n ↓ ) /n , and thenormalized density gradient s = |∇ n | / (24 π n ) / [23].The neural network is an ordinary feed-forward neuralnetwork with 3 hidden layers consist of 32 elements eachwith softplus [25] activation function to have inﬁnitediﬀerentiability of the neural network. To avoid non-convergence of the self-consistent iterations during thetraining, the parameters α and β are initialized to havevalues of 1 and 0 respectively.For the XCNN training, we compiled a small datasetconsisting of experimental atomization energies (AE)from the NIST CCCBDB database [26], atomic ioniza-tion potentials (IP) from the NIST ASD database [27],and calculated density proﬁles using a CCSD calcula-tion [8]. The calculated density proﬁles are used as aregularization to ensure that the learned xc does notproduce electron densities that are too far from CCSDpredictions, a concern discussed in ref. [28].The dataset is split into 2 groups: training and valida-tion. The training dataset is used directly to update theparameters of the neural network via the gradient of aloss function (see supplementary materials for details onthe loss function). After the training is ﬁnished, we se-lected a model checkpoint during the training that givesthe lowest loss function for the validation dataset.The complete list of atoms and molecules used in thetraining and validation datasets is shown in Table I. Im-portantly, the size of the dataset used in this work isorders of magnitude smaller than datasets used in mostML-related work present in the literature concerning molecular properties predictions [29–31]. We note thatthe atoms some atoms are only present in the validationset and not in the training set, to obtain checkpoints thatare able to generalize well to new types of atoms outsidethe training set. Further note that there are two rowsfor each training and validation in the table. The ﬁrstrow only contains atoms from the ﬁrst and second rowsof the periodic table (i.e. H-Ne), while the second rowalso contains the third-row atoms (i.e. Na-Cl). We willconsider neural networks trained using various combina-tions of the training and validation datasets presented inthe table.To test the performance of the trained XCNNs, weprepared a test dataset consisting of the atomization en-ergies of 110 molecules from ref. [32], and the ioniza-tion potentials of atoms H-Ar. The experimental geo-metric data as well as the atomization energies for the110 molecules were obtained from the NIST CCCBDBdatabase [26], while the ionization potentials are obtainedfrom the NIST ASD database [27]. There are actually146 molecules in ref. [32], however, only 110 moleculeshave experimental data in CCCBDB, so we limit ourdataset to those. The atomization energy test datasetcontains molecules with 2-11 atoms ranging from H-Cl.A complete list of all the atoms and molecules used in thetest dataset can be found in the supplementary materials.Our ﬁrst experiment is done by training the XCNN us-ing molecules with H-Ne atoms only, and without the ion-ization potential dataset, i.e., using only atomization en-ergies and density proﬁle data. The results are presentedin Table II. We see that in this case the XCNN-LDA doesnot provide a signiﬁcant improvement over the normalLDA exchange, with the exception of non-hydrocarbonmolecules containing only H-Ne atoms (the column “AE33 others-1”). In contrast, XCNN-PBE is seen to per-form substantially better than the the PBE functional,with up to 3 times lower average prediction errors whencalculating the atomization energies across the 110 testmolecules. XCNN-PBE also achieves better results thanPBE on all of the subsets except the non-hydrocarbon TABLE I. Atoms and molecules presented in the training andvalidation datasets.Type Atomizationenergy Density proﬁle IonizationpotentialTraining H , LiH,O , CO H, Li, Ne, H ,Li , LiH, B ,O , CO O, NeSi , Cl ,SiO, HCl Cl , SiO, HClValidation N , NO,F , HF He, Be, N, N ,F , HF N, FP , S ,SO, HS P TABLE II. Mean absolute error (MAE) in kcal/mol of the atomization energy and ionization potential for atoms and moleculesin the test dataset. The column “IP 18” represents the deviation in ionization potential for atoms H-Ar. Column “AE 110” isthe MAE of atomization energy of a collection of 110 molecules from ref. [32]. The following 4 columns present the atomizationenergies of subsets of molecules from the full 110-molecule set: “AE 18 HC” for 18 hydrocarbons, “AE 28 subs HC” for 28substituted hydrocarbons, “AE 33 others-1” for 33 non-hydrocarbon molecules containing only ﬁrst and second row atoms, and“AE 31 others-2” for 31 non-hydrocarbon molecules containing at least one third row atom. The suﬃx “-IP” in the XCNNcalculation indicates that the xc neural networks were trained with the ionization potential dataset, while suﬃx “-2” indicatesthe inclusion of molecules with third row atoms in the training and validation dataset. The bolded and underlined values arethe best MAE in the respective column.Calculation IP 18 AE 110 AE 18 HC AE 28 subs HC AE 33 others-1 AE 31 others-2

Local density approximations (LDA)

LDA (PW92) [33] 6.9 82.0 127.0 126.9 60.2 38.5LDA (exchange only) 24.6 18.0 16.1 17.0 21.2 16.5XCNN-LDA 21.9 19.0 31.2 25.1 7.5 18.6XCNN-LDA-IP 13.5 16.0 26.2 25.0 8.4 10.1XCNN-LDA-2 163.0 22.8 39.2 36.9 13.6 10.4

Generalized gradient approximations (GGA)

PBE [23] 3.6 27.9 47.5 45.6 20.0 8.8XCNN-PBE 9.0 8.1 4.8 8.2

Other approximations

SCAN [34] 3.6 18.5 35.6 31.5 6.0 8.6CCSD (basis: 6-311++G**) 7.9 18.2 11.1 17.5 20.2 20.9CCSD (basis: cc-pvqz [36]) molecules containing third row atoms (“AE 31 others-2”), where the results are comparable. Importantly, al-though the training and validation sets contain atomiza-tion energies only for 8 diatomic molecules, XCNN-PBEprovides an improvement on atomization energy predic-tions of the 110-molecule test dataset, which includesmolecules with up to 11 atoms. The XCNN-PBE pro-vides the largest improvement over PBE on the subsetsof hydrocarbon and substituted hydrocarbon molecules,which contain many molecular bonds not present in ei-ther the training nor the validation datasets (eg., C–H,C–C, C=C). This demonstrates the generalizability ofour approach in learning xc functionals that can be ap-plied to entirely new bonds and molecular systems.Despite the these encouraging results on the atom-

FIG. 2. Statistical comparison of XCNN-PBE and CCSDfor each group. The line in the box represents the median ofthe absolute error. For clarity, outliers are not shown. ization energies, XCNN-PBE leads to worse predictionsof atomic ionization potentials compared with the PBE.Therefore, for the next experiment we trained the XCNNby adding ionization potential data into the training andvalidation datasets, as shown in Tab. I. The xc neu-ral networks trained with this additional data are called“XCNN-LDA-IP” and “XCNN-PBE-IP”. Adding ioniza-tion potential data to the training and validation datasetsimproves the performance of XCNN on almost all testdatasets. Further, we note that when the XCNN aretrained with molecules that contain third row atoms(XCNN-PBE-2 and XCNN-LDA-2), they result in xcfunctionals that are better at predicting atomization en-ergies of molecules containing third row atoms, but atthe cost of worsening predictions on molecules contain-ing atoms from the ﬁrst two rows only. The eﬀects areobserved for both XCNN-LDA and XCNN-PBE.The diﬀerence in electron density regions wherechanges take place upon ionization or atomization pro-vide insight into the diﬀerent behaviour of the XCNN-PBE-2 and XCNN-PBE-IP functionals. For the atom-ization process, the changes in electron density whenmoving from bonded molecules to isolated atoms takeplace mostly in the low density regions, due to bond-breaking of valence electrons. On the other hand, ion-ization is more dependent on the high electron densityregions as core electrons end up more tightly bonded inionized atoms. Thus, adding ionization potential datainto the training dataset complements the atomizationenergy data by providing further constraints on the xcfunctional in regions of higher electron density. However,atomization energies from molecules with the third-rowatoms may compete with the data from molecules con-taining atoms from the ﬁrst two rows due to the limita-tions of GGA and LDA approximations of the XCNN.We speculate that this competition could be resolved bymoving onto a higher rung of Jacob’s ladder, or incorpo-rating more global density information into the XCNN.While perfectly viable within our framework, such inves-tigations remain outside the scope of the current paper.A ﬁnal result of note is that the mean absolute er-ror of the atomization energy predictions of XCNN-PBEand XCNN-PBE-IP are lower than both SCAN [34], an-other xc functional on a higher rung of the Jacob’s ladder,and even the more expensive CCSD calculations. A de-tailed comparison of our XCNN-PBE results with CCSDcalculations is presented in Fig. 2. We see that the re-sults from XCNN-PBE-IP are similar, and in some caseseven better than, calculations based on CCSD. This re-sult highlights a key advantage of our method over otherML-based methods that rely fully on simulation data: asour method utilizes heterogeneous experimental data inthe training, it enables us to learn xc functionals thatcan exceed in accuracy even the simulations used to helptrain them.We presented a novel approach to train machine-learned xc functionals within the framework of fully-diﬀerential Kohn-Sham DFT using experimental data.Our xc neural networks, trained on a few atoms anddiatomic molecules, are able to improve the predictionof atomization and ionization energies across a set of110 molecules not present in the training dataset, in-cluding larger molecules containing new types of bonds.The prediction of the trained xc neural networks are, insome cases, even closer to the experimental data thanthose based on CCSD simulations. We have limited ourstudy to a minimal training dataset and to a relativelysmall neural network, but our results are readily extend-able to substantially larger systems, bigger experimen-tal datasets, additional heterogeneous physics quantities,and to the use of more sophisticated neural networks.This demonstration of experimental-data-driven xc func-tional learning, embedded within a diﬀerentiable DFTsimulator, shows great promise to advance computationalmaterial discovery.M.F.K. and S.M.V. acknowledge support from theUK EPSRC grant EP/P015794/1 and the Royal Society.S.M.V. is a Royal Society University Research Fellow.The authors declare no conﬂict of interest. ∗ [email protected] † [email protected][1] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message pass-ing for quantum chemistry. In International Conferenceon Machine Learning , pages 1263–1272. PMLR, 2017.[2] Kristof T Sch¨utt, Huziel E Sauceda, P-J Kindermans,Alexandre Tkatchenko, and K-R M¨uller. Schnet–a deeplearning architecture for molecules and materials.

TheJournal of Chemical Physics , 148(24):241722, 2018.[3] Justin S Smith, Olexandr Isayev, and Adrian E Roit-berg. Ani-1: an extensible neural network potential withdft accuracy at force ﬁeld computational cost.

Chemicalscience , 8(4):3192–3203, 2017.[4] Felix A Faber, Luke Hutchison, Bing Huang, JustinGilmer, Samuel S Schoenholz, George E Dahl, OriolVinyals, Steven Kearnes, Patrick F Riley, and O AnatoleVon Lilienfeld. Prediction errors of molecular machinelearning models lower than hybrid dft error.

Journalof chemical theory and computation , 13(11):5255–5264,2017.[5] Walter Kohn and Lu Jeu Sham. Self-consistent equationsincluding exchange and correlation eﬀects.

Physical re-view , 140(4A):A1133, 1965.[6] John C Snyder, Matthias Rupp, Katja Hansen, Klaus-Robert M¨uller, and Kieron Burke. Finding density func-tionals with machine learning.

Physical review letters ,108(25):253002, 2012.[7] Li Li, Stephan Hoyer, Ryan Pederson, Ruoxi Sun, Ekin DCubuk, Patrick Riley, Kieron Burke, et al. Kohn-sham equations as regularizer: Building prior knowledgeinto machine-learned physics.

Physical Review Letters ,126(3):036401, 2021.[8] Jiˇr´ı ˇC´ıˇzek. On the correlation problem in atomic andmolecular systems. calculation of wavefunction compo-nents in ursell-type expansion using quantum-ﬁeld the-oretical methods.

The Journal of Chemical Physics ,45(11):4256–4266, 1966.[9] Bikash Kanungo, Paul M Zimmerman, and VikramGavini. Exact exchange-correlation potentials fromground-state electron densities.

Nature communications ,10(1):1–9, 2019.[10] Ryo Nagai, Ryosuke Akashi, and Osamu Sugino. Com-pleting density functional theory by machine learninghidden messages from molecules. npj Computational Ma-terials , 6(1):1–8, 2020.[11] Qiming Sun, Timothy C Berkelbach, Nick S Blunt,George H Booth, Sheng Guo, Zhendong Li, Junzi Liu,James D McClain, Elvira R Sayfutyarova, SandeepSharma, et al. Pyscf: the python-based simulations ofchemistry framework.

Wiley Interdisciplinary Reviews:Computational Molecular Science , 8(1):e1340, 2018.[12] Niru Maheswaranathan, Luke Metz, George Tucker,Dami Choi, and Jascha Sohl-Dickstein. Guided evolu-tionary strategies: Augmenting random search with sur-rogate gradients. In

International Conference on Ma-chine Learning , pages 4264–4273. PMLR, 2019.[13] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, Zem-ing Lin, Natalia Gimelshein, Luca Antiga, Alban Des-maison, Andreas Kopf, Edward Yang, Zachary DeVito,Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chin-tala. Pytorch: An imperative style, high-performancedeep learning library. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d ' Alch´e-Buc, E. Fox, and R. Garnett,editors,

Advances in Neural Information Processing Sys- tems , volume 32, pages 8026–8037. Curran Associates,Inc., 2019.[14] Benjamin P Pritchard, Doaa Altarawy, Brett Didier,Tara D Gibson, and Theresa L Windus. New basis setexchange: An open, up-to-date resource for the molecu-lar sciences community.

Journal of chemical informationand modeling , 59(11):4814–4820, 2019.[15] Clemens Carel Johannes Roothaan. New developmentsin molecular orbital theory.

Reviews of modern physics ,23(2):69, 1951.[16] Timothy Clark, Jayaraman Chandrasekhar, G¨unther WSpitznagel, and Paul Von Ragu´e Schleyer. Eﬃcient dif-fuse function-augmented basis sets for anion calculations.iii. the 3-21+ g basis set for ﬁrst-row elements, li–f.

Jour-nal of Computational Chemistry , 4(3):294–301, 1983.[17] Qiming Sun. Libcint: An eﬃcient general integral libraryfor g aussian basis functions.

Journal of computationalchemistry , 36(22):1664–1671, 2015.[18] Saswata Dasgupta and John M Herbert. Standard gridsfor high-precision integration of modern density function-als: Sg-2 and sg-3.

Journal of computational chemistry ,38(12):869–882, 2017.[19] Axel D Becke. A multicenter numerical integrationscheme for polyatomic molecules.

The Journal of chem-ical physics , 88(4):2547–2553, 1988.[20] Teresa Tamayo-Mendoza, Christoph Kreisbeck, RolandLindh, and Al´an Aspuru-Guzik. Automatic diﬀerentia-tion in quantum chemistry with applications to fully vari-ational hartree–fock.

ACS central science , 4(5):559–566,2018.[21] Muhammad F Kasim and Sam M Vinko. ξ -torch: dif-ferentiable scientiﬁc computing library. arXiv preprintarXiv:2010.01921 , 2020.[22] Muhammad Firmansyah Kasim. Derivatives of partialeigendecomposition of a real symmetric matrix for de-generate cases. arXiv preprint arXiv:2011.04366 , 2020.[23] John P Perdew, Kieron Burke, and Matthias Ernzerhof.Generalized gradient approximation made simple. Phys-ical review letters , 77(18):3865, 1996.[24] Miguel AL Marques, Micael JT Oliveira, and Tobias Bur-nus. Libxc: A library of exchange and correlation func-tionals for density functional theory.

Computer physicscommunications , 183(10):2272–2281, 2012.[25] Vinod Nair and Geoﬀrey E Hinton. Rectiﬁed linear unitsimprove restricted boltzmann machines. In

Icml , 2010.[26] Russel D Johnson III and and NIST CCCBDB Team. NIST Computational Chemistry Comparison andBenchmark Database (Release 21), [Online]. Avail-able: https://dx.doi.org/10.18434/T47C7Z[2020,August] . National Institute of Standards and Technol-ogy, Gaithersburg, MD., 2020.[27] A. Kramida, Yu. Ralchenko, J. Reader, andand NIST ASD Team. NIST Atomic Spec-tra Database (ver. 5.8), [Online]. Available: https://dx.doi.org/10.18434/T4W30F[2020,October] . National Institute of Standards and Technol-ogy, Gaithersburg, MD., 2020.[28] Michael G Medvedev, Ivan S Bushmarinov, Jianwei Sun,John P Perdew, and Konstantin A Lyssenko. Densityfunctional theory is straying from the path toward theexact functional.

Science , 355(6320):49–52, 2017.[29] Raghunathan Ramakrishnan, Pavlo O Dral, MatthiasRupp, and O Anatole von Lilienfeld. Quantum chemistrystructures and properties of 134 kilo molecules.

ScientiﬁcData , 1, 2014.[30] Stefan Chmiela, Alexandre Tkatchenko, Huziel ESauceda, Igor Poltavsky, Kristof T Sch¨utt, and Klaus-Robert M¨uller. Machine learning of accurate energy-conserving molecular force ﬁelds.

Science advances ,3(5):e1603015, 2017.[31] Felix Brockherde, Leslie Vogt, Li Li, Mark E Tucker-man, Kieron Burke, and Klaus-Robert M¨uller. Bypassingthe kohn-sham equations with machine learning.

Naturecommunications , 8(1):1–10, 2017.[32] Larry A Curtiss, Krishnan Raghavachari, Paul C Red-fern, and John A Pople. Assessment of gaussian-2 anddensity functional theories for the computation of en-thalpies of formation.

The Journal of Chemical Physics ,106(3):1063–1079, 1997.[33] John P Perdew and Yue Wang. Accurate and simple ana-lytic representation of the electron-gas correlation energy.

Physical review B , 45(23):13244, 1992.[34] Jianwei Sun, Adrienn Ruzsinszky, and John P Perdew.Strongly constrained and appropriately normed semilocaldensity functional.

Physical review letters , 115(3):036402,2015.[35] Axel D Becke. A new mixing of hartree–fock and lo-cal density-functional theories.

The Journal of chemicalphysics , 98(2):1372–1377, 1993.[36] Thom H Dunning Jr. Gaussian basis sets for use in corre-lated molecular calculations. i. the atoms boron throughneon and hydrogen.

The Journal of chemical physics ,90(2):1007–1023, 1989. upplementary materials:Learning the exchange-correlation functional from nature with fully diﬀerentiableDensity Functional Theory

M. F. Kasim ∗ and S. M. Vinko Department of Physics, Clarendon Laboratory, University of Oxford, Parks Road, Oxford OX1 3PU, UK (Dated: February 9, 2021)

I. GRADIENT OF KS-DFT COMPONENTSA. Self-consistent iteration cycle

The self-consistent iteration cycle in a KS-DFT calcu-lation can be written mathematically as y = f ( y , θ ) , (1)where y are the parameters to be found self-consistently,and f ( · ) is the function that needs to be satisﬁed whichmay depend additionally on other parameters, here de-noted by θ . The self-consistent iteration takes the pa-rameters θ and an initial guess y , and produces the self-consistent parameters y that satisfy the equation above.In our case, the self-consistent parameters are the ele-ments of the Fock matrix.The gradient of a loss function L , with respect to theparameters θ , given the gradient with respect to the self-consistent parameters ∂ L /∂ y , can be expressed as ∂ L ∂θ = − ∂ L ∂ y (cid:18) I − ∂ f ∂ y (cid:19) − (cid:18) ∂ f ∂θ (cid:19) , (2)where I is the identity matrix. The inverse-matrix vec-tor calculation in the equation above is performed usingthe BiCGSTAB algorithm [1]. The gradient of this self-consistent iteration is implemented in xitorch [2]. B. Gaussian basis

The diﬀerentiable KS-DFT employs a contractedGaussian basis. The calculation of the Gaussian basisin 3D space is performed using libcgto, a library withinPySCF [3]. Libcgto is a library that can generate codeto compute a Gaussian basis of any order. As libcgtodoes not provide an automatic diﬀerentiation feature, ithas to be wrapped by PyTorch to enable the automaticdiﬀerentiation mode through the integrals calculated bylibcgto.A single Gaussian basis centered at r of order l, m, n is expressed as g lmn ( r ; α, c, r ) = c ( x − x ) l ( y − y ) m ( z − z ) n e − α || r − r || , (3) ∗ [email protected] where c is the basis coeﬃcient and α is the exponentialparameter. The contracted Gaussian basis is just a linearcombination of the single Gaussian bases above.The gradient back-propagation calculation with re-spect to the parameters, ( c , α , and r ) can be expressedas ∂ L ∂c = ∂ L ∂g lmn ∂g lmn ∂c (4) ∂ L ∂α = ∂ L ∂g lmn ∂g lmn ∂α (5) ∂ L ∂ r = ∂ L ∂g lmn ∂g lmn ∂ r , (6)where ∂g lmn ∂c = g lmn c (7) ∂g lmn ∂α = − (cid:0) g ( l +2) mn + g l ( m +2) n + g lm ( n +2) (cid:1) (8) ∂g lmn ∂ r = −∇ g lmn . (9)The parameters required for the calculations above (e.g.the gradient of the basis and higher order bases) can beprovided by libcgto. C. Gaussian integrals

Constructing the Fock matrix and the overlap matrixin Roothaan’s equation requires the evaluation of Gaus-sian integrals. The integrals typically involves 2-4 Gaus-sian bases of any order. Those integrals are: S ijk − lmn = Z g ijk ( r ) g lmn ( r ) d r (10) K ijk − lmn = − Z g ijk ( r ) ∇ g lmn ( r ) d r (11) C ijk − lmn = X c Z g ijk ( r ) Z c | r − r c | g lmn ( r ) d r (12) E ijk − lmn − pqr − stu = Z g ijk ( r ) g lmn ( r ) 1 | r − r | g pqr ( r ) g stu ( r ) d r d r , (13)where S is the overlap term, K is the kinetic term, C isthe ionic Coulomb term of an ion of charge Z c at coordi-nate r c , and E is the two-electron Coulomb term. Theseintegrals can be calculated eﬃciently by libcint [4]. a r X i v : . [ phy s i c s . c h e m - ph ] F e b The gradients of a loss function L with respect to thebasis parameters θ = { c, α, r } are: ∂ L ∂θ = (cid:18) ∂ L ∂θ (cid:19) S + (cid:18) ∂ L ∂θ (cid:19) K + (cid:18) ∂ L ∂θ (cid:19) C + (cid:18) ∂ L ∂θ (cid:19) E (14) (cid:18) ∂ L ∂θ (cid:19) S = ∂ L ∂S (cid:20)Z ∂g ijk ( r ) ∂θ g lmn ( r ) d r + Z g ijk ∂g lmn ( r ) ∂θ d r (cid:21) (15) (cid:18) ∂ L ∂θ (cid:19) K = − ∂ L ∂K (cid:20)Z ∂g ijk ( r ) ∂θ ∇ g lmn ( r ) d r + Z g ijk ∇ ∂g lmn ( r ) ∂θ d r (cid:21) (16) (cid:18) ∂ L ∂θ (cid:19) C = ∂ L ∂C X c (cid:20)Z ∂g ijk ( r ) ∂θ Z c | r − r c | g lmn ( r ) d r + Z g ijk Z c | r − r c | ∂g lmn ( r ) ∂θ d r (cid:21) (17) (cid:18) ∂ L ∂θ (cid:19) E = ∂ L ∂E (cid:20)Z ∂g ijk ( r ) ∂θ g lmn ( r ) g pqr ( r ) g stu ( r ) | r − r | d r + Z ∂g lmn ( r ) ∂θ g ijk ( r ) g pqr ( r ) g stu ( r ) | r − r | d r + Z ∂g pqr ( r ) ∂θ g ijk ( r ) g lmn ( r ) g stu ( r ) | r − r | d r + Z ∂g stu ( r ) ∂θ g ijk ( r ) g lmn ( r ) g pqr ( r ) | r − r | d r (cid:21) . (18)If the parameter θ does not belong to the basis g · , then ∂g · /∂θ is 0. Otherwise, the value of ∂g · /∂θ for variousparameters is listed in the previous subsection. For thegradient with respect to the atomic position, r c , the ex-pression is given by ∂ L ∂ r c = ∂ L ∂C (cid:20)Z ∇ g ijk ( r ) Z c | r − r c | g lmn ( r ) d r (19)+ Z g ijk ( r ) Z c | r − r c | ∇ g lmn ( r ) d r (cid:21) . (20)The expression above is obtained by performing partialintegration. As the derivative of the Gaussian basis withrespect to its parameters is also a Gaussian basis, therequired integrals can be provided by libcint [4]. D. Exchange-correlation energy and potential

The exchange-correlation (xc) energy in our calcula-tion is a hybrid between available xc energy functionals(the LDA [5] and PBE [6] models) and a neural networkxc energy. The gradient back-propagation through the xcneural network is easily provided by PyTorch. However,as we use Libxc [7] to calculate the previously available xc energy, it has to be wrapped by PyTorch to providethe gradient back-propagation calculation. As Libxc canalso calculate the gradient of the energy with respect toits parameters, providing the automatic diﬀerentiationfeature by wrapping it with PyTorch is just a matter ofcalling the right function in Libxc.If the PyTorch wrapper for Libxc is provided, calcu-lating the xc potential can be done using the automaticdiﬀerentiation feature provided by PyTorch. The com-ponent of the Fock matrix from the xc potential for theLDA spin-polarized case is V (LDA) ij = Z φ ∗ i ( r ) v (LDA) s ( r ) φ j ( r ) d r , (21)where the subscript s can be ↑ or ↓ representing the spinof the electron, and v (LDA) s is the xc potential, v (LDA) s = ∂∂n s Z nε (LDA) ( n ↑ , n ↓ ) d r , (22)with ε (LDA) the energy density per unit volume, per elec-tron, provided by Libxc. The 3D spatial integration isperformed using a SG-3 integration grid [8] with theBecke scheme to combine multi-atomic grids [9].For spin-polarized GGA, the potential linear operatoris [10] V (GGA) ij = Z φ ∗ i ( r ) v (GGA) s ( r ) φ j ( r ) d r (23) v GGAs = ∂∂n s Z nε (GGA) ( n ↑ , n ↓ , ∇ n ↑ , ∇ n ↓ ) d r + (cid:18) ∂∂ ∇ n s Z nε (GGA) d r (cid:19) · ∇ , (24)where the derivatives with respect to n s and ∇ n s can beperformed by automatic diﬀerentiation, and the last ∇ term is applied to the basis which can be provided bylibcgto from PySCF [3]. E. (Near-)degenerate case of eigendecomposition

One challenge in propagating the gradient backwardfor some molecules is the degeneracy of eigenvalues in theeigendecomposition. The eigendecomposition of a realand symmetric matrix, A , can be written as Av i = λ i v i (25)with λ i and v i the i -th eigenvalue and eigenvector, re-spectively.In the non-degenerate case, the gradient of a loss func-tion L with respect to each element of matrix A can bewritten as ∂ L ∂ A = − X j X i = j ( λ i − λ j ) − v i v Ti ∂ L ∂ v j v Tj . (26)A problem arises when degeneracy appears as a term of0 − appears in the equation above.To solve this problem, we follow the equation from theauthor’s another paper [11], which can be written simplyas ∂ L ∂ A = − X j X i = j,λ i = λ j ( λ i − λ j ) − v i v Ti ∂ L ∂ v j v Tj . (27)The equation above is only valid as long as the loss func-tion does not depend on which linear combinations ofthe degenerate eigenvectors are used and the elements inmatrix A depends on some parameters that always makethe matrix A real and symmetric. For more detailedderivation, please see [11]. II. NEURAL NETWORK TRAINING

The parameters of XCNN are updated based on thegradient of a loss function with respect to its parameters.The loss function is given by a combination of three terms L = w ae N ae N ae X i =1 (cid:16) ˆ E ( ae ) i − E ( ae ) i (cid:17) + w ip N ip N ip X i =1 (cid:16) ˆ E ( ip ) i − E ( ip ) i (cid:17) + w dp N dp N dp X i =1 Z [ˆ n i ( r ) − n i ( r )] d r , (28)where w ( · ) are the weights assigned to each type of thedata, N ( · ) are the number of entries of each datatype inthe dataset, E ( ae ) and E ( ip ) are, respectively, the pre-dicted atomization energy and ionization potential usingXCNN and KS-DFT, n is the electron density proﬁle pre-dicted with XCNN and KS-DFT. The variables with ahat, ˆ E ( · ) and ˆ n , are the target variables that were ob-tained from experimental databases (for energies) andCCSD calculations (for electron density proﬁles).The XCNN is trained using the RAdam [12] algorithmwith a constant learning rate of 10 − . The weight ofthe atomization energy loss is set to be 1340, while theweights for the density proﬁle and for the ionization en-ergy (for NN-PBE-IP and NN-LDA-IP) are 170 and 440,respectively. As the density ( n ) and the normalized den-sity gradient ( s ) can have values between 0 up to a verylarge number, these values are renormalized by applyinglog(1 + x ), i.e. n ← log(1 + n ) and s ← log(1 + s ). III. MOLECULES IN THE TEST DATASET

The 110 molecules in the test dataset, as well as thesubset they belong, to are listed below.1. LiH (Lithium Hydride): Others-12. BeH (beryllium monohydride): Others-13. CH (Methylidyne): HC 4. CH (Methylene): HC5. CH (Methyl radical): HC6. CH (Methane): HC7. NH (Imidogen): Others-18. NH (Amino radical): Others-19. NH (Ammonia): Others-110. OH (Hydroxyl radical): Others-111. H O (Water): Others-112. HF (Hydrogen ﬂuoride): Others-113. SiH (silicon dihydride): Others-214. SiH (Silyl radical): Others-215. SiH (Silane): Others-216. PH (Phosphino radical): Others-217. PH (Phosphine): Others-218. H S (Hydrogen sulﬁde): Others-219. HCl (Hydrogen chloride): Others-220. Li (Lithium diatomic): Others-121. LiF (lithium ﬂuoride): Others-122. C H (Acetylene): HC23. C H (Ethylene): HC24. C H (Ethane): HC25. CN (Cyano radical): Others-126. HCN (Hydrogen cyanide): Others-127. CO (Carbon monoxide): Others-128. HCO (Formyl radical): Others-129. H CO (Formaldehyde): Others-130. CH OH (Methyl alcohol): Subs HC31. N (Nitrogen diatomic): Others-132. N H (Hydrazine): Others-133. NO (Nitric oxide): Others-134. O (Oxygen diatomic): Others-135. H O (Hydrogen peroxide): Others-136. F (Fluorine diatomic): Others-137. CO (Carbon dioxide): Others-138. Na (Disodium): Others-239. Si (Silicon diatomic): Others-240. P (Phosphorus diatomic): Others-241. S (Sulfur diatomic): Others-242. Cl (Chlorine diatomic): Others-243. NaCl (Sodium Chloride): Others-244. SiO (Silicon monoxide): Others-245. CS (carbon monosulﬁde): Others-246. SO (Sulfur monoxide): Others-247. ClO (Monochlorine monoxide): Others-248. ClF (Chlorine monoﬂuoride): Others-249. Si H (disilane): Others-250. CH Cl (Methyl chloride): Subs HC51. CH SH (Methanethiol): Subs HC52. HOCl (hypochlorous acid): Others-253. SO (Sulfur dioxide): Others-254. BF (Borane, triﬂuoro-): Others-155. BCl (Borane, trichloro-): Others-256. AlF (Aluminum triﬂuoride): Others-257. AlCl (Aluminum trichloride): Others-258. CF (Carbon tetraﬂuoride): Others-159. CCl (Carbon tetrachloride): Others-260. OCS (Carbonyl sulﬁde): Others-261. CS (Carbon disulﬁde): Others-262. CF O (Carbonic diﬂuoride): Others-163. SiF (Silicon tetraﬂuoride): Others-264. SiCl (Silane, tetrachloro-): Others-265. N O (Nitrous oxide): Others-166. ClNO (Nitrosyl chloride): Others-267. NF (Nitrogen triﬂuoride): Others-168. O (Ozone): Others-169. F O (Diﬂuorine monoxide): Others-170. C F (Tetraﬂuoroethylene): Others-171. CF CN (Acetonitrile, triﬂuoro-): Others-172. CH CCH (propyne): HC73. CH CCH (allene): HC74. C H (Cyclopropene): HC75. C H (Cyclopropane): HC 76. C H (Propane): HC77. CH CCCH (2-Butyne): HC78. C H (Methylenecyclopropane): HC79. C H (Cyclobutene): HC80. CH CH(CH )CH (Isobutane): HC81. C H (Benzene): HC82. CH F (Methane, diﬂuoro-): Subs HC83. CHF (Methane, triﬂuoro-): Subs HC84. CH Cl (Methylene chloride): Subs HC85. CHCl (Chloroform): Subs HC86. CH CN (Acetonitrile): Subs HC87. HCOOH (Formic acid): Subs HC88. CH CONH (Acetamide): Subs HC89. C H N (Aziridine): Subs HC90. C N (Cyanogen): Subs HC91. CH NHCH (Dimethylamine): Subs HC92. CH CO (Ketene): Subs HC93. C H O (Ethylene oxide): Subs HC94. C H O (Ethanedial): Subs HC95. CH CH OH (Ethanol): Subs HC96. CH OCH (Dimethyl ether): Subs HC97. C H O (Ethylene oxide): Subs HC98. CH CH SH (ethanethiol): Subs HC99. CH CHF (Ethene, ﬂuoro-): Subs HC100. CH CH Cl (Ethyl chloride): Subs HC101. CH COF (Acetyl ﬂuoride): Subs HC102. CH COCl (Acetyl Chloride): Subs HC103. CH ClCH CH (Propane, 1-chloro-): Subs HC104. C H O (Furan): Subs HC105. C H N (Pyrrole): Subs HC106. H (Hydrogen diatomic): Others-1107. HS (Mercapto radical): Others-2108. C H (Ethynyl radical): HC109. CH O (Methoxy radical): Subs HC110. NO (Nitrogen dioxide): Others-1 [1] Henk A Van der Vorst. Bi-cgstab: A fast and smoothlyconverging variant of bi-cg for the solution of nonsym-metric linear systems. SIAM Journal on scientiﬁc andStatistical Computing , 13(2):631–644, 1992.[2] Muhammad F Kasim and Sam M Vinko. ξ -torch: dif-ferentiable scientiﬁc computing library. arXiv preprintarXiv:2010.01921 , 2020.[3] Qiming Sun, Timothy C Berkelbach, Nick S Blunt,George H Booth, Sheng Guo, Zhendong Li, Junzi Liu,James D McClain, Elvira R Sayfutyarova, SandeepSharma, et al. Pyscf: the python-based simulations ofchemistry framework. Wiley Interdisciplinary Reviews:Computational Molecular Science , 8(1):e1340, 2018.[4] Qiming Sun. Libcint: An eﬃcient general integral libraryfor g aussian basis functions.

Journal of computationalchemistry , 36(22):1664–1671, 2015.[5] Walter Kohn and Lu Jeu Sham. Self-consistent equationsincluding exchange and correlation eﬀects.

Physical re-view , 140(4A):A1133, 1965.[6] John P Perdew, Kieron Burke, and Matthias Ernzerhof.Generalized gradient approximation made simple.

Phys-ical review letters , 77(18):3865, 1996. [7] Miguel AL Marques, Micael JT Oliveira, and Tobias Bur-nus. Libxc: A library of exchange and correlation func-tionals for density functional theory.

Computer physicscommunications , 183(10):2272–2281, 2012.[8] Saswata Dasgupta and John M Herbert. Standard gridsfor high-precision integration of modern density function-als: Sg-2 and sg-3.

Journal of computational chemistry ,38(12):869–882, 2017.[9] Axel D Becke. A multicenter numerical integrationscheme for polyatomic molecules.

The Journal of chem-ical physics , 88(4):2547–2553, 1988.[10] Richard M Martin.

Electronic structure: basic theory andpractical methods . Cambridge university press, 2020.[11] Muhammad Firmansyah Kasim. Derivatives of partialeigendecomposition of a real symmetric matrix for de-generate cases. arXiv preprint arXiv:2011.04366 , 2020.[12] Liyuan Liu, Haoming Jiang, Pengcheng He, WeizhuChen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. Onthe variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265arXiv preprint arXiv:1908.03265