[PDF] Kohn-Sham equations as regularizer: building prior knowledge into machine-learned physics

Abstract

Including prior knowledge is important for effective machine learning models in physics, and is usually achieved by explicitly adding loss terms or constraints on model architectures. Prior knowledge embedded in the physics computation itself rarely draws attention. We show that solving the Kohn-Sham equations when training neural networks for the exchange-correlation functional provides an implicit regularization that greatly improves generalization. Two separations suffice for learning the entire one-dimensional H 2 dissociation curve within chemical accuracy, including the strongly correlated region. Our models also generalize to unseen types of molecules and overcome self-interaction error.

Full PDF

KKohn-Sham equations as regularizer:building prior knowledge into machine-learned physics

Li Li ( 李力 ) , ∗ Stephan Hoyer , Ryan Pederson , Ruoxi Sun ( 孙若溪 ) , Ekin D. Cubuk , Patrick Riley , and Kieron Burke , Google Research, Mountain View, CA 94043, USA Department of Physics and Astronomy, University of California, Irvine, CA 92697, USA Department of Chemistry, University of California, Irvine, CA 92697, USA (Dated: November 19, 2020)Including prior knowledge is important for eﬀective machine learning models in physics, andis usually achieved by explicitly adding loss terms or constraints on model architectures. Priorknowledge embedded in the physics computation itself rarely draws attention. We show that solvingthe Kohn-Sham equations when training neural networks for the exchange-correlation functionalprovides an implicit regularization that greatly improves generalization. Two separations suﬃce forlearning the entire one-dimensional H dissociation curve within chemical accuracy, including thestrongly correlated region. Our models also generalize to unseen types of molecules and overcomeself-interaction error. Diﬀerentiable programming [1] is a general paradigm ofdeep learning, where parameters in the computation ﬂoware trained by gradient-based optimization. Based onthe enormous development in automatic diﬀerentiationlibraries [2–5], hardware accelerators [6] and deep learn-ing [7], this emerging paradigm is relevant for scientiﬁccomputing. It supports extremely strong physics priorknowledge and well-established numerical methods [8]and parameterizes the approximation by a neural net-work, which can approximate any continuous function [9].Recent highlights include discretizing partial diﬀeren-tial equations [10], structural optimization [11], samplingequilibrium conﬁgurations [12], diﬀerentiable moleculardynamics [13], diﬀerentiable programming tensor net-works [14], optimizing basis sets in Hartree-Fock [15] andvariational quantum Monte Carlo [16–18].Density functional theory (DFT), an approach to elec-tronic structure problems, took an enormous step for-ward with the creation of the Kohn-Sham (KS) equa-tions [19], which greatly improves accuracy from the orig-inal DFT [20–22]. The results of solving the KS equa-tions are reported in tens of thousands of papers eachyear [23]. Given an approximation to the exchange-correlation (XC) energy, the KS equations are solved self-consistently. Results are limited by the quality of suchapproximations, and a standard problem of KS-DFT isto calculate accurate bond dissociation curves [24]. Thediﬃculties are an example of strong correlation physicsas electrons localize on separate nuclei [25].Naturally, there has been considerable interest in usingmachine learning (ML) methods to improve DFT approx-imations. Initial work [26, 27] focused on the KS kineticenergy, as a suﬃciently accurate approximation would al-low by-passing the solving of the KS equations [28, 29].For XC, recent works focus on learning the XC poten-tial (not functional) from inverse KS [30], and use it inthe KS-DFT scheme [31–34]. An important step forwardwas made last year, when it was shown that a neural net- work could ﬁnd functionals using only three molecules,by training on both energies and densities [35], obtainingaccuracy comparable to human-designed functionals, andgeneralizing to yield accurate atomization energies of 148small molecules [36]. But this pioneering work does notyield chemical accuracy, nor approximations that workin the dissociation limit. Moreover, it uses gradient-freeoptimization which usually suﬀers from poor convergencebehavior on the large number of parameters used in mod-ern neural networks [37–39]. E + E nn ( H a r t r ee ) (a) direct ML (b) KSR - LDA

LDA (c)

KSR - GGA (d)

KSR - global exact ∆ E R ( Bohr ) FIG. 1. One-dimensional H dissociation curves for sev-eral ML models trained from two molecules (red diamonds)with optimal models (highlighted in color) selected by the val-idation molecule at R = 3 (black triangles). The top panelshows energy (with E nn , the nucleus-nucleus repulsion en-ergy) with exact values shown by the black dashed line. Thebottom panel shows diﬀerence from the exact curves withchemical accuracy in grey shadow. (a) directly predicts E from geometries and clearly fails to capture the physics fromvery limited data. (b-d) shows our method (KSR) with dif-ferent inputs to the model to align with the ﬁrst two rungsof Jacob’s ladder [40] (LDA and GGA) and then global (afully non-local functional). Uniform gas LDA [41] is shownin brown. Grey lines denote 15 sampled functionals duringtraining, with darker lines denoting later samples. Atomicunits used throughout. a r X i v : . [ phy s i c s . c o m p - ph ] N ov Here, we show that all these limitations are over-come by incorporating the KS equations themselves intothe neural network training by backpropagating throughtheir iterations – a

KS regularizer (KSR) to the MLmodel. In a traditional KS calculation, the XC is given,the equations are cycled to self-consistency, and all pre-vious iterations are ignored in the ﬁnal answer. Inother ML work, functionals are trained on either ener-gies alone [42–45], or even densities [32, 33, 46], but onlyafter convergence. By incorporating the KS equationsinto the training, thereby learning the relation betweendensity and energy at every iteration, we ﬁnd accuratemodels with very little data and much greater generaliz-ability.Our results are illustrated in Figure 1, which is for aone-dimensional mimic of H designed for testing elec-tronic structure methods [41]. The distribution of curvesof the ML model directly predicting E from geometries(direct ML) in (a) clearly fails to capture the physics.Next we demonstrate KSR with neural XC functionalsfrom the ﬁrst two rungs of Jacob’s ladder [40], by con-straining the receptive ﬁeld of the convolutional neuralnetwork [47]. The local density approximation (LDA) hasa receptive ﬁeld of just the current point, while the gener-alized gradient approximation (GGA) includes the near-est neighbor points, the minimal information for com-puting the spatial gradient of the density. In (b-c), theeﬀect of the KSR yields reasonably accurate results in thevicinity of the data, but not beyond. The KSR-LDA be-haves similar to the uniform gas LDA [41]. When an XCfunctional with a global receptive ﬁeld is included in (d),chemical accuracy is achieved for all separations includ-ing the dissociation limit. Similar results can be achievedfor H , the one-electron self-interaction error can easilybe made to vanish, and the interaction of a pair of H molecules can be found without any training on this typeof molecule (discussed below).Modern DFT ﬁnds the ground-state electronic densityby solving the Kohn-Sham equations: (cid:26) − ∇ v S [ n ]( r ) (cid:27) φ i ( r ) = (cid:15) i φ i ( r ) . (1)The density is obtained from occupied orbitals n ( r ) = P i | φ i ( r ) | . Here v S [ n ]( r ) = v ( r ) + v H [ n ]( r ) + v XC [ n ]( r )is the KS potential consisting of the external one-bodypotential and the density-dependent Hartree (H) and XCpotentials. The XC potential v XC [ n ]( r ) = δE XC /δn ( r )is the functional derivative of the XC energy functional E XC [ n ] = R (cid:15) XC [ n ]( r ) n ( r ) d r , where (cid:15) XC [ n ]( r ) is the XCenergy per electron. The total electronic energy E is thengiven by the sum of the non-interacting kinetic energy T s [ n ], the external one-body potential energy V [ n ], theHartree energy U [ n ], and XC energy E XC [ n ].The KS equations are in principle exact given the exactXC functional [19, 48], which in practice is the only term approximated in DFT. From a computational perspec-tive, the eigenvalue problem of Eq. (1) is solved repeat-edly until the density converges to a ﬁxed point, startingfrom an initial guess. We use linear density mixing [49]to improve convergence, n (in) k +1 = n (in) k + α ( n (out) k − n (in) k ).Figure 2(a) shows the unrolled computation ﬂow. Weapproximate the XC energy per electron using a neu-ral network (cid:15) XC ,θ [ n ], where θ represents the trainable pa-rameters. Together with the self-consistent iterations inFigure 2(b), the combined computational graph resem-bles a recurrent neural network [50] or deep equilibriummodel [51] with additional ﬁxed computational compo-nents. Density mixing improves convergence of KS self-consistent calculations and parallels the now commonresidual connections in deep neural networks [52] for ef-ﬁcient backpropagation.If the neural XC functional were exact, KS self-consistent calculations would output the exact densityand the intermediate energies over iterations would con-verge to the exact energy. This intention can be trans-lated into a loss function and the neural XC functionalcan be updated end-to-end by backpropagating throughthe KS self-consistent calculations. Throughout, exper-iments are performed in one dimension where accuratequantum solutions could be relatively easily generatedvia density matrix renormalization group (DMRG) [53].The electron-electron repulsion is A exp( − κ | x − x | ), andattraction to a nucleus at x = 0 is − A exp( − κ | x | ) [47].We design the loss function as an expectation E over molecule create KSpotentialsolve KSequationsobtain newdensity K S s e l f - c o n s i s t e n t c a l c u l a t i o n KS iteration(a) (b) integral compute XC potential compute XC energy computeenergy g l o b a l c o n v × c o n v ( , ) × n e g a t i v e t r a n s f o r m SIG S i L U S i L U (c) c o n v ( , ) × XC energy c o n v ( , ) × iteration Kiteration 2iteration 1 FIG. 2. KS-DFT as a diﬀerentiable program. Black arrowsare the conventional computation ﬂow. The gradients ﬂowalong red dashed arrows to minimize the energy loss L E anddensity loss L n . (a) The high-level KS self-consistent calcu-lations with linear density mixing (purple diamonds). (b) Asingle KS iteration produces v XC ,θ [ n ] and E XC ,θ [ n ] by invok-ing the XC energy calculation twice, once directly and oncecalculating a derivative using automatic diﬀerentiation. (c)The XC energy calculation using the global XC functional. training molecules, L ( θ ) = E train (cid:20)Z dx ( n KS − n DMRG ) /N e (cid:21)| {z } density loss L n + E train " K X k =1 w k ( E k − E DMRG ) /N e , | {z } energy loss L E (2)where N e is the number of electrons and w k are non-negative weights. L n minimizes the diﬀerence betweenthe ﬁnal density with the exact density. The gradientfrom L n backpropagates through v XC ,θ [ n ] in all KS iter-ations. However, if L E only optimizes the ﬁnal energy,no gradient ﬂows through E XC ,θ [ n ] except for the ﬁnaliteration. To make backpropagation more eﬃcient for E XC ,θ [ n ], L E optimizes the trajectory of energies over alliterations, which directly ﬂows gradients to early itera-tions [54]. This makes the neural XC functional outputaccurate (cid:15) XC at each iteration, and also drives the it-erations to quickly converge to the exact energy. Theoptimal model is selected with minimal mean absoluteenergy per electron on the validation set.Hundreds of useful XC functional approximations havebeen proposed [55]. Researchers typically design thesymbolic form from physics intuition, with some (orno) ﬁtting parameters. Here we build a neural XCfunctional with several diﬀerentiable components withphysics intuition tailored for XC in Figure 2(c). A globalconvolution layer captures the long range interaction, G ( n ( x ) , ξ p ) = ξ p R dx n ( x ) exp( −| x − x | /ξ p ). Note twospecial cases retrieve known physics quantities, Hartreeenergy density G ( n ( x ) , κ − ) ∝ (cid:15) H and electronic density G ( n ( x ) ,

0) = n ( x ). Global convolution contains multiplechannels and ξ p of each channel is trainable to captureinteraction in diﬀerent scales. Although the rectiﬁed lin-ear unit [56] is popular, we use the sigmoid linear unit(SiLU) [57, 58] f ( x ) = x/ (1 + exp( − x )) because the inﬁ-nite diﬀerentiability of SiLU guarantees the smoothnessof v XC , the ﬁrst derivative, and the second and higherorder derivatives of the neural network used in the L-BFGS training [59]. We do not enforce a speciﬁc choiceof (cid:15) XC (sometimes called a gauge [60]), but we do enforcesome conditions, primarily to aid convergence of the al-gorithm. We require (cid:15) XC to vanish whenever the den-sity does, and that it be negative if at all possible. Weachieved the former using the linearity of SiLU near theorigin and turning oﬀ the bias terms in convolution lay-ers. We softly impose the latter by a negative transformlayer at the end, where a negative SiLU makes most out-put values negative. Finally, we design a self-interactiongate (SIG) that mixes in a portion of − (cid:15) H to cancel theself-interaction error, (cid:15) (out)XC = (cid:15) (in)XC (1 − β ) − (cid:15) H β . Theportion is a gate function β ( N e ) = exp( − ( N e − /σ ).When N e = 1, then (cid:15) (out)XC = − (cid:15) H . For more electrons, σ

20 15 10 5 0 5 10 15 20 t-SNE 1 t - S N E R=3.84(a) log ∆ n − . density trajectories t=0t=220t=560initialexact 6 4 2 0 2 4 6 (b) density exact k =0 k ∈ [1 , k =15 x (c) k ∈ [1 , k =15 x (d) k ∈ [1 , k =15 FIG. 3. (a) t-SNE visualization [61] of density trajectories(grey dots) sampled by KSR during training for R = 3 .

84 frominitial guess (cross) to exact density (red diamond). Darkertrajectories denote later optimization steps t . Note t-SNEprojection does not perfectly preserve the distance betweendensities. The light red ellipse illustrates the vicinity of theexact density within log ( R dx ( n KS − n DMRG ) /N e ) ≤ − . t = 0 untrained,(c) t = 220 optimal in Figure 1, and (d) t = 560 overﬁttingto training with bad generalization on validation. can be ﬁxed or adjusted by the training algorithm to de-cide the sensitivity to N e . For H as R → ∞ , (cid:15) XC tendsto a superposition of the negative of the Hartree energydensity at each nucleus and approaches half that for H +2 .Now we dive deeper into the outstanding generaliza-tion we observed in a simple but not easy task: predictingthe entire H dissociation curve, as shown in Figure 1. Itis not surprising that direct ML model completely fails.Neural networks are usually underdetermined systems asthere are more parameters than training examples. Reg-ularization is crucial to improve generalization [62, 63],especially when data is limited. Most existing works reg-ularize models with particular physics prior knowledge byimposing constraints via feature engineering and prepro-cessing [64, 65], architecture design [66–69] or physics-informed loss terms [70, 71]. Another strategy is to gen-erate extra data for training using prior knowledge: inimage classiﬁcation problems, data are augmented by op-erations like ﬂipping and cropping given the prior knowl-edge that labels are invariant to those operations [72].KSR provides a natural data augmentation because al-though the exact densities and energies of only two sepa-rations are given, KSR samples diﬀerent trajectories froman initial density to the exact density at each trainingstep. More importantly, KSR focuses on learning an XCfunctional that can lead the KS self-consistent calcula-tions to converge to the exact density from the initialdensity. Figure 3 visualizes the density trajectories sam-pled by KSR for one training separation R = 3 .

84. Thefunctional with untrained parameters ( t = 0) samplesdensities near the initial guess but soon learns to explorebroadly and ﬁnds the trajectories toward the vicinity ofthe exact density. (a) H and H fullKSRenergyonlyKSRdirectMLchemicalaccuracy t e s t | ∆ E | ( H a r t r ee ) (b) H +2 N train (c) H H FIG. 4. Test generalization of models as a function of thetotal number of training examples N train : full KSR (blue),energy only KSR (pink) and direct ML (orange) on (a) hold-out H and H , and unseen types of molecules (b) H +2 (c)H H . Black dashed lines show chemical accuracy. See thesupplemental material [47] for training details. In contrast, most existing ML functionals learn to pre-dict the output of a single iteration from the exact den-sity, which is a poor surrogate for the full self-consistentcalculations [73]. These standard ML models have twomajor shortcomings. First, the exact density is unknownfor new systems, so the model is not expected to be-have correctly on unseen initial densities for KS calcula-tions. Second, even if a model is trained on many densi-ties for single iteration prediction, it is not guaranteed toconverge the self-consistent calculations to a good solu-tion [74]. On the other hand, since KSR allows the modelaccess to all the KS iterations, it learns to optimize theentire self-consistent procedure to avoid the error accu-mulation from greedy optimization of single iterations.Further comparison for training neural XC functionalswithout or with “weaker” KSR is in the supplementalmaterial [47].Next we retrain our neural XC functional with KSR on N train / and H molecules. Figure 4shows the prediction accuracy of KSR with both energyand density loss (full KSR), in comparison to KSR withonly energy loss (energy only KSR) and direct ML model.We compute the energy mean absolute error on the hold-out sets of H ( R ∈ [0 . , ( R ∈ [1 . , and H with various N train is shown in Figure 4(a). Full KSR has the lowesterror at minimum N train = 4, reaching chemical accuracyat 6. As the size of the training set increases, energy onlyKSR reaches chemical accuracy at N train = 10, but directML model never does (even at 20). Then we test modelson unseen types of molecules. In Figure 4(b), both KSRmodels have perfect prediction on H +2 ( R ∈ [0 . , . and separatethem with R = 0 .

16 to 9 .

76 Bohr, denoted as H H . KSR n K S (a) full KSR (b) energy only KSR exact

10 0 10 x v S

10 0 10 x FIG. 5. Density and KS potential of H with R = 4 .

32 fromneural XC functionals trained with (a) full KSR (blue) and (b)energy only KSR (pink) on training set of size N train = 20.Exact curves are in grey. v S are shifted by a constant forbetter comparison. models generalize much better than ML for “zero-shot”prediction [75], where H H has never been exposed tothe model during training.Why is the density important in training, and whatuse are the non-converged iterations? The density is thefunctional derivative of the energy with respect to thepotential, so it gives the exact slope of the energy withrespect to any change in the potential, including stretch-ing (or compressing) the bond. Thus the density implic-itly contains energetic information including the correctderivative at that point in the binding curve. KS iter-ations produce information about the functional in thevicinity of the minimum. During training, the networklearns to construct a functional with both the correctminimum and all correct derivatives at this minimum.In the paradigm of diﬀerentiable programming, densityis the hidden state carrying the information through therecurrent structure in Figure 2(a). Correct supervisionfrom L n greatly helps generalization from very limiteddata, see N train ≤ N train increases,both KSR with and without L n perform well in energyprediction. We show the solution of H with R = 4 .

32 inFigure 5. With L n , the density is clearly much accuratethan KSR without L n ( R ( n KS − n DMRG ) dx = 9 . × − versus 9 . × − ). Then we compute the correspondingexact v S using inverse KS method [30]. Both function-als do not reproduce the exact v S . However, functionaltrained with L n recovered most of the KS potential. Un-like previous works [32–34] that explicitly included theKS or XC potential into the loss function, our modelnever uses the exact KS potential. In our KSR setup,the model aims at predicting (cid:15) XC , from which the derived v S yields accurate density. Therefore, predicting v XC is aside product. We also address some concerns on trainingexplicitly with v XC . One artifact is that generating theexact v S requires an additional inverse calculation, whichis known to be numerically unstable [30]. Schmidt et al. [32] observe outliers while generating training v XC frominverse KS. While v XC is a fascinating and useful objectfor theoretical study, because its relation to the densityis extremely delicate, it is far more practical to simplyuse the density to train on [35].Diﬀerentiable programming blurs the boundary be-tween physics computation and ML. Our results for KS-DFT serve as proof of principle for rethinking computa-tional physics in this new paradigm. Although there isno explicit limitation of our algorithm to one dimension,we expect practical challenges with real molecules, whichwill require rewriting or extending a mature DFT code tosupport automatic diﬀerentiation. For example, our dif-ferentiable eigensolver for dense matrices [76] is not suit-able for large problems, and will need to be replaced withmethods for partial eigendecomposition of sparse matri-ces [77, 78]. Beyond density functionals, in principle allheuristics in DFT calculations, e.g., initial guess, den-sity update, preconditioning, basis sets, even the entireself-consistent calculations as a meta-optimization prob-lem [54], could be learned and optimized while maintain-ing rigorous physics – getting the best of both worlds.The authors thank Michael Brenner, Sam Schoen-holz, Lucas Wagner, and Hanjun Dai for help-ful discussion. K.B. supported by NSF CHE-1856165, R.P. by DOE DE-SC0008696. Code isavailable at https://github.com/google-research/google-research/tree/master/jax_dft ∗ email: [email protected][1] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M.Siskind, The Journal of Machine Learning Research ,5595 (2017).[2] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson,C. Leary, D. Maclaurin, and S. Wanderman-Milne,“JAX: composable transformations of Python+NumPyprograms,” http://github.com/google/jax (2018).[3] M. Innes, E. Saba, K. Fischer, D. Gandhi, M. C.Rudilosso, N. M. Joy, T. Karmali, A. Pal, and V. Shah,CoRR (2018), arXiv:1811.01457.[4] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. , in Advances in neural information processing sys-tems (2019) pp. 8026–8037.[5] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is-ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev-enberg, D. Man´e, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg,M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,”(2015), software available from tensorﬂow.org.[6] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden,A. Borchers, et al. , in

Proceedings of the 44th Annual In-ternational Symposium on Computer Architecture (2017)pp. 1–12.[7] Y. LeCun, Y. Bengio, and G. Hinton, nature , 436(2015).[8] M. Innes, A. Edelman, K. Fischer, C. Rackauckas,E. Saba, V. B. Shah, and W. Tebbutt, (2019),arXiv:1907.07587 [cs.PL].[9] K. Hornik, Neural networks , 251 (1991).[10] Y. Bar-Sinai, S. Hoyer, J. Hickey, and M. P. Bren-ner, Proceedings of the National Academy of Sciences, 201814058 (2019).[11] S. Hoyer, J. Sohl-Dickstein, and S. Greydanus, arXivpreprint arXiv:1909.04240 (2019).[12] F. No´e, S. Olsson, J. K¨ohler, and H. Wu, Science ,eaaw1147 (2019).[13] S. S. Schoenholz and E. D. Cubuk, “JAX M.D.: End-to-end diﬀerentiable, hardware accelerated, molecular dy-namics in pure python,” https://github.com/google/jax-md , https://arxiv.org/abs/1912.04232 (2019),arXiv:1912.04232 [stat.ML].[14] H.-J. Liao, J.-G. Liu, L. Wang, and T. Xiang, PhysicalReview X , 031041 (2019).[15] T. Tamayo-Mendoza, C. Kreisbeck, R. Lindh, andA. Aspuru-Guzik, ACS Cent Sci , 559 (2018).[16] J. Hermann, Z. Sch¨atzle, and F. No´e, Nature Chemistry, 1 (2020).[17] D. Pfau, J. S. Spencer, A. G. de G. Matthews,and W. M. C. Foulkes, (2019), arXiv:1909.02487[physics.chem-ph].[18] L. Yang, Z. Leng, G. Yu, A. Patel, W.-J. Hu, and H. Pu,Phys. Rev. Research , 012039 (2020).[19] W. Kohn and L. J. Sham, Physical review , A1133(1965).[20] P. Hohenberg and W. Kohn, Physical review , B864(1964).[21] L. H. Thomas, in Mathematical Proceedings of the Cam-bridge Philosophical Society , Vol. 23 (Cambridge Univer-sity Press, 1927) pp. 542–548.[22] E. Fermi, Rend. Accad. Naz. Lincei , 5 (1927).[23] R. O. Jones, Reviews of modern physics , 897 (2015).[24] E. M. Stoudenmire, L. O. Wagner, S. R. White, andK. Burke, Phys. Rev. Lett. , 056402 (2012).[25] A. J. Cohen, P. Mori-S´anchez, and W. Yang, Science , 792 (2008).[26] L. Li, J. C. Snyder, I. M. Pelaschier, J. Huang, U.-N. Niranjan, P. Duncan, M. Rupp, K.-R. M¨uller, andK. Burke, International Journal of Quantum Chemistry , 819 (2016).[27] J. C. Snyder, M. Rupp, K. Hansen, K.-R. M¨uller, andK. Burke, Physical review letters , 253002 (2012).[28] F. Brockherde, L. Vogt, L. Li, M. E. Tuckerman,K. Burke, and K.-R. M¨uller, Nature communications , 1 (2017).[29] L. Li, T. E. Baker, S. R. White, K. Burke, et al. , PhysicalReview B , 245129 (2016).[30] D. S. Jensen and A. Wasserman, International Journal ofQuantum Chemistry , e25425 (2018).[31] D. J. Tozer, V. E. Ingamells, and N. C. Handy, TheJournal of Chemical Physics , 9200 (1996).[32] J. Schmidt, C. L. Benavides-Riveros, and M. A. Mar-ques, The journal of physical chemistry letters , 6425(2019). [33] Y. Zhou, J. Wu, S. Chen, and G. Chen, The journal ofphysical chemistry letters , 7264 (2019).[34] R. Nagai, R. Akashi, S. Sasaki, and S. Tsuneyuki, TheJournal of chemical physics , 241737 (2018).[35] R. Nagai, R. Akashi, and O. Sugino, npj ComputationalMaterials , 1 (2020).[36] L. A. Curtiss, K. Raghavachari, G. W. Trucks, and J. A.Pople, The Journal of chemical physics , 7221 (1991).[37] J. C. Duchi, M. I. Jordan, M. J. Wainwright, andA. Wibisono, IEEE Transactions on Information Theory , 2788 (2015).[38] N. Maheswaranathan, L. Metz, G. Tucker, D. Choi, andJ. Sohl-Dickstein, in International Conference on Ma-chine Learning (PMLR, 2019) pp. 4264–4273.[39] L. M. Rios and N. V. Sahinidis, Journal of Global Opti-mization , 1247 (2013).[40] J. P. Perdew, A. Ruzsinszky, J. Tao, V. N. Staroverov,G. E. Scuseria, and G. I. Csonka, The Journal of chem-ical physics , 062201 (2005).[41] T. E. Baker, E. M. Stoudenmire, L. O. Wagner, K. Burke,and S. R. White, Physical Review B , 235141 (2015).[42] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, andG. E. Dahl, in Proceedings of the 34th International Con-ference on Machine Learning-Volume 70 (JMLR. org,2017) pp. 1263–1272.[43] K. Sch¨utt, P.-J. Kindermans, H. E. S. Felix, S. Chmiela,A. Tkatchenko, and K.-R. M¨uller, in

Advances in neuralinformation processing systems (2017) pp. 991–1001.[44] J. Behler and M. Parrinello, Physical review letters ,146401 (2007).[45] M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A. vonLilienfeld, Physical review letters , 058301 (2012).[46] J. R. Moreno, G. Carleo, and A. Georges, Physical Re-view Letters , 076402 (2020).[47] See Supplemental Material below for information about1d model systems; computational details of DMRG; KScalculations; training, validation and test; neural net-works; training without KSR and training with weakerKSR.[48] L. O. Wagner, E. M. Stoudenmire, K. Burke, and S. R.White, Physical review letters , 093003 (2013).[49] G. Kresse and J. Furthm¨uller, Physical review B ,11169 (1996).[50] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by error propagation ,Tech. Rep. (California Univ San Diego La Jolla Inst forCognitive Science, 1985).[51] S. Bai, J. Z. Kolter, and V. Koltun, in

Advances in Neu-ral Information Processing Systems (NeurIPS) (2019).[52] K. He, X. Zhang, S. Ren, and J. Sun, in

Proceedingsof the IEEE conference on computer vision and patternrecognition (2016) pp. 770–778.[53] S. R. White, Physical review letters , 2863 (1992).[54] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoﬀman, D. Pfau, T. Schaul, B. Shillingford, and N. De Fre-itas, in Advances in neural information processing sys-tems (2016) pp. 3981–3989.[55] N. Mardirossian and M. Head-Gordon, Molecular Physics , 2315 (2017).[56] V. Nair and G. E. Hinton, in

ICML (2010).[57] S. Elfwing, E. Uchibe, and K. Doya, Neural Networks , 3 (2018).[58] P. Ramachandran, B. Zoph, and Q. V. Le, arXiv preprintarXiv:1710.05941 (2017).[59] D. C. Liu and J. Nocedal, Mathematical programming , 503 (1989).[60] J. P. Perdew, A. Ruzsinszky, J. Sun, and K. Burke, TheJournal of chemical physics , 18A533 (2014).[61] L. v. d. Maaten and G. Hinton, Journal of machine learn-ing research , 2579 (2008).[62] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learn-ing (MIT Press, 2016) .[63] J. Kukaˇcka, V. Golkov, and D. Cremers, arXiv preprintarXiv:1710.10686 (2017).[64] E. D. Cubuk, A. D. Sendek, and E. J. Reed, The Journalof chemical physics , 214701 (2019).[65] J. Hollingsworth, L. Li, T. E. Baker, and K. Burke, TheJournal of Chemical Physics , 241743 (2018).[66] N. Thomas, T. Smidt, S. Kearnes, L. Yang,L. Li, K. Kohlhoﬀ, and P. Riley, arXiv preprintarXiv:1802.08219 (2018).[67] K. Sch¨utt, M. Gastegger, A. Tkatchenko, K.-R. M¨uller,and R. J. Maurer, Nature communications , 1 (2019).[68] R. Kondor, H. T. Son, H. Pan, B. Anderson, andS. Trivedi, arXiv preprint arXiv:1801.02144 (2018).[69] S. Seo and Y. Liu, arXiv preprint arXiv:1902.02950(2019).[70] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Journalof Computational Physics , 686 (2019).[71] R. Sharma, A. B. Farimani, J. Gomes, P. Eastman, andV. Pande, arXiv preprint arXiv:1807.11374 (2018).[72] A. Krizhevsky, I. Sutskever, and G. E. Hinton, in Ad-vances in neural information processing systems (2012)pp. 1097–1105.[73] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, andJ. Sohl-Dickstein, in

Advances in Neural InformationProcessing Systems (2017) pp. 2627–2636.[74] S. Ross and D. Bagnell, in

Proceedings of the thir-teenth international conference on artiﬁcial intelligenceand statistics (2010) pp. 661–668.[75] A. Mirhoseini, A. Goldie, M. Yazgan, J. Jiang,E. Songhori, S. Wang, Y.-J. Lee, E. Johnson, O. Pathak,S. Bae, et al. , arXiv preprint arXiv:2004.10746 (2020).[76] K. B. Petersen and M. S. Pedersen, “The matrix cook-book,” (2012), version 20121115.[77] T. H. Lee, AIAA Journal , 1998 (2007).[78] H. Xie, J.-G. Liu, and L. Wang, Physical Review B ,245139 (2020). upplemental MaterialBuilding prior knowledge into machine-learned physics:Kohn-Sham equations as a regularizer Li Li ( 李力 ) , ∗ Stephan Hoyer , Ryan Pederson , Ruoxi Sun ( 孙若溪 ) , Ekin D. Cubuk , Patrick Riley , and Kieron Burke , Google Research, Mountain View, CA 94043, USA Department of Physics and Astronomy, University of California, Irvine, CA 92697, USA Department of Chemistry, University of California, Irvine, CA 92697, USA

CONTENTS

I. 1D Model systems 2II. DMRG calculation details 2III. KS calculation details 2A. Local Density Approximation 2B. Initial density 2C. Linear density mixing 3D. XC potential from automatic diﬀerentiation 3E. Symmetry 3IV. Training, validation and test 4A. Weights in trajectory loss 4B. Number of KS iterations 4C. Dataset for learning H dissociation from two molecules 4D. Dataset for learning and predicting several types of molecules 41. Training molecules 42. Validation molecules 43. Test molecules 54. Test errors 55. Dissociation curve of H , H , H +2 and H H ∗ email: [email protected] a r X i v : . [ phy s i c s . c o m p - ph ] N ov I. 1D MODEL SYSTEMS

In 1D, we utilize exponential Coulomb interactions to mimic the standard 3D Coulomb potential, v exp ( x ) = A exp( − κ | x | ) , (S1)where A = 1 . κ − = 2 . Z and position x is represented by − Z v exp ( x − x ). The external potential for arbitrary molecularsystems and geometries is modeled as v ( x ) = − X j Z j v exp ( x − x j ) . (S2)For example, a 1D H molecule at separation R = 4 can be represented by v ( x ) = − v exp ( x − − v exp ( x + 2). Therepulsion between electrons at positions x and x is given by the two-body potential v ee ( x − x ) = v exp ( x − x ). Werepresent all systems on a 1D grid of m = 513 points each separated by a distance h = 0 .

08. The center grid pointis at the origin, x = 0, and the range of grid points is x ∈ {− . , . . . , . } . For consistency, all nuclei positionsreside on grid points in calculations. In this convention, all molecules in this work are either symmetric about theorigin x = 0 or x = 0 .

04, depending on the separation between nuclei.

II. DMRG CALCULATION DETAILS

The real-space interacting Hamiltonian for a 1D system of lattice spacing h becomes in second quantized notation, H = 54 h X j,σ n jσ − h X h i,j i ,σ c † iσ c jσ + 124 h X hh i,j ii ,σ c † iσ c jσ + X j v ( x j ) n j + X ij v ee ( x i − x j ) n i n j , (S3)where the operator c † jσ creates (and c jσ annihilates) an electron of spin σ on site j , n jσ = c † jσ c jσ , and n j = n j ↑ + n j ↓ .The single and double brackets below the sums indicate sums over nearest and next nearest neighbors, respectively.The hopping term coeﬃcients are determined by the 4-th order central ﬁnite diﬀerence approximation to the secondderivative. The Hamiltonian is solved using DMRG to obtain highly accurate ground-state energies and densities.Calculations are performed using the ITensor library [2] with an energy convergence threshold of 10 − Ha.

III. KS CALCULATION DETAILSA. Local Density Approximation

In our 1D model the electron repulsion is an exponential interaction. To implement a local density approximation(LDA) for this interaction we use Ref. [1] which provides the exponentially repelling uniform gas exchange energyanalytically and an accurate parameterized model for the correlation energy. We use this speciﬁc implementation forall LDA calculations.

B. Initial density

We solve the Schr¨odinger equation of the non-interacting system with the external potential v ( x ) deﬁned in Eq. S2, (cid:26) − ∇ v ( x ) (cid:27) φ i ( x ) = (cid:15) i φ i ( x ) . (S4)The density is the square sum of all the occupied orbitals n ( x ) = P i | φ i ( x ) | . In all the KS self-consistent calculationspresented in this work, we use the density of the non-interacting system with external potential v ( x ) as the initialdensity. C. Linear density mixing

Linear density mixing is a well-known strategy to improve the convergence of the KS self-consistent calculation, n (in) k +1 = n (in) k + α ( n (out) k − n (in) k ) . (S5)In this work, we apply an exponential decay on the mixing factor α = 0 . × . k − . D. XC potential from automatic diﬀerentiation

The XC functional (cid:15) XC [ n ] : R m → R m is a mapping from density n ∈ R m to XC energy density (cid:15) XC ∈ R m , where m is the number of grids. Instead of hand-deriving the functional derivative, we use automatic diﬀerentiation in JAX [3]to compute v XC = δE XC δn = δ R n (cid:15) XC [ n ] dxδn . (S6)The XC energy density (cid:15) XC [ n ] is denoted as xc_energy_density_fn . It takes the density , a ﬂoat array with size m as input argument and returns a ﬂoat array with size m . This function can be a conventional physics XC functional(e.g., LDA) or a neural XC functional.The XC energy E XC can be computed by the following function, def get_xc_energy ( density , xc_energy_density_fn , grids ): """ Gets the xc energy by discretizing the following integral . E_xc = \ int density * xc_energy_density_fn ( density ) dx. Args : density : Float numpy array with shape ( num_grids ,). xc_energy_density_fn : function takes density and returns float numpy array with shape ( num_grids ,). grids : Float numpy array with shape ( num_grids ,). Returns : Float . """ return jnp . dot ( xc_energy_density_fn ( density ), density ) * utils . get_dx ( grids ) Then the functional derivative v XC = δE XC /δn can be computed via jax.grad function, def get_xc_potential ( density , xc_energy_density_fn , grids ): """ Gets xc potential . The xc potential is derived from xc_energy_density through automatic differentiation . Args : density : Float numpy array with shape ( num_grids ,). xc_energy_density_fn : function takes density and returns float numpy array with shape ( num_grids ,). grids : Float numpy array with shape ( num_grids ,). Returns : Float numpy array with shape ( num_grids ,). """ return jax . grad ( get_xc_energy )( density , xc_energy_density_fn , grids ) / utils . get_dx ( grids ) The jax.grad function computes the gradient of get_xc_energy function with respect to the ﬁrst argument density using automatic diﬀerentiation. Both functions can be found in the scf module in the JAX-DFT library.

E. Symmetry

The training molecules used in this paper are symmetric to their centers. We deﬁne the symmetry operation onfunctions on the grids S : R m → R m . It ﬂips a function at the center and averages with itself. In each KS iteration,we enforce the symmetry on the XC functional, (cid:15) XC [ n ] → S (cid:0) (cid:15) XC [ S ( n )] (cid:1) . So E XC and v XC are transformed as E XC = Z n (cid:15) XC [ n ] dx → Z n S (cid:0) (cid:15) XC [ S ( n )] (cid:1) dx (S7) v XC = δ R n (cid:15) XC [ n ] dxδn → δ R n S (cid:0) (cid:15) XC [ S ( n )] (cid:1) dxδn . (S8)Before the output of each KS iteration, n (out) k → S ( n (out) k ).We note that applying the symmetry restriction does not change the model performance of the molecules aroundequilibrium. However, because both stretched H and H have vanishing KS gaps, the KS self-consistent calculationis diﬃcult to converge. The gradient information for the KSR relies on the stable output of the KS self-consistentcalculation. In realistic 3D DFT codes, enforcing symmetry is one common strategy to improve the convergence ofKS self-consistent calculations. Therefore, we applied symmetry restriction so that KSR can obtain stable gradientinformation in the stretched limit cases. IV. TRAINING, VALIDATION AND TESTA. Weights in trajectory loss

We use w k = 0 . K − k H ( k − H is the Heaviside function and K is the total number of iterations. B. Number of KS iterations

We ﬁrst run KS calculations with the standard uniform gas LDA functional [1]. The number of iterations to convergethe largest separation for diﬀerent molecules are around 8, 25, 5, 6 for H , H , H +2 , H H respectively. During thetraining of neural XC functionals with KSR, we use a ﬁxed number of iterations that is greater than or equal tothe estimation from LDA for each type of molecules so it is suﬃcient for convergence. The number of iterations fordiﬀerent molecules are listed in Table I. TABLE I. Number of iterations for diﬀerent molecules.H H H +2 H H train 15 40 – –validation 15 40 – –test 15 40 5 10 C. Dataset for learning H dissociation from two molecules H dissociation curves in Figure 1 are trained from exact densities and energies of two molecules. One is a compressedH ( R = 1 .

28) and the other is a stretched H ( R = 3 .

84) molecule. The optimal checkpoint is selected by a validationmolecule with R = 2 . D. Dataset for learning and predicting several types of molecules

1. Training molecules

The distances between nearby atoms for training molecules used in Figure 4 are listed in Table II.

2. Validation molecules

The distances between nearby atoms for validation molecules used in Figure 4 are listed in Table III.

TABLE II. Distances between nearby atoms for training molecules used in Figure 4. N train H H .

28 3 .

84 2 .

08 3 .

366 0 .

48 1 .

28 3 .

84 1 .

28 2 .

08 3 .

368 0 .

48 1 .

28 3 .

04 3 .

84 1 .

28 2 .

08 3 .

36 4 . .

48 1 .

28 3 .

04 3 .

84 4 .

64 1 .

28 2 .

08 3 .

36 4 .

00 4 . .

48 1 .

28 2 .

40 3 .

04 3 .

84 4 .

64 1 .

28 2 .

08 3 .

04 3 .

36 4 .

00 4 . .

48 1 .

28 2 .

40 3 .

04 3 .

52 3 .

84 4 .

64 1 .

28 2 .

08 3 .

04 3 .

36 3 .

68 4 .

00 4 . .

48 1 .

28 1 .

76 2 .

40 3 .

04 3 .

52 3 .

84 4 .

64 1 .

28 2 .

08 2 .

56 3 .

04 3 .

36 3 .

68 4 .

00 4 . .

48 1 .

28 1 .

76 2 .

40 3 .

04 3 .

52 3 .

84 4 .

16 4 .

64 1 .

28 2 .

08 2 .

56 3 .

04 3 .

36 3 .

68 4 .

00 4 .

48 4 . .

48 0 .

80 1 .

28 1 .

76 2 .

40 3 .

04 3 .

52 3 .

84 4 .

16 4 .

64 1 .

28 1 .

76 2 .

08 2 .

56 3 .

04 3 .

36 3 .

68 4 .

00 4 .

48 4 . ≤ N train ≤

20. Molecule R H . , . , . , . . , . , . , .

3. Test molecules

The distances between nearby atoms for test molecules used in Figure 4 are listed in Table IV.

TABLE IV. The distances between nearby atoms for test molecules used in Figure 4. The test set is ﬁxed for calculations with4 ≤ N train ≤ R H . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . . , . , . , . , . , . , . , . +2 . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . H . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , .

4. Test errors

We extend the plot of test errors in Figure 4 to N train = 20 in Figure S1 and list all the numerical values in Table V.

5. Dissociation curve of H , H , H +2 and H H Figure S2 shows the dissociation curve of H , H , H +2 and H H . Curves of KSR-global (blue) are computed fromthe neural XC functional trained from the full KSR with N train = 8 (four H molecules and four H molecules) inFigure S1. KSR ﬁts H and H well even in the stretched limit. KSR perfectly predicts H +2 because of self-interactiongate in the neural XC functional. Although H H has never been exposed to the model during training, KSR performswell at small distances ( R <

3) and at large distances (

R >

8) but slightly overbinds around R = 5. (a) H and H fullKSR energyonlyKSR directML t e s t | ∆ E | ( H a r t r ee ) (b) H +2 N train (c) H H FIG. S1. Test generalization of ML models as a function of the total number of training examples N train : full KSR (blue),energy only KSR (pink) and direct ML (orange) on (a) holdout H and H , and unseen types of molecules (b) H +2 (c) H H .Black dashed lines show chemical accuracy. All the numerical values are listed in Table V.TABLE V. Numerical values of test errors plotted in Figure S1. Model Molecule N train L n + L E H and H . × − . × − . × − . × − . × − . × − . × − . × − . × − H +2 . × − . × − . × − . × − . × − . × − . × − . × − . × − H H . × − . × − . × − . × − . × − . × − . × − . × − . × − KSR L E H and H . × − . × − . × − . × − . × − . × − . × − . × − . × − H +2 . × − . × − . × − . × − . × − . × − . × − . × − . × − H H . × − . × − . × − . × − . × − . × − . × − . × − . × − ML H and H . × − . × − . × − . × − . × − . × − . × − . × − . × − H +2 . × − . × − . × − . × − . × − . × − . × − . × − . × − H H .

36 1 .

11 1 .

27 1 .

14 1 .

15 1 .

14 1 .

15 1 .

16 1 . V. NEURAL NETWORKSA. Architecture

Figure S3 illustrates the model architectures used in Figure 1. In the ML model that directly predicts energyfrom geometry (Figure S3(a)), we ﬁrst solve Eq. S4 to obtain a density for a particular molecular geometry. We usethis density as a smooth representation of the geometry. The ﬁrst few layers (global conv-conv-SiLU-conv-SiLU) areidentical to the KSR-global in Figure S3(d). Next, we use convolution layers with 128 channels and dense layers toincrease the capacity of the model. Finally, a dense layer with a single unit outputs the scalar E .The KSR-LDA and KSR-GGA approaches do not have the global convolution layer because the use of globalinformation violates the local and semi-local approximation. The ﬁrst layer of KSR-LDA is a convolution with ﬁltersize 1. It mimics the physics of the standard LDA approach by mapping the density value to the XC energy densityat the same point, (cid:15) LDAXC ,θ : R → R . KSR-GGA uses a convolution layer with ﬁlter size 3 to map the density valuesof three nearby points to the XC energy density at the center point, (cid:15) GGAXC ,θ : R → R . The XC energy density inthe entire space is also computed pointwise, (cid:15) XC = (cid:8) (cid:15) GGAXC ,θ [ n ( x − , x , x )] , . . . , (cid:15) GGAXC ,θ [ n ( x m − , x m − , x m )] (cid:9) ∈ R m . Theremaining network structure of KSR-LDA and KSR-GGA is identical to KSR-global, except for self-interaction gate(SIG). (a) H (b) H (c) H +2 R ( Bohr ) (d) H H LDA

KSR - global exact E + E nn ( H a r t r ee ) FIG. S2. Dissociation curves of (a) H , (b) H , (c) H +2 , and (d) H H . Dashed black lines are the exact curves. Dashed brownlines denote the results computed from uniform gas LDA and blue solid lines denote the results computed from KSR-global. B. Layers

1. Convolution

Filter weights in 1D convolution are initialized by He normal initialization [4]. The stride is one. The edges arepadded with zero to ensure that the size of the output spatial dimension is the same as the size of the unpadded inputspatial dimension. There is no bias term.

2. Global convolution

Global convolution contains multiple channels to capture the interaction in diﬀerent scales. The operation in eachchannel is G ( n ( x ) , ξ p ) = 12 ξ p Z dx n ( x ) exp( −| x − x | /ξ p ) . (S9)where ξ p is trainable and controls the scale of the interaction. We parameterize ξ p = a +( b − a ) · σ ( η p ) using the sigmoidfunction σ ( x ) = 1 / (1 + exp( − x )) to bound ξ p ∈ ( a, b ). η p is initialized using the normal distribution N (0 , − ). Forthe 16-channel global convolution layer used in this work, we have η ≡ η p ∈ (0 . , κ − ) are trainable.

3. Dense

The dense layer is only used in the ML model (Figure S3(a)). The weights are initialized using Glorot normalinitialization [5] and bias terms are initialized using the normal distribution N (0 , − ). (a) ML global conv × 16 conv (1,) × 1negative transform SIG conv (3,) × 16

SiLU conv (3,) × 16

SiLU global conv × 16conv (3,) × 16

SiLU conv (3,) × 16

SiLU conv (3,) × 128

SiLU max pool (2,) + aendense 128

SiLU dense 1 (b) KSR-LDA (d) KSR-global conv (1,) × 1negative transform conv (1,) × 16

SiLU conv (1,) × 16

SiLU (c) KSR-GGA conv (1,) × 1negative transform conv (3,) × 16

SiLU conv (1,) × 16

SiLU molecule

FIG. S3. Model architectures of (a) the ML model that directly predicts energy E from geometry, (b) the neural LDA withKSR, (c) the neural GGA with KSR, and (d) the neural global functional. C. Checkpoint selection

Each calculation is repeated with 25 random seeds. The model is trained by L-BFGS [6] implemented in SciPy [7] scipy.optimize.fmin_l_bfgs_b with factr=1,m=20,pgtol=1e-14 . Parameter checkpoints are saved every 10 stepsuntil L-BFGS stops. The optimal checkpoint is the checkpoint with the lowest average energy error per electron onvalidation sets, E {S val } E M∈S val | ( E − E DMRG ) | /N e , where E is the ﬁnal energy from KS calculations, E DMRG is the exactenergy, and N e is the number of electrons. The validation sets {S val } and the molecules M in each sets are listed inTable III. VI. TRAINING A NEURAL XC FUNCTIONAL WITHOUT KS REGULARIZATION

Schmidt et al. [8] proposed a neural XC functional that can be used in a KS self-consistent calculation in the inferencestage. Unlike our work, which trains the network through KS calculations, they train the network in a single-step. Thetraining set contains 12800 molecules and the validation set contains 6400 molecules. Here, the molecules are exactsolutions of one-dimensional two-electron problems in the external potential of up to three random nuclei (Equation 4in Schmidt et al. [8]). The exact v XC for each molecule is computed by an inverse KS method. They input exact groundstate density and train the network to predict XC energy per length e XC that minimizes the loss function: a weightedcombination of the mean square errors (MSE) of the XC energy, XC potential, its numerical spatial derivative, andthe diﬀerence between the XC energy and the integral over the potential (Equation 5 in Schmidt et al. [8]), L ( θ ) = α MSE( E XC ) + β MSE( v XC ) + γ MSE (cid:16) dv XC ( x ) dx (cid:17) + δ MSE (cid:16) E XC − Z dx v XC ( x ) n ( x ) (cid:17) , (S10)where α = 1 . β = 100 . γ = 10 .

0, and δ = 1 . et al. [8].It is natural to pose the question: does the generalization from the two H training molecules in Figure 1 result fromusing KS self-consistent calculations in the inference stage rather than the training stage? This is a reasonable concernbecause the XC energy is usually a small portion of the total energy. To justify this concern, we ﬁrst use inverse KSto get the exact v XC on the two H molecules used in the H experiment. Then, we take the model architecture [9] E + E nn ( H a r t r ee ) training stage: single step (a) R (Bohr) Kohn-Sham regularization (b) ∆ E FIG. S4. Training the neural XC functional using (a) single-step and (b) Kohn-Sham regularization. Both functionals useKohn-Sham self-consistent calculations in the inference stage. and loss function in Schmidt et al. [8] and attempt to learn the entire dissociation curve of H from two molecules.Figure S4 compares the results from KS self-consistent calculations using functionals trained on (a) single-step and (b)Kohn-Sham regularization. It is not surprising that even though both approaches use KS self-consistent calculationsin the inference stage, the model trained on a single-step fails to generalize in the small training set limit (1 / training – Kohn-Shamregularizer – is crucial to the generalization. A single-step model could work well as reported in Schmidt et al. [8]with a larger training set and exact v XC . VII. TRAINING A NEURAL XC FUNCTIONAL WITH “WEAKER” KOHN-SHAMREGULARIZATION

Unlike other methods that build physics prior knowledge to the model through constraint, KSR “augments” densitiesfor the model during training. Thus, there is no single coeﬃcient to explicitly control the strength of the regularization.A straightforward idea to control the KSR strength is to change the total number of iterations K in the KS self-consistent calculations. However, a small K may not be suﬃcient to converge KS calculations. Thus it is ambiguousto understand whether the worse performance is from weaker regularization or unconverged KS calculations. Here wedesign an approach to control the strength of KSR by stopping the gradient ﬂow in the backpropagation and keeping K ﬁxed. Stop gradient is a common operation in diﬀerentiable programming. It acts as an identity in the forward pass so itdoes not aﬀect the KS calculations. In the backpropagation, it sets the gradient passing through it to zero. As shownin Figure S5, we stop gradient before a certain KS iteration k = k ∗ so all the previous iterations k < k ∗ have noaccess to the gradient information. Since the gradient may still ﬂow into the iteration k from L E through its energyoutput E k , we also stop the gradient on E k for k < k ∗ . To simplify the graph, we remove the arrows between E k to L E for k < k ∗ . In Figure S5(a), the neural XC functional is updated only from the gradient information ﬂowingthrough the ﬁnal iteration. (b) is similar to (a) but has access to the gradient ﬂowing through the last two iterations.No stop gradient is applied to (c) and it is identical to the computational graph we used in the main text. Werepeat the same experiment in Figure 1. Figure S6 shows the H dissociation curves trained with three stop gradientsetting in Figure S5. In Figure S6(a), L-BFGS converges quickly as there is no suﬃcient gradient information fortraining. By including the gradient information in the K − (a) stop gradient before iteration K molecule K - K - Kohn-Shamiterations K molecule K - K - Kohn-Shamiterations K STOP molecule K - STOP K - Kohn-Shamiterations K (b) stop gradient before iteration K-1(c) no stop gradient weakstrong FIG. S5. Computational graph with diﬀerent KSR strength. (a) stop gradient before iteration K . (b) stop gradient beforeiteration K −

1. (c) no stop gradient. This is the same computation graph used in the main text. E + E nn ( H a r t r ee ) stop gradient: before K (a) R (Bohr) before K − (b) no (c) ∆ E FIG. S6. H dissociation curves trained from two molecules (red diamonds) with a corresponding computational graph shownin Figure S5. the true dissociation curve is captured. [1] T. E. Baker, E. M. Stoudenmire, L. O. Wagner, K. Burke, and S. R. White, Physical Review B , 235141 (2015).[2] M. Fishman, S. R. White, and E. M. Stoudenmire, “The ITensor software library for tensor network calculations,” (2020),arXiv:2007.14822.[3] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. Wanderman-Milne, “JAX: composabletransformations of Python+NumPy programs,” http://github.com/google/jax (2018).[4] K. He, X. Zhang, S. Ren, and J. Sun, in Proceedings of the IEEE international conference on computer vision (2015) pp.1026–1034. [5] X. Glorot and Y. Bengio, in Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics (2010) pp. 249–256.[6] D. C. Liu and J. Nocedal, Mathematical programming , 503 (1989).[7] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser,J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern,E. Larson, C. Carey, ˙I. Polat, Y. Feng, E. W. Moore, J. Vand erPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen,E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and S. . . Contributors,Nature Methods , 261 (2020).[8] J. Schmidt, C. L. Benavides-Riveros, and M. A. Marques, The journal of physical chemistry letters , 6425 (2019).[9] The only diﬀerence is that this model predicts (cid:15) XC [ n ]( x ) rather than e XC [ n ]( x ) in Schmidt et al. [8]. The relation betweenthem is e XC [ n ]( x ) = (cid:15) XC [ n ]( xx