LLearning DFT
Peter Schmitteckert
HQS Quantum Simulations GmbHHaid-und-Neu-Straße 776131 KarlsruheGermany (Dated: August 19, 2020)We present an extension of reverse engineered Kohn-Sham potentials from a density matrix renor-malization group calculation towards the construction of a density functional theory functional viadeep learning. Instead of applying machine learning to the energy functional itself, we apply thesetechniques to the Kohn-Sham potentials. To this end we develop a scheme to train a neural networkto represent the mapping from local densities to Kohn-Sham potentials. Finally, we use the neuralnetwork to up-scale the simulation to larger system sizes. a r X i v : . [ c ond - m a t . d i s - nn ] A ug I. INTRODUCTION
In the quest of solving strongly interacting quantum systems the density matrix renormalization group technique(DMRG) [1–5] has turned out to be a powerful tool. Representing a many-particle wave function based approachit is perfectly suited to attack strongly interacting many particle problems. However, its downside are the highcomputational costs, rendering the DMRG beeing too expensive for large systems. In contrast, the density functionaltheory (DFT) has proven to be successful in the prediction of structure of molecules, solids and surfaces,[6–8] althoughit is based on a single particle description only. The DFT is based on the Hohenberg-Kohn theorem [6] which statesthat the ground state properties of a many-particle system is determined by the local density, specifically, for eachobservable there exists a functional of the density, which provides the ground state expectation value of the observable,when evaluated with the ground state density. In result the ground state energy functional is minimized by the groundstate energy. Within the Kohn-Sham construction [7] one describes an interacting system by a non-interacting system,where the local Kohn-Sham potential v KS x replaces the local potential to mimic the effect of the interaction. Providedthe interaction is kept fix there is a one-to-one correspondence between the local potentials and the local densities.Despite this solid foundation of the DFT and decades of research, DFT is unreliable for strongly correlated electronsystems, as the nature of the associated functional is unknown. Gunnarson and Sch¨onhammer [9] extended the DFTto a homogeneous lattice system by explicitly reverse engineering the Kohn-Sham potentials of the solution providedby exact diagonalization . This approach was the extended in [10] to inhomogeneous systems, see also [11]. In thiswork we study the extension of the reverse engineered Kohn-Sham potentials for specific systems to the constructionof DFT functionals.To achieve such a construction we make use of deep learning and neuronal networks (NN), where we refer to theexcellent introduction by Nielson[12] and Appendix A. NN have actually a long history, starting with the pioneeringwork of McCulloch, Warren, and Pitts [13]. The major steps in the development of today’s success in patternrecognition consists in the development of the back propagation algorithm [14] and the invention of convolutionalnetworks [15], combined with the computational power provided by graphic cards. In addition, the availability of freesoftware packages[16–18] simplifies the application of neural network enormously.Machine learning has already a broad application in physical simulations, for a review see[19]. In the context ofsimulating electronic systems machine learning has been applied to bypass the Schr¨odinger equation[20, 21]. Thatis, training the neural nework with DFT and post-DFT results in order to predict properties directly. Anotherapproach consists in improving existing DFT functionals for molecules [22–24]. In addition, DFT functionals havebeen constructed by learning the energy functional.[25–27] Here we are following a different route. Instead of applyingmachine learning to the energy functional, we apply it to the learning of the Kohn-Sham potentials. The idea of thisapproach is that it should simplify a divide and conquer approach to larger, even inhomogeneous systems. In addition,there is no problem associated with the functional derivatives of the energy functional, as we are already learning thederivatives.In this work we look at a one dimensional interacting Fermi system with disorder Eq. (1). The model has beenstudied for a long time and caught attraction by the work of Giamarchi and Schultz [28], who predicted a phasetransition from the Anderson insulator to a metallic Luttinger liquid for attractive interaction, U = − t , and weakdisorder. For repulsive interaction the interaction enhances the localization induced by the disorder[28, 29] whilefor strong disorder and interaction the interplay of disorder and charge density wave ordering renders the systemcomplicated[30, 31].In Section II we introduce the model under investigation and discuss the application of a NN to the system inSection III and the up-scaling is presented in Section IV. In Appendix A we provide a detailed introduction intofitting functions with neural networks and finally sketch an extension using convolutional networks in Appendix B. II. THE MODEL
The model under investigation is chosen to be formally simple, but beyond the reach of standard DFT functionals.To this end we look at spinless fermions in one dimension, with a nearest neighbor hopping t = 1, periodic boundaryconditions (PBC), nearest neighbor interaction U and a strong onsite disorder v x , H = − t (cid:88) x ˆ c † x − ˆ c x + ˆ c † x ˆ c x − (cid:124) (cid:123)(cid:122) (cid:125) kinetic part + U (cid:88) x ˆ n † x − ˆ n x (cid:124) (cid:123)(cid:122) (cid:125) interaction + (cid:88) x v x ˆ n x (cid:124) (cid:123)(cid:122) (cid:125) disorder . (1)Here, ˆ c † x (ˆ c x ) denote the fermionic creation (annihilation) operator at site x , ˆ n x = ˆ c † x ˆ c x the local density, and M = 98the number of lattice sites. The interaction is chosen to be U = 1 .
0, and the disorder potentials v x is choosen from auniform distribution between ± W/
2, and smoothened with a cosine filter with a width of 3 three sites. Specifically,we choose (cid:15) x ∈ [ − W/ , W/
2] uniformly distributed and obtain the smoothened potentials from v x = s = d (cid:88) s = − d cos (cid:18) s π d + 2 (cid:19) (cid:15) x + s √ d + 1 (2)with d = 3 and W = 2. The reason for the smoothening is that disorder typically stems from scatterer in the substrate.So each scatterer should also effect neighboring sites. From test runs with fewer samples than provided below, butwithout the smoothing, we conclude that it doesn’t alter the findings of this work.In order to obtain the reference data for training the network we perform a sample statistics using the densitymatrix renormalization group technique where we used the “safety belt” approach as explained in [32, 33] to avoidgetting trapped in an excited state during the DMRG initialization. III. NEURAL NETWORK AS A GENERATOR OF DFT POTENTIALS
From each DMRG run we obtain the local density n x . We then perform an inverse DFT [9–11, 34, 35], where wesearch for a non interacting Hamiltonian, H = − t (cid:88) x ˆ c † x − ˆ c x + ˆ c † x ˆ c x − (cid:124) (cid:123)(cid:122) (cid:125) kinetic part + (cid:88) x v KS x ˆ n x (cid:124) (cid:123)(cid:122) (cid:125) Kohn-Sham potential , (3)that is we search numerically for the Kohn-Sham potential v KS x , such that (cid:104) ˆ n x (cid:105) H = (cid:104) ˆ n x (cid:105) H . (4)In return we obtain for each disorder realization { v x } the corresponding Kohn-Sham potentials v KS x . By buildingan infinite table listing the Kohn-Sham potential for every possible disorder configuration we would obtain the fullDFT functional for this system. Since this is not feasible we explore the possibility of using a NN to construct sucha functional. We would like to remark, that we are actually constructing the Kohn-Sham potentials and the energyfunctional is given by solving the Kohn-Sham system.At a first sight it looks rather boring to construct a DFT functional for a system that one has already solved usingthe DMRG. The main reason for this work is that we would like to apply the functional constructed in this sectionfor larger systems, i.e. to perform an up-scaling in system size, see Sec. IV.In our setup we constructed a set of training and test data by performing an ensemble statistics for 14.950 systemsfor the training set, and 50 realizations for the testing set, where the testing set was never used as a training input.All simulations are performed at half filling. In order to avoid getting stuck at an excited state during the DMRGsweeping we added the ground state of a homogeneous, delocalised systems to the density matrix used for the selectionof the target space.[29, 30, 32] The reason for this approach is explained in detail in [32]. In addition we targeted forthe three states lowest in energy in an initial run, performed three finite lattice sweeps and keept at least 400 statesper block, while we increased the number of states per block to ensure a discarded entropy below 10 − . We thenrestarted each run keeping at least 450 states per block, targeting for the two states lowest in energy and performingthree sweeps and ensured a discarded entropy below 10 − . Finally we restarted each run again, keeping at least 500states per block, targeting the ground state only, performed seven finite lattice sweeps and inreased the number ofstates kept per block to ensure a discarded entropy below 10 − .For each sample the calculated the corresponding Kohn-Sham potentials v KS x . For each site x of a system weconstructed a data set consisting of the densities starting s = 35 sites left to x up to the densities for the s sites rightof x , employing PBC, { n x − s , n x − s +1 , · · · , n x + s } −→ v KS x (5)mapping each set of densities to the corresponding Kohn-Sham potential v KS x . We didn’t use the complete set of M = 98 densities as we later want to use the data for up-scaling and we wanted to avoid targeting at precisely M = 98 sites. We therefore obtain 98 ∗ .
950 = 1 . .
100 training sets Eq. (5) and 98 ∗
50 = 4 .
900 testing setsEq. (5) with an input length of 71 = 2 ∗
35 + 1 and a single valued result. We applied a tanh activation and thereforere-scaled the input densities by 2 ( n x − . × × × tiny DNN [16] software package and switched later to tensorflow [17]with the Keras [18] interface. During the training phase we tried various available optimizer, most notably stochasticgradient descent (
SGD ) [36], adaptive moment estimation (
Adam ) [37], and the adaptive gradient algorithm
Adagrad [38]. In this section we are only reporting the results for the optimization run with the lowest deviation which weachieved using tiny DNN .In addition to the Kohn-Sham potentials v KS x we are also defining the Hartree exchange correlations potentials v HXC x v HXC x = v KS x − v x (6)that is, the interaction induced change of the local potential. Note that in the non-interacting limit, U = 0, theHartree exchange correlation potential is zero by definition. We did not subtract the Hartree contribution from v HXC as the Hartree-Fock approximation has problems for strongly correlated one-dimensional systems, see [11]. (a) -0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 v x H X C = v x KS - v x n x iDFT (b) -0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 v x H X C = v x K S - v x n x
200 test points, W=2.0, U=0.5 predicted correct (c) -0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 v x H X C = v x K S - v x n x
500 test points, W=2.0, U=0.5 (d) -0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 v x HC X = v x KS - v x , v x ne t w o r k - v x KS n x correct value prediction error FIG. 1. Results for the NN based DFT: (a) interaction induced v HXC for the test data vs. the local density; (b) The v HXC potentials for 200 test values, the corresponding from the NN predicted values and arrows indicating the errors; (c) the errorarrows for 500 test values; (c) circles: the cloud of v HXC potentials as in Fig. 1a, crosses: the error in the predicted Kohn-Shamvalues.
In Fig. 1a we show the v HXC potentials vs. the local density. Just by looking at the plot it is obvious that anylocal density approximation will not be able to describe the Kohn-Sham potentials. It is also hard to imagine thatany density gradient based expansion will be able to represent the rather dense cloud of potentials. The key point inlocal density approximations (LDA) and gradient based expansions consists in assuming that the density functionalis smooth on short wave lengths and that one can expand the functional around its local value. In contrast the v HXC potentials as shown in Fig. 1a are highly oscillating and multi valued. Even if one doesn’t require a smooth densitydependence of the v HXC potentials one would need a least a strongly quenched distribution in order to construct anLDA based approximation.Fig. 1b compares the correct v HXC potentials with the ones predicted from the NN for the first 200 testing sets,where the arrows denote the error of the prediction. In Fig. 1c we display the same data, just drawing the arrows forthe errors for the first 500 testing sets. Here, for each sample the error arrows start at the predicted value and and atthe desired correct value. Finally in Fig. 1d we show the cloud of v HXC potentials for the complete testing set as inFig. 1a. In addition we also show the difference between the v KS as predicted from the NN minus the non-interactionpotential. If the NN would work perfectly, those values should zero.At a first sight the results looks rather disappointing. The NN is only capable to partially capture the interac-tion induced v HXC potentials. However, one should take into account that hand-crafting a density functional thatreconstruct the cloud as displayed Fig. 1a is presumably impossible.To make this result quantitative we show in Fig. 2a the distribution function of the v HXC potentials in comparisonto the error of the predicted Kohn-Sham potentials, where both distributions can be fit by a Gaussian distribution.The width of σ = 0 .
13 for the v HXC is to be contrasted to the width of σ = 0 . IV. UP-SCALING
In this section we address the question whether we can use the approach to up-scale the calculations for largersystem sizes, i.e. using the NN trained from the M = 98 site systems to obtain results for larger systems. To this endwe performed DMRG runs for ten M = 250 site systems using the same DMRG parameter/setup as in Sec. III, andextracted 2500 test sets as outlined in Eq. (5). In Fig. 2b we compared the error of the Kohn-Sham potentials forthe 250 site system obtained from the NN trained with the M = 98 sites data. Interestingly the width of the error (a) N u m b e r o f s a m p l e s v xHCX = v xKS - v , v xnetwork -v xKS histogram 4900 test points, W=2.0, U=0.5 v xHCX prediction error σ =0.0347165 σ =0.13 (b) N u m b e r o f s a m p l e s v xHCX = v xKS - v , v xnetwork -v xKS histogram 4900 test points, W=2.0, U=0.5 v xHCX prediction error σ =0.0347165 σ =0.13v xHCX , M250prediction error, M=250 FIG. 2. The error distribution of the NN: (a) The distribution of the v HXC potentials and the distribution of the error inthe prediction of the v KS potentials for the 98 site system and the corresponding Gaussian fits. (b) The same data as in (a)combined with the results obtained from up-scaling the NN to 250 sites. distribution does not increase. The result therefore suggests that the up-scaling of an NN trained on small systemsto an evaluation on larger systems is a fruitful concept. V. SUMMARY
In this work we applied the concept of deep learning via neural networks with the reverse engineering of Kohn-Shampotentials in order to construct a DFT functional by learning the Kohn-Sham potentials. We applied the idea tosystems of one-dimensional spinless fermions with nearest neighbor interaction and hopping combined with on-sitedisorder. We showed, while being not perfect, we managed to capture 73% of the interaction induced exchangecorrelation potentials. In addition, we demonstrated that the concept of constructing functionals for the Kohn-Shampotentials from small system and to apply them for larger systems is a promising route for the investigation ofinteracting Fermi systems.
ACKNOWLEDGMENTS
Most of the work reported here was performed while being at the university of W¨urzburg and was supported byERC-StG-Thomale-TOPOLECTRICS-336012 and was presented at the FQMT’19 in Praque. We would like to thankFlorian Eich for insightful discussions.All authors contributed equally to the manuscript and the acquisition of the results. Steven R. White. Density matrix formulation for quantum renormalization groups.
Phys. Rev. Lett. , 69:2863–2866, Nov1992. doi:10.1103/PhysRevLett.69.2863. URL http://link.aps.org/doi/10.1103/PhysRevLett.69.2863 . S. R. White and R. M. Noack. Real-space quantum renormalization groups.
Phys. Rev. Lett. , 68:3487–3490, Jun 1992.doi:10.1103/PhysRevLett.68.3487. URL http://link.aps.org/doi/10.1103/PhysRevLett.68.3487 . Steven R. White. Density matrix renormalization group.
Phys. Rev. B , 48:10345, 1993. Reinhard M. Noack and Salvatore R. Manmana. Diagonalization- and numerical renormalization-group-based methodsfor interacting quantum systems. In Adolfo Avella and Ferdinando Mancini, editors,
LECTURES ON THE PHYSICS OFHIGHLY CORRELATED ELECTRON SYSTEMS IX: Ninth Training Course in the Physics of Correlated Electron Systemsand High-Tc Superconductors , volume 789, pages 93–163, Salerno, Italy, 2005. AIP. Karen A. Hallberg. New trends in density matrix renormalization.
Adv. Phys. , 55(5):477–526, 2006. doi:http://dx.doi.org/10.1080/00018730600766432. P. Hohenberg and W. Kohn. Inhomogeneous electron gas.
Phys. Rev. , 136:B864–B871, Nov 1964. doi:10.1103/PhysRev.136.B864. URL https://link.aps.org/doi/10.1103/PhysRev.136.B864 . W. Kohn and L. J. Sham. Self-consistent equations including exchange and correlation effects.
Phys. Rev. , 140:A1133–A1138,Nov 1965. doi:10.1103/PhysRev.140.A1133. URL https://link.aps.org/doi/10.1103/PhysRev.140.A1133 . R.M. Dreizler and E.K.U. Gross.
Density Functional Theory . Springer–Verlag, 1990. O. Gunnarsson and K. Sch¨onhammer. Density-functional treatment of an exactly solvable semiconductor model.
Phys. Rev.Lett. , 56(18):1968–1971, May 1986. doi:10.1103/PhysRevLett.56.1968. Peter Schmitteckert and Ferdinand Evers. Exact ground state density-functional theory for impurity models coupled to ex-ternal reservoirs and transport calculations.
Phys. Rev. Lett. , 100(8):086401, Feb 2008. doi:10.1103/PhysRevLett.100.086401. Peter Schmitteckert. Inverse mean field theories.
Phys. Chem. Chem. Phys. , 20:27600–27610, 2018. doi:10.1039/C8CP03763A. URL http://dx.doi.org/10.1039/C8CP03763A . Michael A. Nielsen.
Neural Networks and Deep Learning . Determination Press, 2015. Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.
The bulletin ofmathematical biophysics , 5(4):115–133, Dec 1943. ISSN 1522-9602. doi:10.1007/BF02478259. URL https://doi.org/10.1007/BF02478259 . D. E. Rumelhart, Hinton G. E., and R. J. William. Learning represantations by back-propagating errors.
Nature , 323:533,1986. Y LeCun and Y Bengio. Convolutional networks for images, speech, and time series.
The handbook of brain theory andneural networks , 3361:3539, 1995. Taiga Nomi and et.al. tiny dnn. Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis,Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, YangqingJia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man´e, Rajat Monga, Sherry Moore,Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi´egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke,Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL . Software available from tensorflow.org. Francois Chollet et al. Keras, 2015. Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Schuld, Naftali Tishby, Leslie Vogt-Maranto,and Lenka Zdeborov´a. Machine learning and the physical sciences.
Rev. Mod. Phys. , 91:045002, Dec 2019. doi:10.1103/RevModPhys.91.045002. URL https://link.aps.org/doi/10.1103/RevModPhys.91.045002 . Felix Brockherde, Leslie Vogt, Li Li, Mark E. Tuckerman, Kieron Burke, and Klaus-Robert M¨uller. Bypassing the kohn-shamequations with machine learning.
Nature Communications , 8(1):872, Oct 2017. ISSN 2041-1723. doi:10.1038/s41467-017-00839-3. URL https://doi.org/10.1038/s41467-017-00839-3 . Brian Kolb, Levi C. Lentz, and Alexie M. Kolpak. Discovering charge density functionals and structure-property relationshipswith prophet: A general framework for coupling machine learning and first-principles methods.
Scientific Reports , 7(1):1192,Apr 2017. ISSN 2045-2322. doi:10.1038/s41598-017-01251-z. URL https://doi.org/10.1038/s41598-017-01251-z . L. Hu, X. Wang, L. Wong, and G. Chen. Combined first-principles calculation and neural-network correction approach forheat of formation.
J. Chem. Phys. , 119:11501, 2003. Xiao Zheng, LiHong Hu, XiuJun Wang, and GuanHua Chen. A generalized exchange-correlation func-tional: the neural-networks approach.
Chemical Physics Letters , 390(1):186 – 192, 2004. ISSN 0009-2614. doi:https://doi.org/10.1016/j.cplett.2004.04.020. URL
S0009261404005603 . Qin Liu, JingChun Wang, PengLi Du, LiHong Hu, Xiao Zheng, and GuanHua Chen. Improving the performance of long-range-corrected exchange-correlation functional with an embedded neural network.
The Journal of Physical Chemistry A , 121(38):7273–7281, 2017. doi:10.1021/acs.jpca.7b07045. URL ttps://doi.org/10.1021/acs.jpca.7b07045 . PMID: 28876064. John C. Snyder, Matthias Rupp, Katja Hansen, Klaus-Robert M¨uller, and Kieron Burke. Finding density functionals withmachine learning.
Phys. Rev. Lett. , 108:253002, Jun 2012. doi:10.1103/PhysRevLett.108.253002. URL https://link.aps.org/doi/10.1103/PhysRevLett.108.253002 . John C. Snyder, Matthias Rupp, Katja Hansen, Leo Blooston, Klaus-Robert M¨uller, and Kieron Burke. Orbital-free bondbreaking via machine learning.
The Journal of Chemical Physics , 139(22):224104, 2013. doi:10.1063/1.4834075. URL https://doi.org/10.1063/1.4834075 . Li Li, Thomas E. Baker, Steven R. White, and Kieron Burke. Pure density functional for strong correlation and thethermodynamic limit from machine learning.
Phys. Rev. B , 94:245129, Dec 2016. doi:10.1103/PhysRevB.94.245129. URL https://link.aps.org/doi/10.1103/PhysRevB.94.245129 . T. Giamarchi and H. J. Schulz. Anderson localization and interactions in one-dimensional metals.
Phys. Rev. B , 37:325–340,Jan 1988. doi:10.1103/PhysRevB.37.325. URL https://link.aps.org/doi/10.1103/PhysRevB.37.325 . P. Schmitteckert, T. Schulze, C. Schuster, P. Schwab, and U. Eckern. Anderson localization versus delocalization of in-teracting fermions in one dimension.
Phys. Rev. Lett. , 80:560–563, Jan 1998. doi:10.1103/PhysRevLett.80.560. URL https://link.aps.org/doi/10.1103/PhysRevLett.80.560 . Peter Schmitteckert, Rodolfo A. Jalabert, Dietmar Weinmann, and Jean-Louis Pichard. From the fermi glass towards themott insulator in one dimension: Delocalization and strongly enhanced persistent currents.
Phys. Rev. Lett. , 81:2308–2311,Sep 1998. doi:10.1103/PhysRevLett.81.2308. URL https://link.aps.org/doi/10.1103/PhysRevLett.81.2308 . Rodolfo A Jalabert, Dietmar Weinmann, and Jean-Louis Pichard. Partial delocalization of the ground state by repulsiveinteractions in a disordered chain.
Physica E: Low-dimensional Systems and Nanostructures , 9(3):347 – 351, 2001. ISSN1386-9477. doi:https://doi.org/10.1016/S1386-9477(00)00226-5. URL . Proceedings of an International Workshop and Seminar on the Dynamics of Complex Systems. Peter Schmitteckert. Disordered one-dimensional fermi systems. In
Density Matrix Renormalization[33] , pages 345–355,1999. ISBN 978-3-540-66129-0. I. Peschel, X. Wang, M.Kaulke, and K. Hallberg, editors.
Density Matrix Renormalization , 1999. ISBN 978-3-540-66129-0. K. Sch¨onhammer, O. Gunnarsson, and R. M. Noack. Density-functional theory on a lattice: Comparison with exact numericalresults for a model with strongly correlated electrons.
Phys. Rev. B , 52:2504–2510, Jul 1995. doi:10.1103/PhysRevB.52.2504.URL https://link.aps.org/doi/10.1103/PhysRevB.52.2504 . Ferdinand Evers and Peter Schmitteckert. Density functional theory with exact xc-potentials: Lessons from dmrg-studiesand exactly solvable models.
Phys. Status Solidi B , 250:2330, 2013. Herbert Robbins and Sutton Monro. A stochastic approximation method.
Ann. Math. Statist. , 22(3):400–407, 09 1951.doi:10.1214/aoms/1177729586. URL https://doi.org/10.1214/aoms/1177729586 . Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.
JMLR , 12:2121–2159, 2011.
Appendix A: Neural networks for fitting functions
The main application of neural networks consists in the classification of input variables, i.e. one maps the inputto a discrete set of output variable, with the standard internet example of “is it a cat or not?” . Here we provide anexample for applying a neural network on fitting a function.
1. The network
The basic building block of a neural network (NN) consists of a neutron as depicted in Fig. 3a.The neuron consist of an input { x i } , weight factors { w i } , an offset b , a so-called activation function σ ( z ), see Fig. 3b,and the output z : z = σ (cid:32) b + (cid:88) (cid:96) w (cid:96) x (cid:96) (cid:33) (A1)Throughout this work we have always used an tanh activation function. One now combines many neurons, Fig. 3a,into a neural network in a layered fashion by connecting inputs of the neurons of one layer with the outputs of theneurons of the previous layer, see Fig. 4. Since from a user perspective the NN in Fig. 4 translates the input of thefirst layer into the output of the last layer one calls the first layer the input layer, the last one the output layer andthe other layers are denoted as hidden layers. If each neuron is connected to each neuron of the previous layer one (a) x x x w i z (b) -1-0.5 0 0.5 1 -6 -4 -2 0 2 4 6 σ ( z ) z Sigmoidtanh
FIG. 3. The building block of a NN: (a) a neutron (b) typical weight function: a sigmoid and a tanh. x x x { I npu t L a y e r O u t pu t L a y e r { H i dd e n L a y e r FIG. 4. A neural network build out of neurons shown in Fig. 3a. calls the network dense . The training of a NN is often referred to as machine learning, and in the presence of manyhidden layer as deep learning.In summary, the NN in Fig. 4 calculates an output z from the input { x j } , where one has to specify the parameterin Eq. (A1) for each neuron n in layer (cid:96) , z (cid:96),n = σ b (cid:96),n + (cid:88) j w (cid:96),n,j x j . (A2)Of course, this can be extended to create multiple output variables z k in the output layer.In order to apply a NN for fitting functions f ( x ) we use a NN with a single input x and a single output z . Thefree parameter { b (cid:96),n , w (cid:96),n,j } are the set of fitting parameter. We would like to note that this approach is in contrastto the desired approach in physics, where on tries to fit a phenomenon with a suitable function using as few fittingparameter as possible. Instead, here we take the opposite approach by using a simple fitting function unrelated tothe problem and fit the desired function with a large number of parameter and a few steps of recursion.
2. Training the neural network: Minimize Cost Function
The idea to determine the fit parameter for fitting a function f ( x ) consists in minimizing a cost function, typically C ( x ) = 1 N N (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) f ( x i ) − z { b (cid:96),n ,w (cid:96),n,j } ( x i ) (cid:12)(cid:12)(cid:12)(cid:12) (A3)where N denotes the number of training samples. Eq. (A3) could in principle be minimized by a standard steepestdescent gradient search. However, due the vast amount of fit parameter this is not feasible in non-trivial examples, asthe number of parameter, and therefore the dimensions of the associated matrices get too large. The breakthroughfor neural networks was provided by the invention of the back propagating algorithm [14] combined with a stochasticevaluation of the gradients [36–38] combined the massive computational power of graphic cards, and for patternrecognition the use of convolutional layers [15], see below. In the example provided in this section we used tensorflow [17] software package combined with the keras [18] front end.
3. An example
As an example we look at the function f ( x ) = sin(3 x ) − . (13 x )e . x + 4e − x . + 6e − ( x +0 . . (A4)which has no deeper meaning, it was just handcrafted to represent a not too trivial function combining sharp andnon-sharp features.Since f is a single valued single argument function the input and output layer consists of a single neuron only. InFig. 5 we show the results for fitting the function f ( x ) in Eq. (A4) with two hidden layers consisting of fifty neuronseach. In result we applied a dense NN with a 1 × × × x j with the corresponding z j = f ( x j ). We then trained the NN by performing a stochasticgradient descent search (SGD) with ten repetitions over the complete set of { x j , z j } . We then evaluate the NN onan equidistantly spaced set of { x (cid:96) } . As one can see in Fig. 5a the result is a rather smooth function that misses thesharp features. The way to improve the NN consists in learning harder, that is, we increased the repetitions of theSGD to 100, Fig. 5b, and 1000, Fig. 5c, which finally leads to a good representation of the functions.A different strategy consist in using different gradient search strategies, i.e. a different optimizer to minimize thecost function Eq. (A3). In Fig. 5d we show the results where we used only 500 repetitions, however we switchedbetween a SGD and an ADAM optimizer, which performs much better, that just an SGD alone. We would like toremark that a priory it is not clear which optimizer is the best, and the optimizer performance seems to be ratherproblem dependent, see [12].Finally in Fig. 6 we present results obtained from a deeper network consisting of 1 × × × × − over the complete range. In result weobtained results a rather good approximation to the function at the expense of using more than 15.000 fit parameter { b (cid:96),n , w (cid:96),n,j } .We would like to point out that the approach of using 15.000 fit parameter may appear odd as it renders anunderstanding of the network impossible. However, we are using the approach to construct a DFT functional. Forthe ladder it is also fair to state that most users of the modern sophisticated DFT functionals have no understandingon the details of their construction. Appendix B: A convolutional network