Emulating the First Principles of Matter: A Probabilistic Roadmap
EEmulating the First Principles of Matter: AProbabilistic Roadmap
Jianzhong Wu and Mengyang Gu
Abstract
This chapter provides a tutorial overview of first principles methods todescribe the properties of matter at the ground state or equilibrium. It begins with abrief introduction to quantum and statistical mechanics for predicting the electronicstructure and diverse static properties of of many-particle systems useful for practi-cal applications. Pedagogical examples are given to illustrate the basic concepts andsimple applications of quantum Monte Carlo and density functional theory —tworepresentative methods commonly used in the literature of first principles model-ing. In addition, this chapter highlights the practical needs for the integration ofphysics-based modeling and data-science approaches to reduce the computationalcost and expand the scope of applicability. A special emphasis is placed on recentdevelopments of statistical surrogate models to emulate first principles calculationfrom a probabilistic point of view. The probabilistic approach provides an internalassessment of the approximation accuracy of emulation that quantifies the uncer-tainty in predictions. Various recent advances toward this direction establish a newmarriage between Gaussian processes and first principles calculation, with physicalproperties, such as translational, rotational, and permutation symmetry, naturallyencoded in new kernel functions. Finally, it concludes with some prospects on futureadvances in the field toward faster yet more accurate computation leveraging a syner-getic combination of novel theoretical concepts and efficient numerical algorithms.
Key words: surrogate models, quantum and statistical mechanics, density functionaltheory, physical invariance
Jianzhong WuDepartment of Chemical and Environmental Engineering, University of California, Riverside, CA92521, USA e-mail: [email protected]
Mengyang GuDepartment of Statistics and Applied Probability, University of California, Santa Barbara, CA93106, USA e-mail: [email protected] a r X i v : . [ c s . C E ] S e p Jianzhong Wu and Mengyang Gu
Mathematically speaking, an essential task to predict the properties of matter fromfirst principles is by solving the Schrödinger equation. While the task is ratherstraightforward if one is concerned only with the properties of non-interacting par-ticles (such as ideal Fermions or the lone electron in a hydrogen atom), the problemrapidly becomes much too complicated when the procedure is extended to non-idealsystems consisting of more than a single particle. By non-ideal we mean interactionsbetween particles such as the Coulomb potential between charged species. Not onlyis the dimensionality of the wave function linearly increasing with the number ofparticles in the system, but additional considerations must also be taken to accountfor correlation effects due to particle-particle interactions. The latter is responsiblefor the non-random spatial arrangement of particles in a many-body system whichgives rise to the system symmetry and, for macroscopic systems, the rich phasebehavior of matter in response to the changes of thermodynamic conditions.The fundamental principles to describe particle-particle interactions have beenwell established within the framework of quantum mechanics (QM). On the otherhand, structure formation and phase transition in macroscopic systems are dictated bythe fundamental laws of thermodynamics and can be described, at least in principle,by statistical mechanics (SM). From a practical perspective, the situation can thusbe summarized, as famously stated many years ago by Paul M. Dirac [19],
The underlying physical laws necessary for the mathematical theory of a large part of physicsand the whole of chemistry are thus completely known, and the difficulty is only that theexact application of these laws leads to equations much too complicated to be soluble.It therefore becomes desirable that approximate practical methods of applying quantummechanics should be developed, which can lead to an explanation of the main features ofcomplex atomic systems without too much computation.
Since the beginning of the last century, a perennial effort in the scientific com-munity has been devoted to the development of analytical and numerical schemesto approximate the general procedures of QM/SM calculations such that they canbe applied to materials and chemical systems to attain useful results that wouldsatisfy the practical needs. Such efforts remain active today. The theoretical methodsand their applications to diverse problems of practical interest constitute a majorcomponent of curriculum for a wide variety of disciplines in physical sciences andengineering. Numerous textbooks of QM and SM are readily available on both thefundamental principles and practical applications. Here we introduce only the essen-tial mathematical procedures to describe the properties of many-body systems witha minimal exposure to the physical details. The emphasis is placed on a few theoret-ical approaches commonly used in the literature to predict the electronic structureand macroscopic properties of matter. To establish connections with problems ofpractical interests, we will elucidate how the electronic structure is related to variousphysical and chemical properties of chemical systems and materials. mulating the First Principles of Matter: A Probabilistic Roadmap 3
The ultimate goal of first principles modeling is to predict the properties of matterbased on its fundamental ingredients, i.e. , electrons and nuclei as appeared in gas,liquid, solid or plasma —the four natural states of matter commonly observablein daily life. Electrons are elementary particles. Each electron has a negative unitcharge, − . × − C. Nuclei are made of neutrons and protons. Inchemical systems and materials, nuclei may be represented by point charges undermost conditions.Quantum mechanics asserts that matter exits in discrete quantum states, i.e. , aset of parameters to describe the ultimate details of the system. The properties ofmatter are manifested as the expectation of the collective behavior of the underlyingparticles in different quantum states. Once the quantum states are identified, we canin principle determine all properties of the system.To elucidate the essential mathematical procedure, consider in general a systemcontaining N e electrons and N α nuclei of type α . At the macroscopic scale, all naturalstates of matter satisfy the condition of charge neutrality, i.e. , the total electron chargeis exactly balanced by that of the nuclei. Therefore, the condition of charge neutralityrequires N e − (cid:213) α Z α N α = Z α is a positive integer standing for the valency of the nuclear charge. Thisinteger coincides with the atomic number for nuclear particle α .At any moment, the electrons and nuclei may exist in a multitude of quantumstates satisfying the Schrödinger equation:ˆ H | Ψ (cid:105) = E | Ψ (cid:105) (2)where | Ψ (cid:105) represents a quantum state as specified by wave function Ψ , E is is ascalar value representing the system energy, and ˆ H denotes the Hamiltonian of thesystem. The dynamic properties of the system can be described by the time-dependentSchrödinger equation, which is not of concern in this work.In quantum mechanics, Hamiltonian is a mathematical operator defining the ki-netic and potential energies of the system, i.e. , the energies affiliated with the motionsof individual particles and particle-particle interactions. For a system containing N e electrons and N α nuclei of type α , the Hamiltonian is given byˆ H = − N (cid:213) i (cid:126) m i ∇ i + e πε N (cid:213) i N (cid:213) j (cid:44) i Z i Z j (cid:12)(cid:12) r i − r j (cid:12)(cid:12) (3)where ∇ i is an Laplacian operator of the i th particle, i = , ..., N , with N ≡ (cid:205) α N α + N e being the total number of particles in the system, m i stands for the rest massof particle i , r i is the particle position, (cid:12)(cid:12) r i − r j (cid:12)(cid:12) represents the Euclidean distancebetween particles i and j , (cid:126) = h / π is the reduced Planck constant, e is the unit Jianzhong Wu and Mengyang Gu charge, and ε is the free-space permittivity. The first term on the right defines the totalkinetic energy, which is affiliated with momenta of all particles in the system. Thesecond term prescribes the potential energy, arising from the electrostatic interactionbetween electrons and nuclei. The electrostatic potential has an expression formallyidentical to that given by the Coulomb’s law for classical particles.The wave function has the units of one over square root of volume. It can berepresented in terms of the system configuration, i.e. , a set of coordinates that definethe positions and angular momenta of individual particles, x N ≡ { x , x , ..., x N } .Here each vector x i ≡ { r i , s i } specifies the position r i and spin state s i of particle i . The electron spin is affiliated with its intrinsic angular momentum as that foran elementary particle; it takes only two possible values that are conventionallydenoted as |↑(cid:105) and |↓(cid:105) , or simply spin up and down states. By contrast, the nuclearspin arises from its subatomic constituents, i.e. , protons and neutrons. The nuclearspin is commonly treated as a single entity, which is invariant with the quantum statesof the system. Therefore, we may describe the configuration of a system containing N e electrons and N n = (cid:205) α N α nuclei using 4 N e + N n variables. As each variablerepresents one degree of freedom, the wave function Ψ has the dimensionality of4 N e + N n .Mathematically, Eq.(2) represents an eigenvalue problem. The energy levels andwave functions are related to the eigenvalues and eigenfunctions corresponding tooperator ˆ H . The eigenstates are also known as the pure states, whose wave functionssatisfy the orthonormality condition ∫ d x N Ψ ∗ n ( x N ) Ψ m ( x N ) = (cid:26) n = m n (cid:44) m (4)where superscript * represents complex conjugate, integers n and m are quantumnumbers. At each pure state n , (cid:12)(cid:12) Ψ n ( x N ) (cid:12)(cid:12) represents the probability density of thesystem in configuration x N .For an arbitrary quantum state, the wave function can be expressed as a suppositionof pure states Ψ ( x N ) = (cid:213) n α n Ψ n ( x N ) (5)where subscript n denotes an eigenstate, and α n = ∫ d x N Ψ ∗ n ( x N ) Ψ ( x N ) (6)A mixed quantum state is referred to as one that can be written as a linear combinationof more than one pure states, i.e. , α n (cid:44) (cid:10) ˆ A (cid:11) Ψ = ∫ d x N Ψ ∗ ( x N ) ˆ A Ψ ( x N ) ∫ d x N Ψ ∗ ( x N ) Ψ ( x N ) (7) mulating the First Principles of Matter: A Probabilistic Roadmap 5 where operator ˆ A denotes an observable quantity, and (cid:104) ... (cid:105) Ψ stands for quantumexpectation, i.e. , the expectation value of an observable property of the system inaccordance with wave function Ψ .While the affiliation of particles with positions and spin states is intuitivelyappealing, one should keep in mind that, unlike classical particles, quantum particlesare not allowed to have definite coordinates at any instance and thus, strictly speaking,cannot be “tagged” with specific positions and spin states. At any moment, quantumparticles may assume positions corresponding to a superposition of all possible purestates. A One-Particle Problem
The one-particle problem is helpful to elucidate some basic concepts related to theSchrödinger equation. If we consider a single particle in free space, the Schrödingerequation would be reduced to − (cid:126) m ∇ Ψ ( r ) = E Ψ ( r ) (8)The differential equation can be readily solved with the periodic boundary conditions(PBC) Ψ ( r ) = Ψ ( r + L ) (9)where L ≡ ( L , L , L ) , and L > L such thateach box contains an imaginary particle imaging the position of the real particleunder consideration. We assume that the real and imaginary particles are assumedidentical but do not interacting with each other.From Eqs.(8) and (9), we can easily find the wave function by using the Fouriertransform: Ψ ( r ) = exp ( i r · k )√ V (10)where V = L , and k is a 3-dimensional the vector given by k = π L n (11)with n = ( n x , n y , n z ) , n x , y , z = , ± , ± , ± , .... represents quantum numbers.It is straightforward to verify that the wave function satisfies the orthonormalconditions ∫ d r | Ψ k ( r )| = ∫ d r Ψ ∗ k ( r ) Ψ k (cid:48) ( r ) = δ k , k (cid:48) (13) Jianzhong Wu and Mengyang Gu where δ k , k (cid:48) denotes the Kronecker delta function, which is equal to 1 if k = k (cid:48) andzero otherwise. At each quantum state, the particle is uniformly distributed insidethe box, i.e. , the probability density of finding a particle is everywhere uniform.At each quantum state, we can find the particle energy from the Schrödingerequation: E = (cid:126) k m = h mL ( n x + n y + n z ) (14)At the ground state, the particle has a minimum energy of E = h /( mL ) , whichhas a degeneracy of 6 corresponding to all possible assignments of the quantumnumbers leading to n x + n y + n z =
1. Unlike a classical particle, a quantum particlecannot have a zero energy as required by the uncertainty principle.The Schrödinger equation is applicable to systems with any number of particles,either finite or infinite. If the system is isolated from its surroundings, the totalenergy is fixed, and the number of quantum states corresponding to the particularenergy is called degeneracy. In other words, degenerate quantum states have thesame energy. The ground state is referred to as the state of a system when it hasthe minimum energy. If the system allows to exchange energy with its surroundings( e.g. , in contact with a thermal bath), the total energy fluctuates so that the systembecomes accessible to different excited states.In statistical mechanics, the quantum states are also known as microstates. Ateach microstate, we know the microscopic details of the system such as energyand particle positions. For a system with a given number of particles, volume andtemperature, the probability of different microstates is determined by the Boltzmanndistribution p ν = exp (− E ν / k B T ) Q (15)where ν denotes a microstate, k B is the Boltzmann constant, T is the absolutetemperature, and Q ≡ (cid:205) ν exp (− E ν / k B T ) is called the canonical partition function.Accordingly, the average energy of the system is given by (cid:104) E (cid:105) ≡ (cid:213) ν p ν E ν (16)where (cid:104) ... (cid:105) stands for the ensemble average. From the partition function, we canderive in principle all thermodynamic properties [12].Typically, a molecule contains no more than a few types of nuclei. A similar state-ment can be made for most materials. However, most systems of practical concerncontain a large number of particles. For a macroscopic system, the total number ofparticles, N = N e + (cid:205) α N α , is astronomically large ( ∼ ) and approaches infinityin the thermodynamic limit. Because the dimensionality of wave function scaleslinearly with the total number of particles, the Schrödinger equation becomes “ muchtoo complicated to be soluble ” as the number of particle increases. For practical mulating the First Principles of Matter: A Probabilistic Roadmap 7 applications, the essential task is thus to develop “ approximate methods of applyingquantum and statistical mechanics ”. The Born-Oppenheimer (BO) approximation assumes that the electron degrees offreedom can be decoupled from those corresponding to the nuclei, and that thelatter can be represented classical particles with negligible size. The assumptionis justifiable because a nuclear particle occupies little volume inside each atom.Besides, the electron rest mass m e is much smaller than that of a proton m p , thesmallest nuclear particle ( m p / m e ≈ viz. ,molecular dynamics simulation).As the degree of freedom related to nuclear spins is irrelevant for most chemicalsystems, the configuration of nuclei can then be specified in terms of their positions, R N n ≡ ( R , R , ..., R N n ) . At a time scale sufficiently long for electron relaxation butshort for the motion of nuclei, which is on the oder of a fraction of femtosecond(10 − s), electrons are approximately in a stationary state subject to an external fieldarising from electrostatic interactions with the nuclei v ( r ) = − e πε N n (cid:213) I = Z I | R I − r | (17)The ground state energy and the electronic structure can be determined by solvingthe Schrödinger equation (cid:32) − N (cid:213) i (cid:126) m e ∇ i + e πε N (cid:213) i N (cid:213) j > i (cid:12)(cid:12) r i − r j (cid:12)(cid:12) + v ( r ) (cid:33) Ψ ( x N ) = E Ψ ( x N ) (18)For simplicity of notation, from now on we replace N e with N , x N ≡ { x , x , ..., x N } represents the electron configuration, Ψ ( x N ) is the electron wave function, and E isthe total electronic energy.Once the electron wave function is determined from the Schrödinger equation,the force on each nucleus due to the inhomogeneous distribution of electrons can becalculated from the Hellmann-Feynman (HF) equation F I = Z I e πε ∫ d r ˆ ρ ( r ) r − R I | R I − r | (19) Jianzhong Wu and Mengyang Gu where ˆ ρ ( r ) stands for the electron density. The latter is related to the wave functionˆ ρ ( r ) = N (cid:213) i = ∫ d x N (cid:12)(cid:12) Ψ ( x N ) (cid:12)(cid:12) δ ( r − r i ) (20)where δ ( r − r i ) is the Dirac delta function.The physical meaning of Eq.(19) isintuitive: the overall force on nucleus I due to the electrons is simply equal tothe integration of the local electron number density multiplied by the Coulombforce. Hydrogen Atom
A normal hydrogen atom contains two particles, i.e. , one electron and one proton.Following the BO approximation, we may consider a hydrogen atom as an electronorbiting around the proton with the electron wave function described by the one-particle Schrödinger equation (cid:18) − (cid:126) m pe ∇ − e πε r (cid:19) Ψ ( r ) = E Ψ ( r ) (21)where 1 / m pe ≡ / m e + / m p , r = | r | is the radial distance. Here the proton is placedat the center of the coordinate system. For a single electron, the intrinsic magneticmomentum and thus the spin number play no role in determining the electronicproperties of the system.With the boundary conditions Ψ ( r ) = Ψ (cid:48) ( r ) = r → ∞ , Eqs.(21) yields ananalytical solution. In spherical coordinates, the wave function is given by Ψ ( r , θ, φ ) = N n , l x ln e − x n / L n , l ( x n ) Y ml ( θ, φ ) , (22)and the corresponding energy is E = − m pe e π (cid:15) (cid:126) n . (23)In atomic physics, ( n , l , m ) are known as principal, azimuthal, and magnetic quantumnumbers, respectively. These quantum numbers, take the integer values of n = , , , ... , l = , , , ..., n −
1, and m = − l , ..., l , and define the atomic orbitals thatare commonly used as the basis functions for the wave functions of other atoms andmolecular systems.In Eqs.(22) and (23), N n , l is a normalization constant for the radial componentof the wave function N n , l = (cid:115)(cid:18) m pe a m e (cid:19) ( n − l − ) !2 n [ n ( n + l ) ! ] (24) mulating the First Principles of Matter: A Probabilistic Roadmap 9 where a = π(cid:15) (cid:126) /( m e e ) is known as the Bohr radius. The universal constant, a = . ... × − m is often used as the unit length. L n , l ( x ) stands foran associated Laguerre polynomial of degree ( n − l − ) and order ( l + ) , x n = rm pe /( na m e ) is the dimensionless radial distance, and Y ml ( θ, φ ) is a sphericalharmonic function of degree l and order m .A stable hydrogen atom exists in the ground state. In this case, the quantumnumbers are n = , m =
0, and l =
0, and the minimum energy is E = − m pe E Ryd m e ≈ − E Ryd (25)where E Ryd = e /( π(cid:15) a ) is known as the Rydberg energy. The Rydberg energy,2 . ... × − J, is a universal constant that is often used as a unit energy inatomic physics. . Intuitively, Eq. (25) may be understood as the electrostatic energybetween the electron and the proton at an average distance twice the Bohr radius.At the ground state, the wave function for a hydrogen atom is given by Ψ ( r ) = (cid:113) ( π a ) e − r / a (26)Correspondingly, the electron density is ρ ( r ) = e − r / a π a (27)The spherically symmetric function decays exponentially and has a maximum valueof 1 /( π a ) at the nucleus (at r = E ≈ -13.598 eV, represents the energy change when anelectron and a proton bind to form a stable hydrogen atom. This energy corresponds tothe negative of the hydrogen ionization energy. The changes among different energylevels of the hydrogen atom explain its light emission spectrum, which represents amajor triumph in the early development of quantum mechanics.As illustrated in Box 1.2, one of the simplest examples for the application of theBO approximation is provided by the first principle predictions for the spectrum andionization energy of atomic hydrogen. In principle, a similar approach can be appliedto polyatomic molecules by representing the molecular energy in terms of the elec-tronic contribution plus those related to nuclear motions within the molecule, suchas bond stretching, bond vibration, and molecular rotations. The BO model providesa theoretical basis to predict molecular spectroscopy and the thermodynamic prop-erties of ideal gas systems. For a hydrogen atom, we fix the nuclear position whichis treated as the center of coordinates for solving the Schrödinger equation. When asystem contains multiple nuclei, the electron distribution is in general anisotropic, An alternative energy unit is hartree, 1 hartree = 2 rydberg = 27.211 eV0 Jianzhong Wu and Mengyang Gu leading to an atomic force on each nucleus responsible for the molecular config-urations as well as atomic motions including chemical reactions. If the nuclei aretreated as classical particles, we may describe the motions of nuclei using Newton’sequations. The combination of quantum mechanics for the electronic structure cal-culations and classical physics for the nuclear motions constitutes the essential ideasof the Born-Oppenheimer molecular dynamics (BOMD) simulation.With the nuclei treated as classical particles, the BO approximation greatly sim-plifies the computational task to predict the properties of matter from first principles.Not only does the BO approximation reduce the dimensionally of the wave function,it also essentially transforms the complex quantum-mechanic problem to one that isonly concerned with electronic structure calculations. Whereas the electronic wavefunction remains a multidimensional quantity, it represents the property of only asingle component system. In particular, the electron density can be fully determinedfrom the one-body external potential, a three-dimensional function that depends onlyon the nuclear positions (see Eq.(17)).
Monte Carlo methods for solving the many-body Schrödinger equation were sug-gested first by Metropolis and Ulam in 1949 [43]. However, major breakthroughswere made not until the publication of a landmark work by Ceperley and Alder in1980 [11]. Today quantum Monte Carlo (QMC) simulation represents properly themost generic way to accurately predict electronic properties [44, 30].The central idea of Monte Carlo methods is to generate a large number of samplesusing a stochastic process. It converts multidimensional operations in terms of sim-ple mean-value evaluations [19]. The statistical approach finds broad applications invarious branches of mathematics for solving high-dimensional optimization prob-lems and integro-differential equations. The development of the Metroplis (a.ka.,the
M R T algorithm) marks a milestone for the broad use Monte Carlo methods inphysical sciences. As stated befittingly in the introductory sentence of their famouspaper [43], Monte Carlo methods are suitable for fast electronic computing machines, of calculating the properties of any substance.... The variational quantum Monte Carlo (VMC) represents one of the simplest waysto evaluate many-body electronic wave function by using Monte Carlo simulation.The basic idea is that the ground state energy satisfies the variational principle E v = ∫ d x N Ψ ∗ ( x N ) ˆ H Ψ ( x N ) ∫ d x N Ψ ∗ ( x N ) Ψ ( x N ) ≥ E (28)where Ψ ( x N ) stands for the wave function of the system in an arbitrary quantumstate. The inequality is rather intuitive because, by definition, electrons in an ar- mulating the First Principles of Matter: A Probabilistic Roadmap 11 bitrarily quantum state must have an energy no less than the ground-state value.While the mathematic proof is also elementary, evaluation of the energy entails mul-tidimensional integrations that cannot be performed with conventional numericalmethods.In VMC, the multidimensional integration for the system energy is expressed interms of an expectation value E v = ∫ d x N p ( x N ) E ( x N ) (29)where E ( x N ) represents a local energy density E ( x N ) ≡ Ψ − ( x N ) ˆ H Ψ ( x N ) (30)and p ( x N ) is the probability density of the system in configuration x N p ( x N ) = (cid:12)(cid:12) Ψ ( x N ) (cid:12)(cid:12) ∫ d x (cid:48) N | Ψ ( x (cid:48) N )| (31)The Metropolis algorithm provides a convenient way to sample the configurationalspace with probability p ( x N ) . The probability of acceptance for transition fromconfiguration x Nn to x No is given by acc ( x Nn | x No ) = min (cid:26) , τ ( x No | x Nn )| Ψ ( x Nn )| τ ( x Nn | x No )| Ψ ( x No )| ) (cid:27) (32)where τ ( x No | x Nn ) represents the probability of a trial move from configuration x Nn to x No . A simple procedure to accomplish the Monte Carlo move is by a radondisplacement of the electron configuration x Nn = x No + ξξξ N ∆ (33)where ξξξ N a 3 N -dimensional vector of uniformly distributed random numbers be-tween − ∆ > p ( x N ) after a sufficiently large number ofMonte Carlo moves. As a result, the variational energy can be obtained by averagingover these “sampled” configurations E v ≈ M M (cid:213) i = E ( x No ) (34)where M denotes the number of samples. In stark contrast to Eq.(28), Eq.(34) in-volves no high-dimensional integration. Because the summation is independent ofthe dimensionality of the wave function, the Monte Carlo method thus drastically re- duces the computational cost for evaluation of the variational energy. In the statisticsliterature, the VMC is also known as the Metropolis algorithm, which is widely usedfor sampling from the posterior distribution in Markov Chain Monte Carlo methodsfor Bayesian inference.To minimize the variational energy, one may express the wave function in theso-called Jastrow-Slater form Ψ ( x N ) = e J ( x N ) Φ ( x N ) (35)where J ( x N ) is known as the Jastrow factor, and Φ ( x N ) is a Slater determinant(or a linear combination of Slater determinants). The Jastrow factor accounts forthe electron-electron and electron-nuclear correlations that neglected in Φ ( x N ) . Thecorrelation effects are typically written in terms of semi-empirical functions ofthe particle-particle distances with the parameters obtained by minimization of theground-state energy. The Slater determinant can be obtained from the Hartree-Fock-like low-level QM calculations. Uniform Electron Gas
One primordial example for applications of QMC is to study the equilibriumproperties of uniform electron gas at either the ground state 0 K or at finite tem-peratures. Historically, the simulation results have played an instrumental role forthe formulation of the local density approximation (LDA) (see Section 1.4). Froma theoretical perspective, the properties of uniform electrons also provide a usefulreference for understanding inhomogeneous electronic systems and benchmark datafor theoretical developments of new DFT functionals.Figure 1 presents the spin-resolved radial distribution functions (RDF) for severaluniform electron gases at 0 K, Here the results calculated from VMC are comparedwith those from a theoretical method [57]. Similar to its classical counterpart, RDFdescribes the normalized local density of electrons, g ( r ) = ρ ( r )/ ρ , given anotherelectron is found at the origin. For a uniform system of isotropic particles, RDFis a function of both the bulk density and the radial distance r = | r | . Because ofelectrostatic interactions and the Pauli exclusion principle, the RDF of a uniformelectron gas also depends on the spin state as well as the bulk electron density ρ . InFigure 1, the bulk density is expressed in terms of the reduced Wigner-Seitz radius r s = (cid:32) πρ a (cid:33) / (36)where a = . ... × − m is the Bohr radius.Despite the divergence of the Coulomb potential at r =
0, the RDF for electronsof opposite spins remains a finite value at the origin, manifesting the wave nature ofelectrons. The contact value falls as the reduced Wigner-Seitz radius increases from r s = mulating the First Principles of Matter: A Probabilistic Roadmap 13 Fig. 1
Radial distribution functions for different electron pairs (spin up and down) in a uniformparamagnetic system at T = 0 K and different electronic densities. Here r s is the reduced Wigner-Seitz radius, and k F = ( πρ ) / . Solid lines are from an analytical theory and symbols are fromvariational Monte Carlo simulation. Reproduced from [57]. suggests that the contact value of RDF arises from the electrostatic correlation, whichleads to an effective attraction among electrons. At low density ( e.g. , r s = time, which can be represented in term of a stochastic process (a.k.a., a random walkprocess). Interestingly, the idea of DMC was discussed in the seminal article byMetropolis and Ulam [43]. DMC can be used to calculate the properties of transitionmetal compounds, electrons at excited states, and weak intermolecular interactions.In general, it is more accurate than VMC but is also computationally much moredemanding. To a certain degree, PIMC is similar to DMC but it utilizes MonteCarlo methods to sample the “diffusion” paths. PIMC is commonly used to studythe properties of many-particles systems at finite temperature such as superfluidsand plasmas. On the other hand, FCIQMC directly samples the Slater determinantwith Monte Carlo methods. It is applicable to a variety of chemical systems andsolids but, at present, is most suitable for relatively small systems because of thehigh computational cost. Since the original concepts were introduced in the mid-1960s by Pierre Hohenberg,Walter Kohn and Lu Jeu Sham [28, 31], density functional theory (DFT) has evolvedinto one the most widely used computational tools in condensed matter physics,chemistry, materials science, and more recently, biology as well as engineering. Asan alternative to conventional many-body wave function methods, DFT is drasticallymore efficient from a computational perspective and has been used to predict theproperties of matter virtually of all kinds as reported in over ten thousand publicationsevery year. Despite its great popularity, DFT remains one of the most misunderstoodtheoretical methods, not necessarily in the sense that its usefulness is questioned orthat its predictions are incomprehensible due to its intrinsic connection with quantummechanics — which has always been mysterious, but in the sense that its foundation,limitations, and the scopes of applications or misapplications have been routinelymessed up even by well-respected experts in its own field. To a certain degree, thesituation is well summarized by Sean Carroll, a theoretical physicist at the CaliforniaInstitute of Technology, who remarked in an Op-Ed essay from New York Times[10]:
What’s surprising is that physicists seem to be O.K. with not understanding the most impor-tant theory they have.
DFT had been an obscure theory and very much ignored by the scientific communityfor decades before it reaches today’s glory. In one of his last publications [32], WalterKohn wrote on the occasion celebrating fifty years of DFT:
As many theoretical chemists can confirm from personal experience, Density FunctionalTheory (DFT), for several decades after the publication of the Hohenberg-Kohn theorem (in1964) was unfavorably received by many leading traditional quantum theorists of electronicstructure, including John Pople.
As well-known, John Pople and Walter Kohn were colleagues at the same institutefor a number of years and shared the chemistry Nobel prize in 1998! mulating the First Principles of Matter: A Probabilistic Roadmap 15
Basics of Statistical Mechanics
Before discussing the generic ideas of DFT, it is instructive to recall a few basicconcepts from statistical mechanics. Consider a many-body system with volume V ,temperature T , and a one-body potential for each type of particles v α ( r ) . At equilib-rium, the microstates constitute a grand canonical ensemble, which encompasses allquantum states of the system as described by particles in different configurations.The equilibrium properties of the system can be expressed in terms of various formsof ensemble averages [12].The one-body potential v α ( r ) is referred as a point energy applied to each particle α . This function is invariant with the system configuration, i.e. , it is independent ofthe microstates of the system. For example, if we consider a uniform electron gas, theone-body potential corresponds to the negative of the electron chemical potential, µ ,which is a constant defined by the system temperature and the bulk electron density.For an inhomogeneous electronic system as we have discussed in Section 1.2, theone-body potential is given by that corresponding to a uniform electronic systemplus the Coulomb energy due to the electron interaction with nuclei (see Eq.(17)).The one-body particle density is defined as an ensemble average of the numberdensity of particle α at different microstates ρ α ( r ) = (cid:104) ˆ ρ α ( r )(cid:105) = (cid:213) i α (cid:10) δ ( r − r i α ) (cid:11) (37)where ˆ ρ α ( r ) stands for an instantaneous particle density ( e.g. , see Eq.(20)). In thegrand canonical ensemble, the particle numbers in the system are not fixed; theyfluctuate along with the microstates.At a given microstate, the system energy and the density profiles of all species aredetermined by the many-body Schrödinger equation. The probability of the systemat each microstate is then given by p ν = Ξ exp (cid:8) − β [ K ν + Γ ν + (cid:213) α ∫ d r ˆ ρ α ( r ) v α ( r )] (cid:9) (38)where β = /( k B T ) , K ν and Γ ν are, respectively, the kinetic and potential energiesof the system at microstate ν , and Ξ stands for the grand partition function Ξ ≡ (cid:213) ν exp (cid:8) − β [ K ν + Γ ν + (cid:213) α ∫ d r ˆ ρ α ( r ) v α ( r )] (cid:9) (39)Eq.(38) can be derived from the second law of thermodynamics i.e. , the systementropy is maximized subject to appropriate constraints. Alternatively, it may beobtained from the Gibbs variational principle Ω [ p ν ] ≤ Ω [ p (cid:48) ν ] (40)where Ω [ p ν ] ≡ (cid:213) ν p ν (cid:2) k B T ln p ν + K ν + Γ ν + (cid:213) α ∫ d r ˆ ρ α ( r ) v α ( r ) (cid:3) (41)and p (cid:48) ν stands for the probability for an arbitrary distribution of the microstates. InEq.(40), the equal sign holds only when p ν = p (cid:48) ν .The grand potential of the system is defined as Ω ≡ − k B T ln Ξ = Ω [ p ν ] (42)where p ν corresponds to the equilibrium probability. From a thermodynamic per-spective, the grand potential is the free energy of an open system which takes aminimum value at equilibrium.In a nutshell, DFT may be summarized in terms of two theorems and one corollary.These theorems were first established by Hohenberg and Kohn for inhomogeneouselectronic systems at 0 K [28] and later extended by Mermin to electronic systemsat finite temperature [42]. In essence, the Hohenberg-Kohn (HK) theorem showsa unique relationship between one-body density and one-body potential withoutentailing any specific knowledge of the mcirostates of a many-particle system. As aresult, it holds true for electrons at 0 K as well as multi-component thermodynamicsystems of either quantum or classical particles [21, 9, 18, 13]. The corollary isknown as the Kohn-Sham (KS) scheme or KS ansatz . It is instrumental for practicalapplications of various DFT methods for electronic systems [31].Despite its profound implications, the proof for the HK theorem (and its varia-tions) is rather straightforward. In the following, we discuss these theorems and thecorollary in the general form.
Theorem 1
For a many-particle system of volume V and temperature T , the one-body potential for each type of particles v α ( r ) , and hence all equilibrium propertiesof the system, can be uniquely determined by the one-body density profiles ρ α ( r ) . Proof
As discussed above, an open system can be defined by volume V , temperature T , the one-body potential for each type of particles v α ( r ) . Correspondingly, thereexists a set of equilibrium one-body density profiles ρ α ( r ) corresponding to thestatistical distributions of particles in the system. Suppose that two one-body po-tentials, v α ( r ) and v (cid:48) α ( r ) , lead to the same one-body density, ρ α ( r ) . These one-bodypotentials would generate two sets of probabilities for the equilibrium distributionsof the microstates, p ν and p (cid:48) ν . These probabilities yield the same one-body density: ρ α ( r ) = (cid:213) ν p ν ˆ ρ α ( r ) = (cid:213) ν p (cid:48) ν ˆ ρ α ( r ) (43)According to the Gibbs variational principle, we have mulating the First Principles of Matter: A Probabilistic Roadmap 17 Ω [ p ν ] ≤ (cid:213) ν p (cid:48) ν (cid:20) k B T ln p (cid:48) ν + K ν + Γ ν (cid:21) + (cid:213) α ∫ d r ρ α ( r ) v α ( r ) = Ω (cid:48) [ p (cid:48) ν ] + (cid:213) α ∫ d r ρ α ( r )[ v α ( r ) − v (cid:48) α ( r )] (44)As both p ν and p (cid:48) ν correspond to equilibrium distributions for the microstates of thesystem, the same inequality holds when primed and unprimed quantities switch thepositions, Ω (cid:48) [ p (cid:48) ν ] ≤ Ω [ p ν ] + (cid:213) α ∫ d r ρ α ( r )[ v (cid:48) α ( r ) − v α ( r )] (45)Because the particle density is everywhere non-negative, the only way to satisfyboth inequalities is v α ( r ) = v (cid:48) α ( r ) . In other words, the one-body potentials must beuniquely determined by the one-body density profiles.Theorem 1 indicates that, in principle, one can determine the one-body potentialsfrom the one-body density profiles. With the one-body potentials, all equilibriumproperties of the systems, including the distribution of microstates p ν , can be subse-quently calculated by using standard statistical-mechanical methods. (cid:3) Theorem 2
For any system of volume V , temperature T , and a one-body potentialfor each type of particles v α ( r ) , the equilibrium one-body density profiles ρ α ( r ) isdetermined by minimizing the grand potential Ω [ ρ α ( r )] ≡ F [ ρ α ( r )] + (cid:213) α ∫ d r ρ α ( r ) v α ( r ) (46) where F is known as the intrinsic Helmholtz energy F [ ρ α ( r )] ≡ (cid:213) ν p ν (cid:2) k B T ln ( p ν ) + K ν + Γ ν (cid:3) . (47) Proof
This theorem is also known as the HK variational principle. The proof pro-ceeds as follows. Supposed that ρ (cid:48) α ( r ) is the equilibrium density associated with anyother one-body potential v (cid:48) α ( r ) , which generates microstate probability p (cid:48) ν for the dis-tribution of microstates. Because v (cid:48) α ( r ) and subsequently p (cid:48) ν are uniquely determinedby ρ (cid:48) α ( r ) , we can rewrite the Gibbs variational principle (see Eq. (40)) as Ω [ ρ α ( r )] ≤ Ω [ ρ (cid:48) α ( r )] (48)Therefore, the equilibrium density ρ α ( r ) minimizes the grand potential.It is worth noting that ρ (cid:48) α ( r ) must be associated with some meaningful one-bodypotential v (cid:48) α ( r ) . Otherwise, p (cid:48) ν is not even defined and thus the inequality may not bevalid. The inherent constraint of the density profiles in the HK variational principleis known as “v-representable densities” [34]. (cid:3) Corollary 1
For any equilibrium system of volume V , temperature T , and one-bodydensity profiles ρ α ( r ) , there exists a non-interacting reference system that reproducesthe one-body density profiles. Formally, any intrinsic Helmholtz energy can be expressed in terms of that cor-responding to an non-interacting reference system of the same V and T , F [ ρ α ( r )] ,plus the difference, ∆ F [ ρ α ( r )] : F [ ρ α ( r )] ≡ F [ ρ α ( r )] + ∆ F [ ρ α ( r )] (49)Given a set of density profiles, ρ α ( r ) , theorem 1 indicates that a unique set of one-body potentials, v ,α ( r ) , can be determined for the reference system. According totheorem 2, the density profiles minimize the grand potential of the system underconsideration as well as that of the non-interacting reference system Ω [ ρ α ( r )] = F [ ρ α ( r )] + ∆ F [ ρ α ( r )] + (cid:213) α ∫ d r ρ α ( r ) v α ( r ) (50) Ω [ ρ α ( r )] = F [ ρ α ( r )] + (cid:213) α ∫ d r ρ α ( r ) v ,α ( r ) (51)Following the HK variational principle δ Ω / δρ α ( r ) = δ Ω / δρ α ( r ) =
0, we obtain anexplicit expression for the one-body potentials of the reference system v ,α ( r ) = v α ( r ) + δ ∆ F [ ρ α ( r )] δρ α ( r ) (52)Because the intrinsic Helmholtz energy for a system of non-interacting particles isrelatively easy to evaluate, the KS scheme provides a feasible way to carry out theHK variational principle without specific knowledge about the microstates of thereal many-body system.It is worth noting that the Hohenberg-Kohn (HK) theorem and the Kohn-Sham(KS) scheme are valid not only for many-body systems at the ground state but, ingeneral, for any thermodynamic systems. While the vast majority DFT calculationsup-to-date are concerned only with electrons at 0 K, more applications of DFT to“multi-component” systems are emerging in recent years. The Kohn-Sham DFT
Consider a spin-symmetric system containing 2 N electrons at a nondegenerateground state, the HK theorem asserts that the ground-state energy can be obtainedfrom the variational principle. In the KS scheme, the reference system consists ofnon-interacting electrons, i.e. , ideal Fermions, which provides a basis to evaluate thevariational energy.The wave function of ideal Fermions can be expressed in terms of the Slaterdeterminant Ψ ( r , r , ..., r N ) = √ N ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ψ ( r ) ψ ( r ) . . . ψ N ( r ) ψ ( r ) ψ ( r ) . . . ψ N ( r ) ... ... ...ψ ( r N ) ψ ( r N ) . . . ψ N ( r N ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (53) mulating the First Principles of Matter: A Probabilistic Roadmap 19 where ψ i ( r ) , i = , , . . . , N represents a single-particle wave function. The Slaterdeterminant accounts for the Fermion exchange effect that remains between “non-interacting” electrons. In stark contrast to that for the electronic system, the wavefunction for ideal Ferminions can be decomposed as a product of 3-dimensionalfunctions.In the present of one-body potential v ( r ) , the single-particle wave functions inthe Slater determinant satisfy the one-particle Schrödinger equation (cid:20) − (cid:126) m e ∇ + v ( r ) (cid:21) ψ i ( r ) = (cid:15) i ψ i ( r ) (54)where (cid:15) i represents the single-particle energy of an ideal Fermion. According to theKS scheme, the single-particle wave functions of the ideal Ferminions must satisfythe orthonomal conditions (Eq.(4) ) and reproduce the electron density of the realsystem ρ ( r ) = N (cid:213) i = | ψ i ( r )| (55)where a factor of 2 accounts for spin pairs.To find the one-body potential in the reference system ( v ( r ) in Eq.(54)), we usethe HK variational principle. The energy of the reference system and that of the realsystem are given by, respectively, E [ ρ ( r )] = K [ ρ ( r )] + ∫ d r ρ ( r ) v ( r ) (56) E [ ρ ( r )] = K [ ρ ( r )] + ∆ E [ ρ ( r )] + ∫ d r ρ ( r ) v ( r ) (57)where ∆ E [ ρ ( r )] represents the difference between the intrinsic energies of the ref-erence and real systems. In the former case, the intrinsic energy corresponds to thekinetic energy of ideal Fermions, K [ ρ ( r )] . Meanwhile, the intrinsic energy of theelectrons includes both kinetic and potential contributions.Formally, ∆ E [ ρ ( r )] can be written as ∆ E [ ρ ( r )] = K [ ρ ( r )] − K [ ρ ( r )] + Γ [ ρ ( r )] ≡ J [ ρ ( r (cid:48) )] + E xc [ ρ ( r )] (58)where J [ ρ ( r )] is known as the Hartree energy, which accounts for the direct electro-static interaction among the electrons J [ ρ ( r )] = e πε ∫ dr ∫ dr (cid:48) ρ ( r ) ρ ( r (cid:48) )| r − r (cid:48) | (59)and E xc [ ρ ( r )] , an unknown quantity, defines the exchange-correlation energy.A comparison of the functional derivatives of the real and reference energies withrespect to the electron density leads to v ( r ) = v ( r ) + v c ( r ) + v xc ( r ) (60) where v c ( r ) is the Coulomb potential v c ( r ) ≡ δ J [ ρ ( r )]/ δρ ( r ) = e πε ∫ dr (cid:48) ρ ( r (cid:48) ) (cid:12)(cid:12) r i − r j (cid:12)(cid:12) (61)and v xc ( r ) is known as the exchange-correlation potential v xc ( r ) ≡ δ E xc / δρ ( r ) (62)Substituting Eq.(60) into (54) leads to the celebrated KS equation [31] (cid:20) − (cid:126) m e ∇ + v c ( r ) + v xc ( r ) + v ( r ) (cid:21) ψ i ( r ) = (cid:15) i ψ i ( r ) (63)So far the theoretical procedure is exact except that E xc [ ρ ( r )] remains unknown.Because the exchange-correlation energy is related to the difference between theenergy of many electrons and that of ideal Fermions with the same one-body den-sity, an exact expression for E xc [ ρ ( r )] can be attained only by solving the originalmany-body problem. One remarkable feature of DFT is that reasonable results canbe achieved even with relatively simple approximations. For example, for systemssuch as metals that have near uniform electron densities, a reasonable guess of theexchange-correlation energy is provided by the so-called local density approximation(LDA) E xc [ ρ ( r )] = ∫ dr ρ ( r ) (cid:15) xc ( ρ ( r )) (64)where (cid:15) xc ( ρ ) is the per electron exchange-correlation energy for a uniform electrongas of density ρ . As discussed above, accurate results for (cid:15) xc ( ρ ) can be obtainedfrom quantum Monte Carlo simulation. Correspondingly, the exchange-correlationpotential is given by v xc ( r ) = (cid:15) xc ( r ) + ρ ( r ) d (cid:15) xc d ρ (65)Understandably, LDA breaks down for systems with highly inhomogeneous electrondistributions. Tremendous efforts have been devoted to the development of betterapproximations for the exchange-correlation energy since1980s [40].With an approximate expression for the exchange-correlation energy, the KSequations can be solved with various numerical methods [37]. Subsequently, theground-state energy can be calculated from E [ ρ ( r )] = N (cid:213) i = (cid:15) i − J [ ρ ( r )] − ∫ dr ρ ( r ) d (cid:15) xc ( r ) d ρ ( r ) (66)It is worth noting that the KS equation applies only to the reference system ofideal Fermions, i.e. , the non-interacting reference system. While the reference systemreproduces the one-body density of the real electronic system, its total energy is NOTthe same as the ground-state energy of the real system. Neither the single-particle mulating the First Principles of Matter: A Probabilistic Roadmap 21 energy levels nor the single-particle wave functions of the ideal Fermions are relevantto any physical quantities of the real electronic system under consideration. In theKS scheme, the reference system is introduced in order to avoid the direct evaluationof the many-body wave functions. Another point one should keep in mind is that,at least in its original form, the KS-DFT is concerned only with the ground-stateproperties of electronic systems at 0 K.For systems at finite temperature, DFT may be considered as implementing ther-modynamics calculations in the Hilbert space: instead of using equation of stateto represent thermodynamic properties as functions of macroscopic variables, DFTcalculations are based on the HK variational principle with the thermodynamic prop-erties formulated as functionals of the one-body density profiles. Given an analyticalexpression for the grand potential functional, one can derive all thermodynamicproperties including multi-body correlation functions [40]. While thermodynamicsoffers no information on the equations of state for any macroscopic systems, theHK theorems provides little insight on how one may formulate the grand-potentialfunctional. Like equations of state for bulk thermodynamic systems, reliable den-sity functionals can only be derived from quantum and statistical mechanics, oftenentailing complicated mathematical procedures.The KS scheme is instrumental not only in practical implementation of DFTcalculations but also for formulation of the functionals. By adopting a non-interactingsystem as the reference, it circumvents direct consideration of the microscope detailsof interacting particles thereby simplifies the physical picture and greatly reduces thecomputational effort. A similar approach has been commonly practiced in appliedthermodynamics. Whereas the functional of real systems under consideration aretypically unknown, a generic strategy may be used to derive ∆ F , the differencebetween the intrinsic Helmholtz energy of the real system and that of the non-interacting reference system. The method is known as “adiabatic connection” inquantum mechanics, or “adiabate principle” and “thermodynamic integration" instatistical mechanics [46].For a system of electrons and nuclei, its connection with the non-interactingreference system can be in general established by scaling the Coulomb potential (seeEq.(3)): ˆ H λ = ˆ H + λ ˆ H c (67)When λ =
0, ˆ H λ corresponds to the Hamiltonian of a non-interacting referencesystem, and λ = dEd λ = (cid:28) d ˆ H λ d λ (cid:29) Ψ λ (68)More explicitly, Eq.(68) can be written as dEd λ = e πε (cid:213) α (cid:213) α (cid:48) ∫ d r ∫ d r (cid:48) ˆ ρ α,α (cid:48) ( r , r (cid:48) | λ ) Z α Z α (cid:48) | r α − r α (cid:48) | (69)where ˆ ρ α,α (cid:48) ( r , r (cid:48) | λ ) = (cid:42)(cid:213) i α (cid:213) i α (cid:48) δ ( r − r i α ) δ ( r − r i α (cid:48) ) (cid:43) Ψ λ (70)According to the thermodynamic integration method, the change in the free energydue to the inter-particle potential is ∆ F [ ρ α ( r )] = ∫ d λ (cid:28) dEd λ (cid:29) λ = e πε ∫ d λ (cid:213) α (cid:213) α (cid:48) ∫ d r ∫ d r (cid:48) ρ α,α (cid:48) ( r , r (cid:48) | λ ) Z α Z α (cid:48) | r α − r α (cid:48) | (71)where subscript λ denotes the ensemble average over the configurations of systemwith rescaled Hamiltonian ˆ H λ , and the two-body density function is defined as ρ α,α (cid:48) ( r , r (cid:48) | λ ) = (cid:10) ˆ ρ α,α (cid:48) ( r , r (cid:48) | λ ) (cid:11) (72)In statistical mechanics, the two-body density is often expressed in terms of theradial distribution function g α,α (cid:48) ( r , r (cid:48) | λ ) ≡ ρ α,α (cid:48) ( r , r (cid:48) | λ ) ρ α ( r ) ρ α (cid:48) ( r (cid:48) ) (73)or the total correlation function h α,α (cid:48) ( r , r (cid:48) | λ ) ≡ ρ α,α (cid:48) ( r , r (cid:48) | λ ) ρ α ( r ) ρ α (cid:48) ( r (cid:48) ) − ∆ F can be written as ∆ F [ ρ α ( r )] = J [ ρ α ( r )] + F xc [ ρ α ( r )] (75)where F xc [ ρ α ( r )] = e πε ∫ d λ (cid:213) α (cid:213) α (cid:48) ∫ d r ∫ d r (cid:48) ρ α ( r ) ρ α (cid:48) ( r (cid:48) ) h α,α (cid:48) ( r , r (cid:48) | λ ) Z α Z α (cid:48) | r α − r α (cid:48) | . (76)Alternatively, the exchange-correlation free energy may be written as F xc [ ρ α ( r )] = e πε (cid:213) α (cid:213) α (cid:48) ∫ d r ∫ d r (cid:48) ρ α ( r ) ρ xc α,α (cid:48) ( r , r (cid:48) ) Z α Z α (cid:48) | r α − r α (cid:48) | (77) mulating the First Principles of Matter: A Probabilistic Roadmap 23 where the exchange-correlation hole is defined as ρ xc α,α (cid:48) ( r , r (cid:48) ) = ∫ d λρ α (cid:48) ( r (cid:48) ) h α,α (cid:48) ( r , r (cid:48) | λ ) ≡ ρ α (cid:48) ( r (cid:48) ) ¯ h α,α (cid:48) ( r , r (cid:48) ; ¯ ρ ) (78)where ¯ h and ¯ ρ represent some averaged quantities. Because the Hartree energyaccounts for direct Coulomb energy for electrostatic interactions, the exchange-correlation hole may be understood as the Coulomb energy of a charged particle α with a cavity of particle α (cid:48) . It satisfies the normalization condition ∫ d r (cid:48) ρ xc α,α (cid:48) ( r , r (cid:48) ) = . (79) Fig. 2 (a) The binding energy curves (atom units) for H2+ calculated from different versions ofKSDFT. B3LYP, LDA and exact results are from reference. The binding length is referred to thecenter-to-center distance between the two H atoms. (b) Binding energy curves for H2 calculatedfrom different methods (atom units). Reproduced from [39].
Although an analytical expression for the exchange-correlation free energy isdifficult to attain, the exact equations from the adiabatic connection are appealingbecause the physical meanings of various correlation function are rather intuitive. Forexample, Figure 2 shows various DFT predictions for the binding energy curves for H + and H [39]. The solid lines are predictions from the adiabatic connection with thetotal correlation function represented by a simple weighted density approximation(WDA) ¯ h ( r , r (cid:48) | λ ) ≈ h UEG (| r − r | , ¯ ρ ) (80)where superscript “UEG” stands for uniform electron gas. Whereas noticeable dis-crepancies are observed in comparison with exact results, WDA is free of delocaliza-tion ( viz. , no self-interaction in the single electron limit)and static correlation errors( viz. , no binding energy between atoms in large separation) that are commonplace inmany popular DFT functionals [16]. Like many differential equations derived from physical models, the Schrödingerequation is lack of an analytical solution with closed-form expressions. Conven-tionally, these equations are solved with the Galerkin methods, i.e., discretizationof the partial differential equations into algebraic equations such that they becomesuitable for efficient implementation on a computer. Both plane-wave formalism andreal space methods are well advanced for solving the Schrödinger equation (and re-lated DFT methods) [38]. In general, the numerical method have high computationalcomplexity, which limits their applications to large systems of practical interests.Complementary to the numerical methods for solving the differential equations di-rectly, the statistical and machine-learning models have long been utilized to emulatethe numerical results and speed up theoretical predictions [7, 58]. In the next section,we outline some recent developments in statistical and machine learning methodsthat offer an alternative way to circumvent solving computationally expensive thepartially different equations directly.
Gaussian process (GP) is a large class of statistical models that offer an alternativeway to emulate a computationally expensive function with drastically less compu-tational cost, and at the same time, has an internal assessment of uncertainty inemulation. Under some regularity conditions, the estimator of the GP regressionguarantees to converges to the true underlying function with respect to certain met-ric (e.g. L ∞ or L distance), with a known convergence rate as a function of thenumber of observations and “smoothness" of truth.GP has been widely used for approximating computationally expensive com-puter models [50, 5, 27, 53]. The statistical framework of a GP emulator is closelyconnected to the reproducing kernel Hilbert space and the kernel ridge regression(KRR), though GP and KRR seem to be independently developed from two streamsof research communities. In this section, we first introduce GP emulation and GPregression for scalar-valued functions from the probabilistic point of view in Section2.1 and Section 2.2, respectively. The mathematical connection to the KRR andreproducing kernel Hilbert space is introduced in Section 2.3, and the convergenceproperties that underpin these methods are introduced in Section 2.4. In the contextof first-principles calculations, GP models with new descriptors and kernels are de-veloped to maintain various physical properties, such as translational, permutationaland rotational invariant properties [3]. The recent advances of GP models for repro-ducing macroscopic quantities such as energy and mechanical properties, as well asatomic forces for MD simulations, will be introduced in Section 2.6. mulating the First Principles of Matter: A Probabilistic Roadmap 25 Suppose we want to emulate a real-valued function with a scalar output f : X → R and p -dimensional input x ∈ X . We model f by a Gaussian process, denoted as f (·) ∼ GP ( m (·) , K (· , ·)) , with mean m : X → R , and a covariance function (ora positive semidefinite kernel) K : X × X → R . Conditional on the mean andcovariance function, any marginal distribution f = ( f ( x ) , ..., f ( x n )) T at n inputs X = { x , ..., x n } follows a multivariate normal distribution: (cid:16) ( f ( x ) , ..., f ( x n )) T | m X , K XX (cid:17) ∼ MN ( m X , K XX ) where m X = ( m ( x ) , ..., m ( x n )) T is a vector of the mean and K XX is an n × n covariance matrix with the ( i , j ) th term being K ( x i , x j ) .The mean is often modeled through a linear model of the basis functions: m ( x ) = h ( x ) θ = q (cid:213) t = h t ( x ) θ t , (81)where h ( x ) = ( h ( x ) , ..., h q ( x )) is a set of basis functions of q dimensions, and θ = ( θ , ..., θ q ) T is a vector of trend parameters, estimated from the data. The meanis often held fixed to be zero in applications for simplicity, whereas a physical modelof the basis functions may improve the predictive accuracy if the trend of underlyingfunction can be captured by the basis functions.The covariance function (or kernel) K (· , ·) is the most critical component in aGP model. The GP is often assumed to be stationary (or shift-invariant ), meaningthat for any two input x = ( x , ..., x p ) and x (cid:48) = ( x (cid:48) , ..., x (cid:48) p ) , K ( x , x (cid:48) ) = σ c ( x − x (cid:48) ) with σ being a variance parameter and c (·) is a correlation function with c ( ) =
1. In modeling spatially correlated data, covariance function is often assumedto be isotropic , where K ( x , x (cid:48) ) = σ c (|| x − x (cid:48) ||) , with || · || being the Euclideandistance. Frequently used correlation function include power exponential correlationand Matérn corrlation [48]. The power exponential correlation function follows c ( x , x (cid:48) ) = exp (cid:26) − (cid:18) || x − x (cid:48) || γ (cid:19) α (cid:27) , (82)with a range parameter γ ∈ ( , + ∞) and roughness parameter α ∈ ( , ] When α =
2, the kernel becomes the Gaussian kernel, where the sample path is infinitelydifferentiable. The roughness parameter of the kernel is often held fixed based onthe smoothness of the process, and the range parameters are estimated from the data.The Matérn kernel follows c ( x , x (cid:48) ) = α − Γ ( α ) (cid:18) || x − x (cid:48) || γ (cid:19) α K α (cid:18) || x − x (cid:48) || γ (cid:19) , (83) where K α is the modified Bessel function of the second kind with roughness pa-rameter α and range parameter γ . The Matérn kernel has a closed-form expressionwhen α = M + M ∈ N , and the sample path of the GP with Matérn kernel is (cid:100) α (cid:101) − α = /
2, for instance, the Matérn kernel follows: c ( x , x (cid:48) ) = (cid:32) + √ || x − x (cid:48) || γ + || x − x (cid:48) || γ (cid:33) exp (cid:32) − √ || x − x (cid:48) || γ (cid:33) . (84)Note that as each coordinate input of the computer model may have differentscales, the stationary kernel is not flexible. A widely used anisotropic kernel is theproduct kernel [50, 6, 27]: K ( x , x (cid:48) ) = σ p (cid:214) l = c l (| x al − x bl |) , where c l (·) is a correlation function of the output induced by the l th coordinateof the input. In the above expression, c l (·) can be chosen as a power exponentialcorrelation, Matérn correlation, or any other suitable correlation function. Note thatthe parameters in c l (·) (such as the range parameter γ l ) can be different for each l ,and these parameters can be estimated by the maximum likelihood type of estimator[5, 25], inducing a more flexible way to parameterize the correlation. Maximum likelihood estimator . The process of computer model emulation oftenbegins by selecting a set of inputs X = { x , ..., x n } from a space-filling design, suchthat the design points can evenly fill the input domain. Widely used random spacefilling designs include the Latin hypercube design and its extensions [51], Then werun simulator at these design inputs and obtain a set of numerical solutions, denoted as f = ( f ( x ) , ..., f ( x n )) T . These data will be used to estimated the model parameters,including the mean, variance and range parameters ( θ , σ , γ ) . Denote the mean basis H X = ( h ( x ) T , ..., h ( x n ) T ) T , a n × q matrix of the basis functions. Differentiatingthe likelihood function with respect to the mean and variance parameters, we havea closed form expression of the maximum likelihood estimator (MLE) of mean andvariance parameters: ˆ θ = (cid:16) H TX C XX H X (cid:17) − H TX C − XX f (85)ˆ σ = S X / n (86)with S X = ( f − H X ˆ θ ) T C − XX ( f − H X ˆ θ ) and C XX = K XX / σ being a correlationmatrix with the diagonal entry being 1. Plugging the MLE of the mean and vari-ance parameters into the likelihood function leads to profile likelihood of the rangeparameters in the kernel: L( γ ) ∝ | C XX | − ( S X ) − n . (87) mulating the First Principles of Matter: A Probabilistic Roadmap 27 The range parameters γ are often estimated by numerically maximizing the naturallogarithm of Equation (87) based on a Newton algorithm [45], since the closed formMLE expression may not exist.The MLE is an efficient estimator of the parameters when the sample is large.When the number of available runs of a computer experiment is small, however, theMLE of the parameters of a GP emulator can be very unstable. Other estimators, suchas the penalized MLE [35] and robust marginal posterior mode estimator [25], werestudied when the sample size is small. Besides, the predictive mean in equation (89)may be used to estimate the range parameters through cross-validation. However,more runs may be needed than the MLE, as one needs to split the observations toestimate the parameters in a cross-validation approach. Predictive distribution . Suppose we are interested in predicting the model valueat input x not run before. The joint distribution follows a multivariate normal distri-bution (cid:18) f ( x ) f (cid:19) | ˆ θ , ˆ σ , ˆ γ ∼ MN (cid:18)(cid:18) h ( x ) ˆ θ H X ˆ θ (cid:19) , (cid:18) K ( x , x ) K xX K Xx K XX (cid:19)(cid:19) , where K xX = K TXx = ( K ( x , x ) , ..., K ( x n , x n )) , with the variance and range pa-rameters plugged into the kernel function K (· , ·) . After obtaining the observations f = f = ( f ( x ) , ..., f ( x n )) T , by the conditional distribution of the multivariatenormal, the predictive distribution of the Gaussian process at any input x follows anormal distribution: (cid:16) f ( x ) | f , ˆ θ , ˆ σ , ˆ γ (cid:17) ∼ N ( m ∗ ( x ) , K ∗ ( x , x )) , (88)where predictive mean and covariance follows m ∗ ( x ) = h ( x ) ˆ θ + K xX K − XX ( f − H X ˆ θ ) (89) K ∗ ( x , x ) = K ( x , x ) − K xX K − XX K Xx . (90)The predictive mean in (88) is often used as a point estimator for predicting thevalue of the function at any x ∈ X . The GP emulator has an internal assessment ofthe uncertainty, as the predictive variance and any quantile of the prediction can becomputed by (88). Interpolator . Note that if x = x i , for any i = , , ..., n , we have K xX K − XX = e Ti ,where e i is a vector with 1 at the i th entry and 0 at other entry. The predictive mean m ∗ (·) in (88) is an interpolator , as if x = x i , for any i = , , ..., n , the predictivemean is exactly the same as f ( x i ) : m ∗ f ( x ) = h ( x i ) ˆ θ + e Ti ( f − H X ˆ θ ) = f ( x i ) . An interpolator is typically suitable when the computer model is deterministic andthe numerical error from the computer model is very small.In Figure 3, we graph the predictive mean and 95% predictive interval of aGP emulator implemented in
RobustGaSP R package [26] for function f ( x ) = sin ( π x / ) + sin ( π x / . )/ x ∈ [ , ] . We use the − . − . . . n=10 x f ( x ) l l l l l l l l l l truthpredictive mean 0 2 4 6 8 10 − . − . . . n=15 x f ( x ) l l l l l l l l l l l l l l l truthpredictive mean Fig. 3
Emulation of function f ( x ) = sin ( π x / ) + sin ( π x / . )/ n =
10 and n =
15, graphed in the left panel and the right panel, respectively. The blackcurves are the truth and the blue curves are the predictive mean of the GP emulator in equation 89based on the observations graphed as the black dots. The grey area is the 95% predictive intervalby the GP emulator.
Matérn kernel in (84) to parameterize the covariance and the MLE for estimatingparameters. When the sample size increases, the estimation becomes more accurate,and the uncertainty (shown as the shaded area) is smaller. We only show an examplewith only input being 1 dimensional here, whereas the GP model implemented
RobustGaSP package is applicable for multi-dimensional input and output withboth noise-free or noisy observations.
When the observations of the computer model contain noise (e.g. by non-negligiblenumerical error from the computer model), one can model the observations by y ( x ) = f ( x ) + (cid:15), (91)where f (·) ∼ GP ( m (·) , K (· , ·)) , and (cid:15) is an independent Gaussian noise with variance σ . The covariance function for the new process y (·) can be expressed as ˜ K ( x , x (cid:48) ) = K ( x , x (cid:48) ) + σ x = x (cid:48) , where x = x (cid:48) = x = x (cid:48) otherwise 0. The MLE of θ and σ follow similarly in (85) and (86) by replacing f and C XX by y and ˜ C XX : = C XX + η I n ,respectively, where η = σ / σ is referred as the nugget parameter.Denote the observations y = ( y ( x ) , ..., y ( x n )) T . The predictive distribution atany input x with noisy observations also follows a normal distribution, ( f ( x ) | ˆ θ , ˆ σ , ˆ γ , ˆ η ) ∼ N ( ˜ m ∗ ( x ) , ˜ K ∗ y ( x , x )) with the predictive mean and variance below˜ m ∗ ( x ) = h ( x ) ˜ θ + C xX ( C XX + η I n ) − ( y − H X ˜ θ ) (92)˜ K ∗ y ( x , x ) = ˆ σ (cid:16) C ( x , x ) − C xX ( C XX + η I n ) − C Xx (cid:17) . (93) mulating the First Principles of Matter: A Probabilistic Roadmap 29 where ˜ θ = ( H TX ˜C XX H X ) − H TX ˜C − XX y , C xX = K xX / ˆ σ with the variance and rangeparameters plugged into the kernel function K (· , ·) . Reproducing kernel Hilbert space . We call H the reproducing kernel Hilbert space(RKHS) with the native norm (or RKHS norm) || f || H = (cid:112) (cid:104) f , f (cid:105) H , if there exists akernel function K : X ×X → R , such that, 1) for any x ∈ X , the function K ( x , x (cid:48) ) as afunction belongs to H , and 2) K has the reproducing property: (cid:104) f (·) , K (· , x )(cid:105) H = f ( x ) for any f belongs to H [48].For simplicity, let us consider a GP with zero mean (i.e. m ( x ) = x ). TheRKHS H attached to the GP with kernel K (· , ·) is the completion of the space of allfunctions: H = (cid:40) f = n (cid:213) i = w i K ( x i , x ) , w , ..., w n ∈ R , x , ..., x n , x ∈ X , n ∈ N (cid:41) with the inner product (cid:42) n (cid:213) i = w i K ( x i , ·) , n (cid:213) j = w j K ( x j , ·) (cid:43) H = n (cid:213) i = n (cid:213) j = w i w j K ( x i , x j ) . with n , n ∈ N .We denote (cid:104) f , g (cid:105) L (X) = ∫ x ∈X f ( x ) g ( x ) d x the inner product in L (X) . The RKHS H contains all functions f (·) = (cid:205) ∞ k = f k φ k (·) ∈ L (X) with f k = (cid:104) f , φ k (cid:105) L (X) and (cid:205) ∞ k = f k / λ k < ∞ . For any g (·) = (cid:205) ∞ k = g k φ k (·) ∈ H and f (·) = (cid:205) ∞ k = f k φ k (·) , theinner product can be represented as (cid:104) f , g (cid:105) H = (cid:205) ∞ k = f k g k / λ k . For more discussionon the RKHS, see Chapter 6 in [48] and Chapter 1 of [55]. Kernel ridge regression . Consider n noisy observations y = ( y , y , ..., y n ) with y i = y ( x i ) for i = , , ..., n . We are interested to estimate the mean f ( x ) = E [ y ( x )] of the observations for any x ∈ X . Denote H the RKHS attached to kernel K (· , ·) .The kernel ridge regression (KRR) solves the following optimization problem:ˆ f n = ar g min g ∈H (cid:26) n ( y i − g ( x i )) + λ || g || H (cid:27) (94)where λ is a regularization parameter typically estimated from data. Theorem 3 (Solution of KRR). The solution of Equation (94) is unique and has thefollowing expression: ˆ f n ( x ) = n (cid:213) i = ˆ w i C ( x , x i ) = C xX ( C XX + n λ I n ) − y (95) for any x ∈ X with ˆw = ( ˆ w , ..., ˆ w n ) T = ( C XX + n λ I n ) − y . Proof
By the representer lemma [48, 55], for any θ ∈ Θ and x ∈ X , one hasˆ f n ( x ) = n (cid:213) i = w i K ( x i , x ) , and denote w = ( w , ..., w n ) T ∈ R n the weights in the solution. Since (cid:104) K ( x i , ·) , K ( x j , ·)(cid:105) H = K ( x i , x j ) , equation (94) becomes to find w such that ˆw = ar g min w ∈ R n (cid:26) n ( y − Rw ) T ( y − Rw ) + λ w T Rw (cid:27) . (96)Differentiating (96) with regard to w θ , we have ˆw = ( R + n λ I n ) − y . (97) Remark 1
The solution of KRR in (95) is exactly the same as the predictive meanestimator in equation (92) when the mean function is zero (i.e. m ( x ) = x ∈ X ) and λ = η / n .The KRR solves the optimization problem in (94) and gives an estimator of thefunction. As the noise is not modeled, the uncertainty of the KRR estimator isnot specified. As stated in Remark 1, the solution of the KRR is equivalent to thepredictive mean of GP regression in (92). One main advantage of the GP model is theuncertainty of the estimator can be computed based on the predictive distribution.Note that many simulators may be deterministic or may contain very small nu-merical error. In this scenario, the observations become y i = f ( x i ) for i = , , ..., n .The solution of KRR, however, may not be suitable for these scenarios, as it is notan interpolator. Consider the following kernel “ridgeless" problem [36]:ˆ f n = ar g min g ∈H || g || H , subject to g ( x i ) = f ( x i ) , for i = , , ..., n (98)The solution of equation (98) follows [29]:ˆ f n = K xX K − XX f (99)Note that the solution in equation (99) is exactly the same as the predictive meanexpression in (89) with mean zero m ( x ) = GP regression is a flexible approach to approximate nonlinear continuous func-tions. We briefly introduce the convergence properties of GP regression to the trueunderlying function. Suppose the observations are from mulating the First Principles of Matter: A Probabilistic Roadmap 31 y ( x ) = f ( x ) + (cid:15), (100)where f ( x ) is the true deterministic function with x ∈ X and (cid:15) is an independentnoise. Let us assume we evaluate the goodness of estimation by the L norm: || ˆ f n − f || L = ( ∫ x ∈X ( f ( x ) − f ( x )) d x ) / , where ˆ f n is the KRR estimator in (95) (orequivalently the predictive mean estimator of GP regression in (92)).Loosely speaking, the convergence of KRR estimator depends on three regularityconditions. First the noise (cid:15) should have a tail decreasing rate not slower than theGaussian distribution (i.e. the sub-Gaussian distribution). Second the sequences ofinputs { x i } ∞ i = should fill the space X . Third the number of small balls needed tocover the functional space should not be too large. Denote the covering number N ( r , F , || · || L ) the smallest value of N for the functional space F over X , suchthat there exists a series of L integrable functions { f L , ..., f LN , f U , ..., f UN } with || f Li (·) − f Ui (·)|| ≤ r and r > i = , ..., N , and for each f ∈ F , one has f Li ≤ f ≤ f Ui for certain 1 ≤ i ≤ N . We refer to the book of empirical process forfurther discussion of the covering number [22, 33].For simplicity, we assume the design follows U ([ , ] p ) , a uniform distributionat [ , ] p . Further denote F ( ρ ) = { f ∈ F , | f || H ≤ ρ } . We are ready to state theconvergence theorem, which can be inferred by Theorem 10.2 from [22]. Theorem 4
Suppose the data are generated from equation (100) with f ∈ F .Suppose x i ∼ U ([ , ] p ) the uniform distribution with domain [ , ] p , and thereexists a constant K such that E (cid:15) [ exp ( K (cid:15) )] < ∞ . Furthermore, there exists τ , suchthat lo g ( N ( r , F ( ρ ) , || · || L )) (cid:46) ρ τ r − τ , for all r , ρ > . When λ − = O ( n /( + τ ) ) , the L norm of the difference between the estimator ˆ f n and truth underlying function f is stochastically bounded by λ / : || ˆ f n − f || L = O p ( λ / ) . Various conditions in Theorem 4 can be relaxed. For example, the design space canbe trivially extended to any bounded rectangle in R p and the distribution of thedesign can also be modified to have the same convergence properties. Remark 2
Various kernels and functional space satisfy the conditions. E.g. for theMatérn kernel, the RKHS is equivalent to the Sobolev space. Assuming X = [ , ] p ,the natural logarithm of the covering number follows [20, 54]: lo g ( N ( r , F ( ρ ) , || · || L )) (cid:46) ρ p / α r − p / α . where α is the roughness parameter and the Matérn kernel and with a constant K (cid:48) .Then if f ∈ F , where F is the Sobolev space and λ − (cid:16) n α /( α + p ) with (cid:16) denotingthe same change of magnitude in both sides with respect to the change of n , we havethe optimal convergence rate || ˆ f n − f || L = O p ( n − α α + p ) . In [14], the authors introduce a gradient domain learning (GDML) model , based onthe GP emulator of the force with a constrained kernel constructed by the relationshipbetween energy and force. Denote x an descriptor of a molecule of N atoms withposition ( r , r , ..., r N ) , where r i ∈ R for i = , ..., N . In [14], the descriptor is a N -dimensional real-valued vector x = V ec ( D ) , where V ec ( . ) is a vectorization operatorand D is a N × N matrix with the ( i , i ) th entry of D being D i , i = || r i − r i || − if i > i and 0 if i ≤ i . Denote f E ( x ) the total energy as a function of descriptor x .The energy can be modeled as a GP emulator, meaning that for any set of descriptors { x , x , ..., x n } , we have (cid:16) ( f E ( x ) , ..., f E ( x n )) T | m X , K XX (cid:17) ∼ MN ( m X , K XX ) , (101)where m X = ( m ( x ) , ..., ( m ( x n )) T is a vector of the mean and K XX is an n × n covariance matrix of energies with the ( l , l ) th term being K ( x l , x l ) for l , l = , .., n . The isotropic Matérn kernel with roughness parameter being 2.5 in (84) isused in [14].Denote f F ( r , r , ..., r N ) the molecular force on atoms with positions { r , r , ..., r N } .As the force must follow the conservation of energy, we have the following expres-sion: f F ( r , r , ..., r N ) = −∇ f E ( r , r , ..., r N ) , (102)where f F ( r , r , ..., r N ) = ( f F , , ..., f F , N ) T is a vector of 3 N dimensions with f F , ( i − ) + j = ∂ f E ( r , r , ..., r N )/ ∂ r i , j for i = , , ..., N and j = , , n sets of molecules with descriptor { x , x , ..., x n } follows (cid:16) ( f F ( x ) , ..., f F ( x n )) T | m X , K XX (cid:17) ∼ M N (−∇ m X , ∇ K XX ∇ T ) , (103)where ∇ m X is a mean vector of 3 Nn dimension with the 3 ( l − ) N + ( i − ) + j thterm being ∂ m X / ∂ r l , i , j , and r l , i , j being the j th coordinate of the i th atom at the l thmolecule, for i = , , ..., N , j = , , l = , , ..., n ; ∇ K XX ∇ T is 3 Nn × Nn covariance matrix with the ( ( l − ) N + ( i − ) + j , ( l (cid:48) − ) N + ( i (cid:48) − ) + j (cid:48) ) thterm of ∇ K XX ∇ T being ∂ K ( x l , x l (cid:48) )/ ∂ r l , i , j ∂ r l (cid:48) , i (cid:48) , j (cid:48) for i , i (cid:48) = , , ..., N , j , j (cid:48) = , , l , l (cid:48) = , , ..., n . The matrix ∇ K XX ∇ T can be calculated using matrix derivativechain rule [14, 47].Without the loss of generality, assume the mean function is zero, i.e. m ( x ) = x . Denote f F = ( f F ( x ) T , ..., f F ( x n ) T ) T a 3 Nn vector of trainingforces at n sets of molecules. For a new molecule with any descriptor x , the predictivemean of the force on the atoms of this new molecule follows mulating the First Principles of Matter: A Probabilistic Roadmap 33 l lllll l lllll −2 −1 0 1 2 − . − . . . n=100 r r l lllll l lllll −2 −1 0 1 2 − . − . . . n=200 r r Fig. 4
Emulation of atomic force on Benzene projected on the first two dimensions, when thesample size is n =
100 and n = E [ f F ( x ) | f F ] = (∇ K xX ∇ T ) T (∇ K XX ∇ T ) − f F = n (cid:213) l = N (cid:213) i = (cid:213) j = ω l , i , j ∇ x ∂ x l K ( x , x l ) ∂ r l , i , j , where ∇ K xX ∇ T is a 3 Nn × ( l − ) N + ( i − ) + j th term being ∂ x l K ( x , x l )/ ∂ r l , i , j , and ω l , i , j is the 3 ( l − ) N + ( i − ) + j th term of the vector (∇ K XX ∇ T ) − f F . And the N × N predictive covariance follows: COV [ f F ( x ) | f F ] = ∇ K xx ∇ T − (∇ K xX ∇ T ) T (∇ K XX ∇ T ) − ∇ K xX ∇ T . Figure 4 shows the estimated force at the first two dimensions using n =
100 and n =
200 training data from [14]. When the number of observations increases, thepredictions become more accurate.The GDML approach satisfies translation and rotation symmetry (or invariance).The translational symmetry means the prediction of a physical quantity (such as forceor energy) of any two molecules with positions { r , ..., r N } and { r + h , ..., r N + h } arethe same for any real-valued vector h . The rotational symmetry means the predictionof a physical quantity remains the same when all atoms rotate at the same angle withrespect to an axis. Note that under these two operations, the prediction of the forcewill not change as the descriptor of the molecule in the GDML approach remainsthe same.The GDML approach does not comply with the permutation symmetry , meaningthat the physical quantity of interest is invariant if we relabel the same atoms species.The descriptor of a molecule in the GDML approach changes after relabeling theatoms. An improved approach, called symmetrized gradient-domain machine learn-ing (sGDML) [15], seeks a permutation that minimizes the L norm of two moleculargraphs.For any configuration with atomic positions specified by ( r , ..., r N ) , sGDML aimsto find a permutation τ to minimize the L norm of the N × N distance matrix A with the ( i , i ) th term being || r i − r i || for i i , i = , , ..., N . In other words, for any for twoisomorphic molecular graphs with distance matrix A G and A H , the permutation toalign the matrix is estimated by ˆ τ = argmin τ || P ( τ ) A G P ( τ ) T − A H || . The predictiveperformance of sGDML improves for most of the molecules considered in [15]. Predicting the total energy of a system is one of the most important tasks in firstprinciples modeling. In the previous GDML approach, the energy of a new molecularconfiguration with descriptor x can be predicted using the predictive mean E [ f E ( x ) | f ( x ) , ..., f ( x n )] with n training model runs. Various other approaches based onthe KRR estimator (or the predictive mean of Gaussian process regression) aredeveloped in recent studies to predict the total energy of an atomic system fromKS-DFT calculation.Recent advances focus on developing new descriptors for emulating the energy.In [49], for instance, the descriptor of the energy of a molecule with atomic positions ( r , ..., r N ) is specified as a pseudo Coulomb matrix D , where D i , j = Z i Z j /|| r i − r j || for i (cid:44) j and D i , j = . Z . i for i , j = , , ..., N , where Z i is the nuclear valenceof the i th atom. For two molecules with descriptors D (cid:48) and D (cid:48) , the Gaussian kernelwas then used to parameterize the correlation with input V ec ( D ) and V ec ( D (cid:48) ) . Thepredictive mean of the GP can be used to estimate the energy of a molecule with anew set of atomic positions.Another recent development is on the interatomic potential. For an atomic system,the total energy can be decomposed as [1]: f E = (cid:213) α (cid:213) i ∈ α f α E i + long-range contributions (104)where f α E i is the local energy functionals of atom i of the α type with compact supportwithin a radius r cut ∈ R + . The long-range contributions are referred to electrostatic,polarizability and van der Waals interactions.We model f α E i as a Gaussian process based on inputs representing the neighboringatomic structure. A good representation of local atomic structure should be invariantto the permutational, rotational and translational symmetries, as discussed in Section2.5. Various descriptors as a function of the geometric and radial information of theneighboring atoms are developed in [3, 2], including power spectrum, bispectrum,radial basis and angular Fourier series. After identifying the descriptor of atoms’neighboring features, the similarity between neighboring features may be measuredby a kernel function. In [4], for example, the neighbor density of atom i is representedas a summation of the Gaussian function ρ i ( r ) = (cid:213) i (cid:48) f cut ( r ii (cid:48) ) exp (cid:18) − || r − r ii (cid:48) || γ atom (cid:19) , (105) mulating the First Principles of Matter: A Probabilistic Roadmap 35 where the summation is on the neighbor i (cid:48) atoms including the atom i itself, γ atom is a fixed range parameter and f cut (·) is a cut-off function continuously decreasesto zero beyond a cutoff radius. The Smooth Overlap of Atomic Positions (SOAP)kernel developed in [3] was used in [4] to parameterize the covariance between theneighbor features of atom i and j , denoted as x i and x j : K ( x i , x j ) = σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ K ( x i , x j ) (cid:113) ˜ K ( x i , x i ) ˜ K ( x j , x j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ξ where ˜ K is defined by first integrating the square of the neighbor densities productand then integrating over all possible 3D rotations:˜ K ( x i , x j ) = ∫ R ∈ SO ( ) d R (cid:12)(cid:12)(cid:12)(cid:12)∫ r ∈ R d r ρ i ( r ) ρ j ( Rr ) (cid:12)(cid:12)(cid:12)(cid:12) (106)with R being a three dimensional rotation matrix in the 3D rotation group (oftendenoted as SO(3)), σ being the variance parameter, ξ = f E a and f E b of two sets of atoms, denoted as a and b , can be computed by E [ f E a f E b ] = σ (cid:213) i ∈ set a (cid:213) j ∈ set b K ( x i , x j ) , as the mean is assumed to be zero. The covariance of other quantities such as forcecan be computed by the derivatives of the kernel functions.The GP model with SOAP kernel achieved accurate predictive performance forsilicon clusters and the bulk crystal [3, 4]. The development of interatomic potential isambitious, as it allows one to use GP regression to compute the predictive distributionof the energies of possibly any set of the atoms based on the proximity of this atomset and the training atom sets. One important quantity in the KS-DFT calculation is the electron density, based onwhich one can compute other quantities. Various approaches are developed to emu-late the electron density. For example, the electron density is emulated based on theGaussian potential functions [8]. Unlike the energies and forces, the electron densityis typically represented as a vector output in the Cartesian coordinate or coefficientsin Fourier basis. Emulating a vector-valued function by Gaussian processes has been studies in recent year. We briefly review these approaches in this section. We denotethe input (i.e. a descriptor function of a set of atomic positions) as x ∈ R p and theelectron density ρ ( x ) = [ ρ ( x ) , ..., ρ k ( x )] T at k spatial grids. Many single GP emulators . The simplest approach is to model electron den-sity at each grid independently by a GP emulator, ρ j (·) ∼ GP ( m j (·) , K j (· , ·)) bya mean function m j (·) and covariance function K j (· , ·) . The parameters of eachGP emulator can be estimated based on maximum likelihood estimator sepa-rately for each grid j . The predictive distribution of any new atom set hav-ing descriptor x of any grid j can be computed by the predictive distribution ( ρ j ( x ) | ρ j ( x ) , ..., ρ j ( x n ) , ˆ β j , ˆ σ j , ˆ γ j ) ∼ N ( m ∗ j ( x ) , K ∗ j ( x , x )) , where ( ˆ β j , ˆ σ j , ˆ γ j ) arethe estimated mean, variance and kernel parameters, respectively; m ∗ j (·) and K ∗ j (· , ·) follow equation (89) and (90), respectively, by plugging the estimated parameters ( ˆ β j , ˆ σ j , ˆ γ j ) for grid j . We call this approach many single (MS) GP emulators.The computational cost of MS GP emulators is at the order of O ( Kn ) , whichcould be when the number of grids or the number of training atom sets are large.Besides, the parameters of each local GP emulator are estimated based on the dataat each grid, which could be unstable. Separable GP emulator . Noting that the output at two neighboring spatial grids ispositively correlated, whereas the correlation is not exploited in the MS GP emulator.Another approach is to assume a separable GP emulator, such that ( ρ ( x ) , ..., ρ ( x n )) ∼MN ( M , K SS ⊗ K XX ) , where M is a k × n mean matrix, K SS is the covariance ofspatial inputs, K XX is the covariance matrix of descriptor, and “ ⊗ " denotes theKronecker product. Here the covariance of data is separately modeled by a spatialcovariance matrix K SS and a covariance matrix of the descriptor K XX . Conditionalon the parameters, the predicted distribution also follows a normal distribution withmean and variance in closed-form expression [56].When the number of spatial grids is smaller than the number of training modelruns (i.e. k < n ), a conjugate prior distribution of K SS can be specified as aninverse-Wishart distribution [17], and K SS can be integrated out when computingthe predictive distribution. However, for a 3D electron density, the number of grids istypically larger than the number of model runs in the training data. In this scenario, K SS may be parameterized by a kernel function, where the spatial coordinate is usedas the input. The computational operations of the Separable GP is O ( k ) + O ( n ) ingeneral, which is daunting for even a moderate number of grid size (e.g. k = )for emulating the 3D electron density. Parallel partial GP emulator . One computationally feasible approach is theparallel partial Gaussian process (PP GP) emulator [23]. In this model, we assumethe output density at grid j follows ρ j (·) ∼ GP ( m j (·) , σ j C (· , ·)) , which has differentmean functions, different variance parameters and a shared kernel function for thedensity at each spatial grid. Noting that the maximum likelihood estimator of themean parameters and variance parameters has a closed-form expression, whereasthe parameters in kernel function do not. Since the kernel function is shared acrossspatial grids, we only need to numerically estimate a few kernel parameters, morestable than the MS GP emulator. mulating the First Principles of Matter: A Probabilistic Roadmap 37 The computation complexity of PP GP emulator is O ( n ) + O ( kn ) for n trainingelectron densities on k spatial grids, which is more efficient than the MS GP emulatorand the separable GP emulator. The linear computational complexity with respectto k allows PP GP emulator to emulate densities on a large number of grids. Eventhough the computational complexity of the PP GP is much smaller than the separableGP emulator, as shown in [23], the predictive mean of the PP GP emulator is exactlythe same as the separable GP emulator, and the predictive variance between the PPGP emulator and separable GP emulator is similar. Semiparametric latent factor model . We introduce a useful class of the lin-ear model of coregionalization, called semiparametric latent factor model [52] formodeling the k -dimensional electron density at k grids: ρ ( x ) = Az ( x ) + (cid:15) , (107)where z ( x ) = [ z ( x ) , ..., z d ( x )] T with z l (·) ∼ GP ( m l (·) , K l (· , ·)) follows a GP indepen-dently for l = , ..., d ; A is a k × d latent factor loading matrix that relates the factorto the observations and (cid:15) is vector of independent Gaussian noises.The latent factor loading matrix may be estimated by the principal componentanalysis, where the linear subspace is shown to be equivalent to maximum marginallikelihood estimator (MMLE) of factor loadings when the each factor is independent,where each factor in model (107) follows a GP. Denote the k × n observation matrix P = [ ρ ( x ) , ..., ρ ( x n )] for n training electron densities at k spatial grids, and let Z = [ z ( x ) , ..., z ( x n )] be the d × n factor loading matrix. The MMLE for latent factorloading matrix is stated in the following theorem: Theorem 5
For model (107), assume A T A = I d , after marginalizing out Z ,• if Σ = ... = Σ d = Σ . the marginal likelihood is maximized when ˆ A = UR , (108) where U is a k × d matrix of the first d principal eigenvectors of G = ( σ Σ − + I n ) − P T , (109) and R is an arbitrary d × d orthogonal rotation matrix;• If Σ i (cid:44) Σ j for any i (cid:44) j , denoting G l = P ( σ Σ − l + I n ) − P T , the maximummarginal likelihood estimator is ˆA = argmax A d (cid:213) l = a Tl G l a l , s.t. A T A = I d , (110)The proof of Theorem 5 along with the parameter estimation can be found in [24].Denote ( ˆ γ , ˆ σ , ˆ σ ) the estimated kernel parameters, signal variance and noisevariance parameters. Assume A T A = I d , after marginalizing out Z , for any x , onehas the predictive distribution ρ ( x ) | P , ˆA , ˆ γ , ˆ σ , ˆ σ ∼ MN (cid:16) ˆ µ ∗ ( x ) , ˆ Σ ∗ ( x ) (cid:17) , (111) where the predictive mean follows ˆ µ ∗ ( x ) = ˆAˆz ( x ) , (112)with ˆz ( x ) = ( ˆ z ( x ) , ..., ˆ z d ( x )) T , with ˆ z l ( x ) = ˆ Σ Tl ( x )( ˆ Σ l + ˆ σ I n ) − P T ˆa l , ˆ Σ l ( x ) = ˆ σ l ( ˆ C l ( x , x ) , ..., ˆ C l ( x n , x )) T for l = , ..., d ; the predictive variance follows ˆ Σ ∗ ( x ) = ˆA ˆD ( x ) ˆA T + ˆ σ ( I k − ˆA ˆA T ) , (113)with ˆD ( x ) being a diagonal matrix, and its l th diagonal term, being ˆ D l ( x ) = ˆ σ l ˆ C l ( x , x ) + ˆ σ − ˆ Σ Tl ( x ) (cid:16) ˆ Σ l + ˆ σ I n (cid:17) − ˆ Σ l ( x ) , for l = , ..., d .The covariance of the observations of model (107) follows (cid:205) dl = Σ l ⊗ a l a Tl wherethe ( i , j ) th term of Σ l is K l ( x i , x j ) for 1 ≤ i , j ≤ n and l = , ..., d . Compared withseparable GP emulator, this covariance matrix in LMC is not separable, representinga more flexible class of models. As we will see in Section 3.1, the estimator ofelectron density in [8] can be written as the predictive mean in (112). In [8], the KRR and Fourier basis are used for emulating electron densities, basedon the locations of atoms. Suppose the molecule has N atoms. The descriptor of thisapproach is chosen to be a function of Gaussian potential at a spatial coordinate r : v ( r ) = N (cid:213) j = Z j exp (cid:32) − || r − r atomj || γ (cid:33) , (114)where Z j and r atomj are the nuclear charge and spatial location of the j th atom,respectively; γ is a fixed parameter.Consider n observed electron densities denoted as P = [ ρ ( x ) , ..., ρ ( x n )] , a k × n matrix at atomic configuration [ x , ..., x n ] , with x i = [ r atomi , ..., r atomiN ] , where r atomij is the j th atomic position at the i th simulated run for i = , ..., n and j = , ..., N , respectively. Denote the external potential for the electron densities at the i th simulated run, i = , ..., n , by v i = [ v i ( r ) , ..., v i ( r k )] T at locations { r , ..., r k } .Further denote the electron density of interest ρ ( v ) = ( ρ ( v ( r )) , ..., ρ ( v ( r k ))) T of anypotential energy v = ( v ( r ) , ..., v ( r k )) T with v (·) following equation (114). In [8], theestimator of the electron density at any external potential v can be written as ˆ ρ ( v ) = A r ˆ z ( v ) , (115)where A r = [ a , ..., a d ] is a k × d basis functions over spatial coordinates with a l = ( a l ( v ) , ..., a l ( v k )) T for l = , ..., d , and the l th term of ˆ z ( v ) = ( ˆ z ( v ) , ..., ˆ z d ( v )) T follows mulating the First Principles of Matter: A Probabilistic Roadmap 39 Fig. 5
Emulation of density at r =
10 for one set of held-out density of Benzene. The held-outdensity, predictive density and residuals are given in the left, middle and right panel, respectively.The color scale in the right panel is smaller than the previous two panels. ˆ z l ( v ) = C l ( v )( C l + λ l I n ) − z ( l ) , with C l ( v ) = ( C l ( v , v ) , ..., C l ( v , v n )) T , the ( i , j ) term of C l being C l ( v i , v j ) for1 ≤ i , j ≤ n , λ l being a tuning parameter and the i th entry of z ( l ) = ( z ( l ) , ..., z ( l ) n ) T being z ( l ) i = a Tl ρ ( x i ) for l = , ..., d and i = , ..., n . In [8], the orthogonal Fourierbasis (i.e. A Tr A r = I d ) is used to parameterize the factor loading matrix and theisotropic Gaussian kernel is used to parameterize the covariance between any twoelectron densities with input being the external potential function. Remark 3
Suppose for any external potential v , we model the electron density by ρ ( v ) = A r z ( v ) , where ρ ( v ) is the k × z (·) = [ z (·) , ..., z d (·)] T are modeled as z l (·) ∼ GP ( , ˜ K l (· , ·)) with ˜ K l ( v i , v j )) = σ l C l ( v i , v j )) + σ l v i = v j for l = , , ..., d . The factor loading matrix A r is a k × d matrix with A Tr A r = I d . Conditional on the kernel and variance parameters,the estimator of the electron density in (115) is equivalent to the predictive meanestimator in Equation (112) with input being the external potential. The uncertainty(e.g. 95% predictive interval) can be also be obtained by Equation (111).We show the prediction of the electron density for one set of held-output test dataset in [8] based on the PP GP emulator implemented in RobustGaSP R [26]. Only150 electron densities are used to train our PP GP emulator, and it achieves relativelyhigh accuracy. The method based on the Fourier basis has similar predictive accuracyas the PP GP emulator and it is thus not shown here. Physics-based modeling and data science are complementary but progressed almostin parallel until very recently. On the one hand, quantum and statistical mechanicscalculations are able to predict the properties of virtually any matter in the universefrom first principles. Whereas the theoretical foundation has been in place for nearly acentury, one of the greatest dilemmas in modern science and engineering is that, ow-ing to the seemingly insurmountable computational cost, exact equations are hardlyapplicable to complex systems of practical interest. On the other hand, statisticaland machine learning techniques have gained a lot of attentions in recents rendingnew prospects for solving a wide variety of physics-based models with a tradeoffof numerical accuracy to computational cost. In conjunction with recent progress incomputer technology, in particular with novel architectures such as graphical pro-cessing units (GPUs), the statistical and machine-learning algorithms will potentiallyovercome the major hurdles of first principles calculations.A number of theoretical and simulation methods can be used for physics-basedmodeling. Among them, quantum Monte Carlo simulation (QMC) and the densityfunctional theory (DFT) represent two generic theoretical frameworks that one maytake to achieve accuracy and computational efficiency. In combination with machine-learning methods, these methods will potentially have transformative impacts ontechnological advances including the computational design of innovative devicesand materials.
Acknowledgements
J.W. acknowledges financial support by the U.S. National Science Foun-dationâĂŹs Harnessing the Data Revolution (HDR) Big Ideas Program under Grant No. NSF1940118.
References [1] Bartók AP, Csányi G (2015) Gaussian approximation potentials: A brief tutorialintroduction. International Journal of Quantum Chemistry 115(16):1051–1057[2] Bartók AP, Payne MC, Kondor R, Csányi G (2010) Gaussian approximationpotentials: The accuracy of quantum mechanics, without the electrons. Physicalreview letters 104(13):136403[3] Bartók AP, Kondor R, Csányi G (2013) On representing chemical environ-ments. Physical Review B 87(18):184115[4] Bartók AP, Kermode J, Bernstein N, Csányi G (2018) Machine learn-ing a general-purpose interatomic potential for silicon. Physical Review X8(4):041048[5] Bayarri MJ, Berger JO, Paulo R, Sacks J, Cafeo JA, Cavendish J, Lin CH,Tu J (2007) A framework for validation of computer models. Technometrics49(2):138–154 mulating the First Principles of Matter: A Probabilistic Roadmap 41 [6] Bayarri MJ, Berger JO, Calder ES, Dalbey K, Lunagomez S, Patra AK, PitmanEB, Spiller ET, Wolpert RL (2009) Using statistical and computer models toquantify volcanic hazards. Technometrics 51:402–413[7] Behler J (2017) First principles neural network potentials for reac-tive simulations of large molecular and condensed systems. AngewandteChemie-International Edition 56(42):12828–12840, URL