[PDF] Global optimization of atomic structures with gradient-enhanced Gaussian process regression

Abstract

Determination of atomic structures is a key challenge in the fields of computational physics and materials science, as a large variety of mechanical, chemical, electronic, and optical properties depend sensitively on structure. Here, we present a global optimization scheme where energy and force information from density functional theory (DFT) calculations is transferred to a probabilistic surrogate model to estimate both the potential energy surface (PES) and the associated uncertainties. The local minima in the surrogate PES are then used to guide the search for the global minimum in the DFT potential. We find that adding the gradients in most cases improves the efficiency of the search significantly. The method is applied to global optimization of [Ta_2O_5]_x clusters with x=1,2,3, and the surface structure of oxidized ZrN.

Full PDF

GGlobal optimization of atomic structures with gradient-enhanced Gaussian processregression

Sami Kaappa, Estefan´ıa Garijo del R´ıo, and Karsten Wedel Jacobsen

Department of Physics, Technical University of Denmark, Kongens Lyngby, Denmark (Dated: February 23, 2021)Determination of atomic structures is a key challenge in the ﬁelds of computational physics andmaterials science, as a large variety of mechanical, chemical, electronic, and optical properties de-pend sensitively on structure. Here, we present a global optimization scheme where energy and forceinformation from density functional theory (DFT) calculations is transferred to a probabilistic sur-rogate model to estimate both the potential energy surface (PES) and the associated uncertainties.The local minima in the surrogate PES are then used to guide the search for the global minimumin the DFT potential. We ﬁnd that adding the gradients in most cases improves the eﬃciency ofthe search signiﬁcantly. The method is applied to global optimization of [Ta O ] x clusters with x = 1 , ,

3, and the surface structure of oxidized ZrN.

I. INTRODUCTION

Global optimization in high-dimensional space is along-standing challenge in numerical analysis, and alsoin physics, chemistry, and material science. The struc-ture of an atomic system is at low temperature given bythe global minimum point of the potential energy sur-face (PES), which is a function, E ( x ), of all atomic co-ordinates x . For atomic systems with more than a fewatoms, the dimensionality therefore constitutes a chal-lenge. Furthermore, the PES is usually determined by aquantum mechanical calculation with for example den-sity functional theory (DFT), and these calculations arecomputationally demanding so the optimization shouldtherefore be performed with as few function evaluationsas possible.Numerous algorithms for ﬁnding the structural groundstate of a system are implemented for material scienceproblems [1], such as basin hopping [2], evolutionary al-gorithms, [3–5], and particle swarm optimization [6], butthe issue with these methods remains the large numberof DFT evaluations required by the algorithms.Recently, machine-learned surrogate models have beenconsidered in order to overcome the problem of spendingexcessive amounts of computer resources on DFT calcu-lations. A surrogate model for the PES is constructedbased on a dataset typically obtained with DFT, and itallows for subsequent much faster evaluation of atomicenergies and forces. Surrogate models have been usedin local optimization [7], global optimization [4, 8–12],nudged-elastic band calculations [13, 14], searches fortransition states [15], adsorption studies [16], and thedesign of force ﬁelds [17–19].Many of the surrogate models are based on Bayesianinference, or Gaussian processes (GP) [20, 21], where theresulting PES is a sum of joint kernel functions, cen-tered at the training points. In its most traditional form,the GP is only trained with the target values, i.e. theelectronic ground-state energies in the context of com-putational chemistry. However, since forces are readilyavailable after a ground state DFT calculation, we train the model also with the gradients of the target value, i.e.with the forces on each atom. The inclusion of gradientsis crucial in local structure optimization based on sur-rogate models [7] and has also been shown to generallyimprove model predictions [22].The construction of the PES based on a GP usually in-volves the introduction of a distance measure or similar-ity of two diﬀerent atomic conﬁgurations. If two conﬁg-urations are close, it is assumed that energies and forceswill be close as well. To this end it may be advantageousto describe the atomic conﬁgurations using a structuralﬁngerprint (alias descriptor), which in the simplest casethat we shall consider here, is simply a mapping fromthe atomic Cartesian coordinates to a (typically high-dimensional) vector. The similarity of two atomic con-ﬁgurations can then be estimated based on the diﬀerencebetween the two ﬁngerprint vectors.The introduction of a ﬁngerprint vector may have sev-eral advantages. For example, a ﬁngerprint can be con-structed to reﬂect the translational, rotational, and per-mutational invariances of the atomic conﬁguration, i.e. iftwo conﬁgurations diﬀer by only a permutation of identi-cal atoms, the ﬁngerprint will be unchanged. This has theconsequence that the predicted PES will exhibit the samesymmetries. Furthermore, a good descriptor is able tocatch the relevant information of the conﬁguration for theunderlying problem. A simple example of an atomic ﬁn-gerprint is the Coulomb matrix [23], which represents theatomic conﬁguration using inverse distances, but moreelaborate ﬁngerprints have been developed during thelatest years, such as SOAP [24], ACSF [25], many-bodytensor representation [26], and FCHL [27].This work follows the pioneering approach for eﬃcientglobal optimization of atomic structures by Bisbo andHammer [11, 28] with the essential diﬀerence that wetrain our GP regression model with both energies andforces. We also note that the implementation of the ap-proach involves many choices and parameters, where wemight diﬀer from Bisbo and Hammer. For example, weuse a similar global ﬁngerprint, but introduce an addi-tional smooth cutoﬀ function to obtain a smoother rep-resentation of the gradients. In general, the gradients are a r X i v : . [ phy s i c s . c o m p - ph ] F e b seen to improve the eﬃciency of the global optimization,and we illustrate this through applications to a 15-atomCu cluster, bulk SiO , Ti O cluster, bulk TiO and bulksilicon.This article is organized as follows. In section II, we de-scribe the surrogate model that we use to predict energiesand forces of atomic structures during a global search. Insection III, we illustrate the predictive power of the modelby generating learning curves for a Cu cluster and bulkSiO . In section IV, we describe the global optimizationapproach, and then in section V we demonstrate its per-formance on a Cu cluster, bulk SiO , [Ta O ] x clusters,a ZrN-O surface, a Ti O cluster, bulk TiO , and bulksilicon. In section VI, we discuss computational perfor-mance issues before we ﬁnally conclude. II. SURROGATE MODELA. Gaussian process with gradients

To model the potential energy surface of an atomicstructure, we use a Gaussian process that learns ener-gies and forces ( i.e. gradients) from existing data. AGaussian process uses Bayesian inference and is basedon the assumption that the prior distribution for thedata is given by a multi-dimensional normal distribu-tion. The result is that the predicted energy and forces, µ ( x ) = ( E ( x ) , F ( x )), at a given atomic conﬁguration x with ﬁngerprint ρ ( x ) can be described [20, 29] as µ ( x ) = µ p ( x ) + K ( ρ ( x ) , P ) C ( P, P ) − ( y − µ p ( X )) , (1)where X is a list with the atomic conﬁgurations of thetraining data and P = ρ ( X ) a list with the correspond-ing ﬁngerprints. y is a vector that contains energies andforces of the training data, µ p denotes the prior energyand forces ( µ p ( x ) for the given conﬁguration and µ p ( X )for the training data points), K denotes the covariancematrix, and C is the regularized K -matrix of covariancesbetween training data points. The resulting vector µ ( x )contains the predicted total energy and predicted forcesfor each atom. For the sake of readability, we will denote ρ = ρ ( x ) for the ﬁngerprint for the rest of this section.The covariance matrix K is built as follows. A covari-ance matrix ˜ K between two atomic conﬁgurations con-tains the covariances between energies, between energyand forces, and between forces, and is written as [29]˜ K ( ρ , ρ ) = (cid:18) k ( ρ , ρ ) ( ∇ k ( ρ , ρ )) T ∇ k ( ρ , ρ ) ∇ ( ∇ k ( ρ , ρ )) T (cid:19) (2)with kernel (or covariance) function k ( ρ , ρ ). Here, ∇ i operates on the cartesian coordinates of the atomic con-ﬁguration x i , represented by the ﬁngerprint ρ i . Usingthe formulation of ˜ K , we store the covariances between asingle ﬁngerprint ρ and all the training point ﬁngerprintsin P in the matrix K ( ρ, P ) as K j ( ρ, P ) = ˜ K ( ρ, P j ) (3) where j runs over the number of training points. Fi-nally, the matrix C contains the covariance between allthe training points and a diagonal regularization that de-scribes the estimated noise, or uncertainty, of the trainingdata. Its elements are thus given by C ij ( P, P ) = ˜ K ( P i , P j ) + diag (cid:0) σ n,E , σ n,f (cid:1) δ ij (4)where diag( σ n,E , σ n,f ) represents the diagonal matrixwith σ n,E in the ﬁrst entry and σ n,f in the remainingones. In this way, we have introduced separate regular-izations σ n,E and σ n,f for energy and force covariances,respectively. Throughout this work, the regularization isset so that the ratios between the regularization parame-ters and the kernel prefactor σ (deﬁned below in equation9) are σ n,E /σ = 0 . σ n,f /σ = 0 . x ) = h ˜ K ( ρ, ρ ) − K ( ρ, P ) C ( P, P ) − K ( P, ρ ) i / (5)The uncertainties are used for the global optimizationthrough the acquisition function as described below insection IV. B. The model

We describe the atomic structure by a ﬁngerprintwhich has two terms: a radial distribution and an angulardistribution. Using such global distributions ensure therotational, translational, and permutational symmetriesfor a system. The radial distribution is motivated bythe radial distribution function described by Valle andOganov [30]. However, to remove discontinuity at thecutoﬀ distance, we use a smooth weighting factor. Theradial part of the ﬁngerprint is calculated as ρ RAB ( r ; x ) = X i ∈ Aj ∈ B r ij f c ( r ij ; R Rc ) e −| r − r ij | / δ R (6)where r ij is distance between atoms i and j in the set ofcoordinates x , and δ R is a smearing factor with a ﬁxedvalue δ R = 0 . r denotes the discrete variable with200 values, ranging from 0 to the cutoﬀ distance R Rc .The smooth function f c has the form f c ( r ; R c ) =  − (1 + γ ) (cid:16) rR c (cid:17) γ + γ (cid:16) rR c (cid:17) γ if r ≤ R c r > R c (7)with a cutoﬀ distance R c and γ = 2. This form for f c ( r )has zero value and zero derivative at r = R c . Due to thefactor 1/ r , the form of equation 6 has the property ofgiving more weight on small distances below R Rc . Theangular part of the ﬁngerprint is given by ρ αABC ( θ ; x ) = X i ∈ Aj ∈ Bk ∈ C f c ( r ij ; R αc ) f c ( r jk ; R αc ) e −| θ − θ ijk | / δ α (8)where θ is a discrete variable with 100 values that rangefrom 0 to π , θ ijk is the angle between atoms i , j and k , and δ α is a smearing factor with a ﬁxed value δ α =0 . f c we use the value of γ = 0 . R c = R αc . The diﬀerent γ value tothat of the radial part comes from our observation thatthe predicted potential energy surfaces were not smoothenough at the angular cutoﬀ radius when using the valueof γ = 2 in the angular part. In our work, the cutoﬀradius R Rc has values between 4.0 and 8.0 ˚A, and R αc hasvalues between 3.6 and 4.0 ˚A, comparable to the radiistudied in, e.g., [19]. The ﬁngerprint is very similar tothe one used by Bisbo and Hammer [28] but with theadditional cut-oﬀ function for the radial part.The total ﬁngerprint for an atomic conﬁguration x is obtained by concatenating all vectors ρ RAB ( r ; x ) and ρ αABC ( θ ; x ) with elements A, B and C of the system, re-sulting in a single vector that we denote as ρ . To clarify,for a single-element system such as a Cu cluster, the ra-dial part is just ρ R = ρ R Cu , Cu ( r ; x ), and for a two-elementsystem, like SiO , the radial ﬁngerprint consists of vec-tors ρ R Si , Si ( r ; x ), ρ R Si , O ( r ; x ), ρ R O , Si ( r ; x ), and ρ R O , O ( r ; x ). Asimilar procedure is used for the angular parts of the ﬁn-gerprint.We calculate the covariance between data points usinga squared-exponential kernel k ( ρ , ρ ) = σ exp (cid:18) − D ( ρ , ρ ) l (cid:19) (9)with the distance function D ( ρ , ρ ) and two descrip-tive hyperparameters, the prefactor σ and length scale l . (Note, that we use the term prefactor for σ and not σ .) The distance function we take as simply the Eu-clidean distance between the ﬁngerprint vectors as D ( ρ , ρ ) = "X i ( ρ i − ρ i ) / . (10)We note that Bisbo and Hammer [28] use a kernel func-tion, which is a sum of two squared-exponential kernelswith two diﬀerent length scales. We tried this, but didnot see any systematic improvement by adding an extralength scale.We determine the hyperparameters σ and l , by maxi-mizing the logarithmic marginal likelihood, which is writ- ten as [20]log P = −

12 log(det C ( P, P )) −

12 ( y − µ p ( X )) T C ( P, P ) − ( y − µ p ( X )) − N (3 N atoms + 1)2 log 2 π (11)where N is the number of training points and N atoms isthe number of atoms in a single training data point. Theprefactor, σ , can be determined analytically for ﬁxed val-ues of σ n,E /σ and σ n,f /σ , so the numerical optimizationproblem is only one-dimensional.The Gaussian process allows for the speciﬁcation ofa prior function, E p ( x ), for the energy landscape. Inequation (1), the prior energy landscape is inserted as µ p ( x ) = ( E p ( x ) , ∇ E p ( x )). Here, we apply the prior func-tion suggested by Bisbo and Hammer [11], which is arepulsive potential of the form E p ( x ) = E c + E r ( x ) = E c + X ij (cid:18) . R i + R j r ij ( x ) (cid:19) , (12)where E c is a constant, R i and R j are the covalent radiiof atoms with indices i and j , and r ij ( x ) is the distancebetween the atoms in the set of atomic coordinates x .The prior energy function expresses the expectation thatthe energy rises steeply if two atoms come very close. Aswe shall see later, this helps avoiding very high energystructures in the training data.The constant value E c is determined by maximizingthe marginal likelihood, and is given by the analytic for-mula E c = U T C ( P, P ) − ( y − E r ( X )) U T C ( P, P ) − U (13)where E r ( X ) is a vector that consists of repulsive pri-ors of the training data, that is obtained using the sec-ond term in equation (12), and U is a vector of length N (3 N atoms + 1) with elements U i = ( , if i mod(3 N atoms + 1) = 00 , otherwise (14)where indexing of i starts from 0. Therefore, U i = 1 if y i is an energy value, and U i = 0 if y i is a force. III. MODEL VALIDATIONA. Learning curves and cross validation

To validate the model, we examine two diﬀerent sys-tems: A Cu cluster using an eﬀective-medium theory(EMT) potential as implemented in the ASE package[31–33] and bulk SiO using DFT with the PBE func-tional [34], implemented in GPAW [35]. The unit cell forSiO consists of 12 atoms and has the lowest energy in thecristobalite structure, comprising tetrahedral SiO units.For both training and validation sets, random structuresare generated and relaxed loosely so that the maximumforce criterion is 10 eV/˚A. Diﬀerent sizes of training setsare then used to generate learning curves for models witha variety of length scales in the squared-exponential ker-nel, while the validation set size is kept as 100. The rea-son for showing the learning curves for diﬀerent lengthscales is that the length scale has, as we shall see, a dom-inant eﬀect on the predictive power of the model, and wealso observe that the optimal length scales do not nec-essarily follow the traditional expectations for a Gaus-sian process. The constant term in the prior function(Equation (12)) is set to the mean energy of the train-ing set and the kernel prefactor is kept constant since itdoes not aﬀect the mean of the predictions, as deducedfrom equation (1). In addition to ﬁxed length scales, thelearning curves are computed for models where the lengthscale, the prior constant and the kernel prefactor are ob-tained by maximizing the marginal likelihood separatelyfor each training set size. (a)(b) FIG. 1. Cu learning curves obtained by (a) training on bothenergies and forces and (b) training on energies alone. Theinset graphs show the values of the ﬁt length scales for eachtraining set size. The learning curves for Cu are shown in ﬁgure 1both with and without training the gradients. For thegradient-trained curves, we observe that the root-mean-square error (RMSE) saturates to approximately 0.12 eVwith length scales greater than 30 when the training setsize is increased up to 100 data points. The curves withscale 3.1 and 10 do not seem to saturate to the same de-gree but they end up at the same RMSE at 100 trainingpoints as the other models. The standard deviation of thetest set energies is 1.5 eV, meaning that our model candecrease the RMSE to about 7% of what random sam-pling of energies would produce. Amazingly, a RMSEof 0.25 eV is achieved by having only one data point inthe training set. This might be due to the ability of theﬁngerprint to catch the relatively simple radial depen-dence of the EMT potential, together with the smoothsquared-exponential kernel.Despite the good start with few training points, thecurves develop quite slowly with increasing number oftraining data. The observed saturation is in contrast withthe power law behavior that the learning curves shouldoptimally follow with linear learning curves on the log-logscale [36, 37]. One reason for the saturation could be thediﬃculty to resolve diﬀerences between some conﬁgura-tions as compared to others [38]. We have observed thatsmall perturbations of the atomic structure that lead tosimilar variations in the energy may diﬀer in several or-ders of magnitude in the variation of the ﬁngerprint. Thismay indicate that the potential energy surface is highlyanisotropic in ﬁngerprint space. We note that the highdimensionality of the ﬁngerprint also makes correctingfor the anisotropy with diﬀerent scales in each directionimpractical.Another explanation for the shape of the learningcurves could be that the distribution functions used inthe ﬁngerprint have a ﬁnite range. Deringer and Cs´anyihave shown in the case of amorphous carbon [39] that aﬁnite cutoﬀ may lead to substantial residual forces for aper-atom ﬁngerprint. We have a global ﬁngerprint, butstill the ﬁnite range may limit the quality of the modelprediction.In practice, it seems that the gradient-trained modellearns everything it can after training of the order of only30 data points. On the other hand, an accuracy of 0.12eV is clearly suﬃcient to be of relevance for the determi-nation of the basins with low energy conﬁgurations.The eﬀect of training the gradients is apparent in ﬁg-ure 1: the prediction error is clearly lower up to trainingset sizes of the order of 10. At 30 training data pointsand above, the energy-trained model seems to do as wellas the gradient-trained one. With the largest trainingset sizes, the energy-trained model becomes even slightlybetter than the one including the gradients. The diﬀer-ence is small, though, with the RMSE diﬀerence beingonly 0.012 eV at training set size of 100 between themodels with the best length scales. However, given thesaturation of the gradient-trained curves, it is natural toexpect that the energy-trained model reaches the perfor-mance of the gradient-model at some point in general,and for this particular system that happens after around30 training points.The models with scales 10 and above saturate to sim-ilar performance when the training set size of 100 isreached. All of these scales are comparable to or longerthan the distances between the data points in the dataset with 100 points. For this data set, the distances varybetween 0.1 and 2.3 in ﬁngerprint space. The model withscale 3.1 is seen to perform less well for small data sets,but is not saturated when 100 data points is reached.The limited distances in ﬁngerprint space has to do withthe character of the ﬁngerprint. If the atoms in a givenconﬁguration are displaced, the Cartesian distance cor-responding to the displacement can grow indeﬁnitely. Inﬁngerprint space the distances are calculated based ondiﬀerences in distribution functions, and these will satu-rate at some point.The length scales obtained by maximizing the log-likelihood are always above 15 when gradients are in-cluded in the training, and above 6.8 while training onthe energies alone (see ﬁgure 1). These scales are alsosurprisingly long considering the distances in the train-ing set. There is a clear increasing trend of the optimallength scale in the energy-trained models when the train-ing set size is increasing, but no trend is observed withinthe gradient-trained models. Nevertheless, maximizingthe log likelihood gives roughly the best model as evalu-ated with cross validation.For bulk SiO (ﬁgure 2), the picture is diﬀerent in thatthe power law decay of the learning curves is roughlymaintained with most models with diﬀerent length scalesup to 100 training points. But, the prediction power isnot too impressive with the best RMSE of 1.27 eV at 100training points although it is lower than the standard de-viation of the validation set, which is 3.6 eV. The energy-trained model reaches towards the gradient-trained oneagain, but at 100 training points the gradient-trainedmodel predicts still slightly better than the model basedon energies alone. No clear saturation is observed withinthis data range with neither of the approaches.Compared to the EMT cluster, the learning abilitywith a single training point is reduced. This is expectedwhen moving to a more complex potential energy sur-face: the quantum eﬀects of a more realistic potentialare diﬃcult to track with a relatively simple ﬁngerprintin contrast to the simple radial dependency of EMT.From the SiO learning curves we can also see thatthe length scales above 100 are clearly favored over thesmaller ones. Again, this is remarkable for a Gaussianprocess with a single squared-exponential kernel as co-variance function, since the distances between the testdata points vary between 1.6 and 21.3, that is one ortwo orders of magnitude smaller than the most optimallength scales. According to ﬁgure 2, the most optimallength scale is as much as 1000 while training only on en-ergies, even further from the deviation of the data pointdistances compared to the gradient-trained model. (a)(b) FIG. 2. SiO learning curves (a) training energies and forcesand (b) training only energies. The inset graphs show thevalues of the length scales obtained by maximizing the log-likelihood for each training set size. The ﬁtted length scales, obtained by maximizing thelog-likelihood and shown in the insets in ﬁgure 2, have aclear decreasing trend for the models trained using gra-dients. For the models trained on only energies the vari-ation is dominated by a large jump between 10 and 30training points. This behaviour indicates a high sensitiv-ity of the model to the addition of more data, since weare dealing with a small numbers of training data points.Another reason might be that there are multiple localmaxima in the marginal likelihood, and diﬀerent maxi-mization runs end up at diﬀerent local maxima.Finally, it should be noted that for both Cu andSiO , ﬁtting the length scale by maximizing the marginallikelihood gives roughly the best model in the cross val-idation. This is a desired behavior since maximizing themarginal likelihood is an easy and computationally rel-atively cheap procedure to carry out, and it can there-fore be done repeatedly during a global search where thetraining set is updated often. (a) (b) FIG. 3. Predicted versus true energies of the ﬁnal structuresof relaxations with surrogate models. 80 local relaxations ofCu are run with surrogate models with (a) training on thegradients and (b) training only on energies. The training setof 40 data points is the same for both models. The energiesof the training data points have distribution with mean of 6.0eV and standard deviation of 1.14 eV. The reference point0.0 eV is set to the global minimum energy. Statistics ofthe energies of the ﬁnal structures, as well as the statisticalprediction errors, are shown in the text boxes. The dashedline shows the ideal 1-to-1 mapping of the predictions. B. Local relaxations in the surrogate model

Since our global optimization method relies on localrelaxations rather than single-point calculations in thesurrogate PES, it is relevant to examine how well themodel performs local relaxations. We create a trainingdata set of 40 data points of the Cu cluster, relaxedwith EMT so that the maximum force residual is lessthan 1.0 eV/˚A. After this, the model is trained on thedata both with and without gradients. Then, 80 randomstructures are created independently and relaxed locallyin the surrogate model.A comparison between the EMT energies and the en-ergies of the models is shown in ﬁgure 3. The EMT en-ergies of the obtained structures range from 0.38 to 3.6eV with the gradient-trained model, and from 2.0 to 6.9eV with training only on energies. Actually, the lowestenergy structure (of 0.38 eV) corresponds to a structurethat is very close to the true global minimum structure.The prediction errors range from -1.0 to 0.0 eV with gra-dients and -3.6 to -1.2 eV without gradients. The datademonstrates that the gradient model is able to reachboth lower energies and higher accuracy than the modelincluding only energies, although we work at a trainingset size of 40 where the learning curves show similar per-formance to each other. The model trained on gradientsexhibits some systematic errors in particular for small en-ergies, but the ordering of the energies seems to be wellreproduced. The errors are much larger for the modeltrained on energies alone, but again the ordering of thestates is reproduced fairly well. Initial training data x B A x (Å) (a)

Updated training data x B A x (Å) (b)

FIG. 4. One-dimensional demonstration of surrogate surface.In the ﬁgures, the true potential surface, evaluated with EMT,is shown for a system where one Cu atom is moved on top oftwo other Cu atoms. The two basins of the potential en-ergy surface are labelled as A and B, where the bottom of Ahas lower energy than B. In (a), a data point is evaluated at x = 0 .

96, and a Gaussian process is trained with that singletraining point. The resulting predicted surface, uncertainty(green area) and acquisition function are shown. The acqui-sition function is minimized in basin B, at x = − .

74, andin (b), we show the predictions after evaluating and addingthis point to the training set. Now, the acquisition functionis minimized in A, the global minimum basin of the true po-tential energy surface.

IV. GLOBAL OPTIMIZATION ALGORITHM

The algorithm for the global optimization is relativelysimple: in each iterative step, multiple local relaxationsare carried out on the surrogate surface, and the energyand forces for the most promising structure are evalu-ated using the true potential (EMT or DFT). To selectthe most promising conﬁguration of all the relaxed struc-tures, we make use of the estimated uncertainties by cal-culating the acquisition function f ( x ) = µ ( x ) − κ Σ( x ) (15)for each structure, where we set the parameter κ = 2.This value provides a good balance between exploita-tion (low energy) and exploration (large uncertainty)[11, 28, 40]. The structure with the lowest acquisitionfunction is selected for evaluation with the true poten-tial. This structure is then added to the training setfor the Gaussian process and another set of surrogaterelaxations are performed with the updated model, as il-lustrated in ﬁgure 4. The eﬀect of using the ﬁngerprintis visualized in ﬁgure 4 as well: after just a single train-ing point, the predicted PES exhibits several importantfeatures like the existence of the two local minima. Thiswould not be the case if using Cartesian coordinates asthe descriptor. The gradients also play an important rolefor quick learning of the features of the PES. Moreover,adding the second training point in one of the basinsmakes the prediction in the second basin more accurate.The initial training set consists of randomly generatedstructures with energies and forces evaluated. In thiswork, the optimization routine is always started with 2initial training points. The starting structures for sur-rogate relaxations comprise three diﬀerent types: 1) Al-ready visited structures with the lowest energies, 2) al-ready visited structures with random displacement (alsocalled rattling), and 3) randomly generated structures.The total number of surrogate relaxations per step inthis work varies between 20 and 40, depending on thesystem but kept constant during a run. About 25% ofthe relaxations start from structures of type 1, 25% oftype 2 and 50% of type 3. Before accepting the structuregiven by a surrogate relaxation, it is checked that noneof the bond lengths in the system are less than 0.7 timesthe covalent distance of the atoms. If no valid structuresare acquired in an iterative step, a random structure isgenerated, evaluated and added to the training set, toachieve a model that will fail with smaller probability inthe next iteration.For clusters, the random structures are generated asfollows. First, one of the atoms is placed at the origin.After this, the position of the next atom is always givenby r = r rand + ( r, θ, φ ) in spherical coordinates where r rand is the position of one of the previously set atoms,randomly selected, and ( r, θ, φ ) are randomly generatedspherical coordinates. Here the sample distribution of r is selected manually and system-speciﬁcally, but it wasobserved that selecting the upper bound of r slightlysmaller than the standard covalent distance of the twoatoms works best in general. After generating a newposition, we make sure that adding an atom in the ac-quired coordinates does not violate our restriction of theshort bond lengths with each atom added to the clusteralready, as described above. For TaO clusters, we set an-other restriction to add some chemical intuition: when Ois being added to the system, we enforce that r rand is thecoordinate of a Ta atom and not another O atom. Thisway we avoid introducing chains and clusters of oxygenin the randomly generated structures.For bulk systems, the atoms are simply put at randomcoordinates inside the unit cell, and then relaxed in arepulsive potential of the form of V rep ( x ) = X ij (cid:18) . R i + R j r ij ( x ) (cid:19) (16)with similar notation to that of the prior in equation 12.For surfaces, the atoms are put inside a manually-deﬁnedbox inside the true unit cell, where the atoms are placedin a similar fashion to that of the bulk systems and thenrelaxed in the repulsive potential, given by eq. (16).The hyperparameters E c , σ , ‘ are updated during theglobal optimization assuming ﬁxed values of the noiseparameters relative to the prefactor as explained above.The prior constant E c and the prefactor σ are obtainedanalytically and are updated at every step of the globaloptimization. The update of the length scale has to bedone numerically and is performed every ﬁve steps withan initial value of 20 times the distance in ﬁngerprintspace between the two initial structures. It is our ex- perience that the length scale obtained from maximiz-ing the log-likelihood may be too short leading to a toorapidly varying surrogate PES with more local minimathan the true PES. This aﬀects the search so that it be-comes too local. This is unfortunate, especially in thebeginning of the search, where large parts of the conﬁgu-ration space has to be explored. We therefore introducea lower bound on the value of the length scale duringupdate. The lower bound is set to the mean value ofall distances between the training data points in ﬁnger-print space. Using this value ensures that a large numberof training data are used in each energy/force predictionand too local searches are avoided. We also note that theinvestigations of the learning curves above indicate thata length scale considerably longer than the one obtainedfrom maximizing the log-likelihood still results in a rea-sonable model. In some cases the prediction error is infact reduced by increasing the length scale.The algorithm is implemented so that the user canchoose whether to train using the gradients or not. Inboth cases, the energy and force predictions can be ob-tained analytically from the surrogate potential energysurface. The approach where the gradients are not in-cluded in the training has a reduced memory usage andthe time to train the model is also signiﬁcantly reduced.However, as we have seen in the investigations of thelearning curves the models without training on gradientsare less accurate. In the GOFEE method [11], train-ing is only performed on the energies, but another train-ing point, adjacent to the one selected by the acquisi-tion function, is always evaluated with the true poten-tial and added to the training set. The second train-ing point is obtained by moving the atoms a small dis-tance along the direction of the forces. In this pa-per, we refer to our approach with training on bothenergies an forces as BEACON (from Bayesian Explo-ration of Atomic Conﬁgurations for OptimizatioN) andthe approach where the training is only on energies asL-BEACON, where L stands for “light”. We also showresults of L-BEACON-exact where a neighboring datapoint is evaluated and added to the training set similarlyto GOFEE, and L-BEACON-FD, where, for each DFT-evaluated data point, we add a neighboring data pointwhere the energy is obtained by a ﬁnite diﬀerence esti-mation based on the DFT forces. The step length in theﬁnite diﬀerence method is discussed in section V. We willsee that for the systems we investigate here, adding ex-tra displaced training points does not lead to signiﬁcantimprovement even if gradients are not trained. Further-more, we ﬁnd that training only the energies of singlepoints (L-BEACON) makes a surrogate potential energysurface that has similar or almost similar performance inglobal optimization, compared to L-BEACON-FD andL-BEACON-exact. V. RESULTS AND DISCUSSIONA. Cu In ﬁgure 5, we show the success curves of diﬀerenttypes of global optimization runs for the Cu clusterin the EMT potential. We perform 40 separate runswith BEACON, L-BEACON, L-BEACON-FD and L-BEACON-exact with diﬀerent lengths of displacements,where neighboring training points are included, and 16separate runs with GOFEE [11]. In the ﬁgure, the cu-mulative curves increase by a step each time a single runﬁnds the global minimum with an energy threshold of0.01 eV. The threshold value means that we declare arun successful once it hits a true energy that is at max-imum 0.01 eV higher than the lowest energy that wasfound during the runs. The reference energy here cor-responds to the geometry of a centered icosahedron of13 atoms and two adsorbed atoms in neighboring hollowsites of the icosahedron surface, as shown in the insetof ﬁgure 5. The second lowest-lying local minimum wasfound at 0.15 eV above the global minimum, possessing acentered, gyroelongated hexagonal bipyramid. This ob-servation illustrates the ability of the global optimizationapproach to distinguish between local minima that areclose in energy.Let us ﬁrst discuss the three most optimal successcurves: BEACON, L-BEACON, and L-BEACON-FDwith step length dx=0.001 ˚A. The ability of the gradient-trained model to simulate the potential energy surfaceof EMT is manifested again as 50 % of the BEACONruns found the true global minimum after only 7 EMTevaluations. The convergence of 7 evaluations would beappealing even for local optimization with the 45 degreesof freedom in the system, although we remind ourselvesthat convergence here is deﬁned via energy whereas inthe context of local relaxation convergence is determinedthrough stricter requirements on the forces in the sys-tem. For L-BEACON-FD, the respective number of 50% success is 16 evaluations and for L-BEACON where noforce information is used, 50 % success is acquired after20 evaluations. Our runs with GOFEE [11] show that 46evaluations are required for 50 % success. It is worth not-ing that despite the fast success, the program does notknow that it has reached the optimal conﬁguration butkeeps searching even after the (known) global minimumis found.For comparison, we run 480 relaxations with the truecalculator, EMT, starting from similarly generated ran-dom structures as those for the global optimization.The result is that 60, or 12.5 %, out of all relaxationsend up in the global minimum energy structure. Ifwe perform one such EMT relaxation per step, thiswould statistically result in 61 % success at step 7, since P i =1 (1 − / i − × / .

61. From this perspective, weconclude again that the model and the global optimiza-tion approach of BEACON are together very eﬃcient inthe search for the global minimum. Running EMT re- laxations is in fact computationally faster than runningthe relaxations in the surrogate surface within the globalsearch algorithm, as we will discuss further in section VI,but when using the algorithm with DFT this situation isof course completely diﬀerent.BEACON is seen to be the fastest method up to 80% success rate. Most of the L-BEACON-FD (with steplength dx=0.001 ˚A) runs ﬁnd the minimum with between10 and 20 EMT calculations. L-BEACON lags only alittle behind, and ﬁnally all the three approaches end upwith a somewhat similar performance, although the fullsuccess of BEACON curve takes slightly more steps withone run ﬁnding the correct local minimum after 48 EMTevaluations.Let us now take a closer look to the diﬀerent ap-proaches of L-BEACON where the neighboring datapoints are included into the model, i.e. L-BEACON-FDand L-BEACON-exact. First of all, the success curveswith the EMT evaluated neighboring points, that is theL-BEACON-exact curves, are always behind the curvefor L-BEACON where no neighboring points are includedin the model. With arbitrary increase of step length inL-BEACON-exact, we expect the success curves to reachthe performance of L-BEACON at best, since in that caseall the data points are more or less individual and theneighboring points do not represent simulating the forcesanymore. Also, the step length of dx=0.001 ˚A leads toslightly slower performance than that of dx=0.01 ˚A, indi-cating that the step length of the size 0.001 ˚A is too smallto have the desired eﬀect on the model; due to the noiseterm in the model, the two neighboring points cannot bedistinguished properly when they are too close to eachother. However, comparing the red and pink curves inﬁgure 5b, the smaller step length seems to work similarlybetween L-BEACON-exact and L-BEACON-FD despitethe approximation. This indicates that the linear approx-imation is accurate enough to produce a reasonable sur-rogate PES. On the other hand, the step length 0.01 ˚A isclearly too large for a stable model in L-BEACON-FD al-though it results in better performance with L-BEACON-exact. The value of dx=0.001 ˚A shows the fastest globaloptimization of all the double-point approaches in num-bers of EMT calculations, but the performance is notmuch better than with L-BEACON. Also, the problemof selecting a suitable step length might be diﬃcult forother systems and more complex potentials. Thus, we donot ﬁnd inclusion of neighboring points in general ben-eﬁcial. We will return to this topic later in the case ofbulk SiO .Comparing our curves with GOFEE, we see that thenumber of EMT calculations is two-fold or more forGOFEE. As mentioned above, the essential point of inter-est in the methods is the way in which the energy of theneighboring data point is evaluated. In this respect, L-BEACON-exact is similar to GOFEE. It is observed thatL-BEACON-exact is faster in ﬁnding the true global min-imum, which we attribute to the slight diﬀerences in thedetails of the ﬁngerprint, the kernel function and ﬁtting (a)(b) FIG. 5. Cu success curves with success threshold of 0.01eV. The colored areas denote the standard deviation of boot-strap simulations with 1000 samples of each curve. (a) Successcurves as function of number of EMT calculations. (b) Successcurves as function of step index. FD refers to adding a neigh-boring point per each EMT-evaluated data point, where theenergy of the second point is estimated with ﬁnite-diﬀerencemethod with step length dx (˚A), based on the energy andforces of the EMT point. The notation ”exact” refers to eval-uating the energy of the second point using EMT. the hyperparameters along the way. Comparing BEA-CON and GOFEE, there is about a factor of 6 diﬀerencebetween the number of required energy evaluations.Finally, let us connect the result with BEACON tothe examination of local relaxations above (section III B).There, the training set size was 40, and even with 80relaxations the global minimum was not found. In the global search we ﬁnd the minimum after only 7 trainingpoints. This indicates that updating the model along therun is an important feature of the global optimizationalgorithm. B. SiO Figure 6 shows the success curves for bulk SiO inthe low-cristobalite phase with DFT/PBE. In this case,training the gradients is clearly favourable in the searchfor the global minimum structure. BEACON ﬁnds thecorrect structure after less than 34 DFT evaluations forall runs. L-BEACON and L-BEACON-FD are slowerthan BEACON and eventually fail in 2/20 runs to ﬁndthe true global minimum using 80 DFT calls.We saw that for the Cu cluster the search identiﬁedthe global minimum based on very few EMT energy andforce evaluations compared to what could be expectedbased on the learning curves. This feature is even morepronounced for the SiO system. The learning curves in-dicate a rather poor accuracy with errors of more than1 eV (for the 12 atom system) in the range all the wayup to 100 training points, but still the global minimum isfound within 0.05 eV using only of the order 25 DFT cal-culations. The explanation for this behavior must have todo with the fact that, in the global search, states aroundlocal minima of the PES are of preferential interest andincluded in the training set. The model is thus exclu-sively trained to predict a special part of the PES. Inthe cross validation studies above, the model is trainedon a wider range of points and also evaluated broadly inconﬁguration space.This idea is to some extent illustrated with the sim-ple one-dimensional potential energy surface of the Cu cluster discussed above in ﬁgure 4. The evaluation ofthe model at the minimum point of the well B consider-ably improves the prediction at the other minimum A,because some of the local bonding characteristics are thesame.Interestingly, the energy-trained models, L-BEACONand L-BEACON-FD, have rather similar performance inthe global optimization although the training set of L-BEACON-FD includes twice the number of data points.The diﬀerence to the Cu is the more complex truepotential energy surface, introducing more error in theﬁnite-diﬀerence method. For the case of SiO , we assertthat a step length of 0.001 ˚A is too small for the energiesof the neighboring points to be distinguishable by theGaussian process, resulting in failure of simulating theslopes of the PES. Increasing the step length increasesthe risk of running into problems with an unstable Gaus-sian process, as observed with Cu . We thus conclude,somewhat at variance with Bisbo and Hammer [28], thatadding neighboring data points, as done in L-BEACON-FD, L-BEACON-exact, or GOFEE, is not beneﬁcial ingeneral for the Bayesian approach to global optimization.Let us now investigate how the success curves depend0 FIG. 6. SiO success curves with success threshold of 0.05 eV.The unit cell includes 12 atoms. The global minimum struc-ture is shown in the inset, where 5 extra O atoms are addedto the unit cell to illustrate the tetrahedral coordination of Si.For L-BEACON-FD, the step length is 0.001 ˚A in the ﬁnitediﬀerence method. on the threshold value for the energy. In ﬁgure 6 an en-ergy threshold of 0.05 eV is used for SiO , and we showthe success curves with energy thresholds of 0.2 eV and0.01 eV in the Supplemental material [41]. We see thatBEACON is more eﬃcient than the L-BEACON meth-ods, which are quite similar to each other, and thereforethe overall analysis is not too sensitive to the choice of thethreshold value. However, the choice of an appropriatethreshold may depend on the system being investigated.For example, an energy threshold of 0.2 eV for the Cu cluster has the consequence that ﬁnding the second low-est local minimum is also counted as a successful run.On the other hand, small thresholds like that of 0.01 eVmight fall below the accuracy of the DFT implementa-tion, parameters in use (such as k-point density), conver-gence criteria of the self-consistent cycle etc., making thejudgment whether the global minimum was found or not.The threshold of 0.05 eV seems to be a good compromisefor global optimization of systems with 10-20 atoms usingDFT calculators like GPAW with default settings.The length scale is updated every 5 steps by maxi-mizing the marginal log-likelihood as discussed above inthe section IV. Furthermore, as also discussed above, thelength scale is bounded from below by the mean valueof all distances between the training data points in ﬁn-gerprint space. In ﬁgure 7, we show the evolution of theﬁtted length scale in the global optimization runs of SiO when constraining the ﬁtted length scales from below andwhen not. For both cases, we observe that after the ﬁrstﬁtting at 5 DFT calculations there is a huge variance (a)(b) FIG. 7. Evolution of the updated length scales of the Gaus-sian process kernel during the runs of SiO in (a) BEA-CON and (b) BEACON without lower bound for the updatedlength scales. in the optimal length scales over the diﬀerent runs: thevalues range from 30 to 3000. As the searches proceed,this range gets narrower, and after 20 DFT calculationsthe updated scales vary only between 25 and 100 whenconstraints are turned on. The corresponding ﬁgure forL-BEACON runs is shown in Supplemental material [41].For L-BEACON, most of the values lie below 100, butthe updating also converges to higher values along thesearch. This does not seem to be a problem though, asglobal minima are also found with the large scales.For BEACON, the length scales where the correctglobal minimum structure is found are gradually decreas-ing as the search proceeds. Furthermore, it can be seenthat if the length scale is below 100, the global minimumis never found in fewer than 15 steps. Apparently, thelength scales obtained by maximizing the marginal log-likelihood are not necessarily optimal in the early part ofthe global search, where exploration is particularly im-portant. The large variation in the length scales in thebeginning of the search is hardly surprising in light ofthe small number of training points. In this perspective,it seems reasonable to limit the acceptable length scales1from below to prevent strong overﬁtting in the beginningof the search. It appears from the data in ﬁgure 7 thatthe lower bound could be set even higher in the beginningof the runs.Without the lower bound on the length scale, threeof the runs fail to ﬁnd the global minimum in 50 DFTcalculations as shown in the ﬁgure 7b. It is clear that thefailing runs after 20 steps have very short length scalesand even though the length scale is increased after 25steps they fall back to a less exploratory mode, wherethe global minimum is not identiﬁed. C. [Ta O ] x Ta O is an optically interesting material, whose crys-tal structure is still under debate [42–44]. Furthermore,clusters of the material might be of interest in photo-catalysis [45, 46].To test our approach, we investigate small clusters withthe composition [Ta O ] x with x = 1 , , O .Nevertheless, it is interesting that in all the structuresevery Ta atom has a similar bonding environment to fourO atoms, one of which is pointing outwards from thecluster. Interpreting this as a double bond makes theoxidation number of each Ta atom to be +5.In any case, the global optimization of the clusters pro-vides a good demonstration of our method. Of coursethere is no rigorous proof that the obtained structuresare in fact the global minimum energy structures, butsome indications of this are obtained by just repeatingthe searches and noting the variety of structures visitedduring the search. The shown structures were foundin 4/4 runs for Ta O , 5/6 runs for [Ta O ] , and 4/8runs for [Ta O ] . The second lowest local minimum forTa O was observed at 0.16 eV higher than the lowestone, and this was visited in all runs. This again indicatesthat the model and the acquisition function are able toidentify small energy diﬀerences between diﬀerent struc-tures.The average number of required single-point DFT cal-culations before hitting the global minimum structure(with threshold of 0.05 eV) was 42 for Ta O , 40 for[Ta O ] , and 35 for [Ta O ] , calculated among the suc-cessful runs. The small number of steps required is ahighly desired result, because it means that one can limitthe length of the BEACON runs. This does not only havethe advantage that the total number of DFT calculationsis small, but long runs of BEACON with many steps leadto surrogate models with many data points, which requiremore memory and computational time. It is interestingthat the average number of DFT calculations is small- FIG. 8. Global minimum structures of Ta O (D symme-try), Ta O (T d ) and Ta O (D ) clusters as found byBEACON. est for the largest cluster where the number of degreesof freedom is the largest. This could be explained bythe fact that whenever we train the model with a largercluster, more training data is available since the numberof trained gradients is larger. However, we note that ingeneral larger systems can be expected to exhibit con-siderably more local minima to be explored making theglobal search more diﬃcult.We now take a closer look at some of the BEACONruns in order to get a better understanding of how thealgorithm behaves. We ﬁrst investigate why some of theruns fail for the larger clusters by checking how the globalminimum structure is predicted with surrogate models ofthe unsuccessful runs. That is, we use the surrogate mod-els at step number 40 of the global optimization runs(42 training points), and perform a local relaxation onthis surrogate potential energy surface starting from theglobal minimum structure. In every case, the relaxationdoes not change the structure signiﬁcantly, and the ac-quisition function of the relaxed structure is low enoughso that it would have been selected for DFT evaluationin the original BEACON run. We thus conclude that forthese runs, the search within the surrogate space is insuf-ﬁcient whereas the accuracy of the model is good enoughto ﬁnd the global minimum.In ﬁgure 9, we show how a single run of Ta O pro-ceeds along the search. The search starts with high-energy structures where the prediction and the true en-ergy do not match at all, but as the high uncertaintiesshow, the model knows it might be wrong about the pre-dictions. As the search continues, diﬀerent structures aresuggested and evaluated with a good agreement betweenthe predictions and the true energies, taking the esti-mated uncertainties into account. The global minimumstructure is found at step 42 in this particular run, andthe exploration continues after that, as indicated by thehigher-energy structures that are visited between step 42and step 68. As noted in the description of the algorithm,we remove structures, which have been obtained by localrelaxations in the surrogate model, if a bond distance issmaller than 0.7 times the sum of the covalent radii ofthe atoms in the bond. What happens at step 68 in thisBEACON run is that the model begins to develop lo-2cal minima with O-O bond lengths, which are just abovethis threshold. So the new predicted structures containunphysically short O-O bonds or even clustering oxygenslinked to the rest of the cluster. The model predictstheir energies to be very low, and they are therefore al-ways selected by the acquisition function (eq. 15). How-ever, after step 80 we see that the model has learnedthat the unphysical structures have high energies, andthe search continues in a more reasonable way both ex-ploring new areas and exploiting the known structures,the global minimum included.This issue illustrates how additional conditions on thesearch might be helpful to avoid unphysical structures,but also that the surrogate model might react to these inunexpected ways.Let us brieﬂy look at how the hyperparameters evolveas shown in ﬁgure 9. To ﬁrst recapitulate, we have fourhyperparameters: the length scale in the kernel, the ker-nel prefactor, the prior constant, and the noise parame-ter. For simplicity, we keep the ratio between the noiseand the prefactor ﬁxed, because the maximization of thelog-likelihood can then be done analytically with respectto both the prefactor and the prior constant. Only thelength scale has to be obtained numerically and this up-date is done every 5 steps. We see in ﬁgure 9 that theprior constant has some variation in the beginning butthe changes become smoother as more training data isacquired. The dramatic changes in the model at step68 also lead to a recalibration of the prior constant. Itshould be noted that the prior constant is not only aweighted mean of the observed energies, because the gra-dients also play a role in the determination as seen fromEquation (13).The length scale is signiﬁcantly reduced during the run,and we take this as an indication that the model is ini-tially focusing on getting the large-scale features of thePES correct with subsequent reﬁnements. This is an ap-propriate behavior for a global search strategy that wealready examined in the case of SiO .The kernel prefactor also gradually decreases duringthe global optimization with the most signiﬁcant changesevery 5 steps when the length scale is updated. The pref-actor is unimportant for the prediction of energies andforces as can be seen from Equation (1), but it plays amajor role for the estimated uncertainties. The decayingvalue might therefore indicate an increasing conﬁdenceof the model. The reduction of the prefactor can also beseen as coupled to the change in the length scale. Theuncertainty at a particular point in ﬁngerprint space isroughly given by the kernel function to neighboring datapoints. This estimate involves both the prefactor of thekernel function and the distances to the neighboring datapoints measured in units of the length scale. A reductionof the length scale thus leads to a less ”stiﬀ” model withlarger variances and this is to some extent compensatedby the reduced prefactor. The regularization follows ex-actly the variation of the kernel prefactor because thequotient of these quantities is kept constant throughout the run.In the Supplemental material [41], we show another,similar run to that in ﬁgure 9. Although the global mini-mum energy structure is not found, the overall behaviourof the predictions and hyperparameters is similar witha good balance between exploration and exploitation.Even the stage of unphysical structures with short O-Obonds is the same after 60 steps, after which the searchbecomes stable again at around step 80. Although theprior constant goes below the global minimum energy inthe beginning, we note that it does not have a negativeeﬀect on the model or the search: in the beginning thelength scale is relatively long, and therefore all the pointsin the (reasonable) coordinate space are considered closeto each other compared to the length scale, and conse-quently the model is not very sensitive to the absolutevalue of the prior.An important feature observed in both runs is that theestimated uncertainties are reasonable so that when theerror of the prediction is large, the model knows that itmight be wrong. This is an essential property for usingthe acquisition function, Equation 15, to control the bal-ance between exploration and exploitation in the globalsearch. D. ZrN-O surface

As the last demonstration, we brieﬂy illustrate the ap-plicability of the approach to surface structures. Re-cently, a high catalytic activity of ZrN for oxygen re-duction was observed [47]. To see whether anything in-teresting occurs on the ZrN surface, exposed to oxygen,we investigate the surface structure of ZrN with adsorbedoxygen using the global optimization method. We use asurface slab of 4 layers with the 2 bottom layers ﬁxedduring the optimization. The unit cell is orthogonal con-taining 16 zirconium atoms, 16 nitrogen atoms and 1oxygen atom corresponding to a coverage of one oxygenatom per four zirconium atoms in the surface layer. Theelectronic exchange-correlation eﬀects are modelled us-ing RPBE [48]. With this setup, the global optimizationalgorithm ﬁnds that a structure where the oxygen atomand also one of the nitrogen atoms occupy the hollow sur-face sites minimizes the potential energy of the system asshown in ﬁgure 10. Consequently, there is a nitrogen va-cancy in the ﬁrst layer, below the oxygen atom. TheZr lattice stays close to the cubic (111) surface form, al-though the 3 Zr atoms that are neighbors to oxygen tendto move away from the oxygen atom and lie closer to thenitrogen on the surface, as compared to the bulk struc-ture.To verify the surprising ﬁnding that one of the N atomsin the unit cell prefers a surface site, we relax the ob-tained global minimum structure with a maximum resid-ual force of 0.05 eV/˚A. In addition, we relax three otherstructures that are built manually (see ﬁgure 10). S1:This is a structure where only oxygen is on the surface3

FIG. 9. Energy, prediction, and hyperparameter evolution in a single global optimization run for a Ta O cluster. Also,selected structures are shown that were evaluated with DFT along the run. The global minimum structure is visited at stepnumber 42, and at various steps after that. The green area shows the estimated uncertainty of the prediction. The priorconstant is shown with respect to the global minimum energy. and the ZrN lattice has its bulk form. S2: In this struc-ture a single nitrogen atom is on the surface and oxy-gen is moved to the vacancy left behind by the nitrogen.S3: A structure where a single nitrogen is on the surfaceand oxygen is in the second N-layer. Comparing withthese structures, the structure identiﬁed by the globalsearch has the lowest energy with energy diﬀerences of0.25 eV, 1.6 eV and 1.8 eV to S1, S2, and S3, respec-tively. The method thus ﬁnds a local minimum structurethat is lower in energy than the most intuitive conﬁgu-rations.The progress of one of the successful global optimiza-tion runs is shown in ﬁgure 11. It illustrates that themethod is visiting a diverse set of structures. For theﬁrst 11 steps, the program produces rather unrealisticstructures with energies above 10 eV higher than theglobal minimum. After step 11, the low-energy struc-tures are exploited more thoroughly, and the global min-imum structure is found at step number 22. The programcontinues with a balance between exploring structures atfairly high energies and identifying competing low en-ergy structures, for example at steps 36 and 61. It isalso notable that some non-trivial Zr surface structuresare explored, such as those at steps 1 and 29. In general,we note that a fair share of the structures investigatedare pretty high in energy. This seems to be necessary to achieve a proper exploration and training of the model. E. Other systems

One of the main points of this paper is to show howincluding gradients in the Gaussian process aﬀects theperformance of Bayesian global optimization in compar-ison with GOFEE [11] where only energies are used fortraining. Our results with both learning curves and suc-cess curves indicate that adding the gradient informationinto the model improves the performance of the search.In ﬁgure 12, we show success curves for three more sys-tems, to investigate the eﬀect of training on the gradientsas well as the energies. We run several optimizations fora Ti O cluster, bulk TiO , and bulk silicon. The bulkTiO system consists of 12 atoms and the unit cell isﬁxed as appropriate for the rutile phase. The bulk sili-con system consists of 16 atoms and the unit cell is ﬁxedcorresponding to the diamond lattice. For the Ti O cluster and bulk silicon the improvement is notable. Incontrast, and a bit surprisingly, training on the gradientsdoes not have any eﬀect on the overall performance forbulk TiO , so the potential gain of including gradientsdoes depend on the system under study. However, atthis point, we cannot tell how much diﬀerent properties4 Global minimum0.00 eV S1+0.25 eVS3+1.8 eVS2+1.6 eV

FIG. 10. Investigated structures for ZrN-O surface and theirrespective energies. The structures are shown from above thesurface. See text for details about the structures. Light blue:Zr, sky blue: N, red: O. like size, symmetry, number of elements, shape of thetrue potential etc. matter for the acceleration obtainedby using gradients.

VI. COMPUTATIONAL TIME AND MEMORYLIMITATIONS

The primary goal for this work is to reduce the num-ber of expensive DFT calculations necessary to ﬁnd theglobal minimum of a PES. However, it should be notedthat in some cases the wall-clock time needed to run therelaxations on the surrogate surface may become compa-rable to the time spent on DFT, especially if the DFTimplementation is parallelized eﬃciently while the Gaus-sian process is not. In our current implementation eachsurrogate relaxation is always run on a separate, singleprocessor, but no further parallelization is performed. Inthe beginning, most of the time of the surrogate model isspent on calculating the ﬁngerprints and their gradientsat every step of the local relaxations. Later in a globaloptimization run, as the training set size becomes larger,calculating the Hessians of the kernel function in ﬁnger-print space is the computational bottleneck. (See thecomputational times of training and predicting in Sup-plemental material [41].) In our examples, training themodel with gradients typically takes more than one orderof magnitude more time compared to using the energies alone. Time consumed in predicting is roughly the samewith few training points, but as the number increases,ever larger Hessian matrices need to be calculated withthe gradients, leading to an increase in computer time.Predicting without training the gradients is faster sincethe kernel Hessians are not calculated, and the linearalgebra is applied to much smaller matrices. The nu-merical updating of the length scale involves training themodel at each optimization step, and with a large numberof atoms and training set sizes, this becomes too heavyin practice with the gradient-trained models unless thetraining is parallelized. Eventually, if the DFT calcula-tion is fast enough, the energy-trained L-BEACON mightbecome faster in wall-clock time than BEACON even ifmore DFT calculations are required to train a suﬃcientmodel. However, as we observe from the success curves,BEACON is more robust in ﬁnding the global minimumand does not get stuck as often as L-BEACON.The memory usage is limited by inversion of the C-matrix, as required by equation 1, that scales as O ( n )memory-wise [7, 13] where n is the order of the squarematrix, i.e. in this context n = N (1 + 3 N atoms ). If mem-ory or speed becomes an issue with large systems, one isforced to use L-BEACON (or switch to L-BEACON onthe ﬂy during the optimization), where we have merely n = number of training points. VII. CONCLUSIONS

We think that the use of Bayesian strategies and morespeciﬁcally Gaussian processes for global atomic struc-ture determination is only at its beginning. As demon-strated by Bisbo and Hammer [11, 28], a surrogate modelbased on a global atomic ﬁngerprint in combination witha Bayesian search strategy can outperform earlier globaloptimization methods by orders of magnitude in reducedcomputer power. In the present paper we show that in-cluding gradient information, i.e. the atomic forces, inthe training of the model in most cases leads to a furtherreduction in the number of DFT calculations necessaryto identify the global minimum energy structure. Theapproach was successfully applied to clusters, surfaces,and bulk systems.However, many aspects of the approach are still unex-plored. The surrogate models are not particularly accu-rate as shown by the learning curves, but still they workvery well in the optimization process. So far only a singleglobal ﬁngerprint was investigated, and it is not known ifother ﬁngerprints like SOAP [24], MBTR [26], or FCHL[27] would perform even better. The way of suggestingnew candidate structures could potentially also be im-proved, for example in combination with a genetic algo-rithm, and other acquisition functions may be relevant.Finally, the approach could be combined with other arti-ﬁcial intelligence techniques providing additional guidingof the search.The code for BEACON is available at https:// FIG. 11. A single global optimization run of ZrN-O surface. The global minimum structure was found at step 22; the structureis shown in the ﬁgure. (a) (b) (c)

FIG. 12. (a) Ti O cluster, (b) bulk TiO and (c) bulk Si/diamond success curves with success threshold 0.05 eV. For Ti4O cluster, tight-binding DFT was used as true potential [49, 50], whereas DFT was used for the bulk systems with the samesetups as for the SiO . The unit cells of TiO and Si include 12 and 16 atoms, respectively. gitlab.com/gpatom/ase-gpatom . It is integrated withthe Atomic Simulation Environment (ASE) [32, 33], sothat any energy and force calculator supported by ASEcan be used together with BEACON. In the present im-plementation the unit cell is kept ﬁxed during the search.However, it should be possible to update the unit cellbased on the currently applied ﬁngerprint, and we ex- pect to implement that in the near future. ACKNOWLEDGMENTS

We acknowledge support from the VILLUM Centerfor Science of Sustainable Fuels and Chemicals, whichis funded by the VILLUM Fonden research grant (9455). [1] J. Zhang and V. A. Glezakou, Global optimization ofchemical cluster structures: Methods, applications, and challenges, International Journal of Quantum Chemistry , 044114 (2020).[2] D. J. Wales and J. P. K. Doye, Global optimizationby basin-hopping and the lowest energy structures oflennard-jones clusters containing up to 110 atoms, TheJournal of Physical Chemistry A , 5111 (1997),https://doi.org/10.1021/jp970984n.[3] L. B. Vilhelmsen and B. Hammer, A genetic algorithm forﬁrst principles global structure optimization of supportednano structures, The Journal of Chemical Physics ,044711 (2014), https://doi.org/10.1063/1.4886337.[4] S. V. Lepeshkin, V. S. Baturin, Y. A. Uspenskii,and A. R. Oganov, Method for simultaneous pre-diction of atomic structure and stability of nan-oclusters in a wide area of compositions, The Jour-nal of Physical Chemistry Letters , 102 (2019),https://doi.org/10.1021/acs.jpclett.8b03510.[5] M. J¨ager, R. Sch¨afer, and R. L. Johnston, Giga: a ver-satile genetic algorithm for free and supported clustersand nanoparticles in the presence of ligands, Nanoscale , 9042 (2019).[6] Z. Chen, W. Jia, X. Jiang, S.-S. Li, and L.-W. Wang,Sgo: A fast engine for ab initio atomic structure globaloptimization by diﬀerential evolution, Computer PhysicsCommunications , 35 (2017).[7] E. Garijo del R´ıo, J. J. Mortensen, and K. W. Jacob-sen, Local bayesian optimizer for atomic structures, Phys.Rev. B , 104103 (2019).[8] T. Yamashita, N. Sato, H. Kino, T. Miyake, K. Tsuda,and T. Oguchi, Crystal structure prediction acceleratedby bayesian optimization, Phys. Rev. Materials , 013803(2018).[9] H. L. Mortensen, S. A. Meldgaard, M. K. Bisbo, M.-P. V.Christiansen, and B. Hammer, Atomistic structure learn-ing algorithm with surrogate energy model relaxation,Phys. Rev. B , 075427 (2020).[10] M. O. J¨ager, E. V. Morooka, F. F. Canova, L. Himanen,and A. S. Foster, Machine learning hydrogen adsorptionon nanoclusters through structural descriptors, npj Com-putational Materials , 1 (2018).[11] M. K. Bisbo and B. Hammer, Eﬃcient global structureoptimization with a machine-learned surrogate model,Physical Review Letters , 086102 (2020).[12] B. C. Rinderspacher, Heuristic global optimizationin chemical compound space, The Journal of Physi-cal Chemistry A , 9044 (2020), pMID: 33079549,https://doi.org/10.1021/acs.jpca.0c05941.[13] O.-P. Koistinen, F. B. Dagbjartsd´ottir, V. ´Asgeirsson,A. Vehtari, and H. J´onsson, Nudged elastic band cal-culations accelerated with gaussian process regression,The Journal of Chemical Physics , 152720 (2017),https://doi.org/10.1063/1.4986787.[14] J. A. Garrido Torres, E. Garijo del R´ıo, V. Streibel, M. H.Hansen, T. S. Choksi, J. J. Mortensen, A. Urban, M. Ba-jdich, F. Abild-Pedersen, K. W. Jacobsen, and T. Bli-gaard, An artiﬁcial intelligence approach for navigatingpotential energy surfaces (2021), in preparation.[15] A. Denzel and J. K¨astner, Gaussian process regressionfor transition state search, Journal of Chemical Theoryand Computation , 5777 (2018).[16] M. Todorovi´c, M. U. Gutmann, J. Corander, andP. Rinke, Bayesian inference of atomistic structure infunctional materials, Npj computational materials , 1(2019). [17] A. P. Bart´ok, M. C. Payne, R. Kondor, and G. Cs´anyi,Gaussian approximation potentials: The accuracy ofquantum mechanics, without the electrons, Phys. Rev.Lett. , 136403 (2010).[18] S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky,K. T. Sch¨utt, and K.-R. M¨uller, Machine learning of ac-curate energy-conserving molecular force ﬁelds., ScienceAdvances , e1603015 (2017).[19] J. Vandermause, S. B. Torrisi, S. Batzner, Y. Xie, L. Sun,A. M. Kolpak, and B. Kozinsky, On-the-ﬂy active learn-ing of interpretable bayesian force ﬁelds for atomistic rareevents, npj Computational Materials , 20 (2020).[20] C. E. Rasmussen and C. K. I. Williams, Gaussian Pro-cesses for Machine Learning (The MIT Press, 2006).[21] C. M. Bishop,

Pattern recognition and machine learning (New York : Springer, [2006] © , 045018 (2020).[23] M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A. vonLilienfeld, Fast and accurate modeling of molecular atom-ization energies with machine learning, Phys. Rev. Lett. , 058301 (2012).[24] A. P. Bart´ok, R. Kondor, and G. Cs´anyi, On representingchemical environments, Phys. Rev. B , 184115 (2013).[25] J. Behler, Atom-centered symmetry functions for con-structing high-dimensional neural network potentials,The Journal of Chemical Physics , 074106 (2011),https://doi.org/10.1063/1.3553717.[26] H. Huo and M. Rupp, Uniﬁed representation ofmolecules and crystals for machine learning (2018),arXiv:1704.06439 [physics.chem-ph].[27] A. S. Christensen, L. A. Bratholm, F. A. Faber,and O. Anatole von Lilienfeld, Fchl revisited: Fasterand more accurate quantum machine learning, TheJournal of Chemical Physics , 044107 (2020),https://doi.org/10.1063/1.5126701.[28] M. K. Bisbo and B. Hammer, Global optimizationof atomistic structure enhanced by machine learning,arXiv.org , arXiv:2012.15222 (2020), 2012.15222.[29] J. Wu, M. Poloczek, A. G. Wilson, and P. Frazier,Bayesian optimization with gradients, in Advances inNeural Information Processing Systems (2017) pp. 5267–5278.[30] M. Valle and A. R. Oganov, Crystal ﬁngerprint space– a novel paradigm for studying crystal-structure sets,Acta Crystallographica Section A , 507 (2010),https://doi.org/10.1107/S0108767310026395.[31] K. W. Jacobsen, J. K. Norskov, and M. J. Puska, Inter-atomic interactions in the eﬀective-medium theory, Phys.Rev. B , 7423 (1987).[32] A. H. Larsen, J. J. Mortensen, J. Blomqvist, I. E.Castelli, R. Christensen, M. Du lak, J. Friis, M. N.Groves, B. Hammer, C. Hargus, E. D. Hermes, P. C.Jennings, P. B. Jensen, J. Kermode, J. R. Kitchin, E. L.Kolsbjerg, J. Kubal, K. Kaasbjerg, S. Lysgaard, J. B.Maronsson, T. Maxson, T. Olsen, L. Pastewka, A. Pe-terson, C. Rostgaard, J. Schiøtz, O. Sch¨utt, M. Strange,K. S. Thygesen, T. Vegge, L. Vilhelmsen, M. Walter,Z. Zeng, and K. W. Jacobsen, The atomic simulationenvironment—a python library for working with atoms, Journal of Physics: Condensed Matter , 273002 (2017).[33] Atomic Simulation Environment (ASE), https://wiki.fysik.dtu.dk/ase/ (2020).[34] J. P. Perdew, K. Burke, and M. Ernzerhof, Generalizedgradient approximation made simple, Phys. Rev. Lett. , 3865 (1996).[35] J. Enkovaara, C. Rostgaard, J. J. Mortensen, J. Chen,M. Du lak, L. Ferrighi, J. Gavnholt, C. Glinsvad,V. Haikola, H. A. Hansen, H. H. Kristoﬀersen,M. Kuisma, A. H. Larsen, L. Lehtovaara, M. Ljung-berg, O. Lopez-Acevedo, P. G. Moses, J. Ojanen,T. Olsen, V. Petzold, N. A. Romero, J. Stausholm-Møller, M. Strange, G. A. Tritsaris, M. Vanin, M. Walter,B. Hammer, H. H¨akkinen, G. K. H. Madsen, R. M. Niem-inen, J. K. Nørskov, M. Puska, T. T. Rantala, J. Schiøtz,K. S. Thygesen, and K. W. Jacobsen, Electronic struc-ture calculations with GPAW: a real-space implementa-tion of the projector augmented-wave method, Journal ofPhysics: Condensed Matter , 253202 (2010).[36] B. Huang and O. A. von Lilienfeld, Communication: Un-derstanding molecular representations in machine learn-ing: The role of uniqueness and target similarity, TheJournal of Chemical Physics , 161102 (2016).[37] B. Huang and O. A. von Lilienfeld, Ab initio machinelearning in chemical compound space, arXiv 2012.07502(2020), arXiv:2012.07502 [physics.chem-ph].[38] B. Parsaeifard, D. S. De, A. S. Christensen, F. A.Faber, E. Kocer, S. De, J. Behler, A. von Lilienfeld, andS. Goedecker, An assessment of the structural resolutionof various ﬁngerprints commonly used in machine learn-ing, Machine Learning: Science and Technology (2020).[39] V. L. Deringer and G. Cs´anyi, Machine learning basedinteratomic potential for amorphous carbon, Phys. Rev.B , 094203 (2017).[40] M. S. Jørgensen, U. F. Larsen, K. W. Jacobsen, andB. Hammer, Exploration versus Exploitation in GlobalAtomistic Structure Optimization, The Journal of Phys-ical Chemistry A , 1504 (2018).[41] See Supplemental Material at [URL will be inserted bypublisher] for success threshold comparison of SiO runs,length scale evolution of L-BEACON runs for SiO , opti-mization progress for another BEACON run for Ta O ,and the details of the computational performance ofBEACON.[42] S. P´erez-Walton, C. Valencia-Balv´ın, A. C. M. Padilha,G. M. Dalpian, and J. M. Osorio-Guill´en, A search for the ground state structure and the phase stability of tanta-lum pentoxide, Journal of Physics: Condensed Matter , 035801 (2015).[43] Y. Yang and Y. Kawazoe, Prediction of new ground-state crystal structure of Ta o , Phys. Rev. Materials ,034602 (2018).[44] J.-H. Yuan, K.-H. Xue, Q. Chen, L. R. C. Fonseca, andX.-S. Miao, Ab initio simulation of ta2o5: A high sym-metry ground state phase with application to interfacecalculation, Annalen der Physik , 1800524 (2019),https://doi.org/10.1002/andp.201800524.[45] S. Srivastava, J. P. Thomas, N. Heinig, M. Abd-Ellah,M. A. Rahman, and K. T. Leung, Eﬃcient photoelectro-chemical water splitting on ultrasmall defect-rich TaOxnanoclusters enhanced by size-selected Pt nanoclusterpromoters, Nanoscale , 14395 (2017).[46] Y. Chen, J. L. G. Fierro, T. Tanaka, and I. E. Wachs,Supported Tantalum Oxide Catalysts: Synthesis, Physi-cal Characterization, and Methanol Oxidation ChemicalProbe Reaction, The Journal of Physical Chemistry B , 5243 (2003).[47] Y. Yuan, J. Wang, S. Adimi, H. Shen, T. Thomas, R. Ma,J. P. Attﬁeld, and M. Yang, Zirconium nitride catalystssurpass platinum for oxygen reduction, Nature Materials , 282 (2020).[48] B. Hammer, L. B. Hansen, and J. K. Nørskov, Improvedadsorption energetics within density-functional theoryusing revised perdew-burke-ernzerhof functionals, Phys.Rev. B , 7413 (1999).[49] B. Hourahine, B. Aradi, V. Blum, F. Bonaf´e, A. Buc-cheri, C. Camacho, C. Cevallos, M. Y. Deshaye, T. Du-mitric˘a, A. Dominguez, S. Ehlert, M. Elstner, T. van derHeide, J. Hermann, S. Irle, J. J. Kranz, C. K¨ohler,T. Kowalczyk, T. Kubaˇr, I. S. Lee, V. Lutsker, R. J.Maurer, S. K. Min, I. Mitchell, C. Negre, T. A. Niehaus,A. M. N. Niklasson, A. J. Page, A. Pecchia, G. Penazzi,M. P. Persson, J. ˇRez´aˇc, C. G. S´anchez, M. Sternberg,M. St¨ohr, F. Stuckenberg, A. Tkatchenko, V. W.-z. Yu,and T. Frauenheim, Dftb+, a software package for eﬃ-cient approximate density functional theory based atom-istic simulations, The Journal of Chemical Physics ,124101 (2020), https://doi.org/10.1063/1.5143190.[50] R. Luschtinetz, J. Frenzel, T. Milek, and G. Seifert, Ad-sorption of phosphonic acid at the tio2 anatase (101) andrutile (110) surfaces, The Journal of Physical ChemistryC , 5730 (2009), https://doi.org/10.1021/jp8110343. upplemental Material for:Global optimization of atomic structures with gradient-enhancedGaussian process regression Sami Kaappa, Estefan´ıa Garijo del R´ıo, Karsten Wedel Jacobsen

Department of Physics, Technical University of Denmark, Kongens Lyngby,Denmark (Dated: February 23, 2021) (a) (b)

Figure 1: SiO success curves with success thresholds of 0.20 eV (upper) and0.01 eV (lower). 1 a r X i v : . [ phy s i c s . c o m p - ph ] F e b igure 2: Fit length scales in the L-BEACON runs for SiO .Figure 3: A single global optimization run of Ta O cluster.2 ith gradients Without gradients (a) (b)(a) (b)