[PDF] A local Bayesian optimizer for atomic structures

Abstract

A local optimization method based on Bayesian Gaussian Processes is developed and applied to atomic structures. The method is applied to a variety of systems including molecules, clusters, bulk materials, and molecules at surfaces. The approach is seen to compare favorably to standard optimization algorithms like conjugate gradient or BFGS in all cases. The method relies on prediction of surrogate potential energy surfaces, which are fast to optimize, and which are gradually improved as the calculation proceeds. The method includes a few hyperparameters, the optimization of which may lead to further improvements of the computational speed.

Full PDF

AA local Bayesian optimizer for atomic structures

Estefan´ıa Garijo del R´ıo, Jens Jørgen Mortensen, and Karsten Wedel Jacobsen

CAMD, Department of Physics, Technical University of Denmark (Dated: August 30, 2019)A local optimization method based on Bayesian Gaussian processes is developed and applied toatomic structures. The method is applied to a variety of systems including molecules, clusters, bulkmaterials, and molecules at surfaces. The approach is seen to compare favorably to standard opti-mization algorithms like conjugate gradient or BFGS in all cases. The method relies on predictionof surrogate potential energy surfaces, which are fast to optimize, and which are gradually improvedas the calculation proceeds. The method includes a few hyperparameters, the optimization of whichmay lead to further improvements of the computational speed.

I. INTRODUCTION

One of the great successes of density functional the-ory (DFT) is its ability to predict ground state atomicstructures. By minimizing the total energy, the atomicpositions in solids or molecules at low temperatures canbe obtained. However, the optimization of atomic struc-tures with density functional theory or higher level quan-tum chemistry methods require substantial computer re-sources. It is therefore important to develop new meth-ods to perform the optimization eﬃciently.It is of key interest here, that for a given atomic struc-ture a DFT calculation provides not only the total elec-tronic energy, but also, at almost no additional computa-tional cost, the forces on the atoms, i.e. the derivatives ofthe energy with respect to the atomic coordinates. Thismeans that for a system with N atoms in a particularconﬁguration only a single energy-value is obtained while3 N derivatives are also calculated. It is therefore essen-tial to include the gradient information in an eﬃcientoptimization.A number of well-known function optimizers explor-ing gradient information exist and several are imple-mented in standard libraries like the SciPy library foruse in Python. Two much-used examples are the conju-gate gradient (CG) method and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. Both of these relyon line minimizations and perform particularly well for anearly harmonic potential energy surface (PES). In theCG method, a series of conjugated search directions arecalculated, while the BFGS method gradually builds upinformation about the Hessian, i.e. the second derivativesof the energy, to ﬁnd appropriate search directions.The Gaussian process (GP) method that we are go-ing to present has the beneﬁt that it produces smoothsurrogate potential energy surfaces (SPES) even in re-gions of space where the potential is non-harmonic. Thisleads to a generally improved convergence. The num-ber of algebraic operations that has to be carried outin order to move from one atomic structure to the nextis much higher for the GP method than for the CG orBFGS methods, however, this is not of concern for op-timizing atomic structures with DFT, because the elec-tronic structure calculations themselves are so time con- suming. For more general optimization problems wherethe function evaluations are fast, the situation may bediﬀerent.Machine learning for PES modelling has recentlyattracted the attention of the materials modellingcommunity . In particular, several methods have fo-cused on ﬁtting the energies of electronic structure cal-culations to expressions of the form E ( ρ ) = n (cid:88) i =1 α i k (cid:16) ρ ( i ) , ρ (cid:17) . (1)Here, { ρ ( i ) } ni =1 are some descriptors of the n atomicconﬁgurations sampled, k (cid:0) ρ ( i ) , ρ (cid:1) is known as a kernelfunction and { α i } ni =1 are the coeﬃcients to be determinedin the ﬁt. Since there are n coeﬃcients and n free pa-rameters, the SPES determined by this expression hasthe values of the calculations at the conﬁgurations onthe training set.Here we note that expression (1) can easily be extendedto: E ( ρ ) = n (cid:88) i =1 α i k (cid:16) ρ ( i ) , ρ (cid:17) + n (cid:88) i =1 3 N (cid:88) j =1 β ij ∂k (cid:0) ρ ( i ) , ρ (cid:1) ∂r ( i ) j , (2)where { r ( i ) j } Nj =1 represent the coordinates of the N atomsin the i − th conﬁguration. The new set of parameters β ij together with α i can be adjusted so that not only theright energy of a given conﬁguration ρ ( i ) is predicted, butalso the right forces. This approach has two advantageswith respect to the previous one: (i) the information in-cluded in the model scales with the dimensionality, (ii)the new model is smooth and has the right gradients.In the case of systems with many identical atoms orsimilar local atomic structures it becomes advantageousto construct SPESs based on descriptors or ﬁngerprintscharacterizing the local environment . The descrip-tors can then be constructed to obey basic principles asrotational and translational symmetries and invarianceunder exchange of identical atoms. Here we shall de-velop an approach based on Gaussian processes whichworks directly with the atomic coordinates and eﬀec-tively produces a surrogate PES of the type Eq. (2) aimedat relaxing atomic structures. We note, that Gaussian a r X i v : . [ phy s i c s . c o m p - ph ] A ug processes with derivatives for PES modeling is a ﬁeldthat is developing fast, with recent applications in localoptimization and path determination in elastic bandcalculations . II. GAUSSIAN PROCESS REGRESSION

We use Gaussian process regression with derivative in-formation to produce a combined model for the energy E and the forces f of a conﬁguration with atomic positions x = ( r , r , . . . , r N ): U ( x ) = ( E ( x ) , − f ( x )) ∼ GP ( U p ( x ) , K ( x , x (cid:48) )) , (3)where U p ( x ) = ( E p ( x ) , ∇ E p ( x )) is a vector-valued func-tion which constitutes the prior model for the PES and K ( x , x (cid:48) ) is a matrix-valued kernel function that modelsthe correlation between pairs of energy and force valuesas a function of the conﬁguration space.In this work, we choose the constant function U p ( x ) =( E p , ) as the prior function. For the kernel, we usethe squared-exponential covariance function to modelthe correlation between the energy of diﬀerent conﬁg-urations: k ( x , x (cid:48) ) = σ f e −(cid:107) x − x (cid:48) (cid:107) / l , (4)where l is a typical scale of the problem and σ f is aparameter describing the prior variance at any conﬁgu-ration x . The full kernel K can be obtained by notingthat :cov ( E ( x ) , E ( x (cid:48) )) = k ( x , x (cid:48) ) (5)cov (cid:18) E ( x ) , ∂E ( x (cid:48) ) ∂x (cid:48) i (cid:19) = ∂k ( x , x (cid:48) ) ∂x (cid:48) i ≡ J i ( x , x (cid:48) )(6)cov (cid:32) ∂E ( x ) ∂x i , ∂E ( x (cid:48) ) ∂x (cid:48) j (cid:33) = ∂ k ( x , x (cid:48) ) ∂x i ∂x (cid:48) j ≡ H ij ( x , x (cid:48) ) , (7)and assembling these covariance functions in a matrixform: K ( x , x (cid:48) ) = (cid:18) k ( x , x (cid:48) ) J ( x , x (cid:48) ) J ( x (cid:48) , x ) T H ( x , x (cid:48) ) (cid:19) . (8)The expressions for the mean and the variance forthe posterior distribution follow the usual deﬁnitions in-corporating the additional matrix structure. Let X = { x ( i ) } ni =1 denote the matrix containing n training inputsand let Y = { y ( i ) } ni =1 = { (cid:0) E ( x ( i ) ) , − f ( x ( i ) ) (cid:1) } ni =1 be thematrix containing the corresponding training targets. Bydeﬁning K ( x , X ) = (cid:16) K ( x , x (1) ) , K ( x , x (2) ) , . . . , K ( x , x ( n ) ) (cid:17) (9)and ( K ( X, X )) ij = K ( x ( i ) , x ( j ) ) , (10) we get the following expressions for the mean:¯ U ( x ) = ( ¯ E ( x ) , − ¯ f ( x ))= U p ( x ) + K ( x , X ) K − X ( Y − U p ( X )) (11)and the variance: σ ( x ) = K ( x , x ) − K ( x , X ) K − X K ( X, x ) (12)of the prediction, where K X = K ( X, X ) + Σ n . Here,we have assumed an additive Gaussian noise term withcovariance matrix Σ n . This term corrects only for theself covariance of the points in the training set, and thus,it is a diagonal matrix that models the self correlation offorces with a hyperparameter σ n and the self correlationof energies with σ n × l . We note that even for compu-tational frameworks where the energy and forces can becomputed with very limited numerical noise, small non-zero values of σ n are advantageous since they prevent theinversion of the covariance matrix K ( X, X ) to be numer-ically ill-conditioned .In the following, we will refer to ¯ E ( x ) as deﬁned inequation (11) as the surrogate potential energy surface(SPES) and distinguish it from the ﬁrst principles PES, E ( x ). III. GAUSSIAN PROCESS MINIMIZER: GPMIN

The GP framework can be used to build an optimiza-tion algorithm. In this section, we introduce the mainideas behind the proposed Gaussian process minimizer(denoted GPMin from hereon). A more detailed descrip-tion of the algorithm can be found in the Appendix inthe form of pseudocode.The GP regression provides a SPES that can be min-imized using a gradient-based local optimizer. For thispurpose, we have used the L-BFGS-B algorithm as im-plemented in SciPy . The prior value for the energy isinitially set as the energy of the initial conﬁguration andthen the expression (11) is used to produce a SPES fromthat data point alone. This model is then minimized,and the evaluation at the new local minimum generatesnew data that is then fed into the model to produce anew SPES that will have a diﬀerent local minimum. Be-fore generating each new SPES the prior value for theenergy is updated to the maximum value of the energiespreviously sampled. This step is important because itmakes the algorithm more stable. If a high-energy con-ﬁguration is sampled, the forces may be very large leadingto a too large new step. The increase of the prior valuetends to dampen this by eﬀectively reducing the step size.The whole process is then iterated until convergence isreached.It is illustrative to consider in more detail the ﬁrststep of the algorithm. It is straightforward to show usingequations (4) - (11) that if only a single data point x (1) is known the SPES is given by¯ E ( x ) = E (1) − f (1) · ( x − x (1) ) e −(cid:107) x − x (1) (cid:107) / l , (13)where E (1) and f (1) are the energy and forces of the SPESat the point x (1) , respectively. We have here used thatthe prior energy is set to the energy of the ﬁrst conﬁgura-tion E (1) . One can conﬁrm that this is the prior energyby noting that points far away from x (1) , where no infor-mation is available, take on this value for the energy. Itis seen that the initial force f (1) gives rise to a Gaussiandepletion of the SPES. The ﬁrst step of the GPMin algo-rithm minimizes the SPES leading to a new conﬁguration x = x (1) + l f (1) (cid:107) f (1) (cid:107) . (14)The ﬁrst step is thus in the direction of the force witha step length of l . Considering the information availablethis is a very natural choice.GPMin depends on a number of parameters: the lengthscale l , the prior value of the energy E p , the energy width σ f , and the noise or regularization parameter σ n . It canbe seen from expressions (4) and (11) that the predictionof the SPES depends only on the ratio of σ f and σ n andnot their individual values.The prior energy E p is, as explained above, taken ini-tially as the energy of the ﬁrst conﬁguration and thenupdated if larger energies are encountered. It is impor-tant that the prior value is not too low to avoid largesteps, since the prior energy is the value of the SPESfor all conﬁgurations far away (on the scale of l ) frompreviously investigated structures.The scale l is very important as it sets the distanceover which the SPES relaxes back to the prior value E p when moving away from the region of already exploredconﬁgurations. It therefore also eﬀectively determines astep length in the algorithm.One interesting advantage of the Bayesian approach isthat it allows for update of parameters (usually termedhyperparameters) based on existing data. We investigatethis option by allowing the value of the length scale l to change. Since the update procedure also depends onthe width parameter σ f , we update this as well. Theupdated hyperparameters, θ = ( l, σ f ), are determinedby maximizing the marginal likelihood: θ = arg max ϑ P ( Y | X, ϑ ) . (15)The optimization may fail, for example if there is notenough evidence and the marginal likelihood is very ﬂat,and if that happens, the previous scale is kept. The up-date procedure allows the algorithm to ﬁnd its own scaleas it collects more information, producing a model thatself-adapts to the problem at hand. In section VI we shallconsider in more depth the adequate choices for the val-ues of the hyperparameters and the diﬀerent strategiesfor the update of hyperparameters when the optimizersare applied to DFT calculations. IV. COMPUTATIONAL DETAILS

We illustrate and test the method on a variety ofdiﬀerent systems using two diﬀerent calculation meth-ods: An interatomic eﬀective medium theory potential(EMT) as implemented in ASE and DFT. TheDFT tests have been performed using GPAW with thelocal density approximation (LDA) exchange-correlationfunctional and a plane wave basis set with an energycutoﬀ at 340 eV. The Brillouin zone has been sampledusing the Monkhorst-Pack scheme with a k-point densityof 2.0/(˚A − ) in all three directions. The PAW setup withone valence electron has been used for the sodium clusterfor simplicity. In addition to the default convergence cri-teria for GPAW, we specify that the maximum change inmagnitude of the diﬀerence in force for each atom shouldbe smaller than 10 − eV˚A − for the self-consistent ﬁelditeration to terminate. This improves the convergenceof the forces. All systems have been relaxed until themaximum force of the atoms was below 0.01 eV˚A − . V. EXAMPLE: GOLD CLUSTERS DESCRIBEDIN EFFECTIVE MEDIUM THEORY

In the ﬁrst example GPMin is used to ﬁnd the structureof 10-atom gold clusters as described by the EMT poten-tial, and the eﬃciency is compared with other commonoptimizers. For this purpose, we generate 1000 randomconﬁgurations of a 10-atom gold cluster. The conﬁgura-tions are constructed by sequentially applying three uni-form displacements for each atom in a cubic box withside length 4.8˚A and only keeping those that lie furtherthan 1.7 times the atomic radius of gold away from anyof the other atoms already present in the cluster. Eachconﬁguration is then optimized with diﬀerent choices ofparameters for GPMin, and, for comparison, the samestructures are optimized with the ASE implementationsof FIRE and BFGS Line Search, and the SciPy imple-mentations of BFGS and the CG.For the gold clusters, we have investigated the eﬀect ofupdating σ f and l for six diﬀerent initial scales between0 . . σ f = 1 . σ n /σ f = 5 × − eV˚A − for the regularization.In the update-version of the optimizer, we update thescale every 5th iteration.The statistics of the number of energy evaluations areshown in Figure 1. The GP optimizers are seen to bethe fastest on average, with the appropriate choice ofthe hyperparameters. For the initial scale of 0.5˚A, forexample, the updated version of GPMin had relaxed theclusters after 42 . ± . . ± .

3, as compared to 48 . ± . . ± . . ± . . ± . G P M i n _ . G P M i n _ . G P M i n _ . G P M i n _ . G P M i n _ . G P M i n _ . S c i P y B F G S A S E B F G S L i n e S e a r c h S c i P y C G A S E F I R E Optimizer N u m b e r o f e n e r g y e v a l u a t i o n s GPMinUpdated GPMinOther optimizers

Figure 1. Statistics of the number of energy evaluations for1000 relaxations of a 10-atom gold cluster. The initial con-ditions have been randomly generated. The left hand sideof the plot shows the distribution of the number of energyevaluations for GPMin in its two variants for scales rangingfrom 0.3 to 0.8 ˚A: keeping the scale ﬁxed or allowing it to beupdated. The right hand side shows the performance of otherwidely used optimizers, which have been sorted according tothe average number of function evaluations.

Figure 1 shows the trend in the performance as thescale is varied. For this system, l = 0 . l = 0 . l = 0 . . . . . . . . . . l ( Å ) n / f ( Å ) . . . . . . . . . l ( Å ) n / f ( Å ) Figure 2. Average number of potential energy evaluationsneeded to relax 10 atomic structures as a function of the twohyperparameters: the length scale l , and the regularizationparameter σ n . The label NC (Not Converged) indicates thatat least one of the relaxations did not converge. The defaultchoices for the hyperparameters are indicated by circles. energy than the previously evaluated point. Thus, weconsider the optimization has failed if after 30 such catas-trophic attempts, the optimizer has still not being ableto identify a point that reduces the energy or if SciPy’sBFGS cannot successfully optimize the predicted SPES. VI. DETERMINATION OF THEHYPERPARAMETERS

We now continue by considering the use of the GP op-timizers more generally for systems with PESs describedby DFT. Default values of the hyperparameters shouldbe chosen such that the algorithm performs well for a va-riety of atomic systems. For this purpose, we have chosen N u m b e r o f e n e r g y e v a l u a t i o n s N a c l u s t e r n / f ( Å )3540455055 N u m b e r o f e n e r g y e v a l u a t i o n s C O @ A u f (eV) Update method

GPMin 5GPMin 10%GPMin 20%GPMin

Figure 3. Average number of energy evaluations needed to relax the two training set systems as a function of the hyperparameter σ n /σ f or of the initial value of σ f , while the other one is kept ﬁxed. The results are shown for the three diﬀerent updatingstrategies and compared with the result of running GPMin without update with the same choice of hyperparameters. Therectangles show the values of the hyperparameter that have been chosen as default values. The value of σ f chosen in the rightpanels has been used in the relaxations shown in the left panels and, similarly, the value of σ n /σ f that has been found optimalin the left panel is the one, which has been used in the relaxations in the right panel. a training set consisting of two diﬀerent structures: (i)a 10-atom sodium cluster with random atomic positionsand (ii) a carbon dioxide molecule on a (111) surface withtwo layers of gold and a 2 × l, σ n /σ f ), werelax the training systems and average over the numberof DFT evaluations the optimizer needs to ﬁnd a localminimum. The results are shown in Figure 2. The plotshows that the metallic cluster beneﬁts from relativelylarge scales, while the CO on gold system with tight CObond requires a shorter scale. A too long scale mighteven imply that the optimizer does not converge. Theset of hyperparameters l = 0 . σ n = 1 meV˚A − and σ f = 1 eV seems to be a good compromise between the two cases and these are the default values we shall use inthe following.A similar procedure has been used to determine the de-fault values of the hyperparameters and their initial val-ues in the updated versions of GPMin. Here, the hyper-parameter σ n /σ f is kept ﬁxed during the optimization,whereas l and σ f are determined using expression (15).The value of σ n /σ f and the initial values of the otherhyperparameters are then determined from the analysisof the performance of the optimizer on the two systemsin the training set.The evolution of the hyperparameters depends on thedetails of the optimization of the marginal likelihood to-gether with the frequency at which the hyperparametersare optimized. Here, we explore three diﬀerent strate-gies: Unconstrained maximization of the marginal log-likelihood every 5 energy evaluations (“GPMin-5”), andtwo constrained optimization strategies, where the out-come of the optimization is constrained to vary in the l ( Å ) GPMin 10% N a c l u s t e r GPMin 20%

GPMin 5 l ( Å ) C O @ A u Figure 4. Evolution of the length scale l with iteration for the three optimizers with update GPMin-5, GPMin-10%, andGPMin-20%. The upper panel shows the results for the sodium cluster, while the lower panel shows the evolution for theCO/Au system. In all cases three diﬀerent values l = 0 . , . , . range ±

10% and ±

20% of the value of the hyperparam-eter in the previous step (“GPMin-10%” and “GPMin-20%”, respectively). In the latter two cases we let the op-timization take place whenever new information is addedto the sample. The algorithm used to maximize themarginal log-likelihood is L-BFGS-B for all strategies.We have relaxed the same 10 slightly diﬀerent copiesof the two training set systems described before usingthese three strategies for three diﬀerent initial values ofthe scale (0.2, 0.3 and 0.4 ˚A), 8 diﬀerent initial values of σ f and 7 diﬀerent values of the regularization parameter σ n /σ f . An overview of the full results can be found inthe Supplementary Material .The average numbers of energy evaluations needed torelax the training set for the diﬀerent strategies and hy-perparameters are shown in Figure 3. The initial valueof the scale is chosen as 0 . σ n /σ f when the initial value of σ f = 1 . σ f when the value of σ n /σ f = 2 × − ˚A − .The performance of the optimizers is seen to dependrather weakly on the parameter values in particular forthe sodium cluster. We shall therefore in the followinguse the values σ f = 1 . σ n /σ f = 2 × − ˚A − .From the ﬁgure it can also be seen that the versionsof the optimizer with updates perform considerably bet-ter than GPMin without updates for the sodium cluster,while for the CO molecule on gold, the version without update works slightly better than the three optimizerswith updates.To understand this behavior further we consider in Fig-ure 4 the evolution of the length scale l as it is being up-dated. The scale is initially set at three diﬀerent values l = 0 . , . , . . . l = 0 . l = 0 . , the resultsdo not depend very much on the initial scale in the range0 . − . l = 0 . σ f = 2 . σ n = 0 .

004 eV˚A − ( σ n /σ f = 0 . − ). These values are used in the restof this paper. VII. RESULTS

To test the Bayesian optimizers we have investigatedtheir performance for seven diﬀerent systems with DFT:a CO molecule on a Ag(111) surface, a C adsorbate on aCu(100) surface, a distorted Cu(111) surface, bulk copperwith random displacements of the atoms with Gaussiandistribution and width 0.1 ˚A, an aluminum cluster with13 atoms in a conﬁguration close to fcc; the H molecule,and the pentane molecule. All surfaces are represented by2 layer slabs with a 2 × × × ± ± ± ± ± ± ± molecule.The GP optimizers are seen to compare favorably oron par with the best one of the other optimizers in all cases. GPMin without update is on average faster thanthe other optimizers for 6 of the 7 systems. For the bulkCu system, it is only slightly slower than the ASE-BFGSalgorithm. The updated GP optimizers perform evenbetter with one exception: GPMin-5 is clearly worse thanthe other GP optimizers and ASE-BFGS for the copperbulk system. Since the atomic displacements from theperfect crystal structure are quite small ( ∼ . ∼

10) iterations to converge. The ASE-BFGS can therefore be expected to perform well, whichis also what is observed in Figure 5. GPMin-5 does notupdate the scale for the ﬁrst 5 iterations, and when itdoes so, the new scale does not lead to immediate con-vergence. The plain GPMin and the two other optimizerswith updates perform on par with ASE-BFGS.Generally, the updated optimizers perform better thanGPMin without updates, and both GPMin-10% andGPMin-20% with constrained update perform consis-tently very well. The updated optimizers are clearly bet-ter than the plain GPMin for the Al cluster similar tothe behavior for the Na cluster used in the determina-tion of hyperparameters. For the other training system,the CO/Au system, GPMin was seen to perform betterthan all the updated optimizers. However, in Figure 3the scale was chosen to be l = 0 . VIII. DISCUSSION

We ascribe the overall good performance of the GPoptimizers to their ability to predict smooth potentialenergy surfaces covering both harmonic and anharmonicregions of the energy landscape. Since the Gaussian func-tions applied in the construction of the SPES all have thescale l , the SPES will be harmonic at scales much smallerthan this around the minimum conﬁguration. If the ini-tial conﬁguration is in this regime the performance of theoptimizer can be expected to be comparable to BFGS,which is optimal for a harmonic PES, and this is whatis for example observed for the Cu bulk system. We be-lieve that the relatively worse performance of the SciPyimplementation of BFGS can be attributed to an initialguess of the Hessian that is too far from the correct one.Given the performance on both the training and testsets, GPMin-10% seems to be a good choice. It shouldbe noted that updating the hyperparameters require it-eration over the the marginal log-likelihood leading toan increased computational cost. However, this is not aproblem at least for systems comparable in size to theones considered here.The current version of the algorithm still has roomfor improvement. For example, diﬀerent strategies forthe update of hyperparameters may be introduced. An- Number of DFT evaluations

Figure 5. Number of DFT evaluations required to optimize a given structure. For each structure 10 diﬀerent initial conﬁgu-rations are generated and optimized. The vertical line represents the average number of steps of GPMin without parameterupdates. The error bar represents the error on the average. A diﬀerent color has been used to highlight the optimizers of theGPMin family. other, maybe even more interesting possibility, is to usemore advanced prior models of the PES than just a con-stant. The prior model to the PES could for examplebe obtained from fast lower-quality methods. Somewhatalong these lines there have been recent attempts to usepreviously known semi-empirical potentials for precondi-tioning more traditional gradient-based optimizers .This approach might be combined with the GP frame-work suggested here.We also note that the choice of the Gaussian kernel,even though encouraged by the characteristics of the re-sulting potential and its previously reported success forsimilar problems , is to some extent arbitrary. It wouldbe worthwhile to test its performance against other ker-nel functions, for example the Mat´ern kernel, which hasbeen reported to achieve better performance in diﬀerentcontexts . The kernels used in the work here is alsolimited to considering only one length scale. More ﬂexi-ble kernels allowing for diﬀerent length scales for diﬀerenttypes of bonds would be interesting to explore.The probabilistic aspect, including the uncertainty asexpressed in Eq. (12), is presently used only in the updateof the hyperparameters. It could potentially lead to a fur-ther reduction of the number of function evaluations .The uncertainty provides a measure of how much a re-gion of conﬁguration space has been explored and canthereby guide the search also in global optimizationproblems .Finally, a note on the limitations of the present ver-sion of the optimizer. The construction of the SPESinvolves the inversion of a matrix (Eq. 11) which is asquare matrix, where the number of columns is equal to n = N c ∗ (3 ∗ N + 1), where N is the number of atoms in the system and N c the number of previously visited con-ﬁgurations. This is not a problem for moderately sizedsystems, but for large systems, where the optimizationalso requires many steps, the matrix inversion can bevery computationally time consuming, and the currentversion of the method will only be eﬃcient if this time isstill short compared to the time to perform the DFT cal-culations. In addition, this can also result in a memoryissue for large systems where the relaxation takes manysteps. These issues may be addressed by considering onlya subset of the data points or other sparsiﬁcation tech-niques. Recently, Wang et al. showed that by usingthe Blackbox Matrix-Matrix multiplication algorithm itis possible to reduce the cost of training from O ( n ) to O ( n ) and then by using distributed memory and 8 GPUsthey were able to train a Gaussian process of n ∼ × (this would correspond to about 100 steps for 150 atomswith no constraints) in 50 seconds. This time is negligiblecompared to the time for DFT calculations of systems ofthis size.The GPMin optimizers are implemented in Python andavailable in ASE . ACKNOWLEDGMENTS

We appreciate fruitful conversations with Peter BjørnJørgensen. This work was supported by Grant No. 9455from VILLUM FONDEN.

IX. APPENDIX

The optimization algorithm can be represented inpseudocode as follows:

Input:

Initial structure: x (0) = ( r , r , . . . , r N )Hyperparameters: l , σ n ,Tolerance: f max E (0) , f (0) ← Calculator ( x (0) ) E p ← E (0) while max i | f (0) i | > f max do X, Y ← Update ( x (0) , E (0) , f (0) ) E p ← max Y E x (1) ← l-bfgs-b (GP( X, Y ), start from = x (0) ) E (1) , f (1) ← Calculator ( x (1) ) while E (1) > E (0) do X, Y ← Update ( x (1) , E (1) , f (1) ) E p ← max Y E x (1) ← l-bfgs-b (GP( X, Y ), start from = x (0) ) E (1) , f (1) ← Calculator ( x (1) ) if max i | f (1) i | > f max then break end ifend whilex (0) , E (0) , f (0) ← x (1) , E (1) , f (1) end whileOutput: x (0) , E (0) P. Hohenberg and W. Kohn, Physical Review , 864(1964). W. Kohn and L. J. Sham, Physical Review , 1133(1965). W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.Flannery,

Numerical Recipes 3rd Edition: The Art of Sci-entiﬁc Computing , 3rd ed. (Cambridge University Press,New York, NY, USA, 2007). E. Jones, T. Oliphant, and P. Peterson, “SciPy: Opensource scientiﬁc tools for Python,” (2001–). A. P. Bart´ok, S. De, C. Poelking, N. Bernstein, J. R. Ker-mode, G. Cs´anyi, and M. Ceriotti, Science Advances ,e1701816 (2017). B. Huang and O. A. von Lilienfeld, arXiv.org (2017),1707.04146v3. A. Glielmo, C. Zeni, and A. de Vita, Physical Review B , 184307 (2018). J. Behler and M. Parrinello, Physical Review Letters ,583 (2007). M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A. vonLilienfeld, Physical Review Letters , 058301 (2012). G. Cs´anyi, T. Albaret, M. C. Payne, and A. de Vita,Physical Review Letters , 851 (2004). A. Khorshidi and A. A. Peterson, Computer Physics Com-munications , 310 (2016). T. Yamashita, N. Sato, H. Kino, T. Miyake, K. Tsuda,and T. Oguchi, Phys. Rev. Materials , 013803 (2018). O.-P. Koistinen, F. B. Dagbjartsd´ottir, V. ´Asgeirsson,A. Vehtari, and H. J´onsson, The Journal of ChemicalPhysics , 152720 (2017). T. L. Jacobsen, M. S. Jørgensen, and B. Hammer, Phys.Rev. Lett. , 026102 (2018). M. Todorovi´c, M. U. Gutmann, J. Corander, and P. Rinke,arXiv preprint arXiv:1708.09274 (2017). M. S. Jørgensen, U. F. Larsen, K. W. Jacobsen, andB. Hammer, The Journal of Physical Chemistry A ,1504 (2018). E. V. Podryabinkin and A. V. Shapeev, ComputationalMaterials Science , 171 (2017). K. Gubaev, E. V. Podryabinkin, G. L. Hart, andA. V. Shapeev, Computational Materials Science , 148(2019). A. Denzel and J. K¨astner, The Journal of Chemical Physics , 094114 (2018). O.-P. Koistinen, V. ´Asgeirsson, A. Vehtari, andH. J´onsson, “Nudged elastic band calculations acceleratedwith gaussian process regression based on inverse inter-atomic distances,” (2019). J. A. Garrido Torres, P. C. Jennings, M. H. Hansen, J. R.Boes, and T. Bligaard, Phys. Rev. Lett. , 156001(2019). C. E. Rasmussen and C. K. Williams,

Gaussian Processesfor Machine Learning (MIT, 2006). J. Wu, M. Poloczek, A. G. Wilson, and P. Frazier, in

Advances in Neural Information Processing Systems 30 ,edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett (Curran As-sociates, Inc., 2017) pp. 5267–5278. R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, SIAM Journalon Scientiﬁc Computing , 1190 (1995). K. W. Jacobsen, J. K. Nørskov, and M. J. Puska, PhysicalReview B , 7423 (1987). K. W. Jacobsen, P. Stoltze, and J. K. Nørskov, SurfaceScience , 394 (1996). “Atomic Simulation Environment (ASE),” https://wiki.fysik.dtu.dk/ase/ (2018). A. H. Larsen, J. J. Mortensen, et al. , Journal of Physics:Condensed Matter , 273002 (2017). J. Enkovaara, C. Rostgaard, J. J. Mortensen, et al. , Jour-nal of Physics: Condensed Matter , 253202 (2010). E. Bitzek, P. Koskinen, F. G¨ahler, M. Moseler, andP. Gumbsch, Phys. Rev. Lett. , 170201 (2006). See Supplementary Material at [URL introduced by thepublisher] for a full overview of the average number of DFTcalculations needed to relax the structures in the trainingset with diﬀerent hyperparameters and strategies for GP-Min with updates. J. O. B. Tempkin, B. Qi, M. G. Saunders, B. Roux, A. R.Dinner, and J. Weare, The Journal of Chemical Physics , 184114 (2014). L. Mones, C. Ortner, and G. Cs´anyi, Scientiﬁc reports ,13991 (2018). D. J. Lizotte,

Practical Bayesian Optimization , Ph.D.thesis, University of Alberta, Edmonton, Alta., Canada(2008). G. Schmitz and O. Christiansen, The Journal of ChemicalPhysics , 241704 (2018). B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, andN. de Freitas, Proceedings of the IEEE , 148 (2016). M. S. Jørgensen, U. F. Larsen, K. W. Jacobsen, andB. Hammer, The Journal of Physical Chemistry A , 1504 (2018).38