[PDF] Bayesian optimization for computationally extensive probability distributions

Abstract

An efficient method for finding a better maximizer of computationally extensive probability distributions is proposed on the basis of a Bayesian optimization technique. A key idea of the proposed method is to use extreme values of acquisition functions by Gaussian processes for the next training phase, which should be located near a local maximum or a global maximum of the probability distribution. Our Bayesian optimization technique is applied to the posterior distribution in the effective physical model estimation, which is a computationally extensive probability distribution. Even when the number of sampling points on the posterior distributions is fixed to be small, the Bayesian optimization provides a better maximizer of the posterior distributions in comparison to those by the random search method, the steepest descent method, or the Monte Carlo method. Furthermore, the Bayesian optimization improves the results efficiently by combining the steepest descent method and thus it is a powerful tool to search for a better maximizer of computationally extensive probability distributions.

Full PDF

aa r X i v : . [ phy s i c s . d a t a - a n ] M a r Bayesian optimization for computationally extensiveprobability distributions

Ryo Tamura ¶ * , Koji Hukushima ¶ * International Center for Materials Nanoarchitectonics (WPI-MANA), NationalInstitute for Materials Science, 1-1 Namiki, Tsukuba, Ibaraki 305-0044, Japan Research and Services Division of Materials Data and Integrated System, NationalInstitute for Materials Science, 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha,Kashiwa, Chiba 277-8568, Japan Department of Basic Science, Graduate School of Arts and Sciences, The Universityof Tokyo, Komaba, Meguro, Tokyo 153-8902, Japan ¶ These authors contributed equally to this work.* Corresponding authorsE-mail: [email protected]: [email protected]

Abstract

An efficient method for finding a better maximizer of computationally extensiveprobability distributions is proposed on the basis of a Bayesian optimization technique.A key idea of the proposed method is to use extreme values of acquisition functions byGaussian processes for the next training phase, which should be located near a localmaximum or a global maximum of the probability distribution. Our Bayesianoptimization technique is applied to the posterior distribution in the effective physicalmodel estimation, which is a computationally extensive probability distribution. Evenwhen the number of sampling points on the posterior distributions is fixed to be small,the Bayesian optimization provides a better maximizer of the posterior distributions incomparison to those by the random search method, the steepest descent method, orthe Monte Carlo method. Furthermore, the Bayesian optimization improves the resultsefficiently by combining the steepest descent method and thus it is a powerful tool tosearch for a better maximizer of computationally extensive probability distributions.

Introduction

Bayesian optimization [1–5] has recently attracted much attention as a method tosearch the maximizer/minimizer of a black-box function in informatics and materialsscience [6–12]. In this method, the black-box function is interpolated by Gaussianprocesses. Then the interpolated function is used to predict the maximizer/minimizerof the black-box function. The Bayesian optimization is effective for problems wherethe value on the black-box function cannot be easily obtained. In other words, it iseffective when the data for the black-box function is limited.We are currently developing a generic effective physical model estimation methodfrom experimentally measured data using machine learning, which relates tocalibration in data science [13–15]. As the first example, we developed a method to1/13stimate a set of model parameters x = ( x , ..., x K ) in the Hamiltonian H ( x ), where K is the number of model parameters [16]. Let y ex be the set of physical quantities { y ex ( g l ) } l =1 ,...,L depending on the external parameter g l with L being the number ofdata. By using Bayes’ theorem, the posterior distribution P ( x | y ex ), or the conditionalprobability of x given y ex is expressed as P ( x | y ex ) = P ( y ex | x ) P ( x ) Z ( y ex ) , (1)where P ( x ) and Z ( y ex ) are the prior distributions of the model parameters and thenormalization constant of the posterior distribution, respectively. Assuming that theobserved noise follows a Gaussian distribution with a mean of zero and a standarddeviation of σ , the likelihood function P ( y ex | x ) is given as P ( y ex | x ) ∝ exp " − σ L X l =1 (cid:0) y ex ( g l ) − y cal ( g l , x ) (cid:1) , (2)where { y cal ( g l , x ) } l =1 , ··· ,L is the g l dependence of the physical quantity calculatedfrom H ( x ), and hereinafter let y cal ( x ) be the set of { y cal ( g l , x ) } l =1 , ··· ,L . Then, theposterior distribution is expressed as P ( x | y ex ) ∝ exp [ − E ( x )] , (3)where the “energy function” as a function of x is given by E ( x ) = 12 σ L X l =1 (cid:0) y ex ( g l ) − y cal ( g l , x ) (cid:1) − log P ( x ) . (4)From the viewpoint of the maximum a posterior (MAP) estimation, the plausiblemodel parameters for explaining y ex are obtained as the maximizer of Eq. (3) or theminimizer of Eq. (4). Thus, the most fundamental task for construction of an effectivemodel is summarized to maximize Eq. (3) or minimize Eq. (4).A computational method to evaluate the posterior distribution or energy functionconsists of a double-loop calculation. In the inner loop, the physical quantities y cal ( x )are calculated from H ( x ) when a set of model parameters is given. The computationalcost of the inner loop depends on the simulation method used to calculate y cal ( x ). Asdiscussed in Ref. 16, the steepest descent method is a promising way for thiscalculation when the input data is assumed to be explained by a simple classicalHamiltonian at the zero temperature. Evaluating y cal ( x ) for the given H ( x ), ingeneral, requires statistical/quantum mechanical many-body calculations, such as theMarkov-chain Monte Carlo (MCMC) method [17–21], the exact diagonalizationmethod [22–24], and the density matrix renormalization group method [25–27]. All ofwhich drastically increase the computational cost for the inner loop.In the outer loop, a sampling of a model parameter x in the posterior distributionis performed. In Ref. 16, we used the MCMC method with the exchange Monte Carlomethod [28]. Although this combined method efficiently yields the global maximum ofthe probability distribution even when many local maxima exist, an enormous numberof sampling points is very time consuming. Consequently, the MCMC approach tocalculate the outer loop of a complicated effective model estimation is one of the mainobstacles for applications in material science.In this paper, a computational method that estimates the effective model with areduced outer loop computational cost is discussed on the basis of a Bayesianoptimization for computationally extensive probability distributions. In the our 2/13ayesian optimization technique, extreme values of acquisition functions obtained byGaussian processes are used as candidates of maximizers of Eq. (3) or minimizers ofEq. (4). We investigate the efficiency of our Bayesian optimization technique to searchthe minimizer of E ( x ) defined by Eq. (4) relative to the random search method, thesteepest descent method, and the Monte Carlo method when the number of samplingpoints is fixed to be small. In our demonstrations, the magnetization curve from theclassical Ising model calculated by the mean-field approximation and the specific heatfrom the quantum Heisenberg model calculated by the exact diagonalization methodare treated as the inputted measured data. Consequently, it is found that theBayesian optimization is useful to search a better maximizer of the computationallyextensive probability distribution. Bayesian optimization

Gaussian process

The Gaussian processes are a powerful machine learning technique to estimateunknown data from known data sets [29]. Here we consider the case when the givendata set is { x n , E ( x n ) } n =1 ,...,N , where N is the number of data. In our case, x n is theset of model parameters in the effective physical model and E ( x n ) denotes the value ofthe energy function E ( x ) defined by Eq. (4) on x n . Using Gaussian processes whichare zero mean, the conditional probability P ( E ( x ) | x ) of E ( x ) given any x is written asthe Gaussian distribution with a mean of µ ( x ) and a standard deviation of δ ( x ): µ ( x ) = k T ( K + λ I N ) − E , (5) δ ( x ) = c − k T ( K + λ I N ) − k , (6)where I N is the N -dimensional identity matrix. Furthermore, E , k , K , and c aredefined as E = (cid:0) E ( x ) · · · E ( x N ) (cid:1) T , (7) k = (cid:0) k ( x , x ) · · · k ( x N , x ) (cid:1) T , (8) K =  k ( x , x ) · · · k ( x , x N )... . . . ... k ( x N , x ) · · · k ( x N , x N )  , (9) c = k ( x , x ) + λ, (10)where k ( x i , x j ) is the Gauss kernel function: k ( x i , x j ) = exp (cid:20) − γ k x i − x j k (cid:21) . (11)Although the computational cost of Gaussian processes is O ( N ), some methods toreduce it including an approximation method and their efficiencies are currently underinvestigation [30, 31]. In this formula, λ and γ are the hyperparameters, which shouldbe specified prior to the analysis. While various methods have been proposed todetermine the hyperparameters, we adopt the cross validation method fordetermination of the hyperparameters λ and γ , which are chosen so as to minimize theprediction error. In the cross validation, the data set D , that is, { x n , E ( x n ) } n =1 ,...,N is randomly divided into S data subsets. Each data subset is expressed by D s labeledby s = 1 , ..., S . One of the S data subsets is regarded as the testing data, while theremaining S − G s = D \ D s ,3/13aussian process training is performed when the training data are { x n , E ( x n ) } n ∈ G s .The mean-square error between the testing data E ( x n ) and the estimated µ ( x n ) for n ∈ D s is evaluated. The cross validation regards the mean-square error as theprediction error when the testing data D s is treated as unknown data. The optimalvalues of λ and γ are evaluated to minimize the prediction error averaged over S datasubsets. Bayesian optimization for computationally extensive probabilitydistributions

We introduce a Bayesian optimization technique to find a better minimizer for theenergy function E ( x ) defined by Eq. (4), when the number of sampling points islimited. Our Bayesian optimization is comprised of the following procedure: Step 1 : Sets of model parameters x n are randomly generated with n = 1 , ..., P , and E ( x n ) is calculated for the generated x n . That is, the P calculations of y cal ( x n ) from H ( x n ) are necessary. Step 2 : Gaussian process is trained for the data set { x n , E ( x n ) } n =1 ,...,P ,yielding the mean value µ ( x ) and the standard deviation δ ( x ) of P ( E ( x ) | x ). Step 3 : The steepest descent method with randomly chosen initialparameters is performed for the three types of acquisition functions [32–35]defined as f LCB ( x ) = µ ( x ) − κδ ( x ) , (12) f GP − LCB ( x ) = µ ( x ) − κ t δ ( x ) , κ t = p | X | t π / ǫ ) , (13) f EI ( x ) = − δ ( x )[ Z Φ( Z ) − φ ( Z )] , Z = [ E min − µ ( x )] /δ ( x ) , (14)where κ > < ǫ ≤ | X | is the size of thesearch space, and t is the step of repetition of BO. Furthermore, φ ( Z ) andΦ( Z ) are the standard normal probability distribution function and itscumulative distribution function, respectively, and E min is the presentminimum value of E ( x ). Then, a local or global minimum x ∗ of acquisitionfunctions is obtained and Q different model parameters are generated byrepeating this operation. Note that the fixed value of ǫ as 0.5 is used in theanalysis of this paper for simplicity. Step 4 : E ( x ∗ ) is calculated for each x ∗ obtained in Step 3. By adding thenew data, the data set is updated as { x n , E ( x n ) } n =1 ,...,P + Q . Here, the Q calculations of y cal ( x n ) from H ( x n ) are necessary. Step 5 : Steps 2–4 are repeated R times. In each iteration, the number ofdata points is increased by Q evaluation. Step 6 : Finally, the minimum value of E ( x ) from { x n , E ( x n ) } n =1 ,...,P + Q × R is determined.We emphasize that the number of calculations of y cal ( x n ) from H ( x n ) is N s = P + Q × R in this procedure, which corresponds to the number of samplingpoints on E ( x ). The computational cost in Step 3 is low because µ ( x ) and δ ( x ) arequickly obtained for a given x . Thus, many candidates for a local minimum or a globalminimum of E ( x ) are generated from the acquisition functions without calculation of E ( x ), which is the key of our Bayesian optimization. Notice that an alternativeapproach has been proposed for optimizing a continuous function with aneasily-calculable statistical function defined only on discrete grid points, in contrast tothe our method [36]. 4/13 esults Application for posterior distribution based on a classical Isingmodel

We demonstrate an application for posterior distribution in effective physical modelestimation based on a classical Ising model in two dimensions. The model Hamiltonianof the classical Ising model under magnetic field H is defined by H C ( x ) = − X i,j J ij σ zi σ zj − H X i σ zi , ( σ zi = ± , (15)where J ij is the exchange interactions between the i -th spin and the j -th spin. Here,we consider three types of exchange interactions on the square lattice shown in Fig 1(a). In this case, three different model parameters are to be estimated, that is, x = ( x , x , x ) = ( J , J , J ). Fig 1. (a) Lattice and types of exchange interactions considered in the classical Isingmodel defined by Eq. (15). (b) Inputted magnetization curve { m ex ( H l ) } l =1 ,...,L with L = 200 where ( x , x , x ) = ( − . , − . , .

3) are used for a temperature T = 3 . { m ex ( H l ) } l =1 ,...,L is used as the input data generatedby the same model of Eq. (15). By performing mean-field calculations for the foursublattice model, the magnetic field dependency of the magnetization is calculatedwith ( x , x , x ) = ( − . , − . , .

3) for a temperature T = 3 .

0. Here, the Boltzmannconstant is set to unity and the physical energy unit is set to | J | . Gaussian noise witha mean of zero and a standard deviation of 0.004 is added to the obtainedmagnetization curve. Fig 1 (b) shows the inputted magnetization curve { m ex ( H l ) } l =1 ,...,L where the number of data points is L = 200.To estimate the effective model from { m ex ( H l ) } l =1 ,...,L , we search the maximizer ofthe posterior distribution, which is defined as P ( x |{ m ex ( H l ) } l =1 ,...,L ) ∝ exp [ − E C ( x )] , (16) E C ( x ) = 12 σ L X l =1 (cid:0) m ex ( H l ) − m cal ( H l , x ) (cid:1) − log P ( x ) , (17)where { m cal ( H l , x ) } l =1 ,...,L is the set of calculated magnetization curves from H C ( x ).In this demonstration, the mean-field calculations for the four sublattice model areused as the inner loop calculation method to obtain { m cal ( H l , x ) } l =1 ,...,L . Furthermore,instead of treating the posterior distribution itself, the minimizer of E C ( x ) is searched.For simplicity, the prior distribution of model parameters P ( x ) is assumed to be a5/13niform distribution; that is, P ( x ) = 1 which corresponds to the least square fitting,and then the factor 1 / σ is set to be a constant without loss of generality.The minimum values of E C ( x ) obtained by the random search method, thesteepest descent method, the Monte Carlo method, and the our Bayesian optimizationare compared, depending on the number of sampling points N s on E C ( x ). The detailsof each method are denoted below. Random search method : A set of model parameters x n = ( x , x , x ) israndomly generated from the region where − ≤ x , x , x ≤

5. Then E C ( x n ) is calculated. This procedure is repeated N s times, and the dataset { x n , E C ( x n ) } n =1 ,...,N s is obtained, from which the minimum value of E C ( x ) is searched. Steepest descent method : An initial set of model parameters [i.e. x = ( x , x , x )] is randomly generated from the region where − ≤ x , x , x ≤

5. A set of model parameters is updated N s / x n = ( x , ..., x k , ..., x K ) to x n +1 = ( x , ..., x ′ k , ..., x K ): x ′ k = x k − α ∆ E ∆ x , (18)∆ E = E C ( x , ..., x k + ∆ x, ..., x K ) − E C ( x , ..., x k , ..., x K ) . (19)Here, k is randomly chosen from k ∈ , ..., K where K = 3 in this case, and∆ x = α = 0 .

01. Notice that the calculation of E C ( x ) should be repeatedtwice in each update. Thus, when the number of updates is N s /

2, thenumber of sampling points on E C ( x ) becomes N s . Using this update of themodel parameters, E C ( x ) decreases for each update. From the obtained { x n , E C ( x n ) } n =1 ,...,N s / , the minimum value of E C ( x ) is searched. Monte Carlo method : An initial set of model parameters [i.e. x = ( x , x , x )] is randomly generated from the region where − ≤ x , x , x ≤

5. A set of model parameters is updated N s times usingthe following Metropolis-type transition probability from x n to x n +1 : w ( x n +1 | x n ) = min { , exp [ − ∆ E ( x n +1 , x n )] } , (20)∆ E ( x n +1 , x n ) = E C ( x n +1 ) − E C ( x n ) . (21)Here, the set of model parameters after updating is prepared as x n +1 = ( x , ..., x ′ k , ..., x K ) with x ′ k = x k + r from the set of modelparameters before updating x n = ( x , ..., x k , ..., x K ), where k is randomlychosen from k ∈ , ..., K , and r is a random number between − { x n , E C ( x n ) } n =1 ,...,N s , the minimum value of E C ( x ) issearched. Bayesian optimization : A set of model parameters x n is randomlygenerated from the region where − ≤ x , x , x ≤ E C ( x n ) iscalculated. This procedure is repeated P = 200 times as the initial dataset, and the Bayesian optimization is performed with Q = 10 and R = ( N s − P ) /Q . In the method, the steepest descent method in Step 3 isimplemented by using the following equation from x = ( x , ..., x k , ..., x K )to x ′ = ( x , ..., x ′ k , ..., x K ): x ′ k = x k − α ∆ f ∆ x , (22)∆ f = f ( x , ..., x k + ∆ x, ..., x K ) − f ( x , ..., x k , ..., x K ) , (23) 6/13here f ( x ) expresses the acquisition functions defined by Eqs. (12), (13),and (14). Here, k is randomly chosen from k ∈ , ..., K , and ∆ x = α = 0 . f ( x ) is defined by Eq. (12), which is obtained from Gaussian process. Inour calculation, the steepest descent method is performed with 100updates to obtain the extreme value of f ( x ). From the obtained { x n , E C ( x n ) } n =1 ,...,N s , the minimum value of E C ( x ) is searched.Fig 2 (a) is the sampling number N s dependence of the averaged minimum value E av of E C ( x ) for 100 independent runs with each methods. The error bars arecalculated from the standard deviation. The Bayesian optimization yields the smallest E av , indicating that the Bayesian optimization gives better minimizers of E C ( x ) evenif N s is small. Furthermore, the most successful analysis is given by the Bayesianoptimization using f ( x ) LCB with κ = 20, while the steepest descent method and theMonte Carlo method produce worse results than the random search method. Thesemethods are frequently trapped at a local minimum depending on the initial set ofmodel parameters, and eventually E av stays at large values. Fig 2.

Results of the average E av of the minimum values of E C ( x ) obtained from 100independent runs in the effective model estimation of the classical Ising model. (a) E av as a function of N s , which is the number of sampling points on E C ( x ), obtainedfrom the random search method (red circles), the steepest descent method (yellowcircles), the Monte Carlo method (green circles), and the Bayesian optimization (bluecircles). (b) E av as a function of N s obtained from the random search method (RS)(red circles), the Bayesian optimization using f LCB ( x ) with κ = 20 (BO) (blue circles),the random search method with the steepest descent method (RS+SD) (reddiamonds), and the Bayesian optimization with the steepest descent method (BO+SD)(Blue diamonds). Dashed lines connect the initial E av (circle point) by only RS or BOand the obtained E av (diamond point) by performing the steepest descent methodwith 50 updates after RS or BO.Fig 3 (a) is the distribution of the estimated model parameters for 100 independentruns with various N s by the random search method and the Bayesian optimization.The black lines indicate exact solutions by which the input magnetization curvewithout Gaussian noise is generated, except for the case where any one of theparameters x k has zero. As N s increases, the results by the Bayesian optimizationconverge on the black lines, implying that the model parameters can be correctlyestimated with a high probability. On the other hand, the case of the random searchmethod shows no significant improvement with increasing N s . This could be 7/13nderstood by noticing that the accuracy of the acquisition functions by Gaussianprocesses in the Bayesian optimization is improved with increasing the samplingpoints, namely N s , while the random search method does not refer to the priorsampling points. Fig 3.

Results of the estimated model parameters in the effective model estimationbased on the classical Ising model. (a) Distribution of the estimated model parametersfrom 100 independent runs depending on N s by the random search method (RS) (redcircles) and the Bayesian optimization using f LCB ( x ) with κ = 20 (BO) (blue circles).The black lines indicate exact solutions when the input magnetization curve withoutGaussian noise is obtained. (b) Distribution of the estimated model parameters by therandom search method with the steepest descent method (RS+SD) (red diamonds)and the Bayesian optimization with the steepest descent method (BO+SD) (bluediamonds). In these cases, starting from the results shown in (a) by RS and BO, thesteepest descent method is further performed with 50 updates.The Bayesian optimization as well as the random search method, in general, doesnot take into account local structure of the energy function such as gradient in theparameter space. To improve the solutions, we consider combinations of the steepestdescent method with the random search method or the Bayesian optimization. Onemay expect that the steepest descent method produces a local minimum or a globalminimum around the estimated model parameters by the random search method orthe Bayesian optimization. That is, the estimated model parameters by the randomsearch method or the Bayesian optimization are used as the initial set of modelparameters in the steepest descent method, which is performed with 50 updates. Fig 2(b) compares E av ’s by the random search method, the Bayesian optimization using f LCB ( x ) with κ = 20, and those with the steepest descent method. The drasticimprovement can be confirmed even for 50 updates in the steepest descent method.Note that if the number of updates in the steepest descent method is increased, theobtained E av should be improved. However, since the number of sampling is alsoincreased, a trade-off between search for initial sets by the random search method orthe Bayesian optimization and evaluation of local structures by the steepest descentmethod should be optimized. We confirmed for some cases that the Bayesianoptimization with steepest descent method is the best among the considered methods.8/13ig 3 (b) shows the distribution of the estimated model parameters. For theBayesian optimization with the steepest descent method, the real minimizer of E C ( x )is found in all independent runs, while some of the obtained results by the randomsearch method with the steepest descent method differ from the exact solutions, andthese cases are trapped in local minima. The steepest descent method significantlyimproves the estimates by the Bayesian optimization and random search methods.The results imply that the Bayesian optimization combined with the steepest descentmethod is powerful tool to find the global minimum of E C ( x ). Application for posterior distribution based on a quantumHeisenberg model

The case where the number of model parameters increases against the previous case isconsidered when a quantum Heisenberg model on the one-dimensional chain is used(Fig 4 (a)). The model Hamiltonian of the quantum Heisenberg model under magneticfield H is defined by H Q ( x ) = − X i,j J ij (cid:2) ˆ σ xi ˆ σ xj + ˆ σ yi ˆ σ yj + ∆ˆ σ zi ˆ σ zj (cid:3) − H X i ˆ σ zi , (24)where ∆ is the parameter for the anisotropy and (ˆ σ xi , ˆ σ yi , ˆ σ zi ) is the Pauli matrix. Here,the model parameters are x = ( x , x , x , x , x ) = ( J , J , J , ∆ , H ). Fig 4 (a) depictsthree types of exchange interactions. Fig 4. (a) Lattice and types of exchange interactions considered in the quantumHeisenberg model defined by Eq. (24). (b) Inputted specific heat result { C ex ( T l ) } l =1 ,...,L with L = 200 and ( x , x , x , x , x ) = (1 . , . , − . , − . , . { C ex ( T l ) } l =1 ,...,L is generated from the modeldefined by Eq. (24) as follows. By performing the exact diagonalization method, thetemperature dependence of the thermal average of the specific heat for( x , x , x , x , x ) = (1 . , . , − . , − . , .

3) is calculated. The Gaussian noise with amean of zero and a standard deviation of 0.004 is added to the obtained specific heat.Fig 4 (b) shows the temperature dependence of the specific heat with L = 200, whichis used as the input in the effective model estimation. As shown in the previous case,our task is to search for the minimizer of energy function E Q ( x ) defined as E Q ( x ) = L X l =1 (cid:0) C ex ( T l ) − C cal ( T l , x ) (cid:1) , (25)where { C cal ( T l , x ) } l =1 ,...,L is the set of calculated specific heat from H Q ( x ) byperforming the exact diagonalization method. 9/13e compared E av , which is the average of the minimum value of E Q ( x ) for 100independent runs, for the random search method, the steepest descent method, theMonte Carlo method, and the Bayesian optimization (Fig 5 (a)). The setups of thesemethods are the same as the previous case except for the number of model parameters( K = 5) and the region in which a set of model parameters is randomly generated. Inthis case, we use − ≤ x , x , x ≤ − ≤ x , x ≤

2. The results arequalitatively the same as the previous case. The most successful analysis is producedby the Bayesian optimization using f ( x ) EI . This result is different from the previousdemonstration, which means that an appropriate acquisition function depends on atarget physical model and input physical quantities. Furthermore, as shown in Fig 5(b), the combined steepest descent method improves the estimates of the Bayesianoptimization and the random search method again. Similar to the previous case, theBayesian optimization with the steepest descent method gives a better minimizer of E Q ( x ). Consequently, we conclude that the Bayesian optimization is useful to find abetter maximizer of the posterior distribution in an effective model estimation with asmall number of sampling points. Fig 5.

Results of the average E av of the minimum values of E Q ( x ) obtained from 100independent runs in the effective model estimation of the quantum Heisenberg model.(a) E av as a function of N s obtained from the random search method (red circles), thesteepest descent method (yellow circles), the Monte Carlo method (green circles), andthe Bayesian optimization (blue circles). (b) E av as a function of N s obtained from therandom search method (RS) (red circles), the Bayesian optimization using f EI ( x ) (BO)(blue circles), the random search method with the steepest descent method (RS+SD)(red diamonds), and the Bayesian optimization with the steepest descent method(BO+SD) (blue diamonds). In the steepest descent method, 50 updates are performedafter RS or BO. Discussion

We searched for a better maximizer of a posterior distribution in the effective physicalmodel estimation which is a computationally extensive probability distribution, usingthe Bayesian optimization. It is found for at least two simple models that theBayesian optimization has a higher efficiency of finding a better maximizer of theposterior distribution compared to the random search method, the steepest descentmethod, and the Monte Carlo method when the number of sampling points on the10/13osterior distribution is fixed to be small, while an appropriate acquisition functionproviding a high efficiency still depends on the problem to be solved. Our Bayesianoptimization has some hyperparameters, i.e.,

P, Q , and R . Although we did notoptimize these hyperparameters, the Bayesian optimization is a better method toobtain the maximizer of the posterior distribution. Particularly, since the value of Q isrelated to the batch/parallel problem of the Bayesian optimization [37, 38], someimprovement of the performance is expected by tuning Q . Furthermore, a combinationof the Bayesian optimization and the steepest descent method drastically increases theefficiency of finding a better maximizer of the posterior distribution. The key of ourBayesian optimization is to predict a set of model parameters near a local maximumor a global maximum of the posterior distribution from the extreme values ofacquisition functions by Gaussian processes, which requires a relatively lowcomputational cost. Consequently, the model parameters near a global maximum canbe found with a high probability. These facts suggest that the Bayesian optimizationwill be a powerful tool for effective model estimations. However, to find a maximizerof posterior distributions with various types of prior distributions and a large numberof model parameters, the Bayesian optimization may be not always useful. Then inthe future, we will evaluate effective model estimations using the Bayesianoptimization for actual materials. Because the maximizer of a probability distributionis searched in many scientific fields, the Bayesian optimization will play an importantrole in the promotion of science. Acknowledgments

We thank Shu Tanaka for the useful discussions. R. T. was partially supported by theNippon Sheet Glass Foundation for Materials Science and Engineering. K. H. waspartially supported by a Grants-in-Aid for Scientific Research from JSPS, Japan(Grant No. 25120010 and 25610102). The computations in the present work wereperformed on Numerical Materials Simulator at NIMS, and the supercomputer atSupercomputer Center, Institute for Solid State Physics, The University of Tokyo.This work was done as part of the “Materials Research by Information Integration”Initiative of the Support Program for Starting Up Innovation Hub, Japan Science andTechnology Agency.

References

1. Mockus J. Bayesian approach to global optimization: Theory and applications.Springer; 1989.2. Jones DR, Schonlau M, Welch WJ. Efficient global optimization of expensiveblack-box functions. J Global Optim 1998; 13: 455-492.3. Pelikan M, Goldberg DE, Cant´u-Paz E. BOA: the Bayesian optimizationalgorithm. GECCO’99 Proceedings of the 1st Annual Conference on Geneticand Evolutionary Computation 1999; 525-532.4. Snoek J, Larochelle H, Adams RP. Practical Bayesian optimization of machinelearning algorithms. Advances in Neural Information Processing Systems 25;2012.5. Ueno T, Rhone TD, Hou Z, Mizoguchi T, Tsuda K. COMBO: An efficientBayesian optimization library for materials science. Materials Discovery 2016; 4:18-21. 11/13. Seko A, Maekawa T, Tsuda K, Tanaka I. Machine learning with systematicdensity-functional theory calculations: Application to melting temperatures ofsingle- and binary-component solids. Phys Rev B 2014; 89: 054303-1-9.7. Toyoura K, Hirano D, Seko A, Shiga M, Kuwabara A, Karasuyama M, et al.Machine-learning-based selective sampling procedure for identifying thelow-energy region in a potential energy surface: A case study on protonconduction in oxides. Phys Rev B 2016; 93: 054112-1-11.8. Kiyohara S, Oda H, Tsuda K, Mizoguchi T. Acceleration of stable interfacestructure searching using a kriging approach. Jpn J Appl Phys 2016; 55:045502-1-4.9. Balachandran PV, Xue D, Theiler J, Hogden J, Lookman T. Adaptive strategiesfor materials design using uncertainties. Sci Rep 2016; 6: 19660.10. Ju S, Shiga T, Feng L, Hou Z, Tsuda K, Shiomi J. Designing nanostructures forphonon transport via Bayesian optimization. Phys Rev X 2017; 7: 021024-1-10.11. Packwood DM, Hitosugi T. Rapid prediction of molecule arrangements on metalsurfaces via Bayesian optimization. Appl Phys Express 2017; 10: 065502-1-4.12. Seko A, Hayashi H, Nakayama K, Takahashi A, Tanaka I. Representation ofcompounds for machine-learning prediction of physical properties. Phys Rev B2017; 95: 144110-1-11.13. Kennedy MC, O’Hagan A. Bayesian calibration of computer models. J Roy StatSoc B. 2001; 63: 425-464.14. Higdon D, Kennedy M, Cavendish JC, Cafeo JA, Ryne RD. Combining fielddata and computer simulations for calibration and prediction. SIAM J SciComput 2004; 26: 448-466.15. Liu F, Bayarri MJ, Berger JO. Modularization in Bayesian analysis, withemphasis on analysis of computer models. Bayesian Anal 2009; 4: 119-150.16. Tamura R, Hukushima K. Method for estimating spin-spin interactions frommagnetization curves. Phys Rev B 2017; 95: 064407-1-8.17. Sandvik AW, Kurkij¨arvi J. Quantum Monte Carlo simulation method for spinsystems. Phys Rev B 1991; 43: 5950-5961.18. Wang F, Landau DP. Efficient, multiple-range random walk algorithm tocalculate the density of states. Phys Rev Lett 2001; 86: 2050-2053.19. Kawashima N, Harada K. Recent developments of world-line Monte Carlomethods. J Phys Soc Jpn 2004; 73: 1379-1414.20. Suwa H, Todo S. Markov chain Monte Carlo method without detailed balance.Phys Rev Lett 2010; 105: 120603-1-4.21. Landau DP, Binder K. A guide to Monte Carlo simulations in statistical physics.Cambridge University Press; 2014.22. Lin HQ. Exact diagonalization of quantum-spin models. Phys Rev B 1990; 42:6561-6567.23. Jakliˇc J, Prelovˇsek P. Lanczos method for the calculation of finite-temperaturequantities in correlated systems. Phys Rev B 1994; 49: 5065-5068. 12/134. Yamaji Y, Nomura Y, Kurita M, Arita R, Imada M. First-principles study ofthe honeycomb-lattice iridates Na IrO3