aa r X i v : . [ s t a t . A P ] A ug New Heuristics for Parallel and Scalable BayesianOptimization
Ran RubinCenter for Theoretical Neuroscience,Columbia University, New York, NY, USA
Abstract
Bayesian optimization has emerged as a strong candidate tool for global optimization of func-tions with expensive evaluation costs. However, due to the dynamic nature of research in Bayesianapproaches, and the evolution of computing technology, using Bayesian optimization in a paral-lel computing environment remains a challenge for the non-expert. In this report, I review thestate-of-the-art in parallel and scalable Bayesian optimization methods. In addition, I proposepractical ways to avoid a few of the pitfalls of Bayesian optimization, such as oversampling of edgeparameters and over-exploitation of high performance parameters. Finally, I provide relativelysimple, heuristic algorithms, along with their open source software implementations, that can beimmediately and easily deployed in any computing environment.
Choosing a strategy for searching for the global minimum or maximum of an unknown function isa non-trivial task. A major consideration is the balance between the function’s evaluation cost andavailable resources. Different strategies are optimal for different cases. For example, the functionvalues may be the results of experiments that take days or weeks to complete and only a few can berun concurrently. In this scenario, we would like our search strategy to be as efficient as possible,i.e maximize the function as quickly as possible by choosing the parameters of the next functionevaluation given the previous results. In addition, we would probably be willing to invest considerablecomputation resources to produce these parameters. On the other hand, when the durations of functionevaluations are only a few seconds, new parameters must be chosen quickly which restricts the type ofcomputations that can be performed.Here, we consider an intermediate case encountered in many computational fields of science. In thiscase, the dimensionality of the function is moderately high ( D ∼ O (10 − ) and function evaluationsare moderately expensive ( T ∼ O (10 hours) ). On the other hand, evaluating the functions usuallyinvolves some sort of a computer simulation and current cluster computers allow us to run many( m ∼ O (cid:0) (cid:1) ) function evaluations in parallel. In addition, we assume that the optimization is donegiven a particular amount of computational resources, for example the number of CPUs we can useand the amount of time we are willing to use them. An example that fits this scenario, from the fieldof machine learning, is the training of neural networks where the parameters to optimize are usuallythe various learning rates of the learning algorithm, number of neurons and connections, etc.What would be the requirements for a good search strategy in this scenario? First, in high dimensionsexhaustive searches are unreasonable, thus the search must be efficient and find local extrema of thefunction as quickly as possible. Second, our strategy must allow for as many parallel evaluations1f the function as our resources allow. In most scenarios, the duration of function evaluations isvariable. Thus, it is also important that our search strategy is able to operate in an asynchronousway, proposing new parameters as soon as each function evaluation ends. Finally, as our computerresources continually increase we expect to be able to evaluate n ∼ O (cid:0) − (cid:1) parameter sets in areasonable amount of time. Thus, our search strategy must be scalable and able to handle such largedatasets.Several, fields of study have approached this problem and many different proposals have been made.However, no single accepted method has been chosen by the scientific community thus far. In addition,the software implementations of most of the proposed algorithms are non-trivial and many of themlack well designed, easy to use software packages. Therefore, many practitioners opt to simply per-form manual optimization, by evaluating the function on a grid, identifying regions of interest whereperformance is good and evaluating on a finer grid in these regions. This simple approach allows forunlimited parallel evaluations, is trivially scalable (since almost no computation is needed to chooseparameter sets to test) but is very inefficient and labor intensive.Many heuristic optimization algorithms, termed metaheuristics, such as Simulated Annealing, GeneticAlgorithms, Swarm Algorithms etc., have been proposed as viable solutions for global optimization[1, 2]. These approaches are usually designed for parallel evaluation of the function and are usuallyscalable. However, the efficiency of their search is unclear and comparison among them is a verychallenging task.The field of non-smooth optimization has seen major advances with the introduction of GeneralizedPattern Search (GPS) and Mesh Adaptive Direct Search (MADS) algorithms [3, 4]. These are fairlygeneral frameworks for optimization that can easily combine different search approaches, are wellfounded in theory and come with convergence guaranties under mild assumptions., and can be fairlyeasily parallelized and scaled. However, combining them with efficient search approaches suitable forparallel environments remains an active field of research [5] with no currently available tools for thenon-expert.Finally, recent years have also seen a flourish of research in the field of Bayesian Optimization (BO),that is, using Bayesian inference to guide the search strategy during function optimization (for areview see [6]). With a proper selection of prior distributions this approach yields a very efficientsearch strategy. However, Bayesian inference itself is computationally intensive which is prohibitive inmany situations. The research on how to use Bayesian inference in a parallel environment and how toscale the inference itself to large datasets is intense and ongoing and mature algorithms and tools arenot yet available.It seems that no single approach satisfies all the requirements we set for our search strategy. In thisreport, I would like to examine more closely the Bayesian approach for optimization, and review thepossible ways in which it can be parallelized and scaled. In addition, I will present new, relativelysimple, heuristic algorithms for parallel and scalable Bayesian optimization. These algorithms combineseveral recently proposed approaches and some new ideas of my own. Importantly, I also provide anopen source software implementation which is designed to be modular and modifiable, and represents animmediately available, working solution for moderate scale, parallel Bayesian optimization. Hopefully,this code can serve as a foundation upon which other systematic, large-scale approaches may be built.In the following, I will briefly review the state-of-the-art in Bayesian optimization in sections 2, presenta few modifications that may be advantageous in section 3 and give the details of my proposed algo-rithms in section 4. We are interested in finding a suitable set of parameters, x ∈ R D , that maximize some unknownfunction f ( x ) . We assume our function evaluation values, y ( x ) , are distributed according to,2 ( x ) ∼ N (cid:0) f ( x ) , σ (cid:1) , (1)with constant, unknown standard deviation σ .The goal of our optimization is to find x ⋆ , such that x ⋆ = arg max x f ( x ) . (2)In BO we select points to evaluate y ( x ) by performing Bayesian inference based on past observations:We start by adopting a probabilistic model of f ( x ) , which we denote as ˆ f ( x ) , that models ourknowledge of f ( x ) as a distribution over functions . In particular, we choose a Gaussian Process(GP) [7] denoted as, ˆ f ( x ) ∼ GP ( µ ( x ) , K ( x , x ′ )) , (3)where µ ( x ) is a mean function and K ( x , x ′ ) is some positive definite covariance kernel. This notationimplies, that a set of points { x i } ni =1 induces a multivariate normal distribution on the vector ˆ f ∈ R n (with ˆ f i = ˆ f ( x i ) ): ˆ f ∼ N ( µ , Σ) (4)with means µ i = µ ( x i ) and covariance matrix Σ ij = K ( x i , x j ) .To perform Bayesian inference we assume some priors on the model’s parameters. First, we assumethat ˆ f ( x ) has a GP prior with constant mean function µ ( x ) = ¯ µ and a covariance kernel K θ ( x , x ′ ) , with a set of hyper-parameters θ : ˆ f ( x ) ∼ GP (¯ µ, K θ ( x , x ′ )) . (5)Further, we specify priors for the the noise magnitude, GP prior mean and the kernel’s hyper-parameters: σ ∼ P ( σ ) , ¯ µ ∼ P (¯ µ ) , θ ∼ P ( θ ) . (6)The choice of kernel K θ ( x , x ′ ) is an important step, as different kernels confer different properties onthe inferred functions and may be suitable in different situations [7]. The kernel hyper-parameters, θ ,usually represent, the typical amplitude and length scales of our inferred functions.Now, given a set of n observation pairs, O = { x i , y i } ni =1 , (7)we use Bayes rule to infer the posterior distribution of the parameters: P (cid:16) ˆ f ( x ) , σ, ¯ µ, θ | O (cid:17) ∝ P (cid:16) O | ˆ f ( x ) , σ (cid:17) P ( σ ) P (cid:16) ˆ f ( x ) | ¯ µ, θ (cid:17) P (¯ µ ) P ( θ ) (8)The full posterior distribution lets us make predictions on f ( x ) and also provides us with an estimate ofthe level of certainty of our predictions. One attractive feature of a GP is that, given the parameters σ, ¯ µ and θ the posterior distribution of functions, ˆ f ( x ) , given O is also a GP and it is possible to calculateanalytically its posterior mean function and covariance kernel, simplifying calculations involving thefull posterior distribution (for details see [7]). Remember that the posterior distribution will not, in general, conserve the properties of the prior distribution andtherefore the constant mean assumption is not restrictive. .1 Scalability of Bayesian Inference Due to the required inversion of the observations’ covariance matrix, full Bayesian inference for GPsscales as O ( n ) with the number of observations, n . To be useful in higher dimensions, when − of observations are needed to explore the parameter space, we must use an approximation methodthat scales as O ( n ) . The leading solution in the literature is variational inference (VI) [8], which forGPs amounts to using the kernel method to approximate the posterior mean and covariance functionswith a fixed number of kernel elements. A recent proposal [9] proposes a VI algorithm that indeedscales as O ( n ) . The improved scaling comes at the cost of a somewhat reduced expressibility and anoverestimation of the predicted variance. An alternative approach for the scaling of Bayesian inferenceis to reduce the number of observations used in the inference by considering only local Bayesianinference around promising loci in the parameter space [5] or by performing random sub-sampling ofthe observations [10]. Given a set of observation, O , a BO search strategy proposes a new point x new according to somecriteria. This is usually done by choosing the point that maximizes some function over the searchdomain, termed the acquisition function.Although many acquisition functions have been proposed (see [6] and references therein), a commonone is the expected improvement (EI) which we define here as: EI ( x ) = E P ( ˆ f ( x ) ,σ, ¯ µ, θ | O ) (cid:18)h ˆ f ( x ) − max i ˆ f ( x i ) i + (cid:19) , (9)where the expectation is taken over the posterior distribution given the observations, and [ z ] + = z for z > and [ z ] + = 0 otherwise. We then choose: x EI new = arg max x EI ( x ) . (10)We note that the EI can be high either because the actual value of the posterior mean, µ O ( x ) , is highor the uncertainty (posterior variance, σ O ( x ) ) at point x is high. Thus, this measure implements akind of exploration-exploitation balance.For a GP the EI can be relatively easily calculated following a few simple approximations: First,practitioners remove the sample dependence of the max operation by replacing ˆ f ( x i ) with the posteriormean, µ O ( x i ) , or with the observations themselves, y i . Then, the EI can be calculated analyticallyconditioned on σ, ¯ µ and θ and practitioners usually approximate the expectancy over these parametersby using a periodically updated point estimate for them or by averaging a set of samples from theirposterior using Markov Chain Monte Carlo (MCMC) methods (which usually significantly increase thecomputational cost of the inference).Finally, we note that maximizing the EI (or any other acquisition function) is a non-trivial task initself, and may be computationally expensive and require some form of approximation. Since Bayesian inference is computationally intensive, even a relatively short inference time (for exam-ple, a few seconds) may create a computational bottleneck if a large number of inferences is required.This may leave some of our computer resources idle while waiting for the new points to evaluate. Asproposed in [11], one way to make sure that the inference remains scalable is to perform the Bayesianinference itself in parallel. However, a couple of difficulties arise in this case:4irst, computing several points for evaluation in parallel, given the same or very similar sets of knownmeasurements, will typically result in proposing the same, or very similar, new point, which will hinderthe exploration of the function’s parameters space. One way to avoid this problem is to introducevariability in the proposed new points. Recently, the use of Thompson Sampling [12, 13, 14] has beenproposed [11, 15] as a tool for generating this variability in a way that is consistent with the posteriordistribution. Given observations, O , and a single sample of ˆ f ( x ) , σ, ¯ µ, and θ from the posteriordistribution, Thompson sampling proposes a new point for evaluations according to x Thompson new = arg max x ˆ f ( x ) . (11)Simply put, Thompson Sampling can be regarded as a probability matching search strategy, setting theprobability of choosing a point x to the posterior probability this point is indeed the maximum of f ( x ) .When several inferences run in parallel, each will sample from the posterior probability of x ⋆ indepen-dently, and may therefore propose different points for evaluation. We note that, although conceptuallysimple, sampling ˆ f ( x ) from the posterior and maximizing over it is an additional computational stepthat may not be trivial to implement in a scalable waySecond, performing inference in an asynchronous parallel environment implies that in addition to theset of observations, there is an additional, possibly significant, set of running experiments/simulationsfor which the parameter values are known but their results are still pending. Snoek et al. [16] proposeto take this second set into account during the inference by ’fantasizing’ the results of the evaluationand performing the inference with the fantasized values.It is simple to incorporate fantasies within the framework of Thompson Sampling. Given obser-vations, O , and pending points P = n x Pending i o , one first samples fantasized observation values, ˜ y i ∼ N h ˆ f (cid:16) x Pending i (cid:17) , σ i where ˆ f and σ are sampled from the posterior distribution given the ob-servations. Then, another ˆ f ( x ) can be sampled from the posterior distribution given the union O ∪ n x pending i , ˜ y i o . While this procedure will not, on average, change the predicted mean of thefunction at the pending points, it will reduce the uncertainty around points that are pending andmake them less likely to be selected again. In addition, as noted in [11], sampling ˜ y i from the posteriordistribution does not, on average, change the posterior of the GP hyper-parameters, so there is noneed to sample them again conditioned on ˜ y i . Until now I have reviewed the state-of-the-art of parallel and scalable Bayesian optimization. HereI propose three modifications and improvements that may increase the efficiency of the Bayesianoptimization. I found these to be helpful in my experiments but did not conduct a systematicalquantification of their effect. Such a quantification is not a trivial task and is beyond the scope of thistechnical report. Nonetheless, I present them here as possible, fruitful avenues of research.
Thompson Sampling has been shown to bound the mean regret when the goal of optimization is tomaximize the accumulative value or minimize the accumulated regret over time [17, 18, 19]. As such,it tends to lean more on exploitation once a ‘good‘ set of parameters is found. Indeed, it has beenclaimed that Thompson sampling asymptotically converges to the global maximizer in exponential5ime [20]. However, in the context of parameter optimization, there is really no need to evaluate thesame point again once we are confident enough about the value of f ( x ) at this point.Here I propose the concept of Sample Improvement (SI). We start with the simple idea to approxi-mate/replace the expected improvement by a single sample of the improvement, and select the point x that maximizes this sample improvement function. Given observations, O , and a single sample of ˆ f ( x ) , σ, ¯ µ, and θ from the posterior distribution, we define the Sample Improvement of point x , assimply SI ( x ) = h ˆ f ( x ) − max i ˆ f ( x i ) i + . (12)As an approximation, the SI will introduce the variability that we can exploit for parallel inference.We therefore choose the new point for evaluation according to x SI new = arg max x SI ( x ) . (13)Note that the expected improvement is always positive so the arg max operation is a well-definedfunction. However, in case no point in the sample improves the result of the function we are leftwith SI ( x ) = 0 for all x and the arg max is not defined. The advantage of using SI over Thompsonsampling (Eq. 11) is that we can interpret a sample function for which SI ( x ) ≤ ǫ for all x and somea priori ǫ > as a sign of possible convergence of the inference to a local minimum. In this case, itmight be appropriate to use a more exploratory strategy for the selection of points for evaluation. Onepossibility is choosing the point that locally maximizes the estimated variance or an upper confidencebound of f ( x ) (This will not generate variability in the selected point which may be problematic ifmany inferences are run in parallel). Another is choosing random points around the current estimatedmaximum to encourage exploration in its vicinity. Alternating between search strategies to improve the exploration-exploitation balance based on thecurrent state of the search is important for parameter optimization. Indeed several meta-strategieswere proposed to select between the sampling proposals of several strategies [21, 22]. However, find-ing a strategy that will be suitable at every stage of the optimization and that will always preventoversampling of similar points remains a challenge. Here I propose a simple strategy, termed VarianceControl, to ensure that no point is ever oversampled by explicitly controlling the estimated varianceof sampled points. We consider a situation in which we wish to evaluate f ( x ) at any given point, onlyup to some level of accuracy. Bayesian inference can aid in setting this level: A reasonable limit mightbe a fraction of the estimated noise level in our observation ( σ ). We would like our algorithm to avoidsampling points where the estimated variance is below our required accuracy. This can be achieved byexcluding these points from our search domain.Given a set of observations O = { x i , y i } and pending points P = n x Pending i o and a sample of σ, ¯ µ, and θ , we can define our search domain D O,P = { x | σ O,P ( x ) > ρσ } , (14)where σ O,P ( x ) is the posterior estimated variance given σ, ¯ µ, and θ , and ρ is our desired level ofaccuracy as a fraction of the estimated noise. In this setting, we can view parameter optimizationnot as an attempt to find the maximum of f ( x ) over the entire parameter space but rather only over D O,P . Therefore, we choose our next point to evaluate according to our current search strategy onlyif the proposed point is in D O,P . In addition, we also compute improvement measures based only onobservations that are in D O,P . We will describe two different approaches for implementing variancecontrol, in section 4.3. 6 .3 Boundary Avoidance
In many texts, examples of Bayesian inference are presented for a function of one or two parameters.While this is instructive, one has to remember that our intuitive understanding of low-dimensionalspaces does not always work well for higher dimensions. A famous example is the volume of a D dimensional sphere: unlike 2-dim. circles or 3-dim. balls, when D ≫ most of the volume of a sphereis concentrated at the edge of the sphere.In the context of Bayesian optimization, we are usually concerned with finding the best parameterswithin some range for each parameter. The range is usually picked based on our prior belief of thereasonable value of the parameters. In fact, we usually suspect that the optimal value is not close tothe edges of our range. However, when choosing points to sample using Bayesian inference, we arevery likely to choose points on the edges because, unless already sampled, at these point the model isextrapolating and we expect to have large expected variance and, possibly, a monotonically increasingmean function. For functions of one or two parameters, evaluating the function at the edges of ourrange will only take a few samples and will quickly reduce the estimated variance there. However forhigher dimensional spaces, sampling the edges well will require many function evaluations in locationsthat we do not believe are optimal.Taking the Bayesian approach, we would like to express our knowledge of the parameter range as aprior distribution on the location of the maximum of f ( x ) . However, since we only directly model ourknowledge of f ( x ) and not of the location of its maximum, it is not clear how to incorporate this priorwithin the GP framework we described. First steps at systematically addressing the edges oversamplingproblem were taken in [23]. Our practical solution, is to simply reject points for evaluations if theyare on the edges of our parameter range. Implementing Bayesian inference and optimization involves many choices and approximations that mayaffect the results. However, it is very hard to estimate systematically the effect of these choices, and theimplementer must rely on choices previously made by others as well as common sense and intuition.In this section, I would like to inform the reader and users of these choices and approximations sothat proper review and improvement can be made. In addition, I have tried to design my algorithmicimplementation to be as modular as possible, so that various parts can be easily replaced in case somechoice or approximation is not to the user’s liking or is unsuitable for the user’s circumstances.As an immediately available working solution, I provide two simple algorithms that perform parallel,Bayesian inference using the above proposed heuristics to generate variability in the selection of points,avoid boundaries, alternate between global and local search strategies and avoid oversampling of points.An open source software implementation of these algorithms along with further implementation details,can be found in the GitHub repository: https://github.com/ranr01/miniBOP.
This algorithm assumes that each computational node can be used to either evaluate the function ata single set of parameters, x , or choose a new point for evaluation using Bayesian inference. Given m computational nodes, a simple algorithm to perform the optimization can be defined as follows:1. Submit a batch of m function evaluations on a quasi-random grid.2. Wait for a job to finish. 7a) If the finished job is function evaluation, process results and submit a Bayesian inferencejob with the current information of pending and completed simulations.(b) If the finished job is a Bayesian inference job, submit a function evaluation with the sug-gested parameters.3. Repeat 2 until maximal amount of resources are used.For a more detailed discussion of different resource management strategies I refer the reader to [24, 25]. Sampling a function ˆ f ( x ) from the posterior GP is done by sampling a multivariate Gaussian withthe appropriate mean and covariance according to the sampled points. This can be done in batchby computing the Cholesky decomposition of the full covariance matrix of the sampled points. Forsequential sampling of points (as is usually required for maximization) the Cholesky decompositioncan be performed iteratively in an efficient manner.Note that, this procedure does not scale well with the number of sampled points from ˆ f ( x ) . Findingan efficient, scalable way of sampling from a high-dimensional multivariate Gaussian is in fact an openquestion under scientific investigation (see for example [26, 27]). To avoid sampling a large number ofpoints from a single sampled function, during our inference we do not try to find a global maximumof ˆ f ( x ) . Instead, we find a local maximum by running the Nelder-Mead (Simplex) algorithm startingfrom a random initial point. We specify an absolute tolerance for x as a tunable parameter. The heart of Bayesian optimization is the algorithm to choose new points for evaluation. Here wepresent two heuristic algorithms that incorporate the principles discussed above.
This algorithm alternates between approximately maximizing the Sample Improvement and takingrandom steps from the current global maximum. The name ’poll steps’ is borrowed from the GPR andMADS algorithm to describe the random exploration around the current estimated maximum.Given a set of observations O = { x i , y i } and pending points P = n x Pending i o we choose the next pointin the following way:1. Sample the hyper-parameters ( σ, ¯ µ, and θ ) from the posterior given the observation set O , usingMCMC.2. Sample the results of the pending points by sampling ˜ y i ∼ N h ˆ f (cid:16) x Pending i (cid:17) , σ i were ˆ f ( x ) issampled from the posterior GP given the observations set O .3. Sample a set of candidate points, C Bayes = n x cand j , ˆ f j o n cand j =1 , ˆ f j = ˆ f (cid:0) x cand j (cid:1) , by finding a localmaximum of a sample function ˆ f ( x ) , sampled from the posterior GP given the union O ∪ n x pending i , ˜ y i o . To avoid sampling a large number of points from a single sample function wesample a new ˆ f ( x ) for each x cand j . 8. Exclude candidate points in C Bayes that have low predicted variance (i.e. x cand j / ∈ D O,P ) or thatare on the edges of the parameters range.5. Calculate the improvement of the remaining points, where improvement is defined as I j = (cid:20) ˆ f j − max x i ∈ O,P µ O,P ( x i ) (cid:21) + (15)6. Exclude candidate points in C Bayes for which the improvement is zero (or smaller than somevalue ǫ ).7. If C Bayes is not empty return the point with the maximal improvement and exit.8. Poll step:(a) Find the best parameters out of the sampled and pending points.(b) Choose a set of candidate points, C poll = n x poll j o n poll j =1 by choosing a random point aroundthe best point. The random step size is chosen to be proportional to the estimated lengthscales of the covariance kernel.(c) Exclude points with low predicted variance.(d) if C poll is not empty, return the point with the maximal predicted variance and exit.9. Default step: return a random point and exit. Here, we approximate the requirement of sampling points only inside D O,P by adding a soft barrierfunction to our sampled function ˆ f ( x ) . We define g ( x ) = ˆ f ( x ) − b ( σ O,P ( x )) (16)where σ O,P ( x ) is the estimated standard deviation of f ( x ) at the point x and b ( s ) is a function whichis very large for s < ρσ and negligible for s > ρσ . As a concrete example, we choose a power lawfunction, b ( s ) = (cid:16) ρσs (cid:17) z , (17)with z = 10 . Here, z controls how sharp our barrier function is around the boundary σ O,P ( x ) = ρσ .This parameter can control the rate of exploration around the boundary as higher values allow samplingof points closer to the boundary.The resulting algorithm remains the same as the BOP algorithm except the two following modifications: • In step 3 we replace the optimization of ˆ f ( x ) with the optimization of g ( x ) • In step 5 improvement is defined as I j = (cid:20) g j − max x i ∈ O,P [ µ O,P ( x i ) − b ( σ O,P ( x i ))] (cid:21) + (18)We note that using the barrier function renders the estimated variance check in step 4 somewhatredundant, so in practice it can be dropped. 9 .4 Other implementation details We consider a scenario in which optimization is performed in a hyper-box specified by a minimal andmaximal value for each function variable. For the purpose of Bayesian inference, we follow the commonpractice of rescaling all the variables to lie inside the unit hyper-cube, i.e. inside the [0 , range. Following [16] we use the Matern 5/2 kernel with individual length scale for each dimension (sometimestermed Automatic Relevance Determination).Thus the kernel has D + 1 hyper-parameters, { θ k } Dk =0 , and given x i and x j we have: K θ ( x i , x j ) = θ (cid:18) √ r + 53 r (cid:19) exp (cid:16) −√ r (cid:17) , (19)with, r = vuut D X k =1 x ki − x kj θ k ! . (20) We sample the hyper-parameters ( σ, ¯ µ, and θ ) from the posterior given the observations by performing T MCMC
MCMC steps using Slice Sampling in the manner done in [16].
We specify prior distributions for ¯ µ, σ and θ .For the mean of the GP we take a flat prior ¯ µ ∼ U (¯ µ min , ¯ µ max ) (21)where U ( a, b ) is a uniform distribution between a and b . We set ¯ µ min(max) to be the min(max) overall function value observations.Following [16], for the observation noise variance we take a probability distribution P noise (cid:0) σ (cid:1) ∝ log (cid:20) (cid:16) v noise σ (cid:17) (cid:21) (22)and, for the GP amplitude, θ we take a log-normal distribution P amp (cid:0) θ (cid:1) ∝ θ exp − log (cid:0) θ (cid:1) A ! (23)For each length scale parameter, we use an inverse Gamma distribution: P length ( θ k ) ∝ θ − ( α length +1) k exp (cid:18) − λ length θ k (cid:19) . (24)See source code for the chosen default values of v noise , A , α length and λ length .10 Discussion
We set out to find a search strategy suitable for global optimization of expensive functions in amassively-parallel computing environment. As such we needed a strategy that would be efficient,parallel and scalable. Unfortunately, most of the available open source software for optimization wasnot explicitly designed for this parallel scenario (see [6] for a list of some of the available software).One proprietary software solution for massive Bayesian optimization [28] exists. However, its opaquemethods and algorithms make it hard to evaluate its performance.In my opinion, the combination of quick Bayesian search methods (such as variational inference)with parallel versions of mathematically grounded grid search methods such as MADS is a probableway forward to produce robust and scalable methods of Bayesian Optimization for large-scale globaloptimization. First steps in this direction have been taken with the recently proposed BADS algorithm[5].The heuristic algorithms I presented here perform efficient Bayesian search in a parallel environment.The scalability of these algorithms is limited only by the scalability of the Bayesian inference andsampling. In the current software implementation, the classic full Bayesian inference method is used,scaling O (cid:0) n (cid:1) with the number of observed points. This practically limits the application of thealgorithms moderate size problems where to only a few thousands of observed point are needed. Oncea scalable implementation of Bayesian inference and sampling is available, both algorithms presentedhere can be scaled as well. As mentioned above, a systematic quantification of the performance ofthese algorithms and a thorough comparison to other approaches is still lacking. Hopefully, they canserve as a useful tool until such a study can be performed. Acknowledgments
I thank Larry Abbott, James Murray, Fabio Stefanini and Ari Pakman for helpful discussions andcomments. This work was supported by the Center for Theoretical Neuroscience’s NeuroNex awardand by the charitable contribution of the Gatsby Foundation.
References [1] Xin-She Yang.
Nature-Inspired Metaheuristic Algorithms: Second Edition . Luniver Press, 2010.[2] Jianyong Sun, Jonathan M. Garibaldi, and Charlie Hodgman. Parameter Estimation Using Meta-heuristics in Systems Biology: A Comprehensive Review.
IEEE/ACM Trans. Comput. Biol.Bioinformatics , 9(1):185–202, January 2012.[3] V. Torczon. On the Convergence of Pattern Search Algorithms.
SIAM Journal on Optimization ,7(1):1–25, February 1997.[4] Charles Audet and John E. Dennis Jr. Mesh adaptive direct search algorithms for constrainedoptimization.
SIAM Journal on optimization , 17(1):188–217, 2006.[5] Luigi Acerbi and Wei Ji. Practical Bayesian Optimization for Model Fitting with Bayesian Adap-tive Direct Search. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, editors,
Advances in Neural Information Processing Systems 30 , pages 1836–1846.Curran Associates, Inc., 2017.[6] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the Human Out ofthe Loop: A Review of Bayesian Optimization.
Proceedings of the IEEE , 104(1):148–175, January2016. 117] Carl Edward Rasmussen and Christopher K. I. Williams.
Gaussian Processes for Machine Learn-ing . The MIT Press, Cambridge, Mass, November 2005.[8] David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational Inference: A Review forStatisticians.
Journal of the American Statistical Association , 112(518):859–877, April 2017.[9] Ching-An Cheng and Byron Boots. Variational Inference for Gaussian Process Models with LinearComplexity. arXiv:1711.10127 [cs, stat] , November 2017.[10] Ari Pakman, Dar Gilboa, David Carlson, and Liam Paninski. Stochastic Bouncy Particle Sampler. arXiv:1609.00770 [stat] , September 2016.[11] José Miguel Hernández-Lobato, James Requeima, Edward O. Pyzer-Knapp, and Alán Aspuru-Guzik. Parallel and Distributed Thompson Sampling for Large-scale Accelerated Exploration ofChemical Space. arXiv:1706.01825 [stat] , June 2017.[12] William R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EX-CEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES.
Biometrika , 25(3-4):285–294, December 1933.[13] William R. Thompson. On the Theory of Apportionment.
American Journal of Mathematics ,57(2):450–456, 1935.[14] Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A Tutorial onThompson Sampling. arXiv:1707.02038 [cs] , July 2017.[15] Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabas Poczos. Paral-lelised Bayesian Optimisation via Thompson Sampling. In
International Conference on ArtificialIntelligence and Statistics , pages 133–142, March 2018.[16] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian Optimization of MachineLearning Algorithms. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 25 , pages 2951–2959. Curran Associates, Inc.,2012.[17] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs.In
International Conference on Machine Learning , pages 127–135, 2013.[18] Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptoticallyoptimal finite-time analysis. In
International Conference on Algorithmic Learning Theory , pages199–213. Springer, 2012.[19] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In
Artificial Intelligence and Statistics , pages 99–107, 2013.[20] Kinjal Basu and Souvik Ghosh. Analysis of Thompson Sampling for Gaussian Process Optimiza-tion in the Bandit Setting. arXiv:1705.06808 [stat] , May 2017.[21] Eric Brochu, Matthew W. Hoffman, and Nando de Freitas. Portfolio Allocation for BayesianOptimization. arXiv:1009.5419 [cs] , September 2010.[22] Bobak Shahriari, Ziyu Wang, Matthew W. Hoffman, Alexandre Bouchard-Côté, and Nando deFreitas. An Entropy Search Portfolio for Bayesian Optimization. arXiv:1406.4625 [cs, stat] , June2014.[23] Eero Siivola, Aki Vehtari, Jarno Vanhatalo, and Javier González. Correcting boundary over-exploration deficiencies in Bayesian optimization with virtual derivative sign observations. arXiv:1704.00963 [stat] , April 2017. 1224] Donald R. Jones. A Taxonomy of Global Optimization Methods Based on Response Surfaces.
Journal of Global Optimization , 21(4):345–383, December 2001.[25] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Parallel Algorithm Configuration. In
Learning and Intelligent Optimization , Lecture Notes in Computer Science, pages 55–70. Springer,Berlin, Heidelberg, 2012.[26] Daniel P. Simpson, Ian W. Turner, Christopher M. Strickland, and Anthony N. Pettitt. Scalableiterative methods for sampling from massive Gaussian random vectors. arXiv:1312.1476 [math,stat] , December 2013.[27] Sivaram Ambikasaran, Daniel Foreman-Mackey, Leslie Greengard, David W. Hogg, and MichaelO’Neil. Fast Direct Methods for Gaussian Processes. arXiv:1403.6015 [astro-ph, stat] , March2014.[28] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley.Google Vizier: A Service for Black-Box Optimization. In
Proceedings of the 23rd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , KDD ’17, pages 1487–1495,New York, NY, USA, 2017. ACM.
Appendix: Tunable parameters of the BOP and FuBar VC algo-rithms
Parameter Description n cand Number of local maxima to find in each Bayesian samplingstep. n poll Number of random poll points candidates to find in eachpoll step l poll Size of poll step relative to estimated length scales ρ Minimal fraction of estimated noise level used for variancecontrol of the predicted standard deviation
SEM min
Minimal absolute level used for variance control of thepredicted standard deviation z Exponent of power law barrier function in the FuBar VCalgorithm x a . tol . Absolute tolerance level for maximization of samplefunctions T MCMC
Number of MCMC steps to take for each GPhyper-parameters sample v noise Scale of prior distribution of observation noise ( σ ) A Scale of prior distribution of GP amplitude ( θ ) α length Shape parameter of prior distribution of length scales ( θ k ,k = 1 , , . . . , D ) λ length Scale of prior distribution of length scales ( θ k ,k = 1 , , . . . , D, . . . , D