Efficient and Scalable Batch Bayesian Optimization Using K-Means
EEfficient and Scalable Batch Bayesian Optimization Using K-Means
Matthew Groves
IBM Research UKSci-Tech DaresburyWarrington, UK.
Edward O. Pyzer-Knapp ∗ IBM Research UKSci-Tech DaresburyWarrington, UK. [email protected]
Abstract
We present K-Means Batch Bayesian Optimization (KMBBO),a novel batch sampling algorithm for Bayesian Optimization(BO). KMBBO uses unsupervised learning to efficiently es-timate peaks of the model acquisition function. We show inempirical experiments that our method outperforms the cur-rent state-of-the-art batch allocation algorithms on a variety oftest problems including tuning of algorithm hyper-parametersand a challenging drug discovery problem. In order to ac-commodate the real-world problem of high dimensional data,we propose a modification to KMBBO by combining it withcompressed sensing to project the optimization into a lowerdimensional subspace. We demonstrate empirically that this 2-step method outperforms algorithms where no dimensionalityreduction has taken place.
Introduction
Bayesian optimization (BO) is a popular framework for theoptimization of black-box functions, where the analytic formof the function being optimized is unknown, or too expensiveto evaluate. BO has found extensive use for the optimiza-tion of machine learning algorithms (Snoek, Larochelle, andAdams 2012), and for experimental design of complex sys-tems.In its native form, BO is a sequential optimization proce-dure, since new information is required to update the poste-rior, and therefore the acquisition function. For many of theemerging uses of BO, this is a severe limitation since, due tothe size of the optimization problem, data must be acquiredin a highly parallel manner in order for the optimization tobe completed in a relevant time frame. Several methods forparallelizing the BO process have been proposed, and willbe reviewed in Section . It is important to point out thatthere are two separate, yet complementary, approaches to theparallel BO problem. One is to minimize the strict numberof function evaluations, typically achieved by a dynamicallyallocated batch size, and the other is to minimize the numberof epochs for a given batch size. Whilst there are situationsin which both are valid, the focus of this paper is to min-imize the number of epochs, since there are a number ofsituations in which a fixed batch size is required; for example ∗ Corresponding authorCopyright c (cid:13) in the screening of potential pharmaceutical compounds inwhich there are a pre-determined number of ‘slots’ in whichcompounds can be tested.
Related work
Ginsbourger, Le Riche, and Carraro generalize the EI tothe batch setting, proposing the qEI acquisition function forbatches of q points. Unfortunately, identifying the points thatjointly maximize qEI is difficult, as the computational costof evaluating the function and its derivative scales poorlywith increasing q.(Ginsbourger, Le Riche, and Carraro 2007)Several works have suggested heuristic approaches for ap-proximating qEI, (see for example (Snoek, Larochelle, andAdams 2012), (Chevalier and Ginsbourger 2013), (Wang et al.2016)). One popular qEI-based method is the Constant Liar(CL) approach of Ginsbourger, Le Riche, and Carraro.(Gins-bourger, Le Riche, and Carraro 2010) CL is a sequential batchbuilding method, based on iteratively adding the point thatmaximizes the single point acquisition function, assumingthat evaluating this point will return a particular constant ‘lie’value, temporarily augmenting the model training set withthis synthetic values and refitting the GP.In recent work, González et al. propose an alternativebatching method by approximating the repulsive effect whenbatching.(González et al. 2016) Under a Gaussian Processprior, target values of nearby points in sample space are ex-pected to be highly correlated. Thus, when choosing a batchof samples, we may wish for the batch members to be suffi-ciently far apart to maximize the information gained. To dothis, the authors propose the Local Penalization (LP) methodthat sequentially assembles batches of samples by succes-sively penalizing the acquisition function around points pre-viously selected, using a penalization radius based on theestimated Lipschitz constant of the acquisition function sur-face.The above methods all take a greedy sequential approach tobatch building, iteratively adding points to the batch that max-imize a particular criterion, like the locally-penalized acquisi-tion function. In contrast, Hernández-Lobato et al. propose afully parallel batch sampling technique using Thompson Sam-pling,(Thompson 1933) in which the posterior is sampled togenerate a ‘panel of experts’ which are then polled in parallelas to which data point should be acquired.(Hernández-Lobatoet al. 2017) a r X i v : . [ s t a t . M L ] S e p hah and Ghahramani suggest Parallel Predictive EntropySearch (PPES), a non-greedy batch sampling approach aim-ing to maximize the expected information gain from samplingthe chosen batch in terms of the expected reduction in dif-ferential entropy of the predictive distribution of the globalmaximizer given the sampling data.(Shah and Ghahramani2015)Nguyen et al. propose a novel batch selection methodcalled Budgeted Batch Bayesian Optimization (B3O), whichaims to build sample batches containing peaks of the acquisi-tion function.(Nguyen et al. 2017) To find these peaks, whilstavoiding costly optimization routines, the authors propose ageneralized slice sampling procedure. Slice sampling pref-erentially accepts samples from high density regions of theacquisition function surface, allowing peaks to be reliablyestimated even with modest numbers of samples. Peak pick-ing is then done using an Infinite Gaussian Mixture Model(IGMM) (Rasmussen 2000). Nguyen et al. show empiricallythat B3O performs well on a variety of test functions andcommon BO applications, such as hyperparameter tuning.However, the inability of B3O to allow fixed batch sizes is apotential limitation, as real-world applications for batch BOcan have an effective constraint on possible batch size, forexample the number of available compute nodes (simulation),number of different molecules that a robotic assay can testsimultaneously (drug discovery), or quantity of samples thatcan fit in a furnace (alloy hardening). Under-utilizing theavailable resources with smaller batch sizes costs informa-tion that could be gained at little additional cost, whereaschoosing to allocate too many samples to a batch may beimpossible. Proposed Method
In the BO formalism, the target function is not directly opti-mized. In its place, an acquisition function is constructed us-ing a probabilistic model based upon previously determinedvalues for the function f . Typically this model is a Gaussianprocess (GP),(Rasmussen and Williams 2004) although othermodels including neural networks have been used.(Snoek etal. 2015)There are many different versions of the acquisition func-tion, depending upon the type of optimization task which isbeing performed, but the most commonly used is expectedimprovement, EI,(Mockus 1974) which is determined as fol-lows: EI ( x ) = µ ( x ) − f ∗ Φ( γ ) + σ ( x ) φ ( γ ) (1)where Φ denotes the CDF (cumulative distribution function)of the standard normal distribution, φ denotes the PDF (prob-ability density function) of the standard normal distribution,and γ denotes the improvement, which can be expressed as: γ ( x ) = µ ( x ) − f ∗ σ ( x ) (2)where f ∗ is the best target value observed so far, µ ( x ) is thepredicted mean and σ ( x ) is the corresponding variance.At its core, this procedure is inherently serial, as it is basedupon the updating of a probablistic model, and thus limited Table 1: K-Means Batch Bayesian Optimization (KMBBO) Input: Sampling domain X , Initial samples D , Batch size k ,epochs N , slice samples n s Batch size k ,epochs N , slice samples n s For t = 1 to N : 1. Fit GP model to training data D t −
2. Collect slice samples: { s , ..., s n s } = BGSS ( X )
3. Fit K-Means model to obtain centroids µ i
4. Sample centroids: { y } t = { f ( µ ) , ..., f ( µ k ) }
5. Add newly observed values to dataset: D t = D t − ∪ { y } t End forReturn D N by data acquisition. Our contribution is twofold, firstly wepropose a novel parallel (or batch) Bayesian optimizationprocedure based upon K-means, K-means Batch BayesianOptimization (KMBBO), and secondly we propose a modifi-cation for the use of this method with very high-dimensionaldata using a dimensionality reduction step based upon com-pressed sensing.The central aim of KMBBO is to efficiently select a batchof high quality points to evaluate, i.e, during each samplingepoch, we would like our batch to contain points from high-density regions of the acquisition function. However, mod-eling the landscape of the acquisition function directly isgenerally intractable, except in very low dimensions. In or-der to approximately learn the locations of peaks, we fit aK-Means clustering model to the collection of points in oursample space, chosen using slice sampling. Slice samplingdraws samples uniformly from the volume under the acquisi-tion function, and so will preferentially select samples fromregions where the acquisition function value is highest.K-Means (MacQueen 1967) is one of the simplest andmost commonly used clustering methods. Given a set ofpoints X , and number of clusters k , the K-Means method willattempt to find a partition P ( X ) = { X , ..., X k } clusteringthe members of X in order to minimize the within-clustersum of squares distance between cluster members and thecluster centroid, i.e: argmax P ( X ) k (cid:88) i =1 (cid:88) x ∈ X i || x − µ i || (3)where µ i represents the centroid of cluster i . Thus, KMBBOallows the user to specify the batch size directly as the numberof clusters for the K-Means method.To collect its slice samples, KMBBO utilizes the batch gen-eralized slice sampling (BGSS) method described in (Nguyenet al. 2017) where the joint density is defined as p ( u, u ) = (cid:26) z , if α min < u < α ( x )0 , otherwise (4)where z = (cid:82) α ( x ) dx and α min is obtained through mini-mization using a non-convex global optimizer, thus not requir-ing the function to be non-negative, or a proper distribution.However, like standard slice sampling, BGSS scales poorlywith the dimensionality of the sampling domain (Neal 2003),making it impractical for use in high-dimensional settings.able 2: KMBBO with compressed sensing (CS-KMBBO) Input: Domain X , compression error tolerance (cid:15) ,Batch size k, s
1. Draw n comp samples from X : S = { s , ..., s n c omp }
2. Compress domain using TwIsT: X cs = Compress ( X , S, (cid:15) )
3. Run KMBBO: D N = KMBBO ( X cs , k, N, s )
4. Decompress D N To address this we add a dimensionality reduction step basedupon compressed sensing. Our use of the compressed sens-ing methodology is based upon the observation that mosthigh-dimensional data follows a sparse encoding and thus iscompressible. In the compressed sensing scheme, the aim isto reconstruct a signal using the smallest number of observa-tions (which are linear functions of the components of thesignal) possible. This is achieved by solving the basis pursuitproblem, where we search for the sparsest matrix A whichcan reconstruct the full matrix B : min (cid:107) A (cid:107) subject to ( P AP T ) ij = B ij ∀ i, j ∈ W (5)where P is the change-of-basis matrix and W is a set ofrandomly measured entries in matrix B.We apply this method to the original feature space of ahigh dimensional problem, but instead of using the sparsesolution to reconstruct the original function, we instead use itas a compressed basis in which to perform the BO sampling.The upper bound on the lossless dimensionality reductionwhich can be achieved using compressed sensing is thusequivalent to the number of samples which are required forcompressed sensing to perfectly recover B from A , which hasbeen shown to scale as follows: (Candès and Wakin 2008): M ∝ µ S log ( N ) (6)where N is the original number of features, S is the numberof non-zero elements and µ is the incoherency, which ingeneral ranges from to √ N .Whilst some other methods, such as REMBO,(Wang etal. 2013) have used a compressive scheme, the exact di-mensionality of this compression was left as a parameterto tune. In CS-KMBBO, we instead use the Two-step It-erative Shrinkage/Thresholding (TwIsT)(Bioucas-Dias andFigueiredo 2007) optimization technique - a variant of thepopular Iterative Shrinking Thresholding algorithm (IsT)which is more robust to ill defined measurements - to de-termine the optimal dimensionality of the compression step.Whilst it is possible in some discrete problems, such as thedrug discovery challenge tackled within this paper, to knowthe entire space of inputs, we recognize that this is not al-ways the case. Thus, we sample 1,000 data points to performthe TwIST-based dimensionality optimization procedure tocreate a process which is transferable between discrete andcontinuous spaces. Experiments
Comparison to Existing Methods
For this study we compare the performance of KMBBO toa range of currently used parallel BO methods, the details of which have been described in Section , using the Ex-pected Improvement acquisition function, which has beenshown to have strong theoretical guarantees (Vazquez andBect 2010)and empirical effectiveness (Snoek, Larochelle,and Adams 2012). In addition to Naieve qEI, the most basicparallel sampling method, we compare to Thompson sam-pling, Constant Liar (mean), Local Penalization, a batchpredictive entropy search model to represent a non-greedysearch strategy, and the dynamic batch method B3O. Weinvestigate two metrics for success:1. The convergence of the search to the global minimum(where known) as a function of the number of epochs2. The robustness of the search, as demonstrated throughsampling 100 repeat runs of the sampling experiment.For this study, a batch size of 8 was arbitrarily chosen.Throughout the study the Bayesian model was providedthrough the use of a Gaussian process, which was createdusing a squared-exponential kernel with automatic relevancedetermination (ARD) as implemented in the Scikit-Learnlibrary (Pedregosa et al. 2011), k SE = σ f exp − D (cid:88) d =1 − ( x d − x ∗ d ) l d ) (7)seeded with 10 randomly selected data points. The GP’shyperparameters were optimized using gradient descent onthe marginal likelihood. Finally, both B3O and KMBBOselected 200 slice samples when generating each batch tomaintain consistency with (Nguyen et al. 2017). Optimization Tasks
Synthetic Functions
We test the ability of KMBBO tofind the global extremes of three synthetic functions com-monly used for benchmarking machine learning algorithms:Branin-Hoo (2D), Camelback-6 (2D), and Hartmann (6D) asdescribed on the Virtual Library of Simulation Experimentstest function database (Surjanovic and Bingham ).
SVM
A common use for Bayesian optimization is for thetuning of hyperparmeters for machine learning models. Inorder to test the effectiveness of KMBBO for this task, weuse it to determine optimal hyperparameters for a supportvector machine for the Abalone regression task.(Nash andLaboratories 1994) In this context we tune three hyperparam-eters: C (regularization parameter), (cid:15) (insensitive loss) forregression and γ (RBF kernel function). The loss function isthe root mean squared error of the prediction. Drug Discovery
This is a task taken to illustrate the util-ity of this procedure for lead identification in drug discov-ery - where rapid identification of desirable compounds atlow cost is essential. The target for maximization is thePEC50; a value which describes the potency of the drug.The data was taken from hits from Plasmodium falciparum(P. falciparum) whole cell screening originates from theGlaxoSmithKline Tres Cantos Antimalarial Set (TCAMS),Novartis-GNF Malaria Box Data set and St. Jude Children’sResearch Hospital’s Dataset (EC50 in µ M against P. falci-parum 3D7) as released through the Medicines for Malariaenture website (mmv.org ). Each molecule was describedusing MAACS keys (Durant et al. 2002)- a common chem-informatics descriptor, generated using the RDKit software(Landrum ) resulting in a 167 dimensional optimization prob-lem.
Results
Figure 1: The distribution of the ‘first encounter time’, i.e.when the global optimium is first located for the Branin-Hoofunction. Statistics are generated from 100 repeats of theexperiment.
Synthetic Functions
For the Branin-Hoo, we observe that both the Constant Liarand KMBBO methods are able to approach the minimumquickly, achieving low regret after only a few samplingepochs, with both B3O and Thompson sampling also reliablyreaching finding the optimum before 8 sampling epochs. LP,however, performs poorly, achieving similar regret to NaieveqEI, with many iterations of each method failing to discoverthe minimum after 10 epochs. The performance of LP re-lies heavily on the quality of the Lipschitz constant estimate,which is calculated over the entire sampling domain. Forthe Branin-Hoo function, this is dominated by the quarticterm away from the function minima, leading to a Lipschitzconstant estimate poorly suited to the region around the opti-mum. Figure 1 shows the ‘first encounter time’ of the globaloptimum for each method. We see that, even though theinitial reduction in regret between KMBBO and ConstantLiar is similar, KMBBO is able to locate the optimal valueearlier and more consistently than the other methods. Allfunctions perform well for the Camelback task, although weobserve that KMBBO converges to the true minimum fasterthan the other methods. The 6 dimensional Hartmann func-tion is a more challenging optimization problem. We observein Figure 2 that after 10 epochs are methods have still notyet managed to identify the global optimum. LP, B3O andKMBBO all performed well, achieving similar average regretvalues, but B3O and KMBBO performed more consistently,with lower variance on the regret obtained.
Tuning of Hyperparameters
KMBBO displays the best performance on the SVM hyper-parameter tuning task, shown in Figure 4. With a low dimen-sional sampling space, the slice sampling method used by Figure 2: The optimization performance for the 6 dimen-sional Hartmann function. Statistics are generated from 100repeats of the experiment, and confidence intervals are calcu-lated to 1 sigma.B3O and KMBBO performs particularly well at approximat-ing high density regions of the acquisition function. Indeed,the violin plot in 4 shows that, not only are KMBBO andB3O the best performers at minimizing RMSE, they alsoperform most consistently, with smallest error variance.Figure 3Figure 4: Optimization performance for the tuning of the hy-perparameters of an SVM, as displayed through the RMSD ofthe SVM for the abalone problem with respect to the numberof epochs. Statistics are generated from 100 repeats of theexperiment, and confidence intervals are bootstrapped to 1sigma. Note that the batch-PES methodology is excludingfrom this plot, as its large variance over runs made interpre-tation of the performance other methodologies impossible.The results for this method can be seen in Table 3
Drug Discovery
The high dimensional nature of the drug discovery task pre-sented significant challenges to several of the benchmarkmethods. In 167 dimensions, the slice sampling method usedby B3O is unable to produce any reasonable approximationof the acquisition function surface with the original samplingable 3: Final performance after 10 sampling epochs for each method on each of the test problem cases over 100 repetitions.Best performance in each case is shown in bold. For the Malaria task, an X indicates that the method was not run, due tocomputational intractability or algorithmic instability.
TaskBranin-Hoo Camelback-6 Hartman SVM MalariaMethod
Regret Std.Dev Regret Std.Dev Regret Std.Dev RMSE Std.Dev Regret Std.DevNaieve qEI 0.803 1.47 budget of 200 and we found the substantial increase in sam-ples required lead to prohibitively long running times. TheLP method was hamstrung by the computational cost of ap-proximating the Lipschitz constant in this high dimensionalspace, Furthermore, the Constant Liar methodology is re-liant upon a high quality model, and thus is very sensitiveto hyperparameter selection, and the addition of reasonablequality psuedo-inputs. Unfortunately, during our testing ofthis method for the drug discovery problem, a large numberof runs failed due to a failure for the GP model to convergeduring the fitting task, and thus it is excluded from the results.Of the remaining methods, Thompson sampling, qEI,batch-PES and CS-KMBBO are able to be used for thistask. Figure 5 shows that KMBBO displays strong perfor-mance, reaching low regret after 10 sampling epochs- havingsampled only 90 out of a potential circa 19,000 candidatemolecules. Thompson sampling, qEI and Batch-PES displaysimilar behaviors, discovering a local maximum on the po-tency landscape, but neither are able to discover moleculeswith as low regret as KMBBO. It is worth noting that dueto the discrete nature of the search space, here regret is nota continuous function, for example a regret of 2 places youwithin the top 0.7% of values for the task.
Rankings
One way to measure the robustness of a search method isto compare the rankings of the search method of the wholerange of tasks performed in this study. Since raw rankingscan be misleading (a close second ranks the same as a searchin which the gap between methods was much wider) weinstead use a normalized ranking, Z , proposed in (Jasrasariaand Pyzer-Knapp 2018): Z = s − s (cid:48) s max − s min (8)where s represents the result of a particular strategy, s (cid:48) the result of the best strategy, and s max − s min representthe range of results encountered in the study. This results ina score bounded (0 , where 0 represents a perfect perfor-mance across tasks. Figure 5: Optimization performance for the Malaria drugdiscovery problem, as displayed through instantaneous regretwith respect to the number of epochs. Statistics are generatedfrom 10 repeats of the experiment, and confidence intervalsare bootstrapped to 1 sigma.We calculate Z for both the performance of the optimiza-tion task, as measured by regret or RMSE where appropriate,and the variance of the task as measured over multiple runs.It can easily be seen from Figure 6 that KMBBO achievesa significantly better Z score than any other method for pureoptimization performance, and also the best Z score, albeit bya smaller margin, than any other method for variance. Thisdemonstrates both the class leading nature of KMBBO andalso its strong reproducibility; a property which is key forBayesian optimization, where each data point is expensiveto acquire and thus reliability of a methodology is stronglydesired. Computational Cost
We have analyzed the complexity of the rate-limiting step foreach of the methods used in this work, and performed addi-tional empirical experiments looking at real-world runningtimes . The poor dimensionality scaling of slice sampling
Naieve qEI Thompson Constant Liar LP KMBBO B3O Batch PESZ(Performance) 0.750 0.414 0.460 0.397 0.063 0.276 0.663Z(Variance) 0.533 0.257 0.260 0.480 0.200 0.256 0.359 Z - S c o r e Figure 6: Z score calculated for each of the parallel optimiza-tion strategies investigated in this study.( O (2 d ) ) is common to the B3O method, and worse than thescaling of the LP method ( O ( d ) ). We address this in CS-KMBBO through the incorporation of compressed sensingfor dimensionality reduction. Even when compression is notrequired, our empirical timings, shown in Figure 7, indicatethat the runtime per sampling epoch for KMBBO is generallysignificantly smaller than for B3O, which we attribute to thesimplicity and scalability of the K-Means algorithm com-pared to the IGMM used in B3O. However, it is worth keep-ing in mind that in the Bayesian Optimization framework,it is generally assumed that obtaining ground truth valuesby sampling the black-box function f is substantially moreexpensive (in time/ computational cost) than the calculationof the sampling batch, which somewhat mitigates concernsabout the computational cost of the sampling methodologyas an expensive, yet efficient, sampling scheme will have lessreal-world cost than an inefficient, yet fast, alternative.Figure 7: Runtimes of KMBBO and B30 in 2 and 6D. Run-time is calculated as seconds per sampling epoch. Algorithmic Insight
In this section we discuss the different characteristics of thesampling methods through analyzing their sample selections Figure 8: Points selected to form the next batch for eachsampling method when minimizing f ( x ) = − xsin ( x ) , given5 initial random points. The activation function is shown inblue, with non-zero regions shaded. The blue histogramshows the samples taken by BGSS.for an easy to visualize 1 dimensional optimization problem.Figure 8 shows the activation function curve and subsequentsamples chosen by each of the sampling algorithms whileminimizing the function f ( x ) = − xsin ( x ) , after 5 randomlychosen initial samples. This gives some visual insight into thebehavior of each of the methods. We observe that all of themethods are able to identify the main peak of the acquisitionfunction and allocate at least one sample nearby. Naieve qEIsimply chooses the q points from the sample space closestto this peak, leading to highly local sampling, and insuffi-cient exploration of other areas of density in the acquisitionfunction. LP also does a good job of identifying the mainacquisition function peak, and the local penalization factorensures somewhat more exploration than with the Naieve qEImethod. However, this still seems insufficient to cause ex-ploration of other areas of density in the acquisition function.In contrast, Constant Liar is susceptible to over explorationand selects several low quality points. We posit that this isdue to the assumption that the true value for each sampleadded to the batch is represented by the mean value of theGP prediction. Since the violation of this assumption canlead to large movements in the GP posterior, this can causeerratic behavior, and lead to these poor selections. In our toyexample, B3O successfully identifies two of the acquisitionfunction peaks, but does not represent the third. The IGMMused by B3O seems to be sensitive to the number of slicesamples provided, as experiments with different numbers ofslice samples lead to substantial variations in the number andlocation of the points chosen.KMBBO is able to achieve a good balance between explo-ration and exploitation, with all three maxima in the acquisi-tion function represented, with the remaining samples welldistributed over the non-zero areas of the curve. When thenumber of local optima of the acquisition function is lowerthan the batch size, the quadratic penalization for within clus-ter distance used by K-Means ensures that the remainingcluster centroids will spread out over the set of slice samplevalues. ummary We propose a novel batch sampling algorithm for Bayesianoptimization based upon K-means, K-means Batch BayesianOptimization (KMBBO). KMBBO was tested in a variety oftasks, from common synthetic functions to the tuning of amachine learning algorithm, to a high-dimensional drug dis-covery problem. Over these tasks KMBBO displays superiorsampling behaviors than other common Bayesian optimiza-tion methods, such as LP, Thompson sampling, ConstantLiar, and B3O, delivering either optimal or close to optimalbehavior in all tasks. It also delivered this performance morereliably than any other method, consistently showing thesmallest standard deviation in results over 100 repetitionsacross tasks. This is a very important result since the ma-jor utility of Bayesian optimization is when each sample isexpensive or difficult to collect, and thus reliability in opti-mization performance is strongly desirable. We also proposea modification to KMBBO, CS-KMBBO, for use in highdimensional problems, where the slice sampling in KMBBOadds significant computational overhead. In this adaptation,the optimal dimensionality is achieved through the use of theTWiST technique on a sampled subset of the problem space.CS-KMBBO shows better performance than all methods de-spite operating on a reduced dimensional data set. Finally, wediscuss insights into the performance of KMBBO through thevisualization of the batching process for a toy problem, andcomparison to the other methods studied within this paper.Over a wide variety of tasks, it is inevitable that for any spe-cific task, a particular sampling technique will have optimalperformance, but the strong performance of KMBBO overthe whole range of tasks and dimensions, makes it a reliablechoice.
Acknowledgements
This work was supported by the STFC Hartree Centre’s In-novation Return on Research programme, funded by theDepartment for Business, Energy & Industrial Strategy.
References [Bioucas-Dias and Figueiredo 2007] Bioucas-Dias, J. M., andFigueiredo, M. A. 2007. A new twist: Two-step iterative shrink-age/thresholding algorithms for image restoration.
IEEE Transac-tions on Image processing
IEEE signal processingmagazine
International Confer-ence on Learning and Intelligent Optimization , 59–69. Springer.[Durant et al. 2002] Durant, J. L.; Leland, B. A.; Henry, D. R.; andNourse, J. G. 2002. Reoptimization of mdl keys for use in drugdiscovery.
Journal of Chemical Information and Computer Sciences
NCP07 . [Ginsbourger, Le Riche, and Carraro 2010] Ginsbourger, D.;Le Riche, R.; and Carraro, L. 2010. Kriging is well-suitedto parallelize optimization. In
Computational Intelligence inExpensive Optimization Problems . Springer. 131–162.[González et al. 2016] González, J.; Dai, Z.; Hennig, P.; andLawrence, N. 2016. Batch bayesian optimization via local pe-nalization. In
Artificial Intelligence and Statistics , 648–657.[Hernández-Lobato et al. 2017] Hernández-Lobato, J. M.; Re-queima, J.; Pyzer-Knapp, E. O.; and Aspuru-Guzik, A. 2017.Parallel and distributed thompson sampling for large-scale accel-erated exploration of chemical space. In
Proceedings of the 34thInternational Conference on Machine Learning .[Jasrasaria and Pyzer-Knapp 2018] Jasrasaria, D., and Pyzer-Knapp,E. O. 2018. Dynamic Control of Explore/Exploit Trade-OffIn Bayesian Optimization. arXiv:1807.01279 [cs, stat] . arXiv:1807.01279.[Landrum ] Landrum, G. RDKit: Open-source cheminformatics.bibtex: rdkit.[MacQueen 1967] MacQueen, J. 1967. Some methods for classifi-cation and analysis of multivariate observations. In
Proceedings ofthe Fifth Berkeley Symposium on Mathematical Statistics and Prob-ability, Volume 1: Statistics , 281–297. Berkeley, Calif.: Universityof California Press.[mmv.org ] mmv.org. Malaria Box supporting information | MMV.[Mockus 1974] Mockus, J. 1974. On bayesian methods for seekingthe extremum. In
Optimization Techniques IFIP Technical Con-ference Novosibirsk, July 1–7, 1974 , 400–404. Springer, Berlin,Heidelberg.[Nash and Laboratories 1994] Nash, W. J., and Laboratories, T.M. R. 1994. The Population biology of abalone (Haliotis species)in Tasmania. 1, Blacklip abalone (H. rubra) from the north coastand the islands of Bass Strait.[Neal 2003] Neal, R. M. 2003. Slice sampling.
Annals of statistics arXiv preprint arXiv:1703.04842v2 .[Pedregosa et al. 2011] Pedregosa, F.; Varoquaux, G.; Gramfort, A.;Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.;Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.;Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn:Machine learning in Python.
Journal of Machine Learning Research
Gaussian Processes for Machine Learning . MITPress.[Rasmussen 2000] Rasmussen, C. E. 2000. The infinite gaussianmixture model. In
Advances in neural information processingsystems , 554–560.[Shah and Ghahramani 2015] Shah, A., and Ghahramani, Z. 2015.Parallel predictive entropy search for batch global optimization ofexpensive objective functions. In
Proceedings of the 28th Inter-national Conference on Neural Information Processing Systems -Volume 2 , NIPS’15, 3330–3338. Cambridge, MA, USA: MIT Press.[Snoek et al. 2015] Snoek, J.; Rippel, O.; Swersky, K.; Kiros, R.;Satish, N.; Sundaram, N.; Patwary, M.; Prabhat, M.; and Adams, R.2015. Scalable bayesian optimization using deep neural networks.In
International Conference on Machine Learning , 2171–2180.[Snoek, Larochelle, and Adams 2012] Snoek, J.; Larochelle, H.;and Adams, R. P. 2012. Practical bayesian optimization of machinelearning algorithms. In
Advances in neural information processingsystems , 2951–2959.Surjanovic and Bingham ] Surjanovic, S., and Bingham, D. Vir-tual library of simulation experiments: Test functions anddatasets. Retrieved May 16, 2018, from .[Thompson 1933] Thompson, W. R. 1933. On the likelihood thatone unknown probability exceeds another in view of the evidenceof two samples.
Biometrika
Journal of Statistical Planning andInference
IJCAI , 1778–1784.[Wang et al. 2016] Wang, J.; Clark, S. C.; Liu, E.; and Frazier, P. I.2016. Parallel bayesian global optimization of expensive functions. arXiv preprint arXiv:1602.05149arXiv preprint arXiv:1602.05149