Unbounded Bayesian Optimization via Regularization
UUnbounded Bayesian Optimization viaRegularization
Bobak Shahriari
University of British Columbia [email protected]
Alexandre Bouchard-Cˆot´e
University of British Columbia [email protected]
Nando de Freitas
University of OxfordGoogle DeepMind [email protected]
Abstract
Bayesian optimization has recently emerged as a popular and efficient toolfor global optimization and hyperparameter tuning. Currently, the establishedBayesian optimization practice requires a user-defined bounding box which is as-sumed to contain the optimizer. However, when little is known about the probedobjective function, it can be difficult to prescribe such bounds. In this work wemodify the standard Bayesian optimization framework in a principled way to al-low automatic resizing of the search space. We introduce two alternative methodsand compare them on two common synthetic benchmarking test functions as wellas the tasks of tuning the stochastic gradient descent optimizer of a multi-layeredperceptron and a convolutional neural network on MNIST.
Bayesian optimization has recently emerged as a powerful tool for the efficient global optimizationof black box objective functions. Since the technique was introduced over 50 years ago, it hasbeen applied for many different applications. Perhaps the most relevant application for the machinelearning community is the automated tuning of algorithms [2, 13, 18, 21, 8]. However, the currentstate of the art requires the user to prescribe a bounding box within which to search for the optimum.Unfortunately, setting these bounds—often arbitrarily—is one of the main difficulties hinderingthe broader use of Bayesian optimization as a standard framework for hyperparameter tuning. Forexample, this obstacle was raised at the NIPS 2014 Workshop on Bayesian optimization as one ofthe open challenges in the field.In the present work, we propose two methods that are capable of growing the search space as theoptimization progresses. The first is a simple heuristic which regularly doubles the volume of thesearch space, while the second is a regularization method that is practical and easy to implement inany existing Bayesian optimization toolbox based on Gaussian Process priors over objective func-tions.At a high level, the regularized method is born out of the observation that the only component ofBayesian optimization that currently requires a bounding box on the search space is the maximiza-tion of the acquisition function; this constraint is necessary because acquisition functions can havesuprema at infinity. By using a non-stationary prior mean as a regularizer, we can exclude thispossibility and use an unconstrained optimizer, removing the need for a bounding box.
Although the notion of using a non-trivial Gaussian process prior mean is not new, it is usuallyexpected to encode domain expert knowledge or known structure in the response surface. Onlyone recent work, to the best of the authors’ knowledge, has considered using the prior mean as a1 a r X i v : . [ s t a t . M L ] A ug egularization term and it was primarily to avoid selecting points along boundaries and in corners ofthe bounding box [19].In this work we demonstrate that regulization can be used to carry out Bayesian optimization withouta rigid bounding box. We further introduce a volume doubling baseline, which we compare to theregularization approach. While the regularized algorithms exhibit a much more homogeneous searchbehaviour ( i.e. boundaries and corners are not disproportionately favoured), the volume doublingbaseline performs very well in practice.We begin with a brief review of Bayesian optimization with Gaussian processes in the next section,followed by an introduction to regularization via non-stationary prior means in Section 3, includingvisualizations that show that our proposed approaches indeed venture out of the initial user-definedbounding box. Section 4 reports our results on two synthetic benchmarking problems and two neuralnetwork tuning experiments. Consider the global optimization problem of finding a global maximizer x (cid:63) ∈ arg max x ∈ R d f ( x ) of an unknown objective function f : R d (cid:55)→ R . In this work, the function f is assumed to be ablack-box for which we have no closed-form expression or gradient information. We further assumethat f is expensive to evaluate so we wish to locate the best possible input x with a relativelysmall budget of N evaluations. Finally, the evaluations y ∈ R of the objective function are noise-corrupted observations, and for the remainder of this work, we assume a Gaussian noise distribution, y ∼ N ( f ( x ) , σ ) . In contrast to typical Bayesian optimization settings, we do not assume the arg max to be restricted to a bounded subset X ⊂ R d .Bayesian optimization is a sequential model-based approach which involves (i) maintaining a prob-abilistic surrogate model over likely functions given observed data; and (ii) selecting future querypoints according to a selection policy , which leverages the uncertainty in the surrogate to nego-tiate exploration of the search space and exploitation of current modes. The selection policy isrepresented by an acquisition function α n : R d (cid:55)→ R , where the subscript indicates the implicitdependence on the surrogate and, by extension, on the observed data D n = { ( x i , y i ) } ni =1 . Moreprecisely, at iteration n : an input x n +1 is selected by maximizing the acquisition function α n ; theblack-box is queried and produces a noisy y n +1 ; and the surrogate is updated in light of the newdata point ( x n +1 , y n +1 ) . Finally, after N queries the algorithm must make a final recommendation ˆ x N which represents its best estimate of the optimizer. In this work we prescribe a Gaussian process (GP) prior over functions [15]. When combinedwith a Gaussian likelihood, the posterior is also a GP and the Bayesian update can be computedanalytically. Note that other surrogate models, such as random forests, have also been proposed inthe model-based optimization literature [9].A Gaussian process
GP( µ , k ) is fully characterized by its prior mean function µ : R d (cid:55)→ R and itspositive-definite kernel, or covariance, function k : R d × R d (cid:55)→ R . Given any finite collection of n points x n , the values of f ( x ) , . . . , f ( x n ) are jointly Gaussian with mean m , where m i := µ ( x i ) ,and n × n covariance matrix K , where K ij = k ( x i , x j ) —hence the term covariance function.Given the Gaussian likelihood model, the vector of concatenated observations y = y n is alsojointly Gaussian with covariance K + σ I . Therefore, at any arbitrary test location x , we can queryour surrogate model (the GP) for the function value f ( x ) conditioned on observed data D n . Thequantity f ( x ) is a Gaussian random variable with the following mean µ n ( x ) and marginal variance σ n ( x ) µ n ( x ) = µ ( x ) + k ( x ) T ( K + σ I ) − ( y − m ) , (1) σ n ( x ) = k ( x , x ) − k ( x ) T ( K + σ I ) − k ( x ) , (2) Here we use the convention a i : j = { a i , . . . , a j } . k ( x ) is the vector of cross-covariance terms between x and x n . Throughout this work weuse the squared exponential kernel k SE ( x , x (cid:48) ) = θ exp( − r ) , (3)where r = ( x − x (cid:48) ) T Λ − ( x − x (cid:48) ) and Λ is a diagonal matrix of d length scales θ d and θ is thekernel amplitude. Collectively referred to as the hyperparemeter , θ = θ d parameterizes the kernelfunction k . When the noise variance σ is unknown, it can be added as a model hyperparameter aswell. Similarly, the most common agnostic choice of prior mean is a constant bias µ ( x ) ≡ b whichcan also be added to θ . Hyperparameter marginalization.
As in many regression tasks, the hyperparameter θ mustsomehow be specified and has a dramatic effect on performance. Common tuning techniques suchas cross-validation and maximum likelihood are either highly data-inefficient or run the risk of over-fitting. Recently, a Bayesian treatment of the hyperparameters via Markov chain Monte Carlo hasbecome standard practice in Bayesian optimization [18]. Similarly in the present work, we specifyan uninformative prior on θ and approximately integrate it by sampling from the posterior p ( θ |D n ) via slice sampling. So far we have described the statistical model we use to represent our belief about the unknownobjective f given data D n , and how to update this belief given new observations. We have notdescribed any mechanism or policy for selecting the query points x n . One could select thesearbitrarily or by grid search but this would be wasteful; uniformly random search has also beenproposed as a surprisingly good alternative to grid search [3]. There is, however, a rich literature onselection strategies that utilize the surrogate model to guide the sequential search, i.e. the selectionof the next query point x n +1 given D n .The key idea behind these strategies is to define an acquisition functions α : R d (cid:55)→ R which quanti-fies the promise of any point in the search space. The acquisition function is carefully designed totrade off exploration of the search space and exploitation of current promising neighborhoods. Thereare three common types of acquisition functions: improvement-based, optimistic, and information-based policies.The improvement-based acquisition functions, probability and expected improvement (PI and EI,respectively), select the next point with the most probable and most expected improvement, re-spectively [12, 14]. On the other hand, the optimistic policy upper confidence bound (GP-UCB)measures, marginally for each test point x , how good the corresponding observation y will be ina low and fixed probability “good case scenario”—hence the optimism [20]. In contrast, there ex-ist information-based methods such as randomized probability matching, also known as Thompsonsampling [22, 16, 4, 11, 1], or the more recent entropy search methods [23, 5, 6]. Thompson sam-pling selects the next point according to the distribution of the optimum x (cid:63) , which is induced by thecurrent posterior [16, 6, 17]. Meanwhile, entropy search methods select the point x that is expectedto provide the most information towards reducing uncertainty about x (cid:63) .In this work we focus our attention on EI, which is perhaps the most common acquisition function.With the GP surrogate model, EI has the following analytic expression α EI n ( x ) = ( µ n ( x ) − τ )Φ (cid:18) µ n ( x ) − τσ n ( x ) (cid:19) + σ n ( x ) φ (cid:18) µ n ( x ) − τσ n ( x ) (cid:19) , (4)where Φ and φ denote the standard normal cumulative distribution and density functions, respec-tively. Note however that the technique we outline in the next section can readily be extended to anyGaussian process derived acquisition function, including all those mentioned above. The way “promise” is quantified depends on whether we care about cumulative losses or only the finalrecommendation ˆ x N . Unbounded Bayesian optimization
Our first proposed heuristic approach consists in expanding the search space regularly as the op-timization progresses, starting with an initial user-defined bounding box. This method otherwisefollows the standard Bayesian optimization procedure and optimizes within the bounding box that isavailable at the given time step n . This approach requires two parameters: the number of iterationsbetween expansions and the growth factor γ . Naturally, to avoid growing the feasible space X by afactor that is exponential in d , the growth factor applies to the volume of X . Finally, the expansionis isotropic about the centre of the domain. In this work, we double ( γ = 2 ) the volume every d evaluations (only after an initial latin hypercube sampling of d points). We motivate the regularized approach by considering improvement policies ( e.g.
EI); however, in thenext section we show that this proposed approach can be applied more generally to all GP-derivedacquisition functions. Improvement policies are a popular class of acquisition functions that rely onthe improvement function I ( x ) = ( f ( x ) − τ ) I [ f ( x ) ≥ τ ] (5)where τ is some target value to improve upon. Expected improvement then compute E [ I ( x )] underthe posterior GP.When the optimal objective value f (cid:63) is known and we set τ = f (cid:63) , these algorithms are referredto as goal seeking [10]. When the optimum is not known, it is common to use a proxy for f (cid:63) instead, such as the value of the best observation so far, y + ; or in the noisy setting, one can useeither the maximum of the posterior mean or the value of the posterior at the incumbent x + , where x + ∈ arg max x ∈ x n µ n ( x ) .In some cases, the above choice of target τ = y + can lead to a lack of exploration, therefore itis common to choose a minimum improvement parameter ξ > such that τ = (1 + ξ ) y + (forconvenience here we assume y + is positive and in practice one can subtract the overall mean tomake it so). Intuitively, the parameter ξ allows us to require a minimum fractional improvementover the current best observation y + . Previously, the parameter ξ had always been chosen to beconstant if not zero. In this work we propose to use a function ξ : R d (cid:55)→ R + which maps points inthe space to a value of fractional minimum improvement. Following the same intuition, the function ξ lets us require larger expected improvements from points that are farther and hence acts as aregularizer that penalizes distant points. The improvement function hence becomes: I ( x ) = ( f ( x ) − τ ( x )) I [ f ( x ) ≥ τ ( x )] , with τ ( x ) = (1 + ξ ( x )) y + , (6)where the choice of ξ ( x ) is discussed in Section 3.2.2. In the formulation of the previous section, our method seems restricted to improvement policies.However, many recent acquisition functions of interest are not improvement-based, such as GP-UCB, entropy search, and Thompson sampling. In this section, we describe a closely related formu-lation that generalizes to all acquisition functions that are derived from a GP surrogate model.Consider expanding our choice of non-stationary target τ in Equation (6) I ( x ) = ( f ( x ) − y + (1 + ξ ( x ))) I [ f ( x ) ≥ y + (1 + ξ ( x ))]= ( f ( x ) − y + ξ ( x ) − y + ) I [ f ( x ) − y + ξ ( x ) ≥ y + ]= ( ˜ f ( x ) − y + ) I [ ˜ f ( x ) ≥ y + ] (7)where ˜ f is the posterior mean of a GP from (1) with prior mean ˜ µ ( x ) = µ ( x ) − y + ξ ( x ) . Noticethe similarity between (5) and (7). Indeed, in its current form we see that the regularization can beachieved simply by using a different prior mean ˜ µ and a constant target y + . This duality can bevisualized when comparing the left and right panel of Figure 1.4 inge quadratic f ( x ) µ n ( x ) ξ ( x ) Quadratic ¯ x − R ¯ x + R EIEI-H ¯ x − w ¯ x + w EI-Q (a) Minimum improvement view
Hinge quadratic Quadratic ¯ x − R ¯ x + R EI-H ¯ x − w ¯ x + w EI-Q (b) Prior mean view
Figure 1:
Visualization of the two alternate views of regularization in Bayesian optimization. The objectivefunction and posterior surrogate mean are represented as black dashed and solid lines, respectively, with greyshaded areas indicating ± σ n . Integrating the surrogate model above the target (orange shaded area) resultsin the regularized EI acquisition function (magenta and cyan). Using a non-stationary target with a constantprior mean (left) or a fixed target with a non-stationary prior mean (right) lead to indistinguishable acquisitionfunctions, which decay at infinity. Precisely speaking, these two views are not exactly equivalent. Starting with the ˜ µ prior mean, theposterior mean yields an additional term k ( x ) T ( K + σ I ) − ξ ( X ) , (8)where [ ξ ( X )] i = ξ ( x i ) . This term is negligible when the test point x is far from the data X dueto our exponentially vanishing kernel, making the two alternate views virtually indistinguishable inFigure 1. However, with this new formulation we can apply the same regularization to any policywhich uses a Gaussian process, e.g. Thompson sampling or entropy search.
By inspecting Equation (4), it is easy to see that any coercive prior mean function would lead to anasymptotically vanishing EI acquisition function; in this work however we consider quadratic (Q)and isotropic hinge-quadratic (H) regularizing prior mean functions, defined as follows (excludingthe constant bias b ) ξ Q ( x ) = ( x − ¯ x ) T diag( w ) − ( x − ¯ x ) , (9) ξ H ( x ) = I [ (cid:107) x − ¯ x (cid:107) > R ] (cid:107) x − ¯ x (cid:107) − RβR . (10)Both of these regularizers are parameterized by d location parameters ¯ x , and while ξ Q has an ad-ditional d width parameters w , the isotropic regularizer ξ H has a single radius R and a single β parameter, which controls the curvature of ξ H outside the ball of radius R ; in what follows we fix β = 1 . Fixed prior mean hyperparameters.
We are left with the choice of centre ¯ x and radius parame-ters R (or widths w ). Unlike the bias term b , these parameters of the regularizer are not intended toallow the surrogate to better fit the observations D n . In fact, using the marginal likelihood to esti-mate or marginalize ψ = { ¯ x , R, w } , could lead to fitting the regularizer to a local mode which couldtrap the algorithm in a suboptimal well. For this reason, we use an initial, temporary user-definedbounding box to set ψ at the beginning of the run; the value of ψ remains fixed in all subsequentiterations. 5 ( x ) n d d Visualization of selections x n made by the EI-H algorithm on two toy objective functions consistingof three Gaussian modes in one (left) and two (right) dimensions. Grey lines delimit the initial bounding box;grey square markers indicate the initial latin hypercube points, while the orange and blue points distinguishbetween the first and second half of the evaluation budget of d , respectively. In the two-dimensional example,the height of the Gaussians are indicated by +1, +2, and +3. Before describing our experimental results, Figure 2 provides a visualization of EI with the hinge-quadratic prior mean, optimizing two toy problems in one and two dimensions. The objective func-tions consist of three Gaussian modes of varying heights and the initial bounding box does notinclude the optimum. We draw attention to the way the space is gradually explored outward fromthe initial bounding box. In this section, we evaluate our proposed methods and show that they achieve the desirable behaviouron two synthetic benchmarking functions, and a simple task of tuning the stochastic gradient descentand regularization parameters used in training a multi-layered perceptron (MLP) and a convolutionalneural network (CNN) on the MNIST dataset. Experimental protocol. For every test problem of dimension d and every algorithm, the opti-mization was run with an overall evaluation budget of d including an initial d points sampledaccording to a latin hypercube sampling scheme (as suggested in [10]). Throughout each particularrun, at every iteration n we record the value of the best observation up to n and report these inFigure 3. Algorithms. We compared the two different methods from Section 3 to the standard EI with a fixedbounding box. Common random seeds were used for all methods in order to reduce confoundingnoise. All algorithms were implemented in the pybo framework available on github [7], and arelabelled in our plots as follows: EI: Vanilla expected improvement with hyperparameter marginalization. EI-V: Expected improvement with the search volume doubled every d iterations. EI-H/Q: Regularized EI with a hinge-quadratic and quadratic prior mean, respectively.Note that for the regularized methods EI-H/Q, the initial bounding box is only used to fix the locationand scale of the regularizers, and to sample initial query points. In particular, both regularizers arecentred around the box centre; for the quadratic regularizer the width of the box in each directionis used to fix w , whereas for the hinge-quadratic R is set to the box circumradius. Once theseparameters are fixed, the bounding box is no longer relevant. https://github.com/mwhoffman/pybo .0 0.2 0.4 0.6 0.80.00.20.40.60.81.0 9 90123 Hartmann3 EIEI-VEI-QEI-H 18 180123 Hartmann 6 MLP-MNIST CNN-MNIST Figure 3: Best observations as optimization progresses. Plotting mean and standard error over 40 (Hartmann),14 (MLP), and 19 (CNN) repetitions. The Hartmann 3 and 6 functions (numbers refer to their dimensionality) are standard, syntheticglobal optimization benchmarking test functions. In a separate experiment, indicated by an ap-pended asterisk ( ∗ ), we consider a randomly located bounding box of side length 0.2 within the unithypercube. This window is unlikely to include the global optimum, especially in the six-dimensionalproblem, which makes this a good problem to test whether our proposed methods are capable of use-ful exploration outside the initial bounding box. The MNIST hand-written digit recognition dataset is a very common simple task for testing neuralnetwork methods and architectures. In this work we optimize the parameters of the stochastic gra-dient descent (SGD) algorithm used in training the weights of the neural network. In particular, weconsider an MLP with 2048 hidden units with tanh non-linearities, and a CNN with two convolu-tional layers. These examples were taken from the official GitHub repository of torch7 demos .The code written for this work can be readily extended to any other demo in the repository or anytorch script.For this problem, we optimize four parameters, namely the learning rate and momentum of the SGDoptimizer, and the L and L regularization coefficients. The parameters were optimized in logspace (base e ) with an initial bounding box of [ − , − × [ − , − × [ − , × [ − , , respectively.For each parameter setting, a black-box function evaluation corresponds to training the networkfor one epoch and returning the test set accuracy. To be clear, the goal of this experiment is notto achieve state-of-the-art for this classification task but instead to demonstrate that our proposedalgorithms can find optima well outside their initial bounding boxes. Figure 3 shows that for the Hartmann tests, the Bayesian optimization approaches proposed in thispaper work well. The results confirm our hypothesis that the proposed methods are capable of usefulexploration outside the initial bounding box. We note that when using the entire unit hypercube asthe initial box, all the Bayesian optimization techniques exhibit similar performance as in this casethe optimum is within the box. The Hartmann tests also show that our volume doubling heuristic is https://github.com/torch/demos earningrate Momentum L reg. Figure 4: Pairwise scatterplot of selections (upper triangle) and recommendations (lower triangle) for theMLP-MNIST experiment. For example, the second plot of the first row corresponds to a scatter plot of theselected learning rates vs. momentum parameters for a single seed . In contrast, the first plot of the second rowcorresponds to a scatter plot of the recommended learning rates and momentum parameters over all 14 runs .The initial bounding box and sample points (for this particular run) are shown as a black rectangle and blacksquare dots, respectively. All other points respect the color scheme of Figure 3. ( L cropped out for space.) a good baseline method. Although it is less effective than EI-H as the dimensionality increases, it isnonetheless an improvement over standard EI in all cases.The MNIST experiment shows good performance from all three methods EI- { V,H,Q } , particularlyfrom the hinge-quadratic regularized algorithm. Indeed, when compared to the standard EI, EI-Hboasts a 20% improvement in accuracy on the MLP and almost 10% on the CNN.We believe that EI-H performs particularly well in settings where a small initial bounding box isprescribed because the hinge-quadratic regularizer allows the algorithm to explore outward morequickly. In contrast, EI-Q performs better when the optimum is included in the initial box; wesuspect that this is due to the fact that the regularizer avoids selecting boundary and corner points,which EI and EI-V tend to do, as can be seen in Figure 4. Indeed, the green dots (EI-V) follow thecorners of the growing bounding box. In contrast, EI-H/Q do not exhibit this artefact. In this work, we propose a versatile new approach to Bayesian optimization which is not limited to asearch within a bounding box. Indeed, given an initial bounding box that does not include the opti-mum, we have demonstrated that our approach can expand its region of interest and achieve greaterfunction values. Our method fits seamlessly within the current Bayesian optimization framework,and can be readily used with any acquisition function which is induced by a GP.Finally, this could have implications in a distributed Bayesian optimization scheme whereby mul-tiple Bayesian optimization processes are launched, each responsible for a neighbourhood of thedomain. Interactions between these processes could minimize overlaps or alternatively ensure acertain level of redundancy, which would be helpful in a noisy setting.We emphasize that in this work we have addressed one of the challenges that must be overcometoward the development of practical Bayesian optimization hyper-parameter tuning tools. A full so-lution, however, must also address the issues of dimensionality, non-stationarity, and early stopping.Fortunately, there has been great recent progess along these directions, and the methods proposed inthis paper provide a complementary and essential piece of the puzzle. References [1] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning , 2013.82] J. Bergstra, R. Bardenet, Y. Bengio, and B. K´egl. Algorithms for hyper-parameter optimization.In Advances in Neural Information Processing Systems , pages 2546–2554, 2011.[3] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal ofMachine Learning Research , 13:281–305, 2012.[4] O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances in NeuralInformation Processing Systems , pages 2249–2257, 2011.[5] P. Hennig and C. J. Schuler. Entropy search for information-efficient global optimization. TheJournal of Machine Learning Research , pages 1809–1837, 2012.[6] J. M. Hern´andez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy searchfor efficient global optimization of black-box functions. In Advances in Neural InformationProcessing Systems . 2014.[7] M. W. Hoffman and B. Shahriari. Modular mechanisms for Bayesian optimization. In NIPSworkshop on Bayesian Optimization , 2014.[8] M. W. Hoffman, B. Shahriari, and N. de Freitas. On correlation and budget constraints inmodel-based bandit optimization with application to automatic machine learning. In AI andStatistics , pages 365–374, 2014.[9] F. Hutter, T. Bartz-Beielstein, H. H. Hoos, K. Leyton-Brown, and K. P. Murphy. Sequentialmodel-based parameter optimisation: an experimental investigation of automated and interac-tive approaches. In T. Bartz-Beielstein, M. Chiarandini, L. Paquete, and M. Preuss, editors, Empirical Methods for the Analysis of Optimization Algorithms , chapter 15, pages 361–411.Springer, 2010.[10] D. R. Jones. A taxonomy of global optimization methods based on response surfaces. J. ofGlobal Optimization , 21(4):345–383, 2001.[11] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimalfinite-time analysis. In Algorithmic Learning Theory , volume 7568 of Lecture Notes in Com-puter Science , pages 199–213. Springer Berlin Heidelberg, 2012.[12] Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeakcurve in the presence of noise. Journal of Fluids Engineering , 86(1):97–106, 1964.[13] N. Mahendran, Z. Wang, F. Hamze, and N. de Freitas. Adaptive MCMC with Bayesian opti-mization. Journal of Machine Learning Research - Proceedings Track , 22:751–760, 2012.[14] Jonas Moˇckus, V Tiesis, and Antanas ˇZilinskas. The application of bayesian methods for seek-ing the extremum. In L Dixon and G Szego, editors, Toward Global Optimization , volume 2.Elsevier, 1978.[15] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning . The MITPress, 2006.[16] S. L. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models inBusiness and Industry , 26(6):639–658, 2010.[17] B. Shahriari, Z. Wang, M. W. Hoffman, A. Bouchard-Cˆot´e, and N. de Freitas. An entropysearch portfolio. In NIPS workshop on Bayesian Optimization , 2014.[18] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learningalgorithms. In Advances in Neural Information Processing Systems , pages 2951–2959, 2012.[19] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Ali, R. P.Adams, et al. Scalable Bayesian optimization using deep neural networks. arXiv preprintarXiv:1502.05700 , 2015.[20] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in thebandit setting: No regret and experimental design. In International Conference on MachineLearning , pages 1015–1022, 2010.[21] K. Swersky, J. Snoek, and R. P. Adams. Multi-task Bayesian optimization. In Advances inNeural Information Processing Systems , pages 2004–2012, 2013.[22] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view ofthe evidence of two samples. Biometrika , 25(3/4):285–294, 1933.[23] J. Villemonteix, E. Vazquez, and E. Walter. An informational approach to the global optimiza-tion of expensive-to-evaluate functions.