Max-value Entropy Search for Efficient Bayesian Optimization
MMax-value Entropy Search for Efficient Bayesian Optimization
Zi Wang Stefanie Jegelka Abstract
Entropy Search (ES) and Predictive EntropySearch (PES) are popular and empirically suc-cessful Bayesian Optimization techniques. Bothrely on a compelling information-theoretic mo-tivation, and maximize the information gainedabout the arg max of the unknown function; yet,both are plagued by the expensive computationfor estimating entropies. We propose a new crite-rion, Max-value Entropy Search (MES), that in-stead uses the information about the maximumfunction value. We show relations of MES toother Bayesian optimization methods, and es-tablish a regret bound. We observe that MESmaintains or improves the good empirical perfor-mance of ES/PES, while tremendously lighten-ing the computational burden. In particular, MESis much more robust to the number of samplesused for computing the entropy, and hence moreefficient for higher dimensional problems.
1. Introduction
Bayesian optimization (BO) has become a popular and ef-fective way for black-box optimization of nonconvex, ex-pensive functions in robotics, machine learning, computervision, and many other areas of science and engineering(Brochu et al., 2009; Calandra et al., 2014; Krause & Ong,2011; Lizotte et al., 2007; Snoek et al., 2012; Thorntonet al., 2013; Wang et al., 2017). In BO, a prior is posed onthe (unknown) objective function, and the uncertainty givenby the associated posterior is the basis for an acquisitionfunction that guides the selection of the next point to querythe function. The selection of queries and hence the acqui-sition function is critical for the success of the method.Different BO techniques differ in this acquisition function. Computer Science and Artificial Intelligence Laboratory,Massachusetts Institute of Technology, Massachusetts, USA. Cor-respondence to: Zi Wang < [email protected] > , Stefanie Jegelka < [email protected] > . Proceedings of the th International Conference on MachineLearning , Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).
Among the most popular ones range the Gaussian processupper confidence bound (GP-UCB) (Auer, 2002; Srinivaset al., 2010), probability of improvement (PI) (Kushner,1964), and expected improvement (EI) (Mo˘ckus, 1974).Particularly successful recent additions are entropy search(ES) (Hennig & Schuler, 2012) and predictive entropysearch (PES) (Hern´andez-Lobato et al., 2014), which aimto maximize the mutual information between the queriedpoints and the location of the global optimum.ES and PES are effective in the sense that they are query-efficient and identify a good point within competitively fewiterations, but determining the next query point involvesvery expensive computations. As a result, these methodsare most useful if the black-box function requires a lot ofeffort to evaluate, and are relatively slow otherwise. More-over, they rely on estimating the entropy of the arg max ofthe function. In high dimensions, this estimation demandsa large number of samples from the input space, which canquickly become inefficient.We propose a twist to the viewpoint of ES and PES that re-tains the information-theoretic motivation and empiricallysuccessful query-efficiency of those methods, but at a muchreduced computational cost. The key insight is to replacethe uncertainty about the arg max with the uncertaintyabout the maximum function value. As a result, we referto our new method as
Max-value Entropy Search (MES) .As opposed to the arg max , the maximum function valuelives in a one-dimensional space, which greatly facilitatesthe estimation of the mutual information via sampling. Weexplore two strategies to make the entropy estimation ef-ficient: an approximation by a Gumbel distribution, and aMonte Carlo approach that uses random features.Our contributions are as follows: (1) MES, a variant ofthe entropy search methods, which enjoys efficient com-putation and simple implementation; (2) an intuitive analy-sis which establishes the first connection between ES/PESand the previously proposed criteria GP-UCB, PI andEST (Wang et al., 2016), where the bridge is formed byMES; (3) a regret bound for a variant of MES, which, toour knowledge, is the first regret bound established for anyvariant of the entropy search methods; (4) an extension ofMES to the high dimensional settings via additive Gaus-sian processes; and (5) empirical evaluations which demon- a r X i v : . [ s t a t . M L ] J a n ax-value Entropy Search for Efficient Bayesian Optimization strate that MES identifies good points as quickly or betterthan ES/PES, but is much more efficient and robust in es-timating the mutual information, and therefore much fasterthan its input-space counterparts.After acceptance of this work, we learned that Hoffman &Ghahramani (2015) independently arrived at the acquisi-tion function in Eq. (5). Yet, our approximation (Eq. (6))is different, and hence the actual acquisition function weevaluate and analyze is different.
2. Background
Our goal is to maximize a black-box function f : X → R where X ⊂ R d and X is compact. At time step t , we selectpoint x t and observe a possibly noisy function evaluation y t = f ( x t ) + (cid:15) t , where (cid:15) t ∼ N (0 , σ ) are i.i.d. Gaus-sian variables. We use Gaussian processes (Rasmussen &Williams, 2006) to build a probabilistic model of the black-box function to be optimized. For high dimensional cases,we use a variant of the additive Gaussian process (Duve-naud et al., 2011; Kandasamy et al., 2015). For complete-ness, we here introduce some basics of GP and add-GP. Gaussian processes (GPs) are distributions over functions,and popular priors for Bayesian nonparametric regres-sion. In a GP, any finite set of function values has amultivariate Gaussian distribution. A Gaussian process GP ( µ, k ) is fully specified by a mean function µ ( x ) and covariance (kernel) function k ( x , x (cid:48) ) . Let f bea function sampled from GP ( µ, k ) . Given the observa-tions D t = { ( x τ , y τ ) } tτ =1 , we obtain the posterior mean µ t ( x ) = k t ( x ) T ( K t + σ I ) − y t and posterior covariance k t ( x , x (cid:48) ) = k ( x , x (cid:48) ) − k t ( x ) T ( K t + σ I ) − k t ( x (cid:48) ) of thefunction via the kernel matrix K t = [ k ( x i , x j )] x i , x j ∈ D t and k t ( x ) = [ k ( x i , x )] x i ∈ D t (Rasmussen & Williams,2006). The posterior variance is σ t ( x ) = k t ( x , x ) . Additive Gaussian processes (add-GP) were proposedin (Duvenaud et al., 2011), and analyzed in the BO set-ting in (Kandasamy et al., 2015). Following the lat-ter, we assume that the function f is a sum of inde-pendent functions sampled from Gaussian processes thatare active on disjoint sets A m of input dimensions. Pre-cisely, f ( x ) = (cid:80) Mm =1 f ( m ) ( x A m ) , with A i ∩ A j = ∅ for all i (cid:54) = j , | ∪ Mi =1 A i | = d , and f ( m ) ∼ GP ( µ ( m ) , k ( m ) ) , for all m ≤ M ( M ≤ d < ∞ ).As a result of this decomposition, the function f isdistributed according to GP ( (cid:80) Mm =1 µ ( m ) , (cid:80) Mm =1 k ( m ) ) .Given a set of noisy observations D t = { ( x τ , y τ ) } tτ =1 where y τ ∼ N ( f ( x τ ) , σ ) , the posterior mean and covariance of the function component f ( m ) can beinferred as µ ( m ) t ( x ) = k ( m ) t ( x ) T ( K t + σ I ) − y t and k ( m ) t ( x , x (cid:48) ) = k ( m ) ( x , x (cid:48) ) − k ( m ) t ( x ) T ( K t + σ I ) − k ( m ) t ( x (cid:48) ) , where k ( m ) t ( x ) = [ k ( m ) ( x i , x )] x i ∈ D t and K t = (cid:104)(cid:80) Mm =1 k ( m ) ( x i , x j ) (cid:105) x i , x j ∈ D t . For simplicity,we use the shorthand k ( m ) ( x , x (cid:48) ) = k ( m ) ( x A m , x (cid:48) A m ) . We use two types of evaluation criteria for BO, sim-ple regret and inference regret . In each iteration, wechoose to evaluate one input x t to “learn” where the arg max of the function is. The simple regret r T =max x ∈ X f ( x ) − max t ∈ [1 ,T ] f ( x t ) measures the value ofthe best queried point so far. After all queries, we mayinfer an arg max of the function, which is usually chosenas ˜ x T = arg max x ∈ X µ T ( x ) (Hennig & Schuler, 2012;Hern´andez-Lobato et al., 2014). We denote the inferenceregret as R T = max x ∈ X f ( x ) − f (˜ x T ) which character-izes how satisfying our inference of the arg max is.
3. Max-value Entropy Search
Entropy search methods use an information-theoretic per-spective to select where to evaluate. They find a querypoint that maximizes the information about the location x ∗ = arg max x ∈ X f ( x ) whose value y ∗ = f ( x ∗ ) achievesthe global maximum of the function f . Using the negativedifferential entropy of p ( x ∗ | D t ) to characterize the uncer-tainty about x ∗ , ES and PES use the acquisition functions α t ( x ) = I ( { x , y } ; x ∗ | D t ) (1) = H ( p ( x ∗ | D t )) − E [ H ( p ( x ∗ | D t ∪ { x , y } ))] (2) = H ( p ( y | D t , x )) − E [ H ( p ( y | D t , x , x ∗ ))] . (3)ES uses formulation (2), in which the expectation is over p ( y | D t , x ) , while PES uses the equivalent, symmetric for-mulation (3), where the expectation is over p ( x ∗ | D t ) . Un-fortunately, both p ( x ∗ | D t ) and its entropy is analyticallyintractable and have to be approximated via expensive com-putations. Moreover, the optimum may not be unique,adding further complexity to this distribution.We follow the same information-theoretic idea but proposea much cheaper and more robust objective to compute. In-stead of measuring the information about the argmax x ∗ ,we use the information about the maximum value y ∗ = f ( x ∗ ) . Our acquisition function is the gain in mutual in-formation between the maximum y ∗ and the next point wequery, which can be approximated analytically by evaluat-ing the entropy of the predictive distribution: α t ( x ) = I ( { x , y } ; y ∗ | D t ) (4) = H ( p ( y | D t , x )) − E [ H ( p ( y | D t , x , y ∗ ))] (5) ax-value Entropy Search for Efficient Bayesian Optimization ≈ K (cid:88) y ∗ ∈ Y ∗ (cid:20) γ y ∗ ( x ) ψ ( γ y ∗ ( x ))2Ψ( γ y ∗ ( x )) − log(Ψ( γ y ∗ ( x ))) (cid:21) (6)where ψ is the probability density function and Ψ the cu-mulative density function of a normal distribution, and γ y ∗ ( x ) = y ∗ − µ t ( x ) σ t ( x ) . The expectation in Eq. (5) is over p ( y ∗ | D n ) , which is approximated using Monte Carlo esti-mation by sampling a set of K function maxima. Noticethat the probability in the first term p ( y | D t , x ) is a Gaus-sian distribution with mean µ t ( x ) and variance k t ( x , x ) .The probability in the second term p ( y | D n , x , y ∗ ) is atruncated Gaussian distribution: given y ∗ , the distributionof y needs to satisfy y < y ∗ . Importantly, while ESand PES rely on the expensive, d -dimensional distribution p ( x ∗ | D t ) , here, we use the one-dimensional p ( y ∗ | D n ) ,which is computationally much easier.It may not be immediately intuitive that the value shouldbear sufficient information for a good search strategy. Yet,the empirical results in Section 5 will demonstrate that thisstrategy is typically at least as good as ES/PES. From aformal perspective, Wang et al. (2016) showed how an esti-mate of the maximum value implies a good search strategy(EST). Indeed, Lemma 3.1 will make the relation betweenEST and a simpler, degenerate version of MES explicit.Hence, it remains to determine how to sample y ∗ . Wepropose two strategies: (1) sampling from an approxima-tion via a Gumbel distribution; and (2) sampling functionsfrom the posterior Gaussian distribution and maximizingthe functions to obtain samples of y ∗ . We present the MESalgorithm in Alg. 1. The marginal distribution of f ( x ) for any x is a one-dimensional Gaussian, and hence the distribution of y ∗ maybe viewed as the maximum of an infinite collection of de-pendent Gaussian random variables. Since this distributionis difficult to compute, we make two simplifications. First,we replace the continuous set X by a discrete (finite), densesubset ˆ X of representative points. If we select ˆ X to be an (cid:15) -cover of X and the function f is Lipschitz continuous withconstant L , then we obtain a valid upper bound on f ( X ) byadding (cid:15)L to any upper bound on f ( ˆ X ) .Second, we use a “mean field” approximation and treat thefunction values at the points in ˆ X as independent. This ap-proximation tends to over-estimate the maximum; this fol-lows from Slepian’s lemma if k ( x, x (cid:48) ) ≥ . Such upperbounds still lead to optimization strategies with vanishingregret, whereas lower bounds may not (Wang et al., 2016).We sample from the approximation ˆ p ( y ∗ | D n ) via its cu-mulative distribution function (CDF) (cid:99) Pr[ y ∗ < z ] = (cid:81) x ∈ ˆ X Ψ( γ z ( x )) . That means we sample r uniformly from Algorithm 1
Max-value Entropy Search (MES) function MES ( f, D ) for t = 1 , · · · , T do α t − ( · ) ← A PPROX -MI ( D t − ) x t ← arg max x ∈ X α t − ( x ) y t ← f ( x t ) + (cid:15) t , (cid:15) t ∼ N (0 , σ ) D t ← D t − ∪ { x t , y t } end for end function function Approx-MI ( D t ) if Sample with Gumbel then approximate
Pr[ˆ y ∗ < y ] with G ( a, b ) sample a K -length vector r ∼ Unif ([0 , y ∗ ← a − b log( − log r ) else for i = 1 , · · · , K do sample ˜ f ∼ GP ( µ t , k t | D t ) y ∗ ( i ) ← max x ∈ X ˜ f ( x ) end for y ∗ ← [ y ∗ ( i ) ] Ki =1 end if return α t ( · ) in Eq. (6) end function [0 , and find z such that Pr[ y ∗ < z ] = r . A binary searchfor z to accuracy δ requires O (log δ ) queries to the CDF,and each query takes O ( | ˆ X | ) ≈ O ( n d ) time, so we obtainan overall time of O ( M | ˆ X | log δ ) for drawing M samples.To sample more efficiently, we propose a O ( M + | ˆ X | log δ ) -time strategy, by approximating the CDF by aGumbel distribution: (cid:99) Pr[ y ∗ < z ] ≈ G ( a, b ) = e − e − z − ab .This choice is motivated by the Fisher-Tippett-Gnedenkotheorem (Fisher, 1930), which states that the maximum ofa set of i.i.d. Gaussian variables is asymptotically describedby a Gumbel distribution (see the appendix for further de-tails). This does not in general extend to non-i.i.d. Gaus-sian variables, but we nevertheless observe that in practice,this approach yields a good and fast approximation.We sample from the Gumbel distribution via the Gumbelquantile function: we sample r uniformly from [0 , , andlet the sample be y = G − ( a, b ) = a − b log( − log r ) .We set the appropriate Gumbel distribution parameters a and b by percentile matching and solve the two-variablelinear equations a − b log( − log r ) = y and a − b log( − log r ) = y , where Pr[ y ∗ < y ] = r and Pr[ y ∗ < y ] = r . In practice, we use r = 0 . and r = 0 . so that the scale of the approximated Gumbeldistribution is proportional to the interquartile range of theCDF ˆPr[ y ∗ < z ] . ax-value Entropy Search for Efficient Bayesian Optimization y ∗ via Posterior Functions For an alternative sampling strategy we follow (Hern´andez-Lobato et al., 2014): we draw functions from the poste-rior GP and then maximize each of the sampled functions.Given the observations D t = { ( x τ , y τ ) tτ =1 } , we can ap-proximate the posterior Gaussian process using a 1-hidden-layer neural network ˜ f ( x ) = a T t φ ( x ) where φ ( x ) ∈ R D is a vector of feature functions (Neal, 1996; Rahimi et al.,2007) and the Gaussian weight a t ∈ R D is distributed ac-cording to a multivariate Gaussian N ( ν t , Σ t ) . Computing φ ( x ) . By Bochner’s theorem (Rudin,2011), the Fourier transform ˆ k of a continuous andtranslation-invariant kernel k is guaranteed to be a prob-ability distribution. Hence we can write the kernelof the GP to be k ( x , x (cid:48) ) = E ω ∼ ˆ k ( ω ) [ e iω T ( x − x (cid:48) ) ] = E c ∼ U [0 , π ] E ˆ k [2 cos( ω T x + c ) cos( ω T x (cid:48) + c )] and approx-imate the expectation by k ( x , x (cid:48) ) ≈ φ T ( x ) φ ( x (cid:48) ) where φ i ( x ) = (cid:113) D cos( ω T i x + c i ) , ω i ∼ ˆ κ ( ω ) , and c i ∼ U [0 , π ] for i = 1 , . . . , D . Computing ν t , Σ t . By writing the GP as a random linearcombination of feature functions a Tt φ ( x ) , we are defin-ing the mean and covariance of the GP to be µ t ( x ) = ν T φ ( x ) and k ( x , x (cid:48) ) = φ ( x ) T Σ t φ ( x (cid:48) ) . Let Z =[ z , · · · , z t ] ∈ R D × t , where z τ := φ ( x τ ) ∈ R D . TheGP posterior mean and covariance in Section 2.1 become µ t ( x ) = z T Z ( Z T Z + σ I ) − y t and k t ( x , x (cid:48) ) = z T z (cid:48) − z T Z ( Z T Z + σ I ) − Z T z (cid:48) . Because Z ( Z T Z + σ I ) − =( ZZ T + σ I ) − Z , we can simplify the above equations andobtain ν t = σ − Σ t Z t y t and Σ t = ( ZZ T σ − + I ) − .To sample a function from this random 1-hidden-layer neu-ral network, we sample ˜ a from N ( ν t , Σ t ) and constructthe sampled function ˜ f = ˜ a T φ ( x ) . Then we optimize ˜ f with respect to its input to get a sample of the maximum ofthe function max x ∈ X ˜ f ( x ) . As a side effect, our new acquisition function draws con-nections between ES/PES and other popular BO methods.The connection between MES and ES/PES follows fromthe information-theoretic viewpoint; the following lemmamakes the connections to other methods explicit.
Lemma 3.1.
The following methods are equivalent:1. MES, where we only use a single sample y ∗ for α t ( x ) ;2. EST with m = y ∗ ;3. GP-UCB with β = min x ∈ X y ∗ − µ t ( x ) σ t ( x ) ;4. PI with θ = y ∗ . This equivalence no longer holds if we use
M > samplesof y ∗ in MES. Proof.
The equivalence among 2,3,4 is stated in Lemma2.1 in (Wang et al., 2016). What remains to be shownis the equivalence between 1 and 2. When using a sin-gle y ∗ in MES, the next point to evaluate is chosen bymaximizing α t ( x ) = γ y ∗ ( x ) ψ ( γ y ∗ ( x ))2Ψ( γ y ∗ ( x )) − log(Ψ( γ y ∗ ( x ))) and γ y ∗ = y ∗ − µ t ( x ) σ t ( x ) . For EST with m = y ∗ , the nextpoint to evaluate is chosen by minimizing γ y ∗ ( x ) . Let usdefine a function g ( u ) = u ψ ( u )2Ψ( u ) − log(Ψ( u )) . Clearly, α t ( x ) = g ( γ y ∗ ( x )) . Because g ( u ) is a monotonically de-creasing function, maximizing g ( γ y ∗ ( x )) is equivalent tominimizing γ y ∗ ( x ) . Hence 1 and 2 are equivalent. The connection with EST directly leads to a bound on thesimple regret of MES, when using only one sample of y ∗ .We prove Theorem 3.2 in the appendix. Theorem 3.2 (Simple Regret Bound) . Let F be the cumu-lative probability distribution for the maximum of any func-tion f sampled from GP ( µ, k ) over the compact searchspace X ⊂ R d , where k ( x , x (cid:48) ) ≤ , ∀ x , x (cid:48) ∈ X .Let f ∗ = max x ∈ X f ( x ) and w = F ( f ∗ ) ∈ (0 , ,and assume the observation noise is iid N (0 , σ ) . If ineach iteration t , the query point is chosen as x t =arg max x ∈ X γ y t ∗ ( x ) ψ ( γ yt ∗ ( x ))2Ψ( γ yt ∗ ( x )) − log(Ψ( γ y t ∗ ( x ))) , where γ y t ∗ ( x ) = y t ∗ − µ t ( x ) σ t ( x ) and y t ∗ is drawn from F , then withprobability at least − δ , in T (cid:48) = (cid:80) Ti =1 log w δ π i numberof iterations, the simple regret satisfies r T (cid:48) ≤ (cid:114) Cρ T T ( ν t ∗ + ζ T ) (7) where C = 2 / log(1 + σ − ) and ζ T = (2 log( π T δ )) ; π satisfies (cid:80) Ti =1 π − i ≤ and π t > , and t ∗ = arg max t ν t with ν t (cid:44) min x ∈ X ,y t ∗ >f ∗ γ y t ∗ ( x ) , and ρ T is the maximuminformation gain of at most T selected points. In practice we do not know the hyper-parameters of the GP,so we must adapt our GP model as we observe more data.A standard way to learn the GP hyper-parameters is to opti-mize the marginal data likelihood with respect to the hyper-parameters. As a full Bayesian treatment, we can also drawsamples of the hyper-parameters using slice sampling (Van-hatalo et al., 2013), and then marginalize out the hyper-parameters in our acquisition function in Eq. (6). Namely,if we use E to denote the set of sampled settings for the GPhyper-parameters, our acquisition function becomes α t ( x ) = (cid:88) η ∈ E (cid:88) y ∗ ∈ Y ∗ (cid:20) γ ηy ∗ ( x ) ψ ( γ ηy ∗ ( x ))2Ψ( γ ηy ∗ ( x )) − log(Ψ( γ ηy ∗ ( x ))) (cid:21) , ax-value Entropy Search for Efficient Bayesian Optimization where γ ηy ∗ ( x ) = y ∗ − µ ηt ( x ) σ ηt ( x ) and the posterior inference onthe mean function µ ηt and σ ηt depends on the GP hyper-parameter setting η . Similar approaches have been usedin (Hern´andez-Lobato et al., 2014; Snoek et al., 2012).
4. High Dimensional MES with Add-GP
The high-dimensional input setting has been a challengefor many BO methods. We extend MES to this setting viaadditive Gaussian processes (Add-GP). In the past, Add-GP has been used and analyzed for GP-UCB (Kandasamyet al., 2015), which assumed the high dimensional black-box function is a summation of several disjoint lower di-mensional functions. Utilizing this special additive struc-ture, we overcome the statistical problem of having insuffi-cient data to recover a complex function, and the difficultyof optimizing acquisition functions in high dimensions.Since the function components f ( m ) are independent, wecan maximize the mutual information between the input inthe active dimensions A m and maximum of f ( m ) for eachcomponent separately. Hence, we have a separate acquisi-tion function for each component, where y ( m ) is the evalu-ation of f ( m ) : α ( m ) t ( x ) = I ( { x A m , y ( m ) } ; y ( m ) ∗ | D t ) (8) = H ( p ( y ( m ) | D t , x A m )) − E [ H ( p ( y ( m ) | D t , x A m , y ( m ) ∗ ))] (9) ≈ (cid:88) y ( m ) ∗ γ ( m ) y ∗ ( x ) ψ ( γ ( m ) y ∗ ( x ))2Ψ( γ ( m ) y ∗ ( x )) − log(Ψ( γ ( m ) y ∗ ( x ))) (10)where γ ( m ) y ∗ ( x ) = y ( m ) ∗ − µ ( m ) t ( x ) σ ( m ) t ( x ) . Analogously to the non-additive case, we sample y ( m ) ∗ , separately for each functioncomponent. We select the final x t by choosing a sub-vector x ( m ) t ∈ arg max x ( m ) ∈ A m α ( m ) t ( x ( m ) ) and concatenatingthe components. Sampling y ( m ) ∗ with a Gumbel distribution. The Gum-bel sampling from Section 3.1 directly extends to sam-pling y ( m ) ∗ , approximately. We simply need to sam-ple from the component-wise CDF (cid:99) Pr[ y ( m ) ∗ < z ] = (cid:81) x ∈ ˆ X Ψ( γ ( m ) y ( x ))) , and use the same Gumbel approxi-mation. Sampling y ( m ) ∗ via posterior functions. The additivestructure removes some connections on the input-to-hiddenlayer of our 1-hidden-layer neural network approximation ˜ f ( x ) = a T t φ ( x ) . Namely, for each feature function φ thereexists a unique group m such that φ is only active on x A m ,and φ ( x ) = (cid:113) D cos( ω T x A m + c ) where R | A m | (cid:51) ω ∼ ˆ κ ( m ) ( ω ) and c ∼ U [0 , π ] . Similar to the non-additivecase, we may draw a posterior sample a t ∼ N ( ν t , Σ t ) where ν t = σ − Σ t Z t y t and Σ t = ( ZZ T σ − + I ) − .Let B m = { i : φ i ( x ) is active on x A m } . The posteriorsample for the function component f ( m ) is ˜ f ( m ) ( x ) =( a B m t ) T φ B m ( x A m ) . Then we can maximize ˜ f ( m ) to ob-tain a sample for y ( m ) ∗ .The algorithm for the additive max-value entropy searchmethod (add-MES) is shown in Algorithm 2. The functionA PPROX -MI does the pre-computation for approximatingthe mutual information in a similar way as in Algorithm 1,except that it only acts on the active dimensions in the m -thgroup. Algorithm 2
Additive Max-value Entropy Search function Add-MES ( f, D ) for t = 1 , · · · , T do for m = 1 , · · · , M do α ( m ) t − ( · ) ← A PPROX -MI ( D t − ) x A m t ← arg max x Am ∈ X Am α ( m ) t − ( x ) end for y t ← f ( x t ) + (cid:15) t , (cid:15) t ∼ N (0 , σ ) D t ← D t − ∪ { x t , y t } end for end function
5. Experiments
In this section, we probe the empirical performance of MESand add-MES on a variety of tasks. Here, MES-G de-notes MES with y ∗ sampled from the approximate Gumbeldistribution, and MES-R denotes MES with y ∗ computedby maximizing a sampled function represented by randomfeatures. Following (Hennig & Schuler, 2012; Hern´andez-Lobato et al., 2014), we adopt the zero mean function andnon-isotropic squared exponential kernel as the prior forthe GP. We compare to methods from the entropy searchfamily, i.e., ES and PES, and to other popular Bayesian op-timization methods including GP-UCB (denoted by UCB),PI, EI and EST. The parameter for GP-UCB was set ac-cording to Theorem 2 in (Srinivas et al., 2010); the param-eter for PI was set to be the observation noise σ . For thefunctions with unknown GP hyper-parameters, every 10 it-erations, we learn the GP hyper-parameters using the sameapproach as was used by PES (Hern´andez-Lobato et al.,2014). For the high dimensional tasks, we follow (Kan-dasamy et al., 2015) and sample the additive structure/GPparameters with the highest data likelihood when they areunknown. We evaluate performance according to the sim-ple regret and inference regret as defined in Section 2.3.We used the open source Matlab implementation of PES,ES and EST (Hennig & Schuler, 2012; Hern´andez-Lobato ax-value Entropy Search for Efficient Bayesian Optimization (a) (b)
50 100 150 200 t -0.3-0.2-0.100.10.20.30.40.50.60.7 l o g R t
50 100 150 200 t -0.3-0.2-0.100.10.20.30.40.50.60.7 l o g r t UCBPIEIESTESPES 100PES 10PES 1MES-R 100MES-R 10MES-R 1MES-G 100MES-G 10MES-G 1
Figure 1. (a) Inference regret; (b) simple regret. MES methodsare much less sensitive to the number of maxima y ∗ sampled forthe acquisition function (1, 10 or 100) than PES is to the numberof argmaxes x ∗ . Table 1.
The runtime of selecting the next input. PES 100 is sig-nificantly slower than other methods. MES-G’s runtime is com-parable to the fastest method EI while it performs better in termsof simple and inference regrets.M
ETHOD T IME ( S ) M ETHOD T IME ( S )UCB . ± . PES 1 . ± . PI . ± . MES-R 100 . ± . EI . ± . MES-R 10 . ± . EST . ± . MES-R 1 . ± . ES . ± . MES-G 100 . ± . PES 100 . ± . MES-G 10 . ± . PES 10 . ± . MES-G 1 . ± . et al., 2014; Wang et al., 2016). Our Matlab code andtest functions are available at https://github.com/zi-w/Max-value-Entropy-Search/ . We begin with a comparison on synthetic functions sam-pled from a 3-dimensional GP, to probe our conjecture thatMES is much more robust to the number of y ∗ sampled toestimate the acquisition function than PES is to the num-ber of x ∗ samples. For PES, we sample 100 (PES 100),10 (PES 10) and 1 (PES 1) argmaxes for the acquisitionfunction. Similarly, we sample 100, 10, 1 y ∗ values forMES-R and MES-G. We average the results on 100 func-tions sampled from the same Gaussian kernel with scaleparameter . and bandwidth parameter . , and obser-vation noise N (0 , . ) . Figure 1 shows the simple and inference regrets. For bothregret measures, PES is very sensitive to the the numberof x ∗ sampled for the acquisition function: 100 sampleslead to much better results than 10 or 1. In contrast, bothMES-G and MES-R perform competitively even with 1 or10 samples. Overall, MES-G is slightly better than MES-R, and both MES methods performed better than other ESmethods. MES methods performed better than all othermethods with respect to simple regret. For inference re-gret, MES methods performed similarly to EST, and muchbetter than all other methods including PES and ES.In Table 1, we show the runtime of selecting the next in-put per iteration using GP-UCB, PI, EI, EST, ES, PES,MES-R and MES-G on the synthetic data with fixed GPhyper-parameters. For PES and MES-R, every x ∗ or y ∗ re-quires running an optimization sub-procedure, so their run-ning time grows noticeably with the number of samples.MES-G avoids this optimization, and competes with thefastest methods EI and UCB.In the following experiments, we set the number of x ∗ sam-pled for PES to be 200, and the number of y ∗ sampled forMES-R and MES-G to be 100 unless otherwise mentioned. We test on three challenging optimization test functions:the 2-dimensional eggholder function, the 10-dimensionalShekel function and the 10-dimensional Michalewicz func-tion. All of these functions have many local optima. Werandomly sample 1000 points to learn a good GP hyper-parameter setting, and then run the BO methods with thesame hyper-parameters. The first observation is the samefor all methods. We repeat the experiments 10 times. Theaveraged simple regret is shown in the appendix, and theinference regret is shown in Table 2. On the 2-d eggholderfunction, PES was able to achieve better function valuesfaster than all other methods, which verified the good per-formance of PES when sufficiently many x ∗ are sampled.However, for higher-dimensional test functions, the 10-dShekel and 10-d Michalewicz function, MES methods per-formed much better than PES and ES, and MES-G per-formed better than all other methods. Next, we experiment with Levenberg-Marquardt optimiza-tion for training a 1-hidden-layer neural network. The 4 pa-rameters we tune with BO are the number of neurons, thedamping factor µ , the µ -decrease factor, and the µ -increasefactor. We test regression on the Boston housing dataset All the timing experiments were run exclusively on an In-tel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz. The function eval-uation time is excluded. ax-value Entropy Search for Efficient Bayesian Optimization
Table 2.
Inference regret R T for optimizing the eggholder func-tion, Shekel function, and Michalewicz function.M ETHOD E GGHOLDER S HEKEL M ICHALEWICZ
UCB . ± .
96 9 . ± .
26 6 . ± . PI . ± .
03 6 . ± .
00 4 . ± . EI . ± .
18 6 . ± .
87 4 . ± . EST . ± .
85 5 . ± .
56 5 . ± . ES . ± .
11 6 . ± .
73 5 . ± . PES . ± .
05 8 . ± .
67 5 . ± . MES-R . ± .
71 6 . ± .
80 4 . ± . MES-G . ± .
05 5 . ± .
07 4 . ± . (a) (b) t
50 100 150 r t UCBPIEIESTESPESMES-RMES-G t
50 100 150 r t UCBPIEIESTESPESMES-RMES-G
Figure 2.
Tuning hyper-parameters for training a neural network,(a) Boston housing dataset; (b) breast cancer dataset. MES meth-ods perform better than other methods on (a), while for (b), MES-G, UCB, PES perform similarly and better than others. and classification on the breast cancer dataset (Bache &Lichman, 2013). The experiments are repeated 20 times,and the neural network’s weight initialization and all otherparameters are set to be the same to ensure a fair com-parison. Both of the datasets were randomly split intotrain/validation/test sets. We initialize the observation setto have 10 random function evaluations which were set tobe the same across all the methods. The averaged simpleregret for the regression L2-loss on the validation set of theBoston housing dataset is shown in Fig. 2(a), and the classi-fication accuracy on the validation set of the breast cancerdataset is shown in Fig. 2(b). For the classification prob-lem on the breast cancer dataset, MES-G, PES and UCBachieved a similar simple regret. On the Boston housingdataset, MES methods achieved a lower simple regret. Wealso show the inference regrets for both datasets in Table 3.
We use BO to do active learning for the pre-image learningproblem for pushing (Kaelbling & Lozano-P´erez, 2017).The function we optimize takes as input the pushing actionof the robot, and outputs the distance of the pushed objectto the goal location. We use BO to minimize the function in
Table 3.
Inference regret R T for tuning neural network hyper-parameters on the Boston housing and breast cancer datasets.M ETHOD B OSTON C ANCER (%)UCB . ± .
43 3 . ± . PI . ± .
99 4 . ± . EI . ± .
03 4 . ± . EST . ± .
57 3 . ± . ES . ± .
61 4 . ± . PES . ± .
32 3 . ± . MES-R . ± .
56 3 . ± . MES-G . ± .
61 3 . ± . Table 4.
Inference regret R T for action selection in robot pushing.M ETHOD D ACTION D ACTION
UCB . ± .
66 0 . ± . PI . ± .
77 0 . ± . EI . ± .
87 0 . ± . EST . ± .
90 0 . ± . ES . ± .
59 0 . ± . PES . ± .
27 0 . ± . MES-R . ± .
23 0 . ± . MES-G . ± .
26 0 . ± . order to find a good pre-image for pushing the object to thedesignated goal location. The first function we tested hasa 3-dimensional input: robot location ( r x , r y ) and pushingduration t r . We initialize the observation size to be one,the same across all methods. The second function has a4-dimensional input: robot location and angle ( r x , r y , r θ ) ,and pushing duration t r . We initialize the observation to be50 random points and set them the same for all the methods.We select 20 random goal locations for each function totest if BO can learn where to push for these locations. Weshow the simple regret in Fig. 4 and the inference regret inTable 4. MES methods performed on a par with or betterthan their competitors. In this section, we test our add-MES algorithm on highdimensional black-box function optimization problems.First we compare add-MES and add-GP-UCB (Kandasamyet al., 2015) on a set of synthetic additive functions withknown additive structure and GP hyper-parameters. Eachfunction component of the synthetic additive function isactive on at most three input dimensions, and is sampledfrom a GP with zero mean and Gaussian kernel (band-width = . and scale = ). For the parameter of add-GP-UCB, we follow (Kandasamy et al., 2015) and set β ( m ) t = | A m | log 2 t/ . We set the number of y ( m ) ∗ sam-pled for each function component in add-MES-R and add-MES-G to be 1. We repeat each experiment for 50 times ax-value Entropy Search for Efficient Bayesian Optimization t200 400 r t Add-GP-UCBAdd-MES-RAdd-MES-G t200 400 r t Add-GP-UCBAdd-MES-RAdd-MES-G t200 400 r t Add-GP-UCBAdd-MES-RAdd-MES-G t200 400 r t Add-GP-UCBAdd-MES-RAdd-MES-G d=10 d=20 d=30 d=50 d=100 t200 400 r t Add-GP-UCBAdd-MES-RAdd-MES-G
Figure 3.
Simple regrets for add-GP-UCB and add-MES methods on the synthetic add-GP functions. Both add-MES methods outperformadd-GP-UCB except for add-MES-G on the input dimension d = 100 . Add-MES-G achieves the lowest simple regret when d isrelatively low, while for higher d add-MES-R becomes better than add-MES-G. (a) (b) t
10 20 30 40 r t UCBPIEIESTESPESMES-RMES-G t
10 20 30 40 r t UCBPIEIESTESPESMES-RMES-G
Figure 4.
BO for active data selection on two robot pushing tasksfor minimizing the distance to a random goal with (a) 3-D ac-tions and (b) 4-D actions. MES methods perform better than othermethods on the 3-D function. For the 4-D function, MES methodsconverge faster to a good regret, while PI achieves lower regret inthe very end. for each dimension setting. The results for simple regretare shown in Fig. 3. Add-MES methods perform muchbetter than add-GP-UCB in terms of simple regret. In-terestingly, add-MES-G works better in lower dimensionalcases where d = 10 , , , while add-MES-R outperformsboth add-MES-G and add-GP-UCB for higher dimensionswhere d = 50 , . In general, MES-G tends to overesti-mate the maximum of the function because of the indepen-dence assumption, and MES-R tends to underestimate themaximum of the function because of the imperfect globaloptimization of the posterior function samples. We con-jecture that MES-R is better for settings where exploitationis preferred over exploration (e.g., not too many local op-tima), and MES-G works better if exploration is preferred.To further verify the performance of add-MES in high di-mensional problems, we test on two real-world high dimen-sional experiments. One is a function that returns the dis-tance between a goal location and two objects being pushed by a robot which has 14 parameters . The other function re-turns the walking speed of a planar bipedal robot, with 25parameters to tune (Westervelt et al., 2007). In Fig. 5, weshow the simple regrets achieved by add-GP-UCB and add-MES. Add-MES methods performed competitively com-pared to add-GP-UCB on both tasks. t
200 400 r t Add-GP-UCBAdd-MES-RAdd-MES-G t
200 400 r t Add-GP-UCBAdd-MES-RAdd-MES-G (a) (b)
Figure 5.
Simple regrets for add-GP-UCB and add-MES methodson (a) a robot pushing task with 14 parameters and (b) a planarbipedal walker optimization task with 25 parameters. Both MESmethods perform competitively comparing to add-GP-UCB.
6. Conclusion
We proposed a new information-theoretic approach, max-value entropy search (MES), for optimizing expensiveblack-box functions. MES is competitive with or betterthan previous entropy search methods, but at a much lowercomputational cost. Via additive GPs, MES is adaptableto high-dimensional settings. We theoretically connectedMES to other popular Bayesian optimization methods in-cluding entropy search, GP-UCB, PI, and EST, and showeda bound on the simple regret for a variant of MES. Empiri-cally, MES performs well on a variety of tasks. We implemented the function in (Catto, 2011). ax-value Entropy Search for Efficient Bayesian Optimization
Acknowledgements
We thank Prof. Leslie Pack Kaelbling and Prof. Tom´asLozano-P´erez for discussions on active learning andDr. William Huber for his solution to “Extreme ValueTheory - Show: Normal to Gumbel” at stats.stackexchange.com , which leads to our Gumbel ap-proximation in Section 3.1. We gratefully acknowledgesupport from NSF CAREER award 1553284, NSF grants1420927 and 1523767, from ONR grant N00014-14-1-0486, and from ARO grant W911NF1410433. We thankMIT Supercloud and the Lincoln Laboratory Supercom-puting Center for providing computational resources. Anyopinions, findings, and conclusions or recommendationsexpressed in this material are those of the authors and donot necessarily reflect the views of our sponsors.
References
Auer, Peter. Using confidence bounds for exploitation-explorationtradeoffs.
Journal of Machine Learning Research , 3:397–422,2002.Bache, Kevin and Lichman, Moshe. UCI machine learning repos-itory. 2013.Brochu, Eric, Cora, Vlad M, and De Freitas, Nando. A tutorial onBayesian optimization of expensive cost functions, with appli-cation to active user modeling and hierarchical reinforcementlearning. Technical Report TR-2009-023, University of BritishColumbia, 2009.Calandra, Roberto, Seyfarth, Andr´e, Peters, Jan, and Deisenroth,Marc Peter. An experimental comparison of Bayesian opti-mization for bipedal locomotion. In
International Conferenceon Robotics and Automation (ICRA) , 2014.Catto, Erin. Box2D, a 2D physics engine for games. http://box2d.org , 2011.Djolonga, Josip, Krause, Andreas, and Cevher, Volkan. High-dimensional Gaussian process bandits. In
Advances in NeuralInformation Processing Systems (NIPS) , 2013.Duvenaud, David K, Nickisch, Hannes, and Rasmussen, Carl E.Additive Gaussian processes. In
Advances in Neural Informa-tion Processing Systems (NIPS) , 2011.Fisher, Ronald Aylmer.
The genetical theory of natural selection:a complete variorum edition . Oxford University Press, 1930.Hennig, Philipp and Schuler, Christian J. Entropy search forinformation-efficient global optimization.
Journal of MachineLearning Research , 13:1809–1837, 2012.Hern´andez-Lobato, Jos´e Miguel, Hoffman, Matthew W, andGhahramani, Zoubin. Predictive entropy search for efficientglobal optimization of black-box functions. In
Advances inNeural Information Processing Systems (NIPS) , 2014.Hoffman, Matthew W and Ghahramani, Zoubin. Output-spacepredictive entropy search for flexible global optimization. In
NIPS workshop on Bayesian Optimization , 2015. Kaelbling, Leslie Pack and Lozano-P´erez, Tom´as. Learning com-posable models of primitive actions. In
International Confer-ence on Robotics and Automation (ICRA) , 2017.Kandasamy, Kirthevasan, Schneider, Jeff, and Poczos, Barnabas.High dimensional Bayesian optimisation and bandits via addi-tive models. In
International Conference on Machine Learning(ICML) , 2015.Kawaguchi, Kenji, Kaelbling, Leslie Pack, and Lozano-P´erez,Tom´as. Bayesian optimization with exponential convergence.In
Advances in Neural Information Processing Systems (NIPS) ,2015.Kawaguchi, Kenji, Maruyama, Yu, and Zheng, Xiaoyu. Globalcontinuous optimization with error bound and fast conver-gence.
Journal of Artificial Intelligence Research , 56(1):153–195, 2016.Krause, Andreas and Ong, Cheng S. Contextual Gaussian pro-cess bandit optimization. In
Advances in Neural InformationProcessing Systems (NIPS) , 2011.Kushner, Harold J. A new method of locating the maximum pointof an arbitrary multipeak curve in the presence of noise.
Jour-nal of Fluids Engineering , 86(1):97–106, 1964.Li, Chun-Liang, Kandasamy, Kirthevasan, P´oczos, Barnab´as, andSchneider, Jeff. High dimensional Bayesian optimization viarestricted projection pursuit models. In
International Confer-ence on Artificial Intelligence and Statistics (AISTATS) , 2016.Lizotte, Daniel J, Wang, Tao, Bowling, Michael H, and Schuur-mans, Dale. Automatic gait optimization with Gaussian pro-cess regression. In
International Conference on Artificial In-telligence (IJCAI) , 2007.Massart, Pascal.
Concentration Inequalities and Model Selection ,volume 6. Springer, 2007.Mo˘ckus, J. On Bayesian methods for seeking the extremum. In
Optimization Techniques IFIP Technical Conference , 1974.Neal, R.M.
Bayesian Learning for Neural networks . LectureNotes in Statistics 118. Springer, 1996.Rahimi, Ali, Recht, Benjamin, et al. Random features for large-scale kernel machines. In
Advances in Neural Information Pro-cessing Systems (NIPS) , 2007.Rasmussen, Carl Edward and Williams, Christopher KI. Gaussianprocesses for machine learning.
The MIT Press , 2006.Rudin, Walter.
Fourier analysis on groups . John Wiley & Sons,2011.Slepian, David. The one-sided barrier problem for Gaussiannoise.
Bell System Technical Journal , 41(2):463–501, 1962.Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practi-cal Bayesian optimization of machine learning algorithms. In
Advances in Neural Information Processing Systems (NIPS) ,2012.Srinivas, Niranjan, Krause, Andreas, Kakade, Sham M, andSeeger, Matthias. Gaussian process optimization in the ban-dit setting: no regret and experimental design. In
InternationalConference on Machine Learning (ICML) , 2010. ax-value Entropy Search for Efficient Bayesian Optimization
Thornton, Chris, Hutter, Frank, Hoos, Holger H, and Leyton-Brown, Kevin. Auto-WEKA: combined selection and hyper-parameter optimization of classification algorithms. In
ACMSIGKDD Conference on Knowledge Discovery and Data Min-ing (KDD) , 2013.Vanhatalo, Jarno, Riihim¨aki, Jaakko, Hartikainen, Jouni, Jyl¨anki,Pasi, Tolvanen, Ville, and Vehtari, Aki. Gpstuff: Bayesianmodeling with gaussian processes.
Journal of Machine Learn-ing Research , 14(Apr):1175–1179, 2013.Von Mises, Richard. La distribution de la plus grande de n valeurs.
Rev. math. Union interbalcanique , 1936.Wang, Zi, Zhou, Bolei, and Jegelka, Stefanie. Optimization asestimation with Gaussian processes in bandit settings. In
In-ternational Conference on Artificial Intelligence and Statistics(AISTATS) , 2016.Wang, Zi, Jegelka, Stefanie, Kaelbling, Leslie Pack, and Lozano-P´erez, Tom´as. Focused model-learning and planning for non-Gaussian continuous state-action systems. In
InternationalConference on Robotics and Automation (ICRA) , 2017.Wang, Ziyu, Zoghi, Masrour, Hutter, Frank, Matheson, David,and De Freitas, Nando. Bayesian optimization in high dimen-sions via random embeddings. In
International Conference onArtificial Intelligence (IJCAI) , 2013.Westervelt, Eric R, Grizzle, Jessy W, Chevallereau, Christine,Choi, Jun Ho, and Morris, Benjamin.
Feedback control of dy-namic bipedal robot locomotion , volume 28. CRC press, 2007.
A. Related work
Our work is largely inspired by the entropy search (ES)methods (Hennig & Schuler, 2012; Hern´andez-Lobatoet al., 2014), which established the information-theoreticview of Bayesian optimization by evaluating the inputs thatare most informative to the arg max of the function we areoptimizing.Our work is also closely related to probability of im-provement (PI) (Kushner, 1964), expected improvement(EI) (Mo˘ckus, 1974), and the BO algorithms using up-per confidence bound to direct the search (Auer, 2002;Kawaguchi et al., 2015; 2016), such as GP-UCB (Srinivaset al., 2010). In (Wang et al., 2016), it was pointed out thatGP-UCB and PI are closely related by exchanging the pa-rameters. Indeed, all these algorithms build in the heuristicthat the next evaluation point needs to be likely to achievethe maximum function value or have high probability ofimproving the current evaluations, which in turn, may alsogive more information on the function optima like how ESmethods queries. These connections become clear as statedin Section 3.1 of our paper.Finding these points that may have good values in highdimensional space is, however, very challenging. In the past, high dimensional BO algorithms were developed un-der various assumptions such as the existence of a lower di-mensional function structure (Djolonga et al., 2013; Wanget al., 2013), or an additive function structure where eachcomponent is only active on a lower manifold of thespace (Li et al., 2016; Kandasamy et al., 2015). In thiswork, we show that our method also works well in highdimensions with the additive assumption made in (Kan-dasamy et al., 2015).
B. Using the Gumbel distribution to sample y ∗ To sample the function maximum y ∗ , our first approachis to approximate the distribution for y ∗ and then samplefrom that distribution. We use independent Gaussians toapproximate the correlated f ( x ) , ∀ x ∈ ˆ X where ˆ X is a dis-cretization of the input search space X (unless X is discrete,in which case ˆ X = X ). A similar approach was adoptedin (Wang et al., 2016). We can show that by assuming { f ( x ) } x ∈ ˆ X , our approximated distribution gives a distri-bution for an upperbound on f ( x ) . Lemma B.1 (Slepian’s Comparison Lemma (Slepian,1962; Massart, 2007)) . Let u , v ∈ R n be two multivariateGaussian random vectors with the same mean and vari-ance, such that E [ v i v j ] ≤ E [ u i u j ] , ∀ i, j. Then for every y Pr[ sup i ∈ [1 ,n ] v i ≤ y ] ≤ Pr[ sup i ∈ [1 ,n ] u i ≤ y ] . By the Slepian’s lemma, if the covariance k t ( x , x (cid:48) ) ≥ , ∀ x , x (cid:48) ∈ ˆ X , using the independent assumption with giveus a distribution on the upperbound ˆ y ∗ of f ( x ) , Pr[ˆ y ∗
Pr[ˆ y ∗ < y ] . In Figure 6, we showan example of the result of the approximation for the dis-tribution of the maximum of f ( x ) ∼ GP ( µ t , k t ) ∀ x ∈ ˆ X given 50 observed data points randomly selected from afunction sample from a GP with 0 mean and Gaussian ker-nel. y P r [ ˆ y ∗ < y ] ExactApprox
Figure 6.
An example of approximating the cumulative probabil-ity of the maximum of independent differently distributed Gaus-sians
Pr[ˆ y ∗ < y ] (Exact) with a Gumbel distribution G ( a, b ) (Ap-prox) via percentile matching. C. Regret bounds
Based on the connection of MES to EST, we show thebound on the learning regret for MES with a point estimatefor α ( x ) . Theorem 3.2 (Simple Regret Bound) . Let F be the cumu-lative probability distribution for the maximum of any func-tion f sampled from GP ( µ, k ) over the compact searchspace X ⊂ R d , where k ( x , x (cid:48) ) ≤ , ∀ x , x (cid:48) ∈ X .Let f ∗ = max x ∈ X f ( x ) and w = F ( f ∗ ) ∈ (0 , ,and assume the observation noise is iid N (0 , σ ) . If ineach iteration t , the query point is chosen as x t =arg max x ∈ X γ y t ∗ ( x ) ψ ( γ yt ∗ ( x ))2Ψ( γ yt ∗ ( x )) − log(Ψ( γ y t ∗ ( x ))) , where γ y t ∗ ( x ) = y t ∗ − µ t ( x ) σ t ( x ) and y t ∗ is drawn from F , then withprobability at least − δ , in T (cid:48) = (cid:80) Ti =1 log w δ π i numberof iterations, the simple regret satisfies r T (cid:48) ≤ (cid:114) Cρ T T ( ν t ∗ + ζ T ) (7) where C = 2 / log(1 + σ − ) and ζ T = (2 log( π T δ )) ; π satisfies (cid:80) Ti =1 π − i ≤ and π t > , and t ∗ = arg max t ν t with ν t (cid:44) min x ∈ X ,y t ∗ >f ∗ γ y t ∗ ( x ) , and ρ T is the maximuminformation gain of at most T selected points. Before we continue to the proof, notice that if the function upper bound ˆ y ∗ is sampled using the approach described inSection 3.1 and k t ( x , x (cid:48) ) ≥ , ∀ x , x (cid:48) ∈ ˆ X , we may still getthe regret guarantee by setting y ∗ = ˆ y ∗ (or y ∗ = ˆ y ∗ + (cid:15)L if X is continuous) since Pr[max ˆ X ≤ y ] ≥ Pr[ˆ y ∗ < y ] .Moreover, Theorem 3.2 assumes y ∗ is sampled from a uni-versal maximum distribution of functions from GP ( µ, k ) ,but it is not hard to see that if we have a distribution ofmaximums adapted from GP ( µ t , k t ) , we can still get thesame regret bound by setting T (cid:48) = (cid:80) Ti =1 log w i δ π i , where w i = F i ( f ∗ ) and F i corresponds to the maximum distri-bution at an iteration where y ∗ > f ∗ . Next we introduce afew lemmas and then prove Theorem 3.2. Lemma C.1 (Lemma 3.2 in (Wang et al., 2016)) . Pick δ ∈ (0 , and set ζ t = (2 log( π t δ )) , where (cid:80) Tt =1 π − t ≤ , π t > . Then, it holds that Pr[ µ t − ( x t ) − f ( x t ) ≤ ζ t σ t − ( x t ) , ∀ t ∈ [1 , T ]] ≥ − δ . Lemma C.2 (Lemma 3.3 in (Wang et al., 2016)) . If µ t − ( x t ) − f ( x t ) ≤ ζ t σ t − ( x t ) , the regret at time step t is upper bounded as ˜ r t ≤ ( ν t + ζ t ) σ t − ( x t ) , where ν t (cid:44) min x ∈ X ˆ m t − µ t − ( x ) σ t − ( x ) , and ˆ m t ≥ max x ∈ X f ( x ) , ∀ t ∈ [1 , T ] . Lemma C.3 (Lemma 5.3 in (Srinivas et al., 2010)) . Theinformation gain for the points selected can be expressed interms of the predictive variances. If f T = ( f ( x t )) ∈ R T : I ( y T ; f T ) = 12 T (cid:88) t =1 log(1 + σ − σ t − ( x t )) . Proof. (Theorem 3.2) By lemma 3.1 in our paper, we knowthat the theoretical results from EST (Wang et al., 2016)can be adapted to MES if y ∗ ≥ f ∗ . The key question iswhen a sampled y ∗ that can satisfy this condition. Becausethe cumulative density w = F ( f ∗ ) ∈ (0 , and y t ∗ areindependent samples from F , there exists at least one y t ∗ that satisfies y t ∗ > f ∗ with probability at least − w k i in k i iterations.Let T (cid:48) = (cid:80) Ti =1 k i be the total number of iterations. Wesplit these iterations to T parts where each part have k i it-erations, i = 1 , · · · , T . By union bound, with probabilityat least − (cid:80) Ti =1 w k i , in all the T parts of iterations, wehave at least one iteration t i which samples y t i ∗ satisfying y t i ∗ > f ∗ , ∀ i = 1 , · · · , T .Let (cid:80) Ti =1 w k i = δ , we can set k i = log w δ π i for any (cid:80) Ti =1 ( π i ) − = 1 . A convenient choice for π i is π i = π i .Hence with probability at least − δ , there exist a sampled y t i ∗ satisfying y t i ∗ > f ∗ , ∀ i = 1 , · · · , T .Now let ζ t i = (2 log π ti δ ) . By Lemma C.1 andLemma C.2, the immediate regret r t i = f ∗ − f ( x t i ) canbe bounded as r t i ≤ ( ν t i + ζ t i ) σ t i − ( x t i ) . ax-value Entropy Search for Efficient Bayesian Optimization t
50 100 150 200 r t UCBPIEIESTESPESMES-RMES-G t
50 100 150 200 r t UCBPIEIESTESPESMES-RMES-G t
50 100 150 200 r t UCBPIEIESTESPESMES-RMES-G (a) (b) (c)
Figure 7. (a) 2-D eggholder function; (b) 10-D Shekel function; (c) 10-D Michalewicz function. PES achieves lower regret on the 2-dfunction while MES-G performed better than other methods on the two 10-d optimization test functions.
Note that by assumption ≤ σ t i − ( x t i ) ≤ , so wehave σ t i − ≤ log(1+ σ − σ ti − ( x ti ))log(1+ σ − ) . Then by Lemma C.3,we have (cid:80) Ti =1 σ t i − ( x t i ) ≤ σ − ) I ( y T ; f T ) where f T = ( f ( x t i )) Ti =1 ∈ R T , y T = ( y t i ) Ti =1 ∈ R T . From as-sumptions, we have I ( y T ; f T ) ≤ ρ T . By Cauchy-Schwarzinequality, (cid:80) Ti =1 σ t i − ( x t i ) ≤ (cid:113) T (cid:80) Ti =1 σ t i − ( x t i ) ≤ (cid:113) T ρ T log(1+ σ − ) . It follows that with probability at least − δ , T (cid:88) i =1 r t i ≤ ( ν t ∗ + ζ T ) (cid:115) T ρ T log(1 + σ − ) . As a result, our learning regret is bounded as r T (cid:48) ≤ T T (cid:88) i =1 r t i ≤ ( ν t ∗ + ζ T ) (cid:115) ρ T T log(1 + σ − ) , where T (cid:48) = (cid:80) Ti =1 k i = (cid:80) Ti =1 log w δ π i is the total numberof iterations.At first sight, it might seem like MES with a point es-timate does not have a converging rate as good as EST or GP − U CB . However, notice that min x ∈ X γ y ( x ) < min x ∈ X γ y ( x ) if y < y , which decides the rate ofconvergence in Eq. 7. So if we use y ∗ that is too large, theregret bound could be worse. If we use y ∗ that is smallerthan f ∗ , however, its value won’t count towards the learn-ing regret in our proof, so it is also bad for the regret upperbound. With no principled way of setting y ∗ since f ∗ is un-known. Our regret bound in Theorem 3.2 is a randomizedtrade-off between sampling large and small y ∗ .For the regret bound in add-GP-MES, it should follow add-GP-UCB. However, because of some technical problems inthe proofs of the regret bound for add-GP-UCB, we haven’tbeen able to show a regret bound for add-GP-MES either.Nevertheless, from the experiments on high dimensionalfunctions, the methods worked well in practice. D. Experiments
In this section, we provide more details on our experiments.
Optimization test functions
In Fig. 7, we show the sim-ple regret comparing BO methods on the three challengingoptimization test functions: the 2-D eggholder function, the10-D Shekel function, and the 10-D Michalewicz function.