[PDF] Tight Regret Bounds for Bayesian Optimization in One Dimension

Abstract

We consider the problem of Bayesian optimization (BO) in one dimension, under a Gaussian process prior and Gaussian sampling noise. We provide a theoretical analysis showing that, under fairly mild technical assumptions on the kernel, the best possible cumulative regret up to time T behaves as Ω( T − − √ ) and O( TlogT − − − − − − √ ) . This gives a tight characterization up to a logT − − − − √ factor, and includes the first non-trivial lower bound for noisy BO. Our assumptions are satisfied, for example, by the squared exponential and Matérn- ν kernels, with the latter requiring ν>2 . Our results certify the near-optimality of existing bounds (Srinivas {\em et al.}, 2009) for the SE kernel, while proving them to be strictly suboptimal for the Matérn kernel with ν>2 .

Full PDF

TTight Regret Bounds for Bayesian Optimization in One Dimension

Jonathan Scarlett Abstract

We consider the problem of Bayesian optimiza-tion (BO) in one dimension, under a Gaussianprocess prior and Gaussian sampling noise. Weprovide a theoretical analysis showing that, underfairly mild technical assumptions on the kernel,the best possible cumulative regret up to time T behaves as Ω( √ T ) and O ( √ T log T ) . This givesa tight characterization up to a √ log T factor, andincludes the ﬁrst non-trivial lower bound for noisyBO. Our assumptions are satisﬁed, for example,by the squared exponential and Matérn- ν kernels,with the latter requiring ν > . Our results certifythe near-optimality of existing bounds (Srinivas etal. , 2009) for the SE kernel, while proving themto be strictly suboptimal for the Matérn kernelwith ν > .

1. Introduction

Bayesian optimization (BO) (Shahriari et al., 2016) is apowerful and versatile tool for black-box function optimiza-tion, with applications including parameter tuning, robotics,molecular design, sensor networks, and more. The idea isto model the unknown function as a Gaussian process witha given kernel function dictating the smoothness proper-ties. This model is updated using (typically noisy) samples,which are selected to steer towards the function maximum.One of the most attractive properties of BO is its efﬁciencyin terms of the number of function samples used. Conse-quently, algorithms with rigorous guarantees on the trade-off between samples and optimization performance are par-ticularly valuable. Perhaps the most prominent work in theliterature giving such guarantees is that of (Srinivas et al., Department of Computer Science & Department of Math-ematics, National University of Singapore. Correspondence to:Jonathan Scarlett .

Proceedings of the th International Conference on MachineLearning , Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s). cumulative regret : R T = T (cid:88) t =1 (cid:16) max x f ( x ) − f ( x t ) (cid:17) , (1)where f is the function being optimized, and x t is the pointchosen at time t . Under a Gaussian process (GP) prior andGaussian noise, it is shown in (Srinivas et al., 2010) thatan algorithm called Gaussian Process Upper ConﬁdenceBound (GP-UCB) achieves a cumulative regret of the form R T = O ∗ ( (cid:112) T γ T ) , (2)where γ T = max x ,...,x T I ( f ; y ) (with function values f =( f ( x ) , . . . , f ( x T )) and noisy samples y = ( y , . . . , y T ) )is known as the maximum information gain . Here I ( f ; y ) denotes the mutual information (Cover & Thomas, 2001)between the function values and noisy samples, and O ∗ ( · ) denotes asymptotic notation up to logarithmic factors.The guarantee (2) ensures sub-linear cumulative regret formany kernels of interest. However, the literature is severelylacking in algorithm-independent lower bounds , and with-out these, it is impossible to know to what extent the up-per bounds, including (2), can be improved. In this work,we address this gap in detail in the special case of a one-dimensional function. We show that the best possible cumu-lative regret behaves as Θ ∗ ( √ T ) under mild assumptionson the kernel, thus identifying both cases where (2) is near-optimal, and cases where it is strictly suboptimal. An extensive range of BO algorithms have been proposedin the literature, typically involving the maximization of anacquisition function (Hennig & Schuler, 2012; Hernández-Lobato et al., 2014; Russo & Van Roy, 2014; Wang et al.,2016); see (Shahriari et al., 2016) for a recent overview.As mentioned above, the most relevant algorithm to thiswork for the noisy setting is GP-UCB (Srinivas et al., 2010),which constructs conﬁdence bounds in which the functionlies with high probability, and samples the point with thehighest upper conﬁdence bound. Several extensions to GP-UCB have also been proposed, including contextual (Krause& Ong, 2011; Bogunovic et al., 2016a), batch (Contal et al.,2013; Desautels et al., 2014), and high-dimensional (Kan-dasamy et al., 2015; Rolland et al., 2018) variants. a r X i v : . [ s t a t . M L ] D ec ight Regret Bounds for Bayesian Optimization in One Dimension In the noiseless setting, it has been shown that it is possibleto achieve bounded cumulative regret (de Freitas et al., 2012;Kawaguchi et al., 2015) under some technical assumptions.In (de Freitas et al., 2012), this is done by keeping trackof a set of potential maximizers, and sampling increasinglyﬁnely in order to shrink that set and “zoom in” towardsthe optimal point. Similar ideas have also been used in thenoisy setting for studying batch variants of GP-UCB (Con-tal et al., 2013), simultaneous online optimization (SOO)methods (Wang et al., 2014), and lookahead algorithms thatuse conﬁdence bounds (Bogunovic et al., 2016b). Returningto the noiseless setting, upper and lower bounds were givenin (Grünewälder et al., 2010) for kernels satisfying certainsmoothness assumptions, with the lower bounds showingthat bounded cumulative regret is not always to be expected.Alongside the Bayesian view of the Gaussian process model,several works have also considered a non-Bayesian coun-terpart assuming that the function has a bounded norm inthe associated reproducing kernel Hilbert space (RKHS).Interestingly, GP-UCB still provides similar guarantees to(2) in this setting (Srinivas et al., 2010). Moreover, lowerbounds have been proved; see (Bull, 2011) for the noiselesssetting, and (Scarlett et al., 2017) for the noisy setting. Inthe latter, the lower bounds nearly match the GP-UCB upperbound for the squared exponential (SE) kernel, but gapsremain for the Matérn kernel. For reference, we note thatthese kernels are deﬁned as follows: k SE ( x, x (cid:48) ) = exp (cid:18) − (cid:107) x − x (cid:48) (cid:107) l (cid:19) (3) k Matérn ( x, x (cid:48) ) = 2 − ν Γ( ν ) (cid:18) √ ν (cid:107) x − x (cid:48) (cid:107) l (cid:19) ν × B ν (cid:18) √ ν (cid:107) x − x (cid:48) (cid:107) l (cid:19) , (4)where l > is a lengthscale parameter, ν > is a smooth-ness parameter, B ν is the modiﬁed Bessel function, and Γ is the gamma function.The multi-armed bandit (MAB) (Bubeck & Cesa-Bianchi,2012) literature has developed alongside the BO literature,with the two often bearing similar concepts. The MABliterature is far too extensive to cover here, but it is worthmentioning that sharp lower bounds are known in numeroussettings (Bubeck & Cesa-Bianchi, 2012), and the above-mentioned concept of “zooming in” to the optimal pointhas also been explored (Kleinberg et al., 2008). To ourknowledge, however, none of the existing MAB results areclosely related to our own. The main results of this paper are informally summarizedas follows.

Main Results (Informal).

Under mild technical assump-tions on the kernel, satisﬁed (for example) by the SE kerneland Matérn- ν kernel with ν > , the best possible cumula-tive regret of noisy BO in one dimension behaves as Ω( √ T ) and O ( √ T log T ) . Our results have several important implications: • To our knowledge, our lower bound is the ﬁrst of anykind in the noisy Bayesian setting, and is tight up to a √ log T factor under our technical assumptions. • Our lower bound also establishes the order-optimalityof the O ∗ ( √ T ) upper bound of (Srinivas et al., 2010)applied to the SE kernel, up to logarithmic factors. • On the other hand, our upper bound establishes that theupper bound of (Srinivas et al., 2010) for the Matérn- ν kernel, namely O ∗ ( T ν +22 ν +2 ) , is strictly suboptimal for ν > . For example, if ν = 3 , then this is O ∗ ( T . ) ,as opposed to our upper bound of O ∗ ( T . ) . (See also(Shekhar & Javidi, 2017) for recent improvements over(Srinivas et al., 2010) under the Matérn kernel in higherdimensions and/or with smaller ν ). • Another important implication for the Matérn kernelwith ν > is that the Bayesian setting is provably lessdifﬁcult than the non-Bayesian RKHS counterpart; thelatter has cumulative regret Ω( T ν +12 ν +1 ) (Scarlett et al.,2017), which is strictly worse than O ( √ T log T ) .Our upper bound is stated formally in Section 3, and itstechnical assumptions are given in Section 2.1. We build onthe ideas of (de Freitas et al., 2012) for the noiseless setting,while addressing highly non-trivial challenges arising in thepresence of noise.Our lower bound is stated formally in Section 4, and itstechnical assumptions are given in Section 2.1. The analysisis based on a reduction to binary hypothesis testing and anapplication of Fano’s inequality (Cover & Thomas, 2001).This approach is inspired by previous work on lower boundsfor stochastic convex optimization (Raginsky & Rakhlin,2011), but the details are very different.

2. Problem Setup

We seek to sequentially optimize an unknown reward func-tion f ( x ) over the one-dimensional domain D = [0 , ;note that any interval can be transformed to this choice viare-scaling. At time t , we query a single point x t ∈ D and ob-serve a noisy sample y t = f ( x t ) + z t , where z t ∼ N (0 , σ ) for some noise variance σ > , with independence acrossdifferent times. We measure the performance using thecumulative regret R T , deﬁned in (1). ight Regret Bounds for Bayesian Optimization in One Dimension We henceforth assume f to be distributed according to Gaus-sian process (GP) (Rasmussen, 2006) having mean zeroand kernel function k ( x, x (cid:48) ) . The posterior distribution of f given the points x t = [ x , . . . , x t ] T and observations y t = [ y , . . . , y t ] T up to time t is again a GP, with theposterior mean and variance given by (Rasmussen, 2006) µ t ( x ) = k t ( x ) T (cid:0) K t + σ I t (cid:1) − y t (5) σ t ( x ) = k ( x, x ) − k t ( x ) T (cid:0) K t + σ I t (cid:1) − k t ( x ) , (6)where k t ( x ) = (cid:2) k ( x i , x ) (cid:3) ti =1 , K t = (cid:2) k ( x i , x j ) (cid:3) i,j , and I t is the t × t identity matrix. Here we introduce several assumptions that will be adoptedin our main results, some of which were also used in thenoiseless setting (de Freitas et al., 2012).

Assumption 1.

We have the following:1. The kernel k is stationary, depending on its inputs ( x, x (cid:48) ) only through τ = x − x (cid:48) ;2. The kernel k satisﬁes k ( x, x (cid:48) ) ≤ for all ( x, x (cid:48) ) , and k ( x, x ) = 1 for all x ∈ D ;Given the stationarity assumption, the assumptions k ( x, x (cid:48) ) ≤ and k ( x, x ) = 1 are without loss of gener-ality, as one can always re-scale the function and adjust thenoise variance σ accordingly.Next, we give some high-probability assumptions on therandom function f itself. Assumption 2.

There exists a constant δ ∈ (0 , suchthat, with probability at least − δ , we have the following:1. The function f has a unique maximizer x ∗ such that f ( x ∗ ) ≥ f ( x (cid:48) ) + (cid:15) (7)for any local maximum x (cid:48) that differs from x ∗ , forsome constant (cid:15) > .2. The function f is twice differentiable;3. The function f and its ﬁrst two derivatives are bounded: | f ( x ) | ≤ c , | f (cid:48) ( x ) | ≤ c , | f (cid:48)(cid:48) ( x ) | ≤ c (8)for all x ∈ D and some constants ( c , c , c ) . Thisimplies that f is c -Lipschitz continuous, and f (cid:48) is c -Lipschitz continuous.The assumption of a unique maximizer holds with probabil-ity one in most non-trivial cases (de Freitas et al., 2012), and (7) simply formally deﬁnes the gap to the second-highestpeak. Moreover, given twice differentiability, the remain-ing conditions in (8) are very mild, only requiring that thefunction value and its derivatives are bounded, and formallydeﬁning the corresponding constants.Next, we provide assumptions regarding the derivatives of f and the resulting Taylor expansions (typically around theoptimizer x ∗ ). We adopt slightly different assumptions forthe upper and lower bounds, starting with the former. Assumption 3.

There exist constants δ ∈ (0 , and ρ ∈ (cid:0) , (cid:1) such that conditioned on the events in Assumption2, we have with probability at least − δ that one of thefollowing is true:1. The maximizer is at an endpoint (i.e., x ∗ = 0 or x ∗ =1 ), and satisﬁes the following locally linear behavior:For all ξ ∈ [0 , ρ ] (if x ∗ = 0 ) or ξ ∈ [ − ρ , (if x ∗ = 1 ), it holds that f ( x ∗ ) − c | ξ | ≥ f ( x ∗ + ξ ) ≥ f ( x ∗ ) − c | ξ | (9)for some constants c ≥ c > .2. The maximizer satisﬁes x ∗ ∈ ( ρ , − ρ ) , and f satis-ﬁes the following locally quadratic behavior: For all ξ ∈ [ − ρ , ρ ] , we have f ( x ∗ ) − c ξ ≥ f ( x ∗ + ξ ) ≥ f ( x ∗ ) − c ξ (10)for some constants c ≥ c > .This assumption is near-identical to the main assumptionadopted in the noiseless setting (de Freitas et al., 2012),and is also mild given the assumption of twice differen-tiability. Indeed, (9) and (10) amount to standard Taylorexpansions, with the assumptions c > and c > onlyrequiring non-vanishing gradient at the endpoint (ﬁrst case)or non-vanishing second derivative at the function maxi-mizer (second case). These conditions typically hold withprobability one (de Freitas et al., 2012).The following assumption will be used for the lower bound. Assumption 4.

There exists constants δ (cid:48) ∈ (0 , and ρ ∈ (cid:0) , (cid:1) such that conditioned on the events in Assumption 2, both of the following hold with probability at least − δ (cid:48) :1. For any x ∈ D and ξ ∈ [ − ρ , ρ ] for which x + ξ ∈ D ,we have ξ · f (cid:48) ( x ) + c (cid:48) ξ ≤ f ( x + ξ ) − f ( x ) ≤ ξ · f (cid:48) ( x ) + c (cid:48) ξ . (11)for some (possibly negative) constants c (cid:48) , c (cid:48) .2. The maximizer satisﬁes x ∗ ∈ ( ρ , − ρ ) , and f satis-ﬁes the following for all ξ ∈ [ − ρ , ρ ] : f ( x ∗ ) − c ξ ≥ f ( x ∗ + ξ ) ≥ f ( x ∗ ) − c ξ (12)for some constants c ≥ c > . ight Regret Bounds for Bayesian Optimization in One Dimension x ⇢ LocallyQuadratic

Slope  c Gap ✏ Range  c Figure 1.

Illustration of some of the main assumptions: The func-tion is bounded within [ − c , c ] and its derivative within [ − c , c ] ,the gap to the second highest peak is at least (cid:15) , and the function islocally quadratic for points within a distance ρ of the maximizer. The ﬁrst part is similar to (10), but performs a Taylor expan-sion around an arbitrary point rather than the speciﬁc point x ∗ , and the second part is precisely (10). Note, however,that here we are assuming both of two conditions to hold,rather than one of two. Hence, we are implicitly assumingthat the ﬁrst item of Assumption 3 does not have a signif-icant probability of occurring. For stationary kernels, theonly situations where an endpoint has a high probability ofbeing optimal are those where f varies very slowly (e.g., theSE kernel with a larger lengthscale than the domain width).Such functions are of limited practical interest.Similarly to the noiseless setting (de Freitas et al., 2012),all of the above assumptions hold for the SE kernel, as wellas the Matérn- ν kernel with ν > , with the added caveatthat δ (cid:48) in Assumption 4 is a function of the lengthscaleand cannot be chosen arbitrarily. Speciﬁcally, a smallerlengthscale implies a smaller value of δ (cid:48) . In contrast, δ and δ in Assumptions 2 and 3 can be made arbitrary small bysuitably changing the constants (cid:15) , c , c , c , ρ , and so on.An illustration of some of the main assumptions and theirassociated constants is given in Figure 1.

3. Upper Bound

Our upper bound is formally stated as follows.

Theorem 1. (Upper Bound)

Consider the problem of BO inone dimension described in Section 2.1, with time horizon T and noise variance σ satisfying σ ≥ c σ T − ζ for some c σ > and ζ > . Under Assumptions 1, 2, and 3, there existsan algorithm satisfying the following: With probability atleast − δ − δ (with respect to the Gaussian process f ), the average cumulative regret (averaged over the noisy samples) satisﬁes E [ R T ] ≤ C (cid:16) σ (cid:112) T log T (cid:17) . (13) Here δ and δ are deﬁned in Assumptions 2 and 3, and C depends only on the constants therein and ( c σ , ζ ) . The assumption that σ ≥ c σ T − ζ for some ( c σ , ζ ) is verymild, since typically σ is constant with respect to T . Theproof of Theorem 1 extends immediately to a high probabil-ity guarantee with respect to both f and the noisy samples(i.e., holding with probability − δ − δ − δ for δ in Lemma1 below). We have stated the above form for consistencywith the lower bound, which will be given in Section 4. The algorithm considered in the proof of Theorem 1 is de-scribed informally in Algorithm 1; the details will be estab-lished throughout the proof of Theorem 1, and a completedescription is given in Appendix B.

Algorithm 1

Informal description of our algorithm, basedon reducing uncertainty in epochs via repeated sampling.

Require:

Domain D , GP prior ( µ , k ), discrete sub-domain L ⊆ D , time horizon T . Initialize t = 1 , epoch number i = 1 , potential maxi-mizers M (0) = L , and target conﬁdence η (0) . while less than T samples have been taken do Set η ( i ) = η ( i − . Sample each point within a subset L ( i ) ⊆ L repeat-edly K ( i ) times, where L ( i ) and K ( i ) are chosen suchthat after this sampling, all points x ∈ M ( i − satisfyupper and lower conﬁdence bounds of the form LCB t ( x ) ≤ f ( x ) ≤ UCB t ( x ) , with the gap between the two bounded by | UCB t ( x ) − LCB t ( x ) | ≤ η ( i ) . Update the set of potential maximizers: M ( i ) = (cid:110) x ∈ M ( i − :UCB t ( x ) ≥ max x (cid:48) ∈∈ M ( i − LCB t ( x (cid:48) ) (cid:111) . Increment i . end while As in the noiseless setting (de Freitas et al., 2012), the idea isto operate in epochs and sample a set of increasingly closely-packed points L ( i ) to reduce the posterior variance, but onlywithin a set of potential maximizers that are updated accord-ing to the conﬁdence bounds. As a simple means of bringingthe effective noise level down, we perform resampling , i.e., ight Regret Bounds for Bayesian Optimization in One Dimension sampling the same point K ( i ) times consecutively. In eachepoch, we sample enough to be able to produce upper andlower conﬁdence bounds UCB t ( x ) and LCB t ( x ) that differby at most a target value η ( i ) within M ( i − , and then thetarget is halved for the next epoch.We do not expect our algorithm to perform well in prac-tice by any means, but it still sufﬁces for our purposes inestablishing O ( √ T log T ) regret. Indeed, we have madeno attempt to optimize the corresponding constant factors,and doing so would require more sophisticated techniques.Moreover, the quantities L ( i ) , K ( i ) , UCB t , and LCB t inAlgorithm 1 are chosen as functions of both the kernel andthe constants appearing in our assumptions, which limitsthe algorithm’s practical utility even further. Note, however,that these constants are merely a function of the kernel, andthat suitable bounds sufﬁce in place of exact values (e.g.,lower bound on ρ , upper bound on c , etc.).While our algorithm assumes a known time horizon T (which is used when selecting K ( i ) ; see Appendix B), thisassumption can easily be dropped via a standard doublingtrick. The details are given in Appendix A. Here we present two very standard auxiliary lemmas. Webegin with a simpler version of the conditions of Srinivas et al. (Srinivas et al., 2010) guaranteeing that the posteriormean and variance provide valid conﬁdence bounds withhigh probability. The reason for being slightly simpler isthat we are considering a ﬁxed time horizon.

Lemma 1.

Fix δ ∈ (0 , . For any ﬁnite set of points L ⊆ D and time horizon T , under the choice β T = 2 log |L|· Tδ ,it holds that | f ( x ) − µ t ( x ) | ≤ β / T σ t ( x ) , ∀ x ∈ L , t = 1 , . . . , T, (14) with probability at least − δ .Proof. It was shown in (Srinivas et al., 2010) that for ﬁxed x and t , the event | f ( x ) − µ t ( x ) | ≤ β / T σ t ( x ) holds withprobability at least − e − β T / . the lemma follows by sub-stituting the choice of β T and taking the union bound overthe |L| · T values of x and t .The following lemma is also standard, and has been used(implicitly or explicitly) in the study of multiple algorithmsthat eliminate suboptimal points based on conﬁdence bounds(de Freitas et al., 2012; Contal et al., 2013; Bogunovic et al.,2016b). For completeness, we give a short proof. Lemma 2.

Suppose that at time t , for all x within a set ofpoints (cid:101) L ⊆ D , it holds that LCB t ( x ) ≤ f ( x ) ≤ UCB t ( x ) (15) for some bounds UCB t and LCB t such that max x ∈ (cid:101) L (cid:12)(cid:12) UCB t ( x ) − LCB t ( x ) (cid:12)(cid:12) ≤ η. (16) Then any point x ∈ (cid:101) L satisfying f ( x ) < max x (cid:48) ∈ (cid:101) L f ( x (cid:48) ) − η must also satisfy UCB t ( x ) < max x ∈ (cid:101) L LCB t ( x ) . (17) That is, any η -suboptimal point can be ruled out accordingto the conﬁdence bounds (15) .Proof. We have

UCB t ( x ) ≤ LCB t ( x ) + 2 η (18) ≤ f ( x ) + 2 η (19) < max x (cid:48) ∈ (cid:101) L f ( x (cid:48) ) − η (20) ≤ max x (cid:48) ∈ (cid:101) L UCB t ( x (cid:48) ) − η (21) ≤ max x (cid:48) ∈ (cid:101) L LCB t ( x (cid:48) ) , (22)where (18) and (22) follow from (16), (19) and (21) followfrom the conﬁdence bounds in (15), and (20) follows fromthe assumption f ( x ) < max x (cid:48) ∈ (cid:101) L f ( x (cid:48) ) − η . Here we provide a high-level outline of the Proof of Theo-rem 1; the details are given in Appendix B.Algorithm 1 only samples on a discrete sub-domain L . Thisset is chosen to be a set of regularly-spaced points that areﬁne enough to ensure that the cumulative regret with respectto max x ∈L f ( x ) is within a constant value of the cumulativeregret with respect to max x ∈ D f ( x ) . Working with the ﬁniteset L helps to simplify the subsequent analysis.We split the epochs into two classes, which we call earlyepochs and late epochs . The late epochs are those in whichwe have shrunk the potential maximizers down enough tobe entirely within the locally quadratic region, cf. , Figure1; here we only discuss the second case of Assumption 3,which is the more interesting of the two. Since the width ofthe locally quadratic region is constant, we can show thatthis occurs after a ﬁnite number of epochs , each lasting forat most O (log T ) time. Hence, even if we naively upperbound the instant regret by c according to (8), the overallregret incurred within the early epochs is insigniﬁcant.In the later epochs, we exploit the locally quadratic behaviorto show that the set of potential maximizers shrinks rapidly,i.e., by a constant factor after each epoch. As a result, we canlet the repeatedly-sampled set L ( i ) in Algorithm 1 lie withina given interval that similarly shrinks, thereby controllingthe number of samples we need to take in the epoch. ight Regret Bounds for Bayesian Optimization in One Dimension By Lemma 2, after we attain uniform η ( i ) -conﬁdence, theinstant regret incurred at each time thereafter is at most η ( i ) .Using the fact that η ( i ) = η (0) − i and summing over theepochs, we ﬁnd that the overall regret behaves as in (13).A notable difﬁculty that we omitted above is how we attainthe conﬁdence bounds in order to update the potential max-imizers M ( i ) . While we directly apply Lemma 1 for thepoints that were repeatedly sampled, we found it difﬁcult todo this for the non-sampled points. For those, we insteaduse Lipschitz properties of the function. In the early epochs,we use the global Lipschitz constant c from Assumption 2,whereas in the later epochs, we ﬁnd a considerably smallerLipschitz constant due to the locally quadratic behavior.

4. Lower Bound

Our lower bound is formally stated as follows.

Theorem 2. (Lower Bound)

Consider the one-dimensionalBO problem from Section 2.1, with time horizon T and noisevariance σ satisfying σ ≤ c (cid:48) σ T − ζ (cid:48) for some c (cid:48) σ > and ζ (cid:48) > . Under Assumptions 1, 2, and 4, any algorithm mustyield the following: With probability at least − δ − δ (cid:48) (withrespect to the Gaussian process f ), the average cumulativeregret (averaged over the noisy samples) satisﬁes E [ R T ] ≥ C (cid:48) (cid:16) σ √ T (cid:17) . (23) Here δ and δ (cid:48) are deﬁned in Assumptions 2 and 4, and C (cid:48) depends only on the constants therein and ( c (cid:48) σ , ζ (cid:48) ) . The assumption that σ ≤ c (cid:48) σ T − ζ (cid:48) for some ( c (cid:48) σ , ζ (cid:48) ) isvery mild, since typically σ is constant with respect to T .The assumption is required to avoid (23) contradicting thetrivial O ( T ) upper bound. We also note that Theorem 2immediately implies an Ω (cid:0) σ √ T (cid:1) lower bound on theexpected regret E [ R T ] with respect to both f and the noisysamples, as long as − δ − δ (cid:48) > . As discussed followingAssumption 4, the latter condition is mild.In the remainder of the section, we introduce some of themain tools and ideas, and then outline the proof. We notethat E [ R T ] = Ω(1) is trivial, as the average regret of theﬁrst sample alone is lower bounded by a constant. As aresult, we only need to show that E [ R T ] = Ω( σ √ T ) . Recall that f is a one-dimensional GP on [0 , with a sta-tionary kernel k ( x, x (cid:48) ) . We ﬁx ∆ > , and think of the GPas being generated by the following procedure:1. Generate a GP f with the same kernel on the largerdomain [ − ∆ , ;2. Randomly shift f along the x -axis by +∆ or − ∆ withequal probability, to obtain (cid:101) f ; f + ( x ) f ( x ) Figure 2.

Examples of functions f + and f − considered in the lowerbound. The two are identical up to a small horizontal shift.

3. Let f ( x ) = (cid:101) f ( x ) for x ∈ [0 , .Since the kernel is stationary, the shifting does not affect thedistribution, so the induced distribution of f is indeed thedesired GP on [0 , .We consider a genie argument in which f is revealed to thealgorithm . Clearly this additional information can only helpthe algorithm, so any lower bound still remains valid for theoriginal setting. Stated differently, the algorithm knows that f is either f + or f − , where f + ( x ) = f ( x + ∆) , (24) f − ( x ) = f ( x − ∆) . (25)See Figure 2 for an illustrative example.This argument allows us to reduce the BO problem to a binary hypothesis test with adaptive sampling , as depictedin Figure 3. The hypothesis, indexed by v ∈ {− , + } , is thatthe underlying function is f v . We show that under a suitablechoice of ∆ , achieving small cumulative regret means thatwe can construct a decision rule ˆ V ( x ) such that ˆ V = v withhigh probability, i.e., the hypothesis test is successful. Thecontrapositive statement is then that if the hypothesis testcannot be successful, we cannot achieve small cumulativeregret , from which it only remains to prove the former. Thisidea was used previously for stochastic convex optimizationin (Raginsky & Rakhlin, 2011).In the remainder of the analysis, we implicitly condition onan arbitrary realization of f , meaning that all expectationsand probabilities are only with respect to the random index V and/or the noise. We also assume that f satisﬁes theconditions in Assumptions 1, 2, and 4, which holds withprobability at least − δ − δ (cid:48) . For sufﬁciently small ∆ ,the same assumptions are directly inherited by f + and f − . ight Regret Bounds for Bayesian Optimization in One Dimension Index V Select

Algorithm

Samples

Infer Estimate ˆ V IndexFunction f V Function x y Selections x Figure 3.

Illustration of reduction from optimization to binary hypothesis testing. The gray boxes are considered to be ﬁxed, whereas thewhite boxes are introduced for the purpose of proving the lower bound.

We henceforth assume that ∆ is indeed sufﬁciently small;we will verify that this is the case when we set its value.We introduce some further notation. Letting x ∗ + , x ∗− , and x ∗ denote the maximizers of f + , f − and f (which areunique by Assumption 2), we see that Assumption 4 en-sures these are in the interior (0 , , and hence the optimalvalues coincide: f + ( x ∗ + ) = f − ( x ∗− ) = f ( x ∗ ) =: f ∗ . Tosimplify some of the notation, instead of working with thesefunctions directly, we consider the equivalent problem of minimizing the corresponding regret functions : r + ( x ) = f ∗ − f + ( x ) , (26) r − ( x ) = f ∗ − f − ( x ) . (27)Indeed, since we assume the algorithm knows f and hencealso the optimal value f ∗ , it can always choose to transformthe samples as y → f ∗ − y . In this form, we have theconvenient normalization r + ( x ∗ + ) = r − ( x ∗− ) = 0 . We ﬁrst state the following useful properties of r + and r − . Lemma 3.

The functions r + and r − constructed abovesatisfy the following for sufﬁciently small ∆ under the con-ditions in Assumptions 2 and 4:1. We have for all x ∈ D that r + ( x ) < c ∆ = ⇒ r − ( x ) > c ∆ , (28) where c is deﬁned in Assumption 4.2. There exists a constant c (cid:48) > such that, for all x ∈ D , | r + ( x ) − r − ( x ) | ≤ c (cid:48) (cid:0) ∆ | x − x ∗ | + ∆ (cid:1) . (29)

3. There exists a constant c (cid:48)(cid:48) > such that, for all x ∈ D , r + ( x ) ≥ c (cid:48)(cid:48) (( x − x ∗ ) + ∆) ,r − ( x ) ≥ c (cid:48)(cid:48) (( x − x ∗ ) − ∆) . (30) Proof.

See Appendix C. The ﬁrst part states that any point can be better than c ∆ -optimal for at most one of the two functions, the secondpart shows that the two functions are close for points near x ∗ , and the third part shows that the instant regret is lowerbounded by a quadratic function.The ﬁrst part of the Lemma 3 allows us to bound the cumu-lative regret using Fano’s inequality for binary hypothesistesting with adaptive sampling (Raginsky & Rakhlin, 2011).This inequality lower bounds the success probability of sucha hypothesis test in terms of a mutual information quan-tity (Cover & Thomas, 2001). The resulting lower boundon regret is stated in the following; it is worth noting thatthe consideration of cumulative regret here provides a dis-tinction from the analogous bound on the instant regret in(Raginsky & Rakhlin, 2011). Lemma 4.

Under the preceding setup, we have E [ R T ] ≥ c T ∆ · H − (cid:0) log 2 − I ( V ; x , y ) (cid:1) , (31) where V is equiprobable on { + , −} , and ( x , y ) are theselected points and samples when the minimization algo-rithm is applied to r V . Here H − : [0 , log 2] → (cid:2) , (cid:3) is the functional inverse of the binary entropy function H ( α ) = α log α + (1 − α ) log − α . Since this result is particularly fundamental to our analysis,we provide a proof at the end of this section.

Here we provide a high-level outline of the proof of Theo-rem 2; the details are given in Appendix D.Once the lower bound in Lemma 4 is established, the maintechnical challenge is upper bounding the mutual infor-mation. A useful property called tensorization (e.g., see(Raginsky & Rakhlin, 2011)) allows us to simplify themutual information with the vectors ( x , y ) to a sum ofmutual informations containing only a single pair ( x t , y t ) : I ( V ; x , y ) ≤ (cid:80) Tt =1 I ( V ; y t | x t ) .Each such mutual information term I ( V ; y t | x t ) can furtherbe upper bounded by the KL divergence (Cover & Thomas, ight Regret Bounds for Bayesian Optimization in One Dimension r + and r − , which in turn equals ( r + ( x ) − r − ( x )) σ when x t = x . By substituting the property (29) given inLemma 3, we ﬁnd that I ( V ; x , y ) is upper bounded by a con-stant times σ (cid:0) ∆ E (cid:2) (cid:80) Tt =1 | x t − x ∗ | (cid:3) + T ∆ (cid:1) . If we canfurther upper bound I ( V ; x , y ) by a constant in (0 , log 2) ,then (31) establishes an Ω( T ∆ ) lower bound.We proceed by considering the cases E [ R T ] ≥ c (cid:48)(cid:48) T ∆ and E [ R T ] < c (cid:48)(cid:48) T ∆ separately, with c (cid:48)(cid:48) given in (30). The for-mer case will immediately give the lower bound in Theorem2 when we set ∆ , whereas in the latter case, we can use (30)to show that E (cid:2) (cid:80) Tt =1 | x t − x ∗ | (cid:3) is upper bounded by aconstant times T ∆ , which means that the desired mutualinformation upper bound (see the previous paragraph) isattained under a choice of ∆ scaling as (cid:0) σ T (cid:1) / . Underthis choice, the lower bound E [ R T ] = Ω( T ∆ ) evaluatesto Ω( σ √ T ) , as required. As mentioned above, the proof of Lemma 4 follows alongthe lines of (Raginsky & Rakhlin, 2011), which in turnbuilds on previous works using Fano’s inequality to establishminimax lower bounds in statistical estimation problems;see for example (Yu, 1997).In the following, we use R T, + = (cid:80) Tt =1 r + ( x t ) and R T, − = (cid:80) Tt =1 r − ( x t ) to denote the cumulative regret associatedwith r + and r − , respectively, and we generically write R T,v to denote one of the two with v ∈ { + , −} .We ﬁrst use Markov’s inequality to write E [ R T ] ≥ (1 − α ) c T ∆ · P [ R T ≥ (1 − α ) c T ∆ ] (32)for any α ∈ (0 , . We proceed by analyzing the probabilityon the right-hand side.Recall that V is equiprobable on { + , −} , and ( x , y ) are gen-erated by running the optimization algorithm on r V . Giventhe sequence of inputs x , let ˆ V be the index ˆ v ∈ { + , −} with the lower cumulative regret R T, ˆ v = (cid:80) Tt =1 r ˆ v ( x t ) . ByLemma 3, R T can be less than c T ∆ for at most one ofthe two functions, and hence, if R T,v ≤ (1 − α ) c T ∆ thenwe must have ˆ V = v . Therefore, P v (cid:2) R T ≥ (1 − α ) c T ∆ (cid:3) ≥ P v [ ˆ V (cid:54) = v ] , (33)where, here and subsequently, P v and E v denote probabil-ities and expectations when the underlying instant regretfunction is r v (i.e., the underlying function that the algo-rithm seeks to maximize is f v ).Continuing, we can lower bound the probability appearing in (32) as follows: P [ R T ≥ (1 − α ) c T ∆ ]= 12 (cid:88) v ∈{ + , −} P v (cid:2) R T ≥ (1 − α ) c T ∆ (cid:3) (34) ≥ (cid:88) v ∈{ + , −} P v [ ˆ V (cid:54) = v ] (35) ≥ H − (cid:0) log 2 − I ( V ; x , y ) (cid:1) , (36)where (35) follows from (33), and (36) follows from Fano’sinequality for binary hypothesis testing with adaptive sam-pling (see Eq. (22) and (24) of (Raginsky & Rakhlin, 2011)).The proof is completed by combining (32) and (36), andrecalling that α can be arbitrarily small.

5. Conclusion and Discussion

We have established tight scaling laws on the regret forBayesian optimization in one dimension, showing that theoptimal scaling is Ω( √ T ) and O ( √ T log T ) under mildtechnical assumptions on the kernel. Our results highlightsome limitations of the widespread upper bounds basedon the information gain, as well as providing cases wherethe noisy Bayesian setting is provably less difﬁcult than itsnon-Bayesian RKHS counterpart.An immediate direction for further work is to sharpen theconstant factors in the upper and lower bounds, and to estab-lish whether the upper bound is attained by any algorithmthat can also provide state-of-the-art performance in prac-tice. We re-iterate that our algorithm is certainly not suitablefor this purpose, as its cumulative regret contains large con-stant factors, and the algorithm makes use of a variety ofspeciﬁc constants present in the assumptions (though theyare merely a function of the kernel).We expect our techniques to extend to any constant dimen-sion d ≥ ; the main ideas from the noiseless upper boundstill apply (de Freitas et al., 2012), and in the lower boundwe can choose an arbitrary single dimension and introducea random shift in that direction as per Section 4.1. Whilethese extensions may still yield √ T poly(log T ) regret, thedependence on d would be exponential or worse in the up-per bound, but constant in the lower bound, with the latterdependence certainly being suboptimal. Multi-dimensionallower bounding techniques based on Fano’s inequality (Ra-ginsky & Rakhlin, 2011) may improve the latter to poly( d ) ,but overall, attaining a sharp joint dependence on T and d appears to require different techniques. Acknowledgments.

I would like to thank Ilija Bogunovicfor his helpful comments and suggestions. This work wassupported by an NUS startup grant. ight Regret Bounds for Bayesian Optimization in One Dimension

References

Bogunovic, I., Scarlett, J., and Cevher, V. Time-varyingGaussian process bandit optimization. In

Int. Conf. Art.Intel. Stats. (AISTATS) , 2016a.Bogunovic, I., Scarlett, J., Krause, A., and Cevher, V. Trun-cated variance reduction: A uniﬁed approach to Bayesianoptimization and level-set estimation. In

Conf. Neur. Inf.Proc. Sys. (NIPS) , 2016b.Bubeck, S. and Cesa-Bianchi, N.

Regret Analysis of Stochas-tic and Nonstochastic Multi-Armed Bandit Problems .Found. Trend. Mach. Learn. Now Publishers, 2012.Bull, A. D. Convergence rates of efﬁcient global optimiza-tion algorithms.

J. Mach. Learn. Res. , 12(Oct.):2879–2904, 2011.Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N.

Ma-chine Learning and Knowledge Discovery in Databases ,chapter Parallel Gaussian Process Optimization with Up-per Conﬁdence Bound and Pure Exploration, pp. 225–240.Springer Berlin Heidelberg, 2013.Cover, T. M. and Thomas, J. A.

Elements of InformationTheory . John Wiley & Sons, Inc., 2001.de Freitas, N., Zoghi, M., and Smola, A. J. Exponentialregret bounds for Gaussian process bandits with deter-ministic observations. In

Int. Conf. Mach. Learn. (ICML) ,2012.Desautels, T., Krause, A., and Burdick, J. W. Parallelizingexploration-exploitation tradeoffs in Gaussian processbandit optimization.

J. Mach. Learn. Res. , 15(1):3873–3923, 2014.Grünewälder, S., Audibert, J.-Y., Opper, M., and Shawe-Taylor, J. Regret bounds for Gaussian process banditproblems. In

Int. Conf. Art. Intel. Stats. (AISTATS) , pp.273–280, 2010.Hennig, P. and Schuler, C. J. Entropy search for information-efﬁcient global optimization.

J. Mach. Learn. Research ,13(1):1809–1837, 2012.Hernández-Lobato, J. M., Hoffman, M. W., and Ghahra-mani, Z. Predictive entropy search for efﬁcient globaloptimization of black-box functions. In

Adv. Neur. Inf.Proc. Sys. (NIPS) , pp. 918–926, 2014.Kandasamy, K., Schneider, J., and Póczos, B. High di-mensional Bayesian optimisation and bandits via additivemodels. In

Int. Conf. Mach. Learn. , 2015.Kawaguchi, K., Kaelbling, L. P., and Lozano-Pérez, T.Bayesian optimization with exponential convergence. In

Conf. Neur. Inf. Proc. Sys. (NIPS) , 2015. Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armedbandits in metric spaces. In

Proc. ACM Symp. TheoryComp. (STOC) , pp. 681–690, 2008.Krause, A. and Ong, C. S. Contextual Gaussian processbandit optimization. In

Conf. Neur. Inf. Proc. Sys. (NIPS) ,pp. 2447–2455. Curran Associates, Inc., 2011.Raginsky, M. and Rakhlin, A. Information-based com-plexity, feedback and dynamics in convex programming.

IEEE Trans. Inf. Theory , 57(10):7036–7056, Oct. 2011.Rasmussen, C. E. Gaussian processes for machine learning.MIT Press, 2006.Rolland, P., Scarlett, J., Bogunovic, I., and Cevher, V. High-dimensional Bayesian optimization via additive modelswith overlapping groups. In

Int. Conf. Art. Intel. Stats.(AISTATS) , 2018.Russo, D. and Van Roy, B. Learning to optimize viainformation-directed sampling. In

Conf. Neur. Inf. Proc.Sys. (NIPS) , 2014.Scarlett, J., Bogunovic, I., and Cevher, V. Lower bounds onregret for noisy Gaussian process bandit optimization. In

Conf. Learn. Theory (COLT) . 2017.Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., andde Freitas, N. Taking the human out of the loop: Areview of Bayesian optimization.

Proc. IEEE , 104(1):148–175, 2016.Shekhar, S. and Javidi, T. Gaussian process bandits withadaptive discretization. http://arxiv.org/abs/1712.01447,2017.Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M.Gaussian process optimization in the bandit setting: Noregret and experimental design. In

Int. Conf. Mach. Learn.(ICML) , 2010.Wang, Z., Shakibi, B., Jin, L., and de Freitas, N. Bayesianmulti-scale optimistic optimization. In

Int. Conf. Art.Intel. Stats. (AISTATS) , pp. 1005–1014, 2014.Wang, Z., Zhou, B., and Jegelka, S. Optimization as esti-mation with Gaussian processes in bandit settings. In

Int.Conf. Art. Intel. Stats. (AISTATS) , 2016.Yu, B. Assouad, Fano, and le Cam. In

Festschrift for LucienLe Cam , pp. 423–435. Springer, 1997. ight Regret Bounds for Bayesian Optimization in One Dimension

Supplementary Material

Tight Regret Bounds for Bayesian Optimization in One Dimension(Jonathan Scarlett, ICML 2018)

A. Doubling Trick for an Unknown Time Horizon

Suppose that we have an algorithm that depends on the time horizon T (cid:48) and achieves E [ R T (cid:48) ] ≤ C √ T (cid:48) log T (cid:48) for some C > . We show that we can also achieve E [ R T ] = O (cid:0) √ T log T (cid:1) when T is unknown.To see this, ﬁx an arbitrary integer T ∈ (cid:2) , T (cid:3) , and repeatedly run the algorithm with ﬁxed time horizons T , T , T ,etc., until T points have been sampled. The number of stages is no more than (cid:96) max = (cid:100) log TT (cid:101) . Moreover, we have E [ R T ] ≤ (cid:96) max (cid:88) (cid:96) =1 C (cid:112) (cid:96) − T log T = C (cid:112) T log T (cid:100) log TT (cid:101)− (cid:88) (cid:96) =0 √ (cid:96) ≤ C (cid:112) log T · √ T (37)where the ﬁrst inequality uses log(2 (cid:96) − T ) ≤ log T , and the last inequality uses (cid:80) N(cid:96) =0 (cid:96)/ ≤ · N/ . This establishes thedesired claim. B. Proof of Theorem 1 (Upper Bound)

We continue from the auxiliary results given in Section 3, proceeding in several steps. Algorithm 2 gives a full descriptionof the algorithm; the reader is encouraged to refer to this throughout the proof, rather than trying to understand all the stepstherein immediately. Note that the constants c , c , c , and ρ used in the algorithm come from Assumptions 2 and 3. Reduction to a ﬁnite domain.

Our algorithm only samples f within a ﬁnite set L ⊆ D of pre-deﬁned points. We choosethese points to be regularly spaced, and close enough to ensure that the highest function value is within T of the maximum f ( x ∗ ) . Under condition (8) in Assumption 2 (which implies that f is c -Lipschitz continuous), it sufﬁces to choose L = (cid:18) c · T Z ∩ [0 , (cid:19) ∪ { } , (38)where Z denotes the integers. Here we add x = 1 to L because it will be notationally convenient to ensure that the endpoints { , } are both included in the set. Note that L satisﬁes |L| ≤ c T + 1 , which we crudely upper bound by |L| ≤ c T .Since max x ∈L f ( x ) ≥ max x ∈ D f ( x ) − T , the cumulative regret R ( L ) T with respect to the best point in L is such that R T ≤ R ( L ) T + 1 . (39)Hence, it sufﬁces to bound R ( L ) T instead of R T . For convenience, we henceforth let x ∗L denote an arbitrary input thatachieves max x ∈L f ( x ) , and we deﬁne the instant regret as r ( x ) = f ( x ∗ ) − f ( x ) , r t = r ( x t ) = f ( x ∗ ) − f ( x t ) , r ( L ) t = f ( x ∗L ) − f ( x t ) . (40) Conditioning on high-probability events.

By assumption, the events in Assumptions 2 and 3 simultaneously hold withprobability at least − δ − δ . Moreover, by setting δ = T in Lemma 1 and letting L be as in (38) with |L| ≤ c T , wededuce that (14) holds with probability at least − T when β T = 2 log (cid:0) c T (cid:1) . (41)Denoting the intersection of all events in Assumptions 2 and 3 by A , and the event in Lemma 1 by B , we can write theaverage regret given A as follows: E [ R T |A ] = E [ R T |A , B ] · P [ B|A ] + E [ R T |A , B c ] · P [ B c |A ] (42) ≤ E [ R T |A , B ] + E [ R T |A , B c ] 1 T (1 − δ − δ ) (43) ≤ E [ R T |A , B ] + 2 c − δ − δ , (44) ight Regret Bounds for Bayesian Optimization in One Dimension Algorithm 2

Full description of our algorithm, based on reducing uncertainty in epochs via repeated sampling.

Require:

Domain D , GP prior ( µ , k ), time horizon T , constants c , c , c , ρ . Set discrete sub-domain L = (cid:0) c · T Z ∩ [0 , (cid:1) ∪ { } , conﬁdence parameter β T = 2 log(2 c T ) , initial target conﬁdence η (0) = c , and initial potential maximizers M (0) = L . Initialize time index t = 1 and epoch number i = 1 . while less than T samples have been taken do Set η ( i ) = η ( i − . Deﬁne the interval I ( i ) = (cid:2) min { x ∈ M ( i − } , max { x ∈ M ( i − } (cid:3) ∩ L , and its width w ( i ) = max { x ∈ M ( i − } − min { x ∈ M ( i − } . Set the Lipschitz constant L ( i ) =  c w ( i ) > ρ c w ( i ) ≤ ρ and either ∈ I ( i ) or ∈ I ( i ) c w ( i ) w ( i ) ≤ ρ and I ( i ) ⊆ (0 , . Construct a subset L ( i ) ⊆ I ( i ) as follows: • Initialize L ( i ) ← ∅ . • Construct (cid:101) L ( i ) (not necessarily a subset of I ( i ) or L ) containing regularly-spaced points within the interval (cid:2) min { x ∈ I ( i ) } , max { x ∈ I ( i ) } (cid:3) , with spacing η ( i ) L ( i ) . • For each x ∈ (cid:101) L ( i ) , add its two nearest points in I ( i ) to L ( i ) . Sample each point in L ( i ) repeatedly K ( i ) times, where K ( i ) = (cid:108) σ β T η i ) (cid:109) . For each sample taken, increment t ← t + 1 , and terminate if t > T . Update the posterior distribution ( µ t − , σ t − ) according to (5)–(6), with x t − = [ x , . . . , x t − ] T and y t − =[ y , . . . , y t − ] T respectively containing all the selected points and noisy samples so far. For each x ∈ I ( i ) , set UCB t ( x ) = µ t − ( x (cid:48) ) + η ( i ) , LCB t ( x ) = µ t − ( x (cid:48) ) − η ( i ) , where x (cid:48) = arg min x (cid:48) ∈L ( i ) | x − x (cid:48) | . Update the set of potential maximizers: M ( i ) = (cid:110) x ∈ M ( i − : UCB t ( x ) ≥ max x (cid:48) ∈ M ( i − LCB t ( x (cid:48) ) (cid:111) . Increment i . end while ight Regret Bounds for Bayesian Optimization in One Dimension where (43) follows since P [ B|A ] ≤ and P [ B c |A ] ≤ P [ B c ] P [ A ] ≤ T (1 − δ − δ ) , and (44) follows since condition (8) inAssumption 2 ensures that R T ≤ T · c . By (44), in order to prove Theorem 1, it sufﬁces to show that R T = O ( √ T log T ) whenever the conditions of Assumptions 2–3 and Lemma 1 hold true. We henceforth condition on this being the case. Sampling mechanism.

Recall that η ( i ) represents the target conﬁdence to attain by the end of the i -th epoch, and each suchvalue is half of the previous value. For this interpretation to be valid, η (0) should be sufﬁcient large so that the entire functionis a priori known up to conﬁdence η (0) ; by (8) in Assumption 2, the choice η (0) = c certainly sufﬁces for this purpose.In the i -th epoch, we repeatedly sample a sufﬁciently ﬁne subset of L sufﬁciently many times to attain an overall conﬁdenceof η ( i ) within M ( i − (with M (0) = L ). Speciﬁcally: • We sample each point K ( i ) times and average the resulting observations, yielding an effective noise variance of σ K ( i ) ,and we choose K ( i ) large enough so that σ K ( i ) ≤ η i ) β T . Hence, K ( i ) = (cid:100) σ β T η i ) (cid:101) is sufﬁcient. • To design L ( i ) ⊆ L , we consider the interval I ( i ) = (cid:2) min { x ∈ M ( i − } , max { x ∈ M ( i − } (cid:3) ∩ L , (45)which is the smallest interval (intersected with L ) containing M ( i − . We select a Lipschitz constant L ( i ) (to bespeciﬁed later) such that f is L ( i ) -Lipschitz within I ( i ) , and then we choose L ( i ) ⊆ I ( i ) to ensure the following:Each x ∈ I ( i ) is within a distance η ( i ) L ( i ) of the nearest x (cid:48) ∈ L ( i ) . (46)If we were sampling at arbitrary locations, it would sufﬁce to choose (cid:6) w ( i ) L ( i ) η ( i ) (cid:7) equally-spaced points, where w ( i ) = max { x ∈ M ( i − } − min { x ∈ M ( i − } (47)is the width of the interval. With the restriction of sampling within the ﬁne discretization L , we can simply “round” tothe two nearest points, yielding a suitable set L ( i ) ⊆ I ( i ) of cardinality at most (cid:6) w ( i ) L ( i ) η ( i ) (cid:7) Combining these, the total number of samples T ( i ) is given by T ( i ) = K ( i ) · |L ( i ) | (48) ≤ · (cid:108) σ β T η i ) (cid:109) · (cid:108) w ( i ) L ( i ) η ( i ) (cid:109) . (49)At the points that were sampled, we performed enough repetitions to attain a variance of at most η i ) β T based on those samplesalone. The information from any earlier samples only reduces the variance further, so the overall posterior variance σ t − ( x ) also yields β / T σ t − ( x ) ≤ η ( i ) . Hence, Lemma 1 ensures that at these sampled points, we can set (cid:93) UCB t ( x ) = µ t − ( x ) + η ( i ) , (cid:93) LCB t ( x ) = µ t − ( x ) − η ( i ) . (50)For the points in M ( i − that we didn’t sample, we note that the following conﬁdence bounds are valid as long as f is L ( i ) -Lipschitz continuous within I ( i ) : (cid:93) UCB t ( x ) = µ t − ( x (cid:48) ) + η ( i ) L ( i ) | x − x (cid:48) | , (51) (cid:93) LCB t ( x ) = µ t − ( x (cid:48) ) − η ( i ) − L ( i ) | x − x (cid:48) | , (52) To give a concrete example, suppose that L = { , . , . . . , . , } , and that we seek a set of points such that each x ∈ L is withina distance of the nearest one. Without constraints, the points (cid:8) , (cid:9) would sufﬁce, but after rounding these to { . , . } , the point x = 1 is at a distance . > . However, doubling up and constructing the set { . , . , . , . } clearly sufﬁces. We consider ( µ t − , σ t − ) instead of ( µ t , σ t ) because when the time index is t , we have only selected t − points. ight Regret Bounds for Bayesian Optimization in One Dimension where x (cid:48) = arg min x (cid:48) ∈L ( i ) | x − x (cid:48) | is the closest sampled point to x . If x is itself in L ( i ) , these expressions reduce to (50).Now, since we have ensured the condition (46), we ﬁnd that we can weaken (51)–(52) to UCB t ( x ) = µ t − ( x (cid:48) ) + η ( i ) , LCB t ( x ) = µ t − ( x (cid:48) ) − η ( i ) . (53)That is, as long as the Lipschitz constant L ( i ) is valid, we have η ( i ) -conﬁdence at the end of the i -th epoch. As a result, byLemma 2, the updated set of potential maximizers M ( i ) = (cid:26) x ∈ M ( i − : UCB t ( x ) ≥ max x (cid:48) ∈L LCB t ( x ) (cid:27) , (54)with t being the ending time of the epoch, must only contain points within L whose function value is within η ( i ) of f ( x ∗L ) .Below, we will choose L ( i ) differently in different epochs, while still ensuring the required Lipschitz condition is valid. Analysis of early epochs.

Recall the following: • By Assumption 1, the constant (cid:15) lower bounds the separation between f ( x ∗ ) and the function value at the secondhighest local maximum (if any). • By Assumption 3, we either have x ∗ at an endpoint and the locally linear behavior (9), or we have x ∗ ∈ ( ρ , − ρ ) and the locally quadratic behavior (10).In the epochs for which w ( i ) > ρ , we choose L ( i ) = c ( cf. , (8)), which is clearly a valid Lipschitz constant. We claim that after a ﬁnite number of epochs , all points x ∈ M ( i ) satisfy f ( x ) > f ( x ∗ ) − (cid:15) and | x − x ∗ | ≤ ρ , and therefore, w ( i ) ceasesto be greater than ρ . We henceforth distinguish between the two cases using the terminology early epochs and late epochs .To see that the preceding claim is true, we consider the two cases of Assumption 3: • In the ﬁrst case, all points satisfying | x − x ∗ | > ρ are at least min { c ρ , (cid:15) } -suboptimal by the locally linear behavior(9) and the (cid:15) gap (7); • In the second case, all points satisfying | x − x ∗ | > ρ are at least min { c ρ , (cid:15) } -suboptimal by the locally quadraticbehavior (9) and the (cid:15) gap (7).Hence, in either case, all points satisfying | x − x ∗ | > ρ are at least (cid:15) (cid:48) -suboptimal, where (cid:15) (cid:48) = min { c ρ , c ρ , (cid:15) } . As aresult, to establish the desired claim, we only need to show that M ( i ) contains no points with instant regret r ( x ) ≥ (cid:15) (cid:48) .Since f ( x ∗L ) ≥ f ( x ∗ ) − T (as stated following (38)), we ﬁnd that as long as T > (cid:15) (cid:48) , it sufﬁces that M ( i ) only containspoints such that r ( L ) t ( x ) ≤ (cid:15) (cid:48) . By Lemma 2, this happens as soon as η ( i ) < (cid:15) (cid:48) . Since (cid:15) (cid:48) is constant and we halve η ( i ) at theend of each epoch, it must be that only a ﬁnite number of epochs i max , pass before this occurs, with i max , depending onlyon η (0) and (cid:15) (cid:48) .For these early epochs, we simply upper bound w ( i ) in (49) by one, meaning their overall cumulative time T early satisﬁes T early ≤ i max , (cid:88) i =1 T ( i ) ≤ i max , (cid:108) σ β T ( (cid:15) (cid:48) ) (cid:109) · (cid:108) c (cid:15) (cid:48) (cid:109) , (55)where we have used the fact that η ( i ) ≥ (cid:15) (cid:48) and L ( i ) = c in these epochs. Analysis of late epochs.

Recall that we consider ourselves in a late epoch as soon as w ( i ) ≤ ρ . This condition implies thatall points in M ( i − are within a distance ρ of x ∗ , yielding the locally linear behavior (9) if x ∗ is an endpoint, and thelocally quadratic behavior (10) otherwise. Moreover, Assumption 3 assumes x ∗ ∈ ( ρ , − ρ ) in the latter case, and as aresult, the algorithm can identify which case has occurred: If I ( i ) contains an endpoint, then we are in the ﬁrst case, whereasif I ( i ) ⊆ (0 , , then we are in the second case. It is safe to assume that T is sufﬁciently large, since the smaller values of T can be handled by increasing C in the theorem statement. Since we condition on the conﬁdence bounds in Lemma 1 being valid, only points that are truly suboptimal are ever ruled out. ight Regret Bounds for Bayesian Optimization in One Dimension

Accordingly, the algorithm can choose the Lipschitz constant L ( i ) differently in the two cases. In the ﬁrst case, we simplycontinue to use the global choice L ( i ) = c from (8). In the second case, we observe that f (cid:48) ( x ∗ ) = 0 , and recall from (8)that f (cid:48) is c -Lipschitz continuous. Since the width of the interval of interest I ( i ) is w ( i ) , we conclude that | f (cid:48) ( x ) | ≤ c w ( i ) within I ( i ) , and accordingly, we can set L ( i ) = c w ( i ) . (56)We initially focus on this second case (which is the more interesting of the two), and later return to the ﬁrst case.Recall that within the i -th epoch, all points with f ( x ) < f ( x ∗L ) − η ( i − have already been removed from the potentialmaximizers ( cf. , Lemma 2). This implies that the points sampled incur instant regret at most r ( L ) t ≤ η ( i − , (57)and hence, since we have established that f ( x ∗L ) ≥ f ( x ∗ ) − T , r t ≤ η ( i − + 1 T . (58)From this fact and the locally quadratic behavior (10), we deduce that the width w ( i ) deﬁned in (47) satisﬁes w ( i ) ≤ (cid:114) η ( i − + T c = (cid:114) η ( i ) + T c (since η ( i − = 2 η ( i ) ), from which (49) and (56) yield T ( i ) ≤ (cid:108) σ β T η i ) (cid:109) · (cid:108) c c · (cid:16) T η ( i ) (cid:17)(cid:109) . (59)Grouping all the constants together and writing (cid:100) z (cid:101) ≤ z , we can simplify this to T ( i ) ≤ c (cid:48) (cid:18) T η ( i ) + σ β T η i ) + σ β T T η i ) (cid:19) (60)for suitably-chosen c (cid:48) > . Bounding the cumulative regret.

In the early epochs, we crudely upper bound the regret at each time instant by c ( cf. ,(8)). Hence, since the total cumulative time of these epochs satisﬁes (55) for bounded i max , , and β T = O (log T ) as per(41), the corresponding total cumulative regret R ( L )early is upper bounded by R ( L )early ≤ c (cid:48)(cid:48) (1 + σ log T ) (61)for some c (cid:48)(cid:48) > .For the late epochs, we make use of the instant regret bound in (57), depending on the epoch index i . Since this upper boundis decreasing in i , and the epoch lengths satisfy (60), we can upper bound R ( L ) T by considering the hypothetical case that theepoch lengths are exactly the right-hand side of (60), and the instant regret incurred at time t is exactly r ( L ) t = 4 η ( i − .In this situation, we can easily upper bound the total number of epochs: The last epoch must certainly be no larger than i max , , deﬁned to be the smallest i such that the term c (cid:48) σ β T η i ) on the right-hand side of (60) is T or higher. Substituting η ( i ) = η (0) i and re-arranging, we conclude that i max , ≤ log T η c (cid:48) σ β T = log (cid:115) T η c (cid:48) σ β T . (62)For technical reasons, here and subsequently we can assume without loss of generality that σ ≤ κ (cid:113) T log T for arbitrarilysmall κ > and sufﬁciently large T ; otherwise, Theorem 1 states the trivial bound E [ R T ] ≤ CT . Since β T = Θ(log T ) ,this technical condition means the right-hand side of (62) exceeds one. ight Regret Bounds for Bayesian Optimization in One Dimension Continuing, the total cumulative regret R ( L )late from the late epochs is upper bounded as follows: R ( L )late ≤ i max , (cid:88) i =1 η ( i − T ( i ) (63) ≤ c (cid:48) i max , (cid:88) i =1 η ( i − + 8 c (cid:48) (cid:18) i max , (cid:88) i =1 (cid:19) + 8 c (cid:48) σ β T i max , (cid:88) i =1 η ( i ) + 8 c (cid:48) σ β T T i max , (cid:88) i =1 η i ) (64) ≤ c (cid:48) i max , ( η (0) + 2) + 8 c (cid:48) σ β T i max , (cid:88) i =1 η ( i ) + 8 c (cid:48) σ β T T i max , (cid:88) i =1 η i ) (65) ≤ c (cid:48) i max , ( η (0) + 2) + 8 c (cid:48) σ β T η (0) i max , (cid:88) i =1 i + 8 c (cid:48) σ β T T η i max , (cid:88) i =1 i (66) ≤ c (cid:48) i max , ( η (0) + 2) + 16 c (cid:48) σ β T η (0) i max , + 16 c (cid:48) σ β T T η i max , (67) ≤ c (cid:48) ( η (0) + 2) log T η c (cid:48) σ β T + 16 (cid:112) c (cid:48) σ β T T + 16 , (68)where (64) follows from (60) and the fact that η ( i − = 2 η ( i ) , (65) follows since η ( i − ≤ η (0) , (66) follows since η ( i ) = η (0) i , (67) follows since (cid:80) Ni =1 i ≤ · N and (cid:80) Ni =1 i ≤ · N , and (68) follows by substituting the upper boundon i max , from (62). Using the fact that β T = O (log T ) , and recalling that η (0) = c is constant, we simplify (68) to R ( L )late ≤ c † (cid:16) σ (cid:112) T log T (cid:17) (69)for some c † > . Note that we can safely drop the O (cid:0) log Tσ β T (cid:1) = O (cid:0) log Tσ log T (cid:1) term in (68) due to the assumption σ ≥ c σ T − ζ in Theorem 1. Handling the ﬁrst case in Assumption 3.

From (56) onwards, we focused only on the second case of Assumption 3. Inthe ﬁrst case, we have a worse Lipschitz constant L ( i ) = c , but the width also shrinks faster: By the locally linear behavior(9), achieving η ( i ) -conﬁdence not only brings the interval width w ( i ) down to at most O ( √ η ( i ) ) , but also further down to O ( η ( i ) ) . Hence, we lose a factor of √ η ( i ) in the Lipschitz constant, but we gain a factor of √ η ( i ) in the upper bound on w ( i ) .Since the number of points sampled in (49) contains the product of the two, the ﬁnal result remains unchanged, i.e., we stillhave (69), possibly with a different constant c † . Completion of the proof.

Combining (39), (44), (61) and (69), we obtain E [ R T ] ≤ C † (cid:0) σ log T + σ (cid:112) T log T (cid:1) (70)for some constant C † . As stated following (62), we can assume without loss of generality that σ ≤ O (cid:0)(cid:113) T log T (cid:1) , whichmeans that the third term of (70) dominates the second, and the proof is compete. C. Proof of Lemma 3

For the ﬁrst part, we consider ∆ sufﬁciently small so that c ∆ < (cid:15) , for (cid:15) given in Assumption 2 and c in Assumption 4.Since all local maxima are at least (cid:15) -suboptimal, achieving r + ( x ) < c ∆ requires that x lies within a small interval around x ∗ + . Moreover, the locally quadratic behavior (12) in Assumption 4 yields r + ( x ) ≥ c ( x − x ∗ + ) within this interval when ∆ is sufﬁciently small. Combining this with r + ( x ) < c ∆ gives | x − x ∗ + | < ∆ , and since | x ∗ + − x ∗− | = 2∆ , the triangleinequality yields | x − x ∗− | > ∆ . Again using (12), we conclude that r − ( x ) > c ∆ , as required.For the second part, we recall from (24)–(25) that r + ( x ) = r ( x + ∆) and r − ( x ) = r ( x − ∆) , where r ( x ) = f ( x ∗ ) − f ( x ) . Again assuming ∆ is sufﬁciently small (i.e., less than ρ ), we can apply the general Taylor expansionaccording to (11) to obtain | r + ( x ) − r ( x ) | ≤ ∆ | r (cid:48) ( x ) | + c , max ∆ , (71) | r − ( x ) − r ( x ) | ≤ ∆ | r (cid:48) ( x ) | + c , max ∆ , (72) ight Regret Bounds for Bayesian Optimization in One Dimension where c , max = max {| c (cid:48) | , | c (cid:48) |} . Since r (cid:48) ( x ) is c -Lipschitz continuous (see (8)) and equals zero at x ∗ , we must have | r (cid:48) ( x ) | ≤ c | x − x ∗ | . Hence, and using the triangle inequality along with (71)–(72), we have | r + ( x ) − r − ( x ) | ≤ c | x − x ∗ | + 2 c , max ∆ , (73)which proves (29).For the third part, we note that since x ∗ + = x ∗ − ∆ and x ∗− = x ∗ + ∆ , the conditions in (30) can be written as r + ( x ) ≥ c (cid:48)(cid:48) ( x − x ∗ + ) , r − ( x ) ≥ c (cid:48)(cid:48) ( x − x ∗− ) . (74)Using the locally quadratic behavior in (12), we deduce that (30) holds for all x within distance ρ of the respective functionoptimizer. On the other hand, if the distance from the optimizer is more than ρ , then a combination of (7) and (12) revealsthat r ( x ) is bounded away from zero. Since the quadratic terms in (30) are also bounded from above due to the fact that x ∈ [0 , , we conclude that (30) holds for sufﬁciently small c (cid:48)(cid:48) . D. Proof of Theorem 2 (Lower Bound)

We continue from the reduction to binary hypothesis testing and auxiliary results given in Section 4. These results hold foran arbitrary given (deterministic) BO algorithm, which in general is simply a sequence of mappings that return the nextpoint x t based on the previous samples y , . . . , y t − . Recall also that we implicitly condition on an arbitrary realization of f satisfying the events in Assumptions 2 and 4, meaning that all expectations and probabilities are only with respect to therandom index V ∈ { + , −} and/or the noise. We proceed in two main steps. Bounding the mutual information.

To bound the mutual information term I ( V ; x , y ) appearing in (31), we ﬁrst apply thefollowing tensorization bound for adaptive sampling, which is based on the chain rule for mutual information (e.g., see(Raginsky & Rakhlin, 2011)): I ( V ; x , y ) ≤ T (cid:88) t =1 I ( V ; y t | x t ) . (75)It is well known that the conditional mutual information I ( V ; y t | x t = x ) is upper bounded by the maximum KL divergence max v,v (cid:48) D ( P Y | V,X ( · | v, x ) (cid:107) P Y | V,X ( · | v (cid:48) , x )) between the resulting conditional output distributions P Y | V,X (e.g., seeEq. (31) of (Raginsky & Rakhlin, 2011)). In our setting, there are only two values of v , and since we are consideringGaussian noise, their conditional output distributions are N ( r + ( x ) , σ ) and N ( r − ( x ) , σ ) . Using the standard property thatthe KL divergence between the N ( µ , σ ) and N ( µ , σ ) density functions is ( µ − µ ) σ , we deduce that I ( V ; y t | x t = x ) ≤ ( r + ( x ) − r − ( x )) σ . (76)Substituting property (29) in Lemma 3 gives I ( V ; y t | x t = x ) ≤ ( c (cid:48) ) σ (cid:0) ∆ | x − x ∗ | + ∆ (cid:1) (77) ≤ c (cid:48) ) σ (cid:0) ∆ | x − x ∗ | + ∆ (cid:1) , (78)where (78) follows since ( a + b ) ≤ a + b ) . Averaging over x t , we obtain I ( X ; y t | x t ) ≤ c (cid:48) ) σ (cid:0) ∆ E (cid:2) | x t − x ∗ | (cid:3) +∆ (cid:1) ,and substitution into (75) gives I ( V ; x , y ) ≤ c (cid:48) ) σ (cid:18) ∆ E (cid:20) T (cid:88) t =1 | x t − x ∗ | (cid:21) + T ∆ (cid:19) . (79) This form of the bound is not stated explicitly in (Raginsky & Rakhlin, 2011). However, Eq. (27) of (Raginsky & Rakhlin, 2011)states that I ( V ; x , y ) ≤ (cid:80) Tt =1 I ( V ; y t | x t , y t − ) , where x t = ( x , . . . , x t ) and similarly for y t − . Letting H ( · ) denote the (differential)entropy function (Cover & Thomas, 2001), we obtain (75) by writing I ( V ; y t | x t , y t − ) = H ( y t | x t , y t − ) − H ( y t | x t , y t − , V ) ,applying H ( y t | x t , y t − ) ≤ H ( y t | x t ) since conditioning reduces entropy, and applying H ( y t | x t , y t − , V ) = H ( y t | x t , V ) since in oursetting y t depends on ( x t , y t − , V ) only through ( x t , V ) . ight Regret Bounds for Bayesian Optimization in One Dimension Bounding the regret.

We consider the cases E [ R T ] ≥ c (cid:48)(cid:48) T ∆ and E [ R T ] < c (cid:48)(cid:48) T ∆ separately, where c (cid:48)(cid:48) is deﬁned inLemma 3. In the former case, we immediately have a lower bound on the average cumulative regret, whereas in the lattercase, the following lemma is useful. Lemma 5. If E [ R T ] < c (cid:48)(cid:48) T ∆ with c (cid:48)(cid:48) deﬁned in Lemma 3, then E (cid:2) (cid:80) Tt =1 | x t − x ∗ | (cid:3) < T ∆ .Proof. Since V is equiprobable on { + , −} , we have E [ R T ] = (cid:88) v ∈{ + , −} E (cid:20) T (cid:88) t =1 r v ( x t ) (cid:12)(cid:12)(cid:12) V = v (cid:21) (80) ≥ c (cid:48)(cid:48) (cid:88) v ∈{ + , −} E (cid:20) T (cid:88) t =1 (( x t − x ∗ ) + v ∆) (cid:12)(cid:12)(cid:12) V = v (cid:21) (81) ≥ c (cid:48)(cid:48) E (cid:20) T (cid:88) t =1 ( x t − x ∗ ) − T (cid:88) t =1 | x t − x ∗ | ∆ + T ∆ (cid:21) , (82)where (81) follows from (30) in Lemma 3, and (82) follows by expanding the square and lower bounding the cross-term byits negative absolute value.Substituting the assumption E [ R T ] < c (cid:48)(cid:48) T ∆ into (82), and canceling the term c (cid:48)(cid:48) T ∆ appearing on both sides, we obtain E (cid:20) T (cid:88) t =1 ( x t − x ∗ ) (cid:21) < E (cid:20) T (cid:88) t =1 | x t − x ∗ | (cid:21) (83) ≤ √ T E (cid:34)(cid:118)(cid:117)(cid:117)(cid:116) T (cid:88) t =1 ( x t − x ∗ ) (cid:35) (84) ≤ √ T (cid:118)(cid:117)(cid:117)(cid:116) E (cid:20) T (cid:88) t =1 ( x t − x ∗ ) (cid:21) , (85)where (84) follows from the Cauchy-Schwartz inequality, and (85) follows from Jensen’s inequality. Solving for E (cid:2) (cid:80) Tt =1 ( x t − x ∗ ) (cid:3) yields the desired claim.In the case E [ R T ] < c (cid:48)(cid:48) T ∆ , we claim that under the choice ∆ = (cid:0) σ (cid:101) CT (cid:1) / with a sufﬁciently large constant (cid:101) C , it holdsthat E [ R T ] ≥ (cid:101) cσ √ T for some constant (cid:101) c . Once this is established, combining the two cases with the choice of ∆ gives E [ R T ] ≥ min (cid:26) c (cid:48)(cid:48) (cid:112) (cid:101) C , (cid:101) c (cid:27) σ √ T , (86)which yields Theorem 2. We also note that by the assumption σ ≤ c σ T − ζ in Theorem 2, we have for sufﬁciently large T that ∆ is indeed arbitrarily small under the above choice, as was assumed throughout the proof. It only remains to establish the claim stated above (86) when E [ R T ] < c (cid:48)(cid:48) σ √ T . By Lemma 5, we have E (cid:2) (cid:80) Tt =1 | x t − x ∗ | (cid:3) < T ∆ , and substitution into (79) gives I ( V ; x , y ) ≤ c (cid:48) ) σ T ∆ . (87)Since ∆ = σ (cid:101) CT , we deduce that I ( V ; x , y ) ≤ log 24 (say) when (cid:101) C is sufﬁciently large. As a result, (31) gives E [ R T ] ≥ c T ∆ H − (cid:0) (cid:1) (note that H − is an increasing function). Since ∆ = σ (cid:113) (cid:101) CT , we deduce that E [ R T ] ≥ (cid:101) c · σ √ T ,where (cid:101) c = c (cid:113) (cid:101) C H − (cid:0) (cid:1) . This establishes the desired result. It is safe to assume that T is sufﬁciently large, since the smaller values of T can be handled by decreasing C (cid:48)(cid:48)