Instance-Dependent Bounds for Zeroth-order Lipschitz Optimization with Error Certificates
aa r X i v : . [ m a t h . S T ] F e b The Sample Complexities of Global Lipschitz Optimization
Franc¸ois Bachoc , Tommaso R. Cesari , and S´ebastien Gerchinovitz Institut de Math´ematiques de Toulouse & University Paul Sabatier Toulouse School of Economics IRT Saint Exup´ery & Institut de Math´ematiques de ToulouseFebruary 4, 2021
Abstract
We study the problem of black-box optimization of a Lipschitz function f defined on a compact subset X of R d ,both via algorithms that certify the accuracy of their recommendations and those that do not. We investigate theirsample complexities, i.e., the number of samples needed to either reach or certify a given accuracy ε . We start byproving a tighter bound for the well-known DOO algorithm [Perevozchikov, 1990, Munos, 2011] that matches thebest existing upper bounds for (more computationally challenging) non-certified algorithms. We then introduce andanalyze a new certified version of DOO and prove a matching f -dependent lower bound (up to logarithmic terms).Afterwards, we show that this optimal quantity is proportional to R X d x / ( f ( x ⋆ ) − f ( x ) + ε ) d , solving as a corollarya three-decade-old conjecture by Hansen et al. [1991]. Finally, we show how to control the sample complexity ofstate-of-the-art non-certified algorithms with an integral reminiscent of the Dudley-entropy integral. Keywords: global optimization, bandit optimization.
In this paper, f : X → R denotes a function defined on a compact non-empty subset X of R d . We consider thefollowing global optimization problem: with only black-box access to f , find an ε -optimal point x ⋆n ∈ X of f with aslittle evaluations of f as possible.We make the following weak Lipschitz assumption, where the norm k·k and the constant L are known to thelearner. For some of our results, we will instead assume f to be globally L -Lipschitz. Assumption 1 (Lipschitzness around a maximizer) . The function f attains its maximum at some x ⋆ ∈ X , and thereexist a constant L > and a norm k·k such that, for all x ∈ X , f ( x ) ≥ f ( x ⋆ ) − L k x ⋆ − x k . We study the case in which f is black-box, i.e., except for some a priori knowledge on its smoothness, we canonly access f by sequentially querying its values at a sequence x , x , . . . ∈ [0 , d of points of our choice (seeOnline Protocol 1 below). At every round i ≥ , the query point x i can be chosen as a deterministic functionof the values f ( x ) , . . . , f ( x i − ) observed so far. At the end of round i , the learner outputs a recommendation x ⋆i ∈ X with some other optional information. The goal is to minimize the optimization error (or simple regret ): max x ∈X f ( x ) − f ( x ⋆i ) = f ( x ⋆ ) − f ( x ⋆i ) .In all the sequel, we consider two different types of algorithms:• certified algorithms output an accuracy certificate γ i ∈ { , } along with the recommendation x ⋆i at the end ofeach step i . By definition, certified algorithms take as input an accuracy ε > and are such that f ( x ⋆ ) − f ( x ⋆i ) ≤ ε whenever γ i = 1 , for any function f satisfying Assumption 1. In other words, the accuracy certificate γ i canonly equal if the algorithm can guarantee that the recommendation x ⋆i is ε -optimal;1 nline Protocol 1: Certified/non-certified algorithm input: accuracy ε > ( certified algorithms only ) for i = 1 , , . . . do pick the next query point x i ∈ X observe the value f ( x i ) ∈ R output a recommendation x ⋆i ∈ X output a certificate γ i ∈ { , } ( certified algorithms only )• non-certified algorithms only output the recommendation x ⋆i ∈ X at the end of step i .The goal of finding an ε -optimal point x ⋆n ∈ X of f with as little evaluations of f as possible corresponds tominimizing the sample complexity. More precisely, we associate a natural notion of sample complexity to each of thetwo types of algorithms above. We will discuss existing bounds and algorithms in Section 1.2, and summarize ourmain contributions in Section 1.3. Non-certified sample complexity.
For non-certified algorithms, a natural and classical performance criterion is theminimum number of queries made before outputting ε -optimal recommendations only. More precisely, we define the non-certified sample complexity ζ ( A, f, ε ) of a non-certified algorithm A for a function f with accuracy ε > as ζ ( A, f, ε ) := inf { n ≥ ∀ i ≥ n, x ⋆i ∈ X ε } ∈ { , , . . . } ∪ { + ∞} . (1) Certified sample complexity.
For certified algorithms, it is more natural to look at a notion of sample complexitythat does not depend on a oracle knowing f but instead that can be computed on the basis of the outputs of thealgorithm only. We define the certified sample complexity σ ( A, f, ε ) of a certified algorithm A for a target function f with accuracy ε > as σ ( A, f, ε ) = inf { i ≥ γ i = 1 } ∈ { , , . . . } ∪ { + ∞} . (2)This corresponds to the first time when we can stop the algorithm while being sure to have an ε -optimal recommenda-tion x ⋆i . The most naive approach to output an ε -optimal point knowing that f satisfies Assumption 1 is to perform a uniformgrid search in X with step-size ≈ ε/L , requiring roughly ( L/ε ) d evaluations of f . Though this approach is optimal inthe worst case (see, e.g., Theorem 1.1.2 by Nesterov 2003 where a small Lipschitz bump function f is adversariallyconstructed for any given algorithm), it is clearly suboptimal for functions that take “very suboptimal” values for a“large fraction” of the domain X . For such more benign functions, sequential algorithms can reasonably quickly ruleout very suboptimal regions and then mostly explore closer-to-optimality regions. Indeed many papers have consideredsequential algorithms with improved bounds for benign functions, as discussed below.Many of these bounds rely on the notions of sets of near-optimal points and of packing numbers, which we recallfirst. For all ε > , the set of ε - optimal points of f is defined by X ε := (cid:8) x ∈ X : f ( x ) ≥ f ( x ⋆ ) − ε (cid:9) . Wealso denote its complement (i.e., the set of ε - suboptimal points ) by X cε and, for all ≤ a < b , the ( a, b ] - layer by X ( a,b ] := X ca ∩ X b = (cid:8) x ∈ X : a < f ( x ⋆ ) − f ( x ) ≤ b (cid:9) (i.e., the set of points that are b -optimal but a -suboptimal).Since f is L -Lipschitz around x ⋆ , every point in X is ε -optimal, with ε defined by ε := L sup x , y ∈X k x − y k . Inother words, X ε = X . For this reason, without loss of generality we will only consider values of accuracy ε smallerthan or equal to ε .Recall also that for any bounded set A ⊂ R d and any real number r > , the r - packing number of A is the largestcardinality of an r -packing of A , that is, N ( A, r ) := sup (cid:8) k ∈ N ∗ : ∃ x , . . . , x k ∈ A, min i = j k x i − x j k > r (cid:9) if A is nonempty, zero otherwise. Well-known and useful properties of packing (and covering) numbers are recalled inAppendix D. 2 ounds on the non-certified sample complexity. Several (non worst-case) f -dependent upper bounds involvingsets of ε -optimal points X ε or layers X ( a,b ] of f have been derived over the past decades for the non-certified samplecomplexity. For instance, for the DOO algorithm, Munos [2011, Theorem 1] proved a bound roughly of the form P δ − ( ε ) i =1 N ( X δ ( i ) , cδ ( i )) for some constant c > independent of f , and where the sequence i δ ( i ) is decreasing.In the special case where δ ( i ) ≈ γ i with γ ∈ (0 , , and for (weakly) Lipschitz functions f such that N ( X ε , cε ) . (1 /ε ) d ⋆ with d ⋆ ∈ [0 , d ] (we say that f has near-optimality dimension smaller than d ⋆ ), this non-certified samplecomplexity bound implies a bound roughly of log(1 /ε ) if d ⋆ = 0 or (1 /ε ) d ⋆ if d ⋆ > . A similar bound wasproved by Perevozchikov [1990] for a branch-and-bound algorithm close to the DOO algorithm, with an assumptionon the volume of X ε (instead of a packing number), or by Malherbe and Vayatis [2017] for a stochastic version of thePiyavskii-Shubert algorithm under a stronger assumption on the shape of f around its maximizer.The matching worst-case lower bounds of Nesterov [2003, Theorem 1.1.2] when d ⋆ = d , of Horn [2006] when d ⋆ = d/ , and the lower bounds of Bubeck et al. [2011] for any d ⋆ but in the stochastic case suggest that this boundis worst-case optimal over the class of functions with given near-optimality dimension d ⋆ . However, the boundsmentioned above have several limitations. First, as noted earlier by Kleinberg et al. [2019, remark after Theorem 4.4],bounds involving a single notion of dimension such as the near-optimality or zooming dimension might be too crude.Indeed, functions that feature different shapes at different scales (such as, e.g., different values of d ⋆ for differentranges of ε ) are better analyzed in a non-asymptotic framework through sums of packing numbers at different scales(see discussion at the end of Section 5). Second, considering packing numbers of the near-optimal sets X δ ( i ) as in thefirst bound mentioned above can also be quite suboptimal. For instance, for constant functions f , the sets X δ ( i ) areequal to the whole domain X (maximal packing number) while f is optimized by any algorithm after only a singleevaluation of f .The second limitation was overcome in general metric spaces by Kleinberg et al. [2019] (see also Kleinberg et al.2008) by considering packing numbers of layers X ( a,b ] instead of near-optimal sets X ε . In the special deterministicsetting with compact domain X ⊂ R d considered here, Bouttier et al. [2020, Theorem 1] proved that the non-certifiedsample complexity of the Piyavskii-Shubert algorithm [Piyavskii, 1972, Shubert, 1972] is at most of S NC ( f, ε ) ,where S NC ( f, ε ) := m ε X k =1 N (cid:16) X ( ε k ,ε k − ] , ε k L (cid:17) , (3)where m ε := (cid:6) log ( ε /ε ) (cid:7) , ε m ε := ε and, for each k ∈ { , , . . . , m ε − } , ε k := ε − k .The bound (3) on the non-certified sample complexity is proved by noting that highly suboptimal regions X ( ε k ,ε k − ] (with ε k ≫ ε ) need not be explored too much thanks to Assumption 1. Any reasonable algorithm can therefore“quickly” output recommendations x ⋆i closer to ε -optimality.Notably, the two algorithms achieving (3) or similar bounds are computationally challenging. Indeed, the zoomingalgorithm of Kleinberg et al. [2008, 2019] designed for general metric spaces is based on a “covering oracle” thattakes as input a collection of balls and either declares it to be a covering of the input space, or outputs a point whichis not covered. As for the Piyavskii-Shubert algorithm [Piyavskii, 1972, Shubert, 1972], it requires at every step n to solve an inner global Lipschitz optimization problem whose computational complexity might resemble that of thecomputation of a Voronoi diagram (see discussion in Bouttier et al. 2020, Section 1.1).In Section 2.1 we show that the more computationally tractable DOO algorithm matches the same bound (3).For better interpretability, we also show in Section 5 that this bound is proportional to an integral reminiscent of theDudley-entropy integral. Bounds on the certified sample complexity.
This notion of sample complexity seems to have been less studied inthe past. In dimension d = 1 , several authors derived upper bounds for a certified version of the Piyavskii-Shubertalgorithm. In particular, Hansen et al. [1991] proved that its certified sample complexity for globally L -Lipschitzfunctions f : [0 , → R is at most proportional to the integral R (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) − d x , and left the question ofextending the results to arbitrary dimensions open. Recently, Bouttier et al. [2020, Theorem 2] proved a bound validfor any dimension d ≥ roughly of this form: S C ( f, ε ) := N (cid:16) X ε , εL (cid:17) + m ε X k =1 N (cid:16) X ( ε k ,ε k − ] , ε k L (cid:17) , (4) The bound in Bouttier et al. [2020, Theorem 1] with α = 0 differs a little in the smallest level considered, which can be as small as ε/ , insteadof the more natural level ε . Their bound can however be straightforwardly improved to get ε instead. m ε := (cid:6) log ( ε /ε ) (cid:7) , ε m ε := ε and, for each k ∈ { , , . . . , m ε − } , ε k := ε − k .We show in Section 2.2 how to obtain (4) with the more computationally tractable DOO algorithm, and prove amatching f -dependent lower bound (up to log factors) in Section 3. In Section 4 we also show that the bound (4) isactually proportional to R X (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) − d d x for globally L -Lipschitz functions f : X → R under a mildassumption on X , solving the question left open by Hansen et al. [1991] three decades ago.Though the two bounds (3) and (4) look very similar, they differ in the first term being either or N (cid:0) X ε , εL (cid:1) .This difference is not negligible in general and can be explained with a simple example. Indeed, consider a constantfunction c : [0 , d → R . Any non-certified algorithm has sample complexity for c at any scale ε , because all pointsin the domain are maxima. However, the only way to certify that the output is ε -optimal is essentially to perform agrid-search of [0 , d with step-size roughly ε/L , so as to be sure there is no hidden bump of height more than ε . This isreflected in the term N (cid:0) X ε , εL (cid:1) , which is of order ( L/ε ) d for constant functions. At a high level, the more “constant”a function is, the easier it is to recommend an ε -optimal point, but the harder it is to certify that such recommendationis actually a good recommendation. In Section 2, we refine the analysis of the DOO algorithm, showing that its certified (resp., non-certified) samplecomplexity can be upper bounded (up to constants) by S C ( f, ε ) (resp., S NC ( f, ε ) ). In Section 3, we prove an f -dependent lower bound on the sample complexity of all certified algorithms, which matches S C ( f, ε ) up to logarithmicterms. In Section 4, we show for globally Lipschitz functions f : X → R (with a mild condition on X ) that S C ( f, ε ) is proportional to the integral R X d x / ( f ( x ⋆ ) − f ( x ) + ε ) d , which thus characterizes the optimal certified samplecomplexity of global Lipschitz optimization. In Section 5, we also show that S NC ( f, ε ) is proportional to the Dudley-type integral over accuracies R ∞ ε N (cid:0) X (max( ε, u/ , u ] , u / L (cid:1) /u d u . Some proofs and useful lemmas are deferred to theappendix, together with a comparison of packing-based and volume-based bounds (Appendix C). What we do not cover
There are some interesting directions that would be worth investigating in the future but wedid not cover in this paper, such as noisy observations (see, e.g, Bubeck et al. 2011, Kleinberg et al. 2019) or adaptivityto smoothness (e.g., Munos 2011, Bartlett et al. 2019; we consider L -Lipschitz functions f with L known, althoughour lower bound suggests that no adaptivity could be possible for certified algorithms). Finally, even if the results ofSection 2 could be easily extended to general pseudo-metric spaces as in Munos [2014] and related works, our otherresults are finite-dimensional and exploit the normed space structure. We denote the set of positive integers { , , . . . } by N ∗ and let N := N ∗ ∪ { } . For all n ∈ N ∗ , we denote by [ n ] theset of the first n integers { , . . . , n } . We denote by A + B the Minkowski sum of two sets A, B and for any set A andall λ ∈ R , we let λA := { λ a : a ∈ A } .We denote the Lebesgue measure of a Lebesgue-measurable set A by vol( A ) and we simply refer to it as volume .For all ρ > and x ∈ R d , we denote by B ρ ( x ) the closed ball with radius ρ centered at x . We also write B ρ for theball with radius ρ centered at the origin and denote by v ρ its volume. In this section, we show that the DOO algorithm of Perevozchikov [1990], Munos [2011] and a new certified version ofit match the two f -dependent bounds (3) and (4) previously derived for computationally more challenging algorithms.In particular, our analysis of the non-certified sample complexity of DOO improves over that of Munos [2011]. Anearly matching lower bound on the certified sample complexity of all certified algorithms will be proved in Section 3.The definition of the DOO algorithm and the associated assumptions are recalled (and slightly adjusted) in Ap-pendix E.1, together with its new certified version. 4 .1 Upper Bound on the Non-Certified Sample Complexity The next theorem shows that the quantity S NC ( f, ε ) defined in (3) is a tight upper bound on the non-certified samplecomplexity of the DOO algorithm. This improves over the bound of Munos [2011, Theorem 1] since our boundinvolves a sum of disjoint layers X ( ε k ,ε k − ] whose values are all above ε , while the bound of Munos [2011] involvesa sum over overlapping layers of the form X [0 ,u ) for all values < u ≤ ε . See Section 1.2 and Remark 17 inAppendix E.2 for further discussion. We also note that the result could be extended to more general layers than thosebased on ( ε k ) k ∈{ ,...,m ε } , as detailed in Remark 18.Recall that ζ ( A, f, ε ) denotes the non-certified sample complexity of an algorithm A (see (1)), and that m ε := (cid:6) log ( ε /ε ) (cid:7) , ε m ε := ε and, for each k ∈ { , , . . . , m ε − } , ε k := ε − k . Theorem 1.
Assume that f satisfies Assumption 1, and that Assumptions 4 and 5 in Appendix E.1 hold. Then, the non-certified DOO algorithm (Algorithm 3 in Appendix E.1) has a non-certified sample complexity bounded as follows:for any ε ∈ (0 , ε ] , ζ ( non-certified DOO , f, ε ) ≤ C m ε X k =1 N (cid:16) X ( ε k ,ε k − ] , ε k L (cid:17) , where C = K (cid:0) ν/R ≥ + ν/R< ( R / ν ) d (cid:1) . Note that the above bound equals C · S NC ( f, ε ) , where S NC ( f, ε ) was defined in (3) in Section 1.2. It thereforematches the bound of Bouttier et al. [2020, Theorem 1] on the Piyavskii-Shubert algorithm up to constant factors, butfor the more computionally tractable DOO algorithm.As discussed in Appendix E.1, when X = [0 , d and k·k is the sup norm, we have K = 2 d , R = 1 , δ = 1 / and ν = 1 / . Hence, in this case, the multiplicative constant C above equals d .The proof is postponed to Appendix E.2 and shares common arguments with the proofs of Perevozchikov [1990],Munos [2011]. As noted in Remark 17, some steps are however tighter, leading to a tighter bound. The key changeis to partition the values of f instead of partitioning the domain X at any depth h in the tree (see Munos 2011)when counting the representatives selected at all levels. The idea of using layers X ( ε i , ε i − ] was already presentin Kleinberg et al. [2008, 2019] and Bouttier et al. [2020] but for more computionally challenging algorithms (seediscussion in Section 1.2). We now analyze the certified version of the DOO algorithm that we design in Appendix E.1, and show that its certifiedsample complexity is at most proportional to S C ( f, ε ) defined in (4).Recall that σ ( A, f, ε ) denotes the certified sample complexity of an algorithm A (see (2)), and that m ε := (cid:6) log ( ε /ε ) (cid:7) , ε m ε := ε and, for each k ∈ { , , . . . , m ε − } , ε k := ε − k . Theorem 2.
Assume that Assumptions 4 and 5 in Appendix E.1 hold. Then the certified DOO algorithm (Algorithm3 in Appendix E.1) is indeed a certified algorithm. Furthermore, for any f satisfying Assumption 1 and for any ε ∈ (0 , ε ] , its certified sample complexity is bounded as σ ( certified DOO , f, ε ) ≤ C N (cid:16) X ε , εL (cid:17) + m ε X k =1 N (cid:16) X ( ε k ,ε k − ] , ε k L (cid:17)! , where C = 1 + K (cid:0) ν/R ≥ + ν/R< ( R / ν ) d (cid:1) . Note that the above bound is proportional to the quantity S C ( f, ε ) defined in (4) in Section 1.2. It thereforematches the bound of Bouttier et al. [2020, Theorem 2] on the certified version of the Piyavskii-Shubert algorithm upto constant factors, but for the more computionally tractable DOO algorithm.Similarly to Theorem 1, the above result could be easily extended to more general layers than those based on ( ε k ) ≤ k ≤ m ε (see Remark 18 for further details).The proof is postponed to Appendix E.3. It follows approximately the same lines as that of Theorem 1, exceptthat considering the stopping time σ ( certified DOO , f, ε ) introduces the additional term N (cid:0) X ε , εL (cid:1) , which cannot beavoided as discussed in Section 1.2 and proved formally in Section 3. Similar arguments were used by Bouttier et al.[2020], which we adapt here to handle the discretization inherent to the DOO algorithm (this discretization makes theanalysis slightly less natural but the algorithm more computationally tractable).5 Lower Bounds for the Certified Sample Complexity
In this section, we will focus on f -dependent lower bounds on the certified sample complexity of certified algorithmsapplied to globally Lipschitz functions. Formally, we assume the following. Assumption 2 (Global Lipschitzness) . We say that L := sup u , v ∈X , u = v (cid:12)(cid:12) f ( u ) − f ( v ) (cid:12)(cid:12) / k u − v k , where k·k is anorm, is the Lipschitz constant of f and that f is globally L -Lipschitz if L ≥ L . It is immediate to see that Assumption 1 is a relaxation of Assumption 2. Similarly to Section 1.1, we say that A is a certified algorithm for L -Lipschitz functions if for any globally L -Lipschitz function f , accuracy ε ∈ (0 , ε ] , andtime step i ∈ N ∗ , the accuracy certificate γ i of A at time i when applied to f with accuracy ε is equal to only if itsrecommendation x ⋆i satisfies x ⋆i ∈ X ε .The main result of this section (Theorem 3) shows that when a certified algorithm for L -Lipschitz functions isapplied to a function f that is globally L ′ -Lipschitz, with L ′ bounded away below L , then its certified sample com-plexity is (up to a log factor) at least S C ( f, ε ) . This is the quantity that we presented in Sections 1 and 2, whichupper bounds the certified sample complexity of algorithms such as DOO and Piyavskii-Shubert. Putting these upperand lower bounds together proves that the optimal f -dependent certified sample complexity (of certified algorithmsfor L -Lipschitz functions) is of order S C ( f, ε ) , up to a log factor, at least for globally L ′ -Lipschitz functions with L ′ bounded away below L . The boundary case where L ′ = L is discussed later. Theorem 3.
Let < L ′ < L , K := 16 L/ ( L − L ′ ) , and c := (1 / K ) d / . Then, the certified sample complexity of anycertified algorithm A for L -Lipschitz functions satisfies, for any globally L ′ -Lipschitz function f and all ε ∈ (0 , ε ] , σ ( A, f, ε ) > c m ε S C ( f, ε ) . (5)Before proving Theorem 3, we introduce the insightful notions of worst-case error and worst-case sample com-plexity . For an algorithm A and a function f , we denote the point queried by A when applied to f at time n by x n ( A, f ) , the recommendation by x ⋆n ( A, f ) , and the accuracy certificate by γ n ( A, f ) . For any n ∈ N ⋆ we define the worst-case error of A when applied to f at time n as E L ( A, f, n ) := sup n max( g ) − g (cid:0) x ⋆n ( A, f ) (cid:1) : g is globally L -Lipschitz and g ( x ) = f ( x ) for all x ∈ (cid:8) x ( A, f ) , . . . , x n ( A, f ) (cid:9)o . (6)Note that if f is globally L -Lipschitz, the optimization error of A at time n is no larger than (but possibly equal to) E L ( A, f, n ) . We then define the worst-case minimax error for f at step n as inf A E L ( A, f, n ) , where the infimumis over all algorithms (not only certified ones). A classic performance measure in global optimization is the minimaxerror inf A sup g (cid:0) max( g ) − g (cid:0) x ⋆n ( A, g ) (cid:1)(cid:1) , where the supremum is over all gobally L -Lipschitz functions and theinfimum is over all algorithms (e.g., see Nesterov 2003). Compared to ours, this is a more pessimistic notion, thatdoes not depend on f . Let < L ′ ≤ L . Based on our worst-case minimax error, we define the the worst-case samplecomplexity of a globally L ′ -Lipschitz function f with accuracy ε ∈ (0 , ε ] , as τ ( f, ε ) := min (cid:8) n ∈ N ∗ : inf A E L ( A, f, n ) ≤ ε (cid:9) , where the infimum is over all algorithms. It is immediate to prove that τ ( f, ε ) is finite, by considering an algorithm thatqueries a dense sequence of points (independently of the observed function values) and outputs as a recommendationthe maximizer of the observed values.Crucially, the worst-case sample complexity lower bounds the certified sample complexity. Lemma 4.
Let < L ′ ≤ L . For any certified algorithm A for L -Lipschitz functions, any globally L ′ -Lipschitzfunction f , and all ε ∈ (0 , ε ] , we have σ ( A, f, ε ) ≥ τ ( f, ε ) .Proof. Let N = σ ( A, f, ε ) . Then γ N ( A, f ) = 1 . Assume that
N < τ ( f, ε ) . Then we have E L ( A, f, N ) ≥ inf A E L ( A, f, N ) > ε by definition of τ ( f, ε ) . This means that there exists a globally L -Lipschitz function g , coin-ciding with f on x ( A, f ) , . . . , x N ( A, f ) and such that max( g ) − x ⋆N ( A, g ) > ε . Then by definition of the certificate, γ N ( A, g ) = 0 . But γ N ( A, f ) = γ N ( A, g ) , which yields a contradiction and concludes the proof.6e can now present a full proof of Theorem 3, the main result of this section. The high-level idea is the following.For a given certified algorithm A (for L -Lipschitz functions) and a (globally L ′ -Lipschitz) f , we create an adversarialfunction by adding a perturbation ( ± g ˜ ε in the proof) to f . To show that the resulting function is globally L -Lipschitz,we sum the Lipschitz constants of f and ± g ˜ ε . This is why we require the Lipschitz constant L ′ of f to be boundedaway below L . Proof of Theorem 3.
Because of Lemma 4, it is sufficient to show that τ ( f, ε ) > cS C ( f, ε ) / (1+ m ε ) . If S C ( f, ε ) / (1+ m ε ) < K ) d , then τ ( f, ε ) ≥ > / > cS C ( f, ε ) / (1+ m ε ) . Consider then from now on that S C ( f, ε ) / (1+ m ε ) ≥ K ) d .We will now upper bound S C ( f, ε ) / (1 + m ε ) with (an upper bound of) the largest summand in (4). Let ˜ ε be thescale achieving the maximum in (4), that is ˜ ε = ( ε , if N (cid:0) X ε , εL (cid:1) ≥ max i ∈{ ,...,m ε } N (cid:0) X ( ε i , ε i − ] , ε i L (cid:1) ,ε i ⋆ − , otherwise, where i ⋆ ∈ argmax i ∈{ ,...,m ε } N (cid:0) X ( ε i , ε i − ] , ε i L (cid:1) . Since N ( X ε , ε/L ) ≤ N ( X ε , ε/ L ) and N (cid:0) X ( ε i , ε i − ] , ε i /L (cid:1) ≤ N (cid:0) X ε i − , ε i − / L (cid:1) , we then have S C ( f, ε ) ≤ ( m ε + 1) N (cid:0) X ˜ ε, ˜ ε/ L (cid:1) . Let now n ≤ cS C ( f, ε ) / (1 + m ε ) . We then have n ≤ c N (cid:0) X ˜ ε, ˜ ε/ L (cid:1) . From Lemma 14, N (cid:0) X ˜ ε , K ˜ εL (cid:1) ≥ (cid:0) K (cid:1) d N (cid:0) X ˜ ε , ˜ ε L (cid:1) ≥ (cid:0) K (cid:1) d S C ( f,ε ) m ε +1 ≥ , as considered earlier. Then we have n ≤ c (8 K ) d N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) . Since c (8 K ) d = 1 / , we thus obtain n ≤N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) − .Consider a certified algorithm A for L -Lipschitz functions. Fix a K ˜ ε/L packing x , . . . , x N of X ˜ ε with cardinality N = N ( X ˜ ε , K ˜ ε/L ) . Then the open balls of centers x , . . . , x N and radius K ˜ ε/ L are disjoint and two of them,with centers, say, ˜ x and ˜ x , do not contain any of the x ( A, f ) , . . . , x n ( A, f ) . Let, for x ∈ X , g ˜ ε ( x ) := (cid:0) ˜ ε − LK k x − ˜ x k (cid:1) I (cid:0) x ∈ X ∩ B K ˜ ε/ L (˜ x ) (cid:1) . Then g ˜ ε ( x ) is globally L/K = L − L ′ Lipschitz. Hence f + g ˜ ε and f − g ˜ ε are L -Lipschitz. Observe that f , f + g ˜ ε and f − g ˜ ε coincide on x ( A, f ) , . . . , x n ( A, f ) . As a consequence, A has the same recommendation for them: x ⋆n ( A, f ) = x ⋆n ( A, f + g ˜ ε ) = x ⋆n ( A, f − g ˜ ε ) .Consider first the case x ⋆n ( A, f ) ∈ B (˜ x , K ˜ ε/ L ) . Then, we have, by definition of g ˜ ε and the fact that ˜ x ∈ X ˜ ε , f (˜ x ) − g ˜ ε (˜ x ) − f ( x ⋆n ( A, f )) + g ˜ ε ( x ⋆n ( A, f )) ≥ − ˜ ε + 8˜ ε − LK K ˜ ε L = 3˜ ε .Now consider the case x ⋆n ( A, f ) B (˜ x , K ˜ ε/ L ) . Then, we have, by definition of g ˜ ε and the fact that ˜ x ∈ X ˜ ε , f (˜ x ) + g ˜ ε (˜ x ) − f ( x ⋆n ( A, f )) − g ˜ ε ( x ⋆n ( A, f )) ≥ − ˜ ε + 8˜ ε − ε + LK K ˜ ε L = 3˜ ε .Finally, in both cases E L ( A, f, n ) ≥ ε > ε . Hence inf A E L ( A, f, n ) > ε . Since this has been shown for any n ≤ cS C ( f, ε ) / (1 + m ε ) we thus have τ ( f, ε ) > cS C ( f, ε ) / (1 + m ε ) .We conclude the section by discussing the limit case L ′ → L , in which the constant in Theorem 3 diverges. Thenext result shows that in dimension d = 1 , the limit case L = L ′ does not hinder the validity of the result. The proofis deferred to Appendix A.1. Proposition 5. If d = 1 , let K = 8 and c = 1 / K . Then, the certified sample complexity of any certified algo-rithm A for L -Lipschitz functions satisfies, for any globally L -Lipschitz function f and all ε ∈ (0 , ε ] , σ ( A, f, ε ) > ( c / m ε ) S C ( f, ε ) . The final result of this section shows that, in higher dimension d ≥ , the improvement of Proposition 5 when L ′ is close to L is not possible in general. The proof is deferred to Appendix A.2. Proposition 6.
Let d ≥ , X := B , and k·k be a norm. The certified Piyavskii-Shubert algorithm (for L -Lipschitzfunctions) with initial guess x := is a certified algorithm (for L -Lipschitz functions) satisfying, for the globally L -Lipschitz function f := L k·k and any ε ∈ (0 , ε ) , σ ( Piyavskii-Shubert , f, ε ) = 2 ≪ c/ε d − ≤ S C ( f, ε ) , where c > is a constant independent of ε . See Appendix A for more details on the certified Piyavskii-Shubert algorithm. An Integral Characterization of the Optimal Certified Sample Complex-ity S C In dimension d = 1 , an elegant bound on the certified sample complexity was derived by Hansen et al. [1991] for acertified version of the Piyavskii-Shubert algorithm. They proved that if f is globally Lipschitz, for any accuracy ε ∈ (0 , ε ] , the smallest number of time steps σ := σ ( certified Piyavskii-Shubert , f, ε ) before outputting γ σ = 1 is at mostproportional to R (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) − d x . As pointed out in Bouttier et al. [2020], given that σ for this algorithmcan be upper bounded with S C , this suggests a relationship between the two quantities. However, Hansen et al. [1991]rely heavily on the one-dimensional setting and the specific form of the Piyavskii-Shubert algorithm, claiming it thateven the simpler task of “ Extending the results of this paper to the multivariate case appears to be difficult ”. In thissection, we show an equivalence between S C and this type of integral bound in any dimension d . As a corollary, thissolves the long-standing problem raised by Hansen et al. [1991] three decades ago.To tame the wild spectrum of shapes that compact subsets may have, we will assume that X satisfies the followingadditional assumption. At a high level, it says that a constant fraction of each (sufficiently small) ball centered ata point in X is included in X . This removes sets containing isolated points or “peaked” corners, but includes mostdomains that are typically used, such as non-degenerate polytopes, ellipsoids, finite unions of them, etc. This naturalassumption has already appeared in the past (e.g., Hu et al. 2020) and is weaker than another classical assumption inthe statistics literature (the rolling ball assumption, Cuevas et al. 2012, Walther 1997). Assumption 3.
There exist two constants r > , γ ∈ (0 , such that, for any x ∈ X and all r ∈ (0 , r ) , vol (cid:0) B r ( x ) ∩X (cid:1) ≥ γv r . We can now state the main result of this section. Its proof relies on some additional technical results that aredeferred to the appendix.
Theorem 7. If f is globally L -Lipschitz and X satisfies Assumption 3 with r > ε / L , γ ∈ (0 , , then there exist c, C > (e.g., c := 1 /v / L and C := 1 / ( γv / L ) ) such that, for all ε ∈ (0 , ε ] , c Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d ≤ S C ( f, ε ) ≤ C Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d . In light of Theorem 3, this also shows that the integral bound characterizes (up to a log factor) the optimal certifiedsample complexity of globally Lipschitz functions in any dimension d , outside of boundary cases. Corollary 8.
Let < L ′ < L . Then, there exist two constants c, C > such that the optimal certified samplecomplexity of any certified algorithm A for L -Lipschitz functions satisfies, for any globally L ′ -Lipschitz function f and all ε ∈ (0 , ε ] , c m ε Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d ≤ inf A σ ( A, f, ε ) ≤ C Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d . The previous result follows directly from Theorems 2, 3 and 7. We now prove Theorem 7.
Proof of Theorem 7.
Fix any ε ∈ (0 , ε ] and recall that m ε := (cid:6) log ( ε / ε ) (cid:7) , ε m ε := ε , and for all k ≤ m ε − , ε k := ε − k . Partition the domain of integration X into the following m ε + 1 sets: the set of ε -optimal points X ε andthe m ε layers X ( ε k , ε k − ] , for k ∈ [ m ε ] . We begin by proving the first inequality: Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d ≤ vol( X ε ) ε d + m ε X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ( ε k + ε ) d ≤ M (cid:0) X ε , εL (cid:1) · v (cid:0) εL (cid:1) d ε d + m ε X k =1 M (cid:0) X ( ε k , ε k − ] , ε k L (cid:1) · v (cid:0) ε k L (cid:1) d ε dk ≤ v L d N (cid:16) X ε , εL (cid:17) + m ε X k =1 N (cid:16) X ( ε k , ε k − ] , ε k L (cid:17)! , We actually prove a stronger result. The first inequality holds more generally for any f that is L -Lipschitz around a maximizer (Assumption 1)and Lebesgue-measurable, and does not require X to satisfy Assumption 3. f ( x ⋆ ) − f with its infimum on the partition, the second one bydropping ε > from the second denominator and upper bounding the volume of a set with the volume of the balls of asmallest ε k / L -cover, and the last one by the fact that covering numbers are always smaller than packing numbers (14).This proves the first part of the theorem.For the second one, note that Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d ≥ vol( X ε )( ε + ε ) d + m ε X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ( ε k − + ε ) d ≥ d vol( X ε ) ε d + 14 d m ε X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk ≥ d vol( X ε ) ε d + m ε X k =1 vol (cid:16) X ( ε k , ε k − ] (cid:17) ε dk − ≥ N (cid:0) X ε , εL (cid:1) vol (cid:0) ε L B (cid:1) (32 ε ) d /γ + m ε X k =1 N (cid:0) X ( ε k , ε k − ] , ε k L (cid:1) vol (cid:0) ε k L B (cid:1) (32 ε k − ) d /γ ≥ γv / L N (cid:16) X ε , εL (cid:17) + γv / L m ε X k =1 N (cid:16) X ( ε k , ε k − ] , ε k L (cid:17) , where the first inequality follows by upper bounding f ( x ⋆ ) − f with its supremum on the partition, the second oneby ε ≤ ε k − (for all k ∈ [ m ε + 1] ) and ε k − ≤ ε k (for all k ∈ [ m ε ] ), the third one by Lemma 15, and the fourthone by the elementary inclusions X ε ⊇ X ε and X ( ε k , ε k − ] ⊇ X ( ε k , ε k − ] (for all k ∈ [ m ε ] ) followed byProposition 11. S NC As we mentioned earlier, the non-certified sample complexity of some algorithms can be upper bounded with a sumof packing numbers (3) (e.g., for the DOO and Piyavskii-Shubert algorithms). Notably, these types of bounds appearin fields beyond optimization (e.g., see Bachoc et al. 2020, Theorem 2). In this section, we show how this quantitycan be controlled by a more elegant Dudley-type entropy integral bound. The idea of the reduction is leveraging theusual sum-integral approximation carried out, e.g., when deriving the Dudley entropy integral bound (see, e.g., Dudley1967, Boucheron et al. 2013), with the additional care that here the integrand is not necessarily monotone.
Theorem 9. If f is L -Lipschitz around a maximizer and ( x, y, z )
7→ N (cid:0) X ( x,y ] , z (cid:1) is Lebesgue-measurable, then thereexist c, C > (e.g., c = / and C := 2 · d ) such that, for all ε ∈ (0 , ε ] , c Z ∞ ε N (cid:0) X (max( ε, u/ , u ] , uL (cid:1) u d u ≤ S NC ( f, ε ) ≤ C Z ∞ ε N (cid:0) X (max( ε, u/ , u ] , uL (cid:1) u d u . Proof.
Fix any ε ∈ (0 , ε ] and recall that m ε := (cid:6) log ( ε / ε ) (cid:7) , ε m ε := ε , and for all k ≤ m ε − , ε k := ε − k .The proof of the first inequality is deferred to Appendix B. For the second one, for all k ∈ [ m ε ] , define ν k :=inf u ∈ [ ε k − , ε k − ] N (cid:0) X ( ε k , u ] , u / L (cid:1) , fix any η k > , and take u k ∈ [ ε k − , ε k − ] such that N (cid:0) X ( ε k , u k ] , u k / L (cid:1) ≤ ν k + η k . Then, for all k ∈ [ m ε ] , N (cid:16) X ( ε k , ε k − ] , ε k L (cid:17) ≤ N (cid:16) X ( ε k , u k ] , ε k L (cid:17) ≤ (cid:18) u k ε k (cid:19) d N (cid:16) X ( ε k , u k ] , u k L (cid:17) ≤ d ( ν k + η k ) , In addition to Lemma 15, in Appendix D, we prove a more general result on controlling the sum of the volumes of a family of overlapping setscovering X with the sum of volumes over a partition of X (Proposition 16). As we do here, this can be translated to packing numbers. The trick of using ν k := inf u ∈ [ ε k − ,ε k − ] N (cid:0) X ( ε k , u ] , u / L (cid:1) is key to handle the cases when u
7→ N (cid:0) X ( ε k , u ] , u / L (cid:1) is not non-increasing. S NC ( f, ε ) = m ε X k =1 N (cid:16) X ( ε k , ε k − ] , ε k L (cid:17) ≤ m ε X k =1 d ( ν k + η k ) ε k − ( ε k − − ε k − )= 16 d m ε X k =1 inf u ∈ [ ε k − , ε k − ] N (cid:0) X ( ε k , u ] , u / L (cid:1) ε k − ( ε k − − ε k − ) + 16 d m ε X k =1 η k ≤ · d m ε X k =1 Z ε k − ε k − N (cid:0) X ( ε k , u ] , u / L (cid:1) u d u + 16 d m ε X k =1 η k . The result follows by noting that ε k ≥ max( u/ , ε ) for any k ∈ [ m ε ] and all u ∈ [ ε k − , ε k − ] , summing, recallingthat X ( a,b ] = ∅ for all ε ≤ a < b , and taking the limits for η k → .The previous result yields two direct corollaries, obtained by upper bounding X (max( u/ , ε ) , u ] with either X ( u/ , u ] or X ( ε, u ] . The latter has immediate consequences in terms of near-optimality dimension [Bubeck et al., 2011]. In-deed, if there exist two constants C ⋆ > and d ⋆ ∈ (0 , d ] such that N ( X u , u / L ) ≤ C ⋆ /u d ⋆ for all u ∈ (0 , ε ] ,then immediately there exists C > such that R ∞ ε N (cid:0) X ( ε, u ] , u / L (cid:1) /u d u ≤ C/ε d ⋆ for all ε ∈ (0 , ε ] . Simi-larly, if there exists C ⋆ > such that N ( X u , u / L ) ≤ C ⋆ for all u ∈ (0 , ε ] , then there exists C > such that R ∞ ε N (cid:0) X ( ε, u ] , u / L (cid:1) /u d u ≤ C log( ε /ε ) for all ε ∈ (0 , ε ] . Furthermore, such an integral bound makes it easyto handle multiple different regimes at the same time. This is the case if, for example, there exist C ⋆ , C ⋆ > and u ∈ (0 , ε ] such that N ( X u , u / L ) ≤ C ⋆ for all u ∈ [ u , ε ] but N ( X u , u / L ) ≤ C ⋆ /ε d ⋆ for all u ∈ (0 , u ) . In thiscase, our integral bound reflects that there exist C , C > such that R ∞ ε N (cid:0) X ( ε, u ] , u / L (cid:1) /u d u ≤ C log( ε /ε ) forall ε ∈ [ u , ε ] but R ∞ ε N (cid:0) X ( ε, u ] , u / L (cid:1) /u d u ≤ C (cid:0) log( ε /u ) + 1 /ε d ⋆ (cid:1) if ε ∈ (0 , u ) .Note that if for some ε ∈ (0 , ε ] , most points in [0 , d are ε -optimal, i.e., if X ε is large, then N ( X ε , ε / L ) ≈ /ε d is also large and does not reflect the fact that the non-certified problem is easy. This is a folklore criticism that hasbeen made to these types of bounds. Such a mismatching behavior is solved by expressing bounds as we did, in termsof layers X ( ε, u ] instead. ACKNOWLEDGEMENTS
The work of Tommaso Cesari and S´ebastien Gerchinovitz has benefitted from the AI Interdisciplinary Institute ANITI,which is funded by the French “Investing for the Future – PIA3” program under the Grant agreement ANR-19-P3IA-0004. S´ebastien Gerchinovitz gratefully acknowledges the support of the DEEL project. . This work benefited fromthe support of the project BOLD from the French national research agency (ANR). References
Franc¸ois Bachoc, Tommaso Cesari, and S´ebastien Gerchinovitz. The sample complexity of level set approximation. arXiv preprint arXiv:2010.13405 , 2020.Peter L. Bartlett, Victor Gabillon, and Michal Valko. A simple parameter-free and adaptive approach to optimizationunder a minimal local smoothness assumption. In Aur´elien Garivier and Satyen Kale, editors,
Proceedings of the30th International Conference on Algorithmic Learning Theory , volume 98 of
Proceedings of Machine LearningResearch , pages 184–206, Chicago, Illinois, 22–24 Mar 2019. PMLR.St´ephane Boucheron, G´abor Lugosi, and Pascal Massart.
Concentration inequalities: a nonasymptotic theory ofindependence . Oxford University Press, 2013.Cl´ement Bouttier, Tommaso Cesari, and S´ebastien Gerchinovitz. Regret analysis of the Piyavskii-Shubert algorithmfor global lipschitz optimization. arXiv preprint arXiv:2002.02390 , 2020. Journal of Machine LearningResearch , 12(May):1655–1695, 2011.Antonio Cuevas, Ricardo Fraiman, and Beatriz Pateiro-L´opez. On statistical properties of sets fulfilling rolling-typeconditions.
Advances in Applied Probability , 44(2):311–329, 2012.Richard M. Dudley. The sizes of compact subsets of hilbert space and continuity of Gaussian processes.
Journal ofFunctional Analysis , 1(3):290 – 330, 1967.Pierre Hansen, Brigitte Jaumard, and S-H Lu. On the number of iterations of Piyavskii’s global optimization algorithm.
Mathematics of Operations Research , 16(2):334–350, 1991.Matthias Horn. Optimal algorithms for global optimization in case of unknown Lipschitz constant.
Journal of Com-plexity , 22(1):50–70, 2006.Yichun Hu, Nathan Kallus, and Xiaojie Mao. Smooth contextual bandits: Bridging the parametric and non-differentiable regret regimes. In Jacob Abernethy and Shivani Agarwal, editors,
Proceedings of Thirty Third Con-ference on Learning Theory , volume 125 of
Proceedings of Machine Learning Research , pages 2007–2010. PMLR,09–12 Jul 2020.Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In
Proceedings of thefortieth annual ACM symposium on Theory of computing , pages 681–690, 2008.Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Bandits and experts in metric spaces.
Journal of the ACM , 66(4), 2019.C´edric Malherbe and Nicolas Vayatis. Global optimization of Lipschitz functions. In
Proceedings of the 34th Inter-national Conference on Machine Learning , volume PMLR 70, pages 2314–2323, 2017.R´emi Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In
Ad-vances in Neural Information Processing Systems 24 (NIPS 2011) , pages 783–791, 2011.R´emi Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning.
Foundations and Trends in Machine Learning , 7(1):1–130, 2014.Yurii Nesterov.
Introductory lectures on convex optimization: A basic course , volume 87. Springer Science & BusinessMedia, 2003.A.G. Perevozchikov. The complexity of the computation of the global extremum in a class of multi-extremum prob-lems.
USSR Computational Mathematics and Mathematical Physics , 30(2):28–33, 1990.S.A. Piyavskii. An algorithm for finding the absolute extremum of a function.
USSR Computational Mathematics andMathematical Physics , 12(4):57–67, 1972.Bruno O Shubert. A sequential method seeking the global maximum of a function.
SIAM Journal on NumericalAnalysis , 9(3):379–388, 1972.Martin J. Wainwright.
High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge Series in Statistical andProbabilistic Mathematics. Cambridge University Press, 2019.Guenther Walther. Granulometric smoothing.
The Annals of Statistics , 25(6):2273–2299, 1997.
A Missing proofs of Section 3
In this section we provide all missing details and proofs from Section 3.11 .1 Lower Bound on the Certified Sample Complexity in Dimension d = 1 We begin by proving our lower bound on the certified sample complexity in dimension d = 1 . It shows that in thespecial one-dimensional case d = 1 , S C ( f, ε ) provides a tight lower bound on the certified sample complexity of allcertified algorithms, up to the factor m ε , even in the boundary case in which f is globally L -Lipschitz. Proposition (Proposition 5) . If d = 1 , let K = 8 and c = 1 / K . Then, the certified sample complexity of anycertified algorithm A for L -Lipschitz functions satisfies, for any globally L -Lipschitz function f and all ε ∈ (0 , ε ] , σ ( A, f, ε ) > ( c / m ε ) S C ( f, ε ) .Proof. As for the proof of Theorem 3, it is sufficient to show that τ ( f, ε ) > cS C ( f, ε ) / (1 + m ε ) . If cS C ( f, ε ) / (1 + m ε ) < , then the result follows by τ ( f, ε ) ≥ . Consider then from now on that cS C ( f, ε ) / (1 + m ε ) ≥ .Defining ˜ ε as in the proof of Theorem 3, one can prove similarly that cS C ( f, ε ) / (1 + m ε ) ≤ c N (cid:0) X ˜ ε, ˜ ε/ L (cid:1) . FromLemma 14, N (cid:16) X ˜ ε, K ˜ εL (cid:17) ≥ K N (cid:16) X ˜ ε, ˜ ε L (cid:17) ≥ K S C ( f, ε ) m ε + 1 ≥ , because c = 1 / K and cS C ( f, ε ) / (1 + m ε ) ≥ . Let now n ≤ cS C ( f, ε ) / (1 + m ε ) . Then we have n ≤ c (8 K ) N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) . Thus, by c (8 K ) = 1 / , n ≤ N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) / , and N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) ≥ , we have n ≤ N (cid:16) X ˜ ε, K ˜ εL (cid:17) − . (7)Consider a certified algorithm A for L -Lipschitz functions. Let us consider a K ˜ ε/L packing ≤ x < · · · < x N ≤ of X ˜ ε with N = N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) . Consider the ⌊ N/ ⌋ − disjoint open segments ( x , x ) , ( x , x ) , . . . , ( x ⌊ N/ ⌋− , x ⌊ N/ ⌋− ) . Then from (7) there exists i ∈ (cid:8) , , . . . , ⌊ N/ ⌋ − (cid:9) such that the segment ( x i , x i +2 ) does not containany of the x ( A, f ) , . . . , x n ( A, f ) . Assume that x i +1 − x i ≤ x i +2 − x i +1 (the case x i +1 − x i > x i +2 − x i +1 can be treated analogously; we omit these straightforward details for the sake of conciseness). Consider the function h + , ˜ ε : X → R defined by h + , ˜ ε ( x ) = f ( x ) if x ∈ X \ [ x i , x i +2 ] f ( x i ) + L ( x − x i ) if x ∈ X ∩ [ x i , x i +1 ] f ( x i ) + L ( x i +1 − x i ) + ( x − x i +1 ) f ( x i +2 ) − f ( x i ) − L ( x i +1 − x i ) x i +2 − x i +1 if x ∈ X ∩ ( x i +1 , x i +2 ] . We see that h + , ˜ ε is L -Lipschitz (since x i +1 − x i ≤ x i +2 − x i +1 ). Furthermore, h + , ˜ ε coincides with f on x ( A, f ) , . . . , x n ( A, f ) .Similarly, consider the function h − , ˜ ε : X → R defined by h − , ˜ ε ( x ) = f ( x ) if x ∈ X \ [ x i , x i +2 ] f ( x i ) − L ( x − x i ) if x ∈ X ∩ [ x i , x i +1 ] f ( x i ) − L ( x i +1 − x i ) + ( x − x i +1 ) f ( x i +2 ) − f ( x i )+ L ( x i +1 − x i ) x i +2 − x i +1 if x ∈ X ∩ ( x i +1 , x i +2 ] . As before, h − , ˜ ε is L -Lipschitz and coincides with f on x ( A, f ) , . . . , x n ( A, f ) .Consider the case (1) where x ⋆n ( A, f ) ∈ X \ [ x i , x i +2 ] . Then, we have, since x i ∈ X ˜ ε and x i +1 − x i ≥ K ˜ ε/L , h + , ˜ ε ( x i +1 ) − h + , ˜ ε ( x ⋆n ( A, f )) = f ( x i ) + L ( x i +1 − x i ) − f ( x ⋆n ( A, f )) ≥ − ˜ ε + L K ˜ εL = 7˜ ε. Consider the case (2) where x ⋆n ( A, f ) ∈ X ∩ [ x i , ( x i + x i +1 ) / . Then, we have, since x i +1 − x i ≥ K ˜ ε/L , h + , ˜ ε ( x i +1 ) − h + , ˜ ε ( x ⋆n ( A, f )) = f ( x i ) + L ( x i +1 − x i ) − f ( x i ) − L ( x ⋆n ( A, f ) − x i ) ≥ L ( x i +1 − x i )2 ≥ ˜ ε K ε. Consider the case (3) where x ⋆n ( A, f ) ∈ X ∩ [( x i + x i +1 ) / , x i +1 ] . Then, we have, since x i +1 − x i ≥ K ˜ ε/L , h − , ˜ ε ( x i ) − h − , ˜ ε ( x ⋆n ( A, f )) = f ( x i ) − f ( x i ) + L ( x ⋆n ( A, f ) − x i ) ≥ L ( x i +1 − x i )2 ≥ ˜ ε K ε. x ⋆n ( A, f ) ∈ X ∩ [ x i +1 , ( x i +1 + x i +2 ) / . Then, we have, since x i +1 − x i ≥ K ˜ ε/L ,since x i , x i +2 ∈ X ˜ ε and since h − , ˜ ε is linear increasing on [ x i +1 , x i +2 ] with left value f ( x i ) − L ( x i +1 − x i ) and rightvalue f ( x i +2 ) , h − , ˜ ε ( x i ) − h − , ˜ ε ( x ⋆n ( A, f )) ≥ f ( x i ) − f ( x i ) − L ( x i +1 − x i ) + f ( x i +2 )2= f ( x i ) − f ( x i +2 )2 + L ( x i +1 − x i )2 ≥ − ˜ ε K ε ≥ ε. Consider the case (5) where x ⋆n ( A, f ) ∈ X ∩ [( x i +1 + x i +2 ) / , x i +2 ] . Then, we have, since x i +1 − x i ≥ K ˜ ε/L ,since x i , x i +2 ∈ X ˜ ε and since h + , ˜ ε is linear decreasing on [ x i +1 , x i +2 ] with left value f ( x i ) + L ( x i +1 − x i ) and rightvalue f ( x i +2 ) , h + , ˜ ε ( x i +1 ) − h + , ˜ ε ( x ⋆n ( A, f )) ≥ f ( x i ) + L ( x i +1 − x i ) − f ( x i ) + L ( x i +1 − x i ) + f ( x i +2 )2= f ( x i ) − f ( x i +2 )2 + L ( x i +1 − x i )2 ≥ − ˜ ε K ε ≥ ε. Hence, in all cases E L ( A, f, n ) ≥ ε > ε . Hence inf A E L ( A, f, n ) > ε . Since this has been shown for any n ≤ cS C ( f, ε ) / (1 + m ε ) we thus have τ ( f, ε ) > cS C ( f, ε ) / (1 + m ε ) .As discussed previously, in the proof of Theorem 3, the constraint that f be L ′ -Lipschitz with L ′ < L arose fromthe fact that we added a small bump function ± g ˜ ε to f , with the requirement that the new function f ± g ˜ ε be globally L -Lipschitz. We treat the one-dimensional case differently. For a given algorithm, a given function f and a numberof query points smaller than cS C ( f, ε ) / (1 + m ε ) , we show the existence of a segment unvisited by the algorithm andcontaining three close-to-optimal points that are separated enough. We then replace the function f on this segmentwith an upward or downward hat function which makes the algorithm fail to be ε -optimal. By replacing f with a newfunction on the segment, rather than adding a bump function to f , we can allow f to have Lipschitz constant arbitrarilyclose to L . A.2 The Piyavskii-Shubert Algorithm and Proof of Proposition 6
In this section, we recall the definition of the certified Piyavskii-Shubert algorithm (Algorithm 2, Piyavskii 1972,Shubert 1972) and we show that its sample complexity, in dimension d ≥ , is not lower bounded by S C ( f, ε ) forsome functions in the boundary case L ′ = L . Algorithm 2:
Certified Piyavskii-Shubert algorithm input: accuracy ε > , Lipschitz constant L > , norm k·k , initial guess x ∈ X for i = 1 , , . . . do pick the next query point x i observe the value f ( x i ) output the recommendation x ⋆i ← argmax x ∈{ x ,..., x i } f ( x ) output the certificate γ i ← I (cid:8) b f ⋆i − f ⋆i ≤ ε (cid:9) , where b f i ( · ) ← min j ∈ [ i ] (cid:8) f ( x j ) + L k x j − ( · ) k (cid:9) , b f ⋆i ← max x ∈X b f i ( x ) , f ⋆i ← max j ∈ [ i ] f ( x j ) , and let x i +1 ∈ argmax x ∈X b f i ( x ) In contrast to the one-dimensional case d = 1 , the next result shows that there are special cases with L ′ = L and d ≥ for which the upper bound S C ( f, ε ) is far from tight. Quantifying the optimal certified sample complexity fordimensions d ≥ in the boundary case L ′ = L (or when L ′ is arbitrarily close to L ) is left as an open problem. Proposition 6.
Let d ≥ , X := B , and k·k be a norm. The certified Piyavskii-Shubert algorithm (Algorithm 2) withinitial guess x := is a certified algorithm for L -Lipschitz functions satisfying, for the globally L -Lipschitz function f := L k·k and any ε ∈ (0 , ε ) , σ ( Piyavskii-Shubert , f, ε ) = 2 ≪ c/ε d − ≤ S C ( f, ε ) , where c > is a constantindependent of ε . roof. Fix any ε ∈ (0 , ε ) . When f is at least L -Lipschitz around a maximizer (Assumption 1), then max x ∈X b f i ( x ) ≥ max x ∈X f ( x ) for all i ∈ N ∗ (for a proof of this fact, see Bouttier et al. 2020). Hence if γ i = 1 , then max x ∈X f ( x ) − f ( x ⋆i ) ≤ b f i ( x ) − f ⋆i ≤ ε , and x ⋆i is necessarily ε -optimal. This shows that the certified Piyavskii-Shubert algorithmis indeed a certified algorithm for L -Lipschitz functions. Then, since ε = L , we have that b f = f , γ = 0 ,and x belongs to the the unit sphere, i.e., x is a maximizer of f . Since b f = f , we have that γ = 1 , hence σ ( Piyavskii-Shubert , f, ε ) = 2 . Finally, by definition (4), we have S C ( f, ε ) ≥ N (argmax X f, ε/L ) . Since argmax X f is the unit sphere, there exists a constant c , only depending on d , k·k and L , such that S C ( f, ε ) ≥ c/ε d − .We give some intuition on the previous proposition. Consider a function f that has Lipschitz constant exactly L ,and a pair of points in X whose respective values of f are maximally distant, that is the difference of values of f isexactly L times the norm of the input difference. This configuration provides strong information on the value of theglobal maximum of f , as is illustrated in the proof of Proposition 6. Another interpretation is that when f has Lipschitzconstant exactly L , there is less flexibility for the L -Lipschitz function g that yields the maximal optimization errorin (6). B Missing proofs of Section 5
In this section, we provide the missing part of the proof of Theorem 9.Fix any ε ∈ (0 , ε ] and recall that m ε := (cid:6) log ( ε / ε ) (cid:7) , ε m ε := ε , and for all k ≤ m ε − , ε k := ε − k . We beginby proving the first inequality: Z ∞ ε N (cid:16) X ( max ( ε, u ) ,u ] , uL (cid:17) u d u = m ε X k = − Z ε k − ε k N (cid:16) X ( max ( ε, u ) ,u ] , uL (cid:17) u d u ≤ m ε X k = − Z ε k − ε k N (cid:16) X ( max ( ε, εk ) , ε k − ] , ε k L (cid:17) ε k d u ≤ m ε X k = − N (cid:16) X ( max ( ε, εk ) , ε k − ] , ε k L (cid:17) = m ε X k = − (cid:16) N (cid:16) X ( εk , ε k − ] , ε k L (cid:17) I k ∈{− ,...,m ε − } + N (cid:16) X ( ε, ε k − ] , ε k L (cid:17) I k ∈{ m ε − ,...,m ε } (cid:17) ≤ m ε X k = − k +2 X h = k N (cid:16) X ( ε h , ε h − ] , ε h L (cid:17) I k ∈{− ,...,m ε − } + m ε X h = k N (cid:16) X ( ε h , ε h − ] , ε h L (cid:17) I k ∈{ m ε − ,...,m ε } ! = X k = − ( . . . ) + m ε X k =1 ( . . . ) ≤ S NC ( f, ε ) , where the first and third inequalities follow by the monotonicity properties of packing numbers, the second one by ε k − − ε k ≤ ε k for all k ∈ [ m ε ] , and the last one by noting that both brackets ( . . . ) contain at most three non-zeroterms N (cid:0) X ( ε h , ε h − ] , ε h L (cid:1) , with h ∈ [ m ε ] , that overlap at most times, and nothing else. This proves the first partof the theorem. C Packing vs Volume
The two prototypical sample complexity statements involving assumptions on volumes (8)–(9) (see e.g., Perevozchikov1990) and packing numbers (10)–(11) (e.g., Munos 2014) for algorithms applied to functions f that are L -Lipschitz We wrote ( . . . ) to avoid copying twice the long expression on the previous line. ε ∈ (0 , ε ] are: ∃ r ∈ (0 , , ∃ C > , ∀ u ∈ (0 , ε ] , vol( X u ) ≤ Cu dr (8) = ⇒ ∃ c > , ∀ n ≥ c (cid:0) ln( / ε ) I r =1 + ( / ε ) d (1 − r ) I r =1 (cid:1) , x ⋆n ∈ X ε , (9) ∃ d ⋆ ∈ [0 , d ] , ∃ C ⋆ > , ∀ u ∈ (0 , ε ] , N ( X u , u / L ) ≤ C ⋆ ( ε / ε ) d ⋆ (10) = ⇒ ∃ c > , ∀ n ≥ c (cid:0) ln( / ε ) I d ⋆ =0 + ( / ε ) d ⋆ I r =1 (cid:1) , x ⋆n ∈ X ε . (11)In the two following sections, we will study the relationship between sample complexity statements expressed underassumptions on volumes (8) and packing numbers (10). C.1 Which Assumption is Stronger
The first result suggests that packing assumptions (10) are stronger than volume’s (8).
Proposition 10. If E ⊂ R d is bounded and Lebesgue-measurable and L -Lipschitz around a maximizer, letting c := v / L , for all u ∈ (0 , ε ] , vol( X u ) ≤ c N (cid:16) X u , uL (cid:17) u d . Proof.
For all u ∈ (0 , ε ] , vol( X u ) ≤ M (cid:0) X u , uL (cid:1) vol (cid:0) uL B (cid:1) ≤ N (cid:0) X u , uL (cid:1) u d v / L , where in the first inequality weupper bounded the volume of a set with the volume of the balls of a smallest ( u / L ) -cover and in the second one weused the fact that covering numbers are always smaller than packing numbers (14).The previous proposition implies in particular that if d ⋆ ∈ [0 , d ) is a near-optimality dimension of f for a constantand C ⋆ > , i.e., if for all u ∈ (0 , ε ] , N ( X u , u / L ) ≤ C ⋆ ( ε / u ) d ⋆ (as in (10)), then vol( X u ) ≤ Cu dr , where C := v / L C ⋆ ε d ⋆ and r = 1 − d ⋆ /d , recovering the assumption on volume (8). The assumption on packing lookstherefore stronger. However, we show now that if f is globally Lipschitz (and its domain is not too pathological), theconverse is also true, i.e., that assumptions on packings and volumes bounds are essentially equivalent. Proposition 11. If f is globally L -Lipschitz and X satisfies Assumption 3 with r > , γ ∈ (0 , , then, for all < w < u < Lr , N (cid:16) X u , uL (cid:17) ≤ γ vol (cid:0) X ( / ) u (cid:1) vol (cid:0) u L B (cid:1) and N (cid:16) X ( w,u ] , wL (cid:17) ≤ γ vol (cid:0) X ( w / , u / ] (cid:1) vol (cid:0) w L B (cid:1) . Proof.
Fix any u > w > . Let η := uL , η := wL , E := X , E := X cw , and i ∈ [2] . Note that for any η > and A, E ⊂ R d , the balls of radius η / centered at the elements of an η -packing of A intersected with E are all disjointand included in ( A + B η / ) ∩ E . Thus, N ( X u ∩ E i , η i ) ≤ vol (cid:0) ( X u ∩ E i + B ηi / ) ∩ X (cid:1) / vol( B ηi / ∩ X ) . To lowerbound the denominator, simply apply Assumption 3 to obtain vol (cid:0) B ηi / ( x ) ∩ X (cid:1) ≥ γ vol( B ηi / ) . To upper boundthe numerator, take an arbitrary point x i ∈ ( X u ∩ E i + B ηi / ) ∩ X . By definition of Minkowski sum, there exists x ′ i ∈ X u ∩ E i such that k x i − x ′ i k ≤ η i / . Hence f ( x ⋆ ) − f ( x i ) ≤ f ( x ⋆ ) − f ( x ′ i ) + (cid:12)(cid:12) f ( x ′ i ) − f ( x i ) (cid:12)(cid:12) ≤ u + L ( η i / ) ≤ ( / ) u . This implies that x i ∈ X ( / ) u , which proves the first inequality. For the second one, note that x satisfies f ( x ⋆ ) − f ( x ) ≥ f ( x ⋆ ) − f ( x ′ ) − (cid:12)(cid:12) f ( x ′ ) − f ( x ) (cid:12)(cid:12) ≥ w − L ( η / ) = ( / ) w .The previous proposition implies in particular that if there exist r ∈ (0 , and C > such that, for all u ∈ (0 , ε ] , vol( X u ) ≤ Cu dr (as in (8)), then d ⋆ = d (1 − r ) is a near-optimality dimension of f for C ⋆ := (cid:0) C ( / ) dr (cid:1) / ( γv / L ε d ⋆ ) ,i.e., for all u ∈ (0 , ε ] , N ( X u , u / L ) ≤ C ⋆ ( ε / ε ) d ⋆ , recovering the assumption on packing numbers (10). Therefore,for globally L -Lipschitz functions, the two assumptions are essentially equivalent. C.2 Which Statement is More General
We recall the prototypical volume-base statement presented at the beginning of Appendix C. ∃ r ∈ (0 , , ∃ C > , ∀ u ∈ (0 , ε ] , vol( X u ) ≤ Cu dr (12) = ⇒ ∃ c > , ∀ n ≥ c (cid:0) ln( / ε ) I r =1 + ( / ε ) d (1 − r ) I r =1 (cid:1) , x ⋆n ∈ X ε . (13)15 x i x ⋆j x ⋆h x k x ⋆ ε ε Figure 1: The graph of a function f for which volume-based results do not apply.In the previous section we showed that assumptions on packing numbers imply the above assumptions (12) on volumeswhen f is Lipschitz around a maximizer (and measurable), but for the converse to be true f has to be globally Lipschitz.Although one might believe that this would make volume assumption better (because they are weaker), it turns out thatthey are in fact too weak in general when the function is only Lipschitz around a maximizer to imply the conclusion(13). We prove this in the following proposition. Proposition 12.
Let
L > and r ∈ (0 , . Then, for any deterministic algorithm and all multiplicative constants c > , there exist a function f : X → R and an accuracy ε ′ > such that:1. f is L -Lipschitz around a maximizer with respect to the Euclidean norm k·k ;2. there exists c > such that vol( X u ) ≤ c u dr for all u ∈ (0 , ε ] ;3. for any ε ∈ (0 , ε ′ ) and all n ≤ e n ε , x ⋆n / ∈ X ε , where e n ε := c (cid:0) ln( / ε ) I r =1 + ( / ε ) d (1 − r ) I r =1 (cid:1) .Proof. Fix any deterministic algorithm and c > . Denote by x ( g ) , x ( g ) , . . . the queries and by x ⋆ ( g ) , x ⋆ ( g ) , . . . the recommendations of the algorithm when applied to the constant function g = 0 . Assume first that r = 1 . Let ε ′ ∈ (cid:0) , min { ( L/ /r / (8 v c ) /rd , c /d (1 − r )2 , } (cid:3) , then fix any ε ∈ (0 , ε ′ ) and define n ε := (cid:6) c /ε d (1 − r ) (cid:7) ≥ e n ε . Let E := (cid:8) x ( g ) , x ⋆ ( g ) , . . . , x n ε ( g ) , x ⋆n ε ( g ) (cid:9) . Note that there exists x ⋆ ∈ X such that min x ∈ E k x − x ⋆ k ≥ / (4 v n ε ) /d ≥ ε − r / (8 v c ) /d . Indeed, if the firstinequality did not hold, than the spheres with radius / (4 v n ε ) /d centered at the points in E would cover X , but thiscould not happen since the largest volume that they could cover is (2 n ε ) v / (4 v n ε ) /d = 1 / < . Fix any such x ⋆ ,note that B ε/L ( x ⋆ ) ∩ E = ∅ , and let Q := Q d ∩ (cid:0) X \ (cid:0) B ε/L ( x ⋆ ) ∪ E (cid:1)(cid:1) . We can now define a function f that attainsits maximum at x ⋆ , has value ε in the zero-volume dense set Q , and satisfies f ( x ) = 0 for all x ∈ E . Concretely,consider the function f : X → R defined for all x ∈ X , by (see Fig. 1) f ( x ) = (cid:0) ε − L k x − x ⋆ k (cid:1) I X \ ( Q ∪ E ) ( x ) + ε I Q ( x ) . Then f is L -Lipschitz around x ⋆ . Moreover, letting c := v /L max (cid:8) ε d (1 − r )0 , (cid:9) , we have vol( X u ) = vol (cid:0) B u/L ( x ⋆ ) (cid:1) = v /L u d = v /L u d (1 − r ) u dr ≤ c u dr , for all u ∈ (0 , ε ] , because f is almost everywhere equal to x ε − L k x − x ⋆ k . Finally, by definition, for all n ≤ n ε , the recommendation x ⋆n of the algorithm after n evaluations of f satisfies f ( x ⋆n ) = 0 , hence x ⋆n / ∈ X ε . Thecase r = 1 can be treated similarly, by letting n ε := (cid:6) c /ε d (1 − ρ ) (cid:7) for some ρ ∈ (0 , and noting that n ε ≥ e n ε for allsufficiently small ε .The result is not surprising since for ε, f as in the proof, N (cid:0) X ε , ε / L (cid:1) ≈ /ε d because of the dense subset Q .Packing numbers of ε -optimizers better capture the difficulty of the problem. In conclusion, packing assumptions arebe the best between the two, as they imply convergence result also in the more general case of functions f that areonly Lipschitz around a maximizer. 16 Useful Results on Packing, Covering, and Volume
For the sake of completeness, we recall the definitions of packing and covering numbers, as well as several usefulresults on packing, covering, and volume.
D.1 Definitions of Packing and Covering Numbers
For any bounded set A ⊂ R d and any real number r > , the r - packing number of A is the largest cardinality of an r -packing of A , that is, N ( A, r ) := sup (cid:26) k ∈ N ∗ : ∃ x , . . . , x k ∈ A, min i = j k x i − x j k > r (cid:27) if A is nonempty, zero otherwise.The r - covering number of A is the smallest cardinality of an r -covering of A , that is, M ( A, r ) := min (cid:8) k ∈ N ∗ : ∃ x , . . . , x k ∈ R d , ∀ x ∈ A, ∃ i ∈ { , . . . , k } , k x − x i k ≤ r (cid:9) if A is nonempty, zero otherwise. D.2 Useful Inequalities
Covering numbers and packing numbers are closely related. In particular, the following well-known inequalitieshold—see, e.g., Wainwright [2019, Lemmas 5.5 and 5.7, with permuted notation of M and N ]. Lemma 13.
Fix any norm k·k . For any bounded set A ⊂ R d and any real number r > , N ( A, r ) ≤ M ( A, r ) ≤ N ( A, r ) . (14) Furthermore, for all δ > and all r > , M ( B δ , r ) ≤ (cid:18) δr I r<δ (cid:19) d . (15)We now state a known lemma about packing numbers at different scales. This is the go-to result for rescalingpacking numbers. Lemma 14.
For any E ⊂ X and < r ≤ r < ∞ , we have N ( E, r ) ≤ (cid:18) r r (cid:19) d N ( E, r ) . Proof.
Consider a r -packing of E with cardinality N = N ( E, r ) , written F = { x , . . . , x N } . Consider then thefollowing iterative procedure. Let F = F and initialize k = 1 . While F k − is non-empty, let y k be any point in F k − , let F k be obtained from F k − by removing the points at || . || -distance less or equal to r from y k and increase k by one. Then this procedure yields an r packing of E with cardinality equal to the number of steps (the final valueof k ). At each step, to the set of removed points we can associate a set of disjoint balls with radius r / in a ball withradius r . Hence the number of removed points is smaller or equal to v r /v r / = (4 r /r ) d . Hence the number ofsteps is larger or equal to N ( E, r ) ( r / r ) d . This concludes the proof since N ( E, r ) is larger than this number ofsteps.We now prove a result on the sum of volumes of overlapping layers that is used in the proof of Theorem 7. The definition of r -covering number of a subset A of R d implied by Wainwright [2019, Definition 5.1] is slightly stronger than the one usedin our paper, because elements x , . . . , x N of r -covers belong to A rather than just R d . Even if we do not need it for our analysis, Inequality (15)holds also in this stronger sense. emma 15. If f is L -Lipschitz around a maximizer (Assumption 1) and Lebesgue-measurable, fix ε > and recallthat m ε := (cid:6) log ( ε /ε ) (cid:7) , ε m ε := ε , and for all k ≤ m ε − , ε k =: ε − k . Then, there exist c, C > (e.g., c = / and C := 2 · d ) such that, for all ε ∈ (0 , ε ] , vol (cid:0) X ε (cid:1) ε d + m X k =1 vol (cid:16) X ( ε k , ε k − ] (cid:17) ε dk − ≤ d vol( X ε ) ε d + m X i =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − ! . Proof.
To avoid clutter, we denote m ε simply by m . The left hand side can be upper bounded by vol (cid:0) X ε (cid:1) + vol (cid:0) X ( ε m , ε m − ] (cid:1) + vol (cid:0) X ( ε m − , ε m − ] (cid:1) ε d + m − X k =1 vol (cid:0) X ( ε k +1 , ε k ] (cid:1) + vol (cid:0) X ( ε k , ε k − ] (cid:1) vol (cid:0) X ( ε k − , ε k − ] (cid:1) ε dk − + vol ( X ε ) + vol (cid:0) X ( ε m , ε m − ] (cid:1) + vol (cid:0) X ( ε m − , ε m − ] (cid:1) + vol (cid:0) X ( ε m − , ε m − ] (cid:1) ε dm − + vol ( X ε ) + vol (cid:0) X ( ε m , ε m − ] (cid:1) + vol (cid:0) X ( ε m − , ε m − ] (cid:1) ε dm − ≤ (cid:0) X ε (cid:1) ε d + (2 d + 2) vol (cid:0) X ( ε m , ε m − ] (cid:1) ε dm − + (4 d + 2 d + 1) vol (cid:0) X ( ε m − , ε m − ] (cid:1) ε dm − + 12 d m − X k =2 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − + m − X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − + 2 d m − X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − = 3 vol (cid:0) X ε (cid:1) ε d + vol (cid:0) X ( ε m , ε m − ] (cid:1) ε dm − + 4 d vol (cid:0) X ( ε m − , ε m − ] (cid:1) ε dm − + 12 d m − X k =2 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − + m X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − + 2 d m X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − where we applied several times the definition of the ε k ’s, the inequality follows by / ε d + / ε dm − + / ε dm − ≤ min (cid:8) / ε d ) , (2 d + 2)( / ε dm − ) , (4 d + 2 d + 1)( / ε dm − ) (cid:9) , and the proof is concluded observing that max(3 , , d ) = 4 d and d + / d + 1 + 2 d ≤ d .The previous result can be generalized to arbitrary partitions and rescaling of the layers. We prove it here for theinterested reader. Proposition 16. If f is L -Lipschitz around a maximizer (Assumption 1) and Lebesgue-measurable, fix any m ∈ N and let ξ m +1 < ξ m ≤ ξ m − ≤ . . . ≤ ξ ≤ ξ := ε . For any k ∈ { , . . . , m } , let also a k ∈ (0 , , b k ≥ , i k := min (cid:8) i ∈ { k, . . . , m + 1 } : ξ i ≤ a k ξ k (cid:9) ,j k := ( , if b k ξ k > ξ , max (cid:8) j ∈ { , . . . , k } : b k ξ k ≤ ξ j (cid:9) , otherwise . For all h ∈ [ m + 1] , define n a,h and n b,h as the following numbers of overlaps with ( ξ h , ξ h − ] : n a,h := (cid:8) k ∈ { , . . . , m + 1 } : i k − ≥ k and k ≤ h ≤ i k − (cid:9) ,n b,h := (cid:8) k ∈ [ m ] : j k ≤ k − and j k + 1 ≤ h ≤ k (cid:9) . Then, let c k := ( ξ j k +1 /ξ k +1 ) d for all k ∈ [ m − , c m := ( ξ j m +1 /ξ m ) d , and c := max k ∈ [ m ] c k . If C := 1 + c max (cid:0) n a,m +1 , max h ∈ [ m ] ( n a,h + n b,h ) (cid:1) , we have, for any function f : X → R , vol( X b m ξ m ) ξ dm + m X k =1 vol (cid:0) X ( a k ξ k , b k − ξ k − ] (cid:1) ξ dk ≤ C vol( X ξ m ) ξ dm + m X k =1 vol (cid:0) X ( ξ k , ξ k − ] (cid:1) ξ dk ! . roof. Letting V := vol( X ξ m ) /ξ dm + P mk =1 vol (cid:0) X ( ξ k , ξ k − ] (cid:1) /ξ dk , we have vol (cid:0) X b m ξ m (cid:1) ξ dm + m X k =1 vol (cid:0) X ( a k ξ k , b k − ξ k − ] (cid:1) ξ dk = vol (cid:0) X ξ m (cid:1) ξ dm + vol (cid:0) X ( ξ m , b m ξ m ] (cid:1) ξ dm + m X k =1 vol (cid:0) X ( a k ξ k , ξ k ] (cid:1) ξ dk + m X k =1 vol (cid:0) X ( ξ k , ξ k − ] (cid:1) ξ dk + m X k =1 vol (cid:0) X ( ξ k − , b k − ξ k − ] (cid:1) ξ dk ≤ V + vol (cid:0) X ( ξ m , ξ jm ] (cid:1) ξ dm + m X k =1 vol (cid:0) X ( ξ ik , ξ k ] (cid:1) ξ dk + m X k =1 vol (cid:0) X ( ξ k − , ξ jk − ] (cid:1) ξ dk =: V + (I) + (II) + (III) . We upper bound separately the addends (II) and (I) + (III) . We have (II) = m X k =1 vol (cid:0) X ( ξ ik , ξ k ] (cid:1) ξ dk = m +1 X k =2 vol (cid:0) X ( ξ ik − , ξ k − ] (cid:1) ξ dk − = m +1 X k =2 i k − ≥ k vol (cid:0) X ( ξ ik − , ξ k − ] (cid:1) ξ dk − = m +1 X k =2 i k − ≥ k i k − X i = k vol (cid:0) X ( ξ i , ξ i − ] (cid:1) ξ dk − ≤ m +1 X k =2 i k − ≥ k i k − X i = k vol (cid:0) X ( ξ i , ξ i − ] (cid:1) ξ di I i = m +1 + ξ di − I i = m +1 = n a,m +1 vol (cid:0) X ( ξ m +1 , ξ m ] (cid:1) ξ dm + m X h =1 n a,h vol (cid:0) X ( ξ h , ξ h − ] (cid:1) ξ dh ≤ n a,m +1 vol (cid:0) X ξ m (cid:1) ξ dm + m X h =1 n a,h vol (cid:0) X ( ξ h , ξ h − ] (cid:1) ξ dh . (I) + (III) = vol (cid:0) X ( ξ m , ξ jm ] (cid:1) ξ dm + m X k =1 vol (cid:0) X ( ξ k − , ξ jk − ] (cid:1) ξ dk = vol (cid:0) X ( ξ m , ξ jm ] (cid:1) ξ dm I j m ≤ m − + m X k =1 j k − ≤ k − vol (cid:0) X ( ξ k − , ξ jk − ] (cid:1) ξ dk = vol (cid:0) X ( ξ m , ξ jm ] (cid:1) ξ dm I j m ≤ m − + m X k =2 j k − ≤ k − vol (cid:0) X ( ξ k − , ξ jk − ] (cid:1) ξ dk = vol (cid:0) X ( ξ m , ξ jm ] (cid:1) ξ dm I j m ≤ m − + m − X k =1 j k ≤ k − vol (cid:0) X ( ξ k , ξ jk ] (cid:1) ξ dk +1 = m X j = j m +1 j m ≤ m − vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dm + m − X k =1 j k ≤ k − k X j = j k +1 vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dk +1 = m X j = j m +1 j m ≤ m − ξ dj ξ dm vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj + m − X k =1 j k ≤ k − k X j = j k +1 ξ dj ξ dk +1 vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj ≤ m X j = j m +1 j m ≤ m − ξ dj m +1 ξ dm vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj + m − X k =1 j k ≤ k − k X j = j k +1 ξ dj k +1 ξ dk +1 vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj = m X k =1 j k ≤ k − k X j = j k +1 c k vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj ≤ c m X k =1 j k ≤ k − k X j = j k +1 vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj = c m X h =1 n b,h vol (cid:0) X ( ξ h , ξ h − ] (cid:1) ξ dh . Putting everything together and noting that c ≥ gives the result. E The DOO algorithm: definition and proofs
E.1 The DOO algorithm
Let us present the (non-certified and certified) DOO algorithm and its underlying assumptions. The non-certifiedalgorithm and the assumptions will essentially be the same as in the original reference Munos [2011], with minoradjustments to our framework, for ease of exposition. The notion of certificate is not present in Munos [2011].The algorithm is defined for a fixed K ∈ N ⋆ , by an infinite sequence of subsets of X (cells) of the form ( X h,i ) h ∈ N ,i =0 ,...,K h − . For h ∈ N , the sets X h, , . . . , X h,K h − are non-empty, pairwise disjoint, and their unioncontains X . Furthermore, ( X h,i ) h ∈ N ,i =0 ,...,K h − is associated with a K -ary tree, meaning that for any h ∈ N and j ∈ { , . . . , K h − } , there exist K distinct i , . . . , i K ∈ { , . . . , K h +1 − } such that X h +1 ,i , . . . , X h +1 ,i K form apartition of X h,j . We call ( h + 1 , i ) , . . . , ( h + 1 , i K ) the children of ( h, j ) .For each cell X h,i , h ∈ N , i = 0 , . . . , K h − , there is a representative x h,i ∈ X h,i , which can be thought of, e.g.,as the center of the cell. We assume that feasible cells have feasible representatives, that is: if X h,i ∩ X 6 = ∅ , then x h,i ∈ X .The sequence of cells and representatives is well-behaved in the sense of the following two assumptions. Assumption 4.
There are fixed < δ < and R < + ∞ such that for any h ∈ N , i = 0 , . . . , K h − and u, v ∈ X h,i , || u − v || ≤ Rδ h . ssumption 5. There is a fixed ν > such that, with δ as in Assumption 4, for any h ∈ N , i = 0 , . . . , K h − , h ′ ∈ N , i ′ = 0 , . . . , K h ′ − , with ( h, i ) = ( h ′ , i ′ ) , || x h,i − x h ′ ,i ′ || ≥ νδ max( h,h ′ ) . Assumption 4 is very classical. Note that Assumption 5, which is key for our improved analysis, is slightly strongerthan in Munos [2011], yet easy to satisfy. For a compact X , a sequence ( X h,i ) h ∈ N ,i =0 ,...,K h − satisfying Assumptions4 and 5 indeed exists. For instance, when X is the unit hypercube [0 , d and || . || is the supremum norm || . || ∞ , wemay use bisections with K = 2 d , letting X h,i be an hypercube of edge length − h and x h,i be its center, for h ∈ N and i = 0 , . . . , dh − . In this case we have R = 1 , δ = 1 / and ν = 1 / in Assumptions 4 and 5.The DOO algorithm is defined in Algorithm 3. For both non-certified and certified versions, the most promisingcell among a set of active cells L n is selected at every iteration k (see (16)) before being split into its K children.For the certified version, the algorithm declares its recommendations x ⋆n to be ε -optimal (i.e., γ n = 1 ) whenever thecondition (17) is met. Algorithm 3:
DOO (non-certified and certified versions) input:
Function f , input set X , Lipschitz bound L , cells ( X h,i ) h ∈ N ,i =0 ,...,K h − , representatives ( x h,i ) h ∈ N ,i =0 ,...,K h − . initialization: Let n = 1 . Let x ⋆ = x = x , . Let L = { (0 , } . Evaluate f ( x ) . Certified version only:
Let γ = 0 . for iteration k = 1 , , . . . do Let (breaking ties arbitrarily) ( h ⋆ , i ⋆ ) ∈ argmax ( h,i ) ∈L n (cid:8) f ( x h,i ) + LRδ h (cid:9) . (16) Certified version only: If f ( x h ⋆ ,i ⋆ ) + LRδ h ⋆ ≤ max( f ( x ) , . . . , f ( x n )) + ε, (17)then let γ ← and γ n ← γ .Let L + be the set of the K children of ( h ⋆ , i ⋆ ) . for ( h ⋆ + 1 , j ) ∈ L + do If X h ⋆ +1 ,j ∩ X 6 = ∅ then, let n ← n + 1 , let x n = x h ⋆ +1 ,j , evaluate f ( x n ) , let x ⋆n ∈ argmax x ∈{ x ,..., x n } f ( x ) and let L n = L n − ∪ { ( h ⋆ + 1 , j ) } . Certified version only: let γ n = γ .Remove ( h ⋆ , i ⋆ ) from L n . E.2 Proof of Proposition 1
We start by proving Proposition 1, and will then comment on where our analysis improves over that of Munos [2011,Theorem 1], as well as on a straightforward extension.
Proof.
We first recall the guarantee (19) below, which is classical (e.g., Munos 2011). By induction, it is straightfor-ward to show that the union of the cells in L n contains X at all steps n ∈ N ⋆ . Therefore, the global maximizer x ⋆ (forwhich the inequality in Assumption 1 holds) belongs to a cell X ¯ h, ¯ i with (¯ h, ¯ i ) ∈ L n . Consider now the cell X h ⋆ ,i ⋆ in(16), at step n . We have, using first (16), and then Assumptions 1 and 4, f ( x h ⋆ ,i ⋆ ) + LRδ h ⋆ ≥ f ( x ¯ h, ¯ i ) + LRδ ¯ h ≥ f ( x ⋆ ) − LRδ ¯ h + LRδ ¯ h = f ( x ⋆ ) . (18)21his implies that, for ( h ⋆ , i ⋆ ) given by (16), f ( x h ⋆ ,i ⋆ ) ∈ X LRδ h⋆ . (19)We now proceed in a slightly different way than in the proof of Munos [2011, Theorem 1]. Consider the first timeat which the DOO algorithm reaches step (16) with f ( x h ⋆ ,i ⋆ ) ≥ f ( x ⋆ ) − ε . Then let I ε be the number of times theDOO algorithm went through step (16) strictly before that time, and denote by n ε the total number of evaluations of f strictly before that same time. Then we have n ε ≤ KI ε . Furthermore, after n ε evaluations of f , we have, by definitions of the recommendation x ⋆n ε and n ε , f ( x ⋆n ε ) = max x ∈{ x ,..., x nε } f ( x ) ≥ max x ∈L nε f ( x ) ≥ f ( x h ⋆ ,i ⋆ ) ≥ f ( x ⋆ ) − ε . Since the optimization error f ( x ⋆ ) − f ( x ⋆n ) can only decrease for n ≥ n ε , the last inequality above entails that thenon-certified sample complexity of DOO (non-certified version) is bounded by n ε and thus ζ ( non-certified DOO , f, ε ) ≤ KI ε . (20)Consider now the sequence ( h ⋆ , i ⋆ ) , . . . , ( h ⋆I ε , i ⋆I ε ) corresponding to the first I ε times the DOO algorithm A wentthrough step (16). Let E ε be the corresponding finite set { x h ⋆ ,i ⋆ , . . . , x h ⋆Iε ,i ⋆Iε } . By definition of I ε , we have E ε ⊆X ( ε,ε ] . Since ε = ε m ε ≤ ε m ε − ≤ . . . ≤ ε , we have E ε ⊆ S m ε i =1 X ( ε i , ε i − ] , so that the cardinality I ε of E ε (a leafcan never be selected twice) satisfies I ε = card( E ε ) ≤ m ε X i =1 card (cid:0) E ε ∩ X ( ε i , ε i − ] (cid:1) . (21)Let N ε,i be the cardinality of E ε ∩ X ( ε i , ε i − ] . For x h,j ∈ E ε ∩ X ( ε i , ε i − ] , from (19), we have ε i < LRδ h and thus δ h > ε i LR .
Consider two distinct x h,j , x h ′ ,j ′ ∈ E ε ∩ X ( ε i , ε i − ] . Then, from Assumption 5, we obtain || x h,j − x h ′ ,j ′ || ≥ νδ max( h,h ′ ) > νε i LR .
Hence, by definition of packing numbers, we have N ε,i ≤ N (cid:16) X ( ε i , ε i − ] , νε i LR (cid:17) . Using now Lemma 14, we obtain N ε,i ≤ ν/R ≥ + ν/R< (cid:18) Rν (cid:19) d ! N (cid:16) X ( ε i , ε i − ] , ε i L (cid:17) . (22)Combining the last inequality with (20) and (21) concludes the proof. Remark 17.
The analysis of the DOO algorithm in Munos [2011, Theorem 1] yields a bound on the non-certified sam-ple complexity than can be expressed in the form C P m ε k =1 N (cid:0) X ε k − , ε k L (cid:1) , with a constant C . The correspondingproof relies on two main arguments. First, when a cell of the form ( h ⋆ , i ⋆ ) , i ⋆ ∈ { , . . . , K h ⋆ − } , is selected in (16) , then the corresponding cell representative x i ⋆ ,h ⋆ is LRδ h ⋆ -optimal (we also use this argument). Second, as aconsequence, for a given fixed value of h ⋆ , for the sequence of values of i ⋆ that are selected in (16) , the correspondingcell representatives x h ⋆ ,i ⋆ form a packing of X [0 ,LRδ h⋆ ] .Our analysis in the proof of Proposition 1 stems from the observation that using a packing of X LRδ h⋆ yieldsa suboptimal analysis, since the cell representatives x h ⋆ ,i ⋆ can be much better than LRδ h ⋆ -optimal. Hence, weproceed differently from Munos [2011], by first partitioning all the selected cell representatives (in (16) ) accord-ing to their level of optimality as in (21) and then by exhibiting packings of the different layers of input points X ( ε,ε mε − ] , X ( ε mε − ,ε mε − ] , . . . , X ( ε ,ε ] . In a word, we partition the values of f instead of partitioning the inputspace when counting the representatives selected at all levels. emark 18. The bound of Proposition 1, based on (3) , is built by partitioning ( ε, ε ] into the m ε sets ( ε, ε m ε − ] , ( ε m ε − , ε m ε − ] , . . . , ( ε , ε ] whose lengths are sequentially doubled (except from ( ε m ε , ε m ε − ] to ( ε m ε − , ε m ε − ] ). As can be seen from the proofof Proposition 1, more general bounds could be obtained, based on more general partitions of ( ε, ε ] . The benefitsof the present partition are the following. First, it considers sets whose upper values are no more than twice thelower values, which controls the magnitude of their corresponding packing numbers in (3) (at scale the lower values).Second the number of sets in the partition is logarithmic in ε which controls the sum in (3) .The same generalization could be applied to the bound based on (4) of Proposition 2 below, for the certified versionof the DOO algorithm. In this latter case, another benefit of choosing the partition ( ε, ε m ε − ] , ( ε m ε − , ε m ε − ] , . . . , ( ε , ε ] (together with the additional set [0 , ε ] ) is that the upper bound is then tight up to a logarithmic factor for most functions f , as proved in Section 3. E.3 Proof of Proposition 2
Let us first show that Algorithm 3 (certified version) is indeed a certified algorithm. Let f be any function satisfyingAssumption 1. For notational simplicity, we set n := σ ( certified DOO , f, ε ) = inf { i ∈ N ∗ : γ i = 1 } and we show that f ( x ⋆ ) − f ( x ⋆n ) ≤ ε . After exactly n evaluations of f , when Algorithm 3 reaches step (16), thecondition (17) holds for the first time. From (18) in the proof of Proposition 1, which applies here since the non-certified and certified versions select the same leaves ( h ⋆ , i ⋆ ) and output the same queries x m and recommendations x ⋆m , we know that f ( x ⋆ ) ≤ f ( x h ⋆ ,i ⋆ ) + LRδ h ⋆ . Since condition (17) also guarantees that f ( x h ⋆ ,i ⋆ ) + LRδ h ⋆ ≤ max( f ( x ) , . . . , f ( x n )) + ε = f ( x ⋆n ) + ε this entails that f ( x ⋆ ) − f ( x ⋆n ) ≤ ε . Since the optimization error n ′ f ( x ⋆ ) − f ( x ⋆n ′ ) can only decrease over time,the requirement γ n ′ = 1 ⇒ f ( x ⋆ ) − f ( x ⋆n ′ ) ≤ ε is true for all n ′ ≥ n and thus all n ∈ N ∗ . This proves that Algorithm3 (certified version) is a certified algorithm.We now show the upper bound on σ ( certified DOO , f, ε ) . Let I ε be the number of times the algorithm went throughstep (16) strictly before the first iteration k where γ n is set to , that is, strictly before (17) holds for the first time.Note that σ ( certified DOO , f, ε ) is the total number of evaluations of f before (17) holds for the first time, so that σ ( certified DOO , f, ε ) ≤ KI ε . (23)Consider now the sequence ( h ⋆ , i ⋆ ) , . . . , ( h ⋆I ε , i ⋆I ε ) corresponding to the first I ε times the DOO algorithm wentthrough step (16). Let E ε be the corresponding finite set { x h ⋆ ,i ⋆ , . . . , x h ⋆Iε ,i ⋆Iε } . Of course we have E ε ⊂ X ε S (cid:0)S m ε i =1 X ( ε i , ε i − ] (cid:1) ,so that I ε = card( E ε ) ≤ card ( E ε ∩ X ε ) + m ε X i =1 card (cid:0) E ε ∩ X ( ε i , ε i − ] (cid:1) . (24)Let N ε,m ε +1 be the cardinality of E ε ∩ X ε . For i = 1 , . . . , m ε , let N ε,i be the cardinality of E ε ∩ X ( ε i , ε i − ] . Withexactly the same arguments as in the proof of Proposition 1, we show, for i = 1 , . . . , m ε , that (22) holds.Let now x h ⋆l ,i ⋆l ∈ E ε ∩ X ε , with ℓ ∈ { , . . . I ε } . The pair ( h ⋆l , i ⋆l ) was selected when the algorithm went throughstep (16) for the ℓ -th time. By definition of I ε , (17) does not hold at this time, which implies, with n being the numberof evaluations of f at this time, f ( x h ⋆ℓ ,i ⋆ℓ ) + LRδ h ⋆ℓ > max( f ( x ) , . . . , f ( x n )) + ε ≥ f ( x h ⋆ℓ ,i ⋆ℓ ) + ε . This implies that
LRδ h ⋆ℓ > ε and thus δ h ⋆ℓ > εLR . Note that, in the definition of Algorithm 3, the variable γ n is sometimes assigned twice, with first the value of and then the value of . In thatcase, we consider that γ n = 1 . x h,j , x h ′ ,j ′ ∈ E ε ∩ X ε . Then, from Assumption 5, we obtain || x h,j − x h ′ ,j ′ || ≥ νδ max( h,h ′ ) > νεLR . Hence, we have N ε,m ε +1 ≤ N (cid:16) X ε , νεLR (cid:17) . Using now Lemma 14, we obtain N ε,m ε +1 ≤ ν/R ≥ + ν/R< (cid:18) Rν (cid:19) d ! N (cid:16) X ε , εL (cid:17) ..