[PDF] Instance-Dependent Bounds for Zeroth-order Lipschitz Optimization with Error Certificates

Abstract

We study the problem of zeroth-order (black-box) optimization of a Lipschitz function f defined on a compact subset \mathcal X of \mathbb R^d , with the additional constraint that algorithms must certify the accuracy of their recommendations. We characterize the optimal number of evaluations of any Lipschitz function f to find and certify an approximate maximizer of f at accuracy \varepsilon . Under a weak assumption on \mathcal X , this optimal sample complexity is shown to be nearly proportional to the integral \int_{\mathcal X} \mathrm{d}\boldsymbol x/( \max(f) - f(\boldsymbol x) + \varepsilon )^d . This result, which was only (and partially) known in dimension d=1 , solves an open problem dating back to 1991. In terms of techniques, our upper bound relies on a slightly improved analysis of the DOO algorithm that we adapt to the certified setting and then link to the above integral. Our instance-dependent lower bound differs from traditional worst-case lower bounds in the Lipschitz setting and relies on a local worst-case analysis that could likely prove useful for other learning tasks.

Full PDF

aa r X i v : . [ m a t h . S T ] F e b The Sample Complexities of Global Lipschitz Optimization

Franc¸ois Bachoc , Tommaso R. Cesari , and S´ebastien Gerchinovitz Institut de Math´ematiques de Toulouse & University Paul Sabatier Toulouse School of Economics IRT Saint Exup´ery & Institut de Math´ematiques de ToulouseFebruary 4, 2021

Abstract

We study the problem of black-box optimization of a Lipschitz function f deﬁned on a compact subset X of R d ,both via algorithms that certify the accuracy of their recommendations and those that do not. We investigate theirsample complexities, i.e., the number of samples needed to either reach or certify a given accuracy ε . We start byproving a tighter bound for the well-known DOO algorithm [Perevozchikov, 1990, Munos, 2011] that matches thebest existing upper bounds for (more computationally challenging) non-certiﬁed algorithms. We then introduce andanalyze a new certiﬁed version of DOO and prove a matching f -dependent lower bound (up to logarithmic terms).Afterwards, we show that this optimal quantity is proportional to R X d x / ( f ( x ⋆ ) − f ( x ) + ε ) d , solving as a corollarya three-decade-old conjecture by Hansen et al. [1991]. Finally, we show how to control the sample complexity ofstate-of-the-art non-certiﬁed algorithms with an integral reminiscent of the Dudley-entropy integral. Keywords: global optimization, bandit optimization.

In this paper, f : X → R denotes a function deﬁned on a compact non-empty subset X of R d . We consider thefollowing global optimization problem: with only black-box access to f , ﬁnd an ε -optimal point x ⋆n ∈ X of f with aslittle evaluations of f as possible.We make the following weak Lipschitz assumption, where the norm k·k and the constant L are known to thelearner. For some of our results, we will instead assume f to be globally L -Lipschitz. Assumption 1 (Lipschitzness around a maximizer) . The function f attains its maximum at some x ⋆ ∈ X , and thereexist a constant L > and a norm k·k such that, for all x ∈ X , f ( x ) ≥ f ( x ⋆ ) − L k x ⋆ − x k . We study the case in which f is black-box, i.e., except for some a priori knowledge on its smoothness, we canonly access f by sequentially querying its values at a sequence x , x , . . . ∈ [0 , d of points of our choice (seeOnline Protocol 1 below). At every round i ≥ , the query point x i can be chosen as a deterministic functionof the values f ( x ) , . . . , f ( x i − ) observed so far. At the end of round i , the learner outputs a recommendation x ⋆i ∈ X with some other optional information. The goal is to minimize the optimization error (or simple regret ): max x ∈X f ( x ) − f ( x ⋆i ) = f ( x ⋆ ) − f ( x ⋆i ) .In all the sequel, we consider two different types of algorithms:• certiﬁed algorithms output an accuracy certiﬁcate γ i ∈ { , } along with the recommendation x ⋆i at the end ofeach step i . By deﬁnition, certiﬁed algorithms take as input an accuracy ε > and are such that f ( x ⋆ ) − f ( x ⋆i ) ≤ ε whenever γ i = 1 , for any function f satisfying Assumption 1. In other words, the accuracy certiﬁcate γ i canonly equal if the algorithm can guarantee that the recommendation x ⋆i is ε -optimal;1 nline Protocol 1: Certiﬁed/non-certiﬁed algorithm input: accuracy ε > ( certiﬁed algorithms only ) for i = 1 , , . . . do pick the next query point x i ∈ X observe the value f ( x i ) ∈ R output a recommendation x ⋆i ∈ X output a certiﬁcate γ i ∈ { , } ( certiﬁed algorithms only )• non-certiﬁed algorithms only output the recommendation x ⋆i ∈ X at the end of step i .The goal of ﬁnding an ε -optimal point x ⋆n ∈ X of f with as little evaluations of f as possible corresponds tominimizing the sample complexity. More precisely, we associate a natural notion of sample complexity to each of thetwo types of algorithms above. We will discuss existing bounds and algorithms in Section 1.2, and summarize ourmain contributions in Section 1.3. Non-certiﬁed sample complexity.

For non-certiﬁed algorithms, a natural and classical performance criterion is theminimum number of queries made before outputting ε -optimal recommendations only. More precisely, we deﬁne the non-certiﬁed sample complexity ζ ( A, f, ε ) of a non-certiﬁed algorithm A for a function f with accuracy ε > as ζ ( A, f, ε ) := inf { n ≥ ∀ i ≥ n, x ⋆i ∈ X ε } ∈ { , , . . . } ∪ { + ∞} . (1) Certiﬁed sample complexity.

For certiﬁed algorithms, it is more natural to look at a notion of sample complexitythat does not depend on a oracle knowing f but instead that can be computed on the basis of the outputs of thealgorithm only. We deﬁne the certiﬁed sample complexity σ ( A, f, ε ) of a certiﬁed algorithm A for a target function f with accuracy ε > as σ ( A, f, ε ) = inf { i ≥ γ i = 1 } ∈ { , , . . . } ∪ { + ∞} . (2)This corresponds to the ﬁrst time when we can stop the algorithm while being sure to have an ε -optimal recommenda-tion x ⋆i . The most naive approach to output an ε -optimal point knowing that f satisﬁes Assumption 1 is to perform a uniformgrid search in X with step-size ≈ ε/L , requiring roughly ( L/ε ) d evaluations of f . Though this approach is optimal inthe worst case (see, e.g., Theorem 1.1.2 by Nesterov 2003 where a small Lipschitz bump function f is adversariallyconstructed for any given algorithm), it is clearly suboptimal for functions that take “very suboptimal” values for a“large fraction” of the domain X . For such more benign functions, sequential algorithms can reasonably quickly ruleout very suboptimal regions and then mostly explore closer-to-optimality regions. Indeed many papers have consideredsequential algorithms with improved bounds for benign functions, as discussed below.Many of these bounds rely on the notions of sets of near-optimal points and of packing numbers, which we recallﬁrst. For all ε > , the set of ε - optimal points of f is deﬁned by X ε := (cid:8) x ∈ X : f ( x ) ≥ f ( x ⋆ ) − ε (cid:9) . Wealso denote its complement (i.e., the set of ε - suboptimal points ) by X cε and, for all ≤ a < b , the ( a, b ] - layer by X ( a,b ] := X ca ∩ X b = (cid:8) x ∈ X : a < f ( x ⋆ ) − f ( x ) ≤ b (cid:9) (i.e., the set of points that are b -optimal but a -suboptimal).Since f is L -Lipschitz around x ⋆ , every point in X is ε -optimal, with ε deﬁned by ε := L sup x , y ∈X k x − y k . Inother words, X ε = X . For this reason, without loss of generality we will only consider values of accuracy ε smallerthan or equal to ε .Recall also that for any bounded set A ⊂ R d and any real number r > , the r - packing number of A is the largestcardinality of an r -packing of A , that is, N ( A, r ) := sup (cid:8) k ∈ N ∗ : ∃ x , . . . , x k ∈ A, min i = j k x i − x j k > r (cid:9) if A is nonempty, zero otherwise. Well-known and useful properties of packing (and covering) numbers are recalled inAppendix D. 2 ounds on the non-certiﬁed sample complexity. Several (non worst-case) f -dependent upper bounds involvingsets of ε -optimal points X ε or layers X ( a,b ] of f have been derived over the past decades for the non-certiﬁed samplecomplexity. For instance, for the DOO algorithm, Munos [2011, Theorem 1] proved a bound roughly of the form P δ − ( ε ) i =1 N ( X δ ( i ) , cδ ( i )) for some constant c > independent of f , and where the sequence i δ ( i ) is decreasing.In the special case where δ ( i ) ≈ γ i with γ ∈ (0 , , and for (weakly) Lipschitz functions f such that N ( X ε , cε ) . (1 /ε ) d ⋆ with d ⋆ ∈ [0 , d ] (we say that f has near-optimality dimension smaller than d ⋆ ), this non-certiﬁed samplecomplexity bound implies a bound roughly of log(1 /ε ) if d ⋆ = 0 or (1 /ε ) d ⋆ if d ⋆ > . A similar bound wasproved by Perevozchikov [1990] for a branch-and-bound algorithm close to the DOO algorithm, with an assumptionon the volume of X ε (instead of a packing number), or by Malherbe and Vayatis [2017] for a stochastic version of thePiyavskii-Shubert algorithm under a stronger assumption on the shape of f around its maximizer.The matching worst-case lower bounds of Nesterov [2003, Theorem 1.1.2] when d ⋆ = d , of Horn [2006] when d ⋆ = d/ , and the lower bounds of Bubeck et al. [2011] for any d ⋆ but in the stochastic case suggest that this boundis worst-case optimal over the class of functions with given near-optimality dimension d ⋆ . However, the boundsmentioned above have several limitations. First, as noted earlier by Kleinberg et al. [2019, remark after Theorem 4.4],bounds involving a single notion of dimension such as the near-optimality or zooming dimension might be too crude.Indeed, functions that feature different shapes at different scales (such as, e.g., different values of d ⋆ for differentranges of ε ) are better analyzed in a non-asymptotic framework through sums of packing numbers at different scales(see discussion at the end of Section 5). Second, considering packing numbers of the near-optimal sets X δ ( i ) as in theﬁrst bound mentioned above can also be quite suboptimal. For instance, for constant functions f , the sets X δ ( i ) areequal to the whole domain X (maximal packing number) while f is optimized by any algorithm after only a singleevaluation of f .The second limitation was overcome in general metric spaces by Kleinberg et al. [2019] (see also Kleinberg et al.2008) by considering packing numbers of layers X ( a,b ] instead of near-optimal sets X ε . In the special deterministicsetting with compact domain X ⊂ R d considered here, Bouttier et al. [2020, Theorem 1] proved that the non-certiﬁedsample complexity of the Piyavskii-Shubert algorithm [Piyavskii, 1972, Shubert, 1972] is at most of S NC ( f, ε ) ,where S NC ( f, ε ) := m ε X k =1 N (cid:16) X ( ε k ,ε k − ] , ε k L (cid:17) , (3)where m ε := (cid:6) log ( ε /ε ) (cid:7) , ε m ε := ε and, for each k ∈ { , , . . . , m ε − } , ε k := ε − k .The bound (3) on the non-certiﬁed sample complexity is proved by noting that highly suboptimal regions X ( ε k ,ε k − ] (with ε k ≫ ε ) need not be explored too much thanks to Assumption 1. Any reasonable algorithm can therefore“quickly” output recommendations x ⋆i closer to ε -optimality.Notably, the two algorithms achieving (3) or similar bounds are computationally challenging. Indeed, the zoomingalgorithm of Kleinberg et al. [2008, 2019] designed for general metric spaces is based on a “covering oracle” thattakes as input a collection of balls and either declares it to be a covering of the input space, or outputs a point whichis not covered. As for the Piyavskii-Shubert algorithm [Piyavskii, 1972, Shubert, 1972], it requires at every step n to solve an inner global Lipschitz optimization problem whose computational complexity might resemble that of thecomputation of a Voronoi diagram (see discussion in Bouttier et al. 2020, Section 1.1).In Section 2.1 we show that the more computationally tractable DOO algorithm matches the same bound (3).For better interpretability, we also show in Section 5 that this bound is proportional to an integral reminiscent of theDudley-entropy integral. Bounds on the certiﬁed sample complexity.

This notion of sample complexity seems to have been less studied inthe past. In dimension d = 1 , several authors derived upper bounds for a certiﬁed version of the Piyavskii-Shubertalgorithm. In particular, Hansen et al. [1991] proved that its certiﬁed sample complexity for globally L -Lipschitzfunctions f : [0 , → R is at most proportional to the integral R (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) − d x , and left the question ofextending the results to arbitrary dimensions open. Recently, Bouttier et al. [2020, Theorem 2] proved a bound validfor any dimension d ≥ roughly of this form: S C ( f, ε ) := N (cid:16) X ε , εL (cid:17) + m ε X k =1 N (cid:16) X ( ε k ,ε k − ] , ε k L (cid:17) , (4) The bound in Bouttier et al. [2020, Theorem 1] with α = 0 differs a little in the smallest level considered, which can be as small as ε/ , insteadof the more natural level ε . Their bound can however be straightforwardly improved to get ε instead. m ε := (cid:6) log ( ε /ε ) (cid:7) , ε m ε := ε and, for each k ∈ { , , . . . , m ε − } , ε k := ε − k .We show in Section 2.2 how to obtain (4) with the more computationally tractable DOO algorithm, and prove amatching f -dependent lower bound (up to log factors) in Section 3. In Section 4 we also show that the bound (4) isactually proportional to R X (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) − d d x for globally L -Lipschitz functions f : X → R under a mildassumption on X , solving the question left open by Hansen et al. [1991] three decades ago.Though the two bounds (3) and (4) look very similar, they differ in the ﬁrst term being either or N (cid:0) X ε , εL (cid:1) .This difference is not negligible in general and can be explained with a simple example. Indeed, consider a constantfunction c : [0 , d → R . Any non-certiﬁed algorithm has sample complexity for c at any scale ε , because all pointsin the domain are maxima. However, the only way to certify that the output is ε -optimal is essentially to perform agrid-search of [0 , d with step-size roughly ε/L , so as to be sure there is no hidden bump of height more than ε . This isreﬂected in the term N (cid:0) X ε , εL (cid:1) , which is of order ( L/ε ) d for constant functions. At a high level, the more “constant”a function is, the easier it is to recommend an ε -optimal point, but the harder it is to certify that such recommendationis actually a good recommendation. In Section 2, we reﬁne the analysis of the DOO algorithm, showing that its certiﬁed (resp., non-certiﬁed) samplecomplexity can be upper bounded (up to constants) by S C ( f, ε ) (resp., S NC ( f, ε ) ). In Section 3, we prove an f -dependent lower bound on the sample complexity of all certiﬁed algorithms, which matches S C ( f, ε ) up to logarithmicterms. In Section 4, we show for globally Lipschitz functions f : X → R (with a mild condition on X ) that S C ( f, ε ) is proportional to the integral R X d x / ( f ( x ⋆ ) − f ( x ) + ε ) d , which thus characterizes the optimal certiﬁed samplecomplexity of global Lipschitz optimization. In Section 5, we also show that S NC ( f, ε ) is proportional to the Dudley-type integral over accuracies R ∞ ε N (cid:0) X (max( ε, u/ , u ] , u / L (cid:1) /u d u . Some proofs and useful lemmas are deferred to theappendix, together with a comparison of packing-based and volume-based bounds (Appendix C). What we do not cover

There are some interesting directions that would be worth investigating in the future but wedid not cover in this paper, such as noisy observations (see, e.g, Bubeck et al. 2011, Kleinberg et al. 2019) or adaptivityto smoothness (e.g., Munos 2011, Bartlett et al. 2019; we consider L -Lipschitz functions f with L known, althoughour lower bound suggests that no adaptivity could be possible for certiﬁed algorithms). Finally, even if the results ofSection 2 could be easily extended to general pseudo-metric spaces as in Munos [2014] and related works, our otherresults are ﬁnite-dimensional and exploit the normed space structure. We denote the set of positive integers { , , . . . } by N ∗ and let N := N ∗ ∪ { } . For all n ∈ N ∗ , we denote by [ n ] theset of the ﬁrst n integers { , . . . , n } . We denote by A + B the Minkowski sum of two sets A, B and for any set A andall λ ∈ R , we let λA := { λ a : a ∈ A } .We denote the Lebesgue measure of a Lebesgue-measurable set A by vol( A ) and we simply refer to it as volume .For all ρ > and x ∈ R d , we denote by B ρ ( x ) the closed ball with radius ρ centered at x . We also write B ρ for theball with radius ρ centered at the origin and denote by v ρ its volume. In this section, we show that the DOO algorithm of Perevozchikov [1990], Munos [2011] and a new certiﬁed version ofit match the two f -dependent bounds (3) and (4) previously derived for computationally more challenging algorithms.In particular, our analysis of the non-certiﬁed sample complexity of DOO improves over that of Munos [2011]. Anearly matching lower bound on the certiﬁed sample complexity of all certiﬁed algorithms will be proved in Section 3.The deﬁnition of the DOO algorithm and the associated assumptions are recalled (and slightly adjusted) in Ap-pendix E.1, together with its new certiﬁed version. 4 .1 Upper Bound on the Non-Certiﬁed Sample Complexity The next theorem shows that the quantity S NC ( f, ε ) deﬁned in (3) is a tight upper bound on the non-certiﬁed samplecomplexity of the DOO algorithm. This improves over the bound of Munos [2011, Theorem 1] since our boundinvolves a sum of disjoint layers X ( ε k ,ε k − ] whose values are all above ε , while the bound of Munos [2011] involvesa sum over overlapping layers of the form X [0 ,u ) for all values < u ≤ ε . See Section 1.2 and Remark 17 inAppendix E.2 for further discussion. We also note that the result could be extended to more general layers than thosebased on ( ε k ) k ∈{ ,...,m ε } , as detailed in Remark 18.Recall that ζ ( A, f, ε ) denotes the non-certiﬁed sample complexity of an algorithm A (see (1)), and that m ε := (cid:6) log ( ε /ε ) (cid:7) , ε m ε := ε and, for each k ∈ { , , . . . , m ε − } , ε k := ε − k . Theorem 1.

Assume that f satisﬁes Assumption 1, and that Assumptions 4 and 5 in Appendix E.1 hold. Then, the non-certiﬁed DOO algorithm (Algorithm 3 in Appendix E.1) has a non-certiﬁed sample complexity bounded as follows:for any ε ∈ (0 , ε ] , ζ ( non-certiﬁed DOO , f, ε ) ≤ C m ε X k =1 N (cid:16) X ( ε k ,ε k − ] , ε k L (cid:17) , where C = K (cid:0) ν/R ≥ + ν/R< ( R / ν ) d (cid:1) . Note that the above bound equals C · S NC ( f, ε ) , where S NC ( f, ε ) was deﬁned in (3) in Section 1.2. It thereforematches the bound of Bouttier et al. [2020, Theorem 1] on the Piyavskii-Shubert algorithm up to constant factors, butfor the more computionally tractable DOO algorithm.As discussed in Appendix E.1, when X = [0 , d and k·k is the sup norm, we have K = 2 d , R = 1 , δ = 1 / and ν = 1 / . Hence, in this case, the multiplicative constant C above equals d .The proof is postponed to Appendix E.2 and shares common arguments with the proofs of Perevozchikov [1990],Munos [2011]. As noted in Remark 17, some steps are however tighter, leading to a tighter bound. The key changeis to partition the values of f instead of partitioning the domain X at any depth h in the tree (see Munos 2011)when counting the representatives selected at all levels. The idea of using layers X ( ε i , ε i − ] was already presentin Kleinberg et al. [2008, 2019] and Bouttier et al. [2020] but for more computionally challenging algorithms (seediscussion in Section 1.2). We now analyze the certiﬁed version of the DOO algorithm that we design in Appendix E.1, and show that its certiﬁedsample complexity is at most proportional to S C ( f, ε ) deﬁned in (4).Recall that σ ( A, f, ε ) denotes the certiﬁed sample complexity of an algorithm A (see (2)), and that m ε := (cid:6) log ( ε /ε ) (cid:7) , ε m ε := ε and, for each k ∈ { , , . . . , m ε − } , ε k := ε − k . Theorem 2.

Assume that Assumptions 4 and 5 in Appendix E.1 hold. Then the certiﬁed DOO algorithm (Algorithm3 in Appendix E.1) is indeed a certiﬁed algorithm. Furthermore, for any f satisfying Assumption 1 and for any ε ∈ (0 , ε ] , its certiﬁed sample complexity is bounded as σ ( certiﬁed DOO , f, ε ) ≤ C N (cid:16) X ε , εL (cid:17) + m ε X k =1 N (cid:16) X ( ε k ,ε k − ] , ε k L (cid:17)! , where C = 1 + K (cid:0) ν/R ≥ + ν/R< ( R / ν ) d (cid:1) . Note that the above bound is proportional to the quantity S C ( f, ε ) deﬁned in (4) in Section 1.2. It thereforematches the bound of Bouttier et al. [2020, Theorem 2] on the certiﬁed version of the Piyavskii-Shubert algorithm upto constant factors, but for the more computionally tractable DOO algorithm.Similarly to Theorem 1, the above result could be easily extended to more general layers than those based on ( ε k ) ≤ k ≤ m ε (see Remark 18 for further details).The proof is postponed to Appendix E.3. It follows approximately the same lines as that of Theorem 1, exceptthat considering the stopping time σ ( certiﬁed DOO , f, ε ) introduces the additional term N (cid:0) X ε , εL (cid:1) , which cannot beavoided as discussed in Section 1.2 and proved formally in Section 3. Similar arguments were used by Bouttier et al.[2020], which we adapt here to handle the discretization inherent to the DOO algorithm (this discretization makes theanalysis slightly less natural but the algorithm more computationally tractable).5 Lower Bounds for the Certiﬁed Sample Complexity

In this section, we will focus on f -dependent lower bounds on the certiﬁed sample complexity of certiﬁed algorithmsapplied to globally Lipschitz functions. Formally, we assume the following. Assumption 2 (Global Lipschitzness) . We say that L := sup u , v ∈X , u = v (cid:12)(cid:12) f ( u ) − f ( v ) (cid:12)(cid:12) / k u − v k , where k·k is anorm, is the Lipschitz constant of f and that f is globally L -Lipschitz if L ≥ L . It is immediate to see that Assumption 1 is a relaxation of Assumption 2. Similarly to Section 1.1, we say that A is a certiﬁed algorithm for L -Lipschitz functions if for any globally L -Lipschitz function f , accuracy ε ∈ (0 , ε ] , andtime step i ∈ N ∗ , the accuracy certiﬁcate γ i of A at time i when applied to f with accuracy ε is equal to only if itsrecommendation x ⋆i satisﬁes x ⋆i ∈ X ε .The main result of this section (Theorem 3) shows that when a certiﬁed algorithm for L -Lipschitz functions isapplied to a function f that is globally L ′ -Lipschitz, with L ′ bounded away below L , then its certiﬁed sample com-plexity is (up to a log factor) at least S C ( f, ε ) . This is the quantity that we presented in Sections 1 and 2, whichupper bounds the certiﬁed sample complexity of algorithms such as DOO and Piyavskii-Shubert. Putting these upperand lower bounds together proves that the optimal f -dependent certiﬁed sample complexity (of certiﬁed algorithmsfor L -Lipschitz functions) is of order S C ( f, ε ) , up to a log factor, at least for globally L ′ -Lipschitz functions with L ′ bounded away below L . The boundary case where L ′ = L is discussed later. Theorem 3.

Let < L ′ < L , K := 16 L/ ( L − L ′ ) , and c := (1 / K ) d / . Then, the certiﬁed sample complexity of anycertiﬁed algorithm A for L -Lipschitz functions satisﬁes, for any globally L ′ -Lipschitz function f and all ε ∈ (0 , ε ] , σ ( A, f, ε ) > c m ε S C ( f, ε ) . (5)Before proving Theorem 3, we introduce the insightful notions of worst-case error and worst-case sample com-plexity . For an algorithm A and a function f , we denote the point queried by A when applied to f at time n by x n ( A, f ) , the recommendation by x ⋆n ( A, f ) , and the accuracy certiﬁcate by γ n ( A, f ) . For any n ∈ N ⋆ we deﬁne the worst-case error of A when applied to f at time n as E L ( A, f, n ) := sup n max( g ) − g (cid:0) x ⋆n ( A, f ) (cid:1) : g is globally L -Lipschitz and g ( x ) = f ( x ) for all x ∈ (cid:8) x ( A, f ) , . . . , x n ( A, f ) (cid:9)o . (6)Note that if f is globally L -Lipschitz, the optimization error of A at time n is no larger than (but possibly equal to) E L ( A, f, n ) . We then deﬁne the worst-case minimax error for f at step n as inf A E L ( A, f, n ) , where the inﬁmumis over all algorithms (not only certiﬁed ones). A classic performance measure in global optimization is the minimaxerror inf A sup g (cid:0) max( g ) − g (cid:0) x ⋆n ( A, g ) (cid:1)(cid:1) , where the supremum is over all gobally L -Lipschitz functions and theinﬁmum is over all algorithms (e.g., see Nesterov 2003). Compared to ours, this is a more pessimistic notion, thatdoes not depend on f . Let < L ′ ≤ L . Based on our worst-case minimax error, we deﬁne the the worst-case samplecomplexity of a globally L ′ -Lipschitz function f with accuracy ε ∈ (0 , ε ] , as τ ( f, ε ) := min (cid:8) n ∈ N ∗ : inf A E L ( A, f, n ) ≤ ε (cid:9) , where the inﬁmum is over all algorithms. It is immediate to prove that τ ( f, ε ) is ﬁnite, by considering an algorithm thatqueries a dense sequence of points (independently of the observed function values) and outputs as a recommendationthe maximizer of the observed values.Crucially, the worst-case sample complexity lower bounds the certiﬁed sample complexity. Lemma 4.

Let < L ′ ≤ L . For any certiﬁed algorithm A for L -Lipschitz functions, any globally L ′ -Lipschitzfunction f , and all ε ∈ (0 , ε ] , we have σ ( A, f, ε ) ≥ τ ( f, ε ) .Proof. Let N = σ ( A, f, ε ) . Then γ N ( A, f ) = 1 . Assume that

N < τ ( f, ε ) . Then we have E L ( A, f, N ) ≥ inf A E L ( A, f, N ) > ε by deﬁnition of τ ( f, ε ) . This means that there exists a globally L -Lipschitz function g , coin-ciding with f on x ( A, f ) , . . . , x N ( A, f ) and such that max( g ) − x ⋆N ( A, g ) > ε . Then by deﬁnition of the certiﬁcate, γ N ( A, g ) = 0 . But γ N ( A, f ) = γ N ( A, g ) , which yields a contradiction and concludes the proof.6e can now present a full proof of Theorem 3, the main result of this section. The high-level idea is the following.For a given certiﬁed algorithm A (for L -Lipschitz functions) and a (globally L ′ -Lipschitz) f , we create an adversarialfunction by adding a perturbation ( ± g ˜ ε in the proof) to f . To show that the resulting function is globally L -Lipschitz,we sum the Lipschitz constants of f and ± g ˜ ε . This is why we require the Lipschitz constant L ′ of f to be boundedaway below L . Proof of Theorem 3.

Because of Lemma 4, it is sufﬁcient to show that τ ( f, ε ) > cS C ( f, ε ) / (1+ m ε ) . If S C ( f, ε ) / (1+ m ε ) < K ) d , then τ ( f, ε ) ≥ > / > cS C ( f, ε ) / (1+ m ε ) . Consider then from now on that S C ( f, ε ) / (1+ m ε ) ≥ K ) d .We will now upper bound S C ( f, ε ) / (1 + m ε ) with (an upper bound of) the largest summand in (4). Let ˜ ε be thescale achieving the maximum in (4), that is ˜ ε = ( ε , if N (cid:0) X ε , εL (cid:1) ≥ max i ∈{ ,...,m ε } N (cid:0) X ( ε i , ε i − ] , ε i L (cid:1) ,ε i ⋆ − , otherwise, where i ⋆ ∈ argmax i ∈{ ,...,m ε } N (cid:0) X ( ε i , ε i − ] , ε i L (cid:1) . Since N ( X ε , ε/L ) ≤ N ( X ε , ε/ L ) and N (cid:0) X ( ε i , ε i − ] , ε i /L (cid:1) ≤ N (cid:0) X ε i − , ε i − / L (cid:1) , we then have S C ( f, ε ) ≤ ( m ε + 1) N (cid:0) X ˜ ε, ˜ ε/ L (cid:1) . Let now n ≤ cS C ( f, ε ) / (1 + m ε ) . We then have n ≤ c N (cid:0) X ˜ ε, ˜ ε/ L (cid:1) . From Lemma 14, N (cid:0) X ˜ ε , K ˜ εL (cid:1) ≥ (cid:0) K (cid:1) d N (cid:0) X ˜ ε , ˜ ε L (cid:1) ≥ (cid:0) K (cid:1) d S C ( f,ε ) m ε +1 ≥ , as considered earlier. Then we have n ≤ c (8 K ) d N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) . Since c (8 K ) d = 1 / , we thus obtain n ≤N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) − .Consider a certiﬁed algorithm A for L -Lipschitz functions. Fix a K ˜ ε/L packing x , . . . , x N of X ˜ ε with cardinality N = N ( X ˜ ε , K ˜ ε/L ) . Then the open balls of centers x , . . . , x N and radius K ˜ ε/ L are disjoint and two of them,with centers, say, ˜ x and ˜ x , do not contain any of the x ( A, f ) , . . . , x n ( A, f ) . Let, for x ∈ X , g ˜ ε ( x ) := (cid:0) ˜ ε − LK k x − ˜ x k (cid:1) I (cid:0) x ∈ X ∩ B K ˜ ε/ L (˜ x ) (cid:1) . Then g ˜ ε ( x ) is globally L/K = L − L ′ Lipschitz. Hence f + g ˜ ε and f − g ˜ ε are L -Lipschitz. Observe that f , f + g ˜ ε and f − g ˜ ε coincide on x ( A, f ) , . . . , x n ( A, f ) . As a consequence, A has the same recommendation for them: x ⋆n ( A, f ) = x ⋆n ( A, f + g ˜ ε ) = x ⋆n ( A, f − g ˜ ε ) .Consider ﬁrst the case x ⋆n ( A, f ) ∈ B (˜ x , K ˜ ε/ L ) . Then, we have, by deﬁnition of g ˜ ε and the fact that ˜ x ∈ X ˜ ε , f (˜ x ) − g ˜ ε (˜ x ) − f ( x ⋆n ( A, f )) + g ˜ ε ( x ⋆n ( A, f )) ≥ − ˜ ε + 8˜ ε − LK K ˜ ε L = 3˜ ε .Now consider the case x ⋆n ( A, f ) B (˜ x , K ˜ ε/ L ) . Then, we have, by deﬁnition of g ˜ ε and the fact that ˜ x ∈ X ˜ ε , f (˜ x ) + g ˜ ε (˜ x ) − f ( x ⋆n ( A, f )) − g ˜ ε ( x ⋆n ( A, f )) ≥ − ˜ ε + 8˜ ε − ε + LK K ˜ ε L = 3˜ ε .Finally, in both cases E L ( A, f, n ) ≥ ε > ε . Hence inf A E L ( A, f, n ) > ε . Since this has been shown for any n ≤ cS C ( f, ε ) / (1 + m ε ) we thus have τ ( f, ε ) > cS C ( f, ε ) / (1 + m ε ) .We conclude the section by discussing the limit case L ′ → L , in which the constant in Theorem 3 diverges. Thenext result shows that in dimension d = 1 , the limit case L = L ′ does not hinder the validity of the result. The proofis deferred to Appendix A.1. Proposition 5. If d = 1 , let K = 8 and c = 1 / K . Then, the certiﬁed sample complexity of any certiﬁed algo-rithm A for L -Lipschitz functions satisﬁes, for any globally L -Lipschitz function f and all ε ∈ (0 , ε ] , σ ( A, f, ε ) > ( c / m ε ) S C ( f, ε ) . The ﬁnal result of this section shows that, in higher dimension d ≥ , the improvement of Proposition 5 when L ′ is close to L is not possible in general. The proof is deferred to Appendix A.2. Proposition 6.

Let d ≥ , X := B , and k·k be a norm. The certiﬁed Piyavskii-Shubert algorithm (for L -Lipschitzfunctions) with initial guess x := is a certiﬁed algorithm (for L -Lipschitz functions) satisfying, for the globally L -Lipschitz function f := L k·k and any ε ∈ (0 , ε ) , σ ( Piyavskii-Shubert , f, ε ) = 2 ≪ c/ε d − ≤ S C ( f, ε ) , where c > is a constant independent of ε . See Appendix A for more details on the certiﬁed Piyavskii-Shubert algorithm. An Integral Characterization of the Optimal Certiﬁed Sample Complex-ity S C In dimension d = 1 , an elegant bound on the certiﬁed sample complexity was derived by Hansen et al. [1991] for acertiﬁed version of the Piyavskii-Shubert algorithm. They proved that if f is globally Lipschitz, for any accuracy ε ∈ (0 , ε ] , the smallest number of time steps σ := σ ( certiﬁed Piyavskii-Shubert , f, ε ) before outputting γ σ = 1 is at mostproportional to R (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) − d x . As pointed out in Bouttier et al. [2020], given that σ for this algorithmcan be upper bounded with S C , this suggests a relationship between the two quantities. However, Hansen et al. [1991]rely heavily on the one-dimensional setting and the speciﬁc form of the Piyavskii-Shubert algorithm, claiming it thateven the simpler task of “ Extending the results of this paper to the multivariate case appears to be difﬁcult ”. In thissection, we show an equivalence between S C and this type of integral bound in any dimension d . As a corollary, thissolves the long-standing problem raised by Hansen et al. [1991] three decades ago.To tame the wild spectrum of shapes that compact subsets may have, we will assume that X satisﬁes the followingadditional assumption. At a high level, it says that a constant fraction of each (sufﬁciently small) ball centered ata point in X is included in X . This removes sets containing isolated points or “peaked” corners, but includes mostdomains that are typically used, such as non-degenerate polytopes, ellipsoids, ﬁnite unions of them, etc. This naturalassumption has already appeared in the past (e.g., Hu et al. 2020) and is weaker than another classical assumption inthe statistics literature (the rolling ball assumption, Cuevas et al. 2012, Walther 1997). Assumption 3.

There exist two constants r > , γ ∈ (0 , such that, for any x ∈ X and all r ∈ (0 , r ) , vol (cid:0) B r ( x ) ∩X (cid:1) ≥ γv r . We can now state the main result of this section. Its proof relies on some additional technical results that aredeferred to the appendix.

Theorem 7. If f is globally L -Lipschitz and X satisﬁes Assumption 3 with r > ε / L , γ ∈ (0 , , then there exist c, C > (e.g., c := 1 /v / L and C := 1 / ( γv / L ) ) such that, for all ε ∈ (0 , ε ] , c Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d ≤ S C ( f, ε ) ≤ C Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d . In light of Theorem 3, this also shows that the integral bound characterizes (up to a log factor) the optimal certiﬁedsample complexity of globally Lipschitz functions in any dimension d , outside of boundary cases. Corollary 8.

Let < L ′ < L . Then, there exist two constants c, C > such that the optimal certiﬁed samplecomplexity of any certiﬁed algorithm A for L -Lipschitz functions satisﬁes, for any globally L ′ -Lipschitz function f and all ε ∈ (0 , ε ] , c m ε Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d ≤ inf A σ ( A, f, ε ) ≤ C Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d . The previous result follows directly from Theorems 2, 3 and 7. We now prove Theorem 7.

Proof of Theorem 7.

Fix any ε ∈ (0 , ε ] and recall that m ε := (cid:6) log ( ε / ε ) (cid:7) , ε m ε := ε , and for all k ≤ m ε − , ε k := ε − k . Partition the domain of integration X into the following m ε + 1 sets: the set of ε -optimal points X ε andthe m ε layers X ( ε k , ε k − ] , for k ∈ [ m ε ] . We begin by proving the ﬁrst inequality: Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d ≤ vol( X ε ) ε d + m ε X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ( ε k + ε ) d ≤ M (cid:0) X ε , εL (cid:1) · v (cid:0) εL (cid:1) d ε d + m ε X k =1 M (cid:0) X ( ε k , ε k − ] , ε k L (cid:1) · v (cid:0) ε k L (cid:1) d ε dk ≤ v L d N (cid:16) X ε , εL (cid:17) + m ε X k =1 N (cid:16) X ( ε k , ε k − ] , ε k L (cid:17)! , We actually prove a stronger result. The ﬁrst inequality holds more generally for any f that is L -Lipschitz around a maximizer (Assumption 1)and Lebesgue-measurable, and does not require X to satisfy Assumption 3. f ( x ⋆ ) − f with its inﬁmum on the partition, the second one bydropping ε > from the second denominator and upper bounding the volume of a set with the volume of the balls of asmallest ε k / L -cover, and the last one by the fact that covering numbers are always smaller than packing numbers (14).This proves the ﬁrst part of the theorem.For the second one, note that Z X d x (cid:0) f ( x ⋆ ) − f ( x ) + ε (cid:1) d ≥ vol( X ε )( ε + ε ) d + m ε X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ( ε k − + ε ) d ≥ d vol( X ε ) ε d + 14 d m ε X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk ≥ d  vol( X ε ) ε d + m ε X k =1 vol (cid:16) X ( ε k , ε k − ] (cid:17) ε dk −  ≥ N (cid:0) X ε , εL (cid:1) vol (cid:0) ε L B (cid:1) (32 ε ) d /γ + m ε X k =1 N (cid:0) X ( ε k , ε k − ] , ε k L (cid:1) vol (cid:0) ε k L B (cid:1) (32 ε k − ) d /γ ≥ γv / L N (cid:16) X ε , εL (cid:17) + γv / L m ε X k =1 N (cid:16) X ( ε k , ε k − ] , ε k L (cid:17) , where the ﬁrst inequality follows by upper bounding f ( x ⋆ ) − f with its supremum on the partition, the second oneby ε ≤ ε k − (for all k ∈ [ m ε + 1] ) and ε k − ≤ ε k (for all k ∈ [ m ε ] ), the third one by Lemma 15, and the fourthone by the elementary inclusions X ε ⊇ X ε and X ( ε k , ε k − ] ⊇ X ( ε k , ε k − ] (for all k ∈ [ m ε ] ) followed byProposition 11. S NC As we mentioned earlier, the non-certiﬁed sample complexity of some algorithms can be upper bounded with a sumof packing numbers (3) (e.g., for the DOO and Piyavskii-Shubert algorithms). Notably, these types of bounds appearin ﬁelds beyond optimization (e.g., see Bachoc et al. 2020, Theorem 2). In this section, we show how this quantitycan be controlled by a more elegant Dudley-type entropy integral bound. The idea of the reduction is leveraging theusual sum-integral approximation carried out, e.g., when deriving the Dudley entropy integral bound (see, e.g., Dudley1967, Boucheron et al. 2013), with the additional care that here the integrand is not necessarily monotone.

Theorem 9. If f is L -Lipschitz around a maximizer and ( x, y, z )

7→ N (cid:0) X ( x,y ] , z (cid:1) is Lebesgue-measurable, then thereexist c, C > (e.g., c = / and C := 2 · d ) such that, for all ε ∈ (0 , ε ] , c Z ∞ ε N (cid:0) X (max( ε, u/ , u ] , uL (cid:1) u d u ≤ S NC ( f, ε ) ≤ C Z ∞ ε N (cid:0) X (max( ε, u/ , u ] , uL (cid:1) u d u . Proof.

Fix any ε ∈ (0 , ε ] and recall that m ε := (cid:6) log ( ε / ε ) (cid:7) , ε m ε := ε , and for all k ≤ m ε − , ε k := ε − k .The proof of the ﬁrst inequality is deferred to Appendix B. For the second one, for all k ∈ [ m ε ] , deﬁne ν k :=inf u ∈ [ ε k − , ε k − ] N (cid:0) X ( ε k , u ] , u / L (cid:1) , ﬁx any η k > , and take u k ∈ [ ε k − , ε k − ] such that N (cid:0) X ( ε k , u k ] , u k / L (cid:1) ≤ ν k + η k . Then, for all k ∈ [ m ε ] , N (cid:16) X ( ε k , ε k − ] , ε k L (cid:17) ≤ N (cid:16) X ( ε k , u k ] , ε k L (cid:17) ≤ (cid:18) u k ε k (cid:19) d N (cid:16) X ( ε k , u k ] , u k L (cid:17) ≤ d ( ν k + η k ) , In addition to Lemma 15, in Appendix D, we prove a more general result on controlling the sum of the volumes of a family of overlapping setscovering X with the sum of volumes over a partition of X (Proposition 16). As we do here, this can be translated to packing numbers. The trick of using ν k := inf u ∈ [ ε k − ,ε k − ] N (cid:0) X ( ε k , u ] , u / L (cid:1) is key to handle the cases when u

7→ N (cid:0) X ( ε k , u ] , u / L (cid:1) is not non-increasing. S NC ( f, ε ) = m ε X k =1 N (cid:16) X ( ε k , ε k − ] , ε k L (cid:17) ≤ m ε X k =1 d ( ν k + η k ) ε k − ( ε k − − ε k − )= 16 d m ε X k =1 inf u ∈ [ ε k − , ε k − ] N (cid:0) X ( ε k , u ] , u / L (cid:1) ε k − ( ε k − − ε k − ) + 16 d m ε X k =1 η k ≤ · d m ε X k =1 Z ε k − ε k − N (cid:0) X ( ε k , u ] , u / L (cid:1) u d u + 16 d m ε X k =1 η k . The result follows by noting that ε k ≥ max( u/ , ε ) for any k ∈ [ m ε ] and all u ∈ [ ε k − , ε k − ] , summing, recallingthat X ( a,b ] = ∅ for all ε ≤ a < b , and taking the limits for η k → .The previous result yields two direct corollaries, obtained by upper bounding X (max( u/ , ε ) , u ] with either X ( u/ , u ] or X ( ε, u ] . The latter has immediate consequences in terms of near-optimality dimension [Bubeck et al., 2011]. In-deed, if there exist two constants C ⋆ > and d ⋆ ∈ (0 , d ] such that N ( X u , u / L ) ≤ C ⋆ /u d ⋆ for all u ∈ (0 , ε ] ,then immediately there exists C > such that R ∞ ε N (cid:0) X ( ε, u ] , u / L (cid:1) /u d u ≤ C/ε d ⋆ for all ε ∈ (0 , ε ] . Simi-larly, if there exists C ⋆ > such that N ( X u , u / L ) ≤ C ⋆ for all u ∈ (0 , ε ] , then there exists C > such that R ∞ ε N (cid:0) X ( ε, u ] , u / L (cid:1) /u d u ≤ C log( ε /ε ) for all ε ∈ (0 , ε ] . Furthermore, such an integral bound makes it easyto handle multiple different regimes at the same time. This is the case if, for example, there exist C ⋆ , C ⋆ > and u ∈ (0 , ε ] such that N ( X u , u / L ) ≤ C ⋆ for all u ∈ [ u , ε ] but N ( X u , u / L ) ≤ C ⋆ /ε d ⋆ for all u ∈ (0 , u ) . In thiscase, our integral bound reﬂects that there exist C , C > such that R ∞ ε N (cid:0) X ( ε, u ] , u / L (cid:1) /u d u ≤ C log( ε /ε ) forall ε ∈ [ u , ε ] but R ∞ ε N (cid:0) X ( ε, u ] , u / L (cid:1) /u d u ≤ C (cid:0) log( ε /u ) + 1 /ε d ⋆ (cid:1) if ε ∈ (0 , u ) .Note that if for some ε ∈ (0 , ε ] , most points in [0 , d are ε -optimal, i.e., if X ε is large, then N ( X ε , ε / L ) ≈ /ε d is also large and does not reﬂect the fact that the non-certiﬁed problem is easy. This is a folklore criticism that hasbeen made to these types of bounds. Such a mismatching behavior is solved by expressing bounds as we did, in termsof layers X ( ε, u ] instead. ACKNOWLEDGEMENTS

The work of Tommaso Cesari and S´ebastien Gerchinovitz has beneﬁtted from the AI Interdisciplinary Institute ANITI,which is funded by the French “Investing for the Future – PIA3” program under the Grant agreement ANR-19-P3IA-0004. S´ebastien Gerchinovitz gratefully acknowledges the support of the DEEL project. . This work beneﬁted fromthe support of the project BOLD from the French national research agency (ANR). References

Franc¸ois Bachoc, Tommaso Cesari, and S´ebastien Gerchinovitz. The sample complexity of level set approximation. arXiv preprint arXiv:2010.13405 , 2020.Peter L. Bartlett, Victor Gabillon, and Michal Valko. A simple parameter-free and adaptive approach to optimizationunder a minimal local smoothness assumption. In Aur´elien Garivier and Satyen Kale, editors,

Proceedings of the30th International Conference on Algorithmic Learning Theory , volume 98 of

Proceedings of Machine LearningResearch , pages 184–206, Chicago, Illinois, 22–24 Mar 2019. PMLR.St´ephane Boucheron, G´abor Lugosi, and Pascal Massart.

Concentration inequalities: a nonasymptotic theory ofindependence . Oxford University Press, 2013.Cl´ement Bouttier, Tommaso Cesari, and S´ebastien Gerchinovitz. Regret analysis of the Piyavskii-Shubert algorithmfor global lipschitz optimization. arXiv preprint arXiv:2002.02390 , 2020. Journal of Machine LearningResearch , 12(May):1655–1695, 2011.Antonio Cuevas, Ricardo Fraiman, and Beatriz Pateiro-L´opez. On statistical properties of sets fulﬁlling rolling-typeconditions.

Advances in Applied Probability , 44(2):311–329, 2012.Richard M. Dudley. The sizes of compact subsets of hilbert space and continuity of Gaussian processes.

Journal ofFunctional Analysis , 1(3):290 – 330, 1967.Pierre Hansen, Brigitte Jaumard, and S-H Lu. On the number of iterations of Piyavskii’s global optimization algorithm.

Mathematics of Operations Research , 16(2):334–350, 1991.Matthias Horn. Optimal algorithms for global optimization in case of unknown Lipschitz constant.

Journal of Com-plexity , 22(1):50–70, 2006.Yichun Hu, Nathan Kallus, and Xiaojie Mao. Smooth contextual bandits: Bridging the parametric and non-differentiable regret regimes. In Jacob Abernethy and Shivani Agarwal, editors,

Proceedings of Thirty Third Con-ference on Learning Theory , volume 125 of

Proceedings of Machine Learning Research , pages 2007–2010. PMLR,09–12 Jul 2020.Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In

Proceedings of thefortieth annual ACM symposium on Theory of computing , pages 681–690, 2008.Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Bandits and experts in metric spaces.

Journal of the ACM , 66(4), 2019.C´edric Malherbe and Nicolas Vayatis. Global optimization of Lipschitz functions. In

Proceedings of the 34th Inter-national Conference on Machine Learning , volume PMLR 70, pages 2314–2323, 2017.R´emi Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In

Ad-vances in Neural Information Processing Systems 24 (NIPS 2011) , pages 783–791, 2011.R´emi Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning.

Foundations and Trends in Machine Learning , 7(1):1–130, 2014.Yurii Nesterov.

Introductory lectures on convex optimization: A basic course , volume 87. Springer Science & BusinessMedia, 2003.A.G. Perevozchikov. The complexity of the computation of the global extremum in a class of multi-extremum prob-lems.

USSR Computational Mathematics and Mathematical Physics , 30(2):28–33, 1990.S.A. Piyavskii. An algorithm for ﬁnding the absolute extremum of a function.

USSR Computational Mathematics andMathematical Physics , 12(4):57–67, 1972.Bruno O Shubert. A sequential method seeking the global maximum of a function.

SIAM Journal on NumericalAnalysis , 9(3):379–388, 1972.Martin J. Wainwright.

High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge Series in Statistical andProbabilistic Mathematics. Cambridge University Press, 2019.Guenther Walther. Granulometric smoothing.

The Annals of Statistics , 25(6):2273–2299, 1997.

A Missing proofs of Section 3

In this section we provide all missing details and proofs from Section 3.11 .1 Lower Bound on the Certiﬁed Sample Complexity in Dimension d = 1 We begin by proving our lower bound on the certiﬁed sample complexity in dimension d = 1 . It shows that in thespecial one-dimensional case d = 1 , S C ( f, ε ) provides a tight lower bound on the certiﬁed sample complexity of allcertiﬁed algorithms, up to the factor m ε , even in the boundary case in which f is globally L -Lipschitz. Proposition (Proposition 5) . If d = 1 , let K = 8 and c = 1 / K . Then, the certiﬁed sample complexity of anycertiﬁed algorithm A for L -Lipschitz functions satisﬁes, for any globally L -Lipschitz function f and all ε ∈ (0 , ε ] , σ ( A, f, ε ) > ( c / m ε ) S C ( f, ε ) .Proof. As for the proof of Theorem 3, it is sufﬁcient to show that τ ( f, ε ) > cS C ( f, ε ) / (1 + m ε ) . If cS C ( f, ε ) / (1 + m ε ) < , then the result follows by τ ( f, ε ) ≥ . Consider then from now on that cS C ( f, ε ) / (1 + m ε ) ≥ .Deﬁning ˜ ε as in the proof of Theorem 3, one can prove similarly that cS C ( f, ε ) / (1 + m ε ) ≤ c N (cid:0) X ˜ ε, ˜ ε/ L (cid:1) . FromLemma 14, N (cid:16) X ˜ ε, K ˜ εL (cid:17) ≥ K N (cid:16) X ˜ ε, ˜ ε L (cid:17) ≥ K S C ( f, ε ) m ε + 1 ≥ , because c = 1 / K and cS C ( f, ε ) / (1 + m ε ) ≥ . Let now n ≤ cS C ( f, ε ) / (1 + m ε ) . Then we have n ≤ c (8 K ) N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) . Thus, by c (8 K ) = 1 / , n ≤ N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) / , and N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) ≥ , we have n ≤  N (cid:16) X ˜ ε, K ˜ εL (cid:17)  − . (7)Consider a certiﬁed algorithm A for L -Lipschitz functions. Let us consider a K ˜ ε/L packing ≤ x < · · · < x N ≤ of X ˜ ε with N = N (cid:0) X ˜ ε, K ˜ ε/L (cid:1) . Consider the ⌊ N/ ⌋ − disjoint open segments ( x , x ) , ( x , x ) , . . . , ( x ⌊ N/ ⌋− , x ⌊ N/ ⌋− ) . Then from (7) there exists i ∈ (cid:8) , , . . . , ⌊ N/ ⌋ − (cid:9) such that the segment ( x i , x i +2 ) does not containany of the x ( A, f ) , . . . , x n ( A, f ) . Assume that x i +1 − x i ≤ x i +2 − x i +1 (the case x i +1 − x i > x i +2 − x i +1 can be treated analogously; we omit these straightforward details for the sake of conciseness). Consider the function h + , ˜ ε : X → R deﬁned by h + , ˜ ε ( x ) =  f ( x ) if x ∈ X \ [ x i , x i +2 ] f ( x i ) + L ( x − x i ) if x ∈ X ∩ [ x i , x i +1 ] f ( x i ) + L ( x i +1 − x i ) + ( x − x i +1 ) f ( x i +2 ) − f ( x i ) − L ( x i +1 − x i ) x i +2 − x i +1 if x ∈ X ∩ ( x i +1 , x i +2 ] . We see that h + , ˜ ε is L -Lipschitz (since x i +1 − x i ≤ x i +2 − x i +1 ). Furthermore, h + , ˜ ε coincides with f on x ( A, f ) , . . . , x n ( A, f ) .Similarly, consider the function h − , ˜ ε : X → R deﬁned by h − , ˜ ε ( x ) =  f ( x ) if x ∈ X \ [ x i , x i +2 ] f ( x i ) − L ( x − x i ) if x ∈ X ∩ [ x i , x i +1 ] f ( x i ) − L ( x i +1 − x i ) + ( x − x i +1 ) f ( x i +2 ) − f ( x i )+ L ( x i +1 − x i ) x i +2 − x i +1 if x ∈ X ∩ ( x i +1 , x i +2 ] . As before, h − , ˜ ε is L -Lipschitz and coincides with f on x ( A, f ) , . . . , x n ( A, f ) .Consider the case (1) where x ⋆n ( A, f ) ∈ X \ [ x i , x i +2 ] . Then, we have, since x i ∈ X ˜ ε and x i +1 − x i ≥ K ˜ ε/L , h + , ˜ ε ( x i +1 ) − h + , ˜ ε ( x ⋆n ( A, f )) = f ( x i ) + L ( x i +1 − x i ) − f ( x ⋆n ( A, f )) ≥ − ˜ ε + L K ˜ εL = 7˜ ε. Consider the case (2) where x ⋆n ( A, f ) ∈ X ∩ [ x i , ( x i + x i +1 ) / . Then, we have, since x i +1 − x i ≥ K ˜ ε/L , h + , ˜ ε ( x i +1 ) − h + , ˜ ε ( x ⋆n ( A, f )) = f ( x i ) + L ( x i +1 − x i ) − f ( x i ) − L ( x ⋆n ( A, f ) − x i ) ≥ L ( x i +1 − x i )2 ≥ ˜ ε K ε. Consider the case (3) where x ⋆n ( A, f ) ∈ X ∩ [( x i + x i +1 ) / , x i +1 ] . Then, we have, since x i +1 − x i ≥ K ˜ ε/L , h − , ˜ ε ( x i ) − h − , ˜ ε ( x ⋆n ( A, f )) = f ( x i ) − f ( x i ) + L ( x ⋆n ( A, f ) − x i ) ≥ L ( x i +1 − x i )2 ≥ ˜ ε K ε. x ⋆n ( A, f ) ∈ X ∩ [ x i +1 , ( x i +1 + x i +2 ) / . Then, we have, since x i +1 − x i ≥ K ˜ ε/L ,since x i , x i +2 ∈ X ˜ ε and since h − , ˜ ε is linear increasing on [ x i +1 , x i +2 ] with left value f ( x i ) − L ( x i +1 − x i ) and rightvalue f ( x i +2 ) , h − , ˜ ε ( x i ) − h − , ˜ ε ( x ⋆n ( A, f )) ≥ f ( x i ) − f ( x i ) − L ( x i +1 − x i ) + f ( x i +2 )2= f ( x i ) − f ( x i +2 )2 + L ( x i +1 − x i )2 ≥ − ˜ ε K ε ≥ ε. Consider the case (5) where x ⋆n ( A, f ) ∈ X ∩ [( x i +1 + x i +2 ) / , x i +2 ] . Then, we have, since x i +1 − x i ≥ K ˜ ε/L ,since x i , x i +2 ∈ X ˜ ε and since h + , ˜ ε is linear decreasing on [ x i +1 , x i +2 ] with left value f ( x i ) + L ( x i +1 − x i ) and rightvalue f ( x i +2 ) , h + , ˜ ε ( x i +1 ) − h + , ˜ ε ( x ⋆n ( A, f )) ≥ f ( x i ) + L ( x i +1 − x i ) − f ( x i ) + L ( x i +1 − x i ) + f ( x i +2 )2= f ( x i ) − f ( x i +2 )2 + L ( x i +1 − x i )2 ≥ − ˜ ε K ε ≥ ε. Hence, in all cases E L ( A, f, n ) ≥ ε > ε . Hence inf A E L ( A, f, n ) > ε . Since this has been shown for any n ≤ cS C ( f, ε ) / (1 + m ε ) we thus have τ ( f, ε ) > cS C ( f, ε ) / (1 + m ε ) .As discussed previously, in the proof of Theorem 3, the constraint that f be L ′ -Lipschitz with L ′ < L arose fromthe fact that we added a small bump function ± g ˜ ε to f , with the requirement that the new function f ± g ˜ ε be globally L -Lipschitz. We treat the one-dimensional case differently. For a given algorithm, a given function f and a numberof query points smaller than cS C ( f, ε ) / (1 + m ε ) , we show the existence of a segment unvisited by the algorithm andcontaining three close-to-optimal points that are separated enough. We then replace the function f on this segmentwith an upward or downward hat function which makes the algorithm fail to be ε -optimal. By replacing f with a newfunction on the segment, rather than adding a bump function to f , we can allow f to have Lipschitz constant arbitrarilyclose to L . A.2 The Piyavskii-Shubert Algorithm and Proof of Proposition 6

In this section, we recall the deﬁnition of the certiﬁed Piyavskii-Shubert algorithm (Algorithm 2, Piyavskii 1972,Shubert 1972) and we show that its sample complexity, in dimension d ≥ , is not lower bounded by S C ( f, ε ) forsome functions in the boundary case L ′ = L . Algorithm 2:

Certiﬁed Piyavskii-Shubert algorithm input: accuracy ε > , Lipschitz constant L > , norm k·k , initial guess x ∈ X for i = 1 , , . . . do pick the next query point x i observe the value f ( x i ) output the recommendation x ⋆i ← argmax x ∈{ x ,..., x i } f ( x ) output the certiﬁcate γ i ← I (cid:8) b f ⋆i − f ⋆i ≤ ε (cid:9) , where b f i ( · ) ← min j ∈ [ i ] (cid:8) f ( x j ) + L k x j − ( · ) k (cid:9) , b f ⋆i ← max x ∈X b f i ( x ) , f ⋆i ← max j ∈ [ i ] f ( x j ) , and let x i +1 ∈ argmax x ∈X b f i ( x ) In contrast to the one-dimensional case d = 1 , the next result shows that there are special cases with L ′ = L and d ≥ for which the upper bound S C ( f, ε ) is far from tight. Quantifying the optimal certiﬁed sample complexity fordimensions d ≥ in the boundary case L ′ = L (or when L ′ is arbitrarily close to L ) is left as an open problem. Proposition 6.

Let d ≥ , X := B , and k·k be a norm. The certiﬁed Piyavskii-Shubert algorithm (Algorithm 2) withinitial guess x := is a certiﬁed algorithm for L -Lipschitz functions satisfying, for the globally L -Lipschitz function f := L k·k and any ε ∈ (0 , ε ) , σ ( Piyavskii-Shubert , f, ε ) = 2 ≪ c/ε d − ≤ S C ( f, ε ) , where c > is a constantindependent of ε . roof. Fix any ε ∈ (0 , ε ) . When f is at least L -Lipschitz around a maximizer (Assumption 1), then max x ∈X b f i ( x ) ≥ max x ∈X f ( x ) for all i ∈ N ∗ (for a proof of this fact, see Bouttier et al. 2020). Hence if γ i = 1 , then max x ∈X f ( x ) − f ( x ⋆i ) ≤ b f i ( x ) − f ⋆i ≤ ε , and x ⋆i is necessarily ε -optimal. This shows that the certiﬁed Piyavskii-Shubert algorithmis indeed a certiﬁed algorithm for L -Lipschitz functions. Then, since ε = L , we have that b f = f , γ = 0 ,and x belongs to the the unit sphere, i.e., x is a maximizer of f . Since b f = f , we have that γ = 1 , hence σ ( Piyavskii-Shubert , f, ε ) = 2 . Finally, by deﬁnition (4), we have S C ( f, ε ) ≥ N (argmax X f, ε/L ) . Since argmax X f is the unit sphere, there exists a constant c , only depending on d , k·k and L , such that S C ( f, ε ) ≥ c/ε d − .We give some intuition on the previous proposition. Consider a function f that has Lipschitz constant exactly L ,and a pair of points in X whose respective values of f are maximally distant, that is the difference of values of f isexactly L times the norm of the input difference. This conﬁguration provides strong information on the value of theglobal maximum of f , as is illustrated in the proof of Proposition 6. Another interpretation is that when f has Lipschitzconstant exactly L , there is less ﬂexibility for the L -Lipschitz function g that yields the maximal optimization errorin (6). B Missing proofs of Section 5

In this section, we provide the missing part of the proof of Theorem 9.Fix any ε ∈ (0 , ε ] and recall that m ε := (cid:6) log ( ε / ε ) (cid:7) , ε m ε := ε , and for all k ≤ m ε − , ε k := ε − k . We beginby proving the ﬁrst inequality: Z ∞ ε N (cid:16) X ( max ( ε, u ) ,u ] , uL (cid:17) u d u = m ε X k = − Z ε k − ε k N (cid:16) X ( max ( ε, u ) ,u ] , uL (cid:17) u d u ≤ m ε X k = − Z ε k − ε k N (cid:16) X ( max ( ε, εk ) , ε k − ] , ε k L (cid:17) ε k d u ≤ m ε X k = − N (cid:16) X ( max ( ε, εk ) , ε k − ] , ε k L (cid:17) = m ε X k = − (cid:16) N (cid:16) X ( εk , ε k − ] , ε k L (cid:17) I k ∈{− ,...,m ε − } + N (cid:16) X ( ε, ε k − ] , ε k L (cid:17) I k ∈{ m ε − ,...,m ε } (cid:17) ≤ m ε X k = − k +2 X h = k N (cid:16) X ( ε h , ε h − ] , ε h L (cid:17) I k ∈{− ,...,m ε − } + m ε X h = k N (cid:16) X ( ε h , ε h − ] , ε h L (cid:17) I k ∈{ m ε − ,...,m ε } ! = X k = − ( . . . ) + m ε X k =1 ( . . . ) ≤ S NC ( f, ε ) , where the ﬁrst and third inequalities follow by the monotonicity properties of packing numbers, the second one by ε k − − ε k ≤ ε k for all k ∈ [ m ε ] , and the last one by noting that both brackets ( . . . ) contain at most three non-zeroterms N (cid:0) X ( ε h , ε h − ] , ε h L (cid:1) , with h ∈ [ m ε ] , that overlap at most times, and nothing else. This proves the ﬁrst partof the theorem. C Packing vs Volume

The two prototypical sample complexity statements involving assumptions on volumes (8)–(9) (see e.g., Perevozchikov1990) and packing numbers (10)–(11) (e.g., Munos 2014) for algorithms applied to functions f that are L -Lipschitz We wrote ( . . . ) to avoid copying twice the long expression on the previous line. ε ∈ (0 , ε ] are: ∃ r ∈ (0 , , ∃ C > , ∀ u ∈ (0 , ε ] , vol( X u ) ≤ Cu dr (8) = ⇒ ∃ c > , ∀ n ≥ c (cid:0) ln( / ε ) I r =1 + ( / ε ) d (1 − r ) I r =1 (cid:1) , x ⋆n ∈ X ε , (9) ∃ d ⋆ ∈ [0 , d ] , ∃ C ⋆ > , ∀ u ∈ (0 , ε ] , N ( X u , u / L ) ≤ C ⋆ ( ε / ε ) d ⋆ (10) = ⇒ ∃ c > , ∀ n ≥ c (cid:0) ln( / ε ) I d ⋆ =0 + ( / ε ) d ⋆ I r =1 (cid:1) , x ⋆n ∈ X ε . (11)In the two following sections, we will study the relationship between sample complexity statements expressed underassumptions on volumes (8) and packing numbers (10). C.1 Which Assumption is Stronger

The ﬁrst result suggests that packing assumptions (10) are stronger than volume’s (8).

Proposition 10. If E ⊂ R d is bounded and Lebesgue-measurable and L -Lipschitz around a maximizer, letting c := v / L , for all u ∈ (0 , ε ] , vol( X u ) ≤ c N (cid:16) X u , uL (cid:17) u d . Proof.

For all u ∈ (0 , ε ] , vol( X u ) ≤ M (cid:0) X u , uL (cid:1) vol (cid:0) uL B (cid:1) ≤ N (cid:0) X u , uL (cid:1) u d v / L , where in the ﬁrst inequality weupper bounded the volume of a set with the volume of the balls of a smallest ( u / L ) -cover and in the second one weused the fact that covering numbers are always smaller than packing numbers (14).The previous proposition implies in particular that if d ⋆ ∈ [0 , d ) is a near-optimality dimension of f for a constantand C ⋆ > , i.e., if for all u ∈ (0 , ε ] , N ( X u , u / L ) ≤ C ⋆ ( ε / u ) d ⋆ (as in (10)), then vol( X u ) ≤ Cu dr , where C := v / L C ⋆ ε d ⋆ and r = 1 − d ⋆ /d , recovering the assumption on volume (8). The assumption on packing lookstherefore stronger. However, we show now that if f is globally Lipschitz (and its domain is not too pathological), theconverse is also true, i.e., that assumptions on packings and volumes bounds are essentially equivalent. Proposition 11. If f is globally L -Lipschitz and X satisﬁes Assumption 3 with r > , γ ∈ (0 , , then, for all < w < u < Lr , N (cid:16) X u , uL (cid:17) ≤ γ vol (cid:0) X ( / ) u (cid:1) vol (cid:0) u L B (cid:1) and N (cid:16) X ( w,u ] , wL (cid:17) ≤ γ vol (cid:0) X ( w / , u / ] (cid:1) vol (cid:0) w L B (cid:1) . Proof.

Fix any u > w > . Let η := uL , η := wL , E := X , E := X cw , and i ∈ [2] . Note that for any η > and A, E ⊂ R d , the balls of radius η / centered at the elements of an η -packing of A intersected with E are all disjointand included in ( A + B η / ) ∩ E . Thus, N ( X u ∩ E i , η i ) ≤ vol (cid:0) ( X u ∩ E i + B ηi / ) ∩ X (cid:1) / vol( B ηi / ∩ X ) . To lowerbound the denominator, simply apply Assumption 3 to obtain vol (cid:0) B ηi / ( x ) ∩ X (cid:1) ≥ γ vol( B ηi / ) . To upper boundthe numerator, take an arbitrary point x i ∈ ( X u ∩ E i + B ηi / ) ∩ X . By deﬁnition of Minkowski sum, there exists x ′ i ∈ X u ∩ E i such that k x i − x ′ i k ≤ η i / . Hence f ( x ⋆ ) − f ( x i ) ≤ f ( x ⋆ ) − f ( x ′ i ) + (cid:12)(cid:12) f ( x ′ i ) − f ( x i ) (cid:12)(cid:12) ≤ u + L ( η i / ) ≤ ( / ) u . This implies that x i ∈ X ( / ) u , which proves the ﬁrst inequality. For the second one, note that x satisﬁes f ( x ⋆ ) − f ( x ) ≥ f ( x ⋆ ) − f ( x ′ ) − (cid:12)(cid:12) f ( x ′ ) − f ( x ) (cid:12)(cid:12) ≥ w − L ( η / ) = ( / ) w .The previous proposition implies in particular that if there exist r ∈ (0 , and C > such that, for all u ∈ (0 , ε ] , vol( X u ) ≤ Cu dr (as in (8)), then d ⋆ = d (1 − r ) is a near-optimality dimension of f for C ⋆ := (cid:0) C ( / ) dr (cid:1) / ( γv / L ε d ⋆ ) ,i.e., for all u ∈ (0 , ε ] , N ( X u , u / L ) ≤ C ⋆ ( ε / ε ) d ⋆ , recovering the assumption on packing numbers (10). Therefore,for globally L -Lipschitz functions, the two assumptions are essentially equivalent. C.2 Which Statement is More General

We recall the prototypical volume-base statement presented at the beginning of Appendix C. ∃ r ∈ (0 , , ∃ C > , ∀ u ∈ (0 , ε ] , vol( X u ) ≤ Cu dr (12) = ⇒ ∃ c > , ∀ n ≥ c (cid:0) ln( / ε ) I r =1 + ( / ε ) d (1 − r ) I r =1 (cid:1) , x ⋆n ∈ X ε . (13)15 x i x ⋆j x ⋆h x k x ⋆ ε ε Figure 1: The graph of a function f for which volume-based results do not apply.In the previous section we showed that assumptions on packing numbers imply the above assumptions (12) on volumeswhen f is Lipschitz around a maximizer (and measurable), but for the converse to be true f has to be globally Lipschitz.Although one might believe that this would make volume assumption better (because they are weaker), it turns out thatthey are in fact too weak in general when the function is only Lipschitz around a maximizer to imply the conclusion(13). We prove this in the following proposition. Proposition 12.

Let

L > and r ∈ (0 , . Then, for any deterministic algorithm and all multiplicative constants c > , there exist a function f : X → R and an accuracy ε ′ > such that:1. f is L -Lipschitz around a maximizer with respect to the Euclidean norm k·k ;2. there exists c > such that vol( X u ) ≤ c u dr for all u ∈ (0 , ε ] ;3. for any ε ∈ (0 , ε ′ ) and all n ≤ e n ε , x ⋆n / ∈ X ε , where e n ε := c (cid:0) ln( / ε ) I r =1 + ( / ε ) d (1 − r ) I r =1 (cid:1) .Proof. Fix any deterministic algorithm and c > . Denote by x ( g ) , x ( g ) , . . . the queries and by x ⋆ ( g ) , x ⋆ ( g ) , . . . the recommendations of the algorithm when applied to the constant function g = 0 . Assume ﬁrst that r = 1 . Let ε ′ ∈ (cid:0) , min { ( L/ /r / (8 v c ) /rd , c /d (1 − r )2 , } (cid:3) , then ﬁx any ε ∈ (0 , ε ′ ) and deﬁne n ε := (cid:6) c /ε d (1 − r ) (cid:7) ≥ e n ε . Let E := (cid:8) x ( g ) , x ⋆ ( g ) , . . . , x n ε ( g ) , x ⋆n ε ( g ) (cid:9) . Note that there exists x ⋆ ∈ X such that min x ∈ E k x − x ⋆ k ≥ / (4 v n ε ) /d ≥ ε − r / (8 v c ) /d . Indeed, if the ﬁrstinequality did not hold, than the spheres with radius / (4 v n ε ) /d centered at the points in E would cover X , but thiscould not happen since the largest volume that they could cover is (2 n ε ) v / (4 v n ε ) /d = 1 / < . Fix any such x ⋆ ,note that B ε/L ( x ⋆ ) ∩ E = ∅ , and let Q := Q d ∩ (cid:0) X \ (cid:0) B ε/L ( x ⋆ ) ∪ E (cid:1)(cid:1) . We can now deﬁne a function f that attainsits maximum at x ⋆ , has value ε in the zero-volume dense set Q , and satisﬁes f ( x ) = 0 for all x ∈ E . Concretely,consider the function f : X → R deﬁned for all x ∈ X , by (see Fig. 1) f ( x ) = (cid:0) ε − L k x − x ⋆ k (cid:1) I X \ ( Q ∪ E ) ( x ) + ε I Q ( x ) . Then f is L -Lipschitz around x ⋆ . Moreover, letting c := v /L max (cid:8) ε d (1 − r )0 , (cid:9) , we have vol( X u ) = vol (cid:0) B u/L ( x ⋆ ) (cid:1) = v /L u d = v /L u d (1 − r ) u dr ≤ c u dr , for all u ∈ (0 , ε ] , because f is almost everywhere equal to x ε − L k x − x ⋆ k . Finally, by deﬁnition, for all n ≤ n ε , the recommendation x ⋆n of the algorithm after n evaluations of f satisﬁes f ( x ⋆n ) = 0 , hence x ⋆n / ∈ X ε . Thecase r = 1 can be treated similarly, by letting n ε := (cid:6) c /ε d (1 − ρ ) (cid:7) for some ρ ∈ (0 , and noting that n ε ≥ e n ε for allsufﬁciently small ε .The result is not surprising since for ε, f as in the proof, N (cid:0) X ε , ε / L (cid:1) ≈ /ε d because of the dense subset Q .Packing numbers of ε -optimizers better capture the difﬁculty of the problem. In conclusion, packing assumptions arebe the best between the two, as they imply convergence result also in the more general case of functions f that areonly Lipschitz around a maximizer. 16 Useful Results on Packing, Covering, and Volume

For the sake of completeness, we recall the deﬁnitions of packing and covering numbers, as well as several usefulresults on packing, covering, and volume.

D.1 Deﬁnitions of Packing and Covering Numbers

For any bounded set A ⊂ R d and any real number r > , the r - packing number of A is the largest cardinality of an r -packing of A , that is, N ( A, r ) := sup (cid:26) k ∈ N ∗ : ∃ x , . . . , x k ∈ A, min i = j k x i − x j k > r (cid:27) if A is nonempty, zero otherwise.The r - covering number of A is the smallest cardinality of an r -covering of A , that is, M ( A, r ) := min (cid:8) k ∈ N ∗ : ∃ x , . . . , x k ∈ R d , ∀ x ∈ A, ∃ i ∈ { , . . . , k } , k x − x i k ≤ r (cid:9) if A is nonempty, zero otherwise. D.2 Useful Inequalities

Covering numbers and packing numbers are closely related. In particular, the following well-known inequalitieshold—see, e.g., Wainwright [2019, Lemmas 5.5 and 5.7, with permuted notation of M and N ]. Lemma 13.

Fix any norm k·k . For any bounded set A ⊂ R d and any real number r > , N ( A, r ) ≤ M ( A, r ) ≤ N ( A, r ) . (14) Furthermore, for all δ > and all r > , M ( B δ , r ) ≤ (cid:18) δr I r<δ (cid:19) d . (15)We now state a known lemma about packing numbers at different scales. This is the go-to result for rescalingpacking numbers. Lemma 14.

For any E ⊂ X and < r ≤ r < ∞ , we have N ( E, r ) ≤ (cid:18) r r (cid:19) d N ( E, r ) . Proof.

Consider a r -packing of E with cardinality N = N ( E, r ) , written F = { x , . . . , x N } . Consider then thefollowing iterative procedure. Let F = F and initialize k = 1 . While F k − is non-empty, let y k be any point in F k − , let F k be obtained from F k − by removing the points at || . || -distance less or equal to r from y k and increase k by one. Then this procedure yields an r packing of E with cardinality equal to the number of steps (the ﬁnal valueof k ). At each step, to the set of removed points we can associate a set of disjoint balls with radius r / in a ball withradius r . Hence the number of removed points is smaller or equal to v r /v r / = (4 r /r ) d . Hence the number ofsteps is larger or equal to N ( E, r ) ( r / r ) d . This concludes the proof since N ( E, r ) is larger than this number ofsteps.We now prove a result on the sum of volumes of overlapping layers that is used in the proof of Theorem 7. The deﬁnition of r -covering number of a subset A of R d implied by Wainwright [2019, Deﬁnition 5.1] is slightly stronger than the one usedin our paper, because elements x , . . . , x N of r -covers belong to A rather than just R d . Even if we do not need it for our analysis, Inequality (15)holds also in this stronger sense. emma 15. If f is L -Lipschitz around a maximizer (Assumption 1) and Lebesgue-measurable, ﬁx ε > and recallthat m ε := (cid:6) log ( ε /ε ) (cid:7) , ε m ε := ε , and for all k ≤ m ε − , ε k =: ε − k . Then, there exist c, C > (e.g., c = / and C := 2 · d ) such that, for all ε ∈ (0 , ε ] , vol (cid:0) X ε (cid:1) ε d + m X k =1 vol (cid:16) X ( ε k , ε k − ] (cid:17) ε dk − ≤ d vol( X ε ) ε d + m X i =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − ! . Proof.

To avoid clutter, we denote m ε simply by m . The left hand side can be upper bounded by vol (cid:0) X ε (cid:1) + vol (cid:0) X ( ε m , ε m − ] (cid:1) + vol (cid:0) X ( ε m − , ε m − ] (cid:1) ε d + m − X k =1 vol (cid:0) X ( ε k +1 , ε k ] (cid:1) + vol (cid:0) X ( ε k , ε k − ] (cid:1) vol (cid:0) X ( ε k − , ε k − ] (cid:1) ε dk − + vol ( X ε ) + vol (cid:0) X ( ε m , ε m − ] (cid:1) + vol (cid:0) X ( ε m − , ε m − ] (cid:1) + vol (cid:0) X ( ε m − , ε m − ] (cid:1) ε dm − + vol ( X ε ) + vol (cid:0) X ( ε m , ε m − ] (cid:1) + vol (cid:0) X ( ε m − , ε m − ] (cid:1) ε dm − ≤ (cid:0) X ε (cid:1) ε d + (2 d + 2) vol (cid:0) X ( ε m , ε m − ] (cid:1) ε dm − + (4 d + 2 d + 1) vol (cid:0) X ( ε m − , ε m − ] (cid:1) ε dm − + 12 d m − X k =2 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − + m − X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − + 2 d m − X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − = 3 vol (cid:0) X ε (cid:1) ε d + vol (cid:0) X ( ε m , ε m − ] (cid:1) ε dm − + 4 d vol (cid:0) X ( ε m − , ε m − ] (cid:1) ε dm − + 12 d m − X k =2 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − + m X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − + 2 d m X k =1 vol (cid:0) X ( ε k , ε k − ] (cid:1) ε dk − where we applied several times the deﬁnition of the ε k ’s, the inequality follows by / ε d + / ε dm − + / ε dm − ≤ min (cid:8) / ε d ) , (2 d + 2)( / ε dm − ) , (4 d + 2 d + 1)( / ε dm − ) (cid:9) , and the proof is concluded observing that max(3 , , d ) = 4 d and d + / d + 1 + 2 d ≤ d .The previous result can be generalized to arbitrary partitions and rescaling of the layers. We prove it here for theinterested reader. Proposition 16. If f is L -Lipschitz around a maximizer (Assumption 1) and Lebesgue-measurable, ﬁx any m ∈ N and let ξ m +1 < ξ m ≤ ξ m − ≤ . . . ≤ ξ ≤ ξ := ε . For any k ∈ { , . . . , m } , let also a k ∈ (0 , , b k ≥ , i k := min (cid:8) i ∈ { k, . . . , m + 1 } : ξ i ≤ a k ξ k (cid:9) ,j k := ( , if b k ξ k > ξ , max (cid:8) j ∈ { , . . . , k } : b k ξ k ≤ ξ j (cid:9) , otherwise . For all h ∈ [ m + 1] , deﬁne n a,h and n b,h as the following numbers of overlaps with ( ξ h , ξ h − ] : n a,h := (cid:8) k ∈ { , . . . , m + 1 } : i k − ≥ k and k ≤ h ≤ i k − (cid:9) ,n b,h := (cid:8) k ∈ [ m ] : j k ≤ k − and j k + 1 ≤ h ≤ k (cid:9) . Then, let c k := ( ξ j k +1 /ξ k +1 ) d for all k ∈ [ m − , c m := ( ξ j m +1 /ξ m ) d , and c := max k ∈ [ m ] c k . If C := 1 + c max (cid:0) n a,m +1 , max h ∈ [ m ] ( n a,h + n b,h ) (cid:1) , we have, for any function f : X → R , vol( X b m ξ m ) ξ dm + m X k =1 vol (cid:0) X ( a k ξ k , b k − ξ k − ] (cid:1) ξ dk ≤ C vol( X ξ m ) ξ dm + m X k =1 vol (cid:0) X ( ξ k , ξ k − ] (cid:1) ξ dk ! . roof. Letting V := vol( X ξ m ) /ξ dm + P mk =1 vol (cid:0) X ( ξ k , ξ k − ] (cid:1) /ξ dk , we have vol (cid:0) X b m ξ m (cid:1) ξ dm + m X k =1 vol (cid:0) X ( a k ξ k , b k − ξ k − ] (cid:1) ξ dk = vol (cid:0) X ξ m (cid:1) ξ dm + vol (cid:0) X ( ξ m , b m ξ m ] (cid:1) ξ dm + m X k =1 vol (cid:0) X ( a k ξ k , ξ k ] (cid:1) ξ dk + m X k =1 vol (cid:0) X ( ξ k , ξ k − ] (cid:1) ξ dk + m X k =1 vol (cid:0) X ( ξ k − , b k − ξ k − ] (cid:1) ξ dk ≤ V + vol (cid:0) X ( ξ m , ξ jm ] (cid:1) ξ dm + m X k =1 vol (cid:0) X ( ξ ik , ξ k ] (cid:1) ξ dk + m X k =1 vol (cid:0) X ( ξ k − , ξ jk − ] (cid:1) ξ dk =: V + (I) + (II) + (III) . We upper bound separately the addends (II) and (I) + (III) . We have (II) = m X k =1 vol (cid:0) X ( ξ ik , ξ k ] (cid:1) ξ dk = m +1 X k =2 vol (cid:0) X ( ξ ik − , ξ k − ] (cid:1) ξ dk − = m +1 X k =2 i k − ≥ k vol (cid:0) X ( ξ ik − , ξ k − ] (cid:1) ξ dk − = m +1 X k =2 i k − ≥ k i k − X i = k vol (cid:0) X ( ξ i , ξ i − ] (cid:1) ξ dk − ≤ m +1 X k =2 i k − ≥ k i k − X i = k vol (cid:0) X ( ξ i , ξ i − ] (cid:1) ξ di I i = m +1 + ξ di − I i = m +1 = n a,m +1 vol (cid:0) X ( ξ m +1 , ξ m ] (cid:1) ξ dm + m X h =1 n a,h vol (cid:0) X ( ξ h , ξ h − ] (cid:1) ξ dh ≤ n a,m +1 vol (cid:0) X ξ m (cid:1) ξ dm + m X h =1 n a,h vol (cid:0) X ( ξ h , ξ h − ] (cid:1) ξ dh . (I) + (III) = vol (cid:0) X ( ξ m , ξ jm ] (cid:1) ξ dm + m X k =1 vol (cid:0) X ( ξ k − , ξ jk − ] (cid:1) ξ dk = vol (cid:0) X ( ξ m , ξ jm ] (cid:1) ξ dm I j m ≤ m − + m X k =1 j k − ≤ k − vol (cid:0) X ( ξ k − , ξ jk − ] (cid:1) ξ dk = vol (cid:0) X ( ξ m , ξ jm ] (cid:1) ξ dm I j m ≤ m − + m X k =2 j k − ≤ k − vol (cid:0) X ( ξ k − , ξ jk − ] (cid:1) ξ dk = vol (cid:0) X ( ξ m , ξ jm ] (cid:1) ξ dm I j m ≤ m − + m − X k =1 j k ≤ k − vol (cid:0) X ( ξ k , ξ jk ] (cid:1) ξ dk +1 = m X j = j m +1 j m ≤ m − vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dm + m − X k =1 j k ≤ k − k X j = j k +1 vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dk +1 = m X j = j m +1 j m ≤ m − ξ dj ξ dm vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj + m − X k =1 j k ≤ k − k X j = j k +1 ξ dj ξ dk +1 vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj ≤ m X j = j m +1 j m ≤ m − ξ dj m +1 ξ dm vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj + m − X k =1 j k ≤ k − k X j = j k +1 ξ dj k +1 ξ dk +1 vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj = m X k =1 j k ≤ k − k X j = j k +1 c k vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj ≤ c m X k =1 j k ≤ k − k X j = j k +1 vol (cid:0) X ( ξ j , ξ j − ] (cid:1) ξ dj = c m X h =1 n b,h vol (cid:0) X ( ξ h , ξ h − ] (cid:1) ξ dh . Putting everything together and noting that c ≥ gives the result. E The DOO algorithm: deﬁnition and proofs

E.1 The DOO algorithm

Let us present the (non-certiﬁed and certiﬁed) DOO algorithm and its underlying assumptions. The non-certiﬁedalgorithm and the assumptions will essentially be the same as in the original reference Munos [2011], with minoradjustments to our framework, for ease of exposition. The notion of certiﬁcate is not present in Munos [2011].The algorithm is deﬁned for a ﬁxed K ∈ N ⋆ , by an inﬁnite sequence of subsets of X (cells) of the form ( X h,i ) h ∈ N ,i =0 ,...,K h − . For h ∈ N , the sets X h, , . . . , X h,K h − are non-empty, pairwise disjoint, and their unioncontains X . Furthermore, ( X h,i ) h ∈ N ,i =0 ,...,K h − is associated with a K -ary tree, meaning that for any h ∈ N and j ∈ { , . . . , K h − } , there exist K distinct i , . . . , i K ∈ { , . . . , K h +1 − } such that X h +1 ,i , . . . , X h +1 ,i K form apartition of X h,j . We call ( h + 1 , i ) , . . . , ( h + 1 , i K ) the children of ( h, j ) .For each cell X h,i , h ∈ N , i = 0 , . . . , K h − , there is a representative x h,i ∈ X h,i , which can be thought of, e.g.,as the center of the cell. We assume that feasible cells have feasible representatives, that is: if X h,i ∩ X 6 = ∅ , then x h,i ∈ X .The sequence of cells and representatives is well-behaved in the sense of the following two assumptions. Assumption 4.

There are ﬁxed < δ < and R < + ∞ such that for any h ∈ N , i = 0 , . . . , K h − and u, v ∈ X h,i , || u − v || ≤ Rδ h . ssumption 5. There is a ﬁxed ν > such that, with δ as in Assumption 4, for any h ∈ N , i = 0 , . . . , K h − , h ′ ∈ N , i ′ = 0 , . . . , K h ′ − , with ( h, i ) = ( h ′ , i ′ ) , || x h,i − x h ′ ,i ′ || ≥ νδ max( h,h ′ ) . Assumption 4 is very classical. Note that Assumption 5, which is key for our improved analysis, is slightly strongerthan in Munos [2011], yet easy to satisfy. For a compact X , a sequence ( X h,i ) h ∈ N ,i =0 ,...,K h − satisfying Assumptions4 and 5 indeed exists. For instance, when X is the unit hypercube [0 , d and || . || is the supremum norm || . || ∞ , wemay use bisections with K = 2 d , letting X h,i be an hypercube of edge length − h and x h,i be its center, for h ∈ N and i = 0 , . . . , dh − . In this case we have R = 1 , δ = 1 / and ν = 1 / in Assumptions 4 and 5.The DOO algorithm is deﬁned in Algorithm 3. For both non-certiﬁed and certiﬁed versions, the most promisingcell among a set of active cells L n is selected at every iteration k (see (16)) before being split into its K children.For the certiﬁed version, the algorithm declares its recommendations x ⋆n to be ε -optimal (i.e., γ n = 1 ) whenever thecondition (17) is met. Algorithm 3:

DOO (non-certiﬁed and certiﬁed versions) input:

Function f , input set X , Lipschitz bound L , cells ( X h,i ) h ∈ N ,i =0 ,...,K h − , representatives ( x h,i ) h ∈ N ,i =0 ,...,K h − . initialization: Let n = 1 . Let x ⋆ = x = x , . Let L = { (0 , } . Evaluate f ( x ) . Certiﬁed version only:

Let γ = 0 . for iteration k = 1 , , . . . do Let (breaking ties arbitrarily) ( h ⋆ , i ⋆ ) ∈ argmax ( h,i ) ∈L n (cid:8) f ( x h,i ) + LRδ h (cid:9) . (16) Certiﬁed version only: If f ( x h ⋆ ,i ⋆ ) + LRδ h ⋆ ≤ max( f ( x ) , . . . , f ( x n )) + ε, (17)then let γ ← and γ n ← γ .Let L + be the set of the K children of ( h ⋆ , i ⋆ ) . for ( h ⋆ + 1 , j ) ∈ L + do If X h ⋆ +1 ,j ∩ X 6 = ∅ then, let n ← n + 1 , let x n = x h ⋆ +1 ,j , evaluate f ( x n ) , let x ⋆n ∈ argmax x ∈{ x ,..., x n } f ( x ) and let L n = L n − ∪ { ( h ⋆ + 1 , j ) } . Certiﬁed version only: let γ n = γ .Remove ( h ⋆ , i ⋆ ) from L n . E.2 Proof of Proposition 1

We start by proving Proposition 1, and will then comment on where our analysis improves over that of Munos [2011,Theorem 1], as well as on a straightforward extension.

Proof.

We ﬁrst recall the guarantee (19) below, which is classical (e.g., Munos 2011). By induction, it is straightfor-ward to show that the union of the cells in L n contains X at all steps n ∈ N ⋆ . Therefore, the global maximizer x ⋆ (forwhich the inequality in Assumption 1 holds) belongs to a cell X ¯ h, ¯ i with (¯ h, ¯ i ) ∈ L n . Consider now the cell X h ⋆ ,i ⋆ in(16), at step n . We have, using ﬁrst (16), and then Assumptions 1 and 4, f ( x h ⋆ ,i ⋆ ) + LRδ h ⋆ ≥ f ( x ¯ h, ¯ i ) + LRδ ¯ h ≥ f ( x ⋆ ) − LRδ ¯ h + LRδ ¯ h = f ( x ⋆ ) . (18)21his implies that, for ( h ⋆ , i ⋆ ) given by (16), f ( x h ⋆ ,i ⋆ ) ∈ X LRδ h⋆ . (19)We now proceed in a slightly different way than in the proof of Munos [2011, Theorem 1]. Consider the ﬁrst timeat which the DOO algorithm reaches step (16) with f ( x h ⋆ ,i ⋆ ) ≥ f ( x ⋆ ) − ε . Then let I ε be the number of times theDOO algorithm went through step (16) strictly before that time, and denote by n ε the total number of evaluations of f strictly before that same time. Then we have n ε ≤ KI ε . Furthermore, after n ε evaluations of f , we have, by deﬁnitions of the recommendation x ⋆n ε and n ε , f ( x ⋆n ε ) = max x ∈{ x ,..., x nε } f ( x ) ≥ max x ∈L nε f ( x ) ≥ f ( x h ⋆ ,i ⋆ ) ≥ f ( x ⋆ ) − ε . Since the optimization error f ( x ⋆ ) − f ( x ⋆n ) can only decrease for n ≥ n ε , the last inequality above entails that thenon-certiﬁed sample complexity of DOO (non-certiﬁed version) is bounded by n ε and thus ζ ( non-certiﬁed DOO , f, ε ) ≤ KI ε . (20)Consider now the sequence ( h ⋆ , i ⋆ ) , . . . , ( h ⋆I ε , i ⋆I ε ) corresponding to the ﬁrst I ε times the DOO algorithm A wentthrough step (16). Let E ε be the corresponding ﬁnite set { x h ⋆ ,i ⋆ , . . . , x h ⋆Iε ,i ⋆Iε } . By deﬁnition of I ε , we have E ε ⊆X ( ε,ε ] . Since ε = ε m ε ≤ ε m ε − ≤ . . . ≤ ε , we have E ε ⊆ S m ε i =1 X ( ε i , ε i − ] , so that the cardinality I ε of E ε (a leafcan never be selected twice) satisﬁes I ε = card( E ε ) ≤ m ε X i =1 card (cid:0) E ε ∩ X ( ε i , ε i − ] (cid:1) . (21)Let N ε,i be the cardinality of E ε ∩ X ( ε i , ε i − ] . For x h,j ∈ E ε ∩ X ( ε i , ε i − ] , from (19), we have ε i < LRδ h and thus δ h > ε i LR .

Consider two distinct x h,j , x h ′ ,j ′ ∈ E ε ∩ X ( ε i , ε i − ] . Then, from Assumption 5, we obtain || x h,j − x h ′ ,j ′ || ≥ νδ max( h,h ′ ) > νε i LR .

Hence, by deﬁnition of packing numbers, we have N ε,i ≤ N (cid:16) X ( ε i , ε i − ] , νε i LR (cid:17) . Using now Lemma 14, we obtain N ε,i ≤ ν/R ≥ + ν/R< (cid:18) Rν (cid:19) d ! N (cid:16) X ( ε i , ε i − ] , ε i L (cid:17) . (22)Combining the last inequality with (20) and (21) concludes the proof. Remark 17.

The analysis of the DOO algorithm in Munos [2011, Theorem 1] yields a bound on the non-certiﬁed sam-ple complexity than can be expressed in the form C P m ε k =1 N (cid:0) X ε k − , ε k L (cid:1) , with a constant C . The correspondingproof relies on two main arguments. First, when a cell of the form ( h ⋆ , i ⋆ ) , i ⋆ ∈ { , . . . , K h ⋆ − } , is selected in (16) , then the corresponding cell representative x i ⋆ ,h ⋆ is LRδ h ⋆ -optimal (we also use this argument). Second, as aconsequence, for a given ﬁxed value of h ⋆ , for the sequence of values of i ⋆ that are selected in (16) , the correspondingcell representatives x h ⋆ ,i ⋆ form a packing of X [0 ,LRδ h⋆ ] .Our analysis in the proof of Proposition 1 stems from the observation that using a packing of X LRδ h⋆ yieldsa suboptimal analysis, since the cell representatives x h ⋆ ,i ⋆ can be much better than LRδ h ⋆ -optimal. Hence, weproceed differently from Munos [2011], by ﬁrst partitioning all the selected cell representatives (in (16) ) accord-ing to their level of optimality as in (21) and then by exhibiting packings of the different layers of input points X ( ε,ε mε − ] , X ( ε mε − ,ε mε − ] , . . . , X ( ε ,ε ] . In a word, we partition the values of f instead of partitioning the inputspace when counting the representatives selected at all levels. emark 18. The bound of Proposition 1, based on (3) , is built by partitioning ( ε, ε ] into the m ε sets ( ε, ε m ε − ] , ( ε m ε − , ε m ε − ] , . . . , ( ε , ε ] whose lengths are sequentially doubled (except from ( ε m ε , ε m ε − ] to ( ε m ε − , ε m ε − ] ). As can be seen from the proofof Proposition 1, more general bounds could be obtained, based on more general partitions of ( ε, ε ] . The beneﬁtsof the present partition are the following. First, it considers sets whose upper values are no more than twice thelower values, which controls the magnitude of their corresponding packing numbers in (3) (at scale the lower values).Second the number of sets in the partition is logarithmic in ε which controls the sum in (3) .The same generalization could be applied to the bound based on (4) of Proposition 2 below, for the certiﬁed versionof the DOO algorithm. In this latter case, another beneﬁt of choosing the partition ( ε, ε m ε − ] , ( ε m ε − , ε m ε − ] , . . . , ( ε , ε ] (together with the additional set [0 , ε ] ) is that the upper bound is then tight up to a logarithmic factor for most functions f , as proved in Section 3. E.3 Proof of Proposition 2

Let us ﬁrst show that Algorithm 3 (certiﬁed version) is indeed a certiﬁed algorithm. Let f be any function satisfyingAssumption 1. For notational simplicity, we set n := σ ( certiﬁed DOO , f, ε ) = inf { i ∈ N ∗ : γ i = 1 } and we show that f ( x ⋆ ) − f ( x ⋆n ) ≤ ε . After exactly n evaluations of f , when Algorithm 3 reaches step (16), thecondition (17) holds for the ﬁrst time. From (18) in the proof of Proposition 1, which applies here since the non-certiﬁed and certiﬁed versions select the same leaves ( h ⋆ , i ⋆ ) and output the same queries x m and recommendations x ⋆m , we know that f ( x ⋆ ) ≤ f ( x h ⋆ ,i ⋆ ) + LRδ h ⋆ . Since condition (17) also guarantees that f ( x h ⋆ ,i ⋆ ) + LRδ h ⋆ ≤ max( f ( x ) , . . . , f ( x n )) + ε = f ( x ⋆n ) + ε this entails that f ( x ⋆ ) − f ( x ⋆n ) ≤ ε . Since the optimization error n ′ f ( x ⋆ ) − f ( x ⋆n ′ ) can only decrease over time,the requirement γ n ′ = 1 ⇒ f ( x ⋆ ) − f ( x ⋆n ′ ) ≤ ε is true for all n ′ ≥ n and thus all n ∈ N ∗ . This proves that Algorithm3 (certiﬁed version) is a certiﬁed algorithm.We now show the upper bound on σ ( certiﬁed DOO , f, ε ) . Let I ε be the number of times the algorithm went throughstep (16) strictly before the ﬁrst iteration k where γ n is set to , that is, strictly before (17) holds for the ﬁrst time.Note that σ ( certiﬁed DOO , f, ε ) is the total number of evaluations of f before (17) holds for the ﬁrst time, so that σ ( certiﬁed DOO , f, ε ) ≤ KI ε . (23)Consider now the sequence ( h ⋆ , i ⋆ ) , . . . , ( h ⋆I ε , i ⋆I ε ) corresponding to the ﬁrst I ε times the DOO algorithm wentthrough step (16). Let E ε be the corresponding ﬁnite set { x h ⋆ ,i ⋆ , . . . , x h ⋆Iε ,i ⋆Iε } . Of course we have E ε ⊂ X ε S (cid:0)S m ε i =1 X ( ε i , ε i − ] (cid:1) ,so that I ε = card( E ε ) ≤ card ( E ε ∩ X ε ) + m ε X i =1 card (cid:0) E ε ∩ X ( ε i , ε i − ] (cid:1) . (24)Let N ε,m ε +1 be the cardinality of E ε ∩ X ε . For i = 1 , . . . , m ε , let N ε,i be the cardinality of E ε ∩ X ( ε i , ε i − ] . Withexactly the same arguments as in the proof of Proposition 1, we show, for i = 1 , . . . , m ε , that (22) holds.Let now x h ⋆l ,i ⋆l ∈ E ε ∩ X ε , with ℓ ∈ { , . . . I ε } . The pair ( h ⋆l , i ⋆l ) was selected when the algorithm went throughstep (16) for the ℓ -th time. By deﬁnition of I ε , (17) does not hold at this time, which implies, with n being the numberof evaluations of f at this time, f ( x h ⋆ℓ ,i ⋆ℓ ) + LRδ h ⋆ℓ > max( f ( x ) , . . . , f ( x n )) + ε ≥ f ( x h ⋆ℓ ,i ⋆ℓ ) + ε . This implies that

LRδ h ⋆ℓ > ε and thus δ h ⋆ℓ > εLR . Note that, in the deﬁnition of Algorithm 3, the variable γ n is sometimes assigned twice, with ﬁrst the value of and then the value of . In thatcase, we consider that γ n = 1 . x h,j , x h ′ ,j ′ ∈ E ε ∩ X ε . Then, from Assumption 5, we obtain || x h,j − x h ′ ,j ′ || ≥ νδ max( h,h ′ ) > νεLR . Hence, we have N ε,m ε +1 ≤ N (cid:16) X ε , νεLR (cid:17) . Using now Lemma 14, we obtain N ε,m ε +1 ≤ ν/R ≥ + ν/R< (cid:18) Rν (cid:19) d ! N (cid:16) X ε , εL (cid:17) ..