[PDF] Contraction methods for continuous optimization

Abstract

We describe a class of algorithms that establishes a contracting sequence of closed sets for solving continuous optimization. From the perspective of ensuring effective contractions to achieve a certain accuracy, all the possible continuous optimization problems could be divided into three categories: logarithmic time contractile, polynomial time contractile, or noncontractile. For any problem from the first two categories, the constructed contracting sequence converges to the set of all global minimizers with a theoretical guarantee of linear convergence; for any problem from the last category, we also discuss possible troubles caused by using the proposed algorithms.

Full PDF

aa r X i v : . [ m a t h . O C ] O c t CONTRACTION METHODS FOR CONTINUOUS OPTIMIZATION

XIAOPENG LUO ∗ AND

XIN XU † Abstract.

We describe a class of algorithms that establishes a contracting sequence of closedsets for solving continuous optimization. From the perspective of ensuring eﬀective contractions toachieve a certain accuracy, all the possible continuous optimization problems could be divided intothree categories: logarithmic time contractile, polynomial time contractile, or noncontractile. Forany problem from the ﬁrst two categories, the constructed contracting sequence converges to the setof all global minimizers with a theoretical guarantee of linear convergence; for any problem from thelast category, we also discuss possible troubles caused by using the proposed algorithms.

Key words. continuous optimization, contraction methods, categories, complexity, convergence

AMS subject classiﬁcations.

1. Introduction.

For a possible nonlinear and non-convex continuous function f : Ω ⊂ R n → R with the global minima f ∗ and the set of all global minimizers X ∗ in Ω, we consider the constrained optimization problem(1) min x ∈ Ω f ( x ) , where Ω is a (not necessarily convex) closed domain; and assume that observing f iscostly, so we wish to use as few observations as possible.When f is cheap to evaluate, there are many standard optimization algorithmsexist, such as genetic algorithms [20], evolution strategies [29], diﬀerential evolution[35] and simulated annealing [12]. However, when f is expensive, we have to pay moreattention to how to maximize the use of information obtained; and perhaps the beststrategy is Bayesian optimization (BO) which is a sequential model-based approachand the model is often obtained using a Gaussian process. It was ﬁrst introduced byMoˇckus [21] and later popularized by Jones et al [10]. In particular, theoretical resultson the convergence behaviour of BO is provided in [8]. Various extensions have beensuggested by further authors, and a recent review can be found in [30].The BO method applies the Gaussian process to construct a model for generatingthe acquisition functions. The acquisition function, which trade-oﬀs exploration andexploitation, is used to determine the next candidate point. This is exactly how BOuses the information obtained; but this study presents a diﬀerent approach from theperspective of contractility.The concept of contractility comes from the level set. Actually, the problem (1)is closely related to the u -sublevel set [27, 24, 1], i.e.,(2) L ( u ) = { x ∈ Ω : f ( x ) u } , u ∈ (cid:20) f ∗ , max x ∈ Ω f ( x ) (cid:21) . Its boundary set ∂L ( u ) = { x ∈ Ω : f ( x ) = u } is a contour surface and L ( u ) contractsmonotonically from Ω to X ∗ when u continuously decreases from max x ∈ Ω f ( x ) to f ∗ .We introduce the contracting sequence as a direct extension of the sublevel set.A contracting sequence { D ( k ) } k ∈ N w.r.t. the optimization problem (1) is a sequence ∗ Department of Chemistry, Princeton University, Princeton, NJ 08544, and Department of Controland Systems Engineering, School of Management and Engineering, Nanjing University, Nanjing, JS210093, China ([email protected], [email protected].)1

X. LUO AND X. XU of decreasing nonempty closed sets satisfying(3) Ω ⊇ D ( k ) ⊃ D ( k +1) ⊃ X ∗ ;further, we call { D ( k ) } a strictly contracting sequence if they also satisfy(4) max x ∈ D ( k +1) | f ( x ) | < max x ∈ D ( k ) | f ( x ) | . In this work, we translate the problem (1) into a construction problem of a strictlycontracting sequence { D ( k ) } with D (0) = Ω. Each D ( k +1) is sequentially determinedby an approximation model constructed from uniformly distributed samples on D ( k ) .Both the proposed contraction method (CO) and BO depend on the approximationmodel, but they actually have very signiﬁcant diﬀerences.The ﬁrst essential diﬀerence is that the BO is designed to ﬁnd only one of theminimizers while the CO is trying to ﬁnd out all of them. The most interesting factis that if we do not ﬁnd all the global minimizers, we could not be sure that anyone of them is a global minimizer. So, although BO can ensure the convergence [8],at a certain moment we do not know if the current best solution is really a globalminimizer, even in the sense of probability. In contrast, this is deterministic for CObecause for most of hierarchical low frequency dominant functions (see subsection 3.1),the premise of each contraction is that all global minimizers are covered by subsequentclosed subsets in the sense of probability.The second major diﬀerence is that BO always updates the model on the originaldomain and uses all the historical samples [10, 30], while the CO updates the modelon the gradually decreasing contracted sets and uses only those samples that arelocated in the contracted sets. This helps control the computational cost of modelling,especially for most of hierarchical low frequency dominant functions mentioned above.Due to the characteristic, a contraction method might be viewed as a special case ofclassical branch and bound techniques [16, 36] which is aimed to sequentially guarantee X ∗ ⊂ D ( k +1) and exclude n D ( k ) − D ( k +1) o k . The last signiﬁcant disparity is the diﬀerent conditions for achieving satisfactoryeﬃciency. Speciﬁcally, the eﬃciency of BO depends mainly on the smoothness ofthe objective function [8]; however, the CO may achieve logarithmic time eﬃciencyfor non-diﬀerentiable functions, see Theorem 4.2 for details. The hierarchical lowfrequency dominant conditions once again played an important role here. On theone hand, the contraction strategy provides us a diﬀerent perspective to reconsiderthe problem (1); and on the other hand, the contractility can also be viewed as anecessary complement to smoothness. In Figure 4 we will see an intuitive example toillustrate this clearly.The remainder of the paper is organized as follows. The next section introducesthe contraction algorithms and mainly focuses on its convergence. Three conditionsfor guaranteeing algorithm eﬃciency are discussed in detail in section 3, especiallywith regard to the hierarchical low frequency dominant functions. These conditionsallow us to divide all possible continuous problems into three categories in section 4.And we draw some conclusions in section 5.

2. Contraction algorithms.

For any nonnegative integer k , the contractingsequence { D ( k ) } are deﬁned recursively by D (0) = Ω and(5) D ( k +1) = n x ∈ D ( k ) : A ( k ) f ( x ) f ∗ χ ( k ) + u ( k ) o , ONTRACTION METHODS χ ( k ) = { χ ( k ) i } are uniformly distributed over D ( k ) with the sample size N ( k ) ,the relevant data values f χ ( k ) = { f ( χ ( k ) i ) } with the current f ∗ χ ( k ) = min( f χ ( k ) ) and f ∗∗ χ ( k ) = max( f χ ( k ) ), and A ( k ) f is an approximate model of f w.r.t. the data pairs( χ ( k ) , f χ ( k ) ) such that the upper bound of model prediction errormax x ∈ D ( k ) (cid:12)(cid:12)(cid:12) A ( k ) f ( x ) − f ( x ) (cid:12)(cid:12)(cid:12) δu ( k ) , ∀ δ ∈ (0 ,

1] and u ( k ) < f ∗∗ χ ( k ) − f ∗ χ ( k ) . The parameter δ is used to to enhance convergence, as we will see in Theorem 2.2.The following result states that the sequence of decreasing nonempty closed setsconstructed by (5) monotonically converges to the set of all global minimizers X ∗ forany possible choice of u ( k ) and δ ∈ (0 , Theorem

Suppose { D ( k ) } k ∈ N is constructed by (5) for theproblem (1) . If f is not a constant, then for all k ∈ N , Ω ⊇ D ( k ) ⊃ D ( k +1) ⊃ X ∗ . Remark

There are various possible choices of u ( k ) , and the three commonlyused are the median-type Median f χ ( k ) − f ∗ χ ( k ) , the mean-type Mean f χ ( k ) − f ∗ χ ( k ) andthe middle-type ( f ∗∗ χ ( k ) − f ∗ χ ( k ) ) / .Proof. The convergence proceeds by induction on k . First, Ω = D (0) ⊃ X ∗ istrivial since f is not a constant on Ω. Assume that D ( k ) ⊃ X ∗ , then D ( k ) ⊃ S ( k ) ⊃ X ∗ ,where S ( k ) = n x ∈ D ( k ) : f ( x ) f ∗ χ ( k ) o ;now we will show that D ( k ) ⊃ D ( k +1) ⊃ S ( k ) .The approximate model can be decomposed into A ( k ) f ( x ) = f ( x ) + ε ( k ) ( x ) , then according to the upper bound of model prediction error, it follows that | ε ( k ) ( x ) | δu ( k ) u ( k ) < f ∗∗ χ ( k ) − f ∗ χ ( k ) , δ ∈ (0 , . Hence, for any x ′ ∈ S ( k ) , we have f ( x ′ ) f ∗ χ ( k ) , which can be further rewritten as A ( k ) f ( x ′ ) − ε ( k ) ( x ′ ) f ∗ χ ( k ) , and equivalently, A ( k ) f ( x ′ ) f ∗ χ ( k ) + ε ( k ) ( x ′ ) f ∗ χ ( k ) + δu ( k ) < f ∗∗ χ ( k ) , therefore, D ( k ) ⊃ D ( k +1) ⊃ S ( k ) .The above algorithm does not guarantee that the upper bound of f on every D ( k ) ,say, max x ∈ D ( k ) | f ( x ) | , is strictly monotonically decreasing. In fact, this can be doneby further enhancing the accuracy requirements of the model.For convenience of theoretical analysis, without loss of generality, sometimes wemay assume f ∗ = 0 on Ω. Under the same settings as (5) and the assumption of f ∗ = 0, the sequence { D ( k ) } k ∈ N could be rewritten by D (0) = Ω and(6) D ( k +1) = n x ∈ D ( k ) : A ( k ) f ( x ) u ( k ) o , X. LUO AND X. XU where the k th model A ( k ) f satisﬁes(7) max x ∈ D ( k ) (cid:12)(cid:12)(cid:12) A ( k ) f ( x ) − f ( x ) (cid:12)(cid:12)(cid:12) δu ( k ) , ∀ δ ∈ (0 , . Then we have the following strong convergence.

Theorem

Suppose { D ( k ) } k ∈ N is constructed by (6) for the problem (1) . If f is not a constant and there exists a < q < ∞ such that (8) u ( k ) <

11 + δ

11 + q f ∗∗ χ ( k ) , then for all k ∈ N , (9) max x ∈ D ( k +1) | f ( x ) | <

11 + q max x ∈ D ( k ) | f ( x ) | . Here, (8) is called a q -strong convergence condition and (9) is called linear convergenceof the contraction algorithms. Remark

In practice, the q -strong convergence condition (8) can be clearlywritten as u ( k ) < δ q ( f ∗∗ χ ( k ) − f ∗ χ ( k ) ) .Proof. The approximate model can be decomposed into A ( k ) f ( x ) = f ( x ) + ε ( k ) ( x ) with | ε ( k ) ( x ) | δu ( k ) , ∀ δ ∈ (0 , , then for all t ∈ D ( k +1) , A ( k ) f ( t ) u ( k ) ⇒ f ( t ) (1 + δ ) u ( k ) <

11 + q max x ∈ D ( k ) | f ( x ) | , the desired result follows.Let µ ( S ) = R S d t denote the n -dimensional Lebesgue measure of S ⊂ R n , thenwe say the ratio λ ( k +1) = µ ( D ( k +1) ) µ ( D ( k ) )is the ( k + 1)th contraction factor. In the following, we focus on the median-typealgorithm which often leads to a relatively stable λ ( k +1) . Under the same settings as(5), the median-type sequence { D ( k ) } k ∈ N can be written by D (0) = Ω and(10) D ( k +1) = n x ∈ D ( k ) : A ( k ) f ( x ) Median f χ ( k ) o , where the k th model A ( k ) f satisﬁes(11) max x ∈ D ( k ) (cid:12)(cid:12)(cid:12) A ( k ) f ( x ) − f ( x ) (cid:12)(cid:12)(cid:12) < δ (cid:16) Median f χ ( k ) − f ∗ χ ( k ) (cid:17) , ∀ δ ∈ (0 , . Many diﬀerent global continuousoptimization algorithms [26, 12, 20, 35], especially model-based methods [22, 11, 31],attempt to trade-oﬀ exploration and exploitation [22, 11] because suﬃcient explorationis a guarantee of convergence while exploitation of limited knowledge seems to be thekey to improve eﬃciency [8, 13, 31]. However, in fact, every exploitation on the basisof inadequate information may reduce eﬃciency or even cause trouble.

ONTRACTION METHODS D ( k ) embodiesthe exploration of unknown information and the approximate model A ( k ) f reﬂectsthe exploitation of prior information, while the condition (11) is the key for ensuringa suﬃcient exploration and issuing a judgment on the conversion of exploration toexploitation. Statistical methods allow us to estimate the model prediction errors ina sense of probability.Assume that ε ( k ) = A ( k ) f − f follows an independent, identically distributedGaussian distribution with mean µ ( k ) ε and standard deviation σ ( k ) ε , then µ ( k ) ε and σ ( k ) ε can be estimated by the cross-validation (CV) [9, 2] estimations ˜ µ ( k ) ε and ˜ σ ( k ) ε , and aprobabilistic bound for the residual ε ( k ) on D ( k ) is given by P (cid:16) | ε ( k ) | | µ ( k ) ε | + tσ ( k ) ε (cid:17) = 2Φ( t ) − t = 3 itfollows that the three-sigma rule of thumb P ( | ε ( k ) | | µ ( k ) ε | + 3 σ ( k ) ε ) = 0 . . Even without the Gaussian assumption, there exists the Chebyshev’s inequality P (cid:16) | ε ( k ) | | µ ( k ) ε | + tσ ( k ) ε (cid:17) > − t , although it should be regarded as a theoretical tool rather than a practical estimation.There will naturally be a large CV variance for an overﬁtting model. This mightresult in reduced eﬃciency but will not cause trouble; however, an underﬁtting model A ( k ) f might sometimes result in a very rough estimate of the residual bound andeventually lead to convergence failure. To reduce this possibility, we should imposean additional constraint on the global approximation behavior of A ( k ) f , which alsoensures the eﬀectiveness of subsequent sampling. Let r ( k ) be the percentage of χ ( k ) falling into D ( k +1) and r ( k ) A be the percentage of samples uniformly distributed over D ( k ) falling into D ( k +1) , then it is reasonable to require that the diﬀerence between r ( k ) and r ( k ) A is less than a given threshold b r , i.e., | r ( k ) − r ( k ) A | < b r ; usually, a valueslightly less than 0 . Two important parts of the algorithms aboveare (i) the method for constructing a model to ﬁt the given data on D ( k ) and (ii) thesampling strategy for further generating uniform samples over D ( k ) according to someknown interior points. In this work, we apply Gaussian process (GP) regression fordata ﬁtting. It has merit in the ﬂexible expressivity because its hyperparameters areautomatically tuned with the likelihood maximization. See [25] for more informationabout this approach. GP generally does not lead to an overﬁtting model and is oftenused as an important part of many model-based optimization methods [11, 13, 31].Other common choices also include random forests [6] and regression trees [7], etc.Now we turn to the sampling strategy. Suppose χ = { χ i } Ni =1 ∈ D ( k ) are N givenpoints and T is a union of N sample sets { T i } Ni =1 , where T i contains M samples froma normal distribution with mean χ i and variance σ i (suﬃciently large to cover D ( k ) ).Further, let T ( k ) = T ∩ D ( k ) . Now we are going to add a new point t from T ( k ) to χ and t should preferably ﬁll in the gaps in the distribution of χ . Actually, thissubsequent point t could be determined as(12) t = arg max t ∈ T ( k ) (cid:18) min i N k t − χ i k (cid:19) . X. LUO AND X. XU

One can generate uniform samples in D ( k ) by using the above step recursively, seeFigure 1 for an example to illustrate how the method is performed. This recursivealgorithm is closely related to the Voronoi diagrams [3], because the added point willbe always in the largest Voronoi cell of χ if the size of T ( k ) is large enough. Fig. 1 . 2 -dimensional illustration of the performance of the recursive algorithm given in (12) .Left: the domain D = { ( x , x ) ∈ [0 , : ( x − . + ( x − . . } is shown as the interiorof the circle, the uniform samples in [0 , , that is the ﬁrst 50 points of the -dimensional haltonsequence, are visible as dots in blue, and those samples falling into D , denoted by χ , are shownas asterisks in black. Middle: the candidate set T , which is a union of sample sets from a normaldistribution centered on each χ i , is visible as circledots. Right: samples added recursively from T ∩ D is visible as circledots in blue, so successive points at any stage “know” how to ﬁll in the gaps in thepreviously generated distribution. Now we have the basic framework of contraction algorithms. As an illustrativeexample, Figure 2 visually shows how the median-type algorithm compresses theoriginal domain Ω to X ∗ step by step; and two important pieces of information thatcan be obtained from Figure 2 are: (i) the computational complexity of the model isgreatly reduced due to the domain contractions; and (ii) none of the minimizers will bemissed for the cases of multiple global minimums, which just veriﬁes the conclusionsestablished by Theorems 2.1 and 2.2. And Figure 3 shows a real world applicationabout Lennard-Jones (LJ) microclusters which perhaps is the most intensely studiedmolecular conformation problem.

3. Conditions for eﬃciency.

In order to categorise continuous optimizationproblems, we will introduce the following three classes of conditions. First, we saythat f is a critical regular function on Ω if the set of all critical points of f has a zero n -dimensional Lebesgue measure, where a critical point is a x ∈ Ω where the gradientis undeﬁned or is equal to zero. This critical regular condition guarantees that thecontraction factor is asymptotically equal to 1 / The eﬃciency of thecontractions can be guaranteed if one can quickly sketch out the overall landscape ofthe valley in each step and the relevant deviation will not be large enough to dig oneor more deep holes in the highlands. This requires that the low frequency componentsof f always play a dominant role on every subset D ( k ) . This meaningful observationprompted us to impose a certain restriction on the Fourier transform of f .Suppose ˆ f is the Fourier transform of f , then we say that f is a ( ρ, p )-typehierarchical low frequency dominant function (HLFDF) if there exist a p > ONTRACTION METHODS -5 0 5 10051015 -5 0 5 10051015 -5 0 5 10051015 -5 0 5 10051015 -5 0 5 10051015 -5 0 5 10051015 Function evaluations -12 -8 -4 A b s o l u t e e rr o r k-th model S a m p l e s i z e -5 0 5 10051015 Fig. 2 . Minimizing the function f ( x , x ) = ( x − . π x + π x − + 10(1 − π ) cos( x ) +10 , x ∈ [ − , , x ∈ [0 , . The multiple global minima are located at ( − π, . , ( π, . and (3 π, . . Upper and middle rows: the ﬁrst six contractions, the objective function is shown bycontour lines with some suitable level marks, the samples used to build each model are visible asdots in black, the random candidate points in each contracted subdomain, that is, T ( k +1) in (12) ,are visible as circledots in blue. Lower left: the convergence plot for the total seven contractions.Lower middle: the sample size used in each model. Lower right: x trace. appropriate ρ > j = 1 , , · · · , it holds that(13) Z k ω k ∞ > j − n ρ | ˆ f ( ω ) | d ω < (1 + p ) Z j − n ρ< k ω k ∞ ≤ jn ρ | ˆ f ( ω ) | d ω. It is worth noting that, the condition (13) does not require ˆ f to decay very quickly;as a univariate instance, for all δ > ω − − δ satisﬁes this condition with arbitrary ρ > p = 1 / (2 δ − p/ (1 + p ) every time the number of function evaluations doubles. Thisfeature actually implies contractility.For j ∈ N and ρ >

0, we ﬁrst deﬁne the ( j, ρ )-bandlimited component of f by(14) f ( j ) ρ ( x ) = Z k ω k ∞ jn ρ ˆ f ( ω ) e π i x · ω d ω, X. LUO AND X. XU function evaluations -5 -4 -3 -2 -1 A b s o l u t e e rr o r k-th model S a m p l e s i z e Function evaluations -5 -4 -3 -2 -1 A b s o l u t e e rr o r

15 30 45 k-th model S a m p l e s i z e Fig. 3 . Lennard-Jones (LJ) conformations of a cluster of n identical neutral atoms (interactingpairwise via the LJ potential). A conformation is a point in the n -dimensional Euclidean spaceof coordinates of atomic centers. For a single pair of atoms, the LJ potential in reduced units isgiven by u ( r ) = r − − r − ), where r is the Euclidean interatomic distance; then the total potentialenergy U n = P n − i =1 P nj = i +1 u ( r ij ) , where r ij is the distance between atoms i and j in reduced units.The putative global minima are U ∗ = − , U ∗ = − . [17]. Upper row: the convergence plot,the sample size used in each model, and the global minima conformation at n = 4 . Lower row: theconvergence plot, the sample size used in each model, and the global minima conformation at n = 5 . where x · ω = P ni =1 x i ω i is the inner product of x and ω ; then according to the Nyquist-Shannon sampling theorem, f ( j ) ρ can be reconstructed by its samples correspondingto a sampling density of 2 j ρ n /π n . Lemma If f ∈ L ( R n ) is a ( ρ, p ) -type HLFDF and f ( j ) ρ ( j ∈ N ) is deﬁned as (14) , then ˆ f ∈ L ( R n ) and k f − f ( j ) ρ k ∞ < (cid:18) p p (cid:19) j k ˆ f k . Proof.

First, let R ( j ) ρ = ∞ X i = j I ( i ) ρ , where I (0) ρ = Z k ω k ∞ ρ | ˆ f ( ω ) | d ω and I ( j ) ρ = Z j − n ρ< k ω k ∞ jn ρ | ˆ f ( ω ) | d ω for all j ∈ N , then we have k f − f ( j ) ρ k ∞ R ( j +1) ρ , and the condition (13) can be rewritten as R ( j ) ρ < (1 + p ) I ( j ) ρ , or equivalently, R ( j +1) ρ < pI ( j ) ρ . ONTRACTION METHODS f ∈ L ( R n ) and R ( j +1) ρ R ( j ) ρ = R ( j +1) ρ I ( j ) ρ + R ( j +1) ρ < R ( j +1) ρ R ( j +1) ρ /p + R ( j +1) ρ = p p ;furthermore, by noting that R ( j +1) ρ k ˆ f k = R ( j +1) ρ I (0) ρ + R (1) ρ R (2) ρ R (1) ρ R (3) ρ R (2) ρ · · · R ( j +1) ρ R ( j ) ρ < (cid:18) p p (cid:19) j , the error bound can also be rewritten as k f − f ( j ) ρ k ∞ R ( j +1) ρ < (cid:18) p p (cid:19) j k ˆ f k , as desired. In order to limit the distribution of function valuesto match the dominant condition (13), we need to introduce the tempered functions.Assume that f ∗ = 0 and the sequence { D ( k ) } are deﬁned by D (0) = Ω and(15) D ( k +1) = n x ∈ D ( k ) : f ( k + s ) ρ ( x ) Median f ( ξ ( k ) ) o , k ∈ N , where f ( j ) ρ is the ( j, ρ )-bandlimited component of f deﬁned as (14), ξ ( k ) is a uniformlydistributed random variable over D ( k ) and s > (cid:18) p p (cid:19) s k ˆ f k < δ Median f ( ξ (0) ) (cid:18) p p (cid:19) s − k ˆ f k , for a ﬁxed δ ∈ (0 , . We say that f is a ( ρ, p, q )-type tempered function on Ω if for all k >

0, it holds that(17) p p Median f ( ξ ( k ) ) < Median f ( ξ ( k +1) ) <

11 + q Median f ( ξ ( k ) ) , where q < /p . This condition requires Median f ( ξ ( k ) ) to decrease steadily, neithertoo fast nor too slow. In the next section, it will be used to control the bound of thenumber of function evaluations per contraction as a constant value. In the following,we will show that there always exists a weaker version of (17) that holds with a largeprobability. This weaker version will be used to control the growth of that bound notexceeding the exponential order. Lemma

Suppose { D ( k ) } k ∈ N is constructed by (15) with parameters ρ and p for the problem (1) . If Ω is a compact set and f is continuous and not a constant on Ω , then for any ǫ > , there must exists a q = q ( ǫ ) ∈ (0 , such that (18) q q Median f ( ξ ( k ) ) < Median f ( ξ ( k +1) ) <

11 + q Median f ( ξ ( k ) ) holds for every k > with probability at least − ǫ , where ξ ( k ) is a uniformly distributedrandom variable on D ( k ) .Proof. For convenience, let µ ( k ) = Mean f ( ξ ( k ) ) , m ( k ) = Median f ( ξ ( k ) ) , σ ( k ) = q Var f ( ξ ( k ) ) . X. LUO AND X. XU

Proving the ﬁrst inequality of Lemma 3.2 is equivalent to showing that on each D ( k +1) ,the upper bound of f , m ( k ) , is less than qq of the median of f , m ( k +1) , with proba-bility at least 1 − ǫ .Under the assumption of f ∗ = 0, since f is continuous and not a constant on D ( k +1) ⊆ Ω, we have 0 < m ( k +1) < ∞ and 0 < σ ( k +1) < ∞ and there exists a C > σ ( k +1) = Cm ( k +1) . First, the distance between the median and the meanis bounded by standard deviation [19], i.e., m ( k +1) − σ ( k +1) µ ( k +1) m ( k +1) + σ ( k +1) ;and it follows from the Chebyshev-Cantelli inequality that P (cid:18) f ( ξ ( k +1) ) > µ ( k +1) + √ − ǫ √ ǫ σ ( k +1) (cid:19) ǫ. So it holds thatprctile (cid:16) f ( ξ ( k +1) ) , (1 − ǫ ) · (cid:17) <µ ( k +1) + √ − ǫ √ ǫ σ ( k +1) m ( k +1) + (cid:18) √ − ǫ √ ǫ (cid:19) σ ( k +1) = √ ǫ + C + C √ − ǫ √ ǫ m ( k +1) , or equivalently, q q m ( k ) < m ( k +1) holds for every k > − ǫ , where q = √ ǫ/ ( C + C √ − ǫ ).Similarly, by considering a translation f ( x ) − m ( k ) , the second inequality m ( k +1) <

11 + q m ( k ) also holds for every k > − ǫ . And it is clear that thesetwo inequalities only make sense simultaneous when q <

4. Categories of continuous problems.

According to the contractility, allthe possible continuous problems can be divided into the following three categories:logarithmic time contractile, polynomial time contractile, or noncontractile.

Definition

The problem (1) is said to be logarithmic time contractile if thereexist ρ, p, q > such that f is not only a ( ρ, p ) -type HLFDF but also a ( ρ, p, q ) -typetempered and critical regular function. For such a problem, the key to gaining logarithmic time eﬃciency comes fromthree reasons: (i) the median of f on D ( k ) is reduced by a factor of 1 / (1 + q ) as k increases; (ii) there is an upper bound on the number of function evaluations on every D ( k ) such that a certain approximation A ( k ) f can be constructed by these samples;and (iii) the error bound max x ∈ D ( k ) |A ( k ) f ( x ) − f ( x ) | is strictly less than the medianof f on D ( k ) . According to these three points, we can prove the following theorem. ONTRACTION METHODS Theorem

If the problem (1) is logarithmic time contractile with parameters ( ρ, p, q ) , then for any ǫ > , there exist N Ω ,f > and K > such that after K median-type contractions deﬁned by (10) , it holds that the upper bound max x ∈ D ( K ) | f ( x ) − f ∗ | < ǫ, and the linear convergence rate max x ∈ D ( k +1) | f ( x ) − f ∗ | max x ∈ D ( k ) | f ( x ) − f ∗ | <

11 + q , meanwhile, the total number of function evaluations and the time complexity are O (cid:16) N Ω ,f · log q ǫ (cid:17) and O (cid:16) N a Ω ,f · log q ǫ (cid:17) , respectively; where the time cost of model computations is assumed to be O (( N ( k ) ) a ) . f Function evaluations -8 -6 -4 -2 A b s o l u t e e rr o r Convergence x trace

Fig. 4 . A logarithmic time contractile example f ( x ) = ( x − π ) + sin (50( x − π )) , x ∈ [0 , π ] . Remark

The contraction algorithm is very suitable for this type of problemsbecause there exists a ﬁxed upper bound CN Ω ,f for every N ( k ) , where the constant C is slightly larger than one. This is important because the computational complexity of A ( k ) f is closely related to N ( k ) . Remark

Here, a typical logarithmic time contractile example is illustratedin Figure . There are many local minima and the unique global minima is locatedat π . In the usual sense, it is not a very “good” function because a solution is easilytrapped in any one of the local minima; but in the sense of contraction, it is still a“good” function because contractions could be performed very successfully. Therefore,the detection of space can be quickly concentrated so that the corresponding eﬃciencyis controlled within the logarithmic time. This shows that the concept of contractilityprovides a new explanation for what kind of continuous optimization problems can beeﬀectively solved. Remark

The example of a quadratic function illustrated in Figure , showsa fast convergence of the proposed algorithm. In this case, the contraction algorithmconverges even faster than the gradient descent with the optimal step size. The reasonwhy the proposed contraction algorithm converges a little bit slower than the Bayesianoptimization algorithm in the initial stage, is that the former does more space detectionin the current domain to cover the minimizer with the next subdomain, while the latterfocuses more on what the current model can predict; but as we mentioned before,every exploitation on the basis of inadequate information may reduce eﬃciency or X. LUO AND X. XU even cause trouble, see Figure for a typical example of convergence failure in theBayesian optimization. Function evaluations -12 -8 -4 A b s o l u t e e rr o r Contraction Optimization

Function evaluations -12 -8 -4 A b s o l u t e e rr o r Bayesian Optimization

LCBEIPI

Function evaluations -12 -8 -4 A b s o l u t e e rr o r Gradient Descent =0.047=0.045=0.0475 -5 0 5-505 x trace of Contraction Optimization -5 0 5-505 x trace of BO-LCB -5 0 5-505 x trace of Gradient Descent

Fig. 5 . Minimizing the function f ( x , x ) = 20 x + x , x ∈ [ − , , x ∈ [ − , . The minimais located at (0 , . Upper left: the convergence plots of the median-type contraction algorithm withsix contractions. Upper middle: the convergence plots of Bayesian optimization with diﬀerent typesof acquisition function: lower conﬁdence bound (LCB), expected improvement (EI) and probability ofimprovement (PI). Upper right: the convergence plots of gradient descent with an initial point (5 , and diﬀerent stepsizes η = 0 . , . and . , respectively. Lower row: x traces of diﬀerentalgorithms, respectively. The contraction algorithm converges a little bit slower than the Bayesianoptimization (BO) algorithm in the initial stage because BO focuses more on what the current modelcan predict, which may cause convergence failure as shown in Figure . Function evaluations -10 -8 -6 -4 -2 A b s o l u t e e rr o r Comparison

COBO-LCBBO-EIBO-PI -5 0 5-505 x trace of CO -5 0 5-505 x trace of BO-LCB

Fig. 6 . Minimizing the function f ( x , x ) = 20 x + x −

10 exp[ − x − . − x +2) ] , x ∈ [ − , , x ∈ [ − , . There are two local minima, one of them is . × − located at (4 . × − , − . × − ) and the other is − . located at (0 . , . , which is theglobal minima. Obviously, Bayesian optimization converges to the former local minima and causesa convergence failure in the global sense; however, contraction optimization automatically increasesthe strength of space detection to ensure a full coverage. Left: comparison of convergence betweencontraction optimization (CO) and diﬀerent kinds of Bayesian optimization (BO). Middle: x traceof CO. Right: x trace of BO. Proof.

First, let us consider the set sequence { D ( k ) } k ∈ N deﬁned as (15). since f ONTRACTION METHODS ρ, p, q )-type tempered functionMedian f ( ξ ( k +1) ) <

11 + q Median f ( ξ ( k ) ) , where ξ ( k ) is a uniformly distributed random variable over D ( k ) ; and then, there existscertain δ ∈ (0 ,

1) such that(19) Median f ( ξ ( k +1) ) <

11 + δ

11 + q Median f ( ξ ( k ) );and for this ﬁxed δ , there is the unique integer s > (cid:18) p p (cid:19) s k ˆ f k < δ Median f ( ξ (0) ) (cid:18) p p (cid:19) s − k ˆ f k . Moreover, since f is a ( ρ, p, q )-type tempered function on Ω,(21) (cid:18) p p (cid:19) k Median f ( ξ (0) ) < Median f ( ξ ( k ) );meanwhile, note that f is a ( ρ, p )-type HLFDF, it follows from Lemma 3.1 that forevery k > k f − f ( k + s ) ρ k ∞ < (cid:18) p p (cid:19) k + s k ˆ f k . Clearly, it follows from (20)–(22) that(23) k f − f ( k + s ) ρ k ∞ < δ Median f ( ξ ( k ) ) , and the ( k + s, ρ )-bandlimited component f ( k + s ) ρ can be reconstructed by its samplescorresponding to a sampling density of 2 k + s ρ n /π n .Now we consider the approximation of f ( k + s ) ρ on the compact set D ( k ) . Since f is a critical regular function on Ω, µ ( D ( k ) ) = 2 − k µ (Ω) , the restriction of f ( k + s ) ρ to D ( k ) , denoted by f ( k + s ) ρ | D ( k ) , is closely related to thethreshold value N Ω ,f = 2 s µ (Ω) ρ n /π n and the prolate spheroidal functions ψ j ( x ) = ψ j ( x ; 2 k + sn ρ, D ( k ) ) which are the eigen-functions of the time and frequency limiting operator Q = Q (2 k + sn ρ, D ( k ) ) [14, 15, 32,33, 34]. More speciﬁcally, if we denote a prolate series up to and including the N ( k ) thterm by S N ( k ) f ( x ) = N ( k ) X j =1 ψ j ( x ) Z D ( k ) f ( x ) ψ j ( x )d x, X. LUO AND X. XU then S N ( k ) f ( x ) = S N ( k ) f ( k + s ) ρ ( x ) follows by noting that ψ j has a Fourier transformsupported in { ω ∈ R n : k ω k ∞ jn ρ } . Moreover, it is important to mention that asuper-exponential decay rate of the error boundmax x ∈ D ( k ) (cid:12)(cid:12)(cid:12) f ( k + s ) ρ ( x ) − S N ( k ) f ( k + s ) ρ ( x ) (cid:12)(cid:12)(cid:12) as soon as N ( k ) reaches or goes beyond the plunge region around the threshold value N Ω ,f [5, 4]. (And it is worth pointing out that, strictly speaking, we should ﬁrst ﬁnda series of non-crossing hypercubes covering D ( k ) , and then build a piecewise prolateseries approximation on these hypercubes.)Thus, for all k there exists a C > A ( k ) f = S N ( k ) f ( k + s ) ρ where N ( k ) = CN Ω ,f satisﬁes max x ∈ D ( k ) (cid:12)(cid:12)(cid:12) f ( k + s ) ρ ( x ) − A ( k ) f ( x ) (cid:12)(cid:12)(cid:12) δ Median f ( ξ ( k ) ) − k f − f ( k + s ) ρ k ∞ , and further, together with (23), we havemax x ∈ D ( k ) (cid:12)(cid:12)(cid:12) f ( x ) − A ( k ) f ( x ) (cid:12)(cid:12)(cid:12) = max x ∈ D ( k ) (cid:12)(cid:12)(cid:12) f ( x ) − f ( k + s ) ρ + f ( k + s ) ρ − A ( k ) f ( x ) (cid:12)(cid:12)(cid:12) k f − f ( k + s ) ρ k ∞ + max x ∈ D ( k ) (cid:12)(cid:12)(cid:12) f ( k + s ) ρ ( x ) − A ( k ) f ( x ) (cid:12)(cid:12)(cid:12) δ Median f ( ξ ( k ) ) . (25)Now, under the assumption of f ∗ = 0, according to (10), the contracting sequence { D ( k ) } are deﬁned recursively by D (0) = Ω and D ( k +1) = n x ∈ D ( k ) : A ( k ) f ( x ) Median f ( ξ ( k ) ) o , k ∈ N , where A ( k ) f is the prolate series approximation (24). So it follows from the strongconvergence Theorem 2.2, (19) and (25) thatmax x ∈ D ( k ) | f ( x ) | (cid:18)

11 + q (cid:19) k max x ∈ Ω | f ( x ) | . Further, for a ﬁxed accuracy ǫ >

0, there exists a

K > (cid:18)

11 + q (cid:19) K < ǫ (cid:18)

11 + q (cid:19) K +1 , or , K < log q ǫ K + 1 , hence, after K median-type contractions, one gets the approximate solution set D ( K ) with an error bound max x ∈ D ( K ) | f ( x ) | ǫ, and the total number of function evaluations is less than K − X k =0 ⌈ CN Ω ,f / ⌉ + CN Ω ,f = O (cid:16) N Ω ,f · log q ǫ (cid:17) , ONTRACTION METHODS ⌈ t ⌉ is the smallest integer greater than or equal to t ; further note that thecomputational complexity of A ( k ) f is O (( N ( k ) ) a ), then the time complexity of median-type algorithms is less than K − X k =0 ⌈ C a N a Ω ,f / ⌉ + C a N a Ω ,f = O (cid:16) N a Ω ,f · log q ǫ (cid:17) , taking a logarithmic time for any desired accuracy ǫ . Definition

The problem (1) is said to be polynomial time contractile if thereexist ρ, p > and q ∈ (0 , such that (i) f is a ( ρ, p ) -type HLFDF; (ii) f satisﬁes theconclusion of Lemma with parameters ρ, p and q ; and (iii) p > q . When p q , a problem that satisﬁes the above (i) and (ii) is logarithmic timecontractile by adding the critical regular condition, so it is necessary to impose p > q . Theorem

If the problem (1) is polynomial time contractile with parameters ( ρ, p ) and q ∈ (0 , , then for any ǫ > , there exist N Ω ,f > and K > such thatafter K contractions deﬁned by (10) but not limited to, it holds that the upper bound max x ∈ D ( K ) | f ( x ) − f ∗ | < ǫ, and the linear convergence rate max x ∈ D ( k +1) | f ( x ) − f ∗ | max x ∈ D ( k ) | f ( x ) − f ∗ | <

11 + q , meanwhile, the total number of function evaluations and the time complexity are O (cid:16) N Ω ,f · log q ǫ (cid:17) and O (cid:16) N a Ω ,f · a log q ǫ (cid:17) , respectively; where the time cost of model computations is assumed to be O (( N ( k ) ) a ) . Function evaluations -9 -6 -3 A b s o l u t e e rr o r Comparison

COBO-LCBBO-EIBO-PI k-th model S a m p l e s i z e -5 0 5-505 x trace of CO Fig. 7 . Minimizing the function f ( x , x ) = 100( x − x ) +( x − , x ∈ [ − , , x ∈ [ − , .The global minima is located at (1 , , which lies in a long, narrow and parabolic shaped ﬂat valley.This is the well-known Rosenbrock function [28], especially corresponding to a typical polynomialtime contraction problem. Notice that the number of samples used in the model is almost alwaysrising; but the contraction can still be eﬀectively performed. In this case, the contraction optimizationconverges faster than three common types of the Bayesian optimization. This seems to indicate thatcontraction is not only related to eﬃciency, but also help to improve the local accuracy of modelprediction. Left: comparison of convergence between contraction optimization (CO) and diﬀerentkinds of Bayesian optimization (BO). Middle: the sample size used in each model. Right: x traceof CO. X. LUO AND X. XU

Remark

Unlike the logarithmic time contractile problems, there is no ﬁxedupper bound for N ( k ) ; but it can often be controlled by C kl N Ω ,f , where k is given in (26) . So the contraction algorithm is still eﬀective for this type of problems becausethe contraction of D ( k ) can greatly reduce the computational cost of the model A ( k ) f .Moreover, the algorithm discussed here is not limited to the median-type. A typicalpolynomial time contractile example is illustrated in Figure .Proof. Let us now see what happens after replacing (17) with (18). Since p > q ,there is a unique l such that(26) (cid:18) p p (cid:19) l q q < (cid:18) p p (cid:19) l − . Similarly, we ﬁrst consider the set sequence { D ( k ) } k ∈ N deﬁned as (15). Since theconclusion of Lemma 3.2 holds with a parameter q for f , we have(27) (cid:18) q q (cid:19) k Median f ( ξ (0) ) < Median f ( ξ ( k ) ) , and there exists certain δ ∈ (0 ,

1) such that(28) Median f ( ξ ( k +1) ) <

11 + δ

11 + q Median f ( ξ ( k ) ) , moreover, there is the unique integer s > (cid:18) p p (cid:19) s k ˆ f k < δ Median f ( ξ (0) ) (cid:18) p p (cid:19) s − k ˆ f k , where ξ ( k ) is a uniformly distributed random variable over D ( k ) .Note that f is a ( ρ, p )-type HLFDF, it follows from Lemma 3.1, (26), (27), and(29) that for every k > k f − f ( kl + s ) ρ k ∞ (cid:18) p p (cid:19) kl + s k ˆ f k < δ Median f ( ξ ( k ) ) , and the ( kl + s, ρ )-bandlimited component f ( kl + s ) ρ can be reconstructed by its samplescorresponding to a sampling density of 2 kl + s ρ n /π n .In addition, note that µ ( D ( k ) ) < µ ( D ( k − ) , according to the discussion in the previous subsection, for every k there exist C > A ( k ) f such that A ( k ) f could be constructed by less than N ( k ) = C kl N Ω ,f samplesof f over D ( k ) and(30) max x ∈ D ( k ) | f ( x ) − A ( k ) f ( x ) | δ Median f ( ξ ( k ) ) , where A ( k ) f = S N ( k ) f ( kl + s ) ρ is the prolate series approximation given as (24) and N Ω ,f = 2 s µ (Ω) ρ n /π n .Now, under the assumption of f ∗ = 0, according to (10), the contracting sequence { D ( k ) } are deﬁned recursively by D (0) = Ω and D ( k +1) = n x ∈ D ( k ) : A ( k ) f ( x ) Median f ( ξ ( k ) ) o , k ∈ N , ONTRACTION METHODS x ∈ D ( k ) | f ( x ) | (cid:18)

11 + q (cid:19) k max x ∈ Ω | f ( x ) | . Further, for a ﬁxed accuracy ǫ >

0, there exists a

K > (cid:18)

11 + q (cid:19) K < ǫ (cid:18)

11 + q (cid:19) K +1 , or , K < log q ǫ K + 1 , hence, after K median-type contractions, one can obtain the approximate solution set D ( K ) with an error bound max x ∈ D ( K ) | f ( x ) | ǫ, and clearly, the total number of function evaluations is less than K X k =0 C kl N Ω ,f = O (cid:16) N Ω ,f · log q ǫ (cid:17) ;similarly, note that the computational complexity of A ( k ) f is O (( N ( k ) ) a ), then thetime complexity is much less than K X k =0 C a akl N a Ω ,f = O (cid:16) N a Ω ,f · a log q ǫ (cid:17) , taking a polynomial time for any desired accuracy ǫ . And it is worth noting that,the mean-type contraction does not actually have any special eﬀect in the abovediscussion, so the conclusions of the theorem holds for any type of contraction. For some functions the contracting condition (11) cannotbe met until the global minima is reached, at this point the contraction algorithm isdegraded into a regular model-based approach. Although eﬃciency gains cannot beachieved with domain contractions, the corresponding optimization problem may stillbe eﬀectively solved if the function is suﬃcient smooth.The term “noncontractile” is used to describe those functions whose contractioncannot be consistently guaranteed. This does not always mean that, although it doesexist, the algorithms does not perform any contraction throughout the process. SeeFigure 8 for diﬀerent noncontractile examples.

Definition

The problem (1) is said to be noncontractile if there are nosuitable ρ, p, q > such that f is a ρ, p -type HLFDF or satisﬁes the conclusion ofLemma with parameters ρ, p and q . Here we will summarize some conclusions based on smoothness for noncontractileproblems. Suppose that χ are uniformly distributed over Ω with the sample size N ,the relevant data values f χ = { f ( χ i ) } and A χ f interpolates f on χ . For a sample set χ over Ω, we denote the associated ﬁll distance with(31) h N := sup x ∈ Ω min i N k x − χ i k , then for ǫ >

0, there exists C ǫ > P (cid:16) h N > C ǫ (log N/N ) /n (cid:17) = O ( N − ǫ ) , X. LUO AND X. XU -1 -0.5 0 0.5 1-1.5-1-0.500.511.5 -1 -0.5 0 0.5 1-1.5-1-0.500.511.5 -1 -0.5 0 0.5 1-0.500.51

Fig. 8 . Three examples for noncontractile functions. Left column: a Gaussian noise function f that is obviously not a HLFDF and the contracting condition will never be met; a bandlimitedfunction f = sin(50 x ) that does not have the essential multilevel characteristic of HLFDFs and thecontracting condition (11) cannot be met until the global minima is reached; and a deep trap function f ( x ) = x + e − ∗ ( x +0 . that is a HLFDF but does not satisfy the conclusion of Lemma ,at this time, the contractions will still be carried out with a large probability and eventually lead toconvergence failure, due to the failure of model prediction error bound estimations. Middle column:histograms of function value distribution for f , f and f , respectively. Right column: discretecosine transforms of f , f and f , respectively. or, h N = O ( N ( γ − /n ), where 0 < γ ≪

1; see Lemma 12 of [8].

Theorem If f ∈ C s (Ω) with s > n/ on a bounded domain Ω ⊂ R n , thenthere exists a band-limited interpolant A χ f such that for any ǫ > , it holds that (cid:12)(cid:12)(cid:12)(cid:12) min x ∈ Ω A χ f ( x ) − f ∗ (cid:12)(cid:12)(cid:12)(cid:12) < ǫ and the time complexity of A χ f is O ( N a ) = O (cid:16) na ( s − n/ − γ ) log ǫ (cid:17) , where χ and γ are consistent with (31) and (32) , the time cost of A χ f is assumed tobe O ( N a ) , N is the sample size of χ , and k f k C s (Ω) = P | α | s sup x ∈ Ω | D α f | . According to Lemma 3.9 of [23] and (32), for all x ∈ Ω, it holds that | f ( x ) − A f ( x ) | Ch s − n/ N k f k C s (Ω) = O (cid:16) N − ( s − n/ − γ ) /n (cid:17) , which is actually equivalent to Theorem 4.6. And this result is similar to but slightly ONTRACTION METHODS f is a H¨oldercontinuous function, the corresponding (1) can also be completed in polynomial time.It is called that f satisﬁes a α -H¨older condition if there exist C > < α ≤ | f ( x ′ ) − f ( x ′′ ) | < C k x ′ − x ′′ k α . Then we have

Theorem If f satisﬁes a α -H¨older condition on a bounded domain Ω ⊂ R n ,then there exists a nearest-neighbor interpolant A χ f such that for any ǫ > , it holdsthat (cid:12)(cid:12)(cid:12)(cid:12) min x ∈ Ω A χ f ( x ) − f ∗ (cid:12)(cid:12)(cid:12)(cid:12) < ǫ and the time complexity of A χ f is O ( N a ) = O (cid:16) naα (1 − γ ) log ǫ (cid:17) , where χ, γ and α are consistent with (31)–(33) , the time cost of A χ f is assumed tobe O (( N ( k ) ) a ) . For a H¨older continuous function, then there exists a nearest-neighbor interpolant A χ f , which is closely related to the Voronoi diagram of χ [3], such that for any x ∈ Ω,it holds that | f ( x ) − A χ f ( x ) | Ch αN = O (cid:16) N − α (1 − γ ) /n (cid:17) , which is actually equivalent to Theorem 4.7. Similarly, if f does not satisfy any H¨oldercondition, then the algorithm may not be done in polynomial time, for example, if | f ( x ′ ) − f ( x ′′ ) | < C log(1 + k x ′ − x ′′ k ), then the time complexity of a nearest-neighborinterpolant A χ f will be O (cid:16) ( e ǫ − − na − γ (cid:17) , if the absolute error bound of A χ f on Ω is less than any given ǫ >

5. Conclusions.

For a fairly general class of problems, it is often impossible toﬁnd a universal method that performs well on all possible situations, which is theimportant connotation of the no free lunch theorems [38, 18, 37]. This is not due tosome kind of curse, but a lack of common features. Actually, a certain commonalityis the premise of eﬃciency. And when there is no such a premise, it is always wise toﬁnd a subclass that has enough in common and maintains a proper level of generality.Although the consequent loss of generality is what we have to pay, the correspondingdiscriminant conditions may help us understand the problem better.In this work, we proposed a class of contraction algorithms based on the conceptof contractility. From the eﬃciency of algorithms, especially the mean-type, we haveclassiﬁed all possible continuous problems. Experiments with various categories ofexamples show that these categories seem to be reasonable. To ensure the existenceof eﬃcient optimization algorithms [38], we mainly impose a class of hierarchical lowfrequency dominant condition on the problems. And now, we knew that a suﬃciently0

X. LUO AND X. XU smooth or contractile problem can be eﬀectively predicted using a priori information.Hence, the contractility might be viewed as a complement to smoothness.The algorithm is implemented in Matlab. The source code of the implementationis available at https://github.com/xiaopengluo/contropt.Future research is currently being conducted in several areas. One of the attemptsis to create possible time complexities that is not exponentially related to dimensionsfor some certain function classes. We hope that the expected results would providesome valuable suggestions for eﬃciency of high-dimensional continuous optimization.Second, we are also considering how to establish an adaptive contracting condition toachieve the optimal eﬃciency for various diﬀerent problems. A successful achievementwill be very helpful in practice. Third, we hope that some certain diﬃcult problemscan be translated into relevant easy problems by applying some preconditioning andpostconditioning steps before and after each contraction. Moreover, this requires usto further distinguish which problems are inherently diﬃcult to solve, and which areonly seemingly intractable.

REFERENCES[1]

A. Y. Aravkin, J. V. Burke, D. Drusvyatskiy, M. P. Friedlander, and S. Roy , Level-setmethods for convex optimization , Mathematical Programming, 174 (2019), pp. 359–390.[2]

S. Arlot and A. Celisse , A survey of cross-validation procedures for model selection , Statist.Surv., 4 (2010), pp. 40–79.[3]

F. Aurenhammer , Voronoi diagrams - a survey of a fundamental geometric data structure ,ACM Comput. Surv., 23 (1991), pp. 345–405.[4]

A. Bonamia and A. Karoui , Spectral decay of time and frequency limiting operator , Appl.Comput. Harmon. Anal., 42 (2017), pp. 1–20.[5]

J. P. Boyd , Approximation of an analytic function on a ﬁnite real interval by a bandlimitedfunction and conjectures on properties of prolate spheroidal functions , Appl. Comput.Harmon. Anal., 25 (2003), pp. 168–176.[6]

L. Breiman , Random forests , Machine Learning, 45 (2001), pp. 5–32.[7]

L. Breiman, J. Friedman, R. Olshen, and C. Stone , Classiﬁcation and Regression Trees ,CRC Press, Boca Raton, FL, 1984.[8]

A. D. Bull , Convergence rates of eﬃcient global optimization algorithms , Journal of MachineLearning Research, 12 (2011), pp. 2879–2904.[9]

S. Geisser , The predictive sample reuse method with applications , J. Amer. Statist. Assoc., 70(1975), pp. 320–328.[10]

D. R. Jones, M. Schonlau, and W. J. Welch , Eﬃcient global optimization of expensiveblack-box functions , Journal of Global Optimization, 13 (1998), pp. 455–492.[11]

D. R. Jones, M. Schonlau, and W. J. Welch , Eﬃcient global optimization of expensiveblack-box functions , Journal of Global Optimization, 13 (1998), pp. 455–492.[12]

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi , Optimization by simulated annealing ,Science, 220 (1983), pp. 671–680.[13]

J. P. C. Kleijnen, W. van Beers, and I. van Nieuwenhuyse , Expected improvement ineﬃcient global optimization through bootstrapped kriging , Journal of Global Optimization,54 (2012), pp. 59–73.[14]

H. J. Landau and H. O. Pollak , Prolate spheroidal wave functions, fourier analysis anduncertainty, ii , Bell Systems Tech. J., 40 (1961), pp. 65–84.[15]

H. J. Landau and H. O. Pollak , Prolate spheroidal wave functions, fourier analysis anduncertainty, iii , Bell Systems Tech. J., 41 (1962), pp. 1295–1336.[16]

E. L. Lawler and D. E. Wood , Branch-and-bound methods: A survey , Operations Research,14 (1966), p. 699C719.[17]

R. H. Leary , Global optima of lennard-jones clusters , J. Global Optim., 11 (1997), pp. 35–53.[18]

W. G. Macready and D. H. Wolpert , What makes an optimization problem hard? , Com-plexity, 1 (1996), pp. 40–46.[19]

C. Mallows , Another comment on ocinneide , The American Statistician, 45 (1991), p. 257.[20]

M. Mitchell , An Introduction to Genetic Algorithms , MIT Press, 1996, Cambridge, MA.[21]

J. Moˇckus , On Bayesian methods for seeking the extremum , Optimization Techniques, (1974),pp. 400–404.ONTRACTION METHODS [22] J. Moˇckus, V. Tiesis, and A. Zilinskas , The application of bayesian methods for seeking theextremum , Toward Global Optimization, 2 (1978), pp. 117–129.[23]

F. J. Narcowich, J. D. Ward, and H. Wendland , Sobolev bounds on functions with scatteredzeros, with applications to radial basis function surface ﬁtting , Math. Comp., 74 (2005),pp. 743–763.[24]

S. N. Qihang Lin and N. Soheili , A level-set method for convex optimization with a feasiblesolution path , SIAM J. Optim., 28(4) (2018), pp. 3290–3311.[25]

C. E. Rasmussen and C. K. I. Williams , Gaussian Processes for Machine Learning , MITPress, Cambridge, MA, 2006.[26]

I. Rechenberg , Evolutions strategie: Optimierung technischer Systeme nach Prinzipien derbiologischen Evolution , Frommann-Holzboog, Stuttgart, 1973.[27]

R. T. Rockafellar , Convex Analysis , Priceton Landmarks in Mathematics, Princeton Uni-versity Press, Princeton, NJ, 1970.[28]

H. H. Rosenbrock , An automatic method for ﬁnding the greatest or least value of a function ,The Computer Journal, 3 (1960), pp. 175–184.[29]

H. P. Schwefel , Evolution and Optimum Seeking , Wiley-Interscience, New York, 1995.[30]

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas , Taking the humanout of the loop: A review of Bayesian optimization , Proceedings of the IEEE, 104 (2016),pp. 148–175.[31]

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas , Taking the humanout of the loop: A review of bayesian optimization , Proceedings of the IEEE, 104 (2016),pp. 148–175.[32]

D. Slepian , Prolate spheroidal wave functions, fourier analysis and uncertainty, iv , Bell Sys-tems Tech. J., 43 (1964), pp. 3009–3057.[33]

D. Slepian , Onbandwidth , Proc. IEEE, 64 (1976), pp. 292–300.[34]

D. Slepian and H. O. Pollak , Prolate spheroidal wave functions, fourier analysis and un-certainty, i , Bell Systems Tech. J., 40 (1961), pp. 43–64.[35]

R. Storn and K. Price , Diﬀerential evolution - a simple and eﬃcient heuristic for globaloptimization over continuous spaces , Journal of Global Optimization, 11 (1997), pp. 341–359.[36]

A. T¨orn and A. ˇZilinskas , Global Optimization , Springer-Verlag, Berlin Heidelberg, 1989.[37]

D. H. Wolpert , The lack of a prior distinctions between learning algorithms , Neural Compu-tation, 8 (1996), pp. 1341–1390.[38]