Contraction methods for continuous optimization
aa r X i v : . [ m a t h . O C ] O c t CONTRACTION METHODS FOR CONTINUOUS OPTIMIZATION
XIAOPENG LUO ∗ AND
XIN XU † Abstract.
We describe a class of algorithms that establishes a contracting sequence of closedsets for solving continuous optimization. From the perspective of ensuring effective contractions toachieve a certain accuracy, all the possible continuous optimization problems could be divided intothree categories: logarithmic time contractile, polynomial time contractile, or noncontractile. Forany problem from the first two categories, the constructed contracting sequence converges to the setof all global minimizers with a theoretical guarantee of linear convergence; for any problem from thelast category, we also discuss possible troubles caused by using the proposed algorithms.
Key words. continuous optimization, contraction methods, categories, complexity, convergence
AMS subject classifications.
1. Introduction.
For a possible nonlinear and non-convex continuous function f : Ω ⊂ R n → R with the global minima f ∗ and the set of all global minimizers X ∗ in Ω, we consider the constrained optimization problem(1) min x ∈ Ω f ( x ) , where Ω is a (not necessarily convex) closed domain; and assume that observing f iscostly, so we wish to use as few observations as possible.When f is cheap to evaluate, there are many standard optimization algorithmsexist, such as genetic algorithms [20], evolution strategies [29], differential evolution[35] and simulated annealing [12]. However, when f is expensive, we have to pay moreattention to how to maximize the use of information obtained; and perhaps the beststrategy is Bayesian optimization (BO) which is a sequential model-based approachand the model is often obtained using a Gaussian process. It was first introduced byMoˇckus [21] and later popularized by Jones et al [10]. In particular, theoretical resultson the convergence behaviour of BO is provided in [8]. Various extensions have beensuggested by further authors, and a recent review can be found in [30].The BO method applies the Gaussian process to construct a model for generatingthe acquisition functions. The acquisition function, which trade-offs exploration andexploitation, is used to determine the next candidate point. This is exactly how BOuses the information obtained; but this study presents a different approach from theperspective of contractility.The concept of contractility comes from the level set. Actually, the problem (1)is closely related to the u -sublevel set [27, 24, 1], i.e.,(2) L ( u ) = { x ∈ Ω : f ( x ) u } , u ∈ (cid:20) f ∗ , max x ∈ Ω f ( x ) (cid:21) . Its boundary set ∂L ( u ) = { x ∈ Ω : f ( x ) = u } is a contour surface and L ( u ) contractsmonotonically from Ω to X ∗ when u continuously decreases from max x ∈ Ω f ( x ) to f ∗ .We introduce the contracting sequence as a direct extension of the sublevel set.A contracting sequence { D ( k ) } k ∈ N w.r.t. the optimization problem (1) is a sequence ∗ Department of Chemistry, Princeton University, Princeton, NJ 08544, and Department of Controland Systems Engineering, School of Management and Engineering, Nanjing University, Nanjing, JS210093, China ([email protected], [email protected].)1
X. LUO AND X. XU of decreasing nonempty closed sets satisfying(3) Ω ⊇ D ( k ) ⊃ D ( k +1) ⊃ X ∗ ;further, we call { D ( k ) } a strictly contracting sequence if they also satisfy(4) max x ∈ D ( k +1) | f ( x ) | < max x ∈ D ( k ) | f ( x ) | . In this work, we translate the problem (1) into a construction problem of a strictlycontracting sequence { D ( k ) } with D (0) = Ω. Each D ( k +1) is sequentially determinedby an approximation model constructed from uniformly distributed samples on D ( k ) .Both the proposed contraction method (CO) and BO depend on the approximationmodel, but they actually have very significant differences.The first essential difference is that the BO is designed to find only one of theminimizers while the CO is trying to find out all of them. The most interesting factis that if we do not find all the global minimizers, we could not be sure that anyone of them is a global minimizer. So, although BO can ensure the convergence [8],at a certain moment we do not know if the current best solution is really a globalminimizer, even in the sense of probability. In contrast, this is deterministic for CObecause for most of hierarchical low frequency dominant functions (see subsection 3.1),the premise of each contraction is that all global minimizers are covered by subsequentclosed subsets in the sense of probability.The second major difference is that BO always updates the model on the originaldomain and uses all the historical samples [10, 30], while the CO updates the modelon the gradually decreasing contracted sets and uses only those samples that arelocated in the contracted sets. This helps control the computational cost of modelling,especially for most of hierarchical low frequency dominant functions mentioned above.Due to the characteristic, a contraction method might be viewed as a special case ofclassical branch and bound techniques [16, 36] which is aimed to sequentially guarantee X ∗ ⊂ D ( k +1) and exclude n D ( k ) − D ( k +1) o k . The last significant disparity is the different conditions for achieving satisfactoryefficiency. Specifically, the efficiency of BO depends mainly on the smoothness ofthe objective function [8]; however, the CO may achieve logarithmic time efficiencyfor non-differentiable functions, see Theorem 4.2 for details. The hierarchical lowfrequency dominant conditions once again played an important role here. On theone hand, the contraction strategy provides us a different perspective to reconsiderthe problem (1); and on the other hand, the contractility can also be viewed as anecessary complement to smoothness. In Figure 4 we will see an intuitive example toillustrate this clearly.The remainder of the paper is organized as follows. The next section introducesthe contraction algorithms and mainly focuses on its convergence. Three conditionsfor guaranteeing algorithm efficiency are discussed in detail in section 3, especiallywith regard to the hierarchical low frequency dominant functions. These conditionsallow us to divide all possible continuous problems into three categories in section 4.And we draw some conclusions in section 5.
2. Contraction algorithms.
For any nonnegative integer k , the contractingsequence { D ( k ) } are defined recursively by D (0) = Ω and(5) D ( k +1) = n x ∈ D ( k ) : A ( k ) f ( x ) f ∗ χ ( k ) + u ( k ) o , ONTRACTION METHODS χ ( k ) = { χ ( k ) i } are uniformly distributed over D ( k ) with the sample size N ( k ) ,the relevant data values f χ ( k ) = { f ( χ ( k ) i ) } with the current f ∗ χ ( k ) = min( f χ ( k ) ) and f ∗∗ χ ( k ) = max( f χ ( k ) ), and A ( k ) f is an approximate model of f w.r.t. the data pairs( χ ( k ) , f χ ( k ) ) such that the upper bound of model prediction errormax x ∈ D ( k ) (cid:12)(cid:12)(cid:12) A ( k ) f ( x ) − f ( x ) (cid:12)(cid:12)(cid:12) δu ( k ) , ∀ δ ∈ (0 ,
1] and u ( k ) < f ∗∗ χ ( k ) − f ∗ χ ( k ) . The parameter δ is used to to enhance convergence, as we will see in Theorem 2.2.The following result states that the sequence of decreasing nonempty closed setsconstructed by (5) monotonically converges to the set of all global minimizers X ∗ forany possible choice of u ( k ) and δ ∈ (0 , Theorem
Suppose { D ( k ) } k ∈ N is constructed by (5) for theproblem (1) . If f is not a constant, then for all k ∈ N , Ω ⊇ D ( k ) ⊃ D ( k +1) ⊃ X ∗ . Remark
There are various possible choices of u ( k ) , and the three commonlyused are the median-type Median f χ ( k ) − f ∗ χ ( k ) , the mean-type Mean f χ ( k ) − f ∗ χ ( k ) andthe middle-type ( f ∗∗ χ ( k ) − f ∗ χ ( k ) ) / .Proof. The convergence proceeds by induction on k . First, Ω = D (0) ⊃ X ∗ istrivial since f is not a constant on Ω. Assume that D ( k ) ⊃ X ∗ , then D ( k ) ⊃ S ( k ) ⊃ X ∗ ,where S ( k ) = n x ∈ D ( k ) : f ( x ) f ∗ χ ( k ) o ;now we will show that D ( k ) ⊃ D ( k +1) ⊃ S ( k ) .The approximate model can be decomposed into A ( k ) f ( x ) = f ( x ) + ε ( k ) ( x ) , then according to the upper bound of model prediction error, it follows that | ε ( k ) ( x ) | δu ( k ) u ( k ) < f ∗∗ χ ( k ) − f ∗ χ ( k ) , δ ∈ (0 , . Hence, for any x ′ ∈ S ( k ) , we have f ( x ′ ) f ∗ χ ( k ) , which can be further rewritten as A ( k ) f ( x ′ ) − ε ( k ) ( x ′ ) f ∗ χ ( k ) , and equivalently, A ( k ) f ( x ′ ) f ∗ χ ( k ) + ε ( k ) ( x ′ ) f ∗ χ ( k ) + δu ( k ) < f ∗∗ χ ( k ) , therefore, D ( k ) ⊃ D ( k +1) ⊃ S ( k ) .The above algorithm does not guarantee that the upper bound of f on every D ( k ) ,say, max x ∈ D ( k ) | f ( x ) | , is strictly monotonically decreasing. In fact, this can be doneby further enhancing the accuracy requirements of the model.For convenience of theoretical analysis, without loss of generality, sometimes wemay assume f ∗ = 0 on Ω. Under the same settings as (5) and the assumption of f ∗ = 0, the sequence { D ( k ) } k ∈ N could be rewritten by D (0) = Ω and(6) D ( k +1) = n x ∈ D ( k ) : A ( k ) f ( x ) u ( k ) o , X. LUO AND X. XU where the k th model A ( k ) f satisfies(7) max x ∈ D ( k ) (cid:12)(cid:12)(cid:12) A ( k ) f ( x ) − f ( x ) (cid:12)(cid:12)(cid:12) δu ( k ) , ∀ δ ∈ (0 , . Then we have the following strong convergence.
Theorem
Suppose { D ( k ) } k ∈ N is constructed by (6) for the problem (1) . If f is not a constant and there exists a < q < ∞ such that (8) u ( k ) <
11 + δ
11 + q f ∗∗ χ ( k ) , then for all k ∈ N , (9) max x ∈ D ( k +1) | f ( x ) | <
11 + q max x ∈ D ( k ) | f ( x ) | . Here, (8) is called a q -strong convergence condition and (9) is called linear convergenceof the contraction algorithms. Remark
In practice, the q -strong convergence condition (8) can be clearlywritten as u ( k ) < δ q ( f ∗∗ χ ( k ) − f ∗ χ ( k ) ) .Proof. The approximate model can be decomposed into A ( k ) f ( x ) = f ( x ) + ε ( k ) ( x ) with | ε ( k ) ( x ) | δu ( k ) , ∀ δ ∈ (0 , , then for all t ∈ D ( k +1) , A ( k ) f ( t ) u ( k ) ⇒ f ( t ) (1 + δ ) u ( k ) <
11 + q max x ∈ D ( k ) | f ( x ) | , the desired result follows.Let µ ( S ) = R S d t denote the n -dimensional Lebesgue measure of S ⊂ R n , thenwe say the ratio λ ( k +1) = µ ( D ( k +1) ) µ ( D ( k ) )is the ( k + 1)th contraction factor. In the following, we focus on the median-typealgorithm which often leads to a relatively stable λ ( k +1) . Under the same settings as(5), the median-type sequence { D ( k ) } k ∈ N can be written by D (0) = Ω and(10) D ( k +1) = n x ∈ D ( k ) : A ( k ) f ( x ) Median f χ ( k ) o , where the k th model A ( k ) f satisfies(11) max x ∈ D ( k ) (cid:12)(cid:12)(cid:12) A ( k ) f ( x ) − f ( x ) (cid:12)(cid:12)(cid:12) < δ (cid:16) Median f χ ( k ) − f ∗ χ ( k ) (cid:17) , ∀ δ ∈ (0 , . Many different global continuousoptimization algorithms [26, 12, 20, 35], especially model-based methods [22, 11, 31],attempt to trade-off exploration and exploitation [22, 11] because sufficient explorationis a guarantee of convergence while exploitation of limited knowledge seems to be thekey to improve efficiency [8, 13, 31]. However, in fact, every exploitation on the basisof inadequate information may reduce efficiency or even cause trouble.
ONTRACTION METHODS D ( k ) embodiesthe exploration of unknown information and the approximate model A ( k ) f reflectsthe exploitation of prior information, while the condition (11) is the key for ensuringa sufficient exploration and issuing a judgment on the conversion of exploration toexploitation. Statistical methods allow us to estimate the model prediction errors ina sense of probability.Assume that ε ( k ) = A ( k ) f − f follows an independent, identically distributedGaussian distribution with mean µ ( k ) ε and standard deviation σ ( k ) ε , then µ ( k ) ε and σ ( k ) ε can be estimated by the cross-validation (CV) [9, 2] estimations ˜ µ ( k ) ε and ˜ σ ( k ) ε , and aprobabilistic bound for the residual ε ( k ) on D ( k ) is given by P (cid:16) | ε ( k ) | | µ ( k ) ε | + tσ ( k ) ε (cid:17) = 2Φ( t ) − t = 3 itfollows that the three-sigma rule of thumb P ( | ε ( k ) | | µ ( k ) ε | + 3 σ ( k ) ε ) = 0 . . Even without the Gaussian assumption, there exists the Chebyshev’s inequality P (cid:16) | ε ( k ) | | µ ( k ) ε | + tσ ( k ) ε (cid:17) > − t , although it should be regarded as a theoretical tool rather than a practical estimation.There will naturally be a large CV variance for an overfitting model. This mightresult in reduced efficiency but will not cause trouble; however, an underfitting model A ( k ) f might sometimes result in a very rough estimate of the residual bound andeventually lead to convergence failure. To reduce this possibility, we should imposean additional constraint on the global approximation behavior of A ( k ) f , which alsoensures the effectiveness of subsequent sampling. Let r ( k ) be the percentage of χ ( k ) falling into D ( k +1) and r ( k ) A be the percentage of samples uniformly distributed over D ( k ) falling into D ( k +1) , then it is reasonable to require that the difference between r ( k ) and r ( k ) A is less than a given threshold b r , i.e., | r ( k ) − r ( k ) A | < b r ; usually, a valueslightly less than 0 . Two important parts of the algorithms aboveare (i) the method for constructing a model to fit the given data on D ( k ) and (ii) thesampling strategy for further generating uniform samples over D ( k ) according to someknown interior points. In this work, we apply Gaussian process (GP) regression fordata fitting. It has merit in the flexible expressivity because its hyperparameters areautomatically tuned with the likelihood maximization. See [25] for more informationabout this approach. GP generally does not lead to an overfitting model and is oftenused as an important part of many model-based optimization methods [11, 13, 31].Other common choices also include random forests [6] and regression trees [7], etc.Now we turn to the sampling strategy. Suppose χ = { χ i } Ni =1 ∈ D ( k ) are N givenpoints and T is a union of N sample sets { T i } Ni =1 , where T i contains M samples froma normal distribution with mean χ i and variance σ i (sufficiently large to cover D ( k ) ).Further, let T ( k ) = T ∩ D ( k ) . Now we are going to add a new point t from T ( k ) to χ and t should preferably fill in the gaps in the distribution of χ . Actually, thissubsequent point t could be determined as(12) t = arg max t ∈ T ( k ) (cid:18) min i N k t − χ i k (cid:19) . X. LUO AND X. XU
One can generate uniform samples in D ( k ) by using the above step recursively, seeFigure 1 for an example to illustrate how the method is performed. This recursivealgorithm is closely related to the Voronoi diagrams [3], because the added point willbe always in the largest Voronoi cell of χ if the size of T ( k ) is large enough. Fig. 1 . 2 -dimensional illustration of the performance of the recursive algorithm given in (12) .Left: the domain D = { ( x , x ) ∈ [0 , : ( x − . + ( x − . . } is shown as the interiorof the circle, the uniform samples in [0 , , that is the first 50 points of the -dimensional haltonsequence, are visible as dots in blue, and those samples falling into D , denoted by χ , are shownas asterisks in black. Middle: the candidate set T , which is a union of sample sets from a normaldistribution centered on each χ i , is visible as circledots. Right: samples added recursively from T ∩ D is visible as circledots in blue, so successive points at any stage “know” how to fill in the gaps in thepreviously generated distribution. Now we have the basic framework of contraction algorithms. As an illustrativeexample, Figure 2 visually shows how the median-type algorithm compresses theoriginal domain Ω to X ∗ step by step; and two important pieces of information thatcan be obtained from Figure 2 are: (i) the computational complexity of the model isgreatly reduced due to the domain contractions; and (ii) none of the minimizers will bemissed for the cases of multiple global minimums, which just verifies the conclusionsestablished by Theorems 2.1 and 2.2. And Figure 3 shows a real world applicationabout Lennard-Jones (LJ) microclusters which perhaps is the most intensely studiedmolecular conformation problem.
3. Conditions for efficiency.
In order to categorise continuous optimizationproblems, we will introduce the following three classes of conditions. First, we saythat f is a critical regular function on Ω if the set of all critical points of f has a zero n -dimensional Lebesgue measure, where a critical point is a x ∈ Ω where the gradientis undefined or is equal to zero. This critical regular condition guarantees that thecontraction factor is asymptotically equal to 1 / The efficiency of thecontractions can be guaranteed if one can quickly sketch out the overall landscape ofthe valley in each step and the relevant deviation will not be large enough to dig oneor more deep holes in the highlands. This requires that the low frequency componentsof f always play a dominant role on every subset D ( k ) . This meaningful observationprompted us to impose a certain restriction on the Fourier transform of f .Suppose ˆ f is the Fourier transform of f , then we say that f is a ( ρ, p )-typehierarchical low frequency dominant function (HLFDF) if there exist a p > ONTRACTION METHODS -5 0 5 10051015 -5 0 5 10051015 -5 0 5 10051015 -5 0 5 10051015 -5 0 5 10051015 -5 0 5 10051015 Function evaluations -12 -8 -4 A b s o l u t e e rr o r k-th model S a m p l e s i z e -5 0 5 10051015 Fig. 2 . Minimizing the function f ( x , x ) = ( x − . π x + π x − + 10(1 − π ) cos( x ) +10 , x ∈ [ − , , x ∈ [0 , . The multiple global minima are located at ( − π, . , ( π, . and (3 π, . . Upper and middle rows: the first six contractions, the objective function is shown bycontour lines with some suitable level marks, the samples used to build each model are visible asdots in black, the random candidate points in each contracted subdomain, that is, T ( k +1) in (12) ,are visible as circledots in blue. Lower left: the convergence plot for the total seven contractions.Lower middle: the sample size used in each model. Lower right: x trace. appropriate ρ > j = 1 , , · · · , it holds that(13) Z k ω k ∞ > j − n ρ | ˆ f ( ω ) | d ω < (1 + p ) Z j − n ρ< k ω k ∞ ≤ jn ρ | ˆ f ( ω ) | d ω. It is worth noting that, the condition (13) does not require ˆ f to decay very quickly;as a univariate instance, for all δ > ω − − δ satisfies this condition with arbitrary ρ > p = 1 / (2 δ − p/ (1 + p ) every time the number of function evaluations doubles. Thisfeature actually implies contractility.For j ∈ N and ρ >
0, we first define the ( j, ρ )-bandlimited component of f by(14) f ( j ) ρ ( x ) = Z k ω k ∞ jn ρ ˆ f ( ω ) e π i x · ω d ω, X. LUO AND X. XU function evaluations -5 -4 -3 -2 -1 A b s o l u t e e rr o r k-th model S a m p l e s i z e Function evaluations -5 -4 -3 -2 -1 A b s o l u t e e rr o r
15 30 45 k-th model S a m p l e s i z e Fig. 3 . Lennard-Jones (LJ) conformations of a cluster of n identical neutral atoms (interactingpairwise via the LJ potential). A conformation is a point in the n -dimensional Euclidean spaceof coordinates of atomic centers. For a single pair of atoms, the LJ potential in reduced units isgiven by u ( r ) = r − − r − ), where r is the Euclidean interatomic distance; then the total potentialenergy U n = P n − i =1 P nj = i +1 u ( r ij ) , where r ij is the distance between atoms i and j in reduced units.The putative global minima are U ∗ = − , U ∗ = − . [17]. Upper row: the convergence plot,the sample size used in each model, and the global minima conformation at n = 4 . Lower row: theconvergence plot, the sample size used in each model, and the global minima conformation at n = 5 . where x · ω = P ni =1 x i ω i is the inner product of x and ω ; then according to the Nyquist-Shannon sampling theorem, f ( j ) ρ can be reconstructed by its samples correspondingto a sampling density of 2 j ρ n /π n . Lemma If f ∈ L ( R n ) is a ( ρ, p ) -type HLFDF and f ( j ) ρ ( j ∈ N ) is defined as (14) , then ˆ f ∈ L ( R n ) and k f − f ( j ) ρ k ∞ < (cid:18) p p (cid:19) j k ˆ f k . Proof.
First, let R ( j ) ρ = ∞ X i = j I ( i ) ρ , where I (0) ρ = Z k ω k ∞ ρ | ˆ f ( ω ) | d ω and I ( j ) ρ = Z j − n ρ< k ω k ∞ jn ρ | ˆ f ( ω ) | d ω for all j ∈ N , then we have k f − f ( j ) ρ k ∞ R ( j +1) ρ , and the condition (13) can be rewritten as R ( j ) ρ < (1 + p ) I ( j ) ρ , or equivalently, R ( j +1) ρ < pI ( j ) ρ . ONTRACTION METHODS f ∈ L ( R n ) and R ( j +1) ρ R ( j ) ρ = R ( j +1) ρ I ( j ) ρ + R ( j +1) ρ < R ( j +1) ρ R ( j +1) ρ /p + R ( j +1) ρ = p p ;furthermore, by noting that R ( j +1) ρ k ˆ f k = R ( j +1) ρ I (0) ρ + R (1) ρ R (2) ρ R (1) ρ R (3) ρ R (2) ρ · · · R ( j +1) ρ R ( j ) ρ < (cid:18) p p (cid:19) j , the error bound can also be rewritten as k f − f ( j ) ρ k ∞ R ( j +1) ρ < (cid:18) p p (cid:19) j k ˆ f k , as desired. In order to limit the distribution of function valuesto match the dominant condition (13), we need to introduce the tempered functions.Assume that f ∗ = 0 and the sequence { D ( k ) } are defined by D (0) = Ω and(15) D ( k +1) = n x ∈ D ( k ) : f ( k + s ) ρ ( x ) Median f ( ξ ( k ) ) o , k ∈ N , where f ( j ) ρ is the ( j, ρ )-bandlimited component of f defined as (14), ξ ( k ) is a uniformlydistributed random variable over D ( k ) and s > (cid:18) p p (cid:19) s k ˆ f k < δ Median f ( ξ (0) ) (cid:18) p p (cid:19) s − k ˆ f k , for a fixed δ ∈ (0 , . We say that f is a ( ρ, p, q )-type tempered function on Ω if for all k >
0, it holds that(17) p p Median f ( ξ ( k ) ) < Median f ( ξ ( k +1) ) <
11 + q Median f ( ξ ( k ) ) , where q < /p . This condition requires Median f ( ξ ( k ) ) to decrease steadily, neithertoo fast nor too slow. In the next section, it will be used to control the bound of thenumber of function evaluations per contraction as a constant value. In the following,we will show that there always exists a weaker version of (17) that holds with a largeprobability. This weaker version will be used to control the growth of that bound notexceeding the exponential order. Lemma
Suppose { D ( k ) } k ∈ N is constructed by (15) with parameters ρ and p for the problem (1) . If Ω is a compact set and f is continuous and not a constant on Ω , then for any ǫ > , there must exists a q = q ( ǫ ) ∈ (0 , such that (18) q q Median f ( ξ ( k ) ) < Median f ( ξ ( k +1) ) <
11 + q Median f ( ξ ( k ) ) holds for every k > with probability at least − ǫ , where ξ ( k ) is a uniformly distributedrandom variable on D ( k ) .Proof. For convenience, let µ ( k ) = Mean f ( ξ ( k ) ) , m ( k ) = Median f ( ξ ( k ) ) , σ ( k ) = q Var f ( ξ ( k ) ) . X. LUO AND X. XU
Proving the first inequality of Lemma 3.2 is equivalent to showing that on each D ( k +1) ,the upper bound of f , m ( k ) , is less than qq of the median of f , m ( k +1) , with proba-bility at least 1 − ǫ .Under the assumption of f ∗ = 0, since f is continuous and not a constant on D ( k +1) ⊆ Ω, we have 0 < m ( k +1) < ∞ and 0 < σ ( k +1) < ∞ and there exists a C > σ ( k +1) = Cm ( k +1) . First, the distance between the median and the meanis bounded by standard deviation [19], i.e., m ( k +1) − σ ( k +1) µ ( k +1) m ( k +1) + σ ( k +1) ;and it follows from the Chebyshev-Cantelli inequality that P (cid:18) f ( ξ ( k +1) ) > µ ( k +1) + √ − ǫ √ ǫ σ ( k +1) (cid:19) ǫ. So it holds thatprctile (cid:16) f ( ξ ( k +1) ) , (1 − ǫ ) · (cid:17) <µ ( k +1) + √ − ǫ √ ǫ σ ( k +1) m ( k +1) + (cid:18) √ − ǫ √ ǫ (cid:19) σ ( k +1) = √ ǫ + C + C √ − ǫ √ ǫ m ( k +1) , or equivalently, q q m ( k ) < m ( k +1) holds for every k > − ǫ , where q = √ ǫ/ ( C + C √ − ǫ ).Similarly, by considering a translation f ( x ) − m ( k ) , the second inequality m ( k +1) <
11 + q m ( k ) also holds for every k > − ǫ . And it is clear that thesetwo inequalities only make sense simultaneous when q <
4. Categories of continuous problems.
According to the contractility, allthe possible continuous problems can be divided into the following three categories:logarithmic time contractile, polynomial time contractile, or noncontractile.
Definition
The problem (1) is said to be logarithmic time contractile if thereexist ρ, p, q > such that f is not only a ( ρ, p ) -type HLFDF but also a ( ρ, p, q ) -typetempered and critical regular function. For such a problem, the key to gaining logarithmic time efficiency comes fromthree reasons: (i) the median of f on D ( k ) is reduced by a factor of 1 / (1 + q ) as k increases; (ii) there is an upper bound on the number of function evaluations on every D ( k ) such that a certain approximation A ( k ) f can be constructed by these samples;and (iii) the error bound max x ∈ D ( k ) |A ( k ) f ( x ) − f ( x ) | is strictly less than the medianof f on D ( k ) . According to these three points, we can prove the following theorem. ONTRACTION METHODS Theorem
If the problem (1) is logarithmic time contractile with parameters ( ρ, p, q ) , then for any ǫ > , there exist N Ω ,f > and K > such that after K median-type contractions defined by (10) , it holds that the upper bound max x ∈ D ( K ) | f ( x ) − f ∗ | < ǫ, and the linear convergence rate max x ∈ D ( k +1) | f ( x ) − f ∗ | max x ∈ D ( k ) | f ( x ) − f ∗ | <
11 + q , meanwhile, the total number of function evaluations and the time complexity are O (cid:16) N Ω ,f · log q ǫ (cid:17) and O (cid:16) N a Ω ,f · log q ǫ (cid:17) , respectively; where the time cost of model computations is assumed to be O (( N ( k ) ) a ) . f Function evaluations -8 -6 -4 -2 A b s o l u t e e rr o r Convergence x trace
Fig. 4 . A logarithmic time contractile example f ( x ) = ( x − π ) + sin (50( x − π )) , x ∈ [0 , π ] . Remark
The contraction algorithm is very suitable for this type of problemsbecause there exists a fixed upper bound CN Ω ,f for every N ( k ) , where the constant C is slightly larger than one. This is important because the computational complexity of A ( k ) f is closely related to N ( k ) . Remark
Here, a typical logarithmic time contractile example is illustratedin Figure . There are many local minima and the unique global minima is locatedat π . In the usual sense, it is not a very “good” function because a solution is easilytrapped in any one of the local minima; but in the sense of contraction, it is still a“good” function because contractions could be performed very successfully. Therefore,the detection of space can be quickly concentrated so that the corresponding efficiencyis controlled within the logarithmic time. This shows that the concept of contractilityprovides a new explanation for what kind of continuous optimization problems can beeffectively solved. Remark
The example of a quadratic function illustrated in Figure , showsa fast convergence of the proposed algorithm. In this case, the contraction algorithmconverges even faster than the gradient descent with the optimal step size. The reasonwhy the proposed contraction algorithm converges a little bit slower than the Bayesianoptimization algorithm in the initial stage, is that the former does more space detectionin the current domain to cover the minimizer with the next subdomain, while the latterfocuses more on what the current model can predict; but as we mentioned before,every exploitation on the basis of inadequate information may reduce efficiency or X. LUO AND X. XU even cause trouble, see Figure for a typical example of convergence failure in theBayesian optimization. Function evaluations -12 -8 -4 A b s o l u t e e rr o r Contraction Optimization
Function evaluations -12 -8 -4 A b s o l u t e e rr o r Bayesian Optimization
LCBEIPI
Function evaluations -12 -8 -4 A b s o l u t e e rr o r Gradient Descent =0.047=0.045=0.0475 -5 0 5-505 x trace of Contraction Optimization -5 0 5-505 x trace of BO-LCB -5 0 5-505 x trace of Gradient Descent
Fig. 5 . Minimizing the function f ( x , x ) = 20 x + x , x ∈ [ − , , x ∈ [ − , . The minimais located at (0 , . Upper left: the convergence plots of the median-type contraction algorithm withsix contractions. Upper middle: the convergence plots of Bayesian optimization with different typesof acquisition function: lower confidence bound (LCB), expected improvement (EI) and probability ofimprovement (PI). Upper right: the convergence plots of gradient descent with an initial point (5 , and different stepsizes η = 0 . , . and . , respectively. Lower row: x traces of differentalgorithms, respectively. The contraction algorithm converges a little bit slower than the Bayesianoptimization (BO) algorithm in the initial stage because BO focuses more on what the current modelcan predict, which may cause convergence failure as shown in Figure . Function evaluations -10 -8 -6 -4 -2 A b s o l u t e e rr o r Comparison
COBO-LCBBO-EIBO-PI -5 0 5-505 x trace of CO -5 0 5-505 x trace of BO-LCB
Fig. 6 . Minimizing the function f ( x , x ) = 20 x + x −
10 exp[ − x − . − x +2) ] , x ∈ [ − , , x ∈ [ − , . There are two local minima, one of them is . × − located at (4 . × − , − . × − ) and the other is − . located at (0 . , . , which is theglobal minima. Obviously, Bayesian optimization converges to the former local minima and causesa convergence failure in the global sense; however, contraction optimization automatically increasesthe strength of space detection to ensure a full coverage. Left: comparison of convergence betweencontraction optimization (CO) and different kinds of Bayesian optimization (BO). Middle: x traceof CO. Right: x trace of BO. Proof.
First, let us consider the set sequence { D ( k ) } k ∈ N defined as (15). since f ONTRACTION METHODS ρ, p, q )-type tempered functionMedian f ( ξ ( k +1) ) <
11 + q Median f ( ξ ( k ) ) , where ξ ( k ) is a uniformly distributed random variable over D ( k ) ; and then, there existscertain δ ∈ (0 ,
1) such that(19) Median f ( ξ ( k +1) ) <
11 + δ
11 + q Median f ( ξ ( k ) );and for this fixed δ , there is the unique integer s > (cid:18) p p (cid:19) s k ˆ f k < δ Median f ( ξ (0) ) (cid:18) p p (cid:19) s − k ˆ f k . Moreover, since f is a ( ρ, p, q )-type tempered function on Ω,(21) (cid:18) p p (cid:19) k Median f ( ξ (0) ) < Median f ( ξ ( k ) );meanwhile, note that f is a ( ρ, p )-type HLFDF, it follows from Lemma 3.1 that forevery k > k f − f ( k + s ) ρ k ∞ < (cid:18) p p (cid:19) k + s k ˆ f k . Clearly, it follows from (20)–(22) that(23) k f − f ( k + s ) ρ k ∞ < δ Median f ( ξ ( k ) ) , and the ( k + s, ρ )-bandlimited component f ( k + s ) ρ can be reconstructed by its samplescorresponding to a sampling density of 2 k + s ρ n /π n .Now we consider the approximation of f ( k + s ) ρ on the compact set D ( k ) . Since f is a critical regular function on Ω, µ ( D ( k ) ) = 2 − k µ (Ω) , the restriction of f ( k + s ) ρ to D ( k ) , denoted by f ( k + s ) ρ | D ( k ) , is closely related to thethreshold value N Ω ,f = 2 s µ (Ω) ρ n /π n and the prolate spheroidal functions ψ j ( x ) = ψ j ( x ; 2 k + sn ρ, D ( k ) ) which are the eigen-functions of the time and frequency limiting operator Q = Q (2 k + sn ρ, D ( k ) ) [14, 15, 32,33, 34]. More specifically, if we denote a prolate series up to and including the N ( k ) thterm by S N ( k ) f ( x ) = N ( k ) X j =1 ψ j ( x ) Z D ( k ) f ( x ) ψ j ( x )d x, X. LUO AND X. XU then S N ( k ) f ( x ) = S N ( k ) f ( k + s ) ρ ( x ) follows by noting that ψ j has a Fourier transformsupported in { ω ∈ R n : k ω k ∞ jn ρ } . Moreover, it is important to mention that asuper-exponential decay rate of the error boundmax x ∈ D ( k ) (cid:12)(cid:12)(cid:12) f ( k + s ) ρ ( x ) − S N ( k ) f ( k + s ) ρ ( x ) (cid:12)(cid:12)(cid:12) as soon as N ( k ) reaches or goes beyond the plunge region around the threshold value N Ω ,f [5, 4]. (And it is worth pointing out that, strictly speaking, we should first finda series of non-crossing hypercubes covering D ( k ) , and then build a piecewise prolateseries approximation on these hypercubes.)Thus, for all k there exists a C > A ( k ) f = S N ( k ) f ( k + s ) ρ where N ( k ) = CN Ω ,f satisfies max x ∈ D ( k ) (cid:12)(cid:12)(cid:12) f ( k + s ) ρ ( x ) − A ( k ) f ( x ) (cid:12)(cid:12)(cid:12) δ Median f ( ξ ( k ) ) − k f − f ( k + s ) ρ k ∞ , and further, together with (23), we havemax x ∈ D ( k ) (cid:12)(cid:12)(cid:12) f ( x ) − A ( k ) f ( x ) (cid:12)(cid:12)(cid:12) = max x ∈ D ( k ) (cid:12)(cid:12)(cid:12) f ( x ) − f ( k + s ) ρ + f ( k + s ) ρ − A ( k ) f ( x ) (cid:12)(cid:12)(cid:12) k f − f ( k + s ) ρ k ∞ + max x ∈ D ( k ) (cid:12)(cid:12)(cid:12) f ( k + s ) ρ ( x ) − A ( k ) f ( x ) (cid:12)(cid:12)(cid:12) δ Median f ( ξ ( k ) ) . (25)Now, under the assumption of f ∗ = 0, according to (10), the contracting sequence { D ( k ) } are defined recursively by D (0) = Ω and D ( k +1) = n x ∈ D ( k ) : A ( k ) f ( x ) Median f ( ξ ( k ) ) o , k ∈ N , where A ( k ) f is the prolate series approximation (24). So it follows from the strongconvergence Theorem 2.2, (19) and (25) thatmax x ∈ D ( k ) | f ( x ) | (cid:18)
11 + q (cid:19) k max x ∈ Ω | f ( x ) | . Further, for a fixed accuracy ǫ >
0, there exists a
K > (cid:18)
11 + q (cid:19) K < ǫ (cid:18)
11 + q (cid:19) K +1 , or , K < log q ǫ K + 1 , hence, after K median-type contractions, one gets the approximate solution set D ( K ) with an error bound max x ∈ D ( K ) | f ( x ) | ǫ, and the total number of function evaluations is less than K − X k =0 ⌈ CN Ω ,f / ⌉ + CN Ω ,f = O (cid:16) N Ω ,f · log q ǫ (cid:17) , ONTRACTION METHODS ⌈ t ⌉ is the smallest integer greater than or equal to t ; further note that thecomputational complexity of A ( k ) f is O (( N ( k ) ) a ), then the time complexity of median-type algorithms is less than K − X k =0 ⌈ C a N a Ω ,f / ⌉ + C a N a Ω ,f = O (cid:16) N a Ω ,f · log q ǫ (cid:17) , taking a logarithmic time for any desired accuracy ǫ . Definition
The problem (1) is said to be polynomial time contractile if thereexist ρ, p > and q ∈ (0 , such that (i) f is a ( ρ, p ) -type HLFDF; (ii) f satisfies theconclusion of Lemma with parameters ρ, p and q ; and (iii) p > q . When p q , a problem that satisfies the above (i) and (ii) is logarithmic timecontractile by adding the critical regular condition, so it is necessary to impose p > q . Theorem
If the problem (1) is polynomial time contractile with parameters ( ρ, p ) and q ∈ (0 , , then for any ǫ > , there exist N Ω ,f > and K > such thatafter K contractions defined by (10) but not limited to, it holds that the upper bound max x ∈ D ( K ) | f ( x ) − f ∗ | < ǫ, and the linear convergence rate max x ∈ D ( k +1) | f ( x ) − f ∗ | max x ∈ D ( k ) | f ( x ) − f ∗ | <
11 + q , meanwhile, the total number of function evaluations and the time complexity are O (cid:16) N Ω ,f · log q ǫ (cid:17) and O (cid:16) N a Ω ,f · a log q ǫ (cid:17) , respectively; where the time cost of model computations is assumed to be O (( N ( k ) ) a ) . Function evaluations -9 -6 -3 A b s o l u t e e rr o r Comparison
COBO-LCBBO-EIBO-PI k-th model S a m p l e s i z e -5 0 5-505 x trace of CO Fig. 7 . Minimizing the function f ( x , x ) = 100( x − x ) +( x − , x ∈ [ − , , x ∈ [ − , .The global minima is located at (1 , , which lies in a long, narrow and parabolic shaped flat valley.This is the well-known Rosenbrock function [28], especially corresponding to a typical polynomialtime contraction problem. Notice that the number of samples used in the model is almost alwaysrising; but the contraction can still be effectively performed. In this case, the contraction optimizationconverges faster than three common types of the Bayesian optimization. This seems to indicate thatcontraction is not only related to efficiency, but also help to improve the local accuracy of modelprediction. Left: comparison of convergence between contraction optimization (CO) and differentkinds of Bayesian optimization (BO). Middle: the sample size used in each model. Right: x traceof CO. X. LUO AND X. XU
Remark
Unlike the logarithmic time contractile problems, there is no fixedupper bound for N ( k ) ; but it can often be controlled by C kl N Ω ,f , where k is given in (26) . So the contraction algorithm is still effective for this type of problems becausethe contraction of D ( k ) can greatly reduce the computational cost of the model A ( k ) f .Moreover, the algorithm discussed here is not limited to the median-type. A typicalpolynomial time contractile example is illustrated in Figure .Proof. Let us now see what happens after replacing (17) with (18). Since p > q ,there is a unique l such that(26) (cid:18) p p (cid:19) l q q < (cid:18) p p (cid:19) l − . Similarly, we first consider the set sequence { D ( k ) } k ∈ N defined as (15). Since theconclusion of Lemma 3.2 holds with a parameter q for f , we have(27) (cid:18) q q (cid:19) k Median f ( ξ (0) ) < Median f ( ξ ( k ) ) , and there exists certain δ ∈ (0 ,
1) such that(28) Median f ( ξ ( k +1) ) <
11 + δ
11 + q Median f ( ξ ( k ) ) , moreover, there is the unique integer s > (cid:18) p p (cid:19) s k ˆ f k < δ Median f ( ξ (0) ) (cid:18) p p (cid:19) s − k ˆ f k , where ξ ( k ) is a uniformly distributed random variable over D ( k ) .Note that f is a ( ρ, p )-type HLFDF, it follows from Lemma 3.1, (26), (27), and(29) that for every k > k f − f ( kl + s ) ρ k ∞ (cid:18) p p (cid:19) kl + s k ˆ f k < δ Median f ( ξ ( k ) ) , and the ( kl + s, ρ )-bandlimited component f ( kl + s ) ρ can be reconstructed by its samplescorresponding to a sampling density of 2 kl + s ρ n /π n .In addition, note that µ ( D ( k ) ) < µ ( D ( k − ) , according to the discussion in the previous subsection, for every k there exist C > A ( k ) f such that A ( k ) f could be constructed by less than N ( k ) = C kl N Ω ,f samplesof f over D ( k ) and(30) max x ∈ D ( k ) | f ( x ) − A ( k ) f ( x ) | δ Median f ( ξ ( k ) ) , where A ( k ) f = S N ( k ) f ( kl + s ) ρ is the prolate series approximation given as (24) and N Ω ,f = 2 s µ (Ω) ρ n /π n .Now, under the assumption of f ∗ = 0, according to (10), the contracting sequence { D ( k ) } are defined recursively by D (0) = Ω and D ( k +1) = n x ∈ D ( k ) : A ( k ) f ( x ) Median f ( ξ ( k ) ) o , k ∈ N , ONTRACTION METHODS x ∈ D ( k ) | f ( x ) | (cid:18)
11 + q (cid:19) k max x ∈ Ω | f ( x ) | . Further, for a fixed accuracy ǫ >
0, there exists a
K > (cid:18)
11 + q (cid:19) K < ǫ (cid:18)
11 + q (cid:19) K +1 , or , K < log q ǫ K + 1 , hence, after K median-type contractions, one can obtain the approximate solution set D ( K ) with an error bound max x ∈ D ( K ) | f ( x ) | ǫ, and clearly, the total number of function evaluations is less than K X k =0 C kl N Ω ,f = O (cid:16) N Ω ,f · log q ǫ (cid:17) ;similarly, note that the computational complexity of A ( k ) f is O (( N ( k ) ) a ), then thetime complexity is much less than K X k =0 C a akl N a Ω ,f = O (cid:16) N a Ω ,f · a log q ǫ (cid:17) , taking a polynomial time for any desired accuracy ǫ . And it is worth noting that,the mean-type contraction does not actually have any special effect in the abovediscussion, so the conclusions of the theorem holds for any type of contraction. For some functions the contracting condition (11) cannotbe met until the global minima is reached, at this point the contraction algorithm isdegraded into a regular model-based approach. Although efficiency gains cannot beachieved with domain contractions, the corresponding optimization problem may stillbe effectively solved if the function is sufficient smooth.The term “noncontractile” is used to describe those functions whose contractioncannot be consistently guaranteed. This does not always mean that, although it doesexist, the algorithms does not perform any contraction throughout the process. SeeFigure 8 for different noncontractile examples.
Definition
The problem (1) is said to be noncontractile if there are nosuitable ρ, p, q > such that f is a ρ, p -type HLFDF or satisfies the conclusion ofLemma with parameters ρ, p and q . Here we will summarize some conclusions based on smoothness for noncontractileproblems. Suppose that χ are uniformly distributed over Ω with the sample size N ,the relevant data values f χ = { f ( χ i ) } and A χ f interpolates f on χ . For a sample set χ over Ω, we denote the associated fill distance with(31) h N := sup x ∈ Ω min i N k x − χ i k , then for ǫ >
0, there exists C ǫ > P (cid:16) h N > C ǫ (log N/N ) /n (cid:17) = O ( N − ǫ ) , X. LUO AND X. XU -1 -0.5 0 0.5 1-1.5-1-0.500.511.5 -1 -0.5 0 0.5 1-1.5-1-0.500.511.5 -1 -0.5 0 0.5 1-0.500.51
Fig. 8 . Three examples for noncontractile functions. Left column: a Gaussian noise function f that is obviously not a HLFDF and the contracting condition will never be met; a bandlimitedfunction f = sin(50 x ) that does not have the essential multilevel characteristic of HLFDFs and thecontracting condition (11) cannot be met until the global minima is reached; and a deep trap function f ( x ) = x + e − ∗ ( x +0 . that is a HLFDF but does not satisfy the conclusion of Lemma ,at this time, the contractions will still be carried out with a large probability and eventually lead toconvergence failure, due to the failure of model prediction error bound estimations. Middle column:histograms of function value distribution for f , f and f , respectively. Right column: discretecosine transforms of f , f and f , respectively. or, h N = O ( N ( γ − /n ), where 0 < γ ≪
1; see Lemma 12 of [8].
Theorem If f ∈ C s (Ω) with s > n/ on a bounded domain Ω ⊂ R n , thenthere exists a band-limited interpolant A χ f such that for any ǫ > , it holds that (cid:12)(cid:12)(cid:12)(cid:12) min x ∈ Ω A χ f ( x ) − f ∗ (cid:12)(cid:12)(cid:12)(cid:12) < ǫ and the time complexity of A χ f is O ( N a ) = O (cid:16) na ( s − n/ − γ ) log ǫ (cid:17) , where χ and γ are consistent with (31) and (32) , the time cost of A χ f is assumed tobe O ( N a ) , N is the sample size of χ , and k f k C s (Ω) = P | α | s sup x ∈ Ω | D α f | . According to Lemma 3.9 of [23] and (32), for all x ∈ Ω, it holds that | f ( x ) − A f ( x ) | Ch s − n/ N k f k C s (Ω) = O (cid:16) N − ( s − n/ − γ ) /n (cid:17) , which is actually equivalent to Theorem 4.6. And this result is similar to but slightly ONTRACTION METHODS f is a H¨oldercontinuous function, the corresponding (1) can also be completed in polynomial time.It is called that f satisfies a α -H¨older condition if there exist C > < α ≤ | f ( x ′ ) − f ( x ′′ ) | < C k x ′ − x ′′ k α . Then we have
Theorem If f satisfies a α -H¨older condition on a bounded domain Ω ⊂ R n ,then there exists a nearest-neighbor interpolant A χ f such that for any ǫ > , it holdsthat (cid:12)(cid:12)(cid:12)(cid:12) min x ∈ Ω A χ f ( x ) − f ∗ (cid:12)(cid:12)(cid:12)(cid:12) < ǫ and the time complexity of A χ f is O ( N a ) = O (cid:16) naα (1 − γ ) log ǫ (cid:17) , where χ, γ and α are consistent with (31)–(33) , the time cost of A χ f is assumed tobe O (( N ( k ) ) a ) . For a H¨older continuous function, then there exists a nearest-neighbor interpolant A χ f , which is closely related to the Voronoi diagram of χ [3], such that for any x ∈ Ω,it holds that | f ( x ) − A χ f ( x ) | Ch αN = O (cid:16) N − α (1 − γ ) /n (cid:17) , which is actually equivalent to Theorem 4.7. Similarly, if f does not satisfy any H¨oldercondition, then the algorithm may not be done in polynomial time, for example, if | f ( x ′ ) − f ( x ′′ ) | < C log(1 + k x ′ − x ′′ k ), then the time complexity of a nearest-neighborinterpolant A χ f will be O (cid:16) ( e ǫ − − na − γ (cid:17) , if the absolute error bound of A χ f on Ω is less than any given ǫ >
5. Conclusions.
For a fairly general class of problems, it is often impossible tofind a universal method that performs well on all possible situations, which is theimportant connotation of the no free lunch theorems [38, 18, 37]. This is not due tosome kind of curse, but a lack of common features. Actually, a certain commonalityis the premise of efficiency. And when there is no such a premise, it is always wise tofind a subclass that has enough in common and maintains a proper level of generality.Although the consequent loss of generality is what we have to pay, the correspondingdiscriminant conditions may help us understand the problem better.In this work, we proposed a class of contraction algorithms based on the conceptof contractility. From the efficiency of algorithms, especially the mean-type, we haveclassified all possible continuous problems. Experiments with various categories ofexamples show that these categories seem to be reasonable. To ensure the existenceof efficient optimization algorithms [38], we mainly impose a class of hierarchical lowfrequency dominant condition on the problems. And now, we knew that a sufficiently0
X. LUO AND X. XU smooth or contractile problem can be effectively predicted using a priori information.Hence, the contractility might be viewed as a complement to smoothness.The algorithm is implemented in Matlab. The source code of the implementationis available at https://github.com/xiaopengluo/contropt.Future research is currently being conducted in several areas. One of the attemptsis to create possible time complexities that is not exponentially related to dimensionsfor some certain function classes. We hope that the expected results would providesome valuable suggestions for efficiency of high-dimensional continuous optimization.Second, we are also considering how to establish an adaptive contracting condition toachieve the optimal efficiency for various different problems. A successful achievementwill be very helpful in practice. Third, we hope that some certain difficult problemscan be translated into relevant easy problems by applying some preconditioning andpostconditioning steps before and after each contraction. Moreover, this requires usto further distinguish which problems are inherently difficult to solve, and which areonly seemingly intractable.
REFERENCES[1]
A. Y. Aravkin, J. V. Burke, D. Drusvyatskiy, M. P. Friedlander, and S. Roy , Level-setmethods for convex optimization , Mathematical Programming, 174 (2019), pp. 359–390.[2]
S. Arlot and A. Celisse , A survey of cross-validation procedures for model selection , Statist.Surv., 4 (2010), pp. 40–79.[3]
F. Aurenhammer , Voronoi diagrams - a survey of a fundamental geometric data structure ,ACM Comput. Surv., 23 (1991), pp. 345–405.[4]
A. Bonamia and A. Karoui , Spectral decay of time and frequency limiting operator , Appl.Comput. Harmon. Anal., 42 (2017), pp. 1–20.[5]
J. P. Boyd , Approximation of an analytic function on a finite real interval by a bandlimitedfunction and conjectures on properties of prolate spheroidal functions , Appl. Comput.Harmon. Anal., 25 (2003), pp. 168–176.[6]
L. Breiman , Random forests , Machine Learning, 45 (2001), pp. 5–32.[7]
L. Breiman, J. Friedman, R. Olshen, and C. Stone , Classification and Regression Trees ,CRC Press, Boca Raton, FL, 1984.[8]
A. D. Bull , Convergence rates of efficient global optimization algorithms , Journal of MachineLearning Research, 12 (2011), pp. 2879–2904.[9]
S. Geisser , The predictive sample reuse method with applications , J. Amer. Statist. Assoc., 70(1975), pp. 320–328.[10]
D. R. Jones, M. Schonlau, and W. J. Welch , Efficient global optimization of expensiveblack-box functions , Journal of Global Optimization, 13 (1998), pp. 455–492.[11]
D. R. Jones, M. Schonlau, and W. J. Welch , Efficient global optimization of expensiveblack-box functions , Journal of Global Optimization, 13 (1998), pp. 455–492.[12]
S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi , Optimization by simulated annealing ,Science, 220 (1983), pp. 671–680.[13]
J. P. C. Kleijnen, W. van Beers, and I. van Nieuwenhuyse , Expected improvement inefficient global optimization through bootstrapped kriging , Journal of Global Optimization,54 (2012), pp. 59–73.[14]
H. J. Landau and H. O. Pollak , Prolate spheroidal wave functions, fourier analysis anduncertainty, ii , Bell Systems Tech. J., 40 (1961), pp. 65–84.[15]
H. J. Landau and H. O. Pollak , Prolate spheroidal wave functions, fourier analysis anduncertainty, iii , Bell Systems Tech. J., 41 (1962), pp. 1295–1336.[16]
E. L. Lawler and D. E. Wood , Branch-and-bound methods: A survey , Operations Research,14 (1966), p. 699C719.[17]
R. H. Leary , Global optima of lennard-jones clusters , J. Global Optim., 11 (1997), pp. 35–53.[18]
W. G. Macready and D. H. Wolpert , What makes an optimization problem hard? , Com-plexity, 1 (1996), pp. 40–46.[19]
C. Mallows , Another comment on ocinneide , The American Statistician, 45 (1991), p. 257.[20]
M. Mitchell , An Introduction to Genetic Algorithms , MIT Press, 1996, Cambridge, MA.[21]
J. Moˇckus , On Bayesian methods for seeking the extremum , Optimization Techniques, (1974),pp. 400–404.ONTRACTION METHODS [22] J. Moˇckus, V. Tiesis, and A. Zilinskas , The application of bayesian methods for seeking theextremum , Toward Global Optimization, 2 (1978), pp. 117–129.[23]
F. J. Narcowich, J. D. Ward, and H. Wendland , Sobolev bounds on functions with scatteredzeros, with applications to radial basis function surface fitting , Math. Comp., 74 (2005),pp. 743–763.[24]
S. N. Qihang Lin and N. Soheili , A level-set method for convex optimization with a feasiblesolution path , SIAM J. Optim., 28(4) (2018), pp. 3290–3311.[25]
C. E. Rasmussen and C. K. I. Williams , Gaussian Processes for Machine Learning , MITPress, Cambridge, MA, 2006.[26]
I. Rechenberg , Evolutions strategie: Optimierung technischer Systeme nach Prinzipien derbiologischen Evolution , Frommann-Holzboog, Stuttgart, 1973.[27]
R. T. Rockafellar , Convex Analysis , Priceton Landmarks in Mathematics, Princeton Uni-versity Press, Princeton, NJ, 1970.[28]
H. H. Rosenbrock , An automatic method for finding the greatest or least value of a function ,The Computer Journal, 3 (1960), pp. 175–184.[29]
H. P. Schwefel , Evolution and Optimum Seeking , Wiley-Interscience, New York, 1995.[30]
B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas , Taking the humanout of the loop: A review of Bayesian optimization , Proceedings of the IEEE, 104 (2016),pp. 148–175.[31]
B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas , Taking the humanout of the loop: A review of bayesian optimization , Proceedings of the IEEE, 104 (2016),pp. 148–175.[32]
D. Slepian , Prolate spheroidal wave functions, fourier analysis and uncertainty, iv , Bell Sys-tems Tech. J., 43 (1964), pp. 3009–3057.[33]
D. Slepian , Onbandwidth , Proc. IEEE, 64 (1976), pp. 292–300.[34]
D. Slepian and H. O. Pollak , Prolate spheroidal wave functions, fourier analysis and un-certainty, i , Bell Systems Tech. J., 40 (1961), pp. 43–64.[35]
R. Storn and K. Price , Differential evolution - a simple and efficient heuristic for globaloptimization over continuous spaces , Journal of Global Optimization, 11 (1997), pp. 341–359.[36]
A. T¨orn and A. ˇZilinskas , Global Optimization , Springer-Verlag, Berlin Heidelberg, 1989.[37]
D. H. Wolpert , The lack of a prior distinctions between learning algorithms , Neural Compu-tation, 8 (1996), pp. 1341–1390.[38]