GGlobal Optimum is not Limit Computable
K. Lakshmanan ∗ Abstract
We study the limit computability of finding a global optimum of anon-convex continuous function. We give a short proof to show that theproblem of checking whether a point is a global minimum is not limitcomputable. Thereby showing the same for the problem of finding a globalminimum. In the next part, we give an algorithm that converges to theglobal minima when a lower bound on the size of the basin of attraction ofthe global minima is known. We prove the convergence of this algorithmand provide some numerical experiments.
We consider the problem of finding the global minima of a non-convex contin-uous function f : C → R , where C ⊂ R n is a closed, compact subset. Globalminima is the point x ∗ ∈ C which satisfies the following property: f ( x ∗ ) ≤ f ( x )for all x ∈ C . The function f attains this minimum at least once by extremevalue theorem. Our goal is to find one such point. This problem is well-studiedwith many books written on the subject, see for example [2].In this paper, we show that this problem is not limit computable. That isthere is no algorithm that can convergence to the global minima of any contin-uous function without any knowing any other information about the function.In fact, we show a simpler problem of checking whether a local minimum isglobal is itself not limit computable. Next with the knowledge about the basinof attraction of the global minima we give a fast algorithm that converges to theglobal minima. Before proceeding further we define some preliminary conceptsabout reducibility and limit computability. These definitions are as in chapters1 and 3 of [3]. Definition 1.
A Turing machine has a two-way infinite tape divided into cells,a reading head which scans one cell of the tape at a time, and a finite setof internal states Q = { q , q , . . . , q n } , n ≥ . Each cell is either blank orhas symbol 1 written on it. In a single step the machine may simultaneously(1) change the from one state to another; (2) change the scanned symbol s toanother symbol s (cid:48) ∈ S = { , B } ; (3) move the reading head one cell to the ∗ Department of Computer Science and Engineering, Indian Institute of Technology (BHU),Varanasi 221005, Email: [email protected] a r X i v : . [ m a t h . O C ] S e p ight (R) or left (L). This operation of machine is controlled by a partial map δ : Q × S → Q × S × { R, L } . The map δ viewed as a finite set of quintuples is called a Turing program. Definition 2.
An oracle Turing machine (o-machine) is a Turing machine withan extra “ read only ” tape called the oracle tape, upon which the characteristicfunction of some set A is written. Two reading heads move along these twotapes simultaneously. In a given state q if the tapes contain symbols s and t the machine changesthe work tape symbol to t change the state to q and move the head either rightor left independently. An oracle Turing machine is a finite sequence of programlines. Fix an effective coding (G¨odel numbering) of all oracle Turing programsfor o-machines. Let ˜ P e denote the e th such oracle program under this effectivecoding. If the oracle machine halts, let u be the maximum cell on the oracletape scanned during computation, i.e., maximum integer whose membership in A has been tested. We say that the elements z ≤ u are used in computation. Ifno element is scanned we let u = 0. Definition 3.
If the oracle program ˜ P e with A on the oracle tape and input x halts with output y and if u is the maximum element used on the oracle tapeduring computation, then we write Φ Ae ( x ) = y and ϕ Ae ( x ) = u. We refer Φ Ae ( x ) as a Turing functional and we call corresponding ϕ Ae ( x ) thecorresponding use function. The functional is determined by the program ˜ P e and may be partial or total. Definition 4.
A partial function θ is Turing computable in A (A-Turing com-putable), written θ ≤ T A , if there is an e such that Φ Ae ( x ) ↓ = y if and onlyif θ ( x ) = y . A set B is Turing reducible to A ( B ≤ T A ) if the characteristicfunction χ B ≤ T A . We denote the set { , , , , . . . } by ω . Definition 5.
A set A is limit computable if there is a computable sequence { A s } s ∈ ω such that for all x , A ( x ) = lim s A s ( x ) . Here A ( x ) is the characteristic function of A . Definition 6.
A set A is Σ if there is a computable relation R with x ∈ A ⇔ ( ∃ y )( ∀ z ) R ( x, y, z ) . A set is Π if ¯ A is in Σ . And a set A is (cid:52) if A ∈ Σ and A ∈ Π . emma 1 (Limit Lemma (Shoenfield 1959)) . A set A is limit computable ifand only if A ∈ (cid:52) .Proof. We refer to proof of lemma 3.6.2 of [3].By Limit lemma we also call { A s } s ∈ ω the (cid:52) -approximation of A . Definition 7.
Let A (cid:24)(cid:24) x denote the set { A ( y ) : y ≤ x } . Given { A s } s ∈ ω , anyfunction m ( x ) is a modulus (of convergence) if ∀ x ( ∀ s ≥ m ( x )) [ A (cid:24)(cid:24) x = A s (cid:24)(cid:24) x ] . Proposition 1. If A is limit computable via { A s } s ∈ ω with any modulus m ( x ) ,then A ≤ T m .Proof. Take A ( x ) = A m ( x ) ( x ). Let the set of global minima of the function f be denoted by G . We have thefollowing lemma for the set G . Lemma 2.
The set G is not limit computable.Proof. Now consider the modified problem where we define h z ( x ) := min { , f ( z ) − f ( x ) } . This function is identically zero if and only if z = x ∗ . Hence our problemof finding the global minimum is same as checking whether h z ( · ) is identicallyzero. Since our objective function f is continuous, h z ( · ) is also continuous. Anexample function is shown in figure 1. The plot on the left shows the originalobjective function and on right shows the modified function which has to bechecked if identically zero. Since G is the set of all global minima it is also theset of all points z where the function h z ( · ) is identically zero.For G to be limit computable via { G s } s ∈ ω with some modulus function m we need, ∀ x (cid:48) ( ∀ s ≥ m ( x (cid:48) )) [ G (cid:24)(cid:24) x (cid:48) = G s (cid:24)(cid:24) x (cid:48) ] ∀ x (cid:48) ( ∀ s ≥ m ( x (cid:48) )) { G ( z ) : z ≤ x (cid:48) } = { G s ( z ) : z ≤ x (cid:48) } . Note that we need { G s } s ∈ ω to be computable. But to compute whether G s ( z ) =0 or not for all z ≤ x (cid:48) involves checking whether h z ( x ) ≡
0. That is for eachpoint z we have to check a function is identically zero. But this cannot bechecked for any particular z unless it is checked for all x . As the function h z ( x )is real valued and continuous if it is non-zero, there exists an interval I whoselength can be arbitrarily small such that the function is non-zero in this interval.Since we do not know the length of this interval we have to check for all pointswhich is dense in x . But this set of points x is not finite.3
20 40 60 80 10010203040 x f ( x ) x h ( x ) Figure 1: The figure on the right shows a sample objective function. The figureof the left is the corresponding function h ( x ) := min { , f (60) − f ( x ) } as inthe proof of Lemma 2. The function h ( x ) is not identically zero as x = 60 isnot the global minima of f ( x ).As no Turing machine can check (halt) if a function is zero at infinitelymany points. We see that there is no computable { G s } s ∈ ω with some modulusfunction m such that G ( x ) = lim s G s ( x ) . That is we have shown that the set G is not limit computable. Corollay 1.
The problem of checking whether local minima z is global is alsonot limit computable as this involves checking whether h z ( · ) is identically zero. Corollay 2.
By Limit lemma we can in fact show that the set G of globalminima is in Π but not in Σ as there is an oracle-Turing machine which willhalt and produce the right output if the function h z ( · ) is not identically zero.But not when the function is identically zero. Now we can state the main theorem.
Theorem 1.
Finding global minima of a continuous function is not limit com-putable.Proof.
Suppose finding the global minima is limit computable then have anoracle machine for computing the set of global minima. But this contradicts theLemma 2 that set of global minima is not limit computable.
Remark 1.
By definition a point x ∈ A is a local minima if ( ∃ n ∈ N + )( ∀ y ∈ B ( y, /n )) f ( x ) < f ( y ) . Here B ( y, /n ) is the neighbourhood of y with radius /n . Take this to be the computable relation R in the Limit lemma i.e., we have x ∈ A ⇔ ( ∃ n )( ∀ y ) R ( x, n, y ) . Thereby we get that the computing local minimais limit computable. When additional information is known about the global minima, like it’sbasin of attraction then the global minima may be limit computable. In fact wegive an algorithm converging to the global minima when the basin of attractionis known in the next section. 4
An Algorithm when Basin of Attraction isKnown
In this section, we also assume the function f to be differentiable. Let us denotethe gradient by (cid:79) f ( x ). The algorithm takes as input the lower bound m on thebasin of attraction of the global minimizer. By basin of attraction we mean thefollowing: if we let the initial point to be in the hypercube of length m in allco-ordinates, i.e., in the basin of attraction around the global minima then thegradient descent algorithm will converge to the minima. The algorithm finds thepoint z k where the function takes a minimum amongst all points at a distanceof m from each other and does a gradient descent step from the point z k . Inthis algorithm for simplicity, we do not consider line searches and use constantstep-size t >
0. The figure 2 shows the gradient descent step taken at the pointwhich has the minimum function value amongst all the points in the grid.We note the similarity of our algorithm with the one considered in paper[1], where the basin of attraction of global minimizer is first found by searchingthen a gradient descent is performed. In our algorithm, these two steps areinterleaved. The major issue with their algorithm is that they assume the valueof the global minima is known which they assume to be zero. But this neednot be known in real-world problems. This assumption is not needed with ourapproach. Moreover, we have formally shown the convergence of our algorithm.
Gradient descent step xf ( x ) Figure 2: The function f to be minimized. Gradient descent step is shown forthe interval where the function value is minimum. This interval is a subset ofthe basin of attraction of global minima. We show the convergence of the algorithm given in the preceding section. Wemake the following assumption.
Assumption 1.
The function f is twice differentiable. The gradient of f is lgorithm 1 Global Optimization AlgorithmInput: Function f and a lower bound m on length of a hypercube contained inbasin of attraction of global minimizer of f Let C = [ a, b ] d . For simplicity we let the interval [ a, b ] to be the same in alldimensions. Set y = [ a , . . . , a d ] where a i = a for all i = 1 , . . . , d . And set y j = y j − + m for j = 1 , . . . , ( b − a ) /m . Let x = z = arg min j =0 ,..., ( b − a ) /m { f ( y j ) } . while k = 1 , . . . , L do Set y = [ a k , . . . , a kd ] where a ki = a k − i − t (cid:79) f ( z k ) for all i = 1 , . . . , d . Andset y j = y j − + m for j = 1 , . . . , ( b − a ) /m . As before let z k = arg min j =0 ,..., ( b − a ) /m { f ( y j ) } . Update x k = z k − t (cid:79) f ( z k ). k = k + 1 end while Lipschitz continuous with constant < L < , i.e., (cid:107) (cid:79) f ( x ) − (cid:79) f ( z ) (cid:107) ≤ L (cid:107) x − z (cid:107) . That is we have (cid:79) f ( x ) (cid:22) LI.
We first state the following lemma used in the proof of the convergencetheorem.
Lemma 3.
Assume that the function f satisfies Assumption 1 and the step-size t ≤ /L . We also assume that the global minima x ∗ is unique. Then there existsa constant R > such that for all balls B ( x ∗ , r ) with radius r < R , there is a M r > and that the iterates of the algorithm { x k } remains in this ball B ( x ∗ , r ) asymptotically, i.e., x k ∈ B ( x ∗ , r ) for k ≥ M r .Proof. From assumption 1 we have that (cid:79) f ( x ) − LI is negative semi-definitematrix. Using a quadratic expansion of f around f ( x ∗ ), we obtain the followinginequality for x ∈ B ( x ∗ , r ) f ( x ) ≤ f ( x ∗ ) + (cid:79) f ( x ∗ ) T ( x − x ∗ ) + 12 (cid:79) f ( x ∗ ) (cid:107) x − x ∗ (cid:107) f ( x ) ≤ f ( x ∗ ) + 12 L (cid:107) x − x ∗ (cid:107) (1)Since x ∗ is a global minima we have f ( x ∗ ) ≤ f ( x ) for all x ∈ C . Let ˜ x be anylocal minima which is not global minima. Hence f (˜ x ) = f ( x ∗ ) + δ ˜ x . Now let δ = min ˜ x δ ˜ x . Since ˜ x is local minima but not global minima we have δ > R > x ∈ B ( x ∗ , R ), L (cid:107) x − x ∗ (cid:107) ≤ δ R ≤ δL . Now we have from equation (1) f ( x ) ≤ f ( x ∗ ) + δ , x ∈ B ( x ∗ , R ). That is we have shown there exists a R > x ∈ B ( x ∗ , R ), f ( x ) ≤ f (˜ x ) . (2)Now we observe the following:1. from equation (2) we can see that no other local minima can have a value f (˜ x ), lower than the function value in this ball B ( x ∗ , R )2. for sufficiently small step-size t ≤ /L , the function value decreases witheach gradient step (see equation (3) in proof of Theorem 3)That is if x ∈ B ( x ∗ , R ), the iterates in the algorithm can not move to anotherhypercube around some local minima ˜ x . Or that for all r < R there exists M r > k ≥ M r the iterates remain in the ball B ( x ∗ , r ) around x ∗ . Theorem 2.
Let x ∗ be the unique global minimizer of the function f . We havefor the iterates { x k } generated by the algorithm lim k →∞ f ( x k ) = f ( x ∗ ) . Proof.
Now from Lemma 3 we have
R > r < R there exists M r > x k ∈ B ( x ∗ , r ) for k ≥ M r . From the algorithm we also knowthat the function value decreases with each iteration. Thus we see that thesequence { f ( x k ) } converges as it is monotonic and bounded. Take a sufficientlysmall r < R , such that B ( x ∗ , r ) lies in the basin of attraction. Hence we alsohave that lim k →∞ f ( x k ) = f ( x ∗ ) as in the basin of attraction around the globalminima the gradient descent converges to the minima. Theorem 3.
Let x ∗ be the unique global minimizer of the function f . Forsimplicity denote M = M r . Let step-size t ≤ /L where L is Lipschitz constantof the gradient function in Assumption 1. If we also assume that the function isconvex in the ball B ( x ∗ , r ) we can show that at iteration k > M , f ( x k ) satisfies f ( x k ) − f ( x ∗ ) ≤ (cid:107) x M − x ∗ (cid:107) t ( M − k ) . That is the gradient descent algorithm converges with rate O (1 /k ) .Proof. Consider the gradient descent step x k +1 = z k − t (cid:79) f ( z k ) in the algorithm.Since the iterates remain in a ball around a global minima asymptotically, we7ave from Lemma 3 for k ≥ M r , z k = x k . Now let y = x − t (cid:79) f ( x ), we then get: f ( y ) ≤ f ( x ) + (cid:79) f ( x ) T ( y − x ) + 12 (cid:79) f ( x ) (cid:107) y − x (cid:107) ≤ f ( x ) + (cid:79) f ( x ) T ( y − x ) + 12 L (cid:107) y − x (cid:107) = f ( x ) + (cid:79) f ( x ) T ( x − t (cid:79) f ( x ) − x ) + 12 L (cid:107) y − x (cid:107) = f ( x ) − t (cid:107) (cid:79) f ( x ) (cid:107) + 12 L (cid:107) y − x (cid:107) = f ( x ) − (cid:0) − Lt (cid:1) t (cid:107) (cid:79) f ( x ) (cid:107) . Using the fact that t ≤ /L , − (cid:0) − Lt (cid:1) ≤ − , hence we have f ( y ) ≤ f ( x ) − t (cid:107) (cid:79) f ( x ) (cid:107) . (3)Next we bound f ( y ) the objective function value at the next iteration in termsof f ( x ∗ ). Note that by assumption f is convex in the ball B ( x ∗ , r ). Thus wehave for x ∈ B ( x ∗ , r ), f ( x ) ≤ f ( x ∗ ) + (cid:79) f ( x ) T ( x − x ∗ )Plugging this into equation (3) we get, f ( y ) ≤ f ( x ∗ ) + (cid:79) f ( x ) T ( x − x ∗ ) − t (cid:107) (cid:79) f ( x ) (cid:107) f ( y ) − f ( x ∗ ) ≤ t (cid:18) t (cid:79) f ( x ) T ( x − x ∗ ) − t (cid:107) (cid:79) f ( x ) (cid:107) (cid:19) f ( y ) − f ( x ∗ ) ≤ t (cid:18) t (cid:79) f ( x ) T ( x − x ∗ ) − t (cid:107) (cid:79) f ( x ) (cid:107) − (cid:107) x − x ∗ (cid:107) + (cid:107) x − x ∗ (cid:107) (cid:19) f ( y ) − f ( x ∗ ) ≤ t (cid:18) (cid:107) x − x ∗ (cid:107) − (cid:107) x − t (cid:79) f ( x ) − x ∗ (cid:107) (cid:19) By definition we have y = x − t (cid:79) f ( x ), plugging this into the previous equationwe have f ( y ) − f ( x ∗ ) ≤ t (cid:18) (cid:107) x − x ∗ (cid:107) − (cid:107) y − x ∗ (cid:107) (cid:19) (4)This holds for all gradient descent iterations i ≥ M . Summing over all such8terations we get: k (cid:88) i = M (cid:0) f ( x i ) − f ( x ∗ ) (cid:1) ≤ k (cid:88) i = M t (cid:18) (cid:107) x i − − x ∗ (cid:107) − (cid:107) x i − x ∗ (cid:107) (cid:19) = 12 t (cid:18) (cid:107) x M − x ∗ (cid:107) − (cid:107) x k − x ∗ (cid:107) (cid:19) ≤ t (cid:18) (cid:107) x M − x ∗ (cid:107) (cid:19) . Finally using the fact that f is decreasing in every iteration, we conclude that f ( x k ) − f ( x ∗ ) ≤ k k (cid:88) i = M (cid:0) f ( x i ) − f ( x ∗ ) (cid:1) ≤ t ( M − k ) (cid:107) x M − x ∗ (cid:107) . Remark 2.
If the global minima x ∗ is not unique, then the algorithm canoscillate around different minima. If we assume that the function is convexin a small interval around all these global minima, then we can show that thealgorithm converges to one of the minimum points x ∗ . In addition like in theprevious theorem we can also show that the convergence is linear. Remark 3.
We have not considered momentum based acceleration methodswhich fasten the rate of convergence in this paper.
We present some numerical results. We tested the algorithm on standard bench-mark functions shown in Tables 1 and 2. We show the plots of the functionvalue as iteration proceeds for each of these functions. For Rastrigin, sphereand Rosenbrock functions the dimension was set to 20. We see from these plotsthat the algorithm converges to the optimum for each of these functions as ex-pected. Table 3 gives the step-sizes and lower bound on the basin of attractionset used for each of these functions. -8 -7 -6 -5 -4 -3 -2 -1
0 50 100 150 200 250 300 F un c t i on V a l ue Iteration F un c t i on V a l ue Iteration
Figure 3: Convergence to Optimum for Ackley and Rastrigin Function9able 1: Various Benchmark Functions for Global OptimizationName FormulaRastrigin Function f ( x ) = An + (cid:80) ni =1 (cid:0) x i − A cos(2 πx i ) (cid:1) where A = 10Ackley Function f ( x ) = −
20 exp (cid:0) − . (cid:112) . x + y ) (cid:1) − exp(0 . πx ) + cos(2 πy ))) + e + 20Sphere Function f ( x ) = (cid:80) ni =1 x i Rosenbrock Function f ( x ) = (cid:80) n − i =1 (cid:0) x i +1 − x i ) + (1 − x i ) (cid:1) Beale Function f ( x ) = (1 . − x + xy ) + (2 . − x + xy ) + (2 . − x + xy ) Booth Function f ( x, y ) = ( x + 2 y − + (2 x + y − Table 2: Global Minimum and Search Domain for these Benchmark FunctionsName Global Minimum Search DomainRastrigin Function f (0 , . . . ,
0) = 0 − . ≤ x i ≤ . f (0 ,
0) = 0 − ≤ x, y ≤ f (0 , . . . ,
0) = 0 −∞ ≤ x i ≤ ∞ Rosenbrock Function f (1 , . . . ,
1) = 0 −∞ ≤ x i ≤ ∞ Beale Function f (3 , .
5) = 0 − . ≤ x, y ≤ . f (1 ,
3) = 0 − ≤ x, y ≤ -1 0 1 2 3 4 5 6 0 500 1000 1500 2000 F un c t i on V a l ue Iteration -6 -5 -4 -3 -2 -1
0 2000 4000 6000 8000 10000 F un c t i on V a l ue Iteration
Figure 4: Convergence to Optimum for Sphere and Rosenbrock Function -7 -6 -5 -4 -3 -2 -1
0 500 1000 1500 2000 2500 3000 F un c t i on V a l ue Iteration -8 -7 -6 -5 -4 -3 -2 -1
0 100 200 300 400 500 600 700 800 F un c t i on V a l ue Iteration
Figure 5: Convergence to Optimum for Beale and Booth Function10able 3: Table showing parameters set in the algorithm for these functionsFunction Step-size Lower boundon the basinRastrigin Function 0.0001 0.5Ackley Function 0.0001 0.1Sphere Function 0.001 0.3Rosenbrock Function 0.001 0.5Beale Function 0.0005 0.3Booth Function 0.005 0.3
We have given a simple proof that finding global minima of a continuous functionis not limit computable. To the best of our knowledge, this is the first suchresult. We have also given an algorithm that converges to the global minimawhen a lower bound to the basin of attraction of a global minimum is known.Finally, some numerical results were presented.
References [1] C. D’Helon, Protopopescu V., Wells J.C., and Barhen J. GMG — a guaran-teed global optimization algorithm: Application to remote sensing.
Mathe-matical and Computer Modelling , 45(3-4):459–472, 2007.[2] R. Horst and H. Tuy.
Global Optimization: Deterministic Approaches .Springer-Verlag, 1996.[3] R.I. Soare.