[PDF] Complexity of the Regularized Newton Method

Abstract

Newton's method for finding an unconstrained minimizer for strictly convex functions, generally speaking, does not converge from any starting point. We introduce and study the damped regularized Newton's method (DRNM). It converges globally for any strictly convex function, which has a minimizer in R^n. Locally DRNM converges with a quadratic rate. We characterize the neighborhood of the minimizer, where the quadratic rate occurs. Based on it we estimate the number of DRNM's steps required for finding an \varepsilon- approximation for the minimizer.

Full PDF

aa r X i v : . [ m a t h . O C ] J un COMPLEXITY OF THE REGULARIZED NEWTONMETHOD

ROMAN A. POLYAK

Abstract.

Newton’s method for ﬁnding unconstrained minimizer ofstrictly convex functions, generally speaking, does not converge fromany starting point.We introduce and study the damped regularized Newton’s method(DRNM). It converges globally for any strictly convex function, whichhas a minimizer in R n .Locally DRNM converges with quadratic rate. We characterize theneighborhood of the minimizer, where the quadratic rate occurs. Basedon it we estimate the number of DRNM’s steps required for ﬁnding an ε - approximation for the minimizer. Introduction

Newton’s method, which has been introduced almost 350 years ago, is stillone of the basic tools in numerical analysis, variational and control problems,optimization both constrained and unconstrained, just to mention a few.It has been used not only as a numerical tool, but also as a powerfulinstrument for proving existence and uniqueness results.In particular, Newton-Kantorovich’s method plays a critical role in theclassical KAM theory by Kolmogorov, Arnold and Mozer (see [1]). Anotherexample is the proof of Lusternik’s theorem on tangent spaces (see [3], [9]).Newton’s method was the main instrument in the interior point methods(IPMs), which preoccupied the ﬁeld of optimization for a long time.Yu. Nesterov and A. Nemirovski shown that a special damped Newton’smethod is particularly eﬃcient for minimization self - concordant (SC) func-tions (see [6], [7]).They shown that from any starting point a special damped Newton’sstep reduces the SC function value by a constant, which depends only onthe Newton’s decrement. The decrement converges to zero.By the time it gets small enough the damped Newton’s method practicallyturns into Newton’s method and generates a sequence, which converges invalue with quadratic rate.They characterized the size of the minimizer’s neighborhood, where qua-dratic rate occurs. It allows establishing the complexity of the special

Mathematics Subject Classiﬁcation.

Key words and phrases.

Regularized Newton’s Method, Newton’s Decrement, New-toin’s Area, Global Convergense, Quadratic Convergence Rate. damped Newton’s method for SC function, that is to ﬁnd the upper boundfor the number of damped Newton’s step required for ﬁnding an ε - approx-imation for the minimizer.For strictly convex functions, which are not self-concordant, such results,to the best of our knowledge, are unknown.The purpose of the paper is to introduce and establish complexity boundsof the damped Newton’s method (DNM) and DRNM for minimization oftwice continuously diﬀerentiable and strictly convex f : R n → R .First, we characterize the Newton’s areas for DNM and DRNM. In otherwords, we estimate the minimizer’s neighborhoods, where DNM and DRNMconverges with quadratic rate.Then we estimate the number of steps needed for DNM’s or DRNM’s toenter the correspondent Newton’s areas.The key ingredients of our analysis are the Newton’s and the regularizedNewton’s decrements.On the one hand, the decrements provide the upper bound for the distancefrom the current approximation to the minimizer. Therefore they have beenused in the stopping criteria.On the other hand, they provide a lower bound for the function reductionat each step at any point, which does not belong to the Newton’s or to theregularized Newton’s area.These bounds were used to estimate the number of DNM or DRNM stepsneeded to get into the correspondent Newton’s areas.2. Newton’s Method

We start with the classical Newton’s method for ﬁnding a root of a non-linear equation f ( t ) = 0 , where f : R → R has a smooth derivative f ′ .Let us consider t ∈ R and the linear approximation e f ( t ) = f ( t ) + f ′ ( t )( t − t ) = f ( t ) + f ′ ( t )∆ t of f at t , assuming that f ′ ( t ) = 0.By replacing f with its linear approximation we obtain the following equa-tion f ( t ) + f ′ ( t )∆ t = 0for the Newton’s step ∆ t. The next approximation is given by formula(2.1) t = t + ∆ t = t − ( f ′ ( t )) − f ( t ) . By reiterating (2.1) we obtain Newton’s method(2.2) t s +1 = t s − ( f ′ ( t s )) − f ( t s )for ﬁnding a root of a nonlinear equation f ( t ) = 0 . OMPLEXITY OF THE REGULARIZED NEWTON METHOD 3

Let t ∗ be the root, that is f ( t ∗ ) = 0. Also we assume f ′ ( t ∗ ) = 0 and f ∈ C . We consider the expansion of f at t s with the Lagrange remainder(2.3) 0 = f ( t ∗ ) = f ( t s ) + f ′ ( t s )( t ∗ − t s ) + 12 f ′′ (ˆ t s )( t ∗ − t s ) , where ˆ t s ∈ [ t s , t ∗ ]. For t s close to t ∗ we have f ′ ( t s ) = 0, therefore from (2.3)follows t ∗ − t s + f ( t s ) f ′ ( t s ) = − f ′′ (ˆ t s ) f ′ ( t s ) ( t ∗ − t s ) . Using (2.2) we get(2.4) | t ∗ − t s +1 | = 12 | f ′′ (ˆ t s ) || f ′ ( t s ) | | t ∗ − t s | . If ∆ s = | t ∗ − t s | is small, then there exist a > b > t s that | f ′′ (ˆ t s ) | ≤ a and | f ′ ( t s ) | > b . Therefore, from (2.4) follows(2.5) ∆ s +1 ≤ c ∆ s , where c = 0 . ab − .This is the key characteristic of Newton’s method, which makes themethod so important even 350 years after it was originally introduced.Newton’s method has a natural extension for a nonlinear system of equa-tions(2.6) g ( x ) = 0 , where g : R n → R n is a vector-function with a smooth Jacobian J ( g ) = ∇ g : R n → R n . The linear approximation of g at x is given by(2.7) e g ( x ) = g ( x ) + ∇ g ( x )( x − x ) . We replace g in (2.6) by its linear approximation (2.7). The Newton’s step∆ x one ﬁnds by solving the following linear system : g ( x ) + ∇ g ( x )∆ x = 0 . Assuming det ∇ g ( x ) = 0 we obtain∆ x = − ( ∇ g ( x )) − g ( x ) . The new approximation is given by the following formula:(2.8) x = x − ( ∇ g ( x )) − g ( x ) . By reiterating (2.8) we obtain Newton’s method(2.9) x s +1 = x s − ( ∇ g ( x s )) − g ( x s )for solving a nonlinear system of equations (2.6).Newton’s method for minimization of f : R n → R follows directly from(2.9) if instead of unconstrained minimization problemmin f ( x )s.t. x ∈ R n (2.10) R. POLYAK we consider the nonlinear system(2.11) ∇ f ( x ) = 0 , which is the necessary and suﬃcient condition for x ∗ to be the minimizer in(2.10) in case of convex f .Vector(2.12) n ( x ) = − ( ∇ f ( x )) − ∇ f ( x )deﬁnes the Newton’s direction at x ∈ R n .Application of Newton’s method (2.9) to the system (2.11) leads to theNewton’s method(2.13) x s +1 = x s − ( ∇ f ( x s )) − ∇ f ( x s ) = x s + n ( x s )for solving (2.10).Method (2.13) has another interpretation. Let f : R n → R be twice diﬀer-entiable with a positive deﬁnite Hessian ∇ f .The quadratic approximation of f at x is given by the formula e f ( x ) = f ( x ) + ( ∇ f ( x ) , x − x ) + 12 ( ∇ f ( x )( x − x ) , x − x ) . Instead of solving (2.10) let us ﬁnd¯ x = argmin { e f ( x ) : x ∈ R n } , which is equivalent to solving the following linear system ∇ f ( x )∆ x = −∇ f ( x )for ∆ x = x − x .We obtain ∆ x = n ( x ) , so for the next approximation we have(2.14) ¯ x = x − ( ∇ f ( x )) − ∇ f ( x ) = x + n ( x ) . By reiterating (2.14) we obtain Newton’s method (2.13) for solving (2.10).The local quadratic convergence of both (2.9) and (2.13) is well known(see [2], [4], [7], [8] and references therein).Away from the neighborhood of x ∗ , however, both Newton’s methods(2.9) and (2.13) can either oscillate or diverge. Example 2.1.

Consider g ( t ) = (cid:26) − ( t − + 1 , t ≥ , ( t + 1) − , t < . The function g together with g ′ is continuous on ( −∞ , ∞ ) . Newton’smethod (2.2) converges to the root t ∗ = 0 from any starting point t : | t | < ,oscillates between t s = − and t s +1 = , s = 1 , , ... and diverges for any t : | t | > . OMPLEXITY OF THE REGULARIZED NEWTON METHOD 5

Example 2.2.

For f ( t ) = √ t we have f ( t ∗ ) = f (0) = min { f ( t ) : −∞ < t < ∞} . For the ﬁrst and second derivative we have f ′ ( t ) = t (1 + t ) − , f ′′ ( t ) = (1 + t ) − . Therefore Newton’s method (2.13) is given by the following formula(2.15) t s +1 = t s − (1 + t s ) t s (1 + t s ) − = − t s . It follows from (2.15) that Newton’s method converges from any t ∈ ( − , t s = − t s +1 = 1, s = 1 , , ... and diverges from any t / ∈ [ − , t ∈ ( − ,

1) with the cubic rate, however, in bothexamples the convergence area is negligibly smaller than the area whereNewton’s method diverges. Note that f is strictly convex in R and stronglyconvex in the neighborhood of t ∗ = 0.Therefore there are three important issues associated with the Newton’smethod for unconstrained convex optimization.First, to characterize the neighborhood of the solution, where Newton’smethod converges with quadratic rate.Second, to ﬁnd such modiﬁcation of Newton’s method that generates con-vergent sequence from any starting point and retains quadratic convergencerate in the neighborhood of the solution.Third, to estimate the computational complexity of a globally convergentNewton’s and regularized Newton’s methods in terms of the total numberof steps required for ﬁnding an ε -approximation for x ∗ .3. Local Quadratic Convergence of Newton’s Method

We consider a class of convex functions f : R n → R , that are stronglyconvex at x ∗ , that is(3.1) ∇ f ( x ∗ ) (cid:23) mI,m > x ∗ . In other words there is δ >

0, a ball B ( x ∗ , δ ) = { x ∈ R n , k x − x ∗ k ≤ δ } and M > x and y ∈ B ( x ∗ , δ ) we have(3.2) k∇ f ( x ) − ∇ f ( y ) k ≤ M k x − y k . The following Theorem characterize the neighborhood of x ∗ , where New-ton’s method converges with quadratic rate.There are several ways to proof this fundamental result(see, for example,[2], [4], [7], [8] and references therein). In the following Theorem, whichwe provide for completeness, the Newton’s area is characterized explicitlythrough the convexity constant m > M >

R. POLYAK

Theorem 3.1.

If for < m < M conditions (3.1) and (3.2) are satisﬁed,then for δ = m M and any given x ∈ B ( x ∗ , δ ) the entire sequence { x s } ∞ s =0 generated by (2.13) belongs B ( x ∗ , δ ) and the following bound holds: (3.3) k x s +1 − x ∗ k ≤ M m − M k x s − x ∗ k ) k x s − x ∗ k , s ≥ . Proof . From (2.13) and ∇ f ( x ∗ ) = 0 follows x s +1 − x ∗ = x s − x ∗ − [ ∇ f ( x s )] − ∇ f ( x s ) == x s − x ∗ − ( ∇ f ( x s )) − ( ∇ f ( x s ) − ∇ f ( x ∗ )) == [ ∇ f ( x s )] − [ ∇ f ( x s )( x s − x ∗ ) − ( ∇ f ( x s ) − ∇ f ( x ∗ ))] . (3.4)Then we have ∇ f ( x s ) − ∇ f ( x ∗ ) = Z ∇ f ( x ∗ + τ ( x s − x ∗ ))( x s − x ∗ ) dτ. From (3.4) we obtain(3.5) x s +1 − x ∗ = [ ∇ f ( x s )] − H s ( x s − x ∗ ) , where H s = Z [ ∇ f ( x s ) − ∇ f ( x ∗ + τ ( x s − x ∗ ))] dτ. Let ∆ s = k x s − x ∗ k , then using (3.2) we get k H s k = k Z [ ∇ f ( x s ) − ∇ f ( x ∗ + τ ( x s − x ∗ ))] dτ k≤ Z k [ ∇ f ( x s ) − ∇ f ( x ∗ + τ ( x s − x ∗ )) k dτ ≤≤ Z M k x s − x ∗ − τ ( x s − x ∗ ) k dτ ≤≤ Z M (1 − τ ) k x s − x ∗ k dτ = M s . Therefore from (3.5) and the latter bound we have∆ s +1 ≤ k ( ∇ f ( x s )) − kk H s kk x s − x ∗ k ≤ M k ( ∇ f ( x s )) − k ∆ s . (3.6)From (3.2) follows k∇ f ( x s ) − ∇ f ( x ∗ ) k ≤ M k x s − x ∗ k = M ∆ s , therefore ∇ f ( x ∗ ) + M ∆ s I (cid:23) ∇ f ( x s ) (cid:23) ∇ f ( x ∗ ) − M ∆ s I. From (3.1) follows ∇ f ( x s ) (cid:23) ∇ f ( x ∗ ) − M ∆ s I (cid:23) ( m − M ∆ s ) I. OMPLEXITY OF THE REGULARIZED NEWTON METHOD 7

Hence, for any ∆ s < mM − the matrix ∇ f ( x s ) is positive deﬁnite, thereforethe inverse ( ∇ f ( x s )) − exists and the following bound holds k ( ∇ f ( x s )) − k ≤ m − M ∆ s . From (3.6) and the latter bound follows(3.7) ∆ s +1 ≤ M m − M ∆ s ) ∆ s . From (3.7) for ∆ s < m M follows ∆ s +1 < ∆ s , which means that for δ = m M and any x ∈ B ( x ∗ , δ ) the entire sequence { x s } ∞ s =0 belongs to B ( x ∗ , δ ) andconverges to x ∗ with the quadratic rate (3.7).The proof is completed (cid:3) The neighborhood B ( x ∗ , δ ) with δ = m M is called Newton’s area.In the following section we consider a new version of the damped Newton’smethod, which converges from any starting point and at the same timeretains quadratic convergence rate in the Newton’s area.4. Damped Newton’s Method

To make Newton’s method practical we have to guarantee convergencefrom any starting point. To this end the step length t > n ( x ), that is(4.1) ˆ x = x + tn ( x ) = x − t ( ∇ f ( x )) − ∇ f ( x ) . The step length t > f at each x / ∈ B ( x ∗ , δ ) and t = 1, when x ∈ B ( x ∗ , δ ).Method (4.1) is called the damped Newton’s Method (DNM)(see, for exam-ple, [2], [7], [9])The following function λ : R n → R + :(4.2) λ ( x ) = (( ∇ f ( x )) − ∇ f ( x ) , ∇ f ( x )) . = [ − ( ∇ f ( x ) , n ( x ))] . , which is called the Newton’s decrement of f at x ∈ R n , will play an impor-tant role later.At this point we assume that f : R n → R is strongly convex and itsHessian ∇ f is Lipschitz continuous, that is, there exist ∞ > M > m > ∇ f ( x ) (cid:23) mI and(4.4) k∇ f ( x ) − ∇ f ( y ) k ≤ M k x − y k are satisﬁed for any x and y from R n .Let x ∈ R n be a starting point.Due to (4.3) the sublevel set L = { x ∈ R n : f ( x ) ≤ f ( x ) } is boundedfor any given x ∈ R n . Therefore from (4.4) follows existence L > k∇ f ( x ) k ≤ L R. POLYAK is taking place.We also assume that ε > < ε < m L − holds.We are ready to describe our version of DNM.Let x ∈ R n be a starting point and 0 < ε < δ be the required accuracy.Set x := x

1. ﬁnd Newton’s direction n ( x );2. if the following inequality(4.7) f ( x + n ( x )) ≤ f ( x ) + 0 . ∇ f ( x ) , n ( x ))holds, then set t ( x ) := 1, otherwise set t ( x ) = m (2 L ) − ;3. set x := x + t ( x ) n ( x );4. if λ ( x ) ≤ ε . , then x ∗ := x , otherwise go 1.The following Theorem proves global convergence of the DNM 1.-4. andestablishes the upper bound for the total number of DNM steps require forﬁnding ε -approximation for x ∗ .5. Global Convergence of the DNM and its Complexity

Theorem 5.1. If f : R n → R is twice diﬀerentiable and conditions (4.3)and (4.4) are satisﬁed, then for δ = mM it takes (5.1) N = 9 L M m ( f ( x ) − f ( x ∗ )) . DNM steps to ﬁnd x ∈ B ( x ∗ , δ ) by using DNM. Proof . From (4.5) follows(5.2) ∇ f ( x ) (cid:22) LI.

On other hand, from (4.3) follows the existence of the inverse ( ∇ f ( x )) − .Therefore from (5.2) follows(5.3) ( ∇ f ( x )) − (cid:23) L − I. From (4.2) and (5.3) we obtain the following lower bound for the Newton’sdecrement(5.4) λ ( x ) = ( ∇ f ( x ) − ∇ f ( x ) , ∇ f ( x )) . ≥≥ ( L − k∇ f ( x ) k ) . = L − . k∇ f ( x ) k . From (4.3) we have k∇ f ( x ) kk x − x ∗ k ≥ ( ∇ f ( x ) − ∇ f ( x ∗ ) , x − x ∗ ) ≥ m k x − x ∗ k or(5.5) k∇ f ( x ) k ≥ m k x − x ∗ k . OMPLEXITY OF THE REGULARIZED NEWTON METHOD 9

From (5.4) and (5.5) we obtain(5.6) λ ( x ) ≥ L − . m k x − x ∗ k . From (4.6) and the stopping criteria 4. follows( m L − ) . ε ≥ ε . ≥ λ ( x ) ≥ mL − . k x − x ∗ k , or k x − x ∗ k ≤ ε , which justiﬁes the stopping criteria 4.On the other hand, Newton’s decrement deﬁnes the lower bound for thefunction reduction at each step.In fact, for Newton’s directional derivative from (2.12), (4.2) and (4.3)follows ϕ ′ (0) = df ( x + tn ( x )) dt | t =0 = ( ∇ f ( x ) , n ( x )) =(5.7) − ( ∇ f ( x ) n ( x ) , n ( x )) ≤ − m k n ( x ) k . Due to the strong convexity of ϕ ( t ) = f ( x + tn ( x )) the derivative ϕ ′ ( t ) =( ∇ f ( x + tn ( x )) , n ( x )) is monotone increasing in t >

0, so there is t ( x ) > > ( ∇ f ( x + t ( x ) n ( x )) , n ( x )) ≥

12 ( ∇ f ( x ) , n ( x )) , otherwise ( ∇ f ( x + tn ( x )) , n ( x )) < ( ∇ f ( x ) , n ( x )) ≤ − m k n ( x ) k , t > f ( x ) = −∞ , which is impossible for a strongly convex function f .It follows from (5.7), (5.8) and monotonicity of ϕ ′ ( t ) that for any t ∈ [0 , t ( x )] we have df ( x + tn ( x )) dt = ( ∇ f ( x + tn ( x )) , n ( x )) ≤

12 ( ∇ f ( x ) , n ( x )) . Therefore f ( x + t ( x ) n ( x )) ≤ f ( x ) + 12 t ( x )( ∇ f ( x ) , n ( x )) . Keeping in mind (4.2) we obtain(5.9) f ( x ) − f ( x + t ( x ) n ( x )) ≥ t ( x ) λ ( x ) . Combining (5.7) and (5.8) we obtain( ∇ f ( x + t ( x ) n ( x )) − ∇ f ( x ) , n ( x )) ≥ m k n ( x ) k . Therefore, there is 0 < θ ( x ) < t ( x )( ∇ f ( x + θ ( x ) t ( x ) n ( x )) n ( x ) , n ( x )) = t ( x )( ∇ f ( · ) n ( x ) , n ( x )) ≥ m k n ( x ) k , or t ( x ) k∇ f ( · ) kk n ( x ) k ≥ m k n ( x ) k . From (4.3) follows(5.10) t ( x ) ≥ m L , which justiﬁes the choice of step length t ( x ) in the DNM 1.-4.. Hence, from (5.9) and (5.10) we obtain the following lower bound for thefunction reduction per step(5.11) ∆ f ( x ) = f ( x ) − f ( x + t ( x ) n ( x )) ≥ m L λ ( x ) , which together with the lower bound (5.6) for the Newton’s decrement λ ( x )leads to(5.12) ∆ f ( x ) = f ( x ) − f ( x + t ( x ) n ( x )) ≥ m L k x − x ∗ k . It means that for any x / ∈ B ( x ∗ , δ ) the function reduction at each step isproportional to the square of the distance between current approximation x and the solution x ∗ .In other words, ”far from” the solution Newton’s step produces a ”sub-stantial” reduction of the function value similar to one of the gradientmethod.For x / ∈ B ( x ∗ , δ ) we have k x − x ∗ k ≥ m M , therefore from (5.12) we obtain∆ f ( x ) ≥ m L M . So it takes at most N = 9 L M m ( f ( x ) − f ( x ∗ ))Newton’s steps to obtain x ∈ B ( x ∗ , δ ) from a given starting point x ∈ R n .The proof is completed. (cid:3) From Theorem 3.1 follows that O (ln ln ε − ) steps needed to ﬁnd an ε -approximation to x ∗ from any x ∈ B ( x ∗ , δ ), where 0 < ε < δ is the requiredaccuracy. Therefore the total number of Newton’s steps required for ﬁndingan ε -approximation to the optimal solution x ∗ from a starting point x ∈ R n is N = N + O (ln ln ε − ) . The bound (5.1) is similar to (9.40) from [2], but the proof is based onour version of DNM and the explicit characterization of the Newton’s area.It allows to extend the proof for the regularized Newton’s method [10].The DNM requires an a priori knowledge of two parameters m and L ortheir corresponding lower and upper bounds.The following version of DNM is free from this requirement. To adjustthe step length t > f ( x + tn ( x )) ≤ f ( x ) + αt ( ∇ f ( x ) , n ( x ))with 0 < α ≤ . < ρ <

1, the backtracking line search consist of the following steps.1. For t > t := tρ andrepeat it until (5.13) holds, then go to 2.2. set t ( x ) := t , x := x + t ( x ) n ( x ) OMPLEXITY OF THE REGULARIZED NEWTON METHOD 11

We are ready to describe another version of DNM, which does not requiresan a priori knowledge of the parameters m and L or their lower and upperbounds.Let x ∈ R n be a starting point and 0 < ε << δ be the required accuracy.1. Compute Newton’s direction(5.14) n ( x ) = − ( ∇ f ( x )) − ∇ f ( x );2. set t := 1, use the backtracking line search until f ( x + tn ( x )) ≤ f ( x ) + 0 . t ( ∇ f ( x ) , n ( x ));3. set t ( x ) := t , x := x + t ( x ) n ( x );4. if λ ( x ) ≤ ε . then x ∗ := x otherwise go 1.The complexity of the DNM with backtracking line search can be estab-lished using arguments similar to those in Theorem 5.1Unfortunately, in the absence of strong convexity of f : R n → R Newton’smethod might not converge from any starting point.In case of Example 2.2 Newton’s method does not converge from any t / ∈ ( − ,

1) in spite of f ( t ) = √ t being strongly convex and smoothenough in the neighborhood of t ∗ = 0.In the following section we consider the Regularized Newton’s Method(RNM)(see [10]), which eliminates the basic drawback of the Classical New-ton’s Method. It generates a converging sequence from any starting point x ∈ R n and retains quadratic convergence rate in the regularized Newton’sarea, which we will characterize later.6. Regularized Newton’s Methods

Let f ∈ C be a convex function in R n . We assume that the optimal set X ∗ = Argmin { f ( x ) : x ∈ R n } is not empty and bounded.The corresponding regularized at the point x ∈ R n function F x : R n → R is deﬁned by the following formula(6.1) F x ( y ) = f ( y ) + 12 k ∇ f ( x ) kk y − x k . For any x / ∈ X ∗ we have ||∇ f ( x ) || >

0, therefore for any convex function f : R n → R the regularized function F x is strongly convex in y for any x / ∈ X ∗ . If f is strongly convex at x ∗ , then the regularized function F x is stronglyconvex in R n . The following properties of F x are direct consequences of thedeﬁnition (6.1).1 ◦ . F x ( y ) | y = x = f ( x ) , ◦ . ∇ y F x ( y ) | y = x = ∇ f ( x ) , ◦ . ∇ yy F x ( y ) | y = x = ∇ f ( x ) + ||∇ f ( x ) || I = H ( x ),where I is the identical matrix in R n .For any x / ∈ X ∗ , the inverse H − ( x ) exists for any convex f ∈ C . There-fore the regularized Newton’s step(6.2) ˆ x = x − ( H ( x )) − ∇ f ( x ) can be performed for any convex f ∈ C from any starting point x / ∈ X ∗ .We start by showing that the regularization (6.1) improves the ”quality”of the Newton’s direction as well the condition number of the Hessian ∇ f ( x )at any x ∈ R n that x / ∈ X ∗ .We assume at this point that for any given x ∈ R n there exist 0 ≤ m ( x )

Let f : R n → R be a twice continuous diﬀerentiable convexfunction and the bounds (6.3) hold, then:

1. 1 ≥ q ( r ( x )) ≥ ( m ( x ) + ||∇ f ( x ) || )( M ( x ) + ||∇ f ( x ) || ) − = cond H ( x ) > for any x X ∗ .

2. 1 ≥ q ( n ( x )) ≥ m ( x )( M ( x )) − = cond ∇ f ( x ) for any x ∈ R n . OMPLEXITY OF THE REGULARIZED NEWTON METHOD 13 cond H ( x ) − cond ∇ f ( x ) = ||∇ f ( x ) || (1 − cond ∇ f ( x ))( M ( x ) + ||∇ f ( x ) || ) − > for any x X ∗ , cond ∇ f ( x ) < . Proof .1. From (6.5), we obtain ||∇ f ( x ) || ≤ || H ( x ) || · || r ( x ) || . (6.9) Using the right inequality (6.3) and 3 ◦ , we have || H ( x ) || ≤ M ( x ) + ||∇ f ( x ) || , (6.10) From (6.9) and (6.10) we obtain ||∇ f ( x ) || ≤ ( M ( x ) + ||∇ f ( x ) || ) || r ( x ) || . From (6.5) the left inequality (6.3) and 3 ◦ follows − ( ∇ f ( x ) , r ( x )) = ( H ( x ) r ( x ) , r ( x )) ≥ ( m ( x ) + k∇ f ( x ) k ) k r ( x ) k . Therefore from (6.7) follows q ( r ( x )) ≥ ( m ( x ) + ||∇ f ( x ) || )( M ( x ) + ||∇ f ( x ) || ) − = cond H ( x ) .

2. Now let us consider the Newton’s direction n ( x ). From (6.4), wehave ∇ f ( x ) = −∇ f ( x ) n ( x ) , (6.11) therefore, − ( ∇ f ( x ) , n ( x )) = ( ∇ f ( x ) n ( x ) , n ( x )) . From (6.11) left inequality of (6.3), we obtain(6.12) q ( n ( x )) = − ( ∇ f ( x ) , n ( x )) ||∇ f ( x ) || · || n ( x ) || ≥ m ( x ) || n ( x ) || · ||∇ f ( x ) || − . From (6.11) and the right inequality in (6.3) follows(6.13) ||∇ f ( x ) || ≤ ||∇ f ( x ) || · || n ( x ) || ≤ M ( x ) || n ( x ) || . Combining (6.12) and (6.13) we have q ( n ( x )) ≥ m ( x ) M ( x ) = cond ∇ f ( x ) .

3. Using the formulas for the condition numbers of ∇ f ( x ) and H ( x )we obtain (6.3) (cid:3) Corollary 6.2.

The regularized Newton’s direction r ( x ) is a decent directionfor any convex f : R n → R , whereas the classical Newton’s direction n ( x ) exists and it is a decent direction only if f is a strongly convex at x ∈ R n . Under condition (3.1) and (3.2) the RNM retains the local quadraticconvergence rate, which is typical for the Classical Newton’s method.On the other hand, the regularization (6.1) allows to establish globalconvergence and estimate complexity of the RNM, when the original functionis only strongly convex at x ∗ .7. Local Quadratic Convergence Rate of the RNM

In this section we consider the RNM and determine the neighborhood ofthe minimizer, where the RNM converges with quadratic rate.Along with assumptions (3.1) and (3.2) for the Hessian ∇ f we will usethe Lipschitz condition for the gradient ∇ f (7.1) k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k , which is equivalent to (4.5).The RNM generates a sequence { x s } ∞ s =0 :(7.2) x s +1 = x s − (cid:2) ∇ f ( x s ) + k∇ f ( x s ) k I (cid:3) − ∇ f ( x s ) . The following Theorem characterizes the regularized Newton’s area.

Theorem 7.1.

If (3.1), (3.2) and (7.1) hold, then for δ = mM +2 L and any x ∈ B ( x ∗ , δ ) as a starting point, the sequence { x s } ∞ s =0 generated by RNM(7.2) belongs to B ( x ∗ , δ ) and the following bound holds: (7.3) ∆ s +1 = k x s +1 − x ∗ k ≤ M + 2 L · m − ( M + 2 L )∆ s ∆ s , s ≥ . Proof . From (7.2) follows x s +1 − x ∗ = x s − x ∗ − (cid:2) ∇ f ( x s ) + k∇ f ( x s ) k I (cid:3) − ( ∇ f ( x s ) − ∇ f ( x ∗ )) . Using ∇ f ( x s ) − ∇ f ( x ∗ ) = Z ∇ f ( x ∗ + τ ( x s − x ∗ ))( x s − x ∗ ) dτ, we obtain(7.4) x s +1 − x ∗ = (cid:2) ∇ f ( x s ) + k∇ f ( x s ) k I (cid:3) − H s ( x s − x ∗ ) , where H s = Z ( ∇ f ( x s ) + k∇ f ( x s ) k I − ∇ f ( x ∗ + τ ( x s − x ∗ ))) dτ. OMPLEXITY OF THE REGULARIZED NEWTON METHOD 15

From (3.2) and (7.1) follows k H s k = k Z (cid:0) ∇ f ( x s ) + k∇ f ( x s ) k I − ∇ f ( x ∗ + τ ( x s − x ∗ )) (cid:1) dτ k≤ k Z ( ∇ f ( x s ) − ∇ f ( x ∗ + τ ( x s − x ∗ ))) dτ k + Z k∇ f ( x s ) k dτ ≤ Z k∇ f ( x s ) − ∇ f ( x ∗ + τ ( x s − x ∗ )) k dτ + Z k∇ f ( x s ) − ∇ f ( x ∗ ) k dτ ≤ Z M k x s − x ∗ − τ ( x s − x ∗ ) k dτ + Z L k x s − x ∗ k dτ (7.5) = Z ( M (1 − τ ) + L ) k x s − x ∗ k dτ = M + 2 L k x s − x ∗ k . From (7.4) and (7.5) we have∆ s +1 = k x s +1 − x ∗ k ≤ k (cid:0) ∇ f ( x s ) + k∇ f ( x s ) k I (cid:1) − k · k H s k · k x s − x ∗ k≤ M + 2 L k ( ∇ f ( x s ) + k∇ f ( x s ) k I ) − k ∆ s . (7.6)From (3.2) follows(7.7) k∇ f ( x s ) − ∇ f ( x ∗ ) k ≤ M k x s − x ∗ k = M ∆ s , therefore we have(7.8) ∇ f ( x ∗ ) + M ∆ s I (cid:23) ∇ f ( x s ) (cid:23) ∇ f ( x ∗ ) − M ∆ s I. From (3.1) and (7.8) we obtain ∇ f ( x s ) + k∇ f ( x s ) k I (cid:23) ( m + k∇ f ( x s ) k − M ∆ s ) I. Therefore for ∆ s < m + k ∆ f ( x s ) k M the matrix ∇ f ( x s )+ k∇ f ( x s ) k I is positivedeﬁnite, therefore its inverse exists and we have k ( ∇ f ( x s ) + k∇ f ( x s ) k I ) − k ≤ m + k∇ f ( x s ) k − M ∆ s ≤ m − M ∆ s . (7.9)For ∆ s ≤ mM +2 L from (7.6) and (7.9) follows(7.10) ∆ s +1 ≤ M + 2 L m − ( M + 2 L )∆ s ∆ s . Therefore from (7.10) for 0 < ∆ s ≤ mM +2 L < m + k∇ f ( x s ) k M we obtain∆ s +1 ≤ M + 2 L )2 1 m ∆ s ≤ ∆ s . Hence, for δ = mM +2 L and any x ∈ B ( x ∗ , δ ) as a starting point the sequence { x s } ∞ s =0 generated by (7.2) belongs to B ( x ∗ , δ ) and the bound (7.3) holds.The proof of Theorem 7.1 is completed. (cid:3) Corollary 7.2.

Under conditions of Theorem 7.1 for δ = mM +2 L and any x ∈ B ( x ∗ , δ ) the Hessian ∇ f ( x ) is positive deﬁnite and (7.11) ∇ f ( x ) (cid:23) m I, where m = m ( M + 2 L )( M + 2 L ) − In fact, from (4.4) follows ∇ f ( x ∗ ) + M ∆ xI (cid:23) ∇ f ( x ) (cid:23) ∇ f ( x ∗ ) − M ∆ xI, so for any x ∈ B ( x ∗ , δ ) we have ∇ f ( x ) (cid:23) (cid:18) m − M mM + 2 L (cid:19) I = m ( 13 M + 2 L )( M + 2 L ) − I = m I. From the letter inequality follows k∇ f ( x ) kk x − x ∗ k ≥ ( ∇ f ( x ) − ∇ f ( x ∗ ) , x − x ∗ ) ≥ m k x − x ∗ k , that is for any x ∈ B ( x ∗ , δ ) we have(7.12) k∇ f ( x ) k ≥ m k x − x ∗ k . It follows from Theorem 7.1 that B ( x ∗ , δ ) with δ = mM +2 L is the Newton’sarea for the RNM.So it takes O (ln ln ε − ) regularized Newton’s steps to ﬁnd an ε -approximationfor x ∗ from any x ∈ B ( x ∗ , δ ) as a starting point.To make the RNM globally convergent we have to replace the RNM byDRNM and adjust the step length. It can be done by backtracking linesearch, using Armijo condition (5.13) with Newton’s direction n ( x ) replacedby regularized Newton’s direction r ( x ). In the following section we intro-duce another version of the DRNM and estimate the number of RNM stepsrequired for ﬁnding x ∈ B ( x ∗ , δ ) from any given starting point x ∈ R n .8. Damped Regularized Newton’s Method

Let us consider the regularized Newton’s decrement(8.1) λ r ( x ) = ( H − ( x ) ∇ f ( x ) , ∇ f ( x )) = [ − ( ∇ f ( x ) , r ( x ))] . We assume that ε > < ε . < m ( L + k∇ f ( x ) k ) − . , for ∀ x ∈ L . From (4.5) follows(8.3) ( ∇ f ( x ) + k∇ f ( x ) k I ) (cid:22) ( L + k∇ f ( x ) k ) I. On the other hand, for any x ∈ B ( x ∗ , δ ) from the Corollary 7.2 we have ∇ f ( x ) + k∇ f ( x ) k I (cid:23) ( m + k∇ f ( x ) k ) I. Therefore the inverse ( ∇ f ( x )+ k∇ f ( x ) k I ) − exists and from (8.3) we obtain H − ( x ) = ( ∇ f ( x ) + k∇ f ( x ) k I ) − (cid:23) ( L + k∇ f ( x ) k ) − I. OMPLEXITY OF THE REGULARIZED NEWTON METHOD 17

Therefore from (8.1) for any x ∈ B ( x ∗ , δ ) we have λ ( r ) ( x ) = ( H − ( x ) ∇ f ( x ) , ∇ f ( x )) . ≥ ( L + k∇ f ( x ) k ) − . k∇ f ( x ) k , which together with (7.12) leads to λ ( r ) ( x ) ≥ m ( L + k∇ f ( x ) k ) − . k x − x ∗ k . Then from λ ( r ) ( x ) ≤ ε . and (8.2) follows m ( L + k∇ f ( x ) k ) − . ε ≥ ε . ≥ λ ( r ) ( x ) ≥ m ( L + k∇ f ( x ) k ) − . k x − x ∗ k or k x − x ∗ k ≤ ε, ∀ x ∈ B ( x ∗ , δ ) . Therefore λ ( r ) ( x ) ≤ ε . can be used as a stopping criteria.We are ready to describe the DRNM.Let x ∈ R n be a starting point and 0 < ε < δ be the required accuracy,set x := x .1. Compute the regularized Newton’s direction r ( x ) by solving the sys-tem (6.5);2. if the following inequality(8.4) f ( x + tr ( x )) ≤ f ( x ) + 0 . ∇ f ( x ) , r ( x ))holds, then set t ( x ) := 1, otherwise set t ( x ) := (2 L ) − k∇ f ( x ) k ;3. x := x + t ( x ) r ( x );4. if λ r ( x ) ≤ ε . , then x ∗ := x , otherwise go to 1.The global convergence and the complexity of the DRNM we consider in thefollowing section. 9. Complexity of the DRNM

We assume that conditions (3.1) and (3.2) are satisﬁed. Due to (3.1) thesolution x ∗ is unique. Hence, from convexity f follows that for any givenstarting point x ∈ R n the sublevel set L is bounded, therefore there is L > L .Let B ( x ∗ , r ) = { x ∈ R n : k x − x ∗ k ≤ r } be the ball with center x ∗ andradius r > r = min { r : L ⊂ B ( x ∗ , r ) } . Theorem 9.1.

If (3.1) and (3.2) are satisﬁed and δ = mM +2 L , then fromany given starting point x ∈ L it takes (9.1) N = 13 . (cid:18) L ( M + 2 L ) ( m m ) (1 + r )( f ( x ) − f ( x ∗ )) (cid:19) DRN steps to get x ∈ B ( x ∗ , δ ) . Proof . For the regularized Newton’s directional derivative we have df ( x + tr ( x )) dt | t =0 = ( ∇ f ( x ) , r ( x )) = − (cid:0) ( ∇ f ( x ) + k∇ f ( x ) k I ) r ( x ) , r ( x ) (cid:1) ≤ (9.2) − ( m ( x ) + k∇ f ( x ) k ) k r ( x ) k , where m ( x ) ≥ k∇ f ( x ) k > x = x ∗ . It means that RND is adecent direction at any x ∈ L and x = x ∗ .It follows from (9.2) that ϕ ( t ) = f ( x + tr ( x )) is monotone decreasing forsmall t > f follows that ϕ ′ ( t ) = ( ∇ f ( x + tr ( x )) , r ( x )) is notdecreasing in t >

0, hence at some t = t ( x ) we have(9.3) ( ∇ f ( x + t ( x ) r ( x )) , r ( x )) ≥ −

12 ( m ( x ) + k∇ f ( x ) k ) k r ( x ) k , otherwise inf f ( x ) = −∞ , which is impossible due to the boundedness of L .From (9.2) and (9.3) we have( ∇ f ( x + t ( x ) r ( x )) − ∇ f ( x ) , r ( x )) ≥ m ( x ) + k∇ f ( x ) k k r ( x ) k . Therefore there exist 0 < θ ( x ) < t ( x )( ∇ f ( x + θ ( x ) t ( x ) r ( x )) , r ( x )) = t ( x )( ∇ f ( · ) r ( x ) , r ( x )) ≥ m ( x ) + k∇ f ( x ) k k r ( x ) k or t ( x ) k∇ f ( · ) kk r ( x ) k ≥ m ( x ) + k∇ f ( x ) k k r ( x ) k . Keeping in mind that k∇ f ( · ) k ≤ L we obtain(9.4) t ( x ) ≥ m ( x ) + k∇ f ( x ) k L ≥ k∇ f ( x ) k L .

It means that for t ≤ k∇ f ( x ) k L the inequality df ( x + tr ( x )) dt ≤ −

12 ( ∇ f ( x ) , r ( x ))holds, hence ∆ f ( x ) = f ( x ) − f ( x + t ( x ) r ( x )) ≥ (9.5) 12 t ( x )( −∇ f ( x ) , r ( x )) = 12 t ( x ) λ r ( x ) . Therefore ﬁnding the lower bound for the decrease of f at any x ∈ L such that x / ∈ B ( x ∗ , δ ) we have to ﬁnd the corresponding bound for theregularized Newton’s decrement.Now let us consider x ∈ B ( x ∗ , δ ) then from (7.11) follows(9.6) ( ∇ f ( x ) − ∇ f ( x ∗ ) , x − x ∗ ) ≥ m k x − x ∗ k . for any x ∈ B ( x ∗ , δ ).Let ˆ x / ∈ B ( x ∗ , δ ), we consider a segment [ x ∗ , ˆ x ]. There is 0 < e t < e x = (1 − e t ) x ∗ + e t ˆ x ∈ ∂B ( x ∗ , δ ). OMPLEXITY OF THE REGULARIZED NEWTON METHOD 19

From the convexity f follows( ∇ f ( x ∗ + t (ˆ x − x ∗ )) , ˆ x − x ∗ ) | t =0 ≤ ( ∇ f ( x ∗ + t (ˆ x − x ∗ )) , ˆ x − x ∗ ) | t = e t ≤ ( ∇ f ( x ∗ + t (ˆ x − x ∗ ) , ˆ x − x ∗ ) | t =1 , or 0 = ( ∇ f ( x ∗ ) , ˆ x − x ∗ ) ≤ ( ∇ f ( e x ) , ˆ x − x ∗ ) ≤ ( ∇ f (ˆ x ) , ˆ x − x ∗ ) . The right inequality can be rewritten as follows:( ∇ f ( e x ) , ˆ x − x ∗ ) = k ˆ x − x ∗ k δ ( ∇ f ( e x ) − ∇ f ( x ∗ ) , e x − x ∗ ) ≤ ( ∇ f (ˆ x ) , ˆ x − x ∗ ) . In view of (9.6) we obtain k∇ f (ˆ x ) kk ˆ x − x ∗ k ≥ k ˆ x − x ∗ k δ ( ∇ f ( e x ) − f ( x ∗ ) , e x − x ∗ ) ≥ k ˆ x − x ∗ k δ m k e x − x ∗ k . Keeping in mind that e x ∈ ∂B ( x ∗ , δ ) we get(9.7) k∇ f (ˆ x ) k ≥ m k e x − x ∗ k = 23 m m M + 2 L .

On the other hand from (7.1) and ˆ x ∈ L follows(9.8) k∇ f (ˆ x ) k = k∇ f (ˆ x ) − ∇ f ( x ∗ ) k ≤ L k ˆ x − x ∗ k ≤ Lr . From (4.5) follows(9.9) ∇ f ( x ) (cid:22) LI.

For any ˆ x / ∈ S ( x ∗ , δ ) we have k∇ f (ˆ x ) k >

0, therefore H (ˆ x ) = ∇ f (ˆ x ) + k∇ f (ˆ x ) k I is positive deﬁnite and system (6.5) has a unique solution r (ˆ x ) = − H − (ˆ x ) ∇ f (ˆ x ) . Moreover from (9.9) follows( ∇ f (ˆ x ) + k∇ f (ˆ x ) k I ) (cid:22) ( L + k∇ f (ˆ x ) k ) I. Therefore(9.10) H − (ˆ x ) (cid:23) ( L + k∇ f (ˆ x ) k I ) − ) I. For the regularized Newton’s decrement we obtain(9.11) λ ( r ) (ˆ x ) = ( H − ( x )) ∇ f (ˆ x ) , ∇ f (ˆ x )) . ≥ ( L + k∇ f (ˆ x k ) − . k∇ f (ˆ x ) k . Keeping in mind k∇ f (ˆ x k = k∇ f (ˆ x ) − ∇ f ( x ∗ ) k ≤ L k ˆ x − x ∗ k from (9.4), (9.8) and (9.11) and deﬁnition of r we obtain∆ f (ˆ x ) ≥ t (ˆ x ) λ r (ˆ x ) ≥ k∇ f (ˆ x ) k L ( L + k∇ f (ˆ x ) k ) − ≥ k∇ f (ˆ x ) k L (1 + r ) . Using (9.7) we get∆ f (ˆ x ) ≥ (cid:18) m m M + 2 L (cid:19) L (1 + r ) = 227 ( m m ) ( M + 2 L ) L r ) . Therefore it takes N = ( f (ˆ x ) − f ( x ∗ ))∆ f − (ˆ x ) = 13 . M + 2 L ) L ( m m ) (1 + r )( f (ˆ x ) − f ( x ∗ ))steps to obtain x ∈ B ( x ∗ , δ ) from any x ∈ L .The proof of Theorem 9.1 is completed. (cid:3) From (7.3) follows that it takes O (ln ln ε − ) DRN steps to ﬁnd an ε -approximation for x ∗ from any x ∈ B ( x ∗ , δ ).Therefore the total number of DRN steps required for ﬁnding an ε -approximation for x ∗ from a given starting point x ∈ R n is N = N + o (ln ln ε − ) . Concluding Remarks

The bounds (5.1) and (9.1) depends on the size of Newton’s and regu-larized Newton’s areas, which, in turn, are deﬁned by convexity constant m >

M >

L >

0. The convexity andsmoothness constants dependent on the given system of coordinate.Let consider an aﬃne transformation of the original system given by x = Ay , where A ∈ R n × n is a nondegenerate matrix. We obtain ϕ ( y ) = f ( Ay ) . Let { x s } ∞ s =0 be the sequence generated by Newton’s method x s +1 = x s − ( ∇ f ( x s )) − ∇ f ( x s ) . For the correspondent sequence in the transformed space we obtain y s +1 = y s − ( ∇ ϕ ( y s )) − ∇ ϕ ( y s ) . Let y s = A − x s for some s ≥

0, then y s +1 = y s − ( ∇ ϕ ( y s )) − ∇ ϕ ( y s ) = y s − [ A T ∇ f ( Ay s ) A ] − A T ∇ f ( Ay s ) = A − x s − A − ( ∇ f ( x s )) − ∇ f ( x s ) = A − x s +1 . It means that Newton’s method is aﬃne invariant with respect to the trans-formation x = Ay . Therefore the areas of quadratic convergence dependsonly on the local topology of f (see [7]).To get the Newton’s sequence in the transformed space one needs to apply A − to the elements of the Newton’s sequence in the original space.Let N is such that x N : k x N − x ∗ k ≤ ε , then k y N − y ∗ k ≤ k A − kk x N − x ∗ k . From (3.3) follows k x N +1 − x ∗ k ≤ M m − M k x s − x ∗ k ) k x N − x ∗ k . Therefore k y N +1 − y ∗ k ≤ k A − kk x N +1 − x ∗ k ≤ k A − k M ( m − M ε ) ε . OMPLEXITY OF THE REGULARIZED NEWTON METHOD 21

Hence, for small enough ε ≤ . mM min {

1; ( k A − k ) − } we have k y N +1 − y ∗ k ≤ ε. We would like to emphasize that the bound (9.1) is global, while theconditions (3.1) and (3.2) under which the bound holds are local, at theneighborhood of x ∗ . References [1] V. Arnold,

Small denominators and problem of stability in classical and celestialmechanics , Uspehi Matematicheskih Nauk 18(6), (1963) 91-192.[2] S. Boyd and Vandenberghe,

Convex Optimization , Cambridge University Press.(2004).[3] A. Ioﬀe,

On the local surjection property.

Nonlinear Analysis, VII, n 5 (1987),565–592.[4] L. Kantorovich and Akilow,

Functional Analysis , Nauka, Moscow (1977).[5] Ljusternik,

On conditional extrema of functionals , Mathematicheskij Sbornik 41(3),(1934), 390–401.[6] Yu. Nesterov and A. Nemirovski,

Interior Point Polynomial Algorithms in ConvexProgramming , SIAM, Philadelphia (1994).[7] Yu. Nesterov,

Introductory Lectures on Convex Programming , Kluwer, Boston (2004).[8] B. Polyak,

Introduction to Optimization , Optimization Software, Inc. NY, (1987).[9] B. Polyak

Newton’s method and its use in Optimization , European Journal of OR,181 (2007) 1086-1096 .[10] R. Polyak,

Regularized Newton method for unconstrained Convex Optimization , Math-ematical programming Ser. B 120, (2009), 125-145.(R. Polyak)

Department of Mathematics, The Technion - Israel Institute ofTechnology, 32000 Haifa, Israel

E-mail address ::