[PDF] Accelerated Methods for α -Weakly-Quasi-Convex Problems

Abstract

Many problems encountered in training neural networks are non-convex. However, some of them satisfy conditions weaker than convexity, but which are still sufficient to guarantee the convergence of some first-order methods. In our work we show that some previously known first-order methods retain their convergence rates under these weaker conditions.

Full PDF

aa r X i v : . [ m a t h . O C ] M a r Accelerated methods for α -weakly-quasi-convex optimization problems Sergey Guminov · Alexander GasnikovAbstract

Convexity of the objective function often allows to guarantee much better convergence rates of iterative mini-mization methods than in the general non-convex case. However, many problems encountered in training neural networks arenon-convex. Some of them satisfy conditions weaker than convexity, but which are still suﬃcient to guarantee the convergenceof some ﬁrst-order methods.In this work we present a condition to replace convexity and show that gradient descent with ﬁxed step length retainsits convergence rate under this condition. We show that the sequential subspace optimization method is optimal in terms oforacle complexity in this case. We also provide a substitute for strong convexity which is suﬃcient to guarantee the sameconvergence rate as in the strongly convex case for this new class of generally non-convex functions.

Keywords

Non-convex minimization · First-order methods · Accelerated methods · Global optimization

Introduction

Convexity of the objective function is a natural property often used to prove the convergence of iterative methods ofoptimization. One of the main qualities of convex functions is that they have no non-global local minimums, which makesoptimizing such objectives considerably easier. A new condition called α -weak-quasi-convexity was recently proposed byHardt et al. in [6] in relation to a machine learning problem. In this paper we are going to show that this much weakercondition is still suﬃcient to guarantee convergence of some iterative methods.All the conditions used in this paper will be formally deﬁned in section 2. In section 3 we will show that α -weakly-quasi-convexity and smoothness are suﬃcient to guarantee the convergence of gradient descent with ﬁxed step length. In section 4we will demonstrate a method which retains its convergence rate — which is optimal in the class of smooth convex objectivesin terms of the number of iterations — under our weaker assumptions. In section 5 we will generalize an optimal method ofsmooth strongly convex optimization to a particular subclass of α -weakly-quasi-convex problems.0.1 PreliminariesThroughout this paper we will be dealing with the problem f ( x ) → min x ∈ R n .f : F → R , where F is a closed and convex domain of f ( x ), is assumed to be diﬀerentiable and L -smooth: k∇ f ( y ) − ∇ f ( x ) k L k y − x k ∀ x, y ∈ F , where k · k is the Euclidean norm, h· , ·i denotes the scalar product deﬁned as h x, y i = n P i =1 x i y i . We will also assume that thesolution set X ∗ is not empty and denote f ∗ = min x ∈ R n f ( x ). Sergey GuminovMoscow Institute of Physics and Technology, MoscowE-mail: [email protected] GasnikovMoscow Institute of Physics and Technology, Moscow; Institute for Information Transmission Problems RAS, MoscowE-mail: [email protected] Sergey Guminov, Alexander Gasnikov

In our work we will be using a relaxation of convexity called α -weak-quasi-convexity, as deﬁned in [6]. Deﬁnition 1

A function f is said to be α -weakly-quasi-convex ( α -WQC) with respect to x ∗ ∈ X ∗ with constant α ∈ (0 , x ∈ F α ( f ( x ) − f ∗ ) h∇ f ( x ) , x − x ∗ i .α -weak-quasi-convexity guarantees that any local minimizer of f is also a global minimizer. Simply put, this condition saysthat tangent plane to the function’s graph constructed at any point is not much higher than it’s minimum. Any convex functionwith non-empty solution set is also 1-WQC, but the converse is generally not true. The function f ( x ) = | x | (1 e | x | ) x ∈ R isone example of a non-convex 1-WQC function.To weaken strong convexity we will be using the quadratic growth (QG) condition [1, 2]. Deﬁnition 2

A function f ( x ) is said to satisfy the quadratic growth condition if for some µ > x ∈ F f ( x ) − f ∗ > µ k x − P ( x ) k , where P ( x ) is the projection of x onto X ∗ .Note that the same condition appears in [10], formulated as a property of the solution set. The solution set X ∗ of f ( x )is called globally non-degenerate if it satisﬁes the inequality in the deﬁnition of QG.Though this condition shares some similarities with strong convexity, a non-convex function may still satisfy it. f ( x ) =( k x k − serves as an example [10].1.1 Relationship with other conditionsNaturally, α -weak-quasi-convexity and the QG condition are not the only ways to weaken convexity and strong convexity.What follows is a short list of other similar conditions and their relationships with the ones used in this paper.Let us deﬁne the Polyak- Lojasiewicz condition – another condition used to replace strong convexity in convergencearguments. Deﬁnition 3

A function f ( x ) is said to satisfy the Polyak- Lojasiewicz condition if for some µ > x ∈ F k∇ f ( x ) k > µ ( f ( x ) − f ∗ ) . As shown in [7], QG is weaker than the the Polyak- Lojasiewicz condition.Following [10], we deﬁne star-convexity.

Deﬁnition 4

We call a function f ( x ) star-convex if if for any x ∗ ∈ X ∗ and any x ∈ F we have f ( λx ∗ + (1 − λ ) x ) λf ( x ∗ ) + (1 − λ ) f ( x ) ∀ x ∈ F , ∀ λ ∈ [0 , . This deﬁnition is ideologically similar to that of α -WQC in the way it restricts convexity to the direction towards thesolution set X ∗ . In fact, for α = 1 these two deﬁnitions are equivalent. Lemma 1 − WQC ⇔ star-convexity . Proof ⇒ Let us assume that f ( x ) is not star-convex: ∃ λ ∈ (0 , , x ∈ F s.t. f ( λx ∗ + (1 − λ ) x ) − λf ( x ∗ ) − (1 − λ ) f ( x ) > . By maximizing the LHS of the above inequality we get that for some λ ∗ ∈ (0 ,

1) and z = λ ∗ x ∗ + (1 − λ ∗ ) x h∇ f ( z ) , x − x ∗ i = f ( x ) − f ( x ∗ )and f ( z ) > λ ∗ f ( x ∗ ) + (1 − λ ∗ ) f ( x ) . ccelerated methods for α -weakly-quasi-convex optimization problems 3 Now we note that x − x ∗ = z − x ∗ − λ ∗ and f ( z ) − f ( x ∗ ) > (1 − λ ∗ )( f ( x ) − f ( x ∗ )) . This in turn implies that h∇ f ( z ) , z − x ∗ i < f ( z ) − f ( x ∗ ) , so f ( x ) is not 1-WQC. ⇐ If f is star-convex, then f ( x ∗ ) − f ( x ) > f ( x + λ ( x ∗ − x )) − f ( x ) λ ∀ λ ∈ (0 , , x ∈ F . Taking the limit λ → +0 we obtain f ( x ∗ ) − f ( x ) > h∇ f ( x ) , x ∗ − x i , so f ( x ) is 1-WQC. ⊓⊔ Another condition was recently introduced in [5] to generalize convexity. Called the weak PL inequality, in our notationit may be deﬁned as follows.

Deﬁnition 5

A function f ( x ) is said to satisfy the weak PL inequality with respect to x ∗ ∈ X ∗ if for some µ > x ∈ F √ µ ( f ( x ) − f ∗ ) k∇ f ( x ) kk x − x ∗ k . It immediately follows from the Cauchy-Schwarz inequality that the weak PL inequality is weaker than α -WQC . From here on F = R n One of the ﬁrst questions arising whenever a new condition is proposed to replace convexity iswhether it’s suﬃcient to guarantee the convergence of the gradient descent method. Fortunately, this is the case with α -WQC.Given an objective f , consider the sequence { x k } ∞ k =0 generated by the following rule: x k +1 = x k − L ∇ f ( x k ) . This is the sequence of points generated by gradient descent with step length L . It is well known that for a convex L -smoothobjective f f ( x k ) − f ∗ = O (cid:18) LR k (cid:19) , where R = k x − x ∗ k . We will now prove a similar result for α -WQC objectives. Theorem 1

Let the objective function f be L -smooth and α -WQC with respect to x ∗ ∈ X ∗ . Then the sequence { x k } ∞ k =1 generated by the gradient descent method from a starting point x satisﬁes f ( x T ) − f ∗ LR α ( T + 1) , where R = k x − x ∗ k . Sergey Guminov, Alexander Gasnikov

Proof

Any L -smooth function f satisﬁes f ( y ) f ( x ) + h∇ f ( x ) , y − x i + L k y − x k ∀ x, y ∈ R n . Setting x = x k , y = x k +1 , we obtain f ( x k +1 ) − f ( x k ) − L k∇ f ( x k ) k . (1)This shows that the sequence { f ( x k ) } ∞ k =0 is nonincreasing. We also have12 k x k +1 − x ∗ k = 12 k x k − x ∗ k − L h∇ f ( x k ) , x k − x ∗ i + 12 L k∇ f ( x k ) k . Combining this with the gradient descent guarantee 1 results in h∇ f ( x k ) , x k − x ∗ i L k x k − x ∗ k − L k x k − x ∗ k + f ( x k ) − f ( x k +1 ) . Denote ε k = f ( x k ) − f ∗ . The above inequality combined with the deﬁnition of α -WQC shows that ε k α h∇ f ( x k ) , x k − x ∗ i α (cid:20) L k x k − x ∗ k − L k x k +1 − x ∗ k + ε k − ε k +1 (cid:21) . Summing it up for k = 0 , . . . , T results in T X k =0 ε k α (cid:20) LR − L k x T +1 − x ∗ k + ε − ε T +1 (cid:21) α (cid:20) LR ε (cid:21) . By L -smoothness of f we also have ε LR . Since { ε k } ∞ k =0 is nonincreasing, we have( T + 1) ε T LR α , which is exactly the statement of the theorem. ⊓⊔ In [3] it is also noted that in case of non-smooth objectives gradient descent retains its slow convergence rate ε k = O (cid:16) √ k (cid:17) . Observe a quadratic minimization problem: 12 h x, Ax i + h b, x i → min x ∈ R ⋉ In the conjugate gradients method for quadratic objectives an optimal convergence rate is achieved by using an orthogonalset of descent directions { p i } n − i =0 : h p i , Ap j i = 0 , i = j [4]. In 2005 Guy Narkiss et al. [8] presented a ﬁrst order method. It may be viewed as a generalization of the conjugategradient method as the steps this method makes at each iteration are constructed to satisfy some orthogonality conditions.In this section we will demonstrate that this method retains its quadratic convergence rate for α -WQC L -smooth functions,and the proof of this fact only slightly diﬀers from the original proof in [8].Let D k ( k >

1) be an n × n is the dimensionality of the objective’s domain), the columns of which are thefollowing vectors: d k = ∇ f ( x k ) ,d k = x k − x ,d k = k X i =0 ω i ∇ f ( x i ) , where ω i = ( , i = 0 + q + ω k − , i > D k deﬁned this way, thealgorithm takes the following form: ccelerated methods for α -weakly-quasi-convex optimization problems 5 Algorithm 1:

SESOP( f , x , T ) Input :

The objective function f , initial point x , number of iterations T Output:

Approximate solution x T for k=0 to T-1 do τ k ← argmin τ ∈ R f ( x k + D k τ ) x k +1 ← x k + D k τ k end return x T Theorem 2

Let the objective function f be L -smooth and α -WQC with respect to x ∗ ∈ X ∗ . Then SESOP( f , x , T ) generatesa sequence { x k } ∞ k =0 from a starting point x such that f ( x k ) − f ( x ) LR α k , where R = k x ∗ − x k . Proof

Since ∇ f ( x k ) belongs to the set of directions generated by D k , we can use the following guarantee of gradient descentwith ﬁxed step length for L -smooth functions: f ( x k +1 ) = min s ∈ R m f ( x k + D k s ) f ( x k − L ∇ f ( x k )) f ( x k ) − k∇ f ( x k ) k L . (2)The deﬁnition of α -WQC may be rewritten as follows: f ( x k ) − f ∗ α h∇ f ( x k ) , x k − x ∗ i . (3)By the construction of x k , we have that x k is a minimizer of f on the subspace containing the directions x k − x k − and x k − − x , which means that ∇ f ( x k ) ⊥ x k − x , which in turn allows us to write f ( x k ) − f ∗ α h∇ f ( x k ) , x − x ∗ i instead of (3). Take a weighted sum over k = 0 , . . . , T − T ∈ N with weights ω k deﬁned above. T − X k =0 ω k ( f ( x k ) − f ∗ ) α h T − X k =0 ω k ∇ f ( x k ) , x − x ∗ i α k T − X k =0 ω k ∇ f ( x k ) k R. (4)Since x k is also a minimizer on the subspace containing x k − + k − P k =0 ω k ∇ f ( x k ), we may write ∇ f ( x k ) ⊥ k − P k =0 ω k ∇ f ( x k ).Using (2) and the Pythagorean theorem, we get k T − X k =0 ω k ∇ f ( x k ) k = T − X k =0 ω k k∇ f ( x k ) k L T − X k =0 ω k ( f ( x k ) − f ( x k +1 )) . Note that our choice of ω k is equivalent to choosing the greatest ω k satisfying ω k = ( , k = 0 ω k − ω k − , k > . Returning to (4) and denoting ε k = f ( x k ) − f ∗ , we get S def = T − X k =0 ω k ε k LR α T − X k =0 ω k ( ε k − ε k +1 ) ! − = r LR α q ε ω + ε ( ω − ω ) + . . . + ε T − ( ω T − − ω T − ) − ε T ω T − = r LR α q S − ε T ω T − Sergey Guminov, Alexander Gasnikov

Rewriting that, we get ω T − ε T S − α S LR . Maximizing the right-hand side of this inequality and noting that ω k > k +12 (which may be proven by induction), we obtain ε T LR α T . It remains to notice that T is an arbitrary natural number. In this section we will generalize the method of Arkadi Nemirovski presented in [9] to the class of L -smooth α -weakly-quasi-convex functions satisfying the quadratic growth condition with constant µ >

0. As in the previous section, thisgeneralization is quite straightforward.

Algorithm 2:

CG(f, x , T ) Input :

The objective function f , initial point x , number of iterations T Output:

Approximate solution x T q ← for k=0 to T-1 do E k ← x + Lin { x k − x , q k } ˆ x k ← argmin x ∈ E k f ( x ) x k +1 ← ˆ x k − L ∇ f (ˆ x k ) q k +1 ← q k + ∇ f (ˆ x k ) end return x T Theorem 3

Let f be an L -smooth and α -WQC with respect to P ( x ) (the projection of x onto X ∗ ) function satisfying thequadratic growth condition with constant µ > . Then CG( f , x , T ) returns x T such that f ( x T ) − f ∗

34 ( f ( x ) − f ∗ ) , where T = (cid:24) α r Lµ (cid:25) . Proof

Denote x ∗ = P ( x ). Assume ε T > ε , which also implies ε k > ε for k = 1 , . . . , T . f ( x ) > f ( x ) > f (ˆ x ) > f ( x ) > · · · > f ( x T ) . Therefore, our assumption implies that ε = 0.The gradient descent guarantee f ( x k +1 ) f (ˆ x k ) − k∇ f (ˆ x k ) k L leads us to k∇ f (ˆ x k ) k L ( f (ˆ x k ) − f ( x k +1 )) L ( f ( x k ) − f ( x k +1 )) . (5)Telescoping (5) for k = 0 , . . . , T −

1, we obtain T − X k =0 k∇ f (ˆ x k ) k L ( ε − ε T +1 ) L ε . (6)By the deﬁnition of ˆ x k , ∇ f (ˆ x k ) ⊥ ˆ x k − x . This allows us to use α -WQC in the following way: h∇ f (ˆ x k ) , x ∗ − x i = h∇ f (ˆ x k ) , x ∗ − ˆ x k i α ( f ∗ − f (ˆ x k )) − α ε . (7) ccelerated methods for α -weakly-quasi-convex optimization problems 7 Now telescoping (7) for k = 0 , . . . , T − −k q T kk x ∗ − x k h q T , x ∗ − x i < − T α ε . This inequality will allow us to obtain an upper bound on T , which contradicts the theorem’s statement. All that remains isto get upper bounds on k q T k and x ∗ − x . Again, by deﬁnition of ˆ x k , ∇ f (ˆ x k ) ⊥ q k . By the Pythagorean theorem and (7), k q T k = T − X k =0 k∇ f (ˆ x k ) k ! r L ε . Quadratic growth, on the other hand, implies the following upper bound: k x − x ∗ k r µ ε . Finally, − r µ ε r L ε < − T α ε , or T < α r Lµ .

This contradicts our choice of T = l α q Lµ m . ⊓⊔ This result shows that if f were α -WQC with respect to P ( x ) ∀ x ∈ R n , we would be able to apply a restarting techniqueto this method. To be more precise, under such circumstances it is possible to achieve an accuracy of ε by performinglog ε cycles of l α q Lµ m iterations and using the output of each cycle as input for the next one. This means that by usingNemirovski’s conjugate gradients method we may get a point y such that f ( y ) − f ∗ ε in O (cid:16) α q Lµ log ε (cid:17) iterations. Even though the SESOP and CG methods presented above are optimal in terms of the amount of iterations required toachieve the desired accuracy, each iteration involves solving a subproblem over R or R . However, since all the conditionsreplacing convexity and strong convexity in our paper involved some global minimizer x ∗ , which may not belong to thedomain of any of these subproblems, they may be considered to be general non-convex optimization problems. Not only aresuch problems much more diﬃcult than convex ones, the above convergence analyses relied on these subproblems to be solvedexactly.In all of the methods analysed in this paper the subspace optimization step performed the key role in allowing to generalizethese methods to our more general setting. It is as of yet unknown to the authors of this paper whether any fast gradientmethods not involving any subspace optimizations with guaranteed convergence for α -WQC objectives with α ∈ (0 ,

1] exist.However, such a method for 1-WQC problems is presented in [11] (see p.12-14).

Acknowledgements

Authors would like to thank Arkadi Nemirovski for some important remarks.This work was partially supported by RNF 17-11-01027.

References

1. Mihai Anitescu. Degenerate nonlinear programming with a quadratic growth condition.

SIAM Journal on Optimization , 10(4):1116–1135,2000.2. Joseph Fr´ed´eric Bonnans and Alexander Ioﬀe. Second-order suﬃciency and quadratic growth for nonisolated minima.

Mathematics ofOperations Research , 20(4):801–817, 1995.3. Sebastien Bubeck. Geometry of linearized neural networks. https://blogs.princeton.edu/imabandit/2016/11/13/geometry-of-linearized-neural-networks/ , 2016. Accessed 1 November 2017.4. S´ebastien Bubeck et al. Convex optimization: Algorithms and complexity.

Foundations and Trends ® in Machine Learning , 8(3-4):231–357,2015.5. Dominik Csiba and Peter Richt´arik. Global Convergence of Arbitrary-Block Gradient Methods for Generalized Polyak- Lojasiewicz Func-tions. arXiv preprint arXiv:1709.03014 , 2017.6. Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient Descent Learns Linear Dynamical Systems. CoRR , abs/1609.05191, 2016. Sergey Guminov, Alexander Gasnikov7. Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak- Lojasiewicz Condition. In

European Conference on Machine Learning and Knowledge Discovery in Databases - Volume 9851 , ECMLPKDD 2016, pages 795–811, New York, NY, USA, 2016. Springer-Verlag New York, Inc.8. Guy Narkiss and Michael Zibulevsky.

Sequential subspace optimization method for large-scale unconstrained problems . Technion-IIT,Department of Electrical Engineering, 2005.9. AS Nemirovsky and DB Yudin. Problem Complexity and Optimization Method Eﬃciency.

M.: Nauka , 1979.10. Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance.

Mathematical Programming ,108(1):177–205, 2006.11. A. Tyurin. Mirror version of similar triangles method for constrained optimization problems.