Accelerated Methods for α -Weakly-Quasi-Convex Problems
aa r X i v : . [ m a t h . O C ] M a r Accelerated methods for α -weakly-quasi-convex optimization problems Sergey Guminov · Alexander GasnikovAbstract
Convexity of the objective function often allows to guarantee much better convergence rates of iterative mini-mization methods than in the general non-convex case. However, many problems encountered in training neural networks arenon-convex. Some of them satisfy conditions weaker than convexity, but which are still sufficient to guarantee the convergenceof some first-order methods.In this work we present a condition to replace convexity and show that gradient descent with fixed step length retainsits convergence rate under this condition. We show that the sequential subspace optimization method is optimal in terms oforacle complexity in this case. We also provide a substitute for strong convexity which is sufficient to guarantee the sameconvergence rate as in the strongly convex case for this new class of generally non-convex functions.
Keywords
Non-convex minimization · First-order methods · Accelerated methods · Global optimization
Introduction
Convexity of the objective function is a natural property often used to prove the convergence of iterative methods ofoptimization. One of the main qualities of convex functions is that they have no non-global local minimums, which makesoptimizing such objectives considerably easier. A new condition called α -weak-quasi-convexity was recently proposed byHardt et al. in [6] in relation to a machine learning problem. In this paper we are going to show that this much weakercondition is still sufficient to guarantee convergence of some iterative methods.All the conditions used in this paper will be formally defined in section 2. In section 3 we will show that α -weakly-quasi-convexity and smoothness are sufficient to guarantee the convergence of gradient descent with fixed step length. In section 4we will demonstrate a method which retains its convergence rate — which is optimal in the class of smooth convex objectivesin terms of the number of iterations — under our weaker assumptions. In section 5 we will generalize an optimal method ofsmooth strongly convex optimization to a particular subclass of α -weakly-quasi-convex problems.0.1 PreliminariesThroughout this paper we will be dealing with the problem f ( x ) → min x ∈ R n .f : F → R , where F is a closed and convex domain of f ( x ), is assumed to be differentiable and L -smooth: k∇ f ( y ) − ∇ f ( x ) k L k y − x k ∀ x, y ∈ F , where k · k is the Euclidean norm, h· , ·i denotes the scalar product defined as h x, y i = n P i =1 x i y i . We will also assume that thesolution set X ∗ is not empty and denote f ∗ = min x ∈ R n f ( x ). Sergey GuminovMoscow Institute of Physics and Technology, MoscowE-mail: [email protected] GasnikovMoscow Institute of Physics and Technology, Moscow; Institute for Information Transmission Problems RAS, MoscowE-mail: [email protected] Sergey Guminov, Alexander Gasnikov
In our work we will be using a relaxation of convexity called α -weak-quasi-convexity, as defined in [6]. Definition 1
A function f is said to be α -weakly-quasi-convex ( α -WQC) with respect to x ∗ ∈ X ∗ with constant α ∈ (0 , x ∈ F α ( f ( x ) − f ∗ ) h∇ f ( x ) , x − x ∗ i .α -weak-quasi-convexity guarantees that any local minimizer of f is also a global minimizer. Simply put, this condition saysthat tangent plane to the function’s graph constructed at any point is not much higher than it’s minimum. Any convex functionwith non-empty solution set is also 1-WQC, but the converse is generally not true. The function f ( x ) = | x | (1 e | x | ) x ∈ R isone example of a non-convex 1-WQC function.To weaken strong convexity we will be using the quadratic growth (QG) condition [1, 2]. Definition 2
A function f ( x ) is said to satisfy the quadratic growth condition if for some µ > x ∈ F f ( x ) − f ∗ > µ k x − P ( x ) k , where P ( x ) is the projection of x onto X ∗ .Note that the same condition appears in [10], formulated as a property of the solution set. The solution set X ∗ of f ( x )is called globally non-degenerate if it satisfies the inequality in the definition of QG.Though this condition shares some similarities with strong convexity, a non-convex function may still satisfy it. f ( x ) =( k x k − serves as an example [10].1.1 Relationship with other conditionsNaturally, α -weak-quasi-convexity and the QG condition are not the only ways to weaken convexity and strong convexity.What follows is a short list of other similar conditions and their relationships with the ones used in this paper.Let us define the Polyak- Lojasiewicz condition – another condition used to replace strong convexity in convergencearguments. Definition 3
A function f ( x ) is said to satisfy the Polyak- Lojasiewicz condition if for some µ > x ∈ F k∇ f ( x ) k > µ ( f ( x ) − f ∗ ) . As shown in [7], QG is weaker than the the Polyak- Lojasiewicz condition.Following [10], we define star-convexity.
Definition 4
We call a function f ( x ) star-convex if if for any x ∗ ∈ X ∗ and any x ∈ F we have f ( λx ∗ + (1 − λ ) x ) λf ( x ∗ ) + (1 − λ ) f ( x ) ∀ x ∈ F , ∀ λ ∈ [0 , . This definition is ideologically similar to that of α -WQC in the way it restricts convexity to the direction towards thesolution set X ∗ . In fact, for α = 1 these two definitions are equivalent. Lemma 1 − WQC ⇔ star-convexity . Proof ⇒ Let us assume that f ( x ) is not star-convex: ∃ λ ∈ (0 , , x ∈ F s.t. f ( λx ∗ + (1 − λ ) x ) − λf ( x ∗ ) − (1 − λ ) f ( x ) > . By maximizing the LHS of the above inequality we get that for some λ ∗ ∈ (0 ,
1) and z = λ ∗ x ∗ + (1 − λ ∗ ) x h∇ f ( z ) , x − x ∗ i = f ( x ) − f ( x ∗ )and f ( z ) > λ ∗ f ( x ∗ ) + (1 − λ ∗ ) f ( x ) . ccelerated methods for α -weakly-quasi-convex optimization problems 3 Now we note that x − x ∗ = z − x ∗ − λ ∗ and f ( z ) − f ( x ∗ ) > (1 − λ ∗ )( f ( x ) − f ( x ∗ )) . This in turn implies that h∇ f ( z ) , z − x ∗ i < f ( z ) − f ( x ∗ ) , so f ( x ) is not 1-WQC. ⇐ If f is star-convex, then f ( x ∗ ) − f ( x ) > f ( x + λ ( x ∗ − x )) − f ( x ) λ ∀ λ ∈ (0 , , x ∈ F . Taking the limit λ → +0 we obtain f ( x ∗ ) − f ( x ) > h∇ f ( x ) , x ∗ − x i , so f ( x ) is 1-WQC. ⊓⊔ Another condition was recently introduced in [5] to generalize convexity. Called the weak PL inequality, in our notationit may be defined as follows.
Definition 5
A function f ( x ) is said to satisfy the weak PL inequality with respect to x ∗ ∈ X ∗ if for some µ > x ∈ F √ µ ( f ( x ) − f ∗ ) k∇ f ( x ) kk x − x ∗ k . It immediately follows from the Cauchy-Schwarz inequality that the weak PL inequality is weaker than α -WQC . From here on F = R n One of the first questions arising whenever a new condition is proposed to replace convexity iswhether it’s sufficient to guarantee the convergence of the gradient descent method. Fortunately, this is the case with α -WQC.Given an objective f , consider the sequence { x k } ∞ k =0 generated by the following rule: x k +1 = x k − L ∇ f ( x k ) . This is the sequence of points generated by gradient descent with step length L . It is well known that for a convex L -smoothobjective f f ( x k ) − f ∗ = O (cid:18) LR k (cid:19) , where R = k x − x ∗ k . We will now prove a similar result for α -WQC objectives. Theorem 1
Let the objective function f be L -smooth and α -WQC with respect to x ∗ ∈ X ∗ . Then the sequence { x k } ∞ k =1 generated by the gradient descent method from a starting point x satisfies f ( x T ) − f ∗ LR α ( T + 1) , where R = k x − x ∗ k . Sergey Guminov, Alexander Gasnikov
Proof
Any L -smooth function f satisfies f ( y ) f ( x ) + h∇ f ( x ) , y − x i + L k y − x k ∀ x, y ∈ R n . Setting x = x k , y = x k +1 , we obtain f ( x k +1 ) − f ( x k ) − L k∇ f ( x k ) k . (1)This shows that the sequence { f ( x k ) } ∞ k =0 is nonincreasing. We also have12 k x k +1 − x ∗ k = 12 k x k − x ∗ k − L h∇ f ( x k ) , x k − x ∗ i + 12 L k∇ f ( x k ) k . Combining this with the gradient descent guarantee 1 results in h∇ f ( x k ) , x k − x ∗ i L k x k − x ∗ k − L k x k − x ∗ k + f ( x k ) − f ( x k +1 ) . Denote ε k = f ( x k ) − f ∗ . The above inequality combined with the definition of α -WQC shows that ε k α h∇ f ( x k ) , x k − x ∗ i α (cid:20) L k x k − x ∗ k − L k x k +1 − x ∗ k + ε k − ε k +1 (cid:21) . Summing it up for k = 0 , . . . , T results in T X k =0 ε k α (cid:20) LR − L k x T +1 − x ∗ k + ε − ε T +1 (cid:21) α (cid:20) LR ε (cid:21) . By L -smoothness of f we also have ε LR . Since { ε k } ∞ k =0 is nonincreasing, we have( T + 1) ε T LR α , which is exactly the statement of the theorem. ⊓⊔ In [3] it is also noted that in case of non-smooth objectives gradient descent retains its slow convergence rate ε k = O (cid:16) √ k (cid:17) . Observe a quadratic minimization problem: 12 h x, Ax i + h b, x i → min x ∈ R ⋉ In the conjugate gradients method for quadratic objectives an optimal convergence rate is achieved by using an orthogonalset of descent directions { p i } n − i =0 : h p i , Ap j i = 0 , i = j [4]. In 2005 Guy Narkiss et al. [8] presented a first order method. It may be viewed as a generalization of the conjugategradient method as the steps this method makes at each iteration are constructed to satisfy some orthogonality conditions.In this section we will demonstrate that this method retains its quadratic convergence rate for α -WQC L -smooth functions,and the proof of this fact only slightly differs from the original proof in [8].Let D k ( k >
1) be an n × n is the dimensionality of the objective’s domain), the columns of which are thefollowing vectors: d k = ∇ f ( x k ) ,d k = x k − x ,d k = k X i =0 ω i ∇ f ( x i ) , where ω i = ( , i = 0 + q + ω k − , i > D k defined this way, thealgorithm takes the following form: ccelerated methods for α -weakly-quasi-convex optimization problems 5 Algorithm 1:
SESOP( f , x , T ) Input :
The objective function f , initial point x , number of iterations T Output:
Approximate solution x T for k=0 to T-1 do τ k ← argmin τ ∈ R f ( x k + D k τ ) x k +1 ← x k + D k τ k end return x T Theorem 2
Let the objective function f be L -smooth and α -WQC with respect to x ∗ ∈ X ∗ . Then SESOP( f , x , T ) generatesa sequence { x k } ∞ k =0 from a starting point x such that f ( x k ) − f ( x ) LR α k , where R = k x ∗ − x k . Proof
Since ∇ f ( x k ) belongs to the set of directions generated by D k , we can use the following guarantee of gradient descentwith fixed step length for L -smooth functions: f ( x k +1 ) = min s ∈ R m f ( x k + D k s ) f ( x k − L ∇ f ( x k )) f ( x k ) − k∇ f ( x k ) k L . (2)The definition of α -WQC may be rewritten as follows: f ( x k ) − f ∗ α h∇ f ( x k ) , x k − x ∗ i . (3)By the construction of x k , we have that x k is a minimizer of f on the subspace containing the directions x k − x k − and x k − − x , which means that ∇ f ( x k ) ⊥ x k − x , which in turn allows us to write f ( x k ) − f ∗ α h∇ f ( x k ) , x − x ∗ i instead of (3). Take a weighted sum over k = 0 , . . . , T − T ∈ N with weights ω k defined above. T − X k =0 ω k ( f ( x k ) − f ∗ ) α h T − X k =0 ω k ∇ f ( x k ) , x − x ∗ i α k T − X k =0 ω k ∇ f ( x k ) k R. (4)Since x k is also a minimizer on the subspace containing x k − + k − P k =0 ω k ∇ f ( x k ), we may write ∇ f ( x k ) ⊥ k − P k =0 ω k ∇ f ( x k ).Using (2) and the Pythagorean theorem, we get k T − X k =0 ω k ∇ f ( x k ) k = T − X k =0 ω k k∇ f ( x k ) k L T − X k =0 ω k ( f ( x k ) − f ( x k +1 )) . Note that our choice of ω k is equivalent to choosing the greatest ω k satisfying ω k = ( , k = 0 ω k − ω k − , k > . Returning to (4) and denoting ε k = f ( x k ) − f ∗ , we get S def = T − X k =0 ω k ε k LR α T − X k =0 ω k ( ε k − ε k +1 ) ! − = r LR α q ε ω + ε ( ω − ω ) + . . . + ε T − ( ω T − − ω T − ) − ε T ω T − = r LR α q S − ε T ω T − Sergey Guminov, Alexander Gasnikov
Rewriting that, we get ω T − ε T S − α S LR . Maximizing the right-hand side of this inequality and noting that ω k > k +12 (which may be proven by induction), we obtain ε T LR α T . It remains to notice that T is an arbitrary natural number. In this section we will generalize the method of Arkadi Nemirovski presented in [9] to the class of L -smooth α -weakly-quasi-convex functions satisfying the quadratic growth condition with constant µ >
0. As in the previous section, thisgeneralization is quite straightforward.
Algorithm 2:
CG(f, x , T ) Input :
The objective function f , initial point x , number of iterations T Output:
Approximate solution x T q ← for k=0 to T-1 do E k ← x + Lin { x k − x , q k } ˆ x k ← argmin x ∈ E k f ( x ) x k +1 ← ˆ x k − L ∇ f (ˆ x k ) q k +1 ← q k + ∇ f (ˆ x k ) end return x T Theorem 3
Let f be an L -smooth and α -WQC with respect to P ( x ) (the projection of x onto X ∗ ) function satisfying thequadratic growth condition with constant µ > . Then CG( f , x , T ) returns x T such that f ( x T ) − f ∗
34 ( f ( x ) − f ∗ ) , where T = (cid:24) α r Lµ (cid:25) . Proof
Denote x ∗ = P ( x ). Assume ε T > ε , which also implies ε k > ε for k = 1 , . . . , T . f ( x ) > f ( x ) > f (ˆ x ) > f ( x ) > · · · > f ( x T ) . Therefore, our assumption implies that ε = 0.The gradient descent guarantee f ( x k +1 ) f (ˆ x k ) − k∇ f (ˆ x k ) k L leads us to k∇ f (ˆ x k ) k L ( f (ˆ x k ) − f ( x k +1 )) L ( f ( x k ) − f ( x k +1 )) . (5)Telescoping (5) for k = 0 , . . . , T −
1, we obtain T − X k =0 k∇ f (ˆ x k ) k L ( ε − ε T +1 ) L ε . (6)By the definition of ˆ x k , ∇ f (ˆ x k ) ⊥ ˆ x k − x . This allows us to use α -WQC in the following way: h∇ f (ˆ x k ) , x ∗ − x i = h∇ f (ˆ x k ) , x ∗ − ˆ x k i α ( f ∗ − f (ˆ x k )) − α ε . (7) ccelerated methods for α -weakly-quasi-convex optimization problems 7 Now telescoping (7) for k = 0 , . . . , T − −k q T kk x ∗ − x k h q T , x ∗ − x i < − T α ε . This inequality will allow us to obtain an upper bound on T , which contradicts the theorem’s statement. All that remains isto get upper bounds on k q T k and x ∗ − x . Again, by definition of ˆ x k , ∇ f (ˆ x k ) ⊥ q k . By the Pythagorean theorem and (7), k q T k = T − X k =0 k∇ f (ˆ x k ) k ! r L ε . Quadratic growth, on the other hand, implies the following upper bound: k x − x ∗ k r µ ε . Finally, − r µ ε r L ε < − T α ε , or T < α r Lµ .
This contradicts our choice of T = l α q Lµ m . ⊓⊔ This result shows that if f were α -WQC with respect to P ( x ) ∀ x ∈ R n , we would be able to apply a restarting techniqueto this method. To be more precise, under such circumstances it is possible to achieve an accuracy of ε by performinglog ε cycles of l α q Lµ m iterations and using the output of each cycle as input for the next one. This means that by usingNemirovski’s conjugate gradients method we may get a point y such that f ( y ) − f ∗ ε in O (cid:16) α q Lµ log ε (cid:17) iterations. Even though the SESOP and CG methods presented above are optimal in terms of the amount of iterations required toachieve the desired accuracy, each iteration involves solving a subproblem over R or R . However, since all the conditionsreplacing convexity and strong convexity in our paper involved some global minimizer x ∗ , which may not belong to thedomain of any of these subproblems, they may be considered to be general non-convex optimization problems. Not only aresuch problems much more difficult than convex ones, the above convergence analyses relied on these subproblems to be solvedexactly.In all of the methods analysed in this paper the subspace optimization step performed the key role in allowing to generalizethese methods to our more general setting. It is as of yet unknown to the authors of this paper whether any fast gradientmethods not involving any subspace optimizations with guaranteed convergence for α -WQC objectives with α ∈ (0 ,
1] exist.However, such a method for 1-WQC problems is presented in [11] (see p.12-14).
Acknowledgements
Authors would like to thank Arkadi Nemirovski for some important remarks.This work was partially supported by RNF 17-11-01027.
References
1. Mihai Anitescu. Degenerate nonlinear programming with a quadratic growth condition.
SIAM Journal on Optimization , 10(4):1116–1135,2000.2. Joseph Fr´ed´eric Bonnans and Alexander Ioffe. Second-order sufficiency and quadratic growth for nonisolated minima.
Mathematics ofOperations Research , 20(4):801–817, 1995.3. Sebastien Bubeck. Geometry of linearized neural networks. https://blogs.princeton.edu/imabandit/2016/11/13/geometry-of-linearized-neural-networks/ , 2016. Accessed 1 November 2017.4. S´ebastien Bubeck et al. Convex optimization: Algorithms and complexity.
Foundations and Trends ® in Machine Learning , 8(3-4):231–357,2015.5. Dominik Csiba and Peter Richt´arik. Global Convergence of Arbitrary-Block Gradient Methods for Generalized Polyak- Lojasiewicz Func-tions. arXiv preprint arXiv:1709.03014 , 2017.6. Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient Descent Learns Linear Dynamical Systems. CoRR , abs/1609.05191, 2016. Sergey Guminov, Alexander Gasnikov7. Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak- Lojasiewicz Condition. In
European Conference on Machine Learning and Knowledge Discovery in Databases - Volume 9851 , ECMLPKDD 2016, pages 795–811, New York, NY, USA, 2016. Springer-Verlag New York, Inc.8. Guy Narkiss and Michael Zibulevsky.
Sequential subspace optimization method for large-scale unconstrained problems . Technion-IIT,Department of Electrical Engineering, 2005.9. AS Nemirovsky and DB Yudin. Problem Complexity and Optimization Method Efficiency.
M.: Nauka , 1979.10. Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance.
Mathematical Programming ,108(1):177–205, 2006.11. A. Tyurin. Mirror version of similar triangles method for constrained optimization problems.