Sequential convergence of AdaGrad algorithm for smooth convex optimization
aa r X i v : . [ m a t h . O C ] N ov Sequential convergence of AdaGrad algorithm for smooth convex optimization
Cheik Traor´e a , Edouard Pauwels b a ENS Paris-Saclay b IRIT, CNRS, Universit´e Toulouse 3 Paul Sabatier
Abstract
We prove that the iterates produced by, either the scalar step size variant, or the coordinatewise variant of AdaGrad algorithm, areconvergent sequences when applied to convex objective functions with Lipschitz gradient. The key insight is to remark that suchAdaGrad sequences satisfy a variable metric quasi-Fej´er monotonicity property, which allows to prove convergence.
Keywords:
Convex optimization, adaptive algorithms, sequential convergence, Fej´er monotonicity.
1. Introduction
We consider the problem of unconstrained minimization of asmooth convex function F : R n R which gradient is globallyLipschitz. We will assume that the minimum of F over R n , F ∗ ,is attained. We are interested in the sequential convergence of alargely used adaptive gradient method called AdaGrad. Sequential convergence.
Continuous optimization algorithmsare meant to converge if not to a global minimum at least toa local minimum of the cost function F , a necessary conditionbeing, when the function is di ff erentiable, Fermat rule, ∇ F = R n , ( x k ) k ∈ N , can be measured in several ways:convergence of the norm of the gradients k∇ F ( x k ) k k ∈ N , conver-gence of the suboptimality level F ( x k ) − F ∗ , as k grows to in-finity. These measures of convergence do not translate directlyinto a characterization of the asymptotic behavior of the iterates( x k ) k ∈ N themselves. In general, this needs not be true, withoutadditional assumptions. For example, when F is strongly orstrictly convex, since the minimum is uniquely attained, con-vergence of the gradient or the suboptimality level to 0 impliesconvergence of the sequence.Convergence of iterate sequences is an important measureof algorithmic stability. Indeed, in optimization applications(statistics [1], signal processing [9]) one may be concernedabout the value of the argmin more than the minimum value.Sequential convergence ensures that the estimate of the argminproduced by the algorithm has some asymptotic stability prop-erty. Adaptive gradient methods.
First order methods are the mostwidespread methods for machine learning and signal process-ing applications [5]. We will focus on AdaGrad algorithm [12],which was initially developed in an online learning context, seealso [18]. This is a simple gradient method for which the step ∗ Corresponding author size is tuned automatically, in a coordinatewise fashion, basedon previous gradient observations, this is where the term “adap-tive” comes from. Interestingly, this adaptivity property found alarge interest in training of deep networks [13] with extensionsand variants such as the widespread Adam algorithm [14]. Thereason for this success is probably due to the adequation be-tween the composite structure of deep networks and the coor-dinatewise structure of the algorithm, as well as the empiricalobservation that adaptivity performs well in practice, withoutrequiring much manual tuning, although this is not a consensus[23].Getting back to the convex world, it was suggested that adap-tive step sizes give the possibility to use a single step size strat-egy, independent of the class of problem at hand: smooth versusnonsmooth, deterministic versus noisy [15]. Indeed, it is knownin convex optimization that constant step sizes can be used inthe deterministic smooth case, while a decreasing schedule hasto be used in the presence of nonsmoothness or noise. This ideawas extended to adaptivity to strong convexity [6] and its ex-tensions [24], as well as adaptivity in the context of variationalinequalities [2].In a more general nonconvex optimization context, there hasbeen several recent attempts to develop a convergence theoryfor adaptive methods, with the application to deep networktraining in mind [16, 17, 3, 20, 22, 11, 4]. These provide quali-tative convergence guaranties toward critical point or complex-ity estimates on the norm of the gradient, which are also, ofcourse valid in the convex setting.
Fej´er monotonicity and extensions.
In convex settings, thestudy of the convergence of the iterates of optimization algo-rithms has a long history. For many known first order algo-rithms, it turns out that the quantity k x k − x ∗ k is a Lyapunovfunction for the discrete dynamics for any solution x ∗ . Thisproperty is called Fej´er monotonicity, it allows to obtain con-vergence rates [19] and also to prove convergence of iterate se-quences in relation to Opial property. For example this propertywas used in [7], to prove convergence of proximal point algo-rithm, forward-backward splitting method, Douglas-Rashford Preprint submitted to Elsevier November 26, 2020 litting method and more.One of the most important issues in studying AdaGrad is thatit is not a descent algorithm as one is not guaranteed that asu ffi cient decrease condition would hold. Extension of Fej´ermonotonicity were proposed in order to handle such situations.Quasi-Fej´er monotonicity is the property of being Fej´er mono-tone up to a summable error, its modern description was givenin [8]. This can be used, to prove iterate convergence of al-gorithms such as block-iterative parallel algorithms, projectedsubgradient methods, stochastic subgradient method, perturbedoptimization algorithms. Another issue related to AdaGrad isthe fact that it induces a change of metric at each iteration,hence the notion of monotonicity which will be used is vari-able metric quasi-Fej´er montonicity as introduced in [10]. Main result.
In this paper, we prove the sequential convergenceof AdaGrad for smooth convex objectives. More precisely, weconsider two versions of the algorithm, one with scalar step sizeon the one hand, and one with coordinatewise step size on theother hand. Both have been previously studied in the literature,but the second one is by far the most widely used in practice.Without a surprise the property of Fej´er monotonicity is a cen-tral argument to prove this result. More precisely we show thatsequences generated by AdaGrad are bounded (whenever theobjective attains its minimum) and comply with variable metricquasi Fej´er monotonicity, and our conclusion follows from theabstract analysis in [10].
2. Technical preliminary
In all this document we consider the set R n of real vectors ofdimension n , n ∈ N . We denote by x i the i th component of thevector x ∈ R n , with i ∈ [1 , , · · · , n ]. ∇ F is the gradient of adi ff erentiable function F : R n → R and ∇ i F its i th component,corresponding to the i -th partial derivative. k · k is the euclideannorm and h· , ·i its associated scalar product. Let ( u k ) k ∈ N a se-quence in R n . We denote by u j , i the i th component of the j th vector of the sequence ( u k ) k ∈ N . The diagonal matrix with thevector v ∈ R n as its diagonal is represented by diag( v ) ∈ R n × n .We use the notation ℓ + for the space of summable nonnega-tive sequences of real numbers. For a positive definite matrix W ∈ R n × n we use the notation k d k W = h Wd , d i to denote the as-sociated norm. We let (cid:23) denote the partial order over symmetricmatrices in R n . Throughout this document we will consider the following un-constrained minimization problemmin x ∈ R n F ( x ) (1)where F : R n R is di ff erentiable and n ∈ N is the ambientdimension. In addition, we assume that F is convex and attainsits minimum, that is, there exists x ∗ ∈ R n such that ∀ x ∈ R n , F ( x ) ≥ F ( x ∗ ) . (2) We finally assume that F has as L -Lipschitz gradient, for some L >
0. More explicitly, L is such that for any x , y ∈ R n , k∇ F ( x ) − ∇ F ( y ) k ≤ L k x − y k . (3)From this property, we can derive the classical Descent Lemma,which is a quantitative bound on the di ff erence between f andits first order Taylor expansion, see for example in [19, Lemma1.2.3]. Lemma 1 (Descent Lemma) . Suppose that F : R n R hasL-Lipschitz gradient. Then for all x , y ∈ R n , we have | f ( y ) − f ( x ) − h∇ f ( x ) , y − x i | ≤ L k y − x k . (4) We study two versions of AdaGrad, the original algorithmperforming adaptive gradient steps at a coordinate level and asimplified version which uses a scalar step size. The latter vari-ant, was coined as AdaGrad-Norm in [22], it goes as follows:
Algorithm 2.1 (AdaGrad-Norm) . Given x ∈ R n , v = , δ > , iterate, for k ∈ N ,v k + = v k + k∇ F ( x k ) k x k + = x k − √ v k + + δ ∇ F ( x k ) . (5)The original version presented in [12] applies the same ideacombined with coordinate-wise updates using partial deriva-tives. The algorithm is as follows. Algorithm 2.2 (AdaGrad) . Given x ∈ R n , v = , δ > ,iterate, for k ∈ N and j ∈ [1 , · · · , n ] ,v k + , j = v k , j + (cid:16) ∇ j F ( x k ) (cid:17) x k + , j = x k , j − p v k + , j + δ ∇ j F ( x k ) , (6)Our goal is, to prove that the sequences ( x k ) k ∈ N generated byAdaGrad are convergent for both variants.
3. Results
Our main result is the following
Theorem 3.1.
Let F : R n R be convex with L-Lipschitz gra-dient and attain its minimum on R n . Then any sequence ( x k ) k ∈ N generated by AdaGrad-Norm (Algorithm 2.1) or AdaGrad (Al-gorithm 2.2) converges to a global minimum of F as k groes toinfinity. The coming section is dedicated to exposition of the proofarguments for this result2 .1. Variable metric quasi Fej´er monotonicity
The following definition is a simplification adapted from themore general exposition given in [10].
Definition 3.1.
Let ( W k ) k ∈ N be a sequence of symmetric matri-ces such that W k (cid:23) α I n , ∀ k ∈ N , for some α >
0. Let C be anonempty, closed and convex subset of R n , and let ( x k ) k ∈ N bea sequence in R n . Then ( x k ) k ∈ N is variable metric quasi-Fej´ermonotone with respect to the target set C relative to ( W k ) k ∈ N if (cid:16) ∃ ( η k ) k ∈ N ∈ ℓ + ( N ) (cid:17) ( ∀ z ∈ C ) (cid:16) ∃ ( ε k ) k ∈ N ∈ ℓ + ( N ) (cid:17) k x k + − z k W k + (1 + η k ) k x k − z k W k + ε k , ( ∀ k ∈ N ) . (7)For variable metric quasi-Fej´er sequences the following re-sults have already been established [10, Proposition 3.2], weprovide a proof in Appendix A.1 for completeness. Proposition 3.2.
Let ( u k ) k ∈ N be a variable metric quasi-Fej´ersequence relative to a nonempty, convex and closed set C in R n .These assertions hold.(i) Let u ∈ C. ( k u k − u k W k ) k ∈ N converges.(ii) ( u k ) k ∈ N is bounded. Proposition 3.2 allows to prove sequential convergence ofvariable metric quasi Fej´er sequences. This result is again dueto [10, Theorem 3.3] and a proof is given in Appendix A.2 forcompleteness.
Theorem 3.3.
Let ( W k ) k ∈ N be a sequence of symmetric matricessuch that W k (cid:23) α I n , ∀ k ∈ N , for some α > . We suppose thatthe sequence ( W k ) k ∈ N converges to W. Let ( x k ) k ∈ N be a variablemetric quasi-Fej´er sequence with respect to a closed target setC ⊂ R n . Then ( x k ) k ∈ N converges to a point in C if and only ifevery cluster point of ( x k ) k ∈ N is in C. Remark 3.2.
If, for a variable metric quasi-Fej´er sequence,( W k ) k ∈ N is constant and ( η k ) k ∈ N is null for all k ∈ N , the se-quence is simply called a quasi-Fej´er sequence. Moreover if( ε k ) k ∈ N is null for all k ∈ N , it is called a Fej´er monotone se-quence, which provides a Lyapunov function. Of course, forthese two special cases, the results stated above hold. To prove convergence of sequences generated by AdaGrad-Norm, we start with the following lemma.
Lemma 2.
Under the hypothese of Theorem 3.1, suppose that ( x k ) k ∈ N is a sequence generated by AdaGrad-Norm in Algo-rithm 2.1. Then we have that P ∞ k = k∇ F ( x k ) k is finite.Proof. This proof is inspired by the proof of Lemma 4.1 in [22].Fix x ∗ ∈ R n such that F ( x ∗ ) = inf x F ( x ) > −∞ . We split theproof into two cases. • Suppose that there exits an index k ∈ N such that p v k + δ ≥ L . It follows using the descent Lemma 1, forany j ≥ F ( x k + j ) ≤ F ( x k + j − ) + h∇ F ( x k + j − ) , x k + j − x k + j − i + L k x k + j − x k + j − k = F ( x k + j − ) − p v k + j + δ − L p v k + j + δ ) (cid:13)(cid:13)(cid:13) ∇ F ( x k + j − ) (cid:13)(cid:13)(cid:13) (8) ≤ F ( x k + j − ) − p v k + j + δ ) (cid:13)(cid:13)(cid:13) ∇ F ( x k + j − ) (cid:13)(cid:13)(cid:13) (9) ≤ F ( x k ) − j X ℓ = p v k + ℓ + δ ) (cid:13)(cid:13)(cid:13) ∇ F ( x k − + ℓ ) (cid:13)(cid:13)(cid:13) (10) ≤ F ( x k ) − p v k + j + δ ) j X ℓ = (cid:13)(cid:13)(cid:13) ∇ F ( x k + ℓ − ) (cid:13)(cid:13)(cid:13) , where the transition from (8) to (9) is because p v k + j + δ ≥ p v k + δ ≥ L , for all j ≥
0, and (10) is arecursion. Fix any j ≥
1, let Z = P k + j − k = k k∇ F ( x k ) k . Itfollows using k + j = M in the preceding inequality2 (cid:0) F ( x k ) − F ( x ∗ ) (cid:1) ≥ (cid:16) F ( x k ) − F ( x k + j ) (cid:17) ≥ P k + j − k = k k∇ F ( x k ) k p v k + j + δ = Z p Z + v k + δ By Lemma 5, it follows k + j − X k = k k∇ F ( x k ) k ≤ (cid:0) F ( x k ) − F ( x ∗ ) (cid:1) + F ( x k ) − F ( x ∗ )) p v k + δ. (11)Since M was arbitrary, one may take the limit j → ∞ , wehave ∞ X k = k k∇ F ( x k ) k < ∞ . That means P ∞ k = k∇ F ( x k ) k < ∞ , which concludes theproof for this case. • On the contrary, we have that √ v k + δ < L for all k ∈ N ,this means that ∀ k ∈ N , k X l = k∇ F ( x l ) k < L − δ. Letting k goes to infinity gives ∞ X l = k∇ F ( x l ) k < L − δ < ∞ , which is the desired result.3e can now conclude the proof for AdaGrad-Norm. Proof.
Under the conditions of Theorem 3.1, assume that x k isa sequence generated by AdaGrad-Norm. Let b k = √ δ + v k ≥√ δ for all k ∈ N , which is a non decreasing sequence. Let x ∗ ∈ arg min F be arbitrary. By assumption arg min F is nonemptyand it is convex and closed since F is convex and continuous.We have for all k ∈ N , k x k + − x ∗ k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x k − x ∗ − b k + ∇ F ( x k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = k x k − x ∗ k + * b k + ∇ F ( x k ) , x ∗ − x k + + b k + k∇ F ( x k ) k . Thanks to the convexity of F , the above equality gives k x k + − x ∗ k ≤ k x k − x ∗ k + b k + ( F ( x ∗ ) − F ( x k )) + b k + k∇ F ( x k ) k ≤ k x k − x ∗ k + δ k∇ F ( x k ) k . (12)By Lemma 2, k∇ F ( x k ) k is summable. Hence ( x k ) k ∈ N is a quasi-Fej´er sequence relatively to arg min F . Proposition 3.2 says that( x k ) k ∈ N is bounded. Thus it has at least an accumulation point.Then, thanks again to the Lemma 2, we have the set of accumu-lation points of ( x k ) k ∈ N included in arg min F . So using Theo-rem 3.3 and Remark 3.2, we conclude that ( x k ) k ∈ N is convergentand that its limit is a global minimum of F . We now consider the case of AdaGrad in Algorithm 2.2, tak-ing into account the coordinatewise nature of the updates. Thefollowing corresponds to Lemma 2 for this situation.
Lemma 3.
Under the hypothesis of Theorem 3.1, suppose that ( x k ) k ∈ N is a sequence generated by AdaGrad in Algorithm 2.2.We have that P ∞ k = k∇ F ( x k ) k is finite.Proof. Let I = { i ∈ [1 , · · · , n ] : ∃ k i ∈ N , p v k i , i + δ ≥ L , with k i the smallest possible } . Set k = max k i , i ∈ I . If I is empty, we have, ∀ k ∈ N and ∀ i ∈ [1 , · · · , n ], k X l = ( ∇ i F ( x l )) < L − δ. Making k goes to infinity gives, ∀ i ∈ [1 , · · · , n ], ∞ X l = ( ∇ i F ( x l )) < L − δ < ∞ and ∞ X l = k∇ F ( x l ) k < ∞ . So let us assume that I is not empty. By Descent Lemma 1, for j ≥ F ( x k + j ) ≤ F ( x k + j − ) + h∇ F ( x k + j − ) , x k + j − x k + j − i + L k x k + j − x k + j − k = F ( x k + j − ) + n X i = ∇ i F ( x k + j − )( x k + j − x k + j − ) i + L n X i = ( x k + j − x k + j − ) i = F ( x k + j − ) − n X i = p v k + j , i + δ (cid:16) ∇ i F ( x k + j − ) (cid:17) + L n X i = v k + j , i + δ (cid:16) ∇ i F ( x k + j − ) (cid:17) = F ( x k + j − ) − X i ∈ I p v k + j , i + δ − L p v k + j , i + δ ) (cid:16) ∇ i F ( x k + j − ) (cid:17) − X i < I p v k + j , i + δ − L p v k + j , i + δ ) (cid:16) ∇ i F ( x k + j − ) (cid:17) . (13)For i < I , p v k , i + δ < L for all k ∈ N . Therefore0 ≤ X k ∈ N X i < I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p v k + j , i + δ − L p v k + j , i + δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) ∇ i F ( x k + j − ) (cid:17) ≤ √ δ + L √ δ ! X k ∈ N X i < I (cid:16) ∇ i F ( x k + j − ) (cid:17) ≤ √ δ + L √ δ ! nL = C < + ∞ , (14)where we let C = √ δ (cid:18) + L √ δ (cid:19) nL . Since for i ∈ I , j ≥ − L p v k + j , i + δ > /
2, we have − p v k + j , i + δ − L p v k + j , i + δ ) ≤ −
12 1 p v k + j , i + δ . (15)By recurrence on (13), using (15) and (14), it follows that forall j ≥ F ( x k + j ) ≤ F ( x k ) − j X ℓ = X i ∈ I p v k + ℓ, i + δ (cid:0) ∇ i F ( x k − + ℓ ) (cid:1) + C That is equivalent to2 (cid:16) F ( x k ) − F ( x k + j ) + C (cid:17) ≥ j X ℓ = X i ∈ I p v k + ℓ, i + δ (cid:0) ∇ i F ( x k − + ℓ ) (cid:1) ≥ X i ∈ I p v k + j , i + δ j X ℓ = (cid:0) ∇ i F ( x k + ℓ − ) (cid:1) p ∈ I , we deduce that for all j ≥ p v k + j , p + δ j X ℓ = (cid:16) ∇ p F ( x k + ℓ − ) (cid:17) ≤ X i ∈ I p v k + j , i + δ j X ℓ = (cid:0) ∇ i F ( x k + ℓ − ) (cid:1) ≤ (cid:16) F ( x k ) − F ( x k + j ) + C (cid:17) ≤ (cid:0) F ( x k ) − F ( x ∗ ) + C (cid:1) . Fix any j ≥
1, let Z = P k + j − k = k (cid:16) ∇ p F ( x k ) (cid:17) . We have Z p Z + v k , i + δ ≤ (cid:0) F ( x k ) − F ( x ∗ ) + C (cid:1) . By Lemma 5, we get k + j − X k = k (cid:16) ∇ p F ( x k ) (cid:17) ≤ (cid:0) F ( x k ) − F ( x ∗ ) + C (cid:1) + F ( x k ) − F ( x ∗ ) + C ) p v k , i + δ. (16)We may let j go to infinity and we obtain, + ∞ X k = (cid:16) ∇ p F ( x k ) (cid:17) < ∞ Since p ∈ I was arbitrary, combining with (14), for all i ∈ [1 , · · · , n ] + ∞ X k = ( ∇ i F ( x k )) < ∞ , and the result follows + ∞ X k = k∇ F ( x k ) k < ∞ . We conclude this section with the convergence proof forAdaGrad.
Proof.
Under the conditions of Theorem 3.1, assume that x k is a sequence generated by AdaGrad, Algorithm 2.2. Let b k , i = p δ + v k , i for k ∈ N , i ∈ [1 , · · · , n ], all of them are in-creasing sequences. Fix any x ∗ ∈ arg min F , which is nonemptyclosed and convex since F is convex, continuous and attains itsminimum. Let b k = (cid:0) b k , , · · · , b k , n (cid:1) ∈ R n . We have for all k ∈ N and i = , . . . , n , b k + , i (cid:0) x k + , i − x ∗ i (cid:1) = b k + , i x k , i − x ∗ i − b k + , i ∇ i F ( x k ) ! = b k + , i ( x k − x ∗ ) i + ∇ i F ( x k )) ( x ∗ − x k ) i + b k + , i ( ∇ i F ( x k )) . By summing over i = , . . . , n , we get for all k ∈ N , n X i = b k + , i ( x k + − x ∗ ) i = n X i = b k + , i ( x k − x ∗ ) i + n X i = ( ∇ i F ( x k )) ( x ∗ − x k ) i + n X i = b k + , i ( ∇ i F ( x k )) , and hence, k x k + − x ∗ k B k + ≤ n X i = b k + , i ( x k − x ∗ ) i + h∇ F ( x k ) , x ∗ − x k i + √ δ k∇ F ( x k ) k , where B k + = Diag( b k + ) ∈ R n . Thanks to the convexity of F ,the above equality gives for all k ∈ N k x k + − x ∗ k B k + ≤ n X i = b k + , i ( x k − x ∗ ) i + F ( x ∗ ) − F ( x k )) + √ δ k∇ F ( x k ) k ≤ n X i = b k + , i ( x k − x ∗ ) i + √ δ k∇ F ( x k ) k . It follows, for all k ∈ N , k x k + − x ∗ k B k + − √ δ k∇ F ( x k ) k ≤ n X i = b k , i ( x k − x ∗ ) i b k + , i b k , i ≤ max i ∈ [1 , ··· , n ] b k + , i b k , i n X i = b k , i ( x k − x ∗ ) i = + max i ∈ [1 , ··· , n ] b k + , i b k , i − !! n X i = b k , i ( x k − x ∗ ) i = + max i ∈ [1 , ··· , n ] b k + , i b k , i − !! k x k − x ∗ k B k . Let M ∈ N , M ≥
1. For all i ∈ [1 , · · · , n ], we have, M − X k = b k + , i b k , i − = M − X k = b k + , i − b k , i b k , i ≤ M − X k = b k + , i − b k , i √ δ = √ δ b M , i < ∞ , where the boundedness follows from Lemma 3. So ∀ i ∈ [1 , · · · , n ], the sequence ( b k + , i b k , i − k ∈ N is summable, and since( b k , i ) k ∈ N is nondecreasing, it is also nonnegative. In particular5he sequence (max i ∈ [1 , ··· , n ] b k + , i b k , i − k ∈ N is summable and nonneg-ative.Therefore ( x k ) k ∈ N is variable metric quasi-Fej´er with tar-get set C = arg min F , metric W k = B k (cid:23) δ I , η k = max i ∈ [1 , ··· , n ] b k + , i b k , i − ǫ k = √ δ k∇ F ( x k ) k , for all k ∈ N , us-ing the notations of Definition 3.1. Note that ( η k ) k ∈ N does notdepend on the choice of x ∗ ∈ C and is summable, and ( ǫ k ) k ∈ N is also summable by Lemma 3, so that the definition applies.By Lemma 3, C contains all the cluster points and ( W k ) k ∈ N con-verges. Thus Theorem 3.3 allows us to conclude that ( x k ) k ∈ N converges to a global minimum.
4. Discussion and future work
Sequential convergence of AdaGrad in the smooth convexcase constitutes a further adaptivity property for this algorithm.Fej´er monotonicity plays an important role here as one wouldexpect. It is interesting to remark that our analysis does notrequire any assumption on the objective F beyond its Lipschitzgradient and the fact that it attains its minimum. Those are su ffi -cient to ensure boundedness and convergence of any sequence.This is in contrast with analyses in more advanced, noncon-vex, noisy settings where additional assumptions are required[22, 11]. Extensions of this analysis include the addition ofnoise or nonsmoothness in the convex case. It would also be in-teresting to see if the proposed approach allows to obtain betterconvergence bounds than the original regret analysis [12, 15]. Acknowledgements
Most of this work took place during the first authors mas-ter internship at IRIT. The second author would like to ac-knowledge the support of ANR-3IA Artificial and Natural In-telligence Toulouse Institute, Air Force O ffi ce of Scientific Re-search, Air Force Material Command, USAF, under grant num-bers FA9550-19-1-7026, FA9550-18-1-0226, and ANR Mas-Dol. References [1] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization withsparsity-inducing penalties.
Foundations and Trends ® in Machine Learn-ing , 4(1):1–106, 2012.[2] F. Bach and K. Y. Levy. A universal algorithm for variational inequali-ties adaptive to smoothness and noise. arXiv preprint arXiv:1902.01637 ,2019.[3] A. Barakat and P. Bianchi. Convergence and dynamical behavior of theadam algorithm for non convex stochastic optimization. arXiv preprintarXiv:1810.02263 , 2018.[4] A. Barakat and P. Bianchi. Convergence rates of a momentum algorithmwith bounded adaptive step size for nonconvex optimization. In AsianConference on Machine Learning , pages 225–240. PMLR, 2020.[5] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.
Siam Review , 60(2):223–311, 2018.[6] Z. Chen, Y. Xu, E. Chen, and T. Yang. Sadagrad: Strongly adaptivestochastic gradient methods. In
International Conference on MachineLearning , pages 913–921, 2018.[7] P. L. Combettes.
Fej´er monotonicity in convex optimization , pages 1016–1024. Springer US, 2001. [8] P. L. Combettes. Quasi-fej´erian analysis of some optimization algorithms.In
Studies in Computational Mathematics , volume 8, pages 115–152. El-sevier, 2001.[9] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signalprocessing. In
Fixed-point algorithms for inverse problems in science andengineering , pages 185–212. Springer, 2011.[10] P. L. Combettes and B. C. V˜u. Variable metric quasi-fej´er monotonicity.
Nonlinear Analysis: Theory, Methods & Applications , 78:17–31, 2013.[11] A. D´efossez, L. Bottou, F. Bach, and N. Usunier. On the convergence ofadam and adagrad. arXiv preprint arXiv:2003.02395 , 2020.[12] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods foronline learning and stochastic optimization.
Journal of machine learningresearch , 12(7), 2011.[13] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio.
Deep learning ,volume 1. MIT press Cambridge, 2016.[14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InY. Bengio and Y. LeCun, editors, , 2015.[15] K. Y. Levy, A. Yurtsever, and V. Cevher. Online adaptive methods, uni-versality and acceleration. In
Advances in Neural Information ProcessingSystems , pages 6500–6509, 2018.[16] X. Li and F. Orabona. On the convergence of stochastic gradient descentwith adaptive stepsizes. In
The 22nd International Conference on Artifi-cial Intelligence and Statistics , pages 983–992. PMLR, 2019.[17] Y. Malitsky and K. Mishchenko. Adaptive gradient descent without de-scent. arXiv preprint arXiv:1910.09529 , 2019.[18] H. B. McMahan and M. Streeter. Adaptive bound optimization for onlineconvex optimization. arXiv preprint arXiv:1002.4908 , 2010.[19] Y. Nesterov. Introductory lectures on convex programming volume i: Ba-sic course.
Lecture notes , 3(4):5, 1998.[20] A. Ogaltsov, D. Dvinskikh, P. Dvurechensky, A. Gasnikov, andV. Spokoiny. Adaptive gradient descent for convex and non-convexstochastic optimization. arXiv preprint arXiv:1911.08380 , 2019.[21] B. T. Polyak. Introduction to optimization. optimization software.
Inc.,Publications Division, New York , 1, 1987.[22] R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: Sharp convergenceover nonconvex landscapes. In
International Conference on MachineLearning , pages 6677–6686, 2019.[23] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. Themarginal value of adaptive gradient methods in machine learning. In
Advances in neural information processing systems , pages 4148–4158,2017.[24] Y. Xie, X. Wu, and R. Ward. Linear convergence of adaptive stochasticgradient descent. In
International Conference on Artificial Intelligenceand Statistics , pages 1475–1485. PMLR, 2020. ppendix A. Remaining proofs and Lemmas The following two proofs are simplification of the argumentsin [10], we provide these details for completeness.
Appendix A.1. Proof of Proposition 3.2
We need the next lemma for the proof.
Lemma 4 ([21], Lemma 2.2.2) . Let ( α k ) k ∈ N be a nonnegativesequence, let ( η k ) k ∈ N ∈ ℓ + and let ( ε k ) k ∈ N ∈ ℓ + such that, ∀ k ∈ N , α k + (1 + η k ) α k + ε k . Then ( α k ) k ∈ N converges.Proof of the proposition. (i) As an application of Lemma 4 to (7), with α k = k u k − u k w k ,we have that ( k u k − u k W k ) k ∈ N converges and hence ( k u k − u k W k ) k ∈ N .(ii) Let u ∈ C . By assumption, W k (cid:23) α I n , ∀ k ∈ N , for some α >
0. So, ∀ k ∈ N k u k − u k α h u k − u , W k ( u k − u ) i = k u k − u k W k α . Thanks to the previous point ( k u k − u k W k ) k ∈ N is bounded,then so is ( u k ) k ∈ N . Appendix A.2. Proof of Theorem 3.3Proof of the theorem.
Let W be the set of all accumulationpoints of ( x k ) k ∈ N . First, suppose that x k → ¯ x ∈ C , as k → ∞ ,then obviously, W = { ¯ x } ⊂ C .Conversely suppose that W ⊂ C . W is not empty since bythe Proposition 3.2, ( x k ) k ∈ N is bounded. Fix x and x ′ in W arbi-trary. Say that x m k → x and x l k → x ′ along two subsequences.From Proposition 3.2 (i), we know that (cid:16) k x k − x k W k (cid:17) k ∈ N and( k x k − x ′ k W k ) k ∈ N converge since both x and x ′ are in C . Alsofor any d ∈ R n , k d k W k = ( W k d , d ) → h Wd , d i as k → ∞ . Since,for all k ∈ N (cid:10) W k x k , x − x ′ (cid:11) = (cid:18)(cid:13)(cid:13)(cid:13) x k − x ′ (cid:13)(cid:13)(cid:13) W k − k x k − x k W k + k x k W k − k x ′ k W k (cid:19) , the sequence ( h W k x k , x − x ′ i ) k ∈ N converges, let’s say to λ ∈ R , this means that (cid:10) x k , W k ( x − x ′ ) (cid:11) → k →∞ λ ∈ R . (A.1)Since, as k → ∞ , x m k → x and W m k ( x − x ′ ) → W ( x − x ′ ) , itfollows from (A.1) that h x , W ( x − x ′ ) i = λ . The same way weshow that h x ′ , W ( x − y ) i = λ . So0 = h x , W ( x − x ′ ) i − h x ′ , W ( x − x ′ ) i = h x − x ′ , W ( x − x ′ ) i > α k x − x ′ k , and hence we obtain x = x ′ . Since x and x ′ were arbitraryaccumulation points, the set of accumumation points is reducedto a singleton, that is x k → x ∈ C as k → ∞ . Appendix A.3. Technical lemmas
Lemma 5.
Let a , b ≥ . If for Z ≥ , Z √ Z + a ≤ b, thenZ ≤ b + b √ a.Proof. We have Z − b Z − b a ≤ . For the equation of second order, ∆ = b + b a ≥
0. We havetwo distinct real roots and the leading coe ffi cient is positive, sofor Z to satisfy the above inequality, we should have Z ≤ b + √ b + b a ≤ b + √ b + √ b a = b + b √ a ..