Convergence Analysis of Optimization Algorithms
HyoungSeok Kim, JiHoon Kang, WooMyoung Park, SukHyun Ko, YoonHo Cho, DaeSung Yu, YoungSook Song, JungWon Choi
aa r X i v : . [ s t a t . M L ] J u l Convergence Analysis of Optimization Algorithms
HyoungSeok Kim, JiHoon Kang, WooMyoung Park, SukHyun KoYoonHo Cho, DaeSung Yu, YoungSook Song and JungWon ChoiCompany.AI {alexrp, don, max.park, noag.go,ed.cho, dsds, song.cai, xiao}@company.ai
July 7, 2017
Abstract
The regret bound of an optimization algorithms is one of the basic criteria for evaluating the performanceof the given algorithm. By inspecting the differences between the regret bounds of traditional algorithms andadaptive one, we provide a guide for choosing an optimizer with respect to the given data set and the lossfunction. For analysis, we assume that the loss function is convex and its gradient is Lipschitz continuous.
Consider a minimizing problem of the convex objective function J ( θ ) with input parameter θ ∈ Θ , such as, min J ( θ ) (1)To get the minimizing argument of (1), θ ∗ ∈ Θ , we use iterative methods to update the current parametervector θ t . From the current step t , each method use the gradient of J ( θ t ) with the step size η . Also, ∇ θ J ( θ t ) denotes the gradient of the objective function J ( θ t ) at the current parameter at time step t with respect to theparameter vector θ . Generally, we define the loss function as J ( θ ) = f ( θ ) + ϕ ( θ ) where the convex instant loss f ( θ ) and the convex regularization function is ϕ ( θ ) . For the analysis, we define the regret R J ( T ) as R J ( T ) := T X t =1 h J ( θ t ) − J ( θ ∗ ) i (2)to estimate error bound. Also, to guarantee the convergence of the algorithms in this paper, we assume theconvexity of J and the L -Lipschitz continuous gradient of J such as • J is convex, i.e. J ( y ) ≥ J ( x ) + h∇ J ( x ) , y − x i ∀ x, y (3) • ∇ J ( x ) is L -Lipschitz continuous, i.e. k∇ J ( x ) − ∇ J ( y ) k ≤ L k x − y k , ∀ x, y (4)Also, the equation (4) implies J ( y ) ≤ J ( x ) + h∇ J ( x ) , y − x i + L k y − x k , ∀ x, y (5)The analysis mainly focuses on the regret bound of each algorithms. The choice of optimizer results thedifference in the performance of the training procedure on the same neural network. Roughly, one can classifythe optimization algorithm by its convergence rate. As the first order method, we have stochastic gradient1escent(section 2), momentum method(section 3) and Nesterov accelerated gradient method(section 4). Forthe adaptive method, Adagrad(section 5), Adadelta, and Adam(section 6) are well known. For the optimizingtasks such as training neural net, adaptive methods are usually preferred. But in recent research [7, Figure1] shows that the traditional first order algorithms such as stochastic gradient method or momentum methodgive better convergence results than the adaptive methods. One possible reason may lie on the structure of theestimating Hessian matrix in adaptive algorithms. This estimation issue will be mentioned later at section 5briefly. The basic gradient descent optimization with a full batch is θ t +1 = θ t − η ∇ θ J ( θ t ) (6)where η is the learning rate. In contrast, stochastic gradient descent or mini-batch gradient descent algorithmupdates the parameter vector for each data i or i th mini-batch data set, such as θ t +1 = θ t − η ∇ θ J ( θ t ; x i , y i ) (7)where J ( θ t ; x i , y i ) implies that we only have the partial information of our loss function. In other words, thepartially given batch data guides the gradient direction for each iteration. In this section, we will show the regret bound of gradient descent algorithm with a full batch is bounded bysome constant. Also, we will show that the stochastic gradient descent method shares the same regret bound.One can notice that the sequence { J ( θ T ) } is not monotonically decreasing since our stochastic gradient doesnot guarantee the exact decreasing direction. Since we assume that the cost function J is convex, a constantbound of R J ( T ) implies the error at a certain step is bounded by the inverse of the iteration number. Theorem (Nestrov, 2.1.14) . If J ( θ ) is convex and its gradient is L -Lipschitz continuous, then for η ∈ (0 , /L ] ,the sequence { θ t } generated by update (6) or (7) satisfies R J ( T ) = O (cid:16) k θ − θ ∗ k (cid:17) Proof.
Since J has L -Lipschitz continuity, by (5), we have J ( θ t +1 ) ≤ J ( θ t ) + h∇ θ J ( θ t ) , θ t +1 − θ t i + L k θ t +1 − θ t k = J ( θ t ) + h∇ θ J ( θ t ) , − η ∇ θ J ( θ t ) i + L k − η ∇ θ J ( θ t ) k = J ( θ t ) − η k∇ θ J ( θ t ) k + η L k∇ θ J ( θ t ) k = J ( θ t ) − η (cid:18) − ηL (cid:19) k∇ θ J ( θ t ) k ≤ J ( θ t ) − η k∇ θ J ( θ t ) k ( ∵ η ∈ (0 , /L ]) ≤ J ( θ ∗ ) + h∇ θ J ( θ t ) , θ t − θ ∗ i − η k∇ θ J ( θ t ) k ( ∵ J is convex )= J ( θ ∗ ) + h∇ θ J ( θ t ) , θ t − θ ∗ i − η k∇ θ J ( θ t ) k + 12 η (cid:16) k θ t − θ ∗ k − k θ t − θ ∗ k (cid:17) = J ( θ ∗ ) + 12 η (cid:18) k θ t − θ ∗ k − (cid:16) k θ t k − h θ t , θ ∗ i + k θ ∗ k − η h∇ θ J ( θ t ) , θ t − θ ∗ i + η k∇ θ J ( θ t ) k (cid:17)(cid:19) = J ( θ ∗ ) + 12 η (cid:18) k θ t − θ ∗ k − (cid:16) k θ t − η ∇ θ J ( θ t ) k − h θ t − η ∇ θ J ( θ t ) , θ ∗ i + k θ ∗ k (cid:17)(cid:19) = J ( θ ∗ ) + 12 η (cid:16) k θ t − θ ∗ k − k θ t +1 − θ ∗ k (cid:17) J ( θ t +1 ) − J ( θ ∗ ) ≤ η (cid:16) k θ t − θ ∗ k − k θ t +1 − θ ∗ k (cid:17) (8)Thus, apply (8) to summing over the iterations, T X t =1 h J ( θ t ) − J ( θ ∗ ) i ≤ η T X t =1 h k θ t − θ ∗ k − k θ t +1 − θ ∗ k i = 12 η (cid:16) k θ − θ ∗ k − k θ T +1 − θ ∗ k (cid:17) ≤ η k θ − θ ∗ k To accelerate the convergence of gradient descent method, momentum method use the past steps to updatethe current step. Intuitively, the past steps are relevant to the next update and using this information seemsnatural. Here γ is called momentum parameter and η is the learning rate. The momentum update in [3, (5)] isas follows v t +1 = γ v t − η ∇ θ J ( θ t ) θ t +1 = θ t + v t +1 (9)In [3, (4)] The update equation (9) is equivalent to θ t +1 = θ t + γ ( θ t − θ t − ) − η ∇ θ J ( θ t ) (10) Since the momentum method modifies the basic structure of the gradient descent approach, they sharethe same convergence rate. Similar with the previous analysis, we assume J ( θ ) is convex and its gradient is L -Lipschitz continuous. Theorem (Ghadimi, Theorem 1) . If J ( θ ) is convex and its gradient is L -Lipschitz continuous, then for γ ∈ [0 , , η ∈ (0 , (1 − γ ) /L ] , the sequence { θ t } generated by update (9) satisfies R J ( T ) = O (cid:16) k θ − θ ∗ k (cid:17) Proof.
For some γ ∈ [0 , , let p t = γ − γ ( θ t − θ t − ) where t = 1 , , · · · , T and assume that θ = θ and p = 0 . By (10), θ t +1 + p t +1 = 11 − γ θ t +1 − γ − γ θ t = θ t + p t − η − γ ∇ θ J ( θ t ) Consider the optimal solution as θ ∗ . We have k θ t +1 + p t +1 − θ ∗ k = k θ t + p t − η − γ ∇ θ J ( θ t ) − θ ∗ k = k θ t + p t − θ ∗ k − η − γ h θ t + p t − θ ∗ , ∇ θ J ( θ t ) i + (cid:18) η − γ (cid:19) k∇ θ J ( θ t ) k = k θ t + p t − θ ∗ k − η − γ h θ t − θ ∗ , ∇ θ J ( θ t ) i− ηγ (1 − γ ) h θ t − θ t − , ∇ θ J ( θ t ) i + (cid:18) η − γ (cid:19) k∇ θ J ( θ t ) || J ( θ ) is convex function with L -Lipschitz continuous gradeint, we introduce following propositions from[4, Theorem 2.1.5]. Proposition 1 (Nestrov, Theorem 2.1.5) . ≤ J ( y ) − J ( x ) − h∇ J ( x ) , y − x i ≤ L k x − y k J ( x ) + h∇ J ( x ) , y − x i + 12 L k∇ J ( x ) − ∇ J ( y ) k ≤ J ( y ) The proof of above properties are provided in appendix A. Substituting x, y , the above inequalities aremodified as follows: J ( θ t ) − J ( θ t − ) ≤ h∇ θ J ( θ t ) , θ t − θ t − i (11) J ( θ t ) − J ( θ ∗ ) + 12 L k∇ θ J ( θ t ) k ≤ h∇ θ J ( θ t ) , θ t − θ ∗ i (12)By (12), we obtain k θ t +1 + p t +1 − θ ∗ k ≤ k θ t + p t − θ ∗ k − η − γ (cid:18) J ( θ t ) − J ( θ ∗ ) + 12 L k∇ θ J ( θ t ) k (cid:19) − ηγ (1 − γ ) h θ t − θ t − , ∇ θ J ( θ t ) i + (cid:18) η − γ (cid:19) k∇ θ J ( θ t ) k Here, by (11), we get k θ t +1 + p t +1 − θ ∗ k ≤ k θ t + p t − θ ∗ k − η − γ (cid:18) J ( θ t ) − J ( θ ∗ ) + 12 L k∇ θ J ( θ t ) k (cid:19) − ηγ (1 − γ ) (cid:16) J ( θ t ) − J ( θ t − ) (cid:17) + (cid:18) η − γ (cid:19) k∇ θ J ( θ t ) k Adding − ηγ (1 − γ ) J ( θ ∗ ) on both side and collecting the terms, we obtain (cid:18) η − γ + 2 ηγ (1 − γ ) (cid:19) (cid:16) J ( θ t ) − J ( θ ∗ ) (cid:17) + k θ t +1 + p t +1 − θ ∗ k ≤ ηγ (1 − γ ) (cid:16) J ( θ t − ) − J ( θ ∗ ) (cid:17) + k θ t + p t − θ ∗ k + η − γ (cid:18) η − γ − L (cid:19) k∇ θ J ( θ t ) k Since we assume η ∈ (0 , (1 − γ ) /L ] , the third term of right-hand-side is a negative value. Thus, the inequalityshould hold under the elimination of the third term. i.e., (cid:18) η − γ + 2 ηγ (1 − γ ) (cid:19) (cid:16) J ( θ t ) − J ( θ ∗ ) (cid:17) + k θ t +1 + p t +1 − θ ∗ k ≤ ηγ (1 − γ ) (cid:16) J ( θ t − − J ( θ ∗ ) (cid:17) + k θ t + p t − θ ∗ k Summing over k = 1 , , · · · , T gives η − γ T X t =1 h J ( θ t ) − J ( θ ∗ ) i + T X t =1 (cid:20) ηγ (1 − γ ) (cid:16) J ( θ t ) − J ( θ ∗ ) (cid:17) + k θ t +1 + p t +1 − θ ∗ k (cid:21) ≤ T X t =1 (cid:20) ηγ (1 − γ ) (cid:16) J ( θ t − ) − J ( θ ∗ ) (cid:17) + k θ t + p t − θ ∗ k (cid:21) Since θ ∗ is the optimal solution of J ( θ ) , every terms are positive, so that η − γ T X t =1 h J ( θ t ) − J ( θ ∗ ) i ≤ ηγ (1 − γ ) (cid:16) J ( θ ) − J ( θ ∗ ) (cid:17) + k θ − θ ∗ k Nesterov Accelerated Gradient
In [3, (6)] the standard update equations for NAG method is as follows: y t +1 = θ t − η ∇ θ J ( θ t ) θ t +1 = y t +1 + γ ( y t +1 − y t ) (13)where y = θ . Here, we can understand NAG update more intuitively by modifying the same equation. Byintroducing v t = y t − y t − with y = y − , (13) is equivalent to v t +1 = γ v t − η ∇ θ J ( y t + γ v t ) y t +1 = y t + v t +1 (14)In [3, (7)]. Rather than updating θ t , in (14) we update y t to minimize the objective function. The main idea forNAG is known as gamble first and correct later. As we can see in (14), NAG estimates the next point by jumpthrough the previous gradient direction and calculates the gradient at that position to correct the estimatedpoint. Theorem (Ghadimi, Theorem 3) . If J ( θ ) is convex and its gradient is L -Lipschitz continuous, then for γ ∈ [0 , , η ∈ (0 , /L ] , the sequence { θ t } generated by update (13) satisfies R J ( T ) = O (cid:16) k θ − θ ∗ k (cid:17) (15) Proof.
Let p t = γ − γ (cid:16) θ t − θ t − + η ∇ θ J ( θ t − ) (cid:17) where t = 1 , , · · · , T and assume that θ = θ and p = 0 . By (13), yields θ t +1 + p t +1 = 11 − γ θ t +1 + γ − γ (cid:16) η ∇ θ J ( θ t ) − θ t (cid:17) = θ t + p t − η − γ ∇ θ J ( θ t ) Consider the optimal solution θ ∗ , yields k θ t +1 + p t +1 − θ ∗ k = k θ t + p t − θ ∗ k − η − γ h θ t + p t − θ ∗ , ∇ θ J ( θ t ) i + (cid:18) η − γ (cid:19) k∇ θ J ( θ t ) k = k θ t + p t − θ ∗ k − η − γ h θ t − θ ∗ , ∇ θ J ( θ t ) i − ηγ (1 − γ ) h θ t − θ t − , ∇ θ J ( θ t ) i− η γ (1 − γ ) h∇ θ J ( θ t − ) , ∇ θ J ( θ t ) i + (cid:18) η − γ (cid:19) k∇ θ J ( θ t ) k Again [4, Theorem 2.1.5], we have (12). And also J ( θ t ) − J ( θ t − ) + 12 L k∇ θ J ( θ t ) − ∇ θ J ( θ t − ) k ≤ h∇ θ J ( θ t ) , θ t − θ t − i (16)By (12) and (16), yields k θ t +1 + p t +1 − θ ∗ k ≤ k θ t + p t − θ ∗ k − η − γ (cid:18) J ( θ t ) − J ( θ ∗ ) + 12 L k∇ θ J ( θ t ) k (cid:19) − ηγ (1 − γ ) (cid:18) J ( θ t ) − J ( θ t − ) + 12 L k∇ θ J ( θ t ) − ∇ θ J ( θ t − ) k (cid:19) − η γ (1 − γ ) h∇ θ J ( θ t − ) , ∇ θ J ( θ t ) i + (cid:18) η − γ (cid:19) k∇ θ J ( θ t ) k η ∈ (0 , /L ] , we have k θ t +1 + p t +1 − θ ∗ k ≤ k θ t + p t − θ ∗ k − η − γ (cid:16) J ( θ t ) − J ( θ ∗ ) + η k∇ θ J ( θ t ) k (cid:17) − ηγ (1 − γ ) (cid:16) J ( θ t ) − J ( θ t − ) + η k∇ θ J ( θ t ) − ∇ θ J ( θ t − ) k (cid:17) − η γ (1 − γ ) h∇ θ J ( θ t − ) , ∇ θ J ( θ t ) i + (cid:18) η − γ (cid:19) k∇ θ J ( θ t ) k Adding ηγ (1 − γ ) J ( θ ∗ ) on both side, we get η (1 − γ ) (cid:16) J ( θ t ) − J ( θ ∗ ) (cid:17) + k θ t +1 + p t +1 − θ ∗ k ≤ ηγ (1 − γ ) (cid:16) J ( θ t − ) − J ( θ ∗ ) (cid:17) + k θ t + p t − θ ∗ k − η − γ k∇ θ J ( θ t ) k − η γ (1 − γ ) k∇ θ J ( θ t ) − ∇ θ J ( θ t − ) k − η γ (1 − γ ) h∇ θ J ( θ t − ) , ∇ θ J ( θ t ) i + (cid:18) η − γ (cid:19) k∇ θ J ( θ t ) k = 2 ηγ (1 − γ ) (cid:16) J ( θ t − ) − J ( θ ∗ ) (cid:17) + k θ t + p t − θ ∗ k + η γ (1 − γ ) k∇ θ J ( θ t ) k − η γ (1 − γ ) k∇ θ J ( θ t ) − ∇ θ J ( θ t − ) k − η γ (1 − γ ) h∇ θ J ( θ t − ) , ∇ θ J ( θ t ) i = 2 ηγ (1 − γ ) (cid:16) J ( θ t − ) − J ( θ ∗ ) (cid:17) + k θ t + p t − θ ∗ k + η γ (1 − γ ) (cid:16) k∇ θ J ( θ t ) k − k∇ θ J ( θ t ) − ∇ θ J ( θ t − ) k − h∇ θ J ( θ t − ) , ∇ θ J ( θ t ) i (cid:17) = 2 ηγ (1 − γ ) (cid:16) J ( θ t − ) − J ( θ ∗ ) (cid:17) + k θ t + p t − θ ∗ k + η γ (1 − γ ) (cid:16) k∇ θ J ( θ t ) k − k∇ θ J ( θ t ) − ∇ θ J ( θ t − ) k − h∇ θ J ( θ t − ) , ∇ θ J ( θ t ) i + k∇ θ J ( θ t − ) k − k∇ θ J ( θ t − ) k (cid:17) = 2 ηγ (1 − γ ) (cid:16) J ( θ t − ) − J ( θ ∗ ) (cid:17) + k θ t + p t − θ ∗ k + η γ (1 − γ ) (cid:16) k∇ θ J ( θ t ) − ∇ θ J ( θ t − ) k − k∇ θ J ( θ t ) − ∇ θ J ( θ t − ) k − k∇ θ J ( θ t − ) k (cid:17) = 2 ηγ (1 − γ ) (cid:16) J ( θ t − ) − J ( θ ∗ ) (cid:17) + k θ t + p t − θ ∗ k − η γ (1 − γ ) k∇ θ J ( θ t − ) k Multiplying / η on both side and summing over t = 1 , , · · · , T gives − γ T X t =1 h J ( θ t ) − J ( θ ∗ ) i + T X t =1 (cid:20) γ (1 − γ ) (cid:16) J ( θ t ) − J ( θ ∗ ) (cid:17) + 12 η k θ t +1 + p t +1 − θ ∗ k (cid:21) ≤ T X t =1 (cid:20) γ (1 − γ ) (cid:16) J ( θ t − ) − J ( θ ∗ ) (cid:17) + 12 η k θ t + p t − θ ∗ k (cid:21) Therefore we have − γ T X t =1 h J ( θ t ) − J ( θ ∗ ) i ≤ γ (1 − γ ) (cid:16) J ( θ ) − J ( θ ∗ ) (cid:17) + 12 η k θ − θ ∗ k where θ = θ by assumption. 6 Adagrad
Including Adagrad method, the adaptive method in the next sections follow the Newton’s method whichis known as the second-order method. Since these methods minimize the objective function J with estimatedHessian matrix and apply the Newton’s method approach, they generally perform better than above algorithms.Usually the cost of exact calculation of the Hessian matrix is extremely expensive, therefore Adagrad algorithmestimates the Hessian matrix with the following idea. According to the [6, 5.4.2], consider the mean squarederror function, such as J = 12 N X n =1 ( f ( θ n ) − y n ) Thus, the gradient and Hessian of J is ∇ J = N X n =1 h f ( θ n ) − y n , ∇ f ( θ n ) i H ( J ) = N X n =1 h∇ f ( θ n ) , ∇ f ( θ n ) i + N X n =1 h f ( θ n ) − y n , ∆ f ( θ n ) i Here the second term of Hessian equation goes to zero when the approximation of f ( θ n ) close to the real value y n ,which implies estimate the Hessian matrix with the outer product of the gradient vector. This approximation isquite reasonable under the given mean squared error functions. But this approximation does not always properunder the arbitrary designed cost functions. Especially for the classification tasks, we often use the non-smoothcost functions such as Cross Entropy loss. Consequently, as we mention in the introduction, this estimationcauses potential limitation of adaptive methods that applied in various loss functions.Additionally, one of the benefit in Adagrad which the author of [5] mentioned is since the method updates theparameter vector element-wisely, Adagrad can perform better than previous methods like SGD or momentummethod when the loss function J is sparse. Compare with dense cases, sparse J has relatively more chance to getthe sparse gradient vector. And with the Adagrad method, that gives the larger step size, so that the gradientdirection highly affects to the optimization process. Therefore, rarely occurring factor has more importancethan frequently occurring factors. We use second sub-script for the vector or matrix element index. i,e, θ t,i means the i th parameter ofparameter vector at time step t .Since the convexity of J does not imply the differentiability of J , we import the concept of sub-gradient. Thesub-gradient can be applied to all of algorithms covered here. The sub-differentiable set of function J evaluatedat θ is denoted as ∂J ( θ ) , and a particular gradient vector in the sub-gradient set is denoted by g t ∈ ∂J ( θ ) . Whena function J is differentiable, g t directly implies ∇ θ J ( θ t ) . We also denote g t = [ g , g , · · · , g t ] the concatenatedmatrix of the subgradient sequence.The important feature in Adagrad is calculate the outer product of sub-gradient, denoted by G t ∈ R d × d where d is the number of entry in θ , means G t = t X τ =1 g τ g ⊺ τ (17)As we mentioned before, Adagrad method element-wisely updates parameter vector. In [5, (1)], Adagrad updatethe parameter such as θ t +1 ,i = θ t,i − η p G t,ii + ǫ · g t,i (18) For the analysis, we convert the form of update equation (18). By [5, (1)], consider a Euclidean space Θ and convert the update equation (18) as θ t +1 = arg min θ ∈ Θ (cid:13)(cid:13) θ − ( θ t − η diag ( G t ) − / g t ) (cid:13)(cid:13) G / t (19)7here the Mahalanobis norm k · k A = p h· , A ·i and A / implies the element-wise root of given matrix or vector A . Next, introduce the Bregman divergence associated with a strongly convex function ψ , which is B ψ ( x, y ) = ψ ( x ) − ψ ( y ) − h∇ ψ ( y ) , x − y i (20)According to the [5, (3), (4)], claim that for some regularization function ϕ , we can convert (19) as θ t +1 = arg min θ ∈ Θ n η h g t , θ i + ηϕ ( θ ) + B ψ t ( θ, θ t ) o (21)to update our parameter vector θ . Theorem (Duchi, Theorem 5) . If J ( θ ) is convex and its gradient is L -Lipschitz continuous, then for θ ∗ ∈ Θ ,the sequence { θ t } which generated by (21) satisfies R J ( T ) = O (cid:16) max t ≤ T k θ t − θ ∗ k ∞ d X t =1 k g T,i k (cid:17) Proof.
Let g t be defined as in above. We have the following proposition and the proof is in [5, Appendix F]. Proposition 2 (Duchi, Proposition 3) . Let the sequence { θ t } be defined by the update (21) . For any θ ∗ ∈ Θ , R J ( t ) ≤ η B ψ t ( θ ∗ , θ ) + 1 η T − X t =1 h B ψ t +1 ( θ ∗ , θ t +1 ) − B ψ t ( θ ∗ , θ t +1 ) i + η T X t =1 k J ′ ( θ t ) k ψ ∗ t Let s t is a vector at time step t such that i th element of the vector s t,i = k g t,i k . The following lemma isproved in appendix B. Lemma 1 (Duchi, Lemma 4) . Let g t , g t and s t be defined as in above. Then T X t =1 (cid:10) g t , diag( s t ) − g t (cid:11) ≤ d X i =1 k g T,i k Here, define the associated dual-norm of ψ t ( x ) k g k ψ ∗ t = (cid:10) g, ( δI + diag( s t )) − g (cid:11) where ψ t ( x ) = h x, ( δI + diag( s t )) x i . Since g t is a subgradient of J ( θ ) , implies k J ′ ( θ t ) k ≤ h g t , diag ( s t ) − g t i .Thus, yield, T X t =1 k J ′ t ( θ t ) k ψ ∗ t ≤ d X i =1 k g T,i k Now, the Bregman divergence terms in above proposition are remained. We notice that B ψ t +1 ( θ ∗ , θ t +1 ) − B θ t ( θ ∗ , θ t +1 ) = 12 (cid:10) θ ∗ − θ t +1 , diag ( s t +1 − s t )( θ ∗ − θ t +1 ) (cid:11) ≤
12 max i ( θ ∗ i − θ t +1 ,i ) k s t +1 − s t k Since k s t +1 − s t k = h s t +1 − s t , i and h s T , i = P di =1 k g T,i k , we have T − X t =1 h B ψ t +1 ( θ ∗ , θ t +1 ) − B ψ t ( θ ∗ , θ t +1 ) i ≤ T − X t =1 k θ ∗ − θ t +1 k ∞ h s t +1 − s t , i≤
12 max t ≤ T k θ ∗ − θ t k ∞ d X i =1 k g T,i k − k θ ∗ − θ k ∞ h s , i Combine the proposition and using the above results with the fact that B ψ ( θ ∗ , θ ) ≤ k θ ∗ − θ k ∞ h s , i , wefinally get R J ( T ) ≤ η max t ≤ T k θ ∗ − θ t k ∞ d X i =1 k g T,i k + η d X i =1 k g T,i k Adam
Consider the estimates of the first and the second moment of the gradients. In [9, Algorithm 1], for some β , β ∈ [0 , , m t = β m t − + (1 − β ) g t v t = β v t − + (1 − β ) g t (22)The authors of this method said m t and v t are biased towards zero especially during the initial stages and whenthe decay rates are small (i.e. β and β are nearly 1). So we need bias-correction, such as ˆ m t = m t − β t ˆ v t = v t − β t (23)The final update equation is θ t +1 = θ t − η √ ˆ v t + ǫ ˆ m t (24) We show the regret bound of Adam method with learning rate η t is decaying at a rate of √ t and momentaverage coefficient β decays exponentially with λ . Theorem (Kingma, Theorem 4.1) . If J ( θ ) is convex and its gradient is L -Lipschitz continuous, i.e., k∇ J ( θ ) k ≤ L , k∇ J ( θ ) k ∞ ≤ L ∞ for all θ ∈ Θ and for any m, n ∈ { , , · · · , T } , k θ m − θ n k ≤ D , k θ m − θ n k ∞ ≤ D ∞ thenfor all T ≥ , the sequence { θ t } which generated by (22) , (23) , and (24) satisfies R J ( T ) = O ( √ T ) where β , β ∈ [0 , satisfy β / √ β < and η t = η/ √ t, η = η in the update equations.Proof. The following lemmas are used to support the theorem above. The proofs are in appendix C
Lemma 2 (Kingma, lemma 10.4) . Let γ := β / √ β . For β , β ∈ [0 , that satisfy γ < and bounded g t , i.e., k g t k ≤ L , k g t k ∞ ≤ L ∞ , the following holds T X t =1 ˆ m t,i p t ˆ v t,i ≤ L ∞ (1 − γ ) √ − β k g T,i k where ˆ m t and ˆ v t are defined in 23 Since our cost function J is convex, we have J ( θ t ) − J ( θ ∗ ) ≤ h g t , θ t − θ ∗ i = d X i =1 g t,i · ( θ t,i − θ ∗ i ) From the update rules, for some λ ∈ (0 , θ t +1 = θ t − η t ˆ m t √ ˆ v t = θ t − η t − β t (cid:18) β λ t − √ ˆ v t m t − + 1 − β λ t − √ ˆ v t g t (cid:19) i th element of θ t in Euclidean vector space. On both side of the update equation, we subtract θ ∗ i and square, yield ( θ t +1 ,i − θ ∗ i ) = ( θ t,i − θ ∗ i ) − η t ˆ m t √ ˆ v t ( θ t,i − θ ∗ i ) + η t ˆ m t,i p ˆ v t,i ! = ( θ t,i − θ ∗ i ) − η t − β t β λ t − p ˆ v t,i m t − ,i + 1 − β λ t − p ˆ v t,i g t,i ! ( θ t,i − θ ∗ i ) + η t ˆ m t,i p ˆ v t,i ! Using the fact that ab ≤ a + b , yield g t,i · ( θ t,i − θ ∗ i ) = (1 − β t ) p ˆ v t,i η t (1 − β λ t − ) (cid:16) ( θ t,i − θ ∗ i ) − ( θ t +1 ,i − θ ∗ i ) (cid:17) + β λ t − − β λ t − ( θ ∗ i − θ t,i ) m t − ,i + η t (1 − β t )( ˆ m t,i ) − β λ t − ) p ˆ v t,i = (1 − β t ) p ˆ v t,i η t (1 − β λ t − ) (cid:16) ( θ t,i − θ ∗ i ) − ( θ t +1 ,i − θ ∗ i ) (cid:17) + β λ t − − β λ t − ( θ ∗ i − θ t,i ) p ˆ v t − ,i √ η t − √ η t − p ˆ v t − ,i m t − ,i + η t (1 − β t )2(1 − β λ t − ) ( ˆ m t,i ) p ˆ v t,i ≤ p ˆ v t,i η t (1 − β ) (cid:16) ( θ t,i − θ ∗ i ) − ( θ t +1 ,i − θ ∗ i ) (cid:17) + β λ t − − β λ t − ( θ ∗ i − θ t,i ) p ˆ v t − ,i η t +1 + (cid:18) β − β (cid:19) η t +1 ( ˆ m t − ,i ) p ˆ v t − ,i + η t − β ) ( ˆ m t,i ) p ˆ v t,i One can notice that ˆ v t,i = P tτ =1 (1 − β ) β t − τ g τ,i / (1 − β t ) ≤ k g t,i k . Intuitively, the exponentially decayingweighted sum must be less than or equal to the general summation of a given sequence. We apply the lemma2 to the above inequality and derive the regret bound by summing over all the dimensions for i = 1 , , · · · , d in J ( θ t ) − J ( θ ∗ ) and the sequence of regrets for t = 1 , , · · · , T . The index of the summation in following inequalityis modifying the above inequality by adding or subtracting the initial or the final term of some sequences tomatch the index unity. R J ( T ) ≤ T X t =1 d X i =1 g t,i · ( θ t,i − θ ∗ i ) ≤ η t (1 − β ) d X i =1 ( θ ,i − θ ∗ i ) p ˆ v ,i + 12 η t (1 − β ) d X i =1 T X t =2 ( θ t,i − θ ∗ i ) ( p ˆ v t,i − p ˆ v t − ,i )+ d X i =1 T X t =1 β λ t − η t (1 − β λ t − ) ( θ t,i − θ ∗ i ) p ˆ v t,i + β ηL ∞ (1 − β ) √ − β (1 − γ ) d X i =1 k g T,i k + ηL ∞ (1 − β ) √ − β (1 − r ) d X i =1 k g T,i k From the assumption, k θ t − θ ∗ k ≤ D, k θ m − θ n k ∞ ≤ D ∞ . Also R J ( T ) ≤ D η (1 − β ) d X i =1 p T ˆ v T,i + ( D ∞ ) L ∞ η d X i =1 T X t =1 β λ t − − β λ t − √ t + η ( β + 1) L ∞ (1 − β ) √ − β (1 − γ ) d X i =1 k g T,i k T X t =1 β λ t − − β λ t − √ t ≤ T X t =1 − β λ t − √ t ≤ − β T X t =1 λ t − t = 1 + λ + λ + · · · + λ T − − λ T T (1 − β )(1 − λ ) ≤ λ + λ + · · · (1 − β )(1 − λ ) = 1(1 − β )(1 − λ ) Therefore, we have the following regret bound as R J ( T ) ≤ D η (1 − β ) d X i =1 p T ˆ v T,i + d X i =1 ( D ∞ ) L ∞ η (1 − β )(1 − λ ) + η ( β + 1) L ∞ (1 − β ) √ − β (1 − γ ) d X i =1 k g T,i k AppendixA Proof of Proposition 1
Proposition 1.
For the convex J ( θ ) with L -Lipschitz continuous gradient, implies the following inequalities ≤ J ( y ) − J ( x ) − h∇ J ( x ) , y − x i ≤ L k x − y k (25) J ( x ) + h∇ J ( x ) , y − x ) + 12 L k∇ J ( x ) − ∇ J ( y ) k ≤ J ( y ) (26) Proof.
Clearly, 25 comes from the definition of the convex function and L -Lipschitz continuous gradient. Theremaining part is 26. For some fixed x ∈ Θ , consider a function g ( x ) = J ( x ) − h∇ J ( x ) , x i . Then g ( y ) − g ( x ) − h∇ g ( z ) , y − x i = J ( y ) − h∇ J ( x ) , y i − J ( x ) + h∇ J ( x ) , x i − h∇ J ( x ) − ∇ J ( x ) , y − x i = J ( y ) − J ( x ) − h∇ J ( x ) , y − x i Thus, g ( x ) is also convex function with L -Lipschitz gradient and its optimal point x ∗ = x . Therefore, applyingsecond inequality of 25 to g ( x ) , yields g (cid:18) y − L ∇ g ( y ) (cid:19) − g ( y ) − (cid:28) ∇ g ( y ) , y − L ∇ g ( y ) − y (cid:29) ≤ L k∇ g ( y ) k Since x ∗ is an optimal point of g ( x ) , we have g ( x ) = g ( x ∗ ) ≤ g (cid:18) y − L ∇ g ( y ) (cid:19) ≤ g ( y ) − L k∇ g ( y ) k From ∇ g ( y ) = ∇ J ( y ) − ∇ J ( x ) , we get J ( x ) − h∇ J ( x ) , x i ≤ J ( y ) − h∇ J ( x ) , y i − L k∇ J ( y ) − ∇ J ( x ) k Since we start with arbitrary x as a dummy variable, we finally get the inequality J ( x ) + h∇ J ( x ) , y − x i + 12 L k∇ J ( x ) − ∇ J ( y ) k ≤ J ( y ) Proof of Lemma 1
Lemma 1.
Let g t , g t and s t be defined above. Then T X t =1 (cid:10) g t , diag( s t ) − g t (cid:11) ≤ d X i =1 k g T,i k Proof.
We prove the lemma by considering an arbitrary real-valued sequence { a i } and a t = [ a , a , · · · , a i ] .Consider, T X t =1 ( a t ) k a t k ≤ k a T k We use an induction on T to prove the above inequality. For T = 1 , the inequality is clear. Assume theinequality holds for T − , by the induction assumption, T X t =1 ( a t ) k a t k = T − X t =1 ( a t ) k a t k + ( a T ) k a T k ≤ k a T k + ( a T ) k a T k Suppose b T = P Tt =1 ( a t ) and we obtain p b T − ( a T ) ≤ s b T − ( a T ) + ( a T ) b T = p b t − ( a T ) √ b T Thus, we have k a T − k + ( a T ) k a T k = 2 p b T − ( a T ) + ( a T ) √ b T = 2 k a T k Note that by construction that s t,i = k g t,i k . so T X t =1 h g t , diag ( s t ) − g t i = T X t =1 d X i =1 ( g t,i ) k g t,i k ≤ d X i =1 k g T,i k C Proof of Lemma 2
Before we begin the proof, we will prove the following lemma first.
Lemma 3 (Kingma, Lemma 10.3) . Let g t = ∇ J ( θ t ) and g t = [ g , g , · · · , g t ] is bounded. i.e., k g t k ≤ L , k g t k ∞ ≤ L ∞ . Then T X t =1 s g t,i t ≤ L ∞ k g T,i k Proof.
We will prove the inequality using induction over T. For T = 1 , we have g ,i ≤ G ∞ k g ,i k For the induction, we assume the following is true. T − X t =1 s g t,i t ≤ L ∞ k g T − ,i k T X t =1 s g t,i t ≤ L ∞ k g T − ,i k + s g T,i T = 2 L ∞ q k g T,i k − ( g T,i ) + s g T,i T We want to show the last equation is less than L ∞ k g T,i k . From the fact that k g T,i k − ( g T,i ) ≤ k g T,i k − ( g T,i ) + ( g T,i ) k g T,i k We take the square root at both side. Since k g t k ≤ k g t k ≤ L ∞ , we have q k g T,i k − ( g T,i ) ≤ k g T,i k − ( g T,i ) k g T,i k ≤ k g T,i k − ( g T,i ) p T ( L ∞ ) Therefore, substituting the root term, yields L ∞ q k g T,i k − ( g T,i ) + s g T,i T ≤ L ∞ k g T,i k Lemma 2.
Let γ := β / √ β . For β , β ∈ [0 , that satisfy γ < and bounded g t , i.e., k g t k ≤ L , k g t k ∞ ≤ L ∞ ,the following holds T X t =1 ˆ m t,i p t ˆ v t,i ≤ L ∞ (1 − γ ) √ − β k g T,i k where ˆ m t and ˆ v t are defined in 23Proof. Under the assumption √ − β t (1 − β ) ≤ − β ) . We can expand the last term. T X t =1 ( ˆ m t,i ) p t ˆ v t,i = T − X t =1 ( ˆ m t,i ) p t ˆ v t,i + p − β T (1 − β T ) (cid:16)P Tk =1 (1 − β ) β T − k g k,i (cid:17) q T P Tj =1 (1 − β ) β T − j ( g j,i ) ≤ T − X t =1 ( ˆ m t,i ) p t ˆ v t,i + p − β T (1 − β T ) T X k =1 T (cid:0) (1 − β ) β T − k g k,i (cid:1) q T P Tj =1 (1 − β ) β T − j ( g j,i ) ≤ T − X t =1 ( ˆ m t,i ) p t ˆ v t,i + p − β T (1 − β T ) T X k =1 T (cid:0) (1 − β ) β T − k g k,i (cid:1) q T (1 − β ) β T − k ( g k,i ) ≤ T − X t =1 ( ˆ m t,i ) p t ˆ v t,i + p − β T (1 − β T ) T (1 − β ) p T (1 − β ) T X k =1 (cid:18) β √ β (cid:19) T − k k g k,i k ≤ T − X t =1 ( ˆ m t,i ) p t ˆ v t,i + T p T (1 − β ) T X k =1 γ T − k k g k,i k Expanding the rest of the terms in summation, yields T X t =1 ( ˆ m t,i ) p t ˆ v t,i ≤ T X t =1 k g t,i k p t (1 − β ) T − t X j =0 tγ j ≤ T X t =1 k g t,i k p t (1 − β ) T X j =0 tγ j γ < , the sum of arithmetic-geometric series is bounded as P t tγ t ≤ / (1 − γ ) , which yields T X t =1 k g t,i k p t (1 − β ) T X j =0 tγ j ≤ − γ ) √ − β T X t =1 k g t , i k √ t Finally, we apply the lemma 3, yielding T X t =1 ( ˆ m t,i ) p t ˆ v t,i ≤ L ∞ (1 − γ ) √ − β k g T,i k References [1] Sebastian Ruder.
Insight Centre for Data Analytics, NUI Galway . An overview of gradient descent opti-mization algorithms, 2016.[2] Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. arXive:1412.7457v1 . Global con-vergence of the Heavy-ball method for convex optimization, 2014.[3] Tianbao Yang, Qihang Lin, and Zhe Li. arXive:1604.03257v2 . Unified Convergence Analysis of StochasticMomentum Methods for Convex and Non-convex Optimization, 2016.[4] Yurii Nesterov.
Springer . Introductory lectures on convex optimization: A basic course, 2004.[5] John Duchi, Elad Hazan, and Yoram Singer.
Journal of Machine Learing Research . Adaptive SubgradientMethod for Online Learning and Stochastic Optimization, 2011.[6] Christopher M. Bishop.
Springer . Pattern Recognition And Machine Learning, 2006.[7] Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. arXiv:1705.08292v1 .The Marginal Value of Adaptive Gradient Methods in Machine Learning, 2017.[8] Matthew D. Zeiler. arXiv:1212.5701v1 . ADADELTA: An Adaptive Learning Rate Method, 2012.[9] Diederik P. Kingma and Jimmy Lei Ba. arXiv:1412.6980v9arXiv:1412.6980v9