[PDF] Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation

Abstract

In deep learning tasks, the learning rate determines the update step size in each iteration, which plays a critical role in gradient-based optimization. However, the determination of the appropriate learning rate in practice typically replies on subjective judgement. In this work, we propose a novel optimization method based on local quadratic approximation (LQA). In each update step, given the gradient direction, we locally approximate the loss function by a standard quadratic function of the learning rate. Then, we propose an approximation step to obtain a nearly optimal learning rate in a computationally efficient way. The proposed LQA method has three important features. First, the learning rate is automatically determined in each update step. Second, it is dynamically adjusted according to the current loss function value and the parameter estimates. Third, with the gradient direction fixed, the proposed method leads to nearly the greatest reduction in terms of the loss function. Extensive experiments have been conducted to prove the strengths of the proposed LQA method.

Full PDF

AAutomatic, Dynamic, and Nearly Optimal LearningRate Specification by Local Quadratic Approximation

Yingqiu Zhu [email protected] of Statistics, RenminUniversity of ChinaBeijing, China

Yu Chen [email protected] School of Management,Peking UniversityBeijing, China

Danyang Huang ∗ [email protected] for Applied Statistics, RenminUniversity of ChinaSchool of Statistics, RenminUniversity of ChinaBeijing, China Bo Zhang [email protected] for Applied Statistics, RenminUniversity of ChinaSchool of Statistics, RenminUniversity of ChinaBeijing, China

Hansheng Wang [email protected] School of Management,Peking UniversityBeijing, China

ABSTRACT

In deep learning tasks, the learning rate determines the up-date step size in each iteration, which plays a critical rolein gradient-based optimization. However, the determinationof the appropriate learning rate in practice typically replieson subjective judgement. In this work, we propose a noveloptimization method based on local quadratic approximation(LQA). In each update step, given the gradient direction, welocally approximate the loss function by a standard quadraticfunction of the learning rate. Then, we propose an approx-imation step to obtain a nearly optimal learning rate in acomputationally efficient way. The proposed LQA methodhas three important features. First, the learning rate is au-tomatically determined in each update step. Second, it isdynamically adjusted according to the current loss functionvalue and the parameter estimates. Third, with the gradi-ent direction fixed, the proposed method leads to nearly thegreatest reduction in terms of the loss function. Extensive ∗ Corresponding AuthorPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected].

CIKM ’20, October 19–23, 2020, Galway, Ireland © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00https://doi.org/10.1145/1122445.1122456 experiments have been conducted to prove the strengths ofthe proposed LQA method.

CCS CONCEPTS • Computing methodologies → Neural networks ; Batchlearning . KEYWORDS neural networks, gradient descent, learning rate, machinelearning

ACM Reference Format:

Yingqiu Zhu, Yu Chen, Danyang Huang, Bo Zhang, and HanshengWang. 2020. Automatic, Dynamic, and Nearly Optimal LearningRate Specification by Local Quadratic Approximation. In

CIKM ’20:29th ACM International Conference on Information and KnowledgeManagement, October 19–23, 2020, Galway, Ireland.

ACM, New York,NY, USA, 10 pages. https://doi.org/10.1145/1122445.1122456

In recent years, the development of deep learning has led toremarkable success in visual recognition [7, 10, 14], speechrecognition [8, 29], natural language processing [2, 5], andmany other fields. For different learning tasks, researchershave developed different network frameworks, includingdeep convolutional neural networks [14, 16], recurrent neu-ral networks [6], graph convolutional networks [12] and rein-forcement learning [19, 20]. Although the network structurecould be totally different, the training methods are typicallysimilar. They are often gradient decent methods, which aredeveloped based on backpropagation [24]. a r X i v : . [ s t a t . M L ] A p r IKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.

Given a differentiable objective function, gradient descentis a natural and efficient method for optimization. Amongvarious gradient descent methods, the stochastic gradient de-scent (SGD) method [23] plays a critical role. In the standardSGD method, the first-order gradient of a randomly selectedsample is used to iteratively update the parameter estimatesof a network. Specifically, the parameter estimates are ad-justed with the negative of the random gradient multipliedby a step size. The step size is called the learning rate . Manygeneralized methods based on the SGD method have beenproposed [1, 4, 11, 25, 26]. Most of these extensions specifyimproved update rules to adjust the direction or the step size.However, [1] pointed out that, many hand-designed updaterules are designed for circumstances with certain characteris-tics, such as sparsity or nonconvexity. As a result, rule-basedmethods might perform well in some cases but poorly inothers. Consequently, an optimizer with an automaticallyadjusted update rule is preferable.An update rule contains two important components: oneis the update direction, and the other is the step size. Thelearning rate determines the step size, which plays a sig-nificant role in optimization. If it is set inappropriately, theparameter estimates could be suboptimal. Empirical experi-ence suggests that a relatively larger learning rate might bepreferred in the early stages of the optimization. Otherwise,the algorithm might converge very slowly. In contrast, arelatively smaller learning rate should be used in the laterstages. Otherwise, the objective function cannot be fully op-timized. This phenomenon inspires us to design a methodto automatically search for an optimal learning rate in eachupdate step during optimization.To this end, we propose here a novel optimization methodbased on local quadratic approximation (LQA). It tunes thelearning rate in a dynamic, automatic and nearly optimalmanner. The method can obtain the best step size in each up-date step. Intuitively, given a search direction, what shouldbe the best step size? One natural definition is the step sizethat can lead to the greatest reduction in the global loss. Ac-cordingly, the step size itself should be treated as a parameter,that needs to be optimized. For this purpose, the proposedmethod can be decomposed into two important steps. Theyare the expansion step and the approximation step. First, inthe expansion step, we conduct Taylor expansion on the lossfunction, around the current parameter estimates. Accord-ingly, the objective function can be locally approximated bya quadratic function in terms of the learning rate. Then, thelearning rate is also treated as a parameter to be optimized,which leads to a nearly optimal determination of the learningrate for this particular update step.Second, to implement this idea, we need to compute thefirst- and second-order derivatives of the objective functionon the gradient direction. One way to solve this problem is to compute the Hessian matrix for the loss function. However,this solution is computationally expensive. Because manycomplex deep neural networks involve a large number ofparameters, this makes the Hessian matrix have ultra-highdimensionality. To solve this problem, we propose here anovel approximation step. Note that, given a fixed gradientdirection, the loss function can be approximated by a stan-dard quadratic function with the learning rate as the onlyinput variable. For a univariate quadratic function such asthis, there are only two unknown coefficients. They are thelinear term coefficient and the quadratic term coefficient.As long as these two coefficients can be determined, theoptimal learning rate can be obtained. To estimate the twounknown coefficients, one can try, for example, two differentbut reasonably small learning rates. Then, the correspond-ing objective function can be evaluated. This step leads totwo equations, which can be solved to estimate the two un-known coefficients in the quadratic approximation function.Thereafter, the optimal learning rate can be obtained.

Our contributions : We propose an automatic, dynamicand nearly optimal learning rate tuning algorithm that hasthe following three important features.(1) The algorithm is automatic. In other words, it leads toan optimization method with little subjective judgment.(2) The method is dynamic in the sense that the learningrate used in each update step is different. It is dynamicallyadjusted according to the current status of the loss functionand the parameter estimates. Typically, larger rates are usedin the earlier iterations, while smaller rates are used in thelatter iterations.(3) The learning rate derived from the proposed method isnearly optimal. For each update step, by the novel quadraticapproximation, the learning rate leads to almost the greatestreduction in terms of the loss function. Here, “almost” refersto the fact that the loss function is locally approximated bya quadratic function with unknown coefficients numericallyestimated. For this particular update step, with the gradientdirection fixed, and among all the possible learning rates, theone determined by the proposed method can result in nearlythe greatest reduction in terms of the loss function.The rest of this article is organized as follows. In Section2, we review related works on gradient-based optimizers.Section 3 presents the proposed algorithm in detail. In Sec-tion 4, we verify the performance of the proposed methodthrough empirical studies on open datasets. Then, conclud-ing remarks are given in Section 5.

To optimize a loss function, two important components needto be specified: the update direction and the step size. Ideally,the best update direction should be the gradient computed forthe loss function based on the whole data. For convenience, utomatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland we refer to it as the global gradient . Since the calculation ofthe global gradient is computationally expensive, the SGDmethod [23] uses the gradient estimated based on a stochas-tic subsample in each iteration, which we referred to as a sample gradient . It leads to a fairly satisfactory empiricalperformance. The SGD method has inspired many new opti-mization methods, most of which enhance their performanceby improving the estimation of the global gradient direction.A natural improvement is to combine sample gradients fromdifferent update steps so that a more reliable estimate for theglobal gradient direction can be obtained. This improvementhas led to the momentum-based optimization methods, suchas those proposed in [3, 15, 25, 27]. In particular, [3] adoptedNesterov’s accelerated gradient algorithm [21] to furtherimprove the calculation of the gradient direction.There exist other optimization methods that focus on theadjustment of the step size. [4] proposed AdaGrad, in whichthe step size is iteratively decreased according to a prespeci-fied function. However, it still involves a parameter related tothe learning rate, which needs to be subjectively determined.More extensions of AdaGrad have been proposed, such asRMSProp [26] and AdaDelta [30]. Particularly, RMSProp in-troduced a decay factor to adjust the weights of previoussample gradients. [11] proposed an adaptive moment es-timation (Adam) method, that combined RMSProp with amomentum-based method. Accordingly, the step size andthe update direction are both adjusted during each iteration.However, because step sizes are adjusted without consider-ing the loss function, the loss reduction obtained for eachupdate step is suboptimal. Thus, the resulting convergencerate can be further improved.To summarize, most existing optimization methods sufferfrom one or both of the following two limitations. First, theyare not automatic, and human intervention is required. Sec-ond, they are suboptimal because the loss reduction achievedin each update step can be further improved. These pioneer-ing researchers inspired us to develop a new method forautomatic determination of the learning rate. Ideally, thenew method should be automatic with little human interven-tion. It should be dynamic so that the learning rate used foreach update step is particularly selected. Mostly importantly,in each update step, the learning rate determined by the newmethod should be optimal (or nearly optimal) in terms of theloss reduction, given a fixed update direction.

In this section, we first introduce the notations used in thispaper and the general formulation of the SGD method. Then,we propose an algorithm based on local quadratic approxi-mation to dynamically search for an optimal learning rate.This results in a new variant of the SGD method.

Assume we have a total of N samples. They are indexed by1 ≤ i ≤ N and collected by S = { , , · · · , N } . For eachsample, a loss function can be defined as ℓ ( X i ; θ ) , where X i is the input corresponding to the i -th sample and θ ∈ R p denotes the parameter. Then the global loss function can bedefined as ℓ ( θ ) = N N (cid:213) i = ℓ ( X i ; θ ) = |S| (cid:213) i ∈S ℓ ( X i ; θ ) . Ideally, one should optimize ℓ ( θ ) by a gradient descent al-gorithm. Assume there are a total of T iterations. Let ˆ θ ( t ) bethe parameter estimate obtained in the t -th iteration. Then,the estimate in the next iteration ˆ θ ( t + ) is given by,ˆ θ ( t + ) = ˆ θ ( t ) − δ ∇ ℓ ( ˆ θ ( t ) ) , where δ is the learning rate and ∇ ℓ ( ˆ θ ( t ) ) is the gradient of theglobal loss function ℓ ( θ ) with respect to θ at ˆ θ ( t ) . More specif-ically, ∇ ℓ ( ˆ θ ( t ) ) = N − (cid:205) Ni = ∇ ℓ ( X i ; ˆ θ ( t ) ) , where ∇ ℓ ( X i ; ˆ θ ( t ) ) isthe gradient of the local loss function for the i -th sample.Unfortunately, such a straightforward implementation iscomputationally expensive if the sample size N is relativelylarge, which is particularly true if the dimensionality of θ is also ultrahigh. To alleviate the computational burden, re-searchers proposed the idea of SGD. The key idea is to ran-domly partition the whole sample into a number of nonover-lapping batches. For example, we can write S = ∪ Kk = S k ,where S k collects the indices of the samples in the k -thbatch. We should have S k ∩ S k = ∅ for any k (cid:44) k and |S k | = n for any 1 ≤ k ≤ K , where n is a fixed batch size.Next, instead of computing the global gradient ∇ ℓ ( ˆ θ ( t ) ) , wecan replace it by an estimate computed based on the k -thbatch. More specifically, each iteration (e.g., the t -th itera-tion) is further decomposed into a total of K batch steps. Letˆ θ ( t , k ) be the estimate obtained during the k -th (1 ≤ k ≤ K )batch step during the t -th iteration. Then, we haveˆ θ ( t , k + ) = ˆ θ ( t , k ) − δn (cid:213) i ∈S k ∇ ℓ ( X i ; ˆ θ ( t , k ) ) , where k = , · · · , K −

1. In particular, ˆ θ ( t + , ) = ˆ θ ( t , K ) − δn − (cid:205) i ∈S K ∇ ℓ ( X i ; ˆ θ ( t , K ) ) .By doing so, the computational burden can be alleviated.However, the tradeoff is that the batch-sample-based gra-dient estimate could be unstable, which is particularly trueif the batch size n is relatively small. To fix this problem,various momentum-based methods have been proposed. Thekey idea is to record gradients in previous iterations andintegrate them together to form a more stable estimate. IKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.

In this work, we assume that for each batch step, the esti-mate for the gradient direction is given. It can be obtainedby different algorithms. For example, it could be the esti-mate obtained by a standard SGD algorithm or an estimatethat involves rule-based corrections, such as that from amomentum-based method. We focus on how to specify thelearning rate in an optimal (or, nearly optimal) way.To this end, we treat the learning rate δ as an unknownparameter. It is remarkable that the optimal learning ratecould dynamically change in different batch steps. Thus, weuse δ t , k to denote the learning rate in the k -th batch stepwithin the t -th iteration. Since the reduction in the loss inthis batch step is influenced by the learning rate δ t , k , weexpress it as a function of the learning rate ∆ ℓ ( δ t , k ) .To find the optimal value for δ t , k , we investigate the op-timization of ∆ ℓ ( δ t , k ) based on the Taylor expansion. Forsimplicity, we use д t , k = n − (cid:205) i ∈S k ∇ ℓ ( X i ; ˆ θ ( t , k ) ) to denotethe current gradient. Given ˆ θ ( t , k ) and д t , k , the loss reductioncould be expressed as ∆ ℓ ( δ t , k ) = n (cid:213) i ∈S k (cid:104) ℓ (cid:16) X i ; ˆ θ ( t , k + ) (cid:17) − ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17)(cid:105) = n (cid:213) i ∈S k (cid:104) ℓ (cid:16) X i ; ˆ θ ( t , k ) − δ t , k д t , k (cid:17) − ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17)(cid:105) Then, two estimation steps are conducted to determine anappropriate value for δ t , k in this batch step. (1) Expansion Step. By a Taylor expansion of ℓ ( X i ; θ ) aroundˆ θ ( t , k ) , we have ℓ (cid:16) X i ; ˆ θ ( t , k ) − δ t , k д t , k (cid:17) = ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) − ∇ ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) δ t , k д t , k + δ t , k д Tt , k ∇ ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) д t , k + o (cid:16) δ t , k д Tt , k д t , k (cid:17) , where ∇ ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) and ∇ ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) denote the first-and second-order derivatives of the local loss function, re-spectively. As a result, the reduction is ∆ ℓ ( δ t , k ) = n (cid:213) i ∈S k (cid:104) −∇ ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) δ t , k д t , k + δ t , k д Tt , k ∇ ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) д t , k + o (cid:16) δ t , k д Tt , k д t , k (cid:17)(cid:105) = −  n (cid:213) i ∈S k ∇ ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) д t , k  δ t , k +  n (cid:213) i ∈S k д Tt , k ∇ ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) д t , k  δ t , k + o (cid:16) n − δ t , k д Tt , k д t , k (cid:17) . (1) According to (1), ∆ ℓ ( δ t , k ) is a quadratic function of δ t , k .For simplicity, the coefficient of the linear term and the coef-ficient of the quadratic term are denoted as a t , k = n  (cid:213) i ∈S k ∇ ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) д t , k  , and b t , k =  n (cid:213) i ∈S k д Tt , k ∇ ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) д t , k  , respectively. Since the Taylor remainder here could be negli-gible, (1) can be simply denoted by ∆ ℓ ( δ t , k ) ≈ − a t , k δ t , k + b t , k δ t , k . (2)To maximize ∆ ℓ ( δ t , k ) with respect to δ t , k , we take the corre-sponding derivative of the loss reduction, which leads to, ∂ ∆ ℓ ( δ t , k ) ∂ δ t , k ≈ − a t , k + b t , k δ t , k = . As a result, the optimal learning rate in this batch step canbe approximated by, δ ∗ t , k = ( b t , k ) − a t , k . (3)Note that the computation of b t , k involves the first- andsecond-order derivatives. For a general form of loss function,this calculation may be computationally expensive in realapplications. Thus, an approximation step is preferred toimprove the computational efficiency. (2) Approximation Step. To compute the coefficients a t , k and b t , k and avoid the computation of second derivatives,we consider the following approximation method. The basicidea is to build 2 equations with respect to the 2 unknowncoefficients.Let д t , k be a given estimate of the gradient direction. Wethen compute (cid:213) i ∈S k ℓ (cid:16) X i ; ˆ θ ( t , k ) − δ д t , k (cid:17) = (cid:213) i ∈S k ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) − a t , k δ n + b t , k δ n , (4) (cid:213) i ∈S k ℓ (cid:16) X i ; ˆ θ ( t , k ) + δ д t , k (cid:17) = (cid:213) i ∈S k ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17) + a t , k δ n + b t , k δ n , (5)for a reasonably small learning rate δ . A natural choice for δ could be δ ∗ t , k − if k > δ ∗ t − , K if k =

1. By solving (4) utomatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland and (5), we have˜ b t , k = nδ (cid:213) i ∈S k (cid:104) ℓ (cid:16) X i ; ˆ θ ( t , k ) + δ д t , k (cid:17) + ℓ (cid:16) X i ; ˆ θ ( t , k ) − δ д t , k (cid:17) − ℓ (cid:16) X i ; ˆ θ ( t , k ) (cid:17)(cid:105) , (6)˜ a t , k = nδ (cid:213) i ∈S k (cid:104) ℓ (cid:16) X i ; ˆ θ ( t , k ) + δ д t , k (cid:17) − ℓ (cid:16) X i ; ˆ θ ( t , k ) − δ д t , k (cid:17)(cid:105) , (7)where ˜ a t , k and ˜ b t , k could serve as the approximations of a t , k and b t , k , respectively. Then, we apply these results backto (3), which gives the approximated optimal learning rateˆ δ ∗ t , k . Because ˆ δ ∗ t , k is optimally selected, the reduction in theloss function is nearly optimal for each batch step. As aconsequence, the total number of iterations required forconvergence can be much reduced, which makes the wholealgorithm converge much faster than usual. In summary, Algorithm 1 illustrates the pseudocode of the proposedmethod.

Algorithm 1

Local quadratic approximation algorithm

Require : T : number of iterations; K : number of batcheswithin one iteration; ℓ ( θ ) : loss function with parameters θ ; θ : initial estimate for parameters (e.g., a zero vector); δ : initial (small) learning rate. t ← θ ( , ) ← θ ; while t ≤ T do k ← while k ≤ K − do Compute the gradient д t , k ;Compute ˜ a t , k and ˜ b t , k according to (6) and (7);ˆ δ ∗ t , k ← ( b t , k ) − ˜ a t , k ;ˆ θ ( t , k + ) ← ˆ θ ( t , k ) − ˆ δ ∗ t , k д t , k ; k ← k + end while Compute the gradient д t , K ;Compute ˜ a t , K and ˜ b t , K according to (6) and (7);ˆ δ ∗ t , K ← (cid:16) b t , K (cid:17) − ˜ a t , K ;ˆ θ ( t + , ) ← ˆ θ ( t , K ) − ˆ δ ∗ t , K д t , K ; t ← t + end whilereturn ˆ θ ( T + , ) , the resulting estimate.It is remarkable that the computational cost required forcalculating the optimal learning rate is ignorable. The maincost is due to the calculation of the loss function values (notits derivatives) at two different points. The cost of this step is substantially smaller than computing the gradient and itis particularly true if the unknown parameter θ ’s dimensionis ultrahigh. In this section, we empirically evaluate the proposed methodbased on different models and compare it with various op-timizers under different parameter settings. The details arelisted as follows.

Classification Model.

To demonstrate the robustness of theproposed method, we consider three classic models. Theyare multinomial logistic regression, multilayer perceptron(MLP), and deep convolutional neural network (CNN) mod-els.

Competing Optimizers . For comparison purposes, we com-pare the proposed LQA method with other popular optimiz-ers. They are the standard SGD method, the SGD methodwith momentum, the SGD method based on Nesterov’s ac-celerated gradient (NAG), AdaGrad, RMSProp and Adam.For simplicity, we use “SGD-M” to denote the SGD methodwith momentum and “SGD-NAG” to denote the SGD methodbased on the NAG algorithm.

Parameter Settings.

For the competing optimizers, we adoptthree different learning rates, δ = . , . , and 0 . Performance Measurement.

To gauge the performance ofthe optimizers, we report the training loss of the differentoptimizers, which is defined as the negative log-likelihoodfunction. The results of different optimizers in each iterationare shown in figures for comparison purposes.

We first compare the performance of different optimizersbased on multinomial logistic regression. It has a convexobjective function. We consider the MNIST dataset [17] forillustration. The dataset consists of a total of 70,000 (28 × Learning Rate.

For the competing optimizers, the train-ing loss curves of different learning rates clearly have dif-ferent shapes, which means different convergence speedsbecause the convergence speed is greatly affected by δ . Thebest δ in this case is 0.01 for the SGD, SGD-M, SGD-NAG andAdaGrad methods. However, that for RMSProp and Adamis 0.001. Note that for RMSProp and Adam, the loss may IKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al. fail to converge with an inappropriate δ (e.g., δ = . Loss Reduction.

First, the loss curve of LQA remainslower than those of the SGD optimizers during the wholetraining process. This finding means that LQA convergesfaster than the SGD optimizers. For example, LQA reducesthe loss to 0.256 in the first 10 iterations. It takes 40 iterationsfor the standard SGD optimizer with δ = . δ = . δ = . δ = . MLP models are powerful neural network models and havebeen widely used in machine learning tasks [22]. They con-tain multiple fully connected layers and activation functionsbetween those layers. An MLP can approximate arbitrarycontinuous functions over compact input sets [9].To investigate the performance of the proposed methodin this case, we consider the MNIST dataset. Following themodel setting in [11], the MLP is built with 2 fully connectedhidden layers, each of which has 1,000 units, and the ReLUfunction is adopted as the activation function. Figure 2 showsthe performances of different optimizers. The following con-clusions can be drawn.

Learning Rate.

In this case, the best learning rates for thecompeting methods are quite different: (1) for the standardSGD method, the best learning rate is δ = .

1; (2) for theSGD-M, SGD-NAG and AdaGrad methods, the best learningrate is 0.01; (3) for RMSProp and Adam, δ = .

001 is the best.It is remarkable that even for the same optimizer, differentlearning rates could lead to a different performance, if themodel changes. Thus, determining the appropriate learningrate in practice may depend on expert experience and subjec-tive judgement. In contrast, the proposed method can avoidsuch effort in choosing δ and give a comparable and robustperformance. Loss Reduction.

First, compared with the standard SGDoptimizers, LQA performs much better, which could be seenfrom the lower training loss curve. Second, compared withthe SGD-M, SGD-NAG, RMSProp and Adam optimizers, theperformance of LQA is comparable to their best performances in the early stages (e.g., the first 5 iterations). In the laterstages, the LQA method continues to reduce the loss, whichmakes the training loss curve of LQA lower than that of theother methods. For example, the smallest loss correspondingto the Adam optimizer in the 20th iteration is 0.011, whilethat of the LQA method is 0.002. Third, although AdaGradconverges faster when δ = CNNs have brought remarkable breakthroughs in computervision tasks over the past two decades [7, 10, 14] and play acritical role in various industrial applications, such as facerecognition [28] and driverless vehicles [18]. In this subsec-tion, we investigate the performance of the LQA methodwith respect to the training of CNNs. Two classic CNNsare considered. They are LeNet [17] and ResNet [7]. Morespecifically, LeNet-5 and ResNet-18 are studied in this paper.The MNIST and CIFAR10 [13] datasets are used to demon-strate the performance. The CIFAR10 dataset contains 60,000(32 ×

32) RGB images, which are divided into 10 classes.

LeNet.

Figure 3 and Figure 4 show the results of exper-iments on the MNIST and the CIFAR10 datasets, respec-tively. The following conclusions can be drawn: (1) For bothdatasets, the loss curves of the LQA method remain lowerthan those of the standard SGD and AdaGrad optimizers.This finding suggests LQA converges faster than those opti-mizers during the whole training process. (2) LQA performssimilarly to the SGD-M, SGD-NAG, RMSProp and Adamoptimizers in the early stages (e.g., the first 20 iterations).However, in the later stages, the proposed method can furtherreduce the loss and lead to a lower loss than those optimizersafter the same number of iterations. (3) For the CIFAR10dataset, a large δ (e.g., δ = .

1) may lead to an unstable losscurve for a standard SGD optimizer. Although the loss curveof LQA is unstable in the early stages of training, it becomessmooth in the later stages because the proposed method isable to automatically and adaptively adjust the update stepsize to accelerate training. It is fairly robust.

ResNet.

Figure 5 displays the training loss of ResNet-18corresponding to different optimizers on the CIFAR10 dataset.Accordingly, we make the following conclusions. First, theLQA method performs similarly to the other optimizers inthe early stages of training (e.g., the first 15 iterations). How-ever, it converges faster in the later stages. Particularly, theproposed method leads to a lower loss than RMSProp andAdam within the same number of iterations. Second, in thiscase, the loss curves of the SGD optimizers and AdaGrad arequite unstable during the whole training period. The LQAmethod is much more stable in the later stages of the trainingthan early stages. utomatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland

Figure 1: Training loss of multinomial logistic regression on the MNIST dataset.Figure 2: Training loss of MLP on the MNIST dataset.

In this work, we propose LQA, a novel approach to determinethe nearly optimal learning rate for automatic optimization.

IKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al.

Figure 3: Training loss of LeNet-5 on the MNIST dataset.Figure 4: Training loss of LeNet-5 on the CIFAR10 dataset.

Our method has three important features. First, the learningrate is automatically estimated in each update step. Second, it is dynamically adjusted during the whole training pro-cess. Third, given the gradient direction, the learning rate utomatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation CIKM ’20, October 19–23, 2020, Galway, Ireland

Figure 5: Training loss of ResNet-18 on the CIFAR10 dataset. leads to nearly the greatest reduction in the loss function.Experiments on openly available datasets demonstrate itseffectiveness.We discuss two interesting topics for future research. First,the optimal learning rate derived by LQA is shared by alldimensions of the parameter estimate. A potential extensionis to allow for different optimal learning rates for differentdimensions. Second, in this paper, we focus on acceleratingthe training of the network models. We do not discuss theoverfitting issue and sparsity of the gradients. To further im-prove the performance of the proposed method, it is possibleto combine dropouts or sparsity penalties with LQA.

REFERENCES [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoff-man, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Fre-itas. 2016. Learning to learn by gradient descent by gradient descent.In

Advances in neural information processing systems . 3981–3989.[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neuralmachine translation by jointly learning to align and translate. arXivpreprint arXiv:1409.0473 (2014).[3] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan.2011. Better mini-batch algorithms via accelerated gradient methods.In

Advances in neural information processing systems . 1647–1655.[4] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradientmethods for online learning and stochastic optimization.

Journal ofmachine learning research

12, Jul (2011), 2121–2159.[5] Yoav Goldberg and Omer Levy. 2014. word2vec Explained: derivingMikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014).[6] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013.Speech recognition with deep recurrent neural networks. In . IEEE,6645–6649.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deepresidual learning for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition . 770–778.[8] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahmanMohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, PatrickNguyen, Tara N Sainath, et al. 2012. Deep neural networks for acousticmodeling in speech recognition: The shared views of four researchgroups.

IEEE Signal processing magazine

29, 6 (2012), 82–97.[9] Kurt Hornik. 1991. Approximation capabilities of multilayer feedfor-ward networks.

Neural networks

4, 2 (1991), 251–257.[10] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-berger. 2017. Densely connected convolutional networks. In

Proceed-ings of the IEEE conference on computer vision and pattern recognition .4700–4708.[11] Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochasticoptimization. In

International Conference on Learning Representations2015 .[12] Thomas N Kipf and Max Welling. 2016. Semi-supervised classificationwith graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).[13] Alex Krizhevsky and Geoffrey Hinton. 2009.

Learning multiple layersof features from tiny images . Technical Report.[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenetclassification with deep convolutional neural networks. In

Advancesin neural information processing systems . 1097–1105.[15] Guanghui Lan. 2012. An optimal method for stochastic compositeoptimization.

Mathematical Programming

IKM ’20, October 19–23, 2020, Galway, Ireland Zhu et al. [16] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson,Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989.Backpropagation applied to handwritten zip code recognition.

Neuralcomputation

1, 4 (1989), 541–551.[17] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998.Gradient-based learning applied to document recognition.

Proc. IEEE

86, 11 (1998), 2278–2324.[18] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. 2019. Stereo r-cnn based3d object detection for autonomous driving. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition . 7644–7652.[19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioan-nis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playingatari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu,Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, An-dreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level controlthrough deep reinforcement learning.

Nature O ( / k ) . In Soviet Mathematics Doklady ,Vol. 27. 372–âĂŞ376.[22] Hassan Ramchoun, Mohammed Amine Janati Idrissi, Youssef Ghanou,and Mohamed Ettaouil. 2016. Multilayer Perceptron: Architecture Op-timization and Training.

International Journal of Interactive Multimedia and Artificial Intelligence

4, 1 (2016), 26–30.[23] Herbert Robbins and Sutton Monro. 1951. A stochastic approximationmethod.

The annals of mathematical statistics (1951), 400–407.[24] David E Rumelhart, Richard Durbin, Richard Golden, and Yves Chauvin.1995. Backpropagation: The basic theory.

Backpropagation: Theory,architectures and applications (1995), 1–34.[25] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986.Learning representations by back-propagating errors.

Nature

RMSProp, COURSERA:Neural networks for machine learning . Technical Report.[27] Paul Tseng. 1998. An incremental gradient (-projection) method withmomentum term and adaptive stepsize rule.

SIAM Journal on Opti-mization

8, 2 (1998), 506–531.[28] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Adiscriminative feature learning approach for deep face recognition. In

European conference on computer vision . Springer, 499–515.[29] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, MikeSeltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achiev-ing human parity in conversational speech recognition. arXiv preprintarXiv:1610.05256 (2016).[30] Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701arXiv preprint arXiv:1212.5701