[PDF] Improving Generalization Performance by Switching from Adam to SGD

Abstract

Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer. We report experiments on several standard benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the CIFAR-10 and CIFAR-100 data sets, ResNet on the tiny-ImageNet data set and language modeling with recurrent networks on the PTB and WT2 data sets. The results show that our strategy is capable of closing the generalization gap between SGD and Adam on a majority of the tasks.

Full PDF

IImproving Generalization Performance by Switching from Adam to SGD

Nitish Shirish Keskar Richard Socher Abstract

Despite superior training outcomes, adaptive op-timization methods such as Adam, Adagrad orRMSprop have been found to generalize poorlycompared to Stochastic gradient descent (SGD).These methods tend to perform well in the initialportion of training but are outperformed by SGDat later stages of training. We investigate a hy-brid strategy that begins training with an adaptivemethod and switches to SGD when appropriate.Concretely, we propose SWATS, a simple strat-egy which Sw itches from A dam t o S GD whena triggering condition is satisﬁed. The condi-tion we propose relates to the projection of Adamsteps on the gradient subspace. By design, themonitoring process for this condition adds verylittle overhead and does not increase the num-ber of hyperparameters in the optimizer. We re-port experiments on several standard benchmarkssuch as: ResNet, SENet, DenseNet and Pyramid-Net for the CIFAR-10 and CIFAR-100 data sets,ResNet on the tiny-ImageNet data set and lan-guage modeling with recurrent networks on thePTB and WT2 data sets. The results show thatour strategy is capable of closing the generaliza-tion gap between SGD and Adam on a majorityof the tasks.

1. Introduction

Stochastic gradient descent (SGD) (Robbins & Monro,1951) has emerged as one of the most used training al-gorithms for deep neural networks. Despite its simplic-ity, SGD performs well empirically across a variety ofapplications but also has strong theoretical foundations.This includes, but is not limited to, guarantees of saddlepoint avoidance (Lee et al., 2016), improved generalization(Hardt et al., 2015; Wilson et al., 2017) and interpretationsas Bayesian inference (Mandt et al., 2017).Training neural networks is equivalent to solving the fol- Salesforce Research, Palo Alto, CA – 94301. Correspon-dence to: Nitish Shirish Keskar < [email protected] > . lowing non-convex optimization problem, min w ∈ R n f ( w ) , where f is a loss function. The iterations of SGD can bedescribed as: w k = w k − − α k − ˆ ∇ f ( w k − ) , where w k denotes the k th iterate, α k is a (tuned) step sizesequence, also called the learning rate, and ˆ ∇ f ( w k ) de-notes the stochastic gradient computed at w k . A variant ofSGD (SGDM), that uses the inertia of the iterates to accel-erate the training process, has also found to be successful inpractice (Sutskever et al., 2013). The iterations of SGDMcan be described as: v k = βv k − + ˆ ∇ f ( w k − ) w k = w k − − α k − v k , where β ∈ [0 , is a momentum parameter and v is ini-tialized to .One disadvantage of SGD is that it scales the gradient uni-formly in all directions; this can be particularly detrimentalfor ill-scaled problems. This also makes the process of tun-ing the learning rate α circumstantially laborious.To correct for these shortcomings, several adaptive meth-ods have been proposed which diagonally scale the gradi-ent via estimates of the function’s curvature. Examples ofsuch methods include Adam (Kingma & Ba, 2015), Ada-grad (Duchi et al., 2011) and RMSprop (Tieleman & Hin-ton, 2012). These methods can be interpreted as methodsthat use a vector of learning rates, one for each parameter,that are adapted as the training algorithm progresses. Thisis in contrast to SGD and SGDM which use a scalar learn-ing rate uniformly for all parameters.Adagrad takes steps of the form w k = w k − − α k − ˆ ∇ f ( w k − ) √ v k − + (cid:15) , where (1) v k − = k − (cid:88) j =1 ˆ ∇ f ( w j ) . a r X i v : . [ c s . L G ] D ec mproving Generalization Performance by Switching from Adam to SGD RMSProp uses the same update rule as (1), but insteadof accumulating v k in a monotonically increasing fashion,uses an RMS-based approximation instead, i.e., v k − = βv k − + (1 − β ) ˆ ∇ f ( w k − ) . In both Adagrad and RMSProp, the accumulator v is ini-tialized to . Owing to the fact that v k is monotonicallyincreasing in each dimension for Adagrad, the scaling fac-tor for ˆ ∇ f ( w k − ) monotonically decreases leading to slowprogress. RMSProp corrects for this behavior by employ-ing an average scale instead of a cumulative scale. How-ever, because v is initialized to , the initial updates tendto be noisy given that the scaling estimate is biased by itsinitialization. This behavior is rectiﬁed in Adam by em-ploying a bias correction. Further, it uses an exponentialmoving average for the step in lieu of the gradient. Math-ematically, the Adam update equation can be representedas: w k = w k − − α k − · (cid:112) − β k − β k · m k − √ v k − + (cid:15) , where(2) m k − = β m k − + (1 − β ) ˆ ∇ f ( w k − ) ,v k − = β v k − + (1 − β ) ˆ ∇ f ( w k − ) . (3)Adam has been used in many applications owing to its com-petitive performance and its ability to work well despiteminimal tuning (Karpathy, 2017). Recent work, however,highlights the possible inability of adaptive methods to per-form on par with SGD when measured by their ability togeneralize (Wilson et al., 2017).Furthermore, the authors also show that for even simplequadratic problems, adaptive methods ﬁnd solutions thatcan be orders-of-magnitude worse at generalization thanthose found by SGD(M).Indeed, for several state-of-the-art results in language mod-eling and computer vision, the optimizer of choice is SGD(Merity et al., 2017; Loshchilov & Hutter, 2016; He et al.,2015). Interestingly however, in these and other instances,Adam outperforms SGD in both training and generaliza-tion metrics in the initial portion of the training, but thenthe performance stagnates. This motivates the investigationof a strategy that combines the beneﬁts of Adam, viz. goodperformance with default hyperparameters and fast initialprogress, and the generalization properties of SGD. Giventhe insights of Wilson et al. (2017) which suggest that thelack of generalization performance of adaptive methodsstems from the non-uniform scaling of the gradient, a nat-ural hybrid strategy would begin the training process withAdam and switch to SGD when appropriate. To investi-gate this further, we propose SWATS, a simple strategy thatcombines the best of both worlds by Sw itching from A dam t o S GD. The switch is designed to be automatic and onethat does not introduce any more hyper-parameters. Thechoice of not adding additional hyperparameters is deliber-ate since it allows for a fair comparison between Adam andSWATS. Our experiments on several architectures and datasets suggest that such a strategy is indeed effective.Several attempts have been made at improving the con-vergence and generalization performance of Adam. Theclosest to our proposed approach is (Zhang et al., 2017) inwhich the authors propose ND-Adam, a variant of Adamwhich preserves the gradient direction by a nested opti-mization procedure. This, however, introduces an addi-tional hyperparameter along with the ( α, β , β ) used inAdam. Further, empirically, this adaptation sacriﬁces therapid initial progress typically observed for Adam. InAnonymous (2018), the authors investigate Adam and as-cribe the poor generalization performance to training issuesarising from the non-monotonic nature of the steps. Theauthors propose a variant of Adam called AMSGrad whichmonotonically reduces the step sizes and possesses theoret-ical convergence guarantees. Despite these guarantees, weempirically found the generalization performance of AMS-Grad to be similar to that of Adam on problems where ageneralization gap exists between Adam and SGD. We notethat in the context of the hypothesis of Wilson et al. (2017),all of the aforementioned methods would still yield poorgeneralization given that the scaling of the gradient is non-uniform.The idea of switching from an adaptive method to SGDis not novel and has been explored previously in the con-text of machine translation (Wu et al., 2016) and ImageNettraining (Akiba et al., 2017). Wu et al. (2016) use such amixed strategy for training and tune both the switchoverpoint and the learning rate for SGD after the switch. Akibaet al. (2017) use a similar strategy but use a convex com-bination of RMSProp and SGD steps whose contributionsand learning rates are tuned.In our strategy, the switchover point and the SGD learn-ing rate are both learned as a part of the training process.We monitor a projection of the Adam step on the gradi-ent subspace and use its exponential average as an estimatefor the SGD learning rate after the switchover. Further, theswitchover is triggered when no change in this monitoredquantity is detected. We describe this strategy in detail inSection 2. In Section 3, we describe our experiments com-paring Adam, SGD and SWATS on a host of benchmarkproblems. Finally, in Section 4, we present ideas for futureresearch and concluding remarks. We conclude this sectionby emphasizing the goal of this work is less to propose anew training algorithm but rather to empirically investigatethe viability of hybrid training for improving generaliza-tion. mproving Generalization Performance by Switching from Adam to SGD

2. SWATS

To investigate the generalization gap between Adam andSGD, let us consider the training of the CIFAR-10 dataset (Krizhevsky & Hinton, 2009) on the DenseNet archi-tecture (Iandola et al., 2014). This is an example of aninstance where a signiﬁcant generalization gap exists be-tween Adam and SGD. We plot the performance of Adamand SGD on this task but also consider a variant of Adamwhich we call Adam-Clip ( p, q ) . Given ( p, q ) such that p < q , the iterates for this variant take on the form w k = w k − − clip (cid:32) (cid:112) − β k − β k α k − √ v k − + (cid:15) , p · α sgd , q · α sgd (cid:33) m k − . Here, α sgd is the tuned value of the learning rate for SGDthat leads to the best performance for the same task. Thefunction clip ( x, a, b ) clips the vector x element-wise suchthat the output is constrained to be in [ a, b ] . Note thatAdam-Clip (1 , would correspond to SGD. The networkis trained using Adam, SGD and two variants: Adam-Clip (1 , ∞ ) , Adam-Clip (0 , with tuned learning rates for200 epochs, reducing the learning rate by after epochs. The goal of this experiment is to investigate theeffect of constraining the large and small step sizes thatAdam implicitly learns, i.e., √ − β k − β k α k − √ v k − + (cid:15) , on the gen-eralization performance of the network. We present the re-sults in Figure 1.As seen from Figure 1, SGD converges to the expected test-ing error of ≈ while Adam stagnates in performance ataround ≈ error. We note that ﬁne-tuning of the learningrate schedule (primarily the initial value, reduction amountand the timing) did not lead to better performance. Also,note that the rapid initial progress of Adam relative to SGD.This experiment is in agreement with the experimental ob-servations of Wilson et al. (2017). Interestingly, Adam-Clip (0 , has no tangible effect on the ﬁnal generalizationperformance while Adam-Clip (1 , ∞ ) partially closes thegeneralization gap by achieving a ﬁnal accuracy of ≈ .We observe similar results for several architectures, datasets and modalities whenever a generalization gap existsbetween SGD and Adam. This stands as evidence that thestep sizes learned by Adam could circumstantially be toosmall for effective convergence. This observation regardingthe need to lower-bound the step sizes of Adam is similarto the one made in Anonymous (2018), where the authorsdevise a one-dimensional example in which infrequent butlarge gradients are not emphasized sufﬁciently causing thenon-convergence of Adam.Given the potential insufﬁciency of Adam, even when con-straining one side of the accumulator, we consider switch-ing to SGD once we have reaped the beneﬁts of Adam’s Epochs T e s t i n g E rr o r SGDAdam Clip-(1, )Clip-(0,1)

Figure 1.

Training the DenseNet architecture on the CIFAR-10data set with four optimizers: SGD, Adam, Adam-Clip (1 , ∞ ) andAdam-Clip (0 , . SGD achieves the best testing accuracy whiletraining with Adam leads to a generalization gap of roughly .Setting a minimum learning rate for each parameter of Adam par-tially closes the generalization gap. rapid initial progress. This raises two questions: (a) whento switch over from Adam to SGD, and (b) what learn-ing rate to use for SGD after the switch. Assuming thatthe learning rate of SGD after the switchover is tuned, wefound that switching too late does not yield generalizationimprovements while switching too early may cause the hy-brid optimizer to not beneﬁt from Adam’s initial progress.Indeed, as shown in Figure 2, switching after epochsleads to a learning curve very similar to that of SGD, whileswitching after epochs leads to inferior testing accuracyof ≈ . . To investigate the efﬁcacy of a hybrid strat-egy whilst ensuring no increase in the number of hyperpa-rameters (a necessity for fair comparison with Adam), wepropose SWATS, a strategy that automates the process ofswitching over by determining both the switchover pointand the learning rate of SGD after the switch. Consider an iterate w k with a stochastic gradient g k and astep computed by Adam, p k . For the sake of simplicity,assume that p k (cid:54) = 0 and p Tk g k < . This is a commonrequirement imposed on directions to derive convergence(Nocedal & Wright, 2006). In the case when β = 0 forAdam, i.e., no ﬁrst-order exponential averaging is used, this mproving Generalization Performance by Switching from Adam to SGD Epochs T e s t i n g E rr o r Sw@10Sw@40Sw@80SGDAdam

Figure 2.

Training the DenseNet architecture on the CIFAR-10data set using Adam and switching to SGD with learning rate withlearning rate . and momentum . after (10 , , epochs;the switchover point is denoted by Sw@ in the ﬁgure. Switchingearly enables the model to achieve testing accuracy comparableto SGD but switching too late in the training process leads to ageneralization gap similar to Adam. is trivially true since p k = − (cid:113) − β k +12 − β k +11 α k √ v k + (cid:15) (cid:124) (cid:123)(cid:122) (cid:125) := diag ( H k ) g k , t with H k (cid:31) where diag ( A ) denotes the vector constructedfrom the diagonal of A . Ordinarily, to train using Adam, wewould update the iterate as: w k +1 = w k + p k . To determine a feasible learning rate for SGD, γ k , we pro-pose solving the subproblem for ﬁnding γ k proj − γ k g k p k = p k where proj a b denotes the orthogonal projection of a onto b .This scalar optimization problem can be solved in closedform to yield: γ k = p Tk p k − p Tk g k , since p k = proj − γ k g k p k = − γ k g Tk p k p Tk p k p k implies the above equality.Geometrically, this can be interpreted as the scaling neces-sary for the gradient that leads to its projection on the Adamstep p k to be p k itself; see Figure 3. Note that this is notthe same as an orthogonal projection of p k on − g k . Empir-ically, we found that an orthogonal projection consistentlyunderestimates the SGD learning rate necessary, leading tomuch smaller SGD steps. Indeed, the (cid:96) norm of an orthog-onally projected step will always be lesser than or equal tothat of p k , which is undesirable given our needs. The non-orthogonal projection proposed above does not suffer fromthis problem, and empirically we found that it estimates theSGD learning rate well. A simple scaling rule of γ k = (cid:107) p (cid:107)(cid:107) g (cid:107) was also not found to be successful. We attribute this to thefact that a scaling rule of this form ignores the relative im-portance of the coordinate directions and tends to amplifythe importance of directions with a large step p but smallﬁrst-order importance g , and vice versa.Note again that if no momentum ( β = 0 ) is employed inAdam, then necessarily γ k > since H k (cid:31) . We shouldmention in passing that in this case γ k is equivalent to thereciprocal of the Rayleigh Quotient of H − k with respect tothe vector p k .Since γ k is a noisy estimate of the scaling needed, we main-tain an exponential average initialized at , denoted by λ k such that λ k = β λ k − + (1 − β ) γ k . We use β of Adam, see (3), as the averaging coefﬁcientsince this reuse avoids another hyperparameter and also be-cause the performance is relatively invariant to ﬁne-grainedspeciﬁcation of this parameter. Having answered the question of what learning rate λ k tochoose for SGD after the switch, we now discuss when toswitch. We propose checking a simple, yet powerful, crite-rion: (cid:12)(cid:12)(cid:12) λ k − β k − γ k (cid:12)(cid:12)(cid:12) < (cid:15), (4)at every iteration with k > . The condition compares thebias-corrected exponential averaged value and the currentvalue ( γ k ). The bias correction is necessary to prevent theinﬂuence of the zero initialization during the initial portionof training. Once this condition is true, we switch overto SGD with learning rate Λ := λ k (1 − β k ) . We also experi-mented with more complex criteria including those involv-ing monitoring of gradient norms. However, we found thatthis simple un-normalized criterion works well across a va-riety of different applications. mproving Generalization Performance by Switching from Adam to SGD w k p k − g k − γ k g k Figure 3.

Illustrating the learning rate for SGD ( γ k ) estimated byour proposed projection given an iterate w k , a stochastic gradient g k and the Adam step p k . In the case when β > , we switch to SGDM with learningrate (1 − β )Λ and momentum parameter β . The (1 − β ) factor is the common momentum correction. Refer toAlgorithm 1 for a uniﬁed view of the algorithm. The textin blue denotes operations that are also present in Adam.

3. Numerical Results

To demonstrate the efﬁcacy of our approach, we present nu-merical experiments comparing the proposed strategy withAdam and SGD. We consider the problems of image clas-siﬁcation and language modeling.For the former, we experiment with four architectures:ResNet-32 (He et al., 2015), DenseNet (Iandola et al.,2014), PyramidNet (Han et al., 2016), and SENet (Huet al., 2017) on the CIFAR-10 and CIFAR-100 data sets(Krizhevsky & Hinton, 2009). The goal is to classify im-ages into one of 10 classes for CIFAR-10 and 100 classesfor CIFAR-100. The data sets contain × RGBimages in the training set and images in the test-ing set. We choose these architectures given their superiorperformance on several image classiﬁcation benchmarkingtasks. For a large-scale image classiﬁcation experiment,we experiment with the Tiny-ImageNet data set on theResNet-18 architecture (He et al., 2015). This data set isa subset of the ILSVRC 2012 data set (Deng et al., 2009)and contains classes with

500 224 × RGB imagesper class in the training set and per class in the valida-tion and testing sets. We choose this data set given that it isa good proxy for the performance on the larger ImageNetdata set.We also present results for word-level language modelingwhere the task is to take as inputs a sequence of words andpredict the next word. We choose this task because of its https://tiny-imagenet . herokuapp . com/ Algorithm 1

SWATS

Inputs:

Objective function f , initial point w , learn-ing rate α = 10 − , accumulator coefﬁcients ( β , β ) =(0 . , . , (cid:15) = 10 − , phase= Adam . Initialize k ← , m k ← , a k ← , λ k ← while stopping criterion not met do k = k + 1 Compute stochastic gradient g k = ˆ ∇ f ( w k − ) if phase = SGD then v k = β v k − + g k w k = w k − − (1 − β )Λ v k continue end if m k = β m k − + (1 − β ) g k a k = β a k − + (1 − β ) g k p k = − α k √ − β k − β k m k √ a k + (cid:15) w k = w k + p k if p Tk g k (cid:54) = 0 then γ k = p Tk p k − p Tk g k λ k = β λ k − + (1 − β ) γ k if k > and | λ k (1 − β k ) − γ k | < (cid:15) then phase = SGD v k = 0 Λ = λ k / (1 − β k ) end if else λ k = λ k − end if end whilereturn w k broad importance, the inherent difﬁculties that arise dueto long term dependencies (Hochreiter & Schmidhuber,1997), and since it is a proxy for other sequence learningtasks such as machine translation (Bahdanau et al., 2014).We use the Penn Treebank (PTB) (Mikolov et al., 2011) andthe larger WikiText-2 (WT-2) (Merity et al., 2016) data setsand experimented with the AWD-LSTM and AWD-QRNNarchitectures. In the case of SGD, we clip the gradients to anorm of . while we perform no such clipping for Adamand SWATS. We found that the performance of SGD deteri-orates without clipping and that of Adam and SWATS with.The AWD-LSTM architecture uses a multi-layered LSTMnetwork with learned embeddings while the AWD-QRNNreplaces the expensive LSTM layer by the cheaper QRNNlayer (Bradbury et al., 2016) which uses convolutions in-stead of recurrences. The model is regularized with Drop-Connect (Wan et al., 2013) on the hidden-to-hidden con-nections as well as other strategies such as weight decay,embedding-softmax weight tying, activity regularizationand temporal activity regularization. We refer the reader mproving Generalization Performance by Switching from Adam to SGD to (Merity et al., 2016) for additional details regarding thedata sets including the sizes of the training, validation andtesting sets, size of the vocabulary, and source of the data.For our experiments, we tuned the learning rate of alloptimizers, and report the best-performing conﬁgurationin terms of generalization. The learning rate of Adamand SWATS were chosen from a grid of { . , . , . , . , . , . , . } . For both optimizers,we use the (default) recommended values ( β , β ) =(0 . , . . Note that this implies that, in all cases, weswitch from Adam to SGDM with a momentum coefﬁcientof . . For tuning the learning rate for the SGD(M) op-timizer, we ﬁrst coarsely tune the learning rate on a log-arithmic scale from − to and then ﬁne-tune thelearning rate. For all cases, we experiment with and with-out employing momentum but don’t tune this parameter( β = 0 . ). We found this overall procedure to performbetter than a generic grid-search or hyperparameter opti-mization given the vastly different scales of learning ratesneeded for different modalities. For instance, SGD withlearning rate . performed best for the DenseNet task onCIFAR-10 but for the PTB language modeling task usingthe LSTM architecture, a learning rate of for SGD wasnecessary. Hyperparameters such as batch size, dropoutprobability, (cid:96) -norm decay etc. were chosen to match therecommendations of the respective base architectures. Wetrained all networks for a total of epochs and reducedthe learning rate by on epochs , and . Thisscheme was surprisingly powerful at obtaining good per-formance across the different modalities and architectures.The experiments were coded in PyTorch and conductedusing job scheduling on NVIDIA Tesla K80 GPUs forroughly 3 weeks.The experiments comparing SGD, Adam and SWATS onthe CIFAR and Tiny-ImageNet data sets are presented inFigures 4 and 5, respectively. The experiments compar-ing the optimizers on the language modeling tasks are pre-sented in Figure 6. In Table 1, we summarize the meta-dataconcerning our experiments including the learning ratesthat achieved the best performance, and, in the case ofSWATS, the number of epochs before the switch occurredand the learning rate ( Λ ) for SGD after the switch. Finally,in Figure 7, we depict the evolution of the estimated SGDlearning rate ( γ k ) as the algorithm progresses on two rep-resentative tasks.With respect to the image classiﬁcation data sets, it is ev-ident that, across different architectures, on all three datasets, Adam fails to ﬁnd solutions that generalize well de-spite making good initial progress. This is in agreementwith the ﬁndings of (Wilson et al., 2017). As can beseen from Table 1, the switch from Adam to SGD hap- pytorch . org pens within the ﬁrst epochs for most CIFAR data setsand at epoch for Tiny-ImageNet. Curiously, in the caseof the Tiny-ImageNet problem, the switch from Adam toSGD leads to signiﬁcant but temporary degradation in per-formance. Despite the testing accuracy dropping from to immediately after the switch, the model recoversand achieves a better peak testing accuracy compared toAdam. We observed similar outcomes for several other ar-chitectures on this data set.In the language modeling tasks, Adam outperforms SGDnot only in ﬁnal generalization performance but also inthe number of epochs necessary to attain that performance.This is not entirely surprising given that Merity et al. (2017)required iterate averaging for SGD to achieve state-of-the-art performance despite gradient clipping or learning ratedecay rules. In this case, SWATS switches over to SGD,albeit later in the training process, but achieves compara-ble generalization performance to Adam as measured bythe lowest validation perplexity achieved in the experiment.Again, as in the case of the Tiny-ImageNet experiment(Figure 5), the switch may cause a temporary degradationin performance from which the model is able to recover.These experiments suggest that it is indeed possible to com-bine the best of both worlds for these tasks: in all the tasksdescribed, SWATS performs almost as well as the best al-gorithm amongst SGD and Adam, and in several casesachieves a good initial decrease in the error metric.Figure 7 shows that the estimated learning rate for SGD( γ k ) is noisy but convergent (in mean), and that it convergesto a value of similar scale as the value obtained by tuningthe SGD optimizer (see Table 1). We emphasize that otherthan the learning rate, no other hyperparameters were tunedbetween the experiments.

4. Discussion and Conclusion

Wilson et al. (2017) pointed to the insufﬁciency of adaptivemethods, such as Adam, Adagrad and RMSProp, at gener-alizing in a fashion comparable to that of SGD. In the caseof a convex quadratic function, the authors demonstrate thatadaptive methods provably converge to a point with orders-of-magnitude worse generalization performance than SGD.The authors attribute this generalization gap to the scal-ing of the per-variable learning rates deﬁnitive of adaptivemethods as we explain below.Nevertheless, adaptive methods are important given theirrapid initial progress, relative insensitivity to hyperparam-eters, and ability to deal with ill-scaled problems. Severalrecent papers have attempted to explain and improve adap-tive methods (Loshchilov & Hutter, 2017; Anonymous,2018; Zhang et al., 2017). However, given that they retainthe adaptivity and non-uniform gradient scaling, they too mproving Generalization Performance by Switching from Adam to SGD

Epochs T e s t i n g E rr o r SGDAdamSWATS (a) ResNet-32 — CIFAR-10

Epochs T e s t i n g E rr o r SGDAdamSWATS (b) DenseNet — CIFAR-10

Epochs T e s t i n g E rr o r SGDAdamSWATS (c) PyramidNet — CIFAR-10

Epochs T e s t i n g E rr o r SGDAdamSWATS (d) SENet — CIFAR-10

Epochs T e s t i n g E rr o r SGDAdamSWATS (e) ResNet-32 — CIFAR-100

Epochs T e s t i n g E rr o r SGDAdamSWATS (f) DenseNet — CIFAR-100

Epochs T e s t i n g E rr o r SGDAdamSWATS (g) PyramidNet — CIFAR-100

Epochs T e s t i n g E rr o r SGDAdamSWATS (h) SENet — CIFAR-100

Figure 4.

Numerical experiments comparing SGD(M), Adam and SWATS with tuned learning rates on the ResNet-32, DenseNet, Pyra-midNet and SENet architectures on CIFAR-10 and CIFAR-100 data sets.

Model Data Set SGDM Adam SWATS Λ Switchover Point (epochs)ResNet-32 CIFAR-10 0.1 0.001 0.001 0.52 1.37DenseNet CIFAR-10 0.1 0.001 0.001 0.79 11.54PyramidNet CIFAR-10 0.1 0.001 0.0007 0.85 4.94SENet CIFAR-10 0.1 0.001 0.001 0.54 24.19ResNet-32 CIFAR-100 0.3 0.002 0.002 1.22 10.42DenseNet CIFAR-100 0.1 0.001 0.001 0.51 11.81PyramidNet CIFAR-100 0.1 0.001 0.001 0.76 18.54SENet CIFAR-100 0.1 0.001 0.001 1.39 2.04LSTM PTB 55 † † † † Table 1.

Summarizing the optimal hyperparameters for SGD(M), Adam and SWATS for all experiments and, in the case of SWATS, thevalue of the estimated learning rate for SGD after the switch and the switchover point in epochs. † denotes that no momentum wasemployed for SGDM. mproving Generalization Performance by Switching from Adam to SGD Epochs T e s t i n g A cc u r a c y SGDAdamSWATS

Figure 5.

Numerical experiments comparing SGD(M), Adam andSWATS with tuned learning rates on the ResNet-18 architectureon the Tiny-ImageNet data set. are expected to suffer from similar generalization issues asAdam. Motivated by this observation, we investigate thequestion of using a hybrid training strategy that starts withan adaptive method and switches to SGD. By design, boththe switchover point and the learning rate for SGD afterthe switch, are determined as a part of the algorithm andas such require no added tuning effort. We demonstratethe efﬁcacy of this approach on several standard bench-marks, including a host of architectures, on the PennTreeBank, WikiText-2, Tiny-ImageNet, CIFAR-10 and CIFAR-100 data sets. In summary, our results show that the pro-posed strategy leads to results comparable to SGD whileretaining the beneﬁcial properties of Adam such as hyper-parameter insensitivity and rapid initial progress.The success of our strategy motivates a deeper explorationinto the interplay between the dynamics of the optimizerand the generalization performance. Recent theoreticalwork analyzing generalization for deep learning suggestscoupling generalization arguments with the training pro-cess (Soudry et al., 2017; Hardt et al., 2015; Zhang et al.,2016; Wilson et al., 2017). The optimizers choose differ-ent trajectories in the parameter space and are attracted todifferent basins of attractions, with vastly different general-ization performance. Even for a simple least-squares prob-lem: min w (cid:107) Xw − y (cid:107) with w = 0 , SGD recovers theminimum-norm solution, with its associated margin bene-ﬁts, whereas adaptive methods do not. The fundamentalreason for this is that SGD ensures that the iterates remain in the column space of the X , and that only one optimumexists in that column space, viz. the minimum-norm so-lution. On the other hand, adaptive methods do not nec-essarily stay in the column space of X . Similar argumentscan be constructed for logistic regression problems (Soudryet al., 2017), but an analogous treatment for deep networksis, to the best of our knowledge, an open question. Wehypothesize that a successful implementation of a hybridstrategy, such as SWATS, suggests that in the case of deepnetworks, despite training for few epochs before switchingto SGD, the model is able to navigate towards a basin withbetter generalization performance. However, further em-pirical and theoretical evidence is necessary to buttress thishypothesis, and is a topic of future research.While the focus of this work has been on Adam, the strat-egy proposed is generally applicable and can be analo-gously employed to other adaptive methods such as Ada-grad and RMSProp. A viable research direction includesexploring the possibility of switching back-and-forth, asneeded, from Adam to SGD. Indeed, in our preliminaryexperiments, we found that switching back from SGD toAdam at the end of a 300 epoch run for any of the ex-periments on the CIFAR-10 data set yielded slightly betterperformance. Along the same line, a future research direc-tion includes a smoother transition from Adam to SGD asopposed to the hard switch proposed in this paper, whichmay cause short-term performance degradation. This canbe achieved by using a convex combination of the SGD andAdam directions as in the case of Akiba et al. (2017), andgradually increasing the weight for the SGD contributionby a criterion. Finally, we note that the strategy proposedin this paper does not preclude the use of those proposed inZhang et al. (2017); Loshchilov & Hutter (2017); Anony-mous (2018). We plan to investigate the performance ofthe algorithm obtained by mixing these strategies, suchas monotonic increase guarantees of the second-order mo-ment, cosine-annealing, (cid:96) -norm correction, in the future. References

Akiba, T., Suzuki, S., and Fukuda, K. Extremely largeminibatch SGD: Training resnet-50 on ImageNet in 15minutes. arXiv preprint arXiv:1711.04325 , 2017.Anonymous. On the convergence of Adam and be-yond.

International Conference on Learning Represen-tations , 2018. URL https://openreview . net/forum?id=ryQu7f-RZ .Bahdanau, D., Cho, K., and Bengio, Y. Neural machinetranslation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.Bradbury, J., Merity, S., Xiong, C., and Socher, R. mproving Generalization Performance by Switching from Adam to SGD Epochs V a li d a t i o n P e r p l e x i t y SGDAdamSWATS (a) LSTM — PTB

Epochs V a li d a t i o n P e r p l e x i t y SGDAdamSWATS (b) LSTM — WT2

Epochs V a li d a t i o n P e r p l e x i t y SGDAdamSWATS (c) QRNN — PTB

Epochs V a li d a t i o n P e r p l e x i t y SGDAdamSWATS (d) QRNN — WT2

Figure 6.

Numerical experiments comparing SGD(M), Adam and SWATS with tuned learning rates on the AWD-LSTM and AWD-QRNN architectures on PTB and WT-2 data sets.

Epochs (a) DenseNet — CIFAR-100

Epochs (b) QRNN — PTB

Figure 7.

Evolution of the estimated SGD learning rate ( γ k ) ontwo representative tasks. Quasi-Recurrent Neural Networks. arXiv preprintarXiv:1611.01576 , 2016.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical ImageDatabase. In

CVPR09 , 2009.Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradientmethods for online learning and stochastic optimization.

The Journal of Machine Learning Research , 12:2121–2159, 2011.Han, D., Kim, J., and Kim, J. Deep pyramidal residualnetworks. arXiv preprint arXiv:1610.02915 , 2016.Hardt, M., Recht, B., and Singer, Y. Train faster, generalizebetter: Stability of stochastic gradient descent. arXivpreprint arXiv:1509.01240 , 2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep resid-ual learning for image recognition. arXiv preprintarXiv:1512.03385 , 2015.Hochreiter, S. and Schmidhuber, J. Long short-term mem-ory.

Neural computation , 9(8):1735–1780, 1997. Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation net-works. arXiv preprint arXiv:1709.01507 , 2017.Iandola, F., Moskewicz, M., Karayev, S., Girshick, R.,Darrell, T., and Keutzer, K. Densenet: Implementingefﬁcient convnet descriptor pyramids. arXiv preprintarXiv:1404.1869 , 2014.Karpathy, A. A Peek at Trends in Machine Learn-ing. https://medium . com/@karpathy/a-peek-at-trends-in-machine-learning-ab8a1085a106 , 2017. [Online; accessed 12-Dec-2017].Kingma, D. and Ba, J. Adam: A method for stochasticoptimization. In International Conference on LearningRepresentations (ICLR 2015) , 2015.Krizhevsky, A. and Hinton, G. Learning multiple layers offeatures from tiny images. 2009.Lee, J., Simchowitz, M., Jordan, M. I, and Recht, B. Gradi-ent descent converges to minimizers.

University of Cal-ifornia, Berkeley , 1050:16, 2016.Loshchilov, I. and Hutter, F. SGDR: Stochastic gradientdescent with warm restarts. 2016.Loshchilov, I. and Hutter, F. Fixing Weight Decay Regu-larization in Adam.

ArXiv e-prints , November 2017.Mandt, S., Hoffman, M. D., and Blei, D. M. StochasticGradient Descent as Approximate Bayesian Inference.

ArXiv e-prints , April 2017.Merity, S., Xiong, C., Bradbury, J., and Socher, R.Pointer sentinel mixture models. arXiv preprintarXiv:1609.07843 , 2016.Merity, S., Keskar, N., and Socher, R. Regularizing andOptimizing LSTM Language Models. arXiv preprintarXiv:1708.02182 , 2017. mproving Generalization Performance by Switching from Adam to SGD

Mikolov, T., Kombrink, S., Deoras, A., Burget, L., and Cer-nocky, J. RNNLM-recurrent neural network languagemodeling toolkit. In

Proc. of the 2011 ASRU Workshop ,pp. 196–201, 2011.Nocedal, J. and Wright, S.

Numerical optimization .Springer Science & Business Media, 2006.Robbins, Herbert and Monro, Sutton. A stochastic approx-imation method.

The annals of mathematical statistics ,pp. 400–407, 1951.Soudry, D., Hoffer, E., and Srebro, N. The implicit biasof gradient descent on separable data. arXiv preprintarXiv:1710.10345 , 2017.Sutskever, I., Martens, J., Dahl, G., and Hinton, G. Onthe importance of initialization and momentum in deeplearning. In

International conference on machine learn-ing , pp. 1139–1147, 2013.Tieleman, T. and Hinton, G. Lecture 6.5-RMSProp: Dividethe gradient by a running average of its recent magni-tude.

COURSERA: Neural Networks for Machine Learn-ing , 4, 2012.Wan, L., Zeiler, M., Zhang, S., LeCun, Y, and Fergus, R.Regularization of neural networks using dropconnect. In

Proceedings of the 30th international conference on ma-chine learning (ICML-13) , pp. 1058–1066, 2013.Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., andRecht, B. The Marginal Value of Adaptive GradientMethods in Machine Learning.

ArXiv e-prints , May2017.Wu, Y., Schuster, M., Chen, Z., Le, Q., Norouzi, M.,Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,K., et al. Google’s neural machine translation system:Bridging the gap between human and machine transla-tion. arXiv preprint arXiv:1609.08144 , 2016.Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals,O. Understanding deep learning requires rethinking gen-eralization. arXiv preprint arXiv:1611.03530 , 2016.Zhang, Z., Ma, L., Li, Z., and Wu, C. Nor-malized direction-preserving Adam. arXiv preprintarXiv:1709.04546arXiv preprintarXiv:1709.04546