[PDF] The Two Regimes of Deep Network Training

Abstract

Learning rate schedule has a major impact on the performance of deep learning models. Still, the choice of a schedule is often heuristical. We aim to develop a precise understanding of the effects of different learning rate schedules and the appropriate way to select them. To this end, we isolate two distinct phases of training, the first, which we refer to as the "large-step" regime, exhibits a rather poor performance from an optimization point of view but is the primary contributor to model generalization; the latter, "small-step" regime exhibits much more "convex-like" optimization behavior but used in isolation produces models that generalize poorly. We find that by treating these regimes separately-and em specializing our training algorithm to each one of them, we can significantly simplify learning rate schedules.

Full PDF

TT HE T WO R EGIMES OF D EEP N ETWORK T RAINING

A P

REPRINT

Guillaume Leclerc

MIT [email protected]

Aleksander Madry

MIT [email protected]

February 25, 2020 A BSTRACT

Learning rate schedule has a major impact on the performance of deep learning models. Still, thechoice of a schedule is often heuristical. We aim to develop a precise understanding of the effectsof different learning rate schedules and the appropriate way to select them. To this end, we isolatetwo distinct phases of training—the ﬁrst, which we refer to as the “large-step” regime, exhibitsa rather poor performance from an optimization point of view but is the primary contributor tomodel generalization; the latter, “small-step” regime exhibits much more “convex-like” optimizationbehavior but used in isolation produces models that generalize poorly. We ﬁnd that by treatingthese regimes separately—and specializing our training algorithm to each one of them—we cansigniﬁcantly simplify learning rate schedules.

Finding the right learning rate schedule is critical to ob-tain the best testing accuracy for a given neural networkarchitecture. As deep learning started gaining popularity,starting with the largest learning rate and gradually de-creasing it became the standard practice (Bengio, 2012a).Indeed, today, such “step schedule” still remains one themost popular learning rate schedules, and when properlytuned it yields competitive models.More recently, as architectures grew deeper and wider, andas training massive datasets became the norm, more elab-orate schedules emerged (Loshchilov and Hutter, 2017;Smith, 2017; Smith and Nicholay, 2017). While theseschedule have shown great practical success, it is stillunclear why that is the case. Li and Arora (2019) evenshowed that, counter-intuitively, an exponentially increas-ing schedule could also be effective. In the light of this,it is time to revisit learning rate schedules and shed somelight on why some perform well and some other don’t.

Our contributions

In this paper, we identify two train-ing regimes: (1) the large-step regime and (2) the small-step one which correspond usually respectively to the startand the end of “step-schedule”. In particular, we exam-ine these regimes through the lens of optimization andgeneralization. We ﬁnd that: • In the large-step rate regime, the loss does not de-crease consistently at each epoch and the ﬁnal lossvalue obtained after convergence is much higherthan when training in the small-step regime. Inthat latter regime, the reduction of the loss isfaster and smooth and to large degree matchesthe intuition drawn from the convex optimizationliterature. • In the large-step regime, momentum does not seem to have a discernible beneﬁt. More pre-cisely, we show that we can recover similar lossdecrease curves for a wide range of different mo-mentum values as long as we make a correspond-ing change in the learning rate. In the small-stepregime, however, momentum becomes crucial toreaching a good solution quickly. • Finally, we leverage this understanding to pro-pose a simple two-stage learning rate schedulethat achieves state of the art performance on the

CIFAR-10 and

ImageNet datasets. Importantly,in this schedule, each stage uses a different algo-rithm and hyper-parameters.Our ﬁndings suggest that it might be beneﬁcial to departfrom viewing deep network training as a single optimiza-tion problem and instead to explore using different algo-rithms for different stages of that process. In particular, sec-ond order methods (such as K-FAC (Martens and Grosse, a r X i v : . [ c s . L G ] F e b he Two Regimes of Deep Network Training A P

REPRINT

Given a (differentiable) function f ( θ ) , one of the mostpopular techniques for minimizing it is to use the gradientdescent method (GD). This method, starting from an initialsolution w , iteratively updates the solution w as: w t +1 = w t − η ∇ f ( w t ) , (1)where η > is the learning rate . GD is the most naturaland simple continuous optimization scheme. However,there are a host of its most advanced variants. One ofthe prominent ones is momentum gradient descent, of-ten referred to as the classic momentum , or heavy ballmethod (Polyak, 1964). It corresponds to an update rule: g = 0 ,g t +1 = µg t + ∇ f ( w t ) ,w t +1 = w t − ηg t +1 , (2)where µ is a scalar that controls the momentum accumula-tion. There are also other variants of momentum dynamics.Most prominently, Nesterov’s accelerated gradient (Nes-terov, 1983) offers a theoretically optimal convergencerate. However, it tends to have poor behavior in practicedue its brittleness. For that reason, and also because of itsimmense popularity, we will focus on the above-mentionedclassic momentum dynamics instead.

In this work, we will be interested in isolating two learningregimes:(A) “large-step” regime: corresponds to the highestlearning rate that can be used without causingdivergence, as per Bengio (2012a).(B) “small-step” regime: corresponds to the largestlearning rate at which loss is consistently decreas-ing. (In Smith and Nicholay (2017), the authorspropose an experimental procedure to estimateappropriate learning rates).In carefully tuned step-wise learning rate schedules, theﬁrst and last learning rates usually correspond to the large-step and small-step regimes . Our goal is to characterize We could not identify a sharp boundary between these tworegimes. Learning rates in between the two extremes seems toessentially be a mixture of the two behaviors. and understand how these regimes differ—ﬁrst from anoptimization and then from a generalization perspective.

By examining the evolution of the loss from initializationon Figure 1, we can note three major differences betweenthe two regimes: − − − − tr a i n i n g l o ss µ = 0 . µ = 0 . µ = 0 .

90 10 20 30 40 50epoch10 − − − − tr a i n i n g l o ss µ = 0 . µ = 0 . µ = 0 . Figure 1: Evolution of the training loss for 50 epochs withdifferent momentum values on

CIFAR-10 and

VGG-13-BN .(Top) regime (A) with η = 0 . and (Bottom) regime (B)with η = 0 . .1. The best solution is found in the low learningrate regime, even though we performed the samenumber of 100 times smaller steps—which corre-sponds to a much shorted distance traveled fromthe initialization.2he Two Regimes of Deep Network Training A P

REPRINT

2. In regime (A), the evolution of the loss is verynoisy, while in (B), it decreases almost at eachepoch.3. Momentum seems to behave completely differ-ently in the two experiments. At the top of Fig-ure 1, the largest µ value yields the worst solution,whereas on the other, the ﬁnal loss decreases aswe increase µ .These pieces of evidence suggest that regime (A) is a highlynon-convex optimization problem, while the low learningrate regime reﬂects the intuitions from the convex opti-mization world. To highlight this distinction we will usemomentum (as deﬁned in Section 2). Momentum.

Momentum can provably accelerate gradi-ent descent over functions that are convex, but does notprovide any theoretical guarantees when that property doesnot hold. In order to highlight the different nature of theproblems we are solving in each regime, we compare thebehavior of momentum when used on a convex function,and on a deep neural network under both regimes.Ideally, we would like the momentum vector to be a signalthat: (1) points reliably towards the optimum of our prob-lem, and (2) is strong enough to actually have an impacton the trajectory. To focus on these two key properties, wetrack the two respective quantities:1.

Alignment : the angle between the momentumand the direction to optimum , s t = g t · ( x ∗ − x t ) || g t || || x ∗ − x t || . (3)2. Scale : the ratio between the magnitude of themomentum vector and the gradient, r t = || g t || ||∇ f ( w t ) || . (4)Note that, in order to be helpful in the optimization pro-cess, one would expect the direction of the momentumvector to be correlated with the direction towards optimum(alignment to be close to ), and its scale large enough tobe signiﬁcant. Indeed, that’s the behavior that provably emerges in the context of convex optimization. Convex function baseline.

Figure 2 implies that in caseof a quadratic convex function, a higher momentum valueresults in faster convergence According to the middle plot,the momentum vector is a strong indicator of the directiontowards the optimum (it quickly goes to ). Also, the scale r t increases with the momentum and eventually convergestowards − µ , which is what one would expect when themomentum is indeed accumulating. When the optimum is not known, as it is the case for neuralnetworks, we use instead the solution our algorithm converged toeventually. − − f ( w t ) µ = 0 µ = 0 . µ = 0 . µ = 0 . . . . s t r t Figure 2: Evolution of (top) the value of the function,(middle) the alignment s t and (bottom) the scale r t whileoptimizing a quadratic function f ( w ) = wAw T where A is a random positive semi-deﬁnite matrix whose conditionnumber is . Deep learning setting.

Now that we saw that the metricsbehave as we expect on convex functions, we can measurethem (Figure 3) on the experiment presented earlier.In regime (A), the scale s t is oscillating around and thevalue of the alignment is very low. This means that themomentum vector is nearly orthogonal to the directionof the ﬁnal solution and never constitutes a strong signal.In regime (B), the momentum vector is able to accumu-late more and gives non-negligible information about thedirection towards the point we are converging to.According to Kidambi et al. (2018), momentum might notbe able to cope with the noise coming from the stochasticityof SGD. While it is plausible, experiments in appendix Ausing full gradients instead of mini-batches show that thisnoisiness has only minimal impact and that the step sizeis the most important factor determining the success ofmomentum.This leads to the following informal argument: with smallstep-sizes, the trajectory is unable to escape the current3he Two Regimes of Deep Network Training A P

REPRINT − . . . . . . s t η = 0 . η = 0 . r t Figure 3: Evolution of the metrics (top) r t and (bottom) s t corresponding to the experiment done in Figure 1 ( η = 0 . )with µ = 0 . basin of attraction. The region is “locally convex” and,as a consequence, allows the momentum vector to pointtowards the same critical point during the optimizationprocess, thus helping to speed up optimization. On theother hand, high learning rates penalize the effect of mo-mentum. The steps taken at each iteration are large enoughto escape the current basin of attraction and enter a differ-ent basin (therefore optimizing towards a different localminimum). As this happens, the direction approximatedby the momentum vector points to different critical pointsduring the course of optimization. Thus, at some iteration,momentum steers the trajectory towards a point that is notreachable anymore. This hypothesis also explains whyin the top plot of Figure 1, we see momentum strugglingmore than vanilla SGD: often, the momentum vector, or-thogonal, is completely disagreeing with the gradient andslows down convergence. If our objective is to minimize the loss, training in thesmall-step regime (B) is simpler and faster. Indeed, as wesaw in Figure 1, it was two times faster to reach a lossof × − . It is therefore natural to ask: Why do weeven spend some time in the high learning rate regime?

In deep learning, the loss is only a surrogate of our realobjective: testing accuracy. It turns out that training onlyin the second regime, while it is fast, leads to very sharpminimizers. This is a phenomenon similar to what wasdescribed in Keskar et al. (2017) in the context of the batchsize.The relationship between learning rate and generalizationhas already been studied in the past (Li et al., 2019; Hofferet al., 2017; Keskar et al., 2017; Jiang et al., 2020; Keskar et al., 2017; Jiang et al., 2020). However, it seems thatwhat truly deﬁnes the regime we are in is not the learningrate itself, but the actual step size. Momentum, as deﬁned in Section 2, increases the sizeof the step we actually take at each iteration of SGD. Aswe saw in Section 3.1, it does not seem to be able tospeed up the optimization process. However, it is easyto ﬁnd parameters where increasing momentum improvegeneralization. In this paper, we demonstrate that in thelarge-step size regime (A), momentum solely boosts thestep size.Indeed, assuming that || g t +1 || ||∇ f ( w t ) || does not ﬂuctuate muchduring training and can be approximated by a constant,then we can simulate the increase in step size implied bymomentum just by using a higher learning rate (Figure 4). − − − η . . . . . . . . . t e s t i n ga cc u r a c y µ = 0 . µ = 0 . µ = 0 . Figure 4: Testing accuracies obtained for various learningrates and three different values of µ . VGG-13-BN wastrained using SGD.Figure 4 indicates that the generalization ability is dictatedby the size of the steps taken rather than the learning rateitself. For three different momentum intensities, we can ob-serve the same pattern repeating. Reductions in momentumappear to be compensated by increasing the learning rate.The three curves, albeit shifted, are surprisingly similar.They even exhibit the same drop just after their respectiveoptimal learning rate. Additional experiments made onother architectures and datasets were performed to rule outthe hypothesis that these results are problem speciﬁc. Re-sults are presented on Figure 11. Also in appendix B, weexplore in more detail the equivalence of pairs of learningrate and momentum values.Finally, Figure 5 shows the evolution of the loss duringtraining for the bets performing learning rate of each mo-mentum value considered in Figure 4. It is clear that mo-mentum had no impact here as the trajectories are oddly This is the reason why we prefer the term large and smallsteps regimes as it is possible to make large steps with a smalllearning rate if momentum is large enough.

A P

REPRINT − − − − tr a i n i n g l o ss µ = 0 . µ = 0 . µ = 0 . Figure 5: Loss curves of the best learning rate for eachmomentum value in Figure 4.similar. There is no evidence that the convergence wasimproved at all. The only difference that we can observeis that each model reached − at a different time, but itdoes not seem to be linked to the intensity of momentum.Moreover, they all reach very similar losses at the end oftraining. As we characterized these two very distinct trainingregimes, it is tempting to experiment with a “stripped down”schedule that consists of two completely different phases;For each one we use an algorithm individually tuned toexcel in a particular task. The ﬁrst one has to be SGD asit provides good generalization to the model. The secondcan be any algorithm able to minimize the loss quickly.To stay consistent with the previous experiments we pickhere SGD with momentum but we believe that many algo-rithms would perform similarly or better. Especially, fastalgorithms that have been criticized for their poor gener-alization ability like K-FAC (Martens and Grosse, 2015)and L-BFGS (Liu and Nocedal, 1989) could be perfectcandidates. First, we will appraise the beneﬁts of havingradically different momentum values for the two phases.Secondly, we will evaluate the performance of this twostep approach in comparison to the more elaborate threephase training schedule.

We believe that, even if researchers do search for the bestmomentum value, unlike learning rate, they assume thatit stays constant. For example, in Goyal et al. (2017)and Shallue et al. (2018), a large amount of schedules arecompared; yet momentum never change over the course oftraining. However, as we saw, the two regimes are wildlydifferent. This is why we suggest isolating the two regimesin the two tasks, and optimize them individually. It turns out that with the appropriate learning rate, usingmomentum in the ﬁrst phase has no observable impact onthe performance of the models. However, having a largermomentum (again with an appropriate change in learningrate) is beneﬁcial in the second phase as it inceases theﬁnal testing accuracy under the same budget.In order to control for the parameters, we trained multiplemodels and randomly picked the transition epoch , epochat which we switch from an algorithm to another. Wedisplay the distribution of testing accuracies obtained onFigure 6 . On the top plot we see that two distributions arethe same (for a ﬁxed second phase). On the bottom one,however, a more aggressive momentum associated with asmaller learning rate, on average, outperforms the “classic”parameters.We previously observed that disabling momentum has to beaccompanied by a corresponding increase in learning rate.To ﬁnd such a learning rate we used random search andtook the one that had a training loss curve that matched thebaseline as closely as possible for the ﬁrst 50 epochs(moredetails about this procedure in appendix B). Comparing against the popular, three-step schedule, weﬁnd that two truly independent phases can perform simi-larly or better. This suggests that complex schedules arenot necessary to train deep neural networks.We evaluate this schedule on two datasets:

CIFAR-10 and

ImageNet (Russakovsky et al., 2015). For the former, wesampled many transition points and took the median overequally sized bins. For the latter, because it is particularlyexpensive, only a few transition points were hand picked.For

CIFAR-10 , we used the same parameters as in Sec-tion 4.1. For

ImageNet , learning rates and momentumvalues were hand picked, as optimizing them would havebeen prohibitively costly.Performance as a function of the transition epoch is shownon Figure 7 and Figure 8. In both cases, our schedule out-performs or matches the three stages schedule for at leastones value of the transition epoch . For

CIFAR-10 , we alsoconsidered enabling momentum in the ﬁrst phase . As ourprevious experiment would suggest, the two conﬁgurationsappear equivalent. The older and more popular multiple step learning rateschedules probably originates from the practical recom-mendations found in Bengio (2012b). Bottou et al. (2018)provides a theoretical argument that support schedules withdecreasing learning rate. Results for different second phase algorithms are availablein the appendix on Figure 13. with the appropriate change in learning rate A P

REPRINT

86 87 88 89 90 91 92best test accuracy0.000.250.500.751.001.251.501.752.00 d e n s i t y mean for with momentum (88.76%)mean for without momentum (88.74%)with momentumwithout momentum86 87 88 89 90 91 92best test accuracy0.000.250.500.751.001.251.501.75 d e n s i t y mean for η = − , µ = η = · − , µ = η = − , µ = η = · − , µ = Figure 6: Impact on the distribution of testing accuracieswhen using different values momentum in the trainingphases, (top) is for the ﬁrst phase and (bottom) is for thesecond.More recently, Smith (2017) introduced the cyclic learningrate that consist in a sequence of linear increase and de-crease of the learning rate where the high and low valuescorrespond to what we named the large and small stepregimes. Soon after, Smith and Nicholay (2017) concludesthat a single period of that pattern is sufﬁcient to obtaingood performance and the schedule is named .Similarly to the cycling learning rate schedule, Loshchilovand Hutter (2017) present SGDR, a schedule with suddenjumps of learning rate similar to the restarts found in manygradient free optimization techniques.The learning rate is not the only parameter that has beenconsidered to change over time. For example, Smithand Le (2018) and Goyal et al. (2017) both had successvarying the size of the batch size. This is aligned with t o p - t e s t i n g a cc u r a c y proposed 2 steps schedule w/ momentumproposed 2 steps schedule w/o momentumreference 3 steps schedule Figure 7: Evolution of the testing accuracy in function ofthe transition epoch for our proposed simpliﬁed two stepsschedule on

CIFAR-10 . We present our schedule with andwithout momentum in the ﬁrst phase to emphasise its lackof inﬂuence on the results. t o p - t e s t i n g a cc u r a c y proposed 2 steps schedulereference 3 steps schedule Figure 8: Evolution of the testing accuracy in function ofthe transition epoch for our proposed simpliﬁed two stepsschedule on

ImageNet .6he Two Regimes of Deep Network Training

A P

REPRINT our recommendation that every hyper-parameter should beoptimized for each phase of training.The impact of large learning rates and generalization hasreceived a lot of attention in the past. The predominanthypothesis is that is acts as regularizer (Li et al., 2019;Hoffer et al., 2017). It is believed that it either promotesﬂatter minima (Keskar et al., 2017; Jiang et al., 2020) orincrease the amount of noise during training (Mandt et al.,2017; Smith and Le, 2018).

In this paper, we studied the properties of the two regimesof deep network training. The large-step one is typicallyfound in the early stages of training and the small-steptends to ends training.Our investigations show that optimization in the large step-size regime does not follow training patterns typically ex-pected in the convex setting: the evolution of the loss isvery noisy and we reach a solution far from the optimalone. In this regime, the beneﬁts of momentum are nuanced:It seems that any gain that it offers can be compensated bya corresponding increase in learning rate.The small step-size regime seems fundamentally different:we obtain a lower loss, faster and smoothly, but solutionsgeneralize poorly. In this case, momentum can greatlyspeed up the convergence—as it does in the convex case.The intensity of momentum and, more generally, the op-timization algorithm used are typically considered duringhyper-parameter search. However, they are always keptconstant over the whole training. This restrict the searchspace drastically because we are unable to tailor them tothe different training regimes we encounter. By separatingthe two regimes into two distinct problem we might beable to obtain better model and/or train them faster.Indeed, we demonstrate that a simple schedule consist-ing of only two stages, the ﬁrst one being SGD with no momentum and the second of SGD with a value of mo-mentum larger than usual can be competitive with stateof the art learning rate schedules. This opens up the pos-sibility for development of new training algorithms thatare specialized in only one regime. This might also let usleverage second order methods—usually criticized for thepoor generalization performance— in the second phase oftraining. References

Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In

Neural Networks:Tricks of the Trade: Second Edition , 2012a. Cyclic learning rates and the schedule used in Goyal et al.(2017) are example of exceptions.

Ilya Loshchilov and Frank Hutter. SGDR: stochastic gra-dient descent with warm restarts. In

International Con-ference on Learning Representations (ICLR) , 2017.L. N. Smith. Cyclical learning rates for training neuralnetworks. In

Winter Conference on Applications ofComputer Vision , 2017.Leslie N. Smith Smith and Topin Nicholay. Super-convergence: Very fast training of neural networks usinglarge learning rates. In

ArXiv preprint arXiv:1708.07120 ,2017.Zhiyuan Li and Sanjeev Arora. An exponential learningrate schedule for deep learning, 2019.James Martens and Roger Grosse. Optimizing neuralnetworks with kronecker-factored approximate curva-ture. In

International Conference on Machine Learning ,2015.Dong C. Liu and Jorge Nocedal. On the limited memorybfgs method for large scale optimization. In

Mathemati-cal Programming , 1989.B.T. Polyak. Some methods of speeding up the conver-gence of iteration methods. In

USSR ComputationalMathematics and Mathematical Physics , 1964.Yu. E. Nesterov. A method of solving a convex program-ming problem with convergence rate o(1/k2). In

SovietMathematics Doklady , 1983.Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, andSham M. Kakade. On the insufﬁciency of existing mo-mentum schemes for stochastic optimization. In

Interna-tional Conference on Learning Representations (ICLR) ,2018.Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal,Mikhail Smelyanskiy, and Ping Tak Peter Tang. Onlarge-batch training for deep learning: Generalizationgap and sharp minima. In

International Conference onLearning Representations (ICLR) , 2017.Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explain-ing the regularization effect of initial large learning ratein training neural networks. In

Advances in NeuralInformation Processing Systems (NeurIPS) . 2019.Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer,generalize better: closing the generalization gap in largebatch training of neural networks. In

Advances in NeuralInformation Processing Systems (NeurIPS) , 2017.Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, DilipKrishnan, and Samy Bengio. Fantastic generalizationmeasures and where to ﬁnd them. In

International Con-ference on Learning Representations , 2020.Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tul-loch, Yangqing Jia, and Kaiming He. Accurate, LargeMinibatch SGD: Training ImageNet in 1 Hour. arXivpreprint 1706.02677 , 2017.Christopher J. Shallue, Jaehoon Lee, Joseph M. Antognini,Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl.7he Two Regimes of Deep Network Training

A P

REPRINT

Measuring the effects of data parallelism on neural net-work training. In

ArXiv preprint arXiv:1811.03600 ,2018.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, Alexan-der C. Berg, and Li Fei-Fei. ImageNet Large ScaleVisual Recognition Challenge. In

International Journalof Computer Vision (IJCV) , 2015.Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures.

Lecture Notes inComputer Science (including subseries Lecture Notes inArtiﬁcial Intelligence and Lecture Notes in Bioinformat-ics) , 7700 LECTU:437–478, 2012b. ISSN 03029743.doi: 10.1007/978-3-642-35289-8-26.Léon Bottou, Frank E Curtis, and Jorge Nocedal. Op-timization methods for large-scale machine learning.

Siam Review , 2018.Samuel L. Smith and Quoc V. Le. A bayesian perspectiveon generalization and stochastic gradient descent. In

International Conference on Learning Representations ,2018.Stephan Mandt, Matthew D. Hoffman, and David M. Blei.Stochastic gradient descent as approximate bayesianinference.

The Journal of Machine Learning Research ,18(1), 2017.Alex Krizhevsky. Learning multiple layers of featuresfrom tiny images. In

Technical report , 2009.Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. In

International Conference on Learning Representations(ICLR) , 2015.Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internalcovariate shift. In

International Conference on MachineLearning (ICML) , 2015.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition. In

Conference on Computer Vision and Pattern Recognition(CVPR) , 2016. 8he Two Regimes of Deep Network Training

A P

REPRINT

Appendices

A Full gradient experiments

In this appendix we reproduce some of the experimentsmade in Section 3.1 and Section 3.2 using full gradientsinstead of SGD to rule out the possibility that the stochas-ticity is the cause of the inability of momentum to buildup.Figure 9 and Figure 10 presents similar results to Figure 1and Figure 4 respectively.

B Momentum learning-rate equivalence inthe large step regime

To evaluate in more detail the relationship that ties thelearning and the momentum together, we designed thefollowing experiment:1. Train models for

CIFAR-10 for 50 epochs witha wide range of learning rates and three differentmomentum intensities: , . and . .2. For each conﬁguration with µ = 0 we ﬁnd thecorresponding two conﬁgurations with µ = 0 . and µ = 0 . that match the training loss the bestin L norm.3. We report the corresponding matching learningrate and norm between the curves on Figure 12.We see that for any learning rate between · − and it is possible to ﬁnd an equivalent learning rate with analmost identical behavior. Moreover, the relation betweenequivalent learning rates seems to be linear. C Experiment details

C.1 Shared between all experiments

The details provided in this section are valid for everyexperiment unless speciﬁed otherwise: • Programming language:

Python 3 • Framework

PyTorch 1.0 • Dataset

CIFAR-10 (Krizhevsky, 2009) • Batch size: • Weight decay: − • Per channel normalization

Yes • Data augmentation:

1. Random Crop2. Random horizontal ﬂip − − − tr a i n i n g l o ss − − − − − tr a i n i n g l o ss µ = 0 . µ = 0 . µ = 0 . µ = 0 . Figure 9: Evolution of the training loss using full gradi-ents with different momentum values on

CIFAR-10 and

VGG-13-BN with η = 0 . for (a) and η = 0 . for (b). . . . . . . . . . . . . . . t e s t i n ga cc u r a c y µ = 0 µ = 0 . µ = 0 . µ = 0 . Figure 10: Testing accuracy after 50 epochs in functionof the learning rate for different values of µ . VGG-13-BN model was used and trained using GD and no weight decay.9he Two Regimes of Deep Network Training

A P

REPRINT − − − η t e s t i n ga cc u r a c y µ = 0 . µ = 0 . µ = 0 . − − − η . . . . . . . t e s t i n ga cc u r a c y − − − η t e s t i n ga cc u r a c y − − − η . . . . . . . t e s t i n ga cc u r a c y − − − η t e s t i n ga cc u r a c y − − − η . . . . . . . t e s t i n ga cc u r a c y − − − η t e s t i n ga cc u r a c y − − − η . . . . . . . t e s t i n ga cc u r a c y Figure 11: Testing accuracies obtained for various learning rates and three different values of µ . VGG-13-BN wastrained using SGD. Models used are the same for each row and are, from top to bottom:

ResNet18 , ResNet50 , VGG13 , VGG19 . (left) column shows

CIFAR-10 and (right)

CINIC-10 .10he Two Regimes of Deep Network Training

A P

REPRINT − − − − c l o s e s t m a t c h i n g l e a r n i n g r a t e µ = µ = − − − − learning rate with no momentum0.0000000.0000050.0000100.0000150.0000200.0000250.000030 L n o r m b e t w ee n t h e c l o s e s t c u r v e s Figure 12: Best equivalent learning rate (top) and corre-sponding L distance between the loss curves for differentvalues of momentum. C.2 Experiment visible on Figure 1 and Figure 3 • Architecture : VGG-13 (Simonyan and Zisser-man, 2015) with extra batch norm layers (Ioffeand Szegedy, 2015). • Learning rates: η = 0 . and η = 0 . for thelarge and small steps regime respectively. • Momentum type:

Heavy ball (non Nesterov) • Momentum intensities: , . and . C.3 Experiment visible on Figure 2 • Function optimized : f ( w ) = wAw T • Iterations : • Properties of A : Fixed positive semi deﬁniterandom matrix with eigen values ranging from to . • Momentum type:

Heavy ball (non Nesterov) • Momentum intensities: , . , . and . • Learning rates:

They were picked to yieldthe best performance for each momentum value.They were obtained using a grid search procedure.Results of the grid search visible on Figure 14 • Grid search range: η ∈ [10 − , × ]

50 val-ues equally spaced in log -scale, − µ ∈ [10 − ,

50 values equally spaced in log -scale.

C.4 Experiment visible on Figure 4 and Figure 5 • Architecture : VGG-13 (Simonyan and Zisser-man, 2015) with extra batch norm layers (Ioffeand Szegedy, 2015). • Momentum type:

Heavy ball (non Nesterov) • Momentum intensities: , . and . • Learning rates:

We performed a random searchto ﬁnd the best for each each momentum value.We took 20 samples uniformly in log scaled inthe following ranges: – µ = 0 : η ∈ [10 − , – µ = 0 . η ∈ [10 − , – µ = 0 .

99 : η ∈ [10 − , − ] C.5 Experiment vibile at the top of Figure 6 • Framework

PyTorch 0.4.1 • Architecture : VGG-13 (Simonyan and Zisser-man, 2015) with extra batch norm layers (Ioffeand Szegedy, 2015). • Batch size: • Optimizers

1. Phase 1 (with momentum): SGD(a) Learning rate: . (b) Momentum: . (c) Weight decay: −

2. Phase 1 (without momentum): SGD(a) Learning rate: . (b) Momentum: (c) Weight decay: −

3. Phase 2 (same for the two distributions):SGD with momentum(a) Learning rate: . (b) Momentum: . (c) Weight decay: − C.6 Experiment vibile at the bottom of Figure 6 • Framework

PyTorch 0.4.1 • Architecture : VGG-13 (Simonyan and Zisser-man, 2015) with extra batch norm layers (Ioffeand Szegedy, 2015).11he Two Regimes of Deep Network Training

A P

REPRINT

86 87 88 89 90 91 92best test accuracy0.000.250.500.751.001.251.501.752.00 d e n s i t y mean for with momentum (88.76%)mean for without momentum (88.74%)with momentumwithout momentum86 87 88 89 90 91 92best test accuracy0.00.20.40.60.81.01.2 d e n s i t y mean for with momentum (89.94%)mean for without momentum (90.00%)with momentumwithout momentum86 87 88 89 90 91 92best test accuracy0.000.250.500.751.001.251.501.752.00 d e n s i t y mean for with momentum (88.80%)mean for without momentum (88.79%)with momentumwithout momentum Figure 13: Distribution of test accuracies with and without momentum in the ﬁrst phase for different second phasealgorithms: (top) classic, (middle) reduced learning rate, (AdamW)12he Two Regimes of Deep Network Training

A P

REPRINT − − − η − − − − µ log f ( w ) for diﬀerent parameters − − − − − Figure 14: Result of the grid search to ﬁnd the best learning rate for different momentum intensities. Is displayed log f ( w ) after 10000 iterations. 13he Two Regimes of Deep Network Training A P

REPRINT • Batch size: • Optimizers

1. Phase 1 (same for the two distributions):SGD(a) Learning rate: . (b) Momentum: . (c) Weight decay: −

2. Phase 2 SGD(a) Learning rate: displayed on the legend(b) Momentum: displayed on the legend(c) Weight decay: − C.7 Experiment visible on Figure 7 • Framework

PyTorch 0.4.1 • Architecture : VGG-13 (Simonyan and Zisser-man, 2015) with extra batch norm layers (Ioffeand Szegedy, 2015). • Batch size: • Data augmentation:

None • Optimizers

1. Phase 1: SGD(a) Learning rate: . (b) Momentum: (c) Weight decay: −

1. Phase 2: SGD with momentum(a) Learning rate: . (b) Momentum: . (c) Weight decay: − • Momentum type:

Classic • Reference testing accuracy:

Median over multi-ple training runs that we ran ourself wit the sameparameters except for the learning rate schedule.The default three stages schedule was used witha constant µ = 0 . . C.8 Experiment visible on Figure 8 • Framework

PyTorch 0.4.1 + Robustness1.1 • Architecture : ResNet-18 (He et al., 2016) • Batch size: • Data augmentation:

1. Random crop to size 2242. Random horizontal ﬂip3. Color Jitter4. Lighting noise • Optimizers

1. Phase 1: SGD(a) Learning rate: (b) Momentum: (c) Weight decay: −

1. Phase 2: SGD with momentum(a) Learning rate: − (b) Momentum: . (c) Weight decay: − • Momentum type:

Classic • Reference testing accuracy:

We used the valueavailble here: https://pytorch.org/docs/stable/torchvision/models.htmlhttps://pytorch.org/docs/stable/torchvision/models.html