Learning by Turning: Neural Architecture Aware Optimisation
LLearning by Turning: Neural Architecture Aware Optimisation
Yang Liu ∗ [email protected] Jeremy Bernstein ∗ [email protected] Markus Meister [email protected]
Yisong Yue [email protected]
Abstract
Descent methods for deep networks arenotoriously capricious: they require carefultuning of step size, momentum and weight de-cay, and which method will work best on anew benchmark is a priori unclear. To addressthis problem, this paper conducts a combinedstudy of neural architecture and optimisation,leading to a new optimiser called
Nero : theneuronal rotator. Nero trains reliably withoutmomentum or weight decay, works in situa-tions where Adam and SGD fail, and requireslittle to no learning rate tuning. Also, Nero’smemory footprint is „ square root that ofAdam or LAMB. Nero combines two ideas: (1)projected gradient descent over the space of balanced networks ; (2) neuron-specific updates,where the step size sets the angle throughwhich each neuron’s hyperplane turns . Thepaper concludes by discussing how this geo-metric connection between architecture andoptimisation may impact theories of generali-sation in deep learning. Deep learning has brought on a new paradigm in computerscience, enabling artificial systems to interact with theworld at an unprecedented level of complexity. That said,the core technology relies on various heuristic numericaltechniques that are sometimes brittle and often requireextensive tuning. A major goal of modern research inmachine learning is to uncover the principles underlyinglearning in neural systems, and thus to derive more reliablelearning algorithms.Part of the challenge of this endeavour is that learn-ing in deep networks is an inherently coupled problem.Suppose that training performance is sensitive to a partic-ular detail of the neural architecture—then it is unclearwhether that detail affects the expressivity of the archi-tecture, or just the ability of the descent method to trainthe architecture.This observation motivates the combined study of archi-tecture and optimisation, and this paper explores severalquestions at that intersection. First of all: ∗ Equal contribution. (cid:104) ? (cid:105) What is the right domain of optimisation for aneural network’s weights? Is it R d , or somethingmore exotic—such as a Cartesian product of hyper-spheres?Typically, optimisation is conducted over R d , while acareful weight initialisation and a tuned weight decay hy-perparameter impose a soft constraint on the optimisationdomain. Since normalisation schemes such as batch norm(Ioffe & Szegedy, 2015) render the network invariant to thescale of the weights, weight decay also plays a somewhatsubtle second role in modifying the effective learning rate.Hyperparameters with this kind of subtle coupling add tothe compounding cost of hyperparameter search.Furthermore, descent methods such as Adam (Kingma& Ba, 2015) and LAMB (You et al., 2020) use eithersynapse-specific or layer-specific gradient normalisation.This motivates a second question: (cid:104) ? (cid:105) At what level of granularity should an optimiserwork? Should normalisation occur per-synapse orper-layer—or perhaps, per-neuron?This paper contends that in deep learning, hyperpa-rameters proliferate because of hidden couplings betweenoptimiser and architecture. By studying the above ques-tions, and distilling the simple rules that govern optimi-sation and architecture, this paper aims to make deeplearning less brittle—and less sensitive to opaque hyper-parameters.
Summary of contributions:
1. A new optimiser—
Nero : the neuronal rotator. Neroperforms per-neuron projected gradient descent, anduses „ square root the memory of Adam or LAMB.2. Experiments across image classification, image gener-ation, natural language processing and reinforcementlearning, in which Nero’s out-of-the-box configura-tion tends to outperform tuned baseline optimisers.3. Discussion of how the connection between optimisa-tion and architecture relates to generalisation theo-ries, such as PAC-Bayes and norm-based complexity. a r X i v : . [ c s . N E ] F e b Related work
This section reviews relevant work pertaining to bothneural architecture design and optimisation in machinelearning, and concludes with a bridge to the neuroscienceliterature.
The importance of wiring constraints for the stable func-tion of engineered neural systems is not a new discovery.One important concept is that of balanced excitation andinhibition . For instance, Rosenblatt (1958) found that bal-ancing the proportion of excitatory and inhibitory synapticconnections made his perceptron more robust to varyinginput sizes. Another concept relates to the total magni-tude of synapse strengths . For example, Rochester et al.(1956) constrained the sum of magnitudes of synapsesimpinging on a neuron so as to stabilise the process oflearning. Similar ideas were explored by von der Malsburg(1973) and Miller & MacKay (1994). These works areearly predecessors to this paper’s definition of balancednetworks given in Section 3.1.Given the resurgence of neural networks over the lastdecade, the machine learning community has taken up themantle of research on neural architecture design. Specialweight scalings—such as
Xavier init (Glorot & Bengio,2010) and
Kaiming init (He et al., 2015)—have been pro-posed to stabilise signal transmission through deep net-works. These scalings are only imposed at initialisationand are free to wander during training—an issue whichmay be addressed by tuning a weight decay hyperparame-ter. More recent approaches—such as batch norm (Ioffe& Szegedy, 2015)—explicitly control activition statisticsthroughout training by adding extra normalisation layersto the network.Other recent normalisation techniques lie closer to thework of Rosenblatt (1958) and Rochester et al. (1956).Techniques that involve constraining a neuron’s weights tothe unit hypersphere include: weight norm (Salimans &Kingma, 2016), decoupled networks (Liu et al., 2017, 2018)and orthogonal parameterised training (Liu et al., 2020).Techniques that also balance excitation and inhibitioninclude centred weight norm (Huang et al., 2017) andweight standardisation (Qiao et al., 2019).
Much classic work in optimisation theory focuses on de-riving convergence results for descent methods under as-sumptions such as convexity (Boyd & Vandenberghe, 2004)and
Lipschitz continuity of the gradient (Nesterov, 2004).These simplifying assumptions are often used in the ma-chine learning literature. For instance, Bottou et al. (2018)provide convergence guarantees for stochastic gradient de-scent (SGD) under each of these assumptions. However,these assumptions do not hold in deep learning (Sun,2019).On a related note, SGD is not the algorithm of choicein many deep learning applications, and heuristic methods such as RMSprop (Tieleman & Hinton, 2012) and Adam(Kingma & Ba, 2015) often work better. For instance,Adam often works much better than SGD for traininggenerative adversarial networks (Bernstein et al., 2020a).Yet the theory behind Adam is poorly understood (Reddiet al., 2018).A more recent line of work has explored optimisationmethods that make relative updates to neural network pa-rameters. Optimisers like LARS (You et al., 2017), LAMB(You et al., 2020) and Fromage (Bernstein et al., 2020a)make per-layer relative updates, while Madam (Bernsteinet al., 2020b) makes per-synapse relative updates. Youet al. (2017) found that these methods stabilise large batchtraining, while Bernstein et al. (2020a) found that theyrequire little to no learning rate tuning across tasks.Though these recent methods partially account for theneural architecture—by paying attention to its layeredoperator structure—they do not rigorously address theoptimisation domain. As such, LARS and LAMB requirea tunable weight decay hyperparameter, while Fromageand Madam restrict the optimisation to a bounded set oftunable size (i.e. weight clipping). Without this additionaltuning, these methods can be unstable—see for instance(Bernstein et al., 2020a, Figure 2) and (Bernstein et al.,2020b, Figure 3).The discussion in the previous paragraph typifies themachine learning state of the art: optimisation techniquesthat work well, albeit only after hyperparameter tun-ing. For instance, LAMB is arguably the state-of-the-artrelative optimiser, but it contains in total five tunablehyperparameters. Since—at least naïvely—the cost ofhyperparameter search is exponential in the number ofhyperparameters, the prospect of fully tuning LAMB iscomputationally daunting.
Since the brain is a system that must learn stably withouthyperparameter do-overs, it is worth looking to neuro-science for inspiration on designing better learning algo-rithms.A major swathe of neuroscience research studies mech-anisms by which the brain performs homeostatic control.For instance, neuroscientists report a form of homeosta-sis termed synaptic scaling , where a neuron modulatesthe strengths of all its synapses to stabilise its firing rate(Turrigiano, 2008). More generally, heterosynaptic plas-ticity refers to homeostatic mechanisms that modulatethe strength of unstimulated synapses (Chistiakova et al.,2015). Shen et al. (2020) review connections to normalisa-tion methods used in machine learning.These observations inspired this paper to considerimplementing homeostatic control via projected gradientdescent—leading to the Nero optimiser.
Background Theory
In general, an L -layer neural network f p¨q is a compositionof L simpler functions f p¨q , ..., f L p¨q : f p x q “ f L ˝ f L ´ ˝ ... ˝ f p x q . (forward pass)Due to this compositionality, any slight ill-conditioningin the simple functions f i p¨q has the potential to com-pound over layers, making the overall network f p¨q veryill-conditioned. Architecture design should aim to preventthis from happening, as will be covered in Section 3.1The Jacobian B f {B f l , which plays a key role in evalu-ating gradients, also takes the form of a deep product: B f B f l “ B f L B f L ´ ¨ B f L ´ B f L ´ ¨ ... ¨ B f l ` B f l . (backward pass)Therefore, it is also important from the perspective ofgradient-based optimisation that compositionality is ade-quately addressed, as will be covered in Section 3.2. A common strategy to mitigate the issue of compoundingill-conditioning is to explicitly re-normalise the activationsat every network layer. Batch norm (Ioffe & Szegedy,2015) exemplifies this strategy, and was found to improvethe trainability of deep residual networks. Batch normworks by standardising the activations across a batch ofinputs at each network layer—that is, it shifts and scalesthe activations to have mean zero and variance one acrossa batch.Although batch norm works well, it adds computa-tional overhead to both the forward and backward pass. Toexplore how far one can get without explicit re-normalisation,the following definitions are useful:
Definition 1.
A neuron is balanced if its weight vector w P R d satisfies the following constraints: ř di “ w i “ (balanced excitation & inhibition) ř di “ w i “ . ( (cid:96) constant sum rule) Definition 2.
A network is balanced if all its constituentneurons are balanced.As noted by Huang et al. (2017), balanced neuronsattain some of the properties of batch norm for free. To seethis, consider a linear neuron y “ ř i w i x i with inputs x i that are uncorrelated with mean µ and variance . Thenthe output y is standardised: E r y s “ ř i w i E r x i s “ µ ř i w i “ r y s “ ř i w i Var r x i s “ ř i w i “ . While the assumptions on the inputs x i are unlikely tohold exactly, under more general conditions the constraintsmay at least encourage the standardisation of activationstatistics through the layers of the network (Brock et al.,2021). Since a network is trained via perturbations to its param-eters, it is important to know what size perturbations areappropriate. Consider an L -layer network with weightmatrices W “ p W , W , ..., W L q and loss function L p W q .For a perturbation ∆ W “ p ∆ W , ∆ W , ..., ∆ W L q , thefollowing definition establishes a notion of stable step size: Definition 3.
Let θ l denote the angle between ∆ W l and ´ ∇ W l L p W q . A descent step is stable if for all l “ , ..., L : } ∇ W l L p W ` ∆ W q ´ ∇ W l L p W q} F } ∇ W l L p W q} F ă cos θ l . (1)Or in words: for each layer, the relative change ingradient induced by the perturbation should not exceedthe cosine of the angle between the perturbation and thenegative gradient.This definition is useful because Bernstein et al. (2020a)proved that a stable descent step is guaranteed to decreasea continuously differentiable loss function L p W q . SinceInequality 1 is of little use without a model of its lefthandside, Bernstein et al. (2020a) proposed the following model: Definition 4.
The loss function obeys deep relative trust if for all perturbations ∆ W “ p ∆ W , ∆ W , ..., ∆ W L q : } ∇ W l L p W ` ∆ W q ´ ∇ W l L p W q} F } ∇ W l L p W q} F ď L ź k “ ˆ ` } ∆ W k } F } W k } F ˙ ´ . While deep relative trust is based on a perturbationanalysis of L -layer perceptrons (Bernstein et al., 2020a,Theorem 1), the key idea is that its product structureexplicitly models the product structure of the network’sbackward pass.The deep relative trust model suggests that a stabledescent step should involve small relative perturbations per layer . This motivates the layer-wise family of de-scent methods (You et al., 2017, 2020). Still, it is unclearwhether layers are the right base object to consider. Per-haps a more refined analysis would replace the layersappearing in Definition 4 with individual neurons or even synapses .Small relative perturbations per-synapse were exploredby Bernstein et al. (2020b) and found to slightly degradetraining performance compared to Adam and SGD. Butthis paper will explore the per-neuron middle ground: Definition 5.
A step of size η ą is said to be per-neuronrelative if for any neuron with weights w P R d and bias b P R , the perturbations ∆ w P R d and ∆ b P R satisfy: } ∆ w } {} w } Æ η and | ∆ b |{| b | Æ η. A per-neuron relative update is automatically per-layerrelative. To see this, consider a weight matrix W whose N rows correspond to N neurons w p q , ..., w p N q . Then: } ∆ W } F } W } F “ c ř Ni “ } ∆ w p i q } ř Ni “ } w p i q } Æ c ř Ni “ η } w p i q } ř Ni “ } w p i q } “ η. (2) Nero: the Neuronal Rotator
Following the discussion in Section 3, this paper willconsider an optimisation algorithm that makes per-neuronrelative updates (Definition 5) constrained to the space of balanced networks (Definition 2).Since a balanced neuron is constrained to the unithypersphere, a per-neuron relative update with step size η corresponds to a pure rotation of the neuron’s weightvector by angle « η . To see this, take η small in thefollowing picture: } w } “ } w ` ∆ w } “ } ∆ w } “ ηθ Hence, this paper proposes
Nero : the neuronal rotator.Nero’s goal is to reduce the burden of hyperparametertuning by baking architectural information into the opti-miser. More concretely, the anticipated advantages are asfollows:1. Since per-neuron relative updates are automaticallyper-layer relative by Equation 2, they should inheritthe properties of per-layer updates—in particular,stability across batch sizes (You et al., 2017) whileneeding little to no learning rate tuning (Bernsteinet al., 2020a).2. Since balanced networks place hard constraints onthe norm of a neuron’s weights, the need for initiali-sation tuning and weight decay should be removed.3. Gradients are often normalised by running averages,in order to retain relative scale information betweensuccessive minibatch gradients (Tieleman & Hinton,2012). Along with momentum, this is the mainmemory overhead of Adam and LAMB compared tovanilla SGD. Per-neuron running averages consume „ square root the memory of per-synapse runningaverages.4. Since normalisation is local to a neuron, no com-munication is needed between neurons in a layer(unlike for per-layer updates). This makes the op-timiser more distributable—for example, a singlelayer can be split across multiple compute deviceswithout fuss. For the same reason, the Nero updateis biologically plausible.There is one significant difference between Nero andprior work on balanced networks. In centred weight norm(Huang et al., 2017) and weight standardisation (Qiaoet al., 2019), a neuron’s underlying weight representationis an unnormalised vector r w P R d —which is normalised byincluding the following reparameterisation in the neuralarchitecture: normalise p r w q : “ r w ´ T r w ¨ { d } r w ´ T r w ¨ { d } , (3)where denotes the vector of 1s. Algorithm 1
Nero optimiser. “Out-of-the-box” hyper-parameter defaults are η “ . and β “ . . Theconstant σ b P R ` refers to the initialisation scale of thebiases. Input: step size η P p , s , averaging constant β P r , q repeatfor each neuron do Ź get weight & bias gradients g w P R n & g b P R Ź update running averages ¯ g w Ð β ¨ ¯ g w ` p ´ β q ¨ } g w } ¯ g b Ð β ¨ ¯ g b ` p ´ β q ¨ g b Ź update weights w P R n and bias b P R w Ð w ´ η ¨ } w } { ¯ g w ¨ g w b Ð b ´ η ¨ σ b { ¯ g b ¨ g b Ź project weights back to constraint set w Ð w ´ n ř ni “ w i w Ð w {} w } end foruntil converged Since the target of automatic differentiation is still theunnormalised vector r w , overhead is incurred in both theforward and backward pass. Moreover, there is a subtlecoupling between the step size in additive optimisers likeAdam and the scale of the unnormalised weights r w (seeSection 5.3).In contrast, Nero opts to implement balanced networksvia projected gradient descent. This is lighter-weight thanEquation 3, since duplicate copies of the weights are notneeded and the network’s backward pass does not involveextra operations. Furthermore, Nero can be used as adrop-in replacement for optimisers like Adam, SGD orLAMB, without the user needing to manually modifythe network architecture via the reparameterisation inEquation 3.Pseudocode for Nero is provided in Algorithm 1. Forbrevity, the Adam-style bias correction of the runningaverages is omitted from the pseudocode. But in thePytorch implementation used in this paper’s experiments,the running averages ¯ g w and ¯ g b are divided by a factorof a ´ β t before the t th update. This corrects for thewarmup bias stemming from ¯ g w and ¯ g b being initialisedto zero (Kingma & Ba, 2015).While the pseudocode in Algorithm 1 is presented for neurons and biases , in the Pytorch implementation thebias update is applied to any parameters lacking a notionof fan-in—including batch norm gains and biases. Typicalinitialisation scales are σ b “ for gains and σ b “ . for biases. The Pytorch implementation of Nero used σ b “ . for any bias parameter initialised to zero. Experiments
This section begins with targeted experiments intendedto demonstrate Nero’s key properties. Then, in Section5.6, Nero is benchmarked across a range of popular tasks.In all figures, the mean and range are plotted over threerepeats. For Nero, out-of-the-box refers to setting η “ . and β “ . . More experimental details are given inAppendix A. To verify that projecting to the space of balanced networksimproves the performance of Nero, an ablation experimentwas conducted. As can be seen in Figure 1, when traininga VGG-11 image classifier on the CIFAR-10 dataset, Neroperformed best with both constraints switched on.
Since Bernstein et al. (2020b) found that per-synapserelative updates led to slightly degraded performance,while per-layer relative updates typically perform well(You et al., 2017, 2020; Bernstein et al., 2020a), this sectioncompares per-synapse, per-neuron and per-layer relativeupdates. In particular, Nero is compared to Madam (per-synapse relative) and LAMB (per-layer relative).A VGG-11 model was trained on the CIFAR-10 dataset.Without constraints, the three optimisers performed sim-ilarly, achieving „ top-1 validation error (Figure 2,top). Constraining to the space of balanced networks(Definition 2) improved both Nero and LAMB, but didnot have a significant effect on Madam (Figure 2, bottom).In both configurations, Nero outperformed Madam andLAMB, demonstrating the viability of per-neuron relativeupdates. Existing implementations of balanced networks (Definition2) work via the re-parameterisation given in Equation 3(Huang et al., 2017; Qiao et al., 2019). This leads to anundesired coupling between the learning rate in optimiserslike Adam and the scale of the unnormalised r w parameters.To verify this, a network with weights normalised byEquation 3 was trained to classify the MNIST dataset.The initial weights r w were drawn from N p , σ q , and theexperiment was repeated for σ “ and σ “ . TheAdam optimiser was used for training with a fixed learningrate of . . As can be seen in Figure 3 (left), the trainingperformance was sensitive to the weight scale σ , despitethe fact that a weight normalisation scheme was beingused.The unnecessary scale freedom of reparameterisationcan lead to other undesired consequences, such as numer-ical overflow. Nero completely eliminates this issue byimplementing balanced networks via projected gradientdescent. T o p - e rr o r Training 0 50 100 150 200Epoch0.10.20.4 ValidationConstraintsBothMeanNormNone
Figure 1.
Ablating the balanced network constraints. A VGG-11 network was trained on CIFAR-10. The legend denoteswhich of Nero’s constraints were active.
Mean refers to bal-anced excitation & inhibition, while norm refers to the (cid:96) constant sum rule. T o p - e rr o r Training 0 50 100 150 200Epoch0.10.20.40.8 ValidationNero w/o constraintsMadamLAMB0 50 100 150 200Epoch10 T o p - e rr o r Training 0 50 100 150 200Epoch0.10.20.40.8 ValidationNeroMadam+constraintsLAMB+constraints
Figure 2.
Comparing per-synapse (Madam), per-neuron (Nero)and per-layer (LAMB) relative updates. A VGG-11 networkwas trained to classify CIFAR-10. Top: all optimisers without balanced network constraints. Bottom: all optimisers with constraints. T r a i n i n g a cc u r a c y ReparameterisationInitialisation scale = 1 = 100 0 10 20 30 40 50Epoch0.00.20.40.60.81.0 100 layer MLPNeroSGDAdamLAMB
Figure 3.
Left: Training a 5 layer perceptron normalised viareparameterisation (Equation 3) on MNIST. For a fixed Adamlearning rate, training is sensitive to the scale σ of the rawweights r w . This motivates the different approach taken by Nero.Right: Using Nero to train a 100 layer perceptron—withoutbatch norm or skip connections—to classify MNIST. .4 Nero Trains Deeper Networks Very deep networks are typically difficult to train withoutarchitectural modifications such as residual connections(He et al., 2016) or batch norm (Ioffe & Szegedy, 2015).To test whether Nero enables training very deep modelswithout such modifications, Figure 3 (right) shows theresults of training a very deep multilayer perceptron (MLP)on the MNIST dataset. Unlike SGD, Adam and LAMB,Nero could reliably train a 100-layer MLP.
This section compares Nero out-of-the-box to an SGDimplementation with tuned learning rate, weight decayand momentum. The comparison was made for traininga ResNet-50 image classifier on the ImageNet dataset.As can be seen in Figure 4, SGD with tuned learningrate, momentum, and weight decay outperformed Nero.However, the optimal set of SGD hyperparameters wasbrittle, and ablating weight decay alone increased thetop-1 validation error by 5%.
This section probes the versatility and robustness of Neroby comparing its optimisation and generalisation per-formance with three popular alternatives—SGD, Adam,and LAMB—across six tasks. The tasks span the do-mains of computer vision, natural language processing,and reinforcement learning. A wide spectrum of neuralarchitectures were tested—from convolutional networksto transformers.To make a fair comparison between optimisers, a fairhyperparameter tuning strategy is needed. In this section:1. Learning rates were tuned over t ´ , ´ , ..., u .2. For Adam, LAMB and SGD, the momentum hy-perparameter was tuned to achieve good perfor-mance on the most complicated benchmark—cGANtraining—and then fixed across the rest of the bench-marks. In each case, the best momentum value forcGAN was 0.3. β in Nero and β in Adam and LAMB were fixedto . across all experiments, as recommended byKingma & Ba (2015) and You et al. (2020).4. Weight decay was not used in any of the experiments.The results are collated in Table 1. Nero achievedthe best validation performance in every experiment—while the runner-up varied across tasks. What’s more,the same learning rate of η “ . was optimal for Neroin five out of six experiments. This means that Nerohas strong out-of-the-box performance, since Nero’s onlyother hyperparameter was fixed to β “ . across allexperiments.The remainder of this section discusses each experi-ment in turn. Implementation details are given in Ap-pendix A. T o p - e rr o r TrainingNeroSGDSGD+wd 0 20 40 60 80Epoch0.10.20.30.40.50.6 Validation
Figure 4.
Training a ResNet-50 network to classify the Im-ageNet dataset. Nero uses its out-of-the-box default hyper-parameters η “ . and β “ . . SGD+wd uses initiallearning rate 0.1, momentum 0.9 and weight decay (wd) 0.0001as tuned by He et al. (2016). SGD is also shown without weightdecay. F I D Training NeroSGDAdamLAMB 0 30 60 90 120Epoch020406080100 Test
Figure 5.
Class-conditional GAN training on CIFAR-10. Equallearning rates were used in the generator and discriminator.The Fréchet Inception Distance (Heusel et al., 2017, FID)measures the distance between the sample statistics of realand fake data as represented at a deep layer of a pre-trainedimage classifier. T o p - e rr o r TrainingNeroSGDAdamLAMB 0 50 100 150 200Epoch0.10.20.4 Validation0 50 100 150 200Epoch10 T o p - e rr o r Training NeroSGDAdamLAMB 0 50 100 150 200Epoch0.050.100.200.40 Validation
Figure 6.
CIFAR-10 classification. Top: performance of avanilla, convolutional VGG-11 network. Bottom: performanceof a batch-normalised, residual ResNet-18 network. ask Dataset Model Metric p Ö q Nero SGD Adam LAMB Nero η SGD η Adam η LAMB η cGAN CIFAR-10 BigGAN-like FID ( Ó ) . ˘ . . ˘ .
42 23 . ˘ .
85 16 . ˘ . Ó ) . % ˘ . . ˘ .
21 12 . ˘ .
34 13 . ˘ . Ó ) . % ˘ . . ˘ .
17 5 . ˘ .
19 6 . ˘ . Ó ) . ˘ . . ˘ .
49 178 . ˘ .
96 200 . ˘ . Ó ) . ˘ . . ˘ .
48 12 . ˘ .
34 16 . ˘ . Ò ) . ˘ . . ˘ .
65 15 . ˘ . ´ . ˘ . Table 1.
Validation results for the best learning rate η . The best result is shown in bold, while the runner-up is underlined. Image synthesis with cGAN
Generative AdversarialNetwork (Goodfellow et al., 2014, GAN) training is per-haps the most challenging optimisation problem tackled inthis paper. Good performance has traditionally relied onextensive tuning: different learning rates are often used inthe generator and discriminator (Heusel et al., 2017) andtraining is highly sensitive to momentum (Brock et al.,2019, p. 35). The class-conditional GAN model in thispaper is based on the BigGAN architecture (Brock et al.,2019). This is a heterogeneous network involving a va-riety of building blocks: convolutions, embeddings, fullyconnected layers, attention layers, conditional batch normand spectral norm (Miyato et al., 2018). The results arepresented in Figure 5.
Image classification
In Section 5.5, Nero out-of-the-box was shown to outperform SGD without weight decaywhen training ResNet-50 on ImageNet. Due to limitedcomputational resources, the authors of this paper wereunable to run the LAMB and Adam baselines on Im-ageNet. Experiments were run across all baselines onthe smaller CIFAR-10 dataset instead. The networksused were the vanilla, convolutional VGG-11 network (Si-monyan & Zisserman, 2015) and the batch-normalised,residual ResNet-18 network (He et al., 2015). The resultsare presented in Figure 6.
Natural language processing
Much recent progressin natural language processing is based on the transformerarchitecture (Vaswani et al., 2017). Transformers processinformation via layered, all-to-all comparisons—withoutrecourse to recurrence or convolution. This paper experi-mented with a smaller transformer (19 tensors) trained onthe Wikitext-2 dataset, and a larger transformer (121 ten-sors) trained on WMT2016 English–German translation.The results are presented in Figures 7 and 8.
Reinforcement learning
Many reinforcement learn-ing algorithms use neural networks to perform functionapproximation. Proximal Policy Optimization (Schulmanet al., 2017, PPO) is one example, and PPO has gainedincreasing popularity for its simplicity, scalability, androbust performance. This paper experimented with PPOon the Atari Pong video game. The results are presentedin Figure 9.While LAMB failed to train on this task, further in-vestigation revealed that setting LAMB’s momentum hy-perparameter to 0.9 enabled LAMB to learn. This demon-strates LAMB’s sensitivity to hyperparameters. P e r p l e x i t y Training NeroSGDAdamLAMB 0 5 10 15 20Epoch150200250300350400 Validation
Figure 7.
Training a language model on the Wikitext-2 dataset.A small transformer network was used, composed of 19 tensors.Nero achieved the best anytime performance. P e r p l e x i t y Training NeroSGDAdamLAMB 0 25 50 75 100Epoch10 Validation
Figure 8.
Training an English–German translation model onWMT16. A larger transformer network was used, composed of121 tensors. The optimisers with gradient normalisation—Nero,Adam, and LAMB—performed best in training this model.Training with SGD was unstable and led to significantly worseperplexity. R e w a r d NeroSGDAdamLAMB
Figure 9.
Training a policy network to play Pong. Proxi-mal Policy Optimisation (PPO) was used. Pong’s reward isbounded between ˘ . While investigating LAMB’s failureto train the policy network, it was discovered that adjustingthe β momentum hyperparameter from 0 to 0.9 improvedLAMB’s performance. Discussion: Rotation and Generalisa-tion
The results in this paper may have a bearing on the gen-eralisation theory of neural systems—an area of researchthat is still not settled. Consider the following hypothesis:
Hypothesis 1.
Deep learning generalises because SGDis biased towards solutions with small norm.
This hypothesis is well-known, and is alluded to ormentioned explicitly in many papers (Wilson et al., 2017;Zhang et al., 2017; Bansal et al., 2018; Advani et al., 2020).But in light of the results in Table 1, Hypothesis 1encounters some basic problems. First, for some tasks—such as the GAN and translation experiment—SGD simplyperforms very poorly. And second, Nero is able to findgeneralising solutions even when the norm of the networkis constrained. For instance, the VGG-11 network and theWikitext-2 transformer model have no gain parametersso, under Nero, the norm of the weights (though notthe biases) is fixed and cannot be “adapting to the datacomplexity”.Then it seems right to consider an alternative theory:
Hypothesis 2.
Deep learning generalises because thespace of networks that fit the training data has large mea-sure.
This hypothesis is essentially the PAC-Bayesian gen-eralisation theory (McAllester, 1998; Langford & Seeger,2001) applied to deep learning. Valle-Perez et al. (2019)have developed this line of work, proving the followingresult:
Theorem 1 (Realisable PAC-Bayes) . First, fix a proba-bility measure P over the weight space Ω of a classifier. Let S denote a training set of n iid datapoints and let V S Ă Ω denote the version space—that is, the subset of classifiersthat fit the training data. Consider the population errorrate ď ε p w q ď of weight setting w P Ω , and its averageover the version space ε p V S q : “ E w „ P r ε p w q| w P V S s . Then,for a proportion ´ δ of random draws of the training set S , ε p V S q ď ln 11 ´ ε p V S q ď ln P r V S s ` ln nδ n ´ . (4)The intuition is that for a larger measure of solutions P r V S s , less information needs to be extracted from thetraining data to find just one solution, thus memorisationis less likely.A simple formula for P r V S s is possible based on thispaper’s connection between optimisation and architecture,since the problem is reduced to hyperspherical geometry.Consider a balanced network (Definition 2) composed of m neurons each with fan-in d . Then the optimisation domainis isomorphic to the Cartesian product of m hyperspheres: Ω – S d ´ ˆ ... ˆ S d ´ looooooooomooooooooon m times , while P can be fixed to the uniform distribution on Ω . By restricting deep relative trust (Definition 4) or itsantecedent (Bernstein et al., 2020a, Theorem 1) to bal-anced networks, the following definition becomes natural: Definition 6.
A solution that attains zero training erroris α -robust if all neurons may be simultaneously and ar-bitrarily rotated by up to angle α without inducing anerror.Geometrically, an α -robust solution is the product of m hyperspherical caps. If the version space consists of K non-intersecting α -robust solutions, then its measure is: P r V S s “ K ¨ P r cap d ´ p α qs m ě K m sin m p d ´ q α , (5)where cap d ´ p α q denotes an α -cap of S d ´ , and the in-equality follows from (Ball, 1997, Lemma 2.3). CombiningInequality 5 with Inequality 4 yields the following gener-alisation bound for neural networks: ε p V S q ď m ln 2 ` m p d ´ q ln α ` ln nδ ´ ln Kn ´ . Focusing on the dominant terms, the bound suggests thatthe average test error ε p V S q over the space of solutions V S is low when the number of datapoints n exceeds thenumber of parameters md less the entropy ln K of themultitude of distinct solutions. The theory has two mainimplications:1. In the “over-parameterised” regime md " n , general-isation can still occur if the number of distinct solu-tions K is exponential in the number of parameters md . In practice, ln K might be increased relative to md by constraining the architecture based on thesymmetries of the data—e.g. using convolutions forimage data.2. All else equal, solutions with larger α -robustnessmay generalise better. In practice, α might be in-creased by regularising the training procedure (Foretet al., 2021).Future work might investigate these ideas more thor-oughly. This paper has proposed the Nero optimiser based on acombined study of optimisation and neural architecture.Nero pairs two ingredients: (1) projected gradient descentover the space of balanced networks; and (2) per-neuronrelative updates. Taken together, a Nero update turns each neuron through an angle set by the learning rate.Nero was found to have strong out-of-the-box perfor-mance. In almost all the experiments in this paper—spanning GAN training, image classification, natural lan-guage processing and reinforcement learning—Nero trainedwell using its default hyperparameter settings. The two ex-ceptions were the 100 layer MLP and the WMT16 En–Detransformer, for which Nero required a reduced learningrate of η “ . . Thus Nero has the potential to accel-erate deep learning research and development, since theneed for time and energy intensive hyperparameter searchmay be reduced. eferences Advani, M. S., Saxe, A. M., and Sompolinsky, H. High-dimensional dynamics of generalization error in neuralnetworks.
Neural Networks , 2020.Ball, K. An elementary introduction to modern convexgeometry. In
MSRI Publications , 1997.Bansal, Y., Advani, M., Cox, D., and Saxe, A. M.Minnorm training: an algorithm for training over-parameterized deep neural networks. arXiv:1806.00730 ,2018.Bernstein, J., Vahdat, A., Yue, Y., and Liu, M.-Y. On thedistance between two neural networks and the stabilityof learning. In
Neural Information Processing Systems ,2020a.Bernstein, J., Zhao, J., Meister, M., Liu, M.-Y., Anandku-mar, A., and Yue, Y. Learning compositional functionsvia multiplicative weight updates. In
Neural Informa-tion Processing Systems , 2020b.Bottou, L., Curtis, F. E., and Nocedal, J. Optimizationmethods for large-scale machine learning.
SIAM Review ,2018.Boyd, S. and Vandenberghe, L.
Convex Optimization .Cambridge University Press, 2004.Brock, A., Donahue, J., and Simonyan, K. Large scaleGAN training for high fidelity natural image synthesis.In
International Conference on Learning Representa-tions , 2019.Brock, A., De, S., and Smith, S. L. Characterizing signalpropagation to close the performance gap in unnormal-ized ResNets. In
International Conference on LearningRepresentations , 2021.Chistiakova, M., Bannon, N., Chen, J.-Y., Bazhenov, M.,and Volgushev, M. Homeostatic role of heterosynap-tic plasticity: models and experiments.
Frontiers inComputational Neuroscience , 2015.Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B.Sharpness-aware minimization for efficiently improvinggeneralization. In
International Conference on LearningRepresentations , 2021.Glorot, X. and Bengio, Y. Understanding the difficultyof training deep feedforward neural networks. In
In-ternational Conference on Artificial Intelligence andStatistics , 2010.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In
Neural InformationProcessing Systems , 2014.Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He,K. Accurate, large minibatch SGD: Training ImageNetin 1 hour. arXiv:1706.02677 , 2017. He, K., Zhang, X., Ren, S., and Sun, J. Delving deepinto rectifiers: Surpassing human-level performance onImageNet classification. In
International Conference onComputer Vision , 2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In
Computer Vision andPattern Recognition , 2016.Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B.,and Hochreiter, S. GANs trained by a two time-scaleupdate rule converge to a local Nash equilibrium. In
Neural Information Processing Systems , 2017.Huang, L., Liu, X., Liu, Y., Lang, B., and Tao, D. Cen-tered weight normalization in accelerating training ofdeep neural networks. In
International Conference onComputer Vision , 2017.Ioffe, S. and Szegedy, C. Batch normalization: Accelerat-ing deep network training by reducing internal covariateshift. In
International Conference on Machine Learning ,2015.Kingma, D. P. and Ba, J. Adam: A Method for StochasticOptimization. In
International Conference on LearningRepresentations , 2015.Kostrikov, I. Pytorch implementations of reinforce-ment learning algorithms. github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail , 2018.Langford, J. and Seeger, M. Bounds for averaging clas-sifiers. Technical report, Carnegie Mellon University,2001.Liu, W., Zhang, Y.-M., Li, X., Yu, Z., Dai, B., Zhao, T.,and Song, L. Deep hyperspherical learning. In
NeuralInformation Processing Systems , 2017.Liu, W., Liu, Z., Yu, Z., Dai, B., Lin, R., Wang, Y., Rehg,J. M., and Song, L. Decoupled networks. In
ComputerVision and Pattern Recognition , 2018.Liu, W., Lin, R., Liu, Z., Rehg, J. M., Xiong, L., Weller, A.,and Song, L. Orthogonal over-parameterized training. arXiv:2004.04690 , 2020.McAllester, D. A. Some PAC-Bayesian theorems. In
Conference on Computational Learning Theory , 1998.Miller, K. and MacKay, D. The role of constraints inHebbian learning.
Neural Computation , 1994.Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y.Spectral normalization for generative adversarial net-works. In
International Conference on Learning Repre-sentations , 2018.Nesterov, Y. Introductory lectures on convex optimization:A basic course. In
Applied Optimization , 2004.Qiao, S., Wang, H., Liu, C., Shen, W., and Yuille, A.Micro-batch training with batch-channel normalizationand weight standardization. arXiv:1903.10520 , 2019.eddi, S. J., Kale, S., and Kumar, S. On the convergenceof Adam and beyond. In
International Conference onLearning Representations , 2018.Rochester, N., Holland, J., Haibt, L., and Duda, W. Testson a cell assembly theory of the action of the brain,using a large digital computer.
Information Theory ,1956.Rosenblatt, F. The perceptron: A probabilistic modelfor information storage and organization in the brain.
Psychological Review , 1958.Salimans, T. and Kingma, D. P. Weight normalization:A simple reparameterization to accelerate training ofdeep neural networks. In
Neural Information ProcessingSystems , 2016.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms. arXiv:1707.06347 , 2017.Shen, Y., Wang, J., and Navlakha, S. A correspondence be-tween normalization strategies in artificial and biologicalneural networks. In
From Neuroscience to ArtificiallyIntelligent Systems , 2020.Simonyan, K. and Zisserman, A. Very deep convolutionalnetworks for large-scale image recognition. In
Interna-tional Conference on Learning Representations , 2015.Sun, R. Optimization for deep learning: theory andalgorithms. arXiv:1912.08957 , 2019.Tieleman, T. and Hinton, G. E. Lecture 6.5—RMSprop:Divide the gradient by a running average of its recentmagnitude. COURSERA: Neural Networks for MachineLearning, 2012.Turrigiano, G. The self-tuning neuron: Synaptic scalingof excitatory synapses.
Cell , 2008.Valle-Perez, G., Camargo, C. Q., and Louis, A. A. Deeplearning generalizes because the parameter–functionmap is biased towards simple functions. In
InternationalConference on Learning Representations , 2019.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten-tion is all you need. In
Neural Information ProcessingSystems , 2017.von der Malsburg, C. Self-organization of orientationsensitive cells in the striate cortex.
Kybernetik , 1973.Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., andRecht, B. The marginal value of adaptive gradientmethods in machine learning. In
Neural InformationProcessing Systems , 2017.You, Y., Gitman, I., and Ginsburg, B. Scaling SGDbatch size to 32K for ImageNet training. TechnicalReport UCB/EECS-2017-156, University of California,Berkeley, 2017. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojana-palli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh,C.-J. Large batch optimization for deep learning: Train-ing BERT in 76 minutes. In
International Conferenceon Learning Representations , 2020.Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals,O. Understanding deep learning requires rethinkinggeneralization. In
International Conference on LearningRepresentations , 2017. ppendix A Experimental Details
All code is to be found at github.com/jxbz/nero . Thisappendix records important details of the implementationsand their hyperparameters.
MNIST classification
These experiments used a mul-tilayer perceptron (MLP) network. An L -layer architec-ture consisted of p L ´ q layers of dimension ˆ followed by an output layer of dimension ˆ . A“scaled relu” nonlinearity was used, defined by ϕ p x q : “? ¨ max p , x q . The factor of ? was motivated by Kaim-ing init (He et al., 2015) and was not tuned. The repa-rameterisation experiment used L “ layers and trainedfor 5 epochs without learning rate decay. The very deepMLP used L “ layers and trained for 50 epochs withthe learning rate decayed by a factor of 0.9 at the endof every epoch, and with the initial learning rate tunedover t . , . , . , . u . Training took place on anunknown Google Colab GPU. On an NVIDIA Tesla P100GPU, the 5-layer MLP took „ minute to train and the100-layer MLP took „ minutes. CIFAR-10 cGAN
Equal learning rates were used inthe generator and discriminator. The initial learningrate was tuned over {0.0001, 0.001, 0.01, 0.1, 1.0} for alloptimisers. The networks were trained for 120 epochs,with the learning rate decayed by a factor of 10 at epoch100. The momentum parameter in SGD and β in Adamand LAMB were tuned over {0.0, 0.9}. Nero’s β and β in Adam and LAMB were set to 0.999 without tuning.Training took around 3 hours on an NVIDIA RTX 2080TiGPU. CIFAR-10 classification
All models were trained for200 epochs, with 5 epochs of linear learning rate warm-upand learning rate decay by a factor of 0.2 at epochs 100,150 and 180. The initial learning rates were tuned over{0.0001, 0.001, 0.01, 0.1, 1.0}. Training was performedon an NVIDIA RTX 2080Ti GPU. Training time for theVGG-11 network was „ hour, and for ResNet-18 was „ hours.Since the experiments in Figures 1 and 2 were intendedto probe the fundamental properties of optimisers ratherthan their performance under a limited tuning budget,a more fine-grained learning rate search was conducted.Specifically, the learning rates were tuned over {0.01, 0.02,0.04, 0.06, 0.08, 0.1}. The best results are listed in thefollowing table: Optimiser Fix Mean Fix Norm Top 1 Error Best η Nero . ˘ . (cid:88) . ˘ . (cid:88) . ˘ . (cid:88) (cid:88) . % ˘ . . ˘ . (cid:88) (cid:88) . ˘ . . ˘ . (cid:88) (cid:88) . ˘ . ImageNet classification
For training with SGD + mo-mentum + weight decay, the initial learning was set to0.1, momentum was set to 0.9 and weight decay was set to0.0001. These settings follow He et al. (2016). One epochof linear learning rate warm-up was used, followed by 89epochs of cosine annealing. The batch size was set to 400for ResNet-50 to fit the GPU vRAM budget, and thiswas in the range known to yield good performance (Goyalet al., 2017). This paper’s implementation surpassed thetarget ImageNet top-1 accuracy of . for ResNet-50(Goyal et al., 2017; You et al., 2020). The training wasdistributed over four NVIDIA RTX 2080Ti GPUs, tak-ing „ hours per training run. The total number ofGPU hours for all ImageNet experiments in this paperwas „ . Wikitext-2 language model
The small transformermodel was trained for 20 epochs, with the learning ratedecayed by a factor of 0.1 at epoch 10. The initial learningrate was tuned over {0.0001, 0.001, 0.01, 0.1, 1.0}. Thebatch size was set to 20. Training on an NVIDIA RTX2080Ti GPU took „ minutes. WMT16 En–De translation
The large transformermodel was trained for 100 epochs, with a linear warm-up from epoch 0 to 50, and linear annealing from epoch50 to 100. The maximum learning rate was tuned over{0.0001, 0.001, 0.01, 0.1, 1.0}. A batch size of 128 wasused. Training took „ hour on an NVIDIA RTX 2080TiGPU. Reinforcement learning
Hyperparameter settings fol-lowed Kostrikov (2018), except for the initial learning rateand the total number of environment steps. The numberof steps was fixed to 5 million, and the initial learning ratewas tuned over {0.0001, 0.001, 0.01, 0.1, 1.0}. The policynetwork combined convolutional image feature extractorswith dense output layers. Training was performed on anNVIDIA RTX 2080Ti GPU, and the training time was „ .5