[PDF] Learning by Turning: Neural Architecture Aware Optimisation

Abstract

Descent methods for deep networks are notoriously capricious: they require careful tuning of step size, momentum and weight decay, and which method will work best on a new benchmark is a priori unclear. To address this problem, this paper conducts a combined study of neural architecture and optimisation, leading to a new optimiser called Nero: the neuronal rotator. Nero trains reliably without momentum or weight decay, works in situations where Adam and SGD fail, and requires little to no learning rate tuning. Also, Nero's memory footprint is ~ square root that of Adam or LAMB. Nero combines two ideas: (1) projected gradient descent over the space of balanced networks; (2) neuron-specific updates, where the step size sets the angle through which each neuron's hyperplane turns. The paper concludes by discussing how this geometric connection between architecture and optimisation may impact theories of generalisation in deep learning.

Full PDF

LLearning by Turning: Neural Architecture Aware Optimisation

Yang Liu ∗ [email protected] Jeremy Bernstein ∗ [email protected] Markus Meister [email protected]

Yisong Yue [email protected]

Abstract

Descent methods for deep networks arenotoriously capricious: they require carefultuning of step size, momentum and weight de-cay, and which method will work best on anew benchmark is a priori unclear. To addressthis problem, this paper conducts a combinedstudy of neural architecture and optimisation,leading to a new optimiser called

Nero : theneuronal rotator. Nero trains reliably withoutmomentum or weight decay, works in situa-tions where Adam and SGD fail, and requireslittle to no learning rate tuning. Also, Nero’smemory footprint is „ square root that ofAdam or LAMB. Nero combines two ideas: (1)projected gradient descent over the space of balanced networks ; (2) neuron-speciﬁc updates,where the step size sets the angle throughwhich each neuron’s hyperplane turns . Thepaper concludes by discussing how this geo-metric connection between architecture andoptimisation may impact theories of generali-sation in deep learning. Deep learning has brought on a new paradigm in computerscience, enabling artiﬁcial systems to interact with theworld at an unprecedented level of complexity. That said,the core technology relies on various heuristic numericaltechniques that are sometimes brittle and often requireextensive tuning. A major goal of modern research inmachine learning is to uncover the principles underlyinglearning in neural systems, and thus to derive more reliablelearning algorithms.Part of the challenge of this endeavour is that learn-ing in deep networks is an inherently coupled problem.Suppose that training performance is sensitive to a partic-ular detail of the neural architecture—then it is unclearwhether that detail aﬀects the expressivity of the archi-tecture, or just the ability of the descent method to trainthe architecture.This observation motivates the combined study of archi-tecture and optimisation, and this paper explores severalquestions at that intersection. First of all: ∗ Equal contribution. (cid:104) ? (cid:105) What is the right domain of optimisation for aneural network’s weights? Is it R d , or somethingmore exotic—such as a Cartesian product of hyper-spheres?Typically, optimisation is conducted over R d , while acareful weight initialisation and a tuned weight decay hy-perparameter impose a soft constraint on the optimisationdomain. Since normalisation schemes such as batch norm(Ioﬀe & Szegedy, 2015) render the network invariant to thescale of the weights, weight decay also plays a somewhatsubtle second role in modifying the eﬀective learning rate.Hyperparameters with this kind of subtle coupling add tothe compounding cost of hyperparameter search.Furthermore, descent methods such as Adam (Kingma& Ba, 2015) and LAMB (You et al., 2020) use eithersynapse-speciﬁc or layer-speciﬁc gradient normalisation.This motivates a second question: (cid:104) ? (cid:105) At what level of granularity should an optimiserwork? Should normalisation occur per-synapse orper-layer—or perhaps, per-neuron?This paper contends that in deep learning, hyperpa-rameters proliferate because of hidden couplings betweenoptimiser and architecture. By studying the above ques-tions, and distilling the simple rules that govern optimi-sation and architecture, this paper aims to make deeplearning less brittle—and less sensitive to opaque hyper-parameters.

Summary of contributions:

1. A new optimiser—

Nero : the neuronal rotator. Neroperforms per-neuron projected gradient descent, anduses „ square root the memory of Adam or LAMB.2. Experiments across image classiﬁcation, image gener-ation, natural language processing and reinforcementlearning, in which Nero’s out-of-the-box conﬁgura-tion tends to outperform tuned baseline optimisers.3. Discussion of how the connection between optimisa-tion and architecture relates to generalisation theo-ries, such as PAC-Bayes and norm-based complexity. a r X i v : . [ c s . N E ] F e b Related work

This section reviews relevant work pertaining to bothneural architecture design and optimisation in machinelearning, and concludes with a bridge to the neuroscienceliterature.

The importance of wiring constraints for the stable func-tion of engineered neural systems is not a new discovery.One important concept is that of balanced excitation andinhibition . For instance, Rosenblatt (1958) found that bal-ancing the proportion of excitatory and inhibitory synapticconnections made his perceptron more robust to varyinginput sizes. Another concept relates to the total magni-tude of synapse strengths . For example, Rochester et al.(1956) constrained the sum of magnitudes of synapsesimpinging on a neuron so as to stabilise the process oflearning. Similar ideas were explored by von der Malsburg(1973) and Miller & MacKay (1994). These works areearly predecessors to this paper’s deﬁnition of balancednetworks given in Section 3.1.Given the resurgence of neural networks over the lastdecade, the machine learning community has taken up themantle of research on neural architecture design. Specialweight scalings—such as

Xavier init (Glorot & Bengio,2010) and

Kaiming init (He et al., 2015)—have been pro-posed to stabilise signal transmission through deep net-works. These scalings are only imposed at initialisationand are free to wander during training—an issue whichmay be addressed by tuning a weight decay hyperparame-ter. More recent approaches—such as batch norm (Ioﬀe& Szegedy, 2015)—explicitly control activition statisticsthroughout training by adding extra normalisation layersto the network.Other recent normalisation techniques lie closer to thework of Rosenblatt (1958) and Rochester et al. (1956).Techniques that involve constraining a neuron’s weights tothe unit hypersphere include: weight norm (Salimans &Kingma, 2016), decoupled networks (Liu et al., 2017, 2018)and orthogonal parameterised training (Liu et al., 2020).Techniques that also balance excitation and inhibitioninclude centred weight norm (Huang et al., 2017) andweight standardisation (Qiao et al., 2019).

Much classic work in optimisation theory focuses on de-riving convergence results for descent methods under as-sumptions such as convexity (Boyd & Vandenberghe, 2004)and

Lipschitz continuity of the gradient (Nesterov, 2004).These simplifying assumptions are often used in the ma-chine learning literature. For instance, Bottou et al. (2018)provide convergence guarantees for stochastic gradient de-scent (SGD) under each of these assumptions. However,these assumptions do not hold in deep learning (Sun,2019).On a related note, SGD is not the algorithm of choicein many deep learning applications, and heuristic methods such as RMSprop (Tieleman & Hinton, 2012) and Adam(Kingma & Ba, 2015) often work better. For instance,Adam often works much better than SGD for traininggenerative adversarial networks (Bernstein et al., 2020a).Yet the theory behind Adam is poorly understood (Reddiet al., 2018).A more recent line of work has explored optimisationmethods that make relative updates to neural network pa-rameters. Optimisers like LARS (You et al., 2017), LAMB(You et al., 2020) and Fromage (Bernstein et al., 2020a)make per-layer relative updates, while Madam (Bernsteinet al., 2020b) makes per-synapse relative updates. Youet al. (2017) found that these methods stabilise large batchtraining, while Bernstein et al. (2020a) found that theyrequire little to no learning rate tuning across tasks.Though these recent methods partially account for theneural architecture—by paying attention to its layeredoperator structure—they do not rigorously address theoptimisation domain. As such, LARS and LAMB requirea tunable weight decay hyperparameter, while Fromageand Madam restrict the optimisation to a bounded set oftunable size (i.e. weight clipping). Without this additionaltuning, these methods can be unstable—see for instance(Bernstein et al., 2020a, Figure 2) and (Bernstein et al.,2020b, Figure 3).The discussion in the previous paragraph typiﬁes themachine learning state of the art: optimisation techniquesthat work well, albeit only after hyperparameter tun-ing. For instance, LAMB is arguably the state-of-the-artrelative optimiser, but it contains in total ﬁve tunablehyperparameters. Since—at least naïvely—the cost ofhyperparameter search is exponential in the number ofhyperparameters, the prospect of fully tuning LAMB iscomputationally daunting.

Since the brain is a system that must learn stably withouthyperparameter do-overs, it is worth looking to neuro-science for inspiration on designing better learning algo-rithms.A major swathe of neuroscience research studies mech-anisms by which the brain performs homeostatic control.For instance, neuroscientists report a form of homeosta-sis termed synaptic scaling , where a neuron modulatesthe strengths of all its synapses to stabilise its ﬁring rate(Turrigiano, 2008). More generally, heterosynaptic plas-ticity refers to homeostatic mechanisms that modulatethe strength of unstimulated synapses (Chistiakova et al.,2015). Shen et al. (2020) review connections to normalisa-tion methods used in machine learning.These observations inspired this paper to considerimplementing homeostatic control via projected gradientdescent—leading to the Nero optimiser.

Background Theory

In general, an L -layer neural network f p¨q is a compositionof L simpler functions f p¨q , ..., f L p¨q : f p x q “ f L ˝ f L ´ ˝ ... ˝ f p x q . (forward pass)Due to this compositionality, any slight ill-conditioningin the simple functions f i p¨q has the potential to com-pound over layers, making the overall network f p¨q veryill-conditioned. Architecture design should aim to preventthis from happening, as will be covered in Section 3.1The Jacobian B f {B f l , which plays a key role in evalu-ating gradients, also takes the form of a deep product: B f B f l “ B f L B f L ´ ¨ B f L ´ B f L ´ ¨ ... ¨ B f l ` B f l . (backward pass)Therefore, it is also important from the perspective ofgradient-based optimisation that compositionality is ade-quately addressed, as will be covered in Section 3.2. A common strategy to mitigate the issue of compoundingill-conditioning is to explicitly re-normalise the activationsat every network layer. Batch norm (Ioﬀe & Szegedy,2015) exempliﬁes this strategy, and was found to improvethe trainability of deep residual networks. Batch normworks by standardising the activations across a batch ofinputs at each network layer—that is, it shifts and scalesthe activations to have mean zero and variance one acrossa batch.Although batch norm works well, it adds computa-tional overhead to both the forward and backward pass. Toexplore how far one can get without explicit re-normalisation,the following deﬁnitions are useful:

Deﬁnition 1.

A neuron is balanced if its weight vector w P R d satisﬁes the following constraints: ř di “ w i “ (balanced excitation & inhibition) ř di “ w i “ . ( (cid:96) constant sum rule) Deﬁnition 2.

A network is balanced if all its constituentneurons are balanced.As noted by Huang et al. (2017), balanced neuronsattain some of the properties of batch norm for free. To seethis, consider a linear neuron y “ ř i w i x i with inputs x i that are uncorrelated with mean µ and variance . Thenthe output y is standardised: E r y s “ ř i w i E r x i s “ µ ř i w i “ r y s “ ř i w i Var r x i s “ ř i w i “ . While the assumptions on the inputs x i are unlikely tohold exactly, under more general conditions the constraintsmay at least encourage the standardisation of activationstatistics through the layers of the network (Brock et al.,2021). Since a network is trained via perturbations to its param-eters, it is important to know what size perturbations areappropriate. Consider an L -layer network with weightmatrices W “ p W , W , ..., W L q and loss function L p W q .For a perturbation ∆ W “ p ∆ W , ∆ W , ..., ∆ W L q , thefollowing deﬁnition establishes a notion of stable step size: Deﬁnition 3.

Let θ l denote the angle between ∆ W l and ´ ∇ W l L p W q . A descent step is stable if for all l “ , ..., L : } ∇ W l L p W ` ∆ W q ´ ∇ W l L p W q} F } ∇ W l L p W q} F ă cos θ l . (1)Or in words: for each layer, the relative change ingradient induced by the perturbation should not exceedthe cosine of the angle between the perturbation and thenegative gradient.This deﬁnition is useful because Bernstein et al. (2020a)proved that a stable descent step is guaranteed to decreasea continuously diﬀerentiable loss function L p W q . SinceInequality 1 is of little use without a model of its lefthandside, Bernstein et al. (2020a) proposed the following model: Deﬁnition 4.

The loss function obeys deep relative trust if for all perturbations ∆ W “ p ∆ W , ∆ W , ..., ∆ W L q : } ∇ W l L p W ` ∆ W q ´ ∇ W l L p W q} F } ∇ W l L p W q} F ď L ź k “ ˆ ` } ∆ W k } F } W k } F ˙ ´ . While deep relative trust is based on a perturbationanalysis of L -layer perceptrons (Bernstein et al., 2020a,Theorem 1), the key idea is that its product structureexplicitly models the product structure of the network’sbackward pass.The deep relative trust model suggests that a stabledescent step should involve small relative perturbations per layer . This motivates the layer-wise family of de-scent methods (You et al., 2017, 2020). Still, it is unclearwhether layers are the right base object to consider. Per-haps a more reﬁned analysis would replace the layersappearing in Deﬁnition 4 with individual neurons or even synapses .Small relative perturbations per-synapse were exploredby Bernstein et al. (2020b) and found to slightly degradetraining performance compared to Adam and SGD. Butthis paper will explore the per-neuron middle ground: Deﬁnition 5.

A step of size η ą is said to be per-neuronrelative if for any neuron with weights w P R d and bias b P R , the perturbations ∆ w P R d and ∆ b P R satisfy: } ∆ w } {} w } Æ η and | ∆ b |{| b | Æ η. A per-neuron relative update is automatically per-layerrelative. To see this, consider a weight matrix W whose N rows correspond to N neurons w p q , ..., w p N q . Then: } ∆ W } F } W } F “ c ř Ni “ } ∆ w p i q } ř Ni “ } w p i q } Æ c ř Ni “ η } w p i q } ř Ni “ } w p i q } “ η. (2) Nero: the Neuronal Rotator

Following the discussion in Section 3, this paper willconsider an optimisation algorithm that makes per-neuronrelative updates (Deﬁnition 5) constrained to the space of balanced networks (Deﬁnition 2).Since a balanced neuron is constrained to the unithypersphere, a per-neuron relative update with step size η corresponds to a pure rotation of the neuron’s weightvector by angle « η . To see this, take η small in thefollowing picture: } w } “ } w ` ∆ w } “ } ∆ w } “ ηθ Hence, this paper proposes

Nero : the neuronal rotator.Nero’s goal is to reduce the burden of hyperparametertuning by baking architectural information into the opti-miser. More concretely, the anticipated advantages are asfollows:1. Since per-neuron relative updates are automaticallyper-layer relative by Equation 2, they should inheritthe properties of per-layer updates—in particular,stability across batch sizes (You et al., 2017) whileneeding little to no learning rate tuning (Bernsteinet al., 2020a).2. Since balanced networks place hard constraints onthe norm of a neuron’s weights, the need for initiali-sation tuning and weight decay should be removed.3. Gradients are often normalised by running averages,in order to retain relative scale information betweensuccessive minibatch gradients (Tieleman & Hinton,2012). Along with momentum, this is the mainmemory overhead of Adam and LAMB compared tovanilla SGD. Per-neuron running averages consume „ square root the memory of per-synapse runningaverages.4. Since normalisation is local to a neuron, no com-munication is needed between neurons in a layer(unlike for per-layer updates). This makes the op-timiser more distributable—for example, a singlelayer can be split across multiple compute deviceswithout fuss. For the same reason, the Nero updateis biologically plausible.There is one signiﬁcant diﬀerence between Nero andprior work on balanced networks. In centred weight norm(Huang et al., 2017) and weight standardisation (Qiaoet al., 2019), a neuron’s underlying weight representationis an unnormalised vector r w P R d —which is normalised byincluding the following reparameterisation in the neuralarchitecture: normalise p r w q : “ r w ´ T r w ¨ { d } r w ´ T r w ¨ { d } , (3)where denotes the vector of 1s. Algorithm 1

Nero optimiser. “Out-of-the-box” hyper-parameter defaults are η “ . and β “ . . Theconstant σ b P R ` refers to the initialisation scale of thebiases. Input: step size η P p , s , averaging constant β P r , q repeatfor each neuron do Ź get weight & bias gradients g w P R n & g b P R Ź update running averages ¯ g w Ð β ¨ ¯ g w ` p ´ β q ¨ } g w } ¯ g b Ð β ¨ ¯ g b ` p ´ β q ¨ g b Ź update weights w P R n and bias b P R w Ð w ´ η ¨ } w } { ¯ g w ¨ g w b Ð b ´ η ¨ σ b { ¯ g b ¨ g b Ź project weights back to constraint set w Ð w ´ n ř ni “ w i w Ð w {} w } end foruntil converged Since the target of automatic diﬀerentiation is still theunnormalised vector r w , overhead is incurred in both theforward and backward pass. Moreover, there is a subtlecoupling between the step size in additive optimisers likeAdam and the scale of the unnormalised weights r w (seeSection 5.3).In contrast, Nero opts to implement balanced networksvia projected gradient descent. This is lighter-weight thanEquation 3, since duplicate copies of the weights are notneeded and the network’s backward pass does not involveextra operations. Furthermore, Nero can be used as adrop-in replacement for optimisers like Adam, SGD orLAMB, without the user needing to manually modifythe network architecture via the reparameterisation inEquation 3.Pseudocode for Nero is provided in Algorithm 1. Forbrevity, the Adam-style bias correction of the runningaverages is omitted from the pseudocode. But in thePytorch implementation used in this paper’s experiments,the running averages ¯ g w and ¯ g b are divided by a factorof a ´ β t before the t th update. This corrects for thewarmup bias stemming from ¯ g w and ¯ g b being initialisedto zero (Kingma & Ba, 2015).While the pseudocode in Algorithm 1 is presented for neurons and biases , in the Pytorch implementation thebias update is applied to any parameters lacking a notionof fan-in—including batch norm gains and biases. Typicalinitialisation scales are σ b “ for gains and σ b “ . for biases. The Pytorch implementation of Nero used σ b “ . for any bias parameter initialised to zero. Experiments

This section begins with targeted experiments intendedto demonstrate Nero’s key properties. Then, in Section5.6, Nero is benchmarked across a range of popular tasks.In all ﬁgures, the mean and range are plotted over threerepeats. For Nero, out-of-the-box refers to setting η “ . and β “ . . More experimental details are given inAppendix A. To verify that projecting to the space of balanced networksimproves the performance of Nero, an ablation experimentwas conducted. As can be seen in Figure 1, when traininga VGG-11 image classiﬁer on the CIFAR-10 dataset, Neroperformed best with both constraints switched on.

Since Bernstein et al. (2020b) found that per-synapserelative updates led to slightly degraded performance,while per-layer relative updates typically perform well(You et al., 2017, 2020; Bernstein et al., 2020a), this sectioncompares per-synapse, per-neuron and per-layer relativeupdates. In particular, Nero is compared to Madam (per-synapse relative) and LAMB (per-layer relative).A VGG-11 model was trained on the CIFAR-10 dataset.Without constraints, the three optimisers performed sim-ilarly, achieving „ top-1 validation error (Figure 2,top). Constraining to the space of balanced networks(Deﬁnition 2) improved both Nero and LAMB, but didnot have a signiﬁcant eﬀect on Madam (Figure 2, bottom).In both conﬁgurations, Nero outperformed Madam andLAMB, demonstrating the viability of per-neuron relativeupdates. Existing implementations of balanced networks (Deﬁnition2) work via the re-parameterisation given in Equation 3(Huang et al., 2017; Qiao et al., 2019). This leads to anundesired coupling between the learning rate in optimiserslike Adam and the scale of the unnormalised r w parameters.To verify this, a network with weights normalised byEquation 3 was trained to classify the MNIST dataset.The initial weights r w were drawn from N p , σ q , and theexperiment was repeated for σ “ and σ “ . TheAdam optimiser was used for training with a ﬁxed learningrate of . . As can be seen in Figure 3 (left), the trainingperformance was sensitive to the weight scale σ , despitethe fact that a weight normalisation scheme was beingused.The unnecessary scale freedom of reparameterisationcan lead to other undesired consequences, such as numer-ical overﬂow. Nero completely eliminates this issue byimplementing balanced networks via projected gradientdescent. T o p - e rr o r Training 0 50 100 150 200Epoch0.10.20.4 ValidationConstraintsBothMeanNormNone

Figure 1.

Ablating the balanced network constraints. A VGG-11 network was trained on CIFAR-10. The legend denoteswhich of Nero’s constraints were active.

Mean refers to bal-anced excitation & inhibition, while norm refers to the (cid:96) constant sum rule. T o p - e rr o r Training 0 50 100 150 200Epoch0.10.20.40.8 ValidationNero w/o constraintsMadamLAMB0 50 100 150 200Epoch10 T o p - e rr o r Training 0 50 100 150 200Epoch0.10.20.40.8 ValidationNeroMadam+constraintsLAMB+constraints

Figure 2.

Comparing per-synapse (Madam), per-neuron (Nero)and per-layer (LAMB) relative updates. A VGG-11 networkwas trained to classify CIFAR-10. Top: all optimisers without balanced network constraints. Bottom: all optimisers with constraints. T r a i n i n g a cc u r a c y ReparameterisationInitialisation scale = 1 = 100 0 10 20 30 40 50Epoch0.00.20.40.60.81.0 100 layer MLPNeroSGDAdamLAMB

Figure 3.

Left: Training a 5 layer perceptron normalised viareparameterisation (Equation 3) on MNIST. For a ﬁxed Adamlearning rate, training is sensitive to the scale σ of the rawweights r w . This motivates the diﬀerent approach taken by Nero.Right: Using Nero to train a 100 layer perceptron—withoutbatch norm or skip connections—to classify MNIST. .4 Nero Trains Deeper Networks Very deep networks are typically diﬃcult to train withoutarchitectural modiﬁcations such as residual connections(He et al., 2016) or batch norm (Ioﬀe & Szegedy, 2015).To test whether Nero enables training very deep modelswithout such modiﬁcations, Figure 3 (right) shows theresults of training a very deep multilayer perceptron (MLP)on the MNIST dataset. Unlike SGD, Adam and LAMB,Nero could reliably train a 100-layer MLP.

This section compares Nero out-of-the-box to an SGDimplementation with tuned learning rate, weight decayand momentum. The comparison was made for traininga ResNet-50 image classiﬁer on the ImageNet dataset.As can be seen in Figure 4, SGD with tuned learningrate, momentum, and weight decay outperformed Nero.However, the optimal set of SGD hyperparameters wasbrittle, and ablating weight decay alone increased thetop-1 validation error by 5%.

This section probes the versatility and robustness of Neroby comparing its optimisation and generalisation per-formance with three popular alternatives—SGD, Adam,and LAMB—across six tasks. The tasks span the do-mains of computer vision, natural language processing,and reinforcement learning. A wide spectrum of neuralarchitectures were tested—from convolutional networksto transformers.To make a fair comparison between optimisers, a fairhyperparameter tuning strategy is needed. In this section:1. Learning rates were tuned over t ´ , ´ , ..., u .2. For Adam, LAMB and SGD, the momentum hy-perparameter was tuned to achieve good perfor-mance on the most complicated benchmark—cGANtraining—and then ﬁxed across the rest of the bench-marks. In each case, the best momentum value forcGAN was 0.3. β in Nero and β in Adam and LAMB were ﬁxedto . across all experiments, as recommended byKingma & Ba (2015) and You et al. (2020).4. Weight decay was not used in any of the experiments.The results are collated in Table 1. Nero achievedthe best validation performance in every experiment—while the runner-up varied across tasks. What’s more,the same learning rate of η “ . was optimal for Neroin ﬁve out of six experiments. This means that Nerohas strong out-of-the-box performance, since Nero’s onlyother hyperparameter was ﬁxed to β “ . across allexperiments.The remainder of this section discusses each experi-ment in turn. Implementation details are given in Ap-pendix A. T o p - e rr o r TrainingNeroSGDSGD+wd 0 20 40 60 80Epoch0.10.20.30.40.50.6 Validation

Figure 4.

Training a ResNet-50 network to classify the Im-ageNet dataset. Nero uses its out-of-the-box default hyper-parameters η “ . and β “ . . SGD+wd uses initiallearning rate 0.1, momentum 0.9 and weight decay (wd) 0.0001as tuned by He et al. (2016). SGD is also shown without weightdecay. F I D Training NeroSGDAdamLAMB 0 30 60 90 120Epoch020406080100 Test

Figure 5.

Class-conditional GAN training on CIFAR-10. Equallearning rates were used in the generator and discriminator.The Fréchet Inception Distance (Heusel et al., 2017, FID)measures the distance between the sample statistics of realand fake data as represented at a deep layer of a pre-trainedimage classiﬁer. T o p - e rr o r TrainingNeroSGDAdamLAMB 0 50 100 150 200Epoch0.10.20.4 Validation0 50 100 150 200Epoch10 T o p - e rr o r Training NeroSGDAdamLAMB 0 50 100 150 200Epoch0.050.100.200.40 Validation

Figure 6.

CIFAR-10 classiﬁcation. Top: performance of avanilla, convolutional VGG-11 network. Bottom: performanceof a batch-normalised, residual ResNet-18 network. ask Dataset Model Metric p Ö q Nero SGD Adam LAMB Nero η SGD η Adam η LAMB η cGAN CIFAR-10 BigGAN-like FID ( Ó ) . ˘ . . ˘ .

42 23 . ˘ .

85 16 . ˘ . Ó ) . % ˘ . . ˘ .

21 12 . ˘ .

34 13 . ˘ . Ó ) . % ˘ . . ˘ .

17 5 . ˘ .

19 6 . ˘ . Ó ) . ˘ . . ˘ .

49 178 . ˘ .

96 200 . ˘ . Ó ) . ˘ . . ˘ .

48 12 . ˘ .

34 16 . ˘ . Ò ) . ˘ . . ˘ .

65 15 . ˘ . ´ . ˘ . Table 1.

Validation results for the best learning rate η . The best result is shown in bold, while the runner-up is underlined. Image synthesis with cGAN

Generative AdversarialNetwork (Goodfellow et al., 2014, GAN) training is per-haps the most challenging optimisation problem tackled inthis paper. Good performance has traditionally relied onextensive tuning: diﬀerent learning rates are often used inthe generator and discriminator (Heusel et al., 2017) andtraining is highly sensitive to momentum (Brock et al.,2019, p. 35). The class-conditional GAN model in thispaper is based on the BigGAN architecture (Brock et al.,2019). This is a heterogeneous network involving a va-riety of building blocks: convolutions, embeddings, fullyconnected layers, attention layers, conditional batch normand spectral norm (Miyato et al., 2018). The results arepresented in Figure 5.

Image classiﬁcation

In Section 5.5, Nero out-of-the-box was shown to outperform SGD without weight decaywhen training ResNet-50 on ImageNet. Due to limitedcomputational resources, the authors of this paper wereunable to run the LAMB and Adam baselines on Im-ageNet. Experiments were run across all baselines onthe smaller CIFAR-10 dataset instead. The networksused were the vanilla, convolutional VGG-11 network (Si-monyan & Zisserman, 2015) and the batch-normalised,residual ResNet-18 network (He et al., 2015). The resultsare presented in Figure 6.

Natural language processing

Much recent progressin natural language processing is based on the transformerarchitecture (Vaswani et al., 2017). Transformers processinformation via layered, all-to-all comparisons—withoutrecourse to recurrence or convolution. This paper experi-mented with a smaller transformer (19 tensors) trained onthe Wikitext-2 dataset, and a larger transformer (121 ten-sors) trained on WMT2016 English–German translation.The results are presented in Figures 7 and 8.

Reinforcement learning

Many reinforcement learn-ing algorithms use neural networks to perform functionapproximation. Proximal Policy Optimization (Schulmanet al., 2017, PPO) is one example, and PPO has gainedincreasing popularity for its simplicity, scalability, androbust performance. This paper experimented with PPOon the Atari Pong video game. The results are presentedin Figure 9.While LAMB failed to train on this task, further in-vestigation revealed that setting LAMB’s momentum hy-perparameter to 0.9 enabled LAMB to learn. This demon-strates LAMB’s sensitivity to hyperparameters. P e r p l e x i t y Training NeroSGDAdamLAMB 0 5 10 15 20Epoch150200250300350400 Validation

Figure 7.

Training a language model on the Wikitext-2 dataset.A small transformer network was used, composed of 19 tensors.Nero achieved the best anytime performance. P e r p l e x i t y Training NeroSGDAdamLAMB 0 25 50 75 100Epoch10 Validation

Figure 8.

Training an English–German translation model onWMT16. A larger transformer network was used, composed of121 tensors. The optimisers with gradient normalisation—Nero,Adam, and LAMB—performed best in training this model.Training with SGD was unstable and led to signiﬁcantly worseperplexity. R e w a r d NeroSGDAdamLAMB

Figure 9.

Training a policy network to play Pong. Proxi-mal Policy Optimisation (PPO) was used. Pong’s reward isbounded between ˘ . While investigating LAMB’s failureto train the policy network, it was discovered that adjustingthe β momentum hyperparameter from 0 to 0.9 improvedLAMB’s performance. Discussion: Rotation and Generalisa-tion

The results in this paper may have a bearing on the gen-eralisation theory of neural systems—an area of researchthat is still not settled. Consider the following hypothesis:

Hypothesis 1.

Deep learning generalises because SGDis biased towards solutions with small norm.

This hypothesis is well-known, and is alluded to ormentioned explicitly in many papers (Wilson et al., 2017;Zhang et al., 2017; Bansal et al., 2018; Advani et al., 2020).But in light of the results in Table 1, Hypothesis 1encounters some basic problems. First, for some tasks—such as the GAN and translation experiment—SGD simplyperforms very poorly. And second, Nero is able to ﬁndgeneralising solutions even when the norm of the networkis constrained. For instance, the VGG-11 network and theWikitext-2 transformer model have no gain parametersso, under Nero, the norm of the weights (though notthe biases) is ﬁxed and cannot be “adapting to the datacomplexity”.Then it seems right to consider an alternative theory:

Hypothesis 2.

Deep learning generalises because thespace of networks that ﬁt the training data has large mea-sure.

This hypothesis is essentially the PAC-Bayesian gen-eralisation theory (McAllester, 1998; Langford & Seeger,2001) applied to deep learning. Valle-Perez et al. (2019)have developed this line of work, proving the followingresult:

Theorem 1 (Realisable PAC-Bayes) . First, ﬁx a proba-bility measure P over the weight space Ω of a classiﬁer. Let S denote a training set of n iid datapoints and let V S Ă Ω denote the version space—that is, the subset of classiﬁersthat ﬁt the training data. Consider the population errorrate ď ε p w q ď of weight setting w P Ω , and its averageover the version space ε p V S q : “ E w „ P r ε p w q| w P V S s . Then,for a proportion ´ δ of random draws of the training set S , ε p V S q ď ln 11 ´ ε p V S q ď ln P r V S s ` ln nδ n ´ . (4)The intuition is that for a larger measure of solutions P r V S s , less information needs to be extracted from thetraining data to ﬁnd just one solution, thus memorisationis less likely.A simple formula for P r V S s is possible based on thispaper’s connection between optimisation and architecture,since the problem is reduced to hyperspherical geometry.Consider a balanced network (Deﬁnition 2) composed of m neurons each with fan-in d . Then the optimisation domainis isomorphic to the Cartesian product of m hyperspheres: Ω – S d ´ ˆ ... ˆ S d ´ looooooooomooooooooon m times , while P can be ﬁxed to the uniform distribution on Ω . By restricting deep relative trust (Deﬁnition 4) or itsantecedent (Bernstein et al., 2020a, Theorem 1) to bal-anced networks, the following deﬁnition becomes natural: Deﬁnition 6.

A solution that attains zero training erroris α -robust if all neurons may be simultaneously and ar-bitrarily rotated by up to angle α without inducing anerror.Geometrically, an α -robust solution is the product of m hyperspherical caps. If the version space consists of K non-intersecting α -robust solutions, then its measure is: P r V S s “ K ¨ P r cap d ´ p α qs m ě K m sin m p d ´ q α , (5)where cap d ´ p α q denotes an α -cap of S d ´ , and the in-equality follows from (Ball, 1997, Lemma 2.3). CombiningInequality 5 with Inequality 4 yields the following gener-alisation bound for neural networks: ε p V S q ď m ln 2 ` m p d ´ q ln α ` ln nδ ´ ln Kn ´ . Focusing on the dominant terms, the bound suggests thatthe average test error ε p V S q over the space of solutions V S is low when the number of datapoints n exceeds thenumber of parameters md less the entropy ln K of themultitude of distinct solutions. The theory has two mainimplications:1. In the “over-parameterised” regime md " n , general-isation can still occur if the number of distinct solu-tions K is exponential in the number of parameters md . In practice, ln K might be increased relative to md by constraining the architecture based on thesymmetries of the data—e.g. using convolutions forimage data.2. All else equal, solutions with larger α -robustnessmay generalise better. In practice, α might be in-creased by regularising the training procedure (Foretet al., 2021).Future work might investigate these ideas more thor-oughly. This paper has proposed the Nero optimiser based on acombined study of optimisation and neural architecture.Nero pairs two ingredients: (1) projected gradient descentover the space of balanced networks; and (2) per-neuronrelative updates. Taken together, a Nero update turns each neuron through an angle set by the learning rate.Nero was found to have strong out-of-the-box perfor-mance. In almost all the experiments in this paper—spanning GAN training, image classiﬁcation, natural lan-guage processing and reinforcement learning—Nero trainedwell using its default hyperparameter settings. The two ex-ceptions were the 100 layer MLP and the WMT16 En–Detransformer, for which Nero required a reduced learningrate of η “ . . Thus Nero has the potential to accel-erate deep learning research and development, since theneed for time and energy intensive hyperparameter searchmay be reduced. eferences Advani, M. S., Saxe, A. M., and Sompolinsky, H. High-dimensional dynamics of generalization error in neuralnetworks.

Neural Networks , 2020.Ball, K. An elementary introduction to modern convexgeometry. In

MSRI Publications , 1997.Bansal, Y., Advani, M., Cox, D., and Saxe, A. M.Minnorm training: an algorithm for training over-parameterized deep neural networks. arXiv:1806.00730 ,2018.Bernstein, J., Vahdat, A., Yue, Y., and Liu, M.-Y. On thedistance between two neural networks and the stabilityof learning. In

Neural Information Processing Systems ,2020a.Bernstein, J., Zhao, J., Meister, M., Liu, M.-Y., Anandku-mar, A., and Yue, Y. Learning compositional functionsvia multiplicative weight updates. In

Neural Informa-tion Processing Systems , 2020b.Bottou, L., Curtis, F. E., and Nocedal, J. Optimizationmethods for large-scale machine learning.

SIAM Review ,2018.Boyd, S. and Vandenberghe, L.

Convex Optimization .Cambridge University Press, 2004.Brock, A., Donahue, J., and Simonyan, K. Large scaleGAN training for high ﬁdelity natural image synthesis.In

International Conference on Learning Representa-tions , 2019.Brock, A., De, S., and Smith, S. L. Characterizing signalpropagation to close the performance gap in unnormal-ized ResNets. In

International Conference on LearningRepresentations , 2021.Chistiakova, M., Bannon, N., Chen, J.-Y., Bazhenov, M.,and Volgushev, M. Homeostatic role of heterosynap-tic plasticity: models and experiments.

Frontiers inComputational Neuroscience , 2015.Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B.Sharpness-aware minimization for eﬃciently improvinggeneralization. In

International Conference on LearningRepresentations , 2021.Glorot, X. and Bengio, Y. Understanding the diﬃcultyof training deep feedforward neural networks. In

In-ternational Conference on Artiﬁcial Intelligence andStatistics , 2010.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In

Neural InformationProcessing Systems , 2014.Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He,K. Accurate, large minibatch SGD: Training ImageNetin 1 hour. arXiv:1706.02677 , 2017. He, K., Zhang, X., Ren, S., and Sun, J. Delving deepinto rectiﬁers: Surpassing human-level performance onImageNet classiﬁcation. In

International Conference onComputer Vision , 2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In

Computer Vision andPattern Recognition , 2016.Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B.,and Hochreiter, S. GANs trained by a two time-scaleupdate rule converge to a local Nash equilibrium. In

Neural Information Processing Systems , 2017.Huang, L., Liu, X., Liu, Y., Lang, B., and Tao, D. Cen-tered weight normalization in accelerating training ofdeep neural networks. In

International Conference onComputer Vision , 2017.Ioﬀe, S. and Szegedy, C. Batch normalization: Accelerat-ing deep network training by reducing internal covariateshift. In

International Conference on Machine Learning ,2015.Kingma, D. P. and Ba, J. Adam: A Method for StochasticOptimization. In

International Conference on LearningRepresentations , 2015.Kostrikov, I. Pytorch implementations of reinforce-ment learning algorithms. github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail , 2018.Langford, J. and Seeger, M. Bounds for averaging clas-siﬁers. Technical report, Carnegie Mellon University,2001.Liu, W., Zhang, Y.-M., Li, X., Yu, Z., Dai, B., Zhao, T.,and Song, L. Deep hyperspherical learning. In

NeuralInformation Processing Systems , 2017.Liu, W., Liu, Z., Yu, Z., Dai, B., Lin, R., Wang, Y., Rehg,J. M., and Song, L. Decoupled networks. In

ComputerVision and Pattern Recognition , 2018.Liu, W., Lin, R., Liu, Z., Rehg, J. M., Xiong, L., Weller, A.,and Song, L. Orthogonal over-parameterized training. arXiv:2004.04690 , 2020.McAllester, D. A. Some PAC-Bayesian theorems. In

Conference on Computational Learning Theory , 1998.Miller, K. and MacKay, D. The role of constraints inHebbian learning.

Neural Computation , 1994.Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y.Spectral normalization for generative adversarial net-works. In

International Conference on Learning Repre-sentations , 2018.Nesterov, Y. Introductory lectures on convex optimization:A basic course. In

Applied Optimization , 2004.Qiao, S., Wang, H., Liu, C., Shen, W., and Yuille, A.Micro-batch training with batch-channel normalizationand weight standardization. arXiv:1903.10520 , 2019.eddi, S. J., Kale, S., and Kumar, S. On the convergenceof Adam and beyond. In

International Conference onLearning Representations , 2018.Rochester, N., Holland, J., Haibt, L., and Duda, W. Testson a cell assembly theory of the action of the brain,using a large digital computer.

Information Theory ,1956.Rosenblatt, F. The perceptron: A probabilistic modelfor information storage and organization in the brain.

Psychological Review , 1958.Salimans, T. and Kingma, D. P. Weight normalization:A simple reparameterization to accelerate training ofdeep neural networks. In

Neural Information ProcessingSystems , 2016.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms. arXiv:1707.06347 , 2017.Shen, Y., Wang, J., and Navlakha, S. A correspondence be-tween normalization strategies in artiﬁcial and biologicalneural networks. In

From Neuroscience to ArtiﬁciallyIntelligent Systems , 2020.Simonyan, K. and Zisserman, A. Very deep convolutionalnetworks for large-scale image recognition. In

Interna-tional Conference on Learning Representations , 2015.Sun, R. Optimization for deep learning: theory andalgorithms. arXiv:1912.08957 , 2019.Tieleman, T. and Hinton, G. E. Lecture 6.5—RMSprop:Divide the gradient by a running average of its recentmagnitude. COURSERA: Neural Networks for MachineLearning, 2012.Turrigiano, G. The self-tuning neuron: Synaptic scalingof excitatory synapses.

Cell , 2008.Valle-Perez, G., Camargo, C. Q., and Louis, A. A. Deeplearning generalizes because the parameter–functionmap is biased towards simple functions. In

InternationalConference on Learning Representations , 2019.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten-tion is all you need. In

Neural Information ProcessingSystems , 2017.von der Malsburg, C. Self-organization of orientationsensitive cells in the striate cortex.

Kybernetik , 1973.Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., andRecht, B. The marginal value of adaptive gradientmethods in machine learning. In

Neural InformationProcessing Systems , 2017.You, Y., Gitman, I., and Ginsburg, B. Scaling SGDbatch size to 32K for ImageNet training. TechnicalReport UCB/EECS-2017-156, University of California,Berkeley, 2017. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojana-palli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh,C.-J. Large batch optimization for deep learning: Train-ing BERT in 76 minutes. In

International Conferenceon Learning Representations , 2020.Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals,O. Understanding deep learning requires rethinkinggeneralization. In

International Conference on LearningRepresentations , 2017. ppendix A Experimental Details

All code is to be found at github.com/jxbz/nero . Thisappendix records important details of the implementationsand their hyperparameters.

MNIST classiﬁcation

These experiments used a mul-tilayer perceptron (MLP) network. An L -layer architec-ture consisted of p L ´ q layers of dimension ˆ followed by an output layer of dimension ˆ . A“scaled relu” nonlinearity was used, deﬁned by ϕ p x q : “? ¨ max p , x q . The factor of ? was motivated by Kaim-ing init (He et al., 2015) and was not tuned. The repa-rameterisation experiment used L “ layers and trainedfor 5 epochs without learning rate decay. The very deepMLP used L “ layers and trained for 50 epochs withthe learning rate decayed by a factor of 0.9 at the endof every epoch, and with the initial learning rate tunedover t . , . , . , . u . Training took place on anunknown Google Colab GPU. On an NVIDIA Tesla P100GPU, the 5-layer MLP took „ minute to train and the100-layer MLP took „ minutes. CIFAR-10 cGAN

Equal learning rates were used inthe generator and discriminator. The initial learningrate was tuned over {0.0001, 0.001, 0.01, 0.1, 1.0} for alloptimisers. The networks were trained for 120 epochs,with the learning rate decayed by a factor of 10 at epoch100. The momentum parameter in SGD and β in Adamand LAMB were tuned over {0.0, 0.9}. Nero’s β and β in Adam and LAMB were set to 0.999 without tuning.Training took around 3 hours on an NVIDIA RTX 2080TiGPU. CIFAR-10 classiﬁcation

All models were trained for200 epochs, with 5 epochs of linear learning rate warm-upand learning rate decay by a factor of 0.2 at epochs 100,150 and 180. The initial learning rates were tuned over{0.0001, 0.001, 0.01, 0.1, 1.0}. Training was performedon an NVIDIA RTX 2080Ti GPU. Training time for theVGG-11 network was „ hour, and for ResNet-18 was „ hours.Since the experiments in Figures 1 and 2 were intendedto probe the fundamental properties of optimisers ratherthan their performance under a limited tuning budget,a more ﬁne-grained learning rate search was conducted.Speciﬁcally, the learning rates were tuned over {0.01, 0.02,0.04, 0.06, 0.08, 0.1}. The best results are listed in thefollowing table: Optimiser Fix Mean Fix Norm Top 1 Error Best η Nero . ˘ . (cid:88) . ˘ . (cid:88) . ˘ . (cid:88) (cid:88) . % ˘ . . ˘ . (cid:88) (cid:88) . ˘ . . ˘ . (cid:88) (cid:88) . ˘ . ImageNet classiﬁcation

For training with SGD + mo-mentum + weight decay, the initial learning was set to0.1, momentum was set to 0.9 and weight decay was set to0.0001. These settings follow He et al. (2016). One epochof linear learning rate warm-up was used, followed by 89epochs of cosine annealing. The batch size was set to 400for ResNet-50 to ﬁt the GPU vRAM budget, and thiswas in the range known to yield good performance (Goyalet al., 2017). This paper’s implementation surpassed thetarget ImageNet top-1 accuracy of . for ResNet-50(Goyal et al., 2017; You et al., 2020). The training wasdistributed over four NVIDIA RTX 2080Ti GPUs, tak-ing „ hours per training run. The total number ofGPU hours for all ImageNet experiments in this paperwas „ . Wikitext-2 language model

The small transformermodel was trained for 20 epochs, with the learning ratedecayed by a factor of 0.1 at epoch 10. The initial learningrate was tuned over {0.0001, 0.001, 0.01, 0.1, 1.0}. Thebatch size was set to 20. Training on an NVIDIA RTX2080Ti GPU took „ minutes. WMT16 En–De translation

The large transformermodel was trained for 100 epochs, with a linear warm-up from epoch 0 to 50, and linear annealing from epoch50 to 100. The maximum learning rate was tuned over{0.0001, 0.001, 0.01, 0.1, 1.0}. A batch size of 128 wasused. Training took „ hour on an NVIDIA RTX 2080TiGPU. Reinforcement learning

Hyperparameter settings fol-lowed Kostrikov (2018), except for the initial learning rateand the total number of environment steps. The numberof steps was ﬁxed to 5 million, and the initial learning ratewas tuned over {0.0001, 0.001, 0.01, 0.1, 1.0}. The policynetwork combined convolutional image feature extractorswith dense output layers. Training was performed on anNVIDIA RTX 2080Ti GPU, and the training time was „ .5