Faster Biological Gradient Descent Learning
FFaster Biological Gradient Descent Learning
Ho Ling Li September 29, 2020 School of PsychologyUniversity of Nottingham, Nottingham NG7 2RDU.K.
Abstract
Back-propagation is a popular machine learning algorithm that uses gradient descent intraining neural networks for supervised learning, but can be very slow. A number of algo-rithms have been developed to speed up convergence and improve robustness of the learning.However, they are complicated to implement biologically as they require information from pre-vious updates. Inspired by synaptic competition in biology, we have come up with a simple andlocal gradient descent optimization algorithm that can reduce training time, with no demandon past details. Our algorithm, named dynamic learning rate (DLR), works similarly to thetraditional gradient descent used in back-propagation, except that instead of having a uniformlearning rate across all synapses, the learning rate depends on the current neuronal connectionweights. Our algorithm is found to speed up learning, particularly for small networks.
Introduction
In the past decades, back-propagation has become the go-to machine learning algorithm in trainingneural networks. It utilizes stochastic gradient descent (SGD) to minimize a cost function C byadjusting the connection weights w ij with ∆ w ij = − η ∂C∂w ij at each time-step. However, learningwith the SGD orignally purposed in back-propagation can be very slow. In addition, networkshave higher risk to overfit when training is slow [1]. Therefore, algorithms have been developed tospeed up convergence and improve robustness of learning. These algorithms are mainly separatedinto two categories: SGD and adaptive learning. Examples of adaptive learning algorithms areAdagrad [2], RMSprop [3], and Adam [4]. They are usually faster than SGD algorithms, such asmomentum [5] and Nesterov momentum [6]. On the other hand, networks trained with SGD arebetter at generalization than with adaptive learning [7]. In order to benefit from both the trainingspeed and generalization capability, several algorithms have been designed to transit from adaptivelearning to SGD during training [8, 9]. Regardless of which categories these algorithms belong, theyrequire combining past updates with the current weight update, which make them complicated toimplement biologically.Inspired by synaptic competition in biology, we have come up with a simple and local gradientdescent optimization algorithm that encourages the potentiation of strong synapses and suppressesthe growth of weak synapses. This algorithm, named dynamic learning rate (DLR), works similarlyto the traditional gradient descent used in back-propagation, except that instead of having a uniformlearning rate across all synapses, the learning rate depends on the current connection weights ofindividual synapses and the L norm of the weights of each neuron. It is found to speed up learning,particularly for small networks, with no demand on past information, hence making it biologicallyplausible. 1 a r X i v : . [ c s . N E ] S e p esults The design of DLR is based on the ideas that synaptic transmissions are metabolically expensive,thus pushing neurons to lower the number of strong synapses to save energy [10, 11, 12, 13]. Unlikethe traditional SGD that uses the same learning rate η for all synapses, to quicken the rise ofstrong synapses and speed up the diversification of the connection strength between neurons, DLRencourages neurons to form strong connections to a handful of neurons of their neighbouring layersby assigning higher learning rate ( η ij ) to synapses with bigger weights ( w ij ): η ij = − η | w ij | + α || w j || + α (1) ∆ w ij = − η ij ∂C∂w ij , (2)where i represents the indices of the post-synaptic neurons and j represents the indices of thepre-synaptic neurons. The parameter α is set at the range of values such that at the beginning oftraining α > || w j || (cid:29) w ij so that all synapses have similar learning rate. As learning progresses, thelearning rate of all synapses decreases, but large synapses would retain a relatively large learningrate while the learning rate of small synapses would become small. Here, || w j || is summing overall the post-synaptic weights of a pre-synaptic neuron, leading to each pre-synaptic neuron havingstrong connections to a limited amount of post-synaptic neurons only. However, DLR also worksby replacing this term with || w i || , hence η ij = − η | w ij | + α || w i || + α , (3)which promotes every post-synaptic neuron to form strong connections to a subset of pre-synapticneurons instead. Whether Eq. 1 or Eq. 3 would perform better depends on the network architecture.Since most networks tend to have decreasing numbers of neurons for deeper layers, Eq. 1 is moreapplicable in general. We note that the proposed modulation of learning can easily be imagined tooccur in biology, as it only requires each neuron to know the connection strength with its own pre-or post-synaptic neurons. Training speed compared to standard methods
To test the performance of DLR, we implement a multi-layer network trained with back-propagationwith one hidden layer to classify hand-written digits from the MNIST dataset to a benchmark accu-racy of . We compare the performance of DLR with the traditional SGD in back-propagation,Nesterov momentum [6], and the commonly used adaptive learning algorithm Adam [4]. All thealgorithms involve one or two parameters except Adam, which involves four parameters: α , β , β and (cid:15) . From a brief check, training becomes faster after changing α and (cid:15) from the default valuessuggested by Adam’s authors Kingma and Ba [4]. Therefore, Adam is optimized with respect to α and (cid:15) . The other three algorithms have all of their parameters optimized. Networks with fewerthan hidden units train the fastest with DLR though Adam is not significantly slower thanDLR, Figure 1. As the network size increases, Adam performs the best. Robust for small networks
Next, we are interested in knowing if DLR is more robust for small networks compared to otheralgorithms, i.e. whether DRL allows small networks to reach the designated accuracy that wouldotherwise not be able to reach if they are trained with other learning algorithms. To test it,we reduce the number of neurons at the hidden layer until the networks are no longer able toconverge. Figure 2 shows the minimal network size that the network can still satisfy the accuracyrequirement. Since the change in the network architecture may require different parameter values forthe algorithms, we scan the performance of each algorithm across a parameter space. Compared totraditional SGD, Adam, and Nesterov momentum, which on average demands . ± . , . ± . Comparisons of training speed between traditional SGD, Nesterov momen-tum, Adam and DLR . The networks with back-propagation are trained with one of the fouralgorithms. The network size is varied by adjusting the number of neurons in the hidden layer. Theleft figure shows the training time of the four algorithms. To illustrate the change in training timeby switching from traditional SGD to the other three algorithms, the ratio of their training time tothat of traditional SGD is displayed on the right. Networks equipped with DLR reach the desig-nated accuracy the fastest when the networks have fewer than hidden units. The uncertaintiesare standard deviations across multiple runs.and . ± . hidden neurons respectively, networks trained with DLR only require . ± . neurons to reach accuracy. Our speculated explanation is because DRL allows large weightsto have high learning rates, in the case of networks being stuck at a local minimum, undesirablebig weights can depress more quickly hence releasing the networks from that local minimum. Trad. SGD Nesterov Adam DLR N u m b e r o f h i dd e n un i t s Figure 2:
Mimimal requirement on network size . By gradually reducing the number of neuronsin the hidden layer, network sizes are decresaed until the networks fail to reach an accuracy level of . Networks trained with DLR needs on average fewer neurons than the other three algorithms.The uncertainties are standard deviations across multiple runs.3 ynapse-specific learning rate benefits learning
During training, when implemented with DLR, the learning rate of most synapses would drop. Tocheck that the improvement of training speed is not primarily due to a gradual global decrease inlearning rate but instead due to the learning rate being synapse-specific, we have compared thelearning time of networks trained with DLR with networks trained with the average learning rateof DLR. This is achieved by measuring the average learning rate between the input and hiddenlayers, and between the hidden and output layers of the networks trained with DLR, then fittingthe average learning rate with a exp ( bt / + ct ) + d , where t represents the training time. The fitsare implemented into new networks, which have to learn to classify the MNIST dataset with thatpredetermined learning rate. Networks learnt with DLR reaches accuracy with (0 . ± . epoch in contrast to (1 . ± . epochs when networks learnt with the average learning rate,showing the criticality of the learning rate being synapse-specific. Figure 3 shows the median oftest accuracy over multiple runs, demonstrating how networks with DLR progress faster duringtraining. Training time (epochs) M e d i a n o f t e s t a cc u r a c y DLRavg. η
Figure 3:
Median of test accuracy over training time . The median of test accuracy overmultiple runs is shown until it reaches the designated accuracy. Networks equipped with DLRfinish training within . epoch in majority of the runs. Networks trained with the average learningrate of DLR take at least . epochs in most of the runs. Discussion
DLR has shown the possibility of a local and biological gradient descent optimization algorithm thatcan speed up neural network training with back-propagation. It only requires online information,which may have the benefits of lower memory usage at large networks compared to algorithms thatrequire storing information from past updates.It is shown to have comparable training speed with the popular adaptive learning algorithmAdam for networks with small and medium sizes, i.e. when the parameters in the networks are notredundant. In addition, DLR is found to be more robust than traditional SGD, Nesterov momentumand Adam as it allows small networks to acquire an accuracy level that otherwise would not be ableto achieve. It shows that DLR can find the solution more efficiently even when the number of weightparameters is restrained. Here, we have conducted the tests on MNIST dataset with networks thatare small compared to deep networks that are used to categorize much more complicated images.Therefore, it is uncertain how DLR will perform in deep networks. We wonder if it is also applicablein those networks, which even though have significantly larger network sizes, may still suffer the4ssue of insufficient weight parameters. On the other hand, DLR may allow the use of smaller deepnetworks by guaranteeing similar performance as the larger ones, and provide the benefits of lesscomputation time and memory.In recent years, many algorithms that can effectively speed up learning are adaptive learningalgorithms, which are found to have not as good generalization capability as SGD [7]. It wouldbe interesting to test if DLR, as a SGD, performs well in generalization while still has comparabletraining speed as adaptive learning algorithms.
Methods
We use networks with one hidden layer, logistic units without bias, and one-hot encoding at theoutput. Weights are updated according to the mean squared error back-propagation rule withoutregularization.
Acknowledgments
This project is supported by the Leverhulme Trust with grant number RPG-2017-404. We wouldalso like to thank University of Nottingham High Performance Computing for providing computingpowers for this research.
References [1] Hardt M, Recht B, Singer Y (2015) Train faster, generalize better: stability of stochastic gradientdescent. arXiv : 1509.01240.[2] Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning andstochastic optimization.
Journal of Machine Learning Research
Proceedings of the 3rdInternational Conference on Learning Representations (ICLR 2015) .[5] Plaut DC, Nowlan SJ, Hinton GE (1986) Experiments on learning by back propagation,(Carnegie Mellon University), Technical Report CMU-CS-86-126.[6] Nesterov YE (1983) A method for solving the convex programming problem with convergencerate O(1/k2).
Doklady akademii nauk SSSR (Russian) [Proceedings of the USSR academy ofsciences] arXiv : 1705.08292.[8] Keskar NS, Socher R (2017) Improving generalization performance by switching from adam tosgd. arXiv : 1712.07628.[9] Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound oflearning rate in
International Conference on Learning Representations .[10] Harris JJ, Jolivet R, Attwell D (2012) Synaptic energy use and supply.
Neuron
Journal of Cerebral Blood Flow & Metabolism
Neural Computation