Adapting Resilient Propagation for Deep Learning
aa r X i v : . [ c s . N E ] S e p Adapting Resilient Propagation for Deep Learning
Alan Mosca
Department of Computer Science andInformation SystemsBirkbeck, University of LondonMalet Street, London WC1E 7HX - United KingdomEmail: [email protected]
George D. Magoulas
Department of Computer Science andInformation SystemsBirkbeck, University of LondonMalet Street, London WC1E 7HX - United KingdomEmail: [email protected]
Abstract —The Resilient Propagation (Rprop) algorithm hasbeen very popular for backpropagation training of multilayerfeed-forward neural networks in various applications. The stan-dard Rprop however encounters difficulties in the context ofdeep neural networks as typically happens with gradient-basedlearning algorithms. In this paper, we propose a modification ofthe Rprop that combines standard Rprop steps with a specialdrop out technique. We apply the method for training DeepNeural Networks as standalone components and in ensemble for-mulations. Results on the MNIST dataset show that the proposedmodification alleviates standard Rprop’s problems demonstratingimproved learning speed and accuracy.
I. I
NTRODUCTION
Deep Learning techniques have generated many of the state-of-the-art models [1], [2], [3] that reached impressive resultson benchmark datasets like MNIST [4]. Such models are usu-ally trained with variations of the standard Backpropagationmethod, with stochastic gradient descent (SGD). In the field ofshallow neural networks, there have been several developmentsto training algorithms that have sped up convergence [5],[6]. This paper aims to bridge the gap between the fieldof Deep Learning and these advanced training methods, bycombining Resilient Propagation (Rprop) [5], Dropout [7] andDeep Neural Networks Ensembles.
A. Rprop
The Resilient Propagation [5] weight update rule wasinitially introduced as a possible solution to the “vanishinggradients” problem: as the depth and complexity of an artificialneural network increase, the gradient propagated backwardsby the standard SGD backpropagation becomes increasinglysmaller, leading to negligible weight updates, which slowdown training considerably. Rprop solves this problem byusing a fixed update value δ ij , which is increased or decreasedmultiplicatively at each iteration by an asymmetric factor η + and η − respectively, depending on whether the gradient withrespect to w ij has changed sign between two iterations or not.This “backtracking” allows Rprop to still converge to a localminima, but the acceleration provided by the multiplicativefactor η + helps it skip over flat regions much more quickly.To avoid double punishment when in the backtracking phase,Rprop artificially forces the gradient product to be , so thatthe following iteration is skipped. An illustration of Rprop canbe found in Algorithm 1. Algorithm 1
Rprop η + = 1 . , η − = 0 . , ∆ max = 50 , ∆ min = 10 − pick ∆ ij (0) ∆ w ij (0) = − sgn ∂E (0) ∂w ij · ∆ ij (0) for all t ∈ [1 ..T ] do if ∂E ( t ) ∂w ij · ∂E ( t − ∂w ij > then ∆ ij ( t ) = min { ∆ ij ( t − · η + , ∆ max } ∆ w ij ( t ) = − sgn ∂E ( t ) ∂w ij · ∆ ij ( t ) w ij ( t + 1) = w ij ( t ) + ∆ w ij ( t ) ∂E ( t − ∂w ij = ∂E ( t ) ∂w ij else if ∂E ( t ) ∂w ij · ∂E ( t − ∂w ij < then ∆ ij ( t ) = max { ∆ ij ( t − · η − , ∆ min } ∂E ( t − ∂w ij = 0 else ∆ w ij ( t ) = − sgn ∂E ( t ) ∂w ij · ∆ ij ( t ) w ij ( t + 1) = w ij ( t ) + ∆ w ij ( t ) ∂E ( t − ∂w ij = ∂E ( t ) ∂w ij end if end for B. Dropout
Dropout [7] is a regularisation method by which only arandom selection of nodes in the network is updated du-ring each training iteration, but at the final evaluation stagethe whole network is used. The selection is performed bysampling a dropout mask Dm from a Bernoulli distributionwith P ( muted i ) = D r , where P ( muted i ) is the probabilityof node i being muted during the weight update step ofbackpropagation, and D r is the dropout rate , which is usually . for the middle layers, . or for the input layers, and for the output layer. For convenience this dropout maskis represented as a weight binary matrix D ∈ { , } M × N ,covering all the weights in the network that can be used tomultiply the weight-space of the network to obtain what iscalled a thinned network, for the current training iteration,where each weight w ij is zeroed out based on the probabilityof its parent node i being muted.The remainder of this paper is structured as follows: • In section II we explain why using Dropout causes anincompatibility with Rprop, and propose a modificationo solve the issue. • In section III we show experimental results using theMNIST dataset, first to highlight how Rprop is able toconverge much more quickly during the initial epochs,and then use this to speed up the training of a StackedEnsemble. • Finally in section IV, we look at how this work can beextended with further evaluation and development.II. R
PROP AND D ROPOUT
In this section we explain the zero gradient problem , andpropose a solution by adapting the Rprop algorithm to beaware of Dropout.
A. The zero-gradient problem
In order to avoid double punishment when there is a changeof sign in the gradient, Rprop artificially sets the gradientproduct associated with weight ij for the next iteration to ∂E t ∂w ij · ∂E t +1 ∂w ij = 0 . This condition is checked during thefollowing iteration, and if true no updates to the weights w ij or the learning rate ∆ ij are performed.Using the zero-valued gradient product as an indicationto skip an iteration is acceptable in normal gradient descentbecause the only other occurrence of this would be whenlearning has terminated. When Dropout is introduced, anadditional number of events can produce these zero values: • When neuron i is skipped, the dropout mask for allweights w ij going to the layer above has a value of • When neuron j in the layer above is skipped, the gradientpropagated back to all the weights w ij is also These additional zero-gradient events force additional skippedtraining iterations and missed learning rate adaptations thatslow down the training unnecessarily.
B. Adaptations to Rprop
By making Rprop aware of the dropout mask Dm , we areable to distinguish whether a zero-gradient event occurs asa signal to skip the next weight update or whether it occursfor a different reason, and therefore w and ∆ updates shouldbe allowed. The new version of the Rprop update rule foreach weight ij is shown in Algorithm 2. We use t to indicatethe current training example, t − for the previous trainingexample, t + 1 for the next training example, and where avalue with (0) appears, it is intended to be the initial value.All other notation is the same as used in the original Rprop: • E ( t ) is the error function (in this case negative loglikelihood) • ∆ ij ( t ) is the current update value for weight at index ij • ∆ w ij ( t ) is the current weight update value for index ij In particular, the conditions at line 5 and line 18 areproviding the necessary protection from the additional zero-gradients, and implementing correctly the recipe prescribedby Dropout, by completely skipping every weight for which Dm ij = 0 (which means that neuron j was dropped out andtherefore the gradient will necessarily be . We expect thatthis methodolgy can be extended to other variants of Rprop,such as, but not limited to, iRprop+ [8] and JRprop [6]. Algorithm 2
Rprop adapted for Dropout η + = 1 . , η − = 0 . , ∆ max = 50 , ∆ min = 10 − pick ∆ ij (0) ∆ w ij (0) = − sgn ∂E (0) ∂w ij · ∆ ij (0) for all t ∈ [1 ..T ] do if Dm ij = 0 then ∆ ij ( t ) = ∆ ij ( t − ∆ w ij ( t ) = 0 else if ∂E ( t ) ∂w ij · ∂E ( t − ∂w ij > then ∆ ij ( t ) = min { ∆ ij ( t − · η + , ∆ max } ∆ w ij ( t ) = − sgn ∂E ( t ) ∂w ij · ∆ ij ( t ) w ij ( t + 1) = w ij ( t ) + ∆ w ij ( t ) ∂E ( t − ∂w ij = ∂E ( t ) ∂w ij else if ∂E ( t ) ∂w ij · ∂E ( t − ∂w ij < then ∆ ij ( t ) = max { ∆ ij ( t − · η − , ∆ min } ∂E ( t − ∂w ij = 0 else if ∂E ( t − ∂w ij = 0 then ∆ w ij ( t ) = − sgn ∂E ( t ) ∂w ij · ∆ ij ( t ) w ij ( t + 1) = w ij ( t ) + ∆ w ij ( t ) else ∆ ij ( t ) = ∆ ij ( t − ∆ w ij ( t ) = 0 end if end if end if end for III. E
VALUATING ON
MNISTIn this section we describe an initial evaluation of per-formance on the MNIST dataset. For all experiments weuse a Deep Neural Network (DNN) with five middle layers,of , , , , neurons respectively, and adropout rate Dr mid = 0 . for the middle layers and noDropout on the inputs. The dropout rate has been shownto be an optimal choice for the MNIST dataset in [9]. Asimilar architecture has been used to produce state-of-the-artresults [3], however the authors used the entire training setfor validation, and graphical transformations of said set fortraining. These added transformations have led to a “virtuallyinfinite” training set size, whereby at every epoch, a newtraining set is generated, and a much larger validation set ofthe original images. The test set remains the original image test set. An explanation of these transformationsis provided in [10], which also confirms that:“The most important practice is getting a trainingset as large as possible: we expand the training setby adding a new form of distorted data”We therefore attribute these big improvements to the transfor-mations applied, and have not found it a primary goal to repli-cate these additional transformations to obtain the state-of-the-art results and instead focused on utilising the untransformedataset, using images for training, for validationand for testing. Subsequently, we performed a searchusing the validation set as an indicator to find the optimalhyperparameters of the modified version of Rprop. We foundthat the best results were reached with η + = 0 . , η − = 0 . , ∆ max = 5 and ∆ min = 10 − . We trained all models to themaximum of allowed epochs, and measured the error onthe validation set at every epoch, so that it could be used toselect the model to be applied to the test set. We also measuredthe time it took to reach the best validation error, and report itsapproximate magnitude, to use as a comparison of orders ofmagnitude. The results presented are an average of repeatedruns, limited to a maximum of training epochs. A. Compared to SGD
From the results in Table I we see that the modified versionof Rprop is able to start-up much quicker and reaches anerror value that is close to the minimum much more quickly.SGD reaches a higher error value, and after a much longertime. Although the overall error improvement is significant,the speed gain from using Rprop is more appealing because itallows to save a large number of iterations that could be usedfor improving the model in different ways. Rprop obtains itsbest validation error after only epochs, whilst SGD reachedthe minimum after . An illustration of the first epochscan be seen in Figure 1. V a li da t i on E rr o r ( % ) Training EpochSGD vs Modified Rprop Modified RpropSGD
Fig. 1: Validation Error - SGD vs Mod. Rprop
Method Min Val Err Epochs Time Test Err st EpochSGD 2.85% 1763 320 min 3.50% 88.65%Rprop 3.03% 105 25 min 3.53% 12.81%Mod Rprop 2.57% 35 10 min 3.49% 13.54%
TABLE I: Simulation results
B. Compared to unmodified Rprop
We can see from Figure 2 that the modified version ofRprop has a faster start-up than the unmodified version, and stays below it consistently until it reaches its minimum. Also,the unmodified version does not reach the same final error asthe modified version, and starts overtraining much sooner, anddoes not reach a better error than SGD. Table I shows withmore detail how the performance of the two methods comparesover the first epochs. V a li da t i on E rr o r ( % ) Training EpochUnmodified Rprop vs Modified Rprop ModifiedUnmodified
Fig. 2: Validation Error - Unmod. vs Mod. Rprop
C. Using Modified Rprop to speed up training of Deep Lear-ning Ensembles
The increase in speed of convergence can make it practicalto produce Ensembles of Deep Neural Networks, as thetime to train each member DNN is considerably reducedwithout undertraining the network. We have been able to trainthese Ensembles in less than 12 hours in total on a single-GPU, single-CPU desktop system . We have trained differentEnsemble types, and we report the final results in Table II. Themethods used are Bagging [11] and Stacking [12], with and member DNNs. Each member was trained for a maximumof epochs. • Bagging is an ensemble method by which several dif-ferent training sets are created by random resampling ofthe original training set, and each of these are used totrain a new classifier. The entire set of trained classifiersis usually then aggregated by taking an average or amajority vote to reach a single classification decision. • Stacking is an ensemble method by which the differentclassifiers are aggregated using an additional learningalgorithm that uses the inputs of these first-space clas-sifiers to learn information about how to reach a betterclassification result. This additional learning algorithm iscalled a second-space classifier.In the case of Stacking the final second-space classifierwas another DNN with two middle layers, respectively ofsize (200 N, N ) , where N is the number of DNNs in theEnsemble, trained for a maximum of epochs with the We used a Nvidia GTX-770 graphics card on a core i5 processor,programmed with Theano in python odified Rprop. We used the same original train, validationand test sets for this, and collected the average over repeatedruns. The results are still not comparable to what is presentedin [3], which is consistent with the observations about theimportance of the dataset transformations, however we notethat we are able to improve the error in less time it took totrain a single network with SGD. A Wilcoxon signed ranks testshows that the increase in performance obtained from usingthe ensembles of size compared to the ensemble of size is significant, at the confidence level. Method Size Test Err TimeBagging .
35 minBagging
10 2 .
128 minStacking .
39 minStacking
10 2 .
145 min
TABLE II: Ensemble performanceIV. C
ONCLUSIONS AND F UTURE W ORK
We have highlighted that many training methods that havebeen used in shallow learning may be adapted for use in DeepLearning. We have looked at Rprop and how the appearance of zero-gradients during the training as a side effect of Dropoutposes a challenge to learning, and proposed a solution whichallows Rprop to train DNNs to a better error, and still be muchfaster than standard SGD backpropagation.We then showed that this increase in training speed canbe used to train effectively an Ensemble of DNNs on acommodity desktop system, and reap the added benefits ofEnsemble methods in less time than it would take to train aDeep Neural Network with SGD.It remains to be assessed in further work whether this im-proved methodology would lead to a new state-of-the-art errorwhen applying the pre-training and dataset enhancements thathave been used in other methods, and how the improvementsto Rprop can be ported to its numerous variants.A
CKNOWLEDGEMENT
The authors would like to thank the School of Business,Economics and Informatics, Birkbeck College, University ofLondon, for the grant received to support this research.R
EFERENCES[1] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularizationof neural networks using dropconnect,” in
Proceedings of the 30thInternational Conference on Machine Learning (ICML-13) , 2013, pp.1058–1066.[2] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neuralnetworks for image classification,” in
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR) . IEEEPress, 2012, pp. 3642–3649.[3] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber,“Deep, big, simple neural nets for handwritten digit recognition,”
Neuralcomputation , vol. 22, no. 12, pp. 3207–3220, 2010.[4] Y. Lecun and C. Cortes, “The MNIST database of handwritten digits.”[Online]. Available: http://yann.lecun.com/exdb/mnist/ [5] M. Riedmiller and H. Braun, “A direct adaptive method for fasterbackpropagation learning: The rprop algorithm,” in proceeding of theIEEE International Conference on Neural Networks . IEEE, 1993, pp.586–591.[6] A. D. Anastasiadis, G. D. Magoulas, and M. N. Vrahatis, “Newglobally convergent training scheme based on the resilient propagationalgorithm,”
Neurocomputing , vol. 64, pp. 253–270, 2005.[7] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,”
CoRR , vol. abs/1207.0580, 2012.[8] C. Igel and M. H¨usken, “Improving the Rprop learning algorithm,” in
Proceedings of the second international ICSC symposium on neuralcomputation (NC 2000) , vol. 2000. Citeseer, 2000, pp. 115–121.[9] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from over-fitting,”
The Journal of Machine Learning Research , vol. 15, no. 1, pp.1929–1958, 2014.[10] P. Y. Simard, D. Steinkraus, and J. C. Platt,“Best practices for convolutional neural networks appliedto visual document analysis,” 2003. [Online]. Available:http://research.microsoft.com/apps/pubs/default.aspx?id=68920[11] L. Breiman, “Bagging predictors,”
Machine Learning , vol. 24, no. 2, pp.123–140, 1996.[12] W. D, “Stacked generalization,”