Forward Thinking: Building and Training Neural Networks One Layer at a Time
Chris Hettinger, Tanner Christensen, Ben Ehlert, Jeffrey Humpherys, Tyler Jarvis, Sean Wade
FForward Thinking: Building and Training NeuralNetworks One Layer at a Time
Chris Hettinger, Tanner Christensen, Ben Ehlert,Jeffrey Humpherys, Tyler Jarvis, and Sean Wade
Department of MathematicsBrigham Young UniversityProvo, Utah 84602 [email protected], [email protected],[email protected], [email protected],[email protected],[email protected]
Abstract
We present a general framework for training deep neural networks without back-propagation. This substantially decreases training time and also allows for construc-tion of deep networks with many sorts of learners, including networks whose layersare defined by functions that are not easily differentiated, like decision trees. Themain idea is that layers can be trained one at a time, and once they are trained, theinput data are mapped forward through the layer to create a new learning problem.The process is repeated, transforming the data through multiple layers, one at atime, rendering a new data set, which is expected to be better behaved, and onwhich a final output layer can achieve good performance. We call this forwardthinking and demonstrate a proof of concept by achieving state-of-the-art accuracyon the MNIST dataset for convolutional neural networks. We also provide a generalmathematical formulation of forward thinking that allows for other types of deeplearning problems to be considered.
In recent years, deep neural networks (DNNs) have become a dominant force in many supervisedlearning problems. In several side-by-side comparisons with standardized data sets and well-definedbenchmarks, neural networks have bested most and in some cases all other leading machine learningtechniques [7, 14]. This is particularly pronounced in image, speech, and natural language recognitionproblems, where deep learning methods are also consistently beating humans [6].The main downside of deep learning is the computational complexity of the training algorithms. Inparticular, it is extremely expensive computationally to use backpropagation to train multiple layersof nonlinear activation functions [15]. Indeed, the computational resources required to fully traina DNN are in many cases orders of magnitude greater than other machine learning methods thatperform almost as well on many tasks [9]. In other words, in many cases a great deal of extra work isrequired to get only slightly better performance.We present a general framework, which we call forward thinking , for training DNNs without doingbackpropagation. This allows a network to be built from scratch as deep as needed in real time. It alsoallows the use of many different sorts of learners in the layers of the network, including learners thatare not easily differentiable, like random forests. The main idea is that layers of learning functionscan be trained one at a time, and once trained, the input data can be mapped forward through the layerto create a new learning problem. This process is repeated, transforming the data through multiplelayers, and rendering a new data set, which is expected to be better behaved, and on which a final a r X i v : . [ s t a t . M L ] J un utput layer can achieve good performance. This is much faster than traditional backpropagation,and the number of layers can be determined at training time by simply continuing to add and trainlayers consecutively until performance plateaus.This greedy approach to deep learning stems from a confluence of ideas that generalize nicely intoa single framework. In particular several recent papers have elements that can be nicely describedas special cases of forward thinking; see for example, net2net [3], cascade correlation [4], networkmorphism [17, 16], and convolutional representation transfer [11]. Many variants of the ideas behindforward thinking have been proposed in various settings [13, 12, 5, 2, 8, 18]. But the full potentialof this idea does not seem to have been completely realized. Specifically, the papers [2] and [8] didsome experiments training networks in a greedy fashion and saw poor performance. Others who haveused greedy training methods have used them only for pretraining, that is, as a method for initializingnetworks that are subsequently trained using backpropagation.However, we show here that forward thinking can be effective as a stand-alone training method.It is, of course, much faster than backpropagation, and yet it can give results that are as accurateas backpropagation. As a proof of concept, we use forward thinking to design and train both afully-connected deep neural network (DNN) and a convolutional neural network (CNN) and comparetheir performance on the MNIST dataset against their traditionally trained counterparts. In an “apples-to-apples” comparison against traditionally trained networks, we find roughly equivalent performancein terms of accuracy and significantly reduced training time. In particular, we were able to get anaccuracy of 98.89% with a forward-thinking DNN and a state-of-the-art accuracy of 99.72% with aforward-thinking CNN.The rest of this paper describes how forward thinking can be used to build (both fully connected andconvolutional) feedforward neural networks one layer at a time. In a companion paper, we considerdeep random forests and show that many of the ideas presented here carry over to other machinelearning techniques [10]. Specifically, in that paper we replace neurons with decision trees anddescribe a specific implementation, which, as a proof of concept, also achieves very good results onthe MNIST dataset. Together these two papers illustrate how the forward thinking framework can beapplied to many machine learning methods. In this section, we describe the general mathematical structure of a forward thinking deep network.The main idea is that neurons can be generalized to any type of learner and then, once trained, theinput data are mapped forward through the layer to create a new learning problem. This process isrepeated, transforming the data through multiple layers, one at a time, and rendering a new, which isexpected to be better behaved, and on which a final output layer can achieve good performance.
The input layer
The data D (0) = { ( x (0) i , y i ) } Ni =1 ⊂ X (0) × Y are given as the set of input values x (0) i from a set X (0) and their corresponding outputs y i in a set Y .In many learning problems, X (0) ⊂ R d , which means that there are d real-valued features. If theinputs are images, we can stack them as large vectors where each pixel is a component. In some deeplearning problems, each input is a stack of images, for example color images can be represented asthree separate monochromatic images, or three separate channels of the image.For binary classification problems, the output space can be taken to be Y = {− , } . For multi-classproblems we often set Y = { , , . . . , n } . The first hidden layer
Let C (1) = { C (1)1 , C (1)2 , . . . , C (1) m } be a layer of m learning functions C (1) i : X (0) → Z (1) i , forsome codomain Z (1) i , with parameters θ (1) i . These learning functions (or learners) can be regression,classification, or kernel functions, and can be thought of as defining new features, as follows. Let X (1) = Z (1)1 × Z (1)2 × · · · × Z (1) m and transfer the inputs { x (0) i } Ni =1 ⊂ X (0) to X (1) according to the2ap x (1) i = ( C (1)1 ( x (0) i ) , C (1)2 ( x (0) i ) , . . . , C (1) m ( x (0) i )) ⊂ X (1) , i = 1 , . . . , N. This gives a new data set D (1) = { ( x (1) i , y i ) } Ni =1 ⊂ X (1) × Y .In many learning problems Z (1) = [ − , , in which case the new domain X (1) is a hypercube [ − , m . It is also common to have Z (1) = [0 , ∞ ) , in which case X (1) is the m -dimensionalorthant [0 , ∞ ) m .The goal is to choose C (1) to make the new dataset “more separable,” or better behaved, in some sense,than the previous dataset. As we repeat this process iteratively, the data should become increasinglywell behaved, so that in the final layer, a single learner can finish the job.The functions C (1) are trained on the data set D (0) = { ( x (0) i , y i ) } Ni =1 in some suitable way. In manysettings, this could be done by minimizing a loss function for a neural network with a single hiddenlayer C (1) and a final output layer consisting of just one learner C (cid:48) : X (1) → R with parameters θ (cid:48) .The loss function could then be of the form L (1) (Θ (1) , θ (cid:48) ) = N (cid:88) i =1 (cid:96) ( C (cid:48) ◦ C (1) ( x i ) , y i ) + r (Θ (1) , θ (cid:48) ) , where (cid:96) : R × Y → [0 , ∞ ) is a measure of how close C (cid:48) ◦ C (1) ( x i ) is to y i and r (Θ (1) , θ (cid:48) ) is aregularization term. Of course, the loss function for training this layer need not be of this form, butthis would be an obvious choice. If the learners C (0) are regression trees or random forests, thesecould be trained in the standard way, without the extra learner C (cid:48) .The key point is that once they are learned, the parameters θ (1) i are frozen, but the parameters θ (cid:48) arediscarded (and the learner C (cid:48) may either be discarded or retrained at the next iteration). The old data x (0) i are mapped through the learned layer C (1) to give new data x (1) i , which is then passed to a newhidden layer. Additional hidden layers
Let C ( k ) = { C ( k )1 , C ( k )2 , . . . , C ( k ) m k } be a set (layer) of m k learning functions C ( k ) i : X ( k − → Z ( k ) .This layer is again trained on the data D ( k − = { ( x ( k − i , y i ) } . This would usually be done in thesame manner as the previous layer, but it need not be the same; for example, if the new layer consistsof a different kind of learners, then the training method for the new layer might also need to differ.As with the first layer, the inputs { x ( k − i } Ni =1 ⊂ X ( k − = Z ( k − × Z ( k − × · · · × Z ( k − m k − aretransferred to a new domain { x ( k ) i } Ni =1 ⊂ X ( k ) = Z ( k )1 × Z ( k )2 × · · · × Z ( k ) m k according to the map x ( k ) i = ( C ( k )1 ( x ( k − i ) , C ( k )2 ( x ( k − i ) , . . . , C ( k ) m k ( x ( k − i )) , i = 1 , . . . , N. This gives a new data set D ( k ) = { ( x ( k ) i , y i ) } Ni =1 ⊂ X ( k ) × Y , and the process is repeated. Final layer
After passing the data through the last hidden layer D ( n ) = { ( x ( n ) i , y i ) } Ni =1 ⊂ X ( n ) × Y we train afinal layer, which consists of a single learning function C F : X ( n ) → Y , to determine the outputs,where the C F ( x ( n ) i ) is expected to be close to y i for each i . We now explain how to implement Forward Thinking to build a fully-connected DNN. We call theresult a
Forward Thinking Deep Neural Network or FTDNN . As with any neural network, we begin by selecting an output layer and loss function appropriate tothe problem at hand. For example, a binary classification problem might call for an output layer3 (0) C (1) C F D (0) D (1) C (2) C F D (0) D (1) D (2) C (3) C F Figure 1: The first three iterations of a fully-connected network built with the forwardthinking algorithm. The original data set is represented by an ellipse, fully-connectedlayers with rectangles, and final (output) layers with triangles. Layers with single blueoutlines are trainable, while those with double black outlines have been frozen andthus turned into new data sets.consisting of a single neuron with a sigmoid activation function and a categorical cross-entropy lossfunction.Then we construct a network with a single hidden layer of appropriate width, using conventionalactivation functions, such as ReLU. We also randomly initialize the parameters of this network. Toolslike weight regularization and dropout can be used during training.In our setting, it does not pay to train this first single-hidden-layer network too long. Instead ofmilking this first layer for incremental improvements, one can make bigger improvements by movingon to the next step, adding another layer.
Once the first network is trained, the weights coming into the first layer are frozen (and stored), andthe training inputs { x (0) i } Ni =1 are pushed through the resulting layer to give new “synthetic” data { x (1) i } Ni =1 , which is used to train the next layer. The weights for the output layer are discarded (theywill be retrained at each step).The main advantages of freezing the previously trained weights are ( i ) speed: adding each new layeramounts to training a shallow network with only one hidden layer; and ( ii ) resilience to overfitting. Now insert a new hidden layer between the previously trained layer and the output layer. This layeris trained as a single-hidden-layer network on the new, synthetic data { x (1) i } Ni =1 constructed at theprevious step. Randomly initialize the parameters of this layer and randomly re-initialize those of theoutput layer. This will cause a temporary dip in the performance of the network, but it also createsnew room for improvement. The process of freezing old layers and inserting new ones is repeated until additional layers cease toimprove performance. This indicates that it’s time to stop adding new layers and consider the networkcomplete.Even though each stage involves training only a shallow network, the layers together form a singledeep network. As mentioned before, the main advantage of this method is that we never need to usebackpropagation to reach back into the network and train deep parameters. So we avoid the pitfalls4f backpropagation, including its high computational cost and its struggle to effectively adjust deepparameters.
Remark 1
Throughout this paper we use the term backpropagation informally to mean the tradi-tional process of training a DNN by some variant of stochastic gradient descent (SGD), combinedwith the chain rule and cascading derivatives. In this paper, when we train each of our shallowintermediate networks, we do still use SGD at each stage, but we do not consider this an instance ofbackpropagation, because there are no long chains of cascading derivatives to calculate.
We used both forward thinking and traditional backpropagation to train a fully-connected network withfour hidden layers of 150, 100, 50, and 10 neurons respectively, applied to the MNIST handwrittendigit dataset (we also followed the common practice of augmenting the training set by slightly scaling,rotating, and shifting the images).Test accuracy was comparable between the the networks trained using these two methods, but trainingwith forward thinking was significantly faster, as described below.As explained earlier, training using forward thinking means that we started with a network of onlyone hidden layer of 150 neurons. After that layer was trained, the data was pushed through the trainedlayer to produce a new 150-dimensional synthetic data set. Then we trained a new hidden layer with100 neurons on the new data set on a layer, and then repeated the process for a hidden layer of 50.This network used a 10-neuron softmax output layer and a categorical-cross entropy loss function.To have a benchmark for forward thinking, we trained a DNN of identical architecture in theconventional way, by optimizing all of the parameters at once with backpropagation. We tunedhyperparameters such as learning rate and regularization constants separately for the forward-thinking-trained DNN and the traditionally trained DNN so as to maximize the performance of each andprovide a fair comparison. (a) (b)Figure 2: A comparison of the training and test accuracy per epoch of a convolutionalneural network trained using forward thinking (thicker, green) and traditional back-propagation (thinner, blue). Notice that ( a ) forward thinking fit the network quicklyand precisely to the training data, while training with backpropagation leveled off atlower accuracy. The brief dips in training accuracy for forward thinking occur when anew layer is added. Also, ( b ) the final testing accuracy was similar for both methods,with backprop retaining a slight edge, but the time to train each epoch, and the overalltraining time were both much faster for forward thinking.5he forward thinking network fit itself quickly and precisely to the training data, while training withbackpropagation leveled off, as shown in Figure 2 ( a ) . This suggests that more can be done to preventoverfitting in forward thinking networks. The forward thinking network suffered dips in trainingperformance when adding new layers, but quickly recovered.On the same machine, overall training time for forward thinking was about 30% faster than backprop-agation. This speedup occurred in spite of the fact that both were trained using libraries optimized forbackpropagation. We expect that custom code would increase this speed advantage.Testing accuracy was similar for both methods, with backpropagation retaining a slight edge, asshown in Figure 2 ( b ) . This reinforces the idea that anti-overfitting methods for forward thinking netscould be improved. We can also build convolutional networks with forward thinking. In this case we start with two hiddenlayers—one convolutional and one fully-connected— At each subsequent iteration we add a newconvolutional layer before the fully connected layer at the end. We freeze the previous convolutionallayer at each step but do not freeze the fully connected layer. Convolutional tools such as max poolingcan also be used in this process. D (0) C (1) C (2) C F D (0) D (1) C (2) C (3) C F D (0) D (1) D (2) C (3) C (4) C F Figure 3: The first three iterations of a convolutional network built with the forwardthinking algorithm. The original data set is represented by an ellipse, convolutionallayers with diamonds, fully-connected layers with rectangles, and final (output) layerswith triangles. Layers with single blue outlines are trainable, while those with doubleblack outlines have been frozen and thus turned into new data sets.
We used both forward thinking and backpropagation to train a convolutional neural network on theaugmented MNIST dataset (augmented, as before, with slightly rotated, shifted, and scaled versionsof the original training images).The underlying architecture of the network consists of two identical layers of 256 × convolutions,with maxpooling, followed by a layer of 128 × convolutions, and then a fully connected layer of150 neurons, and then final 10-class softmax layer (Softmax 10). We trained each network (forwardthinking and backpropagation) for 100 epochs (complete passes through the data.)6o train this using forward thinking we first train 256 3x3 convolutions along with a fully-connectedlayer of 150 ReLU neurons (FC 150) and a final 10-class softmax layer (Softmax 10). For the seconditeration, we begin by pushing the data through the 256 convolutions to create a new synthetic dataset.Using this transformed data, we train an identical network: 256 3x3 convolutions followed by FC 150and Softmax 10. As before, we push the data through the newly learned filters. For both of these firstiterations, we train with an aggressive learning rate for only one epoch. With our new dataset (theoriginal data having been passed through both sets of 256 convolutions), we learn a similar networkarchitecture: 128 3x3 Conv, FC 150, Softmax 10. In each of the 3 iterations, we use a 2x2 maxpool and a dropout of . immediately before FC 150. Additionally, we include a dropout of . inbetween FC 150 and Softmax 10.As shown in Figure 4, the forward-thinking-trained network out-performed the identical CNNarchitecture trained using backpropagation. Notice that both the train and test accuracy for theforward thinking net quickly attains a level that the backpropagation net never reaches. In fact,our forward thinking CNN achieves near state-of-the-art performance (single classifier) of 99.72%accuracy. At the time of this writing, this was the 5th ranked result according to [1].(a) (b)Figure 4: A comparison of the training and test accuracy per minute of training timeof a convolutional neural network trained using forward thinking (thicker, green) andtraditional backpropagation (thinner, blue). We have plotted accuracy against timerather than against epoch, so the two curves do not span the same horizontal length.Notice that both ( a ) the train and ( b ) test accuracy for forward thinking quickly attaina level that backpropagation never reaches.Our experiments were run on a single desktop with an Intel i5-7400 processor and an Nvidia GeForceGTX 1060 3GB GPU. We ran both our forward thinking neural network and the backprop neuralnetwork for 100 epochs. Our forward thinking neural network trained with a rate of 24 sec per epoch.Traditional backprop took 53 sec per epoch. However, this doesn’t properly illustrate how much fasterforward thinking should be. Because the computations above were done using libraries optimized forbackpropagation, there was still a lot of unnecessary overhead. Improvements to the implementationshould make forward thinking many times faster than backpropagation, and the advantage shouldgrow with the depth of the network. Goodfellow et al. outline two methods of transferring knowledge from one network to another that isdeeper, wider, or both [3]. Wei et al. transform one network into another with a different architecture7nd presents mathematical formulas for doing so [17][16]. As in the work of Goodfellow et al., thisallows a new network to pick up where a previous leveled off and improved from there. Oquab etal. take a neural network trained to classify one set of images and transfer the mid-level representationslearned by its convolutional layers to a new network to be used on new images [11]. Even when twoimage sets are quite different, starting the second network off with these representations increases itsperformance.Although similar to these knowledge-transfer methods in some ways, forward thinking differs inthat instead of training one network with backpropagation, transferring its knowledge to a deepernetwork, and then retraining the entire new network with backpropagation, forward thinking builds anetwork one layer at a time, from scratch, and does not retrain previously trained layers. At eachstage, we could frame the process of adding a new layer as transferring knowledge from one networkto a deeper one, but freezing old layers gives significant benefits in speed and resistance to overfitting.
Cascade correlation was an early neural network algorithm that effectively let neural networks designthemselves by adding and training a single neuron at a time [4]. New neurons were added alongsidethe features of the data set, much as kernels are added on as new features in other machine leaningmodels such as support vector machines. Cascade correlation can be considered as an early indicationof the potential of forward thinking.Rather than train a single neuron at a time, we train layers, and rather than feeding old data to newlayers, we only train new layers on the new, synthetic data from the previous layer.
Many others have proposed or used various greedy methods for pretraining deep neural networks[13, 12, 5, 2, 8, 18]. We note that [2] and [8] did some experiments training networks in a greedyfashion similar to forward thinking and saw poor performance. Presumably because of those poorpreliminary results, others who have used greedy training methods have used them only for initializingnetworks that are subsequently trained using backpropagation.
Reproducibility
All of the code used to produce our results is available in our github repository athttps://github.com/tkchris93/ForwardThinking.
Acknowledgments
This work was supported in part by the National Science Foundation, Grants number 1323785 and1564502 and by the Defense Threat Reduction Agency, Grant Number HDRTA1-15-0049.
References [1] Rodrigo Benenson. What is the class of this image? http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html . Accessed:2017-05-19.[2] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wisetraining of deep networks. In P. B. Scholkopf, J. C. Platt, and T. Hoffman, editors,
Advances inNeural Information Processing Systems 19 , pages 153–160. MIT Press, 2007.[3] Tianqi Chen, Ian J. Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning viaknowledge transfer.
CoRR , abs/1511.05641, 2015.[4] Scott E. Fahlman and Christian Lebiere. Advances in neural information processing systems 2.chapter The Cascade-correlation Learning Architecture, pages 524–532. Morgan KaufmannPublishers Inc., San Francisco, CA, USA, 1990.[5] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning . MIT Press, 2016. . 86] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:Surpassing human-level performance on imagenet classification.
CoRR , abs/1502.01852, 2015.[7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,editors,
Advances in Neural Information Processing Systems 25 , pages 1097–1105. CurranAssociates, Inc., 2012.[8] Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, and Pascal Lamblin. Exploring strategiesfor training deep neural networks.
J. Mach. Learn. Res. , 10:1–40, June 2009.[9] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The mnist database of handwrittendigits. http://yann.lecun.com/exdb/mnist/ . Accessed: 2017-05-19.[10] Kevin Miller, Chris Hettinger, Jeffrey Humpherys, Tyler Jarvis, and David Kartchner. Forwardthinking: Building deep random forests. 2017.[11] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level imagerepresentations using convolutional neural networks. In , pages 1717–1724, June 2014.[12] Saikat Roy, Nibaran Das, Mahantapas Kundu, and Mita Nasipuri. Handwritten isolated banglacompound character recognition: A new benchmark using a novel deep learning approach.
Pattern Recognition Letters , 90:15–21, 2017.[13] Diego Rueda-Plata, Raúl Ramos-Pollán, and Fabio A. González.
Supervised Greedy Layer-WiseTraining for Deep Convolutional Networks with Small Datasets , pages 275–284. SpringerInternational Publishing, Cham, 2015.[14] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, AndrewNg, and Christopher Potts. Recursive deep models for semantic compositionality over asentiment treebank. In
Proceedings of the 2013 Conference on Empirical Methods in NaturalLanguage Processing , pages 1631–1642, Seattle, Washington, USA, October 2013. Associationfor Computational Linguistics.[15] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance ofinitialization and momentum in deep learning. In Sanjoy Dasgupta and David Mcallester,editors,
Proceedings of the 30th International Conference on Machine Learning (ICML-13) ,volume 28, pages 1139–1147. JMLR Workshop and Conference Proceedings, May 2013.[16] Tao Wei, Changhu Wang, and Chang Wen Chen. Modularized morphing of neural networks.
CoRR , abs/1701.03281, 2017.[17] Tao Wei, Changhu Wang, Yong Rui, and Chang Wen Chen. Network morphism.
CoRR ,abs/1603.01670, 2016.[18] Ke Wu and Malik Magdon-Ismail. Node-by-node greedy deep learning for interpretable features.