aa r X i v : . [ s t a t . M L ] J a n Piecewise Linear Multilayer Perceptrons and Dropout
Ian J. Goodfellow
Universit´e de Montr´eal
Abstract
We propose a new type of hidden layer for amultilayer perceptron, and demonstrate thatit obtains the best reported performance foran MLP on the MNIST dataset.
We propose to use a specific kind of piecewise linearfunction as the activation function for a multilayer per-ceptron.Specifically, suppose that the layer receives as input avector x ∈ R D . The layer then computes presynapticoutput z = x T W + b where W ∈ R D × N and b ∈ R N are learnable parameters of the layer.We propose to have each layer produce output via theactivation function h ( z ) i = max j ∈ S i z j where S i is adifferent non-empty set of indices into z for each i .This function provides several benefits: • It is similar to the rectified linear units(Glorot et al. , 2011) which have already provenuseful for many classification tasks. • Unlike rectifier units, every unit is guaranteed tohave some of its parameters receive some trainingsignal at each update step. This is because the in-puts z j are only compared to each other, and notto 0., so one is always guaranteed to be the maxi-mal element through which the gradient flows. Inthe case of rectified linear units, there is only asingle element z j and it is compared against 0.In the case when 0 > z j , z j receives no updatesignal. Preliminary work. • Max pooling over groups of units allows the fea-tures of the network to easily become invariant tosome aspects of their input. For example, if a unit h i pools (takes the max) over z , z , and z , and z , z and z respond to the same object in threedifferent positions, then h i is invariant to thesechanges in the objects position. A layer consist-ing only of rectifier units can’t take the max overfeatures like this; it can only take their average. • Max pooling can reduce the total number of pa-rameters in the network. If we pool with non-overlapping receptive fields of size k , then h hassize N/k , and the next layer has its number ofweight parameters reduced by a factor of k rela-tive to if we did not use max pooling. This makesthe network cheaper to train and evaluate but alsomore statistically efficient. • This kind of piecewise linear function can be seenas letting each unit h i learn its own activationfunction. Given large enough sets S i , h i can im-plement increasing complex convex functions ofits input. This includes functions that are alreadyused in other MLPS, such as the rectified linearfunction and absolute value rectification. We used S i = { i, i + 1 , ... i + 4 } in our experiments.In other words, the activation function consists of maxpooling over non-overlapping groups of five consecu-tive pre-synaptic inputs.We apply this activation function to the multilayerperceptron trained on MNIST by Hinton et al. (2012).This MLP uses two hidden layers of 1200 units each. Inour setup, the presynaptic activation z has size 1200 sothe pooled output of each layer has size 240. The restof our training setup remains unchanged apart fromadjustment to hyperparameters.Hinton et al. (2012) report 110 errors on the test set.To our knowledge, this is the best published result onthe MNIST dataset for a method that uses neitherpretraining nor knowledge of the input geometry. anuscript under review by AISTATS 2013 It is not clear how Hinton et al. (2012) obtained a sin-gle test set number. We train on the first 50,000 train-ing examples, using the last 10,000 as a validation set.We use the misclassification rate on the validation setto determine at what point to stop training. We thenrecord the log likelihood on the first 50,000 examples,and continue training but using the full 60,000 exampletraining set. When the log likelihood of the validationset first exceeds the recorded value of the training setlog likelihood, we stop training the model, and evalu-ate its test set error. Using this approach, our trainedmodel made 94 mistakes on the test set. We believethis is the best-ever result that does not use pretrain-ing or knowledge of the input geometry.
References
Glorot, X., Bordes, A., and Bengio, Y. (2011). Deepsparse rectifier neural networks. In