[PDF] Fast, simple and accurate handwritten digit classification by training shallow neural network classifiers with the 'extreme learning machine' algorithm

Abstract

Recent advances in training deep (multi-layer) architectures have inspired a renaissance in neural network use. For example, deep convolutional networks are becoming the default option for difficult tasks on large datasets, such as image and speech recognition. However, here we show that error rates below 1% on the MNIST handwritten digit benchmark can be replicated with shallow non-convolutional neural networks. This is achieved by training such networks using the 'Extreme Learning Machine' (ELM) approach, which also enables a very rapid training time (~10 minutes). Adding distortions, as is common practise for MNIST, reduces error rates even further. Our methods are also shown to be capable of achieving less than 5.5% error rates on the NORB image database. To achieve these results, we introduce several enhancements to the standard ELM algorithm, which individually and in combination can significantly improve performance. The main innovation is to ensure each hidden-unit operates only on a randomly sized and positioned patch of each image. This form of random `receptive field' sampling of the input ensures the input weight matrix is sparse, with about 90% of weights equal to zero. Furthermore, combining our methods with a small number of iterations of a single-batch backpropagation method can significantly reduce the number of hidden-units required to achieve a particular performance. Our close to state-of-the-art results for MNIST and NORB suggest that the ease of use and accuracy of the ELM algorithm for designing a single-hidden-layer neural network classifier should cause it to be given greater consideration either as a standalone method for simpler problems, or as the final classification stage in deep neural networks applied to more difficult problems.

Full PDF

FFast, simple and accurate handwritten digit classiﬁcation by training shallow neuralnetwork classiﬁers with the ‘extreme learning machine’ algorithm

Mark D. McDonnell, ∗ Migel D. Tissera, Tony Vladusich, Andr´e van Schaik, and Jonathan Tapson † Computational and Theoretical Neuroscience Laboratory,Institute for Telecommunications Research, University of South Australia, SA 5095, Australia Biomedical Engineering and Neuroscience Group,The MARCS Institute, The University of Western Sydney, Australia (Dated: July 23, 2015)Recent advances in training deep (multi-layer) architectures have inspired a renaissance in neuralnetwork use. For example, deep convolutional networks are becoming the default option for diﬃculttasks on large datasets, such as image and speech recognition. However, here we show that errorrates below 1% on the MNIST handwritten digit benchmark can be replicated with shallow non-convolutional neural networks. This is achieved by training such networks using the ‘ExtremeLearning Machine’ (ELM) approach, which also enables a very rapid training time ( ∼

10 minutes).Adding distortions, as is common practise for MNIST, reduces error rates even further. Our methodsare also shown to be capable of achieving less than 5.5% error rates on the NORB image database.To achieve these results, we introduce several enhancements to the standard ELM algorithm, whichindividually and in combination can signiﬁcantly improve performance. The main innovation is toensure each hidden-unit operates only on a randomly sized and positioned patch of each image. Thisform of random ‘receptive ﬁeld’ sampling of the input ensures the input weight matrix is sparse, withabout 90% of weights equal to zero. Furthermore, combining our methods with a small number ofiterations of a single-batch backpropagation method can signiﬁcantly reduce the number of hidden-units required to achieve a particular performance. Our close to state-of-the-art results for MNISTand NORB suggest that the ease of use and accuracy of the ELM algorithm for designing a single-hidden-layer neural network classiﬁer should cause it to be given greater consideration either as astandalone method for simpler problems, or as the ﬁnal classiﬁcation stage in deep neural networksapplied to more diﬃcult problems.

I. INTRODUCTION

The current renaissance in the ﬁeld of neural networksis a direct result of the success of various types of deepnetwork in tackling diﬃcult classiﬁcation and regressionproblems on large datasets. It may be said to have beeninitiated by the development of Convolutional NeuralNetworks (CNN) by LeCun and colleagues in the late1990s [1] and to have been given enormous impetus bythe work of Hinton and colleagues on Deep Belief Net-works (DBN) during the last decade [2]. It would bereasonable to say that deep networks are now consid-ered to be a default option for machine learning on largedatasets.The initial excitement over CNN and DBN methodswas triggered by their success on the MNIST handwrittendigit recognition problem [1], which was for several yearsthe standard benchmark problem for hard, large datasetmachine learning. A high accuracy on MNIST is regardedas a basic requirement for credibility in a classiﬁcationalgorithm. Both CNN and DBN methods were notable,when ﬁrst published, for posting the best results up tothat respective time on the MNIST problem.The standardised MNIST database consists of 70,000 ∗ Electronic address: [email protected] † Electronic address: [email protected] images, each of size 28 by 28 greyscale pixels [3]. There isa standard set of 60,000 training images and a standardset of 10,000 test images, and numerous papers reportresults of new algorithms applied to these 10,000 testimages, e.g. [1, 4–9].In this report, we introduce variations of the ExtremeLearning Machine algorithm [10] and report their perfor-mance on the MNIST test set. These results are equiva-lent or superior to the original results achieved by CNNand DBN on this problem, and are achieved with signiﬁ-cantly lower network and training complexity. This posesthe important question as to whether the ELM trainingalgorithm should be a more popular choice for this typeof problem, and a more commonplace algorithm as a ﬁrststep in machine learning.Table I summarises our results, and shows some com-parison points with results obtained by other methods inthe past (note that only previous results that do not usedata augmentation methods are shown, and only one ofour new results is for such a case). Our new results sur-pass results using earlier deep networks, but recent reg-ularisation methods such as drop connect [6], stochasticpooling [7], dropout [8] and so-called ‘deeply supervisednetworks’ [9] have enabled deep convolutional networksto set new state-of-the-art performance for MNIST forthe case where no data-augmentation is used. Neverthe-less, our best result for a much simpler single-hidden-layer neural network classiﬁer trained using the very fastELM algorithm, and without using data augmentation, a r X i v : . [ c s . N E ] J u l is within just 41 errors out of 10000 test images of thestate-of-the-art. A. The Extreme Learning Machine: Notation andTraining

The Extreme Learning Machine (ELM) training algo-rithm [10] is relevant for a single hidden layer feedforwardnetwork (SLFN), similar to a standard neural network.However, there are three key departures from conven-tional SLFNs. These are (i) that the hidden layer is fre-quently very much larger than a neural network trainedusing backpropagation; (ii) the weights from the inputto the hidden layer neurons are randomly initialised andare ﬁxed thereafter (i.e., they are not trained); and (iii)the output neurons are linear rather than sigmoidal in re-sponse, allowing the output weights to be solved by leastsquares regression. These attributes have also been com-bined in learning systems several times previously [11–14].The standard ELM algorithm can provide very goodresults in machine learning problems requiring classiﬁ-cation or regression (function optimization); in this pa-per we demonstrate that it provides an accuracy on theMNIST problem superior to prior reported results forsimilarly-sized SLFN networks [1, 15].We begin by introducing three parameters that deﬁnethe dimensions of an ELM used as an N -category classi-ﬁer: L is the dimension of input vectors, M is the numberof hidden layer units, and N is the number of distinct la-bels for training samples. For the case of classifying P test vectors, it is convenient to deﬁne the following ma-trices: • X test , of size L × P , is formed by setting each col-umn to equal a single test vector. • Y test , of size N × P , numerically represents theprediction vector of the classiﬁer.To map from input vectors to network outputs, twoweights matrices are required: • W in , of size M × L , contains the input weight ma-trix that maps length- L input vectors to length M hidden-unit inputs. • W out , of size N × M , contains the output weightsthat project from the M hidden-unit activations toa length N class prediction vector.We also introduce matrices to describe inputs and out-puts to/from the hidden-units: • D test := W in X test , of size M × P , contains the lin-ear projections of the input vectors that are inputsto each of the M hidden-units. A bias for each hid-den unit can be added by expanding the size of theinput dimension from L to L + 1, and setting theadditional input element to always be unity for alltraining and test data, with the bias values includedas an additional column in W in . • A test , of size M × P , contains the hidden-unit acti-vations that occur due to each training vector, andis given by A test := f ( D test ) , (1)where f ( · ) is shorthand notation for the fact that eachelement of D test is nonlinearly converted term-by-termto the corresponding element of A test . For example, ifthe hidden unit response is given by the logistic sigmoidfunction, then( A test ) i,j = f (( D test ) i,j ) = 11 + exp ( − ( D test ) i,j ) . (2)Many nonlinear activation functions can be equally eﬀec-tive, such as the rectiﬁed linear unit (ReLU) function [16],the absolute value function or the quadratic function. Aswith standard artiﬁcial neural networks, the utility of thenonlinearity is that it introduces hidden-unit responsesthat represent correlations or ‘interactions’ between in-put elements, rather than simple linear combinations ofthem.The overall conversion of test data to prediction vec-tors can be written as Y test = W out f ( W in X test ) . (3)We now describe the ELM training algorithm. We in-troduce K to denote the number of training vectors avail-able. It is convenient to introduce the following matricesthat are relevant for training an ELM: X train , of size L × K , A train , of size M × K , and Y train = W out A train ,of size N × K are deﬁned analogously to X test , A test and Y test above. We also introduce Y label , of size N × K ,which numerically represents the labels of each class ofeach training vector; it is convenient to deﬁne this math-ematically such that each column has a 1 in a single row,and all other entries are zero. The only 1 entry in eachcolumn occurs in the row corresponding to the label classfor each training vector.Ideally we seek to ﬁnd to ﬁnd W out that satisﬁes Y label = W out A train . (4)However, the number of unknown variables in W out is N M , and the number of equations is

N K . Althoughan exact solution potentially exists if M = K , it is usu-ally the case that M < K (i.e., there are many moretraining samples than hidden units) so that the systemis overcomplete. The usual approach then, is to seekthe solution that minimises the mean square error be-tween Y label and Y train . This is a standard leastsquares regression problem for which the exact solutionis W out = Y label A (cid:62) train ( A train A (cid:62) train ) − , assuming thatthe inverse exists (in practice it usually does).It can also be useful to regularise the problem to re-duce overﬁtting, by ensuring that the weights of W out do not become large. The standard ridge-regression ap-proach [17] produces the following closed form solutionfor the output weights: W out = Y label A (cid:62) train ( A train A (cid:62) train + c I ) − , (5)where I is the M × M identity matrix, and c can beoptimised using cross-validation techniques. As is dis-cussed in more detail below, we have found QR decom-position [18] to be the most eﬀective method for solvingfor W out . II. FASTER AND MORE ACCURATEPERFORMANCE BY SHAPING THE INPUTWEIGHTS NON-RANDOMLY

In the conventional ELM algorithm, the input weightsare randomly chosen, typically from a continuous uni-form distribution on the interval [ − ,

1] [19], but we havefound that other distributions such as bipolar binary val-ues from {− , } are equally eﬀective.Beyond such simple randomisation of the inputweights, small improvements can be made by ensuringthe rows of W in are as mutually orthogonal as possi-ble [20]. This cannot be achieved exactly unless M ≤ L ,but simple random weights typically produce dot prod-ucts of distinct rows of W in that are close to zero, albeitnot exactly zero, while a dot product of each row withitself is always much larger than zero. In addition, it canbe beneﬁcial to normalise the length of each row of W in ,as occurs in the orthogonal case.In contrast, we can also aim to ﬁnd weights that, ratherthan being selected from a random distribution, are in-stead chosen to be well matched to the statistics of thedata, with the hope that this will improve generalisationof the classiﬁer. Ideally we do not want to have to learnthese weights, but rather just form the weights as a sim-ple function of the data.Here we focus primarily on improving the performanceof the ELM algorithm by biasing the selection of inputlayer weights in six diﬀerent ways, several of which wererecently introduced in the literature, and several of whichare novel in this paper. These methods are as follows:1. Select input layer weights that are random, butbiased using the training data samples, so thatthe dot product between weights and training datasamples is likely to be large. This is called Com-puted Input Weights ELM (CIW-ELM) [15].2. Ensure input weights are constrained to a set ofdiﬀerence vectors of between-class samples in thetraining data. This is called Constrained ELM (C-ELM) [21].3. Restrict the weights for each hidden layer neuronto be non-zero only for a small, random rectangularpatch of the input visual ﬁeld; we call this Recep-tive Field ELM (RF-ELM). Although we believethis method to be new to ELM approaches, it is in-spired by other machine learning approaches thataim to mimic cortical neurons that have limited vi-sual receptive ﬁelds, such as convolutional neuralnetworks [22].4. Combine RF-ELM with CIW-ELM, or RF-ELMwith C-ELM; we show below that the combination is superior to any of the three methods individually.5. Pass the results of a RF-CIW-ELM and a RF-C-ELM into a third standard ELM (thus producinga two-layer ELM system); we show below that thisgives the best overall performance of all methodsconsidered in this paper.6. Application of the backpropagation method of [23].This method highlights that the performance of anELM can be enhanced by adjusting all input layerweights simultaneously, based on all training data.The output layer weights are maintained in theirleast-squares optimal state by recalculating themafter input layer backpropagation updates. Theprocess of backpropagation updating of the inputweights followed by standard ELM recalculation ofthe output weights can be repeated iteratively untilconvergence.RF-ELM and its combination with CIW-ELM and C-ELM, and the two-layer ELM are reported here for theﬁrst time. We demonstrate below that each of thesemethods independently improves the performance of thebasic ELM algorithm, and in combination they produceresults equivalent to many deep networks on the MNISTproblem [1, 2]. First, however, we now describe eachmethod in detail. A. Computed Input Weights for ELM

The CIW-ELM approach is motivated by consideringthe standard backpropagation algorithm [24]. A fea-ture of weight-learning algorithms is that they operateby adding to the weights some proportion of the trainingsamples, or a linear sum or diﬀerence of training sam-ples. In other words, apart from a possible random ini-tialization, the weights are constrained to take ﬁnal val-ues which are drawn from a space deﬁned in terms oflinear combinations of the input training data as basisvectors—see Fig. 1. While it has been argued, not with-out reason, that it is a strength of ELM that it is notthusly constrained [25], the use of this basis as a con-straint on input weights will bias the ELM network to-wards a conventional (backpropagation) solution.The CIW-ELM algorithm is as follows [15]:1. For use in the following steps only, normalize alltraining data by subtracting the mean over alltraining points and dimensions and then dividingby the standard deviation.2. Divide the M hidden layer neurons into N blocks,one for each of N output classes; for data sets wherethe number of training data samples for each classare equal, the block size is M n = M/N . We de-note the number of training samples per class as K n , n = 1 , . . . , N . If the training data sets foreach class are not of equal size, the block size canbe adjusted to be proportional to the data set size.3. For each block, generate a random sign ( ±

1) ma-trix, R n of size M n × K n .4. Multiply R n by the transpose of the input trainingdata set for that class, X (cid:62) train ,n , to produce M n × L summed inner products, which are the weights forthat block of hidden units.5. Concatenate these N blocks of weights for eachclass into the M × L input weight matrix W in .6. Normalize each row of the input weight matrix, W in , to unity length.7. Solve for the output weights of the ELM using stan-dard ELM methods described above. B. Constrained Weights for ELM

Recently, Zhu et al. [21] have published a method forconstraining the input weights of ELM to the set of dif-ference vectors of between-class samples. The diﬀerencevectors of between-class samples are the set of vectorsconnecting samples of one class with samples of a diﬀer-ent class, in the sample space—see Fig. 1. In addition, amethodology is proposed for eliminating from this set thevectors of potentially overlapping spaces (eﬀectively, theshorter vectors) and for reducing the use of near-parallelvectors, in order to more uniformly sample the weightspace.The Constrained ELM (C-ELM) algorithm we used isas follows [21]:1. Randomly select M distinct pairs of training datasuch that:(a) each pair comes from two distinct classes;(b) the vector length of the diﬀerence between thepairs is smaller than some constant, (cid:15) .2. Set each row of the M × L input weight matrix W in to be equal to the diﬀerence between each pair ofrandomly selected training data.3. Set the bias for each hidden unit equal to the scalarproduct of the sum of each pair of randomly se-lected training data and the diﬀerence of each pairof randomly selected training data.4. Normalize each row of the input weight matrix, W in , and each bias value, by the vector of the dif-ference of the corresponding pair of randomly se-lected training data.5. Solve for the output weights of the ELM using stan-dard ELM methods described above. C. Receptive Fields for ELM

We have found that a data-blind (unsupervised) ma-nipulation of the input weights improves generalizationperformance. The approach has the added bonus thatthe input weight matrix is sparse, with a very high per-centage of zero entries, which could be advantageous for hardware implementations, or if sparse matrix storagemethods are used in software.The RF-ELM approach is inspired by neurobiology,and strongly resembles some other machine learning ap-proaches [22]. Biological sensory neurons tend to betuned with preferred receptive ﬁelds so that they receiveinput only from a subset of the overall input space. Theregion of responsiveness tends to be contiguous in somepertinent dimension, such as space for the visual andtouch systems, and frequency for the auditory system.Interestingly, this contiguity aspect may be lost beyondthe earliest neural layers, if features are combined ran-domly.In order to loosely mimic this organisation of biologicalsensory systems, in this paper where we consider onlyimage classiﬁcation tasks, for each hidden unit we createrandomly positioned and sized rectangular masks thatare smaller than the overall image. These masks ensureonly a small subset of the length- L input data vectorsinﬂuence any given hidden unit—see Fig. 1.The algorithm for generating these ‘receptive-ﬁeld’masks is as follows:1. Generate a random input weight matrix W (orinstead start with a CIW-ELM or C-ELM inputweight matrix).2. For each of M hidden units, select two pairs of dis-tinct random integers from { , , . . . L } to form thecoordinates of a rectangular mask.3. If any mask has total area smaller than some value q then discard and repeat.4. Set the entries of a √ L × √ L square matrix thatare deﬁned by the two pairs of integers to 1, andall other entries to zero.5. Flatten each receptive ﬁeld matrix into a length L vector where each entry corresponds to the samepixel as the entry in the data vectors X test or X train .6. Concatenate the resulting M vectors into a recep-tive ﬁeld matrix F of size M × L .7. Generate the ELM input weight matrix by ﬁndingthe Hadamard product (term by term multiplica-tion) W in = F ◦ W .8. Normalize each row of the input weight matrix, W in , to unity length.9. Solve for the output weights of the ELM using stan-dard ELM methods described above.We have additionally found it beneﬁcial to exclude pix-els from the mask if most or all training images have iden-tical values for those regions. For the MNIST database,this typically means ensuring all masks exclude the ﬁrstand last 3 rows and ﬁrst and last 3 columns. For MNISTwe have found that a reasonable value of the minimummask size is q = 10, which enables masks of size 1 × × × D. Combining RF-ELM with CIW-ELM andC-ELM

All three approaches described so far provide weight-ings for pixels for each hidden layer unit. CIW-ELMand C-ELM weight the pixels to bias hidden-units to-wards a larger response for training data from a speciﬁcclass. The sparse weightings provided by RF-ELM biashidden-units to respond to pixels from speciﬁc parts ofthe image.We have found that enhanced classiﬁcation perfor-mance can be achieved by combining the shaped weightsobtained by either CIW-ELM or C-ELM with the recep-tive ﬁeld masks provided by RF-ELM. The algorithm foreither RF-CIW-ELM or RF-C-ELM is as follows.1. Follow the ﬁrst 5 steps of the above CIW-ELM orthe ﬁrst 2 steps of the C-ELM algorithm, to ob-tain an un-normalized shaped input weight matrix, W in , s .2. Follow the ﬁrst 6 steps of the RF-ELM algorithmto obtain a receptive ﬁeld matrix, F .3. Generate the ELM input weight matrix by ﬁndingthe Hadamard product (term by term multiplica-tion) W in = F ◦ W in , s .4. Normalize each row of the input weight matrix, W in , to unity length.5. If RF-C-ELM, produce the biases according tosteps 3 and 4 of the C-ELM algorithm, but usethe masked diﬀerence vectors rather than the un-masked ones.6. Solve for the output weights of the ELM using stan-dard ELM methods described above. E. Combining RF-C-ELM with RF-CIW-ELM in atwo-layer ELM: RF-CIW-C-ELM

We have found that results obtained with RF-C-ELMand RF-CIW-ELM are similar in terms of error percent-age when applied to the MNIST benchmark, but the er-rors follow diﬀerent patterns. As such, a combination ofthe two methods seemed to oﬀer promise. We have com-bined the two methods using a multiple-layer ELM whichconsists of an RF-C-ELM network and a RF-CIW-ELMnetwork in parallel, as the ﬁrst two layers. The outputsof these two networks are then combined using a fur-ther ELM network, which can be thought of as an ELM-autoencoder, albeit that it has twenty input neurons andten output neurons; the input neurons are eﬀectively twosets of the same ten labels. The structure is shown inFig. 2. The two input networks are ﬁrst trained to com-pletion in the usual way, then the autoencoder layer is trained using the outputs of the input networks as itsinput, and the correct labels as the outputs. The re-sult of this second-layer network, which is very quick toimplement (as it uses a hidden layer of typically only 500-1000 neurons), is signiﬁcantly better than the two inputnetworks (see Table I). Note that the middlemost layershown in Fig. 2 consists of linear neurons, and thereforeit can be removed by combining its input and outputweights into one connecting weight matrix. However, itis computationally disadvantageous to do so because thenumber of multiplications will increase.

F. Fine Tuning by Backpropagation

In all of the variations of ELM described in this report,the hidden layer weights are not updated or trained ac-cording to the output error, and the output layer weightsare solved using least squares regression. This consider-ably reduces the trainability of the network, as the num-ber of free parameters is restricted to the output layerweights, which are generally ∼ in number. It hasbeen argued that any network in which the total numberof weights in the output layer is less than the number oftraining points will likely be enhanced by using backprop-agation to train weights in previous layers [26]. Hence,following the example of deep networks, and speciﬁc ELMversions of backpropagation [23], we have experimentedby using backpropagation to ﬁne-tune the hidden layerweights. This does re-introduce the possibility of over-ﬁtting, but that is a well-understood problem in neuralnetworks and the usual methods for avoiding it will applyhere. For simplicity, a batch mode backpropagation wasimplemented, using the following algorithm. Note thatas in Eqn. (2), we assume a logistic activation functionin the hidden layer neurons, for which the derivative canbe expressed as f (cid:48) = f (1 − f ).1. Construct the ELM and solve for the output layerweights W out as described above.2. Perform iterative backpropagation as follows:(a) Compute the error for the whole training set: E = Y label − W out A train .(b) Calculate the weights update,as derived by [23]: ∆ W in = ξ (cid:2) ( W (cid:62) out E ) ◦ ( A train − A train ◦ A train ) (cid:3) X (cid:62) train where ξ is the learning rate, and ◦ indicatesthe Hadamard product (elementwise matrixmultiplication).(c) Update the weights, W in = W in − ∆ W in .(d) Re-calculate A train with the new W in .(e) Re-solve for W out using least squares regres-sion and continue.(f) Repeat from step a) for a desired number ofiterations or until convergence.As illustrated in the Results section, this process hasshown a robust improvement on all of the SLFN ELMsolutions tested here, provided learning rates which main-tained stability were used. III. RESULTS AND DISCUSSION FOR THEMNIST BENCHMARKA. SLFN with shaped input-weights

We trained ELMs using each of the six input-weightshaping methods described above, as well as a standardELM with binary bipolar ( {− , } ) random input weightsplus row-normalisation. Following the normalisation ofrows of the input weight matrix to unity, we multipliedthe values of the entire input weight matrix by a factorof 2, for all seven methods, as this scaling was found tobe close to optimal in most cases.Our results are shown in Fig. 3. To obtain an indi-cation of variance resulting from the randomness inher-ent in each input-weight shaping method, we trained 10ELMs using each method, and then plotted the ensemblemean as a function of hidden-layer size, M . We also plot-ted (see markers) the actual error rates for each trainednetwork. It can be seen in Fig. 3A that the error ratedecreases approximately log-linearly with the log of M for small M , before slowing as M approaches about 10 .Fig. 3B shows the error rate when the actual trainingdata is used. Since our best test results occur when theerror rate on the training data is smaller than 0 . M (see Figs 3C and 3D) we conclude that increasing M further than shown here produces over ﬁtting. This canbe veriﬁed by cross-validation on the training data. B. ELM with shaped input-weights andbackpropagation

We also trained networks using each of the methodsdescribed in the previous section, plus 10 iterations ofELM-backpropagation using a learning rate of ξ = 0 . M (Figure 4B),this use of backpropagation is far from optimal, and doesnot give converged results. As shown in Figure 4A, incomparison with Figure 3A, these 10 iterations at a ﬁxedlearning rate still provide a signiﬁcant improvement inthe error rate for small M . On the other hand, theimprovement for M = 12800 is minimal, which is notsurprising given that the error rate on the training datawithout using backpropagation for this value of M is al-ready well over 99% (Figure 3B).It is likely that we can get further enhancements of ourerror rates by optimising the backpropagation learningrate and increasing the number of iterations used. More-over, several methods for accelerating convergence whencarrying out backpropagation have been described pre-viously [23], and we have not used those methods here.However, the best error rate result reported previouslyfor those methods applied to the MNIST benchmark was1.45%, achieved with 2048 hidden units, and the besterror rate for the backpropagation method we used is 3.73% [23]. Our results for M ≥ C. Runtime eﬃciency

In many applications, the time required to train a net-work is not considered as important as the time requiredfor the network to classify new data. However, there doexist applications in which the statistics of training datachange rapidly, meaning retraining is required, or deploy-ment of a trained classiﬁer is required very rapidly afterdata gathering. For example, in ﬁnancial, sports, or med-ical data analysis, deployment of a newly trained classi-ﬁer can be required within minutes of acquiring data,or retraining may be required periodically, e.g. hourly.Hence, we emphasize in this paper the rapidity of train-ing. The speed for testing is negligible in comparison.The mean training runtime for each of our methodsis shown in Figure 5. They were obtained using Mat-lab running on a Macbook pro with a 3 GHz Intel Corei7 (2 dual cores, for a total of 4 cores), running OS X10.8.5 with 8 GB of RAM. The times plotted in Fig-ure 5 are the total times for setup and training, exclud-ing time required to load the MNIST data into memoryfrom ﬁles. The version of Matlab we used by default ex-ploits all four CPU cores for matrix multiplication andleast squares regression. Note that the diﬀerences in runtime for each method are negligible, which is expected,since the most time-consuming part is the formation ofthe matrix A train A (cid:62) train . The time for testing was notincluded in the shown data. We found, predictably, thatthis scaled linearly with M , and was about 10 secondsfor M = 12 , M =15000 hidden units) for RF-CIW-ELM or RF-C-ELM in-dividually take in the order of 15 minutes total runtimeand achieve ∼

99% correct classiﬁcation on MNIST. Incomparison, data tabulated previously for backpropaga-tion shows at least 81 minutes in order to achieve 98% ac-curacy, and a best result of 98.55% in 98 minutes [23]. Incontrast, runtimes reported for the standard ELM algo-rithm previously (28 seconds for 2048 hidden units [23])are comparable to ours (12 seconds for 1600 hidden unitsand less than 1 minute for 3200 hidden units). This il-lustrates that improving error rate by shaping the inputweights as we have done here has substantial beneﬁts forruntime and error rate in comparison to backpropaga-tion.

D. Distorting and pre-processing the training set

Many other approaches to classifying MNIST hand-written digits improve their error rates by preprocess-ing it and/or by expanding the size of the training setby applying aﬃne (translation, rotation and shearing),and elastic distortions of the standard set [4, 5, 27]. Wehave also experimented with distorting the training setto improve on the error rates reported here. For exam-ple, with 1 and 2 pixel aﬃne translations, we are able toachieve error rates smaller than 0.8%. When we addedrandom rotations, scalings, shears and elastic distortions,we achieved a best repeatable error rate of 0.62%, andan overall best error rate of 0.57%. However, addingdistortions of the training set substantially increases theruntime for two reasons. First, more training points gen-erally requires a larger hidden layer size. For example,when we increase the size of the training set by a factorof 10, we have found we need

M > O ( M ) matrix multiplication required.At this stage, we have chosen to not systematically con-tinue to improve the way in which we implement distor-tions to approach state of the art MNIST results, but ourpreliminary results show that ELM training is capable ofusing such methods to enhance error rate performance,at the expense of a signiﬁcant increase in runtime, as isexpected in other non-ELM methods. E. Results on NORB

We brieﬂy present some results on a second well-known image classiﬁcation benchmark: the NORB-smalldatabase [28]. This database consists of 48600 stereogreyscale images from ﬁve classes, and there is a standardset of 24300 stereo images for training, and a standardset of 24300 for testing. Each of the images in the twostereo channels for each sample is of size 96 ×

96 pixels.Given the large size of each image relative to MNISTimages, we preprocessed all images by spatially lowpassﬁltering using a 9 × ×

13 pixelimages. We then contrast-normalised each image by sub-tracting its mean, and dividing by its standard deviation.Figure 6 shows results for the error rate (ten repeatsand the ensemble mean are shown) on the test set fromapplication of the RF-C-ELM method, as the number ofhidden units increases. We set the minimum receptiveﬁeld size to 1, and the ridge regression parameter to c =5 × − . Our test results peak at close to 95% correct, which is within 3% of state-of-the-art [29], and superiorto some results for deep convolutional networks [30]. F. Computationally eﬃcient methods for ELMtraining: iterative methods for large training sets

In practice, it is known to be generally computationallymore eﬃcient (and avoids other potential problems, suchas those described in [31]) to avoid explicit calculationof matrix inverses or pseudo-inverses when solving linearsets of equations. The same principle applies when usingthe ELM training algorithm, and hence, it is preferableto avoid explicit calculation of the inverse in Eqn. (5),and instead treat the following as a set of

N M linearequations to be solved for

N M unknown variables: Y label A (cid:62) train = W out ( A train A (cid:62) train + c I ) . (6)Fast methods for solving such equations exist, such as theQR decomposition method [18], which we used here. Forlarge M , the memory and computational speed bottle-neck in an implementation then becomes the large matrixmultiplication, A train A (cid:62) train . However, there are simplemethods that can still enable solution of Eqn. (6) when M is too large for this multiplication to be carried out inone calculation.For example, when solving Equation (6) by implemen-tation in MATLAB, it is computationally eﬃcient to usethe overloaded ‘ \ ’ function, which invokes the QR de-composition method. This approach can be used eitherfor the inverse or pseudo-inverse, but we have found itfaster to solve (6), which requires the inverse rather thanthe pseudo-inverse.Well known software packages such as MATLAB(which we used) automatically exploit multiple CPUcores available in most modern PCs to speed up execu-tion of this algorithm using multithreading. Alternativemethods like explicitly calculating the pseudo-inverse, orsingular value decomposition, are in comparison signiﬁ-cantly (sometimes several orders of magnitude) slower.When using the linear equation solution method, themain component of training runtime for large hidden-layer sizes becomes the large matrix multiplication re-quired to obtain A train A (cid:62) train . There is clearly much po-tential for speeding up this simple but time-consumingoperation, such as by using GPUs or other hardware ac-celeration methods.The above text discusses the standard single-batch ap-proach. There are also online and incremental ELM so-lutions for real-time and streaming operations and largedata sets [26, 32–34]. The use of singular value decompo-sition oﬀers some additional insight into network struc-ture and further optimization [20]. Here we describe aniterative method that oﬀers advantages in training wherethe output weight matrix need not be calculated morethan once.One potential drawback of following the standard ELMmethod of solving for the output weights using all train-ing data in one batch is the large amount of memory thatis potentially required. For example, with the MNISTtraining set of 60000 images, and a hidden layer size of M = 10000, the A train matrix has 6 × elements,which for double precision representations requires ap-proximately 4 . M × d j , j = 1 , . . . , K , formed from the columns of A train .Then one of the two key terms in Eqn. (6) can be ex-pressed as A train A (cid:62) train = K (cid:88) j =1 d j d (cid:62) j . (7)That is, the matrix that describes correlations betweenthe activations of each hidden unit is just the sum of theouter products of the hidden-unit activations in responseto all training data. Similarly, we can simplify the otherkey term in Eqn. (6) by introducing size N × y j , j = 1 , . . . K to represent the K columns of Y label andwrite Y label A (cid:62) train = K (cid:88) j =1 y j d (cid:62) j . (8)In this way, the M × M matrix A train A (cid:62) train and the N × M matrix Y label A (cid:62) train can be formed from K train-ing points without need to keep the A train matrix inmemory, and once these are formed, the least squaressolution method applied. The matrix A train A (cid:62) train stillrequires a large amount of memory ( M = 12000 requiresover 1 GB of RAM), but using this method the number oftraining points can be greatly expanded and incur only aruntime cost. In practice, rather than form the sum from K training points, it is more eﬃcient to form batches ofsubsets of training points and then form the sums: thesize of the batch is determined by the maximum RAMavailable.It is important to emphasise that unlike other itera-tive methods for training ELMs that update the outputweights iteratively [26, 32–34], the approach describedhere only iteratively updates A train A (cid:62) train . IV. CONCLUSIONS

We have shown here that simple SLFNs are capa-ble of achieving the same accuracy as deep belief net-works [2] and convolutional neural networks [1] on oneof the canonical benchmark problems in deep learning:image classiﬁcation. The most accurate networks we consider here use a combination of several non-iterativelearning methods to deﬁne a projection from the inputspace to a hidden layer. The hidden layer output is thensolved simply using least squares regression applied toa single batch of all training data to ﬁnd the weightsfor a linear output layer. If extremely high accuracyis required, the outputs of one or more of these SLFNscan be combined using a simple autoencoder stage. Themaximum accuracy obtained here is comparable with thebest published results for the standard MNIST problem,without augmentation of the dataset by preprocessing,warping, noising/denoising or other non-standard mod-iﬁcation. The accuracies achieved for the basic SLFNnetworks are in some cases equal to or higher than thoseachieved by the best eﬀorts with deep belief networks,for example.Moreover, when using the receptive ﬁeld (RF) methodto shape inputs weights, the resulting input weight ma-trix becomes highly sparse: using the RF algorithmabove, close to 90% of input weights are exactly zero.We note also that the implementations here were forthe most part carried out on standard desktop PCs andrequired very little computation in comparison with deepnetworks. It should be highlighted that we have foundsigniﬁcant speed increases for training by avoiding ex-plicit calculation of matrix inverses. Moreover, we haveshown that it is possible to circumvent memory diﬃcul-ties that could arise with large training sets, by itera-tively calculating the matrix A train A (cid:62) train , and then stillonly computing the output weights once. This methodcould also be used in streaming applications: the matrix A train A (cid:62) train could be updated with every training sam-ple, but the output weights only updated periodically.In these ways, we can avoid previously identiﬁed poten-tial limitations of the ELM training algorithm, regardingmatrix inversion discussed in [31] (see also [26, 35]).The principles implemented in the ELM training al-gorithm, and in particular the use of single-batch leastsquares regression in a linear output layer, following ran-dom projection to nonlinear hidden-units, parallel a prin-cipled approach to modelling neurobiological function,known as the neural engineering framework (NEF) [14].Recently this framework [14] was utilized in a very large(2.5 million neuron) model of the functioning brain,known as SPAUN [36, 37]. The computational and per-formance advantages we have demonstrated here couldpotentially boost the performance of the NEF, as wellas, of course, the many other applications of neural net-works.Although deep networks and convolutional networksare now standard for hard problems in image and speechprocessing, their merits were originally argued almost en-tirely on the basis of their success in classiﬁcation prob-lems such as MNIST. The argument was of the form thatbecause no other networks were able to achieve the sameaccuracy, the unique hierarchy of representation of fea-tures oﬀered by deep networks, or the convolutional pro-cessing oﬀered by CNNs, must therefore be necessary toachieve these accuracies. However, If there exists a neuralnetwork that does not use a hierarchical representationof features, and which can obtain the same accuracy asone that does, then this argument may be a case of con-ﬁrmation bias. We have shown here that results equiva-lent to those originally obtained with deep networks andCNNs on MNIST can be obtained with simple single-layer feedforward networks, in which there is only onelayer of nonlinear processing; and that these results canbe obtained with very quick implementations. While theintuitive elegance of deep networks is hard to deny, andthe economy of structure of multilayer networks over sin-gle layer networks is proven, we would argue that thespeed of training and ease of use of ELM-type single layernetworks makes them a pragmatic ﬁrst choice for manyreal-world machine learning applications. Acknowledgments

Mark D. McDonnell’s contribution was by sup-ported by an Australian Research Fellowship fromthe Australian Research Council (project numberDP1093425). Andr´e van Schaik’s contribution was sup-ported by Australian Research Council Discovery ProjectDP140103001.

References [1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner,“Gradient-based learning applied to document recogni-tion,”

Proceedings of the IEEE , vol. 86, pp. 2278–2324,1998.[2] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learn-ing algorithm for deep belief nets,”

Neural computation ,vol. 18, pp. 1527–1554, 2006.[3] Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNISTdatabase of handwritten digits,” Accessed August 2014, http://yann.lecun.com/exdb/mnist/ .[4] D. Cire¸san, U. Meier, L. M. Gambardella, and J. Schmid-huber, “Deep, big, simple neural nets for handwrit-ten digit recognition,”

Neural Computation , vol. 22, pp.3207–3220, 2010.[5] D. Cire¸san, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classiﬁcation,”in

Proc. CVPR , 2012, pp. 3642–3649.[6] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fer-gus, “Regularization of neural networks using DropCon-nect,” in

Proceedings of the 30th International Con-ference on Machine Learning, Atlanta, Georgia, USA;JMLR: W&CP volume 28 , 2013.[7] M. D. Zelier and R. Fergus, “Stochastic pooling for reg-ularization of deep convolutional neural networks,” in

InProc. International Conference on Learning Representa-tions, Scottsdale, USA, 2013 , 2013.[8] I. J. Goodfellow, D. Warde-Farley, M. Mirza,A. Courville, and Y. Bengio, “Maxout networks,”in

Proceedings of the 30th International Conferenceon Machine Learning, Atlanta, Georgia, USA; JMLR:W&CP volume 28 , 2013.[9] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu,“Deeply-supervised nets,” in

Deep Learning and Repre-sentation Learning Workshop, NIPS , 2014.[10] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extremelearning machine: Theory and applications,”

Neurocom-puting , vol. 70, pp. 489–501, 2006.[11] P. F. Schmidt, M. A. Kraaijveld, and R. P. W. Duin,“Feed forward neural networks with random weights,” in

Proc. 11th IAPR Int. Conf. on Pattern Recognition, Vol-ume II, Conf. B: Pattern Recognition Methodology andSystems (ICPR11, The Hague, Aug.30 - Sep.3), IEEE Computer Society Press, Los Alamitos, CA, 1992, 1-4 ,1992.[12] C. L. P. Chen, “A rapid supervised learning neuralnetwork for function interpolation and approximation,”

IEEE Transactions on Neural Networks , vol. 7, pp. 1220–1230, 1996.[13] C. Eliasmith and C. H. Anderson, “Developing andapplying a toolkit from a general neurocomputationalframework,”

Neurocomputing , vol. 26, pp. 1013–1018,1999.[14] ——,

Neural Engineering: Computation, Representation,and Dynamics in Neurobiological Systems . MIT Press,Cambridge, MA, 2003.[15] J.Tapson, P. de Chazal, and A. van Schaik, “Explicitcomputation of input weights in extreme learning ma-chines,” in

Proc. ELM2014 conference, Accepted , 2014,p. arXiv:1406.2889.[16] V. Nair and G. E. Hinton, “Rectiﬁed linear units im-prove restricted Boltzmann machines,” in

Proceedings ofthe 27th International Conference on Machine Learning(ICML), Haifa, Israel , 2010.[17] D. W. Marquardt and R. D. Snee, “Ridge regression inpractice,”

The American Statistician , vol. 29, pp. 3–20,1975.[18] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.Flannery,

Numerical Recipes in C: The Art of ScientiﬁcComputing , 2nd ed. Cambridge University Press, Cam-bridge, UK, 1992.[19] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Ex-treme learning machine for regression and multiclass clas-siﬁcation,”

IEEE Transactions on Systems, Man, andCybernetics—Part B: Cybernetics , vol. 42, pp. 513–529,2012.[20] L. L. C. Kasun, H. Zhou, G.-B. Huang, and C. M. Vong,“Representational learning with extreme learning ma-chine for big data,”

IEEE Intelligent Systems , vol. 28,pp. 31–34, 2013.[21] W. Zhu, J. Miao, and L. Qing, “Constrained extremelearning machines: A study on classiﬁcation cases,” 2015,arXiv:1501.06115.[22] A. Coates, H. Lee, and A. Y. Ng, “An analysis ofsingle-layer networks in unsupervised feature learning,” in Proc.14th International Conference on Artiﬁcial Intel-ligence and Statistics (AISTATS), 2011, Fort Lauderdale,FL, USA. Volume 15 of JMLR:W&CP 15 , 2011.[23] D. Yu and L. Ding, “Eﬃcient and eﬀective algorithmsfor training single-hidden-layer neural networks,”

PatternRecognition Letters , vol. 33, pp. 554–558, 2012.[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,“Learning representations by back-propagating errors,”

Nature , vol. 323, pp. 533–536, 1986.[25] G.-B. Huang, “An insight into extreme learning ma-chines: Random neurons, random features and kernels,”

Cognitive Computation , vol. 6, pp. 376–390, 2014.[26] B. Widrow, A. Greenblatt, Y. Kim, and D. Park, “TheNo-Prop algorithm: A new learning algorithm for mul-tilayer neural networks,”

Neural Networks , vol. 37, pp.182–188, 2013.[27] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best prac-tices for convolutional neural networks applied to visualdocument analysis,” in

Proceedings of the Seventh Inter-national Conference on Document Analysis and Recogni-tion (ICDAR 2003) , 2003.[28] Y. LeCun, F. J. Huang, and L. Bottou, “Learning meth-ods for generic object recognition with invariance to poseand lighting,” in

Proceedings IEEE Computer SocietyConference on Computer Vision and Pattern Recognition(CVPR) , vol. 2, 2004, pp. 97–104.[29] D. C. D. Cire¸san, U. Meier, J. Masci, L. M. Gambardella,and J. Schmidhuber, “Flexible, high performance convo-lutional neural networks for image classiﬁcation,” in

Pro-ceedings of the Twenty-Second International Joint Con-ference on Artiﬁcial Intelligence , 2011, pp. 1237–1242.[30] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. Le-Cun, “What is the best multi-stage architecture for ob-ject recognition?” in

In Proc. IEEE 12th InternationalConference on Computer Vision , 2009.[31] B. Widrow, “Reply to the comments on the “No-Prop”algorithm,”

Neural Networks , vol. 48, p. 204, 2013.[32] N.-Y. Liang, G.-B. Huang, P. Saratchandran, andN. Sundararajan, “A fast and accurate online sequen-tial learning algorithm for feedforward networks,”

IEEETransactions on Neural Networks , vol. 17, pp. 1411–1423,2006.[33] J. Tapson and A. van Schaik, “Learning the pseudoin-verse solution to network weights,”

Neural Networks ,vol. 45, pp. 94–100, 2013.[34] A. van Schaik and J.Tapson, “Online and adaptive pseu-doinverse solutions for ELM weights,”

Neurocomputing ,vol. 149, pp. 233–238, 2015.[35] M.-H. Lim, “Comments on the “No-Prop” algorithm,”

Neural Networks , vol. 48, pp. 59–60, 2013.[36] C. Eliasmith, T. C. Stewart, X. Choo, T. Bekolay, T. De-Wolf, C. Tang, and D. Rasmussen, “A large-scale modelof the functioning brain,”

Science , vol. 338, pp. 1202–1205, 2012.[37] T. C. Stewart and C. Eliasmith, “Large-scale synthesisof functional spiking neural circuits,”

Proceedings of theIEEE , vol. 102, pp. 881–898, 2014.

Figures FIG. 1:

Illustration of the three core methods of shap-ing ELM input weights.

In (a), which is a cartoon of theComputed Input Weights ELM (CIW-ELM) process [15], twoclasses of input data are indicated by ‘+’ and ‘o’ symbols. Thevectors to the ‘+’ symbols are multiplied by random bipolarbinary {− , } ) vectors u and u to produce a biased ran-dom weight vector w . Similarly the weights to the ‘o’ classare also multiplied by random vectors u and u to producea biased random weight vector w . Note that in practice wewould not use the same random binary vectors. In (b), weshow the Constrained ELM (C-ELM) process [21]. The blackarrows are weight vectors derived by computing the diﬀer-ence of two classes; in this case, the diﬀerence between the‘+’ elements and the ‘o’ elements. In (c), we illustrate theReceptive Field ELM (RF-ELM) method; weights for eachhidden layer neuron are restriced to being non-zero for only asmall random rectangular receptive ﬁeld in the original imageplane. FIG. 2:

Combined two-layer RF-CIW-ELM and RF-C-ELM network.

This ﬁgure depicts the structure of our mul-tilayer ELM network that combines a CIW-RF-ELM networkwith a C-ELM network, using what is eﬀectively an autoen-coder output. Note that the middle linear layer of neuronscan be removed by combining the output layer weights of theﬁrst network with the input layer weights of the second; wehave not shown this here, in order to clarify the developmentof the structure.

100 1000 100000.81210

Number of hidden − units, M E rr o r % ELMRF − ELMCIW − ELMC − ELMRF − CIW − ELMRF − C − ELMRF − CIW − C − ELM (a) Classiﬁcation of Test Data

100 1000 100000.11210

Number of hidden − units, M E rr o r % ELMRF − ELMCIW − ELMC − ELMRF − CIW − ELMRF − C − ELMRF − CIW − C − ELM (b) Classiﬁcation of Training Data Number of hidden − units, M T o t a l E rr o r s ELMRF − ELMCIW − ELMC − ELMRF − CIW − ELMRF − C − ELM (c) Classiﬁcation of Test Data: large M Number of hidden − units, M T o t a l E rr o r s ELMRF − ELMCIW − ELMC − ELMRF − CIW − ELMRF − C − ELM (d) Classiﬁcation of Training Data: large M FIG. 3:

Error rates for various SLFN ELM methods with shaped input weights.

The ﬁrst row shows the meanerror percentage from 10 di↵erent ELMs applied to classify (a) the 10000-point MNIST test data set, and (b) the 60000-pointMNIST training data set used to train the ELMs, for various di↵erent sizes of hidden layer, M . Markers show the actual errorpercentage from each of the 10 ELMs. Note that the data for the combination RF-CIW-C-ELM method is plotted against M used in just one of the three parts of the overall network; the total number of hidden units used is actually 2 M + 500.Therefore RF-CIW-C-ELM does not outperform the other methods for the same total number of hidden-units for small M .However it can be seen that for large M RF-CIW-C-ELM produces results below 1% error on the test data set and provide thebest error rates overall. The second row illustrates that increasing the number of hidden-units above about M = 15000 leadsto overﬁtting, since as shown in (c), the total number of errors plateaus, whilst the total number of errors on the training setcontinues to decrease (shown in (d)). Note that (c) and (d) show results from a single ELM only. accuracy, and a best result of 98.55% in 98 minutes. Incontrast, the runtimes reported for standard ELM in [11](28 seconds for 2048 hidden units) are comparable to ours(12 seconds for 1600 hidden units and less than 1 minutefor 3200 hidden units). This illustrates that improvingerror rate by shaping the input weights as we have donehere has substantial beneﬁts for implementation time anderror rate in comparison to backpropagation. D. Distorting and pre-processing the training set

Many other approaches to classifying MNIST hand-written digits improve their error rates by preprocessingit and/or by expanding the size of the training set byapplying ane (translation, rotation and shearing), andelastic distortions of the standard set [13–15]. While thisis a sensible way to improve the network, the use of non-standard training data invalidates the direct comparisonof network performance that we are looking for in thispaper. It seems reasonable that any given network’s per-

FIG. 3:

Error rates for MNIST images for variousSLFN ELM methods with shaped input weights.

Theﬁrst row shows the mean error percentage from 10 diﬀer-ent trained networks applied to classify (a) the 10000-pointMNIST test data set, and (b) the 60000-point MNIST trainingdata set used to train the networks, for various diﬀerent sizesof hidden layer, M . Markers show the actual error percentagefrom each of the 10 networks. Note that the data for the com-bination RF-CIW-C-ELM method is plotted against M usedin just one of the three parts of the overall network; the totalnumber of hidden units used is actually 2 M + 500. ThereforeRF-CIW-C-ELM does not outperform the other methods forthe same total number of hidden-units for small M . Howeverit can be seen that for large M RF-CIW-C-ELM produces re-sults below 1% error on the test data set and provide the besterror rates overall. The second row illustrates that increasingthe number of hidden-units above about M = 15000 leads tooverﬁtting, since as shown in (c), the total number of errorsplateaus, whilst the total number of errors on the training setcontinues to decrease (shown in (d)). Note that (c) and (d)show results from a single trained network only.

100 1000 100000.81210

Number of hidden − units, M E rr o r % ELM − BPRF − ELM − BPCIW − ELM − BPC − ELM − BPRF − CIW − ELM − BPRF − C − ELM − BPRF − CIW − C − ELM − BP (a) Classiﬁcation of Test Data

100 1000 100000.11210

Number of hidden − units, M E rr o r % ELM − BPRF − ELM − BPCIW − ELM − BPC − ELM − BPRF − CIW − ELM − BPRF − C − ELM − BPRF − CIW − C − ELM − BP (b) Classiﬁcation of Training Data FIG. 4:

ELM-backpropagation error rates for various SLFN ELM methods with shaped input weights.

Eachtrace shows the mean error percentage from 10 di↵erent ELMs applied to classify (a) the 10000-point MNIST test data set,and (b) the 60000-point MNIST training data set used to train the ELMs, for various di↵erent sizes of hidden layer, M , whenten iterations of backpropagation were also used. Markers show the actual error percentage from each of the 10 ELMs. Incomparison with Figure 3, it can be seen that backpropagation signiﬁcantly improves the error rate for small M with allmethods, but has little impact when M = 12800. The total number of hidden units used for RF-CIW-C-ELM is actually2 M + 500, but each parallel ELM has M hidden-units. Number of hidden − units, M M ean t o t a l t r a i n i ng and t e s t i ng t i m e ( s ) ELMRF − ELMCIW − ELMC − ELM − RF − CIW − ELMRF − C − ELM

FIG. 5:

Mean implementation times for various SLFN ELM methods with shaped input weights.

Each traceshows the mean run time from 10 di↵erent ELMs trained on all 60000 MNIST training data points, and then tested on the10000 test points, for various di↵erent sizes of hidden layer, M . The total time for setup, training and testing are shown,excluding time to load the MNIST data from ﬁles. When backpropagation is applied, the runtime scales approximately linearlywith the number of iterations, but each backpropagation iteration is slower than each trace shown here, because both inputand output weights are updated in each iteration. FIG. 4:

ELM-backpropagation error rates for MNISTfor various SLFN ELM methods with shaped inputweights.

Each trace shows the mean error percentage from10 diﬀerent trained networks applied to classify (a) the 10000-point MNIST test data set, and (b) the 60000-point MNISTtraining data set used to train the networks, for various dif-ferent sizes of hidden layer, M , when ten iterations of back-propagation were also used. Markers show the actual errorpercentage from each of the 10 networks. In comparison withFigure 3, it can be seen that backpropagation signiﬁcantlyimproves the error rate for small M with all methods, buthas little impact when M = 12800. The total number of hid-den units used for RF-CIW-C-ELM is actually 2 M + 500, buteach parallel ELM has M hidden-units. E rr o r % Mean total training time (s)

ELMRF−ELMCIW−ELMC−ELM−RF−CIW−ELMRF−C−ELM

FIG. 5:

Mean training times for MNIST for vari-ous SLFN ELM training methods with shaped inputweights.

Each trace shows the mean run time from 10 dif-ferent networks trained on all 60000 MNIST training datapoints, to achieve the test-date error rates shown in Figure 4.The total time for setup and training are shown, excludingtime to load the MNIST data from ﬁles. When backpropaga-tion is applied, the runtime scales approximately linearly withthe number of iterations, but each backpropagation iterationis slower than each trace shown here, because both input andoutput weights are updated in each iteration. The time fortesting is not included in the ﬁgure, but was approximately10 seconds for M = 12 , M . E rr o r % Number of hidden−units, M

10 RepeatsMean

FIG. 6:

Error rates for NORB-small for RF-C-ELM.

The error rate on the 24300 stereo-channel NORB-small testimages as a function of the number of hidden-units, M . Thedata was preprocessed by downsampling each channel of eachimage to 13 ×

13 pixels, and then contrast normalising. Ourbest result from all repeats was 94 . M = 10000. Tables

Grouping Method Error in testing ReferenceSelected Non-ELM SLFN, 784-1000-10 4.5% [1]Deep Belief Network 1.25% [2]Deep Conv. Net LeNet-5 0.95% [1]Deep Conv. Net (dropconnect) 0.57% [6]Deep Conv. Net (stochastic pooling) 0.47% [7]Deep Conv. Net (maxout units and dropout) 0.45% [8]Deep Conv. Net (deeply-supervised) 0.39% [9]Past ELM ELM, 784-1000-10 6.05% [15]C-ELM ∼

5% [21]CIW-ELM, 784-1000-10 3.55% [15]ELM, 784-7840-10 2.75% [33]ELM, 784-unknown-10 2.61% [20]CIW-ELM, 784-7000-10 1.52% [15]ELM+backpropagation, 784-2048-10 1.45% [23]Deep ELM, 784-700-15000-10 0.97% [20]ELM & backprop RF-(CIW & C)-ELM, 784-(2 × ×15000)-20-500-10 0.83% (0.87%) This reportTABLE I: