aa r X i v : . [ c s . L G ] N ov Analytic Network Learning
Kar-Ann Toh
School of Electrical and Electronic EngineeringYonsei University, Seoul, Korea 03722
Abstract
Based on the property that solving the system of linear matrix equationsvia the column space and the row space projections boils down to an ap-proximation in the least squares error sense, a formulation for learning theweight matrices of the multilayer network can be derived. By exploiting intothe vast number of feasible solutions of these interdependent weight matrices,the learning can be performed analytically layer by layer without needing ofgradient computation after an initialization. Possible initialization schemesinclude utilizing the data matrix as initial weights and random initialization.The study is followed by an investigation into the representation capabilityand the output variance of the learning scheme. An extensive experimentationon synthetic and real-world data sets validates its numerical feasibility.
Keywords:
Multilayer Neural Networks, Least Squares Error, Linear Algebra,Deep Learning.
Attributed to its high learning capacity with good prediction capability, the deepneural network has found its advantage in wide areas of science and engineeringapplications. Such an observation has sparked a surge of investigations into thearchitectural and learning aspects of the deep network for targeted applications.The main ground for realizing the high learning capacity and predictivity comesfrom several major advancements in the field which include the processing platform,the learning regimen, and the availability of big data size. In terms of the processingplatform, the advancement in Graphics Processing Units (GPUs) has facilitatedparallel processing of complex network learning within accessible time. Togetherwith the relatively low cost of the hardware, the large number of public high levelopen source libraries has enabled a crowdsourcing mode of learning architecturalexploration. Based on such a learning platform, several learning regimens suchas the Convolutional Neural Network (CNN or LeNet-5) [1], the AlexNet [2], theGoogLeNet or Inception [3], the Visual Geometry Group Network (VGG Net) [4],the Residual Network (ResNet) [5] and the DenseNet [6] have stretched the networklearning in terms of the network depth and prediction capability way beyond theknown boundary established by the conventional statistical methods.1ithout sufficiently convincing explanation in theory, the advancement of deeplearning has been grounded upon ‘big’ data, powerful machinery and crowd effortsto achieve at ‘breakthrough’ results that were not possible before. Such a swarmingphenomenon has pushed forward the demand of hardware as well as middle ware,but at the expense of masking the importance of fundamental results available instatistical decision theory. The research scene has arrived at such a state of deemingresults unacceptable without working directly on or comparing them with ‘big’ datawhich implicitly relies on powerful machinery.
Despite the great success in applications, understanding of the underneath learningmechanism towards the representation and generalization properties gained from thenetwork depth is being far fetched and becoming imperative. From the perspectiveof nonlinearity incurred by the activation functions in each layer, analyzing thelearning properties of the network becomes extremely difficult.In terms of the optimality of network learning, several investigations in the lit-erature can be found. For the two-layer linear network, back in the year 1988, Baldiand Hornik [7, 8] showed that the network was convex in each of its two weightmatrices and every local minimum was a global minimum. In 2012, Baldi and Lu [9]extended the result of convexity to deep linear network while conjecting that everylocal minimum was a global minimum. Recently, Kawaguchi [10] showed that theloss surface of deep linear networks was non-convex and non-concave, every localminimum was a global minimum, every critical point that was not a global minimumwas a saddle point, and the rank of weight matrix at saddle points. These obser-vations, represented in the form of Directed Acyclic Graph (DAG), were carriedforward to the nonlinear networks with fewer unrealistic assumptions than that in[11, 12] which were inspired from the Hamiltonian of the spherical spin-glass model.Subsequently, Lu and Kawaguchi [13] showed that depth in linear networks createdno bad local minima. More recently, Yun et al. [14] provided sufficient conditions forglobal optimality in deep nonlinear networks when the activation functions satisfiedcertain invertibility, differentiability and bounding conditions.From the geometrical view of the loss function, Dinh et al. [15] argued that anappropriate definition and use of minima in terms of their sharpness could help inunderstanding the generalization behavior. With the understanding that the diffi-culty of deep search was originated from the proliferation of saddle points and notlocal minima, Dauphin et al. [16] attempted a saddle-free Newton’s method (seee.g., [17]) to search for the minima in deep networks. In [18], a visualization tech-nique based on filter normalization was introduced for studying the loss landscape.By comparing networks with and without skip connections, the study showed thatthe smoothness of loss surface depended highly on the network architecture. Thelandscape of the empirical risk for deep learning of convolutional neural networkswas investigated in [19]. By extending the case of linear networks in [20] to the caseof nonlinear networks, Poggio et al. [21] showed that deep networks did not usuallyoverfit the classification error for low-noise data sets and the solution correspondedto margin maximization.From the view of having a theoretical bound for the estimation error, in 2002,Langford and Caruana [22] investigated into generalization in terms of the true errorrate bound of a distribution over a set of neural networks using the PAC-Bayes2ound. This approach had sparked off several follow-ups (see e.g., [23, 24, 25]).Recently, Bartlett et al. [26] proposed a margin-based generalization bound whichwas spectrally-normalized. Apart from an empirical investigation of several capacitybounds based on the ℓ , ℓ -path, ℓ -path and the spectral norms [27], in [25], themargin-based perturbation bound was combined with the PAC-Bayes analysis toderive the generalization bound. Using the gap between the expected risk andthe empirical risk, Kawaguchi et al. [28] provided generalization bounds for deepmodels which do not have explicit dependency on the number of weights, the depthand the input dimensionality. In view of the lack of an analytic structure in learning,many of these analyses were hinged upon the Stochastic Gradient Descent (SGD,see e.g.,[29, 30, 31]) search towards certain stationary points (e.g., [26, 19, 23, 24]).Moreover, many of the analyses of the linear network or the nonlinear one treatedthe network as a Directed Acyclic Graph (DAG) model (see e.g., [20, 24, 28]) whereit turned into the complicated problem of topological sorting.In [32, 33], we have established a gradient-free learning approach that results inan analytic learning of layered networks of fully connected structure. This simplifieslargely the analytical aspects of network learning. In this work, we further explorethe analytic learning on networks with receptive field. Particularly, the contribu-tions of current study include (1) establishment of an analytic learning frameworkwhere its goal is to find a mapping matrix such that the target matrix falls withinits range space. In other words, a representative mapping is sought after based onthe data matrix and its Moore-Penrose inverse; (2) proposal of an analytic learningalgorithm which utilizes the sparse structural information and showcase two initial-ization possibilities; (3) establishment of conditions for network representation on afinite set of data; (4) investigation of the network output variance with respect tothe network depth; and (5) an extensive numerical study to validate the proposal. The article is organized as follows. An introduction to several related concepts isgiven in Section 2 to pave the way for our development. Particularly, the Moore-Penrose inverse and its relation to least squares error approximation is defined andstated, together with the network structure of interest. Section 3 presents the mainresults of network learning in analytic form. Here, apart from a random initializationscheme, a deterministic initialization scheme based on the data matrix is introduced.This sheds some lights on the underlying learning mechanism. In Section 4, theissue of network representation is shown in view of a finite set of training data.The network generalization is further explored via the output variance of the linearstructure. In Section 5, two cases of synthetic data are studied to observe the fittingbehavior and the representation property of the proposed learning. In Section 6,the proposed learning is experimented on 42 data sets taken from the public domainto validate its numerical feasibility. Finally, some concluding remarks are given inSection 7. 3
Preliminaries
Definition 1 (see e.g., [34, 35, 36, 37]) A least squares solution to the system A θ = y , where A ∈ R m × d , θ ∈ R d × and y ∈ R m × , is a vector ˆ θ such that k ˆ e k = k A ˆ θ − y k k A θ − y k = k e k , (1) where k · k denotes the ℓ -norm of “ · ”. Definition 2 (see e.g., [36, 34, 35, 37]) If A (square or rectangular) is a matrixof real or complex elements, then there exits a unique matrix A † , known as theMoore-Penrose inverse or the pseudoinverse of A , such that (i) AA † A = A , (ii) A † AA † = A † , (iii) ( AA † ) ∗ = AA † and (iv) ( A † A ) ∗ = A † A where ∗ denotes theconjugate transpose. Lemma 1 (see e.g., [35, 34, 36, 37]) ˆ θ = A † y is the best approximate solution of A θ = y . If the linear system of equations A θ = y is an over-determined one (i.e., m > d )and the columns of A are linearly independent, then A † = ( A T A ) − A T and A † A = I . The solution to the system can be derived via pre-multiplying A † to both sides of A θ = y giving A † A ˆ θ = A † y which leads to the result in Lemma 1 since A † A = I .On the other hand, if A θ = y is an under-determined system (i.e., m < d ) and therows of A are linearly independent, then A † = A T ( AA T ) − and AA † = I . Here,the solution to the system can be derived by substituting ˆ θ = A † ˆ α (a projectionof d space onto the m subspace) into the system and then solve for ˆ α ∈ R m whichresults in ˆ α = y . The result in Lemma 1 is again obtained by substituting ˆ α = y into ˆ θ = A † ˆ α . In practice, the full rank condition may not be easily achievedfor data with large dimension without regularization. For such cases, the Moore-Penrose inverse in Definition 2 exists and is unique. This means that for any matrix A , there is precisely one matrix A † that satisfies the four properties of the Penroseequations in Definition 2. In general, if A has a full rank factorization such that A = FG , then X † = G ∗ ( GG ∗ ) − ( F ∗ F ) − F ∗ satisfies the Penrose equations [36].The above algebraic relationships show that learning of a linear system in thesense of least squares error minimization can be achieved by manipulating its kernelor the range space via the Moore-Penrose inverse operation (i.e., multiplying thepseudoinverse of a system matrix to both sides of the equation boils down to animplicit least squares error minimization, the readers are referred to [33] for greaterdetails regarding the least error and the least norm properties). For linear systemswith multiple ( q ) output columns, the following result is observed. Lemma 2 [32] Solving for Θ in the system of linear equations of the form AΘ = Y , A ∈ R m × d , Θ ∈ R d × q , Y ∈ R m × q (2) in the column space (range) of A or in the row space (kernel) of A is equivalent tominimizing the sum of squared errors given by SSE = trace (cid:0) ( AΘ − Y ) T ( AΘ − Y ) (cid:1) . (3) Moreover, the resultant solution ˆ Θ = A † Y is unique with a minimum-norm valuein the sense that k ˆ Θ k k Θ k for all feasible Θ . roof: Equation (2) can be re-written as a set of multiple linear systems as follows: A [ θ , · · · , θ q ] = [ y , · · · , y q ] . (4)Since the trace of ( AΘ − Y ) T ( AΘ − Y ) is equal to the sum of the squared lengthsof the error vectors A θ i − y i , i = 1 , , ..., q , the unique solution ˆ Θ = ( A T A ) − A T Y in the column space of A or that ˆ Θ = A T ( AA T ) − Y in the row space of A , notonly minimizes this sum, but also minimizes each term in the sum [38]. Moreover,since the column and the row spaces are independent, the sum of the individuallyminimized norms is also minimum. (cid:4) Based on these observations, the inherent least squares error approximation prop-erty of algebraic manipulation utilizing the Moore-Penrose inverse shall be exploitedin the following section to derive an analytic solution for multilayer network learningthat is gradient-free.
We consider a multilayer feedforward network of n layers. This network can eitherbe fully connected or with a receptive field setting (also known as a receptive fieldnetwork, we abbreviated it as RFN hereon) as shown in Fig. 1. When a full receptivefield is set for all layers, the network is a fully connected one [32, 33]. Unlikeconventional networks, the bias term in each layer is excluded except for the inputsin this representation.Mathematically, the n -layer network model of h - h - · · · - h n − - h n structure canbe written in matrix form as G = f n ( f n − ( · · · f ( f ( XW ) W ) · · · W n − ) W n ) , (5)where X = [ , X ] ∈ R m × ( d +1) is the augmented input data matrix (i.e., input withbias), W ∈ R ( d +1) × h , W ∈ R h × h , · · · , W n − ∈ R h n − × h n − , and W n ∈ R h n − × q ( h n = q is the output dimension) are the weight matrices without bias at each layer,and G ∈ R m × q is the network output of q dimension. f k , k = 1 , · · · , n are activationfunctions which operate elementwise on its matrix domain. When the k th layer ofthe network has a limited receptive field, then its weight matrix has sparse elementsin each column. For example, if the first layer has receptive field of 3 (denoted as r ), then, when h = d , W can be written as a circulant matrix [39, 40] as follows: h = d W = z }| { w , · · · w , w , · · · w , w , w , · · · w , w , · · ·
00 0 w , · · ·
00 0 0 · · ·
00 0 0 . . . w d − ,d · · · w d,d · · · w d +1 ,d . (6)For simplicity, the activation functions in each layer are taken to be the same(i.e., f k = f , k = 1 , · · · , n ) in our development. Also, unless stated otherwise, all5ctivation functions are taken to operate elementwise on its matrix domain. Weshall denote this network with receptive field as one with structure h r - h - · · · - h n − - h n which carries the common activation function in each layer of f ( h r )- f ( h )- · · · - f ( h n − )- f ( h n ).Figure 1: An n -layer feedforward network without inner weight biases. In some circumstances, we may need to invert the network over the activation func-tion for solution seeking. For such a case, an inversion is performed through afunctional inversion. The inverse function is defined as follows.
Definition 3
Consider a function f which maps x ∈ R to y ∈ R , i.e., y = f ( x ) .Then the inverse function for f is such that f − ( y ) = x . A good example for invertible function is the softplus function given by y = f ( x ) = log(1 + e x ) and its inverse given by log( e y − The indexing of f k , k = 1 , · · · , n for each layer of (5) is kept here for clarity purposeand suppose f k = f is an invertible function. Then, the network can learn a giventarget matrix Y ∈ R m × q by putting G = Y where each layer of the network can be6nverted (via taking functional inverse and pseudoinverse) as follows: Y = f n ( f n − ( f n − ( · · · f ( f ( XW ) W ) · · · ) W n − ) W n ) ⇒ f − n ( Y ) = f n − ( f n − ( · · · f ( f ( XW ) W ) · · · ) W n − ) W n (7) ⇒ f − n ( Y ) W † n = f n − ( f n − ( · · · f ( f ( XW ) W ) · · · ) W n − ) ⇒ f − n − ( f − n ( Y ) W † n ) = f n − ( · · · f ( f ( XW ) W ) · · · ) W n − (8) ⇒ ... ⇒ f − ( · · · f − n − ( f − n ( Y ) W † n ) W † n − · · · ) = f ( XW ) W (9) ⇒ f − ( f − ( · · · f − n − ( f − n ( Y ) W † n ) W † n − · · · ) W † ) = XW . (10)Based on (7) to (10), we can express the weight matrices respectively as W = X † f − ( f − ( f − ( · · · f − n − ( f − n ( Y ) W † n ) W † n − · · · ) W † ) W † ) (11) W = [ f ( XW )] † f − ( f − ( · · · f − n − ( f − n ( Y ) W † n ) W † n − · · · ) W † ) (12)... W n − = [ f n − ( · · · f ( f ( XW ) W ) · · · W n − )] † f − n − ( f − n ( Y ) W † n ) (13) W n = [ f n − ( f n − ( · · · f ( f ( XW ) W ) · · · ) W n − )] † f − n ( Y ) . (14)With these derivations, the follow result is stated formally. Theorem 1
Given m data samples of d dimension which are packed as X ∈ R m × ( d +1) with augmentation, and consider a multilayer network of the form in (5) with f k = f , ∀ k = 1 , · · · , n being invertible. Then, learning of the network towards the target Y ∈ R m × q in the sense of least squares error approximation boils down to solvingthe set of equations given by (11) through (14) .Proof: Since the Moore-Penrose inverse exists uniquely for any complex or realmatrix and the activation functions are invertible, the results follow directly thederivations in (7) through (11). The assertion of least squares error approxima-tion comes from the observation in Lemma 2 where each step of manipulatingeach equation (in (7) through (11)) by taking the pseudoinverse implies a pro-jection onto either the column space or the row space. Mathematically, suppose A = f n − ( f n − ( · · · f ( f ( XW ) W ) · · · ) W n − ). Then (14) can be rewritten as W n = A † f − n ( Y ). The least squares approximation property can be easily visu-alized by substituting (14) into (5) to get G = f n ( AW n ) = f n ( AA † f − n ( Y )). Bytaking f − n to both sides of the equation, we have f − n ( G ) = AA † f − n ( Y ). Consid-ering each column of the equation given by f − n ( g ) = AA † f − n ( y ), we observe theanalogue to ˆ y = A ˆ θ = AA † y of Lemma 1 by putting g = ˆ y where the invertibletransform of the target f − n ( y ) can be recovered uniquely. Hence the proof. (cid:4) The above result shows that the least squares solution to the network weightscomes in a cross-coupling form where the weight matrices are interdependent. Thisposes difficulty towards an analytical estimation. Fortunately, due to the high re-dundancy of the network weights that are constrained within an array structure, thesolutions to network learning are vastly available (see section 4.2 for a detailed analy-sis). The following result is obtained by having the weight matrices W k , k = 2 , · · · , n arbitrarily initialized as non-trivial matrices R k , k = 2 , · · · , n (e.g., a matrix whichhas zero value for all its elements is considered a trivial matrix).7 orollary 1 Consider m data samples of d dimension which are packed as X ∈ R m × ( d +1) with augmentation. Given non-trivial matrices R i , i = 2 , · · · , n where theirdimensions matches those of W j , j = 2 , · · · , n in (5) and suppose the activationfunctions are invertible. Then, learning of the network towards the target Y ∈ R m × q admits the following solution to the least squares error approximation: W = X † f − ( f − ( f − ( · · · f − n − ( f − n ( Y ) R † n ) R † n − · · · ) R † ) R † ) (15) W = [ f ( XW )] † f − ( f − ( · · · f − n − ( f − n ( Y ) R † n ) R † n − · · · ) R † ) (16) ... W n − = [ f n − ( · · · f ( f ( XW ) W ) · · · )] † f − n − ( f − n ( Y ) R † n ) (17) W n = [ f n − ( f n − ( · · · f ( f ( XW ) W ) · · · ) W n − )] † f − n ( Y ) . (18) Proof:
By initializing the weights W j , j = 2 , · · · , n using matrices R i , i = 2 , · · · , n ,the inter-dependency of weights can be decoupled and the weights can be estimatedsequentially from (15) to (18). Hence the proof. (cid:4) Corollary 2
Consider m data samples of d dimension which are packed as X ∈ R m × ( d +1) with augmentation. Given non-trivial matrices R i , i = 1 , , , · · · , n where their dimensions matches those of W j , j = 1 , , , · · · , n in (5) and supposethe activation functions are invertible. Then, learning of the network towards thetarget Y ∈ R m × q admits the following solution to the least squares error approxima-tion: W = [ f ( f ( XR ) R )] † f − ( f − ( f − ( · · · f − n − ( f − n ( Y ) R † n ) R † n − · · · ) R † ) R † )(19) W = [ f ( f ( f ( XR ) R ) W )] † f − ( f − ( · · · f − n − ( f − n ( Y ) R † n ) R † n − · · · ) R † )(20) ... W n − = [ f n − ( · · · f ( f ( f ( XR ) R ) W ) · · · W n − )] † f − n − ( f − n ( Y ) R † n ) (21) W = X † f − ( f − ( f − ( · · · f − n − ( f − n ( Y ) R † n ) W † n − · · · ) W † ) R † ) (22) W = [ f ( XW )] † f − ( f − ( · · · f − n − ( f − n ( Y ) R † n ) W † n − · · · ) W † ) (23) W n = [ f n − ( f n − ( · · · f ( f ( f ( XW ) W ) W ) · · · ) W n − )] † f − n ( Y ) . (24) Proof:
The order of learning sequence is immaterial for any of the inner layerweights W i , i = 1 , , , · · · , n −
1, but the output layer weight W n should be lastlyestimated. This is because W n is the weight which links up all hidden layer weightsand provides a direct least squares error approximation (range space projection) tothe target Y . In other words, the estimation of W n (24) at the output layer iscomplete when all the hidden weights have been determined. Hence the proof. (cid:4) Apart from the above solution based on random initialization, the following resultshows an example of non-random initialization. This is particularly useful when theactivation functions are non-invertible. 8 orollary 3
Given m data samples of d dimension which are packed as X ∈ R m × ( d +1) with augmentation. Then, learning of the network (5) towards the target Y ∈ R m × q admits the following solution to the least squares error approximation: W = X † (25) W = [ f ( XW )] † (26) ... W n − = [ f n − ( · · · f ( f ( XW ) W ) · · · )] † (27) W n = [ f n − ( f n − ( · · · f ( f ( XW ) W ) · · · ) W n − )] † f − n ( Y ) . (28) Proof:
Since R i , i = 2 , · · · , n in Corollary 1 are arbitrarily assigned matrices thatmatches the dimensions of the corresponding weight matrices, they, together with f − i , i = 1 , · · · , n on Y are indeed arbitrary, and they all can be chosen as identitymatrix as one particular solution. Since the estimation of the output weight ma-trix W n in (18) of Corollary 1 does not involve assignment of the arbitrary matrix R i , the transformed target term f − n ( Y ) cannot be replaced by the identity matrix.Hence the proof. (cid:4) Remark 1:
The result in Corollary 1 shows that there exits an infinite numberof solutions for a network with multiple layers due to the arbitrary weight initial-ization. However, this does not apply to the single layer network because it canbe reduced to the system of linear equations. The result in Corollary 2 shows thatthe sequence of weights estimation need not begin from the first layer. Indeed, thearbitrary weights in Corollary 1 and Corollary 2 are only utilized once following thesequence of weight matrix estimation. The weight matrix after each step of the esti-mation sequence is no longer arbitrary because all the previously estimated weightsare used in the subsequent pseudoinverse projection towards the least squared errorapproximation. Corollary 3 shows a particular choice of the arbitrary weight matrixassignment without needing the corresponding activation functions to be invertible.For non-invertible functions, the term f − n ( Y ) in (28) can be replaced by Y whichcorresponds to a linear output layer in (5). However, such an identity setting resultsin having the number of network nodes in every hidden layer (i.e., h i , the columnsize of W i , i = 1 , · · · , n −
1) being set equal to the sample size m . (cid:3) The proposed algorithms in Corollary 1 through Corollary 3 are differentiatedfrom that in [8] in several ways. Firstly, the nonlinear activation function was notconsidered in [8] whereas in our approach, nonlinear activations, both invertibleand not, are considered. Secondly, in Corollary 1 through Corollary 3, there is noattempt to alternate the search of weights among the layers as that in [8] sincesuch a search leads to the convergence issue. Lastly but not the least, we capitalizeon the powerful Moore-Penrose inverse for solving the weights. More importantly,the layered network learning solution becomes analytical after arriving at (28) (andso are (18) and (24)), i.e., the problem boils down to finding an input correlatedmatrix f n − ( f n − ( · · · f ( f ( XW ) W ) · · · ) W n − ) that can represent Y . This canbe seen by considering a network with linear output where G = f n − ( A n − ) W n with A n − = f n − ( A n − ) W n − , · · · , A = XW . When the learning is based onCorollary 3 where W = X † , W = [ f ( A )] † , · · · , W n = [ f n − ( A n − )] † Y , then wehave the estimated output given byˆ Y = f n − ( A n − )[ f n − ( A n − )] † Y , (29)9here A n − = f n − ( A n − )[ f n − ( A n − )] † , · · · , A = XX † . This shows that eachlayer in (29) serves as a composition of projection bases, based on the data matrixor the transformed data matrix (transformed by f k , k = 1 , · · · , n − Y within its range. In a nutshell, our results show that the problemof network learning can be viewed as one to find a data mapping matrix such thatthe target matrix falls within its range. An immediate advantage of such a viewover the conventional view of error minimization is evident from the gradient-freesolutions arrived in Corollaries 1–3. These results shall be validated by numericalevaluations in the experimental section.
Firstly, we show the network representation capability. Then, an analysis of thefeasible solution space is presented. Finally, an analysis of the output variance isperformed.
For simplicity, we shall use d as the dimension for generic matrices instead of theaugmented dimension d + 1 when no ambiguity arises. Definition 4
A matrix A ∈ R m × d is said to be representative for Y ∈ R m × q if AA † Y = Y where m, d, q > . Certainly, if A has full rank in the sense that AA T is invertible for A † = A T ( AA T ) − , then AA † = I for which A is representative for Y . However, thisdefinition of representation is weaker than the full rank requirement in ( A T A ) − A T or A T ( AA T ) − because A needs not have full rank when Y falls within the rangeof A . Definition 5
A function f , which operates elementwise on its matrix domain, issaid to be representative if f ( XW ) is representative for Y ∈ R m × q on X ∈ R m × d with distinct rows and W ∈ R d × h where m, d, h , q > , i.e., f is a function suchthat f ( XW ) f ( XW ) † Y = Y . The above definition can accommodate the identity activation function (i.e., f ( x ) = x ) at certain circumstance because ( XW )( XW ) † Y = Y when the columnrank of the matrix XW matches the row size of Y . However, such an identityactivation does not provide sufficient representation capability for mapping of targetsbeyond the range space of the data matrix. For common activation functions suchas the softplus and the tangent functions, it is observed that imposing themon XW can improve the column rank conditioning for representation in practice.Several numerical examples are given after the theoretical results to observe therepresentation capability. Theorem 2 ( Two-layer Network ) Given m distinct samples of input-outputdata pairs and suppose the activation functions are representative according to Def-inition 5. Then there exists a feedforward network of two layers with at least m × q adjustable weights that can represent these data samples of q > output dimensions. roof: Consider a 2-layer network, with linear activation at the output layer, thattakes an augmented set of inputs: G = f ( XW ) W , (30)where X ∈ R m × ( d +1) , W ∈ R ( d +1) × h , W ∈ R h × q and G ∈ R m × q is the networkoutput of q > G (30) can be used to learn the target matrix Y ∈ R m × q by putting Y = f ( XW ) W . (31)Since f is given to be a representative function according to Definition 5, we have f ( XW ) f ( XW ) † Y = Y such that Y falls within the range space of f ( XW ). Toensure that Y falls within the range space of f ( XW ), it is sufficient that W hav-ing at least a column size of h = m in order to match with the number of samplesin Y . Suppose we have a non-trivial setting for all the weight elements in W suchthat not all of its elements are zeros. A good example for the non-trivial settingis to put W = X † (see Corollary 3). Then, due to the representativeness of f ,the entire weight matrix W ∈ R m × q can be determined uniquely as f ( XW ) † Y to represent the target Y . In other words, W needs only distinct and arbitrarynumbers in its m columns of entries to maintain the representation sufficiency of f ( XW ), and all those elements in W are the only adjustable parameters neededfor the representation. Hence the proof. (cid:4) Remark 2:
The above result shows that the two-layer network is a universal ap-proximator for a finite set of data. Different from most existing results in literature[41, 42, 43, 44, 45, 46], the minimum number of adjustable hidden weights requiredhere is dependent on the number of data samples and the output dimension. Thisresult also stretches beyond that of [47] to include nonlinear activation functions. (cid:3)
Definition 6
A function f is said to be compositional representative if f ( · · · f ( XW ) · · · W n ) is representative for Y ∈ R m × q on X ∈ R m × d with distinctrows and W k ∈ R h k − × h k , k = 1 , · · · , n where h = d + 1 , m, d, h k , q > , i.e., f isa function such that f ( · · · f ( XW ) · · · W n ) f ( · · · f ( XW ) · · · W n ) † Y = Y . Theorem 3 ( Multilayer Network ) Given m distinct samples of input-outputdata pairs and suppose the activation functions are representative according to Def-inition 6. Then there exists a feedforward network of n layers with at least m × q adjustable weights that can represent these data samples samples of q > outputdimensions.Proof: Consider a feedforward network with linear activation at the output layergiven by G = f n − ( A n − ) W n , (32)where A n − = f n − ( A n − ) W n − , · · · , A = XW . Learning towards Y usingthis network is analogous to (31) in Theorem 2 where we require m × q numberof adjustable weight elements in W n here for the desired representation. The rep-resentation property of each A i , i = 1 , · · · , n − W i , i = 1 , · · · , n − f is compositional representative according to Definition 6. Since W i ,11 = 1 , · · · , n − W n ∈ R m × q are theonly adjustable parameters needed in learning the representation. This completesthe proof. (cid:4) Consider the learning problem of a single-layer network given by Y = f ( XW ).Effectively, such a network learning boils down to the linear regression problemsince it can be rewritten as f − ( Y ) = XW when f is invertible. Because thepseudoinverse is unique, the solution in the least squares error approximation sensefor W given by X † f − ( Y ) is unique.For the two-layer network learning given by Y = f ( f ( XW ) W ), the numberof feasible solutions increases tremendously. This can be seen from the re-writtensystems of linear equations given by f − ( Y ) = f ( XW ) W and f − ( f − ( Y ) W † ) = XW where fixing arbitrarily either W or W determines the other uniquely. Thevast possibilities of either of the weight matrices determine the number of feasiblesolutions.For the three-layer network learning given by Y = f ( f ( f ( XW ) W ) W ),the feasible solution space increases further. Again, this can be seen from there-written systems of linear equations given by f − ( Y ) = f ( f ( XW ) W ) W , f − ( f − ( Y ) W † ) = f ( XW ) W and f − ( f − ( f − ( Y ) W † ) W † ) = XW where fix-ing arbitrarily W and W determines W (and so on for other combinations)uniquely. The vast possibilities of the combination of the weight matrices determinethe number of feasible solutions. Based on this analysis, we gather that the numberof feasible solutions increases exponentially according to the layer depth. Proposition 1
The number of feasible solutions for a feedforward network with twolayers is infinite. This number increases exponentially for each increment of networklayer.Proof:
For a two-layer network, the number of feasible solutions is determinedby the size of either W or W which is infinite in the real domain. Let us de-note this number by N . Then, for a three-layer network, this number increases to( N × N ) × C = N × W i , i = 1 , , n -layer network, we have N n − × C nn − possibilities. Hence theproof. (cid:4) For simplicity, consider the linear model g = x T w without the bias term where x ∈ R d × , w ∈ R d × regressing towards the target y ∈ R so that y = x T w + ǫ where ǫ ∈ R . The expected prediction error at an input point x ∈ R d × using thesquared-error loss is (see e.g., [48, 38]):Err( x ) = E [( y − ˆ g ( x )) | x = x ]= σ ǫ + [ E [ˆ g ( x )] − g ( x )] + E [ˆ g ( x ) − E [ˆ g ( x )]] = Irreducible Error + Bias + Variance . (33)12o analyze the bias and variance terms of our network, we consider the resultin Corollary 3 with only deterministic components in the estimation of weights forsimplicity. The estimation Bias from k number of test samples based on (33) isgiven byBias = [ E [ˆ g ( x )] − g ( x )] = 1 k k X i =1 [ E [ˆ g ( x i )] − g ( x i )] = 1 k k X i =1 (cid:2) E [ x Ti ˆ w n ] − g ( x i ) (cid:3) = 1 k k X i =1 h E [ x Ti [ f n − ( f n − ( · · · f ( f ( X ˆ W ) ˆ W ) · · · ) ˆ W n − )] † y ] − g ( x i ) i . (34)This shows that the estimation bias for unseen samples is dependent on the differencebetween the estimated parameters and that of the unknown ‘true’ parameters whichmaps the range space of the unseen target. Since the ‘true’ parameters are unknownand cannot be removed, the bias term cannot be analyzed further.By considering the linear regression estimate ˆ w = ( X T X ) − X T y based on a setof training data X ∈ R m × d with target given by y = X w + ǫ for a certain unknown‘true’ parameter w with error ǫ ∈ R m × , the Variance term of linear regression cannevertheless be analyzed using (see e.g., [48]): E [ˆ g ( x ) − E [ˆ g ( x )]] = E [ x T ˆ w − x T w ] = E [ x T ( X T X ) − X T y − x T w ] = E [ x T ( X T X ) − X T ( X w + ǫ ) − x T w ] = E [ x T w + x T ( X T X ) − X T ǫ − x T w ] = E [ x T ( X T X ) − X T ǫ ] = k h ( x ) k σ ǫ . (35)where h ( x ) = x T ( X T X ) − X T = x T X † .For the single-layer network with single output and linear activation, the estima-tion is given by ˆ w = X † y and the estimated output is given by ˆ y = X ˆ w = XX † y . This boils down to the case of linear regression where the estimated ˆ y varies accord-ing to h ( x ) ǫ = x T X † ǫ given by (35).Next, consider the two-layer network with linear output where the estimation isgiven by ˆ w = X † , ˆ w = [ f ( X ˆ w )] † y . (36)Then, the estimated output can be written asˆ y = f ( X ˆ w ) ˆ w ˆ y = f ( X ˆ w )[ f ( X ˆ w )] † y ˆ y = f ( XX † )[ f ( XX † )] † y . (37)13ince f ( X ˆ w ) is assumed to be representative given the activation function andthe data matrix, the target y can be reproduced exactly for the training set. Here,we note that the underlying estimation is based on ˆ w , and this causes the outputvariance vector, h ( x ) where x ∈ R m , to hinge upon the feature dimension ofˆ w ∈ R m in h ( x ) ǫ = x T [ f ( X ˆ w )] † ǫ = x T [ f ( XX † )] † ǫ . (38)This is differentiated from the case of linear regression where x ∈ R d . For applica-tions with a large data set, m >> d , and this renders ( h ( x ) ǫ ) >> ( h ( x ) ǫ ) .For the three-layer network, the estimation is given byˆ w = X † , ˆ w = [ f ( X ˆ w )] † , ˆ w = [ f ( f ( X ˆ w ) ˆ w )] † y , (39)and the estimated output can be written asˆ y = f ( f ( X ˆ w ) ˆ w ) ˆ w . (40)The corresponding output variance term for the three-layer network model is h ( x ) ǫ = x T [ f ( f ( X ˆ w ) ˆ w )] † ǫ = x T [ f ( f ( XX † )[ f ( XX † )] † )] † ǫ . (41)Based on this analysis, we can generalize the output variance from the 1-layernetwork to the n -layer network as: h k ( x ) ǫ = x T H † k ǫ , k = 1 , , ..., n, (42)where H = X , (43) H = f ( H H † ) , (44) H = f ( H H † ) , (45)... H n = f n − ( H n − H † n − ) . (46) Proposition 2 If f k = f , k = 1 , · · · , n − is a function such that E [( h ( x ) ǫ ) ] >E [( h ( x ) ǫ ) ] for all x ∈ R m and ǫ ∈ R m , then E [( h k ( x ) ǫ ) ] > E [( h k +1 ( x ) ǫ ) ] , ∀ k = 3 , ..., n in (42) .Proof: From E [( x T [ f ( XX † )] † ǫ ) ] > E [( x T [ f ( f ( XX † )[ f ( XX † )] † )] † ǫ ) ], we observethat the term X has been replaced by f ( XX † ). Such replacement is recursively per-formed for each additional layer of the activation function in deeper networks. Sincethe activation function is such that E [( h ( x ) ǫ ) ] > E [( h ( x ) ǫ ) ] for all x ∈ R m ,the inequality maintains for each replacement of the term X within. In other words,an addition of an inner layer shall not change the inequality. Hence the proof. (cid:4) Number of layers in network O u t pu t v a r i an c e Figure 2: The average output variance of 1000 Monte Carlo simulations. Themean(std) values of ( h k ( x ) ǫ ) , k = 1 , ..., Conjecture 1 If f k = f , k = 1 , · · · , n − is a representative function, then E [( h k ( x ) ǫ ) ] > E [( h k +1 ( x ) ǫ ) ] for all k = 2 , ..., n in (42) . Remark 3:
The analysis of output variance shows that the variance of estimationdoes not diverge for deep networks when the hidden layers are estimated using thedata image (i.e., the pseudoinverse of the data matrix). This result is congruencewith the observation of good generalization obtained in deep networks in the liter-ature. Numerically, the result in Proposition 2 can be validated by a Monte Carlosimulation as follows. To keep the outputs of deeper networks within the visiblechanging range, an activation function of the exponential form (i.e., e . XW ) hasbeen adopted for each layer in this study. The weights are computed deterministi-cally according to (25) through (28). The elements of the input matrix X ∈ R × (100 samples of 10 dimensions) are generated randomly within [ − , +5]. The vector x ∈ R × in (42) is also generated randomly within the same range. The learningtarget vector y ∈ {− , +1 } contains 100 samples with half of the total samplesfor each category. The noise vector ǫ ∈ R × in (42) is taken to be of one-tenthmagnitude of the input elements’ range. A 1000 Monte Carlo simulations havebeen conducted for the random setting. Fig. 2 shows the average plot of ( h k ( x ) ǫ ) for the 1000 random trials over k = 1 , , · · · , k = 1 (i.e., ( h ( x ) ǫ ) where x ∈ R × which has a differentdimension from that of x ∈ R × ), it can be seen that the value of the outputvariance ( h k ( x ) ǫ ) does not diverge for networks with more than two layers. Thisresult may contribute to explaining the non-overfitting behavior of deep networks. (cid:3) In this section, two sets of synthetic data with known function generators are adoptedto observe the learning and representation properties of the proposed network. The15rst data set studies the fitting ability in terms of training data representation withrespect to the scaling of the initial weights. The second data set studies the decisionboundary of a learned network as well as the number of nodes needed in the hiddenlayer for representation.Without loss of generality, a modified softplus function f ( x ) = log(0 . e x ) andits inverse given by f − ( x ) = log( e x − .
8) are adopted in the learning algorithm inCorollary 1 (abbreviated as
ANNnet ) for all the numerical studies and experiments.The computing platform is an Intel Core i7-6500U CPU at 2.59GHz with 8GB ofRAM.
In this example, we examine the effect of initial weight magnitudes (based on ascaling factor c which is multiplied to the initial weight matrices elementwise) andthe effect of network layers using a single dimensional regression function. A totalof 8 training target samples y i , i = 1 , , · · · , y = sin(2 x ) / (2 x ) using x i ∈ { , , · · · , } . Apart from this set of training sam-ples, another 10 sets (each set containing 8 samples) of training target samples aregenerated by adding 20% of noisy variations with respect to the training targetrange. Our purpose is to observe the mapping capability of the network to learnthese 11 curves with different curvatures using different c settings and the numberof network layers. In order to observe the fine fitting points of the network, foreach curve, another set of output samples is generated using a higher resolutioninput x j ∈ { . , . , · · · , . , . } . This test set or observation set contains 721samples.Fig. 3(a) and (b) respectively show the fitting results of a 2-layer ANNnet (of 8-1structure) at c = 1 and c = 0 .
1. The c value is a scaling factor for each of the R to R n matrices in (15)-(18). The result at c = 1 (Fig. 3(a) and Table 1) shows aperfect fit for all the training data points but with a high testing (prediction) Sumof Squared Errors (SSE) which indicate the large difference between the output ofthe test samples and the underneath target (blue) curve. The fitting result shows a‘smoother’ (with lower fluctuations) curve at c = 0 . c = 1. In termsof the deviation of the predicted output from the target function, the SSE for thecase of c = 0 . c = 1 (see the test SSE values inTable 1). These results demonstrate an over-fitting case for Fig. 3(a) and an under-fitting case for Fig. 3(b) in terms of the training data where the SSE of trainingdata is compromised.The results for a 5-layer ANNnet (of 1-1-1-8-1 structure) at c = 1 and at c = 0 . c value. However, thissmoother fit with lower fluctuations does not compromise the training SSE valueswhile maintaining a low SSE for the test data (see Table 1). This is different fromthat of the 2-layer case where the smoother fit compromises in fitting every datapoints. This fitting behavior can be observed from the almost zero training SSEvalues for both c value settings in the 5-layer network. These results show a bettergeneralization capability for the 5-layer network than the 2-layer network.16 Target functionTraining data pointsNoisy data pointsANNnet (2-layer)ANNnet (noisy)
Target functionTraining data pointsNoisy data pointsANNnet (2-layer)ANNnet (noisy) (a) (b)
Target functionTraining data pointsNoisy data pointsANNnet (5-layer)ANNnet (noisy)
Target functionTraining data pointsNoisy data pointsANNnet (5-layer)ANNnet (noisy) (c) (d)Figure 3: Learned outputs of the
ANNnet at different scaling c settings for 2-layerand 5-layer networks.Table 1: Sum of Squared Errors (training/test)Methods c = 1, s = 0 c = 0 . s = 02-layer ANNnet × − /1.3512 × × − /1.1143 × ANNnet × − /1.2507 × × − /1.2426 × The spiral problem is well known in studying the mapping capability of neuralnetworks. It has been shown to be intrinsically connected to the inside-outsiderelations [49]. In this example, a 6-arm spiral pattern has been generated witheach arm being perturbed by random noise (i.e., rotation angle + 0.3 × rand ).Each arm contains a total of 500 samples with half being used for training and theremaining half for test. In other words, the training set and the test set each contains1500 samples for the 6 arms (1500 = 250 × softplus8 activation function with c = 0 . R , · · · , R n initialization. Theadopted network structure is similar to that of the above except for the number ofhidden nodes h which is varied from 1250 to 1600 at an interval of 50. In a nut-shell, a network structure of 150-250- h -6 has been adopted with h varies within { , , · · · , , } in order to observe its effect on network representa-tion. The purpose of choosing h in this range is to cover the training sample size17 Figure 4: Testing data points and the decision boundary of a trained
ANNnet .(1500) for observing the accuracy of fit. The results in Fig. 5 show zero SSE for h > Number of hidden nodes S u m o f S qua r ed E rr o r s Figure 5: Variation of the training Sum of Squared Errors (SSE) over the numberof hidden nodes at layer h for 10 trails based on different random seeds.Fig. 6(a) and (b) show respectively the SSEs for variation of h and h nodeswhile fixing the other two layers at 150 nodes. These results show a variation of theminimum number of hidden nodes needed for training data representation (in orderto arrive at an almost zero training SSE) towards the input layer.18
250 1300 1350 1400 1450 1500 1550 1600
Number of hidden nodes S u m o f S qua r ed E rr o r s Number of hidden nodes S u m o f S qua r ed E rr o r s (a) Number of h hidden nodes (b) Number of h hidden nodesFigure 6: Variation of the training Sum of Squared Errors (SSE) over the numberof hidden nodes ((a) at layer h and (b) at layer h ) for 10 trails based on differentrandom seeds. In this section, the proposed learning algorithm is evaluated using real-world datasets taken from the UCI Machine Learning Repository [50]. Our goals are (i) toobserve whether the gradient-free learning is numerically feasible for real-world dataof small to medium size on a computer with 8 GB of RAM and without GPU?If feasible, how is the accuracy comparing with that of the conventional networklearning? (ii) the computational CPU performance, (iii) Finally, the effects of net-work depth and sparseness are evaluated using 2-, 3- and 5-layer networks on therelatively humble computer platform.
The UCI data sets are selected according to [51] and are summarized in Table 2with their pattern attributes. The experimental goals (i) and (iii) are evaluatedby observing the prediction accuracy of the algorithm. The accuracy is defined asthe percentage of samples being classified correctly. The experimental goal (ii) isobserved by recording the training CPU processing time of the compared algorithmson an Intel Core i7-6500U CPU at 2.59GHz with 8GB of RAM memory. The trainingCPU time is measured using the Matlab’s function cputime which corresponds tothe total computational times from each of the 2 cores in the Processor. In otherwords, the physical time experienced is about 1/2 of this clocked cputime . (i) Feasibility and prediction accuracy of two-layer networks In this experiment, the prediction accuracy of
ANNnet is recorded and compared withthat of the feedforwardnet (abbreviated as
FFnet ) from the Matlab’s toolbox [52]based on the fully connected network structure of two-layers. The activation functionadopted for
ANNnet is softplus with random weights initialization, whereas FFnet adopted the default ‘ tansig ’ activation function (since softplus does not convergewell enough in this case) with the ‘ trainlm ’ learning search. The test performance is19able 2: Summary of UCI [50] data sets and chosen hidden layer sizes ( h ) based oncross-validation within the training set (i) (ii) (iii) (iv) 2-layer 3-layer 5-layer( h ) ( h ) ( h ) ( h )Database name ANNnet ANNnet ANNnet
1. Shuttle-l-control 279(15) 6 2 no 3 3 5 52. BUPA-liver-disorder 345 6 2 no 2 500 3 23. Monks-1 124(432) 6 2 no 5 1 2 24. Monks-2 169(432) 6 2 no 10 200 500 805. Monks-3 122(432) 6 2 no 5 1 3 56. Pima-diabetes 768 8 2 no 3 2 3 107. Tic-tac-toe 958 9 2 no 20 30 500 108. Breast-cancer-Wiscn 683(699) 9(10) 2 16 10 3 3 39. StatLog-heart 270 13 2 no 2 20 3 310. Credit-app 653(690) 15 2 37 1 3 1 511. Votes 435 16 2 yes 10 2 3 312. Mushroom 5644(8124) 22 2 attr + + + + + + feedforwardnet from the Matlab’s toolbox using default settings; h : The number of hidden nodes for 2layer, 3layer, and 5layer networks are set as h - q , 2 h - h - q , and 8 h -4 h -2 h - h - q respectively;Note : Data from the Attitudes Towards Smoking Legislation Survey - Metropolitan Toronto 1988, which was fundedby NHRDP (Health and Welfare Canada), were collected by the Institute for Social Research at York Universityfor Dr. Linda Pederson and Dr. Shelley Bull. h is conducted within the set {
1, 2, 3, 5, 10, 20, 30, 50, 80, 100,200, 500 } . Table 2 summarizes the chosen hidden node sizes for the experimentednetworks for each data set. Apparently, there seems to have no strong correlationbetween the choice of hidden node sizes for the two networks of two-layer.Fig. 7 shows the average prediction accuracy recorded from the 10 trails of strati-fied 10-fold cross-validation tests. The plot verifies (i) the feasibility of gradient-freecomputation for real-world data with none of the results running into computa-tional ill-conditioning. Moreover, the results show a comparable average predictionaccuracy for ANNnet (with grand average accuracy of 82.03% on the first 41 datasets) and
FFnet (with grand average accuracy of 82.36% on the first 41 data sets)for most data sets. The result for the 42nd data set for
FFnet was not availabledue to insufficient memory in our computing platform. As indicated by the shadedregions, these results show a significantly larger fluctuation of prediction accuraciesfor
FFnet (where its results vary within the green region) than that of
ANNnet (redregion) over the 10 trials of 10-fold cross-validation tests.
Data Set Index A cc u r a cy Feedforwardnet-2layerANNnet-2layer
Figure 7: Prediction accuracy based on the average of 10 trials of 10-fold cross-validations. The shaded region depicts the maximum and minimum accuracy boundsand the mean accuracies for
FFnet (green) and
ANNnet (red) are respectively indi-cated by circles and asterisks. (ii) The training CPU processing time
Fig. 8 shows the training CPU processing times (which are measured using the cputime function that accounts for per core processing time) of
FFnet and
ANNnet based on the same hidden node size for each data set. These results show at leastan order (10 times) of speed-up for
ANNnet over
FFnet in terms of the trainingtime. The maximum ratio of training time speed-up (CPU(
FFnet )/CPU(
ANNnet ))is 3 . × . The main reason for the speed up in training time is the gradient-freeanalytic training solution. Although the training time can be much faster than theconventional search mechanism, it is noted that the proposed method requires a21elatively large size of RAM memory to compute the pseudoinverse of the entiredata matrix. Data Set Index -4 -3 -2 -1 T r a i n i ng C P U t i m e Feedforwardnet-2layerANNnet-2layer (same hidden nodes size with Feedforwardnet)
Figure 8: Comparing the training CPU time of
ANNnet with that of
Feedforwardnet adopting the same number of hidden nodes. (iii) The effect of deep layers and sparseness
Effect of deep layers : Fig. 9 shows the average accuracy of
ANNnet with 2-layer,3-layer and 5 layer structures plotted respectively for each data set. The 2-layernetwork uses a structure of h - q with hidden layer size h and output layer size q . The3-layer network uses a 2 h - h - q structure and the 5-layer network uses a 8 h -4 h -2 h - h - q structure. The size of h is determined based on a cross-validation search withinthe training set for h ∈ {
1, 2, 3, 5, 10, 20, 30, 50, 80, 100, 200, 500 } . The resultsin Fig. 9 shows that the 5-layer ANNnet outperformed the other two networks formany data sets. The grand average results for
ANNnet of 2-layer, 3-layer and 5 layerare respectively 81.75%, 81.28% and 84.13%. These results appear to support goodgeneralization for the network of 5 layers.
Data Set Index A cc u r a cy ANNnet-2layerANNnet-3layerANNnet-5layer
Figure 9: The effect of layers on accuracy performance.22 ffects of weight scaling and sparseness : When the scaling factor, c , for randominitialization was set at 0.1, the grand average accuracy for ANNnet was slightlyimproved (82.21% on the first 41 data sets according to the results in (i)) for thetwo-layer network. As for the sparseness setting (receptive field r ) the gross averageaccuracies are observed to be 82.30% and 83.02% respectively for the fully-connectedand sparsely-connected (with receptive field of r = 3 units in the first layer) two-layer ANNnet s. In other words, the fully connected network and the sparse networkare with structures h - h and h r - h respectively. For the three-layer ANNnet , theobserved gross average accuracies are 81.82% and 81.93% respectively for the fully-connected ( h - h - h ) and the sparsely-connected ( h r - h - h ) cases. These results (inFig. 10) show a more significant impact of the weight scaling than the receptive field(sparseness) on the generalization performance. Data Set Index A cc u r a cy ANNnet-2layerANNnet-2layer-Sparse
Data Set Index A cc u r a cy ANNnet-3layerANNnet-3layer-Sparse (a) Sparse 2-layer
ANNnet (b) Sparse 3-layer
ANNnet
Figure 10: Comparing a non-sparse network (blue bars) with a sparse network (redbars): (a) 2-layer
ANNnet , (b) 3-layer
ANNnet . In summary, the three goals of the experiments have been achieved as follows: • Based on an extensive numerical study utilizing 42 real-world data sets of smallto medium sizes in terms of their dimension and sample sizes, the numericalfeasibility of network learning based on the Moore-Penrose inverse is clearlyverified. The structures of networks being studied included two-, three- andfive-layers where few serious numerical stability issues are observed. • The prediction accuracy of the proposed learning is observed to be compara-ble with that of the conventional gradient based learning. Attributed to theanalytic learning formulation, the training processing time is observed to beat least 10 times faster than that of the conventional learning. For some datasets, the maximum learning speed-up can go as high as 3 . × . • Based on the study on networks of two-, three- and five-layers, the five-layernetwork shows more than 2% better prediction accuracy than that of the othertwo networks in terms of the grand average accuracy over all data sets.23
The study shows that both the weights scaling and the sparseness of weightsaffect the generalization performance. This shall be a good topic for furtherresearch in the future.
Discussion and Future Works
Capitalized on the classical learning theory on Moore-Penrose inverse, an analyticallearning formulation for multilayer neural networks has been proposed. Essentially,the gradient-free learning is based on the minimization approximation through theMoore-Penrose inverse projection via the column and row spaces. Based on the pro-jection manipulation, it turns out that the solutions to the layer weights are interde-pendent. However, thanks to the enormous number of feasible solutions available inthe network, an analytical learning can be obtained by having the weights initializedlayer by layer. Though not limited to randomness, the feasibility of such a solutionhas been demonstrated by randomly initialized weights in experimentation. Basedon these results, the following possibilities for future research are observed. • The formulation based on the layered full matrix network structure with ele-mentwise operated activation functions does not depart from the conventionalsystem of linear equations. Thanks to the inherent least squares approximationproperty of the Moore-Penrose inverse, such a formulation allows the networksolution to be written in analytical form. This treatment in the linear systemform not only makes the formulation analytic and transparent but also opensup vast research possibilities towards understanding of network learning andoptimization. • An immediate possibility for such research is regarding the effect of networkdepth towards prediction generalization. Attributed to the layered matrixstructure that hinged upon the system of linear equations, the number ofpossible alternative solution to the system has been shown to be exponen-tially increased. This opens up the vast possibilities in initializing the networkweights towards good generalization. • Other possibilities include the structural construction particularly the sparsestructure and the scale of the initial weights where they are found to influencethe prediction generalization in the numerical experiments.
Capitalized on the inherent property of least squares approximation offered by theMoore-Penrose inverse, a gradient-free approach for solving the learning problem ofmultilayer neural networks was proposed. The solution obtained from such an ap-proach showed that network learning boiled down to solving a set of inter-dependentweight equations. By initializing the inner weight matrices either by random matri-ces or by the data matrix, it turned out that the interdependency of weight equationscan be decoupled where an analytic learning can be achieved. Based on the analyticsolution, the network representation and generalization properties with respect tothe layer depth were subsequently analyzed. Our numerical experiments on a wide24ange of data sets of small to medium sizes not only validated the numerical feasi-bility, but also demonstrated the generalization capability of the proposed networksolution.
References [1] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and timeseries,”
The handbook of brain theory and neural networks , pp. 255–258, 1998.[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,” in
Proceedings of the 25th InternationalConference on Neural Information Processing Systems (NIPS 2012) , 2012, pp.1097–1105.[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er-han, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
Proc. International Conference on Computer Vision and Pattern Recognition(CVPR) , 2015, pp. 1–9.[4] K. Simonyan and A. Zisserman, “Very deep convolutional net-works for large-scale image recognition,” 2014. [Online]. Available:https://arxiv.org/abs/1409.1556[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-nition,” in
Proc. International Conference on Computer Vision and PatternRecognition (CVPR) , 2016, pp. 770–778.[6] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely con-nected convolutional networks,” in
Proc. International Conference on ComputerVision and Pattern Recognition (CVPR) , 2017, pp. 4700–4708.[7] P. Baldi and K. Hornik, “Neural networks and principal component analysis:Learning from examples without local minima,”
Neural Networks , vol. 2, pp.53–58, 1989.[8] P. Baldi, “Linear learning: Landscapes and algorithms,” in
Advances in NeuralInformation Processing Systems (NIPS 1988) , 1988, pp. 65–72.[9] P. Baldi and Z. Lu, “Complex-valued autoencoders,”
Neural Networks , vol. 33,pp. 136–147, 2012.[10] K. Kawaguchi, “Deep learning without poor local minima,” in
Proceedings ofthe 29th International Conference on Neural Information Processing Systems(NIPS 2016) , 2016, pp. 586–594.[11] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The losssurfaces of multilayer networks,” in
Proceedings of the Eighteenth InternationalConference on Artificial Intelligence and Statistics , 2015, pp. 192–204.[12] A. Choromanska, Y. LeCun, and G. B. Arous, “Open problem: The landscapeof the loss surfaces of multilayer networks,” in
Proceedings of The 28th Confer-ence on Learning Theory , 2015, pp. 1756–1760.2513] H. Lu and K. Kawaguchi, “Depth creates no bad local minima,” pp. 1–10, 2017.[Online]. Available: https://arxiv.org/abs/1702.08580[14] C. Yun, S. Sra, and A. Jadbabaie, “Global optimality conditions for deep neu-ral networks,” in
Proceedings of the 6th International Conference on LearningRepresentations (ICLR 2018) , Vancouver, BC, Canada, 2018, pp. 1–14.[15] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can generalizefor deep nets,” 2017. [Online]. Available: https://arxiv.org/pdf/1703.04933.pdf[16] Y. N. Dauphin, R. Pascanu, C. Gulcehre, S. G. Kyunghyun Cho, and Y. Bengio,“Identifying and attacking the saddle point problemin high-dimensional non-convex optimization,” in
Advances in Neural Information Processing Systems(NIPS 2014) , 2014, pp. 2933–2941.[17] K. Chen, L. Xu, and H. Chi, “Improved learning algorithms for mixture ofexperts in multiclass classification,”
Neural Networks , vol. 12, no. 9, pp. 1229–1252, 1999.[18] H. Li, Z. Xu, G. Taylor, and T. Goldstein, “Visualizing the loss landscape ofneural nets,” in
Proceedings of the 6th International Conference on LearningRepresentations (ICLR 2018) , Vancouver, BC, Canada, 2018, pp. 1–17.[19] Q. Liao and T. Poggio, “Theory of deep learning II: Landscape of the empiricalrisk in deep learning,”
CBMM Memo No. 066 , pp. 1–45, June 2017.[20] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro, “Implicit bias of gradientdescent on linear convolutional networks,” in
Proceedings of the 29th Inter-national Conference on Neural Information Processing Systems (NIPS 2016) ,2018, pp. 1–22.[21] T. Poggio, Q. Liao, B. Miranda, A. Banburski, X. Boix, and J. Hidary, “Theoryof deep learning IIIb: Generalization in deep networks,”
CBMM Memo No. 090 ,pp. 1–37, June 2018.[22] J. Langford and R. Caruana, “(Not) Bounding the true error,” in
Advances inNeural Information Processing Systems 14 , T. G. Dietterich, S. Becker, andZ. Ghahramani, Eds. MIT Press, 2002, pp. 809–816. [Online]. Available:http://papers.nips.cc/paper/1968-not-bounding-the-true-error.pdf[23] G. K. Dziugaite and D. M. Roy, “Computing nonvacuous generalization boundsfor deep (stochastic) neural networks with many more parameters than trainingdata,” 2017. [Online]. Available: https://arxiv.org/pdf/1703.11008.pdf[24] B. Neyshabur, “Implicit regularization in deep learning,” Ph.D. dissertation,Toyota Technological Institute at Chicago, 2017.[25] B. Neyshabur, S. Bhojanapalli, and N. Srebro, “A PAC-Bayesian approach tospectrally-normalized margin bounds for neural networks,” in
Proceedings ofthe 6th International Conference on Learning Representations (ICLR 2018) ,Vancouver, BC, Canada, 2018, pp. 1–9.2626] P. L. Bartlett, D. J. Foster, and M. Telgarsky, “Spectrally-normalized marginbounds for neural networks,” in
Advances in Neural Information ProcessingSystems (NIPS 2017) , 2017, pp. 6241–6250.[27] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro, “Exploring gen-eralization in deep learning,” in
Advances in Neural Information ProcessingSystems (NIPS 2017) , 2017, pp. 5947–5956.[28] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, “Generalization in deep learn-ing,” 2018. [Online]. Available: https://arxiv.org/pdf/1710.05468.pdf[29] R. Brunelli, “Training neural nets through stochastic minimization,”
NeuralNetworks , vol. 7, no. 9, pp. 1405–1412, 1994.[30] P. Patrick van der Smagt, “Minimisation methods for training feedforward neu-ral networks,”
Neural Networks , vol. 7, no. 1, pp. 1–11, 1994.[31] K. Chen, “Deep and modular neural networks,”
Handbook on ComputationalIntelligence , pp. 473–494, 2015, (Chapter 28).[32] K.-A. Toh, “Learning from the kernel and the range space,” in
Proceedings ofthe 17th IEEE/ACIS International Conference on Computer and InformationScience , Singapore, June 2018, pp. 417–422.[33] K.-A. Toh, Z. Lin, Z. Li, B. Oh, and L. Sun, “Gradient-free learning based onthe kernel and the range space,” https://arxiv.org/abs/1810.11581 , pp. 1–27,27th October 2018.[34] S. L. Campbell and C. D. Meyer,
Generalized Inverses of Linear Transfor-mations , (SIAM edition of the work published by Dover Publications, Inc.,1991) ed. Philadelphia, USA: Society for Industrial and Applied Mathemat-ics, 2009.[35] A. Albert,
Regression and the Moore-Penrose Pseudoinverse . New York: Aca-demic Press, Inc., 1972, vol. 94.[36] Adi Ben-Israel and Thomas N.E. Greville,
Generalized Inverses: Theory andApplications , 2nd ed. New York: Springer-Verlag, 2003.[37] R. MacAusland, “The Moore-Penrose inverse and least squares,” April 2014,lecture notes in Advanced Topics in Linear Algebra (MATH 420). [Online].Available: http://buzzard.ups.edu/courses/2014spring/420projects/math420-UPS-spring-2014-macausland-pseudo-inverse.pdf[38] R. O. Duda, P. E. Hart, and D. G. Stork,
Pattern Classification , 2nd ed. NewYork: John Wiley & Sons, Inc, 2001.[39] P. J. Davis,
Circulant Matrices . New York: Wiley, 1970, ISBN 0471057711.[40] G. H. Golub and C. F. Van Loan,
Matrix Computations . Johns Hopkins, 1996,ISBN 978-0-8018-5414-9.[41] K.-I. Funahashi, “On the approximate realization of continuous mappings byneural networks,”
Neural Networks , vol. 2, no. 3, pp. 183–192, 1989.2742] K. Hornik, M. Stinchcombe, and H. White, “Multi-layer feedforward networksare universal approximators,”
Neural Networks , vol. 2, no. 5, pp. 359–366, 1989.[43] G. Cybenko, “Approximations by superpositions of a sigmoidal function,”
Math. Cont. Signal & Systems , vol. 2, pp. 303–314, 1989.[44] R. Hecht-Nielsen, “Kolmogorov’s mapping neural network existence theorem,”in
Proceedings of IEEE First International Conference on Neural Networks(ICNN) , vol. III, 1987, pp. 11–14.[45] V. K ˙urkov´a, “Kolmogorov’s theorem and multilayer neural networks,”
NeuralNetworks , vol. 5, no. 3, pp. 501–506, 1992.[46] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer feedforwardnetworks with a nonpolynomial activation function can approximate any func-tion,”
Neural Networks , vol. 6, pp. 861–867, 1993.[47] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deeplearning requires rethinking generalization,” in
Proceedings of the 5th Interna-tional Conference on Learning Representations (ICLR 2017) , Toulon, France,2017, pp. 1–15.[48] T. Hastie, R. Tibshirani, and J. Friedman,
The Elements of Statistical Learning:Data Mining, Inference, and Prediction . Canada: Springer, 2001.[49] K. Chen and D. L. Wang, “Perceiving geometric patterns: from spirals toinside/outside relations,”
IEEE Transactions on Neural Networks , vol. 12, no. 5,pp. 1084–1102, 2001.[50] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available:http://archive.ics.uci.edu/ml[51] K.-A. Toh, Q.-L. Tran, and D. Srinivasan, “Benchmarking a reduced multi-variate polynomial pattern classifier,”
IEEE Trans. on Pattern Analysis andMachine Intelligence , vol. 26, no. 6, pp. 740–755, 2004.[52] The MathWorks, “Matlab and simulink,” in