Ill-Posedness and Optimization Geometry for Nonlinear Neural Network Training
aa r X i v : . [ m a t h . O C ] F e b Ill-Posedness and Optimization Geometry forNonlinear Neural Network Training
Thomas O’Leary-Roseberry Omar Ghattas Abstract
In this work we analyze the role nonlinear activa-tion functions play at stationary points of denseneural network training problems. We considera generic least squares loss function training for-mulation. We show that the nonlinear activationfunctions used in the network construction play acritical role in classifying stationary points of theloss landscape. We show that for shallow densenetworks, the nonlinear activation function deter-mines the Hessian nullspace in the vicinity ofglobal minima (if they exist), and therefore deter-mines the ill-posedness of the training problem.Furthermore, for shallow nonlinear networks weshow that the zeros of the activation function andits derivatives can lead to spurious local minima,and discuss conditions for strict saddle points.We extend these results to deep dense neural net-works, showing that the last activation functionplays an important role in classifying stationarypoints, due to how it shows up in the gradientfrom the chain rule.
1. Introduction
Here, we characterize the optimization geometry of non-linear least-squares regression problems for generic denseneural networks and analyze the ill-posedness of the train-ing problem. Neural networks are a popular nonlinear func-tional approximation technique that are succesful in datadriven approximation regimes. A one-layer neural networkcan approximate any continuous function on a compactset, to a desired accuracy given enough neurons (Cybenko,1989; Hornik et al., 1989). Dense neural networks havebeen shown to be able to approximate polynomials arbi-trarily well given enough hidden layers (Schwab & Zech,2019). While no general functional analytic approximationtheory exists for neural networks, they are widely believed Oden Institute for Computational Engineering and Sciences,The University of Texas at Austin, Austin, TX. Correspondenceto: Thomas O’Leary-Roseberry < [email protected] > . to have great approximation power for complicated patternsin data (Poggio & Liao, 2018).Training a neural network, i.e., determining optimal val-ues of network parameters to fit given data, can be ac-complished by solving the nonconvex optimization prob-lem of minimizing a loss function (known as empiricalrisk minimization). Finding a global minimum is NP-hard and instead one usually settles for local minimizers(Bertsekas, 1997; Murty & Kabadi, 1987). Here we seek tocharacterize how nonlinear activation functions affect theleast-squares optimization geometry at stationary points.In particular, we wish to characterize the conditions forstrict saddle points and spurious local minima. Strict sad-dle points are stationary points where the Hessian has atleast one direction of strictly negative curvature. They donot pose a significant problem for neural network training,since they can be escaped efficiently with first and secondorder methods (Dauphin et al., 2014; Jin et al., 2017a;b;Nesterov & Polyak, 2006; O’Leary-Roseberry et al., 2019).On the other hand, spurious local minima (where the gradi-ent vanishes but the data misfit is nonzero) are more prob-lematic; escaping from them in a systematic way may re-quire third order information (Anandkumar & Ge, 2016).We also seek to analyze the rank deficiency of the Hessianof the loss function at global minima (if they exist), in orderto characterize the ill-posedness of the nonlinear neural net-work training problem. Training a neural network is, math-ematically, an inverse problem; rank deficiency of the Hes-sian often makes solution of the inverse problem unstableto perturbations in the data and leads to severe numericaldifficulties when using finite precision arithmetic (Hansen,1998). While early termination of optimization iterationsoften has a regularizing effect (Hanke, 1995; Engl et al.,1996), and general-purpose regularization operators (suchas ℓ or ℓ ) can be invoked, when to terminate the iterationsand how to choose the regularization to limit bias in the so-lution are omnipresent challenges. On the other hand, char-acterizing the nullspace of the Hessian can provide a basisfor developing a principled regularization operator that par-simoniously annihilates this nullspace, as has been recentlydone for shallow linear neural networks (Zhu et al., 2019).We consider both shallow and deep dense neural network ll-Posedness and Optimization Geometry for Nonlinear Neural Network Training parametrizations. The dense parametrization is sufficientlygeneral since convolution operations can be represented ascyclic matrices with repeating block structure. For the sakeof brevity, we do not consider affine transformations, butthis work can easily be extended to this setting. We beginby analyzing shallow dense nonlinear networks, for whichwe show that the nonlinear activation function plays a crit-ical role in classifying stationary points. In particular, ifthe neural network can exactly fit the data, and zero misfitglobal minima exist, we show how the Hessian nullspacedepends on the activation function and its first derivative atthese points.For linear networks, results about local minima, global min-ima, strict saddle points, and optimal regularization oper-ators have been shown (Baldi & Hornik, 1989; Zhu et al.,2019). The linear network case is a nonlinear matrix fac-torization problem, given data matrices X ∈ R n × d , Y ∈ R m × d , one seeks to find W ∗ ∈ R m × r , W ∗ ∈ R r × n suchthat they minimize k Y − W W X k F . (1)When the data matrix X has full row rank, then by theEckart-Young Theorem, the solution is given by the rank r SVD of
Y X T ( XX T ) − , which we denote with a sub-script r W ∗ W ∗ = [ Y X T ( XX T ) − ] r . (2)The solution is non-unique since for any invertible matrix B ∈ R r × r ( W ∗ B )( B − W ∗ ) = [ Y X T ( XX T ) − ] r (3)is also a solution. We show that in addition to inherit-ing issues related to ill-posedness of matrix factorization,the nonlinear activation functions in the nonlinear trainingproblem create ill-posedness and non-uniqueness.We show that stationary points not corresponding to zeromisfit global minima are determined by the activation func-tion and its first derivative through an orthogonality con-dition. In contrast to linear networks, for which the exis-tence of spurious local minima depends only on the rankof the training data and the weights, we show that for non-linear networks, both spurious local minima and strict sad-dle points exist, and depend on the activation functions, thetraining data, and the weights.We extend these results to deep dense neural networkswhere stationary points can arise from exact reconstructionof the training data by the network, or an orthogonality con-dition that involves the activation functions of each layer ofthe network and their first derivatives.For nonlinear neural networks, some work exists on ana-lyzing networks with ReLU activation functions; in par-ticular Safran et. al. establish conditions for the existence of spurious local minima for two layer ReLU networks(Safran & Shamir, 2017). For a given matrix A ∈ R m × n , its vectorization, vec ( A ) ∈ R mn is an mn vector that is the columns of A stackedsequentially. Given a vector z ∈ R m , its diagonaliza-tion diag ( z ) ∈ R m × m is the diagonal matrix with entry ii being component i from z . The diagvec operation isthe composition diagvec ( A ) = diag ( vec )( A ) ∈ R mn × mn ,this is sometimes shortened to dvec. The identity matrixin R d × d is denoted I d . We use the notation ∇ X f ( X ) tomean derivatives of a function f with respect to a matrix X ,and ∂ vec ( X ) f ( vec ( X )) when expressing derivatives withrespect to a vectorized matrix vec ( X ) : ∂ vec ( X ) f ( X ) = ∂f∂ vec ( X ) . For matrices A ∈ R m × n and B ∈ p × q , the Kro-necker product A ⊗ B ∈ R pm × qn is the block matrix A ⊗ B = a B · · · a n B ... . . . ... a m B · · · a mn B (4)For matrices A, B ∈ R m × n , A ◦ B ∈ R m × n is theHadamard (element-wise) product. For a matrix A ∈ R m × n and a matrix B ∈ R n × k , the expression A ⊥ B means that the rows of A are orthogonal to the columns of B , and thus AB = 0 .For a differentiable function F : R d W → R , and a param-eter W ∈ R d W , we say that W is a first order stationarypoint if ∇ F ( W ) = 0 , we say that W is a strict saddlepoint if there exists a negative eigenvalue for the Hessian ∇ F . We say that W is a local minimum if the eigen-values of the Hessian ∇ F are all nonnegative. We saythat W is a global minimum if F ( W ) ≤ F ( W ) for all W ∈ R d W .
2. Stationary Points of Shallow DenseNetwork
We start by considering a one layer dense neural net-work training problem. Given training data matrices X ∈ R n × d , Y ∈ R m × d , the shallow neural network architec-ture consists of an encoder weight matrix W ∈ R r × n , anonlinear activation function σ (which is applied element-wise), and then a decoder weight matrix W ∈ R m × r . Thetraining problem (empirical risk minimization) may then bestated as min W ,W F ( W , W ) = d X i =1 k y i − W σ ( W x i ) k ℓ ( R m ) = 12 k Y − W σ ( W X ) k F ( R m × d ) . (5) ll-Posedness and Optimization Geometry for Nonlinear Neural Network Training We will begin by analyzing first order stationary points ofthe objective function F . Theorem 1.
The gradient of the objective function F isgiven by ∇ F ( W , W ) = [ ∇ W F ( D, E ) T , ∇ W F ( W , W ) T ] T ∇ W F ( W , W ) = ( W σ ( W X ) − Y ) σ ( W X ) T (6) ∇ W F ( W , W ) =[ σ ′ ( W X ) ◦ ( W T ( W σ ( W X ) − Y ))] X T . (7) First order stationary points are characterized by two mainconditions:1. A global minimum where the misfit is exactly zero: W σ ( W X ) = Y . The possibility for which dependson the representation capability of the network, andthe data.2. A stationary point not corresponding to zero misfit: σ ′ ( W X ) ◦ W T ( W σ ( W X ) − Y ) ⊥ X T , and ( W σ ( W X ) − Y ) ⊥ σ ( W X ) T Proof.
The partial derivatives of the objective function F ( W , W ) are derived in Lemma 2. At a first order sta-tionary point of the objective function F both partial deriva-tives must be zero: ( W σ ( W X ) − Y ) σ ( W X ) T = 0 (8) [ σ ′ ( W X ) ◦ ( W T ( W σ ( W X ) − Y ))] X T = 0 . (9)In the case that W σ ( W X ) = Y then both terms are zero,and the corresponding choices of W , W define a globalminimum. This can be seen since F is a nonnegative func-tion, and in this case it is exactly zero.Stationary points where W σ ( W X ) = Y are character-ized by orthogonality conditions. If ∇ W F ( W , W ) =0 , then this means that ( W σ ( W X ) − Y ) ⊥ σ ( W X ) T ; that is, the rows of W σ ( W X ) − Y andthe columns of σ ( W X ) T are pairwise orthogonal. If ∇ W F ( W , W ) = 0 , then this similarly means that [ σ ′ ( W X ) ◦ ( W T ( W σ ( W X ) − Y ))] ⊥ X T Corollary 1.1.
Any W , W such that σ ( W X ) = 0 and σ ′ ( W X ) ◦ ( W T ( W σ ( W X ) − Y )) = 0 correspond tofirst order stationary points of the objective function F . Inparticular, any W for which σ ( W X ) = σ ′ ( W X ) = 0 corresponds to a first order stationary point for all W . This result implies that points in parameter space where theactivation function and its derivatives are zero can lead tosub-optimal stationary points. Note that if a zero misfit min-imum is not possible, there may or may not be an actualglobal minimum (there will always be a global infimum),but since the misfit is not zero any such point will still fall into the second category. In what follows we characterizethe optimization geometry of the objective function F atglobal minima, and degenerate points of the activation func-tion, i.e. points for which σ ( W X ) = σ ′ ( W X ) = 0 . Suppose that for given data
X, Y , there exists W , W , σ such that W σ ( W X ) = Y . As was discussed in Theorem1, such points correspond to a global minimum. In whatfollows we characterize Hessian nullspace at these points,and corresponding ill-posedness of the training problem. Theorem 2.
Characterization of Hessian nullspace atglobal minimum. Given data
X, Y, σ suppose there existweight matrices W , W such that Y = W σ ( W X ) . Sup-pose further that W and σ ( W X ) are full rank, then theHessian nullspace is characterized by directions c W , c W such that (cid:20) c W σ ( W X ) + W ( σ ′ ( W X ) ◦ c W X ) (cid:21) ⊥ σ ( W X ) T (10) (cid:20) [ σ ′ ( W X ) ◦ ( W T c W σ ( W X ))]+[ σ ( W X ) ◦ ( W T W ( σ ′ ( W X ) ◦ c W X ))] (cid:21) ⊥ X T . (11) In particular for any direction c W , such that the directionalderivative σ ′ ( W X ) ◦ c W X is zero, the weight matrices c W c W = − W ( σ ′ ( W X ) ◦ c W X ) σ ( W X ) T [ σ ( W X ) σ ( W X ) T ] − (12) are in the nullspace of the Hessian matrix ∇ F ( W , W ) Proof.
Since the misfit is zero, the Hessian is exactly theGauss-Newton Hessian, which is derived in Lemma 3. Thematrices c W , c W are in the nullspace of the Hessian if (cid:20) c W σ ( W X ) + W ( σ ′ ( W X ) ◦ c W X ) (cid:21) σ ( W X ) T = 0 (13) (cid:20) [ σ ′ ( W X ) ◦ ( W T c W σ ( W X ))]+[ σ ( W X ) ◦ ( W T W ( σ ′ ( W X ) ◦ c W X ))] (cid:21) X T = 0 . (14)For this to be the case we need that [ c W σ ( W X ) + W ( σ ′ ( W X ) ◦ c W X )] ⊥ σ ( W X ) T and [[ σ ′ ( W X ) ◦ ll-Posedness and Optimization Geometry for Nonlinear Neural Network Training ( W T c W σ ( W X ))] + [ σ ( W X ) ◦ ( W T W ( σ ′ ( W X ) ◦ c W X ))]] ⊥ X T . The Hessian nullspace is fully character-ized by points c W , c W that satisfy these two orthogonalityconstraints. One way in which these constraints are satis-fied is if c W σ ( W X ) = − W ( σ ′ ( W X ) ◦ c W X ) (15) [ σ ′ ( W X ) ◦ ( W T c W σ ( W X ))]+[ σ ( W X ) ◦ ( W T W ( σ ′ ( W X ) ◦ c W X ))] = 0 . (16)Subsituting (15) into (16) we have [ − σ ′ ( W X )+ σ ( W X )] ◦ ( W T W ( σ ′ ( W X ) ◦ c W X )) = 0 . (17)The first term is nonzero if σ = exp , since σ ( W X ) isassumed to be full rank. For the Hadamard product to bezero, the second term must be zero: W T W ( σ ′ ( W X ) ◦ c W X ) = 0 . (18)This is accomplished when W T W ⊥ σ ′ ( W X ) ◦ c W X .Since W is full rank this condition reduces to σ ′ ( W X ) ◦ c W X = 0 . Suppose that c W satisfies this directionalderivative constraint, then we can find a corresponding c W such that c W , c W are in the Hessian nullspace from (15): c W = − W ( σ ′ ( W X ) ◦ c W X ) σ ( W X ) T [ σ ( W X ) σ ( W X ) T ] − . (19)Note that σ ( W X ) σ ( W X ) T ∈ R r × r is invertible since σ ( W X ) is assumed to be full rank.This result shows that the Hessian may have a nontrivialnullspace at zero misfit global minima; in particular, ifthere are any local directions c W satisfying the directionalderivative constraint σ ′ ( W X ) ◦ c W X = 0 , then the Hes-sian is guaranteed to have at least one zero eigenvalue. Ifthe Hessian has at least one zero eigenvalue, then the candi-date global minimum W , W is not unique, and instead ison a manifold of global minima. Global minima are in thiscase weak minima.This result is similar to the non-uniqueness of the linear net-work training problem, Equation (3). However in this casethe linear rank constraints are obfuscated by the nonlinearactivation function, and, additionally the zeros of the acti-vation function lead to more possibility for Hessian rank-deficiency and associated ill-posedness.For weak global minima, regularization schemes that anni-hilate the Hessian nullspace while leaving the range space unscathed can be used to make the training problem well-posed without biasing the solution. Furthermore, such reg-ularization schemes will accelerate the asymptotic conver-gence rates of second order methods (Newton convergencedeteriorates from quadratic to linear in the presence of sin-gular Hessians), thereby making them even more attractiverelative to first order methods. As was shown in Theorem 1 and Corollary 1.1, there are sta-tionary points where the misfits are not zero. In this sectionwe show that these points can be both strict saddle pointsas well as spurious local minima.Suppose the gradient is zero, but the misfit is nonzero. Aswas discussed in condition 2 of Theorem 1 such minimarequire orthogonality conditions for matrices that show upin the gradient. Corollary 1.1 establishes that this result isachieved if σ ( W X ) = σ ′ ( W X ) = 0 . Many activationfunctions such as ReLU, sigmoid, softmax, softplus, tanhhave many points satisfying these conditions (or at leastapproximately satisfying these conditions, i.e. for small ǫ > , k σ ′ ( W X ) k F , k σ ( W X ) k F ≤ ǫ ). Such stationarypoints are degenerate due to the activation functions. Inwhat follows we show that while these points are likely tobe strict saddles, it is possible that some of them have nodirections of negative curvature and are thus spurious localminima. Theorem 3.
Negative Curvature Directions at DegenerateActivation Stationary Points. Let W be arbitrary and sup-pose that W is such that σ ′ ( W X ) = σ ( W X ) = 0 , neg-ative curvature directions of the Hessian at such points arecharacterized by directions c W such that d X k =1 r X i =1 ( c W x ( k ) ) i ( σ ′′ ( W x ( k ) )) i ( W T y ( k ) ) i < . (20) Proof.
Since σ ′ ( W X ) all of the terms in the Gauss-Newton Hessian are zero (see Lemma 3). Further, all ofthe off-diagonal non Gauss-Newton portions are also zero.In this case the only block of the Hessian that is nonzero isthe non Gauss-Newton W − W block (see Lemma 4). Weproceed by analyzing an un-normalized Rayleigh quotientfor this block in an arbitrary direction c W . From Equation53 we can compute the quadratic form:vec ( c W ) T ( ∂ vec ( W ) ∂ vec ( W ) misfit ) T misfitvec ( c W )= vec ( c W X ) T dvec (( W T ( W σ ( W X ) − Y )) ◦ σ ′′ ( W X )) vec ( c W X )= vec ( c W X ) T vec ( (cid:2) ( W T ( W σ ( W X ) − Y )) ◦ σ ′′ ( W X ) ◦ c W X (cid:3) ) (21) ll-Posedness and Optimization Geometry for Nonlinear Neural Network Training Expanding this term in a sum we have: d X k =1 r X i =1 ( c W x ( k ) ) i ( σ ′′ ( W x ( k ) )) i ( W T ( W σ ( W x ( k ) ) − y ( k ) ) i (22)The result follows noting that σ ( W X ) = 0 .Directions c W that satisfy the negative curvature condition(20) are difficult to understand in their generality, since theydepend on X, Y and σ ′′ . We discuss some example suffi-cient conditions. Corollary 3.1.
Saddle point with respect to one data pair.Given W , W , and a strictly convex activation function σ ,suppose that σ ′ ( W X ) = σ ( W X ) = 0 . Suppose thatthere is a data pair with x ( k ) = 0 such that at least onenegative component of W T y ( k ) . Then W , W is a strictsaddle point.Proof. If ( W T y ( k ) ) i < , and the x ( k ) j = 0 , then the di-rection c W ij = 1 with all other components zero defines adirection of negative curvature. r X i =1 ( c W x ( k ) ) i ( σ ′′ ( W x ( k ) )) i ( W T y ( k ) ) i =( x ( k ) ) j ( σ ′′ ( W x ( k ) )) j ( W T y ( k ) ) j < . (23) Corollary 3.2.
Given W , W and a strictly convex func-tion σ and all elements of one row of W T Y are negativethen W , W is a strict saddle point.Proof. Let the i th row of W T Y satisfy this condition, thenany choice of c W = 0 such that all rows other than i arezero will define a direction of negative curvature.These conditions are rather restrictive, but demonstrate thenature of existence of negative curvature directions. As wasstated before, the most general condition for a strict saddleis the existence of c W that satisfies Equation (20). We con-jecture that such an inequality shouldn’t be hard to satisfy,but as it is a nonlinear inequality finding general conditionsfor the existence of such b E is difficult. We have the follow-ing result about how the zeroes of the activation functionand its derivatives can lead to spurious local minima. Corollary 3.3.
For a given W , if σ ( W X ) = σ ′ ( W X ) = σ ′′ ( W X ) = 0 , then the Hessian at this pointis exactly zero and this point defines a spurious local mini-mum. Such points exist for functions like ReLU, sigmoid, soft-max, softplus etc. Any activation function that has large regions where it is zero (or near zero) will have such points.The question is then, how common are they? For the afore-mentioned functions, the function and its derivatives arezero or near zero when the argument of the function is suffi-ciently negative. For these functions, and a given tolerance ǫ > there exists a constant C ≤ such that for all ξ < C , σ ( ξ ) ≤ ǫ , σ ′ ( ξ ) ≤ ǫ and σ ′′ ( ξ ) ≤ ǫ . For ReLU (whichdoes not have any derivatives at zero) C = ǫ = 0 . In onedimension this condition is true for roughly half of the realnumber line for each of these functions. For the conditionto be true for a vector it must be true elementwise. So forthe condition σ ( W x ( k ) ) ≤ ǫ and σ ′ ( W x ( k ) ) ≤ ǫ and σ ′′ ( W x ( k ) ) ≤ ǫ (24)to hold for a given input datum x ( k ) ; the encoder arraymust map each component of x ( k ) into the strictly nega-tive orthant of R r . The probability of drawing a mean zeroGaussian random vector in R r that is in the strictly nega-tive orthant is − r . Furthermore for this condition to holdfor all of W X means it must be true for each column ofthe matrix W X . The probability of drawing a mean zeroGaussian random matrix in R r × d such that each column re-sides in the strictly negative orthant is − rd . In practice thelinearly encoded input data matrix W X is unlikely to havethe statistical properties of a mean zero Gaussian, but thisheuristic demonstrates that these degenerate points may beimprobable to encounter. If the Hessian is exactly zero, oneneeds third order information to move in a descent direction(Anandkumar & Ge, 2016).
3. Extension to Deep Networks
In this section we briefly discuss the general conditionsfor stationary points of a dense neural network. We con-sider the parameterization. In this case the weights foran N layer network are [ W , W , · · · , W N ] , where W ∈ R r × m , W N ∈ R n × r N − , and all other W j ∈ R r × r . Theactivation functions σ j are arbitrary. The network parame-terization is W N σ N ( W N − σ N ( · · · σ ( W X ) · · · )) . (25)We have the following general result about first order sta-tionary points of deep neural networks. Theorem 4.
Stationary points of deep dense neural net-works The blocks of the gradient of the least squares lossfunction for the deep neural network (Equation (25) ) are as ll-Posedness and Optimization Geometry for Nonlinear Neural Network Training follows: ∇ W j F ( W ) = (cid:20) σ ′ j +1 ( W j σ j · · · σ ( W X ) · · · ) ◦ (cid:0) W Tj +1 (cid:0) σ ′ j +2 ( W j +1 σ j · · · σ ( W X ) · · · ) ◦· · · ◦ (cid:0) W TN − (cid:0) σ ′ N ( W N − · · · σ ( W X ) · · · )) · · ·◦ (cid:0) W TN ( W N σ N ( W N − · · · σ ( W X ) · · · ) − Y ) (cid:1) · · · (cid:1)(cid:1)(cid:1)(cid:21) σ j ( W j · · · σ ( W X )) T . (26) Stationary points of the loss function are characterized bytwo main cases:1. The misfit is exactly zero. If such points are possible,then these points correspond to local minima2. For each block the following orthogonality conditionholds: (cid:20) σ ′ j +1 ( W j σ j · · · σ ( W X ) · · · ) ◦ (cid:0) W Tj +1 (cid:0) σ ′ j +2 ( W j +1 σ j · · · σ ( W X ) · · · ) ◦· · · ◦ (cid:0) W TN − (cid:0) σ ′ N ( W N − · · · σ ( W X ) · · · )) · · ·◦ (cid:0) W TN ( W N σ N ( W N − · · · σ ( W X ) · · · ) − Y ) (cid:1) · · · (cid:1)(cid:1)(cid:1)(cid:21) ⊥ σ j ( W j · · · σ ( W X )) T (27)This result follows from Lemma 5. There are many dif-ferent conditions on the weights and activation functionsthat will satisfy the orthogonality requirement in Equation(27). One specific example is analogous to the condition inCorollary 1.1. Corollary 4.1.
Any weights [ W , . . . , W N − ] such that σ N ( W N − σ N − ( · · · σ ( W X ) · · · )) = 0 (28) σ ′ N ( W N − σ N − ( · · · σ ( W X ) · · · )) = 0 (29) correspond to a first order stationary point for any W N . This is the case since the term that is zero in Equation (28)shows up in the W N block of the gradient, and the term thatis zero in Equation (29) shows up in every other block ofthe gradient via an Hadamard product due to the chain rule.Analysis similar to that in Section 2 can be carried outto establish conditions for Hessian rank deficiency at zeromisfit minima and corresponding ill-posedness of the train-ing problem in a neighborhood, as well as analysis thatmay establish conditions for saddle points and spuriouslocal minima. Due to limited space we do not pursuesuch analyses, but expect similar results. Specificallythe last activation function and its derivatives seem to be critical in understanding the characteristics of stationarypoints, both their existence and Hessian rank deficiency. Ifthe successive layer mappings prior to the last layer map W N − σ N − ( . . . σ ( W X )) into the zero set of the last ac-tivation and its derivatives then we believe spurious localminima are possible.
4. Conclusion
For dense nonlinear neural networks, we have derived ex-pressions characterizing the nullspace of the Hessian in thevicinity of global minima. These can be used to designregularization operators that target the specific nature ofill-posedness of the training problem. When a candidatestationary point is a strict saddle, appropriately-designedoptimization algorithms will escape it eventually (how fastthey escape will depend on how negative the most nega-tive eigenvalue of the Hessian is). The analysis in this pa-per shows that when the gradient is small, it can be dueto an accurate approximation of the mapping X Y ,or it can be due to the orthogonality condition, Equation(2). Spurious local minima can be identified easily, since k Y − W σ ( W X ) k F will be far from zero. Whether ornot such points are strict saddles or local minima is harderto know specifically since this can depend on many dif-ferent factors, such as the zeros of the activation functionand its derivatives. Such points can be escaped quickly us-ing Gaussian random noise (Jin et al., 2017a). When inthe vicinity of a strict saddle point with a negative curva-ture direction that is large relative to other eigenvalues ofthe Hessian, randomized methods can be used to identifynegative curvature directions and escape the saddle pointat a cost of a small number of neural network evaluations(O’Leary-Roseberry et al., 2019). A. Shallow Dense Neural NetworkDerivations
A.1. Derivation of gradient
Derivatives are taken in vectorized form. In order to sim-plify notation we use the following: F ( W , W ) = 12 misfit T misfitmisfit = vec ( Y − W σ ( W X )) . (30)In numerator layout partial differentials with respect to avectorized matrix X are as follows: ∂ vec ( X ) (cid:18) misfit T misfit (cid:19) = ( ∂ vec ( X ) misfit ) T misfit . (31)First we have a Lemma about the derivative of the activa-tion function with respect to the encoder weight matrix. ll-Posedness and Optimization Geometry for Nonlinear Neural Network Training Lemma 1.
Suppose W ∈ R r × n , X ∈ R n × d and σ isapplied elementwise to the matrix W X , then ∂ vec ( W ) vec ( σ ( W X )) = diagvec ( σ ′ ( W X ))[ X T ⊗ I r ] . (32) Proof.
We use the limit definition of the derivative to derivethis result. Let h ∈ R r × n be arbitrary. In the limit as h → ∈ R r × n we have the following:vec ( σ (( W + h ) X ) − σ ( W X )) = ∂ vec ( W ) ( vec ( σ ( W X ))) vec ( h ) (33)Expanding this term and noting that vec ( σ ( W X )) = σ ( vec ( W X )) , as well as vec ( A ) ◦ vec ( B ) = diagvec ( A ) vec ( B ) , we have: σ ( vec (( W + h ) X )) − σ ( vec ( W X ))= σ ( vec ( W X − hX )) − σ ( vec ( W X ))= vec ( σ ′ ( W X )) ◦ vec ( hX )= diagvec ( σ ′ ( W X )) vec ( hX )= diagvec ( σ ′ ( W X ))[ X T ⊗ I r ] vec ( h ) . (34)The result follows.Now we can derive the gradients of the objective function F ( W , W ) . Lemma 2.
The gradients of the objective function aregiven by ∇ W F ( W , W ) = ( W σ ( W X ) − Y ) σ ( W X ) T (35) ∇ W F ( W , W ) = [ σ ′ ( W X ) ◦ ( W T ( W σ ( W X ) − Y ))] X T . (36) Proof.
We derive in vectorized differential form, fromwhich the matrix form derivatives can be extracted. Firstfor the derivative with respect to D we can derive via thematrix partial differential only with respect to D : ∂ misfit = − ∂ vec ( W σ ( W X ))= − [ σ ( W X ) T ⊗ I n ] ∂ vec ( W ) (37)Thus it follows that ∂ vec ( W ) misfit = − [ σ ( W X ) T ⊗ I n ] (38)The vec ( W ) partial derivative is then: ( ∂ vec ( W ) misfit ) T misfit =[ σ ( W X ) T ⊗ I n ] vec ( W σ ( W X ) − Y )= vec (( W σ ( W X ) − Y ) σ ( W X ) T ) . (39)We have then that the matrix form partial derivative withrespect to W is: ∇ W F ( W , W ) = W σ ( W X ) − Y ) σ ( W X ) T . (40) For the partial derivative with respect to W , again we startwith the vectorized differential form. ∂ vec ( W ) misfit = − ∂ vec ( W ) vec ( W σ ( W X ))= − [ I d ⊗ W ] ∂ vec ( W ) σ ( vec ( W X )) (41)Applying Lemma 1 we have: ∂ vec ( W ) misfit = − [ I d ⊗ W ] diagvec ( σ ′ ( W X ))[ X T ⊗ I r ] . (42)The vec ( W ) partial derivative is then: ( ∂ vec ( W ) misfit ) T misfit =[ X ⊗ I r ] diagvec ( σ ′ ( W X ))[ I d ⊗ W T ] vec ( W σ ( W X ) − Y )=[ X ⊗ I r ] diagvec ( σ ′ ( W X )) vec ( W T ( W σ ( W X ) − Y ))=[ X ⊗ I r ] vec ( σ ′ ( W X ) ◦ ( W T ( W σ ( W X ) − Y )))= vec ([ σ ′ ( W X ) ◦ ( W T ( W σ ( W X ) − Y ))] X T ) , (43)We have then that the matrix form partial derivative withrespect to W is: ∇ W F ( W , W ) = [ σ ′ ( W X ) ◦ ( W T ( W σ ( W X ) − Y ))] X T . (44) A.2. Derivation of Hessian
We now derive the four blocks of the Hessian matrix. I willproceed again by deriving partial differentials in vectorizedform. In numerator layout we have ∂ vec ( Y ) ∂ vec ( X ) (cid:18) misfit T misfit (cid:19) =( ∂ vec ( X ) ∂ vec ( Y ) misfit ) T misfit + ( ∂ vec ( X ) misfit ) T ( ∂ vec ( Y ) misfit ) (45)The term involving only first partial derivatives of the misfitis the Gauss Newton portion which are already derived insection A.1. Lemma 3.
Gauss-Newton portions ( ∂ vec ( W ) misfit ) T ( ∂ vec ( W ) misfit )=[ σ ( W X ) σ ( W X ) T ⊗ I n ] (46) ( ∂ vec ( W ) misfit ) T ( ∂ vec ( W ) misfit )=[ X ⊗ I r ] diagvec ( σ ′ ( W X ))[ σ ( W X ) T ⊗ W T ] (47) ( ∂ vec ( W ) misfit ) T ( ∂ vec ( W ) misfit )=[ σ ( W X ) ⊗ W ] diagvec ( σ ′ ( W X ))[ X T ⊗ I r ] (48) ( ∂ vec ( W ) misfit ) T ( ∂ vec ( W ) misfit )=[ X ⊗ I r ] diagvec ( σ ′ ( W X ))[ I d ⊗ W T W ] · · ·· · · diagvec ( σ ′ ( W X ))[ X T ⊗ I r ] (49) ll-Posedness and Optimization Geometry for Nonlinear Neural Network Training Proof.
This result follows from equations (38) and(42).We proceed by deriving the terms involving second partialderivatives of the misfit, by deriving their action on an ar-bitrary vector Z ∈ R n × d . The matrix K ( r,d ) ∈ R dr × rd is the commutation (perfect shuffle) matrix satisfying theequality K ( r,d ) vec ( V ) = vec ( V ) T for V ∈ R r × d . Lemma 4.
Non Gauss-Newton portions ( ∂ vec ( W ) ∂ vec ( W ) misfit ) T misfit = 0 (50) ( ∂ vec ( W ) ∂ vec ( W ) misfit ) T misfit =[ I r ⊗ ( W σ ( W X ) − Y )] K ( r,d ) dvec ( σ ′ ( W X ))[ X T ⊗ I r ] (51) ( ∂ vec ( W ) ∂ vec ( W ) misfit ) T misfit =[ X ⊗ I r ] dec ( σ ′ ( W X ))[( W σ ( W X ) − Y ) T ⊗ I r ] K ( n,r ) (52) ( ∂ vec ( W ) ∂ vec ( W ) misfit ) T misfit =[ X ⊗ I r ] dvec (cid:0) [ W T ( W σ ( W X ) − Y )] ◦ σ ′′ ( W X ) (cid:1) [ X T ⊗ I r ] (53) Proof.
Equation (50) follows from the fact that W showsup linearly in the misfit. For the W − W block we have: ( ∂ vec ( W ) misfit ) T vec ( Z ) = − [ σ ( W X ) ⊗ I n ] vec ( Z )= − vec ( Zσ ( W X ) T )= − [ I r ⊗ Z ] vec ( σ ( W X ) T )= − [ I r ⊗ Z ] K ( r,d ) vec ( σ ( W X )) . (54)Equation (51) follows from taking a partial differential withrespect to the vectorization of W , applying Lemma 1, andsubstituting the misfit for Z . For the W − W block wehave: ( ∂ vec ( W ) misfit ) T vec ( Z )= − [ X ⊗ I r ] diagvec ( σ ′ ( W X ))[ I d ⊗ W T ] vec ( Z )= − [ X ⊗ I r ] diagvec ( σ ′ ( W X )) vec ( W T Z )= − [ X ⊗ I r ] diagvec ( σ ′ ( W X ))[ Z T ⊗ I r ] K ( n,r ) vec ( W ) . (55)Equation (52) follows from taking a partial differential withrespect to the vectorization of W and subsituting the misfitfor Z . Lastly for the W − W block we have: ( ∂ vec ( W ) misfit ) T vec ( Z )= − [ X ⊗ I r ] diagvec ( σ ′ ( W X )) vec ( W T Z )= − [ X ⊗ I r ] vec ( σ ′ ( W X )) ◦ vec ( W T Z )= − [ X ⊗ I r ] vec ( W T Z ) ◦ vec ( σ ′ ( W X ))= − [ X ⊗ I r ] diagvec ( W T Z ) vec ( σ ′ ( W X )) . (56) Equation (53) follows from taking a partial differential withrespect to the vectorization of W , applying Lemma 1, andsubstituting the misfit in for Z . B. Deep Dense Neural Network GradientDerivation
The least squares loss function may be stated as: F ( W , W , . . . , W N ) = 12 misfit T misfitmisfit = vec ( Y − W N σ N ( W N − σ N ( · · · σ ( W X ) · · · ))) . (57)Numerator layout partial differentials are the same as inEquation (31). The partial derivatives require repeated ap-plication of the chain rule and Lemma 1, which can bestated: ∂ vec ( W j ) vec ( σ j +2 ( W j +1 σ j +1 ( W j σ j ( · · · ))))= dvec ( σ ′ j +2 ( W j +1 σ j +1 ( W j σ j ( · · · )))) · · · [ I d ⊗ W j +1 ] ∂ vec ( W j ) vec ( σ j +1 ( W j σ j ( · · · ))) (58) Lemma 5.
Deep neural network gradients ∇ W j F ( W ) = (cid:20) σ ′ j +1 ( W j σ j · · · σ ( W X ) · · · ) ◦ (cid:0) W Tj +1 (cid:0) σ ′ j +2 ( W j +1 σ j · · · σ ( W X ) · · · ) ◦· · · ◦ (cid:0) W TN − (cid:0) σ ′ N ( W N − · · · σ ( W X ) · · · )) · · ·◦ (cid:0) W TN ( W N σ N ( W N − · · · σ ( W X ) · · · ) − Y ) (cid:1) · · · (cid:1)(cid:1)(cid:1)(cid:21) σ j ( W j · · · σ ( W X )) T (59) Proof.
By iterative application of the chain rule (Equation(58)) we can derive the following ∂ vec ( W j ) misfit = − [ I d ⊗ W N ] dvec ( σ ′ N ( W N · · · σ ( W X ) · · · )) · · · [ I d ⊗ W j +1 ] dvec ( σ ′ j +1 ( W j · · · σ ( W X ) · · · ))[ σ j ( W j · · · σ ( W X ) · · · ) ⊗ I r j ] . (60)The result then follows from Equation (31) and propertiesof Kronecker and Hadamard products that are used in Ap-pendix A. References
Anandkumar, A. and Ge, R. Efficient approaches for es-caping higher order saddle points in non-convex opti-mization. In
Conference on learning theory , pp. 81–102,2016. ll-Posedness and Optimization Geometry for Nonlinear Neural Network Training
Baldi, P. and Hornik, K. Neural networks and principalcomponent analysis: Learning from examples withoutlocal minima.
Neural networks , 2(1):53–58, 1989.Bertsekas, D. P. Nonlinear Programming.
Journal of theOperational Research Society , 48(3):334–334, 1997.Cybenko, G. Approximation by superpositions of a sig-moidal function.
Mathematics of control, signals andsystems , 2(4):303–314, 1989.Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Gan-guli, S., and Bengio, Y. Identifying and attacking the sad-dle point problem in high-dimensional non-convex opti-mization. In
Advances in neural information processingsystems , pp. 2933–2941, 2014.Engl, H. W., Hanke, M., and Neubauer, A.
Regularizationof Inverse Problems , volume 375 of
Mathematics and ItsApplications . Springer Netherlands, 1996. ISBN 978-0-7923-4157-4.Hanke, M.
Conjugate Gradient Type Methods for Ill–Posed Problems . Pitman Research Notes in Mathematics,Vol 327. Longman Scientific & Technical, Essex, 1995.Hansen, P. C.
Rank Deficient and Discrete Ill-Posed Prob-lems: Numerical Aspects of Linear Inversion . SIAM,Philadelphia, 1998.Hornik, K., Stinchcombe, M., and White, H. Multilayerfeedforward networks are universal approximators.
Neu-ral networks , 2(5):359–366, 1989.Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan,M. I. How to escape saddle points efficiently. arXivpreprint arXiv:1703.00887 , 2017a.Jin, C., Netrapalli, P., and Jordan, M. I. Accelerated gra-dient descent escapes saddle points faster than gradientdescent. arXiv preprint arXiv:1711.10456 , 2017b.Murty, K. G. and Kabadi, S. N. Some NP-complete prob-lems in quadratic and nonlinear programming.
Mathe-matical programming , 39(2):117–129, 1987.Nesterov, Y. and Polyak, B. T. Cubic regularization of new-ton method and its global performance.
MathematicalProgramming , 108(1):177–205, 2006.O’Leary-Roseberry, T., Alger, N., and Ghattas, O. Inex-act Newton methods for stochastic non-convex optimiza-tion with applications to neural network training. arXivpreprint arXiv:1905.06738 , 2019.Poggio, T. and Liao, Q. Theory i: Deep networks and thecurse of dimensionality.
Bulletin of the Polish Academyof Sciences. Technical Sciences , 66(6), 2018. Safran, I. and Shamir, O. Spurious local minima are com-mon in two-layer relu neural networks. arXiv preprintarXiv:1712.08968 , 2017.Schwab, C. and Zech, J. Deep learning in high dimension:Neural network expression rates for generalized polyno-mial chaos expansions in uq.
Analysis and Applications ,17(01):19–55, 2019.Zhu, Z., Soudry, D., Eldar, Y. C., and Wakin, M. B. Theglobal optimization geometry of shallow linear neuralnetworks.