Non-attracting Regions of Local Minima in Deep and Wide Neural Networks
11 Non-attracting Regions of Local Minimain Deep and Wide Neural Networks
Henning Petzka , Cristian Sminchisescu Lund University Google Research
Abstract
Understanding the loss surface of neural networks is essential for the design of models with predictableperformance and their success in applications. Experimental results suggest that sufficiently deep and wide neuralnetworks are not negatively impacted by suboptimal local minima. Despite recent progress, the reason for thisoutcome is not fully understood. Could deep networks have very few, if at all, suboptimal local optima? or couldall of them be equally good? We provide a construction to show that suboptimal local minima (i.e., non-globalones), even though degenerate, exist for fully connected neural networks with sigmoid activation functions. Thelocal minima obtained by our construction belong to a connected set of local solutions that can be escaped from viaa non-increasing path on the loss curve. For extremely wide neural networks of decreasing width after the wide layer,we prove that every suboptimal local minimum belongs to such a connected set. This provides a partial explanationfor the successful application of deep neural networks. In addition, we also characterize under what conditions thesame construction leads to saddle points instead of local minima for deep neural networks.
I. I
NTRODUCTION
At the heart of most optimization problems lies the search for the global minimum of a loss function. The commonapproach to finding a solution is to initialize at random in parameter space and subsequently follow directions ofdecreasing loss based on local methods. This approach lacks a global progress criteria, which leads to descent intoone of the nearest local minima. The common approach of using gradient descent variants on non-convex losscurves of deep neural networks is vulnerable precisely to that problem.Authors pursuing the early approaches to local descent by back-propagating gradients [28] experimentally noticedthat suboptimal local minima appeared surprisingly harmless. More recently, for deep neural networks, the earlierobservations were further supported by experiments of, e.g., Zhang et al. [43]. Several authors aimed to providetheoretical insight for this behavior. Some, aiming at explanations, rely on simplifying modeling assumptions. Othersinvestigate neural networks under realistic assumptions, but often focus on failure cases only. Recently, Nguyenand Hein [22] provide partial explanations for deep and extremely wide neural networks for a class of activationfunctions including the commonly used sigmoid. Extreme width is characterized by a “wide” layer that has moreneurons than input patterns to learn. For almost every instantiation of parameter values w (i.e., for all but a set ofparameter values of measure zero) it is shown that, if the loss function has a local minimum at w , then this localminimum must be a global one. This extends results by [12] who required the input layer to be extremely wide.This suggests that for deep and wide neural networks, possibly every local minimum is global. The question onwhat happens at the null set of parameter values, for which the result does not hold, remained unanswered.Similar observations for shallow neural networks with one hidden layer were made earlier by Poston et al. [27].Poston et al. [27] show for a neural network with one hidden layer and sigmoid activation function that, if thehidden layer has more nodes than there are training patterns, then the error function (squared sum of predictionlosses over the samples) has no suboptimal “local minimum” and “each point is arbitrarily close to a point fromwhich a strictly decreasing path starts, so such a point cannot be separated from a so called good point by a barrierof any positive height” [27]. It was criticized by Sprinkhuizen-Kuyper and Boers [34] that the definition of a local This work was supported in part by the European Research Council Consolidator grant SEED, CNCS-UEFISCDI (PN-III-P4-ID-PCE-2016-0535, PN-III-P4-ID-PCCF-2016-0180), the EU Horizon 2020 grant DE-ENIGMA (688835), and SSF.H. Petzka (email: [email protected]), C. Sminchisescu ([email protected]) a r X i v : . [ c s . L G ] A ug minimum used in the proof of Poston et al. [27] was rather strict and unconventional. In particular, the resultsdo not imply that no suboptimal local minima, defined in the usual way, exist. As a consequence, the notion ofattracting and non-attracting regions of local minima were introduced and the authors prove that non-attractingregions exist by providing an example for the extended XOR problem. The existence of these regions imply thata gradient-based approach descending the loss surface using local information may still not converge to the globalminimum. The main objective of this work is to revisit the problem of such non-attracting regions and show thatthey also exist in deep and extremely wide networks. In particular, a gradient based approach may get stuck ina suboptimal local minimum also in these networks. Most importantly, the performance of deep and wide neuralnetworks cannot be explained by the analysis of the loss curve alone, without taking proper initialization or thestochasticity of stochastic gradient descent (SGD) into account.Our observations are not fundamentally negative. At first, the local minima we find are rather degenerate. Withproper initialization, a local descent technique is unlikely to get stuck in one of the degenerate, suboptimal localminima. Secondly, the minima reside on a non-attracting region of local minima (see Definition 1). Due to itsexploration properties, stochastic gradient descent will eventually be able to escape from such a region [see 38].It is conceivable that in sufficiently wide and deep networks, except for a null set of parameter values as startingpoints, there is always a monotonically decreasing path down to the global minimum. This was shown for neuralnetworks with one hidden layer, sigmoid activation function and square loss [27], and we generalize this result todeep neural networks. This implies that in such networks every local minimum belongs to a non-attracting region oflocal minima. (More precisely, our result holds for all extremely wide neural networks with square loss and a classof activation functions including the sigmoid, where the sequence of dimensions of hidden layers is non-increasingbetween the extremely wide layer and the output layer, i.e., the network architecture has no bottleneck layer ofstrictly lower dimension than both its neighboring layers.)Our proof of the existence of suboptimal local minima even in extremely wide and deep networks is based ona construction of local minima in shallow neural networks given by Fukumizu and Amari [11]. By relying ona careful computation we are able to characterize when this construction is applicable to deep neural networks.Interestingly, in deeper layers, the construction rarely seems to lead to local minima, but more often to saddle points.The argument that saddle points rather than suboptimal local minima are the main problem in deep networks hasbeen raised before [8] but a theoretical justification [7] uses strong assumptions that do not exactly hold in neuralnetworks. Here, we provide the first analytical argument, under realistic assumptions on the neural network structure,describing when certain critical points (i.e., points with gradient zero) of the training loss lead to saddle points indeeper networks.In summary, our results contain the following insight: There exist non-attracting regions of local minima and, inparticular, suboptimal local minima in the loss surface of arbitrarily wide neural networks for a class of analyticactivation functions including the sigmoid function. The minima can be both of finite type or only exist in the limitas some parameters converge to infinity. This disproves a conjecture made by Nguyen and Hein [22] stating that forthe therein studied extremely wide neural networks all local minima are globally optimal. Non-attracting regionsof local minima, however, allow for non-increasing paths to the global minimum by first following degeneratedirections of the local minimum. In sufficiently wide neural networks with no bottleneck layer, all local minimabelong to non-attracting regions of local minima.The extremely wide neural networks considered have zero loss at global minima. Naturally, training for zeroglobal loss is not desirable in practice, neither is the use of fully connected extremely wide deep neural networksnecessarily. The results of this paper are of theoretical importance. To be able to understand the complex learningbehavior of deep neural networks in practice, it is a necessity to understand the networks with the most fundamentalstructure. In this regard, our results offer new understanding of the multidimensional loss surface of deep neuralnetworks and their learning behavior. That a proper initialization largely improves training performance is well-known. See, e.g., Wessels and Barnard [39].
II. R
ELATED W ORK
We discuss related work on suboptimal minima of the loss surface. In addition, we refer the reader to the overviewarticle by Vidal et al. [36] for a discussion on the non-convexity in neural network training.It is known that learning the parameters of neural networks is, in general, a hard problem. Blum and Rivest [4]prove NP-completeness for a specific neural network. It has also been shown that local minima and other criticalpoints exist in the loss function of neural network training [2, 11, 25, 31, 34, 40, 42]. The understanding of thesecritical points has led to significant improvements in neural network training. This includes weight initializationtechniques [39], improved backpropagation algorithms to avoid saturation effects in neurons [37], entirely newactivation functions, or the use of second order information [21, 1]. That suboptimal local minima must becomerather degenerate if the neural network becomes sufficiently large was observed for networks with one hidden layerby Poston et al. [27]. Extending work by Gori and Tesi [12], Nguyen and Hein [22, 23] generalized this result todeeper networks containing an extremely wide hidden layer. Our contribution can be considered as a continuationof this work.To explain the persuasive performance of deep neural networks, Dauphin et al. [8] experimentally show that thereis a similarity in the behavior of critical points of the neural network’s loss function with theoretical properties ofcritical points found for Gaussian fields on high-dimensional spaces [6]. Choromanska et al. [7] supply a theoreticalconnection, but they also require strong (arguably unrealistic) assumptions on the network structure. The resultsimply that (under their assumptions on the deep network) the loss at a local minimum must be close to the lossof the global minimum with high probability. In this line of research, Sagun et al. [30] experimentally show asimilarity between spin glass models and the loss curve of neural networks.There is a growing number of papers considering the existence of suboptimal local minima for ReLU andLeakyReLU networks, where the space becomes combinatorial in terms of a positive activation, compared to astalled (or weak) signal. Yun et al. [42] prove existence of bad local minima in ReLU networks for generic datasets by tuning the weight parameters in such a way that all neurons are active and the network becomes locally linear.The existence of bad local minima had previously been shown under stronger assumptions by Du et al. [9], Zhouand Liang [44] and Swirszcz et al. [35], who construct data sets that allow them to find suboptimal local minima inoverparameterized networks. For the hinge loss, Laurent and Brecht [16] study one-hidden-layer networks and showthat Leaky-ReLU networks don’t have bad local minima, while ReLU networks do. Conditions for ReLU networkscharacterizing when no bad local minima exist or how to eliminate them is discussed by Liang et al. [17, 18].Soudry and Hoffer [33] probabilistically compare the volume of regions (for a specific measure) containing badlocal and global minima in the limit, as the number of data points goes to infinity. For networks with one hiddenlayer and ReLU activation function, Freeman and Bruna [10] quantify the amount of hill-climbing necessary to gofrom one point in the parameter space to another and finds that for increasing overparameterization, all level setsbecome connected. Instead of analyzing local minima, Xie et al. [41] consider regions where the derivative of theloss is small for two-layer ReLU networks. Soudry and Carmon [32] consider leaky ReLU activation functions tofind, similarly to the result of Nguyen and Hein [22], that for almost every combination of activation patterns intwo consecutive mildly wide layers, a local minimum has global optimality.To gain better insight into theoretical aspects, some papers consider linear networks, where the activation functionis the identity. The classic result by Baldi and Hornik [3] shows that linear two-layer neural networks have a uniqueglobal minimum and all other critical values are saddle points. Kawaguchi [15], Lu and Kawaguchi [20] and Yunet al. [42] discuss generalizations of the results by Baldi and Hornik [3] to deep linear networks, and Laurent andBrecht [16] finally show that for linear networks with no bottleneck layer, all minima are global.The existence of non-increasing paths on the loss curve down to the global minimum is studied by Poston et al.[27] for extremely wide two-layer neural networks with sigmoid activation functions. Nguyen et al. [24] generalizethis and show existence of such paths for a special type of architecture having as many skip connections to theoutput as there are input patterns to learn. For ReLU networks, Safran and Shamir [29] show that, if one starts at asufficiently high initialization loss, then there is a strictly decreasing path of parameters into the global minimum.Haeffele and Vidal [13] consider a specific class of ReLU networks with regularization, give a sufficient conditionthat a local minimum is globally optimal, and show that a non-increasing path down to the global minimum exists.Finally, worth mentioning is the study of Liao and Poggio [19] who use polynomial approximations to argue,by relying on Bezout’s theorem, that the loss function should have many local minima with zero empirical loss.
Why deep networks perform better than shallow ones is also investigated by Poggio et al. [26] by considering aclass of compositional functions. Also relevant is the observation by Brady et al. [5] showing that, if the globalminimum is not of zero loss, then a perfect predictor may have a larger loss in training than one producing worseclassification results. III. M
AIN R ESULTS
We consider neural network functions with fully connected layers of size n l , ≤ l ≤ L given by f ( x ) = w L ( σ ( w L − ( σ ( . . . σ ( w ( σ ( w x + w )) + w ) . . . )) + w L − )) + w L , where w l ∈ R n l × n l − denotes the weight matrix of the l -th layer, ≤ l ≤ L , w l the bias terms, and σ a nonlinear activation function . The neural network function is denoted by f and we notationally suppress dependence onparameters. We assume the activation function σ to belong to the class of strict monotonically increasing, analytic,bounded functions on R with image an interval ( c, d ) such that ∈ [ c, d ] , a class denoted by A . As prominentexamples, the sigmoid activation function σ ( t ) = exp ( − t ) and σ ( t ) = tanh ( x ) lie in A . We assume no activationfunction at the output layer. All the networks considered in this paper are regression networks mapping into thereal numbers R , i.e., n L = and w L ∈ R × n L − . We train on a finite data set ( x α , y α ) ≤ α ≤ N of size N with input patterns x α ∈ R n and desired target value y α ∈ R . We suppose throughout that the input patterns arepairwise different. We aim to minimize the squared loss L = (cid:80) Nα =1 ( f ( x α ) − y α ) . Further, M denotes the totalnumber of parameters and w ∈ R M denotes the collection of all w l and w l .The dependence of the neural network function f on w translates into a dependence L = L ( w ) of the lossfunction on the parameters w . Due to assumptions on σ , L ( w ) is twice continuously differentiable. The goal oftraining a neural network consists of minimizing L ( w ) over w . There is a unique value L denoting the infimumof the neural network’s loss (most often L = 0 in our examples). Any set of weight parameters w • that satisfies L ( w • ) = L is called a global minimum . Due to its non-convexity, the loss function L ( w ) of a neural network isin general known to potentially suffer from local minima (precise definition of a local minimum below). We willstudy the existence of suboptimal local minima in the sense that a local minimum w ∗ is suboptimal if its loss L ( w ∗ ) is strictly larger than L .We refer to deep neural networks as networks with more than one hidden layer. Further, we refer to extremelywide neural networks as the type of networks considered in other theoretical work [12, 27, 22, 23] with one hiddenlayer containing at least as many neurons as input patterns (i.e., n l ≥ N for some ≤ l < L in our notation). A. A Special Kind of Local Minimum
The standard definition of a local minimum , which is also used here, is a point w ∗ such that w ∗ has aneighborhood U with L ( w ) ≥ L ( w ∗ ) for all w ∈ U . Since local minima do not need to be isolated (i.e., L ( w ) > L ( w ∗ ) for all w ∈ U \ { w ∗ } ) two types of connected regions of local minima may be distinguished. In thefollowing definition, a continuous path is a continuous map w Γ : [0 , → R M assigning each t ∈ [0 , a choice ofparameters values w Γ ( t ) with loss L ( w Γ ( t )) . We call the path non-increasing in L if L ( w Γ ( t )) ≤ L ( w Γ ( s )) forall t ≥ s . A non-increasing path w Γ ( t ) decreases the loss maximally , if it cannot be extended as a non-increasingpath to a parameter setting of lower loss, or formally, if there exists no non-increasing path ˜ w Γ ( t ) such that foreach t in [0 , there is s in [0 , with w Γ ( t ) = ˜ w Γ ( s ) and such that L ( ˜ w Γ (1)) < L ( w Γ (1)) . Definition 1. [34] Let L : R n → R be a differentiable function. Suppose R is a maximal connected subset ofparameter values w ∈ R m , such that every w ∈ R is a local minimum of L with value L ( w ) = c . • R is called an attracting region of local minima , if there is a neighborhood U of R such that every continuouspath w Γ ( t ) , which is non-increasing in L , which starts at some w Γ (0) = w ∈ U and which decreases theloss maximally, ends in R . • R is called a non-attracting region of local minima , if every neighborhood U of R contains a point from wherea continuous path w Γ ( t ) exists that is non-increasing in L and ends in a point w Γ (1) with L ( w Γ (1)) < c . Attracting regions of local minima R are called attracting, as decreasing paths starting in a neighborhood of R eventually end up in R . Our notion differs from the one of Sprinkhuizen-Kuyper and Boers [34] by considering non-increasing paths instead of strictly increasing ones [see also 14]. Despite its non-attractive nature, a non-attracting -1-1-0.5-0.500.5 01 0.5 2101 -1-2 Fig. 1. Left: A non-attracting region of local minima given by R = { ( x, y ) | x = 0 , y ∈ ( − , } illustrated by the function f ( x, y ) = x (1 − y ) . Right: An attracting region of local minima at the same region R for comparison. (These examples do not exactly appear inneural networks considered in this paper, but are of similar nature.) region R of local minima may be harmful for a gradient descent approach. A path of greatest descent can end ina local minimum on R . However, no point z on R needs to have a neighborhood of attraction in the sense thatfollowing the path of greatest descent from a point in a neighborhood of z will lead back to z . (The path can leadto a different local minimum on R close by or reach points with strictly smaller values than c .) A rough illustrationof a non-attracting region of local minima is depicted in Fig. 1. Such non-attracting regions of local minima areconsidered for neural networks with one hidden layer by Fukumizu and Amari [11] and Wei et al. [38] underthe name of singularities . Their regions of local minima are characterized by singularities in the parameter spaceleading to a loss value strictly larger than the global loss. The dynamics around such a region are investigated byWei et al. [38].Non-attracting regions of local minima do not only exist for shallow two-layer neural networks, but also for deepand arbitrary wide networks. A construction of such regions is shown in Section V-C, proving the following result.
Theorem 2.
There exist deep and extremely wide fully-connected neural networks with sigmoid activation functionsuch that the squared loss function of a finite data set has a non-attracting region of local minima (at finiteparameter values).
Corollary 3.
Any attempt to show for fully connected deep neural networks that a gradient descent techniquewill always lead to a global minimum only based on a description of the loss curve will fail if it doesn’t takeinto consideration properties of the learning procedure (such as the stochasticity of stochastic gradient descent),properties of a suitable initialization technique, or assumptions on the data set.
On the positive side, we point out that a stochastic method such as stochastic gradient descent has a good chanceto escape a non-attracting region of local minima due to noise. With infinite time at hand and sufficient exploration,the region can be escaped from with high probability [see 38, for a more detailed discussion]. In Section V-A wewill further characterize when the method used to construct examples of regions of non-attracting local minimais applicable. This characterization limits us to the construction of extremely degenerate examples. We argue whyassuring the necessary assumptions for the construction becomes difficult for wider and deeper networks and whyit is natural to expect a lower suboptimal loss (where the suboptimal minima are less “bad”) the less degeneratethe constructed minima are and the more parameters a neural network possesses.A different type of non-attracting regions of local minima is considered for the 2—3—1 XOR network bySprinkhuizen-Kuyper and Boers [34], where the region of local minima (of higher loss than the global loss) residesat points in parameter space with some coordinates being infinite. For this, we consider the extended parameterspace, where parameters can take on values ±∞ . The standard topology on this space considers open neighborhoods While one might be tempted to term regions of local minima “generalized saddle points”, we note that, under the usual mathematicaldefinition, they do consists of a set of local minima. of ∞ defined by sets { v | v > a } for some a . A local minimum at infinity then satisfies, by definition, that forsufficiently large values of a parameter, the loss is higher at finite values than at the limit as the parameter tendsto infinity. In particular, a gradient descent approach may lead to diverging parameters in that case. However, adifferent non-increasing path down to the global minimum always exists for the network. It can be shown thatsuch generalized local minima at infinity also exist for deep neural networks. (Our proof uses similar ideas as theproof for the 2—3—1- XOR network by Sprinkhuizen-Kuyper and Boers [34, Section III], but needs additionalarguments due to a more general setting. The proof can be found in Appendix E.) Theorem 4.
Let L denote the squared loss of a fully connected regression neural network with sigmoid activationfunctions, having at least one hidden layer and each hidden layer containing at least two neurons. Then, for almostevery finite data set, the loss function L possesses a generalized local minimum in the extended parameter spacewith some coordinates being infinite. The generalized local minimum is suboptimal whenever data set and neuralnetwork are such that a constant function is not an optimal solution.B. Non-increasing Path to a Global Minimum By definition, all points belonging to a non-attracting region of local minima R are local minima with the sameloss value. Further, being non-attractive means that every neighborhood of R contains points from where a non-increasing path to a value less than the value of the region exists. The question therefore arises under what conditionsthere is such a non-increasing path all the way down to a global minimum from almost everywhere in parameterspace. The measure-theoretic term almost everywhere here refers to the Lebesgue measure, i.e., a condition holdsalmost everywhere when it holds for all points except for a set of Lebesgue measure zero. If the last hidden layeris the extremely wide layer having more neurons than input patterns (for example consider an extremely widetwo-layer neural network), then indeed it holds true that non-increasing paths to the global minimum exist fromalmost everywhere in parameter space by the results of Nguyen and Hein [22] (and Gori and Tesi [12], Postonet al. [27]). We show the same conclusion to hold for extremely wide deep neural networks, whenever the sequenceof hidden dimensions is non-increasing, n l +1 ≤ n l , for all layers following the wide layer. Theorem 5.
Consider a fully connected regression neural network with activation function in the class A (asdefined in the beginning of Section III) equipped with the squared loss function for a finite data set. Assume thata hidden layer contains more neurons than the number of input patterns and the sequence of dimensions of allsubsequent layers is non-increasing. Then, for each set of parameters w and all (cid:15) > , there is w (cid:48) such that || w − w (cid:48) || < (cid:15) and such that a path, non-increasing in loss from w (cid:48) to a global minimum (where f ( x α ) = y α foreach α ), exists. Corollary 6.
Consider an extremely wide, fully connected regression neural network with non-increasing hiddendimensions following the wide layer, activation function in the class A and trained to minimize the squared lossover a finite data set. Then all suboptimal local minima are contained in a non-attracting region of local minima. The rest of the paper contains the arguments leading to the given results and an experimental construction oflocal minima in a deep and wide network. IV. N
OTATION
We fix additional notation aside the problem definition from Section III. For input x α we denote the patternvector of values at all neurons at layer l before activation by n ( l ; x α ) and after activation by act ( l ; x α ) .In general, we will denote column vectors of size n with coefficients z i by [ z i ] ≤ i ≤ n or simply [ z i ] i and matriceswith entries a i,j at position ( i, j ) by [ a i,j ] i,j . The neuron value pattern n ( l ; x ) is then a vector of size n l denotedby n ( l ; x ) = [ n ( l, k ; x )] ≤ k ≤ n l , and the activation pattern act ( l ; x ) = [ act ( l, k ; x )] ≤ k ≤ n l .For a fixed data point x α , we will further denote the squared loss on it by (cid:96) α . The loss (cid:96) α can be considered asa function of the neuron values n ( l, k ; x α ) , so that we consider partial derivatives of the loss (cid:96) α with infinitesimalchanges of neuron values at n ( l, k ; x α ) . For convenience of the reader, a tabular summary of all notation is providedin Appendix A–3. x α, x α, , , , f ( x α )[ u ,i ] i [ u ,i ] i [ u ,i ] i v • , v • , v • , v • , γ λ x α, x α, , − , , , f ( x α )[ u ,i ] i [ u ,i ] i [ u ,i ] i [ u ,i ] i λ · v • , (1 − λ ) · v • , v • , v • , v • , Fig. 2. Embedding a smaller two-layer neural network into a larger one. Weights of the larger network are defined by the weights of thesmaller network and the embedding map γ λ . Numbers in hidden nodes (circles) denote the index of a neuron in form of (layer, neuronindex) with negative index for the added neuron. Rectangles correspond to bias terms. V. C
ONSTRUCTION OF L OCAL M INIMA
We recall the construction of suboptimal local minima given by Fukumizu and Amari [11] and extend it to deepnetworks. Once we have fixed a layer l , we denote the parameters of the incoming linear transformation by [ u p,i ] p,i ,so that u p,i denotes the contribution of neuron i in layer l − to neuron p in layer l , and the parameters of theoutgoing linear transformation by [ v s,q ] , where v s,q denotes the contribution of neuron q in layer l to neuron s inlayer l + 1 . For weights of the output layer (into a single neuron), we write w • ,j instead of w ,j . For the constructionof critical points (i.e., points with gradient zero), we add one additional neuron n ( l, − x ) to a hidden layer l .(Negative indices are unused for neurons, which allows us to add a neuron with this index.)A function γ rλ describes the mapping from the parameters of the original network to the parameters after addinga neuron n ( l, − x ) . For a chosen neuron with index r in layer l of the smaller network, γ rλ is determined byincoming weights u − ,i into n ( l, − x ) , outgoing weights v s, − of n ( l, − x ) , and a change of the outgoing weights v s,r of n ( l, r ; x ) . Sorting the network parameters in a convenient way, the embedding of the smaller network intothe larger one is given, for any λ ∈ R , by a function γ rλ mapping parameters { ([ u r,i ] i , [ v s,r ] s , ¯w } of the smallernetwork to parameters { ([ u − ,i ] i , [ v s, − ] s , [ u r,i ] i , [ v s,r ] s , ¯w ) } of the larger network and is defined by γ rλ ([ u r,i ] i , [ v s,r ] s , ¯w ) := ([ u r,i ] i , [ λ · v s,r ] s , [ u r,i ] i , [(1 − λ ) · v s,r ] s , ¯w ) . Here ¯w denotes the collection of all remaining network parameters, i.e., all [ u p,i ] i , [ v s,q ] s for p, q / ∈ {− , r } andall parameters from linear transformation of layers with index smaller than l or larger than l + 1 , if existent. Avisualization of γ λ is shown in Fig. 2. Important fact:
For the network functions ϕ, f of smaller and larger network at parameters ([ u ∗ ,i ] i , [ v ∗ s, ] s , ¯w ∗ ) and γ rλ ([ u ∗ r,i ] i , [ v ∗ s,r ] s , ¯w ∗ ) respectively, we have ϕ ( x ) = f ( x ) for all x . More generally, the activation values of allneurons in the smaller network agree with the activation values of corresponding neuron in the larger network, i.e., n ϕ ( l, k ; x ) = n ( l, k ; x ) and act ϕ ( l, k ; x ) = act ( l, k ; x ) for all l, x and k ≥ . A. Characterization of Critical Points Constructed Hierarchically by γ Using some γ rλ to embed a smaller deep neural network into a larger one with one additional neuron, it has beenshown that critical points get mapped to critical points. Theorem 7 (Nitta [25]) . Consider two neural networks as in Section III, which differ by one neuron in layer l withindex n ( l, − x ) in the larger network. If parameter choices ([ u ∗ r,i ] i , [ v ∗ s,r ] s , ¯w ∗ ) determine a critical point for thesquared loss over a finite data set in the smaller network then, for each λ ∈ R , γ rλ ([ u ∗ r,i ] i , [ v ∗ s,r ] s , ¯w ∗ ) determinesa critical point in the larger network. As a consequence, whenever an embedding of a local minimum with γ rλ into a larger network does not lead toa local minimum, then it leads to a saddle point instead, i.e., critical points where the Hessian has both strictlypositive and strictly negative eigenvalues. (There are no local maxima in the networks we consider, since the loss function is convex with respect to the parameters of the last layer.) For shallow neural networks with one hiddenlayer, it was characterized when a critical point leads to a local minimum. Theorem 8 (Fukumizu and Amari [11]) . Consider two neural networks as in Section III with only one hiddenlayer and which differ by one neuron in the hidden layer with index n (1 , − x ) in the larger network. Assume thatparameters ([ u ∗ r,i ] i , v ∗• ,r , ¯w ∗ ) determine an isolated local minimum for the squared loss over a finite data set in thesmaller neural network and that λ / ∈ { , } .Then γ rλ ([ u ∗ r,i ] i , v ∗• ,r , ¯w ∗ ) determines a local minimum in the larger network if the matrix [ B ri,j ] i,j given by B ri,j = (cid:88) α ( f ( x α ) − y α ) · v ∗• ,r · σ (cid:48)(cid:48) ( n (1 , r ; x α )) · x α,i · x α,j is positive definite and < λ < , or if [ B ri,j ] i,j is negative definite and λ < or λ > . (Here, we denote the k -thinput dimension of input x α by x α,k .) We extend the previous theorem to a characterization in the case of deep neural networks. We note that a similarcomputation has been previously performed for neural networks with two hidden layers by Mizutani and Dreyfus[21].
Theorem 9.
Consider two (possibly deep) neural networks as in Section III, which differ by one neuron in layer l with index n ( l, − x ) in the larger network. Assume that the parameter choices ([ u ∗ r,i ] i , [ v ∗ s,r ] s , ¯w ∗ ) determine anisolated local minimum for the squared loss over a finite data set in the smaller network. If the matrix [ B ri,j ] i,j defined by B ri,j := (cid:88) α (cid:88) k ∂(cid:96) α ∂ n ( l + 1 , k ; x α ) · v ∗ k,r · σ (cid:48)(cid:48) ( n ( l, r ; x α )) · act ( l − , i ; x α ) · act ( l − , j ; x α ) (1) is either • positive definite and λ ∈ I := (0 , , or • negative definite and λ ∈ I := ( −∞ , ∪ (1 , ∞ ) , then (cid:110) γ rλ ([ u ∗ r,i ] i , [ v ∗ s,r ] s , ¯w ∗ ) | λ ∈ I (cid:111) determines a non-attracting region of local minima in the larger network ifand only if D r,si := (cid:88) α ∂(cid:96) α ∂ n ( l + 1 , s ; x α ) · σ (cid:48) ( n ( l, r ; x α )) · act ( l − , i ; x α ) (2) is zero, D r,si = 0 , for all i, s . Remark 10.
In the case of a neural network with only one hidden layer as considered in Theorem 8, the partialderivative ∂(cid:96) α ∂ n ( l +1 ,s ; x α ) reduces to the residual ( f ( x α ) − y α ) and the matrix [ B ri,j ] i,j in Equation 1 reduces to thematrix [ B ri,j ] i,j in Theorem 8. The condition that D r,si = 0 for all i, s does hold for shallow neural networks withone hidden layer as we show in Proposition 15 ( i ) . This proves Theorem 9 to be consistent with Theorem 8. Remark 11.
The assumption of starting from an isolated local minimum can be relaxed by special considerationof the degenerate directions (the eigenvectors of the Hessian with eigenvalue zero), which we will take advantageof.
The theorem follows from a careful computation of the Hessian of the loss function L ( w ) (i.e., we calculatethe matrix of second order derivatives of the loss function with respect to the network parameters), characterizingwhen it is positive (or negative) semidefinite and checking that the loss function does not change along directionsthat correspond to an eigenvector of the Hessian with eigenvalue 0. We state the outcome of the computation ofthe loss Hessian in Lemma 12 and refer the reader interested in a full proof of Theorem 9 to Appendix D. Lemma 12.
Consider two (possibly deep) neural networks as in Section III, which differ by one neuron in layer l with index n ( l, − x ) in the larger network. Fix ≤ r ≤ n l . Assume that the parameter choices ([ u ∗ r,i ] i , [ v ∗ s,r ] s , ¯w ∗ ) determine a critical point in the smaller network.Let L denote the loss function of the larger network and (cid:96) the loss function of the smaller network. Let α (cid:54) = − β ∈ R such that λ = βα + β . With respect to the basis of the parameter space of the larger network given by ([ u − ,i + u r,i ] i , [ v s, − + v s,r ] s , ¯w , [ α · u − ,i − β · u r,i ] i , [ v s, − − v s,r ] s ) , the Hessian of the loss L at γ rλ ([ u ∗ r,i ] i , [ v ∗ s,r ] s , ¯w ∗ ) is given by [ ∂ (cid:96)∂u r,i ∂u r,j ] i,j ∂ (cid:96)∂u r,i ∂v s,r ] i,s [ ∂ (cid:96)∂ ¯w ∂u r,i ] i, ¯w ∂ (cid:96)∂u r,i ∂v s,r ] s,i ∂ (cid:96)∂v s,r ∂v t,r ] s,t ∂ (cid:96)∂ ¯w ∂v s,r ] s, ¯w ( α − β )[ D r,si ] s,i ∂ (cid:96)∂ ¯w ∂u r,i ] ¯w ,i ∂ (cid:96)∂ ¯w ∂v s,r ] ¯w ,s [ ∂ (cid:96)∂ ¯w ∂ ¯w (cid:48) ] ¯w , ¯w (cid:48) α − β )[ D r,si ] i,s αβ [ B ri,j ] i,j ( α + β )[ D r,si ] i,s α + β )[ D r,si ] s,i B. Shallow Networks with a Single Hidden Layer
For the construction of suboptimal local minima in extremely wide two-layer networks, we begin by followingthe experiments of Fukumizu and Amari [11] that prove the existence of suboptimal local minima in (non-wide)two-layer neural networks.Consider a neural network of size 1—2—1. We use the corresponding network function f to construct a data set ( x α , y α ) Nα =1 by randomly choosing x α and letting y α = f ( x α ) . By construction, we know that a neural networkof size 1—2—1 can perfectly fit the data set with zero error.Consider now a smaller network of size 1—1—1 having too little expressibility for a global fit of all data points.We find parameters [ u ∗ , , v ∗• , ] where the loss function of the neural network is in an isolated local minimum withnon-zero loss. For this small example, the required positive definiteness of [ B i,j ] i,j from Equation 1 for a use of γ λ with λ ∈ (0 , reduces to checking a real number B , for positivity. An empirical example is easily found,and we assume the positivity condition to hold true. We can now apply γ λ and Theorem 8 to find parameters fora neural network of size 1—2—1 that determine a suboptimal local minimum. This concludes the constructionof Fukumizu and Amari [11]. The obtained network now serves as a base for a proof by induction to show thatsuboptimal local minima also exist in arbitrarily wide neural networks. Theorem 13.
There is an extremely wide two-layer neural network with sigmoid activation functions and arbitrarilymany neurons in the hidden layer that has a non-attracting region of suboptimal local minima.Proof.
Having already established the existence of parameters for a (small) neural network leading to a suboptimallocal minimum, it suffices to note that iteratively adding neurons using Theorem 8 is possible. Iteratively at step t ,we add a neuron n (1 , − t ; x ) (negatively indexed) to the network by an application of γ λ with the same λ ∈ (0 , ,using the same neuron with index in each iteration. The corresponding matrix from Equation 1 remains a positivenumber B , ( t )1 , = (cid:88) α ( f ( x α ) − y α ) · (1 − λ ) t − · v ∗• , · σ (cid:48)(cid:48) ( n ( l, x α )) · x α, . (We use here that neither f ( x α ) nor n ( l, x α ) ever change during this construction and that the outgoing weightfrom the hidden neuron with index changes by multiplication with (1 − λ ) .) The idea is to apply Theorem 8 toguarantee that adding neurons as above always leads to a suboptimal minimum with nonzero loss for the network for λ ∈ (0 , . The only problem is that the addition of previous neurons creates the possibility of reparameterizationsthat keep the loss unchanged, i.e., the assumption in Theorem 8 of having an isolated minimum is violated. However,the computation of Lemma 12 is still valid and to ensure a local minimum from a positive semi-definite Hessian it isonly necessary to additionally make sure that no reduction is possible along directions defined by the kernel of theHessian i.e., the space of eigenvectors of the Hessian with eigenvalue zero. Since all these eigenvectors correspondto reparameterizations that cannot be influenced by the additional weights in and out of newly added neurons, itcan be seen that the loss is indeed locally constant into all directions defined by the kernel of the Hessian. In thiscase, positive semi-definiteness is sufficient to ensure a local minimum.In particular, we may add an arbitrary number of neurons to the hidden layer and make the network extremelywide. Further, a continuous change of the λ belonging to the last added neuron via γ λ to a value outside of [0 , doesnot change the network function, but leads to a saddle point (as we introduce a negative factor αβ < ⇔ λ / ∈ (0 , into the calculation of the Hessian at the position of αβB , ( t )1 , , see Lemma 12). Hence, we found a non-attractingregion of suboptimal minima. Remark 14.
Since we started the construction from a network of size 1—1—1, our constructed example is extremelydegenerate: The suboptimal local minima of the wide network have identical incoming weight vectors for each hiddenneuron. Obviously, the suboptimality of this parameter setting is easily discovered by inspection of the parameters.Also with proper initialization, the chance of landing in this local minimum is vanishing.However, one may also start the construction from a more complex network with a larger network with severalhidden neurons. In this case, when adding a few more neurons using γ λ , it is much harder to detect the suboptimalityof the parameters from visual inspection.C. Deep Neural Networks According to Theorem 9, for deep networks there is a second condition for the construction of local minima usingthe map γ rλ . Next to positive definiteness of the matrix B ri,j for some r , we additionally require that [ D r,si ] i,s = 0 for all i, s and the same r . We consider sufficient conditions for D r,si = 0 . Proposition 15.
Suppose we have constructed a critical point of the squared loss of a neural network by starting froma local minimum of a smaller network and by adding a neuron into layer l with index n ( l, − x ) by applicationof the map γ rλ to a neuron n ( l, r ; x ) . Suppose further that for the outgoing weights v ∗ s,r of n ( l, r ; x ) we have (cid:80) s v ∗ s,r (cid:54) = 0 , and suppose that D r,si is defined as in Equation 2. Then D r,si = 0 if one of the following holds. (i) The layer l is the last hidden layer. (This condition includes the case l = 1 indexing the hidden layer in atwo-layer network.) (ii) For all t, t (cid:48) , α , we have ∂(cid:96) α ∂ n ( l + 1 , t ; x α ) = ∂(cid:96) α ∂ n ( l + 1 , t (cid:48) ; x α ) . (iii) For each α and each t , ∂(cid:96) α ∂ n ( l + 1 , t ; x α ) = 0 . (This condition holds in the case of the weight infinity attractors in the proof to Theorem 4 for l + 1 the secondlast layer. It also holds in a global minimum.) The proof of the proposition is contained in Appendix F, and we apply it to construct a non-attracting region ofsuboptimal local minima in a deep and wide neural network.
Proof. (Theorem 2)
For simplicity of the presentation, we show the existence of local minima for a three-layerneural network, but the same construction naturally generalizes to deeper networks. We begin with a regressionnetwork of size — n — n —1 for input dimension n = 2 , hidden layers of dimension n , n , and the outputlayer mapping into R . We use this network to construct a finite data set and train a network of size 2—1—1—1with one neuron in each hidden layer on this data set to find an isolated local minimum.In Equations 1 and we have suppressed the choice of the layer to simplify the notation. To distinguish layershere, we write [ B ri,j (1)] i,j , [ D r,si (1)] i,s and [ B ri,j (2)] i,j , [ D r,si (2)] i,s for the matrices of the first and second hiddenlayer respectively. We assume that the local minimum satisfies that [ B i,j (1)] i,j is positive definite and B , (2) > .Existence of such an example is easily verified empirically and we provide an example in the following section.Starting with the first hidden layer, the condition of Proposition 15 (ii) is trivially satisfied as there is only oneneuron in the second layer. Hence D , (1) = 0 and we can we can iteratively add arbitrarily many neurons tothe first hidden layer by repetitively applying γ λ on n (1 , x ) and Theorem 9 (with λ ∈ (0 , ). (Precisely as inthe iterative application of Theorem 8 in the proof of Theorem 13, we can see that in this iterative process theassumption of starting from an isolated local minimum can be relaxed.) The relevant matrices are given at each step t by [ B , ( t ) i,j (1)] i,j = (1 − λ ) t − [ B , (1) i,j (1)] i,j which is positive definite, and D , (1) does not change and remainszero. We can therefore find a local minimum in a network of size 2— m —1—1 for any m .For the second layer, the matrix B (2) := [ B i,j (2)] i,j equals the constant matrix of size ( m × m ) with entry thepositive number B , (2) calculated before adding neurons to the first layer and is positive semidefinite. Now,Proposition 15 (i) applies to show that we can apply Theorem 9 to iteratively add arbitrarily many neuronsto the second hidden layer using γ λ on n (2 , x ) . When adding the t -th neuron to the second layer, we have [ B , ( t ) i,j (2)] i,j = (1 − λ ) t − B (2) , which is positive semidefinite. Since the matrix B (2) is only positive semidefinite e rr o r e rr o r (a) e rr o r (b) e rr o r e rr o r (c) e rr o r (d)Fig. 3. A non-attracting region of local minima in a deep and wide neural network. (a) Local minimum. Top:
Loss evolution for 5000random directions.
Bottom:
Minimum over sampled directions. (b) Path along a degenerate direction to a saddle point. (c) Saddle point withthe same loss value.
Top:
Loss evolution for 5000 random directions.
Bottom:
Minimum over sampled directions. (d) Error evolution alongan analytically known direction of descent at the saddle point. and not positive definite, we again need special consideration to the newly added degenerate directions: The firstlayer contains identical neurons from enlarging that layer. The new degenerate directions are consequently givenby increasing a weight connecting a new neuron in the second layer with one of these identical neurons in thefirst layer and decreasing the weight from the second identical neuron such that their contribution cancels out.This defines another reparameterization that leaves the loss constant and which does not interfere with any of theother reparameterizations. This excludes the possibility of loss reduction along degenerate directions and positivesemi-definiteness is sufficient to guarantee a local minimum.This leads to a local minimum in a network of size 2— m — m —1. If m ≥ n , m ≥ n , then we know thelocal minimum to be suboptimal by construction. Adding sufficiently many neurons, we can construct the networkto be extremely wide. A continuous change of the parameter λ ∈ (0 , used for adding the t -th neuron, changes thesign of the corresponding B-matrix for negative λ , which offers a direction in weight space for further reduction ofthe loss. This shows the existence of a non-attracting region in a deep and wide neural network, proving Theorem 2. D. Experiment for Deep Neural Networks
We empirically validate the construction of a suboptimal local minimum in a deep and extremely wide neuralnetwork as given in the proof of Theorem 2. We start by considering a three-layer network of size 2—5—5—1,i.e., we have two input dimensions, one output dimension and hidden layers of five neurons. We use its networkfunction f to create a data set of 20 samples ( x α , f ( x α )) , hence we know that a network of size 2—5—5—1 canattain zero loss.We initialize a new neural network of size 2—1—1—1 and train it until convergence to find a local minimum oftotal loss (sum over the data points) of . . We check for positive definiteness of the matrix B i,j (1) (eigenvalues The accompanying code can be found at https://github.com/petzkahe/nonattracting regions of local minima.git. here given by . , . ) and positivity of B , (2) (here . ). Following the proof to Theorem 2, we addtwenty neurons to both hidden layers to construct a local minimum in an extremely wide network of size 2—21—21—1. The local minimum must be suboptimal by construction of the data set. Experimentally, we show not onlythat indeed we end up with a suboptimal minimum, but also that it belongs to a non-attracting region of localminima.Fig. 3 shows results of this construction. The plot in (a) shows the loss in the neighborhood of the local minimumin parameter space. The top image shows the loss curve into 5000 randomly generated directions, the bottom displaysthe minimal loss over all these directions. Evidently, we were not able to find a direction in parameter space thatallows us to reduce the loss. The plot in (b) shows the change of loss along one of the degenerate directions thatleads to a saddle point, which is shown in plot (c). Again, the top image shows the loss curve into 5000 randomlygenerated directions, and the bottom displays the minimum loss over all these directions. Most random directionsin parameter space lead to an increase in loss, but the minimum loss shows the existence of directions that leadto a reduction of loss. In such a saddle point, we know a direction for loss reduction at this saddle point fromLemma 12. The plot in (d) shows a significant reduction in loss for the analytically known direction. Being able toreach a saddle point from a local minimum by a path of non-increasing loss shows that we found a non-attractingregion of local minima. E. A Discussion of Limitations and of the Loss of Non-attracting Regions of Suboptimal Minima
We theoretically proved the possibility of suboptimal local minima in deep and wide networks and empiricallyvalidated their existence. Due to the degeneracy of the constructed examples, the following questions on conse-quences in practice remain unanswered. (i) How frequent are suboptimal local minima with high loss, and (ii) howdegenerate must they be to have high loss?Suppose we aim to find a suboptimal local minimum using the above construction in a network consisting of L layers with hidden dimensions n l , and we want the local minimum to be less degenerate than the above exampleswhile still having considerably high loss. We therefore want to start from a local minimum in a smaller networkwith hidden dimension ν i and add only a few neurons to each of the layers. We want each ν i small enough to finda (non-degenerate) local minimum in the smaller network with large loss, but not too small so that the addition ofseveral neurons renders the local minimum degenerate. For the construction to work, we need to find in each layer l a neuron with index r l such that the n l − -many eigenvalues of the corresponding B -matrix are either all positiveor all negative so that the matrix is either positive definite or negative definite (determining a suitable choice for λ according to Theorem 9). In addition, the same neuron must satisfy D r l ,si = 0 for all i, s , adding n l − n l +1 manyconditions. To find such an example is difficult both theoretically (the sufficient conditions from Proposition 15 fora vanishing D-matrix are rather strong) as well as empirically. Whenever any of the necessary conditions is violated,then we cannot use the above construction to find a local minimum in a larger network. In other words, wheneverwe find a local minimum in the smaller network such that no neuron exists that satisfies all the necessary conditions(positive definiteness of B and zero D -matrix), then the above construction leads to a saddle point. While thesesaddle points can be close to being a local minimum (in the sense that slight perturbations of the parameters yieldsa higher loss in most cases), there is at least one direction in parameter space with negative curvature of the losssurface.It is therefore conceivable that suboptimal local minima with high loss are extremely rare in practical applicationswith sufficiently large networks, which is in line with empirical observations. From this perspective, our resultssuggest that for sufficiently wide deep networks almost all local minima are global, but the existence of degeneratelocal minima makes a general statement impossible.VI. P ROVING THE E XISTENCE OF A N ON - INCREASING P ATH TO THE G LOBAL M INIMUM
In the previous section we showed the existence of non-attracting regions of local minima and Theorem 4 showedthe existence of generalized local minima at infinity that can cause divergent parameters during loss reduction. Thesetype of local minima do not rule out the possibility of non-increasing paths to the global minimum from almosteverywhere in parameter space. In this section, we sketch the proof to Theorem 5 illustrated in form of severallemmas, where up to the basic assumptions on the neural network structure as in Section III (with activation function Theorem 5: Visualization of consideredneural network architecture. Lemma 16: An (cid:15) -change guaranteeslinear independence.
X X
Lemma 17&18: Realizing paths of activationvectors by paths in parameter space Lemma 19&18: Realizing paths of activationvectors for decreasing hidden dimensions.Fig. 4. Non-increasing paths to the global minimum exist from almost everywhere. Visualization of proof ideas. in A ), the assumption of one lemma is given by the conclusion of the previous ones. A full proof can be found inAppendix G.We consider vectors that we call activation vectors, different from the activation pattern vectors act ( l ; x ) fromabove. The activation vector at neuron k in layer l are denoted by and a lk , and defined by all values at the givenneuron for samples x α : a lk := [ act ( l, k ; x α )] α . In other words while we fix l and x for the activation pattern vectors act ( l ; x ) and let k run over its possible values,we fix l and k for the activation vectors a lk and let x run over its samples x α in the data set. We denote by a l thematrix [ a lk ] k = [ act ( l, k ; x α )] k,α of size ( n l × N ) containing the activations values of all neurons and samples atlayer l . Similarly, we denote by n l the matrix n l = [ n ( l, k ; x α )] k,α of size ( n l × N ) containing the pre-activationneuron values for all neurons and samples at layer l The first step of the proof is to use the freedom given by (cid:15) > to change the starting point in parameter spaceto satisfy that the activation vectors a l ∗ k of the extremely wide layer l ∗ span the entire space R N . Lemma 16. [22, Corollary 4.5] For each choice of parameters w and all (cid:15) > there is w (cid:48) such that (i) || w − w (cid:48) || <(cid:15) , (ii) the activation vectors a l ∗ k of the extremely wide layer l ∗ (containing more neurons than the number of trainingsamples N ) at parameters w (cid:48) satisfy span k a l ∗ k = R N , and (iii) the weight matrices ( w (cid:48) ) l have full rank for all l > l ∗ + 1 . The second step of the proof is to guarantee that we can then induce any continuous change of activation vectorsin layer l ∗ + 1 by suitable paths in the parameter space changing only the weights of the same layer. The followingtwo lemmas ensure exactly that. We first consider pre-activation values and then consider the application of theactivation function. We slightly abuse notation in the statement when adding a vector to a matrix, which shall meanthe addition of the vector to all columns of the matrix. Lemma 17.
Assume that in the extremely wide layer l ∗ we have that the activation vectors at a set of parameters w satisfy span k [ a l ∗ k ] = R N . Then, for any continuous path n l ∗ +1Γ : [0 , → R n l ∗ +1 × N with starting point n l ∗ +1Γ (0) = n l ∗ +1 = w l ∗ +1 · a l ∗ + w l ∗ +10 , there is a continuous path of parameters w l ∗ +1Γ : [0 , → R n l ∗ +1 × n l ∗ of the ( l ∗ + 1) -thlayer with w l ∗ +1Γ (0) = w l ∗ +1 and such that n l ∗ +1Γ ( t ) = w l ∗ +1Γ ( t ) · a l ∗ + w l ∗ +10 . Lemma 18.
For all continuous paths a Γ ( t ) in Im ( σ ) n × N , i.e., the ( n × N ) -fold copy of the image of an activationfunction σ ∈ A , there is a continuous path n Γ ( t ) in R n × N such that a Γ ( t ) = σ ( n Γ ( t )) for all t . With activations values a l depending on parameters w ι of previous layers with index ι ≤ l , we denote thisfunctional dependence by a l ( w ) . We say that a continuous path a l Γ : [0 , → Im ( σ ) n l × N of activation values in the l -th layer is realized by a path of parameters w Γ ( t ) , if the path w Γ ( t ) induces a change of activation values atlayer l according to the desired path a l Γ , i.e., a l ( w Γ ( t )) = a l Γ ( t ) . Using this terminology, Lemma 17 and 18 showthat in a layer following an extremely wide one, any continuous path of activation values can be realized by a pathof parameters of the same layer.The third step guarantees that, as long as the sequence of dimensions of subsequent hidden layers never increases,realizability of arbitrary paths in layer l implies realizability of arbitrary paths in layer l + 1 for both activation andpre-activation values. Lemma 19.
Assume that n l +1 ≤ n l and that the weight matrix w l +1 ∈ R n l +1 × n l has full rank n l +1 . Let a l = a l ( w ) and n l +1 = n l +1 ( w ) be the matrices of activation values at the l -th layer and pre-activations at the ( l + 1) -thlayer for parameters w respectively. Then, for any continuous path n l +1Γ : [0 , → R n l +1 × N with n l +1Γ (0) = n l +1 = w l +1 · a l + w l +10 , there are continuous paths of (full-rank) weight matrices w l +1Γ : [0 , → R n l +1 × n l and bias w l +1Γ , : [0 , → R n l +1 in the ( l + 1) -th layer and a continuous path a l Γ : [0 , → Im ( σ ) n l × N of activation valuesin the l -th layer, such that w l +1Γ (0) = w l +1 , w l +1Γ , (0) = w l +10 , a l Γ (0) = a l and for all t ∈ [0 , , n l +1Γ ( t ) = w l +1Γ ( t ) · a l Γ ( t ) + w l +1Γ , ( t ) . Combining Lemma 18 and 19, we can realize any continuous path of activation values a l +1Γ ( t ) in layer l + 1 by apath of parameters, if arbitrary paths of activation values a l Γ ( t ) can be realized in the previous layer l . By Lemma 17,we can indeed realize arbitrary paths in the layer following the extremely wide layer. Hence, by induction over thelayers we find that any path at the output is realizable. In the following result, we denote the dependence of thenetwork function’s output of its parameters w on the training sample x α by f ( w ; x α ) . Lemma 20.
Assume a neural network structure as above with activation vectors a l ∗ k of the extremely wide hiddenlayer spanning R N , hidden dimensions n l +1 ≤ n l for all l > l ∗ and weight matrices w l +1 ∈ R n l +1 × n l of full rank n l +1 for all l > l ∗ . Then for any continuous path f Γ : [0 , → R N with f Γ (0) = [ f ( w ; x α )] α there is a continuouspath w Γ ( t ) from the current weights w Γ (0) = w that realizes f Γ ( t ) as the output of the neural network function, f Γ ( t ) = [ f ( w Γ ( t ); x α )] α . With fixed z α = f ( w ; x α ) , the prediction for the current weights, let z = [ z α ] α denote the vector of predictions,and let y = [ y α ] α denote the vector of all target values. An obvious path of decreasing loss at the output layeris then given by f Γ ( t ) = z + t · ( y − z ) , inducing the loss L = || z + t · ( y − z ) − y || = (1 − t ) || y − z || . Thisconcludes the proof of Theorem 5 by applying Lemma 20 to this choice of f Γ ( t ) after a possible change of thestarting parameters w to arbitrarily close parameters w (cid:48) using Lemma 17.VII. C ONCLUSION
We have proved the existence of suboptimal local minima for regression neural networks with sigmoid activationfunctions of arbitrary width. We established that the nature of local minima is such that they live in a specialregion of the cost function called a non-attractive region, and showed that a non-increasing path to a configurationwith lower loss than that of the region can always be found. For sufficiently wide neural networks with decreasinghidden layer dimensions after the extremely wide layer, all local minima belong to such a region. We generalized aprocedure to find such regions in shallow networks, introduced by Fukumizu and Amari [11], to deep networks and described conditions for the construction to work. The necessary conditions become hard to satisfy in wider anddeeper networks and, if they fail, the construction leads to saddle points instead. The appearance of an additionalcondition when extending Fukumizu and Amari [11]’s construction to deeper networks suggests that local minimaare rare and degenerate in deep networks, but their existence shows that no general statement about all local minimabeing global can be made. A PPENDIX
A. General Notation [ x α ] α R n column vector with entries x α ∈ R [ x i,j ] i,j ∈ R n × n matrix with entry x i,j at position ( i, j ) Im(f) ⊆ R image of a function fC n ( X, Y ) n-times continuously differentiable functionfrom X to YN ∈ N number of data samples in training set x α ∈ R n training sample input y α ∈ R target output for sample x α A ∈ C ( R ) class of real-analytic, strictly monotonicallyincreasing, bounded (activation) functions suchthat the closure of the image contains zero σ ∈ C ( R , R ) a nonlinear activation function in class A f ∈ C ( R n , R ) neural network function l ≤ l ≤ L index of a layer L ∈ N number of layers excluding the input layerl=0 input layer l = L output layer n l ∈ N number of neurons in layer lM = (cid:80) Ll =1 ( n l · n l − ) number of all network parameters k ≤ k ≤ n l index of a neuron in layer l w l ∈ R n l × n l − weight matrix of the l-th layer w ∈ R M collection of all w l w li,j ∈ R the weight from neuron j of layer l − toneuron j of layer lw L • ,j ∈ R the weight from neuron j of layer L − tothe output w Γ ∈ C ([0 , , R M ) a path in parameter space L , (cid:96) ∈ R + squared loss over training samples (cid:96) α ∈ R + the squared loss for data sample x α n ( l, k ; x ) ∈ R value at neuron k in layer l before activationfor input pattern x n ( l ; x ) ∈ R n l neuron pattern at layer l before activation forinput pattern x act ( l, k ; x ) ∈ Im ( σ ) activation pattern at neuron k in layer l forinput x act ( l ; x ) ∈ Im ( σ ) n l neuron pattern at layer l for input x B. Notation Section V
In Section V, where we fix a layer l , we additionally use the following notation. [ u p,i ] p,i ∈ R n l × n l − weights of the given layer l . [ v s,q ] s,q ∈ R n l × n l +1 weights the layer l + 1 . r ∈ { , , . . . , n l } the index of the neuron of layer l that we usefor the addition of one additional neuron M ∈ N = (cid:80) Lt =1 ( n t · n t − ) , the number of weights inthe smaller neural network ¯w ∈ R M − n l − − n l +1 all weights except u ,i and v s, γ rλ ∈ C ( R M , R M (cid:48) ) the map defined in Section V to add a M (cid:48) = M + n l − + n l +1 neuron in layer l using the neuron withindex r in layer lB ri,j ∈ R = (cid:80) α (cid:80) k ∂(cid:96) α ∂ n ( l +1 ,k ; x α ) · v ∗ k,r · σ (cid:48)(cid:48) ( n ( l, r ; x α )) · act ( l − , i ; x α ) · act ( l − , j ; x α ) D r,si ∈ R = (cid:80) α ∂(cid:96) α ∂ n ( l +1 ,s ; x α ) · σ (cid:48) ( n ( l, r ; x α )) · act ( l − , i ; x α ) B = [ B ri,j ] i,j ∈ R n l − × n l − matrix needs to be pos. or neg. def. for local min. D = [ D r,si ] i,s ∈ R n l − × n l +1 matrix needs to be 0 for local min. C. Notation Section VI
In Section VI, we additionally use the following notation. a lk ∈ Im ( σ ) N activation vector at neuron k in layer l givenby a lk = [ act ( l, k ; x α )] α a l ∈ Im ( σ ) n l × N matrix of activations in layer l given by a l = [ a lk ] k a l ( w ) ∈ Im ( σ ) n l × N activation vector at layer l as a functionof the parameters ww Γ ∈ C ([0 , , R M ) a path in parameter space w l Γ ∈ C ([0 , , R n l × n l − ) a path in parameter space at layer l n l ∈ R n l × N matrix of pre-activation values in layer l givenby n l = [ n ( l, k ; x α )] k,α n l Γ ∈ C ([0 , , R n l × N ) a path of neuron values in layer l a l Γ ∈ C ([0 , , R n l × N ) a path of activation values in layer l a l Γ ,j ∈ C ([0 , , R N ) a path of activation vectors at neuron j in layer lf ( w ; x α ) ∈ R network as function of parameters w and sample x α f l Γ ∈ C ([0 , , R N ) a path of outputs over training samples D. Proofs for the Construction of Local Minima
Here we prove Theorem 9, which follows from two lemmas, with the first lemma being Lemma 12 containingthe computation of the Hessian of the cost function L of the larger network at parameters γ rλ ([ u ∗ r,i ] i , [ v ∗ s,r ] s , ¯w ∗ ) with respect to a suitable basis. Proof. (Lemma 12)
The proof only requires a tedious, but not complicated calculation (using the relation αλ − β (1 − λ ) = 0 multiple times. To keep the argumentation streamlined, we moved all the necessary calculations intothe supplementary material.The second lemma determines when matrices of the form as calculated in Lemma 12 are positive definite. Lemma B.1.
Let a, b, c, d, e, f, g, h, x be matrices of appropriate sizes. (a)
A matrix of the form a b c b T d e c T e T f
00 0 0 x is positive semidefinite if and only if both x and the matrix a b cb T d ec T e T f are positive semidefinite. (b) A matrix x of the form x = (cid:18) g hh T (cid:19) is positive semidefinite if and only if g is positive semidefinite and h = 0 .Proof. (a) By definition, a matrix A is positive semidefinite if and only if z T Az ≥ for all z . Note now that ( z , z , z , z ) a b c b T d e c T e T f
00 0 0 x z z z z = ( z , z , z , z ) a b c b T d e c T e T f
00 0 0 x z z z z (b) It is clear that the matrix x is positive semidefinite for g positive semidefinite and h = 0 . To show the converse,first note that if g is not positive semidefinite and z is such that z T gz < then ( z T , (cid:18) g hh T (cid:19) (cid:18) z (cid:19) = z T gz < . It therefore remains to show that also h = 0 is a necessary condition. Assume h (cid:54) = 0 and find z such that hz (cid:54) = 0 . Then for any λ ∈ R we have (( hz ) T , − λz T ) (cid:18) g hh T (cid:19) (cid:18) hz − λz (cid:19) = ( hz ) T g ( hz ) − hz ) T hλz = ( hz ) T g ( hz ) − λ || hz || . For sufficiently large λ , the last term is negative.In addition, to find local minima from positive semi-definiteness, one needs to explain away all degeneratedirections, i.e., we need to show that the loss function actually does not change into the direction of eigenvectorsof the Hessian with eigenvalue . Otherwise a higher derivative into this direction could be nonzero and potentiallylead to a saddle point. Proof of Theorem 9.
In Lemma 12, we calculated the Hessian of L with respect to a suitable basis at a the criticalpoint γ λ ([ u ∗ r,i ] i , [ v ∗ s,r ] s , ¯w ∗ ) . If the matrix [ D r,si ] i,s is nonzero, then by Lemma 1(b) the Hessian is not positivesemidefinite, hence none of the critical points are local minima.If, on the other hand, the matrix [ D r,si ] i,s is zero, then by Lemma 1(a+b) the Hessian is positive semidefinite,since [ ∂ (cid:96)∂u r,i ∂u r,j ] i,j [ ∂ (cid:96)∂u r,i ∂v s,r ] i,s [ ∂ (cid:96)∂ ¯w ∂u r,i ] i, ¯w [ ∂ (cid:96)∂u r,i ∂v s,r ] s,i [ ∂ (cid:96)∂v s,r ∂v t, ] s,t [ ∂ (cid:96)∂ ¯w ∂v s,r ] s, w [ ∂ (cid:96)∂ ¯w ∂u r,i ] ¯w ,i [ ∂ (cid:96)∂ ¯w ∂v s,r ] ¯w ,s [ ∂ (cid:96)∂ ¯w ∂ ¯w (cid:48) ] ¯w , ¯w (cid:48) is positive definite by assumption (isolated minimum), and αβ [ B ri,j ] i,j is positive definite if λ ∈ (0 , ⇔ αβ > and [ B ri,j ] i,j is positive definite, or if ( λ < or λ > ⇔ αβ < and [ B ri,j ] i,j is negative definite. In each case wecan alter λ to values leading to saddle points without changing the network function or loss. Therefore, the criticalpoints can only be saddle points or local minima on a non-attracting region of local minima.To determine whether the critical points in question lead to local minima when [ D r,si ] i,s = 0 , it is insufficientto only prove the Hessian to be positive semidefinite (in contrast to (strict) positive definiteness), but we need toconsider directions for which the second order information is insufficient. We know that the loss is at a minimumwith respect to all coordinates except for the degenerate directions defined by a change of [ v s, − − v s,r ] s that keeps [ v s, − + v s,r ] s constant. However, the network function f ( x ) is constant along [ v s, − − v s,r ] s (keeping [ v s, − + v s,r ] s constant) at the critical point where u − ,i = u r,i for all i . Hence, no higher order information leads to saddle pointsand it follows that the critical point lies on a region of local minima. E. Local Minima at Infinity in Neural Networks
In this section we prove Theorem 4, showing the existence of local minima at infinity in neural networks.
Proof. (Theorem 4)
We will show that, if all bias terms u i, of the last hidden layer are sufficiently large, thenthere are parameters u i,k for k (cid:54) = 0 and parameters v i of the output layer such that the minimal loss is achieved at u i, = ∞ for all i .We note that, if u i, = ∞ for all i , all neurons of the last hidden layer are fully active for all samples,i.e., act ( L − , i ; x α ) = 1 for all i . Therefore, in this case f ( x α ) = (cid:80) i v • ,i for all α . A constant function f ( x α ) = (cid:80) i v • ,i = c minimizes the loss (cid:80) α ( c − y α ) uniquely for c := N (cid:80) Nα =1 y α . We will assume that the v • ,i are chosen such that (cid:80) i v • ,i = c does hold. That is, for fully active hidden neurons at the last hidden layer, the v • ,i are chosen to minimize the loss.We write f ( x α ) = c + (cid:15) α . Then L = 12 (cid:88) α ( f ( x α ) − y α ) = 12 (cid:88) α ( c + (cid:15) α − y α ) = 12 (cid:88) α ( (cid:15) α + ( c − y α )) = 12 (cid:88) α ( c − y α ) (cid:124) (cid:123)(cid:122) (cid:125) Loss at u i, = ∞ for all i + 12 (cid:88) α (cid:15) α (cid:124) (cid:123)(cid:122) (cid:125) ≥ + (cid:88) α (cid:15) α ( c − y α ) (cid:124) (cid:123)(cid:122) (cid:125) ( ∗ ) . The idea is now to ensure that ( ∗ ) ≥ for sufficiently large u i, and in a neighborhood of the v • ,i chosen asabove. Then the loss L is larger than at infinity, and any such point in parameter space with u i, = ∞ and v • ,i with (cid:80) i v • ,i = c is a local minimum.To study the behavior at u i, = ∞ , we consider p i = exp( − u i, ) . Note that lim u i, →∞ p i = 0 . We have f ( x α ) = (cid:88) i v • ,i σ ( u i, + (cid:88) k u i,k act ( L − , k ; x α )) = (cid:88) i v • ,i ·
11 + p i · exp( − (cid:80) k u i,k act ( L − , k ; x α )) Now for p i close to we can use Taylor expansion of g ji ( p i ) := p i exp ( a ji ) to get g ji ( p i ) = 1 − exp( a ji ) p i + O ( | p i | ) . Therefore f ( x α ) = c − (cid:88) i v • ,i p i exp( − (cid:88) k u i,k act ( L − , k ; x α )) + O ( p i ) and we find that (cid:15) α = − (cid:80) i v • ,i p i exp( − (cid:80) k u i,k act ( L − , k ; x α )) + O ( p i ) .Recalling that we aim to ensure ( ∗ ) = (cid:88) α (cid:15) α ( c − y α ) ≥ we consider (cid:88) α (cid:15) α ( c − y α ) = − (cid:88) α ( c − y α )( (cid:88) i v • ,i p i exp( − (cid:88) k u i,k act ( L − , k ; x α ))) + O ( p i )= − (cid:88) i v • ,i p i (cid:88) α ( c − y α ) exp( − (cid:88) k u i,k act ( L − , k ; x α )) + O ( p i ) We are still able to choose the parameters u i,k for i (cid:54) = 0 , the parameters from previous layers, and the v • ,i subjectto (cid:80) i v • ,i = c . If now v • ,i > whenever (cid:88) α ( c − y α ) exp( − (cid:88) k u i,k act ( L − , k ; x α )) < , and v • ,i < whenever (cid:88) α ( c − y α ) exp( − (cid:88) k u i,k act ( L − , k ; x α )) > , then the term ( ∗ ) is strictly positive, hence the overall loss is larger than the loss at p i = 0 for sufficiently small p i and in a neighborhood of v • ,i . The only obstruction we have to get around is the case where we need all v • ,i of the opposite sign of c (in other words, (cid:80) α ( c − y α ) exp( − (cid:80) k u i,k act ( L − , k ; x α )) has the same sign as c ),conflicting with (cid:80) i v • ,i = c . To avoid this case, we impose the mild condition that (cid:80) α ( c − y α ) act ( L − , r ; x α ) (cid:54) = 0 for some r , which can be arranged to hold for almost every data set by fixing all parameters of layers with indexsmaller than L − . By Lemma 2 below (with d α = ( c − y α ) and a rα = act ( L − , r ; x α ) ), we can find u >k such that (cid:80) α ( c − y α ) exp( − (cid:80) k u >k act ( L − , k ; x α )) > and u Suppose that m is a positive integer, m ≥ , and for α = 1 , . . . , N and r = 1 , . . . , m we havenumbers d α , a rα in R such that N (cid:88) α =1 d α = 0 , and (cid:88) α d α a rα (cid:54) = 0 for some r. Then there are u Consider the function φ ( u , u , . . . , u m ) := (cid:88) α d α exp( − (cid:88) k u k a kα ) . We have φ (0 , , . . . , 0) = (cid:88) α d α = 0 . Further ∂φ∂u r | (0 , ,..., = − (cid:88) α d α a rα . By assumption, there is r such that the last term is nonzero. Hence, using coordinate r , we can choose w =(0 , , . . . , , w r , , . . . , such that φ ( w ) is positive and we can choose w such that φ ( w ) is negative. F. Construction of Local Minima in Deep NetworksProof. (Proposition 15) The fact that property (i) suffices uses that ∂(cid:96) α n ( l +1 , • ; x α ) reduces to ( f ( x α ) − y α ) . Then,considering a regression network as before, our assumption says that v ∗• ,r (cid:54) = 0 , hence its reciprocal can be factoredout of the sum in Equation 2. Denoting incoming weights into n ( l, r ; x ) by u r,i as before, this leads to D r, • i = 1 v ∗• ,r · (cid:88) α ( f ( x α ) − y α ) · v ∗• ,r · σ (cid:48) ( n ( l, r ; x α )) · act ( l − , i ; x α )= 1 v ∗• ,r · ∂(cid:96)∂u r,i = 0 In the case of (ii), ∂(cid:96) α ∂ n ( l + 1 , s ; x α ) = ∂(cid:96) α ∂ n ( l + 1 , t ; x α ) for all s, t and we can factor out the reciprocal of (cid:80) t v ∗ r,t (cid:54) = 0 in Equation 2 to obtain for each i, sD r,si := (cid:88) α ∂(cid:96) α ∂ n ( l + 1 , s ; x α ) · σ (cid:48) ( n ( l, r ; x α )) · act ( l − , i ; x α )= 1 (cid:0)(cid:80) t v ∗ r,t (cid:1) (cid:88) α (cid:88) t v ∗ r,t · (cid:96) α ∂ n ( l + 1 , t ; x α ) · σ (cid:48) ( n ( l, r ; x α )) · act ( l − , i ; x α )= 1 (cid:0)(cid:80) t v ∗ r,t (cid:1) · ∂(cid:96)∂u r,i = 0 Part (iii) is evident since in this case clearly every summand in Equation 2 is zero. G. Proofs for the Non-increasing Path to a Global Minimum In this section we discuss how in extremely wide neural networks with a non-increasing sequence of dimensionsof hidden layers following the extremely wide layer, a path to the global minimum that is non-increasing in lossmay be found from almost everywhere in the parameter space. Theorem 5. Consider a fully connected regression neural network with activation function in the class A (asdefined in the beginning of Section III) equipped with the squared loss function for a finite data set. Assume thata hidden layer contains more neurons than the number of input patterns and the sequence of dimensions of allsubsequent layers is non-increasing. Then, for each set of parameters w and all (cid:15) > , there is w (cid:48) such that || w − w (cid:48) || < (cid:15) and such that a path, non-increasing in loss from w (cid:48) to a global minimum (where f ( x α ) = y α foreach α ), exists. The first step of the proof is to use the freedom given by (cid:15) > to change the starting point in parameter spaceto satisfy that the activation vectors a l ∗ k of the extremely wide layer l ∗ span the entire space R N . Lemma 16. [22, Corollary 4.5] For each choice of parameters w and all (cid:15) > there is w (cid:48) such that (i) || w − w (cid:48) || <(cid:15) , (ii) the activation vectors a l ∗ k of the extremely wide layer l ∗ (containing more neurons than the number of trainingsamples N ) at parameters w (cid:48) satisfy span k a l ∗ k = R N , and (iii) the weight matrices ( w (cid:48) ) l have full rank for all l > l ∗ + 1 . The second step of the proof is to guarantee that we can then induce any continuous change of activation vectorsin layer l ∗ + 1 by suitable paths in the parameter space changing only the weights of the same layer. The followingtwo lemmas ensure exactly that. We first consider pre-activation values and then consider the application of theactivation function. We slightly abuse notation in the statement when adding a vector to a matrix, which shall meanthe addition of the vector to all columns of the matrix. Lemma 17. Assume that in the extremely wide layer l ∗ we have that the activation vectors at a set of parameters w satisfy span k [ a l ∗ k ] = R N . Then, for any continuous path n l ∗ +1Γ : [0 , → R n l ∗ +1 × N with starting point n l ∗ +1Γ (0) = n l ∗ +1 = w l ∗ +1 · a l ∗ + w l ∗ +10 , there is a continuous path of parameters w l ∗ +1Γ : [0 , → R n l ∗ +1 × n l ∗ of the ( l ∗ + 1) -thlayer with w l ∗ +1Γ (0) = w l ∗ +1 and such that n l ∗ +1Γ ( t ) = w l ∗ +1Γ ( t ) · a l ∗ + w l ∗ +10 . Proof. We write n l ∗ +1Γ ( t ) = n l ∗ +1 + ˜ n Γ ( t ) with ˜ n Γ (0) = 0 . We will find ˜ w Γ ( t ) such that ˜ n Γ ( t ) = ˜ w Γ ( t ) · a l ∗ with ˜ w Γ (0) = 0 . Then w l ∗ +1Γ ( t ) := w l ∗ +1 + ˜ w Γ ( t ) does the job.Since, by assumption, a l ∗ = [ act ( l ∗ , k ; x α )] k,α has full rank, we can find an invertible submatrix ¯ A ∈ R N × N of a l ∗ . Then we can define a continuous path ω in R n l ∗ +1 × N given by ω ( t ) := ˜ n Γ ( t ) · ¯ A − , which satisfies ω ( t ) · ¯ A = ˜ n Γ ( t ) and ω (0) = 0 . Extending ω ( t ) to a path ˜ w Γ ( t ) in R n l ∗ +1 × n l ∗ by zero columns at positionscorresponding to rows of a l ∗ missing in ¯ A gives ˜ w Γ ( t ) · a l ∗ = ˜ n Γ ( t ) and ˜ w Γ (0) = 0 as desired. Lemma 18. For all continuous paths a Γ ( t ) in Im ( σ ) n × N , i.e., the ( n × N ) -fold copy of the image of an activationfunction σ ∈ A , there is a continuous path n Γ ( t ) in R n × N such that a Γ ( t ) = σ ( n Γ ( t )) for all t .Proof. Since σ : R n × N → Im ( σ ) n × N is invertible with a continuous inverse, take n Γ ( t ) = σ − ( a Γ ( t )) . With activations values a l depending on parameters w ι of previous layers with index ι ≤ l , we denote thisfunctional dependence by a l ( w ) . We say that a continuous path a l Γ : [0 , → Im ( σ ) n l × N of activation values in the l -th layer is realized by a path of parameters w Γ ( t ) , if the path w Γ ( t ) induces a change of activation values atlayer l according to the desired path a l Γ , i.e., a l ( w Γ ( t )) = a l Γ ( t ) . Using this terminology, Lemma 17 and 18 showthat in a layer following an extremely wide one, any continuous path of activation values can be realized by a pathof parameters of the same layer.The third step guarantees that, as long as the sequence of dimensions of subsequent hidden layers never increases,realizability of arbitrary paths in layer l implies realizability of arbitrary paths in layer l + 1 for both activation andpre-activation values. Lemma 19. Assume that n l +1 ≤ n l and that the weight matrix w l +1 ∈ R n l +1 × n l has full rank n l +1 . Let a l = a l ( w ) and n l +1 = n l +1 ( w ) be the matrices of activation values at the l -th layer and pre-activations at the ( l + 1) -thlayer for parameters w respectively. Then, for any continuous path n l +1Γ : [0 , → R n l +1 × N with n l +1Γ (0) = n l +1 = w l +1 · a l + w l +10 , there are continuous paths of (full-rank) weight matrices w l +1Γ : [0 , → R n l +1 × n l and bias w l +1Γ , : [0 , → R n l +1 in the ( l + 1) -th layer and a continuous path a l Γ : [0 , → Im ( σ ) n l × N of activation valuesin the l -th layer, such that w l +1Γ (0) = w l +1 , w l +1Γ , (0) = w l +10 , a l Γ (0) = a l and for all t ∈ [0 , , n l +1Γ ( t ) = w l +1Γ ( t ) · a l Γ ( t ) + w l +1Γ , ( t ) . Proof. Since the matrix w l +1 has full rank, it contains an invertible submatrix W ∈ R n l +1 × n l +1 . Since we canpermute indices of neurons in each layer, we can assume without loss of generality that this submatrix consists ofthe first n l +1 columns of w l +1 .For some suitable paths λ ( t ) ∈ R > and δ ( t ) ∈ R n l +1 , which will be chosen later, we define ˜ n Γ ( t ) := n l +1Γ ( t ) − n l +1Γ (0) w l +1Γ ( t ) := λ ( t ) · w l +1 w l +1Γ , ( t ) := w l +10 − δ ( t ) a l Γ ( t ) := 1 λ ( t ) (cid:18) a l + (cid:20) W − ˜ n Γ ( t )0 n l − n l +1 (cid:21) + (cid:20) W − δ ( t )0 n l − n l +1 (cid:21)(cid:19) . We then have ˜ n Γ (0) = 0 , and we choose λ (0) = 1 and δ (0) = 0 . This implies that at t = 0 we have a l Γ (0) = a l ∈ Im ( σ ) n l × N . Further w l +1Γ ( t ) · a l Γ ( t ) + w l +1Γ , ( t ) = λ ( t ) · w l +1 · a l Γ ( t ) + w l +10 − δ ( t )= w l +1 a l + w l +10 + w l +1 (cid:124) (cid:123)(cid:122) (cid:125) =[ W ∗ ] (cid:18)(cid:20) W − ˜ n Γ ( t )0 n l − n l +1 (cid:21) + (cid:20) W − δ ( t )0 n l − n l +1 (cid:21)(cid:19) − δ ( t )= n l +1Γ (0) + ˜ n Γ ( t ) + δ ( t ) − δ ( t ) = n l +1Γ ( t ) as desired. Note that with δ ( t ) = 0 and λ ( t ) = 1 for all t , we would obtain suitable paths with a l Γ ( t ) ∈ R n l × N ,but due to the activations in the previous layer we must require that a l Γ ( t ) ∈ Im ( σ ) n l × N . Here, we use the fullfreedom of choosing δ ( t ) and λ ( t ) to ensure this. In the case that ∈ ( c, d ) = Im ( σ ) it suffices to fix δ ( t ) = 0 and to always choose sufficiently large λ ( t ) > such that a l Γ ( t ) ∈ ( c, d ) n l × N . In the case that lies on theboundary of the interval [ c, d ] , we also need to to choose δ ( t ) to guarantee the correct sign in each componentof a l Γ ,k ( t ) , i.e., if c = 0 then choose δ ( t ) such that each entry of W − δ ( t ) is sufficiently large to guarantee that (cid:18) a l + (cid:20) W − ˜ n Γ ( t )0 n l − n l +1 (cid:21) + (cid:20) W − δ ( t )0 n l − n l +1 (cid:21)(cid:19) ∈ R n l × N> .Combining Lemma 18 and 19, we can realize any path of activation values a l +1Γ ( t ) in layer l + 1 by a pathof parameters, if arbitrary paths of activation values a l Γ ( t ) can be realized in the previous layer l . By Lemma 17and 18, arbitrary paths of activation values can be realized in the layer following the extremely wide layer. Hence,by induction over the layers we find that any path at the output is realizable. In the following result, we denote thedependence of the network function of its parameters w on the training sample x α by f ( w ; x α ) . Lemma 20. Assume a neural network structure as above with activation vectors a l ∗ k of the extremely wide hiddenlayer spanning R N , hidden dimensions n l +1 ≤ n l for all l > l ∗ and weight matrices w l +1 ∈ R n l +1 × n l of full rank n l +1 for all l > l ∗ . Then for any continuous path f Γ : [0 , → R N with f Γ (0) = [ f ( w ; x α )] α there is a continuouspath w Γ ( t ) from the current weights w Γ (0) = w that realizes f Γ ( t ) as the output of the neural network function, f Γ ( t ) = [ f ( w Γ ( t ); x α )] α .Proof. As outlined before the statement of the lemma, the proof only requires a composition of previous lemmas.We first show by induction over l that, for each l ∗ < l ≤ L − , every continuous path a l Γ ( t ) ∈ C ([0 , , R n l × N ) can be realized by a continuous change of parameters from previous layers. That is, for all l ∗ < l ≤ L − andall continuous paths a l Γ ( t ) ∈ C ([0 , , R n l × N ) starting at the activations values in layer l for parameters w l , i.e., a l Γ (0) = a l ( w ) , there is a continuous path of parameters w Γ ( t ) with w Γ (0) = w such that the activation values a as a function of w Γ ( t ) satisfy a ( w Γ ( t )) = a Γ ( t ) .The base case for the induction, l = l ∗ + 1 , holds true by combining Lemma 17 and 18. The induction step isfurther shown by combining Lemma 18 and 19.This guarantees all necessary assumptions to also apply Lemma 19 to the last layer, showing that any path f Γ ( t ) can be realized at the output. Proof. (Theorem 5) Let w be a given set of parameters for the neural network and (cid:15) > . Applying Lemma 17we find w (cid:48) such that (i) || w − w (cid:48) || < (cid:15) , the activation vectors a l ∗ k of the extremely wide layer l ∗ (containing moreneurons than the number of training samples N ) at parameters w (cid:48) satisfy span k a l ∗ k = R N , and (iii) the weight matrices ( w (cid:48) ) l have full rank for all l > l ∗ + 1 .Part (ii) and (iii) together with the assumption on the architecture of the network guarantee the assumptionsof Lemma 20 for w (cid:48) , so that for any continuous path f Γ : [0 , → R N with f Γ (0) = [ f ( w (cid:48) ; x α )] α there is acontinuous path w (cid:48) Γ ( t ) with w (cid:48) Γ (0) = w (cid:48) and f Γ ( t ) = [ f ( w (cid:48) Γ ( t ); x α )] α . So we only need to specify a desired pathat the output, which we can then realize by a continuous change of parameters of the neural network.With fixed z α = f ( w (cid:48) ; x α ) , the prediction for the current weights, let z = [ z α ] α denote the vector of predictions,and let y = [ y α ] α denote the vector of all target values. An obvious path of decreasing loss at the output layer isthen given by t ∈ [0 , (cid:55)→ z + t · ( y − z ) , inducing the loss L = || z + t · ( y − z ) − y || = (1 − t ) || y − z || .R EFERENCES [1] Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation , 10(2):251–276, 1998.[2] Peter Auer, Mark Herbster, and Manfred K. Warmuth. Exponentially many local minima for single neurons.In Advances in Neural Information Processing Systems 8, NIPS 1995 , pages 316–322, 1995.[3] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from exampleswithout local minima. Neural Networks , 2(1):53–58, 1989.[4] Avrim L. Blum and Ronald L. Rivest. Training a 3-node neural network is NP-complete. Neural Networks ,5(1):117–127, 1992.[5] Martin Lee Brady, Raghu Raghavan, and Joseph Slawny. Back propagation fails to separate where perceptronssucceed. IEEE Transactions on Circuits and Systems , 36(5):665–674, 2006.[6] Alan J. Bray and David S. Dean. The statistics of critical points of gaussian fields on large-dimensionalspaces. Physical Review Letters , 98:150201, 1989.[7] Anna Choromanska, Micha¨el Henaff, Michael Mathieu, and Yann LeCun G´erard Ben Arous. The loss surfacesof multilayer networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligenceand Statistics, AISTATS 2015 , 2015.[8] Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio.Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advancesin Neural Information Processing Systems 27, NIPS 2014 , pages 2933–2941, 2014.[9] Simon S Du, Jason D Lee, Yuandong Tian ans Aarti Singh, and Barnabas Poczos. Gradient descent learnsone-hidden-layer cnn: Dont be afraid of spurious local minima. In Proceedings of the 35th InternationalConference on Machine Learning, ICML 2018 , page 13381347, 2018.[10] C. Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. arXive-prints , arXiv:1611.01540, 2017.[11] Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierarchical structures of multilayerperceptrons. Neural Networks , 13(3):317–327, 2000.[12] Marco Gori and Alberto Tesi. On the problem of local minima in backpropagation. IEEE Transactions onPattern Analysis and Machine Intelligence , 14(1):76–86, 1992.[13] Benjamin D. Haeffele and Rene Vidal. Global optimality in neural network training. , pages 4390–4398, 2017.[14] Leonard G. C. Hamey. Xor has no local minima: A case study in neural network error surface analysis. NeuralNetw. , 11(4):669–681, 1998.[15] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information ProcessingSystems 29, NIPS 2016 , pages 586–594, 2016.[16] Thomas Laurent and James Brecht. The multilinear structure of relu networks. In Proceedings of the 35thInternational Conference on Machine Learning, ICML 2018 , page 29142922, 2018.[17] Shiyu Liang, Ruoyu Sun, Jason D Lee, and Rayadurgam Srikant. Adding one neuron can eliminate all badlocal minima. In Advances in Neural Information Processing Systems 32, NIPS 2018 , page 43554365, 2018. [18] Shiyu Liang, Ruoyu Sun, Yixuan Li, and Rayadurgam Srikant. Understanding the loss surface of neuralnetworks for binary classification. In In International Conference on Machine Learning 35, ICML 2018 , page28402849, 2018.[19] Qianli Liao and Tomaso Poggio. Theory of deep learning II: Landscape of the empirical risk in deep learning. arXiv e-prints , arXiv:1703.09833, 2017.[20] Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. arXiv e-prints , arXiv:1702.08580, 2017.[21] Eiji Mizutani and Stuart Dreyfus. An analysis on negative curvature induced by singularity in multi-layerneural-network learning. In Advances in Neural Information Processing Systems 24, NIPS 2010 , pages 1669–1677, 2010.[22] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In Proceedings ofthe 34th International Conference on Machine Learning, ICML 2017 , 2017.[23] Quynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep cnn. In Proceedings ofthe 35th International Conference on Machine Learning, ICML 2018 , page 37273736, 2018.[24] Quynh Nguyen, Mahesh C. Mukkamala, and Matthias Hein. On the loss landscape of a class of deepneural networks with no bad local valleys. In Proceedings of the 7th International Conference on LearningRepresentations, ICLR 2019 , 2019.[25] Tohru Nitta. Resolution of singularities introduced by hierarchical structure in deep neural networks. IEEETransactions on Neural Networks and Learning Systems , 28(10):2282–2293, 2017.[26] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and whencan deep-but not shallow-networks avoid the curse of dimensionality: A review. International Journal ofAutomation and Computing , 14(5):503–519, 2017.[27] Timothy Poston, Chung-Nim Lee, YoungJu Choie, and Yonghoon Kwon. Local minima and back propagation.In IJCNN-91-Seattle International Joint Conference on Neural Networks , volume ii, pages 173–176, 1991.[28] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by errorpropagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1 , pages318–362. MIT Press, Cambridge, MA, USA, 1986.[29] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In Proceedings of the 33rd International Conference on Machine Learning, ICML 2016 , pages 774–782, 2016.[30] Levent Sagun, and G´erard Ben Arous V. Ugur G¨uney, and Yann LeCun. Exploration on high dimensionallandscapes. arXiv e-prints , arXiv:1412.6615, 2015.[31] Cristian Sminchisescu and Bill Triggs. Building Roadmaps of Minima and Transitions in Visual Models. International Journal of Computer Vision , 61(1), 2005.[32] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees formultilayer neural networks. arXiv e-prints , arXiv:1605.08361, 2016.[33] Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neuralnetworks. arXiv e-prints , arXiv:1702.05777, 2017.[34] Ida G. Sprinkhuizen-Kuyper and Egbert J. W. Boers. A local minimum for the 2-3-1 xor network. IEEETransactions on Neural Networks , 10(4):968–971, 1999.[35] Grzegorz Swirszcz, Wojciech Marian Czarnecki, and Razvan Pascanu. Local minima in training of neuralnetworks. arXiv e-prints , arXiv:1611.06310, 2016.[36] Rene Vidal, Joan Bruna, Raja Giryes, and Stefan. Mathematics of deep learning. arXiv e-prints ,arXiv:1712.04741, 2017.[37] Xu Gang Wang, Zheng Tang, Hiroki Tamura, Masahiro Ishii, and Wei Dong Sun. An improved backpropagationalgorithm to avoid the local minima problem. Neurocomputing , 56:455–460, 2004.[38] Haikun Wei, Jun Zhang, Florent Cousseau, Tomoko Ozeki, and Shun-ichi Amari. Dynamics of learning nearsingularities in layered networks. Neural Computation , 20(3):813–843, 2008.[39] Lodewyk F.A. Wessels and Etienne Barnard. Avoiding false local minima by proper initialization ofconnections. IEEE Transactions on Neural Networks , 3(6):899–905, 1992.[40] Lodewyk F.A. Wessels, Etienne Barnard, and Eugene van Rooyen. The physical correlates of local minima.In International Neural Network Conference , pages 985–985, 1990.[41] Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv e-prints ,arXiv:1611.03131, 2016. [42] Chulhee Yun, Suwvrit Sra, and Ali Jadbabaie. Global optimality conditions for deep nerural networks. arXive-prints , arXiv:1707.02444, 2017.[43] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learningrequires rethinking generalization. arXiv e-prints , arXiv:1611.03530, 2017.[44] Yi Zhou and Yingbin Liang. Critical points of neural networks: Analytical forms and landscape properties.In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018 , 2018. H. Calculations for Lemma 12 For the calculations we may assume without loss of generality that r = 1 . If we want to consider a different n ( l, r ; x ) and its corresponding γ rλ , then this can be achieved by a reordering of the indices of neurons.)We let ϕ denote the network function of the smaller neural network and f the neural network function of thelarger network after adding one neuron according to the map γ λ . To distinguish the parameters of f and ϕ , wewrite w ϕ for the parameters of the network before the embedding. This gives for all i, s and all m ≥ : u − ,i = u ϕ ,i u ,i = u ϕ ,i v s, − = λv ϕs, v s, = (1 − λ ) v ϕs, u m,i = u ϕm,i v s,m = v ϕs,m ¯w = ¯w ϕ We do the same for neuron vectors and activation vectors. Using that f can be considered a composition offunctions from consecutive layers, we denote the function from n ( k ; x ) to the output by h • ,k ( x ) and we alsospecify by usage of an upper-script ϕ when the function h • ,l +1 belongs to the network before the embedding.Key to the computation is the fact that all derivatives of f can be naturally written as derivatives of ϕ . Concretely,implied by the embedding, all values at neurons n ( l, i ; x ) and their activation values act ( l, i ; x ) remain unchanged,i.e., we have for all m ≥ and all ˜ l (cid:54) = l that act ( l, − x ) = act ϕ ( l, x ) act ( l, m ; x ) = act ϕ ( l, m ; x ) act (˜ l, m ; x ) = act ϕ (˜ l, m ; x ) n ( l, − x ) = n ϕ ( l, x ) n ( l, m ; x ) = n ϕ ( l, m ; x ) n (˜ l, m ; x ) = n ϕ (˜ l, m ; x ) 1) First order derivatives of network functions f and ϕ . For the function f we have the following partial derivatives. ∂f ( x ) ∂u p,i = (cid:88) k ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , k ; x ) · v k,p · σ (cid:48) ( n ( l, p ; x )) · act ( l − , i ; x ) and ∂f ( x ) ∂v s,q = ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , s ; x ) · act ( l, q ; x ) The analogous equations hold for ϕ . 2) Relating first order derivatives of network functions f and ϕ Therefore, at ([ u ,i ] i , [ v s, ] s , ¯w ) and γ λ ([ u ,i ] i , [ v s, ] s , ¯w ) respectively, we get for k = − , that ∂f ( x ) ∂u − ,i = λ ∂ϕ ( x ) ∂u ϕ ,i , and ∂f ( x ) ∂u ,i = (1 − λ ) ∂ϕ ( x ) ∂u ϕ ,i , and ∂f ( x ) ∂v s,k = ∂ϕ ( x ) ∂v ϕs, and for k ≥ we get that ∂f ( x ) ∂u k,i = ∂ϕ ( x ) ∂u ϕk,i , and ∂f ( x ) ∂v s,k = ∂ϕ ( x ) ∂v ϕs,k . 3) Second order derivatives of network functions f and ϕ . For the second derivatives we get (with δ ( a, a ) = 1 and δ ( a, b ) = 0 for a (cid:54) = b ) ∂ f ( x ) ∂u p,i ∂u q,j = ∂∂u q,j (cid:32)(cid:88) k ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , k ; x ) · v k,p · σ (cid:48) ( n ( l, p ; x )) · act ( l − , i ; x ) (cid:33) = (cid:88) m (cid:88) k ∂ h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , m ; x ) ∂ n ( l + 1 , k ; x ) · v m,q · σ (cid:48) ( n ( l, q ; x )) · act ( l − , j ; x ) · v k,p · σ (cid:48) ( n ( l, p ; x )) · act ( l − , i ; x )+ δ ( p, q ) (cid:88) k ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , k ; x ) · v k,p · σ (cid:48)(cid:48) ( n ( l, p ; x )) · act ( l − , i ; x ) · act ( l − , j ; x ) and ∂ f ( x ) ∂v s,p ∂v t,q = ∂∂v t,q (cid:18) ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , s ; x ) · act ( l, p ; x ) (cid:19) = ∂ h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , s ; x ) ∂ n ( l + 1 , t ; x ) · act ( l, p ; x ) · act ( l, q ; x ) and ∂ f ( x ) ∂u p,i ∂v s,q = ∂∂v s,q (cid:32)(cid:88) k ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , k ; x ) · v k,p · σ (cid:48) ( n ( l, p ; x )) · act ( l − , i ; x ) (cid:33) = (cid:88) k ∂ h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , s ; x ) ∂ n ( l + 1 , k ; x ) · act ( l, q ; x ) · v k,p · σ (cid:48) ( n ( l, p ; x )) · act ( l − , i ; x )+ δ ( q, p ) · ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , s ; x ) · σ (cid:48) ( n ( l, p ; x )) · act ( l − , i ; x ) For a parameter w closer to the input than [ u p,i ] p,i , [ v s,q ] s,q , we have ∂ f ( x ) ∂u p,i ∂w = ∂∂w (cid:32)(cid:88) k ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , k ; x ) · v k,p · σ (cid:48) ( n ( l, p ; x )) · act ( l − , i ; x ) (cid:33) = (cid:88) m (cid:88) k ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , k ; x ) ∂ n ( l + 1 , m ; x ) · ∂ n ( l + 1 , m ; x ) ∂w · v k,p · σ (cid:48) ( n ( l, p ; x )) · act ( l − , i ; x )+ (cid:88) k ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , k ; x ) · v k,p · σ (cid:48)(cid:48) ( n ( l, p ; x )) · ∂ n ( l, p ; x ) ∂w · act ( l − , i ; x )+ (cid:88) k ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , k ; x ) · v k,p · σ (cid:48) ( n ( l, p ; x )) · ∂ act ( l − , i ; x ) ∂w and ∂ f ( x ) ∂v s,q ∂w = ∂∂w (cid:18) ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , s ; x ) · act ( l, q ; x ) (cid:19) = (cid:88) n ∂ h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , s ; x ) ∂ n ( l + 1 , n ; x ) · ∂ n ( l + 1 , n ; x ) ∂w · act ( l, q ; x ) · act ( l, q ; x )+ ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , s ; x ) · ∂ act ( l, q ; x ) ∂w For a parameter w closer to the output than [ u p,i ] p,i , [ v s,q ] s,q , we have ∂ f ( x ) ∂u p,i ∂w = ∂∂w (cid:32)(cid:88) k ∂h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , k ; x ) · v k,p · σ (cid:48) ( n ( l, p ; x )) · act ( l − , i ; x ) (cid:33) = (cid:88) k ∂ h • ,l +1 ( n ( l + 1; x )) ∂ n ( l + 1 , k ; x ) ∂w · v k,p · σ (cid:48) ( n ( l, p ; x )) · act ( l − , i ; x ) 4) Relating second order derivatives of network functions f and ϕ To relate the second derivatives of f at γ λ ([ u ,i ] i , [ v s, ] s , ¯w ) to the second derivatives of ϕ at ([ u ,i ] i , [ v s, ] s , ¯w ) ,we define A p,qi,j ( x ) := (cid:88) m (cid:88) k ∂ h ϕ • ,l +1 ( n ϕ ( l + 1; x )) ∂ n ϕ ( l + 1 , m ; x ) ∂ n ϕ ( l + 1 , k ; x ) · v ϕm,q · σ (cid:48) ( n ϕ ( l, q ; x )) · act ϕ ( l − , j ; x ) · v ϕk,p · σ (cid:48) ( n ϕ ( l, p ; x )) · act ϕ ( l − , i ; x ) B pi,j ( x ) := (cid:88) k ∂h ϕ • ,l +1 ( n ϕ ( l + 1; x )) ∂ n ϕ ( l + 1 , k ; x ) · v ϕk,p · σ (cid:48)(cid:48) ( n ϕ ( l, p ; x )) · act ϕ ( l − , i ; x ) · act ϕ ( l − , j ; x ) C p,si,q ( x ) := (cid:88) k ∂ h ϕ • ,l +1 ( n ( l + 1; x )) ∂ n ϕ ( l + 1 , s ; x ) ∂ n ϕ ( l + 1 , k ; x ) · act ϕ ( l, q ; x ) · v ϕk,p · σ (cid:48) ( n ϕ ( l, p ; x )) · act ϕ ( l − , i ; x ) D p,si ( x ) := ∂h ϕ • ,l +1 ( n ϕ ( l + 1; x )) ∂ n ϕ ( l + 1 , s ; x ) · σ (cid:48) ( n ϕ ( l, p ; x )) · act ϕ ( l − , i ; x ) E s,tp,q ( x ) := ∂ h ϕ • ,l +1 ( n ϕ ( l + 1; x )) ∂ n ϕ ( l + 1 , s ; x ) ∂ n ϕ ( l + 1 , t ; x ) · act ϕ ( l, p ; x ) · act ϕ ( l, q ; x ) Then for all i, j, p, q, s, t , we have ∂ ϕ ( x ) ∂u ϕp,i ∂u ϕq,j = A p,qi,j ( x ) + δ ( q, p ) B pi,j ( x ) ∂ ϕ ( x ) ∂u ϕp,i ∂v ϕs,q = C p,si,q ( x ) + δ ( q, p ) D p,si ( x ) ∂ ϕ ( x ) ∂v s,p ∂v t,q = E s,tp,q ( x ) For f we get for p, q ∈ {− , } and all i, j, s, t∂ f ( x ) ∂u − ,i ∂u − ,j = λ A , i,j ( x ) + λB i,j ( x ) ∂ f ( x ) ∂u ,i ∂u ,j = (1 − λ ) A , i,j ( x ) + (1 − λ ) B i,j ( x ) ∂ f ( x ) ∂u − ,i ∂u ,j = ∂ f ( x ) ∂u ,i ∂u − ,j = λ (1 − λ ) · A , i,j ( x ) ∂ f ( x ) ∂u − ,i ∂v s, − = λC ,si, ( x ) + D ,si ( x ) ∂ f ( x ) ∂u ,i ∂v s, = (1 − λ ) C ,si, ( x ) + D ,si ( x ) ∂ f ( x ) ∂u − ,i ∂v s, = λ · C ,si, ( x ) = λ · ∂ ϕ ( x ) ∂u ϕ ,i ∂v ϕs, ∂ f ( x ) ∂u ,i ∂v s, − = (1 − λ ) · C ,si, ( x ) = (1 − λ ) · ∂ ϕ ( x ) ∂u ϕ ,i ∂v ϕs, ∂ f ( x ) ∂v s,p ∂v t,q = E s,t , ( x ) = ∂ ϕ ( x ) ∂v ϕs, ∂v ϕt, and for q ≥ and p ∈ {− , } and all i, j, s, t∂ f ( x ) ∂u − ,i ∂u q,j = λA ,qi,j ( x ) = λ · ∂ ϕ ( x ) ∂u ϕ ,i ∂u ϕq,j ∂ f ( x ) ∂u ,i ∂u q,j = (1 − λ ) A ,qi,j ( x ) = (1 − λ ) · ∂ ϕ ( x ) ∂u ϕ ,i ∂u ϕq,j ∂ f ( x ) ∂u − ,i ∂v s,q = λC ,si,q ( x ) = λ · ∂ ϕ ( x ) ∂u ϕ ,i ∂v ϕs,q ∂ f ( x ) ∂u ,i ∂v s,q = (1 − λ ) C ,si,q ( x ) = (1 − λ ) ∂ ϕ ( x ) ∂u ϕ ,i ∂v ϕs,q ∂ f ( x ) ∂u q,i ∂v s,p = C q,si, ( x ) = ∂ ϕ ( x ) ∂u ϕq,i ∂v ϕs, ∂ f ( x ) ∂v s,p ∂v t,q = E s,t ,q ( x ) = ∂ ϕ ( x ) ∂v ϕs, ∂v ϕt,q and for p, q ≥ and all i, j, s, t ∂ f ( x ) ∂u p,i ∂u q,j = A p,qi,j ( x ) + δ ( q, p ) B pi,j ( x ) = ∂ ϕ ( x ) ∂u ϕp,i ∂u ϕq,j ∂ f ( x ) ∂u p,i ∂v s,q = C p,si,q ( x ) + δ ( q, p ) D p,si ( x ) = ∂ ϕ ( x ) ∂u ϕp,i ∂v ϕs,q ∂ f ( x ) ∂v s,p ∂v t,q = E s,tp,q ( x ) = ∂ ϕ ( x ) ∂v ϕs,p ∂v ϕt,q 5) Derivatives of L and (cid:96) Let A p,qi,j := (cid:88) α ( ϕ ( x α ) − y α ) · A p,qi,j ( x α ) B pi,j := (cid:88) α ( ϕ ( x α ) − y α ) · B pi,j ( x α ) C p,si,q := (cid:88) α ( ϕ ( x α ) − y α ) · C p,si,q ( x α ) D p,si := (cid:88) α ( ϕ ( x α ) − y α ) · D p,si ( x α ) E s,tp,q := (cid:88) α ( ϕ ( x α ) − y α ) · E s,tp,q ( x α ) and A p,qi,j := (cid:88) α (cid:32) ∂ϕ ( x α ) ∂u ϕp,i (cid:33) · (cid:32) ∂ϕ ( x α ) ∂u ϕq,j (cid:33) C p,si,q := (cid:88) α (cid:32) ∂ϕ ( x α ) ∂u ϕp,i (cid:33) · (cid:18) ∂ϕ ( x α ) ∂v ϕs,q (cid:19) E s,tp,q := (cid:88) α (cid:18) ∂ϕ ( x α ) ∂v ϕs,p (cid:19) · (cid:32) ∂ϕ ( x α ) ∂v ϕt,q (cid:33) Now we have everything together to compute the derivatives of the loss. For the first derivative of the loss, wehave for any variables w, r that ∂ L ∂w = (cid:88) α ( f ( x α ) − y α ) · ∂f ( x α ) ∂w and ∂(cid:96)∂w ϕ = (cid:88) α ( ϕ ( x α ) − y α ) · ∂ϕ ( x α ) ∂w ϕ . From this it follows immediately that if ∂(cid:96)∂w ϕ ( w ϕ ) = 0 , then ∂ L ∂w ( γ λ ( w ϕ )) = 0 for all λ [cf. 11, 25].For the second derivative we get ∂ L ∂w∂r = (cid:88) α ( f ( x α ) − y α ) · ∂ f ( x α ) ∂w∂r + (cid:88) α (cid:18) ∂f ( x α ) ∂w (cid:19) · (cid:18) ∂f ( x α ) ∂r (cid:19) This leads to the following equations for (cid:96)∂ (cid:96)∂u ϕp,i ∂u ϕq,j = A , p,q + δ ( p, q ) B pi,j + A p,qi,j ∂ (cid:96)∂u ϕp,i ∂v ϕs,q = C p,si,q + δ ( p, q ) D p,si + C p,si,q ∂ (cid:96)∂v ϕs,p ∂v t,q = E s,tp,q + E s,tp,q For L we get for p, q ∈ {− , } and all i, j, s, t at γ λ ([ u ,i ] i , [ v s, ] s , ¯w ) ∂ L ∂u − ,i ∂u − ,j = λ A , i,j + λB i,j + λ A , i,j ∂ L ∂u ,i ∂u ,j = (1 − λ ) A , i,j + (1 − λ ) B i,j + (1 − λ ) A , i,j ∂ L ∂u − ,i ∂u ,j = λ (1 − λ ) A , i,j + λ (1 − λ ) A , i,j ∂ L ∂u − ,i ∂v s, − = λC ,si, + D ,si + λ C ,si, ∂ L ∂u ,i ∂v s, = (1 − λ ) C ,si, + D ,si + (1 − λ ) C ,si, ∂ L ∂u − ,i ∂v s, = λC ,si, + λ C ,si, ∂ L ∂u ,i ∂v s, − = (1 − λ ) C ,si, + (1 − λ ) C ,si, ∂ L ∂v s,p ∂v t,q = E s,t , + E s,t , and for q ≥ and p ∈ {− , } and all i, j, s, t∂ L ∂u − ,i ∂u q,j = λA ,qi,j + λ A , i,j ∂ L ∂u ,i ∂u q,j = (1 − λ ) A ,qi,j + (1 − λ ) A ,qi,j ∂ L ∂u − ,i ∂v s,q = λC ,si,q + λ C ,si,q ∂ L ∂u ,i ∂v s,q = (1 − λ ) C ,si,q + (1 − λ ) C ,si,q ∂ L ∂u q,i ∂v s,p = C q,si,p + C q,si,p ∂ L ∂v s,p ∂v t,q = E s,t ,q + E s,t ,q and for p, q ≥ and all i, j, s, t ∂ L ∂u p,i ∂u q,j = A p,qi,j + δ ( q, p ) B pi,j ( x ) + A p,qi,j = ∂ (cid:96)∂u ϕp,i ∂u ϕq,j ∂ L ∂u p,i ∂v s,q = C p,si,q + δ ( q, p ) D p,si + C p,si,q = ∂ (cid:96)∂u ϕp,i ∂v ϕs,q ∂ L ∂v s,p ∂v t,q = E s,tp,q + E s,tp,q = ∂ (cid:96)∂v ϕs,p ∂v ϕt,q 6) Change of basis Choose any real numbers α (cid:54) = − β such that λ = βα + β (equivalently αλ − β (1 − λ ) = 0 ) and set µ − ,i = u − ,i + u ,i µ ,i = α · u − ,i − β · u ,i ν s, − = v s, − + v s, ν s, = v s, − − v s, .Then at γ λ ([ u ,i ] i , [ v s, ] s , ¯w ) , ∂ L ∂µ − ,i ∂µ − ,j = (cid:18) ∂∂u − ,i + ∂∂u ,i (cid:19) (cid:18) ∂ L ( x ) ∂u − ,j + ∂ L ( x ) ∂u ,j (cid:19) = ∂ L ( x ) ∂u − ,i ∂u − ,j + ∂ L ( x ) ∂u − ,i ∂u ,j + ∂ L ( x ) ∂u ,i ∂u − ,j + ∂ L ( x ) ∂u ,i ∂u ,j = (cid:16) λ A , i,j + λB i.j + λ A , i,j (cid:17) + (cid:16) λ (1 − λ ) A , i,j + λ (1 − λ ) A , i,j (cid:17) + (cid:16) λ (1 − λ ) A , i,j + λ (1 − λ ) A , i,j (cid:17) + (cid:16) (1 − λ ) A , i,j + (1 − λ ) B i.j + (1 − λ ) A , i,j (cid:17) = A , i,j + B i.j + A , i,j ∂ L ∂µ ,i ∂µ ,j = (cid:18) α ∂∂u − ,i − β ∂∂u ,i (cid:19) (cid:18) α ∂ L ( x ) ∂u − ,j − β ∂ L ( x ) ∂u ,j (cid:19) = α ∂ L ( x ) ∂u − ,i ∂u − ,j − αβ ∂ L ( x ) ∂u − ,i ∂u ,j − αβ ∂ L ( x ) ∂u ,i ∂u − ,j + β ∂ L ( x ) ∂u ,i ∂u ,j = α (cid:16) λ A , i,j + λB i.j + λ A , i,j (cid:17) − αβ (cid:16) λ (1 − λ ) A , i,j + λ (1 − λ ) A , i,j (cid:17) − αβ (cid:16) λ (1 − λ ) A , i,j + λ (1 − λ ) A , i,j (cid:17) + β (cid:16) (1 − λ ) A , i,j + (1 − λ ) B i.j + (1 − λ ) A , i,j (cid:17) = αβB i.j ∂ L ∂µ − ,i ∂µ ,j = (cid:18) ∂∂u − ,i + ∂∂u ,i (cid:19) (cid:18) α ∂ L ( x ) ∂u − ,j − β ∂ L ( x ) ∂u ,j (cid:19) = α ∂ L ( x ) ∂u − ,i ∂u − ,j − β ∂ L ( x ) ∂u − ,i ∂u ,j + α ∂ L ( x ) ∂u ,i ∂u − ,j − β ∂ L ( x ) ∂u ,i ∂u ,j = α (cid:16) λ A , i,j + λB i.j + λ A , i,j (cid:17) − β (cid:16) λ (1 − λ ) A , i,j + λ (1 − λ ) A , i,j (cid:17) + α (cid:16) λ (1 − λ ) A , i,j + λ (1 − λ ) A , i,j (cid:17) − β (cid:16) (1 − λ ) A , i,j + (1 − λ ) B i.j + (1 − λ ) A , i,j (cid:17) = 0 ∂ L ∂ν s, − ∂ν t, − = (cid:18) ∂∂v s, − + ∂∂v s, (cid:19) (cid:18) ∂ L ( x ) ∂v t, − + ∂ L ( x ) ∂v t, (cid:19) = ∂ L ( x ) ∂v s, − ∂v t, − + ∂ L ( x ) ∂v s, − ∂v t, + ∂ L ( x ) ∂v s, ∂v t, − + ∂ L ( x ) ∂v s, ∂v t, = (cid:16) E s,t , + E s,t , (cid:17) + (cid:16) E s,t , + E s,t , (cid:17) + (cid:16) E s,t , + E s,t , (cid:17) + (cid:16) E s,t , + E s,t , (cid:17) = 4 E s,t , + 4 E s,t , ∂ L ∂ν s, ∂ν t, = (cid:18) ∂∂v s, − − ∂∂v s, (cid:19) (cid:18) ∂ L ( x ) ∂v t, − − ∂ L ( x ) ∂v t, (cid:19) = ∂ L ( x ) ∂v s, − ∂v t, − − ∂ L ( x ) ∂v s, − ∂v t, − ∂ L ( x ) ∂v s, ∂v t, − + ∂ L ( x ) ∂v s, ∂v t, = (cid:16) E s,t , + E s,t , (cid:17) − (cid:16) E s,t , + E s,t , (cid:17) − (cid:16) E s,t , + E s,t , (cid:17) + (cid:16) E s,t , + E s,t , (cid:17) = 0 ∂ L ∂ν s, − ∂ν t, = (cid:18) ∂∂v s, − + ∂∂v s, (cid:19) (cid:18) ∂ L ( x ) ∂v t, − − ∂ L ( x ) ∂v t, (cid:19) = ∂ L ( x ) ∂v s, − ∂v t, − − ∂ L ( x ) ∂v s, − ∂v t, + ∂ L ( x ) ∂v s, ∂v t, − − ∂ L ( x ) ∂v s, ∂v t, = (cid:16) E s,t , + E s,t , (cid:17) − (cid:16) E s,t , + E s,t , (cid:17) + (cid:16) E s,t , + E s,t , (cid:17) − (cid:16) E s,t , + E s,t , (cid:17) = 0 ∂ L ∂µ − ,i ∂ν s, − = (cid:18) ∂∂u − ,i + ∂∂u ,i (cid:19) (cid:18) ∂ L ( x ) ∂v s, − + ∂ L ( x ) ∂v s, (cid:19) = ∂ L ( x ) ∂u − ,i ∂v s, − + ∂ L ( x ) ∂u − ,i ∂v s, + ∂ L ( x ) ∂u ,i ∂v s, − + ∂ L ( x ) ∂u ,i ∂v s, = (cid:16) λC ,si, + D ,si + λ C ,si, (cid:17) + (cid:16) λC ,si, + λ C ,si, (cid:17) + (cid:16) (1 − λ ) C ,si, + (1 − λ ) C ,si, (cid:17) + (cid:16) (1 − λ ) C ,si, + D ,si + (1 − λ ) C ,si, (cid:17) = 2 C ,si, + 2 D ,si + 2 C ,si, ∂ L ∂µ − ,i ∂ν s, = (cid:18) ∂∂u − ,i + ∂∂u ,i (cid:19) (cid:18) ∂ L ( x ) ∂v s, − − ∂ L ( x ) ∂v s, (cid:19) = ∂ L ( x ) ∂u − ,i ∂v s, − − ∂ L ( x ) ∂u − ,i ∂v s, + ∂ L ( x ) ∂u ,i ∂v s, − − ∂ L ( x ) ∂u ,i ∂v s, = (cid:16) λC ,si, + D ,si + λ C ,si, (cid:17) − (cid:16) λC ,si, + λ C ,si, (cid:17) + (cid:16) (1 − λ ) C ,si, + (1 − λ ) C ,si, (cid:17) − (cid:16) (1 − λ ) C ,si, + D ,si + (1 − λ ) C ,si, (cid:17) = 0 ∂ L ∂µ ,i ∂ν s, − = (cid:18) α ∂∂u − ,i − β ∂∂u ,i (cid:19) (cid:18) ∂ L ( x ) ∂v s, − + ∂ L ( x ) ∂v s, (cid:19) = α ∂ L ( x ) ∂u − ,i ∂v s, − + α ∂ L ( x ) ∂u − ,i ∂v s, − β ∂ L ( x ) ∂u ,i ∂v s, − − β ∂ L ( x ) ∂u ,i ∂v s, = α (cid:16) λC ,si, + D ,si + λ C ,si, (cid:17) + α (cid:16) λC ,si, + λ C ,si, (cid:17) − β (cid:16) (1 − λ ) C ,si, + (1 − λ ) C ,si, (cid:17) − β (cid:16) (1 − λ ) C ,si, + D ,si + (1 − λ ) C ,si, (cid:17) = ( α − β ) D ,si ∂ L ∂µ ,i ∂ν s, = (cid:18) α ∂∂u − ,i − β ∂∂u ,i (cid:19) (cid:18) ∂ L ( x ) ∂v s, − − ∂ L ( x ) ∂v s, (cid:19) = α ∂ L ( x ) ∂u − ,i ∂v s, − − α ∂ L ( x ) ∂u − ,i ∂v s, − β ∂ L ( x ) ∂u ,i ∂v s, − + β ∂ L ( x ) ∂u ,i ∂v s, = α (cid:16) λC ,si, + D ,si + λ C ,si, (cid:17) − α (cid:16) λC ,si, + λ C ,si, (cid:17) − β (cid:16) (1 − λ ) C ,si, + (1 − λ ) C ,si, (cid:17) + β (cid:16) (1 − λ ) C ,si, + D ,si + (1 − λ ) C ,si, (cid:17) = ( α + β ) D ,si For q ≥ and p ∈ {− , } ∂ L ∂µ − ,i ∂u q,j = (cid:18) ∂∂u − ,i + ∂∂u ,i (cid:19) (cid:18) ∂ L ( x ) ∂u q,j (cid:19) = λA ,qi,j + λ A , i,j + (1 − λ ) A ,qi,j + (1 − λ ) A ,qi,j = A ,qi,j + A ,qi,j ∂ L ∂µ ,i ∂u q,j = (cid:18) α ∂∂u − ,i − β ∂∂u ,i (cid:19) (cid:18) ∂ L ( x ) ∂u q,j (cid:19) = α ( λA ,qi,j + λ A , i,j ) + β ((1 − λ ) A ,qi,j + (1 − λ ) A ,qi,j )= 0 ∂ L ∂µ − ,i ∂v s,q = (cid:18) ∂∂u − ,i + ∂∂u ,i (cid:19) (cid:18) ∂ L ( x ) ∂v s,q (cid:19) = λC ,si,q + λ C ,si,q + (1 − λ ) C ,si,q + (1 − λ ) C ,si,q = C ,si,q + C ,si,q ∂ L ∂µ ,i ∂v s,q = (cid:18) α ∂∂u − ,i − β ∂∂u ,i (cid:19) (cid:18) ∂ L ( x ) ∂v s,q (cid:19) = α ( λC ,si,q + λ C ,si,q ) − β ((1 − λ ) C ,si,q + (1 − λ ) C ,si,q )= 0 ∂ L ∂ν s, − ∂u q,i = (cid:18) ∂∂v s, − + ∂∂v s, (cid:19) (cid:18) ∂ L ( x ) ∂u q,i (cid:19) = C q,si,p + C q,si,p + C q,si,p + C q,si,p = 2 C q,si,p + 2 C q,si,p ∂ L ∂ν s, ∂u q,i = (cid:18) ∂∂v s, − − ∂∂v s, (cid:19) (cid:18) ∂ L ( x ) ∂u q,i (cid:19) = C q,si,p + C q,si,p − C q,si,p − C q,si,p = 0 ∂ L ∂ν s, − ∂v t,q = (cid:18) ∂∂v s, − + ∂∂v s, (cid:19) (cid:18) ∂ L ( x ) ∂v t,q (cid:19) = E s,t ,q + E s,t ,q + E s,t ,q + E s,t ,q = 2 E s,t ,q + 2 E s,t ,q ∂ L ∂ν s, ∂v t,q = (cid:18) ∂∂v s, − − ∂∂v s, (cid:19) (cid:18) ∂ L ( x ) ∂v t,q (cid:19) = E s,t ,q + E s,t ,q − E s,t ,q − E s,t ,q = 0 We also need to consider the second derivative with respect to the other variables of ¯w . If w is closer to theoutput than [ u p,i ] p,i , [ v s,q ] s,q belonging to layer γ where γ > l + 1 , then we get ∂ L ∂w∂µ − ,i = ∂∂w (cid:18) ∂ L ∂u − ,i + ∂ L ∂u ,i (cid:19) = (cid:88) α ∂f ( x α ) ∂w (cid:32)(cid:88) k ∂h • ,l +1 ( n ( l + 1; x α )) ∂ n ( l + 1 , k ; x α ) · v k, − · σ (cid:48) ( n ( l, − x α )) · act ( l − , i ; x α ) (cid:33) + (cid:88) α ∂f ( x α ) ∂w (cid:32)(cid:88) k ∂h • ,l +1 ( n ( l + 1; x α )) ∂ n ( l + 1 , k ; x α ) · v k, · σ (cid:48) ( n ( l, x α )) · act ( l − , i ; x α ) (cid:33) = (cid:88) α ( f ( x α ) − y α ) · (cid:88) k ∂ h • ,l +1 ( n ( l + 1; x α )) ∂w∂ n ( l + 1 , k ; x α ) · v k, − · σ (cid:48) ( n ( l, − x α )) · act ( l − , i ; x α )+ (cid:88) α ( f ( x α ) − y α ) · (cid:88) k ∂ h • ,l +1 ( n ( l + 1; x α )) ∂w∂ n ( l + 1 , k ; x α ) · v k, · σ (cid:48) ( n ( l, x α )) · act ( l − , i ; x α )= ∂ (cid:96)∂w ϕ ∂u ϕ ,i and ∂ L ∂w∂µ − ,i = ∂∂w (cid:18) α ∂ L ∂u − ,i − β ∂ L ∂u ,i (cid:19) = (cid:88) α ∂f ( x α ) ∂w · (cid:32)(cid:88) k ∂h • ,l +1 ( n ( l + 1; x α )) ∂ n ( l + 1 , k ; x α ) · αv k, − · σ (cid:48) ( n ( l, − x α )) · act ( l − , i ; x α ) (cid:33) − (cid:88) α ∂f ( x α ) ∂w · (cid:32)(cid:88) k ∂h • ,l +1 ( n ( l + 1; x α )) ∂ n ( l + 1 , k ; x α ) · βv k, · σ (cid:48) ( n ( l, x α )) · act ( l − , i ; x α ) (cid:33) + (cid:88) α ( f ( x α ) − y α ) · (cid:88) k ∂ h • ,l +1 ( n ( l + 1; x α )) ∂w∂ n ( l + 1 , k ; x α ) · αv k, − · σ (cid:48) ( n ( l, − x α )) · act ( l − , i ; x α ) − (cid:88) α ( f ( x α ) − y α ) · (cid:88) k ∂ h • ,l +1 ( n ( l + 1; x α )) ∂w∂ n ( l + 1 , k ; x α ) · βv k, · σ (cid:48) ( n ( l, x α )) · act ( l − , i ; x α )= 0 and ∂ L ∂w∂ν s, − = ∂∂w (cid:18) ∂ L ∂v s, − + ∂ L ∂v s, (cid:19) = (cid:88) α ∂f ( x α ) ∂w · (cid:18) ∂h • ,l +1 ( n ( l + 1; x α )) ∂ n ( l + 1 , s ; x α ) · act ( l, − x α )) (cid:19) + (cid:88) α ∂f ( x α ) ∂w · (cid:18) ∂h • ,l +1 ( n ( l + 1; x α )) ∂ n ( l + 1 , s ; x α ) · act ( l, x α ) (cid:19) + (cid:88) α ( f ( x α ) − y α ) · ∂ h • ,l +1 ( n ( l + 1; x α )) ∂w∂ n ( l + 1 , s ; x α ) · act ( l, − x α )+ (cid:88) α ( f ( x α ) − y α ) · ∂ h • ,l +1 ( n ( l + 1; x α )) ∂w∂ n ( l + 1 , s ; x α ) · act ( l, x α )= 2 · ∂ (cid:96)∂w ϕ ∂v ϕs, and ∂ L ∂w∂ν s, = ∂∂w (cid:18) ∂ L ∂v s, − − ∂ L ∂v s, (cid:19) = (cid:88) α ∂f ( x α ) ∂w · (cid:18) ∂h • ,l +1 ( n ( l + 1; x α )) ∂ n ( l + 1 , s ; x α ) · act ( l, − x α )) (cid:19) − (cid:88) α ∂f ( x α ) ∂w · (cid:18) ∂h • ,l +1 ( n ( l + 1; x α )) ∂ n ( l + 1 , s ; x α ) · act ( l, x α ) (cid:19) + (cid:88) α ( f ( x α ) − y α ) · ∂ h • ,l +1 ( n ( l + 1; x α )) ∂w∂ n ( l + 1 , s ; x α ) · act ( l, − x α ) − (cid:88) α ( f ( x α ) − y α ) · ∂ h • ,l +1 ( n ( l + 1; x α )) ∂w∂ n ( l + 1 , s ; x α ) · act ( l, x α )= 0 If w is closer to the input than [ u p,i ] p,i , [ v s,q ] s,q connecting neuron j of layer γ − with neuron r of layer γ where γ < l , then we get ∂ L ∂µ − ,i ∂w = (cid:18) ∂∂u − ,i + ∂∂u ,i (cid:19) (cid:18) ∂ L ∂w (cid:19) = (cid:88) α ( f ( x α ) − y α ) · (cid:18) ∂∂u − ,i + ∂∂u ,i (cid:19) (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) + (cid:88) α (cid:18) ∂f ( x α ) ∂u − ,i + ∂f ( x α ) ∂u ,i (cid:19) · (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) = (cid:88) α ( f ( x α ) − y α ) · (cid:88) k ∂ h • ,γ ( n ( γ ; x α )) ∂ n ( l + 1 , k ; x α ) ∂ n ( γ, r ; x α ) · v k, − · σ (cid:48) ( n ( l, − x α )) · act ( l − , i ; x α ) · act ( γ − , j ; x α )+ (cid:88) α ( f ( x α ) − y α ) · (cid:88) k ∂ h • ,γ ( n ( γ ; x α )) ∂ n ( l + 1 , k ; x α ) ∂ n ( γ, r ; x α ) · v k, · σ (cid:48) ( n ( l, x α )) · act ( l − , i ; x α ) · act ( γ − , j ; x α )+ (cid:88) α (cid:32) ∂ϕ ( x α ) ∂u ϕ ,i (cid:33) · (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) = ∂ (cid:96)∂w ϕ ∂u ϕ ,i and ∂ L ∂µ − ,i ∂w = (cid:18) α ∂∂u − ,i − β ∂∂u ,i (cid:19) (cid:18) ∂ L ∂w (cid:19) = (cid:88) α ( f ( x α ) − y α ) · (cid:18) α ∂∂u − ,i − β ∂∂u − ,i (cid:19) (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) + (cid:88) α (cid:18) α ∂f ( x α ) ∂u − ,i − β ∂f ( x α ) ∂u − ,i (cid:19) · (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) = (cid:88) α ( f ( x α ) − y α ) · (cid:88) k ∂ h • ,γ ( n ( γ ; x α )) ∂ n ( l + 1 , k ; x α ) ∂ n ( γ, r ; x α ) · v k, − · σ (cid:48) ( n ( l, − x α )) · act ( l − , i ; x α ) · act ( γ − , j ; x α ) − (cid:88) α ( f ( x α ) − y α ) · (cid:88) k ∂ h • ,γ ( n ( γ ; x α )) ∂ n ( l + 1 , k ; x α ) ∂ n ( γ, r ; x α ) · v k, · σ (cid:48) ( n ( l, x α )) · act ( l − , i ; x α ) · act ( γ − , j ; x α )+ (cid:88) α (cid:32) ( αλ − β (1 − λ )) ∂ϕ ( x α ) ∂u ϕ ,i (cid:33) · (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) = 0 and ∂ L ∂ν s, − ∂w = (cid:18) ∂∂v s, − + ∂∂v s, (cid:19) (cid:18) ∂ L ∂w (cid:19) = (cid:88) α ( f ( x α ) − y α ) · (cid:18) ∂∂v s, − + ∂∂v s, (cid:19) (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) + (cid:88) α (cid:18) ∂f ( x α ) ∂v s, − + ∂f ( x α ) ∂v s, (cid:19) · (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) = (cid:88) α ( f ( x α ) − y α ) · ∂ h • ,γ ( n ( γ ; x α )) ∂ n ( l + 1 , s ; x α ) ∂ n ( γ, r ; x α ) · act ( l, − x α )) · act ( γ − , j ; x α )+ (cid:88) α ( f ( x α ) − y α ) · ∂ h • ,γ ( n ( γ ; x α )) ∂ n ( l + 1 , s ; x α ) ∂ n ( γ, r ; x α ) · act ( l, x α )) · act ( γ − , j ; x α )+ (cid:88) α (cid:32) ∂ϕ ( x α ) ∂v ϕs, + ∂ϕ ( x α ) ∂v ϕs, (cid:33) · (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) = 2 · ∂ (cid:96)∂w ϕ ∂v ϕs, and ∂ L ∂ν s, ∂w = (cid:18) ∂∂v s, − − ∂∂v s, (cid:19) (cid:18) ∂ L ∂w (cid:19) = (cid:88) α ( f ( x α ) − y α ) · (cid:18) ∂∂v s, − − ∂∂v s, (cid:19) (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) + (cid:88) α (cid:18) ∂f ( x α ) ∂v s, − − ∂f ( x α ) ∂v s, (cid:19) · (cid:18) ∂h • ,γ ( n ( γ ; x α )) ∂ n ( γ, r ; x α ) · act ( γ − , j ; x α ) (cid:19) = (cid:88) α ( f ( x α ) − y α ) · ∂ h • ,γ ( n ( γ ; x α )) ∂ n ( l + 1 , s ; x α ) ∂ n ( γ, r ; x α ) · act ( l, − x α )) · act ( γ − , j ; x α ) − (cid:88) α ( f ( x α ) − y α ) · ∂ h • ,γ ( n ( γ ; x α )) ∂ n ( l + 1 , s ; x α ) ∂ n ( γ, r ; x α ) · act ( l, x α )) · act ( γ − , j ; x α )= 0 Finally, if w and w (cid:48) are parameters different from [ u p,i ] p,i , [ v s,q ] s,q , µ and ν , then ∂ L ∂w∂w (cid:48) = ∂ (cid:96)∂w ϕ ∂w (cid:48) ϕ 7) The Hessian Putting things together, the matrix for the second derivative of L with respect to µ − , ν − , ¯w , µ , ν , where ¯w stands for the collection of all other parameters, at γ λ ([ u ∗ ,i ] i , [ v ∗ s, ] s , ¯w ∗ ) is given by: [ ∂ (cid:96)∂u ,i ∂u ,j ] i,j ∂ (cid:96)∂u ,i ∂v s, ] i,s [ ∂ (cid:96)∂ ¯w ∂u ,i ] i, ¯w ∂ (cid:96)∂u ,i ∂v s, ] s,i ∂ (cid:96)∂v s, ∂v t, ] s,t ∂ (cid:96)∂ ¯w ∂v s, ] s, ¯w ( α − β )[ D ,si ] s,i ∂ (cid:96)∂ ¯w ∂u ,i ] ¯w ,i ∂ (cid:96)∂ ¯w ∂v s, ] ¯w ,s [ ∂ (cid:96)∂ ¯w ∂ ¯w (cid:48) ] ¯w , ¯w (cid:48) α − β )[ D ,si ] i,s αβ [ B i,j ] i,j ( α + β )[ D ,si ] i,s α + β )[ D ,si ] s,i0