Elimination of All Bad Local Minima in Deep Learning
EElimination of All Bad Local Minima in Deep Learning
Kenji Kawaguchi Leslie Pack Kaelbling
MIT MIT
Abstract
In this paper, we theoretically prove thatadding one special neuron per output uniteliminates all suboptimal local minima of anydeep neural network, for multi-class classifi-cation, binary classification, and regressionwith an arbitrary loss function, under prac-tical assumptions. At every local minimumof any deep neural network with these addedneurons, the set of parameters of the origi-nal neural network (without added neurons)is guaranteed to be a global minimum of theoriginal neural network. The effects of theadded neurons are proven to automaticallyvanish at every local minimum. Moreover,we provide a novel theoretical characteriza-tion of a failure mode of eliminating subopti-mal local minima via an additional theoremand several examples. This paper also intro-duces a novel proof technique based on theperturbable gradient basis (PGB) necessarycondition of local minima, which providesnew insight into the elimination of local min-ima and is applicable to analyze various mod-els and transformations of objective functionsbeyond the elimination of local minima.
Deep neural networks have achieved significant prac-tical success in the fields of computer vision, ma-chine learning, and artificial intelligence. However,theoretical understanding of deep neural networks isscarce relative to its empirical success. One of themajor difficulties in theoretically understanding deepneural networks lies in the non-convexity and high-dimensionality of the objective functions used to train
Proceedings of the 23 rd International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2020, Palermo,Italy. PMLR: Volume 108. Copyright 2020 by the au-thor(s). the networks. Because of the non-convexity and high-dimensionality, it is often unclear whether a deep neu-ral network will be guaranteed to have a desired prop-erty after training, instead of becoming stuck aroundan arbitrarily poor local minimum. Indeed, it is NP-hard to find a global minimum of a general non-convexfunction (Murty and Kabadi, 1987), and of non-convexobjective functions used to train certain types of neu-ral networks (Blum and Rivest, 1992). In the past,such theoretical concerns were considered one of rea-sons to prefer classical machine learning models (withor without a kernel approach) that require only convexoptimization. Given their recent empirical success, aquestion remains whether practical deep neural net-works can be theoretically guaranteed to avoid poorlocal minima.There have been numerous recent studies that haveadvanced theoretical understanding regarding the op-timization of deep neural networks with significantover-parameterization (Nguyen and Hein, 2017, 2018;Allen-Zhu et al., 2018; Du et al., 2018; Zou et al., 2018)and model simplification (Choromanska et al., 2015;Kawaguchi, 2016; Hardt and Ma, 2017; Bartlett et al.,2019; Du and Hu, 2019). For shallow networks with asingle hidden layer, there have been many positive re-sults, yet often with strong assumptions, for example,requiring the use of significant over-parameterization,simplification, or Gaussian inputs (Andoni et al., 2014;Sedghi and Anandkumar, 2014; Soltanolkotabi, 2017;Brutzkus and Globerson, 2017; Ge et al., 2017; Soudryand Hoffer, 2017; Goel and Klivans, 2017; Zhong et al.,2017; Li and Yuan, 2017; Du and Lee, 2018).Instead of using these strong assumptions, adding oneneuron to a neural network with a single output unitwas recently shown to be capable of eliminating allsuboptimal local minima (i.e., all local minima thatare not global minima) for binary classification witha special type of smoothed hinge loss function (Lianget al., 2018). However, the restriction to a neural net-work with a single output unit and a special loss func-tion for binary classification makes it inapplicable tomany common and important deep learning problems.Both removing the restriction to networks with a sin- a r X i v : . [ c s . L G ] J a n limination of All Bad Local Minima in Deep Learning gle output unit and removing restrictions on loss func-tions are seen as important open problems in differentbut related theoretical work on local minima of neuralnetworks (Shamir, 2018; Laurent and Brecht, 2018).In this paper, we state and prove a novel and sig-nificantly stronger theorem: adding one neuron peroutput unit can eliminate all suboptimal local min-ima of any deep neural network with an arbitrary lossfunction for multi-class classification, binary classifi-cation, and regression. This paper also introduces anovel proof technique based on the perturbable gradi-ent basis (PGB) condition, which provides new insightinto the elimination of local minima and can be usedto analyze new models and transformations of objec-tive functions beyond the elimination of local minima.This paper analyzes the problem in the regime withoutsignificant over-parameterization or model simplifica-tion (except adding the few extra neurons).While analyzing the properties of local minima in thisregime with no strong assumption is a potentially im-portant step in theory, it does not immediately guar-antee the efficient solution of general neural networkoptimization. We explain this phenomenon in anotherkey contribution of this paper, which is a novel char-acterization of a failure mode of the removal of badlocal optima, in terms of its effect on gradient-basedoptimization methods. The optimization problem for the elimination of localminima is defined in Section 2.1. Our theoretical re-sults on the elimination of local minima are presentedin Section 2.2 for arbitrary datasets, and in Section2.3 for realizable datasets. We discuss these results interms of non-vacuousness, consistency, and the effectof over-parameterization in Section 2.4.
Let x ∈ X ⊆ R d x and y ∈ Y ⊆ R dy be an input vectorand a target vector, respectively. Define (( x i , y i )) mi =1 as a training dataset of size m . Given an input x andparameter θ , let f ( x ; θ ) ∈ R d y be the pre-activationoutput of the last layer of any arbitrary deep neuralnetwork with any structure (e.g., any convolutionalneural network with any depth and any width, withor without skip connections). That is, there is no as-sumption with regard to f except that f ( x ; θ ) ∈ R d y .We consider the following standard objective function L to train an arbitrary neural network f : L ( θ ) = 1 m m (cid:88) i =1 (cid:96) ( f ( x i ; θ ) , y i ) , where (cid:96) : R d y × Y → R is an arbitrary loss crite-rion such as cross entropy loss, smoothed hinge loss,or squared loss.We then define an auxiliary objective function ˜ L :˜ L (˜ θ ) = 1 m m (cid:88) i =1 (cid:96) ( f ( x i ; θ ) + g ( x i ; a, b, W ) , y i ) + λ (cid:107) a (cid:107) , where λ >
0, ˜ θ = ( θ, a, b, W ), a, b ∈ R d y , W = (cid:2) w w · · · w d y (cid:3) ∈ R d x × d y with w k ∈ R d x ,and g ( x ; a, b, W ) k = a k exp( w (cid:62) k x + b k ) for all k ∈{ , . . . , d y } .We also define a modified neural network ˜ f as˜ f ( x ; ˜ θ ) = f ( x ; θ ) + g ( x ; a, b, W ) , which is equivalent to adding one neuron g ( x ; a, b, W ) k per each output unit f ( x ; θ ) k of the original neural net-work. Because ˜ L (˜ θ ) = m (cid:80) mi =1 (cid:96) ( ˜ f ( x i ; ˜ θ ) , y i ) + λ (cid:107) a (cid:107) ,the auxiliary objective function ˜ L is the standard ob-jective function L with the modified neural network ˜ f and a regularizer on a . Under only a mild assumption (Assumption 1), Theo-rem 1 states that at every local minimum ( θ, a, b, W )of the modified objective function ˜ L , the parametervector θ achieves a global minimum of the original ob-jective function L , and the modified neural network ˜ f automatically becomes the original neural network f .The proof of Theorem 1 is presented in Section 4 andAppendix B. Assumption 1. (Use of common loss criteria)
Forany i ∈ { , . . . , m } , the function (cid:96) y i : q (cid:55)→ (cid:96) ( q, y i ) ∈ R ≥ is differentiable and convex (e.g., the squared loss,cross entropy loss, and polynomial hinge loss satisfythis assumption). Theorem 1.
Let Assumption 1 hold. Then, at everylocal minimum ( θ, a, b, W ) of ˜ L , the following state-ments hold:(i) θ is a global minimum of L , and(ii) ˜ f ( x ; θ, a, b, W ) = f ( x ; θ ) for all x ∈ R d x , and ˜ L ( θ, a, b, W ) = L ( θ ) . Assumption 1 is satisfied by simply using a commonloss criterion, including the squared loss (cid:96) ( q, y ) = (cid:107) q − y (cid:107) or (cid:96) ( q, y ) = (1 − yq ) (the latter with d y = 1),cross entropy loss (cid:96) ( q, y ) = − (cid:80) d y k =1 y k log exp( q k ) (cid:80) k (cid:48) exp( q k (cid:48) ) ,or smoothed hinge loss (cid:96) ( q, y ) = (max { , − yq } ) p with p ≥ d y = 1). Althoughthe objective function L : θ (cid:55)→ L ( θ ) used to train aneural network is non-convex in θ , the loss criterion (cid:96) y i : q (cid:55)→ (cid:96) ( q, y i ) is usually convex in q . Therefore,Theorem 1 is directly applicable to most common deeplearning tasks in practice. enji Kawaguchi, Leslie Pack Kaelbling Theorem 2 makes a statement similar to Theorem 1under a weaker assumption on the loss criterion (As-sumption 2) but with an additional assumption on thetraining dataset (Assumption 3). The proof of Theo-rem 2 is presented in Appendix B.
Assumption 2. (On the loss)
For any i ∈ { , . . . , m } ,the function (cid:96) y i : q (cid:55)→ (cid:96) ( q, y i ) is differentiable, and q ∈ R d y is a global minimum of (cid:96) y i if ∇ (cid:96) y i ( q ) = 0 . Assumption 3. (On the label consistency)
Thereexists a function f ∗ such that f ∗ ( x i ) = y i for all i ∈ { , . . . , m } . Theorem 2.
Let Assumptions 2 and 3 hold. Then,at every local minimum ( θ, a, b, W ) of ˜ L , the followingstatements hold:(i) θ is a global minimum of L ,(ii) ˜ f ( x ; θ, a, b, W ) = f ( x ; θ ) for all x ∈ R d x , and ˜ L ( θ, a, b, W ) = L ( θ ) , and(iii) f ( x i ; θ ) is a global minimum of (cid:96) y i : q (cid:55)→ (cid:96) ( q, y i ) for all i ∈ { , . . . , m } . Assumption 2 is weaker than Assumption 1 in thesense that the former is implied by the latter but notvice versa. However, as discussed above, Assumption 1already accommodates most common loss criteria. As-sumption 3 is automatically satisfied if a target y givenan input x is not random, but the non-randomnessis not necessary to satisfy Assumption 3. Even ifthe targets are generated at random, as long as all x , x , . . . , x m are distinct (i.e., x i (cid:54) = x j for all i (cid:54) = j ),Assumption 3 is satisfied.Therefore, although Theorem 2 might be less applica-ble in practice when compared to Theorem 1, Theorem2 can still be applied to many common deep learn-ing tasks with the additional guarantee, as stated inTheorem 2 (iii). By using an appropriate loss crite-rion for classification, Theorem 2 (iii) implies that thetrained neural network f ( · ; θ ) at every local minimumcorrectly classifies all training data points. Theorems 1 and 2 are both non-vacuous and consistentwith pathological cases. For the consistency, Theorems1 and 2 (vacuously) hold true if there is no local min-imum of ˜ L , for example, with a pathological case of (cid:96) ( q, y i ) = q − y i .For non-vacuousness, there exists a local minimum of˜ L if there exists a global minimum θ of L such that f ( x i ; θ ) achieves a global minimum for each f ( x i ; θ ) (cid:55)→ (cid:96) ( f ( x i ; θ ) , y i ) for i ∈ { , . . . , m } (this is because, given such a θ , any point with a = 0 is a local minimumof ˜ L ). Therefore, the existence of local minimumfor ˜ L can be guaranteed by the weak degree of over-parameterization that ensures the exitance of a globalminimum θ for each f ( x i ; θ ) (cid:55)→ (cid:96) ( f ( x i ; θ ) , y i ) only fora given training dataset (rather than for all datasets).This is in contrast to the previous papers that require significant over-parameterization to ensure interpola-tion of all datasets and to make the correspondingneural tangent kernel to be approximately unchangedduring training (Nguyen and Hein, 2017, 2018; Allen-Zhu et al., 2018; Du et al., 2018; Zou et al., 2018).Our paper does not require those and allow the neuraltangent kernel to significantly change during training.Because of this difference, our paper only needs ˜Ω(1)parameters, whereas the state-of-the-art previous pa-per requires ˜Ω( H n ) parameters (Zou and Gu, 2019)or ˜Ω( n ) parameters (Kawaguchi and Huang, 2019), .Because a local minimum does not need to be a strict local minimum (i.e., a local minimum with a strictlyless value than others in a neighborhood), there aremany other cases where there exists a local minimumof ˜ L : e.g., Example 5 in Section 6.2 also illustrates asituation where there exists a local minimum of ˜ L with-out the above condition of the existence of sample-wiseglobal minimum of L or weak over-parameterization. In this section, we introduce a more general result be-yond elimination of local minima. Namely, we provethe perturbable gradient basis (PGB) necessary condi-tion of local minima , which directly implies the elim-ination result as a special case. Beyond the specifictransformation of the objective function ˜ L , the PGBnecessary condition of local minima can be applied toother objective functions with various transformationsand models. Theorem 3. (PGB necessary condition of local min-ima) . Define the objective function Q by Q ( z ) = 1 m m (cid:88) i =1 Q i ( φ i ( z )) + R i ( ϕ i ( z )) (1) where for all i ∈ { , . . . , m } , the functions Q i : q ∈ R d φ (cid:55)→ Q i ( q ) ∈ R ≥ and R i : q ∈ R d ϕ (cid:55)→ R i ( q ) ∈ R ≥ are differentiable and convex, and φ i and ϕ i are differentiable. Assume that there ex-ists a function h : R d z → R d z and a real num-ber ρ (cid:54) = 0 such that for all i ∈ { , . . . , m } and all z ∈ R d z , φ i ( z ) = (cid:80) d z k =1 h ( z ) k ∂ k φ i ( z ) and ϕ i ( z ) = ρ (cid:80) d z k =1 h ( z ) k ∂ k ϕ i ( z ) . Then, for any local minimum limination of All Bad Local Minima in Deep Learning z ∈ R d z of Q , the following holds: there exists (cid:15) > such that for any (cid:15) ∈ [0 , (cid:15) ) , Q ( z ) ≤ inf S ⊆ fin V [ z,(cid:15) ] ,α ∈ R dz ×| S | ˜ Q (cid:15),z ( α, S ) + ρ − ρm m (cid:88) i =1 ∂R i ( ϕ i ( z )) ϕ i ( z ) , where ˜ Q (cid:15),z ( α, S ) = 1 m m (cid:88) i =1 Q i ( φ zi ( α, (cid:15), S )) + R i ( ϕ zi ( α, (cid:15), S )) ,φ zi ( α, (cid:15), S ) = d z (cid:88) k =1 | S | (cid:88) j =1 α k,j ∂ k φ i ( z + (cid:15)S j ) , and ϕ zi ( α, (cid:15), S ) = d z (cid:88) k =1 | S | (cid:88) j =1 α k,j ∂ k ϕ i ( z + (cid:15)S j ) . Here, S ⊆ fin S (cid:48) denotes a finite subset S of a set S (cid:48) and V [ z, (cid:15) ] is the set of all vectors v ∈ R d z suchthat (cid:107) v (cid:107) ≤ , φ i ( z + (cid:15)v ) = φ i ( z ) , and ϕ i ( z + (cid:15)v ) = ϕ i ( z ) for all i ∈ { , . . . , m } . Furthermore, if ρ = 1 , this statement holds with equality as Q ( z ) =inf S ⊆ fin V [ z,(cid:15) ] ,α ∈ R dz ×| S | ˜ Q (cid:15),z ( α, S ) . The PGB necessary condition of local minima statesthat whether a given objective Q is non-convex or con-vex, if z is a local minimum of the original (poten-tially non-convex) objective Q , then z must achievethe global minimum value of the transformed objec-tive ˜ Q (cid:15),z ( α, S ) for any sufficiently small (cid:15) . Here, thetransformed objective ˜ Q (cid:15),z is the original objective Q except that the original functions φ i ( z ) and ϕ i ( z ) arereplaced by perturbable gradient basis (PGB) func-tions φ zi ( α, (cid:15), S ) and ϕ zi ( α, (cid:15), S ). In other words, all lo-cal minima with the original functions φ i ( z ) and ϕ i ( z )achieve the global minimum values of PGB functions φ zi ( α, (cid:15), S ) and ϕ zi ( α, (cid:15), S ).Here, the original objective Q is a non-convex functionin general because both φ i and ϕ i can be non-convexfunctions. If φ i and ϕ i are both linear functions, thenwe have φ i ( α ) = φ zi ( α, , ∅ ) and ϕ i ( α ) = ϕ zi ( α, , ∅ )(for all α and z ) and hence the PGB necessary con-dition of local minima recovers the following knownstatement: every local minimum of original Q is aglobal minimum of original Q .The PGB condition of local minima can be also un-derstood based on the following geometric interpreta-tion. For the geometric interpretation, we considertwo spaces – the parameter space R d z and the outputspace R m ( d φ + d ϕ ) – and the map from the parameterspace to the output space, which is defined by Φ : z ∈ R d z (cid:55)→ ( φ ( z ) (cid:62) , ϕ ( z ) (cid:62) , . . . , φ m ( z ) (cid:62) , ϕ m ( z ) (cid:62) ) (cid:62) ∈ R m ( d φ + d ϕ ) . Then, in the output space, we canintuitively consider the “tangent” space T Φ( z ) =span( { ∂ Φ( z ) , . . . , ∂ d z Φ( z ) } ) + { Φ( z ) } , where the sumof the two sets represents the Minkowski sum of thesets. Then, given a (cid:15) ( ≤ (cid:15) ), the span of the set ofall vectors of the “tangent” spaces T Φ( z + (cid:15)v ) at all per-turbed points z + (cid:15)v , defined by ˜ T Φ( z + (cid:15)v ) = span( { f ∈ R m ( d φ + d ϕ ) : ( ∃ v ∈ V [ θ, (cid:15) ])[ f ∈ T Φ( z + (cid:15)v ) ] } ) , is exactlyequal to the space of the outputs of the PGB functions.Therefore, from the geometric viewpoint, the PGBnecessary condition of local minima states that theoutput Φ( z ) at any local minimum z is globally op-timal in the span of the “tangent” spaces ˜ T Φ( z ) . ThePGB condition of local minima translates the local op-timality in the parameter space R d z into the global optimality in the span of the “tangent” spaces in the output space R m ( d φ + d ϕ ) .The PGB necessary condition of local minima is an ex-tension of theorem 2 in a previous study (Kawaguchiet al., 2019) to the problem with a regularization termand a general transformation of the objective func-tion. Accordingly, beyond the elimination of localminima, the PGB necessary condition of local minimaentails the previously proven statements of no bad lo-cal minima for deep linear neural networks (Laurentand Brecht, 2018) and deep nonlinear residual neu-ral networks (Kawaguchi and Bengio, 2019) (since thePGB necessary condition of local minima is strictlymore general than theorem 2 by Kawaguchi et al. 2019,which was shown to entail those statements).With our extension, the PGB necessary condition oflocal minima can be now used to study the effects ofvarious transformations of objective functions and reg-ularization terms, including the elimination of localminima, as shown in the next section. In this section, we introduce a novel and concise prooffor Theorem 1 based on the PGB necessary conditionof local minima, which shows that all suboptimal localminima can be eliminated because the global minimumvalue of the PGB model is indeed the global minimumvalue of the original model. From the geometric view-point, this is because the PGB model is shown to beexpressive in that the span of the “tangent” spaces T Φ( z + (cid:15)v ) contains the output space R md φ × { } where In special cases (e.g., when rank( ∂ Φ( z )) is constant ina neighborhood of z ), T Φ( z ) is indeed a tangent space of alocal manifold embedded in the output space. In general, T Φ( z ) and ˜ T Φ( z ) are the affine subspaces of the output space R m ( d φ + d ϕ ) . enji Kawaguchi, Leslie Pack Kaelbling ∈ R md ϕ . In Appendix B, we also provide an alter-native proof of Theorem 1 without the PGB necessarycondition, which is intended to be more detailed withelementary facts but less elegant than the followingproof via the PGB necessary condition. Proof of Theorem 1.
Let θ be fixed. Let ( a, b, W )be a local minimum of ˜ L | θ ( a, b, W ) := ˜ L ( θ, a, b, W ).Let ¯ θ ∈ R d ¯ θ be the vector containing ( a, b, W ), de-fined by ¯ θ = ( a (cid:62) , b (cid:62) , vec( W ) (cid:62) ) (cid:62) . We apply thePGB necessary condition of local minima by set-ting Q (¯ θ ) = ˜ L | θ ( a, b, W ) with φ i (¯ θ ) = g ( x i ; a, b, W ), ϕ i (¯ θ ) = ( a , . . . , a d y ) (cid:62) , Q i ( q ) = (cid:96) ( f ( x i ; θ ) + q, y i ),and R i ( q ) = λ (cid:80) d y j =1 q j . Then, for all i ∈ { , . . . , m } ,the functions Q i and R i are differentiable and con-vex, and the functions φ i and ϕ i are differentiable.Furthermore, we can rewrite φ i (¯ θ ) = (cid:80) d y k =1 a k ∂ a k φ i (¯ θ )and ϕ i (¯ θ ) = ρ (cid:80) d y k =1 a k ∂ a k ϕ i (¯ θ ) with ρ = 1 / i ∈ { , . . . , m } and all ¯ θ ∈ R d ¯ θ . These satisfy theassumptions of the PGB necessary condition of localminima of Q (¯ θ ) = ˜ L | θ ( a, b, W ).From the PGB necessary condition of local minima of Q (¯ θ ) = ˜ L | θ ( a, b, W ), there exists (cid:15) > (cid:15) ∈ [0 , (cid:15) ), Q (¯ θ ) − ρ − ρm m (cid:88) i =1 ∂R i ( ϕ i (¯ θ )) ϕ i (¯ θ ) (2) ≤ inf S ⊆ fin V [¯ θ,(cid:15) ] ,α ∈ R dz ×| S | ˜ Q (cid:15), ¯ θ ( α, S ) ≤ inf S ⊆ fin ¯ V [¯ θ,(cid:15) ] ,α ∈ R dz ×| S | ˜ Q (cid:15), ¯ θ ( α, S ) , where V [¯ θ, (cid:15) ] is the set of all vectors v ∈ R d ¯ θ suchthat (cid:107) v (cid:107) ≤ φ i (¯ θ + (cid:15)v ) = φ i (¯ θ ), and ϕ i (¯ θ + (cid:15)v ) = ϕ i (¯ θ ) for all i ∈ { , . . . , m } . Then, the subset ¯ V [¯ θ, (cid:15) ] ⊂V [¯ θ, (cid:15) ] is defined by ¯ V [¯ θ, (cid:15) ] = { ( a (cid:62) , b (cid:62) , vec( W ) (cid:62) ) (cid:62) :( a (cid:62) , b (cid:62) , vec( W ) (cid:62) ) (cid:62) ∈ V [¯ θ, (cid:15) ] , a = 0 } .Since ( a, b, W ) is a local minimum and hence the par-tial derivatives with respect to ( a, b ) are zeros, wehave that for all k ∈ { , , . . . , d y } , a k ∂ ˜ L ( θ,a,b,W ) ∂a k = m (cid:80) mi =1 ( ∇ (cid:96) y i ( f ( x i ; θ ) + g ( x i ; a, b, W ))) k a k exp( w (cid:62) k x + b k ) + 2 λa k = ∂ ˜ L ( θ,a,b,W ) ∂b k + 2 λa k = 2 λa k = 0, whichimplies that a k = 0 for all k ∈ { , , . . . , d y } , since2 λ (cid:54) = 0. Since a = 0, it proves statement (ii), and wehave L ( θ ) = ˜ L | θ ( a, b, W ), ( ∂ a k ϕ i (¯ θ + (cid:15)v )) k = 2 a k = 0for all v ∈ ¯ V [¯ θ, (cid:15) ], and ρ − ρm (cid:80) mi =1 ∂R i ( ϕ i (¯ θ )) ϕ i (¯ θ ) =0. Since L ( θ ) = ˜ L | θ ( a, b, W ) = Q (¯ θ ) = Q (¯ θ ) − ρ − ρm (cid:80) mi =1 ∂R i ( ϕ i (¯ θ )) ϕ i (¯ θ ) with (2) and ∂ a k ϕ i (¯ θ + (cid:15)v ) = 0 (for all v ∈ ¯ V [¯ θ, (cid:15) ]), we have that for any θ (cid:48) , L ( θ ) − L ( θ (cid:48) ) (3) ≤ inf S ⊆ fin ¯ V [¯ θ,(cid:15) ] ,α ∈ R dz ×| S | ˜ Q (cid:15), ¯ θ ( α, S ) − L ( θ (cid:48) ) = inf S ⊆ fin ¯ V [¯ θ,(cid:15) ] ,α ∈ R d ¯ θ ×| S | m m (cid:88) i =1 Q i ( φ ¯ θi ( α, (cid:15), S )) − (cid:96) ( f ( x i ; θ (cid:48) ) , y i ) ≤ , where the last inequality is to be shown to hold in thefollowing.Since φ ¯ θi ( α, (cid:15), S ) can differ for different indexes i only through different inputs x i , we can rewrite φ ¯ θx i ( α, (cid:15), S ) = φ ¯ θi ( α, (cid:15), S ). We then rearrange the sumin the last line of (3):1 m m (cid:88) i =1 Q i ( φ ¯ θx i ( α, (cid:15), S )) − (cid:96) ( f ( x i ; θ (cid:48) ) , y i )= 1 m m (cid:48) (cid:88) j =1 (cid:88) i ∈I j Q i ( φ ¯ θ ¯ x j ( α, (cid:15), S )) − (cid:96) ( f (¯ x j ; θ (cid:48) ) , y i ) . where {I , . . . , I m (cid:48) } is a partition of the set { , . . . , m } such that for any x ∈ I j and x (cid:48) ∈ I j (cid:48) , x = x (cid:48) if j = j (cid:48) , and x (cid:54) = x (cid:48) if j (cid:54) = j (cid:48) . Here, we write¯ x j := x with a representative x ∈ I j .Let S t ∈ R d ¯ θ be the vector containing (ˆ a ( t ) , ˆ b ( t ) , ˆ W ( t ) ).Since a = 0, we have φ i (¯ θ + (cid:15)S t ) = φ i (¯ θ ) forall vectors S t containing any (ˆ a ( t ) , ˆ b ( t ) , ˆ W ( t ) ) withˆ a ( t ) = 0. In other words, for any finite | S | andany ((ˆ b ( t ) , ˆ W ( t ) )) | S | t =1 with (cid:107) S t (cid:107) ≤
1, there exists S ⊆ fin ¯ V [¯ θ, (cid:15) ] such that φ ¯ θ ¯ x j ( α, (cid:15), S ) k = exp( w (cid:62) k ¯ x j + b k ) (cid:80) | S | t =1 α ( t ) k exp( (cid:15) (( ˆ w ( t ) k ) (cid:62) ¯ x j +ˆ b ( t ) k )) (by letting ˆ a ( t ) =0 for all t ). Therefore, given m (cid:48) distinct input points¯ x , . . . , ¯ x m (cid:48) , fix ((ˆ b ( t ) , ˆ W ( t ) )) | S | t =1 such that the rankof the matrix M ∈ R m (cid:48) ×| S | with entries M j,t =exp(( ˆ w ( t ) k ) (cid:62) ¯ x j + ˆ b ( t ) k ) is m (cid:48) with a sufficiently large fi-nite | S | . Then, by letting M j,t ( (cid:15) ) ∈ R m (cid:48) ×| S | be thematrix with entries M j,t ( (cid:15) ) = exp( (cid:15) (( ˆ w ( t ) k ) (cid:62) ¯ x j + ˆ b ( t ) k )),the function ψ ( (cid:15) ) = det( M j,t ( (cid:15) ) M j,t ( (cid:15) ) (cid:62) ) is an analyticfunction of (cid:15) . Since ψ is analytic and ψ (1) (cid:54) = 0, eitherthe zeros of ψ are isolated or ψ ( (cid:15) ) (cid:54) = 0 for all (cid:15) . Inboth cases, there exists (cid:15) ∈ [0 , (cid:15) ) such that ψ ( (cid:15) ) (cid:54) = 0.Fix a (cid:15) ∈ [0 , (cid:15) ) with ψ ( (cid:15) ) (cid:54) = 0. Then, since M j,t ( (cid:15) )has rank m (cid:48) , for any θ (cid:48) , there exists α such that for all j ∈ { , . . . , m (cid:48) } , φ ¯ θ ¯ x j ( α, (cid:15), S ) = f (¯ x j ; θ (cid:48) ) − f (¯ x j ; θ ) . Therefore, the last inequality in (3) holds, and hencefor any θ (cid:48) , L ( θ ) ≤ L ( θ (cid:48) ), which proves statement (i). That is, I ∪ · · · ∪ I m (cid:48) = { , . . . , m } , I j ∩ I j (cid:48) = ∅ forall j (cid:54) = j (cid:48) , and I j (cid:54) = ∅ for all j ∈ { , . . . , m (cid:48) } . limination of All Bad Local Minima in Deep Learning Our theoretical results in the previous sections haveshown that, for a wide range of deep learning tasks,all suboptimal local minima can be removed by addingone neuron per output unit. This might be surprisinggiven the fact that dealing with suboptimal local min-ima in general is known to be challenging in theory.However, for the worst case scenario, the following the-orem illuminates a novel failure mode for the elimina-tion of suboptimal local minima. Theorem 4 holdstrue for the previous results (Liang et al., 2018) (aswe discuss further in Section 7) and hence provides anovel failure mode for elimination of local minima ingeneral, apart from our Theorems 1 and 2. Our resultin this paper is the first result that points out this typeof failure mode for elimination of local minima. Theproof of Theorem 4 is presented in Appendix C.
Theorem 4.
Let Assumption 1 hold, or let Assump-tions 2 and 3 hold. Then, for any θ , if θ is not a globalminimum of L , there is no local minimum ( a, b, W ) ∈ R d y × R d y × R d x × d y of ˜ L | θ ( a, b, W ) := ˜ L ( θ, a, b, W ) .Furthermore, there exists a tuple ( (cid:96), f, { ( x i , y i ) } mi =1 ) and a suboptimal stationary point θ of L such that ∂ ˜ L ( θ,a,b,W ) ∂θ = 0 for all ( a, b, W ) ∈ R d y × R d y × R d x × d y . Therefore, on the one hand, Theorems 1 and 2 statethat if an algorithm can find a local minimum of ˜ L ,then it can find a global minimum of L (via a local min-imum of ˜ L ). On the other hand, Theorem 4 suggeststhat if an algorithm moves toward a local minimum of˜ L by simply following (negative) gradient directions,then either it moves toward a global minimum θ of L or the norm of ( a, b, W ) approaches infinity. Bymonitoring the norm of ( a, b, W ), we can detect thisfailure mode. This suggests a hybrid approach of localand global optimization algorithms beyond a pure lo-cal gradient-based method, with a mechanism to mon-itor the increase in the norm of ( a, b, W ). The previous sections show that one can eliminate sub-optimal local minima, and there remains a detectablefailure mode for gradient-based optimization methodsafter elimination. In this section, we provide numericaland analytical examples to illustrate these phenomena.Through these examples, we show that using ˜ L insteadof L can help training deep neural networks in ‘good’cases while it may not help in ‘bad’ cases. Figures 1 illustrates the novel failure mode proven byTheorem 4. Note that there exist local minima of ˜ L near θ = 0 . L in bounded subspaces, there is no local minimumof L | θ ( a, b, W ) with a fixed suboptimal θ . The settingused for plotting Figure 1 is summarized in Example 1,where the dataset consists of only one sample ( x , y ). Example 1.
Let m = 1, d y = 1, x = 0, and y = − L ( θ ) = (cid:96) ( f ( x ; θ ) , y ) = (max(0 , − y f ( x ; θ )) . Let f ( x ; θ ) = 5( − . e − ∗ ( θ − . − . e − ∗ ( θ − . + 0 .
5) for a simple illustration. Be-cause x = 0, we can think of this function as a modelwith an extra parameter θ (cid:48) , the effect of which dis-appears as θ (cid:48) x = 0 (e.g., f ( x ; θ ) = ¯ f ( x ; θ, θ (cid:48) ) =5( − . e − ∗ ( θ (cid:48) x + θ − . − . e − ∗ ( θ (cid:48) x + θ − . + 0 . S ⊆ R d requires a lower semi-continuity of the objective function ˜ L and the existenceof a q ∈ S for which the set { q (cid:48) ∈ S : ˜ L ( q (cid:48) ) ≤ ˜ L ( q ) } is compact (e.g., see Bertsekas 1999 for more discus-sion on the existence of optimal solutions). In theabove example, for the function ˜ L | θ ( a, b, W ) with afixed suboptimal θ , the former condition of lower semi-continuity is satisfied, whereas the latter condition ofcompactness is not.While Figures 1 provides one of the ‘bad-case’ exam-ples, Appendix D and the following section providesome of the ‘good-case’ examples where using ˜ L in-stead of L helps optimization of L . To further understand the properties of eliminatingsuboptimal local minima in an analytical manner, thissection presents several analytical examples. Example2 shows a general case where using ˜ L instead of L helpsa gradient-based optimization method. For the simpleillustration of the failure mode, Example 3 uses a singledata point and squared loss. Example 4 is the versionof Examples 3 with two data points and shows thatthe value of ˜ L can also approach a suboptimal value.Finally, Example 5 illustrates the existence of a localminimum of ˜ L via only the existence of a standardglobal minimum θ of L . In Appendix E, Examples 6and 7 show the same phenomena as those in Examples3 and 4 with a smoothed hinge loss. Example 2.
Let A [ θ ] = m [( ∂f ( x ; θ ) ∂θ ) (cid:62) , . . . , ( ∂f ( x m ; θ ) ∂θ ) (cid:62) ] ∈ R d θ × ( md y ) be a matrix, and r [ ϕ ] =[ ∇ (cid:96) y ( ϕ ( x )) (cid:62) , . . . , ∇ (cid:96) y m ( ϕ ( x m )) (cid:62) ] (cid:62) ∈ R md y be acolumn vector given a function ϕ : R d x → R d y .The modified objective ˜ L helps a gradient-basedmethod by creating extra decreasing directions as r [ f ( · ; θ ) + g ( · ; a, b, W )] / ∈ Null( A [ θ ]) even when enji Kawaguchi, Leslie Pack Kaelbling (a) objective function L (b) modified function ˜ L (c) negative gradient directions of ˜ L Figure 1: Illustration of the failure mode suggested by Theorem 4. In sub-figure (a), the original objective func-tion L has a suboptimal local minimum near θ = 0 . θ = 0 .
8. In sub-figures (b) and(c), it can be observed that even with the modified objective function ˜ L , if θ is initially near the suboptimal localminimum (0 . θ → . b as b → ∞ . In sub-figure (c), the arrows represent the negative normalizedgradient vectors at each point. In sub-figures (b) and (c), the function ˜ L is plotted along the coordinates ( θ, b )by setting other parameters to be solutions ( a ∗ , W ∗ ) of each objective, minimize a,W ˜ L | θ,b ( a, W ) = ˜ L ( θ, a, b, W ),at each given point ( θ, b ). r [ f ( · ; θ )] ∈ Null( A [ θ ]). This also helps optimizationwhen r [ f ( · ; θ )] is approximately in Null( A [ θ ]) while r [ f ( · ; θ ) + g ( · ; a, b, W )] is not. Example 3.
Let m = 1 and d y = 1. In addition, let L ( θ ) = (cid:96) ( f ( x ; θ ) , y ) = ( f ( x ; θ ) − y ) . Accordingly,˜ L ( θ, a, b, W ) = ( f ( x ; θ )+ a exp( w (cid:62) x + b ) − y ) + λa .Let θ be a non-global minimum of L as f ( x ; θ ) (cid:54) = y .In particular, let us first consider the case of f ( x ; θ ) =2 and y = 1. Then, L ( θ ) = 1 and ˜ L ( θ, a, b, W ) =1 + 2 a exp( w (cid:62) x + b ) + a exp(2 w (cid:62) x + 2 b ) + λa . If( a, b, W ) is a local minimum, from the stationary pointconditions of ˜ L ( θ,a,b,W ) ∂a = 0 and ˜ L ( θ,a,b,W ) ∂b = 0, wemust have a = 0, yielding that ˜ L ( θ, a, b, W ) = 1 . How-ever, a point with a = 0 is not a local minimum (withfinite ( b, w )), because with a < | a | > L ( θ, a, b, W ) = 1 − | a | · exp( w (cid:62) x + b ) + | a | (exp(2 w (cid:62) x + 2 b ) + λ ) < . Hence, there isno local minimum ( a, b, W ) ∈ R × R × R d x of ˜ L | θ . In-deed, if we set a = − exp( − /(cid:15) ) and b = 1 /(cid:15) − w (cid:62) x ,˜ L ( θ, a, b, W ) = λ exp( − /(cid:15) ) → (cid:15) →
0, and henceas a → − and b → ∞ . This illustrates the casein which ( a, b ) does not attain a solution in R × R .The identical conclusion holds with the general case of f ( x ; θ ) (cid:54) = y by following the same steps of reasoning. Example 4.
Let m = 2 and d y = 1. In addition, L ( θ ) = ( f ( x ; θ ) − y ) + ( f ( x ; θ ) − y ) . Let us con-sider the case of f ( x ; θ ) = f ( x ; θ ) = 0, y = 1, and y = −
1. Then, L ( θ ) = 2. If ( a, b, W ) is a local min-imum, we must have a = 0 similarly to Example 3,yielding that ˜ L ( θ, a, b, W ) = 2. On the other hand, ˜ L ( θ, a, b, W ) (4)= 2 − a (exp( w (cid:62) x + b ) − exp( w (cid:62) x + b )) + ϕ ( a ) , where ϕ ( a ) = a exp(2 w (cid:62) x + 2 b ) + a exp(2 w (cid:62) x +2 b ) + λa . Note that, with a sufficiently small | a | > ϕ ( a ) becomes negligible. Let x (cid:54) = x .In this case, our θ with f ( x ; θ ) = f ( x ; θ ) = 0 isnot a global minimum. Then, a point with a = 0can be shown to be not a local minimum as fol-lows. If exp( w (cid:62) x + b ) > exp( w (cid:62) x + b ), with a > L ( θ, a, b, W ) <
2. Ifexp( w (cid:62) x + b ) < exp( w (cid:62) x + b ), with a < | a | being sufficiently small, ˜ L ( θ, a, b, W ) <
2. Ifexp( w (cid:62) x + b ) = exp( w (cid:62) x + b ), since x (cid:54) = x ,we can perturb w with an arbitrarily small magni-tude to make exp( w (cid:62) x + b ) (cid:54) = exp( w (cid:62) x + b ), andhence we can yield the above cases. Thus, a pointwith a = 0 is not a local minimum. Therefore, thereis no local minimum ( a, b, W ) of ˜ L | θ . Indeed, since x (cid:54) = x , if we set a = exp( − /(cid:15) ), b = 1 /(cid:15) − w (cid:62) x ,and w = − (cid:15) ( x − x ), ˜ L ( θ, a, b, W ) = (exp( −(cid:107) x − x (cid:107) /(cid:15) )) + 1) + λ exp( − /(cid:15) ) → , as (cid:15) →
0, and henceas a → − , b → ∞ and (cid:107) w (cid:107) → ∞ , illustrating thecase in which ( a, b, W ) does not attain a solution in R × R × R d x . Example 5.
Consider the exact same example as inExample 4, with the exception that x = x . In thiscase, a θ with f ( x ; θ ) = f ( x ; θ ) = 0 is a global min-imum unlike in Example 4. A point with a = 0 isindeed a local minimum of ˜ L , which can be seen inEquation (4) where for all a , 2 − a (exp( w (cid:62) x + b ) − exp( w (cid:62) x + b )) + ϕ ( a ) = 2 + ϕ ( a ) ≥ limination of All Bad Local Minima in Deep Learning There have been many analyses regarding the op-timization of deep neural networks with significantover-parameterization (Nguyen and Hein, 2017; Allen-Zhu et al., 2018; Du et al., 2018; Zou et al., 2018)and model simplification (Choromanska et al., 2015;Kawaguchi, 2016; Hardt and Ma, 2017; Bartlett et al.,2019; Du and Hu, 2019). In contrast, this paper stud-ies the problem in the regime without significant over-parameterization or model simplification.Our results are also different when compared to theprevious study by Liang et al. (2018). First, we havesolved the open problem left in the previous study; i.e.,our theoretical results are applicable to practical set-tings and neural networks with multiple output unitsand common loss functions. The results in the previ-ous study are applicable to a neural network with asingle output unit for binary classification with partic-ular smoothed hinge loss functions that are not used incommon practice. In particular, the previous resultsare not applicable to multi-class classification or re-gression with any loss criteria, or binary classificationwith standard loss criteria (e.g., cross entropy loss andsmoothed hinge loss without twice differentiability).Second, we proved and demonstrated the failure modeof eliminating local minima as a key contribution viaTheorem 4 as well as numerical and analytical exam-ples. The failure mode proven in Theorem 4 also holdstrue for the results in the previous study. Indeed, Ex-amples 1, 6, and 7 as well as Figure 1 directly illus-trate the failure mode of the results in the previous pa-per. The previous study does not discuss any possiblefailure mode of eliminating local minima and, in fact,states that good neural networks are “just one neuronaway” from bad neural networks (with suboptimal lo-cal optima). Our Theorem 4 together with analyticalexamples prove that such “good” neural networks withan added neuron are still subject to their own failuremode, opening up the need of future research.Third, this paper has introduced a novel and conciseproof based on the PGB necessary condition as wellas a longer but more elementary proof. Our proofintroduced new insight into why we can eliminate sub-optimal local minima; i.e., the global minimum valueof the perturbable gradient basis of an added networkis indeed the global minimum value of L ( θ ) (see Sec-tion 4 for more details). Beyond the elimination oflocal minima and the particular modification ˜ L , thePGB condition can be used to analyze other models This is because the assumptions of Theorem 4 areimplied by the assumptions used in the previous study,and the construction of the tuple ( (cid:96), f, { ( x i , y i ) } mi =1 , θ ) inthe proof also accommodates the setting in the previousstudy. and modifications, and it might be helpful to designnew modifications of the objective functions.In addition to the use of the PGB necessary condi-tion, our proofs also differ from the previous proofsbecause the scope and the assumptions of the resultsare different. Indeed, the analyses of one dimensionaloutput with y ∈ {− , +1 } (the previous study) andhigh-dimensional output with y ∈ R d y (this paper)are naturally different. For example, when the matrix f ( X ; θ ) = [ f ( x ; θ ) , ..., f ( x m ; θ )] is rank-deficient, alloutputs must be simply zero as f ( X ; θ ) = 0 in theprevious study, whereas f ( X ; θ ) can be any one of in-finitely many non-zero (rank-deficient) matrices in thispaper. Unlike the previous study, our proofs also can-not invoke properties of discrete points and second-order Taylor expansions because we do not assume y ∈ {− , +1 } (together with the particular smoothedhinge loss) and twice differentiability. In this paper, we proved that if an algorithm finds a lo-cal minimum of a modified objective function ˜ L , thenit immediately recovers a global minimum of the orig-inal objective function L of an arbitrary deep neuralnetwork. However, Theorem 4 together with analyticalexamples showed that if an algorithm simply followsnegative gradient directions toward a local minimumof ˜ L , either it moves toward a global minimum θ of L or the norm of ( a, b, W ) approaches infinity. Thissuggested a hybrid approach of local and global opti-mization algorithms, with a mechanism to monitor thenorm of ( a, b, W ).From a theoretical viewpoint, we have shown a reduc-tion of the problem of getting stuck around an arbi-trarily poor local minimum into the detectable prob-lem of the divergence of the norm. This proven reduc-tion might be useful as a future proof technique in atheoretical literature and as a foundation of a futurealgorithm in practice.In summary, this paper has advanced theoreticalunderstanding of the properties of the optimizationlandscape in the regime that has not been stud-ied well by previous research with significant over-parameterization or model simplification. Beyond theelimination of local minima, this paper has introducedthe proof technique based the PGB necessary condi-tion of local minima that can be used to study generalmachine learning models and transformations of ob-jective functions. Acknowledgements
We gratefully acknowledge support from NSF grants1523767 and 1723381, AFOSR grant FA9550-17-1-0165, ONR grant N00014-18-1-2847, Honda Research, enji Kawaguchi, Leslie Pack Kaelbling and the MIT-Sensetime Alliance on AI.
References
Allen-Zhu, Z., Li, Y., and Song, Z. (2018). Aconvergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962 .Andoni, A., Panigrahy, R., Valiant, G., and Zhang, L.(2014). Learning polynomials with neural networks.In
International Conference on Machine Learning ,pages 1908–1916.Bartlett, P. L., Helmbold, D. P., and Long, P. M.(2019). Gradient descent with identity initializationefficiently learns positive-definite linear transforma-tions by deep residual networks.
Neural computa-tion , 31(3):477–502.Bertsekas, D. P. (1999).
Nonlinear programming .Athena scientific Belmont.Blum, A. L. and Rivest, R. L. (1992). Training a 3-node neural network is NP-complete.
Neural Net-works , 5(1):117–127.Brescia, M. (1994). Semeion handwritten digit dataset.
Semeion Research Center of Sciences of Com-munication .Brutzkus, A. and Globerson, A. (2017). Globally opti-mal gradient descent for a convnet with gaussianinputs. In
International Conference on MachineLearning , pages 605–614.Choromanska, A., Henaff, M., Mathieu, M.,Ben Arous, G., and LeCun, Y. (2015). The losssurfaces of multilayer networks. In
Proceedings ofthe Eighteenth International Conference on Artifi-cial Intelligence and Statistics , pages 192–204.Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb,A., Yamamoto, K., and Ha, D. (2018). Deep learn-ing for classical japanese literature. arXiv preprintarXiv:1812.01718 .Du, S. S. and Hu, W. (2019). Width provably mat-ters in optimization for deep linear neural networks. arXiv preprint arXiv:1901.08572 .Du, S. S. and Lee, J. D. (2018). On the power of over-parametrization in neural networks with quadraticactivation. arXiv preprint arXiv:1803.01206 .Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X.(2018). Gradient descent finds global minima of deepneural networks. arXiv preprint arXiv:1811.03804 .Gasca, M. and Sauer, T. (2000). Polynomial interpola-tion in several variables.
Advances in ComputationalMathematics , 12(4):377.Ge, R., Lee, J. D., and Ma, T. (2017). Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501 . Goel, S. and Klivans, A. (2017). Learning depth-threeneural networks in polynomial time. arXiv preprintarXiv:1709.06010 .Hardt, M. and Ma, T. (2017). Identity matters in deeplearning. In
International Conference on LearningRepresentations .Kawaguchi, K. (2016). Deep learning without poorlocal minima. In
Advances in Neural InformationProcessing Systems , pages 586–594.Kawaguchi, K. and Bengio, Y. (2019). Depth withnonlinearity creates no bad local minima in ResNets.
Neural Networks , 118:167–174.Kawaguchi, K. and Huang, J. (2019). Gradient descentfinds global minima for generalizable deep neuralnetworks of practical sizes. In . IEEE.Kawaguchi, K., Huang, J., and Kaelbling, L. P. (2019).Every local minimum value is the global minimumvalue of induced model in nonconvex machine learn-ing.
Neural Computation , 31(12):2293–2323.Laurent, T. and Brecht, J. (2018). Deep linear net-works with arbitrary loss: All local minima areglobal. In
International Conference on MachineLearning , pages 2908–2913.LeCun, Y., Bottou, L., Bengio, Y., and Haffner,P. (1998). Gradient-based learning applied todocument recognition.
Proceedings of the IEEE ,86(11):2278–2324.Li, Y. and Yuan, Y. (2017). Convergence analysisof two-layer neural networks with relu activation.In
Advances in Neural Information Processing Sys-tems , pages 597–607.Liang, S., Sun, R., Lee, J. D., and Srikant, R. (2018).Adding one neuron can eliminate all bad local min-ima. In
Advances in Neural Information ProcessingSystems .Murty, K. G. and Kabadi, S. N. (1987). Some NP-complete problems in quadratic and nonlinear pro-gramming.
Mathematical programming , 39(2):117–129.Nguyen, Q. and Hein, M. (2017). The loss surfaceof deep and wide neural networks. In
InternationalConference on Machine Learning , pages 2603–2612.Nguyen, Q. and Hein, M. (2018). Optimization land-scape and expressivity of deep cnns. In
InternationalConference on Machine Learning , pages 3727–3736.Sedghi, H. and Anandkumar, A. (2014). Provablemethods for training neural networks with sparseconnectivity. arXiv preprint arXiv:1412.2693 . limination of All Bad Local Minima in Deep Learning Shamir, O. (2018). Are ResNets provably better thanlinear predictors? In
Advances in Neural Informa-tion Processing Systems .Soltanolkotabi, M. (2017). Learning relus via gradientdescent. In
Advances in Neural Information Pro-cessing Systems , pages 2007–2017.Soudry, D. and Hoffer, E. (2017). Exponentially van-ishing sub-optimal local minima in multilayer neuralnetworks. arXiv preprint arXiv:1702.05777 .Zhang, X., Ling, C., and Qi, L. (2012). The best rank-1 approximation of a symmetric tensor and relatedspherical optimization problems.
SIAM Journal onMatrix Analysis and Applications , 33(3):806–821.Zhong, K., Song, Z., Jain, P., Bartlett, P. L., andDhillon, I. S. (2017). Recovery guarantees for one-hidden-layer neural networks. In
International Con-ference on Machine Learning , pages 4140–4149.Zou, D., Cao, Y., Zhou, D., and Gu, Q.(2018). Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprintarXiv:1811.08888 .Zou, D. and Gu, Q. (2019). An improved analysis oftraining over-parameterized deep neural networks. arXiv preprint arXiv:1906.04688 . enji Kawaguchi, Leslie Pack Kaelbling A Necessary Condition and Elimination of Local Minimafor Deep Neural NetworksAppendix
A Proofs of Theorem 3
Let z ∈ R d z be an arbitrary local minimum of Q . From the convexity and differentiability of Q i and R i , we havethat 1 m m (cid:88) i =1 Q i ( φ zi ( α, (cid:15), S )) + R i ( ϕ zi ( α, (cid:15), S )) (5) ≥ m m (cid:88) i =1 Q i ( φ i ( z )) + R i ( ϕ i ( z )) + ∂Q i ( φ i ( z ))( φ zi ( α, (cid:15), S ) − φ i ( z )) + ∂R i ( ϕ i ( z ))( ϕ zi ( α, (cid:15), S ) − ϕ i ( z ))= Q ( z ) + 1 m m (cid:88) i =1 ∂Q i ( φ i ( z )) φ zi ( α, (cid:15), S ) + ∂R i ( ϕ i ( z )) ϕ zi ( α, (cid:15), S ) − ∂Q i ( φ i ( z )) φ i ( z ) − ∂R i ( ϕ i ( z )) ϕ i ( z ) . Since z is a local minimum of Q , by the definition of a local minimum, there exists (cid:15) > Q ( z ) ≤ Q ( z (cid:48) )for all z (cid:48) ∈ B ( z, (cid:15) ). Then, for any (cid:15) ∈ [0 , (cid:15) /
2) and any ν ∈ V [ z, (cid:15) ], the vector ( z + (cid:15)v ) is also a local minimumbecause Q ( z + (cid:15)v ) = Q ( z ) ≤ Q ( z (cid:48) ) , for all z (cid:48) ∈ B ( z + (cid:15)v, (cid:15) / ⊆ B ( z, (cid:15) ), where the set inclusion follows from the triangle inequality. This satisfiesthe definition of a local minimum for ( z + (cid:15)v ). Since the composition and the sums of differentiable functionsare differentiable, the vector ( z + (cid:15)v ) is a differentiable local minimum. Therefore, from the first-order necessarycondition of differentiable local minima, there exists (cid:15) > (cid:15) ∈ [0 , (cid:15) ), any v ∈ V [ θ, (cid:15) ], and any k ∈ { , . . . , d θ } , ∂ k Q ( z + (cid:15)v ) = 1 m m (cid:88) i =1 ∂Q i ( φ i ( θ )) ∂ k φ i ( z + (cid:15)v ) + ∂R i ( ϕ i ( θ )) ∂ k ϕ i ( z + (cid:15)v ) = 0 , (6)where we used the fact that φ i ( z ) = φ i ( z + (cid:15)v ) and ϕ i ( z ) = ϕ i ( z + (cid:15)v ) for any v ∈ V [ z, (cid:15) ]. From (6), there exists (cid:15) > (cid:15) ∈ [0 , (cid:15) ), any S ⊆ fin V [ θ, (cid:15) ] and any α ∈ R d θ ×| S | ,1 m m (cid:88) i =1 ∂Q i ( φ i ( z )) φ zi ( α, (cid:15), S ) + ∂R i ( ϕ i ( z )) ϕ zi ( α, (cid:15), S ) (7)= d z (cid:88) k =1 | S | (cid:88) j =1 α k,j (cid:32) m m (cid:88) i =1 ∂Q i ( φ i ( z )) ∂ k φ i ( z + (cid:15)S j ) + ∂R i ( ϕ i ( z )) ∂ k ϕ i ( z + (cid:15)S j ) (cid:33) = 0where the second line follows the definition of φ zi ( α, (cid:15), S ) and ϕ zi ( α, (cid:15), S ), and the last line follows (6).Furthermore,1 m m (cid:88) i =1 ∂Q i ( φ i ( z )) φ i ( z ) + ∂R i ( ϕ i ( z )) ϕ i ( z ) ± (1 /ρ ) ∂R i ( ϕ i ( z )) ϕ i ( z ) (8)= d z (cid:88) k =1 h ( z ) k (cid:32) m m (cid:88) i =1 ∂Q i ( φ i ( z )) ∂ k φ i ( z ) + ∂R i ( ϕ i ( z )) ∂ k ϕ i ( z ) (cid:33) + (1 − /ρ ) 1 m m (cid:88) i =1 ∂R i ( ϕ i ( z )) ϕ i ( z )= (1 − /ρ ) 1 m m (cid:88) i =1 ∂R i ( ϕ i ( z )) ϕ i ( z ) limination of All Bad Local Minima in Deep Learning where the second line follows the assumption of the existence of a function h for writing φ i ( z ) and ϕ i ( z ), andthe last line follows (6).Substituting (7) and (8) into (5), there exists (cid:15) > (cid:15) ∈ [0 , (cid:15) ), any S ⊆ fin V [ θ, (cid:15) ] and any α ∈ R d θ ×| S | , 1 m m (cid:88) i =1 Q i ( φ zi ( α, (cid:15), S )) + R i ( ϕ zi ( α, (cid:15), S )) ≥ Q ( z ) − (1 − /ρ ) 1 m m (cid:88) i =1 ∂R i ( ϕ i ( z )) ϕ i ( z ) . This proves the main statement of the theorem. In the case of ρ = 1, this shows that on the one hand, thereexists (cid:15) > (cid:15) ∈ [0 , (cid:15) ), Q ( z ) ≤ inf { m (cid:80) mi =1 Q i ( φ zi ( α, (cid:15), S ))+ R i ( ϕ zi ( α, (cid:15), S )) : S ⊆ fin V [ z, (cid:15) ] , α ∈ R d z ×| S | } . On the other hand, since φ i ( z ) = (cid:80) d z k =1 h ( z ) k ∂ k φ i ( z ) and ϕ i ( z ) = ρ (cid:80) d z k =1 h ( z ) k ∂ k ϕ i ( z ) with ρ = 1,we have that Q ( z ) ≥ inf { m (cid:80) mi =1 Q i ( φ zi ( α, (cid:15), S )) + R i ( ϕ zi ( α, (cid:15), S )) : S ⊆ fin V [ z, (cid:15) ] , α ∈ R d z ×| S | } . Combining theseyields the desires statement for the equality in the case of ρ = 1. B Proofs of eliminating local minima
A high level idea behind the proofs of Theorems 1 and 2 in this section (instead of the proof via the PGBnecessary condition) follows the idea utilized by Kawaguchi (2016) for deep linear networks. That is, we firstobtain possible candidate local minima ˜ θ via the first-order necessary condition (i.e., { ( θ, a, b, W ) : a = 0 } ), andthen consider small perturbations of those candidate local minima. From the definition of local minima, the valueat a possible local minimum ˜ θ must be less than or equal to the value at any sufficiently small perturbationsof the given local minimum ˜ θ . This condition imposes strong constraints on those candidate local minima, andturns out to be sufficient to prove the desired result with appropriate perturbations and rearrangements, togetherwith the interpolation result with polynomial or simply based on linear algebra (i.e., we can interpolate m (cid:48) pointsvia polynomial as the corresponding matrix has rank m (cid:48) ).In all the proofs of Theorems 1 and 2 (including the proof with the PGB necessary condition), we let θ bearbitrary so that we can prove the failure mode of eliminating the suboptimal local minima in the next section(Theorem 4) by reusing these proofs. Let (cid:96) y ( q ) = (cid:96) ( q, y ), and let ∇ (cid:96) y ( ϕ ( q )) = ( ∇ (cid:96) y )( ϕ ( q )) be the gradient ∇ (cid:96) y evaluated at an output ϕ ( q ) of a function ϕ . B.1 Proof of Theorem 1 without the PGB necessary condition
Proof of Theorem 1.
Let θ be fixed. Let ( a, b, W ) be a local minimum of ˜ L | θ ( a, b, W ) := ˜ L ( θ, a, b, W ). Let˜ L | ( θ,W ) ( a, b ) = ˜ L ( θ, a, b, W ). Since (cid:96) y : q (cid:55)→ (cid:96) ( q, y ) is assumed to be differentiable, ˜ L | ( θ,W ) is also differentiable(since a sum of differentiable functions is differentiable, and a composition of differentiable functions is differen-tiable). From the definition of a stationary point of a differentiable function ˜ L | ( θ,W ) , for all k ∈ { , , . . . , d y } , a k ∂ ˜ L ( θ,a,b,W ) ∂a k = m (cid:80) mi =1 ( ∇ (cid:96) y i ( f ( x i ; θ )+ g ( x i ; a, b, W ))) k a k exp( w (cid:62) k x + b k )+2 λa k = ∂ ˜ L ( θ,a,b,W ) ∂b k +2 λa k = 2 λa k = 0,which implies that a k = 0 for all k ∈ { , , . . . , d y } , since 2 λ (cid:54) = 0. Therefore, we have that a = 0 . (9)This yields g ( x ; a, b, W ) = 0, and ˜ L ( θ, a, b, W ) = L ( θ ) . We now consider perturbations of a local minimum ( a, b, W ) of L | θ with a = 0. Note that, among other equivalentdefinitions, a function h : R d → R is said to be differentiable at q ∈ R d if there exist a vector ∇ h ( q ) and a function ϕ ( q ; · ) (with its domain being a deleted neighborhood of the origin 0 ∈ R d ) such that lim ∆ q → ϕ ( q ; ∆ q ) = 0, and h ( q + ∆ q ) = h ( q ) + ∇ h ( q ) (cid:62) ∆ q + (cid:107) ∆ q (cid:107) ϕ ( q ; ∆ q ) , for any non-zero vector ∆ q ∈ R d that is sufficiently close to 0 ∈ R d (e.g., see fundamental increment lemmaand the definition of differentiability for multivariable functions). Thus, with sufficiently small perturbations∆ a ∈ R d y and ∆ W = (cid:2) ∆ w ∆ w . . . ∆ w d y (cid:3) ∈ R d x × d y , there exists a function ϕ such that enji Kawaguchi, Leslie Pack Kaelbling ˜ L ( θ, a + ∆ a, b, W + ∆ W )= 1 m m (cid:88) i =1 (cid:96) y i ( f ( x i ; θ ) + ∆ g i ) + λ (cid:107) ∆ a (cid:107) = 1 m m (cid:88) i =1 (cid:96) y i ( f ( x i ; θ )) + ∇ (cid:96) y i ( f ( x i ; θ )) (cid:62) ∆ g i + (cid:107) ∆ g i (cid:107) ϕ ( f ( x i ; θ ); ∆ g i ) + λ (cid:107) ∆ a (cid:107) , where lim ∆ q → ϕ ( f ( x i ; θ ); ∆ q ) = 0 and ∆ g i = g ( x i ; ∆ a, b, W + ∆ W )). Here, the last line follows the definitionof the differentiability of (cid:96) y i , since g ( x i ; ∆ a, b, W + ∆ W ) k = ∆ a k exp( w (cid:62) k x i + ∆ w (cid:62) k x i + b k ) is arbitrarily smallwith sufficiently small ∆ a k and ∆ w k .Combining the above two equations, since ( a, b, W ) is a local minimum, we have that, for any sufficiently small∆ a and ∆ w , ˜ L ( θ, a + ∆ a, b, W + ∆ W ) − ˜ L ( θ, a, b, W )= 1 m m (cid:88) i =1 ∇ (cid:96) y i ( f ( x i ; θ )) (cid:62) ∆ g i + 1 m m (cid:88) i =1 (cid:107) ∆ g i (cid:107) ϕ ( f ( x i ; θ ); ∆ g i ) + λ (cid:107) ∆ a (cid:107) ≥ . Rearranging with ∆ a = (cid:15)v such that (cid:15) > (cid:107) v (cid:107) = 1, and with ∆˜ g i = g ( x i ; v, b, W + ∆ W ), (cid:15)m m (cid:88) i =1 ∇ (cid:96) y i ( f ( x i ; θ )) (cid:62) ∆˜ g i ≥ − (cid:15)m m (cid:88) i =1 (cid:107) ∆˜ g i (cid:107) ϕ ( f ( x i ; θ ); (cid:15) ∆˜ g i ) − λ(cid:15) (cid:107) v (cid:107) , since ∆ g i = (cid:15) ∆˜ g i . With (cid:15) >
0, this implies that1 m m (cid:88) i =1 ∇ (cid:96) y i ( f ( x i ; θ )) (cid:62) ∆˜ g i ≥ − m m (cid:88) i =1 (cid:107) ∆˜ g i (cid:107) ϕ ( f ( x i ; θ ); (cid:15) ∆˜ g i ) − λ(cid:15) (cid:107) v (cid:107) . Since ϕ ( f ( x i ; θ ); (cid:15) ∆˜ g i ) → λ(cid:15) (cid:107) v (cid:107) → (cid:15) → (cid:15) (cid:54) = 0), m (cid:88) i =1 ∇ (cid:96) y i ( f ( x i ; θ )) (cid:62) g ( x i ; v, b, W + ∆ W ) ≥ . For any k ∈ { , , . . . , d y } , by setting v k (cid:48) = 0 for all k (cid:48) (cid:54) = k , we have that v k m (cid:88) i =1 ( ∇ (cid:96) y i ( f ( x i ; θ ))) k exp( w (cid:62) k x i + ∆ w (cid:62) k x i + b k ) ≥ , for any v k ∈ R such that | v k | = 1. With v k ∈ {− , } , m (cid:88) i =1 ( ∇ (cid:96) y i ( f ( x i ; θ ))) k exp( w (cid:62) k x i + b k ) exp(∆ w (cid:62) k x i ) = 0 . By setting ∆ w k = ¯ (cid:15) k u k such that ¯ (cid:15) k > (cid:107) u (cid:107) = 1, ∞ (cid:88) t =0 ¯ (cid:15) tk t ! m (cid:88) i =1 ( ∇ (cid:96) y i ( f ( x i ; θ ))) k exp( w (cid:62) k x i + b k )( u (cid:62) k x i ) t = 0 , since exp( q ) = lim T →∞ (cid:80) Tt =0 q t t ! and a finite sum of limits of convergent sequences is the limit of the finite sum.Rewriting this using z t = (cid:80) mi =1 ( ∇ (cid:96) y i ( f ( x i ; θ ))) k exp( w (cid:62) k x i + b k )( u (cid:62) k x i ) t ,lim T →∞ T (cid:88) t =0 ¯ (cid:15) tk t ! z t = 0 . (10) limination of All Bad Local Minima in Deep Learning We now show that z p = 0 for all p ∈ N by induction. Consider the base case with p = 0. Equation (10) impliesthat lim T →∞ (cid:32) z + T (cid:88) t =1 ¯ (cid:15) tk t ! z t (cid:33) = z + lim T →∞ T (cid:88) t =1 ¯ (cid:15) tk t ! z t = 0since lim T →∞ (cid:80) Tt =1 ¯ (cid:15) tk t ! z t exists (which follows that lim T →∞ (cid:80) Tt =0 ¯ (cid:15) tk t ! z t = 0 exists). Here, lim T →∞ (cid:80) Tt =1 ¯ (cid:15) tk t ! z t → (cid:15) →
0, and hence z = 0. Consider the inductive step with the inductive hypothesis that z t = 0 for all t ≤ p −
1. Similarly to the base case, Equation (10) implies p − (cid:88) t =0 ¯ (cid:15) tk t ! z t + ¯ (cid:15) pk p ! z p + lim T →∞ T (cid:88) t = p +1 ¯ (cid:15) tk t ! z t = 0 . Multiplying p ! / ¯ (cid:15) pk on both sides, since (cid:80) p − t =0 ¯ (cid:15) tk t ! z t = 0 from the inductive hypothesis, z p + lim T →∞ T (cid:88) t = p +1 ¯ (cid:15) t − pk p ! t ! z t = 0 . Since lim T →∞ (cid:80) Tt = p +1 ¯ (cid:15) t − pk p ! t ! z t → (cid:15) →
0, we have that z p = 0, which finishes the induction. Therefore, forany k ∈ { , , . . . , d y } and any p ∈ N , m (cid:88) i =1 ( ∇ (cid:96) y i ( f ( x i ; θ ))) k exp( w (cid:62) k x i + b k )( u (cid:62) k x i ) p = 0 . (11)Let x ⊗ x be the tensor product of the vectors x and x ⊗ p = x ⊗ · · · ⊗ x where x appears p times. For a p -thorder tensor M ∈ R d ×···× d and p vectors u (1) , u (2) , . . . , u ( p ) ∈ R d , defines M ( u (1) k , u (2) k , . . . , u ( p ) k ) = (cid:88) ≤ i ··· i p ≤ d M i ··· i p u (1) i · · · u ( p ) i p . Let ξ i,k = ( ∇ (cid:96) y i ( f ( x i ; θ ))) k exp( w (cid:62) k x i + b k ). Then, for any k ∈ { , , . . . , d y } and any p ∈ N ,max u (1) ,...,u ( p ) : (cid:107) u (1) (cid:107) = ··· = (cid:107) u ( p ) (cid:107) =1 (cid:32) m (cid:88) i =1 ξ i,k x ⊗ pi (cid:33) ( u (1) , . . . , u ( p ) ) = max u : (cid:107) u (cid:107) =1 (cid:32) m (cid:88) i =1 ξ i,k x ⊗ pi (cid:33) ( u, u, . . . , u )= max u : (cid:107) u (cid:107) =1 m (cid:88) i =1 ξ i,k ( u (cid:62) x i ) p = 0 . where the first line follows theorem 2.1 in (Zhang et al., 2012), and the last line follows Equation (11). Thisimplies that m (cid:88) i =1 ( ∇ (cid:96) y i ( f ( x i ; θ ))) k exp( w (cid:62) k x i + b k ) vec( x ⊗ pi ) = 0 ∈ R d px . (12)Using Equation (12), we now prove statement (i). For any θ (cid:48) , there exist p and u t,k (for t = 0 , . . . , p and k = 1 , . . . , d y ) such that m ( L ( θ (cid:48) ) − L ( θ )) ≥ m (cid:88) i =1 ∇ (cid:96) y i ( f ( x i ; θ )) (cid:62) ( f ( x i ; θ (cid:48) ) − f ( x i ; θ ))= m (cid:48) (cid:88) j =1 (cid:88) i ∈I j ∇ (cid:96) y i ( f ( x i ; θ )) (cid:62) ( f ( x i ; θ (cid:48) ) − f ( x i ; θ )) enji Kawaguchi, Leslie Pack Kaelbling = m (cid:48) (cid:88) j =1 d y (cid:88) k =1 ( f (¯ x j ; θ (cid:48) ) − f (¯ x j ; θ )) k (cid:124) (cid:123)(cid:122) (cid:125) =exp( w (cid:62) k ¯ x j + b k ) (cid:80) pt =0 u (cid:62) t,k vec(¯ x ⊗ tj ) (cid:88) i ∈I j ∇ (cid:96) y i ( f ( x i ; θ )) k = p (cid:88) t =0 d y (cid:88) k =1 u (cid:62) t,k m (cid:88) i =1 ∇ (cid:96) y i ( f ( x i ; θ )) k exp( w (cid:62) k x i + b k ) vec( x ⊗ ti ) (cid:124) (cid:123)(cid:122) (cid:125) = 0 from Equation (12) = 0 , where the first line follows from the assumption that (cid:96) y i is convex and differentiable, and the third line followsfrom the fact that ¯ x j = x for all x ∈ I j . The forth line follows from the fact that the vector vec( x ⊗ ti ) containsall monomials in x i of degree t , and m (cid:48) input points ¯ x , . . . , ¯ x m (cid:48) are distinct, which allows the basic existence(and construction) result of a polynomial interpolation of the finite m (cid:48) points; i.e., with p sufficiently large( p = m (cid:48) − k , there exists u t,k such that (cid:80) pt =0 u (cid:62) t,k vec(¯ x ⊗ tj ) = q j,k for any q j,k ∈ R forall j ∈ { , . . . , m (cid:48) } (e.g., see equation (1.9) in Gasca and Sauer 2000), in particular, including q j,k = ( f (¯ x j ; θ (cid:48) ) − f (¯ x j ; θ )) k exp( − w (cid:62) k ¯ x j − b k ).Therefore, we have that, for any θ (cid:48) , L ( θ (cid:48) ) ≥ L ( θ ), which proves statement (i). Statement (ii) directly followsfrom Equation (9). B.2 Proof of Theorem 2
Proof of Theorem 2.
Let θ be fixed. Let ( a, b, W ) be a local minimum of ˜ L | θ ( a, b, W ) := ˜ L ( θ, a, b, W ). Then, forany k ∈ { , , . . . , d y } , there exist p and u t,k (for t = 0 , . . . , p ) such that m (cid:88) i =1 ( ∇ (cid:96) y i ( f ( x i ; θ ))) k = m (cid:48) (cid:88) j =1 |I j | ( ∇ (cid:96) f ∗ (¯ x j ) ( f (¯ x j ; θ ))) k = p (cid:88) t =0 u (cid:62) t,k m (cid:88) i =1 ( ∇ (cid:96) y i ( f ( x i ; θ ))) k exp( w (cid:62) k x i + b k ) vec( x ⊗ ti )= 0 , where the first line utilizes Assumption 3. The second line follows from the fact that since m (cid:48) input points¯ x , . . . , ¯ x m (cid:48) are distinct, with p sufficiently large ( p = m (cid:48) − k , there exist u t,k for t = 0 , . . . , p such that (cid:80) pt =0 u (cid:62) t,k vec( x ⊗ ti ) = ( ∇ (cid:96) f ∗ (¯ x j ) ( f (¯ x j ; θ ))) k exp( − w (cid:62) k ¯ x j − b k ) |I j | − (similarly to the proofof Theorem 1). The third line follows from Equation (12). Here, Equation (12) still holds since it is obtained inthe proof of Theorem 1 under only the assumption that the function (cid:96) y i : q (cid:55)→ (cid:96) ( q, y i ) is differentiable for any i ∈ { , . . . , m } , which is still satisfied by Assumption 2.This implies that for all i ∈ { , . . . , m } , ∇ (cid:96) y i ( f ( x i ; θ )) = 0, which proves statement (iii) because of Assumption2. Statement (i) directly follows from Statement (iii). Statement (ii) directly follows from Equation (9). C Proof of Theorem 4
The proofs of Theorems 1 and 2 (including the proof via the PGB necessary condition) are designed such thatthe proof of Theorem 4 is simple, as shown below. Given a function ϕ ( q ) ∈ R d and a vector v ∈ R d (cid:48) , let ∂ϕ ( q ) ∂v be a d × d (cid:48) matrix with each entry ( ∂ϕ ( q ) ∂v ) i,j = ∂ ( ϕ ( q )) i ∂v j . Proof of Theorem 4.
Let Assumption 1 hold (instead of Assumptions 2 and 3). In the both versions of our proofsof Theorem 1, θ was arbitrary and ( a, b, W ) was an arbitrary local minimum of ˜ L | θ ( a, b, W ) := ˜ L ( θ, a, b, W ).Thus, the same proof proves that, for any θ , at every local minimum ( a, b, W ) ∈ R d y × R d y × R d x × d y of ˜ L | θ , θ isa global minimum of L . Thus, based on the logical equivalence ( p → q ≡ ¬ q → ¬ p ), if θ is a not global minimumof L , then there is no local minimum ( a, b, W ) ∈ R d y × R d y × R d x × d y of ˜ L | θ , proving the first statement in the limination of All Bad Local Minima in Deep Learning case of using Assumption 1. Instead of Assumption 1, if Assumptions 2 and 3 hold, then the exact same proofas above (with Theorem 1 being replaced by Theorem 2) proves the first statement.Example 1 with the square loss or the smoothed hinge loss suffices to prove the second statement. However, to ob-tain better theoretical insight, let us consider a more general construction of the desired tuples ( (cid:96), f, { ( x i , y i ) } mi =1 )to prove the second statement. Let θ ∈ R d θ . In addition, let A [ θ ] = m [( ∂f ( x ; θ ) ∂θ ) (cid:62) · · · ( ∂f ( x m ; θ ) ∂θ ) (cid:62) ] ∈ R d θ × ( md y ) be a matrix, and r [ ϕ ] = [ ∇ (cid:96) y ( ϕ ( x )) (cid:62) · · · ∇ (cid:96) y m ( ϕ ( x m )) (cid:62) ] (cid:62) ∈ R md y be a column vector given a function ϕ : R d x → R d y . Then, ∂L ( θ ) ∂θ = 1 m m (cid:88) i =1 ∇ (cid:96) y i ( f ( x i ; θ )) (cid:62) ∂f ( x i ; θ ) ∂θ = ( A [ θ ] r [ f ( · ; θ )]) (cid:62) , and ∂ ˜ L ( θ, a, b, W ) ∂θ = ( A [ θ ] r [ f ( · ; θ ) + g ( · ; a, b, W )]) (cid:62) . Here, the equality A [ θ ] r [ f ( · ; θ )] = 0 is equivalent to r [ f ( · ; θ )] ∈ Null( A [ θ ]), where Null( A [ θ ]) is the null space of thematrix A [ θ ]. Therefore, any tuple ( (cid:96), f, { ( x i , y i ) } mi =1 ) such that r [ f ( · ; θ )] ∈ Null( A [ θ ]) ⇒ r [ f ( · ; θ ) + g ( · ; a, b, W )] ∈ Null( A [ θ ]) at a suboptimal θ suffices to provide a proof for the second statement. An (infinite) set of tuples( (cid:96), f, { ( x i , y i ) } mi =1 ) such that there exists a suboptimal θ of L with A [ θ ] = 0 (e.g., Example 1) satisfies thiscondition, which proves the second statement. D Additional numerical examples for good cases
For using ˜ L instead of L , we show the failure mode and ‘bad-case’ scenarios in Section 6 and Appendix E.Accordingly, to have a balance, this section considers some of ‘good-case’ scenarios where using ˜ L instead of L helps optimization of L . Figure 3 shows the histograms of training loss values after training with originalnetworks f minimizing L , and modified networks ˜ f minimizing ˜ L with and without the failure mode detectorbased on Theorems 1, 2 and 4. We used a simple failure mode detector, which automatically restarted theoptimizer to a random point during training when (cid:107) a (cid:107) + (cid:107) b (cid:107) + (cid:107) W (cid:107) ≥
7. The histograms were plotted withthe results of 1000 random trials for Semeion dataset and of 100 random trials for KMNIST dataset, for eachmethod. Semeion (Brescia, 1994) is a dataset of handwritten digits and KMNIST (Clanuwat et al., 2018) isa dataset of Japanese letters. We used the exact same experimental settings for both the original networks f and the modified networks ˜ f with and without the failure mode detector. We used a standard variant ofLeNet (LeCun et al., 1998) with ReLU activations: two convolutional layers with 64 5 × (a) Semeion (b) KMNIST Figure 3: Histogram of loss values after training with original networks f minimizing L (original), modifiednetworks ˜ f minimizing ˜ L (elimination of local minima), and modified networks ˜ f minimizing ˜ L with the failuremode detector (elimination & failure mode monitor). The plotted training loss values are the values of thestandard training objective L for both original networks f (minimizing L ) and modified networks ˜ f (minimizing˜ L ) with and without the failure mode detector. The elimination of local minima helped a gradient-based methodfor Semeion, and did not help it much for KMNIST. For KMNIST, the novel failure mode of the elimination wasdetected by monitoring the norms of ( a, b, W ) to restart and search a better subspace. enji Kawaguchi, Leslie Pack Kaelbling E Additional numerical and analytical examples to illustrate the failure mode
Figure 4 illustrates the novel failure mode proven by Theorem 4. The setting used for plotting Figure 4 is exactlysame as that in Figure 1 (i.e., Example 1) except that (cid:96) ( f ( x ; θ ) , y ) = ( f ( x ; θ ) − y ) and y = f ( x ; 0 . (a) original objective function L (b) modified objective function ˜ L (c) negative gradient directions of ˜ L Figure 4: Illustration of the failure mode suggested by Theorem 4 with the squared loss. The qualitativelyidentical behavior as that in Figure 1 can be observed.Examples 6 and 7 illustrate the same phenomena as those in Examples 3 and 4 with a smoothed hinge lossinstead of the squared loss.
Example 6.
Let m = 1 and d y = 1. In addition, L ( θ ) = (cid:96) ( f ( x ; θ ) , y ) = (max(0 , − y f ( x ; θ )) . Accordingly,˜ L ( θ, a, b, W ) = (max(0 , − y f ( x ; θ ) − y a exp( w (cid:62) x + b )) + λa . Let θ be a non-global minimum of L as f ( x ; θ ) (cid:54) = y , in particular, by setting f ( x ; θ ) = − y = 1. Then, L ( θ ) = 8. If ( a, b, W ) is a localminimum, we must have a = 0 similarly to Example 3, yielding that ˜ L ( θ, a, b, W ) = 8 . However, a point with a = 0 is not a local minimum, since with a > L ( θ, a, b, W ) = (2 − a exp( w (cid:62) x + b )) + λa < . Hence, there is no local minimum ( a, b, W ) ∈ R × R × R d x of ˜ L | θ . Indeed, if we set a = − − /(cid:15) ) and b = 1 /(cid:15) − w (cid:62) x , ˜ L ( θ, a, b, W ) = λ exp( − /(cid:15) ) → (cid:15) →
0, and hence as a → − and b → ∞ . This illustratesthe case in which ( a, b ) does not attain a solution in R × R . The identical conclusion holds with the general caseof f ( x ; θ ) (cid:54) = y by following the same logic. Example 7.
Let m = 2 and d y = 1. In addition, L ( θ ) = (max(0 , − y f ( x ; θ )) + (max(0 , − y f ( x ; θ )) .Moreover, let x (cid:54) = x . Finally, let f ( x ; θ ) = − , f ( x ; θ ) = 1, y = 1, and y = −
1. If ( a, b, W ) is a localminimum, we must have a = 0 similarly to Example 3, yielding ˜ L ( θ, a, b, W ) = 16. However, a point with a = 0is not a local minimum, which follows from the perturbations of ( a, W ) in the same manner as in Example 4.Therefore, there is no local minimum ( a, b, W ) of ˜ L | θ . Indeed, if we set a = 2 exp( − /(cid:15) ), b = 1 /(cid:15) − w (cid:62) x , and w = − (cid:15) ( x − x ), ˜ L ( θ, a, b, W ) = (2 + 2 exp( −(cid:107) x − x (cid:107) /(cid:15) )) + λ exp( − /(cid:15) ) → (cid:15) →
0, and hence as a → − , b → ∞ and (cid:107) w (cid:107) → ∞ , illustrating the case in which ( a, b, W ) does not attain asolution in R × R × R d xx