[PDF] Noise-induced degeneration in online learning

Abstract

In order to elucidate the plateau phenomena caused by vanishing gradient, we herein analyse stability of stochastic gradient descent near degenerated subspaces in a multi-layer perceptron. In stochastic gradient descent for Fukumizu-Amari model, which is the minimal multi-layer perceptron showing non-trivial plateau phenomena, we show that (1) attracting regions exist in multiply degenerated subspaces, (2) a strong plateau phenomenon emerges as a noise-induced synchronisation, which is not observed in deterministic gradient descent, (3) an optimal fluctuation exists to minimise the escape time from the degenerated subspace. The noise-induced degeneration observed herein is expected to be found in a broad class of machine learning via neural networks.

Full PDF

NNoise-induced degeneration in online learning

Yuzuru Sato

RIES / Department of Mathematics, Hokkaido University, Kita 20 Nichi 10, Kita-ku,Sapporo 001-0020, JapanLondon Mathematical Laboratory, 8 Margravine Gardens, London W68RH, UK

Daiji Tsutsui

Department of Mathematics, Osaka University, Toyonaka, Osaka 560-0043, JAPAN

Akio Fujiwara

Department of Mathematics, Osaka University, Toyonaka, Osaka 560-0043, JAPAN

Abstract

In order to elucidate the plateau phenomena caused by vanishing gradient, weherein analyse stability of stochastic gradient descent dynamics near degener-ated subspaces in a multi-layer perceptron. We show that, in Fukumizu-Amarimodel, attracting regions exist in the degenerated subspace, and a novel typeof strong plateau phenomenon emerges as a noise-induced phenomenon, whichmakes learning much slower than the deterministic gradient descent dynamics.The noise-induced degeneration observed herein is expected to be found in abroad class of online learning in perceptrons.

1. Machine learning as a random dynamical system

Dynamics of learning is characterised as (i) non-autonomous dynamics drivenby uncertain input sequences from the external, and (ii) multi-scale dynamicswhich consists of slow memory dynamics and fast system dynamics. When theuncertain input sequences are modelled by stochastic processes, dynamics oflearning is described by a random dynamical system. The random dynamicalsystems approaches, in contrast to a traditional Fokker-Planck approach in sta-tistical physics, permits the study not only of stationary distributions and globalstatistics, but also of the pathwise structure of nonlinear stochastic dynamics.Quantitative properties in machine learning can be discussed with stability andbifurcation analysis.We start with a simple model of learning given by a gradient dynamics drivenby the Ornstein-Uhlenbeck process. d θ = − η ∇ θ l ( x ; θ ) dt,dx = − γxdt + κdW t , (1)where θ = ( w , . . . , w n ) ∈ R n , x ∈ R , η, µ, κ >

0, and W t is the Wienerprocess. The slope of a potential l ( x ; θ ) with a coeﬃcient η deﬁnes the gradient.The stationary distribution of x is given as N (0 , σ ), where σ = κ / γ , when Preprint submitted to Elsevier August 28, 2020 a r X i v : . [ n li n . AO ] A ug (0) = 0. The parameter γ is the decay rate of the Ornstein-Uhlenbeck process.Gradient dynamics with external force under Langevin noise has been studiedin statistical and nonlinear physics (e.g. [7]). When γ is large, our model isreduced to d θ = − η ∇ θ l ( x ; θ ) dt, (2)where x is an i.i.d. random variable subject to ρ ( x ) ∼ N (0 , σ ).The gradient descent dynamics is given by the average dynamics of Eq. (2)with averaged potential E x [ l ( x ; θ )]; d θ = − η ∇ θ E x [ l ( x ; θ )] dt, (3)where E x [ · ] denotes the ensemble average over x . In studies on machine learning,the parameter η is known as the learning rate, the external input x as thetraining data, and the averaged potential E x [ l ( x ; θ )] as the loss function that isto be minimised. The approximation in Eq. (3) corresponds to the assumptionthat a large set of training data x , which are i.i.d. sampled from ρ ( x ), is given tothe system at once, yielding the exact average potential E x [ l ( x ; θ )]. For ﬁnitedata set the dynamics is given as Eq. (2) and is called a stochastic gradientdescent dynamics. The loss function l ( x ; θ ) is typically given by a functionnorm || f ( x ; θ ) − f ( x ; θ ∗ ) || , where f ( x ; θ ∗ ) is the optimal function. We adopt amulti-layer perceptron , a class of feed-forward neural networks, as the parametricmodel f ( x ; θ ). It is known that a perceptron with a single hidden layer is an auniversal function approximator [6].It is also known that, in many cases, a perceptron with gradient descentlearning exhibits slow relaxation to the optimal, known as a plateau phenomenon (see Fig. 1), which makes the dynamics of learning stagnant. The plateau phe-nomenon is a chain of slow dynamics near attracting regions, during which theloss function reduces extremely small amount per unit time. Each plateau ortrapping is caused by a saddle set or a Milnor attractor [9] in a degenerated sub-space f ( x ; θ d ) in Eq. (3), resulting in slow convergence to the optimal attractors[8, 5, 18].Learning in perceptrons with a ﬁnite batch size S , in particular with S = 1,is called online learning . Online learning is modelled by a stochastic gradient de-scent dynamics, which, however, is not well understood theoretically. Recently,plateau phenomena in online learning has been studied, adopting an averagedstochastic gradient descent dynamics as an approximation; d ¯ θ = − ηE x [ ∇ ¯ θ l ( x ; ¯ θ )] dt, (4)where ¯ θ is the average of θ at each time t [1]. However, such an approximationis not fully valid in online learning [17]. In order to elucidate the underlyingmechanism of the plateau phenomena in perceptrons, we here analyse plateauphenomena in stochastic gradient descent dynamics from a viewpoint of ran-dom dynamical systems theory. We show that, in Fukumizu-Amari model, at-tracting regions exist in multiply degenerated subspaces, and a strong plateauphenomenon emerges as a noise-induced phenomenon.

2. Stochastic gradient descent in perceptrons

The minimal model of multi-layer perceptrons which exhibits the plateauphenomena is given as three-layer perceptrons with gradient descent dynamics2 lateau phenomena t Stagnation near degenerated subspaceLearning dynamics Learning dynamics Optimal function0

Figure 1: A schematic view of the plateau phenomena (left) and a stagnant dynamics neardegenerated subspace f ( x, θ d ) (right). The dynamics of learning slows down due to thevanishing gradient and eventually escapes to the optimal function f ( x ; θ ∗ ). [8]. x yww vv

12 12

Figure 2: The three-layer perceptron: The nodes are activation functions given by tanh( · ),and each edge indicates a linear superposition with parameters ( w , w , v , v ). The output y is a function of the input x and the parameters ( w , w , v , v ). The equation of motion of the stochastic gradient descent learning (2), canbe equivalently given as the following discrete time random map; θ ( t + 1) = θ ( t ) − η ∇ θ l ( x ; θ ( t )) , ( t = 1 , , . . . ) (5)where θ = ( w , w , v , v ) ∈ Θ (6) l ( x ; θ ) = 12 ( f ( x ; θ ) − T ( x )) (7) f ( x ; θ ) = v tanh( w x ) + v tanh( w x ) . (8)Here, Θ is a domain of R called the parameter space, and x is an i.i.d. randomvariable subject to ρ ( x ), η ∈ [0 ,

1] is the learning rate, and T ( x ) is the targetfunction to be learnt. The Fukumizu-Amari model [8] is given by Eq. (5) witha target function T ( x ) = 2 tanh( x ) − tanh(4 x ) , (9)and the probability distribution of the training data ρ ( x ) = N (0 , σ ) . (10)The optimal function f ( x ; θ ∗ ) with parameters θ ∗ , which minimise l ( x ; θ ) isgiven by θ ∗ = (1 , , , − , ( − , , − , − , (1 , − , , , ( − , − , − , , (4 , , − , , (4 , − , − , − , ( − , , , , ( − , − , , − . (11)3he plateau phenomenon is caused by neutral stability in function space.Typically, it emerges near the degenerated subspace in the parameter space Θ .Given Eqs. (5)-(8), the following degenerated subspaces θ d = ( w, w, v , v − v ) , ( w, − w, v , v − v ) , ( w , w, , v ) , ( w, w , v,

0) (12)deﬁne a class of degenerated functions f ( x ; θ d ) = 2 v tanh( wx ) . (13)In this paper, we focus on interplay between degenerated subspaces M w = { θ | w = w = w } and M wv = { θ | w = w = w, v = v = v } . In thedegenerated subspace M w , the eﬀective degrees of freedom of Eq. (5) decreasesto 3. Although we still have free variable v in M w , it does not contribute to abetter function approximation. Furthermore, when a multiple degeneration to M wv occurs, the eﬀective degrees of freedom decreases to 2, and the dynamicscannot even return to the full parameter space Θ .In many cases, the dynamics of learning stays near the degenerated subspacefor a very long time. This trapping phenomenon is caused by neutral stabil-ity with vanishing gradients. Although the dynamics may eventually escapetowards the global optimum by ﬂuctuations, the dynamics of learning shows ex-tremely slow convergence to the optimum. In some cases, the residual time nearthese degenerated subspace follows power-law and the system shows anomalousstatistics, which is similar to intermittency in random dynamical systems [14].In online learning, in addition to this neutrally stable trapping, “stronger” trap-ping based on multiple degeneration may occur as a noise-induced phenomenon [10, 16, 11]. We call this type of degeneration as noise-induced degeneration inonline learning.

3. Strong plateau phenomena in multiply degenerated subspace

We focus on dynamics of the following Fukumizu-Amari model; w ( t + 1) = w ( t ) − ηxv ( t ) h ( x ; θ , T )cosh ( w ( t ) x ) , (14) w ( t + 1) = w ( t ) − ηxv ( t ) h ( x ; θ , T )cosh ( w ( t ) x ) , (15) v ( t + 1) = v ( t ) − η tanh( w ( t ) x ) h ( x ; θ , T ) , (16) v ( t + 1) = v ( t ) − η tanh( w ( t ) x ) h ( x ; θ , T ) , (17)where h ( x ; θ , T ) = v tanh( w x ) + v tanh( w x ) − T ( x ) . (18)The learning rate is ﬁxed to η = 0 .

1. The target function is given as T ( x ) = 2 tanh( x ) − tanh(4 x ). The ﬂuctuation σ of the training data is acontrol parameter. Our numerical experiments suggest that, for a broad regionof large σ and a positive measure set of initial conditions, and for a ﬁnite time,there exists attracting dynamics from full space Θ to a degenerated subspace M w = { θ | w = w = w } . (19)4urthermore, in some cases with large σ , we observe attracting dynamics fromthe degenerated subspace M w to multiply degenerated subspace M wv = { θ | w = w = w, v = v = v } . (20)The attraction to M w is caused by a local minimum of the averaged potential E x [ l ( x ; θ )]. Due to the valley formed by steep gradients of the averaged poten-tial, the gradient dynamics is dominant compared with the stochastic eﬀectsand approaches a neighbourhood of M w quickly (see Appendix A).To investigate local dynamics near the degenerated subspace M w , one canintroduce the following coordinate system  p = w + w , q = v + v ,r = w − w , s = v − v , (21)and the corresponding transformed model p ( t + 1) − p ( t ) = − ηxh (cid:20) q ( t ) + s ( t )cosh (( p ( t ) + r ( t )) x ) + q ( t ) − s ( t )cosh (( p ( t ) − r ( t )) x ) (cid:21) , (22) q ( t + 1) − q ( t ) = − ηh p ( t ) + r ( t )) x ) + tanh(( p ( t ) − r ( t )) x )] , (23) r ( t + 1) − r ( t ) = − ηxh (cid:20) q ( t ) + s ( t )cosh (( p ( t ) + r ( t )) x ) − ( q ( t ) − s ( t ))cosh (( p ( t ) − r ( t )) x ) (cid:21) , (24) s ( t + 1) − s ( t ) = − ηh p ( t ) + r ( t )) x ) − tanh(( p ( t ) − r ( t )) x )] , (25)where h ( x ; θ , T ) = ( q + s ) tanh(( p + r ) x ) + ( q − s ) tanh(( p − r ) x ) − T ( x ) . (26)We focus on the dynamics near M w which approaches M wv , keeping weaksynchronisation near M w . Assuming r (cid:39)

0, we have s ( t + 1) = − ηrx ˜ h cosh ( px ) + (cid:20) − ηr x cosh ( px ) (cid:21) s ( t ) + O ( r ) , (27)where ˜ h = 2 q tanh(( px ) − T ( x ) . (28)Integration of the right-hand side of (27) over x yields s ( t + 1) (cid:39) − C r + (cid:2) − C r (cid:3) s ( t ) , (29)where both C and C are nonzero constants due to the Gaussian integral ofeven functions. Note that the constant C is bounded. For instance, when p > / η = 0 .

1, we have 2 ηx / cosh ( px ) < < C <

1. Assumingfurther that the quasi stationary density near r = 0 is symmetric and has theaverage (cid:104) r (cid:105) = 0 and the variance (cid:104) r (cid:105) ≡ κ <

1, we integrate the right-hand sideof (29) over the quasi stationary distribution for r, to obtain s ( t + 1) (cid:39) [1 − κC ] s ( t ) , < κC < s is contracting in average andapproaches 0. If s is exactly 0, we observe total synchronisation with w = w and v = v on the degenerated subspace M wv . As a dynamical phenomenon, weobserve (i) global attraction to a neighbourhood of M w , and (ii) local attractionfrom the neighbourhood M w to a neighbourhood of M wv .The multiple degeneration from M w to M wv is a characteristic behaviourof the stochastic gradient descent dynamics. In the case of the deterministicgradient descent dynamics, the dynamics near M w is almost neutral in thedirection of s because r converges to 0 exponentially fast. To the contrary, inthe stochastic gradient descent dynamics, the dynamics of s can be contractingbecause r ﬂuctuates around 0. This phenomenon is comparable with noise-induced synchronisation in random dynamical systems. A typical example ofsynchronisation is given by uncoupled phase oscillators driven by common noise(see Appendix D). It is known that the Lyapunov exponent of the stochasticphase oscillator is negative while those of the deterministic dynamics is zero[16, 13]. In the next section, we show that there exists yet another noise-inducedtrapping in the multiply degenerated subspace M wv . The equation of motion in M wv is given by the following two-dimensionalrandom map of w = w = w and v = v = v ; w ( t + 1) = w ( t ) − ηxv ( t ) 2 v ( t ) tanh( w ( t ) x ) − T ( x )cosh ( w ( t ) x ) , (31) v ( t + 1) = v ( t ) − η tanh( w ( t ) x )[2 v ( t ) tanh( w ( t ) x ) − T ( x )] . (32)or equivalently (cid:18) w ( t + 1) − w ( t ) v ( t + 1) − v ( t ) (cid:19) = η · g ( x ; w, v, T ( x )) , (33)where g ( x ; w, v, T ) = − [2 v tanh( wx ) − T ( x )]  vx cosh ( wx )tanh( wx )  . (34)The Jacobian matrix of g is given by J ( x ; w, v ) =  − vx [ T ( x ) tanh( wx ) − v tanh ( wx )+ v ] cosh ( wx ) x [ T ( x ) − vq tanh( wx )]cosh ( wx ) x [ T ( x ) − v tanh( wx )]cosh ( wx ) − ( wx )  , (35)or equivalently J ( x ; w, v ) = − (cid:18) vx cosh ( wx ) tanh( wx ) (cid:19) (cid:16) vx cosh ( wx ) tanh( wx ) (cid:17) − (2 v tanh( wx ) − T ( x )) (cid:32) − vx tanh( wx )cosh ( wx ) x cosh ( wx ) x cosh ( wx ) (cid:33) . (36)6et the eigenvalues of J at a point ( w, v ) be µ − ( x ; w, v ) and µ + ( x ; w, v ) ( µ − ≤ µ + ) (See Appendix B). Sincedet (cid:32) − vx tanh( wx )cosh ( wx ) x cosh ( wx ) x cosh ( wx ) (cid:33) = − (cid:18) x cosh ( px ) (cid:19) < x (cid:54) = 0, the second term of Eq. (36) has both positive and negative eigenvaluesif 2 v tanh( wx ) − T ( x ) (cid:54) = 0. The ﬁrst term of Eq. (36) is negative semideﬁnite.Thus, µ − ( x ; w, v ) is negative whenever 2 v tanh( wx ) − T ( x ) (cid:54) = 0. Therefore,for suﬃciently small η , the point ( w, v ) is attracting when µ + ( x ; w, v ) is non-positive; otherwise, it is a saddle. In this way, the dynamics near ( w, v ) on M wv is characterised by the sign of µ + ( x ; w, v ) as long as η is small.We investigate the dynamics near ( w, v ) = (1 / , /

2) by computing theeigenvalues µ ± ( x ; 1 / , /

2) of J ( x ; w, v ) explicitly. We see from (Fig.3 (left))that the dynamics is attracted to the point ( w, v ) = (1 / , /

2) when the ﬂuctu-ation σ is suﬃciently large. Put diﬀerently, the large ﬂuctuation may “stabilise”the dynamics near ( w, v ) in M wv . As a result, the residual time near M wv isextended and a stronger plateau phenomenon is observed. Figure 3: (Left) The eigenvalues µ + ( x ; 1 / , /

2) (red), and µ − ( x ; 1 / , /

2) (blue) of J ( x ; 1 / , /

2) as well as the distribution ρ ( x ) are depicted as functions of x . The param-eters are set as T ( x ) = 2 tanh( x ) − tanh(4 x ), η = 0 .

1, and σ =0.1,1. When the ﬂuctuation σ is small, x is frequently sampled near zero, and the point ( w, v ) = (1 / , /

2) is a saddle point;otherwise, it is an attracting point. (right) The probability π ( x ; w, v ) = Prob[ µ + ( x ; w, v ) ≤ σ = 1 plotted in [0 , on M wv . The red curve C indicates the valley formed by steepgradients of the averaged potential (see Appendix A). Fig.3 (right) shows the numerically computed probability distribution π ( x ; w, v ) = Prob[ µ + ( x ; w, v ) ≤ , (38)with σ = 1, where higher values of π ( x ; w, v ) corresponds to the darker tones.The red curve C indicates the approximated one-dimensional valley formedby steep gradients of the averaged potential. In this case, C includes a localminimum of the averaged potential. The dark grey region, where π ( x ; w, v ) is7 igure 4: Finite time pullback attractors (see appendix C) with the pullback time τ = 1000, τ = 10000, τ = 30000, and τ = 100000 in the full space Θ . Parameters are set as T ( x ) =2 tanh( x ) − tanh(4 x ), η = 0 .

1, and σ = 0 . σ = 1 (right). The red and blue dotsrepresent paths of ( w , w ) and ( v , v ), respectively, starting from diﬀerent initial conditions.Both dynamics are plotted together in each panel. The grey points correspond to the optimalattractors θ ∗ . The synchronisation manifolds w = w and v = v are depicted as a singleline. A typical noise realisation { x } is ﬁxed and dynamics is developed with 10 diﬀerentinitial conditions θ (0) ∈ [ − , . When σ = 1, the trapping dynamics near M wv is observedin the attracting region indicated by a dashed circle. M wv . To the contrary, withsmaller ﬂuctuation σ = 0 .

1, the grey attracting region in Fig. 3 (right) disap-pears because most points in M wv are saddle points. Thus, there exists anothernoise-induced phenomenon, i.e., the emergence of an attracting region on M wv due to the large ﬂuctuation σ , resulting in a substantial extension of the escapetime from M wv .The global dynamics of stochastic gradient descent in Θ is shown in Fig.4.It depicts converging dynamics to pullback attractors [12] (see Appendix C)in Fukumizu-Amari model with T ( x ) = 2 tanh( x ) − tanh(4 x ), η = 0 .

1, and σ = 0 . ,

1. The red and blue dots represent paths of ( w , w ) and ( v , v ), re-spectively, starting from diﬀerent initial conditions. Both dynamics are plottedtogether in each panel. The grey points correspond to the optimal attractors θ ∗ . The synchronisation manifolds w = w and v = v are depicted as a singleline. In the case of σ = 1, clear noise-induced degeneration, i.e. | w − w | → M w at τ = 10000, followed by | v − v | → M wv at τ = 30000, areobserved. Due to this type of noise-induced degeneration, substantial numberof sample paths stay near the attracting region in M wv for a extremely longtime, which is shown in a dashed circle at τ = 100000. In sum, noise-induceddegeneration and plateau phenomena emerge through the following three pro-cess;1. Global attraction to a neighbourhood of M w .2. Local attraction from the neighbourhood of M w to a neighbourhood of M wv .3. Long term trapping near attracting regions in M wv .

4. Noise induced phenomena in online learning

If there exists an attractor A on a subspace M , and A is not an attractorin the full space Θ , it is called a relative attractor in Θ [15]. Thus, if thereexists an attractor on M wv , it is a relative attractor. If there is no randomcompact invariant set in a random dynamical system, all random attractors arerandom Milnor attractors [2]. Our conjecture is that both relative attractors (orattracting region) in the degenerated subspaces M wv and the optimal attractorsin the full space Θ are random Milnor attractors. The dynamics of learning maycome and go with each attractor in a long time scale.In general, non-optimal stable random attractors may exist in stochasticgradient descent learning with small η , because T ( x ), f ( x ; θ ), and ∇ θ l ( x ; θ ) arebounded for any ρ ( x ). In these cases, if an initial point θ (0) does not belongto the eﬀective basins of the optimal attractors, the attracting region near θ d may become a stable random attractor and the orbits can stay there for anarbitrarily long time. Further studies on stability of attractors and bifurcationson the degenerated subspaces will be reported elsewhere.In conclusion, in Fukumizu-Amari model, there exists characteristic ﬂuctu-ation sizes of the training data, with which the dynamics shows strong plateauphenomena. Starting from an initial point in the full space, the dynamics oflearning is attracted by a degenerated subspace by the gradient, and then, isattracted by a multiply degenerated subspace, which we call noise-induced de-generation. When the ﬂuctuation size is large, in the multiply degeneratedsubspace, attracting region emerges. Although it is ﬁnite-time trapping and9ynamics eventually escapes to the optimal attractor, the residual time nearthe attracting region in the multiply degenerated subspace can be extremelylong. This phenomena is expected to be observed in a broad class of onlinelearning because of the universality of the presented noise-induced phenomena.Our approach would shed new light on various problems in machine learningfrom a viewpoint of random dynamical systems theory.

5. Acknowledgments

YS is supported by the external fellowship of London Mathematical Labo-ratory and the Grant in Aid for Scientiﬁc Research (C) No. 18K03441, JSPS.Authors are supported by the Grant in Aid for Scientiﬁc Research (B) No.17H02861, JSPS.

Appendix A. Local minima in the averaged dynamics

In this paper, we have treated tanh( · ) as an activation function. However,in order to analyse the averaged potential, we herein use the function h ( u ) :=erf( x/ √

2) := (cid:112) /π (cid:82) x e − t / dt as an approximation of tanh( · ) to understandthe qualitative behaviour of the dynamics. According to [3], for a network f ( x ; θ ) = v h ( w x ) + v h ( w x ) , (A.1)and a target function which is denoted as T ( x ) = ν h ( ω x ) + ν h ( ω x ) , (A.2)for some values ν , ν , ω , ω ∈ R , the averaged potential is given by L ( θ ) = E x (cid:20)

12 ( f ( x ; θ ) − T ( x )) (cid:21) = 1 π (cid:88) i,j =1 v i v j Φ( w i , w j ) − π (cid:88) i,a =1 v i ν a Φ( w i , ω a ) + const , (A.3)Φ( ζ , ζ ) := arcsin (cid:32) σ ζ ζ (cid:112) σ ζ (cid:112) σ ζ (cid:33) . (A.4)When w (cid:54) = 0 and w (cid:54) = 0, the function L ( θ ) is a quadratic in ( v , v ), andthus has a minimiser. In particular, when w = w = w is ﬁxed and v = v = v , L ∗ ( w, v ) := L ( θ ) = 4 π (cid:32) v Φ( w, w ) − v (cid:88) a =1 ν a Φ( w, ω a ) (cid:33) + const. (A.5)takes minimum at v ∗ ( w ; σ ) = (cid:80) a ν a Φ( w, ω a )2Φ( w, w ) . (A.6)Hence, a minimiser of L ∗ lies in the one-dimensional valley { ( w, v ∗ ( w ; σ )) | w ∈ R } .For ( ω , ω , ν , ν ) = (1 , , , −

1) and σ = 1, in particular, we have the redcurve C in Fig. 3 (right) as v ∗ ( w ; 1) = (cid:16) (cid:16) w √ √ w (cid:17) − arcsin (cid:16) w √ √ w (cid:17)(cid:17)(cid:16) (cid:16) w w (cid:17)(cid:17) . (A.7)10 ppendix B. Eigenvalues of Jacobian of the dynamics in the multi-ply degenerated subspace The Jacobian of g on the multiply degenerated subspace M wv is given by J =  − vx [ T ( x ) tanh( wx ) − v tanh ( wx )+ v ] cosh ( wx ) x [ T ( x ) − vq tanh( wx )]cosh ( wx ) x [ T ( x ) − v tanh( wx )]cosh ( wx ) − ( wx )  . (B.1)The eigenvalues of J are given by µ ± ( x ; w, v ) = A ±

18 cosh ( wx ) √ B, (B.2)where A = − v x − vx tanh( wx ) T ( x )cosh ( wx ) − v x cosh ( wx ) , (B.3) B = 32 x cosh ( wx ) C + (cid:0) − v x − v x cosh(2 wx ) + cosh(4 wx ) + 4 vx sinh(2 wx ) T ( x ) (cid:1) (B.4) C = 8 v (2 + cosh(2 wx )) sinh ( wx ) − v (6 sinh(2 wx )+ sinh(4 wx )) T ( x ) + 2 cosh ( wx ) T ( x ) . (B.5) Appendix C. Pullback attractors in random dynamical systems

Let θ acts on the probabilistic space of noise realisations Ω, and θ t ω is thepath taken at time t by the noise realisation ω ∈ Ω. The random dynamicalsystem is represented by the pair ( θ, φ ), where φ denotes the dynamics in thestate space X , driven by a noise realisation θ t ω . The pullback attractor A ( t, ω )of a random dynamical system is deﬁned as a random invariant set of X thatsatisﬁes lim τ →∞ dist( φ ( τ, θ t − τ ω ) B, A ( t, ω )) = 0 , (C.1)for any bounded set B ⊂ X , where dist( C, D ) denotes the Hausdorﬀ distancebetween two subsets C and D of X [4].We call the following τ -pullback image of B as ﬁnite time pullback attractor or τ -pullback attractor , which is given by (cid:101) A Bτ ( t, ω ) = φ ( τ, θ t − τ ω ) B, (C.2)where τ is called pullback time. For a given τ , the set (cid:101) A Bτ ( t, ω ) represents aﬁnite space-time structure, which may include transient orbits and densities.Each invariant sets in Fig. 4 are ﬁnite time pullback attractors with pullbacktime τ . Appendix D. Noise-induced synchronisation

A stochastic phase oscillator is given by dφ = ωdt + sin φ ◦ dW t , (D.1)11n Stratonovich form, where ω is a constant, φ ∈ (0 , π ] is phase on circle, and W t is the Wiener process with dW t ∼ N (0 , σ ). The linearisation along a ﬁxedsolution φ is given by dψ = cos φ · ψ ◦ dW t , (D.2)Let r = log | ψ | , then we have dr = cos φ ◦ dW t , (D.3)or, equivalently in Ito form, dr = − σ φdt + cos φdW t . (D.4)Thus, the Lyapunov exponent λ of (D.1) is given by λ = lim T →∞ r ( T ) T = − lim T →∞ T (cid:90) T σ φdt. (D.5)Assuming that the ﬂuctuation σ is small, the dynamics is ergodic, and theinvariant density is approximately uniform on circle, we have λ (cid:39) − π (cid:90) π σ φdφ = − σ < . (D.6) References [1] Shun-ichi Amari, Tomoko Ozeki, Ryo Karakida, Yuki Yoshida, and MasatoOkada. Dynamics of learning in mlp: Natural gradient and singularityrevisited.

Neural computation , 30(1):1–33, 2018.[2] Peter Ashwin. Minimal attractors and bifurcations of random dynamicalsystems.

Proceedings of the Royal Society of London. Series A: Mathemat-ical, Physical and Engineering Sciences , 455(1987):2615–2634, 1999.[3] Michael Biehl and Holm Schwarze. Learning by online gradient descent.

Journal of Physics A , 28:643–656, 1995.[4] Micka¨el D Chekroun, Eric Simonnet, and Michael Ghil. Stochastic cli-mate dynamics: Random attractors and time-dependent invariant mea-sures.

Physica D: Nonlinear Phenomena , 240(21):1685–1700, 2011.[5] Florent Cousseau, Tomoko Ozeki, and Shun-ichi Amari. Dynamics of learn-ing in multilayer perceptrons near singularities.

IEEE Transactions onNeural Networks , 19:1313–1328, 2008.[6] George Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics of control, signals and systems , 2(4):303–314, 1989.[7] Davide Faranda, Yuzuru Sato, Brice Saint-Michel, Cecile Wiertel, VincentPadilla, B´ereng`ere Dubrulle, and Fran¸cois Daviaud. Stochastic chaos in aturbulent swirling ﬂow.

Physical review letters , 119(1):014502, 2017.128] Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierar-chical structures of multilayer perceptrons.

Neural networks , 13(3):317–327,2000.[9] John Milnor. On the concept of attractor. In

The theory of chaotic attrac-tors , pages 243–264. Springer, 1985.[10] A S Pikovskii. Synchronization and stochastization of array of self-excitedoscillators by external noise.

Radiophysics and Quantum Electronics ,27(5):390–395, 1984.[11] Y Sato, TS Doan, NT The, and HT Tuan. An analytical proof for synchro-nization of stochastic phase oscillator. arXiv preprint arXiv:1801.02761 ,2018.[12] Yuzuru Sato, Micka¨el D Chekroun, and Michael Ghil. Convergence rate ofsnapshot attractors to random strange attractors. submitted, 2020.[13] Yuzuru Sato, Thai Son Doan, Jeroen SW Lamb, and Martin Rasmussen.Dynamical characterization of stochastic bifurcations in a random logisticmap. arXiv preprint arXiv:1811.03994 , 2018.[14] Yuzuru Sato and Rainer Klages. Anomalous diﬀusion in random dynamicalsystems.

Physical Review Letters , 122(17):174101, 2019.[15] Joseph D Skufca, James A Yorke, and Bruno Eckhardt. Edge of chaos ina parallel shear ﬂow.

Physical review letters , 96(17):174101, 2006.[16] Junnosuke Teramae and Dan Tanaka. Robustness of the noise-inducedphase synchronization in a general class of limit cycle oscillators.

Physicalreview letters , 93(20):204103, 2004.[17] Daiji Tsutsui. Center manifold analysis of plateau phenomena caused bydegeneration of three-layer perceptron.

Neural Computation , 32:683–710,2020.[18] Haikun Wei, Jun Zhang, Florent Cousseau, Tomoko Ozeki, and Shun-ichiAmari. Dynamics of learning near singularities in layered networks.