[PDF] True Asymptotic Natural Gradient Optimization

Abstract

We introduce a simple algorithm, True Asymptotic Natural Gradient Optimization (TANGO), that converges to a true natural gradient descent in the limit of small learning rates, without explicit Fisher matrix estimation. For quadratic models the algorithm is also an instance of averaged stochastic gradient, where the parameter is a moving average of a "fast", constant-rate gradient descent. TANGO appears as a particular de-linearization of averaged SGD, and is sometimes quite different on non-quadratic models. This further connects averaged SGD and natural gradient, both of which are arguably optimal asymptotically. In large dimension, small learning rates will be required to approximate the natural gradient well. Still, this shows it is possible to get arbitrarily close to exact natural gradient descent with a lightweight algorithm.

Full PDF

aa r X i v : . [ s t a t . M L ] D ec True Asymptotic Natural Gradient Optimization

Yann Ollivier

Abstract

We introduce a simple algorithm, True Asymptotic Natural Gradi-ent Optimization (TANGO), that converges to a true natural gradientdescent in the limit of small learning rates, without explicit Fishermatrix estimation.For quadratic models the algorithm is also an instance of averagedstochastic gradient, where the parameter is a moving average of a“fast”, constant-rate gradient descent. TANGO appears as a partic-ular de-linearization of averaged SGD, and is sometimes quite diﬀerenton non-quadratic models. This further connects averaged SGD andnatural gradient, both of which are arguably optimal asymptotically.In large dimension, small learning rates will be required to approx-imate the natural gradient well. Still, this shows it is possible to getarbitrarily close to exact natural gradient descent with a lightweightalgorithm.

Let p θ ( y | x ) be a probabilistic model for predicting output values y frominputs x ( x = ∅ for unsupervised learning). Consider the associated log-loss ℓ ( y | x ) := − ln p θ ( y | x ) (1)Given a dataset D of pairs ( x, y ), we optimize the average log-loss over θ viaa momentum-like gradient descent. Definition 1 (TANGO).

Let δt k be a sequence of learning ratesand let γ > . Set v = 0 . Iterate the following: • Select a sample ( x k , y k ) at random in the dataset D . • Generate a pseudo-sample ˜ y k for input x k according to the predictionsof the current model, ˜ y k ∼ p θ (˜ y k | x k ) (or just ˜ y k = y k for the “outerproduct” variant). Compute gradients g k ← ∂ℓ ( y k | x k ) ∂θ , ˜ g k ← ∂ℓ (˜ y k | x k ) ∂θ (2) • Update the velocity and parameter via v k = (1 − δt k − ) v k − + γg k − γ (1 − δt k − )( v ⊤ k − ˜ g k )˜ g k (3) θ k = θ k − − δt k v k (4)1ANGO is built to approximate Amari’s natural gradient descent, namely,a gradient descent preconditioned by the inverse of the Fisher informationmatrix of the probabilistic model p θ (see deﬁnitions below). The naturalgradient arguably provides asymptotically optimal estimates of the parame-ter θ [Ama98]. However, its use is unrealistic for large-dimensional modelsdue to the computational cost of storing and inverting the Fisher matrix,hence the need for approximations. One of its key features is its invarianceto any change of variable in the parameter θ (contrary to simple gradientdescent). The natural gradient is also a special case of the extended Kalmanﬁlter from estimation theory [Oll17], under mild conditions.In TANGO, δt/γ should be small for a good natural gradient approxi-mation.For stability of the update (3) of v , γ should be taken small enough; buta small γ brings slower convergence to the natural gradient. A conservative,theoretically safe choice is setting γ = 1 / max k ˜ g k using the largest normof ˜ g seen so far. This may produce a too small γ if gradients are unbounded.If the gradients follow a Gaussian distribution (with any covariance matrix),then γ = 1 / E [3 k ˜ g k ] is theoretically safe; the average can be estimatedon past gradients. In general, γ E [ k ˜ g k ] / E [ k ˜ g k ] is a necessary but notsuﬃcient condition; this may be used as a starting point. (See discussionafter Theorem 5.)TANGO enjoys the following properties:1. TANGO converges to an exact natural gradient trajectory when thelearning rate δt tends to 0 with γ ﬁxed, namely, to the trajectory ofthe ordinary diﬀerential equation d θ/ d t = − J ( θ ) − E [ ∂ℓ/∂θ ] with J the Fisher matrix at θ (Theorem 3).2. For δt = 1 TANGO is an ordinary gradient descent with constantlearning rate γ .3. For quadratic losses, TANGO is an instance of averaged stochastic gra-dient descent with additional noise (Proposition 2): a “fast” stochasticgradient descent with constant learning rate is performed, and the al-gorithm returns a moving average of this trajectory (updated by afactor δt k at each step). However, for non-quadratic losses, TANGOcan greatly diﬀer from averaged SGD (Fig. 1).Thus, TANGO smoothly interpolates between ordinary and natural gra-dient descent when the learning rate decreases.To illustrate the convergence to the natural gradient in an informal way,take δt = 0. Then θ does not move, and the average of g is the gradientof the expected loss at θ . Then the average of v over time converges to( E ˜ g ˜ g ⊤ ) − E g , the exact natural gradient direction at θ . Indeed, this is theonly ﬁxed point of (3) in expectation. Actually, (3) is a way of solving for2 s i g m a mu TANGOEuclidean stoch. gradient descentAveraged SGDTrue natural gradient Figure 1: Learning a Gaussian model N ( µ, σ ) with unknown µ and σ ,via gradient descent on ( µ, ln σ ). The initial point is N (0 ,

1) and the dataare N (10 , µ, σ ),whose geodesics are circles, so that the true natural gradient starts by in-creasing variance so that µ moves faster. Plotted are trajectories of SGDwith learning rate 10 − , and TANGO and averaged SGD with γ = 10 − and δt = 10 − .( E ˜ g ˜ g ⊤ ) v = E g by stochastic gradient descent on v . The Fisher matrix J is E ˜ g ˜ g ⊤ by deﬁnition. Acknowledgments.

The author would like to thank Léon Bottou, Guil-laume Charpiat, Fabrice Debbasch, Aaron Defazio, Gaétan Marceau-Caronand Corentin Tallec for helpful discussions and comments around these ideas.

Related work.

Three diﬀerent lines of work lead to TANGO-like algo-rithms. Averaged SGD [PJ92, Rup88] uses a “fast” gradient descent withlarge learning rate γ (here on the variable v ), with an averaging operationon top (here by accumulation into θ ). For linear problems γ can be keptconstant.Averaged SGD achieves the asymptotically optimal Cramer–Rao boundinvolving the inverse Fisher matrix, although “no explicit Hessian inversionhas been performed” [MB11, PJ92]. TANGO may clarify how the implicitHessian or Fisher matrix inversion occurs.Later work on averaged SGD focussed on non-asymptotic behavior (espe-cially, forgetting of the starting point), on somewhat dimension-independentbounds, and on larger γ for linear models [MB11, BM13, Mar14, DB15,DFB16]. A constant, large γ provides the most beneﬁts; yet for nonlinearmodels, averaged SGD with constant γ leads to biases, hence the need forTANGO. Our analysis of the dynamics of v in TANGO and in Theorem 5below follows this line of work. 3revious work on approximating the natural gradient for large-dimensionalmodels, such as TONGA and others [LMB07, Oll15, MG15, DSP +

15, MCO16],did not provide an arbitrarily good approximation to the Fisher matrix, as itrelied on structural matrix approximations (diagonal, block-diagonal, diag-onal plus small-rank...) An exception is [DPCB13] for Boltzmann machines,directly transposed from the Hessian-free Newton method of [Mar10, MS11,MS12]: at each step, a large number of auxiliary conjugate gradient stepsare performed to solve for Fisher matrix inversion, before the main updateof the parameter occurs. From this viewpoint, TANGO performs the maingradient descent on θ and the auxiliary gradient descent at the same time.For quasi-Newton methods in the convex case, auxiliary gradient de-scents to approximate the inverse Hessian have been suggested several times;see [ABH16, Mar10, MS11, MS12] and the references therein. Second-ordermethods for neural networks have a long history, see e.g. [LBOM98]. Third, “two-timescale” algorithms in reinforcement learning use updatesreminiscent of TANGO, where the “fast” timescale is used to approximate avalue function over a linear basis via a least squares method, and the “slow”timescale is used to adapt the parameters of a policy. For instance, the mainresults of [Tad04] or [KB17] deal with convergence of updates generalizing(3)–(4). However, these results crucially assume that both δt and γ tend to 0.This would be too slow in our setting. A constant γ can be used in TANGO(and in averaged SGD for linear least squares) thanks to the linearity of theupdate of v , but this requires a ﬁner analysis of noise. Discussion and shortcomings.

Critical to TANGO is the choice of theparameter γ : the larger γ is, the faster the trajectory will resemble naturalgradient (as v converges faster to ( E ˜ g ˜ g ⊤ ) − E g ). However, if γ is too large theupdate for v is numerically unstable. For averaged SGD on quadratic losses,the choice of γ is theoretically well understood [DB15], but the situationis less clear for non-quadratic losses. We provide some general guidelinesbelow.The algorithmic interest of using TANGO with respect to direct Fishermatrix computation is not clear. Indeed, for δt = 0, the update equation(3) on v actually solves v = ( E ˜ g ˜ g ⊤ ) − E g by stochastic gradient descenton v . The speed of convergence is heavily dimension-dependent, a priori.Similar Hessian-free Newton algorithms that rely on an auxiliary gradientdescent to invert the Hessian, e.g., [Mar10], need a large number of auxiliarygradient iterations. In this case, the interest of TANGO may be its ease ofimplementation. Technically the natural gradient is not a second-order method, as the Fisher matrixrepresents a Riemannian metric tensor rather than a Hessian of the loss. It can be com-puted from squared gradients, and the natural gradient is well-deﬁned even if the loss isﬂat or concave. The Fisher matrix coincides with the Hessian of the loss function onlyasymptotically at a local minimum, provided the data follow the model. y in each of the clusters may alreadyprovide an interesting low-rank approximation of the Fisher matrix E ˜ g ˜ g ⊤ . Insuch a situation, v may converge reasonably fast to an approximate naturalgradient direction. Implementation remarks: minibatches, preconditioned TANGO.

If ˜ g is computed as the average over a minibatch of size B , namely ˜ g = B P Bi =1 ˜ g i with ˜ g i the gradient corresponding to output sample ˜ y i in theminibatch, then the equation for v has to be modiﬁed to v k = (1 − δt k − ) v k − + γg k − γB (1 − δt k − )( v ⊤ k − ˜ g k )˜ g k (5)because the expectation of ˜ g ˜ g ⊤ is B times the Fisher matrix.Preconditioned TANGO (e.g., à la RMSProp) can be obtained by choos-ing a positive deﬁnite matrix C and iterating v k = (1 − δt k − ) v k − + γCg k − γ (1 − δt k − )( v ⊤ k − ˜ g k ) C ˜ g k (6) θ k = θ k − − δt k v k (7)(This is TANGO on the variable C − / θ .) The matrix C may help to improveconditioning of gradients and of the matrix C E ˜ g ˜ g ⊤ . Choices of C mayinclude RMSProp (the entrywise reciprocal of the root-mean-square averageof gradients) or the inverse of the diagonal Fisher matrix, C − = diag( E ˜ g ⊙ ).These options will require diﬀerent adjustements for γ .Quadratic output losses can be seen as the log-loss of a probabilisticmodel, ℓ ( y | x ) = k y − f θ ( x ) k σ for any value of σ . However, σ should be setto the actual mean square error on the outputs, for the natural gradientdescent to work best. The choice of σ aﬀects both the scaling of gradients g and ˜ g , and the sampling of pseudo-samples ˜ y , whose law is N ( f θ ( x ) , σ ). TANGO as an instance of averaged SGD for quadratic losses.

Av-eraged SGD maintains a fast-moving parameter with constant learning rate,and returns a moving average of the fast trajectory. It is known to haveexcellent asymptotic properties for quadratic models.For quadratic losses, TANGO can be rewritten as a form of averagedSGD, despite TANGO only using gradients evaluated at the “slow” param-eter θ . This is speciﬁc to gradients being a linear function of θ .5hus TANGO can be considered as a non-linearization of averaged SGD,written using gradients at θ only. Even for simple nonlinear models, thediﬀerence can be substantial (Fig. 1). For nonlinear models, averaged SGDwith a ﬁxed learning rate γ can have a bias of size comparable to γ , evenwith small δt . TANGO does not exhibit such a bias.

Proposition 2.

Assume that for each sample ( x, y ) , the log-loss ℓ ( y | x ) isa quadratic function of θ whose Hessian does not depend on y (e.g., linearregression ℓ ( y | x ) = k y − θ ⊤ x k ).Then TANGO is identical to the following trajectory averaging algo-rithm: θ fast k = θ fast k − − γ ∂ℓ ( y k | x k ) ∂θ fast k − + γξ k (8) θ k = (1 − δt k ) θ k − + δt k θ fast k (9) where ξ k is some centered random variable whose law depends on θ fast k − and θ k − . The identiﬁcation with TANGO is via v k = θ k − − θ fast k . The proof (Appendix A) is mostly by direct algebraic manipulations.For quadratic losses, the gradients are a linear function of the parameter, sothat the derivative at point θ fast can be rewritten as the derivative at point θ plus a Hessian term; for quadratic losses, the Hessian is equal to the Fishermetric.The additional noise ξ k is multiplicative in v . This is standard for linearregression [DFB16]: indeed, in linear regression, the gradient from sample( x, y ) is − yx + xx ⊤ θ , and its expectation is − E ( yx ) + E ( xx ⊤ ) θ so that thegradient noise has a multiplicative component ( xx ⊤ − E ( xx ⊤ )) θ . (Treatmentsof gradient descent often assume additive noise instead, see discussion in[DFB16].)Replacing the TANGO update of θ in (4) with θ k = θ k − − v k wouldmake TANGO equivalent to an accelerated gradient method with additionalnoise for quadratic functions. Convergence of TANGO to the natural gradient.

Let the Fishermatrix of the model be J ( θ ) := E ˜ g ˜ g ⊤ = E ( x,y ) ∈D E ˜ y ∼ p θ (˜ y | x ) ∂ℓ (˜ y | x ) ∂θ ⊗ (10) A bias of size γ is easy to see on the following example: Deﬁne a loss ℓ ( x ) = | x | for | x | > γ/

2, and extend this loss in an arbitrary way on the interval [ − γ/ γ/ ± γ ,initialized at a multiple of γ/

2, will make jumps of size exactly γ and never visit theinterior of the interval [ − γ/ γ/ − γ/ γ/

2] and to the location of theminimum. Thus averaged SGD can have a bias of size ≈ γ , whatever δt . v , v ⊗ is the outer product vv ⊤ .The stochastic natural gradient descent on θ , with learning rate δt , usingthe exact Fisher matrix J ( θ ), is θ t + δt = θ t − δtJ ( θ t ) − ∂ℓ ( y k | x k ) ∂θ t (11)where at each step ( x k , y k ) is a random sample from the dataset D . In thelimit of small learning rates δt →

0, it converges to a “true” continuous-timenatural gradient descent trajectory, driven by the diﬀerential equationd θ t d t = − J ( θ t ) − E ( x,y ) ∈ D ∂ℓ ( y | x ) ∂θ t (12) Theorem 3.

Make the following regularity assumptions: The second mo-ment of gradients g is bounded over θ . The fourth moment of gradients ˜ g is bounded over θ . The lowest eigenvalue of the Fisher matrix J ( θ ) , as afunction of θ , is bounded away from . The Fisher matrix is a C functionof θ with bounded ﬁrst derivatives.Let θ T be the value of the exact natural gradient (12) at time T . Assumethat the parameter γ in TANGO is smaller than some constant that dependson the moments of the gradients and the eigenvalues of the Fisher matrix.Then the value of θ obtained after T / δt iterations of TANGO convergesin probability to θ T , when δt → . The probability in this theorem refers to the random choice of samples x k , y k and ˜ y k in TANGO.Theorem 3 will be obtained as a corollary of the more general Theorem 5,which also provides quantitative versions of the choice of γ in TANGO.To illustrate a key idea of the proof, we start with a simpler, noise-freesituation. Proposition 4.

Consider the iteration of v k = v k − + γF ( θ k − ) − γA ( θ k − ) v k − (13) θ k = θ k − − δt v k (14) initialized at v = 0 , where F is a vector ﬁeld on θ and A is a ﬁeld ofsymmetric positive deﬁnite matrices.Assume that F and A are C with bounded derivatives. Let λ min :=inf θ min eigenvalues( A ( θ )) and λ max := sup θ max eigenvalues( A ( θ )) , and as-sume λ min > and λ max < ∞ . Fix γ smaller than /λ max .Then when δt → , the value θ of this system after T / δt iterationsconverges to the value at time T of the ordinary diﬀerential equation withpreconditioning A − , d θ t d t = − A ( θ t ) − F ( θ t ) (15) initialized at θ = θ . More precisely, θ T/ δt − θ T = O ( δt ) . roof. We ﬁrst deal with the case of constant A ( θ ) ≡ A .First, note that the sums of the contributions of v to all future updatesof θ is δt P (Id − γA ) k v = δtγ − A − v .This suggests setting z k +1 := θ k − δtγ − A − v k +1 (16)which contains “ θ k plus all the known future updates from the terms F ( θ j ), j k , that are already present in v k ”. Substituting for θ k and v k +1 in z k +1 ,one ﬁnds that the update for z is z k +1 = θ k − − δtv k − δtγ − A − ( v k + γF ( θ k ) − γAv k ) (17)= z k − δtA − F ( θ k ) (18)which only involves the new contribution from F ( θ n ), and not v .Moreover, z k = θ k − − δtγ − A − v k = θ k + δt v k − δtγ − A − v k = θ k + O ( δt k v k k ) (19)since A − is bounded (its largest eigenvalue is 1 /λ min ).Now, the update for v k is (1 − γλ min )-contracting, because the condition γ < /λ max implies that the eigenvalues of γA lie between γλ min and 1.Since λ min > F is bounded, it is easy to show by induction that k v k k (sup k F k ) /λ min so that v is bounded.Therefore, z k = θ k + O ( δt ). Then, given the regularity assumptions on F , one has F ( θ k ) = F ( z k ) + O ( δt ) and z k +1 = z k − δtA − F ( z k ) + O ( δt ) (20)since A − is bounded. This does not involve v any more.But this update for z k is just a Euler numerical scheme for the diﬀerentialequation ˙ z = − A − F ( z ). So by the standard theory of approximation ofordinary diﬀerential equations, when δt → z T/ δt converges to the solutionat time T of this equation, within an error O ( δt ). Since θ k − z k is O ( δt ) aswell, we get the same conclusion for θ .For the case of variable A , set z k +1 := θ k − δtγ − A − ( θ k ) v k +1 (21)and substituting for θ k and v k +1 in this deﬁnition, one ﬁnds z k +1 = θ k − − δtγ − A ( θ k ) − v k − δtA ( θ k ) − F ( θ k ) (22)= z k + δtγ − ( A ( θ k − ) − − A ( θ k ) − ) v k − δtA ( θ k ) − F ( θ k ) (23)8ow, under our eigenvalue assumptions, A − is bounded. Since A hasbounded derivatives, so does A − thanks to ∂ θ A − = − A − ( ∂ θ A ) A − . There-fore we can apply a Taylor expansion of A − so that A ( θ k − ) − − A ( θ k ) − = O ( θ k − − θ k ) = O ( δt k v k k ) (24)so that z k +1 = z k − δtA ( θ k ) − F ( θ k ) + O ( δt k v k k ) (25)after which the proof proceeds as for the case of constant A , namely: z k − θ k is O ( δt k v k k ) so that z k +1 = z k − δtA ( z k ) − F ( z k ) + O ( δt k v k k + δt k v k k ) (26)and v k is bounded by induction. So the update for z k is a Euler numericalscheme for the diﬀerential equation ˙ z = − A ( z ) − F ( z ), which ends the proof.We now turn to the stochastic version of Proposition 4. This provides ageneralization of Theorem 3: Theorem 3 is a corollary of Theorem 5 usingˆ F k = g k and ˆ A k = (1 − δt )˜ g k ˜ g ⊤ k + δtγ Id.For numerical simulations of stochastic diﬀerential equations, the usualrate of convergence is O ( √ δt ) rather than O ( δt ) [KP92]. Theorem 5.

Consider the iteration of v k = v k − + γ ˆ F k − γ ˆ A k v k − (27) θ k = θ k − − δt v k (28) initialized at v = 0 , where ˆ F k is a vector-valued random variable and ˆ A k isa symmetric-matrix-valued random variable.Let F k be the sigma-algebra generated by all variables up to time k , andabbreviate E k for E [ · | F k ] . Let F k := E k − ˆ F k , A k := E k − ˆ A k (29) and assume that these depend on θ k − only, namely, that exist functions F ( θ ) and A ( θ ) such that F k = F ( θ k − ) , A k = A ( θ k − ) (30) Assume that the functions F and A are C with bounded derivatives. Let λ := inf θ min eigenvalues( A ( θ )) , and assume λ > .Assume the following variance control: for some σ > and R > , E k − (cid:13)(cid:13)(cid:13) ˆ F k (cid:13)(cid:13)(cid:13) σ , E k − h ˆ A ⊤ k ˆ A k i R A k (31)9 here A B means B − A is positive semideﬁnite.Fix < γ /R .Then when δt → , the value θ of this system after T / δt iterationsconverges in probability to the value at time T of the ordinary diﬀerentialequation with preconditioning A − , d θ t d t = − A ( θ t ) − F ( θ t ) (32) initialized at θ = θ .More precisely, for any ε > , with probability > − ε one has θ T/ δt − θ T = O ( √ δt ) when the constant in O () depends on ε , T , λ , γ , σ , R , andthe derivatives of F ( θ ) and A ( θ ) . The bounds are uniform for T in compactintervals. The variance assumption on ˆ A directly controls the maximum possiblevalue via γ /R , and, consequently, the speed of convergence to A − .This assumption appears in [BM13, DB15, DFB16] for ˆ A = ˜ g ˜ g ⊤ , where thevalue of R for typical cases is discussed.With ˆ A = ˜ g ˜ g ⊤ , the variance assumption on ˆ A is always satisﬁed with R = sup k ˜ g k if ˜ g is bounded. It is also satisﬁed with R = E k ˜ g k /λ ,without bounded gradients. (Indeed, ﬁrst, one has E ˆ A = E ( k ˜ g k ˜ g ˜ g ⊤ ) (sup k ˜ g k ) E ˜ g ˜ g ⊤ ; second, for any vector u , one has u ⊤ E [˜ g ˜ g ⊤ ˜ g ˜ g ⊤ ] u = E [ u ⊤ ˜ g ˜ g ⊤ ˜ g ˜ g ⊤ u ] E [ k u k k ˜ g k ] = k u k E k ˜ g k while u ⊤ Au is at least λ k u k .) If the distribu-tion of ˜ g has bounded curtosis κ in every direction, then the assumption issatisﬁed with R = κ E k ˜ g k [DFB16]; in particular, for Gaussian ˜ g , with anycovariance matrix, the assumption is satisﬁed with R = 3 E k ˜ g k . All thesequantities can be estimated based on past values of ˜ g .Theorem 5 would still be valid with additional centered noise on θ andadditional o ( δt ) terms on θ ; for simplicity we did not include them, as theyare not needed for TANGO. Lemma 6.

Under assumptions of Theorem 5, the largest eigenvalue of A ( θ ) is at most R . The operator (Id − γA ( θ )) is (1 − γλ ) -contracting.Moreover, θ A − ( θ ) exists, is bounded, and is C with bounded deriva-tives. The same holds for θ A − ( θ ) F ( θ ) . Proof.

First, for any vector u , one has k Au k = (cid:13)(cid:13)(cid:13) E ˆ Au (cid:13)(cid:13)(cid:13) E (cid:13)(cid:13)(cid:13) ˆ Au (cid:13)(cid:13)(cid:13) = E [ u ⊤ ˆ A ⊤ ˆ Au ] R u ⊤ Au . Taking u an eigenvector associated with the largest eigenvalue λ max of A shows that λ max R . Next, the eigenvalues of A lie between λ and TANGO uses ˆ A = (1 − δt )˜ g ˜ g ⊤ + δtγ Id rather than ˆ A = ˜ g ˜ g ⊤ . Actually it is enoughto check the assumption with ˜ g ˜ g ⊤ . Indeed one checks that if ˜ g ˜ g ⊤ satisﬁes the assumptionwith some R , then (1 − δt )˜ g ˜ g ⊤ + δtγ Id satisﬁes the assumption with max( R , /γ ), andthat γ /R implies γ / max( R , /γ ). so that the eigenvalues of γA lie between γλ and 1. So the eigenvaluesof Id − γA lie between 0 and 1 − γλ .Since A is symmetric and its smallest eigenvalue is λ >

0, it is invertiblewith its inverse bounded by 1 /λ . Thanks to ∂ θ A − = − A − ( ∂ θ A ) A − , thederivatives of A − are bounded. Lemma 7.

Under the notation and assumptions of Theorem 5, for any k , E k v k k σ λ (33)Up to the factor 4, this is optimal: indeed, when ˆ F and ˆ A have a distri-bution independent of k , the ﬁxed point of v in expectation is v = A − E ˆ F ,whose square norm is ( E ˆ F ) ⊤ A − E ˆ F which is (cid:13)(cid:13)(cid:13) E ˆ F (cid:13)(cid:13)(cid:13) /λ if E ˆ F lies in thedirection of the eigenvalue λ . Proof.

The proof is a variant of arguments appearing in [BM13]; in our case A isnot constant, ˆ F k is not centered, ˆ A k is not rank-one, and we do not use thenorm associated with A on the left-hand-side. Let w k := (Id − γ ˆ A k ) v k − (34)so that v k = w k + γ ˆ F k . Consequently k v k k = k w k k + (cid:13)(cid:13)(cid:13) γ ˆ F k (cid:13)(cid:13)(cid:13) +2 γw k · ˆ F k (1+ α ) k w k k +(1+1 /α ) (cid:13)(cid:13)(cid:13) γ ˆ F k (cid:13)(cid:13)(cid:13) (35)for any α >

0, thanks to 2 ab = 2( √ α a )( b/ √ α ) αa + b /α for any α > a, b ∈ R .Now k w k k = k v k − k − γv ⊤ k − ( ˆ A k + ˆ A ⊤ k ) v k − + γ v ⊤ k − ˆ A ⊤ k ˆ A k v k − (36)Take expectations conditionally to F k − . Using E k − h ˆ A ⊤ k ˆ A k i R A k we ﬁnd E k − k w k k k v k − k − γ (2 − γR ) v ⊤ k − A k v k − (37)By the assumptions, γR v ⊤ k − A k v k − > λ k v k − k . Thus E k − k w k k (1 − γλ ) k v k − k (38)Taking 1 + α = − γλ/ − γλ we ﬁnd E k − k v k k (1 − γλ/ k v k − k + (1 + 1 /α ) γ σ (39) (1 − γλ/ k v k − k + 1 − γλ/ γλ/ γ σ (40)11aking unconditional expectations, we obtain E k v k k (1 − γλ/ E k v k − k + 1 − γλ/ γλ/ γ σ (41)and by induction, starting at v = 0, this implies E k v k k − γλ/ γλ/ γ σ σ λ (42) Corollary 8.

Under the notation and assumptions of Theorem 5, forany n , for any ε > , with probability > − ε one has sup k n k v k k σλ r nε (43) Proof.

This follows from Lemma 7 by the Markov inequality and a union bound.The next two lemmas result from standard martingale arguments; thedetailed proofs are given in the Appendix.

Lemma 9.

Under the notation and assumptions of Theorem 5, let ξ be thenoise on F , ξ k := ˆ F k − F k (44) Let ( M k ) be any sequence of operators such that M k is F k − -measurableand k M k k op Λ almost surely.Then E n X j =1 k M j ξ j k n Λ σ (45) and moreover for any n , for any ε > , with probability > − ε , for any k n one has (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X j = k M j ξ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s n Λ σ ε (46) Lemma 10.

Under the notation and assumptions of Theorem 5, set ζ k := ( ˆ A k − A k ) v k − (47) Let ( M k ) be any sequence of operators such that M k is F k − -measurableand k M k k op Λ almost surely. Let λ max = sup θ max eigenvalues( A k ) , whichis ﬁnite by Lemma 6.Then E n X j =1 k M j ζ j k nR λ max Λ σ /λ (48)12 nd moreover, for any n , for any ε > , with probability > − ε , for any k n , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X j = k M j ζ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s nR λ max Λ σ ελ (49) Proof of Theorem 5.

Let n := T / δt be the number of discrete steps corresponding to continuoustime T . All the constants implied in O () notation below depend on T and on the assumptions of the theorem ( R , γ , λ , etc.), and we study thedependency on δt .Similarly to Proposition 4, set z k := θ k − − δtγ − B k v k (50)where B k is a matrix to be deﬁned later (equal to A − for the case of constant A ). Informally, z contains θ plus the future updates to be made to θ basedon the current value of v .Substituting θ k − = θ k − − δt v k − and v k = v k − + γ ˆ F k − γA k v k − − γζ k into the deﬁnition of z k , one ﬁnds z k = θ k − − δt v k − − δtγ − B k (cid:16) v k − + γ ˆ F k − γA k v k − − γζ k (cid:17) (51)= θ k − − δtB k ( ˆ F k − ζ k ) − δt (cid:16) Id + γ − B k − B k A k (cid:17) v k − (52)= z k − − δtB k ( ˆ F k − ζ k ) − δt (cid:16) Id − B k A k + γ − ( B k − B k − ) (cid:17) v k − (53)Now deﬁne B k in order to cancel the v k − term, namely B k − := B k + γ (Id − B k A k ) (54)initialized with B n := A − n . (If A is constant, then B = A − .) Then δtγ − B k v k represents all the future updates to θ stemming from the currentvalue v k .With this choice, the update for z is z k = z k − − δtB k ( ˆ F k − ζ k ) = z k − − δtB k ( F k + ξ k − ζ k ) (55)Remove the noise by deﬁning y k := z k − δt n X j = k +1 B j ( ξ j − ζ j ) (56)so that y k = y k − − δtB k F k (57)Assume for now that B k = A − ( θ k − ) + O ( √ δt ). Then y k = y k − − δtA − ( θ k − ) F ( θ k − ) + O ( δt / ) (58)13ince A − F is Lipschitz (Lemma 6), we have y k = y k − − δtA − ( y k − ) F ( y k − ) + O ( δt k y k − − θ k − k ) + O ( δt / ) (59)If we prove that y k − − θ k − = O ( √ δt ) then we ﬁnd y k = y k − − δtA − ( y k − ) F ( y k − ) + O ( δt / ) (60)so that y k is a Euler numerical scheme for the diﬀerential equation ˙ y = − A − ( y ) F ( y ), and thus converges to the natural gradient trajectory up to O ( √ δt ), uniformly on the time interval [0; T ].Since we assumed that θ k − y k = O ( √ δt ), this holds for θ k as well.We still have to prove the two assumptions that y k − − θ k − = O ( √ δt )and that B k = A − ( θ k − ) + O ( √ δt ). Lemma 11.

Deﬁne B k − := B k + γ (Id − B k A k ) initialized with B n := A − n .Then for any ε > , with probability > − ε , one has sup k (cid:13)(cid:13)(cid:13) B k − A − k (cid:13)(cid:13)(cid:13) op = O ( √ δt ) . Proof of Lemma 11.

With this deﬁnition one has B k − − A − k − = ( B k − A − k )(Id − γA k ) + A − k − A − k − (61)by a direct computation.Now A − k − A − k − = A − ( θ k − ) − A − ( θ k − ) = O ( θ k − − θ k − ) because A − is Lipschitz. Moreover θ k − = θ k − − δtv k − . So A − k − A − k − = O ( δt k v k − k ).Thanks to Corollary 8, with probability > − ε , sup k k v k − k = O ( √ n ) = O (1 / √ δt ) so that A − k − A − k − is O ( √ δt ), uniformly in k .Now, the operator (Id − γA k ) is (1 − γλ )-contracting. Therefore, (cid:13)(cid:13)(cid:13) B k − − A − k − (cid:13)(cid:13)(cid:13) op (1 − γλ ) (cid:13)(cid:13)(cid:13) B k − A − k (cid:13)(cid:13)(cid:13) op + O ( √ δt ) (62)and B n − A − n is 0, so by induction, (cid:13)(cid:13)(cid:13) B k − − A − k − (cid:13)(cid:13)(cid:13) op = O ( √ δt ), uniformlyin k .Back to the proof of Theorem 5. To prove that y k − θ k = O ( √ δt ), let usﬁrst prove that y k − z k = O ( √ δt ). We have z k − y k = δt n X j = k +1 B j ( ξ j − ζ j ) (63)Thanks to Lemma 11, this rewrites as z k − y k = δt n X j = k +1 A − j ( ξ j − ζ j ) + O  δt / n X j = k +1 ( k ξ j k + k ζ j k )  (64)14or the ﬁrst term, note that A − j = A − ( θ j − ) is F j − -measurable (while B j is not, because it depends on θ k for k > j ). By Lemmas 9 and 10, P A j ξ j and P A j ζ j are both O ( √ n ) = O ( p / δt ) with high probability. So the ﬁrstterm of z k − y k is O ( √ δt ).For the second term, n X j = k +1 k ξ j k n X j =1 k ξ j k √ n vuut n X j =1 k ξ j k (65)by Cauchy–Schwarz. By Lemma 9, E P k ξ j k is O ( n ). So with probability > − ε , thanks to the Markov inequality, qP k ξ j k is O ( √ n ) where theconstant in O () depends on ε . Therefore, P nj = k k ξ j k is O ( n ) = O (1 / δt ).The same argument applies to ζ thanks to Lemma 10.Therefore, z k − y k is O ( √ δt ).Finally, z k − θ k is O ( δt k v k k ) which is O ( δt √ n ) = O ( √ δt ) by Corollary 8.Therefore y k − θ k is O ( √ δt ) as well. A Additional proofs

Proof of Proposition 2.

Start with the algorithm in Proposition 2, with any noise ξ k . Under theupdate for θ k one has θ k − θ fast k = (1 − δt k )( θ k − − θ fast k ) (66)Now set v k := θ k − − θ fast k (67)so that the update for θ k is θ k = θ k − − δt k θ k − + δt k θ fast k = θ k − − δt k v k byconstruction. To determine the update for v , remove θ k − from the updateof θ fast k : θ fast k − θ k − = θ fast k − − θ k − − γg fast k + γξ k (68)where we abbreviate g fast k := ∂ℓ ( y k | x k ) ∂θ fast k − , the gradient of the loss at θ fast k − .Let H k be the Hessian of the loss on the k -th example with respect tothe parameter. Since losses are quadratic, the gradient of the loss is a linearfunction of the parameter: g fast k = g k + H k ( θ fast k − − θ k − ) (69)where g k := ∂ℓ ( y k | x k ) ∂θ k − is the gradient of the loss at θ k − .Thus (68) rewrites as v k = − θ fast k − + θ k − + γg k + γH k ( θ fast k − − θ k − ) − γξ k (70)15nd thanks to (66), θ k − − θ fast k − = (1 − δt k − ) v k − (71)so the above rewrites as v k = (1 − δt k − ) v k − + γg k − γ (1 − δt k − ) H k v k − − γξ k (72)If we set ξ k := (1 − δt k − )(˜ g k ˜ g ⊤ k − H k ) v k − (73)then this is identical to TANGO. However, we still have to prove that sucha ξ k is a centered noise, namely, E ξ k = 0. This will be the case if H k = E ˜ g k ˜ g ⊤ k (74)where the expectation is with respect to the choice of the random output ˜ y k given x k . From the double deﬁnition of the Fisher matrix of a probabilisticmodel, we know that E ˜ y ∼ p θ (˜ y | x ) ∂ℓ (˜ y | x ) ∂θ ∂ℓ (˜ y | x ) ∂θ ⊤ = E ˜ y ∼ p θ (˜ y | x ) ∂ ℓ (˜ y | x ) ∂θ (75)Since we have assumed that this Hessian does not depend on ˜ y , it is equalto H k .Thus TANGO rewrites as averaged SGD with a particular model of noiseon the fast parameter. Proof of Lemma 9.

This is a standard martingale argument. By the variance assumption on ˆ F k ,one has E k − k ξ k k σ . Likewise, E k − k M k ξ k k Λ σ . This proves theﬁrst claim.Moreover, since E k − ξ k = 0 and M k is F k − -measurable, E k − M k ξ k = 0,namely, the M k ξ k are martingale increments.Setting X k := (cid:13)(cid:13)(cid:13)P kj =1 M j ξ j (cid:13)(cid:13)(cid:13) , we ﬁnd E k X k +1 = X k + 2 E k [( M k +1 ξ k +1 ) · P kj =1 M j ξ j ] + E k k M k +1 ξ k +1 k = X k + E k k M k +1 ξ k +1 k .Consequently, E X n n Λ σ . Moreover, E k X k +1 > X k , so that X k is asubmartingale. Therefore, by Doob’s martingale inequality, with probability > − ε , sup k n X k E X n ε n Λ σ ε (76)Finally, P nj = k M j ξ j = P nj =1 M j ξ j − P k − j =1 M j ξ j , hence the conclusion bythe triangle inequality. 16 roof of Lemma 10. The argument is similar to the preceding lemma, together with the boundon E k v k k from Lemma 7. Conditionally to F k − one has E k − k ζ k k = E k − v ⊤ k − ( ˆ A k − A k )( ˆ A k − A k ) v k − = E k − v ⊤ k − ˆ A k v k − − v ⊤ k − A k v k − R v ⊤ k − A k v k − R λ max k v k − k . Therefore, E k ζ k k R λ max E k v k − k R λ max σ /λ by Lemma 7.The operators M k introduce an additional factor Λ . Consequently, E P nk =1 k M k ζ k k nR Λ λ max σ /λ .The rest of the proof is identical to Lemma 9. References [ABH16] Naman Agarwal, Brian Bullins, and Elad Hazan. Second or-der stochastic optimization in linear time. arXiv preprintarXiv:1602.03943 , 2016.[Ama98] Shun-ichi Amari. Natural gradient works eﬃciently in learning.

Neural Comput. , 10:251–276, February 1998.[BM13] Francis Bach and Eric Moulines. Non-strongly-convex smoothstochastic approximation with convergence rate o (1/n). In

Ad-vances in neural information processing systems , pages 773–781,2013.[DB15] Alexandre Défossez and Francis Bach. Averaged least-mean-squares: Bias-variance trade-oﬀs and optimal sampling distri-butions. In

Artiﬁcial Intelligence and Statistics , pages 205–213,2015.[DFB16] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach.Harder, better, faster, stronger convergence rates for least-squares regression. arXiv preprint arXiv:1602.05419 , 2016.[DPCB13] Guillaume Desjardins, Razvan Pascanu, Aaron Courville, andYoshua Bengio. Metric-free natural gradient for joint-training ofboltzmann machines. arXiv preprint arXiv:1301.3545 , 2013.[DSP +

15] Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al.Natural neural networks. In

Advances in Neural InformationProcessing Systems , pages 2071–2079, 2015.[KB17] Prasenjit Karmakar and Shalabh Bhatnagar. Two time-scalestochastic approximation with controlled markov noise and oﬀ-policy temporal-diﬀerence learning.

Mathematics of OperationsResearch , 2017. 17KP92] Peter E. Kloeden and Eckhard Platen.

Numerical solution ofstochastic diﬀerential equations , volume 23 of

Applications ofMathematics (New York) . Springer-Verlag, Berlin, 1992.[LBOM98] Yann Le Cun, Léon Bottou, Genevieve B. Orr, and Klaus-RobertMüller. Eﬃcient backprop. In

Neural Networks, Tricks of theTrade , Lecture Notes in Computer Science LNCS 1524. SpringerVerlag, 1998.[LMB07] Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio.Topmoumoute online natural gradient algorithm. In

Advancesin Neural Information Processing Systems 20, Proceedings of theTwenty-First Annual Conference on Neural Information Process-ing Systems, Vancouver, British Columbia, Canada, December3-6, 2007 , pages 849–856, 2007.[Mar10] James Martens. Deep learning via Hessian-free optimization. InJohannes Fürnkranz and Thorsten Joachims, editors,

Proceed-ings of the 27th International Conference on Machine Learning(ICML-10), June 21-24, 2010, Haifa, Israel , pages 735–742. Om-nipress, 2010.[Mar14] James Martens. New insights and perspectives on the naturalgradient method. arXiv preprint arXiv:1412.1193 , 2014.[MB11] Éric Moulines and Francis R Bach. Non-asymptotic analysis ofstochastic approximation algorithms for machine learning. In

Advances in Neural Information Processing Systems , pages 451–459, 2011.[MCO16] Gaétan Marceau-Caron and Yann Ollivier. Practical riemannianneural networks. arXiv preprint arXiv:1602.08007 , 2016.[MG15] James Martens and Roger Grosse. Optimizing neural networkswith kronecker-factored approximate curvature. In

InternationalConference on Machine Learning , pages 2408–2417, 2015.[MS11] James Martens and Ilya Sutskever. Learning recurrent neuralnetworks with Hessian-free optimization. In

ICML , pages 1033–1040, 2011.[MS12] James Martens and Ilya Sutskever. Training deep and recur-rent neural networks with Hessian-free optimization. In GrégoireMontavon, Geneviève B. Orr, and Klaus-Robert Müller, editors,

Neural Networks: Tricks of the Trade , volume 7700 of

LectureNotes in Computer Science , pages 479–535. Springer, 2012.18Oll15] Yann Ollivier. Riemannian metrics for neural networks I: feedfor-ward networks.

Information and Inference , 4(2):108–153, 2015.[Oll17] Yann Ollivier. Online natural gradient as a kalman ﬁlter. arXivpreprint arXiv:1703.00209 , 2017.[PJ92] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochas-tic approximation by averaging.

SIAM Journal on Control andOptimization , 30(4):838–855, 1992.[Rup88] David Ruppert. Eﬃcient estimations from a slowly convergentrobbins-monro process. Technical report, Cornell University Op-erations Research and Industrial Engineering, 1988.[Tad04] Vladislav B Tadic. Almost sure convergence of two time-scalestochastic approximation algorithms. In