An efficient Averaged Stochastic Gauss-Newton algorithm for estimating parameters of non linear regressions models
AAn efficient Averaged Stochastic Gauss-Newtwonalgorithm for estimating parameters of non linearregressions models
Peggy C
ÉNAC ( ∗ ) , Antoine G ODICHON -B AGGIONI ( ∗∗ ) , Bruno P ORTIER ( ∗∗∗ )( ∗ ) Institut de Mathématiques de Bourgogne, Université de Bourgogne, ( ∗∗ ) Laboratoire de Probabilités, Statistique et Modélisation, Sorbonne Université ( ∗∗∗ ) Laboratoire de Mathématiques de l’INSA, INSA Rouen-Normandie
Abstract
Non linear regression models are a standard tool for modeling real phenomena, withseveral applications in machine learning, ecology, econometry... Estimating the param-eters of the model has garnered a lot of attention during many years. We focus here ona recursive method for estimating parameters of non linear regressions. Indeed, thesekinds of methods, whose most famous are probably the stochastic gradient algorithmand its averaged version, enable to deal efficiently with massive data arriving sequen-tially. Nevertheless, they can be, in practice, very sensitive to the case where the eigen-values of the Hessian of the functional we would like to minimize are at different scales.To avoid this problem, we first introduce an online Stochastic Gauss-Newton algorithm.In order to improve the estimates behavior in case of bad initialization, we also intro-duce a new Averaged Stochastic Gauss-Newton algorithm and prove its asymptoticefficiency.
We consider in this paper nonlinear regression model of the form Y n = f ( X n , θ ) + ε n , n ∈ N ,where the observations ( X n , Y n ) n are independent random vectors in R p × R , ( ε n ) n are in-dependent, identically distributed zero-mean non observable random variables. Moreover, f : R p × R q −→ R and θ ∈ R q is the unknown parameter to estimate.Nonlinear regression models are a standard tool for modeling real phenomena with acomplex dynamic. It has a wide range of applications including maching learning (Bach(2014)), ecology (Komori et al. (2016)), econometry (Varian (2014)), pharmacokinetics (Batesand Watts (1988)), epidemiology (Suárez et al. (2017)) or biology (Giurc˘aneanu et al. (2005)).Most of the time, the parameter θ is estimated using the least squares method, thus it is1 a r X i v : . [ m a t h . S T ] J un stimated by (cid:98) θ n = arg min h ∈ R q n ∑ j = ( Y j − f ( X j , h )) Many authors have studied the asymptotic behavior of the least squares estimator, undervarious assumptions with different methods. For example, Jennrich (1969) consider the casewhen θ belongs to a compact set, f is a continuous function with a unique extremum. Wu(1981) consider similar local assumptions. Lai (1994); Skouras (2000); van de Geer (1990);van de Geer and Wegkamp (1996); Yao (2000) generalized the consistency results. Undera stronger assumption about the errors ε , that is, considering that the errors are uniformlysubgaussian, van de Geer (1990) obtained sharper stochastic bounds using empirical pro-cess methods. Under second moment assumptions on the errors and regularity assump-tions (lipschitz conditions) on f , Pollard and Radchenko (2006) established weak consis-tency and a central limit theorem. Yang et al. (2016) consider a compact set for θ , understrong regularity assumptions and the algorithm allows to reach a stationary point withoutcertainty that it is the one we are looking for. They built hypothesis tests and confidenceintervals. Finally, let also mention Yao (2000) which considers the case of stable nonlinearautoregressive models and establishes the strong consistency of the least squares estimator.However, in practice, the calculation of the least square estimator is not explicit in mostcases and therefore requires the implementation of a deterministic approximation algo-rithm. The second order Gauss-Newton algorithm is generally used (or sometimes Gauss-Marquard algorithm to avoid inversion problems (see for example Bates and Watts (1988)).These algorithms are therefore not adapted to the case where the data are acquired se-quentially and at high frequencies. In such a situation, stochastic algorithms offer an inter-esting alternative. One example is the stochastic gradient algorithm, defined by θ n = θ n − − γ n ∇ θ f ( X n , θ n − ) ( Y n − f ( X n , θ n − )) ,where ( γ n ) is a sequence of positive real numbers decreasing towards 0. Thanks to itsrecursive nature, this algorithm does not require to store all the data and can be updatedautomatically when the data sequentially arrive.It is thus adapted to very large datasets of data arriving at high speed. We refer to Rob-bins and Monro (1951) and to its averaged version Polyak and Juditsky (1992) and Ruppert(1988). However, even if it is often used in practice as in the case of neural networks, theuse of stochastic gradient algorithms can lead to non satistying results. Indeed, for suchmodels, it amounts to take the same step sequence for each coordinates. Nevertheless, asexplained in Bercu et al. (2020), if the Hessian of the function we would like to minimizehas eigenvalues at different scales, this kind of "uniform" step sequences is not adapted.In this paper, we propose an alternative strategy to stochastic gradient algorithms, in2he spirit of the Gauss-Newton algorithm. It is defined by : φ n = ∇ θ f ( X n , θ n ) S − n = S − n − − ( + φ Tn S − n φ n ) − S − n − φ n φ Tn S − n − θ n = θ n − + S − n φ n ( Y n − f ( X n , θ n − )) where the initial value θ can be arbitrarily chosen and S is a positive definite deterministicmatrix, typically S = I q where I q denotes the identity matrix of order q . Remark thatthanks to Riccati’s formula (see (Duflo, 1997, p. 96), also called Sherman Morrison’s formula(Sherman and Morrison (1950)), S − n is the inverse of the matrix S n defined by S n = S + n ∑ j = φ j φ Tj .When the function f is linear of the form f ( x , θ ) = θ T x , algorithm (1)-(3) rewrites as thestandard recursive least-squares estimators (see Duflo (1997)) defined by: S − n = S − n − − ( + X Tn S − n − X n ) − S − n − X n X Tn S − n − θ n = θ n − + S − n X n (cid:16) Y n − θ Tn − X n (cid:17) This algorithm can be considered as a Newton stochastic algorithm since the matrix n − S n is an estimate of the Hessian matrix of the least squares criterion.To the best of our knowledge and apart from the least squares estimate mentionedabove, second order stochastic algorithms are hardly ever used and studied since they oftenrequire the inversion of a matrix at each step, which can be very expensive in term of timecalculation. To overcome this problem some authors (see for instance Mokhtari and Ribeiro(2014); Lucchi et al. (2015); Byrd et al. (2016)) use the BFGS (for Broyden-Fletcher-Goldfarb-Shanno ) algorithm which is based on the recursive estimation of a matrix whose behavioris closed to the one of the inverse of the Hessian matrix. Nevertheless, this last estimateneed a regularization of the objective function, leading to unsatisfactory estimation of theunknown parameter. In a recent paper dedicated to estimation of parameters in logisticregression models (Bercu et al., 2020), the authors propose a truncated Stochastic Newtonalgorithm. This truncation opens the way for online stochastic Newton algorithm withoutnecessity to penalize the objective function.In the same spirit of this work, and to relax assumption on the function f , we considerin fact a modified version of the stochastic Gauss-Newton algorithm defined by (3), that en-ables us to obtain the asymptotic efficiency of the estimates in a larger area of assumptions.In addition, we introduce the following new Averaged Stochastic Gauss-Newton algorithm3ASN for short) defined by φ n = ∇ h f ( X n , θ n − ) S − n = S − n − − ( + φ Tn S − n − φ n ) − S − n − φ n φ Tn S − n − θ n = θ n − + n β S − n φ n ( Y n − f ( X n , θ n − )) θ n = θ n − + n (cid:0) θ n − θ n − (cid:1) where β ∈ (
0, 1/2 ) , θ = n β before the term S − n in (6) allows the algorithm to movequickly which enables to reduce the sensibility to a bad initialization. The averaging stepallows to maintain an optimal asymptotic behavior. Indeed, under assumptions, we firstgive the rate of convergence of the estimates, before proving their asymptotic efficiency.The paper is organized as follows. Framework and algorithms are introduced in Sec-tion 2. In Section 3, we give the almost sure rates of convergence of the estimates and estab-lish their asymptotic normality. A simulation study illustrating the interest of averaging ispresented in Section 4. Proofs are postponed in Section 5 while some general results usedin the proofs on almost sure rates of convergence for martingales are given in Section 6. Let us consider the non linear regression model of the form Y n = f ( X n , θ ) + (cid:101) n where ( X n , Y n , (cid:101) n ) n ≥ is a sequence of independent and identically distributed random vec-tors in R p × R × R . Furthermore, for all n , (cid:101) n is independent from X n and is a zero-meanrandom variable. In addition, the function f is assumed to be almost surely twice differen-tiable with respect to the second variable. Under certain assumptions, θ is a local minimizerof the functional G : R q −→ R + defined for all h ∈ R q by G ( h ) = E (cid:104) ( Y − f ( X , h )) (cid:105) = : 12 E [ g ( X , Y , h )] ,where ( X , Y , (cid:101) ) has the same distribution as ( X , Y , (cid:101) ) . Suppose from now that the follow-ing assumptions are fulfilled: (H1a) There is a positive constant C such that for all h ∈ R q , E (cid:104) (cid:107)∇ h g ( X , Y , h ) (cid:107) (cid:105) ≤ C ,4 H1b)
There is a positive constant C (cid:48)(cid:48) such that for all h ∈ R q , E (cid:104) (cid:107)∇ h f ( X , h ) (cid:107) (cid:105) ≤ C (cid:48)(cid:48) (H1c) The matrix L ( h ) defined for all h ∈ R q by L ( h ) = E (cid:104) ∇ h f ( X , h ) ∇ h f ( X , h ) T (cid:105) is positive at θ .Assumption (H1a) ensures ifrst that the functional G is Frechet differentiable for all h ∈ R q . Moreover, since (cid:101) is independent from X and zero-mean, ∇ G ( h ) = E [( f ( X , h ) − f ( X , θ )) ∇ h f ( X , h )] .Then, ∇ G ( θ ) =
0. Assumption (H1b) will be crucial to control the possible divergence ofestimates of L ( θ ) as well as to give their rate of convergence. Finally, remark that thanks toassumption (H1c) , L ( θ ) is invertible. In order to estimate θ , we propose in this work a new approach. Instead of using straighta Stochastic Newton algorithm based on the estimate of the Hessian of the functionnal G ,we will substitute this estimate by an estimate of L ( θ ) , imitating thereby the Gauss-Newtonalgorithm. This leads to an algorithm of the form θ n + = θ n + n + L − n ( Y n + − f ( X n + , θ n )) ∇ h f ( X n + , θ n ) ,where L n : = n n ∑ i = ∇ h f ( X i , θ i − ) ∇ h f ( X i , θ i − ) T is a natural recursive estimate of L ( θ ) . Remark that supposing the functional G is twicedifferentiable leads to ∇ G ( h ) = E (cid:104) ∇ h f ( X , h ) ∇ h f ( X , h ) T (cid:105) − E (cid:2) ( f ( X , θ ) − f ( X , h )) ∇ h f ( X , h ) (cid:3) ,and in particular, H = ∇ G ( θ ) = L ( θ ) . Then, L n is also an estimate of H . The proposedalgorithm can so be considered as a Stochastic Newton algorithm, but does not require anexplicit formula for the Hessian of G . Furthermore, the interest of considering L n as anestimate of H is that we can update recursively the inverse of L n thanks to the Riccati’sformula. To obtain the convergence of such an algorithm, it should be possible to state thefollowing assumption: 5 H*)
There is a positive constant c such that for all h ∈ R q , λ min ( L ( h )) ≥ c .Nevertheless, this assumption consequently limits the family of functions f we can con-sider. In order to free ourselves from the restrictive previous assumption (H*) , we propose toestimate θ with the following Gauss-Newton algorithm. Definition 2.1 (Stochastic Gauss-Newton algorithm) . Let ( Z n ) n ≥ be a sequence of randomvectors, independent and for all n ≥ , Z n ∼ N ( I q ) . The Gauss-Newton algorithm is definedrecursively for all n ≥ by (cid:101) Φ n + = ∇ h f (cid:16) X n + , (cid:101) θ n (cid:17)(cid:101) θ n + = (cid:101) θ n + (cid:101) H − n (cid:101) Φ n + (cid:16) Y n + − f (cid:16) X n + , (cid:101) θ n (cid:17)(cid:17) (8) (cid:101) H − n + = (cid:101) H − n − (cid:32) + c (cid:101) β ( n + ) (cid:101) β Z Tn + (cid:101) H − n Z n + (cid:33) − c (cid:101) β ( n + ) (cid:101) β (cid:101) H − n Z n + Z Tn + (cid:101) H − n (cid:101) H − n + = (cid:101) H − n + − (cid:16) + (cid:101) Φ Tn + (cid:101) H − n + ˆ Φ n + (cid:17) − (cid:101) H − n + (cid:101) Φ n + (cid:101) Φ Tn + (cid:101) H − n + , with (cid:101) θ bounded, (cid:101) H − symmetric and positive, c (cid:101) β ≥ and (cid:101) β ∈ (
0, 1/2 ) . Note that compared with the algorithm (2.2), matrix ( n + ) L n has been replaced bymatrix (cid:101) H n defined by: (cid:101) H n = (cid:101) H + n ∑ i = (cid:101) Φ i (cid:101) Φ Ti + n ∑ i = c (cid:101) β i (cid:101) β Z i Z Ti .Matrix (cid:101) H − n is iteratively computed thanks to Riccati’s inversion formula (see (Duflo, 1997,p.96)) which is applied twice: first recursively inverse matrix (cid:101) H n + = (cid:101) H n + c (cid:101) β ( n + ) (cid:101) β Z n + Z Tn + ,then matrix (cid:101) H n + = (cid:101) H n + + (cid:101) Φ n + (cid:101) Φ Tn + . In fact, introducing this additional term enables toensure, taking c β > λ max (cid:16) (cid:101) H − n (cid:17) = O (cid:16) n − (cid:101) β (cid:17) a.s.Therefore, it enables to control the possible divergence of the estimates of the inverse of theHessian and to obtain convergence results without assuming (H*) . Anyway, if assumption (H*) is verified, one can take c (cid:101) β = n . Never-theless, in the case of stochastic gradient descents, it is well known in practice that this canlead to non sufficient results with a bad initialization. In order to overcome this problem,we propose an Averaged Stochastic Gauss-Newton algorithm, which consists in modifying6quation (8) by introducing a term n − α (with α ∈ ( ) ), leading step sequence of theform n α . Finally, in order to ensure the asymptotic efficiency, we add an averaging step. Definition 2.2 (Averaged Stochastic Gauss-Newton algorithm) . The Averaged Stochastic GaussNewton algorithm is recursively defined for all n ≥ by Φ n + = ∇ h f (cid:0) X n + , θ n (cid:1) θ n + = θ n + γ n + S − n ( Y n + − f ( X n + , θ n )) ∇ h f ( X n + , θ n ) (9) θ n + = n + n + θ n + n + θ n + (10) S − n + = S − n − (cid:18) + c β ( n + ) β Z Tn + S − n Z n + (cid:19) − c β ( n + ) β S − n Z n + Z Tn + S − n S − n + = S − n − (cid:16) + Φ Tn + S − n Φ n + (cid:17) − S − n Φ n + Φ Tn + S − n . where θ = θ is bounded, S is symmetric and positive, γ n = c α n − α with c α > c β ≥ , α ∈ ( ) , β ∈ ( α − ) and S − n = ( n + ) S − n . Let us note that despite the modification of the algorithm, Riccati’s formula always hold.Finally, remark that if assumption (H*) is satisfied, one can take c β = We now suppose that these additional assumptions are fulfilled: (H2)
The functional G is twice Frechet differentiable in R q and there is a positive constant C (cid:48) such that for all h ∈ R q , (cid:13)(cid:13) ∇ G ( h ) (cid:13)(cid:13) ≤ C (cid:48) .Note that this is an usual assumption in stochastic convex optimization (see for instance(Kushner and Yin, 2003)), and especially for studying the convergence of stochastic algo-rithms (Bach, 2014; Godichon-Baggioni, Antoine, 2019; Gadat and Panloup, 2017). (H3) The function L is continuous at θ .Assumption (H3) is used for the consistency of the estimates of L ( θ ) , which enables togive the almost sure rates of convergence of stochastic Gauss-Newton estimates and theiraveraged version.Let us now make some additional assumptions on the Hessian of the function we wouldlike to minimize: (H4a) The functional h (cid:55)−→ ∇ G ( h ) is continuous on a neighborhood of θ . (H4b) The functional h (cid:55)−→ ∇ G ( h ) is C G -Lipschitz on a neighborhood of θ .7ssumption (H4a) is useful for establishing the rate of convergence of stochastic Gauss-Newton algorithms given by (8) and (9) while assumption (H4b) enables to give the rate ofconvergence of the averaged estimates. Clearly (H4b) implies (H4a) . Note that as in the caseof assumption (H2) , these last ones are crucial to obtain almost sure rates of convergence ofstochastic gradient estimates (see for instance (Pelletier, 1998; Godichon-Baggioni, Antoine,2019)). We focus here on the convergence of the Averaged Stochastic Gauss-Newton algorithmssince the proofs are more unusual than the ones for the non averaged version. Indeed,these last ones are quite closed to the proofs in Bercu et al. (2020).
The first theorem deals with the almost sure convergence of a subsequence of ∇ G ( θ n ) . Theorem 3.1.
Under assumptions (H1) and (H2) with c β > , G ( θ n ) converges almost surely toa finite random variable and there is a subsequence ( θ ϕ n ) such that (cid:13)(cid:13) ∇ G (cid:0) θ ϕ n (cid:1)(cid:13)(cid:13) a . s −−−−→ n → + ∞ G is convex or if we project the algorithm on a convexsubspace where G is convex, previous theorem leads to the almost sure convergence of theestimates. In order to stay as general as possible, let us now introduce the event Γ θ = (cid:26) ω ∈ Ω , θ n ( ω ) −−−−→ n → + ∞ θ (cid:27) .It is not unusual to introduce this kind of events for studying convergence of stochasticalgorithms without loss of generality: see for instance (Pelletier, 1998, 2000). Many criteriacan ensure that P [ Γ θ ] =
1, i.e that θ n converges almost surely to θ (see Duflo (1997), Kushnerand Yin (2003) for stochastic gradient descents and Bercu et al. (2020) for an example ofstochastic Newton algorithm). The following corollary gives the almost sure convergenceof the estimates of L ( θ ) . Corollary 3.1.
Under assumptions (H1) to (H3) , on Γ θ , the following almost sure convergenceshold: S n a . s −−−−→ n → + ∞ H and S − n a . s −−−−→ n → + ∞ H − .In order to get the rates of convergence of the estimates ( θ n ) , let us first introduce a newassumption: 8 H5)
There are positive constants η , C η such that η > α − h ∈ R q , E (cid:104) (cid:107)∇ h g ( X , Y , h ) (cid:107) η + (cid:105) ≤ C η . Theorem 3.2.
Assume assumptions (H1) to (H4a) and (H5) hold. Then, on Γ θ , (cid:107) θ n − θ (cid:107) = O (cid:18) ln nn α (cid:19) a . s . (11) Besides, adding assumption (H4b) , assuming that the function L is C f -Lipschitz on a neighborhoodof θ , and that the function h (cid:55)−→ E (cid:104) ∇ h g ( X , Y , h ) ∇ h g ( X , Y , h ) T (cid:105) is C g -Lipschitz on a neighbor-hood of θ , then on Γ θ , (cid:114) n α c α ( θ n − θ ) L −−−−→ n → + ∞ N (cid:18) σ L ( θ ) − (cid:19) (12)These are quite usual results for a step sequence of order n − α . Indeed, as expected, wehave a "loss" on the rate of convergence of ( θ n ) , but the averaged step enables to get anasymptotically optimal behaviour, which is given by the following theorem. Theorem 3.3.
Assuming (H1) to (H4a) together with (H5) , on Γ θ , (cid:13)(cid:13) θ n − θ (cid:13)(cid:13) = O (cid:18) ln nn (cid:19) a . s . Moreover, suppose that the function h (cid:55)−→ E (cid:104) ∇ h g ( X , Y , h ) ∇ h g ( X , Y , h ) T (cid:105) is continuous at θ ,then on Γ θ , √ n (cid:0) θ n − θ (cid:1) L −−−−→ n → + ∞ N (cid:16) σ L ( θ ) − (cid:17) . Corollary 3.2.
Assuming (H1) to (H5) and assuming that the functional L is C f -Lipschitz on aneighborhood of θ , then on Γ θ , for all δ > , (cid:13)(cid:13) S n − H (cid:13)(cid:13) F = O (cid:32) max (cid:40) c β n β , ( ln n ) + δ n (cid:41)(cid:33) a . s , (cid:13)(cid:13)(cid:13) S − n − H − (cid:13)(cid:13)(cid:13) F = O (cid:32) max (cid:40) c β n β , ( ln n ) + δ n (cid:41)(cid:33) a . s .Recall here that if (H*) holds, then one can take c β = n (up to a log term). Theorem 3.4.
Assume assumptions (H1) to (H5) hold. Then, on (cid:101) Γ θ : = (cid:26) ω ∈ Ω , (cid:101) θ n ( ω ) a . s −−−−→ n → + ∞ θ (cid:27) , (cid:13)(cid:13) ˜ θ n − θ (cid:13)(cid:13) = O (cid:18) ln nn (cid:19) a . s .9 oreover, suppose that the functional h (cid:55)−→ E (cid:104) ∇ h g ( X , Y , h ) ∇ h g ( X , Y , h ) T (cid:105) is continuous at θ ,then on (cid:101) Γ θ , √ n (cid:16)(cid:101) θ n − θ (cid:17) L −−−−→ n → + ∞ N (cid:16) σ L ( θ ) − (cid:17) . σ We now focus on the estimation of the variance σ of the errors. First, consider the sequenceof predictors (cid:0) ˆ Y n (cid:1) n ≥ of Y defined for all n ≥ Y n = f (cid:0) X n , θ n − (cid:1) .Then, one can estimate σ by the recursive estimate ˆ σ n defined for all n ≥ σ n = n n ∑ k = (cid:0) ˆ Y k − Y k (cid:1) . Corollary 3.3.
Assume assumptions (H1) to (H5) hold and that (cid:101) admits a -th order moment.Then, on Γ θ , (cid:12)(cid:12) ˆ σ n − σ (cid:12)(cid:12) = O (cid:32)(cid:114) ln nn (cid:33) a . s ., and √ n (cid:0) ˆ σ n − σ (cid:1) L −−−−→ n → + ∞ N (cid:0) V (cid:2) (cid:101) (cid:3)(cid:1) . In this section, we present a short simulation study to illustrate convergence results givenin Section 3. Simulations were carried out using the statistical software .
Consider the following model Y = θ ( − exp ( − θ X )) + (cid:101) ,with θ = ( θ , θ ) T = (
21, 12 ) T , X ∼ U ([
0, 1 ]) , and (cid:101) ∼ N (
0, 1 ) . For sure, this model is verysimple. However, it allows us to explore the behaviour of the different algorithms presentedin this paper by comparing our methods with the stochastic gradient algorithm and itsaveraged version and by looking at the influence of the step sequence on the estimates. Inthat special case, matrix L ( θ ) = ∇ G ( θ ) is equal to: L ( θ ) = E (cid:34)(cid:32) ( − exp ( − θ X )) θ X ( − exp ( − θ X )) exp ( − θ X ) θ X ( − exp ( − θ X )) exp ( − θ X ) θ X exp ( − θ X ) (cid:33)(cid:35) .10hen, taking θ = (
21, 12 ) T , we obtain L (
21, 12 ) (cid:39) (cid:32) (cid:33) and in particular, the eigenvalues are about equal (in decreasing order) to 0.889 and 0.049. To avoid some computational problems, we consider projected versions of the differentalgorithms: more precisely, each algorithm is projected on the ball of center (
21, 12 ) T and ofradius 12.For each algorithm, we calculate the mean squared error based on 100 independentsamples of size n =
10 000, with initialization θ = θ + U , where U follows an uniformlaw on the unit sphere of R .We can see in Tables 1 and 2 that Gauss-Newton methods perform globally better thangradient descents. This is certainly due to the fact that the eigenvalues of L ( θ ) are at quitedifferent scales, so that the step sequence in gradient descent is less adapted to each di-rection. Remark that we could have been less fair play, considering an example where theeigenvalues of L ( θ ) would have been at most sensitively different scales. Nevertheless,without surprise, the estimates seem to be quite sensitive to the choices of the parameters c α and α . (cid:72)(cid:72)(cid:72)(cid:72)(cid:72) c α α (cid:72)(cid:72)(cid:72)(cid:72)(cid:72) c α α Table 1: Mean squared errors of Stochastic Newton estimates defined by (9) (on the left)and its averaged version defined by (10) (on the right) for different parameters α and c α , for n =
10 000. (cid:72)(cid:72)(cid:72)(cid:72)(cid:72) c α α (cid:72)(cid:72)(cid:72)(cid:72)(cid:72) c α α Table 2: Mean squared errors of Stochastic Gradient estimates with step c α n − α (on the left)and its averaged version (on the right) for different parameters α and c α , for n =
10 000.To complete this study, let us now examine the behaviour of the mean squared errorin function of the sample size for the standard Stochastic Gauss-Newton algorithm ("SN"),11nd its averaged version ("ASN"), for the Stochastic Gauss-Newton algorithm defined by(9) ("ND") and for the stochastic gradient algorithm ("SGD") as well as its averaged version("ASGD"). The stochastic Gauss-Newton algorithms, has been computed with c β = S = I . For the averaged stochastic Newton algorithm a step sequence of the form c α n − α with c α = α = c α = α = θ = θ + r Z , where Z follows an uniform law on the unit sphereof R and r =
1, 5, 12.
10 50 500 5000 − − + Sample size Q uad r a t i c m ean e rr o r NDASNSGDSNASGD 10 50 500 5000 − − + Sample size Q uad r a t i c m ean e rr o r NDASNSGDSNASGD 10 50 500 5000 − − + Sample size Q uad r a t i c m ean e rr o r NDASNSGDSNASGD
Figure 1: Evolution of the mean squared error in relation with the sample size with, fromthe left to the right, r =
1, 5, 12.One can see in Figure 1 that stochastic Gauss-Newton algorithms perform better thangradient descents and their averaged version for quite good initialization. As explainedbefore, this is due to the fact that stochastic Gauss-Newton methods enable to adapt thestep sequence to each direction. Furthermore, one can see that when we have a quite goodinitialization, Stochastic Gauss-Newton and the averaged version give similar results. Nev-ertheless, when we deal with a bad initialization, to take a step sequence of the form c α n − α enables the estimates to move faster, so that the averaged Gauss Newton estimates havebetter results compare to the non averaged version. Finally, on Figure 2, one can see thatGauss-Newton methods globally perform better than gradient methods, but there are somebad trajectories due to bad initialization. 12 llllll lllllllllllllllllllll llllllllllllllll llllllll llllllllllll ND SN ASN SG ASG llllllll lllllllll lllllllll llllllll llllllllll
ND SN ASN SG ASG . . . Figure 2: Mean squared errors for the different methods for n =
10 000 and r = β n We now focus on the influence of β n on the behaviour of the estimates. All the results arecalculated from 100 independent samples of size n . In table 3, we have chosen c α = c =
1, and initialize the algorithms taking θ = θ + U . Choosing small c β leads togoods results, which, without surprise, suggests that the use of the term ∑ nk = β k Z k Z Tk isonly theoretical, but has no use in practice. (cid:72)(cid:72)(cid:72)(cid:72) c β β − − − − (cid:72)(cid:72)(cid:72)(cid:72) c β β − − − − (cid:72)(cid:72)(cid:72)(cid:72) c β β − − − − Table 3: Mean squared errors (.10 − ) of, from left to right, Stochastic Newton estimates with c α n − α defined by (9), its averaged version defined by (10) and Stochastic Newton estimatesdefinded by (8) for different values of β and c β . We now illustrate the asymptotic normality of the estimates. In order to consider all thecomponents of parameter θ , we shall in fact examine the two following central limit theo-rem. C n : = (cid:16)(cid:101) θ n − θ (cid:17) T (cid:101) H n (cid:16)(cid:101) θ n − θ (cid:17) L −−−−→ n → + ∞ χ and C n : = (cid:0) θ n − θ (cid:1) T S n (cid:0) θ n − θ (cid:1) L −−−−→ n → + ∞ χ .These two results are straightforward applications of Theorems 3.3 and 3.4, taking σ = n , and using the kernel method, we estimatethe probability density function (pdf for short) of C n and C n that we compare to the pdfof a chi-squared with 2 degrees of freedom (df). Estimates (cid:101) θ n and θ n are computed taking θ = θ + U , c α = α = p -values equal to 0.1259 for C n and 0.33 for C n . Therefore, the gaussian approximation provided by the TLC given by3.3 and 3.4 is pretty good, and so, even for quite moderate sample sizes. . . . . . Chi−squareASNSN
Figure 3: Density function of a Chi squared with 2 df (in black) and kernel-based densityestimates of C n (in green) and C n (in red) for n = Lemma 5.1.
Assuming (H1) and (H2) and taking c β > , it comes λ max (cid:16) S − n (cid:17) = O (cid:16) n β − (cid:17) a . s and λ max ( S n ) = O ( n ) a . s . Proof of Lemma 5.1.
First, note that λ min ( S n ) ≥ λ min (cid:32) n ∑ i = c β i β Z i Z Ti (cid:33) ,and one can check that 1 ∑ ni = c β i β n ∑ i = c β i β Z i Z Ti a . s −−−−→ n → + ∞ I q .Since n ∑ i = c β i β ∼ c β − β n − β ,it comes λ max (cid:16) S − n (cid:17) = O (cid:16) n β − (cid:17) a . s .14et us now give a bound of λ max ( S n ) . First, remark that S n can be written as S n = S + n ∑ i = E (cid:104) Φ i Φ Ti |F i − (cid:105) + ∑ i = Ξ i + n ∑ i = c β i β Z i Z Ti ,where Ξ i : = Φ i Φ Ti − E (cid:104) Φ i Φ Ti |F i − (cid:105) and ( F i ) is the σ -algebra generated by the sample,i.e F i : = σ (( X , Y ) , . . . , ( X i , Y i )) . Thanks to assumption (H1b) , E (cid:20)(cid:13)(cid:13)(cid:13) Φ i Φ Ti (cid:13)(cid:13)(cid:13) F |F i − (cid:21) ≤ C (cid:48)(cid:48) ,and applying Theorem 6.2, it comes for all positive constant δ > (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ i = Ξ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = o (cid:16) n ( ln n ) + δ (cid:17) a . s ,where (cid:107) . (cid:107) F is the Frobenius norm. Then, (cid:107) S n (cid:107) F = O (cid:18) max (cid:18) (cid:107) S (cid:107) F , C (cid:48)(cid:48) n , (cid:113) n ( ln n ) + δ , n − β (cid:19)(cid:19) a . s ,which concludes the proof. Proof of Theorem 3.1.
With the help of a Taylor’s decomposition, it comes G ( θ n + ) = G ( θ n ) + ∇ G ( θ n ) T ( θ n + − θ n ) + ( θ n + − θ n ) T (cid:90) ∇ G ( θ n + + t ( θ n + − θ n )) dt ( θ n + − θ n ) .Then, assumption (H2) yields G ( θ n + ) ≤ G ( θ n ) + ∇ G ( θ n ) T ( θ n + − θ n ) + C (cid:48) (cid:107) θ n + − θ n (cid:107) .Replacing θ n + , G ( θ n + ) ≤ G ( θ n ) − γ n + ∇ G ( θ n ) T S − n ∇ h g ( X n + , Y n + , θ n ) + C (cid:48) γ n + (cid:13)(cid:13)(cid:13) S − n ∇ h g ( X n + , Y n + , θ n ) (cid:13)(cid:13)(cid:13) ≤ G ( θ n ) − γ n + ∇ G ( θ n ) T S − n ∇ h g ( X n + , Y n + , θ n ) + C (cid:48) γ n + (cid:16) λ max (cid:16) S − n (cid:17)(cid:17) (cid:107)∇ h g ( X n + , Y n + , θ n ) (cid:107) .Assumption (H1a) leads to E [ G ( θ n + ) |F n ] ≤ G ( θ n ) − γ n + ∇ G ( θ n ) T S − n ∇ G ( θ n ) + CC (cid:48) γ n + (cid:16) λ max (cid:16) S − n (cid:17)(cid:17) ≤ G ( θ n ) − γ n + λ min (cid:16) S − n (cid:17) (cid:107)∇ G ( θ n ) (cid:107) + CC (cid:48) γ n + (cid:16) λ max (cid:16) S − n (cid:17)(cid:17) .Thanks to Lemma 5.1, and since β < α − ∑ n ≥ γ n + (cid:16) λ max (cid:16) S − n (cid:17)(cid:17) < + ∞ a . s ,so that, applying Robbins-Siegmund Theorem (see Duflo (1997) for instance), G ( θ n ) con-15erges almost surely to a finite random variable and ∑ n ≥ γ n + λ min (cid:16) S − n (cid:17) (cid:107)∇ G ( θ n ) (cid:107) = ∑ n ≥ γ n + λ max (cid:0) S n (cid:1) − (cid:107)∇ G ( θ n ) (cid:107) < + ∞ a . s .Lemma 5.1 implies ∑ n ≥ γ n + λ max (cid:0) S n (cid:1) − = + ∞ almost surely, so that there is alsmot surelya subsequence θ ϕ n such that (cid:13)(cid:13) ∇ G (cid:0) θ ϕ n (cid:1)(cid:13)(cid:13) converges to 0. Proof of Corollary 3.1.
Let us give the convergence of each term in decomposition (5.1). Since β > n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ i = Ξ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F a . s −−−−→ n → + ∞ a . s and 1 n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ i = c β i β Z i Z Ti (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F a . s −−−−→ n → + ∞ a . s .Finally, since θ n converges almost surely to θ , assumption (H3) together with Toeplitz lemmayield 1 n n ∑ i = E (cid:104) Φ i Φ Ti |F i − (cid:105) a . s −−−−→ n → + ∞ E (cid:104) ∇ h f ( X , θ ) ∇ h f ( X , θ ) T (cid:105) = L ( θ ) . First, θ n + can be written as θ n + − θ = θ n − θ − γ n + S − n ∇ G ( θ n ) + γ n + S − n ξ n + , (13)with ξ n + : = ∇ G ( θ n ) − ∇ h g ( X n + , Y n + , θ n ) . Remark that ( ξ n ) is a sequence of martingaledifferences adapted to the filtration ( F n ) . Moreover, linearizing the gradient and noting H = ∇ G ( θ ) , it comes θ n + − θ = θ n − θ − γ n + S − n H ( θ n − θ ) + γ n + S − n ξ n + − γ n + S − n δ n = ( − γ n + ) ( θ n − θ ) + γ n + (cid:16) H − − S − n (cid:17) H ( θ n − θ )+ γ n + S − n ξ n + − γ n + S − n δ n , (14)where δ n = ∇ G ( θ n ) − H ( θ n − θ ) is the remainder term in the Taylor’s decomposition ofthe gradient. By induction, one can check that for all n ≥ θ n − θ = β n ,0 ( θ − θ ) + n − ∑ k = β n , k + γ k + (cid:16) H − − S − k (cid:17) H ( θ k − θ ) − n − ∑ k = β n , k + γ k + S − k δ k + n − ∑ k = β n , k + γ k + S − k ξ k + , 16ith, for all k , n ≥ k ≤ n , β n , k = n ∏ j = k + (cid:0) − γ j (cid:1) and β n , n = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − ∑ k = β n , k + γ k + S − k ξ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O (cid:18) ln nn α (cid:19) a . s . (15)We can now prove Theorem 3.2. Proof of equation 11.
The aim is to give the rate of convergence of each term of decomposi-tion (5.2). The rate of the martingale term is given by equation (15). Remark that there is arank n α such that for all n ≥ n α , we have γ n ≤
1, so that, for all n ≥ n α , (cid:107) β n ,0 ( θ − θ ) (cid:107) ≤ (cid:107) θ − θ (cid:107) n α − ∏ i = | − γ i + | ∏ i = n α ( − γ i + ) ≤ (cid:107) θ − θ (cid:107) n α − ∏ i = | − γ i + | exp (cid:32) − n ∑ i = n α γ i + (cid:33) .Since α <
1, this term converges at an exponential rate, and more precisely (cid:107) β n ,0 ( θ − θ ) (cid:107) = O (cid:18) exp (cid:18) − c α − α n − α (cid:19)(cid:19) a . s .Let us denote ∆ n : = n − ∑ k = β n , k + γ k + (cid:16) H − − S − k (cid:17) H ( θ k − θ ) − n ∑ k = β n , k + γ k + S − k δ k .The aim is so to prove that this term is negligible. First, remark that the Taylor’s decompo-sition of the gradient yields δ n = (cid:90) (cid:0) ∇ G ( θ + t ( θ n − θ )) − H (cid:1) ( θ n − θ ) dt ,and thanks to assumption (H4a) , since θ n converges almost surely to θ and by dominatedconvergence, (cid:107) δ n (cid:107) ≤ (cid:107) θ n − θ (cid:107) (cid:90) (cid:13)(cid:13) ∇ G ( θ + t ( θ n − θ )) − H (cid:13)(cid:13) op dt = o ( (cid:107) θ n − θ (cid:107) ) a . s In a similar way, since S − n converges almost surely to H − , (cid:13)(cid:13)(cid:13)(cid:16) H − − S − n (cid:17) H ( θ n − θ ) (cid:13)(cid:13)(cid:13) = o ( (cid:107) θ n − θ (cid:107) ) a . s .17oreover, since ∆ n + = ( − γ n + ) ∆ n + γ n + (cid:16) H − − S − n (cid:17) H ( θ n − θ ) − γ n + S − n δ n ,we have, (cid:107) ∆ n + (cid:107) ≤ | − γ n + | (cid:107) ∆ n (cid:107) + o (cid:0) n − α (cid:107) θ n − θ (cid:107) (cid:1) a . s ≤ | − γ n + | (cid:107) ∆ n (cid:107) + o (cid:32) n − α (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∆ n + n − ∑ k = β n , k + γ k + S − k ξ k + + β n ,0 ( θ − θ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33) a . s .Then, since β n ,0 ( θ − θ ) converges at an exponential rate to 0 and thanks to equation (15),there are almost surely a rank n and positive constants c , C such that for all n ≥ n , (cid:107) ∆ n + (cid:107) ≤ (cid:0) − c n − α (cid:1) (cid:107) ∆ n (cid:107) + C n − α (cid:114) ln nn α ,and applying a stabilization Lemma (see Duflo (1997) for instance), (cid:107) ∆ n (cid:107) = O (cid:32)(cid:114) ln nn α (cid:33) a . s which concludes the proof. Proof of equation (12) . In order to get the asymptotic normality, we will apply the CentralLimit Theorem in Jakubowski (1988) to the martingale term in decomposition (5.2) andprove that other terms of this decomposition are negligible. The rate of convergence of thefirst term on the right-hand side of equality (5.2) is given by equality (5.2). Moreover, apply-ing Theorem 3.2 and Remark 5.1 as well as Lemma E.2 in Cardot and Godichon-Baggioni(2015), one can check that for all δ > (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − ∑ k = β n , k + γ k + (cid:16) H − − S − k (cid:17) H ( θ k − θ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (cid:18) ( ln n ) + δ n α + β (cid:19) a . s ,and this term is so negligible. In the same way, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − ∑ k = β n , k + γ k + S − k δ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (cid:18) ( ln n ) + δ n α (cid:19) a . s .Note that the martingale term can be written as n − ∑ k = β n , k + γ k + S − k ξ k + = n − ∑ k = β n , k + c α γ k + (cid:16) S − k − H − (cid:17) ξ k + + n − ∑ k = β n , k + γ k + H − ξ k + .18pplying Theorem 6.1 and Remark 5.1, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ k = β n + k + γ k + (cid:16) S − k − H − (cid:17) ξ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O (cid:18) ln nn α + β (cid:19) a . s .Finally, let us now prove that the martingale term verifies assumption in Jakubowski (1988),i.e that we have ∀ ν >
0, lim n → + ∞ P (cid:34) sup ≤ k ≤ n (cid:114) n α c α (cid:13)(cid:13)(cid:13) β n + k + c α ( k + ) − α H − ξ k + (cid:13)(cid:13)(cid:13) > ν (cid:35) = n α c α n ∑ k = β n + k + γ k + H − ξ k + ξ Tk + H − a . s −−−−→ n → + ∞ σ H − . Proof of equation (5.2):
Thanks to assumption (H5) , one can check that for all n ≥ E (cid:104) (cid:107) ξ n + (cid:107) + η |F k (cid:105) ≤ + η C η Then, applying Markov’s inequality, P (cid:34) sup ≤ k ≤ n (cid:114) n α c α (cid:13)(cid:13)(cid:13) β n + k + γ k + H − ξ k + (cid:13)(cid:13)(cid:13) > ν (cid:35) ≤ n ∑ k = P (cid:20)(cid:114) n α c α (cid:13)(cid:13)(cid:13) β n + k + γ k + H − ξ k + (cid:13)(cid:13)(cid:13) > ν (cid:21) ≤ n ∑ k = P (cid:34) (cid:107) ξ k + (cid:107) > √ c α ν n α /2 | β n + k + | γ k + (cid:107) H − (cid:107) op (cid:35) ≤ c + ηα (cid:13)(cid:13) H − (cid:13)(cid:13) + η op ν + η + η C η n α ( + η ) n ∑ k = | β n + k + | + η k − α ( + η ) .Moreover, note that there is a rank n α such that for all j ≥ n α we have c α j − α ≤
1. For thesake of readibility of the proof (in other way, one can split the sum into two parts as in theproof of Lemma 3.1 in Cardot et al. (2017)), we consider from now on that n α =
1. Then | β n + k + | = n + ∏ i = k + (cid:0) − c α i − α (cid:1) ≤ exp (cid:32) − c α n + ∑ i = k + i − α (cid:33) .Applying Lemma E.2 in Cardot and Godichon-Baggioni (2015), it comes n ∑ k = | β n + k + | + η k − α ( + η ) = O (cid:0) n − α − αη (cid:1) ,and so lim n → + ∞ c + ηα (cid:13)(cid:13) H − (cid:13)(cid:13) + η op ν + η + η C η n α ( + η ) n ∑ k = | β n + k + | k − α ( + η ) = roof of equation (5.2): First, note that n ∑ k = β n + k + γ k + H − ξ k + ξ Tk + H − = n ∑ k = β n + k + γ k + H − E (cid:104) ξ k + ξ Tk + |F k (cid:105) H − + n ∑ k = β n + k + γ k + H − Ξ k + H − ,with Ξ k + = ξ k + ξ Tk + − E (cid:2) ξ k + ξ Tk + |F k (cid:3) . ( Ξ k ) is a sequence of martingale differencesadapted to the filtration ( F k ) . Applying Theorem 6.1, it comes (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ k = β n + k + γ k + H − Ξ k + H − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O (cid:18) ln nn α (cid:19) a . s .Moreover, note that n ∑ k = β n + k + γ k + H − E (cid:104) ξ k + ξ Tk + |F k (cid:105) H − = − n ∑ k = β n + k + γ k + H − ∇ G ( θ k ) ∇ G ( θ k ) T H − + n ∑ k = β n + k + γ k + H − E (cid:104) ∇ h g ( X k + , Y k + , θ k ) ∇ h g ( X k + , Y k + , θ k ) T |F k (cid:105) H − .Applying Theorem 3.2, equality (5.2) and Lemma E.2 in Cardot and Godichon-Baggioni(2015), since the gradient of G is Lipschitz, one can check that for all δ > (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ k = β n + k + γ k + H − ∇ G ( θ k ) ∇ G ( θ k ) T H − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = o (cid:18) ( ln n ) + δ n α (cid:19) a . s .Moreover, noting R k = E (cid:104) ∇ h g ( X k + , Y k + , θ k ) ∇ h g ( X k + , Y k + , θ k ) T |F k (cid:105) − E (cid:104) ∇ h g ( X k + , Y k + , θ ) ∇ h g ( X k + , Y k + , θ ) T |F k (cid:105) , n ∑ k = β n + k + γ k + H − E (cid:104) ∇ h g ( X k + , Y k + , θ k ) ∇ h g ( X k + , Y k + , θ k ) T |F k (cid:105) H − = n ∑ k = β n + k + γ k + H − R k H − + n ∑ k = β n + k + γ k + H − E (cid:104) ∇ h g ( X k + , Y k + , θ ) ∇ h g ( X k + , Y k + , θ ) T |F k (cid:105) H − Moreover, applying Theorem 3.2, Lemma E.2 in Cardot and Godichon-Baggioni (2015), onechan check that for all δ > (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ k = β n + k + γ k + H − R k H − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = o (cid:18) ( ln n ) + δ n α (cid:19) a . s .Furthermore, since E (cid:104) ∇ h g ( X k + , Y k + , θ ) ∇ h g ( X k + , Y k + , θ ) T |F k (cid:105) = σ H , applying Lemma20.1 in Godichon-Baggioni (2019),lim n → ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H − (cid:32) n α c α n ∑ k = β n + k + γ k + E (cid:104) ∇ h g ( X k + , Y k + , θ ) ∇ h g ( X k + , Y k + , θ ) T |F k (cid:105) − σ H (cid:33) H − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = Let us first give some decompositions of the averaged estimates. First, note that θ n = n + n ∑ k = θ k .Second, one can write decomposition (5.2) as θ n − θ = H − S n ( θ n − θ ) − ( θ n + − θ ) γ n + + H − ξ n + − H − δ n .Then, summing these equalities and dividing by n +
1, it comes θ n − θ = H − n + n ∑ k = S k ( θ k − θ ) − ( θ k + − θ ) γ k + + H − n + n ∑ k = ξ k + − H − n + n ∑ k = δ k . Proof of Theorem 3.3.
Let us give the rate of convergence of each term on the right-hand sideof equality (5.3). First, in order to apply a Law of Large Numbers and a Central LimitTheorem for martingales, let us calculatelim n → ∞ n + n ∑ k = E (cid:104) ξ k + ξ Tk + |F k (cid:105) .By definition of ( ξ n ) , it comes1 n + n ∑ k = E (cid:104) ξ k + ξ Tk + |F k (cid:105) = n + n ∑ k = E (cid:104) ∇ h g ( X k + , Y k + , θ k ) ∇ h g ( X k + , Y k + , θ k ) T |F k (cid:105) − n + n ∑ k = ∇ G ( θ k ) ∇ G ( θ k ) T .By continuity and since θ k converges amost surely to θ , Toeplitz lemma implies1 n + n ∑ k = E (cid:104) ξ k + ξ Tk + |F k (cid:105) a . s −−−−→ n → + ∞ E (cid:104) ∇ h g ( X , Y , θ ) ∇ h g ( X , Y , θ ) T (cid:105) .Furthermore, E (cid:104) ∇ h g ( X , Y , θ ) ∇ h g ( X , Y , θ ) T (cid:105) = E (cid:104) (cid:101) ∇ h f ( X , Y , θ ) ∇ h f ( X , Y , θ ) T (cid:105) = σ H .21inally, since E (cid:104) (cid:107) ξ n + (cid:107) + η |F n (cid:105) ≤ + η C η , applying a Law of Large Numbers for martin-gales (see Duflo (1997)), 1 ( n + ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H − n ∑ k = ξ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O (cid:18) ln nn (cid:19) a . s .Moreover, applying a Central Limit Theorem for martingales (see Duflo (1997)),1 √ n H − n ∑ k = ξ k + L −−−−→ n → + ∞ N (cid:16) σ H − (cid:17) .Let us now prove that other terms on the right-hand side of equality (5.3) are negligible.Thanks to assumption (H4b) and since θ n converges almost surely to θ , (cid:107) δ n (cid:107) ≤ (cid:107) θ n − θ (cid:107) (cid:90) (cid:13)(cid:13) ∇ G ( θ + t ( θ n − θ )) − H (cid:13)(cid:13) op dt = O (cid:16) (cid:107) θ n − θ (cid:107) (cid:17) a . s Then applying Theorem 3.2, for all δ > n + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H − n ∑ k = δ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) H − (cid:13)(cid:13) op n + n ∑ k = (cid:107) δ k (cid:107) = o (cid:18) ( ln n ) + δ n α (cid:19) a . s and since α > n + n ∑ k = S k ( θ k − θ ) − ( θ k + − θ ) c α ( k + ) − α = − S n + ( θ n + − θ )( n + ) γ n + + S ( θ − θ )( n + ) γ − n + n + ∑ k = (cid:16) γ − k S k − − γ − k + S k (cid:17) ( θ k − θ ) .Let us now give the rates of convergence of each term on the right-hand side of equality(5.3). First, applying Theorem 3.2 and since S n converges almost surely to H , (cid:13)(cid:13)(cid:13)(cid:13) S n + ( θ n + − θ ) γ n + ( n + ) (cid:13)(cid:13)(cid:13)(cid:13) = O (cid:18) ln nn − α (cid:19) a . s and (cid:13)(cid:13)(cid:13)(cid:13) S ( θ − θ )( n + ) γ (cid:13)(cid:13)(cid:13)(cid:13) = O (cid:18) n (cid:19) a . s and these terms are negligible since α <
1. Furthermore, since S k = S k − − k S k − + k (cid:18) Φ k Φ Tk + c β k β Z k Z Tk (cid:19) ,22t comes ( ∗ ) : = n + n + ∑ k = (cid:0) γ k S k − − γ k + S k (cid:1) ( θ k − θ )= n + n + ∑ k = (cid:16) γ − k − γ − k + (cid:17) S k − ( θ k − θ ) − n + n + ∑ k = γ − k + k S k − ( θ k − θ )+ n + n + ∑ k = γ − k + k (cid:18) Φ k Φ Tk + c β k β Z k Z Tk (cid:19) ( θ k − θ ) (cid:124) (cid:123)(cid:122) (cid:125) : =( ∗∗ ) .For the first term on the right-hand side of previous equality, since (cid:12)(cid:12)(cid:12) γ − k − γ − k + (cid:12)(cid:12)(cid:12) ≤ α c − α k α − and since S k − converges almost surely to H , applying Theorem 3.2, for all δ > ( n + ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n + ∑ k = (cid:16) γ − k − γ − k + (cid:17) S k − ( θ k − θ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (cid:18) ( ln n ) + δ n − α (cid:19) a . s which is negligible since α <
1. Furthermore, since S k − converges almost surely to H , onecan check that 1 ( n + ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ k = γ − k + k S k − ( θ k − θ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (cid:18) ( ln n ) + δ n − α (cid:19) a . s .Let us now give the rate of convergence of ( ∗∗ ) . In this aim, let us consider δ > Ω k = (cid:110) (cid:107) θ k − θ (cid:107) ≤ ( ln k ) + δ k α /2 (cid:111) . Since δ >
0, and thanks to Theorem 3.2, (cid:107) θ n − θ (cid:107) n α ( ln n ) + δ converges almost surely to 0, so that Ω Ck converges almost surely to 0. Further-more, ( ∗∗ ) = n + n + ∑ k = γ − k + k (cid:18) Φ k Φ Tk + c β k β Z k Z Tk (cid:19) ( θ k − θ ) Ω k (cid:124) (cid:123)(cid:122) (cid:125) : =( ∗∗∗ ) + n + n + ∑ k = γ − k + k (cid:18) Φ k Φ Tk + c β k β Z k Z Tk (cid:19) ( θ k − θ ) Ω Ck Since Ω Ck converges almost surely to 0, ∑ k ≥ γ − k + k (cid:13)(cid:13)(cid:13)(cid:13) Φ k Φ Tk + c β k β Z k Z Tk (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) θ k − θ (cid:107) op Ω Ck < + ∞ a . s so that 1 n + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n + ∑ k = γ − k + k (cid:18) Φ k Φ Tk + c β k β Z k Z Tk (cid:19) ( θ k − θ ) Ω Ck (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O (cid:18) n (cid:19) a . s .23oreover, ( ∗ ∗ ∗ ) ≤ n + n + ∑ k = γ − k + k (cid:13)(cid:13)(cid:13)(cid:13) Φ k Φ Tk + c β k β Z k Z Tk (cid:13)(cid:13)(cid:13)(cid:13) op ( ln k ) + δ k α /2 .One can consider the sequence of martigales differences ( Ξ k ) adapted to the filtration ( F k ) and defined for all k ≥ Ξ k = (cid:13)(cid:13)(cid:13)(cid:13) Φ k Φ Tk + c β k β Z k Z Tk (cid:13)(cid:13)(cid:13)(cid:13) op − E (cid:34)(cid:13)(cid:13)(cid:13)(cid:13) Φ k Φ Tk + c β k β Z k Z Tk (cid:13)(cid:13)(cid:13)(cid:13) op |F k − (cid:35) .Then, ( ∗ ∗ ∗ ) ≤ n + n + ∑ k = γ k + ( ln k ) + δ k α /2 + E (cid:34)(cid:13)(cid:13)(cid:13)(cid:13) Φ k Φ Tk + c β k β Z k Z Tk (cid:13)(cid:13)(cid:13)(cid:13) op |F k − (cid:35) + n + n + ∑ k = γ k + ( ln k ) + δ k α /2 + Ξ k .Then, thanks to Assumption (H1b) and Toeplitz lemma,1 n + n + ∑ k = γ k + ( ln k ) + δ k α /2 + E (cid:34)(cid:13)(cid:13)(cid:13)(cid:13) Φ k Φ Tk + c β k β Z k Z Tk (cid:13)(cid:13)(cid:13)(cid:13) op |F k − (cid:35) = o (cid:18) ( ln n ) + δ n − α /2 (cid:19) a . s .Furthermore, since α <
1, with assumption (H1b) , Theorem 6.2 leads to (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n + n + ∑ k = γ k + ( ln k ) + δ k α /2 + Ξ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O (cid:18) n (cid:19) a . s . Proof.
The aim is to give the rate of convergence of each term of decomposition (5.1). Notethat the rate of the martingale term is given by (5.1), while, thanks to equation (5.1), wehave 1 n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ k = c β k β Z k + Z Tk + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = O (cid:18) n β (cid:19) a . s .Finally, since the functional h (cid:55)−→ E (cid:2) ∇ h f ( X , h ) ∇ h f ( X , h ) T (cid:3) is C f -Lipschitz on a neigh-borhood of θ and since θ n converges almost surely to θ , applying Theorem 3.3 as well asToeplitz lemma, for all δ > n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ k = (cid:16) E (cid:104) Φ k + Φ Tk + |F k (cid:105) − H (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = O C f n (cid:32) n ∑ k = (cid:13)(cid:13) θ k − θ (cid:13)(cid:13)(cid:33) a . s = o (cid:18) ( ln n ) + δ n (cid:19) a . s ,which concludes the proof. 24 emark 5.1. Note that to prove equality (12) without knowing the rate of convergence of θ n , itis necessary to have a first rate of convergence of S n . For that purpose, we study the asymptoticbehaviour of n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ k = (cid:16) E (cid:104) Φ k + Φ Tk + |F k (cid:105) − H (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F . Equality (11) yields for all δ > , (cid:13)(cid:13) θ n − θ (cid:13)(cid:13) ≤ n + n ∑ k = (cid:107) θ k − θ (cid:107) = O (cid:18) ( ln n ) + δ /2 n α /2 (cid:19) , so that, since the functional h (cid:55)−→ E (cid:2) ∇ h f ( X , h ) ∇ h f ( X , h ) T (cid:3) is C f -Lipschitz on a neighborhoodof θ , n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ k = (cid:16) E (cid:104) Φ k + Φ Tk + |F k (cid:105) − H (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = O C f n (cid:32) n ∑ k = (cid:13)(cid:13) θ k − θ (cid:13)(cid:13)(cid:33) a . s = o (cid:18) ( ln n ) + δ n α (cid:19) a . s , and then, since β < α − , (cid:13)(cid:13) S n − H (cid:13)(cid:13) F = O (cid:32) max (cid:40) c β n β , (cid:114) ( ln n ) + δ n α (cid:41)(cid:33) a . s Only the main lines of the proof are given since it is a mix between the proof of Theorem 3.3and the ones in Bercu et al. (2020).
Proof of THeorem 3.4.
Let us denote H − n = ( n + ) (cid:101) H − n . First, remark that as in the proof ofCorollary 3.2, one can check that that on (cid:101) Γ θ , H n a . s −−−−→ n → + ∞ H and H − n a . s −−−−→ n → + ∞ H − .Furthermore, decomposition (13) can be rewritten as (cid:101) θ n + = (cid:101) θ n − θ − n + S − n ∇ G (cid:16)(cid:101) θ n (cid:17) + n + H − n (cid:101) ξ n + ,where (cid:101) ξ n + : = ∇ G (cid:16)(cid:101) θ n (cid:17) − ∇ h g (cid:16) X n + , Y n + , (cid:101) θ n (cid:17) . Then, (cid:16) (cid:101) ξ n (cid:17) is a sequence of martingaledifferences adapted to the filtration ( F n ) . Linearizing the gradient, decomposition (14) can25e rewritten as (cid:101) θ n + − θ = nn + (cid:16)(cid:101) θ n − θ (cid:17) + n + (cid:16) H − − H − n (cid:17) H (cid:16)(cid:101) θ n − θ (cid:17) + n + H − n (cid:101) ξ n + − n + H − n (cid:101) δ n where (cid:101) δ n = ∇ G (cid:16)(cid:101) θ n (cid:17) − H (cid:16)(cid:101) θ n − θ (cid:17) is the reminder term in the Taylor’s decomposition ofthe gradient. Then, by induction, it comes (cid:101) θ n − θ = n (cid:16)(cid:101) θ − θ (cid:17) + n n − ∑ k = (cid:16) H − − H − k (cid:17) H (cid:16)(cid:101) θ k − θ (cid:17) − n n − ∑ k = H − k (cid:101) δ k + n n − ∑ k = H − k (cid:101) ξ k + .(16)Since H − k converges almost surely to H − , on (cid:101) Γ θ , one can check that1 n n − ∑ k = H − k (cid:101) ξ k + (cid:101) ξ Tk + H − k a . s −−−−→ n → + ∞ σ H − .Thanks to assumption (H5) , applying a law of large numbers for martingales, one can checkthat (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n − ∑ k = H − k (cid:101) ξ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O (cid:18) ln nn (cid:19) a . s (17)In the same way, applying a central limit theorem, it comes1 √ n n − ∑ k = H − k (cid:101) ξ k + L −−−−→ n → + ∞ N (cid:16) σ H − (cid:17) .Let us now prove that the other terms in decomposition (16) are negligible. First, clearly1 n (cid:13)(cid:13)(cid:13)(cid:101) θ − θ (cid:13)(cid:13)(cid:13) = O (cid:18) n (cid:19) a . s . (18)Furthermore, let us denote (cid:101) ∆ n = n n − ∑ k = (cid:16) H − − H − k (cid:17) H (cid:16)(cid:101) θ k − θ (cid:17) − n n − ∑ k = H − k (cid:101) δ k .Furthermore, as in the proof of Theorem 3.2, one can verify (cid:13)(cid:13)(cid:13)(cid:16) H − − H − n (cid:17) H (cid:16)(cid:101) θ n − θ (cid:17)(cid:13)(cid:13)(cid:13) = o (cid:16)(cid:13)(cid:13)(cid:13)(cid:101) θ n − θ (cid:13)(cid:13)(cid:13)(cid:17) a . s (cid:13)(cid:13)(cid:13)(cid:101) δ n (cid:13)(cid:13)(cid:13) = o (cid:16)(cid:13)(cid:13)(cid:13)(cid:101) θ n − θ (cid:13)(cid:13)(cid:13)(cid:17) a . s Then, (cid:13)(cid:13)(cid:13)(cid:101) ∆ n + (cid:13)(cid:13)(cid:13) = nn + (cid:13)(cid:13)(cid:13)(cid:101) ∆ n (cid:13)(cid:13)(cid:13) + o (cid:16)(cid:101) θ n − θ (cid:17) a . s .26s in the proof of Theorem 6.2 in Bercu et al. (2020) (see equations (6.24) to (6.32)), it comes (cid:13)(cid:13)(cid:13)(cid:101) ∆ n (cid:13)(cid:13)(cid:13) = O (cid:18) ln nn (cid:19) a . s . (19)Then, thanks to equalities (17),(18) and (19), it comes (cid:13)(cid:13)(cid:13)(cid:101) θ n − θ (cid:13)(cid:13)(cid:13) = O (cid:18) ln nn (cid:19) a . s . (20)In order to get the asymptotic normality of (cid:101) θ n , let us now give the rate of convergence ofeach term on the right-hand side of decomposition (16). First, since (cid:101) δ n = O (cid:18)(cid:13)(cid:13)(cid:13)(cid:101) θ n − θ (cid:13)(cid:13)(cid:13) (cid:19) a . s ,and since H − n converges almost surely to H − , thanks to equality (20), for all δ > n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − ∑ k = H − k (cid:101) δ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (cid:18) ( ln n ) + δ n (cid:19) a . s which is so negligible. Furthermore, as in the proof of Corollary 3.2, one can check that forall δ > (cid:13)(cid:13)(cid:13) H − n − H − (cid:13)(cid:13)(cid:13) = O (cid:32) max (cid:40) c β n β , ( ln n ) + δ n (cid:41)(cid:33) a . s .Then, for all δ > n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n ∑ k = (cid:16) H − k − H − (cid:17) H (cid:16)(cid:101) θ k − θ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (cid:32) max (cid:40) c β ( ln n ) + δ n β + , ( ln n ) + δ n (cid:41)(cid:33) a . s and this term is no negligible. First, note that (cid:0) ˆ Y k − Y k (cid:1) − σ = (cid:0) f (cid:0) X k , θ k − (cid:1) − f ( X k , θ ) (cid:1) − (cid:101) k (cid:0) f (cid:0) X k , θ k − (cid:1) − f ( X k , θ ) (cid:1) + (cid:0) (cid:101) k − σ (cid:1) .Thanks to a Taylor’s decomposition, there are U , . . . , U n − ∈ R q such that R : = n n ∑ k = (cid:0) f (cid:0) X k , θ k − (cid:1) − f ( X k , θ ) (cid:1) ≤ n n ∑ k = (cid:107)∇ h f ( X k , U k − ) (cid:107) (cid:13)(cid:13) θ k − − θ (cid:13)(cid:13) Let us consider the filtration (cid:16) F ( ) n (cid:17) defined by F ( ) n = σ (( X k , (cid:101) k − ) , k =
1, . . . , n ) . Then,considering (cid:101) ξ k = E (cid:104) (cid:107)∇ h f ( X k , U k − ) (cid:107) |F ( ) k − (cid:105) − (cid:107)∇ h f ( X k , U k − ) (cid:107) , it comes thanks to as-27umption (H1b) , R ≤ n n ∑ k = √ C (cid:48)(cid:48) (cid:13)(cid:13) θ k − − θ (cid:13)(cid:13) − n n ∑ k = (cid:101) ξ k (cid:13)(cid:13) θ k − − θ (cid:13)(cid:13) Then, applying Toeplitz Lemma and Theorem 3.3 to the term on the left hand-side of previ-ous inequality, one can check that for all δ > n n ∑ k = √ C (cid:48)(cid:48) (cid:13)(cid:13) θ k − − θ (cid:13)(cid:13) = o (cid:18) ( ln n ) + δ n (cid:19) a . s .Furthermore, since (cid:16) (cid:101) ξ n (cid:17) is a sequence of martingale differences with bounded squaredmoments, with the help of Theorems 3.3 and 6.2, it comes1 n n ∑ k = (cid:101) ξ k (cid:13)(cid:13) θ k − − θ (cid:13)(cid:13) = O (cid:18) n (cid:19) a . s so that R = o (cid:18) ( ln n ) + δ n (cid:19) a . s .Furthermore, let us consider R = n n ∑ k = (cid:101) k (cid:0) f (cid:0) X k , θ k − (cid:1) − f ( X k , θ ) (cid:1) .We have a sum of martingale differences, and thanks to Theorems 3.3 and 6.2, one can checkthat (cid:107) R (cid:107) = o (cid:18) ( ln n ) + δ n (cid:19) a . s .Then, considering that (cid:101) n − σ = (cid:101) n − E (cid:104) (cid:101) n |F ( ) n − (cid:105) one can apply a Central Limit Theorem for martingale to get the asymptotic normality, anda strong law of large numbers for martingales to get the almost sure rate of convergence. ( n − α ) n Theorem 6.1.
Let H be a separable Hilbert space and let us considerM n + = n ∑ k = β n , k γ k R k ξ k + , where ( ξ n ) is a H-valued martingale differences sequence adapted to a filtration ( F n ) such that E (cid:104) (cid:107) ξ n + (cid:107) |F n (cid:105) ≤ C + R n a . s , ∑ n ≥ γ n E (cid:104) (cid:107) ξ n + (cid:107) {(cid:107) ξ n + (cid:107) ≥ γ − n ( ln n ) − } |F n (cid:105) < + ∞ a . s , (21) where C ≥ and ( R n ) n converges almost surely to ; • γ n = cn − α with c > and α ∈ ( ) ; • ( R n ) is a sequence of operators on H such that, for a deterministic sequence ( v n ) , (cid:107) R n (cid:107) op = o ( v n ) a . s and v n = ( ln n ) a n b . with a , b ≥ ; • For all n ≥ and ≤ k ≤ n, β n , k = n ∏ j = k + ( I H − γ k Γ ) and β n , n = I H , where Γ is a symmetric operator on H such that < λ min ( Γ ) ≤ λ max ( Γ ) < + ∞ .Then, (cid:107) M n + (cid:107) = O (cid:0) γ n v n ln n (cid:1) a . s . Remark 6.1.
Note equation (21) holds since there are η > α − and a positive constant C θ suchthat E (cid:104) (cid:107) ξ n + (cid:107) + η |F n (cid:105) ≤ C + R η , n with R η , n converging to . Remark 6.2.
Previous theorem remains true considering a sequence ( R n ) satisfying that there area positive constant C R and a rank n R such that for all n ≥ n R , (cid:107) R n (cid:107) op ≤ C R v n .Proof. Let us now consider the events A n = { R n > v n or R n > C } B n + = { R n ≤ v n , R n ≤ C , (cid:107) ξ n + (cid:107) ≤ δ n } C n + = { R n ≤ v n , R n ≤ C , (cid:107) ξ n + (cid:107) > δ n } with δ n = γ − n ( ln n ) − . One can remark that A cn = B n + (cid:116) C n + . Then, one can write29 n + as M n + = n ∑ k = β n , k γ k R k ξ k + { A k } + n ∑ k = β n , k γ k Σ R k ξ k + { A ck } = n ∑ k = β n , k γ k R k ξ k + { A k } + n ∑ k = β n , k γ k R k (cid:16) ξ k + { B k + } − E (cid:104) ξ k + { B k + } |F k (cid:105)(cid:17) + n ∑ k = β n , k γ k R k (cid:16) ξ k + { C k + } − E (cid:104) ξ k + { C k + } |F k (cid:105)(cid:17) .Let us now give the rates of convergence of these three terms. Bounding M n + : = ∑ nk = β n , k γ k R k ξ k + { A k } . Remark that there is a rank n such thatfor all n ≥ n , (cid:107) I H − γ n Γ (cid:107) op ≤ ( − λ min γ n ) . Furthermore, M n + = ( I H − γ n Γ ) M n + γ n R n ξ n + { A n } . Then, for all n ≥ n , E (cid:104) (cid:107) M n + (cid:107) |F n (cid:105) ≤ ( − λ min γ n ) (cid:107) M n (cid:107) + γ n (cid:107) R n (cid:107) op ( C + R n ) { A n } .Considering V n + = ∏ nk = ( + λ min γ k ) (cid:107) M n + (cid:107) , it comes E [ V n + |F n ] ≤ (cid:0) + λ γ n (cid:1) V n + n ∏ k = ( + λ min γ k ) γ n (cid:107) R n (cid:107) op ( C + R n ) { A n } Moreover, { A n } converges almost surely to 0 so that ∑ n ≥ n ∏ k = ( + λ min γ k ) γ n (cid:107) R n (cid:107) op ( C + R n ) { A n } < + ∞ a . s and applying Robbins-Siegmund Theorem, V n converges almost surely to a finite randomvariable, i.e (cid:107) M n + (cid:107) = O (cid:32) n ∏ k = ( + λ min γ k ) − (cid:33) a . s and converges exponentially fast. Bounding M n + : = ∑ nk = β n , k γ k R k (cid:16) ξ k + { B k + } − E (cid:104) ξ k + { B k + } |F k (cid:105)(cid:17) . Let us denote Ξ k + = R k (cid:16) ξ k + { B k + } − E (cid:104) ξ k + { B k + } |F k (cid:105)(cid:17) . Remark that ( Ξ n ) is a sequence of martin-gale differences adapted to the filtration ( F n ) . As in Pinelis (1994) (proofs of Theorems 3.1and 3.2), let λ > t ∈ [
0, 1 ] and j ≤ n , ϕ ( t ) = E (cid:34) cosh (cid:32) λ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) j − ∑ k = β n , k γ k Ξ k + + t β n , j γ j Ξ j + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33) (cid:12)(cid:12) F j (cid:35) .One can check that ϕ (cid:48) ( ) = ϕ (cid:48)(cid:48) ( t ) ≤ λ (cid:13)(cid:13) β n , j γ j Ξ j + (cid:13)(cid:13) e λ t (cid:107) β n , j γ j Ξ j + (cid:107) cosh (cid:32) λ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) j − ∑ k = β n , k γ k Ξ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33) E (cid:34) cosh (cid:32) λ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) j ∑ k = β n , k γ k Ξ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33) |F j (cid:35) = ϕ ( ) = ϕ ( ) + (cid:90) ( − t ) ϕ (cid:48)(cid:48) ( t ) dt ≤ (cid:0) + e j , n (cid:1) cosh (cid:32) λ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) j − ∑ k = β n , k γ k Ξ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33) with e j , n = E (cid:104) e λ (cid:107) β n , j γ j Ξ j + (cid:107) − − λ (cid:13)(cid:13) β n , j γ j Ξ j + (cid:13)(cid:13) |F j (cid:105) , which is well defined since Ξ j + is a.s.finite. Moreover, considering G n + = cosh ( λ (cid:107) ∑ nk = β n , k γ k Ξ k + (cid:107) ) ∏ nj = (cid:0) + e j , n (cid:1) and G = E [ G n + |F n ] = G n , it comes E [ G n + ] =
1. For all r > P [ (cid:107) M n + (cid:107) ≥ r ] = P (cid:34) G n + ≥ cosh ( λ r ) ∏ nj = (cid:0) + e j , n (cid:1) (cid:35) ≤ P (cid:34) G n + ≥ e λ r ∏ nj = (cid:0) + e jn (cid:1) (cid:35) .Furthermore, let (cid:101) j + = ξ j + { B j } − E (cid:104) ξ j + { B j } |F j (cid:105) and remark that E (cid:104)(cid:13)(cid:13) (cid:101) j + (cid:13)(cid:13) |F j (cid:105) ≤ C .Then, recalling that δ n = γ − n ( ln n ) − , and since for all k ≥ E (cid:104)(cid:13)(cid:13) (cid:101) j + (cid:13)(cid:13) k |F k (cid:105) ≤ k − δ k − j E (cid:104)(cid:13)(cid:13) ξ j + (cid:13)(cid:13) { B j } |F j (cid:105) ≤ k − C δ k − j , e j , n = ∞ ∑ k = λ k (cid:13)(cid:13) β n , j (cid:13)(cid:13) kop γ kj E (cid:104)(cid:13)(cid:13) Ξ j + (cid:13)(cid:13) k |F k (cid:105) ≤ ∞ ∑ k = λ k (cid:13)(cid:13) β n , j (cid:13)(cid:13) kop γ kj v kj E (cid:104)(cid:13)(cid:13) (cid:101) j + (cid:13)(cid:13) k |F k (cid:105) ≤ ∞ ∑ k = λ k (cid:13)(cid:13) β n , j (cid:13)(cid:13) kop γ kj v kj k − C δ k − j ≤ C λ (cid:13)(cid:13) β n , j (cid:13)(cid:13) op γ j v j ∞ ∑ k = ( λ ) k − (cid:13)(cid:13) β n , j (cid:13)(cid:13) k − op γ k − j v k − j ln j − k − = C λ (cid:13)(cid:13) β n , j (cid:13)(cid:13) op γ j v j exp (cid:16) λ (cid:13)(cid:13) β n , j (cid:13)(cid:13) op (cid:112) γ j v j (cid:17) Then, P [ (cid:107) M n + (cid:107) ≥ r ] ≤ P G n + ≥ e λ r ∏ nj = (cid:16) + C λ (cid:13)(cid:13) β n , j (cid:13)(cid:13) op γ j v j exp (cid:16) λ (cid:13)(cid:13) β n , j (cid:13)(cid:13) op v j (cid:112) γ j ln j (cid:17)(cid:17) Applying Markov’s inequality, P [ (cid:107) M n + (cid:107) ≥ r ] ≤ (cid:32) − λ r + C λ n ∑ j = (cid:13)(cid:13) β n , j (cid:13)(cid:13) op γ j v j exp (cid:16) λ (cid:13)(cid:13) β n , j (cid:13)(cid:13) op v j (cid:113) γ j ln j (cid:17)(cid:33) .Take λ = γ − n v − n √ ln n . Let C = (cid:107) β n ,0 (cid:107) op and remark that for n ≥ n (i.e such that31 n /2 λ max ( Γ ) ≤ j ≤ n /2, (cid:13)(cid:13) β n , j (cid:13)(cid:13) op ≤ C exp (cid:16) − c λ min ( n /2 ) − α (cid:17) ,so that for all j ≤ n /2, λ (cid:13)(cid:13) β n , j (cid:13)(cid:13) op γ j v j ≤ C exp (cid:16) − λ min ( n /2 ) − α (cid:17) √ n b + α ln n a . s −−−−→ n → + ∞ n ≥ n , and for all j ≥ n /2, λ (cid:13)(cid:13) β n , j (cid:13)(cid:13) op √ γ j v j (cid:112) ln j ≤ C b + α + .Then, there is a positive constant C (cid:48)(cid:48) such that for all n ≥ j ≤ n ,exp (cid:16) λ (cid:13)(cid:13) β n , j (cid:13)(cid:13) op (cid:112) γ j v j (cid:17) ≤ C (cid:48)(cid:48) Finally, one can easily check that (see Lemma E.2 in Cardot and Godichon-Baggioni (2015)) n ∑ j = (cid:13)(cid:13) β n , j (cid:13)(cid:13) op γ j ( ln j ) a j b = O (cid:18) ( ln n ) a n b + α (cid:19) .There is a positive constant C (cid:48)(cid:48)(cid:48) such that P [ (cid:107) M n + (cid:107) ≥ r ] ≤ exp (cid:16) − rv − n γ − n √ ln n + C (cid:48)(cid:48)(cid:48) ln n (cid:17) Then , taking r = ( + C (cid:48)(cid:48)(cid:48) ) v n (cid:112) γ n ln n , it comes P (cid:104) (cid:107) M n + (cid:107) ≥ (cid:0) + C (cid:48)(cid:48)(cid:48) (cid:1) v n (cid:112) γ n ln n (cid:105) ≤ exp ( − n ) = n and applying Borell Cantelli’s lemma, (cid:107) M n + (cid:107) = O (cid:16) v n (cid:112) γ n ln n (cid:17) a . s . Bounding M n + : = ∑ nk = β n , k γ k Σ R n (cid:16) ξ k + { C k + } − E (cid:104) ξ k + { C k + } |F k (cid:105)(cid:17) . Let us denote (cid:101) k + = ξ k + { C k + } − E (cid:104) ξ k + { C k + } |F k (cid:105) and remark that for n ≥ n , E (cid:104) (cid:107) M n + (cid:107) |F n (cid:105) ≤ ( − λ min γ n ) (cid:107) M n (cid:107) + γ n v n E (cid:104) (cid:107) (cid:101) n + (cid:107) |F n (cid:105) ≤ ( − λ min γ n ) (cid:107) M n (cid:107) + γ n v n E (cid:104) (cid:107) ξ n + (cid:107) {(cid:107) ξ n + (cid:107) ≥ γ − n } |F n (cid:105) Let V (cid:48) n = γ n v n (cid:107) M n (cid:107) . There are a rank n and a positive constant c such that for all n ≥ n E [ V n + |F n ] ≤ ( − c γ n ) V n + O (cid:16) γ n E (cid:104) (cid:107) ξ n + (cid:107) {(cid:107) ξ n + (cid:107) ≥ γ − n } |F n (cid:105)(cid:17) a . s .32pplying Robbins-Siegmund Theorem as well as equation (21), it comes (cid:107) M n + (cid:107) = O (cid:0) γ n v n (cid:1) a . s . Theorem 6.2.
Let H be a separable Hilbert space and letM n = n n ∑ k = R k ξ k + , where • ( ξ n ) is a H-valued martingale differences sequence adapted to a filtration ( F n ) verifying E (cid:104) (cid:107) ξ n + (cid:107) |F n (cid:105) ≤ C + R n where ( R n ) n converges almost surely to . • ( R n ) is a sequence of operators on H such that for a deterministic sequence ( v n ) , (cid:107) R n (cid:107) op = O ( v n ) a . s and ∃ c ≤ ( a n ) , v n v n + = + cn + a n n + o (cid:16) a n n (cid:17) , with ( a n ) converging to .Then, for all δ > , • If ∑ n ≥ v n < + ∞ a . s, (cid:13)(cid:13) M n (cid:13)(cid:13) = O (cid:18) n (cid:19) a . s . • If c < , (cid:107) M n (cid:107) = o (cid:16) n − v − n ( ln n ) + δ (cid:17) a . s . • If ∑ n ≥ a n n < + ∞ and if ≤ c ≤ , (cid:107) M n (cid:107) = o (cid:16) n c − v − n ( ln n ) + δ (cid:17) a . s • If ∑ n ≥ a n n = + ∞ and if ≤ c < , for all a < − c (cid:107) M n (cid:107) = o (cid:0) n − a v − n (cid:1) a . s . Proof. If ∑ n ≥ v n < + ∞ , let us consider W n = n (cid:107) M n (cid:107) . We have E [ W n + |F n ] ≤ W n + (cid:107) R n (cid:107) op E (cid:104) (cid:107) ξ n + (cid:107) |F n (cid:105) .33hen, E [ W n + |F n ] ≤ W n + O (cid:0) v n + (cid:1) a . s and applying Robbins-Siegmund Theorem, (cid:107) M n (cid:107) = O (cid:18) n (cid:19) .Let us consider a ≤ V n + a = ( n + ) a v n + ( ln ( n + )) + δ (cid:107) M n + (cid:107) . Then, E [ V n + a |F n ] = ( n + ) a ( ln ( n + )) + δ (cid:18) nn + (cid:19) v n v n + (cid:107) M n (cid:107) + ( n + ) a ( ln ( n + )) + δ ( n + ) (cid:107) R n (cid:107) op v n + E (cid:104) (cid:107) ξ n + (cid:107) |F n (cid:105) ≤ (cid:18) nn + (cid:19) − a v n + v n V n + O (cid:18) n − a ( ln n ) + δ (cid:19) a . s = (cid:18) − ( − a − c ) n + a n n + O (cid:18) n (cid:19)(cid:19) V n + O (cid:18) n − a ( ln n ) + δ (cid:19) a . s Applying Robbins-Siegmund theorem, • If c < a = (cid:107) M n (cid:107) = o (cid:16) n − v n ( ln n ) + δ (cid:17) a . s . • If ∑ n ≥ a n n < + ∞ and if 1/2 ≤ c ≤
1, one can take a = − c and (cid:107) M n (cid:107) = o (cid:16) n c − v n ( ln n ) + δ (cid:17) a . s • If ∑ n ≥ a n n = + ∞ and if 1/2 ≤ c <
1, for all a < − c (cid:107) M n (cid:107) = o (cid:16) n − a v n ( ln n ) + δ (cid:17) a . s . References
Bach, F. (2014). Adaptivity of averaged stochastic gradient descent to local strong convexityfor logistic regression.
The Journal of Machine Learning Research , 15(1):595–627.Bates, D. M. and Watts, D. G. (1988).
Nonlinear regression analysis and its applications . WileySeries in Probability and Mathematical Statistics: Applied Probability and Statistics. JohnWiley & Sons, Inc., New York.Bercu, B., Godichon, A., and Portier, B. (2020). An efficient stochastic newton algorithm forparameter estimation in logistic regressions.
SIAM Journal on Control and Optimization ,58(1):348–367. 34yrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. (2016). A stochastic quasi-newtonmethod for large-scale optimization.
SIAM Journal on Optimization , 26(2):1008–1031.Cardot, H., Cénac, P., Godichon-Baggioni, A., et al. (2017). Online estimation of the geo-metric median in hilbert spaces: Nonasymptotic confidence balls.
The Annals of Statistics ,45(2):591–614.Cardot, H. and Godichon-Baggioni, A. (2015). Fast estimation of the median covariationmatrix with application to online robust principal components analysis.
TEST , pages1–20.Duflo, M. (1997).
Random iterative models , volume 34 of
Applications of Mathematics (NewYork) . Springer-Verlag, Berlin. Translated from the 1990 French original by Stephen S.Wilson and revised by the author.Gadat, S. and Panloup, F. (2017). Optimal non-asymptotic bound of the ruppert-polyakaveraging without strong convexity. arXiv preprint arXiv:1709.03342 .Giurc˘aneanu, C. D., T˘abu¸s, I., and Astola, J. (2005). Clustering time series gene expressiondata based on sum-of-exponentials fitting.
EURASIP J. Appl. Signal Process. , (8):1159–1173.Godichon-Baggioni, A. (2019). Online estimation of the asymptotic variance for averagedstochastic gradient algorithms.
Journal of Statistical Planning and Inference , 203:1 – 19.Godichon-Baggioni, Antoine (2019). Lp and almost sure rates of convergence of averagedstochastic gradient algorithms: locally strongly convex objective.
ESAIM: PS , 23:841–873.Jakubowski, A. (1988). Tightness criteria for random measures with application to the prin-ciple of conditioning in Hilbert spaces.
Probab. Math. Statist. , 9(1):95–114.Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares estimators.
Ann.Math. Statist. , 40:633–643.Komori, O., Eguchi, S., Ikeda, S., Okamura, H., Ichinokawa, M., and Nakayama, S. (2016).An asymmetric logistic regression model for ecological data.
Methods in Ecology and Evo-lution , 7(2):249–260.Kushner, H. J. and Yin, G. G. (2003).
Stochastic approximation and recursive algorithms andapplications , volume 35 of
Applications of Mathematics (New York) . Springer-Verlag, NewYork, second edition. Stochastic Modelling and Applied Probability.Lai, T. L. (1994). Asymptotic properties of nonlinear least squares estimates in stochasticregression models.
Ann. Statist. , 22(4):1917–1930.Lucchi, A., McWilliams, B., and Hofmann, T. (2015). A variance reduced stochastic newtonmethod. arXiv preprint arXiv:1503.08316 .Mokhtari, A. and Ribeiro, A. (2014). Res: Regularized stochastic bfgs algorithm.
IEEETransactions on Signal Processing , 62(23):6089–6104.35elletier, M. (1998). On the almost sure asymptotic behaviour of stochastic algorithms.
Stochastic processes and their applications , 78(2):217–244.Pelletier, M. (2000). Asymptotic almost sure efficiency of averaged stochastic algorithms.
SIAM J. Control Optim. , 39(1):49–72.Pinelis, I. (1994). Optimum bounds for the distributions of martingales in Banach spaces.
The Annals of Probability , 22:1679–1706.Pollard, D. and Radchenko, P. (2006). Nonlinear least-squares estimation.
J. MultivariateAnal. , 97(2):548–562.Polyak, B. and Juditsky, A. (1992). Acceleration of stochastic approximation.
SIAM J. Controland Optimization , 30:838–855.Robbins, H. and Monro, S. (1951). A stochastic approximation method.
The annals of mathe-matical statistics , pages 400–407.Ruppert, D. (1988). Efficient estimations from a slowly convergent robbins-monro process.Technical report, Cornell University Operations Research and Industrial Engineering.Sherman, J. and Morrison, W. J. (1950). Adjustment of an inverse matrix corresponding toa change in one element of a given matrix.
Ann. Math. Statistics , 21:124–127.Skouras, K. (2000). Strong consistency in nonlinear stochastic regression models.
Ann.Statist. , 28(3):871–879.Suárez, E., Pérez, C. M., Rivera, R., and Martínez, M. N. (2017).
Applications of RegressionModels in Epidemiology . John Wiley & Sons.van de Geer, S. (1990). Estimating a regression function.
Ann. Statist. , 18(2):907–924.van de Geer, S. and Wegkamp, M. (1996). Consistency for the least squares estimator innonparametric regression.
Ann. Statist. , 24(6):2513–2523.Varian, H. R. (2014). Big data: New tricks for econometrics.
Journal of Economic Perspectives ,28(2):3–28.Wu, C.-F. (1981). Asymptotic theory of nonlinear least squares estimation.
Ann. Statist. ,9(3):501–513.Yang, Z., Wang, Z., Liu, H., Eldar, Y., and Zhang, T. (2016). Sparse nonlinear regression:Parameter estimation under nonconvexity. In Balcan, M. F. and Weinberger, K. Q., ed-itors,
Proceedings of The 33rd International Conference on Machine Learning , volume 48 of
Proceedings of Machine Learning Research , pages 2472–2481, New York, New York, USA.PMLR.Yao, J.-F. (2000). On least squares estimation for stable nonlinear AR processes.