General multilevel adaptations for stochastic approximation algorithms
GGENERAL MULTILEVEL ADAPTATIONS FOR STOCHASTICAPPROXIMATION ALGORITHMS
STEFFEN DEREICH AND THOMAS M ¨ULLER-GRONBACH
Abstract.
In this article we present and analyse new multilevel adaptations of stochastic ap-proximation algorithms for the computation of a zero of a function f : D → R d defined on aconvex domain D ⊂ R d , which is given as a parameterised family of expectations. Our approachis universal in the sense that having multilevel implementations for a particular application athand it is straightforward to implement the corresponding stochastic approximation algorithm.Moreover, previous research on multilevel Monte Carlo can be incorporated in a natural way.This is due to the fact that the analysis of the error and the computational cost of our methodis based on similar assumptions as used in Giles [7] for the computation of a single expectation.Additionally, we essentially only require that f satisfies a classical contraction property fromstochastic approximation theory. Under these assumptions we establish error bounds in p -thmean for our multilevel Robbins-Monro and Polyak-Ruppert schemes that decay in the compu-tational time as fast as the classical error bounds for multilevel Monte Carlo approximations ofsingle expectations known from Giles [7]. Introduction
Let D ⊂ R d be closed and convex and let U be a random variable on a probability space(Ω , F , P ) with values in a set U equipped with some σ -field. We study the problem of computingzeros of functions f : D → R d of the form f ( θ ) = E [ F ( θ, U )] , where F : D × U → R d is a product measurable function such that all expectations E [ F ( θ, U )]are well-defined. In this article we focus on the case where the random variables F ( θ, U ) cannotbe simulated directly so that one has to work with appropriate approximations in numericalsimulations. For example, one may think of U being a Brownian motion and of F ( θ, U ) beingthe payoff of an option, where θ is a parameter affecting the payoff and/or the dynamics of theprice process. Alternatively, F ( θ, U ) might be the value of a PDE at certain positions with U representing random coefficients and θ a parameter of the equation.In previous years the multilevel paradigm introduced by Heinrich [8] and Giles [7] has provedto be a very efficient tool in the numerical computation of expectations. By Frikha [5] it hasrecently been shown that the efficiency of the multilevel paradigm prevails when combined withstochastic approximation algorithms. In the present paper we take a different approach thanthe one introduced by the latter author. Instead of employing a sequence of coupled Robbins-Monro algorithms to construct a multilevel estimate of a zero of f we basically propose a singleRobbins-Monro algorithm that uses in the ( n + 1)-th step a multilevel estimate of E [ F ( θ n , U )]with a complexity that is adapted to the actual state θ n of the system and increases in thenumber of steps. Mathematics Subject Classification.
Primary 62L20; Secondary 60J10, 65C05.
Key words and phrases.
Stochastic approximation; Monte Carlo; multilevel. a r X i v : . [ m a t h . P R ] M a y MULTILEVEL STOCHASTIC APPROXIMATION
Our approach is universal in the sense that having multilevel implementations for a particularapplication at hand it is straightforward to implement the corresponding stochastic approxima-tion algorithm. Moreover, previous research on multilevel Monte Carlo can be incorporated ina natural way. This is due to the fact that the analysis of the error and the computational costof our method is based on similar assumptions on the biases, the p -th central moments and thesimulation cost of the underlying approximations of F ( θ, U ) as used in Giles [7], see Assump-tions C.1 and C.2 in Section 3. Additionally, we require that f satisfies a classical contractionproperty from stochastic approximation theory: there exist L > θ ∗ of f such thatfor all θ ∈ D , (cid:104) f ( θ ) , θ − θ ∗ (cid:105) ≤ − L (cid:107) θ − θ ∗ (cid:107) , where (cid:104)· , ·(cid:105) denotes an inner product on R d . Moreover, f has to satisfy a linear growth condi-tion relative to the zero θ ∗ , see Assumption A.1 and Remark 2.1 in Section 2. Note that thecontraction property implies that the zero θ ∗ is unique. Theorem 3.1 asserts that under theseassumptions the maximum p -th mean error sup k ≥ n E [ (cid:107) θ k − θ ∗ (cid:107) p ] of our properly tuned multilevelRobbins-Monro scheme ( θ n ) n ∈ N satisfies the same upper bounds in terms of the computationaltime needed to compute θ n as the bounds obtained in Giles [7] for the multilevel computationof a single expectation.In general, the design of this algorithm requires knowledge on the constant L in the contractionproperty of f . To bypass this problem without loss of efficiency one may work with a Polyak-Ruppert average of our algorithm. Theorem 3.2 states that under Assumptions C.1 and C.2on the approximations of F ( θ, U ) and Assumption B.1 on f , which is slightly stronger thancondition A.1 a properly tuned multilevel Polyak-Ruppert average (¯ θ n ) n ∈ N achieves, for q < p ,the same upper bounds in the relation of the q -th mean error E [ (cid:107) ¯ θ n − θ ∗ (cid:107) q ] and the correspondingcomputational time as the previously introduced multilevel Robbins-Monro method.We briefly outline the content of the paper. The multilevel algorithms and the respectivecomplexity theorems are presented in Section 3 for the case where D = R d . General closedconvex domains D are covered in Section 4. We add that Sections 3 and 4 are self-contained anda reader interested in the multilevel schemes only, can immediately start reading in Section 3.The error analysis of the multilevel stochastic approximation algorithms is based on newestimates of the p -th mean error of Robbins-Monro and Polyak-Ruppert algorithms. Theseresults are presented in Section 2. As a technical tool we employ a modified Burkholder-Davis-Gundy inequality, which is established in the appendix and might be of interest in itself, seeTheorem 5.1.We add that formally all results of the following sections remain true when replacing ( R d , (cid:104)· , ·(cid:105) )by an arbitrary separable Hilbert space. However in that case the definition (61) of the compu-tational cost of a multilevel algorithm might not be appropriate in general.2. New error estimates for stochastic approximation algorithms
Since the pioneering work of Robbins and Monro [21] in 1951 a large body of research hasbeen devoted to the analysis of stochastic approximation algorithms with a strong focus onpathwise and weak convergence properties. In particular, laws of iterated logarithm and centrallimit theorems have been established that allow to optimise the parameters of the schemes withrespect to the almost sure and weak convergence rates and the size of the limiting covariance.See e.g. [2, 3, 6, 9, 10, 13, 14, 15, 17, 18, 19, 20, 22, 23] for results and further references aswell as the survey articles and monographs [1, 4, 11, 12, 16, 23]. Less attention has been paidto an error control in L p -norm for arbitrary orders p ≥
2. We provide such estimates for the
ULTILEVEL STOCHASTIC APPROXIMATION 3
Robbins-Monro approximation and the Polyak-Rupert averaging introduced by Ruppert [23]and Polyak [20] under mild conditions on the ingredients of these schemes. These estimatesbuild the basis for the error analysis of the multilevel schemes introduced in Section 3.Throughout this section we fix p ∈ [2 , ∞ ), a probability space (Ω , F , P ) equipped with afiltration ( F n ) n ∈ N , a scalar product (cid:104)· , ·(cid:105) on R d with induced norm (cid:107) · (cid:107) . Furthermore, we fix ameasurable function f : R d → R d that has a unique zero θ ∗ ∈ R d .We consider an adapted R d -valued dynamical system ( θ n ) n ∈ N iteratively defined by(1) θ n = θ n − + γ n (cid:0) f ( θ n − ) + ε n R n + σ n D n (cid:1) , for n ∈ N , where θ ∈ R d is a fixed deterministic starting value,(I) ( R n ) n ∈ N is a previsible process, the remainder/bias ,(II) ( D n ) n ∈ N is a sequence of martingale differences ,(III) ( γ n ) n ∈ N is a sequence of positive reals tending to zero, and ( ε n ) n ∈ N and ( σ n ) n ∈ N aresequences of non-negative real numbers. Estimates for the Robbins-Monro algorithm.
Our goal is to quantify the speed of con-vergence of the sequence ( θ n ) n ∈ N to θ ∗ in the p -th mean sense in terms of the step-sizes γ n , the bias-levels ε n and the noise-levels σ n .To this end we employ the following set of assumptions in addition to (I)–(III). A.1 (Assumptions on f and θ ∗ )There exist L, L (cid:48) ∈ (0 , ∞ ) such that for all θ ∈ R d (i) (cid:104) θ − θ ∗ , f ( θ ) (cid:105) ≤ − L (cid:107) θ − θ ∗ (cid:107) and(ii) (cid:104) θ − θ ∗ , f ( θ ) (cid:105) ≤ − L (cid:48) (cid:107) f ( θ ) (cid:107) . A.2 (Assumptions on ( R n ) n ∈ N and ( D n ) n ∈ N )It holds(i) sup n ∈ N esssup (cid:107) R n (cid:107) < ∞ and(ii) sup n ∈ N E [ (cid:107) D n (cid:107) p ] < ∞ . Remark 2.1 (Discussion of Assumption A.1) . We briefly discuss A.1(i) and A.1(ii).Let θ ∈ R d and c , c , c (cid:48) , γ ∈ (0 , ∞ ), and consider the conditions (cid:104) θ − θ ∗ , f ( θ ) (cid:105) ≤ − c (cid:107) θ − θ ∗ (cid:107) , (i) (cid:104) θ − θ ∗ , f ( θ ) (cid:105) ≤ − c (cid:107) f ( θ ) (cid:107) , (ii) (cid:107) f ( θ ) (cid:107) ≤ c (cid:48) (cid:107) θ − θ ∗ (cid:107) , (ii’) (cid:107) θ − θ ∗ + γf ( θ ) (cid:107) ≤ (cid:107) θ − θ ∗ (cid:107) (cid:0) − γc (2 − γc ) (cid:1) . ( ∗ )By the Cauchy-Schwartz inequality we have(2) f satisfies (ii) ⇒ f satisfies (ii’) for every c (cid:48) ≥ /c , and the choice f ( θ ) = θ shows that the reverse implication is not valid in general. However, itis easy to check that f satisfies (i) and (ii’) ⇒ f satisfies (ii) for any c ≤ c / ( c (cid:48) ) . MULTILEVEL STOCHASTIC APPROXIMATION
Thus, in the presence of condition A.1(i), condition A.1(ii) is equivalent to a linear growthcondition on the function f relative to the zero θ ∗ .Finally, conditions (i) and (ii) jointly imply the contraction property (*), which is crucial forthe analysis of the Robbins-Monro scheme. We have(3) f satisfies (i) and (ii) ⇒ f satisfies ( ∗ ) for every γ ≤ c . In fact, let γ ≤ c and use (ii) and then (i) to conclude that (cid:107) θ − θ ∗ + γf ( θ ) (cid:107) = (cid:107) θ − θ ∗ (cid:107) + 2 γ (cid:104) θ − θ ∗ , f ( θ ) (cid:105) + γ (cid:107) f ( θ ) (cid:107) ≤ (cid:107) θ − θ ∗ (cid:107) + (cid:104) θ − θ ∗ , f ( θ ) (cid:105) (2 γ − γ c ) ≤ (cid:107) θ − θ ∗ (cid:107) − c (cid:107) θ − θ ∗ (cid:107) (2 γ − γ c ) . In the following we put for r ∈ (0 , ∞ ) and n, k ∈ N with n ≥ k ,(4) τ k,n ( r ) = n (cid:89) j = k +1 (1 − γ j r ) , e k,n ( r ) = max j = k,...,n ε j τ j,n ( r ) , s k,n ( r ) = n (cid:88) j = k γ j σ j ( τ j,n ( r )) . First we provide p -th mean error estimates in terms of the quantities introduced in (4). Proposition 2.2.
Assume that (I)-(III) and A.1 and A.2 are satisfied. Then for every r ∈ (0 , L ) there exist n ∈ N and κ ∈ (0 , ∞ ) such that for all n ≥ k ≥ n we have τ k ,n ( r ) ∈ (0 , and (5) E (cid:2) (cid:107) θ n − θ ∗ (cid:107) p (cid:3) /p ≤ κ (cid:0) τ k ,n ( r ) E [ (cid:107) θ k − θ ∗ (cid:107) p ] /p + e k ,n ( r ) + s k ,n ( r ) (cid:1) . Proof.
Without loss of generality we may assume that θ ∗ = 0.By Assumption A.2 there exists κ ∈ (0 , ∞ ) such that for all n ∈ N ,(6) (cid:107) R n (cid:107) ≤ κ a.s.and(7) E [ (cid:107) D n (cid:107) p ] ≤ κ . Note further that (2) in Remark 2.1 implies that the dynamical system (1) satisfies (cid:107) θ n (cid:107) ≤ (1 + γ n /L (cid:48) ) (cid:107) θ n − (cid:107) + γ n ε n (cid:107) R n (cid:107) + γ n σ n (cid:107) D n (cid:107) for every n ∈ N . With Assumption A.2 we concludethat θ n ∈ L p (Ω , F , P ) for every n ∈ N .Let r ∈ (0 , L ). Since lim n →∞ γ n = 0 we may choose n ∈ N such that 1 − γ n L > − γ n /L (cid:48) ≥ ( r + L ) / (2 L ) for all n ≥ n . Using (3) in Remark 2.1 we obtain that for all θ ∈ R d and for all n ≥ n ,(8) (cid:107) θ + γ n f ( θ ) (cid:107) ≤ (1 − γ n L (1 − γ n /L (cid:48) )) (cid:107) θ (cid:107) ≤ (1 − γ n ( r + L ) / (cid:107) θ (cid:107) . In the following we write τ k,n , e k,n and s k,n in place of τ k,n ( r ), e k,n ( r ) and s k,n ( r ), respectively.Let k ≥ n and put(9) ζ n = θ n τ k ,n , ξ n = θ n − + γ n ( f ( θ n − ) + ε n R n ) τ k ,n , M n = ζ k + n (cid:88) k = k +1 γ k σ k D k τ k ,k for n ≥ k . Then ( ζ n ) n ≥ k is adapted, ( ξ n ) n>k is previsible, ( M n ) n ≥ k is a martingale and forall n > k we have(10) ζ n = ξ n + ∆ M n . Below we show that there exists a constant κ ∈ (0 , ∞ ), which only depends on L , r and κ such that a.s. for all n > k ,(11) (cid:107) ξ n (cid:107) ≤ (cid:107) ζ n − (cid:107) ∨ κ ε n τ k ,n ULTILEVEL STOCHASTIC APPROXIMATION 5 and(12) E (cid:2) [ M ] p/ n (cid:3) /p ≤ E [ (cid:107) θ k (cid:107) p ] /p + κ s k ,n τ k ,n . Observing (10) and (11) we may apply the BDG inequality, see Theorem 5.1, to the processes( ζ n ) n ≥ k , ( ξ n ) n>k and ( M n ) n ≥ k to obtain for n ≥ k that E (cid:2) max k ≤ k ≤ n (cid:107) ζ k (cid:107) p (cid:3) ≤ κ (cid:16) E (cid:2) [ M ] p/ n (cid:3) + (cid:0) κ e k ,n τ k ,n (cid:1) p (cid:17) , (13)where the constant κ > p . Using (12) we conclude that E (cid:2) (cid:107) θ n (cid:107) p (cid:3) = τ pk ,n E (cid:2) (cid:107) ζ n (cid:107) p (cid:3) ≤ p/ κ (cid:0) τ pk ,n E [ (cid:107) θ k (cid:107) p ] + κ p/ s pk ,n + κ p e pk ,n (cid:1) , which completes the proof of the theorem up to the justification of (11) and (12).For the proof of (11) we use (6) and (8) to obtain that a.s. for n > k , (cid:107) ξ n (cid:107) ≤ (cid:13)(cid:13)(cid:13) θ n − + γ n f ( θ n − )1 − γ n r τ k ,n − (cid:13)(cid:13)(cid:13) + γ n ε n τ k ,n (cid:107) R n (cid:107)≤ − γ n ( r + L ) / − γ n r (cid:107) ζ n − (cid:107) + κ γ n ε n τ k ,n ≤ (cid:16) − γ n L − r (cid:17) (cid:107) ζ n − (cid:107) + κ γ n ε n τ k ,n , where the last inequality follows from the fact that − a − b ≤ − a + b for 0 ≤ b ≤ a ≤
1. Hence, if L − r (cid:107) ζ n − (cid:107) ≥ κ ε n /τ k ,n then (cid:107) ξ n (cid:107) ≤ (cid:107) ζ n − (cid:107) , while in the case L − r (cid:107) ζ n − (cid:107) < κ ε n /τ k ,n , (cid:107) ξ n (cid:107) ≤ κ L − r ε n τ k ,n . Thus (11) holds for any κ ≥ κ / ( L − r ).It remains to show (12). Using (7) we get(14) E (cid:2) [ M ] p/ n (cid:3) /p = E (cid:104)(cid:0) (cid:107) θ k (cid:107) + n (cid:88) k = k +1 (cid:107) ∆ M k (cid:107) (cid:1) p/ (cid:105) /p ≤ E [ (cid:107) θ k (cid:107) p ] /p + n (cid:88) k = k +1 γ k σ k τ k ,k (cid:0) E [ (cid:107) D k (cid:107) p ] (cid:1) /p ≤ E [ (cid:107) θ k (cid:107) p ] /p + κ /p n (cid:88) k = k +1 γ k σ k τ k ,k = E [ (cid:107) θ k (cid:107) p ] /p + κ /p τ k ,n s k ,n . Hence (12) holds for any κ ≥ κ /p , which completes the proof. (cid:3) Remark 2.3.
The proof of the p -th mean error estimate (5) in Proposition 2.2 for the times n ≥ k ≥ n makes use of the recursion (1) for n strictly larger than k only. Hence, if m ∈ N and (˜ θ n ) n ≥ m is the dynamical system given by the recursion (1) with an arbitrary randomstarting value ˜ θ m ∈ L p (Ω , F m , P ) then estimate (5) is valid for ˜ θ n in place of θ n with the sameconstant κ for all n ≥ k ≥ max( n , m ).The following theorem provides an estimate for the p -th mean error of θ n in terms of theproduct v n = √ γ n σ n . MULTILEVEL STOCHASTIC APPROXIMATION
It requires the following additional assumptions on the step-sizes γ n , the bias-levels ε n and thenoise-levels σ n . A.3 (Assumptions on ( γ n ) n ∈ N , ( ε n ) n ∈ N and ( σ n ) n ∈ N )We have v n > n ∈ N . Furthermore, with L according to A.1(i),(i) lim sup n →∞ ε n v n < ∞ , and(ii) lim sup n →∞ γ n v n − − v n v n − < L . Theorem 2.4 (Robbins-Monro approximation) . Assume that conditions (I)-(III), A.1, A.2 andA.3 are satisfied. Then there exists κ ∈ (0 , ∞ ) such that for all n ∈ N , E (cid:2) (cid:107) θ n − θ ∗ (cid:107) p (cid:3) /p ≤ κ v n . Proof.
Below we show that there exist r ∈ (0 , L ), κ ∈ (0 , ∞ ) and n ∈ N such that for all n ≥ n we have γ n < /L and(15) τ n ,n ( r ) + e n ,n ( r ) + s n ,n ( r ) ≤ κ v n . Then, by choosing n ∈ N and κ ∈ (0 , ∞ ) according to Proposition 2.2 and taking k =max( n , n ) we have for n ≥ k that τ k ,n ( r ) = τ n ,n ( r ) /τ n ,k ( r ), e k ,n ( r ) ≤ e n ,n ( r ) and s k ,n ( r ) ≤ s n ,n ( r ), and therefore for all n ≥ k E (cid:2) (cid:107) θ n − θ ∗ (cid:107) p (cid:3) /p ≤ κ (cid:0) E [ (cid:107) θ k − θ ∗ (cid:107) p ] /p /τ n ,k ( r ) + 1 (cid:1) (cid:0) τ n ,n ( r ) + e n ,n ( r ) + s n ,n ( r ) (cid:1) , which finishes the proof of the theorem, up to the justification of (15).By Assumption A.3 there exist r ∈ (0 , L ), κ ∈ (0 , ∞ ) and n ∈ N such that for all n > n ,(16) v n − v n ≤ − γ n r as well as(17) ε n ≤ κ v n . Take r ∈ ( r , L ) and assume without loss of generality that 1 − γ n r > n ≥ n . Inthe following we write τ k,n , e k,n and s k,n in place of τ k,n ( r ), e k,n ( r ) and s k,n ( r ), respectively. Itfollows from (16) and r > r that the sequence ( v n /τ n ,n ) n ≥ n is increasing and therefore, for all n ≥ n ,(18) τ n ,n = τ n ,n v n v n τ n ,n ≤ v n v n . Furthermore, observing (17) we also have for all n ≥ n ,(19) e n ,n ≤ κ max j = n ,...,n v j τ j,n = κ τ n ,n max j = n ,...,n v j τ n ,j = κ v n . Put ϕ ( n ) = s n ,n v n for n ≥ n . Observing (16) we obtain that for n > n , ϕ ( n ) = v n − v n (1 − γ n r ) ϕ ( n −
1) + γ n ≤ (1 − γ n r ) (1 − γ n r ) ϕ ( n −
1) + γ n = (cid:16) − γ n r − r − γ n r (cid:17) ϕ ( n −
1) + γ n ≤ (1 − γ n ( r − r )) ϕ ( n −
1) + γ n . ULTILEVEL STOCHASTIC APPROXIMATION 7
This entails that ϕ ( n ) − / ( r − r ) ≤ (1 − γ n ( r − r ))( ϕ ( n − − / ( r − r )) , so that ϕ ( n ) ≤ ϕ ( n − ∨ / ( r − r ). Hence, by induction, for all n ≥ n , ϕ ( n ) ≤ ϕ ( n ) ∨ / ( r − r ) , so that(20) s n ,n ≤ ( γ n ∨ / ( r − r )) / v n . Combining (18) to (20) yields (15). (cid:3)
As a particular consequence of Theorem 2 . γ n and noise-levels σ n . Corollary 2.5 (Polynomial step-sizes and noise-levels) . Assume that conditions (I)-(III), A.1and A.2 are satisfied and choose L according to A.1(i). Take γ , σ ∈ (0 , ∞ ) , r ∈ (0 , and r ∈ R with r < or (cid:0) r = 1 and γ > r L (cid:1) and let for all n ∈ N , γ n = γ n r , σ n = σ n r . Assume further that lim sup n →∞ n ( r + r ) / ε n < ∞ . Then there exists a constant κ ∈ (0 , ∞ ) such that for all n ∈ N , (21) E (cid:2) (cid:107) θ n − θ ∗ (cid:107) p (cid:3) p ≤ κ n − r r . Proof.
We first verify that Assumption A.3 is satisfied. By definition of γ n and σ n we have v n = √ γ σ n ( r + r ) / . Thus, A.3(i) is satisfied due to the assumption on the sequence ( ε n ) n ∈ N . Moreover, it is easy tosee that lim n →∞ γ n (cid:16) − v n v n − (cid:17) = (cid:40) , if r < , r γ , if r = 1and therefore A.3(ii) is satisfied as well. Since conditions (I)-(III), A.1 and A.2 are part of thecorollary, we may apply Theorem 2.4 to obtain the claimed error estimate. (cid:3) Remark 2.6 (Exponential decay of noise-levels) . Assumption A.3(ii) may also be satisfied inthe case that the noise-levels σ n have a superpolynomial decay. For instance, if γ n = a n r , σ n = a n r exp( − a n r )for all n ∈ N , where a , a , a > r > r ∈ R and r ∈ (0 , n →∞ γ n (cid:16) − v n v n − (cid:17) = , if r < − r , a r a , if r = 1 − r , ∞ , if r > − r . On the other hand side, if the noise-levels σ n are decreasing with exponential decay and thestep-sizes γ n are monotonically decreasing then Assumption A.3(ii) is typically not satisfied. In MULTILEVEL STOCHASTIC APPROXIMATION fact, if γ n ≥ γ n +1 for n ≥ n , lim n →∞ γ n = 0 and lim sup n →∞ σ n +1 /σ n < n →∞ γ − n (1 − v n /v n − ) = ∞ .The case of an exponential decay of the noise-levels σ n can be treated by applying Proposi-tion 2.2. Assume that conditions (I)-(III), A.1 and A.2 are satisfied. Assume further that thereexist r ∈ (0 , L ) and c ∈ (0 , ∞ ) such that for all n ∈ N ,(a) σ n ≤ c exp( − rn ) and(b) ε n ≤ c exp (cid:0) − r n (cid:88) k =1 γ k (cid:1) .Then there exists κ ∈ (0 , ∞ ) such that for all n ∈ N ,(22) E (cid:2) (cid:107) θ n − θ ∗ (cid:107) p (cid:3) /p ≤ κ exp (cid:16) − r n (cid:88) k =1 γ k (cid:17) . Proof of (22) . Since lim n →∞ γ n = 0 and 1 − x ≤ exp( − x ) for all x ∈ [0 ,
1] we have (1 − γ n r ) ≤ exp( − rγ n ) for n sufficiently large. Hence there exists n ∈ N such that for all n ≥ j ≥ n ,(23) τ j,n ( r ) ≤ exp (cid:16) − r n (cid:88) k = j +1 γ k (cid:17) . Using (23) as well as Assumption (b) we get for all n ≥ j ≥ n ,(24) e j,n ( r ) ≤ (1 + c ) exp (cid:16) − r n (cid:88) k =1 γ k (cid:17) . Choosing n large enough we may also assume that γ n ≤ / n ≥ n . Employing (23)and Assumption (a) we then conclude that for all n ≥ j ≥ n ,(25) s j,n ( r ) = n (cid:88) k = j γ k σ k ( τ k,n ( r )) ≤ n (cid:88) k = j (1 + c ) exp (cid:16) − rk − r n (cid:88) (cid:96) = k +1 γ (cid:96) (cid:17) = exp (cid:16) − r n (cid:88) (cid:96) = j +1 γ (cid:96) (cid:17) n (cid:88) k = j (1 + c ) exp (cid:16) − rk + 2 r k (cid:88) (cid:96) = j +1 γ (cid:96) (cid:17) ≤ exp (cid:16) − r n (cid:88) (cid:96) = j +1 γ (cid:96) (cid:17) n (cid:88) k = j (1 + c ) exp( − rk ) ≤ exp (cid:16) − r n (cid:88) (cid:96) = j +1 γ (cid:96) (cid:17) (1 + c )1 − exp( − r ) . Combining (23) to (25) with Proposition 2.2 completes the proof of (22). (cid:3)
So far we proved error estimates for the single random variables θ n . In the following theoremwe establish error estimates, which allow to control the quality of approximation for the wholesequence ( θ n ) n ≥ k starting from some time k .To this end we employ the following assumption A.4, which is stronger than condition A.3. A.4 (Assumptions on ( γ n ) n ∈ N , ( ε n ) n ∈ N and ( σ n ) n ∈ N )We have v n > n ∈ N . Furthermore, with L according to A.1(i), there exist c , c , η ∈ (0 , ∞ ) as well as η ∈ (0 ,
1] such that η > (1 − η ) /p and(i) lim sup n →∞ ε n v n < ∞ , ULTILEVEL STOCHASTIC APPROXIMATION 9 (ii) lim sup n →∞ γ n v n − − v n v n − < L and v n ≤ c n η for all but finitely many n ∈ N ,(iii) γ n ≤ c n η for all but finitely many n ∈ N . Theorem 2.7 (Robbins-Monro approximation) . Assume that conditions (I)-(III), A.1, A.2 andA.4 are satisfied and let η ∗ = η − (1 − η ) /p. Then for all η ∈ (0 , η ∗ ) there exists a constant κ ∈ (0 , ∞ ) and n ∈ N such that for all k ≥ n E (cid:104) sup k ≥ k k pη (cid:107) θ k − θ ∗ (cid:107) p (cid:105) /p ≤ κ k − ( η ∗ − η )0 . (26) Proof.
Clearly, we may assume that θ ∗ = 0. Fix η ∈ (0 , η ∗ ).We again use the quantities introduced in (4). Since Assumption A.4 is stronger than As-sumption A.3 we see from the proof of Theorem 2.4 that there exist r ∈ (0 , L ), κ ∈ (0 , ∞ ) and n ∈ N such that for all n ≥ n we have γ n < /L and(27) τ n ,n ( r ) + e n ,n ( r ) + s n ,n ( r ) ≤ κ v n , cf. (15). By A.4(ii) and A.4(iii) we may further assume that for all n ≥ n ,(28) v n ≤ c /n η and γ n < min(1 , / (2 r )) . Fix k ≥ n and define a strictly increasing sequence ( k (cid:96) ) (cid:96) ∈ N in N by k (cid:96) = min (cid:110) m ≥ k : m (cid:88) k = k +1 γ k ≥ (cid:96) (cid:111) . Observing the upper bound for γ n in (28) it is then easy to see that for all (cid:96) ∈ N ,(29) k (cid:96) (cid:88) k = k (cid:96) − +1 γ k ≤ . In the following we write τ k,n , e k,n and s k,n in place of τ k,n ( r ), e k,n ( r ) and s k,n ( r ), respectively.We estimate the decay of the sequence ( τ k ,k (cid:96) ) (cid:96) ∈ N . Let (cid:96) ∈ N . Using (28), the fact that 1 − x ≥ exp( − x ) for all x ∈ [0 , / − x ≤ exp( − x ) for all x ∈ [0 , τ k ,k (cid:96) − = τ k ,k (cid:96) k (cid:96) (cid:89) k = k (cid:96) − +1 (1 − γ k r ) − ≤ τ k ,k (cid:96) k (cid:96) (cid:89) k = k (cid:96) − +1 exp(2 rγ k ) ≤ τ k ,k (cid:96) exp(4 r )as well as(31) τ k ,k (cid:96) ≤ k (cid:96) (cid:89) k = k +1 exp( − rγ k ) ≤ exp( − r(cid:96) ) . Next, we establish a lower bound for the growth of the sequence ( k (cid:96) ) (cid:96) ∈ N , namely(32) k (cid:96) ≥ K (cid:96) for all (cid:96) ∈ N , where K (cid:96) = (cid:40)(cid:0) (cid:96) (1 − η ) /c + k − η (cid:1) − η , if η < ,k exp( (cid:96)/c ) , if η = 1 . In fact, by A.4(iii) we get (cid:96) ≤ k (cid:96) (cid:88) k = k +1 γ k ≤ k (cid:96) (cid:88) k = k +1 c k η ≤ c (cid:90) k (cid:96) k x − η dx = (cid:40) c − η (cid:0) k − η (cid:96) − k − η (cid:1) , if η < ,c ln (cid:0) k (cid:96) k (cid:1) , if η = 1 . which yields (32).We are ready to establish the claimed estimate in p -th mean (26). Similar to the proof ofProposition 2.2 we consider the process ( ζ n ) n ≥ n and the martingale ( M n ) n ≥ n given by (9),where k is replaced by n . As in the proof of Proposition 2.2 we obtain the maximum estimatein p -th mean (13) for the process ( ζ n ) n ≥ n and the estimate in p/ M ] n ) n ≥ n . Combining these two estimates we see that for sufficiently large n thereexists a constant κ ∈ (0 , ∞ ), such that for every n ≥ n we have E (cid:2) max n ≤ k ≤ n (cid:107) ζ k (cid:107) p (cid:3) ≤ κ (cid:16) E (cid:2) (cid:107) θ n (cid:107) p (cid:3) + s pn ,n + e pn ,n τ pn ,n (cid:17) . Using the latter inequality as well as (30), Theorem (2.4) and (27) we may thus conclude thatthere exists a constant κ ∈ (0 , ∞ ), which may depend on n but not on k such that for every (cid:96) ∈ N we have E (cid:2) max k = k (cid:96) − +1 ,...,k (cid:96) (cid:107) θ k (cid:107) p (cid:3) ≤ τ pn ,k (cid:96) − E (cid:2) max k = k (cid:96) − +1 ,...,k (cid:96) (cid:107) ζ k (cid:107) p (cid:3) ≤ κ exp(4 rp ) τ pn ,k (cid:96) (cid:16) E (cid:2) (cid:107) θ n (cid:107) p (cid:3) + s pn ,k (cid:96) + e pn ,k (cid:96) τ pn ,k (cid:96) (cid:17) ≤ κ (cid:0) τ pn ,k (cid:96) + s pn ,k (cid:96) + e pn ,k (cid:96) (cid:1) ≤ κ κ p v pk (cid:96) . Hence, there exists a constant κ ∈ (0 , ∞ ) that does not depend on k such that(33) E (cid:2) sup k>k k pη (cid:107) θ k (cid:107) p (cid:3) ≤ (cid:88) (cid:96) ∈ N E (cid:2) max k = k (cid:96) − +1 ,...,k (cid:96) k pη(cid:96) (cid:107) θ k (cid:107) p (cid:3) ≤ κ (cid:88) (cid:96) ∈ N k pη(cid:96) v pk (cid:96) . Using (28), the fact that p ( η − η ) > − η , due to the choice of η , and the lower bound in (32)we obtain(34) (cid:88) (cid:96) ∈ N k pη(cid:96) v pk (cid:96) ≤ c p (cid:88) (cid:96) ∈ N k − p ( η − η ) (cid:96) ≤ c p (cid:80) (cid:96) ∈ N (cid:0) (cid:96) (1 − η ) /c + k − η (cid:1) − p ( η − η )1 − η , if η < , (cid:80) (cid:96) ∈ N (cid:0) k exp( (cid:96)/c ) (cid:1) − p ( η − η ) , if η = 1 , ≤ κ k − p ( η − η )+1 − η with a constant κ ∈ (0 , ∞ ) that does not depend on k . Combining (33) with (34) yields theclaimed maximum estimate in p -th mean. (cid:3) In analogy to Corollary 2 . γ n andnoise-levels σ n . Corollary 2.8 (Polynomial step-sizes and noise-levels) . Assume that conditions (I)-(III), A.1and A.2 are satisfied and choose L according to A.1(i). Take γ , σ ∈ (0 , ∞ ) , r ∈ (0 , and r ∈ ( − r , ∞ ) with (a) r < or (cid:0) r = 1 and γ > r L (cid:1) , (b) r + r > − r p ULTILEVEL STOCHASTIC APPROXIMATION 11 and let for all n ∈ N , γ n = γ n r , σ n = σ n r . Assume further that lim sup n →∞ n ( r + r ) / ε n < ∞ . Then for all η ∈ (0 , r + r − − r p ) there exists a constant κ ∈ (0 , ∞ ) such that for all k ∈ N , (35) E (cid:2) sup k ≥ k k pη (cid:107) θ k − θ ∗ (cid:107) p (cid:3) /p ≤ κ k − (cid:0) r r − − r p − η (cid:1) . Proof.
We first verify Assumption A.4. By definition of γ n and σ n we have v n = √ γ σ n ( r + r ) / . Thus, A.4(i) is satisfied due to the assumption on the sequence ( ε n ) n ∈ N and the first part ofA.4(ii) is satisfied due to Assumption (a), see the proof of Corollary 2.5. Observing Assumption(b) it is obvious that the second part of A.4(ii) and Assumption A.4(iii) are satisfied for η = ( r + r ) / , η = r , c = √ γ σ , c = γ . Since conditions (I)-(III), A.1 and A.2 are part of the corollary, we may apply Theorem 2.7to obtain the claimed error estimate. (cid:3)
Estimates for the Polyak-Ruppert algorithm.
Now we turn to the analysis of Polyak-Ruppert averaging. For n ∈ N we let(36) ¯ θ n = 1¯ b n n (cid:88) k =1 b k θ k , where ( b k ) k ∈ N is a fixed sequence of strictly positive reals and¯ b n = n (cid:88) k =1 b k . We estimate the speed of convergence of (¯ θ n ) n ∈ N to θ ∗ in p -th mean in terms of the sequence(¯ v n ) n ∈ N given by ¯ v n = v n √ n γ n = σ n √ n . To this end we will replace the set of assumptions A.1, A.2 and A.3 by the following set ofassumptions B.1, B.2 and B.3. Note that B.2 coincides with A.2 while B.1 is stronger than A.1and B.3 is stronger than A.3, see Remark 2 . B.1 (Assumptions on f and θ ∗ )There exist L, L (cid:48) , L (cid:48)(cid:48) , λ ∈ (0 , ∞ ) and a matrix H ∈ R d × d such that for all θ ∈ R d (i) (cid:104) θ − θ ∗ , f ( θ ) (cid:105) ≤ − L (cid:107) θ − θ ∗ (cid:107) ,(ii) (cid:104) θ − θ ∗ , f ( θ ) (cid:105) ≤ − L (cid:48) (cid:107) f ( θ ) (cid:107) and(iii) (cid:107) f ( θ ) − H ( θ − θ ∗ ) (cid:107) ≤ L (cid:48)(cid:48) (cid:107) θ − θ ∗ (cid:107) λ . B.2 (Assumptions on ( R n ) n ∈ N and ( D n ) n ∈ N )It holds (i) sup n ∈ N esssup (cid:107) R n (cid:107) < ∞ and(ii) sup n ∈ N E [ (cid:107) D n (cid:107) p ] < ∞ . B.3 (Assumptions on ( γ n ) n ∈ N , ( ε n ) n ∈ N , ( σ n ) n ∈ N and ( b n ) n ∈ N )We have σ n > n ∈ N . The sequence ( γ n ) n ∈ N is decreasing and the sequences( nγ n ) n ∈ N and ( b n σ n ) n ∈ N are increasing. Moreover, with L and λ according to B.1 thereexist ν, c , c , c ∈ [0 , ∞ ) with c > ( ν + 1) /L such that(i) lim sup n →∞ ε n ¯ v n < ∞ ,(ii) lim sup n →∞ γ n ¯ v n − − ¯ v n ¯ v n − < L and v λn ≤ c ¯ v n for all but finitely many n ∈ N ,(iii) γ n ≥ c n for all but finitely many n ∈ N ,(iv) b m ≤ c b n (cid:16) mn (cid:17) ν for all m ≥ n ≥ b n ) n ∈ N has at most polynomial growth. Remark 2.9 (Discussion of assumptions B.1 and B.3) . We first show that Assumption B.3 im-plies Assumption A.3. Since ( nγ n ) n ∈ N is increasing we have ε n /v n = ε n / ( √ nγ n ¯ v n ) ≤ ε n / ( √ γ ¯ v n )for every n ∈ N , which proves that B.3 implies A.3(i). Furthermore, v n − v n +1 v n = ¯ v n − √ ( n +1) γ n +1 √ nγ n ¯ v n +1 ¯ v n ≤ ¯ v n − ¯ v n +1 ¯ v n for every n ∈ N , which proves that B.3 implies A.3(ii).We add that, due to the presence of Assumption B.1(ii), it is sufficient to require that f satisfiesthe inequality in B.1(iii) on some open ball around θ ∗ . In fact, let D ⊂ R d and δ ∈ (0 , ∞ ) besuch that B ( θ ∗ , δ ) = { θ ∈ R d : (cid:107) θ − θ ∗ (cid:107) < δ } ⊂ D . Let c , c , c (cid:48) , λ ∈ (0 , ∞ ) and H ∈ R d × d andconsider the conditions ∀ θ ∈ D : (cid:104) θ − θ ∗ , f ( θ ) (cid:105) ≤ − c (cid:107) f ( θ ) (cid:107) , (ii) ∀ θ ∈ D : (cid:107) f ( θ ) − H ( θ − θ ∗ ) (cid:107) ≤ c (cid:107) θ − θ ∗ (cid:107) λ , (iii) ∀ θ ∈ B ( θ ∗ , δ ) : (cid:107) f ( θ ) − H ( θ − θ ∗ ) (cid:107) ≤ c (cid:48) (cid:107) θ − θ ∗ (cid:107) λ . (iii’)Then(37) f satisfies (ii) and (iii’) ⇒ (cid:107) H (cid:107) ≤ /c and f satisfies (iii) for every c ≥ max( c (cid:48) , c δ λ ) , where (cid:107) H (cid:107) denotes the induced matrix norm of H .For a proof of (37) we first note that (iii’) implies that H ( θ ) = lim <ε → ε f ( θ ∗ + εθ ) for every θ ∈ R d . Using (2) we conclude that (cid:107) H (cid:107) ≤ /c . For θ ∈ D \ B ( θ ∗ , δ ) we have (cid:107) θ − θ ∗ (cid:107) ≥ δ .Observing the latter fact and using (2) again we conclude (cid:107) f ( θ ) − H ( θ − θ ∗ ) (cid:107) ≤ (cid:107) f ( θ ) (cid:107) + (cid:107) H ( θ − θ ∗ ) (cid:107) ≤ c (cid:107) θ − θ ∗ (cid:107) ≤ c δ λ (cid:107) θ − θ ∗ (cid:107) λ . As an immediate consequence of (37) with the choice δ ≤ f satisfiesB.1(ii),(iii) then f satisfies B.1(iii) for every λ (cid:48) ∈ [0 , λ ] with L (cid:48)(cid:48) replaced by max( L (cid:48)(cid:48) , / ( L (cid:48) δ λ (cid:48) )). Theorem 2.10 (Polyak-Ruppert approximation) . Assume that conditions (I)-(III) and B.1-B.3are satisfied. Put q = p λ with λ according to B.1. Then there exists κ ∈ (0 , ∞ ) such that the ULTILEVEL STOCHASTIC APPROXIMATION 13
Polyak-Ruppert algorithm (36) satisfies for all n ∈ NE (cid:2) (cid:107) ¯ θ n − θ ∗ (cid:107) q (cid:3) /q ≤ κ ¯ v n . For the proof of Theorem 2.10 we follow the approach of the classical paper [20] by firstcomparing the dynamical system ( θ n ) n ≥ with a linearised version ( y n ) n ≥ given by y = θ and y n = y n − + γ n (cid:0) H ( y n − − θ ∗ ) + σ n D n (cid:1) for n ∈ N . Lemma 2.11.
Assume that conditions (I)-(III) and B.1-B.3 are satisfied. Put q = p λ with λ according to B.1. Then there exists κ ∈ (0 , ∞ ) such that for all n ∈ NE (cid:2) (cid:107) θ n − y n (cid:107) q (cid:3) /q ≤ κ ¯ v n . Proof.
Without loss of generality we may assume that θ ∗ = 0.Using B.3(i),(ii) we see that there exist r ∈ (0 , L ), n ∈ N and κ ∈ (0 , ∞ ) such that for all n ≥ n we have(38) ε n ≤ κ ¯ v n and(39) ¯ v n − ¯ v n ≤ − γ n r . Since lim n →∞ γ n = 0 we may assume that γ n ≤ / (2 r ) for all n ≥ n .By Remark 2.9 conditions A.1-A.3 are satisfied and we may apply Theorem 2.4 to obtain theexistence of κ ∈ (0 , ∞ ) such that for all n ∈ N ,(40) E [ (cid:107) θ n (cid:107) p ] /p ≤ κ v n . Furthermore, estimate (8) in the proof of Proposition 2.2 is valid, i.e., there exists n ∈ N suchthat for all n ≥ n and all θ ∈ R d ,(41) (cid:107) θ + γ n f ( θ ) (cid:107) ≤ (1 − γ n ( r + L ) / (cid:107) θ (cid:107) . By Assumption B.1(iii) we have(42) Hθ = lim ε ↓ ε − f ( εθ )for every θ ∈ R d . Using (41) we may therefore conclude that for all n ≥ n and all θ ∈ R d ,(43) (cid:107) θ + γ n Hθ (cid:107) ≤ (1 − γ n ( r + L ) / (cid:107) θ (cid:107) . Let n = max( n , n ). For n ≥ n we put z n = θ n − y n and δ n = E (cid:2) (cid:107) z n (cid:107) q (cid:3) /q ¯ v n . Let n > n . Using (43), Assumptions B.1(iii), B.2(i) and (38) we see that there exists κ ∈ (0 , ∞ )such that (cid:107) z n (cid:107) = (cid:107) z n − + γ n (cid:0) Hz n − + f ( θ n − ) − Hθ n − + ε n R n (cid:1) (cid:107)≤ (cid:107) z n − + γ n Hz n − (cid:107) + γ n (cid:107) f ( θ n − ) − Hθ n − (cid:107) + γ n ε n (cid:107) R n (cid:107)≤ (1 − γ n ( r + L ) / (cid:107) z n − (cid:107) + γ n L (cid:48)(cid:48) (cid:107) θ n − (cid:107) λ + κ γ n ¯ v n a.s. , and employing (39), (40), B.3(ii) and the fact that γ n ≤ / (2 r ) we conclude that(44) δ n ≤ (1 − γ n ( r + L ) /
2) ¯ v n − ¯ v n δ n − + L (cid:48)(cid:48) γ n ¯ v n − ¯ v n E (cid:2) (cid:107) θ n − (cid:107) p (cid:3) /q ¯ v n − + κ γ n ≤ − γ n ( r + L ) / − γ n r δ n − + L (cid:48)(cid:48) κ λ c − γ n r γ n + κ γ n ≤ (1 − γ n ( L − r ) / δ n − + (cid:0) L (cid:48)(cid:48) κ λ c + κ (cid:1) γ n . Put κ = ( L − r ) / > κ = 2 L (cid:48)(cid:48) κ λ c + κ . By (44) we have for n ≥ n that δ n ≤ (1 − κ γ n ) δ n − + κ γ n or, equivalently, δ n κ − κ ≤ (1 − κ γ n ) (cid:16) δ n − κ − κ (cid:17) , which yields, δ n κ − κ ≤ (cid:0) δ n κ − κ (cid:1) exp (cid:16) − κ n (cid:88) k = n +1 γ k (cid:17) . Since (cid:80) k ∈ N γ k = ∞ , due to Assumption B.3(iii), we conclude thatlim sup n →∞ ( δ n /κ − /κ ) ≤ , which finishes the proof. (cid:3) Proof of Theorem 2.10.
Without loss of generality we may assume that θ ∗ = 0.For all n ∈ N we have(45) ¯ θ n = 1¯ b n n (cid:88) k =1 b k ( θ k − y k ) + 1¯ b n n (cid:88) k =1 b k y k . We separately analyze the two terms on the right hand side of (45).By Assumption B.3(iv) it follows that(46) ¯ b n ≥ b n c n (cid:88) k =1 (cid:16) kn (cid:17) ν ≥ κ nb n , where κ = ( c ( ν + 1)) − .Employing Lemma 2.11 as well as (46) and the fact that ( b k σ k ) k ∈ N is increasing we see thatthere exists κ ∈ (0 , ∞ ) such that for all n ∈ N ,(47) E (cid:104)(cid:13)(cid:13)(cid:13) b n n (cid:88) k =1 b k ( θ k − y k ) (cid:13)(cid:13)(cid:13) q (cid:105) /q ≤ κ n b n n (cid:88) k =1 b k ¯ v k = κ n b n n (cid:88) k =1 b k σ k √ k ≤ κ σ n n n (cid:88) k =1 √ k ≤ κ ¯ v n , where we used (cid:80) nk =1 / √ k ≤ √ n in the latter step.Next, put b = 0 and let Υ k,n = n (cid:89) (cid:96) = k +1 ( I d + γ (cid:96) H ) , where I d ∈ R d × d is the identity matrix, as well as¯Υ k,n = n (cid:88) m = k b m Υ k,m ULTILEVEL STOCHASTIC APPROXIMATION 15 for 0 ≤ k ≤ n . Then, for all n ∈ N , y n = Υ ,n θ + n (cid:88) k =1 γ k σ k Υ k,n D k and n (cid:88) k =1 b k y k = ¯Υ ,n θ + n (cid:88) k =1 γ k σ k ¯Υ k,n D k . Using the Burkholder-Davis-Gundy inequality we obtain that there exists a constant κ ∈ (0 , ∞ )such that for all n ∈ N , E (cid:20)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) k =1 b k y k (cid:13)(cid:13)(cid:13)(cid:13) q (cid:21) /q ≤ κ (cid:18) (cid:107) ¯Υ ,n (cid:107) (cid:107) θ (cid:107) + E (cid:104)(cid:16) n (cid:88) k =1 γ k σ k (cid:107) ¯Υ k,n (cid:107) (cid:107) D k (cid:107) (cid:17) q/ (cid:105) /q (cid:19) . (48)By Assumption B.2(ii) there exists a constant κ ∈ (0 , ∞ ) such that for all n ∈ N ,(49) E (cid:104)(cid:16) n (cid:88) k =1 γ k σ k (cid:107) ¯Υ k,n (cid:107) (cid:107) D k (cid:107) (cid:17) q/ (cid:105) /q ≤ (cid:16) n (cid:88) k =1 γ k σ k (cid:107) ¯Υ k,n (cid:107) E (cid:2) (cid:107) D k (cid:107) q (cid:3) /q (cid:17) / ≤ κ (cid:16) n (cid:88) k =1 γ k σ k (cid:107) ¯Υ k,n (cid:107) (cid:17) / . We proceed with estimating the norms (cid:107) ¯Υ k,n (cid:107) . Since c > ( ν + 1) /L we can fix r ∈ (( ν +1) /c , L ) and proceed as in the proof of Lemma 2.11 to conclude that there exists n ∈ N suchthat for all n ≥ n (cid:107) I d + γ n H (cid:107) ≤ − γ n ( r + L ) / , see (43). The latter fact and the assumption that the sequence ( nγ n ) n ∈ N is increasing jointlyimply that for n ≥ k ≥ n − (cid:107) Υ k,n (cid:107) ≤ n (cid:89) (cid:96) = k +1 (cid:16) − γ (cid:96) ( r + L )2 (cid:17) ≤ n (cid:89) (cid:96) = k +1 (cid:16) − γ k k ( r + L )2 (cid:96) (cid:17) ≤ exp (cid:16) − ( r + L ) γ k k n (cid:88) (cid:96) = k +1 (cid:96) (cid:17) ≤ (cid:16) k + 1 n + 1 (cid:17) ( r + L ) γ k k/ , where we used that 1 − z ≤ e − z for all z ∈ R . Employing the latter estimate as well as AssumptionB.3(iv) we get that for n ≥ k ≥ n − (cid:107) ¯Υ k,n (cid:107) ≤ n (cid:88) (cid:96) = k b (cid:96) (cid:107) Υ k,(cid:96) (cid:107) ≤ c b k n (cid:88) (cid:96) = k (cid:16) (cid:96)k (cid:17) ν (cid:16) k + 1 (cid:96) + 1 (cid:17) ( r + L ) γ k k/ ≤ c b k ν n (cid:88) (cid:96) = k (cid:16) k + 1 (cid:96) + 1 (cid:17) ( r + L ) γ k k/ − ν . Put β k = ( r + L ) γ k k/ − ν and note that by the choice of r and by B.3(iii) one has for k largeenough that β k = ( L − r ) γ k k/ rγ k k − ν > ( L − r ) γ k k/ . Choosing n large enough we therefore conclude that there exists κ ∈ (0 , ∞ ) such that for n ≥ k ≥ n − n (cid:88) (cid:96) = k (cid:16) k + 1 (cid:96) + 1 (cid:17) ( r + L ) γ k k/ − ν ≤ k + 1) β k (cid:90) ∞ k +1 t − β k dt = 1 + k + 1 β k − ≤ κ γ k . In combination with (50) we see that there exists κ ∈ (0 , ∞ ) such that for all n ≥ k ≥ n − (cid:107) ¯Υ k,n (cid:107) ≤ κ b k γ k . For 0 ≤ k < n − ≤ n we have¯Υ k,n = n − (cid:88) (cid:96) = k b (cid:96) Υ k,(cid:96) + ¯Υ n − ,n Υ k,n − and, observing (51), we may thus conclude that for 0 ≤ k < n − ≤ n ,(52) (cid:107) ¯Υ k,n (cid:107) ≤ (cid:16) max ≤ j ≤ (cid:96) ≤ n − (cid:107) Υ j,(cid:96) (cid:107) (cid:17) (cid:16) n − (cid:88) (cid:96) =1 b (cid:96) + κ b n − γ n − (cid:17) . Using (51) as well as (52) and the fact that the sequence ( b n σ n ) n ∈ N is increasing we concludethat there exists κ ∈ (0 , ∞ ) such that for all n ∈ N ,(53) (cid:18) n (cid:88) k =1 γ k σ k (cid:107) ¯Υ k,n (cid:107) (cid:19) / ≤ κ √ n b n σ n . Combining (48), (49) and (53), employing again that the sequence ( b n σ n ) n ∈ N is increasingand observing (46) we see that there exists a constant κ ∈ (0 , ∞ ) such that for all n ∈ N ,(54) E (cid:20)(cid:13)(cid:13)(cid:13)(cid:13) b n n (cid:88) k =1 b k y k (cid:13)(cid:13)(cid:13)(cid:13) q (cid:21) /q ≤ κ ¯ b n √ n b n σ n ≤ κ κ ¯ v n Combining (47) with (54) completes the proof of the theorem. (cid:3)
We consider the particular case of polynomial step-sizes γ n , noise-levels σ n and weights b n . Corollary 2.12 (Polynomial step-sizes, noise-levels and weights) . Assume that conditions (I)-(III), B.1 and B.2 are satisfied and let q ∈ [ p λ , p ) with λ according to B.1(iii).Take γ , σ , b ∈ (0 , ∞ ) , r ∈ (0 , , r ∈ ( − r , ∞ ) and r ∈ [ r / , ∞ ) with r r + r ≤ pq and let for all n ∈ N , γ n = γ n r , σ n = σ n r , b n = b n r . Assume further that lim sup n →∞ n (1+ r ) / ε n < ∞ Then there exists a constant κ ∈ (0 , ∞ ) such that for all n ∈ N , E [ (cid:107) ¯ θ n − θ ∗ (cid:107) q ] /q ≤ κ n − ( r +1) . Proof.
Conditions (I)-(III), B.1 and B.2 are part of the corollary. Further note that by Remark 2.9condition B.1(iii) remains true when replacing λ by λ (cid:48) = pq − ∈ (0 , λ ]. We verify that conditionB.3 holds as well with λ (cid:48) in place of λ . Then the corollary is a consequence of Theorem 2.10 andthe fact that ¯ v n = σ n / √ n = σ n − ( r +1) .Since r ∈ (0 ,
1) it is clear that ( γ n ) n ∈ N is decreasing and ( nγ n ) n ∈ N is increasing. Furthermore,( b n σ n ) n ∈ N = ( σ b n r − r / ) n ∈ N is increasing since r ≥ r /
2. B.3(i) is satisfied due to the assumption on ( ε n ) n ∈ N . Since r < ULTILEVEL STOCHASTIC APPROXIMATION 17 positive constant c since v λ (cid:48) n = ( √ γ n σ n ) λ (cid:48) = ( γ σ ) p q n − pq r r and pq r + r ≥ r +12 byassumption. Condition B.3(iii) is satisfied for any c ∈ (0 , ∞ ) since r <
1, and thus conditionB.3(iv) is satisfied with ν = r . (cid:3) Multilevel stochastic approximation
Throughout this section we fix p ∈ [2 , ∞ ), a probability space (Ω , F , P ), a scalar product (cid:104)· , ·(cid:105) on R d with induced norm (cid:107) · (cid:107) , a non-empty set U equipped with some σ -field, a random variable U : Ω → U and a product-measurable function F : R d × U → R d , such that F ( θ, U ) is integrable for every θ ∈ R d . We consider the function f : R d → R d given by(55) f ( θ ) = E [ F ( θ, U )]and we assume that f has a unique zero θ ∗ ∈ R d .Our goal is to compute θ ∗ by means of stochastic approximation algorithms based on themultilevel Monte Carlo approach. To this end we suppose that we are given a hierarchicalscheme F , F , . . . : R d × U → R d of suitable product-measurable approximations to F , such that F k ( θ, U ) is integrable and F k ( θ, U ) − F k − ( θ, U ) can be simulated for all θ ∈ R d and k ∈ N , where F = 0.To each random vector F k ( θ, U ) − F k − ( θ, U ) we assign a positive number C k ∈ (0 , ∞ ),which depends only on the level k and may represent a deterministic worst case upper boundof the average computational cost or average runtime needed to compute a single simulationof F k ( θ, U ) − F k − ( θ, U ). As announced in the introduction we impose assumptions on theapproximations F k and the cost bounds C k that are similar in spirit to the classical multilevelMonte Carlo setting, see [7]. C.1 (Assumptions on ( F k ) k ∈ N and ( C k ) k ∈ N )There exist measurable functions Γ , Γ : R d → (0 , ∞ ) and constants M ∈ (1 , ∞ ) and K, α, β ∈ (0 , ∞ ) with α ≥ β such that for all k ∈ N and all θ ∈ R d (i) E [ (cid:107) F k ( θ, U ) − F k − ( θ, U ) − E [ F k ( θ, U ) − F k − ( θ, U )] (cid:107) p ] /p ≤ Γ ( θ ) M − kβ ,(ii) (cid:107) E [ F k ( θ, U ) − F ( θ, U )] (cid:107) ≤ Γ ( θ ) M − kα , and(iii) C k ≤ KM k .We combine the Robbins-Monro algorithm with the classical multilevel approach taken in [7].The proposed method uses in each Robbins-Monro step a multilevel estimate with a complexitythat is adapted to the actual state of the system and increases in time.The algorithm is specified by the parameters Γ , Γ , M, α, β from Assumption C.1, an initialvector θ ∈ R d ,(i) a sequence of step-sizes ( γ n ) n ∈ N ⊂ (0 , ∞ ) tending to zero,(ii) a sequence of bias-levels ( ε n ) n ∈ N ⊂ (0 , ∞ ), and(iii) a sequence of noise-levels ( σ n ) n ∈ N ⊂ (0 , ∞ ).The maximal level m n ( θ ) and the number of iterations N n,k ( θ ) on level k ∈ { , . . . , m n ( θ ) } thatare used by the multilevel estimator in the n -th Robbins-Monro step depend on θ ∈ R d and are determined in the following way. We take(56) m n ( θ ) = 1 ∨ (cid:108) α log M (cid:16) Γ ( θ ) ε n (cid:17)(cid:109) ∈ N , i.e. m n ( θ ) is the smallest m ∈ N such that Γ ( θ ) M − αm ≤ ε n holds true for the bias bound inAssumption C.1(ii). Furthermore,(57) N n,k ( θ ) = (cid:6) κ n ( θ ) M − k ( β +1 / (cid:7) , where(58) κ n ( θ ) = (cid:40) (Γ ( θ ) /σ n ) M m n ( θ )( − β ) + , if β (cid:54) = 1 / , (Γ ( θ ) /σ n ) m n ( θ ) , if β = 1 / . Take a sequence ( U n,k,(cid:96) ) n,k,(cid:96) ∈ N of independent copies of U . We use(59) Z n ( θ ) = m n ( θ ) (cid:88) k =1 N n,k ( θ ) N n,k ( θ ) (cid:88) (cid:96) =1 (cid:0) F k ( θ, U n,k,(cid:96) ) − F k − ( θ, U n,k,(cid:96) ) (cid:1) as a multilevel approximation of f ( θ ) in the n -th Robbins-Monro step, and we study the sequenceof Robbins-Monro approximations ( θ n ) n ∈ N given by(60) θ n = θ n − + γ n Z n ( θ n − ) . We measure the computational cost of θ n by the quantity(61) cost n = E (cid:104) n (cid:88) j =1 m j ( θ j − ) (cid:88) k =1 N j,k ( θ j − ) C k (cid:105) . That means we take the mean computational cost for simulating the random vectors F k ( θ, U j,k,(cid:96) ) − F k − ( θ, U j,k,(cid:96) ) for the first n iterations into account and we ignore the cost of the involved arith-metical operations. Note, however, that the number of arithmetical operations needed to com-pute θ n is essentially proportional to (cid:80) nj =1 (cid:80) m j ( θ j − ) k =1 N j,k ( θ j − ), and the average of the latterquantity is captured by cost n under the weak assumption that inf k C k > n depends on the parameters Γ , Γ , M, α, β, θ , ( γ k ) k =1 ,...,n ,( ε k ) k =1 ,...,n and ( σ k ) k =1 ,...,n , which determine the algorithm ( θ k ) k ∈ N up to time n . For ease ofnotation we do not explicitly indicate this dependence in the notation cost n .To obtain upper bounds of cost n we need the following additional assumption C.2 on thefunctions Γ , Γ , which implies that both the variance estimate in C.1(i) and the bias estimatein C.1(ii) are at most of polynomial growth in θ ∈ R d with exponents related to the parameters α , β and p . C.2 (Assumption on Γ , Γ )With α, β, Γ , Γ according to Assumption C.1 there exists K ∈ (0 , ∞ ) and β ∈ (cid:40) [0 , min( β, / , if β (cid:54) = 1 / , [0 , / , if β = 1 / , such that for all θ ∈ R d (62) Γ ( θ ) ≤ K (1 + (cid:107) θ (cid:107) ) β p and Γ ( θ ) ≤ K (1 + (cid:107) θ (cid:107) ) αp . We are now in the position to state the central complexity theorem on the multilevel Robbins-Monro algorithm.
ULTILEVEL STOCHASTIC APPROXIMATION 19
Theorem 3.1 (Multilevel Robbins-Monro approximation) . Suppose that Assumption A.1 issatisfied for the function f given by (55) and that Assumptions C.1 and C.2 are satisfied. Take L ∈ (0 , ∞ ) according to A.1, take Γ , Γ , M, α, β according to C.1 and C.2 and let θ ∈ R d .Take r ∈ ( − , ∞ ) , γ ∈ ( r L , ∞ ) , σ , ε ∈ (0 , ∞ ) , and let ρ = (1 + r ) and for all n ∈ N , γ n = γ n , σ n = σ n r , ε n = ε n ρ . Then for all η ∈ (0 , ρ ) there exists κ ∈ (0 , ∞ ) such that for all n ∈ N , E (cid:2) sup k ≥ n k ηp (cid:107) θ k − θ ∗ (cid:107) p ] /p ≤ κ n − ( ρ − η ) . In particular, for all δ ∈ (0 , ∞ ) we have lim n →∞ n ρ − δ (cid:107) θ n − θ ∗ (cid:107) = 0 almost surely.If additionally α > β ∧ / and r > β ∧ / α − β ∧ / then there exists κ (cid:48) ∈ (0 , ∞ ) such that for all n ∈ N , cost n ≤ κ (cid:48) n ρ , if β > / ,n ρ (ln( n + 1)) , if β = 1 / ,n ρ (cid:0)
1+ 1 − β α (cid:1) , if β < / . The implementation of the multilevel Robbins-Monro approximation from Theorem 3.1 re-quires the knowledge of a positive lower bound for the parameter L from Assumption A.1. Thisdifficulty is overcome by applying the Polyak-Ruppert averaging methodology. That means weconsider the approximations(63) ¯ θ n = 1¯ b n n (cid:88) k =1 b k θ k , were ( θ n ) n ∈ N is the multilevel Robbins-Monro scheme specified by (60), ( b k ) k ∈ N is a sequence ofpositive reals and ¯ b n = n (cid:88) k =1 b k for n ∈ N , see Section 2.Note that the cost to compute ¯ θ n differs from the cost to compute θ n at most by a deterministicfactor, which does not depend on n . Therefore we again measure the computational cost for thecomputation of ¯ θ n by the quantity cost n given by (61).We state the second complexity theorem, which concerns Polyak-Ruppert averaging. Theorem 3.2 (Multilevel Polyak-Ruppert approximation) . Suppose that Assumption B.1 issatisfied for the function f given by (55) and that Assumptions C.1 and C.2 are satisfied. Take λ ∈ (0 , ∞ ) according to B.1, take Γ , Γ , M, α, β according to C.1 and C.2 and let θ ∈ R d .Let q ∈ [ p λ , p ) . Take γ , σ , ε , b ∈ (0 , ∞ ) , r ∈ (0 , , r ∈ ( − r , ∞ ) and r ∈ [ r / , ∞ ) with r r + r ≤ pq , and let ρ = (1 + r ) and for all n ∈ N , γ n = γ n r , ε n = ε n ρ , σ n = σ n r , b n = b n r . Then there exists κ ∈ (0 , ∞ ) such that for all n ∈ NE (cid:2) (cid:107) ¯ θ n − θ ∗ (cid:107) q ] /q ≤ κ n − ρ . If additionally α > β ∧ / and r ≥ β ∧ / α − β ∧ / then there exists κ (cid:48) ∈ (0 , ∞ ) such that for all n ∈ N cost n ≤ κ (cid:48) n ρ , if β > / ,n ρ (ln( n + 1)) , if β = 1 / ,n ρ (cid:0)
1+ 1 − β α (cid:1) , if β < / . Remark 3.3.
Assume the setting of Theorem 3.1 or 3.2 and let e n = E [ (cid:107) θ n − θ ∗ (cid:107) p ] /p or e n = E [ (cid:107) ¯ θ n − θ ∗ (cid:107) q ] /q , respectively. Then there exists κ ∈ (0 , ∞ ) such that for every n ∈ N ,(64) cost n ≤ κ e − n , if β > / ,e − n (ln(1 + e − n )) , if β = 1 / ,e − − − β α n , if β < / . Note that these bounds for the computational cost in terms of the error coincide with therespective bounds for the multilevel computation of a single expectation presented in [7].
Remark 3.4.
The multilevel stochastic approximation algorithms analysed in Theorems 3.1and 3.2 are based on evaluations of the increments F k − F k − . Consider, more generally, asequence of measurable mappings P k : R d × U → R d , k ∈ N , such that for all k ∈ N , E [ P k ( θ, U )] = E [ F k ( θ, U ) − F k − ( θ, U )]and C k is a worst case cost bound for simulating P k ( θ, U ). Then Theorems 3.1 and 3.2 are stillvalid for the algorithm obtained by using P k as a substitute for the increment F k − F k − in (59)if Assumption C.1(i) is satisfied with P k in place of F k − F k − .The proofs of Theorems 3.1 and 3.2 are based on the following proposition, which shows thatunder Assumptions C.1(i),(ii) the scheme (60) can be represented as a Robbins-Monro schemeof the general form (1) studied in Section 2. It further provides an estimate of the computationalcost (61) based on Assumptions C.1(iii) and C.2 only. Proposition 3.5. (i)
Suppose that Assumptions C.1(i),(ii) are satisfied. Let F n denote the σ -field generatedby the variables U m,k,(cid:96) with m, k, (cid:96) ∈ N and m ≤ n , and let F denote the trivial σ -field.The scheme ( θ n ) n ∈ N given by (60) satisfies θ n = θ n − + γ n (cid:0) f ( θ n − ) + ε n R n + σ n D n (cid:1) for every n ∈ N , where ( R n ) n ∈ N is a previsible process with respect to the filtration ( F n ) n ∈ N and ( D n ) n ∈ N is a sequence of martingale differences with respect to ( F n ) n ∈ N and ( R n ) n ∈ N and ( D n ) n ∈ N satisfy Assumption A.2. (ii) Suppose that Assumptions C.1(iii) and C.2 are satisfied. Then there exists a constant κ ∈ (0 , ∞ ) such that for all n ∈ N the computational cost (61) of θ n given by (60) satisfies (65) cost n ≤ κ max k =1 ,...,n − E [(1 + (cid:107) θ k (cid:107) ) p ] (cid:80) nj =1 (cid:0) ε − /αj + σ − j ε − (1 − β )+ α j (cid:1) , if β (cid:54) = 1 / , (cid:80) nj =1 (cid:0) ε − /αj + σ − j (log M (1 /ε j )) (cid:1) , if β = 1 / . ULTILEVEL STOCHASTIC APPROXIMATION 21
Proof.
We first prove statement (i) of the proposition. Put f n ( θ ) = E [ F n ( θ, U )] and let P n,k,(cid:96) ( θ ) = F k ( θ, U n,k,(cid:96) ) − F k − ( θ, U n,k,(cid:96) )for n, k, (cid:96) ∈ N and θ ∈ R d . By Assumptions C.1(i),(ii) we have(66) E [ (cid:107) P n,k,(cid:96) ( θ ) − E [ P n,k,(cid:96) ( θ )] (cid:107) p ] /p ≤ Γ ( θ ) M − kβ and (cid:107) f k ( θ ) − f ( θ ) (cid:107) ≤ Γ ( θ ) M − kα for all n, k, (cid:96) ∈ N and θ ∈ R d .By (66) and the definition (56) of m n ( θ ) we get for all n ∈ N and θ ∈ R d that (cid:107) E [ Z n ( θ )] − f ( θ ) (cid:107) = (cid:107) E [ F m n ( θ ) ( θ, U )] − f ( θ ) (cid:107) ≤ Γ ( θ ) M − m n ( θ ) α ≤ ε n . (67)Furthermore, by the Burkholder-Davis-Gundy inequality, the triangle inequality on the L p/ -space, (66) and the definition (57) of N n,k ( θ ) there exists c ∈ (0 , ∞ ), which only depends on p ,such that E [ (cid:107) Z n ( θ ) − E [ Z n ( θ )] (cid:107) p ] /p = E (cid:104)(cid:13)(cid:13)(cid:13) m n ( θ ) (cid:88) k =1 N n,k ( θ ) (cid:88) (cid:96) =1 N n,k ( θ ) (cid:0) P n,k,(cid:96) ( θ ) − E [ P n,k,(cid:96) ( θ )] (cid:1)(cid:13)(cid:13)(cid:13) p (cid:105) /p ≤ c E (cid:104)(cid:16) m n ( θ ) (cid:88) k =1 N n,k ( θ ) (cid:88) (cid:96) =1 N n,k ( θ ) (cid:13)(cid:13) P n,k,(cid:96) ( θ ) − E [ P n,k,(cid:96) ( θ )] (cid:13)(cid:13) (cid:17) p/ (cid:105) /p ≤ c m n ( θ ) (cid:88) k =1 N n,k ( θ ) N n,k ( θ ) (cid:88) (cid:96) =1 E (cid:2)(cid:13)(cid:13) P n,k,(cid:96) ( θ ) − E [ P n,k,(cid:96) ( θ )] (cid:13)(cid:13) p (cid:3) /p ≤ c Γ ( θ ) m n ( θ ) (cid:88) k =1 N n,k ( θ ) M − βk ≤ c Γ ( θ ) κ n ( θ ) m n ( θ ) (cid:88) k =1 M k (1 / − β ) . Recalling the definition of κ n ( θ ), see (58), we conclude that there exists c ∈ (0 , ∞ ) such thatfor all n ∈ N and θ ∈ R d (68) E [ (cid:107) Z n ( θ ) − E [ Z n ( θ )] (cid:107) p ] /p ≤ c σ n . With R n := 1 ε n E [ Z n ( θ n − ) − f ( θ n − ) |F n − ] and D n := 1 σ n (cid:0) Z n ( θ n − ) − f ( θ n − ) − E [ Z n ( θ n − ) − f ( θ n − ) |F n − ] (cid:1) we obtain that θ n = θ n − + γ n ( f ( θ n − ) + ε n R n + σ n D n ). We verify Assumption A.2.The process ( R n ) n ∈ N is predictable and using the independence of ( U n,k,(cid:96) ) k,(cid:96) ∈ N and F n − weconclude with (67) that sup n ∈ N (cid:107) R n (cid:107) ≤
1. By the latter independence it further follows that( D n ) n ∈ N is a sequence of martingale differences, which satisfies sup n ∈ N E [ (cid:107) D n (cid:107) p ] ≤ c p/ as aconsequence of (68). This completes the proof of statement (i). We turn to the proof of statement (ii). Let j ∈ N and θ ∈ R d . Using Assumption C.1(iii), weconclude that there exists c ∈ (0 , ∞ ), which only depends on K , M and β such that(69) m j ( θ ) (cid:88) k =1 N j,k ( θ ) C k ≤ m j ( θ ) (cid:88) k =1 (cid:0) κ j ( θ ) M − k ( β +1 / (cid:1) KM k ≤ K − M − M m j ( θ ) + Kκ j ( θ ) (cid:40) M mj ( θ )(1 / − β )+ − M −| / − β | , if β (cid:54) = 1 / ,m j ( θ ) , if β = 1 / ≤ c M m j ( θ ) + c Γ ( θ ) σ j (cid:40) M m j ( θ )(1 − β ) + , if β (cid:54) = 1 / , ( m j ( θ )) , if β = 1 / . Furthermore, (56) yields that(70) m j ( θ ) ≤ α − (log M (Γ ( θ )) + log M ( ε − j )) + 1 and M m j ( θ ) ≤ M ε − /αj (Γ ( θ )) /α . Combining (69) with (70) and employing Assumption C.2 we see that there exists c ∈ (0 , ∞ ),which only depends on K , K , M , β and α , such that(71) m j ( θ ) (cid:88) k =1 N j,k ( θ ) C k ≤ c ε − /αj (1 + (cid:107) θ (cid:107) ) p + c σ − j (1 + (cid:107) θ (cid:107) ) β p (cid:40) ε − (1 − β ) + /αj (1 + (cid:107) θ (cid:107) ) (1 − β ) + p , if β (cid:54) = 1 / , (log M ( ε − j (1 + (cid:107) θ (cid:107) ) αp )) , if β = 1 / . Suppose that β (cid:54) = 1 /
2. Then (71) implies m j ( θ ) (cid:88) k =1 N j,k ( θ ) C k ≤ c (1 + (cid:107) θ (cid:107) ) p (cid:0) ε − /αj + σ − j ε − (1 − β )+ α j (cid:1) , which finishes the proof for the case β (cid:54) = 1 /
2. In the case β = 1 / β < / c ∈ (0 , ∞ ), which does not depend on θ , such that(log M (1 + (cid:107) θ (cid:107) αp )) ≤ c (1 + (cid:107) θ (cid:107) ) p (1 − β ) . One completes the proof of statement (ii) by combining the latter estimate with (71). (cid:3)
We turn to the proof of Theorem 3.1.
Proof of Theorem 3.1.
The error estimate follows by Corollary 2.8 since Assumption A.1 is partof the theorem and Assumptions (I)-(III) and A.2 are satisfied by Proposition 3.5(i).It remains to prove the cost estimate. The error estimate implies that sup n ∈ N E [(1 + (cid:107) θ n (cid:107) ) p ] < ∞ . Employing Proposition 3.5(ii) we thus see that there exists c ∈ (0 , ∞ ) such that for every n ∈ N (72) cost n ≤ c (cid:80) nj =1 (cid:0) j ρ/α + j ρ (1+(1 / − β ) + /α ) − (cid:1) , if β (cid:54) = 1 / , (cid:80) nj =1 (cid:0) j ρ/α + ρ j ρ − (log M ( j )) (cid:1) , if β = 1 / . Hence there exists c ∈ (0 , ∞ ) such that for every n ∈ N (73) cost n ≤ c n ρ/α +1 + n ρ (1+(1 / − β ) + /α ) , if β (cid:54) = 1 / ,n ρ/α +1 + n ρ (log M ( n )) , if β = 1 / . ULTILEVEL STOCHASTIC APPROXIMATION 23 If β ≥ / α > / r > α − , which implies that ρ/α < r = 2 ρ − n ρ/α +1 ≤ n ρ = n ρ (1+(1 / − β ) + /α ) . If β < / α > β and r > βα − β , which implies that 2 ρ α − βα > n ρ/α +1 ≤ n ρ/α +2 ρ α − βα = n ρ (1+(1 / − β ) + /α ) . This completes the proof. (cid:3)
We proceed with the proof of Theorem 3.2.
Proof of Theorem 3.2.
The error estimate follows with Corollary 2.12 since Assumption B.1 ispart of the theorem and Assumptions (I)-(III) and B.2 hold by Proposition 3.5(i). The costestimate in the theorem is proved in the same way as the cost estimate in Theorem 3.1. Oneonly observes that sup n ∈ N E [(1 + (cid:107) θ n (cid:107) ) p ] < ∞ is valid since the assumptions in Corollary 2.12are stronger than the assumptions in Corollary 2.5. (cid:3) General convex closed domains
In this section we extend the results of Sections 2 and 3 to convex domains. In the following D denotes a convex and closed subset of R d and f : D → R d is a function with a unique zero θ ∗ ∈ D . We start with the Robbins-Monro scheme.Let pr D : R d → D denote the orthogonal projection on D with respect to the given inner product (cid:104)· , ·(cid:105) on R d anddefine the dynamical system ( θ n ) n ∈ N by the recursion(74) θ n = pr D (cid:0) θ n − + γ n ( f ( θ n − ) + ε n R n + σ n D n ) (cid:1) in place of (1), where θ ∈ D is a deterministic starting value in D . Then the following factfollows by a straightforward modification in the proofs of Proposition 2.2 and Theorem 2.7 usingthe contraction property of pr D . Extension 4.1.
Proposition 2.2, Theorem 2.4, Corollary 2.5, Theorem 2.7, Corollary 2.8 andthe statement on the system (1) in Remark 2.6 remain valid for the system (74) in place of (1) if R d is replaced by D in Assumption A.1. Analogously, we extend Theorem 3.1 in Section 3 on the multilevel Robbins-Monro approxi-mation to the case where the mappings
F, F , F , . . . are defined on D × U with D being a closedand convex subset of R d and f : D → R d , θ (cid:55)→ E [ F ( θ, U )]has a unique zero θ ∗ ∈ D . In this case we proceed analogously to Extension 4.1 and employ theprojected multilevel Robbins-Monro scheme(75) θ n = pr D (cid:0) θ n − + γ n Z n ( θ n − ) (cid:1) with θ ∈ D and Z n given by (59), in place of the multilevel scheme (60).Note that if pr D can be evaluated on R d with constant cost then, up to a constant dependingon D only, the computational cost of the projected approximation θ n is still bounded by thequantity cost n given by (72) since the computation of θ n requires n evaluations of pr D andcost n ≥ C n .Employing Proposition 3.5 as well as Extension 4.1 one easily gets the following result. Extension 4.2.
Theorem 3.1 remains valid for the scheme (75) in place of (60) if R d is replacedby D in Assumptions A.1, C.1 and C.2. Next we consider the Polyak-Ruppert scheme. In this case we additionally suppose that D contains an open ball B ( θ ∗ , δ ) = { θ ∈ R d : (cid:107) θ − θ ∗ (cid:107) < δ } around the unique zero θ ∗ ∈ D and weextend the function f on R d : for c ∈ (0 , ∞ ) define(76) f c : R d → R d , x (cid:55)→ − c ( x − pr D ( x )) + f (pr D ( x )) . The following lemma shows that property B.1 is preserved for appropriately chosen c > Lemma 4.3.
Let δ > and suppose that B ( θ ∗ , δ ) ⊂ D and that f : D → R d satisfies B.1(i) toB.1(iii) on D . Take L, L (cid:48) , L (cid:48)(cid:48) , λ, H according to B.1, let c ∈ (1 / (2 L (cid:48) ) , ∞ ) and put r c = 1 − Lc (cid:0) − cL (cid:48) ) ∈ [0 , . Then f c satisfies B.1(i) to B.1(iii) on R d with (77) L c = c (cid:0) − √ r c (cid:1) , L (cid:48) c = L c L (cid:48) ) + 2 c , L (cid:48)(cid:48) c = (cid:0) c + L (cid:48) (cid:1) δ λ + L (cid:48)(cid:48) in place of L , L (cid:48) and L (cid:48)(cid:48) , respectively.Proof. Using (3) with c = L , c = L (cid:48) and γ = 1 /c it follows that r c ∈ [0 , D we have for any θ ∈ R d that (cid:104) θ − θ ∗ , f c ( θ ) (cid:105) = (cid:104) θ − θ ∗ , − c ( θ − pr D ( θ )) + f (pr D ( θ )) (cid:105) = − c (cid:107) θ − θ ∗ (cid:107) + (cid:104) θ − θ ∗ , − c ( θ ∗ − pr D ( θ )) + f (pr D ( θ )) (cid:105)≤ − c (cid:107) θ − θ ∗ (cid:107) + c (cid:107) θ − θ ∗ (cid:107) (cid:107) pr D ( θ ) − θ ∗ + c f (pr D ( θ )) (cid:107)≤ − c (cid:107) θ − θ ∗ (cid:107) + c √ r c (cid:107) θ − θ ∗ (cid:107) (cid:107) pr D ( θ ) − θ ∗ (cid:107)≤ − c (1 − √ r c ) (cid:107) θ − θ ∗ (cid:107) , which shows that f c satisfies B.1(i) on R d with L c in place of L .Using the latter estimate, (2) with c (cid:48) = 1 /L (cid:48) and the Lipschitz continuity of pr D we get forany θ ∈ R d that (cid:107) f c ( θ ) (cid:107) + L (cid:48) c (cid:104) θ − θ ∗ , f c ( θ ) (cid:105) ≤ (cid:107) − c ( θ − pr D ( θ )) + f (pr D ( θ )) (cid:107) − L c L (cid:48) c (cid:107) θ − θ ∗ (cid:107) ≤ c (cid:107) θ − pr D ( θ ) (cid:107) + 2 (cid:107) f (pr D ( θ )) (cid:107) − L c ¯ L (cid:48) c (cid:107) θ − θ ∗ (cid:107) ≤ c (cid:107) θ − pr D ( θ ) (cid:107) + L (cid:48) ) (cid:107) pr D ( θ ) − θ ∗ (cid:107) − L c L (cid:48) c (cid:107) θ − θ ∗ (cid:107) ≤ (2 c + L (cid:48) ) − L c L (cid:48) c ) (cid:107) θ − θ ∗ (cid:107) = 0 , which shows that f c satisfies B.1(ii) on R d with L (cid:48) c in place of L (cid:48) .Finally, let θ ∈ R d \ D , which implies that (cid:107) θ − θ ∗ (cid:107) ≥ δ . Using the latter fact and the projectionproperty and the contractivity of pr D we get (cid:107) f c ( θ ) − H ( θ − θ ∗ ) (cid:107) = (cid:107) − c ( θ − pr D ( θ )) + f (pr D ( θ )) − H (pr D ( θ ) − θ ∗ ) − H ( θ − pr D ( θ )) (cid:107)≤ (cid:107) ( c I d + H )( θ − pr D ( θ )) (cid:107) + (cid:107) f (pr D ( θ )) − H (pr D ( θ ) − θ ∗ ) (cid:107)≤ (cid:107) c I d + H (cid:107)(cid:107) θ − pr D ( θ ) (cid:107) + L (cid:48)(cid:48) (cid:107) pr D ( θ ) − θ ∗ (cid:107) λ ≤ ( c + (cid:107) H (cid:107) ) (cid:107) θ − θ ∗ (cid:107) + L (cid:48)(cid:48) (cid:107) θ − θ ∗ (cid:107) λ ≤ (cid:0) ( c + (cid:107) H (cid:107) ) δ λ + L (cid:48)(cid:48) (cid:1) (cid:107) θ − θ ∗ (cid:107) λ . Observing that (cid:107) H (cid:107) ≤ /L (cid:48) , see (37), completes the proof of the lemma. (cid:3) ULTILEVEL STOCHASTIC APPROXIMATION 25
Replacing f by f c in (1) we obtain the dynamical system(78) θ c,n = θ c,n − + γ n (cid:0) f c ( θ c,n − ) + ε n R n + σ n D n (cid:1) , for n ∈ N , where θ c, ∈ D is a deterministic starting value in D .Employing Lemma 4.3 we immediately arrive at the following fact. Extension 4.4.
Assume that B ( θ ∗ , δ ) ⊂ D for some δ ∈ (0 , ∞ ) . Then Corollary 2.12 remainsvalid for the modified Polyak-Ruppert algorithm (79) ¯ θ c,n = 1¯ b n n (cid:88) k =1 b k θ c,k , n ∈ N , in place of the scheme (36) , if R d is replaced by D in Assumption B.1 and c ∈ (1 / (2 L (cid:48) ) , ∞ ) with L (cid:48) according to B.1(ii).Moreover, Theorem 2.10 remains valid for the scheme (79) as well if, additionally, AssumptionB.3 is satisfied with L c given by (77) in place of L . Similar to Extension 4.4 we can extend Theorem 3.2 on the multilevel Polyak-Ruppert av-eraging. To this end we define for c ∈ (0 , ∞ ) extensions F c , F c, , F c, , . . . : R d × U → R d of themappings F, F , F . . . : D × U → R d by taking G c : R d × U → R d , ( θ, u ) (cid:55)→ − c ( θ − pr D ( θ )) + G (pr D ( θ ) , u )for G ∈ { F, F , F , . . . } . Note that E [ F c ( θ, U )] = f c ( θ ) and f ( θ ∗ ) = f c ( θ ∗ ) = 0 with f c givenby (76).Clearly, if the mappings F, F , F , . . . satisfy C.1(i),(ii) on D then the mappings F c , F c, , F c, , . . . satisfy C.1(i),(ii) on R d with Γ ◦ pr D , Γ ◦ pr D in place of Γ , Γ . Furthermore, if Γ , Γ satisfyAssumption C.2 on D then Γ ◦ pr D , Γ ◦ pr D satisfy Assumption C.2 on R d , since we have (cid:107) pr D ( θ ) (cid:107) ≤ (cid:107) θ (cid:107) + (cid:107) pr D (0) (cid:107) for every θ ∈ R d .We thus take Z c,n ( θ ) = m n (pr D ( θ )) (cid:88) k =1 N n,k (pr D ( θ )) N n,k (pr D ( θ )) (cid:88) (cid:96) =1 (cid:0) F c,k ( θ, U n,k,(cid:96) ) − F c,k − ( θ, U n,k,(cid:96) ) (cid:1) = − c ( θ − pr D ( θ )) + Z n (pr D ( θ )) , with m n , N n,k and Z n given by (56),(57) and (59), respectively, as a multilevel approximationof f c ( θ ) in the n -th Robbins-Monro step, and we use the multilevel scheme(80) θ c,n = θ c,n − + γ n Z c,n ( θ c,n − )for Polyak-Ruppert averaging.Employing Lemma 4.3 we get the following result. Extension 4.5.
Assume that B ( θ ∗ , δ ) ⊂ D for some δ ∈ (0 , ∞ ) . Then Theorem 3.2 remainsvalid for the modified Polyak-Ruppert algorithm (81) ¯ θ c,n = 1¯ b n n (cid:88) k =1 b k θ c,k , n ∈ N , with ( θ c,n ) n ∈ N given by (80) in place of the scheme (36) , if R d is replaced by D in AssumptionsB.1, C.1, C.2 and c ∈ (1 / (2 L (cid:48) ) , ∞ ) with L (cid:48) according to B.1(ii). Numerical Experiments
We illustrate the application of our multilevel methods in the simple case of computing thevolatility in a Black Scholes model based on the price of a European call.Fix
T, µ, s , K ∈ (0 , ∞ ) and let W denote a one-dimensional Brownian motion on [0 , T ]. Forevery θ ∈ (0 , ∞ ) we use S θ to denote the geometric Brownian motion on [0 , T ] with initial value s , trend µ and volatility θ , i.e.(82) S θ = s ,dS θt = µS θt dt + θS θt dW t , t ∈ [0 , T ] . In a Black Scholes model with fixed interest rate µ the fair price of a European call with maturity T , strike K and underlying geometric Brownian motion with volatility θ is given by p ( θ ) = E [ C ( θ, W )] , where C ( θ, W ) = exp( − µT )( S θT − K ) + , and according to the Black-Scholes formula p satisfies p ( θ ) = s Φ (cid:16) ln( s /K )+( µ + θ / Tθ √ T (cid:17) − exp( − µT ) K Φ (cid:16) ln( s /K )+( µ − θ / Tθ √ T (cid:17) , where Φ denotes the standard normal distribution function.Fix ϑ < ϑ as well as θ ∗ ∈ [ ϑ , ϑ ]. Our computational goal is to approximate θ ∗ based onthe knowledge of ϑ , ϑ and the value of the price p ( θ ∗ ).Within the framework of sections 3 and 4 we take d = 1, D = [ ϑ , ϑ ], U = W and F ( θ, W ) = p ( θ ∗ ) − C ( θ, W ) , θ ∈ D. Moreover, we approximate F ( θ, W ) by employing equidistant Milstein schemes: for M, k ∈ N with M ≥ θ ∈ D we define F M,k ( θ, W ) = p ( θ ∗ ) − exp( − µT )( (cid:98) S θM k ,T − K ) + , where (cid:98) S θM k ,T denotes the Milstein approximation of S θT based on M k equidistant steps, i.e. (cid:98) S θM k ,T = s M k (cid:89) (cid:96) =1 (cid:16) µ TM k + θ ∆ (cid:96) W + θ (cid:0) (∆ (cid:96) W ) − TM k (cid:1)(cid:17) with ∆ (cid:96) W = W ( (cid:96)T /M k ) − W (( (cid:96) − T /M k ).We briefly check the validity of Assumptions B.1, C.1 and C.2. Clearly, the mapping f = E [ F ( · , W )] : D → R satisfies f ( θ ) = p ( θ ∗ ) − p ( θ ) , θ ∈ D. Note that p is two times differentiable with respect to θ on (0 , ∞ ) with(83) ∂p∂θ ( θ ) = s √ T ϕ (cid:16) ln( s /K )+( µ + θ / Tθ √ T (cid:17) , ∂ p∂θ ( θ ) = (ln( s /K )+( µ + θ / T )(ln( s /K )+( µ − θ / T ) θ T ∂p∂θ ( θ ) , where ϕ denotes the density of the standard normal distribution. Let g ( θ ) = ln( s /K )+( µ + θ / Tθ √ T and put u = 2(ln( s /K ) + µT ) /T as well as z ∗ = s √ T (cid:40) min (cid:0) ϕ ( g ( ϑ )) , ϕ ( g ( ϑ )) (cid:1) , if u (cid:54)∈ ( ϑ , ϑ ) , min (cid:0) ϕ (max( g ( ϑ ) , g ( ϑ ))) , ϕ ( √ u ) (cid:1) if u ∈ ( ϑ , ϑ ) . ULTILEVEL STOCHASTIC APPROXIMATION 27
Using (83) it is then straightforward to verify that f satisfies Assumption B.1 on D with pa-rameters(84) L = s √ T min θ ∈ [ ϑ ,ϑ ] ϕ ( g ( θ )) = z ∗ and(85) L (cid:48) = √ πs √ T , H = − ∂p∂θ ( θ ∗ ) , L (cid:48)(cid:48) = max θ ∈ [ ϑ ,ϑ ] (cid:12)(cid:12)(cid:12) ∂ p∂θ ( θ ) (cid:12)(cid:12)(cid:12) , λ = 1 . As is well known there exists a constant c ( T, µ, ϑ , ϑ ) ∈ (0 , ∞ ), which depends only on T, µ, ϑ , ϑ ,such that sup θ ∈ D E [ | (cid:98) S θM k ,T − S θT | ] / ≤ c ( T, µ, ϑ , ϑ ) M − k Since | F M,k ( θ, W ) − F ( θ, W ) | ≤ | (cid:98) S θM k ,T − S θT | we conclude that Assumption C.1 is satisfied on D with parameters α = β = 1 , Γ = Γ = c ( T, µ, ϑ , ϑ , M )for some constant c ( T, µ, ϑ , ϑ , M ) ∈ (1 , ∞ ), which depends only on T, µ, ϑ , ϑ , M . Conse-quently, Assumption C.2 is satisfied on D as well.First, we consider the projected multilevel Robbins-Monro scheme (75) with step-size γ n ,noise-level σ n and bias-level ε n given by γ n = Ln , σ n = √ c ( T,µ,ϑ ,ϑ ,M ) n , ε n = c ( T,µ,ϑ ,ϑ ,M ) n . Note that the constant c ( T, µ, ϑ , ϑ , M ) does not need to be known in order to implement thescheme. We have θ n = proj [ ϑ ,ϑ ] (cid:0) θ n − + Ln Z n ( θ n − ) (cid:1) , where θ ∈ [ ϑ , ϑ ] and for all θ ∈ D (86) Z n ( θ ) = ∨(cid:100) M ( n ) (cid:101) (cid:88) k =1 (cid:100) n M − k/ (cid:101) (cid:100) n M − k/ (cid:101) (cid:88) (cid:96) =1 (cid:0) F M,k ( θ, W n,k,(cid:96) ) − F M,k − ( θ, W n,k,(cid:96) ) (cid:1) with independent copies W n,k,(cid:96) of W . Then by Extension 4.2 of Theorem 3.1 there exists κ ∈ (0 , ∞ ) such that for every n ∈ N ,(87) E [ (cid:107) θ n − θ ∗ | ] / ≤ κ n − , cost n ≤ κ n . In the following we use the model parameters(88) s = 10 , T = 2 , µ = 0 . , K = 11 , ϑ = 0 . , ϑ = 0 . , θ ∗ = 0 . M = 4 , θ = 0 . | θ n − θ ∗ | ) n ∈ N .Figure 2 shows the log-log plot of Monte Carlo estimates of the root mean squared error of θ n and the corresponding average computational times for n = 1 , . . . ,
100 based on N = 200replications. Both plots are in accordance with the theoretical bounds in (87). l l l l l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − − − − − − − log n l og e rr o r Figure 1.
Multilevel Robbins-Monro: error trajectory for n = 1 , . . . , l l l l l l l l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − − − − − − − log n l og r m s e l l l l l l l l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − − − log n l og t i m e Figure 2.
Multilevel Robbins-Monro: estimated root mean squared error andaverage computational time for n = 1 , . . . , ULTILEVEL STOCHASTIC APPROXIMATION 29
Next, we consider the multilevel Polyak-Rupert averaging (81) with step-size γ n , noise-level σ n , bias-level ε n , weight b n and extension parameter c given by γ n = n . , σ n = √ c ( T,µ,ϑ ,ϑ ,M ) n , ε n = c ( T,µ,ϑ ,ϑ ,M ) n , b n = n , c = 1 L (cid:48) . Thus ¯ θ c,n = 6 n ( n + 1)(2 n + 1) n (cid:88) k =1 k θ k,c , where θ n,c = θ n − ,c + 1 n . (cid:16) − L (cid:48) ( θ c,n − − proj [ ϑ ,ϑ ] ( θ c,n − )) + Z n (proj [ ϑ ,ϑ ] ( θ c,n − )) (cid:17) , with Z n given by (86) and a deterministic θ ,c ∈ [ ϑ , ϑ ]. Then by Extension 4.5 of Theorem 3.2there exists for every q ∈ [1 ,
2) a constant κ ∈ (0 , ∞ ) such that for every n ∈ N ,(90) E [ (cid:107) ¯ θ c,n − θ ∗ | q ] /q ≤ κn − , cost n ≤ κn . We choose the parameters s , T, µ, K, ϑ , ϑ , M, θ c, = θ as in (88) and (89). Figure 3 showsthe log-log plot of a trajectory of the error process ( | ¯ θ c,n − θ ∗ | ) n ∈ N until n = 500. Figure 4 l l l l l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − − − − − − log n l og e rr o r Figure 3.
Multilevel Polyak-Ruppert: error trajectory for n = 1 , . . . , θ c,n and thecorresponding average computational times for n = 1 , . . . ,
100 based on N = 200 replications.As for the multilevel Robbins-Monro scheme, both plots are in accordance with the theoreticalbounds in (90). l l l l l l l l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − − − − − − − log n l og r m s e l l l l l l l l l l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − − − − log n l og t i m e Figure 4.
Multilevel Polyak Ruppert: estimated root mean squared error andaverage computational time for n = 1 , . . . , Appendix
Let (Ω , F , P ) be a probability space endowed with a filtration ( F n ) n ∈ N and let (cid:107) · (cid:107) denotea Hilbert space norm on R d .In this section we provide p -th mean estimates for an adapted d -dimensional dynamical system( ζ n ) n ∈ N with the property that for each n ∈ N , ζ n is a zero-mean perturbation of a previsibleproposal ξ n being comparable in size to ζ n − . More formally, we assume that there exist aprevisible d -dimensional process ( ξ n ) n ∈ N , a d -dimensional martingale ( M n ) n ∈ N with M = ζ and a constant c ≥ n ∈ N (91) ζ n = ξ n + ∆ M n , (cid:107) ξ n (cid:107) ≤ (cid:107) ζ n − (cid:107) ∨ c, where ∆ M n = M n − M n − . Note that necessarily ξ n = E [ ζ n |F n − ]. Theorem 5.1.
Assume that ( ζ n ) n ∈ N is an adapted d -dimensional process, which satisfies (91),and let p ∈ [1 , ∞ ) . Then there exists a constant κ ∈ (0 , ∞ ) , which only depends on p , such thatfor every n ∈ N , E (cid:2) max ≤ k ≤ n (cid:107) ζ k (cid:107) p (cid:3) ≤ κ (cid:0) E (cid:2) [ M ] p/ n (cid:3) + c p (cid:1) , where [ M ] n = n (cid:88) k =1 (cid:107) ∆ M k (cid:107) + (cid:107) M (cid:107) . Proof.
Fix p ∈ [1 , ∞ ).We first consider the case where c = 0. Recall that by the BDG inequality there exists aconstant ¯ κ > p such that for every d -dimensional martingale ( M n ) n ∈ N , E (cid:2) max ≤ k ≤ n (cid:107) M k (cid:107) p (cid:3) ≤ ¯ κ E (cid:2) [ M ] p/ n (cid:3) . ULTILEVEL STOCHASTIC APPROXIMATION 31
We fix a time horizon T ∈ N and prove the statement of the theorem with κ = ¯ κ by induction:we say that the statement holds up to time t ∈ { , . . . , T } , if for every d -dimensional adaptedprocess ( ζ n ) n ∈ N , for every d -dimensional previsible process ( ξ n ) n ∈ N and for every d -dimensionalmartingale ( M n ) n ∈ N with ζ = M , (cid:107) ξ n (cid:107) ≤ (cid:107) ζ n − (cid:107) , if 1 ≤ n ≤ t,ζ n = ξ n + ∆ M n , if 1 ≤ n ≤ t,ζ n = ζ n − + ∆ M n , if n > t, ( C t )one has E (cid:2) max ≤ n ≤ T (cid:107) ζ n (cid:107) p (cid:3) ≤ ¯ κ E (cid:2) [ M ] p/ T (cid:3) . Clearly, the statement is satisfied up to time 0 as a consequence of the BDG inequality. Next,suppose that the statement is satisfied up to time t ∈ { , . . . , T − } . Let ( ζ n ) n ∈ N be a d -dimensional adapted process, ( ξ n ) n ∈ N be a d -dimensional previsible process and ( M n ) n ∈ N bea d -dimensional martingale satisfying property ( C t +1 ). Consider any F t -measurable randomorthonormal transformation U on ( R d , (cid:107) · (cid:107) ) and put ζ Un = (cid:40) ζ n , if n ≤ t,ζ t + U ( M n − M t ) , if n > t as well as M Un = (cid:40) M n , if n ≤ t,M t + U ( M n − M t ) , if n > t. Then it is easy to check that ( M Un ) n ∈ N is a martingale with [ M U ] n = [ M ] n for all n ∈ N .Furthermore, ( ζ Un ) n ∈ N is adapted and the triple ( ζ U , ξ, M U ) satisfies property ( C t ). Hence, bythe induction hypothesis,(92) E (cid:2) max ≤ n ≤ T (cid:107) ζ Un (cid:107) p (cid:3) ≤ ¯ κ E (cid:2) [ M U ] p/ T (cid:3) = ¯ κ E (cid:2) [ M ] p/ T (cid:3) . Note that for any such random orthonormal transformation U , the norm of the random variable ζ Un is the same as the norm of the variable ¯ ζ Un given by¯ ζ Un = (cid:40) ζ n , if n ≤ t,U ∗ ζ t + M n − M t , if n > t, whence(93) E (cid:2) max ≤ n ≤ T (cid:107) ¯ ζ Un (cid:107) p (cid:3) = E (cid:2) max ≤ n ≤ T (cid:107) ζ Un (cid:107) p (cid:3) . Clearly, we can choose an F t -measurable random orthonormal transformation U on ( R d , (cid:107) · (cid:107) )such that U ∗ ζ t = (cid:107) ζ t (cid:107)(cid:107) ξ t +1 (cid:107) ξ t +1 holds on { ξ t +1 (cid:54) = 0 } . Let α = (cid:107) ξ t +1 (cid:107) + (cid:107) ζ t (cid:107) (cid:107) ζ t (cid:107) · { ζ t (cid:54) =0 } . Then α is F t -measurable and takes values in [0 ,
1] since (cid:107) ξ t +1 (cid:107) ≤ (cid:107) ζ t (cid:107) . Moreover, we have ξ t +1 = αU ∗ ζ t + (1 − α )( − U ) ∗ ζ t so that by property ( C t +1 ) of the triple ( ζ, ξ, M ), ζ n = ξ t +1 + M n − M t = α ¯ ζ Un + (1 − α )¯ ζ − Un for n = t + 1 , . . . , T . Note that ζ n = ζ Un = ζ − Un for n = 0 , . . . , t . By convexity of (cid:107) · (cid:107) p we thusobtainmax ≤ n ≤ T (cid:107) ¯ ζ Un (cid:107) p = max ≤ n ≤ T (cid:107) α ¯ ζ Un + (1 − α )¯ ζ − Un (cid:107) p ≤ α max ≤ n ≤ T (cid:107) ¯ ζ Un (cid:107) p + (1 − α ) max ≤ n ≤ T (cid:107) ¯ ζ − Un (cid:107) p . Hence E (cid:2) max ≤ n ≤ T (cid:107) ζ n (cid:107) p |F t (cid:3) ≤ α E (cid:2) max ≤ n ≤ T (cid:107) ¯ ζ Un (cid:107) p |F t (cid:3) + (1 − α ) E (cid:2) max ≤ n ≤ T (cid:107) ¯ ζ − Un (cid:107) p |F t (cid:3) ≤ E (cid:2) max ≤ n ≤ T (cid:107) ¯ ζ U (cid:48) n (cid:107) p |F t (cid:3) , where U (cid:48) is the F t -measurable random orthonormal transformation given by U (cid:48) ( ω ) = (cid:40) U ( ω ) if ω ∈ (cid:8) E [max ≤ n ≤ T (cid:107) ¯ ζ Un (cid:107) p |F t ] ≥ E [max ≤ n ≤ T (cid:107) ¯ ζ − Un (cid:107) p |F t ] (cid:9) , − U ( ω ) otherwise . Applying (92) and (93) with U = U (cid:48) finishes the induction step.Next, we consider the case of c >
0. Suppose that ζ, ξ and M are as stated in the theorem.For n ∈ N we put ˜ ξ n = (1 − c/ (cid:107) ξ n (cid:107) ) + · ξ n and ˜ ζ n = ˜ ξ n + ∆ M n . Furthermore, let ˜ ζ = ζ = M . We will show that the triple (˜ ζ, ˜ ξ, M ) satisfies (91) with c = 0.Clearly, (˜ ζ n ) n ∈ N is adapted and ( ˜ ξ n ) n ∈ N is previsible. Moreover, one has for n ∈ N on {(cid:107) ξ n (cid:107) ≥ c } that (cid:107) ˜ ξ n (cid:107) = (cid:107) ξ n (cid:107) − c ≤ (cid:107) ζ n − (cid:107) − c = (cid:107) ˜ ζ n − + ξ n − − ˜ ξ n − (cid:107) − c ≤ (cid:107) ˜ ζ n − (cid:107) + (cid:107) ξ n − − ˜ ξ n − (cid:107) − c = (cid:107) ˜ ζ n − (cid:107) and on {(cid:107) ξ n (cid:107) < c } that (cid:107) ˜ ξ n (cid:107) = 0 ≤ (cid:107) ˜ ζ n − (cid:107) . We may thus apply Theorem 5.1 with c = 0 toobtain that for every n ∈ N , E (cid:2) max ≤ k ≤ n (cid:107) ˜ ζ n (cid:107) p (cid:3) ≤ ¯ κ E (cid:2) [ M ] p/ n (cid:3) . Since for every n ∈ N , (cid:107) ζ n (cid:107) p = (cid:107) ˜ ζ n + ξ n − ˜ ξ n (cid:107) p ≤ p ( (cid:107) ˜ ζ n (cid:107) p + c p ) , we conclude that E (cid:2) max ≤ k ≤ n (cid:107) ζ n (cid:107) p (cid:3) ≤ p (cid:0) ¯ κ E (cid:2) [ M ] p/ n (cid:3) + c p (cid:1) ≤ p (¯ κ ∨ · (cid:0) E (cid:2) [ M ] p/ n (cid:3) + c p (cid:1) , which completes the proof. (cid:3) References [1] A. Benveniste, M. M´etivier, and P. Priouret.
Adaptive algorithms and stochastic approximations , volume 22of
Applications of Mathematics (New York) . Springer-Verlag, Berlin, 1990.[2] J. Dippon and J. Renz. Weighted means in stochastic approximation of minima.
SIAM J. Control Optim. ,35(5):1811–1827, 1997.[3] K. Djeddour, A. Mokkadem, and M. Pelletier. On the recursive estimation of the location and of the size ofthe mode of a probability density.
Serdica Math. J. , 34(3):651–688, 2008.[4] M. Duflo.
Algorithmes stochastiques , volume 23 of
Math´ematiques & Applications (Berlin) [Mathematics &Applications] . Springer-Verlag, Berlin, 1996.[5] N. Frikha. Multi-level stochastic approximation algorithms.
Ann. Appl. Probab. , 26:933–985, 2016.[6] V. F. Gaposhkin and T. P. Krasulina. On the law of the iterated logarithm in stochastic approximationprocesses.
Theory Prob. Appl. , 19(4):844–850, 1974.[7] M. B. Giles. Multilevel Monte Carlo path simulation.
Oper. Res. , 56(3):607–617, 2008.
ULTILEVEL STOCHASTIC APPROXIMATION 33 [8] S. Heinrich. Multilevel Monte Carlo methods. In
Large-scale scientific computing , pages 58–67. SpringerBerlin Heidelberg, 2001.[9] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation.
Ann.Appl. Probab. , 14(2):796–819, 2004.[10] H. J. Kushner and J. Yang. Stochastic approximation with averaging of the iterates: optimal asymptotic rateof convergence for general processes.
SIAM J. Control Optim. , 31(4):1045–1062, 1993.[11] H. J. Kushner and G. G. Yin.
Stochastic approximation and recursive algorithms and applications , volume 35of
Applications of Mathematics (New York) . Springer-Verlag, New York, second edition, 2003. StochasticModelling and Applied Probability.[12] T. L. Lai. Stochastic approximation.
Ann. Statist. , 31(2):391–406, 2003. Dedicated to the memory of HerbertE. Robbins.[13] T. L. Lai and H. Robbins. Limit theorems for weighted sums and stochastic approximation processes.
Proc.Nat. Acad. Sci. U.S.A. , 75, 1978.[14] A. Le Breton and A. Novikov. Averaging for estimating covariances in stochastic approximation.
Math.Methods Statist. , 3(3):244–266, 1994.[15] A. Le Breton and A. Novikov. Some results about averaging in stochastic approximation.
Metrika , 42(3-4):153–171, 1995. Second International Conference on Mathematical Statistics (Smolenice Castle, 1994).[16] L. Ljung, G. Pflug, and H. Walk.
Stochastic approximation and optimization of random systems , volume 17of
DMV Seminar . Birkh¨auser Verlag, Basel, 1992.[17] A. Mokkadem and M. Pelletier. A generalization of the averaging procedure: the use of two-time-scale algo-rithms.
SIAM J. Control Optim. , 49(4):1523–1543, 2011.[18] M. Pelletier. On the almost sure asymptotic behaviour of stochastic algorithms.
Stochastic Process. Appl. ,78(2):217–244, 1998.[19] M. Pelletier. Weak convergence rates for stochastic approximation with application to multiple targets andsimulated annealing.
Ann. Appl. Probab. , 8(1):10–44, 1998.[20] B. T. Polyak. A new method of stochastic approximation type.
Avtomat. i Telemekh. , 51(7):937–1008, 1998.[21] H. Robbins and S. Monro. A stochastic approximation method.
Ann. Math. Statistics , 22:400–407, 1951.[22] D. Ruppert. Almost sure approximations to the Robbins-Monro and Kiefer-Wolfowitz processes with depen-dent noise.
Ann. Probab. , 10, 1982.[23] D. Ruppert. Stochastic approximation. In
Handbook of sequential analysis , volume 118 of
Statist. TextbooksMonogr. , pages 503–529. Dekker, New York, 1991.
Steffen Dereich, Institut f¨ur Mathematische Statistik, Fachbereich 10: Mathematik und Infor-matik, Westf¨alische Wilhelms-Universit¨at M¨unster, Orl´eans-Ring 10, 48149 M¨unster, Germany
E-mail address : [email protected] Thomas M¨uller-Gronbach, Fakult¨at f¨ur Informatik und Mathematik, Universit¨at Passau, Innstraße33, 94032 Passau, Germany
E-mail address ::