Geom-SPIDER-EM: Faster Variance Reduced Stochastic Expectation Maximization for Nonconvex Finite-Sum Optimization
aa r X i v : . [ s t a t . M L ] N ov GEOM-SPIDER-EM: FASTER VARIANCE REDUCED STOCHASTIC EXPECTATIONMAXIMIZATION FOR NONCONVEX FINITE-SUM OPTIMIZATION
Gersende Fort ⋄ Eric Moulines ⋆ Hoi-To Wai †⋄ Institut Math´ematique de Toulouse, Universit´e de Toulouse; CNRS UPS, F-31062 Toulouse Cedex, France ⋆ Centre de Math´ematiques Appliqu´ees; Ecole Polytechnique; 91128 Palaiseau Cedex, France † Department of SEEM; The Chinese University of Hong Kong; Shatin, Hong Kong
ABSTRACT
The Expectation Maximization (EM) algorithm is a key reference forinference in latent variable models; unfortunately, its computationalcost is prohibitive in the large scale learning setting. In this paper,we propose an extension of the Stochastic Path-Integrated Differen-tial EstimatoR EM (SPIDER-EM) and derive complexity bounds forthis novel algorithm, designed to solve smooth nonconvex finite-sumoptimization problems. We show that it reaches the same state of theart complexity bounds as SPIDER-EM; and provide conditions for alinear rate of convergence. Numerical results support our findings.
Index Terms — Large scale learning, Latent variable analy-sis, Expectation Maximization, Stochastic nonconvex optimization,Variance reduction.
1. INTRODUCTION
Intelligent processing of large data set and efficient learning of high-dimensional models require new optimization algorithms designedto be robust to big data and complex models era (see e.g. [1, 2, 3]).This paper is concerned with stochastic optimization of a nonconvexfinite-sum smooth objective function
Argmin θ ∈ Θ F ( θ ) , F ( θ ) def = 1 n n X i =1 L i ( θ ) + R ( θ ) , (1)when Θ ⊆ R d and F cannot be explicitly evaluated (nor its gradi-ent). Many statistical learning problems can be cast into this frame-work, where n is the number of observations or examples, L i isa loss function associated to example i (most of often a negatedlog-likelihood), and R is a penalty term promoting sparsity, regu-larity, etc.. Intractability of F ( θ ) might come from two sources.The first, referred to as large scale learning setting, is that the num-ber n is very large so that the computations involving a sum over n terms should be either simply avoided or sparingly used during therun of the optimization algorithm (see e.g. [4] for an introduction tothe bridge between large scale learning and stochastic approxima-tion; see [5, 6] for applications to training of deep neural networksfor signal and image processing; and more generally, empirical riskminimization in machine learning is a matter for (1)). The secondis due to the presence of latent variables: for any i , the function L i as a (high-dimensional) integral over latent variables. Such a latentvariable context is a classical statistical modeling: for example as atool for solving inference in mixture models [7], for the definitionof mixed models capturing variability among examples [8] or for Part of this work is funded by the Fondation Simone and Cino Del Ducaunder the program OpSiMorE modeling hidden and/or missing variables (see e.g. applications intext modeling through latent Dirichlet allocation [9], in audio sourceseparation [10, 11], in hyper-spectral imaging [12]).In this contribution, we address the two levels of intractability inthe case L i is of the form L i ( θ ) def = − log Z Z h i ( z ) exp ( h s i ( z ) , φ ( θ ) i − ψ ( θ )) µ (d z ) . (2)This setting in particular covers the case when P ni =1 L i ( θ ) is thenegated log-likelihood of the observations ( Y , · · · , Y n ) , the pairsobservation/latent variable { ( Y i , Z i ) , i ≤ n } are independent, andthe distribution of the complete data ( Y i , Z i ) given by ( y i , z ) h i ( z ) exp ( h s i ( z ) , φ ( θ ) i − ψ ( θ )) µ (d z ) is from the curved expo-nential family. Gaussian mixture models are typical examples, aswell as mixtures of distributions from the curved exponential family.In the framework (1)-(2), a Majorize-Minimization approachthrough the Expectation-Maximization (EM) algorithm [13] is stan-dard; unfortunately, the computational cost of the batch EM canbe prohibitive in the large scale learning setting. Different strate-gies were proposed to address this issue [14, 15, 16, 17, 18]: theycombine mini-batches processing, Stochastic Approximation (SA)techniques (see e.g. [19, 20]) and variance reduction methods.The first contribution of this paper is to provide a novel al-gorithm, the generalized Stochastic Path-Integrated DifferentialEstimatoR EM ( g-SPIDER-EM ), which is among the variance re-duced stochastic EM methods for nonconvex finite-sum optimizationof the form (1)-(2); the generalizations allow a reduced computa-tional cost without altering the convergence properties. The secondcontribution is the proof of complexity bounds, that is the num-ber of parameter updates (M-step) and the number of conditionalexpectations evaluations (E-step), in order to reach ǫ -approximatestationary points; these bounds are derived for a specific form of g-SPIDER-EM : we show that its complexity bounds are the sameas those of SPIDER-EM , bounds which are state of the art ones andoverpass all the previous ones. Linear convergence rates are provedunder a Polyak-Łojasiewicz condition. Finally, numerical resultssupport our findings and provide insights on how to implement g-SPIDER-EM in order to inherit the properties of
SPIDER-EM while reducing the computational cost.
Notations
For two vectors a, b ∈ R q , h a, b i is the scalar product,and k·k the associated norm. For a matrix A , A T is its transpose. Fora positive integer n , set [ n ] ⋆ def = { , · · · , n } and [ n ] def = { , · · · , n } . ∇ f denotes the gradient of a differentiable function f . The mini-mum of a and b is denoted by a ∧ b . Finally, we use standard big O notation to leave out constants. . EM-BASED METHODS IN THE EXPECTATION SPACE We begin by formulating the model assumptions:
A1. Θ ⊆ R d is a convex set. ( Z , Z ) is a measurable space and µ is a σ -finite positive measure on Z . The functions R : Θ → R , φ : Θ → R q , ψ : Θ → R , s i : Z → R q , h i : Z → R + for all i ∈ [ n ] ⋆ are measurable. For any θ ∈ Θ and i ∈ [ n ] ⋆ , |L i ( θ ) | < ∞ . For any θ ∈ Θ and i ∈ [ n ] ⋆ , define the posterior density of thelatent variable Z i given the observation Y i : p i ( z ; θ ) def = h i ( z ) exp ( h s i ( z ) , φ ( θ ) i − ψ ( θ ) + L i ( θ )) , (3)note that the dependence upon y i follows through the index i in theabove. Set ¯ s i ( θ ) def = Z Z s i ( z ) p i ( z ; θ ) µ (d z ) , ¯ s ( θ ) def = n − n X i =1 ¯ s i ( θ ) . (4) A2.
The expectations ¯ s i ( θ ) are well defined for all θ ∈ Θ and i ∈ [ n ] ⋆ . For any s ∈ R q , Argmin θ ∈ Θ ( ψ ( θ ) − h s, φ ( θ ) i + R ( θ )) is a(non empty) singleton denoted by { T ( s ) } . EM is an iterative algorithm: given a current value τ k ∈ Θ ,the next value is τ k +1 ← T ◦ ¯ s ( τ k ) . It combines an expectationstep which boils down to the computation of ¯ s ( τ k ) , the conditionalexpectation of s ( z ) under p ( · ; τ k ) ; and a maximization step whichcorresponds to the computation of the map T . Equivalently, by using T which maps R q to Θ , it can be described in the expectation space (see [21]): given the current value ¯ s k ∈ ¯ s (Θ) , the next value is ¯ s k +1 ← ¯ s ◦ T (¯ s k ) .In this paper, we see EM as an iterative algorithm operating inthe expectation space. In that case, the fixed points of the EM oper-ator ¯ s ◦ T are the roots of the function hh ( s ) def = ¯ s ◦ T ( s ) − s . (5)EM possesses a Lyapunov function: in the parameter space, it isthe objective function F where by definition of the EM sequence,it holds F ( τ k +1 ) ≤ F ( τ k ) ; in the expectation space, it is W def = F ◦ T , and W(¯ s k +1 ) ≤ W(¯ s k ) holds. In order to derive complexitybounds, regularity assumptions are required on W : A3.
The functions φ , ψ and R are continuously differentiable on Θ v ,where Θ v is a neighborhood of Θ . T is continuously differentiableon R q . The function F is continuously differentiable on Θ v and forany θ ∈ Θ , ∇ F ( θ ) = −∇ φ ( θ ) T ¯ s ( θ ) + ∇ ψ ( θ ) + ∇ R ( θ ) . For any s ∈ R q , B ( s ) def = ∇ ( φ ◦ T )( s ) is a symmetric q × q matrix and thereexist < v min ≤ v max < ∞ such that for all s ∈ R q , the spectrumof B ( s ) is in [ v min , v max ] . For any i ∈ [ n ] ⋆ , ¯ s i ◦ T is globallyLipschitz on R q with constant L i . The function s
7→ ∇ ( F ◦ T )( s ) = B ( s ) (¯ s ◦ T ( s ) − s ) is globally Lipschitz on R q with constant L ˙W . A6 implies that W has globally Lipschitz gradient and ∇ W ( s ) = − B ( s ) h ( s ) for some positive definite matrix B ( s ) (see e.g. [21,Lemma 2]; see also [22, Propositions 1 and 2]). Note that this im-plies that ∇ W ( s ⋆ ) = 0 iff h ( s ⋆ ) = 0 .Unfortunately, in the large scale learning setting (when n ≫ ),EM can not be easily applied since each iteration involves n con-ditional expectations (CE) evaluations through ¯ s = n − P ni =1 ¯ s i . Incremental
EM techniques have been proposed to address this is-sue: the most straightforward approach amounts to use a SA schemewith mean field h since. Upon noting that h ( s ) = E [¯ s I ◦ T ( s )] − s where I is a uniform random variable (r.v.) on [ n ] ⋆ , the fixed pointsof the EM operator ¯ s ◦ T are those of the SA scheme b S k +1 = b S k + γ k +1 (cid:18) b − X i ∈B k +1 ¯ s i ◦ T ( b S k ) − b S k (cid:19) (6)where { γ k , k ≥ } is a deterministic positive step size sequence,and B k +1 is sampled in [ n ] ⋆ independently from the past of the al-gorithm. This forms the basis of Online-EM proposed by [15] (seealso [23]). Variance reduced versions were also proposed and stud-ied: Incremental EM ( i-EM ) [14, 24], Stochastic EM with variancereduction ( sEM-vr ) [16], Fast Incremental EM [17, 22] (
FIEM )and more recently, Stochastic Path-Integrated Differential Estima-toR EM (
SPIDER-EM ) [18].As shown in [22, section 2.3], these algorithms can be seen asa combination of SA with control variate : upon noting that h ( s ) = h ( s ) + E [ U ] for any r.v. U such that E [ U ] = 0 , control variateswithin SA procedures replace (6) with b S k +1 = b S k + γ k +1 (cid:18) b − X i ∈B k +1 ¯ s i ◦ T ( b S k ) + U k +1 − b S k (cid:19) for a choice of U k +1 such that the new algorithm has better proper-ties (for example, in terms of complexity - see the end of Section 3).Lastly, we remark that A4–A6 are common assumptions satis-fied by many statistical models such as the Gaussian Mixture Model;see [18] for a rigorous justification of these assumptions.
3. THE GEOM-SPIDER-EM ALGORITHMData: k out ∈ N ⋆ ; b S init ∈ R q ; ξ t ∈ N ⋆ for t ∈ [ k out ] ⋆ ; γ t, ≥ , γ t,k > for t ∈ [ k out ] ⋆ , k ∈ [ ξ t ] ⋆ . Result:
The SPIDER-EM sequence: { b S t,k } b S , = b S , − = b S init ; S , = ¯ s ◦ T ( b S , − ) + E ; for t = 1 , · · · , k out do for k = 0 , . . . , ξ t − do Sample a mini batch B t,k +1 of size b in [ n ] ⋆ ; S t,k +1 = S t,k + b − P i ∈B t,k +1 (cid:16) ¯ s i ◦ T ( b S t,k ) − ¯ s i ◦ T ( b S t,k − ) (cid:17) ; b S t,k +1 = b S t,k + γ t,k +1 (cid:16) S t,k +1 − b S t,k (cid:17) b S t +1 , − = b S t,ξ t ; S t +1 , = ¯ s ◦ T ( b S t +1 , − ) + E t +1 ; b S t +1 , = b S t +1 , − + γ t +1 , (cid:16) S t +1 , − b S t +1 , − (cid:17) Algorithm 1:
The g-SPIDER-EM algorithm. The E t ’sare introduced as a perturbation to the computation of ¯ s ◦ T ( b S t, − ) ; they can be null.The algorithm generalized Stochastic Path-Integrated Differ-ential EstimatoR Expectation Maximization ( g-SPIDER-EM ) de-scribed by Algorithm 1 uses a new strategy when defining theapproximation of ¯ s ◦ T ( s ) at each iteration. It is composed of nestedloops: k out outer loops, each of them formed with a possibly randomnumber of inner loops. Within the t th outer loop, g-SPIDER-EM mimics the identity ¯ s ◦ T ( b S t,k ) = ¯ s ◦ T ( b S t,k − )+ { ¯ s ◦ T ( b S t,k ) − ¯ s ◦ T ( b S t,k − ) } . More precisely, at iteration k + 1 , the approximation t,k +1 of the full sum ¯ s ◦ T ( b S t,k ) is the sum of the current approx-imation S t,k and of a Monte Carlo approximation of the difference(see Lines 5, 6, in Algorithm 1); the examples i in B t,k +1 used in theapproximation of ¯ s ◦ T ( b S t,k ) and those used for the approximation of ¯ s ◦ T ( b S t,k − ) are the same - which make the approximations corre-lated and favor a variance reduction when plugged in the SA update(Line 7). B t,k +1 is sampled with or without replacement; even when B t,k +1 collects independent examples sampled uniformly in [ n ] ⋆ ,we have E [ S t,k +1 |F t,k ] − ¯ s ◦ T ( b S t,k ) = S t,k − ¯ s ◦ T ( b S t,k − ) where F t,k is the sigma-field collecting the randomness up to theend of the outer loop t and inner loop k : the approximation S t,k +1 of ¯ s ◦ T ( b S t,k ) is biased - a property which makes the theo-retical analysis of the algorithm challenging. This approximation isreset (see Lines 2,9) at the end of an outer loop: in the ”standard” SPIDER-EM , S t, = ¯ s ◦ T ( b S t, − ) is computed, but this ”refresh”can be only partial, by computing an update on a (large) batch ˜ B t, (size ˜ b t ) of observations: S t, = ˜ b − t P i ∈ ˜ B t, ¯ s i ◦ T ( b S t, − ) . Sucha reset starts a so-called epoch (see Line 3). The number of innerloops ξ t at epoch t can be deterministic ξ t ; or random, such as auniform distribution on [ k in ] ⋆ or a geometric distribution, and drawnprior the run of the algorithm.Comparing g-SPIDER-EM with SPIDER-EM [18], we no-tice that the former allows a perturbation E t when initializing S t, . This is important for computational cost reduction. More-over, g-SPIDER-EM considers epochs with time-varying length ξ t which covers situations when it is random and chosen inde-pendently of the other sources of randomness (the errors E t , thebatches B t,k +1 ). Hereafter, we provide an original analysis of an g-SPIDER-EM , namely Geom-SPIDER-EM which correspondsto the case ξ t ← Ξ t , Ξ t being a geometric r.v. on N ⋆ with successprobability − ρ t ∈ (0 , : P (Ξ t = k ) = (1 − ρ t ) ρ k − t for k ≥ (hereafter, we will write Ξ t ∼ G ⋆ (1 − ρ t ) ). Since Ξ t is also the firstsuccess distribution in a sequence of independent Bernoulli trials,the geometric length could be replaced with: (i) at each iteration k of epoch t , sample a Bernoulli r.v. with a probability of success (1 − ρ t ) ; (ii) when the coin comes up head, start a new epoch (see[25, 26] for similar ideas on stochastic gradient algorithms).Let us establish complexity bounds for Geom-SPIDER-EM .We analyze a randomized terminating iteration Ξ ⋆ [27] and dis-cuss how to choose k out , b and ξ , · · · , ξ k out as a function ofthe batch size n and an accuracy ǫ > to reach ǫ -approximatestationarity i.e. E [ k h ( b S Ξ ⋆ ) k ] ≤ ǫ . To this end, we endow theprobability space (Ω , A , P ) with the sigma-fields F , = σ ( E ) , F t, = σ ( F t − ,ξ t , E t ) for t ≥ , and F t,k +1 def = σ ( F t,k , B t,k +1 ) for t ∈ [ k out ] ⋆ , k ∈ [ ξ t − . For a r.v. Ξ t ∼ G ⋆ (1 − ρ t ) ,set E t [ φ (Ξ t ) |F t, ] def = (1 − ρ t ) P k ≥ ρ k − t E [ φ ( k ) |F t, ] for anybounded measurable function φ . Theorem 1.
Assume A4 to A6. For any t ∈ [ k out ] ⋆ , let ρ t ∈ (0 , and Ξ t ∼ G ⋆ (1 − ρ t ) . Run Algorithm 1 with γ t,k +1 = γ t > and ξ t ← Ξ t for any t ∈ [ k out ] ⋆ , k ≥ . Then, for any t ∈ [ k out ] ⋆ , v min γ t − ρ t ) E t h k h ( b S t, Ξ t − ) k |F t, i ≤ W( b S t, ) − E t h W( b S t, Ξ t ) |F t, i + v max γ t − ρ t ) kE t k + v max γ t γ t, − ρ t ) L b k ∆ b S t, k + N t E t h k ∆ b S t, Ξ k |F t, i ; where ∆ b S t,ξ def = S t,ξ − b S t,ξ − , L = n − P ni =1 L i , and N t def = − γ t − ρ t ) (cid:18) v min − γ t L ˙W − v max L ρ t (1 − ρ t ) b γ t (cid:19) . Theorem 1 is the key result from which our conclusions are drawn;its proof is adapted from [18, section 8] (also see [28]).Let us discuss the rate of convergence and the complexity of
Geom-SPIDER-SA in the case: for any t ∈ [ k out ] ⋆ , the mean num-ber of inner loops is (1 − ρ t ) − = k in , γ t, = 0 and γ t = α/L for α > satisfying v min − α L ˙W L − α v max k in b (cid:18) − k in (cid:19) > . Linear rate.
When Ξ ∼ G ⋆ (1 − ρ ) , we have ρ E [ D Ξ ] ≤ ρ E [ D Ξ ] + (1 − ρ ) D = E [ D Ξ − ] (7)for any positive sequence { D k , k ≥ } , Theorem 1 implies E t h k h ( b S t, Ξ t ) k |F t, i ≤ Lv min α ( k in − (cid:16) W( b S t, ) − min W (cid:17) + v max v min k in k in − kE t k . (8)Hence, when kE t k = 0 and W satisfies a Polyak-Łojasiewicz con-dition [29], i.e. ∃ τ > , ∀ s ∈ R q , W( s ) − min W ≤ τ k∇ W( s ) k (9)then (8) yields H t def = E t h k h ( b S t, Ξ t ) k |F t, i ≤ Lτ v v min α ( k in − k h ( b S t − ,ξ t − ) k , thus establishing a linear rate of the algorithm along the path { b S t, Ξ t , t ∈ [ k out ] ⋆ } as soon as k in is large enough: E [ H t ] ≤ (cid:18) Lτ v v min α ( k in − (cid:19) t k h ( b S init ) k . Even if the Polyak-Łojasiewicz condition (9) is quite restrictive,the above discussion gives the intuition of the lock-in phenomenonwhich often happens at convergence: a linear rate of convergence isobserved when the path is trapped in a neighborhood of its limitingpoint, which may be the consequence that locally, the Polyak-Łojasiewicz condition holds (see figure 1 in Section 4).
Complexity for ǫ -approximate stationarity. From Theorem 1,Eq. (7) and b S t, Ξ t = b S t +1 , (here γ t, = 0 ), it holds v min α ( k in − L E [ H t ] ≤ E h W( b S t, ) − W( b S t +1 , ) i . Therefore, k out k out X t =1 E [ H t ] ≤ L (cid:16) W( b S init ) − min W (cid:17) v min α ( k in − k out . (10)Eq. (10) establishes that in order to obtain an ǫ -approximate sta-tionary point, it is sufficient to stop the algorithm at the end of theepoch T , where T is sampled uniformly in [ k out ] ⋆ with k out = O ( ǫ − L/k in ) - and return b S T,ξ T . To do such, the mean numberof conditional expectations evaluations is K CE def = n + nk out + b k in k out ; and the mean number of optimization steps is K Opt def = k out + k in k out . By choosing k in = O ( √ n ) and b = O ( √ n ) ,we have K CE = O ( L √ nǫ − ) and K Opt = O ( Lǫ − ) . Similarrandomized terminating strategies were proposed in the literature:their optimal complexity in terms of conditional expectations eval-uations is O ( ǫ − ) for Online-EM [15], O ( ǫ − n ) for i-EM [14], O ( ǫ − n / ) for sEM-vr [16, 17], O ( { ǫ − n / }∧{ ǫ − / √ n } ) for FIEM [17, 22] and O ( ǫ − √ n ) for SPIDER-EM - see [18, section 6]for a comparison of the complexities K CE and K opt of these incre-mental EM algorithms. Hence, Geom-SPIDER-EM has the samecomplexity bounds as
SPIDER-EM , and they are optimal among theclass of incremental EM algorithms.
4. NUMERICAL ILLUSTRATION
We perform experiments on the MNIST dataset, which consists of n = 6 × images of handwritten digits, each with pixels. Wepre-process the datas as detailed in [22, section 5]: uninformativepixels are removed from each image, and then a principal compo-nent analysis is applied to further reduce the dimension; we keep the principal components of each observation. The learning problemconsists in fitting a Gaussian mixture model with g = 12 compo-nents: θ collects the weights of the mixture, the expectations of thecomponents (i.e. g vectors in R ) and a full × covariancematrix; here, R = 0 (no penalty term). All the algorithms start froman initial value b S init = ¯ s ◦ T ( θ init ) such that − F ( θ init ) = − . ,and their first two epochs are Online-EM . The first epoch with avariance reduction technique is epoch ; on Fig. 1, the plot startsat epoch .The proposed
Geom-SPIDER-EM is run with a constant stepsize γ t,k = 0 . (and γ t, = 0 ); k out = 148 epochs (which are pre-ceded with epochs of Online-EM ); a mini batch size b = √ n .Different strategies are considered for the initialization S t, and theparameter of the geometric r.v. Ξ t . In full-geom , k in = √ n/ so that the mean total number of conditional expectations evalua-tions per outer loop is b k in = n ; and E t = 0 which means that S t, requires the computation of the full sum ¯ s over n terms. In half-geom , k in is defined as in full-geom , but for all t ∈ [ k out ] ⋆ , S t, = (2 /n ) P i ∈ ˜ B t, ¯ s i ◦ T ( b S t, − ) where B t, is of car-dinality n/ ; therefore E t = 0 . In quad-geom , a quadratic growthis considered both for the mean of the geometric random variables: E [ ξ t ] = min( n, max(20 t , n/ / (2 b ) ; and for the size of themini batch when computing S t, : S t, = ˜ b − t P i ∈ ˜ B t ¯ s i ◦ T ( b S t, − ) with ˜ b t = min( n, max(20 t , n/ . The g-SPIDER-EM with aconstant number of inner loops ξ t = k in = n/ (2 b ) is also run forcomparison: different strategies for S t, are considered, the same asabove (it corresponds to full-ctt , half-ctt and quad-ctt on the plots). Finally, in order to illustrate the benefit of the variancereduction, a pure Online-EM is run for epochs, one epoch cor-responding to √ n updates of the statistics b S , each of them requiringa mini batch B k +1 of size √ n (see Eq.(6)).The algorithms are compared through an estimation of the quan-tile of order . of k h ( b S t, Ξ t ) k over independent realizations. Itis plotted versus the number of epochs t in Fig. 1 and the numberof conditional expectations (CE) evaluations in Fig. 2. They are alsocompared through the objective function F along the path; the meanvalue over independent paths is displayed versus the number ofCE, see Fig. 3.We first observe that Online-EM has a poor convergence rate,thus justifying the interest of variance reduction techniques as shownin Fig. 1. Having a persistent bias along iterations when defining S t, -12 -10 -8 -6 -4 -2 online-EMhalf-cttfull-cttquad-ctthalf-geomfull-geomquad-geom Fig. 1 . Quantile . of k h ( b S t, Ξ t ) k vs the number of epochs -12 -10 -8 -6 -4 -2 half-cttfull-cttquad-ctthalf-geomfull-geomquad-geom Fig. 2 . Quantile . of k h ( b S t, Ξ t ) k vs the number of CE evaluationsi.e. considering ˜ b t = n and therefore E t = 0 , is also a bad strategyas seen in Fig. 1, 2 for half-ctt and half-geom . For the fourother SPIDER-EM strategies, we observe a linear convergence ratein Fig. 1, 2. The best strategy, both in terms of CE evaluations andin terms of efficiency given a number of epochs, is quad-ctt : aconstant and deterministic number of inner loops ξ t combined withan increasing accuracy when computing S t, ; therefore, during thefirst iterations, it is better to reduce the computational cost of thealgorithm by considering ˜ b t ≪ n . When E t = 0 (i.e. ˜ b t = n sothe computational cost of S t, is maximal), it is possible to reducethe total CE computational cost of the algorithm by considering arandom number of inner loops (see full-geom and full-ctt on Fig. 1, 2). Finally, the strategy which consists in increasing both ˜ b t and the number of inner loops, does not look the best one (see quad-ctt and quad-geom on Fig. 1 to Fig. 3). -32.5-32.4-32.3-32.2-32.1-32-31.9-31.8 full-cttquad-cttfull-geomquad-geom -31.95-31.9-31.85-31.8 full-cttquad-cttfull-geomquad-geom Fig. 3 . (Left) − F vs CE, until e . (Right) − F vs CE, after e . . REFERENCES [1] K. Slavakis, G.B. Giannakis, and G. Mateos, “Modeling andoptimization for big data analytics: (statistical) learning toolsfor our era of data deluge,” IEEE Signal Processing Magazine ,vol. 31, no. 5, pp. 18–31, 2014.[2] P. B¨uhlmann, P. Drineas, M. Kane, and M. van der Laan,
Hand-book of Big Data , Chapman & Hall/CRC Handbooks of Mod-ern Statistical Methods. CRC Press, 2016.[3] W. H¨ardle, H.H.S Lu, and X. Shen,
Handbook of big dataanalytics , Springer, 2018.[4] K. Slavakis, S. Kim, and G.B. Mateos, G. Giannakis, “Stochas-tic Approximation vis-a-vis Online Learning for Big Data An-alytics [Lecture Notes],”
IEEE Signal Processing Magazine ,vol. 31, no. 6, pp. 124–129, 2014.[5] Y. LeCun, B.H. Boser, J.S. Denker, D. Henderson, R.E.Howard, W.E. Hubbard, and L.D. Jackel, “Handwritten digitrecognition with a back-propagation network,” in
Advancesin Neural Information Processing Systems 2 , D. S. Touretzky,Ed., pp. 396–404. Morgan-Kaufmann, 1990.[6] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”
Proceedingsof the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998.[7] G.J. McLachlan and D. Peel,
Finite mixture models , vol. 299 of
Probability and Statistics – Applied Probability and StatisticsSection , Wiley, New York, 2000.[8] J. Jiang,
Linear and Generalized Linear Mixed Models andTheir Applications , Springer Series in Statistics. Springer, Dor-drecht, 2007.[9] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allo-cation,”
J. Mach. Learn. Res. , vol. 3, pp. 993–1022, 2003.[10] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gan-not, and R. Horaud, “A Variational EM Algorithm for theSeparation of Time-Varying Convolutive Audio Mixtures,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 24, no. 8, pp. 1408–1423, 2016.[11] K. Weisberg, S. Gannot, and O. Schwartz, “An onlinemultiple-speaker doa tracking using the capp ´E-moulines recur-sive expectation-maximization algorithm,” in
ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2019, pp. 656–660.[12] B. Lin, X. Tao, S. Li, L. Dong, and J. Lu, “Variational bayesianimage fusion based on combined sparse representations,” in , 2016, pp. 1432–1436.[13] A. P. Dempster, N. M. Laird, and D. B. Rubin, “MaximumLikelihood from Incomplete Data Via the EM Algorithm,”
Journal of the Royal Statistical Society: Series B (Methodolog-ical) , vol. 39, no. 1, pp. 1–22, 1977.[14] R. M. Neal and G. E. Hinton,
A View of the EM Algorithm thatJustifies Incremental, Sparse, and other Variants , pp. 355–368,Springer Netherlands, Dordrecht, 1998.[15] O. Capp´e and E. Moulines, “On-line Expectation Maximiza-tion algorithm for latent data models,”
J. Roy. Stat. Soc. B Met. ,vol. 71, no. 3, pp. 593–613, 2009. [16] J. Chen, J. Zhu, Y.W. Teh, and T. Zhang, “Stochastic Expecta-tion Maximization with Variance Reduction,” in
Advances inNeural Information Processing Systems 31 , S. Bengio, H. Wal-lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Gar-nett, Eds., pp. 7967–7977. 2018.[17] B. Karimi, H.-T. Wai, E. Moulines, and M. Lavielle, “On theGlobal Convergence of (Fast) Incremental Expectation Maxi-mization Methods,” in
Advances in Neural Information Pro-cessing Systems 32 , H. Wallach, H. Larochelle, A. Beygelz-imer, F. d’Alch´e Buc, E. Fox, and R. Garnett, Eds., pp. 2837–2847. Curran Associates, Inc., 2019.[18] G. Fort, E. Moulines, and H.T. Wai, “A Stochastic Path-Integrated Differential EstimatoR Expectation MaximizationAlgorithm,” in
Advances in Neural Information ProcessingSystems 34 . Curran Associates, Inc., 2020.[19] A. Benveniste, P. Priouret, and M. M´etivier,
Adaptive Al-gorithms and Stochastic Approximations , Springer-Verlag,Berlin, Heidelberg, 1990.[20] V. S. Borkar,
Stochastic approximation , Cambridge UniversityPress, Cambridge; Hindustan Book Agency, New Delhi, 2008,A dynamical systems viewpoint.[21] B. Delyon, M. Lavielle, and E. Moulines, “Convergence of aStochastic Approximation version of the EM algorithm,”
Ann.Statist. , vol. 27, no. 1, pp. 94–128, 1999.[22] G. Fort, P. Gach, and E. Moulines, “Fast Incremental Ex-pectation Maximization for non-convex finite-sum optimiza-tion: non asymptotic convergence bounds,” Tech. Rep., HAL02617725v1, 2020.[23] P. Liang and D. Klein, “Online EM for unsupervised mod-els,” in
Proceedings of human language technologies: The2009 annual conference of the North American chapter of theassociation for computational linguistics , 2009, pp. 611–619.[24] A. Gunawardana and W. Byrne, “Convergence theorems forgeneralized alternating minimization procedures,”
J. Mach.Learn. Res. , vol. 6, pp. 2049–2073, 2005.[25] Z. Li, H. Bao, X. Zhang, and P. Richt´arik, “PAGE: A Simpleand Optimal Probabilistic Gradient Estimator for NonconvexOptimization,” Tech. Rep., arXiv 2008.10898, 2020.[26] S. Horvath, L. Lei, P. Richtarik, and M.I. Jordan, “Adaptivityof Stochastic Gradient Methods for Nonconvex Optimization,”Tech. Rep., arXiv 2002.05359, 2020.[27] S. Ghadimi and G. Lan, “Stochastic First- and Zeroth-OrderMethods for Nonconvex Stochastic Programming,”
SIAM J.Optimiz. , vol. 23, no. 4, pp. 2341–2368, 2013.[28] G. Fort, E. Moulines, and H.-T. Wai, “GEOM-SPIDER-EM:Faster Variance Reduced Stochastic Expectation Maximizationfor Nonconvex Finite-Sum Optimization,” Tech. Rep., 2020,supplementary material, available at https://perso.math.univ-toulouse.fr/gfort/publications-2/technical-report/.[29] H. Karimi, J. Nutini, and M. Schmidt, “Linear Convergence ofGradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition,” in
Joint European Conference onMachine Learning and Knowledge Discovery in Databases .Springer, 2016, pp. 795–811. upplementary material,paper “GEOM-SPIDER-EM: faster variancereduced stochastic Expectation Maximizationfor nonconvex finite-sum optimization”
6. PROOF OF THEOREM 1
Let {E t , t ∈ [ k out ] ⋆ } and {B t,k +1 , t ∈ [ k out ] ⋆ , k ∈ [ ξ t − } berandom variables defined on the probability space (Ω , A , P ) . Definethe filtrations F , = σ ( E ) , F t, = σ ( F t − ,ξ t , E t ) for t ≥ , and F t,k +1 def = σ ( F t,k , B t,k +1 ) for t ∈ [ k out ] ⋆ , k ∈ [ ξ t − .For ρ t ∈ (0 , , set E t [ φ (Ξ t ) |F t, ] def = (1 − ρ t ) X k ≥ ρ k − t E [ φ ( k ) |F t, ] , for any measurable positive function φ . A4. Θ ⊆ R d is a convex set. ( Z , Z ) is a measurable space and µ is a σ -finite positive measure on Z . The functions R : Θ → R , φ : Θ → R q , ψ : Θ → R , s i : Z → R q , h i : Z → R + for all i ∈ [ n ] ⋆ are measurable. For any θ ∈ Θ and i ∈ [ n ] ⋆ , |L i ( θ ) | < ∞ . A5.
The expectations ¯ s i ( θ ) are well defined for all θ ∈ Θ and i ∈ [ n ] ⋆ . For any s ∈ R q , Argmin θ ∈ Θ ( ψ ( θ ) − h s, φ ( θ ) i + R ( θ )) is a(non empty) singleton denoted by { T ( s ) } . A6.
The functions φ , ψ and R are continuously differentiable on Θ v ,where Θ v is a neighborhood of Θ . T is continuously differentiableon R q . The function F is continuously differentiable on Θ v and forany θ ∈ Θ , ∇ F ( θ ) = −∇ φ ( θ ) T ¯ s ( θ ) + ∇ ψ ( θ ) + ∇ R ( θ ) . For any s ∈ R q , B ( s ) def = ∇ ( φ ◦ T )( s ) is a symmetric q × q matrix and thereexist < v min ≤ v max < ∞ such that for all s ∈ R q , the spectrumof B ( s ) is in [ v min , v max ] . For any i ∈ [ n ] ⋆ , ¯ s i ◦ T is globallyLipschitz on R q with constant L i . The function s
7→ ∇ ( F ◦ T )( s ) = B ( s ) (¯ s ◦ T ( s ) − s ) is globally Lipschitz on R q with constant L ˙W . Lemma 2.
Let ρ ∈ (0 , and { D k , k ≥ } be real numbers suchthat P k ≥ ρ k | D k | < ∞ . Let ξ ∼ G ⋆ (1 − ρ ) . Then E [ D ξ − ] = ρ E [ D ξ ] + (1 − ρ ) D = E [ D ξ ] + (1 − ρ )( D − E [ D ξ ]) .Proof. By definition of ξ , E [ D ξ ] = (1 − ρ ) X k ≥ ρ k − D k = ρ − (1 − ρ ) X k ≥ ρ k D k = ρ − (1 − ρ ) X k ≥ ρ k D k − ρ − (1 − ρ ) D = ρ − (1 − ρ ) X k ≥ ρ k − D k − − ρ − (1 − ρ ) D = ρ − E [ D ξ − ] − ρ − (1 − ρ ) D . This yields ρ E [ D ξ ] = E [ D ξ − ] − (1 − ρ ) D and concludes theproof. Lemma 3.
For any t ∈ [ k out ] ⋆ , k ∈ [ ξ t ] ⋆ , B t,k and F t,k − are inde-pendent. In addition, for any s ∈ R q , b − E hP i ∈B t,k ¯ s i ◦ T ( s ) i = ¯ s ◦ T ( s ) . Finally, assume that for any i ∈ [ n ] ⋆ , ¯ s i ◦ T is globallyLipschitz with constant L i . Then for any s, s ′ ∈ R q , E k b − X i ∈B t,k (cid:8) ¯ s i ◦ T ( s ) − ¯ s i ◦ T ( s ′ ) (cid:9) − ¯ s ◦ T ( s ) + ¯ s ◦ T ( s ′ ) k ≤ b (cid:0) L k s − s ′ k − k ¯ s ◦ T ( s ) − ¯ s ◦ T ( s ′ ) k (cid:1) , where L = n − P ni =1 L i .Proof. See [18, Lemma 4]; the proof holds true when B t,k is sam-pled with or without replacement. Proposition 4.
For any t ∈ [ k out ] ⋆ , k ∈ [ ξ t − , E [ S t,k +1 |F t,k ] − ¯ s ◦ T ( b S t,k ) = S t,k − ¯ s ◦ T ( b S t,k − ) , and E h S t,k +1 − ¯ s ◦ T ( b S t,k ) |F t, i = E t . Proof.
Let t ∈ [ k out ] ⋆ , k ∈ [ ξ t − . By Lemma 3, E [ S t,k +1 |F t,k ] = S t,k + ¯ s ◦ T ( b S t,k ) − ¯ s ◦ T ( b S t,k − ) . By definition of S t, and of the filtrations, S t, − ¯ s ◦ T ( b S t, − ) = E t ∈ F t, . The proof follows by induction on k . Proposition 5.
Assume that for any i ∈ [ n ] ⋆ , ¯ s i ◦ T is globallyLipschitz with constant L i . For any t ∈ [ k out ] ⋆ , k ∈ [ ξ t − , E (cid:2) k S t,k +1 − E [ S t,k +1 |F t,k ] k |F t,k (cid:3) ≤ b (cid:16) L k b S t,k − b S t,k − k − k ¯ s ◦ T ( b S t,k ) − ¯ s ◦ T ( b S t,k − ) k (cid:17) ≤ L b γ t,k k S t,k − b S t,k − k , where L = n − P ni =1 L i .Proof. Let t ∈ [ k out ] ⋆ , k ∈ [ ξ t − . By Lemma 3, Proposition 4,the definition of S t,k +1 and of the filtration F t,k , S t,k +1 − E [ S t,k +1 |F t,k ]= S t,k +1 − ¯ s ◦ T ( b S t,k ) − S t,k + ¯ s ◦ T ( b S t,k − )= b − X i ∈B t,k +1 { ¯ s i ◦ T ( b S t,k ) − ¯ s i ◦ T ( b S t,k − ) }− ¯ s ◦ T ( b S t,k ) + ¯ s ◦ T ( b S t,k − ) . We then conclude by Lemma 3 for the first inequality; and by usingthe definition of b S t,k for the second one. Proposition 6.
Assume that for any i ∈ [ n ] ⋆ , ¯ s i ◦ T is globallyLipschitz with constant L i . For any t ∈ [ k out ] ⋆ , k ∈ [ ξ t − , E h k S t,k +1 − ¯ s ◦ T ( b S t,k ) k |F t,k i ≤ L b γ t,k k S t,k − b S t,k − k + k S t,k − ¯ s ◦ T ( b S t,k − ) k , where L = n − P ni =1 L i .roof. By definition of the conditional expectation, we have for anyr.v. φ ( V ) E (cid:2) k U − φ ( V ) k | V (cid:3) = E (cid:2) k U − E [ U | V ] k | V (cid:3) + k E [ U | V ] − φ ( V ) k . The proof follows from this equality and Propositions 4 and 5.
Corollary 7.
Assume that for any i ∈ [ n ] ⋆ , ¯ s i ◦ T is globally Lip-schitz with constant L i . For any t ∈ [ k out ] ⋆ , let ρ t ∈ (0 , and Ξ t ∼ G ⋆ (1 − ρ t ) . For any t ∈ [ k out ] ⋆ , E t h ( γ t, Ξ t − ρ t γ t, Ξ t +1 ) k S t, Ξ t − ¯ s ◦ T ( b S t, Ξ t − ) k |F t, i ≤ L ρ t b E t h γ t, Ξ t +1 γ t, Ξ t k S t, Ξ t − b S t, Ξ t − k |F t, i + L (1 − ρ t ) b γ t, γ t, k S t, − b S t, − k + (1 − ρ t ) γ t, kE t k , where L = n − P ni =1 L i .Proof. Let t ∈ [ k out ] ⋆ and k ∈ [ ξ t − . From Proposition 6 andsince F t, ⊆ F t,k for k ∈ [ ξ t − , we have E h k S t,k +1 − ¯ s ◦ T ( b S t,k ) k |F t, i ≤ E (cid:20) k S t,k − ¯ s ◦ T ( b S t,k − ) k + L b γ t,k k S t,k − b S t,k − k |F t, (cid:21) . Multiply by γ t,k +1 and apply with k = ξ t − : E h γ t,ξ t k S t,ξ t − ¯ s ◦ T ( b S t,ξ t − ) k |F t, i ≤ E h γ t,ξ t k S t,ξ t − − ¯ s ◦ T ( b S t,ξ t − ) k |F t, i + L b E h γ t,ξ t γ t,ξ t − k S t,ξ t − − b S t,ξ t − k |F t, i . This implies E t h γ t, Ξ t k S t, Ξ t − ¯ s ◦ T ( b S t, Ξ t − ) k |F t, i ≤ E t h γ t, Ξ t k S t, Ξ t − − ¯ s ◦ T ( b S t, Ξ t − ) k |F t, i + L b E t h γ t, Ξ t γ t, Ξ t − k S t, Ξ t − − b S t, Ξ t − k |F t, i . By Lemma 2, we have E t h γ t, Ξ t k S t, Ξ t − − ¯ s ◦ T ( b S t, Ξ t − ) k |F t, i = ρ t E t h γ t, Ξ t +1 k S t, Ξ t − ¯ s ◦ T ( b S t, Ξ t − ) k |F t, i + (1 − ρ t ) γ t, E t h k S t, − ¯ s ◦ T ( b S t, − ) k |F t, i ; by definition of S t, and F t, , the last term is equal to (1 − ρ t ) γ t, kE t k . By Lemma 2, we have E t h γ t, Ξ t γ t, Ξ t − k S t, Ξ t − − b S t, Ξ t − k |F t, i = ρ t E t h γ t, Ξ t +1 γ t, Ξ t k S t, Ξ t − b S t, Ξ t − k |F t, i + (1 − ρ t ) γ t, γ t, k S t, − b S t, − k . This concludes the proof.
Lemma 8.
For any h, s, S ∈ R q and any q × q symmetric matrix B , it holds − h Bh, S i = − h BS, S i − h
Bh, h i + h B { h − S } , h − S i . Proposition 9.
Assume A4 to A6. For any t ∈ [ k out ] ⋆ and k ∈ [ ξ t − , E h W( b S t,k +1 ) |F t, i + v min γ t,k +1 E h k h ( b S t,k ) k |F t, i ≤ E h W( b S t,k ) |F t, i + v max γ t,k +1 E h k S t,k +1 − ¯ s ◦ T ( b S t,k ) k |F t, i − γ t,k +1 v min − γ t,k +1 L ˙W ) E h k S t,k +1 − b S t,k k |F t, i . Proof.
Since W is continuously differentiable with L ˙W -Lipschitzgradient, then for any s, s ′ ∈ R q , W( s ′ ) − W( s ) ≤ (cid:10) ∇ W( s ) , s ′ − s (cid:11) + L ˙W k s ′ − s k . Set s ′ = s + pasS where γ > and S ∈ R q . Since ∇ W( s ) = − B ( s ) h ( s ) and B ( s ) is symmetric, apply Lemma 8 with h ← h ( s ) , B ← B ( s ) and S = ( s ′ − s ) /γ ; this yields W( s + γS ) − W( s ) ≤ − γ h B ( s ) S, S i − γ h B ( s ) h ( s ) , h ( s ) i + γ h B ( s ) { h ( s ) − S } , h ( s ) − S i + L ˙W γ k S k . Since k a k v min ≤ h B ( s ) a, a i ≤ v max k a k for any a ∈ R q , wehave W( s + γS ) − W( s ) ≤ − γv min k S k − γv min k h ( s ) k + γv max k h ( s ) − S k + L ˙W γ k S k . Applying this inequality with s ← b S t,k , γ ← γ t,k +1 , S ← S t,k +1 − b S t,k (which yields s + γS = b S t,k +1 ), and then the conditional ex-pectation yield the result. Proposition 10.
Assume A4 to A6. For any t ∈ [ k out ] ⋆ W( b S t +1 , ) − W( b S t +1 , − ) ≤ − γ t +1 , v min k h ( b S t +1 , − ) k + v max γ t +1 , kE t +1 k − γ t +1 , v min − γ t +1 , L ˙W ) k S t +1 , − b S t +1 , − k . Proof.
As in the proof of Proposition 9, we write for any s, s ′ ∈ R q , W( s ′ ) − W( s ) ≤ (cid:10) ∇ W( s ) , s ′ − s (cid:11) + L ˙W k s ′ − s k . With Lemma 8, this yields when s ′ = s + γS for γ > and S ∈ R q W( s + γS ) − W( s ) ≤ − γ v min − γL ˙W ) k S k − γv min k h k + v max γ k h − S k . Apply this inequality with γ ← γ t +1 , , s ← b S t +1 , − , S ← S t +1 , − b S t +1 , − and h ← h ( s ) (which yields s + γS = b S t +1 , ). heorem 11. Assume A4 to A6. For any t ∈ [ k out ] ⋆ , let ρ t ∈ (0 , and Ξ t ∼ G ⋆ (1 − ρ t ) . Finally, choose γ t,k +1 = γ t > for any k ≥ . For any t ∈ [ k out ] ⋆ , v min γ t − ρ t ) E t h k h ( b S t, Ξ t − ) k |F t, i ≤ W( b S t, ) − E t h W( b S t, Ξ t ) |F t, i + v max − ρ t ) L b γ t γ t, k S t, − b S t, − k + v max − ρ t ) γ t, kE t k − γ t − ρ t ) (cid:18) v min − γ t L ˙W − v max L ρ t (1 − ρ t ) b γ t (cid:19) · · ·× E t h k S t, Ξ t − b S t, Ξ t − k |F t, i . Proof.
Apply Proposition 9 with k ← ξ t − and then set ξ t ← Ξ t ;this yields E t h W( b S t, Ξ t ) |F t, i + v min E t h γ t, Ξ t k h ( b S t, Ξ t − ) k |F t, i ≤ E t h W( b S t, Ξ t − ) |F t, i + v max E t h γ t, Ξ t k S t, Ξ t − ¯ s ◦ T ( b S t, Ξ t − ) k |F t, i − E t h γ t, Ξ t v min − γ t, Ξ t L ˙W ) k S t, Ξ t − b S t, Ξ t − k |F t, i . Since Ξ t ≥ and γ t,k = γ t for any k ≥ , we have E t h W( b S t, Ξ t ) |F t, i + v min γ t E t h k h ( b S t, Ξ t − ) k |F t, i ≤ E t h W( b S t, Ξ t − ) |F t, i + v max γ t E t h k S t, Ξ t − ¯ s ◦ T ( b S t, Ξ t − ) k |F t, i − γ t v min − γ t L ˙W ) E t h k S t, Ξ t − b S t, Ξ t − k |F t, i . By Lemma 2, it holds E t h W( b S t, Ξ t ) |F t, i = E t h W( b S t, Ξ t − ) |F t, i + (1 − ρ t ) (cid:16) E t h W( b S t, Ξ t ) |F t, i − W( b S t, ) (cid:17) Furthermore, by Corollary 7 applied with γ t, Ξ t = γ t, Ξ t +1 = γ t (1 − ρ t ) γ t E t h k S t, Ξ t − ¯ s ◦ T ( b S t, Ξ t − ) k |F t, i ≤ L ρ t b γ t E t h k S t, Ξ t − b S t, Ξ t − k |F t, i + L (1 − ρ t ) b γ t γ t, k S t, − b S t, − k + (1 − ρ t ) γ t, kE t k , Therefore, v min γ t E t h k h ( b S t, Ξ t − ) k |F t, i ≤ (1 − ρ t ) W( b S t, ) − (1 − ρ t ) E t h W( b S t, Ξ t ) |F t, i + v max L ρ t (1 − ρ t ) b γ t E t h k S t, Ξ t − b S t, Ξ t − k |F t, i + v max L b γ t γ t, k S t, − b S t, − k + v max γ t kE t k − γ t v min − γ t L ˙W ) E t h k S t, Ξ t − b S t, Ξ t − k |F t, i . This concludes the proof.
Corollary 12 (of Theorem 11) . For any t ∈ [ k out ] ⋆ , (cid:18) γ t ρ t (1 − ρ t ) + γ t +1 , (cid:19) v min E t h k h ( b S t, Ξ t ) k |F t, i ≤ W( b S t, ) − E t h W( b S t +1 , ) |F t, i − γ t +1 , v min − γ t +1 , L ˙W ) E h k S t +1 , − b S t +1 , − k |F t, i + v max − ρ t ) L b γ t γ t, k S t, − b S t, − k + v max − ρ t ) γ t kE t k + v max γ t +1 , E (cid:2) kE t +1 k |F t, (cid:3) . Proof.
Let t ∈ [ k out ] ⋆ . By Proposition 10, since b S t,ξ t = b S t +1 , − we have − E h W( b S t,ξ t ) |F t, i ≤ − E h W( b S t +1 , ) |F t, i − γ t +1 , v min E h k h ( b S t,ξ t ) k |F t, i + v max γ t +1 , E (cid:2) kE t +1 k |F t, (cid:3) − γ t +1 , v min − γ t +1 , L ˙W ) E h k S t +1 , − b S t +1 , − k |F t, i . The previous inequality remains true when E h W( b S t,ξ t ) |F t, i isreplaced with E t h W( b S t, Ξ t ) |F t, i ; and E h k h ( b S t,ξ t ) k |F t, i = E t h k h ( b S t, Ξ t ) k |F t, i . The proof follows from Theorem 11, and(see Lemma 2) E t h k h ( b S t, Ξ t − ) k |F t, i ≥ ρ t E t h k h ( b S t, Ξ t ) k |F t, ii