[PDF] Unbiased Simulation for Optimizing Stochastic Function Compositions

Abstract

In this paper, we introduce an unbiased gradient simulation algorithms for solving convex optimization problem with stochastic function compositions. We show that the unbiased gradient generated from the algorithm has finite variance and finite expected computation cost. We then combined the unbiased gradient simulation with two variance reduced algorithms (namely SVRG and SCSG) and showed that the proposed optimization algorithms based on unbiased gradient simulations exhibit satisfactory convergence properties. Specifically, in the SVRG case, the algorithm with simulated gradient can be shown to converge linearly to optima in expectation and almost surely under strong convexity. Finally, for the numerical experiment,we applied the algorithms to two important cases of stochastic function compositions optimization: maximizing the Cox's partial likelihood model and training conditional random fields.

Full PDF

aa r X i v : . [ m a t h . O C ] N ov Unbiased Simulation for Optimizing Stochastic Function Compositions

Jose Blanchet, Donald Goldfarb, Garud Iyengar, Fengpei Li, Chaoxu Zhou

Abstract

In this paper, we introduce an unbiased gradient simulation algorithms for solving convex optimization prob-lem with stochastic function compositions. We show that the unbiased gradient generated from the algorithmhas ﬁnite variance and ﬁnite expected computation cost. We then combined the unbiased gradient simulationwith two variance reduced algorithms (namely SVRG and SCSG) and showed that the proposed optimization al-gorithms based on unbiased gradient simulations exhibit satisfactory convergence properties. Speciﬁcally, in theSVRG case, the algorithm with simulated gradient can be shown to converge linearly to optima in expectationand almost surely under strong convexity. Finally, for the numerical experiment,we applied the algorithms totwo important cases of stochastic function compositions optimization: maximizing the Cox’s partial likelihoodmodel and training conditional random ﬁelds.

In machine learning, we often encounter the following optimization problem. Let f , ..., f n be a sequence of vectorfunctions from R d to R . Our goal is to ﬁnd an approximate solution of the following optimization problem, alsoknown as the empricial risk minimization (ERM) problem,min x F ( x ) , F ( x ) , n n X i =1 f i ( x ) (1)The standard method of SGD can be described by the following update rule for t = 1 , , ...x ( t ) = x ( t − − λ t ( ∇ f v t ( x t − )) and E [ x ( t ) | x ( t − ] = x ( t − − λ t n n X i =1 ∇ f i ( x ( t − ) (2)where v t follows uniform distribution on { , , ..., n } . Stochastic gradient descent (SGD) and its variance reducedvariants including SVRG have been shown to be powerful tools for solving the ERM problem, when n is large andcomputing the full gradient is computationally intensive. However, most of these algorithms implicitly assume thatthe gradient of each member function f i ( · ), i = 1 , . . . , n is easy to obtain. However, this assumption fails to holdin the stochastic composition optimization problems [1]min x ∈ R p F ( x ) , n n X i =1 f i ( 1 m m X j =1 g m ( x )) . (3)where v and w follows certain distribution. Problem of this form arises in many areas such as reinforcementlearning, risk-averse learning to graphical model, econometrics and survival analysis. The current algorithms usedto solve this problem are based on biased stochastic gradient oracles. As we know, the convergence rates for thesealgorithms are either unsatisfactory compared to generic stochastic optimization algorithms or heavily dependenton the number of component functions m and n . To overcome these drawbacks, we introduce a couple of variancereduced algorithms that involve the simulation of unbiased stochastic gradients via the Multilevel Monte-Carlo. The contribution of this paper is two-folded. First, we introduce unbiased gradient simulation algorithms for solvingstochastic composition optimization problem. With an unbiased gradient simulation procedure, the stochasticcomposition optimization problem can be reduced to a generic stochastic optimization problem. We also construct1 unbiased gradient simulation algorithm to take advantage of the ﬁnite sum structure. We also show that thecomputational cost of the unbiased gradient simulation algorithms is independent of the dimension of the objectionfunction. Secondly, we apply our algorithms to maximize the partial likelihood function in Cox’s model whosecomputational issues has not been fully addressed in machine learning literature so far. Speciﬁcally, when thesample size is large, solving this problem is known to be a computationally intensive task because of the cumulativesum structure that involves all data in the risk set presents in every component function. Our unbiased gradientsimulation algorithms provide an eﬃcient way to collapse the cumulative sum structure and the variance reducedgradient methods could further boost the rate of convergence.

In the stochastic composition optimization literature, all algorithms are based on biased stochastic gradient. [1] ﬁrstproposed a generic algorithm for solving (4) with a convergence rate O ( k − / ) for convex objectives and O ( K − / )for strongly convex objectives. This result has been improved to O ( k − / ) for strongly convex objectives by [2].Recently, [3] further improves the convergence rate to O ( ρ K/ ( m + n + κ ) ) for the ﬁnite sum problem (5) by utilizing astochastic variance reduced gradient algorithm (SVRG). However, in this paper, we proposed a unbiased gradientsimulation method that combines recent development in [4, 5]. In particular, we employ the methods proposedin [5] which combines a bias removal randomization scheme into Multilevel Monte Carlo method proposed in [4].We then further make use of the SVRG [6] algorithm which can greatly reduce variance for ERM problem thatachieves linear convergence. SVRG has been extended and improved in many works including but not limitedto [7], [8], [9], [10], [11], [12]. SAG [13] and SAGA [14] are two examples of incremental gradient methods thatachieve linear convergence. In section 2, we will give some concrete examples that is formulated as (4) and (5). In section 3, we will describeour unbiased gradient simulation algorithms for the stochastic problem (4) and the ﬁnite sum problem (5). Thenbased on these two algorithms, we present the algorithms for both problems. In section 4, we will ﬁrst show thatthe gradients generated by the simulation algorithms are unbiased, has ﬁnite second moments and the expectedcomputation cost is ﬁnite. Finally we will show that our variance reduced algorithms converges linearly to an ǫ -approximated solution in expectation for both problems. In section 5, we implement our algorithms for maximizingthe Cox’s partial likelihood and present our numerical results. We concludes with remarks on possible future work. Through out this paper, we consider numerical solutions of the stochastic optimization problem belowmin x ∈ R p F ( x ) , E v f v ( E w g w ( x )) . (4)Note that the following two problems can be considered as special cases of (4), the ﬁnite sum problemmin x ∈ R p F n ( x ) , n n X i =1 f i ( 1 m i m i X j =1 g j ( x )) , (5)or the mixed problem min x ∈ R p n n X i =1 f i ( E w g ( x, w )) . (6)We will also discuss numerical algorithms for these two special cases. We assume f v : R d → R is µ -strongly convexand has L -Lipschitz gradients, for each component v and g w : R p → R d for each component w . The gradient (withrespect to x ) of each member function f v ( · ) for the stochastic problem is { E w ∇ g w ( x ) }∇ f v { E w g w ( x ) } , so that ∇ F n ( x ) = { E w ∇ g w ( x ) } ⊺ E v {∇ f v ( E w g w ( x )) } (7)2here ∇ g w ( x ) =  ∂ [ g w ] ∂ [ x ] ( x ) · · · ∂ [ g w ] ∂ [ x ] p ( x )... . . . ... ∂ [ g w ] d ∂ [ x ] ( x ) · · · ∂ [ g w ] d ∂ [ x ] p ( x )  , and g w ( x ) = ([ g w ] ( x ) , [ g w ] ( x ) , · · · [ g w ] d ( x )) ⊤ . We present the algorithm to simulate unbiased gradients for the stochastic problem (4), (5) and (6). They canbe considered as variants of [5] which is based on multi-level randomization technique. In the ﬁrst algorithm wepurpose for simulating unbiased gradient for problem (4) and (6) while ﬁxing a component v for f v ( E w g ( x, w )).The base level n of estimator can be raised to reduce variance. We introduce a couple of notations ﬁrst. Deﬁnition 1.

Fix x ∈ R p , we deﬁne S ( x ) = ∇ g w ( x ) ∈ R d × p , T ( x ) = g w ( x ) ∈ R d and Z ( x ) = ∇ g w ( x ) ∈ R d × p × p where w is random. Speciﬁcally, sample I.I.D { w i } i ≥ from the distribution of w , we deﬁne S i ( x ) = ∇ g w i ( x ), T i ( x ) = g w i ( x ) and Z i ( x ) = ∇ g w i ( x ). Also, we write ¯ S n ( x ) = n n P i =1 S i ( x ), ˜ S n ( x ) = n n P i = n +1 S i ( x ) and similarly for¯ T n ( x ) , ˜ T n ( x ) , ¯ Z n ( x ) , ˜ Z n ( x ). It follows that, for any n ,¯ S n ( x ) = 12 ( ¯ S n ( x ) + ˜ S n ( x )) , ¯ T n ( x ) = 12 ( ¯ T n ( x ) + ˜ T n ( x )) and ¯ Z n ( x ) = 12 ( ¯ Z n ( x ) + ˜ Z n ( x )) (8) Algorithm 1

UnbiasedGradient( x, v ) Input: x ∈ R p , v ∈ { , ..., n } , base level of estimator n ≥

0, rate parameter 1 < γ < Output: W ( x, v ) ∈ R p , an unbiased estimate of the gradient of f v ( E w g ( x, w )) at point x and component v .Sample N follow geometric distribution with success probability 1 − p where p = 0 . γ .Sample I.I.D. { w i } ≤ i ≤ N + n follow the distribution of w and obtain { S i ( x ) , T i ( x ) } ≤ i ≤ N + n .Set Y = [ ¯ S N + n ( x )] ⊺ · ∇ f v ( ¯ T N + n ( x )). Set Y = [ ¯ S N + n ( x )] ⊺ · ∇ f v ( ¯ T N + n ( x )).Set Y = [ ˜ S N + n ( x )] ⊺ · ∇ f v ( ˜ T N + n ( x )). Set Y = [ ¯ S n ( x )] ⊺ · ∇ f v ( ¯ T n ( x )).Set W ( x, v ) = Y − . · ( Y + Y )˜ p N + Y , where ˜ p N = (1 − p ) · p N . Output: W ( x, v )We shall prove in section 4 that algorithm 1 outputs an unbiased estimate of f v ( E w g ( x, w )) for ﬁxed v . Itfollows that if we sample v ∼ v , then W ( x, v ) would be an unibased estimate of the gradient of E v f ( E w g ( x, w ) , v ).The algorithm 1 presented here is in its most general form which can be applied to unbiased gradient simulation forall three problems (4), (6) and (5). We also present another algorithm below tailored for the ﬁnite sum problem (5)where E v f ( E w g ( x, w ) , v ) can be written as n P ni =1 f i ( m i P m i j =1 g j ( x )). The key change in algorithm 2 is to truncatethe geometric random variable to take into account the case when the ﬁrst algorithm requires more samples thanthe size of overall data. We discuss the details of these algorithms in section 4. Remark : In this algorithm, we truncated the geometric random variable N at n − n + 1 and adjust itsprobability mass function at n from p n (1 − p ) to ˜ p n = p n (1 − p )(1 − p n − n +1 ) − to account for the truncation. We now present our algorithms to solve problem (4), (6) and (5). It is based on the unbiased gradient simulationalgorithms just introduced as well as the control variate method for variance reduction. In [15], [16], the controlvariate methods ia used to generate variance reduced stochastic gradients for solving min x ∈ R p E ξ f ( x, v ). Forexample, for a function of the form F ( x ) = n P ni =1 f i ( x ), a variance reduced stochastic gradient at point x withrespect to the reference point ˜ x is deﬁned as ∇ x f ( x, v ) − ∇ x f (˜ x, v ) + E v ∇ x f (˜ x, v ) where v is sampled from v . Incontrast to SGD where the stochastic gradient is simply ∇ x f ( x, v ), the variance reduced algorithms use constantstep size and converge linearly to the optimum in the presence of strong convexity.3 lgorithm 2 Unbiased Estimator of Gradient for ﬁnite sum problems using Multilevel Monte-Carlo

Input: x ∈ R p , v ∈ { , . . . , n } , base level of estimator n ≥

0, rate parameter 1 < γ < Output: W ( x, v ) ∈ R p , an unbiased estimator of the gradient of n P ni =1 f i ( m i P m i j =1 g j ( x )) in (5) at point x .Sample N follow geometric distribution with success probability 1 − p , where p = 0 . γ .Set n = ⌊ log ( m v ) ⌋ , N = N mod ( n − n + 1) and ˜ p N = p N (1 − p )(1 − p N − n +1 ) − if n ≥ n then Set W ( x, v ) = { m v P m v j =1 [ ∇ g j ( x )] ⊺ }∇ f v { m v P m v j =1 g j ( x ) } else if n = n − n then Uniformly sample with replacement { w i } ≤ i ≤ n from { , . . . , m v } .Set Y = { m v P m v j =1 [ ∇ g j ( x )] ⊺ }∇ f v { m v P m v j =1 g j ( x ) } .Set Y = { n P n i =1 [ ∇ g w i ( x )] ⊺ }∇ f v { n P n i =1 g w i ( x ) } .Set Y = { n P n i =1 [ ∇ g w i ( x )] ⊺ }∇ f v { n P n i =1 g w i ( x ) } Set W ( x, v ) = Y − Y ˜ p N + Y ; else Uniformly sample with replacement { w i } ≤ i ≤ n n from { , . . . , m v } .Set Y = [ ¯ S N n ( x )] ⊺ · ∇ f v ( ¯ T N n ( x )). Set Y = [ ¯ S N + n ( x )] ⊺ · ∇ f v ( ¯ T N + n ( x )).Set Y = [ ˜ S N + n ( x )] ⊺ · ∇ f v ( ˜ T N + n ( x )). Set Y = [ ¯ S n ( x )] ⊺ · ∇ f v ( ¯ T n ( x )).Set W ( x, v ) = Y − . · ( Y + Y )˜ p N + Y end ifOutput: W ( x, v )We adopt the variance reduction techniques into the current setting of optimizing function compositions withsimulated unbiased gradients. Speciﬁcally, we simulate the unbiased gradients at x and ˜ x simultaneously, usingthe same set of data, to control variance. We summarize the details of generating variance reduced gradient inalgorithm 3. The procedure in algorithm 3 is based on the setting of algorithm 1 for the ease of presentation andit can be modiﬁed to suit the improved algorithm 2 as well. Algorithm 3

SimulatedGradient(x, ˜ x , g (˜ x )) Input: x ∈ R d , v ∈ Ω v , reference point ˜ x ∈ R d , reference gradient at point ˜ x denoted by g (˜ x ) ∈ R p , base levelof estimator n ≥ < γ < Output: W ∈ R p , a variance reduced unbiased estimator of the gradient of E v f ( E w g ( x, w ) , v ) at point x .Sample N from geometric distribution with success rate 1 − p where p = 0 . γ and let ˜ p N = (1 − p ) · p N .Sample I.I.D { w i } ≤ i ≤ N + n follow the distribution of w and obtain { S i ( x ) , T i ( x ) } ≤ i ≤ N + n .Set Y ( x ) = [ ¯ S N + n ( x )] ⊺ · ∇ f v ( ¯ T N + n ( x )). Set Y (˜ x ) = [ ¯ S N + n (˜ x )] ⊺ · ∇ f v ( ¯ T N + n (˜ x )).Set Y ( x ) = [ ¯ S N + n ( x )] ⊺ · ∇ f v ( ¯ T N + n ( x )). Set Y (˜ x ) = [ ¯ S N + n (˜ x )] ⊺ · ∇ f v ( ¯ T N + n (˜ x )).Set Y ( x ) = [ ˜ S N + n ( x )] ⊺ · ∇ f v ( ˜ T N + n ( x )). Set Y (˜ x ) = [ ˜ S N + n (˜ x ) · ] ⊺ ∇ f v ( ˜ T N + n (˜ x )).Set Y ( x ) = [ ¯ S n ( x )] ⊺ · ∇ f v ( ¯ T n ( x )). Set Y (˜ x ) = [ ¯ S n (˜ x )] ⊺ · ∇ f v ( ¯ T n (˜ x )).Set W ( x, v ) = Y ( x ) − . ·{ Y ( x )+ Y ( x ) } ˜ p N + Y ( x ). Set W (˜ x, v ) = Y (˜ x ) − . ·{ Y (˜ x )+ Y (˜ x ) } ˜ p N + Y (˜ x ).Set W = W ( x, v ) − W (˜ x, v ) + g (˜ x ). Output:

WIn the above algorithm, the reference gradient g (˜ x ) can either be the full gradient at ∇ F (˜ x ) or some estimateof the full gradient ∇ F (˜ x ). Speciﬁcally, when it is eﬃcient to compute full gradients of the objective function forproblem (4), we propose to use the Variance Reduced Simulated Gradient Descent method for solving this problem.4 lgorithm 4 Simulated Variance Reduced Gradient Descent(Simulated SVRG)

Inputs:

Number of epochs T , number of steps in each epoch M , step size λ and initial point ˜ x . for s = 1 , , ...T do ˜ h = ∇ F (˜ x s − ) x = ˜ x for t = 1 , , ...M do Sample v t from the distribution of v .Set ν t = SimulatedGradient( x t − , ˜ x s − , ˜ h, v t ).Update x t = x t − − λν t . end foroption I Output ˜ x s = x M option II Output ˜ x s = x t for randomly chosen t ∈ { , ..., M − } end for However, when the full gradients ∇ F (˜ x ) of the objective function (4) can not be computed eﬃciently, weestimate the full gradient ∇ F (˜ x ) by sampling unbiased gradient within a batch of the index and taking avergae.We summarize the detail into the following Stochastically Controlled Simulated Gradient method. Algorithm 5

Stochastically Controlled Simulated Gradient Descent(Simulated SCSG)

Inputs:

Number of epochs T , number of steps in each epoch M , batch size B , sample size K , step size λ , initialpoint ˜ x . for s = 1 , , ...T do ˜ x = ˜ x s − Uniformly sample a batch I s ⊂ Ω v according to the distribution of v with |I s | = B for k = 1 , , ..., K do Generate h k (˜ x ) = B P v i ∈I s UnbiasedGradient(˜ x, v i ) end for Set ˜ h (˜ x ) = K P Ki =1 h i (˜ x ) for t = 1 , , ...M do Sample v t from the distribution of v .Set ν t = SimulatedGradient( x t − , ˜ x s − , ˜ h (˜ x ) , v t ).Update x t = x t − − λv t . end foroption I Output ˜ x s = x M option II Output ˜ x s = x t for randomly chosen t ∈ { , ..., M − } end for We will prove the convergence properties of Algorithm 4 and 5 in section 4.

We present some important examples of stochastic optimization problem.

Conditional random ﬁelds [17] is a popular probabilistic model used for structural prediction. It has been used in anumber of natural language processing problems including part-of-speech tagging [17], noun-phrase chunking [18,19]named identity recognition [20] and image segmentation task in computer vision [21]. For example, Given anobservation x ∈ X , the conditional probability of a structured outcome y ∈ Y is given by p ( y | x ; θ ) = exp { θ ⊤ F ( x, y ) } P y ′ ∈Y exp { θ ⊤ F ( x, y ′ ) } , (9)where θ ∈ R p is the parameter to be estimated and F ( x, y ) ∈ R p is pre-speciﬁed feature functions depending on theunderlying structure of Y . Base on a set of training data { ( x i , y i ) , i = 1 , . . . , n } , the parameter θ can be estimated5y maximize the log likelihood function max θ ∈ R p n n X i =1 log p ( y i | x i , θ ) . (10)The key diﬃculty of computing the objective function value or its gradient lies in the exponential cardinality of Y . When the underlying structure of Y is a linear chain or a tree, both objective function values and its gradientcan be eﬃciently computed by dynamic programming method (the Viterbi algorithm [22]). In this case, a numberof methods could be used to solve (10), for example, deterministic methods such as the iterative scaling algorithmin [17] , L-BFGS [19], stochastic methods such as SGD in [23] and SAG in [24]. However, the computational issue ofCRF has not been fully addressed when the underlying structure is more complex. In our setting, we can formulate(10) as a composition optimization problem as in (4) by noticing that (10) is equivalent tomin θ n n X i =1 (cid:0) log (cid:2) X y ′ ∈Y exp { θ ⊤ F ( x i , y ′ ) } (cid:3) − θ ⊤ F ( x i , y i ) (cid:1) whose gradient can be written as1 n n X i =1 P y ′ ∈Y exp { θ ⊤ F ( x i , y ′ ) } F ( x i , y ′ ) P y ′ ∈Y exp { θ ⊤ F ( x i , y ′ ) } − F ( x i , y i ) . Note that this problem is equivalent tomin θ n n X i =1 (cid:0) log (cid:2) |Y| X y ′ ∈Y exp { θ ⊤ F ( x i , y ′ ) } (cid:3) − θ ⊤ F ( x i , y i ) + log |Y| (cid:1) . Therefore we can view it as a function composition and apply our optimization algorithms to solve this problem.

Cox’s partial likelihood [25, 26] is a widely used model in survival analysis for censored data. The model assumes λ ( t | X ) = λ ( t ) exp( β ′ X ) , where λ ( t | X ) is the hazard function for an individual with covariates X ∈ R p and coeﬃcient β ∈ R p ; and λ ( t )is the baseline hazard function. In the model, let ( X i , Y i , ∆ i ) ≤ i ≤ n be i.i.d. observations where X i ∈ R p is thecovariates and let Y i = min( T i , C i ) , ∆ i = I { Y i = T i } where T i is the true life time and C i is the censoring timeindependent of T i . Also, for a particular observation i , its risk set is deﬁned to be the index set { j : Y j ≥ Y i } . Thegoal is to maximize the partial likelihood function which can be written as the following composition optimizationproblem as in (4): min β ∈ R p n n X i =1 ∆ i [ − X ⊤ i β + log { n X j =1 I ( Y j ≥ Y i ) exp( X ⊤ j β ) } ] , (11)and the gradient of this objective function is1 n n X i =1 ∆ i [ − X i + P nj =1 I ( Y j ≥ Y i ) exp( X ⊤ j β ) X j P nj =1 I ( Y j ≥ Y i ) exp( X ⊤ i β ) ] . (12)Note that this problem is equivalent tomin β ∈ R p n n X i =1 ∆ i [ − X ⊤ i β + log { n n X j =1 I ( Y j ≥ Y i ) exp( X ⊤ j β ) } ] . Now we can view this problem as a composition of functions and apply the proposed algorithm to solve it.6 .3 Solving expectation-Maximization (EM) subproblem without posterior sampling

An Expectation-Maximization (EM) algorithm [27] is an iterative procedure to obtain an MLE of a statisticalmodel with the presence of latent variables (or random eﬀects). Given the observed data x , latent data z , theparameters to be estimated θ ∈ R p and the likelihood function L ( θ ; x, z ) = p ( x, z | θ ), the EM algorithm iterativelyperforms the following two steps • E-step Update Q ( θ | θ ( t ) ) = R log L ( θ ; x, z ) p ( z | x, θ ( t ) ) dz • M-step maximize θ ∈ R p Q ( θ | θ ( t ) ).When the latent variable z is high dimensional, due to the diﬃculty of numerical integration in E-step, the twosteps are combined into a stochastic optimization problem:min θ ∈ R p − Z log L ( θ ; x, z ) p ( z | x, θ ( t ) ) dz. (13)This problem can be solved by sampling from the posterior distribution and applying stochastic gradient descentalgorithm. However, the Markov chain Monte Carlo (MCMC) algorithms used for posterior sampling can be slowand inaccurate in high dimensional cases.Therefore, we rewrite the objective function as − Z log L ( θ ; x, z ) p ( z | x, θ ( t ) ) dz = − Z log L ( θ ; x, z ) p ( x | z, θ ( t ) ) p ( x | θ ( t ) ) p ( z ) dz = − Z log L ( θ ; x, z ) p ( x | z, θ ( t ) ) R p ( x | z, θ ( t ) ) p ( z ) dz p ( z ) dz, (14)and treat it as minimizing function compositions using the proposed algorithms. In this section we present the analysis of our algorithms applied for problem (4), the case where one is slovingmin x ∈ R p F ( x ) , E v f { E w g ( x, w ) , v } . The case for (5) and (6) can be analyzed similarly. Given the initial point ˜ x ∈ R p , there exist a compact set D such that then the sequence ofiterates { x k } k ≥ produced by the algorithms is contained in D . Assumption 2.

Inside the compact set D , each f v ( · ) in the objective function of (4) is µ -strongly convex with L -Lipschitz continuous gradients. Assumption 3.

Inside the compact set D , each f v ( · ) is twice continously diﬀerentiable and its second derivativeshave L -Lipschitz continous gradient and each g w ( · ) is twice continously diﬀerentiable. Remark : Assumption 1 is reasonable for deterministic SVRG and SCSG algorithms. In the Simulated SVRGand SCGS algorithms where we use simulated gradients, we can still justify it under small adjustments. Forexample, if we switch the Simulated SVRG and SCGS to the deterministic ones whenever the output ˜ x s of thealgorithm lies outside some compact set D , then the convergence result for the algorithms will not be aﬀectedwhile we may ﬁnd a appropriate D ⊂ D where assumption 1 holds. In practice, by making D large enough, theadjustment will not be necessary. Deﬁnition 2.

We deﬁne the support of distribution v and w to be Ω v and Ω w . Let G = { y ∈ R d | y = g w ( x ) , x ∈D , w ∈ Ω w } H = { y ∈ R d × p | y = ∇ g w ( x ) , x ∈ D , w ∈ Ω w } and J = { z ∈ R d × p × p | z = ∇ g w ( x ) , x ∈ D , w ∈ Ω w } .We deﬁne l f = sup y ∈G sup v ∈ Ω v sup ≤ i ≤ | f ( i ) v ( y ) | and l g = sup x ∈D sup w ∈ Ω w sup ≤ j ≤ | g ( j ) w ( x ) | where we write the upper index f ( i ) and g ( j ) to denote the order of the derivative when they are actually partial derivatives and the norm | · | is taken to bethe maxinum among partial derivatives.Finally, we set l D = max { l f , l g , L, } so that the norm of any component of the partial derivative functions f ( i ) v ( y ) , g ( j ) w ( x ) is bounded by l D with Lipschitz continous gradient l D for any x ∈ D , y ∈ G , v ∈ Ω v , w ∈ Ω w and0 ≤ i ≤ , ≤ j ≤

1. As a consequence of Assumption 1 and Assumption 3, we have l D < ∞ .7efore we proceed to the proofs, we introduce some techinal lemmas. Lemma 1.

Let f : R d → R be a continuously diﬀerentiable function with L -Lipschitz continuous gradients, then | f ( y ) − f ( x ) − h∇ f ( x ) , y − x i| ≤ L k y − x k . Proof.

Let g ( x ) = L x T x − f ( x ). Since f ( · ) has L -Lipschitz continous gradient where k∇ f ( y ) −∇ f ( x ) k ≤ L k y − x k ,we have ( ∇ f ( y ) − ∇ f ( x )) T ( y − x ) ≤ L k y − x k by Cauchy-Schwartz. This implies( ∇ g ( y ) − ∇ g ( x )) T ( y − x ) = L k y − x k − ( ∇ f ( y ) − ∇ f ( x )) T ( y − x ) ≥ , which shows g ( · ) is convex. The convexity of g ( · ) implies0 ≤ g ( y ) − g ( x ) − ∇ g ( x ) T ( y − x ) = L k y − x k − ( f ( y ) − f ( x ) − h∇ f ( x ) , y − x i ) . So we have f ( y ) − f ( x ) − h∇ f ( x ) , y − x i ≤ L k y − x k . Since − f ( · ) also has L -Lipschitz continous gradient, we cansubstitute f ( · ) with − f ( · ) from the above equation and deduce that | f ( y ) − f ( x ) − h∇ f ( x ) , y − x i| ≤ L k y − x k . Lemma 2.

Let { f i ( · ) } ≤ i ≤ n : R d → R be bounded Lipschitz function. Then P ni =1 f i ( · ), Q ni =1 f i ( · ) and [ f ( · ) , ..., f n ( · )] ⊺ are also Lipschitz functions. Proof.

Suppose the | f i ( · ) | ≤ | f i ( x ) − f i ( x ) | ≤ L k x − x k for any x , x ∈ R d and 1 ≤ i ≤ n , then | ( n X i =1 f i )( x ) − ( n X i =1 f i )( x ) | ≤ nL k x − x k . On the other hand, | ( f · f )( x ) − ( f · f )( x ) | ≤ | f ( x ) f ( x ) − f ( x ) f ( x ) + f ( x ) f ( x ) − f ( x ) f ( x ) |≤ | f ( x ) || f ( x ) − f ( x ) | + | f ( x ) || f ( x ) − f ( x ) |≤ ( L + L ) k x − x k . Since | f · f | ≤

1, it follows from induction that Q ni =1 f i ( · ) is Lipschitz continous with constant nL .For the generalcase where | f i ( · ) | ≤ M , apply the lemma to each f i ( · ) M , the Lipschitz constant of Q ni =1 f i ( · ) becomes nM n L .Finally, considering the function [ f ( x ) , ..., f n ( x )] ⊺ : R d → R n k [ f ( x ) , ..., f n ( x )] ⊺ − [ f ( x ) , ..., f n ( x )] ⊺ k = vuut n X j =1 ( f j ( x ) − f j ( x )) ≤ √ nL k x − x k Lemma 3.

Given a sequence of real number a i , 1 ≤ i ≤ N and a positive integer N , we have | N X i =1 a i | p ≤ N p − N X i =1 | a i | p (15) In the following section, we present some properties of the output W ( x, v ) from Algorithm 1. We ﬁrst prove theunbiasedness of W ( x, v ). Proposition 1.

For any x ∈ D , sample v ∼ v , then W ( x, v ) is an unbiased estimate of ∇ E v f v { E w g w ( x ) } .8 roof. Fix v and x ∈ D , we will show that the output W ( x, v ) is an unbiased estimate of f v { E w g w ( x ) } . Accordingto Algorithm 1, we have, E W ( x, v ) = ∞ X n =0 E [ W ( x, v ) | N = n ] · P ( N = n ) = ∞ X n =0 E [ Y − . Y + Y ) | N = n ]˜ p n · ˜ p n + E Y = ∞ X n =0 E (cid:18) [ ¯ S n + n ( x )] ⊺ · ∇ f v ( ¯ T n + n ( x )) − . (cid:16) [ ¯ S n + n ( x )] ⊺ · ∇ f v ( ¯ T n + n ( x )) + [ ˜ S n + n ( x )] ⊺ · ∇ f v ( ˜ T n + n ( x )) (cid:17)(cid:19) + E (cid:2) [ ¯ S n ( x )] ⊺ · ∇ f v ( ¯ T n ( x )) (cid:3) = (cid:16) ∞ X n =0 E [[ ¯ S n + n ( x )] ⊺ · ∇ f v ( ¯ T n + n ( x ))] − E [[ ¯ S n + n ( x )] ⊺ · ∇ f v ( ¯ T n + n ( x ))] (cid:17) + E [[ ¯ S n ( x )] ⊺ · ∇ f v ( ¯ T n ( x ))]= lim n →∞ E (cid:0) [ ¯ S n + n ( x )] ⊺ · ∇ f v ( ¯ T n + n ( x )) = E [ S ( x )] ⊺ · ∇ f v ( E [ T ( x )]) = [ E w ∇ g w ( x )] ⊺ · ∇ f v ( E g w ( x ))= ∇ ( f v ( E w g w ( x )))where the third inequality follows from the fact that E [[ ¯ S n + n ( x )] ⊺ ·∇ f v ( ¯ T n + n ( x ))] = E [[ ˜ S n + n ( x )] ⊺ ·∇ f v ( ˜ T n + n ( x ))]for any n . The next line follows from the continuity of ∇ f v ( · ), Assumptions 2-3 and the bounded convergencetheorem.Finally, taking expectation w.r.t v , we can see that E v ∼ v W ( x, v ) = E v ∇ ( f v ( E w g w ( x ))) = ∇ ( E v f v ( E w g w ( x ))).We now proceed to show W ( x, v ) has ﬁnite variance and ﬁnite expectation cost to generate. Proposition 2.

For any x ∈ D , sample v ∼ v , then W ( x, v ) has ﬁnite variance. Proof.

Fix v ∈ Ω v and x ∈ D , we will show that E k W ( x, v ) k < ∞ . E k W ( x, v ) k = ∞ X n =0 E [ k W ( x, v ) k | N = n ] · P ( N = n ) ≤ E k Y k + 2 E [ k Y − . Y + Y ) k | N = n ]˜ p n · ˜ p n (16)where the inequality follows from equation (15). To proceed with equation (16), notice 2 E k Y k can be boundedby 2 p · d · l D since k Y k = k [ ¯ S n ] ⊺ · ∇ f ( ¯ Y n ) k ≤ p · d · l D based on the deﬁnition of l D .To bound the second term on the right hand side of (16), we deﬁne the following function: for x ∈ H ⊆ R d × p and y ∈ G ⊆ R d , deﬁne G : H × G → R p by G ( x, y ) , x ⊺ · ∇ f v ( y ). It is straightforward to compute each componentof the gradient ∇ G ( x, y ) ∈ R p × ( R d × p × R d ) as: ∂ [ G ] i ∂ [ x ] kj ( x, y ) = δ ij · [ ∇ f v ( y )] k = δ ij · ∂f v ∂ [ y ] k ( y ) and ∂ [ G ] i ∂ [ y ] h ( x, y ) = d X k =1 [ x ] ki · ∂ [ ∇ f v ] k ∂ [ y ] h = d X k =1 [ x ] ki · ∂ f v ∂ [ y ] k ∂ [ y ] h ( y )where 1 ≤ i, j ≤ p ,1 ≤ k, h ≤ d and δ ij is the Kronecker delta ( δ ij = 1 when i = j ; δ ij = 0 otherwise). Itfollows from Assumptions 1, Assumption 3 and Lemma (2) that each component of the ∇ G ( x, y ) is Lipschitzcontinous with constant d · l D and ∇ [ G ( x, y )] i is Lipschitz continous with constant √ d · p + d · d · l D . Thus if we let R ( x, x , y, y ) , G ( x, y ) − G ( x , y ) − ∇ G ( x , y ) · vec (cid:0) x − x y − y (cid:1) , using Lemma (1), we have: k R ( x, x , y, y ) k = k G ( x, y ) − G ( x , y ) − ∇ G ( x , y ) · (cid:18) x − x y − y (cid:19) k ≤ p X i =1 k [ G ( x, y )] i − [ G ( x , y )] i − ∇ [ G ( x , y )] i · (cid:18) x − x y − y (cid:19) k ≤ p X i =1 √ dp + d · d · l D k x − x k F + k y − y k )= p · √ dp + d · d · l D d X k =1 p X j =1 ([ x ] kj − [ x ] kj ) + d X h =1 ([ y ] h − [ y ] h ) ) , (17)9or any x, x ∈ H and y, y ∈ G . Now we can bound the second term in (16): E [ k Y − . Y + Y ) k | N = n ]˜ p n · ˜ p n = ∞ X n =0 p n E k [ ¯ S n + n ( x )] ⊺ · ∇ f v ( ¯ T n + n ( x )) − . (cid:16) [ ¯ S n + n ( x )] ⊺ · ∇ f v ( ¯ T n + n ( x )) + [ ˜ S n + n ( x )] ⊺ · ∇ f v ( ˜ T n + n ( x )) (cid:17) k = ∞ X n =0 p n E k G ( ¯ S n + n ( x ) , ¯ T n + n ( x )) − . (cid:16) G ( ¯ S n + n ( x ) , ¯ T n + n ( x )) + G ( ˜ S n + n ( x ) , ˜ T n + n ( x )) (cid:17) k = ∞ X n =0 p n E k G ( E S ( x ) , E T ( x )) + ∇ G ( E S ( x ) , E T ( x )) · (cid:18) ¯ S n + n ( x ) − E S ( x )¯ T n + n ( x ) − E T ( x ) (cid:19) + R ( ¯ S n + n ( x ) , E S ( x ) , ¯ T n + n ( x ) , E T ( x )) − G ( E S ( x ) , E T ( x )) − ∇ G ( E S ( x ) , E T ( x )) · (cid:18) ¯ S n + n ( x )+ e S n + n ( x )2 − E S ( x ) ¯ T n + n ( x )+ e T n + n ( x )2 − E T ( x ) (cid:19) − R ( ¯ S n + n ( x ) , E S ( x ) , ¯ T n + n ( x ) , E T ( x )) − R ( ¯ S n + n ( x ) , E S ( x ) , ¯ T n + n ( x ) , E T ( x )) k = ∞ X n =0 p n E k R ( ¯ S n + n ( x ) , E S ( x ) , ¯ T n + n ( x ) , E T ( x )) − R ( ¯ S n + n ( x ) , E S ( x ) , ¯ T n + n ( x ) , E T ( x )) − R ( ¯ S n + n ( x ) , E S ( x ) , ¯ T n + n ( x ) , E T ( x )) k ≤ ∞ X n =0 p n p ( dp + d ) d l D (cid:18) d X h =1 E ([ ¯ T n + n ( x )] h − [ E T ( x )] h ) + 14 E ([ ¯ T n + n ( x )] h − [ E T ( x )] h ) + 14 E ([ ˜ T n + n ( x )] h − [ E T ( x )] h ) + d X k =1 p X j =1 E ([ ¯ S n + n ( x )] kj − [ E S ( x )] kj ) + 14 E ([ ¯ S n + n ( x )] kj − [ E S ( x )] kj ) + 14 E ([ ˜ S n + n ( x )] kj − [ E S ( x )] kj ) (cid:19) , (18)where the last inequality follows from equation (15) and (17). The equality above it follows from equation (8).To proceed with equation (18), we notice that { [ S i ( x )] kj ,[ T i ( x )] h } i ≥ are I.I.D samples with ﬁnite fourth centralmoment (max { E ([ S ( x )] kj − [ E S ( x )] kj ) , E ([ T ( x )] h − [ E T ( x )] h ) } ≤ l D ) for any 1 ≤ j ≤ p, ≤ k, h ≤ d . Thus wecan use Cauchy-Schwartz inequality and equation (15) to derive: E ([ ¯ S n ( x )] kj − [ E S ( x )] kj ) = P ni =1 E ([ S i ( x )] kj − [ E S ( x )] kj ) n + 6 P ni =1 P nj = i +1 E ([ S i ( x )] kj − [ E S ( x )] kj ) E ([ S j ( x )] kj − [ E S ( x )] kj ) n ≤ P ni =1 E ([ S i ( x )] kj − [ E S ( x )] kj ) n + 6 P ni =1 P nj = i +1 p E ([ S i ( x )] kj − [ E S ( x )] kj ) · p E ([ S j ( x )] kj − [ E S ( x )] kj ) n ≤ l D ( 1 n + 6 · n − n n ) ≤ · l D n . (19)for any 1 ≤ j ≤ p, ≤ k ≤ d and n ≥

1. The same result holds for E ([ ¯ T n ( x )] h − [ E T ( x )] h ) where 1 ≤ h ≤ d . Nowusing (19), we continue on (18) to get E [ k Y − . Y + Y ) k | N = n ]˜ p n · ˜ p n ≤ p n p d ( dp + d ) · l D · d + dp ) · (cid:18) − n l D · − n (cid:19) . (20)Deﬁne C ′D , pd l D + · n p d ( dp + d ) l D · (1 − . γ ) − · (1 − γ − ) − . Notice 0 < C ′D < ∞ because γ <

2. Now(16) becomes E k W ( x, v ) k ≤ p · d · l D + 2 2716 · n p d ( dp + d ) l D · ( ∞ X n =0 − n ˜ p n ) = C ′D , (21)following from the fact ˜ p n = (1 − . γ ) · . γn and the deﬁnition of C ′D . It is worth noting that the convergence ofthe series above relies on y <

2. 10inally, considering the case where we sample v ∼ v , we have V ar ( W ( x, v )) ≤ E v ∼ v k W ( x, v ) k = E [ E k W ( x, v ) k | v ] ≤ E [ C ′D | v ] = C ′D < ∞ . Proposition 3.

For any x ∈ D and v ∈ Ω v , the number of random variables one needs to generate (simulationcost) in order to construct W ( x, v ) has ﬁnite expectation. Proof.

Fix v ∈ Ω v and x ∈ D , denote cost W to be the number of random variables one needs to generate in orderto construct W ( x, v ). In Algorithm 1, we generate one geometric random variable N and 2 N + n +1 number of w i ∼ w . Thus we have cost W = 1 + 2 N + n +1 . Taking expectation w.r.t. N , we conclude E [ cost W ] = E [ E [ cost W | N ]] = ∞ X n =0 E [ cost W | N = n ] · p ( n )= ∞ X n =0 (1 + 2 n + n +1 ) · (1 − . γ ) · . γn =1 + 2 n +1 · (1 − . γ ) · (1 − − γ ) − < ∞ , where the convergence of the series above relies on γ > In the following section, we present a property of the output W = W ( x, v ) − W (˜ x, v ) + g (˜ x ) from Algorithm 3that is important in the proof for the convergence rate of Algorithm 4 and 5. We want to show there exist C D < ∞ such that for any v ∈ Ω v and x, ˜ x ∈ D , we have E k W ( x, v ) − W (˜ x, v ) k ≤ C D k x − ˜ x k . In order to do so, weﬁsrt introduce a couple of lemmas. Lemma 4 (Azuma-Hoeﬀding) . Let X , X , ...X n be I.I.D random variables such that | X i | ≤ L for all 1 ≤ i ≤ n .Then for any t >

0, we have: P ( | n X i =1 X i − n E [ X ] | > t ) ≤ · exp (cid:16) − t nL (cid:17) (22)which implies P ( | ¯ X n − E [ X ] | > t ) ≤ · exp (cid:16) − nt L (cid:17) (23)for any t > Lemma 5.

Deﬁne diam ( D ) , sup {k x − ˜ x k ∞ , | x, ˜ x ∈ D} and C ′′D = (2) − . p ( l D ) (2 · diam ( D )) p + 16 p l D . Then: E (cid:20) sup ζ ∈D | [ ¯ S n ( ζ )] kj − [ E S ( ζ )] kj | (cid:21) ≤ C ′′D (cid:0) log (4 n ) n (cid:1) (24) E (cid:20) sup ζ ∈D | [ ¯ T n ( ζ )] h − [ E T ( ζ )] h | (cid:21) ≤ C ′′D (cid:0) log (4 n ) n (cid:1) (25) E (cid:20) sup ζ ∈D | [ ¯ Z n ( ζ )] kij − [ E Z ( ζ )] kij | (cid:21) ≤ C ′′D (cid:0) log (4 n ) n (cid:1) (26)for any n ≥

1, 1 ≤ k, h ≤ d and 1 ≤ i, j ≤ p . Proof.

Fix 1 ≤ k ≤ d and 1 ≤ j ≤ p , we prove equation (24). It follows from Deﬁnition 2 that for any w ∈ Ω w ,sup x, ˜ x ∈D , ≤ j ≤ p ≤ k ≤ d {| ∂ [ g w ] k ∂ [ x ] j ( x ) − ∂ [ g w ] k ∂ [ x ] j (˜ x ) |} ≤ l D k x − ˜ x k ≤ l D · √ p k x − ˜ x k ∞ . (27)11t also follows from Deﬁnition 2 that diam ( D ) < ∞ . Consequently, we can ﬁnd a set Γ ⊂ R p with | Γ | ≤ (cid:0) · diam ( D ) ǫ (cid:1) p such that for any ζ ∈ D , there exists γ ∈ Γ with k ζ − γ k ∞ ≤ ǫ . It then follows from (27) that: | [ ¯ S n ] kj ( ζ ) − [ E S ( ζ )] kj | = | n n X i =1 ∂ [ g w i ] k ∂ [ x ] j ( ζ ) − E w ∂ [ g w ] k ∂ [ x ] j ( ζ ) |≤| n n X i =1 ∂ [ g w i ] k ∂ [ x ] j ( γ ) − E w ∂ [ g w ] k ∂ [ x ] j ( γ ) | + 2 ǫ √ pl D = | [ ¯ S n ] kj ( γ ) − [ E S ( γ )] kj | + 2 ǫ √ pl D . Thus, ﬁx δ > < ǫ < min { diam ( D ) , δ √ pl D } , we can denote the elements in Γ by { γ , ..., γ | Γ | } and write: P (cid:26) sup ζ ∈D | [ ¯ S n ] kj ( ζ ) − [ E S ( ζ )] kj | ≥ δ (cid:27) ≤ P (cid:26) max ≤ i ≤| Γ | | [ ¯ S n ] kj ( γ i ) − [ E S ( γ i )] kj | + 2 ǫ √ pl D ≥ δ (cid:27) ≤ | Γ | X i =1 P (cid:26) | [ ¯ S n ] kj ( γ i ) − [ E S ( γ i )] kj | ≥ δ − ǫ √ pl D (cid:27) ≤ | Γ | X i =1 · exp {− n ( δ − ǫ √ pl D ) l D } ≤ (cid:0) · diam ( D ) ǫ (cid:1) p · exp {− n ( δ − ǫ √ pl D ) l D } , (28)where the third line follows from Lemma 4, the Azuma-Hoeﬀding inequality. Now we prove (24): E (cid:20) sup ζ ∈D | [ ¯ S n ] kj ( ζ ) − [ E S ( ζ )] kj | (cid:21) ≤ (2 l D ) P { sup ζ ∈D | [ ¯ S n ] kj ( ζ ) − [ E S ( ζ )] kj | ≥ δ } + δ P { sup ζ ∈D | [ ¯ S n ] kj ( ζ ) − [ E S ( ζ )] kj | < δ }≤ (2 l D ) (cid:0) · diam ( D ) ǫ (cid:1) p · exp {− n ( δ − ǫ √ pl D ) l D } + δ = (2 l D ) (2 · diam ( D )) p · exp {− n ( δ − ǫ √ pl D ) l D + p · log ( 1 ǫ ) } + δ , (29)where the second line follows from (28). Now take δ = √ pl D · √ log (2 n +2) √ n and ǫ = √ n . Without loss of generality,we assume diam ( D ) > < ǫ < min { diam ( D ) , δ √ pl D } . Then (29) becomes: E (cid:20) sup ζ ∈D | [ ¯ S n ] kj ( ζ ) − [ E S ( ζ )] kj | (cid:21) ≤ (2 l D ) (2 · diam ( D )) p · exp {− p ( p log (2 n + 2) − + p · log ( √ n ) } + 4 · p l D ( log (2 n + 2) n ) ≤ (2 l D ) (2 · diam ( D )) p · exp {− p log (2 n + 2)) + p · log ( √ n ) } + 4 · p l D ( log (2 n + 2) n ) ≤ (2) − . p ( l D ) (2 · diam ( D )) p · n + 2) p + 4 · p l D ( log (2 n + 2) n ) ≤ C ′′D (cid:0) log (2 n + 2) n (cid:1) ≤ C ′′D (cid:0) log (4 n ) n (cid:1) (30)where C ′′D , (2) − . p ( l D ) (2 · diam ( D )) p +4 · p l D ( log (2 n +2) n ) . The second inequality follows from that ( x − ≥ x when x ≥

2. The third inequality follows from − p · log (2 n + 2) + p · log (2 n ) ≤ − p · log (2 n + 2). The last twoinequalities follows from p ≥ n ≥ g w ( x ) and its second order derivatives are also l D -Lipschitzcontinous for all w ∈ Ω w according to Assumption 3 and Deﬁnition 2. Thus (27) becomes:sup x, ˜ x ∈D , ≤ k ≤ d {| ∂ [ g w ] k ( x ) − ∂ [ g w ] k (˜ x ) |} ≤ l D · √ p k x − ˜ x k ∞ sup x, ˜ x ∈D , ≤ i,j ≤ p ≤ k ≤ d {| ∂ [ g w ] k ∂ [ x ] i ∂ [ x ] j ( x ) − ∂ [ g w ] k ∂ [ x ] i ∂ [ x ] j (˜ x ) |} ≤ l D · √ p k x − ˜ x k ∞ , Theorem 1.

There exist a constant C D < ∞ such that for any v ∈ Ω v and x, ˜ x ∈ D , the W ( x, v ) and W (˜ x, v )from the variance reduced unbiased gradient W = W ( x, v ) − W (˜ x, v ) + g (˜ x ) in Algorithm 3 satisﬁes: E k W ( x, v ) − W (˜ x, v ) k ≤ C D k x − ˜ x k (31) Proof.

Fix v ∈ Ω v and x, ˜ x ∈ D , we have W ( x, v ) − W (˜ x, v ) = 1˜ p N (cid:18) Y ( x ) − Y (˜ x ) − . · (cid:16) Y ( x ) − Y (˜ x ) + Y ( x ) − Y (˜ x ) (cid:17)(cid:19) + Y ( x ) − Y (˜ x )and it follows that E k W ( x, v ) − W (˜ x, v ) k = ∞ X n =0 E [ k W ( x, v ) − W (˜ x, v ) k | N = n ] · ˜ p n = ∞ X n =0 p X i =1 E [([ W ( x, v )] i − [ W (˜ x, v )] i ) | N = n ] · ˜ p n ≤ p X i =1 E ([ Y ( x )] i − [ Y (˜ x )] i ) + ∞ X n =0 p X i =1 p n E [ (cid:18) [ Y ( x )] i − [ Y (˜ x )] i − . · (cid:16) [ Y ( x )] i − [ Y (˜ x )] i + [ Y ( x )] i − [ Y (˜ x )] i (cid:17)(cid:19) | N = n ]= p X i =1 E ( ∇ [ Y ( ρ i )] ⊺ i ( x − ˜ x )) + ∞ X n =0 p X i =1 p n · E [ (cid:18) ∇ (cid:16) [ Y ( ξ i )] i − . · ([ Y ( ξ i )] i + [ Y ( ξ i )] i ) (cid:17) ⊺ ( x − ˜ x ) (cid:19) | N = n ]= p X i =1 k x − ˜ x k · E k∇ [ Y ( ρ i )] i k + ∞ X n =0 p X i =1 k x − ˜ x k ˜ p n · E [ k∇ (cid:16) [ Y ( ξ i )] i − . · ([ Y ( ξ i )] i + [ Y ( ξ i )] i ) (cid:17) k | N = n ](32)The last two lines follows from Mean Value Theorem where ξ i , ρ i , ≤ i ≤ p lie somewhere between x and ˜ x . Toproceed with equation (32), ﬁx N = n where n ≥

0, notice we can write Y ( ξ i ) = [ ¯ S n + n ( ξ i )] ⊺ ·∇ f v ( ¯ T n + n ( ξ i )).Thus we have [ Y ( ξ i )] i = d X k =1 [ ¯ S n + n ( ξ i )] ki · [ ∇ f v ( ¯ T n + n ( ξ i ))] k = d X k =1 ( 12 n + n +1 · n + n X t =1 ∂ [ g w t ] k ∂ [ x ] i ( ξ i )) · ∂f v ∂ [ y ] k ( 12 n + n +1 · n + n X t =1 g w t ( ξ i )) , and it follows from the chain rule that, for 1 ≤ j ≤ p ,[ ∇ [ Y ( ξ i )] i ] j = d X k =1 (cid:18) ( 12 n + n +1 · n + n X t =1 ∂ [ g w t ] k ∂ [ x ] i · ∂ [ x ] j ( ξ i )) · ∂f v ∂ [ y ] k ( 12 n + n +1 · n + n X t =1 g w t ( ξ i ))+ ( 12 n + n +1 · n + n X t =1 ∂ [ g w t ] k ∂ [ x ] i ( ξ i )) · (cid:16) d X h =1 ∂f v ∂ [ y ] k · ∂ [ y ] h ( 12 n + n +1 · n + n X t =1 g w t ( ξ i )) · ( 12 n + n +1 · n + n X t =1 ∂ [ g w t ] h ∂ [ x ] j ( ξ i )) (cid:17)(cid:19) = d X k =1 (cid:18) [ ¯ Z n + n ( ξ i )] kij · ∂f v ∂ [ y ] k ( ¯ T n + n ( ξ i )) + [ ¯ S n + n ( ξ i )] ki · (cid:16) d X h =1 ∂f v ∂ [ y ] k · ∂ [ y ] h ( ¯ T n + n ( ξ i )) · [ ¯ S n + n ( ξ i )] hj (cid:17)(cid:19) , (33)where the last line follows from Deﬁnition 1.It follows from the deﬁnition of l D that for any ξ i ∈ D and N = n , | [ ∇ [ Y ( ξ i )] i ] j | ≤ dl D (1 + dl D ) and k∇ [ Y ( ξ i )] i k ≤ pd l D (1 + dl D ) . Following a similar analysis, we also have k∇ [ Y ( ρ i )] i k ≤ pd l D (1 + dl D ) , so the ﬁrst term of (32) satisﬁes:2 k x − ˜ x k · E k∇ [ Y ( ρ i )] i k ≤ pd l D (1 + dl D ) · k x − ˜ x k . (34)13o bound the second term in (32), we deﬁne the following function: for x ∈ H ⊆ R d × p , y ∈ G ⊆ R d , z ∈ J ⊆ R d × p × p and each 1 ≤ i, j ≤ p , deﬁne G : H × G × J → R by: G ij ( x, y, z ) , d X k =1 (cid:18) [ z ] kij · ∂f v ∂ [ y ] k ( y ) + [ x ] ki · (cid:16) d X h =1 ∂f v ∂ [ y ] k · ∂ [ y ] h ( y ) · [ x ] hj (cid:17)(cid:19) , (35)It is straightforward to see that for any realization of N , [ ∇ [ Y ( x )] i ] j = G ij ( ¯ S N + n ( x ) , ¯ T N + n ( x ) , ¯ Z N + n ( x )).We can compute each component of the gradient ∇ G ( x, y, z ) ∈ R ( d × p ) × d × ( d × p × p ) as ∂G ij ∂ [ x ] k ′ j ′ = δ ij ′ · d X h =1 ∂f v ∂ [ y ] k ′ · ∂ [ y ] h ( y ) · [ x ] hj + δ jj ′ · d X k =1 ∂f v ∂ [ y ] k · ∂ [ y ] k ′ ( y ) · [ x ] ki ∂G ij ∂ [ y ] h ′ = d X k =1 (cid:18) [ z ] kij · ∂f v ∂ [ y ] k ∂ [ y ] h ′ ( y ) + [ x ] ki · (cid:16) d X h =1 ∂f v ∂ [ y ] k ∂ [ y ] h ∂ [ y ] h ′ ( y ) · [ x ] hj (cid:17)(cid:19) ∂G ij ∂ [ z ] k ′ i ′ j ′ = δ ii ′ · δ jj ′ · ∂f v ∂ [ y ] k ′ ( y ) (36)where 1 ≤ i ′ , j ′ ≤ p ,1 ≤ k ′ , h ′ ≤ d and δ ij is the Kronecker delta. It follows from Assumptions 1, Assumption 3and Lemma (2) that for any 1 ≤ i, j ≤ p , each component of the ∇ G ij ( x, y, z ) is Lipschitz continous with constant2 dl D (1 + dl D ) and ∇ G ij ( x, y, z ) is Lipschitz continous with constant 2 p dp + dp + d · dl D (1 + dl D ). Thus if wedeﬁne R ij  x, x y, y z, z  , G ij ( x, y, z ) − G ij ( x , y , z ) − ∇ G ij ( x , y , z ) ·  x − x y − y z − z  for 1 ≤ i, j ≤ p , using Lemma 1, | R ij  x, x y, y z, z  | = | G ( x, y, z ) − G ( x , y , z ) − ∇ G ( x , y , z ) ·  x − x y − y z − z  | ≤ p dp + dp + d · dl D (1 + dl D )( d X k ′ =1 p X j ′ =1 ([ x ] k ′ j ′ − [ x ] k ′ j ′ ) + d X h ′ =1 ([ y ] h ′ − [ y ] h ′ ) + d X k ′ =1 p X i ′ ,j ′ =1 ([ z ] k ′ i ′ j ′ − [ z ] k ′ i ′ j ′ ) ) ≤ pd l D ( d X k ′ =1 p X j ′ =1 ([ x ] k ′ j ′ − [ x ] k ′ j ′ ) + d X h ′ =1 ([ y ] h ′ − [ y ] h ′ ) + d X k ′ =1 p X i ′ ,j ′ =1 ([ z ] k ′ i ′ j ′ − [ z ] k ′ i ′ j ′ ) ) (37)14or any x, x ∈ H , y, y ∈ G and z, z ∈ J . To bound the second term in (32), for any n ≥ ≤ i ≤ p , E [ k∇ (cid:16) [ Y ( ξ i )] i − . · ([ Y ( ξ i )] i + [ Y ( ξ i )] i ) (cid:17) k | N = n ]= p X j =1 E [ (cid:0) [ ∇ [ Y ( ξ i )] i ] j − . · ([ ∇ [ Y ( ξ i )] i ] j + [ ∇ [ Y ( ξ i )] i ] j ) (cid:1) | N = n ]= p X j =1 E h(cid:16) G ij ( ¯ S N + n ( ξ i ) , ¯ T N + n ( ξ i ) , ¯ Z N + n ( ξ i )) − . · G ij ( ¯ S N + n ( ξ i ) , ¯ T N + n ( ξ i ) , ¯ Z N + n ( ξ i )) − . · G ij ( ˜ S N + n ( ξ i ) , ˜ T N + n ( ξ i ) , ˜ Z N + n ( ξ i )) (cid:17) | N = n i = p X j =1 E h(cid:16) G ij ( E S ( ξ i ) , E T ( ξ i ) , E Z ( ξ i )) + ∇ G ij ( E S ( ξ i ) , E T ( ξ i ) , E Z ( ξ i )) ·  ¯ S N + n ( ξ i ) − E S ( ξ i )¯ T N + n ( ξ i ) − E T ( ξ i )¯ Z N + n ( ξ i ) − E Z ( ξ i )  − G ij ( E S ( ξ i ) , E T ( ξ i ) , E Z ( ξ i )) − ∇ G ij ( E S ( ξ i ) , E T ( ξ i ) , E Z ( ξ i )) ·  ¯ S N + n ( ξ i )+ ˜ S N + n ( ξ i )2 − E S ( ξ i ) ¯ T N + n ( ξ i )+ ˜ T N + n ( ξ i )2 − E T ( ξ i ) ¯ Z N + n ( ξ i )+ ˜ Z N + n ( ξ i )2 − E Z ( ξ i )  + R ij  ¯ S N + n ( ξ i ) , E S ( ξ i )¯ T N + n ( ξ i ) , E T ( ξ i )¯ Z N + n ( ξ i ) , E Z ( ξ i )  − R ij  ¯ S N + n ( ξ i ) , E S ( ξ i )¯ T N + n ( ξ i ) , E T ( ξ i )¯ Z N + n ( ξ i ) , E Z ( ξ i )  − R ij  ˜ S N + n ( ξ i ) , E S ( ξ i )˜ T N + n ( ξ i ) , E T ( ξ i )˜ Z N + n ( ξ i ) , E Z ( ξ i )  (cid:17) | N = n i = p X j =1 E h ( R ij  ¯ S N + n ( ξ i ) , E S ( ξ i )¯ T N + n ( ξ i ) , E T ( ξ i )¯ Z N + n ( ξ i ) , E Z ( ξ i )  − R ij  ¯ S N + n ( ξ i ) , E S ( ξ i )¯ T N + n ( ξ i ) , E T ( ξ i )¯ Z N + n ( ξ i ) , E Z ( ξ i )  − R ij  ˜ S N + n ( ξ i ) , E S ( ξ i )˜ T N + n ( ξ i ) , E T ( ξ i )˜ Z N + n ( ξ i ) , E Z ( ξ i )  ) | N = n i ≤ p X j =1 E h R ij  ¯ S N + n ( ξ i ) , E S ( ξ i )¯ T N + n ( ξ i ) , E T ( ξ i )¯ Z N + n ( ξ i ) , E Z ( ξ i )  + R ij  ¯ S N + n ( ξ i ) , E S ( ξ i )¯ T N + n ( ξ i ) , E T ( ξ i )¯ Z N + n ( ξ i ) , E Z ( ξ i )  + R ij  ˜ S N + n ( ξ i ) , E S ( ξ i )˜ T N + n ( ξ i ) , E T ( ξ i )˜ Z N + n ( ξ i ) , E Z ( ξ i )  | N = n i ≤ p X j =1 p d l D ( d + dp + dp ) k x − ˜ x k · E h d X h ′ =1 T N + n ( ξ i )] h ′ − [ E T ( ξ i )] h ′ ) + ([ ¯ T N + n ( ξ i )] h ′ − [ E T ( ξ i )] h ′ ) + ([ ˜ T N + n ( ξ i )] h ′ − [ E T ( ξ i )] h ′ ) + X ≤ k ′ ≤ d ≤ i ′ ≤ p ≤ j ′ ≤ p Z N + n ( ξ i )] k ′ i ′ j ′ − [ E Z ( ξ i )] k ′ i ′ j ′ ) + ([ ¯ Z N + n ( ξ i )] k ′ i ′ j ′ − [ E Z ( ξ i )] k ′ i ′ i ′ j ′ ) + ([ ˜ Z N + n ( ξ i )] k ′ i ′ j ′ − [ E Z ( ξ i )] k ′ i ′ j ′ ) + X ≤ k ′ ≤ d ≤ j ′ ≤ p S N + n ( ξ i )] k ′ j ′ − [ E S ( ξ i )] k ′ j ′ ) + ([ ¯ S N + n ( ξ i )] k ′ j ′ − [ E S ( ξ i )] k ′ j ′ ) + ([ ˜ S N + n ( ξ i )] k ′ j ′ − [ E S ( ξ i )] k ′ j ′ ) | N = n i , (38)where the last two inequality follows from equation (15) and (37). The equality above it follows from equation (8).Continuing on (38), it follows from Lemma 5 that E [ k∇ (cid:16) [ Y ( ξ i )] i − . · ([ Y ( ξ i )] i + [ Y ( ξ i )] i ) (cid:17) k | N = n ] ≤ p d l D ( d + dp + dp ) C ′′D · ( log · ( n + n + 3) n + n ) · k x − ˜ x k ≤ p d l D C ′′D · ( log · ( n + n + 3) n + n ) · k x − ˜ x k (39)Combine (34) and (39). Let C D , p d l D (1+ dl D ) + p d l D C ′′D · ( log (1 − . γ )2 n ∞ P n =0 ( n + n + 3) · ( γ − n . Notice C D < ∞ n ≥ γ <

2. Now (32) becomes: E k W ( x, v ) − W (˜ x, v ) k ≤ (cid:16) p X i =1 pd l D (1 + dl D ) + p X i =1 ∞ X n =0 p d l D C ′′D · ( log · ( n + n + 3) n + n ) ˜ p n (cid:17) · k x − ˜ x k = C D · k x − ˜ x k In this section we prove the convergence property of Algorithm 4. Notice the constant C D is deﬁned in Theorem 1and µ is the strong convexity coeﬃcient. Lemma 6.

Let F : R p → R be a convex function with L -Lipschitz gradient and denote x ⋆ = arg min x ∈ R p F ( x ) to bethe global minimizer of F ( · ), then for any x ∈ R p ,12 L k∇ F ( x ) k ≤ F ( x ) − F ( x ⋆ ) . Proof.

Let F x ( y ) = F ( x ) + ∇ F ( x )( y − x ) + L k y − x k , since F ( · ) has L -Lipschitz gradient, we have F ( y ) ≤ F x ( y )for all x ∈ R p . It then follows F ( x ⋆ ) ≤ min y ∈ R p F x ( y ). It is straightforward to compute the global minimizer y ⋆ of thequadratic function F x ( y ) to be y ⋆ = x − L ∇ F ( x ).so we have: F ( x ⋆ ) ≤ F x ( y ⋆ ) = F ( x ) − L k F ( x ) k Theorem 2.

Consider the Simulated SVRG Algorithm 4 with options II. Let λ is small and M is suﬃciently largeso that α = 1 µ (1 − µ C D λ ) λM + ( µ C D + 2 L ) λ − µ C D λ < . (40)Then under Assumptions 1-3, we have geometric convergence in expectation for the Simulated SVRG : E [ F (˜ x s )] ≤ F (˜ x ⋆ ) + α s [ F (˜ x ) − F (˜ x ⋆ )] Proof.

It follows from Lemma 6 that k∇ F ( x ) − ∇ F ( x ⋆ ) k = k∇ F ( x ) k ≤ L [ F ( x ) − F ( x ⋆ )] (41)Now conditioning on x t − , we can take expectation with respect to v t ∈ Ω v to obtain E [ k ν t k | x t − ] ≤ E [ k W ( x t − , v t ) − W (˜ x, v t ) k | x t − ] + 2 ∇k F (˜ x ) k ≤ C D k x t − − ˜ x k + 4 L [ F (˜ x ) − F ( x ⋆ )] ≤ C D ( k x t − − x ⋆ k + k ˜ x − x ⋆ k ) + 4 L [ F (˜ x ) − F ( x ⋆ )] ≤ µ C D · [ F ( x t − ) − F ( x ⋆ )] + ( 8 µ C D + 4 L ) · [ F (˜ x ) − F ( x ⋆ )] . (42)where the second inequality follows from Theorem 1 and equation (41). The last inequality follows from the strongconvexity of F ( · ). Thus, E [ k x t − x ⋆ k | x t − ]= k x t − − x ⋆ k − λ ( x t − − x ⋆ ) ⊺ E [ ν t | x t − ] + λ E [ k ν t k | x t − ] ≤k x t − − x ⋆ k − λ ( x t − − x ⋆ ) ⊺ ∇ F ( x t − ) + 8 µ C D λ · [ F ( x t − ) − F ( x ⋆ )] + ( 8 µ C D + 4 L ) λ · [ F (˜ x ) − F ( x ⋆ )] ≤k x t − − x ⋆ k − λ [ F ( x t − ) − F ( x ⋆ )] + 8 µ C D λ · [ F ( x t − ) − F ( x ⋆ )] + ( 8 µ C D + 4 L ) λ · [ F (˜ x ) − F ( x ⋆ )]= k x t − − x ⋆ k − λ (1 − µ C D λ )[ F ( x t − ) − F ( x ⋆ )] + ( 8 µ C D + 4 L ) λ [ F (˜ x ) − F ( x ⋆ )] . (43)16here the third line follows from the unbiasedness of the simulated gradient and the fourth line follows from theconvexity of F ( · ). Now we consider a ﬁxed stage s , so that ˜ x = ˜ x s − and ˜ x s is selected uniformly after all M updates are completed. Summing over the previous inequality over t = 1 , ..., M , taking expectation and use optionsII at stage s , we obtain E [ k x M − x ⋆ k ] + 2 λ (1 − µ C D λ ) M E [ F (˜ x s ) − F ( x ⋆ )] ≤ E [ k x − x ⋆ k ] + ( 8 µ C D + 4 L ) λ M E [ F (˜ x ) − F ( x ⋆ )]= E [ k ˜ x − x ⋆ k ] + ( 8 µ C D + 4 L ) λ M E [ F (˜ x ) − F ( x ⋆ )] ≤ µ E [ F (˜ x ) − F ( x ⋆ )] + ( 8 µ C D + 4 L ) λ M E [ F (˜ x ) − F ( x ⋆ )]=( 2 µ + ( 8 µ C D + 4 L ) λ M ) E [ F (˜ x ) − F ( x ⋆ )] (44)Thus we obtain E [ F (˜ x s ) − F ( x ⋆ )] ≤ (cid:20) µ (1 − µ C D λ ) λM + ( µ C D + 2 L ) λ − µ C D λ (cid:21) E [ F (˜ x s − ) − F ( x ⋆ )] (45)This implies that E [ F (˜ x s ) − F ( x ⋆ )] ≤ α s · E [ F (˜ x ) − F ( x ⋆ )]. The conclusion follows. Corollary 1.

Let { ˜ x s } s ≥ be the sequence of output from each epoch of the Simulated SVRG algorithm. Then,with probability 1, ˜ x s converge exponentially fast to x ⋆ . Proof.

It follows from Theorem 2 that we can ﬁnd 0 < α < E [ F (˜ x s )] ≤ F (˜ x ⋆ ) + α s [ F (˜ x ) − F (˜ x ⋆ )].Pick any α < ρ <

1. Deﬁne the set A s = { F (˜ x s ) − F ( x ⋆ ) > ρ s } in probability space, we have P ( A s ) ≤ ( αρ ) s · E [ F (˜ x ) − F ( x ⋆ )] which implies that P s ≥ P ( A s ) < ∞ . It then follows from Borel-Cantelli lemma that P ( A s occurs inﬁnitely often) = P (cid:16) lim sup s →∞ A s (cid:17) = P ( ∞ \ t =0 ∞ [ s = t A s ) = inf t ≥ P ( ∞ [ s = t A s ) ≤ inf t ≥ X s ≥ t P ( A s ) = 0 . (46)Thus with probability 1, F (˜ x s ) − F ( x ⋆ ) < ρ s for s large enough which implies k ˜ x s − x ⋆ k ≤ µ ρ s . In this section we prove the convergence property of Algorithm 5.

Lemma 7.

Fix x ∈ D and K, B ≥

1, we sample a batch

I ⊂ Ω v with |I| = B following the distribution of v andindependently generate h k ( x ) = B P v i ∈I UnibasedGradient( x, v i ) for 1 ≤ k ≤ K . Let C ′D be the constant in theproof of Proposition 2 where E k W ( x, v ) k ≤ C ′D for arbitary v ∈ Ω v . Deﬁne ˜ h ( x ) = K P Ki =1 h i ( x ), we have E [˜ h ( x )] = ∇ F ( x ) and V ar [˜ h ( x )] ≤ C ′D KB + 4 pd l D ( 1 K + 1 B ) , (47)so V ar [˜ h ( x )] can be made arbitrarily small for any x ∈ D by making K and B suﬃciently large. Proof.

First we have E [˜ h ( x )] = E [ h ( x )] = E [ E [ h ( x ) |I ]] = 1 B E I [ E [ X v i ∈I UnibasedGradient( x, v i ) |I ]]= 1 B E I [ X v i ∈I ∇ ( f v i ( E w g w ( x )))] = ∇ F ( x ) . v ∈ Ω v , denote W i = UnbiasedGradient( x, v i ), h v = ∇ ( f v ( E w g w ( x ))) and h ( I ) = E [ h ( x ) |I ] = B P v i ∈I h v i , we have V ar [˜ h ( x )] = E [ V ar [˜ h ( x ) |I ]] + V ar [ E [˜ h ( x ) |I ]]= 1 K E [ V ar [ h ( x ) |I ]] + V ar I [ h ( I )]= 1 K E (cid:2) E [( h ( x ) − h ( I )) ⊺ ( h ( x ) − h ( I )) |I ] (cid:3) + 1 B V ar v [ h v ]= 1 K · B E (cid:2) E [( B X i =1 W i − h v i + h v i − h ( I )) ⊺ ( B X i =1 W i − h v i + h v i − h ( I )) |I ] (cid:3) + 1 B V ar v [ h v ]= 1 K · B E (cid:2) E [ B X i =1 k W i − h v i k + B X i =1 B X j =1 ( h v i − h ( I )) ⊺ ( h v j − h ( I )) |I ] (cid:3) + 1 B V ar v [ h v ] ≤ C ′D KB + 4 pd l D ( 1 K + 1 B )where the last inequality follow from the deﬁnition of C ′D and fact that each component of h v is bounded by dl D forany v ∈ Ω v , according to the deﬁnition of l D and h v . The equality above it follows from the independence betweenthe W i ’s given I . Theorem 3.

Consider the Simulated SCSG Algorithm 5 with options II. Suppose the setting in Theorem 2. Fix ǫ > λ is small and M is suﬃciently large so that α = 2 µ (1 − µ C D λ ) λM + ( µ C D + 8 L ) λ − µ C D λ < , (48)while making one of K and B large enough so that4( λ + µ )1 − µ C D λ · V ar [˜ h ] < ǫ (49)Then we have the following result for the Simulated SCSG : E [ F (˜ x s ) − F ( x ⋆ )] ≤ α s · E [ F (˜ x ) − F ( x ⋆ )] + 11 − α · ǫ (50) Proof.

Conditioning on x t − , we can take expectation with respect to v t ∈ Ω v to obtain E [ k ν t k | x t − ] ≤ E [ k W ( x t − , v t ) − W (˜ x, v t ) k | x t − ] + 4 k∇ F (˜ x ) k + 4 k ˜ h (˜ x ) − ∇ F (˜ x ) k ≤ C D k x t − − ˜ x k + 8 L [ F (˜ x ) − F ( x ⋆ )] + 4 k ˜ h (˜ x ) − ∇ F (˜ x ) k ≤ C D ( k x t − − x ⋆ k + k ˜ x − x ⋆ k ) + 8 L [ F (˜ x ) − F ( x ⋆ )] + 4 k ˜ h (˜ x ) − ∇ F (˜ x ) k ≤ µ C D · [ F ( x t − ) − F ( x ⋆ )] + ( 8 µ C D + 8 L ) · [ F (˜ x ) − F ( x ⋆ )] + 4 k ˜ h (˜ x ) − ∇ F (˜ x ) k . (51)where the second inequality follows from Theorem 1 and equation (41). The last inequality follows from the strongconvexity of F ( · ). Now following (51), we can write E [ k x t − x ⋆ k | x t − ]= k x t − − x ⋆ k − λ ( x t − − x ⋆ ) ⊺ E [ ν t | x t − ] + λ E [ k ν t k | x t − ] ≤k x t − − x ⋆ k − λ ( x t − − x ⋆ ) ⊺ ( ∇ F ( x t − ) − ∇ F (˜ x ) + ˜ h (˜ x ))+ 8 µ C D λ · [ F ( x t − ) − F ( x ⋆ )] + ( 8 µ C D + 8 L ) λ · [ F (˜ x ) − F ( x ⋆ )] + 4 λ k ˜ h (˜ x ) − ∇ F (˜ x ) k ≤k x t − − x ⋆ k − λ [ F ( x t − ) − F ( x ⋆ )] + 2 λ ( x t − − x ⋆ ) ⊺ (˜ h (˜ x ) − ∇ F (˜ x ))+ 8 µ C D λ · [ F ( x t − ) − F ( x ⋆ )] + ( 8 µ C D + 8 L ) λ · [ F (˜ x ) − F ( x ⋆ )] + 4 λ k ˜ h (˜ x ) − ∇ F (˜ x ) k = k x t − − x ⋆ k − λ (1 − µ C D λ )[ F ( x t − ) − F ( x ⋆ )] + ( 8 µ C D + 8 L ) λ [ F (˜ x ) − F ( x ⋆ )]+ 4 λ k ˜ h (˜ x ) − ∇ F (˜ x ) k + 2 λ ( x t − − x ⋆ ) ⊺ (˜ h (˜ x ) − ∇ F (˜ x )) , (52)18here the third line follows from the convexity of F ( · ). Now we consider a ﬁxed stage s , so that ˜ x = ˜ x s − and ˜ x s is selected uniformly after all M updates are completed. Summing over the previous inequality over t = 1 , ..., M ,taking expectation and use options II at stage s , we obtain E [ k x M − x ⋆ k ] + 2 λ (1 − µ C D λ ) M E [ F (˜ x s ) − F ( x ⋆ )] ≤ E [ k x − x ⋆ k ] + ( 8 µ C D + 8 L ) λ M E [ F (˜ x ) − F ( x ⋆ )] + 4 λ M k ˜ h (˜ x ) − ∇ F (˜ x ) k + 2 λM E [(˜ x s − x ⋆ ) ⊺ (˜ h (˜ x ) − ∇ F (˜ x ))] ≤ E [ k x − x ⋆ k ] + ( 8 µ C D + 8 L ) λ M E [ F (˜ x ) − F ( x ⋆ )] + 4 λM ( λ + 12 µ ) k ˜ h (˜ x ) − ∇ F (˜ x ) k + µ λM E [ k ˜ x s − x ⋆ k ] ≤ E [ k x − x ⋆ k ] + ( 8 µ C D + 8 L ) λ M E [ F (˜ x ) − F ( x ⋆ )] + 4 λM ( λ + 12 µ ) k ˜ h (˜ x ) − ∇ F (˜ x ) k + λM E [ F (˜ x s ) − F ( x ⋆ )] , (53)where the second inequality follows from 2 a ⊺ b ≤ β k a k + β k b k while β = µ . The last inequality follows fromstrong convexity of F ( · ). Finally, taking expectation over the randomness of ˜ h (˜ x ), we have λ (1 − µ C D λ ) M E [ F (˜ x s ) − F ( x ⋆ )] ≤ E [ k ˜ x − x ⋆ k ] + ( 8 µ C D + 8 L ) λ M E [ F (˜ x ) − F ( x ⋆ )] + 4 λM ( λ + 12 µ ) V ar [˜ h (˜ x )] ≤ µ E [ F (˜ x ) − F ( x ⋆ )] + ( 8 µ C D + 8 L ) λ M E [ F (˜ x ) − F ( x ⋆ )] + 4 λM ( λ + 12 µ ) V ar [˜ h (˜ x )]=( 2 µ + ( 8 µ C D + 8 L ) λ M ) E [ F (˜ x ) − F ( x ⋆ )] + 4 λM ( λ + 12 µ ) V ar [˜ h (˜ x )] (54)Thus we obtain E [ F (˜ x s ) − F ( x ⋆ )] ≤ (cid:20) µ (1 − µ C D λ ) λM + ( µ C D + 8 L ) λ − µ C D λ (cid:21) E [ F (˜ x s − ) − F ( x ⋆ )] + 4( λ + µ )1 − µ C D λ V ar [˜ h (˜ x )] ≤ α · E [ F (˜ x s − ) − F ( x ⋆ )] + ǫ (55)This implies that E [ F (˜ x s ) − F ( x ⋆ )] ≤ α s · E [ F (˜ x ) − F ( x ⋆ )] + ǫ − α . The conclusion follows. Corollary 2.

Let { ˜ x s } s ≥ be the sequence of output from each epoch of the Simulated SCSG algorithm and deﬁne˜ y s = min t ≤ s { F (˜ x t ) − F ( x ⋆ ) } for s ≥ s ≥ ˜ y s ≤ ǫ − α . Proof.

It follows from Theorem 3 that we can ﬁnd 0 < α < E [ F (˜ x s ) − F ( x ⋆ )] ≤ α s E [ F (˜ x ) − F ( x ⋆ )] + ǫ − α .We also have sup x ∈D { F ( x ) − F ( x ⋆ ) } ≤ l D from the deﬁnition of l D . It follows that for any ˜ x ∈ D , we have that E [ F (˜ x s ) − F ( x ⋆ ) | ˜ x ] ≤ α s · l D + ǫ − α . For any ρ >

0, pick N large enough that δ = ( α N · l D + ǫ − α )( ǫ − α + ρ ) − < P (˜ y N ≥ ǫ − α + ρ ) ≤ P ( F (˜ x N ) − F ( x ) ≥ ǫ − α + ρ ) ≤ E [ F (˜ x ) − F ( x )]( ǫ − α + ρ ) − ≤ δ. However, if we denote X N to be the distribution of ˜ x N conditioning on ˜ y N ≥ ǫ − α + ρ , then it follows from the19arkov Property that P (˜ y N ≥ ǫ − α + ρ ) = P (˜ y N ≥ ǫ − α + ρ | ˜ y N ≥ ǫ − α + ρ ) P (˜ y N ≥ ǫ − α + ρ )= P ( min N +1 ≤ s ≤ N { F (˜ x s ) − F ( x ⋆ ) } ≥ ǫ − α + ρ | ˜ y N ≥ ǫ − α + ρ ) P (˜ y N ≥ ǫ − α + ρ )=( P ˜ x N ∼X N P ( min N +1 ≤ s ≤ N { F (˜ x s ) − F ( x ⋆ ) } ≥ ǫ − α + ρ | ˜ x N )) · P (˜ y N ≥ ǫ − α + ρ ) ≤ ( P ˜ x N ∼X N P ( F (˜ x N ) − F ( x ⋆ ) ≥ ǫ − α + ρ | ˜ x N )) · δ ≤ ( P ˜ x N ∼X N E [ F (˜ x N ) − F ( x ⋆ ) | ˜ x N ]) · ( ǫ − α + ρ ) − · δ =( P ˜ x ∼X N E [ F (˜ x N ) − F ( x ⋆ ) | ˜ x ]) · ( ǫ − α + ρ ) − · δ ≤ P ˜ x ∼X N ( α N · l D + ǫ − α ) · ( ǫ − α + ρ ) − · δ ≤ δ Continue on, we can prove that P (˜ y kN ≥ ǫ − α + ρ ) ≤ δ k . Thus if we deﬁne the set A ρ = { inf s ≥ ˜ y s ≥ ǫ − α + ρ } and A = { inf s ≥ ˜ y s > ǫ − α } in probability space, we have P ( A ρ ) = P (inf s ≥ ˜ y s ≥ ǫ − α + ρ ) ≤ P (˜ y kN ≥ ǫ − α + ρ ) ≤ δ k , (56)for any k ≥

1. Since δ <

1, we have P ( A ρ ) = 0 for any ρ > P ( A ) = P ( S n ≥ A n ) ≤ ∞ P n =1 P ( A n ) = 0 . So, with probability 1, inf s ≥ ˜ y s ≤ ǫ − α . We implemented the two algorithms on minimizing a regularized Cox’s negative partial log- likelihood and comparedthe performance with the Compositional-SVRG-1 algorithm in [3] and gradient decent algorithm:min β ∈ R p n n X i =1 ∆ i [ − X ⊤ i β + log { n X j =1 I ( Y j ≥ Y i ) exp( X ⊤ j β ) } ] + 12 k β k , . (57)As in the setting of Examples, ( X i , Y i , ∆ i ), i = 1 , . . . , n are I.I.D. observations, X i ∈ R p is the feature vector, Y i = min( T i , C i ) and ∆ i = I { Y i = T i } , T i is the true life time and C i is the censoring time which is independent of T i . It is easy to see that each component function is strongly convex and has Lipschitz continuous gradients. Ournumerical results are based on simulated data and here is our settings. We set n = 10 , p = 10 and let every entryof X follow I.I.D. standard normal distribution. T is generated by standard exponential base line hazard functionand C is independent of T with censoring rate around 30%.In the Simulated SVRG algorithm, we set the step size to be λ = 0 .

01, the number of iterations in the innerloop to be M = 100 and the base level to be n = 0 whereas in the Simulated SCSG algorithm, we set he step sizeto be λ = 0 . M = 100, the batch size to be B = 100, numberof simulations to be K = 50 and the base levels to be n = 2.Accordingly, in the compositional-SVRG-1 algorithm,we set the step size to be λ = 0 . M = 100 and the batch sizeto be B = 500 whereas in the gradient descent algorithm, we set the step size to be λ = 0 . m . However, thisis not the case for the compositional-SVRG-1 algorithms. Also, as expected, we can see that Algorithm 5, theSimulated SCGS does not perform as well as Algorithm 4 or compositional-SVRG-1 since it does not involve fullgradient evaluations. An interesting ﬁnding is that gradient descent algorithm has the worst performance in termsof iteration complexity. 20 umber of iterations l og ( f ( x k ) - f ( x * )) -7-6-5-4-3-2-10123 Comparisons of the algorithms on minimizing negative partial likelihood

Algo4Algo5Comp-SVRG-1GD

We also implemented the proposed algorithms on the optical character recognition (OCR) dataset to train condi-tional random ﬁeld in [28]. In contrast to the Cox’s partial likelihood problem, the full gradient of CRF can beeﬃciently computed by dynamic programming method (the Viterbi algorithm [22]) as mentioned in Examples. Wecompare the proposed algorithms with gradient descent.In the Simulated SVRG algorithm, we set the step size to be λ = 0 . M = 200 and the base level to be n = 0. In the Simulated SCGS descent algorithm, we set the stepsize to be λ = 0 . M = 200, the batch size to be B = 100,number of simulations to be K = 10 and the base levels to be n = 2. Finally, in the gradient descent algorithm,we set the step size to be λ = 0 . References [1] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: Algorithms forminimizing compositions of expected-value functions.

Mathematical Programming , 161(1-2):419–449, 2017.[2] Mengdi Wang and Ji Liu. Accelerating stochastic composition optimization. In

Advances In Neural InformationProcessing Systems , pages 1714–1722, 2016.[3] Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum Composition Optimization via Variance Reduced GradientDescent. In Aarti Singh and Jerry Zhu, editors,

Proceedings of the 20th International Conference on Artiﬁcial umber of iterations l og ( f ( x k ) - f ( x * )) -8-6-4-2024 Comparisons of the algorithms on OCR dataset

Algo4Algo5GD

Intelligence and Statistics , volume 54 of

Proceedings of Machine Learning Research , pages 1159–1167, FortLauderdale, FL, USA, 20–22 Apr 2017. PMLR.[4] Chang-Han Rhee and Peter W. Glynn. Unbiased estimation with square root convergence for sde models.

Operations Research , 63(5):1026–1043, 2015.[5] Jose H Blanchet and Peter W Glynn. Unbiased monte carlo for optimization and functions of expectations viamulti-level randomization. In

Proceedings of the 2015 Winter Simulation Conference , pages 3656–3667. IEEEPress, 2015.[6] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In

Advances in Neural Information Processing Systems , pages 315–323, 2013.[7] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction.

SIAMJournal on Optimization , 24(4):2057–2075, 2014.[8] Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives.Technical report, Technical report, arXiv preprint, 2016.[9] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Koneˇcn`y, and Scott Sallinen.Stopwasting my gradients: Practical svrg. In

Advances in Neural Information Processing Systems , pages2251–2259, 2015.[10] Lihua Lei and Michael Jordan. Less than a single pass: Stochastically controlled stochastic gradient. In

Artiﬁcial Intelligence and Statistics , pages 148–156, 2017.[11] Pinghua Gong and Jieping Ye. Linear convergence of variance-reduced stochastic gradient without strongconvexity. arXiv preprint arXiv:1406.1102 , 2014.[12] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In

Advances in NeuralInformation Processing Systems , pages 1574–1582, 2014.2213] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing ﬁnite sums with the stochastic average gradient.

Mathematical Programming , pages 1–30.[14] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method withsupport for non-strongly convex composite objectives. In

Advances in Neural Information Processing Systems ,pages 1646–1654, 2014.[15] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In

Advances in Neural Information Processing Systems , pages 315–323, 2013.[16] Roy Frostig, Rong Ge, Sham M Kakade, and Aaron Sidford. Competing with the empirical risk minimizer ina single pass. In

Conference on learning theory , pages 728–763, 2015.[17] John Laﬀerty, Andrew McCallum, and Fernando CN Pereira. Conditional random ﬁelds: Probabilistic modelsfor segmenting and labeling sequence data. 2001.[18] Charles Sutton, Andrew McCallum, and Khashayar Rohanimanesh. Dynamic conditional random ﬁelds: Fac-torized probabilistic models for labeling and segmenting sequence data.

Journal of Machine Learning Research ,8(Mar):693–723, 2007.[19] Fei Sha and Fernando Pereira. Shallow parsing with conditional random ﬁelds. In

Proceedings of the 2003 Con-ference of the North American Chapter of the Association for Computational Linguistics on Human LanguageTechnology-Volume 1 , pages 134–141. Association for Computational Linguistics, 2003.[20] Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random ﬁelds,feature induction and web-enhanced lexicons. In

Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 , pages 188–191. Association for Computational Linguistics, 2003.[21] Sebastian Nowozin, Christoph H Lampert, et al. Structured learning and prediction in computer vision.

Foundations and Trends R (cid:13) in Computer Graphics and Vision , 6(3–4):185–365, 2011.[22] G. D. Forney. The viterbi algorithm. Proceedings of the IEEE , 61(3):268–278, March 1973.[23] SVN Vishwanathan, Nicol N Schraudolph, Mark W Schmidt, and Kevin P Murphy. Accelerated training ofconditional random ﬁelds with stochastic gradient methods. In

Proceedings of the 23rd international conferenceon Machine learning , pages 969–976. ACM, 2006.[24] Mark Schmidt, Reza Babanezhad, Mohamed Ahmed, Aaron Defazio, Ann Clifton, and Anoop Sarkar. Non-uniform stochastic average gradient method for training conditional random ﬁelds. In

Artiﬁcial Intelligenceand Statistics , pages 819–828, 2015.[25] Cox R David et al. Regression models and life tables (with discussion).

Journal of the Royal Statistical Society ,34:187–220, 1972.[26] David R Cox. Partial likelihood.

Biometrika , 62(2):269–276, 1975.[27] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via theem algorithm.

Journal of the royal statistical society. Series B (methodological) , pages 1–38, 1977.[28] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In