[PDF] An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient

Abstract

We revisit the stochastic variance-reduced policy gradient (SVRPG) method proposed by Papini et al. (2018) for reinforcement learning. We provide an improved convergence analysis of SVRPG and show that it can find an ϵ -approximate stationary point of the performance function within O(1/ ϵ 5/3 ) trajectories. This sample complexity improves upon the best known result O(1/ ϵ 2 ) by a factor of O(1/ ϵ 1/3 ) . At the core of our analysis is (i) a tighter upper bound for the variance of importance sampling weights, where we prove that the variance can be controlled by the parameter distance between different policies; and (ii) a fine-grained analysis of the epoch length and batch size parameters such that we can significantly reduce the number of trajectories required in each iteration of SVRPG. We also empirically demonstrate the effectiveness of our theoretical claims of batch sizes on reinforcement learning benchmark tasks.

Full PDF

AAn Improved Convergence Analysis of Stochastic Variance-Reduced PolicyGradient ∗ Pan Xu

Department of Computer ScienceUniversity of California, Los AngelesLos Angeles, CA 90095

Felicia Gao

Department of Computer ScienceUniversity of California, Los AngelesLos Angeles, CA 90095

Quanquan Gu

Department of Computer ScienceUniversity of California, Los AngelesLos Angeles, CA 90095

Abstract

We revisit the stochastic variance-reduced pol-icy gradient (SVRPG) method proposed byPapini et al. (2018) for reinforcement learn-ing. We provide an improved convergenceanalysis of SVRPG and show that it can ﬁndan (cid:15) -approximate stationary point of the per-formance function within O (1 /(cid:15) / ) trajecto-ries. This sample complexity improves uponthe best known result O (1 /(cid:15) ) by a factor of O (1 /(cid:15) / ) . At the core of our analysis is (i) atighter upper bound for the variance of impor-tance sampling weights, where we prove thatthe variance can be controlled by the parame-ter distance between different policies; and (ii)a ﬁne-grained analysis of the epoch length andbatch size parameters such that we can signif-icantly reduce the number of trajectories re-quired in each iteration of SVRPG. We alsoempirically demonstrate the effectiveness ofour theoretical claims of batch sizes on rein-forcement learning benchmark tasks. Reinforcement learning (RL) is a sequential decisionprocess that learns the best actions to solve a task by re-peated, direct interaction with the environment (Sutton& Barto, 2018). In detail, an RL agent starts at one stateand sequentially takes an action according to a certainpolicy, observes the resulting reward signal, and lastly,evaluates and improves its policy before it transits to thenext state. A policy tells the agent which action to take ateach state. Therefore, a good policy is critically impor-tant in a RL problem. Recently, policy gradient methods(Sutton et al., 2000) have achieved impressive successes ∗ To appear in the proceedings of the 35th InternationalConference on Uncertainty in Artiﬁcial Intelligence. in many challenging deep reinforcement learning appli-cations (Kakade, 2002; Schulman et al., 2015), which di-rectly optimizes the performance function J ( θ ) (We willformally deﬁne it later) over a class of policies parame-terized by some model parameter θ . In particular, pol-icy gradient methods seek to ﬁnd the best policy π θ thatmaximizes the expected return of the agent. They aregenerally more effective in the high-dimensional actionspace and enjoy the additional ﬂexibility of stochasticity,compared with deterministic value-function based meth-ods such as Q-learning and SARSA (Sutton et al., 2000).In many RL applications, the performance function J ( θ ) is non-concave and the goal is to ﬁnd a stationary point θ ∗ such that (cid:107)∇ J ( θ ∗ ) (cid:107) = 0 using gradient based al-gorithms. Due to the specialty of reinforcement learn-ing, the objective function J ( θ ) is calculated based oncumulative rewards arriving in a sequential way, whichmakes it impossible to calculate the full gradient di-rectly. Therefore, most algorithms such as REINFORCE(Williams, 1992) and GPOMDP (Baxter & Bartlett,2001) need to actively sample trajectories to approximatethe gradient ∇ J ( θ ) . This resembles the stochastic gra-dient (SG) based algorithms in stochastic optimization(Robbins & Monro, 1951) which require O (1 /(cid:15) ) trajec-tories to obtain E [ (cid:107)∇ J ( θ ) (cid:107) ] ≤ (cid:15) Due to the large vari-ances caused by stochastic gradient, the convergence ofSG based methods can be rather sample inefﬁcient whenthe required precision (cid:15) is very small.To mitigate the negative effect of large variance on theconvergence of SG methods, a large class of stochasticvariance-reduced gradient (SVRG) algorithms were pro-posed for both convex (Johnson & Zhang, 2013; Xiao& Zhang, 2014; Harikandeh et al., 2015; Nguyen et al.,2017) and nonconvex (Allen-Zhu & Hazan, 2016; Reddiet al., 2016; Lei et al., 2017; Li & Li, 2018; Fang et al.,2018; Zhou et al., 2018) objective functions. SVRGhas proved to achieve faster convergence in terms of thetotal number of stochastic gradient evaluations. Thesevariance-reduced algorithms have since been applied to a r X i v : . [ c s . L G ] M a y einforcement learning in policy evaluation (Du et al.,2017), trust-region policy optimization (Xu et al., 2017)and policy gradient (Papini et al., 2018). In particu-lar, Papini et al. (2018) recently proposed a stochas-tic variance-reduced policy gradient (SVRPG) algorithmthat marries SVRG to policy gradient for reinforcementlearning. The algorithm saves on sample computationand improves the performance of the vanilla policy gra-dient methods based on SG. However, from a theoret-ical perspective, the authors only showed that SVRPGconverges to a stationary point within E [ (cid:107)∇ J ( θ ) (cid:107) ] ≤ (cid:15) with O (1 /(cid:15) ) stochastic gradient evaluations (trajectorysamples), which in fact only matches the sample com-plexity of SG based policy gradient methods. This leavesopen the important question: Can SVRPG be provably better than SG based policygradient methods?

We answer this question afﬁrmatively and ﬁll this gapbetween theory and practice in this paper. Speciﬁcally,we provide a sharp convergence analysis of SVRPG andshow that it only requires O (1 /(cid:15) / ) stochastic gradi-ent evaluations in order to converge to a stationary point θ of the performance function, i.e., E [ (cid:107)∇ J ( θ ) (cid:107) ] ≤ (cid:15) .This sample complexity of SVRPG is strictly lower thanthat of SG based policy gradient methods by a factor of O (1 /(cid:15) / ) . By the same argument, our result is alsobetter than the sample complexity provided in Papiniet al. (2018) by a factor of O (1 /(cid:15) / ) . The key ideas inour theoretical analysis are twofold: (i) we prove a keylemma that controls the variance of importance weightsintroduced in SVRPG to deal with the non-stationarity ofthe sample distribution in reinforcement learning. Thishelps offset the additional variance introduced by im-portance sampling; and (ii) we provide a reﬁned proofof the convergence of SVRPG and carefully investigatethe trade-off between the convergence rate and compu-tational efﬁciency of SG methods. This enables us tochoose a smaller batch size to reduce the sample com-plexity while maintaining the convergence rate. In ad-dition, we demonstrate the advantage of SVRPG overGPOMDP and validate our theoretical results on Cart-pole and Mountain Car problems. Notation

In this paper, scalars, vectors and matrices aredenoted by lower case, lower case bold face, and up-per case bold face letters respectively. We use (cid:107) v (cid:107) and (cid:107) A (cid:107) to denote the vector -norm of a vector v ∈ R d and the spectral norm of a matrix A ∈ R d × d respec-tively. We denote a n = O ( b n ) if a n ≤ Cb n for someconstant < C . For α > , the R´enyi divergence (R´enyiet al., 1961) between distributions P, Q is D α ( P || Q ) = 1 α − (cid:90) x P ( x ) (cid:18) P ( x ) Q ( x ) (cid:19) α − d x, which is non-negative for all α > . The exponentiatedR´enyi divergence is deﬁned as d α ( P || Q ) = 2 D α ( P || Q ) . In this section, we review additional relevant work that isnot discussed in the introduction.Deep RL models (Mnih et al., 2015) have been popularin solving complex problems such as robot locomotion,playing grandmaster skill-level Go, and safe autonomousdriving (Levine et al., 2015; Silver et al., 2016; Shalev-Shwartz et al., 2016). Policy gradient (Sutton et al.,2000) is one of the most effective algorithms, wherethe policy is usually approximated by linear functionsor nonlinear functions such as neural networks, and canbe both stochastic and deterministic (Silver et al., 2014).One major drawback of traditional policy gradient meth-ods such as REINFORCE (Williams, 1992), GPOMDP(Baxter & Bartlett, 2001) and TRPO (Schulman et al.,2015) is the large variance caused in the estimation of thegradient (Sehnke et al., 2010), which leads to a poor con-vergence performance in practice. One way of reducingthe variance in gradient estimation is to introduce var-ious baselines as control variates (Weaver & Tao, 2001;Greensmith et al., 2004; Peters & Schaal, 2008; Gu et al.,2017; Tucker et al., 2018). (Pirotta et al., 2013) proposedto use adaptive step size to offset the effect of variance ofthe policy. Papini et al. (2017) further studied the adap-tive batch size used to approximate the gradient and pro-posed to jointly optimize the adaptive step size and batchsize. It has also been extensively studied to reduce thevariance of policy gradient by importance sampling (Liu,2008; Cortes et al., 2010). Metelli et al. (2018) reducedthe variance caused by importance sampling by derivinga surrogate objective with a Renyi penalty.

In this section, we introduce the preliminaries on rein-forcement learning and policy gradient.

Markov Decision Process:

We will model the reinforce-ment learning task as a discrete-time Markov DecisionProcess (MDP): M = {S , A , P , R , γ, ρ } , where S isthe state space and A is the action space. P ( s (cid:48) | s, a ) deﬁnes the probability that the agent transits to state s (cid:48) when taking action a in state s . The reward function R ( s, a ) : S ×A (cid:55)→ [0 , R ] gives the reward after the agenttakes action a at state s for some constant R > , and γ ∈ (0 , is the discount factor. ρ is the initial state dis-tribution. The probability that the agent chooses action a at state s is modeled by its policy π ( a | s ) . Following anystationary policy, the agent can observe and collect a tra-jectory τ = { s , a , s , a , . . . , s H − , a H − , s H } whichis a sequence of state-action pairs, where H is the trajec-tory horizon. Along with the state-action pairs, the agentlso observes an cumulative discounted reward R ( τ ) = (cid:80) H − h =0 γ h R ( s h , a h ) . (3.1) Policy Gradients:

Suppose that the policy π is parame-terized by an unknown parameter θ ∈ R d and denoted by π θ . We denote the distribution induced by policy π θ as p ( τ | π θ ) , also referred to as p ( τ | θ ) for simplicity. Then p ( τ | θ ) = ρ ( s ) H − (cid:89) h =0 π θ ( a h | s h ) P ( s h +1 | s h , a h ) . (3.2)To measure the performance of a given policy π θ , wedeﬁne the expected total reward under this policy as J ( θ ) = E τ ∼ p ( ·| θ ) [ R ( τ ) | M ] . Taking the gradient of J ( θ ) with respect to θ gives ∇ θ J ( θ ) = (cid:90) τ R ( τ ) ∇ θ p ( τ | θ ) d τ = (cid:90) τ R ( τ ) ∇ θ p ( τ | θ ) p ( τ | θ ) p ( τ | θ ) d τ = E τ ∼ p ( ·| θ ) [ ∇ θ log p ( τ | θ ) R ( τ ) | M ] . (3.3)We can update the policy by running gradient ascentbased algorithms on θ . However, it is impossible tocalculate the full gradient in reinforcement learning. Inparticular, policy gradient samples a batch of trajectories { τ i } Ni =1 to approximate the full gradient in (3.3). At the k -th iteration, the policy is then updated by θ k +1 = θ k + η (cid:98) ∇ N J ( θ k ) , (3.4)where η > is the step size and the estimated gradient (cid:98) ∇ N J ( θ k ) is an approximation of (3.3) based on trajec-tories { τ i } Ni =1 , which is deﬁned as follows (cid:98) ∇ N J ( θ ) = 1 N N (cid:88) i =1 ∇ θ log p ( τ i | θ ) R ( τ i ) . According to (3.2), we know that ∇ θ log p ( τ i | θ ) is inde-pendent of the transition matrix P . Therefore, combiningthis with (3.1) yields (cid:98) ∇ N J ( θ )= 1 N N (cid:88) i =1 (cid:34) H − (cid:88) h =0 ∇ θ log π θ ( a ih | s ih ) (cid:35)(cid:34) H − (cid:88) h =0 γ h R ( s ih , a ih ) (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) g ( τ i | θ ) , where τ i = { s i , a i , s i , a i , . . . , s iH − , a iH − , s iH } for all i = 1 , . . . , N are sampled from policy π θ , and g ( τ i | θ ) is the unbiased gradient estimator based on sample τ i .Then we can rewrite the gradient in (3.4) as (cid:98) ∇ N J ( θ ) =1 /N (cid:80) Ni =1 g ( τ i | θ ) . Based on the above estimator, wecan obtain the most well-known gradient estimators for policy gradient such as REINFORCE (Williams, 1992)and GPOMDP (Baxter & Bartlett, 2001). In particular,the REINFORCE estimator introduces an additional term b as the constant baseline: g ( τ i | θ ) (3.5) = (cid:34) H − (cid:88) h =0 ∇ θ log π θ ( a ih | s ih ) (cid:35)(cid:34) H − (cid:88) h =0 γ h R ( s ih , a ih ) − b (cid:35) . GPOMDP is a reﬁned estimator of REINFORCE basedon the fact that the current action does not affect previousdecisions: g ( τ i | θ ) (3.6) = H − (cid:88) h =0 (cid:18) h (cid:88) t =0 ∇ θ log π θ ( a it | s it ) (cid:19)(cid:0) γ h r ( s ih , a ih ) − b h (cid:1) . In each iteration of the gradient ascent update (3.4), pol-icy gradient methods need to sample a batch of trajec-tories to estimate the expected gradient. This subsam-pling introduces a high variance and undermines theconvergence speed of the algorithm. Inspired by thesuccess of stochastic variance-reduced gradient (SVRG)techniques in stochastic optimization (Johnson & Zhang,2013; Reddi et al., 2016; Allen-Zhu & Hazan, 2016), Pa-pini et al. (2018) proposed a stochastic variance reducedpolicy gradient (SVRPG) method, which is displayed inAlgorithm 1.SVRPG consists of multiple epochs. At the beginning ofthe s -th epoch, it treats the current policy as a referencepoint denoted by (cid:101) θ s = θ s +10 . It then computes a gradientestimator µ s = 1 /N (cid:80) Ni =1 g ( τ i | (cid:101) θ s ) based on N trajec-tories { τ i } Ni =1 sampled from the current policy, where g ( τ i | (cid:101) θ s ) is the REINFORCE or GPOMDP estimator. Atthe t -th iteration within the s -th epoch, SVRPG samples B trajectories { τ j } Bj =1 based on the current policy θ s +1 t .Then it updates the policy based on the following semi-stochastic gradient v s +1 t = 1 B (cid:80) Bj =1 g ( τ j | θ s +1 t )+ µ s − B B (cid:88) j =1 ω ( τ j | (cid:101) θ s , θ s +1 t ) g ( τ j | (cid:101) θ s ) , (4.1)where the last two terms serve as a correction to the sub-sampled gradient estimator which reduces the varianceand improves the convergence rate of Algorithm 1. Itis worth noting that the semi-stochastic gradient in (4.1)differs from the common one used in SVRG due to theadditional term ω ( τ | (cid:101) θ s , θ s +1 t ) = p ( τ | (cid:101) θ s ) /p ( τ | θ s +1 t ) ,hich is called the importance sampling weight from p ( τ | θ s +1 t ) to p ( τ | (cid:101) θ s ) . This term is important in rein-forcement learning due to the non-stationarity of the dis-tribution of τ . Speciﬁcally, { τ i } Ni =1 are sampled from (cid:101) θ s while { τ j } Bj =1 are sampled based on θ s +1 t . Nevertheless,we have E π θ s +1 t (cid:2) ω ( ·| (cid:101) θ s , θ s +1 t ) g ( ·| (cid:101) θ s ) (cid:3) = E π (cid:101) θ s (cid:2) g ( ·| (cid:101) θ s ) (cid:3) , which ensures the correction term is zero mean and thus v s +1 t is an unbiased gradient estimator. Algorithm 1

SVRPG Input: number of epochs S , epoch size m , step size η , batch size N , mini-batch size B , gradient estima-tor g , initial parameter θ m := (cid:101) θ := θ for s = 0 , . . . , S − do θ s +10 = (cid:101) θ s = θ sm Sample N trajectories { τ i } from p ( ·| (cid:101) θ s ) µ s = (cid:98) ∇ N J ( (cid:101) θ s ) := N (cid:80) Ni =1 g ( τ i | (cid:101) θ s ) for t = 0 , . . . , m − do Sample B trajectories { τ j } from p ( ·| θ s +1 t ) v s +1 t = µ s + B (cid:80) Bj =1 (cid:0) g (cid:0) τ j | θ s +1 t (cid:1) − ω (cid:0) τ j | (cid:101) θ s , θ s +1 t (cid:1) g (cid:0) τ j | (cid:101) θ s (cid:1)(cid:1) θ s +1 t +1 = θ s +1 t + η v s +1 t end for end for return θ out : uniformly picked from { θ st } for t =0 , . . . , m ; s = 0 , . . . , S In this section, we are going to provide a sharp analysisof Algorithm 1. We ﬁrst lay down the following commonassumption on the log-density of the policy function.

Assumption 5.1.

Let π θ ( a | s ) be the policy of an agentat state s . There exist constants G, M > such that thelog-density of the policy function satisﬁes (cid:107)∇ θ log π θ ( a | s ) (cid:107) ≤ G, (cid:13)(cid:13) ∇ θ log π θ ( a | s ) (cid:13)(cid:13) ≤ M, for all a ∈ A and s ∈ S .In many real-world problems, we require that policy pa-rameterization to change smoothly over time instead ofdrastically. Assumption 5.1 is an important condition innonconvex optimization (Reddi et al., 2016; Allen-Zhu& Hazan, 2016), which guarantees the smoothness ofthe objective function J ( θ ) . Our assumption is slightlydifferent from that in Papini et al. (2018), which as-sumes that ∂∂θ i log π θ ( a | s ) and ∂ ∂θ i ∂θ j log π θ ( a | s ) areupper bounded elementwisely. It can be easily veriﬁedthat our Assumption 5.1 is milder than theirs. It shouldalso be noted that although in reinforcement learning wemake the assumptions on the parameterized policy, there is no difference in imposing the smoothness assumptionon the performance function J ( θ ) directly. In fact, As-sumption 5.1 implies the following proposition on J ( θ ) . Proposition 5.2.

Under Assumption 5.1, J ( θ ) is L -smooth with L = HR ( M + HG ) / (1 − γ ) . In addition,let g ( τ | θ ) be the REINFORCE or GPOMDP gradient es-timators. Then for all θ , θ ∈ R d , it holds that (cid:107) g ( τ | θ ) − g ( τ | θ ) (cid:107) ≤ L g (cid:107) θ − θ (cid:107) and (cid:107) g ( τ | θ ) (cid:107) ≤ C g for all θ ∈ R d , where L g = HM ( R + | b | ) / (1 − γ ) , C g = HG ( R + | b | ) / (1 − γ ) and b is the baseline reward.The next assumption requires that the variance of the gra-dient estimator is bounded. Assumption 5.3.

There exists a constant σ such thatVar (cid:0) g ( τ | θ ) (cid:1) ≤ σ , for all policy π θ . The above assumption is widely made in stochastic opti-mization. It can be easily veriﬁed for Gaussian policieswith REINFORCE estimator (Zhao et al., 2011; Pirottaet al., 2013; Papini et al., 2018).The following assumption is needed due to the non-stationarity of the sample distribution, which is alsomade in Papini et al. (2018).

Assumption 5.4.

There is a constant

W < ∞ such thatfor each policy pairs encountered in Algorithm 1, it holdsVar ( ω ( τ | θ , θ )) ≤ W, ∀ θ , θ ∈ R d , τ ∼ p ( ·| θ ) . We now present our convergence result for SVRPG.

Theorem 5.5.

Under Assumptions 5.1, 5.3 and 5.4. InAlgorithm 1, suppose the step size η ≤ / (4 L ) andepoch length m and mini-batch size B satisfy Bm ≥ C ω C g + L g )2 L , where C ω = H (2 HG + M )( W + 1) , and L g , C g and L are deﬁned in Proposition 5.2. Then the output of Al-gorithm 1 satisﬁes E (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ out (cid:1)(cid:13)(cid:13) (cid:3) ≤ J ( θ ∗ ) − J ( θ )) ηSm + 6 σ N , where θ ∗ is the maximizer of J ( θ ) . Remark 5.6.

Let T = Sm be the total number of itera-tions Algorithm 1 needs to achieve E (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ out (cid:1)(cid:13)(cid:13) (cid:3) ≤ (cid:15) . The ﬁrst term on the right hand side in Theorem 5.5gives an O (1 /T ) convergence rate which matches thatof Papini et al. (2018) and the results in nonconvex opti-mization (Allen-Zhu & Hazan, 2016; Reddi et al., 2016).able 1: Comparison on sample complexity required toachieve (cid:107)∇ J ( θ ) (cid:107) ≤ (cid:15) . METHODS COMPLEXITY SG O (1 /(cid:15) ) SVRPG (Papini et al., 2018) O (1 /(cid:15) ) SVRPG (This paper) O (1 /(cid:15) / ) The second term O (1 /N ) comes from the full gradientapproximation at the beginning of each epoch in Algo-rithm 1. Compared with the result in Papini et al. (2018),Theorem 5.5 does not have the additional term O (1 /B ) ,which is offset by our elaborate and careful analysis ofthe variance of importance weights. This also enables usto choose a much smaller batch size B in the inner loopsof Algorithm 1 and leads to a lower sample complexity.Based on Theorem 5.5, we can calculate the total trajec-tory samples Algorithm 1 requires to achieve (cid:15) -precision. Corollary 5.7.

Under the same conditions as in The-orem 5.5, let (cid:15) > , if we set η = 1 / (4 L ) , N = O (1 /(cid:15) ) , B = O (1 /(cid:15) / ) and m = √ B , then Algo-rithm 1 needs O (1 /(cid:15) / ) trajectories in order to achieve E [ (cid:107)∇ J ( θ out ) (cid:107) ] ≤ (cid:15) . Remark 5.8.

In Theorem 4.4 of Papini et al. (2018), theauthors showed that the sample complexity of SVRPG is O (( B + N/m ) /(cid:15) ) . In order to make the gradient smallenough, they essentially require that B, N = O (1 /(cid:15) ) ,which leads to O (1 /(cid:15) ) sample complexity. In sharpcontrast, our Corollary 5.7 shows that the SVRPG al-gorithm only needs O (1 /(cid:15) / ) number of trajectories toachieve (cid:107)∇ J ( θ ) (cid:107) ≤ (cid:15) , which is obviously lower thanthe sample complexity proved in Papini et al. (2018). Wepresent a straightforward comparison in Table 1 to showthe sample complexities of different methods. SG repre-sents vanilla stochastic gradient based methods such asREINFORCE and GPOMDP. It can be seen from Table1 that our analysis yields the lowest complexity. In this section, we prove our main theoretical results.

Before we provide the proof of Theorem 5.5, we ﬁrst laydown the following key lemma that controls the varianceof the importance sampling weights ω ( τ | (cid:101) θ s , θ s +1 t ) . Lemma 6.1.

Let ω (cid:0) τ | (cid:101) θ s , θ s +1 t (cid:1) = p ( τ | (cid:101) θ s ) /p ( τ | θ s +1 t ) .Under Assumptions 5.1 and 5.4, it holds thatVar (cid:0) ω (cid:0) τ | (cid:101) θ s , θ s +1 t (cid:1)(cid:1) ≤ C ω (cid:107) (cid:101) θ s − θ s +1 t (cid:107) , where C ω = H (2 HG + M )( W + 1) . Lemma 6.1 shows that the variance of the importanceweight is proportional to the distance between the behav-ioral and the target policies. Note that this upper boundcould be trivial based on Assumption 5.4 when the dis-tance is large. However, Lemma 6.1 also provides a ﬁne-grained control of the variance when the behavioral andtarget polices are sufﬁciently close.Now we are ready to present the proof of our main theo-rem, which is also inspired from that in Li & Li (2018). Proof of Theorem 5.5.

By Proposition 5.2, J ( θ ) is L -smooth, which leads to J (cid:0) θ s +1 t +1 (cid:1) ≥ J (cid:0) θ s +1 t (cid:1) + (cid:10) ∇ J (cid:0) θ s +1 t (cid:1) , θ s +1 t +1 − θ s +1 t (cid:11) − L (cid:13)(cid:13) θ s +1 t +1 − θ s +1 t (cid:13)(cid:13) = J (cid:0) θ s +1 t (cid:1) + (cid:10) ∇ J (cid:0) θ s +1 t (cid:1) − v s +1 t , η v s +1 t (cid:11) + η (cid:13)(cid:13) v s +1 t (cid:13)(cid:13) − L (cid:13)(cid:13) θ s +1 t +1 − θ s +1 t (cid:13)(cid:13) ≥ J (cid:0) θ s +1 t (cid:1) − η (cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1) − v s +1 t (cid:13)(cid:13) + η (cid:13)(cid:13) v s +1 t (cid:13)(cid:13) − L (cid:13)(cid:13) θ s +1 t +1 − θ s +1 t (cid:13)(cid:13) ≥ J (cid:0) θ s +1 t (cid:1) − η (cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1) − v s +1 t (cid:13)(cid:13) + (cid:20) η − L (cid:21)(cid:13)(cid:13) θ s +1 t +1 − θ s +1 t (cid:13)(cid:13) + η (cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1)(cid:13)(cid:13) , (6.1)where the second inequality holds due to Young’s in-equality and the last inequality comes from the fact that (cid:107)∇ J ( θ s +1 t ) (cid:107) ≤ (cid:107) v s +1 t (cid:107) + 2 (cid:107)∇ J ( θ s +1 t ) − v s +1 t (cid:107) .Let E N,B denote the expectation only over the random-ness of the sampling trajectories { τ i } Ni =1 and { τ j } Bj =1 E N,B (cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1) − v s +1 t (cid:13)(cid:13) = E N,B (cid:13)(cid:13)(cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1) − µ s + 1 B B (cid:88) j =1 (cid:0) ω ( τ j | (cid:101) θ s , θ s +1 t ) g (cid:0) τ j | (cid:101) θ s (cid:1) − g (cid:0) τ j | θ s +1 t (cid:1)(cid:1)(cid:13)(cid:13)(cid:13)(cid:13) = E N,B (cid:13)(cid:13)(cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1) − ∇ J ( (cid:101) θ s ) + ∇ J ( (cid:101) θ s ) − µ s + 1 B B (cid:88) j =1 (cid:0) ω ( τ j | (cid:101) θ s , θ s +1 t ) g (cid:0) τ j | (cid:101) θ s (cid:1) − g (cid:0) τ j | θ s +1 t (cid:1)(cid:1)(cid:13)(cid:13)(cid:13)(cid:13) = E N,B (cid:13)(cid:13)(cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1) − ∇ J ( (cid:101) θ s )+ 1 B B (cid:88) j =1 (cid:0) ω ( τ j | (cid:101) θ s , θ s +1 t ) g (cid:0) τ j | (cid:101) θ s (cid:1) − g (cid:0) τ j | θ s +1 t (cid:1)(cid:1)(cid:13)(cid:13)(cid:13)(cid:13) E N,B (cid:13)(cid:13)(cid:13)(cid:13) ∇ J ( (cid:101) θ s ) − N N (cid:88) i =1 g (cid:0) τ i | (cid:101) θ s (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) (6.2) = 1 B B (cid:88) j =1 E N,B (cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1) − ∇ J ( (cid:101) θ s )+ ω ( τ j | (cid:101) θ s , θ s +1 t ) g (cid:0) τ j | (cid:101) θ s (cid:1) − g (cid:0) τ j | θ s +1 t (cid:1)(cid:13)(cid:13) + 1 N N (cid:88) i =1 E N,B (cid:13)(cid:13) ∇ J ( (cid:101) θ s ) − g (cid:0) τ i | (cid:101) θ s (cid:1)(cid:13)(cid:13) (6.3) ≤ B B (cid:88) j =1 E N,B (cid:13)(cid:13) ω ( τ j | (cid:101) θ s , θ s +1 t ) g (cid:0) τ j | (cid:101) θ s (cid:1) − g (cid:0) τ j | θ s +1 t (cid:1)(cid:13)(cid:13) + σ /N, (6.4)where (6.2) holds due to the independence between tra-jectories { τ i } Ni =1 and { τ j } Bj =1 , (6.3) is due to E (cid:107) x + . . . + x n (cid:107) = E (cid:107) x (cid:107) + . . . + E (cid:107) x n (cid:107) for independentand zero mean variables x , . . . , x n , and (6.4) followsAssumption 5.3 and the fact that E (cid:107) x − E x (cid:107) ≤ E (cid:107) x (cid:107) .Note that we have E N,B (cid:13)(cid:13) ω ( τ j | (cid:101) θ s , θ s +1 t ) g (cid:0) τ j | (cid:101) θ s (cid:1) − g (cid:0) τ j | θ s +1 t (cid:1)(cid:13)(cid:13) ≤ E N,B (cid:13)(cid:13)(cid:0) ω ( τ j | (cid:101) θ s , θ s +1 t ) − (cid:1) g (cid:0) τ j | (cid:101) θ s (cid:1)(cid:13)(cid:13) + E N,B (cid:13)(cid:13) g (cid:0) τ j | (cid:101) θ s (cid:1) − g (cid:0) τ j | θ s +1 t (cid:1)(cid:13)(cid:13) ≤ C g E N,B (cid:13)(cid:13) ω ( τ j | (cid:101) θ s , θ s +1 t ) − (cid:13)(cid:13) + L g (cid:13)(cid:13) (cid:101) θ s − θ s +1 t (cid:13)(cid:13) , (6.5)where the second inequality comes from Proposition 5.2.By Lemma 6.1, we have E N,B (cid:13)(cid:13) ω ( τ j | (cid:101) θ s , θ s +1 t ) − (cid:13)(cid:13) = Var (cid:101) θ s , θ s +1 t (cid:0) ω ( τ j | (cid:101) θ s , θ s +1 t ) (cid:1) ≤ C ω (cid:13)(cid:13) θ s +1 t − (cid:101) θ s (cid:13)(cid:13) . (6.6)where C ω = (2 G + M )( W +1) . Substituting the resultsin (6.4), (6.5) and (6.6) into (6.1) yields E N,B (cid:2) J (cid:0) θ s +1 t +1 (cid:1)(cid:3) ≥ E N,B (cid:2) J (cid:0) θ s +1 t (cid:1)(cid:3) + η E N,B (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1)(cid:13)(cid:13) (cid:3) + (cid:20) η − L (cid:21) E N,B (cid:2)(cid:13)(cid:13) θ s +1 t +1 − θ s +1 t (cid:13)(cid:13) (cid:3) − ησ N − η ( C ω C g + L )4 B E N,B (cid:2)(cid:13)(cid:13) θ s +1 t − (cid:101) θ s (cid:13)(cid:13) (cid:3) . (6.7)For the ease of notation, we denote Ψ = 3( C ω C g + L g )4 B . (6.8)By Young’s inequality (Peter-Paul inequality), we have (cid:13)(cid:13) θ s +1 t +1 − (cid:101) θ s (cid:13)(cid:13) ≤ (1 + α ) (cid:13)(cid:13) θ s +1 t +1 − θ s +1 t (cid:13)(cid:13) + (1 + 1 /α ) (cid:13)(cid:13) θ s +1 t − (cid:101) θ s (cid:13)(cid:13) holds for any α > . For η ≤ / (2 L ) , combining theabove inequality with (6.7) and (6.8) yields E N,B (cid:2) J (cid:0) θ s +1 t +1 (cid:1)(cid:3) ≥ E N,B (cid:2) J (cid:0) θ s +1 t (cid:1)(cid:3) + η E N,B (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1)(cid:13)(cid:13) (cid:3) − ησ N + 11 + α (cid:20) η − L (cid:21) E N,B (cid:2)(cid:13)(cid:13) θ s +1 t +1 − (cid:101) θ s (cid:13)(cid:13) (cid:3) − (cid:20) η Ψ + 1 α (cid:20) η − L (cid:21)(cid:21) E N,B (cid:2)(cid:13)(cid:13) θ s +1 t − (cid:101) θ s (cid:13)(cid:13) (cid:3) Now we set α = 2 t + 1 and sum up the above inequalityover t = 0 , . . . , m − . Note that θ s +10 = (cid:101) θ s , θ s +1 m = (cid:101) θ s +1 . We are able to obtain E N (cid:2) J (cid:0) (cid:101) θ s +1 (cid:1)(cid:3) ≥ E N (cid:2) J (cid:0) (cid:101) θ s (cid:1)(cid:3) + η m − (cid:88) t =0 E N (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1)(cid:13)(cid:13) (cid:3) − mησ N + m − (cid:88) t =0 / (2 η ) − L t + 1) E N (cid:2)(cid:13)(cid:13) θ s +1 t +1 − (cid:101) θ s (cid:13)(cid:13) (cid:3) − m − (cid:88) t =0 (cid:20) η Ψ + 1 / (2 η ) − L t + 1) (cid:21) E N (cid:2)(cid:13)(cid:13) θ s +1 t − (cid:101) θ s (cid:13)(cid:13) (cid:3) = E N (cid:2) J (cid:0) (cid:101) θ s (cid:1)(cid:3) + η m − (cid:88) t =0 E N (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1)(cid:13)(cid:13) (cid:3) − mησ N + m − (cid:88) t =0 / (2 η ) − L t + 1) E N (cid:2)(cid:13)(cid:13) θ s +1 t +1 − (cid:101) θ s (cid:13)(cid:13) (cid:3) − m − (cid:88) t =1 (cid:20) η Ψ + 1 / (2 η ) − L t + 1) (cid:21) E N (cid:2)(cid:13)(cid:13) θ s +1 t − (cid:101) θ s (cid:13)(cid:13) (cid:3) + 1 / (2 η ) − L m E N (cid:2)(cid:13)(cid:13) θ s +1 m − (cid:101) θ s (cid:13)(cid:13) (cid:3) − (cid:20) η Ψ + 14 η − L (cid:21) E N (cid:2)(cid:13)(cid:13) θ s +10 − (cid:101) θ s (cid:13)(cid:13) (cid:3) = E N (cid:2) J (cid:0) (cid:101) θ s (cid:1)(cid:3) + η m − (cid:88) t =0 E N (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1)(cid:13)(cid:13) (cid:3) − mησ N + m − (cid:88) t =1 (cid:20) / (4 η ) − L/ t (2 t + 1) − η Ψ (cid:21) E N (cid:2)(cid:13)(cid:13) θ s +1 t − (cid:101) θ s (cid:13)(cid:13) (cid:3) + 1 / (2 η ) − L m E N (cid:2)(cid:13)(cid:13) θ s +1 m − (cid:101) θ s (cid:13)(cid:13) (cid:3) . (6.9)Recall the deﬁnition of Ψ in (6.8). If we set step size η and the epoch length B to satisfy η ≤ L , Bm ≥ C ω C g + L g )2 L , (6.10)hen (6.9) leads to E N (cid:2) J (cid:0) (cid:101) θ s +1 (cid:1)(cid:3) ≥ E N (cid:2) J (cid:0) (cid:101) θ s (cid:1)(cid:3) − mησ N + η m − (cid:88) t =0 E N (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1)(cid:13)(cid:13) (cid:3) . Telescoping the above inequality yields η S − (cid:88) s =0 m − (cid:88) t =0 E (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ s +1 t (cid:1)(cid:13)(cid:13) (cid:3) ≤ E (cid:2) J (cid:0) (cid:101) θ S (cid:1)(cid:3) − E (cid:2) J (cid:0) (cid:101) θ (cid:1)(cid:3) + 3 Smησ N , which immediately implies E (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ out (cid:1)(cid:13)(cid:13) (cid:3) ≤ (cid:0) E (cid:2) J (cid:0) (cid:101) θ S (cid:1)(cid:3) − E (cid:2) J (cid:0) (cid:101) θ (cid:1)(cid:3)(cid:1) ηSm + 6 σ N ≤ J ( θ ∗ ) − J ( θ )) ηSm + 6 σ N .

This completes the proof.

Proof of Corollary 5.7.

By Theorem 5.5, in order to en-sure E (cid:2)(cid:13)(cid:13) ∇ J (cid:0) θ out (cid:1)(cid:13)(cid:13) (cid:3) ≤ (cid:15) , it sufﬁces to ensure J ( θ ∗ ) − J ( θ )) ηSm = (cid:15) , σ N = (cid:15) , which implies Sm = O (1 /(cid:15) ) and N = O (1 /(cid:15) ) . Notethat we have set m = O ( √ B ) . The total number ofstochastic gradient evaluations T g we need is T g = SN + SmB = O (cid:18) N √ B(cid:15) + B(cid:15) (cid:19) = O (cid:18) (cid:15) / (cid:19) , where we set B = N / = 1 /(cid:15) / . In this subsection, we provide the proofs of the technicallemmas used in the proof of main theory. We ﬁrst provethe smoothness of J ( θ ) . Proof of Proposition 5.2.

Recall the notion in (3.3) as ∇ J ( θ ) = (cid:90) τ R ( τ ) ∇ θ p ( τ | θ ) d τ, which directly implies the Hessian matrix ∇ J ( θ ) = (cid:90) τ R ( τ ) ∇ θ p ( τ | θ ) d τ. (6.11)Note that the Hessian of the log-density function is ∇ θ log p ( τ | θ ) = − p ( τ | θ ) − ∇ θ p ( τ | θ ) ∇ θ p ( τ | θ ) (cid:62) + p ( τ | θ ) − ∇ θ p ( τ | θ ) . (6.12)Substituting (6.12) into (6.11) yields ∇ J ( θ ) = (cid:90) τ p ( τ | θ ) R ( τ ) (cid:2) ∇ θ log p ( τ | θ )+ ∇ θ log p ( τ | θ ) ∇ θ log p ( τ | θ ) (cid:62) (cid:3) d τ. Therefore, we have (cid:107)∇ J ( θ ) (cid:107) ≤ (cid:90) τ p ( τ | θ ) R ( τ ) (cid:2) (cid:107)∇ θ log p ( τ | θ ) (cid:107) + (cid:107)∇ θ log p ( τ | θ ) (cid:107) (cid:3) d τ ≤ (cid:90) τ p ( τ | θ ) R ( τ )( HM + H G ) d τ. (6.13)By (3.1), we have for any τ it holds that R ( τ ) ≤ R (1 − γ H )1 − γ ≤ R − γ . Combining this with (6.13) yields (cid:107)∇ J ( θ ) (cid:107) ≤ RH ( M + HG ) / (1 − γ ) , which means J ( θ ) is L -smooth with L = RH ( M + HG ) / (1 − γ ) . Recall the REINFORCE estimator in(3.5): g ( τ | θ ) = (cid:34) H − (cid:88) t =0 ∇ log π θ ( a t | s t ) (cid:35)(cid:34) H − (cid:88) t =0 γ t R ( s t , a t ) − b (cid:35) , where b is a constant baseline reward. Then we have (cid:107)∇ g ( τ | θ ) (cid:107) ≤ (cid:34) H − (cid:88) t =0 (cid:13)(cid:13) ∇ log π θ ( a t | s t ) (cid:13)(cid:13) (cid:35) R + | b | − γ ≤ HM ( R + | b | )1 − γ . Similarly, we have (cid:107) g ( τ | θ ) (cid:107) ≤ HG (cid:20) R (1 − γ H )1 − γ + | b | (cid:21) ≤ HG ( R + | b | )1 − γ . The proof of the GPOMDP estimator is similar and weomit it for simplicity. This completes the proof.The analysis of Lemma 6.1 relies on the following im-portant properties of importance sampling weights.

Lemma 6.2 (Lemma 1 in Cortes et al. (2010)) . Let ω ( x ) = P ( x ) /Q ( x ) be the importance weight for dis-tributions P and Q . Then the following identities hold: E [ ω ] = 1 , E [ ω ] = d ( P || Q ) , where d ( P || Q ) = 2 D ( P || Q ) and D ( P || Q ) is theR´enyi divergence between distributions P and Q . Notethat this immediately implies Var ( ω ) = d ( P || Q ) − . roof of Lemma 6.1. According to Lemma 6.2, we haveVar (cid:0) ω (cid:0) τ | (cid:101) θ s , θ s +1 t (cid:1)(cid:1) = d (cid:0) p ( τ | (cid:101) θ s ) || p ( τ | θ s +1 t ) (cid:1) − . In the rest of this proof, we denote θ = (cid:101) θ s and θ = θ s +1 t to simplify the notation. By deﬁnition, we have d ( p ( τ | θ ) || p ( τ | θ )) = (cid:90) τ p ( τ | θ ) p ( τ | θ ) p ( τ | θ ) d τ = (cid:90) τ p ( τ | θ ) p ( τ | θ ) − d τ. For any ﬁxed θ ∈ R d , computing the gradient of d ( p ( τ | θ ) || p ( τ | θ )) with respect to θ yields ∇ θ d ( p ( τ | θ ) || p ( τ | θ ))= 2 (cid:90) τ p ( τ | θ ) ∇ θ p ( τ | θ ) p ( τ | θ ) − d τ, which implies that if we set θ = θ , we will obtain ∇ θ d ( p ( τ | θ ) || p ( τ | θ )) (cid:12)(cid:12) θ = θ = 2 (cid:90) τ ∇ θ p ( τ | θ ) d τ (cid:12)(cid:12) θ = θ = 0 . Hence, applying mean value theorem, we have d ( p ( τ | θ ) || p ( τ | θ )) (6.14) = 1 + 12 ( θ − θ ) (cid:62) ∇ θ d ( p ( τ | θ ) || p ( τ | θ ))( θ − θ ) , where θ = t θ + (1 − t ) θ for some t ∈ [0 , . Next, wecompute the Hessian matrix. For any ﬁxed θ , we have ∇ θ d ( p ( τ | θ ) || p ( τ | θ ))= 2 (cid:90) τ ∇ θ p ( τ | θ ) ∇ θ p ( τ | θ ) (cid:62) p ( τ | θ ) − d τ + 2 (cid:90) τ ∇ θ p ( τ | θ ) p ( τ | θ ) p ( τ | θ ) − d τ = 2 (cid:90) τ ∇ θ log p ( τ | θ ) ∇ θ log p ( τ | θ ) (cid:62) p ( τ | θ ) p ( τ | θ ) d τ + 2 (cid:90) τ ∇ θ p ( τ | θ ) p ( τ | θ ) p ( τ | θ ) − d τ. (6.15)Recall the Hessian of the log-density function in (6.12).Substituting (6.12) into (6.15) yields (cid:107)∇ θ d ( p ( τ | θ ) || p ( τ | θ )) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) (cid:90) τ ∇ θ log p ( τ | θ ) ∇ θ log p ( τ | θ ) (cid:62) p ( τ | θ ) p ( τ | θ ) d τ + 2 (cid:90) τ ∇ θ log p ( τ | θ ) p ( τ | θ ) p ( τ | θ ) d τ (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:90) τ p ( τ | θ ) p ( τ | θ ) (cid:0) (cid:107)∇ θ log p ( τ | θ ) (cid:107) + 2 (cid:107)∇ θ log p ( τ | θ ) (cid:107) (cid:1) d τ ≤ (4 H G + 2 HM ) E [ ω ( τ | θ , θ ) ] ≤ H (2 HG + M )( W + 1) , where the second inequality comes from Assumption 5.1and the last inequality is due to Assumption 5.4 andLemma 6.2. Therefore, by (6.14) we haveVar (cid:0) ω (cid:0) τ | (cid:101) θ s , θ s +1 t (cid:1)(cid:1) = d (cid:0) p ( τ | (cid:101) θ s ) || p ( τ | θ s +1 t ) (cid:1) − ≤ C ω (cid:107) (cid:101) θ s − θ s +1 t (cid:107) , where C ω = H (2 HG + M )( W + 1) . In this section, we conduct experiments on reinforcementlearning benchmark tasks, i.e., the Cartpole and Moun-tain Car (continuous) environments (Brockman et al.,2016), to evaluate the performance of Algorithm 1. Wemeasure the performance of an algorithm in terms ofthe total sample trajectories it needs to achieve a cer-tain reward. We compare SVRPG with vanilla stochasticgradient based algorithms: the REINFORCE (Williams,1992) and GPOMDP (Baxter & Bartlett, 2001) algo-rithms. Recall that at each iteration of Algorithm 1, wealso need to choose certain stochastic gradient estimatorto approximate the full gradient based on sampled tra-jectories. Since the performance of GPOMDP is alwayscomparable or better than REINFORCE, we only reportthe results of SVRPG with the GPOMDP estimator.We follow the practical suggestions provided in Papiniet al. (2018) to improve the performance including (1)performing one initial gradient update immediately aftersampling the N trajectories in the outer loop; (2) usingadaptive step sizes; and (3) using adaptive epoch length(terminate the inner loop update early if the step size usedin the inner loop is smaller than that used in the outerloop). Following Papini et al. (2018), we use the follow-ing Gaussian policy with a ﬁxed standard deviation (cid:101) σ : π θ ( a | s ) = 1 / √ πσ exp (cid:0) − ( θ (cid:62) φ ( s ) − a ) / (cid:101) σ (cid:1) , where φ : S (cid:55)→ R d is a bounded feature map. Under theGaussian policy, it is easy to verify that Assumptions 5.1and 5.3 is satisﬁed with parameters depending on φ, (cid:101) σ and the upper bound of the action a for all a ∈ A . Cartpole Setup:

The neural network of the Cartpole en-vironment has one hidden layer of nodes with the tanh We thank Papini et al. (2018) for their implementations ofGPOMDP and SVRPG as well as Duan et al. (2016) for theirimplementations from the rllab library.

250 500 750 1000 1250 1500 1750 2000Number of Trajectories02004006008001000 R e w a r d REINFORCEGPOMDPSVRPG (a) Cartpole R e w a r d REINFORCEGPOMDPSVRPG (b) Mountain Car

Figure 1: The average reward of different algorithms inCartpole and Mountain Car environments. R e w a r d Batch size 5Batch size 10Batch size 20 (a) Cartpole R e w a r d Batch size 10Batch size 20Batch size 50 (b) Mountain Car

Figure 2: The average reward of SVRPG with differentmini-batch sizes B .activation function. In the comparison between REIN-FORCE, GPOMDP, and SVRPG, we use learning rate η = 0 . , . and . for them respectively. Based onour theoretical analysis, we chose N = 25 and B = 10 for SVRPG. For the best comparison between the algo-rithms, we also the set the batch size of vanilla gradientmethods to be N = 10 .We also test the effectiveness of different mini-batchsizes within each epoch of SVRPG to validate our the-oretical claims in Corollary 5.7. We ﬁx N = 25 andvary mini-batch sizes of B = [5 , , respectively. Asmini-batch size increases, we also scale learning rate pro-portionally such that η = [0 . , . , . , correspond-ing to their respective mini-batch size. Mountain Car Setup:

The neural network for theMountain Car environment contains one hidden layerwith nodes with the tanh activation. In the compar-ison among REINFORCE, GPOMDP and SVRPG, weset N = 100 and B = 20 for SVRPG and set batch size N = 20 for the vanilla gradient methods. REINFORCE,GPOMDP, and SVRPG have respective learning rates of η = [0 . , . , . .Similar to the experiments on Cartpole, we also investi-gate the effect of difference mini-batch sizes on SVRPGfor Mountain Car. We conduct experiments on SVRPGby setting N = 100 and B = [10 , , with corre-sponding learning rates of η = [0 . , . , . . Experimental Results:

Figures 1(a) and 1(b) respec- tively show the performance of different algorithms onthe Cartpole and Mountain Car environments. All the re-sults are averaged over repetitions and the shaded areais a conﬁdence interval corresponding to the standard de-viation over different runs. It can be seen that all themethods solved the Cartpole environment (with averagedreward close to ). The SVRPG algorithm outper-forms the other two by gaining higher rewards with fewersample trajectories. SVRPG also beats the other methodsin solving the Mountain Car environment (with averagedreward close to ). However, the REINFORCE algo-rithm fails to solve the Mountain Car environment due toits high variance.Figures 2(a) and 2(b) show the effect of different mini-batch sizes B on SVRPG. Note that the outer loop batchsizes of Cartpole and Mountain Car are N = 25 and N = 100 . It can be seen that when B = 10 and B = 20 for Cartpole and Mountain Car respectively,SVRPG achieve the best performance, which is wellaligned with our theory. In particular, with a small mini-batch size, SVRPG acts similarly to the vanilla stochasticgradient based algorithms which needs fewer trajectoriesin each iteration but converges slowly and requires moretrajectories in total. Conversely, using a large mini-batchpushes SVRPG to converge in fewer iterations, but re-quires more trajectories in total. We revisited the SVRPG algorithm (Papini et al., 2018)and derived a sharp convergence analysis of SVRPGwhich achieves the state-of-the-art sample complexity.We provided a detailed discussion and guidance on thechoice of batch sizes and epoch length based on our im-proved analysis so that the total number of samples canbe signiﬁcantly reduced. We also empirically validatedthe theoretical results on common reinforcement learn-ing tasks. As a future direction, it would be interestingto see whether any better sample complexity can be ob-tained for policy gradient algorithms.

Acknowledgements

We would like to thank the anonymous reviewers fortheir helpful comments. This research was sponsoredin part by the National Science Foundation IIS-1904183and IIS-1906169. The views and conclusions containedin this paper are those of the authors and should not beinterpreted as representing any funding agencies.

References

Allen-Zhu, Z. and Hazan, E. Variance reduction forfaster non-convex optimization. In

International Con-ference on Machine Learning , pp. 699–707, 2016.Baxter, J. and Bartlett, P. L. Inﬁnite-horizon policy-radient estimation.

Journal of Artiﬁcial IntelligenceResearch , 15:319–350, 2001.Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. Openai gym,2016.Cortes, C., Mansour, Y., and Mohri, M. Learning boundsfor importance weighting. In

Advances in Neural In-formation Processing Systems , pp. 442–450, 2010.Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D.Stochastic variance reduction methods for policy eval-uation. In

Proceedings of the 34th International Con-ference on Machine Learning-Volume 70 , pp. 1049–1058. JMLR. org, 2017.Duan, Y., Chen, X., Houthooft, R., Schulman, J., andAbbeel, P. Benchmarking deep reinforcement learningfor continuous control. In

International Conference onMachine Learning , pp. 1329–1338, 2016.Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider:Near-optimal non-convex optimization via stochasticpath-integrated differential estimator. In

Advances inNeural Information Processing Systems , pp. 686–696,2018.Greensmith, E., Bartlett, P. L., and Baxter, J. Variancereduction techniques for gradient estimates in rein-forcement learning.

Journal of Machine Learning Re-search , 5(Nov):1471–1530, 2004.Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., andLevine, S. Q-prop: Sample-efﬁcient policy gradientwith an off-policy critic. In

International Conferenceon Learning Representations 2017 . OpenReviews. net,2017.Harikandeh, R., Ahmed, M. O., Virani, A., Schmidt, M.,Koneˇcn`y, J., and Sallinen, S. Stopwasting my gradi-ents: Practical svrg. In

Advances in Neural Informa-tion Processing Systems , pp. 2251–2259, 2015.Johnson, R. and Zhang, T. Accelerating stochastic gra-dient descent using predictive variance reduction. In

Advances in Neural Information Processing Systems ,pp. 315–323, 2013.Kakade, S. M. A natural policy gradient. In

Advancesin Neural Information Processing Systems , pp. 1531–1538, 2002.Lei, L., Ju, C., Chen, J., and Jordan, M. I. Non-convexﬁnite-sum optimization via scsg methods. In

Advancesin Neural Information Processing Systems , pp. 2348–2358, 2017.Levine, S., Wagener, N., and Abbeel, P. Learningcontact-rich manipulation skills with guided policysearch. In , pp. 156–163. IEEE,2015. Li, Z. and Li, J. A simple proximal stochastic gradi-ent method for nonsmooth nonconvex optimization. In

Advances in Neural Information Processing Systems ,pp. 5569–5579, 2018.Liu, J. S.

Monte Carlo strategies in scientiﬁc computing .Springer Science & Business Media, 2008.Metelli, A. M., Papini, M., Faccio, F., and Restelli, M.Policy optimization via importance sampling. In

Ad-vances in Neural Information Processing Systems , pp.5447–5459, 2018.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,Fidjeland, A. K., Ostrovski, G., et al. Human-levelcontrol through deep reinforcement learning.

Nature ,518(7540):529, 2015.Nguyen, L. M., Liu, J., Scheinberg, K., and Tak´aˇc,M. Sarah: A novel method for machine learningproblems using stochastic recursive gradient. In

Pro-ceedings of the 34th International Conference on Ma-chine Learning-Volume 70 , pp. 2613–2621. JMLR.org, 2017.Papini, M., Pirotta, M., and Restelli, M. Adaptive batchsize for safe policy gradients. In

Advances in Neu-ral Information Processing Systems , pp. 3591–3600,2017.Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., andRestelli, M. Stochastic variance-reduced policy gradi-ent. In

International Conference on Machine Learn-ing , pp. 4023–4032, 2018.Peters, J. and Schaal, S. Reinforcement learning of motorskills with policy gradients.

Neural Networks , 21(4):682–697, 2008.Pirotta, M., Restelli, M., and Bascetta, L. Adaptive step-size for policy gradient methods. In

Advances in Neu-ral Information Processing Systems , pp. 1394–1402,2013.Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola,A. Stochastic variance reduction for nonconvex op-timization. In

International Conference on MachineLearning , pp. 314–323, 2016.R´enyi, A. et al. On measures of entropy and informa-tion. In

Proceedings of the Fourth Berkeley Sympo-sium on Mathematical Statistics and Probability, Vol-ume 1: Contributions to the Theory of Statistics . TheRegents of the University of California, 1961.Robbins, H. and Monro, S. A stochastic approximationmethod.

The Annals of Mathematical Statistics , pp.400–407, 1951.Schulman, J., Levine, S., Abbeel, P., Jordan, M. I.,and Moritz, P. Trust region policy optimization. In nternational Conference on Machine Learning , vol-ume 37, pp. 1889–1897, 2015.Sehnke, F., Osendorfer, C., R¨uckstieß, T., Graves, A., Pe-ters, J., and Schmidhuber, J. Parameter-exploring pol-icy gradients.

Neural Networks , 23(4):551–559, 2010.Shalev-Shwartz, S., Shammah, S., and Shashua, A. Safe,multi-agent, reinforcement learning for autonomousdriving.

CoRR , abs/1610.03295, 2016. URL http://arxiv.org/abs/1610.03295 .Silver, D., Lever, G., Heess, N., Degris, T., Wierstra,D., and Riedmiller, M. Deterministic policy gradientalgorithms. In

International Conference on MachineLearning , 2014.Silver, D., Huang, A., Maddison, C. J., Guez, A.,Sifre, L., van den Driessche, G., Schrittwieser, J.,Antonoglou, I., Panneershelvam, V., Lanctot, M.,Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N.,Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu,K., Graepel, T., and Hassabis, D. Mastering the gameof go with deep neural networks and tree search.

Na-ture , 529:484–489, 2016.Sutton, R. S. and Barto, A. G.

Reinforcement learning:An introduction . MIT press, 2018.Sutton, R. S., McAllester, D. A., Singh, S. P., and Man-sour, Y. Policy gradient methods for reinforcementlearning with function approximation. In

Advancesin Neural Information Processing Systems , pp. 1057–1063, 2000.Tucker, G., Bhupatiraju, S., Gu, S., Turner, R., Ghahra-mani, Z., and Levine, S. The mirage of action-dependent baselines in reinforcement learning. In

International Conference on Machine Learning , pp.5022–5031, 2018.Weaver, L. and Tao, N. The optimal reward baselinefor gradient-based reinforcement learning. In

Pro-ceedings of the 17th Conference in Uncertainty in Ar-tiﬁcial Intelligence , pp. 538–545. Morgan KaufmannPublishers Inc., 2001.Williams, R. J. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.

Machine Learning , 8(3-4):229–256, 1992.Xiao, L. and Zhang, T. A proximal stochastic gradientmethod with progressive variance reduction.

SIAMJournal on Optimization , 24(4):2057–2075, 2014.Xu, T., Liu, Q., and Peng, J. Stochastic variancereduction for policy gradient estimation.

CoRR ,abs/1710.06034, 2017. URL http://arxiv.org/abs/1710.06034 .Zhao, T., Hachiya, H., Niu, G., and Sugiyama, M. Anal-ysis and improvement of policy gradient estimation. In

Advances in Neural Information Processing Systems ,pp. 262–270, 2011.Zhou, D., Xu, P., and Gu, Q. Stochastic nested variancereduced gradient descent for nonconvex optimization.In