[PDF] Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Abstract

Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on- and off-policy updates for deep reinforcement learning. Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of off-policy gradient estimates with on-policy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by off-policy updates, and improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks.

Full PDF

IInterpolated Policy Gradient: Merging On-Policy andOff-Policy Gradient Estimation for DeepReinforcement Learning

Shixiang Gu

University of CambridgeMax Planck Institute [email protected]

Timothy Lillicrap

DeepMind [email protected]

Zoubin Ghahramani

University of CambridgeUber AI Labs [email protected]

Richard E. Turner

University of Cambridge [email protected]

Bernhard Schölkopf

Max Planck Institute [email protected]

Sergey Levine

UC Berkeley [email protected]

Abstract

Off-policy model-free deep reinforcement learning methods using previously col-lected data can improve sample efﬁciency over on-policy policy gradient techniques.On the other hand, on-policy algorithms are often more stable and easier to use.This paper examines, both theoretically and empirically, approaches to mergingon- and off-policy updates for deep reinforcement learning. Theoretical resultsshow that off-policy updates with a value function estimator can be interpolatedwith on-policy policy gradient updates whilst still satisfying performance bounds.Our analysis uses control variate methods to produce a family of policy gradientalgorithms, with several recently proposed algorithms being special cases of thisfamily. We then provide an empirical comparison of these techniques with theremaining algorithmic details ﬁxed, and show how different mixing of off-policygradient estimates with on-policy samples contribute to improvements in empiricalperformance. The ﬁnal algorithm provides a generalization and uniﬁcation ofexisting deep policy gradient techniques, has theoretical guarantees on the biasintroduced by off-policy updates, and improves on the state-of-the-art model-freedeep RL methods on a number of OpenAI Gym continuous control benchmarks.

Reinforcement learning (RL) studies how an agent that interacts sequentially with an environmentcan learn from rewards to improve its behavior and optimize long-term returns. Recent research hasdemonstrated that deep networks can be successfully combined with RL techniques to solve difﬁcultcontrol problems. Some of these include robotic control (Schulman et al., 2016; Lillicrap et al., 2016;Levine et al., 2016), computer games (Mnih et al., 2015), and board games (Silver et al., 2016).One of the simplest ways to learn a neural network policy is to collect a batch of behavior whereinthe policy is used to act in the world, and then compute and apply a policy gradient update fromthis data. This is referred to as on-policy learning because all of the updates are made using datathat was collected from the trajectory distribution induced by the current policy of the agent. It isstraightforward to compute unbiased on-policy gradients, and practical on-policy gradient algorithmstend to be stable and relatively easy to use. A major drawback of such methods is that they tend tobe data inefﬁcient, because they only look at each data point once. Off-policy algorithms based onQ-learning and actor-critic learning (Sutton et al., 1999) have also proven to be an effective approachto deep reinforcement learning such as in (Mnih et al., 2015) and (Lillicrap et al., 2016). Such methods a r X i v : . [ c s . L G ] J un euse samples by storing them in a memory replay buffer and train a value function or Q-functionwith off-policy updates. This improves data efﬁciency, but often at a cost in stability and ease of use.Both on- and off-policy learning techniques have been shown to have advantages. Most recentresearch has worked with on-policy algorithms or off-policy algorithms, a few recent methods havesought to make use of both on- and off-policy data for learning (Gu et al., 2017; Wang et al., 2017;O’Donoghue et al., 2017). Such algorithms hope to gain advantages from both modes of learning,whilst avoiding their limitations. Broadly speaking, there have been two basic approaches in recentlyproposed algorithms that make use of both on- and off-policy data and updates. The ﬁrst approachis to mix some ratio of on- and off-policy gradients or update steps in order to update a policy, asin the ACER and PGQ algorithms (Wang et al., 2017; O’Donoghue et al., 2017). In this case, thereare no theoretical bounds on the error induced by incorporating off-policy updates. In the secondapproach, an off-policy Q critic is trained but is used as a control variate to reduce on-policy gradientvariance, as in the Q-prop algorithm (Gu et al., 2017). This case does not introduce additional bias tothe gradient estimator, but the policy updates do not use off-policy data.We seek to unify these two approaches using the method of control variates. We introduce aparameterized family of policy gradient methods that interpolate between on-policy and off-policylearning. Such methods are in general biased, but we show that the bias can be bounded.We showthat a number of recent methods (Gu et al., 2017; Wang et al., 2017; O’Donoghue et al., 2017) can beviewed as special cases of this more general family. Furthermore, our empirical results show that inmost cases, a mix of policy gradient and actor-critic updates achieves the best results, demonstratingthe value of considering interpolated policy gradients. A key component of our interpolated policy gradient method is the use of control variates to mixlikelihood ratio gradients with deterministic gradient estimates obtained explicitly from a state-actioncritic. In this section, we summarize both likelihood ratio and deterministic gradient methods, as wellas how control variates can be used to combine these two approaches.

At time t , the RL agent in state s t takes action a t according to its policy π ( a t | s t ) , the state transitionsto s t +1 , and the agent receives a reward r ( s t , a t ) . For a parametrized policy π θ , the objective is tomaximize the γ -discounted cumulative future return J ( θ ) = J ( π ) = E s ,a , ···∼ π [ P ∞ t =0 γ t r ( s t , a t )] .Monte Carlo policy gradient methods, such as REINFORCE (Williams, 1992) and TRPO (Schulmanet al., 2015), use the likelihood ratio policy gradient of the RL objective, ∇ θ J ( θ ) = E ρ π ,π [ ∇ θ log π θ ( a t | s t )( ˆ Q ( s t , a t ) − b ( s t ))] = E ρ π ,π [ ∇ θ log π θ ( a t | s t ) ˆ A ( s t , a t )] , (1)where ˆ Q ( s t , a t ) = P ∞ t = t γ t − t r ( s t , a t ) is the Monte Carlo estimate of the “critic” Q π ( s t , a t ) = E s t +1 ,a t +1 , ···∼ π | s t ,a t [ ˆ Q ( s t , a t )] , and ρ π = P ∞ t =0 γ t p ( s t = s ) are the unnormalized state visitationfrequencies, while b ( s t ) is known as the baseline, and serves to reduce the variance of the gradient esti-mate (Williams, 1992). If the baseline estimates the value function, V π ( s t ) = E a t ∼ π ( ·| s t ) [ Q π ( s t , a t )] ,then ˆ A ( s t ) is an estimate of the advantage function A π ( s t , a t ) = Q π ( s t , a t ) − V π ( s t ) . Likelihoodratio policy gradient methods use unbiased gradient estimates (except for the technicality detailedby Thomas (2014)), but they often suffer from high variance and are sample-intensive. Policy gradient methods with function approximation (Sutton et al., 1999), or actor-critic methods,are a family of policy gradient methods which ﬁrst estimate the critic, or the value, of the policy by Q w ≈ Q π , and then greedily optimize the policy π θ with respect to Q w . While it is not necessary forsuch algorithms to be off-policy, we primarily analyze the off-policy variants, such as (Riedmiller,2005; Degris et al., 2012; Heess et al., 2015; Lillicrap et al., 2016). For example, DDPG Lillicrapet al. (2016), which optimizes a continuous deterministic policy π θ ( a t | s t ) = δ ( a t = µ θ ( s t )) , can besummarized by the following update equations, where Q w denotes the target Q network (Lillicrap2 ν CV Examples- 0 No REINFORCE (Williams, 1992),TRPO (Schulman et al., 2015) π = π - No ≈ PGQ (O’Donoghue et al., 2017), ≈ ACER (Wang et al., 2017)

Table 1: Prior policy gradient method objectives as special cases of IPG.et al., 2016) and β denotes some off-policy distribution, e.g. from experience replay: w ← arg min E β [( Q w ( s t , a t ) − y t ) ] , y t = r ( s t , a t ) + γQ w ( s t +1 , µ θ ( s t +1 )) θ ← arg max E β [ Q w ( s t , µ θ ( s t ))] . (2)This provides the following deterministic policy gradient through the critic: ∇ θ J ( θ ) ≈ E ρ β [ ∇ θ Q w ( s t , µ θ ( s t ))] . (3)This policy gradient is generally biased due to the imperfect estimator Q w and off-policy statesampling from β . Off-policy actor-critic algorithms therefore allow training the policy on off-policysamples, at the cost of introducing potentially unbounded bias into the gradient estimate. This usuallymakes off-policy algorithms less stable during learning, compared to on-policy algorithms using alarge batch size for each update (Duan et al., 2016; Gu et al., 2017). The control variates method (Ross, 2006) is a general technique for variance reduction of a MonteCarlo estimator by exploiting a correlated variable for which we know more information such asanalytical expectation. General control variates for RL include state-action baselines, and an examplecan be an off-policy ﬁtted critic Q w . Q-Prop (Gu et al., 2017), for example, used ˜ Q w , the ﬁrst-orderTaylor expansion of Q w , as the control variates, and showed improvement in stability and sampleefﬁciency of policy gradient methods. µ θ here corresponds to the mean of the stochastic policy π θ . ∇ θ J ( θ ) = E ρ π ,π [ ∇ θ log π θ ( a t | s t )( ˆ Q ( s t , a t ) − ˜ Q w ( s t , a t ))] + E ρ π [ ∇ θ Q w ( s t , µ θ ( s t ))] . (4)The gradient estimator combines both likelihood ratio and deterministic policy gradients in Eq. 1and 3. It has lower variance and stable gradient estimates and enables more sample-efﬁcient learning.However, one limitation of Q-Prop is that it uses only on-policy samples for estimating the policygradient. This ensures that the Q-Prop estimator remains unbiased, but limits the use of off-policysamples for further variance reduction. Our proposed approach, interpolated policy gradient (IPG), mixes likelihood ratio gradient with ˆ Q ,which provides unbiased but high-variance gradient estimation, and deterministic gradient through anoff-policy ﬁtted critic Q w , which provides low-variance but biased gradients. IPG directly interpolatesthe two terms from Eq. 1 and 3: ∇ θ J ( θ ) ≈ (1 − ν ) E ρ π ,π [ ∇ θ log π θ ( a t | s t ) ˆ A ( s t , a t )] + ν E ρ β [ ∇ θ ¯ Q πw ( s t )] , (5)where we generalized the deterministic policy gradient through the critic as ∇ θ ¯ Q w ( s t ) = ∇ θ E π [ Q πw ( s t , · )] . This generalization is to make our analysis applicable with more general forms ofthe critic-based control variates, as discussed in the Appendix. This gradient estimator is biased fromtwo sources: off-policy state sampling ρ β , and inaccuracies in the critic Q w . However, as we show inSection 4, we can bound the biases for all the cases, and in some cases, the algorithm still guaranteesmonotonic convergence as in Kakade & Langford (2002); Schulman et al. (2015). While IPG includes ν to trade off bias and variance directly, it contains a likelihood ratio gradient term,for which we can introduce a control variate (CV) Ross (2006) to further reduce the estimator variance.3he expression for the IPG with control variates is below, where A πw ( s t , a t ) = Q w ( s t , a t ) − ¯ Q πw ( s t ) , ∇ θ J ( θ ) ≈ (1 − ν ) E ρ π ,π [ ∇ θ log π θ ( a t | s t ) ˆ A ( s t , a t )] + ν E ρ β [ ∇ θ ¯ Q πw ( s t )]= (1 − ν ) E ρ π ,π [ ∇ θ log π θ ( a t | s t )( ˆ A ( s t , a t ) − A πw ( s t , a t ))]+ (1 − ν ) E ρ π [ ∇ θ ¯ Q πw ( s t )] + ν E ρ β [ ∇ θ ¯ Q πw ( s t )] ≈ (1 − ν ) E ρ π ,π [ ∇ θ log π θ ( a t | s t )( ˆ A ( s t , a t ) − A πw ( s t , a t ))] + E ρ β [ ∇ θ ¯ Q πw ( s t )] . (6)The ﬁrst approximation indicates the biased approximation from IPG, while the second approximationindicates replacing the ρ π in the control variate correction term with ρ β and merging with the lastterm. The second approximation is a design decision and introduces additional bias when β = π but ithelps simplify the expression to be analyzed more easily, and the additional beneﬁt from the variancereduction from the control variate could still outweigh this extra bias. The biases are analyzed inSection 4. The likelihood ratio gradient term is now proportional to the residual in on- and off-policyadvantage estimates ˆ A ( s t , a t ) − A πw ( s t , a t ) , and therefore, we call this term residual likelihood ratiogradient . Intuitively, if the off-policy critic estimate is accurate, this term has a low magnitude andthe overall variance of the estimator is reduced. Crucially, IPG allows interpolating a rich list of prior deep policy gradient methods using only threeparameters: β , ν , and the use of the control variate (CV). The connection is summarized in Table 1and the algorithm is presented in Algorithm 1. Importantly, a wide range of prior work has onlyexplored limiting cases of the spectrum, e.g. ν = 0 , , with or without the control variate. Our workprovides a thorough theoretical analysis of the biases, and in some cases performance guarantees,for each of the method in this spectrum and empirically demonstrates often the best performingalgorithms are in the midst of the spectrum. Algorithm 1

Interpolated Policy Gradient input β , ν , useCV1: Initialize w for critic Q w , θ for stochastic policy π θ , and replay buffer R ← ∅ .2: repeat

3: Roll-out π θ for E episodes, T time steps each, to collect a batch of data B = { s, a, r } T, E to R

4: Fit Q w using R and π θ , and ﬁt baseline V φ ( s t ) using B

5: Compute Monte Carlo advantage estimate ˆ A t,e using B and V φ if useCV then

7: Compute critic-based advantage estimate ¯ A t,e using B , Q w and π θ

8: Compute and center the learning signals l t,e = ˆ A t,e − ¯ A t,e and set b = 1 else

10: Center the learning signals l t,e = ˆ A t,e and set b = ν end if

12: Multiply l t,e by (1 − ν )

13: Sample D = s M from R and/or B based on β

14: Compute ∇ θ J ( θ ) ≈ ET P e P t ∇ θ log π θ ( a t,e | s t,e ) l t,e + bM P m ∇ θ ¯ Q πw ( s m )

15: Update policy π θ using ∇ θ J ( θ ) until π θ converges. ν = 1 : Actor-Critic methods Before presenting our theoretical analysis, an important special case to discuss is ν = 1 , whichcorresponds to a deterministic actor-critic method. Several advantages of this special case includethat the policy can be deterministic and the learning can be done completely off-policy, as it does nothave to estimate the on-policy Monte Carlo critic ˆ Q . Prior work such as DDPG Lillicrap et al. (2016)and related Q-learning methods have proposed aggressive off-policy exploration strategy to exploitthese properties of the algorithm. In this work, we compare alternatives such as using on-policyexploration and stochastic policy with classical DDPG algorithm designs, and show that in somedomains the off-policy exploration can signiﬁcantly deteriorate the performance. Theoretically, weconﬁrm this empirical observation by showing that the bias from off-policy sampling in β increases4onotonically with the total variation or KL divergence between β and π . Both the empirical andtheoretical results indicate that well-designed actor-critic methods with an on-policy explorationstrategy could be a more reliable alternative than with an on-policy exploration. In this section, we present a theoretical analysis of the bias in the interpolated policy gradient. This iscrucial, since understanding the biases of the methods can improve our intuition about its performanceand make it easier to design new algorithms in the future. Because IPG includes many prior methodsas special cases, our analysis also applies to those methods and other intermediate cases. We ﬁrstanalyze a special case and derive results for general IPG. All proofs are in the Appendix. β = π , ν = 0 : Policy Gradient with Control Variate and Off-Policy Sampling This section provides an analysis of the special case of IPG with β = π , ν = 1 , and the controlvariate. Plugging in to Eq. 6, we get an expression similar to Q-Prop in Eq. 4, ∇ θ J ( θ ) ≈ E ρ π ,π [ ∇ θ log π θ ( a t | s t )( ˆ A ( s t , a t ) − A πw ( s t , a t ))] + E ρ β [ ∇ θ ¯ Q πw ( s t )] , (7)except that it also supports utilizing off-policy data for updating the policy. To analyze the bias forthis gradient expression, we ﬁrst introduce ˜ J ( π, ˜ π ) , a local approximation to J ( π ) , which has beenused in prior theoretical work (Kakade & Langford, 2002; Schulman et al., 2015). The derivation andthe bias from this approximation are discussed in the proof for Theorem 1 in the Appendix. J ( π ) = J (˜ π ) + E ρ π ,π [ A ˜ π ( s t , a t )] ≈ J (˜ π ) + E ρ ˜ π ,π [ A ˜ π ( s t , a t )] = ˜ J ( π, ˜ π ) . (8)Note that J ( π ) = ˜ J ( π, ˜ π = π ) and ∇ π J ( π ) = ∇ π ˜ J ( π, ˜ π = π ) . In practice, ˜ π corresponds to policy π k at iteration k and π corresponds next policy π k +1 after parameter update. Thus, this approximationis often sufﬁciently good. Next, we write the approximate objective for Eq. 7, ˜ J β,ν =0 ,CV ( π, ˜ π ) , J (˜ π ) + E ρ ˜ π ,π [ A ˜ π ( s t , a t ) − A ˜ πw ( s t , a t )] + E ρ β [ ¯ A π, ˜ πw ( s t )] ≈ ˜ J ( π, ˜ π )¯ A π, ˜ πw ( s t ) = E π [ A ˜ πw ( s t , · )] = E π [ Q w ( s t , · )] − E ˜ π [ Q w ( s t , · )] . (9)Note that ˜ J β,ν =0 ( π, ˜ π = π ) = ˜ J ( π, ˜ π = π ) = J ( π ) , and ∇ π ˜ J β,ν =0 ( π, ˜ π = π ) equals Eq. 7. Wecan bound the absolute error between ˜ J β,ν =0 ,CV ( π, ˜ π ) and J ( π ) by the following theorem, where D max KL ( π i , π j ) = max s D KL ( π i ( ·| s ) , π j ( ·| s )) is the maximum KL divergence between π i , π j . Theorem 1. If (cid:15) = max s | ¯ A π, ˜ πw ( s ) | , ζ = max s | ¯ A π, ˜ π ( s ) | , then (cid:13)(cid:13)(cid:13) J ( π ) − ˜ J β,ν =0 ,CV ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ γ (1 − γ ) (cid:18) (cid:15) q D max KL (˜ π, β ) + ζ q D max KL ( π, ˜ π ) (cid:19) Theorem 1 contains two terms: the second term conﬁrms ˜ J β,ν =0 ,CV is a local approximation around π and deviates from J ( π ) as ˜ π deviates, and the ﬁrst term bounds the bias from off-policy samplingusing the KL divergence between the policies ˜ π and β . This means that the algorithm ﬁts well withpolicy gradient methods which constrain the KL divergence per policy update, such as covariantpolicy gradient (Bagnell & Schneider, 2003), natural policy gradient (Kakade & Langford, 2002),REPS (Peters et al., 2010), and trust-region policy optimization (TRPO) (Schulman et al., 2015). Some forms of on-policy policy gradient methods have theoretical guarantees on monotonic con-vergence Kakade & Langford (2002); Schulman et al. (2015). Such guarantees often correspond tostable empirical performance on challenging problems, even when some of the constraints are relaxedin practice (Schulman et al., 2015; Duan et al., 2016; Gu et al., 2017). We can show that Algorithm 2,which is a variant of IPG, guarantees monotonic convergence. The proof is provided in the appendix.Algorithm 2 is often impractical to implement; however, IPG with trust-region updates when β = π, ν = 1 , CV = true approximates this monotonic algorithm, similar to how TRPO is anapproximation to the theoretically monotonic algorithm proposed by Schulman et al. (2015).5 lgorithm 2 Policy iteration with non-decreasing returns J ( π ) and bounded off-policy sampling

1: Initialize policy π , and critic Q w repeat

3: Compute all advantage values A π i ( s, a ) , and choose any off-policy distribution β i

4: Update critic Q w using any method (no requirement for performance)5: Solve the constrained optimization problem:6: π i +1 ← arg max π ˜ J β i ,ν =0 ,CV ( π, π i ) − C (cid:18) ζ p D max KL ( π, π i ) + (cid:15) p D max KL ( π i , β i ) (cid:19)

7: subject to P a π ( a | s ) = 1 ∀ s

8: where C = γ (1 − γ ) , ζ = max s | ¯ A π, ˜ π ( s ) | , (cid:15) = max s | ¯ A π, ˜ πw ( s ) | until π i converges. We can establish bias bounds for the general IPG algorithm, with and without the control variate,using Theorem 2. The additional term that contributes to the bias in the general case is δ , whichrepresents the error between the advantage estimated by the off-policy critic and the true A π values. Theorem 2. If δ = max s,a | A ˜ π ( s, a ) − A ˜ πw ( s, a ) | , (cid:15) = max s | ¯ A π, ˜ πw ( s ) | , ζ = max s | ¯ A π, ˜ π ( s ) | , ˜ J β,ν ( π, ˜ π ) , J (˜ π ) + (1 − ν ) E ρ ˜ π ,π [ ˆ A ˜ π ] + ν E ρ β [ ¯ A π, ˜ πw ]˜ J β,ν,CV ( π, ˜ π ) , J (˜ π ) + (1 − ν ) E ρ ˜ π ,π [ ˆ A ˜ π − A ˜ πw ] + E ρ β [ ¯ A π, ˜ πw ] then, (cid:13)(cid:13)(cid:13) J ( π ) − ˜ J β,ν ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ νδ − γ + 2 γ (1 − γ ) (cid:18) ν(cid:15) q D max KL (˜ π, β ) + ζ q D max KL ( π, ˜ π ) (cid:19)(cid:13)(cid:13)(cid:13) J ( π ) − ˜ J β,ν,CV ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ νδ − γ + 2 γ (1 − γ ) (cid:18) (cid:15) q D max KL (˜ π, β ) + ζ q D max KL ( π, ˜ π ) (cid:19) This bound shows that the bias from directly mixing the deterministic policy gradient through ν comes from two terms: how well the critic Q w is approximating Q π , and how close the off-policysampling policy is to the actor policy. We also show that the bias introduced is proportional to ν while the variance of the high variance likelihood ratio gradient term is proportional to (1 − ν ) , so ν allows directly trading off bias and variance. Theorem 2 fully bounds bias in the full spectrum of IPGmethods; this enables us to analyze how biases arise and interact and help us design better algorithms. An overarching aim of this paper is to help unify on-policy and off-policy policy gradient algo-rithms into a single conceptual framework. Our analysis examines how Q-Prop (Gu et al., 2017),PGQ (O’Donoghue et al., 2017), and ACER (Wang et al., 2017), which are all recent works thatcombine on-policy with off-policy learning, are connected to each other (see Table 1). IPG with < ν < and without the control variate relates closely to PGQ and ACER, but differ in the details.PGQ mixes in the Q-learning Bellman error objective, and ACER mixes parameter update stepsrather than directly mixing gradients. And both PGQ and ACER come with numerous additionaldesign details that make fair comparisons with methods like TRPO and Q-Prop difﬁcult. We insteadfocus on the three minimal variables of IPG and explore their settings in relation to the closely relatedTRPO and Q-Prop methods, in order to theoretically and empirically understand in which situationswe might expect gains from mixing on- and off-policy gradients.Asides from these more recent works, the use of off-policy samples with policy gradients has beena popular direction of research (Peshkin & Shelton, 2002; Jie & Abbeel, 2010; Degris et al., 2012;Levine & Koltun, 2013). Most of these methods rely on variants of importance sampling (IS) tocorrect for bias. The use of importance sampling ensures unbiased estimates, but at the cost ofconsiderable variance, as quantiﬁed by the ESS measure used by Jie & Abbeel (2010). Ignoringimportance weights produces bias but, as shown in our analysis, this bias can be bounded. Therefore,our IPG estimators have higher bias as the sampling distribution deviates from the policy, whileIS methods have higher variance. Among these importance sampling methods, Levine & Koltun(2013) evaluates on tasks that are the most similar to our paper, but the focus is on using importancesampling to include demonstrations, rather than to speed up learning from scratch.6 a) IPG with ν = 0 and the control variate. (b) IPG with ν = 1 . Figure 1: (a) IPG- ν = 0 vs Q-Prop on HalfCheetah-v1, with batch size 5000. IPG- β -rand30000,which uses 30000 random samples from the replay as samples from β , outperforms Q-Prop in termsof learning speed. (b) IPG- ν =1 vs other algorithms on Ant-v1. In this domain, on-policy IPG- ν =1with on-policy exploration signiﬁcantly outperforms DDPG and IPG- ν =1-OU, which use a heuristicOU (Ornstein–Uhlenbeck) process noise exploration strategy, and marginally outperforms Q-Prop.Lastly, there are many methods that combine on- and off-policy data for policy evaluation (Precup,2000; Mahmood et al., 2014; Munos et al., 2016), mostly through variants of importance sampling.Combining our methods with more sophisticated policy evaluation methods will likely lead to furtherimprovements, as done in (Degris et al., 2012). A more detailed analysis of the effect of importancesampling on bias and variance is left to future work, where some of the relevant work includes Precup(2000); Jie & Abbeel (2010); Mahmood et al. (2014); Jiang & Li (2016); Thomas & Brunskill (2016). In this section, we empirically show that the three parameters of IPG can interpolate differentbehaviors and often achieve superior performance versus prior methods that are limiting cases of thisapproach. Crucially, all methods share the same algorithmic structure as Algorithm 1, and we holdthe rest of the experimental details ﬁxed. All experiments were performed on MuJoCo domains inOpenAI Gym (Todorov et al., 2012; Brockman et al., 2016), with results presented for the averageover three seeds. Additional experimental details are provided in the Appendix. β = π , ν = 0 , with the control variate We evaluate the performance of the special case of IPG discussed in Section 4.1. This case is ofparticular interest, since we can derive monotonic convergence results for a variant of this methodunder certain conditions, despite the presence of off-policy updates. Figure 1a shows the performanceon the HalfCheetah-v1 domain, when the policy update batch size is 5000 transitions (i.e. 5 episodes).“last” and “rand” indicate if β samples from the most recent transitions or uniformly from theexperience replay. “last05000” would be equivalent to Q-Prop given ν = 0 . Comparing “IPG- β -rand05000” and “Q-Prop” curves, we observe that by drawing the same number of samples randomlyfrom the replay buffer for estimating the critic gradient, instead of using the on-policy samples, weget faster convergence. If we sample batches of size 30000 from the replay buffer, the performancefurther improves. However, as seen in the “IPG- β -last30000” curve, if we instead use the 30000most recent samples, the performance degrades. One possible explanation for this is that, whileusing random samples from the replay increases the bound on the bias according to Theorem 1, italso decorrelates the samples within the batch, providing more stable gradients. This is the originalmotivation for experience replay in the DQN method (Mnih et al., 2015), and we have shown thatsuch decorrelated off-policy samples can similarly produce gains for policy gradient algorithms. SeeTable 2 for results on other domains.The results for this variant of IPG demonstrate that random sampling from the replay provides furtherimprovement on top of Q-Prop, a strong baseline for sample-efﬁciency and stability. Note that thesereplay buffer samples are different from standard off-policy samples in DDPG or DQN algorithms,which often use aggressive heuristic exploration strategies. The samples used by IPG are sampled7 alfCheetah-v1 Ant-v1 Walker-v1 Humanoid-v1 β = π β = π β = π β = π β = π β = π β = π β = π IPG- ν =0.2 3356 3458 ν =0.2 IPG- ν =1 2962 Q-Prop 4178

TRPO 2889 N.A. 1520 N.A. 1487 N.A. 615 N.A.

Table 2: Comparisons on all domains with mini-batch size 10000 for Humanoid and 5000 otherwise.We compare the maximum of average test rewards in the ﬁrst 10000 episodes (Humanoid requiresmore steps to fully converge; see the Appendix for learning curves). Results outperforming Q-Prop (orIPG-cv- ν =0 with β = π ) are boldface. The two columns show results with on-policy and off-policysamples for estimating the deterministic policy gradient.from prior policies that follow a conservative trust-region update, resulting in greater regularity butless exploration. In the next section, we show that in some cases, ensuring that the off-policy samplesare not too off-policy is essential for good performance. β = π, ν = 1 In this section, we empirically evaluate another special case of IPG, where β = π , indicating on-policy sampling, and ν = 1 , which reduces to a trust-region, on-policy variant of a deterministicactor-critic method. Although this algorithm performs actor-critic updates, the use of a trust regionmakes it more similar to TRPO or Q-Prop than DDPG.Results for all domains are shown in Table 2. Figure 1b shows the learning curves on Ant-v1.Although IPG- ν =1 methods can be off-policy, the policy is updated every 5000 samples to keep itconsistent with other IPG methods, while DDPG updates the policy on every step in the environmentand makes other design choices Lillicrap et al. (2016). We see that, in this domain, standard DDPGbecomes stuck with a mean reward of 1000, while IPG- ν =1 improves monotonically, achieving asigniﬁcantly better result. To investigate why this large discrepancy arises, we also ran IPG- ν =1 withthe same OU process exploration noise as DDPG, and observed large degradation in performance.This provides empirical support for Theorem 2. It is illuminating to contrast this result with theprevious experiment, where the off-policy samples did not adversely alter the results. In the previousexperiments, the samples came from Gaussian policies updated with trust-regions. The differencebetween π and β was therefore approximately bounded by the trust-regions. In the experiment withBrownian noise, the behaving policy uses temporally correlated noise, with potentially unboundedKL-divergence from the learned Gaussian policy. In this case, the off-policy samples result inexcessive bias, wiping out the variance reduction beneﬁts of off-policy sampling. In general, weobserved that for the harder Ant-v1 and Walker-v1 domains, on-policy exploration is more effective,even when doing off-policy state sampling from a replay buffer. This results suggests the followinglesson for designing off-policy actor-critic methods: for domains where exploration is difﬁcult, it maybe more effective to use on-policy exploration with bounded policy updates than to design heuristicexploration rules such as the OU process noise, due to the resulting reduction in bias. Table 2 shows the results for experiments where we compare IPG methods with varying values of ν ; additional results are provided in the Appendix. β = π indicates that the method uses off-policysamples from the replay buffer, with the same batch size as the on-policy batch for fair comparison.We ran sweeps over ν = { . , . , . , . } and found that ν = 0 . consistently produce betterperformance than Q-Prop, TRPO or prior actor-critic methods. This is consistent with the results inPGQ (O’Donoghue et al., 2017) and ACER (Wang et al., 2017), which found that their equivalentof ν = 0 . performed best on their benchmarks. Importantly, we compared all methods with thesame algorithm designs (exploration, policy, etc.), since Q-Prop and TRPO are IPG- ν =0 with andwithout the control variate. IPG- ν =1 is a novel variant of the actor-critic method that differs fromDDPG (Lillicrap et al., 2016) and SVG(0) (Heess et al., 2015) due to the use of a trust region. Theresults in Table 2 suggest that, in most cases, the best performing algorithm is one that interpolatesbetween the policy-gradient and actor-critic variants, with intermediate values of ν .8 Discussion

In this paper, we introduced interpolated policy gradient methods, a family of policy gradientalgorithms that allow mixing off-policy learning with on-policy learning while satisfying performancebounds. This family of algorithms uniﬁes and interpolates on-policy likelihood ratio policy gradientand off-policy deterministic policy gradient, and includes a number of prior works as approximatelimiting cases. Empirical results conﬁrm that, in many cases, interpolated gradients have improvedsample-efﬁciency and stability over the prior state-of-the-art methods, and the theoretical resultsprovide intuition for analyzing the cases in which the different methods perform well or poorly. Ourhope is that this detailed analysis of interpolated gradient methods can not only provide for moreeffective algorithms in practice, but also give useful insight for future algorithm design.

Acknowledgements

This work is supported by generous sponsorship from Cambridge-Tübingen PhD Fellowship, NSERC,and Google Faculty Award.

References

Bagnell, J Andrew and Schneider, Jeff. Covariant policy search. IJCAI, 2003.Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie,and Zaremba, Wojciech. Openai gym. arXiv preprint arXiv:1606.01540 , 2016.Degris, Thomas, White, Martha, and Sutton, Richard S. Off-policy actor-critic. arXiv preprintarXiv:1205.4839 , 2012.Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deepreinforcement learning for continuous control.

International Conference on Machine Learning(ICML) , 2016.Gu, Shixiang, Lillicrap, Timothy, Sutskever, Ilya, and Levine, Sergey. Continuous deep q-learningwith model-based acceleration.

International Conference on Machine Learning (ICML) , 2016.Gu, Shixiang, Lillicrap, Timothy, Ghahramani, Zoubin, Turner, Richard E, and Levine, Sergey.Q-prop: Sample-efﬁcient policy gradient with an off-policy critic.

ICLR , 2017.Heess, Nicolas, Wayne, Gregory, Silver, David, Lillicrap, Tim, Erez, Tom, and Tassa, Yuval. Learningcontinuous control policies by stochastic value gradients. In

Advances in Neural InformationProcessing Systems , pp. 2944–2952, 2015.Hunter, David R and Lange, Kenneth. A tutorial on mm algorithms.

The American Statistician , 58(1):30–37, 2004.Jiang, Nan and Li, Lihong. Doubly robust off-policy value evaluation for reinforcement learning. In

International Conference on Machine Learning , pp. 652–661, 2016.Jie, Tang and Abbeel, Pieter. On a connection between importance sampling and the likelihood ratiopolicy gradient. In

Advances in Neural Information Processing Systems , pp. 1000–1008, 2010.Kahn, Gregory, Zhang, Tianhao, Levine, Sergey, and Abbeel, Pieter. Plato: Policy learning usingadaptive trajectory optimization. arXiv preprint arXiv:1603.00622 , 2016.Kakade, Sham and Langford, John. Approximately optimal approximate reinforcement learning. In

International Conference on Machine Learning (ICML) , volume 2, pp. 267–274, 2002.Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes.

ICLR , 2014.Levine, Sergey and Koltun, Vladlen. Guided policy search. In

International Conference on MachineLearning (ICML) , pp. 1–9, 2013. 9evine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deepvisuomotor policies.

Journal of Machine Learning Research , 17(39):1–40, 2016.Lillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval,Silver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning.

ICLR ,2016.Mahmood, A Rupam, van Hasselt, Hado P, and Sutton, Richard S. Weighted importance samplingfor off-policy learning with linear function approximation. In

Advances in Neural InformationProcessing Systems , pp. 3014–3022, 2014.Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare,Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning.

Nature , 518(7540):529–533, 2015.Munos, Rémi, Stepleton, Tom, Harutyunyan, Anna, and Bellemare, Marc G. Safe and efﬁcientoff-policy reinforcement learning. arXiv preprint arXiv:1606.02647 , 2016.O’Donoghue, Brendan, Munos, Remi, Kavukcuoglu, Koray, and Mnih, Volodymyr. Pgq: Combiningpolicy gradient and q-learning.

ICLR , 2017.Peshkin, Leonid and Shelton, Christian R. Learning from scarce experience. In

Proceedings of theNineteenth International Conference on Machine Learning , 2002.Peters, Jan, Mülling, Katharina, and Altun, Yasemin. Relative entropy policy search. In

AAAI .Atlanta, 2010.Precup, Doina. Eligibility traces for off-policy policy evaluation.

Computer Science DepartmentFaculty Publication Series , pp. 80, 2000.Riedmiller, Martin. Neural ﬁtted q iteration–ﬁrst experiences with a data efﬁcient neural reinforcementlearning method. In

European Conference on Machine Learning , pp. 317–328. Springer, 2005.Ross, Sheldon M. Simulation.

Burlington, MA: Elsevier , 2006.Ross, Stéphane, Gordon, Geoffrey J, and Bagnell, Drew. A reduction of imitation learning andstructured prediction to no-regret online learning. In

AISTATS , volume 1, pp. 6, 2011.Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan, Michael I, and Moritz, Philipp. Trust regionpolicy optimization. In

ICML , pp. 1889–1897, 2015.Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. High-dimensional continuous control using generalized advantage estimation.

International Conferenceon Learning Representations (ICLR) , 2016.Silver, David, Lever, Guy, Heess, Nicolas, Degris, Thomas, Wierstra, Daan, and Riedmiller, Martin.Deterministic policy gradient algorithms. In

International Conference on Machine Learning(ICML) , 2014.Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche,George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al.Mastering the game of go with deep neural networks and tree search.

Nature , 529(7587):484–489,2016.Sutton, Richard S, McAllester, David A, Singh, Satinder P, Mansour, Yishay, et al. Policy gra-dient methods for reinforcement learning with function approximation. In

Advances in NeuralInformation Processing Systems (NIPS) , volume 99, pp. 1057–1063, 1999.Thomas, Philip. Bias in natural actor-critic algorithms. In

ICML , pp. 441–448, 2014.Thomas, Philip and Brunskill, Emma. Data-efﬁcient off-policy policy evaluation for reinforcementlearning. In

International Conference on Machine Learning , pp. 2139–2148, 2016.10odorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control.In , pp. 5026–5033.IEEE, 2012.Wang, Ziyu, Bapst, Victor, Heess, Nicolas, Mnih, Volodymyr, Munos, Remi, Kavukcuoglu, Koray,and de Freitas, Nando. Sample efﬁcient actor-critic with experience replay.

ICLR , 2017.Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcementlearning.

Machine learning , 8(3-4):229–256, 1992.

In the main paper, we introduced the approximate objective ˜ J ( π, ˜ π ) to J ( π ) during our theoret-ical analysis. In this section, we discuss the motivations behind this choice, referencing priorwork (Kakade & Langford, 2002; Schulman et al., 2015).First, the expected return J ( π ) of a policy π can be written as the sum of the expected return J (˜ π ) ofanother policy ˜ π and the expected advantage term between the two policies in the equation, where A ˜ π ( s t , a t ) is the advantage of policy ˜ π , J ( π ) = J (˜ π ) + E ρ π ,π [ A ˜ π ( s t , a t )] . For the proof, see Lemma 1 in (Schulman et al., 2015). This expression is still not tractable toanalyze because of the dependency of unnormalized state sampling distribution ρ π on π . Kakade &Langford (2002); Schulman et al. (2015) thus introduce a local approximation by replacing ρ π with ρ π , J ( π ) ≈ J (˜ π ) + E ρ ˜ π ,π [ A ˜ π ( s t , a t )] , ˜ J ( π, ˜ π ) . We can show that J ( π ) = ˜ J ( π, ˜ π = π ) and ∇ π J ( π ) = ∇ π ˜ J ( π, ˜ π = π ) , meaning that the J ( π ) and ˜ J ( π, ˜ π ) match up to the ﬁrst order. Schulman et al. (2015) then uses this property, in combinationwith minorization-maximization Hunter & Lange (2004), to derive a monotonic convergence proof fora variant of policy iteration algorithm. We also use this property to prove the monotonic convergenceproperty in Algorithm 2, but we further derive the following lemma, Lemma 3. If ζ = max s | ¯ A π, ˜ π ( s ) | , then (cid:13)(cid:13)(cid:13) J ( π ) − ˜ J ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ ζ γ (1 − γ ) D max T V (˜ π, π ) ≤ ζ γ (1 − γ ) q D max KL (˜ π, π ) Proof.

We deﬁne ρ πt ( s t ) as the marginal state distribution at time t assuming that the agent followspolicy π from initial state ρ ( s t ) at time t = 0 . Note that from the deﬁnition of ρ π , ρ π ( s ) = P ∞ t =0 γ t ρ πt ( s t = s ) . We can use the following lemma from Kahn et al. (2016), which is adaptedfrom Ross et al. (2011) and Schulman et al. (2015). Lemma 4. (Kahn et al., 2016) (cid:13)(cid:13)(cid:13) ρ πt − ρ βt (cid:13)(cid:13)(cid:13) ≤ tD max T V ( π, β ) ≤ t q D max KL ( π, β ) (10)11he full proof is below, where ¯ A π, ˜ π ( s ) = E π [ A ˜ π ( s t , a t )] and A ˜ π ( s t , a t ) is the advantage functionof ˜ π , (cid:13)(cid:13)(cid:13) J ( π ) − ˜ J ( π, ˜ π ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) E ρ ˜ π [ ¯ A π, ˜ π ( s )] − E ρ π [ ¯ A π, ˜ π ( s )] (cid:13)(cid:13) ≤ ∞ X t =0 γ t (cid:13)(cid:13)(cid:13) E ρ ˜ πt [ ¯ A π, ˜ π ( s )] − E ρ πt [ ¯ A π, ˜ π ( s )] (cid:13)(cid:13)(cid:13) ≤ ζ ∞ X t =0 γ t (cid:13)(cid:13) ρ ˜ πt − ρ πt (cid:13)(cid:13) ≤ ζ ( ∞ X t =0 γ t t ) D max T V (˜ π, π )= 2 ζ γ (1 − γ ) D max T V (˜ π, π ) ≤ ζ γ (1 − γ ) q D max KL ( π, ˜ π ) . (11)This lemma is crucial in our theoretical analysis, as it allows us to tractably bound the biases of thefull spectrum of IPG objectives ˜ J β,ν,CV ( π, ˜ π ) against J ( π ) . Proof.

We ﬁrst prove the bound for (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν =0 ,CV ( π, ˜ π ) (cid:13)(cid:13)(cid:13) . Using Lemma 4, the bound isgiven below, with a similar derivation process as in Lemma 3. (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν =0 ,CV ( π, ˜ π ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) J (˜ π ) + E ρ ˜ π ,π [ A ˜ π ( s t , a t )] − J (˜ π ) − E ρ ˜ π ,π [ A ˜ π ( s t , a t ) − A ˜ πw ( s t , a t )] − E ρ β [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) = (cid:13)(cid:13) E ρ ˜ π [ ¯ A π, ˜ πw ( s )] − E ρ β [ ¯ A π, ˜ πw ( s )] (cid:13)(cid:13) ≤ ∞ X t =0 γ t (cid:13)(cid:13)(cid:13) E ρ ˜ πt [ ¯ A π, ˜ πw ( s )] − E ρ βt [ ¯ A π, ˜ πw ( s )] (cid:13)(cid:13)(cid:13) ≤ (cid:15) ∞ X t =0 γ t (cid:13)(cid:13)(cid:13) ρ ˜ πt − ρ βt (cid:13)(cid:13)(cid:13) ≤ (cid:15) ( ∞ X t =0 γ t t ) D max T V (˜ π, β )= 2 (cid:15) γ (1 − γ ) D max T V (˜ π, β ) ≤ (cid:15) γ (1 − γ ) q D max KL (˜ π, β ) . (12)Given this bound, we can directly derive the bound for (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν =0 ,CV ( π, ˜ π ) (cid:13)(cid:13)(cid:13) by combiningwith Lemma 3, (cid:13)(cid:13)(cid:13) J ( π ) − ˜ J β,ν =0 ,CV ( π, ˜ π ) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) J ( π ) − ˜ J ( π, ˜ π ) + ˜ J ( π, ˜ π ) − ˜ J β,ν =0 ,CV ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν =0 ,CV ( π, ˜ π ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) J ( π ) − ˜ J ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ γ (1 − γ ) (cid:18) (cid:15) q D max KL (˜ π, β ) + ζ q D max KL ( π, ˜ π ) (cid:19) (13)12 Proof for Monotonic Convergence in Algorithm 2

We can prove that the algorithm guaranteeing monotonic improvement by ﬁrst introducing thefollowing corollary,

Corollary 1. J ( π ) ≥ M ( π, ˜ π ) ≥ M β,ν =0 ,CV ( π, ˜ π ) , J (˜ π ) = M (˜ π, ˜ π ) = M β,ν =0 ,CV (˜ π, ˜ π ) (14) where M ( π, ˜ π ) = ˜ J ( π, ˜ π ) − Cζ q D max KL ( π, ˜ π ) M β,ν =0 ,CV ( π, ˜ π ) = ˜ J β,ν =0 ( π, ˜ π ) − C ( ζ q D max KL ( π, ˜ π ) + (cid:15) q D max KL (˜ π, β )) C = 2 γ (1 − γ ) , ζ = max s | ¯ A π, ˜ π ( s ) | , (cid:15) = max s | ¯ A π, ˜ πw ( s ) | Proof.

It follows from Theorem 1 in the main text and Theorem 1 in Schulman et al. (2015). J (˜ π ) = M β,ν =0 ,CV (˜ π, ˜ π ) since ζ = (cid:15) = 0 when π = ˜ π .Given Corollary 1, we use minorization-maximization (MM) (Hunter & Lange, 2004) to deriveAlgorithm 2 in the main text, a policy iteration algorithm that allows using off-policy sampleswhile guaranteeing monotonic improvement on J ( π ) . MM suggests that at each iteration, bymaximizing the lower bound, or the minorizer, of the objective, the algorithm can guarantee mono-tonic improvement: J ( π i +1 ) ≥ M β i ,ν =0 ,CV ( π i +1 , π i ) ≥ M β i ,ν =0 ,CV ( π i , π i ) = J ( π i ) , where π i +1 ← arg max π M β i ,ν =0 ,CV ( π, π i ) . Importantly, the algorithm guarantees monotonic improve-ment regardless of the off-policy distribution β i or the performance of the critic Q w . This result is astep toward achieving off-policy policy gradient with convergence guarantee of on-policy algorithms. We compare our theoretical algorithm with Algorithm 1 in Schulman et al. (2015), which guaranteesmonotonic improvement in a general on-policy policy gradient algorithm. The main difference is theadditional term, − C(cid:15) p D max KL (˜ π, β ) to the lower bound. D max KL (˜ π, β ) is constant with respect to π ,while (cid:15) = 0 if π = ˜ π and (cid:15) ≥ if otherwise. This suggests that as β becomes more off-policy, thegap between the lower bound and the true objective widens, proportionally to p D max KL (˜ π, β ) . Thismay make each majorization step end in a place very close to where it started, i.e. π i +1 very close to π i , and slow down learning. This again suggests a trade-off that comes in as off-policy samples areused.

10 Proof for Theorem 2

We follow the same procedure as the proof for Theorem 1, where we ﬁrst derive bounds between ˜ J ( π, ˜ π ) and the other local objectives, and then combine the results with Lemma 3.To begin the proof, we ﬁrst derive the bound for the special case where ν = 1 . Having ν = 1 , weremove the likelihood ratio policy gradient term, and get the following gradient expression, ∇ θ J ( θ ) ≈ E ρ β [ ∇ θ ¯ Q πw ( s t )] . (15)This is an off-policy actor-critic algorithm, and is closely connected to DDPG (Lillicrap et al., 2016),except that it does not use target policy network and its use of a stochastic policy enables on-policyexploration, trust-region policy updates, and no heuristic additive exploration noise.We can introduce the following bound on the local objective ˜ J β,ν =1 ( π, ˜ π ) , whose policy gradientequals 15 at π = ˜ π , similarly to the proof for Theorem 1 in the main text. Corollary 2. If δ = max s,a | A ˜ π ( s, a ) − A ˜ πw ( s, a ) | , (cid:15) = max s | ¯ A π, ˜ πw ( s ) | , and ˜ J β,ν =1 ( π, ˜ π ) = J (˜ π ) + E ρ β [ ¯ A π, ˜ πw ( s t )] (16) Schulman et al. (2015) applies additional bound, (cid:15) ≥ (cid:15) p D max KL ( π, ˜ π ) where (cid:15) = max s,a | A ˜ πw ( s, a ) | toremove dependency on π . In our case, we cannot apply such bound on ζ , since then the inequality in Theorem 1is still satisﬁed but the equality is violated, and thus the algorithm no longer guarantees monotonic improvement. hen, (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν =1 ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ δ − γ + 2 (cid:15) γ (1 − γ ) q D max KL (˜ π, β ) (17) Proof.

We note that (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν =1 ( π, ˜ π ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) E ρ ˜ π ,π [ A ˜ π ( s t , a t ) − A ˜ πw ( s t , a t )] + ˜ J ( π, ˜ π ) − ˜ J β,ν =0 ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) E ρ ˜ π ,π [ A ˜ π ( s t , a t ) − A ˜ πw ( s t , a t )] (cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν =0 ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ ∞ X t =0 γ t (cid:13)(cid:13)(cid:13) E ρ ˜ πt ,π [ A ˜ π ( s t , a t ) − A ˜ πw ( s t , a t )] (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν =0 ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ δ ∞ X t =0 γ t + (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν =0 ( π, ˜ π ) (cid:13)(cid:13)(cid:13) = δ − γ + (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν =0 ( π, ˜ π ) (cid:13)(cid:13)(cid:13) ≤ δ − γ + 2 (cid:15) γ (1 − γ ) q D max KL (˜ π, β ) , (18)where the proof uses Theorem 1 at the last step.Given Corollary 2 and Theorem 1, we are ready to prove the two bounds in Theorem 2. Proof. (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν ( π, ˜ π ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) J (˜ π ) + E ρ ˜ π ,π [ A ˜ π ( s t , a t )] − J (˜ π ) − (1 − ν ) E ρ ˜ π ,π [ A ˜ π ( s t , a t )] − ν E ρ β [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) = ν (cid:13)(cid:13) E ρ ˜ π ,π [ A ˜ π ( s t , a t )] − E ρ β [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) = ν (cid:13)(cid:13) E ρ ˜ π ,π [ A ˜ π ( s t , a t )] − E ρ π [ ¯ A π, ˜ πw ( s t )] + E ρ π [ ¯ A π, ˜ πw ( s t )] − E ρ β [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) ≤ ν (cid:13)(cid:13) E ρ ˜ π ,π [ A ˜ π ( s t , a t )] − E ρ π [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) + ν (cid:13)(cid:13) E ρ π [ ¯ A π, ˜ πw ( s t )] − E ρ β [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) = ν (cid:13)(cid:13) E ρ ˜ π ,π [ A ˜ π ( s t , a t ) − ¯ A ˜ πw ( s t , a t )] (cid:13)(cid:13) + ν (cid:13)(cid:13) E ρ π [ ¯ A π, ˜ πw ( s t )] − E ρ β [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) ≤ νδ − γ + 2 (cid:15) νγ (1 − γ ) q D max KL (˜ π, β ) (cid:13)(cid:13)(cid:13) ˜ J ( π, ˜ π ) − ˜ J β,ν,CV ( π, ˜ π ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) J (˜ π ) + E ρ ˜ π ,π [ A ˜ π ( s t , a t )] − J (˜ π ) − (1 − ν ) E ρ ˜ π ,π [ A ˜ π ( s t , a t ) − ¯ A ˜ πw ( s t , a t )] − E ρ β [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) = (cid:13)(cid:13) ν ( E ρ ˜ π ,π [ A ˜ π ( s t , a t )] − E ρ π [ ¯ A π, ˜ πw ( s t )]) + E ρ π [ ¯ A π, ˜ πw ( s t )] − E ρ β [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) ≤ ν (cid:13)(cid:13) E ρ ˜ π ,π [ A ˜ π ( s t , a t )] − E ρ π [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) + (cid:13)(cid:13) E ρ π [ ¯ A π, ˜ πw ( s t )] − E ρ β [ ¯ A π, ˜ πw ( s t )] (cid:13)(cid:13) ≤ νδ − γ + 2 (cid:15) γ (1 − γ ) q D max KL (˜ π, β ) . (19)We combine these bounds with Lemma 3 to conclude the proof.

11 Control Variates for Policy Gradient

In this Section, we describe control variate choices for policy gradient methods other than theﬁrst-order Taylor expansion presented in Q-Prop (Gu et al., 2017).14

If action is continuous and the policy is a simple distribution such as a Gaussian, one option is touse the full Q w as the control variate and use Monte Carlo to estimate its expectation with respectto the policy. To reduce the variance of the correction term, we estimate the policy gradient, thereparameterization trick (Kingma & Welling, 2014) can be employed to reduce the variance. For aGaussian policy π θ ( a t | s t ) = N ( µ θ ( s t ) , σ θ ( s t )) , ¯ Q πw ( s t ) = E π [ Q w ( s t , a t )] = E (cid:15) ∼N (0 , [ Q w ( s t , µ θ ( s t ) + (cid:15)σ ( s t ))] ≈ m m X i =1 Q w ( s t , µ θ ( s t ) + (cid:15) i σ ( s t )) . (20) Let π θ ( s t ) ∈ R k denote a probability vector over k discrete actions, and Q w ( s t ) ∈ R k denote theaction-value function for the k actions, as in DQN (Mnih et al., 2015). ¯ Q πw ( s t ) = π θ ( s t ) T · Q w ( s t ) . (21) For continuous control, it is also possible to use a more general critic. If the policy is locally Gaussian,i.e. π θ ( a t | s t ) = N ( µ θ ( s t ) , Σ θ ( s t )) , then the quadratic Q w from Normalized Advantage Function(NAF) (Gu et al., 2016) can be directly used, Q w ( s t , a t ) = A w ( s t , a t ) + V w ( s t ) A w ( s t , a t ) = −

12 ( a t − µ w ( s t )) T P w ( s t )( a t − µ w ( s t )) . (22)The deterministic policy gradient expression leads to, ¯ Q πw ( s t ) = V w ( s t ) − Tr ( P w ( s t )Σ θ ( s t )) −

12 ( µ θ ( s t ) − µ w ( s t )) T P w ( s t )( µ θ ( s t ) − µ w ( s t )) . (23)

12 Supplementary Experimental Details

GAE( λ = 0 . ) (Schulman et al., 2016) is used for ˆ A estimation. Trust-region update in TRPO is usedas the policy optimizer (Schulman et al., 2015). The standard Q-ﬁtting routine from DDPG (Lillicrapet al., 2016) is used for ﬁtting Q w , where Q w is trained with batch size 64, using experience replay ofsize e , and target network with τ = 0 . . ADAM (Kingma & Ba, 2014) is used as the optimizerfor Q w . Policy network parametrizes a Gaussian policy with π θ ( a t | s t ) = N ( µ θ ( s t ) , Σ θ ) , where µ θ is a two-hidden-layer neural network of size − and tanh hidden nonlinearity and linear output,and Σ θ is a diagonal, state-independent variance. For DDPG, the policy network is deterministicand additionally has tanh activation at the output layer. The critic function Q w is a two-hidden-layerneural network of size − with ReLU activation. We use the ﬁrst-order Taylor expansionof Q w as the control variate, as in Q-Prop (Gu et al., 2017), except for ν = 1 cases we use thereparametrized control variate described below. For IPG methods with the control variates, we furtherexplored the standard and conservative variants, the technique proposed in Q-Prop (Gu et al., 2017),and the Taylor expansion variant with the reparametrized variant discussed in Section 11.1.The trust-region step size for policy update is ﬁxed to . for HalfCheetah-v1 and Humanoid-v1, and . for Ant-v1 and Walker2d-v1, while the learning rate for ADAM in critic update is ﬁxed to e − for HalfCheetah-v1, Ant-v1, Humanoid-v1, and e − for Walker2d-v1. Those two hyperparametersare found by ﬁrst running TRPO and DDPG on each domain, and picking the ones that give bestperformance for each domain. These parameters are ﬁxed throughout the experiment to ensure faircomparisons. 15igure 2: IPG- ν = 0 . - π -CV vs Q-Prop and TRPO on Humanoid-v1 with batch size 10000 in theﬁrst 10000 episodes. IPG- ν = 0 . - π -CV, with a small difference of ν = 0 . multiplier, out-performsQ-Prop. All these methods have stable, monotonic policy improvement. The experiment is cut at10000 episodes due to heavy compute requirement of Q-Prop and IPG methods, mostly from ﬁttingthe off-policy critic.For all IPG algorithms, we use the ﬁrst-order Taylor expansion control variate, as in Q-Prop, whilefor ν = 1 , we use the reparameterized control variate in Section 11.1 with Monte Carlo sample size m = 1 . The ﬁrst-order Taylor expansion control variate cannot be used with ν = 1 directly, sincethen it does not provide the gradient for training the variance term of the policy.The plots in the main text present the mean returns as solid lines, scatter plots of all runs in thebackground to visualize variability.12.2 Additional Plot