[PDF] Supervised Policy Update for Deep Reinforcement Learning

Abstract

We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU formulates and solves a constrained optimization problem in the non-parameterized proximal policy space. Using supervised regression, it then converts the optimal non-parameterized policy to a parameterized policy, from which it draws new samples. The methodology is general in that it applies to both discrete and continuous action spaces, and can handle a wide variety of proximity constraints for the non-parameterized optimization problem. We show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems, and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology. The SPU implementation is much simpler than TRPO. In terms of sample efficiency, our extensive experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.

Full PDF

PPublished as a conference paper at ICLR 2019 S UPERVISED P OLICY U PDATE FOR D EEP R EINFORCE - MENT L EARNING

Quan Vuong

University of California, San Diego [email protected]

Yiming Zhang

New York University [email protected]

Keith Ross

New York University/New York University Shanghai [email protected] A BSTRACT

We propose a new sample-efﬁcient methodology, called Supervised Policy Update(SPU), for deep reinforcement learning. Starting with data generated by the cur-rent policy, SPU formulates and solves a constrained optimization problem in thenon-parameterized proximal policy space. Using supervised regression, it thenconverts the optimal non-parameterized policy to a parameterized policy, fromwhich it draws new samples. The methodology is general in that it applies to bothdiscrete and continuous action spaces, and can handle a wide variety of proximityconstraints for the non-parameterized optimization problem. We show how theNatural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) prob-lems, and the Proximal Policy Optimization (PPO) problem can be addressed bythis methodology. The SPU implementation is much simpler than TRPO. In termsof sample efﬁciency, our extensive experiments show SPU outperforms TRPO inMujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.

NTRODUCTION

The policy gradient problem in deep reinforcement learning (DRL) can be deﬁned as seeking aparameterized policy with high expected reward. An issue with policy gradient methods is poorsample efﬁciency (Kakade, 2003; Schulman et al., 2015a; Wang et al., 2016b; Wu et al., 2017;Schulman et al., 2017). In algorithms such as REINFORCE (Williams, 1992), new samples areneeded for every gradient step. When generating samples is expensive (such as robotic environments),sample efﬁciency is of central concern. The sample efﬁciency of an algorithm is deﬁned to be thenumber of calls to the environment required to attain a speciﬁed performance level (Kakade, 2003).Thus, given the current policy and a ﬁxed number of trajectories (samples) generated, the goal of thesample efﬁciency problem is to construct a new policy with the highest performance improvementpossible. To do so, it is desirable to limit the search to policies that are close to the original policy π θ k (Kakade, 2002; Schulman et al., 2015a; Wu et al., 2017; Achiam et al., 2017; Schulman et al., 2017;Tangkaratt et al., 2018). Intuitively, if the candidate new policy π θ is far from the original policy π θ k ,it may not perform better than the original policy because too much emphasis is being placed on therelatively small batch of new data generated by π θ k , and not enough emphasis is being placed on therelatively large amount of data and effort previously used to construct π θ k .This guideline of limiting the search to nearby policies seems reasonable in principle, but requires adistance η ( π θ , π θ k ) between the current policy π θ k and the candidate new policy π θ , and then attemptto solve the constrained optimization problem:maximize θ ˆ J ( π θ | π θ k , new data ) (1)subject to η ( π θ , π θ k ) ≤ δ (2)where ˆ J ( π θ | π θ k , new data ) is an estimate of J ( π θ ) , the performance of policy π θ , based on theprevious policy π θ k and the batch of fresh data generated by π θ k . The objective (1) attempts to1 a r X i v : . [ c s . L G ] D ec ublished as a conference paper at ICLR 2019maximize the performance of the updated policy, and the constraint (2) ensures that the updated policyis not too far from the policy π θ k that was used to generate the data. Several recent papers (Kakade,2002; Schulman et al., 2015a; 2017; Tangkaratt et al., 2018) belong to the framework (1)-(2).Our work also strikes the right balance between performance and simplicity. The implementationis only slightly more involved than PPO (Schulman et al., 2017). Simplicity in RL algorithms hasits own merits. This is especially useful when RL algorithms are used to solve problems outside oftraditional RL testbeds, which is becoming a trend (Zoph & Le, 2016; Mingxing Tan, 2018).We propose a new methodology, called Supervised Policy Update (SPU), for this sample efﬁciencyproblem. The methodology is general in that it applies to both discrete and continuous action spaces,and can address a wide variety of constraint types for (2). Starting with data generated by thecurrent policy, SPU optimizes over a proximal policy space to ﬁnd an optimal non-parameterizedpolicy . It then solves a supervised regression problem to convert the non-parameterized policy toa parameterized policy, from which it draws new samples. We develop a general methodology forﬁnding an optimal policy in the non-parameterized policy space, and then illustrate the methodologyfor three different deﬁnitions of proximity. We also show how the Natural Policy Gradient and TrustRegion Policy Optimization (NPG/TRPO) problems and the Proximal Policy Optimization (PPO)problem can be addressed by this methodology. While SPU is substantially simpler than NPG/TRPOin terms of mathematics and implementation, our extensive experiments show that SPU is moresample efﬁcient than TRPO in Mujoco simulated robotic tasks and PPO in Atari video game tasks.Off-policy RL algorithms generally achieve better sample efﬁciency than on-policy algorithms(Haarnoja et al., 2018). However, the performance of an on-policy algorithm can usually be substan-tially improved by incorporating off-policy training (Mnih et al. (2015), Wang et al. (2016a)). Ourpaper focuses on igniting interests in separating ﬁnding the optimal policy into a two-step process:ﬁnding the optimal non-parameterized policy, and then parameterizing this optimal policy. We alsowanted to deeply understand the on-policy case before adding off-policy training. We thus comparewith algorithms operating under the same algorithmic constraints, one of which is being on-policy.We leave the extension to off-policy to future work. We do not claim state-of-the-art results. RELIMINARIES

We consider a Markov Decision Process (MDP) with state space S , action space A , and rewardfunction r ( s, a ) , s ∈ S , a ∈ A . Let π = { π ( a | s ) : s ∈ S , a ∈ A} denote a policy, let Π be the set ofall policies, and let the expected discounted reward be: J ( π ) (cid:44) E τ ∼ π (cid:34) ∞ (cid:88) t =0 γ t r ( s t , a t ) (cid:35) (3)where γ ∈ (0 , is a discount factor and τ = ( s , a , s , . . . ) is a sample trajectory. Let A π ( s, a ) be the advantage function for policy π (Levine, 2017). Deep reinforcement learning considers a setof parameterized policies Π DL = { π θ | θ ∈ Θ } ⊂ Π , where each policy is parameterized by a neuralnetwork called the policy network. In this paper, we will consider optimizing over the parameterizedpolicies in Π DL as well as over the non-parameterized policies in Π . For concreteness, we assumethat the state and action spaces are ﬁnite. However, our methodology also applies to continuous stateand action spaces, as shown in the Appendix.One popular approach to maximizing J ( π θ ) over Π DL is to apply stochastic gradient ascent. Thegradient of J ( π θ ) evaluated at a speciﬁc θ = θ k can be shown to be (Williams, 1992): ∇ θ J ( π θ k ) = E τ ∼ π θk (cid:34) ∞ (cid:88) t =0 γ t ∇ θ log π θ k ( a t | s t ) A π θk ( s t , a t ) (cid:35) . (4)We can approximate (4) by sampling N trajectories of length T from π θ k : ∇ θ J ( π θ k ) ≈ N N (cid:88) i =1 T − (cid:88) t =0 ∇ θ log π θ k ( a it | s it ) A π θk ( s t , a t ) (cid:44) g k (5)Additionally, deﬁne d π ( s ) (cid:44) (1 − γ ) (cid:80) ∞ t =0 γ t P π ( s t = s ) for the the future state probabilitydistribution for policy π , and denote π ( ·| s ) for the probability distribution over the action space A s and using policy π . Further denote D KL ( π (cid:107) π θ k )[ s ] for the KL divergence from π ( ·| s ) to π θ k ( ·| s ) , and denote the following as the “aggregated KL divergence”. ¯ D KL ( π (cid:107) π θ k ) = E s ∼ d πθk [ D KL ( π (cid:107) π θ k )[ s ]] (6)2.1 S URROGATE O BJECTIVES FOR THE S AMPLE E FFICIENCY P ROBLEM

For the sample efﬁciency problem, the objective J ( π θ ) is typically approximated using samplesgenerated from π θ k (Schulman et al., 2015a; Achiam et al., 2017; Schulman et al., 2017). Twodifferent approaches are typically used to approximate J ( π θ ) − J ( π θ k ) . We can make a ﬁrst orderapproximation of J ( π θ ) around θ k (Kakade, 2002; Peters & Schaal, 2008a;b; Schulman et al., 2015a): J ( π θ ) − J ( π θ k ) ≈ ( θ − θ k ) T ∇ θ J ( π θ k ) ≈ ( θ − θ k ) T g k (7)where g k is the sample estimate (5). The second approach is to approximate the state distribution d π ( s ) with d π θk ( s ) (Achiam et al., 2017; Schulman et al., 2017; Achiam, 2017): J ( π ) − J ( π θ k ) ≈ L π θk ( π ) (cid:44) − γ E s ∼ d πθk E a ∼ π θk ( ·| s ) (cid:20) π ( a | s ) π θ k ( a | s ) A π θk ( s, a ) (cid:21) (8)There is a well-known bound for the approximation (8) (Kakade & Langford, 2002; Achiam et al.,2017). Furthermore, the approximation L π θk ( π θ ) matches J ( π θ ) − J ( π θ k ) to the ﬁrst order withrespect to the parameter θ (Achiam et al., 2017). ELATED W ORK

Natural gradient (Amari, 1998) was ﬁrst introduced to policy gradient by Kakade (Kakade, 2002) andthen in (Peters & Schaal, 2008a;b; Achiam, 2017; Schulman et al., 2015a). referred to collectivelyhere as NPG/TRPO. Algorithmically, NPG/TRPO ﬁnds the gradient update by solving the sampleefﬁciency problem (1)-(2) with η ( π θ , π θ k ) = ¯ D KL ( π θ (cid:107) π θ k ) , i.e., use the aggregate KL-divergencefor the policy proximity constraint (2). NPG/TRPO addresses this problem in the parameter space θ ∈ Θ . First, it approximates J ( π θ ) with the ﬁrst-order approximation (7) and ¯ D KL ( π θ (cid:107) π θ k ) usinga similar second-order method. Second, it uses samples from π θ k to form estimates of these twoapproximations. Third, using these estimates (which are functions of θ ), it solves for the optimal θ ∗ . The optimal θ ∗ is a function of g k and of h k , the sample average of the Hessian evaluated at θ k . TRPO also limits the magnitude of the update to ensure ¯ D KL ( π θ (cid:107) π θ k ) ≤ δ (i.e., ensuring thesampled estimate of the aggregated KL constraint is met without the second-order approximation).SPU takes a very different approach by ﬁrst (i) posing and solving the optimization problem inthe non-parameterized policy space, and then (ii) solving a supervised regression problem to ﬁnd aparameterized policy that is near the optimal non-parameterized policy. A recent paper, Guided ActorCritic (GAC), independently proposed a similar decomposition (Tangkaratt et al., 2018). However,GAC is much more restricted in that it considers only one speciﬁc constraint criterion (aggregatedreverse-KL divergence) and applies only to continuous action spaces. Furthermore, GAC incurssigniﬁcantly higher computational complexity, e.g. at every update, it minimizes the dual function toobtain the dual variables using SLSQP. MPO also independently propose a similar decomposition(Abbas Abdolmaleki, 2018). MPO uses much more complex machinery, namely, ExpectationMaximization to address the DRL problem. However, MPO has only demonstrates preliminaryresults on problems with discrete actions whereas our approach naturally applies to problems witheither discrete or continuous actions. In both GAC and MPO, working in the non-parameterized spaceis a by-product of applying the main ideas in those papers to DRL. Our paper demonstrates that thedecomposition alone is a general and useful technique for solving constrained policy optimization.Clipped-PPO (Schulman et al., 2017) takes a very different approach to TRPO. At each iteration,PPO makes many gradient steps while only using the data from π θ k . Without the clipping, PPOis the approximation (8). The clipping is analogous to the constraint (2) in that it has the goal ofkeeping π θ close to π θ k . Indeed, the clipping keeps π θ ( a t | s t ) from becoming neither much largerthan (1 + (cid:15) ) π θ k ( a t | s t ) nor much smaller than (1 − (cid:15) ) π θ k ( a t | s t ) . Thus, although the clipped PPOobjective does not squarely ﬁt into the optimization framework (1)-(2), it is quite similar in spirit. Wenote that the PPO paper considers adding the KL penalty to the objective function, whose gradient3ublished as a conference paper at ICLR 2019is similar to ours. However, this form of gradient was demonstrated to be inferior to Clipped-PPO.To the best of our knowledge, it is only until our work that such form of gradient is demonstrated tooutperform Clipped-PPO.Actor-Critic using Kronecker-Factored Trust Region (ACKTR) (Wu et al., 2017) proposed usingKronecker-factored approximation curvature (K-FAC) to update both the policy gradient and criticterms, giving a more computationally efﬁcient method of calculating the natural gradients. ACER(Wang et al., 2016a) exploits past episodes, linearizes the KL divergence constraint, and maintains anaverage policy network to enforce the KL divergence constraint. In future work, it would of interestto extend the SPU methodology to handle past episodes. In contrast to bounding the KL divergenceon the action distribution as we have done in this work, Relative Entropy Policy Search considersbounding the joint distribution of state and action and was only demonstrated to work for smallproblems (Jan Peters, 2010). RAMEWORK

The SPU methodology has two steps. In the ﬁrst step, for a given constraint criterion η ( π, π θ k ) ≤ δ ,we ﬁnd the optimal solution to the non-parameterized problem:maximize π ∈ Π L π θk ( π ) (9)subject to η ( π, π θ k ) ≤ δ (10)Note that π is not restricted to the set of parameterized policies Π DL . As commonly done, weapproximate the objective function (8). However, unlike PPO/TRPO, we are not approximatingthe constraint (2). We will show below the optimal solution π ∗ for the non-parameterized problem(9)-(10) can be determined nearly in closed form for many natural constraint criteria η ( π, π θ k ) ≤ δ .In the second step, we attempt to ﬁnd a policy π θ in the parameterized space Π DL that is close to thetarget policy π ∗ . Concretely, to advance from θ k to θ k +1 , we perform the following steps:(i) We ﬁrst sample N trajectories using policy π θ k , giving sample data ( s i , a i , A i ) , i = 1 , .., m .Here A i is an estimate of the advantage value A π θk ( s i , a i ) . (For simplicity, we index thesamples with i rather than with ( i, t ) corresponding to the t th sample in the i th trajectory.)(ii) For each s i , we deﬁne the target distribution π ∗ to be the optimal solution to the constrainedoptimization problem (9)-(10) for a speciﬁc constraint η .(iii) We then ﬁt the policy network π θ to the target distributions π ∗ ( ·| s i ) , i = 1 , .., m . Speciﬁcally,to ﬁnd θ k +1 , we minimize the following supervised loss function: L ( θ ) = 1 N m (cid:88) i =1 D KL ( π θ (cid:107) π ∗ )[ s i ] (11)For this step, we initialize with the weights for π θ k . We minimize the loss function L ( θ ) withstochastic gradient descent methods. The resulting θ becomes our θ k +1 . PPLIED TO S PECIFIC P ROXIMITY C RITERIA

To illustrate the SPU methodology, for three different but natural types of proximity constraints, wesolve the corresponding non-parameterized optimization problem and derive the resulting gradientfor the SPU supervised learning problem. We also demonstrate that different constraints lead to verydifferent but intuitive forms of the gradient update.4ublished as a conference paper at ICLR 20195.1 F

ORWARD A GGREGATE AND D ISAGGREGATE

KL C

ONSTRAINTS

We ﬁrst consider constraint criteria of the form:maximize π ∈ Π (cid:88) s d π θk ( s ) E a ∼ π ( ·| s ) [ A π θk ( s, a )] (12)subject to (cid:88) s d π θk ( s ) D KL ( π (cid:107) π θ k )[ s ] ≤ δ (13) D KL ( π (cid:107) π θ k )[ s ] ≤ (cid:15) for all s (14)Note that this problem is equivalent to minimizing L π θk ( π ) subject to the constraints (13) and (14).We refer to (13) as the "aggregated KL constraint" and to (14) as the "disaggregated KL constraint".These two constraints taken together restrict π from deviating too much from π θ k . We shall refer to(12)-(14) as the forward-KL non-parameterized optimization problem.Note that this problem without the disaggregated constraints is analogous to the TRPO problem.The TRPO paper actually prefers enforcing the disaggregated constraint to enforcing the aggregatedconstraints. However, for mathematical conveniences, they worked with the aggregated constraints:"While it is motivated by the theory, this problem is impractical to solve due to the large numberof constraints. Instead, we can use a heuristic approximation which considers the average KLdivergence" (Schulman et al., 2015a). The SPU framework allows us to solve the optimizationproblem with the disaggregated constraints exactly. Experimentally, we compared against TRPO ina controlled experimental setting, e.g. using the same advantage estimation scheme, etc. Since weclearly outperform TRPO, we argue that SPU’s two-process procedure has signiﬁcant potentials.For each λ > , deﬁne: π λ ( a | s ) = π θ k ( a | s ) Z λ ( s ) e A πθk ( s,a ) /λ where Z λ ( s ) is the normalization term.Note that π λ ( a | s ) is a function of λ . Further, for each s, let λ s be such that D KL ( π λ s (cid:107) π θ k )[ s ] = (cid:15) .Also let Γ λ = { s : D KL ( π λ (cid:107) π θ k )[ s ] ≤ (cid:15) } . Theorem 1

The optimal solution to the problem (12) - (14) is given by: ˜ π λ ( a | s ) = (cid:26) π λ ( a | s ) s ∈ Γ λ π λ s ( a | s ) s / ∈ Γ λ (15) where λ is chosen so that (cid:80) s d π θk ( s ) D KL (˜ π λ (cid:107) π θ k )[ s ] = δ (Proof in subsection A.1). Equation (15) provides the structure of the optimal non-parameterized policy. As part of the SPUframework, we then seek a parameterized policy π θ that is close to ˜ π λ ( a | s ) , that is, minimizes theloss function (11). For each sampled state s i , a straightforward calculation shows (Appendix B): ∇ θ D KL ( π θ (cid:107) ˜ π λ )[ s i ] = ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] − λ s i E a ∼ π θk ( ·| s i ) [ ∇ θ log π θ ( a | s i ) A π θk ( s i , a )] (16)where ∼ λ s i = λ for s i ∈ Γ λ and ˜ λ s i = λ s i for s i / ∈ Γ λ . We estimate the expectation in (16) with thesampled action a i and approximate A π θk ( s i , a i ) as A i (obtained from the critic network), giving: ∇ θ D KL ( π θ (cid:107) ˜ π λ )[ s i ] ≈ ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] − ∼ λ s i ∇ θ π θ ( a i | s i ) π θ k ( a i | s i ) A i (17)To simplify the algorithm, we slightly modify (17). We replace the hyper-parameter δ with thehyper-parameter λ and tune λ rather than δ . Further, we set ∼ λ s i = λ for all s i in (17) and introduceper-state acceptance to enforce the disaggregated constraints, giving the approximate gradient: ∇ θ D KL ( π θ (cid:107) ˜ π λ ) ≈ m m (cid:88) i =1 [ ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] − λ ∇ θ π θ ( a i | s i ) π θ k ( a i | s i ) A i ] D KL ( π θ (cid:107) π θk )[ s i ] ≤ (cid:15) (18)We make the approximation that the disaggregated constraints are only enforced on the statesin the sampled trajectories. We use (18) as our gradient for supervised training of the policy5ublished as a conference paper at ICLR 2019network. The equation (18) has an intuitive interpretation: the gradient represents a trade-offbetween the approximate performance of π θ (as captured by λ ∇ θ π θ ( a i | s i ) π θ k ( a i | s i ) A i ) and how far π θ diverges from π θ k (as captured by ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] ). For the stopping criterion, we train until m (cid:80) i D KL ( π θ (cid:107) π θ k )[ s i ] ≈ δ .5.2 B ACKWARD

KL C

ONSTRAINT

In a similar manner, we can derive the structure of the optimal policy when using the reverseKL-divergence as the constraint. For simplicity, we provide the result for when there are onlydisaggregated constraints. We seek to ﬁnd the non-parameterized optimal policy by solving:maximize π ∈ Π (cid:88) s d π θk ( s ) E a ∼ π ( ·| s ) [ A π θk ( s, a )] (19) D KL ( π (cid:107) π θ k )[ s ] ≤ (cid:15) for all s (20) Theorem 2

The optimal solution to the problem (19) - (20) is given by: π ∗ ( a | s ) = π θ k ( a | s ) λ ( s ) λ (cid:48) ( s ) − A π θk ( s, a ) (21) where λ ( s ) > and λ (cid:48) ( s ) > max a A π θk ( s, a ) (Proof in subsection A.2). Note that the structure of the optimal policy with the backward KL constraint is quite different fromthat with the forward KL constraint. A straight forward calculation shows (Appendix B): ∇ θ D KL ( π θ (cid:107) π ∗ )[ s ] = ∇ θ D KL ( π θ (cid:107) π θ k )[ s ] − E a ∼ π θk (cid:20) ∇ θ π θ ( a | s ) π θ k ( a | s ) log (cid:18) λ (cid:48) ( s ) − A π θk ( s, a ) (cid:19)(cid:21) (22)The equation (22) has an intuitive interpretation. It increases the probability of action a if A π θk ( s, a ) > λ (cid:48) ( s ) − and decreases the probability of action a if A π θk ( s, a ) < λ (cid:48) ( s ) − . (22)also tries to keep π θ close to π θ k by minimizing their KL divergence.5.3 L ∞ C ONSTRAINT

In this section we show how a PPO-like objective can be formulated in the context of SPU. Recallfrom Section 3 that the the clipping in PPO can be seen as an attempt at keeping π θ ( a i | s i ) frombecoming neither much larger than (1 + (cid:15) ) π θ k ( a i | s i ) nor much smaller than (1 − (cid:15) ) π θ k ( a i | s i ) for i = 1 , . . . , m . In this subsection, we consider the constraint function η ( π, π θ k ) = max i =1 ,...,m | π ( a i | s i ) − π θ k ( a i | s i ) | π θ k ( a i | s i ) (23)which leads us to the following optimization problem:maximize π ( a | s ) ,...,π ( a m | s m ) m (cid:88) i =1 A π θk ( s i , a i ) π ( a i | s i ) π k ( a i | s i ) (24)subject to (cid:12)(cid:12)(cid:12)(cid:12) π ( a i | s i ) − π θ k ( a i | s i ) π θ k ( a i | s i ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) i = 1 , . . . , m (25) m (cid:88) i =1 (cid:18) π ( a i | s i ) − π θ k ( a i | s i ) π θ k ( a i | s i ) (cid:19) ≤ δ (26)Note that here we are using a variation of the SPU methodology described in Section 4 since herewe ﬁrst create estimates of the expectations in the objective and constraints and then solve theoptimization problem (rather than ﬁrst solve the optimization problem and then take samples as donefor Theorems 1 and 2). Note that we have also included an aggregated constraint (26) in addition tothe PPO-like constraint (25), which further ensures that the updated policy is close to π θ k .6ublished as a conference paper at ICLR 2019 Theorem 3

The optimal solution to the optimization problem (24-26) is given by: π ∗ ( a i | s i ) = (cid:26) π θ k ( a i | s i ) min { λA i , (cid:15) } A i ≥ π θ k ( a i | s i ) max { λA i , − (cid:15) } A i < (27) for some λ > where A i (cid:44) A π θk ( s i , a i ) (Proof in subsection A.3). To simplify the algorithm, we treat λ as a hyper-parameter rather than δ . After solving for π ∗ , we seeka parameterized policy π θ that is close to π ∗ by minimizing their mean square error over sampledstates and actions, i.e. by updating θ in the negative direction of ∇ θ (cid:80) i ( π θ ( a i | s i ) − π ∗ ( a i | s i )) .This loss is used for supervised training instead of the KL because we take estimates before formingthe optimization problem. Thus, the optimal values for the decision variables do not completelycharacterize a distribution. We refer to this approach as SPU with the L ∞ constraint.Although we consider three classes of proximity constraint, there may be yet another class thatleads to even better performance. The methodology allows researchers to explore other proximityconstraints in the future. XPERIMENTAL R ESULTS

Extensive experimental results demonstrate SPU outperforms recent state-of-the-art methods forenvironments with continuous or discrete action spaces. We provide ablation studies to show theimportance of the different algorithmic components, and a sensitivity analysis to show that SPU’sperformance is relatively insensitive to hyper-parameter choices. There are two deﬁnitions we use toconclude A is more sample efﬁcient than B: (i) A takes fewer environment interactions to achievea pre-deﬁned performance threshold (Kakade, 2003); (ii) the averaged ﬁnal performance of A ishigher than that of B given the same number environment interactions (Schulman et al., 2017).Implementation details are provided in Appendix D.6.1 R

ESULTS ON M UJOCO

The Mujoco (Todorov et al., 2012) simulated robotics environments provided by OpenAI gym(Brockman et al., 2016) have become a popular benchmark for control problems with continuousaction spaces. In terms of ﬁnal performance averaged over all available ten Mujoco environmentsand ten different seeds in each, SPU with L ∞ constraint (Section 5.3) and SPU with forward KLconstraints (Section 5.1) outperform TRPO by and respectively. Since the forward-KLapproach is our best performing approach, we focus subsequent analysis on it and hereafter refer toit as SPU. SPU also outperforms PPO by . Figure 1 illustrates the performance of SPU versusTRPO, PPO.To ensure that SPU is not only better than TRPO in terms of performance gain early during training,we further retrain both policies for 3 million timesteps. Again here, SPU outperforms TRPO by .Figure 3 in the Appendix illustrates the performance for each environment. Code for the Mujocoexperiments is at https://github.com/quanvuong/Supervised_Policy_Update .6.2 A BLATION S TUDIES FOR M UJOCO

The indicator variable in (18) enforces the disaggregated constraint. We refer to it as per-stateacceptance . Removing this component is equivalent to removing the indicator variable. We referto using (cid:80) i D KL ( π θ (cid:107) π θ k )[ s i ] to determine the number of training epochs as dynamic stopping .Without this component, the number of training epochs is a hyper-parameter. We also tried removing ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] from the gradient update step in (18). Table 1 illustrates the contribution ofthe different components of SPU to the overall performance. The third row shows that the term ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] makes a crucially important contribution to SPU. Furthermore, per-stateacceptance and dynamic stopping are both also important for obtaining high performance, with theformer playing a more central role. When a component is removed, the hyper-parameters are retunedto ensure that the best possible performance is obtained with the alternative (simpler) algorithm.7ublished as a conference paper at ICLR 2019Figure 1: SPU versus TRPO, PPO on 10 Mujoco environments in 1 million timesteps. The x-axisindicates timesteps. The y-axis indicates the average episode reward of the last 100 episodes.Table 1: Ablation study for SPUApproach Percentage better than TRPO Performance vs. original algorithmOriginal Algorithm 27% 0%No grad KL 4% - 85%No dynamic stopping 24% - 11%No per-state acceptance 9% - 67%6.3 S ENSITIVITY A NALYSIS ON M UJOCO

To demonstrate the practicality of SPU, we show that its high performance is insensitive to hyper-parameter choice. One way to show this is as follows: for each SPU hyper-parameter, select areasonably large interval, randomly sample the value of the hyper parameter from this interval, andthen compare SPU (using the randomly chosen hyper-parameter values) with TRPO. We sampled100 SPU hyper-parameter vectors (each vector including δ, (cid:15), λ ), and for each one determined therelative performance with respect to TRPO. First, we found that for all 100 random hyper-parametervalue samples, SPU performed better than TRPO. and of the samples outperformed TRPOby at least and respectively. The full CDF is given in Figure 4 in the Appendix. We canconclude that SPU’s superior performance is largely insensitive to hyper-parameter values.6.4 R

ESULTS ON A TARI (Rajeswaran et al., 2017; Mania et al., 2018) demonstrates that neural networks are not needed toobtain high performance in many Mujoco environments. To conclusively evaluate SPU, we compareit against PPO on the Arcade Learning Environments (Bellemare et al., 2012) exposed throughOpenAI gym (Brockman et al., 2016). Using the same network architecture and hyper-parameters,we learn to play 60 Atari games from raw pixels and rewards. This is highly challenging because ofthe diversity in the games and the high dimensionality of the observations.8ublished as a conference paper at ICLR 2019Here, we compare SPU against PPO because PPO outperforms TRPO by in Mujoco. Averagedover 60 Atari environments and 20 seeds, SPU is better than PPO in terms of averaged ﬁnalperformance. Figure 2 provides a high-level overview of the result. The dots in the shaded arearepresent environments where their performances are roughly similar. The dots to the right of theshaded area represent environment where SPU is more sample efﬁcient than PPO. We can drawtwo conclusions: (i) In 36 environments, SPU and PPO perform roughly the same ; SPU clearlyoutperforms PPO in 15 environments while PPO clearly outperforms SPU in 9; (ii) In those 15+9environments, the extent to which SPU outperforms PPO is much larger than the extent to whichPPO outperforms SPU. Figure 5, Figure 6 and Figure 7 in the Appendix illustrate the performance ofSPU vs PPO throughout training. SPU’s high performance in both the Mujoco and Atari domainsdemonstrates its high performance and generality.Figure 2: High-level overview of results on Atari CKNOWLEDGEMENTS

We would like to acknowledge the extremely helpful support by the NYU Shanghai High PerformanceComputing Administrator Zhiguo Qi. We also are grateful to OpenAI for open-sourcing their baselinescodes. 9ublished as a conference paper at ICLR 2019 R EFERENCES

Yuval Tassa Remi Munos Nicolas Heess Martin Riedmiller Abbas Abdolmaleki, Jost Tobias Sprin-genberg. Maximum a posteriori policy optimisation. 2018. URL https://arxiv.org/abs/1806.06920 .Joshua Achiam. Advanced policy gradient methods. http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_13_advanced_pg.pdf , 2017.Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In

International Conference on Machine Learning , pp. 22–31, 2017.Shun-Ichi Amari. Natural gradient works efﬁciently in learning.

Neural computation , 10(2):251–276,1998.Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning en-vironment: An evaluation platform for general agents.

CoRR , abs/1207.4708, 2012. URL http://arxiv.org/abs/1207.4708 .Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, andWojciech Zaremba. Openai gym.

CoRR , abs/1606.01540, 2016. URL http://arxiv.org/abs/1606.01540 .Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford,John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines , 2017.Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deepreinforcement learning for continuous control.

CoRR , abs/1604.06778, 2016. URL http://arxiv.org/abs/1604.06778 .Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policymaximum entropy deep reinforcement learning with a stochastic actor.

CoRR , abs/1801.01290,2018. URL http://arxiv.org/abs/1801.01290 .Yasemin Altun Jan Peters, Katharina Mulling. Relative entropy policy search. 2010. URL .Sham Kakade.

On the sample complexity of reinforcement learning . PhD thesis, University ofLondon London, England, 2003.Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In

ICML , volume 2, pp. 267–274, 2002.Sham M Kakade. A natural policy gradient. In

Advances in neural information processing systems ,pp. 1531–1538, 2002.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

CoRR ,abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980 .Sergey Levine. UC Berkeley CS294 deep reinforcement learning lecture notes. http://rail.eecs.berkeley.edu/deeprlcourse-fa17/index.html , 2017.Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitiveapproach to reinforcement learning. arXiv preprint arXiv:1803.07055 , 2018.Ruoming Pang Vijay Vasudevan Quoc V. Le Mingxing Tan, Bo Chen. Mnasnet: Platform-aware neuralarchitecture search for mobile. 2018. URL https://arxiv.org/abs/1807.11626 .Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle-mare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.

Nature , 518, 2015. URL http://dx.doi.org/10.1038/nature14236 .10ublished as a conference paper at ICLR 2019Jan Peters and Stefan Schaal. Natural actor-critic.

Neurocomputing , 71(7-9):1180–1190, 2008a.Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients.

Neuralnetworks , 21(4):682–697, 2008b.Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, and Sham Kakade. Towards generalizationand simplicity in continuous control.

CoRR , abs/1703.02660, 2017. URL http://arxiv.org/abs/1703.02660 .John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust regionpolicy optimization. In

International Conference on Machine Learning , pp. 1889–1897, 2015a.John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.

CoRR , abs/1506.02438,2015b. URL http://arxiv.org/abs/1506.02438 .John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.Voot Tangkaratt, Abbas Abdolmaleki, and Masashi Sugiyama. Guide actor-critic for continuouscontrol. In

International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=BJk59JZ0b .Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based con-trol. 2012. URL https://ieeexplore.ieee.org/abstract/document/6386109/authors .Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, andNando de Freitas. Sample efﬁcient actor-critic with experience replay.

CoRR , abs/1611.01224,2016a. URL http://arxiv.org/abs/1611.01224 .Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu,and Nando de Freitas. Sample efﬁcient actor-critic with experience replay. arXiv preprintarXiv:1611.01224 , 2016b.Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. In

Reinforcement Learning , pp. 5–32. Springer, 1992.Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-regionmethod for deep reinforcement learning using kronecker-factored approximation. In

Advances inneural information processing systems , pp. 5285–5294, 2017.Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning.

CoRR ,abs/1611.01578, 2016. URL http://arxiv.org/abs/1611.01578 .11ublished as a conference paper at ICLR 2019

Appendices

A P

ROOFS FOR NON - PARAMETERIZED OPTIMIZATION PROBLEMS

A.1 F

ORWARD

KL A

GGREGATED AND D ISAGGREGATED C ONSTRAINTS

We ﬁrst show that (12)-(14) is a convex optimization. To this end, ﬁrst note that the objective (12)is a linear function of the decision variables π = { π ( a | s ) ´: s ∈ S , a ∈ A} . The LHS of (14) canbe rewritten as: (cid:80) a ∈A π ( a | s ) log π ( a | s ) − (cid:80) a ∈A π ( a | s ) log π θ k ( a | s ) . The second term is a linearfunction of π . The ﬁrst term is a convex function since the second derivative of each summand isalways positive. The LHS of (14) is thus a convex function. By extension, the LHS of (13) is also aconvex function since it is a nonnegative weighted sum of convex functions. The problem (12)-(14)is thus a convex optimization problem. According to Slater’s constraint qualiﬁcation, strong dualityholds since π θ k is a feasible solution to (12)-(14) where the inequality holds strictly.We can therefore solve (12)-(14) by solving the related Lagrangian problem. For a ﬁxed λ consider:maximize π ∈ Π (cid:88) s d π θk ( s ) { E a ∼ π ( ·| s ) [ A π θk ( s, a )] − λD KL ( π (cid:107) π θ k )[ s ] } (28)subject to D KL ( π (cid:107) π θ k )[ s ] ≤ (cid:15) for all s (29)The above problem decomposes into separate problems, one for each state s :maximize π ( ·| s ) E a ∼ π ( ·| s ) [ A π θk ( s, a )] − λD KL ( π (cid:107) π θ k )[ s ] (30)subject to D KL ( π (cid:107) π θ k )[ s ] ≤ (cid:15) (31)Further consider the unconstrained problem (30) without the constraint (31):maximize π ( ·| s ) K (cid:88) a =1 π ( a | s ) (cid:20) A π θk ( s, a ) − λ log (cid:18) π ( a | s ) π θ k ( a | s ) (cid:19)(cid:21) (32)subject to K (cid:88) a =1 π ( a | s ) = 1 (33) π ( a | s ) ≥ , a = 1 , . . . , K (34)A simple Lagrange-multiplier argument shows that the opimal solution to (32)-(34) is given by: π λ ( a | s ) = π θ k ( a | s ) Z λ ( s ) e A πθk ( s,a ) /λ where Z λ ( s ) is deﬁned so that π λ ( ·| s ) is a valid distribution. Now returning to the decomposedconstrained problem (30)-(31), there are two cases to consider. The ﬁrst case is when D KL ( π λ (cid:107) π θ k )[ s ] ≤ (cid:15) . In this case, the optimal solution to (30)-(31) is π λ ( a | s ) . The second case is when D KL ( π λ (cid:107) π θ k )[ s ] > (cid:15) . In this case the optimal is π λ ( a | s ) with λ replaced with λ s , where λ s is thesolution to D KL ( π λ (cid:107) π θ k )[ s ] = (cid:15) . Thus, an optimal solution to (30)-(31) is given by: ˜ π λ ( a | s ) =  π θ k ( a | s ) Z ( s ) e A πθk ( s,a ) /λ s ∈ Γ λ π θ k ( a | s ) Z ( s ) e A πθk ( s,a ) /λ s s / ∈ Γ λ (35)where Γ λ = { s : D KL ( π λ (cid:107) π θ k )[ s ] ≤ (cid:15) } .To ﬁnd the Lagrange multiplier λ , we can then do a line search to ﬁnd the λ that satisﬁes: (cid:88) s d π θk ( s ) D KL (˜ π λ (cid:107) π θ k )[ s ] = δ (36) (cid:3) ACKWARD

KL C

ONSTRAINT

The problem (19)-(20) decomposes into separate problems, one for each state s ∈ S :maximize π ( ·| s ) E a ∼ π θk ( ·| s ) (cid:20) π ( a | s ) π θ k ( a | s ) A π θk ( s, a ) (cid:21) (37)subject to E a ∼ π θk ( ·| s ) (cid:20) log π θ k ( a | s ) π ( a | s ) (cid:21) ≤ (cid:15) (38)After some algebra, we see that above optimization problem is equivalent to:maximize π ( ·| s ) K (cid:88) a =1 A π θk ( s, a ) π ( a | s ) (39)subject to − K (cid:88) a =1 π θ k ( a | s ) log π ( a | s ) ≤ (cid:15) (cid:48) (40) K (cid:88) a =1 π ( a | s ) = 1 (41) π ( a | s ) ≥ , a = 1 , . . . , K (42)where (cid:15) (cid:48) = (cid:15) + entropy ( π θ k ) . (39)-(42) is a convex optimization problem with Slater’s conditionholding. Strong duality thus holds for the problem (39)-(42). Applying standard Lagrange multiplierarguments, it is easily seen that the solution to (39)-(42) is π ∗ ( a | s ) = π θ k ( a | s ) λ ( s ) λ (cid:48) ( s ) − A π θk ( s, a ) where λ ( s ) and λ (cid:48) ( s ) are constants chosen such that the disaggregegated KL constraint is bindingand the sum of the probabilities equals 1. It is easily seen λ ( s ) > and λ (cid:48) ( s ) > max a A π θk ( s, a ) (cid:3) A.3 L ∞ CONSTRAINT

The problem (24-26) is equivalent to:maximize π ( a | s ) ,...,π ( a m | s m ) m (cid:88) i =1 A π θk ( s i , a i ) π ( a i | s i ) π k ( a i | s i ) (43)subject to − (cid:15) ≤ π ( a i | s i ) π θ k ( a i | s i ) ≤ (cid:15) i = 1 , . . . , m (44) m (cid:88) i =1 (cid:18) π ( a i | s i ) − π θ k ( a i | s i ) π θ k ( a i | s i ) (cid:19) ≤ δ (45)This problem is clearly convex. π θ k ( a i | s i ) , i = 1 , . . . , m is a feasible solution where the inequalityconstraint holds strictly. Strong duality thus holds according to Slater’s constraint qualiﬁcation. Tosolve (43)-(45), we can therefore solve the related Lagrangian problem for ﬁxed λ :maximize π ( a | s ) ,...,π ( a m | s m ) m (cid:88) i =1 (cid:34) A π θk ( s i , a i ) π ( a i | s i ) π k ( a i | s i ) − λ (cid:18) π ( a i | s i ) − π θ k ( a i | s i ) π θ k ( a i | s i ) (cid:19) (cid:35) (46)subject to − (cid:15) ≤ π ( a i | s i ) π θ k ( a i | s i ) ≤ (cid:15) i = 1 , . . . , m (47)which is separable and decomposes into m separate problems, one for each s i :maximize π ( a i | s i ) A π θk ( s i , a i ) π ( a i | s i ) π k ( a i | s i ) − λ (cid:18) π ( a i | s i ) − π θ k ( a i | s i ) π θ k ( a i | s i ) (cid:19) (48)subject to − (cid:15) ≤ π ( a i | s i ) π θ k ( a i | s i ) ≤ (cid:15) (49)13ublished as a conference paper at ICLR 2019The solution to the unconstrained problem (48) without the constraint (49) is: π ∗ ( a i | s i ) = π θ k ( a i | s i ) (cid:18) A π θk ( s i , a i )2 λ (cid:19) Now consider the constrained problem (48)-(49). If A π θk ( s i , a i ) ≥ and π ∗ ( a i | s i ) > π θ k ( a i | s i )(1+ (cid:15) ) , the optimal solution is π θ k ( a i | s i )(1 + (cid:15) ) . Similarly, If A π θk ( s i , a i ) < and π ∗ ( a i | s i ) <π θ k ( a i | s i )(1 − (cid:15) ) , the optimal solution is π θ k ( a i | s i )(1 − (cid:15) ) . Rearranging the terms gives Theorem 3.To obtain λ , we can perform a line search over λ so that the constraint (45) is binding. (cid:3) B D

ERIVATIONS THE GRADIENT OF LOSS FUNCTION FOR

SPU

Let CE stands for CrossEntropy.B.1 F ORWARD -KL CE ( π θ || ˜ π ˜ λ s )[ s ]= − (cid:88) a π θ ( a | s ) log ˜ π ˜ λ s ( a | s ) (Expanding the deﬁnition of cross entropy) = − (cid:88) a π θ ( a | s ) log (cid:32) π θ k ( a | s ) Z ˜ λ s ( s ) e A πθk ( s,a ) / ˜ λ s (cid:33) (Expanding the deﬁnition of ˜ π ˜ λ s ) = − (cid:88) a π θ ( a | s ) log (cid:32) π θ k ( a | s ) Z ˜ λ s ( s ) (cid:33) − (cid:88) a π θ ( a | s ) A π θk ( s, a )˜ λ s (Log of product is sum of log) = − (cid:88) a π θ ( a | s ) log π θ k ( a | s ) + (cid:88) a π θ ( a | s ) log Z ˜ λ s ( s ) − λ s (cid:88) a π θ k ( a | s ) π θ ( a | s ) π θ k ( a | s ) A π θk ( s, a )= CE ( π θ || π θ k )[ s ] + log Z ˜ λ s ( s ) − λ s E a ∼ π θk ( . | s ) (cid:20) π θ ( a | s ) π θ k ( a | s ) A π θk ( s, a ) (cid:21) ⇒ ∇ θ CE ( π θ || ˜ π ˜ λ s )[ s ] = ∇ θ CE ( π θ || π θ k )[ s ] − λ s E a ∼ π θk ( . | s ) (cid:20) ∇ θ π θ ( a | s ) π θ k ( a | s ) A π θk ( s, a ) (cid:21) (Taking gradient on both sides) ⇒ ∇ θ D KL ( π θ (cid:107) ˜ π ˜ λ s )[ s ] = ∇ θ D KL ( π θ (cid:107) π θ k )[ s ] − λ s E a ∼ π θk ( . | s ) (cid:20) ∇ θ π θ ( a | s ) π θ k ( a | s ) A π θk ( s, a ) (cid:21) (Adding the gradient of the entropy on both sides and collapse the sum of gradients of cross entropy and entropyinto the gradient of the KL)B.2 R EVERSE -KL CE ( π θ || π ∗ )[ s ]= − (cid:88) a π θ ( a | s ) log π ∗ ( a | s ) (Expanding the deﬁnition of cross entropy) = − (cid:88) a π θ ( a | s ) log (cid:18) π θ k ( a | s ) λ ( s ) λ (cid:48) ( s ) − A π θk ( s, a ) (cid:19) (Expanding the deﬁnition of π ∗ ) = − (cid:88) a π θ ( a | s ) log π θ k ( a | s ) − (cid:88) a π θ ( a | s ) log λ ( s ) + (cid:88) a π θ ( a | s ) log( λ (cid:48) ( s ) − A π θk ( s, a ))= CE ( π θ || π θ k )[ s ] − λ ( s ) + E a ∼ π θk (cid:20) π θ ( a | s ) π θ k ( a | s ) log( λ (cid:48) ( s ) − A π θk ( s, a )) (cid:21) = CE ( π θ || π θ k )[ s ] − λ ( s ) − E a ∼ π θk (cid:20) π θ ( a | s ) π θ k ( a | s ) log 1 λ (cid:48) ( s ) − A π θk ( s, a ) (cid:21) ⇒ ∇ θ CE ( π θ || π ∗ )[ s ] = ∇ θ CE ( π θ || π θ k )[ s ] − E a ∼ π θk (cid:20) ∇ θ π θ ( a | s ) π θ k ( a | s ) log 1 λ (cid:48) ( s ) − A π θk ( s, a ) (cid:21) (Taking gradient on both sides) ⇒ ∇ θ D KL ( π θ (cid:107) π ∗ )[ s ] = ∇ θ D KL ( π θ (cid:107) π θ k )[ s ] − E a ∼ π θk (cid:20) ∇ θ π θ ( a | s ) π θ k ( a | s ) log 1 λ (cid:48) ( s ) − A π θk ( s, a ) (cid:21) (Adding the gradient of the entropy on both sides and collapse the sum of gradients of cross entropy and entropyinto the gradient of the KL) 15ublished as a conference paper at ICLR 2019 C E

XTENSION TO C ONTINUOUS S TATE AND A CTION S PACES

The methodology developed in the body of this paper also applies to continuous state and actionspaces. In this section, we outline the modiﬁcations that are necessary for the continuous case.We ﬁrst modify the deﬁnition of d π ( s ) by replacing P π ( s t = s ) with dds P π ( s t ≤ s ) so that d π ( s ) becomes a density function over the state space. With this modiﬁcation, the deﬁnition of ¯ D KL ( π (cid:107) π k ) and the approximation (8) are unchanged. The SPU framework described in Section 4 is alsounchanged.Consider now the non-parameterized optimization problem with aggregate and disaggregate con-straints (12-14), but with continuous state and action space:maximize π ∈ Π (cid:90) d π θk ( s ) E a ∼ π ( ·| s ) [ A π θk ( s, a )] ds (50)subject to (cid:90) d π θk ( s ) D KL ( π (cid:107) π θ k )[ s ] ds ≤ δ (51) D KL ( π (cid:107) π θ k )[ s ] ≤ (cid:15) for all s (52)Theorem 1 holds although its proof needs to be slightly modiﬁed as follows. It is straightforward toshow that (50-52) remains a convex optimization problem. We can therefore solve (50-52) by solvingthe Lagrangian (28-29) with the sum replaced with an integral. This problem again decomposes withseparate problems for each s ∈ S giving exactly the same equations (30-31). The proof then proceedsas in the remainder of the proof of Theorem 1.Theorem 2 and 3 are also unchanged for continuous action spaces. Their proofs require slightmodiﬁcations, as in the proof of Theorem 1. D I

MPLEMENTATION D ETAILS AND H YPERPARAMETERS

D.1 M

UJOCO

As in (Schulman et al., 2017), for Mujoco environments, the policy is parameterized by a fully-connected feed-forward neural network with two hidden layers, each with 64 units and tanh nonlinear-ities. The policy outputs the mean of a Gaussian distribution with state-independent variable standarddeviations, following (Schulman et al., 2015a; Duan et al., 2016). The action dimensions are assumedto be independent. The probability of an action is given by the multivariate Gaussian probabilitydistribution function. The baseline used in the advantage value calculation is parameterized by a simi-larly sized neural network, trained to minimize the MSE between the sampled states TD − λ returnsand the their predicted values. For both the policy and baseline network, SPU and TRPO use the samearchitecture. To calculate the advantage values, we use Generalized Advantage Estimation (Schulmanet al., 2015b). States are normalized by dividing the running mean and dividing by the runningstandard deviation before being fed to any neural networks. The advantage values are normalized bydividing the batch mean and dividing by the batch standard deviation before being used for policyupdate. The TRPO result is obtained by running the TRPO implementation provided by OpenAI(Dhariwal et al., 2017), commit 3cc7df060800a45890908045b79821a13c4babdb. At every iteration,SPU collects 2048 samples before updating the policy and the baseline network. For both networks,gradient descent is performed using Adam (Kingma & Ba, 2014) with step size . , minibatch sizeof . The step size is linearly annealed to 0 over the course of training. γ and λ for GAE (Schulmanet al., 2015b) are set to . and . respectively. For SPU, δ, (cid:15), λ and the maximum number ofepochs per iteration are set to . / . , . , . and respectively. Training is performed for 1million timesteps for both SPU and PPO. In the sensitivity analysis, the ranges of values for thehyper-parameters δ, (cid:15), λ and maximum number of epochs are [0 . , . , [0 . , . , [1 . , . and [5 , respectively. 16ublished as a conference paper at ICLR 2019D.2 A TARI

Unless otherwise mentioned, the hyper-parameter values are the same as in subsection D.1. Thepolicy is parameterized by a convolutional neural network with the same architecture as described inMnih et al. (2015). The output of the network is passed through a relu, linear and softmax layer inthat order to give the action distribution. The output of the network is also passed through a differentlinear layer to give the baseline value. States are normalized by dividing by 255 before being fedinto any network. The TRPO result is obtained by running the PPO implementation provided byOpenAI (Dhariwal et al., 2017), commit 3cc7df060800a45890908045b79821a13c4babdb. 8 differentprocesses run in parallel to collect timesteps. At every iteration, each process collects 256 samplesbefore updating the policy and the baseline network. Each process calculates its own update to thenetwork’s parameters and the updates are averaged over all processes before being used to updatethe network’s parameters. Gradient descent is performed using Adam (Kingma & Ba, 2014) withstep size . . In each process, random number generators are initialized with a different seedaccording to the formula process _ seed = experiment _ seed + 10000 ∗ process _ rank . Trainingis performed for 10 million timesteps for both SPU and PPO. For SPU, δ, (cid:15), λ and the maximumnumber of epochs per iteration are set to . , δ/ . , . and respectively. E A

LGORITHMIC D ESCRIPTION FOR

SPU

Algorithm 1

Algorithmic description of forward-KL non-parameterized SPU

Require:

A neural net π θ that parameterizes the policy. Require:

A neural net V φ that approximates V π θ . Require:

General hyperparameters: γ, β (advantage estimation using GAE), α (learning rate), N(number of trajectory per iteration), T (size of each trajectory), M (size of training minibatch). Require:

Algorithm-speciﬁc hyperparameters: δ (aggregated KL constraint), (cid:15) (disaggregated con-straint), λ , ζ (max number of epoch). for k = 1, 2, . . . do under policy π θ k , sample N trajectories, each of size T ( s it , a it , r it , s i ( t +1) ) , i =1 , . . . , N, t = 1 , . . . , T Using any advantage value estimation scheme, estimate A it , i = 1 , . . . , N, t = 1 , . . . , T θ ← θ k φ ← φ k for ζ epochs do Sample M samples from the N trajectories, giving { s , a , A , . . . , s M , a M , A M } L ( φ ) = 1 M (cid:80) m ( V targ ( s m ) − V φ ( s m )) φ ← φ − α ∇ φ L ( φ ) L ( θ ) = M (cid:80) m (cid:20) ∇ θ D KL ( π θ (cid:107) π θ k )[ s m ] − λ ∇ θ π θ ( a m | s m ) π θ k ( a m | s m ) A m (cid:21) D KL ( π θ (cid:107) π θk )[ s m ] ≤ (cid:15) θ ← θ − αL ( θ ) if m (cid:80) m D KL ( π (cid:107) π θ k )[ s m ] > δ then Break out of for loop θ k +1 ← θ φ k +1 ← φ F E

XPERIMENTAL RESULTS

F.1 R

ESULTS ON M UJOCO FOR MILLION TIMESTEPS

TRPO and SPU were trained for 1 million timesteps to obtain the results in section 6. To ensure thatSPU is not only better than TRPO in terms of performance gain early during training, we furtherretrain both policies for 3 million timesteps. Again here, SPU outperforms TRPO by . Figure 3illustrates the performance on each environment.17ublished as a conference paper at ICLR 2019Figure 3: Performance of SPU versus TRPO on 10 Mujoco environments in 3 million timesteps. Thex-axis indicates timesteps. The y-axis indicates the average episode reward of the last 100 episodes.F.2 S

ENSITIVITY A NALYSIS

CDF

FOR M UJOCO

When values for SPU hyper-parameter are randomly sampled as is explained in subsection 6.3, thepercentage improvement of SPU over TRPO becomes a random variable. Figure 4 illustrates theCDF of this random variable. Figure 4: Sensitivity Analysis for SPU18ublished as a conference paper at ICLR 2019F.3 A