Supervised Policy Update for Deep Reinforcement Learning
PPublished as a conference paper at ICLR 2019 S UPERVISED P OLICY U PDATE FOR D EEP R EINFORCE - MENT L EARNING
Quan Vuong
University of California, San Diego [email protected]
Yiming Zhang
New York University [email protected]
Keith Ross
New York University/New York University Shanghai [email protected] A BSTRACT
We propose a new sample-efficient methodology, called Supervised Policy Update(SPU), for deep reinforcement learning. Starting with data generated by the cur-rent policy, SPU formulates and solves a constrained optimization problem in thenon-parameterized proximal policy space. Using supervised regression, it thenconverts the optimal non-parameterized policy to a parameterized policy, fromwhich it draws new samples. The methodology is general in that it applies to bothdiscrete and continuous action spaces, and can handle a wide variety of proximityconstraints for the non-parameterized optimization problem. We show how theNatural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) prob-lems, and the Proximal Policy Optimization (PPO) problem can be addressed bythis methodology. The SPU implementation is much simpler than TRPO. In termsof sample efficiency, our extensive experiments show SPU outperforms TRPO inMujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.
NTRODUCTION
The policy gradient problem in deep reinforcement learning (DRL) can be defined as seeking aparameterized policy with high expected reward. An issue with policy gradient methods is poorsample efficiency (Kakade, 2003; Schulman et al., 2015a; Wang et al., 2016b; Wu et al., 2017;Schulman et al., 2017). In algorithms such as REINFORCE (Williams, 1992), new samples areneeded for every gradient step. When generating samples is expensive (such as robotic environments),sample efficiency is of central concern. The sample efficiency of an algorithm is defined to be thenumber of calls to the environment required to attain a specified performance level (Kakade, 2003).Thus, given the current policy and a fixed number of trajectories (samples) generated, the goal of thesample efficiency problem is to construct a new policy with the highest performance improvementpossible. To do so, it is desirable to limit the search to policies that are close to the original policy π θ k (Kakade, 2002; Schulman et al., 2015a; Wu et al., 2017; Achiam et al., 2017; Schulman et al., 2017;Tangkaratt et al., 2018). Intuitively, if the candidate new policy π θ is far from the original policy π θ k ,it may not perform better than the original policy because too much emphasis is being placed on therelatively small batch of new data generated by π θ k , and not enough emphasis is being placed on therelatively large amount of data and effort previously used to construct π θ k .This guideline of limiting the search to nearby policies seems reasonable in principle, but requires adistance η ( π θ , π θ k ) between the current policy π θ k and the candidate new policy π θ , and then attemptto solve the constrained optimization problem:maximize θ ˆ J ( π θ | π θ k , new data ) (1)subject to η ( π θ , π θ k ) ≤ δ (2)where ˆ J ( π θ | π θ k , new data ) is an estimate of J ( π θ ) , the performance of policy π θ , based on theprevious policy π θ k and the batch of fresh data generated by π θ k . The objective (1) attempts to1 a r X i v : . [ c s . L G ] D ec ublished as a conference paper at ICLR 2019maximize the performance of the updated policy, and the constraint (2) ensures that the updated policyis not too far from the policy π θ k that was used to generate the data. Several recent papers (Kakade,2002; Schulman et al., 2015a; 2017; Tangkaratt et al., 2018) belong to the framework (1)-(2).Our work also strikes the right balance between performance and simplicity. The implementationis only slightly more involved than PPO (Schulman et al., 2017). Simplicity in RL algorithms hasits own merits. This is especially useful when RL algorithms are used to solve problems outside oftraditional RL testbeds, which is becoming a trend (Zoph & Le, 2016; Mingxing Tan, 2018).We propose a new methodology, called Supervised Policy Update (SPU), for this sample efficiencyproblem. The methodology is general in that it applies to both discrete and continuous action spaces,and can address a wide variety of constraint types for (2). Starting with data generated by thecurrent policy, SPU optimizes over a proximal policy space to find an optimal non-parameterizedpolicy . It then solves a supervised regression problem to convert the non-parameterized policy toa parameterized policy, from which it draws new samples. We develop a general methodology forfinding an optimal policy in the non-parameterized policy space, and then illustrate the methodologyfor three different definitions of proximity. We also show how the Natural Policy Gradient and TrustRegion Policy Optimization (NPG/TRPO) problems and the Proximal Policy Optimization (PPO)problem can be addressed by this methodology. While SPU is substantially simpler than NPG/TRPOin terms of mathematics and implementation, our extensive experiments show that SPU is moresample efficient than TRPO in Mujoco simulated robotic tasks and PPO in Atari video game tasks.Off-policy RL algorithms generally achieve better sample efficiency than on-policy algorithms(Haarnoja et al., 2018). However, the performance of an on-policy algorithm can usually be substan-tially improved by incorporating off-policy training (Mnih et al. (2015), Wang et al. (2016a)). Ourpaper focuses on igniting interests in separating finding the optimal policy into a two-step process:finding the optimal non-parameterized policy, and then parameterizing this optimal policy. We alsowanted to deeply understand the on-policy case before adding off-policy training. We thus comparewith algorithms operating under the same algorithmic constraints, one of which is being on-policy.We leave the extension to off-policy to future work. We do not claim state-of-the-art results. RELIMINARIES
We consider a Markov Decision Process (MDP) with state space S , action space A , and rewardfunction r ( s, a ) , s ∈ S , a ∈ A . Let π = { π ( a | s ) : s ∈ S , a ∈ A} denote a policy, let Π be the set ofall policies, and let the expected discounted reward be: J ( π ) (cid:44) E τ ∼ π (cid:34) ∞ (cid:88) t =0 γ t r ( s t , a t ) (cid:35) (3)where γ ∈ (0 , is a discount factor and τ = ( s , a , s , . . . ) is a sample trajectory. Let A π ( s, a ) be the advantage function for policy π (Levine, 2017). Deep reinforcement learning considers a setof parameterized policies Π DL = { π θ | θ ∈ Θ } ⊂ Π , where each policy is parameterized by a neuralnetwork called the policy network. In this paper, we will consider optimizing over the parameterizedpolicies in Π DL as well as over the non-parameterized policies in Π . For concreteness, we assumethat the state and action spaces are finite. However, our methodology also applies to continuous stateand action spaces, as shown in the Appendix.One popular approach to maximizing J ( π θ ) over Π DL is to apply stochastic gradient ascent. Thegradient of J ( π θ ) evaluated at a specific θ = θ k can be shown to be (Williams, 1992): ∇ θ J ( π θ k ) = E τ ∼ π θk (cid:34) ∞ (cid:88) t =0 γ t ∇ θ log π θ k ( a t | s t ) A π θk ( s t , a t ) (cid:35) . (4)We can approximate (4) by sampling N trajectories of length T from π θ k : ∇ θ J ( π θ k ) ≈ N N (cid:88) i =1 T − (cid:88) t =0 ∇ θ log π θ k ( a it | s it ) A π θk ( s t , a t ) (cid:44) g k (5)Additionally, define d π ( s ) (cid:44) (1 − γ ) (cid:80) ∞ t =0 γ t P π ( s t = s ) for the the future state probabilitydistribution for policy π , and denote π ( ·| s ) for the probability distribution over the action space A s and using policy π . Further denote D KL ( π (cid:107) π θ k )[ s ] for the KL divergence from π ( ·| s ) to π θ k ( ·| s ) , and denote the following as the “aggregated KL divergence”. ¯ D KL ( π (cid:107) π θ k ) = E s ∼ d πθk [ D KL ( π (cid:107) π θ k )[ s ]] (6)2.1 S URROGATE O BJECTIVES FOR THE S AMPLE E FFICIENCY P ROBLEM
For the sample efficiency problem, the objective J ( π θ ) is typically approximated using samplesgenerated from π θ k (Schulman et al., 2015a; Achiam et al., 2017; Schulman et al., 2017). Twodifferent approaches are typically used to approximate J ( π θ ) − J ( π θ k ) . We can make a first orderapproximation of J ( π θ ) around θ k (Kakade, 2002; Peters & Schaal, 2008a;b; Schulman et al., 2015a): J ( π θ ) − J ( π θ k ) ≈ ( θ − θ k ) T ∇ θ J ( π θ k ) ≈ ( θ − θ k ) T g k (7)where g k is the sample estimate (5). The second approach is to approximate the state distribution d π ( s ) with d π θk ( s ) (Achiam et al., 2017; Schulman et al., 2017; Achiam, 2017): J ( π ) − J ( π θ k ) ≈ L π θk ( π ) (cid:44) − γ E s ∼ d πθk E a ∼ π θk ( ·| s ) (cid:20) π ( a | s ) π θ k ( a | s ) A π θk ( s, a ) (cid:21) (8)There is a well-known bound for the approximation (8) (Kakade & Langford, 2002; Achiam et al.,2017). Furthermore, the approximation L π θk ( π θ ) matches J ( π θ ) − J ( π θ k ) to the first order withrespect to the parameter θ (Achiam et al., 2017). ELATED W ORK
Natural gradient (Amari, 1998) was first introduced to policy gradient by Kakade (Kakade, 2002) andthen in (Peters & Schaal, 2008a;b; Achiam, 2017; Schulman et al., 2015a). referred to collectivelyhere as NPG/TRPO. Algorithmically, NPG/TRPO finds the gradient update by solving the sampleefficiency problem (1)-(2) with η ( π θ , π θ k ) = ¯ D KL ( π θ (cid:107) π θ k ) , i.e., use the aggregate KL-divergencefor the policy proximity constraint (2). NPG/TRPO addresses this problem in the parameter space θ ∈ Θ . First, it approximates J ( π θ ) with the first-order approximation (7) and ¯ D KL ( π θ (cid:107) π θ k ) usinga similar second-order method. Second, it uses samples from π θ k to form estimates of these twoapproximations. Third, using these estimates (which are functions of θ ), it solves for the optimal θ ∗ . The optimal θ ∗ is a function of g k and of h k , the sample average of the Hessian evaluated at θ k . TRPO also limits the magnitude of the update to ensure ¯ D KL ( π θ (cid:107) π θ k ) ≤ δ (i.e., ensuring thesampled estimate of the aggregated KL constraint is met without the second-order approximation).SPU takes a very different approach by first (i) posing and solving the optimization problem inthe non-parameterized policy space, and then (ii) solving a supervised regression problem to find aparameterized policy that is near the optimal non-parameterized policy. A recent paper, Guided ActorCritic (GAC), independently proposed a similar decomposition (Tangkaratt et al., 2018). However,GAC is much more restricted in that it considers only one specific constraint criterion (aggregatedreverse-KL divergence) and applies only to continuous action spaces. Furthermore, GAC incurssignificantly higher computational complexity, e.g. at every update, it minimizes the dual function toobtain the dual variables using SLSQP. MPO also independently propose a similar decomposition(Abbas Abdolmaleki, 2018). MPO uses much more complex machinery, namely, ExpectationMaximization to address the DRL problem. However, MPO has only demonstrates preliminaryresults on problems with discrete actions whereas our approach naturally applies to problems witheither discrete or continuous actions. In both GAC and MPO, working in the non-parameterized spaceis a by-product of applying the main ideas in those papers to DRL. Our paper demonstrates that thedecomposition alone is a general and useful technique for solving constrained policy optimization.Clipped-PPO (Schulman et al., 2017) takes a very different approach to TRPO. At each iteration,PPO makes many gradient steps while only using the data from π θ k . Without the clipping, PPOis the approximation (8). The clipping is analogous to the constraint (2) in that it has the goal ofkeeping π θ close to π θ k . Indeed, the clipping keeps π θ ( a t | s t ) from becoming neither much largerthan (1 + (cid:15) ) π θ k ( a t | s t ) nor much smaller than (1 − (cid:15) ) π θ k ( a t | s t ) . Thus, although the clipped PPOobjective does not squarely fit into the optimization framework (1)-(2), it is quite similar in spirit. Wenote that the PPO paper considers adding the KL penalty to the objective function, whose gradient3ublished as a conference paper at ICLR 2019is similar to ours. However, this form of gradient was demonstrated to be inferior to Clipped-PPO.To the best of our knowledge, it is only until our work that such form of gradient is demonstrated tooutperform Clipped-PPO.Actor-Critic using Kronecker-Factored Trust Region (ACKTR) (Wu et al., 2017) proposed usingKronecker-factored approximation curvature (K-FAC) to update both the policy gradient and criticterms, giving a more computationally efficient method of calculating the natural gradients. ACER(Wang et al., 2016a) exploits past episodes, linearizes the KL divergence constraint, and maintains anaverage policy network to enforce the KL divergence constraint. In future work, it would of interestto extend the SPU methodology to handle past episodes. In contrast to bounding the KL divergenceon the action distribution as we have done in this work, Relative Entropy Policy Search considersbounding the joint distribution of state and action and was only demonstrated to work for smallproblems (Jan Peters, 2010). RAMEWORK
The SPU methodology has two steps. In the first step, for a given constraint criterion η ( π, π θ k ) ≤ δ ,we find the optimal solution to the non-parameterized problem:maximize π ∈ Π L π θk ( π ) (9)subject to η ( π, π θ k ) ≤ δ (10)Note that π is not restricted to the set of parameterized policies Π DL . As commonly done, weapproximate the objective function (8). However, unlike PPO/TRPO, we are not approximatingthe constraint (2). We will show below the optimal solution π ∗ for the non-parameterized problem(9)-(10) can be determined nearly in closed form for many natural constraint criteria η ( π, π θ k ) ≤ δ .In the second step, we attempt to find a policy π θ in the parameterized space Π DL that is close to thetarget policy π ∗ . Concretely, to advance from θ k to θ k +1 , we perform the following steps:(i) We first sample N trajectories using policy π θ k , giving sample data ( s i , a i , A i ) , i = 1 , .., m .Here A i is an estimate of the advantage value A π θk ( s i , a i ) . (For simplicity, we index thesamples with i rather than with ( i, t ) corresponding to the t th sample in the i th trajectory.)(ii) For each s i , we define the target distribution π ∗ to be the optimal solution to the constrainedoptimization problem (9)-(10) for a specific constraint η .(iii) We then fit the policy network π θ to the target distributions π ∗ ( ·| s i ) , i = 1 , .., m . Specifically,to find θ k +1 , we minimize the following supervised loss function: L ( θ ) = 1 N m (cid:88) i =1 D KL ( π θ (cid:107) π ∗ )[ s i ] (11)For this step, we initialize with the weights for π θ k . We minimize the loss function L ( θ ) withstochastic gradient descent methods. The resulting θ becomes our θ k +1 . PPLIED TO S PECIFIC P ROXIMITY C RITERIA
To illustrate the SPU methodology, for three different but natural types of proximity constraints, wesolve the corresponding non-parameterized optimization problem and derive the resulting gradientfor the SPU supervised learning problem. We also demonstrate that different constraints lead to verydifferent but intuitive forms of the gradient update.4ublished as a conference paper at ICLR 20195.1 F
ORWARD A GGREGATE AND D ISAGGREGATE
KL C
ONSTRAINTS
We first consider constraint criteria of the form:maximize π ∈ Π (cid:88) s d π θk ( s ) E a ∼ π ( ·| s ) [ A π θk ( s, a )] (12)subject to (cid:88) s d π θk ( s ) D KL ( π (cid:107) π θ k )[ s ] ≤ δ (13) D KL ( π (cid:107) π θ k )[ s ] ≤ (cid:15) for all s (14)Note that this problem is equivalent to minimizing L π θk ( π ) subject to the constraints (13) and (14).We refer to (13) as the "aggregated KL constraint" and to (14) as the "disaggregated KL constraint".These two constraints taken together restrict π from deviating too much from π θ k . We shall refer to(12)-(14) as the forward-KL non-parameterized optimization problem.Note that this problem without the disaggregated constraints is analogous to the TRPO problem.The TRPO paper actually prefers enforcing the disaggregated constraint to enforcing the aggregatedconstraints. However, for mathematical conveniences, they worked with the aggregated constraints:"While it is motivated by the theory, this problem is impractical to solve due to the large numberof constraints. Instead, we can use a heuristic approximation which considers the average KLdivergence" (Schulman et al., 2015a). The SPU framework allows us to solve the optimizationproblem with the disaggregated constraints exactly. Experimentally, we compared against TRPO ina controlled experimental setting, e.g. using the same advantage estimation scheme, etc. Since weclearly outperform TRPO, we argue that SPU’s two-process procedure has significant potentials.For each λ > , define: π λ ( a | s ) = π θ k ( a | s ) Z λ ( s ) e A πθk ( s,a ) /λ where Z λ ( s ) is the normalization term.Note that π λ ( a | s ) is a function of λ . Further, for each s, let λ s be such that D KL ( π λ s (cid:107) π θ k )[ s ] = (cid:15) .Also let Γ λ = { s : D KL ( π λ (cid:107) π θ k )[ s ] ≤ (cid:15) } . Theorem 1
The optimal solution to the problem (12) - (14) is given by: ˜ π λ ( a | s ) = (cid:26) π λ ( a | s ) s ∈ Γ λ π λ s ( a | s ) s / ∈ Γ λ (15) where λ is chosen so that (cid:80) s d π θk ( s ) D KL (˜ π λ (cid:107) π θ k )[ s ] = δ (Proof in subsection A.1). Equation (15) provides the structure of the optimal non-parameterized policy. As part of the SPUframework, we then seek a parameterized policy π θ that is close to ˜ π λ ( a | s ) , that is, minimizes theloss function (11). For each sampled state s i , a straightforward calculation shows (Appendix B): ∇ θ D KL ( π θ (cid:107) ˜ π λ )[ s i ] = ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] − λ s i E a ∼ π θk ( ·| s i ) [ ∇ θ log π θ ( a | s i ) A π θk ( s i , a )] (16)where ∼ λ s i = λ for s i ∈ Γ λ and ˜ λ s i = λ s i for s i / ∈ Γ λ . We estimate the expectation in (16) with thesampled action a i and approximate A π θk ( s i , a i ) as A i (obtained from the critic network), giving: ∇ θ D KL ( π θ (cid:107) ˜ π λ )[ s i ] ≈ ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] − ∼ λ s i ∇ θ π θ ( a i | s i ) π θ k ( a i | s i ) A i (17)To simplify the algorithm, we slightly modify (17). We replace the hyper-parameter δ with thehyper-parameter λ and tune λ rather than δ . Further, we set ∼ λ s i = λ for all s i in (17) and introduceper-state acceptance to enforce the disaggregated constraints, giving the approximate gradient: ∇ θ D KL ( π θ (cid:107) ˜ π λ ) ≈ m m (cid:88) i =1 [ ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] − λ ∇ θ π θ ( a i | s i ) π θ k ( a i | s i ) A i ] D KL ( π θ (cid:107) π θk )[ s i ] ≤ (cid:15) (18)We make the approximation that the disaggregated constraints are only enforced on the statesin the sampled trajectories. We use (18) as our gradient for supervised training of the policy5ublished as a conference paper at ICLR 2019network. The equation (18) has an intuitive interpretation: the gradient represents a trade-offbetween the approximate performance of π θ (as captured by λ ∇ θ π θ ( a i | s i ) π θ k ( a i | s i ) A i ) and how far π θ diverges from π θ k (as captured by ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] ). For the stopping criterion, we train until m (cid:80) i D KL ( π θ (cid:107) π θ k )[ s i ] ≈ δ .5.2 B ACKWARD
KL C
ONSTRAINT
In a similar manner, we can derive the structure of the optimal policy when using the reverseKL-divergence as the constraint. For simplicity, we provide the result for when there are onlydisaggregated constraints. We seek to find the non-parameterized optimal policy by solving:maximize π ∈ Π (cid:88) s d π θk ( s ) E a ∼ π ( ·| s ) [ A π θk ( s, a )] (19) D KL ( π (cid:107) π θ k )[ s ] ≤ (cid:15) for all s (20) Theorem 2
The optimal solution to the problem (19) - (20) is given by: π ∗ ( a | s ) = π θ k ( a | s ) λ ( s ) λ (cid:48) ( s ) − A π θk ( s, a ) (21) where λ ( s ) > and λ (cid:48) ( s ) > max a A π θk ( s, a ) (Proof in subsection A.2). Note that the structure of the optimal policy with the backward KL constraint is quite different fromthat with the forward KL constraint. A straight forward calculation shows (Appendix B): ∇ θ D KL ( π θ (cid:107) π ∗ )[ s ] = ∇ θ D KL ( π θ (cid:107) π θ k )[ s ] − E a ∼ π θk (cid:20) ∇ θ π θ ( a | s ) π θ k ( a | s ) log (cid:18) λ (cid:48) ( s ) − A π θk ( s, a ) (cid:19)(cid:21) (22)The equation (22) has an intuitive interpretation. It increases the probability of action a if A π θk ( s, a ) > λ (cid:48) ( s ) − and decreases the probability of action a if A π θk ( s, a ) < λ (cid:48) ( s ) − . (22)also tries to keep π θ close to π θ k by minimizing their KL divergence.5.3 L ∞ C ONSTRAINT
In this section we show how a PPO-like objective can be formulated in the context of SPU. Recallfrom Section 3 that the the clipping in PPO can be seen as an attempt at keeping π θ ( a i | s i ) frombecoming neither much larger than (1 + (cid:15) ) π θ k ( a i | s i ) nor much smaller than (1 − (cid:15) ) π θ k ( a i | s i ) for i = 1 , . . . , m . In this subsection, we consider the constraint function η ( π, π θ k ) = max i =1 ,...,m | π ( a i | s i ) − π θ k ( a i | s i ) | π θ k ( a i | s i ) (23)which leads us to the following optimization problem:maximize π ( a | s ) ,...,π ( a m | s m ) m (cid:88) i =1 A π θk ( s i , a i ) π ( a i | s i ) π k ( a i | s i ) (24)subject to (cid:12)(cid:12)(cid:12)(cid:12) π ( a i | s i ) − π θ k ( a i | s i ) π θ k ( a i | s i ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) i = 1 , . . . , m (25) m (cid:88) i =1 (cid:18) π ( a i | s i ) − π θ k ( a i | s i ) π θ k ( a i | s i ) (cid:19) ≤ δ (26)Note that here we are using a variation of the SPU methodology described in Section 4 since herewe first create estimates of the expectations in the objective and constraints and then solve theoptimization problem (rather than first solve the optimization problem and then take samples as donefor Theorems 1 and 2). Note that we have also included an aggregated constraint (26) in addition tothe PPO-like constraint (25), which further ensures that the updated policy is close to π θ k .6ublished as a conference paper at ICLR 2019 Theorem 3
The optimal solution to the optimization problem (24-26) is given by: π ∗ ( a i | s i ) = (cid:26) π θ k ( a i | s i ) min { λA i , (cid:15) } A i ≥ π θ k ( a i | s i ) max { λA i , − (cid:15) } A i < (27) for some λ > where A i (cid:44) A π θk ( s i , a i ) (Proof in subsection A.3). To simplify the algorithm, we treat λ as a hyper-parameter rather than δ . After solving for π ∗ , we seeka parameterized policy π θ that is close to π ∗ by minimizing their mean square error over sampledstates and actions, i.e. by updating θ in the negative direction of ∇ θ (cid:80) i ( π θ ( a i | s i ) − π ∗ ( a i | s i )) .This loss is used for supervised training instead of the KL because we take estimates before formingthe optimization problem. Thus, the optimal values for the decision variables do not completelycharacterize a distribution. We refer to this approach as SPU with the L ∞ constraint.Although we consider three classes of proximity constraint, there may be yet another class thatleads to even better performance. The methodology allows researchers to explore other proximityconstraints in the future. XPERIMENTAL R ESULTS
Extensive experimental results demonstrate SPU outperforms recent state-of-the-art methods forenvironments with continuous or discrete action spaces. We provide ablation studies to show theimportance of the different algorithmic components, and a sensitivity analysis to show that SPU’sperformance is relatively insensitive to hyper-parameter choices. There are two definitions we use toconclude A is more sample efficient than B: (i) A takes fewer environment interactions to achievea pre-defined performance threshold (Kakade, 2003); (ii) the averaged final performance of A ishigher than that of B given the same number environment interactions (Schulman et al., 2017).Implementation details are provided in Appendix D.6.1 R
ESULTS ON M UJOCO
The Mujoco (Todorov et al., 2012) simulated robotics environments provided by OpenAI gym(Brockman et al., 2016) have become a popular benchmark for control problems with continuousaction spaces. In terms of final performance averaged over all available ten Mujoco environmentsand ten different seeds in each, SPU with L ∞ constraint (Section 5.3) and SPU with forward KLconstraints (Section 5.1) outperform TRPO by and respectively. Since the forward-KLapproach is our best performing approach, we focus subsequent analysis on it and hereafter refer toit as SPU. SPU also outperforms PPO by . Figure 1 illustrates the performance of SPU versusTRPO, PPO.To ensure that SPU is not only better than TRPO in terms of performance gain early during training,we further retrain both policies for 3 million timesteps. Again here, SPU outperforms TRPO by .Figure 3 in the Appendix illustrates the performance for each environment. Code for the Mujocoexperiments is at https://github.com/quanvuong/Supervised_Policy_Update .6.2 A BLATION S TUDIES FOR M UJOCO
The indicator variable in (18) enforces the disaggregated constraint. We refer to it as per-stateacceptance . Removing this component is equivalent to removing the indicator variable. We referto using (cid:80) i D KL ( π θ (cid:107) π θ k )[ s i ] to determine the number of training epochs as dynamic stopping .Without this component, the number of training epochs is a hyper-parameter. We also tried removing ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] from the gradient update step in (18). Table 1 illustrates the contribution ofthe different components of SPU to the overall performance. The third row shows that the term ∇ θ D KL ( π θ (cid:107) π θ k )[ s i ] makes a crucially important contribution to SPU. Furthermore, per-stateacceptance and dynamic stopping are both also important for obtaining high performance, with theformer playing a more central role. When a component is removed, the hyper-parameters are retunedto ensure that the best possible performance is obtained with the alternative (simpler) algorithm.7ublished as a conference paper at ICLR 2019Figure 1: SPU versus TRPO, PPO on 10 Mujoco environments in 1 million timesteps. The x-axisindicates timesteps. The y-axis indicates the average episode reward of the last 100 episodes.Table 1: Ablation study for SPUApproach Percentage better than TRPO Performance vs. original algorithmOriginal Algorithm 27% 0%No grad KL 4% - 85%No dynamic stopping 24% - 11%No per-state acceptance 9% - 67%6.3 S ENSITIVITY A NALYSIS ON M UJOCO
To demonstrate the practicality of SPU, we show that its high performance is insensitive to hyper-parameter choice. One way to show this is as follows: for each SPU hyper-parameter, select areasonably large interval, randomly sample the value of the hyper parameter from this interval, andthen compare SPU (using the randomly chosen hyper-parameter values) with TRPO. We sampled100 SPU hyper-parameter vectors (each vector including δ, (cid:15), λ ), and for each one determined therelative performance with respect to TRPO. First, we found that for all 100 random hyper-parametervalue samples, SPU performed better than TRPO. and of the samples outperformed TRPOby at least and respectively. The full CDF is given in Figure 4 in the Appendix. We canconclude that SPU’s superior performance is largely insensitive to hyper-parameter values.6.4 R
ESULTS ON A TARI (Rajeswaran et al., 2017; Mania et al., 2018) demonstrates that neural networks are not needed toobtain high performance in many Mujoco environments. To conclusively evaluate SPU, we compareit against PPO on the Arcade Learning Environments (Bellemare et al., 2012) exposed throughOpenAI gym (Brockman et al., 2016). Using the same network architecture and hyper-parameters,we learn to play 60 Atari games from raw pixels and rewards. This is highly challenging because ofthe diversity in the games and the high dimensionality of the observations.8ublished as a conference paper at ICLR 2019Here, we compare SPU against PPO because PPO outperforms TRPO by in Mujoco. Averagedover 60 Atari environments and 20 seeds, SPU is better than PPO in terms of averaged finalperformance. Figure 2 provides a high-level overview of the result. The dots in the shaded arearepresent environments where their performances are roughly similar. The dots to the right of theshaded area represent environment where SPU is more sample efficient than PPO. We can drawtwo conclusions: (i) In 36 environments, SPU and PPO perform roughly the same ; SPU clearlyoutperforms PPO in 15 environments while PPO clearly outperforms SPU in 9; (ii) In those 15+9environments, the extent to which SPU outperforms PPO is much larger than the extent to whichPPO outperforms SPU. Figure 5, Figure 6 and Figure 7 in the Appendix illustrate the performance ofSPU vs PPO throughout training. SPU’s high performance in both the Mujoco and Atari domainsdemonstrates its high performance and generality.Figure 2: High-level overview of results on Atari CKNOWLEDGEMENTS
We would like to acknowledge the extremely helpful support by the NYU Shanghai High PerformanceComputing Administrator Zhiguo Qi. We also are grateful to OpenAI for open-sourcing their baselinescodes. 9ublished as a conference paper at ICLR 2019 R EFERENCES
Yuval Tassa Remi Munos Nicolas Heess Martin Riedmiller Abbas Abdolmaleki, Jost Tobias Sprin-genberg. Maximum a posteriori policy optimisation. 2018. URL https://arxiv.org/abs/1806.06920 .Joshua Achiam. Advanced policy gradient methods. http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_13_advanced_pg.pdf , 2017.Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In
International Conference on Machine Learning , pp. 22–31, 2017.Shun-Ichi Amari. Natural gradient works efficiently in learning.
Neural computation , 10(2):251–276,1998.Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning en-vironment: An evaluation platform for general agents.
CoRR , abs/1207.4708, 2012. URL http://arxiv.org/abs/1207.4708 .Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, andWojciech Zaremba. Openai gym.
CoRR , abs/1606.01540, 2016. URL http://arxiv.org/abs/1606.01540 .Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford,John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines , 2017.Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deepreinforcement learning for continuous control.
CoRR , abs/1604.06778, 2016. URL http://arxiv.org/abs/1604.06778 .Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policymaximum entropy deep reinforcement learning with a stochastic actor.
CoRR , abs/1801.01290,2018. URL http://arxiv.org/abs/1801.01290 .Yasemin Altun Jan Peters, Katharina Mulling. Relative entropy policy search. 2010. URL .Sham Kakade.
On the sample complexity of reinforcement learning . PhD thesis, University ofLondon London, England, 2003.Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In
ICML , volume 2, pp. 267–274, 2002.Sham M Kakade. A natural policy gradient. In
Advances in neural information processing systems ,pp. 1531–1538, 2002.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
CoRR ,abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980 .Sergey Levine. UC Berkeley CS294 deep reinforcement learning lecture notes. http://rail.eecs.berkeley.edu/deeprlcourse-fa17/index.html , 2017.Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitiveapproach to reinforcement learning. arXiv preprint arXiv:1803.07055 , 2018.Ruoming Pang Vijay Vasudevan Quoc V. Le Mingxing Tan, Bo Chen. Mnasnet: Platform-aware neuralarchitecture search for mobile. 2018. URL https://arxiv.org/abs/1807.11626 .Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle-mare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.
Nature , 518, 2015. URL http://dx.doi.org/10.1038/nature14236 .10ublished as a conference paper at ICLR 2019Jan Peters and Stefan Schaal. Natural actor-critic.
Neurocomputing , 71(7-9):1180–1190, 2008a.Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients.
Neuralnetworks , 21(4):682–697, 2008b.Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, and Sham Kakade. Towards generalizationand simplicity in continuous control.
CoRR , abs/1703.02660, 2017. URL http://arxiv.org/abs/1703.02660 .John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust regionpolicy optimization. In
International Conference on Machine Learning , pp. 1889–1897, 2015a.John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.
CoRR , abs/1506.02438,2015b. URL http://arxiv.org/abs/1506.02438 .John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.Voot Tangkaratt, Abbas Abdolmaleki, and Masashi Sugiyama. Guide actor-critic for continuouscontrol. In
International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=BJk59JZ0b .Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based con-trol. 2012. URL https://ieeexplore.ieee.org/abstract/document/6386109/authors .Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, andNando de Freitas. Sample efficient actor-critic with experience replay.
CoRR , abs/1611.01224,2016a. URL http://arxiv.org/abs/1611.01224 .Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu,and Nando de Freitas. Sample efficient actor-critic with experience replay. arXiv preprintarXiv:1611.01224 , 2016b.Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. In
Reinforcement Learning , pp. 5–32. Springer, 1992.Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-regionmethod for deep reinforcement learning using kronecker-factored approximation. In
Advances inneural information processing systems , pp. 5285–5294, 2017.Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning.
CoRR ,abs/1611.01578, 2016. URL http://arxiv.org/abs/1611.01578 .11ublished as a conference paper at ICLR 2019
Appendices
A P
ROOFS FOR NON - PARAMETERIZED OPTIMIZATION PROBLEMS
A.1 F
ORWARD
KL A
GGREGATED AND D ISAGGREGATED C ONSTRAINTS
We first show that (12)-(14) is a convex optimization. To this end, first note that the objective (12)is a linear function of the decision variables π = { π ( a | s ) ´: s ∈ S , a ∈ A} . The LHS of (14) canbe rewritten as: (cid:80) a ∈A π ( a | s ) log π ( a | s ) − (cid:80) a ∈A π ( a | s ) log π θ k ( a | s ) . The second term is a linearfunction of π . The first term is a convex function since the second derivative of each summand isalways positive. The LHS of (14) is thus a convex function. By extension, the LHS of (13) is also aconvex function since it is a nonnegative weighted sum of convex functions. The problem (12)-(14)is thus a convex optimization problem. According to Slater’s constraint qualification, strong dualityholds since π θ k is a feasible solution to (12)-(14) where the inequality holds strictly.We can therefore solve (12)-(14) by solving the related Lagrangian problem. For a fixed λ consider:maximize π ∈ Π (cid:88) s d π θk ( s ) { E a ∼ π ( ·| s ) [ A π θk ( s, a )] − λD KL ( π (cid:107) π θ k )[ s ] } (28)subject to D KL ( π (cid:107) π θ k )[ s ] ≤ (cid:15) for all s (29)The above problem decomposes into separate problems, one for each state s :maximize π ( ·| s ) E a ∼ π ( ·| s ) [ A π θk ( s, a )] − λD KL ( π (cid:107) π θ k )[ s ] (30)subject to D KL ( π (cid:107) π θ k )[ s ] ≤ (cid:15) (31)Further consider the unconstrained problem (30) without the constraint (31):maximize π ( ·| s ) K (cid:88) a =1 π ( a | s ) (cid:20) A π θk ( s, a ) − λ log (cid:18) π ( a | s ) π θ k ( a | s ) (cid:19)(cid:21) (32)subject to K (cid:88) a =1 π ( a | s ) = 1 (33) π ( a | s ) ≥ , a = 1 , . . . , K (34)A simple Lagrange-multiplier argument shows that the opimal solution to (32)-(34) is given by: π λ ( a | s ) = π θ k ( a | s ) Z λ ( s ) e A πθk ( s,a ) /λ where Z λ ( s ) is defined so that π λ ( ·| s ) is a valid distribution. Now returning to the decomposedconstrained problem (30)-(31), there are two cases to consider. The first case is when D KL ( π λ (cid:107) π θ k )[ s ] ≤ (cid:15) . In this case, the optimal solution to (30)-(31) is π λ ( a | s ) . The second case is when D KL ( π λ (cid:107) π θ k )[ s ] > (cid:15) . In this case the optimal is π λ ( a | s ) with λ replaced with λ s , where λ s is thesolution to D KL ( π λ (cid:107) π θ k )[ s ] = (cid:15) . Thus, an optimal solution to (30)-(31) is given by: ˜ π λ ( a | s ) = π θ k ( a | s ) Z ( s ) e A πθk ( s,a ) /λ s ∈ Γ λ π θ k ( a | s ) Z ( s ) e A πθk ( s,a ) /λ s s / ∈ Γ λ (35)where Γ λ = { s : D KL ( π λ (cid:107) π θ k )[ s ] ≤ (cid:15) } .To find the Lagrange multiplier λ , we can then do a line search to find the λ that satisfies: (cid:88) s d π θk ( s ) D KL (˜ π λ (cid:107) π θ k )[ s ] = δ (36) (cid:3) ACKWARD
KL C
ONSTRAINT
The problem (19)-(20) decomposes into separate problems, one for each state s ∈ S :maximize π ( ·| s ) E a ∼ π θk ( ·| s ) (cid:20) π ( a | s ) π θ k ( a | s ) A π θk ( s, a ) (cid:21) (37)subject to E a ∼ π θk ( ·| s ) (cid:20) log π θ k ( a | s ) π ( a | s ) (cid:21) ≤ (cid:15) (38)After some algebra, we see that above optimization problem is equivalent to:maximize π ( ·| s ) K (cid:88) a =1 A π θk ( s, a ) π ( a | s ) (39)subject to − K (cid:88) a =1 π θ k ( a | s ) log π ( a | s ) ≤ (cid:15) (cid:48) (40) K (cid:88) a =1 π ( a | s ) = 1 (41) π ( a | s ) ≥ , a = 1 , . . . , K (42)where (cid:15) (cid:48) = (cid:15) + entropy ( π θ k ) . (39)-(42) is a convex optimization problem with Slater’s conditionholding. Strong duality thus holds for the problem (39)-(42). Applying standard Lagrange multiplierarguments, it is easily seen that the solution to (39)-(42) is π ∗ ( a | s ) = π θ k ( a | s ) λ ( s ) λ (cid:48) ( s ) − A π θk ( s, a ) where λ ( s ) and λ (cid:48) ( s ) are constants chosen such that the disaggregegated KL constraint is bindingand the sum of the probabilities equals 1. It is easily seen λ ( s ) > and λ (cid:48) ( s ) > max a A π θk ( s, a ) (cid:3) A.3 L ∞ CONSTRAINT
The problem (24-26) is equivalent to:maximize π ( a | s ) ,...,π ( a m | s m ) m (cid:88) i =1 A π θk ( s i , a i ) π ( a i | s i ) π k ( a i | s i ) (43)subject to − (cid:15) ≤ π ( a i | s i ) π θ k ( a i | s i ) ≤ (cid:15) i = 1 , . . . , m (44) m (cid:88) i =1 (cid:18) π ( a i | s i ) − π θ k ( a i | s i ) π θ k ( a i | s i ) (cid:19) ≤ δ (45)This problem is clearly convex. π θ k ( a i | s i ) , i = 1 , . . . , m is a feasible solution where the inequalityconstraint holds strictly. Strong duality thus holds according to Slater’s constraint qualification. Tosolve (43)-(45), we can therefore solve the related Lagrangian problem for fixed λ :maximize π ( a | s ) ,...,π ( a m | s m ) m (cid:88) i =1 (cid:34) A π θk ( s i , a i ) π ( a i | s i ) π k ( a i | s i ) − λ (cid:18) π ( a i | s i ) − π θ k ( a i | s i ) π θ k ( a i | s i ) (cid:19) (cid:35) (46)subject to − (cid:15) ≤ π ( a i | s i ) π θ k ( a i | s i ) ≤ (cid:15) i = 1 , . . . , m (47)which is separable and decomposes into m separate problems, one for each s i :maximize π ( a i | s i ) A π θk ( s i , a i ) π ( a i | s i ) π k ( a i | s i ) − λ (cid:18) π ( a i | s i ) − π θ k ( a i | s i ) π θ k ( a i | s i ) (cid:19) (48)subject to − (cid:15) ≤ π ( a i | s i ) π θ k ( a i | s i ) ≤ (cid:15) (49)13ublished as a conference paper at ICLR 2019The solution to the unconstrained problem (48) without the constraint (49) is: π ∗ ( a i | s i ) = π θ k ( a i | s i ) (cid:18) A π θk ( s i , a i )2 λ (cid:19) Now consider the constrained problem (48)-(49). If A π θk ( s i , a i ) ≥ and π ∗ ( a i | s i ) > π θ k ( a i | s i )(1+ (cid:15) ) , the optimal solution is π θ k ( a i | s i )(1 + (cid:15) ) . Similarly, If A π θk ( s i , a i ) < and π ∗ ( a i | s i ) <π θ k ( a i | s i )(1 − (cid:15) ) , the optimal solution is π θ k ( a i | s i )(1 − (cid:15) ) . Rearranging the terms gives Theorem 3.To obtain λ , we can perform a line search over λ so that the constraint (45) is binding. (cid:3) B D
ERIVATIONS THE GRADIENT OF LOSS FUNCTION FOR
SPU
Let CE stands for CrossEntropy.B.1 F ORWARD -KL CE ( π θ || ˜ π ˜ λ s )[ s ]= − (cid:88) a π θ ( a | s ) log ˜ π ˜ λ s ( a | s ) (Expanding the definition of cross entropy) = − (cid:88) a π θ ( a | s ) log (cid:32) π θ k ( a | s ) Z ˜ λ s ( s ) e A πθk ( s,a ) / ˜ λ s (cid:33) (Expanding the definition of ˜ π ˜ λ s ) = − (cid:88) a π θ ( a | s ) log (cid:32) π θ k ( a | s ) Z ˜ λ s ( s ) (cid:33) − (cid:88) a π θ ( a | s ) A π θk ( s, a )˜ λ s (Log of product is sum of log) = − (cid:88) a π θ ( a | s ) log π θ k ( a | s ) + (cid:88) a π θ ( a | s ) log Z ˜ λ s ( s ) − λ s (cid:88) a π θ k ( a | s ) π θ ( a | s ) π θ k ( a | s ) A π θk ( s, a )= CE ( π θ || π θ k )[ s ] + log Z ˜ λ s ( s ) − λ s E a ∼ π θk ( . | s ) (cid:20) π θ ( a | s ) π θ k ( a | s ) A π θk ( s, a ) (cid:21) ⇒ ∇ θ CE ( π θ || ˜ π ˜ λ s )[ s ] = ∇ θ CE ( π θ || π θ k )[ s ] − λ s E a ∼ π θk ( . | s ) (cid:20) ∇ θ π θ ( a | s ) π θ k ( a | s ) A π θk ( s, a ) (cid:21) (Taking gradient on both sides) ⇒ ∇ θ D KL ( π θ (cid:107) ˜ π ˜ λ s )[ s ] = ∇ θ D KL ( π θ (cid:107) π θ k )[ s ] − λ s E a ∼ π θk ( . | s ) (cid:20) ∇ θ π θ ( a | s ) π θ k ( a | s ) A π θk ( s, a ) (cid:21) (Adding the gradient of the entropy on both sides and collapse the sum of gradients of cross entropy and entropyinto the gradient of the KL)B.2 R EVERSE -KL CE ( π θ || π ∗ )[ s ]= − (cid:88) a π θ ( a | s ) log π ∗ ( a | s ) (Expanding the definition of cross entropy) = − (cid:88) a π θ ( a | s ) log (cid:18) π θ k ( a | s ) λ ( s ) λ (cid:48) ( s ) − A π θk ( s, a ) (cid:19) (Expanding the definition of π ∗ ) = − (cid:88) a π θ ( a | s ) log π θ k ( a | s ) − (cid:88) a π θ ( a | s ) log λ ( s ) + (cid:88) a π θ ( a | s ) log( λ (cid:48) ( s ) − A π θk ( s, a ))= CE ( π θ || π θ k )[ s ] − λ ( s ) + E a ∼ π θk (cid:20) π θ ( a | s ) π θ k ( a | s ) log( λ (cid:48) ( s ) − A π θk ( s, a )) (cid:21) = CE ( π θ || π θ k )[ s ] − λ ( s ) − E a ∼ π θk (cid:20) π θ ( a | s ) π θ k ( a | s ) log 1 λ (cid:48) ( s ) − A π θk ( s, a ) (cid:21) ⇒ ∇ θ CE ( π θ || π ∗ )[ s ] = ∇ θ CE ( π θ || π θ k )[ s ] − E a ∼ π θk (cid:20) ∇ θ π θ ( a | s ) π θ k ( a | s ) log 1 λ (cid:48) ( s ) − A π θk ( s, a ) (cid:21) (Taking gradient on both sides) ⇒ ∇ θ D KL ( π θ (cid:107) π ∗ )[ s ] = ∇ θ D KL ( π θ (cid:107) π θ k )[ s ] − E a ∼ π θk (cid:20) ∇ θ π θ ( a | s ) π θ k ( a | s ) log 1 λ (cid:48) ( s ) − A π θk ( s, a ) (cid:21) (Adding the gradient of the entropy on both sides and collapse the sum of gradients of cross entropy and entropyinto the gradient of the KL) 15ublished as a conference paper at ICLR 2019 C E
XTENSION TO C ONTINUOUS S TATE AND A CTION S PACES
The methodology developed in the body of this paper also applies to continuous state and actionspaces. In this section, we outline the modifications that are necessary for the continuous case.We first modify the definition of d π ( s ) by replacing P π ( s t = s ) with dds P π ( s t ≤ s ) so that d π ( s ) becomes a density function over the state space. With this modification, the definition of ¯ D KL ( π (cid:107) π k ) and the approximation (8) are unchanged. The SPU framework described in Section 4 is alsounchanged.Consider now the non-parameterized optimization problem with aggregate and disaggregate con-straints (12-14), but with continuous state and action space:maximize π ∈ Π (cid:90) d π θk ( s ) E a ∼ π ( ·| s ) [ A π θk ( s, a )] ds (50)subject to (cid:90) d π θk ( s ) D KL ( π (cid:107) π θ k )[ s ] ds ≤ δ (51) D KL ( π (cid:107) π θ k )[ s ] ≤ (cid:15) for all s (52)Theorem 1 holds although its proof needs to be slightly modified as follows. It is straightforward toshow that (50-52) remains a convex optimization problem. We can therefore solve (50-52) by solvingthe Lagrangian (28-29) with the sum replaced with an integral. This problem again decomposes withseparate problems for each s ∈ S giving exactly the same equations (30-31). The proof then proceedsas in the remainder of the proof of Theorem 1.Theorem 2 and 3 are also unchanged for continuous action spaces. Their proofs require slightmodifications, as in the proof of Theorem 1. D I
MPLEMENTATION D ETAILS AND H YPERPARAMETERS
D.1 M
UJOCO
As in (Schulman et al., 2017), for Mujoco environments, the policy is parameterized by a fully-connected feed-forward neural network with two hidden layers, each with 64 units and tanh nonlinear-ities. The policy outputs the mean of a Gaussian distribution with state-independent variable standarddeviations, following (Schulman et al., 2015a; Duan et al., 2016). The action dimensions are assumedto be independent. The probability of an action is given by the multivariate Gaussian probabilitydistribution function. The baseline used in the advantage value calculation is parameterized by a simi-larly sized neural network, trained to minimize the MSE between the sampled states TD − λ returnsand the their predicted values. For both the policy and baseline network, SPU and TRPO use the samearchitecture. To calculate the advantage values, we use Generalized Advantage Estimation (Schulmanet al., 2015b). States are normalized by dividing the running mean and dividing by the runningstandard deviation before being fed to any neural networks. The advantage values are normalized bydividing the batch mean and dividing by the batch standard deviation before being used for policyupdate. The TRPO result is obtained by running the TRPO implementation provided by OpenAI(Dhariwal et al., 2017), commit 3cc7df060800a45890908045b79821a13c4babdb. At every iteration,SPU collects 2048 samples before updating the policy and the baseline network. For both networks,gradient descent is performed using Adam (Kingma & Ba, 2014) with step size . , minibatch sizeof . The step size is linearly annealed to 0 over the course of training. γ and λ for GAE (Schulmanet al., 2015b) are set to . and . respectively. For SPU, δ, (cid:15), λ and the maximum number ofepochs per iteration are set to . / . , . , . and respectively. Training is performed for 1million timesteps for both SPU and PPO. In the sensitivity analysis, the ranges of values for thehyper-parameters δ, (cid:15), λ and maximum number of epochs are [0 . , . , [0 . , . , [1 . , . and [5 , respectively. 16ublished as a conference paper at ICLR 2019D.2 A TARI
Unless otherwise mentioned, the hyper-parameter values are the same as in subsection D.1. Thepolicy is parameterized by a convolutional neural network with the same architecture as described inMnih et al. (2015). The output of the network is passed through a relu, linear and softmax layer inthat order to give the action distribution. The output of the network is also passed through a differentlinear layer to give the baseline value. States are normalized by dividing by 255 before being fedinto any network. The TRPO result is obtained by running the PPO implementation provided byOpenAI (Dhariwal et al., 2017), commit 3cc7df060800a45890908045b79821a13c4babdb. 8 differentprocesses run in parallel to collect timesteps. At every iteration, each process collects 256 samplesbefore updating the policy and the baseline network. Each process calculates its own update to thenetwork’s parameters and the updates are averaged over all processes before being used to updatethe network’s parameters. Gradient descent is performed using Adam (Kingma & Ba, 2014) withstep size . . In each process, random number generators are initialized with a different seedaccording to the formula process _ seed = experiment _ seed + 10000 ∗ process _ rank . Trainingis performed for 10 million timesteps for both SPU and PPO. For SPU, δ, (cid:15), λ and the maximumnumber of epochs per iteration are set to . , δ/ . , . and respectively. E A
LGORITHMIC D ESCRIPTION FOR
SPU
Algorithm 1
Algorithmic description of forward-KL non-parameterized SPU
Require:
A neural net π θ that parameterizes the policy. Require:
A neural net V φ that approximates V π θ . Require:
General hyperparameters: γ, β (advantage estimation using GAE), α (learning rate), N(number of trajectory per iteration), T (size of each trajectory), M (size of training minibatch). Require:
Algorithm-specific hyperparameters: δ (aggregated KL constraint), (cid:15) (disaggregated con-straint), λ , ζ (max number of epoch). for k = 1, 2, . . . do under policy π θ k , sample N trajectories, each of size T ( s it , a it , r it , s i ( t +1) ) , i =1 , . . . , N, t = 1 , . . . , T Using any advantage value estimation scheme, estimate A it , i = 1 , . . . , N, t = 1 , . . . , T θ ← θ k φ ← φ k for ζ epochs do Sample M samples from the N trajectories, giving { s , a , A , . . . , s M , a M , A M } L ( φ ) = 1 M (cid:80) m ( V targ ( s m ) − V φ ( s m )) φ ← φ − α ∇ φ L ( φ ) L ( θ ) = M (cid:80) m (cid:20) ∇ θ D KL ( π θ (cid:107) π θ k )[ s m ] − λ ∇ θ π θ ( a m | s m ) π θ k ( a m | s m ) A m (cid:21) D KL ( π θ (cid:107) π θk )[ s m ] ≤ (cid:15) θ ← θ − αL ( θ ) if m (cid:80) m D KL ( π (cid:107) π θ k )[ s m ] > δ then Break out of for loop θ k +1 ← θ φ k +1 ← φ F E
XPERIMENTAL RESULTS
F.1 R
ESULTS ON M UJOCO FOR MILLION TIMESTEPS
TRPO and SPU were trained for 1 million timesteps to obtain the results in section 6. To ensure thatSPU is not only better than TRPO in terms of performance gain early during training, we furtherretrain both policies for 3 million timesteps. Again here, SPU outperforms TRPO by . Figure 3illustrates the performance on each environment.17ublished as a conference paper at ICLR 2019Figure 3: Performance of SPU versus TRPO on 10 Mujoco environments in 3 million timesteps. Thex-axis indicates timesteps. The y-axis indicates the average episode reward of the last 100 episodes.F.2 S
ENSITIVITY A NALYSIS
CDF
FOR M UJOCO
When values for SPU hyper-parameter are randomly sampled as is explained in subsection 6.3, thepercentage improvement of SPU over TRPO becomes a random variable. Figure 4 illustrates theCDF of this random variable. Figure 4: Sensitivity Analysis for SPU18ublished as a conference paper at ICLR 2019F.3 A