[PDF] A Logarithmic Barrier Method For Proximal Policy Optimization

Abstract

Proximal policy optimization(PPO) has been proposed as a first-order optimization method for reinforcement learning. We should notice that an exterior penalty method is used in it. Often, the minimizers of the exterior penalty functions approach feasibility only in the limits as the penalty parameter grows increasingly large. Therefore, it may result in the low level of sampling efficiency. This method, which we call proximal policy optimization with barrier method (PPO-B), keeps almost all advantageous spheres of PPO such as easy implementation and good generalization. Specifically, a new surrogate objective with interior penalty method is proposed to avoid the defect arose from exterior penalty method. Conclusions can be draw that PPO-B is able to outperform PPO in terms of sampling efficiency since PPO-B achieved clearly better performance on Atari and Mujoco environment than PPO.

Full PDF

AA Logarithmic Barrier Method For Proximal Policy Optimization

Cheng Zeng , ∗ and Hongming Zhang Academy of Mathematics and Systems Science, Chinese Academy of Sciences University of Chinese Academy of Sciences Institude of Automation, Chinese Academy of [email protected]

Abstract

Proximal policy optimization(PPO) [Schulman etal. , 2017] has been proposed as a ﬁrst-order opti-mization method for reinforcement learning. Weshould notice that an exterior penalty method isused in it. Often, the minimizers of the exte-rior penalty functions approach feasibility only inthe limits as the penalty parameter grows increas-ingly large. Therefore, it may result in the lowlevel of sampling efﬁciency. This method, whichwe call proximal policy optimization with barriermethod (PPO-B), keeps almost all advantageousspheres of PPO such as easy implementation andgood generalization. Speciﬁcally, a new surrogateobjective with interior penalty method is proposedto avoid the defect arose from exterior penaltymethod. Conclusions can be draw that PPO-B isable to outperform PPO in terms of sampling ef-ﬁciency since PPO-B achieved clearly better per-formance on Atari and Mujoco environment thanPPO.

Reinforcement learning is a computational approach to un-derstanding and automating goal-directed learning and deci-sion making. It is distinguished from other computational ap-proaches by its emphasis on learning by an agent from directinteraction with its environment, without requiring exemplarysupervision or complete models of the environment.The integration of reinforcement learning and neural net-works has a long history [Sutton and Barto, 1998; Bert-sekas et al. , 1996; Schmidhuber, 2015]. With recent ex-citing achievements of deep learning [Lecun et al. , 2015;Heaton, 2017], beneﬁting from big data, new algorithmictechniques and powerful computation, we have been witness-ing the renaissance of reinforcement learning, especially, thecombination of reinforcement learning and deep neural net-works, i.e., deep reinforcement learning.Deep reinforcement learning methods have shown tremen-dous success in a large variety tasks, such as Atari [Mnih et al. , 2013], continuous control [Lillicrap et al. , 2015; ∗ Contact Author

Schulman et al. , 2015] and even Go at the human grandmasterlevel [Silver et al. , 2016; Silver et al. , 2017]. Policy gradientmethods [Williams, 1992] is an important family of methodsin model-free reinforcement learning.Policy gradient methods use neural network to describe therelationship between state and action, or a distribution overaction. We can get the expression of the expected return un-der the policy π . After that, we derive the formula and writeit in a form of mathematical expectation so as to get the for-mula of the gradient estimate under the current policy. Thenwe could use Monte Carlo approach to estimate the policygradient with the data obtained from the interaction betweenthe current policy and the environment. After that, a gradientascent method will be implemented on the parameters in neu-ral networks. Once the parameters are determined ﬁnally, weactually determine an optimal policy.There are some problems in policy gradient methods: forinstance, it is in the cards to converge to the local minimumvalue. Sample inefﬁciency is also a major issue. What’smore, once the optimization is done without any constraints,the parameters may update on a large scale and lead to di-vergence eventually. For the sake of averting this problem,TRPO creates a new constraint on the KL divergence betweenthe two updated action distributions, and proves that underthis constraint the optimal solution of our constraint problemwill ensure that the expected return is increasing. However,TRPO is very complex in implement and computation. PPOalgorithm is proposed to avoid this. It transform a constrainoptimization to an unconstrained optimization. It is not onlyeasy to implement, but also fairly well in experiment. In asense, PPO is one of the best algorithm in policy gradientmethod.In the original PPO, two objective functions are proposed.One is with penalty on KL divergence between two distri-butions, the other is a well designed pessimistic clipped sur-rogate objective. Experimental results showed that the PPOalgorithm with the “clipped” surrogate objective were betterin more games. Although PPO with KLD penalty doesn’tperform well in experiment, it can give some enlightenmentto our follow-up work. From the point of view of optimiza-tion, PPO with penalty on KL divergence is actually a exteriorpenalty method. It constantly penalizes the larger KL diver-gence, forcing it to converge to the constraint region.The exterior penalty method has some drawbacks. Though a r X i v : . [ c s . L G ] D ec very step of iteration penalizes solutions that are not withinthe scope, it still can not guarantee that every update is in thefeasible region, in other words, the inequality control range.When out of feasible region, the gradient will not be esti-mated by rule and line. Thus, in order to avoid this situa-tion, we proposed to use the barrier function to constrain thesearch set of every update. Barrier functions is a family offunctions which tend to inﬁnity when it approaches the edgeof constrained region, it will force every update in the fea-sible region. The collected data can help us to estimate thegradient accurately, so that parameters can be updated moreeffectively. We conducted a complete experiment in the Atariand Mujuco environment, and achieved very competitive re-sults relative to PPO.The rest of this paper is organized as follows: in the ﬁrst,we will elaborate the background of this paper and summa-rize previous work. Next, we will introduce the logarithmicbarrier method. Then a new surrogate objective will be pro-posed to improve sampling efﬁciency. In the end, we willshow the experience results and analysis them. The policy gradient methods are an important family of re-inforcement learning algorithms. It assumes that the policy π θ ( s ) is a map from state to an action. θ is the parameter ofpolicy, and a ﬁxed θ will decide a unique policy. In policygradient method, the expected cumulative reward is writtenas a function of policy. That is J P G ( θ ) = E ( s t ,a t ) ∼ π θ [log π θ ( a t | s t ) A θ ] (1)where A θ is an estimator of the advantage function attimestep t. We take ﬁxed θ to interact with the environment,and the collected data can update many θ . In this way, thegradient estimator has the form ˆ g = E ( s t ,a t ) ∼ π θ [ ∇ θ log π θ ( a t | s t ) A θ ] (2)The core idea of the policy gradient algorithm is to interactwith the environment using the policy, and to estimate thegradient by Monte Carlo method. In this way, we also get anew policy, and interact with the environment with the newpolicy. The data obtained are used to update the parameters,and this cycle improves the expected return of the policy. If the algorithm is on-policy, the data derived from π θ canonly be used for a time. Once the θ is updated, the data col-lected before will be useless. This is not in line with our want.It is appealing to perform multiple steps of optimization onthe loss using the same trajectory.In this case, the more effective way is to adopt off-policy.This method needs the importance sampling. In TRPO, itgives the objective function that we need to optimize: J θ (cid:48) ( θ ) = E ( s t ,a t ) ∼ π θ (cid:48) [ π θ ( a t | s t ) π θ (cid:48) ( a t | s t ) A θ (cid:48) ( s t , a t )] (3)The A θ (cid:48) ( s t , a t ) is the dominant function under the param-eter θ (cid:48) . TRPO proves that if maximizing the upper form in everyiteration, we can guarantee the expectation is monotonous.Due to the property of importance sampling, if we want to usethe data more efﬁcient, it is necessary to keep the KL diver-gence between old and new action distribution not very large.So what we want to deal with is a optimization problem inEquation (3) under the constraint of KL ( π θ (cid:48) ( ·| s ) , π θ ( ·| s )) <δ . Therefore, the constrained optimization problem proposedby TRPO is J θ (cid:48) ( θ ) = E ( s t ,a t ) ∼ π θ (cid:48) [ π θ ( a t | s t ) π θ (cid:48) ( a t | s t ) A θ (cid:48) ( s t , a t )] s.t. KL ( π θ (cid:48) ( ·| s ) , π θ ( ·| s )) < δ (4)A lot of techniques are used in TRPO to deal with this com-plex constrained optimization problem, such as making a lin-ear approximation to the objective and a quadratic approxi-mation to the constraint, which also increase the difﬁculty ofcomputation and implementation. PPO method has been proposed to beneﬁt the reliability andstability from TRPO with the goal of simpler implementation,better generalization and better empirical sample complexity.PPO algorithm puts forward two ways to improve TRPO’sconstrained optimization problem. The ﬁrst is to use penaltymethod instead of constraint. Which is Shown on Equation(5). In essence, it is a exterior penalty method. For the selec-tion of the penalty parameter, PPO proposes an adaptive wayto adjust the penalty coefﬁcient. J KLP EN = E ( s t ,a t ) ∼ π θ (cid:48) [ π θ ( a t | s t ) π θ (cid:48) ( a t | s t ) A θ (cid:48) ( s t , a t )] − βKL ( π θ (cid:48) ( ·| s ) , π θ ( ·| s ))] (5)The second is relatively ingenious. It replaces the origi-nal constrained problem with a “clipped” surrogate objective,which is J CP I = E ( s t ,a t ) ∼ π θ (cid:48) [ π θ ( a t | s t ) π θ (cid:48) ( a t | s t ) A θ (cid:48) ( s t , a t )] ,clip ( π θ ( a t | s t ) π θ (cid:48) ( a t | s t ) , − (cid:15), (cid:15) ) A θ (cid:48) ( s t , a t )] (6)From the experimental results, PPO-clip is better.For the ﬁrst method, the exterior penalty function is used tosolve the problem. However, the method of exterior penaltyfunction will cause some problems. It can not guarantee thatthe two distributions are strictly in the constrain domain ateach update. Therefore, this paper begins to consider the useof barrier method to ensure that parameter is strictly in theconstrain region at each update. From the experimental re-sults, our method is better than PPO-clip, and achieve state-of-art performance in policy gradient methods. We will elab-orate this method in detail in the following chapters. The Logarithmic Barrier Method

The ﬁrst method in PPO are sometimes known as exteriorpenalty methods, because the penalty term for constraint isnonzero only when independent variables is infeasible withrespect to that constraint. Often, the minimizers of the penaltyfunctions are infeasible with respect to the original problem,and approach feasibility only in the limits as the penalty pa-rameter grows increasingly large.Now that exterior penalty method in PPO does not guar-antee the KL divergence satisﬁes the constraint, a interiorpenalty method will be proposed here to avoid these. Theinterior penalty method is also called the barrier method. Weintroduce the concept of barrier functions by a generalizedinequality-constrained optimization problem. Consider theproblem min x f ( x ) subject to c i ( x ) ≥ , i ∈ I , (7)the strictly feasible region is deﬁned by F ≡ { x ∈ R n | c i ( x ) > for all i ∈ I} ; (8)we assume that F is nonempty for purposes of this dis-cussion. Barrier functions for this problem have the prop-erties that (a)they are smooth inside F ; (b)they are inﬁniteeverywhere except in F ; (c)their value approaches ∞ as xapproaches the boundary of F .The most commonly used barrier function is the logarith-mic barrier function, which for the constraint set c i ( x ) ≥ , i ∈ I , has the form − (cid:88) i ∈I log c i ( x ) , (9)where log ( · ) denotes the natural logarithm.For the inequality-constrained optimization problem, thecombined objective/barrier function is given by L ( x ; µ ) = f ( x ) − µ (cid:88) i ∈I log c i ( x ) (10)where µ is referred to here as the barrier parameter. As ofnow, we refer to L ( x ; µ ) itself as the “logarithmic barrierfunction” for the Equation (7), or simply the “log barrier func-tion ” for short.Consider the following problem in a single variable x: min x subject to x ≥ , − x ≥ , (11)for which we have P ( x ; µ ) = x − µ log( x − − µ log(2 − x ) . (12)We graph this function for different values of µ in Figure 1.Naturally, for small values of µ , the function P ( x ; µ ) is closeto the objective f over most of the feasible set; it approaches ∞ only in narrow “boundary layers.” (In Figure 1, the curve P ( x ; 0 . is almost indistinguishable from f ( x ) to the res-olution of our plot, though this function also approaches ∞ when x is very close to the endpoints 1 and 2.) Also, it is clearthat as µ ↓ , the minimizer x ( µ ) of P ( x ; µ ) is approachingthe solution x ∗ = 1 of the constrained problem. x P ( x , ) =1=0.4=0.1=0.01 Figure 1: ﬁgure of P ( x ; µ ) for different µ Since the minimizer x ( µ ) of P ( x ; µ ) lies in the strictly fea-sible set F , we can in principle search for it by using theunconstrained minimization algorithms. Unfortunately, theminimizer x ( µ ) becomes more and more difﬁcult to ﬁnd as µ ↓ . So we should choose a suitable µ , not as small aspossible.M. Wright [Wright and Holt, 1985] proved the effective-ness of log barrier function method in the case of convexfunction. Theorem 1

Suppose that f and − c i , i ∈ I , in Equation(7)and (8) are all convex functions, and that the strictly feasi-ble region F deﬁned by Equation (8) is nonempty. Assumethat the solution set M is nonempty and bounded. Then forany µ > , P ( x ; µ ) is convex in F and attains a minimizer x ( µ ) (not necessarily unique) on F . Any local minimizer x ( µ ) is also a global minimizer of P ( x ; µ ) . In fact, other functions can be used as barrier functions. Wecompare the experimental results of some barrier functionsand ﬁnally choose log function as our barrier function.

In this section, we will apply the logarithmic barrier methodto solve constrained optimization problems of Equation (4),which will improve the Sampling Efﬁciency of PPO.The essence of PPO with penalization objective is an exte-rior penalty method. The constrained condition is: KL ( π θ (cid:48) ( ·| s ) , π θ ( ·| s )) < δ. (13)so we penalize the KL divergence with coefﬁcient β , which ischosen by an adaptive way. When leaving the constraint re-gion, the penalization is increased, forcing it to move closer tothe feasible area. In the feasible area, the difference betweenthe two is not large enough to be considered, so we can havethe similar optimal solution. Unlike the exterior penalty func-tion, we use the interior penalty method to solve this problem.The essence of exterior penalty function method is to approx-imate the optimal solution of the constraint problem from theoutside of the feasible region, while interior penalty func-tion contrarily. Each update will keep the constrain strictly.So this method is more suitable for solving inequality con-strained optimization problems. Speciﬁcally, we use the log-arithmic barrier function proposed in the previous section tosolve this problem.he barrier method is also called the interior penalty func-tion. The minimum point of the function is a strictly feasi-ble point, that is, a point satisfying the inequality constraint.When the minimal point sequence of the barrier function ap-proaches the boundary of the feasible domain from the fea-sible domain, then the barrier function tends to inﬁnity, soas to prevent the iteration point from falling out of the feasi-ble region. At the same time, when we ﬁnd a suitable µ , theextreme value of the objective function with barrier functionis closed to that of the original function, so we only need tosolve an unconstrained optimization problem.Compared with the penalty function in PPO, the objectivefunction with barrier function is more explanatory, and ac-cording to the property of the barrier function, it can strictlyensure that the difference of distribution between two succes-sive iterations is not too big, so that the purpose of making fulluse of the sample can be achieved. The experimental resultsin the following chapters also conﬁrm our idea.When applied to the problems we need to deal with, wecan transform problems into J KLBAR ( θ ) = E ( s t ,a t ) ∼ π θ (cid:48) [ π θ ( a t | s t ) π θ (cid:48) ( a t | s t ) A θ (cid:48) ( s t , a t )]+ µ ln[ δ − KL ( π θ ( ·| s t ) , π θ (cid:48) ( ·| s t ))] (14)But in practice, the distance determined by KL divergenceis not a very robust distance. In practice, inspired by PPO-clip, we hope to use the distance between π θ ( a t | s t ) and π θ (cid:48) ( a t | s t ) to limit the difference between the two distribu-tions. There are many ways to measure the distance betweenthem. We have conducted several sets of controlled trials onseveral games, and ﬁnally chose a distance characterized by ( (cid:112) π θ ( a t | s t ) − (cid:112) π θ (cid:48) ( a t | s t )) . It can be easily proved that thedistance is less than the Angular distance between two actiondistribution.In this way, the objective function we need to optimize ischanged to: J ADBAR = E ( s t ,a t ) ∼ π θ (cid:48) [ π θ ( a t | s t ) π θ (cid:48) ( a t | s t ) A θ (cid:48) ( s t , a t )]+ µ ln[ δ − ( (cid:112) π θ ( a t | s t ) − (cid:112) π θ (cid:48) ( a t | s t )) ] . (15)We solve the whole problem in the framework of A2C [Mnih et al. , 2016]. It should be noted that our method will in-troduce two hyperparameters µ and δ , but fortunately ex-periments show that it is sufﬁcient to simply choose ﬁxedcoefﬁcients( µ = 1 and δ = 0 . ) and optimize the penalizedobjective Equation (15) with SGD.The pseudo code is shown on Algorithm 1. It is almost thesame as the pseudo code of the PPO algorithm, except thatthe objective function is replaced by Equation (15). In this section, we experimentally compare PPO-B and PPOon version-4 49 benchmark Atari game playing tasks pro-vided by OpenAI Gym [Brockman et al. , 2016] and version-2 7 benchmark control tasks provided by the robotics RL en-vironment of PyBullet [Plappert et al. , 2018]. we focus on a

Algorithm 1

POP-B, Actor-Critic Style

Input: max iterations L , actors N , epochs K for iteration=1,2 to L dofor actor=1,2 to N do Run policy π θ old for T time stepsCompute advantage estimations ˆ A , ..., ˆ A T end forfor epoch=1,2 to K do Optimized loss objective wrt θ according to Equa-tion (15) with mini-batch size M ≤ N T , then update θ old ← θ . end forend for detailed quantitative comparison with PPO to check whetherPPO-B could improve sampling efﬁciency and performance.In our setting, the same policy network architecture givenin [Mnih et al. , 2015] for Atari game playing tasks and thesame network architecture given in [Schulman et al. , 2017;Duan et al. , 2016] for benchmark control tasks are adoptedin both algorithms. We also use the same training steps andmake use of the same amount of game frames (40M for Atarigame and 10M for Mujoco). Meanwhile we follow strictlythe hyperparameters settings used in [Schulman et al. , 2017]for both PPO-B and PPO and initialize parameters using thesame policy as [Schulman et al. , 2017]. The only exceptionis the number of actors which is set to 8 in [Schulman et al. ,2017] but equals to 16 in our experiments for both tasks. Inaddition to the hyperparameters used in PPO, PPO-B requirestwo extra hyperparameters δ and β . In our experiments, wealso tested several different settings for µ and δ and choseparameters ( µ = 1 and δ = 0 . ) that performed best in 7Mujoco environments.For searching over hyperparameters for our algorithm,we used a computationally cheap benchmark proposed by[Schulman et al. , 2017] to test the algorithms on. It consistsof 7 simulated robotics tasks implemented in OpenAI Gym,which use the Mujoco Environment. There are only 1 mil-lion time steps for each environment. Each algorithm wasrun on all 7 environments, with 3 random seeds on each. Wescored each run of the algorithm by computing the average to-tal reward of the last 100 episodes. We shifted and scaled thescores for each environment so that the random policy gavea score of 0 and the best result was set to 1, and averagedover 21 runs to produce a single score for each algorithm set-ting. Parameters ( µ = 1 and δ = 0 . ) performed best in 7Mujoco environments, so we used the two parameters in ourexperiment settings.First of all, the performance of both algorithms will be ex-amined based on the learning curves presented in Figure 2and Figure 3. Afterwards, we will compare the sample efﬁ-ciency of PPO-B and PPO by using the performance scoressummarized in Table 1 and Table 2. We compared PPO-B in Atari environment with PPO algo-rithm. It is noteworthy that PPO used the best parameters inhis experiment. The speciﬁc parameters are shown in Ta-ble 3 and Table 4. The PPO-B does not modify any of theparameters in the PPO, only adjusts the newly introduced pa-rameters. In fact, in this case, the setting is beneﬁcial to thePPO. We conducted experiments on 49 games on Atari andconducted three experiments in each game environment.The ﬁnal result is as follows:PPO PPO-B reward reward all Table 1: Number of games “won” by each algorithm on ”Atari”

We present the learning curves of the two algorithms on 49Atari games in Figure 3. As can be clearly seen in these ﬁg-ures, PPO-B can outperform PPO on 34 out of the 49 games.On some game such as DemonAttack and Gopher, PPO-Bperformed better than PPO. On some games such asBreakOut and KungFuMaster, the two algorithms exhibitedsimilar performance at the start. However PPO-B managedto achieve better performance towards the end of the learningprocess. On other games such as Kangaroo and Gopher, theperformance differences can be witnessed shortly after learn-ing started.To compare PPO-B and PPO in terms of their sampleefﬁciency, we adopt the two scoring metrics introduced in[Schulman et al. , 2017]: (1) average reward per episode overthe entire training period that measures fast learning and (2)average reward per episode over last 100 episodes of train-ing that measures ﬁnal performance. As evidenced in Table 1and Table 2, PPO-B is clearly more efﬁcient in sampling thanPPO on 34 out of 49 Atari games in the metrics of averagereward per episode over last 100 episodes of training.

We also conﬁgure PPO-B and PPO in Mujoco which is a con-tinuous environment. The two algorithms have the same pa-rameters as those in the previous section. For each Mujocoenvironment, we have done three experiments on each algo-rithm. The experimental results are shown below. We can seethat our algorithm is superior to or competitive with PPO.PPO PPO-B reward reward all

444 3

Table 2: Number of games “won” by each algorithm on ”Mujoco”

In Table 7 we present the mean of rewards of the last 100episodes in training as a function of training time steps. No-tably, PPO-B outperforms PPO in HalfCheetah, Hopper ,In-vertedDoublePendulum , InvertedPendulum and Walker2d.In the Swimmer and Reacher environment, however, we ob-serve a different result (in Table 2) where PPO outperformsPPO-B.

We have proposed an improved PPO algorithm, which usesbarrier method instead of exterior penalty method, andachieves good results in Atari and Mujoco environments.PPO-B makes full use of the advantages of barrier method,increasing the sampling efﬁciency of each actors. PPO-B alsokeeps the advantage of simple implementation, good general-ization. It achieve better performance to the PPO algorithm,and can also give some enlightenment to the following work.

Acknowledgements

The auther are grateful to Ruoyu Wang, Minghui Qin andothers at AMSS for insightful comments.

References [Bertsekas et al. , 1996] D. P. Bertsekas, J. N. Tsitsiklis, andA. Volgenant. Neuro-dynamic programming.

Encyclope-dia of Optimization , 27(6):1687–1692, 1996.[Brockman et al. , 2016] Greg Brockman, Vicki Cheung,Ludwig Pettersson, Jonas Schneider, John Schulman, JieTang, and Wojciech Zaremba. Openai gym, 2016.[Duan et al. , 2016] Yan Duan, Xi Chen, Rein Houthooft,John Schulman, and Pieter Abbeel. Benchmarking deepreinforcement learning for continuous control. In

Inter-national Conference on International Conference on Ma-chine Learning , pages 1329–1338, 2016.[Heaton, 2017] Jeff Heaton. Ian goodfellow, yoshua bengio,and aaron courville: Deep learning.

Genetic Programming& Evolvable Machines , 19(1-2):1–3, 2017.[Lecun et al. , 2015] Y Lecun, Y Bengio, and G Hinton.Deep learning.

Nature , 521(7553):436, 2015.[Lillicrap et al. , 2015] Timothy P Lillicrap, Jonathan J Hunt,Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous con-trol with deep reinforcement learning. arXiv preprintarXiv:1509.02971 , 2015.[Mnih et al. , 2013] Volodymyr Mnih, Koray Kavukcuoglu,David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atari with deepreinforcement learning.

Computer Science , 2013.[Mnih et al. , 2015] V Mnih, K Kavukcuoglu, D Silver, A. A.Rusu, J Veness, M. G. Bellemare, A Graves, M Ried-miller, A. K. Fidjeland, and G Ostrovski. Human-levelcontrol through deep reinforcement learning.

Nature ,518(7540):529, 2015.[Mnih et al. , 2016] Volodymyr Mnih, Adria PuigdomenechBadia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap,Tim Harley, David Silver, and Koray Kavukcuoglu. Asyn-chronous methods for deep reinforcement learning. 2016.[Plappert et al. , 2018] Matthias Plappert, MarcinAndrychowicz, Alex Ray, Bob McGrew, Bowen Baker,Glenn Powell, Jonas Schneider, Josh Tobin, MaciekChociej, Peter Welinder, Vikash Kumar, and WojciechZaremba. Multi-goal reinforcement learning: Challengingrobotics environments and request for research, 2018.Schmidhuber, 2015] Jurgen Schmidhuber. Deep learning inneural networks: An overview.

Neural Netw , 61:85–117,2015.[Schulman et al. , 2015] John Schulman, Sergey Levine,Philipp Moritz, Michael I Jordan, and Pieter Abbeel. Trustregion policy optimization.

Computer Science , pages1889–1897, 2015.[Schulman et al. , 2017] John Schulman, Filip Wolski, Pra-fulla Dhariwal, Alec Radford, and Oleg Klimov. Proximalpolicy optimization algorithms. 2017.[Silver et al. , 2016] D Silver, A. Huang, C. J. Maddison,A Guez, L Sifre, den Driessche G Van, J Schrittwieser,I Antonoglou, V Panneershelvam, and M Lanctot. Mas-tering the game of go with deep neural networks and treesearch.

Nature , 529(7587):484–489, 2016.[Silver et al. , 2017] D Silver, J Schrittwieser, K Simonyan,I Antonoglou, A. Huang, A Guez, T Hubert, L Baker,M. Lai, and A Bolton. Mastering the game of go withouthuman knowledge.

Nature , 550(7676):354–359, 2017.[Sutton and Barto, 1998] R Sutton and A Barto.

Reinforce-ment Learning:An Introduction . MIT Press, 1998.[Williams, 1992] Ronald J. Williams. Simple statisticalgradient-following algorithms for connectionist reinforce-ment learning.

Machine Learning , 8(3-4):229–256, 1992.[Wright and Holt, 1985] S. J. Wright and J. N. Holt. An in-exact levenberg-marquardt method for large sparse nonlin-ear least squres.

Journal of the Australian MathematicalSociety , 26(4):387–403, 1985.

A Hyperparameters

HYPER-PARAMETER ValueHORIZON (T) 128ADAM STEP-SIZE . × − × α NUM EPOCHS 3MINI-BATCH SIZE × DISCOUNT( γ ) 0.99GAE PARAMETER( λ ) 0.95NUMBER OF ACTORS 16CLIPPING PARAMETER . × α VF COEFF 1ENTROPY COEFF 0.01

Table 3: PPO’s hyper-parameters for Atari game.

HYPER-PARAMETER ValueHORIZON (T) 128ADAM STEP-SIZE . × − × α NUM EPOCHS 3MINI-BATCH SIZE × DISCOUNT( γ ) 0.99GAE PARAMETER( λ ) 0.95NUMBER OF ACTORS 16VF COEFF 1ENTROPY COEFF . × α BARRIER FUNCTION PARAMETER ( β ) 1BARRIER FUNCTION PARAMETER ( δ ) 0.5 Table 4: PPO-B’s hyper-parameters for Atari game.

HYPER-PARAMETER ValueHORIZON (T) 2048ADAM STEP-SIZE × − NUM. EPOCHS 10MINI-BATCH SIZE DISCOUNT( γ ) 0.99GAE PARAMETER( λ ) 0.95 Table 5: PPO’s hyper-parameters for Mujoco game.

HYPER-PARAMETER ValueHORIZON (T) 2048ADAM STEP-SIZE × − NUM. EPOCHS 10MINI-BATCH SIZE DISCOUNT( γ ) 0.99GAE PARAMETER( λ ) 0.95BARRIER FUNCTION PARAMETER ( β ) 1BARRIER FUNCTION PARAMETER ( δ ) 0.5 Table 6: PPO-B’s hyper-parameters for Mujoco.

Performance on Mujoco Games

Figure 2: Comparison of several algorithms on several MuJoCo en-vironments

PPO-B PPOHalfCheetah 4784.274784.274784.27 2252.16Hopper 2968.562968.562968.56 2187.83InvertedDoublePendulum 8562.628562.628562.62 8377.86InvertedPendulum 999.43999.43999.43 905.88Reacher -6.31 -4.51-4.51-4.51Swimmer 85.28 113.57113.57113.57Walker2d 4201.244201.244201.24 3794.44

Table 7: Mean ﬁnal scores (last 100 episodes) of PPO-B and PPOon Mujoco

C Performance on More Atari Games

Figure 3: Comparison of PPO-B and PPO on all 49 ATARI gamesincluded in OpenAI Gym

PO-B PPOAlien . . . . . . . . . . . . . . . Atlantis . . . . . . BattleZone 4826.67 . . . BeamRider 1567.07 2809.02Bowling . . . . . . . . . . ChopperCommand . . . . . . . . . − . − . − . -9.53Enduro . . . . . . -28.78Freeway . . . . . . . . . . . . IceHockey − . − . − . -4.68Jamesbond . . . . . . . . . . . . . . . MsPacman . . . . . . . . . Pong . . . . . . Qbert . . . . . . . . . . . . Seaquest . . . . . . . . . Tennis − . − . − . -17.63TimePilot . . . . . . UpNDown . . . . . . . . . Zaxxon 2420.67 . . .33