An Adaptive Clipping Approach for Proximal Policy Optimization
AAn Adaptive Clipping Approach for Proximal Policy Optimization
Gang Chen
School of Engineeringand Computer Science,Victoria University of Wellington,Wellington, New [email protected]
Yiming Peng
School of Engineeringand Computer Science,Victoria University of Wellington,Wellington, New [email protected]
Mengjie Zhang
School of Engineeringand Computer Science,Victoria University of Wellington,Wellington, New [email protected]
Abstract
Very recently proximal policy optimization (PPO)algorithms have been proposed as first-order opti-mization methods for effective reinforcement learn-ing. While PPO is inspired by the same learn-ing theory that justifies trust region policy opti-mization (TRPO), PPO substantially simplifies al-gorithm design and improves data efficiency byperforming multiple epochs of clipped policy op-timization from sampled data. Although clippingin PPO stands for an important new mechanism forefficient and reliable policy update, it may fail toadaptively improve learning performance in accor-dance with the importance of each sampled state.To address this issue, a new surrogate learning ob-jective featuring an adaptive clipping mechanismis proposed in this paper, enabling us to developa new algorithm, known as PPO- λ . PPO- λ opti-mizes policies repeatedly based on a theoretical tar-get for adaptive policy improvement. Meanwhile,destructively large policy update can be effectivelyprevented through both clipping and adaptive con-trol of a hyperparameter λ in PPO- λ , ensuring highlearning reliability. PPO- λ enjoys the same simpleand efficient design as PPO. Empirically on severalAtari game playing tasks and benchmark controltasks, PPO- λ also achieved clearly better perfor-mance than PPO. Driven by the explosive interest from the research communityand industry, reinforcement learning (RL) technologies arebeing advanced at an accelerating pace [Barreto et al. , 2017;Nachum et al. , 2017; Tang et al. , 2017; Haarnoja et al. , 2017;Vezhnevets et al. , 2017]. Particularly, numerous innovativeRL algorithms have been proposed in recent years for train-ing deep neural networks (DNN) that can solve difficult de-cision making problems modelled as Markov Decision Pro-cesses (MDPs) [Arulkumaran et al. , 2017; Wang et al. , 2015;Duan et al. , 2016; Mnih et al. , 2015]. Among them, manycutting-edge RL algorithms feature the use of policy search(PS) techniques where action-selection policies are repre-sented explicitly as DNNs rather than implicitly derived from separate value functions. Through direct optimization of pol-icy parameters in DNNs, these algorithms are expected toachieve near-optimal learning performance in extremely largelearning environments [Mnih et al. , 2016; Hausknecht andStone, 2016; Gu et al. , 2017; Schulman et al. , 2015b; Schul-man et al. , 2015a; Lillicrap et al. , 2015; Wang et al. , 2016;Wu et al. , 2017].Trust region policy optimization (TRPO) is a prominentand highly effective algorithm for PS [Schulman et al. ,2015a]. It has the goal of optimizing a surrogate learning ob-jective subject to predefined behavioral constraints (see Sub-section 2.2 for more information). To tackle this learningproblem, both linear approximation of the learning objectiveand quadratic approximation of the constraint are utilized tojointly guide policy update, resulting in relatively high com-putation complexity. Meanwhile, TRPO is not compatiblewith NN architectures for parameter sharing (between thepolicy and value function) [Schulman et al. , 2017]. It is alsonot suitable for performing multiple epochs of minibatch pol-icy update which is vital for high sample efficiency.In order to address the above difficulties, very recentlySchulman et al. developed new PS algorithms called prox-imal policy optimization (PPO) [Schulman et al. , 2017].By employing only first-order optimization methods, PPOis much simpler to implement and more efficient to oper-ate than TRPO. Moreover, PPO allows repeated policy op-timization based on previously sampled data to reduce sam-ple complexity. Thanks to this clever design, on manybenchmark RL tasks including robotic locomotion and Atarigame playing, PPO significantly outperformed TRPO andseveral other state-of-the-art learning algorithms includingA2C [Mnih et al. , 2016], Vanilla PG [Sutton et al. , 2000;Bhatnagar et al. , 2009] and CEM [Szita and L¨orincz, 2006].It also performed competitively to ACER, a currently leadingalgorithm in RL research [Wang et al. , 2016].A simple clipping mechanism plays a vital role for reliablePS in PPO. In fact, clipping, as an efficient and effective tech-nique for policy learning under behavioral restriction, is gen-eral in nature and can be widely used in many cutting-edgeRL algorithms. However the clipped probability ratio used byPPO in its surrogate learning objective may allow less impor-tant states to receive more policy updates than desirable. Thisis because policy update at more important states often van-ish early during repeated policy optimization whenever the a r X i v : . [ c s . L G ] A p r orresponding probability ratios shoot beyond a given clip-ping threshold. Moreover, the updating scale at any sampledstate remains fixed even when the policy behavior has beenchanged significantly at that state, potentially affecting learn-ing performance too.In view of the above issues, this paper aims to propose anew adaptive approach for clipped policy optimization. Forthis purpose, using the learning theory originally introducedby TRPO in [Schulman et al. , 2015a], we will consider alearning problem at the level of each individual state. Aftertransforming the problem into a Lagrangian and identifyingits stationary point, the target for adaptive policy improve-ment can be further obtained. Guided by this target, we willpropose a new surrogate learning objective featuring an adap-tive clipping mechanism controlled by a hyperparameter λ ,enabling us to develop a new RL algorithm called PPO- λ .PPO- λ is equally simple and efficient as PPO. We willinvestigate the performance of PPO- λ through experiments.Since PPO- λ finds its root in PPO which has already beendemonstrated to outperform many state-of-the-art algorithms[Schulman et al. , 2017], we will focus on comparing PPO- λ with PPO in this paper. Particularly, as evidenced on sev-eral Atari game playing tasks and benchmark control tasks inSection 4, PPO- λ can achieve clearly better performance andsample efficiency than PPO. Our results suggest that, in orderto solve many difficult RL problems, PPO- λ may be consid-ered as a useful alternative to PPO. Our discussion of the research background begins with a def-inition of the RL problem. We then briefly introduce TRPOand its supporting learning theory. Afterwards, the surrogatelearning objective for PPO will be presented and analyzed.
The environment for RL is often described as a MDP witha set of states s ∈ S and a set of actions a ∈ A [Suttonand Barto, 1998]. At each time step t , an RL agent observesits current state s t and performs a selected actions a t , result-ing in a transition to a new state s t +1 . Meanwhile, a scalarreward r ( s t , a t ) will be produced as an environmental feed-back. Starting from arbitrary initial state s at time t = 0 , thegoal for RL is to maximize the expected long-term cumula-tive reward defined below V ( s ) = E (cid:40) ∞ (cid:88) t =0 γ t r ( s t , a t ) | s = s (cid:41) (1)where γ is a discount factor. To fulfill this goal, an agent mustidentify a suitable policy π for action selection. Specifically, π θ ( s t , a t ) is often treated as a parametric function that dic-tates the probability of choosing any action a t at every possi-ble state s t , controlled by parameters θ . Subsequently, V ( s ) in (1) becomes a function of θ since an agent’s behavior de-pends completely on its policy π θ . Hence the problem for RLis to learn the optimal policy parameters, i.e. θ ∗ : θ ∗ = arg max θ V π θ (2) where V π θ measures the expected performance of using pol-icy π θ and is defined below V π θ = (cid:88) s ρ ( s ) V π θ ( s ) with ρ ( s ) representing the probability for the RL agent tostart interacting with its environment at state s . V π θ ( s ) stands for the expected cumulative reward obtainable uponfollowing policy π θ from state s . Instead of maximizing V π θ directly, TRPO is designed to op-timize a performance lower bound. To find it, we need todefine the state-action value function Q π θ and the advantagefunction A π θ with respect to any policy π θ below Q π θ ( s , a ) = E (cid:40) ∞ (cid:88) t =0 γ t r ( s t , a t ) | s = s , a = a (cid:41) (3) A π θ ( s , a ) = Q π θ ( s , a ) − V π θ ( s ) (4)where a t ∼ π θ ( s t , a t ) for t ≥ . Further define function L π θ ( π θ (cid:48) ) as L π θ ( π θ (cid:48) ) = V π θ + (cid:88) s ρ π θ ( s ) (cid:88) a π θ (cid:48) ( s , a ) A π θ ( s , a ) (5)Here ρ π θ ( s ) refers to the discounted frequencies of visitingevery state s while obeying policy π θ [Sutton et al. , 2000].Given any two policies π θ old and π θ new . Assume that π θ old is the policy utilized to sample state transition data and π θ new stands for the policy obtained through optimization based onsampled data. Then the following performance bound holds[Schulman et al. , 2015a]: V π θ new ≥ L π θ old ( π θ new ) − (cid:15)γ − γ α (6)with (cid:15) being a non-negative constant and α = D ∗ T V = max s D T V ( π θ old ( s , · ) , π θ new ( s , · ))= max s (cid:88) a | π θ old ( s , a ) − π θ new ( s , · ) | D T V is the total variation divergence. Provided that π θ old and π θ new are reasonably close, steady policy improvement canbe guaranteed during RL based on the performance bound in(6). Since D T V ( π θ old , π θ new ) ≤ D KL ( π θ old , π θ new ) where D KL stands for the KL divergence [Pollard, 2000], duringeach learning iteration, TRPO is designed to solve the fol-lowing constrained optimization problem: max θ new L π θ old ( π θ new ) s.t. ¯ D π θ old KL ( π θ old , π θ new ) ≤ δ (7)where ¯ D π θ old KL is the average KL divergence over all states tobe visited upon following policy π θ old . It heuristically re-places the maximum D ∗ KL such that an approximated solu-tion of (7) can be obtained efficiently in TRPO. However it isto be noted that D ∗ KL should actually serve as the constraintin theory so as to guarantee continued policy improvementaccording to (6). .3 Proximal Policy Optimization through aClipped Surrogate Learning Objective Because TRPO uses ¯ D KL in (7) to guide its policy update,the updated policy may exhibit strong behavioral deviationsat some sampled states. In practice, it can be more desirableto clip policy optimization in response to large changes to theprobability ratio τ t ( π θ new ) = π θ new ( s t ,a t ) π θ old ( s t ,a t ) . It is straightfor-ward to see that τ t applies to each state individually. For anystate s t , guarded by δ which is a given threshold on proba-bility ratios, policy update at s t can be penalized through theclipping function below. C ( s t ) = (1 + δ ) A π θ old t A π θ old t > , τ t > δ (1 − δ ) A π θ old t A π θ old t < , τ t < − δτ t ( π θ new ) A π θ old t otherwise (8)Here for simplicity we use A π θ old t to denote A π θ old ( s t , a t ) .Upon optimizing C ( s t ) in (8) as the surrogate learning ob-jective, penalties apply as long as either of the first two con-ditions in (8) are satisfied, in which case | τ t − | > δ andpolicy behavioral changes at s t are hence considered to betoo large. Building on C ( s t ) , PPO can effectively preventdestructively large policy updates and has been shown toachieve impressive learning performance empirically [Schul-man et al. , 2017].It is not difficult to see that A π θ old t often varies signifi-cantly at different sampled states s t . In particular states withcomparatively large values of | A π θ old t | are deemed as impor-tant since policy updates at those states could potentially leadto high performance improvements. However, through re-peated updating of policy parameters along the direction of ∂C ( s t ) /∂ θ new , τ t ( π θ new ) at some important states s t caneasily go beyond the region (1 − δ, δ ) and hence either ofthe first two cases in (8) may be satisfied. In such situation, ∂C ( s t ) /∂ θ new = 0 and policy update at state s t vanishes.In comparison, policy updates at less important states are lesslikely to vanish.Due to the above reason, after multiple epochs of policy up-date in a learning iteration (see the PPO algorithm in [Schul-man et al. , 2017]), less important states tend to have beenupdated more often than important states. As a consequence,the scale of policy updates may not properly reflect the impor-tance of each sampled state. Meanwhile, we note that, when-ever policy update does not vanish, ∂C ( s t ) /∂ θ new is deter-mined by A π θ old t which remains fixed during a full learningiteration. In other words, the scale of policy update stays atthe same level across multiple updating epochs. This may notbe desirable either since, when π θ new deviates significantlyfrom π θ old , further policy updates often result in undesirableoutcome. Ideally, with every new epoch of successive policyupdate, we expect the updating scale to be adaptively reduced.Motivated by our analysis of C ( s t ) and PPO, this paperaims to show that, by employing a new adaptive clippingmethod for constrained policy optimization, we can effec-tively improve PS performance and sample efficiency. While developing the new adaptive approach for clipped pol-icy updating, we wish to achieve three objectives: (O1) sim-ilar to PPO, policy behavioral changes should be controlledat the level of individual states; (O2) in comparison to im-portant states, the policy being learned is expected to exhibitless changes at less important states; and (O3) the policy up-dating scale should be reduced adaptively with each succes-sive epoch in a learning iteration. Driven by the three objec-tives, we decide to study a learning problem slightly differ-ent from that considered by TRPO. For this purpose, we notethat, although D KL ( π θ old , π θ new ) (cid:54) = D KL ( π θ new , π θ old ) ,it is straightforward to see that D KL ( π θ new , π θ old ) >D T V ( π θ new , π θ old ) = D T V ( π θ old , π θ new ) . Therefore, theperformance bound in (6) remains valid with α = D ∗ KL = max s D KL ( π θ new ( s , · ) , π θ old ( s , · )) Hence if the maximum KL divergence across all states canbe bounded properly, we can guarantee policy improvementin each learning iteration. In view of this and objective O1,it is sensible to concentrate on local policy optimization atevery sampled state first and then combine policy updatesfrom all sampled states together. At any specific state s ,maximizing L π θ old ( π θ new ) in (5) is equivalent to maximiz-ing (cid:80) a π θ new ( s , a ) A π θ old . As a consequence, the learningproblem for PPO- λ at state s can be defined as max θ new (cid:88) a π θ new ( s , a ) A π θ old ( s , a ) s.t. D KL ( π θ new ( s , · ) , π θ old ( s , · )) ≤ δ (9)Different from TRPO that considers D KL ( π θ old , π θ new ) asits optimization constraint, we consider D KL ( π θ new , π θ old ) in (9) instead. The Lagrangian of our learning problem is L = (cid:88) a π θ new ( s , a ) A π θ old ( s , a ) − λ (cid:32)(cid:88) a π θ new ( s , a ) log π θ new ( s , a ) π θ old ( s , a ) − δ (cid:33) (10)where the hyperparameter λ stands for the Lagrange multi-plier. It controls the penalty due to the violation of the localpolicy optimization constraint. According to [Gelfand et al. ,2000], we can identify the stationary point of L by solvingthe following Euler-Lagrangian equation: ∂ L ∂π θ new ( s , a ) = A π θ old ( s , a ) − λ − λ log π θ new ( s , a )+ λ log π θ old ( s , a )= 0 (11)Under the conditions that ∀ s ∈ S : (cid:88) a π θ old ( s , a ) = 1 , (cid:88) a π θ new ( s , a ) = 1 he equation in (11) can be solved immediately. As a mat-ter of fact, it can be easily verified that π ∗ θ new ( s , a ) ∝ π θ old ( s , a ) e Aπ θ old ( s ,a ) λ (12)where π ∗ θ new represents the target for adaptive policy im-provement in PPO- λ . To guide the updating of π θ new towardsthe corresponding learning target π ∗ θ ∗ new in (12), we can quan-tify the difference in between π θ new and π ∗ θ ∗ new through KLdivergence as D KL ( π θ new ( s , · ) , π ∗ θ new ( s , · )) = (cid:88) a π θ new ( s , a ) log π θ new ( s , a ) π ∗ θ new ( s , a ) (13)For the purpose of minimizing D KL ( π θ new , π ∗ θ new ) in (13)through updating θ new , the derivative below can be utilizedby the PPO- λ algorithm. ∂D KL ( π θ new , π ∗ θ new ) ∂ θ new = (cid:88) a ∂π θ new ( s , a ) ∂ θ new log π θ new ( s , a ) π ∗ θ new ( s , a ) (14)Because the summation over all actions a ∈ A in (14) canbe cumbersome to perform in practice, especially when theRL problem allows a huge collection of alternative actions,we choose to adopt the same importance sampling techniqueused by TRPO such that ∂D tKL ∂ θ new ≈ π θ old ( s t , a t ) ∂π θ new ( s t , a t ) ∂ θ new log π θ new ( s t , a t ) π ∗ θ new ( s t , a t )= ∂τ t ∂ θ new log π θ new ( s t , a t ) π ∗ θ new ( s t , a t ) (15)where D tKL refers to the KL divergence between π θ new and π ∗ θ new at time t with sampled state s t and action a t . It is inter-esting to note that the same technique is also utilized in PPO,which updates θ new at any time t (ignoring value functionand entropy related update for the moment) in the directionof ∂τ t ∂ θ new A π θ old ( s t , a t ) (16)Notice that, in the first updating epoch of every learning it-eration (see Algorithm 1), because θ new = θ old , we can seethat, ∂D tKL ∂ θ new ≈ − λ ∂τ t ∂ θ new A π θ old ( s t , a t ) (17)In view of (16) and (17), to ensure that the updating step sizeof PPO- λ is at the same level as PPO, it is natural for us toupdate θ new according to λ ∂D tKL ∂ θ new in PPO- λ . Meanwhile, wemust penalize large policy updates through clipping. There-fore, we propose to adopt a new surrogate learning objectiveat any time t as ˆ D tKL = λ (1 + δ ) log π t θ new π ∗ ,t θ new A π θ old t > , τ t > δ (1 − δ ) log π t θ new π ∗ ,t θ new A π θ old t < , τ t < − δτ t ( π θ new ) log π t θ new π ∗ ,t θ new otherwise (18) In comparison to (8), we can spot two advantages of using(18). First, at a less important state s t , π ∗ ,t θ new /π t θ old will fallwithin (1 − δ, δ ) with large enough λ in (12). Conse-quently repeated updates at s t will not push π t θ new towardsthe clipping boundary and are expected to produce relativelysmall policy behavioral changes. Due to this reason, objectiveO2 is achieved. Furthermore, after each updating epoch, wecan expect π t θ new to become closer to π ∗ ,t θ new , thereby reducingthe absolute value of log( π t θ new /π ∗ ,t θ new ) in (18). Accordingly,policy update will adaptively reduce its scale with successiveupdating epochs. Objective O3 is hence realized.While training a NN architecture that shares parameters be-tween the policy and value function through PPO- λ , our sur-rogate learning objective must take into account a value func-tion error term and an entropy bonus term [Schulman et al. ,2017]. As a consequence, at any time t , θ new can be updatedaccording to the learning rule below. θ new ← θ new − βλ ∂ ˆ D tKL ∂ θ new − βc ∂E V ( s t ) ∂ θ new + βc ∂S ( s t ) ∂ θ new (19)where c and c are two fixed coefficients. β is the learningrate. E V ( s t ) stands for the TD error of the value function atstate s t . S ( s t ) represents the entropy of the currently learnedpolicy π θ new at s t . The learning rule in (19) requires estima-tion of the advantage function, i.e. A π θ old t . For this purpose,identical to PPO, PPO- λ employs the generalized advantagefunction estimation technique introduced in [Schulman et al. ,2015b].In practice, based on the learning rule in (19), we will up-date policy parameters θ new with minibatch SGD over mul-tiple time steps of data. This is directly achieved by using theAdam algorithm [Kingma and Ba, 2014]. In the mean time,we should prevent large behavioral deviation of π θ new from π θ old for reliable learning. Particularly, following the tech-nique adopted by the PPO implementation in OpenAI base-lines , the threshold on probability ratios δ is linearly decre-mented from the initial threshold of δ to when learningapproaches to its completion. Accordingly, the learning rate β also decrements linearly from the initial rate of β to . Inline with this linear decrement strategy, λ must be adjustedin every learning iteration to continually support objectiveO2. Notice that, in the PPO implementation, the estimatedadvantage function values will be normalized according to(20) at the beginning of each learning iteration [Schulman etal. , 2017], ˜ A π θ old = A π θ old − ¯ A π θ old σ A π θ old (20)where ¯ A π θ old is the mean advantage function values in aminibatch and σ A π θ old gives the corresponding standard de-viation. We assume that the normalized advantage functionvalues in the minibatch follow the standard normal distribu-tion. Subsequently, to ensure that the same proportion of sam-pled states will have their policy improvement target staying n the clipping boundary across all learning iterations, λ willbe determined as below λ n = λ log( δ + 1)log( δ n + 1) (21)with λ n and δ n stand respectively for λ and δ to be used in the n -th learning iteration. Our discussion above has been sum-marized in Algorithm 1. It should be clear now that PPO- λ achieves all the three objectives highlighted at the beginningof this section. In comparison to PPO, PPO- λ is equally sim-ple in design. It can be implemented easily by applying onlyminor changes to the PPO implementation. As for operationefficiency, PPO and PPO- λ are indistinguishable too. Algorithm 1
PPO- λ Require: an MDP (cid:104) S , A , P , R , γ (cid:105) , n : the number of learning iterations, p : the number of actors, T : the horizon, K : the number of epochs M : the minibatch size λ : the Lagrangian multiplier repeat for n learning iterations: repeat for p actors: Perform a rollout with π θ old for T time steps Compute ˆ A , . . . , ˆ A T Compute λ according to (21) repeat for k epochs: Perform minibatch SGD update of θ new based on(19) and M samples Repeat step 7 until p × T samples have been used θ old ← θ new (a) Step 1 (b) Step 2 (c) Step 3 (d) Step 4 (e) Step 5Figure 1: The snapshots of the first five steps taken in the Ponggame when using the max pixel values of sampled frames.(a) Step 1 (b) Step 2 (c) Step 3 (d) Step 4 (e) Step 5Figure 2: The snapshots of the first five steps taken in the Ponggame when using the mean pixel values of sampled frames. In this section, we experimentally compare PPO- λ and PPOon six benchmark Atari game playing tasks provided by Ope-nAI Gym [Bellemare et al. , 2015; Brockman et al. , 2016] andfour benchmark control tasks provided by the robotics RL en-vironment of PyBullet [Coumans and Bai, 2018]. First of allthe performance of both algorithms will be examined basedon the learning curves presented in Figure 3 and Figure 4.Afterwards, we will compare the sample efficiency of PPO- λ and PPO by using the performance scores summarized inTable 1 and Table 2.In all our experiments, both algorithms adopt the same pol-icy network architecture given in [Mnih et al. , 2015] for Atarigame playing tasks and the same network architecture givenin [Schulman et al. , 2017; Duan et al. , 2016] for benchmarkcontrol tasks. Meanwhile we follow strictly the hyperparam-eter settings used in [Schulman et al. , 2017] for both PPO- λ and PPO. The only exception is the minibatch size which isset to 256 in [Schulman et al. , 2017] but equals to 128 in ourexperiments for Atari game playing tasks. This is because 8actors will be employed by PPO- λ and PPO while learning toplay Atari games. Each actor will sample 128 time steps ofdata for every learning iteration. When the minibatch size is128, one minibatch SGD update of the policy network will beperformed by one actor, enabling easy parallelization.In addition to the hyperparameters used in PPO, PPO- λ requires an extra hyperparameter λ . In our experiments, λ isset to initially without any parameter tuning. We also testedseveral different settings for λ , ranging from 0.1 to 5.0, butdid not observe significant changes in performance. Basedon the initial values, the actual value for λ to be utilized ineach learning iteration is further determined by (21).We obey primarily the default settings for all Atari gameswhich are made available in the latest distribution of Ope-nAI baselines on Github. However, two minor changes havebeen introduced by us and will be briefly explained here. Ateach time step, following the DeepMind setting [Mnih et al. ,2015], four consecutive frames of an Atari game must becombined together into one frame that serves subsequentlyas the input to the policy network. In OpenAI baselines, thecombination is achieved by taking the maximum value of ev-ery pixel over the sampled frames. The snapshots of the firstfive sampled steps of the Pong game under this setting is pre-sented in Figure 1. Clearly, when the control paddle is mov-ing fast, the visible size of the paddle in the combined framewill be larger than its actual size. As a result, an agent playingthis game will not be able to know for sure that its observedpaddle can catch an approaching ball. To address this issue,in our experiments, the value of each pixel in the combinedframe is obtained by taking the average of the same pixel oversampled frames. The snapshots of the Pong game under thisnew setting can be found in Figure 2. Using Figure 2, theagent can easily determine which part of the visible paddlecan catch a ball with certainly (i.e. the part of the paddle withthe highest brightness).We found that, in OpenAI baselines, every Atari game bydefault offers clipped rewards, i.e. the reward of performingany action is either -1, 0, or +1. Since advantage function a) Enduro (b) BankHeist (c) Boxing(d) Freeway (e) Pong (f) SeaquestFigure 3: Average total rewards per episode obtained by PPO- λ and PPO on six Atari games, i.e. Enduro, BankHeist, Boxing, Freeway,Pong, and Seaquest. (a) Hopper (b) Humanoid(c) Inverted Double Pendulum (d) Walker2DFigure 4: Average total rewards per episode obtained by PPO- λ and PPO on four benchmark control tasks, i.e. Hopper, Humanoid,Inverted Double Pendulum, and Walker2D. coring Metric Algorithms BankHeist Boxing Enduro Freeway Pong SeaquestFast Learning PPO- λ PPO 145.02 35.42 λ PPO 205.90 64.21 576.78 25.71 18.64 200.98
Table 1: Sample efficiency scores obtained by PPO- λ and PPOon six Atari games. Scoring Metric Algorithms Hopper Humanoid InvertedDoublePendulum Walker2DFast Learning PPO- λ PPO 1708.44 672.81 8319.28 830.23Final Performance PPO- λ PPO 1746.58 1232.64 8657.11 945.57
Table 2: Sample efficiency scores obtained by PPO- λ and PPOon four benchmark control tasks. values will be normalized in both PPO- λ and PPO, furtherclipping the rewards may not be necessary. Due to this rea-son, we do not clip rewards in all our experiments.To compare the performance of PPO- λ and PPO, wepresent the learning curves of the two algorithms on six Atarigames in Figure 3 and four benchmark control tasks in Fig-ure 4. As can be clearly seen in these figures, PPO- λ canoutperform PPO on five out of the six games (i.e. BankHeist,Boxing, Freeway, Pong and Seaquest) and two out of the fourbenchmark control tasks (i.e. Humanoid and Walker2D). Onthe Enduro game, PPO- λ performed equally well as PPO. Onsome games such as Seaquest, the two algorithms exhibitedsimilar performance within the first 5000 learning iterations.However PPO- λ managed to achieve better performance to-wards the end of the learning process. On other games suchas Boxing, Freeway and Pong, the performance differencescan be witnessed shortly after learning started. It is also in-teresting to note that PPO- λ did not perform clearly worsethan PPO on any benchmark control tasks.To compare PPO- λ and PPO in terms of their sampleefficiency, we adopt the two scoring metrics introduced in[Schulman et al. , 2017]: (1) average reward per episode overthe entire training period that measures fast learning and (2)average reward per episode over the last 10 training iterationsthat measures final performance. As evidenced in Table 1 andTable 2, PPO- λ is clearly more sample efficient than PPO onfive out of six Atari games (with Enduro as the exception)and two out of four benchmark control tasks, based on theirrespective scores. Motivated by the usefulness of the simple clipping mecha-nism in PPO for efficient and effective RL, this paper aimedto develop new adaptive clipping method to further improvethe performance of state-of-the-art deep RL algorithms. Par-ticularly, through analyzing the clipped surrogate learning ob-jective adopted by PPO, we found that repeated policy updatein PPO may not adaptively improve learning performance inaccordance with the importance of each sampled state. Toaddress this issue, we introduced a new constrained policylearning problem at the level of individual states. Based onthe solution to this problem, we proposed a new theoreticaltarget for adaptive policy update, which enabled us to developa new PPO- λ algorithm. PPO- λ is equally simple and effi- cient in design as PPO. By controlling a hyperparameter λ ,we can effectively control policy update according to the sig-nificance of each sampled state and therefore enhance learn-ing reliability. Moreover, our empirical study on six Atarigames and four benchmark control tasks also showed thatPPO- λ can achieve better performance and higher sample ef-ficiency than PPO in practice.In the future, it is interesting to study the effectiveness ofPPO- λ on many real-world applications including resourcescheduling in large computer networks. It is also interest-ing to explore potential use of our adaptive clipping mecha-nism on A3C, ACKTR and other cutting-edge RL algorithms.Meanwhile more efforts must be spent to better understandthe relationship between λ and learning performance, whichmay give rise to more sample efficient algorithms. References [Arulkumaran et al. , 2017] K. Arulkumaran, M. P. Deisen-roth, M. Brundage, and A. A. Bharath. A brief surveyof deep reinforcement learning.
CoRR , abs/1708.05866,2017.[Barreto et al. , 2017] A. Barreto, W. Dabney, R. Munos,J. Hunt, T. Schaul, D. Silver, and H. P. van Hasselt. Succes-sor features for transfer in reinforcement learning. In
Ad-vances in Neural Information Processing Systems , pages4058–4068, 2017.[Bellemare et al. , 2015] M. G. Bellemare, Y. Naddaf, J. Ve-ness, and M. Bowling. The Arcade Learning Environment- An Evaluation Platform for General Agents (ExtendedAbstract).
IJCAI , 2015.[Bhatnagar et al. , 2009] S. Bhatnagar, R. S. Sutton,M. Ghavamzadeh, and M. Lee. Natural actor-criticalgorithms.
Journal Automatica , 45(11):2471–2482,2009.[Brockman et al. , 2016] G. Brockman, V. Cheung, L. Pet-tersson, J. Schneider, J. Schulman, J. Tang, andW. Zaremba. OpenAI Gym. arXiv , June 2016.[Coumans and Bai, 2018] E. Coumans and Y. Bai. Pybul-let, a python module for physics simulation for games,robotics and machine learning. http://pybullet.org , 2018.[Duan et al. , 2016] Y. Duan, X. Chen, J. Schulman, andP. Abbeel. Benchmarking Deep Reinforcement Learningfor Continuous Control. arXiv , 2016.[Gelfand et al. , 2000] I. M. Gelfand, R. A. Silverman, et al.
Calculus of variations . Courier Corporation, 2000.[Gu et al. , 2017] S. Gu, T. Lillicrap, Z. Ghahramani, R. E.Turner, B. Sch¨olkopf, and S. Levine. Interpolated policygradient: Merging on-policy and off-policy gradient esti-mation for deep reinforcement learning. arXiv preprintarXiv:1706.00387 , 2017.[Haarnoja et al. , 2017] T. Haarnoja, H. Tang, P. Abbeel, andS. Levine. Reinforcement learning with deep energy-basedpolicies. arXiv preprint arXiv:1702.08165 , 2017.Hausknecht and Stone, 2016] M. Hausknecht and P. Stone.On-policy vs. off-policy updates for deep reinforcementlearning. In
Deep Reinforcement Learning: Frontiers andChallenges, IJCAI 2016 Workshop , 2016.[Kingma and Ba, 2014] D. Kingma and J. Ba. Adam:A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[Lillicrap et al. , 2015] T. Lillicrap, J. J. Hunt, A. Pritzel,N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wier-stra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 , 2015.[Mnih et al. , 2015] V. Mnih, K. Kavukcuoglu, D. Silver,A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried-miller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning.
Nature ,518(7540):529–533, 2015.[Mnih et al. , 2016] V. Mnih, A. P. Badia, M. Mirza,A. Graves, T. Lillicrap, T. Harley, D. Silver, andK. Kavukcuoglu. Asynchronous methods for deep rein-forcement learning. In
International Conference on Ma-chine Learning , pages 1928–1937, 2016.[Nachum et al. , 2017] O. Nachum, M. Norouzi, K. Xu, andD. Schuurmans. Bridging the gap between value and pol-icy based reinforcement learning. In
Advances in NeuralInformation Processing Systems , pages 2772–2782, 2017.[Pollard, 2000] D. Pollard. Asymptopia: an exposition ofstatistical asymptotic theory. 2000. , 2000.[Schulman et al. , 2015a] J. Schulman, S. Levine, P. Abbeel,M. Jordan, and P. Moritz. Trust region policy optimization.In
Proceedings of the 32nd International Conference onMachine Learning (ICML-15) , pages 1889–1897, 2015.[Schulman et al. , 2015b] J. Schulman, P. Moritz, S. Levine,M. Jordan, and P. Abbeel. High-dimensional continu-ous control using generalized advantage estimation. arXivpreprint arXiv:1506.02438 , 2015.[Schulman et al. , 2017] J. Schulman, F. Wolski, P. Dhariwal,A. Radford, and O. Klimov. Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347 , 2017.[Sutton and Barto, 1998] R. S. Sutton and A. G. Barto.
Re-inforcement Learning: An Introduction . MIT Press, 1998.[Sutton et al. , 2000] R. S. Sutton, D. McAllester, S. Singh,and Y. Mansour. Policy gradient methods for rein-forcement learning with function approximation. In
Advances in Neural Information Processing Systems 12(NIPS 1999) , pages 1057–1063. MIT Press, 2000.[Szita and L¨orincz, 2006] I. Szita and A. L¨orincz. Learningtetris using the noisy cross-entropy method.
Learning ,18(12), 2006.[Tang et al. , 2017] H. Tang, R. Houthooft, D. Foote,A. Stooke, X. Chen, Y. Duan, J. Schulman, F. DeTurck,and P. Abbeel.
Advancesin Neural Information Processing Systems , pages 2750–2759, 2017. [Vezhnevets et al. , 2017] A. S. Vezhnevets, S. Osindero,T. Schaul, N. Heess, M. Jaderberg, D. Silver, andK. Kavukcuoglu. Feudal networks for hierarchical re-inforcement learning. arXiv preprint arXiv:1703.01161 ,2017.[Wang et al. , 2015] Z. Wang, T. Schaul, M. Hessel, H. VanHasselt, M. Lanctot, and N. De Freitas. Dueling net-work architectures for deep reinforcement learning. arXivpreprint arXiv:1511.06581 , 2015.[Wang et al. , 2016] Z. Wang, V. Bapst, N. Heess, V. Mnih,R. Munos, K. Kavukcuoglu, and N. de Freitas. Sampleefficient actor-critic with experience replay. arXiv preprintarXiv:1611.01224 , 2016.[Wu et al. , 2017] Y. Wu, E. Mansimov, R. B. Grosse,S. Liao, and J. Ba. Scalable trust-region method for deepreinforcement learning using kronecker-factored approx-imation. In