[PDF] Towards Combining On-Off-Policy Methods for Real-World Applications

Abstract

In this paper, we point out a fundamental property of the objective in reinforcement learning, with which we can reformulate the policy gradient objective into a perceptron-like loss function, removing the need to distinguish between on and off policy training. Namely, we posit that it is sufficient to only update a policy π for cases that satisfy the condition A( π μ −1)≤0 , where A is the advantage, and μ is another policy. Furthermore, we show via theoretic derivation that a perceptron-like loss function matches the clipped surrogate objective for PPO. With our new formulation, the policies π and μ can be arbitrarily apart in theory, effectively enabling off-policy training. To examine our derivations, we can combine the on-policy PPO clipped surrogate (which we show to be equivalent with one instance of the new reformation) with the off-policy IMPALA method. We first verify the combined method on the OpenAI Gym pendulum toy problem. Next, we use our method to train a quadrotor position controller in a simulator. Our trained policy is efficient and lightweight enough to perform in a low cost micro-controller at a minimum update rate of 500 Hz. For the quadrotor, we show two experiments to verify our method and demonstrate performance: 1) hovering at a fixed position, and 2) tracking along a specific trajectory. In preliminary trials, we are also able to apply the method to a real-world quadrotor.

Full PDF

aa r X i v : . [ c s . L G ] A p r Towards Combining On-Oﬀ-Policy Methods forReal-World Applications

Kai-Chun Hu ∗ , Chen-Huan Pi † , Ting Han Wei ‡ , I-Chen Wu §3 ,Stone Cheng ¶ , Yi-Wei Dai k and Wei-Yuan Ye ∗∗ Department of Applied Mathematics, National Chiao TungUniversity Department of Mechanical Engineering, National Chiao TungUniversity Department of Computer Science, National Chiao TungUniversityApril 25, 2019

Abstract

In this paper, we point out a fundamental property of the objectivein reinforcement learning, with which we can reformulate the policy gra-dient objective into a perceptron-like loss function, removing the need todistinguish between on and oﬀ policy training. Namely, we posit that itis suﬃcient to only update a policy π for cases that satisfy the condition A ( πµ − ≤ , where A is the advantage, and µ is another policy. Further-more, we show via theoretic derivation that a perceptron-like loss functionmatches the clipped surrogate objective for PPO. With our new formu-lation, the policies π and µ can be arbitrarily apart in theory, eﬀectivelyenabling oﬀ-policy training. To examine our derivations, we can combinethe on-policy PPO clipped surrogate (which we show to be equivalent withone instance of the new reformation) with the oﬀ-policy IMPALA method.We ﬁrst verify the combined method on the OpenAI Gym pendulum toyproblem. Next, we use our method to train a quadrotor position controllerin a simulator. Our trained policy is eﬃcient and lightweight enough toperform in a low cost micro-controller at a minimum update rate of 500 ∗ [email protected] † [email protected] ‡ [email protected] § [email protected] ¶ [email protected] k [email protected] ∗∗ [email protected] z. For the quadrotor, we show two experiments to verify our method anddemonstrate performance: 1) hovering at a ﬁxed position, and 2) trackingalong a speciﬁc trajectory. In preliminary trials, we are also able to applythe method to a real-world quadrotor. Reinforcement learning (RL), and more recently deep reinforcement learning(DRL), has helped put machine learning in the spotlight in recent years [1, 2, 3].Of the many approaches to solve RL and DRL problems, policy gradient meth-ods are a family of model-free algorithms that are widely used and well estab-lished [4, 5]. Methods that fall into this categorization basically seek to maximizesome performance objective with respect to the parameters of a parameterizedpolicy, typically through gradient ascent.Despite the rigorous theory behind policy gradients in RL, there are a numberof practical issues involved with policy gradients in DRL [6, 7]. We will focusour analysis on two major DRL improvements for policy gradients in the scopeof this paper, the ﬁrst being Trust Region Policy Optimization (TRPO) [8], andthe second being Proximal Policy Optimization (PPO) [9].TRPO builds upon previous work by ﬁrst establishing that through optimizinga surrogate function, the parameterized policy is also guaranteed to improve.Next, a trust region is used to conﬁne updates so that the step sizes can be largeenough for rapid convergence, but small enough so that each update is guaran-teed to monotonically improve on the current policy. Building on the successof TRPO, PPO aims to improve data sampling eﬃciency while also simplifyingthe computationally intensive TRPO with a clipped surrogate function. BothTRPO and PPO are discussed in more detail in subsection 2.2.While TRPO and PPO are used widely and successfully [10, 11, 12, 13, 14],a recent paper by Ilyas et al. [7] attempts to investigate how well these DRLmethods match the theories for fundamental policy gradient methods. Namely,it is not entirely clear why in modern problems with high-dimensional (or evencontinuous) state-action spaces, a relatively small sample of trajectories canoften be suﬃcient in training strong policies, especially when policy gradientderivations are based on ﬁnite Markov decision processes. We summarize Ilyaset al.’s ﬁndings in section 3.Inspired by Ilyas et al.’s work, we devised a novel update method, resulting ina more simpliﬁed perspective on PPO. This simpliﬁcation allows us to treatPPO as an oﬀ-policy method, so that it can be combined with IMPALA [15],another oﬀ-policy method. We ﬁrst verify our combined method with the simpleOpenAI Gym pendulum problem [16], which we describe in detail in subsection4.1. Next, we use our method to train a quadrotor controller in a simulator.Since our method is a simpliﬁcation of PPO, the trained policy is eﬃcient andlightweight enough to perform in a low cost micro-controller at a minimumupdate rate of 500Hz. We were able to train policies for two tasks, hovering ata ﬁxed position, and tracking along a speciﬁed trajectory. Quadrotor experiment2esults will be discussed in section 5. Preliminary test show that the resultingmodel, with a few modiﬁcations, can be applied to a real-world quadrotor.

In this section, we list the notation for the terms we will use in this paper insubsection 2.1. Our notation is based mostly on two sources [4, 8]. We thenbrieﬂy review the relevant background knowledge for our proposed method insubsection 2.2.

The standard reinforcement learning framework consists of a learning agentinteracting with a Markov decision process (MDP), which can be deﬁned by thetuple ( S , A , P ass ′ , R as , ρ , γ ) , where S is the state space, A is the action space, P ass ′ = P r ( s t +1 = s ′ | s t = s, a t = a ) is the transition probability density ofthe environment, R as = P r ( r t +1 | s t = s, a t = a ) ∀ s, s ′ ∈ S , a ∈ A is the set ofexpected rewards, ρ : S → R + is the distribution of the initial state s , and γ ∈ (0 , is the discount factor. The agent chooses an action at each time stepaccording to the policy π ( a | s ; θ ) = P r ( a | s ; θ ) . In the modern deep reinforcementlearning approach, the policy is represented by a neural network and its weights θ , where θ ∈ R N θ . We will refer to the policy simply as π ( a | s ) for the remainderof this paper.Let η be a function that measures the expected discounted reward with respectto π , formulated as η ( π ) = E X t ≥ γ t − r t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s ∼ ρ , π  . (1)We also use the standard deﬁnition of the state value function V π ( s ) , state-action value function Q π ( s, a ) , and the advantage function A π ( s, a ) , written asfollows: V π ( s ) = E X l ≥ γ l r t + l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s t = s, π  (2) Q π ( s, a ) = E [ r t + γV π ( s t +1 ) | s t = s, a t = a, π ] (3) A π ( s, a ) = Q π ( s, a ) − V π ( s ) (4)The following is the expected return for an agent following the policy π , ex-pressed in terms of the advantage for another policy µ [8]: η ( π ) = η ( µ ) + X s ρ π ( s ) X a π ( a | s ) A µ ( s, a ) , (5)where ρ π ( s ) = P t ≥ γ t P r ( s t = s | π ) is the discounted weighting of states en-countered starting at s ∼ ρ ( s ) , then following the policy π at each time step.3 .2 Background The policy gradient idea was ﬁrst proposed by Sutton et al. [4] as an approachwhere the policy can be expressed as a function approximator π ( a | s ) , which isdiﬀerent from previous value-based RL theory in that it is independent from thevarious forms of value functions. Consequently, the policy can be improved byupdating strictly according to the gradient of the expected reward with respectto the policy’s parameters as follows: θ ← θ + α ∂η ( π ) ∂θ . (6)The so-called policy gradient theorem, derived from Equation 1, states thatthe gradient of the expected reward can be estimated using a state-action-value(Equation 7) or advantage function (Equation 8) approximator: ∂η ( π ) ∂θ = X s ∈S ρ π ( s ) X a ∈A ∂π ( a | s ) ∂θ Q π ( s, a ) (7) ∂η ( π ) ∂θ = X s ∈S ρ π ( s ) X a ∈A ∂π ( a | s ) ∂θ ( Q π ( s, a ) − V π ( s, a ))= X s ∈S ρ π ( s ) X a ∈A ∂π ( a | s ) ∂θ ( A π ( s, a )) (8)Since the proof for the theorem is covered in detail in Sutton’s original policygradient paper [4], we will not repeat it here. We refer to these two formulationsas on-policy policy gradient in this paper, because the gradient direction ∂π∂θ isweighted by the two terms ρ π ( s ) and Q π ( s, a ) , where the current policy π isnecessary in order to arrive at an acceptable estimation for ρ π through sampling.The ﬁrst oﬀ-policy policy gradient method was proposed by Degris et.al. [17]. Inthe context of oﬀ-policy learning, the original policy π is referred to as a targetpolicy, whereas another policy µ , referred to as the behavior policy, is used tocollect data for training. Ideally, we want to optimize according to Equation 5,however, ρ π ( s ) cannot be easily evaluated in an oﬀ-policy setting. Instead, weconsider an approximate gradient deﬁned by ˆ g = X s ∈S ρ µ ( s ) X a ∈A ∂π ( a | s ) ∂θ Q π ( s, a ) . (9)Degris et al. [17] state that the policy improvement theorem guarantees thatupdates along this approximate gradient’s direction is not descending. However,regardless of whether an oﬀ-policy or an on-policy method is used, it can be quitediﬃcult to apply classical optimization techniques, since any estimation of thetrue value of η ( π ) can be extremely expensive. For example, when performing,say, line search [18] for determining the update step size, η ( π ) needs to beevaluated many times. 4o address this problem, Schulman et al. [8] investigated using the trust regionmethod to obtain the largest update step size while guaranteeing gradient ascent.The basic idea of the resulting TRPO method is to deﬁne a linear approximationof the main objective: L µ ( π ) = η ( µ ) + X s ∈S ρ µ ( s ) X a ∈A π ( a | s ) A µ ( s, a ) . (10)Theoretically, the estimation error can be bounded by βD maxKL ( µ, π ) , where β = 4 max | A µ | γ (1 − γ ) . (11)We can then say that if X s ∈S ρ µ ( s ) X a ∈A π ( a | s ) A µ ( s, a ) − βD maxKL ( µ, π ) ≥ , (12)then η ( π ) ≥ η ( µ ) . (13)That is, the expected reward for the target policy is higher than the behaviorpolicy. Therefore, by maximizing the left-hand term in Equation 12, we areguaranteed to improve the true objective. Note that in the original derivationsby Schulman et al. [8], µ and π need to be suﬃciently similar for this guaranteeto hold. From this perspective, while we do refer to policies π and µ , TRPO isin practice an on-policy method, where π can be thought of as an incrementalimprovement over µ after one policy gradient update.However, it is diﬃcult to directly optimize Equation 12, so the following heuristicapproximation is used instead:maximize L µ ( π ) s.t. ¯ D ρKL ( µ, π ) ≤ ǫ (14)where ¯ D ρKL ( µ, π ) = E s ∼ µ [ D KL ( µ k π )] .The next algorithm, PPO, tries to keep the ﬁrst order approximation for η ( π ) but proposes a new objective: L CLIP ( π ) = min (cid:20) π ( a | s ) µ ( a | s ) ˆ A, clip (cid:18) π ( a | s ) µ ( a | s ) , − ǫ, ǫ (cid:19) ˆ A (cid:21) (15)where ˆ A is the estimation of A µ ; e.g. generalized advantage estimation (GAE)[19] is often used in practice. By replacing the TRPO surrogate function (Equa-tion 12) with the clipped function (Equation 15), we are essentially excludingsamples where the term π ( a | s ) µ ( a | s ) ≥ ǫ for positive values of ˆ A , or π ( a | s ) µ ( a | s ) ≤ − ǫ for negative values of ˆ A . This is a lower bound to the actual expected reward, sointuitively, by optimizing Equation 15, we are in turn improving the policy. Atthe same time, Equation 15 is less expensive to optimize as the TRPO surrogatefunction. 5 Reformulation of the Policy Gradient Objec-tive

Recently, Ilyas et al. [7] performed a series of experiments and mathematicalanalyses to investigate how well current deep policy gradient methods, partic-ularly PPO, adhere to the theoretical basis for policy gradients. What theyfound, in short, were the following: • The gradient estimates for deep policy gradient methods tend to be poorlycorrelated with the true gradients, and also between each update. Also, asthe training progresses or as the task becomes more complex, the gradientestimate also decreases in quality. • Value networks also tend to produce inaccurate predictions for the truevalue function. They conﬁrm that by using value predictions as a baseline,the variance of the advantage (and in turn the gradient) does decrease.However, compared with no baseline, the improvement is marginal. • The optimization landscape that is produced by deep policy gradientmethods often do not match the true reward landscape. • The authors list the above ﬁndings as potential reasons why trust regionmethods require suﬃciently similar policies to work. However, since theseabove factors are not formally considered in the theory for trust regions, itcan be diﬃcult to understand why deep policy gradients work, or attributecauses when they do not work.While these ﬁndings seem to be critical towards deep policy gradients, the factis that empirically, these methods are eﬀective in training working policies. Tounderstand better why policy gradients work in the current setting, we wish totake a closer look.

First, to analyze the ideal update scheme for TRPO, we assume the function isarbitrarily perturbed after each batch of training data, say π ( a | s ) = µ ( a | s ) + δ ( a, s ) . Since π and µ are normalized probability density functions, the followingcondition holds automatically X a ∈A δ ( a, s ) = 0 . By substituting this deﬁnition of π into Equation 12, we obtain X s ∈S ρ µ ( s ) X a ∈A δ ( a, s ) A µ ( s, a ) − βD maxKL ( µ, µ + δ ) , P a ∈A µ ( a | s ) A µ ( s, a ) = 0 . In order to simplify calculation, we considerthe lower bound of this formula as follows. X s ∈S " ρ µ ( s ) X a ∈A δ ( a, s ) A µ ( s, a ) ! − βD KL ( µ k µ + δ ) For a small δ on a given s , we expand D KL ( µ k µ + δ ) around δ = 0 , whichleads to D KL ( µ k µ + δ ) = X a ∈A µ [ln µ − ln( µ + δ )]= X a ∈A µ (cid:20) − δµ + δ µ + O ( δ ) (cid:21) ≈ X a ∈A δ µ . Hence, we obtain X s ∈S " ρ µ ( s ) X a ∈A δ ( a, s ) A µ ( s, a ) ! − βD KL ( µ k µ + δ ) ≈ X s ∈S ,a ∈A (cid:20) ρ µ ( s ) A µ ( s, a ) δ ( a, s ) − β δ ( a, s )2 µ (cid:21) , where the ﬁrst term is linear and the second is quadratic with respect to δ . Thisrelation shows that non-decreasing updates generally exist at the correspondingoptimal update, δ ( a, s ) = ρ µ ( s ) A µ ( s, a ) µ ( a | s ) β . (16)From the perspective of policy gradients, a single transition policy update at agiven state s and action a is δ ( a, s ) = π ( a | s ) − µ ( a | s )= A µ ( s, a ) ∂µ ( a | s ) ∂θ ∆ θ + O (∆ θ ) ≈ A µ ( s, a ) ∂µ ( a | s ) ∂θ ∆ θ, (17)where ∆ θ = α ∂µ∂θ and α is the learning rate. Both Equations 16 and 17 indicatethat the update, disregarding the amount, is A µ | A µ | ; i.e. 1 for positive A µ and-1 for negative A µ . In other words, during updates, we encourage actions withpositive advantages, and discourage actions for negative advantage.The estimation for | ρ µ ( s ) A µ ( s, a ) | when following the function approximationapproach is extremely important. This term eﬀectively balances the eﬀects7f all the encouraging/discouraging for every state-action pair within a mini-batch, and overall provides limitations to the function approximator. However,as mentioned earlier in this section, the ﬁndings by Ilyas et al. [7] show thatmodern policy gradient methods can eﬀectively handle DRL problems despitenot being able to truly evaluate the value for | ρ µ ( s ) A µ ( s, a ) | . To shed light onwhy it is not necessary to have a highly accurate evaluation of this term, wewill need to examine the relation between η ( π ) and η ( µ ) . We ﬁrst modify Equation 5 by swapping π and µη ( µ ) = η ( π ) + X s ∈S ρ µ ( s ) X a ∈A µ ( a | s ) A π ( s, a ) . (18)By performing algebraic calculations on Equation 5 and 18 respectively, weobtain η ( π ) − η ( µ ) = X s ∈S ρ π ( s ) X a ∈A π ( a | s ) A µ ( s, a ) (19) η ( π ) − η ( µ ) = − X s ∈S ρ µ ( s ) X a ∈A µ ( a | s ) A π ( s, a ) (20)We then add − X s ∈S ρ π ( s ) X a ∈A µ ( a | s ) A µ ( s, a ) to the right hand side of Equation 19, and X s ∈S ρ µ ( s ) X a ∈A π ( a | s ) A π ( s, a ) to Equation 20. These two external terms are zero (since A π for the policy π iszero), therefore we do not need to balance the equations. Thus, we obtain thefollowing. η ( π ) − η ( µ ) = X s ∈S ρ π ( s ) X a ∈A [ π ( a | s ) − µ ( a | s )] A µ ( s, a ) (21) η ( π ) − η ( µ ) = X s ∈S ρ µ ( s ) X a ∈A [ π ( a | s ) − µ ( a | s )] A π ( s, a ) (22)The values for ρ · ( s ) are non-negative, therefore these two relations indicate thatif one of following inequalities holds for all ( s, a ) then η ( π ) ≥ η ( µ ) (i.e. theexpected reward for π dominates that of µ ):1. ( π ( a | s ) − µ ( a | s )) A µ ( s, a ) ≥ ( π ( a | s ) µ ( a | s ) − A µ ( s, a ) ≥ ( π ( a | s ) − µ ( a | s )) A π ( s, a ) ≥ ( π ( a | s ) µ ( a | s ) − A π ( s, a ) ≥ We refer to these four conditions as dominating inequalities. Note that, ﬁrst, byfollowing these conditions, we are not optimizing the policies, but rather onlyguaranteeing in this case that π is better than µ . Second, the implication fromthis result is that we do not require the estimation of the advantage ( A π or A µ )to be highly accurate; we only need to have the correct sign. Finally, when wesay this derivation depends on having observed all ( s, a ) , this requirement is onpar with the original policy gradient theory. To enforce any one of the aboveinequalities, we can use a perceptron-like approach [20], e.g. maximize θ min (cid:20)(cid:18) π ( a | s ) µ ( a | s ) − (cid:19) A µ ( s, a ) , ξ (cid:21) , (23)where ξ is a user-deﬁned margin. It is apparent that this form is equivalent tothe PPO clipping surrogate function. We show this from observation. We canenumerate three diﬀerent cases:1. A µ > and L CLIP ( π ) = π ( a | s ) µ ( a | s ) A µ . This implies that π ( a | s ) µ ( a | s ) A µ > (1 + ǫ ) A µ This inequality is equivalent to the case (cid:18) π ( a | s ) µ ( a | s ) − (cid:19) A µ > ǫ | A µ | (24)2. A µ < and L CLIP ( π ) = π ( a | s ) µ ( a | s ) A µ . This implies that π ( a | s ) µ ( a | s ) A µ > (1 − ǫ ) A µ . This inequality is equivalent to case 1 (Equation 24).3. L CLIP ( π ) = π ( a | s ) µ ( a | s ) A µ . The update will be θ ← θ + αA µ µ ( a | s ) ∂π ( a | s ) ∂θ In short, it is possible to encourage/discourage actions without using data distri-bution and trust region assumptions, where the rule conforms to a perceptron-like objective with ξ = ǫ | A µ | . It is also possible to gain an intuition for this newobjective: for a speciﬁc action a t at state s t , π will be adjusted according to theexperience gained through following the policy µ . More speciﬁcally, a positive A µ ( s t , a t ) implies that a t is believed to be a good action for µ ; for case 3, π is9ot as "smarter" as µ , and so it should be adjusted so that a t is encouraged.For the clipped cases (e.g. case 1), π is already better in that it will choose agood action more frequently than µ , so no adjustment will be made.On the other hand, with function approximators, it can be exceedingly diﬃcultto satisfy the large number of conditions listed above. It is equivalent to askinga function with parameter θ ∈ R N θ to satisfy |S × A| number of conditions.However, this can be thought of as a limitation that depends on the capacity ofthe neural network, and our conjecture is that DRL problems can be solved byonly considering one of the four conditions listed above.Lastly, by reformulating Equation 22, we can recover the on/oﬀ-policy policygradient formula. X s ∈S ρ µ ( s ) X a ∈A [ π ( a | s ) − µ ( a | s )] A π ( s, a )= X s ∈S ρ µ ( s ) X a ∈A [ π ( a | s ) − µ ( a | s )] Q π ( s, a ) , (25)since V π ( s ) P a [ π ( a | s ) − µ ( a | s )] = 0 . We then take the derivative with respectto θ to obtain ∂η ( π ) ∂θ = X s ∈S ρ µ ( s ) X a ∈A ∂π ( a | s ) ∂θ Q π ( s, a )+ X s ∈S ρ µ ( s ) X a ∈A [ π ( a | s ) − µ ( a | s )] ∂Q π ( s, a ) ∂θ . (26)The ﬁrst term is exactly the same as the oﬀ-policy policy gradient proposed byDegris et.al. [17]. We conclude that the oﬀ-policy policy gradient is indeed anapproximation, where the accuracy is determined by the diﬀerence between π and µ . The act of recovering the on-policy policy gradient is to simply send µ to approach π . With its clipped surrogate function, PPO is eﬀectively a "nearly on-policy"algorithm in that it does not follow a policy’s sample distribution exactly. Withthe reformulation of the RL objective, we can try to use PPO as an oﬀ-policymethod. To do this, we use a replay buﬀer. In an oﬀ-policy setting, we arenot able to access the estimation V µ . Therefore, we choose to use the fourthdominating condition (cid:18) π ( a | s ) µ ( a | s ) − (cid:19) A π ≥ ǫ | A π | . However, the estimation of A π is still diﬃcult, so we turn to using V-trace,a state-value estimation technique used in IMPALA [15], for value functiontraining, where all parameters, e.g., ¯ c, ¯ ρ , are set to be . We deﬁne A trace to be A tracet = A t + γ min(1 , π t +1 µ t +1 ) A tracet +1 , (27)10 lgorithm 1 Learnerwait learner thread lock repeat sample T ∈ B tmpA = 0 for i = n − to do A tracei = ( r i + γV i +1 − V i ) + γ ∗ tmpAtmpA = min(1 , π i µ i ) A tracei V tracei = V i + tmpA end for θ a ← θ a + α ∂L policy ∂θ θ v ← θ v − α ∂L value ∂θ until Thread terminate

Algorithm 2

Runner

Input: B replay buﬀer, Learner thread, ˜ M maximum trajectories, n trajec-tories lengthInitialize B , Learner thread lock, θ a policy weight, θ v value function weight for j = 0 to ˜ M − doif j > then release learner thread lock end if sample T j = ( s i , a i , π ( a i | s i ) , r i , s i +1 ) j ∼ π B = B ∪ T j end for Kill learner threadwhere A t = r t + γV t +1 − V t , and V t is the estimation from the current valuenetwork, similar to the settings published by Rolnick et al. [21]. Let V trace be deﬁned from A trace accordingly. Combined together, the objective is thefollowing L policy = X ( s,a ) ∈ T min (cid:20)(cid:18) π ( a | s ) µ ( a | s ) − (cid:19) A trace , ǫ | A trace | (cid:21) L value = 1 | T | X ( s,a ) ∈ T (cid:0) V ( s ) − V trace (cid:1) . (28)In all of our experiments (including the quadrotor experiments in the next sec-tion), a mini-batch is set to be time steps. The corresponding pseudo-codeis written in Algorithm 1. It is worth mentioning that the parameters are sharedin the same memory location during the entire training process. Algorithm 2does the trajectory collection, running concurrently with the learner thread.11 epsiodes -2000-1500-1000-5000 a cc u m u l a t e r e w a r d Pendulum Result Compare

Our Method (high learning rate)Our Method (low learning rate)PPO (low learning rate)PPO (high learning rate)

Figure 1: Learning curves for the pendulum problem. The high learning rate isset to be − , while the lower learning rate is set to be . × − . To verify the algorithm, we ﬁrst apply our method to the OpenAI pendulumenvironment. As shown in Figure 1, compared to a minimally tuned PPO,our combined method is able to achieve slightly better performance, though thereformulated objective is nearly equivalent to PPO with slightly relaxed criteria.The pendulum task is time-unlimited, but in the experiment each episode is aﬁnite number of steps. To address this potential terminal state deﬁnition issue,we perform partial episode bootstrapping [22]. We believe the combined on/oﬀ-policy method should perform better with proper hyperparameter selection.Since the pendulum experiment is merely a simple test to verify our hypothesis,we leave this as future work.

Next, we apply the combined on/oﬀ-policy method to train a model-free quadro-tor control agent. For the scope of this paper, we simply discuss training andtesting the DRL agent in a simulator, though preliminary experiments showthat the learned model can be applied to a real-world scenario.To train and verify our algorithm in a simulated environment, we construct aquadrotor simulator written in Python based on the dynamic model outlined byBangura et al. [23]. We use the simplest dynamic model, which only calculatesthe eﬀects of gravity and the forces generated by the motors in our simulator.

In this experiment, we focus on learning two tasks: 1) position control, and 2)tracking a speciﬁc trajectory. For the ﬁrst task, the DRL agent needs to learnto stabilize the quadrotor and hover at a deﬁned position. For the second task,the DRL agent needs to ﬂy along a circular trajectory with a constant speed.12 episodes -4-3-2-10 a cc u m u l a t e r e w a r d i n ea c h ep i s ode s Hover

Figure 2: Learning curve for optimizing the hovering policy in two trainingtrials.During training, the target location is set to be the origin of the space, where thegoal for the quadrotor agent is to reach and stabilize at the origin. The quadrotorinitial state (position, velocity, angle, and angular velocity) is randomized.To increase ﬂexibility of the controller, we add an oﬀset force F o on each motorduring the training process as described in Equation 29. The sum of the oﬀ-set force and policy network output F π are used as the control output on thequadrotor. F = F o + F π F o = 14 mg (29)By using this method, the machine learned action chosen by the policy onlyneeds to adjust the motor thrust around a reference action, rather than exploringa wide action space. Empirically, this method results in a much faster trainingprocess than without a reference action.With the aforementioned method, the two tasks were evaluated in our exper-iments, where each episode consists of at most 200 steps, and each step takes0.01 second in the simulation environment. The experiments for the two tasksare described in the subsequent subsections respectively. The basic function of the quadrotor controller is to hover at a speciﬁc position.The reward function is designed asreward = − (cid:2) w w w (cid:3) T (cid:2) k q e k k p e k k a k (cid:3) , where q e , p e , and a are angle error, position error, and action output. w to w are the weights of the error. The reward is normalized to ± for fasterconvergence.Based on the algorithm described above, we train our model with 10 collectedepisodes (at most 200 steps in one episode). Two training runs are collected and13 time (s) -2-1012 d i s t an c e ( m ) Hover in Random Initial xyz

Figure 3: The quadrotor’s x , y , and z distance from the origin [0 , , in a hov-ering trial, starting from a random initial state with position [ − . , . , . and velocity [ − . , . , . . episodes -8-6-4-20 a cc u m u l a t e r e w a r d i n ea c h ep i s ode Circle Tracking

Figure 4: Learning curve for optimizing the action policy in two training trialsfor the tracking task.the training curves are shown in Figure 2. Using our algorithm, it takes only 10million time steps to train a policy such that the quadrotor can steadily hover atthe designated position. In contrast, in the work by Hwangbo et al. [24], whichis the ﬁrst to use RL on quadrotor control to our knowledge, the quadrotor isstill unstable after 100 million steps, and appears to achieve a similar level ofstability at 2150 million steps .In addition, Figure 3 shows the results for a randomly initialized trial. Based on their video at https://youtu.be/zIi4yHYJdJY?t=111, and the network trainingsection of their paper; they used one million time steps per iteration, and the best policy wastrained with 2150 iterations. x (m) -2-1.5-1-0.500.511.5 y ( m ) Circle Tracking Trajectory on X-Y plane

Figure 5: The quadrotor starts from 5 diﬀerent initial conditions and followsthe designed trajectory.

In addition to hovering at a designated point as a regulation problem, trajec-tory tracking is another common scenario in quadrotor operation. We createa circular path on the xy -plane and train the agent to ﬂy on the circle withconstant speed, starting from a random initial condition. The reward functionis designed as reward = − (cid:2) w w w (cid:3) T (cid:2) k d k k χ k k a k (cid:3) where w to w are the weights of the error, d is the distance to the circulartrajectory, and χ is the direction that the quadrotor is ﬂying on the trajectory.To ensure the quadrotor ﬂies on the desired direction on the path, we take thecross product of the position and velocity on the xy -plane, which we call χ ,written as: χ = xv y − yv x − r des v des where r des and v des are the desired radius and quadrotor velocity on the trajec-tory.In our experiment, the radius of the circle is set to be 1 meter and the velocityis 2 m/s. Two training curves are shown in Figure 4. The controller learnedto ﬂy in a circle after around 1.0 million training episodes. Figure 5 shows ﬁvediﬀerent trials with ﬁve diﬀerent initial states, where the quadrotor is able toﬂy on our designated 1 m radius circular trajectory. With the constant velocityrequirement, a simple harmonic motion on the x/y -axes can be observed inFigure 6. To our knowledge, this is the ﬁrst work that can train a quadrotor totrack along a speciﬁc trajectory using RL.15 time (s) -3-2-10123 v e l o c i t y ( m / s ) Velocity on X-Y axis V x V y Figure 6: Quadrotor velocity on the axis of x and y with maximum speed 2m/s. In this paper, we point out a fundamental property of the objective as in subsec-tion 3.2. Through derivation, we can reformulate the policy gradient objectiveinto a perceptron-like loss function, following what we refer to as the four dom-inating inequalities. Since the reformulation is based on Equation 5, a generalformula with no restrictions on the KL-divergence of π and µ , training followingthis new objective can be both on- or oﬀ-policy.More speciﬁcally, we posit that it is suﬃcient to only update a policy π for casesthat satisfy the condition A ( πµ − ≤ . Furthermore, we show via theoreticderivation that a speciﬁc instance of our perceptron-like loss function matchesthe clipped surrogate objective for PPO in each update.To examine our derivations, we can combine PPO with IMPALA. The combinedmethod is demonstrated to work well for a quadrotor simulator. Preliminarytrials also show that the trained policy can be used on a real world quadrotor. Itis somewhat surprising that our trained agent in the above hovering task can beapplied in the real world without any sim-to-real techniques (e.g., randomizationor accurate physical models [14]). On the simulator, our method was able tohover steadily with 10 million time steps. In previous work by Hwangbo et al.[24], the ﬁrst to use RL on quadrotor control to our knowledge, the quadrotor isstill unstable after 100 million steps. Additionally, to our knowledge, this paperis the ﬁrst work that can train quadrotor to track along a speciﬁc trajectoryusing RL.Lastly, our derivations could shed light on one of the questions posed by Ilyaset.al [7], i.e., why modern policy gradient methods work despite the inaccuratestate-value function predictors and policy gradient directions.16 eferences [1] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, VedaPanneershelvam, Marc Lanctot, et al. Mastering the game of Go with deepneural networks and tree search. nature , 529(7587):484, 2016.[2] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou,Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai,Adrian Bolton, et al. Mastering the game of Go without human knowledge. Nature , 550(7676):354, 2017.[3] OpenAI. OpenAI Five. https://blog.openai.com/openai-five/ , 2018.[4] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Man-sour. Policy gradient methods for reinforcement learning with function ap-proximation. In

Advances in neural information processing systems , pages1057–1063, 2000.[5] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves,Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu.Asynchronous methods for deep reinforcement learning. In

ICML , 2016.[6] Peter Henderson, Joshua Romoﬀ, and Joelle Pineau. Where did my opti-mum go?: An empirical analysis of gradient descent optimization in policygradient methods.

CoRR , abs/1810.02525, 2018.[7] Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Fir-daus Janoos, Larry Rudolph, and Aleksander Madry. Are deep policy gra-dient algorithms truly policy gradient algorithms?

CoRR , abs/1811.02553,2018.[8] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and PhilippMoritz. Trust region policy optimization. In

International Conference onMachine Learning , pages 1889–1897, 2015.[9] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and OlegKlimov. Proximal policy optimization algorithms.

CoRR , abs/1707.06347,2017.[10] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, andPieter Abbeel. VIME: Variational information maximizing exploration. In

NIPS , 2016.[11] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learn-ing. In

NIPS , 2016.[12] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, YanDuan, John Schulman, Filip De Turck, and Pieter Abbeel.

NIPS , 2017. 1713] Sean Morrison, Alex Fisher, and Fabio Zambetta. Towards intelligent air-craft through deep reinforcement learning. In S. Watkins, editor, , pages 328–335,Melbourne, Australia, Nov 2018.[14] Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, DanijarHafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-real: Learning agilelocomotion for quadruped robots.

CoRR , abs/1804.10332, 2018.[15] Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, VolodymyrMnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning,Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In

ICML , 2018.[16] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, JohnSchulman, Jie Tang, and Wojciech Zaremba. OpenAI gym.

CoRR ,abs/1606.01540, 2016.[17] Thomas Degris, Martha White, and Richard S. Sutton. Oﬀ-policy actor-critic.

CoRR , abs/1205.4839, 2012.[18] Jorge Nocedal and Stephen J. Wright.

Numerical Optimization . Springer,New York, NY, USA, second edition, 2006.[19] John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, andPieter Abbeel. High-dimensional continuous control using generalized ad-vantage estimation.

CoRR , abs/1506.02438, 2015.[20] Frank Rosenblatt. The perceptron: a probabilistic model for informationstorage and organization in the brain.

Psychological review , 65(6):386, 1958.[21] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap,and Greg Wayne. Experience replay for continual learning.

CoRR ,abs/1811.11682, 2018.[22] Fabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Timelimits in reinforcement learning. In

ICML , 2018.[23] Moses Bangura, Robert Mahony, et al. Nonlinear dynamic modeling forhigh performance control of a quadrotor. In

Australasian conference onrobotics and automation , pages 1–10, 2012.[24] Jemin Hwangbo, Inkyu Sa, Roland Siegwart, and Marco Hutter. Control ofa quadrotor with reinforcement learning.