[PDF] OPAC: Opportunistic Actor-Critic

Abstract

Actor-critic methods, a type of model-free reinforcement learning (RL), have achieved state-of-the-art performances in many real-world domains in continuous control. Despite their success, the wide-scale deployment of these models is still a far cry. The main problems in these actor-critic methods are inefficient exploration and sub-optimal policies. Soft Actor-Critic (SAC) and Twin Delayed Deep Deterministic Policy Gradient (TD3), two cutting edge such algorithms, suffer from these issues. SAC effectively addressed the problems of sample complexity and convergence brittleness to hyper-parameters and thus outperformed all state-of-the-art algorithms including TD3 in harder tasks, whereas TD3 produced moderate results in all environments. SAC suffers from inefficient exploration owing to the Gaussian nature of its policy which causes borderline performance in simpler tasks. In this paper, we introduce Opportunistic Actor-Critic (OPAC), a novel model-free deep RL algorithm that employs better exploration policy and lesser variance. OPAC combines some of the most powerful features of TD3 and SAC and aims to optimize a stochastic policy in an off-policy way. For calculating the target Q-values, instead of two critics, OPAC uses three critics and based on the environment complexity, opportunistically chooses how the target Q-value is computed from the critics' evaluation. We have systematically evaluated the algorithm on MuJoCo environments where it achieves state-of-the-art performance and outperforms or at least equals the performance of TD3 and SAC.

Full PDF

OOPAC: O

PPORTUNISTIC A CTOR -C RITIC

A P

REPRINT

Srinjoy Roy, Saptam Bakshi, and Tamal MaharajDepartment of Computer ScienceRamakrishna Mission Vivekananda Educational and Research Institute (RKMVERI)West Bengal, India A BSTRACT

Actor-critic methods, a type of model-free reinforcement learning (RL), have achievedstate-of-the-art performances in many real-world domains in continuous control. Despitetheir success, the wide-scale deployment of these models is still a far cry. The main prob-lems in these actor-critic methods are inefﬁcient exploration and sub-optimal policies. SoftActor-Critic (SAC) and Twin Delayed Deep Deterministic Policy Gradient (TD3), two cut-ting edge such algorithms, suffer from these issues. SAC effectively addressed the prob-lems of sample complexity and convergence brittleness to hyper-parameters and thus out-performed all state-of-the-art algorithms including TD3 in harder tasks, whereas TD3 pro-duced moderate results in all environments. SAC suffers from inefﬁcient exploration owingto the Gaussian nature of its policy which causes borderline performance in simpler tasks.In this paper, we introduce Opportunistic Actor-Critic (OPAC), a novel model-free deepRL algorithm that employs better exploration policy and lesser variance. OPAC combinessome of the most powerful features of TD3 and SAC and aims to optimize a stochastic pol-icy in an off-policy way. For calculating the target Q-values, instead of two critics, OPACuses three critics and based on the environment complexity, opportunistically chooses howthe target Q-value is computed from the critics’ evaluation. We have systematically eval-uated the algorithm on MuJoCo environments where it achieves state-of-the-art perfor-mance and outperforms or at least equals the performance of TD3 and SAC.

Model-free deep reinforcement learning (RL) algorithms have been successfully applied to a series of chal-lenging domains ranging from games [10, 17] to robotic control [5, 6]. The combination of reinforcementlearning with powerful function approximators, like neural networks, has given rise to deep reinforcementlearning. In recent years, deep RL has proved to be highly effective in a wide range of decision making andcontrol tasks. However, the application of model-free deep RL in such tasks is made complicated by twomajor challenges – sample complexity and convergence brittleness. Cutting edge deep RL algorithms likeTwin Delayed Deep Deterministic Policy Gradient (TD3) [4] and Soft Actor-Critic (SAC) [7] have shownpromising results in dealing with challenging control tasks. TD3 learns efﬁciently from past samples usingexperience replay memory and it effectively addresses the overestimation bias that occurs in traditionalactor-critic methods. But it suffers from the sensitivity to hyper-parameters and as a result, requires a lotof tuning to converge. To combat this convergence brittleness of TD3, the maximum entropy RL frame-work [23] was incorporated in SAC.The reason for the failure of Deep Deterministic Policy Gradient (DDPG) [9] based algorithms is the dra-matic overestimation of Q-values [4]. TD3 addressed this issue by the use of three techniques in its algo-rithm – clipped double Q-learning, delayed policy updates, and target policy smoothing. It is an off-policy,Q-learning based algorithm which trains a deterministic policy. On the other hand, SAC trains a stochastic a r X i v : . [ c s . L G ] D ec PAC: Opportunistic Actor-Critic

A P

REPRINT policy and explores in an on-policy way. The gap between DDPG style approaches and stochastic policyoptimization was bridged by SAC.The use of target networks has been illustrated in the TD3 algorithm where its role in stabilizing the trainingprocess is evident. TD3 follows a pessimistic approach while evaluating the deterministic policy. This isdone by the clipped double Q-learning technique where it takes the minimum of two target Q-values forupdating the parameters of the critic models. Policy updates and the update of target network parametersare done less frequently than the update of model network parameters. This is to ensure that the errorin value network is minimized up to a certain extent before introducing a policy update. Exploration isfacilitated in TD3 by adding noise to the target policy to avoid over estimation. SAC also employs theuse of target networks but only for the critics. Since it explores in an on-policy way, SAC does not usetarget networks for the actor, which is the policy itself. The inherent stochastic nature of the policy enablesexploration in SAC. It’s analogous to the target policy smoothing in TD3. Entropy regularization is one ofthe key features of SAC where the policy is trained to maximize a trade-off between the expected return overtime and the entropy. The term entropy in this context refers to a measure of randomness in the policy. Asalready mentioned, TD3 and SAC both employ a pessimistic approach while calculating the Mean SquaredBellman Error (MSBE) by taking the minimum of two Q-values. Optimistic Actor-Critic (OAC) [3], anotherrecent algorithm, takes an optimistic approach instead. It was shown to attain substantial improvement inthe quality of exploration being made.In this context, we introduce Opportunistic Actor-Critic (OPAC), a model-free Deep RL algorithm that hasincorporated some of the novel features of TD3 and SAC like the use of target networks, target policysmoothing, entropy maximization framework [23, 6, 19, 21, 13]. To introduce the idea of voting, an addi-tional critic is used to ﬁne-tune the value updates. The driving idea behind the development of OPAC is toretain the beneﬁts of TD3 and SAC and combine them under a single roof along with an extra critic to forma link between stochastic policy optimization and off-policy exploration. We demonstrated via experimen-tal results that having three critics instead of two improves the quality of the policy which in turn, yieldsa higher average reward over time. SAC was shown to outperform TD3 and other model-free Deep RLalgorithms (like Proximal Policy Optimization (PPO) [16], Trust Region Policy Optimization (TRPO) [15])in challenging tasks like the “Humanoid-v2” environment in MuJoCo. In this paper, we have shown thatOPAC outperforms both TD3 and SAC with a few exceptions where it works as par with TD3 and SAC,both in challenging as well as simple control tasks in terms of the average return. Since TD3 and SAC arecurrently two of the best model-free Deep RL algorithms, we limit our comparison of the performance ofOPAC with only TD3 and SAC.

We ﬁrst discuss the principle concepts regarding reinforcement learning and maximum entropy reinforce-ment learning. These discussions will contain the necessary mathematical notations that will be useful aswell as heavily referred to in the later sections.

Markov Decision Processes (MDPs) are deﬁned by the tuple ( S , A , p , r ) , where S is the ﬁnite state space, A is the ﬁnite action space, p represents the state transition probabilities and r represents the reward function. S and A are assumed to be continuous and the state transition probability p : S × S × A → [ ∞ ) repre-sents the probability density of the next state s t + ∈ S given the current state s t ∈ S and action a t ∈ A . Thegoal in an MDP is to ﬁnd an optimal “policy” for the decision maker : a function π that speciﬁes the action π ( s ) that the decision maker chooses when in state s .Reinforcement learning (RL) considers the paradigm of an agent interacting with its environment to learnreward-maximizing behavior. The agent in RL could be thought of as the decision-maker in MDPs and theenvironment could be thought of as the setting on which the MDP is deﬁned. Thus, a standard reinforce-ment learning framework is deﬁned as a policy search in an MDP. The standard reinforcement learningobjective is the expected sum of rewards given by, ∞ ∑ t = E ( s t , a t ) ∼ ρ π [ r ( s t , a t )] .The goal is to learn a policy π ( a t | s t ) that maximizes the objective. In other words, we are trying to learnthe optimal policy π ∗ θ ( a t | s t ) , with the parameters θ .2PAC: Opportunistic Actor-Critic A P

REPRINT

The maximum entropy objective [22] generalizes the standard RL objective by augmenting it with an en-tropy term, such that the optimal policy additionally aims to maximize its entropy at each visited state: π ∗ = arg max π ∞ ∑ t = E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α . H ( π ( · | s t ))] ,where α is the temperature parameter that determines the relative importance of the entropy term versus thereward, and thus controls the stochasticity of the optimal policy. Entropy is the measure of unpredictabilityof a random variable. Let x be a random variable with probability mass or density function P . The entropy H of x is computed from its distribution P according to, H ( P ) = E x ∼P [ − log P ( x )] .The maximum entropy framework has many conceptual and practical advantages. Firstly, the policy isgiven an incentive to explore more widely, while rejecting actions that are sub-optimal. Secondly, the policycan capture multiple modes of near-optimal behavior and in scenarios where more than one actions seemequally good, the policy will assign equal probabilities to those actions. It has been observed that it con-siderably improves learning speed over state-of-the-art methods that optimize the standard RL objectivefunction. The soft policy iteration is a general algorithm for determining optimal policies under the maximum en-tropy framework. It alternates between policy improvement and policy evaluation steps. It was introducedand fully derived in the paper of SAC [7]. We revisit the lemmas and the soft policy iteration theorem butwe skip their proofs since those can be found in the aforementioned paper.Soft policy iteration was shown to converge to an optimal policy within a set of policies. In the policy eval-uation step, value of the policy π was computed according the maximum entropy reinforcement learningobjective. Q : S × A → R was the soft Q-value function [14, 12] whose value could be computed iteratively.This was done by repeatedly applying a modiﬁed Bellman backup operator T π deﬁned by, T π Q ( s t , a t ) = ∆ r ( s t , a t ) + γ E s t + ∼ p [ V ( s t + )] , (1)where, V ( s t ) = E a t ∼ π [ Q ( s t , a t ) − α log π ( a t | s t )] (2)was the soft state value function. The soft Q-function for any policy π was obtained by repeatedly applying T π as was formalized in Lemma 3.1. Lemma 3.1 (Soft Policy Evaluation) . Consider the soft Bellman backup operator T π in Equation 1 and a mappingQ : S × A → R with |A| < ∞ and deﬁne Q k + = T π Q k . Then the sequence Q k will converge to the softQ-function of π as k → ∞ . In the policy improvement step, for each state, the policy was updated according to, π new = arg min π (cid:48) ∈ Π D KL (cid:32) π (cid:48) ( · | s t ) || exp ( α Q π old ( s t , · )) Z π old ( s t ) (cid:33) , (3)where π new corresponded to the updated policy and was updated towards the exponential of the new softQ-function. Π was a set of policies which belonged to the parameterized family of Gaussian distributions.Information projection deﬁned in terms of the Kullback-Leibler divergence was used to project the im-proved policy into the desired set of policies to satisfy the constraint π (cid:48) ∈ Π . Z π old ( s t ) was the partitionfunction which normalized the distribution. It was ignored because it did not contribute to the gradientwith respect to π new . In Lemma 3.2 it was formalized that the new projected policy had a higher value thanthe old policy. Lemma 3.2 (Soft Policy Improvement) . Let π old ∈ Π and let π new be the optimizer of the minimization problemdeﬁned in Equation 3. Then Q π new ( s t , a t ) ≥ Q π old ( s t , a t ) for all ( s t , a t ) ∈ S × A with |A| < ∞ . A P

REPRINT

In Theorem 3.3 it was proved that the soft policy iteration algorithm converges to the optimal maximumentropy policy by alternating between soft policy evaluation and soft policy improvement steps.

Theorem 3.3 (Soft Policy Iteration) . Repeated application of soft policy evaluation and soft policy improvementfrom any π ∈ Π converges to a policy π ∗ such that Q π ∗ ( s t , a t ) ≥ Q π ( s t , a t ) for all π ∈ Π and ( s t , a t ) ∈ S × A with |A| < ∞ . Soft policy iteration was derived in a tabular setting. To extend this into continuous state-action domainsthe soft Q-function and the policy both, have to be approximated by the use of deep neural networks.Instead of alternating between soft policy evaluation and soft policy improvement up to convergence, wewill alternate between optimizing the soft Q-function and policy network by stochastic gradient descent.This is how we will construct our algorithm of OPAC. Let Q φ ( s t , a t ) and π θ ( a t | s t ) be the soft Q-functionand a tractable policy with parameters φ and θ respectively. Parameters of the soft Q-function can be trainedto minimize the soft Bellman residual error, J Q ( φ ) = E ( s t , a t ) ∼ D [ ( Q φ ( s t , a t ) − ( r ( s t , a t ) + γ E s t + ∼ p [ V φ target ( s t + )])) ] Substituting the value function parameters as in Equation 2 in the above equation and optimizing it bystochastic gradient descent we have, ∇ φ J Q ( φ ) = ∇ φ Q φ ( s t , a t )( Q φ ( s t , a t ) − ( r ( s t , a t ) + γ Q φ target ( s t + , a t + ) − α log π θ target ( a t + | s t + ))) , (4)where φ target in the update rule denotes the parameters of the target Q-function networks. This is an im-portant tool for stabilizing training [11]. The parameters of the policy network can be directly learned byminimizing the KL-divergence in Equation 3, J π ( θ ) = E s t ∼ D [ E a t ∼ π θ [ α log π θ ( a t | s t ) − Q φ ( s t , a t )]] . (5)We need to compute, ∇ θ E a t ∼ π θ [ α log π θ ( a t | s t ) − Q φ ( s t , a t )] . Q φ ( s t , a t ) does not directly depend on θ , thus no gradient of Q φ ( s t , a t ) can be computed over θ . Rather, wecan write the action a t as, a t = µ θ ( s t ) + (cid:101) t σ θ ( s t ) ,where (cid:101) t ∼ N (

0, 1 ) . Instead of sampling a t ∼ π θ ( s t ) , we now sample (cid:101) t ∼ N (

0, 1 ) . Therefore, we can surelywrite Q φ ( s t , a t ) = Q φ ( s t , µ θ ( s t ) + (cid:101) t σ θ ( s t )) .Thus, a gradient over θ appears, leading to smaller variance. We set a t = f θ ( (cid:101) t , s t ) = µ θ ( s t ) + (cid:101) t σ θ ( s t ) . Nowwe have, J π ( θ ) = E s t ∼ D [ E (cid:101) t ∼N ( ) [ α log π θ ( f θ ( (cid:101) t , s t ) | s t ) − Q φ ( s t , f θ ( (cid:101) t , s t ))]] whose gradient with respect to θ can be obtained by, ∇ θ J π ( θ ) = ∇ θ log π θ ( a t | s t ) + ( ∇ a t log π θ ( a t | s t ) − ∇ a t Q φ ( s t , a t )) ∇ θ f θ ( (cid:101) t , s t ) .Finally, we have all the necessary update rules for OPAC. The whole process described in this section hasa lot of similarity with that of SAC especially in the use of reparameterization trick. However, in practiceit has been observed that learning policy parameters by the above equation yields inferior results. Instead,we can learn the policy parameters by, J π ( θ ) = E s t ∼ D [ E a t ∼ π θ [ α log π θ ( a t | s t ) − Q φ ( s t , a t )]] (6)Note that Q φ ( s t , a t ) has become Q φ ( s t , a t ) . The signiﬁcance of φ is that we are only considering the outputof the ﬁrst Q-network. The J π ( θ ) in Equation 6 can be optimized by stochastic gradient descent using asimilar reparameterization trick as of Equation 5. This modiﬁcation was inspired from the policy updaterule of TD3. Main reason for modifying policy update rule of Equation 5 to what’s in Equation 6 is strictlyfor practical purposes. We will look at it more deeply in an upcoming section where we present an algorithmfor OPAC. 4PAC: Opportunistic Actor-Critic A P

REPRINT

In the previous section, we constructed an off-policy algorithm for OPAC given a particular temperaturei.e., the value of α was ﬁxed. Figuring out an optimal temperature is, in practice, a complicated task. Theentropy can vary unpredictably both across tasks and during training as the policy becomes better. Weborrow the same strategy that SAC uses to automatically adjust the entropy temperature α .The standard maximum entropy learning problem for OPAC can be reformulated as a constraint optimiza-tion problem - while maximizing the expected return, the policy should satisfy a minimum entropy con-straint, max π ... π T E [ ∑ Tt = r ( s t , at )] s.t. ∀ t , H ( π t ) ≥ H , where H is a predeﬁned minimum policy entropythreshold. The expected return E [ ∑ Tt = r ( s t , at )] can be decomposed into a sum of rewards at all the timesteps. We make use of a dynamic programming strategy. Since the policy π t at time t has no effect on thepolicy at the earlier time step π t − , we can maximize the return at different steps backward in time.max π ( E [ r ( s , a )] + max π ( E [ ... ] + max π T E [ r ( s T , a T )])) ,where we consider γ =

1. So we start the optimization from the last timestep T :maximize ( E ( s T , a T ) ∼ ρ π [ r ( s T , a T )]) such that, H ( π t ) − H ≥

0. Firstly, let us deﬁne the following functions: h ( π T ) = H ( π T ) − H = E ( s T , a T ) ∼ ρ π [ − log π T ( a T | s T )] − H f ( π T ) = (cid:40) E ( s T , a T ) ∼ ρ π [ r ( s T , a T )] , if h ( π T ) ≥ − ∞ , otherwiseThen the optimization problem becomes,maximize ( f ( π T )) s.t. h ( π T ) ≥ α T as, L ( π T , α T ) = f ( π T ) + α T h ( π T ) .We skip rest of the part where we minimize L ( π T , α T ) with respect to α T - given a particular value π T ,because a similar approach is already given in [7]. Therefore, we can conclude that we will have equationsof the following form, α ∗ T − = arg min α T − ≥ E ( s T − , a T − ) ∼ ρ π ∗ [ α T − H ( π ∗ T − ) − α T − H ] and, α ∗ T = arg min α T ≥ E ( s T , a T ) ∼ ρ π ∗ [ α T H ( π ∗ T ) − α T H ] ,where, α ∗ T corresponds to the optimal temperature at the last timestep T . The equation for updating α ∗ T − has the same form as the equation for updating α ∗ T . By repeating this process, we can learn the optimaltemperature parameter in every step by minimizing the objective function: J ( α ) = E a t ∼ π t [ − α log π t ( a t | s t ) − α H ] . (7) Unlike TD3 and SAC our algorithm of OPAC uses clipped triple Q-learning instead of clipped double Q-learning. But practically, we are considering two strategies – mean value of the smaller two critics andmedian value of all the three critics. We now establish a proof of convergence for clipped triple Q-learning.The convergence of the mean and median strategies will automatically follow from this proof.We ﬁrst include a lemma due to [18] which we are going to use for the convergence proof of Triple Q-learning. It originally appears as a proposition in [1] which was further generalised into this lemma.The proof of Triple Q-learning is similar to the proof of double Q-learning [8] and Clipped Double Q-learning [4]. 5PAC: Opportunistic Actor-Critic

A P

REPRINT

Lemma 6.1.

Consider a stochastic process ( ζ t , ∆ t , F t ) , t ≥ where ζ t , ∆ t , F t : X → R satisfy the equation: ∆ t + ( x t ) = ( − ζ t ( x t )) ∆ t ( x t ) + ζ t ( x t ) F t ( x t ) , where, x t ∈ X and t =

0, 1, 2, . . . . Let P t be a sequence of increasing σ -ﬁelds such that ζ and ∆ are P measurableand ζ t , ∆ t and F t − are P t measurable, t =

0, 1, 2, . . . . Assume that the following hold:1. The set X is ﬁnite.2. ζ t ( x t ) ∈ [

0, 1 ] , ∑ t ζ t ( x t ) = ∞ , ∑ t ( ζ t ( x t )) < ∞ with probability 1 and ∀ x (cid:54) = x t : ζ t = .3. (cid:107) E [ F t | P t ] (cid:107) ≤ κ (cid:107) ∆ t (cid:107) + c t where κ ∈ [

0, 1 ) and c t converges to 0 with probability 1.4. Var [ F t | P t ] ≤ K ( + κ (cid:107) ∆ t (cid:107) ) , where K is some constant.Where (cid:107) . (cid:107) denotes the maximum norm. Then ∆ t converges to 0 with probability 1. For a ﬁnite MDP setting, we maintain 3 tabular estimates of the value functions Q A , Q B , and Q C . At eachtimestep we update all of them. Theorem 6.2 (Clipped Triple Q-learning) . Given the following conditions:1. Each state action pair is sampled an inﬁnite number of times.2. The MDP is ﬁnite.3. γ ∈ [

0, 1 ) .4. Q-values are stored in a lookup table.5. Q A , Q B , and Q C receive an inﬁnite number of updates.6. The learning rates satisfy the following conditions: α t ( s , a ) ∈ [

0, 1 ] , ∑ t α t ( s , a ) = ∞ , ∑ t ( α t ( s , a )) < ∞ with probability 1, and α t ( s , a ) = , ∀ ( s , a ) (cid:54) = ( s t , a t ) .7. Var [ r ( s , a )] < ∞ , ∀ ( s , a ) .Then Clipped Triple Q-learning will converge to the optimal action value function Q ∗ , as deﬁned by the Bellmanoptimality equation, with probability 1.Proof. We apply lemma 6.1 with P t = { Q A , Q B , Q C , s , a , α , r , s , . . . , s t , a t } , X = S × A , ζ t = α t . Consider atarget mapping, g : Q At × Q Bt × Q Ct (cid:55)→ q , q ∈ R . Also without loss of generality, let’s assume ∆ t = Q At − Q ∗ .The condition 1 and 4 of lemma 6.1 holds by the conditions 2 and 4 of the theorem respectively. Lemmacondition 2 holds by the theorem condition 6 along with our selection of ζ t = α t .Deﬁning a ∗ = argmax a Q A ( s t + , a ) we have, ∆ t + ( s t , a t ) = ( − α t ( x t ))( Q At ( s t , a t ) − Q ∗ ( s t , a t ))+ α t ( x t )( r t + γ g (cid:16) Q At ( s t + , a ∗ ) , Q Bt ( s t + , a ∗ ) , Q Ct ( s t + , a ∗ ) (cid:17) − Q ∗ ( s t , a t ))= ( − α t ( s t , a t )) ∆ t ( s t , a t ) + α t ( s t , a t ) F t ( s t , a t ) where, F t ( s t , a t ) is deﬁned as: F t ( s t , a t ) = r t + γ g (cid:16) Q At ( s t + , a ∗ ) , Q Bt ( s t + , a ∗ ) , Q Ct ( s t + , a ∗ ) (cid:17) − Q ∗ ( s t , a t )= r t + γ Q At ( s t + , a ∗ ) − Q ∗ ( s t , a t ) + γ g (cid:16) Q At ( s t + , a ∗ ) , Q Bt ( s t + , a ∗ ) , Q Ct ( s t + , a ∗ ) (cid:17) − γ Q At ( s t + , a ∗ )= F Qt ( s t , a t ) + c t (8)where, F Qt ( s t , a t ) = r t + γ Q At ( s t + , a ∗ ) − Q ∗ ( s t , a t ) A P

REPRINT and, c t = γ g (cid:16) Q At ( s t + , a ∗ ) , Q Bt ( s t + , a ∗ ) , Q Ct ( s t + , a ∗ ) (cid:17) − γ Q At ( s t + , a ∗ ) . F Qt denotes the value of F t under the standard Q-learning. E [ F Qt | P t ] ≤ γ (cid:107) ∆ t (cid:107) is known to be true dueto Bellman operator being a contraction mapping . This implies condition 3 of lemma 6.1 holds if we canshow that c t converges to 0 with probability 1. Let, y = r t + γ g (cid:0) Q At ( s t + , a ∗ ) , Q Bt ( s t + , a ∗ ) , Q Ct ( s t + , a ∗ ) (cid:1) , ∆ BAt = Q Bt ( s t , a t ) − Q At ( s t , a t ) , and ∆ BCt = Q Bt ( s t , a t ) − Q Ct ( s t , a t ) . It means c t converges to 0 if both ∆ BAt and ∆ BCt converges to 0 with probability 1. Again, ∆ BAt + ( s t , a t ) def = Q Bt + ( s t , a t ) − Q At + ( s t , a t )= (cid:104) Q Bt ( s t , a t ) + α t ( s t , a t ) (cid:16) y − Q Bt ( s t , a t ) (cid:17)(cid:105) − (cid:104) Q At ( s t , a t ) + α t ( s t , a t ) (cid:16) y − Q At ( s t , a t ) (cid:17)(cid:105) = (cid:16) Q Bt ( s t , a t ) − Q At ( s t , a t ) (cid:17) − α t ( s t , a t ) (cid:16) Q Bt ( s t , a t ) − Q At ( s t , a t (cid:17) = ∆ BAt ( s t , a t ) − α t ( s t , a t ) ∆ BAt ( s t , a t )= ( − α t ( s t , a t )) ∆ BAt ( s t , a t ) Algorithm 1:

OPAC

Input:

Initial policy parameters θ , Q-function parameters φ , φ , φ and an empty replay buffer D .Set target parameters equal to main parameters, θ target ← θ , φ target ,1 ← φ , φ target ,2 ← φ , φ target ,3 ← φ .Populate the replay buffer D . repeat If s (cid:48) is a terminal state, reset the environment. if it’s time to update thenfor j in range(number of updates) do Sample a batch of transitions, B = { ( s , a , r , s (cid:48) , d ) } from D .Compute target actions, a (cid:48) ( s (cid:48) ) = clip ( π θ target ( s (cid:48) ) + clip ( (cid:101) , − c , c ) , a low , a high ) , where (cid:101) ∼ N ( σ ) .Compute the shared Q-target by, y ( r , s (cid:48) , d ) = r + γ ( − d )( mean/median − α . log π θ target ( a (cid:48) | s (cid:48) )) Update Q-functions (critic-models) using, ∇ φ i . 1 | B | ∑ ( s , a , r , s (cid:48) , d ) ∈ B ( Q φ i ( s , a ) − y ( r , s (cid:48) , d )) , for i = 1, 2, 3. if j mod policy delay = 0 then Update policy by, ∇ θ . 1 | B | ∑ s ∈ B ( Q φ ( s , π θ ( s )) − α . log π θ ( π θ ( s ) | s )) Update the target networks and adjust temperature α (for i = 1, 2, 3), θ target ← τ . θ target + ( − τ ) θφ target , i ← τ . φ target , i + ( − τ ) φ i α ← α − λ ∇ α J ( α ) endendenduntil convergence ; 7PAC: Opportunistic Actor-Critic A P

REPRINT

Clearly, ∆ BAt converges to 0. Using similar arguments, it can be shown that ∆ BCt converges to 0. Theseimply we have fulﬁlled the condition 3 of lemma 6.1, implying Q A ( s t , a t ) converges to Q ∗ t ( s t , a t ) . Similarly,it can be shown that Q B ( s t , a t ) and Q C ( s t , a t ) converge to the optimal action value function by choosing ∆ t = Q Bt − Q ∗ and ∆ t = Q Ct − Q ∗ respectively. The ﬁnal algorithm for OPAC is listed in Algorithm 1. It makes use of three soft Q-functions, i.e., critics toreduce positive bias in the policy improvement step that is known to degrade the performance of value-based methods [8, 4]. Since each of the three soft Q-functions have parameters φ i , where i =

1, 2, 3, we trainthem independently to optimize J Q ( φ i ) . We then use two strategies – the mean value of the smaller twocritics and the median value of all the three critics for computing the stochastic gradient in Equation 4 andpolicy gradient in Equation 6. It is important to note that the policy gradient is computed by gradient ascentonce every two iterations while the gradients for the soft Q-functions are computed by stochastic gradientdescent in every iteration. Algorithm 1 makes use of two variables “mean” and “median” which store meanvalue and the median value respectively, according to the strategies mentioned earlier. The algorithm eitheruses ”mean” or ”median” in a single instance.The entropy temperature α is learned automatically by minimizing the objective function in Equation 7.There are target networks for the policy (i.e., the actor) and the three soft Q-functions (i.e., the critics). Inshort, there are 8 deep neural networks in our algorithm - one for the actor target and actor model eachand, three for critic targets and critic models each. The target networks are updated by Polyak averagingonce every two iterations. Gaussian noise is added to the actions a (cid:48) played by the actor-target for targetpolicy smoothing. The added Gaussian noise can also be termed as exploration noise and it is clipped inthe algorithm to keep the target close to the original action. We have selected six environments namely Ant-v2, HalfCheetah-v2, Hopper-v2, Humanoid-v2,InvertedPendulum-v2, and Walker2d-v2 for comparing the performance of OPAC with SAC and TD3. Allthe algorithms have been tested in the MuJoCo continuous control tasks [20] interfaced through OpenAIGym [2]. (a) Ant-v2 (b) HalfCheetah-v2 (c) Hopper-v2(d) Humanoid-v2 (e) InvertedPendulum-v2 (f) Walker2d-v2

Figure 1: Learning curves on MuJoCo continuous control environments. OPAC outperforms TD3 and SACin most of the environments. The shaded region corresponds to one standard deviation.The algorithms were run in Hopper-v2 and InvertedPendulum-v2 for one million time steps whereas inAnt-v2, HalfCheetah-v2, and Walker2d-v2 for three million time steps. Humanoid-v2, the hardest and most8PAC: Opportunistic Actor-Critic

A P

REPRINT

Environment SAC TD3 OPAC (mean ofthe smaller 2Q-values) OPAC (medianof the 3Q-values)

Ant-v2 5384.97 ± ± ± ± ± ± ± ± Hopper-v2 3200.85 ± ± ± ± Humanoid-v2 8676.14 ± ± ± ± InvertedPendulum-v2 ± ± ± ± Walker2d-v2 5969.10 ± ± ± ± Table 1: The maximum average return over 5 trials. The ± corresponds to one standard deviation. Thehighest reward in an environment has been boldfaced.challenging environment among all the others, required 10 million time steps. Figure 1 shows the totalaverage return of evaluation rollouts during training. We train ﬁve different instances of each algorithmwith the seed values 0, 200, 872, 2359 and 6574 and then plot the results by averaging over the ﬁve trials.This has been done for the sake of reliability and to make the results reproducible.The algorithms have been run for 10, 000 time steps with a purely exploratory policy. Policy evaluationis performed after every 5000 time steps. Each of the evaluation step is performed over 20 episodes. Theevaluation reports the mean of the cumulative reward generated at each of the 20 episodes without discountand any noise (starting from the start state of the environment as dictated by the seed value). The solidcurves in Figure 1 corresponds to the mean and the shaded region to one standard deviation of the returnsover the ﬁve trials. For OPAC, we include both the versions, where we consider mean value of the smallerQ-values (in red) and median value of all the Q-values (in magenta). Table 1 shows a comparison betweenthe maximum average reward obtained over the ﬁve trials of SAC, TD3 and two variants of OPAC. Thecurves have been smoothed using simple moving average as needed. In this paper, we presented Opportunistic Actor-Critic (OPAC), an off-policy maximum entropy deep rein-forcement learning algorithm that retains the beneﬁts of TD3 and SAC both and also explores better due tothe usage of three critics.Our theoretical results use the soft policy iteration and automatic entropy adjustment concepts derivedin [7]. These were already shown to converge. We introduced the theory of clipped triple Q-learning andalso established its proof of convergence. Combining all these theories, we formulated a practical oppor-tunistic actor-critic algorithm that can be used to train deep neural network policies in continuous state-action spaces. The model is opportunistic in both action selection and Q-updates. We empirically showedthat it equals or exceeds the performance of TD3 and SAC both without any environment-speciﬁc hyper-parameter tuning. Our experiments clearly indicate that OPAC is robust and sample efﬁcient enough foreasy as well as challenging tasks. It also has lesser variance in its learning curves as shown in Figure 1 thanSAC and TD3. Because of the simplicity of design, OPAC can be included in part to any other actor-criticalgorithm.

References [1] Bertsekas, D. P. 2000.

Dynamic Programming and Optimal Control . Athena Scientiﬁc, 2nd edition. ISBN1886529094.[2] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016.OpenAI Gym.[3] Ciosek, K.; Vuong, Q.; Loftin, R.; and Hofmann, K. 2019. Better Exploration with Optimistic Actor-Critic.[4] Fujimoto, S.; van Hoof, H.; and Meger, D. 2018. Addressing Function Approximation Error in Actor-Critic Methods.

CoRR abs/1802.09477. URL http://arxiv.org/abs/1802.09477.[5] Gu, S.; Holly, E.; Lillicrap, T. P.; and Levine, S. 2016. Deep Reinforcement Learning for Robotic Manip-ulation.

CoRR abs/1610.00633. URL http://arxiv.org/abs/1610.00633.9PAC: Opportunistic Actor-Critic

A P

REPRINT [6] Haarnoja, T.; Pong, V.; Zhou, A.; Dalal, M.; Abbeel, P.; and Levine, S. 2018. Composable Deep Re-inforcement Learning for Robotic Manipulation.

CoRR abs/1803.06773. URL http://arxiv.org/abs/1803.06773.[7] Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel,P.; and Levine, S. 2018. Soft Actor-Critic Algorithms and Applications.

CoRR abs/1812.05905. URLhttp://arxiv.org/abs/1812.05905.[8] Hasselt, H. V. 2010. Double Q-learning. In Lafferty, J. D.; Williams, C. K. I.; Shawe-Taylor, J.; Zemel,R. S.; and Culotta, A., eds.,

Advances in Neural Information Processing Systems 23 , 2613–2621. CurranAssociates, Inc. URL http://papers.nips.cc/paper/3964-double-q-learning.pdf.[9] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016.Continuous control with deep reinforcement learning. In Bengio, Y.; and LeCun, Y., eds.,

ICLR . URLhttp://dblp.uni-trier.de/db/conf/iclr/iclr2016.html

CoRR abs/1312.5602. URL http://arxiv.org/abs/1312.5602.[11] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Ried-miller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King,H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deepreinforcement learning.

Nature

CoRR abs/1702.08892. URL http://arxiv.org/abs/1702.08892.[13] Rawlik, K.; Toussaint, M.; and Vijayakumar, S. 2013. On Stochastic Optimal Control and ReinforcementLearning by Approximate Inference (Extended Abstract). In

Proceedings of the Twenty-Third InternationalJoint Conference on Artiﬁcial Intelligence , IJCAI ’13, 3052–3056. AAAI Press. ISBN 9781577356332.[14] Schulman, J.; Abbeel, P.; and Chen, X. 2017. Equivalence Between Policy Gradients and Soft Q-Learning.

CoRR abs/1704.06440. URL http://arxiv.org/abs/1704.06440.[15] Schulman, J.; Levine, S.; Moritz, P.; Jordan, M. I.; and Abbeel, P. 2015. Trust Region Policy Optimization.

CoRR abs/1502.05477. URL http://arxiv.org/abs/1502.05477.[16] Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal Policy OptimizationAlgorithms.

CoRR abs/1707.06347. URL http://arxiv.org/abs/1707.06347.[17] Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.;Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.; Kalchbrenner,N.; Sutskever, I.; Lillicrap, T.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mas-tering the game of Go with deep neural networks and tree search.

Nature

Mach. Learn. , 4286–4292.[20] Todorov, E.; Erez, T.; and Tassa, Y. 2012. MuJoCo: A physics engine for model-based control. In , 5026–5033.[21] Toussaint, M. 2009. Robot Trajectory Optimization Using Approximate Inference. In

Proceedings ofthe 26th Annual International Conference on Machine Learning , ICML ’09, 1049–1056. New York, NY, USA:Association for Computing Machinery. ISBN 9781605585161. doi:10.1145/1553374.1553508. URL https://doi.org/10.1145/1553374.1553508.[22] Ziebart, B. D. 2010.

Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy .Ph.D. thesis, USA.[23] Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; and Dey, A. K. 2008. Maximum Entropy Inverse Reinforce-ment Learning. In Fox, D.; and Gomes, C. P., eds.,