Soft Actor-Critic for Discrete Action Settings
SS OFT A CTOR -C RITIC FOR D ISCRETE A CTION S ETTINGS
A P
REPRINT
Petros Christodoulou
Imperial College London [email protected]
October 21, 2019 A BSTRACT
Soft Actor-Critic is a state-of-the-art reinforcement learning algorithm for continuous action settingsthat is not applicable to discrete action settings. Many important settings involve discrete actions,however, and so here we derive an alternative version of the Soft Actor-Critic algorithm that is applicable to discrete action settings. We then show that, even without any hyperparameter tuning, itis competitive with the tuned model-free state-of-the-art on a selection of games from the Atari suite. Reinforcement Learning (RL) has famously made great progress in recent years, successfully being applied to settingssuch as board games [Silver et al., 2017], video games [Mnih et al., 2015] and robot tasks [OpenAI et al., 2018].However, widespread adoption of RL in real-world domains has remained slow primarily because of its poor sampleefficiency which Wu et al. (2017) see as a ”dominant concern in RL”.Haarnoja et al. (2018) provide the Soft Actor-Critic (SAC) algorithm which helps deal with this concern in continuousaction settings. It has achieved model-free state-of-the-art sample efficiency in multiple challenging continuous controldomains. Many domains however involve discrete rather than continuous actions and in these environments SAC is notcurrently applicable. This paper derives a version of SAC that is applicable to discrete action domains and then showsthat it is competitive with the model-free state-of-the-art for discrete action domains in terms of sample efficiency on aselection of games from the Atari [Bellemare et al., 2013] suite.We proceed as follows: first we explain the derivation of Soft Actor-Critic for continuous action settings found inHaarnoja et al. (2018) and Haarnoja et al. (2019), then we derive and explain the changes required to create a discreteaction version of the algorithm, and finally we test the discrete action algorithm on the Atari suite.
Soft Actor-Critic [Haarnoja et al., 2018] attempts to find a policy that maximises the maximum entropy objective: π ∗ = argmax π T (cid:88) t =0 E ( s t ,a t ) ∼ τ π [ γ t ( r ( s t , a t ) + α H ( π ( . | s t ))] (1)where π is a policy, π ∗ is the optimal policy, T is the number of timesteps, r : S × A → R is the reward function, γ ∈ [0 , is the discount rate, s t ∈ S is the state at timestep t, a t ∈ A is the action at timestep t, τ π is the distributionof trajectories induced by policy π , α determines the relative importance of the entropy term versus the reward andis called the temperature parameter, and H ( π ( . | s t ) is the entropy of the policy π at state s t and is calculated as H ( π ( . | s t )) = − log π ( . | s t ) .To maximise the objective the authors use soft policy iteration which is a method of alternating between policy evaluationand policy improvement within the maximum entropy framework. a r X i v : . [ c s . L G ] O c t he policy evaluation step involves computing the value of policy π . To do this they first define the soft state valuefunction as: V ( s t ) := E a t ∼ π [ Q ( s t , a t ) − α log( π ( a t | s t ))] (2)They then prove that in a tabular setting (i.e. when the state space is discrete) we can obtain the soft q-function bystarting from a randomly initialised function Q : S × A → R and repeatedly applying the modified Bellman backupoperator T π given by: T π Q ( s t , a t ) := r ( s t , a t ) + γE s t +1 ∼ p ( s t ,a t ) [ V ( s t +1 )] (3)where p : S × A → S gives the distribution over the next state given the current state and action.In the continuous state (instead of tabular) setting they explain that we instead firstly parameterise the soft q-function Q θ ( s t , a t ) using a neural network with parameters θ . Then we train the soft Q-function to minimise the soft Bellmanresidual: J Q ( θ ) = E ( s t ,a t ) ∼ D [ 12 ( Q θ ( s t , a t ) − ( r ( s t , a t ) + γE s t +1 ∼ p ( s t ,a t ) [ V ¯ θ ( s t +1 )])) ] (4)where D is a replay buffer of past experiences and V ¯ θ ( s t +1 ) is estimated using a target network for Q and a monte-carloestimate of (2) after sampling experiences from the replay buffer .The policy improvement step then involves updating the policy in a direction that maximises the rewards it will achieve.To do this they use the soft Q-function calculated in the policy evaluation step to guide changes to the policy. Specifically,they update the policy towards the exponential of the new soft Q-function. Because they also want the policy to betractable however, they restrict the possible policies to a parameterised family of distributions (e.g. Gaussian). Toaccount for this, after updating the policy towards the exponential of the soft Q-function they then project it back intothe space of acceptable policies using the information projection defined in terms of Kullback-Leibler divergence. Sooverall the policy improvement step is given by: π new = argmin π ∈ Π D KL (cid:18) π ( . | s t ) (cid:13)(cid:13)(cid:13)(cid:13) exp( α Q π old ( s t , . )) Z π old ( s t ) (cid:19) (5)They note that partition function Z π old ( s t ) is intractable but does not contribute to the gradient with respect to the newpolicy and so it can be ignored.In the continuous state setting they parameterise the policy π φ ( a t | s t ) using a neural network with parameters φ thatoutputs a mean and covariance that is then used to define a Gaussian policy. The policy parameters are then learned byminimizing the expected KL-divergence (5) after multiplying by the temperature parameter α and ignoring the partitionfunction Z π old ( s t ) as it does not impact the gradient: J π ( φ ) = E s t ∼ D [ E a t ∼ π φ [ α log( π φ ( a t | s t )) − Q θ ( s t , a t )]] (6)This involves taking an expectation over the policy’s output distribution which means errors cannot be backpropagatedin the normal way. To deal with this they use the reparameterisation trick [Kingma and Welling, 2013] - instead ofusing the output of the policy network to form a stochastic action distribution directly, they combine its output with aninput noise vector sampled from a spherical Gaussian. For example, in the one-dimensional case our network outputs amean m and standard deviation s . We could randomly sample our action directly a ∼ N ( m, s ) but then we could notbackpropagate the errors through this operation. So instead we do a = m + s(cid:15) where (cid:15) ∼ N (0 , which allows us tobackpropagate as normal. To signify that they are reparameterising the policy in this way they write: a t = f φ ( (cid:15) t ; s t ) (7)where (cid:15) t ∼ N (0 , I ) . The new policy objective then becomes: J π ( φ ) = E s t ∼ D,(cid:15) t ∼ N [ α log( π φ ( f φ ( (cid:15) t ; s t ) | s t )) − Q θ ( s t , f φ ( (cid:15) t ; s t ))] (8)2here π φ is now defined implicitly in terms of f φ . They then go on to prove that in the tabular setting, alternatingbetween policy evaluation and policy improvement as above will converge to the optimal policy.Haarnoja et al. (2019) also provide an optional way of learning the temperature parameter so that we do not need toset it as a hyperparameter. They provide a long derivation for the temperature objective value, however because thedetails are not strictly relevant for our derivation of the discrete action version of SAC we do not repeat it here. Thefinal objective they get to for the temperature parameter is however relevant and given by: J ( α ) = E a t ∼ π t [ − α (log π t ( a t | s t ) + ¯ H )] (9)where ¯ H is a constant vector equal to the hyperparameter representing the target entropy. They are unable to minimisethis expression directly because of the expectation operator and so instead they minimise a monte-carlo estimate of itafter sampling experiences from the replay buffer.Lastly, in practice the authors maintain two separately trained soft Q-networks and then use the minimum of their twooutputs to be the soft Q-network output in the above objectives. They do this because Fujimoto, Hoof, and Meger(2018) showed that it helps combat state-value overestimation. We now derive a discrete action version of the above SAC algorithm. The first thing to note is that all the critical stepsinvolved in deriving the objectives above hold whether the actions are continuous or discrete. All that changes is that π φ ( a t | s t ) now outputs a probability instead of a density. Therefore the three objective functions J Q ( θ ) (4), J π ( φ ) (6)and J ( α ) (9) still hold. We must however make five important changes to the process of optimising these objectivefunctions:i) It is now more efficient to have the soft Q-function output the Q-value of each possible action rather than simply theaction provided as an input, i.e. our Q function moves from Q : S × A → R to Q : S → R | A | . This was not possiblebefore when there were infinitely many possible actions we could take.ii) There is now no need for our policy to output the mean and covariance of our action distribution, instead it can directlyoutput our action distribution. The policy therefore changes from π : S → R | A | to π : S → [0 , | A | where now weare applying a softmax function in the final layer of the policy to ensure it outputs a valid probability distribution.iii) Before, in order to minimise the soft Q-function cost J Q ( θ ) (4) we had to plug in our sampled actions from thereplay buffer to form a monte-carlo estimate of the soft state-value function (2). This was because estimating the softstate-value function in (2) involved taking an expectation over the action distribution. However, now, because our actionset is discrete we can fully recover the action distribution and so there is no need to form a monte-carlo estimate andinstead we can calculate the expectation directly. This change should reduce the variance involved in our estimate of theobjective J Q ( θ ) (4). This means that we change our soft state-value calculation equation from (2) to: V ( s t ) := π ( s t ) T [ Q ( s t ) − α log( π ( s t ))] (10)iv) Similarly, we can make the same change to our calculation of the temperature loss to also reduce the variance of thatestimate. The temperature objective changes from (9) to: J ( α ) = π t ( s t ) T [ − α (log( π t ( s t )) + ¯ H )] (11)v) Before, to minimise J π ( φ ) (6) we had to use the reparameterisation trick to allow gradients to pass through theexpectations operator. However, now our policy outputs the exact action distribution we are able to calculcate theexpectation directly. Therefore there is no need for the reparameterisation trick and the new objective for the policychanges from (8) to: J π ( φ ) = E s t ∼ D [ π t ( s t ) T [ α log( π φ ( s t )) − Q θ ( s t )]] (12)Combining all these changes, our algorithm for SAC with discrete actions (SAC-Discrete) is given by Algorithm 1.3 lgorithm 1 Soft Actor-Critic with Discrete Actions (SAC-Discrete)Initialise Q θ : S → R | A | , Q θ : S → R | A | , π φ : S → [0 , | A | (cid:46) Initialise local networksInitialise ¯ Q θ : S → R | A | , ¯ Q θ : S → R | A | (cid:46) Initialise target networks ¯ θ ← θ , ¯ θ ← θ (cid:46) Equalise target and local network weights
D ← ∅ (cid:46)
Initialize an empty replay buffer for each iteration dofor each environment step do a t ∼ π φ ( a t | s t ) (cid:46) Sample action from the policy s t +1 ∼ p ( s t +1 | s t , a t ) (cid:46) Sample transition from the environment
D ← D ∪ { ( s t , a t , r ( s t , a t ) , s t +1 ) } (cid:46) Store the transition in the replay buffer for each gradient step do θ i ← θ i − λ Q ˆ ∇ θ i J ( θ i ) for i ∈ { , } (cid:46) Update the Q-function parameters φ ← φ − λ π ˆ ∇ φ J π ( φ ) (cid:46) Update policy weights α ← α − λ ˆ ∇ α J ( α ) (cid:46) Update temperature ¯ Q i ← τ Q i + (1 − τ ) ¯ Q i for i ∈ { , } (cid:46) Update target network weights
Output θ , θ , φ (cid:46) Optimized parameters
100 0 100 200Relative PerformanceFreewayMsPacmanEnduroBattleZoneQbertSpace InvadersBeam RiderAssaultJamesBondSeaquestAsterixKangarooAlienRoad RunnerFrostbiteAmidarCrazy ClimberBreakoutUpNDownPong +4330%+90%+47%+30%+19%+19%+18%+17%+11%+3%-5%-24%-25%-42%-58%-62%-71%-78%-81%-99%
SAC-Discrete vs. Rainbow Relative Performance
Figure 1: Comparing SAC-Discrete to Rainbow for 20 Atari games. The graph shows the averagerelative performance of SAC-Discrete over Rainbow over 5 random seeds where evaluation scoresare calculated at the end of 100,000 steps of training. Note that no hyperparameter tuning wasdone to support the SAC scores compared to the Rainbow scores which benefited from substantialhyperparameter tuning
To test the effectiveness of SAC-Discrete we run it for 100,000 steps on 20 Atari games for 5 random seeds each andcompare its results with Rainbow which is a state-of-the-art model-free algorithm in terms of sample efficiency. Thegames vary significantly and were chosen a piori and so we believe the results on these 20 games are a good estimatefor relative performance on the whole Atari suite of 49 games. We chose to run the algorithm for 100,000 steps becausewe are most interested in sample efficiency and Kaiser et al. (2019) demonstrated that Rainbow can make significantprogress on Atari games within 100,000 steps.For SAC-Discrete actions we did no hyperparameter tuning and instead used a mixture of the hyperparameters foundin Haarnoja et al. (2019) and Kaiser et al. (2019). The hyperparameters can be found in Appendix B. The Rainbow4esults we compare to come from Kaiser et al. (2019) and as they explain were the result of a significant amount ofhyperparameter tuning. Therefore we are comparing the tuned Rainbow algorithm to our untuned SAC algorithm andso it is highly likely the relative performance of SAC could be improved if we spent time tuning its hyperparameters.We find that SAC-Discrete achieves a better score than Rainbow in 10 out of 20 games with a median performance of-1%, maximum performance of +4330% and minimum of -99% - Figure 1 summarises the results and Appendix Aprovides them in a table. Overall, we therefore consider the SAC-Discrete algorithm as roughly competitive with themodel-free state-of-the-art on the Atari suite in terms of sample efficiency.
The original Soft Actor-Critic algorithm achieved state-of-the-art results on numerous continuous action settings butwas not applicable to discrete action settings. To correct this we have derived a version of the algorithm called SAC-Discrete that is applicable to discrete action settings and have shown that it performs competitively with the model-freestate-of-the-art on the Atari suite even without any hyperparameter tuning. We provide a Python implementation of thealgorithm at the project’s GitHub repository. References
Bellemare, M., Y. Naddaf, J. Veness, and M. Bowling (2013). “The Arcade Learning Environment: An EvaluationPlatform for General Agents”. In:
Journal of Artificial Intelligence .Castro, P., S. Moitra, C. Gelanda, S. Kumar, and M. Bellemare (2018). “Dopamine: A Research Framework for DeepReinforcement Learning”. In: arXiv preprint .Fujimoto, S., H. van Hoof, and D. Meger (2018). “Addressing Function Approximation Error in Actor-Critic Methods”.In: arXiv preprint .Haarnoja, T., A. Zhou, P. Abbeel, and S. Levine (2018). “Soft Actor-Critic: Off-Policy Maximum Entropy DeepReinforcement Learning with a Stochastic Actor”. In:
International Conference on Learning Representations .Haarnoja, T., A. Zhou, K. Hartikainen, G. Tucker, J. Tan S. Ha, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine(2019). “Soft Actor-Critic Algorithms and Applications”. In: arXiv preprint .Kaiser, L., M. Babaeizadech, P. Milos, B. Osinski, R. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski,S. Levine, A. Mohiuddin, R. Sepassi, G. Tucker, and H. Michalewski (2019). “Model Based Reinforcement Learningfor Atari”. In: arXiv preprint .Kingma, D. and M. Welling (2013). “Auto-Encoding Variational Bayes”. In:
International Conference on LearningRepresentations .Mnih, V., K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland,G. Ostrovski, S. Peterson, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, andD. Hassabis (2015). “Human-Level Control Through Deep Reinforcement Learning”. In:
Nature .OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub W. Pachocki,Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, PeterWelinder, Lilian Weng, and Wojciech Zaremba (2018). “Learning Dexterous In-Hand Manipulation”. In:
ArXiv abs/1808.00177.Silver, D., J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton,Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Driessche, T. Graepel, and D. Hassabis (2017). “Mastering the Game of GoWithout Human Knowledge”. In:
Nature .Wu, Y., E. Mansimov, S. Liao, R. Grosse, and J. Ba (2017). “Scalable Trust-Region Method for Deep ReinforcementLearning Using Kronecker-Factored Approximation”. In:
Advances in Neural Information Processing Systems . For two games (Enduro and SpaceInvaders) Kaiser et al. (2019) provide no results for Rainbow and so for these two games onlywe ran the Rainbow algorithm ourselves. We used the Dopamine [Castro et al., 2018] codebase to do this (as Kaiser et al. (2019) alsodid) along with the same (tuned) hyperaprameters used in Kaiser et al. (2019). We share the code used to do this in the colaboratorynotebook: https://colab.research.google.com/drive/11prfRfM5qrMsfUXV6cGY868HtwDphxaF https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch ppendixA SAC and Rainbow Atari Results Table 1: SAC and Rainbow results on 20 Atari games. The mean SAC result of 5 random seedsis shown with the standard deviation in brackets. As a benchmark we also provide a columnindicating the score an agent would get if it acted purely randomly. The Rainbow results comefrom Kaiser et al., 2019.
Game Random Rainbow SAC
Freeway 0.0 0.1 . (9 . MsPacman 235.2 364.3 . (141 . Enduro 0.0 0.53 . (0 . BattleZone 2895.0 3363.5 . (1163 . Qbert 166.1 235.6 . (124 . Space Invaders 148.0 135.1 . (17 . Beam Rider 372.1 365.6 . (44 . Assault 233.7 300.3 . (40 . JamesBond 29.2 61.7 . (26 . Seaquest 61.1 206.3 . (59 . Asterix 248.8 285.7 . (33 . Kangaroo 42.0 38.7 . (55 . Alien 184.8 290.6 . (43 . Road Runner 0.0 524.1 . (557 . Frostbite 74.0 140.1 . (16 . Amidar 11.8 20.8 . (5 . Crazy Climber 7339.5 12558.3 . (600 . Breakout 0.9 3.3 . (0 . UpNDown 488.4 1346.3 . (176 . Pong -20.4 -19.5 − . (0 . SAC-Discrete Hyperparameters
Table 2: Hyperparameters used for SAC-Discrete results