Multi-Pass Q-Networks for Deep Reinforcement Learning with Parameterised Action Spaces
MMulti-Pass Q-Networks for Deep Reinforcement Learning withParameterised Action Spaces
Craig J. Bester , Steven D. James , George D. Konidaris University of the Witwatersrand, Johannesburg Brown University, Providence RI
Abstract
Parameterised actions in reinforcement learningare composed of discrete actions with continuousaction-parameters. This provides a framework forsolving complex domains that require combininghigh-level actions with flexible control. The re-cent P-DQN algorithm extends deep Q-networksto learn over such action spaces. However, ittreats all action-parameters as a single joint inputto the Q-network, invalidating its theoretical foun-dations. We analyse the issues with this approachand propose a novel method—multi-pass deep Q-networks, or MP-DQN—to address them. We em-pirically demonstrate that MP-DQN significantlyoutperforms P-DQN and other previous algorithmsin terms of data efficiency and converged policyperformance on the Platform, Robot Soccer Goal,and Half Field Offense domains.
Reinforcement learning (RL) and deep RL in particular havedemonstrated remarkable success in solving tasks that requireeither discrete actions, such as Atari [Mnih et al. , 2015], orcontinuous actions, such as robot control [Schulman et al. ,2015; Lillicrap et al. , 2016]. Reinforcement learning withparameterised actions [Masson et al. , 2016] that combine dis-crete actions with continuous action-parameters has recentlyemerged as an additional setting of interest, allowing agentsto learn flexible behavior in tasks such as 2D robot soccer[Hausknecht and Stone, 2016a; Hussein et al. , 2018], sim-ulated human-robot interaction [Khamassi et al. , 2017], andterrain-adaptive bipedal and quadrupedal locomotion [Peng et al. , 2016].There are two main approaches to learning with param-eterised actions: alternate between optimising the discreteactions and continuous action-parameters separately [Mas-son et al. , 2016; Khamassi et al. , 2017], or collapse the pa-rameterised action space into a continuous one [Hausknechtand Stone, 2016a]. Both of these approaches fail to fullyexploit the structure present in parameterised action prob-lems. The former does not share information between theaction and action-parameter policies, while the latter doesnot take into account which action-parameter is associated with which action, or even which discrete action is executedby the agent. More recently, Xiong et al. [2018] introducedP-DQN, a method for learning behaviours directly in the pa-rameterised action space. This leverages the distinct nature ofthe action space and is the current state-of-the-art algorithmon 2D robot soccer and King of Glory, a multiplayer onlinebattle arena game. However, the formulation of the approachis flawed due to the dependence of the discrete action val-ues on all action-parameters, not only those associated witheach action. In this paper, we show how the above issue leadsto suboptimal decision-making. We then introduce a novelmulti-pass method to separate action-parameters, and demon-strate that the resulting algorithm—MP-DQN—outperformsexisting methods on the Platform, Robot Soccer Goal, andHalf Field Offense domains.
Parameterised action spaces [Masson et al. , 2016] consist ofa set of discrete actions, A d = [ K ] = { k , k , ..., k K } , whereeach k has a corresponding continuous action-parameter x k ∈ X k ⊆ R m k with dimensionality m k . This can be writ-ten as A = (cid:91) k ∈ [ K ] { a k = ( k, x k ) | x k ∈ X k } . (1)We consider environments modelled as a Parameterised Ac-tion Markov Decision Process (PAMDP) [Masson et al. ,2016]. For a PAMDP M = ( S , A , P, R, γ ) : S is theset of all states, A is the parameterised action space, P ( s (cid:48) | s, k, x k ) is the Markov state transition probability func-tion, R ( s, k, x k , s (cid:48) ) is the reward function, and γ ∈ [0 , isthe future reward discount factor. An action policy π : S →A maps states to actions, typically with the aim of maximis-ing Q-values Q ( s, a ) , which give the expected discounted re-turn of executing action a in state s and following the currentpolicy thereafter.The Q-PAMDP algorithm [Masson et al. , 2016] alternatesbetween learning a discrete action policy with fixed action-parameters using Sarsa( λ ) [Sutton and Barto, 1998] with theFourier basis [Konidaris et al. , 2011] and optimising the con-tinuous action-parameters using episodic Natural Actor Critic(eNAC) [Peters and Schaal, 2008] while the discrete actionpolicy is kept fixed. Hausknecht and Stone [2016a] applyartificial neural networks and the Deep Deterministic Policy a r X i v : . [ c s . L G ] M a y … , Q Q K θ Q Q N e t w o r k A c t o r State s , . . . , x x K k = argmax i Q i θ x s , x Figure 1: The P-DQN network architecture [Xiong et al. , 2018].Note that the joint action-parameter vector x is fed into the Q-network. Gradients (DDPG) algorithm [Lillicrap et al. , 2016] to pa-rameterised action spaces by treating both the discrete actionsand their action-parameters as a joint continuous action vec-tor. This can be seen as relaxing the parameterised actionspace (Equation 1) into a continuous one: A = { ( f K , x K ) | f k ∈ R , x k ∈ X k ∀ k ∈ [ K ] } , (2)where f , f , . . . , f K are continuous values in [ − , . An (cid:15) -greedy or softmax policy is then used to select discrete ac-tions. However, not only does this fail to exploit the disjointnature of different parameterised actions, but optimising overthe joint action and action-parameter space can result in pre-mature convergence to suboptimal policies, as occurred in ex-periments by Masson et al. [2016]. We henceforth refer to thealgorithm used by Hausknecht and Stone as PA-DDPG. Unlike previous approaches, Xiong et al. [2018] introducea method that operates in the parameterised action spacedirectly by combining DQN and DDPG. Their P-DQN al-gorithm achieves state-of-the-art performance using a Q-network to approximate Q-values used for discrete action se-lection, in addition to providing critic gradients for an actornetwork that determines the continuous action-parameter val-ues for all actions. By framing the problem as a PAMDPdirectly, rather than alternating between discrete and contin-uous action MDPs as with Q-PAMDP, or using a joint con-tinuous action MDP as with PA-DDPG, P-DQN necessitatesa change to the Bellman equation to incorporate continuousaction-parameters: Q ( s, k, x k ) = E r,s (cid:48) (cid:104) r + γ max k (cid:48) sup x k (cid:48) ∈X k (cid:48) Q ( s (cid:48) , k (cid:48) , x k (cid:48) ) (cid:12)(cid:12)(cid:12) s, k, x k (cid:105) . (3)To avoid the computationally intractable calculation of thesupremum over X k , Xiong et al. [2018] state that when the Q function is fixed, one can view argsup x k ∈X k Q ( s, k, x k ) as afunction x Qk : S → X k for any state s ∈ S and k ∈ [ K ] . This allows the Bellman equation to be rewritten as: Q ( s, k, x k ) = E r,s (cid:48) (cid:20) r + γ max k (cid:48) Q ( s (cid:48) , k (cid:48) , x Qk (cid:48) ( s (cid:48) )) (cid:12)(cid:12)(cid:12) s, k, x k (cid:21) . (4)P-DQN uses a deep neural network with parameters θ Q torepresent Q ( s, k, x k ; θ Q ) , and a second deterministic actornetwork with parameters θ x to represent the action-parameterpolicy x k ( s ; θ x ) : S → X k , an approximation of x Qk ( s ) .With this formulation it is easy to apply the standard DQNapproach of minimising the mean-squared Bellman error toupdate the Q-network using minibatches sampled from replaymemory D [Mnih et al. , 2015], replacing a with ( k, x k ) : L Q ( θ Q ) = E ( s,k,x k ,r,s (cid:48) ) ∼ D (cid:104) (cid:0) y − Q ( s, k, x k ; θ Q ) (cid:1) (cid:105) , (5)where y = r + γ max k (cid:48) ∈ [ K ] Q ( s (cid:48) , k (cid:48) , x k (cid:48) ( s (cid:48) ; θ x ); θ Q ) is theupdate target derived from Equation (4). Then, the loss forthe actor network in P-DQN is given by the negative sum ofQ-values: L x ( θ x ) = E s ∼ D (cid:34) − K (cid:88) k =1 Q (cid:0) s, k, x k ( s ; θ x ); θ Q (cid:1)(cid:35) . (6)Although this choice of loss function was not motivated byXiong et al. [2018], it resembles the deterministic policygradient loss used by PA-DDPG where a scalar critic valueis used over all action-parameters [Hausknecht and Stone,2016a]. During updates, the estimated Q-values are back-propagated through the critic to the actor, producing gradi-ents indicating how the action-parameters should be updatedto increase the Q-values. The P-DQN architecture inputs the joint action-parametervector over all actions to the Q-network, as illustrated in Fig-ure 1. This was pointed out by Xiong et al. [2018] but theydid not discuss it further. While this may seem like an in-consequential implementation detail, it changes the formula-tion of the Bellman equation used for parameterised actions(Equation 4) since each Q-value is a function of the jointaction-parameter vector x = ( x , . . . , x K ) , rather than onlythe action-parameter x k corresponding to the associated ac-tion: Q ( s, k, x ) = E r,s (cid:48) (cid:104) r + γ max k (cid:48) Q ( s (cid:48) , k (cid:48) , x Q ( s (cid:48) )) (cid:12)(cid:12)(cid:12) s, k, x (cid:105) . (7)This in turn affects both the updates to the Q-values andthe action-parameters. Firstly, we consider the effect onthe action-parameter loss, specifically that each Q-value pro-duces gradients for all action-parameters. Consider fordemonstration purposes the action-parameter loss (Equa-tion 6) over a single sample with state s : L x ( θ x ) = − K (cid:88) k =1 Q (cid:0) s, k, x ( s ; θ x ); θ Q (cid:1) . (8)The policy gradient is then given by: ∇ θ x x ( s ; θ x ) = − K (cid:88) k =1 ∇ x Q (cid:0) s, k, x ( s ; θ x ); θ Q (cid:1) ∇ θ x x ( s ; θ x ) . (9) un Hop Leap (a) Agent in the Platform domain. Q - V a l u e After 10000 episodes
RunHopLeap 1 0 1
Leap Action-Parameter
After 20000 episodes
RunHopLeap 1 0 10.100.150.200.250.30
After 80000 episodes
RunHopLeap
Greedy action changes (b) Predicted Q-values versus leap action-parameter. Vertical lines indicate the leap valueactually chosen. Ideally, the run and hop Q-values should remain constant since their action-parameters are fixed.Figure 2: Example of dependence on unrelated action-parameters affecting discrete action selection on the Platform domain. Three param-eterised actions are available: run , hop , and leap . In a particular state (a), the optimal action is to run forward to be able to traverse a gap,while choosing to leap would cause the agent to fall and die. The Q-value of the leap action should change with its action-parameter, but (b)shows that varying the leap action-parameter while the others are kept fixed changes the Q-values predicted by P-DQN for all actions. Nearthe start of training, this can alter the discrete policy such that a suboptimal action is chosen. After
80 000 episodes, P-DQN correctly learnsto choose the optimal action regardless of the unrelated leap action-parameter, although the other Q-values still vary.
Expanding the gradients with respect to the action-parametersgives ∇ x Q = (cid:16) ∂Q ∂x + ∂Q ∂x + · · · + ∂Q K ∂x , · · · , ∂Q ∂x K + · · · + ∂Q K ∂x K (cid:17) , (10)where Q k = Q ( s, k, x ( s ; θ x ); θ Q ) . Theoretically, if each Q-value were a function of just x k as the P-DQN formulation in-tended, then ∂Q k / ∂x j = 0 ∀ k, j ∈ [ K ] , j (cid:54) = k and ∇ x Q sim-plifies to: ∇ x Q = (cid:16) ∂Q ∂x , ∂Q ∂x , · · · , ∂Q K ∂x K (cid:17) . (11)However this is not the case in P-DQN, so the gradientswith respect to other action-parameters ∂Q k / ∂x j are not zeroin general. This is a problem because each Q-value is up-dated only when its corresponding action is sampled, as perEquation 5, and thus has no information on what effect otheraction-parameters x j , j (cid:54) = k have on transitions or how theyshould be updated to maximise the expected return. Theytherefore produce what we term false gradients . This effectmay be mitigated by the summation over all Q-values in theaction-parameter loss, since the gradients from each Q-valueare summed and averaged over a minibatch.The dependence of Q-values on all action-parameters alsonegatively affects the discrete action policy. Specifically, up-dating the continuous action-parameter policy of any actionperturbs the Q-values of all actions, not just the one associ-ated with that action-parameter. This can lead to the relativeordering of Q-values changing, which in turn can result insuboptimal greedy action selection. We demonstrate a situa-tion where this occurs on the Platform domain in Figure 2. The na¨ıve solution to the problem of joint action-parameterinputs in P-DQN would be to split the Q-network into sep-arate networks for each discrete action. Then, one can inputonly the state and relevant action-parameter x k to the networkcorresponding to Q k . However, this drastically increases the computational and space complexity of the algorithm due tothe duplication of network parameters for each action. Fur-thermore, the loss of the shared feature representation be-tween Q-values may be detrimental.We therefore consider an alternative approach that doesnot involve architectural changes to the network structure ofP-DQN. While separating the action-parameters in a singleforward pass of a single Q-network with fully connected lay-ers is impossible, we can do so with multiple passes. Weperform a forward pass once per action k with the state s and action-parameter vector xe k as input, where e k isthe standard basis vector for dimension k . Thus xe k =(0 , . . . , , x k , , . . . , is the joint action-parameter vectorwhere each x j , j (cid:54) = k is set to zero. This causes all falsegradients to be zero, ∂Q k / ∂x j = 0 , and completely negatesthe impact of the network weights for unassociated action-parameters x j from the input layer, making Q k only dependon x k . That is, Q ( s, k, xe k ) (cid:117) Q ( s, k, x k ) . (12)Both problems are therefore addressed without introducingany additional neural network parameters. We refer to this asthe multi-pass Q-network method, or MP-DQN.A total of K forward passes are required to predict all Q-values instead of one. However, we can make use of the par-allel minibatch processing capabilities of artificial neural net-works, provided by libraries such as PyTorch and Tensorflow,to perform this in a single parallel pass, or multi-pass . Amulti-pass with K actions is processed in the same manneras a minibatch of size K : Q ( s, · , xe ; θ Q ) ... Q ( s, · , xe K ; θ Q ) = Q Q · · · Q K ... ... . . . ... Q K Q K · · · Q KK , (13)where Q ij is the Q-value for action j generated on the i th pass where x i is non-zero. Only the diagonal elements Q ii are valid and used in the final output Q i ← Q ii . This processis illustrated in Figure 3. igure 3: Illustration of the multi-pass Q-network architecture. Compared to separate Q-networks, our multi-pass tech-nique introduces a relatively minor amount of overhead dur-ing forward passes. Although minibatches for updates aresimilarly duplicated K times, backward passes to accumulategradients are not duplicated since only the diagonal elements Q ii are used in the loss function. The computational com-plexity of this overhead scales linearly with the number ofactions and minibatch size during updates. Unlike separateQ-networks (and even when a larger Q-network with morehidden layers and neurons is used) if the number of actionsdoes not change, then the overhead of multi-passes would bethe same as with a smaller Q-network, provided the minibatchis of a reasonable size and can be processed in parallel. We compare the original P-DQN algorithm with a sin-gle Q-network against our proposed multi-pass Q-network(MP-DQN), as well as against separate Q-networks(SP-DQN). We also compare against Q-PAMDP andPA-DDPG, the former state-of-the-art approaches on their re-spective domains. We are unable to use King of Glory as abenchmark domain as it is closed-source and proprietary.Similar to Mnih et al. [2015] and Hausknecht andStone [2016a], we add target networks to P-DQN to computethe update targets y for stability. Soft updates (Polyak aver-aging) are used for the target networks. Adam [Kingma andBa, 2014] with β = 0 . , β = 0 . is used to optimisethe neural network parameters for P-DQN and PA-DDPG.Layer weights are initialised following the strategy of He etal. [2015] with rectified linear unit (ReLU) activation func-tions. We employ the inverting gradients approach to boundaction-parameters for both algorithms, as Hausknecht andStone [2016a] claim PA-DDPG is unable to learn withoutit on Half Field Offense. Action-parameters are scaled to [ − , , as we found this increased performance for all al-gorithms.We perform a hyperparameter grid search for Platform andRobot Soccer Goal over: the network learning rates α Q , α x ∈{ − , − , − , − , − } s.t. α x ≤ α Q ; Polyak av-eraging factors τ Q , τ x ∈ { . , . , . } s.t. τ x ≤ τ Q ;minibatch size B ∈ { , , } ; and number of hiddenlayers and neurons in { (256 , , (128 , , (256) , (128) } .The hidden layers are kept symmetric between the actor andcritic networks as in previous works. Each combination istested over random runs for P-DQN and PA-DDPG sepa-rately on each domain. The same hyperparameters are usedfor P-DQN, SP-DQN and MP-DQN.To keep the comparison with PA-DDPG fair, we do notuse dueling networks [Wang et al. , 2016] nor asynchronousparallel workers as Xiong et al. [2018] used for P-DQN. Foreach algorithm and domain, we train agents with uniquerandom seeds and evaluate them without exploration for an-other episodes. Our experiments are implemented inPython using PyTorch [Paszke et al. , 2017] and OpenAI Gym[Brockman et al. , 2016], and run on the following hardware:Intel Core i7-7700, 16GB DRAM, NVidia GTX 1060 GPU.Complete source code is available online. The Platform domain [Masson et al. , 2016] has threeactions—run, hop, and leap—each with a continuous action-parameter to control horizontal displacement. The agent hasto hop over enemies and leap across gaps between platformsto reach the goal state. The agent dies if it touches an enemyor falls into a gap. A -dimensional state space gives the po-sition and velocity of the agent and local enemy along withfeatures of the current platform such as length.We train agents on this domain for
80 000 episodes, us-ing the same hyperparameters for Q-PAMDP as Masson etal. [2016], except we reduce the learning rate for eNAC( α eNAC ) to . , and exploration noise variance ( σ ) to . ,to account for the scaled action-parameters. For P-DQN,shallow networks with one hidden layer (128) were foundto perform best with α Q = 10 − , α x = 10 − , τ Q = 0 . , τ x = 0 . , and B = 128 . PA-DDPG uses two hidden lay-ers (256 , with α Q = 10 − , α µ = 10 − , τ Q = 0 . , τ µ = 0 . , and B = 32 . A replay memory size of
10 000 samples is used for both algorithms, update gradients areclipped at , and γ = 0 . .We introduce a passthrough layer to the actor networksof P-DQN and PA-DDPG to initialise their action-parameterpolicies to the same linear combination of state variables thatMasson et al. [2016] use to initialise the Q-PAMDP pol-icy. The weights of the passthrough layer are kept fixed toavoid instability; this does not reduce the range of action-parameters available as the output of the actor network com-pensates before inverting gradients are applied. We usean (cid:15) -greedy discrete action policy with additive Ornstein-Uhlenbeck noise for action-parameter exploration, similar toLillicrap et al. [2016], which we found gives slightly betterperformance than Gaussian noise. https://github.com/cycraig/MP-DQN .2 Robot Soccer Goal The Robot Soccer Goal domain [Masson et al. , 2016] is asimplification of RoboCup 2D [Kitano et al. , 1997] in whichan agent has to score a goal past a keeper that tries to in-tercept the ball. The three parameterised actions—kick-to,shoot-goal-left, and shoot-goal-right—are all related to kick-ing the ball, which the agent automatically approaches be-tween actions until close enough to kick again. The statespace consists of continuous features describing the po-sition, velocity, and orientation of the agent and keeper, andthe ball’s position and distance to the keeper and goal.Training consisted of
100 000 episodes, using the same hy-perparameters for Q-PAMDP as Masson et al. [2016] exceptwe set α eNAC = 0 . and σ = 0 . . P-DQN uses a singlehidden layer (256) , with α Q = 10 − , α x = 10 − , τ Q = 0 . , τ x = 0 . , and B = 128 . Two hidden layers (128 , are used for PA-DDPG, with α Q = 10 − , α µ = 10 − , τ Q = 0 . , τ µ = 0 . , and B = 64 . Both algorithms usea replay memory size of
20 000 , γ = 0 . , gradients clip-ping at , and the same action-parameter policy initialisationas Q-PAMDP with additive Ornstein-Uhlenbeck noise. The third and final domain, Half Field Offense (HFO)[Hausknecht and Stone, 2016a], is also the most complex.It has state features and three parameterised actions avail-able: dash, turn, and kick. Unlike Robot Soccer Goal, theagent must first learn to approach the ball and then kick itinto the goals, although there is no keeper in this task.We use
30 000 episodes for training on HFO. This ismore than the
20 000 episodes (or roughly million transi-tions) used by Hausknecht and Stone [2016a] and Xiong etal. [2018] so that ample opportunity is given for the algo-rithms to converge in order to fairly evaluate the final pol-icy performance. We use the same network structure as pre-vious works with hidden layers of (256 , , neuronsfor P-DQN and (1024 , , , neurons for PA-DDPG.The leaky ReLU activation function with negative slope . is used on HFO because of these deeper networks. Xiong et al. [2018] use asynchronous parallel workers for n -step returns on HFO. For fair comparison and due to thelack of sufficient hardware, we instead use mixed n -step re-turn targets [Hausknecht and Stone, 2016b] with a mixingratio of β = 0 . for both P-DQN and PA-DDPG, as thistechnique does not require multiple workers. The β valuewas selected after a search over β ∈ { , . , . , . , } .We otherwise use the same hyperparameters as Hausknechtand Stone [2016b] apart from the network learning rates: α Q = 10 − , α x = 10 − for P-DQN and α Q = 10 − , α µ = 10 − for PA-DDPG. In the absence of an initial action-parameter policy, we use the same (cid:15) -greedy with uniform ran-dom action-parameter exploration strategy as the original au-thors. In general we kept as many factors consistent betweenthe two algorithms as possible for a fair comparison.We select of the most relevant state features forQ-PAMDP to avoid intractable Fourier basis calculations.These features include: player orientation, stamina, proxim-ity to ball, ball angle, ball-kickable, goal centre position, andgoal centre proximity. Even with this reduced selection, we M e a n R e t u r n MP-DQN (ours)SP-DQNP-DQNPA-DDPGQ-PAMDP (a) Platform M e a n G o a l P r o b a b ili t y MP-DQN (ours)SP-DQNP-DQNPA-DDPGQ-PAMDP (b) Robot Soccer Goal M e a n R e t u r n MP-DQN (ours)SP-DQNP-DQNPA-DDPGQ-PAMDP (c) Half Field OffenseFigure 4: Learning curves on Platform (a), Robot Soccer Goal (b),and Half Field Offense (c). The running average scores—episodicreturn for Platform and HFO, and goal scoring probability for RobotSoccer Goal—are smoothed over episodes and include randomexploration, higher is better. Shaded areas represent standard errorof the running averages over the different agents. The oscillatingbehaviour of Q-PAMDP on Platform is a result of re-exploration be-tween the alternating optimisation steps. MP-DQN clearly performsbest overall. latform Robot Soccer Goal Half Field OffenseReturn P(Goal) P(Goal) Avg. Steps to Goal
Q-PAMDP 0.789 ± ± ± n/aPA-DDPG 0.284 ± ± ± ± P-DQN 0.964 ± ± ± ± ± ± ± ± ± ± ± ± - - 0.923 ± ± - - 0.989 ± ± Table 1: Mean evaluation scores over random runs for each algorithm, averaged over episodes after training with no randomexploration. We include previously published results from Hausknecht and Stone [2016a] and Xiong et al. [2018] on HFO, although theyare not directly comparable with ours as we use a longer training period and have a much larger sample size of agents— versus and respectively—and asynchronous P-DQN uses parallel workers to implement n -step returns rather than the mixing strategy we use. found at most a Fourier basis of order could be used. We usean adaptive step-size [Dabney and Barto, 2012] for Sarsa( λ )with an eNAC learning rate of . . The Q-PAMDP agent ini-tially learns with Sarsa( λ ) for a period of episodes be-fore alternating between κ = 50 eNAC updates of rolloutseach, and episodes of discrete action re-exploration. The resulting learning curves of MP-DQN, SP-DQN,P-DQN, PA-DDPG, and Q-PAMDP on the three parame-terised action benchmark domains are shown in Figure 4, withmean evaluation scores detailed in Table 1.Our results show that MP-DQN learns significantly fasterthan baseline P-DQN with joint action-parameter inputs andachieves the highest mean evaluation scores across all threedomains. SP-DQN similarly shows better performance thanP-DQN on Platform and Robot Soccer Goal but to a slightlylesser extent than MP-DQN. Notably, SP-DQN exhibits fastinitial learning on HFO but plateaus at a lower performancelevel than P-DQN. This is likely due to the aforementionedlack of a shared feature representation between the separateQ-networks and the duplicate network parameters which re-quire more updates to optimise.In general, we observe that P-DQN and its variants outper-form Q-PAMDP on Platform and Robot Soccer Goal, whilePA-DDPG consistently converges prematurely to suboptimalpolicies. Wei et al. [2018] observe similar behaviour forPA-DDPG on Platform. This highlights the problem withupdating the action and action-parameter policies simultane-ously and was also observed when using eNAC for directpolicy search on Platform [Masson et al. , 2016]. On HFO,Q-PAMDP fails to learn to score any goals—likely due toits reduced feature space and use of linear function approxi-mation rather than neural networks. Unexpectedly, baselineP-DQN appears to learn slower than PA-DDPG on HFO. Thissuggests that the dueling networks and asynchronous parallelworkers used by Xiong et al. [2018] were major factors im-proving P-DQN in their comparisons. Average over runs [Hausknecht and Stone, 2016a]. Average over runs with workers [Xiong et al. , 2018]. Many recent deep RL approaches follow the strategy of col-lapsing the parameterised action space into a continuousone. Hussein et al. [2018] present a deep imitation learningapproach for scoring goals on HFO using long-short-term-memory networks with a joint action and action-parameterpolicy. Agarwal [2018] introduces skills for multi-goal pa-rameterised action space environments to achieve multiplerelated goals; they demonstrate success on robotic manipula-tion tasks by combining PA-DDPG with hindsight experiencereplay and their skill library.One can alternatively view parameterised actions as a 2-level hierarchy: Klimek et al. [2017] use this approach tolearn a reach-and-grip task using a single network to rep-resent a distribution over macro (discrete) actions and theirlower-level action-parameters. The work most relevant to thispaper is by Wei et al. [2018], who introduce a parameterisedaction version of TRPO (PATRPO). They also take a hier-archical approach but instead condition the action-parameterpolicy on the discrete action chosen to avoid predicting allaction-parameters at once. While their preliminary resultsshow the method achieves good performance on Platform, weomit comparison with PATRPO as it fails to learn to scoregoals on HFO.
We identified a significant problem with the P-DQN algo-rithm for parametrised action spaces: the dependence of itsQ-values on all action-parameters causes false gradients andcan lead to suboptimal action selection. We introduced a newalgorithm, MP-DQN, with separate action-parameter inputswhich demonstrated superior performance over P-DQN andformer state-of-the-art techniques Q-PAMDP and PA-DDPG.We also found that PA-DDPG was unstable and convergedto suboptimal policies on some domains. Our results sug-gest that future approaches should leverage the disjoint natureof parameterised action spaces and avoid simultaneous opti-misation of the policies for discrete actions and continuousaction-parameters. cknowledgments
This work is based on the research supported in part by theNational Research Foundation of South Africa (Grant Num-ber: 113737).
References [Agarwal, 2018] Arpit Agarwal. Deep reinforcement learn-ing with skill library: Exploring with temporal abstrac-tions and coarse approximate dynamics models. Master’sthesis, Carnegie Mellon University, Pittsburgh, PA, July2018.[Brockman et al. , 2016] Greg Brockman, Vicki Cheung,Ludwig Pettersson, Jonas Schneider, John Schulman, JieTang, and Wojciech Zaremba. OpenAI gym. arXivpreprint arXiv:1606.01540 , 2016.[Dabney and Barto, 2012] William Dabney and Andrew GBarto. Adaptive step-size for online temporal differencelearning. In
Proceedings of the Twenty-Sixth AAAI Con-ference on Artificial Intelligence , 2012.[Hausknecht and Stone, 2016a] Matthew Hausknecht andPeter Stone. Deep reinforcement learning in parameterizedaction space. In
Proceedings of the International Confer-ence on Learning Representations , 2016.[Hausknecht and Stone, 2016b] Matthew Hausknecht andPeter Stone. On-policy vs. off-policy updates for deep re-inforcement learning. In
Deep Reinforcement Learning:Frontiers and Challenges, IJCAI Workshop , July 2016.[He et al. , 2015] Kaiming He, Xiangyu Zhang, ShaoqingRen, and Jian Sun. Delving deep into rectifiers: surpass-ing human-level performance on ImageNet classification.In
Proceedings of the IEEE international conference oncomputer vision , pages 1026–1034, 2015.[Hussein et al. , 2018] Ahmed Hussein, Eyad Elyan, andChrisina Jayne. Deep imitation learning with memory forRobocup soccer simulation. In
Proceedings of the Interna-tional Conference on Engineering Applications of NeuralNetworks , pages 31–43. Springer, 2018.[Khamassi et al. , 2017] Mehdi Khamassi, George Velentzas,Theodore Tsitsimis, and Costas Tzafestas. Active explo-ration and parameterized reinforcement learning appliedto a simulated human-robot interaction task. In
Proceed-ings of the First IEEE International Conference on RoboticComputing , pages 28–35. IEEE, 2017.[Kingma and Ba, 2014] Diederik P. Kingma and Jimmy LeiBa. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.[Kitano et al. , 1997] Hiroaki Kitano, Minoru Asada, YasuoKuniyoshi, Itsuki Noda, Eiichi Osawa, and Hitoshi Mat-subara. Robocup: A challenge problem for AI.
AI Maga-zine , 18:73–85, 1997.[Klimek et al. , 2017] Maciej Klimek, Henryk Michalewski,and Piotr Miło´s. Hierarchical reinforcement learning withparameters. In
Conference on Robot Learning , pages 301–313, 2017. [Konidaris et al. , 2011] George D. Konidaris, Sarah Osen-toski, and Philip S. Thomas. Value function approxima-tion in reinforcement learning using the Fourier basis. In
Proceedings of the Twenty-Fifth Conference on ArtificialIntelligence , pages 380–385, August 2011.[Lillicrap et al. , 2016] Timothy P Lillicrap, Jonathan J Hunt,Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous control withdeep reinforcement learning. In
Proceedings of the Inter-national Conference on Learning Representations , 2016.[Masson et al. , 2016] Warwick Masson, Pravesh Ranchod,and George Konidaris. Reinforcement learning with pa-rameterized actions. In
Proceedings of the Thirtieth AAAIConference on Artificial Intelligence , pages 1934–1940,2016.[Mnih et al. , 2015] Volodymyr Mnih, Koray Kavukcuoglu,David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidje-land, Georg Ostrovski, et al. Human-level control throughdeep reinforcement learning.
Nature , 518(7540):529,2015.[Paszke et al. , 2017] Adam Paszke, Sam Gross, SoumithChintala, Gregory Chanan, Edward Yang, Zachary De-Vito, Zeming Lin, Alban Desmaison, Luca Antiga, andAdam Lerer. Automatic differentiation in PyTorch. In
NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques , 2017.[Peng et al. , 2016] Xue Bin Peng, Glen Berseth, and MichielVan de Panne. Terrain-adaptive locomotion skills us-ing deep reinforcement learning.
ACM Transactions onGraphics , 35(4):81:1–81:12, 2016.[Peters and Schaal, 2008] Jan Peters and Stefan Schaal. Nat-ural actor-critic.
Neurocomputing , 71(7-9):1180–1190,2008.[Schulman et al. , 2015] John Schulman, Sergey Levine,Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trustregion policy optimization. In
International Conference ofMachine Learning , volume 37, pages 1889–1897, 2015.[Sutton and Barto, 1998] Richard S. Sutton and Andrew G.Barto.
Reinforcement Learning: An Introduction . MITPress, Cambridge, MA, USA, 1998.[Wang et al. , 2016] Ziyu Wang, Tom Schaul, Matteo Hes-sel, Hado Van Hasselt, Marc Lanctot, and Nando De Fre-itas. Dueling network architectures for deep reinforcementlearning. In
Proceedings of the 33rd International Confer-ence on International Conference on Machine Learning -Volume 48 , pages 1995–2003, 2016.[Wei et al. , 2018] Ermo Wei, Drew Wicke, and Sean Luke.Hierarchical approaches for reinforcement learning in pa-rameterized action space. In , 2018.[Xiong et al. , 2018] Jiechao Xiong, Qing Wang, ZhuoranYang, Peng Sun, Lei Han, Yang Zheng, Haobo Fu,Tong Zhang, Ji Liu, and Han Liu. Parametrizeddeep Q-networks learning: Reinforcement learning withiscrete-continuous hybrid action space. arXiv preprintarXiv:1810.06394arXiv preprintarXiv:1810.06394