[PDF] Effective Exploration for Deep Reinforcement Learning via Bootstrapped Q-Ensembles under Tsallis Entropy Regularization

Abstract

Recently deep reinforcement learning (DRL) has achieved outstanding success on solving many difficult and large-scale RL problems. However the high sample cost required for effective learning often makes DRL unaffordable in resource-limited applications. With the aim of improving sample efficiency and learning performance, we will develop a new DRL algorithm in this paper that seamless integrates entropy-induced and bootstrap-induced techniques for efficient and deep exploration of the learning environment. Specifically, a general form of Tsallis entropy regularizer will be utilized to drive entropy-induced exploration based on efficient approximation of optimal action-selection policies. Different from many existing works that rely on action dithering strategies for exploration, our algorithm is efficient in exploring actions with clear exploration value. Meanwhile, by employing an ensemble of Q-networks under varied Tsallis entropy regularization, the diversity of the ensemble can be further enhanced to enable effective bootstrap-induced exploration. Experiments on Atari game playing tasks clearly demonstrate that our new algorithm can achieve more efficient and effective exploration for DRL, in comparison to recently proposed exploration methods including Bootstrapped Deep Q-Network and UCB Q-Ensemble.

Full PDF

EEffective Exploration for Deep ReinforcementLearning via Bootstrapped Q-Ensembles underTsallis Entropy Regularization

Gang Chen ∗ , Yiming Peng † , and Mengjie Zhang ‡ School of Engineering and Computer Science,Victoria University of Wellington, New ZealandSeptember 6, 2019

Abstract

Recently deep reinforcement learning (DRL) has achieved outstanding successon solving many difﬁcult and large-scale RL problems. However the high samplecost required for effective learning often makes DRL unaffordable in resource-limited applications. With the aim of improving sample efﬁciency and learningperformance, we will develop a new DRL algorithm in this paper that seamless in-tegrates entropy-induced and bootstrap-induced techniques for efﬁcient and deepexploration of the learning environment. Speciﬁcally, a general form of

Tsallisentropy regularizer will be utilized to drive entropy-induced exploration based onefﬁcient approximation of optimal action-selection policies. Different from manyexisting works that rely on action dithering strategies for exploration, our algo-rithm is efﬁcient in exploring actions with clear exploration value. Meanwhile,by employing an ensemble of Q-networks under varied Tsallis entropy regular-ization, the diversity of the ensemble can be further enhanced to enable effectivebootstrap-induced exploration. Experiments on Atari game playing tasks clearlydemonstrate that our new algorithm can achieve more efﬁcient and effective explo-ration for DRL, in comparison to recently proposed exploration methods includingBootstrapped Deep Q-Network and UCB Q-Ensemble.

Keywords—

Reinforcement Learning, Deep Learning, Q-Ensemble, Tsallis Entropy

In recent years, deep reinforcement learning (DRL) has been extensively and successfully uti-lized by computer systems to autonomously learn to solve many challenging problems such as ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ c s . L G ] S e p obotics control [LHP +

15, SLA + +

15, HGS16], and road traf-ﬁc management [LLW16]. However, in order to achieve its learning goals, an RL agent mustoften use a huge amount of sampled data to train its deep neural networks (DNNs). Sincedata sampling is realized through direct trial-and-error interactions with the learning environ-ment, the high sample cost usually makes DRL unaffordable in resource-limited applications[WMG +

17, WBH +

16, CPZ18a].In order to improve sample efﬁciency, an RL agent must carefully manage its explorationof the learning environment. Osband et al. recently proposed the idea of “ deep exploration ” toemphasize on the requirement for the agent to learn effectively within a reasonable time frame byconsidering not only the immediate beneﬁts of taking any action but also the long-term impact ofthe action on future learning, thereby properly synthesizing efﬁcient exploration with effectivegeneralization [OR16, ORW14, ORWR17]. Guided by this requirement, Bootstrapped Deep Q-Network (Bootstrapped DQN) has been proposed lately to drive deep and efﬁcient exploration[OBPR16].Bootstrapped DQN was inspired by the posterior sampling method for RL with near-optimalregret bounds [ORR13]. However, instead of sampling and solving numerous

Markov DecisionProcesses (MDPs), Bootstrapped DQN approximates a posterior model over optimal Q-functions(also known as the state-action value functions) at much affordable computation cost. Thisis shown to easily outperform action dithering strategies for exploration such as (cid:15) -greedy orsoftmax action sampling techniques [Kak03, Str07]. For this purpose, an ensemble of randomlyinitialized Q-networks (or Q-functions) will be maintained consistently during RL. Empiricalresults showed that effective deep exploration can be achieved in practice by randomly choosingone of the Q-networks to guide multi-step interactions with the learning environment.Besides Bootstrapped DQN, the UCB Q-Ensemble method proposed in [CSAS17] also re-lies on learning concurrently an ensemble of Q-networks. However it adopts an approximated upper-conﬁdence bound over Q-values produced by these Q-networks to steer exploration. Al-though highly competitive performance has been witnessed on Atari game playing tasks, theo-retical studies suggest that precise calculation of such conﬁdence bounds can be computationallyintractable [RR13, RR14].Similar to Bootstrapped DQN and UCB Q-Ensemble, we employ an ensemble of Q-networksto achieve deep exploration. However, without relying on actions with either the highest Q-values or upper-conﬁdence bounds for exploration, we generalize action selection by studyingpolicies under entropy regularization. This generalization enables us to develop a new form ofoptimal stochastic policies, thereby relieving the dependency on randomly initialized Q-networksas the main source of randomness for deep exploration [CSAS17, OBPR16].In the literature, Shannon entropy is frequently utilized to regularize action selection, givingrise to optimal policies that exhibit softmax action-selection behaviors [OMKM17, NNXS17,HTAL17, SAC17]. While softmax distributions naturally bring stochasticity to deep exploration,they are prone to assigning non-negligible probability mass to actions with negligible explorationvalue [LCO17, NCG18].Tsallis entropy is an important extension of Shannon entropy [PP93, Tsa94]. A special caseof

Tsallis entropy has been studied in [LCO17] to tackle sparse MDP problems. When applied toDRL, Tsallis entropy allows an RL agent to concentrate on exploring actions that deserve furtherexploration. Due to this reason,

Tsallis entropy regularization is deemed a key mechanism forefﬁcient deep exploration in this paper. Moreover, without being restricted to any speciﬁc settingof Tsallis entropy as in [LCO17, NCG18], a general form of Tsallis entropy will be studied theﬁrst time in literature to guide DRL.Based on computationally efﬁcient approximation of optimal policies under general Tsallisentropy regularization, a new deep exploration algorithm involving an ensemble of deep Q-networks will be further developed in this paper. The newly proposed algorithm will be called he Bootstrapped Q-Ensemble under Tsallis Entropy Regularization (BQETR) algorithm. EachQ-network in BQETR adopts a different setting of the Tsallis entropy regularizer in order toachieve high ensemble diveristy. Meanwhile, the regularization coefﬁcient is kept the samefor all Q-networks and will be gradually reduced to 0 as an RL agent gains increasingly moreexperience from its learning environment. In this way we expect to seamlessly integrate entropy-induced exploration with bootstrap-induced exploration for effective RL.Empirical studies have been performed on benchmark Atari game playing tasks. Our exper-iment results clearly show that BQETR has achieved signiﬁcantly better sample efﬁciency andperformance than Bootstrapped DQN and UCB Q-Ensemble. We therefore believe that BQETRis an effective and efﬁcient method for deep exploration and RL.

Huge efforts have been devoted to developing efﬁcient and effective exploration strategies forRL. Particularly, provably efﬁcient exploration techniques have been studied based on the ideaof Bayes optimal policies [GMP +

15] and clearly revealed the importance of multi-step explo-ration [KS02]. Further studies along this line also demonstrated the inefﬁciency of (cid:15) -greedyand softmax exploration techniques on large RL problems [BT02, SLW +

06, AO07, DB15]. Inview of the fact that many existing DRL algorithms rely on such simple methods for exploration[MKS +

15, HGS16], developing new exploration methods for effective DRL has great value bothin theory and in practice.Among all the exploration techniques proposed so far, a notable series of research worksclearly highlighted the advantages of exploration through randomized value functions [ORR13,ORW14, OR16, ORWR17]. Speciﬁcally, the randomized least-squares value iteration (RLSVI)algorithms proposed in [ORWR17] extended traditional least-squares value iteration methodsthrough randomly sampling statistically plausible value functions . However, efﬁcient samplingoften requires value functions to be linear with respect to their parameters and may not be suitablefor DRL [ORW14, ORWR17]. To cope with this issue, Bootstrapped DQN has been developedrecently in [OBPR16] to approximately sample value functions modeled as DNNs.Besides bootstrapping, to achieve entropy-induced exploration, the training of action-selectionpolicies is often reshaped by a Shannon entropy regularizer, resulting in various soft-Q style al-gorithms for DRL [NNXS17, OMKM17, HTAL17]. For example, Nachum et al. developedan off-policy algorithm based on a multi-step consistency equation for entropy-regularized RL[NNXS17]. Haarnoja et al. conducted research on soft-Q learning in high-dimensional actionspaces [HTAL17]. In [OMKM17, SAC17], policy gradient training is shown as equivalent tosoft-Q learning. This insightful understanding enables an RL agent to combine policy gradientwith Q-learning for effective sample reuse [OMKM17]. Different from these research works,efﬁcient entropy-induced exploration is realized in this paper through Tsallis entropy regulariza-tion [LCO17, NCG18].This paper is similar to [LCO17, NCG18] since they all leverage on Tsallis entropy to reg-ularize policy optimization. However, different from these research works that studied only aspeciﬁc setting of Tsallis entropy, we will consider general forms of Tsallis entropy so as tomaintain strong diversity in a Q-ensemble. It also allows us to treat Shannon entropy regular-ization as a special case of our research. Moreover, the integrated use of Tsallis entropy andbootstrapping mechanism for deep exploration further separates this paper apart from most ofthe previous works. Entropy Regularized Q-Learning

In this section, we will introduce the RL problem ﬁrst, followed by a quick review of Q-learningand policy gradient learning techniques. Afterwards, a new deep Q-learning algorithm underTsallis entropy regularization will be developed. The

Bellman residue of the newly proposedalgorithm will also be analyzed under an extreme circumstance.

This paper studies general RL problems that can be described by an MDP with an arbitrary setof states s ∈ S and a ﬁnite set of actions a ∈ A [OMKM17]. Such problems appear frequentlyin literature including robotics control and video game playing [MKS +

15, HGS16, CPZ18b].At each time step t , an RL agent observes its environment and determines its current state s t .It subsequently selects and performs an action a t , driving the environment to move to its nextstate s t +1 with a probability P ( s t , a t , s t +1 ) which is unknown to the agent. Meanwhile, a scalarreward r ( s t , a t ) is provided as the immediate feedback of performing action a t . Starting fromany initial state s , the agent is required to perform a long (sometimes inﬁnite) sequence ofactions in order to obtain the discounted total return deﬁned below J ( π ) = E s ,π (cid:32) ∞ (cid:88) t =0 γ t r t (cid:33) (1)where the expectation in (1) is conditional on initial state s and π . Here π refers to a stochasticaction-selection policy that the RL agent follows to determine the probability π ( s t , a ) of per-forming any action a in any state s t . Obviously (cid:80) a ∈ A π ( s t , a ) = 1 is an important constraintfor π to be well-deﬁned. Moreover, γ takes its value in [0 , and serves as a discount factor forthe RHS of (1) to be meaningful. With an MDP described as above, the ultimate goal of RL ishence to identify the optimal policy π ∗ that maximizes J ( π ∗ ) , i.e. π ∗ = arg max π J ( π ) (2)In an effort to learn π ∗ , an RL agent may choose to ﬁrst learn the Q-function with respect tosome non-optimal policy π , as deﬁned below Q π ( s t , a t ) = E π,s t ,a t (cid:32) ∞ (cid:88) k =0 γ k r k + t (cid:33) (3)Given Q π in (3), the value of state s t under policy π is determined further as V π ( s t ) = E s t ,π (cid:32) ∞ (cid:88) k =0 γ k r k + t (cid:33) = (cid:88) a ∈ A π ( s t , a ) Q π ( s t , a ) (4)To ease discussion, we denote the value of any state s under the optimal policy π ∗ as V ∗ ( s ) .Accordingly Q ∗ ( s, a ) represents the maximum Q-value achievable as a result of performingaction a in state s . In value function based methods for DRL, the Q-function (or V-function) is represented as aDNN with numerous parameters. Through updating these parameters, we can bring the Q-value utputs from such deep Q-networks as close to the ﬁxed point of the Bellman equation as possi-ble. Two versions of the Bellman equation are typically studied in the literature. Each version isassociated with a different

Bellman operator , i.e. T π and T ∗ , as deﬁned below. T π Q π ( s, a ) = r ( s, a )+ γ (cid:88) s (cid:48) P ( s, a, s (cid:48) ) (cid:88) b ∈ A π ( s (cid:48) , b ) Q π ( s (cid:48) , b ) (5) T ∗ Q ∗ ( s, a ) = r ( s, a ) + γ (cid:88) s (cid:48) P ( s, a, s (cid:48) ) max b ∈ A Q ∗ ( s (cid:48) , b ) (6)Clearly T π in (5) applies to Q π which is the ﬁxed point of the Bellman equation T π Q ( s, a ) = Q ( s, a ) . Likewise T ∗ in (6) applies to Q ∗ which is the ﬁxed point of the Bellman equation T ∗ Q ( s, a ) = Q ( s, a ) . It is well-known that both T π and T ∗ are γ -contraction mappings inthe sup-norm and are suitable to drive value function learning [Ber95]. Speciﬁcally, in DQN[MKS + T ∗ based on a batch of previously sampled state transitiondata B = { ( s, a, s (cid:48) , r ( s, a )) } is utilized to update the Q-network parameters in the direction ofminimizing the loss function below L ∗ ( θ ) = (cid:88) ( s,a,s (cid:48) ,r ) ∈B (cid:16) Q θ ( s, a ) − ˜ T ∗ ( s, a, s (cid:48) , r ) (cid:17) (7)To avoid overestimation in the original design of DQN, Double DQN is typically used inpractice by employing a separate target Q-network to calculate ˜ T ∗ in (7) [HGS16]. The equationbelow shows more details. ˜ T ∗ ( s, a, s (cid:48) , r ) = r ( s, a ) + γQ θ − (cid:18) s (cid:48) , arg max b ∈ A Q θ ( s (cid:48) , b ) (cid:19) (8)where in (7) and (8), Q θ refers to the Q-network parameterized by θ and Q θ − stands for thetarget Q-network parametrized by θ − . Clearly by minimizing L ∗ in (7), Q θ has the aim toapproximate Q ∗ . Similar loss function has also been deﬁned for Q θ to precisely estimate Q π with respect to arbitrary policy π .Under the general actor-critic framework for RL, assume that a stochastic policy π is imple-mented as a DNN parameterized by ω . According to the policy gradient theorem [SMSM00], ω should be updated in the direction of ∂J ( π ) ∂ω = E ( s,a ) ∼ π (cid:18) Q π ( s, a ) ∂ log π ( s, a ) ∂ω (cid:19) (9)Here the expectation is taken over all possible state-action pairs ( s, a ) with probability d π ( s ) π ( s, a ) and d π ( s ) gives the discounted distribution of states deﬁned in [SMSM00]. While training policy networks based on (9), in order to prevent a policy from converging too fastand therefore leaving no opportunity for future exploration, it is a common practice to introducean extra entropy regularizer . As a consequence, the policy network parameters ω can be updatedaccording to ∆ ω ∝ E ( s,a ) ∼ π (cid:18) Q πθ ( s, a ) ∂logπ ( s, a ) ∂ω + α ∂H π ( s ) ∂ω (cid:19) (10) lgorithm 1 An Algorithm for Deep Q-Learning under Tsallis Entropy Regularization Input : a Q-network, q value for the Tsallis entropy regularizer, α for the initialregularization coefﬁcient, and a replay buffer B that stores past state-transitionsamples for training for each problem episode do : Obtain initial state s from environment for t = 1 , . . . until end of episode do : Sample action a t according to (27) Perform a t Add ( s t , a t , s t +1 , r t ) to B if learning interval is reached do : Sample mini-batch from B Update Q-network to minimize L π ∗ α ( θ ) inthe mini-batch Reduce α linearly by ∆ α until 0 where H π denotes the entropy of policy π and α > is the entropy regularization coefﬁcient .Previously, Shannon entropy as deﬁned below is frequently utilized for regularized policy train-ing. H πI ( s ) = − (cid:88) a ∈ A π ( s, a ) log π ( s, a ) (11)Subject to entropy regularization in (11) with coefﬁcient α in (10), it can be shown that theoptimal policy π ∗ α and the corresponding optimal Q-function Q π ∗ α obey the following equation[SAC17], π ∗ α ( s, a ) = exp( Q π ∗ α ( s, a ) /α ) (cid:80) b ∈ A exp( Q π ∗ α ( s, b ) /α ) (12)In this paper, instead of using the softmax distribution in (12) to guide entropy-inducedexploration which can exhibit poor efﬁciency in practice [ORWR17], we decide to study Tsallisentropy based regularizer as deﬁned below. H πq ( s ) = 1 q − (cid:32) − (cid:88) a ∈ A π ( s, a ) q (cid:33) (13)It can be shown that lim q → H πq ( s ) = H πI ( s ) , for any s ∈ S . We can hence consider soft-Qlearning as demonstrated by (12) as a special case of our new Q-learning method. In [LCO17,NCG18], a speciﬁc setting of (13) with q = 2 has been studied to derive a ﬁxed-form representa-tion of the optimal policy. In this paper, on the other hand, we are interested in the general formof Tsallis entropy in (13) with q > .Analogous to the analysis presented in [OMKM17], we can represent the RHS of (10) as f ( ω ) . Meanwhile let g π ( s ) = (cid:80) a ∈ A π ( s, a ) . Clearly when ω in (10) reaches its ﬁxed point(or optima), no further updating of ω in the direction of f ( ω ) is possible without violating theconstraint that g π ( s ) = 1 for any s ∈ S . This means that, with the optimal policy parameters ω ∗ , f ( ω ∗ ) belongs to the span of the vectors { ∂g π ( s ) ∂ω } , i.e. f ( ω ∗ ) = (cid:88) s ∈ S λ ( s ) ∂g π ( s ) ∂ω (cid:12)(cid:12)(cid:12)(cid:12) ω = ω ∗ (14) here for every state s , the Lagrange multiplier λ ( s ) in (14) ensures that g π ( s ) = 1 . Meanwhilewe can determine ∂H πq ( s ) ∂ω as ∂H πq ( s ) ∂ω = qq − (cid:88) a ∈ A π ( s, a ) q − ∂π ( s, a ) ∂ω = qq − (cid:88) a ∈ A π ( s, a ) q ∂ log π ( s, a ) ∂ω (15)By substituting (15) into (10) and also taking into account (14), the optimal condition for policyparameters ω ∗ becomes E ( s,a ) ∼ π (cid:18)(cid:18) Q π ( s, a ) − αqq − π ( s, a ) q − − c ( s ) (cid:19) ∂ log π ( s, a ) ∂ω (cid:19) = 0 (16)where c ( s ) stands for λ ( s ) in (14) adjusted according to the discounted distribution of state s . Tosolve the equation in (16), similar to [OMKM17], it is eligible to consider each state s separately.Particularly, in any state s and ∀ a ∈ A , we have Q π ( s, a ) − αqq − π ( s, a ) q − − c ( s ) = 0 or ∂π ( s, a ) ∂ω = 0 (17)It is straightforward to verify the solution below of (17), π ∗ α ( s, a ) = q − (cid:115) max (cid:18)(cid:18) Q π ∗ α ( s, a ) α − c ( s ) α (cid:19) , (cid:19) q − q (18)with π ∗ α representing the optimal policy of the entropy-regularized policy gradient learning prob-lem described in (10). Q π ∗ α stands for the respective Q-function for policy π ∗ α . Meanwhile, c ( s ) in (18) ensures that the condition g π ( s ) = 1 holds consistently. Notice that for certain action a ,it is possible for Q π ∗ α ( s, a ) − c ( s ) < . For such an action, the validity of (18) is ensured byletting π ∗ ( s, a ) = 0 , therefore ∂π ∗ ( s,a ) ∂ω = 0 . In other words, only a portion of actions in A maybe explored in any state s , thereby encouraging efﬁcient exploration. Particularly, when q = 2 ,it can be shown that [LCO17] c ( s ) = α (cid:80) a ∈ S ( s ) Q π ∗ α ( s,a ) α − q (cid:107) S ( s ) (cid:107) (19)with S ( s ) representing the set of actions with non-zero chance of exploration in state s , asdetermined below. S ( s ) = (cid:40) a i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q + i Q π ∗ α ( s, a i ) α > i (cid:88) j =1 Q π ∗ α ( s, a i ) α (cid:41) (20)where a i denotes the action with the i -th highest Q-value in state s . If q (cid:54) = 2 , closed-formrepresentation of c ( s ) and S ( s ) may not exist. Therefore, in order to estimate π ∗ α , we havedeveloped two efﬁcient approximation techniques in our Q-Learning algorithm. Speciﬁcally, ecause whenever π ∗ α ( s, a ) > for any state s and action a , Q π ∗ α ( s, a ) − c ( s ) > . For suchaction a , we can establish a ﬁrst order approximation of π ∗ α in (18) based on π ∗ α ( s, a ) ≈ q − (cid:32)(cid:32) Q π ∗ α ( s, a ) α − c ( s ) α (cid:33) q − q − (cid:33) + o (cid:32)(cid:32) Q π ∗ α ( s, a ) α − c ( s ) α (cid:33) q − q − (cid:33) (21)Now apply the constraint in (22) over all actions belonging to S ( s ) , (cid:88) a ∈ S ( s ) π ∗ α ( s, a ) = 1 (22)we can obtain the result c ( s ) ≈ α (cid:80) a ∈ S ( s ) Q π ∗ α ( s,a ) α − q (cid:107) S ( s ) (cid:107) + α (cid:18) q − qq − (cid:19) (23)Clearly, when q = 2 , c ( s ) as approximated in (23) is identical to c ( s ) in (19). Meanwhile, we cancheck the condition of Q π ∗ α ( s,a ) α > c ( s ) α whenever a ∈ S ( s ) . Apparently, only actions associatedwith high Q-values in state s have the chance to be performed by an RL agent. Suppose that { a , . . . , a m } are the actions with the m highest Q-values. For S ( s ) to contain all these actions,we must make sure that Q π ∗ α ( s, a ) α > (cid:80) mi =1 Q π ∗ α ( s,a i ) α − qm + (cid:18) q − qq − (cid:19) (24)Therefore, m Q π ∗ α ( s, a ) α + q > m (cid:88) i =1 Q π ∗ α ( s, a i ) α + m (cid:18) q − qq − (cid:19) (25)Based on (25), S ( s ) can now be estimated immediately as in (26). S ( s ) ≈  a i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q + i Q π ∗ α ( s,a i ) α > (cid:80) ij =1 Q π ∗ α ( s,a i ) α + i (cid:16) q − qq − (cid:17)  (26)Due to the inherent error involved in approximating c ( s ) and S ( s ) through (23) and (26)respectively, we do not know for sure which action can be safely ignored for future exploration.Without missing any potentially valuable actions, the second technique to approximate π ∗ α is touse the softplus function [DBB +

01] as a smooth implementation of max( · , in (18). Speciﬁ-cally, π ∗ α ( s, a ) ∝ q − (cid:115) δ (cid:18) Q π ∗ α ( s, a ) α − c ( s ) α (cid:19) (27)where the softplus function δ is deﬁned as δ ( x ) = log(1 + exp( x )) . Based on (27), we can learna Q-network parameterized by θ with the aim of minimizing the loss function L π ∗ α ( θ ) . L π ∗ α ( θ ) is calculated based on (7) where the Bellman operator T ∗ is replaced by T π ∗ α . Driven by thisidea, we have developed an algorithm for Q-Learning under Tsallis entropy regularization assummarized in Algorithm 1. .4 Bellman Residue of Entropy Regularized Q-Learning This subsection analyzes the Bellman residue to show that our Q-Learning algorithm will notsuffer from any performance degradation despite of using approximated π ∗ α in (27). We willparticularly consider one extreme circumstance when α → . With α approaching to 0, it isstraightforward to verify that only the action that produces the highest Q-value in any state s , i.e. a , satisﬁes the condition in (26). Therefore c ( s ) = α (cid:18) Q π ∗ α ( s,a ) α − qq − (cid:19) . According to (27), π ∗ α ( s, a ) ∝ δ ( qq − ) > . On the other hand, for any action a (cid:54) = a , lim α → q − (cid:115) δ (cid:18) Q π ∗ α ( s, a ) α − c ( s ) α (cid:19) = lim α → q − (cid:115) δ (cid:18) Q π ∗ α ( s, a ) − Q π ∗ α ( s, a ) α + qq − (cid:19) = 0 (28)We hence can conclude that, when α → , π ∗ α ( s, a ) → . Consequently, π ∗ α degenerates to thevalue-maximization policy. Based on this understanding, we can further prove Theorem 1 below(see Appendix for proof). Theorem 1

For the Q-learning problem under Tsallis entropy regularization, suppose that π ∗ α is the approximated optimal policy for the problem as deﬁned in (27) and α is the correspondingregularization coefﬁcient, then for any state s ∈ S and any action a ∈ A , the Bellman residue |T ∗ Q π ∗ α − Q π ∗ α | satisﬁes the following property lim α → (cid:12)(cid:12)(cid:12) T ∗ Q π ∗ α ( s, a ) − Q π ∗ α ( s, a ) (cid:12)(cid:12)(cid:12) = 0 Because the Bellman residue converges to 0 with decreasing α , when α is sufﬁciently small,the performance of Algorithm 1 is as good as DQN and Double DQN. For this reason, Algo-rithm 1 employs a linear schedule to constantly decrease α , thereby gradually reducing entropy-induced exploration till α = 0 and learning converges. Using Algorithm 1 alone is insufﬁcient to realize deep and effective exploration. Followingthe deep exploration principle studied in [ORW14, ORWR17], we expand an MDP M q in thispaper with the Tsallis entropy regularizer in (13) under the settings of < q < q max < ∞ . Inother words, the immediate reward of performing any action in state s by following policy π isextended with a new term that depends on Tsallis entropy of π in state s . Given the experiencesobtained so far by an RL agent through direct interactions with its learning environment, denotedas B , a posterior model over M q can be established in theory and represented as P ( M q |B ) .Deep exploration requires an RL agent to randomly sample one MDP M q from P ( M q |B ) and subsequently utilize Q π ∗ α, M q and π ∗ α, M q to control future interactions with its learningenvironment in the next problem episode. Each problem episode starts from an initial state s and ends whenever a ﬁnal state is reached. Q π ∗ α, M q and π ∗ α, M q stand respectively for the optimal -function and optimal policy with respect to an MDP M q under the speciﬁc settings of q and α . Apparently this deep exploration method has the aim of optimizing the posterior learningperformance of an RL agent based on its past experiences, as described below J ∗ ( B ) = E M q ∼ P ( M q |B ) E ( s,a ) ∼ π ∗ α, M q (cid:18) r ( s, a ) + αH π ∗ α, M q q ( s ) (cid:19) (29)On large-scale RL problems it is difﬁcult to keep track of P ( M q |B ) as well as to deter-mine the optimal policy for every possible M q . Inspired by [OBPR16], we decide to efﬁcientlyapproximate the deep exploration process through bootstrapping. This is implemented in theBQETR algorithm (see Algorithm 2) by maintaining an ensemble of entropy regularized Q-networks. All Q-networks share the same regularization coefﬁcient α . Meanwhile, differentQ-networks follow different settings of q so as to enhance the diversity of the ensemble, whichis essential for effective deep exploration.Apparently, no change to q is required for any Q-network in the ensemble during RL. Sinceagent’s past experiences B will not affect the posterior distribution over q , each Q-network in theensemble can be sampled equally likely for the next episode of deep exploration. This simpletechnique enables us to seamlessly integrate entropy-induced exploration with bootstrap-inducedexploration and lays the foundation of the BQETR algorithm. Because α is decremented eachtime by a very small step ∆ α in Algorithm 1 and Algorithm 2, its change will not affect theeffectiveness of the bootstrapping mechanism. Algorithm 2

The Bootstrapped Q-Ensemble under Tsallis Entropy Regularization(BQETR) Algorithm Input : an ensemble of K Q-networks { Q k } Kk =1 , a list of q values { q k } Kk =1 forthe Tsallis entropy regularizers, α for the initial regularization coefﬁcient, a re-play buffer B that stores past state-transition samples for training, and a maskingdistribution M . for each problem episode do : Choose the i -th Q-network in { Q k } Kk =1 randomly Obtain initial state s from environment for t = 1 , . . . until end of episode do : Use Q i , q i and α to sample action a t according to (27) Perform a t Sample bootstrap mask m t ∼ M Add ( s t , a t , s t +1 , r t , m t ) to B if learning interval is reached do : Follow Algorithm 1 to train all K Q-networks.

Reduce α linearly by ∆ α until 010

200 400 600 800 1000

Episodes A v e r a g e R e w a r d s (a) Bowling Episodes − A v e r a g e R e w a r d s (b) Boxing Episodes A v e r a g e R e w a r d s (c) Enduro

200 400 600 800 1000

Episodes A v e r a g e R e w a r d s (d) Freeway Episodes − − A v e r a g e R e w a r d s (e) Pong Bootstrapped DQNUCB Q-EnsembleBQETR

Figure 1: Average total return per episode obtained by BQETR, Bootstrapped DQNand UCB Q-Ensemble on ﬁve Atari game playing tasks, including Bowling, Boxing,Enduro, Freeway, and Pong.

In this section, the learning performance of BQETR is compared to the performance achievablethrough Bootstrapped DQN and UCB Q-Ensemble on commonly studied Atari game playingtasks [HGS16, MKS +

15, SLA +

15, OBPR16, CSAS17]. Based on the performance results, thesample complexity of the three algorithms is further analyzed to demonstrate that BQETR canimprove sample efﬁciency through deep and effective exploration of its learning environment.In this paper we consider speciﬁcally ﬁve video games simulated by the Arcade LearningEnvironment [BNVB15] as benchmark problems, including Bowling, Boxing, Enduro, Freewayand Pong. These problems require an RL agent to handle high-dimensional state spaces (i.e. anagent must be able to process direct video input provided by the games) and are highly difﬁcultto solve, even for expert human game players. As a result, to achieve reasonable learning per-formance, an RL agent must play numerous rounds of each benchmark game. Hence they aresuitable problems to reveal the difference in sample efﬁciency upon using various explorationmethods for DRL.We implement all algorithms in the experiments based on the high-quality implementationof Double DQN provided by OpenAI baselines [DHK + α and the entropic index q for Tsallisentropy. Without spending substantial efforts in ﬁne-tuning these parameters, α is set to 0.5(other settings ranging from 1.0 to 0.1 do not seem to produce noticeable difference in perfor-mance). After each learning interval, α will be decremented by . × − till 0. Since theQ-ensemble maintained by all algorithms contains 10 individual Q-networks. The values for q in BQETR have been set to . , . , . . . , . respectively for each Q-network. Moreover, every ixel input to a Q-network is obtained by averaging the same pixel over four consecutive framesof the game video. On each game playing task, we have run every algorithm for only 3M framesby using commodity desktop computers (no GPUs). This enables us to examine the effectivenessof all algorithms under limited computation resources and sample budget. Figure 1 depicts the learning performance (i.e., average total return per episode) of the three al-gorithms in our experiments. To cover at least 3M frames, the performance across 1000 learningepisodes have been presented in the ﬁgure, except for Enduro. This is because each episode inEnduro includes more frames. Therefore 3M frames have been reached after just playing 300episodes of Enduro.As evidenced in Figure 1, BQETR outperformed Bootstrapped DQN and UCB Q-Ensembleon all benchmark problems. Particularly on Bowling, Boxing and Enduro, BQETR achievedsigniﬁcantly higher performance than competing algorithms. In the meantime, BQETR alsomanaged to solve Freeway and Pong clearly faster than other algorithms. Based on the experi-ment results, we believe that BQETR is an effective algorithm for DRL thanks to its integrateduse of both entropy-induced and bootstrap-induced exploration techniques.

To analyze sample efﬁciency, we adopt the performance metrics introduced in [SWD + Scoring Metric Algorithms Bowling Boxing Enduro Freeway Pong

Fast Learning BQETR ± ± ± ± ± Bootstrapped DQN 7.63 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Bootstrapped DQN 7.97 ± ± ± ± ± ± ± ± ± ± Table 1: Scoring metrics of fast learning and ﬁnal performance obtained by BQETR,Bootstrapped DQN and UCB Q-Ensemble on ﬁve Atari games.

In this paper we studied entropy-induced environment exploration via deep Q-learning undergeneral Tsallis entropy regularization. Through this study, we developed the ﬁrst time in lit-erature new approximation techniques to address entropy regularized RL problems. Bellmanresidue analysis subsequently showed that our approximation techniques will not affect the ﬁnalperformance achievable through Q-learning. Driven by the goal for deep exploration, we havefurther developed a bootstrapped Q-learning algorithm involving an ensemble of Q-networks.Every Q-network is controlled by a Tsallis entropy regularizer under different settings of q so asto achieve high ensemble diversity and effective deep exploration. ooking into the future, it is interesting to explore the possibilities of extending our Q-learning algorithm to tackle RL problems with high-dimensional and continuous action spaces.Meanwhile, it is also interesting to explore the beneﬁts of our Q-learning algorithm on moreproblem domains. Due to limited computation resources that are available to this research, wecannot conduct large-scale experimental studies in this paper. However our experiment resultshave clearly shown that our new algorithm is both effective and sample efﬁcient. Appendix

This appendix presents proof of Theorem 1, i.e. (cid:107)T ∗ Q π ∗ α − Q π ∗ α (cid:107) → by decreasing theregularization coefﬁcient α all the way to 0. Speciﬁcally, for any state-action pair ( s, a ) , we canderive the following inequalities. ≤ (cid:12)(cid:12)(cid:12) T ∗ Q π ∗ α ( s, a ) − T α Q π ∗ α ( s, a ) (cid:12)(cid:12)(cid:12) ≤ E s (cid:48) ∼ P ( s,a,s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max b Q π ∗ α ( s (cid:48) , b ) − (cid:88) b π ∗ α ( s (cid:48) , b ) Q π ∗ α ( s (cid:48) , b ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E s (cid:48) ∼ P ( s,a,s (cid:48) ) (cid:32)(cid:88) b | I s (cid:48) b = b − π ∗ α ( s (cid:48) , b ) | · | Q π ∗ α ( s (cid:48) , b ) | (cid:33) ≤ E s (cid:48) ∼ P ( s,a,s (cid:48) ) (cid:32)(cid:88) b | I s (cid:48) b = b − π ∗ α ( s (cid:48) , b ) | (cid:88) b | Q π ∗ α ( s (cid:48) , b ) | (cid:33) (30)where I s (cid:48) b = b refers to the policy that selects action b in state s (cid:48) with probability 1 and b is theaction with the highest Q-value in state s (cid:48) . Assume without loss of generality that the absoluteQ-value with respect to any state and any action can never exceed ¯ Q . Then, from (30), we have (cid:12)(cid:12)(cid:12) T ∗ Q π ∗ α ( s, a ) − T α Q π ∗ α ( s, a ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) A (cid:107) · ¯ Q · E s (cid:48) ∼ P ( s,a,s (cid:48) ) (cid:16) D TV (cid:16) I s (cid:48) · = b (cid:107) π ∗ α ( s (cid:48) , · ) (cid:17)(cid:17) (31)with D TV ( x (cid:107) y ) = (cid:80) i | x i − y i | representing the total variation divergence in between anytwo discrete probability distributions x and y . Due to the fact that D TV ( x | y ) ≤ D KL ( x (cid:107) y ) where D KL is the standard KL divergence [Pol], we can further obtain the inequality below from(31), (cid:12)(cid:12)(cid:12) T ∗ Q π ∗ α ( s, a ) − T α Q π ∗ α ( s, a ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) A (cid:107) · ¯ Q · E s (cid:48) ∼ P ( s,a,s (cid:48) ) (cid:113) D KL (cid:0) I s (cid:48) · = b (cid:107) π ∗ α ( s (cid:48) , · ) (cid:1) = 2 (cid:107) A (cid:107) · ¯ Q · E s (cid:48) ∼ P ( s,a,s (cid:48) ) (cid:112) − log π ∗ α ( s (cid:48) , b ) (32)Consequently, ≤ lim α → (cid:12)(cid:12)(cid:12) T ∗ Q π ∗ α ( s, a ) − T α Q π ∗ α ( s, a ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) A (cid:107) · ¯ Q · E s (cid:48) ∼ P ( s,a,s (cid:48) ) lim α → (cid:112) − log π ∗ α ( s (cid:48) , b )= 0 (33) sing (33), it can be further shown that ≤ lim α → (cid:12)(cid:12)(cid:12) T ∗ Q π ∗ α ( s, a ) − Q π ∗ α ( s, a ) (cid:12)(cid:12)(cid:12) ≤ lim α → (cid:12)(cid:12)(cid:12) T ∗ Q π ∗ α ( s, a ) − T α Q π ∗ α ( s, a ) (cid:12)(cid:12)(cid:12) + lim α → (cid:12)(cid:12)(cid:12) T α Q π ∗ α ( s, a ) − Q π ∗ α ( s, a ) (cid:12)(cid:12)(cid:12) =0 (34)Notice that when α → , the error involved in approximating π ∗ α in (18) is negligible sincethe action that produces the highest Q-value will be selected with probability 1. As a result, T α Q π ∗ α ( s, a ) − Q π ∗ α ( s, a ) = 0 for any state s and action a . References [AO07] P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted rein-forcement learning. In

Advances in Neural Information Processing Systems , pages49–56, 2007.[Ber95] D. P. Bertsekas.

Dynamic programming and optimal control . Athena scientiﬁcBelmont, MA, 1995.[BNVB15] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcadelearning environment: An evaluation platform for general agents. In

Proceedingsof the 24th International Conference on Artiﬁcial Intelligence , IJCAI’15, pages4148–4152. AAAI Press, 2015.[BT02] R. I. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithmfor near-optimal reinforcement learning.

Journal of Machine Learning Research ,3(Oct):213–231, 2002.[CPZ18a] G. Chen, Y. Peng, and M. Zhang. An adaptive clipping approach for proximalpolicy optimization. arXiv preprint arXiv:1804.06461 , 2018.[CPZ18b] G. Chen, Y. Peng, and M. Zhang. Constrained expectation-maximization methodsfor effective reinforcement learning. In

International Joint Conference on NeuralNetworks , 2018.[CSAS17] R. Y. Chen, S. Sidor, P. Abbeel, and J. Schulman. UCB Exploration via Q-Ensembles. arXiv preprint arXiv:1706.01502 , 2017.[DB15] C. Dann and E. Brunskill. Sample complexity of episodic ﬁxed-horizon rein-forcement learning. In

Advances in Neural Information Processing Systems , pages2818–2826, 2015.[DBB +

01] C. Dugas, Y. Bengio, F. B´elisle, C. Nadeau, and R. Garcia. Incorporating second-order functional knowledge for better option pricing. In

Advances in neural infor-mation processing systems , pages 472–478, 2001.[DHK +

17] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plap-pert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai base-lines. https://github.com/openai/baselines , 2017.[GMP +

15] M. Ghavamzadeh, S. Mannor, J. Pineau, A. Tamar, et al. Bayesian reinforcementlearning: A survey.

Foundations and Trends in Machine Learning , 8(5-6):359–483,2015. HGS16] H. Van Hasselt, A. Guez, and D. Silver. Deep Reinforcement Learning with DoubleQ-Learning. In

Thirtieth AAAI Conference on Artiﬁcial Intelligence , volume 16,pages 2094–2100, 2016.[HTAL17] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deepenergy-based policies. arXiv preprint arXiv:1702.08165 , 2017.[Kak03] S. Kakade.

On the sample complexity of reinforcement learning . PhD thesis, Uni-versity College London, 2003.[KS02] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time.

Machine learning , 49(2-3):209–232, 2002.[LCO17] K. Lee, S. Choi, and S. Oh. Sparse Markov Decision Processes with CausalSparse Tsallis Entropy Regularization for Reinforcement Learning. arXiv preprintarXiv:1709.06293 , 2017.[LHP +

15] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, andD. Wierstra. Continuous control with deep reinforcement learning. arXiv preprintarXiv:1509.02971 , 2015.[LLW16] L. Li, Y. Lv, and F. Y. Wang. Trafﬁc signal timing via deep reinforcement learning.

IEEE/CAA Journal of Automatica Sinica , 3(3):247–254, 2016.[MKS +

15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,A. Graves, M. Riedmiller, A. K. Fidjelandand G. Ostrovski, et al. Human-levelcontrol through deep reinforcement learning.

Nature , 518(7540):529, 2015.[NCG18] O. Nachum, Y. Chow, and M. Ghavamzadeh. Path Consistency Learning in TsallisEntropy Regularized MDPs. arXiv preprint arXiv:1802.03501 , 2018.[NNXS17] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap betweenvalue and policy based reinforcement learning. In

Advances in Neural InformationProcessing Systems , pages 2772–2782, 2017.[OBPR16] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via boot-strapped DQN. In

Advances in neural information processing systems , pages4026–4034, 2016.[OMKM17] B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih. Combining policygradient and Q-learning. arXiv preprint arXiv:1611.01626 , 2017.[OR16] I. Osband and B. Van Roy. Why is posterior sampling better than optimism forreinforcement learning. arXiv preprint arXiv:1607.00215 , 2016.[ORR13] I. Osband, D. Russo, and B. Van Roy. (More) efﬁcient reinforcement learning viaposterior sampling. In

Advances in Neural Information Processing Systems , pages3003–3011, 2013.[ORW14] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomizedvalue functions. arXiv preprint arXiv:1402.0635 , 2014.[ORWR17] I. Osband, D. Russo, Z. Wen, and B. Van Roy. Deep exploration via randomizedvalue functions. arXiv preprint arXiv:1703.07608 , 2017.[Pol] D. Pollard. Asymptopia: an exposition of statistical asymptotic theory. 2000. .[PP93] A. R. Plastino and A. Plastino. Tsallis’ entropy, ehrenfest theorem and informationtheory.

Physics Letters A , 177(3):177–179, 1993. RR13] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of opti-mistic exploration. In

Advances in Neural Information Processing Systems , pages2256–2264, 2013.[RR14] D. Russo and B. Van Roy. Learning to optimize via posterior sampling.

Mathe-matics of Operations Research , 39(4):1221–1243, 2014.[SAC17] J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients andsoft Q-Learning. arXiv preprint arXiv:1704.06440 , 2017.[SLA +

15] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy op-timization. In

International Conference on Machine Learning , pages 1889–1897,2015.[SLW +

06] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. Pac model-freereinforcement learning. In

Proceedings of the 23rd international conference onMachine learning , pages 881–888. ACM, 2006.[SMSM00] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradientmethods for reinforcement learning with function approximation. In

Advances inneural information processing systems , pages 1057–1063, 2000.[Str07] A. L. Strehl.

Probably approximately correct (PAC) exploration in reinforcementlearning . PhD thesis, Rutgers University-Graduate School-New Brunswick, 2007.[SWD +

17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policyoptimization algorithms.

CoRR , 2017.[Tsa94] C. Tsallis. Nonextensive physics: a possible connection between generalized statis-tical mechanics and quantum groups.

Physics Letters A , 195(5-6):329–334, 1994.[WBH +

16] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, andN. de Freitas. Algorithms for multi-armed bandit problems. arXiv preprintarXiv:1611.01224 , 2016.[WMG +

17] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba. Scalable trust-region methodfor deep reinforcement learning using kronecker-factored approximation. In

Ad-vances in neural information processing systems , pages 5279–5288, 2017., pages 5279–5288, 2017.