[PDF] Cross Learning in Deep Q-Networks

Abstract

In this work, we propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods, particularly in the deep Q-networks where the overestimation is exaggerated by function approximation errors. Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network, which leads to reduced overestimation bias as well as the variance. We provide empirical evidence on the advantages of our method by evaluating on some benchmark environment, the experimental results demonstrate significant improvement of performance in reducing the overestimation bias and stabilizing the training, further leading to better derived policies.

Full PDF

CCross Learning in Deep Q-Networks

Xing Wang , Alexander Vinel ∗ Department of Industrial EngineeringAuburn UniversityAuburn, AL, 36849, USA

Abstract

In this work, we propose a novel cross Q-learning algorithm, aim at alleviating thewell-known overestimation problem in value-based reinforcement learning meth-ods, particularly in the deep Q-networks where the overestimation is exaggeratedby function approximation errors. Our algorithm builds on double Q-learning,by maintaining a set of parallel models and estimate the Q-value based on a ran-domly selected network, which leads to reduced overestimation bias as well asthe variance. We provide empirical evidence on the advantages of our method byevaluating on some benchmark environment, the experimental results demonstratesigniﬁcant improvement of performance in reducing the overestimation bias andstabilizing the training, further leading to better derived policies.

Overestimation has been identiﬁed as one of the most severe problems in value-based reinforcementlearning (RL) algorithms such as Q-learning [1], where the maximization of value estimates inducesa consistent positive bias, and the error of the estimates is accumulated by the nature of temporaldifference (TD) learning. In the function approximation setting such as deep Q-networks (DQN),the issue of value overestimation is more severe, given the noise induced by the inaccuracy of theapproximation. As a result, learning DQN tends to have instability and variability for estimatedQ-values, the derived policies accroding to the overestimated Q-values tend to be not optimal andoften diverge.To overcome this issue, double Q-learning [2] has become a standard approach for training DQNs.The main purpose of double Q-learning is to avoid the overestimation problem for the target Q-value,by introducing negative bias from the double estimates. The usual way to realize it in DQN is tomaintain a target network which is a copy of the policy DQN which is either frozen for a periodof time, or softly updated with exponential moving average. The target network then is used toestimate the TD target. This may alleviate the issue, however, double DQN still often suffer fromoverestimation in practice, partially because the policy and target estimates of Q -values are usuallytoo similar, while the noise from high variance is propagated through the network and occasionallarge reward can produce great overestimation in the future. Another approach sometimes proposedis to impose a bias-correction term on the estimates for Q-learning [3], however, the error correctionterm is complicated to derive for deep networks, in which the ﬁniteness of state space is no longertrue. A more recent modiﬁcation over double DQN favors underestimation and clips the Q-valueestimates [4], that is, always chooses the minimum of the estimated targets over the two networks.The clipped double Q-learning is used on the critics in actor-critic methods for the deterministic policygradient, which is referred to as TD3 (twin delayed deep deterministic policy gradient) and has shownstate-of-the-art results on multiple tasks. However, the intentionally engineered underestimationlacks of rigorous theoretical guide, in addition, it may induce bias in the other direction, e.g., the ∗ Corresponding to: [email protected] a r X i v : . [ c s . A I] S e p nderestimation can also accumulate through TD learning and derive suboptimal policies. Further,excessive underestimation can naturally lead to slower convergence.Another direction to alleviate overestimation is through reducing the variance during training. Forexample, [5] uses the average of the learned estimated Q-values from multiple networks, which isdesigned to help reduce the target approximation error. There also exist various variance reductiontechniques [6, 7, 8, 9] that focus on the general non-convex optimization procedure for acceleratingthe stochastic gradient descent, or their direct application on DQNs [10], in which the agent couldobtain smaller approximated gradient errors. Reducing the variance can effectively stabilize the DQNtraining procedure, and overestimation alleviation can be seen as a by-product. However, these areindirect methods for overestimation control, and the positive bias due to the max operator in TDupdate are not taken care of.To address these concerns, we propose a cross DQN algorithm, which can be seen as a direct extensionof an earlier variant of double DQN, but can be more ﬂexible. In cross DQN, we maintain more thantwo networks, and update them one at a time based on the estimation from another randomly selectedone. As mentioned above, the averaged DQN [5] calculates the average of K estimated Q-values,with the primary purpose of the overall variance reduction. For all K networks, each step of TDupdates as well as action selections are based on combining the K estimates. Consequently, thenetworks are tangled together and cannot be implemented with a parallel simulation. In bootstrappedDQN [11], one of the K networks (or heads) is bootstrapped for each action selection step duringtraining, aiming at encouraging exploration early on. Thus the simulation is not independent amongnetworks, while the TD updates are totally independent within each of the networks, by using its ownestimation of Q-values as in standard (double) DQN. [12] investigates more general applications oftraditional ensemble reinforcement learning on policies, i.e., majority voting, rank voting, Boltzmannaddition, etc. to combine the different policies derived from multiple networks, by which they calledthe target ensembles, in addition to the averaged DQN which they called the temporal ensemble. Allof the above-mentioned work that maintain multiple networks have achieved better performance byaddressing different issues through some particular settings. Our method focuses on the variation ofTD updates, in which the target Q-values are estimated with a bootstrapped network for calculatingthe gradients, with the direct goal of reducing overestimation. Each of the K networks wouldperform its own TD updates, while maintaining ﬂexibility in action selections: the networks caneither interact with the environment independently, or through any other ensemble strategy. Thedetailed implementation options would be discussed in Section 3.In supervised learning, ensemble strategies such as bagging, boosting, stacking, and hierarchicalmixture of experts, etc. are commonly applied to achieve better performance, by simultaneouslylearning and combining multiple models. All of the abovementioned algorithms that maintain multiplemodels, including ours, can be seen as special cases of general ensemble DQNs. But our methodhas a deeper root in resampling and model selection. By bootstrapping another model to assess thevalues of current model, we introduce model bias for in-sample estimations, but reduce the varianceof out-of-sample estimations (i.e., the squares of out-of-sample bias), in other words, the trainedmodel can generalize better and alleviate overﬁtting. For squared errors, this can be expressed as thewell-known bia-variance trade-off: MSE = Irreducible Error + Bias + Variance . In value-basedreinforcement learning, the model easily overﬁts due to overestimation (which is caused by the maxoperator) during learning. Cross Q-learning introduces underestimation bias, and further reduces thevariance, thus improves the generalization of the trained model.Like in [4], our work can be naturally extended to the state of the art actor-critic methods in continuousaction space, such as the deep deterministic policy gradient [13], in which the critic network(s) arelearned to give an estimate of the Q-value for the actor network to update its gradient and derivepolicies. Usually multiple critic networks are applied, however, rather than accumulating their learnedgradients (either synchronously or asynchronously [14]) and optionally sharing network layers, noother information is shared among the critics. The extension of our method allows the critics to sharetheir value estimates and utilize that of others, which leads to more accurate estimation of each critics,thus can improve the performance of these models. Similar to these actor-critic algorithms, our workcan be implemented for parallel training easily, and the exchange of information among networkscould take place either synchronously or asynchronously like the accumulation of gradients, as thereis always tradeoff between synchronous and asynchronous update.2he rest of this paper is organized as follows. In Section 2, we resume the basics of value-based RL,and go through some recent related research. In Section A, we formally deﬁne the estimators forthe maximum expected values, along with their theoretical properties. The convergence of our crossestimator is shown in Section B. Section 3 illustrates our cross DQN algorithm directly derived fromthe double DQN in details. We show some empirical results in Section 4. Finally, Section 5 drawsconclusions and discusses future work.

A natural abstraction for many sequential decision-making problems is to model the system as a

Markov Decision Process (MDP) [15], in which the agent interacts with the environment over asequence of discrete time steps. It is often represented as a 5-tuple: M = < S , A , T, R, γ > , where S is a set of states ; A is a set of actions that can be taken; T : S × A (cid:55)→ P S is the transition function such that (cid:82) s (cid:48) ∈S T ( s (cid:48) | s, a ) = 1 , which denotes the (stationary) probability distribution over S ofreaching a new state s (cid:48) , after taking action a in state s ; R is the reward function, which can take theform of either R : S (cid:55)→ R , R : S × A (cid:55)→ R , or R : S × A × S (cid:55)→ R ; and γ ∈ [0 , is the discountfactor .A policy π : S (cid:55)→ P A deﬁnes the conditional probability distribution of choosing each action whilein state s . For an MDP, once a stationary policy is ﬁxed, the distribution of the reward sequence isthen determined. Thus to evaluate a policy π , it is natural to deﬁne the action value function under π as the expected cumulative discounted reward by taking action a starting from state s and following π thereafter: Q π ( s, a ) ≡ E π (cid:104) ∞ (cid:88) τ =0 γ τ R t + τ | S t = s, A t = a (cid:105) = R ( s, a ) + γ (cid:90) s (cid:48) T ( s (cid:48) | s, a ) Q π ( s (cid:48) , π ( s (cid:48) )) . (1)The goal of solving an MDP is to ﬁnd an optimal policy π ∗ that maximizes the expected cumula-tive discounted reward in all states. The corresponding optimal action values satisfy Q ∗ ( s, a ) =max π Q π ( s, a ) , and Banach’s ﬁxed-point theorem ensures the existence and uniqueness of theﬁxed-point solution of Bellman optimality equations [15] : Q ∗ ( s, a ) = R ( s, a ) + γ (cid:90) s (cid:48) T ( s (cid:48) | s, a ) max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) (2)from which we can derive a deterministic optimal policy by being greedy with respect to Q ∗ , i.e., π ∗ = argmax a ∈A Q ∗ ( s, a ) .In reinforcement learning problems, the agent must interact with the environment to learn theinformation about the transition and reward functions, meanwhile trying to produce an optimal policy.While interacting with the environment, at each time step t , the agent senses some representation ofcurrent state s , selects an action a , then receives an immediate reward r from the environment andﬁnds itself in a new state s (cid:48) . The experience tuple < s, a, r, s (cid:48) > summarizes the observed transitionfor a single step. Based on the experiences through interacting with the environment, the agent caneither learn the MDP model ﬁrst by approximating the transition probabilities and reward functions,and then plan in the MDP to obtain an optimal policy (this is called the model-based approach inreinforcement learning); or without learning the model, directly learn the optimal value functions andupon which the optimal policy is derived (this is called the model-free approach).As a model-free approach, Q-learning [16] updates one-step bootstrapped estimation of Q-valuesfrom the experience samples over time steps. The update rule upon observing < s, a, r, s (cid:48) > is Q ( s, a ) ← Q ( s, a ) + α (cid:0) r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ) − Q ( s, a ) (cid:1) (3)in which α is the learning rate, r + max a (cid:48) Q ( s (cid:48) , a (cid:48) ) serves as the update target of the Q-value, whichcan be seen as a sample of the expected value of one-step look-ahead estimation for state-actionpair ( s, a ) , based on the the maximum estimated value over next state s (cid:48) , and the last term Q ( s, a ) is simply the current estimation. The difference δ = r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ) − Q ( s, a ) is referredto as temporal difference (TD) error, or Bellman error. Note that one can bootstrap more than one3tep when estimating the target, often by using the eligibility trace as in T D ( λ ) [17]. Q-learningis guaranteed to converge to the optimal values in probability as long as each action is executed ineach state inﬁnitely often, s (cid:48) is sampled following the distribution T ( s, a, s (cid:48) ) , r is sampled with mean R ( s, a ) , variance is bounded and given appropriately decaying α . For environments with large state spaces, the Q-values are often represented by a function of state-action pairs rather than the tabular form, i.e., Q θ ( s, a ) = f ( s, a | θ ) , where θ is a parameter vector.We consider Q-learning with function approximation in this paper. To update parameter vector θ ,ﬁrst-order gradient methods are usually applied to minimize the mean squared error (MSE) loss: θ ← θ + αδ ∇ θ Q θ . However, with function approximation, the convergence guarantee can no longerbe established in general. Neural networks, while attractive as a powerful function approximator, werewell known to be unstable and even to diverge when applied for reinforcement learning until deep Q-network (DQN) [18] was introduced to show great success, in which several important modiﬁcationswere made.

Experience replay [19] was used to address the non-stationary data problem, by storingand mixing the samples (i.e., experiences) into a replay memory for the updates. During training abatch of experiences is randomly sampled each time and the gradient descent is performed on thesampled batch. This way the temporal correlations could be alleviated. In addition, a separate targetnetwork , which is a copy of the learned network parameters ( θ ) is employed. This copy is frozen fora period of time and is only updated periodically (denoted as θ − ), and is applied to calculate the TDerror, with the aim of improving stability.A variety of extensions and generalizations have been proposed and shown successes in the literature.Overestimation due to the max operator in Q-learning may signiﬁcantly hurt the performance.To reduce the overestimation error, double DQN (DDQN) [2] decouples the action selection fromestimation of the target, that is, choosing the maximizing action according to the original network ( Q θ ),and evaluate the current value using the other one ( Q θ − from the target network), i.e., Q θ ( s, a ) ← r + γQ θ − ( s (cid:48) , arg max a Q θ ( s (cid:48) , a )) . The procedures of double DQN is shown in Algorithm 1.

Algorithm 1

Double DQN

1: Initialize policy network Q θ and target network Q θ − with random parameters.2: Initialize replay buffer B .3: for each episode until end of learning do

4: Initialize state s for step t = 1 , · · · until s is terminal state of an episode do

6: Select action a t = argmax a Q θ ( s, a ) with exploration7: Take action a t , observe reward r and next state s (cid:48)

8: Store experience tuple < s, a t , r, s (cid:48) > into B

9: Sample a mini-batch of experiences from B .10: for all sampled experience in the mini-batch do

11: To train network Q θ , compute a (cid:48) = argmax a Q θ ( s (cid:48) , a )

12: Estimate TD target with target network y = r + Q θ − ( s (cid:48) , a (cid:48) )

13: Backpropagate TD error δ = y − Q θ ( s, a t ) through Q k , update θ with learning rate α end for s ← s (cid:48)

16: Update target network θ − ← θ in a ﬁxed frequency17: end for end for [20] proposed the dueling network architecture, in which lower layers of a deep neural network areshared and followed by two streams of fully-connected layers, that are used to represent two separateestimators, one for the state value function V ( s ) and the other for the associated state-dependentaction advantage function A ( s, a ) . The two outputs are then combined to estimate the action value Q ( s, a ) : Q ( s, a ) = V ( s ) + A ( s, a ) − |A| (cid:88) a (cid:48) A ( s, a (cid:48) ) (4)4ote here the average of advantage values across all possible actions are used to achieve betterstability, instead of the max operator in the other form proposed in [20], i.e., Q ( s, a ) = V ( s ) + A ( s, a ) − max a (cid:48) A ( s, a (cid:48) ) (5)The dueling factoring often leads to faster convergence and better policy evaluation, especially in thepresence of similar-valued actions. The deployment of advantage values is more robust to noise, sinceit emphasizes the gaps between Q -values of different actions given the same state, which are usuallytiny thus small amount of noise may results in reordering of actions. In addition, the subtraction ofan action-irrelevant baseline in Equation (4) also effectively reduces variance, which helps stabilizelearning and thus is more often used. The shared feature learning module also generalizes learningacross actions, in which more frequent updating of the value stream V leads to more efﬁcient learningof state values, contrasts with that in DQNs of a single stream output, only one of the action values isupdated while other action values remain untouched. The main purpose of Bootstrapped DQN [11] is to provide efﬁcient “deep” exploration inspired by

Thompson sampling or as probability matching in Bayesian reinforcement learning [21], but insteadof maintaining a distribution over possible values and intractable exact posterior update, it takes asingle sample from the posterior. Bootstrapped DQN maintains a Q -ensemble, represented by amulti-head deep neural network in order to parameterize a set of K ∈ N + different Q -value functions.The lower layers are shared by the K “heads”, and each head represents an independent estimate ofthe action value Q k ( s, a | θ k ) . For each episode at training, Bootstrapped DQN picks a single headuniformly at random, and follows the greedy policy with respect to the selected Q -value estimates,i.e., a t = argmax a Q k ( s t , a ) , until the end of the episode.Bootstrapped DQN diversiﬁes the Q -estimates and improves exploration through independent initial-ization of the K heads as well as the fact that each head is trained with different experience samples.The K heads can be trained together with the help of so-called bootstrap mask m τk , which decideswhether the k -th head should be trained, i.e., the transition experience τ updates Q k only if m kτ isnonzero. In addition, bootstrapped DQN adapts double DQN in order to avoid overestimation, i.e.,the the estimates of TD targets are calculated using the target network Q θ k − . The loss backpropagatedto k -the head is then L ( θ k ) = E τ [ m kτ ( r + γQ k ( s (cid:48) , a (cid:48) | θ k − ) − Q k ( s, a | θ k )) ] where a (cid:48) = argmax a Q k ( s (cid:48) , a | θ k ) (6)Note the gradients should be further aggregated and normalized for updating the lower layers of thenetwork. In this section, we elaborate our proposed cross Q-learning method and its variants. Cross DQNserves as an extension to the double DQN algorithm [2], which has been used as the default settingfor most state-of-art DQN training.Double DQN was proposed in the aim of reducing overestimation bias, in which the target networksimply is a delayed-updated copy of the current network. Note that the original vanilla DQN alsouses two networks, the purpose of periodic frozen and update of the target network is to stabilizelearning. Speciﬁcally, in vanilla DQN, the target network is used to evaluate both the action and thevalue, i.e., y ← r + γQ θ (cid:48) ( s (cid:48) , a (cid:48)∗ ) where a (cid:48)∗ = argmax a (cid:48) Q θ (cid:48) ( s (cid:48) , a (cid:48) ) (7)On the other hand, in double DQN, the current network is used to evaluate the action and select a (cid:48) ,while the target network is used for evaluate the value, so that action selection is decoupled fromestimation of the target: y ← r + γQ θ (cid:48) ( s (cid:48) , a (cid:48)∗ ) where a (cid:48)∗ = argmax a (cid:48) Q θ ( s (cid:48) , a (cid:48) ) (8)In practice however, it is common the case that little improvement can be gained by using doubleDQN, since the current and target networks are usually too similar due to slowly changed parameters5n neural network models with SGD optimization. We can neither set the period of updating targettoo long, otherwise the derived policy would not exhibit learning and progress. As a result, doubleDQN does not entirely eliminate the overestimation bias. In Section 4, we will further experimentallyshow the elimination of overestimation is not effective nor sufﬁcient in double DQN.Instead of maintaining only two separate networks, we will use a set of K models for estimatingQ-values and selecting actions in our cross Q-learning. While update each network’s parameters, wewill calculate its TD target Q-value using one of the other K − models. More speciﬁcally, let thenetwork with parameters we are about to adjust be our current network ( θ i ), and we randomly pickanother network to be our target network ( θ j , e.g., j ∈ U [1 , K ] ). To compute the target Q-value, wewill use the current network to evaluate the actions and select a (cid:48) in the next state s (cid:48) , while the value isevaluated by using the target network, i.e., y ← r + γQ θ j ( s (cid:48) , a (cid:48)∗ ) where a (cid:48)∗ = argmax a (cid:48) Q θ i ( s (cid:48) , a (cid:48) ) (9) Algorithm 2

Cross-Learning DQN

1: Initialize K ∈ N + different Q-functions Q ( s, a | θ k ) with random parameters θ k for k = 1 , · · · , K .2: Initialize replay buffer B .3: for each episode until end of learning do

4: Initialize state s for step t = 1 , · · · until s is terminal state of an episode do

6: Select action a t according to Q with exploration, e.g., a t = MajorityVote { argmax a Q k ( s, a ) } Kk =1

7: Take action a t , observe reward r and next state s (cid:48)

8: Store experience tuple < s, a t , r, s (cid:48) > into B

9: Sample a mini-batch of experiences from B .10: for all sampled experience in the mini-batch do

11: To train network Q i , compute a (cid:48) = argmax a Q i ( s (cid:48) , a | θ i )

12: Randomly pick another network Q j to estimate TD target y = r + Q j ( s (cid:48) , a (cid:48) | θ j )

13: Backpropagate TD error δ = y − Q i ( s, a | θ i ) through Q i , update θ i with learning rate α t end for s ← s (cid:48) end for end for In implementation, we have ﬂexibility and various options in how to utilize the K different Q-networks. There always exist tradeoffs among different choices that we need to consider in order topick the one that meets our goal most. For example, we can have different design of neural networkarchitectures. A natural choice of having K independent models is to maintain a list of separateneural networks with the same architecture, the difference between their outputs (i.e., K streamsof Q-values derived from the same ( s, a )-pair as the input) comes from different random parameterinitialization of each model, also is due to that different data that each model is trained upon, i.e.,for each step of backpropagation, each model randomly samples a mini-batch of experiences andperforms SGD optimization with the mini-batch. Maintaining K copies of models implies thatnot only the storage for the models would be K times large as a single network, also the forwardpropagation would take K times amount of computations. Instead, we can utilize the shared networkdesign for the K models, in which the K models shared their weights except for the last layer, whichconsists of K value function heads from which the value functions Q k ( s, a | θ k ) are derived, and theweights on the last layer are generally different. Thus we have much less parameters in total to betrained, and the computational burden can be greatly alleviated. Moreover, as recent deep learningresearch reveals, the ﬁrst few layers of neural network are mainly about representations learning,the shared layers provide the same features expressed for computing Q , this can be seen as onlinetransfer of learned knowledge among models. Note that in shared learning settings, in order to avoidpremature learning and suboptimal convergence, the gradients of the network except the last layer areusually normalized by /K , but this also results in slower learning early on. On the other hand, theseparate models are simpler yet provide more variability in Q -values, also are more stable duringtraining. In addition, when we train the networks in distributed system, the separate networks do notdepend on others’ weights thus can be learned independently, which requires much less informationexchange and this could be a huge advantage for distributed learning. The comparison of the separateand shared network architectural design is shown in Figure 1.6 ····· (a) Separated Network Design ··· Shared Layers (b) Shared Network Design

Figure 1: Separate and Shared Network ArchitectureWith K different models (or heads), while each could derive a possibly different policy, there isno doubt that during test phase we should take advantage of ensembles, for instance by choosingthe action with the majority votes across the outputs. However, we can make choices on how tocombine action selections into a single policy during training. With ensemble action selection suchas majority voting, the derived policy is often superior than any individual one, thus greatly reducesthe variance during training, as we will experimentally show in Section 4. This in turn reﬁnesexploitation, results in great variance reduction of Q -values and speeds up learning. Note that todeal with exploration-exploitation dilemma, ε -greedy strategy is needed to encourage exploration.On the other hand, we may also randomly pick a single network from the K models, and act as itsuggests during training. This falls into the paradigm of Bootstrapped DQN [11], which encouragesexploration, in the cost of slower early learning (see Section 4), but may learn better policy later withmore exploration. Another advantage of bootstrapped action selection is that it can slightly reducecomputational burden, since instead of forward passing and computing all K of the Q -values foraction selection, we can calculate only one of them. The procedure of bootstrapped version of crossDQN is presented in Algorithm 3. Algorithm 3

Bootstrapped Cross DQN

1: Initialize K ∈ N + different Q-functions Q ( s, a | θ k ) with random parameters θ k for k = 1 , · · · , K .2: Initialize replay buffer B .3: for each episode until end of learning do

4: Initialize state s

5: Randomly pick a network Q k to act, where k ∈ { , · · · , K } .6: for step t = 1 , · · · until s is terminal state of an episode do

7: Select action a t = argmax a (cid:48) Q k ( s, a (cid:48) ) with exploration8: Take action a t , observe reward r and next state s (cid:48)

9: Store experience tuple < s, a t , r, s (cid:48) > into B

10: Sample a mini-batch of experiences from B .11: for all sampled experience in the mini-batch do

12: To train network Q i , compute a (cid:48) = argmax a (cid:48) Q i ( s (cid:48) , a (cid:48) | θ k )

13: Randomly pick another network Q j to stimate TD target y = r + Q j ( s (cid:48) , a (cid:48) | θ j )

14: Backpropagate TD error δ = y − Q i ( s, a t | θ i ) through Q i , update θ i with learning rate α t end for s ← s (cid:48) end for end for Another choice we can make is the training frequency. In our cross DQN settings, when back-propagation occurs, we can either choose to train on a single network (e.g., the single model that7rovides the action selection), or each of the K networks could independently sample a mini-batchof experiences and perform SGD optimization. The latter would increase the sample efﬁciency andspeed up learning, while the former would reduce the computational burden, in which the numberof backpropagation (which is the most computational expensive) remains the same as in a singleDQN. In addition, with the former setting, our cross Q-learning does not require maintaining copiesof the networks as the target. Experimentally, we found that freezing targets merely has any effecton stabilization of learning, but only costs doubled memory for model storage. This is due to tworeasons. First, we bootstrap a model that is different than the current one, when K ≥ , the variety ofmodels ensures the difference in parameter initialization, as well as the difference of mini-batch datatheir learning based upon, which in turn ensures the independence of Q-value estimates. Secondly,with less frequent update of each network, the bootstrapped target Q -value changes less as well, alsohelps stabilize learning. In this paper, we conducted experiments on two classical control problems, CartPole and LunarLander,for extended tests. We selected these testbeds in the aim of covering different challenges, especially interms of complexity. As both environments interfaced through OpenAI gym environment [22], unlessspeciﬁed otherwise. The neural networks have a number of hyperparameters. The combinatorialspace of hyperparameters is too large for an exhaustive search, therefore we have performed limitedtuning. For each component, we started with the same settings as in [23] in order to make comparisonswith states of the art results.

The CartPole, also known as an inverted pendulum, in which a pole (or pendulum) is attached byan un-actuated joint to a cart (i.e., the pivot point). The pendulum starts upright at the center of a2D track but is unstable since the center of gravity is above the pivot point. The goal of this taskis to keep the pole balanced and prevent it from falling over, by applying appropriate force to thepivot point, while the force could move the cart along the frictionless track with ﬁnite length of 4.8units. An immediate reward of +1 is provided for every timestep that the pole remains not fallingover, and the maximum cumulative rewards in an episode are clipped to 200. An episode also endswhen the pole is slanted with degree > ◦ from vertical, or the cart moves out of the track [24]. Ineach timestep, the agent is provided with current state s ∈ R , which represents cart position, cartvelocity, pole angle, and pole angular velocity, respectively. A unit force either from left or right canbe applied, thus the actions are discrete with a ∈ {− , , +1 } .As in [23], we approximate the Q -values using a neural network with two fully-connected hiddenlayers (which consist of 64 and 32 neurons, respectively). We train each of the neural networks for1000 episodes (approximately a little less than 200000 steps), with a FIFO memory of size × transitions for experience replay. A target network is updated every 500 steps to further stabilizelearning. The adaptive moment estimation (Adam) optimizer with learning rate α = 0 . is usedto train the network, since it is in general less sensitive to the choice of the learning rate than otherstochastic gradient descent algorithms [25]. The optimization is performed on mini-batches of size32, sampled uniformly from the experience replay. The discount factor γ is set to 0.99, and ε -greedypolicy is used for choosing actions throughout interacting with the environment, which starts withexploration ε = 1 , and annealed to . in the ﬁrst 10000 steps.After every 20 training episodes, we conduct a performance test that plays 10 full episodes usingthe greedy policy deterministically derived from the current network. For the models with K > ,majority voting is used for the action selection disregard whether or not bootstrapped Q -value head isused during training. The cumulative rewards of each test episode are used for comparison amongdifferent models. Moreover, in order to comparing the estimation of Q -values among models, every20 training episodes, we randomly sample a batch of historical s, a ) -pairs from the replaybuffer and compute their Q -values using current network. More than one thousand samples ensurethat their mean is somewhat representative for Q -values under current model.8 .1.2 Analysis of Cross Q-learning Effects We compared our cross Q-learning algorithms with vanilla DQN and double DQN. Note that vanillaDQN uses single estimators, while double DQN uses double estimators, and our cross DQN usescross estimators. K = 5 and K = 10 are used in cross DQNs. Figure 2(a) illustrate the traininghistory of episodic total rewards of the four models, from which we can see that although with asingle network (vanilla and double DQNs), the agent starts to learn early on with less samples, inparticular, double Q-learning helps the single network to learn even faster, however, the learnedmodels are not stable. With cross Q-learning, although the networks learn slower at the beginning, inparticular, cross DQN with K = 10 started to learn even later than cross DQN with K = 5 , oncecross DQNs start to learn, the performance improvement is substantial. Not only the total rewards arehigher, the learning is also much more stable. After 300 episodes, the training total rewards convergeto 200 for K = 10 cross DQN, with little variation (due to ε exploration). K = 5 cross DQN hasmore variation, but it also seems to converge after 900 episodes, while vanilla DQN and double DQNare easily deteriorated, and have much larger variations.The performance improvement can be more clearly seen in Figure 2(b). After 300 episodes of training,the policies derived cross DQN with K = 10 become more and more stable, the variance of testtotal rewards become zero close to the end of training. Cross DQN with K = 5 deteriorates after500 episodes of training, but later it also learns to derive stable policy that has total rewards of 200with tiny variances. Whereas the policies derived from vanilla DQN and double DQN can only getscore which is approximately half of cross DQNs, and with large variances. The policy derived fromdouble DQN seems to be a little better than that from vanilla DQN, but the improvement is not assigniﬁcant as that of using cross Q-learning.Furthermore, part of the reason for slower start of cross DQN is due to our learning settings, in whichwe only perform SGD optimization on one of the networks (or heads). In other words, we reduce thelearning frequency of each network (or head) down to /K to alleviate the computational effort, atthe cost of slower start on learning. If we increase the learning frequency (i.e., backpropagate foreach of the K networks/heads every time), the learning should be faster.We also plot the average Q -values from bootstrapped 1024 ( s, a ) -pairs as shown in Figure 2(c). Weobserve that the beginning of learning, vanilla DQN has highest estimates of Q -values, which is anevidence of overestimation. The estimates from double DQN is lower, but only for limited amount,therefore we say that double Q-learning may have not solve the overestimation problem completely.Cross DQNs have quite smaller estimations at the beginning, in particular, as K gets larger, theestimates of Q -values become even lower. Overestimation is clearly an obstacle of effective learning,as a result, the estimated Q -values from cross DQNs are substantially higher than that from vanillaror double DQNs, since cross DQNs has derived better policies and obtained higher rewards. The Q -values estimates from cross DQNs start to converge after the derived policies stabilized, At theend of training, the estimated Q -values from the four different models are about at the same level,however, note that the estimates from vanilla and double DQNs continue increasing, and their derivedpolicies are not stable, also have lower rewards. Our cross Q-learning algorithm has addressed theoverestimation problem better. As the cross learning architecture shares the same input-output interface with standard DQN, wecan recycle many recent advances in DQN research. We have mentioned one variant in Section 3that it can combined with Bootstrapped DQN for action selection during training, while in Secction4.1.2, our experiments for cross DQN are based on majority voting from K different Q -functions.Furthermore, it is convenient to combine the dueling architecture into each of the K networks. Thegoal of dueling DQN is to reduce variance for Q -value estimation, by subtracting a baseline andemphasizing the advantages among different actions, thus accelerates learning effectively. Thevariance reduction is performed on a single network’s estimation, while our cross Q-learning reducesvariance from a different perspective. For each network, the target values were calculated with othermodels by bootstrapping from multiple Q -values, thus introduces some bias. Due to the bias-variancetradeoff, however, the variance of our estimates decreases, and thus the overall error becomes smaller.In addition, the maximum operator induces overestimation bias, while cross-estimator tends tointroduce bias in the other direction, thus greatly alleviates overestimation problem.9

200 400 600 800 1000 episode t o t a l r e w a r d s vanilla DQNdouble DQNcross DQN, K=5cross DQN, K=10 (a) Learning curves episode t e s t s c o r e vanilla DQNdouble DQNcross DQN, K=5cross DQN, K=10 (b) Model test performance. Every 20 training episodes, 10 full test episodes were conducted. Episode A v e r a g e Q V a l u e s vanilla DQNdouble DQNcross DQN, K=5cross DQN, K=10 (c) Mean of Q-value estimations on CartPole . Every 20 training episodes, 1024 ( s, a ) -pairs were bootstrapped. Figure 2: Comparison of vanillar DQN, double DQN and cross DQNs of K = 5 , K = 10 on CartPole .Figure 3 and Figure 4 illustrate the training and testing performance of cross DQN with differentarchitectures, for the cases of K = 5 and K = 10 , respectively. We can see that dueling architecturespeeds up early on learning effectively, without hurting the model performance later in general. Onthe other hand, Bootstrapped DQN slows learning at beginning, especially when K is large, sincethe selected actions varies among networks at beginning quite a bit. For example, the K = 10 crossDQN with bootstrap converges around 400 episodes while the other cross learning agents convergesbefore 200 episodes. But after learned something, the bootstrapped action selection won’t hurt themodel. In fact, it might help learning for more complicated tasks because of more exploration earlyon. At least, using bootstrapped DQN can help our cross DQN agent make faster action selectionduring training and reduce computational burden slightly, since instead of calculate all K Q-values,we can calculate only one of them. Moreover, by comparing the learning curves of bootstrapped crossDQNs with different K s, we can conclude that it is primarily our cross Q-learning rather than policyensemble that greatly reduces the variance, as with K = 10 the variations are much smaller that thatwith K = 5 , though policy ensemble further reduces the variance greatly, and during testing phase,our agent can deﬁnitely beneﬁt from ensemble of multiple models. Naturally combined crossedQ-learning with dueling and bootstrapped DQN, our model aggregates the merits from all threeperspectives. 10

200 400 600 800 1000 episode t o t a l r e w a r d s cross DQNcross DQN with Duelingcross DQN with Bootstrapcross DQN with Dueling and Bootstrap (a) Learning curves episode t e s t s c o r e cross DQNcross DQN with Duelingcross DQN with Bootstrapcross DQN with Dueling and Bootstrap (b) Model test performance. Every 20 training episodes, 10 full test episodes were conducted. Figure 3: Comparison of cross DQNs of K = 5 . Cross DQN with ensemble voting, with duelingDQN and voting, with bootstrapped DQN, and with both dueling & bootstrapped DQN on CartPole . episode t o t a l r e w a r d s cross DQNcross DQN with Duelingcross DQN with Bootstrapcross DQN with Dueling and Bootstrap (a) Learning curves episode t e s t s c o r e cross DQNcross DQN with Duelingcross DQN with Bootstrap cross DQN with Dueling and Bootstrap (b) Model test performance. Every 20 training episodes, 10 full test episodes were conducted. Figure 4: Comparison of cross DQNs of K = 10 . Cross DQN, with dueling DQN, with bootstrappedDQN, with both dueling & bootstrapped DQN on CartPole . The task of Lunar Lander in Box2D [26] is to land the spaceship between the ﬂags smoothly. In eachstep, the agent is provided with the current state s of the lander in R , in which 6 of the dimensionsare in continuous space whereas the other 2 are dummy variables in discrete space, and the agent isallowed to make one of the 4 possible actions (i.e., the action space is discrete): ﬁre the left, right, ordown throttle so that the lander could obtain a force toward the opposite direction, or do nothing. Atthe end of each step, the agent receives a reward and moves to a new state s (cid:48) . An episode ﬁnishes ifthe lander rest on the ground at zero speed (receives additional reward of +100 ), or hits the ground11nd crashes (receives additional − reward), or ﬂies outside the screen, or reaches the maximumof 1000 time steps of one episode. The agent aims for successful landing which is deﬁned as reachingthe landing pad (between two ﬂags) centered at the ground at the speed of zero, and receives anadditional reward in range [100 , , while landing outside the pad would cause some penalty.We built each network with two fully-connected hidden layers, which consist of 128 and 64 neurons,respectively. We train each of the neural networks for 10000 episodes for the LunarLander task,with a much larger replay buffer of size . The target network update is set to every 1000 stepsfor vanilla and double DQN, and learning rate α = 0 . and batch size of 64 are used for Adamoptimizer to train all the models. The discount factor γ is again 0.99, and exploration rate ε is set toannealed to . in the ﬁrst 100000 steps. And again, Q -values for bootstrapped 1024 ( s, a ) -pairs areevaluated and 10 episodes of performance tests with current policy are conducted every 20 trainingepisodes.In Figure 5, We compared our cross Q-learning algorithms with vanilla DQN and double DQN. Withslower learning in the ﬁrst a few hundreds of episodes due to our experimental design of the learningfrequencies, cross DQNs learned much better and more stable policies, while vanilla and double DQNhave large variances in both learning curves and performance testing. Figure 5(c) clearly shows thatfrom the beginning, vanilla DQN optimistically gathers the occasional large rewards which are due tothe high variance, and produces great overestimations. Double DQN slightly allivates the problem,but cannot avoid the overestimation effectively. The derived policies from these two networks arethen not optimal nor stable. As learning going on, the estimated Q -values from both vanilla anddouble DQN explode, resulting in that the derived policies are no better than random actions. On theother hand, cross DQNs have much lower Q -value estimations at the beginning, and the estimatesfrom model with K = 10 are even lower than that from model with K = 5 .After 1000 episodes, the estimates continue growing until convergence, and their values converge toa same level at about . The derived policies are very stable, with total rewards close to 300 andalso have little variance. Note that double DQN has lower estimates of Q -values than cross DQNsafter 8000 episodes of training. The reason is that the corresponding policies from double DQN aremuch worse, and it does not indicate that double DQN addresses overestimation better.Comparing Figure 6 and Figure 7, K = 5 seems works even better than K = 10 for most of time.Especially for K = 10 bootstrapped cross DQN, both the learning curve and the test scores are lowerthan other cross DQN models. This indicates that it is not always the larger K the better, since crossestimator would induce underestimate bias, and too much underestimation may also hide the realbetter actions and thus hurt the model performance. In fact, K = 10 cross DQN might have toomuch underestimation at the beginning, which slows down the learning process signiﬁcantly. Butoverall, the K = 10 bootstrapped cross learning with dueling architecture performs best among allmodels, including all K = 5 cross DQNs. We say that the DQN architectures are too complicated,and the aggregated effect may signiﬁcantly change the performance of a particular model. Generallyspeaking, our cross DQNs favor underestimation, which should be much better than overestimation ifno unbiased estimation can be achieved, since underestimations do not tend to propagate too muchduring training, as lower valued actions are avoided by the greedy action selection mechanism. Andthe bias-variance tradeoff tells us that the overall error can be reduced when the variance of ourestimates is greatly decreased, by introducing slight negative bias, this in tern leads to better modelperformance.Note that the derived policies from cross DQNs are much more stable in general, and hard todeteriorate. There are at least two reasons for this phenomena. First, cross Q-learning effectivelyaddressed overestimation problem, thus premature policy would be more difﬁcult to derived fromcross DQN. In addition, we always ensemble policies using methods such as majority voting duringtest time, which in general is superior and has a stabilizing effect for action selections. The improvedstability comes from larger barrier for altering the decision boundaries, and we could care muchless about the early termination as an additional hyperparameter during training. This is yet anotheradvantage of using multiple networks as in cross DQN. In this paper, we have presented the cross Q-learning algorithm, an extension to DQN that effectivelyreduces overestimation, stabilizes training, and improves performance. Cross DQN is a simple12 episode t o t a l r e w a r d s vanilla DQNdouble DQNcross DQN, K=5cross DQN, K=10 (a) Learning curves episode t e s t s c o r e vanilla DQNdouble DQNcross DQN, K=5cross DQN, K=10 (b) Model test performance. Every 20 training episodes, 10 full test episodes were conducted. Episode A v e r a g e Q V a l u e s vanilla DQNdouble DQNcross DQN, K=5cross DQN, K=10 (c) Mean of Q-value estimations on LunarLander. Every 20 training episodes, 1024 ( s, a ) -pairs were boot-strapped. Figure 5: Comparison of vanillar DQN, double DQN and cross DQNs of K = 5 , K = 10 on LunarLander .extension that can be easily integrated with other algorithmic improvement such as dueling networkand bootstrapped DQN, leads to dramatic performance enhancement. We have both shown in theoryand demonstrated in several experiments of classical control problems that the proposed scheme issuperior in reducing overestimation and leads to better policies derivation, compared to widely usedapproaches such as double DQN. Cross learning favors underestimation, the introduced negative biascan greatly help variance reduction. We analyze this effect from the famous bias-variance tradeoffpoint of view. However, this also indicates that it is not the case the larger K the better modelperformance in cross DQN. Nevertheless, DQN models tolerate underestimation much more thanoverestimation, as lower valued actions can be avoided by the greedy action selection mechanism.It is noted that the computation complexity of cross DQN is generally higher, comparing with that ofsingle network DQNs. We can, however, greatly reduce the complexity given the ﬂexibility providedby our model. In addition, ensemble policies from multiple networks help stabilize the decision space,which can be utilized optionally in stablizing learning and deﬁnitely during testing.As future work, we would apply cross learning to the state-of-the-art actor-critic methods in continuouscontrol, further reduce the overestimation and stabilize those algorithms. Also, analysis from statisticallearning theory could be helpful for us to derive more advanced cross learning strategies, for instance,better bootstrap estimations may be obtained by mimicking the K -fold cross validation [27], or fromBayesian perspective [28]. 13 episode t o t a l r e w a r d s cross DQNcross DQN with Duelingcross DQN with Bootstrapcross DQN with Dueling and Bootstrap (a) Learning curves episode t e s t s c o r e cross DQNcross DQN with Duelingcross DQN with Bootstrapcross DQN with Dueling and Bootstrap (b) Model test performance. Every 20 training episodes, 10 full test episodes were conducted. Figure 6: Comparison of cross DQNs of K = 5 . Cross DQN, with dueling DQN, with bootstrappedDQN, and with both dueling & bootstrapped DQN on LunarLander . episode t o t a l r e w a r d s cross DQNcross DQN with Duelingcross DQN with Bootstrapcross DQN with Dueling and Bootstrap (a) Learning curves episode t e s t s c o r e cross DQNcross DQN with Dueling cross DQN with Bootstrapcross DQN with Dueling and Bootstrap (b) Model test performance. Every 20 training episodes, 10 full test episodes were conducted. Figure 7: Comparison of cross DQNs of K = 10 . Cross DQN, with dueling DQN, with bootstrappedDQN, and with both dueling & bootstrapped DQN on LunarLander .Moreover, it worth noting that in each step of Q-learning (and more general value-based RL), weutilize Q -values in several different places. Now that a set of K different Q -functions are applied, wecan make different choices for picking particular one to use. We call them generalized cross learningin DQNs, and some existing work can be fell into a particular subclass of our generalized method.The ﬁrst place that Q -values are utilized is when the agent makes decision for choosing an action a t at time step t while observing s t . We can pick a random Q -function for action selection, and this isexactly what bootstrapped DQN [11] does. We say the bootstrapped DQN is a special case of ourgeneralized cross DQN. The next place is at TD update when the target Q-values need to be evaluated14or choosing the next action a (cid:48) , which might not be executed, but is used to evaluate the current targetQ-value and derive the max operator. Recall in Q-learning we use the maximum estimator. Finally,after picking the next action a (cid:48) , its value can be evaluated, again we have choices here for pickinga Q -function to use. In the version of our cross DQN we presented in this work, which is directlyderived from double DQN, we decoupled the selection and evaluation of the next action a (cid:48) , wherethe current network is used for evaluating a (cid:48) while another target network is used for selecting a (cid:48) .We could try to do the opposite in certain circumstances, i.e., select a (cid:48) with the current network andbootstrap another network to evaluate a (cid:48) , which should have the effect of decrease bias but increasevariance due to bias-variance tradeoff in general statistical learning scheme. One can further analyzeand experiment with other generalized cross Q-learning variants. References [1] Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcementlearning.[2] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with doubleq-learning. In

AAAI , pages 2094–2100, 2016.[3] Donghun Lee, Boris Defourny, and Warren B Powell. Bias-corrected q-learning to controlmax-operator bias in q-learning. In , pages 93–99. IEEE, 2013.[4] Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation errorin actor-critic methods. arXiv preprint arXiv:1802.09477 , 2018.[5] Oron Anschel, Nir Baram, and Nahum Shimkin. Averaged-dqn: Variance reduction and stabi-lization for deep reinforcement learning. In

Proceedings of the 34th International Conferenceon Machine Learning-Volume 70 , pages 176–185. JMLR. org, 2017.[6] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variancereduction. In

Advances in neural information processing systems , pages 315–323, 2013.[7] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradientmethod with support for non-strongly convex composite objectives. In

Advances in neuralinformation processing systems , pages 1646–1654, 2014.[8] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing ﬁnite sums with the stochasticaverage gradient.

Mathematical Programming , 162(1-2):83–112, 2017.[9] Zeyuan Allen-Zhu. Katyusha: The ﬁrst direct acceleration of stochastic gradient methods.

TheJournal of Machine Learning Research , 18(1):8194–8244, 2017.[10] Zengqiang Chen, Beibei Qin, Mingwei Sun, and Qinglin Sun. Q-learning-based parametersadaptive algorithm for active disturbance rejection control and its application to ship coursecontrol.

Neurocomputing , 2019.[11] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration viabootstrapped dqn. In

Advances in neural information processing systems , pages 4026–4034,2016.[12] Xi-liang Chen, Lei Cao, Chen-xi Li, Zhi-xiong Xu, and Jun Lai. Ensemble network architecturefor deep reinforcement learning.

Mathematical Problems in Engineering , 2018.[13] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXivpreprint arXiv:1509.02971 , 2015.[14] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deepreinforcement learning. In

International conference on machine learning , pages 1928–1937,2016.[15] ML Puterman. Markov decision processes. 1994.

Jhon Wiley & Sons, New Jersey , 1994.[16] Christopher JCH Watkins and Peter Dayan. Q-learning.

Machine learning , 8(3-4):279–292,1992.[17] Richard S Sutton. Learning to predict by the methods of temporal differences.

Machine learning ,3(1):9–44, 1988. 1518] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning.

Nature , 518(7540):529–533, 2015.[19] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning andteaching.

Machine learning , 8(3-4):293–321, 1992.[20] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and NandoDe Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprintarXiv:1511.06581 , 2015.[21] Malcolm Strens. A bayesian framework for reinforcement learning.[22] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,and Wojciech Zaremba. Openai gym, 2016.[23] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, AlecRadford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines , 2017.[24] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elementsthat can solve difﬁcult learning control problems.

IEEE transactions on systems, man, andcybernetics , 5:834–846, 1983.[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[26] Erin Catto. Box2d: A 2d physics engine for games, 2011.[27] Hado Van Hasselt. Estimating the maximum expected value: an analysis of (nested) crossvalidation and the maximum sample average. arXiv preprint arXiv:1302.7175 , 2013.[28] Carlo D’Eramo, Marcello Restelli, and Alessandro Nuara. Estimating maximum expected valuethrough gaussian approximation. In

International Conference on Machine Learning , pages1032–1040, 2016.[29] Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence resultsfor single-step on-policy reinforcement-learning algorithms.

Machine learning , 38(3):287–308,2000.[30] Hado V Hasselt. Double q-learning. In

Advances in Neural Information Processing Systems ,pages 2613–2621, 2010. 16

Estimating the Maximum Expected Values

For Q-learning, the action is selected according to the estimated target Q-values. This is an instance ofa more general maximum expected value estimation problem, which is formed as follows. Considera set of |A| random variables Q = { Q a , · · · , Q a |A| } , we are interested in ﬁnding the maximumexpected value among the set of variables, which is deﬁned as max a µ a = max a E [ Q a ] while each E [ Q a ] is usually estimated from samples. Let Ω a denote the sample space for estimating Q a , for a ∈ A , and we further assume that the samples in Ω a are i.i.d. The sample mean ˆ µ a = | Ω a | (cid:80) x ∈ Ω a x is then an unbiased estimator for E [ Q a ] .Let f a : R → R be the probability density function (PDF) for the variable Q a , and F a ( x ) = (cid:82) x −∞ f a ( x ) dx be the cumulative density function (CDF). The maximum expected value is then max a E [ Q a ] = max a (cid:90) ∞−∞ xf a ( x ) dx. (10) A.1 (Single) Maximum Estimator

The most straightforward way to approximate max a E [ Q a ] is to take the maximum over the samplemean for each a , i.e., max a E [ Q a ] ≈ max a ¯ q a . Note that the sample means ¯ q a are unbiased estimatesof the true means, thus max a ¯ q a is an unbiased estimate for E [max a µ a ] = (cid:82) ∞−∞ xf max ( x ) dx ,however, it is a biased estimate for max a E [ Q a ] .Consider its CDF F µ max = P { max a ˆ µ a ≤ x } = Π a P { µ a ≤ x } = Π a F µa ( x ) , we can write E [max a ˆ µ a ] = (cid:90) ∞−∞ x ddx Π a F µa ( x ) dx = (cid:88) a (cid:48) (cid:90) ∞−∞ xf µa ( x )Π a (cid:48) (cid:54) = a F µa ( x ) dx. (11)Comparing equations (10) and (11), clearly max a E [ Q a ] and E [max a ˆ µ a ] are note equivalent. More-over, the product term Π a (cid:48) (cid:54) = a F µa ( x ) in the integral introduces positive bias (since CDFs are monotoni-cally increasing, the sum of their derivatives will be positive, the integral value would be monotonicallyincreasing while more product terms are added). Therefore, we say that the expected value of thesingle estimator for the maximum is an overestimation of the maximum expected value. A.2 Double Estimator

Consider the case that we use two sets of estimators ˆ µ A = { ˆ µ Aa , · · · , ˆ µ Aa |A| } and ˆ µ B = { ˆ µ Ba , · · · , ˆ µ Ba |A| } , in which each ˆ µ Aa and is estimated from a set of samples independent of theone to estimate ˆ µ Ba , i.e., ˆ µ Aa = | Ω Aa | (cid:80) x ∈ Ω Aa x , ˆ µ Ba = | Ω Ba | (cid:80) x ∈ Ω Ba x , and Ω Aa ∩ Ω Ba = ∅ . Forall a , both ˆ µ Aa and ˆ µ Ba are unbiased estimator for E [ Q a ] , assuming all the samples in both setsare independently drawn from the population. That means E [ˆ µ Aa ] = E [ Q a ] for all a , including a ∗ B = argmax a ˆ µ Ba , the action that maximizes the sample mean ˆ µ B . Therefore, ˆ µ Aa ∗ B can be used toestimate max a E [ˆ µ Aa ] as well as max a E [ Q a ] , i.e., max a E [ Q a ] = max a E [ˆ µ Aa ] ≈ ˆ µ Aa ∗ B . The same argument holds for the opposite way considering the best action over Ω A and the samplemean ˆ µ Ba ∗ A . The selection of a ∗ means that all other a gives lower estimation, i.e., P ( a = a ∗ ) =Π a (cid:54) = a ∗ P ( µ Aa < µ Aa ∗ ) . Let f Aa and F Aa be the PDF and CDF of µ Aa , respectively. Then P ( a = a ∗ ) = (cid:90) ∞−∞ P ( µ Aa = x )Π a (cid:48) (cid:54) = a P ( µ AA < x ) dx = (cid:90) ∞−∞ xf Aa ( x )Π a (cid:48) (cid:54) = a F Aa ( x ) dx. The expected value of double estimator is a weighted sum of the sample means’ expected values inone sample space, weighted by the probability of each sample mean to be the maximum in the other17ample space, i.e., (cid:88) a P ( a = a ∗ ) E [ µ Ba ] = (cid:88) a E [ µ Ba ] (cid:90) ∞−∞ xf Aa ( x )Π a (cid:48) (cid:54) = a F Aa ( x ) dx. Double estimator gives us negative bias, since the weights P ( a = a ∗ ) are probabilities, which arepositive and sum to 1, the maximum expected value then serves as an upper bound for the weightedsum, as some weights may also be given to variables whose expected value is less than the maximum. A.3 Cross Estimator

We can easily extend the double estimator to a more general case, in which instead of using twosets of estimators, suppose now we have K independent sets of estimators ˆ µ , · · · , ˆ µ K . We call itthe cross estimator. The double estimator can be seen as a special case of the more general crossestimator. Similar argument as analyzing the double estimator can be applied here, for any twoestimators ˆ µ i and ˆ µ j , as max a E [ Q a ] = max a E [ˆ µ ia ] ≈ ˆ µ Aa ∗ j . The cross estimator ﬁnally uses a convex combination of the K sample means, (cid:88) a P ( a = a ∗ ) E [ µ ja ] = (cid:88) a E [ µ ja ] (cid:90) ∞−∞ xf ia ( x )Π a (cid:48) (cid:54) = a F ia ( x ) dx, thus also underestimates the maximum expected value. Theorem 1. [27] There does not exist an unbiased estimator for maximum expected values.

B Convergence in the Limit

In this section, we ﬁrst present a lemma which claims the convergence of SARSA from [29], andthen use it to prove convergence of cross Q-learning. Note that this part heavily borrows the proof ofthe convergence of double Q-learning [30], but serves as a more general case.

Lemma 2. [29] . Consider a stochastic process ( α t , ∆ t , F t ) , t ≥ , where α t , ∆ t and F t : X → R satisfy the equation ∆ t +1 ( x ) = (1 − α t ( x ))∆ t ( x ) + α t ( x ) F t ( x ) , where x ∈ X, t = 0 , , , · · · Let P t be a sequence of increasing σ -ﬁelds such that α and ∆ are P -measureable and α t , ∆ t and F t − are P t -measurable, for t = 1 , , · · · . ∆ t converges to zero with probability one (w.p.1) if the following hold:1. the set X is ﬁnite.2. ≤ α t ( x ) ≤ , (cid:80) t α t ( x ) = ∞ , and (cid:80) t α t ( x ) < ∞ w.p. 1.3. || E [ F t | P t ] || ≤ κ || ∆ t || + c t , where κ ∈ [0 , and c t converges to zero w.p. 1.4. V ar ( F t | P t ) ≤ K (1 + || ∆ t || ) , where K is a constant.in which || · || denotes the maximum norm. Theorem 3.

In a given ergodic MDP, suppose that we have a set of K Q-value functions, Q , Q , · · · , Q K , as updated by cross Q-learning, will converge to the optimal value function Q ∗ with probability 1, if the following conditions hold:1. The MDP is ﬁnite, i.e., |S × A| < ∞ .2. γ ∈ [0 , .3. The Q -values are stored in a lookup table.4. Each state-action pair is visited inﬁnitely often. . Each Q k receives an inﬁnite number of updates, for all k = 1 , · · · , K .6. ≤ α t ( s, a ) ≤ , (cid:80) t α t ( s, a ) = ∞ , and (cid:80) t α t ( x ) < ∞ w.p. 1. Moreover, α t ( s, a ) =0 , ∀ ( s, a ) (cid:54) = ( s t , a t ) .7. V ar ( R ( s, a )) < ∞ , ∀ s, a Proof.