Finite-Time Analysis for Double Q-learning
FFinite-Time Analysis for Double Q-learning
Huaqing Xiong , Lin Zhao , Yingbin Liang , and Wei Zhang ∗ .3,41 The Ohio State University National University of Singapore Southern University of Science and Technology Peng Cheng Laboratory {xiong.309, liang.889}@osu.edu; [email protected]; [email protected] Abstract
Although Q-learning is one of the most successful algorithms for finding the bestaction-value function (and thus the optimal policy) in reinforcement learning, its im-plementation often suffers from large overestimation of Q-function values incurredby random sampling. The double Q-learning algorithm proposed in Hasselt (2010)overcomes such an overestimation issue by randomly switching the update betweentwo Q-estimators, and has thus gained significant popularity in practice. However,the theoretical understanding of double Q-learning is rather limited. So far onlythe asymptotic convergence has been established, which does not characterize howfast the algorithm converges. In this paper, we provide the first non-asymptotic(i.e., finite-time) analysis for double Q-learning. We show that both synchronousand asynchronous double Q-learning are guaranteed to converge to an (cid:15) -accurateneighborhood of the global optimum by taking ˜Ω (cid:18)(cid:16) − γ ) (cid:15) (cid:17) ω + (cid:16) − γ (cid:17) − ω (cid:19) iterations, where ω ∈ (0 , is the decay parameter of the learning rate, and γ isthe discount factor. Our analysis develops novel techniques to derive finite-timebounds on the difference between two inter-connected stochastic processes, whichis new to the literature of stochastic approximation. Q-learning is one of the most successful classes of reinforcement learning (RL) algorithms, whichaims at finding the optimal action-value function or Q-function (and thus the associated optimalpolicy) via off-policy data samples. The Q-learning algorithm was first proposed by Watkins andDayan (1992), and since then, it has been widely used in various applications including robotics (Taiand Liu, 2016), autonomous driving (Okuyama et al., 2018), video games (Mnih et al., 2015), to namea few. Theoretical performance of Q-learning has also been intensively explored. The asymptoticconvergence has been established in Tsitsiklis (1994); Jaakkola et al. (1994); Borkar and Meyn(2000); Melo (2001); Lee and He (2019). The non-asymptotic (i.e., finite-time) convergence rateof Q-learning was firstly obtained in Szepesvári (1998), and has been further studied in (Even-Darand Mansour, 2003; Shah and Xie, 2018; Wainwright, 2019; Beck and Srikant, 2012; Chen et al.,2020) for synchronous Q-learning and in (Even-Dar and Mansour, 2003; Qu and Wierman, 2020)for asynchoronous Q-learning.One major weakness of Q-learning arises in practice due to the large overestimation of the action-value function (Hasselt, 2010; Hasselt et al., 2016). Practical implementation of Q-learning involvesusing the maximum sampled
Q-function to estimate the maximum expected
Q-function (where the ∗ Corresponding author34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. a r X i v : . [ c s . L G ] O c t xpectation is taken over the randomness of reward). Such an estimation often yields a large positivebias error (Hasselt, 2010), and causes Q-learning to perform rather poorly. To address this issue,double Q-learning was proposed in Hasselt (2010), which keeps two Q-estimators (i.e., estimatorsfor Q-functions), one for estimating the maximum Q-function value and the other one for update,and continuously changes the roles of the two Q-estimators in a random manner. It was shown inHasselt (2010) that such an algorithm effectively overcomes the overestimation issue of the vanillaQ-learning. In Hasselt et al. (2016), double Q-learning was further demonstrated to substantiallyimprove the performance of Q-learning with deep neural networks (DQNs) for playing Atari 2600games. It inspired many variants (Zhang et al., 2017; Abed-alguni and Ottom, 2018), received a lotof applications (Zhang et al., 2018a,b), and have become one of the most common techniques forapplying Q-learning type of algorithms (Hessel et al., 2018).Despite its tremendous empirical success and popularity in practice, theoretical understanding ofdouble Q-learning is rather limited. Only the asymptotic convergence was provided in Hasselt(2010); Weng et al. (2020c). There has been no non-asymptotic result on how fast double Q-learningconverges. From the technical standpoint, such finite-time analysis for double Q-learning doesnot follow readily from those for the vanilla Q-learning, because it involves two randomly updatedQ-estimators, and the coupling between these two random paths significantly complicates the analysis.This goes much more beyond the existing techniques for analyzing the vanilla Q-learning that handlesthe random update of a single Q-estimator. Thus, the goal of this paper is to develop new finite-timeanalysis techniques that handle the inter-connected two random path updates in double Q-learningand provide the convergence rate. The main contribution of this paper lies in providing the first finite-time analysis for double Q-learningwith both the synchronous and asynchronous implementations. • We show that synchronous double Q-learning with a learning rate α t = 1 /t ω (where ω ∈ (0 , ) attains an (cid:15) -accurate global optimum with at least the probability of − δ by taking Ω (cid:18)(cid:16) − γ ) (cid:15) ln |S||A| (1 − γ ) (cid:15) δ (cid:17) ω + (cid:16) − γ ln − γ ) (cid:15) (cid:17) − ω (cid:19) iterations, where γ ∈ (0 , is thediscount factor, |S| and |A| are the sizes of the state space and action space, respectively. • We further show that under the same accuracy and high probability requirements, asynchronousdouble Q-learning takes Ω (cid:18)(cid:16) L (1 − γ ) (cid:15) ln |S||A| L (1 − γ ) (cid:15) δ (cid:17) ω + (cid:16) L − γ ln − γ ) (cid:15) (cid:17) − ω (cid:19) iterations,where L is the covering number specified by the exploration strategy.Our results corroborate the design goal of double Q-learning, which opts for better accuracy bymaking less aggressive progress during the execution in order to avoid overestimation. Specifically,our results imply that in the high accuracy regime, double Q-learning achieves the same convergencerate as vanilla Q-learning in terms of the order-level dependence on (cid:15) , which further indicates thatthe high accuracy design of double Q-learning dominates the less aggressive progress in such aregime. In the low-accuracy regime, which is not what double Q-learning is designed for, the cautiousprogress of double Q-learning yields a slightly weaker convergence rate than Q-learning in terms ofthe dependence on − γ .From the technical standpoint, our proof develops new techniques beyond the existing finite-timeanalysis of the vanilla Q-learning with a single random iteration path. More specifically, we modelthe double Q-learning algorithm as two alternating stochastic approximation (SA) problems, whereone SA captures the error propagation between the two Q-estimators, and the other captures theerror dynamics between the Q-estimator and the global optimum. For the first SA, we developnew techniques to provide the finite-time bounds on the two inter-related stochastic iterations ofQ-functions. Then we develop new tools to bound the convergence of Bernoulli-controlled stochasticiterations of the second SA conditioned on the first SA. Due to the rapidly growing literature on Q-learning, we review only the theoretical results that arehighly relevant to our work. 2-learning was first proposed in Watkins and Dayan (1992) under finite state-action space. Itsasymptotic convergence has been established in Tsitsiklis (1994); Jaakkola et al. (1994); Borkar andMeyn (2000); Melo (2001) through studying various general SA algorithms that include Q-learningas a special case. Along this line, Lee and He (2019) characterized Q-learning as a switched linearsystem and applied the results of Borkar and Meyn (2000) to show the asymptotic convergence, whichwas also extended to other Q-learning variants. Another line of research focuses on the finite-timeanalysis of Q-learning which can capture the convergence rate. Such non-asymptotic results werefirstly obtained in Szepesvári (1998). A more comprehensive work (Even-Dar and Mansour, 2003)provided finite-time results for both synchronous and asynchoronous Q-learning. Both Szepesvári(1998) and Even-Dar and Mansour (2003) showed that with linear learning rates, the convergencerate of Q-learning can be exponentially slow as a function of − γ . To handle this, the so-calledrescaled linear learning rate was introduced to avoid such an exponential dependence in synchronousQ-learning (Wainwright, 2019; Chen et al., 2020) and asynchronous Q-learning (Qu and Wierman,2020). The finite-time convergence of Q-learning was also analyzed with constant step sizes (Beckand Srikant, 2012; Chen et al., 2020; Li et al., 2020). Moreover, the polynomial learning rate, whichis also the focus of this work, was investigated for both synchronous (Even-Dar and Mansour, 2003;Wainwright, 2019) and asynchronous Q-learning (Even-Dar and Mansour, 2003). In addition, it isworth mentioning that Shah and Xie (2018) applied the nearest neighbor approach to handle MDPson infinite state space.Differently from the above extensive studies of vanilla Q-learning, theoretical understanding ofdouble Q-learning is limited. The only theoretical guarantee was on the asymptotic convergenceprovided by Hasselt (2010); Weng et al. (2020c), which do not provide the non-asymptotic (i.e.,finite-time) analysis on how fast double Q-learning converges. This paper provides the first finite-timeanalysis for double Q-learning.The vanilla Q-learning algorithm has also been studied for the function approximation case, i.e., theQ-function is approximated by a class of parameterized functions. In contrast to the tabular case, evenwith linear function approximation, Q-learning has been shown not to converge in general (Baird,1995). Strong assumptions are typically imposed to guarantee the convergence of Q-learning withfunction approximation (Bertsekas and Tsitsiklis, 1996; Zou et al., 2019; Chen et al., 2019; Du et al.,2019; Xu and Gu, 2019; Cai et al., 2019; Weng et al., 2020a,b). Regarding double Q-learning, it isstill an open topic on how to design double Q-learning algorithms under function approximation andunder what conditions they have theoretically guaranteed convergence. In this section, we introduce the Q-learning and the double Q-learning algorithms.
We consider a γ -discounted Markov decision process (MDP) with a finite state space S and a finiteaction space A . The transition probability of the MDP is given by P : S × A × S → [0 , , that is, P ( ·| s, a ) denotes the probability distribution of the next state given the current state s and action a . Weconsider a random reward function R t at time t drawn from a fixed distribution φ : S × A × S (cid:55)→ R ,where E { R t ( s, a, s (cid:48) ) } = R s (cid:48) sa and s (cid:48) denotes the next state starting from ( s, a ) . In addition, weassume | R t | ≤ R max . A policy π := π ( ·| s ) characterizes the conditional probability distributionover the action space A given each state s ∈ S .The action-value function (i.e., Q-function) Q π ∈ R |S|×|A| for a given policy π is defined as Q π ( s, a ) := E (cid:34) ∞ (cid:88) t =0 γ t R t ( s, π ( s ) , s (cid:48) ) (cid:12)(cid:12)(cid:12) s = s, a = a (cid:35) = E s (cid:48) ∼ P ( ·| s,a ) a (cid:48) ∼ π ( ·| s (cid:48) ) (cid:104) R s (cid:48) sa + γQ π ( s (cid:48) , a (cid:48) ) (cid:105) , (1)where γ ∈ (0 , is the discount factor. Q-learning aims to find the Q-function of an optimal policy π ∗ that maximizes the accumulated reward. The existence of such a π ∗ has been proved in the classicalMDP theory (Bertsekas and Tsitsiklis, 1996). The corresponding optimal Q-function, denoted as Q ∗ ,3s known as the unique fixed point of the Bellman operator T given by T Q ( s, a ) = E s (cid:48) ∼ P ( ·| s,a ) (cid:20) R s (cid:48) sa + γ max a (cid:48) ∈ U ( s (cid:48) ) Q ( s (cid:48) , a (cid:48) ) (cid:21) , (2)where U ( s (cid:48) ) ⊂ A is the admissible set of actions at state s (cid:48) . It can be shown that the Bellman operator T is γ -contractive in the supremum norm (cid:107) Q (cid:107) := max s,a | Q ( s, a ) | , i.e., it satisfies (cid:107)T Q − T Q (cid:48) (cid:107) ≤ γ (cid:107) Q − Q (cid:48) (cid:107) . (3)The goal of Q-learning is to find Q ∗ , which further yields π ∗ ( s ) = arg max a ∈ U ( s ) Q ∗ ( s, a ) . Inpractice, however, exact evaluation of the Bellman operator (2) is usually infeasible due to the lack ofknowledge of the transition kernel of MDP and the randomness of the reward. Instead, Q-learningdraws random samples to estimate the Bellman operator and iteratively learns Q ∗ as Q t +1 ( s, a ) = (1 − α t ( s, a )) Q t ( s, a ) + α t ( s, a ) (cid:18) R t ( s, a, s (cid:48) ) + γ max a (cid:48) ∈ U ( s (cid:48) ) Q t ( s (cid:48) , a (cid:48) ) (cid:19) , (4)where R t is the sampled reward, s (cid:48) is sampled by the transition probability given ( s, a ) , and α t ( s, a ) ∈ (0 , denotes the learning rate. Although Q-learning is a commonly used RL algorithm to find the optimal policy, it can sufferfrom overestimation in practice (Smith and Winkler, 2006). To overcome this issue, Hasselt (2010)proposed double Q-learning given in Algorithm 1.
Algorithm 1
Synchronous Double Q-learning (Hasselt, 2010) Input:
Initial Q A , Q B . for t = 1 , , . . . , T do Assign learning rate α t . Randomly choose either UPDATE(A) or UPDATE(B) with probability 0.5, respectively. for each ( s, a ) do observe s (cid:48) ∼ P ( ·| s, a ) , and sample R t ( s, a, s (cid:48) ) . if UPDATE(A) then Obtain a ∗ = arg max a (cid:48) Q At ( s (cid:48) , a (cid:48) ) Q At +1 ( s, a ) = Q At ( s, a ) + α t ( s, a )( R t ( s, a, s (cid:48) ) + γQ Bt ( s (cid:48) , a ∗ ) − Q At ( s, a )) else if UPDATE(B) then
Obtain b ∗ = arg max b (cid:48) Q Bt ( s (cid:48) , b (cid:48) ) Q Bt +1 ( s, a ) = Q Bt ( s, a ) + α t ( s, a )( R t ( s, a, s (cid:48) ) + γQ At ( s (cid:48) , b ∗ ) − Q Bt ( s, a )) end if end for end for Output: Q AT (or Q BT ).Double Q-learning maintains two Q-estimators (i.e., Q-tables): Q A and Q B . At each iteration ofAlgorithm 1, one Q-table is randomly chosen to be updated. Then this chosen Q-table generatesa greedy optimal action, and the other Q-table is used for estimating the corresponding Bellmanoperator for updating the chosen table. Specifically, if Q A is chosen to be updated, we use Q A to obtain the optimal action a ∗ and then estimate the corresponding Bellman operator using Q B .As shown in Hasselt (2010), E [ Q B ( s (cid:48) , a ∗ )] is likely smaller than max a E [ Q A ( s (cid:48) , a )] , where theexpectation is taken over the randomness of the reward for the same state-action pair. In this way,such a two-estimator framework of double Q-learning can effectively reduce the overestimation. Synchronous and asynchronous double Q-learning:
In this paper, we study the finite-time conver-gence rate of double Q-learning in two different settings: synchronous and asynchronous implemen-tations. For synchronous double Q-learning (as shown in Algorithm 1), all the state-action pairs ofthe chosen Q-estimator are visited simultaneously at each iteration. For the asynchronous case, onlyone state-action pair is updated in the chosen Q-table. Specifically, in the latter case, we sample a4rajectory { ( s t , a t , R t , i t ) } ∞ t =0 under a certain exploration strategy, where i t ∈ { A, B } denotes theindex of the chosen Q-table at time t . Then the two Q-tables are updated based on the following rule: Q it +1 ( s, a )= Q it ( s, a ) , ( s, a ) (cid:54) = ( s t , a t ) or i (cid:54) = i t ;(1 − α t ( s, a )) Q it ( s, a ) + α t ( s, a ) (cid:16) R t ( s, a, s (cid:48) ) + γQ i c t ( s (cid:48) , arg max a (cid:48) ∈ U ( s (cid:48) ) Q it ( s (cid:48) , a (cid:48) ) (cid:17) , otherwise , where i c = { A, B } \ i .We next provide the boundedness property of the Q-estimators and the errors in the following lemma,which is typically necessary for the finite-time analysis. Lemma 1.
For either synchronous or asynchronous double Q-learning, let Q it ( s, a ) be the valueof either Q table corresponding to a state-action pair ( s, a ) at iteration t . Suppose (cid:13)(cid:13) Q i (cid:13)(cid:13) ≤ R max − γ .Then we have (cid:13)(cid:13) Q it (cid:13)(cid:13) ≤ R max − γ and (cid:13)(cid:13) Q it − Q ∗ (cid:13)(cid:13) ≤ V max for all t ≥ , where V max := R max − γ . Lemma 1 can be proved by induction arguments using the triangle inequality and the uniformboundedness of the reward function, which is seen in Appendix A.
We present our finite-time analysis for the synchronous and asynchronous double Q-learning inthis section, followed by a sketch of the proof for the synchronous case which captures our maintechniques. The detailed proofs of all the results are provided in the Supplementary Materials.
Since the update of the two Q-estimators is symmetric, we can characterize the convergence rate ofeither Q-estimator, e.g., Q A , to the global optimum Q ∗ . To this end, we first derive two importantproperties of double Q-learning that are crucial to our finite-time convergence analysis.The first property captures the stochastic error (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) between the two Q-estimators. Sincedouble Q-learning updates alternatingly between these two estimators, such an error process mustdecay to zero in order for double Q-learning to converge. Furthermore, how fast such an errorconverges determines the overall convergence rate of double Q-learning. The following proposition(which is an informal restatement of Proposition 1 in Appendix B.1) shows that such an error processcan be block-wisely bounded by an exponentially decreasing sequence G q = (1 − ξ ) q V max for q = 0 , , , . . . , and some ξ ∈ (0 , . Conceptually, as illustrated in Figure 1, such an error processis upper-bounded by the blue-colored piece-wise linear curve. Proposition 1. ( Informal ) Consider synchronous double Q-learning under a polynomial learningrate α t = t ω with ω ∈ (0 , . We divide the time horizon into blocks [ˆ τ q , ˆ τ q +1 ) for q ≥ , where ˆ τ = 0 and ˆ τ q +1 = ˆ τ q + c ˆ τ ωq with some c > . Fix ˆ (cid:15) > . Then for any n such that G n ≥ ˆ (cid:15) andunder certain conditions on ˆ τ (see Appendix B.1), we have P (cid:2) ∀ q ∈ [0 , n ] , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ G q +1 (cid:3) ≥ − c n exp (cid:18) − c ˆ τ ω ˆ (cid:15) V (cid:19) , where the positive constants c and c are specified in Appendix B.1. Proposition 1 implies that the two Q-estimators approach each other asymptotically, but does not nec-essarily imply that they converge to the optimal action-value function Q ∗ . Then the next proposition(which is an informal restatement of Proposition 2 in Appendix B.2) shows that as long as the highprobability event in Proposition 1 holds, the error process (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) between either Q-estimator(say Q A ) and the optimal Q-function can be block-wisely bounded by an exponentially decreasingsequence D k = (1 − β ) k V max σ for k = 0 , , , . . . , and β ∈ (0 , . Conceptually, as illustrated inFigure 1, such an error process is upper-bounded by the yellow-colored piece-wise linear curve. Proposition 2. ( Informal ) Consider synchronous double Q-learning using a polynomial learningrate α t = t ω with ω ∈ (0 , . We divide the time horizon into blocks [ τ k ,τ k +1 ) for k ≥ , where { G k } k ≥ as a block-wise upper bound on (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) , andsequence { D k } k ≥ as a block-wise upper bound on (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) conditioned on the first upperbound event. τ = 0 and τ k +1 = τ k + c τ ωk with some c > . Fix ˜ (cid:15) > . Then for any m such that D m ≥ ˜ (cid:15) andunder certain conditions on τ (see Appendix B.2), we have P (cid:2) ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | E, F (cid:3) ≥ − c m exp (cid:18) − c τ ω ˜ (cid:15) V (cid:19) , where E and F denote certain events defined in (12) and (13) in Appendix B.2, and the positiveconstants c , c , and c are specified Appendix B.2. As illustrated in Figure 1, the two block sequences { ˆ τ q } q ≥ in Proposition 1 and { τ q } q ≥ in Proposi-tion 2 can be chosen to coincide with each other. Then combining the above two properties followedby further mathematical arguments yields the following main theorem that characterizes the con-vergence rate of double Q-learning. We will provide a proof sketch for Theorem 1 in Section 3.3,which explains the main steps to obtain the supporting properties of Proposition 1 and 2 and howthey further yield the main theorem. Theorem 1.
Fix (cid:15) > and γ ∈ (1 / , . Consider synchronous double Q-learning using apolynomial learning rate α t = t ω with ω ∈ (0 , . Let Q AT ( s, a ) be the value of Q A for a state-actionpair ( s, a ) at time T . Then we have P ( (cid:13)(cid:13) Q AT − Q ∗ (cid:13)(cid:13) ≤ (cid:15) ) ≥ − δ , given that T = Ω (cid:32)(cid:18) V (1 − γ ) (cid:15) ln |S||A| V (1 − γ ) (cid:15) δ (cid:19) ω + (cid:18) − γ ln V max (1 − γ ) (cid:15) (cid:19) − ω (cid:33) , (5) where V max = R max − γ . Theorem 1 provides the finite-time convergence guarantee in high probability sense for synchronousdouble Q-learning. Specifically, double Q-learning attains an (cid:15) -accurate optimal Q-function with highprobability with at most Ω (cid:18)(cid:16) − γ ) (cid:15) ln − γ ) (cid:15) (cid:17) ω + (cid:16) − γ ln − γ ) (cid:15) (cid:17) − ω (cid:19) iterations. Such aresult can be further understood by considering the following two regimes. In the high accuracyregime, in which (cid:15) (cid:28) − γ , the dependence on (cid:15) dominates, and the time complexity is given by Ω (cid:16)(cid:0) (cid:15) ln (cid:15) (cid:1) ω + (cid:0) ln (cid:15) (cid:1) − ω (cid:17) , which is optimized as ω approaches to 1. In the low accuracy regime,in which (cid:15) (cid:29) − γ , the dependence on − γ dominates, and the time complexity can be optimized at ω = , which yields T = ˜Ω (cid:16) − γ ) (cid:15) / + − γ ) (cid:17) = ˜Ω (cid:16) − γ ) (cid:15) / (cid:17) .Furthermore, Theorem 1 corroborates the design effectiveness of double Q-learning, which overcomesthe overestimation issue and hence achieves better accuracy by making less aggressive progress ineach update. Specifically, comparison of Theorem 1 with the time complexity bounds of vanillasynchronous Q-learning under a polynomial learning rate in Even-Dar and Mansour (2003) andWainwright (2019) indicates that in the high accuracy regime, double Q-learning achieves the sameconvergence rate as vanilla Q-learning in terms of the order-level dependence on (cid:15) . Clearly, the designof double Q-learning for high accuracy dominates the performance. In the low-accuracy regime6which is not what double Q-learning is designed for), double Q-learning achieves a slightly weakerconvergence rate than vanilla Q-learning in Even-Dar and Mansour (2003); Wainwright (2019) interms of the dependence on − γ , because its nature of less aggressive progress dominates theperformance. In this subsection, we study the asynchronous double Q-learning and provide its finite-time conver-gence result.Differently from synchronous double Q-learning, in which all state-action pairs are visited for eachupdate of the chosen Q-estimator, asynchronous double Q-learning visits only one state-action pairfor each update of the chosen Q-estimator. Therefore, we make the following standard assumption onthe exploration strategy (Even-Dar and Mansour, 2003):
Assumption 1. (Covering number) There exists a covering number L , such that in consecutive L updates of either Q A or Q B estimator, all the state-action pairs of the chosen Q-estimator are visitedat least once. The above conditions on the exploration are usually necessary for the finite-time analysis of asyn-chronous Q-learning. The same assumption has been taken in Even-Dar and Mansour (2003). Quand Wierman (2020) proposed a mixing time condition which is in the same spirit.Assumption 1 essentially requires the sampling strategy to have good visitation coverage over allstate-action pairs. Specifically, Assumption 1 guarantees that consecutive L updates of Q A visit eachstate-action pair of Q A at least once, and the same holds for Q B . Since L iterations of asynchronousdouble Q-learning must make at least L updates for either Q A or Q B , Assumption 1 further impliesthat any state-action pair ( s, a ) must be visited at least once during L iterations of the algorithm.In fact, our analysis allows certain relaxation of Assumption 1 by only requiring each state-actionpair to be visited during an interval with a certain probability. In such a case, we can also derive afinite-time bound by additionally dealing with a conditional probability.Next, we provide the finite-time result for asynchronous double Q-learning in the following theorem. Theorem 2.
Fix (cid:15) > , γ ∈ (1 / , . Consider asynchronous double Q-learning under a polynomiallearning rate α t = t ω with ω ∈ (0 , . Suppose Assumption 1 holds. Let Q AT ( s, a ) be the value of Q A for a state-action pair ( s, a ) at time T . Then we have P ( (cid:13)(cid:13) Q AT − Q ∗ (cid:13)(cid:13) ≤ (cid:15) ) ≥ − δ , given that T = Ω (cid:32)(cid:18) L V (1 − γ ) (cid:15) ln |S||A| L V (1 − γ ) (cid:15) δ (cid:19) ω + (cid:18) L − γ ln γV max (1 − γ ) (cid:15) (cid:19) − ω (cid:33) . (6)Comparison of Theorem 1 and 2 indicates that the finite-time result of asynchronous double Q-learning matches that of synchronous double Q-learning in the order dependence on − γ and (cid:15) . Thedifference lies in the extra dependence on the covering time L in Theorem 2. Since synchronousdouble Q-learning visits all state-action pairs (i.e., takes |S||A| sample updates) at each iteration,whereas asynchronous double Q-learning visits only one state-action pair (i.e., takes only one sampleupdate) at each iteration, a more reasonable comparison between the two should be in terms ofthe overall sample complexity. In this sense, synchronous and asynchronous double Q-learningalgorithms have the sample complexities of |S||A| T (where T is given in (5)) and T (where T isgiven in (6)), respectively. Since in general L (cid:29) |S||A| , synchronous double-Q is more efficient thanasynchronous double-Q in terms of the overall sampling complexity. In this subsection, we provide an outline of the technical proof of Theorem 1 and summarize the keyideas behind the proof. The detailed proof can be found in Appendix B.Our goal is to study the finite-time convergence of the error (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) between one Q-estimatorand the optimal Q-function (this is without the loss of generality due to the symmetry of the twoestimators). To this end, our proof includes: (a) Part I which analyzes the stochastic error propagationbetween the two Q-estimators (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ; (b) Part II which analyzes the error dynamics betweenone Q-estimator and the optimum (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) conditioned on the error event in Part I; and (c) Part7II which bounds the unconditional error (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) . We describe each of the three parts in moredetails below. Part I: Bounding (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) (see Proposition 1). The main idea is to upper bound (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) by a decreasing sequence { G q } q ≥ block-wisely with high probability, where each block q (with q ≥ ) is defined by t ∈ [ˆ τ q , ˆ τ q +1 ) . The proof consists of the following four steps. Step 1 (see Lemma 2) : We characterize the dynamics of u BAt ( s, a ) := Q B ( s, a ) − Q A ( s, a ) as anSA algorithm as follows: u BAt +1 ( s, a ) = (1 − α t ) u BAt ( s, a ) + α t ( h t ( s, a ) + z t ( s, a )) , where h t is a contractive mapping of u BAt , and z t is a martingale difference sequence. Step 2 (see Lemma 3) : We derive lower and upper bounds on u BAt via two sequences X t ;ˆ τ q and Z t ;ˆ τ q as follows: − X t ;ˆ τ q ( s, a ) + Z t ;ˆ τ q ( s, a ) ≤ u BAt ( s, a ) ≤ X t ;ˆ τ q ( s, a ) + Z t ;ˆ τ q ( s, a ) , for any t ≥ ˆ τ q , state-action pair ( s, a ) ∈ S × A , and q ≥ , where X t ;ˆ τ q is deterministic and drivenby G q , and Z t ;ˆ τ q is stochastic and driven by the martingale difference sequence z t . Step 3 (see Lemma 5 and Lemma 6) : We block-wisely bound u BAt ( s, a ) using the induction arguments.Namely, we prove (cid:13)(cid:13) u BAt (cid:13)(cid:13) ≤ G q for t ∈ [ ˆ τ q , ˆ τ q +1 ) holds for all q ≥ . By induction, we first observefor q = 0 , (cid:13)(cid:13) u BAt (cid:13)(cid:13) ≤ G holds. Given any state-action pair ( s, a ) , we assume that (cid:13)(cid:13) u BAt ( s, a ) (cid:13)(cid:13) ≤ G q holds for t ∈ [ˆ τ q , ˆ τ q +1 ) . Then we show (cid:13)(cid:13) u BAt ( s, a ) (cid:13)(cid:13) ≤ G q +1 holds for t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , whichfollows by bounding X t ;ˆ τ q and Z t ;ˆ τ q separately in Lemma 5 and Lemma 6, respectively. Step 4 (see Appendix B.1.4) : We apply union bound (Lemma 8) to obtain the block-wise bound forall state-action pairs and all blocks.
Part II: Conditionally bounding (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) (see Proposition 2) . We upper bound (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) by a decreasing sequence { D k } k ≥ block-wisely conditioned on the following two events:Event E : (cid:13)(cid:13) u BAt (cid:13)(cid:13) is upper bounded properly (see (12) in Appendix B.2), andEvent F : there are sufficient updates of Q At in each block (see (13) in Appendix B.2).The proof of Proposition 2 consists of the following four steps. Step 1 (see Lemma 10) : We design a special relationship (illustrated in Figure 1) between theblock-wise bounds { G q } q ≥ and { D k } k ≥ and their block separations. Step 2 (see Lemma 11) : We characterize the dynamics of the iteration residual r t ( s, a ) := Q At ( s, a ) − Q ∗ ( s, a ) as an SA algorithm as follows: when Q A is chosen to be updated at iteration t , r t +1 ( s, a ) = (1 − α t ) r t ( s, a )+ α t ( T Q At ( s, a ) − Q ∗ ( s, a ))+ α t w t ( s, a )+ α t γu BAt ( s (cid:48) , a ∗ ) , where w t ( s, a ) is the error between the Bellman operator and the sample-based empirical estimator,and is thus a martingale difference sequence, and u BAt has been defined in Part I.
Step 3 (see Lemma 12) : We provide upper and lower bounds on r t via two sequences Y t ; τ k and W t ; τ k as follows: − Y t ; τ k ( s, a ) + W t ; τ k ( s, a ) ≤ r t ( s, a ) ≤ Y t ; τ k ( s, a ) + W t ; τ k ( s, a ) , for all t ≥ τ k , all state-action pairs ( s, a ) ∈ S × A , and all q ≥ , where Y t ; τ k is deterministicand driven by D k , and W t ; τ k is stochastic and driven by the martingale difference sequence w t . Inparticular, if Q At is not updated at some iteration, then the sequences Y t ; τ k and W t ; τ k assume thesame values from the previous iteration. Step 4 (see Lemma 13, Lemma 14 and Appendix B.2.4) : Similarly to Steps 3 and 4 in Part I, weconditionally bound (cid:107) r t (cid:107) ≤ D k for t ∈ [ τ k , τ k +1 ) and k ≥ via bounding Y t ; τ k and W t ; τ k andfurther taking the union bound. Part III: Bounding (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) (see Appendix B.3). We combine the results in the first two parts,and provide high probability bound on (cid:107) r t (cid:107) with further probabilistic arguments, which exploit thehigh probability bounds on P ( E ) in Proposition 1 and P ( F ) in Lemma 15.8 Conclusion
In this paper, we provide the first finite-time results for double Q-learning, which characterize howfast double Q-learning converges under both synchronous and asynchronous implementations. Forthe synchronous case, we show that it achieves an (cid:15) -accurate optimal Q-function with at least theprobability of − δ by taking Ω (cid:18)(cid:16) − γ ) (cid:15) ln |S||A| (1 − γ ) (cid:15) δ (cid:17) ω + (cid:16) − γ ln − γ ) (cid:15) (cid:17) − ω (cid:19) iterations.Similar scaling order on − γ and (cid:15) also applies for asynchronous double Q-learning but with extradependence on the covering number. We develop new techniques to bound the error between twocorrelated stochastic processes, which can be of independent interest. Acknowledgements
The work was supported in part by the U.S. National Science Foundation under the grant CCF-1761506 and the startup fund of the Southern University of Science and Technology (SUSTech),China.
Broader Impact
Reinforcement learning has achieved great success in areas such as robotics and game playing, andthus has aroused broad interests and more potential real-world applications. Double Q-learning is acommonly used technique in deep reinforcement learning to improve the implementation stability andspeed of deep Q-learning. In this paper, we provided the fundamental analysis on the convergencerate for double Q-learning, which theoretically justified the empirical success of double Q-learning inpractice. Such a theory also provides practitioners desirable performance guarantee to further developsuch a technique into various transferable technologies.
References
Abed-alguni, B. H. and Ottom, M. A. (2018). Double delayed Q-learning.
International Journal ofArtificial Intelligence , 16(2):41–59.Azuma, K. (1967). Weighted sums of certain dependent random variables.
Tohoku MathematicalJournal, Second Series , 19(3):357–367.Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In
Machine Learning Proceedings 1995 , pages 30–37. Elsevier.Beck, C. L. and Srikant, R. (2012). Error bounds for constant step-size Q-learning.
Systems &Control Letters , 61(12):1203–1208.Bertsekas, D. P. and Tsitsiklis, J. N. (1996).
Neuro-Dynamic Programming , volume 5. AthenaScientific.Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximationand reinforcement learning.
SIAM Journal on Control and Optimization , 38(2):447–469.Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. (2019). Neural temporal-difference learning convergesto global optima. In
Advances in Neural Information Processing Systems (NeurIPS) , pages11312–11322.Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis ofstochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874 .Chen, Z., Zhang, S., Doan, T. T., Maguluri, S. T., and Clarke, J.-P. (2019). Finite-time analysis ofQ-learning with linear function approximation. arXiv preprint arXiv:1905.11425 .Du, S. S., Luo, Y., Wang, R., and Zhang, H. (2019). Provably efficient Q-learning with functionapproximation via distribution shift error checking oracle. In
Advances in Neural InformationProcessing Systems (NeurIPS) , pages 8058–8068.Even-Dar, E. and Mansour, Y. (2003). Learning rates for Q-learning.
Journal of Machine LearningResearch , 5(Dec):1–25. 9asselt, H. V. (2010). Double Q-learning. In
Advances in Neural Information Processing Systems(NeurIPS) , pages 2613–2621.Hasselt, H. v., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double q-learning.In
Proc. AAAI Conference on Artificial Intelligence (AAAI) .Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot,B., Azar, M., and Silver, D. (2018). Rainbow: Combining improvements in deep reinforcementlearning. In
Proc. AAAI Conference on Artificial Intelligence (AAAI) .Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Convergence of stochastic iterative dynamicprogramming algorithms. In
Advances in Neural Information Processing Systems (NeurIPS) ,pages 703–710.Lee, D. and He, N. (2019). A unified switching system perspective and ODE analysis of Q-learningalgorithms. arXiv preprint arXiv:1912.02270 .Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronous Q-learning:Sharper analysis and variance reduction. arXiv preprint arXiv:2006.03041 .Melo, F. S. (2001). Convergence of Q-learning: A simple proof.
Institute of Systems and Robotics,Tech. Rep , pages 1–4.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deepreinforcement learning.
Nature , 518(7540):529.Okuyama, T., Gonsalves, T., and Upadhay, J. (2018). Autonomous driving system based on deepQ-learnig. In
Proc. IEEE International Conference on Intelligent Autonomous Systems (ICoIAS) ,pages 201–205.Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation andQ-learning. arXiv preprint arXiv:2002.00260 .Shah, D. and Xie, Q. (2018). Q-learning with nearest neighbors. In
Advances in Neural InformationProcessing Systems (NeurIPS) , pages 3111–3121.Smith, J. E. and Winkler, R. L. (2006). The optimizer’s curse: Skepticism and postdecision surprisein decision analysis.
Management Science , 52(3):311–322.Szepesvári, C. (1998). The asymptotic convergence-rate of Q-learning. In
Advances in NeuralInformation Processing Systems (NeurIPS) , pages 1064–1070.Tai, L. and Liu, M. (2016). A robot exploration strategy based on Q-learning network. In
Proc. IEEEInternational Conference on Real-time Computing and Robotics (RCAR) , pages 57–62.Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning.
Machine Learning ,16(3):185–202.Wainwright, M. J. (2019). Stochastic approximation with cone-contractive operators: Sharp (cid:96) ∞ -bounds for Q-learning. arXiv preprint arXiv:1905.06265 .Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning , 8(3-4):279–292.Weng, B., Xiong, H., Liang, Y., and Zhang, W. (2020a). Analysis of Q-learning with adaptation andmomentum restart for gradient descent. In
Proceedings of the Twenty-Ninth International JointConference on Artificial Intelligence (IJCAI-20) , pages 3051–3057.Weng, B., Xiong, H., Zhao, L., Liang, Y., and Zhang, W. (2020b). Momentum Q-learning withfinite-sample convergence guarantee. arXiv preprint arXiv:2007.15418 .Weng, W., Gupta, H., He, N., Ying, L., and R., S. (2020c). Provably-efficient double Q-learning. arXiv preprint arXiv:arXiv:2007.05034 .Xu, P. and Gu, Q. (2019). A finite-time analysis of Q-learning with neural network functionapproximation. arXiv preprint arXiv:1912.04511 .Zhang, Q., Lin, M., Yang, L. T., Chen, Z., Khan, S. U., and Li, P. (2018a). A double deep Q-learning model for energy-efficient edge scheduling.
IEEE Transactions on Services Computing ,12(5):739–749.Zhang, Y., Sun, P., Yin, Y., Lin, L., and Wang, X. (2018b). Human-like autonomous vehicle speedcontrol by deep reinforcement learning with double Q-learning. In
Proc. IEEE IntelligentVehicles Symposium (IV) , pages 1251–1256.10hang, Z., Pan, Z., and Kochenderfer, M. J. (2017). Weighted double Q-learning. In
InternationalJoint Conferences on Artificial Intelligence , pages 3455–3461.Zou, S., Xu, T., and Liang, Y. (2019). Finite-sample analysis for SARSA with linear functionapproximation. In
Advances in Neural Information Processing Systems (NeurIPS) , pages8665–8675. 11 upplementary Materials
A Proof of Lemma 1
We prove Lemma 1 by induction.First, it is easy to guarantee that the initial case is satisfied, i.e., (cid:13)(cid:13) Q A (cid:13)(cid:13) ≤ R max − γ = V max , (cid:13)(cid:13) Q B (cid:13)(cid:13) ≤ V max . (In practice we usually initialize the algorithm as Q A = Q B = 0 ). Next, we assume that (cid:13)(cid:13) Q At (cid:13)(cid:13) ≤ V max , (cid:13)(cid:13) Q Bt (cid:13)(cid:13) ≤ V max . It remains to show that such conditions still hold for t + 1 .Observe that (cid:13)(cid:13) Q At +1 ( s, a ) (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1 − α t ) Q At ( s, a ) + α t (cid:32) R t + γQ Bt ( s (cid:48) , arg max a (cid:48) ∈ U ( s (cid:48) ) Q At ( s (cid:48) , a (cid:48) ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (1 − α t ) (cid:13)(cid:13) Q At (cid:13)(cid:13) + α t (cid:107) R t (cid:107) + α t γ (cid:13)(cid:13) Q Bt (cid:13)(cid:13) ≤ (1 − α t ) R max − γ + α t R max + α t γR max − γ = R max − γ = V max . Similarly, we can have (cid:13)(cid:13) Q Bt +1 ( s, a ) (cid:13)(cid:13) ≤ V max . Thus we complete the proof. B Proof of Theorem 1
In this appendix, we will provide a detailed proof of Theorem 1. Our proof includes: (a) Part Iwhich analyzes the stochastic error propagation between the two Q-estimators (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ; (b)Part II which analyzes the error dynamics between one Q-estimator and the optimum (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) conditioned on the error event in Part I; and (c) Part III which bounds the unconditional error (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) . We describe each of the three parts in more details below. B.1 Part I: Bounding (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) The main idea is to upper bound (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) by a decreasing sequence { G q } q ≥ block-wisely withhigh probability, where each block or epoch q (with q ≥ ) is defined by t ∈ [ˆ τ q , ˆ τ q +1 ) . Proposition 1.
Fix (cid:15) > , κ ∈ (0 , , σ ∈ (0 , and ∆ ∈ (0 , e − . Consider synchronous doubleQ-learning using a polynomial learning rate α t = t ω with ω ∈ (0 , . Let G q = (1 − ξ ) q G with G = V max and ξ = − γ . Let ˆ τ q +1 = ˆ τ q + cκ ˆ τ ωq for q ≥ with c ≥ ln(2+∆)+1 / ˆ τ ω − ln(2+∆) − / ˆ τ ω and ˆ τ as thefinishing time of the first epoch satisfying ˆ τ ≥ max (cid:18) − ln(2 + ∆) (cid:19) ω , c ( c + κ ) V κ (cid:16) ∆2+∆ (cid:17) σ ξ (cid:15) ln c ( c + κ ) V κ (cid:16) ∆2+∆ (cid:17) σ ξ (cid:15) ω . Then for any n such that G n ≥ σ(cid:15) , we have P (cid:2) ∀ q ∈ [0 , n ] , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ G q +1 (cid:3) ≥ − c ( n + 1) κ (cid:18) cκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω c ( c + κ ) V . The proof of Proposition 1 consists of the following four steps.12 .1.1 Step 1: Characterizing the dynamics of Q Bt ( s, a ) − Q At ( s, a ) We first characterize the dynamics of u BAt ( s, a ) := Q Bt ( s, a ) − Q At ( s, a ) as a stochastic approxima-tion (SA) algorithm in this step. Lemma 2.
Consider double Q-learning in Algorithm 1. Then we have u BAt +1 ( s, a ) = (1 − α t ) u BAt ( s, a ) + α t F t ( s, a ) , where F t ( s, a ) = (cid:40) Q Bt ( s, a ) − R t − γQ Bt ( s t +1 , a ∗ ) , w.p. 1/2 R t + γQ At ( s t +1 , b ∗ ) − Q At ( s, a ) , w.p. 1/2 . In addition, F t satisfies (cid:107) E [ F t |F t ] (cid:107) ≤ γ (cid:13)(cid:13) u BAt (cid:13)(cid:13) . Proof.
Algorithm 1 indicates that at each time, either Q A or Q B is updated with equal probability.When updating Q A at time t , for each ( s, a ) we have u BAt +1 ( s, a ) = Q Bt +1 ( s, a ) − Q At +1 ( s, a )= Q Bt ( s, a ) − ( Q At ( s, a ) + α t ( R t + γQ Bt ( s t +1 , a ∗ ) − Q At ( s, a )))= (1 − α t ) Q Bt ( s, a ) − ((1 − α t ) Q At ( s, a ) + α t ( R t + γQ Bt ( s t +1 , a ∗ ) − Q Bt ( s, a )))= (1 − α t ) u BAt ( s, a ) + α t ( Q Bt ( s, a ) − R t − γQ Bt ( s t +1 , a ∗ )) . Similarly, when updating Q B , we have u BAt +1 ( s, a ) = Q Bt +1 ( s, a ) − Q At +1 ( s, a )= ( Q Bt ( s, a ) + α t ( R t + γQ At ( s t +1 , b ∗ ) − Q Bt ( s, a ))) − Q At ( s, a )= (1 − α t ) Q Bt ( s, a ) + ( α t ( R t + γQ At ( s t +1 , b ∗ ) − Q At ( s, a )) − (1 − α t ) Q At ( s, a ))= (1 − α t ) u BAt ( s, a ) + α t ( R t + γQ At ( s t +1 , b ∗ ) − Q At ( s, a )) . Therefore, we can rewrite the dynamics of u BAt as u BAt +1 ( s, a ) = (1 − α t ) u BAt ( s, a ) + α t F t ( s, a ) ,where F t ( s, a ) = (cid:40) Q Bt ( s, a ) − R t − γQ Bt ( s t +1 , a ∗ ) , w.p. 1/2 R t + γQ At ( s t +1 , b ∗ ) − Q At ( s, a ) , w.p. 1/2 . Thus, we have E [ F t ( s, a ) |F t ]= 12 (cid:18) Q Bt ( sa ) − E s t +1 [ R s (cid:48) sa − γQ Bt ( s t +1 , a ∗ )] (cid:19) + 12 (cid:18) E s t +1 [ R s (cid:48) s,a + γQ At ( s t +1 , b ∗ )] − Q At ( s, a ) (cid:19) = 12 ( Q Bt ( s, a ) − Q At ( s, a )) + γ E s t +1 (cid:2) Q At ( s t +1 , b ∗ ) − Q Bt ( s t +1 , a ∗ ) (cid:3) = 12 u BAt ( s, a ) + γ E s t +1 (cid:2) Q At ( s t +1 , b ∗ ) − Q Bt ( s t +1 , a ∗ ) (cid:3) . (7)Next, we bound E s t +1 (cid:2) Q At ( s t +1 , b ∗ ) − Q Bt ( s t +1 , a ∗ ) (cid:3) . First, consider the case when E s t +1 Q At ( s t +1 , b ∗ ) ≥ E s t +1 Q Bt ( s t +1 , a ∗ ) . Then we have (cid:12)(cid:12)(cid:12)(cid:12) E s t +1 (cid:2) Q At ( s t +1 , b ∗ ) − Q Bt ( s t +1 , a ∗ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) = E s t +1 (cid:2) Q At ( s t +1 , b ∗ ) − Q Bt ( s t +1 , a ∗ ) (cid:3) (i) ≤ E s t +1 (cid:2) Q At ( s t +1 , a ∗ ) − Q Bt ( s t +1 , a ∗ ) (cid:3) ≤ (cid:13)(cid:13) u BAt (cid:13)(cid:13) , a ∗ in Algorithm 1. Similarly, if E s t +1 Q At ( s t +1 , b ∗ ) < E s t +1 Q Bt ( s t +1 , a ∗ ) , we have (cid:12)(cid:12)(cid:12)(cid:12) E s t +1 (cid:2) Q At ( s t +1 , b ∗ ) − Q Bt ( s t +1 , a ∗ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) = E s t +1 (cid:2) Q Bt ( s t +1 , a ∗ ) − Q At ( s t +1 , b ∗ ) (cid:3) (i) ≤ E s t +1 (cid:2) Q Bt ( s t +1 , b ∗ ) − Q At ( s t +1 , b ∗ ) (cid:3) ≤ (cid:13)(cid:13) u BAt (cid:13)(cid:13) , where (i) follows from the definition of b ∗ . Thus we can conclude that (cid:12)(cid:12)(cid:12)(cid:12) E s t +1 (cid:2) Q At ( s t +1 , b ∗ ) − Q Bt ( s t +1 , a ∗ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13) u BAt (cid:13)(cid:13) . Then, we continue to bound (7), and obtain | E [ F t ( s, a ) |F t ] | = (cid:12)(cid:12)(cid:12)(cid:12) u BAt ( s, a ) + γ E s t +1 (cid:2) Q At ( s t +1 , b ∗ ) − Q Bt ( s t +1 , a ∗ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13) u BAt (cid:13)(cid:13) + γ (cid:12)(cid:12)(cid:12)(cid:12) E s t +1 (cid:2) Q At ( s t +1 , b ∗ ) − Q Bt ( s t +1 , a ∗ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) ≤ γ (cid:13)(cid:13) u BAt (cid:13)(cid:13) , for all ( s, a ) pairs. Hence, (cid:107) E [ F t |F t ] (cid:107) ≤ γ (cid:13)(cid:13) u BAt (cid:13)(cid:13) .Applying Lemma 2, we write the dynamics of u BAt ( s, a ) in the form of a classical SA algorithmdriven by a martingale difference sequence as follows: u BAt +1 ( s, a ) = (1 − α t ) u BAt ( s, a ) + α t F t ( s, a ) = (1 − α t ) u BAt ( s, a ) + α t ( h t ( s, a ) + z t ( s, a )) , where h t ( s, a ) = E [ F t ( s, a ) |F t ] and z t ( s, a ) = F t ( s, a ) − E [ F t |F t ] . Then, we obtain E [ z t ( s, a ) |F t ] = 0 and (cid:107) h t (cid:107) ≤ γ (cid:13)(cid:13) u BAt (cid:13)(cid:13) following from Lemma 2. We define u ∗ ( s, a ) = 0 , andtreat h t as an operator over u BAt . Then h t has a contraction property as: (cid:107) h t − u ∗ (cid:107) ≤ γ (cid:48) (cid:13)(cid:13) u BAt − u ∗ (cid:13)(cid:13) , (8)where γ (cid:48) = γ ∈ (0 , . Based on this SA formulation, we bound u BAt ( s, a ) block-wisely in thenext step. B.1.2 Step 2: Constructing sandwich bounds on u BAt
We derive lower and upper bounds on u BAt via two sequences X t ;ˆ τ q and Z t ;ˆ τ q in the following lemma. Lemma 3.
Let ˆ τ q be such that (cid:13)(cid:13) u BAt (cid:13)(cid:13) ≤ G q for all t ≥ ˆ τ q . Define Z t ;ˆ τ q ( s, a ) , X t ;ˆ τ q ( s, a ) as Z t +1;ˆ τ q ( s, a ) = (1 − α t ) Z t ;ˆ τ q ( s, a ) + α t z t ( s, a ) , with Z ˆ τ q ;ˆ τ q ( s, a ) = 0; X t +1;ˆ τ q ( s, a ) = (1 − α t ) X t ;ˆ τ q ( s, a ) + α t γ (cid:48) G q , with X ˆ τ q ;ˆ τ q ( s, a ) = G q , γ (cid:48) = 1 + γ . Then for any t ≥ ˆ τ q and state-action pair ( s, a ) , we have − X t ;ˆ τ q ( s, a ) + Z t ;ˆ τ q ( s, a ) ≤ u BAt ( s, a ) ≤ X t ;ˆ τ q ( s, a ) + Z t ;ˆ τ q ( s, a ) . Proof.
We proceed the proof by induction. For the initial condition t = ˆ τ q , (cid:13)(cid:13)(cid:13) u BA ˆ τ q (cid:13)(cid:13)(cid:13) ≤ G q implies − G q ≤ u BA ˆ τ q ≤ G q . We assume the sandwich bound holds for time t . It remains to check that thebound also holds for t + 1 . 14t time t + 1 , we have u BAt +1 ( s, a ) = (1 − α t ) u BAt ( s, a ) + α t ( h t ( s, a ) + z t ( s, a )) ≤ (1 − α t )( X t ;ˆ τ q ( s, a ) + Z t ;ˆ τ q ( s, a )) + α t ( h t ( s, a ) + z t ( s, a )) (i) ≤ (cid:2) (1 − α t ) X t ;ˆ τ q ( s, a ) + α t γ (cid:48) (cid:13)(cid:13) u BAt (cid:13)(cid:13)(cid:3) + (cid:2) (1 − α t ) Z t ;ˆ τ q ( s, a ) + α t z t ( s, a ) (cid:3) ≤ (cid:2) (1 − α t ) X t ;ˆ τ q ( s, a ) + α t γ (cid:48) G q (cid:3) + (cid:2) (1 − α t ) Z t ;ˆ τ q ( s, a ) + α t z t ( s, a ) (cid:3) = X t +1;ˆ τ q ( s, a ) + Z t +1;ˆ τ q ( s, a ) , where (i) follows from Lemma 2. Similarly, we can bound the other direction as u BAt +1 ( s, a ) = (1 − α t ) u BAt ( s, a ) + α t ( h t ( s, a ) + z t ( s, a )) ≥ (1 − α t )( − X t ;ˆ τ q ( s, a ) + Z t ;ˆ τ q ( s, a )) + α t ( h t ( s, a ) + z t ( s, a )) ≥ (cid:2) − (1 − α t ) X t ;ˆ τ q ( s, a ) − α t γ (cid:48) (cid:13)(cid:13) u BAt (cid:13)(cid:13)(cid:3) + (cid:2) (1 − α t ) Z t ;ˆ τ q ( s, a ) + α t z t ( s, a ) (cid:3) ≥ (cid:2) − (1 − α t ) X t ;ˆ τ q ( s, a ) − α t γ (cid:48) G q (cid:3) + (cid:2) (1 − α t ) Z t ;ˆ τ q ( s, a ) + α t z t ( s, a ) (cid:3) = − X t +1;ˆ τ q ( s, a ) + Z t +1;ˆ τ q ( s, a ) . B.1.3 Step 3: Bounding X t ;ˆ τ q and Z t ;ˆ τ q for block q + 1 We bound X t ;ˆ τ q and Z t ;ˆ τ q in Lemma 5 and Lemma 6 below, respectively. Before that, we firstintroduce the following technical lemma which will be useful in the proof of Lemma 5. Lemma 4.
Fix ω ∈ (0 , . Let < t < t . Then we have t (cid:89) i = t (cid:18) − i ω (cid:19) ≤ exp (cid:18) − t − t t ω (cid:19) . Proof.
Since ln(1 − x ) ≤ − x for any x ∈ (0 , , we have ln (cid:34) t (cid:89) i = t (cid:18) − i ω (cid:19)(cid:35) ≤ − t (cid:88) i = t i − ω ≤ − (cid:90) t t t − ω dt = − t − ω − t − ω − ω . Thus, fix ω ∈ (0 , , let < t < t , and then we have t (cid:89) i = t (cid:18) − i ω (cid:19) ≤ exp (cid:18) − t − ω − t − ω − ω (cid:19) . Define f ( t ) := t − ω . Observe that f ( t ) is an increasing concave function. Then we have t − ω − t − ω ≥ f (cid:48) ( t )( t − t ) = (1 − ω ) t − ω ( t − t ) , which immediately indicates the result.We now derive a bound for X t ;ˆ τ q . Lemma 5.
Fix κ ∈ (0 , and ∆ ∈ (0 , e − . Let { G q } be defined in Proposition 1. Considersynchronous double Q-learning using a polynomial learning rate α t = t ω with ω ∈ (0 , . Supposethat X t ;ˆ τ q ( s, a ) ≤ G q for any t ≥ ˆ τ q . Then for any t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , given ˆ τ q +1 = ˆ τ q + cκ ˆ τ ωq with ˆ τ ≥ (cid:16) − ln(2+∆) (cid:17) ω and c ≥ ln(2+∆)+1 / ˆ τ ω − ln(2+∆) − / ˆ τ ω , we have X t ;ˆ τ q ( s, a ) ≤ (cid:18) γ (cid:48) + 22 + ∆ ξ (cid:19) G q . roof. Observe that X ˆ τ q ;ˆ τ q ( s, a ) = G q = γ (cid:48) G q + (1 − γ (cid:48) ) G q := γ (cid:48) G q + ρ ˆ τ q . We can rewrite thedynamics of X t ;ˆ τ q ( s, a ) as X t +1;ˆ τ q ( s, a ) = (1 − α t ) X t ;ˆ τ q ( s, a ) + α t γ (cid:48) G q = γ (cid:48) G q + (1 − α t ) ρ t , where ρ t +1 = (1 − α t ) ρ t . By the definition of ρ t , we obtain ρ t = (1 − α t − ) ρ t − = · · · = (1 − γ (cid:48) ) G q t − (cid:89) i =ˆ τ q (1 − α i )= (1 − γ (cid:48) ) G q t − (cid:89) i =ˆ τ q (cid:18) − i ω (cid:19) (i) ≤ (1 − γ (cid:48) ) G q ˆ τ q +1 − (cid:89) i =ˆ τ q (cid:18) − i ω (cid:19) (ii) ≤ (1 − γ (cid:48) ) G q exp (cid:18) − ˆ τ q +1 − − ˆ τ q (ˆ τ q +1 − ω (cid:19) ≤ (1 − γ (cid:48) ) G q exp (cid:32) − ˆ τ q +1 − − ˆ τ q ˆ τ ωq +1 (cid:33) = (1 − γ (cid:48) ) G q exp (cid:32) − cκ ˆ τ ωq − τ ωq +1 (cid:33) = (1 − γ (cid:48) ) G q exp (cid:32) − cκ (cid:18) ˆ τ q ˆ τ q +1 (cid:19) ω + 1ˆ τ ωq +1 (cid:33) (iii) ≤ (1 − γ (cid:48) ) G q exp (cid:18) − cκ
11 + cκ + 1ˆ τ ω (cid:19) (iv) ≤ (1 − γ (cid:48) ) G q exp (cid:18) − c c + 1ˆ τ ω (cid:19) , where (i) follows because α i is decreasing and t ≥ ˆ τ q +1 , (ii) follows from Lemma 4, (iii) followsbecause ˆ τ q ≥ ˆ τ and (cid:18) ˆ τ q ˆ τ q +1 (cid:19) ω ≥ ˆ τ q ˆ τ q +1 = ˆ τ q ˆ τ q + cκ ˆ τ ωq ≥
11 + cκ , and (iv) follows because cκ ≥ c . Next, observing the conditions that ˆ τ ω ≥ − ln(2+∆) and c ≥ − ln(2+∆) − / ˆ τ ω − , we have c c − τ ω ≥ ln(2 + ∆) . Thus we have ρ t ≤ − γ (cid:48) G q . Finally, We finish our proof by further observing that − γ (cid:48) = 2 ξ .Since we have bounded X t ;ˆ τ q ( s, a ) by (cid:16) γ (cid:48) + ξ (cid:17) G q for all t ≥ ˆ τ q +1 , it remains to bound Z t ;ˆ τ q ( s, a ) by (cid:16) − (cid:17) ξG q for block q +1 , which will further yield (cid:13)(cid:13) u BAt ( s, a ) (cid:13)(cid:13) ≤ ( γ (cid:48) + ξ ) G q =(1 − ξ ) G q = G q +1 for any t ∈ [ˆ τ q +1 , ˆ τ q +2 ) as desired. Differently from X t ;ˆ τ q ( s, a ) which is adeterministic monotonic sequence, Z t ;ˆ τ q ( s, a ) is stochastic. We need to capture the probability fora bound on Z t ;ˆ τ q ( s, a ) to hold for block q + 1 . To this end, we introduce a different sequence { Z lt ;ˆ τ q ( s, a ) } given by Z lt ;ˆ τ q ( s, a ) = ˆ τ q + l (cid:88) i =ˆ τ q α i t − (cid:89) j = i +1 (1 − α j ) z i ( s, a ) := ˆ τ q + l (cid:88) i =ˆ τ q φ q,t − i z i ( s, a ) , (9)where φ q,t − i = α i (cid:81) t − j = i +1 (1 − α j ) . By the definition of Z t ;ˆ τ q ( s, a ) , one can check that Z t ;ˆ τ q ( s, a ) = Z t − − ˆ τ q t ;ˆ τ q ( s, a ) . Thus we have Z t ;ˆ τ q ( s, a ) = Z t ;ˆ τ q ( s, a ) − Z ˆ τ q ;ˆ τ q ( s, a ) = t − − ˆ τ q (cid:88) l =1 ( Z lt ;ˆ τ q ( s, a ) − Z l − t ;ˆ τ q ( s, a )) + Z t ;ˆ τ q ( s, a ) . (10)In the following lemma, we capture an important property of Z lt ;ˆ τ q ( s, a ) defined in (9). Lemma 6.
For any t ∈ [ˆ τ q +1 , ˆ τ q +2 ) and ≤ l ≤ t − − ˆ τ q , Z lt ;ˆ τ q ( s, a ) is a martingale sequenceand satisfies | Z lt ;ˆ τ q ( s, a ) − Z l − t ;ˆ τ q ( s, a ) | ≤ V max ˆ τ ωq . (11)16 roof. To show the martingale property, we observe that E [ Z lt ;ˆ τ q ( s, a ) − Z l − t ;ˆ τ q ( s, a ) |F ˆ τ q + l − ] = E [ φ q,t − τ q + l z ˆ τ q + l ( s, a ) |F ˆ τ q + l − ]= φ q,t − τ q + l E [ z ˆ τ q + l ( s, a ) |F ˆ τ q + l − ] = 0 , where the last equation follows from the definition of z t ( s, a ) .In addition, based on the definition of φ q,t − i in (9) which requires i ≥ ˆ τ q , we have φ q,t − i = α i t − (cid:89) j = i +1 (1 − α j ) ≤ α i ≤ τ ωq . Further, since | F t | ≤ R max − γ = V max , we obtain | z t ( s, a ) | = | F t − E [ F t |F t ] | ≤ V max . Thus | Z lt ;ˆ τ q ( s, a ) − Z l − t ;ˆ τ q ( s, a ) | = φ q,t − τ q + l | z ˆ τ q + l ( s, a ) | ≤ V max ˆ τ ωq . Lemma 6 guarantees that Z lt ;ˆ τ q ( s, a ) is a martingale sequence, which allows us to apply the followingAzuma’s inequality. Lemma 7. (Azuma, 1967) Let X , X , . . . , X n be a martingale sequence such that for each ≤ k ≤ n , | X k − X k − | ≤ c k , where the c k is a constant that may depend on k . Then for all n ≥ and any (cid:15) > , P [ | X n − X | > (cid:15) ] ≤ (cid:18) − (cid:15) (cid:80) nk =1 c k (cid:19) . By Azuma’s inequality and the relationship between Z t ;ˆ τ q ( s, a ) and Z lt ;ˆ τ q ( s, a ) in (9), we obtain P (cid:2) | Z t ;ˆ τ q ( s, a ) | > ˆ (cid:15) | t ∈ [ˆ τ q +1 , ˆ τ q +2 ) (cid:3) ≤ − ˆ (cid:15) (cid:80) t − ˆ τ q − l =1 (cid:16) Z lt ;ˆ τ q ( s, a ) − Z l − t ;ˆ τ q ( s, a ) (cid:17) + 2( Z t ;ˆ τ q ( s, a )) (i) ≤ (cid:32) − ˆ (cid:15) ˆ τ ωq t − ˆ τ q ) V (cid:33) ≤ (cid:32) − ˆ (cid:15) ˆ τ ωq τ q +2 − ˆ τ q ) V (cid:33) (ii) ≤ (cid:32) − κ ˆ (cid:15) ˆ τ ωq c ( c + κ ) V (cid:33) = 2 exp (cid:32) − κ ˆ (cid:15) ˆ τ ωq c ( c + κ ) V (cid:33) , where (i) follows from Lemma 6, and (ii) follows because ˆ τ q +2 − ˆ τ q = 2 cκ ˆ τ ωq +1 + 2 cκ ˆ τ ωq = 2 cκ (cid:18) ˆ τ q + 2 cκ ˆ τ ωq (cid:19) ω + 2 cκ ˆ τ ωq ≤ cκ (cid:18) cκ (cid:19) ˆ τ ωq = 4 c ( c + κ ) κ ˆ τ ωq . B.1.4 Step 4: Unionizing all blocks and state-action pairs
Now we are ready to prove Proposition 1 by taking a union of probabilities over all blocks andstate-action pairs. Before that, we introduce the following two preliminary lemmas, which will beused for multiple times in the sequel.
Lemma 8.
Let { X i } i ∈I be a set of random variables. Fix (cid:15) > . If for any i ∈ I , we have P ( X i ≤ (cid:15) ) ≥ − δ , then P ( ∀ i ∈ I , X i ≤ (cid:15) ) ≥ − |I| δ. roof. By union bound, we have P ( ∀ i ∈ I , X i ≤ (cid:15) ) = 1 − P (cid:32) (cid:91) i ∈I X i > (cid:15) (cid:33) ≥ − (cid:88) i ∈I P ( X i > (cid:15) ) ≥ − |I| δ. Lemma 9.
Fix positive constants a, b satisfying ab ln ab > . If τ ≥ ab ln ab , then τ b exp (cid:18) − τa (cid:19) ≤ exp (cid:16) − τa (cid:17) . Proof.
Let c = ab . If τ ≤ c , we have c ln τ ≤ c ln c = 2 c ln c ≤ τ. If τ ≥ c , we have c ln τ ≤ √ τ ln τ ≤ √ τ √ τ = τ, where the last inequality follows from ln x = 2 ln x ≤ x . Therefore, we obtain c ln τ = ab ln τ ≤ τ .Thus τ b ≤ exp (cid:0) τa (cid:1) , which implies this lemma. Proof of Proposition 1
Based on the results obtained above, we are ready to prove Proposition 1. Applying Lemma 8, wehave P (cid:20) ∀ ( s, a ) , ∀ q ∈ [0 , n ] , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , | Z t ;ˆ τ q ( s, a ) | ≤ ∆2 + ∆ ξG q (cid:21) ≥ − n (cid:88) q =0 |S||A| (ˆ τ q +2 − ˆ τ q +1 ) · P (cid:20) | Z t ;ˆ τ q ( s, a ) | > ∆2 + ∆ ξG q (cid:12)(cid:12)(cid:12) t ∈ [ˆ τ q +1 , ˆ τ q +2 ) (cid:21) ≥ − n (cid:88) q =0 |S||A| cκ ˆ τ ωq +1 · − κ (cid:16) ∆2+∆ (cid:17) ξ G q ˆ τ ωq c ( c + κ ) V ≥ − n (cid:88) q =0 |S||A| cκ (cid:18) cκ (cid:19) ˆ τ ωq · − κ (cid:16) ∆2+∆ (cid:17) ξ G q ˆ τ ωq c ( c + κ ) V (i) ≥ − n (cid:88) q =0 |S||A| cκ (cid:18) cκ (cid:19) ˆ τ ωq · − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ωq c ( c + κ ) V (ii) ≥ − cκ (cid:18) cκ (cid:19) n (cid:88) q =0 |S||A| · exp − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ωq c ( c + κ ) V (iii) ≥ − c ( n + 1) κ (cid:18) cκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω c ( c + κ ) V , where (i) follows because G q ≥ G n ≥ σ(cid:15) , (ii) follows from Lemma 9 by substituting that a = c ( c + κ ) V κ ( ∆2+∆ ) σ ξ (cid:15) , b = 1 and observing ˆ τ ωq ≥ ˆ τ ω ≥ c ( c + κ ) V κ (cid:16) ∆2+∆ (cid:17) σ ξ (cid:15) ln c ( c + κ ) V κ (cid:16) ∆2+∆ (cid:17) σ ξ (cid:15) = 2 ab ln ab, ˆ τ q ≥ ˆ τ .Finally, we complete the proof of Proposition 1 by observing that X t ;ˆ τ q is a deterministic sequenceand thus P (cid:2) ∀ q ∈ [0 , n ] , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ G q +1 (cid:3) ≥ P (cid:20) ∀ ( s, a ) , ∀ q ∈ [0 , n ] , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , | Z t ;ˆ τ q ( s, a ) | ≤ ∆2 + ∆ ξG q (cid:21) . B.2 Part II: Conditionally bounding (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) In this part, we upper bound (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) by a decreasing sequence { D k } k ≥ block-wisely condi-tioned on the following two events: fix a positive integer m , we define E := (cid:8) ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ σD k +1 (cid:9) , (12) F := {∀ k ∈ [1 , m + 1] , I Ak ≥ cτ ωk } , (13)where I Ak denotes the number of iterations updating Q A at epoch k , τ k +1 is the starting iterationindex of the ( k + 1) th block, and ω is the decay parameter of the polynomial learning rate. Roughly,Event E requires that the difference between the two Q-estimators are bounded appropriately, andEvent F requires that Q A is sufficiently updated in each block. Proposition 2.
Fix (cid:15) > , κ ∈ (ln 2 , and ∆ ∈ (0 , e κ − . Consider synchronous double Q-learning under a polynomial learning rate α t = t ω with ω ∈ (0 , . Let { G q } q ≥ , { ˆ τ q } q ≥ bedefined in Proposition 1. Define D k = (1 − β ) k V max σ with β = − γ (1+ σ )2 and σ = − γ γ . Let τ k = ˆ τ k for k ≥ . Suppose that c ≥ κ (ln(2+∆)+1 /τ ω )2( κ − ln(2+∆) − /τ ω ) and τ as the finishing time of the first block satisfies τ ≥ max (cid:18) κ − ln(2 + ∆) (cid:19) ω , c ( c + κ ) V κ (cid:16) ∆2+∆ (cid:17) β (cid:15) ln c ( c + κ ) V κ (cid:16) ∆2+∆ (cid:17) β (cid:15) ω . Then for any m such that D m ≥ (cid:15) , we have P (cid:2) ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | E, F (cid:3) ≥ − c ( m + 1) κ (cid:18) cκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ω c ( c + κ ) V , where the events E, F are defined in (12) and (13) , respectively.
The proof of Proposition 2 consists of the following four steps.
B.2.1 Step 1: Designing { D k } k ≥ The following lemma establishes the relationship (illustrated in Figure 1) between the block-wisebounds { G q } q ≥ and { D k } k ≥ and their block separations, such that Event E occurs with highprobability as a result of Proposition 1. Lemma 10.
Let { G q } be defined in Proposition 1, and let D k = (1 − β ) k V max σ with β = − γ (1+ σ )2 and σ = − γ γ . Then we have P (cid:2) ∀ q ∈ [0 , m ] , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ G q +1 (cid:3) ≤ P (cid:2) ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ σD k +1 (cid:3) , given that τ k = ˆ τ k .Proof. Based on our choice of σ , we have β = 1 − γ (1 + σ )2 = 1 − γ · γ γ − γ ξ. D k is the same as that of G q . Further considering G = σD , we canmake the sequence { σD k } as an upper bound of { G q } for any time as long as we set the same startingpoint and ending point for each epoch.In Lemma 10, we make G k = σD k at any block k and ξ = β = − γ by careful design of σ . In fact,one can choose any value of σ ∈ (0 , (1 − γ ) /γ ) and design a corresponding relationship between τ k and ˆ τ k as long as the sequence { σD k } can upper bound { G q } for any time. For simplicity ofpresentation, we keep the design in Lemma 10. B.2.2 Step 2: Characterizing the dynamics of Q At ( s, a ) − Q ∗ ( s, a ) We characterize the dynamics of the iteration residual r t ( s, a ) := Q At ( s, a ) − Q ∗ ( s, a ) as an SAalgorithm in Lemma 11 below. Since not all iterations contribute to the error propagation due to therandom update between the two Q-estimators, we introduce the following notations to label the validiterations. Definition 1.
We define T A as the collection of iterations updating Q A . In addition, we denote T A ( t , t ) as the set of iterations updating Q A between time t and t . That is, T A ( t , t ) = (cid:8) t : t ∈ [ t , t ] and t ∈ T A (cid:9) . Correspondingly, the number of iterations updating Q A between time t and t is the cardinality of T A ( t , t ) which is denoted as | T A ( t , t ) | . Lemma 11.
Consider double Q-learning in Algorithm 1. Then we have r t +1 ( s, a ) = (cid:40) r t ( s, a ) , t / ∈ T A ;(1 − α t ) r t ( s, a )+ α t ( T Q At ( s, a ) − Q ∗ ( s, a ))+ α t w t ( s, a )+ α t γu BAt ( s (cid:48) , a ∗ ) , t ∈ T A , where w t ( s, a ) = T t Q At ( s, a ) − T Q At ( s, a ) , u BAt ( s, a ) = Q Bt ( s, a ) − Q At ( s, a ) .Proof. Following from Algorithm 1 and for t ∈ T A , we have Q At +1 ( s, a )= Q At ( s, a ) + α t ( R t + γQ Bt ( s (cid:48) , a ∗ ) − Q At ( s, a ))= (1 − α t ) Q At ( s, a ) + α t (cid:0) R t + γQ At ( s (cid:48) , a ∗ ) (cid:1) + α t (cid:0) γQ Bt ( s (cid:48) , a ∗ ) − γQ At ( s (cid:48) , a ∗ ) (cid:1) (i) = (1 − α t ) Q At ( s, a ) + α t (cid:0) T t Q At ( s, a ) + γu BAt ( s (cid:48) , a ∗ ) (cid:1) = (1 − α t ) Q At ( s, a ) + α t T Q At ( s, a ) + α t ( T t Q At ( s, a ) − T Q At ( s, a )) + α t γu BAt ( s (cid:48) , a ∗ )= (1 − α t ) Q At ( s, a ) + α t T Q At ( s, a ) + α t w t ( s, a ) + α t γu BAt ( s (cid:48) , a ∗ ) , where (i) follows because we denote T t Q At ( s, a ) = R t + γQ At ( s (cid:48) , a ∗ ) . By subtracting Q ∗ from bothsides, we complete the proof. B.2.3 Step 3: Constructing sandwich bounds on r t ( s, a ) We provide upper and lower bounds on r t by constructing two sequences Y t ; τ k and W t ; τ k in thefollowing lemma. Lemma 12.
Let τ k be such that (cid:107) r t (cid:107) ≤ D k for all t ≥ τ k . Suppose that we have (cid:13)(cid:13) u BAt (cid:13)(cid:13) ≤ σD k with σ = − γ γ for all t ≥ τ k . Define W t ; τ k ( s, a ) as W t +1; τ k ( s, a ) = (cid:40) W t ; τ k ( s, a ) , t / ∈ T A ;(1 − α t ) W t ; τ k ( s, a ) + α t w t ( s, a ) , t ∈ T A , where W τ k ; τ k ( s, a ) = 0 and define Y t ; τ k ( s, a ) as Y t +1; τ k ( s, a ) = (cid:40) Y t ; τ k ( s, a ) , t / ∈ T A ;(1 − α t ) Y t ; τ k ( s, a ) + α t γ (cid:48)(cid:48) D k , t ∈ T A , where Y τ k ; τ k ( s, a ) = D k and γ (cid:48)(cid:48) = γ (1 + σ ) . Then for any t ≥ τ k and state-action pair ( s, a ) , wehave − Y t ; τ k ( s, a ) + W t ; τ k ( s, a ) ≤ r t ( s, a ) ≤ Y t ; τ k ( s, a ) + W t ; τ k ( s, a ) . roof. We proceed the proof by induction. For the initial condition t = τ k , we have (cid:107) r t ( s, a ) (cid:107) ≤ D k ,and thus it holds that − D k ≤ r τ k ( s, a ) ≤ D k . We assume the sandwich bound holds for time t ≥ τ k .It remains to check whether this bound holds for t + 1 .If t / ∈ T A , then r t +1 ( s, a ) = r t ( s, a ) , W t +1; τ k ( s, a ) = W t ; τ k ( s, a ) , Y t +1; τ k ( s, a ) = Y t ; τ k ( s, a ) .Thus the sandwich bound still holds.If t ∈ T A , we have r t +1 ( s, a ) = (1 − α t ) r t ( s, a ) + α t ( T Q At ( s, a ) − Q ∗ ( s, a )) + α t w t ( s, a ) + α t γu BAt ( s (cid:48) , a ∗ ) ≤ (1 − α t )( Y t ; τ k ( s, a ) + W t ; τ k ( s, a )) + α t (cid:13)(cid:13) T Q At − Q ∗ (cid:13)(cid:13) + α t w t ( s, a ) + α t γ (cid:13)(cid:13) u BAt (cid:13)(cid:13) (i) ≤ (1 − α t )( Y t ; τ k ( s, a ) + W t ; τ k ( s, a )) + α t γ (cid:107) r t (cid:107) + α t w t ( s, a ) + α t γ (cid:13)(cid:13) u BAt (cid:13)(cid:13) (ii) ≤ (1 − α t ) Y t ; τ k ( s, a ) + α t γ (1 + σ ) D k + (1 − α t ) W t ; τ k ( s, a ) + α t w t ( s, a ) ≤ Y t +1; τ k ( s, a ) + W t +1; τ k ( s, a ) , where (i) follows from the contraction property of the Bellman operator, and (ii) follows from thecondition (cid:13)(cid:13) u BAt (cid:13)(cid:13) ≤ σD k .Similarly, we can bound the other direction as r t +1 ( s, a ) = (1 − α t ) r t ( s, a ) + α t ( T Q At ( s, a ) − Q ∗ ( s, a )) + α t w t ( s, a ) + α t γu BAt ( s (cid:48) , a ∗ ) ≥ (1 − α t )( − Y t ; τ k ( s, a ) + W t ; τ k ( s, a )) − α t (cid:13)(cid:13) T Q At − Q ∗ (cid:13)(cid:13) + α t w t ( s, a ) − α t γ (cid:13)(cid:13) u BAt (cid:13)(cid:13) ≥ (1 − α t )( Y t ; τ k ( s, a ) + W t ; τ k ( s, a )) − α t γ (cid:107) r t (cid:107) + α t w t ( s, a ) − α t γ (cid:13)(cid:13) u BAt (cid:13)(cid:13) ≥ − (1 − α t ) Y t ; τ k ( s, a ) − α t γ (1 + σ ) D k + (1 − α t ) W t ; τ k ( s, a ) + α t w t ( s, a ) ≥ − Y t +1; τ k ( s, a ) + W t +1; τ k ( s, a ) . B.2.4 Step 4: Bounding Y t ; τ k ( s, a ) and W t ; τ k ( s, a ) for epoch k + 1 Similarly to Steps 3 and 4 in Part I, we conditionally bound (cid:107) r t (cid:107) ≤ D k for t ∈ [ τ k , τ k +1 ) and k = 0 , , , . . . by the induction arguments followed by the union bound. We first bound Y t ; τ k ( s, a ) and W t ; τ k ( s, a ) in Lemma 13 and Lemma 14, respectively. Lemma 13.
Fix κ ∈ (ln 2 , and ∆ ∈ (0 , e κ − . Let { D k } be defined in Lemma 10. Considersynchronous double Q-learning using a polynomial learning rate α t = t ω with ω ∈ (0 , . Supposethat Y t ; τ k ( s, a ) ≤ D k for any t ≥ τ k . At block k , we assume that there are at least cτ ωk iterationsupdating Q A , i.e., | T A ( τ k , τ k +1 ) | ≥ cτ ωk . Then for any t ∈ [ τ k +1 , τ k +2 ) , we have Y t ; τ k ( s, a ) ≤ (cid:18) γ (cid:48)(cid:48) + 22 + ∆ β (cid:19) D k . Proof.
Since we have defined τ k = ˆ τ k in Lemma 10, we have τ k +1 = τ k + cκ τ ωk .Observe that Y τ k ; τ k ( s, a ) = D k = γ (cid:48)(cid:48) D k + (1 − γ (cid:48)(cid:48) ) D k := γ (cid:48)(cid:48) D k + ρ τ k . We can rewrite thedynamics of Y t ; τ k ( s, a ) as Y t +1; τ k ( s, a ) = (cid:40) Y t ; τ k ( s, a ) , t / ∈ T A (1 − α t ) Y t ; τ k ( s, a ) + α t γ (cid:48)(cid:48) D k = γ (cid:48)(cid:48) D k + (1 − α t ) ρ t , t ∈ T A ρ t +1 = (1 − α t ) ρ t for t ∈ T A . By the definition of ρ t , we obtain ρ t = ρ τ k (cid:89) i ∈ T A ( τ k ,t − (1 − α i ) = (1 − γ (cid:48)(cid:48) ) D k (cid:89) i ∈ T A ( τ k ,t − (1 − α i )= (1 − γ (cid:48)(cid:48) ) D k (cid:89) i ∈ T A ( τ k ,t − (cid:18) − i ω (cid:19) (i) ≤ (1 − γ (cid:48)(cid:48) ) D k (cid:89) i ∈ T A ( τ k ,τ k +1 − (cid:18) − i ω (cid:19) (14) (ii) ≤ (1 − γ (cid:48)(cid:48) ) D k τ k +1 − (cid:89) i = τ k +1 − cτ ωk (cid:18) − i ω (cid:19) (iii) ≤ (1 − γ (cid:48)(cid:48) ) D k exp (cid:18) − cτ ωk − τ k +1 − ω (cid:19) ≤ (1 − γ (cid:48)(cid:48) ) D k exp (cid:18) − cτ ωk − τ ωk +1 (cid:19) = (1 − γ (cid:48)(cid:48) ) D k exp (cid:18) − c (cid:18) τ k τ k +1 (cid:19) ω + 1 τ ωk +1 (cid:19) (iv) ≤ (1 − γ (cid:48)(cid:48) ) D k exp (cid:18) − c cκ + 1 τ ω (cid:19) , where (i) follows because α i < and t ≥ τ k +1 , (ii) follows because | T A ( τ k , τ k +1 − | ≥ cτ ωk where T A ( t , t ) and | T A ( t , t ) | are defined in Definition 1, (iii) follows from Lemma 9, and (iv)holds because τ + k ≥ τ and (cid:18) τ k τ k +1 (cid:19) ω ≥ τ k τ k +1 = τ k τ k + cκ τ ωk ≥
11 + cκ . Next we check the value of the power − c cκ + τ ω . Since κ ∈ (ln 2 , and ∆ ∈ (0 , e κ − , we have ln(2 + ∆) ∈ (0 , κ ) . Further, observing τ ω > κ − ln(2+∆) , we obtain ln(2 + ∆) + τ ω ∈ (0 , κ ) . Last,since c ≥ κ (cid:18) − ln(2+∆)+1 /τω κ − (cid:19) = κ (ln(2+∆)+1 /τ ω )2( κ − ln(2+∆) − /τ ω ) , we have − c cκ + τ ω ≤ − ln(2 + ∆) .Thus, we have ρ t ≤ − γ (cid:48)(cid:48) D k . Finally, we finish our proof by further observing that − γ (cid:48)(cid:48) = 2 β .It remains to bound | W t ; τ k ( s, a ) | ≤ (cid:16) − (cid:17) βD k for t ∈ [ τ k +1 , τ k +2 ) . Combining the boundsof Y t ; τ k and W t ; τ k yields ( γ (cid:48)(cid:48) + β ) D k = (1 − β ) D k = D k +1 . Since W t ; τ k is stochastic, we needto derive the probability for the bound to hold. To this end, we first rewrite the dynamics of W t ; τ k defined in Lemma 12 as W t ; τ k ( s, a ) = (cid:88) i ∈ T A ( τ k ,t − α i Π j ∈ T A ( i +1 ,t − (1 − α j ) w i ( s, a ) . Next, we introduce a new sequence { W lt ; τ k ( s, a ) } as W lt ; τ k ( s, a ) = (cid:88) i ∈ T A ( τ k ,τ k + l ) α i Π j ∈ T A ( i +1 ,t − (1 − α j ) w i ( s, a ) . Thus we have W t ; τ k ( s, a ) = W t − − τ k t ; τ k ( s, a ) . Then we have the following lemma. Lemma 14.
For any t ∈ [ τ k +1 , τ k +2 ] and ≤ l ≤ t − τ k − , { W lt ; τ k ( s, a ) } is a martingale sequenceand satisfies | W lt ; τ k ( s, a ) − W l − t ; τ k ( s, a ) | ≤ V max τ ωk . Proof.
Observe that W lt ; τ k ( s, a ) − W l − t ; τ k ( s, a ) = , τ k + l − / ∈ T A ; α τ k + l Π j ∈ T A ( τ k + l +1 ,t − (1 − α j ) w τ k + l ( s, a ) , τ k + l − ∈ T A . Since E [ w t |F t − ] = 0 , we have E (cid:2) W lt ; τ k ( s, a ) − W l − t ; τ k ( s, a ) |F τ k + l − (cid:3) = 0 . { W lt ; τ k ( s, a ) } is a martingale sequence. In addition, since l ≥ and α t ∈ (0 , , we have α τ k + l Π j ∈ T A ( τ k + l +1 ,t − (1 − α j ) ≤ α τ k + l ≤ α τ k = 1 τ ωk . Further, we obtain | w t ( s, a ) | = |T t Q At ( s, a ) − T Q At ( s, a ) | ≤ Q max − γ = V max . Thus | W lt ; τ k ( s, a ) − W l − t ; τ k ( s, a ) | ≤ α τ k + l | w τ k + l ( s, a ) | ≤ V max τ ωk . Next, we bound W t ; τ k ( s, a ) . Fix ˜ (cid:15) > . Then for any t ∈ [ τ k +1 , τ k +2 ) , we have P [ | W t ; τ k ( s, a ) | > ˜ (cid:15) | t ∈ [ τ k +1 , τ k +2 ) , E, F ] (i) ≤ − ˜ (cid:15) (cid:80) l : τ k + l − ∈ T A ( τ k ,t − (cid:0) W lt ; τ k ( s, a ) − W l − t ; τ k ( s, a ) (cid:1) + 2( W min( T A ( τ k ,t − t ; τ k ( s, a )) (ii) ≤ (cid:18) − ˆ (cid:15) τ ωk | T A ( τ k , t − | + 1) V (cid:19) (iii) ≤ (cid:18) − ˜ (cid:15) τ ωk t + 1 − τ k ) V (cid:19) ≤ (cid:18) − ˜ (cid:15) τ ωk τ k +2 − τ k ) V (cid:19) (iv) ≤ (cid:18) − κ ˜ (cid:15) τ ωk c ( c + κ ) V (cid:19) , where (i) follows from Lemma 7, (ii) follows from Lemma 14, (iii) follows because | T A ( t , t ) | ≤ t − t + 1 and (iv) holds because τ k +2 − τ k = 2 cκ τ ωk +1 + 2 cκ τ ωk = 2 cκ (cid:18) τ k + 2 cκ τ ωk (cid:19) ω + 2 cκ τ ωk ≤ c ( c + κ ) κ τ ωk . Proof of Proposition 2
Now we bound (cid:107) r t (cid:107) by combining the bounds of Y t ; τ k and W t ; τ k . Applying the union bound inLemma 8 yields P (cid:20) ∀ ( s, a ) , ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , | W t ; τ k ( s, a ) | ≤ ∆2 + ∆ βD k | E, F (cid:21) ≥ − m (cid:88) k =0 |S||A| ( τ k +2 − τ k +1 ) · P (cid:20) | W t ; τ k ( s, a ) | > ∆2 + ∆ βD k (cid:12)(cid:12)(cid:12) t ∈ [ τ k +1 , τ k +2 ) , E, F (cid:21) ≥ − m (cid:88) k =0 |S||A| cκ τ ωk +1 · − κ (cid:16) ∆2+∆ (cid:17) β D k τ ωk c ( c + κ ) V ≥ − m (cid:88) k =0 |S||A| cκ (cid:18) cκ (cid:19) τ ωk · − κ (cid:16) ∆2+∆ (cid:17) β D k τ ωk c ( c + κ ) V (i) ≥ − m (cid:88) k =0 |S||A| cκ (cid:18) cκ (cid:19) τ ωk · − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ωk c ( c + κ ) V (15) (ii) ≥ − cκ (cid:18) cκ (cid:19) m (cid:88) k =0 |S||A| · exp − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ωk c ( c + κ ) V ≥ − c ( m + 1) κ (cid:18) cκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ω c ( c + κ ) V , D k ≥ D m ≥ (cid:15) , and (ii) follows from Lemma 9 by substituting a = c ( c + κ ) V κ ( ∆2+∆ ) β (cid:15) , b = 1 and observing that τ ωk ≥ ˆ τ ω ≥ c ( c + κ ) V κ (cid:16) ∆2+∆ (cid:17) β (cid:15) ln c ( c + κ ) V κ (cid:16) ∆2+∆ (cid:17) β (cid:15) = 2 ab ln ab. Note that Y t ; τ k ( s, a ) is deterministic. We complete this proof by observing that P (cid:2) ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | E, F (cid:3) ≥ P (cid:20) ∀ ( s, a ) , ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , | W t ; τ k ( s, a ) | ≤ ∆2 + ∆ βD k | E, F (cid:21) . B.3 Part III: Bounding (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) We combine the results in the first two parts, and provide a high probability bound on (cid:107) r t (cid:107) withfurther probabilistic arguments, which exploit the high probability bounds on P ( E ) in Proposition 1and P ( F ) in the following lemma. Lemma 15.
Let the sequence τ k be the same as given in Lemma 10, i.e. τ k +1 = τ k + cκ τ ωk for k ≥ . Then we have P (cid:2) ∀ k ∈ [1 , m ] , I Ak ≥ cτ ωk (cid:3) ≥ − m exp (cid:18) − (1 − κ ) cτ ω κ (cid:19) . where I Ak denotes the number of iterations updating Q A at epoch k .Proof. The event updating Q A is a binomial random variable. To be specific, at iteration t we define J At = (cid:40) , updating Q A ;0 , updating Q B . Clearly, the events are independent across iterations. Therefore, for a given epoch [ τ k , τ k +1 ) , I Ak = (cid:80) τ k +1 − t = τ k J At is a binomial random variable satisfying the distribution Binomial ( τ k +1 − τ k , . .In the following, we use the tail bound of a binomial random variable. That is, if a random variable X ∼ Binomial ( n, p ) , by Hoeffding’s inequality we have P ( X ≤ x ) ≤ exp (cid:16) − np − x ) n (cid:17) for x < np , which implies P ( X ≤ κnp ) ≤ exp (cid:0) − np (1 − κ ) (cid:1) for any fixed κ ∈ (0 , .If k = 0 , I A ∼ Binomial ( τ , . . Thus the tail bound yields P (cid:104) I A ≤ κ · τ (cid:105) ≤ exp (cid:18) − (1 − κ ) τ (cid:19) . If k ≥ , since τ k +1 − τ k = cκ τ ωk , we have I Ak ∼ Binomial (cid:0) cκ τ ωk , . (cid:1) . Thus the tail bound of abinomial random variable gives P (cid:20) I Ak ≤ κ · cκ τ ωk (cid:21) ≤ exp (cid:18) − (1 − κ ) cτ ωk κ (cid:19) . Then by the union bound, we have P (cid:2) ∀ k ∈ [1 , m ] , I Ak ≥ cτ ωk (cid:3) = P (cid:20) ∀ k ∈ [1 , m ] , I Ak ≥ κ · cκ τ ωk (cid:21) ≥ − m (cid:88) k =1 exp (cid:18) − (1 − κ ) cτ ωk κ (cid:19) ≥ − m exp (cid:18) − (1 − κ ) cτ ω κ (cid:19) .
24e further give the following Lemma 16 and Lemma 17 before proving Theorem 1. Lemma 16characterizes the number of blocks to achieve (cid:15) -accuracy given D k defined in Lemma 10. Lemma 16.
Let D k +1 = (1 − β ) D k with β = − γ , D = γV max − γ . Then for m ≥ − γ ln γV max (cid:15) (1 − γ ) ,we have D m ≤ (cid:15) .Proof. By the definition of D k , we have D k = (1 − β ) k D . Then we obtain D k ≤ (cid:15) ⇐⇒ (1 − β ) k D ≤ (cid:15) ⇐⇒ − β ) k ≥ D (cid:15) ⇐⇒ k ≥ ln( D /(cid:15) )ln(1 / (1 − β )) . Further observe that ln − x ≤ x if x ∈ (0 , . Thus we have k ≥ β ln D (cid:15) = 41 − γ ln 2 γV max (cid:15) (1 − γ ) . From the above lemma, it suffices to find the starting time at epoch m ∗ = (cid:108) − γ ln γV max (cid:15) (1 − γ ) (cid:109) .The next lemma is useful to calculate the total iterations given the initial epoch length and number ofepochs. Lemma 17. (Even-Dar and Mansour, 2003, Lemma 32) Consider a sequence { x k } satisfying x k +1 = x k + cx ωk = x + k (cid:88) i =1 cx ωi . Then for any constant ω ∈ (0 , , we have x k = O (cid:16) ( x − ω + ck ) − ω (cid:17) = O (cid:16) x + ( ck ) − ω (cid:17) . Proof of Theorem 1
Now we are ready to prove Theorem 1 based on the results obtained so far.Let m ∗ = (cid:108) − γ ln γV max (cid:15) (1 − γ ) (cid:109) , then G m ∗ − ≥ σ(cid:15), D m ∗ − ≥ (cid:15) . Thus we obtain P ( (cid:13)(cid:13) Q Aτ m ∗ ( s, a ) − Q ∗ (cid:13)(cid:13) ≤ (cid:15) ) ≥ P (cid:2) ∀ k ∈ [0 , m ∗ − , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 (cid:3) = P (cid:2) ∀ k ∈ [0 , m ∗ − , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | E, F (cid:3) · P ( E ∩ F ) ≥ P (cid:2) ∀ k ∈ [0 , m ∗ − , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | E, F (cid:3) · ( P ( E ) + P ( F ) − (i) ≥ P (cid:2) ∀ k ∈ [0 , m ∗ − , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | E, F (cid:3) · (cid:0) P (cid:2) ∀ q ∈ [0 , m ∗ − , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ G q +1 (cid:3) + P ( F ) − (cid:1) (ii) ≥ − cm ∗ κ (cid:18) cκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ω c ( c + κ ) V · − cm ∗ κ (cid:18)
1+ 2 cκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω c ( c + κ ) V − m ∗ exp (cid:18) − (1 − κ ) c ˆ τ ω κ (cid:19) ≥ − cm ∗ κ (cid:18) cκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ω c ( c + κ ) V cm ∗ κ (cid:18)
1+ 2 cκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω c ( c + κ ) V − m ∗ exp (cid:18) − (1 − κ ) c ˆ τ ω κ (cid:19) (iii) ≥ − cm ∗ κ (cid:18) cκ (cid:19) |S||A| exp − κ (1 − κ ) (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω c ( c + κ ) V , where (i) follows from Lemma 10, (ii) follows from Proposition 1 and 2 and (iii) holds due to the factthat cm ∗ κ (cid:18) cκ (cid:19) |S||A| = max (cid:26) cm ∗ κ (cid:18) cκ (cid:19) |S||A| , m ∗ (cid:27) ,κ (1 − κ ) (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω c ( c + κ ) V ≤ min κ (cid:16) ∆2+∆ (cid:17) β (cid:15) ˆ τ ω c ( c + κ ) V , (1 − κ ) ˆ τ ω κ , κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω c ( c + κ ) V . By setting − cm ∗ κ (cid:18) cκ (cid:19) |S||A| exp − κ (1 − κ ) (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω c ( c + κ ) V ≥ − δ, we obtain ˆ τ ≥ c ( c + κ ) V κ (1 − κ ) (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ln 12 cm ∗ |S||A| (2 c + κ ) κ δ ω . Considering the conditions on ˆ τ in Proposition 1 and Proposition 2, we choose ˆ τ = Θ (cid:32)(cid:18) V (1 − γ ) (cid:15) ln m ∗ |S||A| V (1 − γ ) (cid:15) δ (cid:19) ω (cid:33) . Finally, applying the number of iterations m ∗ = (cid:108) − γ ln γV max (cid:15) (1 − γ ) (cid:109) and Lemma 17, we conclude thatit suffices to let T = Ω (cid:32)(cid:18) V (1 − γ ) (cid:15) ln m ∗ |S||A| V (1 − γ ) (cid:15) δ (cid:19) ω + (cid:18) cκ − γ ln γV max (1 − γ ) (cid:15) (cid:19) − ω (cid:33) = Ω (cid:32) V (1 − γ ) (cid:15) ln |S||A| V ln( V max (1 − γ ) (cid:15) )(1 − γ ) (cid:15) δ (cid:33) ω + (cid:18) − γ ln V max (1 − γ ) (cid:15) (cid:19) − ω = Ω (cid:32)(cid:18) V (1 − γ ) (cid:15) ln |S||A| V (1 − γ ) (cid:15) δ (cid:19) ω + (cid:18) − γ ln V max (1 − γ ) (cid:15) (cid:19) − ω (cid:33) , to attain an (cid:15) -accurate Q-estimator. C Proof of Theorem 2
The main idea of this proof is similar to that of Theorem 1 with further efforts to characterize theeffects of asynchronous sampling. The proof also consists of three parts: (a) Part I which analyzes thestochastic error propagation between the two Q-estimators (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ; (b) Part II which analyzesthe error dynamics between one Q-estimator and the optimum (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) conditioned on the errorevent in Part I; and (c) Part III which bounds the unconditional error (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) .To proceed the proof, we first introduce the following notion of valid iterations for any fixed state-action pair ( s, a ) . 26 efinition 2. We define T ( s, a ) as the collection of iterations if a state-action pair ( s, a ) is used toupdate the Q-function Q A or Q B , and T A ( s, a ) as the collection of iterations specifically updating Q A ( s, a ) . In addition, we denote T ( s, a, t , t ) and T A ( s, a, t , t ) as the set of iterations updating ( s, a ) and Q A ( s, a ) between time t and t , respectively. That is, T ( s, a, t , t ) = { t : t ∈ [ t , t ] and t ∈ T ( s, a ) } ,T A ( s, a, t , t ) = (cid:8) t : t ∈ [ t , t ] and t ∈ T A ( s, a ) (cid:9) . Correspondingly, the number of iterations updating ( s, a ) between time t and t equals the cardinal-ity of T ( s, a, t , t ) which is denoted as | T ( s, a, t , t ) | . Similarly, the number of iterations updating Q A ( s, a ) between time t and t is denoted as | T A ( s, a, t , t ) | . Given Assumption 1, we can obtain some properties of the quantities defined above.
Lemma 18.
It always holds that | T ( s, a, t , t ) | ≤ t − t + 1 and | T A ( s, a, t , t ) | ≤ t − t + 1 .In addition, suppose that Assumption 1 holds. Then we have T ( s, a, t, t + 2 kL − ≥ k for any t ≥ .Proof. Since in a consecutive L running iterations of Algorithm 1, either Q A or Q B is updated atleast L times. Then following from Assumption 1, ( s, a ) is visited at least once for each L runningiterations of Algorithm 1, which immediately implies this proposition.Now we proceed our proof by three parts. C.1 Part I: Bounding (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) We upper bound (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) block-wisely using a decreasing sequence { G q } q ≥ as defined in Propo-sition 3 below. Proposition 3.
Fix (cid:15) > , κ ∈ (ln 2 , and ∆ ∈ (0 , e κ − . Consider asynchronous doubleQ-learning using a polynomial learning rate α t = t ω with ω ∈ (0 , . Suppose that Assumption 1holds. Let G q = (1 − ξ ) q G with G = V max and ξ = − γ . Let ˆ τ q +1 = ˆ τ q + cLκ ˆ τ ωq for q ≥ with c ≥ Lκ (ln(2+∆)+1 /τ ω )2( κ − ln(2+∆) − /τ ω ) and ˆ τ as the finishing time of the first block satisfying ˆ τ ≥ max (cid:18) κ − ln(2 + ∆) (cid:19) ω , cL ( cL + κ ) V κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ln cL ( cL + κ ) V κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ω . Then for any n such that G n ≥ σ(cid:15) , we have P (cid:2) ∀ q ∈ [0 , n ] , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ G q +1 (cid:3) ≥ − cL ( n + 1) κ (cid:18) cLκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω cL ( cL + κ ) V . The proof of Proposition 3 consists of the following steps. Since the main idea of the proofs is similarto that of Proposition 1, we will focus on pointing out the difference. We continue to use the notation u BAt ( s, a ) := Q Bt ( s, a ) − Q At ( s, a ) . Step 1: Characterizing the dynamics of u BAt
First, we observe that when ( s, a ) is visited at time t , i.e., t ∈ T ( s, a ) , Lemmas 2 and 3 still apply.Otherwise, u BA is not updated. Thus, we have u BAt +1 ( s, a ) = (cid:40) u BAt ( s, a ) , t / ∈ T ( s, a );(1 − α t ) u BAt ( s, a ) + α t F t ( s, a ) , t ∈ T ( s, a ) , where F t satisfies (cid:107) E [ F t |F t ] (cid:107) ≤ γ (cid:13)(cid:13) u BAt (cid:13)(cid:13) . t ∈ T ( s, a ) , we rewrite the dynamics of u BAt ( s, a ) as u BAt +1 ( s, a ) = (1 − α t ) u BAt ( s, a ) + α t F t = (1 − α t ) u BAt ( s, a ) + α t ( h t ( s, a ) + z t ( s, a )) , where h t ( s, a ) = E [ F t ( s, a ) |F t ] and z t ( s, a ) = F t ( s, a ) − E [ F t ( s, a ) |F t ] .In the following steps, we use induction to proceed the proof of Proposition 3. Given G q definedin Proposition 3, since (cid:13)(cid:13) u BAt (cid:13)(cid:13) ≤ G holds for all t , and thus it holds for t ∈ [0 , ˆ τ ] . Now suppose ˆ τ q satisfies that (cid:13)(cid:13) u BAt (cid:13)(cid:13) ≤ G q for any t ≥ ˆ τ q . Then we will show there exists ˆ τ q +1 = ˆ τ q + cLκ ˆ τ ωq suchthat (cid:13)(cid:13) u BAt (cid:13)(cid:13) ≤ G q +1 for any t ≥ ˆ τ q +1 . Step 2: Constructing sandwich bounds
We first observe that the following sandwich bound still holds for all t ≥ ˆ τ q . − X t ;ˆ τ q ( s, a ) + Z t ;ˆ τ q ( s, a ) ≤ u BAt ( s, a ) ≤ X t ;ˆ τ q ( s, a ) + Z t ;ˆ τ q ( s, a ) , where Z t ;ˆ τ q ( s, a ) is defined as Z t +1;ˆ τ q ( s, a ) = (cid:26) Z t ;ˆ τ q ( s, a ) , t / ∈ T ( s, a )(1 − α t ) Z t ;ˆ τ q ( s, a ) + α t z t ( s, a ) , t ∈ T ( s, a ) , with the initial condition Z ˆ τ q ;ˆ τ q ( s, a ) = 0 , and X t ;ˆ τ q ( s, a ) is defined as X t +1;ˆ τ q ( s, a ) = (cid:26) X t ;ˆ τ q ( s, a ) , t / ∈ T ( s, a )(1 − α t ) X t ;ˆ τ q ( s, a ) + α t γ (cid:48) G q , t ∈ T ( s, a ) , with X ˆ τ q ;ˆ τ q ( s, a ) = G q , γ (cid:48) = γ .This claim can be shown by induction. This bound clearly holds for the initial case with t = ˆ τ q .Assume that it still holds for iteration t . If t ∈ T ( s, a ) , the proof is the same as that of Lemma 3. If t / ∈ T ( s, a ) , since all three sequences do not change from time t to time t + 1 , the sandwich boundstill holds. Thus we conclude this claim. Step 3: Bounding X t ;ˆ τ q ( s, a ) Next, we bound the deterministic sequence X t ;ˆ τ q ( s, a ) . Observe that X t ;ˆ τ q ( s, a ) ≤ G q for any t ≥ ˆ τ q . We will next show that X t ;ˆ τ q ( s, a ) ≤ (cid:16) γ (cid:48) + ξ (cid:17) G q for any t ∈ [ˆ τ q +1 , ˆ τ q +2 ) where ˆ τ q +1 = ˆ τ q + cLκ ˆ τ ωq .Similarly to the proof of Lemma 5, we still rewrite X ˆ τ q ;ˆ τ q ( s, a ) as X ˆ τ q ;ˆ τ q ( s, a ) = G q = γ (cid:48) G q +(1 − γ (cid:48) ) G q := γ (cid:48) G q + ρ ˆ τ q . However, in this case the dynamics of X t ;ˆ τ q ( s, a ) is different, which isrepresented as X t +1;ˆ τ q ( s, a ) = (cid:26) X t ;ˆ τ q ( s, a ) , t / ∈ T ( s, a )(1 − α t ) X t ;ˆ τ q ( s, a ) + α t γ (cid:48) G q = γ (cid:48) G q + (1 − α t ) ρ t , t ∈ T ( s, a ) . where ρ t +1 = (1 − α t ) ρ t when t ∈ T ( s, a ) . By the definition of ρ t , we obtain ρ t = ρ ˆ τ q Π i ∈ T ( s,a, ˆ τ q ,t − (1 − α i ) = (1 − γ (cid:48) ) G q Π i ∈ T ( s,a, ˆ τ q ,t − (1 − α i ) ≤ (1 − γ (cid:48) ) G q Π i ∈ T ( s,a, ˆ τ q , ˆ τ q +1 − (cid:18) − i ω (cid:19) ≤ (1 − γ (cid:48) ) G q ˆ τ q +1 − (cid:89) i =ˆ τ q +1 −| T ( s,a, ˆ τ q , ˆ τ q +1 − | (cid:18) − i ω (cid:19) (i) ≤ (1 − γ (cid:48) ) G q ˆ τ q +1 − (cid:89) i =ˆ τ q +1 − cκ ˆ τ ωq (cid:18) − i ω (cid:19) (ii) ≤ (1 − γ (cid:48) ) G q exp (cid:18) − cκ ˆ τ ωq − τ q +1 − ω (cid:19) ≤ (1 − γ (cid:48) ) G q exp (cid:32) − cκ ˆ τ ωq − τ ωq +1 (cid:33) = (1 − γ (cid:48) ) G q exp (cid:32) − cκ (cid:18) ˆ τ q ˆ τ q +1 (cid:19) ω + 1ˆ τ ωq +1 (cid:33) (iii) ≤ (1 − γ (cid:48) ) G q exp (cid:32) − cκ
11 + cLκ + 1ˆ τ ω (cid:33) , ˆ τ q ≥ ˆ τ and (cid:18) ˆ τ q ˆ τ q +1 (cid:19) ω ≥ ˆ τ q ˆ τ q +1 = ˆ τ q ˆ τ q + cLκ ˆ τ ωq ≥
11 + cLκ . Since κ ∈ (ln 2 , and ∆ ∈ (0 , e κ − , we have ln(2 + ∆) ∈ (0 , κ ) . Further, observing ˆ τ ω > κ − ln(2+∆) , we obtain ln(2 + ∆) + τ ω ∈ (0 , κ ) . Last, since c ≥ Lκ (ln(2+∆)+1 / ˆ τ ω )2( κ − ln(2+∆) − / ˆ τ ω ) , we have − c cκ + τ ω ≤ − ln(2 + ∆) .Finally, combining the above observations with the fact − γ (cid:48) = 2 ξ , we conclude that for any t ≥ ˆ τ q +1 = ˆ τ q + cLκ ˆ τ ωq , X t ;ˆ τ q ( s, a ) ≤ (cid:18) γ (cid:48) + 22 + ∆ ξ (cid:19) G q . Step 4: Bounding Z t ;ˆ τ q ( s, a ) It remains to bound the stochastic sequence Z t ;ˆ τ q ( s, a ) by ∆2+∆ ξG q at epoch q + 1 . We define anauxiliary sequence { Z lt ;ˆ τ q ( s, a ) } (which is different from that in (9)) as: Z lt ;ˆ τ q ( s, a ) = (cid:88) i ∈ T ( s,a, ˆ τ q ,t − α i Π j ∈ T ( s,a,i +1 ,t − (1 − α j ) z i ( s, a ) . Following the same arguments as the proof of Lemma 6, we conclude that { Z lt ;ˆ τ q ( s, a ) } is a martingalesequence and satisfies | Z lt ;ˆ τ q ( s, a ) − Z l − t ;ˆ τ q ( s, a ) | = α ˆ τ q + l | z ˆ τ q + l ( s, a ) | ≤ V max ˆ τ ωq . In addition, note that Z t ;ˆ τ q ( s, a ) = Z t ;ˆ τ q ( s, a ) − Z ˆ τ q ;ˆ τ q ( s, a )= (cid:88) l :ˆ τ q + l − ∈ T ( s,a, ˆ τ q ,t − ( Z lt ;ˆ τ q ( s, a ) − Z l − t ;ˆ τ q ( s, a )) + Z min( T ( s,a, ˆ τ q ,t − t ;ˆ τ q ( s, a ) . Then we apply Azuma’ inequality in Lemma 7 and obtain P (cid:2) | Z t ;ˆ τ q ( s, a ) | > ˆ (cid:15) | t ∈ [ˆ τ q +1 , ˆ τ q +2 ) (cid:3) ≤ − ˆ (cid:15) (cid:80) l :ˆ τ q + l − ∈ T ( s,a, ˆ τ q ,t − ( Z lt ;ˆ τ q ( s, a ) − Z l − t ;ˆ τ q ( s, a )) +2 (cid:16) Z min( T ( s,a, ˆ τ q ,t − t ;ˆ τ q ( s, a ) (cid:17) ≤ (cid:32) − ˆ (cid:15) ˆ τ ωq | T ( s, a, ˆ τ q , t − | + 1) V (cid:33) (i) ≤ (cid:32) − ˆ (cid:15) ˆ τ ωq t − ˆ τ q ) V (cid:33) ≤ (cid:32) − ˆ (cid:15) ˆ τ ωq τ q +2 − ˆ τ q ) V (cid:33) = 2 exp (cid:32) − ˆ (cid:15) ˆ τ ωq (cid:0) cLκ ˆ τ ωq +1 + cLκ ˆ τ ωq (cid:1) V (cid:33) = 2 exp (cid:32) − ˆ (cid:15) ˆ τ ωq (cid:0) cLκ (ˆ τ q + cLκ ˆ τ ωq ) ω + cLκ ˆ τ ωq (cid:1) V (cid:33) ≤ (cid:32) − κ ˆ (cid:15) ˆ τ ωq cL ( cL + κ ) V (cid:33) where (i) follows from Lemma 18. Step 5: Taking union over all blocks P (cid:2) ∀ q ∈ [0 , n ] , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ G q +1 (cid:3) ≥ P (cid:20) ∀ ( s, a ) , ∀ q ∈ [0 , n ] , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , | Z t ;ˆ τ q ( s, a ) | ≤ ∆2 + ∆ ξG q (cid:21) ≥ − n (cid:88) q =0 |S||A| (ˆ τ q +2 − ˆ τ q +1 ) · P (cid:20) | Z t ;ˆ τ q ( s, a ) | > ∆2 + ∆ ξG q (cid:12)(cid:12)(cid:12) t ∈ [ˆ τ q +1 , ˆ τ q +2 ) (cid:21) ≥ − n (cid:88) q =0 |S||A| cLκ ˆ τ ωq +1 · − κ (cid:16) ∆2+∆ (cid:17) ξ G q ˆ τ ωq cL ( cL + κ ) V ≥ − n (cid:88) q =0 |S||A| cLκ (cid:18) cLκ (cid:19) ˆ τ ωq · − κ (cid:16) ∆2+∆ (cid:17) ξ G q ˆ τ ωq cL ( cL + κ ) V (i) ≥ − n (cid:88) q =0 |S||A| cLκ (cid:18) cLκ (cid:19) ˆ τ ωq · − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ωq cL ( cL + κ ) V (ii) ≥ − cLκ (cid:18) cLκ (cid:19) n (cid:88) q =0 |S||A| · exp − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ωq cL ( cL + κ ) V (iii) ≥ − cL ( n + 1) κ (cid:18) cLκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω cL ( cL + κ ) V , where (i) follows from G q ≥ G n ≥ σ(cid:15) , (ii) follows from Lemma 9 by substituting a = cL ( cL + κ ) V κ ( ∆2+∆ ) ξ σ (cid:15) , b = 1 and observing that ˆ τ ωq ≥ ˆ τ ω ≥ cL ( cL + κ ) V κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ln cL ( cL + κ ) V κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) = 2 ab ln ab, and (iii) follows from ˆ τ q ≥ ˆ τ . C.2 Part II: Conditionally bounding (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) We upper bound (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) block-wisely by a decreasing sequence { D k } k ≥ conditioned on thefollowing two events: fix a positive integer m , G = (cid:8) ∀ ( s, a ) , ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ σD k +1 (cid:9) , (16) H = {∀ k ∈ [1 , m + 1] , I Ak ≥ cLτ ωk } , (17)where I Ak denotes the number of iterations updating Q A at epoch k , τ k is the starting iteration index ofthe k + 1 th block, and ω is the parameter of the polynomial learning rate. Roughly, Event G requiresthat the difference between the two Q-function estimators are bounded appropriately, and Event H requires that Q A is sufficiently updated in each epoch. Again, we will design { D k } k ≥ in a way suchthat the occurrence of Event G can be implied from the event that (cid:13)(cid:13) u BAt (cid:13)(cid:13) is bounded by { G q } q ≥ (see Lemma 19 below). A lower bound of the probability for Event H to hold is characterizedin Lemma 15 in Part III. Proposition 4.
Fix (cid:15) > , κ ∈ (ln 2 , and ∆ ∈ (0 , e κ − . Consider asynchronous doubleQ-learning using a polynomial learning rate α t = t ω with ω ∈ (0 , . Let { G q } , { ˆ τ q } be as definedin Proposition 3. Define D k = (1 − β ) k V max σ with β = − γ (1+ σ )2 and σ = − γ γ . Let τ k = ˆ τ k for ≥ . Suppose that c ≥ L (ln(2+∆)+1 /τ ω )2( κ − ln(2+∆) − /τ ω ) and τ = ˆ τ as the finishing time of the first epochsatisfies τ ≥ max (cid:18) κ − ln(2 + ∆) (cid:19) ω , cL ( cL + κ ) V κ (cid:16) ∆2+∆ (cid:17) β (cid:15) ln cL ( cL + κ ) V κ (cid:16) ∆2+∆ (cid:17) β (cid:15) ω . Then for any m such that D m ≥ (cid:15) , we have P (cid:2) ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | G, H (cid:3) ≥ − cL ( m + 1) κ (cid:18) cLκ (cid:19) |S||A| exp (cid:32) − κ (cid:0) − e (cid:1) β (cid:15) τ ω cL ( cL + κ ) V (cid:33) . Recall that in the proof of Proposition 2, Q A is not updated at each iteration and thus we introducednotations T A and T A ( t , t ) in Definition 1 to capture the convergence of the error (cid:13)(cid:13) Q A − Q ∗ (cid:13)(cid:13) .In this proof, the only difference is that when choosing to update Q A , only one ( s, a ) -pair isvisited. Therefore, the proof of Proposition 4 is similar to that of Proposition 2, where most of thearguments simply substitute T A , T A ( t , t ) in the proof of Proposition 2 by T A ( s, a ) , T A ( s, a, t , t ) in Definition 2, respectively. Certain bounds are affected by such substitutions. In the following, weproceed the proof of Proposition 4 in five steps, and focus on pointing out the difference from theproof of Proposition 2. More details can be referred to Appendix B.2. Step 1: Coupling { D k } k ≥ and { G q } q ≥ We establish the relationship between { D k } k ≥ and { G q } q ≥ in the same way as Lemma 10. Forthe convenience of reference, we restate Lemma 10 in the following. Lemma 19.
Let { G q } be defined in Proposition 3, and let D k = (1 − β ) k V max σ with β = − γ (1+ σ )2 and σ = − γ γ . Then we have P (cid:2) ∀ ( s, a ) , ∀ q ∈ [0 , m ] , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ G q +1 (cid:3) ≤ P (cid:2) ∀ ( s, a ) , ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ σD k +1 (cid:3) , given that τ k = ˆ τ k . Step 2: Constructing sandwich bounds
Let r t ( s, a ) = Q A ( s, a ) − Q ∗ ( s, a ) and τ k be such that (cid:107) r t (cid:107) ≤ D k for all t ≥ τ k . The requirementof Event G yields − Y t ; τ k ( s, a ) + W t ; τ k ( s, a ) ≤ r t ( s, a ) ≤ Y t ; τ k ( s, a ) + W t ; τ k ( s, a ) , where W t ; τ k ( s, a ) is defined as W t +1; τ k ( s, a ) = (cid:40) W t ; τ k ( s, a ) , t / ∈ T A ( s, a );(1 − α t ) W t ; τ k ( s, a ) + α t w t ( s, a ) , t ∈ T A ( s, a ) , with w t ( s, a ) = T t Q At ( s, a ) − T Q At ( s, a ) and W τ k ; τ k ( s, a ) = 0 , and Y t ; τ k ( s, a ) is given by Y t +1; τ k ( s, a ) = (cid:40) Y t ; τ k ( s, a ) , t / ∈ T A ( s, a );(1 − α t ) Y t ; τ k ( s, a ) + α t γ (cid:48)(cid:48) D k , t ∈ T A ( s, a ) , with Y τ k ; τ k ( s, a ) = D k and γ (cid:48)(cid:48) = γ (1 + σ ) . Step 3: Bounding Y t ; τ k ( s, a ) Next, we first bound Y t ; τ k ( s, a ) . Observe that Y t ; τ k ( s, a ) ≤ D k for any t ≥ τ k . We will bound Y t ; τ k ( s, a ) by (cid:16) γ (cid:48)(cid:48) + β (cid:17) D k for block k + 1 .We use a similar representation of Y t ; τ k ( s, a ) as in the proof of Lemma 13, which is given by Y t +1; τ k ( s, a ) = (cid:40) Y t ; τ k ( s, a ) , t / ∈ T A ( s, a )(1 − α t ) Y t ; τ k ( s, a ) + α t γ (cid:48)(cid:48) G q = γ (cid:48)(cid:48) G q + (1 − α t ) ρ t , t ∈ T A ( s, a ) ρ t +1 = (1 − α t ) ρ t for t ∈ T A ( s, a ) . By the definition of ρ t , we obtain ρ t = ρ τ k (cid:89) i ∈ T A ( s,a,τ k ,t − (1 − α i ) = (1 − γ (cid:48)(cid:48) ) D k (cid:89) i ∈ T A ( s,a,τ k ,t − (1 − α i )= (1 − γ (cid:48)(cid:48) ) D k (cid:89) i ∈ T A ( s,a,τ k ,t − (cid:18) − i ω (cid:19) (i) ≤ (1 − γ (cid:48)(cid:48) ) D k (cid:89) i ∈ T A ( s,a,τ k ,τ k +1 − (cid:18) − i ω (cid:19) (ii) ≤ (1 − γ (cid:48)(cid:48) ) D k τ k +1 − (cid:89) i = τ k +1 − cτ ωk (cid:18) − i ω (cid:19) (iii) ≤ (1 − γ (cid:48)(cid:48) ) D k exp (cid:18) − cτ ωk − τ k +1 − ω (cid:19) ≤ (1 − γ (cid:48)(cid:48) ) D k exp (cid:18) − cτ ωk − τ ωk +1 (cid:19) = (1 − γ (cid:48)(cid:48) ) D k exp (cid:18) − c (cid:18) τ k τ k +1 (cid:19) ω + 1 τ ωk +1 (cid:19) (iv) ≤ (1 − γ (cid:48)(cid:48) ) D k exp (cid:32) − c Lcκ + 1 τ ω (cid:33) , where (i) follows because α i < and t ≥ τ k +1 , (ii) follows from Proposition 18 and the requirementof event H , (iii) follows from Lemma 9, and (iv) holds because τ + k ≥ τ and (cid:18) τ k τ k +1 (cid:19) ω ≥ τ k τ k +1 = τ k τ k + cLκ τ ωk ≥
11 + cLκ . Since κ ∈ (ln 2 , and ∆ ∈ (0 , e κ − , we have ln(2 + ∆) ∈ (0 , κ ) . Further, observing ˆ τ ω > κ − ln(2+∆) , we obtain ln(2 + ∆) + τ ω ∈ (0 , κ ) . Last, since c ≥ L (ln(2+∆)+1 / ˆ τ ω )2( κ − ln(2+∆) − / ˆ τ ω ) , we have − c cκ + τ ω ≤ − ln(2 + ∆) .Then, we have ρ t ≤ − γ (cid:48)(cid:48) D k . Thus we conclude that for any t ∈ [ τ k +1 , τ k +2 ] , Y t ; τ k ( s, a ) ≤ (cid:18) γ (cid:48)(cid:48) + 22 + ∆ β (cid:19) D k . Step 4: Bounding W t ; τ k ( s, a ) It remains to bound | W t ; τ k ( s, a ) | ≤ (cid:16) − (cid:17) βD k for t ∈ [ τ k +1 , τ k +2 ) .Similarly to Appendix B.2.4, we define a new sequence { W lt ; τ k ( s, a ) } as W lt ; τ k ( s, a ) = (cid:88) i ∈ T A ( s,a,τ k ,τ k + l ) α i Π j ∈ T A ( s,a,i +1 ,t − (1 − α j ) w i ( s, a ) . The same arguments as the proof of Lemma 14 yields | W lt ; τ k ( s, a ) − W l − t ; τ k ( s, a ) | ≤ V max τ ωk . If we fix ˜ (cid:15) > , then for any t ∈ [ τ k +1 , τ k +2 ) we have P [ | W t ; τ k ( s, a ) | > ˜ (cid:15) | t ∈ [ τ k +1 , τ k +2 ) , G, H ] ≤ − ˜ (cid:15) (cid:80) l : τ k + l − ∈ T A ( s,a,τ k ,t − (cid:0) W lt ; τ k ( s, a ) − W l − t ; τ k ( s, a ) (cid:1) +2( W min( T A ( s,a,τ k ,t − t ; τ k ( s, a )) ≤ (cid:18) − ˆ (cid:15) τ ωk | T A ( s, a, τ k , t − | + 1) V (cid:19) (i) ≤ (cid:18) − ˜ (cid:15) τ ωk t − τ k ) V (cid:19) ≤ (cid:18) − ˜ (cid:15) τ ωk τ k +2 − τ k ) V (cid:19) (ii) ≤ (cid:18) − κ ˜ (cid:15) τ ωk cL ( cL + κ ) V (cid:19) = 2 exp (cid:18) − κ ˜ (cid:15) τ ωk cL ( cL + κ ) V (cid:19) , τ k +2 − τ k = 2 cLκ τ ωk +1 + 2 cLκ τ ωk = 2 cLκ (cid:18) τ k + 2 cLκ τ ωk (cid:19) ω + 2 cLκ τ ωk ≤ cL ( cL + κ ) κ τ ωk . Step 5: Taking union over all blocks
Applying the union bound in Lemma 8, we obtain P (cid:2) ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | G, H (cid:3) ≥ P (cid:20) ∀ ( s, a ) , ∀ k ∈ [0 , m ] , ∀ t ∈ [ τ k +1 , τ k +2 ) , | W t ; τ k ( s, a ) | ≤ ∆2 + ∆ βD k | G, H (cid:21) ≥ − m (cid:88) k =0 |S||A| ( τ k +2 − τ k +1 ) · P (cid:20) | W t ; τ k ( s, a ) | > ∆2 + ∆ βD k (cid:12)(cid:12)(cid:12) t ∈ [ τ k +1 , τ k +2 ) , G, H (cid:21) ≥ − m (cid:88) k =0 |S||A| cLκ τ ωk +1 · − κ (cid:16) ∆2+∆ (cid:17) β D k τ ωk cL ( cL + κ ) V ≥ − m (cid:88) k =0 |S||A| cLκ (cid:18) cLκ (cid:19) τ ωk · − κ (cid:16) ∆2+∆ (cid:17) β D k τ ωk cL ( cL + κ ) V (i) ≥ − m (cid:88) k =0 |S||A| cLκ (cid:18) cLκ (cid:19) τ ωk · − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ωk cL ( cL + κ ) V (ii) ≥ − cLκ (cid:18) cLκ (cid:19) m (cid:88) k =0 |S||A| · exp − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ωk cL ( cL + κ ) V ≥ − cL ( m + 1) κ (cid:18) cLκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ω cL ( cL + κ ) V , where (i) follows because D k ≥ D m ≥ (cid:15) , and (ii) follows from Lemma 9 by substituting a = cL ( cL + κ ) V κ ( ∆2+∆ ) β (cid:15) , b = 1 and observing that τ ωk ≥ ˆ τ ω ≥ cL ( cL + κ ) V κ (cid:16) ∆2+∆ (cid:17) β (cid:15) ln cL ( cL + κ ) V κ (cid:16) ∆2+∆ (cid:17) β (cid:15) = 2 ab ln ab. C.3 Part III: Bound (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) In order to obtain the unconditional high-probability bound on (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) , we first characterize alower bound on the probability of Event H . Note that the probability of Event G is lower boundedin Proposition 3. Lemma 20.
Let the sequence τ k be the same as given in Lemma 19, i.e. τ k +1 = τ k + cLκ τ ωk for k ≥ . Define I Ak as the number of iterations updating Q A at epoch k . Then we have P (cid:2) ∀ k ∈ [1 , m ] , I Ak ≥ cLτ ωk (cid:3) ≥ − m exp (cid:18) − (1 − κ ) cLτ ω κ (cid:19) . Proof.
We use the same idea as the proof of Lemma 15. Since we only focus on the blocks with k ≥ , I Ak ∼ Binomial (cid:0) cLκ τ ωk , . (cid:1) in such a case. Thus the tail bound of a binomial random33ariable gives P (cid:20) I Ak ≤ κ · cLκ τ ωk (cid:21) ≤ exp (cid:18) − (1 − κ ) cLτ ωk κ (cid:19) . Then by the union bound, we have P (cid:2) ∀ k ∈ [1 , m ] , I Ak ≥ cLτ ωk (cid:3) = P (cid:20) ∀ k ∈ [1 , m ] , I Ak ≥ κ · cLκ τ ωk (cid:21) ≥ − m (cid:88) k =1 exp (cid:18) − (1 − κ ) cLτ ωk κ (cid:19) ≥ − m exp (cid:18) − (1 − κ ) cLτ ω κ (cid:19) . Following from Lemma 16, it suffices to determine the starting time at epoch m ∗ = (cid:108) − γ ln γV max (cid:15) (1 − γ ) (cid:109) .This can be done by using Lemma 17 if we have ˆ τ .Now we are ready to prove the main result of Theorem 2. By the definition of m ∗ , we know D m ∗ − ≥ (cid:15), G m ∗ − ≥ σ(cid:15) . Then we obtain P ( (cid:13)(cid:13) Q Aτ m ∗ − Q ∗ (cid:13)(cid:13) ≤ (cid:15) ) ≥ P (cid:2) ∀ k ∈ [0 , m ∗ − , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 (cid:3) = P (cid:2) ∀ k ∈ [0 , m ∗ − , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | G, H (cid:3) · P ( G ∩ H ) ≥ P (cid:2) ∀ k ∈ [0 , m ∗ − , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | G, H (cid:3) · ( P ( G ) + P ( H ) − (i) ≥ P (cid:2) ∀ k ∈ [0 , m ∗ − , ∀ t ∈ [ τ k +1 , τ k +2 ) , (cid:13)(cid:13) Q At − Q ∗ (cid:13)(cid:13) ≤ D k +1 | G, H (cid:3) · (cid:0) P (cid:2) ∀ q ∈ [0 , m ∗ − , ∀ t ∈ [ˆ τ q +1 , ˆ τ q +2 ) , (cid:13)(cid:13) Q Bt − Q At (cid:13)(cid:13) ≤ G q +1 (cid:3) + P ( H ) − (cid:1) (ii) ≥ − cLm ∗ κ (cid:18) cLκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ω cL ( cL + κ ) V · − cLm ∗ κ (cid:18) cLκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω cL ( cL + κ ) V − m ∗ exp (cid:18) − (1 − κ ) cL ˆ τ ω κ (cid:19)(cid:21) ≥ − cLm ∗ κ (cid:18) cLκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) β (cid:15) τ ω cL ( cL + κ ) V − cLm ∗ κ (cid:18) cLκ (cid:19) |S||A| exp − κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω cL ( cL + κ ) V − m ∗ exp (cid:18) − (1 − κ ) cL ˆ τ ω κ (cid:19) (iii) ≥ − cLm ∗ κ (cid:18) cLκ (cid:19) |S||A| exp − κ (1 − κ ) (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω cL ( cL + κ ) V , cLm ∗ κ (cid:18) cLκ (cid:19) |S||A| = max (cid:26) cLm ∗ κ (cid:18) cLκ (cid:19) |S||A| , m ∗ (cid:27) ,κ (1 − κ ) (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω cL ( cL + κ ) V ≤ min κ (cid:16) ∆2+∆ (cid:17) β (cid:15) ˆ τ ω cL ( cL + κ ) V , (1 − κ ) ˆ τ ω κ , κ (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω cL ( cL + κ ) V . By setting − cLm ∗ κ (cid:18) cLκ (cid:19) |S||A| exp − κ (1 − κ ) (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ˆ τ ω cL ( cL + κ ) V ≥ − δ, we obtain ˆ τ ≥ cL ( cL + κ ) V κ (1 − κ ) (cid:16) ∆2+∆ (cid:17) ξ σ (cid:15) ln 12 m ∗ |S||A| cL (2 cL + κ ) κ δ ω . Combining with the requirement of ˆ τ in Propositions 3 and 4, we can choose ˆ τ = Θ (cid:32)(cid:18) L V (1 − γ ) (cid:15) ln m ∗ |S||A| L V (1 − γ ) (cid:15) δ (cid:19) ω (cid:33) . Finally, applying m ∗ = (cid:108) − γ ln γV max (cid:15) (1 − γ ) (cid:109) and Lemma 17, we conclude that it suffices to let T = Ω (cid:32)(cid:18) L V (1 − γ ) (cid:15) ln m ∗ |S||A| L V (1 − γ ) (cid:15) δ (cid:19) ω + (cid:18) cLκ − γ ln γV max (1 − γ ) (cid:15) (cid:19) − ω (cid:33) = Ω (cid:32) L V (1 − γ ) (cid:15) ln |S||A| L V ln γV max (cid:15) (1 − γ ) (1 − γ ) (cid:15) δ (cid:33) ω + (cid:18) cLκ − γ ln γV max (1 − γ ) (cid:15) (cid:19) − ω = Ω (cid:32)(cid:18) L V (1 − γ ) (cid:15) ln |S||A| L V (1 − γ ) (cid:15) δ (cid:19) ω + (cid:18) L − γ ln γV max (1 − γ ) (cid:15) (cid:19) − ω (cid:33) . to attain an (cid:15)(cid:15)