[PDF] A Discrete-Time Switching System Analysis of Q-learning

Abstract

This paper develops a novel control-theoretic framework to analyze the non-asymptotic convergence of Q-learning. We show that the dynamics of asynchronous Q-learning with a constant step-size can be naturally formulated as a discrete-time stochastic affine switching system. Moreover, the evolution of the Q-learning estimation error is over- and underestimated by trajectories of two simpler dynamical systems. Based on these two systems, we derive a new finite-time error bound of asynchronous Q-learning when a constant stepsize is used. Our analysis also sheds light on the overestimation phenomenon of Q-learning. We further illustrate and validate the analysis through numerical simulations.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b Finite-Time Error Analysis of Asynchronous Q-Learning withDiscrete-Time Switching System Models

Donghwan Lee

Abstract —This paper develops a novel framework to analyzethe convergence of Q-learning algorithm by using its connectionsto dynamical systems. We prove that asynchronous Q-learningwith a constant step-size can be naturally formulated as discrete-time stochastic switched linear systems. Moreover, the evolutionof the Q-learning estimation error is over- and underestimatedby trajectories of two dynamical systems. Based on the schemes,a new ﬁnite-time analysis of the Q-learning is given with a ﬁnite-time error bound. It offers novel intuitive insights on analysisof Q-learning mainly based on control theoretic frameworks. Byﬁlling the gap between both domains in a synergistic way, thisapproach can potentially facilitate further progress in each ﬁeld.

I. I

NTRODUCTION

Q-learning [1] is known to be one of the most popular andsuccessful reinforcement learning algorithm. Convergence ofQ-learning has been substantially studied until now, includingthe asymptotic convergence [2]–[5] and the ﬁnite-time conver-gence analysis [6]–[12]. Especially, signiﬁcant advances havebeen made recently in the ﬁnite-time analysis. The main goalof this paper is to present a novel framework for the ﬁnite-timeanalysis of asynchronous Q-learning with constant step-sizesusing connections to discrete-time switching systems [13].Based on the control system theoretic argument, we provide anew ﬁnite-time analysis of the Q-learning.In particular, we consider a discounted Markov decisionprocess with ﬁnite state and action spaces and prove thatasynchronous Q-learning with a constant step-size can beformulated as a discrete-time switched afﬁne system. Thisallows us to transform the convergence analysis into a stabilityanalysis of the switched system. However, its stability analysisis nontrivial due to the afﬁne term. The main breakthrough inour analysis lies in developing upper and lower comparisonsystems whose trajectories over- and underestimate the orig-inal system’s trajectory. The lower comparison system is astochastic linear system, while the upper comparison systemis a stochastic switched linear system [13], both of whichare simpler than the original system. The convergence of Q-learning is established by proving the convergence of the twobounding systems. The analysis provides a new error boundon iterates of asynchronous Q-learning with a constant step-size. Not only the ﬁnite-time analysis, the present work offersnovel frameworks on asynchronous Q-learning based on adiscrete-time dynamical system perspective. The concept maybe relevantly more intuitive than the previous approaches. Forexample, the proposed analysis can reﬂect the overestimationphenomenon in Q-learning due to the maximization bias [14].

D. Lee is with the Department of Electrical and Engineering, KoreaAdvanced Institute of Science and Technology (KAIST), Daejeon, 34141,South Korea [email protected] . We expect that, by ﬁlling the gap between the two domainsin a natural and synergistic way, this approach can potentiallyenrich both ﬁelds and open up new opportunities to the designof new reinforcement learning algorithms and developmentof various analysis for Q-learning algorithms, such as thedouble Q-learning [14], averaging Q-learning [5], speedy Q-learning [10], and multi-agent Q-learning [15]. In this respect,we view our proposed analysis as a complement rather thana replacement for existing analysis techniques. Moreover,we note that our main purpose is to provide new insightsand frameworks to deepen our understanding on Q-learningvia connections to discrete-time dynamical systems, ratherthan improving existing convergence rates. The analysis canbe applied to both the i.i.d. sampling case and Markoviansampling case without modiﬁcations.

Related work:

Q-learning was ﬁrst introduced in [1].Classical studies on convergence of Q-learning mostly focusedon asymptotic analysis of asynchronous Q-learning [2], [3]and synchronous Q-learning [4] with connection to inﬁnite-norm contractive operators. The ﬁrst ﬁnite-time analysis of Q-learning was developed in [9] with an i.i.d. sampling setting.Subsequently, [12] analysed a batch version of synchronousQ-learning, called phased Q-learning, with ﬁnite-time bounds.Another ﬁnite-time analysis with a single trajectory Markoviansampling setting was conducted in [11] for both synchronousand asynchronous Q-learning with polynomial and linear step-sizes. [10] proposed a variant of synchronous Q-learningcalled speedy Q-learning by adding a momentum term, andobtained an accelerated learning rate. Many advances havebeen made recently in ﬁnite-time analysis. [16] provided ﬁnite-time bounds for general synchronous stochastic approxima-tion, and applied it to synchronous Q-learning. In [17], aﬁnite-time convergence rate of general asynchronous stochas-tic approximation scheme was derived under the Markoviansetting, and it was applied to asynchronous Q-learning. Sub-sequently, [6] obtained a sharper bound under similar assump-tions, and [18] provided a Lyapunov method-based analysis forgeneral stochastic approximations and reinforcement learning.Compared to the aforementioned work, we consider asyn-chronous Q-learning with a constant step-size. Similar lines ofwork are [7] and [5]. [7] studied a ﬁnite-time analysis of syn-chronous Q-learning with a constant step-size. It proved thatsynchronous Q-learning with a constant step-size convergesexponentially with a constant error bound, and also developeda ﬁnite-time bound of asynchronous Q-learning with a ﬁnitecovering time assumption. Recently, [18] proposed some con-vergence rates of Q-learning with a constant step-size, Marko-vian trajectories, and mixing time assumption. [5] developed auniﬁed switching system perspective and asymptotic conver-gence analysis of asynchronous Q-learning algorithms. Theain difference lies in that [5] applied the O.D.E analysis [4]with continuous-time switched system models and developedasymptotic convergence with diminishing step-sizes, while theproposed approach analyzes the algorithm directly in discrete-time domain. The transition from the continuous-domain todiscrete-domain makes huge differences in their analysis.The overall paper consists of the following parts: Sec-tion II provides preliminary discussions including basics ofMarkov decision process, switching system theory, Q-learning,and useful deﬁnitions and notations used throughout thepaper. Section III provides the main results of the paper,including the switched system models of Q-learning, upperand lower comparison systems, and the convergence results.Numerical simulations are given in Section IV to illustrate anddemonstrate the proposed analysis, and Section VII. P

RELIMINARIES

A. Markov decision problem

We consider the inﬁnite-horizon (discounted) Markov deci-sion problem (MDP), where the agent sequentially takes ac-tions to maximize cumulative discounted rewards. In a Markovdecision process with the state-space S := { , , . . . , |S|} and action-space A := { , , . . . , |A|} , the decision makerselects an action a ∈ A with the current state s , then the statetransits to s ′ with probability P ( s, a, s ′ ) , and the transitionincurs a random reward r ( s, a, s ′ ) , where P ( s, a, s ′ ) is thestate transition probability from the current state s ∈ S tothe next state s ′ ∈ S under action a ∈ A , and r ( s, a, s ′ ) is the reward function, which can be also a random variableconditioned on ( s, a, s ′ ) , but for convenience, we will considerit a deterministic function. Moreover, we will write r ( s k , a k , s k +1 ) =: r k , k ∈ { , , . . . } . A deterministic policy, π : S → A , maps a state s ∈ S to an action π ( s ) ∈ A . The Markov decision problem (MDP)is to ﬁnd a deterministic optimal policy, π ∗ , such that thecumulative discounted rewards over inﬁnite time horizons ismaximized, i.e., π ∗ := arg max π ∈ Θ E " ∞ X k =0 γ k r k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) π , where γ ∈ [0 , is the discount factor, Θ is the set of alladmissible deterministic policies, ( s , a , s , a , . . . ) is a state-action trajectory generated by the Markov chain under policy π , and E [ ·|· , π ] is an expectation conditioned on the policy π .The Q-function under policy π is deﬁned as Q π ( s, a ) = E " ∞ X k =0 γ k r k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s = s, a = a, π , s ∈ S , a ∈ A , and the corresponding optimal Q-function is deﬁned as Q ∗ ( s, a ) = Q π ∗ ( s, a ) for all s ∈ S , a ∈ A . Once Q ∗ isknown, then an optimal policy can be retrieved by π ∗ ( s ) =arg max a ∈A Q ∗ ( s, a ) . We will assume that the MDP is ergodicso that the stationary state distribution exists and the Markovdecision problem is well posed. B. Switching system theory

Since the switching system is a special form of nonlinearsystems, we ﬁrst consider the nonlinear system x k +1 = f ( x k ) , x = z ∈ R n , k ∈ { , , . . . } , (1)where x k ∈ R n is the state and f : R n → R n is a nonlinearmapping. An important concept in dealing with the nonlinearsystem is the equilibrium point. A point x = x ∗ in the statespace is said to be an equilibrium point of (1) if it has theproperty that whenever the state of the system starts at x ∗ , itwill remain at x ∗ [19]. For (1), the equilibrium points are thereal roots of the equation f ( x ) = 0 . The equilibrium point x ∗ is said to be globally asymptotically stable if for any initialstate x ∈ R n , x k → x ∗ as k → ∞ .Next, let us consider the particular nonlinear system, the switched linear system , x k +1 = A σ k x k , x = z ∈ R n , k ∈ { , , . . . } , (2)where x k ∈ R n is the state, σ ∈ M := { , , . . . , M } iscalled the mode, σ k ∈ M is called the switching signal,and { A σ , σ ∈ M} are called the subsystem matrices. Theswitching signal can be either arbitrary or controlled by theuser under a certain switching policy. Especially, a state-feedback switching policy is denoted by σ k = σ ( x k ) . C. Revisit Q-learning

We now brieﬂy review the standard Q-learning and itsconvergence. Recall that the standard Q-learning updates Q k +1 ( s k , a k ) = Q k ( s k , a k )+ α k ( s k , a k ) (cid:26) r k + γ max u ∈A Q k ( s k +1 , u ) − Q k ( s k , a k ) (cid:27) , where ≤ α k ( s, a ) ≤ is called the learning rate associatedwith the state-action pair ( s, a ) at iteration k . This value isassumed to be zero if ( s, a ) = ( s k , a k ) . If ∞ X k =0 α k ( s, a ) = ∞ , ∞ X k =0 α k ( s, a ) < ∞ , and every state-action pair is visited inﬁnitely often, then theiterate is guaranteed to converge to Q ∗ with probability one.Note that the state-action can be visited arbitrarily, which ismore general than stochastic visiting rules.In this paper, we analyze the convergence for the two casesin a uniﬁed framework1) { ( s k , a k , s ′ k ) } ∞ k =0 is a sequence of i.i.d. random variables, ( s k , a k ) is sampled with an arbitrary probability distribu-tion, d ( s, a ) , s ∈ S , a ∈ A , of the state and action pair ( s, a ) and s ′ k ∼ P ( ·| s k , a k ) .2) { ( s k , a k ) } ∞ k =0 is a single Markov chain under a behaviorpolicy β , where the behavior policy is the policy by whichthe RL agent actually behaves to collect experiences. Inthis case, we assume that the initial state distribution isalready the stationary distribution lim k →∞ P [ s k = s | β ] = p ∞ ( s ) , s ∈ S , and in this case, the state-action distribution at each timeis still identically given by d ( s, a ) = p ∞ ( s ) β ( a | s ) , ( s, a ) ∈ S × A . . Assumptions and deﬁnitions Assumptions that will be applied throughout the paper aresummarized below.

Assumption 1. d ( s, a ) > holds for all s ∈ S , a ∈ A . Assumption 1 guarantees that every state-action pair maybe visited inﬁnitely often for sufﬁcient exploration.

Assumption 2.

The step-size is a constant α ∈ (0 , . Assumption 3.

The reward is bounded as follows: max ( s,a,s ′ ) ∈S×A×S | r ( s, a, s ′ ) | =: R max = 1 . Assumption 4.

The initial Q-function iterate Q satisﬁes k Q k ∞ ≤ . Assumption 3 is required to ensure the boundedness of Q-learning iterates. We set R max = 1 in Assumption 3 andapply Assumption 4 just for simplicity of analysis.Some quantities will be frequently used in this paper, andhence, we deﬁne notations for them for convenience. Deﬁnition 1.

1) Minimum state-action visit probability: d max := max ( s,a ) ∈S×A d ( s, a ) ∈ (0 , .

2) Minimum state-action visit probability: d min := min ( s,a ) ∈S×A d ( s, a ) ∈ (0 , .

3) Exponential decay rate: ρ := 1 − αd min (1 − γ ) ∈ (0 , . Under Assumption 2, the decay rate satisﬁes ρ ∈ (0 , .Throughout the paper, we will use the following compactnotations: P :=  P ... P |A|  ∈ R |S|×|S||A| , R :=  R ... R |A|  ∈ R |S||A| ,Q :=  Q ( · , ... Q ( · , |A| )  ∈ R |S||A| ,D a :=  d (1 , a ) . . . d ( |S| , a )  ∈ R |S|×|S| ,D :=  D . . . D |A|  ∈ R |S||A|×|S||A| , where P a = P ( · , a, · ) ∈ R |S|×|S| , Q ( · , a ) ∈ R |S| , a ∈ A and R a ( s ) := E [ r ( s, a, s ′ ) | s, a ] . In this notation, Q-function Q isencoded as a single vector Q ∈ R |S||A| , which enumerates Q ( s, a ) for all s ∈ S and a ∈ A . The single value Q ( s, a ) canbe extracted from Q as Q ( s, a ) = ( e a ⊗ e s ) T Q, where e s ∈ R |S| and e a ∈ R |A| are s -th basis vector (allcomponents are except for the s -th component which is ) and a -th basis vector, respectively. Note also that under As-sumption 1, D is a nonsingular diagonal matrix with strictlypositive diagonal elements.For any stochastic policy, π : S → ∆ |S| , where ∆ |A| isthe set of all probability distributions over A , we deﬁne thecorresponding action transition matrix as Π π :=  π (1) T ⊗ e T π (2) T ⊗ e T ... π ( | S | ) T ⊗ e T |S|  ∈ R |S|×|S||A| , (3)where e s ∈ R |S| . Then, it is well known that P Π π ∈ R |S||A|×|S||A| is the transition probability matrix of the state-action pair un-der policy π . If we consider a deterministic policy, π : S → A ,the stochastic policy can be replaced with the correspondingone-hot encoding vector ~π ( s ) := e π ( s ) ∈ ∆ |A| . where e a ∈ R |A| , and the corresponding action transitionmatrix is identical to (3) with π replaced with ~π . For anygiven Q ∈ R |S||A| , denote the greedy policy w.r.t. Q as π Q ( s ) := arg max a ∈A Q ( s, a ) ∈ A . Then, we will use thefollowing shorthand frequently: Π π Q =: Π Q . The boundedness of Q-learning iterates [20] plays an im-portant role in our analysis.

Lemma 1 (Boundedness of Q-learning iterates [20]) . If thestep-size is less than one, then the Q-learning’s iterates satisfy k Q k k ∞ ≤ max { R max , max ( s,a ) ∈S×A Q ( s, a ) } − γ =: Q max , ∀ k ≥ . From Assumption 3 and Assumption 4, we easily see that Q max = − γ .III. C ONVERGENCE OF Q- LEARNING FROM S WITCHING S YSTEM T HEORY

In this section, we study a switching system model of Q-learning and prove its convergence based on the switchingsystem analysis. We consider a version of Q-learning givenin Algorithm 1 for the i.i.d. sampling case and Algorithm 2for the Markovian trajectory case. Compared to the originalQ-learning, the step-size α does not depend on the state-action pair and is constant in this paper. We will provethat Algorithm 1 and Algorithm 2 converge to the optimal Q ∗ with high probability. The proofs of both cases can beanalyzed in a uniﬁed way. Therefore, we will only considerthe i.i.d. case. lgorithm 1 Q-Learning with a constant step-size (i.i.d. case) Initialize Q ∈ R |S||A| randomly. for iteration k = 0 , , . . . do Sample ( s k , a k ) ∼ d Sample s ′ k ∼ P ( s k , a k , · ) and r k = r ( s k , a k , s ′ k ) Update Q k +1 ( s k , a k ) = Q k ( s k , a k ) + α { r k + γ max u ∈A Q k ( s ′ k , u ) − Q k ( s k , a k ) } end forAlgorithm 2 Q-Learning with a constant step-size (Markoviancase) Initialize Q ∈ R |S||A| randomly. Sample s ∼ p ∞ for iteration k = 0 , , . . . do Sample a k ∼ β ( ·| s k ) Sample s k +1 ∼ P ( s k , a k , · ) and r k = r ( s k , a k , s k +1 ) Update Q k +1 ( s k , a k ) = Q k ( s k , a k ) + α { r k + γ max u ∈A Q k ( s k +1 , u ) − Q k ( s k , a k ) } end for A. Q-learning as a switched linear system

Using the notation introduced, the update in Algorithm 1can be rewritten as Q k +1 = Q k + α { ( e a k ⊗ e s k ) r k + γ ( e a k ⊗ e s k )( e s ′ k ) T max a ∈A Q k ( · , a ) − ( e a k ⊗ e s k )( e a k ⊗ e s k ) T Q k } , where e s ∈ R |S| and e a ∈ R |A| are s -th basis vector (allcomponents are except for the s -th component which is )and a -th basis vector, respectively. The above update can befurther expressed as Q k +1 = Q k + α { DR + γDP Π Q k Q k − DQ k + w k } , (4)where w k =( e a k ⊗ e s k ) r k + γ ( e a k ⊗ e s k )( e s ′ k ) T Π Q k Q k − ( e a k ⊗ e s k )( e a k ⊗ e s k ) T Q k − ( DR + γDP Π Q k Q k − DQ k ) , and ( s k , a k , r k , s ′ k ) is the sample in the k -th time-step. Recallthe deﬁnitions π Q ( s ) := arg max a ∈A Q ( s, a ) ∈ A and Π π Q =: Π Q . Using the optimal Bellman equation ( γDP Π Q ∗ − D ) Q ∗ + DR = 0 , (4) is rewritten by ( Q k +1 − Q ∗ ) = { I + α ( γDP Π Q k − D ) } | {z } =: A Qk ( Q k − Q ∗ )+ γDP (Π Q k − Π Q ∗ ) Q ∗ | {z } =: b Qk + w k . (5)which is a linear switching system with stochastic noises.More precisely, (5) can be represented by the stochastic afﬁneswitching system Q k +1 − Q ∗ = A Q k ( Q k − Q ∗ ) + b Q k + w k , (6) where, A Q k and b Q k switch among matrices A Q k ∈ { I + α ( γDP Π π − D ) : π ∈ Θ } and b Q k ∈ { γDP (Π π − Π π ∗ ) Q ∗ : π ∈ Θ } . Therefore, the convergence of Q-learning is nowreduced to proving the stability of the above switching system.Main obstacle in proving the stability is the existence of theafﬁne and stochastic terms. Without the additional terms, thestability can be easily proved as follows. Lemma 2.

Deﬁne A Q := I + α ( γDP Π Q − D ) . Then, k A Q k ∞ ≤ ρ =: 1 − αd min (1 − γ ) , ∀ Q ∈ R |S||A| . Proof.

Noting k A Q k ∞ = max i ∈{ , ,..., |S||A|} X j ∈{ , ,..., |S||A|} | [ I − αD + αγDP Π Q ] ij | and X j | [ I − αD + αγDP Π Q ] ij | =[ I − αD ] ii + X j [ αγDP Π Q ] ij =1 − α [ D ] ii + αγ X j [ DP Π Q ] ij =1 − α [ D ] ii + αγ [ D ] ii X j [ P Π Q ] ij =1 − α [ D ] ii + αγ [ D ] ii =1 + α [ D ] ii ( γ − where the ﬁrst line is due to the fact that A Q is a positivematrix. Taking the maximum over i , we have k A Q k ∞ = max i ∈{ , ,..., |S||A|} { α [ D ] ii ( γ − } =1 − α min ( s,a ) ∈S×A d ( s, a )(1 − γ )= ρ, which completes the proof. Proposition 1.

The switching system Q k +1 − Q ∗ = { I + α ( γDP Π H k − D ) } | {z } =: A Hk ( Q k − Q ∗ ) ,Q − Q ∗ ∈ R |S||A| , is exponentially stable under arbitrary H k ∈ R |S||A| , k ≥ .Proof. We consider the ∞ -norm as a Lyapunov function and k Q k +1 − Q ∗ k ∞ = k A H k ( Q k − Q ∗ ) k ∞ ≤k A H k k ∞ k Q k − Q ∗ k ∞ ≤ ρ k Q k − Q ∗ k ∞ , k ≥ . Therefore, iterating the above inequality yields k Q k − Q ∗ k ∞ ≤ ρ k k Q − Q ∗ k ∞ , k ≥ roving the exponential stability of the switching system.However, due to the afﬁne and stochastic noises in theoriginal form (6), the convergence proof of Q-learning is nottrivial. The main idea of our analysis lies in using specialstructures of Q-learning algorithm to over and under bound theoriginal Q-learning iterates. The overall idea of the proposedanalysis is shown in Figure 1, which will be detailed in thenext section.Finally, based Lemma 1, the variance of w k , E [ w Tk w k ] isbounded as follows. Lemma 3.

The variance of w k is bounded as E [ w Tk w k ] ≤ |S||A| (1 − γ ) =: W Proof.

To obtain a bound on the variance of w k , we ﬁrst ﬁnda bound on w k as k w k k ∞ ≤k (( e a ⊗ e s ) − D ) r k k ∞ + k γ ( e a ⊗ e s )( e s ′ ) T Π Q k − γDP Π Q k k ∞ k Q k k ∞ + k (( e a ⊗ e s )( e a ⊗ e s ) T − D ) k ∞ k Q k k ∞ ≤ R max + γ k ( e a ⊗ e s )( e s ′ ) T − DP k ∞ k Π Q k k ∞ Q max + 2 Q max ≤ R max + γ Q max + 2 Q max ≤ − γ , where the last line comes from Assumption 3, Assumption 4,and Lemma 1. Using this bound, we can conclude the proofby using E [ w Tk w k ] = E [ k w k k ] ≤ E [ |S||A|k w k k ∞ ] ≤ |S||A| (1 − γ ) . This completes the proof.

B. Lower comparison system

Consider the stochastic linear system Q Lk +1 − Q ∗ = { I + α ( γDP Π Q ∗ − D ) } ( Q Lk − Q ∗ ) + αw k ,Q L − Q ∗ ∈ R |S||A| (7)which we will call the lower comparison system, where thestochastic noise w k is shared with the original system (5). Itsmain property is that if Q L − Q ∗ ≤ Q − Q ∗ initially, then Q Lk − Q ∗ ≤ Q k − Q ∗ for all k ≥ . Proposition 2.

Suppose Q L − Q ∗ ≤ Q − Q ∗ . Then, Q Lk − Q ∗ ≤ Q k − Q ∗ for all k ≥ .Proof. The proof is done by induction. Suppose Q Lk − Q ∗ ≤ Q k − Q ∗ for some k ≥ . Then, ( Q k +1 − Q ∗ )=( Q k − Q ∗ )+ αD { γP Π Q k Q k − γP Π Q ∗ Q ∗ − Q k + Q ∗ } + αw k = { I + α ( γDP Π Q ∗ − D ) } ( Q k − Q ∗ )+ αγDP (Π Q k − Π Q ∗ ) Q k + αw k ≥{ I + α ( γDP Π Q ∗ − D ) } ( Q k − Q ∗ ) + αw k ≥{ I + α ( γDP Π Q ∗ − D ) } ( Q Lk − Q ∗ ) + αw k = Q Lk +1 − Q ∗ where we used DP (Π Q k − Π Q ∗ ) Q k ≥ DP (Π Q ∗ − Π Q ∗ ) Q k =0 in the third line. In the forth line, we used the hypothesis Q Lk − Q ∗ ≤ Q k − Q ∗ and the fact that I + α ( γDP Π Q ∗ − D ) is a positive matrix. The proof is completed by induction.Since the lower system is a stochastic linear system, itsstability can be proved by using its mean dynamics. E [ Q Lk +1 − Q ∗ ] = A E [ Q Lk − Q ∗ ] where A := I + α ( γDP Π Q ∗ − D ) , which is simply a linear system. By Proposition 1, one easilygets the following result. Proposition 3.

The lower comparison system’s state satisﬁes k E [ Q Lk − Q ∗ ] k ∞ ≤ ρ k k Q L − Q ∗ k ∞ , ∀ k ≥ . (8)From this result, we can conclude that A is Schur, i.e., themagnitude of all its engenvalues is strictly less than one, andfrom the Lyapunov theory for linear systems, there exists apositive deﬁnite matrix P > and β ∈ (0 , such that A T P A ≤ βP. It is not clear how β ∈ (0 , can be determined. Animportant fact is that β can be β = ( ρ + ǫ ) for arbitrary ǫ > . Since ρ ∈ (0 , , we can choose a sufﬁciently small ǫ > such that ( ρ + ǫ ) ∈ (0 , . Proposition 4.

For any ǫ > such that ρ + ǫ ∈ (0 , , thereexists the corresponding positive deﬁnite P > such that A T P A ≤ ( ρ + ǫ ) P, and λ min ( P ) ≥ , λ max ( P ) ≤ |S||A| − (cid:16) ρρ + ǫ (cid:17) Proof.

We ﬁrst prove that such a P is P = ∞ X k =0 (cid:18) ρ + ǫ (cid:19) k ( A k ) T A k . (9)Noting that ( ρ + ǫ ) − A T P A + I = 1 ρ + ǫ A T ∞ X k =0 (cid:18) ρ + ǫ (cid:19) k ( A k ) T A k ! A + I = P we have ( ρ + ǫ ) − A T P A ≤ ( ρ + ǫ ) − A T P A + I = P resulting in the desired conclusion. Next, it remains to provethe existence of such a P by proving its boundedness. Since ( ρ + ǫ ) ∈ (0 , , we have k P k ∞ = k I k ∞ + ( ρ + ǫ ) − k A T A k ∞ ig. 1. Overall idea of the proposed analysis + ( ρ + ǫ ) − k ( A ) T A k ∞ + · · ·≤ ρ + ǫ ) − k A k ∞ + ( ρ + ǫ ) − k A k ∞ + · · · =1 + (cid:18) ρρ + ǫ (cid:19) + (cid:18) ρρ + ǫ (cid:19) + · · · = 11 − (cid:16) ρρ + ǫ (cid:17) . Finally, we prove the bounds on the maximum and mini-mum eigenvalues. From the deﬁnition (9), P ≥ I , and hence λ min ( P ) ≥ . On the other hand, one gets λ max ( P ) = λ max ( I + ( ρ + ǫ ) − A T A + ( ρ + ǫ ) − ( A ) T A + · · · ) ≤ λ max ( I ) + ( ρ + ǫ ) − λ max ( A T A )+ ( ρ + ǫ ) − λ max (( A ) T A ) + · · · = λ max ( I ) + ( ρ + ǫ ) − k A k + ( ρ + ǫ ) − k A k + · · · =1 + |S||A| ( ρ + ǫ ) − k A k ∞ + |S||A| ( ρ + ǫ ) − k A k ∞ + · · ·≤ |S||A| − (cid:16) ρρ + ǫ (cid:17) . The proof is completed.Using the Lyapunov theorem in Proposition 4, we can obtaina bound on E [ k Q Lk − Q ∗ k ∞ ] . Proposition 5. E [ k Q Lk − Q ∗ k ∞ ] ≤ (cid:18) ρ (cid:19) k s |S| |A| α (1 − d min (1 − γ )) d min (1 − γ ) k Q L − Q ∗ k ∞ + 12 s α |S| |A| d min (1 − γ ) =: θ ( k ) Proof.

Deﬁning the Laypunov function V ( x ) = x T P x , weget E [ V ( Q Lk +1 − Q ∗ )]= E [( A ( Q Lk − Q ∗ ) + αw k ) T P ( A ( Q Lk − Q ∗ ) + αw k )]= E [( Q Lk − Q ∗ ) T A T P A ( Q Lk − Q ∗ ) + 2 αw Tk P A ( Q Lk − Q ∗ )+ 2 α w Tk w k ]= E [ V ( A ( Q Lk − Q ∗ ))] + 2 α W ≤ ( ρ + ǫ ) E [ V ( Q Lk − Q ∗ )] + 2 α W, where ǫ > is such that ρ + ǫ < , the third line comesfrom Lemma 3 and E [ w k ] = 0 , and the last line is dueto Proposition 4. Iterating the last inequality leads to E [ V ( Q Lk − Q ∗ )] ≤ ( ρ + ǫ ) k V ( Q L − Q ∗ ) + 2 α W − ( ρ + ǫ ) Moreover, using λ min ( P ) k x k ≤ V ( x ) ≤ λ max ( P ) k x k and taking the square root in the last inequality yield q λ min ( P ) E [ k Q Lk − Q ∗ k ] | {z } LHS ≤ s ( ρ + ǫ ) k λ max ( P ) k Q L − Q ∗ k + 2 α W − ( ρ + ǫ ) | {z } RHS . Using the subadditivity of the square root in the right-handside, we have

RHS ≤ ( ρ + ǫ ) k p λ max ( P ) k Q L − Q ∗ k + s α W − ( ρ + ǫ ) ≤ ( ρ + ǫ ) k p λ max ( P ) |S||A|k Q L − Q ∗ k ∞ + s α W − ( ρ + ǫ ) he concavity of the square root and Jensen’s inequalitylead to LHS ≥ p λ min ( P ) E [ q k Q Lk − Q ∗ k ] ≥ p λ min ( P ) 1 p |S||A| E [ k Q Lk − Q ∗ k ∞ ] We combine the last three inequalities and simplify theresulting inequality to have E [ k Q Lk − Q ∗ k ∞ ] ≤|S||A| ( ρ + ǫ ) k s λ max ( P ) λ min ( P ) k Q L − Q ∗ k ∞ + p |S||A| p λ min ( P ) s α W − ( ρ + ε ) Next, using the bounds on λ min ( P ) and λ max ( P ) in Propo-sition 4, we have E [ k Q Lk − Q ∗ k ∞ ] ≤|S| / |A| / ( ρ + ǫ ) k vuut − (cid:16) ρρ + ǫ (cid:17) | {z } =:Φ k Q L − Q ∗ k ∞ + |S| / |A| / s α W − ( ρ + ǫ ) | {z } =:Φ (10)Now, we simply choose ǫ = − ρ so that ρ + ǫ = ρ + 1 − ρ ρ < and the terms Φ can be simpliﬁed to Φ = s ρ + ρ ρ − ρ ≤ r ρ − ρ = s ραd min (1 − γ ) ≤ s α (1 − d min (1 − γ )) d min (1 − γ ) , where the ﬁrst inequality is due to ρ ∈ (0 , , and the secondinequality is due to ρ ≥ − d min (1 − γ ) > . Similarly, Φ is bounded as Φ = s α W − ρ − ρ ≤ s α W − ρ = s αWd min (1 − γ ) ≤ s α |S||A| d min (1 − γ ) , where the ﬁrst inequality is due to ρ ∈ (0 , and the secondinequality follows from the variance bound in Lemma 3.Combining these bounds with (10), we can complete theproof.For any α ∈ (0 , , lim k →∞ θ ( k ) = 12 q α |S| |A| d min (1 − γ ) .Moreover, we can prove that E [ Q k ( s, a )] ≥ E [ Q ∗ ( s, a )] − θ ( k ) holds. This implies that Q k tends to overestimate Q ∗ as k → ∞ up to the error bound q α |S| |A| d min (1 − γ ) . Proposition 6.

At any k ≥ , E [ Q k ( s, a )] ≥ Q ∗ ( s, a ) − θ ( k ) for all ( s, a ) ∈ S × A .Proof. Noting Q k − Q Lk ≥ and after some manipulations,we have Q k ≥ Q ∗ − ( Q ∗ − Q Lk ) . Multiplying e i from the leftleads to e Ti Q k ≥ e Ti Q ∗ − e Ti ( Q ∗ − Q Lk ) ≥ e Ti Q ∗ − | e Ti ( Q ∗ − Q Lk ) |≥ e Ti Q ∗ − k Q ∗ − Q Lk k ∞ . Taking the expectation and using the bound in Proposition 5yields the conclusion.

C. Upper comparison system

Consider the stochastic switched linear system Q Uk +1 − Q ∗ = { I + α ( γDP Π Q k − D ) } | {z } =: A Qk ( Q Uk − Q ∗ ) + αw k ,Q U − Q ∗ ∈ R |S||A| , (11)which we will call the upper comparison system, where thestochastic noise w k is shared with the original system. Notealso that the system matrix A Q k switches according to thechange of Q k , which is probabilistically depends on Q Uk .Therefore, if we take the expectation on both sides, it is notpossible to separate A Q k and the state Q Uk − Q ∗ contrary tothe lower comparison system, which makes it much harder toanalyze the stability of the upper comparison system.Similar to the lower comparison system, its main property isthat if Q U − Q ∗ ≥ Q − Q ∗ initially, then Q Uk − Q ∗ ≥ Q k − Q ∗ for all k ≥ . Proposition 7.

Suppose Q U − Q ∗ ≥ Q − Q ∗ . Then, Q Uk − Q ∗ ≥ Q k − Q ∗ for all k ≥ .Proof. The proof is done by induction. Suppose Q Uk − Q ∗ ≥ Q k − Q ∗ for some k ≥ . Then, ( Q k +1 − Q ∗ )=( Q k − Q ∗ )+ αD { γP Π Q k Q k − γP Π Q ∗ Q ∗ − Q k + Q ∗ } + αw k = { I + α ( γDP Π Q k − D ) } ( Q k − Q ∗ )+ αD ( γP Π Q k Q ∗ − γP Π Q ∗ Q ∗ ) + αw k ≤{ I + α ( γDP Π Q k − D ) } ( Q k − Q ∗ ) + αw k ≤{ I + α ( γDP Π Q k − D ) } ( Q k − Q ∗ ) { I + α ( γDP Π Q k − D ) } ( Q Uk − Q k ) + αw k = { I + α ( γDP Π Q k − D ) } ( Q Uk − Q ∗ ) + αw k = Q Uk +1 − Q ∗ , where we used D ( γP Π Q k Q ∗ − γP Π Q ∗ Q ∗ ) ≤ D ( γP Π Q ∗ Q ∗ − γP Π Q ∗ Q ∗ ) = 0 in the third line. Inthe forth line, we used the hypothesis Q Uk − Q ∗ ≥ Q k − Q ∗ and the fact that I + α ( γDP Π Q k − D ) is a positive matrix.The proof is completed by induction.In summary, while the three systems’ trajectories evolve,the trajectory of the upper system bounds that of the originalsystem from above, while the trajectory of the lower systembounds that of the original system from below. However,contrary to the lower comparison system, the convergence ofthe upper comparison system is harder to prove directly due toits system matrix which probabilistically depends on the state.To avoid such a difﬁculty, we ﬁrst obtain an error systemby subtracting the lower comparison system from the uppercomparison system as follows: Q Uk +1 − Q Lk +1 = { I + α ( γDP Π Q k − D ) } ( Q Uk − Q ∗ ) − { I + α ( γDP Π Q ∗ − D ) } ( Q Lk − Q ∗ )= { I + α ( γDP Π Q k − D ) } ( Q Uk − Q ∗ ) − { I + α ( γDP Π Q k − D ) } ( Q Lk − Q ∗ )+ α ( γDP Π Q k − γDP Π Q ∗ )( Q Lk − Q ∗ )= { I + α ( γDP Π Q k − D ) } | {z } =: A Qk ( Q Uk − Q Lk )+ αγDP (Π Q k − Π Q ∗ ) | {z } =: B Qk ( Q Lk − Q ∗ ) , where the stochastic noise αw k is canceled out, and two terms A Q k ( Q Uk − Q Lk ) and B Q k ( Q Lk − Q ∗ ) remain. Here, ( A Q k , B Q k ) switches according to the external signal Q k , and Q Lk − Q ∗ can be seen as an external disturbance. In summary, we havethe error system Q Uk +1 − Q Lk +1 = { I + α ( γDP Π Q k − D ) } | {z } =: A Qk ( Q Uk − Q Lk )+ αγDP (Π Q k − Π Q ∗ ) | {z } =: B Qk ( Q Lk − Q ∗ ) , (12) Q L − Q ∗ ∈ R |S||A| where Q k − Q Lk ≥ holds for all k ≥ . The overall schemeis as follows: If we can prove the stability of the error system,i.e., Q Uk − Q Lk → as k → ∞ , then since Q Lk → Q ∗ as k → ∞ , we have Q Uk → Q ∗ as well. D. Stability of the error system

In this subsection, we will prove the stability of the errorsystem in the previous part. The analysis is divided into twophases as depicted in Figure 1. In the ﬁrst phase of Q-learning,the lower system ﬁrst converges to Q ∗ sufﬁciently after N iterations. Then, the algorithm enters the second phase, wherethe upper system tends to converges to the lower comparisonsystem. As can be seen in (12), if Q Lk → Q ∗ sufﬁciently, then the second term becomes negligible, and the ﬁrst termdominates the overall evolution of the state. Then, we canuse Lemma 2 to prove the convergence. With this idea, wecan obtain the following result. Proposition 8.

In the second phase, the expected error after N steps is bounded as E [ k Q N + N − Q ∗ k ∞ ] ≤ (cid:18) ρ (cid:19) N − γ + (cid:18) ρ (cid:19) N d max |S| . |A| . α − / d . (1 − γ ) . s − d min (1 − γ )+ 72 d max |S||A| d . (1 − γ ) . α / (13) Proof.

After N time steps, taking the norm on both sides ofthe error system, we have k Q UN + i +1 − Q LN + i +1 k ∞ ≤k I + α ( γDP Π Q k − D ) k ∞ k Q UN + i − Q LN + i k ∞ + k αγDP (Π Q N i − Π Q ∗ ) k ∞ k Q LN + i − Q ∗ k ∞ ≤ ( ρ + ǫ ) k Q UN + i − Q LN + i k ∞ + k αγDP (Π Q N i − Π Q ∗ ) k ∞ k Q LN + i − Q ∗ k ∞ , for all i ≥ . Using the bound k DP (Π Q N i − Π Q ∗ ) k ∞ ≤ d max k P (Π Q N i − Π Q ∗ ) k ∞ ≤ d max and taking the expectation on both sides yield E [ k Q UN + i +1 − Q LN + i +1 k ∞ ] ≤ ( ρ + ǫ ) E [ k Q UN + i − Q LN + i k ∞ ]+ 2 αγd max E [ k Q LN + i − Q ∗ k ∞ ] , i ≥ . Combining the inequalities iteratively N times leads to E [ k Q UN + N − Q LN + N k ∞ ] ≤ ( ρ + ǫ ) N E [ k Q UN − Q LN k ∞ ]+ 2 αγd max − ( ρ + ǫ ) max ≤ i ≤ N − E [ k Q LN + i − Q ∗ k ∞ ] To obtain a bound in terms of k Q N + N − Q ∗ k ∞ , we notethe following results:1) We use the bound for the left-hand side E [ k Q N + N − Q LN + N k ∞ ] ≤ E [ k Q UN + N − Q LN + N k ∞ ] because ≤ Q N + N − Q LN + N ≤ Q UN + N − Q LN + N . Then, using the reverse triangle inequality on the left-hand side leads to E [ k Q UN + N − Q LN + N k ∞ ] ≥ E [ k Q N + N − Q LN + N k ∞ ]= E [ k Q N + N − Q ∗ + Q ∗ − Q LN + N k ∞ ] ≥ E [ k Q N + N − Q ∗ k ∞ ] − E [ k Q ∗ − Q LN + N k ∞ ] ) Moreover, note the following bound on the right-handside E [ k Q UN − Q LN k ∞ ]= E [ k Q UN − Q ∗ + Q ∗ − Q LN k ∞ ] ≤ E [ k Q UN − Q ∗ k ∞ ] + E [ k Q ∗ − Q LN k ∞ ] Combining the above two results, we have E [ k Q N + N − Q ∗ k ∞ ] ≤ ( ρ + ǫ ) N E [ k Q UN − Q ∗ k ∞ ]+ ( ρ + ǫ ) N E [ k Q ∗ − Q LN k ∞ ] + E [ k Q LN + N − Q ∗ k ∞ ]+ 2 αγd max − ( ρ + ǫ ) max ≤ i ≤ N − E [ k Q LN + i − Q ∗ k ∞ ] . (14)For the last term including the max operator, we can applythe bound in Proposition 5 to get max ≤ i ≤ N − [ k Q LN + i − Q ∗ k ∞ ] ≤ θ ( N ) As in the lower comparison system, we set ǫ = − ρ andapply the bound in Proposition 5 again to (14) to conclude E [ k Q N + N − Q ∗ k ∞ ] ≤ ( ρ + ε ) N E [ k Q UN − Q ∗ k ∞ ]+ ( ρ + ε ) N θ ( N ) + θ ( N + N ) + 2 αγd max − ( ρ + ε ) θ ( N ) (15) ≤ (cid:18) ρ (cid:19) N E [ k Q UN − Q ∗ k ∞ ]+ 6 d max d min (1 − γ ) (cid:18) ρ (cid:19) N × s |S| |A| α (1 − d min (1 − γ )) d min (1 − γ ) k Q L − Q ∗ k ∞ + 72 d max d min (1 − γ ) s α |S| |A| d min (1 − γ ) From Lemma 1, − Q max ≤ Q k ≤ Q max holds for all k ≥ . Since Q UN can be set arbitrarily such that Q UN ≥ Q N ,we can simply let Q UN = Q max in the above inequality.Similarly, since Q L can be arbitrarily deﬁned under theconstraint Q L ≤ Q , we can simply set Q L = Q . Applying Q max = 1 / (1 − γ ) and k Q k ∞ ≤ , the triangle inequality, andafter simpliﬁcations, we can obtain the desired conclusion. Remark 1.

Since we know the expected error bound on theQ-learning iterates in Proposition 8, we can obtain a highprobability error bound to obtain an ε -optimal solution, i.e., k Q T − Q ∗ k ∞ < ε using the Markov inequality. Moreover, weobtain an upper bound on the sample/iteration complexity:to ﬁnd an ε -optimal solution with probability at least − δ ,we need samples at most ˜ O (cid:16) d |S| |A| δ ε d (1 − γ ) (cid:17) . If the state-action pair is sampled uniformly from S × A , then d ( s, a ) = |S||A| , ∀ ( s, a ) ∈ S × A and d min = d max = |S||A| . In thiscase, the sample complexity becomes ˜ O (cid:16) |S| |A| δ ε (1 − γ ) (cid:17) . Theproof is given in Appendix. Overall, the proposed analysis based on switching systemmodels proves that Q-learning with a constant step size mayconverge exponentially with a constant error term. Moreover,it reﬂect the well-known overestimation phenomenon in Q-learning due to the maximization bias [14]. The concept isformalized in the following result.

Proposition 9.

At any k = N + N ≥ , where N , N ≥ ,we have Q ∗ ( s, a ) − θ ( N + N ) ≤ E [ Q N + N ( s, a )] ≤ Q ∗ ( s, a ) + ξ ( N , N ) + θ ( N + N ) for all ( s, a ) ∈ S × A , where ξ ( N , N ) := (cid:18) ρ (cid:19) N E [ Q UN − Q ∗∞ ]+ (cid:18) ρ (cid:19) N θ ( N ) + 2 αγd max − ( ρ + ǫ ) θ ( N ) Proof.

We start with (15) in the proof of Proposition 8.Inequality in (15) implies E [ e Ti Q N + N ] ≤ e Ti Q ∗ + ξ ( N , N ) + θ ( N + N ) for all i . Combining it with (6) concludes the proof.Roughly speaking, the lower system is a linear system, andits state may usually converge faster than the overestimatesystem, which is a more complex switching system. Theresult is reﬂected in Proposition 9. During the learning, theexpected current estimate Q k is bounded from below by Q ∗ ( s, a ) − θ ( N + N ) , while is bounded from above by Q ∗ ( s, a ) + ξ ( N , N ) + θ ( N + N ) with has the additionalgap ξ ( N , N ) . Although it does not ensure that Q k alwaysoverestimates Q ∗ , it provides an intuitive idea to explain theoverestimation phenomenon.We also expect that the proposed approach may providea new insight on the analysis of Q-leaning based on con-trol system theoretic arguments. The approach shows thatQ-learning’s structure allows us to model it as a stochas-tic switched linear system, and its nice structure allows usto construct over- and underestimating systems. From thisanalysis, we have proved that Q-learning’s convergence canbe analyzed in a simpler way. In the next section, simplenumerical simulations are given to illustrate and demonstratethe proposed analysis. IV. E XAMPLE

Consider an MDP with S = { , } , A = { , } , γ = 0 . , P = (cid:20) . . . . (cid:21) , P = (cid:20) . . . . (cid:21) , and the reward function r (1 ,

1) = − , r (2 ,

1) = 1 , r (1 ,

2) = 2 , r (2 ,

2) = − . Actions are sampled using the behavior policy β P [ a k = 1 | s k = 1] = 0 . , P [ a k = 2 | s k = 1] = 0 . , P [ a k = 1 | s k = 2] = 0 . , P [ a k = 2 | s k = 2] = 0 . , nd the states are sampled according to the distribution P [ s k = 1] = 0 . , P [ s k = 2] = 0 . , ∀ k ≥ . Simulated trajectories of the switching system model ofQ-learning (black solid line) with α = 0 . , the lowercomparison system (blue solid line), and the upper comparisonsystem (red solid line) are depicted in Figure 2. It demonstratesthat the state of the switching model of Q-learning, Q k − Q ∗ ,is underestimated by the lower comparison system’s state Q Lk − Q ∗ and overestimated by the upper comparison system’sstate Q Uk − Q ∗ .The evolution of the error between the upper and lowercomparison systems are depicted in Figure 3. It also demon-strates that the state of the error, Q Uk − Q Lk , converges tothe origin. The simulation study empirically proves that thebounding rules that we predicted theoretically hold.Under the same conditions, the simulation results with thestep-size α = 0 . are given in Figure 4 and Figure 5. Figure 4shows that the evolution of Q k − Q ∗ (black solid lines), lowercomparison system Q Lk − Q ∗ (blue solid lines), and uppercomparison system Q Uk − Q ∗ (red solid lines) are less stablewith a larger step-size. Figure 5 also shows large ﬂuctuationof the error Q Uk − Q Lk with α = 0 . .V. C ONCLUSION

In this paper, we have developed novel frameworks toanalyze the convergence of Q-learning algorithm by using itsconnections to dynamical systems. We have considered asyn-chronous Q-learning with a constant step-size, and proved thatits evolution can be naturally characterized by the trajectoriesof some discrete-time stochastic switched linear systems. Anew ﬁnite-time analysis of the Q-learning has been derivedusing the connections. Moreover, by ﬁlling the gap betweenthe two domains in a natural and synergistic way, it offersnovel insights on Q-learning and can potentially enrich bothﬁelds. We expect that the proposed framework provides uniﬁedanalysis methods for other variants of Q-learning, such as thedouble Q-learning [14], averaging Q-learning [5], speedy Q-learning [10], and multi-agent Q-learning [15], which couldbe potential future topics. The analysis can be applied toboth the i.i.d. setting and Markovian setting if the initialstate distribution is already stationary under a behavior policy.However, it can be generalized to the convergence with a arbi-trary initial state distribution under a mixing time assumption.The extension may require substantial further work, and it isremained as a potential future topic.R

EFERENCES[1] C. J. C. H. Watkins and P. Dayan, “Q-learning,”

Machine learning ,vol. 8, no. 3-4, pp. 279–292, 1992.[2] J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-learning,”

Machine learning , vol. 16, no. 3, pp. 185–202, 1994.[3] T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochasticiterative dynamic programming algorithms,” in

Advances in neuralinformation processing systems , 1994, pp. 703–710.[4] V. S. Borkar and S. P. Meyn, “The ODE method for convergence ofstochastic approximation and reinforcement learning,”

SIAM Journal onControl and Optimization , vol. 38, no. 2, pp. 447–469, 2000. [5] D. Lee and N. He, “A uniﬁed switching system perspective and conver-gence analysis of q-learning algorithms,” in , 2020.[6] G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, “Sample complexityof asynchronous Q-learning: Sharper analysis and variance reduction,” arXiv preprint arXiv:2006.03041 , 2020.[7] C. L. Beck and R. Srikant, “Error bounds for constant step-size Q-learning,”

Systems & Control letters , vol. 61, no. 12, pp. 1203–1208,2012.[8] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, “Is Q-learningprovably efﬁcient?” in

Advances in Neural Information ProcessingSystems , 2018, pp. 4863–4873.[9] C. Szepesv´ari, “The asymptotic convergence-rate of Q-learning,” in

Advances in Neural Information Processing Systems , 1998, pp. 1064–1070.[10] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. J. Kappen, “SpeedyQ-learning,” in

Proceedings of the 24th International Conference onNeural Information Processing Systems , 2011, pp. 2411–2419.[11] E. Even-Dar and Y. Mansour, “Learning rates for Q-learning,”

Journalof machine learning Research , vol. 5, no. Dec, pp. 1–25, 2003.[12] M. J. Kearns and S. P. Singh, “Finite-sample convergence rates for Q-learning and indirect algorithms,” in

Advances in neural informationprocessing systems , 1999, pp. 996–1002.[13] H. Lin and P. J. Antsaklis, “Stability and stabilizability of switched linearsystems: a survey of recent results,”

IEEE Transactions on Automaticcontrol , vol. 54, no. 2, pp. 308–322, 2009.[14] H. V. Hasselt, “Double Q-learning,” in

Advances in Neural InformationProcessing Systems , 2010, pp. 2613–2621.[15] S. Kar, J. M. Moura, and H. V. Poor, “QD-learning: a collaborativedistributed strategy for multi-agent reinforcement learning through con-sensus + innovations,”

IEEE Transactions on Signal Processing , vol. 61,no. 7, pp. 1848–1862, 2013.[16] M. J. Wainwright, “Stochastic approximation with cone-contractiveoperators: Sharp ℓ ∞ -bounds for Q-learning,” arXiv preprintarXiv:1905.06265 , 2019.[17] G. Qu and A. Wierman, “Finite-time analysis of asynchronous stochasticapproximation and Q-learning,” arXiv preprint arXiv:2002.00260 , 2020.[18] Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, “A lyapunovtheory for ﬁnite-sample guarantees of asynchronous q-learning and td-learning variants,” arXiv preprint arXiv:2102.01567 , 2021.[19] H. K. Khalil, “Nonlinear systems,” Upper Saddle River , 2002.[20] A. Gosavi, “Boundedness of iterates in Q-learning,”

Systems & Controlletters , vol. 55, no. 4, pp. 347–349, 2006. A PPENDIX S AMPLE C OMPLEXITY

Proposition 10 (Sample complexity) . To achieve k Q T − Q ∗ k ∞ < ε with probability at least − δ , we need the number ofsamples/iterations at most ˜ O (cid:18) d |S| |A| δ ε d (1 − γ ) (cid:19) Proof.

For convenience, we ﬁrst ﬁnd a simpliﬁed overestimateon the right-hand side of (13) as E [ k Q N + N − Q ∗ k ∞ ] ≤ ((cid:18) ρ (cid:19) N + (cid:18) ρ (cid:19) N ) × d max |S| . |A| . α − / d . (1 − γ ) . s − d min (1 − γ )+ 72 d max |S||A| d . (1 − γ ) . α / =: C −2−1.5−1−0.500.5 Iteration k ( s , a ) = ( , ) Q−learning modelLower systemUpper system 0 5 10 15x 10 −3−2−101 Iteration k ( s , a ) = ( , ) Q−learning modelLower systemUpper system0 5 10 15x 10 −2.5−2−1.5−1−0.500.5 Iteration k ( s , a ) = ( , ) Q−learning modelLower systemUpper system 0 5 10 15x 10 −2−1012 Iteration k ( s , a ) = ( , ) Q−learning modelLower systemUpper system

Fig. 2. Evolution of Q k − Q ∗ (black solid lines), lower comparison system Q Lk − Q ∗ (blue solid lines), and upper comparison system Q Uk − Q ∗ (red solidlines) with step-size α = 0 . . where N and N are the total iteration numbers of thephase one and phase two, respectively. Applying the Markovinequality P [ k Q N + N − Q ∗ k ∞ ≥ ε ] ≤ Cε we conclude that k Q N + N − Q ∗ k ∞ < ε with probability atleast − δ , i.e., P [ k Q N + N − Q ∗ k ∞ < ε ] ≥ − δ, where δ = ((cid:18) ρ (cid:19) N + (cid:18) ρ (cid:19) N ) × d max |S| . |A| . α − / εd . (1 − γ ) . s − d min (1 − γ )+ 72 d max |S||A| εd . (1 − γ ) . α / N , N , and α are appropriately chosen so that δ ∈ (0 , .Next, set N = N = N , and then the total number ofiteration is T = N + N = 2 N . One concludes that to satisfy k Q T − Q ∗ k ∞ < ε with probability at least − δ , we shouldhave δ ≥ (cid:18) ρ (cid:19) N d max |S| . |A| . α − / εd . (1 − γ ) . s − d min (1 − γ ) | {z } =:Φ + 72 d max |S||A| εd . (1 − γ ) . α / | {z } =:Φ which is achieved if δ/ ≥ Φ , δ/ ≥ Φ . The ﬁrst inequality is satisﬁed if ln (cid:26) d max |S| . |A| . εδd . (1 − γ ) . α / (cid:27) / ln (cid:18)

21 + ρ (cid:19) ≤ N (16)and the second inequality holds if δεd . (1 − γ ) . d max |S||A| ≥ α / We simply set α / to be α / = 12 δεd . (1 − γ ) . d max |S||A| and plug it into (16) to have ln (cid:16) d max |S| . |A| . ε δ d (1 − γ ) (cid:17) ln − δ ε d − γ )62882 d |S| |A| ! ≤ N We can ignore the logarithmic term, and can obtain an upperbound on the inverse of the log function using x +1 x ≥ x ) to arrive at the desired conclusion. Fig. 3. Evolution of error Q Uk − Q Lk (black solid lines) with step-size α = 0 . . ( s , a ) = ( , ) Q−learning modelLower systemUpper system 0 200 400 600 800 1000−4−2024 Iteration k ( s , a ) = ( , ) Q−learning modelLower systemUpper system0 200 400 600 800 1000−2−10123 Iteration k ( s , a ) = ( , ) Q−learning modelLower systemUpper system 0 200 400 600 800 1000−2−10123 Iteration k ( s , a ) = ( , ) Q−learning modelLower systemUpper system

Fig. 4. Evolution of Q k − Q ∗ (black solid lines), lower comparison system Q Lk − Q ∗ (blue solid lines), and upper comparison system Q Uk − Q ∗ (red solidlines) with step-size α = 0 . .

200 400 600 800 100000.511.5 Iteration k Error (s,a) = (1,1) 0 200 400 600 800 100000.511.522.5 Iteration k Error (s,a) = (2,1)0 200 400 600 800 100000.20.40.60.81 Iteration k Error (s,a) = (1,2) 0 200 400 600 800 100000.511.5 Iteration k Error (s,a) = (2,2)

Fig. 5. Evolution of error Q Uk − Q Lk (black solid lines) with step-size α = 0 .9