[PDF] Two Timescale Stochastic Approximation with Controlled Markov noise and Off-policy temporal difference learning

Abstract

We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by `controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time-scales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solution to the off-policy convergence problem for temporal difference learning with linear function approximation, using our results.

Full PDF

aa r X i v : . [ m a t h . D S ] F e b Two Timescale Stochastic Approximation with Controlled Markov noiseand Oﬀ-policy Temporal Diﬀerence Learning

Prasenjit Karmakar and Shalabh BhatnagarApril 15, 2019

Abstract

We present for the ﬁrst time an asymptotic convergence analysis of two time-scale stochastic approximationdriven by ‘controlled’ Markov noise. In particular, both the faster and slower recursions have non-additive con-trolled Markov noise components in addition to martingale diﬀerence noise. We analyze the asymptotic behaviorof our framework by relating it to limiting diﬀerential inclusions in both time-scales that are deﬁned in terms ofthe ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solutionto the oﬀ-policy convergence problem for temporal diﬀerence learning with linear function approximation, usingour results.

Stochastic approximation algorithms are sequential non-parametric methods for ﬁnding a zero or minimum of a func-tion in the situation where only the noisy observations of the function values are available. Two time-scale stochasticapproximation algorithms represent one of the most general subclasses of stochastic approximation methods. Thesealgorithms consist of two coupled recursions which are updated with diﬀerent (one is considerably smaller than theother) step sizes which in turn facilitate convergence for such algorithms.Two time-scale stochastic approximation algorithms [19] have successfully been applied to several complex prob-lems arising in the areas of reinforcement learning, signal processing and admission control in communication net-works. There are many reinforcement learning applications (precisely those where parameterization of value functionis implemented) where non-additive Markov noise is present in one or both iterates thus requiring the current twotime-scale framework to be extended to include Markov noise (for example, in [13, p. 5] it is mentioned that in orderto generalize the analysis to Markov noise, the theory of two time-scale stochastic approximation needs to includethe latter).Here we present a more general framework of two time-scale stochastic approximation with “controlled” Markovnoise, i.e., the noise is not simply Markov; rather it is driven by the iterates and an additional control process aswell. We analyze the asymptotic behaviour of our framework by relating it to limiting diﬀerential inclusions inboth timescales that are deﬁned in terms of the ergodic occupation measures associated with the controlled Markovprocesses. Next, using these results for the special case of our framework where the random processes are irreducibleMarkov chains, we present a solution to the oﬀ-policy convergence problem for temporal diﬀerence learning withlinear function approximation. While the oﬀ-policy convergence problem for reinforcement learning (RL) with linearfunction approximation has been one of the most interesting problems, there are very few solutions available inthe current literature. One such work [4] shows the convergence of the least squares temporal diﬀerence learningalgorithm with eligibility traces (LSTD( λ )) as well as the TD( λ ) algorithm. While the LSTD methods are not feasiblewhen the dimension of the feature vector is large, oﬀ-policy TD( λ ) is shown to converge only when the eligibilityfunction λ ∈ [0 ,

1] is very close to 1. Another recent work [5] proves weak convergence of several emphatic temporaldiﬀerence learning algorithms which is also designed to solve the oﬀ-policy convergence problem. In [11, 12, 3] thegradient temporal diﬀerence learning (GTD) algorithms were proposed to solve this problem. However, the authorsmake the assumption that the data is available in the “oﬀ-policy” setting (i.e. the oﬀ-policy issue is incorporatedinto the data rather than in the algorithm) whereas, in reality, one has only the “on-policy” Markov trajectorycorresponding to a given behaviour policy and we are interested in designing an online learning algorithm. We useone of the algorithms from [3] called TDC with “importance-weighting” which takes the “on-policy” data as inputand show its convergence using the results we develop. Our convergence analysis can also be extended for the samealgorithm with eligibility traces for a suﬃciently large range of values of λ . Our results can be used to provide aconvergence analysis for reinforcement learning algorithms such as those in [6] for which convergence proofs have notbeen provided.To the best of our knowledge there are related works such as [14, 16, 17, 15] where two time-scale stochasticapproximation algorithms with algorithm iterate dependent non-additive Markov noise is analyzed. In all of them1he Markov noise in the recursion is handled using the classic Poisson equation based approach of [1, 10] and appliedto the asymptotic analysis of many algorithms used in machine learning, system identiﬁcation, signal processing,image analysis and automatic control. However, we show that our method also works if there is another additionalcontrol process as well and if the underlying Markov process has non-unique stationary distributions. Further, thementioned application does not require strong assumption such as aperiodicity for the underlying Markov chainwhich is a suﬃcient condition if we use Poisson equation based approach [2, 14]. Additionally, our assumptions arequite diﬀerent from the assumptions made in the mentioned literature and we give a detailed comparison in Section2.2.The organization of the paper is as follows: Section 2 formally deﬁnes the problem and provides background andassumptions. Section 3 shows the main results. Section 4 discusses how one of our assumptions of Section 2 can berelaxed. Section 5 presents an application of our results to the oﬀ-policy convergence problem for temporal diﬀerencelearning with linear function approximation. Finally, we conclude by providing some future research directions. In the following we describe the preliminaries and notation used in our proofs. Most of the deﬁnitions and notationare from [9, 21, 7].

Deﬁnition and Notation

Let F denote a set-valued function mapping each point θ ∈ R m to a set F ( θ ) ⊂ R m . F is called a Marchaud map ifthe following hold:(i) F is upper-semicontinuous in the sense that if θ n → θ and w n → w with w n ∈ F ( θ n ) for all n ≥

1, then w ∈ F ( θ ). In order words, the graph of F deﬁned as { ( θ, w ) : w ∈ F ( θ ) } is closed.(ii) F ( θ ) is a non-empty compact convex subset of R m for all θ ∈ R m .(iii) ∃ c > θ ∈ R m , sup z ∈ F ( θ ) k z k ≤ c (1 + k θ k ) , where k . k denotes any norm on R m . A solution for the diﬀerential inclusion (d.i.) ˙ θ ( t ) ∈ F ( θ ( t )) (1)with initial point θ ∈ R m is an absolutely continuous (on compacts) mapping θ : R → R m such that θ (0) = θ and˙ θ ( t ) ∈ F ( θ ( t ))for almost every t ∈ R . If F is a Marchaud map, it is well-known that (1) has solutions (possibly non-unique) throughevery initial point. The diﬀerential inclusion (1) induces a set-valued dynamical system { Φ t } t ∈ R deﬁned byΦ t ( θ ) = { θ ( t ) : θ ( · ) is a solution to (1) with θ (0) = θ } . Consider the autonomous ordinary diﬀerential equation (o.d.e.)˙ θ ( t ) = h ( θ ( t )) , (2)where h is Lipschitz continuous. One can write (2) in the format of (1) by taking F ( θ ) = { h ( θ ) } . It is well-knownthat (2) is well-posed, i.e., it has a unique solution for every initial point. Hence the set-valued dynamical systeminduced by the o.d.e. or ﬂow is { Φ t } t ∈ R with Φ t ( θ ) = { θ ( t ) } , where θ ( · ) is the solution to (2) with θ (0) = θ . It is also well-known that Φ t ( . ) is a continuous function for all t ∈ R .A set A ⊂ R m is said to be invariant (for F ) if for all θ ∈ A there exists a solution θ ( · ) of (1) with θ (0) = θ such that θ ( R ) ⊂ A .Given a set A ⊂ R m and θ ′′ , w ′′ ∈ A , we write θ ′′ ֒ → A w ′′ if for every ǫ > T > ∃ n ∈ N , solutions θ ( · ) , . . . , θ n ( · ) to (1) and real numbers t , t , . . . , t n greater than T such that(i) θ i ( s ) ∈ A for all 0 ≤ s ≤ t i and for all i = 1 , . . . , n, k θ i ( t i ) − θ i +1 (0) k ≤ ǫ for all i = 1 , . . . , n − , (iii) k θ (0) − θ ′′ k ≤ ǫ and k θ n ( t n ) − w ′′ k ≤ ǫ. The sequence ( θ ( · ) , . . . , θ n ( · )) is called an ( ǫ, T ) chain (in A from θ ′′ to w ′′ ) for F . A set A ⊂ R m is said to be internally chain transitive , provided that A is compact and θ ′′ ֒ → A w ′′ for all θ ′′ , w ′′ ∈ A . It can be proved that inthe above case, A is an invariant set.A compact invariant set A is called an attractor for Φ, provided that there is a neighbourhood U of A (i.e., forthe induced topology) with the property that d (Φ t ( θ ′′ ) , A ) → t → ∞ uniformly in θ ′′ ∈ U . Here d ( X, Y ) =sup θ ′′ ∈ X inf w ′′ ∈ Y k θ ′′ − w ′′ k for X, Y ⊂ R m . Such a U is called a fundamental neighbourhood of the attractor A . Anattractor of a well-posed o.d.e. is an attractor for the set-valued dynamical system induced by the o.d.e.The set ω Φ ( θ ′′ ) = \ t ≥ Φ [ t, ∞ ) ( θ ′′ )is called the ω -limit set of a point θ ′′ ∈ R m . If A is a set, then B ( A ) = { θ ′′ ∈ R m : ω Φ ( θ ′′ ) ⊂ A } denotes its basin of attraction . A global attractor for Φ is an attractor A whose basin of attraction consists of all R m . Then the following lemma will be useful for our proofs, see [9] for a proof. Lemma 2.1.

Suppose Φ has a global attractor A . Then every internally chain transitive set lies in A . We also require another result which will be useful to apply our results to the RL application we mention. Beforestating it we recall some deﬁnitions from Appendix 11.2.3 of [21]:A point θ ∗ ∈ R m is called Lyapunov stable for the o.d.e (2) if for all ǫ >

0, there exists a δ > δ -neighbourhood of θ ∗ remains in its ǫ -neighbourhood. θ ∗ is called globallyasymptotically stable if θ ∗ is Lyapunov stable and all trajectories of the o.d.e. converge to it. Lemma 2.2.

Consider the autonomous o.d.e. ˙ θ ( t ) = h ( θ ( t )) where h is Lipschitz continuous. Let θ ∗ be globallyasymptotically stable. Then θ ∗ is the global attractor of the o.d.e. Proof 1.

We refer the readers to Lemma 1 of [21, Chapter 3] for a proof.

We end this subsection with a notation which will be used frequently in the convergence statements in thefollowing sections.

Deﬁnition 2.1.

For function θ ( . ) deﬁned on [0 , ∞ ) , the notation “ θ ( t ) → A as t → ∞ ” means that ∩ t ≥ { θ ( s ) : s ≥ t } ⊂ A . Similar deﬁnition applies for a sequence { θ n } . Problem Deﬁnition

Our goal is to perform an asymptotic analysis of the following coupled recursions: θ n +1 = θ n + a ( n ) h h ( θ n , w n , Z (1) n ) + M (1) n +1 i , (3) w n +1 = w n + b ( n ) h g ( θ n , w n , Z (2) n ) + M (2) n +1 i , (4)where θ n ∈ R d , w n ∈ R k , n ≥ { Z ( i ) n } , { M ( i ) n } , i = 1 , (A1) { Z ( i ) n } takes values in a compact metric space S ( i ) , i = 1 ,

2. Additionally, the processes { Z ( i ) n } , i = 1 , { θ m } , { w m } and a random process { A ( i ) n } taking values in a compact metric space U ( i ) respectively with theirindividual dynamics speciﬁed by P ( Z ( i ) n +1 ∈ B ( i ) | Z ( i ) m , A ( i ) m , θ m , w m , m ≤ n ) = Z B ( i ) p ( i ) ( dy | Z ( i ) n , A ( i ) n , θ n , w n ) , n ≥ , for B ( i ) Borel in S ( i ) , i = 1 , , respectively. Remark 1.

In this context one should note that [1, 10] require the Markov process to take values in a normedPolish space. emark 2. In [20] it is assumed that the state space where the controlled Markov Process takes values isPolish. This space is then compactiﬁed using the fact that a Polish space can be homeomorphically embeddedinto a dense subset of a compact metric space. The vector ﬁeld h ( ., . ) : R d × S → R d is considered boundedwhen the ﬁrst component lies in a compact set. This would, however, require a continuous extension of h ′ : R d × φ ( S ) → R d deﬁned by h ′ ( x, s ′ ) = h ( x, φ − ( s ′ )) to R d × φ ( S ) . Here φ ( · ) is the homeomorphism deﬁnedby φ ( s ) = ( ρ ( s, s ) , ρ ( s, s ) , . . . ) ∈ [0 , ∞ , and { s i } and ρ is a countable dense subset and metric of the Polishspace respectively. A suﬃcient condition for the above is h ′ to be uniformly continuous [22, Ex:13, p. 99].However, this is hard to verify. This is the main motivation for us to take the range of the Markov process ascompact for our problem. However, there are other reasons for taking compact state space which will be clearin the proofs of this section and the next. (A2) h : R d + k × S (1) → R d is jointly continuous as well as Lipschitz in its ﬁrst two arguments uniformly w.r.t thethird. The latter condition means that ∀ z (1) ∈ S (1) , k h ( θ, w, z (1) ) − h ( θ ′ , w ′ , z (1) ) k ≤ L (1) ( k θ − θ ′ k + k w − w ′ k ) . Same thing is also true for g where the Lipschitz constant is L (2) . Note that the Lipschitz constant L ( i ) doesnot depend on z ( i ) for i = 1 , Remark 3.

We later relax the uniformity of the Lipschitz constant w.r.t the Markov process state space byputting suitable moment assumptions on the Markov process. (A3) { M ( i ) n } , i = 1 , σ -ﬁelds F n = σ ( θ m , w m , M ( i ) m , Z ( i ) m , m ≤ n, i = 1 , , n ≥ , satisfying E [ k M ( i ) n +1 k |F n ] ≤ K (1 + k θ n k + k w n k ) , i = 1 , , for n ≥ K > (A4) The stepsizes { a ( n ) } , { b ( n ) } are positive scalars satisfying X n a ( n ) = X n b ( n ) = ∞ , X n ( a ( n ) + b ( n ) ) < ∞ , a ( n ) b ( n ) → . Moreover, a ( n ) , b ( n ) , n ≥ p ( i ) , i = 1 , P ( S ). Here we mention the deﬁnitions and main theorems on the spaces of probabilitymeasures that we use in our proofs (details can be found in Chapter 2 of [18]). We denote the metric by d andis deﬁned as d ( µ, ν ) = X j − j | Z f j dµ − Z f j dν | , µ, ν ∈ P ( S ) , where { f j } are countable dense in the unit ball of C ( S ). Then the following are equivalent:(i) d ( µ n , µ ) → , (ii) for all bounded f in C ( S ), Z S f dµ n → Z S f dµ, (5)(iii) for all f bounded and uniformly continuous, Z S f dµ n → Z S f dµ. Hence we see that d ( µ n , µ ) → R S f j dµ n → R S f j dµ for all j . Any such sequence of functions { f j } is calleda convergence determining class in P ( S ). Sometimes we also denote d ( µ n , µ ) → µ n ⇒ µ .Also, we recall the characterization of relative compactness in P ( S ) that relies on the deﬁnition of tightness. A ⊂ P ( S ) is a tight set if for any ǫ >

0, there exists a compact K ǫ ⊂ S such that µ ( K ǫ ) > − ǫ for all µ ∈ A .Clearly, if S is compact then any A ⊂ P ( S ) is tight. By Prohorov’s theorem, A ⊂ P ( S ) is relatively compactif and only if it is tight.With the above deﬁnitions we assume the following:4 A5)

The map S ( i ) × U ( i ) × R d + k ∋ ( z ( i ) , a ( i ) , θ, w ) → p ( i ) ( dy | z ( i ) , a ( i ) , θ, w ) ∈ P ( S ( i ) ) is continuous. Remark 4. (A5) is much simpler than the assumptions on n -step transition kernel in [1, Part II,Chap. 2,Theorem 6]. Additionally, unlike [20, p 140 line 13], we do not require the extra assumption of the continuity in the θ variable of p ( dy | z, a, θ ) to be uniform on compacts w.r.t the other variables.For θ n = θ, w n = w for all n with a ﬁxed deterministic ( θ, w ) ∈ R d + k and under any stationary randomizedcontrol π ( i ) , it follows from Lemma 2.1 and Lemma 3.1 of [20] that the time-homogeneous Markov processes Z ( i ) n , i = 1 , ( i ) θ,w,π ( i ) , i = 1 ,

2. .Now, it is well-known that the ergodic occupation measure deﬁned asΨ ( i ) θ,w,π ( i ) ( dz, da ) := Ψ ( i ) θ,w,π ( i ) ( dz ) π ( i ) ( z, da ) ∈ P ( S ( i ) × U ( i ) )satisﬁes the following: Z S ( i ) f ( i ) ( z )Ψ ( i ) θ,w,π ( i ) ( dz, U ( i ) ) = Z S ( i ) × U ( i ) Z S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a, θ, w )Ψ ( i ) θ,w,π ( i ) ( dz, da ) (6)for f ( i ) : S ( i ) → R ∈ C b ( S ( i ) ).We denote by D ( i ) ( θ, w ) , i = 1 , θ and w . In thefollowing we prove some properties of the map ( θ, w ) → D ( i ) ( θ, w ). Lemma 2.3.

For all ( θ, w ) , D ( i ) ( θ, w ) is convex and compact. Proof 2.

The proof trivially follows from (A1) , (A5) and (6). Lemma 2.4.

The map ( θ, w ) → D ( i ) ( θ, w ) is upper-semi-continuous. Proof 3.

Let θ n → θ, w n → w and Ψ ( i ) n ⇒ Ψ ( i ) ∈ P ( S ( i ) × U ( i ) ) such that Ψ ( i ) n ∈ D ( i ) ( θ n , w n ) . Let g ( i ) n ( z, a ) = R S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a, θ n , w n ) and g ( i ) ( z, a ) = R S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a, θ, w ) . From (6) we get that Z S ( i ) f ( i ) ( z )Ψ ( i ) ( dz, U ( i ) ) = lim n →∞ Z S ( i ) f ( i ) ( z )Ψ ( i ) n ( dz, U ( i ) )= lim n →∞ Z S ( i ) × U ( i ) Z S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a, θ n , w n )Ψ ( i ) n ( dz, da )= lim n →∞ Z S ( i ) × U ( i ) g ( i ) n ( z, a )Ψ ( i ) n ( dz, da ) . Now, p ( i ) ( dy | z, a, θ n , w n ) ⇒ p ( i ) ( dy | z, a, θ, w ) implies g ( i ) n ( · , · ) → g ( i ) ( · , · ) pointwise. We prove that the convergenceis indeed uniform. It is enough to prove that this sequence of functions is equicontinuous. Then along with pointwiseconvergence it will imply uniform convergence on compacts [22, p. 168, Ex: 16]. This is also a place where (A1) isused.Deﬁne g ′ : S ( i ) × U ( i ) × R d + k → R by g ′ ( z ′ , a ′ , θ ′ , w ′ ) = R S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a ′ , θ ′ , w ′ ) . Then g ′ is continuous.Let A = S ( i ) × U ( i ) × ( { θ n } ∪ θ ) × ( { w n } ∪ w ) . So, A is compact and g ′ | A is uniformly continuous. This impliesthat for all ǫ > , there exists δ > such that if ρ ′ ( s , s ) < δ, µ ′ ( a , a ) < δ, k θ − θ k < δ, k w − w k < δ, then | g ′ ( s , a , θ , w ) − g ′ ( s , a , θ , w ) | < ǫ where s , s ∈ S ( i ) , a , a ∈ U ( i ) , θ , θ ∈ ( { θ n } ∪ θ ) , w , w ∈ ( { w n } ∪ w ) and ρ ′ and µ ′ denote the metrics in S ( i ) and U ( i ) respectively. Now use this same δ for the { g ( i ) n ( · , · ) } to get for all n thefollowing for ρ ′ ( z , z ) < δ, µ ′ ( a , a ) < δ : | g ( i ) n ( z , a ) − g ( i ) n ( z , a ) | = | g ′ ( z , a , θ n , w n ) − g ′ ( z , a , θ n , w n ) | < ǫ. Hence { g ( i ) n ( · , · ) } is equicontinuous. For large n , sup ( z,a ) ∈ S ( i ) × U ( i ) | g ( i ) n ( z, a ) − g ( i ) ( z, a ) | < ǫ/ because of uniformconvergence of { g ( i ) n ( · , · ) } , hence R S ( i ) × U ( i ) | g ( i ) n ( z, a ) − g ( i ) ( z, a ) | Ψ ( i ) n ( dz, da ) < ǫ/ . Now (for n large), | Z S ( i ) × U ( i ) g ( i ) n ( z, a )Ψ ( i ) n ( dz, da ) − Z S ( i ) × U ( i ) g ( i ) ( z, a )Ψ ( i ) ( dz, da ) | = | Z S ( i ) × U ( i ) [ g ( i ) n ( z, a ) − g ( i ) ( z, a )]Ψ ( i ) n ( dz, da ) + Z S ( i ) × U ( i ) g ( i ) ( z, a )Ψ ( i ) n ( dz, da ) − Z S ( i ) × U ( i ) g ( i ) ( z, a )Ψ ( i ) ( dz, da ) | < ǫ/ | Z S ( i ) × U ( i ) g ( i ) ( z, a )Ψ ( i ) n ( dz, da ) − Z S ( i ) × U ( i ) g ( i ) ( z, a )Ψ ( i ) ( dz, da ) | < ǫ. (7)5 he last inequality follows the fact that Ψ ( i ) n ⇒ Ψ ( i ) . Hence from (7) we get, Z S ( i ) f ( i ) ( z )Ψ ( i ) ( dz, U ( i ) ) = Z S ( i ) × U ( i ) Z S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a, θ, w )Ψ ( i ) ( dz, da ) proving that the map is upper-semi-continuous. Deﬁne ˜ g ( θ, w, ν ) = R g ( θ, w, z ) ν ( dz, U (2) ) for ν ∈ P ( S (2) × U (2) ) and ˆ g ( θ, w ) = { ˜ g ( θ, w, ν ) : ν ∈ D (2) ( θ, w ) } . Lemma 2.5. ˆ g ( · , · ) is a Marchaud map. Proof 4. (i) Convexity and compactness follow trivially from the same for the map ( θ, w ) → D (2) ( θ, w ) .(ii) k ˜ g ( θ, w, ν ) k = k Z g ( θ, w, z ) ν ( dz, U (2) ) k≤ Z k g ( θ, w, z ) k ν ( dz, U (2) ) ≤ Z L (2) ( k θ k + k w k + k g (0 , , z ) k ) ν ( dz, U (2) ) ≤ max( L (2) , L (2) Z k g (0 , , z ) k ν ( dz, U (2) ))(1 + k θ k + k w k ) . Clearly, K ( θ ) = max( L (2) , L (2) R k g (0 , , z ) k ν ( dz, U (2) )) > . The above is true for all ˜ g ( θ, w, ν ) ∈ ˆ g ( θ, w ) , ν ∈ D (2) ( θ, w ) .(iii) Let ( θ n , w n ) → ( θ, w ) , ˜ g ( θ n , w n , ν n ) → m, ν n ∈ D (2) ( θ n , w n ) . Now, { ν n } is tight, hence has a convergent sub-sequence { ν n k } with ν being the limit. Then using the arguments similar to the proof of Lemma 2.4 one canshow that m = ˜ g ( θ, w, ν ) whereas ν ∈ D (2) ( θ, w ) follows directly from the upper-semi-continuity of the map ( θ, w ) → D (2) ( θ, w ) for all θ . Note that the map ˆ h ( · , · ) can be deﬁned similarly and can be shown to be a Marchaud map using the exact sametechnique. Other assumptions needed for two time-scale convergence analysis

We now list the other assumptions required for two time-scale convergence analysis: (A6) for all θ ∈ R d , the diﬀerential inclusion ˙ w ( t ) ∈ ˆ g ( θ, w ( t )) (8)has a singleton global attractor λ ( θ ) where λ : R d → R k is a Lipschitz map with constant K . Additionally,there exists a continuous function V : R d + k → [0 , ∞ ) satisfying the hypothesis of Corollary 3.28 of [9] withΛ = { ( θ, λ ( θ )) : θ ∈ R d } . This is the most important assumption as it links the fast and slow iterates. (A7) Stability of the iterates: sup n ( k θ n k + k w n k ) < ∞ a.s.Let ¯ θ ( . ) , t ≥ θ ( t ( n )) = θ n , n ≥

0, with linear interpo-lation on each interval [ t ( n ) , t ( n + 1)), i.e.,¯ θ ( t ) = θ n + ( θ n +1 − θ n ) t − t ( n ) t ( n + 1) − t ( n ) , t ∈ [ t ( n ) , t ( n + 1)) . The following theorem is our main result:

Theorem 2.6 (Slower timescale result) . Under assumptions (A1)-(A7) , ( θ n , w n ) → ∪ θ ∗ ∈ A ( θ ∗ , λ ( θ ∗ )) a.s. as n → ∞ . , where A = ∩ t ≥ { ¯ θ ( s ) : s ≥ t } is almost everywhere an internally chain transitive set of the diﬀerential inclusion˙ θ ( t ) ∈ ˆ h ( θ ( t )) , (9)where ˆ h ( θ ) = { ˜ h ( θ, λ ( θ ) , ν ) : ν ∈ D (1) ( θ, λ ( θ )) } . We call (8) and (9) as the faster and slower d.i. to correspond withfaster and slower recursions, respectively. 6 orollary 1. Under the additional assumption that the inclusion ˙ θ ( t ) ∈ ˆ h ( θ ( t ))) , has a global attractor set A , ( θ n , w n ) → ∪ θ ∗ ∈ A ( θ ∗ , λ ( θ ∗ )) a.s. as n → ∞ . Remark 5.

In case where the set D (2) ( θ, w ) is singleton, we can relax (A6) to local attractors also. The relaxedassumption will be (A6)’ The function ˆ g ( θ, w ) = R g ( θ, w, z )Γ (2) θ,w ( dz ) is Lipschitz continuous where Γ (2) θ,w is the only element of D (2) ( θ, w ) .Further, for all θ ∈ R d , the o.d.e ˙ w ( t ) = ˆ g ( θ, w ( t )) (10) has an asymptotically stable equilibrium λ ( θ ) with domain of attraction G θ where λ : R d → R k is a Lipschitzmap with constant K . Also, assume that T θ G θ is non-empty. Moreover, the function V ′ : G → [0 , ∞ ) deﬁnedby V ′ ( θ, w ) = V θ ( w ) is continuously diﬀerentiable where V θ ( . ) is the Lyapunov function (for deﬁnition see [21,Chapter 11.2.3]) for the o.d.e. (10) with λ ( θ ) as its attractor, and G = S θ ∈ R d {{ θ } × G θ } . This extra conditionis needed so that the set graph( λ ):= { ( θ, λ ( θ )) : θ ∈ R d } becomes an asymptotically stable set of the coupled o.d.e ˙ w ( t ) = ˆ g ( θ ( t ) , w ( t )) , ˙ θ ( t ) = 0 . Note that (A6)’ allows multiple attractors (at least one of them have to be a point, others can be sets) for the fastero.d.e for every θ .Then the statement of Theorem 2.6 will be modiﬁed as in the following: Theorem 2.7 (Slower timescale result when λ ( θ ) is a local attractor) . Under assumptions (A1)-(A5), (A6)’ and (A7) , on the event “ { w n } belongs to a compact subset B (depending on the sample point) of T θ ∈ R d G θ eventually ”, ( θ n , w n ) → ∪ θ ∗ ∈ A ( θ ∗ , λ ( θ ∗ )) a.s. as n → ∞ .The requirement on { w n } is much stronger than the usual local attractor statement for Kushner-Clarke lemma[10, Section II.C] which requires the iterates to enter a compact set in the domain for attraction of the local attractorinﬁnitely often only. The reason for imposing this strong assumption is that graph( λ ) is not a subset of any compactset in R d + k , and hence the usual tracking lemma kind of arguments do not go through directly. One has to relate thelimit set of the coupled iterate ( θ n , w n ) to graph( λ ) (See the proof of Lemma 3.6). We present the proof of our main results in the next section.

We ﬁrst discuss an extension of the single time-scale controlled Markov noise framework of [20] under our assumptionsto prove our main results. Note that the results of [20] assume that the state space of the controlled Markov processis Polish which may impose additional conditions that are hard to verify. In this section, other than proving our twotime-scale results, we prove many of the results in [20] (which were only stated there) assuming the state space tobe compact.We begin by describing the intuition behind the proof techniques in [20].The space C ([0 , ∞ ); R d ) of continuous functions from [0 , ∞ ) to R d is topologized with the coarsest topology suchthat the map that takes any f ∈ C ([0 , ∞ ); R d ) to its restriction to [0 , T ] when viewed as an element of the space C ([0 , T ]; R d ), is continuous for all T >

0. In other words, f n → f in this space iﬀ f n | [0 ,T ] → f | [0 ,T ] . The othernotations used below are the same as those in [20, 21]. We present a few for easy reference.Consider the single time-scale stochastic approximation recursion with controlled Markov noise: x n +1 = x n + a ( n ) [ h ( x n , Y n ) + M n +1 ] . (11)Deﬁne time instants t (0) = 0 , t ( n ) = P n − m =0 a ( m ) , n ≥

1. Let ¯ x ( t ) , t ≥ x ( t ( n )) = x n , n ≥

0, with linear interpolation on each interval [ t ( n ) , t ( n + 1)), i.e.,¯ x ( t ) = x n + ( x n +1 − x n ) t − t ( n ) t ( n + 1) − t ( n ) , t ∈ [ t ( n ) , t ( n + 1)) . Now, deﬁne ˜ h ( x, ν ) = R h ( x, z ) ν ( dz, U ) for ν ∈ P ( S × U ). Let µ ( t ) , t ≥ µ ( t ) = δ ( Y n ,Z n ) for t ∈ [ t ( n ) , t ( n + 1)) , n ≥

0, where δ ( y,a ) is the Dirac measure corresponding to ( y, a ). Consider thenon-autonomous o.d.e. ˙ x ( t ) = ˜ h ( x ( t ) , µ ( t )) . (12)7et x s ( t ) , t ≥ s , denote the solution to (12) with x s ( s ) = ¯ x ( s ), for s ≥

0. Note that x s ( t ) , t ∈ [ s, s + T ] and x s ( t ) , t ≥ s can be viewed as elements of C ([0 , T ]; R d ) and C ([0 , ∞ ); R d ) respectively. With this abuse of notation, it is easy tosee that { x s ( . ) | [ s,s + T ] , s ≥ } is a pointwise bounded and equicontinuous family of functions in C ([0 , T ]; R d ) ∀ T > s ( n ) ↑ ∞ , { ¯ x ( s ( n )+ . ) | [ s ( n ) ,s ( n )+ T ] ,n ≥ } has a limit point in C ([0 , T ]; R d ) ∀ T >

0. With the above topology for C ([0 , ∞ ); R d ), { x s ( . ) , s ≥ } isalso relatively compact in C ([0 , ∞ ); R d ) and for all s ( n ) ↑ ∞ , { ¯ x ( s ( n ) + . ) , n ≥ } has a limit point in C ([0 , ∞ ); R d ).One can write from (11) the following:¯ x ( u ( n ) + t ) = ¯ x ( u ( n )) + Z t h (¯ x ( u ( n ) + τ ) , ν ( u ( n ) + τ )) dτ + W n ( t ) , where u ( n ) ↑ ∞ , ¯ x ( u ( n ) + . ) → ˜ x ( · ) , ν ( t ) = ( Y n , Z n ) for t ∈ [ t ( n ) , t ( n + 1)) , n ≥ W n ( t ) = W ( t + u ( n )) − W ( u ( n )) , W ( t ) = W n + ( W n +1 − W n ) t − t ( n ) t ( n +1) − t ( n ) , W n = P n − k =0 a ( k ) M k +1 , n ≥

0. From here one cannot directly takelimit on both sides as ﬁnding limit points of ν ( s + . ) as s → ∞ is not meaningful. Now, h ( x, y ) = R h ( x, z ) δ ( y,a ) ( dz × U ).Hence by deﬁning ˜ h ( x, ρ ) = R h ( x, z ) ρ ( dz ) and µ ( t ) = δ ν ( t ) one can write the above as¯ x ( u ( n ) + t ) = ¯ x ( u ( n )) + Z t ˜ h (¯ x ( u ( n ) + τ ) , µ ( u ( n ) + τ )) dτ + W n ( t ) . (13)The advantage is that the space U of measurable functions from [0 , ∞ ) to P ( S × U ) is compact metrizable, so sub-sequential limits exist. Note that µ ( · ) is not a member of U , rather we need to ﬁx a sample point, i.e., µ ( ., ω ) ∈ U .For ease of understanding, we abuse the terminology and talk about the limit points ˜ µ ( · ) of µ ( s + . ).From (13) one can infer that the limit ˜ x ( · ) of ¯ x ( u ( n ) + . ) satisﬁes the o.d.e. ˙ x ( t ) = ˜ h ( x ( t ) , µ ( t )) with µ ( · ) replacedby ˜ µ ( · ). Here each ˜ µ ( t ) , t ∈ R in ˜ µ ( · ) is generated through diﬀerent limiting processes each one associated with thecompact metrizable space U t = space of measurable functions from [0 , t ] to P ( S × U ). This will be problematic ifwe want to further explore the process ˜ µ ( · ) and convert the non-autonomous o.d.e. into an autonomous one.Hence the main result is proved using an auxiliary lemma [20, Lemma 2.3] other than the tracking lemma(Lemma 2.2 of [20]). Let u ( n ( k )) ↑ ∞ be such that ¯ x ( u ( n ( k )) + . ) → ˜ x ( · ) and µ ( u ( n ( k )) + . ) → ˜ µ ( · ), then usingLemma 2.2 of [20] one can show that x u ( n ( k )) ( · ) → ˜ x ( · ). Then the auxiliary lemma shows that the o.d.e. trajectory x u ( n ( k )) ( · ) associated with µ ( u ( n ( k )) + . ) tracks (in the limit) the o.d.e. trajectory associated with ˜ µ ( · ). HenceLemma 2.3 of [20] links the two limiting processes ˜ x ( · ) and ˜ µ ( · ) in some sense. Note that Lemma 2.3 of [20] involvesonly the o.d.e. trajectories, not the interpolated trajectory of the algorithm.Consider the iteration θ n +1 = θ n + a ( n ) [ h ( θ n , Y n ) + ǫ n + M n +1 ] , (14)where ǫ n → { Y n } is the controlled Markov processdriven by { θ n } and M n +1 , n ≥ θ ( t ) , t ≥ θ ( t ( n )) = θ n , n ≥

0, with linear interpolation on each interval [ t ( n ) , t ( n + 1)).Also, let θ s ( t ) , t ≥ s , denote the solution to (12) with θ s ( s ) = ¯ θ ( s ), for s ≥ Lemma 3.1.

For any

T > , sup t ∈ [ s,s + T ] k ¯ θ ( t ) − θ s ( t ) k → , a.s. as s → ∞ . Proof 5.

The proof follows from the Lemma 2.2 and the remark 3 thereof (p. 144) of [20].

Now, µ can be viewed as a random variable taking values in U = the space of measurable functions from [0 , ∞ )to P ( S × U ). This space is topologized with the coarsest topology such that the map ν ( · ) ∈ U → Z T g ( t ) Z f dν ( t ) dt ∈ R is continuous for all f ∈ C ( S ) , T > , g ∈ L [0 , T ]. Note that U is compact metrizable. Lemma 3.2.

Almost surely every limit point of ( µ ( s + . ) , ¯ θ ( s + . )) as s → ∞ is of the form (˜ µ ( · ) , ˜ θ ( · )) where ˜ µ ( · ) satisﬁes ˜ µ ( t ) ∈ D (˜ θ ( t )) a.e. t . Proof 6.

Suppose that u ( n ) ↑ ∞ , µ ( u ( n ) + . ) → ˜ µ ( · ) and ¯ θ ( u ( n ) + . ) → ˜ θ ( · ) . Let { f i } be countable dense in the unitball of C ( S ) , hence a separating class, i.e., ∀ i, R f i dµ = R f i dν implies µ = ν . For each i , ζ in = n − X m =1 a ( m )( f i ( Y m +1 ) − Z f i ( y ) p ( dy | Y m , Z m , θ m )) , n ≥ , s a zero-mean martingale with F n = σ ( θ m , Y m , Z m , m ≤ n ) . Moreover, it is a square integrable martingale due tothe fact that f i ’s are bounded and each ζ in is a ﬁnite sum. Its quadratic variation process A n = n − X m =0 a ( m ) E [( f i ( Y m +1 ) − Z f i ( y ) p ( dy | Y m , Z m , θ m )) |F m ] + E [( ζ i ) ] is almost surely convergent. By the martingale convergence theorem, ζ in , n ≥ converges a.s. for all i . As before let τ ( n, t ) = min { m ≥ n : t ( m ) ≥ t ( n ) + t } for t ≥ , n ≥ . Then as n → ∞ , τ ( n,t ) X m = n a ( m )( f i ( Y m +1 ) − Z f i ( y ) p ( dy | Y m , Z m , θ m )) → , a.s.for t > . By our choice of { f i } and the fact that { a ( n ) } is an eventually non-increasing sequence (the latter propertyis used only here and in Lemma 3.9), we have τ ( n,t ) X m = n ( a ( m ) − a ( m + 1)) f i ( Y m +1 ) → , a.s.From the foregoing, τ ( n,t ) X m = n ( a ( m + 1) f i ( Y m +1 ) − a ( m ) Z f i ( y ) p ( dy | Y m , Z m , θ m )) → , a.s.for all t > , which implies τ ( n,t ) X m = n a ( m )( f i ( Y m ) − Z f i ( y ) p ( dy | Y m , Z m , θ m )) → , a.s.for all t > due to the fact that a ( n ) → and f i ( . ) are bounded. This implies Z t ( n )+ tt ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˆ θ ( s ))) µ ( s, dzda )) ds → , a.s.and that in turn implies Z u ( n )+ tu ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˆ θ ( s ))) µ ( s, dzda )) ds → , a.s.(this is true because a ( n ) → and f i ( · ) is bounded) where ˆ θ ( s ) = θ n when s ∈ [ t ( n ) , t ( n + 1)) for n ≥ . Now, onecan claim from the above that Z u ( n )+ tu ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ¯ θ ( s ))) µ ( s, dzda )) ds → , a.s.This is due to the fact that the map S × U × R d ∋ ( z, a, θ ) → R f ( y ) p ( dy | z, a, θ ) is continuous and hence uniformlycontinuous on the compact set A = S × U × M where M is the compact set s.t. θ n ∈ M for all n . Here we also usethe fact that k ¯ θ ( s ) − θ m k = k h ( θ m , Y m ) + ǫ m + M m +1 k ( s − s m ) → , s ∈ [ t m , t m +1 ) as the ﬁrst two terms inside thenorm in the R.H.S are bounded. The above convergence is equivalent to Z t ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ¯ θ ( s + u ( n ))) µ ( s + u ( n ) , dzda )) ds → , a.s.Fix a sample point in the probability one set on which the convergence above holds for all i . Then the convergenceabove leads to Z t ( Z f i ( z ) − Z f i ( y ) p ( dy | z, a, ˜ θ ( s )))˜ µ ( s, dzda ) ds = 0 ∀ i. (15) Here we use one part of the proof from Lemma 2.3 of [20] that if µ n ( · ) → µ ∞ ( · ) ∈ U then for any t > , Z t Z ˜ f ( s, z, a ) µ n ( s, dzda ) ds − Z t Z ˜ f ( s, z, a ) µ ∞ ( s, dzda ) ds → , or all ˜ f ∈ C ([0 , t ] × S × A ) and the fact that ˜ f n ( s, z, a ) = R f i ( y ) p ( dy | z, a, ¯ θ ( s + u ( n ))) converges uniformly to ˜ f ( s, z, a ) = R f i ( y ) p ( dy | z, a, ˜ θ ( s )) . To prove the latter, deﬁne g : C ([0 , t ]) × [0 , t ] × S × A → R by g ( θ ( · ) , s, z, a ) = R f i ( y ) p ( dy | z, a, θ ( s ))) . To see that g is continuous we need to check that if θ n ( · ) → θ ( · ) uniformly and s ( n ) → s ,then θ n ( s ( n )) → θ ( s ) . This is because k θ n ( s ( n )) − θ ( s ) k = k θ n ( s ( n )) − θ ( s ( n )) + θ ( s ( n )) − θ ( s ) k ≤ k θ n ( s ( n )) − θ ( s ( n )) k + k θ ( s ( n )) − θ ( s ) k . The ﬁrst and second terms go to zero due to the uniform convergence of θ n ( · ) , n ≥ andcontinuity of θ ( · ) respectively. Let A = { ¯ θ ( u ( n ) + . ) | [ u ( n ) ,u ( n )+ t ] , n ≥ } ∪ ˜ θ ( · ) | [0 ,t ] . A is compact as it is the union ofa sequence of functions and their limit. So, g | ( A × [0 ,t ] × S × U ) is uniformly continuous. Then using the same argumentsas in Lemma 2.4 we can show equicontinuity of { ˜ f n ( ., . ) } , that results in uniform convergence and thereby (15). Anapplication of Lebesgue’s theorem in conjunction with (15) shows that Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˜ θ ( t )))˜ µ ( t, dzda ) = 0 ∀ i for a.e. t . By our choice of { f i } , this leads to ˜ µ ( t, dy × U ) = Z p ( dy | z, a, ˜ θ ( t ))˜ µ ( t, dzda ) a.e. t . Therefore the conclusion follows by disintegrating such measure as the product of marginal on S and theregular conditional law on U ([20, p 140]). Remark 6.

Note that the above invariant distribution does not come “naturally”; rather it arises from the assumptionmade to match the natural timescale intuition for the controlled Markov noise component, i.e., the slower iterateshould see the average eﬀect of the Markov component.

The proof of the following lemma, in this case, will be unchanged from its original version, so we just mention itfor completeness and refer the reader to Lemma 2.3 of [20] for its proof.

Lemma 3.3.

Let µ n ( · ) → µ ∞ ( · ) ∈ U . Let θ n ( · ) , n = 1 , , . . . , ∞ denote solutions to (12) corresponding to the casewhere µ ( · ) is replaced by µ n ( · ) , for n = 1 , , . . . ∞ . Suppose θ n (0) → θ ∞ (0) . Then lim n →∞ sup t ∈ [0 ,T ] k θ n ( t ) − θ ∞ ( t ) k = 0 for every T > . Lemma 3.4.

Almost surely, { θ n } converges to an internally chain transitive set of the diﬀerential inclusion ˙ θ ( t ) ∈ ˆ h ( θ ( t )) , (16) where ˆ h ( θ ) = { ˜ h ( θ, ν ) : ν ∈ D ( θ ) } . Proof 7.

Lemma 3.3 shows that every limit point (˜ µ ( · ) , ˜ θ ( · )) of ( µ ( s + . ) , ¯ θ ( s + . )) as s → ∞ is such that ˜ θ ( · ) satisﬁes(12) with µ ( · ) = ˜ µ ( · ) . Hence, ˜ θ ( · ) is absolutely continuous. Moreover, using Lemma 3.2, one can see that it satisﬁes(16) a.e. t , hence is a solution to the diﬀerential inclusion (16). Hence the proof follows. Lemma 3.5 (Faster timescale result) . ( θ n , w n ) → { ( θ, λ ( θ )) : θ ∈ R d } a.s. Proof 8.

We ﬁrst rewrite (3) as θ n +1 = θ n + b ( n ) h ǫ n + M (3) n +1 i , where ǫ n = a ( n ) b ( n ) h ( θ n , w n , Z (1) n ) → as n → ∞ a.s. and M (3) n +1 = a ( n ) b ( n ) M (1) n +1 for n ≥ . Let α n = ( θ n , w n ) , α =( θ, w ) ∈ R d + k , G ( α, z ) = (0 , g ( α, z )) , ǫ ′ n = ( ǫ n , , M (4) n +1 = ( M (3) n +1 , M (2) n +1 ) . Then one can write (3) and (4) in theframework of (14) as α n +1 = α n + b ( n ) h G ( α n , Z (2) n ) + ǫ ′ n + M (4) n +1 i , (17) with ǫ ′ n → as n → ∞ . α n , n ≥ converges almost surely to an internally chain transitive set of the diﬀerentialinclusion ˙ α ( t ) ∈ ˆ G ( α ( t )) , where ˆ G ( α ) = { ˜ G ( α, ν ) : ν ∈ D (2) ( θ, w ) } with ˜ G ( α, ν ) = (0 , ˜ g ( θ, w, ν )) . In other words, ( θ n , w n ) , n ≥ converges toan internally chain transitive set of the diﬀerential inclusion ˙ w ( t ) ∈ ˆ g ( θ ( t ) , w ( t )) , ˙ θ ( t ) = 0 . The rest follows from the second part of (A6) . emark 7. Under the conditions mentioned in Remark 4 the above faster timescale result should be modiﬁed asfollows:

Lemma 3.6 (Faster timescale result when λ ( θ ) is a local attractor) . Under assumptions (A1) - (A5), (A6)’ and (A7) , on the event “ { w n } belongs to a compact subset B (depending on the sample point) of T θ ∈ R d G θ eventually”, ( θ n , w n ) → { ( θ, λ ( θ )) : θ ∈ R d } a.s. Proof 9.

Fix a sample point ω . The proof follows from these observations:1. continuity of ﬂow for the coupled o.d.e around the initial point,2. sup n k θ n k = M < ∞ ,3. the fact that the set graph( λ ) is Lyapunov stable ( V ′ ( . ) as mentioned in (A6)’ will be a Lyapunov function forthis set), and4. the fact that T t ≥ ¯ α ( s ) : s ≥ t is an internally chain transitive set of the coupled o.d.e ˙ w ( t ) = ˆ g ( θ ( t ) , w ( t )) , ˙ θ ( t ) = 0 , (18) where ¯ α ( . ) is the interpolated trajectory of the coupled iterate { α n } .As { θ : k θ k ≤ M } × B ⊂ S θ ∈ R d {{ θ } × G θ } , the ﬁrst three observations show that for all ǫ > , there exists a T ǫ > such that any o.d.e trajectory for (18) with starting point on the compact set { θ : k θ k ≤ M } × B reaches the ǫ -neighbourhood of graph( λ ) after time T ǫ . Further, \ t ≥ ¯ α ( s ) : s ≥ t ⊂ { θ : k θ k ≤ M } × B. Then one can use the last observation by choosing

T > T ǫ to show the required convergence to the set graph( λ ). Remark 8.

One interesting question in this context is to analyze whether one can extend the single timescale localattractor convergence statements to the two timescale setting under some veriﬁable conditions. More speciﬁcally, ifthere is a global attractor A for ˙ θ ( t ) ∈ ˆ h ( θ ( t )) , then can one provide veriﬁable conditions to show P [( θ n , w n ) → ∪ θ ∈ A ( θ, λ ( θ ))] > . Here λ ( θ ) is a local attractor as mentioned in (A6)’ .There are two ways in which this could possibly be tried:1. Use Theorem 2.7 where we show that on the event { w n } belongs to a compact subset B (depending on thesample point) of T θ ∈ R d G θ “eventually”, ( θ n , w n ) → ∪ θ ∗ ∈ A ( θ ∗ , λ ( θ ∗ )) a.s. as n → ∞ ,which is an extension of Kushner-Clarke Lemma to the two timescale case. Therefore the task would be toimpose veriﬁable assumptions so that P ( { w n } belongs to a compact subset B (depending on the sample point)of T θ ∈ R d G θ “eventually”) >

0. In a stochastic approximation scenario it is not immediately clear how onecould possibly impose veriﬁable assumptions so that such a probabilistic statement becomes true.2. The second approach would be to extend the analysis of [8, 9] to the two timescale case. In our opinion this isvery hard as this analysis is based on the attractor introduced by Benaim et al. whereas the coupled o.d.e (18)which tracks the coupled iterate ( θ n , w n ) (therefore the interpolated trajectory of the coupled iterate will be anasymptotic pseudo-trajectory [8] for (18)) has no attractor. The reason is that one cannot obtain a fundamentalneighbourhood for sets like ∪ θ ∈ A ( θ, λ ( θ )) as the θ component will remain constant for any trajectory of theabove coupled o.d.e.Thus it is immediately not clear as to how this question can be addressed and this will be an interesting futuredirection. k w n − λ ( θ n ) k → { w n } asymptotically tracks { λ ( θ n ) } a.s.Now, consider the non-autonomous o.d.e.˙ θ ( t ) = ˜ h ( θ ( t ) , λ ( θ ( t )) , µ ( t )) , (19)where µ ( t ) = δ Z (1) n ,A (1) n when t ∈ [ t ( n ) , t ( n + 1)) for n ≥ h ( θ, w, ν ) = R h ( θ, w, z ) ν ( dz ). Let θ s ( t ) , t ≥ s denotethe solution to (19) with θ s ( s ) = ¯ θ ( s ), for s ≥

0. Then

Lemma 3.7.

For any

T > , sup t ∈ [ s,s + T ] k ¯ θ ( t ) − θ s ( t ) k → , a.s. Proof 10.

The slower recursion corresponds to θ n +1 = θ n + a ( n ) h h ( θ n , w n , Z (1) n ) + M (1) n +1 i . Let t ( n + m ) ∈ [ t ( n ) , t ( n ) + T ] . Let [ t ] = max { t ( k ) : t ( k ) ≤ t } . Then by construction, ¯ θ ( t ( n + m )) = ¯ θ ( t ( n )) + m − X k =0 a ( n + k ) h (¯ θ ( t ( n + k )) , w n + k , Z (1) n + k ) + δ n,n + m = ¯ θ ( t ( n )) + m − X k =0 a ( n + k ) h (¯ θ ( t ( n + k )) , λ (¯ θ ( t ( n + k ))) , Z (1) n + k )+ m − X k =0 a ( n + k )( h (¯ θ ( t ( n + k )) , w n + k , Z (1) n + k ) − h (¯ θ ( t ( n + k )) , λ ( θ n + k ) , Z (1) n + k ))+ δ n,n + m , where δ n,n + m = ζ n + m − ζ n with ζ n = P n − m =0 a ( m ) M (1) m +1 , n ≥ . θ t ( n ) ( t ( m + n )) = ¯ θ ( t ( n )) + Z t ( n + m ) t ( n ) ˜ h ( θ t ( n ) ( t ) , λ ( θ t ( n ) ( t )) , µ ( t )) dt = ¯ θ ( t ( n )) + m − X k =0 a ( n + k ) h ( θ t ( n ) ( t ( n + k )) , λ ( θ t ( n ) ( t ( n + k ))) , Z (1) n + k )+ Z t ( n + m ) t ( n ) ( h ( θ t ( n ) ( t ) , λ ( θ t ( n ) ( t ) , µ ( t ))) − h ( θ t ( n ) ([ t ]) , λ ( θ t ( n ) ([ t ]) , µ ([ t ])))) dt. Let t ( n ) ≤ t ≤ t ( n + m ) . Now, if ≤ k ≤ ( m − and t ∈ ( t ( n + k ) , t ( n + k + 1)] , k θ t ( n ) ( t ) k ≤ k ¯ θ ( t ( n ) k + k Z tt ( n ) ˜ h ( θ t ( n ) ( τ ) , λ ( θ t ( n ) ( τ )) , µ ( τ )) dτ k≤ k θ n k + k − X l =0 Z t ( n + l +1) t ( n + l ) ( k h (0 , , Z (1) n + l ) k + L (1) ( k λ (0) k + ( K + 1) k θ t ( n ) ( τ ) k )) dτ + Z tt ( n + k ) ( k h (0 , , Z (1) n + k ) k + L (1) ( k λ (0) k + ( K + 1) k θ t ( n ) ( τ ) k )) dτ ≤ C + ( M + L (1) k λ (0) k ) T + L (1) ( K + 1) Z tt ( n ) k θ t ( n ) ( τ ) k dτ, where C = sup n k θ n k < ∞ , sup z ∈ S (1) k h (0 , , z ) k = M . By Gronwall’s inequality, it follows that k θ t ( n ) ( t ) k ≤ ( C + ( M + L (1) k λ (0) k ) T ) e L (1) ( K +1) T . k θ t ( n ) ( t ) − θ t ( n ) ( t ( n + k )) k ≤ Z tt ( n + k ) k h ( θ t ( n ) ( s ) , λ ( θ t ( n ) ( s )) , Z (1) n + k ) k ds ≤ ( k h (0 , , Z (1) n + k ) k + L (1) k λ (0) k )( t − t ( n + k ))+ L (1) ( K + 1) Z tt ( n + k ) k θ t ( n ) ( s ) k ds ≤ C T a ( n + k ) , here C T = ( M + L (1) k λ (0) k ) + L (1) ( K + 1)( C + ( M + L (1) k λ (0) k ) T ) e L (1) ( K +1) T . Thus, k Z t ( n + m ) t ( n ) ( h ( θ t ( n ) ( t ) , λ ( θ t ( n ) ( t )) , µ ( t )) − h ( θ t ( n ) ([ t ]) , λ ( θ t ( n ) ([ t ])) , µ ([ t ]))) dt k≤ m − X k =0 Z t ( n + k +1) t ( n + k ) k h ( θ t ( n ) ( t ) , λ ( θ t ( n ) ( t )) , Z (1) n + k ) − h ( θ t ( n ) ([ t ]) , λ ( θ t ( n ) ([ t ])) , Z (1) n + k ) k dt ≤ L m − X k =0 Z t ( n + k +1) t ( n + k ) k θ t ( n ) ( t ) − θ t ( n ) ( t ( n + k )) k dt ≤ C T L m − X k =0 a ( n + k ) ≤ C T L ∞ X k =0 a ( n + k ) → as n → ∞ , where L = L (1) ( K + 1) . Hence k ¯ θ ( t ( n + m )) − θ t ( n ) ( t ( n + m )) ≤ L m − X k =0 a ( n + k ) k ¯ θ ( t ( n + k )) − θ t ( n ) ( t ( n + k )) k + C T L ∞ X k =0 a ( n + k ) + sup k ≥ k δ n,n + k k + L (1) m − X k =0 a ( n + k ) k w n + k − λ ( θ n + k ) k≤ L m − X k =0 a ( n + k ) k ¯ θ ( t ( n + k )) − θ t ( n ) ( t ( n + k )) k + C T L ∞ X k =0 a ( n + k ) + sup k ≥ k δ n,n + k k + L (1) T sup k ≥ k w n + k − λ ( θ n + k ) k , a.s.Deﬁne K T,n = C T L ∞ X k =0 a ( n + k ) + sup k ≥ k δ n,n + k k + L (1) T sup k ≥ k w n + k − λ ( θ n + k ) k . Note that K T,n → a.s. The remainder of the proof follows in the exact same manner as the tracking lemma, seeLemma 1, Chapter 2 of [21]. Lemma 3.8.

Suppose, µ n ( · ) → µ ∞ ( · ) ∈ U (1) . Let θ n ( · ) , n = 1 , , . . . , ∞ denote solutions to (19) corresponding tothe case where µ ( · ) is replaced by µ n ( · ) , for n = 1 , , . . . , ∞ . Suppose θ n (0) → θ ∞ (0) . Then lim n →∞ sup t ∈ [0 ,T ] k θ n ( t ) − θ ∞ ( t ) k → for every T > . Proof 11.

It is shown in Lemma 2.3 of [20] that Z t Z ˜ f ( s, z ) µ n ( s, dz ) ds − Z t Z ˜ f ( s, z ) µ ∞ ( s, dz ) ds → for any ˜ f ∈ C ([0 , T ] × S ) . Using this, one can see that k Z t (˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ n ( s )) − ˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ ∞ ( s ))) ds k → . This follows because λ is continuous and h is jointly continuous in its arguments. As a function of t , the integralon the left is equicontinuous and pointwise bounded. By the Arzela-Ascoli theorem, this convergence must in fact be niform for t in a compact set. Now for t > , k θ n ( t ) − θ ∞ ( t ) k≤ k θ n (0) − θ ∞ (0) k + Z t k ˜ h ( θ n ( s ) , λ ( θ n ( s )) , µ n ( s )) − ˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ ∞ ( s )) k ds ≤ k θ n (0) − θ ∞ (0) k + Z t ( k ˜ h ( θ n ( s ) , λ ( θ n ( s )) , µ n ( s )) − ˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ n ( s )) k ) ds + Z t ( k ˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ n ( s )) − ˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ ∞ ( s )) k ) ds. Now, using the fact that λ is Lipschitz with constant K the remaining part of the proof follows in the same manneras Lemma 2.3 of [20]. Note that Lemma 3.8 shows that every limit point (˜ µ ( · ) , ˜ θ ( · )) of ( µ ( s + . ) , ¯ θ ( s + . )) as s → ∞ is such that ˜ θ ( · )satisﬁes (19) with µ ( · ) = ˜ µ ( · ). Lemma 3.9.

Almost surely every limit point of ( µ ( s + . ) , ¯ θ ( s + . )) as s → ∞ is of the form (˜ µ ( · ) , ˜ θ ( · )) , where ˜ µ ( · ) satisﬁes ˜ µ ( t ) ∈ D (1) (˜ θ ( t ) , λ (˜ θ ( t ))) . Proof 12.

Suppose that u ( n ) ↑ ∞ , µ ( u ( n ) + . ) → ˜ µ ( · ) and ¯ θ ( u ( n ) + . ) → ˜ θ ( · ) . Let { f i } be countable dense in theunit ball of C ( S ) , hence it is a separating class, i.e., for all i , R f i dµ = R f i dν implies µ = ν . For each i , ζ in = n − X m =1 a ( m )( f i ( Z (1) m +1 ) − Z f i ( y ) p ( dy | Z (1) m , A (1) m , θ m , w m )) , is a zero-mean martingale with F n = σ ( θ m , w m , Z (1) m , A (1) m , m ≤ n ) , n ≥ . Moreover, it is a square-integrablemartingale due to the fact that f i ’s are bounded and each ζ in is a ﬁnite sum. Its quadratic variation process A n = n − X m =0 a ( m ) E [( f i ( Z (1) m +1 ) − Z f i ( y ) p ( dy | Z (1) m , A (1) m , θ m , w m )) |F m ] + E [( ζ i ) ] is almost surely convergent. By the martingale convergence theorem, { ζ in } converges a.s. Let τ ( n, t ) = min { m ≥ n : t ( m ) ≥ t ( n ) + t } for t ≥ , n ≥ . Then as n → ∞ , τ ( n,t ) X m = n a ( m )( f i ( Z (1) m +1 ) − Z f i ( y ) p ( dy | Z (1) m , A (1) m , θ m , w m )) → , a.s.,for t > . By our choice of { f i } and the fact that { a ( n ) } are eventually non-increasing, τ ( n,t ) X m = n ( a ( m ) − a ( m + 1)) f i ( Z (1) m +1 ) → , a.s.Thus, τ ( n,t ) X m = n a ( m )( f i ( Z (1) m ) − Z f i ( y ) p ( dy | Z (1) m , A (1) m , θ m , w m )) → , a.s.which implies Z t ( n )+ tt ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˆ θ ( s ) , ˆ w ( s ))) µ ( s, dzda )) ds → , a.s.Recall that u ( n ) can be any general sequence other than t ( n ) . Therefore Z u ( n )+ tu ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˆ θ ( s ) , ˆ w ( s ))) µ ( s, dzda )) ds → , a.s.,(this follows from the fact that a ( n ) → and f i ’s are bounded) where ˆ θ ( s ) = θ n and ˆ w ( s ) = w n when s ∈ [ t ( n ) , t ( n +1)) , n ≥ . Now, one can claim from the above that Z u ( n )+ tu ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ¯ θ ( s ) , λ (¯ θ ( s )))) µ ( s, dzda )) ds → , a.s. his is due to the fact that the map S (1) × U (1) × R d + k ∋ ( z, a, θ, w ) → R f i ( y ) p ( dy | z, a, θ, w ) is continuous and henceuniformly continuous on the compact set A = S (1) × U (1) × M × M where M is the compact set s.t. θ n ∈ M for all n and M = { w : k w k ≤ max(sup k w n k , K ′ ) } where K ′ is the bound for the compact set λ ( M ) . Here we also use the factthat k w m − λ (¯ θ ( s )) k → for s ∈ [ t m , t m +1 ) as λ is Lipschitz and k w m − λ ( θ m ) k → . The above convergence isequivalent to Z t ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ¯ θ ( s + u ( n )) , λ (¯ θ ( s + u ( n ))))) µ ( s + u ( n ) , dzda )) ds → a.s.Fix a sample point in the probability one set on which the convergence above holds for all i . Then the convergenceabove leads to Z t ( Z f i ( z ) − Z f i ( y ) p ( dy | z, a, ˜ θ ( s ) , λ (˜ θ ( s ))))˜ µ ( s, dzda ) ds = 0 ∀ i. (20) For showing the above, we use one part of the proof from Lemma 2.3 of [20] that if µ n ( · ) → µ ∞ ( · ) ∈ U then for any t , Z t Z ˜ f ( s, z, a ) µ n ( s, dzda ) ds − Z t Z ˜ f ( s, z, a ) µ ∞ ( s, dzda ) ds → for all ˜ f ∈ C ([0 , t ] × S (1) × U (1) ) . In addition, we make use of the fact that ˜ f n ( s, z, a ) = R f i ( y ) p ( dy | z, a, ¯ θ ( s + u ( n )) , λ (¯ θ ( s + u ( n )))) converges uniformly to ˜ f ( s, z, a ) = R f i ( y ) p ( dy | z, a, ˜ θ ( s ) , λ (˜ θ ( s ))) . Toprove this, deﬁne g : C ([0 , t ]) × [0 , t ] × S (1) × U (1) → R by g ( θ ( · ) , s, z, a ) = R f i ( y ) p ( dy | z, a, θ ( s ) , λ ( θ ( s ))) . Let A ′ = { ¯ θ ( u ( n ) + . ) | [ u ( n ) ,u ( n )+ t ] , n ≥ } ∪ ˜ θ ( · ) | [0 ,t ] . Using the same argument as in Lemma 3.2 and (A6) , i.e., λ is Lip-schitz (the latter helps to claim that if θ n ( · ) → θ ( · ) uniformly then λ ( θ n ( · )) → λ ( θ ( · )) uniformly), it can be seen that g is continuous. Then A ′ is compact as it is a union of a sequence of functions and its limit. So, g | ( A ′ × [0 ,t ] × S (1) × U (1) ) is uniformly continuous. Then a similar argument as in Lemma 2.4 shows equicontinuity of { ˜ f n ( ., . ) } that results inuniform convergence and thereby (20). An application of Lebesgue’s theorem in conjunction with (20) shows that Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˜ θ ( t ) , λ (˜ θ ( t )))˜ µ ( t, dzda ) = 0 ∀ i for a.e. t . By our choice of { f i } , this leads to ˜ µ ( t, dy × U (1) ) = Z p ( dy | z, a, ˜ θ ( t ) , λ (˜ θ ( t )))˜ µ ( t, dzda ) , a.e. t . Lemma 3.8 shows that every limit point (˜ µ ( · ) , ˜ θ ( · )) of ( µ ( s + . ) , ¯ θ ( s + . )) as s → ∞ is such that ˜ θ ( · ) satisﬁes (19)with µ ( · ) = ˜ µ ( · ). Hence, ˜ θ ( · ) is absolutely continuous. Moreover, using Lemma 3.9, one can see that it satisﬁes (9)a.e. t , hence is a solution to the diﬀerential inclusion (9). Proof 13 (Proof of Theorem 2.6 and 2.7) . From the previous three lemmas it is easy to see that A = ∩ t ≥ { ¯ θ ( s ) : s ≥ t } is almost everywhere an internally chain transitive set of (9). Proof 14 (Proof of Corollary 1) . Follows directly from Theorem 2.6 and Lemma 2.1.

We discuss relaxation of the uniformity of the Lipschitz constant w.r.t state of the controlled Markov process for thevector ﬁeld. The modiﬁed assumption here is (A2)’ h : R d + k × S (1) → R d is jointly continuous as well as Lipschitz in its ﬁrst two arguments with the third argumentﬁxed to same value and Lipschitz constant is a function of this value. The latter condition means that ∀ z (1) ∈ S (1) , k h ( θ, w, z (1) ) − h ( θ ′ , w ′ , z (1) ) k ≤ L (1) ( z (1) )( k θ − θ ′ k + k w − w ′ k ) . A similar condition holds for g where the Lipschitz constant is L (2) : S (2) → R + .Note that this allows L ( i ) ( . ) to be an unbounded measurable function making it discontinuous due to (A1) . Thestraightforward solution for implementing this is to additionally assume the following: (A8) sup n L ( i ) ( Z ( i ) n ) < ∞ a.s. 15till allowing L ( i ) ( . ) to be an unbounded function. As all our proofs in Section 3 are shown for every sample pointof a probability 1 set, our proofs will go through. In the following we give such an example for the case where theMarkov process is uncontrolled.It is enough to consider examples with locally compact S ( i ) (because then we can take the standard one-pointcompactiﬁcation and deﬁne L ( i ) arbitrarily at the extra point).Let S ( i ) = Z and let Z ( i ) n , n ≥ Z starting at 0 with transition probabilities p ( n, n +1) = p and p ( n, n −

1) = 1 − p . We assume 1 / < p <

1. Let L ( i ) ( n ) = (cid:0) − pp (cid:1) n .Note that Z ( i ) n , n ≥ Z ( i ) n → + ∞ a.s. From this it follows that inf n Z ( i ) n > −∞ ,and thus sup n L ( i ) ( Z ( i ) n ) < ∞ almost surely. It follows that ( L ( i ) ( Z ( i ) n )) n ∈ N is a bounded sequence with probability1, but this bound is clearly not deterministic since there is a non-zero probability that the sample path reaches largenegative values.However in the following we discuss on the idea of using moment assumptions to analyze the convergence of singletimescale controlled Markov noise framework of [20]. We show that the iterates (14) (with ǫ n = 0) converge to aninternally chain transitive set of the o.d.e. (12). For this we prove Lemma 3.1 under the following assumptions: Forall T > , i = 1 , (S1) The controlled Markov process Y n as described in [20] takes values in a compact metric space. (S2) For all n >

0, 0 < a ( n ) ≤ P n a ( n ) = ∞ , P n a ( n ) < ∞ and a ( n + 1) ≤ a ( n ) , n ≥ (S3) h : R d × S → R d Lipschitz in its ﬁrst argument w.r.t the second. The condition means that ∀ z ∈ S, k h ( θ, z ) − h ( θ ′ , z ) k ≤ L ( z )( k θ − θ ′ k ) . (S4) Let φ ( n, T ) = max( m : a ( n ) + a ( n + 1) + · · · + a ( n + m ) ≤ T ) with the bound depending on T . Thensup n E  sup ≤ m ≤ φ ( n,T ) L ( Y n + m ) !  < ∞ . (S5) sup n E h e P φ ( n,T ) m =0 a ( n + m ) L ( Y n + m ) i < ∞ . Note that (S4) and (S5) are trivially satisﬁed in the case when L ( z ) = L for all z ∈ S i.e. the case of Section2. Remark 9.

As long as one can prove Lemma 3.1 for all

T > it will hold for all T > , thus one can combine (S4) and (S5) into the following assumption: sup n E h e T sup ≤ m ≤ φ ( n,T ) L ( Y n + m ) i < ∞ . As an instance where such an assumption is veriﬁed, consider the Markov process of [10, Eqn. (3.4)] deﬁnedby Y n +1 = A ( θ n ) Y n + B ( θ n ) W n +1 where A ( θ ) , B ( θ ) , θ ∈ R d , are k × k -matrices and ( W n ) n ≥ O are independent and identically distributed R k -valuedrandom variables. Assume that the following conditions hold true for all x, y ∈ S :(a) L ( Y n ) is a non-decreasing sequence.(b) For r > , R > , sup k θ k≤ R e rL ( A ( θ ) x + B ( θ ) y ) ≤ L R α Rr e rL ( x ) + M R e C R L ( y ) for some C R , M R , L R > and α R < .Then E h e rL ( Y n ) | Y n − = x, θ n − = θ i ≤ Z e rL ( A ( θ ) x + B ( θ ) y ) µ n ( dy ) ≤ L R α Rr e rL ( x ) + M R E h e C R L ( W n ) i = L R α Rr e rL ( x ) + K R , ith K R = M R E (cid:2) e C R L ( W n ) (cid:3) (this follows from the fact that W n are i.i.d if we assume that E (cid:2) e C R L ( W ) (cid:3) < ∞ ).Choosing large values of r , one can show that E h e rL ( Y n ) | Y n − = x, θ n − = θ i ≤ β R e rL ( x ) + K R where β R = L R α Rr < . Using the above, for large rE h e rL ( Y n ) i = E h E h e rL ( Y n ) | Y n − , θ n − ii ≤ β R E h e rL ( Y n − ) i + K R , which shows that sup n E h e rL ( Y n ) i < ∞ . Choosing r > T , sup n E h e T L ( Y n ) i < ∞ . Note that this is a much weaker assumption that (A8) . (S6) The noise sequence M n , n ≥ n E  φ ( n,T ) X m =0 k M n + m +1 k   < ∞ . (S7) sup n k θ n k < ∞ .With the above assumptions we prove the following tracking lemma: Lemma 4.1.

For any

T > , sup t ∈ [ s,s + T ] k ¯ θ ( t ) − θ s ( t ) k → , a.s. Proof 15.

Let t ( n ) ≤ t ≤ t ( n + m ) . Now, if ≤ k ≤ ( m − and t ∈ ( t ( n + k ) , t ( n + k + 1)] , k θ t ( n ) ( t ) k ≤ k ¯ θ ( t ( n ) k + k Z tt ( n ) ˜ h ( θ t ( n ) ( τ ) , µ ( τ )) dτ k≤ k θ n k + k − X l =0 Z t ( n + l +1) t ( n + l ) ( k h (0 , Y n + l ) k + L ( Y n + l ) k θ t ( n ) ( τ ) k )) dτ + Z tt ( n + k ) ( k h (0 , Y n + k ) k + L ( Y n + k ) k θ t ( n ) ( τ ) k )) dτ ≤ C + M T + Z tt ( n ) L ( Y ( τ )) k θ t ( n ) ( τ ) k dτ where Y ( τ ) = Y n if τ ∈ [ t ( n ) , t ( n + 1)) . Then it follows from an application of Gronwall inequality that k θ t ( n ) ( t ) k ≤ Ce R tt ( n ) L ( Y ( τ )) dτ a.e. t where C = C + M T . Next, k θ t ( n ) ( t ) − θ t ( n ) ( t ( n + k )) k ≤ Z tt ( n + k ) k h ( θ t ( n ) ( s ) , Y n + k ) k ds ≤ k h (0 , Y n + k ) k ( t − t ( n + k )) + L ( Y n + k ) Z tt ( n + k ) k θ t ( n ) ( s ) k ds ≤ M a ( n + k ) + CL ( Y n + k ) Z tt ( n + k ) e R st ( n ) L ( Y ( τ )) dτ ds. hen k Z t ( n + m ) t ( n ) ( h ( θ t ( n ) ( t ) , µ ( t )) − h ( θ t ( n ) ([ t ]) , µ ([ t ]))) dt k≤ m − X k =0 Z t ( n + k +1) t ( n + k ) k h ( θ t ( n ) ( t ) , Y n + k ) − h ( θ t ( n ) ([ t ]) , Y n + k ) k dt ≤ m − X k =0 L ( Y n + k ) Z t ( n + k +1) t ( n + k ) k θ t ( n ) ( t ) − θ t ( n ) ( t ( n + k )) k dt ≤ m − X k =0 c k where c k = L ( Y n + k ) a ( n + k ) h M + CL ( Y n + k ) e P ki =0 a ( n + i ) L ( Y n + i ) i . k ¯ θ ( t ( n + m )) − θ t ( n ) ( t ( n + m )) k ≤ m − X k =0 L ( Y n + k ) a ( n + k ) k ¯ θ ( t ( n + k )) − θ t ( n ) ( t ( n + k )) k + m − X k =0 c k + k δ n,n + m k , where δ n,n + m = P n + m − k = n a ( k ) M k +1 .Therefore using discrete Gronwall inequality we get k ¯ θ ( t ( n + m )) − θ t ( n ) ( t ( n + m )) k ≤ r ( m, n ) e P m − k =0 a ( n + k ) L ( Y n + k ) where r ( m, n ) = P m − k =0 ( c k + a ( n + k ) k M n + k +1 k ) .Now, for some λ ∈ [0 , , k θ t ( n ) ( t ) − ¯ θ ( t ) k≤ (1 − λ ) k θ t ( n ) ( t ( n + m + 1)) − ¯ θ ( t ( n + m + 1)) + λ k θ t ( n ) ( t ( n + m )) − ¯ θ ( t ( n + m )) k + max( λ, − λ ) Z t ( n + m +1) t ( n + m ) k ˜ h ( θ t ( n ) ( s ) , µ ( s )) k ds ≤ r ( m + 1 , n ) e P mk =0 a ( n + k ) L ( Y n + k ) + a ( n + m ) h M + CL ( Y n + m ) e P mk =0 a ( n + k ) L ( Y n + k ) i . Therefore ρ ( n, T ) := sup t ∈ [ t ( n ) ,t ( n )+ T ] k θ t ( n ) ( t ) − ¯ θ ( t ) k ≤ r ( φ ( n, T + 1) , n ) e P φ ( n,T ) k =0 a ( n + k ) L ( Y n + k ) + a ( n ) " M + C sup ≤ m ≤ φ ( n,T ) L ( Y n + m ) e P φ ( n,T ) k =0 a ( n + k ) L ( Y n + k ) . Now to prove the a.s. convergence of the quantity in the left hand side as n → ∞ , we have using Cauchy-Schwartzinequality: ∞ X n =1 E [ ρ ( n, T ) ] ≤ K T ∞ X n =1 (cid:16) E h ( r ( φ ( n, T + 1) , n )) i(cid:17) / + 4 M ∞ X n =0 a ( n ) +4 C ∞ X n =1 a ( n ) E  sup ≤ m ≤ φ ( n,T ) L ( Y n + m ) ! e P φ ( n,T ) k =0 a ( n + k ) L ( Y n + k )  , where K T = q sup n E [ e P φ ( n,T ) k =0 a ( n + k ) L ( Y n + k ) ] which depends only on T due to (S5) . Now, the third term in the .H.S is clearly ﬁnite from the assumptions (S4) and (S5) . Now we analyze the ﬁrst term i.e. ∞ X n =1 (cid:16) E h r ( φ ( n, T + 1) , n ) i(cid:17) / ≤ √ ∞ X n =1  E  φ ( n,T ) X k =0 c k   / + 2 √ ∞ X n =1  E  φ ( n,T ) X k =0 a ( n + k ) k M n + k +1 k   / . (21) Next we analyze the ﬁrst term in the R.H.S of (21) again using Cauchy-Schwartz inequality: ∞ X n =1  E  φ ( n,T ) X k =0 c k   / ≤ M ∞ X n =1 φ ( n, T ) a ( n )  E  sup ≤ k ≤ φ ( n,T ) L ( Y n + k ) !  / +8 C ∞ X n =1 φ ( n, T ) a ( n )  E  sup ≤ k ≤ φ ( n,T ) L ( Y n + k ) ! e P φ ( n,T ) i =0 a ( n + i ) L ( Y n + i )  / . Therefore the the R.H.S will be ﬁnite if we can show that P ∞ n =1 φ ( n, T ) a ( n ) is ﬁnite. For common step-size sequence a ( n ) = n , φ ( n, T ) = O ( n ) thus the above series converges clearly. One can make the series converge for all a ( n ) = n k with < k ≤ by putting assumptions on higher moments in (S4) and (S5) .In the above we have used the following inequality repeatedly for non-negative random variables X and Y : r E h ( X + Y ) n i ≤ n − hp E [ X n ] + p E [ Y n ] i with n ∈ N .Now, ∞ X n =1  E  φ ( n,T ) X k =0 a ( n + k ) k M n + k +1 k   / ≤ ∞ X n =1 a ( n )  E  φ ( n,T ) X k =0 k M n + k +1 k   / which is ﬁnite under assumption (S5) and the fact that a ( n ) are non-increasing. In this section, we present an application of our results in the setting of oﬀ-policy temporal diﬀerence learning withlinear function approximation. In this framework, we need to estimate the value function for a target policy π giventhe continuing evolution of the underlying MDP (with ﬁnite state and action spaces S and A respectively, speciﬁedby expected reward r ( · , · , · ) and transition probability kernel p ( ·|· , · )) for a behaviour policy π b with π = π b . Theauthors of [11, 12, 3] have proposed two approaches to solve the problem:(i) Sub-sampling: In this approach, the transitions which are relevant to deterministic target policy are keptand the rest of the data is discarded from the given “on-policy” trajectory. We use the triplet ( S, R, S ′ ) torepresent (current state, reward, next state). Therefore one has “oﬀ-policy” data ( X ′ n , R n , W n ) , n ≥ E [ R n | X ′ n = s, W n = s ′ ] = r ( s, a, s ′ ), P ( W n = s ′ | X ′ n = s ) = p ( s ′ | s, a ) with π ( s ) = a , π being the target policyand X ′ n , n ≥ X ′ n , R n , W n ) , n ≥ { X ′ n } are i.i.d, E [ R n | X ′ n = s, W n = s ′ ] = r ( s, a, s ′ ) and P ( W n = s ′ | X ′ n = s ) = p ( s ′ | s, a ) with π ( s ) = a , π being the deterministic target policy. Additionally, the distribution of { X ′ n } is assumedto be sampled according to the stationary distribution of the Markov chain corresponding to the behaviour policy.However, such data cannot be generated from sub-sampling given only the “on-policy” trajectory. The reason isthat a Markov chain sampled at increasing stopping times cannot be i.i.d. In the following, we show how gradienttemporal-diﬀerence learning along with importance weighting can be used to solve the oﬀ-policy convergence problemstated above for TD when only the “on-policy” trajectory is available. Problem Deﬁnition

Suppose we are given an on-policy trajectory ( X n , A n , R n , X n +1 ) , n ≥ { X n } is a time-homogeneous irre-ducible Markov chain with unique stationary distribution ν and generated from a behavior policy π b = π . Here thequadruplet ( S, A, R, S ′ ) represents (current state, action, reward, next state). Also, assume that π b ( a | s ) > s ∈ S, a ∈ A . We need to ﬁnd the solution θ ∗ for the following:0 = X s,a,s ′ ν ( s ) π ( a | s ) p ( s ′ | s, a ) δ ( θ ; s, a, s ′ ) φ ( s )= E [ ρ X,A n δ X,R n ,X n +1 ( θ ) φ ( X )]= b − Aθ, (22)where(i) θ ∈ R d is the parameter for value function,(ii) φ : S → R d is a vector of state features,(iii) X ∼ ν ,(iv) 0 < γ < E [ R n | X n = s, X n +1 = s ′ ] = P a ∈ A π b ( a | s ) r ( s, a, s ′ ),(vi) P ( X n +1 = s ′ | X = s ) = P a ∈ A π b ( a | s ) p ( s ′ | s, a ),(vii) δ ( θ ; s, a, s ′ ) = r ( s, a, s ′ ) + γθ T φ ( s ′ ) − θ T φ ( s ) is the temporal diﬀerence term with expected reward,(viii) ρ X,A n = π ( A n | X ) π b ( A n | X ) ,(ix) δ X,R n ,X n +1 = R n + γθ T φ ( X n +1 ) − θ T φ ( X ) is the online temporal diﬀerence,(x) A = E [ ρ X,A n φ ( X )( φ ( X ) − γφ ( X n +1 )) T ],(xi) b = E [ ρ X,A n R n φ ( X )].Hence the desired approximate value function under the target policy π is V ∗ π = θ ∗ T φ . Let V θ = θ T φ . It is well-known([3]) that θ ∗ satisﬁes the projected ﬁxed point equation namely V θ = Π G ,ν T π V θ , where Π G ,ν ˆ V = arg min f ∈G ( k ˆ V − f k ν ) , with G = { V θ | θ ∈ R d } and the Bellman operator T π V θ ( i ) = X j ∈ S X a ∈ A π ( a | i ) p ( j | i, a ) [ γV θ ( i ) + r ( i, a, j )] . Therefore to ﬁnd θ ∗ , the idea is to minimize the mean square projected Bellman error J ( θ ) = k V θ − Π G ,ν T π V θ k ν using stochastic gradient descent. It can be shown that the expression of gradient contains product of multipleexpectations. Such framework can be modelled by two time-scale stochastic approximation where one iterate storesthe quasi-stationary estimates of some of the expectations and the other iterate is used for sampling.20 .2 The TDC Algorithm with importance-weighting

We consider the TDC (Temporal Diﬀerence with Correction) algorithm with importance-weighting from Sections 4.2and 5.2 of [3]. The gradient in this case can be shown to satisfy − ∇ J ( θ ) = E [ ρ X,R n δ X,R n ,X n +1 ( θ ) φ ( X )] − γE [ ρ X,R n φ ( X n +1 ) φ ( X ) T ] w ( θ ) ,w ( θ ) = E [ φ ( X ) φ ( X ) T ] − E [ ρ X,R n δ X,R n ,X n +1 ( θ ) φ ( X )] . Deﬁne φ n = φ ( X n ), φ ′ n = φ ( X n +1 ), δ n ( θ ) = δ X n ,R n ,X n +1 ( θ ) and ρ n = ρ X n ,A n . Therefore the associated iterations inthis algorithm are: θ n +1 = θ n + a ( n ) ρ n (cid:2) δ n ( θ n ) φ n − γφ ′ n φ Tn w n (cid:3) ,w n +1 = w n + b ( n ) (cid:2) ( ρ n δ n ( θ n ) − φ Tn w n ) φ n (cid:3) , (23)with { a ( n ) } , { b ( n ) } satisfying (A4) . Convergence Proof

Theorem 5.1 (Convergence of TDC with importance-weighting) . Consider the iterations (23) of the TDC. Assumethe following:(i) { a ( n ) } , { b ( n ) } satisfy (A4) .(ii) { ( X n , R n , X n +1 ) , n ≥ } is such that { X n } is a time-homogeneous ﬁnite state irreducible Markov chaingenerated from the behavior policy π b with unique stationary distribution ν . E [ R n | X n = s, X n +1 = s ′ ] = P a ∈ A π b ( a | s ) r ( s, a, s ′ ) and P ( X n +1 = s ′ | X n = s ) = P a ∈ A π b ( a | s ) p ( s ′ | s, a ) where π b is the behaviour policy, π = π b . Also, E [ R n | X n , X n +1 ] < ∞ for all n almost surely, and(iii) C = E [ φ ( X ) φ ( X ) T ] and A = E [ ρ X,R n φ ( X )( φ ( X ) − γφ ( X n +1 )) T ] are non-singular where X ∼ ν .(iv) π b ( a | s ) > for all s ∈ S, a ∈ A .(v) sup n ( k θ n k + k w n k ) < ∞ w.p. 1.Then the parameter vector θ n converges with probability one as n → ∞ to the TD(0) solution (22). Proof 16.

The iterations (23) can be cast into the framework of Section 2.2 with(i) Z ( i ) n = X n − ,(ii) h ( θ, w, z ) = E [( ρ n ( δ n ( θ ) φ n − γφ ′ n φ Tn w )) | X n − = z, θ n = θ, w n = w ] ,(iii) g ( θ, w, z ) = E [(( ρ n δ n ( θ ) − φ Tn w ) φ n ) | X n − = z, θ n = θ, w n = w ] ,(iv) M (1) n +1 = ρ n ( δ n ( θ n ) φ n − γφ ′ n φ Tn w n ) − E [ ρ n ( δ n ( θ n ) φ n − γφ ′ n φ Tn w n ) | X n − , θ n , w n ] ,(v) M (2) n +1 = ( ρ n δ n ( θ n ) − φ Tn w n ) φ n − E [( ρ n δ n ( θ n ) − φ nT w n ) φ n | X n − , θ n , w n ] ,(vi) F n = σ ( θ m , w m , R m − , X m − , A m − , m ≤ n, i = 1 , , n ≥ .Note that in (ii) and (iii) we can deﬁne h and g independent of n due to time-homogeneity of { X n } .Now, we verify the assumptions (A1)-(A7) (mentioned in Sections 2.2 and 2.3) for our application:(i) (A1) : Z ( i ) n , ∀ n, i = 1 , takes values in compact metric space as { X n } is a ﬁnite state Markov chain.(ii) (A5) : Continuity of transition kernel follows trivially from the fact that we have a ﬁnite state MDP. Remark 10.

In fact we don’t have to verify this assumption for the special case when the Markov chain isuncontrolled and has unique stationary distribution. The reason is that in such case (A5) will be used only inthe proof of Lemma 2.3. However, if the Markov chain has unique stationary distribution Lemma 2.3 triviallyfollows.(iii) (A2) a) k h ( θ, w, z ) − h ( θ ′ , w ′ , z ) k = k E [ ρ n ( θ − θ ′ ) T ( γφ ( X n +1 ) − φ ( X n )) φ ( X n ) − γρ n φ ( X n +1 ) φ ( X n ) T ( w − w ′ ) | X n − = z ] k≤ L (2 k θ − θ ′ k M + k w − w ′ k M ) , where M = max s ∈ S k φ ( s ) k with S being the state space of the MDP and L = max ( s,a ) ∈ ( S × A ) π ( a | s ) π b ( a | s ) . Hence h is Lipschitz continuous in the ﬁrst two arguments uniformly w.r.t the third. In the last inequality above,we use the Cauchy-Schwarz inequality.(b) As with the case of h , g can be shown to be Lipschitz continuous in the ﬁrst two arguments uniformly w.r.tthe third.(c) Joint continuity of h and g follows from (iii)(a) and (b) respectively as well as the ﬁniteness of S .(iv) (A3) : Clearly, { M ( i ) n +1 } , i = 1 , are martingale diﬀerence sequences w.r.t. increasing σ -ﬁelds F n . Note that E [ k M ( i ) n +1 k |F n ] ≤ K (1 + k θ n k + k w n k ) a.s., n ≥ since E [ R n | X n , X n +1 ] < ∞ for all n almost surely and S is ﬁnite.(v) (A4) : This follows from the conditions (i) in the statement of Theorem 5.1.Now, one can see that the faster o.d.e. becomes ˙ w ( t ) = E [ ρ X,A n δ X,R n ,X n +1 ( θ ) φ ( X )] − E [ φ ( X ) φ ( X ) T ] w ( t ) . Clearly, C − E [ ρ X,A n δ X,R n ,X n +1 ( θ ) φ ( X )] is the globally asymptotically stable equilibrium of the o.d.e. Moreover, V ′ ( θ, w ) = k Cw − E [ ρ X,A n δ X,R n ,X n +1 ( θ ) φ ( X )] k is continuously diﬀerentiable. Additionally, λ ( θ ) = C − E [ ρ X,A n δ X,R n ,X n +1 ( θ ) φ ( X )] and it is Lipschitz continuous in θ , verifying (A6)’ . For the slower o.d.e., the global attractor is A − E [ ρ X,A n R n φ ( X )] verifying the additional assumption in Corollary 1. The attractor set here is a singleton. Also, (A7) is ( v ) in thestatement of Theorem 5.1. Therefore the assumptions ( A1 ) − ( A5 ) , ( A6 ′ ) , ( A7 ) are veriﬁed. The proof would thenfollow from Corollary 1. Remark 11.

The reason for using two time-scale framework for the TDC algorithm is to make sure that the o.d.e’shave globally asymptotically stable equilibrium.

Remark 12.

Because of the fact that the gradient is a product of two expectations the scheme is a “pseudo”-gradientdescent which helps to ﬁnd the global minimum here.

Remark 13.

Here we assume the stability of the iterates (23). Certain suﬃcient conditions have been sketched forshowing stability of single timescale stochastic recursions with controlled Markov noise [21, p. 75, Theorem 9]. Thissubsequently needs to be extended to the case of two time-scale recursions.Another way to ensure boundedness of the iterates is to use a projection operator. However, projection mayintroduce spurious ﬁxed points on the boundary of the projection region and ﬁnding globally asymptotically stableequilibrium of a projected o.d.e. is hard. Therefore we do not use projection in our algorithm.

Remark 14.

Convergence analysis for TDC with importance weighting along with eligibility traces cf. [3, p. 74]where it is called GTD( λ )can be done similarly using our results. The main advantage is that it works for λ < Lγ ( λ ∈ [0 , being the eligibility function) whereas the analysis in [4] is shown only for λ very close to 1. Remark 15.

One can analyze this algorithm when the state space is inﬁnite by imposing assumptions on φ as wellas the target and behavior policies. We presented a general framework for two time-scale stochastic approximation with controlled Markov noise. More-over, using a special case of our results, i.e., when the random process is a ﬁnite state irreducible time-homogeneousMarkov chain (hence has a unique stationary distribution) and uncontrolled (i.e, does not depend on iterates), weprovided a rigorous proof of convergence for oﬀ-policy temporal diﬀerence learning algorithm that is also extendibleto eligibility traces (for a suﬃciently large range of λ ) with linear function approximation under the assumptionthat the “on-policy” trajectory for a behaviour policy is only available. This has previously not been done to ourknowledge. 22 cknowledgments. The authors want to thank Csaba Szepesv´ari for some useful discussion on the literature of oﬀ-policy learning. Ourwork was partly supported by the Robert Bosch Centre for Cyber-Physical Systems, Indian Institute of Science,Bangalore.

References [1] A.Benveniste, M.Metivier, and P.Priouret.

Adaptive Algorithms and Stochastic Approximation . Springer Verlag,Berlin - New York, 1990.[2] D.J.Ma, A.M.Makowski, and A.Shwartz. Stochastic approximations for ﬁnite state Markov chains.

StochasticProcesses and their Applications , 35:27–45, 1990.[3] H.R.Maei.

Gradient temporal-diﬀerence learning algorithms . PhD thesis, University of Alberta, 2011.[4] H.Yu. Least squares temporal diﬀerence methods: an analysis under general conditions.

SIAM Journal onControl and Optimization , 50(6):3310–3343, 2012.[5] H.Yu. Weak Convergence Properties of Constrained Emphatic Temporal-diﬀerence Learning with Constant andSlowly Diminishing Stepsize.

Journal of Machine Learning Research , 17(220):1–58, 2016.[6] I.Menache, S.Mannor, and N.Shimkin. Basis function adaptation in temporal diﬀerence reinforcement learning.

Annals of Operations Research , 134:215–238, 2005.[7] J.Aubin and A.Cellina.

Diﬀerential Inclusions: Set-Valued Maps and Viability Theory . Springer, 1984.[8] M.Bena¨ım. Dynamics of stochastic approximation algorithms.

S´eminaire de probabilit´es , pages 1–68, 1999.[9] M.Bena¨ım, J.Hofbauer, and S.Sorin. Stochastic approximations and diﬀerential inclusions.

SIAM Journal ofControl and Optimization , 44(1):328–348, 2005.[10] M.Metivier and P.Priouret. Applications of a Kushner-Clark lemma to general classes of stochastic algorithms.

IEEE Transactions on Information Theory , (30):140–151, 1984.[11] R.S.Sutton, H.R.Maei, and C.Szepesv´ari.

A convergent O(n) algorithm for oﬀ-policy temporal-diﬀerence learningwith linear function approximation . Advances in Neural Information Processing Systems, Vancouver, B.C.,Canada, 2008.[12] R.S.Sutton, H.R.Maei, D.Precup, S.Bhatnagar, D.Silver, and E.Wiewiora.

Fast gradient-descent methods fortemporal-diﬀerence learning with linear function approximation . International Conference on Machine Learning,Montreal, Canada, 2009.[13] T.Degris, M.White, and R.S.Sutton.

Oﬀ-policy actor-critic . International Conference on Machine Learning,Scotland, UK, 2012.[14] V.B.Tadi´c.

Almost sure convergence of two time-scale stochastic approximation algorithms . American ControlConference, Boston, 2004.[15] V.B.Tadi´c. Convergence and Convergence Rate of Stochastic Gradient Search in the Case of Multiple andNon-Isolated Extrema.

Stochastic Processes and Their Applications , 125:1715–1755, 2015.[16] V.R.Konda and J.N.Tsitsiklis. Linear stochastic approximation driven by slowly varying Markov chains.

Systemsand Control Letters , 50:95–102, 2003.[17] V.R.Konda and J.N.Tsitsiklis. On actor-critic algorithms.

SIAM Journal on Control and Optimization , 42:1143–1166, 2003.[18] V.S.Borkar.

Probability Theory : An Advanced Course . Springer, 1995.[19] V.S.Borkar. Stochastic approximation with two time scales.

Systems and Control Letters , 29(5):291–294, 1997.[20] V.S.Borkar. Stochastic approximation with ‘controlled Markov noise’.

Systems and Control Letters , 55(2):139–145, 2006.[21] V.S.Borkar.

Stochastic Approximation : A Dynamic Systems Viewpoint . Cambridge University Press, 2008.[22] W.Rudin.