Two Timescale Stochastic Approximation with Controlled Markov noise and Off-policy temporal difference learning
aa r X i v : . [ m a t h . D S ] F e b Two Timescale Stochastic Approximation with Controlled Markov noiseand Off-policy Temporal Difference Learning
Prasenjit Karmakar and Shalabh BhatnagarApril 15, 2019
Abstract
We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximationdriven by ‘controlled’ Markov noise. In particular, both the faster and slower recursions have non-additive con-trolled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behaviorof our framework by relating it to limiting differential inclusions in both time-scales that are defined in terms ofthe ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solutionto the off-policy convergence problem for temporal difference learning with linear function approximation, usingour results.
Stochastic approximation algorithms are sequential non-parametric methods for finding a zero or minimum of a func-tion in the situation where only the noisy observations of the function values are available. Two time-scale stochasticapproximation algorithms represent one of the most general subclasses of stochastic approximation methods. Thesealgorithms consist of two coupled recursions which are updated with different (one is considerably smaller than theother) step sizes which in turn facilitate convergence for such algorithms.Two time-scale stochastic approximation algorithms [19] have successfully been applied to several complex prob-lems arising in the areas of reinforcement learning, signal processing and admission control in communication net-works. There are many reinforcement learning applications (precisely those where parameterization of value functionis implemented) where non-additive Markov noise is present in one or both iterates thus requiring the current twotime-scale framework to be extended to include Markov noise (for example, in [13, p. 5] it is mentioned that in orderto generalize the analysis to Markov noise, the theory of two time-scale stochastic approximation needs to includethe latter).Here we present a more general framework of two time-scale stochastic approximation with “controlled” Markovnoise, i.e., the noise is not simply Markov; rather it is driven by the iterates and an additional control process aswell. We analyze the asymptotic behaviour of our framework by relating it to limiting differential inclusions inboth timescales that are defined in terms of the ergodic occupation measures associated with the controlled Markovprocesses. Next, using these results for the special case of our framework where the random processes are irreducibleMarkov chains, we present a solution to the off-policy convergence problem for temporal difference learning withlinear function approximation. While the off-policy convergence problem for reinforcement learning (RL) with linearfunction approximation has been one of the most interesting problems, there are very few solutions available inthe current literature. One such work [4] shows the convergence of the least squares temporal difference learningalgorithm with eligibility traces (LSTD( λ )) as well as the TD( λ ) algorithm. While the LSTD methods are not feasiblewhen the dimension of the feature vector is large, off-policy TD( λ ) is shown to converge only when the eligibilityfunction λ ∈ [0 ,
1] is very close to 1. Another recent work [5] proves weak convergence of several emphatic temporaldifference learning algorithms which is also designed to solve the off-policy convergence problem. In [11, 12, 3] thegradient temporal difference learning (GTD) algorithms were proposed to solve this problem. However, the authorsmake the assumption that the data is available in the “off-policy” setting (i.e. the off-policy issue is incorporatedinto the data rather than in the algorithm) whereas, in reality, one has only the “on-policy” Markov trajectorycorresponding to a given behaviour policy and we are interested in designing an online learning algorithm. We useone of the algorithms from [3] called TDC with “importance-weighting” which takes the “on-policy” data as inputand show its convergence using the results we develop. Our convergence analysis can also be extended for the samealgorithm with eligibility traces for a sufficiently large range of values of λ . Our results can be used to provide aconvergence analysis for reinforcement learning algorithms such as those in [6] for which convergence proofs have notbeen provided.To the best of our knowledge there are related works such as [14, 16, 17, 15] where two time-scale stochasticapproximation algorithms with algorithm iterate dependent non-additive Markov noise is analyzed. In all of them1he Markov noise in the recursion is handled using the classic Poisson equation based approach of [1, 10] and appliedto the asymptotic analysis of many algorithms used in machine learning, system identification, signal processing,image analysis and automatic control. However, we show that our method also works if there is another additionalcontrol process as well and if the underlying Markov process has non-unique stationary distributions. Further, thementioned application does not require strong assumption such as aperiodicity for the underlying Markov chainwhich is a sufficient condition if we use Poisson equation based approach [2, 14]. Additionally, our assumptions arequite different from the assumptions made in the mentioned literature and we give a detailed comparison in Section2.2.The organization of the paper is as follows: Section 2 formally defines the problem and provides background andassumptions. Section 3 shows the main results. Section 4 discusses how one of our assumptions of Section 2 can berelaxed. Section 5 presents an application of our results to the off-policy convergence problem for temporal differencelearning with linear function approximation. Finally, we conclude by providing some future research directions. In the following we describe the preliminaries and notation used in our proofs. Most of the definitions and notationare from [9, 21, 7].
Definition and Notation
Let F denote a set-valued function mapping each point θ ∈ R m to a set F ( θ ) ⊂ R m . F is called a Marchaud map ifthe following hold:(i) F is upper-semicontinuous in the sense that if θ n → θ and w n → w with w n ∈ F ( θ n ) for all n ≥
1, then w ∈ F ( θ ). In order words, the graph of F defined as { ( θ, w ) : w ∈ F ( θ ) } is closed.(ii) F ( θ ) is a non-empty compact convex subset of R m for all θ ∈ R m .(iii) ∃ c > θ ∈ R m , sup z ∈ F ( θ ) k z k ≤ c (1 + k θ k ) , where k . k denotes any norm on R m . A solution for the differential inclusion (d.i.) ˙ θ ( t ) ∈ F ( θ ( t )) (1)with initial point θ ∈ R m is an absolutely continuous (on compacts) mapping θ : R → R m such that θ (0) = θ and˙ θ ( t ) ∈ F ( θ ( t ))for almost every t ∈ R . If F is a Marchaud map, it is well-known that (1) has solutions (possibly non-unique) throughevery initial point. The differential inclusion (1) induces a set-valued dynamical system { Φ t } t ∈ R defined byΦ t ( θ ) = { θ ( t ) : θ ( · ) is a solution to (1) with θ (0) = θ } . Consider the autonomous ordinary differential equation (o.d.e.)˙ θ ( t ) = h ( θ ( t )) , (2)where h is Lipschitz continuous. One can write (2) in the format of (1) by taking F ( θ ) = { h ( θ ) } . It is well-knownthat (2) is well-posed, i.e., it has a unique solution for every initial point. Hence the set-valued dynamical systeminduced by the o.d.e. or flow is { Φ t } t ∈ R with Φ t ( θ ) = { θ ( t ) } , where θ ( · ) is the solution to (2) with θ (0) = θ . It is also well-known that Φ t ( . ) is a continuous function for all t ∈ R .A set A ⊂ R m is said to be invariant (for F ) if for all θ ∈ A there exists a solution θ ( · ) of (1) with θ (0) = θ such that θ ( R ) ⊂ A .Given a set A ⊂ R m and θ ′′ , w ′′ ∈ A , we write θ ′′ ֒ → A w ′′ if for every ǫ > T > ∃ n ∈ N , solutions θ ( · ) , . . . , θ n ( · ) to (1) and real numbers t , t , . . . , t n greater than T such that(i) θ i ( s ) ∈ A for all 0 ≤ s ≤ t i and for all i = 1 , . . . , n, k θ i ( t i ) − θ i +1 (0) k ≤ ǫ for all i = 1 , . . . , n − , (iii) k θ (0) − θ ′′ k ≤ ǫ and k θ n ( t n ) − w ′′ k ≤ ǫ. The sequence ( θ ( · ) , . . . , θ n ( · )) is called an ( ǫ, T ) chain (in A from θ ′′ to w ′′ ) for F . A set A ⊂ R m is said to be internally chain transitive , provided that A is compact and θ ′′ ֒ → A w ′′ for all θ ′′ , w ′′ ∈ A . It can be proved that inthe above case, A is an invariant set.A compact invariant set A is called an attractor for Φ, provided that there is a neighbourhood U of A (i.e., forthe induced topology) with the property that d (Φ t ( θ ′′ ) , A ) → t → ∞ uniformly in θ ′′ ∈ U . Here d ( X, Y ) =sup θ ′′ ∈ X inf w ′′ ∈ Y k θ ′′ − w ′′ k for X, Y ⊂ R m . Such a U is called a fundamental neighbourhood of the attractor A . Anattractor of a well-posed o.d.e. is an attractor for the set-valued dynamical system induced by the o.d.e.The set ω Φ ( θ ′′ ) = \ t ≥ Φ [ t, ∞ ) ( θ ′′ )is called the ω -limit set of a point θ ′′ ∈ R m . If A is a set, then B ( A ) = { θ ′′ ∈ R m : ω Φ ( θ ′′ ) ⊂ A } denotes its basin of attraction . A global attractor for Φ is an attractor A whose basin of attraction consists of all R m . Then the following lemma will be useful for our proofs, see [9] for a proof. Lemma 2.1.
Suppose Φ has a global attractor A . Then every internally chain transitive set lies in A . We also require another result which will be useful to apply our results to the RL application we mention. Beforestating it we recall some definitions from Appendix 11.2.3 of [21]:A point θ ∗ ∈ R m is called Lyapunov stable for the o.d.e (2) if for all ǫ >
0, there exists a δ > δ -neighbourhood of θ ∗ remains in its ǫ -neighbourhood. θ ∗ is called globallyasymptotically stable if θ ∗ is Lyapunov stable and all trajectories of the o.d.e. converge to it. Lemma 2.2.
Consider the autonomous o.d.e. ˙ θ ( t ) = h ( θ ( t )) where h is Lipschitz continuous. Let θ ∗ be globallyasymptotically stable. Then θ ∗ is the global attractor of the o.d.e. Proof 1.
We refer the readers to Lemma 1 of [21, Chapter 3] for a proof.
We end this subsection with a notation which will be used frequently in the convergence statements in thefollowing sections.
Definition 2.1.
For function θ ( . ) defined on [0 , ∞ ) , the notation “ θ ( t ) → A as t → ∞ ” means that ∩ t ≥ { θ ( s ) : s ≥ t } ⊂ A . Similar definition applies for a sequence { θ n } . Problem Definition
Our goal is to perform an asymptotic analysis of the following coupled recursions: θ n +1 = θ n + a ( n ) h h ( θ n , w n , Z (1) n ) + M (1) n +1 i , (3) w n +1 = w n + b ( n ) h g ( θ n , w n , Z (2) n ) + M (2) n +1 i , (4)where θ n ∈ R d , w n ∈ R k , n ≥ { Z ( i ) n } , { M ( i ) n } , i = 1 , (A1) { Z ( i ) n } takes values in a compact metric space S ( i ) , i = 1 ,
2. Additionally, the processes { Z ( i ) n } , i = 1 , { θ m } , { w m } and a random process { A ( i ) n } taking values in a compact metric space U ( i ) respectively with theirindividual dynamics specified by P ( Z ( i ) n +1 ∈ B ( i ) | Z ( i ) m , A ( i ) m , θ m , w m , m ≤ n ) = Z B ( i ) p ( i ) ( dy | Z ( i ) n , A ( i ) n , θ n , w n ) , n ≥ , for B ( i ) Borel in S ( i ) , i = 1 , , respectively. Remark 1.
In this context one should note that [1, 10] require the Markov process to take values in a normedPolish space. emark 2. In [20] it is assumed that the state space where the controlled Markov Process takes values isPolish. This space is then compactified using the fact that a Polish space can be homeomorphically embeddedinto a dense subset of a compact metric space. The vector field h ( ., . ) : R d × S → R d is considered boundedwhen the first component lies in a compact set. This would, however, require a continuous extension of h ′ : R d × φ ( S ) → R d defined by h ′ ( x, s ′ ) = h ( x, φ − ( s ′ )) to R d × φ ( S ) . Here φ ( · ) is the homeomorphism definedby φ ( s ) = ( ρ ( s, s ) , ρ ( s, s ) , . . . ) ∈ [0 , ∞ , and { s i } and ρ is a countable dense subset and metric of the Polishspace respectively. A sufficient condition for the above is h ′ to be uniformly continuous [22, Ex:13, p. 99].However, this is hard to verify. This is the main motivation for us to take the range of the Markov process ascompact for our problem. However, there are other reasons for taking compact state space which will be clearin the proofs of this section and the next. (A2) h : R d + k × S (1) → R d is jointly continuous as well as Lipschitz in its first two arguments uniformly w.r.t thethird. The latter condition means that ∀ z (1) ∈ S (1) , k h ( θ, w, z (1) ) − h ( θ ′ , w ′ , z (1) ) k ≤ L (1) ( k θ − θ ′ k + k w − w ′ k ) . Same thing is also true for g where the Lipschitz constant is L (2) . Note that the Lipschitz constant L ( i ) doesnot depend on z ( i ) for i = 1 , Remark 3.
We later relax the uniformity of the Lipschitz constant w.r.t the Markov process state space byputting suitable moment assumptions on the Markov process. (A3) { M ( i ) n } , i = 1 , σ -fields F n = σ ( θ m , w m , M ( i ) m , Z ( i ) m , m ≤ n, i = 1 , , n ≥ , satisfying E [ k M ( i ) n +1 k |F n ] ≤ K (1 + k θ n k + k w n k ) , i = 1 , , for n ≥ K > (A4) The stepsizes { a ( n ) } , { b ( n ) } are positive scalars satisfying X n a ( n ) = X n b ( n ) = ∞ , X n ( a ( n ) + b ( n ) ) < ∞ , a ( n ) b ( n ) → . Moreover, a ( n ) , b ( n ) , n ≥ p ( i ) , i = 1 , P ( S ). Here we mention the definitions and main theorems on the spaces of probabilitymeasures that we use in our proofs (details can be found in Chapter 2 of [18]). We denote the metric by d andis defined as d ( µ, ν ) = X j − j | Z f j dµ − Z f j dν | , µ, ν ∈ P ( S ) , where { f j } are countable dense in the unit ball of C ( S ). Then the following are equivalent:(i) d ( µ n , µ ) → , (ii) for all bounded f in C ( S ), Z S f dµ n → Z S f dµ, (5)(iii) for all f bounded and uniformly continuous, Z S f dµ n → Z S f dµ. Hence we see that d ( µ n , µ ) → R S f j dµ n → R S f j dµ for all j . Any such sequence of functions { f j } is calleda convergence determining class in P ( S ). Sometimes we also denote d ( µ n , µ ) → µ n ⇒ µ .Also, we recall the characterization of relative compactness in P ( S ) that relies on the definition of tightness. A ⊂ P ( S ) is a tight set if for any ǫ >
0, there exists a compact K ǫ ⊂ S such that µ ( K ǫ ) > − ǫ for all µ ∈ A .Clearly, if S is compact then any A ⊂ P ( S ) is tight. By Prohorov’s theorem, A ⊂ P ( S ) is relatively compactif and only if it is tight.With the above definitions we assume the following:4 A5)
The map S ( i ) × U ( i ) × R d + k ∋ ( z ( i ) , a ( i ) , θ, w ) → p ( i ) ( dy | z ( i ) , a ( i ) , θ, w ) ∈ P ( S ( i ) ) is continuous. Remark 4. (A5) is much simpler than the assumptions on n -step transition kernel in [1, Part II,Chap. 2,Theorem 6]. Additionally, unlike [20, p 140 line 13], we do not require the extra assumption of the continuity in the θ variable of p ( dy | z, a, θ ) to be uniform on compacts w.r.t the other variables.For θ n = θ, w n = w for all n with a fixed deterministic ( θ, w ) ∈ R d + k and under any stationary randomizedcontrol π ( i ) , it follows from Lemma 2.1 and Lemma 3.1 of [20] that the time-homogeneous Markov processes Z ( i ) n , i = 1 , ( i ) θ,w,π ( i ) , i = 1 ,
2. .Now, it is well-known that the ergodic occupation measure defined asΨ ( i ) θ,w,π ( i ) ( dz, da ) := Ψ ( i ) θ,w,π ( i ) ( dz ) π ( i ) ( z, da ) ∈ P ( S ( i ) × U ( i ) )satisfies the following: Z S ( i ) f ( i ) ( z )Ψ ( i ) θ,w,π ( i ) ( dz, U ( i ) ) = Z S ( i ) × U ( i ) Z S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a, θ, w )Ψ ( i ) θ,w,π ( i ) ( dz, da ) (6)for f ( i ) : S ( i ) → R ∈ C b ( S ( i ) ).We denote by D ( i ) ( θ, w ) , i = 1 , θ and w . In thefollowing we prove some properties of the map ( θ, w ) → D ( i ) ( θ, w ). Lemma 2.3.
For all ( θ, w ) , D ( i ) ( θ, w ) is convex and compact. Proof 2.
The proof trivially follows from (A1) , (A5) and (6). Lemma 2.4.
The map ( θ, w ) → D ( i ) ( θ, w ) is upper-semi-continuous. Proof 3.
Let θ n → θ, w n → w and Ψ ( i ) n ⇒ Ψ ( i ) ∈ P ( S ( i ) × U ( i ) ) such that Ψ ( i ) n ∈ D ( i ) ( θ n , w n ) . Let g ( i ) n ( z, a ) = R S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a, θ n , w n ) and g ( i ) ( z, a ) = R S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a, θ, w ) . From (6) we get that Z S ( i ) f ( i ) ( z )Ψ ( i ) ( dz, U ( i ) ) = lim n →∞ Z S ( i ) f ( i ) ( z )Ψ ( i ) n ( dz, U ( i ) )= lim n →∞ Z S ( i ) × U ( i ) Z S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a, θ n , w n )Ψ ( i ) n ( dz, da )= lim n →∞ Z S ( i ) × U ( i ) g ( i ) n ( z, a )Ψ ( i ) n ( dz, da ) . Now, p ( i ) ( dy | z, a, θ n , w n ) ⇒ p ( i ) ( dy | z, a, θ, w ) implies g ( i ) n ( · , · ) → g ( i ) ( · , · ) pointwise. We prove that the convergenceis indeed uniform. It is enough to prove that this sequence of functions is equicontinuous. Then along with pointwiseconvergence it will imply uniform convergence on compacts [22, p. 168, Ex: 16]. This is also a place where (A1) isused.Define g ′ : S ( i ) × U ( i ) × R d + k → R by g ′ ( z ′ , a ′ , θ ′ , w ′ ) = R S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a ′ , θ ′ , w ′ ) . Then g ′ is continuous.Let A = S ( i ) × U ( i ) × ( { θ n } ∪ θ ) × ( { w n } ∪ w ) . So, A is compact and g ′ | A is uniformly continuous. This impliesthat for all ǫ > , there exists δ > such that if ρ ′ ( s , s ) < δ, µ ′ ( a , a ) < δ, k θ − θ k < δ, k w − w k < δ, then | g ′ ( s , a , θ , w ) − g ′ ( s , a , θ , w ) | < ǫ where s , s ∈ S ( i ) , a , a ∈ U ( i ) , θ , θ ∈ ( { θ n } ∪ θ ) , w , w ∈ ( { w n } ∪ w ) and ρ ′ and µ ′ denote the metrics in S ( i ) and U ( i ) respectively. Now use this same δ for the { g ( i ) n ( · , · ) } to get for all n thefollowing for ρ ′ ( z , z ) < δ, µ ′ ( a , a ) < δ : | g ( i ) n ( z , a ) − g ( i ) n ( z , a ) | = | g ′ ( z , a , θ n , w n ) − g ′ ( z , a , θ n , w n ) | < ǫ. Hence { g ( i ) n ( · , · ) } is equicontinuous. For large n , sup ( z,a ) ∈ S ( i ) × U ( i ) | g ( i ) n ( z, a ) − g ( i ) ( z, a ) | < ǫ/ because of uniformconvergence of { g ( i ) n ( · , · ) } , hence R S ( i ) × U ( i ) | g ( i ) n ( z, a ) − g ( i ) ( z, a ) | Ψ ( i ) n ( dz, da ) < ǫ/ . Now (for n large), | Z S ( i ) × U ( i ) g ( i ) n ( z, a )Ψ ( i ) n ( dz, da ) − Z S ( i ) × U ( i ) g ( i ) ( z, a )Ψ ( i ) ( dz, da ) | = | Z S ( i ) × U ( i ) [ g ( i ) n ( z, a ) − g ( i ) ( z, a )]Ψ ( i ) n ( dz, da ) + Z S ( i ) × U ( i ) g ( i ) ( z, a )Ψ ( i ) n ( dz, da ) − Z S ( i ) × U ( i ) g ( i ) ( z, a )Ψ ( i ) ( dz, da ) | < ǫ/ | Z S ( i ) × U ( i ) g ( i ) ( z, a )Ψ ( i ) n ( dz, da ) − Z S ( i ) × U ( i ) g ( i ) ( z, a )Ψ ( i ) ( dz, da ) | < ǫ. (7)5 he last inequality follows the fact that Ψ ( i ) n ⇒ Ψ ( i ) . Hence from (7) we get, Z S ( i ) f ( i ) ( z )Ψ ( i ) ( dz, U ( i ) ) = Z S ( i ) × U ( i ) Z S ( i ) f ( i ) ( y ) p ( i ) ( dy | z, a, θ, w )Ψ ( i ) ( dz, da ) proving that the map is upper-semi-continuous. Define ˜ g ( θ, w, ν ) = R g ( θ, w, z ) ν ( dz, U (2) ) for ν ∈ P ( S (2) × U (2) ) and ˆ g ( θ, w ) = { ˜ g ( θ, w, ν ) : ν ∈ D (2) ( θ, w ) } . Lemma 2.5. ˆ g ( · , · ) is a Marchaud map. Proof 4. (i) Convexity and compactness follow trivially from the same for the map ( θ, w ) → D (2) ( θ, w ) .(ii) k ˜ g ( θ, w, ν ) k = k Z g ( θ, w, z ) ν ( dz, U (2) ) k≤ Z k g ( θ, w, z ) k ν ( dz, U (2) ) ≤ Z L (2) ( k θ k + k w k + k g (0 , , z ) k ) ν ( dz, U (2) ) ≤ max( L (2) , L (2) Z k g (0 , , z ) k ν ( dz, U (2) ))(1 + k θ k + k w k ) . Clearly, K ( θ ) = max( L (2) , L (2) R k g (0 , , z ) k ν ( dz, U (2) )) > . The above is true for all ˜ g ( θ, w, ν ) ∈ ˆ g ( θ, w ) , ν ∈ D (2) ( θ, w ) .(iii) Let ( θ n , w n ) → ( θ, w ) , ˜ g ( θ n , w n , ν n ) → m, ν n ∈ D (2) ( θ n , w n ) . Now, { ν n } is tight, hence has a convergent sub-sequence { ν n k } with ν being the limit. Then using the arguments similar to the proof of Lemma 2.4 one canshow that m = ˜ g ( θ, w, ν ) whereas ν ∈ D (2) ( θ, w ) follows directly from the upper-semi-continuity of the map ( θ, w ) → D (2) ( θ, w ) for all θ . Note that the map ˆ h ( · , · ) can be defined similarly and can be shown to be a Marchaud map using the exact sametechnique. Other assumptions needed for two time-scale convergence analysis
We now list the other assumptions required for two time-scale convergence analysis: (A6) for all θ ∈ R d , the differential inclusion ˙ w ( t ) ∈ ˆ g ( θ, w ( t )) (8)has a singleton global attractor λ ( θ ) where λ : R d → R k is a Lipschitz map with constant K . Additionally,there exists a continuous function V : R d + k → [0 , ∞ ) satisfying the hypothesis of Corollary 3.28 of [9] withΛ = { ( θ, λ ( θ )) : θ ∈ R d } . This is the most important assumption as it links the fast and slow iterates. (A7) Stability of the iterates: sup n ( k θ n k + k w n k ) < ∞ a.s.Let ¯ θ ( . ) , t ≥ θ ( t ( n )) = θ n , n ≥
0, with linear interpo-lation on each interval [ t ( n ) , t ( n + 1)), i.e.,¯ θ ( t ) = θ n + ( θ n +1 − θ n ) t − t ( n ) t ( n + 1) − t ( n ) , t ∈ [ t ( n ) , t ( n + 1)) . The following theorem is our main result:
Theorem 2.6 (Slower timescale result) . Under assumptions (A1)-(A7) , ( θ n , w n ) → ∪ θ ∗ ∈ A ( θ ∗ , λ ( θ ∗ )) a.s. as n → ∞ . , where A = ∩ t ≥ { ¯ θ ( s ) : s ≥ t } is almost everywhere an internally chain transitive set of the differential inclusion˙ θ ( t ) ∈ ˆ h ( θ ( t )) , (9)where ˆ h ( θ ) = { ˜ h ( θ, λ ( θ ) , ν ) : ν ∈ D (1) ( θ, λ ( θ )) } . We call (8) and (9) as the faster and slower d.i. to correspond withfaster and slower recursions, respectively. 6 orollary 1. Under the additional assumption that the inclusion ˙ θ ( t ) ∈ ˆ h ( θ ( t ))) , has a global attractor set A , ( θ n , w n ) → ∪ θ ∗ ∈ A ( θ ∗ , λ ( θ ∗ )) a.s. as n → ∞ . Remark 5.
In case where the set D (2) ( θ, w ) is singleton, we can relax (A6) to local attractors also. The relaxedassumption will be (A6)’ The function ˆ g ( θ, w ) = R g ( θ, w, z )Γ (2) θ,w ( dz ) is Lipschitz continuous where Γ (2) θ,w is the only element of D (2) ( θ, w ) .Further, for all θ ∈ R d , the o.d.e ˙ w ( t ) = ˆ g ( θ, w ( t )) (10) has an asymptotically stable equilibrium λ ( θ ) with domain of attraction G θ where λ : R d → R k is a Lipschitzmap with constant K . Also, assume that T θ G θ is non-empty. Moreover, the function V ′ : G → [0 , ∞ ) definedby V ′ ( θ, w ) = V θ ( w ) is continuously differentiable where V θ ( . ) is the Lyapunov function (for definition see [21,Chapter 11.2.3]) for the o.d.e. (10) with λ ( θ ) as its attractor, and G = S θ ∈ R d {{ θ } × G θ } . This extra conditionis needed so that the set graph( λ ):= { ( θ, λ ( θ )) : θ ∈ R d } becomes an asymptotically stable set of the coupled o.d.e ˙ w ( t ) = ˆ g ( θ ( t ) , w ( t )) , ˙ θ ( t ) = 0 . Note that (A6)’ allows multiple attractors (at least one of them have to be a point, others can be sets) for the fastero.d.e for every θ .Then the statement of Theorem 2.6 will be modified as in the following: Theorem 2.7 (Slower timescale result when λ ( θ ) is a local attractor) . Under assumptions (A1)-(A5), (A6)’ and (A7) , on the event “ { w n } belongs to a compact subset B (depending on the sample point) of T θ ∈ R d G θ eventually ”, ( θ n , w n ) → ∪ θ ∗ ∈ A ( θ ∗ , λ ( θ ∗ )) a.s. as n → ∞ .The requirement on { w n } is much stronger than the usual local attractor statement for Kushner-Clarke lemma[10, Section II.C] which requires the iterates to enter a compact set in the domain for attraction of the local attractorinfinitely often only. The reason for imposing this strong assumption is that graph( λ ) is not a subset of any compactset in R d + k , and hence the usual tracking lemma kind of arguments do not go through directly. One has to relate thelimit set of the coupled iterate ( θ n , w n ) to graph( λ ) (See the proof of Lemma 3.6). We present the proof of our main results in the next section.
We first discuss an extension of the single time-scale controlled Markov noise framework of [20] under our assumptionsto prove our main results. Note that the results of [20] assume that the state space of the controlled Markov processis Polish which may impose additional conditions that are hard to verify. In this section, other than proving our twotime-scale results, we prove many of the results in [20] (which were only stated there) assuming the state space tobe compact.We begin by describing the intuition behind the proof techniques in [20].The space C ([0 , ∞ ); R d ) of continuous functions from [0 , ∞ ) to R d is topologized with the coarsest topology suchthat the map that takes any f ∈ C ([0 , ∞ ); R d ) to its restriction to [0 , T ] when viewed as an element of the space C ([0 , T ]; R d ), is continuous for all T >
0. In other words, f n → f in this space iff f n | [0 ,T ] → f | [0 ,T ] . The othernotations used below are the same as those in [20, 21]. We present a few for easy reference.Consider the single time-scale stochastic approximation recursion with controlled Markov noise: x n +1 = x n + a ( n ) [ h ( x n , Y n ) + M n +1 ] . (11)Define time instants t (0) = 0 , t ( n ) = P n − m =0 a ( m ) , n ≥
1. Let ¯ x ( t ) , t ≥ x ( t ( n )) = x n , n ≥
0, with linear interpolation on each interval [ t ( n ) , t ( n + 1)), i.e.,¯ x ( t ) = x n + ( x n +1 − x n ) t − t ( n ) t ( n + 1) − t ( n ) , t ∈ [ t ( n ) , t ( n + 1)) . Now, define ˜ h ( x, ν ) = R h ( x, z ) ν ( dz, U ) for ν ∈ P ( S × U ). Let µ ( t ) , t ≥ µ ( t ) = δ ( Y n ,Z n ) for t ∈ [ t ( n ) , t ( n + 1)) , n ≥
0, where δ ( y,a ) is the Dirac measure corresponding to ( y, a ). Consider thenon-autonomous o.d.e. ˙ x ( t ) = ˜ h ( x ( t ) , µ ( t )) . (12)7et x s ( t ) , t ≥ s , denote the solution to (12) with x s ( s ) = ¯ x ( s ), for s ≥
0. Note that x s ( t ) , t ∈ [ s, s + T ] and x s ( t ) , t ≥ s can be viewed as elements of C ([0 , T ]; R d ) and C ([0 , ∞ ); R d ) respectively. With this abuse of notation, it is easy tosee that { x s ( . ) | [ s,s + T ] , s ≥ } is a pointwise bounded and equicontinuous family of functions in C ([0 , T ]; R d ) ∀ T > s ( n ) ↑ ∞ , { ¯ x ( s ( n )+ . ) | [ s ( n ) ,s ( n )+ T ] ,n ≥ } has a limit point in C ([0 , T ]; R d ) ∀ T >
0. With the above topology for C ([0 , ∞ ); R d ), { x s ( . ) , s ≥ } isalso relatively compact in C ([0 , ∞ ); R d ) and for all s ( n ) ↑ ∞ , { ¯ x ( s ( n ) + . ) , n ≥ } has a limit point in C ([0 , ∞ ); R d ).One can write from (11) the following:¯ x ( u ( n ) + t ) = ¯ x ( u ( n )) + Z t h (¯ x ( u ( n ) + τ ) , ν ( u ( n ) + τ )) dτ + W n ( t ) , where u ( n ) ↑ ∞ , ¯ x ( u ( n ) + . ) → ˜ x ( · ) , ν ( t ) = ( Y n , Z n ) for t ∈ [ t ( n ) , t ( n + 1)) , n ≥ W n ( t ) = W ( t + u ( n )) − W ( u ( n )) , W ( t ) = W n + ( W n +1 − W n ) t − t ( n ) t ( n +1) − t ( n ) , W n = P n − k =0 a ( k ) M k +1 , n ≥
0. From here one cannot directly takelimit on both sides as finding limit points of ν ( s + . ) as s → ∞ is not meaningful. Now, h ( x, y ) = R h ( x, z ) δ ( y,a ) ( dz × U ).Hence by defining ˜ h ( x, ρ ) = R h ( x, z ) ρ ( dz ) and µ ( t ) = δ ν ( t ) one can write the above as¯ x ( u ( n ) + t ) = ¯ x ( u ( n )) + Z t ˜ h (¯ x ( u ( n ) + τ ) , µ ( u ( n ) + τ )) dτ + W n ( t ) . (13)The advantage is that the space U of measurable functions from [0 , ∞ ) to P ( S × U ) is compact metrizable, so sub-sequential limits exist. Note that µ ( · ) is not a member of U , rather we need to fix a sample point, i.e., µ ( ., ω ) ∈ U .For ease of understanding, we abuse the terminology and talk about the limit points ˜ µ ( · ) of µ ( s + . ).From (13) one can infer that the limit ˜ x ( · ) of ¯ x ( u ( n ) + . ) satisfies the o.d.e. ˙ x ( t ) = ˜ h ( x ( t ) , µ ( t )) with µ ( · ) replacedby ˜ µ ( · ). Here each ˜ µ ( t ) , t ∈ R in ˜ µ ( · ) is generated through different limiting processes each one associated with thecompact metrizable space U t = space of measurable functions from [0 , t ] to P ( S × U ). This will be problematic ifwe want to further explore the process ˜ µ ( · ) and convert the non-autonomous o.d.e. into an autonomous one.Hence the main result is proved using an auxiliary lemma [20, Lemma 2.3] other than the tracking lemma(Lemma 2.2 of [20]). Let u ( n ( k )) ↑ ∞ be such that ¯ x ( u ( n ( k )) + . ) → ˜ x ( · ) and µ ( u ( n ( k )) + . ) → ˜ µ ( · ), then usingLemma 2.2 of [20] one can show that x u ( n ( k )) ( · ) → ˜ x ( · ). Then the auxiliary lemma shows that the o.d.e. trajectory x u ( n ( k )) ( · ) associated with µ ( u ( n ( k )) + . ) tracks (in the limit) the o.d.e. trajectory associated with ˜ µ ( · ). HenceLemma 2.3 of [20] links the two limiting processes ˜ x ( · ) and ˜ µ ( · ) in some sense. Note that Lemma 2.3 of [20] involvesonly the o.d.e. trajectories, not the interpolated trajectory of the algorithm.Consider the iteration θ n +1 = θ n + a ( n ) [ h ( θ n , Y n ) + ǫ n + M n +1 ] , (14)where ǫ n → { Y n } is the controlled Markov processdriven by { θ n } and M n +1 , n ≥ θ ( t ) , t ≥ θ ( t ( n )) = θ n , n ≥
0, with linear interpolation on each interval [ t ( n ) , t ( n + 1)).Also, let θ s ( t ) , t ≥ s , denote the solution to (12) with θ s ( s ) = ¯ θ ( s ), for s ≥ Lemma 3.1.
For any
T > , sup t ∈ [ s,s + T ] k ¯ θ ( t ) − θ s ( t ) k → , a.s. as s → ∞ . Proof 5.
The proof follows from the Lemma 2.2 and the remark 3 thereof (p. 144) of [20].
Now, µ can be viewed as a random variable taking values in U = the space of measurable functions from [0 , ∞ )to P ( S × U ). This space is topologized with the coarsest topology such that the map ν ( · ) ∈ U → Z T g ( t ) Z f dν ( t ) dt ∈ R is continuous for all f ∈ C ( S ) , T > , g ∈ L [0 , T ]. Note that U is compact metrizable. Lemma 3.2.
Almost surely every limit point of ( µ ( s + . ) , ¯ θ ( s + . )) as s → ∞ is of the form (˜ µ ( · ) , ˜ θ ( · )) where ˜ µ ( · ) satisfies ˜ µ ( t ) ∈ D (˜ θ ( t )) a.e. t . Proof 6.
Suppose that u ( n ) ↑ ∞ , µ ( u ( n ) + . ) → ˜ µ ( · ) and ¯ θ ( u ( n ) + . ) → ˜ θ ( · ) . Let { f i } be countable dense in the unitball of C ( S ) , hence a separating class, i.e., ∀ i, R f i dµ = R f i dν implies µ = ν . For each i , ζ in = n − X m =1 a ( m )( f i ( Y m +1 ) − Z f i ( y ) p ( dy | Y m , Z m , θ m )) , n ≥ , s a zero-mean martingale with F n = σ ( θ m , Y m , Z m , m ≤ n ) . Moreover, it is a square integrable martingale due tothe fact that f i ’s are bounded and each ζ in is a finite sum. Its quadratic variation process A n = n − X m =0 a ( m ) E [( f i ( Y m +1 ) − Z f i ( y ) p ( dy | Y m , Z m , θ m )) |F m ] + E [( ζ i ) ] is almost surely convergent. By the martingale convergence theorem, ζ in , n ≥ converges a.s. for all i . As before let τ ( n, t ) = min { m ≥ n : t ( m ) ≥ t ( n ) + t } for t ≥ , n ≥ . Then as n → ∞ , τ ( n,t ) X m = n a ( m )( f i ( Y m +1 ) − Z f i ( y ) p ( dy | Y m , Z m , θ m )) → , a.s.for t > . By our choice of { f i } and the fact that { a ( n ) } is an eventually non-increasing sequence (the latter propertyis used only here and in Lemma 3.9), we have τ ( n,t ) X m = n ( a ( m ) − a ( m + 1)) f i ( Y m +1 ) → , a.s.From the foregoing, τ ( n,t ) X m = n ( a ( m + 1) f i ( Y m +1 ) − a ( m ) Z f i ( y ) p ( dy | Y m , Z m , θ m )) → , a.s.for all t > , which implies τ ( n,t ) X m = n a ( m )( f i ( Y m ) − Z f i ( y ) p ( dy | Y m , Z m , θ m )) → , a.s.for all t > due to the fact that a ( n ) → and f i ( . ) are bounded. This implies Z t ( n )+ tt ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˆ θ ( s ))) µ ( s, dzda )) ds → , a.s.and that in turn implies Z u ( n )+ tu ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˆ θ ( s ))) µ ( s, dzda )) ds → , a.s.(this is true because a ( n ) → and f i ( · ) is bounded) where ˆ θ ( s ) = θ n when s ∈ [ t ( n ) , t ( n + 1)) for n ≥ . Now, onecan claim from the above that Z u ( n )+ tu ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ¯ θ ( s ))) µ ( s, dzda )) ds → , a.s.This is due to the fact that the map S × U × R d ∋ ( z, a, θ ) → R f ( y ) p ( dy | z, a, θ ) is continuous and hence uniformlycontinuous on the compact set A = S × U × M where M is the compact set s.t. θ n ∈ M for all n . Here we also usethe fact that k ¯ θ ( s ) − θ m k = k h ( θ m , Y m ) + ǫ m + M m +1 k ( s − s m ) → , s ∈ [ t m , t m +1 ) as the first two terms inside thenorm in the R.H.S are bounded. The above convergence is equivalent to Z t ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ¯ θ ( s + u ( n ))) µ ( s + u ( n ) , dzda )) ds → , a.s.Fix a sample point in the probability one set on which the convergence above holds for all i . Then the convergenceabove leads to Z t ( Z f i ( z ) − Z f i ( y ) p ( dy | z, a, ˜ θ ( s )))˜ µ ( s, dzda ) ds = 0 ∀ i. (15) Here we use one part of the proof from Lemma 2.3 of [20] that if µ n ( · ) → µ ∞ ( · ) ∈ U then for any t > , Z t Z ˜ f ( s, z, a ) µ n ( s, dzda ) ds − Z t Z ˜ f ( s, z, a ) µ ∞ ( s, dzda ) ds → , or all ˜ f ∈ C ([0 , t ] × S × A ) and the fact that ˜ f n ( s, z, a ) = R f i ( y ) p ( dy | z, a, ¯ θ ( s + u ( n ))) converges uniformly to ˜ f ( s, z, a ) = R f i ( y ) p ( dy | z, a, ˜ θ ( s )) . To prove the latter, define g : C ([0 , t ]) × [0 , t ] × S × A → R by g ( θ ( · ) , s, z, a ) = R f i ( y ) p ( dy | z, a, θ ( s ))) . To see that g is continuous we need to check that if θ n ( · ) → θ ( · ) uniformly and s ( n ) → s ,then θ n ( s ( n )) → θ ( s ) . This is because k θ n ( s ( n )) − θ ( s ) k = k θ n ( s ( n )) − θ ( s ( n )) + θ ( s ( n )) − θ ( s ) k ≤ k θ n ( s ( n )) − θ ( s ( n )) k + k θ ( s ( n )) − θ ( s ) k . The first and second terms go to zero due to the uniform convergence of θ n ( · ) , n ≥ andcontinuity of θ ( · ) respectively. Let A = { ¯ θ ( u ( n ) + . ) | [ u ( n ) ,u ( n )+ t ] , n ≥ } ∪ ˜ θ ( · ) | [0 ,t ] . A is compact as it is the union ofa sequence of functions and their limit. So, g | ( A × [0 ,t ] × S × U ) is uniformly continuous. Then using the same argumentsas in Lemma 2.4 we can show equicontinuity of { ˜ f n ( ., . ) } , that results in uniform convergence and thereby (15). Anapplication of Lebesgue’s theorem in conjunction with (15) shows that Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˜ θ ( t )))˜ µ ( t, dzda ) = 0 ∀ i for a.e. t . By our choice of { f i } , this leads to ˜ µ ( t, dy × U ) = Z p ( dy | z, a, ˜ θ ( t ))˜ µ ( t, dzda ) a.e. t . Therefore the conclusion follows by disintegrating such measure as the product of marginal on S and theregular conditional law on U ([20, p 140]). Remark 6.
Note that the above invariant distribution does not come “naturally”; rather it arises from the assumptionmade to match the natural timescale intuition for the controlled Markov noise component, i.e., the slower iterateshould see the average effect of the Markov component.
The proof of the following lemma, in this case, will be unchanged from its original version, so we just mention itfor completeness and refer the reader to Lemma 2.3 of [20] for its proof.
Lemma 3.3.
Let µ n ( · ) → µ ∞ ( · ) ∈ U . Let θ n ( · ) , n = 1 , , . . . , ∞ denote solutions to (12) corresponding to the casewhere µ ( · ) is replaced by µ n ( · ) , for n = 1 , , . . . ∞ . Suppose θ n (0) → θ ∞ (0) . Then lim n →∞ sup t ∈ [0 ,T ] k θ n ( t ) − θ ∞ ( t ) k = 0 for every T > . Lemma 3.4.
Almost surely, { θ n } converges to an internally chain transitive set of the differential inclusion ˙ θ ( t ) ∈ ˆ h ( θ ( t )) , (16) where ˆ h ( θ ) = { ˜ h ( θ, ν ) : ν ∈ D ( θ ) } . Proof 7.
Lemma 3.3 shows that every limit point (˜ µ ( · ) , ˜ θ ( · )) of ( µ ( s + . ) , ¯ θ ( s + . )) as s → ∞ is such that ˜ θ ( · ) satisfies(12) with µ ( · ) = ˜ µ ( · ) . Hence, ˜ θ ( · ) is absolutely continuous. Moreover, using Lemma 3.2, one can see that it satisfies(16) a.e. t , hence is a solution to the differential inclusion (16). Hence the proof follows. Lemma 3.5 (Faster timescale result) . ( θ n , w n ) → { ( θ, λ ( θ )) : θ ∈ R d } a.s. Proof 8.
We first rewrite (3) as θ n +1 = θ n + b ( n ) h ǫ n + M (3) n +1 i , where ǫ n = a ( n ) b ( n ) h ( θ n , w n , Z (1) n ) → as n → ∞ a.s. and M (3) n +1 = a ( n ) b ( n ) M (1) n +1 for n ≥ . Let α n = ( θ n , w n ) , α =( θ, w ) ∈ R d + k , G ( α, z ) = (0 , g ( α, z )) , ǫ ′ n = ( ǫ n , , M (4) n +1 = ( M (3) n +1 , M (2) n +1 ) . Then one can write (3) and (4) in theframework of (14) as α n +1 = α n + b ( n ) h G ( α n , Z (2) n ) + ǫ ′ n + M (4) n +1 i , (17) with ǫ ′ n → as n → ∞ . α n , n ≥ converges almost surely to an internally chain transitive set of the differentialinclusion ˙ α ( t ) ∈ ˆ G ( α ( t )) , where ˆ G ( α ) = { ˜ G ( α, ν ) : ν ∈ D (2) ( θ, w ) } with ˜ G ( α, ν ) = (0 , ˜ g ( θ, w, ν )) . In other words, ( θ n , w n ) , n ≥ converges toan internally chain transitive set of the differential inclusion ˙ w ( t ) ∈ ˆ g ( θ ( t ) , w ( t )) , ˙ θ ( t ) = 0 . The rest follows from the second part of (A6) . emark 7. Under the conditions mentioned in Remark 4 the above faster timescale result should be modified asfollows:
Lemma 3.6 (Faster timescale result when λ ( θ ) is a local attractor) . Under assumptions (A1) - (A5), (A6)’ and (A7) , on the event “ { w n } belongs to a compact subset B (depending on the sample point) of T θ ∈ R d G θ eventually”, ( θ n , w n ) → { ( θ, λ ( θ )) : θ ∈ R d } a.s. Proof 9.
Fix a sample point ω . The proof follows from these observations:1. continuity of flow for the coupled o.d.e around the initial point,2. sup n k θ n k = M < ∞ ,3. the fact that the set graph( λ ) is Lyapunov stable ( V ′ ( . ) as mentioned in (A6)’ will be a Lyapunov function forthis set), and4. the fact that T t ≥ ¯ α ( s ) : s ≥ t is an internally chain transitive set of the coupled o.d.e ˙ w ( t ) = ˆ g ( θ ( t ) , w ( t )) , ˙ θ ( t ) = 0 , (18) where ¯ α ( . ) is the interpolated trajectory of the coupled iterate { α n } .As { θ : k θ k ≤ M } × B ⊂ S θ ∈ R d {{ θ } × G θ } , the first three observations show that for all ǫ > , there exists a T ǫ > such that any o.d.e trajectory for (18) with starting point on the compact set { θ : k θ k ≤ M } × B reaches the ǫ -neighbourhood of graph( λ ) after time T ǫ . Further, \ t ≥ ¯ α ( s ) : s ≥ t ⊂ { θ : k θ k ≤ M } × B. Then one can use the last observation by choosing
T > T ǫ to show the required convergence to the set graph( λ ). Remark 8.
One interesting question in this context is to analyze whether one can extend the single timescale localattractor convergence statements to the two timescale setting under some verifiable conditions. More specifically, ifthere is a global attractor A for ˙ θ ( t ) ∈ ˆ h ( θ ( t )) , then can one provide verifiable conditions to show P [( θ n , w n ) → ∪ θ ∈ A ( θ, λ ( θ ))] > . Here λ ( θ ) is a local attractor as mentioned in (A6)’ .There are two ways in which this could possibly be tried:1. Use Theorem 2.7 where we show that on the event { w n } belongs to a compact subset B (depending on thesample point) of T θ ∈ R d G θ “eventually”, ( θ n , w n ) → ∪ θ ∗ ∈ A ( θ ∗ , λ ( θ ∗ )) a.s. as n → ∞ ,which is an extension of Kushner-Clarke Lemma to the two timescale case. Therefore the task would be toimpose verifiable assumptions so that P ( { w n } belongs to a compact subset B (depending on the sample point)of T θ ∈ R d G θ “eventually”) >
0. In a stochastic approximation scenario it is not immediately clear how onecould possibly impose verifiable assumptions so that such a probabilistic statement becomes true.2. The second approach would be to extend the analysis of [8, 9] to the two timescale case. In our opinion this isvery hard as this analysis is based on the attractor introduced by Benaim et al. whereas the coupled o.d.e (18)which tracks the coupled iterate ( θ n , w n ) (therefore the interpolated trajectory of the coupled iterate will be anasymptotic pseudo-trajectory [8] for (18)) has no attractor. The reason is that one cannot obtain a fundamentalneighbourhood for sets like ∪ θ ∈ A ( θ, λ ( θ )) as the θ component will remain constant for any trajectory of theabove coupled o.d.e.Thus it is immediately not clear as to how this question can be addressed and this will be an interesting futuredirection. k w n − λ ( θ n ) k → { w n } asymptotically tracks { λ ( θ n ) } a.s.Now, consider the non-autonomous o.d.e.˙ θ ( t ) = ˜ h ( θ ( t ) , λ ( θ ( t )) , µ ( t )) , (19)where µ ( t ) = δ Z (1) n ,A (1) n when t ∈ [ t ( n ) , t ( n + 1)) for n ≥ h ( θ, w, ν ) = R h ( θ, w, z ) ν ( dz ). Let θ s ( t ) , t ≥ s denotethe solution to (19) with θ s ( s ) = ¯ θ ( s ), for s ≥
0. Then
Lemma 3.7.
For any
T > , sup t ∈ [ s,s + T ] k ¯ θ ( t ) − θ s ( t ) k → , a.s. Proof 10.
The slower recursion corresponds to θ n +1 = θ n + a ( n ) h h ( θ n , w n , Z (1) n ) + M (1) n +1 i . Let t ( n + m ) ∈ [ t ( n ) , t ( n ) + T ] . Let [ t ] = max { t ( k ) : t ( k ) ≤ t } . Then by construction, ¯ θ ( t ( n + m )) = ¯ θ ( t ( n )) + m − X k =0 a ( n + k ) h (¯ θ ( t ( n + k )) , w n + k , Z (1) n + k ) + δ n,n + m = ¯ θ ( t ( n )) + m − X k =0 a ( n + k ) h (¯ θ ( t ( n + k )) , λ (¯ θ ( t ( n + k ))) , Z (1) n + k )+ m − X k =0 a ( n + k )( h (¯ θ ( t ( n + k )) , w n + k , Z (1) n + k ) − h (¯ θ ( t ( n + k )) , λ ( θ n + k ) , Z (1) n + k ))+ δ n,n + m , where δ n,n + m = ζ n + m − ζ n with ζ n = P n − m =0 a ( m ) M (1) m +1 , n ≥ . θ t ( n ) ( t ( m + n )) = ¯ θ ( t ( n )) + Z t ( n + m ) t ( n ) ˜ h ( θ t ( n ) ( t ) , λ ( θ t ( n ) ( t )) , µ ( t )) dt = ¯ θ ( t ( n )) + m − X k =0 a ( n + k ) h ( θ t ( n ) ( t ( n + k )) , λ ( θ t ( n ) ( t ( n + k ))) , Z (1) n + k )+ Z t ( n + m ) t ( n ) ( h ( θ t ( n ) ( t ) , λ ( θ t ( n ) ( t ) , µ ( t ))) − h ( θ t ( n ) ([ t ]) , λ ( θ t ( n ) ([ t ]) , µ ([ t ])))) dt. Let t ( n ) ≤ t ≤ t ( n + m ) . Now, if ≤ k ≤ ( m − and t ∈ ( t ( n + k ) , t ( n + k + 1)] , k θ t ( n ) ( t ) k ≤ k ¯ θ ( t ( n ) k + k Z tt ( n ) ˜ h ( θ t ( n ) ( τ ) , λ ( θ t ( n ) ( τ )) , µ ( τ )) dτ k≤ k θ n k + k − X l =0 Z t ( n + l +1) t ( n + l ) ( k h (0 , , Z (1) n + l ) k + L (1) ( k λ (0) k + ( K + 1) k θ t ( n ) ( τ ) k )) dτ + Z tt ( n + k ) ( k h (0 , , Z (1) n + k ) k + L (1) ( k λ (0) k + ( K + 1) k θ t ( n ) ( τ ) k )) dτ ≤ C + ( M + L (1) k λ (0) k ) T + L (1) ( K + 1) Z tt ( n ) k θ t ( n ) ( τ ) k dτ, where C = sup n k θ n k < ∞ , sup z ∈ S (1) k h (0 , , z ) k = M . By Gronwall’s inequality, it follows that k θ t ( n ) ( t ) k ≤ ( C + ( M + L (1) k λ (0) k ) T ) e L (1) ( K +1) T . k θ t ( n ) ( t ) − θ t ( n ) ( t ( n + k )) k ≤ Z tt ( n + k ) k h ( θ t ( n ) ( s ) , λ ( θ t ( n ) ( s )) , Z (1) n + k ) k ds ≤ ( k h (0 , , Z (1) n + k ) k + L (1) k λ (0) k )( t − t ( n + k ))+ L (1) ( K + 1) Z tt ( n + k ) k θ t ( n ) ( s ) k ds ≤ C T a ( n + k ) , here C T = ( M + L (1) k λ (0) k ) + L (1) ( K + 1)( C + ( M + L (1) k λ (0) k ) T ) e L (1) ( K +1) T . Thus, k Z t ( n + m ) t ( n ) ( h ( θ t ( n ) ( t ) , λ ( θ t ( n ) ( t )) , µ ( t )) − h ( θ t ( n ) ([ t ]) , λ ( θ t ( n ) ([ t ])) , µ ([ t ]))) dt k≤ m − X k =0 Z t ( n + k +1) t ( n + k ) k h ( θ t ( n ) ( t ) , λ ( θ t ( n ) ( t )) , Z (1) n + k ) − h ( θ t ( n ) ([ t ]) , λ ( θ t ( n ) ([ t ])) , Z (1) n + k ) k dt ≤ L m − X k =0 Z t ( n + k +1) t ( n + k ) k θ t ( n ) ( t ) − θ t ( n ) ( t ( n + k )) k dt ≤ C T L m − X k =0 a ( n + k ) ≤ C T L ∞ X k =0 a ( n + k ) → as n → ∞ , where L = L (1) ( K + 1) . Hence k ¯ θ ( t ( n + m )) − θ t ( n ) ( t ( n + m )) ≤ L m − X k =0 a ( n + k ) k ¯ θ ( t ( n + k )) − θ t ( n ) ( t ( n + k )) k + C T L ∞ X k =0 a ( n + k ) + sup k ≥ k δ n,n + k k + L (1) m − X k =0 a ( n + k ) k w n + k − λ ( θ n + k ) k≤ L m − X k =0 a ( n + k ) k ¯ θ ( t ( n + k )) − θ t ( n ) ( t ( n + k )) k + C T L ∞ X k =0 a ( n + k ) + sup k ≥ k δ n,n + k k + L (1) T sup k ≥ k w n + k − λ ( θ n + k ) k , a.s.Define K T,n = C T L ∞ X k =0 a ( n + k ) + sup k ≥ k δ n,n + k k + L (1) T sup k ≥ k w n + k − λ ( θ n + k ) k . Note that K T,n → a.s. The remainder of the proof follows in the exact same manner as the tracking lemma, seeLemma 1, Chapter 2 of [21]. Lemma 3.8.
Suppose, µ n ( · ) → µ ∞ ( · ) ∈ U (1) . Let θ n ( · ) , n = 1 , , . . . , ∞ denote solutions to (19) corresponding tothe case where µ ( · ) is replaced by µ n ( · ) , for n = 1 , , . . . , ∞ . Suppose θ n (0) → θ ∞ (0) . Then lim n →∞ sup t ∈ [0 ,T ] k θ n ( t ) − θ ∞ ( t ) k → for every T > . Proof 11.
It is shown in Lemma 2.3 of [20] that Z t Z ˜ f ( s, z ) µ n ( s, dz ) ds − Z t Z ˜ f ( s, z ) µ ∞ ( s, dz ) ds → for any ˜ f ∈ C ([0 , T ] × S ) . Using this, one can see that k Z t (˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ n ( s )) − ˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ ∞ ( s ))) ds k → . This follows because λ is continuous and h is jointly continuous in its arguments. As a function of t , the integralon the left is equicontinuous and pointwise bounded. By the Arzela-Ascoli theorem, this convergence must in fact be niform for t in a compact set. Now for t > , k θ n ( t ) − θ ∞ ( t ) k≤ k θ n (0) − θ ∞ (0) k + Z t k ˜ h ( θ n ( s ) , λ ( θ n ( s )) , µ n ( s )) − ˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ ∞ ( s )) k ds ≤ k θ n (0) − θ ∞ (0) k + Z t ( k ˜ h ( θ n ( s ) , λ ( θ n ( s )) , µ n ( s )) − ˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ n ( s )) k ) ds + Z t ( k ˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ n ( s )) − ˜ h ( θ ∞ ( s ) , λ ( θ ∞ ( s )) , µ ∞ ( s )) k ) ds. Now, using the fact that λ is Lipschitz with constant K the remaining part of the proof follows in the same manneras Lemma 2.3 of [20]. Note that Lemma 3.8 shows that every limit point (˜ µ ( · ) , ˜ θ ( · )) of ( µ ( s + . ) , ¯ θ ( s + . )) as s → ∞ is such that ˜ θ ( · )satisfies (19) with µ ( · ) = ˜ µ ( · ). Lemma 3.9.
Almost surely every limit point of ( µ ( s + . ) , ¯ θ ( s + . )) as s → ∞ is of the form (˜ µ ( · ) , ˜ θ ( · )) , where ˜ µ ( · ) satisfies ˜ µ ( t ) ∈ D (1) (˜ θ ( t ) , λ (˜ θ ( t ))) . Proof 12.
Suppose that u ( n ) ↑ ∞ , µ ( u ( n ) + . ) → ˜ µ ( · ) and ¯ θ ( u ( n ) + . ) → ˜ θ ( · ) . Let { f i } be countable dense in theunit ball of C ( S ) , hence it is a separating class, i.e., for all i , R f i dµ = R f i dν implies µ = ν . For each i , ζ in = n − X m =1 a ( m )( f i ( Z (1) m +1 ) − Z f i ( y ) p ( dy | Z (1) m , A (1) m , θ m , w m )) , is a zero-mean martingale with F n = σ ( θ m , w m , Z (1) m , A (1) m , m ≤ n ) , n ≥ . Moreover, it is a square-integrablemartingale due to the fact that f i ’s are bounded and each ζ in is a finite sum. Its quadratic variation process A n = n − X m =0 a ( m ) E [( f i ( Z (1) m +1 ) − Z f i ( y ) p ( dy | Z (1) m , A (1) m , θ m , w m )) |F m ] + E [( ζ i ) ] is almost surely convergent. By the martingale convergence theorem, { ζ in } converges a.s. Let τ ( n, t ) = min { m ≥ n : t ( m ) ≥ t ( n ) + t } for t ≥ , n ≥ . Then as n → ∞ , τ ( n,t ) X m = n a ( m )( f i ( Z (1) m +1 ) − Z f i ( y ) p ( dy | Z (1) m , A (1) m , θ m , w m )) → , a.s.,for t > . By our choice of { f i } and the fact that { a ( n ) } are eventually non-increasing, τ ( n,t ) X m = n ( a ( m ) − a ( m + 1)) f i ( Z (1) m +1 ) → , a.s.Thus, τ ( n,t ) X m = n a ( m )( f i ( Z (1) m ) − Z f i ( y ) p ( dy | Z (1) m , A (1) m , θ m , w m )) → , a.s.which implies Z t ( n )+ tt ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˆ θ ( s ) , ˆ w ( s ))) µ ( s, dzda )) ds → , a.s.Recall that u ( n ) can be any general sequence other than t ( n ) . Therefore Z u ( n )+ tu ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˆ θ ( s ) , ˆ w ( s ))) µ ( s, dzda )) ds → , a.s.,(this follows from the fact that a ( n ) → and f i ’s are bounded) where ˆ θ ( s ) = θ n and ˆ w ( s ) = w n when s ∈ [ t ( n ) , t ( n +1)) , n ≥ . Now, one can claim from the above that Z u ( n )+ tu ( n ) ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ¯ θ ( s ) , λ (¯ θ ( s )))) µ ( s, dzda )) ds → , a.s. his is due to the fact that the map S (1) × U (1) × R d + k ∋ ( z, a, θ, w ) → R f i ( y ) p ( dy | z, a, θ, w ) is continuous and henceuniformly continuous on the compact set A = S (1) × U (1) × M × M where M is the compact set s.t. θ n ∈ M for all n and M = { w : k w k ≤ max(sup k w n k , K ′ ) } where K ′ is the bound for the compact set λ ( M ) . Here we also use the factthat k w m − λ (¯ θ ( s )) k → for s ∈ [ t m , t m +1 ) as λ is Lipschitz and k w m − λ ( θ m ) k → . The above convergence isequivalent to Z t ( Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ¯ θ ( s + u ( n )) , λ (¯ θ ( s + u ( n ))))) µ ( s + u ( n ) , dzda )) ds → a.s.Fix a sample point in the probability one set on which the convergence above holds for all i . Then the convergenceabove leads to Z t ( Z f i ( z ) − Z f i ( y ) p ( dy | z, a, ˜ θ ( s ) , λ (˜ θ ( s ))))˜ µ ( s, dzda ) ds = 0 ∀ i. (20) For showing the above, we use one part of the proof from Lemma 2.3 of [20] that if µ n ( · ) → µ ∞ ( · ) ∈ U then for any t , Z t Z ˜ f ( s, z, a ) µ n ( s, dzda ) ds − Z t Z ˜ f ( s, z, a ) µ ∞ ( s, dzda ) ds → for all ˜ f ∈ C ([0 , t ] × S (1) × U (1) ) . In addition, we make use of the fact that ˜ f n ( s, z, a ) = R f i ( y ) p ( dy | z, a, ¯ θ ( s + u ( n )) , λ (¯ θ ( s + u ( n )))) converges uniformly to ˜ f ( s, z, a ) = R f i ( y ) p ( dy | z, a, ˜ θ ( s ) , λ (˜ θ ( s ))) . Toprove this, define g : C ([0 , t ]) × [0 , t ] × S (1) × U (1) → R by g ( θ ( · ) , s, z, a ) = R f i ( y ) p ( dy | z, a, θ ( s ) , λ ( θ ( s ))) . Let A ′ = { ¯ θ ( u ( n ) + . ) | [ u ( n ) ,u ( n )+ t ] , n ≥ } ∪ ˜ θ ( · ) | [0 ,t ] . Using the same argument as in Lemma 3.2 and (A6) , i.e., λ is Lip-schitz (the latter helps to claim that if θ n ( · ) → θ ( · ) uniformly then λ ( θ n ( · )) → λ ( θ ( · )) uniformly), it can be seen that g is continuous. Then A ′ is compact as it is a union of a sequence of functions and its limit. So, g | ( A ′ × [0 ,t ] × S (1) × U (1) ) is uniformly continuous. Then a similar argument as in Lemma 2.4 shows equicontinuity of { ˜ f n ( ., . ) } that results inuniform convergence and thereby (20). An application of Lebesgue’s theorem in conjunction with (20) shows that Z ( f i ( z ) − Z f i ( y ) p ( dy | z, a, ˜ θ ( t ) , λ (˜ θ ( t )))˜ µ ( t, dzda ) = 0 ∀ i for a.e. t . By our choice of { f i } , this leads to ˜ µ ( t, dy × U (1) ) = Z p ( dy | z, a, ˜ θ ( t ) , λ (˜ θ ( t )))˜ µ ( t, dzda ) , a.e. t . Lemma 3.8 shows that every limit point (˜ µ ( · ) , ˜ θ ( · )) of ( µ ( s + . ) , ¯ θ ( s + . )) as s → ∞ is such that ˜ θ ( · ) satisfies (19)with µ ( · ) = ˜ µ ( · ). Hence, ˜ θ ( · ) is absolutely continuous. Moreover, using Lemma 3.9, one can see that it satisfies (9)a.e. t , hence is a solution to the differential inclusion (9). Proof 13 (Proof of Theorem 2.6 and 2.7) . From the previous three lemmas it is easy to see that A = ∩ t ≥ { ¯ θ ( s ) : s ≥ t } is almost everywhere an internally chain transitive set of (9). Proof 14 (Proof of Corollary 1) . Follows directly from Theorem 2.6 and Lemma 2.1.
We discuss relaxation of the uniformity of the Lipschitz constant w.r.t state of the controlled Markov process for thevector field. The modified assumption here is (A2)’ h : R d + k × S (1) → R d is jointly continuous as well as Lipschitz in its first two arguments with the third argumentfixed to same value and Lipschitz constant is a function of this value. The latter condition means that ∀ z (1) ∈ S (1) , k h ( θ, w, z (1) ) − h ( θ ′ , w ′ , z (1) ) k ≤ L (1) ( z (1) )( k θ − θ ′ k + k w − w ′ k ) . A similar condition holds for g where the Lipschitz constant is L (2) : S (2) → R + .Note that this allows L ( i ) ( . ) to be an unbounded measurable function making it discontinuous due to (A1) . Thestraightforward solution for implementing this is to additionally assume the following: (A8) sup n L ( i ) ( Z ( i ) n ) < ∞ a.s. 15till allowing L ( i ) ( . ) to be an unbounded function. As all our proofs in Section 3 are shown for every sample pointof a probability 1 set, our proofs will go through. In the following we give such an example for the case where theMarkov process is uncontrolled.It is enough to consider examples with locally compact S ( i ) (because then we can take the standard one-pointcompactification and define L ( i ) arbitrarily at the extra point).Let S ( i ) = Z and let Z ( i ) n , n ≥ Z starting at 0 with transition probabilities p ( n, n +1) = p and p ( n, n −
1) = 1 − p . We assume 1 / < p <
1. Let L ( i ) ( n ) = (cid:0) − pp (cid:1) n .Note that Z ( i ) n , n ≥ Z ( i ) n → + ∞ a.s. From this it follows that inf n Z ( i ) n > −∞ ,and thus sup n L ( i ) ( Z ( i ) n ) < ∞ almost surely. It follows that ( L ( i ) ( Z ( i ) n )) n ∈ N is a bounded sequence with probability1, but this bound is clearly not deterministic since there is a non-zero probability that the sample path reaches largenegative values.However in the following we discuss on the idea of using moment assumptions to analyze the convergence of singletimescale controlled Markov noise framework of [20]. We show that the iterates (14) (with ǫ n = 0) converge to aninternally chain transitive set of the o.d.e. (12). For this we prove Lemma 3.1 under the following assumptions: Forall T > , i = 1 , (S1) The controlled Markov process Y n as described in [20] takes values in a compact metric space. (S2) For all n >
0, 0 < a ( n ) ≤ P n a ( n ) = ∞ , P n a ( n ) < ∞ and a ( n + 1) ≤ a ( n ) , n ≥ (S3) h : R d × S → R d Lipschitz in its first argument w.r.t the second. The condition means that ∀ z ∈ S, k h ( θ, z ) − h ( θ ′ , z ) k ≤ L ( z )( k θ − θ ′ k ) . (S4) Let φ ( n, T ) = max( m : a ( n ) + a ( n + 1) + · · · + a ( n + m ) ≤ T ) with the bound depending on T . Thensup n E sup ≤ m ≤ φ ( n,T ) L ( Y n + m ) ! < ∞ . (S5) sup n E h e P φ ( n,T ) m =0 a ( n + m ) L ( Y n + m ) i < ∞ . Note that (S4) and (S5) are trivially satisfied in the case when L ( z ) = L for all z ∈ S i.e. the case of Section2. Remark 9.
As long as one can prove Lemma 3.1 for all
T > it will hold for all T > , thus one can combine (S4) and (S5) into the following assumption: sup n E h e T sup ≤ m ≤ φ ( n,T ) L ( Y n + m ) i < ∞ . As an instance where such an assumption is verified, consider the Markov process of [10, Eqn. (3.4)] definedby Y n +1 = A ( θ n ) Y n + B ( θ n ) W n +1 where A ( θ ) , B ( θ ) , θ ∈ R d , are k × k -matrices and ( W n ) n ≥ O are independent and identically distributed R k -valuedrandom variables. Assume that the following conditions hold true for all x, y ∈ S :(a) L ( Y n ) is a non-decreasing sequence.(b) For r > , R > , sup k θ k≤ R e rL ( A ( θ ) x + B ( θ ) y ) ≤ L R α Rr e rL ( x ) + M R e C R L ( y ) for some C R , M R , L R > and α R < .Then E h e rL ( Y n ) | Y n − = x, θ n − = θ i ≤ Z e rL ( A ( θ ) x + B ( θ ) y ) µ n ( dy ) ≤ L R α Rr e rL ( x ) + M R E h e C R L ( W n ) i = L R α Rr e rL ( x ) + K R , ith K R = M R E (cid:2) e C R L ( W n ) (cid:3) (this follows from the fact that W n are i.i.d if we assume that E (cid:2) e C R L ( W ) (cid:3) < ∞ ).Choosing large values of r , one can show that E h e rL ( Y n ) | Y n − = x, θ n − = θ i ≤ β R e rL ( x ) + K R where β R = L R α Rr < . Using the above, for large rE h e rL ( Y n ) i = E h E h e rL ( Y n ) | Y n − , θ n − ii ≤ β R E h e rL ( Y n − ) i + K R , which shows that sup n E h e rL ( Y n ) i < ∞ . Choosing r > T , sup n E h e T L ( Y n ) i < ∞ . Note that this is a much weaker assumption that (A8) . (S6) The noise sequence M n , n ≥ n E φ ( n,T ) X m =0 k M n + m +1 k < ∞ . (S7) sup n k θ n k < ∞ .With the above assumptions we prove the following tracking lemma: Lemma 4.1.
For any
T > , sup t ∈ [ s,s + T ] k ¯ θ ( t ) − θ s ( t ) k → , a.s. Proof 15.
Let t ( n ) ≤ t ≤ t ( n + m ) . Now, if ≤ k ≤ ( m − and t ∈ ( t ( n + k ) , t ( n + k + 1)] , k θ t ( n ) ( t ) k ≤ k ¯ θ ( t ( n ) k + k Z tt ( n ) ˜ h ( θ t ( n ) ( τ ) , µ ( τ )) dτ k≤ k θ n k + k − X l =0 Z t ( n + l +1) t ( n + l ) ( k h (0 , Y n + l ) k + L ( Y n + l ) k θ t ( n ) ( τ ) k )) dτ + Z tt ( n + k ) ( k h (0 , Y n + k ) k + L ( Y n + k ) k θ t ( n ) ( τ ) k )) dτ ≤ C + M T + Z tt ( n ) L ( Y ( τ )) k θ t ( n ) ( τ ) k dτ where Y ( τ ) = Y n if τ ∈ [ t ( n ) , t ( n + 1)) . Then it follows from an application of Gronwall inequality that k θ t ( n ) ( t ) k ≤ Ce R tt ( n ) L ( Y ( τ )) dτ a.e. t where C = C + M T . Next, k θ t ( n ) ( t ) − θ t ( n ) ( t ( n + k )) k ≤ Z tt ( n + k ) k h ( θ t ( n ) ( s ) , Y n + k ) k ds ≤ k h (0 , Y n + k ) k ( t − t ( n + k )) + L ( Y n + k ) Z tt ( n + k ) k θ t ( n ) ( s ) k ds ≤ M a ( n + k ) + CL ( Y n + k ) Z tt ( n + k ) e R st ( n ) L ( Y ( τ )) dτ ds. hen k Z t ( n + m ) t ( n ) ( h ( θ t ( n ) ( t ) , µ ( t )) − h ( θ t ( n ) ([ t ]) , µ ([ t ]))) dt k≤ m − X k =0 Z t ( n + k +1) t ( n + k ) k h ( θ t ( n ) ( t ) , Y n + k ) − h ( θ t ( n ) ([ t ]) , Y n + k ) k dt ≤ m − X k =0 L ( Y n + k ) Z t ( n + k +1) t ( n + k ) k θ t ( n ) ( t ) − θ t ( n ) ( t ( n + k )) k dt ≤ m − X k =0 c k where c k = L ( Y n + k ) a ( n + k ) h M + CL ( Y n + k ) e P ki =0 a ( n + i ) L ( Y n + i ) i . k ¯ θ ( t ( n + m )) − θ t ( n ) ( t ( n + m )) k ≤ m − X k =0 L ( Y n + k ) a ( n + k ) k ¯ θ ( t ( n + k )) − θ t ( n ) ( t ( n + k )) k + m − X k =0 c k + k δ n,n + m k , where δ n,n + m = P n + m − k = n a ( k ) M k +1 .Therefore using discrete Gronwall inequality we get k ¯ θ ( t ( n + m )) − θ t ( n ) ( t ( n + m )) k ≤ r ( m, n ) e P m − k =0 a ( n + k ) L ( Y n + k ) where r ( m, n ) = P m − k =0 ( c k + a ( n + k ) k M n + k +1 k ) .Now, for some λ ∈ [0 , , k θ t ( n ) ( t ) − ¯ θ ( t ) k≤ (1 − λ ) k θ t ( n ) ( t ( n + m + 1)) − ¯ θ ( t ( n + m + 1)) + λ k θ t ( n ) ( t ( n + m )) − ¯ θ ( t ( n + m )) k + max( λ, − λ ) Z t ( n + m +1) t ( n + m ) k ˜ h ( θ t ( n ) ( s ) , µ ( s )) k ds ≤ r ( m + 1 , n ) e P mk =0 a ( n + k ) L ( Y n + k ) + a ( n + m ) h M + CL ( Y n + m ) e P mk =0 a ( n + k ) L ( Y n + k ) i . Therefore ρ ( n, T ) := sup t ∈ [ t ( n ) ,t ( n )+ T ] k θ t ( n ) ( t ) − ¯ θ ( t ) k ≤ r ( φ ( n, T + 1) , n ) e P φ ( n,T ) k =0 a ( n + k ) L ( Y n + k ) + a ( n ) " M + C sup ≤ m ≤ φ ( n,T ) L ( Y n + m ) e P φ ( n,T ) k =0 a ( n + k ) L ( Y n + k ) . Now to prove the a.s. convergence of the quantity in the left hand side as n → ∞ , we have using Cauchy-Schwartzinequality: ∞ X n =1 E [ ρ ( n, T ) ] ≤ K T ∞ X n =1 (cid:16) E h ( r ( φ ( n, T + 1) , n )) i(cid:17) / + 4 M ∞ X n =0 a ( n ) +4 C ∞ X n =1 a ( n ) E sup ≤ m ≤ φ ( n,T ) L ( Y n + m ) ! e P φ ( n,T ) k =0 a ( n + k ) L ( Y n + k ) , where K T = q sup n E [ e P φ ( n,T ) k =0 a ( n + k ) L ( Y n + k ) ] which depends only on T due to (S5) . Now, the third term in the .H.S is clearly finite from the assumptions (S4) and (S5) . Now we analyze the first term i.e. ∞ X n =1 (cid:16) E h r ( φ ( n, T + 1) , n ) i(cid:17) / ≤ √ ∞ X n =1 E φ ( n,T ) X k =0 c k / + 2 √ ∞ X n =1 E φ ( n,T ) X k =0 a ( n + k ) k M n + k +1 k / . (21) Next we analyze the first term in the R.H.S of (21) again using Cauchy-Schwartz inequality: ∞ X n =1 E φ ( n,T ) X k =0 c k / ≤ M ∞ X n =1 φ ( n, T ) a ( n ) E sup ≤ k ≤ φ ( n,T ) L ( Y n + k ) ! / +8 C ∞ X n =1 φ ( n, T ) a ( n ) E sup ≤ k ≤ φ ( n,T ) L ( Y n + k ) ! e P φ ( n,T ) i =0 a ( n + i ) L ( Y n + i ) / . Therefore the the R.H.S will be finite if we can show that P ∞ n =1 φ ( n, T ) a ( n ) is finite. For common step-size sequence a ( n ) = n , φ ( n, T ) = O ( n ) thus the above series converges clearly. One can make the series converge for all a ( n ) = n k with < k ≤ by putting assumptions on higher moments in (S4) and (S5) .In the above we have used the following inequality repeatedly for non-negative random variables X and Y : r E h ( X + Y ) n i ≤ n − hp E [ X n ] + p E [ Y n ] i with n ∈ N .Now, ∞ X n =1 E φ ( n,T ) X k =0 a ( n + k ) k M n + k +1 k / ≤ ∞ X n =1 a ( n ) E φ ( n,T ) X k =0 k M n + k +1 k / which is finite under assumption (S5) and the fact that a ( n ) are non-increasing. In this section, we present an application of our results in the setting of off-policy temporal difference learning withlinear function approximation. In this framework, we need to estimate the value function for a target policy π giventhe continuing evolution of the underlying MDP (with finite state and action spaces S and A respectively, specifiedby expected reward r ( · , · , · ) and transition probability kernel p ( ·|· , · )) for a behaviour policy π b with π = π b . Theauthors of [11, 12, 3] have proposed two approaches to solve the problem:(i) Sub-sampling: In this approach, the transitions which are relevant to deterministic target policy are keptand the rest of the data is discarded from the given “on-policy” trajectory. We use the triplet ( S, R, S ′ ) torepresent (current state, reward, next state). Therefore one has “off-policy” data ( X ′ n , R n , W n ) , n ≥ E [ R n | X ′ n = s, W n = s ′ ] = r ( s, a, s ′ ), P ( W n = s ′ | X ′ n = s ) = p ( s ′ | s, a ) with π ( s ) = a , π being the target policyand X ′ n , n ≥ X ′ n , R n , W n ) , n ≥ { X ′ n } are i.i.d, E [ R n | X ′ n = s, W n = s ′ ] = r ( s, a, s ′ ) and P ( W n = s ′ | X ′ n = s ) = p ( s ′ | s, a ) with π ( s ) = a , π being the deterministic target policy. Additionally, the distribution of { X ′ n } is assumedto be sampled according to the stationary distribution of the Markov chain corresponding to the behaviour policy.However, such data cannot be generated from sub-sampling given only the “on-policy” trajectory. The reason isthat a Markov chain sampled at increasing stopping times cannot be i.i.d. In the following, we show how gradienttemporal-difference learning along with importance weighting can be used to solve the off-policy convergence problemstated above for TD when only the “on-policy” trajectory is available. Problem Definition
Suppose we are given an on-policy trajectory ( X n , A n , R n , X n +1 ) , n ≥ { X n } is a time-homogeneous irre-ducible Markov chain with unique stationary distribution ν and generated from a behavior policy π b = π . Here thequadruplet ( S, A, R, S ′ ) represents (current state, action, reward, next state). Also, assume that π b ( a | s ) > s ∈ S, a ∈ A . We need to find the solution θ ∗ for the following:0 = X s,a,s ′ ν ( s ) π ( a | s ) p ( s ′ | s, a ) δ ( θ ; s, a, s ′ ) φ ( s )= E [ ρ X,A n δ X,R n ,X n +1 ( θ ) φ ( X )]= b − Aθ, (22)where(i) θ ∈ R d is the parameter for value function,(ii) φ : S → R d is a vector of state features,(iii) X ∼ ν ,(iv) 0 < γ < E [ R n | X n = s, X n +1 = s ′ ] = P a ∈ A π b ( a | s ) r ( s, a, s ′ ),(vi) P ( X n +1 = s ′ | X = s ) = P a ∈ A π b ( a | s ) p ( s ′ | s, a ),(vii) δ ( θ ; s, a, s ′ ) = r ( s, a, s ′ ) + γθ T φ ( s ′ ) − θ T φ ( s ) is the temporal difference term with expected reward,(viii) ρ X,A n = π ( A n | X ) π b ( A n | X ) ,(ix) δ X,R n ,X n +1 = R n + γθ T φ ( X n +1 ) − θ T φ ( X ) is the online temporal difference,(x) A = E [ ρ X,A n φ ( X )( φ ( X ) − γφ ( X n +1 )) T ],(xi) b = E [ ρ X,A n R n φ ( X )].Hence the desired approximate value function under the target policy π is V ∗ π = θ ∗ T φ . Let V θ = θ T φ . It is well-known([3]) that θ ∗ satisfies the projected fixed point equation namely V θ = Π G ,ν T π V θ , where Π G ,ν ˆ V = arg min f ∈G ( k ˆ V − f k ν ) , with G = { V θ | θ ∈ R d } and the Bellman operator T π V θ ( i ) = X j ∈ S X a ∈ A π ( a | i ) p ( j | i, a ) [ γV θ ( i ) + r ( i, a, j )] . Therefore to find θ ∗ , the idea is to minimize the mean square projected Bellman error J ( θ ) = k V θ − Π G ,ν T π V θ k ν using stochastic gradient descent. It can be shown that the expression of gradient contains product of multipleexpectations. Such framework can be modelled by two time-scale stochastic approximation where one iterate storesthe quasi-stationary estimates of some of the expectations and the other iterate is used for sampling.20 .2 The TDC Algorithm with importance-weighting
We consider the TDC (Temporal Difference with Correction) algorithm with importance-weighting from Sections 4.2and 5.2 of [3]. The gradient in this case can be shown to satisfy − ∇ J ( θ ) = E [ ρ X,R n δ X,R n ,X n +1 ( θ ) φ ( X )] − γE [ ρ X,R n φ ( X n +1 ) φ ( X ) T ] w ( θ ) ,w ( θ ) = E [ φ ( X ) φ ( X ) T ] − E [ ρ X,R n δ X,R n ,X n +1 ( θ ) φ ( X )] . Define φ n = φ ( X n ), φ ′ n = φ ( X n +1 ), δ n ( θ ) = δ X n ,R n ,X n +1 ( θ ) and ρ n = ρ X n ,A n . Therefore the associated iterations inthis algorithm are: θ n +1 = θ n + a ( n ) ρ n (cid:2) δ n ( θ n ) φ n − γφ ′ n φ Tn w n (cid:3) ,w n +1 = w n + b ( n ) (cid:2) ( ρ n δ n ( θ n ) − φ Tn w n ) φ n (cid:3) , (23)with { a ( n ) } , { b ( n ) } satisfying (A4) . Convergence Proof
Theorem 5.1 (Convergence of TDC with importance-weighting) . Consider the iterations (23) of the TDC. Assumethe following:(i) { a ( n ) } , { b ( n ) } satisfy (A4) .(ii) { ( X n , R n , X n +1 ) , n ≥ } is such that { X n } is a time-homogeneous finite state irreducible Markov chaingenerated from the behavior policy π b with unique stationary distribution ν . E [ R n | X n = s, X n +1 = s ′ ] = P a ∈ A π b ( a | s ) r ( s, a, s ′ ) and P ( X n +1 = s ′ | X n = s ) = P a ∈ A π b ( a | s ) p ( s ′ | s, a ) where π b is the behaviour policy, π = π b . Also, E [ R n | X n , X n +1 ] < ∞ for all n almost surely, and(iii) C = E [ φ ( X ) φ ( X ) T ] and A = E [ ρ X,R n φ ( X )( φ ( X ) − γφ ( X n +1 )) T ] are non-singular where X ∼ ν .(iv) π b ( a | s ) > for all s ∈ S, a ∈ A .(v) sup n ( k θ n k + k w n k ) < ∞ w.p. 1.Then the parameter vector θ n converges with probability one as n → ∞ to the TD(0) solution (22). Proof 16.
The iterations (23) can be cast into the framework of Section 2.2 with(i) Z ( i ) n = X n − ,(ii) h ( θ, w, z ) = E [( ρ n ( δ n ( θ ) φ n − γφ ′ n φ Tn w )) | X n − = z, θ n = θ, w n = w ] ,(iii) g ( θ, w, z ) = E [(( ρ n δ n ( θ ) − φ Tn w ) φ n ) | X n − = z, θ n = θ, w n = w ] ,(iv) M (1) n +1 = ρ n ( δ n ( θ n ) φ n − γφ ′ n φ Tn w n ) − E [ ρ n ( δ n ( θ n ) φ n − γφ ′ n φ Tn w n ) | X n − , θ n , w n ] ,(v) M (2) n +1 = ( ρ n δ n ( θ n ) − φ Tn w n ) φ n − E [( ρ n δ n ( θ n ) − φ nT w n ) φ n | X n − , θ n , w n ] ,(vi) F n = σ ( θ m , w m , R m − , X m − , A m − , m ≤ n, i = 1 , , n ≥ .Note that in (ii) and (iii) we can define h and g independent of n due to time-homogeneity of { X n } .Now, we verify the assumptions (A1)-(A7) (mentioned in Sections 2.2 and 2.3) for our application:(i) (A1) : Z ( i ) n , ∀ n, i = 1 , takes values in compact metric space as { X n } is a finite state Markov chain.(ii) (A5) : Continuity of transition kernel follows trivially from the fact that we have a finite state MDP. Remark 10.
In fact we don’t have to verify this assumption for the special case when the Markov chain isuncontrolled and has unique stationary distribution. The reason is that in such case (A5) will be used only inthe proof of Lemma 2.3. However, if the Markov chain has unique stationary distribution Lemma 2.3 triviallyfollows.(iii) (A2) a) k h ( θ, w, z ) − h ( θ ′ , w ′ , z ) k = k E [ ρ n ( θ − θ ′ ) T ( γφ ( X n +1 ) − φ ( X n )) φ ( X n ) − γρ n φ ( X n +1 ) φ ( X n ) T ( w − w ′ ) | X n − = z ] k≤ L (2 k θ − θ ′ k M + k w − w ′ k M ) , where M = max s ∈ S k φ ( s ) k with S being the state space of the MDP and L = max ( s,a ) ∈ ( S × A ) π ( a | s ) π b ( a | s ) . Hence h is Lipschitz continuous in the first two arguments uniformly w.r.t the third. In the last inequality above,we use the Cauchy-Schwarz inequality.(b) As with the case of h , g can be shown to be Lipschitz continuous in the first two arguments uniformly w.r.tthe third.(c) Joint continuity of h and g follows from (iii)(a) and (b) respectively as well as the finiteness of S .(iv) (A3) : Clearly, { M ( i ) n +1 } , i = 1 , are martingale difference sequences w.r.t. increasing σ -fields F n . Note that E [ k M ( i ) n +1 k |F n ] ≤ K (1 + k θ n k + k w n k ) a.s., n ≥ since E [ R n | X n , X n +1 ] < ∞ for all n almost surely and S is finite.(v) (A4) : This follows from the conditions (i) in the statement of Theorem 5.1.Now, one can see that the faster o.d.e. becomes ˙ w ( t ) = E [ ρ X,A n δ X,R n ,X n +1 ( θ ) φ ( X )] − E [ φ ( X ) φ ( X ) T ] w ( t ) . Clearly, C − E [ ρ X,A n δ X,R n ,X n +1 ( θ ) φ ( X )] is the globally asymptotically stable equilibrium of the o.d.e. Moreover, V ′ ( θ, w ) = k Cw − E [ ρ X,A n δ X,R n ,X n +1 ( θ ) φ ( X )] k is continuously differentiable. Additionally, λ ( θ ) = C − E [ ρ X,A n δ X,R n ,X n +1 ( θ ) φ ( X )] and it is Lipschitz continuous in θ , verifying (A6)’ . For the slower o.d.e., the global attractor is A − E [ ρ X,A n R n φ ( X )] verifying the additional assumption in Corollary 1. The attractor set here is a singleton. Also, (A7) is ( v ) in thestatement of Theorem 5.1. Therefore the assumptions ( A1 ) − ( A5 ) , ( A6 ′ ) , ( A7 ) are verified. The proof would thenfollow from Corollary 1. Remark 11.
The reason for using two time-scale framework for the TDC algorithm is to make sure that the o.d.e’shave globally asymptotically stable equilibrium.
Remark 12.
Because of the fact that the gradient is a product of two expectations the scheme is a “pseudo”-gradientdescent which helps to find the global minimum here.
Remark 13.
Here we assume the stability of the iterates (23). Certain sufficient conditions have been sketched forshowing stability of single timescale stochastic recursions with controlled Markov noise [21, p. 75, Theorem 9]. Thissubsequently needs to be extended to the case of two time-scale recursions.Another way to ensure boundedness of the iterates is to use a projection operator. However, projection mayintroduce spurious fixed points on the boundary of the projection region and finding globally asymptotically stableequilibrium of a projected o.d.e. is hard. Therefore we do not use projection in our algorithm.
Remark 14.
Convergence analysis for TDC with importance weighting along with eligibility traces cf. [3, p. 74]where it is called GTD( λ )can be done similarly using our results. The main advantage is that it works for λ < Lγ ( λ ∈ [0 , being the eligibility function) whereas the analysis in [4] is shown only for λ very close to 1. Remark 15.
One can analyze this algorithm when the state space is infinite by imposing assumptions on φ as wellas the target and behavior policies. We presented a general framework for two time-scale stochastic approximation with controlled Markov noise. More-over, using a special case of our results, i.e., when the random process is a finite state irreducible time-homogeneousMarkov chain (hence has a unique stationary distribution) and uncontrolled (i.e, does not depend on iterates), weprovided a rigorous proof of convergence for off-policy temporal difference learning algorithm that is also extendibleto eligibility traces (for a sufficiently large range of λ ) with linear function approximation under the assumptionthat the “on-policy” trajectory for a behaviour policy is only available. This has previously not been done to ourknowledge. 22 cknowledgments. The authors want to thank Csaba Szepesv´ari for some useful discussion on the literature of off-policy learning. Ourwork was partly supported by the Robert Bosch Centre for Cyber-Physical Systems, Indian Institute of Science,Bangalore.
References [1] A.Benveniste, M.Metivier, and P.Priouret.
Adaptive Algorithms and Stochastic Approximation . Springer Verlag,Berlin - New York, 1990.[2] D.J.Ma, A.M.Makowski, and A.Shwartz. Stochastic approximations for finite state Markov chains.
StochasticProcesses and their Applications , 35:27–45, 1990.[3] H.R.Maei.
Gradient temporal-difference learning algorithms . PhD thesis, University of Alberta, 2011.[4] H.Yu. Least squares temporal difference methods: an analysis under general conditions.
SIAM Journal onControl and Optimization , 50(6):3310–3343, 2012.[5] H.Yu. Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant andSlowly Diminishing Stepsize.
Journal of Machine Learning Research , 17(220):1–58, 2016.[6] I.Menache, S.Mannor, and N.Shimkin. Basis function adaptation in temporal difference reinforcement learning.
Annals of Operations Research , 134:215–238, 2005.[7] J.Aubin and A.Cellina.
Differential Inclusions: Set-Valued Maps and Viability Theory . Springer, 1984.[8] M.Bena¨ım. Dynamics of stochastic approximation algorithms.
S´eminaire de probabilit´es , pages 1–68, 1999.[9] M.Bena¨ım, J.Hofbauer, and S.Sorin. Stochastic approximations and differential inclusions.
SIAM Journal ofControl and Optimization , 44(1):328–348, 2005.[10] M.Metivier and P.Priouret. Applications of a Kushner-Clark lemma to general classes of stochastic algorithms.
IEEE Transactions on Information Theory , (30):140–151, 1984.[11] R.S.Sutton, H.R.Maei, and C.Szepesv´ari.
A convergent O(n) algorithm for off-policy temporal-difference learningwith linear function approximation . Advances in Neural Information Processing Systems, Vancouver, B.C.,Canada, 2008.[12] R.S.Sutton, H.R.Maei, D.Precup, S.Bhatnagar, D.Silver, and E.Wiewiora.
Fast gradient-descent methods fortemporal-difference learning with linear function approximation . International Conference on Machine Learning,Montreal, Canada, 2009.[13] T.Degris, M.White, and R.S.Sutton.
Off-policy actor-critic . International Conference on Machine Learning,Scotland, UK, 2012.[14] V.B.Tadi´c.
Almost sure convergence of two time-scale stochastic approximation algorithms . American ControlConference, Boston, 2004.[15] V.B.Tadi´c. Convergence and Convergence Rate of Stochastic Gradient Search in the Case of Multiple andNon-Isolated Extrema.
Stochastic Processes and Their Applications , 125:1715–1755, 2015.[16] V.R.Konda and J.N.Tsitsiklis. Linear stochastic approximation driven by slowly varying Markov chains.
Systemsand Control Letters , 50:95–102, 2003.[17] V.R.Konda and J.N.Tsitsiklis. On actor-critic algorithms.
SIAM Journal on Control and Optimization , 42:1143–1166, 2003.[18] V.S.Borkar.
Probability Theory : An Advanced Course . Springer, 1995.[19] V.S.Borkar. Stochastic approximation with two time scales.
Systems and Control Letters , 29(5):291–294, 1997.[20] V.S.Borkar. Stochastic approximation with ‘controlled Markov noise’.
Systems and Control Letters , 55(2):139–145, 2006.[21] V.S.Borkar.
Stochastic Approximation : A Dynamic Systems Viewpoint . Cambridge University Press, 2008.[22] W.Rudin.