Multi-Agent Reinforcement Learning in Time-varying Networked Systems
DDistributed Reinforcement Learningin Multi-Agent Networked Systems
Yiheng Lin
Tsinghua University, Beijing, China [email protected]
Guannan Qu
Caltech, Pasadena, CA [email protected]
Longbo Huang
Tsinghua University, Beijing, China [email protected]
Adam Wierman
Caltech, Pasadena, CA [email protected]
Abstract
We study distributed reinforcement learning (RL) for a network of agents. Theobjective is to find localized policies that maximize the (discounted) global reward.In general, scalability is a challenge in this setting because the size of the globalstate/action space can be exponential in the number of agents. Scalable algorithmsare only known in cases where dependencies are local, e.g., between neighbors. Inthis work, we propose a Scalable Actor Critic framework that applies in settingswhere the dependencies are non-local and provide a finite-time error bound thatshows how the convergence rate depends on the depth of the dependencies in thenetwork. Additionally, as a byproduct of our analysis, we obtain novel finite-timeconvergence results for a general stochastic approximation scheme and for temporaldifference learning with state aggregation that apply beyond the setting of RL innetworked systems.
Multi-Agent Reinforcement Learning (MARL) has achieved impressive performance in a widearray of applications including multi-player game play [26, 34], multi-robot systems [11], andautonomous driving [20]. In comparison to single-agent reinforcement learning (RL), MARL posesmany challenges, chief of which is scalability [46]. Even if each agent’s local state/action spacesare small, the size of the global state/action space can be large, potentially exponentially large in thenumber of agents, which renders many RL algorithms like Q -learning not applicable.A promising approach for addressing the scalability challenge that has received attention in recentyears is to exploit application-specific structures, e.g., [14, 30, 32]. A particularly important exampleof such a structure is a networked structure, e.g., applications in multi-agent networked systemssuch as social networks [5, 22], communication networks [41, 49], queueing networks [29], andsmart transportation networks [48]. In these networked systems, it is often possible to exploit local dependency structures [1, 12, 13, 27], i.e., the fact that agents only interact with neighboring agents inthe network. This sort of local dependence structure often leads to scalable, distributed algorithms foroptimization and control [1, 12, 27], and has proven effective for designing scalable and distributedMARL algorithms, e.g. [30, 32].However, many real-world networked systems are inherently non-local and involve complex depen-dencies on other agents beyond neighbors in the network. For example, in the context of wirelessnetworks, each node can send packets to other nodes within a fixed transmission range. However,the interference range, in which other nodes can interfere the transmission, can be larger than the Preprint. Under review. a r X i v : . [ c s . L G ] J un ransmission range [43]. As a result, due to potential collisions, the local reward of each agent notonly depends on its own local state/action, but also depends on the actions of other nodes withinthe interference range, which may be more than one-hop away. In addition, a node may be ableto observe other nodes’ local states before picking its local action [28]. Although one can alwayslocalize the dependence model with some loss of structural knowledge, this usually leads to reducedperformance. Beyond wireless networks, similar non-local dependencies exists in epidemics [25],social networks [5, 22], and smart transportation networks [48].A challenging open question in MARL is to understand how to obtain algorithms that remainscalable in settings where dependencies extend beyond purely local ones, as considered in [30, 32],to neighborhoods, clusters, or beyond. However, it is not immediately clear whether such a goal isobtainable. It is clear that hardness results take hold when the dependencies are too general [19].Further, positive results to this point rely on the concept of exponential decay [12, 30], meaning theagents’ impact on each other decays exponentially in their graph distance. The use of this propertyrelies on the fact that the dependencies are purely local, and it is not clear whether it can still beexploited when the interactions are more general. Contributions.
In this paper, we define a class of dependency structures spanning from localdependence (which includes prior work [32]) to global interaction, where every agent is allowedto depend on all other agents. Key to our approach is that the class of dependencies we considerleads to an exponential decay property (Definition 3.1). This property enables the design of anefficient and scalable algorithm. Specifically, we propose a Scalable Actor Critic algorithm that canprovably learn a near-optimal decision policy in a scalable manner (Theorem 3.3). Our analysis ofthe algorithm reveals a trade-off between the “depth” of the dependency structure and efficiency: asdeeper interactions are modeled, the efficiency of the proposed method degrades gracefully. This isto be expected, as when the agents are allowed to interact globally, the problem will degenerate to asingle-agent tabular Q -learning with exponentially large state space, which is known to be intractablebecause the sample complexity is polynomial in the size of the state/action space [10, 19]. Further,we illustrate the effectiveness of the proposed approach via a wireless communication example withnon-local dependence. Due to space constraints, this example is presented in Appendix A.The key technical result underlying our contribution is a finite-time analysis of a general stochasticapproximation scheme featuring infinite-norm contraction and state aggregation (Theorem 2.1). Weapply this result to networked MARL using the local neighborhood of each agent to provide stateaggregation. Importantly, the result applies more broadly beyond MARL. Specifically, we show thatit yields a finite-time bound on Temporal Difference (TD) learning with state aggregation (Theorem2.2). To the best of our knowledge, the resulting bound is the first finite-time guarantee in the infinitynorm setting for TD learning with state aggregation. In addition, the SA result yields a finite-timebound on asynchronous Q -learning with state aggregation as well, which is of independent interest.We defer the discussion to Appendix H due to space constraints. Related literature.
MARL has received considerable attention in recent years, see [46] for a survey.The line of work most relevant to the current paper focuses on cooperative MARL. In the cooperativesetting, agents decide on local actions but share a common global state. The agents cooperate in orderto maximize a global reward. Notable examples of this approach include [4, 8] and the referencestherein. In contrast, we study a situation where each agent has its own state and that it acts upon.Despite the differences, like our situation, cooperative MARL problems still face scalability issuessince the joint-action space is exponentially large. A variety of methods have been proposed to dealwith this, including independent learners [6, 24], where each agent employs a single-agent RL policy.Alternatively, one can approximate a large Q -table via linear function approximation [47] or neuralnetworks [23]. Such methods can reduce computation complexity significantly, but it is unclearwhether the performance loss caused by the function approximation is small. In contrast to bothof these approaches, our technique both reduces computational demands and also guarantees smallperformance loss.More broadly, this paper contributes to a growing literature that uses exponential decay to derivescalable algorithms for learning in networked systems. The specific form of exponential decay we useis related to the idea of “correlation decay” studied in [12, 13], though their focus is on solving staticcombinatorial optimization problems whereas ours is on learning policies in dynamic environments.Most related to the current paper is [32], which shows the exponential decay property in a restricted2etworked MARL model with purely local dependencies. In contrast, we show the exponential decayproperty holds for a general form of non-local dependencies.The technical work in this paper contributes to the analysis of stochastic approximation (SA), whichhas received considerable attention over the past decade [9, 36, 44, 45]. Our work is most relatedto [31], which uses an asynchronous nonlinear SA to study the finite-time convergence rate forasynchronous Q -learning on a single trajectory. Beyond [31], there are many other works that use SAschemes to study TD learning and Q -learning, e.g. [15, 36, 42]. The finite-time error bound for TDlearning with state aggregation in our work is most related to the asymptotic convergence limit givenin [39] and the application of SA scheme to asynchronous Q -learning in [31]. Beyond these papers,other related work in the broader area of RL with state aggregation includes [7, 17, 18, 21, 35]. We addto this literature with a novel finite-time convergence bound for a general SA with state aggregationthe first finite-time error bound in the infinity norm for TD learning with state aggregation. In this section, we present the key technical innovation underlying our results on MARL: a new finite-time analysis of a general asynchronous stochastic approximation (SA) scheme. This analysis andits application to TD learning with state aggregation underlie our approach for MARL in networkedsystems (presented in Section 3). This SA scheme is of interest more broadly, e.g., to the setting ofasynchronous Q -learning with state aggregation (see Appendix H). Consider a finite-state Markov chain whose state space is given by N = { , , · · · , n } . We use { i t | t = 0 , , · · · } to denote the sequence of states visited by this Markov chain. Our focus is on thefollowing asynchronous stochastic approximation (SA) scheme, which is studied in [33, 38, 42]: Letparameter x ∈ R N , and F : R N → R N be a γ -contraction in the infinity norm. The update rule ofthe SA scheme is given by x i t ( t + 1) = x i t ( t ) + α t ( F i t ( x ( t )) − x i t ( t ) + w ( t )) ,x j ( t + 1) = x j ( t ) for j (cid:54) = i t , j ∈ N , (1)where w ( t ) is a noise sequence. It is shown in [31] that parameter x ( t ) converges to the unique fixedpoint of F at the rate of O (cid:0) / √ t (cid:1) .While general, in many cases, including networked MARL, we do not wish to calculate an entry forevery state in N in parameter x , but instead, wish to calculate “aggregated entries.” Specifically, ateach time step, after i t is generated, we use a surjection h to decide which dimension of parameter x should be updated. This technique, referred to as state aggregation, is one of the easiest-to-deployscheme for state space compression in the RL literature [16, 35]. In the generalized SA scheme, ourobjective is to specify the limit of convergence as well as obtain a finite-time error bound.Formally, to define the generalization of (1), let N = { , · · · , n } be the state space of { i t } and M = { , · · · , m } , ( m ≤ n ) be the abstract state space. Surjection h : N → M is used to convertevery state in N to its abstraction in M . Given parameter x ∈ R M and function F : R N → R N , weconsider the generalized SA scheme that updates x ( t ) ∈ R M starting from x (0) = , x h ( i t ) ( t + 1) = x h ( i t ) ( t ) + α t (cid:0) F i t (Φ x ( t )) − x h ( i t ) ( t ) + w ( t ) (cid:1) ,x j ( t + 1) = x j ( t ) for j (cid:54) = h ( i t ) , j ∈ M , (2)where the feature matrix Φ ∈ R N ×M is defined as Φ ij = (cid:26) if h ( i ) = j otherwise , ∀ i ∈ N , j ∈ M . (3)In order to state our main result characterizing the convergence of (2), we must first state a fewdefinitions and assumptions. First, we define the weighted infinity norm as in [31], except that weextend its definition so as to define the contraction of function F . The reason we use the weightedinfinity norm as opposed to the standard infinity norm is that its generality can be used in certainsettings for undiscounted RL, as shown in [2, 38].3 efinition 2.1 (Weighted Infinity Norm) . Given a positive vector v = [ v , · · · , v m ] (cid:124) ∈ R M , wedefine (cid:107) x (cid:107) v := sup i ∈M | x i | v i , ∀ x ∈ R M , and (cid:107) x (cid:107) v := sup i ∈N | x i | v h ( i ) , ∀ x ∈ R N . Next, we state our assumption on the mixing rate of the Markov chain { i t } , which is common in theliterature [36, 40]. It holds for any finite-state Markov chain which is aperiodic and irreducible [3]. Assumption 2.1 (Stationary Distribution and Geometric Mixing Rate) . { i t } is an aperiodic andirreducible Markov chain on state space N with stationary distribution d = ( d , d , · · · , d n ) . Let d (cid:48) j = (cid:80) i ∈ h − ( j ) d i and σ (cid:48) = inf j ∈M d (cid:48) j . There exists positive constants K , K which satisfy that ∀ j ∈ N , ∀ t ≥ , sup S⊆N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i ∈S d i − (cid:88) i ∈S P ( i t = i | i = j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K exp( − t/K ) and K ≥ . Our next assumption ensures contraction of F and is identical with [31]. It is also standard, e.g.,[38, 38, 42], and ensures that F has a unique fixed point y ∗ . Assumption 2.2 (Contraction) . Operator F is a γ contraction in (cid:107)·(cid:107) v , i.e., for any x, y ∈ R N , (cid:107) F ( x ) − F ( y ) (cid:107) v ≤ γ (cid:107) x − y (cid:107) v . Further, there exists some constant
C > such that (cid:107) F ( x ) (cid:107) v ≤ γ (cid:107) x (cid:107) v + C, ∀ x ∈ R N . In Assumption 2.2, notice that the first sentence directly implies the second since (cid:107) F ( x ) (cid:107) v ≤ (cid:107) F ( x ) − F ( y ∗ ) (cid:107) v + (cid:107) F ( y ∗ ) (cid:107) v ≤ γ (cid:107) x − y ∗ (cid:107) v + (cid:107) y ∗ (cid:107) v ≤ γ (cid:107) x (cid:107) v + (1 + γ ) (cid:107) y ∗ (cid:107) v , where y ∗ ∈ R N is the unique fixed point of F . Further, while Assumption 2.2 implies that F has a unique fixed point y ∗ , we do not expect our stochastic approximation scheme to converge toit. Instead, we show that the convergence is to the unique x ∗ that solves Π F (Φ x ∗ ) = x ∗ , where Π = (Φ (cid:124) D Φ) − Φ (cid:124) D . Here D = diag ( d , d , · · · , d n ) denote the steady-state probabilities for theprocess i t . x ∗ is well-defined because the operator Π F (Φ · ) , which defines a mapping from R M to R M , is also a contraction in (cid:107)·(cid:107) v . We state and prove this as Proposition B.1 in Appendix B.Our last assumption is on the noise sequence w ( t ) . It is also standard, e.g., [31, 33]. Assumption 2.3 (Martingale Difference Sequence) . w t is F t +1 measurable and satisfies E w ( t ) |F t = 0 . Further, | w ( t ) | ≤ ¯ w almost surely for some constant ¯ w. We are now ready to state our finite-time convergence result for stochastic approximation.
Theorem 2.1.
Suppose Assumptions 2.1, 2.2, 2.3 hold. Further, assume there exists constant ¯ x ≥ (cid:107) x ∗ (cid:107) v such that ∀ t, (cid:107) x ( t ) (cid:107) v ≤ ¯ x almost surely. Let the step size be α t = Ht + t with t =max(4 H, K log T ) , and H ≥ σ (cid:48) (1 − γ ) . Let x ∗ be the unique solution of equation Π F (Φ x ∗ ) = x ∗ and define C (cid:15) = 2 K (2¯ x + C )(1 + 2 K + 4 H ) . Then, with probability at least − δ , (cid:107) x ( T ) − x ∗ (cid:107) v ≤ C a √ T + t + C (cid:48) a T + t = ˜ O (cid:18) √ T (cid:19) , where C a = 4 H − γ (cid:18) x + 2 C + ¯ wv (cid:19) (cid:112) K log T · (cid:115) log T + log log T + log (cid:18) mK δ (cid:19) ,C (cid:48) a = 41 − γ max (cid:32) K (cid:16) x + C + ¯ wv (cid:17) H log Tσ (cid:48) + C (cid:15) , x (2 K log T + t ) (cid:33) . A proof of Theorem 2.1 can be found in Appendix C. Compared with Theorem 4 in [31], Theorem 2.1holds for a more general SA scheme where state aggregation is used to reduce the dimension of theparameter x . At the expense of generality, Theorem 2.1 requires a stronger but standard assumption(Assumption 2.1) on the mixing rate of the Markov chain { i t } . Note that the assumption on ¯ x follows from Assumptions 2.2 and 2.3. We show this in Proposition D.1 inAppendix D. .2 TD Learning with State Aggregation Before applying Theorem 2.1 in the network setting, we first illustrate its importance via a simplerapplication to the case of TD learning with state aggregation. As discussed in the introduction, thestate aggregation method has been studied in many previous works in the RL literature and has broadapplications [7, 17, 18, 21, 35]. This result is extremely useful in the analysis in networked MARLsince the exponential decay property (Definition 3.1) provides a natural state aggregation tool (seeCorollary 3.2).In TD learning with state aggregation [35, 39], given the sequence of states visited by the Markovchain is { i t } , the update rule of TD (0) is given by θ h ( i t ) ( t + 1) = (1 − α t ) θ h ( i t ) ( t ) + α t (cid:2) r t + γθ h ( i t +1 ) ( t ) (cid:3) ,θ j ( t + 1) = θ j ( t ) for j (cid:54) = h ( i t ) , j ∈ M , (4)where, h : N → M is a surjection that maps each state in N to an abstract state in M and r t is thereward at time step t .Taking F as the Bellman Policy Operator, i.e., the i ’th dimension of function F is given by F i ( V ) = E i (cid:48) ∼ P ( ·| i ) [ r ( i, i (cid:48) ) + γV i (cid:48) ] , ∀ i ∈ N , for V ∈ R N . The value function (vector) V ∗ is defined as V ∗ i = E [ (cid:80) ∞ t =0 γ t r ( i t , i t +1 ) | i = i ] , i ∈N [39]. By defining the feature matrix Φ as (3) and the noise sequence as w ( t ) = r t + γθ h ( i t +1 ) ( t ) − E i (cid:48) ∼ P ( ·| i t ) (cid:2) r ( i t , i (cid:48) ) + γθ h ( i (cid:48) ) ( t ) (cid:3) , we can rewrite the update rule of TD (0) (4) in the form of SA scheme (2). Therefore, we can applyTheorem 2.1 to obtain a finite-time error bound for TD learning with state aggregation. Theorem 2.2.
Suppose Assumption 2.1 holds for the Markov chain { i t } and the stage reward r t is upper bounded by ¯ r almost surely. We further assume that if h ( i ) = h ( i (cid:48) ) for i, i (cid:48) ∈ N ,we have | V ∗ i − V ∗ i (cid:48) | ≤ ζ for a constant ζ . Consider TD (0) with the step size α t = Ht + t , where t = max(4 H, K log T ) and H ≥ σ (cid:48) (1 − γ ) . Then, with probability at least − δ , (cid:107) Φ · θ ( T ) − V ∗ (cid:107) ∞ ≤ C a √ T + t + C (cid:48) a T + t + ζ − γ , where C a = 40 H ¯ r (1 − γ ) (cid:112) K log T · (cid:115) log T + log log T + log (cid:18) mK δ (cid:19) ,C (cid:48) a = 8¯ r (1 − γ ) max (cid:18) K H log Tσ (cid:48) + 4 K (1 + 2 K + 4 H ) , K log T + t (cid:19) . The proof of Theorem 2.2 can be found in Appendix F. To the best of our knowledge, Theorem 2.2provides the first finite-time error bound in the infinity norm for TD (0) with state aggregation.
We now present our main results, which focus on MARL. These theoretical results apply the analysisin the previous section to a networked system.
We consider a network of agents that are associated with an undirected graph G = ( N , E ) , where N = { , , · · · , n } denotes the set of agents and E ⊆ N × N denotes the set of edges. The graphdistance between two agents is defined as the number of edges on the shortest path that connectsthem. Each agent is associated with its local state s i ∈ S i and local action a i ∈ A i where S i and A i are finite sets. The global state/action is defined as the combination of all local states/actions, i.e., s = ( s , · · · , s n ) ∈ S := S × · · · × S n , and a = ( a , · · · , a n ) ∈ A := A × · · · × A n . We use N κi to denote the κ -hop neighborhood of agent i , i.e., the agents whose graph distance to i is lessthan or equal to κ , including i itself.The networked system contains three key components: transition dependence , policy dependence , and reward dependence , which are parameterized by ( α , α , β , β ) and are described below. Note thatthe model in [32] can be viewed as a special case of our model with ( α , α , β , β ) = (1 , , , .5 Transition Dependence ( α , α ). The local state s i ( t + 1) of agent i is only dependent on thestates of its α -hop neighborhood and the actions of its α -hop neighborhood at time step t .Further, the global transition probability can be decomposed as the product of local transitionprobabilities: P ( s ( t + 1) | s ( t ) , a ( t )) = (cid:81) ni =1 P (cid:0) s i ( t + 1) | s N α i ( t ) , a N α i ( t ) (cid:1) . • Policy Dependence ( β ). Each agent i adopts a stochastic localized policy ζ θ i i parameterized by θ i . We assume each agent can take the action a i ( t ) based the states of its β -hop neighborhood,i.e. a i ( t ) is independently drawn from ζ θ i i (cid:0) · | s N β i ( t ) (cid:1) . The parameter β may depend on howmuch timely information each node can obtain when deciding its action. Therefore, the globalstochastic policy can be decomposed as ζ θ ( a | s ) = (cid:81) ni =1 ζ θ i i (cid:0) a i | s N β i (cid:1) , where θ is the tuple ofall local parameters. • Reward Dependence ( β ). The stage local reward associated with each agent i is a functionof the states and actions of i ’s β -hop neighborhood. The global stage reward is defined as r ( s, a ) = n (cid:80) ni =1 r i (cid:0) s N β i , a N β i (cid:1) . Without loss of generality, assume r i is upper bounded by .Starting from some initial distribution π of the global state, the objective of the networked MARLalgorithm is to maximize the discounted global reward, i.e., J ( θ ) = E s ∼ π E a ( t ) ∼ ζ θ ( ·| s ( t )) (cid:20) ∞ (cid:88) t =0 γ t r ( s ( t ) , a ( t )) | s (0) = s (cid:21) . (5)Define π θt as the distribution of s ( t ) under policy θ given that s (0) ∼ π . A well-known result [37] isthat the gradient of the objective can be computed by ∇ J ( θ ) = 11 − γ E s ∼ π θ ,a ∼ ζ θ ( ·| s ) Q θ ( s, a ) ∇ log ζ θ ( a | s ) , (6)where π θ ( s ) = (1 − γ ) (cid:80) ∞ t =0 γ t π θt ( s ) is the discounted state visitation distribution . Evaluating the Q -function Q θ ( s, a ) plays a key role in approximating ∇ J ( θ ) . Given the networked structure, wedefine local Q -function for agent i as the discounted local reward, i.e. Q θi ( s, a ) = E a ( t ) ∼ ζ θ ( ·| s ( t )) (cid:20) ∞ (cid:88) t =0 γ t r i ( t ) | s (0) = s, a (0) = a (cid:21) , where we use r i ( t ) to denote the local reward of agent i at time step t . Using local Q -functions, wecan decompose the global Q -function as Q θ ( s, a ) = n (cid:80) ni =1 Q θi ( s, a ) , which allows each node toevaluate its local Q -function separately. Exponential decay is a powerful property that allows for the design of scalable, distributed algorithmsfor optimization and control in a variety of settings, e.g., [12, 13, 30, 32]. Broadly, it characterizes thefact that the effect of an agent j on another agent i diminishes as their distance increases in manycontexts. In the MARL networked model we consider, we formalize an exponential decay propertyin the definition below, where N κ − i := N \ N κi . Definition 3.1.
The ( c, ρ ) -exponential decay property holds if, for any localized policy θ , for any i ∈ N , s N κi ∈ S N κi , s N κ − i , s (cid:48) N κ − i ∈ S N κ − i , a N κi ∈ A N κi , a N κ − i , a (cid:48) N κ − i ∈ A N κ − i , Q θi satisfies (cid:12)(cid:12)(cid:12) Q θi ( s N κi , s N κ − i , a N κi , a N κ − i ) − Q θi ( s N κi , s (cid:48) N κ − i , a N κi , a (cid:48) N κ − i ) (cid:12)(cid:12)(cid:12) ≤ cρ κ +1 . (7)In the context of MARL, an exponential decay property was first used in [32], which considers purelylocal dependencies, i.e., ( α , α , β , β ) = (1 , , , . Here, we show that an exponential decayproperty holds for more general dependency structures. Notice that obtaining this property is morechallenging in our problem setting than in the local context of [32] because the agents can affectothers not only through transition dependency but also through policy and reward dependencies. Lemma 3.1.
Let ξ = max( α , α + β ) . (cid:16) − γ ) γ ( β β /ξ , γ ξ (cid:17) -exponential decay property holds. A proof of Lemma 3.1 can be found in Appendix I. From Lemma 3.1, we see that ρ = γ ξ becomeslarger as the dependence parameters ( α , α , β , β ) increase. Hence the bound in (7) becomeslooser as the dependency becomes deeper. 6 lgorithm 1 Scalable Actor Critic for m = 0 , , , · · · do Sample initial global state s (0) ∼ π . Each node i takes action a i (0) ∼ ζ θ i ( m ) i ( · | s N β i (0)) to obtain the global state s (1) . Each node i records s N κi (0) , a N κi (0) , r i (0) and initialize ˆ Q i to be all zero vector. for t = 1 , · · · , T do Each node i takes action a i ( t ) ∼ ζ θ i ( m ) i ( · | s N β i ( t )) to obtain the global state s ( t + 1) . Each node i update the local estimation ˆ Q i with step size α t − = Ht − t , ˆ Q ti (cid:0) s N κi ( t − , a N κi ( t − (cid:1) =(1 − α t − ) ˆ Q t − i (cid:0) s N κi ( t − , a N κi ( t − (cid:1) + α t − (cid:16) r i ( t ) + γ ˆ Q t − i (cid:0) s N κi ( t ) , a N κi ( t ) (cid:1)(cid:17) , ˆ Q ti (cid:0) s N κi , a N κi (cid:1) = ˆ Q t − i (cid:0) s N κi , a N κi (cid:1) for (cid:0) s N κi , a N κi (cid:1) (cid:54) = (cid:0) s N κi ( t − , a N κi ( t − (cid:1) . Each node i approximate ∇ θ i J ( θ ) by ˆ g i ( m ) = (cid:80) Tt =0 γ t n (cid:80) j ∈ N κi ˆ Q Tj (cid:0) s N κj ( t ) , a N κj ( t ) (cid:1) ∇ θ i log ζ θ i ( m ) i (cid:0) a i ( t ) | s N β i ( t ) (cid:1) . Each node i conducts gradient ascent by θ i ( m + 1) = θ i ( m ) + η m ˆ g i ( m ) . We now present a novel Scalable Actor Critic algorithm (Algorithm 1) for networked MARL problem,which exploits the exponential decay result in the previous section and generalizes the approachin [32]. The Critic part (from line 2 to line 7) uses the local trajectory { ( s N κi ( t ) , a N κi ( t ) , r i ( t )) | t = 0 , , · · · , T } to evaluate the local Q -functions under parameter θ ( m ) . The Actor part (fromline 8 to line 9) computes the estimated partial derivative using the estimated local Q -functions, anduses the partial derivative to update local parameter θ i . Compared with the Scalable Actor Criticalgorithm proposed in [32], Algorithm 1 extends the policy/reward dependency structure consideredfrom completely local to the β /β -hop neighborhood, which adds considerable complexity.Algorithm 1 is highly scalable. Each agent i needs only to query and store the information within its κ -hop neighborhood during the learning process. The parameter κ can be set to balance accuracy andcomplexity. Specifically, as κ increases, the error bound becomes tighter at the expense of increasingcomputation, communication, and space complexity. We now present our main result, a finite-time error bound for the Scalable Actor Critic algorithm(Algorithm 1) that holds under general (non-local) dependencies. To that end, we first describe theassumptions needed in our result. The first assumption is the exponential decay property. We haveshown that it holds for general system parameters in Section 3.2.
Assumption 3.1. ( c, ρ ) -exponential decay property holds for some ρ < . Our second assumption focuses on the Markov chain formed by the global state-action pair ( s, a ) under a fixed policy parameter θ and is a standard assumption for finite-time convergence results inRL, e.g., [3, 31, 36]. Assumption 3.2.
Under any fixed policy θ , { z ( t ) := ( s ( t ) , a ( t )) } is an aperiodic and irreducibleMarkov chain on state space Z := S × A with a unique stationary distribution d θ = ( d θz , z ∈ Z ) ,which satisfies d θz > , ∀ z ∈ Z . Define d θ ( z (cid:48) ) = (cid:80) z ∈Z : z Nκi = z (cid:48) d θ ( z ) and σ (cid:48) := inf z (cid:48) ∈Z Nκi d θ ( z (cid:48) ) .There exists positive constants K , K which satisfy K ≥ and ∀ z (cid:48) ∈ Z , ∀ t ≥ , sup K⊆Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) z ∈K d θz − (cid:88) z ∈K P ( z ( t ) = z | z (0) = z (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K exp( − t/K ) . Recall that in TD learning with state aggregation (Section 2.2), we defined a surjection h that maps astate to an abstract state. To have a good approximate equivalence, we need to find a good h , i.e.,7f two states are mapped to the same abstract state, their value functions are required to be close(Theorem 2.2). In the context of networked MARL, the exponential decay property (Definition 3.1)provides a natural mapping h for state aggregation. To see this, for each agent i , let h map the globalstate/action to the local states/actions in agent i ’s κ -hop neighborhood, i.e., h ( s, a ) = (cid:0) s N κi , a N κi (cid:1) .The exponential decay property guarantees that if h ( s, a ) = h ( s (cid:48) , a (cid:48) ) , the difference in their Q-functions is upper bounded by cρ κ , which is vanishing as κ increases. This idea leads to the followingcorollary by applying Theorem 2.2 to the networked MARL system. Corollary 3.2.
Suppose Assumptions 3.1 and 3.2 hold. Let the step size be α t = Ht + t with t =max(4 H, K log T ) , and H ≥ σ (cid:48) (1 − γ ) . Then, inside outer loop iteration m , for each i ∈ N , withprobability at least − δ , we have sup ( s,a ) ∈S×A (cid:12)(cid:12)(cid:12) Q θ ( m ) i ( s, a ) − ˆ Q T ( s N κi , a N κi ) (cid:12)(cid:12)(cid:12) ≤ C a √ T + t + C (cid:48) a T + t + cρ κ +1 − γ , where C a = 40 H (1 − γ ) (cid:112) K log T · (cid:115) log T + log log T + log (cid:18) f ( κ ) K δ (cid:19) ,C (cid:48) a = 8(1 − γ ) max (cid:18) K H log Tσ (cid:48) + 4 K (1 + 2 K + 4 H ) , K log T + t (cid:19) . The most related result in the literature to the above is Theorem 7 in [32]. In comparison, Corollary3.2 applies for more general, potentially non-local, dependencies and, also, improves the constantterm by a factor of / (1 − γ ) .To analyze the Actor part of Algorithm 1, we make the following additional boundedness andLipschitz continuity assumptions on the gradients. These are standard assumptions in the literature. Assumption 3.3.
For any i, a i , s N β i and θ i , we assume (cid:13)(cid:13)(cid:13) ∇ θ i log ζ θ i i ( a i | s N β i ) (cid:13)(cid:13)(cid:13) ≤ L i . Therefore, (cid:13)(cid:13) ∇ θ log ζ θ ( a | s ) (cid:13)(cid:13) ≤ L := (cid:112)(cid:80) ni =1 L i . We further assume ∇ J ( θ ) is L (cid:48) -Lipschitz continuous in θ . Intuitively, to analyze the Actor part of the algorithm we show that if every agent i has learned a goodapproximation of its local Q -function in the Critic part of Algorithm 1, the Actor part can obtaina good approximation of a stationary point of the objective function. This is possible because thequality of the estimated policy gradient depends on the quality of the estimation of Q -functions. Westate this result below and defer a proof to Appendix J. Theorem 3.3.
Suppose inner loop length T is sufficiently large such that T + 1 ≥ log γ ( c (1 − ρ )) +( κ + 1) log γ ρ and with probability at least − δ , we have sup m ≤ M − sup i ∈N sup ( s,a ) ∈S×A (cid:12)(cid:12)(cid:12) Q θ ( m ) i ( s, a ) − ˆ Q T ( s N κi , a N κi ) (cid:12)(cid:12)(cid:12) ≤ ιcρ κ +1 − γ , where ι is positive constant. Suppose the actor step size satisfies η m = η √ m +1 with η ≤ L (cid:48) . Define C M = η (1 − γ ) + L (1 − γ ) (cid:113) log M log δ + L (cid:48) L (1 − γ ) η log M . Then, with probability at least − δ , (cid:80) M − m =0 η m (cid:107)∇ J ( θ ( m )) (cid:107) (cid:80) M − m =0 η m ≤ C M √ M + 1 + 2(2 + ι ) L cρ κ +1 (1 − γ ) . (8) Notice that, when T is sufficiently large, the assumptions in Theorem 3.3 can be satisfied by applyingthe union bound to the conclusion of Corollary 3.2. Define (cid:15) κ := L cρ κ +1 / (1 − γ ) . By combiningTheorem 3.3 with Corollary 3.2, we see that Algorithm 1 finds an O ( (cid:15) κ ) -approximation of a stationarypoint. This improves on [32] by a factor of / (1 − γ ) , despite the more general setting.As for complexity, to reach an O ( (cid:15) κ ) -approximate stationary point, the number of required iterationsof the outer loop is M ≥ ˜Ω (cid:0) (cid:15) κ poly ( L, L (cid:48) , − γ ) (cid:1) and the number of required iterations of the innerloop is T ≥ ˜Ω (cid:0) (cid:15) κ poly ( σ (cid:48) , K , − γ ) (cid:1) . Compared with [32], our result removes f ( κ ) from thepolynomial term of the inner loop despite the more general setting. In conclusion, we show thatwe can learn a near-optimal localized policy in a scalable manner even with the much more generaldependence structure than [32].Finally, we illustrate the effectiveness of Scalable Actor Critic algorithm (Algorithm 1) via a wirelesscommunication example with non-local dependence. The details of the problem setting and simulationresults are deferred to Appendix A due to space constraints.8 roader Impact This paper contributes to the foundations of a growing literature that seeks to develop approaches forapplying multi-agent reinforcement learning to the control of networked systems. The theoreticalcontributions here can potentially lead to improved RL-based algorithms with convergence guarantees,which can in turn be applied to improve the adaptive control of various socio-technical networkedsystems such as traffic systems, communication systems, and energy systems. However, it is importantto be cautious when considering applications of the proposed algorithm as in its current form. Resultsin this paper focus only on the overall efficiency; the issue of fairness has not been considered.We see no ethical concerns related to this paper.
References [1] B. Bamieh, F. Paganini, and M. A. Dahleh. Distributed control of spatially invariant systems.
IEEE Transactions on automatic control , 47(7):1091–1107, 2002.[2] D. P. Bertsekas.
Dynamic Programming and Optimal Control, Vol. II . Athena Scientific, 3rdedition, 2007.[3] P. Bremaud.
Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues . Texts inApplied Mathematics. Springer New York, 2013.[4] L. Bu, R. Babu, B. De Schutter, et al. A comprehensive survey of multiagent reinforcementlearning.
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications andReviews) , 38(2):156–172, 2008.[5] D. Chakrabarti, Y. Wang, C. Wang, J. Leskovec, and C. Faloutsos. Epidemic thresholds in realnetworks.
ACM Transactions on Information and System Security (TISSEC) , 10(4):1, 2008.[6] C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagentsystems.
AAAI/IAAI , 1998:746–752, 1998.[7] C. Dann, N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. On oracle-efficient pac rl with rich observations. In
Advances in Neural Information Processing Systems ,pages 1422–1432, 2018.[8] T. Doan, S. Maguluri, and J. Romberg. Finite-time analysis of distributed TD(0) with lin-ear function approximation on multi-agent reinforcement learning. In K. Chaudhuri andR. Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learn-ing , volume 97 of
Proceedings of Machine Learning Research , pages 1626–1635, Long Beach,California, USA, 09–15 Jun 2019. PMLR.[9] T. T. Doan. Finite-time analysis and restarting scheme for linear two-time-scale stochasticapproximation, 2019.[10] K. Dong, Y. Wang, X. Chen, and L. Wang. Q-learning with ucb exploration is sample efficientfor infinite-horizon mdp. arXiv preprint arXiv:1901.09311 , 2019.[11] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcementlearning for continuous control. In
International Conference on Machine Learning , pages 1329–1338, 2016.[12] D. Gamarnik. Correlation decay method for decision, optimization, and inference in large-scalenetworks. In
Theory Driven by Influential Applications , pages 108–121. INFORMS, 2013.[13] D. Gamarnik, D. A. Goldberg, and T. Weber. Correlation decay in random decision networks.
Mathematics of Operations Research , 39(2):229–261, 2014.[14] H. Gu, X. Guo, X. Wei, and R. Xu. Q-learning for mean-field controls, 2020.[15] D. hwan Lee and N. He. A unified switching system perspective and o.d.e. analysis of q-learningalgorithms.
ArXiv , abs/1912.02270, 2019.[16] N. Jiang. Notes on state abstractions. http://nanjiang.web.engr.illinois.edu/files/cs598/note4.pdf , 2018.[17] N. Jiang, A. Kulesza, and S. Singh. Abstraction selection in model-based reinforcement learning.In
International Conference on Machine Learning , pages 179–188, 2015.918] N. K. Jong and P. Stone. State abstraction discovery from irrelevant state variables. In
IJCAI ,volume 8, pages 752–757, 2005.[19] T. Lattimore and M. Hutter. Pac bounds for discounted mdps. In N. H. Bshouty, G. Stoltz,N. Vayatis, and T. Zeugmann, editors,
Algorithmic Learning Theory , pages 320–334, Berlin,Heidelberg, 2012. Springer Berlin Heidelberg.[20] D. Li, D. Zhao, Q. Zhang, and Y. Chen. Reinforcement learning and deep learning based lateralcontrol for autonomous driving [application notes].
IEEE Computational Intelligence Magazine ,14(2):83–98, 2019.[21] L. Li, T. J. Walsh, and M. L. Littman. Towards a unified theory of state abstraction for MDPs.In
ISAIM , 2006.[22] M. Llas, P. M. Gleiser, J. M. López, and A. Díaz-Guilera. Nonequilibrium phase transitionin a model for the propagation of innovations among economic agents.
Physical Review E ,68(6):066101, 2003.[23] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch. Multi-agent actor-critic formixed cooperative-competitive environments. In
Advances in Neural Information ProcessingSystems , pages 6379–6390, 2017.[24] L. Matignon, G. J. Laurent, and N. Le Fort-Piat. Independent reinforcement learners in coopera-tive markov games: a survey regarding coordination problems.
The Knowledge EngineeringReview , 27(1):1–31, 2012.[25] W. Mei, S. Mohagheghi, S. Zampieri, and F. Bullo. On the dynamics of deterministic epidemicpropagation over networks.
Annual Reviews in Control , 44:116–128, 2017.[26] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-forcement learning.
Nature , 518(7540):529, 2015.[27] N. Motee and A. Jadbabaie. Optimal control of spatially distributed systems.
IEEE Transactionson Automatic Control , 53(7):1616–1629, 2008.[28] M. J. Neely. Optimal backpressure routing for wireless networks with multi-receiver diversity.In , pages 18–25, 2006.[29] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimal queuing network control.
Mathematics of Operations Research , 24(2):293–305, 1999.[30] G. Qu and N. Li. Exploiting fast decaying and locality in multi-agent mdp with tree dependencestructure. arXiv preprint arXiv:1909.06900 , 2019.[31] G. Qu and A. Wierman. Finite-time analysis of asynchronous stochastic approximation and q -learning. arXiv preprint arXiv:2002.00260 , 2020.[32] G. Qu, A. Wierman, and N. Li. Scalable reinforcement learning of localized policies formulti-agent networked systems. arXiv preprint arXiv:1912.02906 , 2019.[33] D. Shah and Q. Xie. Q-learning with nearest neighbors. In Advances in Neural InformationProcessing Systems , pages 3111–3121, 2018.[34] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neuralnetworks and tree search. nature , 529(7587):484, 2016.[35] S. P. Singh, T. Jaakkola, and M. I. Jordan. Reinforcement learning with soft state aggregation.In
Advances in neural information processing systems , pages 361–368, 1995.[36] R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation andtdlearning. pages 2803–2830, 2019.[37] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforce-ment learning with function approximation. In
Proceedings of the 12th International Conferenceon Neural Information Processing Systems , NIPS’99, page 1057–1063, Cambridge, MA, USA,1999. MIT Press.[38] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning.
Machine learning ,16(3):185–202, 1994. 1039] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with functionapproximation.
IEEE Transactions on Automatic Control , 42(5):674–690, 1997.[40] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-diffference learning with functionapproximation. In
Advances in neural information processing systems , pages 1075–1081, 1997.[41] W. Vogels, R. van Renesse, and K. Birman. The power of epidemics: Robust communicationfor large-scale distributed systems.
SIGCOMM Comput. Commun. Rev. , 33(1):131–135, Jan.2003.[42] M. J. Wainwright. Stochastic approximation with cone-contractive operators: Sharp (cid:96) i nf ty -bounds for Q-learning. arXiv preprint arXiv:1905.06265 , 2019.[43] S. Wang, V. Venkateswaran, and X. Zhang. Fundamental analysis of full-duplex gains inwireless networks. IEEE/ACM Transactions on Networking , 25(3):1401–1416, 2017.[44] Y. Wu, W. Zhang, P. Xu, and Q. Gu. A finite time analysis of two time-scale actor critic methods,2020.[45] T. Xu, S. Zou, and Y. Liang. Two time-scale off-policy td learning: Non-asymptotic analysisover markovian samples. In
Advances in Neural Information Processing Systems 32 , pages10634–10644. Curran Associates, Inc., 2019.[46] K. Zhang, Z. Yang, and T. Ba¸sar. Multi-agent reinforcement learning: A selective overview oftheories and algorithms. arXiv preprint arXiv:1911.10635 , 2019.[47] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Ba¸sar. Fully decentralized multi-agent reinforcementlearning with networked agents. arXiv preprint arXiv:1802.08757 , 2018.[48] R. Zhang and M. Pavone. Control of robotic mobility-on-demand systems: a queueing-theoretical perspective.
The International Journal of Robotics Research , 35(1-3):186–203,2016.[49] A. Zocca. Temporal starvation in multi-channel csma networks: an analytical framework.
Queueing Systems , 91(3-4):241–263, 2019.11igure 1: Setup of user nodes and access points.
Appendices
A Application in Wireless Networks
We consider a wireless network with multiple access points setting shown in Fig. 1, where a set ofuser nodes in a wireless network, denoted by U = { u , u , · · · , u n } , share a set of access points Y = { y , y , · · · , y m } [49]. Each access point y i is associated with a probability p i of successfultransmission. Each user node u i only has access to a subset Y i ⊆ Y of the access points. Typically,this available set is determined by each user node’s physical connections to the access points. Toapply the networked MARL model, we identify the set of user nodes U as the set of agents N inSection 3. The underlying graph G = ( N , E ) is defined as the conflict graph, i.e., edge ( u i , u j ) ∈ E if and only if Y i ∩ Y j (cid:54) = ∅ .At each time step t , each user u i receives a packet with initial life span d with probability q . Eachuser maintains a queue to cache the packets it receives. At each time step, if the packet is successfullysent to an access point, it will be removed from the queue. Otherwise, its life span will decrease by .A packet is discarded from the queue immediately if its remaining life span is . At each time step t ,a user node u i can choose to send one of the packets in its queue to one of the access point y i,t ∈ Y i .If no other user node sends packets to access point y i,t at time step t , the packet from user i can bedelivered successfully with probability p i . Otherwise, the sending action will fail. A user u i receivesa local reward of r i,t = 1 immediately after successfully sending a packet at time step t , and receives r i,t = 0 otherwise. Our objective is to find a policy that maximizes the global discounted rewardunder a discounted factor ≤ γ < : E (cid:34) n (cid:88) i =1 ∞ (cid:88) t =0 γ t r i,t (cid:35) . Before applying the Scalable Actor Critic algorithm we proposed, we first need to define the localstate/action and specify the dependence parameters. Since each packet has a life span of d , andeach user node receives at most one packet at a time step, we use a d -tuple s i = ( e , e , · · · , e d ) ∈S i := { , } d to denote the local state of user node i . Specifically, e j indicates whether user node u i has a packet with remaining life span j in its queue. A local action of user node u i is -tuple ( l, y ) , which means sending the packet with remaining life span l ∈ { , , · · · , d } to an access point y ∈ Y i . Note that we define an empty action that does nothing at all. If a user node performs anaction ( l, y ) when there is no packet with life span l in its queue, we view this as an empty action. Inthis setting, the next local state of user node u i depends on the current local states/actions in its -hopneighborhood ( α = α = 1 ). We assume each user node can choose its action only based on itscurrent local state ( β = 0 ). Due to potential collisions, the local reward of user u i also depends on12igure 2: Discounted reward in the train-ing process. × grid, user per grid. Figure 3: Discounted reward in the train-ing process. × grid, users per grid.the states/actions in its -hop neighborhood ( β = 1 ). As a result, the dependence parameters in thissetting are ( α , α , β , β ) = (1 , , , , for which the results of [32] do not apply.The detailed setting we use is as follows. We consider the setting where the user nodes are located in h × w grids (see Fig. 1). There are c user nodes in each grid, and each user can send packets to anaccess point on the corner of its grid. We set the initial life span d = 2 , the arrival probability q = 0 . ,and the discounted factor γ = 0 . . The successful transmission probability p i for each access point y i is sampled uniformly randomly from [0 , . We run the Scalable Actor Critic algorithm withparameter κ = 1 to learn a localized stochastic policy in two cases ( h, w, c ) = (5 , , (see Fig. 2)and ( h, w, c ) = (3 , , (see Fig. 3). For comparison, we use a benchmark based on the localizedALOHA protocol. Specifically, the benchmark policy works as following: At time step t , each usernode u i takes the empty action with a certain probability p (cid:48) ; otherwise, it sends the packet with theminimum remaining life span to a random access point in Y i , with the probability proportional to thesuccessful transmission probability of this access point and inverse proportional to the number ofusers sharing this access point. In Fig. 2 and Fig. 3, we have tuned the parameter p (cid:48) to find the onewith the highest discounted reward.As shown in Fig. 2 and Fig. 3, starting from the initial policy that chooses an local action uniformlyat random, the Scalable Actor Critic algorithm with parameter κ = 1 can learn a policy that performsbetter than the benchmark. As a remark, the benchmark policy requires the set { p i } ≤ i ≤ m , theprobability of successful transmission, as input. Moreover, in the benchmark policy, the probabilityof performing an empty action also needs to be tuned manually. In contrast, the Scalable Actor Criticalgorithm can learn a better policy without these specific inputs by interacting with the system. B γ -contraction of Operator Π F (Φ · ) To show that the equation Π F (Φ x ) = x has a unique solution x ∗ , by the Banach–Caccioppolifixed-point theorem, it suffices to show that operator Π F (Φ · ) is a γ -contraction in (cid:107)·(cid:107) v . Proposition B.1.
If Assumption 2.2 holds, operator Π F (Φ · ) is a contraction in (cid:107)·(cid:107) v , i.e., for any x, y ∈ R M , (cid:107) Π F (Φ x ) − Π F (Φ y ) (cid:107) v ≤ γ (cid:107) x − y (cid:107) v . To prove this proposition, we first show both operator Π and operator Φ are non-expansive in (cid:107)·(cid:107) v before combining them with F . Proof of Proposition B.1.
We first show that operator Π is non-expansive in (cid:107)·(cid:107) v , i.e. for any x, y ∈ R N , we have (cid:107) Π x − Π y (cid:107) v ≤ (cid:107) x − y (cid:107) v . (9)Since Π is a linear operator, it suffices to show that for any x ∈ R N , (cid:107) Π x (cid:107) v ≤ (cid:107) x (cid:107) v . The ALOHA protocol was proposed in: L. Roberts. ALOHA packet system with and without slots and cap-ture. ACM SIGCOMM Computer Communication Review, 5:28–42, 04 1975. doi: 10.1145/1024916.1024920. ∀ j ∈ M , h − ( j ) := { i ∈ N | h ( i ) = j } . Using this notation, the j th element of vector Π x is given by (Π x ) j = 1 (cid:80) i ∈ h − ( j ) d i (Φ (cid:124) Dx ) j = 1 (cid:80) i ∈ h − ( j ) d i · (cid:88) i ∈ h − ( j ) d i x i . Hence we see that (cid:12)(cid:12)(cid:12) (Π x ) j (cid:12)(cid:12)(cid:12) v j ≤ (cid:80) i ∈ h − ( j ) d i · (cid:88) i ∈ h − ( j ) d i | x i | v j ≤ sup i ∈ h − ( j ) | x i | v j . (10)By taking sup j on both sides of (10), we see that (cid:107) Π x (cid:107) v = sup j ∈M (cid:12)(cid:12)(cid:12) (Π x ) j (cid:12)(cid:12)(cid:12) v j ≤ sup j ∈M sup i ∈ h − ( j ) | x i | v j = sup i ∈N | x i | v h ( i ) = (cid:107) x (cid:107) v , (11)where we use the definition of (cid:107)·(cid:107) v on R N in the last equation. Hence we have shown that Π isnon-expansive in (cid:107)·(cid:107) v (inequality (9)).We can also show that for any x, y ∈ R M , we have (cid:107) Φ x − Φ y (cid:107) v = (cid:107) x − y (cid:107) v . (12)Since Φ is a linear operator, we only need to show that for any x ∈ R M , (cid:107) Φ x (cid:107) v = (cid:107) x (cid:107) v .Since (Φ x ) i = x h ( i ) , ∀ i ∈ N , by the definition of (cid:107)·(cid:107) v on R N , we see that (cid:107) Φ x (cid:107) v = sup i ∈N | (Φ x ) i | v h ( i ) = sup i ∈N (cid:12)(cid:12) x h ( i ) (cid:12)(cid:12) v h ( i ) = sup j ∈M | x j | v j = (cid:107) x (cid:107) v . Hence we have shown that Φ is non-expansive in (cid:107)·(cid:107) v (equation (12)).Therefore, for any x, y ∈ R M , we have (cid:107) Π F (Φ x ) − Π F (Φ y ) (cid:107) v ≤ (cid:107) F (Φ x ) − F (Φ y ) (cid:107) v (13a) ≤ γ (cid:107) Φ x − Φ y (cid:107) v (13b) = γ (cid:107) x − y (cid:107) v , (13c)where we use (9) in (13a); Assumption 2.2 in (13b); (12) in (13c). C Proof of Theorem 2.1
The proof sketch of Theorem 2.1 is similar to the proof of Theorem 4 in [31]. Specifically, we showan upper bound for (cid:107) x ( t ) − x ∗ (cid:107) v by induction on time step t . To do so, we divide the whole proof tothree steps: In Step 1, we manipulate the update rule (2) so that it can be written in a recursive formof sequence (cid:107) x ( t ) − x ∗ (cid:107) v (see Lemma C.1); In Step 2, we bound the effect of noise terms in therecursive form we obtained in Step 1; In Step 3, we combine the first two steps to finish the induction.One of the main proof techniques used in [31] is to consider D t = E e i t e (cid:124) i t | F t − τ , which is thedistribution of i t condition on F t − τ , in the coefficients of the recursive relationship of sequence (cid:107) x ( t ) − x ∗ (cid:107) v . However, this approach does not work in the more general setting we consider because x ∗ may not be the stationary point of operator (Φ (cid:124) D t Φ) − φ (cid:124) D t F (Φ · ) . As a result, we cannotdecompose (cid:107) x ( t ) − x ∗ (cid:107) v recursively if we use D t in the coefficients. To overcome this difficulty,we use D = diag ( d , · · · , d n ) , which is the stationary distribution of i t , in the coefficients of therecursive relationship (Lemma C.1).Now we begin the technical part of our proof. Step 1: Decomposition of Error.
Let D t = E e i t e (cid:124) i t | F t − τ , where τ is a parameter that we willtune later. Then D t is a F t − τ -measurable n -by- n diagonal random matrix, with its i ’th entry being d t,i = P ( i t = i | F t − τ ) . Recall that D = diag ( d , · · · , d n ) , where d is the stationary distribution ofthe Markov Chain { i t } . 14otice that for all i ∈ N , we have e h ( i ) = Φ (cid:124) e i . We can rewrite the update rule as x ( t + 1) = x ( t ) + α t [ e (cid:124) i t F (Φ x ( t )) − e (cid:124) h ( i t ) x ( t ) + w ( t )] e h ( i t ) = x ( t ) + α t [ e h ( i t ) e (cid:124) i t F (Φ x ( t )) − e h ( i t ) e (cid:124) h ( i t ) x ( t ) + w ( t ) e h ( i t ) ]= x ( t ) + α t Φ (cid:124) (cid:2) e i t e (cid:124) i t ( F (Φ x ( t )) − Φ x ( t )) + w ( t ) e i t (cid:3) (14a) = x ( t ) + α t [Φ (cid:124) DF (Φ x ( t )) − Φ (cid:124) D Φ x ( t )]+ α t Φ (cid:124) (cid:2) ( e i t e (cid:124) i t − D ) ( F (Φ x ( t )) − Φ x ( t )) + w ( t ) e i t (cid:3) = x ( t ) + α t [Φ (cid:124) DF (Φ x ( t )) − Φ (cid:124) D Φ x ( t )]+ α t Φ (cid:124) (cid:2) ( e i t e (cid:124) i t − D ) ( F (Φ x ( t − τ )) − Φ x ( t − τ )) + w ( t ) e i t (cid:3) + α t Φ (cid:124) ( e i t e (cid:124) i t − D ) [ F (Φ x ( t )) − F (Φ x ( t − τ )) − Φ( x ( t ) − x ( t − τ ))]= ( I − α t Φ (cid:124) D Φ) x ( t ) + α t Φ (cid:124) DF (Φ x ( t )) + α t ( (cid:15) ( t ) + ψ ( t )) , (14b)where in (14a), we use e h ( i t ) = Φ (cid:124) e i t . Additionally, in (14b), we define (cid:15) ( t ) = Φ (cid:124) (cid:2) ( e i t e (cid:124) i t − D ) ( F (Φ x ( t − τ )) − Φ x ( t − τ )) + w ( t ) e i t (cid:3) and ψ ( t ) = Φ (cid:124) ( e i t e (cid:124) i t − D ) [ F (Φ x ( t )) − F (Φ x ( t − τ )) − Φ( x ( t ) − x ( t − τ ))] . We further decompose (cid:15) ( t ) as (cid:15) ( t ) = (cid:15) ( t ) + (cid:15) ( t ) , where (cid:15) ( t ) and (cid:15) ( t ) are defined as (cid:15) ( t ) = Φ (cid:124) (cid:2) ( e i t e (cid:124) i t − D t ) ( F (Φ x ( t − τ )) − Φ x ( t − τ )) + w ( t ) e i t (cid:3) and (cid:15) ( t ) = Φ (cid:124) ( D t − D ) ( F (Φ x ( t − τ )) − Φ x ( t − τ )) . We see that condition on F t − τ , the expected value of (cid:15) ( t ) is zero, i.e. E (cid:15) ( t ) | F t − τ = Φ (cid:124) E (cid:2) ( e i t e (cid:124) i t − D t ) | F t − τ (cid:3) [ F (Φ x ( t − τ )) − Φ x ( t − τ )] + Φ (cid:124) E [ E [ w ( t ) | F t ] e i t | F t − τ ]= 0 . Recall that matrix Π is defined as Π = (Φ (cid:124) D Φ) − Φ (cid:124) D. By expanding (14) recursively, we obtain that x ( t + 1) = t (cid:89) k = τ ( I − α k Φ (cid:124) D Φ) x ( τ ) + t (cid:88) k = τ α k (cid:32) t (cid:89) l = k +1 ( I − α l Φ (cid:124) D Φ) (cid:33) Φ (cid:124) DF (Φ x ( k ))+ t (cid:88) k = τ α k (cid:32) t (cid:89) l = k +1 ( I − α l Φ (cid:124) D Φ) (cid:33) ( (cid:15) ( k ) + ψ ( k ))= ˜ B τ − ,t x ( τ ) + t (cid:88) k = τ B k,t Π F (Φ x ( k )) + t (cid:88) k = τ α k ˜ B k,t ( (cid:15) ( k ) + ψ ( k )) , (15)where B k,t = α k (Φ (cid:124) D Φ) (cid:81) tl = k +1 ( I − α l Φ (cid:124) D Φ) and ˜ B k,t = (cid:81) tl = k +1 ( I − α l Φ (cid:124) D Φ) . For notation simplicity, we define D (cid:48) = Φ (cid:124) D Φ ∈ R M×M . Notice that D (cid:48) is a diagonal matrix in R M×M with the j ’th entry d (cid:48) j = (cid:80) j ∈ h − ( i ) d i . Clearly, B k,t and ˜ B k,t are m -by- m diagonal matrices,with the i ’th diagonal entry given by b k,t,i and ˜ b k,t,i , where b k,t,i = α k d (cid:48) i (cid:81) tl = k +1 (1 − α l d (cid:48) i ) and ˜ b k,t,i = (cid:81) tl = k +1 (1 − α l d (cid:48) i ) . Therefore, for any i ∈ M , we have ˜ b τ − ,t,i + t (cid:88) k = τ b k,t,i = 1 . (16)15lso, by the definition of σ (cid:48) , we have that for any i , almost surely b k,t,i ≤ β k,t := α k t (cid:89) l = k +1 (1 − α l σ (cid:48) ) , ˜ b k,t,i ≤ ˜ β k,t = t (cid:89) l = k +1 (1 − α l σ (cid:48) ) , where σ (cid:48) = min { d (cid:48) , · · · , d (cid:48) m } . Recall that x ∗ is the unique solution of the equation Π F (Φ x ∗ ) = x ∗ . Lemma C.1 shows that we canexpand the error term (cid:107) x ( t ) − x ∗ (cid:107) v recursively. Lemma C.1.
Let a t = (cid:107) x ( t ) − x ∗ (cid:107) v , we have almost surely, a t +1 ≤ ˜ β τ − ,t a τ + γ sup i ∈M t (cid:88) k = τ b k,t,i a k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t (cid:15) ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t ψ ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v . Proof of Lemma C.1.
By (15) and the triangle inequality of (cid:107)·(cid:107) v , we have (cid:107) x ( t + 1) − x ∗ (cid:107) v ≤ sup i ∈M v i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ b τ − ,t,i x i ( τ ) + t (cid:88) k = τ b k,t,i (Π F (Φ x ( k ))) i − x ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t (cid:15) ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t ψ ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v . (17)We also see that for each i ∈ M , v i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ b τ − ,t,i x i ( τ ) + t (cid:88) k = τ b k,t,i (Π F (Φ x ( k ))) i − x ∗ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ˜ b τ − ,t,i v i | x i ( τ ) − x ∗ i | + t (cid:88) k = τ b k,t,i v i | (Π F (Φ x ( k ))) i − x ∗ i | (18a) ≤ ˜ b τ − ,t,i (cid:107) x ( τ ) − x ∗ (cid:107) v + t (cid:88) k = τ b k,t,i (cid:107) (Π F (Φ x ( k ))) − x ∗ (cid:107) v ≤ ˜ b τ − ,t,i (cid:107) x ( τ ) − x ∗ (cid:107) v + γ t (cid:88) k = τ b k,t,i (cid:107) x ( k ) − x ∗ (cid:107) v , (18b)where in (18a), we use (16) which says ˜ b τ − ,t,i + (cid:80) tk = τ b k,t,i = 1 holds for all i ∈ M ; in (18b), weuse Proposition B.1, which says Π F (Φ · ) is γ -contraction in (cid:107)·(cid:107) v with fixed point x ∗ . Therefore, by substituting (18) into (17), we obtain that a t +1 ≤ ˜ β τ − ,t a τ + γ sup i ∈M t (cid:88) k = τ b k,t,i a k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t (cid:15) ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t ψ ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v . Step 2: Bounding (cid:13)(cid:13)(cid:13)(cid:80) tk = τ α k ˜ B k,t (cid:15) ( k ) (cid:13)(cid:13)(cid:13) v and (cid:13)(cid:13)(cid:13)(cid:80) tk = τ α k ˜ B k,t ψ ( k ) (cid:13)(cid:13)(cid:13) v . We start with a bound on each individual (cid:15) ( k ) , (cid:15) ( k ) , and ψ ( k ) in Lemma C.2. Lemma C.2.
The following bounds hold almost surely.1. (cid:107) (cid:15) ( t ) (cid:107) v ≤ x + 2 C + ¯ wv := ¯ (cid:15). (cid:107) (cid:15) ( t ) (cid:107) v ≤ (2¯ x + C ) · K exp( − τ /K ) . (cid:107) ψ ( t ) (cid:107) v ≤ (cid:16) x + C + ¯ wv (cid:17) (cid:80) tk = t − τ +1 α k − . roof of Lemma C.2. By the definition of (cid:107)·(cid:107) v in R M and its extension to R N , the induced matrixnorm of (cid:107)·(cid:107) for a matrix A = [ a ij ] i ∈M ,j ∈N is given by (cid:107) A (cid:107) v = sup i ∈M (cid:80) j ∈N v h ( j ) v i | a ij | . Recallthat the i ’th entry of the diagonal matrix D t is given by d t,i = P ( i t = i | F t − τ ) . Hence we have that (cid:13)(cid:13) Φ (cid:124) ( e i t e (cid:124) i t − D t ) (cid:13)(cid:13) v = sup j ∈M (cid:88) i ∈N h ( i ) = j ) · | i = i t ) − d t,i | ≤ . (19)Therefore, we can upper bound (cid:107) (cid:15) ( t ) (cid:107) v by (cid:107) (cid:15) ( t ) (cid:107) v = (cid:13)(cid:13) Φ (cid:124) (cid:2) ( e i t e (cid:124) i t − D t ) ( F (Φ x ( t − τ )) − Φ x ( t − τ )) + w ( t ) e i t (cid:3)(cid:13)(cid:13) v ≤ (cid:13)(cid:13) Φ (cid:124) ( e i t e (cid:124) i t − D t ) (cid:13)(cid:13) v (cid:107) F (Φ x ( t − τ )) − Φ x ( t − τ ) (cid:107) v + | w ( t ) | (cid:107) Φ (cid:124) e i t (cid:107) v ≤ (cid:107) F (Φ x ( t − τ )) − Φ x ( t − τ ) (cid:107) v + | w ( t ) | (cid:107) Φ (cid:124) e i t (cid:107) v (20a) ≤ (cid:107) F (Φ x ( t − τ )) (cid:107) v + 2 (cid:107) x ( t − τ ) (cid:107) v + ¯ wv (20b) ≤ x + 2 C + ¯ wv , (20c)where we use (19) in (20a); the triangle inequality, the definition of ¯ v , and Assumption 2.3 in (20b);Assumption 2.2 in (20c).For (cid:107) (cid:15) ( t ) (cid:107) v , recall that (cid:107) (cid:15) ( t ) (cid:107) v = (cid:107) Φ (cid:124) ( D t − D ) ( F (Φ x ( t − τ )) − Φ x ( t − τ )) (cid:107) v = sup j ∈M v j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i ∈N h ( i ) = j )( d t,i − d i ) ( F (Φ x ( t − τ )) − Φ x ( t − τ )) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup j ∈M v j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ H ( j ) ( d t,i − d i ) ( F (Φ x ( t − τ )) − Φ x ( t − τ )) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (21)By Assumption 2.1, we have that sup S⊆N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i ∈S d i − (cid:88) i ∈S d t,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K exp( − τ /K ) . (22)Our objective is to bound the following term in (21) for all j ∈ M : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ h − ( j ) ( d t,i − d i ) ( F (Φ x ( t − τ )) − Φ x ( t − τ )) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Let M j := sup i ∈ h − ( j ) | ( F (Φ x ( t − τ )) − Φ x ( t − τ )) i | . Define function g : [ − M j , M j ] N → R as g ( y ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ h − ( j ) ( d t,i − d i ) y i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Suppose y max ∈ arg max y g ( y ) . We know that for i ∈ h − ( j ) , ( y max ) i is either M j or − M j if d t,i − d i (cid:54) = 0 . Let S j := { i ∈ h − ( j ) | ( y max ) i = M j } and S (cid:48) j := { i ∈ h − ( j ) | ( y max ) i = − M j } . Therefore, we see that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ h − ( j ) ( d t,i − d i ) ( F (Φ x ( t − τ )) − Φ x ( t − τ )) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max y ∈ [ − M j ,M j ] N g ( y ) (23a) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ S j ( d t,i − d i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M j + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ S (cid:48) j ( d t,i − d i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M j K exp( − τ /K ) M j . (23b)where we use the definition of function g in (23a); we use (22) in (23b).Substituting (23) into (21) gives that (cid:107) (cid:15) ( t ) (cid:107) v ≤ (cid:107) F (Φ x ( t − τ )) − Φ x ( t − τ ) (cid:107) v · K exp( − τ /K ) ≤ ( (cid:107) F (Φ x ( t − τ )) (cid:107) v + (cid:107) Φ x ( t − τ ) (cid:107) v ) · K exp( − τ /K ) (24a) ≤ (2¯ x + C ) · K exp( − τ /K ) , (24b)where we use the triangle inequality in (24a); we use Assumption 2.2 in (24b).As for (cid:107) ψ ( t ) (cid:107) v , we have the following bound (cid:107) ψ ( t ) (cid:107) v = (cid:13)(cid:13) Φ (cid:124) ( e i t e (cid:124) i t − D ) ( F (Φ x ( t )) − F (Φ x ( t − τ ))) − Φ (cid:124) ( e i t e (cid:124) i t − D )Φ ( x ( t ) − x ( t − τ )) (cid:13)(cid:13) v ≤ (cid:13)(cid:13) Φ (cid:124) ( e i t e (cid:124) i t − D ) ( F (Φ x ( t )) − F (Φ x ( t − τ ))) (cid:13)(cid:13) v + (cid:13)(cid:13) Φ (cid:124) ( e i t e (cid:124) i t − D )Φ ( x ( t ) − x ( t − τ )) (cid:13)(cid:13) v ≤ (cid:13)(cid:13) Φ (cid:124) ( e i t e (cid:124) i t − D ) (cid:13)(cid:13) v · (cid:107) ( F (Φ x ( t )) − F (Φ x ( t − τ ))) (cid:107) v + (cid:13)(cid:13) Φ (cid:124) ( e i t e (cid:124) i t − D )Φ (cid:13)(cid:13) v · (cid:107) ( x ( t ) − x ( t − τ )) (cid:107) v . (25)Notice that (cid:13)(cid:13) Φ (cid:124) ( e i t e (cid:124) i t − D )Φ (cid:13)(cid:13) v = (cid:13)(cid:13)(cid:13) e h ( i t ) e (cid:124) h ( i t ) − D (cid:48) (cid:13)(cid:13)(cid:13) v = sup j ∈M (cid:12)(cid:12) h ( i t ) = j ) − d (cid:48) j (cid:12)(cid:12) ≤ . Substituting this into (25) and use (19), we obtain that (cid:107) ψ ( t ) (cid:107) v ≤ (cid:107) F (Φ x ( t )) − F (Φ x ( t − τ )) (cid:107) v + (cid:107) x ( t ) − x ( t − τ ) (cid:107) v ≤ (cid:107) x ( t ) − x ( t − τ ) (cid:107) v ≤ t (cid:88) k = t − τ +1 (cid:107) x ( k ) − x ( k − (cid:107) v . (26)By the update rule of x and Assumption 2.2, we have that (cid:107) x ( t ) − x ( t − (cid:107) v ≤ α t − (cid:18) (cid:107) F (Φ x ( t − (cid:107) v + (cid:107) x ( t − (cid:107) v + ¯ wv (cid:19) ≤ α t − (cid:18) x + C + ¯ wv (cid:19) . (27)Substituting (27) into (26), we obtain that (cid:107) ψ ( t ) (cid:107) v ≤ (cid:18) x + C + ¯ wv (cid:19) t (cid:88) k = t − τ +1 α k − . Lemma C.3. If α t = Ht + t , where H > σ (cid:48) and t ≥ max(4 H, τ ) , then β k,t , ˜ β k,t satisfies thefollowing1. β k,t ≤ Hk + t (cid:16) k +1+ t t +1+ t (cid:17) σ (cid:48) H , ˜ β k,t ≤ (cid:16) k +1+ t t +1+ t (cid:17) σ (cid:48) H . (cid:80) tk =1 β k,t ≤ Hσ (cid:48) t +1+ t . (cid:80) tk = τ β k,t (cid:80) kl = k − τ +1 α l − ≤ Hτσ (cid:48) t +1+ t . Proof of Lemma C.3.
To show Lemma C.3, we only need to substitute σ (cid:48) for σ in the proof of[31][Lemma 10]. 18 emma C.4. The following inequality holds almost surely (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t ψ ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v ≤ (cid:16) x + C + ¯ wv (cid:17) Hτσ (cid:48) t + 1 + t := C ψ t + 1 + t . Proof of Lemma C.4.
We have that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t ψ ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v ≤ t (cid:88) k = τ α k (cid:13)(cid:13)(cid:13) ˜ B k,t (cid:13)(cid:13)(cid:13) v (cid:107) ψ ( k ) (cid:107) v ≤ (cid:18) x + C + ¯ wv (cid:19) t (cid:88) k = τ β k,t k (cid:88) l = k − τ +1 α l − (28a) ≤ (cid:16) x + C + ¯ wv (cid:17) Hτσ (cid:48) t + 1 + t , (28b)where we use Lemma C.2 in (28a); Lemma C.3 in (28b). Lemma C.5.
For each t , with probability at least − δ , we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t (cid:15) ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v ≤ H ¯ (cid:15)t + t (cid:115) τ t log (cid:18) τ mδ (cid:19) . To show Lemma C.5, we need to use Lemma C.6, which is Lemma 13 in [31].
Lemma C.6.
Let X t be a F t -adapted stochastic process which satisfies E X t | F t − τ = 0 . Further, | X t | ≤ ¯ X t almost surely. Then with probability − δ , we have, (cid:12)(cid:12)(cid:12)(cid:80) tk =0 X t (cid:12)(cid:12)(cid:12) ≤ (cid:113) τ (cid:80) tk =0 ¯ X k log (cid:0) τδ (cid:1) . Proof of Lemma C.5.
Recall that (cid:80) k = τ α k ˜ B k,t (cid:15) ( k ) is a random vector in R M , with its i ’th entry t (cid:88) k = τ α k ( (cid:15) ) i ( k ) t (cid:89) l = k +1 (1 − α l d (cid:48) i ) . Since step sizes { α l } are deterministic, we see that E (cid:34) α k ( (cid:15) ) i ( k ) t (cid:89) l = k +1 (1 − α l d (cid:48) i ) | F k − τ (cid:35) = α k t (cid:89) l = k +1 (1 − α l d (cid:48) i ) E [( (cid:15) ) i ( k ) | F k − τ ] = 0 . Notice that α k t (cid:89) l = k +1 (1 − α l d (cid:48) i ) = Hk + t t (cid:89) l = k +1 (cid:18) − Hd (cid:48) i l + t (cid:19) (29a) ≤ Hk + t t (cid:89) l = k +1 (cid:18) − l + t (cid:19) (29b) ≤ Hk + t t (cid:89) l = k +1 (cid:18) − l + t (cid:19) ≤ Ht + t , where we use α l = Hl + t in (29a); we use H > σ (cid:48) in (29b).19y the definition of ¯ (cid:15) , we also see that | ( (cid:15) ) i ( k ) | ≤ v i ¯ (cid:15). Therefore, by Lemma C.6, we obtain that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (cid:88) k = τ α k ( (cid:15) ) i ( k ) t (cid:89) l = k +1 (1 − α l d (cid:48) i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Hv i ¯ (cid:15)t + t (cid:115) τ t log (cid:18) τδ (cid:19) holds with probability at least − δ . By union bound, we see that with probability at least − δ , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t (cid:15) ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v ≤ H ¯ (cid:15)t + t (cid:115) τ t log (cid:18) τ mδ (cid:19) . Lemma C.7.
If we set τ to be an integer such that τ ≥ K max (log t, , we have that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t (cid:15) ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v ≤ C (cid:15) t + t + 1 , where t = max( τ, H ) and C (cid:15) = (2¯ x + C ) · K (1 + 2 K + 4 H ) . Proof of Lemma C.7.
Since K ≥ , the bound is trivial when t = 1 . We consider the case when t ≥ below.Since α k ˜ B k,t is a diagonal matrix and its entries are positive and less than , we have that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t (cid:15) ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v ≤ t (cid:88) k = τ (cid:13)(cid:13)(cid:13) α k ˜ B k,t (cid:13)(cid:13)(cid:13) v · (cid:107) (cid:15) ( k ) (cid:107) v ≤ t (cid:107) (cid:15) ( k ) (cid:107) v (30a) ≤ t (2¯ x + C ) · K exp( − τ /K ) . (30b)where we use (cid:13)(cid:13)(cid:13) α k ˜ B k,t (cid:13)(cid:13)(cid:13) v ≤ in (30a); Lemma C.2 in (30b).To show Lemma C.7, we only need to show t (2¯ x + C ) · K ( t + τ + 4 H ) exp( − τ /K ) ≤ C (cid:15) (31)holds for all τ ≥ K log t because t + t + 1 ≤ t + τ + 4 H. To study how the left hand side of (31) changes with τ , we define function g ( τ ) = ( τ + t + 4 H ) exp( − τ /K ) . Notice that we view τ as real number in function g , so we can get the derivative of g : g (cid:48) ( τ ) = exp( − τ /K ) K ( K − t − H − τ ) . Therefore, when τ ≥ K log t , we always have g (cid:48) ( τ ) < . Hence we obtain that g ( τ ) ≤ g (2 K log t ) = 2 K log t + t + 4 Ht ≤ K + 4 Ht (32)holds for all τ ≥ K log t. Substituting (32) into (31) finishes the proof.
Step 3: Bounding the error sequence.
Based on the recursive relationship we derived in LemmaC.1 and the bounds we obtained in Step 2, we want to show that, with probability − δ , a t ≤ C a √ t + t + C (cid:48) a t + t , (33)20olds for all τ ≤ t ≤ T , where C a = 2 H ¯ (cid:15) − γ (cid:115) τ log (cid:18) τ mTδ (cid:19) , C (cid:48) a = 41 − γ max ( C ψ + C (cid:15) , x ( τ + t )) . Notice that C a and C (cid:48) a are independent of t but may dependent on T . We set τ = 2 K log T. By applying union bound to Lemma C.5, we see that with probability at least − δ , for any t ≤ T , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) k = τ α k ˜ B k,t (cid:15) ( k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v ≤ C (cid:15) √ t + 1 + t , where C (cid:15) = H ¯ (cid:15) (cid:113) τ log (cid:0) τmTδ (cid:1) .Therefore, we get with probability − δ , (34) holds for all τ ≤ t ≤ T : a t +1 ≤ ˜ β τ − ,t a τ + γ sup i ∈M t (cid:88) k = τ b k,t,i a k + C (cid:15) √ t + 1 + t + C ψ + C (cid:15) t + 1 + t . (34)We now condition on (34) to show (33) by induction. (33) is true for t = τ , as C (cid:48) a τ + t ≥ − γ ¯ x ≥ a τ ,where we have used a τ = (cid:107) x ( τ ) − x ∗ (cid:107) v ≤ (cid:107) x ( τ ) (cid:107) v + (cid:107) x ∗ (cid:107) v ≤ x. Then, assuming (33) is true forup to k ≤ t . By (34), we have that a t +1 ≤ ˜ β τ − ,t a τ + γ sup i ∈M t (cid:88) k = τ b k,t,i (cid:20) C a √ k + t + C (cid:48) a k + t (cid:21) + C (cid:15) √ t + 1 + t + C ψ + C (cid:15) t + 1 + t ≤ ˜ β τ − ,t a τ + γC a sup i ∈M t (cid:88) k = τ b k,t,i √ k + t + γC (cid:48) a sup i ∈M t (cid:88) k = τ k + t b k,t,i + C (cid:15) √ t + 1 + t + C ψ + C (cid:15) t + 1 + t . (35)We use the following auxiliary lemma to handle the second and the third term in (35). Lemma C.8. If σ (cid:48) H (1 − √ γ ) ≥ , t ≥ , and α ≤ , then, for any i ∈ N , and any < ω ≤ ,we have t (cid:88) k = τ b k,t,i k + t ) ω ≤ √ γ ( t + 1 + t ) ω . Proof of Lemma C.8.
Recall that α k = Hk + t , and b k,t,i = α k d (cid:48) i (cid:81) tl = k +1 (1 − α l d (cid:48) i ) , where d (cid:48) i ≥ σ (cid:48) .Define e t = (cid:80) tk = τ b k,t,i k + t ) ω . We use induction on t to show that e t ≤ √ γ ( t +1+ t ) ω . The statement is clearly true for t = τ . Assume it is true for t − . Notice that e t = t − (cid:88) k = τ b k,t,i k + t ) ω + b t,t,i t + t ) ω = (1 − α t d (cid:48) i ) t − (cid:88) k = τ b k,t − ,i k + t ) ω + α t d (cid:48) i t + t ) ω (36a) = (1 − α t d (cid:48) i ) e t − + α t d (cid:48) i t + t ) ω ≤ (1 − α t d (cid:48) i ) 1 √ γ ( t + t ) ω + α t d (cid:48) i t + t ) ω (36b) = [1 − α t d (cid:48) i (1 − √ γ )] 1 √ γ ( t + t ) ω , b t,t,i = α t d (cid:48) i in (36a); we use the induction assumption in (36b).Plugging in α t = Ht + t , we see that e t ≤ (cid:20) − σ (cid:48) Ht + t (1 − √ γ ) (cid:21) √ γ ( t + t ) ω (37a) = (cid:20) − σ (cid:48) Ht + t (1 − √ γ ) (cid:21) (cid:18) t + t (cid:19) ω √ γ ( t + 1 + t ) ω ≤ (cid:18) − t + t (cid:19) (cid:18) t + t (cid:19) ω √ γ ( t + 1 + t ) ω (37b) ≤ (cid:18) − t + t (cid:19) (cid:18) t + t (cid:19) √ γ ( t + 1 + t ) ω (37c) ≤ √ γ ( t + 1 + t ) ω , where we use d (cid:48) i ≥ σ (cid:48) in (37a); we use the assumption that σ (cid:48) H (1 − √ γ ) ≥ in (37b); we use < ω ≤ in (37c).Applying Lemma C.8 to (35), we see that a t +1 ≤ ˜ β τ − ,t a τ + √ γC a √ t + 1 + t + √ γC (cid:48) a t + 1 + t + C (cid:15) √ t + 1 + t + ( C ψ + C (cid:15) ) 1 t + 1 + t (38a) ≤ (cid:18) √ γC a √ t + 1 + t + C (cid:15) √ t + 1 + t (cid:19) + (cid:32) √ γC (cid:48) a t + 1 + t + ( C ψ + C (cid:15) ) 1 t + 1 + t + (cid:18) τ + t t + 1 + t (cid:19) σ (cid:48) H a τ (cid:33) , (38b)where we use Lemma C.8 in (38a); we use the bound on ˜ β τ − ,t in Lemma C.3 in (38b).To bound the two terms in (38b), we define χ t := √ γC a √ t + 1 + t + C (cid:15) √ t + 1 + t and χ (cid:48) t = √ γC (cid:48) a t + 1 + t + ( C ψ + C (cid:15) ) 1 t + 1 + t + (cid:18) τ + t t + 1 + t (cid:19) σ (cid:48) H a τ . To finish the induction, it suffices to show that χ t ≤ C a √ t +1+ t and χ (cid:48) t ≤ C (cid:48) a t +1+ t . To see this χ t √ t + 1 + t C a = √ γ + C (cid:15) C a , χ (cid:48) t t + 1 + t C (cid:48) a = √ γ + C ψ + C (cid:15) C (cid:48) a + a τ ( τ + t ) C (cid:48) a (cid:18) τ + t t + 1 + t (cid:19) σ (cid:48) H − . It suffices to show that C (cid:15) C a ≤ − √ γ , C ψ + C (cid:15) C (cid:48) a ≤ −√ γ , and a τ ( τ + t ) C (cid:48) a ≤ −√ γ . Recall that C a = 2 H ¯ (cid:15) − γ (cid:115) τ log (cid:18) τ mTδ (cid:19) , C (cid:48) a = 41 − γ max ( C ψ + C (cid:15) , x ( τ + t )) , and C (cid:15) = H ¯ (cid:15) (cid:115) τ log (cid:18) τ mTδ (cid:19) . Using that a τ ≤ x , one can check that C a and C (cid:48) a satisfy the above three inequalities.22 Parameter Upper Bound
Proposition D.1.
Suppose Assumptions 2.2 and 2.3 hold. Then for all t , (cid:107) x ( t ) (cid:107) v ≤ − γ (cid:18) (1 + γ ) (cid:107) y ∗ (cid:107) v + ¯ wv (cid:19) holds almost surely, where y ∗ ∈ R N is the stationary point of F .Proof of Proposition D.1. By Assumption 2.2, we have that for all x ∈ R M , (cid:107) F (Φ x ) (cid:107) v ≤ (cid:107) F (Φ x ) − F ( y ∗ ) (cid:107) v + (cid:107) F ( y ∗ ) (cid:107) v (39a) ≤ γ (cid:107) Φ x − y ∗ (cid:107) v + (cid:107) y ∗ (cid:107) v (39b) ≤ γ (cid:107) x (cid:107) v + (1 + γ ) (cid:107) y ∗ (cid:107) v , (39c)where we use the triangle inequality in (39a) and (39c); we use Assumption 2.2 in (39b).Let ¯ x = − γ (cid:16) (1 + γ ) (cid:107) y ∗ (cid:107) v + ¯ wv (cid:17) . We prove (cid:107) x ( t ) (cid:107) v ≤ ¯ x by induction on t . Since we initialize x (0) to be , the statement is true for t = 0 .Suppose the statement is true for t . By the update rule of x , we see that v h ( i t ) (cid:12)(cid:12) x h ( i t ) ( t + 1) (cid:12)(cid:12) ≤ (1 − α t ) 1 v h ( i t ) (cid:12)(cid:12) x h ( i t ) ( t ) (cid:12)(cid:12) + α t (cid:18) v h ( i t ) | F i t (Φ x ( t )) | + 1 v h ( i t ) | w ( t ) | (cid:19) ≤ (1 − α t ) (cid:107) x ( t ) (cid:107) v + α t (cid:18) (cid:107) F (Φ x ( t )) (cid:107) v + ¯ wv (cid:19) (40a) ≤ (1 − α t ) (cid:107) x ( t ) (cid:107) v + α t (cid:18) γ (cid:107) x ( t ) (cid:107) v + (1 + γ ) (cid:107) y ∗ (cid:107) v + ¯ wv (cid:19) (40b) ≤ (1 − α t )¯ x + α t (cid:18) γ ¯ x + (1 + γ ) (cid:107) y ∗ (cid:107) v + ¯ wv (cid:19) (40c) = ¯ x, where we use Assumption 2.3 in (40a); (39) in (40b); the induction assumption in (40c).For j (cid:54) = h ( i t ) , j ∈ M , we have that v j | x j ( t + 1) | = 1 v j | x j ( t ) | ≤ (cid:107) x ( t ) (cid:107) v ≤ ¯ x. (41)Combining (40) and (41), we see that the statement also holds for t + 1 . Hence we have showed (cid:107) x ( t ) (cid:107) v ≤ ¯ x by induction. E Asymptotic Convergence of TD Learning with State Aggregation
Our asymptotic convergence result for TD learning with state aggregation builds upon the asymptoticconvergence result for TD learning with linear function approximation shown in [39]. For complete-ness, we first present the main result of [39] in Theorem E.1. In order to do this, we must first state afew definitions and assumptions made in [39].We use φ ( i ) ∈ R m to denote the feature vector associated with state i ∈ N . Feature matrix Φ isa n -by- m matrix whose i ’th row is φ ( i ) (cid:124) . Starting from θ (0) = , the T D ( λ ) algorithm keepsupdating θ, ψ by the following update rule, θ ( t + 1) = θ ( t ) + α t d t ψ t ,ψ t +1 = γλψ t + φ ( i t +1 ) , where ψ t is named eligible vector in [39] and satisfies ψ = φ ( i ) .Recall that D = diag ( d , d , · · · , d n ) denotes the stationary distribution of Markov chain { i t } . Forvectors x, y ∈ R n , we define inner product (cid:104) x, y (cid:105) = x (cid:124) Dy . The induced norm of this inner productis (cid:107)·(cid:107) D = (cid:112) (cid:104)· , ·(cid:105) D . Let L ( N , D ) denote the set of vectors V ∈ R n such that (cid:107) V (cid:107) D is finite.23ecall that we define Π = (Φ (cid:124) D Φ) − Φ (cid:124) D . As shown in [39], the projection matrix that projects anarbitrary vector in R n to the set { Φ θ | θ ∈ R m } is given by ΦΠ , i.e. for any V ∈ L ( N , D ) , we have ΦΠ V = arg min ¯ V ∈{ Φ θ | θ ∈ R m } (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) D . Notice that our definition of matrix Π is slightly different with [39] because we want to be consistentwith Section 2.1.To characterize the TD ( λ ) algorithm’s dynamics, [39] defines T ( λ ) : L ( N , D ) → L ( N , D ) operator as following: for all V ∈ R n , let the i ’th dimension of (cid:0) T ( λ ) V (cid:1) be defined as (cid:16) T ( λ ) V (cid:17) i = (cid:26) (1 − λ ) (cid:80) ∞ m =0 λ m E (cid:2)(cid:80) mt =0 γ t r ( i t , i t +1 ) + γ m +1 V i m +1 | i = i (cid:3) if λ < E [ (cid:80) ∞ t =0 γ t r ( i t , i t +1 ) | i = i ] if λ = 1 . If V is an approximation of the value function V ∗ , T ( λ ) can be viewed as an improved approximationto V ∗ . Notice that when λ = 0 , T ( λ ) is identical with the Bellman operator.Formally, [39] made four necessary assumptions for their main result (Theorem E.1). We omit thethird assumption ( [39][Assumption 3]) in our summary because it must hold when the state space N is finite.The first assumption ( [39][Assumption 1]) concerns the stationary distribution and the rewardfunction of the Markov chain { i t } . It must hold when Assumption 2.1 holds and every stage reward r t is upper bounded by ¯ r , as assumed by Theorem 2.2. Assumption E.1.
The transition probability and cost function satisfies the following two conditions:1. The Markov chain { i t } is irreducible and aperiodic. Furthermore, there is a uniquedistribution d that satisfies d (cid:124) P = d (cid:124) with d i > for all i ∈ N . Let E stand forexpectation with respect to this distribution.2. The reward function r ( i t , i t +1 ) satisfies E (cid:2) r ( i t , i t +1 ) (cid:3) < ∞ . The second assumption ( [39][Assumption 2]) concerns the feature vectors and the feature matrix. Itmust hold when Φ is defined as (3). Assumption E.2.
The following two conditions hold for Φ :1. The matrix Φ has full column rank; that is, the m columns (named basis functions in [39]) { φ k | k = 1 , · · · , m } are linearly independent.2. For every k , the basis function φ k satisfies E (cid:2) φ k ( i t ) (cid:3) < ∞ . The third assumption ( [39][Assumption 4]) concerns the learning step size. It must hold if thelearning step sizes are as defined in Theorem 2.2.
Assumption E.3.
The step sizes α t are positive, nonincreasing, and chosen prior to execution of thealgorithm. Furthermore, they satisfy (cid:80) ∞ t =0 α t = ∞ and (cid:80) ∞ t =0 α t < ∞ . Now we are ready to present the main asymptotic convergence result given in [39].
Theorem E.1.
Under Assumptions E.1, E.2, E.3, the following hold.1. The value function V is in L ( N , D ) .2. For any λ ∈ [0 , , the TD ( λ ) algorithm with linear function approximation converges withprobability one.3. The limit of convergence θ ∗ is the unique solution of the equation Π T ( λ ) (Φ θ ∗ ) = θ ∗ .
4. Furthermore, θ ∗ satisfies (cid:107) Φ θ ∗ − V ∗ (cid:107) D ≤ − λγ − γ (cid:107) ΦΠ V ∗ − V ∗ (cid:107) D . (42)24otice that (42) is not exactly the result we want to obtain. Specifically, we want the both sidesof (42) to be in (cid:107)·(cid:107) ∞ instead of (cid:107)·(cid:107) D . Although this kind of result is not obtainable for generalTD learning with linear function approximation, we can leverage the special assumptions for stateaggregation, which are summarized below: Assumption E.4. h : N → M is a surjective function from set N to M . The feature matrix Φ is asdefined in (3) , i.e. the feature vector associated with state i ∈ N is given by φ k ( i ) = (cid:26) if k = h ( i )0 otherwise , ∀ k ∈ M . Further, if h ( i ) = h ( i (cid:48) ) for i, i (cid:48) ∈ N , we have | V ∗ ( i ) − V ∗ ( i (cid:48) ) | ≤ ζ for a fixed positive constant ζ . Under Assumption E.4, we can show the asymptotic error bound in the infinity norm as we desired:
Theorem E.2.
Under Assumptions E.1, E.2, E.3, if Assumption E.4 also holds, the limit of conver-gence θ ∗ of the T D ( λ ) algorithm satisfies (cid:107) Φ θ ∗ − V ∗ (cid:107) ∞ ≤ (1 − λγ )1 − γ (cid:107) ΦΠ V ∗ − V ∗ (cid:107) ∞ ≤ (1 − λγ )1 − γ ζ. To show Theorem E.2, we need to prove several auxiliary lemmas first.
Lemma E.3.
Under Assumption E.1, for any V ∈ L ( N , D ) , we have (cid:107) P V (cid:107) ∞ ≤ (cid:107) V (cid:107) ∞ .Proof of Lemma E.3. This lemma holds because the transition matrix P is non-expansive in infinitynorm. Lemma E.4.
Under Assumption E.1, for any V, ¯ V ∈ L ( N , D ) , we have (cid:13)(cid:13)(cid:13) T ( λ ) V − T ( λ ) ¯ V (cid:13)(cid:13)(cid:13) ∞ ≤ γ (1 − λ )1 − γλ (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ . Proof of Lemma E.4.
By the definition of T ( λ ) , we have that (cid:13)(cid:13)(cid:13) T ( λ ) V − T ( λ ) ¯ V (cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1 − λ ) ∞ (cid:88) m =0 λ m ( γP ) m +1 (cid:0) V − ¯ V (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (1 − λ ) ∞ (cid:88) m =0 λ m γ m +1 (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ (43a) γ (1 − λ )1 − γλ (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ , where inequality (43a) holds because (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) ∞ < ∞ so we use Lemma E.3. Lemma E.5.
Under Assumption E.1 and E.4, we have (cid:107) ΦΠ V ∗ − V ∗ (cid:107) ∞ ≤ ζ (44) and for any V ∈ L ( N , D ) (cid:107) ΦΠ V (cid:107) ∞ ≤ (cid:107) V (cid:107) ∞ . (45) Proof of Lemma E.5.
For j ∈ M , we use h − ( j ) ⊆ N to denote all the elements in N whose featureis e j , i.e. h − ( j ) = { i | i ∈ N , h ( i ) = j } . Since h is surjection, h − ( j ) (cid:54) = ∅ , ∀ j ∈ M . Since ΦΠ isthe projection matrix that projects a vector in R n to the set { Φ θ | θ ∈ R m } , we have Π V = arg min θ ∈ R m (cid:88) j ∈M (cid:88) i ∈ h − ( j ) d i ( V i − θ j ) . Hence the optimal θ j must be in the range (cid:2) min i ∈ h − ( j ) V i , max i ∈ h − ( j ) V i (cid:3) . Therefore, we see that | (ΦΠ V ) i | = (cid:12)(cid:12) (Π V ) h ( i ) (cid:12)(cid:12) ≤ max i (cid:48) ∈ h − ( h ( i )) | V i (cid:48) | , | (ΦΠ V ) i − V i | ≤ max (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) min i (cid:48) ∈ h − ( h ( i )) V i (cid:48) − V i (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) max i (cid:48) ∈ h − ( h ( i )) V i (cid:48) − V i (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) . (46)holds for all z ∈ Z . Let V = V ∗ and use Assumption E.4 in (46) gives (44).Now we come back to the proof of Theorem E.2.Notice that (cid:107) Φ θ ∗ − V ∗ (cid:107) ∞ ≤ (cid:107) Φ θ ∗ − ΦΠ V ∗ (cid:107) ∞ + (cid:107) ΦΠ V ∗ − V ∗ (cid:107) ∞ (47a) = (cid:13)(cid:13)(cid:13) ΦΠ T ( λ ) (Φ θ ∗ ) − ΦΠ V ∗ (cid:13)(cid:13)(cid:13) ∞ + (cid:107) ΦΠ V ∗ − V ∗ (cid:107) ∞ (47b) ≤ (cid:13)(cid:13)(cid:13) T ( λ ) (Φ θ ∗ ) − V ∗ (cid:13)(cid:13)(cid:13) ∞ + (cid:107) ΦΠ V ∗ − V ∗ (cid:107) ∞ (47c) ≤ γ (1 − λ )1 − γλ (cid:107) Φ θ ∗ − V ∗ (cid:107) ∞ + (cid:107) ΦΠ V ∗ − V ∗ (cid:107) ∞ , (47d)where we use the triangle inequality in (47a); Theorem E.1 in (47b); Lemma E.5 in (47c); LemmaE.4 in (47d).Therefore, we obtain that (cid:107) Φ θ ∗ − V ∗ (cid:107) ∞ ≤ (1 − λγ )1 − γ (cid:107) Π V ∗ − V ∗ (cid:107) ∞ ≤ (1 − λγ )1 − γ ζ, where we use Lemma E.5 in the second inequality. F Proof of Theorem 2.2
Before presenting the proof of Theorem 2.2, we first show two upper bounds that are needed in theassumptions of Theorem 2.1. We defer the proof of this result to Appendix G.
Proposition F.1.
Under the same assumptions as Theorem 2.2, we have (cid:107) θ ( t ) (cid:107) ∞ ≤ ¯ θ := ¯ r − γ holdsfor all t almost surely and (cid:107) θ ∗ (cid:107) ∞ ≤ ¯ θ . | w ( t ) | ≤ ¯ w := r − γ also holds for all t almost surely. Now we come back to the proof of Theorem 2.2. Recall that we define F as the Bellman PolicyOperator and the noise sequence w ( t ) as w ( t ) = r t + γθ h ( i t +1 ) ( t ) − E i (cid:48) ∼ P ( ·| i t ) (cid:2) r ( i t , i (cid:48) ) + γθ h ( i (cid:48) ) ( t ) (cid:3) . Let θ ∗ be the unique solution of the equation Π F (Φ θ ∗ ) = θ ∗ . By the triangle inequality, we have that (cid:107) Φ · θ ( T ) − V ∗ (cid:107) ∞ ≤ (cid:107) Φ · θ ( T ) − Φ · θ ∗ (cid:107) ∞ + (cid:107) Φ · θ ∗ − V ∗ (cid:107) ∞ ≤ (cid:107) θ ( T ) − θ ∗ (cid:107) ∞ + (cid:107) Φ · θ ∗ − V ∗ (cid:107) ∞ . (48)We first bound the first term of (48) by Theorem 2.1. To do this, we first rewrite the update rule ofTD learning with state aggregation (4) in the form of the SA update rule (2): θ h ( i t ) ( t + 1) = θ h ( i t ) ( t ) + α t (cid:0) F i t (Φ θ ( t )) − θ h ( i t ) ( t ) + w ( t ) (cid:1) ,θ j ( t + 1) = θ j ( t ) for j (cid:54) = h ( i t ) , j ∈ M . Now we verify all the assumptions of Theorem 2.1. Assumption 2.1 is assumed to be satisfied in thebody of Theorem 2.2. As for Assumption 2.2, F is γ -contraction in the infinity norm because it is theBellman operator, and we can set C = r − γ so that C ≥ (1 + γ ) (cid:107) y ∗ (cid:107) ∞ (see the discussion belowAssumption 2.2). As for Assumption 2.3, by the definition of noise sequence w ( t ) , we see that E [ w ( t ) | F t ] = E (cid:2) r t + γθ h ( i t +1 ) ( t ) − E i (cid:48) ∼ P ( ·| i t ) (cid:2) r ( i t , i (cid:48) ) + γθ h ( i (cid:48) ) ( t ) (cid:3) | F t (cid:3) = E (cid:2) r t + γθ h ( i t +1 ) ( t ) | F t (cid:3) − E i (cid:48) ∼ P ( ·| i t ) (cid:2) r ( i t , i (cid:48) ) + γθ h ( i (cid:48) ) ( t ) (cid:3) . In addition, we can set ¯ w = r − γ according to Proposition F.1. Finally, we can set ¯ θ = ¯ r − γ accordingto Proposition F.1.Therefore, by Theorem 2.1, we see that (cid:107) θ ( T ) − θ ∗ (cid:107) ∞ ≤ C a √ T + t + C (cid:48) a T + t , where (49) C a = 40 H ¯ r (1 − γ ) (cid:112) K log T · (cid:115) log T + log log T + log (cid:18) mK δ (cid:19) ,C (cid:48) a = 8¯ r (1 − γ ) max (cid:18) K H log Tσ (cid:48) + 4 K (1 + 2 K + 4 H ) , K log T + t (cid:19) . As for the second term of (48), by Theorem E.2, we have that (cid:107) Φ · θ ∗ − V ∗ (cid:107) ∞ ≤ ζ − γ . (50)Substituting (49) and (50) into (48) finishes the proof. G Proof of Proposition F.1
We show (cid:107) θ ( t ) (cid:107) ∞ ≤ ¯ r − γ by induction on t . The statement holds for t = 0 because we initialize θ (0) = . Suppose the statement holds for t . By the induction assumption, we see that θ h ( i t ) ( t + 1) = (1 − α t ) θ h ( i t ) ( t ) + α t (cid:2) r t + γθ h ( i t +1 ) ( t ) (cid:3) ≤ (1 − α t ) (cid:107) θ ( t ) (cid:107) ∞ + α t [ r t + γ (cid:107) θ ( t ) (cid:107) ∞ ] ≤ (1 − α t ) ¯ r − γ + α t (cid:20) r t + γ · ¯ r − γ (cid:21) ≤ ¯ r − γ . For j (cid:54) = h ( i t ) , j ∈ M , we have that θ j ( t + 1) = θ j ( t ) ≤ (cid:107) θ ( t ) (cid:107) ∞ ≤ ¯ r − γ . Hence the statement also holds for t + 1 . Therefore, we have showed (cid:107) θ ( t ) (cid:107) ∞ ≤ ¯ r − γ by induction.By Theorem E.1, we know θ ∗ = lim t →∞ θ ( t ) . Since we have already shown that (cid:107) θ ( t ) (cid:107) ∞ ≤ ¯ r − γ holds for all t , we must have (cid:107) θ ∗ (cid:107) ∞ ≤ ¯ r − γ .Using (cid:107) θ ( t ) (cid:107) ∞ ≤ ¯ r − γ , we see that | w ( t ) | ≤ | r t | + γ (cid:12)(cid:12) θ h ( i t +1 ) ( t ) (cid:12)(cid:12) − (cid:12)(cid:12) E i (cid:48) ∼ P ( ·| i t ) (cid:2) r ( i t , i (cid:48) ) + γθ h ( i (cid:48) ) ( t ) (cid:3)(cid:12)(cid:12) ≤ r + 2 γ ¯ θ = 2¯ r − γ . H Application of the SA Scheme to Q-learning with State Aggregation
The Q -learning with state aggregation setting we study is a generalization of the tabular settingstudied in [31]. Specifically, we consider a Markov Decision Process (MDP) with a finite state space S and finite action space A . Suppose the transition probability is given by P ( s t +1 = s (cid:48) | s t = s, a t = ) = P ( s (cid:48) | s, a ) , and the stage reward at time step t is a random variable r t with its expectationgiven by r s t ,a t . Under a stochastic policy π , the Q table Q π : R S×A is defined as Q πs,a = E π (cid:34) ∞ (cid:88) t =0 γ t r t | ( s , a ) = ( s, a ) (cid:35) , where ≤ γ < is the discounted factor.Similar with [31], we assume the trajectory { ( s t , a t , r t ) } ∞ t =0 is sampled by implementing a fixedbehavioral stochastic policy π . In Q -learning with state aggregation, we suppose h : S × A → M is asurjection that maps each state-action pair to its abstraction in set M . The update rule for Q -learningwith state aggregation is given by θ h ( s t ,a t ) ( t + 1) = (1 − α t ) θ h ( s t ,a t ) ( t ) + α t (cid:20) r t + γ max a ∈A θ h ( s t +1 ,a ) ( t ) (cid:21) ,θ j ( t + 1) = θ j ( t ) for j (cid:54) = h ( s t , a t ) . (51)As a remark, some previous works only compress the state space S to an abstract state space S (cid:48) bya surjection φ : S → S (cid:48) and do not compress the action space [16]. In contrast, our definition ofmapping h is more general because we can let h ( s, a ) := ( φ ( s ) , a ) for all s ∈ S , a ∈ A .We define function F as the Bellman Optimality Operator , i.e. F s,a ( Q ) = r s,a + γ E s (cid:48) ∼ P ( ·| s,a ) max a (cid:48) ∈A Q s (cid:48) ,a (cid:48) . We can rewrite the update rule (51) as θ h ( s t ,a t ) ( t + 1) = θ h ( s t ,a t ) ( t ) + α t (cid:2) F s t ,a t (Φ θ ( t )) − θ h ( s t ,a t ) ( t ) + w ( t ) (cid:3) ,θ j ( t + 1) = θ j ( t ) for j (cid:54) = h ( s t , a t ) , where w ( t ) = r t + γ max a ∈A θ h ( s t +1 ,a ) ( t ) − F s t ,a t (Φ θ ( t ))= ( r t − r s t ,a t ) + γ (cid:20) max a ∈A θ h ( s t +1 ,a ) ( t ) − E s (cid:48) ∼ P ( ·| s t ,a t ) max a (cid:48) ∈A θ h ( s (cid:48) ,a (cid:48) ) ( t ) (cid:21) . Hence we have E [ w ( t ) | F t ] = 0 . In order to apply Theorem 2.1, we need the following assumptionon the induced Markov chain of stochastic policy π : Assumption H.1.
The following conditions hold:1. For each time step t , the stage reward r t satisfies | r t | ≤ ¯ r almost surely.2. Under the behavioral policy π , the induced Markov chain ( s t , a t ) with state space S × A satisfies Assumption 2.1 with stationary distribution d and parameters σ (cid:48) , K , K . Define θ ∗ as the unique solution of equation θ = Π F (Φ θ ) . Under Assumption H.1, one can easilyshow that (cid:107) θ ∗ (cid:107) ∞ ≤ ¯ r − γ . Further, using a similar approach with the proof of Proposition B.1, wealso see that (cid:107) θ ( t ) (cid:107) ∞ ≤ ¯ θ := ¯ r − γ , | w ( t ) | ≤ ¯ w := 2¯ r − γ hold for all t almost surely. Hence we have the following corollary of Theorem 2.1: Corollary H.1.
Under Assumption H.1, suppose the step size of Q -learning with state aggregationis given by α t = Ht + t , where t = max(4 H, K log T ) and H ≥ σ (cid:48) (1 − γ ) . Then, with probabilityat least − δ , (cid:107) θ ( T ) − θ ∗ (cid:107) ∞ ≤ C a √ T + t + C (cid:48) a T + t , where C a = 40 H ¯ r (1 − γ ) (cid:112) K log T · (cid:115) log T + log log T + log (cid:18) mK δ (cid:19) ,C (cid:48) a = 8¯ r (1 − γ ) max (cid:18) K H log Tσ (cid:48) + 4 K (1 + 2 K + 4 H ) , K log T + t (cid:19) . Proof of Lemma 3.1
For notation simplicity, we define s = ( s N κi , s N κ − i ) , s (cid:48) = ( s N κi , s (cid:48) N κ − i ) , and a = ( a N κi , a N κ − i ) , a (cid:48) =( a N κi , a (cid:48) N κ − i ) . Let π t,i be the distribution of ( s N β i ( t ) , a N β i ( t )) condition on ( s (0) , a (0)) = ( s, a ) under policy θ , and let π (cid:48) t,i be the distribution of ( s N β i ( t ) , a N β i ( t )) condition on ( s (0) , a (0)) =( s (cid:48) , a (cid:48) ) under policy θ .To bound the difference between Q θi ( s, a ) and Q θi ( s, a ) , we study the range of the local states andlocal actions affected by s N κ − i (0) at time step t .Recall that ξ = max( α , α + β ) . At time step , the states s N κ − i (0) can affect the actions a N κ − β − i (0) . Hence the states affected at time step is s N κ − ξ − i (1) . Therefore, at time step t , the rangeof the local states affected by (cid:16) s N κ − i (0) , a N κ − i (0) (cid:17) is s N κ − tξ − i ; the range of the local actions affectedby (cid:16) s N κ − i (0) , a N κ − i (0) (cid:17) is a N κ − β − tξ − i .If the states and actions in N β i are not affected by (cid:16) s N κ − i (0) , a N κ − i (0) (cid:17) at time step t , we must have π t,i = π (cid:48) t,i . By the discussion above, we know (cid:16) s N β i , a N β i (cid:17) is not affected by (cid:16) s N κ − i (0) , a N κ − i (0) (cid:17) at time step t if κ − β − tξ ≥ β . Let m := (cid:100) κ +1 − β − β ξ (cid:101) . When t < m , we must have π t,i = π (cid:48) t,i .For notation simplicity, we use z ( t ) to denote (cid:16) s N β i ( t ) , a N β i ( t ) (cid:17) . We see that (cid:12)(cid:12) Q θi ( s, a ) − Q θi ( s (cid:48) , a (cid:48) ) (cid:12)(cid:12) ≤ ∞ (cid:88) t =0 (cid:12)(cid:12)(cid:12) γ t E z ( t ) ∼ π t,i r i ( z ( t )) − γ t E z ( t ) ∼ π (cid:48) t,i r i ( z ( t )) (cid:12)(cid:12)(cid:12) = ∞ (cid:88) t = m (cid:12)(cid:12)(cid:12) γ t E z ( t ) ∼ π t,i r i ( z ( t )) − γ t E z ( t ) ∼ π (cid:48) t,i r i ( z ( t )) (cid:12)(cid:12)(cid:12) ≤ ∞ (cid:88) t = m γ t ¯ r ≤ ¯ r − γ γ m ≤ ¯ r (1 − γ ) γ ( β + β ) /ξ γ κ +1 ξ . J Proof of Theorem 3.3
While [32, Theorem 5] studies the error bound of Scalable Actor Critic as a whole, we want todecouple the effect of the inner loop and the outer loop in Theorem 3.3. Our proof of Theorem3.3 uses similar techniques with the proof in [32], but we extend the analysis to a more generaldependence model.According to Algorithm 1, at iteration m , agent i performs gradient ascent by θ i ( m + 1) = θ i ( m ) + η m ˆ g i ( m ) , with step size η m = η √ m +1 . The approximate local gradient ˆ g i ( m ) is given by ˆ g i ( m ) = T (cid:88) t =0 γ t n (cid:88) j ∈ N κi ˆ Q m,Tj (cid:16) s N κj ( t ) , a N κj ( t ) (cid:17) ∇ θ i log ζ θ i ( m ) i (cid:16) a i ( t ) | s N β i ( t ) (cid:17) . ∇ θ i J ( θ ( m )) = ∞ (cid:88) t =0 E s ∼ π θ ( m ) t ,a ∼ ζ θ ( m ) i ( ·| s ) γ t Q θ ( m ) ( s, a ) ∇ θ i log ζ θ i ( m ) (cid:16) a i ( t ) | s N β i ( t ) (cid:17) , where we use π θt to denote the distribution of global state s ( t ) under fixed policy θ .To bound (cid:107) ˆ g ( m ) − ∇ θ J ( θ ( m )) (cid:107) , we define intermediate quantities g ( m ) and h ( m ) whose i ’thcomponent is given by g i ( m ) = T (cid:88) t =0 γ t n (cid:88) j ∈ N κi Q θ ( m ) j ( s ( t ) , a ( t )) ∇ θ i log ζ θ i ( m ) i (cid:16) a i ( t ) | s N β i ( t ) (cid:17) ,h i ( m ) = T (cid:88) t =0 E s ∼ π θ ( m ) t ,a ∼ ζ θ ( m ) ( ·| s ) γ t n (cid:88) j ∈ N κi Q θ ( m ) j ( s, a ) ∇ θ i log ζ θ i ( m ) i (cid:16) a i ( t ) | s N β i ( t ) (cid:17) . Lemma J.1.
We have almost surely, ∀ m ≤ M , max ( (cid:107) ˆ g ( m ) (cid:107) , (cid:107) g ( m ) (cid:107) , (cid:107) h ( m ) (cid:107) , (cid:107)∇ J ( θ ( m )) (cid:107) ) ≤ L (1 − γ ) . To show Lemma J.1, we only need to replace ζ θ i ( m ) i ( a i ( t ) | s i ( t )) by ζ θ i ( m ) i (cid:16) a i ( t ) | s N β i ( t ) (cid:17) inthe proof of [32, Lemma 17].Notice that ˆ g ( m ) − ∇ J ( θ ( m )) = e ( m ) + e ( m ) + e ( m ) , where e ( m ) := ˆ g ( m ) − g ( m ) , e ( m ) := g ( m ) − h ( m ) , e ( m ) := h ( m ) − ∇ J ( θ ( m )) . To bound (cid:107) ˆ g ( m ) − ∇ J ( θ ( m )) (cid:107) , we only need to bound e ( m ) , e ( m ) , e ( m ) separately. Lemma J.2.
With probability at least − δ , we have sup ≤ m ≤ M − (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) ≤ ιcLρ κ +1 (1 − γ ) . Proof of Lemma J.2.
By the assumption that sup m ≤ M − sup i ∈N sup ( s,a ) ∈S×A (cid:12)(cid:12)(cid:12) Q θ ( m ) i ( s, a ) − ˆ Q T ( s N κi , a N κi ) (cid:12)(cid:12)(cid:12) ≤ ιcρ κ +1 − γ , we have for all m ≤ M − and i ∈ N , (cid:107) ˆ g i ( m ) − g i ( m ) (cid:107)≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T (cid:88) t =0 γ t n (cid:88) j ∈ N κi (cid:104) ˆ Q m,Tj (cid:16) s N κj ( t ) , a N κj ( t ) (cid:17) − Q θ ( m ) j ( s ( t ) , a ( t )) (cid:105) ∇ θ i log ζ θ i ( m ) i (cid:16) a i ( t ) | s N β i ( t ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ T (cid:88) t =0 γ t n (cid:88) j ∈ N κi (cid:12)(cid:12)(cid:12) ˆ Q m,Tj (cid:16) s N κj ( t ) , a N κj ( t ) (cid:17) − Q θ ( m ) j ( s ( t ) , a ( t )) (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) ∇ θ i log ζ θ i ( m ) i (cid:16) a i ( t ) | s N β i ( t ) (cid:17)(cid:13)(cid:13)(cid:13) ≤ T (cid:88) t =0 γ t ιcρ κ +1 − γ L i < ιcL i ρ κ +1 (1 − γ ) . Combining all n dimensions finishes the proof. 30 emma J.3. With probability at least − δ , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M − (cid:88) m =0 η m (cid:104)∇ J ( θ ( m )) , e ( m ) (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L (1 − γ ) (cid:118)(cid:117)(cid:117)(cid:116) M − (cid:88) m =0 η m log 4 δ . To show Lemma J.3, we only need to replace ζ θ i ( m ) i ( a i ( t ) | s i ( t )) by ζ θ i ( m ) i (cid:16) a i ( t ) | s N β i ( t ) (cid:17) inthe proof of [32, Lemma 19]. Lemma J.4.
When T + 1 ≥ log γ ( c (1 − ρ )) + ( κ + 1) log γ ρ , we have almost surely (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) ≤ cLρ κ +1 − γ . To show Lemma J.4, we only need to replace ζ θ i ( m ) i ( a i ( t ) | s i ( t )) by ζ θ i ( m ) i (cid:16) a i ( t ) | s N β i ( t ) (cid:17) inthe proof of [32, Lemma 20].Now we come back to the proof of Theorem 3.3. Using the identical steps with the proof of [32,Theorem 5], we can obtain that (equation (44) in [32]) M − (cid:88) m =0 η m (cid:107)∇ J ( θ ( m )) (cid:107) ≤ J ( θ ( m )) − J ( θ (0)) − M − (cid:88) m =0 η m (cid:15) m, + M − (cid:88) m =0 η m (cid:15) m, + M − (cid:88) m =0 η m (cid:15) m, , (52)where (cid:15) m, = (cid:104)∇ J ( θ ( m )) , e ( m ) (cid:105) ,(cid:15) m, = (cid:107)∇ J ( θ ( m )) (cid:107) ( (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) + (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) ) ,(cid:15) m, = 2 L (cid:48) ( (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) + (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) + (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) ) . By Lemma J.3, we have with probability at least − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M − (cid:88) m =0 η m (cid:15) m, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L (1 − γ ) (cid:118)(cid:117)(cid:117)(cid:116) M − (cid:88) m =0 η m log 4 δ . (53)By Lemma J.2 and Lemma J.4, we have with probability at least − δ , sup m ≤ M − (cid:15) m, ≤ L (1 − γ ) (cid:18) sup m ≤ M − (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) + sup m ≤ M − (cid:13)(cid:13) e ( m ) (cid:13)(cid:13)(cid:19) ≤ (2 + ι ) L cρ κ +1 (1 − γ ) . (54)By Lemma J.1, we have almost surely max( (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) , (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) , (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) ) ≤ L (1 − γ ) , and hencealmost surely sup m ≤ M − (cid:15) m, = 2 L (cid:48) (cid:16)(cid:13)(cid:13) e ( m ) (cid:13)(cid:13) + (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) + (cid:13)(cid:13) e ( m ) (cid:13)(cid:13) (cid:17) ≤ L (cid:48) L (1 − γ ) . (55)By union bound, (53), (54), and (55) hold simultaneously with probability − δ . Combining themwith (52) gives (cid:80) M − m =0 η m (cid:107)∇ J ( θ ( m )) (cid:107) (cid:80) M − m =0 η m ≤ ( J ( θ ( M )) − J ( θ (0))) + (cid:12)(cid:12)(cid:12)(cid:80) M − m =0 η m (cid:15) m, (cid:12)(cid:12)(cid:12) + sup m ≤ M − (cid:15) m, (cid:80) M − m =0 η m (cid:80) M − m =0 η m + 2 sup m ≤ M − (cid:15) m, ..