[PDF] Local Stochastic Approximation: A Unified View of Federated Learning and Distributed Multi-Task Reinforcement Learning Algorithms

Abstract

Motivated by broad applications in reinforcement learning and federated learning, we study local stochastic approximation over a network of agents, where their goal is to find the root of an operator composed of the local operators at the agents. Our focus is to characterize the finite-time performance of this method when the data at each agent are generated from Markov processes, and hence they are dependent. In particular, we provide the convergence rates of local stochastic approximation for both constant and time-varying step sizes. Our results show that these rates are within a logarithmic factor of the ones under independent data. We then illustrate the applications of these results to different interesting problems in multi-task reinforcement learning and federated learning.

Full PDF

aa r X i v : . [ c s . L G ] J un Local Stochastic Approximation: A Uniﬁed View of Federated Learningand Distributed Multi-Task Reinforcement Learning Algorithms

Thinh T. Doan ∗ Abstract

Motivated by broad applications in reinforcement learning and federated learning, we study localstochastic approximation over a network of agents, where their goal is to ﬁnd the root of an operatorcomposed of the local operators at the agents. Our focus is to characterize the ﬁnite-time performanceof this method when the data at each agent are generated from Markov processes, and hence they aredependent. In particularly, we provide the explicit convergence rates of local stochastic approximationfor both constant and time-varying step sizes. Our results show that these rates are within a logarithmicfactor of the ones under independent data. We then illustrate the applications of these results todiﬀerent interesting problems in multi-task reinforcement learning and federated learning.

In this paper, we study local stochastic approximation (SA), a distributed variant of the classic SA originallyintroduced by Robbins and Monro [1] for solving the root-ﬁnding problems under corrupted measurementsof an operator (or function). We consider the setting where there are a group of agents communicatingindirectly through a centralized coordinator. The goal of the agents is to ﬁnd the root of the operator,which is composed of the local operators at the agents. For solving this problem, each agent iterativelyruns a number of local SA steps based on its own data, whose iterates are then averaged at the centralizedcoordinator. This algorithm, presented in detail in Section 2, is relatively simple and eﬃcient for solvingproblems requiring a large amount of data that are distributed to diﬀerent agents.We are motivated by the broad applications of local SA in solving problems in federated learning[2, 3] and multi-task reinforcement learning [4–6]. These two areas share a common framework, wheremultiple agents (clients or workers) collaboratively solve a machine learning/reinforcement learning problemunder the coordination of a centralized server [2–6]. In stead of sharing the data collected at the localdevices/environments to the server, the agents run local updates of their models/policy based on theirdata, whose results are then aggregated at the server with the goal to ﬁnd the global learning objective.In these contexts, local SA can be used to formulate the popular algorithms studied in these two areas, asshown in Section 4.Our focus, in this paper, is on the theoretical aspects of the ﬁnite-time performance of the localstochastic approximation. Our goal is to characterize the convergence rate of this method when the dataat the agents are heterogeneous and dependent. In particular, we consider the case when the data at eachagent is generated from a Markov process as often considered in the context of multi-task reinforcementlearning. Our setting generalizes the existing works in the literature, where the local data is assumed i.i.d.Under fairly standard assumptions, our main contribution is to show that the convergence rates of localSA is within a logarithmic factor of the comparable bounds for independent data. As illustrated in Section ∗ Thinh T. Doan is with the School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, 30332,USA. Email: [email protected]

1, the results in this paper provide a uniﬁed view for the ﬁnite-time bounds of federated learning andmulti-task reinforcement learning algorithms under diﬀerent settings.

Stochastic approximation is the most eﬃcient and widely used method for solving stochastic optimizationproblems in many areas, including machine learning [7] and reinforcement learning [8, 9]. The asymptoticconvergence of SA under Markov randomness is often done by using the ordinary diﬀerential equation (ODE)method [10, 11]. Such ODE method shows that under the right conditions the noise eﬀects eventuallyaverage out and the SA iterate asymptotically follows a stable ODE. On the other hand, the rates of SA havebeen mostly considered in the context of stochastic gradient descent (SGD) under i.i.d samples [7,12]. Theﬁnite-time convergence of SGD under Markov randomness has been studied in [13, 14] and the referencestherein. In the context of reinforcement learning, such results have been studied in [15–19] for linear SAand a recent work in [20] for nonlinear SA.The local SA method considered in this paper has recently received much interests in the context offederated learning under the name of local SGD; see for example [21–27]. Finite-time bounds of local SGDin these works are derived when the local data at each agent are sampled i.i.d. On the other hand, ourfocus is to consider the setting where the local data at each agent are sampled from a Markov process,and therefore, they are dependent. We note that the popular distributed SGD in machine learning alsoshares the same communication structure as in the local SGD. However, while in local SGD the agentshave heterogeneous objectives and only share their iterates with the centralized coordinator, in distributedSGD the agents compute the stochastic gradients of a global objective and send them to the centralizedcoordinator.Finally, distributed/local stochastic approximation has also found broad applications in the context of(multi-task) reinforcement learning [4–6, 28–30], which is the main motivation of this paper. In theseapplication, since reinforcement learning is often modeled as a Markov random process, the noise in localSA is Markovian. However, there is a lack of understanding about its ﬁnite-time performance. Our resultsin this paper, therefore, help to ﬁll this gap.

We consider distributed learning framework, where there are a group of N agents communicating indirectlythrough a centralized coordinator. Associated with each agent i is a local operator F i : R d → R d . Thegoal of the agents is to ﬁnd the solution θ ∗ satisfying F ( θ ∗ ) , N X i =1 F i ( θ ∗ ) = 0 , (1)where each F i : R d → R d is given as F i ( θ ) , E π i [ F i ( θ ; X i )] = X X i ∈X i π i ( X i ) F i ( θ ; X i ) . (2)Here X i is a statistical sample space with probability distribution π i at agent i . We assume that eachagent i has access to operator F i only through its samples { F i ( · , X ki ) } , where { X ki } is a sequence ofsamples of the random variable X i . We are interested in the case where each sequence { X ki } is generatedfrom an ergodic Markov process, whose stationary distribution is π i . Moreover, the sequences { X ki } areindependent across i . For solving problem (1), our focus is to study the local stochastic approximation2ormally stated in Algorithm 1. In our algorithm, we implicitly assume that there is an oracle that returnsto agent i the value F i ( θ ; X i ) for a given θ and X i . Algorithm 1:

Local stochastic approximation

Initialization:

Each agent i initializes θ i ∈ R d , a sequence of step sizes { α k } , and a positive integer H .The centralized coordinator initializes ¯ θ = 1 /N P Ni =1 θ i . for k=0,1,2,... dofor each worker i do

1) Receive ¯ θ k sent by the centralized coordinator2) Set θ k, i = ¯ θ k for t = 0 , , . . . , H − do θ k,t +1 i = θ k,ti − α k + t F i ( θ k,ti ; X k + ti ) . (3) endend The centralized coordinator receives θ k,Hi from each agent i and implements ¯ θ k +1 = 1 N N X i =1 θ k,Hi . (4) end Algorithm 1 is relatively simple to implement, which can be explained as follows. In this algorithm, eachagent i maintains a copy θ i of the solution θ ∗ and the centralized coordinator maintains ¯ θ to estimate theaverage of θ i . At any iteration k ≥ , each agent i ﬁrst received ¯ θ k from the centralized coordinator andinitalizes its iterate θ k, i = ¯ θ k . Here θ k,ti denotes the iterate at iteration k and local time t ∈ [0 , . . . , H − .Agent i then runs a number H of local stochastic approximation steps using the time-varying step sizes α k and based on its local data { X k + ti } . After H local steps, the agents then send their new local updates θ k,Hi to the centralized coordinator to update ¯ θ k +1 by taking the average of these local values. In this section, we study the ﬁnite-time performance of Algorithm 1 when each operator F i is stronglymonotone. We provide an upper bound for the convergence rate of the mean square error E h k ¯ θ k − θ ∗ k i to zero for both constant and time-varying step sizes. In particular, under constant step size, α k = α forsome constant α and k & log(1 /α ) , this convergence occurs at a rate E h k ¯ θ k +1 − θ ∗ k i ≤ (cid:18) − HµαN (cid:19) k +1 − τ ( α ) E h k ¯ θ τ ( α ) − θ ∗ k i + O (cid:16) CN HLB log(1 /α ) α (cid:17) , where τ ( α ) denotes the mixing time of the underlying Markov chain. On the other hand, under time-varyingstep sizes α k ∼ / ( k + 1) this rate happens at E h k ¯ θ k +1 − θ ∗ k i ≤ O  E h k ¯ θ K ∗ − θ ∗ k i k  + O CHLB log( k ) k ! , ∀ k ≥ K ∗ for some positive integer K ∗ depending on the problem parameters C, L, B, µ , which we will deﬁne shortly.First, our rates scale linearly with the number of local steps H . As expected, when H goes to ∞ , each3gent only consider its local SA without communicating with the centralized coordinator. In this case, onewould expect that each agent only ﬁnds the root of its own operator. Second, one can view the constant B as the variance of the noise. Finally, our rates match the ones of local SGD under i.i.d noise [27], exceptfor a log factor reﬂecting the Markovian randomness.We start our analysis by introducing the following technical assumptions and notation used in thissection. Given a constant α > , we denote by τ i ( α ) the mixing time associated with the Markov chain { X ki } , i.e., the following condition holds k P ki ( X i , · ) − π i k T V ≤ α, ∀ k ≥ τ i ( α ) , ∀ X i ∈ X i , ∀ i ∈ [ N ] , (5)where k · k T V is the total variance distance and P ki ( X i , X i ) is the probability that X ki = X i when westart from X i . The mixing time represents the time X ki getting close to the stationary distribution π i . Inaddition, we denote by τ ( α ) = max i τ i ( α ) .For our results derived in this section, we consider the following assumptions. For simplicity, we assumethat these assumptions hold to the rest of this paper. Assumption 1.

The Markov chain { X ki } is erogodic (irreducible and aperiodic) with ﬁnite state space X i . Assumption 2.

The mapping F i ( · ) and F i ( · , X i ) is Lipschitz continuous in θ almost surely, i.e., thereexists a positive constant L such that for all X i ∈ X i and i ∈ [ N ] k F i ( θ ) − F i ( ω ) k ≤ L k θ − ω k and k F i ( θ, X i ) − F i ( ω, X i ) k ≤ L k θ − ω k , ∀ θ, ω ∈ R d . (6) Assumption 3.

There exists a positive constant µ such that ( F i ( θ ) − F i ( ω )) T ( θ − ω ) ≥ µ k θ − ω k , ∀ θ, ω ∈ R d , ∀ i ∈ [ N ] . (7)Assumption 1 implies that the Markov chain { X ki } has geometric mixing time, which depends onthe second largest eigenvalue of the transition probability matrix of { X ki } [31]. This assumption holds invarious applications, e.g, in incremental optimization [32] and in reinforcement learning modeled by Markovdecision processes with a ﬁnite number of states and actions [33]. In addition, Assumption 1 is used inthe existing literature to study the ﬁnite-time performance of SA and its distributed variants under Markovrandomness; see [13, 16, 19, 20, 34, 35] and the references therein.Assumption 2 is often used in local SGD methods in federated learning [27]. This assumption alsoholds in the context of reinforcement learning considered in Section 4. Assumption 3 implies that F i arestrongly monotone (or strongly convex in the context of SGD).The following result is a consequence of Assumptions 1 and 2, which is shown in [20, Lemma . ].This lemma basically states that each Markov chain X i has a geometric mixing time, which translates tothe operator F i due to the Lipschitz condition. Lemma 1.

There exists a constant

C > such that given α > we have, ∀ i ∈ [ N ] , τ i ≤ C log(1 /α ) and (cid:12)(cid:12)(cid:12) E h F i ( θ, X ki ) | X i i − F i ( θ ) (cid:12)(cid:12)(cid:12) ≤ α ( k θ k + 1) , ∀ θ, ∀ k ≥ τ i ( α ) . (8)Finally, since X i is ﬁnite for all i ∈ [ N ] , Assumption 2 also gives the following result. Lemma 2.

Let B = max i max X i ∈X i k F i (0 , X i ) k . Then we have for all θ ∈ R d k F i ( θ ) k ≤ B ( k θ k + 1) and k F i ( θ, X i ) k ≤ B ( k θ k + 1) , ∀ X i ∈ X i , ∀ i ∈ [ N ] . (9)4 .1 Constant step sizes In this subsection, we derive the rate of Algorithm 1 under constant step sizes, that is, α k = α for all k ≥ .Note that by Lemma 1 τ i ( α ) ≤ C log(1 /α ) given a constant α > . Thus, we have lim α → τ i ( α ) α = 0 for all i ∈ [ N ] . This implies that there exists a suﬃciently small positive α such that ατ ( α ) ≤ min (cid:26) log(2)2 BH , µ N (19 B H + 9 + 57 LBH ) , N Hµ (cid:27) , (10)where recall that τ ( α ) = max i τ i ( α ) . The result in this section is established under condition (10). Underconstant step sizes, we have from (3) for all k ≥ θ k,Hi = θ k, i − α H − X t =0 F i ( θ k,ti ; X k + ti ) = ¯ θ k − α H − X t =0 F i ( θ k,ti ; X k + ti ) , (11)which implies ¯ θ k +1 = ¯ θ k − αN N X i =1 H − X t =0 F i ( θ k,ti ; X k + ti ) . (12)To derive the ﬁnite-time bound of Algorithm 1, we require the following three technical lemmas. For anease of exposition, their proofs are presented in the Appendix. The ﬁrst lemma is to upper bound the normof θ i by the norm of ¯ θ . Lemma 3.

Let { ¯ θ k } and { θ k,ti } , for all k ≥ and t ∈ [1 , H ] , be generated by Algorithm 1. In addition,let the step size α satisfy (10) . Then the following relations hold for all k ≥ and t ∈ [0 , H ] k θ k,ti k ≤ k ¯ θ k k + 2 BHα ≤ k ¯ θ k k + 1 . (13) k θ k,ti − ¯ θ k k ≤ BHα k ¯ θ k k + 2 BHα. (14)Our next lemma is to provide an upper bound for the quantity k ¯ θ k − ¯ θ k − τ ( α ) k . Lemma 4.

Let all the conditions in Lemma 3 hold. Then the following relations hold ∀ k ≥ and t ∈ [0 , H ] k ¯ θ k − ¯ θ k − τ ( α ) k ≤ BHατ ( α ) k ¯ θ k k + 12 BHατ ( α ) ≤ k ¯ θ k k + 2 . (15) k ¯ θ k − ¯ θ k − τ ( α ) k ≤ B H α τ ( α ) k ¯ θ k k + 288 B H α τ ( α ) ≤ k ¯ θ k k + 8 . (16)Finally, we present an upper bound for the bias caused by Markovian noise. Lemma 5.

Let all the conditions in Lemma 3 hold. Then we have − N X i =1 H − X t =0 E hD ¯ θ k − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) Ei ≤ N Hα E h k ¯ θ k − θ ∗ k i + 36 N Hα (1 + k θ ∗ k ) + 12(19 L + 6 B ) N BH ατ ( α ) E h k ¯ θ k − θ ∗ k i + 12(19 L + 6 B ) N BH (1 + k θ ∗ k ) ατ ( α ) . (17)Our ﬁrst main result in this paper is presented in the following Theorem, where we derive the rate ofAlgorithm 1 under constant step sizes. 5 heorem 1. Let { ¯ θ k } and { θ k,ti } , for all k ≥ and t ∈ [1 , H ] , be generated by Algorithm 1. In addition,let the step size α satisfy (10) . Then we have E h k ¯ θ k +1 − θ ∗ k i ≤ (cid:18) − HµαN (cid:19) k +1 − τ ( α ) E h k ¯ θ τ ( α ) − θ ∗ k i + 8 N C (cid:0) B H + 9 + 3(19 L + 6 B ) BH (cid:1) ( k θ ∗ k + 1) µ log(1 /α ) α. (18) Remark 1.

Under constant step sizes, the rate in (18) shows that the mean square errors generated byAlgorithm 1 decays to a ball surrounding the origin exponentially. As α decays to zero, this error also goesto zero. Second, our rate is only diﬀerent from the one using i.i.d data by a log factor, for example in [27].This reﬂects the impact of Markovian randomness through the mixing time τ ( α ) . Third, our upper boundscales linearly on the number of local steps H .Proof. By (12) we consider k ¯ θ k +1 − θ ∗ k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ¯ θ k − θ ∗ − αN N X i =1 H − X t =0 F i ( θ k,ti ; X k + ti ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = k ¯ θ k − θ ∗ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) αN N X i =1 H − X t =0 F i ( θ k,ti ; X k + ti ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − αN * ¯ θ k − θ ∗ , N X i =1 H − X t =0 F i ( θ k,ti ; X k + ti ) + = k ¯ θ k − θ ∗ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) αN N X i =1 H − X t =0 F i ( θ k,ti ; X k + ti ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − HαN D ¯ θ k − θ ∗ , F (¯ θ k ) E − αN * ¯ θ k − θ ∗ , N X i =1 H − X t =0 F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) + . (19)First, using (9) and (13) we consider the second term on the right-hand side of (19) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) αN N X i =1 H − X t =0 F i ( θ k,ti ; X k + ti ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ α N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 H − X t =0 B ( k θ k,ti k + 1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ α N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 H − X t =0 B ( k ¯ θ k k + 1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ B H α ( k ¯ θ k k + 1) ≤ B H α k ¯ θ k − θ ∗ k + 8 B H ( k θ ∗ k + 1) α . (20)Second, by Assumption 3 we have − HαN D ¯ θ k − θ ∗ , F (¯ θ k ) E ≤ − HµαN k ¯ θ k − θ ∗ k . (21)Thus, taking the expectation on both sides of (19) and using (17), (20), and (21) yields E h k ¯ θ k +1 − θ ∗ k i ≤ (cid:18) − HµαN (cid:19) E h k ¯ θ k − θ ∗ k i + 8 B H α E h k ¯ θ k − θ ∗ k i + 8 B H ( k θ ∗ + 1 k ) α + 72 Hα E h k ¯ θ k − θ ∗ k i + 72 Hα (1 + k θ ∗ k ) + 24(19 L + 6 B ) BH α τ ( α ) E h k ¯ θ k − θ ∗ k i + 24(19 L + 6 B ) BH (1 + k θ ∗ k ) α τ ( α ) ≤ (cid:18) − HµαN (cid:19) E h k ¯ θ k − θ ∗ k i + 8 H (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) τ ( α ) α E h k ¯ θ k − θ ∗ k i + 8 H (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) ( k θ ∗ k + 1) τ ( α ) α . α satisﬁes (10) and by Lemma 1 τ ( α ) ≤ C log(1 /α ) . Then the preceding relation yields (18),i.e., for all k ≥ τ ( α ) E h k ¯ θ k +1 − θ ∗ k i ≤ (cid:18) − HµαN (cid:19) E h k ¯ θ k − θ ∗ k i + 8 H (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) ( k θ ∗ k + 1) τ ( α ) α ≤ (cid:18) − HµαN (cid:19) k +1 − τ ( α ) E h k ¯ θ τ ( α ) − θ ∗ k i + 8 H (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) ( k θ ∗ k + 1) τ ( α ) α k X t = τ ( α ) (cid:18) − HµαN (cid:19) k − t ≤ (cid:18) − HµαN (cid:19) k +1 − τ ( α ) E h k ¯ θ τ ( α ) − θ ∗ k i + 8 N C (cid:0) B H + 9 + 3(19 L + 6 B ) BH (cid:1) ( k θ ∗ k + 1) µ C log(1 /α ) α. In this section, we derive the ﬁnite-time bound of Algorithm 1 under time-varying step sizes α k . We consider α k being nonnegative, decreasing, and lim k →∞ α k = 0 . Thus, by Lemma 1 we have lim k →∞ τ i ( α k ) α k = 0 for all i ∈ [ N ] . This implies that there exists a postive integer K ∗ such that for all k ≥ K ∗ k X t = k − τ ( α k ) α t ≤ α k − τ ( α k ) τ ( α k ) ≤ min (cid:26) log(2)2 BH , µ N (19 B H + 9 + 57 LBH ) , α (cid:27) , (22)where recall that τ ( α k ) = max i τ i ( α k ) . For convenience, we denote by α k ; τ ( α k ) = k X t = k − τ ( α k ) α t . (23)Under α k , we have from (3) for all k ≥ θ k,Hi = θ k, i − H − X t =0 α k + t F i ( θ k,ti ; X k + ti ) = ¯ θ k − H − X t =0 α k + t F i ( θ k,ti ; X k + ti ) , (24)which implies ¯ θ k +1 = ¯ θ k − N N X i =1 H − X t =0 α k + t F i ( θ k,ti ; X k + ti ) . (25)Similar to the case of constant step sizes, we ﬁrst consider the following three technical lemmas thatare useful for our main result presented in Theorem 2 below. For an ease of exposition, their proofs arepresented in the Appendix. The ﬁrst lemma is to upper bound the norm of θ i by the norm of ¯ θ . Lemma 6.

Let { ¯ θ k } and { θ k,ti } , for all k ≥ and t ∈ [1 , H ] , be generated by Algorithm 1. In addition,let the step size α satisfy (22) . Then the following relations hold for all k ≥ and t ∈ [0 , H ] k θ k,ti k ≤ k ¯ θ k k + 2 BHα k ≤ k ¯ θ k k + 1 . (26) k θ k,ti − ¯ θ k k ≤ BHα k k ¯ θ k k + 2 BHα k . (27)7ur next lemma is to provide an upper bound for the quantity k ¯ θ k − ¯ θ k − τ ( α ) k . Lemma 7.

Let all the conditions in Lemma 6 hold. Then the following relations hold ∀ k ≥ and t ∈ [0 , H ] k ¯ θ k − ¯ θ k − τ ( α ) k ≤ BHα k ; τ ( α k ) k ¯ θ k k + 12 BHα k ; τ ( α k ) ≤ k ¯ θ k k + 2 . (28) k ¯ θ k − ¯ θ k − τ ( α ) k ≤ B H α k ; τ ( α k ) k ¯ θ k k + 288 B H α k ; τ ( α k ) ≤ k ¯ θ k k + 8 . (29)Finally, we present an upper bound for the bias caused by Markovian noise. Lemma 8.

Let all the conditions in Lemma 6 hold. Then we have − N X i =1 H − X t =0 α k + t E hD ¯ θ k − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) Ei ≤ N Hα k E h k ¯ θ k − θ ∗ k i + 36 N Hα k (1 + k θ ∗ k ) + 12(19 L + 6 B ) N BH α k α k ; τ ( α k ) E h k ¯ θ k − θ ∗ k i + 12(19 L + 6 B ) N BH (1 + k θ ∗ k ) α k α k ; τ ( α k ) . (30)The second main result in this paper is presented in the following theorem, where we study the rate ofAlgorithm 1 under time-varying step sizes. Theorem 2.

Let { ¯ θ k } and { θ k,ti } , for all k ≥ and t ∈ [1 , H ] , be generated by Algorithm 1. In addition,let the step size α k = α/ ( k + 1) satisfy (22) where α = 2 N/ ( Hµ ) . Then we have for all k ≥ K ∗ E h k ¯ θ k +1 − θ ∗ k i ≤ ( K ∗ ) E h k ¯ θ K ∗ − θ ∗ k i ( k + 1) + 16 α CH (cid:0) B H + 9 + 3(19 L + 6 B ) BH (cid:1) ( k θ ∗ k + 1) log (cid:16) k +1 α (cid:17) k + 1 · (31) Remark 2.

Here, we have the same observation as the one in Theorem 1, except now the rate is sublineardue to time-varying step sizes. However, the mean square errors decay to zero instead of to a neighborhoodof the origin.Proof.

By (12) we consider k ¯ θ k +1 − θ ∗ k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ¯ θ k − θ ∗ − N N X i =1 H − X t =0 α k + t F i ( θ k,ti ; X k + ti ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = k ¯ θ k − θ ∗ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 H − X t =0 α k + t F i ( θ k,ti ; X k + ti ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − N * ¯ θ k − θ ∗ , N X i =1 H − X t =0 α k + t F i ( θ k,ti ; X k + ti ) + = k ¯ θ k − θ ∗ k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 H − X t =0 α k + t F i ( θ k,ti ; X k + ti ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − HN * ¯ θ k − θ ∗ , F (¯ θ k ) H − X t =0 α k + t + − N * ¯ θ k − θ ∗ , N X i =1 H − X t =0 α k + t (cid:16) F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) (cid:17)+ . (32)First, using (9) and (26) we consider the second term on the right-hand side of (32) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 H − X t =0 α k + t F i ( θ k,ti ; X k + ti ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ α k N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 H − X t =0 B ( k θ k,ti k + 1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ α k N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 H − X t =0 B ( k ¯ θ k k + 1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ B H α k ( k ¯ θ k k + 1) ≤ B H α k k ¯ θ k − θ ∗ k + 8 B H ( k θ ∗ + 1 k ) α k . (33)8econd, by Assumption 3 we have − HN * ¯ θ k − θ ∗ , F (¯ θ k ) H − X t =0 α k + t + ≤ − Hµ P H − t =0 α k + t N k ¯ θ k − θ ∗ k ≤ − HµN α k k ¯ θ k − θ ∗ k . (34)Thus, taking the expectation on both sides of (32) and using (30), (33), and (34) yields E h k ¯ θ k +1 − θ ∗ k i ≤ (cid:18) − Hµα k N (cid:19) E h k ¯ θ k − θ ∗ k i + 8 B H α k E h k ¯ θ k − θ ∗ k i + 8 B H ( k θ ∗ + 1 k ) α k + 72 Hα k E h k ¯ θ k − θ ∗ k i + 72 Hα k (1 + k θ ∗ k ) + 24(19 L + 6 B ) BH α k α k ; τ ( α k ) E h k ¯ θ k − θ ∗ k i + 24(19 L + 6 B ) BH (1 + k θ ∗ k ) α k α k ; τ ( α k ) ≤ (cid:18) − Hµα k N (cid:19) E h k ¯ θ k − θ ∗ k i + 8 H (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) α k α k ; τ ( α k ) E h k ¯ θ k − θ ∗ k i + 8 H (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) ( k θ ∗ k + 1) α k α k ; τ ( α k ) . (35)Recall that α k = α/ ( k + 1) satisﬁes (22) where α = 2 N/ ( Hµ ) . Then we have ( k + 1) (cid:18) − Hµα k N (cid:19) = ( k + 1)( k − ≤ k . In addition, by Lemma 1 and (22) we have ( k + 1) α k ; τ ( α k ) ≤ ατ ( α k )( k + 1) k + 1 − τ ( α k ) ≤ ατ ( α k ) ≤ Cα log (cid:18) k + 1 α (cid:19) Thus, multiply both sides of (35) by ( k + 1) yields ( k + 1) E h k ¯ θ k +1 − θ ∗ k i ≤ ( k + 1) (cid:18) − Hµα k N (cid:19) E h k ¯ θ k − θ ∗ k i + 8 H (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) ( k θ ∗ k + 1) α k α k ; τ ( α k ) ( k + 1) ≤ k E h k ¯ θ k − θ ∗ k i + 8 αH (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) ( k θ ∗ k + 1) α k ; τ ( α k ) ( k + 1) ≤ k E h k ¯ θ K ∗ − θ ∗ k i + 16 α CH (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) ( k θ ∗ k + 1) log (cid:18) k + 1 α (cid:19) ≤ ( K ∗ ) E h k ¯ θ K ∗ − θ ∗ k i + 16 α CH (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) ( k θ ∗ k + 1) k X t = K ∗ log (cid:18) t + 1 α (cid:19) ≤ ( K ∗ ) E h k ¯ θ K ∗ − θ ∗ k i + 16 α CH (cid:16) B H + 9 + 3(19 L + 6 B ) BH (cid:17) ( k θ ∗ k + 1) k log (cid:18) k + 1 α (cid:19) , which dividing both sides by ( k + 1) yields (31). 9 Motivating applications

In this section, we consider three concrete applications in federated learning [2, 3] and multi-task rein-forcement learning [36], which can be formulated as problem (1). Thus, these problems can be solved byAlgorithm 1, and therefore, one can use our results to provide its theoretical guarantees.

In federated learning, multiple agents (clients or workers) collaboratively solve a machine learning problemunder the coordination of a centralized server [2, 3]. In stead of sharing the data to the server, the agentsrun local updates of their models (parameters) based on their data, whose results are aggregated at theserver with the goal to ﬁnd the global learning objective. Such an approach has gained much interestsrecently due to its eﬃciency in data processing, system privacy, and operating costs.A central problem in federated learning is distributed (or federated) optimization problems, where thegoal is to solve minimize θ ∈ R d G ( θ ) , N N X i =1 G i ( θ ) . (36)Here each G i ( θ ) = E X i ∼ π i [ G i ( θ, X i )] is the loss function and π i is the distribution of the data located atagent i . For i = j , π i and π j are very diﬀerent, which is referred to as data heterogeneity across the agents.The most popular method for solving (36) is the so-called local stochastic gradient descent (SGD) [21–27],which can be viewed as a variant of Algorithm 1. In particular, let F i ( θ ) = ∇ G i ( θ ) and F i ( θ, X i ) is itsstochastic (sub)gradient ∇ G i ( θ, X i ) . Then at each iteration k ≥ , each agent i initialize θ k, i = ¯ θ k andruns H steps of local SGD to update θ k,ti . These values are then aggregated by the server to update for ¯ θ k +1 , i.e., Agent i: θ k,t +1 i = θ k,ti − α k + t ∇ G i ( θ k,ti , X k + ti ) , θ k, i = ¯ θ k , t ∈ [0 , H − Server: ¯ θ k +1 = 1 N N X i =1 θ k,Hi . In federated optimization literature, it is often assumed that the sequence of samples { X ki } are sampled i.i.dfrom π i and the resulting stochastic gradients are unbiased, i.e., ∇ G i ( θ ) = E [ ∇ G i ( θ, X ki )] . In addition,the variance of these samples is assumed to be bounded. These assumptions are obviously the special caseof the ones considered in this paper. We consider a multi-task reinforcement learning problem over a network of N agents operating in N diﬀerent environments modeled by Markov random processes (MDPs). Here, each environment representsa task assigned to each agent. We assume that the agents can communicate directly with a centralizedcoordinator. This is a distributed multi-task reinforcement learning problem (MTRL) over multi-agentsystems, which can be mathematically characterized by using N diﬀerent MDPs as follows.Let M i = ( S i , A i , P i , R i , γ i ) be a -tuple representing the discounted reward MDP at agent i , where S i , A i , and P i are the set of states, action, and transition probability matrices, respectively. In addition, R i is the reward function and γ i ∈ (0 , is the discount factor. Note that the set of states and actionsat the agents can (partially) overlap with each other, and we denote them by S = ∪ i S i and A = ∪ i A i .We focus on randomized stationary policies (RSPs), where a policy π assigns to each s ∈ S a probabilitydistribution π ( ·| s ) over A . 10iven a policy π , let V πi be the value function associated with the i -th environment, V πi ( s i ) = E " ∞ X k =0 γ ki R i ( s ki , a ki ) | s i = s i , a ki ∼ π ( ·| s ki ) . Similarly, we denote by Q πi the Q -function in the i -th environment Q πi ( s i , a i ) = E " ∞ X k =0 γ ki R ( s ki , a ki ) | s i = s i , a i = a i . Let ρ i be an initial state distribution over S i and with some abuse of notation we denote the long-term reward associated with this distribution as V πi ( ρ i ) = E s i ∼ ρ i [ V πi ( s i )] . The goal of the agents is tocooperatively ﬁnd a policy π ∗ that maximizes the total accumulative discounted rewards max π V ( π ; ρ ) , N X i =1 V πi ( ρ i ) , ρ =  ρ ... ρ N  . (37)Treating each of the environments as independent RL problems would produce diﬀerent policies π ∗ i , eachmaximizing their respective V πi . The goal of MTRL is to to ﬁnd a single π ∗ that balances the performanceacross all environments. In the following, we present two fundamental problems in this area, which canbe formulated as problem (2). As a consequence, we can apply Algorithm 1 to solve these problems in adistributed manner. TD ( λ ) with linear function approximation One of the most fundamental problems in RL is the so-called policy evaluation problems, where the goalis to estimate the value function V π associated with a stationary policy π . This problem arises as asubproblem in RL policy search methods, including policy iteration and actor-critic methods. Our focushere is to study the multi-task variant of the policy evaluation problems, that is, we want to estimatethe sum of the value functions V πi of a stationary policy π . In addition, we study this problem when thenumber of state space S is very large, motivating us to use function approximation. We consider the linearfunction approximation ˜ V θi of V πi parameterized by a weight vector θ ∈ R L and given as ˜ V θi ( s ) = L X ℓ =1 θ ℓ φ i,ℓ ( s ) , ∀ s ∈ S , for a given set of L basis vectors φ i,ℓ : S → R , ℓ ∈ { , . . . , L } , where some examples of how to choosethese vectors can be found in [37]. Here we assume that φ i,ℓ ( s ) = 0 if s / ∈ S i , implying ˜ V θi ( s ) = 0 . Weare interested in the case L ≪ M = |S| . Our goal is to ﬁnd θ ∗ such that it provides a good approximationof the sum of the value functions at the agents, i.e., N X i =1 V πi ≈ N X i =1 Φ i θ ∗ , where Φ i ∈ R M × L is the feature matrix, whose i -th row is φ i ( s ) ∈ R L , the feature vector of the agent iφ i ( s ) = ( φ i, ( s ) , . . . , φ i,L ( s )) T ∈ R L . Distributed TD ( λ ) . For solving this problem, we consider a distributed variant of TD ( λ ) , originallystudied in [38] and analyzed explicitly in [15, 16, 39]. For simplicity, we consider the case λ = 0 , while11he case of λ ∈ [0 , can be done in a similar manner. In particular, each agent i maintains an estimate θ i of θ ∗ and the centralized coordinator maintains ¯ θ , the averages of these θ i . At each iteration k ≥ ,each agent i initialize θ k, i = ¯ θ k and runs H steps of local TD (0) to update θ k,ti . These values are thenaggregated by the server to update for ¯ θ k +1 , i.e., set θ k, i = ¯ θ k and for all t ∈ [0 , H − Agent i: θ k,t +1 i = θ k,ti + α k + t (cid:16) R k + ti + γφ i ( s k + t +1 i ) T θ k,ti − φ i ( s k + ti ) T θ k,ti (cid:17) φ i ( s k + ti ) T Server: ¯ θ k +1 = 1 N N X i =1 θ k,Hi , (38)where R ki = R i ( s ki , a ki ) and { s ki , s k +1 i , R ki } is the data tuple observed at time k at agent i .Let { X ki } = n ( s ki , s k +1 i , a ki ) o be a Markov chain. The update above can be viewed as a local stochasticapproximation for ﬁnding the root of some linear operator. Indeed, let A i ( X ki ) and b i ( X ki ) be deﬁned as A i ( X ki ) = φ ( s ki )( γφ ( s k +1 i ) − φ ( s ki ) T ,b i ( X ki ) = R ki φ ( s ki ) . Moreover, let π i be the stationary distribution of the underlying Markov chain { X ki } and A i = E π i h A i ( X ki ) i , b i = E π i h b i ( X ki ) i . Thus, in this case if we consider F i ( θ ki ; X ki ) = − A i ( X ki ) θ ki − b i ( X ki ) , then (38) is a variant of Algorithm 1 where each F i is linear in θ . In this case, the local TD (0) seeks toﬁnd θ ∗ satisfying N X i =1 A i θ ∗ + b i = 0 . To establish the convergence of (38) the following conditions are assumed in the literature [15, 16, 39].

Assumption 4.

The instantaneous rewards at the agents are uniformly bounded, i.e., there exist a constant R such that | R i ( s, s ′ ) | ≤ R , for all s, s ∈ S and i ∈ [ N ] . Assumption 5.

For each i ∈ [ N ] , the feature vectors { φ i,ℓ } , for all ℓ ∈ { , . . . , L } , are linearly independent,i.e., the matrix Φ i has full column rank. In addition, we assume that all feature vectors φ i ( s ) are uniformlybounded, i.e., k φ i ( s ) k ≤ . Assumption 6.

Each Markov chain { X ki } is irreducible and aperiodic. Under Assumption 4–6 one can verify that Assumptions 1–3 hold [39]. For example, under theseassumptions each A i is a negative deﬁnite matrix, i.e., x T A i x < for all x . In this section, we consider a distributed variant of the classic Q-learning method [40] for solving problem(37). Similar to the case of TD ( λ ) , we focus on the linear function approximation ˜ Q θi of Q i parameterizedby a weight vector θ ∈ R L and deﬁned as ˜ Q θi ( s, a ) = L X ℓ =1 θ ℓ φ i,ℓ ( s, a ) , ∀ ( s, a ) ∈ S × A L basis vectors φ i,ℓ : S × A → R , ℓ ∈ , . . . , L . We assume again that φ i,ℓ ( s, a ) = 0 iseither s / ∈ S i or a / ∈ A i , implying ˜ Q θi ( s, a ) = 0 . Let Φ i ∈ R |S||A|× L be the feature matrix, whose i -th rowis φ i ( s, a ) = ( φ i, ( s, a ) , . . . , φ i,L ( s, a ) T . The the goal of distributed Q-learning is to solve N X i =1 Φ i θ = N X i =1 Π T i [Φ i θ ] , where Π denotes the projection on the linear subspace of the feature vectors and T i is the Bellman operatorassociated with Q function at environment i ; see for example [20].For solving this problem, we consider a distributed variant of Q-learning [20, 41, 42]. In particular, eachagent i maintains an estimate θ i of θ ∗ and the centralized coordinator maintains ¯ θ , the averages of these θ i . At each iteration k ≥ , each agent i initialize θ k, i = ¯ θ k and runs H steps of local Q-learning toupdate θ k,ti . These values are then aggregated by the server to update for ¯ θ k +1 , i.e., set θ k, i = ¯ θ k and forall t ∈ [0 , H − Agent i: θ k,t +1 i = θ k,ti + α k + t (cid:18) R k + ti + γ max a ′ φ i ( s k + t +1 i , a ′ ) T θ k,ti − φ i ( s k + ti , a k + t ) T θ k,ti (cid:19) φ i ( s k + ti , a k + t ) T Server: ¯ θ k +1 = 1 N N X i =1 θ k,Hi , (39)where R ki = R i ( s ki , a ki ) and { s ki , s k +1 i , a ki } is the sample trajectory generated by some behavior policy σ i at agent i . Let X ki = { s ki , a ki , s k +1 i } be a Markov chain. We denote the nonlinear mapping F i as F i ( θ ; X k ) = φ i ( s k , a k ) h R ( s k , a k ) + γ max a φ i ( s k +1 , a ) T θ − φ i ( s k , a k ) T θ i , and F i ( θ ) = E π i h F i ( θ ; X k ) i , where π i is the stationary distribution of { X ki } . Then the goal of (39) is toﬁnd θ ∗ such that N X i =1 F i ( θ ∗ ) = 0 . Under proper assumptions, for example see [20, Theorem 1], one can verify that Assumptions 1–3 hold.Thus, we can apply our results in Theorems 1 and 2 to derive the ﬁnite-time performance bound fordistributed Q-learning in (39).

This paper studies local stochastic approximation over a network of agents, where the data at each agentare generated from a Markov process. Our main contribution is to provide a ﬁnite-time bound for theconvergence of mean square errors generated by the algorithm to zero. Our results generalized the existingliterature, where the data at the agents are i.i.d, and therefore, the current approach cannot be applied tosome algorithms in multi-task reinforcement learning over multi-agent systems.13 eferences [1] H. Robbins and S. Monro, “A stochastic approximation method,”

The Annals of Mathematical Statis-tics , vol. 22, no. 3, pp. 400–407, 1951.[2] P. Kairouz and etc., “Advances and open problems in federated learning,” available at:https://arxiv.org/abs/1912.04977, 2020.[3] T. Li, A. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and futuredirections,” available at: https://arxiv.org/abs/1908.07873, 2020.[4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu,“Asynchronous methods for deep reinforcement learning,” in

International conference on machinelearning , 2016, pp. 1928–1937.[5] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley,I. Dunning et al. , “Impala: Scalable distributed deep-rl with importance weighted actor-learner archi-tectures,” arXiv preprint arXiv:1802.01561 , 2018.[6] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt, “Multi-task deepreinforcement learning with popart,” in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,vol. 33, 2019, pp. 3796–3803.[7] L. Bottou, F. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,”

SIAMReview , vol. 60, no. 2, pp. 223–311, 2018.[8] R. S. Sutton and A. G. Barto,

Reinforcement Learning: An Introduction , 2nd ed. MIT Press, 2018.[9] D. Bertsekas and J. Tsitsiklis,

Neuro-Dynamic Programming , 2nd ed. Athena Scientiﬁc, Belmont,MA, 1999.[10] V. Borkar,

Stochastic Approximation: A Dynamical Systems Viewpoint . Cambridge University Press,2008.[11] H. Kushner and G. Yin,

Stochastic Approximation Algorithms and Applications . Springer, NY, 2003.[12] G. Lan,

First-order and Stochastic Optimization Methods for Machine Learning . Springer-Nature,2020.[13] T. Sun, Y. Sun, and W. Yin, “On markov chain gradient descent,” in

Proceedings of the 32ndInternational Conference on Neural Information Processing Systems , ser. NIPS’18, 2018, p. 9918–9927.[14] T. T. Doan, L. M. Nguyen, N. H. Pham, and J. Romberg, “Convergence rates of ac-celerated markov gradient descent with applications in reinforcement learning,” available at:https://arxiv.org/abs/2002.02873, 2020.[15] J. Bhandari, D. Russo, and R. Singal, “A ﬁnite time analysis of temporal diﬀerence learning with linearfunction approximation,” in

COLT , 2018.[16] R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation and TD learning,”in

COLT , 2019.[17] B. Hu and U. Syed, “Characterizing the exact behaviors of temporal diﬀerence learning algorithmsusing markov jump linear system theory,” in

Advances in Neural Information Processing Systems 32 ,2019. 1418] B. Karimi, B. Miasojedow, E. Moulines, and H. Wai, “Non-asymptotic analysis of biased stochasticapproximation scheme,” in

Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix,AZ, USA , 2019, pp. 1944–1974.[19] T. T. Doan, “Finite-time analysis and restarting scheme for linear two-time-scale stochastic approxi-mation,” available at: https://arxiv.org/abs/1912.10583, 2019.[20] C. Z. Chen, S. Zhang, T. T. Doan, S. T. Maguluri, and J.-P. Clarke, “Performance of Q-learning with Linear Function Approximation: Stability and Finite-Time Analysis,” available at:https://arxiv.org/abs/1905.11425, 2019.[21] T. T. Doan, J. Lubars, C. L. Beck, and R. Srikant, “Convergence rate of distributed random projec-tions,” in ,vol. 51, no. 23, 2018, pp. 373 – 378.[22] S. U. Stich, “Local SGD converges fast and communicates little,” in

International Conference onLearning Representations , 2019. [Online]. Available: https://openreview.net/forum?id=S1g2JnRcFX[23] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-eﬃcient learn-ing of deep networks from decentralized data,” in

Proceedings of the 20th International Conferenceon Artiﬁcial Intelligence and Statistics , vol. 54, 2017, pp. 1273–1282.[24] B. Woodworth, K. Patel, S. Stich, Z. Dai, B. Bullins, B. McMahan, O. Shamir, and N. Srebro, “Islocal SGD better than minibatch SGD?” available at: https://arxiv.org/abs/2002.07839, 2020.[25] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local SGD with periodic averaging:Tighter analysis and adaptive synchronization,” in

Advances in Neural Information Processing Systems32 , 2019, pp. 11 082–11 094.[26] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Scaﬀold:Stochastic controlled averaging for on-device federated learning,” 2019. [Online]. Available:https://arxiv.org/abs/1910.06378[27] A. Khaled, K. Mishchenko, and P. Richt´arik, “Tighter theory for local SGD on identical and hetero-geneous data,” in the 23rd International Conference on Artiﬁcial Intelligence and Statistics , 2020.[28] L. T. Liu, U. Dogan, and K. Hofmann, “Decoding multitask dqn in the world of minecraft,” in

The13th European Workshop on Reinforcement Learning (EWRL) 2016 , 2016.[29] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-tasklearning,” available at: https://arxiv.org/abs/2001.06782, 2020.[30] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. Maria, V. Panneershelvam, M. Suleyman,C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver, “Massively parallel methodsfor deep reinforcement learning,” 07 2015.[31] D. A. Levin, Y. Peres, and E. L. Wilmer,

Markov chains and mixing times . American MathematicalSociety, 2006.[32] S. Ram, A. Nedi´c, and V. V. Veeravalli, “Incremental stochastic subgradient algorithms for convexoptimization,”

SIAM Journal on Optimization , vol. 20, no. 2, pp. 691–717, 2009.1533] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,M. Lai, A. Bolton et al. , “Mastering the game of go without human knowledge,”

Nature , vol. 550,no. 7676, pp. 354–359, 2017.[34] Y. Wu, W. Zhang, P. Xu, and Q. Gu, “A ﬁnite time analysis of two time-scale actor critic methods,” arXiv preprint arXiv:2005.01350 , 2020.[35] T. T. Doan, S. T. Maguluri, and J. Romberg, “Finite-time performance of distributed temporal dif-ference learning with linear function approximation,” available at: https://arxiv.org/abs/1907.12530,2019.[36] S. Zeng, A. Anwar, T. Doan, J. Romberg, and A. Raychowdhury, “A decentralized policy gradientapproach to multi-task reinforcement learning,” available at: https://arxiv.org/abs/2006.04338, 2020.[37] G. Konidaris, S. Osentoski, and P. Thomas, “Value function approximation in reinforcement learningusing the fourier basis,” in

Proceedings of the Twenty-Fifth AAAI Conference on Artiﬁcial Intelligence ,2011, p. 380–385.[38] R. S. Sutton, “Learning to predict by the methods of temporal diﬀerences,”

Machine Learning , vol. 3,no. 1, pp. 9–44, Aug 1988.[39] J. N. Tsitsiklis and B. V. Roy, “An analysis of temporal-diﬀerence learning with function approxima-tion,”

IEEE Transactions on Automatic Control , vol. 42, no. 5, pp. 674–690, 1997.[40] C. Watkins and P. Dayan, “Q-learning,”

Machine Learning , vol. 8, pp. 279–292, 05 1992.[41] F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforcement learning with functionapproximation,” in

Proceedings of the 25th international conference on Machine learning , 2008, pp.664–671.[42] D. Lee and N. He, “A uniﬁed switching system perspective and o.d.e. analysis of q-learning algorithms,”available at: https://arxiv.org/abs/1912.02270, 2019.

A Proofs of Lemmas 3–8

A.1 Proof of Lemma 3

We ﬁrst show (13). Indeed, by (3) and (9) we have for any t ∈ [0 , H − k θ k,t +1 i − θ k,ti k = α k F i ( θ k,ti ; X k + ti ) k ≤ Bα ( k θ k,ti k + 1) , (40)which gives k θ k,t +1 i k ≤ (1 + Bα ) k θ k,ti k + Bα ≤ (1 + Bα ) t +1 k θ k, i k + Bα t X u =0 (1 + Bα ) t − u . Using the relation x ≤ e x for all x ≥ into the preceding equation gives (13), i.e., k θ k,t +1 i k ≤ e Bα ( t +1) k θ k, i k + Bαte

Bαt ≤ e BαH k θ k, i k + BHαe

BαH ≤ k ¯ θ k k + 2 BHα ≤ k ¯ θ k k + 1 , where the second inequality is due to (10), i.e., HBα ≤ log(2) / τ ( α ) ≤ log(2) , and recall that θ k, i = ¯ θ k .Next, using (40) and (13) we obtain for all t ∈ [0 , H − k θ k,t +1 i − θ k,ti k ≤ Bα k ¯ θ k k + 2 B Hα + Bα, (41)16hich implies (14), i.e., for all t ∈ [1 , H ] we have k θ k,ti − ¯ θ k k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X u =0 θ k,u +1 i − θ k,ui (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ t − X u =0 k θ k,u +1 i − θ k,ui k≤ Bαt k ¯ θ k k + 2 B Hα t + Bαt ≤ BHα k ¯ θ k k + 2 B H α + BHα ≤ BHα k ¯ θ k k + 2 BHα, where the last inequality is due to (10).

A.2 Proof of Lemma 4

We ﬁrst show (15). Using (12) and (9) we have k ¯ θ k +1 k − k ¯ θ k k ≤ k ¯ θ k +1 − ¯ θ k k ≤ αN N X i =1 H − X t =0 k F i ( θ k,ti ; X k + ti ) k≤ αN N X i =1 H − X t =0 B (cid:16) k θ k,ti k + 1 (cid:17) ≤ BHα (cid:16) k ¯ θ k k + 1 (cid:17) + αN N X i =1 H − X t =0 B k θ k,ti − ¯ θ k k , which when using (14) and BHα ≤ log(2) / (from (10)) gives k ¯ θ k +1 k − k ¯ θ k k ≤ k ¯ θ k +1 − ¯ θ k k ≤ BHα (cid:16) k ¯ θ k k + 1 (cid:17) + BHα (cid:16) BHα k ¯ θ k k + 2 BHα (cid:17) ≤ BHα k ¯ θ k k + 2 BHα, (42)The preceding relation yields k ¯ θ k +1 k ≤ (1 + 2 BHα ) k ¯ θ k k + 2 BHα.

Using the relation x ≤ e x for all x ≥ , the equation above gives for all t ∈ [ k − τ ( α ) , k − k ¯ θ t k ≤ (1 + 2 BHα ) t − k + τ ( α ) k ¯ θ k − τ ( α ) k + 2 BH t − X u = k − τ ( α ) α (1 + 2 BHα ) t − u − ≤ (1 + 2 BHα ) τ ( α ) k ¯ θ k − τ ( α ) k + 2 BHατ ( α )(1 + 2 BHα ) t − − k + τ ( α ) ≤ e BHατ ( α ) k ¯ θ k − τ ( α ) k + 2 BHατ ( α )(1 + 2 BHα ) τ ( α ) ≤ e BHατ ( α ) k ¯ θ k − τ ( α ) k + 2 BHατ ( α ) e BHατ ( α ) ≤ k ¯ θ k − τ ( α ) k + 4 BHατ ( α ) , where the last inequality is due to (10), i.e., HBτ ( α ) α ≤ log(2) . Using the preceding relation we havefrom (42) k ¯ θ k − ¯ θ k − τ ( α ) k ≤ k − X t = k − τ ( α ) k ¯ θ t +1 − ¯ θ t k ≤ k − X t = k − τ ( α ) BHα ( k ¯ θ t k + 1) ≤ BHα k − X t = k − τ ( α ) (cid:16) k ¯ θ k − τ ( α ) k + 4 BHατ ( α ) (cid:17) + 2 BHατ ( α ) ≤ BHατ ( α ) k ¯ θ k − τ ( α ) k + 4 BHατ ( α ) , HBτ ( α ) α ≤ log(2) ≤ / . Using the preceding inequalityand the triangle inequality yields k ¯ θ k − ¯ θ k − τ ( α k ) k ≤ BHατ ( α ) k ¯ θ k − ¯ θ k − τ ( α ) k + 4 BHατ ( α ) k ¯ θ k k + 4 BHατ ( α ) ≤ k ¯ θ k − ¯ θ k − τ ( α ) k + 4 BHατ ( α ) k ¯ θ k k + 4 BHατ ( α ) , where the last inequality we use (10) to have BHτ ( α k ) α ≤ log(2) ≤ / . Rearranging the equationabove yields (15) k ¯ θ k − ¯ θ k − τ ( α ) k ≤ BHατ ( α ) k ¯ θ k k + 12 BHατ ( α ) ≤ k ¯ θ k k + 2 . Taking square on both sides of the preceding relation and using the Cauchy-Schwarz inequality yield (16) k ¯ θ k − ¯ θ k − τ ( α ) k ≤ B H α τ ( α ) k ¯ θ k k + 288 B H α τ ( α ) ≤ k ¯ θ k k + 8 . A.3 Proof of Lemma 5

Consider − N X i =1 H − X t =0 D ¯ θ k − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) E = − N X i =1 H − X t =0 D ¯ θ k − ¯ θ k − τ ( α ) , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) E − N X i =1 H − X t =0 D ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) E = − N X i =1 H − X t =0 D ¯ θ k − ¯ θ k − τ ( α ) , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) E − N X i =1 H − X t =0 D ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k − τ ( α ) ,ti ; X k + ti ) − F i ( θ k − τ ( α ) ,ti ) E − N X i =1 H − X t =0 D ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i ( θ k − τ ( α ) ,ti ; X k + ti ) E − N X i =1 H − X t =0 D ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k − τ ( α ) ,ti ) − F i (¯ θ k − τ ( α ) ) E − N X i =1 H − X t =0 D ¯ θ k − τ ( α ) − θ ∗ , F i (¯ θ k − τ ( α ) ) − F i (¯ θ k ) E . (43)We ﬁrst consider the second term on the right-hand side of (43). Let F k be the set containing all theinformation generated by Algorithm 1 up to time k . Then, using (8) we have − N X i =1 H − X t =0 E hD ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k − τ ( α ) ,ti ; X k + ti ) − F i ( θ k − τ ( α ) ,ti ) E | F k + t − τ ( α ) i = − N X i =1 H − X t =0 D ¯ θ k − τ ( α ) − θ ∗ , E h F i ( θ k − τ ( α ) ,ti ; X k + ti ) − F i ( θ k − τ ( α ) ,ti ) | F k + t − τ ( α ) iE ≤ N X i =1 H − X t =0 (cid:13)(cid:13)(cid:13) ¯ θ k − τ ( α ) − θ ∗ (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) E h F i ( θ k − τ ( α ) ,ti ; X k + ti ) − F i ( θ k − τ ( α ) ,ti ) | F k + t − τ ( α ) i(cid:12)(cid:12)(cid:12) ≤ N X i =1 H − X t =0 α (cid:13)(cid:13)(cid:13) ¯ θ k − τ ( α ) − θ ∗ (cid:13)(cid:13)(cid:13) (cid:16)(cid:13)(cid:13)(cid:13) θ k − τ ( α ) ,ti (cid:13)(cid:13)(cid:13) + 1 (cid:17) = N Hα (cid:13)(cid:13)(cid:13) ¯ θ k − τ ( α ) − θ ∗ (cid:13)(cid:13)(cid:13) (cid:16)(cid:13)(cid:13)(cid:13) θ k − τ ( α ) ,ti (cid:13)(cid:13)(cid:13) + 1 (cid:17) , − N X i =1 H − X t =0 E hD ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k − τ ( α ) ,ti ; X k + ti ) − F i ( θ k − τ ( α ) ,ti ) E | F k + t − τ ( α ) i ≤ N Hα (cid:16) k ¯ θ k − ¯ θ k − τ ( α ) k + k ¯ θ k − θ ∗ k (cid:17) (cid:16) k ¯ θ k − τ ( α ) k + 2 BHα + 1 (cid:17) ≤ N Hα (cid:16) k ¯ θ k − ¯ θ k − τ ( α ) k + k ¯ θ k − θ ∗ k (cid:17) (cid:16) k ¯ θ k − ¯ θ k − τ ( α ) k + 2 k ¯ θ k k + 2 (cid:17) ≤ N Hα (cid:16) k ¯ θk k + 2 + k ¯ θ k − θ ∗ k (cid:17) (cid:16) k ¯ θ k k + 3 (cid:17) ≤ N Hα (cid:16) k ¯ θ k − θ ∗ k + 2 + 2 k θ ∗ k (cid:17) (cid:16) k ¯ θ k − θ ∗ k + 1 + k θ ∗ k (cid:17) ≤ N Hα (cid:16) k ¯ θ k − θ ∗ k + 1 + k θ ∗ k (cid:17) ≤ N Hα (cid:16) k ¯ θ k − θ ∗ k (cid:17) + 36 N Hα (1 + k θ ∗ k ) , (44)where the third inequality is due to (15) and the last inequality is due to the Cauchy-Schwarz inequality.Next, we consider the third term on the right-hand side of (43). Indeed, using (6) we have − N X i =1 H − X t =0 D ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i ( θ k − τ ( α ) ,ti ; X k + ti ) E ≤ L N X i =1 H − X t =0 k ¯ θ k − τ ( α ) − θ ∗ kk θ k,ti − θ k − τ ( α ) ,ti k≤ L N X i =1 H − X t =0 k ¯ θ k − τ ( α ) − θ ∗ k (cid:16) k θ k,ti − ¯ θ k k + k ¯ θ k − ¯ θ k − τ ( α ) k + k ¯ θ k − τ ( α ) − θ k − τ ( α ) ,ti k (cid:17) . (45)Similarly, using (6) we consider the last two terms on the right-hand sides of (43) − N X i =1 H − X t =0 (cid:16)D ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k − τ ( α ) ,ti ) − F i (¯ θ k − τ ( α ) ) E + D ¯ θ k − τ ( α ) − θ ∗ , F i (¯ θ k − τ ( α ) ) − F i (¯ θ k ) E(cid:17) ≤ L N X i =1 H − X t =0 k ¯ θ k − τ ( α ) − θ ∗ k (cid:16) k θ k − τ ( α ) ,ti − ¯ θ k − τ ( α ) k + k ¯ θ k − ¯ θ k − τ ( α ) k (cid:17) , which by adding to (45) yields − N X i =1 H − X t =0 (cid:16)D ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k − τ ( α ) ,ti ) − F i (¯ θ k − τ ( α ) ) E + D ¯ θ k − τ ( α ) − θ ∗ , F i (¯ θ k − τ ( α ) ) − F i (¯ θ k ) E(cid:17) − N X i =1 H − X t =0 D ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i ( θ k − τ ( α ) ,ti ; X k + ti ) E ≤ L N X i =1 H − X t =0 k ¯ θ k − τ ( α ) − θ ∗ k (cid:16) k θ k,ti − ¯ θ k k + 2 k θ k − τ ( α ) ,ti − ¯ θ k − τ ( α ) k + 2 k ¯ θ k − ¯ θ k − τ ( α ) k (cid:17) . − N X i =1 H − X t =0 (cid:16)D ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k − τ ( α ) ,ti ) − F i (¯ θ k − τ ( α ) ) E + D ¯ θ k − τ ( α ) − θ ∗ , F i (¯ θ k − τ ( α ) ) − F i (¯ θ k ) E(cid:17) − N X i =1 H − X t =0 D ¯ θ k − τ ( α ) − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i ( θ k − τ ( α ) ,ti ; X k + ti ) E ≤ L N X i =1 H − X t =0 (cid:16) k ¯ θ k − τ ( α ) − ¯ θ k k + k ¯ θ k − θ ∗ k (cid:17) × (cid:16) BHα k ¯ θ k k + 4 BHα k ¯ θ k − τ ( α ) k + 6 BHα + 24

BHατ ( α ) k ¯ θ k k + 18 BHατ ( α ) (cid:17) ≤ LN H (cid:16) k ¯ θ k k + 2 + k ¯ θ k − θ ∗ k (cid:17) (cid:16) BHατ ( α ) k ¯ θ k k + 24 BHατ ( α ) + 4 BHα k ¯ θ k − τ ( α ) − ¯ θ k k + 4 BHα k ¯ θ k k (cid:17) ≤ LN H (cid:16) k ¯ θ k − θ ∗ k + 2 + 2 k θ ∗ k (cid:17) (cid:16) BHατ ( α ) k ¯ θ k k + 32 BHατ ( α ) (cid:17) ≤ LN H (cid:16) k ¯ θ k − θ ∗ k + 2 + 2 k θ ∗ k (cid:17) (cid:16) BHατ ( α ) k ¯ θ k − θ ∗ k + 38 BH ( k θ ∗ k + 1) ατ ( α ) (cid:17) ≤ LBN H ατ ( α ) (cid:16) k ¯ θ k − θ ∗ k + k θ ∗ k + 1 (cid:17) ≤ LBN H ατ ( α ) (cid:16) k ¯ θ k − θ ∗ k (cid:17) + 228 LBN H ατ ( α ) ( k θ ∗ k + 1) . (46)Finally, we consider the ﬁrst term on the right-hand side of (43). Using (13), (15), and (9) we consider − N X i =1 H − X t =0 D ¯ θ k − ¯ θ k − τ ( α ) , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) E ≤ B N X i =1 H − X t =0 (cid:13)(cid:13)(cid:13) ¯ θ k − ¯ θ k − τ ( α ) (cid:13)(cid:13)(cid:13) (cid:16) k θ k,ti k + k ¯ θ k k + 2 (cid:17) ≤ B N X i =1 H − X t =0 (cid:16) BHατ ( α ) k ¯ θ k k + 12 BHατ ( α ) (cid:17) (cid:16) k ¯ θ k k + 2 + 2 BHα (cid:17) ≤ N B H ατ ( α ) (cid:16) k ¯ θ k k + 1 (cid:17) (3 k ¯ θ k k + 3) ≤ N B H ατ ( α )( k ¯ θ k k + 1) ≤ N B H ατ ( α )( k ¯ θ k − θ ∗ k + k θ ∗ k + 1) ≤ N B H ατ ( α ) k ¯ θ k − θ ∗ k + 72 N B H (1 + k θ ∗ k ) ατ ( α ) , (47)where in the third inequality we use (10) to have BHα ≤ . Finally, taking the expectation on both sidesof (43) and using (44), (46), and (47) yields (17) − N X i =1 H − X t =0 E hD ¯ θ k − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) Ei ≤ N Hα E h k ¯ θ k − θ ∗ k i + 36 N Hα (1 + k θ ∗ k ) + 228 LBN H ατ ( α ) E h k ¯ θ k − θ ∗ k i + 228 LBN H ατ ( α ) ( k θ ∗ k + 1) + 72 N B H ατ ( α ) E h k ¯ θ k − θ ∗ k i + 72 N B H (1 + k θ ∗ k ) ατ ( α ) ≤ N Hα E h k ¯ θ k − θ ∗ k i + 36 N Hα (1 + k θ ∗ k ) + 12(19 L + 6 B ) N BH ατ ( α ) E h k ¯ θ k − θ ∗ k i + 12(19 L + 6 B ) N BH (1 + k θ ∗ k ) ατ ( α ) . A.4 Proof of Lemma 6

We ﬁrst show (26). Indeed, by (3) and (9) we have for any t ∈ [0 , H − k θ k,t +1 i − θ k,ti k = α k + t k F i ( θ k,ti ; X k + ti ) k ≤ Bα k + t ( k θ k,ti k + 1) , (48)20hich gives k θ k,t +1 i k ≤ (1 + Bα k + t ) k θ k,ti k + Bα k + t ≤ t Y u =0 (1 + Bα k + u ) k θ k, i k + B t X u =0 α k + u t Y ℓ = u +1 (1 + Bα k + ℓ ) . Using the relation x ≤ e x for all x ≥ into the preceding equation and since α k is decreasing we obtain(26), i.e., for t ∈ [0 , H − k θ k,t +1 i k ≤ exp ( B t X u =0 α k + u ) k θ k, i k + B t X u =0 α k + u exp  B t X ℓ = u +1 α k + ℓ  ≤ exp { BHα k }k θ k, i k + B t X u =0 α k + u exp { BHα k }≤ k ¯ θ k k + 2 BHα k , where the second inequality is (22), i.e., HBα k ≤ log(2) , and recall that θ k, i = ¯ θ k . Next, using (48) and(26) and since α k is decreasing we obtain for all t ∈ [0 , H − k θ k,t +1 i − θ k,ti k ≤ Bα k (cid:16) k θ k,ti k + 1 (cid:17) ≤ Bα k k ¯ θ k k + 2 B Hα k + Bα k , (49)which implies (14), i.e., for all t ∈ [1 , H ] we have k θ k,ti − ¯ θ k k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X u =0 θ k,u +1 i − θ k,ui (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ t − X u =0 k θ k,u +1 i − θ k,ui k≤ Bα k t k ¯ θ k k + 2 B Hα k t + Bα k t ≤ BHα k k ¯ θ k k + 2 B H α k + BHα k ≤ BHα k k ¯ θ k k + 2 BHα k , where the last inequality is due to (22). A.5 Proof of Lemma 7

We ﬁrst show (28). Using (25) and (9) we have k ¯ θ k +1 k − k ¯ θ k k ≤ k ¯ θ k +1 − ¯ θ k k ≤ N N X i =1 H − X t =0 α k + t k F i ( θ k,ti ; X k + ti ) k≤ N N X i =1 H − X t =0 Bα k + t (cid:16) k θ k,ti k + 1 (cid:17) ≤ BHα k (cid:16) k ¯ θ k k + 1 (cid:17) + 1 N N X i =1 H − X t =0 Bα k + t k θ k,ti − ¯ θ k k , which when using (27) and BHα k ≤ log(2) / (from (22)) gives k ¯ θ k +1 k − k ¯ θ k k ≤ k ¯ θ k +1 − ¯ θ k k ≤ BHα k (cid:16) k ¯ θ k k + 1 (cid:17) + BHα k (cid:16) BHα k k ¯ θ k k + 2 BHα k (cid:17) ≤ BHα k k ¯ θ k k + 2 BHα k , (50)The preceding relation yields k ¯ θ k +1 k ≤ (1 + 2 BHα k ) k ¯ θ k k + 2 BHα k . x ≤ e x for all x ≥ , the equation above gives for all t ∈ [ k − τ ( α k ) , k − k ¯ θ t +1 k ≤ t Y u = k − τ ( α k ) (1 + 2 BHα u ) k ¯ θ k − τ ( α k ) k + 2 BH t − X u = k − τ ( α k ) α u t Y ℓ = u +1 (1 + 2 BHα ℓ ) ≤ exp  t X u = k − τ ( α k ) BHα u  k ¯ θ k − τ ( α k ) k + 2 BH t − X u = k − τ ( α k ) α u exp  t X ℓ = u +1 BHα ℓ  ≤ k ¯ θ k − τ ( α k ) k + 4 BH t − X u = k − τ ( α k ) α u , where the last inequality is due to (22), i.e., HBα k ; τ ( α k ) ≤ log(2) . Using the preceding relation we havefrom (50) k ¯ θ k − ¯ θ k − τ ( α k ) k ≤ k − X t = k − τ ( α k ) k ¯ θ t +1 − ¯ θ t k ≤ k − X t = k − τ ( α k ) BHα t ( k ¯ θ t k + 1) ≤ BH k − X t = k − τ ( α ) α t  k ¯ θ k − τ ( α ) k + 4 BH t − X u = k − τ ( α k ) α u  + 2 BHα k ; τ ( α k ) ≤ BHα k ; τ ( α k ) k ¯ θ k − τ ( α ) k + 4 BHα k ; τ ( α k ) , where the last inequality is due to (22), i.e., HBα k ; τ ( α k ) ≤ log(2) ≤ / . Using the preceding inequalityand the triangle inequality yields k ¯ θ k − ¯ θ k − τ ( α k ) k ≤ BHα k ; τ ( α k ) k ¯ θ k − ¯ θ k − τ ( α ) k + 4 BHα k ; τ ( α k ) k ¯ θ k k + 4 BHα k ; τ ( α k ) ≤ k ¯ θ k − ¯ θ k − τ ( α ) k + 4 BHα k ; τ ( α k ) k ¯ θ k k + 4 BHα k ; τ ( α k ) , where the last inequality we use (22) to have BHα k ; τ ( α k ) ≤ log(2) ≤ / . Rearranging the equationabove yields (15) k ¯ θ k − ¯ θ k − τ ( α ) k ≤ BHα k ; τ ( α k ) k ¯ θ k k + 12 BHα k ; τ ( α k ) ≤ k ¯ θ k k + 2 . Taking square on both sides of the preceding relation and using the Cauchy-Schwarz inequality yield (16) k ¯ θ k − ¯ θ k − τ ( α ) k ≤ B H α k ; τ ( α k ) k ¯ θ k k + 288 B H α k ; τ ( α k ) ≤ k ¯ θ k k + 8 . .6 Proof of Lemma 8 Consider − N X i =1 H − X t =0 α k + t D ¯ θ k − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) E = − N X i =1 H − X t =0 α k + t D ¯ θ k − ¯ θ k − τ ( α k ) , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) E − N X i =1 H − X t =0 α k + t D ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) E = − N X i =1 H − X t =0 α k + t D ¯ θ k − ¯ θ k − τ ( α k ) , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) E − N X i =1 H − X t =0 α k + t D ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k − τ ( α k ) ,ti ; X k + ti ) − F i ( θ k − τ ( α k ) ,ti ) E − N X i =1 H − X t =0 α k + t D ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i ( θ k − τ ( α k ) ,ti ; X k + ti ) E − N X i =1 H − X t =0 α k + t D ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k − τ ( α k ) ,ti ) − F i (¯ θ k − τ ( α k ) ) E − N X i =1 H − X t =0 α k + t D ¯ θ k − τ ( α k ) − θ ∗ , F i (¯ θ k − τ ( α k ) ) − F i (¯ θ k ) E . (51)We ﬁrst consider the second term on the right-hand side of (51). Let F k be the set containing all theinformation generated by Algorithm 1 up to time k . Then, using (8) we have − N X i =1 H − X t =0 α k + t E hD ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k − τ ( α k ) ,ti ; X k + ti ) − F i ( θ k − τ ( α k ) ,ti ) E | F k + t − τ ( α k ) i = − N X i =1 H − X t =0 α k + t D ¯ θ k − τ ( α k ) − θ ∗ , E h F i ( θ k − τ ( α k ) ,ti ; X k + ti ) − F i ( θ k − τ ( α k ) ,ti ) | F k + t − τ ( α k ) iE ≤ N X i =1 H − X t =0 α k + t (cid:13)(cid:13)(cid:13) ¯ θ k − τ ( α k ) − θ ∗ (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) E h F i ( θ k − τ ( α k ) ,ti ; X k + ti ) − F i ( θ k − τ ( α k ) ,ti ) | F k + t − τ ( α k ) i(cid:12)(cid:12)(cid:12) ≤ N X i =1 H − X t =0 α k + t α k (cid:13)(cid:13)(cid:13) ¯ θ k − τ ( α k ) − θ ∗ (cid:13)(cid:13)(cid:13) (cid:16)(cid:13)(cid:13)(cid:13) θ k − τ ( α k ) ,ti (cid:13)(cid:13)(cid:13) + 1 (cid:17) = N Hα k (cid:13)(cid:13)(cid:13) ¯ θ k − τ ( α k ) − θ ∗ (cid:13)(cid:13)(cid:13) (cid:16)(cid:13)(cid:13)(cid:13) θ k − τ ( α k ) ,ti (cid:13)(cid:13)(cid:13) + 1 (cid:17) , − N X i =1 H − X t =0 α k + t E hD ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k − τ ( α k ) ,ti ; X k + ti ) − F i ( θ k − τ ( α k ) ,ti ) E | F k + t − τ ( α k ) i ≤ N Hα k (cid:16) k ¯ θ k − ¯ θ k − τ ( α k ) k + k ¯ θ k − θ ∗ k (cid:17) (cid:16) k ¯ θ k − τ ( α k ) k + 2 BHα k + 1 (cid:17) ≤ N Hα k (cid:16) k ¯ θ k − ¯ θ k − τ ( α k ) k + k ¯ θ k − θ ∗ k (cid:17) (cid:16) k ¯ θ k − ¯ θ k − τ ( α k ) k + 2 k ¯ θ k k + 2 (cid:17) ≤ N Hα k (cid:16) k ¯ θk k + 2 + k ¯ θ k − θ ∗ k (cid:17) (cid:16) k ¯ θ k k + 3 (cid:17) ≤ N Hα k (cid:16) k ¯ θ k − θ ∗ k + 2 + 2 k θ ∗ k (cid:17) (cid:16) k ¯ θ k − θ ∗ k + 1 + k θ ∗ k (cid:17) ≤ N Hα k (cid:16) k ¯ θ k − θ ∗ k + 1 + k θ ∗ k (cid:17) ≤ N Hα k (cid:16) k ¯ θ k − θ ∗ k (cid:17) + 36 N Hα k (1 + k θ ∗ k ) , (52)where the third inequality is due to (28) and the last inequality is due to the Cauchy-Schwarz inequality.Next, we consider the third term on the right-hand side of (51). Indeed, using (6) we have − N X i =1 H − X t =0 α k + t D ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i ( θ k − τ ( α k ) ,ti ; X k + ti ) E ≤ L N X i =1 H − X t =0 α k + t k ¯ θ k − τ ( α k ) − θ ∗ kk θ k,ti − θ k − τ ( α k ) ,ti k≤ L N X i =1 H − X t =0 α k + t k ¯ θ k − τ ( α k ) − θ ∗ k (cid:16) k θ k,ti − ¯ θ k k + k ¯ θ k − ¯ θ k − τ ( α k ) k + k ¯ θ k − τ ( α k ) − θ k − τ ( α k ) ,ti k (cid:17) . (53)Similarly, using (6) we consider the last two terms on the right-hand sides of (51) − N X i =1 H − X t =0 α k + t (cid:16)D ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k − τ ( α k ) ,ti ) − F i (¯ θ k − τ ( α k ) ) E + D ¯ θ k − τ ( α k ) − θ ∗ , F i (¯ θ k − τ ( α k ) ) − F i (¯ θ k ) E(cid:17) ≤ L N X i =1 H − X t =0 α k + t k ¯ θ k − τ ( α k ) − θ ∗ k (cid:16) k θ k − τ ( α k ) ,ti − ¯ θ k − τ ( α k ) k + k ¯ θ k − ¯ θ k − τ ( α k ) k (cid:17) , which by adding to (53) yields − N X i =1 H − X t =0 α k + t (cid:16)D ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k − τ ( α k ) ,ti ) − F i (¯ θ k − τ ( α k ) ) E + D ¯ θ k − τ ( α k ) − θ ∗ , F i (¯ θ k − τ ( α k ) ) − F i (¯ θ k ) E(cid:17) − N X i =1 H − X t =0 α k + t D ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i ( θ k − τ ( α k ) ,ti ; X k + ti ) E ≤ L N X i =1 H − X t =0 α k + t k ¯ θ k − τ ( α k ) − θ ∗ k (cid:16) k θ k,ti − ¯ θ k k + 2 k θ k − τ ( α k ) ,ti − ¯ θ k − τ ( α k ) k + 2 k ¯ θ k − ¯ θ k − τ ( α k ) k (cid:17) . − N X i =1 H − X t =0 α k + t (cid:16)D ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k − τ ( α k ) ,ti ) − F i (¯ θ k − τ ( α k ) ) E + D ¯ θ k − τ ( α k ) − θ ∗ , F i (¯ θ k − τ ( α k ) ) − F i (¯ θ k ) E(cid:17) − N X i =1 H − X t =0 α k + t D ¯ θ k − τ ( α k ) − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i ( θ k − τ ( α k ) ,ti ; X k + ti ) E ≤ L N X i =1 H − X t =0 α k + t (cid:16) k ¯ θ k − τ ( α k ) − ¯ θ k k + k ¯ θ k − θ ∗ k (cid:17) (cid:16) BHα k k ¯ θ k k + 2 BHα k + 4 BHα k k ¯ θ k − τ ( α k ) k + 4 BHα k − τ ( α k ) (cid:17) ++ L N X i =1 H − X t =0 α k + t (cid:16) k ¯ θ k − τ ( α k ) − ¯ θ k k + k ¯ θ k − θ ∗ k (cid:17) (cid:16) BHα k ; τ ( α k ) k ¯ θ k k + 24 BHα k ; τ ( α k ) (cid:17) ≤ LN BH α k (cid:16) k ¯ θ k k + 2 + k ¯ θ k − θ ∗ k (cid:17) (cid:16) α k ; τ ( α k ) k ¯ θ k k + 30 α k ; τ ( α k ) + 4 α k k ¯ θ k − τ ( α k ) − ¯ θ k k + 4 α k k ¯ θ k k (cid:17) ≤ LN BH α k (cid:16) k ¯ θ k − θ ∗ k + 2 + 2 k θ ∗ k (cid:17) (cid:16) α k ; τ ( α k ) k ¯ θ k k + 30 α k ; τ ( α k ) + 8 α k k ¯ θ k k + 8 α k (cid:17) ≤ LN BH α k α k ; τ ( α k ) (cid:16) k ¯ θ k − θ ∗ k + 2 + 2 k θ ∗ k (cid:17) (cid:16) k ¯ θ k k + 1 (cid:17) ≤ LN BH α k α k ; τ ( α k ) (cid:16) k ¯ θ k − θ ∗ k + 1 + k θ ∗ k (cid:17) ≤ LN BH α k α k ; τ ( α k ) k ¯ θ k − θ ∗ k + 228 LN BH α k α k ; τ ( α k ) ( k θ ∗ k + 1) . (54)Finally, we consider the ﬁrst term on the right-hand side of (51). Using (26), (28), and (9) we consider − N X i =1 H − X t =0 α k + t D ¯ θ k − ¯ θ k − τ ( α k ) , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) E ≤ B N X i =1 H − X t =0 α k + t (cid:13)(cid:13)(cid:13) ¯ θ k − ¯ θ k − τ ( α k ) (cid:13)(cid:13)(cid:13) (cid:16) k θ k,ti k + k ¯ θ k k + 2 (cid:17) ≤ B N X i =1 H − X t =0 α k + t (cid:16) BHα k ; τ ( α k ) k ¯ θ k k + 12 BHα k ; τ ( α k ) (cid:17) (cid:16) k ¯ θ k k + 3 (cid:17) ≤ N B H α k α k ; τ ( α k ) (cid:16) k ¯ θ k k + 1 (cid:17) ≤ N B H α k α k ; τ ( α k ) ( k ¯ θ k − θ ∗ k + k θ ∗ k + 1) ≤ N B H α k α k ; τ ( α k ) k ¯ θ k − θ ∗ k + 72 N B H (1 + k θ ∗ k ) α k α k ; τ ( α k ) , (55)where in the third inequality we use (22) to have BHα ≤ . Finally, taking the expectation on both sidesof (51) and using (52), (54), and (55) yields (30), i.e., − N X i =1 H − X t =0 α k + t E hD ¯ θ k − θ ∗ , F i ( θ k,ti ; X k + ti ) − F i (¯ θ k ) Ei ≤ N Hα k (cid:16) k ¯ θ k − θ ∗ k (cid:17) + 36 N Hα k (1 + k θ ∗ k ) + 228 LBN H α k α k ; τ ( α k ) E h k ¯ θ k − θ ∗ k i + 228 LBN H α k α k ; τ ( α k ) ( k θ ∗ k + 1) + 72 N B H α k α k ; τ ( α k ) E h k ¯ θ k − θ ∗ k i + 72 N B H (1 + k θ ∗ k ) α k α k ; τ ( α k ) ≤ N Hα k E h k ¯ θ k − θ ∗ k i + 36 N Hα k (1 + k θ ∗ k ) + 12(19 L + 6 B ) N BH α k α k ; τ ( α k ) E h k ¯ θ k − θ ∗ k i + 12(19 L + 6 B ) N BH (1 + k θ ∗ k ) α k α k ; τ ( α k ) ..