[PDF] Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

Abstract

We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal O ˜ ( T − − √ ) regret and another computationally efficient variant with O ˜ ( T 3/4 ) regret, where T is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with O ˜ ( T − − √ ) regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with O ˜ ( T 2/3 ) regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).

Full PDF

LLearning Inﬁnite-horizon Average-reward MDPswith Linear Function Approximation

Chen-Yu Wei

University of Southern California [email protected]

Mehdi Jafarnia-Jahromi

University of Southern California [email protected]

Haipeng Luo

University of Southern California [email protected]

Rahul Jain

University of Southern California [email protected]

Abstract

We develop several new algorithms for learning Markov Decision Processes in aninﬁnite-horizon average-reward setting with linear function approximation. Usingthe optimism principle and assuming that the MDP has a linear structure, weﬁrst propose a computationally inefﬁcient algorithm with optimal (cid:101) O ( √ T ) regretand another computationally efﬁcient variant with (cid:101) O ( T ) regret, where T is thenumber of interactions. Next, taking inspiration from adversarial linear bandits,we develop yet another efﬁcient algorithm with (cid:101) O ( √ T ) regret under a different setof assumptions, improving the best existing result by Hao et al. [16] with (cid:101) O ( T ) regret. Moreover, we draw a connection between this algorithm and the NaturalPolicy Gradient algorithm proposed by Kakade [22], and show that our analysisimproves the sample complexity bound recently given by Agarwal et al. [4]. Reinforcement learning with value function approximation has gained signiﬁcant empirical successin many applications. However, the theoretical understanding of these methods is still quite limited.Recently, some progress has been made for Markov Decision Processes (MDPs) with a transitionkernel and a reward function that are both linear in a ﬁxed state-action feature representation (or moregenerally with a value function that is linear in such a feature representation). For example, Jin et al.[21] develop an optimistic variant of the Least-squares Value Iteration (LSVI) algorithm [7, 29] forthe ﬁnite-horizon episodic setting with regret (cid:101) O ( √ d T ) , where d is the dimension of the featuresand T is the number of interactions. Importantly, the bound has no dependence on the number ofstates or actions.However, the understanding of function approximation for the inﬁnite-horizon average-reward setting,even under the aforementioned linear conditions, remains underexplored. Compared to the ﬁnite-horizon setting, the inﬁnite-horizon model is often a better ﬁt for real-world problems such as serveroperation optimization or stock market decision making which last for a long time or essentially neverend. On the other hand, compared to the discounted-reward model, maximizing the long-term averagereward also has its advantage in the sense that the transient behavior of the learner does not reallymatter for the latter case. Indeed, the inﬁnite-horizon average-reward setting for the tabular case (thatis, no function approximation) is a heavily-studied topic in the literature. Several recent works startto investigate function approximation for this setting, albeit under strong assumptions [2, 3, 16]. Preprint. Under review. a r X i v : . [ c s . L G ] J u l otivated by this fact, in this work we signiﬁcantly expand the understanding of learning MDPs inthe inﬁnite-horizon average-reward setting with linear function approximation. We develop threenew algorithms, each with different pros and cons. Our ﬁrst two algorithms provably ensure lowregret for MDPs with linear transition and reward, which are the ﬁrst for this setting to the best of ourknowledge. More speciﬁcally, the ﬁrst algorithm Fixed-point OPtimization with Optimism (FOPO)is based on the principle of “optimism in the face of uncertainty” applied in a novel way. FOPOaims to ﬁnd a weight vector (parametrizing the estimated value function) that maximizes the averagereward under a ﬁxed-point constraint akin to the LSVI update involving the observed data and anoptimistic term. The constraint is non-convex and we do not know of a way to efﬁciently solve it.FOPO also relies on a lazy update schedule similar to [1] for stochastic linear bandits, which is onlyfor the purpose of saving computation in their work but critical for our regret guarantee. We provethat FOPO enjoys (cid:101) O ( √ d T ) regret with high probability, which is optimal in T . (Section 2)Our second algorithm OLSVI.FH addresses the computational inefﬁciency issue of FOPO withthe price of having larger regret. Speciﬁcally, it combines two ideas: 1) solving an inﬁnite-horizonproblem via an artiﬁcially constructed ﬁnite-horizon problem, which is new as far as we know, and 2)the optimistic LSVI algorithm of [21] for the ﬁnite-horizon setting. OLSVI.FH can be implementedefﬁciently and is shown to achieve (cid:101) O (( dT ) ) regret. (Section 3)Our third algorithm MDP-E XP (cid:101) O ( √ T ) regret (ignoring dependence on other parameters) for thetabular case under an ergodic assumption. We generalize the idea and apply a particular adversariallinear bandit algorithm known as E XP (cid:101) O ( T ) to (cid:101) O ( √ T ) . In Appendix E, we also describe the connection of this algorithm with the Natural PolicyGradient algorithm proposed by Kakade [22], whose sample complexity bound is recently formalizedby Agarwal et al. [4]. We argue that under the setting considered in Section 4, their analysis translatesto a sub-optimal regret bound of (cid:101) O ( T ) , and that our improvement over theirs comes from the waywe construct the gradient estimates. Related work.

For the tabular case with ﬁnite state and action space in the inﬁnite-horizon average-reward setting, the works [6, 20] are among the ﬁrst to develop algorithms with provable sublinearregret. Over the years, numerous improvements have been proposed, see for example [28, 14, 33, 15,38, 35]. In particular, the recent work of [35] develops two model-free algorithms for this problem.We refer the reader to [35, Table 1] for comparisons of existing algorithms. As mentioned, ouralgorithm MDP-E XP (cid:101) O (cid:0) /(cid:15) (cid:1) . However, since the oracle assumption is rather strong, it is not clear how to extend theiralgorithm to the online setting.The works of [2, 3, 16] are among the ﬁrst to consider the inﬁnite-horizon average-reward settingwith function approximation and provable regret guarantees in the online setting. Their results alldepend on some uniformly mixing and uniformly excited feature conditions. As mentioned, under thesame assumption, our MDP-E XP (cid:101) O ( √ T ) regret improves the best existing resultby Hao et al. [16] with (cid:101) O ( T ) regret. Moreover, our other two algorithms ensure low regret forlinear MDPs without these extra assumptions, which do not appear before.Provable function approximation has gained growing research interest in other settings as well (ﬁnite-horizon or discounted-reward). See recent works [36, 21, 37, 12, 34] for example. In particular, our2OPO algorithm shares some similarity with the algorithm of Zanette et al. [37], which also relies onsolving an optimization problem under a constraint akin to LSVI, with no efﬁcient implementation.Adversarial linear bandit is also known as bandit linear optimization. The E XP XP IN E XP We consider inﬁnite-horizon average-reward Markov Decision Processes (MDPs) described by ( X , A , r, p ) where X is a Borel state space with possibly inﬁnite number of elements, A is aﬁnite action set, r : X × A → [ − , is the (unknown) reward function, and p ( ·| x, a ) is the(unknown) transition kernel induced by x, a , satisfying (cid:82) X p (d x (cid:48) | x, a ) = 1 (following integralnotation from [19]).The learning protocol is as follows. A learner interacts with the MDP through T steps, starting froman arbitrary initial state x ∈ X . At each step t , the learner decides an action a t , and then observesthe reward r ( x t , a t ) as well as the next state x t +1 which is a sample drawn from p ( ·| x t , a t ) . The goalof the learner is to be competitive against any ﬁxed stationary policy. Speciﬁcally, a stationary policyis a mapping π : X → ∆ A with π ( a | x ) specifying the probability of selecting action a at state x .The long-term average reward of a stationary policy π starting from state x ∈ X is naturally deﬁnedas: J π ( x ) (cid:44) lim inf T →∞ T E (cid:34) T (cid:88) t =1 r ( x t , a t ) (cid:12)(cid:12)(cid:12) x = x, ∀ t ≥ , a t ∼ π ( ·| x t ) , x t +1 ∼ p ( ·| x t , a t ) (cid:35) . The performance measure of the learner, known as regret, is then deﬁned as Reg T :=max π (cid:80) Tt =1 ( J π ( x ) − r ( x t , a t )) , which is the difference between the total rewards of the beststationary policy and that of the learner.However, in contrast to the ﬁnite-horizon episodic setting where ensuring sublinear regret is alwayspossible, it is known that in our setting a necessary condition is that the optimal policy has a long-termaverage reward that is independent of the initial state [6]. To this end, throughout the paper we onlyconsider a broad subclass of MDPs where a certain form of Bellman optimality equation holds [19]: Assumption 1 (Bellman optimality equation) . There exist J ∗ ∈ R and bounded measurable functions v ∗ : X → R and q ∗ : X × A → R such that the following holds for all x ∈ X and a ∈ A : J ∗ + q ∗ ( x, a ) = r ( x, a ) + E x (cid:48) ∼ p ( ·| x,a ) [ v ∗ ( x (cid:48) )] and v ∗ ( x ) = max a ∈A q ∗ ( x, a ) . (1)Indeed, under this assumption, the claim is that a policy π ∗ that deterministically selects an actionfrom argmax a q ∗ ( x, a ) at each state x is the optimal policy, with J π ∗ ( x ) = J ∗ for all x . To see this,note that for any policy π , using the Bellman optimality equation we have J π ( x ) = lim inf T →∞ T E (cid:34) T (cid:88) t =1 (cid:32) J ∗ + (cid:88) a ∈A q ∗ ( x t , a ) · π ( a | x t ) − v ∗ ( x t +1 ) (cid:33)(cid:35) ≤ lim inf T →∞ T E (cid:34) T (cid:88) t =1 ( J ∗ + v ∗ ( x t ) − v ∗ ( x t +1 )) (cid:35) = J ∗ , with equality attained by π ∗ , proving the claim. Consequently, under Assumption 1 we simply writethe regret as Reg T := (cid:80) Tt =1 ( J ∗ − r ( x t , a t )) .All existing works on regret minimization for inﬁnite-horizon average-reward MDPs make thisassumption, either explicitly or through even stronger assumptions which imply this one. In thetabular case with a ﬁnite state space, weakly communicating MDPs is the broadest class to study3egret minimization in the literature, and is known to satisfy Assumption 1 (see [30]). More generally,Assumption 1 holds under many other common conditions; see [19, Section 3.3].Note that v ∗ ( x ) and q ∗ ( x, a ) quantify the relative advantage of starting with x and starting with ( x, a ) respectively and then acting optimally in the MDP. Therefore, v ∗ is sometimes called the statebias function and q ∗ is called the state-action bias function .For a bounded function v : X → R , we deﬁne its span as sp( v ) (cid:44) sup x,x (cid:48) ∈X | v ( x ) − v ( x (cid:48) ) | . Noticethat if ( v ∗ , q ∗ ) is a solution of Eq. (1), then a translated version ( v ∗ − c, q ∗ − c ) for any constant c isalso a solution. In the remaining of the paper, we let ( v ∗ , q ∗ ) be an arbitrary solution pair of Eq. (1)with a small span sp( v ∗ ) in the sense that sp( v ∗ ) ≤ v (cid:48) ) for any other solution ( v (cid:48) , q (cid:48) ) . We alsoassume without loss of generality | v ∗ ( x ) | ≤ sp( v ∗ ) for any x because we can perform the abovetranslation and center the values of v ∗ around zero. In this section, we present two optimism-based algorithms with sublinear regret, under only oneextra assumption that the MDP is linear (also known as low-rank MDPs). We emphasize thatearlier works for linear MDPs in the ﬁnite-horizon average-reward setting all require extra strongassumptions [2, 3, 16].Speciﬁcally, a linear MDP has a transition kernel and a reward function both linear in some state-actionfeature representation, formally summarized as:

Assumption 2 (Linear MDP) . There exist a known d -dimensional feature mapping Φ :

X × A → R d , d unknown measures µ = ( µ , µ , . . . , µ d ) over X , and an unknown vector θ ∈ R d such that for all x, x (cid:48) ∈ X and a ∈ A , p ( x (cid:48) | x, a ) = Φ( x, a ) (cid:62) µ ( x (cid:48) ) , r ( x, a ) = Φ( x, a ) (cid:62) θ . Without loss of generality, we further assume that for all x ∈ X and a ∈ A , (cid:107) Φ( x, a ) (cid:107) ≤ √ , theﬁrst coordinate of Φ( x, a ) is ﬁxed to , and that (cid:107) µ ( X ) (cid:107) ≤ √ d , (cid:107) θ (cid:107) ≤ √ d , where we use µ ( X ) todenote the vector ( µ ( X ) , . . . , µ d ( X )) and µ i ( X ) (cid:44) (cid:82) X d µ i ( x ) is the total measure of X under µ i .(All norms are 2-norm.) In [21], the same assumption is made except for a different rescaling: (cid:107) Φ( x, a ) (cid:107) ≤ , (cid:107) µ ( X ) (cid:107) ≤ √ d ,and (cid:107) θ (cid:107) ≤ √ d . The reason that this is without loss of generality is not justiﬁed in [21], and forcompleteness we prove this in Appendix A. With this scaling, clearly one can augment the feature Φ( x, a ) with a constant coordinate of value and augment µ ( x ) and θ with a constant coordinate ofvalue , such that the linear structure is preserved while the scaling speciﬁed in Assumption 2 holds.Under Assumption 2, one can show that the state-action bias function q ∗ is in fact also linear in thefeatures. Lemma 1.

Under Assumption 1 and Assumption 2, there exists a ﬁxed weight vector w ∗ ∈ R d suchthat q ∗ ( x, a ) = Φ( x, a ) (cid:62) w ∗ for all x ∈ X and a ∈ A , and furthermore, (cid:107) w ∗ (cid:107) ≤ (2 + sp( v ∗ )) √ d . Based on this lemma, a natural idea emerges: at time t , build an estimator w t of w ∗ using observeddata, then act according to the estimated long-term reward of each action given by Φ( x t , a ) (cid:62) w t .While the idea is intuitive, how to construct the estimator and, perhaps more importantly, how toincorporate the optimism principle well known to be important for learning with partial information,are highly non-trivial. In the next two subsections, we describe two different ways of doing so,leading to our two algorithms FOPO and OLSVI.FH. We present our ﬁrst algorithm FOPO which is computationally inefﬁcient but achieves regret (cid:101) O (sp( v ∗ ) √ d T ) . This is optimal in T since even in the tabular case O ( √ T ) is unimprovable [20].See Algorithm 1 for the complete pseudocode.As mentioned, the key part lies in how the estimator w t is constructed. In Algorithm 1, this is done bysolving an optimization problem over certain constraints. To understand the ﬁrst constraint Eq. (2),4 lgorithm 1 Fixed-point OPtimization with Optimism (FOPO)

Parameters : < δ < , β = 20(2 + sp( v ∗ )) d (cid:112) log( T /δ ) , λ = 1 Initialize : Λ = λI where I ∈ R d × d is the identity matrix for t = 1 , . . . , T do if t = 1 or det(Λ t ) ≥ s t − ) then Set s t = t (cid:66) s t records the most recent update Let w t be the solution of the following optimization problem: max w t ,b t ∈ R d ,J t ∈ R J t s.t. w t = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ ) ( r ( x τ , a τ ) − J t + v t ( x τ +1 )) + b t (2) q t ( x, a ) = Φ( x, a ) (cid:62) w t , v t ( x ) = max a q t ( x, a ) (cid:107) b t (cid:107) Λ t ≤ β, (cid:107) w t (cid:107) ≤ (2 + sp( v ∗ )) √ d else ( w t , J t , b t , v t , q t , s t ) = ( w t − , J t − , b t − , v t − , q t − , s t − ) Play a t = argmax a q t ( x t , a ) , observe r ( x t , a t ) and x t +1 Update Λ t +1 = Λ t + Φ( x t , a t )Φ( x t , a t ) (cid:62) recall that q ∗ ( x, a ) = Φ( x, a ) (cid:62) w ∗ satisﬁes the Bellman optimality equation: Φ( x, a ) (cid:62) w ∗ = r ( x, a ) − J ∗ + (cid:90) X v ∗ ( x (cid:48) ) p (d x (cid:48) | x, a )= r ( x, a ) − J ∗ + (cid:90) X (cid:16) max a (cid:48) Φ( x (cid:48) , a (cid:48) ) (cid:62) w ∗ (cid:17) p (d x (cid:48) | x, a ) . While p and r are unknown, we do observe samples x , . . . , x t − and r ( x , a ) , . . . , r ( x t − , a t − ) .If for a moment we assume J ∗ was known, then it is natural to try to ﬁnd w t such that Φ( x τ , a τ ) (cid:62) w t ≈ r ( x τ , a τ ) − J ∗ + max a (cid:48) Φ( x τ +1 , a (cid:48) ) (cid:62) w t , ∀ τ = 1 , . . . , t − . (3)In common variants of Least-squares Value Iteration (LSVI) update, the w t on the right hand side ofEq. (3) would be replaced with another already computed weight vector w (cid:48) t that is either from the lastiteration (i.e, w t − ) or from the next layer in the case of episodic MDPs. Then solving a least-squaresproblem with regularization λ (cid:107) w t (cid:107) gives a natural estimate w t = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ ) (cid:16) r ( x τ , a τ ) − J ∗ + max a (cid:48) Φ( x τ +1 , a (cid:48) ) (cid:62) w (cid:48) t (cid:17) where Λ t = λI + (cid:80) τ

Parameters : < δ < , λ = 1 , H = max (cid:26) √ sp( v ∗ ) T / d / , (cid:16) sp( v ∗ ) Td (cid:17) / (cid:27) , β =40 dH (cid:112) log( T /δ ) Initialization : Λ = λI where I ∈ R d × d is the identity matrix Deﬁne : x kh = x t and a kh = a t , for t = ( k − H + h for k = 1 , . . . , T / H do Deﬁne V kH +1 ( x ) = 0 for all x . for h = H, . . . , do Compute w kh = Λ − k (cid:80) k − k (cid:48) =1 (cid:80) Hh (cid:48) =1 Φ( x k (cid:48) h (cid:48) , a k (cid:48) h (cid:48) ) (cid:16) r ( x k (cid:48) h (cid:48) , a k (cid:48) h (cid:48) ) + V kh +1 ( x k (cid:48) h (cid:48) +1 ) (cid:17) Deﬁne (cid:98) Q kh ( x, a ) = w kh · Φ( x, a ) + β (cid:113) Φ( x, a ) (cid:62) Λ − k Φ( x, a ) Deﬁne Q kh ( x, a ) = min (cid:110) (cid:98) Q kh ( x, a ) , H (cid:111) and V kh ( x ) = max a Q kh ( x, a ) for h = 1 , . . . , H do Play a kh = argmax a Q kh ( x kh , a ) and observe x kh and r ( x kh , a kh ) Update Λ k +1 = Λ k + (cid:80) Hh =1 Φ( x kh , a kh )Φ( x kh , a kh ) (cid:62) linear bandits. However, while they use this lazy update only to save computation, here we use it tomake sure that w t does not change too often, which is critical for our regret analysis.We point out that the closest existing algorithm we are aware of is the one from a recent work [37]for the ﬁnite-horizon setting. Just like theirs, our algorithm also does not admit an efﬁcient imple-mentation due to the complicated nature of the optimization problem. However, it can be shown thatthe constraint set is non-empty with ( w t , b t , J t ) = ( w ∗ , b, J ∗ ) for some b being a feasible solution(with high probability). This fact also immediately implies that J t is indeed an optimistic estimatorof J ∗ in the following sense: Lemma 2.

With probability at least − δ , Algorithm 1 ensures J t ≥ J ∗ for all t . With the help of this lemma, we prove the following regret bound of FOPO with optimal (in T ) rate. Theorem 3.

Under Assumptions 1 and 2,

FOPO guarantees with probability at least − δ : Reg T = O (cid:16) sp( v ∗ ) log( T /δ ) √ d T (cid:17) . Next, we present another optimism-based algorithm which can be implemented efﬁciently, albeitwith a suboptimal regret guarantee. The high-level idea is still based on LSVI. However, since we donot know how to efﬁciently solve a ﬁxed-point problem as in Algorithm 1, we “open the loop” bysolving a ﬁnite-horizon problem instead. More speciﬁcally, we divide the T rounds into T / H episodeseach with H rounds, and run a ﬁnite-horizon optimistic LSVI algorithm over the episodes as in [21].The resulted algorithm is shown in Algorithm 2. For simplicity, we replace the time index t witha combination of an episode index k and a step index h within the episode. This gives the relation t = ( k − H + h , and ( x t , a t ) is written as ( x kh , a kh ) . At the beginning of each episode k , thelearner computes a set of Q-function parameters w k , . . . , w kH by backward calculation using allhistorical data (Line 3 to Line 6). Note that Line 4 is now simply an assignment step (as opposedto a ﬁxed-point problem) since V kh +1 is computed already when in step h . In Line 5, we introduceoptimism by incorporating a bonus term β (cid:107) Φ( x, a ) (cid:107) Λ − k into the deﬁnition of (cid:98) Q kh ( x, a ) , and hence Q kh ( x, a ) . Then in step h of episode k , the learner simply follows the greedy choice suggested by Q kh ( x kh , · ) (Line 8).Note that Algorithm 2 is slightly different from the version in [21]: they maintain a differentcovariance matrix Λ kh separately for each step h , but we only maintain a single Λ k for all h . Similarly,their w kh is computed using only data related to step h from all previous episodes, while ours iscomputed using all previous data. This is because in our problem, the steps within an episode6hare the same transition and reward functions, and consequently they can be learned jointly, whicheventually reduces the sample complexity.Clearly, this reduction ensures that the learner has low regret against the best policy for the ﬁnite-horizon problem that we create. However, since our original problem is about average-reward overinﬁnite horizon, we need to argue that the best ﬁnite-horizon policy also performs well under theinﬁnite-horizon criteria. Indeed, we show that the sub-optimality gap of the best ﬁnite-horizon policyis bounded by some quantity governed by sp( v ∗ ) /H , which is intuitive since the larger H is, thesmaller the gap becomes (see Lemma 13).In our analysis, for a ﬁxed episode we deﬁne π = ( π , . . . , π H ) as the ﬁnite-horizon policy (i.e., alength- H sequence of policies), where each π h is a mapping X → ∆ A . For any such ﬁnite-horizonpolicy π , we deﬁne Q πh ( x, a ) and V πh ( x ) as the value functions for the ﬁnite-horizon problem wecreate, which satisfy: V πH +1 ( x ) = 0 and for h = H, . . . , , Q πh ( x, a ) = r ( x, a ) + E x (cid:48) ∼ p ( ·| x,a ) [ V πh +1 ( x (cid:48) )] , V πh ( x ) = E a ∼ π h ( ·| x ) Q πh ( x, a ) . (4)The analysis of the algorithm relies on the following key lemma, which shows that Q kh ( x, a ) upperbounds Q πh ( x, a ) for any π . Lemma 4.

With probability at least − δ , Algorithm 2 ensures for any ﬁnite-horizon policy π that ≤ Q kh ( x, a ) − Q πh ( x, a ) ≤ E x (cid:48) ∼ p ( ·| x,a ) (cid:2) V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ) (cid:3) + 2 β (cid:107) Φ( x, a ) (cid:107) Λ − k for all x, a, k, h . With the help of Lemma 4, we prove the ﬁnal regret bound of OLSVI.FH stated in the next theorem(proof deferred to the appendix).

Theorem 5.

Under Assumptions 1 and 2,

OLSVI.FH guarantees with probability at least − δ : Reg T = (cid:101) O (cid:16)(cid:112) sp( v ∗ )( dT ) + (sp( v ∗ ) dT ) (cid:17) . Note that although our bound is suboptimal, OLSVI.FH is the ﬁrst efﬁcient algorithm with sublinearregret for this setting under only Assumptions 1 and 2. XP There are two disadvantages of the optimism-based algorithms introduced in the last section. First,they require the transition kernel and reward function to be both linear in the feature (Assumption 2),which is restrictive and might not hold especially when d is small. Second, even for the polynomial-time algorithm OLSVI.FH, it is still computationally intensive because in Line 4 of the algorithm, V kh +1 is applied to all previous states, and every evaluation of V kh +1 requires computing (cid:107) Φ( x, a ) (cid:107) Λ k .Since this is done for every k , the total computational cost of the algorithm is super-linear in T .In fact, all existing optimism-based algorithms with linear function approximation suffer the sameissue [36, 21, 37].To this end, we propose yet another algorithm based on very different ideas. It is computationally lessintensive and it enjoys (cid:101) O ( √ T ) regret, albeit under a different (and non-comparable) set of assumptionscompared to those in Section 3. Note that these are the same assumptions made in [2, 16]. Below, westart with stating these assumptions, followed by the description of our algorithm.The ﬁrst assumption we make is that the MDP is uniformly mixing. Assumption 3 (Uniform Mixing) . There exists a constant t mix ≥ such that for any policy π , andany distributions ν , ν ∈ ∆ X over the state space, (cid:107) P π ν − P π ν (cid:107) TV ≤ e − /t mix (cid:107) ν − ν (cid:107) TV , where ( P π ν )( x (cid:48) ) = (cid:82) X (cid:80) a ∈A π ( a | x ) p ( x (cid:48) | x, a )d ν ( x ) and (cid:107) · (cid:107) TV is the total variation. Under this uniform mixing assumption, we are able to deﬁne the stationary state distribution under apolicy π as ν π = ( P π ) ∞ ν for an arbitrary initial distribution ν . Also, now we not only have theBellman optimality equation (1) (that is, Assumption 3 implies Assumption 1), but also a Bellmanequation for every policy π , as shown in the following lemma.7 lgorithm 3 MDP-E XP Parameter : N = 8 t mix log T , B = 32 N log( dT ) σ − , η = min (cid:110)(cid:112) / ( T t mix ) , σ/ (24 N ) (cid:111) . for k = 1 , . . . , T / B do (cid:66) k indexes an epoch Deﬁne policy π k such that π k ( a | x ) ∝ exp (cid:16) η (cid:80) k − j =1 Φ( x, a ) (cid:62) w j (cid:17) for every x ∈ X for t = ( k − B + 1 , . . . , kB do Play a t ∼ π k ( ·| x t ) , observe r t ( x t , a t ) and x t +1 (cid:66) Execute π k in the entire epoch for m = 1 , . . . , B / N do (cid:66) m indexes a trajectory Deﬁne τ k,m = ( k − B + 2 N ( m −

1) + N + 1 (cid:66) ﬁrst step of the m -th trajectory Compute R k,m = (cid:80) τ k,m + N − t = τ k,m r ( x t , a t ) (cid:66) total reward of the m -th trajectory Compute (cid:66) λ min denotes the minimum eigenvalue M k = B N (cid:88) m =1 (cid:88) a π k ( a | x τ k,m )Φ( x τ k,m , a )Φ( x τ k,m , a ) (cid:62) ,w k = (cid:40) M − k (cid:80) B N m =1 Φ( x τ k,m , a τ k,m ) R k,m , if λ min ( M k ) ≥ Bσ N , else. Lemma 6.

Suppose Assumption 3 holds. For any π , its long-term average reward J π ( x ) is indepen-dent of the initial state x , thus denoted as J π . Also, the following Bellman equation holds: J π + q π ( x, a ) = r ( x, a ) + E x (cid:48) ∼ p ( ·| x,a ) [ v π ( x (cid:48) )] and v π ( x ) = (cid:88) a ∈A π ( a | x ) q π ( x, a ) for some measurable functions v π : X → [ − t mix , t mix ] and q π : X × A → [ − t mix , t mix ] with (cid:82) X v π ( x )d ν π ( x ) = 0 . On the other hand, with this assumption (stronger than Assumption 1), we can replace Assumption 2(linear MDP) with the following weaker one that only requires the bias function q π to be linear. (InLemma 14 in the appendix, we show that this is indeed weaker than the linear MDP assumption.) Assumption 4 (Linear bias function) . There exists a known d -dimensional feature mapping Φ :

X × A → R d such that for every policy π , q π ( x, a ) can be written as Φ( x, a ) (cid:62) w π for someweight vector w π ∈ R d . Again, without loss of generality (justiﬁed in Appendix A), we assumethat for all x, a , (cid:107) Φ( x, a ) (cid:107) ≤ √ holds, the ﬁrst coordinate of Φ( x, a ) is ﬁxed to , and for all π , (cid:107) w π (cid:107) ≤ t mix √ d . The last assumption we make is uniformly excited features, which intuitively guarantees that everypolicy is explorative in the feature space.

Assumption 5 (Uniformly excited features) . There exists σ > such that for any π , λ min (cid:32)(cid:90) X (cid:32)(cid:88) a π ( a | x )Φ( x, a )Φ( x, a ) (cid:62) (cid:33) d ν π ( x ) (cid:33) ≥ σ, where λ min denotes the smallest eigenvalue. This assumption is needed due to the nature of our algorithm that only performs local search of theparameters. It can potentially be weakened if we combine our algorithm with the idea of Abbasi-Yadkori et al. [3] (details omitted).

Algorithm.

We are now ready to present our MDP-E XP XP XP epochs of equal length B = (cid:101) O ( dt mix /σ ) . In each epoch k , thealgorithm executes a ﬁxed policy π k (explained later), and collects B N disjoint trajectories, each of8ength N = (cid:101) O ( t mix ) . Between every two consecutive trajectories, there is a window of length N inwhich the algorithm does not collect any samples, so that the correlation of samples from differenttrajectories is reduced. See Figure 1 in the appendix for an illustration.In the analysis, we show that the expected total reward of a trajectory is roughly q π ( x τ , a τ ) + N J π (Lemma 15), where π is the policy used to collect that trajectory and τ is the ﬁrst step of the trajectory.By Assumption 4 we have q π ( x τ , a τ ) + N J π = Φ( x τ , a τ ) (cid:62) ( w π + N J π e ) . This observationallows us to draw a connection between this problem and adversarial linear bandits. To see this, ﬁrstnote that the regret is roughly B (cid:80) T/Bk =1 ( J ∗ − J π k ) . By the standard value difference lemma [23,Lemma 5.2.1], we have T/B (cid:88) k =1 ( J ∗ − J π k ) = (cid:90) X  T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) q π k ( x, a )  d ν π ∗ ( x ) where according to the previous observation and the fact (cid:80) a ( π ∗ ( a | x ) − π k ( a | x )) N J π k =0 , the term in the parentheses with respect to a ﬁxed state x can be further written as (cid:80) T/Bk =1 (cid:80) a ( π ∗ ( a | x ) − π k ( a | x )) Φ( x, a ) (cid:62) ( w π k + N J π k e ) . This is exactly the regret of a standardonline learning problem over a set of actions { Φ( x, a ) } a ∈A with linear reward functions parameter-ized by a weight vector ( w π k + N J π k e ) at step k . Moreover, since we do not observe this weightbut have access to the reward of a trajectory whose mean is roughly Φ( x, a ) (cid:62) ( w π k + N J π k e ) asmentioned, we are in the so-called bandit setting. In fact, since the weight can generally changearbitrarily over time (because π k is changing), this is an adversarial linear bandit problem.With this connection in mind, the idea behind MDP-E XP XP k the algorithm constructsan estimator w k for the reward vector w π k + N J π k e . The construction mostly follows the idea ofE XP

2, with the only difference being the way of controlling the variance — in the original E XP (cid:107) w k (cid:107) isnot too large, we also set it to if λ min ( M k ) is too small). Finally, with these estimators, the policyfor epoch k is computed by a standard exponential weight update rule (see Line 2).We emphasize that MDP-E XP XP w k and calculate π ( ·| x t ) on the ﬂy for each x t , which is even more efﬁcient than optimism-based algorithms. It also enjoys a favorable regretguarantee of order (cid:101) O ( √ T ) , as shown below. Once again, the best existing result under the same setof assumptions is (cid:101) O ( T / ) from [16]. Theorem 7.

Under Assumptions 3, 4, 5,

MDP-E XP ensures E [ Reg T ] = (cid:101) O (cid:16) σ (cid:112) t mix T (cid:17) . Note that while the bound in Theorem 7 seemingly does not depend on d , the dependence is in factimplicit because σ = Ω( d ) always holds by the deﬁnition of σ (see Remark 1 in the appendix). Weprovide a proof for this fact along with the proof of Theorem 7 in the appendix. Connections to Natural Policy Gradient.

Finally, we remark that although MDP-E XP XP

2, it is related to the (in fact much earlier) reinforcement learningalgorithm Natural Policy Gradient (NPG) [22] under softmax parameterization. The connectionbetween softmax-parameterized NPG and the exponential weight update was formalized in a recentwork by Agarwal et al. [4]. In Appendix E, we ﬁrst restate the connection. Then we compare theimplementation details of MDP-E XP XP In this work, we provide three new algorithms for learning inﬁnite-horizon average-reward MDPs withlinear function approximation, signiﬁcantly extending and improving previous works. One key openquestion is how to achieve the optimal (cid:101) O ( √ T ) regret efﬁciently under the linear MDP assumption.In Appendix E, we also discuss another open question related to weakening Assumption 5 whilemaintaining a similar regret bound. 9 eferences [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linearstochastic bandits. In Advances in Neural Information Processing Systems , pages 2312–2320,2011.[2] Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and GellértWeisz. Politex: Regret bounds for policy iteration using expert prediction. In

InternationalConference on Machine Learning , pages 3692–3702, 2019.[3] Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced politex. arXiv preprint arXiv:1908.10479 , 2019.[4] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approxi-mation with policy gradient methods in markov decision processes. In

Conference on LearningTheory , 2020.[5] Keith Ball et al. An elementary introduction to modern convex geometry.[6] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcementlearning in weakly communicating mdps. In

Proceedings of the Twenty-Fifth Conference onUncertainty in Artiﬁcial Intelligence , pages 35–42. AUAI Press, 2009.[7] Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal differencelearning.

Machine learning , 22(1-3):33–57, 1996.[8] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies foronline linear optimization with bandit feedback. In

Conference on Learning Theory , pages41–1, 2012.[9] Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits.

Journal of Computer andSystem Sciences , 78(5):1404–1422, 2012.[10] Yichen Chen, Lihong Li, and Mengdi Wang. Scalable bilinear π learning using state and actionfeatures. In International Conference on Machine Learning , pages 834–843, 2018.[11] Varsha Dani, Sham M Kakade, and Thomas P Hayes. The price of bandit information for onlineoptimization. In

Advances in Neural Information Processing Systems , 2008.[12] Kefan Dong, Jian Peng, Yining Wang, and Yuan Zhou. √ n -regret for learning in markovdecision processes with function approximation and low bellman rank. In Conference onLearning Theory , 2020.[13] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning andan application to boosting. In

European conference on computational learning theory , pages23–37. Springer, 1995.[14] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efﬁcient bias-span-constrained exploration-exploitation in reinforcement learning. In

International Conference onMachine Learning , pages 1573–1581, 2018.[15] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of ucrl2 with empiricalbernstein inequality. arXiv preprint arXiv:2007.05456 , 2020.[16] Botao Hao, Nevena Lazic, Yasin Abbasi-Yadkori, Pooria Joulani, and Csaba Szepesvari. Prov-ably efﬁcient adaptive approximate policy iteration. arXiv preprint arXiv:2002.03069 , 2020.[17] Nick Harvey. Matrix chernoff bounds. In .[18] Elad Hazan and Zohar Karnin. Volumetric spanners: An efﬁcient exploration basis for learning.

Journal of Machine Learning Research , 17(119):1–34, 2016.[19] Onésimo Hernández-Lerma.

Adaptive Markov control processes , volume 79. Springer Science& Business Media, 2012. 1020] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcementlearning.

Journal of Machine Learning Research , 11(Apr):1563–1600, 2010.[21] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efﬁcient reinforcementlearning with linear function approximation. In

Conference on Learning Theory , 2020.[22] Sham M Kakade. A natural policy gradient. In

Advances in neural information processingsystems , pages 1531–1538, 2002.[23] Sham Machandranath Kakade.

On the sample complexity of reinforcement learning . PhD thesis,University College London, 2003.[24] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deepreinforcement learning. In

International conference on machine learning , pages 1928–1937,2016.[25] Gergely Neu and Julia Olkhovskaya. Online learning in mdps with linear function approximationand bandit feedback. arXiv preprint arXiv:2007.01612 , 2020.[26] Gergely Neu, András György, Csaba Szepesvári, and András Antos. Online markov decisionprocesses under bandit feedback.

IEEE Transactions on Automatic Control , 59:676–691, 2013.[27] Gergely Neu, Anders Jonsson, and Vicenç Gómez. A uniﬁed view of entropy-regularizedmarkov decision processes. arXiv preprint arXiv:1705.07798 , 2017.[28] Ronald Ortner. Regret bounds for reinforcement learning via markov chain concentration.

Journal of Artiﬁcial Intelligence Research , 67:115–128, 2020.[29] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomizedvalue functions. In

International Conference on Machine Learning , pages 2377–2386, 2016.[30] Martin L Puterman.

Markov decision processes: discrete stochastic dynamic programming .John Wiley & Sons, 2014.[31] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust regionpolicy optimization. In

International conference on machine learning , pages 1889–1897, 2015.[32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximalpolicy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.[33] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds forundiscounted reinforcement learning in mdps. In

Algorithmic Learning Theory , pages 770–805,2018.[34] Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efﬁcient reinforcementlearning with general value function approximation. arXiv preprint arXiv:2005.10804 , 2020.[35] Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in inﬁnite-horizon average-reward markov decision processes. In

International Conference on Machine Learning , 2020.[36] Lin F Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels,and regret bound. In

International Conference on Machine Learning , 2020.[37] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learningnear optimal policies with low inherent bellman error. In

International Conference on MachineLearning , 2020.[38] Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluatingthe optimal bias function. In

Advances in Neural Information Processing Systems , 2019.11

Auxiliary Lemmas Related to Assumption 2 and Assumption 4

In this section, we provide justiﬁcation for the scaling assumption made in Assumption 2 andAssumption 4, showing that they are indeed without loss of generality as long as one transforms andnormalizes the features in some way beforehand.

Lemma 8.

Let

Φ = { Φ( x, a ) : x ∈ X , a ∈ A} ⊂ R d be a feature set with rank d . Then thereexists an invertible linear transformation v → Av with A ∈ R d × d such that for any function F : X × A → R deﬁned by F ( x, a ) = Φ( x, a ) (cid:62) z, for some z ∈ R d , we have (cid:107) A Φ( x, a ) (cid:107) ≤ and (cid:107) A − z (cid:107) ≤ √ dF max where F max (cid:44) sup x,a | F ( x, a ) | . This lemma implies that if we use the transformed feature Φ (cid:48) ( x, a ) = A Φ( x, a ) with (cid:107) Φ (cid:48) ( x, a ) (cid:107) ≤ ,then any function F ( x, a ) = Φ( x, a ) (cid:62) z can be equivalently written as F ( x, a ) = Φ (cid:48) ( x, a ) (cid:62) z (cid:48) with z (cid:48) = A − z and (cid:107) z (cid:48) (cid:107) ≤ √ dF max . Therefore, taking z to be µ ( X ) or θ for Assumption 2, or w π forAssumption 4, with the corresponding F ( x, a ) being (cid:82) X p ( x (cid:48) | x, a )d x (cid:48) , r ( x, a ) , and q π ( x, a ) , and F max being , , and t mix (Lemma 6) respectively, justiﬁes the scaling stated in these assumptions.Notice that the transformation A only depends on the feature set Φ , but not F or z . Thus we canperform this transformation as long as we know the feature map. This is similar to a standardpreprocessing step of feature normalizing in machine learning. Proof of Lemma 8.

Deﬁne − Φ = {− Φ( x, a ) : x ∈ X , a ∈ A} and K (Φ) = Φ ∪ − Φ . We ﬁrst arguethat for any bounded feature set Φ ⊂ R d , there exists an invertible linear transformation v → Av with A ∈ R d × d such that the minimum volume enclosing ellipsoid (MVEE) of the transformed featureset K ( A Φ) where A Φ (cid:44) { A Φ( x, a ) : x ∈ X , a ∈ A} is the unit sphere. This can be seen by thefollowing: notice that K (Φ) is always symmetric around the origin, and so is its MVEE. Suppose thatthe MVEE of K (Φ) is { u ∈ R d : u (cid:62) Bu = 1 } for some invertible B (otherwise Φ is not full-rank).Then if we pick A = B , the MVEE of K ( A Φ) will be the unit sphere.Now consider this new feature Φ (cid:48) ( x, a ) (cid:44) A Φ( x, a ) with the MVEE of K (Φ (cid:48) ) being the unit sphere(which implies (cid:107) Φ (cid:48) ( x, a ) (cid:107) ≤ ). Deﬁning z (cid:48) = A − z , we have Φ (cid:48) ( x, a ) (cid:62) z (cid:48) = Φ( x, a ) (cid:62) z = F ( x, a ) .Below, we show that (cid:107) z (cid:48) (cid:107) ≤ √ dF max .By Lemma 9 below, there exists a subset M = { u , . . . , u m } ⊆ K (Φ (cid:48) ) that lie on the unit sphere,and non-negative weights c , . . . , c m , such that m (cid:88) i =1 c i u i u (cid:62) i = I d . Taking trace on both sides, we get (cid:80) mi =1 c i = d .Note that we have F ( x, a ) = Φ (cid:48) ( x, a ) (cid:62) z (cid:48) for all x, a . Specially, applying this to the elements in M ,and using the fact that | F ( x, a ) | ≤ F max , we get dF = m (cid:88) i =1 c i F ≥ m (cid:88) i =1 c i ( u (cid:62) i z (cid:48) ) = z (cid:48)(cid:62) (cid:32) m (cid:88) i =1 c i u i u (cid:62) i (cid:33) z (cid:48) = (cid:107) z (cid:48) (cid:107) , which implies (cid:107) z (cid:48) (cid:107) ≤ √ dF max and ﬁnishes the proof. Lemma 9. ([18, Theorem 6], [5])

Let K be a symmetric set such that its MVEE is the unit sphere.Then there exist m ≤ d ( d + 1) / − contact points of K and the sphere u , . . . u m and non-negativeweights c , . . . , c m such that (cid:80) i c i u i = 0 and (cid:80) i c i u i u (cid:62) i = I d . B Auxiliary Lemmas for Self-normalized Processes

In this section, we provide some useful lemmas related to the concentration of self-normalizedprocesses . The ﬁrst two are taken directly from [21, Appendix D.2].12 emma 10 (Concentration of Self-Normalized Processes) . Let { ε t } ∞ t =1 be a real-valued stochasticprocess with corresponding ﬁltration {F t } ∞ t =0 . Let ε t |F t − be zero-mean and σ -subgaussian, that is, E [ ε t |F t − ] = 0 and E [ e λε t |F t − ] ≤ e λ σ / for all λ ∈ R .Let { φ t } ∞ t =0 be an R d -valued stochastic process where φ t ∈ F t − . Assume that Λ is a d × d positivedeﬁnite matrix, and let Λ t = Λ + (cid:80) t − s =1 φ s φ (cid:62) s . Then for any δ > , with probability at least − δ ,we have for all t > , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − (cid:88) s =1 φ s ε s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − t ≤ σ log (cid:20) det(Λ t ) / det(Λ ) − / δ (cid:21) . Lemma 11.

Let { x t } ∞ t =1 be a stochastic process on state space X with corresponding ﬁltration {F t } ∞ t =0 , { φ t } ∞ t =0 be an R d -valued stochastic process where φ t ∈ F t − and (cid:107) φ t (cid:107) ≤ , Λ t = λI + (cid:80) t − s =1 φ s φ (cid:62) s , and V ⊆ R X be an arbitrary set of functions deﬁned on X , with N ε being its ε -covering number with respect to dist ( v, v (cid:48) ) = sup x | v ( x ) − v ( x (cid:48) ) | for some ﬁxed ε > . Then forany δ > , with probability at least − δ , for all t > and any v ∈ V so that sup x | v ( x ) | ≤ H , wehave (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − (cid:88) s =1 φ s (cid:16) v ( x s ) − E [ v ( x s ) |F t − ] (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − t ≤ H (cid:20) d (cid:18) t + λλ (cid:19) + log N ε δ (cid:21) + 8 t ε λ . Lemma 12.

Let V be a class of mappings from X to R parametrized by α = ( α , α , . . . , α P ) ∈ R P with α i ∈ [ − B, B ] for all i . Suppose that for any v ∈ V (parameterized by α ) and v (cid:48) ∈ V (parameterized by α (cid:48) ), the following holds: sup x ∈X | v ( x ) − v (cid:48) ( x ) | ≤ L (cid:107) α − α (cid:48) (cid:107) . Let N ε be be the ε -covering number of V with respect to the distance dist ( v, v (cid:48) ) = sup x ∈X | v ( x ) − v ( x (cid:48) ) | . Then log N ε ≤ P log (cid:18) BLPε (cid:19) . Proof. If α and α (cid:48) are such that | α i − α (cid:48) i | ≤ εLP for all i , then we havedist ( v, v (cid:48) ) = sup x ∈X | v ( x ) − v (cid:48) ( x ) | ≤ L × P (cid:88) i =1 | α i − α (cid:48) i | ≤ ε. Therefore, the following set constitutes an ε -cover for V : (cid:26) α ∈ R P : α i = kεLP for some k ∈ Z (cid:27) ∩ [ − B, B ] P The number of elements in this sets is upper bounded by (cid:0) BLPε (cid:1) P . C Omitted Analysis in Section 3

Proof of Lemma 1.

By the two assumptions, we have (with e = (1 , , . . . , ) q ∗ ( x, a ) = r ( x, a ) − J ∗ + E x (cid:48) ∼ p ( ·| x,a ) [ v ∗ ( x (cid:48) )]= Φ( x, a ) (cid:62) θ − J ∗ Φ( x, a ) (cid:62) e + Φ( x, a ) (cid:62) (cid:90) X v ∗ ( x (cid:48) )d µ ( x (cid:48) )= Φ( x, a ) (cid:62) (cid:18) θ − J ∗ e + (cid:90) X v ∗ ( x (cid:48) )d µ ( x (cid:48) ) (cid:19) . Therefore, we can deﬁne w ∗ = θ − J ∗ e + (cid:82) X v ∗ ( x (cid:48) )d µ ( x (cid:48) ) , proving the ﬁrst claim. Furthermore, (cid:107) w ∗ (cid:107) ≤ (cid:107) θ (cid:107) + 1 + sup x (cid:48) ∈X | v ∗ ( x (cid:48) ) | × (cid:107) µ ( X ) (cid:107) ≤ √ d + 1 + sp( v ∗ ) × √ d ≤ (2 + sp( v ∗ )) √ d, which proves the second claim. 13 .1 Omitted Analysis in Section 3.1 Proof of Lemma 2.

It sufﬁces to show that with probability at least − δ , ( w ∗ , b, J ∗ ) for some b is afeasible solution of the optimization problem (since J t is the optimal solution). To show this, ﬁrstnote that w ∗ = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ )Φ( x τ , a τ ) w ∗ + λ Λ − t w ∗ (deﬁnition of Λ t ) = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) r ( x τ , a τ ) − J ∗ + E x (cid:48) ∼ p ( ·| x τ ,a τ ) v ∗ ( x (cid:48) ) (cid:1) + λ Λ − t w ∗ ( q ∗ ( x τ , a τ ) = Φ( x τ , a τ ) w ∗ and Eq. (1)) = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ ) ( r ( x τ , a τ ) − J ∗ + v ∗ ( x τ +1 )) + λ Λ − t w ∗ + (cid:15) ∗ t , where (cid:15) ∗ t = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) E x (cid:48) ∼ p ( ·| x τ ,a τ ) v ∗ ( x (cid:48) ) − v ∗ ( x τ +1 ) (cid:1) . Using Lemma 10 with ε τ = E x (cid:48) ∼ p ( ·| x τ ,a τ ) v ∗ ( x (cid:48) ) − v ∗ ( x τ +1 ) and φ τ = Φ( x τ , a τ ) , we have withprobability at least − δ (note that given the past ε τ is zero-mean and in the range [ − sp( v ∗ ) , sp( v ∗ )] thus sp( v ∗ ) -subgaussian), (cid:107) (cid:15) ∗ t (cid:107) Λ t = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − (cid:88) τ =1 φ τ ε τ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ − t ≤ √ v ∗ ) (cid:114) log det(Λ t ) / / det(Λ ) / δ ≤ √ v ∗ ) (cid:115) log (1 + Tλd ) d/ δ ≤ β , where we use the fact det(Λ t ) ≤ (cid:18) TR (Λ t ) d (cid:19) d = (cid:32) λd + (cid:80) t − τ =1 (cid:107) φ τ (cid:107) d (cid:33) d ≤ (cid:18) λd + 2 Td (cid:19) d and the deﬁnition of β . Also, λ (cid:107) Λ − t w ∗ (cid:107) Λ t = λ (cid:107) w ∗ (cid:107) Λ − t ≤ √ λ (cid:107) w ∗ (cid:107) ≤ (2 + sp( v ∗ )) √ λd ≤ β (Lemma 1). Deﬁne b = λ Λ − t w ∗ + (cid:15) ∗ t , we have thus proven that (cid:107) b (cid:107) Λ t ≤ β holds with probabilityat least − δ . which proves that ( w ∗ , b, J ∗ ) is a solution of the optimization problem, ﬁnishing theproof. Proof of Theorem 3.

Without loss of generality, we assume sp( v ∗ ) ≤ √ T , d ≤ √ T , and T ≥ (otherwise the bound is vacuous). Fix t and let s = s t . Deﬁne (cid:15) s = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) v s ( x τ +1 ) − E x (cid:48) ∼ p ( ·| x τ ,a τ ) v s ( x (cid:48) ) (cid:1) . Using the identity w ∗ = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ )Φ( x τ , a τ ) (cid:62) w ∗ + λ Λ − s w ∗ = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) r ( x τ , a τ ) − J ∗ + E x (cid:48) ∼ p ( ·| x τ ,a τ ) v ∗ ( x (cid:48) ) (cid:1) + λ Λ − s w ∗ , w s , we have w s − w ∗ = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ ) ( r ( x τ , a τ ) − J s + v s ( x τ +1 )) + b s − Λ − s s − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) r ( x τ , a τ ) − J ∗ + E x (cid:48) ∼ p ( ·| x τ ,a τ ) v ∗ ( x (cid:48) ) (cid:1) − λ Λ − s w ∗ = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) J ∗ − J s + E x (cid:48) ∼ p ( ·| x τ ,a τ ) [ v s ( x (cid:48) ) − v ∗ ( x (cid:48) )] (cid:1) + (cid:15) s + b s − λ Λ − s w ∗ = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ )Φ( x τ , a τ ) (cid:62) (cid:18) J ∗ e − J s e + (cid:90) X ( v s ( x (cid:48) ) − v ∗ ( x (cid:48) )) d µ ( x (cid:48) ) (cid:19) + (cid:15) s + b s − λ Λ − s w ∗ = J ∗ e − J s e + (cid:90) X ( v s ( x (cid:48) ) − v ∗ ( x (cid:48) )) d µ ( x (cid:48) ) + (cid:15) s + b s − λ Λ − s (cid:18) J ∗ e − J s e + (cid:90) X ( v s ( x (cid:48) ) − v ∗ ( x (cid:48) )) d µ ( x (cid:48) ) (cid:19) − λ Λ − s w ∗ . Therefore, q s ( x t , a t ) − q ∗ ( x t , a t ) = Φ( x t , a t ) (cid:62) ( w s − w ∗ ) ≤ ( J ∗ − J s ) + E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s ( x (cid:48) ) − v ∗ ( x (cid:48) )] + Φ( x t , a t ) (cid:62) ( (cid:15) s + b s + λ Λ − s u s ) , where u s (cid:44) − (cid:0) J ∗ e − J s e + (cid:82) X ( v s ( x (cid:48) ) − v ∗ ( x (cid:48) )) d µ ( x (cid:48) ) (cid:1) − w ∗ .Next, under the event J ∗ ≤ J s which holds with probability at least − δ (Lemma 2), we continuewith q s ( x t , a t ) − q ∗ ( x t , a t ) (5) ≤ E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s ( x (cid:48) ) − v ∗ ( x (cid:48) )] + Φ( x t , a t ) (cid:62) ( (cid:15) s + b s + λ Λ − s u s ) ≤ E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s ( x (cid:48) ) − v ∗ ( x (cid:48) )] + (cid:107) Φ( x t , a t ) (cid:107) Λ − s (cid:107) (cid:15) s + b s + λ Λ − s u s (cid:107) Λ s ≤ E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s ( x (cid:48) ) − v ∗ ( x (cid:48) )] + 2 (cid:107) Φ( x t , a t ) (cid:107) Λ − t (cid:107) (cid:15) s + b s + λ Λ − s u s (cid:107) Λ s , (6)where the second inequality uses Hölder’s inequality and the last one uses the fact Λ s (cid:22) Λ t (cid:22) s according to the lazy update schedule of the algorithm.By the algorithm, (cid:107) b s (cid:107) Λ s ≤ β . To bound (cid:107) (cid:15) s (cid:107) Λ s , we use Lemma 11 and Lemma 12: Deﬁne ε τ = v s ( x τ +1 ) − E x (cid:48) ∼ p ( ·| x τ ,a τ ) v s ( x (cid:48) ) and φ τ = √ Φ( x τ , a τ ) . With Lemma 11 and the fact | v s ( x ) | ≤ √ (cid:107) w s (cid:107) ≤ (2 + sp( v ∗ )) √ d , we have that with probability at least − δ , for all s : (cid:107) (cid:15) s (cid:107) Λ s = √ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s − (cid:88) τ =1 φ τ ε τ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ − s ≤ v ∗ )) √ d (cid:114) d s + λλ + log N ε δ + 4 (cid:114) s ε λ , where ε = T and N ε is the ε -cover for the function class of v s , which can be bounded with the helpof Lemma 12 (with α = w s , P = d , B = (2 + sp( v ∗ )) √ d , and L = √ ) by log N ε ≤ d log 2(2 + sp( v ∗ )) √ d × √ dT − ≤ d log T (using the conditions stated at the beginning of the proof). Therefore, we have (cid:107) (cid:15) s (cid:107) Λ s ≤ v ∗ )) √ d (cid:112) d log T + log(1 /δ ) + 4 = O ( β ) , (7)for all s with probability at least − δ . Next, we bound (cid:107) λ Λ − s u s (cid:107) Λ s as: (cid:107) λ Λ − s u s (cid:107) Λ s = λ (cid:107) u s (cid:107) Λ − s ≤ √ λ (cid:107) u s (cid:107) ≤ O (1 + (2 + sp( v ∗ )) d ) = O ( β ) , (8)15here in the second inequality we use the condition (cid:107) µ ( X ) (cid:107) ≤ √ d in Assumption 2 to bound (cid:107) (cid:82) X ( v s ( x (cid:48) ) − v ∗ ( x (cid:48) )) d µ ( x (cid:48) ) (cid:107) as sup x ∈X | v s ( x ) − v ∗ ( x ) |(cid:107) µ ( X ) (cid:107) = O ((2 + sp( v ∗ )) d ) . Put to-gether, the above shows (cid:107) (cid:15) s + b s + λ Λ − s u s (cid:107) Λ s = O ( β ) .Continuing with Eq. (6) and summing over t , we have that with probability at least − δ , T (cid:88) t =1 ( q s t ( x t , a t ) − q ∗ ( x t , a t )) ≤ T (cid:88) t =1 E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t ( x (cid:48) ) − v ∗ ( x (cid:48) )] + O (cid:32) β T (cid:88) t =1 (cid:107) Φ( x t , a t ) (cid:107) Λ − t (cid:33) = T (cid:88) t =1 E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t ( x (cid:48) ) − v ∗ ( x (cid:48) )] + O  β √ T (cid:118)(cid:117)(cid:117)(cid:116) T (cid:88) t =1 (cid:107) Φ( x t , a t ) (cid:107) − t  (Cauchy-Schwarz inequality) = T (cid:88) t =1 E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t ( x (cid:48) ) − v ∗ ( x (cid:48) )] + O (cid:16) β (cid:112) dT log T (cid:17) , where the last equality is by [21, Lemma D.2] with the facts that det(Λ ) = λ d and det(Λ T +1 ) ≤ (cid:0) d trace (Λ T +1 ) (cid:1) d ≤ ( λ + 2 T ) d . Rearranging the last inequality we get T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v ∗ ( x (cid:48) )] − q ∗ ( x t , a t ) (cid:1) ≤ T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t ( x (cid:48) )] − q s t ( x t , a t ) (cid:1) + O (cid:16) β (cid:112) dT log T (cid:17) = T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t ( x (cid:48) )] − v s t ( x t ) (cid:1) + O (cid:16) β (cid:112) dT log T (cid:17) where the last line is by the choice of a t . Next, notice that every time the algorithm updates (i.e. s t (cid:54) = s t − ), it holds that det(Λ t ) = det(Λ s t ) ≥ s t − ) . Since det(Λ T +1 ) / det(Λ ) ≤ (cid:0) λ +2 Tλ (cid:1) d ,this cannot happen more than log (cid:0) λ +2 Tλ (cid:1) d = O ( d log T ) times. Using this fact and the range of v t ,we continue with T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v ∗ ( x (cid:48) )] − q ∗ ( x t , a t ) (cid:1) ≤ T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t +1 ( x (cid:48) )] − v s t ( x t ) (cid:1) + O (cid:16) β (cid:112) dT log T + βd log T (cid:17) = T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t +1 ( x (cid:48) )] − v s t +1 ( x t +1 ) (cid:1) + O (cid:16) β (cid:112) dT log T + βd log T (cid:17) = O (cid:16) β (cid:112) dT log T + βd log T (cid:17) , (9)where the last step holds with probability at least − δ by Azuma’s inequality. Finally, note that theregret can be written asReg T = T (cid:88) t =1 ( J ∗ − r ( x t , a t )) = T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v ∗ ( x (cid:48) )] − q ∗ ( x t , a t ) (cid:1) = O (cid:16) β (cid:112) dT log T + βd log T (cid:17) . by the Bellman optimality equation, which ﬁnishes the proof (combining all the high probabilitystatements with a union bound, the last bound holds with probability at least − δ ).16 .2 Omitted Analysis in Section 3.2 Proof of Lemma 4.

By Assumption 2 and the Bellman equation for the ﬁnite-horizon problem (Eq.(4)), we have that for any ﬁnite-horizon policy π and any h < H , Q πh ( x, a ) = r ( x, a ) + E x (cid:48) ∼ p ( ·| x,a ) (cid:2) V πh +1 ( x (cid:48) ) (cid:3) = Φ( x, a ) (cid:62) θ + Φ( x, a ) (cid:62) (cid:90) X V πh +1 ( x (cid:48) )d µ ( x (cid:48) )= Φ( x, a ) (cid:62) (cid:18) θ + (cid:90) X V πh +1 ( x (cid:48) )d µ ( x (cid:48) ) (cid:19) . Deﬁne w πh = θ + (cid:82) X V πh +1 ( x (cid:48) )d µ ( x (cid:48) ) . Then we have Q πh ( x, a ) = Φ( x, a ) (cid:62) w πh with (cid:107) w πh (cid:107) ≤(cid:107) θ (cid:107) + ( H − h ) (cid:107) µ ( X ) (cid:107) ≤ √ d + √ d ( H − h ) ≤ √ dH .We now rewrite w kh − w πh as follow. For simplicity, we denote x ∼ p ( ·| x k (cid:48) h (cid:48) , a k (cid:48) h (cid:48) ) as x ∼ ( k (cid:48) , h (cid:48) ) , Φ( x k (cid:48) h (cid:48) , a k (cid:48) h (cid:48) ) as Φ k (cid:48) h (cid:48) , and r ( x k (cid:48) h (cid:48) , a k (cid:48) h (cid:48) ) as r k (cid:48) h (cid:48) w kh − w πh = Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:104) r k (cid:48) h (cid:48) + V kh +1 ( x k (cid:48) h (cid:48) +1 ) (cid:105) − Λ − k (cid:32) λI + k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) Φ k (cid:48) h (cid:48) (cid:62) (cid:33) w πh = Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:104) r k (cid:48) h (cid:48) + V kh +1 ( x k (cid:48) h (cid:48) +1 ) (cid:105) − Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:104) r k (cid:48) h (cid:48) + E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) [ V πh +1 ( x (cid:48) )] (cid:105) − λ Λ − k w πh (using Q πh ( x, a ) = Φ( x, a ) (cid:62) w πh and the Bellman equation) = Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:104) V kh +1 ( x k (cid:48) h (cid:48) +1 ) − E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) V πh +1 ( x (cid:48) ) (cid:105) − λ Λ − k w πh = Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:2) E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) V kh +1 ( x (cid:48) ) − E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) V πh +1 ( x (cid:48) ) (cid:3) + (cid:15) kh − λ Λ − k w πh (deﬁne (cid:15) kh = Λ − k (cid:80) k − k (cid:48) =1 (cid:80) Hh (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:104) V kh +1 ( x k (cid:48) h (cid:48) +1 ) − E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) (cid:2) V kh +1 ( x (cid:48) ) (cid:3)(cid:105) ) = Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) Φ k (cid:48) h (cid:48) (cid:62) (cid:20)(cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:21) + (cid:15) kh − λ Λ − k w πh = (cid:0) I − λ Λ − k (cid:1) (cid:20)(cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:21) + (cid:15) kh − λ Λ − k w πh = (cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) + (cid:15) kh − λ Λ − k (cid:20)(cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:21) − λ Λ − k w πh . (cid:98) Q kh ( x, a ) − Q πh ( x, a )= Φ( x, a ) (cid:62) ( w kh − w πh ) + β (cid:113) Φ( x, a ) (cid:62) Λ − k Φ( x, a )= Φ( x, a ) (cid:62) (cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) + Φ( x, a ) (cid:62) (cid:15) kh + β (cid:107) Φ( x, a ) (cid:107) Λ − k − λ Φ( x, a ) (cid:62) Λ − k (cid:20)(cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:21) − λ Φ( x, a ) (cid:62) Λ − k w πh = E x (cid:48) ∼ p ( ·| x,a ) (cid:2) V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ) (cid:3) + Φ( x, a ) (cid:62) (cid:15) kh (cid:124) (cid:123)(cid:122) (cid:125) term + β (cid:107) Φ( x, a ) (cid:107) Λ − k − λ Φ( x, a ) (cid:62) Λ − k (cid:20)(cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) term − λ Φ( x, a ) (cid:62) Λ − k w πh (cid:124) (cid:123)(cid:122) (cid:125) term . (10)Below we bound the manitudes of term , term , term respectively. For term , we use Lemma 11and Lemma 12: deﬁne ε k (cid:48) h (cid:48) = V kh +1 ( x k (cid:48) h (cid:48) +1 ) − E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) (cid:2) V kh +1 ( x (cid:48) ) (cid:3) , φ k (cid:48) h (cid:48) = √ Φ k (cid:48) h (cid:48) . By Lemma 11,we have (cid:107) (cid:15) kh (cid:107) Λ k = √ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 φ k (cid:48) h (cid:48) ε k (cid:48) h (cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ k = √ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 φ k (cid:48) h (cid:48) ε k (cid:48) h (cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ − k ≤ √ H (cid:114) d T + λλ + log N ε δ + √ × (cid:114) t ε λ , (11)for all k and h with probability at least − δ , where N ε is the ε -cover of the function class that V kh +1 ( · ) lies in. Notice that all t , V kh +1 ( · ) can be expressed as the following: V kh +1 ( x ) = min (cid:26) max a w (cid:62) Φ( x, a ) + β (cid:113) Φ( x, a ) (cid:62) ΓΦ( x, a ) , H (cid:27) for some positive deﬁnite Γ ∈ R d × d with λ ≥ λ max (Γ) ≥ λ min (Γ) ≥ λ +2 T = T and some w ∈ R d with (cid:107) w (cid:107) ≤ λ max (Γ) × T × sup x,a,x (cid:48) ( (cid:107) Φ( x, a ) (cid:107) H ) ≤ √ T H . Therefore, we can writethe class of functions that V kh +1 ( · ) lies in as follows: V = (cid:26) V ( x ) = min (cid:26) max a w (cid:62) Φ( x, a ) + β (cid:113) Φ( x, a ) (cid:62) ΓΦ( x, a ) , H (cid:27) : w ∈ R d : (cid:107) w (cid:107) ≤ √ T H, Γ ∈ R d × d : 11 + 2 T ≤ λ min (Γ) ≤ λ max (Γ) ≤ (cid:27) . Now we apply Lemma 12 to V , with the following choices of parameters: α = ( w, Γ) , P = d + d , ε = T , B = √ T H , and L = β (cid:112) T ) which is given by the following calculation: for any ∆ w = (cid:15) e i , | (cid:15) | (cid:12)(cid:12) ( w + ∆ w ) (cid:62) Φ( x, a ) − w (cid:62) Φ( x, a ) (cid:12)(cid:12) = | e (cid:62) i Φ( x, a ) | ≤ (cid:107) Φ( x, a ) (cid:107) ≤ √ , ∆Γ = (cid:15) e i e (cid:62) j , | (cid:15) | (cid:12)(cid:12)(cid:12)(cid:12) β (cid:113) Φ( x, a ) (cid:62) (Γ + ∆Γ)Φ( x, a ) − β (cid:113) Φ( x, a ) (cid:62) ΓΦ( x, a ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ β (cid:12)(cid:12) Φ( x, a ) (cid:62) e i e (cid:62) j Φ( x, a ) (cid:12)(cid:12)(cid:112) Φ( x, a ) (cid:62) ΓΦ( x, a ) ( √ u + v − √ u ≤ | v |√ u ) ≤ β Φ( x, a ) (cid:62) (cid:0) e i e (cid:62) i + e j e (cid:62) j (cid:1) Φ( x, a ) (cid:112) Φ( x, a ) (cid:62) ΓΦ( x, a ) ≤ β Φ( x, a ) (cid:62) Φ( x, a ) (cid:112) Φ( x, a ) (cid:62) ΓΦ( x, a ) ≤ √ β (cid:115) λ min (Γ) ≤ β (cid:112) T ) . Lemma 12 then implies: log N ε ≤ ( d + d ) log 2 × √ T H × β (cid:112) T ) × ( d + d ) T − ≤ d log T, where in the last step we use the deﬁnition of β and also assume without loss of generality that sp( v ∗ ) ≤ √ T , d ≤ √ T , and T ≥ (since otherwise the regret bound is vacuous). Then by Eq. (11)we have with probability − δ , for all k and h , (cid:107) (cid:15) kh (cid:107) Λ k ≤ √ H (cid:114) d T + 11 + log 1 δ + 20 d log T + 4 ≤ dH (cid:112) log( T /δ ) = β , and therefore, | term | ≤ (cid:107) Φ( x, a ) (cid:107) Λ − k (cid:107) (cid:15) kh (cid:107) Λ k ≤ β (cid:107) Φ( x, a ) (cid:107) Λ − k . Furthermore, | term | ≤ (cid:107) Φ( x, a ) (cid:107) Λ − k (cid:13)(cid:13)(cid:13)(cid:13) λ (cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:13)(cid:13)(cid:13)(cid:13) Λ − k (Cauchy-Schwarz inequality) ≤ (cid:107) Φ( x, a ) (cid:107) Λ − k (cid:13)(cid:13)(cid:13)(cid:13) √ λ (cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:13)(cid:13)(cid:13)(cid:13) ( λ min (Λ k ) ≥ λ ) ≤ √ λ (cid:107) Φ( x, a ) (cid:107) Λ − k × H √ d ( (cid:107) µ ( X ) (cid:107) ≤ √ d by Assumption 2) ≤ β (cid:107) Φ( x, a ) (cid:107) Λ − k , (using λ = 1 )and | term | ≤ (cid:107) Φ( x, a ) (cid:107) Λ − k (cid:107) λw πh (cid:107) Λ − k (Cauchy-Schwarz inequality) ≤ (cid:107) Φ( x, a ) (cid:107) Λ − k (cid:13)(cid:13)(cid:13) √ λw πh (cid:13)(cid:13)(cid:13) ( λ min (Λ k ) ≥ λ ) ≤ β (cid:107) Φ( x, a ) (cid:107) Λ − k . ( (cid:107) w πh (cid:107) ≤ √ dH and λ = 1 )Therefore, | term | + | term | + | term | ≤ β (cid:107) Φ( x, a ) (cid:107) Λ − k for all k and h with probability at least − δ . Then by Eq. (10), we have (cid:98) Q kh ( x, a ) − Q πh ( x, a ) ≤ E x (cid:48) ∼ p ( ·| x,a ) [ V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) )] + 2 β (cid:107) Φ( x, a ) (cid:107) Λ − k , proving one inequality in the lemma statement (since Q kh ( x, a ) ≤ (cid:98) Q kh ( x, a ) ). To prove the otherinequality, note that Eq. (10) together with | term | + | term | + | term | ≤ β (cid:107) Φ( x, a ) (cid:107) Λ − k alsoimplies (cid:98) Q kh ( x, a ) − Q πh ( x, a ) ≥ E x (cid:48) ∼ p ( ·| x,a ) [ V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) )] . (12)19ow we ﬁx k and use induction on h to prove Q kh ( x, a ) ≥ Q πh ( x, a ) . The base case h = H is clear due to Eq. (12) and the facts V kH +1 ( x ) = V πH +1 ( x ) = 0 and Q kH ( x, a ) − Q πH ( x, a ) =min { (cid:98) Q kH ( x, a ) , H } − Q πH ( x, a ) ≥ . Next assume Q kh +1 ( x, a ) ≥ Q πh +1 ( x, a ) for all x and a .Then V kh +1 ( x ) = max a Q kh +1 ( x, a ) ≥ max a Q πh +1 ( x, a ) ≥ V πh +1 ( x ) . Using Eq. (12) we have (cid:98) Q kh ( x, a ) − Q πh ( x, a ) ≥ , which again implies Q kh ( x, a ) = min { (cid:98) Q kh ( x, a ) , H } ≥ Q πh ( x, a ) . Thisﬁnishes the induction and proves the other inequality in the lemma statement. Proof of Theorem 5.

Let π k = ( π k , . . . , π kH ) be the ﬁnite-horizon policy that our algorithm executesfor episode k , that is, π kh ( a | x ) = [ a = argmax a (cid:48) Q kh ( x, a (cid:48) )] (breaking ties arbitrarily). Alsolet ¯ π ∗ be the optimal ﬁnite-horizon policy with value functions Q ∗ h ( x, a ) = max π Q πh ( x, a ) and V ∗ h ( x ) = max a Q ∗ h ( x, a ) . We ﬁrst decompose the regret asReg T = T (cid:88) t =1 ( J ∗ − r ( x t , a t ))= T/H (cid:88) k =1 (cid:0) HJ ∗ − V ∗ ( x k ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) term + T/H (cid:88) k =1 (cid:0) V ∗ ( x k ) − V π k ( x k ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) term + T/H (cid:88) k =1 (cid:32) V π k ( x k ) − H (cid:88) h =1 r ( x kh , a kh ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) term (13)In Lemma 13 (stated after this proof), we connect the optimal reward of the the inﬁnite-horizonsetting and the ﬁnite-horizon setting and show that term ≤ T sp( v ∗ ) H .Notice that conditioned on the history before episode k , V π k ( x k ) is the expectation of (cid:80) Hh =1 r ( x kh , a kh ) . Therefore, term is a martingale different sequence, which can be upper boundedby O (cid:16) H (cid:113) TH log(1 /δ ) (cid:17) = O (cid:16)(cid:112) HT log(1 /δ ) (cid:17) with probabiltiy at least − δ (via Azuma’sinequality).Finally, we deal with term . Below we assume that the high-probability event in Lemma 4 hold.Then for all k, h : Q kh ( x kh , a kh ) − Q π k h ( x kh , a kh ) ≤ E x (cid:48) ∼ ( k,h ) [ V kh +1 ( x (cid:48) ) − V π k h +1 ( x (cid:48) )] + 2 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k = V kh +1 ( x kh +1 ) − V π k h +1 ( x kh +1 ) + 2 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k + e kh = Q kh +1 ( x kh +1 , a kh +1 ) − Q π k h +1 ( x kh +1 , a kh +1 ) + 2 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k + e kh where in the ﬁrst equality we deﬁne e kh = E x (cid:48) ∼ ( k,h ) [ V kh +1 ( x (cid:48) ) − V π k h +1 ( x (cid:48) )] − (cid:0) V kh +1 ( x kh +1 ) − V π k h +1 ( x kh +1 ) (cid:1) , which has zero mean, and in the second equality we use the facts V kh +1 ( x kh +1 ) = Q kh +1 ( x kh +1 , a kh +1 ) and V π k h +1 ( x kh +1 ) = Q π k h +1 ( x kh +1 , a kh +1 ) . Repeating the same argument and using V kH +1 ( · ) = V π k H +1 ( · ) = 0 , we arrive at Q k ( x k , a k ) − Q π k ( x k , a k ) ≤ H (cid:88) h =1 (cid:16) β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k + e kh (cid:17) . Further using that V ∗ ( x k ) = max a Q ∗ ( x k , a ) ≤ max a Q k ( x k , a ) = Q k ( x k , a k ) (the inequality isby Lemma 4) and that V π k ( x k ) = Q π k ( x k , a k ) , we have shown term ≤ T/H (cid:88) k =1 H (cid:88) h =1 (cid:16) β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k + e kh (cid:17) . The term (cid:80)

T/Hk =1 (cid:80) Hh =1 e kh is again the sum of a martingale difference sequence with each term’smagnitude bounded by H , and therefore is bounded by O (cid:16) H (cid:112) T log(1 /δ ) (cid:17) with probability20t least − δ using Azuma’s inequality. For the term (cid:80) T/Hk =1 (cid:80) Hh =1 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k , we ﬁrstdecompose it into two parts: (cid:88) k :det(Λ k +1 ) ≤ k ) H (cid:88) h =1 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k + (cid:88) k :det(Λ k +1 ) > k ) H (cid:88) h =1 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k . By [1, Lemma 12], det(Λ k +1 ) ≤ k ) implies Λ k +1 (cid:22) k and thus Λ − k (cid:22) − k +1 . Therefore,the ﬁrst part is upper bounded by √ (cid:80) k,h β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k +1 ≤ β √ T (cid:113)(cid:80) k,h (cid:107) Φ( x kh , a kh ) (cid:107) − k +1 ,by Cauchy-Schwarz inequality. Further invoking [21, Lemma D.2], we upper bound the lastterm by O (cid:18) β √ T (cid:113) log det(Λ T/H +1 )det(Λ ) (cid:19) = O (cid:18) β √ T (cid:113) log (cid:0) λ +2 Tλ (cid:1) d (cid:19) = O (cid:0) β √ dT log T (cid:1) . For thesecond part, notice that since the event det(Λ k +1 ) > k ) cannot happen for more than O (cid:16) log det(Λ T/H +1 )det(Λ ) (cid:17) = O ( d log T ) times, this part is upper bounded by O ( βdH log T ) .To conclude, we have shown that term = O (cid:16) β √ dT log T + βdH log T + H (cid:112) T log(1 /δ ) (cid:17) holdswith probability at least − δ . Combining all the bounds with Eq. (13), we haveReg T = T (cid:88) t =1 ( J ∗ − r ( x t , a t )) = O (cid:18) T sp( v ∗ ) H + β (cid:112) dT log T + βdH log T + H (cid:112) T log(1 /δ ) (cid:19) = (cid:101) O (cid:18) T sp( v ∗ ) H + d / H √ T + d H (cid:19) (plug in the value of β )with probability at least − δ . Picking the optimal H (the one speciﬁed in Algorithm 2), we getthat Reg T = (cid:101) O (cid:16)(cid:112) sp( v ∗ )( dT ) + (sp( v ∗ ) dT ) (cid:17) . Lemma 13.

For any x , | HJ ∗ − V ∗ ( x ) | ≤ sp( v ∗ ) .Proof. Let π ∗ be the optimal policy of the inﬁnite-horizon setting, and ( π , . . . , π H ) be the optimalpolicy of the ﬁnite-horizon setting. Without loss generality assume that both of them are deterministicpolicy. By the Bellman equation and the optimality of π ∗ , we have v ∗ ( x ) = max a (cid:0) r ( x, a ) − J ∗ + E x (cid:48) ∼ p ( ·| x,a ) v ∗ ( x ) (cid:1) (14) = r ( x, π ∗ ( x )) − J ∗ + E x (cid:48) ∼ p ( ·| x,π ∗ ( x )) v ∗ ( x ) . (15)For any x , consider a state sequence x = x, x , . . . , x H generated by π ∗ . By the suboptimality of π ∗ in the ﬁnite-horizon setting, V ∗ ( x ) ≥ E (cid:34) H (cid:88) h =1 r ( x h , π ∗ ( x h )) (cid:12)(cid:12)(cid:12)(cid:12) x = x, π ∗ (cid:35) = E (cid:34) H (cid:88) h =1 (cid:0) J ∗ + v ∗ ( x h ) − E x (cid:48) ∼ p ( ·| x h ,π ∗ ( x h )) [ v ∗ ( x (cid:48) )] (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x = x, π ∗ (cid:35) (by Eq. (15)) = E (cid:34) H (cid:88) h =1 ( J ∗ + v ∗ ( x h ) − v ∗ ( x h +1 )) (cid:12)(cid:12)(cid:12)(cid:12) x = x, π ∗ (cid:35) = HJ ∗ + E (cid:2) v ∗ ( x ) − v ∗ ( x H +1 ) (cid:12)(cid:12) x = x, π ∗ (cid:3) ≥ HJ ∗ − sp( v ∗ ) . x = x, x , . . . , x H generated by ( π , . . . , π H ) : V ∗ ( x ) = E (cid:34) H (cid:88) h =1 r ( x h , π h ( x h )) (cid:12)(cid:12)(cid:12)(cid:12) x = x, { π i } Hi =1 (cid:35) ≤ E (cid:34) H (cid:88) t =1 (cid:0) J ∗ + v ∗ ( x t ) − E x (cid:48) ∼ p ( ·| x h ,π h ( x h )) [ v ∗ ( x (cid:48) )] (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x = x, { π i } Hi =1 (cid:35) (by Eq. (14)) = E (cid:34) H (cid:88) t =1 ( J ∗ + v ∗ ( x h ) − v ∗ ( x h +1 )) (cid:12)(cid:12)(cid:12)(cid:12) x = x, { π i } Hi =1 (cid:35) = HJ ∗ + E (cid:2) v ∗ ( x ) − v ∗ ( x H +1 ) (cid:12)(cid:12) x = x, { π i } Hi =1 (cid:3) ≤ HJ ∗ + sp( v ∗ ) . Combining the two directions ﬁnishes the proof.

D Omitted Analysis in Section 4 𝑅 𝑘,1 𝑅 𝑘,2 𝑅 𝑘,3 𝑅 𝑘,4 𝜏 𝑘,1 𝜏 𝑘,2 𝜏 𝑘,3 𝜏 𝑘,4 𝑁 steps Figure 1: An illustration for the data collection process of MDP-E XP

2. In the ﬁgure, we show howthe algorithm collects trajectories of length N (the red intervals) in an epoch with length B = 8 N .Figure 1 is an illustration of the data collection scheme of MDP-E XP

2. Below, we ﬁrst provide theproof for Lemma 6.

Proof of Lemma 6.

Denote E [ ·| x = x, a t ∼ π ( ·| x t ) , x t +1 ∼ p ( ·| x t , a t ) for all t ≥ by E [ ·| x = x, π ] . For any two initial states u, u (cid:48) ∈ X , let δ u and δ u (cid:48) be the Dirac measures with respect to u and u (cid:48) . Writing P π as P for simplicity, we have for any time t , | E [ r ( x t , a t ) | x = u, π ] − E [ r ( x t , a t ) | x = u (cid:48) , π ] | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X (cid:88) a ∈A π ( a | x ) r ( x, a )d P t − δ u ( x ) − (cid:90) X (cid:88) a ∈A π ( a | x ) r ( x, a )d P t − δ u (cid:48) ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) P t − δ u − P t − δ u (cid:48) (cid:107) TV ≤ e − t − t mix (cid:107) δ u − δ u (cid:48) (cid:107) TV (Assumption 3) ≤ e − t − t mix . (16)Therefore, by the deﬁnition of J π ( u ) in Section 2, we have | J π ( u ) − J π ( u (cid:48) ) | ≤ lim T →∞ T T (cid:88) t =1 e − t − t mix = 0 , proving that J π ( u ) is a ﬁxed value independent of the initial state u and can thus be denoted as J π .22ext, deﬁne the following two quantities: v πT ( x ) = E (cid:34) T (cid:88) t =1 ( r ( x t , a t ) − J π ) (cid:12)(cid:12)(cid:12) x = x, π (cid:35) ,q πT ( x, a ) = E (cid:34) T (cid:88) t =1 ( r ( x t , a t ) − J π ) (cid:12)(cid:12)(cid:12) ( x , a ) = ( x, a ) , x t ∼ p ( ·| x t − , a t − ) , a t ∼ π ( ·| x t ) for t ≥ (cid:35) . (17)We will show that v π ( x ) (cid:44) lim T →∞ v πT ( x ) and q π ( x, a ) (cid:44) lim T →∞ q πT ( x, a ) satisfy the con-ditions stated in Lemma 6. First we argue that they do exist. Note that J π can be written as (cid:82) X (cid:80) a r ( x, a ) π ( a | x )d ν π ( x ) where ν π is the stationary distribution under π . Therefore, for any T ,we have (cid:12)(cid:12)(cid:12) E (cid:104) r ( x T +1 , a T +1 ) − J π (cid:12)(cid:12)(cid:12) x = x, π (cid:105)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X (cid:88) a ∈A π ( a | x (cid:48) ) r ( x (cid:48) , a )d P T δ x ( x (cid:48) ) − (cid:90) X (cid:88) a ∈A π ( a | x ) r ( x, a )d ν π ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) P T δ x − ν π (cid:107) TV = 2 (cid:107) P T δ x − P T ν π (cid:107) TV (by the deﬁnition of ν π ) ≤ e − Tt mix (cid:107) δ x − ν π (cid:107) TV (by Assumption 3) ≤ e − Tt mix , (18)and thus | v πT ( x ) − v πT +1 ( x ) | = (cid:12)(cid:12)(cid:12) E (cid:104) r ( x T +1 , a T +1 ) − J π (cid:12)(cid:12)(cid:12) x = x, π (cid:105)(cid:12)(cid:12)(cid:12) ≤ e − Tt mix , which goes to zero and implies that v π ( x ) = lim T →∞ v πT ( x ) exists. On the other hand, by thedeﬁnition we have q πT ( x, a ) = r ( x, a ) − J π + E x (cid:48) ∼ p ( ·| x,a ) v πT − ( x (cid:48) ) , and taking the limit on both sides shows that q π ( x, a ) = lim T →∞ q πT ( x, a ) exists and satisﬁes theBellman equation in the lemma statement: q π ( x, a ) = r ( x, a ) − J π + E x (cid:48) ∼ p ( ·| x,a ) v π ( x (cid:48) ) . Finally, Eq. (18) also shows that | v πT ( x ) | ≤ T (cid:88) t =1 e − t − t mix ≤ − e − t mix ≤ − (cid:16) − t mix (cid:17) = 4 t mix , (using e − x ≤ − x for x ∈ [0 , and t mix ≥ )and thus the range of v π is [ − t mix , t mix ] while the range of q π is [ − t mix , t mix ] since | q π ( x, a ) | ≤| r ( x, a ) | + | J π | + sup x (cid:48) | v π ( x (cid:48) ) | ≤ t mix ≤ t mix . The last statement (cid:82) X v π ( x )d ν π ( x ) =0 in the lemma is also clear since (cid:82) X v πT ( x )d ν π ( x ) = 0 for all T by the equality J π = (cid:82) X (cid:80) a r ( x, a ) π ( a | x )d ν π ( x ) and the fact that x , . . . , x T all have marginal distribution ν π when x = x is drawn from ν π .In Section 4, we mention that Assumption 4 is weaker than Assumption 2 when Assumption 3 holds.Below we provide a proof for this statement. Lemma 14.

Under Assumption 3, Assumption 2 implies Assumption 4.Proof.

Since Assumption 3 holds, by Lemma 6, we have q π ( x, a ) = r ( x, a ) − J π + E x (cid:48) ∼ p ( ·| x,a ) v π ( x (cid:48) )= Φ( x, a ) (cid:62) θ − J π Φ( x, a ) (cid:62) e + Φ( x, a ) (cid:62) (cid:90) X v π ( x (cid:48) )d µ ( x (cid:48) ) (Assumption 2) = Φ( x, a ) (cid:62) (cid:18) θ − J π e + (cid:90) X v π ( x (cid:48) )d µ ( x (cid:48) ) (cid:19) . w π to be θ − J π e + (cid:82) X v π ( x (cid:48) )d µ ( x (cid:48) ) and noting that (cid:107) w π (cid:107) ≤ (cid:107) θ (cid:107) + 1 +(max x ∈X v π ( x )) (cid:107) µ ( X ) (cid:107) ≤ √ d + 1 + 4 t mix √ d ≤ t mix √ d ﬁnishes the proof. D.1 Proof of Theorem 7

To prove Theorem 7, we ﬁrst show a couple of useful lemmas.

Lemma 15.

Let k be any number in { , , . . . , TB } and m be any number in { , , . . . , B N } . Let E [ · | τ k,m ] denote the expectation conditioned on ( x τ k,m , a τ k,m ) and all history before time τ k,m (recall the deﬁnitions of τ k,m and R k,m in Algorithm 3). Then we have (cid:12)(cid:12) E [ R k,m | τ k,m ] − (cid:0) q π k ( x τ k,m , a τ k,m ) + N J π k (cid:1)(cid:12)(cid:12) ≤ T . Proof.

Recalling the deﬁnition of q π k N in Eq. (17), we have E [ R k,m | τ k,m ]= E (cid:34) N (cid:88) t =1 r ( x t , a t ) (cid:12)(cid:12)(cid:12) ( x , a ) = ( x τ k,m , a τ k,m ) , x t ∼ p ( ·| x t − , a t − ) , a t ∼ π k ( ·| x t ) for t ≥ (cid:35) = q π k N ( x τ k,m , a τ k,m ) + N J π k . (19)Then we bound the difference between q πN ( x, a ) and q π ( x, a ) (which is lim N →∞ q πN ( x, a ) as shownin the proof of Lemma 6) for any π, x, a : | q πN ( x, a ) − q π ( x, a ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) ∞ (cid:88) t = N +1 ( r ( x t , a t ) − J π ) (cid:12)(cid:12)(cid:12) ( x , a ) = ( x, a ) , x t ∼ p ( ·| x t − , a t − ) , a t ∼ π k ( ·| x t ) for t ≥ (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ (cid:88) t = N +1 e − t − t mix ≤ e − Nt mix − e − t mix ≤ t mix e − Nt mix . (Eq. (18))Recall that N = 8 t mix log T , and without loss of generality we assum t mix ≤ T / (otherwise theregret bound is vacuous). Thus we can bound the last expression by t mix T ≤ T . Combining this withEq. (19) ﬁnishes the proof. Lemma 16.

Let E k [ · ] denote the expectation conditioned on all history before epoch k . Then (cid:107) E k [ w k ] − ( w π k + N J π k e ) (cid:107) ≤ T . roof. Let I k = [ λ min ( M k ) ≥ Bσ N ] . We proceed as follows: E k [ w k ] = E k  I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) R k,m  (deﬁnition of w k ) = E k  I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) E k [ R k,m | x τ k,m , a τ k,m ]  (taking expectation for R k,m conditioned on ( x τ k,m , a τ k,m ) ) = E k  I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) (cid:0) q π k ( x τ k,m , a τ k,m ) + N J π k (cid:1) + E k  I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) (cid:15) k ( x τ k,m , a τ k,m )  (deﬁne (cid:15) k ( x τ k,m , a τ k,m ) = E k [ R k,m | x τ k,m , a τ k,m ] − (cid:0) q π k ( x τ k,m , a τ k,m ) + N J π k (cid:1) ) = E k  I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m )Φ( x τ k,m , a τ k,m ) (cid:62) ( w π k + N J π k e )  + E k  I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) (cid:15) k ( x τ k,m , a τ k,m )  (by Assumption 4) = E k  I k M − k B N (cid:88) m =1 (cid:88) a π k ( a | x τ k,m )Φ( x τ k,m , a )Φ( x τ k,m , a ) (cid:62) ( w π k + N J π k e )  + E k  I k M − k B N (cid:88) m =1 (cid:88) a π k ( a | x τ k,m )Φ( x τ k,m , a ) (cid:15) k ( x τ k,m , a )  (taking expectation for a τ k,m conditioned on x τ k,m ) = E k [ I k ( w π k + N J π k e )] + (cid:15) (deﬁne (cid:15) = E k (cid:104) I k M − k (cid:80) B N m =1 (cid:80) a π k ( a | x τ k,m )Φ( x τ k,m , a ) (cid:15) k ( x τ k,m , a ) (cid:105) ) = w π k + N J π k e − E k [(1 − I k )( w π k + N J π k e )] + (cid:15) . By Lemma 15, we have | (cid:15) k ( x τ k,m , a τ k,m ) | ≤ /T and thus (cid:107) (cid:15) (cid:107) ≤ E k  B N (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) I k M − k (cid:88) a π k ( a | x τ k,m )Φ( x τ k,m , a ) (cid:15) k ( x τ k,m , a τ k,m ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E k  B N (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) I k M − k (cid:88) a π k ( a | x τ k,m )Φ( x τ k,m , a )Φ( x τ k,m , a ) (cid:62) e (cid:15) k ( x τ k,m , a τ k,m ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E k  B N (cid:88) m =1 (cid:13)(cid:13) I k e (cid:15) k ( x τ k,m , a τ k,m ) (cid:13)(cid:13) ≤ E k  B N (cid:88) m =1 T  ≤ T . On the other hand, we also have (cid:107) E k [(1 − I k )( w π k + N J π k e )] (cid:107) ≤ E k [(1 − I k )] (6 t mix √ d + N ) ≤ t mix √ d + NT . where the last step is by Lemma 17 (stated after this proof). Finally, combining everything proves (cid:107) E k [ w k ] − ( w π k + N J π k e ) (cid:107) ≤ T + 6 t mix √ d + NT ≤ T , t mix √ d + N = 6 t mix √ d + 8 t mix log T is at most T (otherwise the regret bound isvacuous). Lemma 17.

For any k ∈ { , . . . , T /B } , conditioning on the history before epoch k , we have withprobability at least − T , λ min ( M k ) ≥ Bσ N .Proof. We consider a ﬁxed k . Notice that since N is larger than t mix , the state distribution at τ k,m conditioned on all trajectories collected before (which all happen before τ k,m − N ) would be closeto the stationary distribution ν π k . For the purpose of analysis, we consider an imaginary world whereall history before epoch k remains the same as the real world, but in epoch k , at time t = τ k,m , ∀ m = 1 , , . . . , the state distribution is reset according to the stationary distribution, i.e., x τ k,m ∼ ν π k ;for other rounds, it follows the state transition driven by π k , the same as the real world. we denotethe expectation (given the history before epoch k ) in the imaginary world as E (cid:48) k [ · ] .Fro simplicity, deﬁne y m = x τ k,m , z m = { a τ k,m , R k,m } and m ∗ = B N . Note that M k is afunction of { y m } m ∗ m =1 and that ( y i − , z i − ) → y i → z i form a Markov chain. Therefore, by writing M k = M k ( y , . . . , y m ∗ ) , and considering any function f of M k , we have E k [ f ( M k )] = (cid:90) f ( M k ( y , . . . , y m ∗ )) d q ( y )d q ( z | y )d q ( y | y , z )d q ( z | y ) · · · d q ( y m ∗ | y m ∗ − , z m ∗ − )d q ( z m ∗ | y m ∗ ) and E (cid:48) k [ f ( M k )] = (cid:90) f ( M k ( y , . . . , y m ∗ )) d q (cid:48) ( y )d q ( z | y )d q (cid:48) ( y )d q ( z | y ) · · · d q (cid:48) ( y m ∗ )d q ( z m ∗ | y m ∗ ) where q and q (cid:48) denote the probability measure in the real and the imaginary worlds respectively(conditioned on the history before epoch k ). Note that by our construction, in the imaginary world y i is independent of ( y , z , . . . , y i − , z i − ) , while z i | y i follows the same distribution as in the realworld. By the uniform-mixing assumption, we have that (cid:107) q (cid:48) ( y m ) − q ( y m | y m − , z m − ) (cid:107) TV ≤ e − Nt mix ≤ T , implying that | E k [ f ( M k )] − E (cid:48) k [ f ( M k )] | ≤ T × B N × f max ≤ f max T , (20)where f max is the maximum magnitude of f ( · ) . Picking f ( M ) = (cid:2) λ min ( M ) ≤ Bσ N (cid:3) (with f max = 1 clearly), we have shown that Pr k (cid:20) λ min ( M k ) ≤ Bσ N (cid:21) ≤ Pr (cid:48) k (cid:20) λ min ( M k ) ≤ Bσ N (cid:21) + 1 T . It remains to bound Pr (cid:48) k (cid:2) λ min ( M k ) ≤ Bσ N (cid:3) . Notice that E (cid:48) k [ M k ] = B N × (cid:90) X (cid:88) a π k ( a | x )Φ( x, a )Φ( x, a ) (cid:62) d ν π k ( x ) (cid:23) B N × σI by Assumption 5. Using standard matrix concentration results (speciﬁcally, Lemma 18 with δ = , n = B N = σ log( dT ) , X m = (cid:80) a π k ( a | x τ k,m )Φ( x τ k,m , a )Φ( x τ k,m , a ) (cid:62) , R = 2 , and r = Bσ N =16 log( dT ) ), we get Pr (cid:48) k (cid:20) λ min ( M k ) ≤ × Bσ N (cid:21) ≤ d · exp (cid:18) − ×

16 log( dT ) × (cid:19) ≤ d · exp ( − . dT )) ≤ T . . In other words, we have shown Pr k (cid:20) λ min ( M k ) ≤ Bσ N (cid:21) ≤ T . + 1 T ≤ T , which completes the proof. 26 emma 18. (Theorem 2 in [17]) Let X , . . . , X n be independent, random, symmetric, real matricesof size d × d with (cid:22) X m (cid:22) RI for all m . Suppose rI (cid:22) E [ (cid:80) nm =1 X m ] for some r > . Then forall δ ∈ [0 , , one has Pr (cid:34) λ min (cid:32) n (cid:88) m =1 X m (cid:33) ≤ (1 − δ ) r (cid:35) ≤ d · e − δ r/ (2 R ) . Lemma 19.

With η ≤ σ N , MDP-E XP guarantees for all x : E  T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) q π k ( x, a )  ≤ O (cid:18) ln |A| η + η T N Bσ (cid:19) . Proof.

Note that by the deﬁnition of w k we have | w (cid:62) k Φ( x, a ) | ≤ √ η (cid:107) w t (cid:107) ≤ √ η × NBσ × B N × √ N = 24 Nσ , (21)and thus η | w (cid:62) k Φ( x, a ) | ≤ by our choice of η . Therefore, using the standard regret bound ofexponential weight (see e.g., [8, Theorem 1]), we have T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) (cid:0) w (cid:62) k Φ( x, a ) (cid:1) ≤ O  ln |A| η + η T/B (cid:88) k =1 (cid:88) a π k ( a | x ) (cid:0) w (cid:62) k Φ( x, a ) (cid:1)  . (22)Taking expectation, the left-hand side becomes E  T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) (cid:0) w (cid:62) k Φ( x, a ) (cid:1) = E  T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) (( w π k + N J π k e ) · Φ( x, a ))  − O (1) (Lemma 16) = E  T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) ( w π k (cid:62) Φ( x, a ) + N J π k )  − O (1)= E  T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) w π k (cid:62) Φ( x, a )  − O (1)= E  T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) q π k ( x, a )  − O (1) . (Assumption 4)To bound the expectation of the right-hand side of Eq. (22), we focus on the key term E k (cid:2)(cid:80) a π k ( a | x )( w (cid:62) k Φ( x, a )) (cid:3) ( E k denotes the expectation conditioned on the history before epoch k ) and use the same argument as done in the proof of Lemma 17 via the help of an imaginary wordwhere everything is the same as the real world except that the ﬁrst state of each trajectory x τ k,m for m = 1 , , . . . , B/ N is reset according to the stationary distribution ν π k ( E (cid:48) k denotes the conditionalexpectation in this imaginary world). By the exact same argument ( cf. Eq. (20)), we have E k (cid:34)(cid:88) a π k ( a | x )( w (cid:62) k Φ( x, a )) (cid:35) ≤ E (cid:48) k (cid:34)(cid:88) a π k ( a | x )( w (cid:62) k Φ( x, a )) (cid:35) + BT N × (cid:18) Nσ (cid:19) ( w (cid:62) k Φ( x, a )) derived earlier in Eq. (21). It remains to bound E (cid:48) k (cid:2)(cid:80) a π k ( a | x )( w (cid:62) k Φ( x, a )) (cid:3) , which we proceed as follows with I k = [ λ min ( M k ) ≥ Bσ N ] : E (cid:48) k (cid:34)(cid:88) a π k ( a | x )( w (cid:62) k Φ( x, a )) (cid:35) = E (cid:48) k (cid:88) a π k ( a | x )  Φ( x, a ) (cid:62) M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) R k,m  I k  ≤ N E (cid:48) k (cid:88) a π k ( a | x )  Φ( x, a ) (cid:62) M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m )  I k  ( R k,m ≤ N ) = N E (cid:48) k (cid:88) a π k ( a | x )Φ( x, a ) (cid:62) M − k  B N (cid:88) m =1 Φ( x τ k,m , a τ k,m )   B N (cid:88) m =1 Φ( x τ k,m , a τ k,m )  (cid:62) M − k Φ( x, a ) I k  ≤ BN E (cid:48) k (cid:88) a π k ( a | x )Φ( x, a ) (cid:62) M − k  B N (cid:88) m =1 Φ( x τ k,m , a τ k,m )Φ( x τ k,m , a τ k,m ) (cid:62)  M − k Φ( x, a ) I k  (Cauchy-Schwarz inequality) = BN E (cid:48) k (cid:88) a π k ( a | x )Φ( x, a ) (cid:62) M − k  B N (cid:88) m =1 (cid:88) a (cid:48) π k ( a (cid:48) | x τ k,m )Φ( x τ k,m , a (cid:48) )Φ( x τ k,m , a (cid:48) ) (cid:62)  M − k Φ( x, a ) I k  = BN E (cid:48) k (cid:34)(cid:88) a π k ( a | x )Φ( x, a ) (cid:62) M − k Φ( x, a ) I k (cid:35) ≤ O (cid:18) BN × NBσ (cid:19) (deﬁnition of I k ) = O (cid:18) N σ (cid:19) . Combining everything shows E  T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) q π k ( x, a )  ≤ O (cid:18) ln |A| η + η TB (cid:18) N σ + N BT σ (cid:19)(cid:19) ≤ O (cid:18) ln |A| η + η T N Bσ (cid:19) , which ﬁnishes the proof.We are now ready to prove Theorem 7. Proof of Theorem 7.

First, decompose the regret as:Reg T = E (cid:34) T (cid:88) t =1 ( J ∗ − r ( x t , a t )) (cid:35) = E  T/B (cid:88) k =1 B ( J ∗ − J π k )  + E  T/B (cid:88) k =1 kB (cid:88) t =( k − B +1 ( J π k − r ( x t , a t ))  . E  T/B (cid:88) k =1 B ( J ∗ − J π k )  = E  T/B (cid:88) k =1 B (cid:90) X (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) q π k ( x, a )d ν π ∗ ( x )  = O (cid:18) B ln |A| η + η T N σ (cid:19) . (by Lemma 19)For the second term, we ﬁrst consider a speciﬁc k : E k  kB (cid:88) t =( k − B +1 ( J π k − r ( x t , a t ))  = E k  kB (cid:88) t =( k − B +1 ( E x (cid:48) ∼ p ( ·| x t ,a t ) [ v π k ( x (cid:48) )] − q π k ( x t , a t ))  (Bellman equation) = E k  kB (cid:88) t =( k − B +1 ( v π k ( x t +1 ) − v π k ( x t ))  = v π k ( x kB +1 ) − v π k ( x ( k − B +1 ) . Therefore, E  T/B (cid:88) k =1 kB (cid:88) t =( k − B +1 ( J π k − r ( x t , a t ))  ≤ E  T/B (cid:88) k =1 (cid:0) v π k ( x kB +1 ) − v π k ( x ( k − B +1 ) (cid:1) ≤ E  T/B (cid:88) k =2 (cid:0) v π k − ( x ( k − B +1 ) − v π k ( x ( k − B +1 ) (cid:1) + O ( t mix ) . (23)We bound the last summation using the fact that π k and π k − are close. Indeed, by the update rule ofthe algorithm, we have π k ( a | x ) − π k − ( a | x ) = π k − ( a | x ) e η Φ( x,a ) (cid:62) w k − (cid:80) b ∈A π k − ( b | x ) e η Φ( x,b ) (cid:62) w k − − π k − ( a | x ) ≤ π k − ( a | x ) e η Φ( x,a ) (cid:62) w k − (cid:80) b ∈A π k − ( b | x ) e − min b η Φ( x,b ) (cid:62) w k − − π k − ( a | x ) ≤ π k − ( a | x ) (cid:16) e η max b | Φ( x,b ) (cid:62) w k − | − (cid:17) . Recall that in the proof of Lemma 19, we have shown η max b | Φ( x, b ) (cid:62) w k − | ≤ as long as η ≤ σ/ (24 N ) . Combining with the fact e x ≤ x for x ∈ [0 , we have (cid:16) e η max b | Φ( x,b ) (cid:62) w k − | − (cid:17) ≤ η max b | Φ( x, b ) (cid:62) w k − | = O (cid:18) η × Nσ (cid:19) , where the last step is by Eq. (21). This shows π k ( a | x ) − π k − ( a | x ) ≤ O (cid:18) ηNσ π k − ( a | x ) (cid:19) . π k − ( a | x ) − π k ( a | x ) = O (cid:16) ηNσ π k − ( a | x ) (cid:17) as well. By the same argumentof [35, Lemma 7] (summarized in Lemma 20 for completeness), this implies: | v π k ( x ) − v π k − ( x ) | ≤ O (cid:18) η N σ + 1 T (cid:19) . for all x . Continuing from Eq. (23), we arrive at E  T/B (cid:88) k =1 kB (cid:88) t =( k − B +1 ( J π k − r ( x t , a t ))  = O (cid:18) η TB N σ + t mix (cid:19) . Combining everything, we have shownReg T = O (cid:18) B ln |A| η + η T N σ + η T N Bσ + t mix (cid:19) = (cid:101) O (cid:18) t mix ση + η T t mix σ (cid:19) (deﬁnition of N and B ) = (cid:101) O (cid:18) σ (cid:113) t mix T (cid:19) , (by the choice of η speciﬁed in Algorithm 3)which ﬁnishes the proof. Lemma 20. If π (cid:48) and π satisfy | π (cid:48) ( a | x ) − π ( a | x ) | ≤ O ( βπ ( a | x )) for all x, a and some β > , and N ≥ t mix log T , then | v π (cid:48) ( x ) − v π ( x ) | ≤ O ( ηβN + T ) .Proof. See the proof of [35, Lemma 7].

Remark 1.

Notice that by the deﬁnition of σ , σ ≤ λ min (cid:32)(cid:90) X (cid:32)(cid:88) a π ( a | x )Φ( x, a )Φ( x, a ) (cid:62) (cid:33) d ν π ( x ) (cid:33) ≤ d trace (cid:34)(cid:90) X (cid:32)(cid:88) a π ( a | x )Φ( x, a )Φ( x, a ) (cid:62) (cid:33) d ν π ( x ) (cid:35) ≤ d (cid:90) X (cid:32)(cid:88) a π ( a | x ) (cid:107) Φ( x, a ) (cid:107) (cid:33) d ν π ( x ) (trace [Φ( x, a )Φ( x, a ) (cid:62) ] = (cid:107) Φ( x, a ) (cid:107) ) ≤ d , ( (cid:107) Φ( x, a ) (cid:107) ≤ by Assumption 2)which implies σ ≥ d . Therefore, the regret bound in Theorem 7 has an implicit Ω( d ) dependence. E Connection between Natural Policy Gradient and MDP-E XP The connection between the exponential weight algorithm [13] and the classic natural policy gradient(NPG) algorithm [22] under softmax parameterization has been discussed in [4]. Further connectionsbetween exponential weight algorithms and several relative-entropy-regularized policy optimizationalgorithms (e.g., TRPO [31], A3C [24], PPO [32]) are also drawn in [27]. In this section, we reviewthese connection, and argue that because of the different way of constructing the policy gradientestimator, our MDP-E XP E.1 Equivalence between NPG with softmax parameterization and exponential weightupdates

We ﬁrst restates [4, Lemma 5.1], which shows that NPG with softmax parameterization is equivalentto exponential weight updates: 30 emma 21 (Lemma 5.1 of [4]) . Let π θ ( a | x ) = exp ( Φ( x,a ) (cid:62) θ ) (cid:80) b exp(Φ( x,b ) (cid:62) θ ) . Also, let ν θ be the stationarydistribution under policy π θ , and A π ( x, a ) be the advantage function under policy π deﬁned as A π ( x, a ) = q π ( x, a ) − v π ( x ) . Then the update θ new = θ + ηF † θ g θ with F θ = E x ∼ ν πθ E a ∼ π θ ( ·| x ) (cid:2) ∇ θ log π θ ( a | x ) ∇ θ log π θ ( a | x ) (cid:62) (cid:3) g θ = E x ∼ ν πθ E a ∼ π θ ( ·| x ) [ ∇ θ log π θ ( a | x ) A π θ ( x, a )] implies: π θ new ( a | x ) = π θ ( a | x ) exp ( ηA π θ ( x, a )) Z θ ( x ) where Z θ ( x ) is a normalization factor that ensures (cid:80) a π θ new ( a | x ) = 1 . To see this connection, notice that the update direction w = F † θ g θ is the solution of min w ∈ R d E x ∼ ν πθ E a ∼ π θ ( ·| x ) (cid:104)(cid:13)(cid:13) w (cid:62) ∇ θ log π θ ( a | x ) − A π θ ( x, a ) (cid:13)(cid:13) (cid:105) , (24)and also by deﬁnition π θ new ( a | x ) = exp ( Φ( x,a ) (cid:62) θ new ) (cid:80) b exp(Φ( x,b ) (cid:62) θ new ) ∝ π θ ( a | x ) exp (cid:16) η Φ( x, a ) (cid:62) F † θ g θ (cid:17) = π θ ( a | x ) exp (cid:0) η Φ( x, a ) (cid:62) w (cid:1) ∝ π θ ( a | x ) exp (cid:0) η ∇ θ log π θ ( a | x ) (cid:62) w (cid:1) . Therefore, if w achieves a valueof zero in Eq. (24), we will have π θ new ( a | x ) ∝ π θ ( a | x ) exp ( ηA π θ ( x, a )) . The proof of [4] handlesthe general case where Eq. (24) is not necessarily zero. Notice that π θ ( a | x ) exp ( ηA π θ ( x, a )) is fur-ther proportional to π θ ( a | x ) exp ( ηq π θ ( x, a )) , which is consistent with the intuition of our algorithmexplained in Section 4. E.2 Comparison between the NPG in [4] and MDP-E XP While the general formulations of the NPG in [4] and MDP-E XP A π θ ( x, a ) (or q π θ ( x, a ) ) when the learnerdoes not have access to their true values and has to estimate them from sampling. We argue thatunder the setting considered in Section 4, our algorithm and analysis achieve the near-optimal regretof order (cid:101) O ( √ T ) while theirs only obtains sub-optimal regret.In MDP-E XP

2, we construct a nearly unbiased estimator of w satisfying q π θ ( x, a ) + N J π θ = w (cid:62) Φ( x, a ) (which exists under Assumption 4), and feed it to the exponential weight algorithm. Theway we do it is similar to how E XP XP

2, to construct each estimator (denoted as w k there), the learner collects B N = (cid:101) O ( σ ) trajectories, with σ deﬁned in Assumption 5, and then aggregate them through a form of importanceweighting introduced by M − k . With this construction, w (cid:62) k Φ( x, a ) has negligible bias (by Lemma 16)compared to w (cid:62) Φ( x, a ) , while having variance upper bounded by a constant related to σ (see theproof of Lemma 19).On the other hand, the estimator used in [4] is an approximate solution of Eq. (24). Under the sameassumptions of Assumption 4 and Assumption 5, they use stochastic gradient descent to solve Eq.(24), and obtain an estimator (cid:98) w that makes (cid:98) w (cid:62) ∇ θ log θ ( a | x ) (cid:15) -close to w (cid:62) ∇ θ log θ ( a | x ) . To obtainsuch (cid:98) w , they need to sample O (cid:0) (cid:15) (cid:1) trajectories.Comparing the two approaches, we see that to obtain a single estimator (cid:98) w for the update direction w = F † θ g θ in Lemma 21, MDP-E XP (cid:15) -accurateone with low variance using O (cid:0) (cid:15) (cid:1) trajectories. The advantage of the former is that each estimatoris cheaper to get, and the effect of high variance can be amortized over iterations. As shown inTheorem 7, MDP-E XP (cid:101) O ( √ T ) regret bound. On the other hand, to get an (cid:15) -optimal policy,[4] needs to use O (cid:0) (cid:15) (cid:1) trajectories per iteration of policy update, and perform O (cid:0) (cid:15) (cid:1) iterations of31olicy updates, leading to a total sample complexity bound of O (cid:0) (cid:15) (cid:1) . This translates to a regretbound of O ( T ) in our setting at best. In fact, since the algorithms by [2] and [16] are also basedon exponential weight, they also can be regarded as variants of NPG. However, the estimators theyconstruct suffer the same issue described above, and can only get O ( T ) or O ( T ) regret.We remark that the version of NPG by [4] can also learn the optimal policy with a worse samplecomplexity O (cid:0) (cid:15) (cid:1) under a weaker assumption compared to Assumption 5 (which replaces σ withthe relative condition number κκ