Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Rahul Jain
LLearning Infinite-horizon Average-reward MDPswith Linear Function Approximation
Chen-Yu Wei
University of Southern California [email protected]
Mehdi Jafarnia-Jahromi
University of Southern California [email protected]
Haipeng Luo
University of Southern California [email protected]
Rahul Jain
University of Southern California [email protected]
Abstract
We develop several new algorithms for learning Markov Decision Processes in aninfinite-horizon average-reward setting with linear function approximation. Usingthe optimism principle and assuming that the MDP has a linear structure, wefirst propose a computationally inefficient algorithm with optimal (cid:101) O ( √ T ) regretand another computationally efficient variant with (cid:101) O ( T ) regret, where T is thenumber of interactions. Next, taking inspiration from adversarial linear bandits,we develop yet another efficient algorithm with (cid:101) O ( √ T ) regret under a different setof assumptions, improving the best existing result by Hao et al. [16] with (cid:101) O ( T ) regret. Moreover, we draw a connection between this algorithm and the NaturalPolicy Gradient algorithm proposed by Kakade [22], and show that our analysisimproves the sample complexity bound recently given by Agarwal et al. [4]. Reinforcement learning with value function approximation has gained significant empirical successin many applications. However, the theoretical understanding of these methods is still quite limited.Recently, some progress has been made for Markov Decision Processes (MDPs) with a transitionkernel and a reward function that are both linear in a fixed state-action feature representation (or moregenerally with a value function that is linear in such a feature representation). For example, Jin et al.[21] develop an optimistic variant of the Least-squares Value Iteration (LSVI) algorithm [7, 29] forthe finite-horizon episodic setting with regret (cid:101) O ( √ d T ) , where d is the dimension of the featuresand T is the number of interactions. Importantly, the bound has no dependence on the number ofstates or actions.However, the understanding of function approximation for the infinite-horizon average-reward setting,even under the aforementioned linear conditions, remains underexplored. Compared to the finite-horizon setting, the infinite-horizon model is often a better fit for real-world problems such as serveroperation optimization or stock market decision making which last for a long time or essentially neverend. On the other hand, compared to the discounted-reward model, maximizing the long-term averagereward also has its advantage in the sense that the transient behavior of the learner does not reallymatter for the latter case. Indeed, the infinite-horizon average-reward setting for the tabular case (thatis, no function approximation) is a heavily-studied topic in the literature. Several recent works startto investigate function approximation for this setting, albeit under strong assumptions [2, 3, 16]. Preprint. Under review. a r X i v : . [ c s . L G ] J u l otivated by this fact, in this work we significantly expand the understanding of learning MDPs inthe infinite-horizon average-reward setting with linear function approximation. We develop threenew algorithms, each with different pros and cons. Our first two algorithms provably ensure lowregret for MDPs with linear transition and reward, which are the first for this setting to the best of ourknowledge. More specifically, the first algorithm Fixed-point OPtimization with Optimism (FOPO)is based on the principle of “optimism in the face of uncertainty” applied in a novel way. FOPOaims to find a weight vector (parametrizing the estimated value function) that maximizes the averagereward under a fixed-point constraint akin to the LSVI update involving the observed data and anoptimistic term. The constraint is non-convex and we do not know of a way to efficiently solve it.FOPO also relies on a lazy update schedule similar to [1] for stochastic linear bandits, which is onlyfor the purpose of saving computation in their work but critical for our regret guarantee. We provethat FOPO enjoys (cid:101) O ( √ d T ) regret with high probability, which is optimal in T . (Section 2)Our second algorithm OLSVI.FH addresses the computational inefficiency issue of FOPO withthe price of having larger regret. Specifically, it combines two ideas: 1) solving an infinite-horizonproblem via an artificially constructed finite-horizon problem, which is new as far as we know, and 2)the optimistic LSVI algorithm of [21] for the finite-horizon setting. OLSVI.FH can be implementedefficiently and is shown to achieve (cid:101) O (( dT ) ) regret. (Section 3)Our third algorithm MDP-E XP (cid:101) O ( √ T ) regret (ignoring dependence on other parameters) for thetabular case under an ergodic assumption. We generalize the idea and apply a particular adversariallinear bandit algorithm known as E XP (cid:101) O ( T ) to (cid:101) O ( √ T ) . In Appendix E, we also describe the connection of this algorithm with the Natural PolicyGradient algorithm proposed by Kakade [22], whose sample complexity bound is recently formalizedby Agarwal et al. [4]. We argue that under the setting considered in Section 4, their analysis translatesto a sub-optimal regret bound of (cid:101) O ( T ) , and that our improvement over theirs comes from the waywe construct the gradient estimates. Related work.
For the tabular case with finite state and action space in the infinite-horizon average-reward setting, the works [6, 20] are among the first to develop algorithms with provable sublinearregret. Over the years, numerous improvements have been proposed, see for example [28, 14, 33, 15,38, 35]. In particular, the recent work of [35] develops two model-free algorithms for this problem.We refer the reader to [35, Table 1] for comparisons of existing algorithms. As mentioned, ouralgorithm MDP-E XP (cid:101) O (cid:0) /(cid:15) (cid:1) . However, since the oracle assumption is rather strong, it is not clear how to extend theiralgorithm to the online setting.The works of [2, 3, 16] are among the first to consider the infinite-horizon average-reward settingwith function approximation and provable regret guarantees in the online setting. Their results alldepend on some uniformly mixing and uniformly excited feature conditions. As mentioned, under thesame assumption, our MDP-E XP (cid:101) O ( √ T ) regret improves the best existing resultby Hao et al. [16] with (cid:101) O ( T ) regret. Moreover, our other two algorithms ensure low regret forlinear MDPs without these extra assumptions, which do not appear before.Provable function approximation has gained growing research interest in other settings as well (finite-horizon or discounted-reward). See recent works [36, 21, 37, 12, 34] for example. In particular, our2OPO algorithm shares some similarity with the algorithm of Zanette et al. [37], which also relies onsolving an optimization problem under a constraint akin to LSVI, with no efficient implementation.Adversarial linear bandit is also known as bandit linear optimization. The E XP XP IN E XP We consider infinite-horizon average-reward Markov Decision Processes (MDPs) described by ( X , A , r, p ) where X is a Borel state space with possibly infinite number of elements, A is afinite action set, r : X × A → [ − , is the (unknown) reward function, and p ( ·| x, a ) is the(unknown) transition kernel induced by x, a , satisfying (cid:82) X p (d x (cid:48) | x, a ) = 1 (following integralnotation from [19]).The learning protocol is as follows. A learner interacts with the MDP through T steps, starting froman arbitrary initial state x ∈ X . At each step t , the learner decides an action a t , and then observesthe reward r ( x t , a t ) as well as the next state x t +1 which is a sample drawn from p ( ·| x t , a t ) . The goalof the learner is to be competitive against any fixed stationary policy. Specifically, a stationary policyis a mapping π : X → ∆ A with π ( a | x ) specifying the probability of selecting action a at state x .The long-term average reward of a stationary policy π starting from state x ∈ X is naturally definedas: J π ( x ) (cid:44) lim inf T →∞ T E (cid:34) T (cid:88) t =1 r ( x t , a t ) (cid:12)(cid:12)(cid:12) x = x, ∀ t ≥ , a t ∼ π ( ·| x t ) , x t +1 ∼ p ( ·| x t , a t ) (cid:35) . The performance measure of the learner, known as regret, is then defined as Reg T :=max π (cid:80) Tt =1 ( J π ( x ) − r ( x t , a t )) , which is the difference between the total rewards of the beststationary policy and that of the learner.However, in contrast to the finite-horizon episodic setting where ensuring sublinear regret is alwayspossible, it is known that in our setting a necessary condition is that the optimal policy has a long-termaverage reward that is independent of the initial state [6]. To this end, throughout the paper we onlyconsider a broad subclass of MDPs where a certain form of Bellman optimality equation holds [19]: Assumption 1 (Bellman optimality equation) . There exist J ∗ ∈ R and bounded measurable functions v ∗ : X → R and q ∗ : X × A → R such that the following holds for all x ∈ X and a ∈ A : J ∗ + q ∗ ( x, a ) = r ( x, a ) + E x (cid:48) ∼ p ( ·| x,a ) [ v ∗ ( x (cid:48) )] and v ∗ ( x ) = max a ∈A q ∗ ( x, a ) . (1)Indeed, under this assumption, the claim is that a policy π ∗ that deterministically selects an actionfrom argmax a q ∗ ( x, a ) at each state x is the optimal policy, with J π ∗ ( x ) = J ∗ for all x . To see this,note that for any policy π , using the Bellman optimality equation we have J π ( x ) = lim inf T →∞ T E (cid:34) T (cid:88) t =1 (cid:32) J ∗ + (cid:88) a ∈A q ∗ ( x t , a ) · π ( a | x t ) − v ∗ ( x t +1 ) (cid:33)(cid:35) ≤ lim inf T →∞ T E (cid:34) T (cid:88) t =1 ( J ∗ + v ∗ ( x t ) − v ∗ ( x t +1 )) (cid:35) = J ∗ , with equality attained by π ∗ , proving the claim. Consequently, under Assumption 1 we simply writethe regret as Reg T := (cid:80) Tt =1 ( J ∗ − r ( x t , a t )) .All existing works on regret minimization for infinite-horizon average-reward MDPs make thisassumption, either explicitly or through even stronger assumptions which imply this one. In thetabular case with a finite state space, weakly communicating MDPs is the broadest class to study3egret minimization in the literature, and is known to satisfy Assumption 1 (see [30]). More generally,Assumption 1 holds under many other common conditions; see [19, Section 3.3].Note that v ∗ ( x ) and q ∗ ( x, a ) quantify the relative advantage of starting with x and starting with ( x, a ) respectively and then acting optimally in the MDP. Therefore, v ∗ is sometimes called the statebias function and q ∗ is called the state-action bias function .For a bounded function v : X → R , we define its span as sp( v ) (cid:44) sup x,x (cid:48) ∈X | v ( x ) − v ( x (cid:48) ) | . Noticethat if ( v ∗ , q ∗ ) is a solution of Eq. (1), then a translated version ( v ∗ − c, q ∗ − c ) for any constant c isalso a solution. In the remaining of the paper, we let ( v ∗ , q ∗ ) be an arbitrary solution pair of Eq. (1)with a small span sp( v ∗ ) in the sense that sp( v ∗ ) ≤ v (cid:48) ) for any other solution ( v (cid:48) , q (cid:48) ) . We alsoassume without loss of generality | v ∗ ( x ) | ≤ sp( v ∗ ) for any x because we can perform the abovetranslation and center the values of v ∗ around zero. In this section, we present two optimism-based algorithms with sublinear regret, under only oneextra assumption that the MDP is linear (also known as low-rank MDPs). We emphasize thatearlier works for linear MDPs in the finite-horizon average-reward setting all require extra strongassumptions [2, 3, 16].Specifically, a linear MDP has a transition kernel and a reward function both linear in some state-actionfeature representation, formally summarized as:
Assumption 2 (Linear MDP) . There exist a known d -dimensional feature mapping Φ :
X × A → R d , d unknown measures µ = ( µ , µ , . . . , µ d ) over X , and an unknown vector θ ∈ R d such that for all x, x (cid:48) ∈ X and a ∈ A , p ( x (cid:48) | x, a ) = Φ( x, a ) (cid:62) µ ( x (cid:48) ) , r ( x, a ) = Φ( x, a ) (cid:62) θ . Without loss of generality, we further assume that for all x ∈ X and a ∈ A , (cid:107) Φ( x, a ) (cid:107) ≤ √ , thefirst coordinate of Φ( x, a ) is fixed to , and that (cid:107) µ ( X ) (cid:107) ≤ √ d , (cid:107) θ (cid:107) ≤ √ d , where we use µ ( X ) todenote the vector ( µ ( X ) , . . . , µ d ( X )) and µ i ( X ) (cid:44) (cid:82) X d µ i ( x ) is the total measure of X under µ i .(All norms are 2-norm.) In [21], the same assumption is made except for a different rescaling: (cid:107) Φ( x, a ) (cid:107) ≤ , (cid:107) µ ( X ) (cid:107) ≤ √ d ,and (cid:107) θ (cid:107) ≤ √ d . The reason that this is without loss of generality is not justified in [21], and forcompleteness we prove this in Appendix A. With this scaling, clearly one can augment the feature Φ( x, a ) with a constant coordinate of value and augment µ ( x ) and θ with a constant coordinate ofvalue , such that the linear structure is preserved while the scaling specified in Assumption 2 holds.Under Assumption 2, one can show that the state-action bias function q ∗ is in fact also linear in thefeatures. Lemma 1.
Under Assumption 1 and Assumption 2, there exists a fixed weight vector w ∗ ∈ R d suchthat q ∗ ( x, a ) = Φ( x, a ) (cid:62) w ∗ for all x ∈ X and a ∈ A , and furthermore, (cid:107) w ∗ (cid:107) ≤ (2 + sp( v ∗ )) √ d . Based on this lemma, a natural idea emerges: at time t , build an estimator w t of w ∗ using observeddata, then act according to the estimated long-term reward of each action given by Φ( x t , a ) (cid:62) w t .While the idea is intuitive, how to construct the estimator and, perhaps more importantly, how toincorporate the optimism principle well known to be important for learning with partial information,are highly non-trivial. In the next two subsections, we describe two different ways of doing so,leading to our two algorithms FOPO and OLSVI.FH. We present our first algorithm FOPO which is computationally inefficient but achieves regret (cid:101) O (sp( v ∗ ) √ d T ) . This is optimal in T since even in the tabular case O ( √ T ) is unimprovable [20].See Algorithm 1 for the complete pseudocode.As mentioned, the key part lies in how the estimator w t is constructed. In Algorithm 1, this is done bysolving an optimization problem over certain constraints. To understand the first constraint Eq. (2),4 lgorithm 1 Fixed-point OPtimization with Optimism (FOPO)
Parameters : < δ < , β = 20(2 + sp( v ∗ )) d (cid:112) log( T /δ ) , λ = 1 Initialize : Λ = λI where I ∈ R d × d is the identity matrix for t = 1 , . . . , T do if t = 1 or det(Λ t ) ≥ s t − ) then Set s t = t (cid:66) s t records the most recent update Let w t be the solution of the following optimization problem: max w t ,b t ∈ R d ,J t ∈ R J t s.t. w t = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ ) ( r ( x τ , a τ ) − J t + v t ( x τ +1 )) + b t (2) q t ( x, a ) = Φ( x, a ) (cid:62) w t , v t ( x ) = max a q t ( x, a ) (cid:107) b t (cid:107) Λ t ≤ β, (cid:107) w t (cid:107) ≤ (2 + sp( v ∗ )) √ d else ( w t , J t , b t , v t , q t , s t ) = ( w t − , J t − , b t − , v t − , q t − , s t − ) Play a t = argmax a q t ( x t , a ) , observe r ( x t , a t ) and x t +1 Update Λ t +1 = Λ t + Φ( x t , a t )Φ( x t , a t ) (cid:62) recall that q ∗ ( x, a ) = Φ( x, a ) (cid:62) w ∗ satisfies the Bellman optimality equation: Φ( x, a ) (cid:62) w ∗ = r ( x, a ) − J ∗ + (cid:90) X v ∗ ( x (cid:48) ) p (d x (cid:48) | x, a )= r ( x, a ) − J ∗ + (cid:90) X (cid:16) max a (cid:48) Φ( x (cid:48) , a (cid:48) ) (cid:62) w ∗ (cid:17) p (d x (cid:48) | x, a ) . While p and r are unknown, we do observe samples x , . . . , x t − and r ( x , a ) , . . . , r ( x t − , a t − ) .If for a moment we assume J ∗ was known, then it is natural to try to find w t such that Φ( x τ , a τ ) (cid:62) w t ≈ r ( x τ , a τ ) − J ∗ + max a (cid:48) Φ( x τ +1 , a (cid:48) ) (cid:62) w t , ∀ τ = 1 , . . . , t − . (3)In common variants of Least-squares Value Iteration (LSVI) update, the w t on the right hand side ofEq. (3) would be replaced with another already computed weight vector w (cid:48) t that is either from the lastiteration (i.e, w t − ) or from the next layer in the case of episodic MDPs. Then solving a least-squaresproblem with regularization λ (cid:107) w t (cid:107) gives a natural estimate w t = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ ) (cid:16) r ( x τ , a τ ) − J ∗ + max a (cid:48) Φ( x τ +1 , a (cid:48) ) (cid:62) w (cid:48) t (cid:17) where Λ t = λI + (cid:80) τ Parameters : < δ < , λ = 1 , H = max (cid:26) √ sp( v ∗ ) T / d / , (cid:16) sp( v ∗ ) Td (cid:17) / (cid:27) , β =40 dH (cid:112) log( T /δ ) Initialization : Λ = λI where I ∈ R d × d is the identity matrix Define : x kh = x t and a kh = a t , for t = ( k − H + h for k = 1 , . . . , T / H do Define V kH +1 ( x ) = 0 for all x . for h = H, . . . , do Compute w kh = Λ − k (cid:80) k − k (cid:48) =1 (cid:80) Hh (cid:48) =1 Φ( x k (cid:48) h (cid:48) , a k (cid:48) h (cid:48) ) (cid:16) r ( x k (cid:48) h (cid:48) , a k (cid:48) h (cid:48) ) + V kh +1 ( x k (cid:48) h (cid:48) +1 ) (cid:17) Define (cid:98) Q kh ( x, a ) = w kh · Φ( x, a ) + β (cid:113) Φ( x, a ) (cid:62) Λ − k Φ( x, a ) Define Q kh ( x, a ) = min (cid:110) (cid:98) Q kh ( x, a ) , H (cid:111) and V kh ( x ) = max a Q kh ( x, a ) for h = 1 , . . . , H do Play a kh = argmax a Q kh ( x kh , a ) and observe x kh and r ( x kh , a kh ) Update Λ k +1 = Λ k + (cid:80) Hh =1 Φ( x kh , a kh )Φ( x kh , a kh ) (cid:62) linear bandits. However, while they use this lazy update only to save computation, here we use it tomake sure that w t does not change too often, which is critical for our regret analysis.We point out that the closest existing algorithm we are aware of is the one from a recent work [37]for the finite-horizon setting. Just like theirs, our algorithm also does not admit an efficient imple-mentation due to the complicated nature of the optimization problem. However, it can be shown thatthe constraint set is non-empty with ( w t , b t , J t ) = ( w ∗ , b, J ∗ ) for some b being a feasible solution(with high probability). This fact also immediately implies that J t is indeed an optimistic estimatorof J ∗ in the following sense: Lemma 2. With probability at least − δ , Algorithm 1 ensures J t ≥ J ∗ for all t . With the help of this lemma, we prove the following regret bound of FOPO with optimal (in T ) rate. Theorem 3. Under Assumptions 1 and 2, FOPO guarantees with probability at least − δ : Reg T = O (cid:16) sp( v ∗ ) log( T /δ ) √ d T (cid:17) . Next, we present another optimism-based algorithm which can be implemented efficiently, albeitwith a suboptimal regret guarantee. The high-level idea is still based on LSVI. However, since we donot know how to efficiently solve a fixed-point problem as in Algorithm 1, we “open the loop” bysolving a finite-horizon problem instead. More specifically, we divide the T rounds into T / H episodeseach with H rounds, and run a finite-horizon optimistic LSVI algorithm over the episodes as in [21].The resulted algorithm is shown in Algorithm 2. For simplicity, we replace the time index t witha combination of an episode index k and a step index h within the episode. This gives the relation t = ( k − H + h , and ( x t , a t ) is written as ( x kh , a kh ) . At the beginning of each episode k , thelearner computes a set of Q-function parameters w k , . . . , w kH by backward calculation using allhistorical data (Line 3 to Line 6). Note that Line 4 is now simply an assignment step (as opposedto a fixed-point problem) since V kh +1 is computed already when in step h . In Line 5, we introduceoptimism by incorporating a bonus term β (cid:107) Φ( x, a ) (cid:107) Λ − k into the definition of (cid:98) Q kh ( x, a ) , and hence Q kh ( x, a ) . Then in step h of episode k , the learner simply follows the greedy choice suggested by Q kh ( x kh , · ) (Line 8).Note that Algorithm 2 is slightly different from the version in [21]: they maintain a differentcovariance matrix Λ kh separately for each step h , but we only maintain a single Λ k for all h . Similarly,their w kh is computed using only data related to step h from all previous episodes, while ours iscomputed using all previous data. This is because in our problem, the steps within an episode6hare the same transition and reward functions, and consequently they can be learned jointly, whicheventually reduces the sample complexity.Clearly, this reduction ensures that the learner has low regret against the best policy for the finite-horizon problem that we create. However, since our original problem is about average-reward overinfinite horizon, we need to argue that the best finite-horizon policy also performs well under theinfinite-horizon criteria. Indeed, we show that the sub-optimality gap of the best finite-horizon policyis bounded by some quantity governed by sp( v ∗ ) /H , which is intuitive since the larger H is, thesmaller the gap becomes (see Lemma 13).In our analysis, for a fixed episode we define π = ( π , . . . , π H ) as the finite-horizon policy (i.e., alength- H sequence of policies), where each π h is a mapping X → ∆ A . For any such finite-horizonpolicy π , we define Q πh ( x, a ) and V πh ( x ) as the value functions for the finite-horizon problem wecreate, which satisfy: V πH +1 ( x ) = 0 and for h = H, . . . , , Q πh ( x, a ) = r ( x, a ) + E x (cid:48) ∼ p ( ·| x,a ) [ V πh +1 ( x (cid:48) )] , V πh ( x ) = E a ∼ π h ( ·| x ) Q πh ( x, a ) . (4)The analysis of the algorithm relies on the following key lemma, which shows that Q kh ( x, a ) upperbounds Q πh ( x, a ) for any π . Lemma 4. With probability at least − δ , Algorithm 2 ensures for any finite-horizon policy π that ≤ Q kh ( x, a ) − Q πh ( x, a ) ≤ E x (cid:48) ∼ p ( ·| x,a ) (cid:2) V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ) (cid:3) + 2 β (cid:107) Φ( x, a ) (cid:107) Λ − k for all x, a, k, h . With the help of Lemma 4, we prove the final regret bound of OLSVI.FH stated in the next theorem(proof deferred to the appendix). Theorem 5. Under Assumptions 1 and 2, OLSVI.FH guarantees with probability at least − δ : Reg T = (cid:101) O (cid:16)(cid:112) sp( v ∗ )( dT ) + (sp( v ∗ ) dT ) (cid:17) . Note that although our bound is suboptimal, OLSVI.FH is the first efficient algorithm with sublinearregret for this setting under only Assumptions 1 and 2. XP There are two disadvantages of the optimism-based algorithms introduced in the last section. First,they require the transition kernel and reward function to be both linear in the feature (Assumption 2),which is restrictive and might not hold especially when d is small. Second, even for the polynomial-time algorithm OLSVI.FH, it is still computationally intensive because in Line 4 of the algorithm, V kh +1 is applied to all previous states, and every evaluation of V kh +1 requires computing (cid:107) Φ( x, a ) (cid:107) Λ k .Since this is done for every k , the total computational cost of the algorithm is super-linear in T .In fact, all existing optimism-based algorithms with linear function approximation suffer the sameissue [36, 21, 37].To this end, we propose yet another algorithm based on very different ideas. It is computationally lessintensive and it enjoys (cid:101) O ( √ T ) regret, albeit under a different (and non-comparable) set of assumptionscompared to those in Section 3. Note that these are the same assumptions made in [2, 16]. Below, westart with stating these assumptions, followed by the description of our algorithm.The first assumption we make is that the MDP is uniformly mixing. Assumption 3 (Uniform Mixing) . There exists a constant t mix ≥ such that for any policy π , andany distributions ν , ν ∈ ∆ X over the state space, (cid:107) P π ν − P π ν (cid:107) TV ≤ e − /t mix (cid:107) ν − ν (cid:107) TV , where ( P π ν )( x (cid:48) ) = (cid:82) X (cid:80) a ∈A π ( a | x ) p ( x (cid:48) | x, a )d ν ( x ) and (cid:107) · (cid:107) TV is the total variation. Under this uniform mixing assumption, we are able to define the stationary state distribution under apolicy π as ν π = ( P π ) ∞ ν for an arbitrary initial distribution ν . Also, now we not only have theBellman optimality equation (1) (that is, Assumption 3 implies Assumption 1), but also a Bellmanequation for every policy π , as shown in the following lemma.7 lgorithm 3 MDP-E XP Parameter : N = 8 t mix log T , B = 32 N log( dT ) σ − , η = min (cid:110)(cid:112) / ( T t mix ) , σ/ (24 N ) (cid:111) . for k = 1 , . . . , T / B do (cid:66) k indexes an epoch Define policy π k such that π k ( a | x ) ∝ exp (cid:16) η (cid:80) k − j =1 Φ( x, a ) (cid:62) w j (cid:17) for every x ∈ X for t = ( k − B + 1 , . . . , kB do Play a t ∼ π k ( ·| x t ) , observe r t ( x t , a t ) and x t +1 (cid:66) Execute π k in the entire epoch for m = 1 , . . . , B / N do (cid:66) m indexes a trajectory Define τ k,m = ( k − B + 2 N ( m − 1) + N + 1 (cid:66) first step of the m -th trajectory Compute R k,m = (cid:80) τ k,m + N − t = τ k,m r ( x t , a t ) (cid:66) total reward of the m -th trajectory Compute (cid:66) λ min denotes the minimum eigenvalue M k = B N (cid:88) m =1 (cid:88) a π k ( a | x τ k,m )Φ( x τ k,m , a )Φ( x τ k,m , a ) (cid:62) ,w k = (cid:40) M − k (cid:80) B N m =1 Φ( x τ k,m , a τ k,m ) R k,m , if λ min ( M k ) ≥ Bσ N , else. Lemma 6. Suppose Assumption 3 holds. For any π , its long-term average reward J π ( x ) is indepen-dent of the initial state x , thus denoted as J π . Also, the following Bellman equation holds: J π + q π ( x, a ) = r ( x, a ) + E x (cid:48) ∼ p ( ·| x,a ) [ v π ( x (cid:48) )] and v π ( x ) = (cid:88) a ∈A π ( a | x ) q π ( x, a ) for some measurable functions v π : X → [ − t mix , t mix ] and q π : X × A → [ − t mix , t mix ] with (cid:82) X v π ( x )d ν π ( x ) = 0 . On the other hand, with this assumption (stronger than Assumption 1), we can replace Assumption 2(linear MDP) with the following weaker one that only requires the bias function q π to be linear. (InLemma 14 in the appendix, we show that this is indeed weaker than the linear MDP assumption.) Assumption 4 (Linear bias function) . There exists a known d -dimensional feature mapping Φ : X × A → R d such that for every policy π , q π ( x, a ) can be written as Φ( x, a ) (cid:62) w π for someweight vector w π ∈ R d . Again, without loss of generality (justified in Appendix A), we assumethat for all x, a , (cid:107) Φ( x, a ) (cid:107) ≤ √ holds, the first coordinate of Φ( x, a ) is fixed to , and for all π , (cid:107) w π (cid:107) ≤ t mix √ d . The last assumption we make is uniformly excited features, which intuitively guarantees that everypolicy is explorative in the feature space. Assumption 5 (Uniformly excited features) . There exists σ > such that for any π , λ min (cid:32)(cid:90) X (cid:32)(cid:88) a π ( a | x )Φ( x, a )Φ( x, a ) (cid:62) (cid:33) d ν π ( x ) (cid:33) ≥ σ, where λ min denotes the smallest eigenvalue. This assumption is needed due to the nature of our algorithm that only performs local search of theparameters. It can potentially be weakened if we combine our algorithm with the idea of Abbasi-Yadkori et al. [3] (details omitted). Algorithm. We are now ready to present our MDP-E XP XP XP epochs of equal length B = (cid:101) O ( dt mix /σ ) . In each epoch k , thealgorithm executes a fixed policy π k (explained later), and collects B N disjoint trajectories, each of8ength N = (cid:101) O ( t mix ) . Between every two consecutive trajectories, there is a window of length N inwhich the algorithm does not collect any samples, so that the correlation of samples from differenttrajectories is reduced. See Figure 1 in the appendix for an illustration.In the analysis, we show that the expected total reward of a trajectory is roughly q π ( x τ , a τ ) + N J π (Lemma 15), where π is the policy used to collect that trajectory and τ is the first step of the trajectory.By Assumption 4 we have q π ( x τ , a τ ) + N J π = Φ( x τ , a τ ) (cid:62) ( w π + N J π e ) . This observationallows us to draw a connection between this problem and adversarial linear bandits. To see this, firstnote that the regret is roughly B (cid:80) T/Bk =1 ( J ∗ − J π k ) . By the standard value difference lemma [23,Lemma 5.2.1], we have T/B (cid:88) k =1 ( J ∗ − J π k ) = (cid:90) X T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) q π k ( x, a ) d ν π ∗ ( x ) where according to the previous observation and the fact (cid:80) a ( π ∗ ( a | x ) − π k ( a | x )) N J π k =0 , the term in the parentheses with respect to a fixed state x can be further written as (cid:80) T/Bk =1 (cid:80) a ( π ∗ ( a | x ) − π k ( a | x )) Φ( x, a ) (cid:62) ( w π k + N J π k e ) . This is exactly the regret of a standardonline learning problem over a set of actions { Φ( x, a ) } a ∈A with linear reward functions parameter-ized by a weight vector ( w π k + N J π k e ) at step k . Moreover, since we do not observe this weightbut have access to the reward of a trajectory whose mean is roughly Φ( x, a ) (cid:62) ( w π k + N J π k e ) asmentioned, we are in the so-called bandit setting. In fact, since the weight can generally changearbitrarily over time (because π k is changing), this is an adversarial linear bandit problem.With this connection in mind, the idea behind MDP-E XP XP k the algorithm constructsan estimator w k for the reward vector w π k + N J π k e . The construction mostly follows the idea ofE XP 2, with the only difference being the way of controlling the variance — in the original E XP (cid:107) w k (cid:107) isnot too large, we also set it to if λ min ( M k ) is too small). Finally, with these estimators, the policyfor epoch k is computed by a standard exponential weight update rule (see Line 2).We emphasize that MDP-E XP XP w k and calculate π ( ·| x t ) on the fly for each x t , which is even more efficient than optimism-based algorithms. It also enjoys a favorable regretguarantee of order (cid:101) O ( √ T ) , as shown below. Once again, the best existing result under the same setof assumptions is (cid:101) O ( T / ) from [16]. Theorem 7. Under Assumptions 3, 4, 5, MDP-E XP ensures E [ Reg T ] = (cid:101) O (cid:16) σ (cid:112) t mix T (cid:17) . Note that while the bound in Theorem 7 seemingly does not depend on d , the dependence is in factimplicit because σ = Ω( d ) always holds by the definition of σ (see Remark 1 in the appendix). Weprovide a proof for this fact along with the proof of Theorem 7 in the appendix. Connections to Natural Policy Gradient. Finally, we remark that although MDP-E XP XP 2, it is related to the (in fact much earlier) reinforcement learningalgorithm Natural Policy Gradient (NPG) [22] under softmax parameterization. The connectionbetween softmax-parameterized NPG and the exponential weight update was formalized in a recentwork by Agarwal et al. [4]. In Appendix E, we first restate the connection. Then we compare theimplementation details of MDP-E XP XP In this work, we provide three new algorithms for learning infinite-horizon average-reward MDPs withlinear function approximation, significantly extending and improving previous works. One key openquestion is how to achieve the optimal (cid:101) O ( √ T ) regret efficiently under the linear MDP assumption.In Appendix E, we also discuss another open question related to weakening Assumption 5 whilemaintaining a similar regret bound. 9 eferences [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linearstochastic bandits. In Advances in Neural Information Processing Systems , pages 2312–2320,2011.[2] Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and GellértWeisz. Politex: Regret bounds for policy iteration using expert prediction. In InternationalConference on Machine Learning , pages 3692–3702, 2019.[3] Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced politex. arXiv preprint arXiv:1908.10479 , 2019.[4] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approxi-mation with policy gradient methods in markov decision processes. In Conference on LearningTheory , 2020.[5] Keith Ball et al. An elementary introduction to modern convex geometry.[6] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcementlearning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference onUncertainty in Artificial Intelligence , pages 35–42. AUAI Press, 2009.[7] Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal differencelearning. Machine learning , 22(1-3):33–57, 1996.[8] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies foronline linear optimization with bandit feedback. In Conference on Learning Theory , pages41–1, 2012.[9] Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer andSystem Sciences , 78(5):1404–1422, 2012.[10] Yichen Chen, Lihong Li, and Mengdi Wang. Scalable bilinear π learning using state and actionfeatures. In International Conference on Machine Learning , pages 834–843, 2018.[11] Varsha Dani, Sham M Kakade, and Thomas P Hayes. The price of bandit information for onlineoptimization. In Advances in Neural Information Processing Systems , 2008.[12] Kefan Dong, Jian Peng, Yining Wang, and Yuan Zhou. √ n -regret for learning in markovdecision processes with function approximation and low bellman rank. In Conference onLearning Theory , 2020.[13] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning andan application to boosting. In European conference on computational learning theory , pages23–37. Springer, 1995.[14] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference onMachine Learning , pages 1573–1581, 2018.[15] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of ucrl2 with empiricalbernstein inequality. arXiv preprint arXiv:2007.05456 , 2020.[16] Botao Hao, Nevena Lazic, Yasin Abbasi-Yadkori, Pooria Joulani, and Csaba Szepesvari. Prov-ably efficient adaptive approximate policy iteration. arXiv preprint arXiv:2002.03069 , 2020.[17] Nick Harvey. Matrix chernoff bounds. In .[18] Elad Hazan and Zohar Karnin. Volumetric spanners: An efficient exploration basis for learning. Journal of Machine Learning Research , 17(119):1–34, 2016.[19] Onésimo Hernández-Lerma. Adaptive Markov control processes , volume 79. Springer Science& Business Media, 2012. 1020] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcementlearning. Journal of Machine Learning Research , 11(Apr):1563–1600, 2010.[21] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcementlearning with linear function approximation. In Conference on Learning Theory , 2020.[22] Sham M Kakade. A natural policy gradient. In Advances in neural information processingsystems , pages 1531–1538, 2002.[23] Sham Machandranath Kakade. On the sample complexity of reinforcement learning . PhD thesis,University College London, 2003.[24] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deepreinforcement learning. In International conference on machine learning , pages 1928–1937,2016.[25] Gergely Neu and Julia Olkhovskaya. Online learning in mdps with linear function approximationand bandit feedback. arXiv preprint arXiv:2007.01612 , 2020.[26] Gergely Neu, András György, Csaba Szepesvári, and András Antos. Online markov decisionprocesses under bandit feedback. IEEE Transactions on Automatic Control , 59:676–691, 2013.[27] Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularizedmarkov decision processes. arXiv preprint arXiv:1705.07798 , 2017.[28] Ronald Ortner. Regret bounds for reinforcement learning via markov chain concentration. Journal of Artificial Intelligence Research , 67:115–128, 2020.[29] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomizedvalue functions. In International Conference on Machine Learning , pages 2377–2386, 2016.[30] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming .John Wiley & Sons, 2014.[31] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust regionpolicy optimization. In International conference on machine learning , pages 1889–1897, 2015.[32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximalpolicy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.[33] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds forundiscounted reinforcement learning in mdps. In Algorithmic Learning Theory , pages 770–805,2018.[34] Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efficient reinforcementlearning with general value function approximation. arXiv preprint arXiv:2005.10804 , 2020.[35] Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International Conference on Machine Learning , 2020.[36] Lin F Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels,and regret bound. In International Conference on Machine Learning , 2020.[37] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learningnear optimal policies with low inherent bellman error. In International Conference on MachineLearning , 2020.[38] Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluatingthe optimal bias function. In Advances in Neural Information Processing Systems , 2019.11 Auxiliary Lemmas Related to Assumption 2 and Assumption 4 In this section, we provide justification for the scaling assumption made in Assumption 2 andAssumption 4, showing that they are indeed without loss of generality as long as one transforms andnormalizes the features in some way beforehand. Lemma 8. Let Φ = { Φ( x, a ) : x ∈ X , a ∈ A} ⊂ R d be a feature set with rank d . Then thereexists an invertible linear transformation v → Av with A ∈ R d × d such that for any function F : X × A → R defined by F ( x, a ) = Φ( x, a ) (cid:62) z, for some z ∈ R d , we have (cid:107) A Φ( x, a ) (cid:107) ≤ and (cid:107) A − z (cid:107) ≤ √ dF max where F max (cid:44) sup x,a | F ( x, a ) | . This lemma implies that if we use the transformed feature Φ (cid:48) ( x, a ) = A Φ( x, a ) with (cid:107) Φ (cid:48) ( x, a ) (cid:107) ≤ ,then any function F ( x, a ) = Φ( x, a ) (cid:62) z can be equivalently written as F ( x, a ) = Φ (cid:48) ( x, a ) (cid:62) z (cid:48) with z (cid:48) = A − z and (cid:107) z (cid:48) (cid:107) ≤ √ dF max . Therefore, taking z to be µ ( X ) or θ for Assumption 2, or w π forAssumption 4, with the corresponding F ( x, a ) being (cid:82) X p ( x (cid:48) | x, a )d x (cid:48) , r ( x, a ) , and q π ( x, a ) , and F max being , , and t mix (Lemma 6) respectively, justifies the scaling stated in these assumptions.Notice that the transformation A only depends on the feature set Φ , but not F or z . Thus we canperform this transformation as long as we know the feature map. This is similar to a standardpreprocessing step of feature normalizing in machine learning. Proof of Lemma 8. Define − Φ = {− Φ( x, a ) : x ∈ X , a ∈ A} and K (Φ) = Φ ∪ − Φ . We first arguethat for any bounded feature set Φ ⊂ R d , there exists an invertible linear transformation v → Av with A ∈ R d × d such that the minimum volume enclosing ellipsoid (MVEE) of the transformed featureset K ( A Φ) where A Φ (cid:44) { A Φ( x, a ) : x ∈ X , a ∈ A} is the unit sphere. This can be seen by thefollowing: notice that K (Φ) is always symmetric around the origin, and so is its MVEE. Suppose thatthe MVEE of K (Φ) is { u ∈ R d : u (cid:62) Bu = 1 } for some invertible B (otherwise Φ is not full-rank).Then if we pick A = B , the MVEE of K ( A Φ) will be the unit sphere.Now consider this new feature Φ (cid:48) ( x, a ) (cid:44) A Φ( x, a ) with the MVEE of K (Φ (cid:48) ) being the unit sphere(which implies (cid:107) Φ (cid:48) ( x, a ) (cid:107) ≤ ). Defining z (cid:48) = A − z , we have Φ (cid:48) ( x, a ) (cid:62) z (cid:48) = Φ( x, a ) (cid:62) z = F ( x, a ) .Below, we show that (cid:107) z (cid:48) (cid:107) ≤ √ dF max .By Lemma 9 below, there exists a subset M = { u , . . . , u m } ⊆ K (Φ (cid:48) ) that lie on the unit sphere,and non-negative weights c , . . . , c m , such that m (cid:88) i =1 c i u i u (cid:62) i = I d . Taking trace on both sides, we get (cid:80) mi =1 c i = d .Note that we have F ( x, a ) = Φ (cid:48) ( x, a ) (cid:62) z (cid:48) for all x, a . Specially, applying this to the elements in M ,and using the fact that | F ( x, a ) | ≤ F max , we get dF = m (cid:88) i =1 c i F ≥ m (cid:88) i =1 c i ( u (cid:62) i z (cid:48) ) = z (cid:48)(cid:62) (cid:32) m (cid:88) i =1 c i u i u (cid:62) i (cid:33) z (cid:48) = (cid:107) z (cid:48) (cid:107) , which implies (cid:107) z (cid:48) (cid:107) ≤ √ dF max and finishes the proof. Lemma 9. ([18, Theorem 6], [5]) Let K be a symmetric set such that its MVEE is the unit sphere.Then there exist m ≤ d ( d + 1) / − contact points of K and the sphere u , . . . u m and non-negativeweights c , . . . , c m such that (cid:80) i c i u i = 0 and (cid:80) i c i u i u (cid:62) i = I d . B Auxiliary Lemmas for Self-normalized Processes In this section, we provide some useful lemmas related to the concentration of self-normalizedprocesses . The first two are taken directly from [21, Appendix D.2].12 emma 10 (Concentration of Self-Normalized Processes) . Let { ε t } ∞ t =1 be a real-valued stochasticprocess with corresponding filtration {F t } ∞ t =0 . Let ε t |F t − be zero-mean and σ -subgaussian, that is, E [ ε t |F t − ] = 0 and E [ e λε t |F t − ] ≤ e λ σ / for all λ ∈ R .Let { φ t } ∞ t =0 be an R d -valued stochastic process where φ t ∈ F t − . Assume that Λ is a d × d positivedefinite matrix, and let Λ t = Λ + (cid:80) t − s =1 φ s φ (cid:62) s . Then for any δ > , with probability at least − δ ,we have for all t > , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − (cid:88) s =1 φ s ε s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − t ≤ σ log (cid:20) det(Λ t ) / det(Λ ) − / δ (cid:21) . Lemma 11. Let { x t } ∞ t =1 be a stochastic process on state space X with corresponding filtration {F t } ∞ t =0 , { φ t } ∞ t =0 be an R d -valued stochastic process where φ t ∈ F t − and (cid:107) φ t (cid:107) ≤ , Λ t = λI + (cid:80) t − s =1 φ s φ (cid:62) s , and V ⊆ R X be an arbitrary set of functions defined on X , with N ε being its ε -covering number with respect to dist ( v, v (cid:48) ) = sup x | v ( x ) − v ( x (cid:48) ) | for some fixed ε > . Then forany δ > , with probability at least − δ , for all t > and any v ∈ V so that sup x | v ( x ) | ≤ H , wehave (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − (cid:88) s =1 φ s (cid:16) v ( x s ) − E [ v ( x s ) |F t − ] (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − t ≤ H (cid:20) d (cid:18) t + λλ (cid:19) + log N ε δ (cid:21) + 8 t ε λ . Lemma 12. Let V be a class of mappings from X to R parametrized by α = ( α , α , . . . , α P ) ∈ R P with α i ∈ [ − B, B ] for all i . Suppose that for any v ∈ V (parameterized by α ) and v (cid:48) ∈ V (parameterized by α (cid:48) ), the following holds: sup x ∈X | v ( x ) − v (cid:48) ( x ) | ≤ L (cid:107) α − α (cid:48) (cid:107) . Let N ε be be the ε -covering number of V with respect to the distance dist ( v, v (cid:48) ) = sup x ∈X | v ( x ) − v ( x (cid:48) ) | . Then log N ε ≤ P log (cid:18) BLPε (cid:19) . Proof. If α and α (cid:48) are such that | α i − α (cid:48) i | ≤ εLP for all i , then we havedist ( v, v (cid:48) ) = sup x ∈X | v ( x ) − v (cid:48) ( x ) | ≤ L × P (cid:88) i =1 | α i − α (cid:48) i | ≤ ε. Therefore, the following set constitutes an ε -cover for V : (cid:26) α ∈ R P : α i = kεLP for some k ∈ Z (cid:27) ∩ [ − B, B ] P The number of elements in this sets is upper bounded by (cid:0) BLPε (cid:1) P . C Omitted Analysis in Section 3 Proof of Lemma 1. By the two assumptions, we have (with e = (1 , , . . . , ) q ∗ ( x, a ) = r ( x, a ) − J ∗ + E x (cid:48) ∼ p ( ·| x,a ) [ v ∗ ( x (cid:48) )]= Φ( x, a ) (cid:62) θ − J ∗ Φ( x, a ) (cid:62) e + Φ( x, a ) (cid:62) (cid:90) X v ∗ ( x (cid:48) )d µ ( x (cid:48) )= Φ( x, a ) (cid:62) (cid:18) θ − J ∗ e + (cid:90) X v ∗ ( x (cid:48) )d µ ( x (cid:48) ) (cid:19) . Therefore, we can define w ∗ = θ − J ∗ e + (cid:82) X v ∗ ( x (cid:48) )d µ ( x (cid:48) ) , proving the first claim. Furthermore, (cid:107) w ∗ (cid:107) ≤ (cid:107) θ (cid:107) + 1 + sup x (cid:48) ∈X | v ∗ ( x (cid:48) ) | × (cid:107) µ ( X ) (cid:107) ≤ √ d + 1 + sp( v ∗ ) × √ d ≤ (2 + sp( v ∗ )) √ d, which proves the second claim. 13 .1 Omitted Analysis in Section 3.1 Proof of Lemma 2. It suffices to show that with probability at least − δ , ( w ∗ , b, J ∗ ) for some b is afeasible solution of the optimization problem (since J t is the optimal solution). To show this, firstnote that w ∗ = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ )Φ( x τ , a τ ) w ∗ + λ Λ − t w ∗ (definition of Λ t ) = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) r ( x τ , a τ ) − J ∗ + E x (cid:48) ∼ p ( ·| x τ ,a τ ) v ∗ ( x (cid:48) ) (cid:1) + λ Λ − t w ∗ ( q ∗ ( x τ , a τ ) = Φ( x τ , a τ ) w ∗ and Eq. (1)) = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ ) ( r ( x τ , a τ ) − J ∗ + v ∗ ( x τ +1 )) + λ Λ − t w ∗ + (cid:15) ∗ t , where (cid:15) ∗ t = Λ − t t − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) E x (cid:48) ∼ p ( ·| x τ ,a τ ) v ∗ ( x (cid:48) ) − v ∗ ( x τ +1 ) (cid:1) . Using Lemma 10 with ε τ = E x (cid:48) ∼ p ( ·| x τ ,a τ ) v ∗ ( x (cid:48) ) − v ∗ ( x τ +1 ) and φ τ = Φ( x τ , a τ ) , we have withprobability at least − δ (note that given the past ε τ is zero-mean and in the range [ − sp( v ∗ ) , sp( v ∗ )] thus sp( v ∗ ) -subgaussian), (cid:107) (cid:15) ∗ t (cid:107) Λ t = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − (cid:88) τ =1 φ τ ε τ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ − t ≤ √ v ∗ ) (cid:114) log det(Λ t ) / / det(Λ ) / δ ≤ √ v ∗ ) (cid:115) log (1 + Tλd ) d/ δ ≤ β , where we use the fact det(Λ t ) ≤ (cid:18) TR (Λ t ) d (cid:19) d = (cid:32) λd + (cid:80) t − τ =1 (cid:107) φ τ (cid:107) d (cid:33) d ≤ (cid:18) λd + 2 Td (cid:19) d and the definition of β . Also, λ (cid:107) Λ − t w ∗ (cid:107) Λ t = λ (cid:107) w ∗ (cid:107) Λ − t ≤ √ λ (cid:107) w ∗ (cid:107) ≤ (2 + sp( v ∗ )) √ λd ≤ β (Lemma 1). Define b = λ Λ − t w ∗ + (cid:15) ∗ t , we have thus proven that (cid:107) b (cid:107) Λ t ≤ β holds with probabilityat least − δ . which proves that ( w ∗ , b, J ∗ ) is a solution of the optimization problem, finishing theproof. Proof of Theorem 3. Without loss of generality, we assume sp( v ∗ ) ≤ √ T , d ≤ √ T , and T ≥ (otherwise the bound is vacuous). Fix t and let s = s t . Define (cid:15) s = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) v s ( x τ +1 ) − E x (cid:48) ∼ p ( ·| x τ ,a τ ) v s ( x (cid:48) ) (cid:1) . Using the identity w ∗ = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ )Φ( x τ , a τ ) (cid:62) w ∗ + λ Λ − s w ∗ = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) r ( x τ , a τ ) − J ∗ + E x (cid:48) ∼ p ( ·| x τ ,a τ ) v ∗ ( x (cid:48) ) (cid:1) + λ Λ − s w ∗ , w s , we have w s − w ∗ = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ ) ( r ( x τ , a τ ) − J s + v s ( x τ +1 )) + b s − Λ − s s − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) r ( x τ , a τ ) − J ∗ + E x (cid:48) ∼ p ( ·| x τ ,a τ ) v ∗ ( x (cid:48) ) (cid:1) − λ Λ − s w ∗ = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ ) (cid:0) J ∗ − J s + E x (cid:48) ∼ p ( ·| x τ ,a τ ) [ v s ( x (cid:48) ) − v ∗ ( x (cid:48) )] (cid:1) + (cid:15) s + b s − λ Λ − s w ∗ = Λ − s s − (cid:88) τ =1 Φ( x τ , a τ )Φ( x τ , a τ ) (cid:62) (cid:18) J ∗ e − J s e + (cid:90) X ( v s ( x (cid:48) ) − v ∗ ( x (cid:48) )) d µ ( x (cid:48) ) (cid:19) + (cid:15) s + b s − λ Λ − s w ∗ = J ∗ e − J s e + (cid:90) X ( v s ( x (cid:48) ) − v ∗ ( x (cid:48) )) d µ ( x (cid:48) ) + (cid:15) s + b s − λ Λ − s (cid:18) J ∗ e − J s e + (cid:90) X ( v s ( x (cid:48) ) − v ∗ ( x (cid:48) )) d µ ( x (cid:48) ) (cid:19) − λ Λ − s w ∗ . Therefore, q s ( x t , a t ) − q ∗ ( x t , a t ) = Φ( x t , a t ) (cid:62) ( w s − w ∗ ) ≤ ( J ∗ − J s ) + E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s ( x (cid:48) ) − v ∗ ( x (cid:48) )] + Φ( x t , a t ) (cid:62) ( (cid:15) s + b s + λ Λ − s u s ) , where u s (cid:44) − (cid:0) J ∗ e − J s e + (cid:82) X ( v s ( x (cid:48) ) − v ∗ ( x (cid:48) )) d µ ( x (cid:48) ) (cid:1) − w ∗ .Next, under the event J ∗ ≤ J s which holds with probability at least − δ (Lemma 2), we continuewith q s ( x t , a t ) − q ∗ ( x t , a t ) (5) ≤ E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s ( x (cid:48) ) − v ∗ ( x (cid:48) )] + Φ( x t , a t ) (cid:62) ( (cid:15) s + b s + λ Λ − s u s ) ≤ E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s ( x (cid:48) ) − v ∗ ( x (cid:48) )] + (cid:107) Φ( x t , a t ) (cid:107) Λ − s (cid:107) (cid:15) s + b s + λ Λ − s u s (cid:107) Λ s ≤ E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s ( x (cid:48) ) − v ∗ ( x (cid:48) )] + 2 (cid:107) Φ( x t , a t ) (cid:107) Λ − t (cid:107) (cid:15) s + b s + λ Λ − s u s (cid:107) Λ s , (6)where the second inequality uses Hölder’s inequality and the last one uses the fact Λ s (cid:22) Λ t (cid:22) s according to the lazy update schedule of the algorithm.By the algorithm, (cid:107) b s (cid:107) Λ s ≤ β . To bound (cid:107) (cid:15) s (cid:107) Λ s , we use Lemma 11 and Lemma 12: Define ε τ = v s ( x τ +1 ) − E x (cid:48) ∼ p ( ·| x τ ,a τ ) v s ( x (cid:48) ) and φ τ = √ Φ( x τ , a τ ) . With Lemma 11 and the fact | v s ( x ) | ≤ √ (cid:107) w s (cid:107) ≤ (2 + sp( v ∗ )) √ d , we have that with probability at least − δ , for all s : (cid:107) (cid:15) s (cid:107) Λ s = √ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s − (cid:88) τ =1 φ τ ε τ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ − s ≤ v ∗ )) √ d (cid:114) d s + λλ + log N ε δ + 4 (cid:114) s ε λ , where ε = T and N ε is the ε -cover for the function class of v s , which can be bounded with the helpof Lemma 12 (with α = w s , P = d , B = (2 + sp( v ∗ )) √ d , and L = √ ) by log N ε ≤ d log 2(2 + sp( v ∗ )) √ d × √ dT − ≤ d log T (using the conditions stated at the beginning of the proof). Therefore, we have (cid:107) (cid:15) s (cid:107) Λ s ≤ v ∗ )) √ d (cid:112) d log T + log(1 /δ ) + 4 = O ( β ) , (7)for all s with probability at least − δ . Next, we bound (cid:107) λ Λ − s u s (cid:107) Λ s as: (cid:107) λ Λ − s u s (cid:107) Λ s = λ (cid:107) u s (cid:107) Λ − s ≤ √ λ (cid:107) u s (cid:107) ≤ O (1 + (2 + sp( v ∗ )) d ) = O ( β ) , (8)15here in the second inequality we use the condition (cid:107) µ ( X ) (cid:107) ≤ √ d in Assumption 2 to bound (cid:107) (cid:82) X ( v s ( x (cid:48) ) − v ∗ ( x (cid:48) )) d µ ( x (cid:48) ) (cid:107) as sup x ∈X | v s ( x ) − v ∗ ( x ) |(cid:107) µ ( X ) (cid:107) = O ((2 + sp( v ∗ )) d ) . Put to-gether, the above shows (cid:107) (cid:15) s + b s + λ Λ − s u s (cid:107) Λ s = O ( β ) .Continuing with Eq. (6) and summing over t , we have that with probability at least − δ , T (cid:88) t =1 ( q s t ( x t , a t ) − q ∗ ( x t , a t )) ≤ T (cid:88) t =1 E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t ( x (cid:48) ) − v ∗ ( x (cid:48) )] + O (cid:32) β T (cid:88) t =1 (cid:107) Φ( x t , a t ) (cid:107) Λ − t (cid:33) = T (cid:88) t =1 E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t ( x (cid:48) ) − v ∗ ( x (cid:48) )] + O β √ T (cid:118)(cid:117)(cid:117)(cid:116) T (cid:88) t =1 (cid:107) Φ( x t , a t ) (cid:107) − t (Cauchy-Schwarz inequality) = T (cid:88) t =1 E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t ( x (cid:48) ) − v ∗ ( x (cid:48) )] + O (cid:16) β (cid:112) dT log T (cid:17) , where the last equality is by [21, Lemma D.2] with the facts that det(Λ ) = λ d and det(Λ T +1 ) ≤ (cid:0) d trace (Λ T +1 ) (cid:1) d ≤ ( λ + 2 T ) d . Rearranging the last inequality we get T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v ∗ ( x (cid:48) )] − q ∗ ( x t , a t ) (cid:1) ≤ T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t ( x (cid:48) )] − q s t ( x t , a t ) (cid:1) + O (cid:16) β (cid:112) dT log T (cid:17) = T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t ( x (cid:48) )] − v s t ( x t ) (cid:1) + O (cid:16) β (cid:112) dT log T (cid:17) where the last line is by the choice of a t . Next, notice that every time the algorithm updates (i.e. s t (cid:54) = s t − ), it holds that det(Λ t ) = det(Λ s t ) ≥ s t − ) . Since det(Λ T +1 ) / det(Λ ) ≤ (cid:0) λ +2 Tλ (cid:1) d ,this cannot happen more than log (cid:0) λ +2 Tλ (cid:1) d = O ( d log T ) times. Using this fact and the range of v t ,we continue with T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v ∗ ( x (cid:48) )] − q ∗ ( x t , a t ) (cid:1) ≤ T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t +1 ( x (cid:48) )] − v s t ( x t ) (cid:1) + O (cid:16) β (cid:112) dT log T + βd log T (cid:17) = T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v s t +1 ( x (cid:48) )] − v s t +1 ( x t +1 ) (cid:1) + O (cid:16) β (cid:112) dT log T + βd log T (cid:17) = O (cid:16) β (cid:112) dT log T + βd log T (cid:17) , (9)where the last step holds with probability at least − δ by Azuma’s inequality. Finally, note that theregret can be written asReg T = T (cid:88) t =1 ( J ∗ − r ( x t , a t )) = T (cid:88) t =1 (cid:0) E x (cid:48) ∼ p ( ·| x t ,a t ) [ v ∗ ( x (cid:48) )] − q ∗ ( x t , a t ) (cid:1) = O (cid:16) β (cid:112) dT log T + βd log T (cid:17) . by the Bellman optimality equation, which finishes the proof (combining all the high probabilitystatements with a union bound, the last bound holds with probability at least − δ ).16 .2 Omitted Analysis in Section 3.2 Proof of Lemma 4. By Assumption 2 and the Bellman equation for the finite-horizon problem (Eq.(4)), we have that for any finite-horizon policy π and any h < H , Q πh ( x, a ) = r ( x, a ) + E x (cid:48) ∼ p ( ·| x,a ) (cid:2) V πh +1 ( x (cid:48) ) (cid:3) = Φ( x, a ) (cid:62) θ + Φ( x, a ) (cid:62) (cid:90) X V πh +1 ( x (cid:48) )d µ ( x (cid:48) )= Φ( x, a ) (cid:62) (cid:18) θ + (cid:90) X V πh +1 ( x (cid:48) )d µ ( x (cid:48) ) (cid:19) . Define w πh = θ + (cid:82) X V πh +1 ( x (cid:48) )d µ ( x (cid:48) ) . Then we have Q πh ( x, a ) = Φ( x, a ) (cid:62) w πh with (cid:107) w πh (cid:107) ≤(cid:107) θ (cid:107) + ( H − h ) (cid:107) µ ( X ) (cid:107) ≤ √ d + √ d ( H − h ) ≤ √ dH .We now rewrite w kh − w πh as follow. For simplicity, we denote x ∼ p ( ·| x k (cid:48) h (cid:48) , a k (cid:48) h (cid:48) ) as x ∼ ( k (cid:48) , h (cid:48) ) , Φ( x k (cid:48) h (cid:48) , a k (cid:48) h (cid:48) ) as Φ k (cid:48) h (cid:48) , and r ( x k (cid:48) h (cid:48) , a k (cid:48) h (cid:48) ) as r k (cid:48) h (cid:48) w kh − w πh = Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:104) r k (cid:48) h (cid:48) + V kh +1 ( x k (cid:48) h (cid:48) +1 ) (cid:105) − Λ − k (cid:32) λI + k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) Φ k (cid:48) h (cid:48) (cid:62) (cid:33) w πh = Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:104) r k (cid:48) h (cid:48) + V kh +1 ( x k (cid:48) h (cid:48) +1 ) (cid:105) − Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:104) r k (cid:48) h (cid:48) + E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) [ V πh +1 ( x (cid:48) )] (cid:105) − λ Λ − k w πh (using Q πh ( x, a ) = Φ( x, a ) (cid:62) w πh and the Bellman equation) = Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:104) V kh +1 ( x k (cid:48) h (cid:48) +1 ) − E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) V πh +1 ( x (cid:48) ) (cid:105) − λ Λ − k w πh = Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:2) E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) V kh +1 ( x (cid:48) ) − E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) V πh +1 ( x (cid:48) ) (cid:3) + (cid:15) kh − λ Λ − k w πh (define (cid:15) kh = Λ − k (cid:80) k − k (cid:48) =1 (cid:80) Hh (cid:48) =1 Φ k (cid:48) h (cid:48) (cid:104) V kh +1 ( x k (cid:48) h (cid:48) +1 ) − E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) (cid:2) V kh +1 ( x (cid:48) ) (cid:3)(cid:105) ) = Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 Φ k (cid:48) h (cid:48) Φ k (cid:48) h (cid:48) (cid:62) (cid:20)(cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:21) + (cid:15) kh − λ Λ − k w πh = (cid:0) I − λ Λ − k (cid:1) (cid:20)(cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:21) + (cid:15) kh − λ Λ − k w πh = (cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) + (cid:15) kh − λ Λ − k (cid:20)(cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:21) − λ Λ − k w πh . (cid:98) Q kh ( x, a ) − Q πh ( x, a )= Φ( x, a ) (cid:62) ( w kh − w πh ) + β (cid:113) Φ( x, a ) (cid:62) Λ − k Φ( x, a )= Φ( x, a ) (cid:62) (cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) + Φ( x, a ) (cid:62) (cid:15) kh + β (cid:107) Φ( x, a ) (cid:107) Λ − k − λ Φ( x, a ) (cid:62) Λ − k (cid:20)(cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:21) − λ Φ( x, a ) (cid:62) Λ − k w πh = E x (cid:48) ∼ p ( ·| x,a ) (cid:2) V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ) (cid:3) + Φ( x, a ) (cid:62) (cid:15) kh (cid:124) (cid:123)(cid:122) (cid:125) term + β (cid:107) Φ( x, a ) (cid:107) Λ − k − λ Φ( x, a ) (cid:62) Λ − k (cid:20)(cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) term − λ Φ( x, a ) (cid:62) Λ − k w πh (cid:124) (cid:123)(cid:122) (cid:125) term . (10)Below we bound the manitudes of term , term , term respectively. For term , we use Lemma 11and Lemma 12: define ε k (cid:48) h (cid:48) = V kh +1 ( x k (cid:48) h (cid:48) +1 ) − E x (cid:48) ∼ ( k (cid:48) ,h (cid:48) ) (cid:2) V kh +1 ( x (cid:48) ) (cid:3) , φ k (cid:48) h (cid:48) = √ Φ k (cid:48) h (cid:48) . By Lemma 11,we have (cid:107) (cid:15) kh (cid:107) Λ k = √ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ − k k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 φ k (cid:48) h (cid:48) ε k (cid:48) h (cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ k = √ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k − (cid:88) k (cid:48) =1 H (cid:88) h (cid:48) =1 φ k (cid:48) h (cid:48) ε k (cid:48) h (cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ − k ≤ √ H (cid:114) d T + λλ + log N ε δ + √ × (cid:114) t ε λ , (11)for all k and h with probability at least − δ , where N ε is the ε -cover of the function class that V kh +1 ( · ) lies in. Notice that all t , V kh +1 ( · ) can be expressed as the following: V kh +1 ( x ) = min (cid:26) max a w (cid:62) Φ( x, a ) + β (cid:113) Φ( x, a ) (cid:62) ΓΦ( x, a ) , H (cid:27) for some positive definite Γ ∈ R d × d with λ ≥ λ max (Γ) ≥ λ min (Γ) ≥ λ +2 T = T and some w ∈ R d with (cid:107) w (cid:107) ≤ λ max (Γ) × T × sup x,a,x (cid:48) ( (cid:107) Φ( x, a ) (cid:107) H ) ≤ √ T H . Therefore, we can writethe class of functions that V kh +1 ( · ) lies in as follows: V = (cid:26) V ( x ) = min (cid:26) max a w (cid:62) Φ( x, a ) + β (cid:113) Φ( x, a ) (cid:62) ΓΦ( x, a ) , H (cid:27) : w ∈ R d : (cid:107) w (cid:107) ≤ √ T H, Γ ∈ R d × d : 11 + 2 T ≤ λ min (Γ) ≤ λ max (Γ) ≤ (cid:27) . Now we apply Lemma 12 to V , with the following choices of parameters: α = ( w, Γ) , P = d + d , ε = T , B = √ T H , and L = β (cid:112) T ) which is given by the following calculation: for any ∆ w = (cid:15) e i , | (cid:15) | (cid:12)(cid:12) ( w + ∆ w ) (cid:62) Φ( x, a ) − w (cid:62) Φ( x, a ) (cid:12)(cid:12) = | e (cid:62) i Φ( x, a ) | ≤ (cid:107) Φ( x, a ) (cid:107) ≤ √ , ∆Γ = (cid:15) e i e (cid:62) j , | (cid:15) | (cid:12)(cid:12)(cid:12)(cid:12) β (cid:113) Φ( x, a ) (cid:62) (Γ + ∆Γ)Φ( x, a ) − β (cid:113) Φ( x, a ) (cid:62) ΓΦ( x, a ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ β (cid:12)(cid:12) Φ( x, a ) (cid:62) e i e (cid:62) j Φ( x, a ) (cid:12)(cid:12)(cid:112) Φ( x, a ) (cid:62) ΓΦ( x, a ) ( √ u + v − √ u ≤ | v |√ u ) ≤ β Φ( x, a ) (cid:62) (cid:0) e i e (cid:62) i + e j e (cid:62) j (cid:1) Φ( x, a ) (cid:112) Φ( x, a ) (cid:62) ΓΦ( x, a ) ≤ β Φ( x, a ) (cid:62) Φ( x, a ) (cid:112) Φ( x, a ) (cid:62) ΓΦ( x, a ) ≤ √ β (cid:115) λ min (Γ) ≤ β (cid:112) T ) . Lemma 12 then implies: log N ε ≤ ( d + d ) log 2 × √ T H × β (cid:112) T ) × ( d + d ) T − ≤ d log T, where in the last step we use the definition of β and also assume without loss of generality that sp( v ∗ ) ≤ √ T , d ≤ √ T , and T ≥ (since otherwise the regret bound is vacuous). Then by Eq. (11)we have with probability − δ , for all k and h , (cid:107) (cid:15) kh (cid:107) Λ k ≤ √ H (cid:114) d T + 11 + log 1 δ + 20 d log T + 4 ≤ dH (cid:112) log( T /δ ) = β , and therefore, | term | ≤ (cid:107) Φ( x, a ) (cid:107) Λ − k (cid:107) (cid:15) kh (cid:107) Λ k ≤ β (cid:107) Φ( x, a ) (cid:107) Λ − k . Furthermore, | term | ≤ (cid:107) Φ( x, a ) (cid:107) Λ − k (cid:13)(cid:13)(cid:13)(cid:13) λ (cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:13)(cid:13)(cid:13)(cid:13) Λ − k (Cauchy-Schwarz inequality) ≤ (cid:107) Φ( x, a ) (cid:107) Λ − k (cid:13)(cid:13)(cid:13)(cid:13) √ λ (cid:90) X ( V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) ))d µ ( x (cid:48) ) (cid:13)(cid:13)(cid:13)(cid:13) ( λ min (Λ k ) ≥ λ ) ≤ √ λ (cid:107) Φ( x, a ) (cid:107) Λ − k × H √ d ( (cid:107) µ ( X ) (cid:107) ≤ √ d by Assumption 2) ≤ β (cid:107) Φ( x, a ) (cid:107) Λ − k , (using λ = 1 )and | term | ≤ (cid:107) Φ( x, a ) (cid:107) Λ − k (cid:107) λw πh (cid:107) Λ − k (Cauchy-Schwarz inequality) ≤ (cid:107) Φ( x, a ) (cid:107) Λ − k (cid:13)(cid:13)(cid:13) √ λw πh (cid:13)(cid:13)(cid:13) ( λ min (Λ k ) ≥ λ ) ≤ β (cid:107) Φ( x, a ) (cid:107) Λ − k . ( (cid:107) w πh (cid:107) ≤ √ dH and λ = 1 )Therefore, | term | + | term | + | term | ≤ β (cid:107) Φ( x, a ) (cid:107) Λ − k for all k and h with probability at least − δ . Then by Eq. (10), we have (cid:98) Q kh ( x, a ) − Q πh ( x, a ) ≤ E x (cid:48) ∼ p ( ·| x,a ) [ V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) )] + 2 β (cid:107) Φ( x, a ) (cid:107) Λ − k , proving one inequality in the lemma statement (since Q kh ( x, a ) ≤ (cid:98) Q kh ( x, a ) ). To prove the otherinequality, note that Eq. (10) together with | term | + | term | + | term | ≤ β (cid:107) Φ( x, a ) (cid:107) Λ − k alsoimplies (cid:98) Q kh ( x, a ) − Q πh ( x, a ) ≥ E x (cid:48) ∼ p ( ·| x,a ) [ V kh +1 ( x (cid:48) ) − V πh +1 ( x (cid:48) )] . (12)19ow we fix k and use induction on h to prove Q kh ( x, a ) ≥ Q πh ( x, a ) . The base case h = H is clear due to Eq. (12) and the facts V kH +1 ( x ) = V πH +1 ( x ) = 0 and Q kH ( x, a ) − Q πH ( x, a ) =min { (cid:98) Q kH ( x, a ) , H } − Q πH ( x, a ) ≥ . Next assume Q kh +1 ( x, a ) ≥ Q πh +1 ( x, a ) for all x and a .Then V kh +1 ( x ) = max a Q kh +1 ( x, a ) ≥ max a Q πh +1 ( x, a ) ≥ V πh +1 ( x ) . Using Eq. (12) we have (cid:98) Q kh ( x, a ) − Q πh ( x, a ) ≥ , which again implies Q kh ( x, a ) = min { (cid:98) Q kh ( x, a ) , H } ≥ Q πh ( x, a ) . Thisfinishes the induction and proves the other inequality in the lemma statement. Proof of Theorem 5. Let π k = ( π k , . . . , π kH ) be the finite-horizon policy that our algorithm executesfor episode k , that is, π kh ( a | x ) = [ a = argmax a (cid:48) Q kh ( x, a (cid:48) )] (breaking ties arbitrarily). Alsolet ¯ π ∗ be the optimal finite-horizon policy with value functions Q ∗ h ( x, a ) = max π Q πh ( x, a ) and V ∗ h ( x ) = max a Q ∗ h ( x, a ) . We first decompose the regret asReg T = T (cid:88) t =1 ( J ∗ − r ( x t , a t ))= T/H (cid:88) k =1 (cid:0) HJ ∗ − V ∗ ( x k ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) term + T/H (cid:88) k =1 (cid:0) V ∗ ( x k ) − V π k ( x k ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) term + T/H (cid:88) k =1 (cid:32) V π k ( x k ) − H (cid:88) h =1 r ( x kh , a kh ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) term (13)In Lemma 13 (stated after this proof), we connect the optimal reward of the the infinite-horizonsetting and the finite-horizon setting and show that term ≤ T sp( v ∗ ) H .Notice that conditioned on the history before episode k , V π k ( x k ) is the expectation of (cid:80) Hh =1 r ( x kh , a kh ) . Therefore, term is a martingale different sequence, which can be upper boundedby O (cid:16) H (cid:113) TH log(1 /δ ) (cid:17) = O (cid:16)(cid:112) HT log(1 /δ ) (cid:17) with probabiltiy at least − δ (via Azuma’sinequality).Finally, we deal with term . Below we assume that the high-probability event in Lemma 4 hold.Then for all k, h : Q kh ( x kh , a kh ) − Q π k h ( x kh , a kh ) ≤ E x (cid:48) ∼ ( k,h ) [ V kh +1 ( x (cid:48) ) − V π k h +1 ( x (cid:48) )] + 2 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k = V kh +1 ( x kh +1 ) − V π k h +1 ( x kh +1 ) + 2 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k + e kh = Q kh +1 ( x kh +1 , a kh +1 ) − Q π k h +1 ( x kh +1 , a kh +1 ) + 2 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k + e kh where in the first equality we define e kh = E x (cid:48) ∼ ( k,h ) [ V kh +1 ( x (cid:48) ) − V π k h +1 ( x (cid:48) )] − (cid:0) V kh +1 ( x kh +1 ) − V π k h +1 ( x kh +1 ) (cid:1) , which has zero mean, and in the second equality we use the facts V kh +1 ( x kh +1 ) = Q kh +1 ( x kh +1 , a kh +1 ) and V π k h +1 ( x kh +1 ) = Q π k h +1 ( x kh +1 , a kh +1 ) . Repeating the same argument and using V kH +1 ( · ) = V π k H +1 ( · ) = 0 , we arrive at Q k ( x k , a k ) − Q π k ( x k , a k ) ≤ H (cid:88) h =1 (cid:16) β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k + e kh (cid:17) . Further using that V ∗ ( x k ) = max a Q ∗ ( x k , a ) ≤ max a Q k ( x k , a ) = Q k ( x k , a k ) (the inequality isby Lemma 4) and that V π k ( x k ) = Q π k ( x k , a k ) , we have shown term ≤ T/H (cid:88) k =1 H (cid:88) h =1 (cid:16) β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k + e kh (cid:17) . The term (cid:80) T/Hk =1 (cid:80) Hh =1 e kh is again the sum of a martingale difference sequence with each term’smagnitude bounded by H , and therefore is bounded by O (cid:16) H (cid:112) T log(1 /δ ) (cid:17) with probability20t least − δ using Azuma’s inequality. For the term (cid:80) T/Hk =1 (cid:80) Hh =1 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k , we firstdecompose it into two parts: (cid:88) k :det(Λ k +1 ) ≤ k ) H (cid:88) h =1 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k + (cid:88) k :det(Λ k +1 ) > k ) H (cid:88) h =1 β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k . By [1, Lemma 12], det(Λ k +1 ) ≤ k ) implies Λ k +1 (cid:22) k and thus Λ − k (cid:22) − k +1 . Therefore,the first part is upper bounded by √ (cid:80) k,h β (cid:107) Φ( x kh , a kh ) (cid:107) Λ − k +1 ≤ β √ T (cid:113)(cid:80) k,h (cid:107) Φ( x kh , a kh ) (cid:107) − k +1 ,by Cauchy-Schwarz inequality. Further invoking [21, Lemma D.2], we upper bound the lastterm by O (cid:18) β √ T (cid:113) log det(Λ T/H +1 )det(Λ ) (cid:19) = O (cid:18) β √ T (cid:113) log (cid:0) λ +2 Tλ (cid:1) d (cid:19) = O (cid:0) β √ dT log T (cid:1) . For thesecond part, notice that since the event det(Λ k +1 ) > k ) cannot happen for more than O (cid:16) log det(Λ T/H +1 )det(Λ ) (cid:17) = O ( d log T ) times, this part is upper bounded by O ( βdH log T ) .To conclude, we have shown that term = O (cid:16) β √ dT log T + βdH log T + H (cid:112) T log(1 /δ ) (cid:17) holdswith probability at least − δ . Combining all the bounds with Eq. (13), we haveReg T = T (cid:88) t =1 ( J ∗ − r ( x t , a t )) = O (cid:18) T sp( v ∗ ) H + β (cid:112) dT log T + βdH log T + H (cid:112) T log(1 /δ ) (cid:19) = (cid:101) O (cid:18) T sp( v ∗ ) H + d / H √ T + d H (cid:19) (plug in the value of β )with probability at least − δ . Picking the optimal H (the one specified in Algorithm 2), we getthat Reg T = (cid:101) O (cid:16)(cid:112) sp( v ∗ )( dT ) + (sp( v ∗ ) dT ) (cid:17) . Lemma 13. For any x , | HJ ∗ − V ∗ ( x ) | ≤ sp( v ∗ ) .Proof. Let π ∗ be the optimal policy of the infinite-horizon setting, and ( π , . . . , π H ) be the optimalpolicy of the finite-horizon setting. Without loss generality assume that both of them are deterministicpolicy. By the Bellman equation and the optimality of π ∗ , we have v ∗ ( x ) = max a (cid:0) r ( x, a ) − J ∗ + E x (cid:48) ∼ p ( ·| x,a ) v ∗ ( x ) (cid:1) (14) = r ( x, π ∗ ( x )) − J ∗ + E x (cid:48) ∼ p ( ·| x,π ∗ ( x )) v ∗ ( x ) . (15)For any x , consider a state sequence x = x, x , . . . , x H generated by π ∗ . By the suboptimality of π ∗ in the finite-horizon setting, V ∗ ( x ) ≥ E (cid:34) H (cid:88) h =1 r ( x h , π ∗ ( x h )) (cid:12)(cid:12)(cid:12)(cid:12) x = x, π ∗ (cid:35) = E (cid:34) H (cid:88) h =1 (cid:0) J ∗ + v ∗ ( x h ) − E x (cid:48) ∼ p ( ·| x h ,π ∗ ( x h )) [ v ∗ ( x (cid:48) )] (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x = x, π ∗ (cid:35) (by Eq. (15)) = E (cid:34) H (cid:88) h =1 ( J ∗ + v ∗ ( x h ) − v ∗ ( x h +1 )) (cid:12)(cid:12)(cid:12)(cid:12) x = x, π ∗ (cid:35) = HJ ∗ + E (cid:2) v ∗ ( x ) − v ∗ ( x H +1 ) (cid:12)(cid:12) x = x, π ∗ (cid:3) ≥ HJ ∗ − sp( v ∗ ) . x = x, x , . . . , x H generated by ( π , . . . , π H ) : V ∗ ( x ) = E (cid:34) H (cid:88) h =1 r ( x h , π h ( x h )) (cid:12)(cid:12)(cid:12)(cid:12) x = x, { π i } Hi =1 (cid:35) ≤ E (cid:34) H (cid:88) t =1 (cid:0) J ∗ + v ∗ ( x t ) − E x (cid:48) ∼ p ( ·| x h ,π h ( x h )) [ v ∗ ( x (cid:48) )] (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) x = x, { π i } Hi =1 (cid:35) (by Eq. (14)) = E (cid:34) H (cid:88) t =1 ( J ∗ + v ∗ ( x h ) − v ∗ ( x h +1 )) (cid:12)(cid:12)(cid:12)(cid:12) x = x, { π i } Hi =1 (cid:35) = HJ ∗ + E (cid:2) v ∗ ( x ) − v ∗ ( x H +1 ) (cid:12)(cid:12) x = x, { π i } Hi =1 (cid:3) ≤ HJ ∗ + sp( v ∗ ) . Combining the two directions finishes the proof. D Omitted Analysis in Section 4 𝑅 𝑘,1 𝑅 𝑘,2 𝑅 𝑘,3 𝑅 𝑘,4 𝜏 𝑘,1 𝜏 𝑘,2 𝜏 𝑘,3 𝜏 𝑘,4 𝑁 steps Figure 1: An illustration for the data collection process of MDP-E XP 2. In the figure, we show howthe algorithm collects trajectories of length N (the red intervals) in an epoch with length B = 8 N .Figure 1 is an illustration of the data collection scheme of MDP-E XP 2. Below, we first provide theproof for Lemma 6. Proof of Lemma 6. Denote E [ ·| x = x, a t ∼ π ( ·| x t ) , x t +1 ∼ p ( ·| x t , a t ) for all t ≥ by E [ ·| x = x, π ] . For any two initial states u, u (cid:48) ∈ X , let δ u and δ u (cid:48) be the Dirac measures with respect to u and u (cid:48) . Writing P π as P for simplicity, we have for any time t , | E [ r ( x t , a t ) | x = u, π ] − E [ r ( x t , a t ) | x = u (cid:48) , π ] | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X (cid:88) a ∈A π ( a | x ) r ( x, a )d P t − δ u ( x ) − (cid:90) X (cid:88) a ∈A π ( a | x ) r ( x, a )d P t − δ u (cid:48) ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) P t − δ u − P t − δ u (cid:48) (cid:107) TV ≤ e − t − t mix (cid:107) δ u − δ u (cid:48) (cid:107) TV (Assumption 3) ≤ e − t − t mix . (16)Therefore, by the definition of J π ( u ) in Section 2, we have | J π ( u ) − J π ( u (cid:48) ) | ≤ lim T →∞ T T (cid:88) t =1 e − t − t mix = 0 , proving that J π ( u ) is a fixed value independent of the initial state u and can thus be denoted as J π .22ext, define the following two quantities: v πT ( x ) = E (cid:34) T (cid:88) t =1 ( r ( x t , a t ) − J π ) (cid:12)(cid:12)(cid:12) x = x, π (cid:35) ,q πT ( x, a ) = E (cid:34) T (cid:88) t =1 ( r ( x t , a t ) − J π ) (cid:12)(cid:12)(cid:12) ( x , a ) = ( x, a ) , x t ∼ p ( ·| x t − , a t − ) , a t ∼ π ( ·| x t ) for t ≥ (cid:35) . (17)We will show that v π ( x ) (cid:44) lim T →∞ v πT ( x ) and q π ( x, a ) (cid:44) lim T →∞ q πT ( x, a ) satisfy the con-ditions stated in Lemma 6. First we argue that they do exist. Note that J π can be written as (cid:82) X (cid:80) a r ( x, a ) π ( a | x )d ν π ( x ) where ν π is the stationary distribution under π . Therefore, for any T ,we have (cid:12)(cid:12)(cid:12) E (cid:104) r ( x T +1 , a T +1 ) − J π (cid:12)(cid:12)(cid:12) x = x, π (cid:105)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X (cid:88) a ∈A π ( a | x (cid:48) ) r ( x (cid:48) , a )d P T δ x ( x (cid:48) ) − (cid:90) X (cid:88) a ∈A π ( a | x ) r ( x, a )d ν π ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) P T δ x − ν π (cid:107) TV = 2 (cid:107) P T δ x − P T ν π (cid:107) TV (by the definition of ν π ) ≤ e − Tt mix (cid:107) δ x − ν π (cid:107) TV (by Assumption 3) ≤ e − Tt mix , (18)and thus | v πT ( x ) − v πT +1 ( x ) | = (cid:12)(cid:12)(cid:12) E (cid:104) r ( x T +1 , a T +1 ) − J π (cid:12)(cid:12)(cid:12) x = x, π (cid:105)(cid:12)(cid:12)(cid:12) ≤ e − Tt mix , which goes to zero and implies that v π ( x ) = lim T →∞ v πT ( x ) exists. On the other hand, by thedefinition we have q πT ( x, a ) = r ( x, a ) − J π + E x (cid:48) ∼ p ( ·| x,a ) v πT − ( x (cid:48) ) , and taking the limit on both sides shows that q π ( x, a ) = lim T →∞ q πT ( x, a ) exists and satisfies theBellman equation in the lemma statement: q π ( x, a ) = r ( x, a ) − J π + E x (cid:48) ∼ p ( ·| x,a ) v π ( x (cid:48) ) . Finally, Eq. (18) also shows that | v πT ( x ) | ≤ T (cid:88) t =1 e − t − t mix ≤ − e − t mix ≤ − (cid:16) − t mix (cid:17) = 4 t mix , (using e − x ≤ − x for x ∈ [0 , and t mix ≥ )and thus the range of v π is [ − t mix , t mix ] while the range of q π is [ − t mix , t mix ] since | q π ( x, a ) | ≤| r ( x, a ) | + | J π | + sup x (cid:48) | v π ( x (cid:48) ) | ≤ t mix ≤ t mix . The last statement (cid:82) X v π ( x )d ν π ( x ) =0 in the lemma is also clear since (cid:82) X v πT ( x )d ν π ( x ) = 0 for all T by the equality J π = (cid:82) X (cid:80) a r ( x, a ) π ( a | x )d ν π ( x ) and the fact that x , . . . , x T all have marginal distribution ν π when x = x is drawn from ν π .In Section 4, we mention that Assumption 4 is weaker than Assumption 2 when Assumption 3 holds.Below we provide a proof for this statement. Lemma 14. Under Assumption 3, Assumption 2 implies Assumption 4.Proof. Since Assumption 3 holds, by Lemma 6, we have q π ( x, a ) = r ( x, a ) − J π + E x (cid:48) ∼ p ( ·| x,a ) v π ( x (cid:48) )= Φ( x, a ) (cid:62) θ − J π Φ( x, a ) (cid:62) e + Φ( x, a ) (cid:62) (cid:90) X v π ( x (cid:48) )d µ ( x (cid:48) ) (Assumption 2) = Φ( x, a ) (cid:62) (cid:18) θ − J π e + (cid:90) X v π ( x (cid:48) )d µ ( x (cid:48) ) (cid:19) . w π to be θ − J π e + (cid:82) X v π ( x (cid:48) )d µ ( x (cid:48) ) and noting that (cid:107) w π (cid:107) ≤ (cid:107) θ (cid:107) + 1 +(max x ∈X v π ( x )) (cid:107) µ ( X ) (cid:107) ≤ √ d + 1 + 4 t mix √ d ≤ t mix √ d finishes the proof. D.1 Proof of Theorem 7 To prove Theorem 7, we first show a couple of useful lemmas. Lemma 15. Let k be any number in { , , . . . , TB } and m be any number in { , , . . . , B N } . Let E [ · | τ k,m ] denote the expectation conditioned on ( x τ k,m , a τ k,m ) and all history before time τ k,m (recall the definitions of τ k,m and R k,m in Algorithm 3). Then we have (cid:12)(cid:12) E [ R k,m | τ k,m ] − (cid:0) q π k ( x τ k,m , a τ k,m ) + N J π k (cid:1)(cid:12)(cid:12) ≤ T . Proof. Recalling the definition of q π k N in Eq. (17), we have E [ R k,m | τ k,m ]= E (cid:34) N (cid:88) t =1 r ( x t , a t ) (cid:12)(cid:12)(cid:12) ( x , a ) = ( x τ k,m , a τ k,m ) , x t ∼ p ( ·| x t − , a t − ) , a t ∼ π k ( ·| x t ) for t ≥ (cid:35) = q π k N ( x τ k,m , a τ k,m ) + N J π k . (19)Then we bound the difference between q πN ( x, a ) and q π ( x, a ) (which is lim N →∞ q πN ( x, a ) as shownin the proof of Lemma 6) for any π, x, a : | q πN ( x, a ) − q π ( x, a ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) ∞ (cid:88) t = N +1 ( r ( x t , a t ) − J π ) (cid:12)(cid:12)(cid:12) ( x , a ) = ( x, a ) , x t ∼ p ( ·| x t − , a t − ) , a t ∼ π k ( ·| x t ) for t ≥ (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ (cid:88) t = N +1 e − t − t mix ≤ e − Nt mix − e − t mix ≤ t mix e − Nt mix . (Eq. (18))Recall that N = 8 t mix log T , and without loss of generality we assum t mix ≤ T / (otherwise theregret bound is vacuous). Thus we can bound the last expression by t mix T ≤ T . Combining this withEq. (19) finishes the proof. Lemma 16. Let E k [ · ] denote the expectation conditioned on all history before epoch k . Then (cid:107) E k [ w k ] − ( w π k + N J π k e ) (cid:107) ≤ T . roof. Let I k = [ λ min ( M k ) ≥ Bσ N ] . We proceed as follows: E k [ w k ] = E k I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) R k,m (definition of w k ) = E k I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) E k [ R k,m | x τ k,m , a τ k,m ] (taking expectation for R k,m conditioned on ( x τ k,m , a τ k,m ) ) = E k I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) (cid:0) q π k ( x τ k,m , a τ k,m ) + N J π k (cid:1) + E k I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) (cid:15) k ( x τ k,m , a τ k,m ) (define (cid:15) k ( x τ k,m , a τ k,m ) = E k [ R k,m | x τ k,m , a τ k,m ] − (cid:0) q π k ( x τ k,m , a τ k,m ) + N J π k (cid:1) ) = E k I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m )Φ( x τ k,m , a τ k,m ) (cid:62) ( w π k + N J π k e ) + E k I k M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) (cid:15) k ( x τ k,m , a τ k,m ) (by Assumption 4) = E k I k M − k B N (cid:88) m =1 (cid:88) a π k ( a | x τ k,m )Φ( x τ k,m , a )Φ( x τ k,m , a ) (cid:62) ( w π k + N J π k e ) + E k I k M − k B N (cid:88) m =1 (cid:88) a π k ( a | x τ k,m )Φ( x τ k,m , a ) (cid:15) k ( x τ k,m , a ) (taking expectation for a τ k,m conditioned on x τ k,m ) = E k [ I k ( w π k + N J π k e )] + (cid:15) (define (cid:15) = E k (cid:104) I k M − k (cid:80) B N m =1 (cid:80) a π k ( a | x τ k,m )Φ( x τ k,m , a ) (cid:15) k ( x τ k,m , a ) (cid:105) ) = w π k + N J π k e − E k [(1 − I k )( w π k + N J π k e )] + (cid:15) . By Lemma 15, we have | (cid:15) k ( x τ k,m , a τ k,m ) | ≤ /T and thus (cid:107) (cid:15) (cid:107) ≤ E k B N (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) I k M − k (cid:88) a π k ( a | x τ k,m )Φ( x τ k,m , a ) (cid:15) k ( x τ k,m , a τ k,m ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E k B N (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) I k M − k (cid:88) a π k ( a | x τ k,m )Φ( x τ k,m , a )Φ( x τ k,m , a ) (cid:62) e (cid:15) k ( x τ k,m , a τ k,m ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E k B N (cid:88) m =1 (cid:13)(cid:13) I k e (cid:15) k ( x τ k,m , a τ k,m ) (cid:13)(cid:13) ≤ E k B N (cid:88) m =1 T ≤ T . On the other hand, we also have (cid:107) E k [(1 − I k )( w π k + N J π k e )] (cid:107) ≤ E k [(1 − I k )] (6 t mix √ d + N ) ≤ t mix √ d + NT . where the last step is by Lemma 17 (stated after this proof). Finally, combining everything proves (cid:107) E k [ w k ] − ( w π k + N J π k e ) (cid:107) ≤ T + 6 t mix √ d + NT ≤ T , t mix √ d + N = 6 t mix √ d + 8 t mix log T is at most T (otherwise the regret bound isvacuous). Lemma 17. For any k ∈ { , . . . , T /B } , conditioning on the history before epoch k , we have withprobability at least − T , λ min ( M k ) ≥ Bσ N .Proof. We consider a fixed k . Notice that since N is larger than t mix , the state distribution at τ k,m conditioned on all trajectories collected before (which all happen before τ k,m − N ) would be closeto the stationary distribution ν π k . For the purpose of analysis, we consider an imaginary world whereall history before epoch k remains the same as the real world, but in epoch k , at time t = τ k,m , ∀ m = 1 , , . . . , the state distribution is reset according to the stationary distribution, i.e., x τ k,m ∼ ν π k ;for other rounds, it follows the state transition driven by π k , the same as the real world. we denotethe expectation (given the history before epoch k ) in the imaginary world as E (cid:48) k [ · ] .Fro simplicity, define y m = x τ k,m , z m = { a τ k,m , R k,m } and m ∗ = B N . Note that M k is afunction of { y m } m ∗ m =1 and that ( y i − , z i − ) → y i → z i form a Markov chain. Therefore, by writing M k = M k ( y , . . . , y m ∗ ) , and considering any function f of M k , we have E k [ f ( M k )] = (cid:90) f ( M k ( y , . . . , y m ∗ )) d q ( y )d q ( z | y )d q ( y | y , z )d q ( z | y ) · · · d q ( y m ∗ | y m ∗ − , z m ∗ − )d q ( z m ∗ | y m ∗ ) and E (cid:48) k [ f ( M k )] = (cid:90) f ( M k ( y , . . . , y m ∗ )) d q (cid:48) ( y )d q ( z | y )d q (cid:48) ( y )d q ( z | y ) · · · d q (cid:48) ( y m ∗ )d q ( z m ∗ | y m ∗ ) where q and q (cid:48) denote the probability measure in the real and the imaginary worlds respectively(conditioned on the history before epoch k ). Note that by our construction, in the imaginary world y i is independent of ( y , z , . . . , y i − , z i − ) , while z i | y i follows the same distribution as in the realworld. By the uniform-mixing assumption, we have that (cid:107) q (cid:48) ( y m ) − q ( y m | y m − , z m − ) (cid:107) TV ≤ e − Nt mix ≤ T , implying that | E k [ f ( M k )] − E (cid:48) k [ f ( M k )] | ≤ T × B N × f max ≤ f max T , (20)where f max is the maximum magnitude of f ( · ) . Picking f ( M ) = (cid:2) λ min ( M ) ≤ Bσ N (cid:3) (with f max = 1 clearly), we have shown that Pr k (cid:20) λ min ( M k ) ≤ Bσ N (cid:21) ≤ Pr (cid:48) k (cid:20) λ min ( M k ) ≤ Bσ N (cid:21) + 1 T . It remains to bound Pr (cid:48) k (cid:2) λ min ( M k ) ≤ Bσ N (cid:3) . Notice that E (cid:48) k [ M k ] = B N × (cid:90) X (cid:88) a π k ( a | x )Φ( x, a )Φ( x, a ) (cid:62) d ν π k ( x ) (cid:23) B N × σI by Assumption 5. Using standard matrix concentration results (specifically, Lemma 18 with δ = , n = B N = σ log( dT ) , X m = (cid:80) a π k ( a | x τ k,m )Φ( x τ k,m , a )Φ( x τ k,m , a ) (cid:62) , R = 2 , and r = Bσ N =16 log( dT ) ), we get Pr (cid:48) k (cid:20) λ min ( M k ) ≤ × Bσ N (cid:21) ≤ d · exp (cid:18) − × 16 log( dT ) × (cid:19) ≤ d · exp ( − . dT )) ≤ T . . In other words, we have shown Pr k (cid:20) λ min ( M k ) ≤ Bσ N (cid:21) ≤ T . + 1 T ≤ T , which completes the proof. 26 emma 18. (Theorem 2 in [17]) Let X , . . . , X n be independent, random, symmetric, real matricesof size d × d with (cid:22) X m (cid:22) RI for all m . Suppose rI (cid:22) E [ (cid:80) nm =1 X m ] for some r > . Then forall δ ∈ [0 , , one has Pr (cid:34) λ min (cid:32) n (cid:88) m =1 X m (cid:33) ≤ (1 − δ ) r (cid:35) ≤ d · e − δ r/ (2 R ) . Lemma 19. With η ≤ σ N , MDP-E XP guarantees for all x : E T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) q π k ( x, a ) ≤ O (cid:18) ln |A| η + η T N Bσ (cid:19) . Proof. Note that by the definition of w k we have | w (cid:62) k Φ( x, a ) | ≤ √ η (cid:107) w t (cid:107) ≤ √ η × NBσ × B N × √ N = 24 Nσ , (21)and thus η | w (cid:62) k Φ( x, a ) | ≤ by our choice of η . Therefore, using the standard regret bound ofexponential weight (see e.g., [8, Theorem 1]), we have T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) (cid:0) w (cid:62) k Φ( x, a ) (cid:1) ≤ O ln |A| η + η T/B (cid:88) k =1 (cid:88) a π k ( a | x ) (cid:0) w (cid:62) k Φ( x, a ) (cid:1) . (22)Taking expectation, the left-hand side becomes E T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) (cid:0) w (cid:62) k Φ( x, a ) (cid:1) = E T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) (( w π k + N J π k e ) · Φ( x, a )) − O (1) (Lemma 16) = E T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) ( w π k (cid:62) Φ( x, a ) + N J π k ) − O (1)= E T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) w π k (cid:62) Φ( x, a ) − O (1)= E T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) q π k ( x, a ) − O (1) . (Assumption 4)To bound the expectation of the right-hand side of Eq. (22), we focus on the key term E k (cid:2)(cid:80) a π k ( a | x )( w (cid:62) k Φ( x, a )) (cid:3) ( E k denotes the expectation conditioned on the history before epoch k ) and use the same argument as done in the proof of Lemma 17 via the help of an imaginary wordwhere everything is the same as the real world except that the first state of each trajectory x τ k,m for m = 1 , , . . . , B/ N is reset according to the stationary distribution ν π k ( E (cid:48) k denotes the conditionalexpectation in this imaginary world). By the exact same argument ( cf. Eq. (20)), we have E k (cid:34)(cid:88) a π k ( a | x )( w (cid:62) k Φ( x, a )) (cid:35) ≤ E (cid:48) k (cid:34)(cid:88) a π k ( a | x )( w (cid:62) k Φ( x, a )) (cid:35) + BT N × (cid:18) Nσ (cid:19) ( w (cid:62) k Φ( x, a )) derived earlier in Eq. (21). It remains to bound E (cid:48) k (cid:2)(cid:80) a π k ( a | x )( w (cid:62) k Φ( x, a )) (cid:3) , which we proceed as follows with I k = [ λ min ( M k ) ≥ Bσ N ] : E (cid:48) k (cid:34)(cid:88) a π k ( a | x )( w (cid:62) k Φ( x, a )) (cid:35) = E (cid:48) k (cid:88) a π k ( a | x ) Φ( x, a ) (cid:62) M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) R k,m I k ≤ N E (cid:48) k (cid:88) a π k ( a | x ) Φ( x, a ) (cid:62) M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) I k ( R k,m ≤ N ) = N E (cid:48) k (cid:88) a π k ( a | x )Φ( x, a ) (cid:62) M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) B N (cid:88) m =1 Φ( x τ k,m , a τ k,m ) (cid:62) M − k Φ( x, a ) I k ≤ BN E (cid:48) k (cid:88) a π k ( a | x )Φ( x, a ) (cid:62) M − k B N (cid:88) m =1 Φ( x τ k,m , a τ k,m )Φ( x τ k,m , a τ k,m ) (cid:62) M − k Φ( x, a ) I k (Cauchy-Schwarz inequality) = BN E (cid:48) k (cid:88) a π k ( a | x )Φ( x, a ) (cid:62) M − k B N (cid:88) m =1 (cid:88) a (cid:48) π k ( a (cid:48) | x τ k,m )Φ( x τ k,m , a (cid:48) )Φ( x τ k,m , a (cid:48) ) (cid:62) M − k Φ( x, a ) I k = BN E (cid:48) k (cid:34)(cid:88) a π k ( a | x )Φ( x, a ) (cid:62) M − k Φ( x, a ) I k (cid:35) ≤ O (cid:18) BN × NBσ (cid:19) (definition of I k ) = O (cid:18) N σ (cid:19) . Combining everything shows E T/B (cid:88) k =1 (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) q π k ( x, a ) ≤ O (cid:18) ln |A| η + η TB (cid:18) N σ + N BT σ (cid:19)(cid:19) ≤ O (cid:18) ln |A| η + η T N Bσ (cid:19) , which finishes the proof.We are now ready to prove Theorem 7. Proof of Theorem 7. First, decompose the regret as:Reg T = E (cid:34) T (cid:88) t =1 ( J ∗ − r ( x t , a t )) (cid:35) = E T/B (cid:88) k =1 B ( J ∗ − J π k ) + E T/B (cid:88) k =1 kB (cid:88) t =( k − B +1 ( J π k − r ( x t , a t )) . E T/B (cid:88) k =1 B ( J ∗ − J π k ) = E T/B (cid:88) k =1 B (cid:90) X (cid:88) a ( π ∗ ( a | x ) − π k ( a | x )) q π k ( x, a )d ν π ∗ ( x ) = O (cid:18) B ln |A| η + η T N σ (cid:19) . (by Lemma 19)For the second term, we first consider a specific k : E k kB (cid:88) t =( k − B +1 ( J π k − r ( x t , a t )) = E k kB (cid:88) t =( k − B +1 ( E x (cid:48) ∼ p ( ·| x t ,a t ) [ v π k ( x (cid:48) )] − q π k ( x t , a t )) (Bellman equation) = E k kB (cid:88) t =( k − B +1 ( v π k ( x t +1 ) − v π k ( x t )) = v π k ( x kB +1 ) − v π k ( x ( k − B +1 ) . Therefore, E T/B (cid:88) k =1 kB (cid:88) t =( k − B +1 ( J π k − r ( x t , a t )) ≤ E T/B (cid:88) k =1 (cid:0) v π k ( x kB +1 ) − v π k ( x ( k − B +1 ) (cid:1) ≤ E T/B (cid:88) k =2 (cid:0) v π k − ( x ( k − B +1 ) − v π k ( x ( k − B +1 ) (cid:1) + O ( t mix ) . (23)We bound the last summation using the fact that π k and π k − are close. Indeed, by the update rule ofthe algorithm, we have π k ( a | x ) − π k − ( a | x ) = π k − ( a | x ) e η Φ( x,a ) (cid:62) w k − (cid:80) b ∈A π k − ( b | x ) e η Φ( x,b ) (cid:62) w k − − π k − ( a | x ) ≤ π k − ( a | x ) e η Φ( x,a ) (cid:62) w k − (cid:80) b ∈A π k − ( b | x ) e − min b η Φ( x,b ) (cid:62) w k − − π k − ( a | x ) ≤ π k − ( a | x ) (cid:16) e η max b | Φ( x,b ) (cid:62) w k − | − (cid:17) . Recall that in the proof of Lemma 19, we have shown η max b | Φ( x, b ) (cid:62) w k − | ≤ as long as η ≤ σ/ (24 N ) . Combining with the fact e x ≤ x for x ∈ [0 , we have (cid:16) e η max b | Φ( x,b ) (cid:62) w k − | − (cid:17) ≤ η max b | Φ( x, b ) (cid:62) w k − | = O (cid:18) η × Nσ (cid:19) , where the last step is by Eq. (21). This shows π k ( a | x ) − π k − ( a | x ) ≤ O (cid:18) ηNσ π k − ( a | x ) (cid:19) . π k − ( a | x ) − π k ( a | x ) = O (cid:16) ηNσ π k − ( a | x ) (cid:17) as well. By the same argumentof [35, Lemma 7] (summarized in Lemma 20 for completeness), this implies: | v π k ( x ) − v π k − ( x ) | ≤ O (cid:18) η N σ + 1 T (cid:19) . for all x . Continuing from Eq. (23), we arrive at E T/B (cid:88) k =1 kB (cid:88) t =( k − B +1 ( J π k − r ( x t , a t )) = O (cid:18) η TB N σ + t mix (cid:19) . Combining everything, we have shownReg T = O (cid:18) B ln |A| η + η T N σ + η T N Bσ + t mix (cid:19) = (cid:101) O (cid:18) t mix ση + η T t mix σ (cid:19) (definition of N and B ) = (cid:101) O (cid:18) σ (cid:113) t mix T (cid:19) , (by the choice of η specified in Algorithm 3)which finishes the proof. Lemma 20. If π (cid:48) and π satisfy | π (cid:48) ( a | x ) − π ( a | x ) | ≤ O ( βπ ( a | x )) for all x, a and some β > , and N ≥ t mix log T , then | v π (cid:48) ( x ) − v π ( x ) | ≤ O ( ηβN + T ) .Proof. See the proof of [35, Lemma 7]. Remark 1. Notice that by the definition of σ , σ ≤ λ min (cid:32)(cid:90) X (cid:32)(cid:88) a π ( a | x )Φ( x, a )Φ( x, a ) (cid:62) (cid:33) d ν π ( x ) (cid:33) ≤ d trace (cid:34)(cid:90) X (cid:32)(cid:88) a π ( a | x )Φ( x, a )Φ( x, a ) (cid:62) (cid:33) d ν π ( x ) (cid:35) ≤ d (cid:90) X (cid:32)(cid:88) a π ( a | x ) (cid:107) Φ( x, a ) (cid:107) (cid:33) d ν π ( x ) (trace [Φ( x, a )Φ( x, a ) (cid:62) ] = (cid:107) Φ( x, a ) (cid:107) ) ≤ d , ( (cid:107) Φ( x, a ) (cid:107) ≤ by Assumption 2)which implies σ ≥ d . Therefore, the regret bound in Theorem 7 has an implicit Ω( d ) dependence. E Connection between Natural Policy Gradient and MDP-E XP The connection between the exponential weight algorithm [13] and the classic natural policy gradient(NPG) algorithm [22] under softmax parameterization has been discussed in [4]. Further connectionsbetween exponential weight algorithms and several relative-entropy-regularized policy optimizationalgorithms (e.g., TRPO [31], A3C [24], PPO [32]) are also drawn in [27]. In this section, we reviewthese connection, and argue that because of the different way of constructing the policy gradientestimator, our MDP-E XP E.1 Equivalence between NPG with softmax parameterization and exponential weightupdates We first restates [4, Lemma 5.1], which shows that NPG with softmax parameterization is equivalentto exponential weight updates: 30 emma 21 (Lemma 5.1 of [4]) . Let π θ ( a | x ) = exp ( Φ( x,a ) (cid:62) θ ) (cid:80) b exp(Φ( x,b ) (cid:62) θ ) . Also, let ν θ be the stationarydistribution under policy π θ , and A π ( x, a ) be the advantage function under policy π defined as A π ( x, a ) = q π ( x, a ) − v π ( x ) . Then the update θ new = θ + ηF † θ g θ with F θ = E x ∼ ν πθ E a ∼ π θ ( ·| x ) (cid:2) ∇ θ log π θ ( a | x ) ∇ θ log π θ ( a | x ) (cid:62) (cid:3) g θ = E x ∼ ν πθ E a ∼ π θ ( ·| x ) [ ∇ θ log π θ ( a | x ) A π θ ( x, a )] implies: π θ new ( a | x ) = π θ ( a | x ) exp ( ηA π θ ( x, a )) Z θ ( x ) where Z θ ( x ) is a normalization factor that ensures (cid:80) a π θ new ( a | x ) = 1 . To see this connection, notice that the update direction w = F † θ g θ is the solution of min w ∈ R d E x ∼ ν πθ E a ∼ π θ ( ·| x ) (cid:104)(cid:13)(cid:13) w (cid:62) ∇ θ log π θ ( a | x ) − A π θ ( x, a ) (cid:13)(cid:13) (cid:105) , (24)and also by definition π θ new ( a | x ) = exp ( Φ( x,a ) (cid:62) θ new ) (cid:80) b exp(Φ( x,b ) (cid:62) θ new ) ∝ π θ ( a | x ) exp (cid:16) η Φ( x, a ) (cid:62) F † θ g θ (cid:17) = π θ ( a | x ) exp (cid:0) η Φ( x, a ) (cid:62) w (cid:1) ∝ π θ ( a | x ) exp (cid:0) η ∇ θ log π θ ( a | x ) (cid:62) w (cid:1) . Therefore, if w achieves a valueof zero in Eq. (24), we will have π θ new ( a | x ) ∝ π θ ( a | x ) exp ( ηA π θ ( x, a )) . The proof of [4] handlesthe general case where Eq. (24) is not necessarily zero. Notice that π θ ( a | x ) exp ( ηA π θ ( x, a )) is fur-ther proportional to π θ ( a | x ) exp ( ηq π θ ( x, a )) , which is consistent with the intuition of our algorithmexplained in Section 4. E.2 Comparison between the NPG in [4] and MDP-E XP While the general formulations of the NPG in [4] and MDP-E XP A π θ ( x, a ) (or q π θ ( x, a ) ) when the learnerdoes not have access to their true values and has to estimate them from sampling. We argue thatunder the setting considered in Section 4, our algorithm and analysis achieve the near-optimal regretof order (cid:101) O ( √ T ) while theirs only obtains sub-optimal regret.In MDP-E XP 2, we construct a nearly unbiased estimator of w satisfying q π θ ( x, a ) + N J π θ = w (cid:62) Φ( x, a ) (which exists under Assumption 4), and feed it to the exponential weight algorithm. Theway we do it is similar to how E XP XP 2, to construct each estimator (denoted as w k there), the learner collects B N = (cid:101) O ( σ ) trajectories, with σ defined in Assumption 5, and then aggregate them through a form of importanceweighting introduced by M − k . With this construction, w (cid:62) k Φ( x, a ) has negligible bias (by Lemma 16)compared to w (cid:62) Φ( x, a ) , while having variance upper bounded by a constant related to σ (see theproof of Lemma 19).On the other hand, the estimator used in [4] is an approximate solution of Eq. (24). Under the sameassumptions of Assumption 4 and Assumption 5, they use stochastic gradient descent to solve Eq.(24), and obtain an estimator (cid:98) w that makes (cid:98) w (cid:62) ∇ θ log θ ( a | x ) (cid:15) -close to w (cid:62) ∇ θ log θ ( a | x ) . To obtainsuch (cid:98) w , they need to sample O (cid:0) (cid:15) (cid:1) trajectories.Comparing the two approaches, we see that to obtain a single estimator (cid:98) w for the update direction w = F † θ g θ in Lemma 21, MDP-E XP (cid:15) -accurateone with low variance using O (cid:0) (cid:15) (cid:1) trajectories. The advantage of the former is that each estimatoris cheaper to get, and the effect of high variance can be amortized over iterations. As shown inTheorem 7, MDP-E XP (cid:101) O ( √ T ) regret bound. On the other hand, to get an (cid:15) -optimal policy,[4] needs to use O (cid:0) (cid:15) (cid:1) trajectories per iteration of policy update, and perform O (cid:0) (cid:15) (cid:1) iterations of31olicy updates, leading to a total sample complexity bound of O (cid:0) (cid:15) (cid:1) . This translates to a regretbound of O ( T ) in our setting at best. In fact, since the algorithms by [2] and [16] are also basedon exponential weight, they also can be regarded as variants of NPG. However, the estimators theyconstruct suffer the same issue described above, and can only get O ( T ) or O ( T ) regret.We remark that the version of NPG by [4] can also learn the optimal policy with a worse samplecomplexity O (cid:0) (cid:15) (cid:1) under a weaker assumption compared to Assumption 5 (which replaces σ withthe relative condition number κκ