[PDF] Near-optimal Representation Learning for Linear Bandits and Linear RL

Abstract

This paper studies representation learning for multi-task linear bandits and multi-task episodic RL with linear value function approximation. We first consider the setting where we play M linear bandits with dimension d concurrently, and these bandits share a common k-dimensional linear representation so that k\ll d and k \ll M. We propose a sample-efficient algorithm, MTLR-OFUL, which leverages the shared representation to achieve \tilde{O}(M\sqrt{dkT} + d\sqrt{kMT} ) regret, with T being the number of total steps. Our regret significantly improves upon the baseline \tilde{O}(Md\sqrt{T}) achieved by solving each task independently. We further develop a lower bound that shows our regret is near-optimal when d > M. Furthermore, we extend the algorithm and analysis to multi-task episodic RL with linear value function approximation under low inherent Bellman error \citep{zanette2020learning}. To the best of our knowledge, this is the first theoretical result that characterizes the benefits of multi-task representation learning for exploration in RL with function approximation.

Full PDF

aa r X i v : . [ c s . L G ] F e b Near-Optimal Representation Learning for Linear Banditsand Linear RL

Jiachen Hu ∗ [email protected]

Key Laboratory of Machine Perception, MOE, School of EECS, Peking University

Xiaoyu Chen ∗ [email protected] Key Laboratory of Machine Perception, MOE, School of EECS, Peking University

Chi Jin [email protected]

Department of Electrical and Computer Engineering, Princeton University

Lihong Li [email protected]

Amazon

Liwei Wang [email protected]

Key Laboratory of Machine Perception, MOE, School of EECS, Peking UniversityCenter for Data Science, Peking University, Beijing Institute of Big Data Research

Abstract

This paper studies representation learning for multi-task linear bandits and multi-task episodicRL with linear value function approximation. We ﬁrst consider the setting where we play M linearbandits with dimension d concurrently, and these bandits share a common k -dimensional linearrepresentation so that k ≪ d and k ≪ M . We propose a sample-eﬃcient algorithm, MTLR-OFUL,which leverages the shared representation to achieve ˜ O ( M √ dkT + d √ kMT ) regret, with T beingthe number of total steps. Our regret signiﬁcantly improves upon the baseline ˜ O ( Md √ T ) achievedby solving each task independently. We further develop a lower bound that shows our regret is near-optimal when d > M . Furthermore, we extend the algorithm and analysis to multi-task episodicRL with linear value function approximation under low inherent Bellman error (Zanette et al.,2020a). To the best of our knowledge, this is the ﬁrst theoretical result that characterizes thebeneﬁts of multi-task representation learning for exploration in RL with function approximation.

1. Introduction

Multi-task representation learning is the problem of learning a common low-dimensional representa-tion among multiple related tasks (Caruana, 1997). This problem has become increasingly importantin many applications such as natural language processing (Ando and Zhang, 2005; Liu et al., 2019),computer vision (Li et al., 2014), drug discovery (Ramsundar et al., 2015), and reinforcement learn-ing (Wilson et al., 2007; Teh et al., 2017; D’Eramo et al., 2019). In these cases, common informationcan be extracted from related tasks to improve data eﬃciency and accelerate learning.While representation learning has achieved tremendous success in a variety of applications (Bengio et al.,2013), its theoretical understanding is still limited. A widely accepted assumption in the literatureis the existence of a common representation shared by diﬀerent tasks. For example, Maurer et al.(2016) proposed a general method to learn data representation in multi-task supervised learning andlearning-to-learn setting. Du et al. (2020) studied few-shot learning via representation learning withassumptions on a common representation among source and target tasks. Tripuraneni et al. (2020)focused on the problem of multi-task linear regression with low-rank representation, and proposedalgorithms with sharp statistical rates.Inspired by the theoretical results in supervised learning, we take a step further to investigateprovable beneﬁts of representation learning for sequential decision making problems. First, we study ∗ . Equal contribution M tasks of d -dimensional (inﬁnite-arm) linearbandits are concurrently learned for T steps. The expected reward of arm x i ∈ R d for task i is θ ⊤ i x i ,as determined by an unknown linear parameter θ i . To take advantage of the multi-task representa-tion learning framework, we assume that θ i ’s lie in an unknown k -dimensional subspace of R d , where k is much smaller compared to d and M (Yang et al., 2020). The dependence among tasks makesit possible to achieve a regret bound better than solving each task independently. Speciﬁcally, ifthe tasks are solved independently with standard algorithms such as OFUL (Abbasi-Yadkori et al.,2011), the total regret is ˜ O ( M d √ T ). By leveraging the common representation among tasks, we canachieve a better regret ˜ O ( M √ dkT + d √ M kT ). Our algorithm is also robust to the linear represen-tation assumption when the model is misspeciﬁed. If the k -dimensional subspace approximates therewards with error at most ζ , our algorithm can still achieve regret ˜ O ( M √ dkT + d √ kM T + M T √ dζ ).Moreover, we prove a regret lower bound indicating that the regret of our algorithm is not improvableexcept for logarithmic factors in the regime d > M .Compared with multi-task linear bandits, multi-task reinforcement learning is a more popularresearch topic with a long line of works in both theoretical side and empirical side (Taylor and Stone,2009; Parisotto et al., 2015; Liu et al., 2016; Teh et al., 2017; Hessel et al., 2019; D’Eramo et al.,2019; Arora et al., 2020). We extend our algorithm for linear bandits to the multi-task episodicreinforcement learning with linear value function approximation under low inherent Bellman error(Zanette et al., 2020a). Assuming a low-rank linear representation across all the tasks, we proposea sample-eﬃcient algorithm with regret ˜ O ( HM √ dkT + Hd √ kM T + HM T √ d I ) , where k is thedimension of the low-rank representation, d is the ambient dimension of state-action features, M is the number of tasks, H is the horizon, T is the number of episodes, and I denotes the inherentBellman error. The regret signiﬁcantly improves upon the baseline regret ˜ O ( HM d √ T + HM T √ d I )achieved by running ELEANOR algorithm (Zanette et al., 2020a) for each task independently. Wealso prove a regret lower bound Ω( M k √ HT + d √ HkM T + HM T √ d I ). To the best of our knowledge,this is the ﬁrst provably sample-eﬃcient algorithm for exploration in multi-task linear RL.

2. Preliminaries

We study the problem of representation learning for linear bandits in which there are multipletasks sharing common low-dimensional features. Let d be the ambient dimension and k be therepresentation dimension. We play M tasks concurrently for T steps each. Each task i ∈ [ M ] isassociated with an unknown vector θ i ∈ R d . In each step t ∈ [ T ], the player chooses one action x t,i ∈ A t,i for each task i ∈ [ M ], and receives a batch of rewards { y t,i } Mi =1 afterwards, where A t,i isthe feasible action set (can even be chosen adversarially) for task i at step t . The rewards receivedare determined by y t,i = θ ⊤ i x t,i + η t,i , where the η t,i is the random noise.We use the total regret for M tasks in T steps to measure the performance of our algorithm,which is deﬁned in the following way:Reg( T ) def = T X t =1 M X i =1 (cid:0)(cid:10) x ⋆t,i , θ i (cid:11) − h x t,i , θ i i (cid:1) , where x ⋆t,i = argmax x ∈A t,i h x , θ i i .The main assumption is the existence of a common linear feature extractor. Assumption 1.

There exists a linear feature extractor B ∈ R d × k and a set of k -dimensionalcoeﬃcients { w i } Mi =1 such that { θ i } Mi =1 satisﬁes θ i = Bw i .

1. ˜ O hides the logarithmic factors. F t to be the σ -algebra induced by σ ( { x ,i } Mi =1 , · · · , { x t +1 ,i } Mi =1 , { η ,i } Mi =1 , · · · , { η t,i } Mi =1 ),then we have the following assumption. Assumption 2.

Following the standard regularity assumptions in linear bandits (Abbasi-Yadkori et al.,2011; Lattimore and Szepesv´ari, 2020), we assume • k θ i k ≤ , ∀ i ∈ [ M ] • k x k ≤ , ∀ x ∈ A t,i , t ∈ [ T ] , i ∈ [ M ] • η t,i is conditionally zero-mean -sub-Gaussian random variable with regards to F t − . For notation convenience, we use X t,i = [ x ,i , x ,i , · · · , x t,i ] and y t,i = [ y ,i , · · · , y t,i ] ⊤ to denotethe arms and the corresponding rewards collected for task i ∈ [ M ] in the ﬁrst t steps, and we alsouse η t,i = [ η ,i , η ,i , · · · , η t,i ] ⊤ to denote the corresponding noise. We deﬁne Θ def = [ θ , θ , · · · , θ M ]and W def = [ w , w , · · · , w M ]. For any positive deﬁnite matrix A ∈ R d × d , the Mahalanobis normwith regards to A is denoted by k x k A = √ x ⊤ Ax . We also study how this low-rank structure beneﬁts the exploration problem with approximate linearvalue functions in multi-task episodic reinforcement learning. For reference convenience, we abbre-viate our setting as multi-task LSVI setting, which is a natural extension of LSVI condition in thesingle-task setting (Zanette et al., 2020a).Consider an undiscounted episodic MDP M = ( S , A , p, r, H ) with state space S , action space A , and ﬁxed horizon H . For any h ∈ [ H ], any state s h ∈ S and action a h ∈ A , the agent receivesa reward R h ( s h , a h ) with mean r h ( s h , a h ), and transits to the next state s h +1 according to thetransition kernel p h ( · | s h , a h ). The action value function for each state-action pair at step h forsome deterministic policy π is deﬁned as Q πh ( s h , a h ) def = r h ( s h , a h ) + E hP Ht = h +1 R t ( s t , π t ( s t )) i , andthe state value function is deﬁned as V πh ( s h ) = Q πh ( s h , π h ( s h ))Note that there always exists an optimal deterministic policy (under some regularity conditions) π ∗ for which V π ∗ h ( s ) = max π V πh ( s ) and Q π ∗ h ( s, a ) = max π Q πh ( s, a ) for each h ∈ [ H ]. We denote V π ∗ h and Q π ∗ h by V ∗ h and Q ∗ h for short.It’s also convenient to deﬁne the Bellman optimality operator T h as T h ( Q h +1 )( s, a ) def = r h ( s, a ) + E s ′ ∼ p h ( ·| s,a ) max a ′ Q h +1 ( s ′ , a ′ ).In the framework of single-task approximate linear value functions (see Section 5 for more discus-sions), we assume a feature map φ : S × A → R d that maps each state-action pair to a d -dimensionalvector. In case that S is too large or continuous (e.g. in robotics), this feature map helps to reducethe problem scale from |S| × |A| to d . The value functions are the linear combinations of thosefeature maps, so we can deﬁne the function space at step h ∈ [ H ] to be Q ′ h = { Q h ( θ h ) | θ h ∈ Θ ′ h } and V ′ h = { V h ( θ h ) | θ h ∈ Θ ′ h } , where Q h ( θ h )( s, a ) def = φ ( s, a ) ⊤ θ h , and V h ( θ h )( s ) def = max a φ ( s, a ) ⊤ θ h .In order to ﬁnd the optimal value function using value iteration with Q h , we require that it isapproximately close under T h , as measured by the inherent Bellman error (or IBE for short). TheIBE (Zanette et al., 2020a) at step h is deﬁned as I h def = sup Q h +1 ∈Q h +1 inf Q h ∈Q h sup s ∈S ,a ∈A | ( Q h − T h ( Q h +1 )) ( s, a ) | . (1)In multi-task reinforcement learning, we have M MDPs M , M , ..., M M (we use superscript i to denote task i ). Assume they share the same state space and action space, but have diﬀerentrewards and transitions. 3o take advantage of the multi-task LSVI setting and low-rank representation learning, we deﬁnea joint function space for all the tasks as Θ h def = { (cid:0) B h w h , B h w h , · · · , B h w Mh (cid:1) : B h ∈ O d × k , w ih ∈B k , B h w ih ∈ Θ i ′ h } , where O d × k is the collection of all orthonormal matrices in R d × k .The induced function space is deﬁned as Q h def = { (cid:0) Q h (cid:0) θ h (cid:1) , Q h (cid:0) θ h (cid:1) , · · · , Q Mh (cid:0) θ Mh (cid:1)(cid:1) | (cid:0) θ h , θ h , · · · , θ Mh (cid:1) ∈ Θ h } (2) V h def = { (cid:0) V h (cid:0) θ h (cid:1) , V h (cid:0) θ h (cid:1) , · · · , V Mh (cid:0) θ Mh (cid:1)(cid:1) | (cid:0) θ h , θ h , · · · , θ Mh (cid:1) ∈ Θ h } (3)The low-rank IBE at step h for multi-task LSVI setting is a generalization of IBE (Eqn 1) forthe single-task setting, which is deﬁned accordingly as I mul h def = sup { Q ih +1 } Mi =1 ∈Q h +1 inf { Q ih } Mi =1 ∈Q h sup s ∈S ,a ∈A ,i ∈ [ M ] (cid:12)(cid:12)(cid:0) Q ih − T ih ( Q ih +1 ) (cid:1) ( s, a ) (cid:12)(cid:12) (4) Assumption 3. I def = sup h I mul h is small with regards to the joint function space Q h for all h . When I = 0, Assumption 3 can be regarded as a natural extension of Assumption 1 in episodicRL. This is because there exists { ¯ θ i ∗ h } Mi =1 ∈ Θ h such that Q i ∗ h = Q ih ( ¯ θ i ∗ h ) for all i ∈ [ M ] and h ∈ [ H ]in the case I = 0. According to the deﬁnition of Θ h we know that { ¯ θ i ∗ h } Mi =1 also admit a low-rankproperty as Assumption 1 indicates. When I >

0, then Assumption 3 is an extension of misspeciﬁedmulti-task linear bandits (discussed in Section 4.3) in episodic RL.Deﬁne the ﬁltration F h,t to be the σ -ﬁeld induced by all the random variables up to step h inepisode t (not include the rewards at step h in episode t ), then we have the following assumptions. Assumption 4.

Following the parameter scale in (Zanette et al., 2020a), we assume • k φ ( s, a ) k ≤ , ∀ ( s, a ) ∈ S × A , h ∈ [ H ] • ≤ Q πh ( s, a ) ≤ , ∀ ( s, a ) ∈ S × A , h ∈ [ H ] , ∀ π . • There exists constant D that for any h ∈ [ H ] and any (cid:8) θ ih (cid:9) Mi =1 ∈ Θ h , it holds that k θ ih k ≤ D, ∀ i ∈ [ M ] . • For any ﬁxed (cid:8) Q ih +1 (cid:9) Mi =1 ∈ Q h +1 , the random noise z ih ( s, a ) def = R ih ( s, a ) + max a Q ih +1 ( s ′ , a ) −T ih (cid:0) Q ih +1 (cid:1) ( s, a ) is bounded in [ − , a.s., and is independent conditioned on F h,t for any s ∈ S , a ∈ A , h ∈ [ H ] , i ∈ [ M ] , where the randomness is from reward R and s ′ ∼ p h ( · | s, a ) . The ﬁrst condition is a standard regularization condition for linear features. The second conditionis on the scale of the problem. This scale of the exploration problem that the value functionis bounded in [0 ,

1] has also been studied in both tabular and linear setting (Zhang et al., 2020;Wang et al., 2020; Zanette et al., 2020a). The last two conditions are compatible with the scale ofthe problem. It’s suﬃcient to assume the constant norm of θ ih since the optimal value functionis of the same scale. The last condition is standard in linear bandits (Abbasi-Yadkori et al., 2011;Lattimore and Szepesv´ari, 2020) and RL (Zanette et al., 2020a), and is automatically satisﬁed if D = 1.The total regret of M tasks in T episodes is deﬁned asReg( T ) def = T X t =1 M X i =1 (cid:16) V i ∗ − V π it (cid:17) (cid:0) s i t (cid:1) (5)where π it is the policy used for task i in episode t , and s iht denotes the state encountered at step h in episode t for task i . We assume M ≥ , T ≥ . Related Work Multi-task Supervised Learning

The idea of multi-task representation learning at least datesback to Caruana (1997); Thrun and Pratt (1998); Baxter (2000). Empirically, representation learn-ing has shown its great power in various domains. We refer readers to Bengio et al. (2013) fora detailed review about empirical results. From the theoretical perspective, Baxter (2000) per-formed the ﬁrst theoretical analysis and gave sample complexity bounds using covering number.Maurer et al. (2016) considered the setting where all tasks are sampled from a certain distribution,and analysed the beneﬁt of representation learning for both reducing the sample complexity of thetarget task. Following their results, Du et al. (2020) and Tripuraneni et al. (2020) replaced the i.i.dassumption with a deterministic assumption on the data distribution and task diversity, and pro-posed eﬃcient algorithms that can fully utilize all source data with better sample complexity. Theseresults mainly focus on the statistical rate for multi-task supervised learning, and cannot tackle theexploration problem in bandits and RL.

Multi-task Bandit Learning

For multi-task linear bandits, the most related work is a recentpaper by Yang et al. (2020). For linear bandits with inﬁnite-action set, they ﬁrstly proposed anexplore-then-exploit algorithm with regret ˜ O ( M k √ T + d . k √ M T ), which outperforms the naiveapproach with ˜ O ( M d √ T ) regret in the regime where M = Ω( dk ). Though their results are in-sightful, they require the action set for all tasks and all steps to be the same well-conditioned d -dimensional ellipsoids which cover all directions nicely with constant radius. Besides, they assumethat the task parameters are diverse enough with W W ⊤ well-conditioned, and the norm of w i islower bounded by a constant. These assumptions make the application of the theory rather restric-tive to only a subset of linear bandit instances with benign structures. In contrast, our theory ismore general since we do not assume the same and well-conditioned action set for diﬀerent tasksand time steps, nor assume the benign properties of w i ’s. Multi-task RL

For multi-task reinforcement learning, there is a long line of works from theempirical perspective (Taylor and Stone, 2009; Parisotto et al., 2015; Liu et al., 2016; Teh et al.,2017; Hessel et al., 2019). From the theoretical perspective, Brunskill and Li (2013) analyzed thesample complexity of multi-task RL in the tabular setting. D’Eramo et al. (2019) showed thatrepresentation learning can improve the rate of approximate value iteration algorithm. Arora et al.(2020) proved that representation learning can reduce the sample complexity of imitation learning.

Bandits with Low Rank Structure

Low-rank representations have also been explored in single-task settings. Jun et al. (2019) studied bilinear bandits with low rank representation. The meanreward in their setting is deﬁned as the bilinear multiplication x ⊤ Θ y , where x and y are two ac-tions selected at each step, and Θ is an unknown parameter matrix with low rank. Their settingis further generalized by Lu et al. (2020). Furthermore, sparse linear bandits can be regarded asa simpliﬁed setting, where B is a binary matrix indicating the subset of relevant features in con-text x (Abbasi-Yadkori et al., 2012; Carpentier and Munos, 2012; Lattimore et al., 2015; Hao et al.,2020). Exploration in Bandits and RL

Our regret analysis is also related to exploration in single-tasklinear bandits and linear RL. Linear bandits have been extensively studied in recent years (Auer,2002; Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011; Chu et al.,2011; Li et al., 2019a,b). Our algorithm is most relevant to the seminal work of Abbasi-Yadkori et al.(2011), who applied self-normalized techniques to obtain near-optimal regret upper bounds. Forsingle-task linear RL, recent years have witnessed a tremendous of works under diﬀerent func-tion approximation settings, including linear MDPs (Yang and Wang, 2019; Jin et al., 2020), linearmixture MDPs (Ayoub et al., 2020; Zhou et al., 2020a), linear RL with low inherent Bellman er-ror (Zanette et al., 2020a,b), and MDPs with low Bellman-rank (Jiang et al., 2017). Our multi-task5etting is a natural extension of linear RL with low inherent Bellman error setting, which coverslinear MDP setting as a special case (Zanette et al., 2020a).

4. Main Results for Linear Bandits

In this section, we present our main results for multi-task linear bandits.

A natural and successful method to design eﬃcient algorithms for sequential decision making problemis the optimism in the face of uncertainty principle . When applied to single-task linear bandits, thebasic idea is to maintain a conﬁdence set C t for the parameter θ based on history observations foreach step t ∈ [ T ]. The algorithm chooses an optimistic estimation ˜ θ t = argmax θ ∈C t (max x ∈A t h x , θ i )and then selects action x t = argmax x t ∈A t h x , ˜ θ t i , which maximizes the reward according to theestimation ˜ θ t . In other words, the algorithm chooses the pair (cid:16) x t , ˜ θ t (cid:17) = argmax ( x , θ ) ∈A t ×C t h x , θ i .For multi-task linear bandits, the main diﬀerence is that we need to tackle M highly correlatedtasks concurrently. To obtain tighter conﬁdence bound, we maintain the conﬁdence set C t for B and { w i } Mi =1 , then choose the optimistic estimation ˜ Θ t for all tasks concurrently. To be more speciﬁc,the algorithm chooses an optimistic estimate ˜ Θ t = argmax Θ ∈C t (max { x i ∈A t,i } Mi =1 P Mi =1 h x i , θ i i ), andthen selects action x t,i = argmax x i ∈A t,i D x i , ˜ θ t,i E for each task i ∈ [ M ].The main technical contribution is the construction of a tighter conﬁdence set C t for the estima-tion of Θ . At each step t ∈ [ T ], we solve the following least-square problem based on the samplescollected so far and obtain the minimizer ˆ B t and ˆ W t :arg min B ∈ R d × k , w ..M ∈ R k × M M X i =1 (cid:13)(cid:13) y t − ,i − X ⊤ t − ,i Bw i (cid:13)(cid:13) (6)s . t . k Bw i k ≤ , ∀ i ∈ [ M ] . (7)We maintain a high probability conﬁdence set C t for the unknown parameters B and { w i } Mi =1 .We calculate C t in the following way: C t def = (cid:26) Θ = BW : M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤ L, B ∈ R d × k , w i ∈ R k , k Bw i k ≤ , ∀ i ∈ [ M ] (cid:27) , (8)where L = ˜ O ( M k + kd ) (see Appendix A.1 for the exact value) and ˜ V t − ,i ( λ ) = X t − ,i X ⊤ t − ,i + λ I d . λ is a hyperparameter used to ensure that ˜ V t − ,i ( λ ) is always invertable, which can be set to 1. Wecan guarantee that Θ ∈ C t for all t ∈ [ T ] with high probability by the following lemma. Lemma 1.

With probability at least − δ , for any step t ∈ [ T ] , suppose ˆ Θ t = ˆ B t ˆ W t is the optimalsolution of the least-square regression (Eqn 6), the true parameter Θ = BW is always contained inthe conﬁdence set C t , i.e. M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤ L, (9) where ˜ V t − ,i ( λ ) = X t − ,i X ⊤ t − ,i + λ I d .

6f we solve each tasks independently with standard single-task algorithms such as OFUL (Abbasi-Yadkori et al.,2011), it is not hard to realize that we can only obtain a conﬁdence set with P Mi =1 k ˆ B t ˆ w t,i − Bw i k V t − ,i ( λ ) ≤ L = ˜ O ( M d ). Our conﬁdence bound is much sharper compared with this naivebound, which explains the improvement in our ﬁnal regret. Compared with Yang et al. (2020), weare not able to estimate B and W directly like their methods due to the more relaxed bandit setting.In our setting, the empirical design matrix ˜ V t − ,i ( λ ) can be quite ill-conditioned if the action set ateach step is chosen adversarially. Thus, we have to establish a tighter conﬁdence set to improve theregret bound.We only sketch the main idea of the proof for Lemma 1 and defer the detailed explanationto Appendix A.1. Considering the non-trivial case where d > k , our main observation is thatboth BW and ˆ B t ˆ W t are low-rank matrix with rank upper bounded by k , which indicates thatrank (cid:16) ˆ B t ˆ W t − BW (cid:17) ≤ k . Therefore, we can write ˆ B t ˆ W t − BW = U t R t = [ U t r t, , U t r t, , · · · , U t r t,M ],where U t ∈ R d × k is an orthonormal matrix and R t ∈ R k × M . Thus we have X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) = (cid:0) U ⊤ t X t − ,i (cid:1) ⊤ R t . This observation indicates that we can project the history actions X t − ,i to a 2 k -dimensionalspace with U t , and take U ⊤ t X t − ,i as the 2 k -dimensional actions we have selected in the ﬁrst t − P Mi =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) tothe term P Mi =1 (cid:13)(cid:13)(cid:13) η ⊤ t − ,i (cid:0) U ⊤ t X t − ,i (cid:1) ⊤ (cid:13)(cid:13)(cid:13) V − t − ,i ( λ ) , where V t − ,i ( λ ) def = (cid:0) U ⊤ t X t − ,i (cid:1) (cid:0) U ⊤ t X t − ,i (cid:1) ⊤ + λ I .We bound this term for the ﬁxed U t with the technique of self-normalized bound for vector-valuedmartingales (Abbasi-Yadkori et al., 2011), and then apply the ǫ -net trick to cover all possible U t .This leads to an upper bound for P Mi =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) , and consequently helps to obtainthe upper bound in Lemma 1. Multi-Task Low-Rank OFUL for step t = 1 , , · · · , T do Calculate the conﬁdence interval C t by Eqn 8 ˜ Θ t , x t,i = argmax Θ ∈C t , x i ∈A t,i P Mi =1 h x i , θ i i for task i = 1 , , · · · , M do Play x t,i for task i , and obtain the reward y t,i end for end for We describe our Multi-Task Low-Rank OFUL algorithm in Algorithm 1. The following theoremstates a bound on the regret of the algorithm.

Theorem 1.

Suppose Assumption 1 holds. Then, with probability at least − δ , the regret ofAlgorithm 1 is bounded by Reg( T ) = ˜ O (cid:16) M √ dkT + d √ kM T (cid:17) (10)We defer the proof of Theorem 1 to Appendix A.2. The ﬁrst term in the regret has lineardependence on M . This term characterizes the regret caused by learning the parameters w i foreach task. The second term has square root dependence on the number of total samples M T , which7ndicates the cost to learn the common representation with samples from M tasks. By dividing thetotal regret by the number of tasks M , we know that the average regret for each task is ˜ O ( √ dkT + d p kT /M ). Note that if we solve M tasks with algorithms such as OFUL (Abbasi-Yadkori et al.,2011) independently, the regret per task can be ˜ O ( d √ T ). Our bound saves a factor of p d/k compared with the naive method by leveraging the common representation features. We also showthat when d > M our regret bound is near optimal (see Theorem 3). For multi-task linear bandit problem, it is relatively unrealistic to assume a common feature extractorthat can ﬁt the reward functions of M tasks exactly. A more natural situation is that the underlyingreward functions are not exactly linear, but have some misspeciﬁcations. There are also relevant dis-cussions on single-task linear bandits in recent works (Lattimore et al., 2020; Zanette et al., 2020a).We ﬁrst present a deﬁnition for the approximately linear bandit learning in multi-task setting. Assumption 5.

There exists a linear feature extractor B ∈ R d × k and a set of linear coeﬃcients { w i } Mi =1 such that the expectation reward E [ y i | x i ] for any action x i ∈ R d satisﬁes | E [ y i | x i ] − h x i , Bw i i| ≤ ζ . In general, an algorithm designed for a linear model could break down entirely if the underlyingmodel is not linear. However, we ﬁnd that our algorithm is in fact robust to small model misspeci-ﬁcation if we set L = ˜ O ( M k + kd + M T ζ ) (see Appendix A.4 for the exact value). The followingregret bound holds under Assumption 5 if we slightly modify the hyperparameter L in the deﬁnitionof conﬁdence region C t . Theorem 2.

Under Assumption 5, with probability at least − δ , the regret of Algorithm 1 is boundedby Reg( T ) = ˜ O (cid:16) M √ dkT + d √ kM T + M T √ dζ (cid:17) (11)Theorem 2 is proved in Appendix A.4. Compared with Theorem 1, there is an additional term˜ O ( M T √ dζ ) in the regret of Theorem 2. This additional term is inevitably linear in M T due to theintrinsic bias introduced by linear function approximation. Note that our algorithm can still enjoygood theoretical guarantees when ζ is suﬃciently small. In this subsection, we propose the regret lower bound for multi-task linear bandit problem underAssumption 5.

Theorem 3.

For any k, M, d, T ∈ Z + with k ≤ d ≤ T and k ≤ M , and any learning algorithm A , there exist a multi-task linear bandit instance that satisﬁes Assumption 5, such that the regret ofAlgorithm A is lower bounded by Reg( T ) ≥ Ω (cid:16) M k √ T + d √ kM T + M T √ dζ (cid:17) . We defer the proof of Theorem 3 to Appendix A.5. By setting ζ = 0, Theorem 3 can beconverted to the lower bound for multi-task linear bandit problem under Assumption 1, which isΩ( M k √ T + d √ kM T ). These lower bounds match the upper bounds in Theorem 1 and Theorem 2in the regime where d > M respectively. There is still a gap of p d/k in the ﬁrst part of theregret. For the upper bounds, the main diﬃculty to obtain ˜ O ( M k √ T ) regret in the ﬁrst part comesfrom the estimation of B . Since the action sets are not ﬁxed and can be ill-conditioned, we cannotfollow the explore-then-exploit framework and estimate B at the beginning. Besides, explore-then-exploit algorithms always suﬀer ˜ O ( T / ) regret in the general linear bandits setting without further8ssumptions. Without estimating B beforehand with enough accuracy, the exploration in original d -dimensional space can be redundant since we cannot identify actions that have the similar k -dimensional representations before pulling them. We conjecture that our upper bound is tight andleave the gap as future work.

5. Main Results for Linear RL

We now show the main results for the multi-task episodic reinforcement learning under the assump-tion of low inherent Bellman error (i.e. the multi-task LSVI setting).

In the exploration problems in RL where linear value function approximation is employed (Yang and Wang,2019; Jin et al., 2020; Yang and Wang, 2020), LSVI-based algorithms are usually very eﬀective whenthe linear value function space are close under Bellman operator. For example, it is shown that aLSVI-based algorithm with additional bonus can solve the exploration challenge eﬀectively in low-rank MDP (Jin et al., 2020), where the function space Q h , Q h +1 are totally close under Bellmanoperator (i.e. any function Q h +1 in Q h +1 composed with Bellman operator T h Q h +1 belongs to Q h ).For the release of such strong assumptions, the inherent Bellman error for a MDP (Deﬁnition 1) wasproposed to measure how close is the function space under Bellman operator (Zanette et al., 2020a).We extend the deﬁnition of IBE to the multi-task LSVI setting (Deﬁnition 4), and show that ourreﬁned conﬁdence set for the least square estimator can be applied to the low-rank multi-task LSVIsetting, and gives an optimism-based algorithm with sharper regret bound compared to naively doexploration in each task independently. The MTLR-LSVI (Algorithm 2) follows the LSVI-based (Jin et al., 2020; Zanette et al., 2020a)algorithms to build our (optimistic) estimator for the optimal value functions. To understand howthis works for multi-task LSVI setting, we ﬁrst take a glance at how LSVI-based algorithms work insingle-task LSVI setting.In traditional value iteration algorithms, we perform an approximate Bellman backup in episode t for each step h ∈ [ H ] on the estimator Q h +1 ,t − constructed at the end of episode t −

1, andﬁnd the best approximator for T h ( Q h +1 ,t − ) in function space Q h . Since we assume linear functionspaces, we can take the least-square solution of the empirical Bellman backup on Q h +1 ,t − as thebest approximator.In the multi-task framework, given an estimator Q h +1 (cid:0) θ ih +1 (cid:1) for each i ∈ [ M ], to apply suchleast-square value iteration to our low-rank multi-task LSVI setting, we use the solution to thefollowing constrained optimization problem M X i =1 t − X j =1 (cid:16)(cid:0) φ ihj (cid:1) ⊤ θ ih − R ihj − V ih +1 (cid:0) θ ih +1 (cid:1) (cid:0) s ih +1 ,j (cid:1)(cid:17) (12)s.t. θ h , θ h , ..., θ Mh lies in a k -dimensional subspace (13)to approximate the Bellman update in the t -th episode, where φ ihj = φ h ( s ihj , a ihj ) is the featureobserved at step h in episode j for task i , and similarly R ihj = R h ( s ihj , a ihj ).To guarantee the optimistic property of our estimator, we follow the global optimization proce-dure of Zanette et al. (2020a) which solves the following optimization problem in the t -th episode9 lgorithm 2 Multi-Task Low-Rank LSVI Input: low-rank parameter k , failure probability δ , regularization λ = 1, inherent Bellman error I Initialize ˜ V h = λ I for h ∈ [ H ] for episode t = 1 , , · · · do Compute α ht for h ∈ [ H ]. (see Lemma 9) Solve the global optimization problem 1 Compute π iht ( s ) = argmax a φ ( s, a ) ⊤ ¯ θ iht Execute π iht for task i at step h Collect (cid:8) s iht , a iht , r (cid:0) s iht , a iht (cid:1)(cid:9) for episode t . end forDeﬁnition 1 (Global Optimization Procedure) . max ¯ ξ ih , ˆ θ ih , ¯ θ ih M X i =1 max a i (cid:0) φ ( s i , a i ) (cid:1) ⊤ ¯ θ i (14) s.t. (cid:16) ˆ θ h , ..., ˆ θ Mh (cid:17) = ˆ B h (cid:2) ˆ w h ˆ w h · · · ˆ w Mh (cid:3) = argmin k B h w ih k ≤ D M X i =1 t − X j =1 L ( B h , w ih ) (15)¯ θ ih = ˆ θ ih + ¯ ξ ih ; M X i =1 (cid:13)(cid:13) ¯ ξ ih (cid:13)(cid:13) V iht ( λ ) ≤ α ht (16) (cid:0) ¯ θ h , ¯ θ h , · · · , ¯ θ Mh (cid:1) ∈ Θ h (17)where the empirical least-square loss L ( B h , w ih ) def = (( φ ihj ) ⊤ B h w ih − R ihj − V ih +1 ( ¯ θ ih +1 )( s ih +1 ,j )) , and ˜ V iht ( λ ) def = P t − j =1 ( φ ihj )( φ ihj ) ⊤ + λ I is the regularized empirical linear design matrix for task i in episode t .We have three types of variables in this global optimization problem, ¯ ξ ih , ˆ θ ih , and ¯ θ ih . Here ¯ θ ih denotes the estimator for Q i ∗ h . We solve for the low-rank least-square solution of the approximatevalue iteration and denote the solution by ˆ θ ih . Instead of adding the bonus term directly on Q ih ( ˆ θ ih )to obtain an optimistic estimate of Q i ∗ h as in the tabular setting (Azar et al., 2017; Jin et al., 2018)and linear MDP setting (Jin et al., 2020), we use global variables ¯ ξ ih to quantify the conﬁdencebonus. This is because we cannot preserve the linear property of our estimator if we add the bonusdirectly, resulting in an exponential propagation of error. However, by using ¯ ξ ih we can construct alinear estimator Q ih (cid:0) ¯ θ ih (cid:1) and obtain much smaller regret. A drawback of this global optimizationtechnique is that we can only obtain an optimistic estimator at step 1, since values in diﬀerent statesand steps are possibly negatively correlated. Under Assumption 3 and 4, with probability − δ the regret after T episodes is boundedby Reg( T ) = ˜ O (cid:16) HM √ dkT + Hd √ kM T + HM T √ d I (cid:17) (18)Compared to naively executing single-task linear RL algorithms (e.g. the ELEANOR algorithm)on each task without information-sharing, which incurs regret ˜ O ( HM d √ T + HM T √ d I ), our regretbound is smaller by a factor of approximately p d/k in our setting where k ≪ d and k ≪ M .10e give a brief explanation on how we improve the regret bound and defer the full analysis toappendix B. We start with the decomposition of the regret. Let ¯ Q iht ( ¯ V iht ) be the solution of theproblem in deﬁnition 1 in episode t , thenReg( T ) = T X t =1 M X i =1 (cid:16) V i ∗ − ¯ V i t + ¯ V i t − V π it (cid:17) (cid:0) s i t (cid:1) (19) ≤ HM T I (by Lemma 12) (20)+ T X t =1 H X h =1 M X i =1 (cid:0)(cid:12)(cid:12) ¯ Q iht ( s, a ) − T ih ¯ Q ih +1 ,t ( s, a ) (cid:12)(cid:12) + ζ iht (cid:1) . (21)In (20) we use the optimistic property of ¯ V i t . In (21), ζ iht is a martingale diﬀerence (deﬁned insection B.5) with regards to F h,t , and the dominate term (the ﬁrst term) is the Bellman error of¯ Q iht .For any { Q ih +1 } Mi =1 ∈ Q h +1 , we can ﬁnd a group of vectors { ˙ θ ih ( Q ih +1 ) } Mi =1 ∈ Θ h that satisfy∆ ih (cid:0) Q ih +1 (cid:1) ( s, a ) def = T ih (cid:0) Q ih +1 (cid:1) ( s, a ) − φ ( s, a ) ⊤ ˙ θ ih (cid:0) Q ih +1 (cid:1) and the approximation error (cid:13)(cid:13) ∆ ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13) ∞ ≤I is small for each i ∈ [ M ]. By deﬁnition, ˙ θ ih (cid:0) Q ih +1 (cid:1) is actually the best approximator of T ih (cid:0) Q ih +1 (cid:1) in the function class Q h . Since our algorithm is based on least-square value iteration, a key step isto bound the error of estimating ˙ θ ih ( ¯ Q ih +1 ,t ) ( ˙ θ ih for short). In the global optimization procedure,we use ˆ θ ih to approximate the empirical Bellman backup. In Lemma 9 we show M X i =1 (cid:13)(cid:13)(cid:13) ˆ θ ih − ˙ θ ih (cid:13)(cid:13)(cid:13) V iht ( λ ) = ˜ O (cid:0) M k + kd + M T I (cid:1) (22)This is the key step leading to improved regret bound. If we solve each task independentlywithout information sharing, we can only bound the least square error in (22) as ˜ O ( M d + M T I ).Our bound is much more sharper since k ≪ d and k ≪ M .Using the least square error in (22), we can show that the dominate term in (21) is bounded by(see Lemma 10 and section B.5) M X i =1 (cid:12)(cid:12) ¯ Q iht ( s, a ) − T ih ¯ Q ih +1 ,t ( s, a ) (cid:12)(cid:12) ≤ M I + ˜ O (cid:16)p M k + kd + M T I (cid:17) · vuut M X i =1 (cid:13)(cid:13) φ ( s iht , a iht ) (cid:13)(cid:13) V iht ( λ ) − (23)Abbasi-Yadkori et al. (2011, Lemma 11) states that P Tt =1 (cid:13)(cid:13) φ ( s iht , a iht ) (cid:13)(cid:13) V iht ( λ ) − = ˜ O ( d ) for any h and i , so we can ﬁnally bound the regret asReg( T ) = ˜ O (cid:16) HM T I + H p M k + kd + M T I · √ M T d (cid:17) = ˜ O (cid:16) HM √ dkT + Hd √ kM T + HM T √ d I (cid:17) where the ﬁrst equality is by Cauchy-Schwarz. This subsection presents the lower bound for multi-task reinforcement learning with low inherentBellman error. Our lower bound is derived from the lower bound in the single-task setting. As abyproduct, we also derive a lower bound for misspeciﬁed linear RL in the single-task setting. Wedefer the proof of Theorem 5 to Appendix C. 11 heorem 5.

For our construction in appendix C, the expected regret of any algorithm where d, k, H ≥ , |A| ≥ , M ≥ k, T = Ω( d H ) , I ≤ / H is Ω (cid:16) M k √ HT + d √ HkM T + HM T √ d I (cid:17) Careful readers may ﬁnd that there is a gap of √ H in the ﬁrst two terms between the upper boundand the lower bound. This gap is because the conﬁdence set used in the algorithm is intrinsically“Hoeﬀding-type”. Using a “Bernstein-type” conﬁdence set can potentially improve the upper boundby a factor of √ H . This “Bernstein” technique has been well exploited in many previous resultsfor single-task RL (Azar et al., 2017; Jin et al., 2018; Zhou et al., 2020a). Since our focus is mainlyon the beneﬁts of multi-task representation learning, we don’t apply this technique for the clarityof the analysis. If we ignore this gap in the dependence on H , our upper bound matches this lowerbound in the regime where d ≥ M .

6. Conclusion

In this paper, we study provably sample-eﬃcient representation learning for multi-task linear banditsand linear RL. For linear bandits, we propose an algorithm called MTLR-OFUL, which obtains near-optimal regret in the regime where d ≥ M . We then extend our algorithms to multi-task RL setting,and propose a sample-eﬃcient algorithm, MTLR-LSVI.There are two directions for future investigation. First, our algorithms are statistically sample-eﬃcient, but a computationally eﬃcient implementation is still unknown, although we conjectureour MTLR-OFUL algorithm is computationally eﬃcient. How to design both computationally andstatistically eﬃcient algorithms in our multi-task setting is an interesting problem for future research.Second, there remains a gap of p d/k between regret upper and lower bounds (in the ﬁrst term).We conjecture that our lower bound is not minimax optimal and hope to address this problem inthe future work. References

Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linear stochasticbandits. In

Advances in Neural Information Processing Systems , pages 2312–2320, 2011.Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-conﬁdence-set conversions andapplication to sparse stochastic bandits. In

Artiﬁcial Intelligence and Statistics , pages 1–9. PMLR,2012.Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multipletasks and unlabeled data.

Journal of Machine Learning Research , 6(Nov):1817–1853, 2005.Sanjeev Arora, Simon S Du, Sham Kakade, Yuping Luo, and Nikunj Saunshi. Provable representationlearning for imitation learning via bi-level optimization. arXiv preprint arXiv:2002.10544 , 2020.Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-oﬀs.

Journal of MachineLearning Research , 3(Nov):397–422, 2002.Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin F Yang. Model-based reinforcementlearning with value-targeted regression. arXiv preprint arXiv:2006.01107 , 2020.Mohammad Gheshlaghi Azar, Ian Osband, and R´emi Munos. Minimax regret bounds for rein-forcement learning. In

International Conference on Machine Learning , pages 263–272. PMLR,2017. 12onathan Baxter. A model of inductive bias learning.

Journal of artiﬁcial intelligence research , 12:149–198, 2000.Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and newperspectives.

IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828,2013.Emma Brunskill and Lihong Li. Sample complexity of multi-task reinforcement learning. In

Pro-ceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence (UAI-13) , pages122–131, 2013.Alexandra Carpentier and R´emi Munos. Bandit theory meets compressed sensing for high dimen-sional stochastic linear bandit. In

Artiﬁcial Intelligence and Statistics , pages 190–198. PMLR,2012.Rich Caruana. Multitask learning.

Machine learning , 28(1):41–75, 1997.Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoﬀfunctions. In

Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence andStatistics , pages 208–214, 2011.Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under banditfeedback. 2008.Carlo D’Eramo, Davide Tateo, Andrea Bonarini, Marcello Restelli, and Jan Peters. Sharing knowl-edge in multi-task deep reinforcement learning. In

International Conference on Learning Repre-sentations , 2019.Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learningthe representation, provably. arXiv preprint arXiv:2002.09434 , 2020.Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. arXivpreprint arXiv:2011.04020 , 2020.Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado vanHasselt. Multi-task deep reinforcement learning with popart. In

Proceedings of the AAAI Con-ference on Artiﬁcial Intelligence , volume 33, pages 3796–3803, 2019.Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Con-textual decision processes with low bellman rank are pac-learnable. In

International Conferenceon Machine Learning , pages 1704–1713. PMLR, 2017.Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably eﬃcient? arXiv preprint arXiv:1807.03765 , 2018.Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably eﬃcient reinforcementlearning with linear function approximation. In

Conference on Learning Theory , pages 2137–2143. PMLR, 2020.Kwang-Sung Jun, Rebecca Willett, Stephen Wright, and Robert Nowak. Bilinear bandits withlow-rank structure. In

International Conference on Machine Learning , pages 3163–3172. PMLR,2019.Tor Lattimore and Csaba Szepesv´ari.

Bandit algorithms . Cambridge University Press, 2020.Tor Lattimore, Koby Crammer, and Csaba Szepesv´ari. Linear multi-resource allocation with semi-bandit feedback. In

NIPS , pages 964–972, 2015.13or Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representationsin bandits and in rl with a generative model. In

International Conference on Machine Learning ,pages 5662–5670. PMLR, 2020.Jiayi Li, Hongyan Zhang, Liangpei Zhang, Xin Huang, and Lefei Zhang. Joint collaborative repre-sentation with multitask learning for hyperspectral image classiﬁcation.

IEEE Transactions onGeoscience and Remote Sensing , 52(9):5923–5936, 2014.Yingkai Li, Yining Wang, and Yuan Zhou. Nearly minimax-optimal regret for linearly parameterizedbandits. arXiv preprint arXiv:1904.00242 , 2019a.Yingkai Li, Yining Wang, and Yuan Zhou. Tight regret bounds for inﬁnite-armed linear contextualbandits. arXiv preprint arXiv:1905.01435 , 2019b.Lydia T Liu, Urun Dogan, and Katja Hofmann. Decoding multitask dqn in the world of minecraft.In

The 13th European Workshop on Reinforcement Learning (EWRL) 2016 , 2016.Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks fornatural language understanding. arXiv preprint arXiv:1901.11504 , 2019.Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Low-rank generalized linear bandit problems. arXiv preprint arXiv:2006.02948 , 2020.Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The beneﬁt of multitaskrepresentation learning.

The Journal of Machine Learning Research , 17(1):2853–2884, 2016.Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask andtransfer reinforcement learning. arXiv preprint arXiv:1511.06342 , 2015.Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding, and VijayPande. Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072 , 2015.Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits.

Mathematics ofOperations Research , 35(2):395–411, 2010.Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.

Journal of Machine Learning Research , 10(7), 2009.Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, NicolasHeess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In

Advances inNeural Information Processing Systems , pages 4496–4506, 2017.Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In

Learning tolearn , pages 3–17. Springer, 1998.Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representations. arXiv preprint arXiv:2002.11684 , 2020.Ruosong Wang, Simon S Du, Lin F Yang, and Sham M Kakade. Is long horizon reinforcement learn-ing more diﬃcult than short horizon reinforcement learning? arXiv preprint arXiv:2005.00527 ,2020.Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. Multi-task reinforcement learning:a hierarchical bayesian approach. In

Proceedings of the 24th international conference on Machinelearning , pages 1015–1022, 2007. 14iaqi Yang, Wei Hu, Jason D. Lee, and Simon S. Du. Provable beneﬁts of representation learningin linear bandits, 2020.Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, andregret bound. In

International Conference on Machine Learning , pages 10746–10756. PMLR,2020.Lin F Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. arXiv preprint arXiv:1902.04779 , 2019.Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning nearoptimal policies with low inherent bellman error. arXiv preprint arXiv:2003.00153 , 2020a.Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Provably eﬃcientreward-agnostic navigation with linear value iteration. arXiv preprint arXiv:2008.07737 , 2020b.Zihan Zhang, Xiangyang Ji, and Simon S Du. Is reinforcement learning more diﬃcult than bandits?a near-optimal algorithm escaping the curse of horizon. arXiv preprint arXiv:2009.13503 , 2020.Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learn-ing for linear mixture markov decision processes. arXiv preprint arXiv:2012.08507 , 2020a.Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably eﬃcient reinforcement learning for dis-counted mdps with feature mapping. arXiv preprint arXiv:2006.13165 , 2020b.15 ppendices

A Omitted Proof in Section 4

A.1 Proof of Lemma 1

Proof.

By the optimality of ˆ B t and ˆ W t = [ ˆ w t, , · · · , ˆ w t,M ], we know that P Mi =1 (cid:13)(cid:13)(cid:13) y t − ,i − X ⊤ t − ,i ˆ B t ˆ w t,i (cid:13)(cid:13)(cid:13) ≤ P Mi =1 (cid:13)(cid:13) y t − ,i − X ⊤ t − ,i Bw i (cid:13)(cid:13) . Since y t − ,i = X ⊤ t − ,i Bw i + η t − ,i , we have M X i =1 (cid:13)(cid:13)(cid:13) X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17)(cid:13)(cid:13)(cid:13) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) . (24)We ﬁrstly analyse the non-trivial setting where d ≥ k . Note that both Θ = BW and ˆ Θ t =ˆ B t ˆ W t are low-rank matrix with rank upper bounded by k , which indicates that rank (cid:16) ˆ Θ t − Θ (cid:17) ≤ k .In that case, we can write ˆ Θ t − Θ = U t R t = [ U t r t, , U t r t, , · · · , U t r t,M ], where U t ∈ R d × k is anorthonormal matrix with k U t k F = √ k , and R t ∈ R k × M satisﬁes k r t,i k ≤ √ k . In other words,we can write ˆ B t ˆ w t,i − Bw i = U t r t,i for certain U t and r t,i .Deﬁne V t − ,i ( λ ) def = (cid:0) U ⊤ t X t − ,i (cid:1) (cid:0) U ⊤ t X t − ,i (cid:1) ⊤ + λ I . We have: M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (25)= M X i =1 (cid:13)(cid:13)(cid:13) X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17)(cid:13)(cid:13)(cid:13) + M X i =1 λ (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) (26) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) + 4 M λ (27)=2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i U t r t,i + 4 M λ (28) ≤ M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) k r t,i k V t − ,i ( λ ) + 4 M λ (29) ≤ vuut M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) vuut M X i =1 k r t,i k V t − ,i ( λ ) + 4 M λ (30)=2 vuut M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) + 4 M λ (31)Eqn 27 is due to Eqn 24, (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i (cid:13)(cid:13)(cid:13) ≤ k Bw i k ≤

1. Eqn 30 is due to Cauchy-Schwarzinequality. Eqn 31 is from M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) = M X i =1 k U t r t,i k V t − ,i ( λ ) = M X i =1 k r t,i k U ⊤ t ˜ V t − ,i ( λ ) U t = M X i =1 k r t,i k V t − ,i ( λ ) . The main problem is how to bound (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) = (cid:13)(cid:13)(cid:13)P t − n =1 η n,i U ⊤ t x n,i (cid:13)(cid:13)(cid:13) V − t − ,i ( λ ) .Note that for a ﬁxed U t = ¯ U , we can regard ¯ U ⊤ x n,i ∈ R k as the corresponding “action” chosen in16tep t . With this observation, if U t is ﬁxed, we can bound this term following the arguments of theself-normalized bound for vector-valued martingales (Abbasi-Yadkori et al., 2011). Lemma 2.

For a ﬁxed ¯ U , deﬁne ¯ V t,i ( λ ) def = (cid:0) ¯ U ⊤ X t,i (cid:1) (cid:0) ¯ U ⊤ X t,i (cid:1) ⊤ + λ I , then any δ > , withprobability at least − δ , for all t ≥ , M X i =1 (cid:13)(cid:13) ¯ U ⊤ X t,i η t,i (cid:13)(cid:13) V − t,i (32) ≤ Q Mi =1 (cid:0) det( ¯ V t,i ) / det( λ I ) − / (cid:1) δ ! . (33)We defer the proof of Lemma 2 to Appendix A.3. We set λ = 1. By Lemma 2, we know that fora ﬁxed ¯ U , with probability at least 1 − δ , M X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X n =1 η n,i ¯ U ⊤ x n,i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V − t,i ( λ ) ≤ Q Mi =1 det( ¯ V t,i ( λ )) / det( λ I ) − / δ ! ≤ M k + 2 log(1 /δ ) . (34)The above analysis shows that we can bound (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) if U t is ﬁxed as ¯ U .Following this idea, we prove the lemma by the construction of ǫ -net over all possible U t . To applythe trick of ǫ -net, we need to slightly modify the derivation of Eqn 25. For a ﬁxed matrix ¯ U ∈ R d × k ,we have M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (35) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i U t r t,i + 4 M λ (36)=2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i ¯ U r t,i + 2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i + 4 M λ (37) ≤ M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) ¯ V − t − ,i ( λ ) k r t,i k ¯ V t − ,i ( λ ) + 2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i + 4 M λ (38)=2 M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) ¯ V − t − ,i ( λ ) k r t,i k V t − ,i ( λ ) + 2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i (39)+ 2 M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) ¯ V − t − ,i ( λ ) (cid:16) k r t,i k ¯ V t − ,i ( λ ) − k r t,i k V t − ,i ( λ ) (cid:17) + 4 M λ (40) ≤ vuut M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) V − t − ,i ( λ ) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) + 2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i (41)+ 2 M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) ¯ V − t − ,i ( λ ) (cid:16) k r t,i k ¯ V t − ,i ( λ ) − k r t,i k V t − ,i ( λ ) (cid:17) + 4 M λ (42)Eqn 36, 38 and 41 follow the same idea of Eqn 28, 29 and 31.17e construct an ǫ -net E in Frobenius norm over the matrix set (cid:8) U ∈ R d × k : k U k F ≤ k (cid:9) . It isnot hard to see that |E| ≤ (cid:16) √ kǫ (cid:17) kd . By the union bound over all possible ¯ U ∈ E , we know thatwith probability 1 − |E| δ , Eqn 34 holds for any ¯ U ∈ E . For each U t , we choose an ¯ U ∈ E with (cid:13)(cid:13) U t − ¯ U (cid:13)(cid:13) F ≤ ǫ , and we have2 vuut M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) V − t − ,i ( λ ) ≤ p M k + 2 log(1 /δ ) (43)Since (cid:13)(cid:13) U t − ¯ U (cid:13)(cid:13) F ≤ ǫ , we have2 M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) ¯ V − t − ,i ( λ ) (cid:16) k r t,i k ¯ V t − ,i ( λ ) − k r t,i k V t − ,i ( λ ) (cid:17) ≤ p M kǫ (2 M k + 2 log(1 /δ )) . (44)For the term 2 P Mi =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i , the following inequality holds for any step t ∈ [ T ]with probability 1 − M T δ ,2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i ≤ M X i =1 k η t − ,i k (cid:13)(cid:13) X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i (cid:13)(cid:13) (45) ≤ M X i =1 k η t − ,i k √ kT ǫ (46) ≤ M p /δ ) kT ǫ (47)The last inequality follows from the fact that | η n,i | ≤ p /δ ) with probability 1 − δ forﬁxed n, i , and apply a union bound over n ∈ [ t − , i ∈ [ M ]. Plugging Eqn. 43, 44 and 45 back toEqn. 41, the following inequality holds for any t ∈ [ T ] with probability at least 1 − |E| δ − M T δ : M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (48) ≤ p M k + 2 log(1 /δ ) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (49)+ 2 M p /δ ) kT ǫ + 2 p M kǫ (2 M k + 2 log(1 /δ )) + 4 M λ (50)By solving the above inequality, we know that M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤

32 (

M k + log(1 /δ )) + 4 M p /δ ) kT ǫ (51)+ 4 p M kǫ (2 M k + 2 log(1 /δ )) + 8 M λ (52)Setting λ = 1, ǫ = kM T , δ = δ (cid:16) √ kǫ (cid:17) kd ≤ δ |E| , and δ = δ MT , the following inequality holdswith probability 1 − δ : M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤ L def = 48 ( M k + 5 kd log( kM T )) + 32 log(4 M T ) + 76 log(1 /δ ) (53)18t last we talk about the trivial setting where k < d < k . In this case, we can write ˆ Θ t − Θ = R t where R t ∈ R d × M . The proof then follows the same framework as the case when d ≥ k , exceptthat we don’t need to consider U t and construct ǫ -net over all possible U t . It is not hard to showthat P Mi =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤

24 (

M d + 2 log(

T k/δ )) in this case, which is also less than L since d < k . 19 .2 Proof of Theorem 1 With Lemma 1, we are ready to prove Theorem 1.

Proof.

Let ˜ V t,i ( λ ) = X t,i X ⊤ t,i + λ I d for some λ > T ) = T X t =1 M X i =1 (cid:10) θ i , x ∗ t,i − x t,i (cid:11) (54) ≤ T X t =1 M X i =1 D ˜ θ t,i − θ i , x t,i E (55)= T X t =1 M X i =1 D ˜ θ t,i − ˆ θ t,i + ˆ θ t,i − θ i , x t,i E (56) ≤ T X t =1 M X i =1 (cid:18)(cid:13)(cid:13)(cid:13) ˜ θ t,i − ˆ θ t,i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) + (cid:13)(cid:13)(cid:13) ˆ θ t,i − θ i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) (cid:19) k x t,i k ˜ V t − ,i ( λ ) − (57) ≤ vuut T X t =1 M X i =1 (cid:13)(cid:13)(cid:13) ˜ θ t,i − ˆ θ t,i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) + vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ θ t,i − θ i (cid:13)(cid:13)(cid:13) V t − ,i ( λ )  · vuut T X t =1 M X i =1 k x t,i k V t − ,i ( λ ) − (58) ≤ p T ( L + 4 λM ) · vuut M X i =1 T X t =1 k x t,i k V t − ,i ( λ ) − (59)where the ﬁrst inequality is due to P Mi =1 (cid:10) θ i , x ∗ t,i (cid:11) ≤ D ˜ θ t,i , x t,i E from the optimistic choice of ˜ θ t,i and x t,i . By Lemma 11 of Abbasi-Yadkori et al. (2011), as long as λ ≥ T X t =1 k x t,i k ˜ V t − ,i ( λ ′ ) − ≤ ˜ V T,i ( λ ′ ))det( λ ′ I d ) ≤ d log (cid:18) Tλd (cid:19) (60)Therefore, we can ﬁnally bound the regret by choosing λ = 1Reg( T ) ≤ p T ( L + 4 M ) · vuut M X i =1 T X t =1 k x t,i k V t − ,i ( λ ′ ) − (61) ≤ p T ( L + 4 M ) · s M d log (cid:18) Td (cid:19) (62)= ˜ O (cid:16) M √ dkT + d √ kM T (cid:17) . (63)20 .3 Proof of Lemma 2 The proof of Lemma 2 follows the similar idea of Theorem 1 in Abbasi-Yadkori et al. (2011). We con-sider the σ -algebra F t = σ (cid:0) { x ,i } Mi =1 , { x ,i } Mi =1 , · · · , { x t +1 ,i } Mi =1 , { η ,i } Mi =1 , { η ,i } Mi =1 , · · · , { η t,i } Mi =1 (cid:1) ,then { x t,i } Mi =1 is F t − -measurable, and { η t,i } Mi =1 is F t -measurable.Deﬁne ¯ x t,i = U ⊤ x t,i and S t,i = P tn =1 ¯ U ⊤ x t,i η t,i . Let M t ( Q ) = exp t X n =1 M X i =1 (cid:20) η t,i h q i , ¯ x t,i i − h q i , ¯ x t,i i (cid:21)! , Q = [ q , · · · , q M ] ∈ R k × M (64) Lemma 3.

Let τ be a stopping time w.r.t the ﬁltration { F t } ∞ t =0 . Then M t ( Q ) is almost surelywell-deﬁned and E [ M t ( Q )] ≤ .Proof. Let D t ( Q ) = exp (cid:16)P Mi =1 h η t,i h q i , ¯ x t,i i − h q i , ¯ x t,i i i(cid:17) . By the sub-Gaussianity of η t,i , wehave E (cid:20) exp (cid:18)(cid:20) η t,i h q i , ¯ x t,i i − h q i , ¯ x t,i i (cid:21)(cid:19) ] | F t − (cid:21) ≤ . (65)Then we have E [ D t ( Q ) | F t − ] ≤

1. Further, E [ M t ( Q ) | F t − ] = E [ M ( Q ) · · · D t − ( Q ) D t ( Q ) | F t − ] (66)= D ( Q ) · · · D t − ( Q ) E [ D t ( Q ) | F t − ] ≤ M t − ( Q ) (67)This shows that { M t ( Q ) } ∞ t =0 is a supermartingale and E [ M t ( Q )] ≤ M τ ( Q ) isalmost surely well-deﬁned. By the convergence theorem for nonnegative supermartingales, M ∞ ( Q ) =lim t →∞ M t ( Q ) is almost surely well-deﬁned. Therefore, M τ ( Q ) is indeed well-deﬁned independentlyof whether τ < ∞ or not. Let W t ( Q ) = M min { τ,t } ( Q ) be a stopped version of ( M t ( ( Q ))) t . ByFatou’s Lemma, E [ M τ ( Q )] = E [lim inf t →∞ W t ( Q )] ≤ lim inf t →∞ E [ W t ( Q )] ≤

1. This shows that E [ M τ ( Q )] ≤ P Mi =1 k S t,i k V − t,i ( λ ) . Lemma 4.

Let τ be a stopping time w.r.t the ﬁltration { F t } ∞ t =0 . Then, for δ > , with probability − δ , M X i =1 k S τ,i k V − τ,i ( λ ) ≤ Q Mi =1 (cid:0) det( ¯ V τ,i ) / det( λ I ) − / (cid:1) δ ! . (68) Proof.

For each i ∈ [ M ], let Λ i be a R k Gaussian random variable which is independent of all theother random variables and whose covariance is λ − I . Deﬁne M t = E [ M t ([ Λ , · · · , Λ M ]) | F ∞ ]. Westill have E [ M τ ] = E [ E [ M t ([ Λ , · · · , Λ M ]) | { Λ i } Mi =1 ]] ≤ M t . Deﬁne M t,i ( q i ) def = exp (cid:16)P tn =1 h η t,i h q i , ¯ x t,i i − h q i , ¯ x t,i i i(cid:17) , then wehave M t = E hQ Mi =1 M t,i ( Λ i ) | F ∞ i = Q Mi =1 E [ M t,i ( Λ i ) | F ∞ ], where the second equality is dueto the fact that { M t,i ( Λ i ) } Mi =1 are relatively independent given F ∞ . We only need to calculate E [ M t,i ( Λ i ) | F ∞ ] for each i ∈ [ M ].Following the proof of Lemma 9 in Abbasi-Yadkori et al. (2011), we know that E [ M t,i ( Λ i ) | F ∞ ] = (cid:18) det( λ I )det( ¯ V t,i ) (cid:19) / exp (cid:18) k S t,i k V − t,i ( λ ) (cid:19) . (69)21hen we have M t = M Y i =1 (cid:18) det( λ I )det( ¯ V t,i ) (cid:19) / ! exp M X i =1 k S t,i k V − t,i ( λ ) ! . (70)Since E [ M τ ] ≤

1, we havePr " M X i =1 k S τ,i k V − τ,i ( λ ) > Q Mi =1 (cid:0) det( ¯ V τ,i ) / det( λ I ) − / (cid:1) δ ! = Pr  exp (cid:18) P Mi =1 k S τ,i k V − τ,i ( λ ) (cid:19) δ − (cid:16)Q Mi =1 (cid:0) det( ¯ V t,i ) / det( λ I ) − / (cid:1)(cid:17) >  ≤ E  exp (cid:18)P Mi =1 k S τ,i k V − τ,i ( λ ) (cid:19) δ − (cid:16)Q Mi =1 (cid:0) det( ¯ V τ,i ) / det( λ I ) − / (cid:1)(cid:17)  = E [ M τ ] δ ≤ δ. Proof. (Proof of Lemma 2) The only remaining issue is the stopping time construction. Deﬁne thebad event B t ( δ ) def = ( ω ∈ Ω : M X i =1 k S t,i k V − t,i ( λ ) > Q Mi =1 (cid:0) det( ¯ V t,i ) / det( λ I ) − / (cid:1) δ !) (71)Consider the stopping time τ ( ω ) = min { t ≥ ω ∈ B t ( δ ) } , we have S t ≥ B t ( δ ) = { ω : τ ( ω ) < ∞} .By lemma 4, we havePr  [ t ≥ B t ( δ )  = Pr[ τ < ∞ ] (72)= Pr " M X i =1 k S τ,i k V − τ,i ( λ ) > Q Mi =1 (cid:0) det( ¯ V τ,i ) / det( λ I ) − / (cid:1) δ ! , τ ≤ ∞ (73) ≤ Pr " M X i =1 k S τ,i k V − τ,i ( λ ) > Q Mi =1 (cid:0) det( ¯ V τ,i ) / det( λ I ) − / (cid:1) δ ! (74) ≤ δ. (75)22 .4 Proof of Theorem 2 Proof.

The proof follows the same idea of that for Theorem 1. The only diﬀerence is that, in our set-ting, we have y t,i = x ⊤ t,i Bw i + η t,i + ∆ t,i , where θ i = Bw i is the best approximator for task i ∈ [ M ]such that (cid:12)(cid:12)(cid:12) E [ y i | x i ] − D x i , ˙ B ˙ w i E(cid:12)(cid:12)(cid:12) ≤ ζ , and k ∆ t,i k ≤ ζ . Deﬁne ∆ t,i = [∆ ,i , ∆ ,i , · · · , ∆ t,i ]. Simi-larly, by the optimality of ˆ B t and ˆ W t = [ ˆ w t, , · · · , ˆ w t,M ], we know that P Mi =1 (cid:13)(cid:13)(cid:13) y t − ,i − X ⊤ t − ,i ˆ B t ˆ w t,i (cid:13)(cid:13)(cid:13) ≤ P Mi =1 (cid:13)(cid:13) y t − ,i − X ⊤ t − ,i Bw i (cid:13)(cid:13) . Since y t − ,i = X ⊤ t − ,i Bw i + η t − ,i + ∆ t,i , thus we have M X i =1 (cid:13)(cid:13)(cid:13) X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17)(cid:13)(cid:13)(cid:13) (76) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) + 2 M X i =1 ∆ ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) (77) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) + 2 M X i =1 k X t − ,i ∆ t − ,i k ˜ V − t − ,i ( λ ) (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) (78) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) + 2 M X i =1 √ T ζ (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) (79) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) + 2 √ M T ζ vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (80)The third inequality follows from Projection Bound (Lemma 8) in Zanette et al. (2020a). The ﬁrstterm of Eqn 80 shares the same form of Eqn 24. Following the same proof idea of Lemma 1, weknow that with probability 1 − δ , M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (81) ≤ (cid:16) p M k + 8 kd log( kM T /δ ) + 2 √ M T ζ (cid:17) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) + 4 M + 4 p log(4 M T /δ )(82)Solving for P Mi =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) , we know that the true parameter BW is alwayscontained in the conﬁdence set, i.e. M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤ L ′ , (83)where L ′ = 2 L + 32 M T ζ . 23hus we haveReg( T ) = T X t =1 M X i =1 (cid:0) y ∗ t,i − y t,i (cid:1) (84) ≤ M T ζ + T X t =1 M X i =1 (cid:10) θ i , x ∗ t,i − x t,i (cid:11) (85) ≤ M T ζ + T X t =1 M X i =1 D ˜ θ t,i − θ i , x t,i E (86)= 2 M T ζ + T X t =1 M X i =1 D ˜ θ t,i − ˆ θ t,i + ˆ θ t,i − θ i , x t,i E (87) ≤ M T ζ + T X t =1 M X i =1 (cid:18)(cid:13)(cid:13)(cid:13) ˜ θ t,i − ˆ θ t,i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) + (cid:13)(cid:13)(cid:13) ˆ θ t,i − θ i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) (cid:19) k x t,i k ˜ V t − ,i ( λ ) − (88) ≤ M T ζ + vuut T X t =1 M X i =1 (cid:13)(cid:13)(cid:13) ˜ θ t,i − ˆ θ t,i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) + vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ θ t,i − θ i (cid:13)(cid:13)(cid:13) V t − ,i ( λ )  · vuut T X t =1 M X i =1 k x t,i k V t − ,i ( λ ) − (89) ≤ M T ζ + 2 p T ( L ′ + 4 λM ) · vuut M X i =1 T X t =1 k x t,i k V t − ,i ( λ ) − (90) ≤ M T ζ + 2 p T ( L ′ + 4 λM ) r M d log(1 + Td ) (91)= ˜ O ( M √ dkT + d √ kM T + M T √ dζ ) , (92)where the second inequality is due to P Mi =1 (cid:10) θ i , x ∗ t,i (cid:11) ≤ D ˜ θ t,i , x t,i E from the optimistic choice of ˜ θ t,i and x t,i . The third inequality is due to Eqn 83. The last inequality is from Eqn 60.24 .5 Proof of Theorem 3 Since our setting is strictly harder than the setting of multi-task linear bandit with inﬁnite arms inYang et al. (2020), we can prove the following lemma directly from their Theorem 4 by reduction.

Lemma 5.

Under the setting of Theorem 3, the regret of any Algorithm A is lower bounded by Ω (cid:16) M k √ T + d √ kM T (cid:17) . In order to prove Theorem 3, we only need to show that the following lemma is true.

Lemma 6.

Under the setting of Theorem 3, the regret of any Algorithm A is lower bounded by Ω (cid:16) M T √ dζ (cid:17) . Proof. (Proof of Lemma 6)To prove Lemma 6, we leverage the lower bound for misspeciﬁed linear bandits in the single-tasksetting. We restate the following lemma from the previous literature with a slight modiﬁcation ofnotations.

Lemma 7. (Proposition 6 in Zanette et al. (2020a)). There exists a feature map φ : A → R d thatdeﬁnes a misspeciﬁed linear bandits class M such that every bandit instance in that class has rewardresponse: µ a = φ ⊤ a θ + z a for any action a (Here z a ∈ [0 , ζ ] is the deviation from linearity and µ a ∈ [0 , ) and such that theexpected regret of any algorithm on at least a member of the class up to round T is Ω( √ dζT ) . Suppose M can be exactly divided by k , we construct the following instances to prove lemma 6.We divide M tasks into k groups. Each group shares the same parameter θ i . To be more speciﬁc,we let w = w = · · · = w M/k = e , w M/k +1 = w M/k +2 = · · · = w M/k = e , · · · , w ( k − M/k +1 = w ( k − M/k +2 = · · · = w M = e k . Under this construction, the parameters θ i for these tasks areexactly the same in each group, but relatively independent among diﬀerent groups. That is to say,the expected regret lower bound is at least the summation of the regret lower bounds in all k groups.Now we consider the regret lower bound for group j ∈ [ k ]. Since the parameters are shared inthe same group, the regret of running an algorithm for M/k tasks with T steps each is at least theregret of running an algorithm for single-task linear bandit with M/k · T steps. By Lemma 7, theregret for single-task linear bandit with M T /k steps is at least Ω( √ dζM T /k ). Summing over all k groups, we can prove that the regret lower bound is Ω( √ dζM T ).Combining Lemma 5 and Lemma 6, we complete the proof of Theorem 3.25 Proof of Theorem 4

B.1 Definitions and First Step Analysis

Before presenting the proof of theorem 4, we will make a ﬁrst step analysis on the low-rank least-square estimator in equation 12.For any (cid:8) Q ih +1 (cid:9) Mi =1 ∈ Q h +1 , there exists n ˙ θ ih (cid:0) Q ih +1 (cid:1)o Mi =1 ∈ Θ h that∆ ih (cid:0) Q ih +1 (cid:1) ( s, a ) = T ih (cid:0) Q ih +1 (cid:1) ( s, a ) − φ ( s, a ) ⊤ ˙ θ ih (cid:0) Q ih +1 (cid:1) (93)where the approximation error (cid:13)(cid:13) ∆ ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13) ∞ ≤ I is small for each i ∈ [ M ]. We also use˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1) in place of ˙ θ ih (cid:0) Q ih +1 (cid:1) in the following sections since we can write ˙ θ ih as ˙ B h ˙ w ih ac-cording to Assumption 3.In the multi-task low-rank least-square regression (equation 12), we are actually trying to recover˙ θ ih . However, due to the noise and representation error (i.e. the inherent Bellman error), we canonly obtain an approximate solution ˆ θ ih = ˆ B h ˆ w ih (see the global optimization problem in Deﬁnition1). (cid:16) ˆ θ h , ..., ˆ θ Mh (cid:17) = ˆ B h (cid:2) ˆ w h ˆ w h · · · ˆ w Mh (cid:3) (94)= argmin k B h w ih k ≤ D M X i =1 t − X j =1 (cid:16) φ (cid:0) s ihj , a ihj (cid:1) ⊤ B h w ih − R (cid:0) s ihj , a ihj (cid:1) − max a Q ih +1 (cid:0) s ih +1 ,j (cid:1)(cid:17) (95)= argmin k B h w ih k ≤ D M X i =1 t − X j =1 (cid:16) φ (cid:0) s ihj , a ihj (cid:1) ⊤ B h w ih − T ih (cid:0) Q ih +1 (cid:1) (cid:0) s ihj , a ihj (cid:1) − z ihj (cid:0) Q ih +1 (cid:1) (cid:0) s ihj , a ihj (cid:1)(cid:17) (96)where z ihj (cid:0) Q ih +1 (cid:1) (cid:16) s ihj , a ihj (cid:17) def = R (cid:16) s ihj , a ihj (cid:17) + max a Q ih +1 (cid:16) s ih +1 ,j , a (cid:17) − T ih (cid:0) Q ih +1 (cid:1) (cid:16) s ihj , a ihj (cid:17) .Deﬁne Φ iht ∈ R ( t − × d to be the collection of linear features up to episode t − i ,i.e. the j -th row of Φ iht is φ (cid:16) s ihj , a ihj (cid:17) ⊤ . Let Y iht ∈ R t − be a vector whose j -th dimension is T ih (cid:0) Q ih +1 (cid:1) (cid:16) s ihj , a ihj (cid:17) + z ihj (cid:0) Q ih +1 (cid:1) (cid:16) s ihj , a ihj (cid:17) . Then the objective in (96) can be written asargmin k B h w ih k ≤ D M X i =1 (cid:13)(cid:13) Φ iht B h w ih − Y iht (cid:13)(cid:13) (97)Therefore, we have M X i =1 (cid:13)(cid:13)(cid:13) Φ iht ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − Y iht (cid:13)(cid:13)(cid:13) ≤ M X i =1 (cid:13)(cid:13)(cid:13) Φ iht ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1) − Y iht (cid:13)(cid:13)(cid:13) (98)which implies 26 X i =1 (cid:13)(cid:13)(cid:13) Φ iht ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − Φ iht ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) (99) ≤ M X i =1 (cid:0) ∆ iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (100)+ 2 M X i =1 (cid:0) z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (101)where ∆ iht def = h ∆ ih (cid:0) Q ih +1 (cid:1) (cid:0) s ih , a ih (cid:1) ∆ ih (cid:0) Q ih +1 (cid:1) (cid:0) s ih , a ih (cid:1) · · · ∆ ih,t − (cid:0) Q ih +1 (cid:1) (cid:16) s ih,t − , a ih,t − (cid:17)i ∈ R t − , and z iht def = h z ih (cid:0) Q ih +1 (cid:1) (cid:0) s ih , a ih (cid:1) · · · z ih,t − (cid:0) Q ih +1 (cid:1) (cid:16) s ih,t − , a ih,t − (cid:17)i ∈ R t − .In the next sections we will show how to bound 100 and 101.27 .2 Failure Event Deﬁne the failure event at step h in episode t as Deﬁnition 2 (Failure Event) . E ht def = I h ∃ (cid:8) Q ih +1 (cid:9) Mi =1 ∈ Q h +1 M X i =1 (cid:0) z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) > (102) F h vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) + F h i (103)where F h and F h will be speciﬁed later.We have the following lemma to bound the probability of E ht . Lemma 8.

For the input parameter δ > , there exists F h and F h such that P T [ t =1 H [ h =1 E ht ! ≤ δ Proof.

According to Lemma A.5 of Du et al. (2020), there exists an ǫ -net E oh +1 over O d × k (withregards to the Frobenius norm) such that (cid:12)(cid:12) E oh +1 (cid:12)(cid:12) ≤ (6 √ k/ǫ ′ ) kd . Moreover, there exists an ǫ -net E bh +1 over B k that (cid:12)(cid:12) E bh +1 (cid:12)(cid:12) ≤ (1 + 2 /ǫ ′ ) k . We can show a corresponding ǫ -net E mul h +1 def = E oh +1 × (cid:0) E bh +1 (cid:1) M overΘ h +1 .For any (cid:0) Q h +1 (cid:0) B h +1 w h +1 (cid:1) , · · · , Q Mh +1 (cid:0) B h +1 w Mh +1 (cid:1)(cid:1) ∈ Q h +1 , there exists ¯ B h +1 ∈ E oh +1 and (cid:0) ¯ w h +1 , · · · , ¯ w Mh +1 (cid:1) ∈ (cid:0) E bh +1 (cid:1) M such that (cid:13)(cid:13) B h +1 − ¯ B h +1 (cid:13)(cid:13) F ≤ ǫ ′ (cid:13)(cid:13) w ih +1 − ¯ w ih +1 (cid:13)(cid:13) ≤ ǫ ′ , ∀ i ∈ [ M ]Therefore, (cid:13)(cid:13) B h +1 w ih +1 − ¯ B h +1 ¯ w ih +1 (cid:13)(cid:13) ≤ ǫ ′ , ∀ i ∈ [ M ]Deﬁne ¯ Q ih +1 to be Q ih +1 (cid:0) ¯ B h +1 ¯ w ih +1 (cid:1) , and let ¯ z iht def = h z ih (cid:0) ¯ Q ih +1 (cid:1) (cid:0) s ih , a ih (cid:1) · · · z ih,t − (cid:0) ¯ Q ih +1 (cid:1) (cid:16) s ih,t − , a ih,t − (cid:17)i ∈ R t − , then M X i =1 (cid:0) z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (105)= M X i =1 (cid:0) ¯ z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (106)+ M X i =1 (cid:0) z iht − ¯ z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (107)For ﬁxed (cid:8) ¯ B h +1 ¯ w ih +1 (cid:9) Mi =1 ∈ E mul h +1 , z ih,j (cid:0) ¯ Q ih +1 (cid:1) (cid:16) s ih,j , a ih,j (cid:17) is zero-mean 1-subgaussian condi-tioned on F h,j according to Assumption 4. Thus, we can use exactly the same argument as inLemma 1 to show that 28 X i =1 (cid:0) ¯ z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (108) ≤ p M k + 5 kd log( kM T ) + 2 log(1 /δ ′ ) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) (109)+ p M T /δ ′ ) + p k + 3 kd log( kM T ) + log(1 /δ ′ ) (110)by setting ǫ = kM T , δ = δ ′ (cid:16) √ kǫ (cid:17) kd , and δ = δ ′ MT in equation 50. Thus, we have that withprobability 1 − δ ′ the inequality above holds for any h ∈ [ H ] , t ∈ [ T ]. Take δ = δ ′ | E mul h +1 | , by unionbound we know the above ineqaulity holds with probability 1 − δ for any (cid:8) ¯ B h +1 ¯ w ih +1 (cid:9) Mi =1 ∈ E mul h +1 and any h ∈ [ H ] , t ∈ [ T ].Since it holds that (cid:12)(cid:12) Q ih +1 (cid:0) B h +1 w ih +1 (cid:1) ( s, a ) − Q ih +1 (cid:0) ¯ B h +1 ¯ w ih +1 (cid:1) ( s, a ) (cid:12)(cid:12) ≤ ǫ ′ for any ( s, a ) ∈S × A , i ∈ [ M ], we have (cid:12)(cid:12) z ihj (cid:0) ¯ Q ih +1 (cid:1) (cid:0) s ihj , a ihj (cid:1) − z ihj (cid:0) Q ih +1 (cid:1) (cid:0) s ihj , a ihj (cid:1)(cid:12)(cid:12) ≤ ǫ ′ (111)Then we have M X i =1 (cid:0) z iht − ¯ z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (112) ≤ M X i =1 (cid:13)(cid:13)(cid:13)(cid:0) Φ iht (cid:1) ⊤ (cid:0) z iht − ¯ z iht (cid:1)(cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) − (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) (113) ≤ ǫ ′ √ T M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) (114) ≤ ǫ ′ √ M T vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) (115)for arbitrary { Q ih +1 } and any h ∈ [ H ] , t ∈ [ T ]. The second inequality follows from the ProjectionBound (Lemma 8) in Zanette et al. (2020a).Take ǫ ′ = 1 / √ M T , we ﬁnally ﬁnish the proof by setting F h def = p kd log( kM T ) + 5 M k log(

M T ) + 2 log(2 /δ ) (116) F h def = p kd log( kM T ) + 5 M k log(

M T ) + 2 log(2 /δ ) (117)+ p k + 5 kd log( kM T ) + 2 M k log(

M T ) + log(2 /δ ) (118)In the next sections we assume the failure event S Tt =1 S Hh =1 E ht won’t happen.29 .3 Bellman Error Outside the failure event, we can bound the estimation error of the least-square regression 12.

Lemma 9.

For any episode t ∈ [ T ] and step h ∈ [ H ] , any (cid:8) Q ih +1 (cid:9) Mi =1 ∈ Q h +1 , we have M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) ≤ α ht def = (cid:18) √ M T I + 2 F h + q F h + 4 M D λ (cid:19) (119) Proof.

Recall that M X i =1 (cid:13)(cid:13)(cid:13) Φ iht ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − Φ iht ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) (120) ≤ M X i =1 (cid:0) ∆ iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (121)+ 2 M X i =1 (cid:0) z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (122)For the ﬁrst term, we have M X i =1 (cid:0) ∆ iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (123) ≤ M X i =1 (cid:13)(cid:13)(cid:13)(cid:0) Φ iht (cid:1) ⊤ ∆ iht (cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) − (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) (124) ≤ √ T I M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) (125) ≤ √ M T I vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) (126)The second inequality follows from the Projection Bound (Lemma 8) in Zanette et al. (2020a),and the last inequality is due to Cauchy-Schwarz.Outside the failure event, we have M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) (127) ≤ M X i =1 (cid:13)(cid:13)(cid:13) Φ iht ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − Φ iht ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) + 4 M D λ (128) ≤ (cid:16) √ M T I + 2 F h (cid:17) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) + 2 F h + 4 M D λ (129)30hich implies M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) (130) ≤ (cid:16) √ M T I + 2 F h (cid:17) + 2 F h + 4 M D λ + (cid:16) √ M T I + 2 F h (cid:17) q F h + 4 M D λ (131) ≤ (cid:18) √ M T I + 2 F h + q F h + 4 M D λ (cid:19) (132) Lemma 10 (Bound on Bellman Error) . Outside the failure event, for any feasible solution (cid:8) Q ih (cid:0) ¯ θ ih (cid:1)(cid:9) ih ( ¯ Q ih for short, with a little abuse of notations) of the global optimization procedure in deﬁnition 1,for any ( s, a ) ∈ S × A , any h ∈ [ H ] , t ∈ [ T ] M X i =1 (cid:12)(cid:12) ¯ Q ih ( s, a ) − T ih ¯ Q ih +1 ( s, a ) (cid:12)(cid:12) ≤ M I + 2 vuut α ht · M X i =1 k φ ( s, a ) k V iht ( λ ) − (133) Proof. M X i =1 (cid:12)(cid:12) ¯ Q ih ( s, a ) − T ih ¯ Q ih +1 ( s, a ) (cid:12)(cid:12) = M X i =1 (cid:12)(cid:12)(cid:12) φ ( s, a ) ⊤ ¯ θ ih − φ ( s, a ) ⊤ ˙ θ ih (cid:0) ¯ Q ih +1 (cid:1) − ∆ ih (cid:0) ¯ Q ih +1 (cid:1) ( s, a ) (cid:12)(cid:12)(cid:12) (134) ≤ M I + M X i =1 (cid:12)(cid:12)(cid:12) φ ( s, a ) ⊤ ¯ θ ih − φ ( s, a ) ⊤ ˙ θ ih (cid:0) ¯ Q ih +1 (cid:1)(cid:12)(cid:12)(cid:12) (135) ≤ M I + M X i =1 (cid:16)(cid:12)(cid:12)(cid:12) φ ( s, a ) ⊤ ˙ θ ih (cid:0) ¯ Q ih +1 (cid:1) − φ ( s, a ) ⊤ ˆ θ ih (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) φ ( s, a ) ⊤ ˆ θ ih − φ ( s, a ) ⊤ ¯ θ ih (cid:12)(cid:12)(cid:12)(cid:17) (136) ≤ M I + M X i =1 k φ ( s, a ) k ˜ V iht ( λ ) − (cid:18)(cid:13)(cid:13)(cid:13) ˙ θ ih (cid:0) ¯ Q ih +1 (cid:1) − ˆ θ ih (cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) + (cid:13)(cid:13)(cid:13) ˆ θ ih − ¯ θ ih (cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) (cid:19) (137) ≤ M I + 2 vuut α ht · M X i =1 k φ ( s, a ) k V iht ( λ ) − (138)The ﬁrst equality is due to the deﬁnition of ∆ ih (cid:0) ¯ Q ih +1 (cid:1) ( s, a ). The last inequality is due to lemma9. 31 .4 Optimism We can ﬁnd the ”best” approximator of optimal value functions in our function class recursivelydeﬁned as (cid:0) θ ∗ h , θ ∗ h , · · · , θ M ∗ h (cid:1) def = argmin( θ h , θ h , ··· , θ Mh ) ∈ Θ h sup s,a,i (cid:12)(cid:12)(cid:0) φ ( s, a ) ⊤ θ ih − T ih Q ih +1 (cid:0) θ i ∗ h +1 (cid:1)(cid:1) ( s, a ) (cid:12)(cid:12) (139)with θ i ∗ H +1 = , ∀ i ∈ [ M ]For the accuracy of this best approximator, we have Lemma 11.

For any h ∈ [ H ] , sup ( s,a ) ∈S×A ,i ∈ [ M ] (cid:12)(cid:12) Q i ∗ h ( s, a ) − φ ( s, a ) ⊤ θ ∗ h (cid:12)(cid:12) ≤ ( H − h + 1) I where Q i ∗ h is the optimal value function for task i . This lemma is derived directly from Lemma6 in Zanette et al. (2020a).For our solution of the problem in Deﬁnition 1 in episode t , we have the following lemma: Lemma 12. (cid:8)(cid:0) θ ∗ h , θ ∗ h , · · · , θ M ∗ h (cid:1)(cid:9) Hh =1 is a feasible solution of the problem in Deﬁnition 1. More-over, denote the solution of the problem in Deﬁnition 1 in episode t by ¯ θ iht for h ∈ [ H ] , i ∈ [ M ] , itholds that M X i =1 V i (cid:0) ¯ θ i t (cid:1) (cid:0) s i t (cid:1) ≥ M X i =1 V i ∗ (cid:0) s i t (cid:1) − M H I (140) Proof.

First we show that (cid:8)(cid:0) θ ∗ h , θ ∗ h , · · · , θ M ∗ h (cid:1)(cid:9) Hh =1 is a feasible solution. We can construct (cid:8) ¯ ξ ih (cid:9) Mi =1 so that ¯ θ ih = θ i ∗ h and no other constraints are violated. We use an inductive construction, and thebase case when ¯ θ iH +1 = θ i ∗ H +1 = 0 is trivial.Now suppose we have (cid:8) ¯ ξ iy (cid:9) Mi =1 for y = h + 1 , ..., H such that ¯ θ iy = θ i ∗ y for y = h + 1 , ..., H and i ∈ [ M ], we show we can ﬁnd (cid:8) ¯ ξ ih (cid:9) Mi =1 so ¯ θ ih = θ i ∗ h for i ∈ [ M ], and no constraints are violated.From the deﬁnition of θ i ∗ h we can set (with a little abuse of notations)˙ θ ih (cid:0) θ i ∗ h +1 (cid:1) = θ i ∗ h (141)According to lemma 9 we have M X i =1 (cid:13)(cid:13)(cid:13) ˆ θ ih (cid:0) θ i ∗ h +1 (cid:1) − ˙ θ ih (cid:0) θ i ∗ h +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) ≤ α ht (142)Therefore, set ¯ ξ ih = ˙ θ ih (cid:0) θ i ∗ h +1 (cid:1) − ˆ θ ih (cid:0) θ i ∗ h +1 (cid:1) , then¯ θ ih = ˆ θ ih (cid:0) ¯ θ ih +1 (cid:1) + ¯ ξ ih (143)= ˆ θ ih (cid:0) θ i ∗ h +1 (cid:1) + ˙ θ ih (cid:0) θ i ∗ h +1 (cid:1) − ˆ θ ih (cid:0) θ i ∗ h +1 (cid:1) (144)= θ i ∗ h (145)32inally, we can verify (cid:0) ¯ θ h , ..., ¯ θ Mh (cid:1) ∈ Θ h from (cid:0) θ ∗ h , · · · , θ M ∗ h (cid:1) ∈ Θ h .Since ¯ θ i t is the optimal solution, we can ﬁnish the proof by showing M X i =1 V i (cid:0) ¯ θ i t (cid:1) (cid:0) s i t (cid:1) = M X i =1 max a φ (cid:0) s i t , a (cid:1) ⊤ ¯ θ i t (146) ≥ M X i =1 max a φ (cid:0) s i t , a (cid:1) ⊤ θ i ∗ (since θ i ∗ is the feasible solution) (147) ≥ M X i =1 φ (cid:0) s i t , π i ∗ (cid:0) s i t (cid:1)(cid:1) ⊤ θ i ∗ (148) ≥ M X i =1 Q i ∗ h (cid:0) s i t , π i ∗ (cid:0) s i t (cid:1)(cid:1) − M H I (by Lemma 11) (149) ≥ M X i =1 V i ∗ h (cid:0) s i t (cid:1) − M H I (150)33 .5 Regret Bound We are ready to present the proof of our regret bound.From Lemma 8 we know that the failure event S Tt =1 S Hh =1 E ht happens with probability at most δ/

2, so we assume it does not happen. Then we can decompose the regret asReg( T ) = T X t =1 M X i =1 (cid:16) V i ∗ − V π it (cid:17) (cid:0) s i t (cid:1) (151)= T X t =1 M X i =1 (cid:0) V i ∗ − V i (cid:0) ¯ θ i t (cid:1)(cid:1) (cid:0) s i t (cid:1) + T X t =1 M X i =1 (cid:16) V i (cid:0) ¯ θ i t (cid:1) − V π it (cid:17) (cid:0) s i t (cid:1) (152) ≤ T X t =1 M X i =1 (cid:16) V i (cid:0) ¯ θ i t (cid:1) − V π it (cid:17) (cid:0) s i t (cid:1) + M HT I (by Lemma 12) (153)Let a iht = π it (cid:0) s iht (cid:1) , and denote Q ih (cid:0) ¯ θ iht (cid:1) ( V ih (cid:0) ¯ θ iht (cid:1) ) by ¯ Q iht ( ¯ V iht ) for short, we have M X i =1 (cid:16) ¯ V iht − V π it h (cid:17) (cid:0) s iht (cid:1) = M X i =1 (cid:16) ¯ Q iht − Q π it h (cid:17) (cid:0) s iht , a iht (cid:1) (154)= M X i =1 (cid:0) ¯ Q iht − T ih ¯ Q ih +1 ,t (cid:1) (cid:0) s iht , a iht (cid:1) + M X i =1 (cid:16) T ih ¯ Q ih +1 ,t − Q π it h (cid:17) (cid:0) s iht , a iht (cid:1) (155) ≤ M I + 2 vuut α ht · M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M X i =1 E s ′ ∼ p ih ( s iht ,a iht ) h(cid:16) ¯ V ih +1 ,t − V π it h +1 (cid:17) ( s ′ ) i (156) ≤ M X i =1 (cid:16) ¯ V ih +1 ,t − V π it h +1 (cid:17) (cid:0) s ih +1 ,t (cid:1) + M I + 2 vuut α ht · M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M X i =1 ζ iht (157)where ζ iht is a martingale diﬀerence with regards to the ﬁltration F h,t deﬁned as ζ iht def = (cid:16) ¯ V ih +1 ,t − V π it h +1 (cid:17) (cid:0) s ih +1 ,t (cid:1) − E s ′ ∼ p ih ( s iht ,a iht ) h(cid:16) ¯ V ih +1 ,t − V π it h +1 (cid:17) ( s ′ ) i (158)According to assumption 4 we know (cid:12)(cid:12) ζ iht (cid:12)(cid:12) ≤

4, so we can apply Azuma-Hoeﬀding’s inequalitythat with probability 1 − δ/ t ∈ [ T ] and i ∈ [ M ] t X j =1 ζ iht ≤ s t ln (cid:18) Tδ (cid:19) (159)By applying inequality 157 recursively, we can bound the regret asReg( T ) ≤ T X t =1 M X i =1 (cid:16) ¯ V i t − V π it (cid:17) (cid:0) s i t (cid:1) + M HT I (160) ≤ M HT I + T X t =1 H X h =1 vuut α ht · M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M X i =1 H X h =1 T X t =1 ζ iht (161)34he last inequality is due to ¯ V iH +1 ( s ) = max a φ ( s, a ) ⊤ ¯ θ iH +1 ,t = 0 , V π it H +1 ( s ) = 0.The Lemma 11 of Abbasi-Yadkori et al. (2011) gives that for any i ∈ [ M ] and h ∈ [ H ] T X t =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − = ˜ O ( d ) (162)Moreover, by the deﬁnition of α ht (see Lemma 9) we know that for any h ∈ [ H ] and t ∈ [ T ] α ht = ˜ O (cid:0) M k + kd + M T I (cid:1) (163)Take all of above we can show the ﬁnal regret bound.Reg( T ) ≤ M HT I + T X t =1 H X h =1 vuut α ht · M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M X i =1 H X h =1 T X t =1 ζ iht (164)= ˜ O  M HT I + ˜ O (cid:16)p M k + kd + M T I (cid:17) H X h =1 T X t =1 vuut M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M H √ T  (165)= ˜ O  M HT I + ˜ O (cid:16)p M k + kd + M T I (cid:17) H X h =1 √ T · vuut T X t =1 M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M H √ T  (166)= ˜ O (cid:16) M HT I + ˜ O (cid:16) ˜ O (cid:16)p M k + kd + M T I (cid:17) · H √ M T d (cid:17) + M H √ T (cid:17) (167)= ˜ O (cid:16) HM √ dkT + Hd √ M kT + HM T √ d I (cid:17) (168)35 Proof of Theorem 5

To prove the lower bound for multi-task RL, our idea is to connect the lower bound for the multi-tasklearning problem to the lower bound in the single-task LSVI setting (Zanette et al., 2020a). in thepaper of Zanette et al. (2020a), they assumed the feature dimension d can be varied among diﬀerentsteps, which is denoted as d h for step h . They proved the lower bound for linear RL in this setting isΩ (cid:16)P Hh =1 d h √ T + P Hh =1 √ d h I T (cid:17) . However, this lower bound is derived by the hard instance with d = P Hh =2 d h . If we set d = d = · · · = d H = d like our setting, we can only obtain the lowerbound of Ω (cid:16) d √ T + √ d I T (cid:17) following their proof idea. In fact, the dependence on H in this lowerbound can be further improved. In order to obtain a tighter lower bound, we consider the lowerbound for single-task misspeciﬁed linear MDP. This setting can be proved to be strictly simpler thanthe LSVI setting following the idea of Proposition 3 in Zanette et al. (2020a). The lower bound formisspeciﬁed linear MDP can thus be applied to LSVI setting.36 .1 Lower Bounds for single-task RL This subsection focus on the lower bound for misspecifed linear MDP setting, in which the transitionkernel and the reward function are assume to be approximately linear.

Assumption 6. (Assumption B in Jin et al. (2020)) For any ζ ≤ , we say that MDP( S , A , p, r, H ) is a ζ -approximate linear MDP with a feature map φ : S × A → R d , if for any h ∈ [ H ] , there exist d unknown measures θ h = ( θ (1) h , · · · , θ ( d ) h ) over S and an unknown vector ν h ∈ R d such that for any ( s, a ) ∈ S × A , we have k p h ( ·| s, a ) − h φ ( s, a ) , θ h ( · ) i k TV ≤ ζ (169) | r h ( s, a ) − h φ ( s, a ) , ν h i | ≤ ζ (170)For regularity, we assume that Assumption 4 still holds, and we also assume that there exists aconstant D such that k θ h ( s ) k ≤ D for all s ∈ S , h ∈ [ H ], k ν h k ≤ D for all h ∈ [ H ]. D ≥ Proposition 1.

Suppose T ≥ d H , d ≥ , H ≥ and ζ ≤ H , there exist a ζ -approximate linearMDP class such that the expected regret of any algorithm on at least a member of the MDP class isat least Ω (cid:16) d √ HT + HT I√ d (cid:17) . To prove the lower bound, our basic idea is to connect the problem to H linear bandit problems.Similar hard instance construction has been used in Zhou et al. (2020a,b). In our construction, thestate space S consists of H + 2 states, which is denoted as x , x , · · · , x H +2 . The agent starts theepisode in state x . In x h , it can either transits to x h +1 or x H +2 with certain transition probability.If the agent enters x H +2 , it will stay in this state in the remaining steps, i.e. x H +2 is an absorbingstate. For each state, there are 2 d − actions and A = {− , } d − . Suppose the agent takes action a ∈ {− , } d − in state s h , the transition probability to state s h +1 and s H +2 is 1 − ζ h ( a ) − δ − µ ⊤ h a and δ + ζ h ( a ) + µ ⊤ h a respectively. Here | ζ h ( a ) | ≤ ζ denotes the approximation error of linearrepresentation, δ = 1 /H and µ h ∈ {− ∆ , ∆ } d − with ∆ = p δ/T / (4 √

2) so that the probability iswell-deﬁned. The reward can only be obtained in x H +2 , with r h ( x H +2 ,a ) = 1 /H for any h, a . Weassume the reward to be deterministic.We can check that this construction satisﬁes Assumption 6 with φ and θ deﬁned in the followingway: φ ( s, a ) =  (cid:0) , α, αδ, , β a ⊤ (cid:1) ⊤ s = x , x , · · · , x H (cid:0) , , , α, ⊤ (cid:1) ⊤ s = x H +1 (cid:0) α, , , α, ⊤ (cid:1) ⊤ s = x H +2 θ h ( s ′ ) =  (cid:18) , α , − α , , − µ ⊤ h β (cid:19) ⊤ s ′ = x h +1 (cid:18) , , α , α , µ ⊤ h β (cid:19) ⊤ s = x H +2 otherwise ν h is deﬁned to be ( Hα , ⊤ ) ⊤ , and α = p / (2 + ∆( d − β = p ∆ / (2 + ∆( d − k φ ( s, a ) k ≤ k θ h ( s ′ ) k ≤ D and k ν h k ≤ D hold for any s, a, s ′ , h when T ≥ d H/ x H +2 , the optimal strategy in state x h ( h ≤ H ) is to take anaction that maximizes the probability of entering x H +2 , i.e., to maximize µ ⊤ h a + ζ ( a ). That is to37ay, we can regard the problem of ﬁnding the optimal action in state s h and step h as ﬁnding theoptimal arm for a d − δ such that (1 − δ ) H/ is a constant, there is suﬃciently high probability of enteringstate x h for any h ≤ H/

2. Therefore, we can show that this problem is harder than solving H/ Lemma 13.

Suppose H ≥ , d ≥ and ( d − ≤ H . We deﬁne r bh ( a ) = µ ⊤ a + ζ h ( a ) , whichcan be regarded as the corresponding reward for the equivalent linear bandit problem in step h . Fix µ ∈ ( {− ∆ , ∆ } d − ) H . Fix a possibly history dependent policy π . Letting V ⋆ and V π be the optimalvalue function and the value function of policy π respectively, we have V ⋆ ( s ) − V π ( s ) ≥ . H/ X h =1 max a ∈A r bh ( a ) − X a ∈A π h ( a | s h ) r bh ( a ) ! (171) Proof.

Note that the only rewarding state is x H +2 with r h ( x H +2 , a ) = H . Therefore, the valuefunction of a certain policy π can be calculated as: V π ( x ) = H − X h =1 H − hH P ( N h | π ) (172)where N h denotes the event of visiting state x h in step h and then transits to x H +2 , i.e. N h = { s h = x h , s h +1 = x H +2 } . Suppose ω πh = P a ∈A π h ( a | s h ) r bh ( a ) and ω ⋆h = max a ∈A r bh ( a ). By the lawof total probability and the Markov property, we have P ( N h | π ) = ( δ + ω πh ) h − Y j =1 (1 − δ − ω πh ) (173)Thus we have V π ( x ) = H − X h =1 H − hH ( δ + ω πh ) h − Y j =1 (1 − δ − ω πh ) (174)Similarly, for the value function of the optimal policy, we have V ⋆ ( x ) = H − X h =1 H − hH ( δ + ω ⋆h ) h − Y j =1 (1 − δ − ω ⋆h ) (175)Deﬁne S i = P H − h = i H − hH ( δ + ω πh ) Q h − j = i (1 − δ − ω πh ) and T i = P H − h = i H − hH ( δ + ω ⋆h ) Q h − j = i (1 − δ − ω ⋆h ).Then we have V ⋆ ( x ) − V π ( x ) = T − S . Notice that S i = H − iH ( ω πi + δ ) + S i +1 (1 − ω πi − δ ) (176) T i = H − iH ( ω ⋆i + δ ) + T i +1 (1 − ω ⋆i − δ ) (177)Thus we have T i − S i = (cid:18) H − iH − T i +1 (cid:19) ( ω ⋆i − ω πi ) + ( T i +1 − S i +1 )(1 − ω πi − δ ) (178)38y induction, we get T − S = H − X h =1 ( ω ⋆i − ω πi )( H − hH − T h +1 ) h − Y j =1 (1 − ω πj − δ ) (179)Since the reward is non-negative and only occurs in x H +2 , we know that V ⋆ ( x ) ≥ V ⋆ ( x ) ≥· · · ≥ V ⋆ ( x H ). Thus we have T h ≤ T = V ⋆ ( x ) ≤ P Hh =1 P ( N h | π ⋆ ). If N h doesn’t happen for any h ∈ [ H ], then the agent must enter x H +1 . The probability of this event has the following form: P (cid:0) ¬ (cid:0) ∪ h ∈ [ H ] N h | π ⋆ (cid:1)(cid:1) =1 − H Y h =1 P ( N h | π ⋆ ) (180)= Y h ∈ [ H ] (1 − δ − ω ⋆h ) (181) ≥ Y h ∈ [ H ] (1 − H + 12 H ) (182)=(1 − H ) H (183) ≥ . δ = H and | ω ⋆h | ≤ H . The above discussion indicates that T h ≤ . H − hH − T h +1 ≥ . h ≤ H/

2. Similarly, Q h − j =1 (1 − ω πj − δ ) ≥ (1 − H ) H − ≥ .

2. Combiningwith Eqn 179, we have T − S ≥ . H X h =1 ( ω ⋆h − ω πh ) = 0 . H/ X h =1 max a ∈A r bh ( a ) − X a ∈A π h ( a | s h ) r bh ( a ) ! (185)Combining with the deﬁnition of T and S , we can prove the lemma.After proving Lemma 13, we are ready to prove Proposition 1. Proof. (proof of Proposition 1) By Lemma 13, we know that we can decompose the sub-optimalitygap of a policy π in the following way: V ⋆ ( s ) − V π ( s ) ≥ . H/ X h =1 max a ∈A r bh ( a ) − X a ∈A π h ( a | s h ) r bh ( a ) ! (186)where r bh ( a ) = µ ⊤ a + ζ h ( a ), which can be regarded as a reward function for misspeciﬁed linearbandit. To prove Theorem 1, the only remaining problem is to derive the lower bound for misspeciﬁedlinear bandits. We directly apply the following two lower bounds for linear bandits. Lemma 14. (Lemma C.8 in Zhou et al. (2020a)) Fix a positive real < δ ≤ / , and positiveintegers T, d and assume that T ≥ d / (2 δ ) and consider the linear bandit problem L µ parametrizedwith a parameter vector µ ∈ {− ∆ , ∆ } d and action set A = {− , } d so that the reward distributionfor taking action a ∈ A is a Bernoulli distribution B ( δ + ( µ ⋆ ) ⊤ a ) . Then for any bandit algorithm B , there exists a µ ∗ ∈ {− ∆ , ∆ } d such that the expected pseudo-regret of B over T steps on bandit L µ ⋆ is lower bounded by d √ T δ √ . emma 15. (Proposition 6 in Zanette et al. (2020a)) There exists a feature map φ : A → R d thatdeﬁnes a misspeciﬁed linear bandits class M such that every bandit instance in that class has rewardresponse: µ a = φ ⊤ a θ + z a for any action a (Here z a ∈ [0 , ζ ] is the deviation from linearity and µ a ∈ [0 , ) and such that theexpected regret of any algorithm on at least a member of the class up to round T is Ω( √ dζT ) . Lemma 14 is used to prove the lower bound for linear mixture MDPs in Zhou et al. (2020a),which states that the lower bound for linear bandits with approximation error ζ = 0, while Lemma 15mainly consider the inﬂuence of ζ to the lower bound. Combining these two lemmas, the regret lowerbound for misspeciﬁd linear bandit is Ω(max( d √ T δ, √ dζT )) = Ω( d √ T δ + √ dζT ). Since here ourproblem can reduce from H/ Hd √ T δ + H √ dζT ) = Ω( d √ HT + H √ dζT )Now we obtain the regret lower bound for misspeciﬁed linear MDP. We can prove the correspond-ing lower bound for the LSVI setting Zanette et al. (2020a) since LSVI setting is strictly harderthan linear MDP setting. The following lemma states this relation between two settings. Lemma 16.

If an MDP ( S , A , p, r, H ) is a misspecifed linear MDP with approximation error ζ , thenthis MDP satisﬁes the low inherent Bellman error assumption with I = 2 ζ .Proof. If an MDP is an ζ -approximate linear MDP, then we have k p h ( ·| s, a ) − h φ ( s, a ) , θ h ( · ) i k TV ≤ ζ (187) | r h ( s, a ) − h φ ( s, a ) , ν h i | ≤ ζ (188)For any θ h +1 ∈ R d , we have T h ( Q h +1 ( θ h +1 )) ( s, a ) = r h ( s, a ) + E s ′ ∼ p h ( ·| s,a ) V h +1 ( θ h +1 ) ( s ′ ). Since V h +1 ( θ h +1 ) ( s ′ ) ≤

1, plugging the approximately linear form of r h ( s, a ) and p h ( ·| s, a ), we have |T h ( Q h +1 ( θ h +1 )) ( s, a ) − * φ ( s, a ) , X s ′ θ h ( s ′ ) V h +1 ( θ h +1 ) ( s ′ ) + ν h + | ≤ ζ (189)By lemma 16, we can directly apply the hard instance construction and the lower bound formisspeciﬁed linear MDP to LSVI setting. Proposition 2.

There exist function feature maps φ , ..., φ H that deﬁne an MDP class M suchthat every MDP in that class satisﬁes low inherent Bellman error at most I and such that theexpected reward on at least a member of the class (for |A| ≥ , d, k, H ≥ , T = Ω( d H ) , I ≤ H )is Ω( d √ HT + dH I T ) . .2 Lower Bound for Multi-task RL In order to prove Theorem 5, we need to prove and then combine the following two lemmas.

Lemma 17.

Under the setting of Theorem 5, the expected regret of any algorithm A is lower boundedby Ω( M k √ HT ) . Lemma 18.

Under the setting of Theorem 5, the expected regret of any algorithm A is lower boundedby Ω (cid:16) d √ kM HT + HM T √ d I (cid:17) . These two lemmas are proved by reduction from Proposition 2, which is a lower bound we provedfor the single-task LSVI setting.