Near-optimal Representation Learning for Linear Bandits and Linear RL
aa r X i v : . [ c s . L G ] F e b Near-Optimal Representation Learning for Linear Banditsand Linear RL
Jiachen Hu ∗ [email protected]
Key Laboratory of Machine Perception, MOE, School of EECS, Peking University
Xiaoyu Chen ∗ [email protected] Key Laboratory of Machine Perception, MOE, School of EECS, Peking University
Chi Jin [email protected]
Department of Electrical and Computer Engineering, Princeton University
Lihong Li [email protected]
Amazon
Liwei Wang [email protected]
Key Laboratory of Machine Perception, MOE, School of EECS, Peking UniversityCenter for Data Science, Peking University, Beijing Institute of Big Data Research
Abstract
This paper studies representation learning for multi-task linear bandits and multi-task episodicRL with linear value function approximation. We first consider the setting where we play M linearbandits with dimension d concurrently, and these bandits share a common k -dimensional linearrepresentation so that k ≪ d and k ≪ M . We propose a sample-efficient algorithm, MTLR-OFUL,which leverages the shared representation to achieve ˜ O ( M √ dkT + d √ kMT ) regret, with T beingthe number of total steps. Our regret significantly improves upon the baseline ˜ O ( Md √ T ) achievedby solving each task independently. We further develop a lower bound that shows our regret is near-optimal when d > M . Furthermore, we extend the algorithm and analysis to multi-task episodicRL with linear value function approximation under low inherent Bellman error (Zanette et al.,2020a). To the best of our knowledge, this is the first theoretical result that characterizes thebenefits of multi-task representation learning for exploration in RL with function approximation.
1. Introduction
Multi-task representation learning is the problem of learning a common low-dimensional representa-tion among multiple related tasks (Caruana, 1997). This problem has become increasingly importantin many applications such as natural language processing (Ando and Zhang, 2005; Liu et al., 2019),computer vision (Li et al., 2014), drug discovery (Ramsundar et al., 2015), and reinforcement learn-ing (Wilson et al., 2007; Teh et al., 2017; D’Eramo et al., 2019). In these cases, common informationcan be extracted from related tasks to improve data efficiency and accelerate learning.While representation learning has achieved tremendous success in a variety of applications (Bengio et al.,2013), its theoretical understanding is still limited. A widely accepted assumption in the literatureis the existence of a common representation shared by different tasks. For example, Maurer et al.(2016) proposed a general method to learn data representation in multi-task supervised learning andlearning-to-learn setting. Du et al. (2020) studied few-shot learning via representation learning withassumptions on a common representation among source and target tasks. Tripuraneni et al. (2020)focused on the problem of multi-task linear regression with low-rank representation, and proposedalgorithms with sharp statistical rates.Inspired by the theoretical results in supervised learning, we take a step further to investigateprovable benefits of representation learning for sequential decision making problems. First, we study ∗ . Equal contribution M tasks of d -dimensional (infinite-arm) linearbandits are concurrently learned for T steps. The expected reward of arm x i ∈ R d for task i is θ ⊤ i x i ,as determined by an unknown linear parameter θ i . To take advantage of the multi-task representa-tion learning framework, we assume that θ i ’s lie in an unknown k -dimensional subspace of R d , where k is much smaller compared to d and M (Yang et al., 2020). The dependence among tasks makesit possible to achieve a regret bound better than solving each task independently. Specifically, ifthe tasks are solved independently with standard algorithms such as OFUL (Abbasi-Yadkori et al.,2011), the total regret is ˜ O ( M d √ T ). By leveraging the common representation among tasks, we canachieve a better regret ˜ O ( M √ dkT + d √ M kT ). Our algorithm is also robust to the linear represen-tation assumption when the model is misspecified. If the k -dimensional subspace approximates therewards with error at most ζ , our algorithm can still achieve regret ˜ O ( M √ dkT + d √ kM T + M T √ dζ ).Moreover, we prove a regret lower bound indicating that the regret of our algorithm is not improvableexcept for logarithmic factors in the regime d > M .Compared with multi-task linear bandits, multi-task reinforcement learning is a more popularresearch topic with a long line of works in both theoretical side and empirical side (Taylor and Stone,2009; Parisotto et al., 2015; Liu et al., 2016; Teh et al., 2017; Hessel et al., 2019; D’Eramo et al.,2019; Arora et al., 2020). We extend our algorithm for linear bandits to the multi-task episodicreinforcement learning with linear value function approximation under low inherent Bellman error(Zanette et al., 2020a). Assuming a low-rank linear representation across all the tasks, we proposea sample-efficient algorithm with regret ˜ O ( HM √ dkT + Hd √ kM T + HM T √ d I ) , where k is thedimension of the low-rank representation, d is the ambient dimension of state-action features, M is the number of tasks, H is the horizon, T is the number of episodes, and I denotes the inherentBellman error. The regret significantly improves upon the baseline regret ˜ O ( HM d √ T + HM T √ d I )achieved by running ELEANOR algorithm (Zanette et al., 2020a) for each task independently. Wealso prove a regret lower bound Ω( M k √ HT + d √ HkM T + HM T √ d I ). To the best of our knowledge,this is the first provably sample-efficient algorithm for exploration in multi-task linear RL.
2. Preliminaries
We study the problem of representation learning for linear bandits in which there are multipletasks sharing common low-dimensional features. Let d be the ambient dimension and k be therepresentation dimension. We play M tasks concurrently for T steps each. Each task i ∈ [ M ] isassociated with an unknown vector θ i ∈ R d . In each step t ∈ [ T ], the player chooses one action x t,i ∈ A t,i for each task i ∈ [ M ], and receives a batch of rewards { y t,i } Mi =1 afterwards, where A t,i isthe feasible action set (can even be chosen adversarially) for task i at step t . The rewards receivedare determined by y t,i = θ ⊤ i x t,i + η t,i , where the η t,i is the random noise.We use the total regret for M tasks in T steps to measure the performance of our algorithm,which is defined in the following way:Reg( T ) def = T X t =1 M X i =1 (cid:0)(cid:10) x ⋆t,i , θ i (cid:11) − h x t,i , θ i i (cid:1) , where x ⋆t,i = argmax x ∈A t,i h x , θ i i .The main assumption is the existence of a common linear feature extractor. Assumption 1.
There exists a linear feature extractor B ∈ R d × k and a set of k -dimensionalcoefficients { w i } Mi =1 such that { θ i } Mi =1 satisfies θ i = Bw i .
1. ˜ O hides the logarithmic factors. F t to be the σ -algebra induced by σ ( { x ,i } Mi =1 , · · · , { x t +1 ,i } Mi =1 , { η ,i } Mi =1 , · · · , { η t,i } Mi =1 ),then we have the following assumption. Assumption 2.
Following the standard regularity assumptions in linear bandits (Abbasi-Yadkori et al.,2011; Lattimore and Szepesv´ari, 2020), we assume • k θ i k ≤ , ∀ i ∈ [ M ] • k x k ≤ , ∀ x ∈ A t,i , t ∈ [ T ] , i ∈ [ M ] • η t,i is conditionally zero-mean -sub-Gaussian random variable with regards to F t − . For notation convenience, we use X t,i = [ x ,i , x ,i , · · · , x t,i ] and y t,i = [ y ,i , · · · , y t,i ] ⊤ to denotethe arms and the corresponding rewards collected for task i ∈ [ M ] in the first t steps, and we alsouse η t,i = [ η ,i , η ,i , · · · , η t,i ] ⊤ to denote the corresponding noise. We define Θ def = [ θ , θ , · · · , θ M ]and W def = [ w , w , · · · , w M ]. For any positive definite matrix A ∈ R d × d , the Mahalanobis normwith regards to A is denoted by k x k A = √ x ⊤ Ax . We also study how this low-rank structure benefits the exploration problem with approximate linearvalue functions in multi-task episodic reinforcement learning. For reference convenience, we abbre-viate our setting as multi-task LSVI setting, which is a natural extension of LSVI condition in thesingle-task setting (Zanette et al., 2020a).Consider an undiscounted episodic MDP M = ( S , A , p, r, H ) with state space S , action space A , and fixed horizon H . For any h ∈ [ H ], any state s h ∈ S and action a h ∈ A , the agent receivesa reward R h ( s h , a h ) with mean r h ( s h , a h ), and transits to the next state s h +1 according to thetransition kernel p h ( · | s h , a h ). The action value function for each state-action pair at step h forsome deterministic policy π is defined as Q πh ( s h , a h ) def = r h ( s h , a h ) + E hP Ht = h +1 R t ( s t , π t ( s t )) i , andthe state value function is defined as V πh ( s h ) = Q πh ( s h , π h ( s h ))Note that there always exists an optimal deterministic policy (under some regularity conditions) π ∗ for which V π ∗ h ( s ) = max π V πh ( s ) and Q π ∗ h ( s, a ) = max π Q πh ( s, a ) for each h ∈ [ H ]. We denote V π ∗ h and Q π ∗ h by V ∗ h and Q ∗ h for short.It’s also convenient to define the Bellman optimality operator T h as T h ( Q h +1 )( s, a ) def = r h ( s, a ) + E s ′ ∼ p h ( ·| s,a ) max a ′ Q h +1 ( s ′ , a ′ ).In the framework of single-task approximate linear value functions (see Section 5 for more discus-sions), we assume a feature map φ : S × A → R d that maps each state-action pair to a d -dimensionalvector. In case that S is too large or continuous (e.g. in robotics), this feature map helps to reducethe problem scale from |S| × |A| to d . The value functions are the linear combinations of thosefeature maps, so we can define the function space at step h ∈ [ H ] to be Q ′ h = { Q h ( θ h ) | θ h ∈ Θ ′ h } and V ′ h = { V h ( θ h ) | θ h ∈ Θ ′ h } , where Q h ( θ h )( s, a ) def = φ ( s, a ) ⊤ θ h , and V h ( θ h )( s ) def = max a φ ( s, a ) ⊤ θ h .In order to find the optimal value function using value iteration with Q h , we require that it isapproximately close under T h , as measured by the inherent Bellman error (or IBE for short). TheIBE (Zanette et al., 2020a) at step h is defined as I h def = sup Q h +1 ∈Q h +1 inf Q h ∈Q h sup s ∈S ,a ∈A | ( Q h − T h ( Q h +1 )) ( s, a ) | . (1)In multi-task reinforcement learning, we have M MDPs M , M , ..., M M (we use superscript i to denote task i ). Assume they share the same state space and action space, but have differentrewards and transitions. 3o take advantage of the multi-task LSVI setting and low-rank representation learning, we definea joint function space for all the tasks as Θ h def = { (cid:0) B h w h , B h w h , · · · , B h w Mh (cid:1) : B h ∈ O d × k , w ih ∈B k , B h w ih ∈ Θ i ′ h } , where O d × k is the collection of all orthonormal matrices in R d × k .The induced function space is defined as Q h def = { (cid:0) Q h (cid:0) θ h (cid:1) , Q h (cid:0) θ h (cid:1) , · · · , Q Mh (cid:0) θ Mh (cid:1)(cid:1) | (cid:0) θ h , θ h , · · · , θ Mh (cid:1) ∈ Θ h } (2) V h def = { (cid:0) V h (cid:0) θ h (cid:1) , V h (cid:0) θ h (cid:1) , · · · , V Mh (cid:0) θ Mh (cid:1)(cid:1) | (cid:0) θ h , θ h , · · · , θ Mh (cid:1) ∈ Θ h } (3)The low-rank IBE at step h for multi-task LSVI setting is a generalization of IBE (Eqn 1) forthe single-task setting, which is defined accordingly as I mul h def = sup { Q ih +1 } Mi =1 ∈Q h +1 inf { Q ih } Mi =1 ∈Q h sup s ∈S ,a ∈A ,i ∈ [ M ] (cid:12)(cid:12)(cid:0) Q ih − T ih ( Q ih +1 ) (cid:1) ( s, a ) (cid:12)(cid:12) (4) Assumption 3. I def = sup h I mul h is small with regards to the joint function space Q h for all h . When I = 0, Assumption 3 can be regarded as a natural extension of Assumption 1 in episodicRL. This is because there exists { ¯ θ i ∗ h } Mi =1 ∈ Θ h such that Q i ∗ h = Q ih ( ¯ θ i ∗ h ) for all i ∈ [ M ] and h ∈ [ H ]in the case I = 0. According to the definition of Θ h we know that { ¯ θ i ∗ h } Mi =1 also admit a low-rankproperty as Assumption 1 indicates. When I >
0, then Assumption 3 is an extension of misspecifiedmulti-task linear bandits (discussed in Section 4.3) in episodic RL.Define the filtration F h,t to be the σ -field induced by all the random variables up to step h inepisode t (not include the rewards at step h in episode t ), then we have the following assumptions. Assumption 4.
Following the parameter scale in (Zanette et al., 2020a), we assume • k φ ( s, a ) k ≤ , ∀ ( s, a ) ∈ S × A , h ∈ [ H ] • ≤ Q πh ( s, a ) ≤ , ∀ ( s, a ) ∈ S × A , h ∈ [ H ] , ∀ π . • There exists constant D that for any h ∈ [ H ] and any (cid:8) θ ih (cid:9) Mi =1 ∈ Θ h , it holds that k θ ih k ≤ D, ∀ i ∈ [ M ] . • For any fixed (cid:8) Q ih +1 (cid:9) Mi =1 ∈ Q h +1 , the random noise z ih ( s, a ) def = R ih ( s, a ) + max a Q ih +1 ( s ′ , a ) −T ih (cid:0) Q ih +1 (cid:1) ( s, a ) is bounded in [ − , a.s., and is independent conditioned on F h,t for any s ∈ S , a ∈ A , h ∈ [ H ] , i ∈ [ M ] , where the randomness is from reward R and s ′ ∼ p h ( · | s, a ) . The first condition is a standard regularization condition for linear features. The second conditionis on the scale of the problem. This scale of the exploration problem that the value functionis bounded in [0 ,
1] has also been studied in both tabular and linear setting (Zhang et al., 2020;Wang et al., 2020; Zanette et al., 2020a). The last two conditions are compatible with the scale ofthe problem. It’s sufficient to assume the constant norm of θ ih since the optimal value functionis of the same scale. The last condition is standard in linear bandits (Abbasi-Yadkori et al., 2011;Lattimore and Szepesv´ari, 2020) and RL (Zanette et al., 2020a), and is automatically satisfied if D = 1.The total regret of M tasks in T episodes is defined asReg( T ) def = T X t =1 M X i =1 (cid:16) V i ∗ − V π it (cid:17) (cid:0) s i t (cid:1) (5)where π it is the policy used for task i in episode t , and s iht denotes the state encountered at step h in episode t for task i . We assume M ≥ , T ≥ . Related Work Multi-task Supervised Learning
The idea of multi-task representation learning at least datesback to Caruana (1997); Thrun and Pratt (1998); Baxter (2000). Empirically, representation learn-ing has shown its great power in various domains. We refer readers to Bengio et al. (2013) fora detailed review about empirical results. From the theoretical perspective, Baxter (2000) per-formed the first theoretical analysis and gave sample complexity bounds using covering number.Maurer et al. (2016) considered the setting where all tasks are sampled from a certain distribution,and analysed the benefit of representation learning for both reducing the sample complexity of thetarget task. Following their results, Du et al. (2020) and Tripuraneni et al. (2020) replaced the i.i.dassumption with a deterministic assumption on the data distribution and task diversity, and pro-posed efficient algorithms that can fully utilize all source data with better sample complexity. Theseresults mainly focus on the statistical rate for multi-task supervised learning, and cannot tackle theexploration problem in bandits and RL.
Multi-task Bandit Learning
For multi-task linear bandits, the most related work is a recentpaper by Yang et al. (2020). For linear bandits with infinite-action set, they firstly proposed anexplore-then-exploit algorithm with regret ˜ O ( M k √ T + d . k √ M T ), which outperforms the naiveapproach with ˜ O ( M d √ T ) regret in the regime where M = Ω( dk ). Though their results are in-sightful, they require the action set for all tasks and all steps to be the same well-conditioned d -dimensional ellipsoids which cover all directions nicely with constant radius. Besides, they assumethat the task parameters are diverse enough with W W ⊤ well-conditioned, and the norm of w i islower bounded by a constant. These assumptions make the application of the theory rather restric-tive to only a subset of linear bandit instances with benign structures. In contrast, our theory ismore general since we do not assume the same and well-conditioned action set for different tasksand time steps, nor assume the benign properties of w i ’s. Multi-task RL
For multi-task reinforcement learning, there is a long line of works from theempirical perspective (Taylor and Stone, 2009; Parisotto et al., 2015; Liu et al., 2016; Teh et al.,2017; Hessel et al., 2019). From the theoretical perspective, Brunskill and Li (2013) analyzed thesample complexity of multi-task RL in the tabular setting. D’Eramo et al. (2019) showed thatrepresentation learning can improve the rate of approximate value iteration algorithm. Arora et al.(2020) proved that representation learning can reduce the sample complexity of imitation learning.
Bandits with Low Rank Structure
Low-rank representations have also been explored in single-task settings. Jun et al. (2019) studied bilinear bandits with low rank representation. The meanreward in their setting is defined as the bilinear multiplication x ⊤ Θ y , where x and y are two ac-tions selected at each step, and Θ is an unknown parameter matrix with low rank. Their settingis further generalized by Lu et al. (2020). Furthermore, sparse linear bandits can be regarded asa simplified setting, where B is a binary matrix indicating the subset of relevant features in con-text x (Abbasi-Yadkori et al., 2012; Carpentier and Munos, 2012; Lattimore et al., 2015; Hao et al.,2020). Exploration in Bandits and RL
Our regret analysis is also related to exploration in single-tasklinear bandits and linear RL. Linear bandits have been extensively studied in recent years (Auer,2002; Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011; Chu et al.,2011; Li et al., 2019a,b). Our algorithm is most relevant to the seminal work of Abbasi-Yadkori et al.(2011), who applied self-normalized techniques to obtain near-optimal regret upper bounds. Forsingle-task linear RL, recent years have witnessed a tremendous of works under different func-tion approximation settings, including linear MDPs (Yang and Wang, 2019; Jin et al., 2020), linearmixture MDPs (Ayoub et al., 2020; Zhou et al., 2020a), linear RL with low inherent Bellman er-ror (Zanette et al., 2020a,b), and MDPs with low Bellman-rank (Jiang et al., 2017). Our multi-task5etting is a natural extension of linear RL with low inherent Bellman error setting, which coverslinear MDP setting as a special case (Zanette et al., 2020a).
4. Main Results for Linear Bandits
In this section, we present our main results for multi-task linear bandits.
A natural and successful method to design efficient algorithms for sequential decision making problemis the optimism in the face of uncertainty principle . When applied to single-task linear bandits, thebasic idea is to maintain a confidence set C t for the parameter θ based on history observations foreach step t ∈ [ T ]. The algorithm chooses an optimistic estimation ˜ θ t = argmax θ ∈C t (max x ∈A t h x , θ i )and then selects action x t = argmax x t ∈A t h x , ˜ θ t i , which maximizes the reward according to theestimation ˜ θ t . In other words, the algorithm chooses the pair (cid:16) x t , ˜ θ t (cid:17) = argmax ( x , θ ) ∈A t ×C t h x , θ i .For multi-task linear bandits, the main difference is that we need to tackle M highly correlatedtasks concurrently. To obtain tighter confidence bound, we maintain the confidence set C t for B and { w i } Mi =1 , then choose the optimistic estimation ˜ Θ t for all tasks concurrently. To be more specific,the algorithm chooses an optimistic estimate ˜ Θ t = argmax Θ ∈C t (max { x i ∈A t,i } Mi =1 P Mi =1 h x i , θ i i ), andthen selects action x t,i = argmax x i ∈A t,i D x i , ˜ θ t,i E for each task i ∈ [ M ].The main technical contribution is the construction of a tighter confidence set C t for the estima-tion of Θ . At each step t ∈ [ T ], we solve the following least-square problem based on the samplescollected so far and obtain the minimizer ˆ B t and ˆ W t :arg min B ∈ R d × k , w ..M ∈ R k × M M X i =1 (cid:13)(cid:13) y t − ,i − X ⊤ t − ,i Bw i (cid:13)(cid:13) (6)s . t . k Bw i k ≤ , ∀ i ∈ [ M ] . (7)We maintain a high probability confidence set C t for the unknown parameters B and { w i } Mi =1 .We calculate C t in the following way: C t def = (cid:26) Θ = BW : M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤ L, B ∈ R d × k , w i ∈ R k , k Bw i k ≤ , ∀ i ∈ [ M ] (cid:27) , (8)where L = ˜ O ( M k + kd ) (see Appendix A.1 for the exact value) and ˜ V t − ,i ( λ ) = X t − ,i X ⊤ t − ,i + λ I d . λ is a hyperparameter used to ensure that ˜ V t − ,i ( λ ) is always invertable, which can be set to 1. Wecan guarantee that Θ ∈ C t for all t ∈ [ T ] with high probability by the following lemma. Lemma 1.
With probability at least − δ , for any step t ∈ [ T ] , suppose ˆ Θ t = ˆ B t ˆ W t is the optimalsolution of the least-square regression (Eqn 6), the true parameter Θ = BW is always contained inthe confidence set C t , i.e. M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤ L, (9) where ˜ V t − ,i ( λ ) = X t − ,i X ⊤ t − ,i + λ I d .
6f we solve each tasks independently with standard single-task algorithms such as OFUL (Abbasi-Yadkori et al.,2011), it is not hard to realize that we can only obtain a confidence set with P Mi =1 k ˆ B t ˆ w t,i − Bw i k V t − ,i ( λ ) ≤ L = ˜ O ( M d ). Our confidence bound is much sharper compared with this naivebound, which explains the improvement in our final regret. Compared with Yang et al. (2020), weare not able to estimate B and W directly like their methods due to the more relaxed bandit setting.In our setting, the empirical design matrix ˜ V t − ,i ( λ ) can be quite ill-conditioned if the action set ateach step is chosen adversarially. Thus, we have to establish a tighter confidence set to improve theregret bound.We only sketch the main idea of the proof for Lemma 1 and defer the detailed explanationto Appendix A.1. Considering the non-trivial case where d > k , our main observation is thatboth BW and ˆ B t ˆ W t are low-rank matrix with rank upper bounded by k , which indicates thatrank (cid:16) ˆ B t ˆ W t − BW (cid:17) ≤ k . Therefore, we can write ˆ B t ˆ W t − BW = U t R t = [ U t r t, , U t r t, , · · · , U t r t,M ],where U t ∈ R d × k is an orthonormal matrix and R t ∈ R k × M . Thus we have X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) = (cid:0) U ⊤ t X t − ,i (cid:1) ⊤ R t . This observation indicates that we can project the history actions X t − ,i to a 2 k -dimensionalspace with U t , and take U ⊤ t X t − ,i as the 2 k -dimensional actions we have selected in the first t − P Mi =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) tothe term P Mi =1 (cid:13)(cid:13)(cid:13) η ⊤ t − ,i (cid:0) U ⊤ t X t − ,i (cid:1) ⊤ (cid:13)(cid:13)(cid:13) V − t − ,i ( λ ) , where V t − ,i ( λ ) def = (cid:0) U ⊤ t X t − ,i (cid:1) (cid:0) U ⊤ t X t − ,i (cid:1) ⊤ + λ I .We bound this term for the fixed U t with the technique of self-normalized bound for vector-valuedmartingales (Abbasi-Yadkori et al., 2011), and then apply the ǫ -net trick to cover all possible U t .This leads to an upper bound for P Mi =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) , and consequently helps to obtainthe upper bound in Lemma 1. Multi-Task Low-Rank OFUL for step t = 1 , , · · · , T do Calculate the confidence interval C t by Eqn 8 ˜ Θ t , x t,i = argmax Θ ∈C t , x i ∈A t,i P Mi =1 h x i , θ i i for task i = 1 , , · · · , M do Play x t,i for task i , and obtain the reward y t,i end for end for We describe our Multi-Task Low-Rank OFUL algorithm in Algorithm 1. The following theoremstates a bound on the regret of the algorithm.
Theorem 1.
Suppose Assumption 1 holds. Then, with probability at least − δ , the regret ofAlgorithm 1 is bounded by Reg( T ) = ˜ O (cid:16) M √ dkT + d √ kM T (cid:17) (10)We defer the proof of Theorem 1 to Appendix A.2. The first term in the regret has lineardependence on M . This term characterizes the regret caused by learning the parameters w i foreach task. The second term has square root dependence on the number of total samples M T , which7ndicates the cost to learn the common representation with samples from M tasks. By dividing thetotal regret by the number of tasks M , we know that the average regret for each task is ˜ O ( √ dkT + d p kT /M ). Note that if we solve M tasks with algorithms such as OFUL (Abbasi-Yadkori et al.,2011) independently, the regret per task can be ˜ O ( d √ T ). Our bound saves a factor of p d/k compared with the naive method by leveraging the common representation features. We also showthat when d > M our regret bound is near optimal (see Theorem 3). For multi-task linear bandit problem, it is relatively unrealistic to assume a common feature extractorthat can fit the reward functions of M tasks exactly. A more natural situation is that the underlyingreward functions are not exactly linear, but have some misspecifications. There are also relevant dis-cussions on single-task linear bandits in recent works (Lattimore et al., 2020; Zanette et al., 2020a).We first present a definition for the approximately linear bandit learning in multi-task setting. Assumption 5.
There exists a linear feature extractor B ∈ R d × k and a set of linear coefficients { w i } Mi =1 such that the expectation reward E [ y i | x i ] for any action x i ∈ R d satisfies | E [ y i | x i ] − h x i , Bw i i| ≤ ζ . In general, an algorithm designed for a linear model could break down entirely if the underlyingmodel is not linear. However, we find that our algorithm is in fact robust to small model misspeci-fication if we set L = ˜ O ( M k + kd + M T ζ ) (see Appendix A.4 for the exact value). The followingregret bound holds under Assumption 5 if we slightly modify the hyperparameter L in the definitionof confidence region C t . Theorem 2.
Under Assumption 5, with probability at least − δ , the regret of Algorithm 1 is boundedby Reg( T ) = ˜ O (cid:16) M √ dkT + d √ kM T + M T √ dζ (cid:17) (11)Theorem 2 is proved in Appendix A.4. Compared with Theorem 1, there is an additional term˜ O ( M T √ dζ ) in the regret of Theorem 2. This additional term is inevitably linear in M T due to theintrinsic bias introduced by linear function approximation. Note that our algorithm can still enjoygood theoretical guarantees when ζ is sufficiently small. In this subsection, we propose the regret lower bound for multi-task linear bandit problem underAssumption 5.
Theorem 3.
For any k, M, d, T ∈ Z + with k ≤ d ≤ T and k ≤ M , and any learning algorithm A , there exist a multi-task linear bandit instance that satisfies Assumption 5, such that the regret ofAlgorithm A is lower bounded by Reg( T ) ≥ Ω (cid:16) M k √ T + d √ kM T + M T √ dζ (cid:17) . We defer the proof of Theorem 3 to Appendix A.5. By setting ζ = 0, Theorem 3 can beconverted to the lower bound for multi-task linear bandit problem under Assumption 1, which isΩ( M k √ T + d √ kM T ). These lower bounds match the upper bounds in Theorem 1 and Theorem 2in the regime where d > M respectively. There is still a gap of p d/k in the first part of theregret. For the upper bounds, the main difficulty to obtain ˜ O ( M k √ T ) regret in the first part comesfrom the estimation of B . Since the action sets are not fixed and can be ill-conditioned, we cannotfollow the explore-then-exploit framework and estimate B at the beginning. Besides, explore-then-exploit algorithms always suffer ˜ O ( T / ) regret in the general linear bandits setting without further8ssumptions. Without estimating B beforehand with enough accuracy, the exploration in original d -dimensional space can be redundant since we cannot identify actions that have the similar k -dimensional representations before pulling them. We conjecture that our upper bound is tight andleave the gap as future work.
5. Main Results for Linear RL
We now show the main results for the multi-task episodic reinforcement learning under the assump-tion of low inherent Bellman error (i.e. the multi-task LSVI setting).
In the exploration problems in RL where linear value function approximation is employed (Yang and Wang,2019; Jin et al., 2020; Yang and Wang, 2020), LSVI-based algorithms are usually very effective whenthe linear value function space are close under Bellman operator. For example, it is shown that aLSVI-based algorithm with additional bonus can solve the exploration challenge effectively in low-rank MDP (Jin et al., 2020), where the function space Q h , Q h +1 are totally close under Bellmanoperator (i.e. any function Q h +1 in Q h +1 composed with Bellman operator T h Q h +1 belongs to Q h ).For the release of such strong assumptions, the inherent Bellman error for a MDP (Definition 1) wasproposed to measure how close is the function space under Bellman operator (Zanette et al., 2020a).We extend the definition of IBE to the multi-task LSVI setting (Definition 4), and show that ourrefined confidence set for the least square estimator can be applied to the low-rank multi-task LSVIsetting, and gives an optimism-based algorithm with sharper regret bound compared to naively doexploration in each task independently. The MTLR-LSVI (Algorithm 2) follows the LSVI-based (Jin et al., 2020; Zanette et al., 2020a)algorithms to build our (optimistic) estimator for the optimal value functions. To understand howthis works for multi-task LSVI setting, we first take a glance at how LSVI-based algorithms work insingle-task LSVI setting.In traditional value iteration algorithms, we perform an approximate Bellman backup in episode t for each step h ∈ [ H ] on the estimator Q h +1 ,t − constructed at the end of episode t −
1, andfind the best approximator for T h ( Q h +1 ,t − ) in function space Q h . Since we assume linear functionspaces, we can take the least-square solution of the empirical Bellman backup on Q h +1 ,t − as thebest approximator.In the multi-task framework, given an estimator Q h +1 (cid:0) θ ih +1 (cid:1) for each i ∈ [ M ], to apply suchleast-square value iteration to our low-rank multi-task LSVI setting, we use the solution to thefollowing constrained optimization problem M X i =1 t − X j =1 (cid:16)(cid:0) φ ihj (cid:1) ⊤ θ ih − R ihj − V ih +1 (cid:0) θ ih +1 (cid:1) (cid:0) s ih +1 ,j (cid:1)(cid:17) (12)s.t. θ h , θ h , ..., θ Mh lies in a k -dimensional subspace (13)to approximate the Bellman update in the t -th episode, where φ ihj = φ h ( s ihj , a ihj ) is the featureobserved at step h in episode j for task i , and similarly R ihj = R h ( s ihj , a ihj ).To guarantee the optimistic property of our estimator, we follow the global optimization proce-dure of Zanette et al. (2020a) which solves the following optimization problem in the t -th episode9 lgorithm 2 Multi-Task Low-Rank LSVI Input: low-rank parameter k , failure probability δ , regularization λ = 1, inherent Bellman error I Initialize ˜ V h = λ I for h ∈ [ H ] for episode t = 1 , , · · · do Compute α ht for h ∈ [ H ]. (see Lemma 9) Solve the global optimization problem 1 Compute π iht ( s ) = argmax a φ ( s, a ) ⊤ ¯ θ iht Execute π iht for task i at step h Collect (cid:8) s iht , a iht , r (cid:0) s iht , a iht (cid:1)(cid:9) for episode t . end forDefinition 1 (Global Optimization Procedure) . max ¯ ξ ih , ˆ θ ih , ¯ θ ih M X i =1 max a i (cid:0) φ ( s i , a i ) (cid:1) ⊤ ¯ θ i (14) s.t. (cid:16) ˆ θ h , ..., ˆ θ Mh (cid:17) = ˆ B h (cid:2) ˆ w h ˆ w h · · · ˆ w Mh (cid:3) = argmin k B h w ih k ≤ D M X i =1 t − X j =1 L ( B h , w ih ) (15)¯ θ ih = ˆ θ ih + ¯ ξ ih ; M X i =1 (cid:13)(cid:13) ¯ ξ ih (cid:13)(cid:13) V iht ( λ ) ≤ α ht (16) (cid:0) ¯ θ h , ¯ θ h , · · · , ¯ θ Mh (cid:1) ∈ Θ h (17)where the empirical least-square loss L ( B h , w ih ) def = (( φ ihj ) ⊤ B h w ih − R ihj − V ih +1 ( ¯ θ ih +1 )( s ih +1 ,j )) , and ˜ V iht ( λ ) def = P t − j =1 ( φ ihj )( φ ihj ) ⊤ + λ I is the regularized empirical linear design matrix for task i in episode t .We have three types of variables in this global optimization problem, ¯ ξ ih , ˆ θ ih , and ¯ θ ih . Here ¯ θ ih denotes the estimator for Q i ∗ h . We solve for the low-rank least-square solution of the approximatevalue iteration and denote the solution by ˆ θ ih . Instead of adding the bonus term directly on Q ih ( ˆ θ ih )to obtain an optimistic estimate of Q i ∗ h as in the tabular setting (Azar et al., 2017; Jin et al., 2018)and linear MDP setting (Jin et al., 2020), we use global variables ¯ ξ ih to quantify the confidencebonus. This is because we cannot preserve the linear property of our estimator if we add the bonusdirectly, resulting in an exponential propagation of error. However, by using ¯ ξ ih we can construct alinear estimator Q ih (cid:0) ¯ θ ih (cid:1) and obtain much smaller regret. A drawback of this global optimizationtechnique is that we can only obtain an optimistic estimator at step 1, since values in different statesand steps are possibly negatively correlated. Under Assumption 3 and 4, with probability − δ the regret after T episodes is boundedby Reg( T ) = ˜ O (cid:16) HM √ dkT + Hd √ kM T + HM T √ d I (cid:17) (18)Compared to naively executing single-task linear RL algorithms (e.g. the ELEANOR algorithm)on each task without information-sharing, which incurs regret ˜ O ( HM d √ T + HM T √ d I ), our regretbound is smaller by a factor of approximately p d/k in our setting where k ≪ d and k ≪ M .10e give a brief explanation on how we improve the regret bound and defer the full analysis toappendix B. We start with the decomposition of the regret. Let ¯ Q iht ( ¯ V iht ) be the solution of theproblem in definition 1 in episode t , thenReg( T ) = T X t =1 M X i =1 (cid:16) V i ∗ − ¯ V i t + ¯ V i t − V π it (cid:17) (cid:0) s i t (cid:1) (19) ≤ HM T I (by Lemma 12) (20)+ T X t =1 H X h =1 M X i =1 (cid:0)(cid:12)(cid:12) ¯ Q iht ( s, a ) − T ih ¯ Q ih +1 ,t ( s, a ) (cid:12)(cid:12) + ζ iht (cid:1) . (21)In (20) we use the optimistic property of ¯ V i t . In (21), ζ iht is a martingale difference (defined insection B.5) with regards to F h,t , and the dominate term (the first term) is the Bellman error of¯ Q iht .For any { Q ih +1 } Mi =1 ∈ Q h +1 , we can find a group of vectors { ˙ θ ih ( Q ih +1 ) } Mi =1 ∈ Θ h that satisfy∆ ih (cid:0) Q ih +1 (cid:1) ( s, a ) def = T ih (cid:0) Q ih +1 (cid:1) ( s, a ) − φ ( s, a ) ⊤ ˙ θ ih (cid:0) Q ih +1 (cid:1) and the approximation error (cid:13)(cid:13) ∆ ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13) ∞ ≤I is small for each i ∈ [ M ]. By definition, ˙ θ ih (cid:0) Q ih +1 (cid:1) is actually the best approximator of T ih (cid:0) Q ih +1 (cid:1) in the function class Q h . Since our algorithm is based on least-square value iteration, a key step isto bound the error of estimating ˙ θ ih ( ¯ Q ih +1 ,t ) ( ˙ θ ih for short). In the global optimization procedure,we use ˆ θ ih to approximate the empirical Bellman backup. In Lemma 9 we show M X i =1 (cid:13)(cid:13)(cid:13) ˆ θ ih − ˙ θ ih (cid:13)(cid:13)(cid:13) V iht ( λ ) = ˜ O (cid:0) M k + kd + M T I (cid:1) (22)This is the key step leading to improved regret bound. If we solve each task independentlywithout information sharing, we can only bound the least square error in (22) as ˜ O ( M d + M T I ).Our bound is much more sharper since k ≪ d and k ≪ M .Using the least square error in (22), we can show that the dominate term in (21) is bounded by(see Lemma 10 and section B.5) M X i =1 (cid:12)(cid:12) ¯ Q iht ( s, a ) − T ih ¯ Q ih +1 ,t ( s, a ) (cid:12)(cid:12) ≤ M I + ˜ O (cid:16)p M k + kd + M T I (cid:17) · vuut M X i =1 (cid:13)(cid:13) φ ( s iht , a iht ) (cid:13)(cid:13) V iht ( λ ) − (23)Abbasi-Yadkori et al. (2011, Lemma 11) states that P Tt =1 (cid:13)(cid:13) φ ( s iht , a iht ) (cid:13)(cid:13) V iht ( λ ) − = ˜ O ( d ) for any h and i , so we can finally bound the regret asReg( T ) = ˜ O (cid:16) HM T I + H p M k + kd + M T I · √ M T d (cid:17) = ˜ O (cid:16) HM √ dkT + Hd √ kM T + HM T √ d I (cid:17) where the first equality is by Cauchy-Schwarz. This subsection presents the lower bound for multi-task reinforcement learning with low inherentBellman error. Our lower bound is derived from the lower bound in the single-task setting. As abyproduct, we also derive a lower bound for misspecified linear RL in the single-task setting. Wedefer the proof of Theorem 5 to Appendix C. 11 heorem 5.
For our construction in appendix C, the expected regret of any algorithm where d, k, H ≥ , |A| ≥ , M ≥ k, T = Ω( d H ) , I ≤ / H is Ω (cid:16) M k √ HT + d √ HkM T + HM T √ d I (cid:17) Careful readers may find that there is a gap of √ H in the first two terms between the upper boundand the lower bound. This gap is because the confidence set used in the algorithm is intrinsically“Hoeffding-type”. Using a “Bernstein-type” confidence set can potentially improve the upper boundby a factor of √ H . This “Bernstein” technique has been well exploited in many previous resultsfor single-task RL (Azar et al., 2017; Jin et al., 2018; Zhou et al., 2020a). Since our focus is mainlyon the benefits of multi-task representation learning, we don’t apply this technique for the clarityof the analysis. If we ignore this gap in the dependence on H , our upper bound matches this lowerbound in the regime where d ≥ M .
6. Conclusion
In this paper, we study provably sample-efficient representation learning for multi-task linear banditsand linear RL. For linear bandits, we propose an algorithm called MTLR-OFUL, which obtains near-optimal regret in the regime where d ≥ M . We then extend our algorithms to multi-task RL setting,and propose a sample-efficient algorithm, MTLR-LSVI.There are two directions for future investigation. First, our algorithms are statistically sample-efficient, but a computationally efficient implementation is still unknown, although we conjectureour MTLR-OFUL algorithm is computationally efficient. How to design both computationally andstatistically efficient algorithms in our multi-task setting is an interesting problem for future research.Second, there remains a gap of p d/k between regret upper and lower bounds (in the first term).We conjecture that our lower bound is not minimax optimal and hope to address this problem inthe future work. References
Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linear stochasticbandits. In
Advances in Neural Information Processing Systems , pages 2312–2320, 2011.Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions andapplication to sparse stochastic bandits. In
Artificial Intelligence and Statistics , pages 1–9. PMLR,2012.Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multipletasks and unlabeled data.
Journal of Machine Learning Research , 6(Nov):1817–1853, 2005.Sanjeev Arora, Simon S Du, Sham Kakade, Yuping Luo, and Nikunj Saunshi. Provable representationlearning for imitation learning via bi-level optimization. arXiv preprint arXiv:2002.10544 , 2020.Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.
Journal of MachineLearning Research , 3(Nov):397–422, 2002.Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin F Yang. Model-based reinforcementlearning with value-targeted regression. arXiv preprint arXiv:2006.01107 , 2020.Mohammad Gheshlaghi Azar, Ian Osband, and R´emi Munos. Minimax regret bounds for rein-forcement learning. In
International Conference on Machine Learning , pages 263–272. PMLR,2017. 12onathan Baxter. A model of inductive bias learning.
Journal of artificial intelligence research , 12:149–198, 2000.Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and newperspectives.
IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828,2013.Emma Brunskill and Lihong Li. Sample complexity of multi-task reinforcement learning. In
Pro-ceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI-13) , pages122–131, 2013.Alexandra Carpentier and R´emi Munos. Bandit theory meets compressed sensing for high dimen-sional stochastic linear bandit. In
Artificial Intelligence and Statistics , pages 190–198. PMLR,2012.Rich Caruana. Multitask learning.
Machine learning , 28(1):41–75, 1997.Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payofffunctions. In
Proceedings of the Fourteenth International Conference on Artificial Intelligence andStatistics , pages 208–214, 2011.Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under banditfeedback. 2008.Carlo D’Eramo, Davide Tateo, Andrea Bonarini, Marcello Restelli, and Jan Peters. Sharing knowl-edge in multi-task deep reinforcement learning. In
International Conference on Learning Repre-sentations , 2019.Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learningthe representation, provably. arXiv preprint arXiv:2002.09434 , 2020.Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. arXivpreprint arXiv:2011.04020 , 2020.Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado vanHasselt. Multi-task deep reinforcement learning with popart. In
Proceedings of the AAAI Con-ference on Artificial Intelligence , volume 33, pages 3796–3803, 2019.Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Con-textual decision processes with low bellman rank are pac-learnable. In
International Conferenceon Machine Learning , pages 1704–1713. PMLR, 2017.Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? arXiv preprint arXiv:1807.03765 , 2018.Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcementlearning with linear function approximation. In
Conference on Learning Theory , pages 2137–2143. PMLR, 2020.Kwang-Sung Jun, Rebecca Willett, Stephen Wright, and Robert Nowak. Bilinear bandits withlow-rank structure. In
International Conference on Machine Learning , pages 3163–3172. PMLR,2019.Tor Lattimore and Csaba Szepesv´ari.
Bandit algorithms . Cambridge University Press, 2020.Tor Lattimore, Koby Crammer, and Csaba Szepesv´ari. Linear multi-resource allocation with semi-bandit feedback. In
NIPS , pages 964–972, 2015.13or Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representationsin bandits and in rl with a generative model. In
International Conference on Machine Learning ,pages 5662–5670. PMLR, 2020.Jiayi Li, Hongyan Zhang, Liangpei Zhang, Xin Huang, and Lefei Zhang. Joint collaborative repre-sentation with multitask learning for hyperspectral image classification.
IEEE Transactions onGeoscience and Remote Sensing , 52(9):5923–5936, 2014.Yingkai Li, Yining Wang, and Yuan Zhou. Nearly minimax-optimal regret for linearly parameterizedbandits. arXiv preprint arXiv:1904.00242 , 2019a.Yingkai Li, Yining Wang, and Yuan Zhou. Tight regret bounds for infinite-armed linear contextualbandits. arXiv preprint arXiv:1905.01435 , 2019b.Lydia T Liu, Urun Dogan, and Katja Hofmann. Decoding multitask dqn in the world of minecraft.In
The 13th European Workshop on Reinforcement Learning (EWRL) 2016 , 2016.Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks fornatural language understanding. arXiv preprint arXiv:1901.11504 , 2019.Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Low-rank generalized linear bandit problems. arXiv preprint arXiv:2006.02948 , 2020.Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitaskrepresentation learning.
The Journal of Machine Learning Research , 17(1):2853–2884, 2016.Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask andtransfer reinforcement learning. arXiv preprint arXiv:1511.06342 , 2015.Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding, and VijayPande. Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072 , 2015.Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits.
Mathematics ofOperations Research , 35(2):395–411, 2010.Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.
Journal of Machine Learning Research , 10(7), 2009.Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, NicolasHeess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In
Advances inNeural Information Processing Systems , pages 4496–4506, 2017.Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In
Learning tolearn , pages 3–17. Springer, 1998.Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representations. arXiv preprint arXiv:2002.11684 , 2020.Ruosong Wang, Simon S Du, Lin F Yang, and Sham M Kakade. Is long horizon reinforcement learn-ing more difficult than short horizon reinforcement learning? arXiv preprint arXiv:2005.00527 ,2020.Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. Multi-task reinforcement learning:a hierarchical bayesian approach. In
Proceedings of the 24th international conference on Machinelearning , pages 1015–1022, 2007. 14iaqi Yang, Wei Hu, Jason D. Lee, and Simon S. Du. Provable benefits of representation learningin linear bandits, 2020.Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, andregret bound. In
International Conference on Machine Learning , pages 10746–10756. PMLR,2020.Lin F Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. arXiv preprint arXiv:1902.04779 , 2019.Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning nearoptimal policies with low inherent bellman error. arXiv preprint arXiv:2003.00153 , 2020a.Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Provably efficientreward-agnostic navigation with linear value iteration. arXiv preprint arXiv:2008.07737 , 2020b.Zihan Zhang, Xiangyang Ji, and Simon S Du. Is reinforcement learning more difficult than bandits?a near-optimal algorithm escaping the curse of horizon. arXiv preprint arXiv:2009.13503 , 2020.Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learn-ing for linear mixture markov decision processes. arXiv preprint arXiv:2012.08507 , 2020a.Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for dis-counted mdps with feature mapping. arXiv preprint arXiv:2006.13165 , 2020b.15 ppendices
A Omitted Proof in Section 4
A.1 Proof of Lemma 1
Proof.
By the optimality of ˆ B t and ˆ W t = [ ˆ w t, , · · · , ˆ w t,M ], we know that P Mi =1 (cid:13)(cid:13)(cid:13) y t − ,i − X ⊤ t − ,i ˆ B t ˆ w t,i (cid:13)(cid:13)(cid:13) ≤ P Mi =1 (cid:13)(cid:13) y t − ,i − X ⊤ t − ,i Bw i (cid:13)(cid:13) . Since y t − ,i = X ⊤ t − ,i Bw i + η t − ,i , we have M X i =1 (cid:13)(cid:13)(cid:13) X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17)(cid:13)(cid:13)(cid:13) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) . (24)We firstly analyse the non-trivial setting where d ≥ k . Note that both Θ = BW and ˆ Θ t =ˆ B t ˆ W t are low-rank matrix with rank upper bounded by k , which indicates that rank (cid:16) ˆ Θ t − Θ (cid:17) ≤ k .In that case, we can write ˆ Θ t − Θ = U t R t = [ U t r t, , U t r t, , · · · , U t r t,M ], where U t ∈ R d × k is anorthonormal matrix with k U t k F = √ k , and R t ∈ R k × M satisfies k r t,i k ≤ √ k . In other words,we can write ˆ B t ˆ w t,i − Bw i = U t r t,i for certain U t and r t,i .Define V t − ,i ( λ ) def = (cid:0) U ⊤ t X t − ,i (cid:1) (cid:0) U ⊤ t X t − ,i (cid:1) ⊤ + λ I . We have: M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (25)= M X i =1 (cid:13)(cid:13)(cid:13) X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17)(cid:13)(cid:13)(cid:13) + M X i =1 λ (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) (26) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) + 4 M λ (27)=2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i U t r t,i + 4 M λ (28) ≤ M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) k r t,i k V t − ,i ( λ ) + 4 M λ (29) ≤ vuut M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) vuut M X i =1 k r t,i k V t − ,i ( λ ) + 4 M λ (30)=2 vuut M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) + 4 M λ (31)Eqn 27 is due to Eqn 24, (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i (cid:13)(cid:13)(cid:13) ≤ k Bw i k ≤
1. Eqn 30 is due to Cauchy-Schwarzinequality. Eqn 31 is from M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) = M X i =1 k U t r t,i k V t − ,i ( λ ) = M X i =1 k r t,i k U ⊤ t ˜ V t − ,i ( λ ) U t = M X i =1 k r t,i k V t − ,i ( λ ) . The main problem is how to bound (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) = (cid:13)(cid:13)(cid:13)P t − n =1 η n,i U ⊤ t x n,i (cid:13)(cid:13)(cid:13) V − t − ,i ( λ ) .Note that for a fixed U t = ¯ U , we can regard ¯ U ⊤ x n,i ∈ R k as the corresponding “action” chosen in16tep t . With this observation, if U t is fixed, we can bound this term following the arguments of theself-normalized bound for vector-valued martingales (Abbasi-Yadkori et al., 2011). Lemma 2.
For a fixed ¯ U , define ¯ V t,i ( λ ) def = (cid:0) ¯ U ⊤ X t,i (cid:1) (cid:0) ¯ U ⊤ X t,i (cid:1) ⊤ + λ I , then any δ > , withprobability at least − δ , for all t ≥ , M X i =1 (cid:13)(cid:13) ¯ U ⊤ X t,i η t,i (cid:13)(cid:13) V − t,i (32) ≤ Q Mi =1 (cid:0) det( ¯ V t,i ) / det( λ I ) − / (cid:1) δ ! . (33)We defer the proof of Lemma 2 to Appendix A.3. We set λ = 1. By Lemma 2, we know that fora fixed ¯ U , with probability at least 1 − δ , M X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t − X n =1 η n,i ¯ U ⊤ x n,i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V − t,i ( λ ) ≤ Q Mi =1 det( ¯ V t,i ( λ )) / det( λ I ) − / δ ! ≤ M k + 2 log(1 /δ ) . (34)The above analysis shows that we can bound (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i U t (cid:13)(cid:13) V − t − ,i ( λ ) if U t is fixed as ¯ U .Following this idea, we prove the lemma by the construction of ǫ -net over all possible U t . To applythe trick of ǫ -net, we need to slightly modify the derivation of Eqn 25. For a fixed matrix ¯ U ∈ R d × k ,we have M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (35) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i U t r t,i + 4 M λ (36)=2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i ¯ U r t,i + 2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i + 4 M λ (37) ≤ M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) ¯ V − t − ,i ( λ ) k r t,i k ¯ V t − ,i ( λ ) + 2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i + 4 M λ (38)=2 M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) ¯ V − t − ,i ( λ ) k r t,i k V t − ,i ( λ ) + 2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i (39)+ 2 M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) ¯ V − t − ,i ( λ ) (cid:16) k r t,i k ¯ V t − ,i ( λ ) − k r t,i k V t − ,i ( λ ) (cid:17) + 4 M λ (40) ≤ vuut M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) V − t − ,i ( λ ) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) + 2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i (41)+ 2 M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) ¯ V − t − ,i ( λ ) (cid:16) k r t,i k ¯ V t − ,i ( λ ) − k r t,i k V t − ,i ( λ ) (cid:17) + 4 M λ (42)Eqn 36, 38 and 41 follow the same idea of Eqn 28, 29 and 31.17e construct an ǫ -net E in Frobenius norm over the matrix set (cid:8) U ∈ R d × k : k U k F ≤ k (cid:9) . It isnot hard to see that |E| ≤ (cid:16) √ kǫ (cid:17) kd . By the union bound over all possible ¯ U ∈ E , we know thatwith probability 1 − |E| δ , Eqn 34 holds for any ¯ U ∈ E . For each U t , we choose an ¯ U ∈ E with (cid:13)(cid:13) U t − ¯ U (cid:13)(cid:13) F ≤ ǫ , and we have2 vuut M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) V − t − ,i ( λ ) ≤ p M k + 2 log(1 /δ ) (43)Since (cid:13)(cid:13) U t − ¯ U (cid:13)(cid:13) F ≤ ǫ , we have2 M X i =1 (cid:13)(cid:13) η ⊤ t − ,i X ⊤ t − ,i ¯ U (cid:13)(cid:13) ¯ V − t − ,i ( λ ) (cid:16) k r t,i k ¯ V t − ,i ( λ ) − k r t,i k V t − ,i ( λ ) (cid:17) ≤ p M kǫ (2 M k + 2 log(1 /δ )) . (44)For the term 2 P Mi =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i , the following inequality holds for any step t ∈ [ T ]with probability 1 − M T δ ,2 M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i ≤ M X i =1 k η t − ,i k (cid:13)(cid:13) X ⊤ t − ,i (cid:0) U t − ¯ U (cid:1) r t,i (cid:13)(cid:13) (45) ≤ M X i =1 k η t − ,i k √ kT ǫ (46) ≤ M p /δ ) kT ǫ (47)The last inequality follows from the fact that | η n,i | ≤ p /δ ) with probability 1 − δ forfixed n, i , and apply a union bound over n ∈ [ t − , i ∈ [ M ]. Plugging Eqn. 43, 44 and 45 back toEqn. 41, the following inequality holds for any t ∈ [ T ] with probability at least 1 − |E| δ − M T δ : M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (48) ≤ p M k + 2 log(1 /δ ) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (49)+ 2 M p /δ ) kT ǫ + 2 p M kǫ (2 M k + 2 log(1 /δ )) + 4 M λ (50)By solving the above inequality, we know that M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤
32 (
M k + log(1 /δ )) + 4 M p /δ ) kT ǫ (51)+ 4 p M kǫ (2 M k + 2 log(1 /δ )) + 8 M λ (52)Setting λ = 1, ǫ = kM T , δ = δ (cid:16) √ kǫ (cid:17) kd ≤ δ |E| , and δ = δ MT , the following inequality holdswith probability 1 − δ : M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤ L def = 48 ( M k + 5 kd log( kM T )) + 32 log(4 M T ) + 76 log(1 /δ ) (53)18t last we talk about the trivial setting where k < d < k . In this case, we can write ˆ Θ t − Θ = R t where R t ∈ R d × M . The proof then follows the same framework as the case when d ≥ k , exceptthat we don’t need to consider U t and construct ǫ -net over all possible U t . It is not hard to showthat P Mi =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤
24 (
M d + 2 log(
T k/δ )) in this case, which is also less than L since d < k . 19 .2 Proof of Theorem 1 With Lemma 1, we are ready to prove Theorem 1.
Proof.
Let ˜ V t,i ( λ ) = X t,i X ⊤ t,i + λ I d for some λ > T ) = T X t =1 M X i =1 (cid:10) θ i , x ∗ t,i − x t,i (cid:11) (54) ≤ T X t =1 M X i =1 D ˜ θ t,i − θ i , x t,i E (55)= T X t =1 M X i =1 D ˜ θ t,i − ˆ θ t,i + ˆ θ t,i − θ i , x t,i E (56) ≤ T X t =1 M X i =1 (cid:18)(cid:13)(cid:13)(cid:13) ˜ θ t,i − ˆ θ t,i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) + (cid:13)(cid:13)(cid:13) ˆ θ t,i − θ i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) (cid:19) k x t,i k ˜ V t − ,i ( λ ) − (57) ≤ vuut T X t =1 M X i =1 (cid:13)(cid:13)(cid:13) ˜ θ t,i − ˆ θ t,i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) + vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ θ t,i − θ i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) · vuut T X t =1 M X i =1 k x t,i k V t − ,i ( λ ) − (58) ≤ p T ( L + 4 λM ) · vuut M X i =1 T X t =1 k x t,i k V t − ,i ( λ ) − (59)where the first inequality is due to P Mi =1 (cid:10) θ i , x ∗ t,i (cid:11) ≤ D ˜ θ t,i , x t,i E from the optimistic choice of ˜ θ t,i and x t,i . By Lemma 11 of Abbasi-Yadkori et al. (2011), as long as λ ≥ T X t =1 k x t,i k ˜ V t − ,i ( λ ′ ) − ≤ ˜ V T,i ( λ ′ ))det( λ ′ I d ) ≤ d log (cid:18) Tλd (cid:19) (60)Therefore, we can finally bound the regret by choosing λ = 1Reg( T ) ≤ p T ( L + 4 M ) · vuut M X i =1 T X t =1 k x t,i k V t − ,i ( λ ′ ) − (61) ≤ p T ( L + 4 M ) · s M d log (cid:18) Td (cid:19) (62)= ˜ O (cid:16) M √ dkT + d √ kM T (cid:17) . (63)20 .3 Proof of Lemma 2 The proof of Lemma 2 follows the similar idea of Theorem 1 in Abbasi-Yadkori et al. (2011). We con-sider the σ -algebra F t = σ (cid:0) { x ,i } Mi =1 , { x ,i } Mi =1 , · · · , { x t +1 ,i } Mi =1 , { η ,i } Mi =1 , { η ,i } Mi =1 , · · · , { η t,i } Mi =1 (cid:1) ,then { x t,i } Mi =1 is F t − -measurable, and { η t,i } Mi =1 is F t -measurable.Define ¯ x t,i = U ⊤ x t,i and S t,i = P tn =1 ¯ U ⊤ x t,i η t,i . Let M t ( Q ) = exp t X n =1 M X i =1 (cid:20) η t,i h q i , ¯ x t,i i − h q i , ¯ x t,i i (cid:21)! , Q = [ q , · · · , q M ] ∈ R k × M (64) Lemma 3.
Let τ be a stopping time w.r.t the filtration { F t } ∞ t =0 . Then M t ( Q ) is almost surelywell-defined and E [ M t ( Q )] ≤ .Proof. Let D t ( Q ) = exp (cid:16)P Mi =1 h η t,i h q i , ¯ x t,i i − h q i , ¯ x t,i i i(cid:17) . By the sub-Gaussianity of η t,i , wehave E (cid:20) exp (cid:18)(cid:20) η t,i h q i , ¯ x t,i i − h q i , ¯ x t,i i (cid:21)(cid:19) ] | F t − (cid:21) ≤ . (65)Then we have E [ D t ( Q ) | F t − ] ≤
1. Further, E [ M t ( Q ) | F t − ] = E [ M ( Q ) · · · D t − ( Q ) D t ( Q ) | F t − ] (66)= D ( Q ) · · · D t − ( Q ) E [ D t ( Q ) | F t − ] ≤ M t − ( Q ) (67)This shows that { M t ( Q ) } ∞ t =0 is a supermartingale and E [ M t ( Q )] ≤ M τ ( Q ) isalmost surely well-defined. By the convergence theorem for nonnegative supermartingales, M ∞ ( Q ) =lim t →∞ M t ( Q ) is almost surely well-defined. Therefore, M τ ( Q ) is indeed well-defined independentlyof whether τ < ∞ or not. Let W t ( Q ) = M min { τ,t } ( Q ) be a stopped version of ( M t ( ( Q ))) t . ByFatou’s Lemma, E [ M τ ( Q )] = E [lim inf t →∞ W t ( Q )] ≤ lim inf t →∞ E [ W t ( Q )] ≤
1. This shows that E [ M τ ( Q )] ≤ P Mi =1 k S t,i k V − t,i ( λ ) . Lemma 4.
Let τ be a stopping time w.r.t the filtration { F t } ∞ t =0 . Then, for δ > , with probability − δ , M X i =1 k S τ,i k V − τ,i ( λ ) ≤ Q Mi =1 (cid:0) det( ¯ V τ,i ) / det( λ I ) − / (cid:1) δ ! . (68) Proof.
For each i ∈ [ M ], let Λ i be a R k Gaussian random variable which is independent of all theother random variables and whose covariance is λ − I . Define M t = E [ M t ([ Λ , · · · , Λ M ]) | F ∞ ]. Westill have E [ M τ ] = E [ E [ M t ([ Λ , · · · , Λ M ]) | { Λ i } Mi =1 ]] ≤ M t . Define M t,i ( q i ) def = exp (cid:16)P tn =1 h η t,i h q i , ¯ x t,i i − h q i , ¯ x t,i i i(cid:17) , then wehave M t = E hQ Mi =1 M t,i ( Λ i ) | F ∞ i = Q Mi =1 E [ M t,i ( Λ i ) | F ∞ ], where the second equality is dueto the fact that { M t,i ( Λ i ) } Mi =1 are relatively independent given F ∞ . We only need to calculate E [ M t,i ( Λ i ) | F ∞ ] for each i ∈ [ M ].Following the proof of Lemma 9 in Abbasi-Yadkori et al. (2011), we know that E [ M t,i ( Λ i ) | F ∞ ] = (cid:18) det( λ I )det( ¯ V t,i ) (cid:19) / exp (cid:18) k S t,i k V − t,i ( λ ) (cid:19) . (69)21hen we have M t = M Y i =1 (cid:18) det( λ I )det( ¯ V t,i ) (cid:19) / ! exp M X i =1 k S t,i k V − t,i ( λ ) ! . (70)Since E [ M τ ] ≤
1, we havePr " M X i =1 k S τ,i k V − τ,i ( λ ) > Q Mi =1 (cid:0) det( ¯ V τ,i ) / det( λ I ) − / (cid:1) δ ! = Pr exp (cid:18) P Mi =1 k S τ,i k V − τ,i ( λ ) (cid:19) δ − (cid:16)Q Mi =1 (cid:0) det( ¯ V t,i ) / det( λ I ) − / (cid:1)(cid:17) > ≤ E exp (cid:18)P Mi =1 k S τ,i k V − τ,i ( λ ) (cid:19) δ − (cid:16)Q Mi =1 (cid:0) det( ¯ V τ,i ) / det( λ I ) − / (cid:1)(cid:17) = E [ M τ ] δ ≤ δ. Proof. (Proof of Lemma 2) The only remaining issue is the stopping time construction. Define thebad event B t ( δ ) def = ( ω ∈ Ω : M X i =1 k S t,i k V − t,i ( λ ) > Q Mi =1 (cid:0) det( ¯ V t,i ) / det( λ I ) − / (cid:1) δ !) (71)Consider the stopping time τ ( ω ) = min { t ≥ ω ∈ B t ( δ ) } , we have S t ≥ B t ( δ ) = { ω : τ ( ω ) < ∞} .By lemma 4, we havePr [ t ≥ B t ( δ ) = Pr[ τ < ∞ ] (72)= Pr " M X i =1 k S τ,i k V − τ,i ( λ ) > Q Mi =1 (cid:0) det( ¯ V τ,i ) / det( λ I ) − / (cid:1) δ ! , τ ≤ ∞ (73) ≤ Pr " M X i =1 k S τ,i k V − τ,i ( λ ) > Q Mi =1 (cid:0) det( ¯ V τ,i ) / det( λ I ) − / (cid:1) δ ! (74) ≤ δ. (75)22 .4 Proof of Theorem 2 Proof.
The proof follows the same idea of that for Theorem 1. The only difference is that, in our set-ting, we have y t,i = x ⊤ t,i Bw i + η t,i + ∆ t,i , where θ i = Bw i is the best approximator for task i ∈ [ M ]such that (cid:12)(cid:12)(cid:12) E [ y i | x i ] − D x i , ˙ B ˙ w i E(cid:12)(cid:12)(cid:12) ≤ ζ , and k ∆ t,i k ≤ ζ . Define ∆ t,i = [∆ ,i , ∆ ,i , · · · , ∆ t,i ]. Simi-larly, by the optimality of ˆ B t and ˆ W t = [ ˆ w t, , · · · , ˆ w t,M ], we know that P Mi =1 (cid:13)(cid:13)(cid:13) y t − ,i − X ⊤ t − ,i ˆ B t ˆ w t,i (cid:13)(cid:13)(cid:13) ≤ P Mi =1 (cid:13)(cid:13) y t − ,i − X ⊤ t − ,i Bw i (cid:13)(cid:13) . Since y t − ,i = X ⊤ t − ,i Bw i + η t − ,i + ∆ t,i , thus we have M X i =1 (cid:13)(cid:13)(cid:13) X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17)(cid:13)(cid:13)(cid:13) (76) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) + 2 M X i =1 ∆ ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) (77) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) + 2 M X i =1 k X t − ,i ∆ t − ,i k ˜ V − t − ,i ( λ ) (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) (78) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) + 2 M X i =1 √ T ζ (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) (79) ≤ M X i =1 η ⊤ t − ,i X ⊤ t − ,i (cid:16) ˆ B t ˆ w t,i − Bw i (cid:17) + 2 √ M T ζ vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (80)The third inequality follows from Projection Bound (Lemma 8) in Zanette et al. (2020a). The firstterm of Eqn 80 shares the same form of Eqn 24. Following the same proof idea of Lemma 1, weknow that with probability 1 − δ , M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) (81) ≤ (cid:16) p M k + 8 kd log( kM T /δ ) + 2 √ M T ζ (cid:17) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) + 4 M + 4 p log(4 M T /δ )(82)Solving for P Mi =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) , we know that the true parameter BW is alwayscontained in the confidence set, i.e. M X i =1 (cid:13)(cid:13)(cid:13) ˆ B t ˆ w t,i − Bw i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) ≤ L ′ , (83)where L ′ = 2 L + 32 M T ζ . 23hus we haveReg( T ) = T X t =1 M X i =1 (cid:0) y ∗ t,i − y t,i (cid:1) (84) ≤ M T ζ + T X t =1 M X i =1 (cid:10) θ i , x ∗ t,i − x t,i (cid:11) (85) ≤ M T ζ + T X t =1 M X i =1 D ˜ θ t,i − θ i , x t,i E (86)= 2 M T ζ + T X t =1 M X i =1 D ˜ θ t,i − ˆ θ t,i + ˆ θ t,i − θ i , x t,i E (87) ≤ M T ζ + T X t =1 M X i =1 (cid:18)(cid:13)(cid:13)(cid:13) ˜ θ t,i − ˆ θ t,i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) + (cid:13)(cid:13)(cid:13) ˆ θ t,i − θ i (cid:13)(cid:13)(cid:13) ˜ V t − ,i ( λ ) (cid:19) k x t,i k ˜ V t − ,i ( λ ) − (88) ≤ M T ζ + vuut T X t =1 M X i =1 (cid:13)(cid:13)(cid:13) ˜ θ t,i − ˆ θ t,i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) + vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ θ t,i − θ i (cid:13)(cid:13)(cid:13) V t − ,i ( λ ) · vuut T X t =1 M X i =1 k x t,i k V t − ,i ( λ ) − (89) ≤ M T ζ + 2 p T ( L ′ + 4 λM ) · vuut M X i =1 T X t =1 k x t,i k V t − ,i ( λ ) − (90) ≤ M T ζ + 2 p T ( L ′ + 4 λM ) r M d log(1 + Td ) (91)= ˜ O ( M √ dkT + d √ kM T + M T √ dζ ) , (92)where the second inequality is due to P Mi =1 (cid:10) θ i , x ∗ t,i (cid:11) ≤ D ˜ θ t,i , x t,i E from the optimistic choice of ˜ θ t,i and x t,i . The third inequality is due to Eqn 83. The last inequality is from Eqn 60.24 .5 Proof of Theorem 3 Since our setting is strictly harder than the setting of multi-task linear bandit with infinite arms inYang et al. (2020), we can prove the following lemma directly from their Theorem 4 by reduction.
Lemma 5.
Under the setting of Theorem 3, the regret of any Algorithm A is lower bounded by Ω (cid:16) M k √ T + d √ kM T (cid:17) . In order to prove Theorem 3, we only need to show that the following lemma is true.
Lemma 6.
Under the setting of Theorem 3, the regret of any Algorithm A is lower bounded by Ω (cid:16) M T √ dζ (cid:17) . Proof. (Proof of Lemma 6)To prove Lemma 6, we leverage the lower bound for misspecified linear bandits in the single-tasksetting. We restate the following lemma from the previous literature with a slight modification ofnotations.
Lemma 7. (Proposition 6 in Zanette et al. (2020a)). There exists a feature map φ : A → R d thatdefines a misspecified linear bandits class M such that every bandit instance in that class has rewardresponse: µ a = φ ⊤ a θ + z a for any action a (Here z a ∈ [0 , ζ ] is the deviation from linearity and µ a ∈ [0 , ) and such that theexpected regret of any algorithm on at least a member of the class up to round T is Ω( √ dζT ) . Suppose M can be exactly divided by k , we construct the following instances to prove lemma 6.We divide M tasks into k groups. Each group shares the same parameter θ i . To be more specific,we let w = w = · · · = w M/k = e , w M/k +1 = w M/k +2 = · · · = w M/k = e , · · · , w ( k − M/k +1 = w ( k − M/k +2 = · · · = w M = e k . Under this construction, the parameters θ i for these tasks areexactly the same in each group, but relatively independent among different groups. That is to say,the expected regret lower bound is at least the summation of the regret lower bounds in all k groups.Now we consider the regret lower bound for group j ∈ [ k ]. Since the parameters are shared inthe same group, the regret of running an algorithm for M/k tasks with T steps each is at least theregret of running an algorithm for single-task linear bandit with M/k · T steps. By Lemma 7, theregret for single-task linear bandit with M T /k steps is at least Ω( √ dζM T /k ). Summing over all k groups, we can prove that the regret lower bound is Ω( √ dζM T ).Combining Lemma 5 and Lemma 6, we complete the proof of Theorem 3.25 Proof of Theorem 4
B.1 Definitions and First Step Analysis
Before presenting the proof of theorem 4, we will make a first step analysis on the low-rank least-square estimator in equation 12.For any (cid:8) Q ih +1 (cid:9) Mi =1 ∈ Q h +1 , there exists n ˙ θ ih (cid:0) Q ih +1 (cid:1)o Mi =1 ∈ Θ h that∆ ih (cid:0) Q ih +1 (cid:1) ( s, a ) = T ih (cid:0) Q ih +1 (cid:1) ( s, a ) − φ ( s, a ) ⊤ ˙ θ ih (cid:0) Q ih +1 (cid:1) (93)where the approximation error (cid:13)(cid:13) ∆ ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13) ∞ ≤ I is small for each i ∈ [ M ]. We also use˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1) in place of ˙ θ ih (cid:0) Q ih +1 (cid:1) in the following sections since we can write ˙ θ ih as ˙ B h ˙ w ih ac-cording to Assumption 3.In the multi-task low-rank least-square regression (equation 12), we are actually trying to recover˙ θ ih . However, due to the noise and representation error (i.e. the inherent Bellman error), we canonly obtain an approximate solution ˆ θ ih = ˆ B h ˆ w ih (see the global optimization problem in Definition1). (cid:16) ˆ θ h , ..., ˆ θ Mh (cid:17) = ˆ B h (cid:2) ˆ w h ˆ w h · · · ˆ w Mh (cid:3) (94)= argmin k B h w ih k ≤ D M X i =1 t − X j =1 (cid:16) φ (cid:0) s ihj , a ihj (cid:1) ⊤ B h w ih − R (cid:0) s ihj , a ihj (cid:1) − max a Q ih +1 (cid:0) s ih +1 ,j (cid:1)(cid:17) (95)= argmin k B h w ih k ≤ D M X i =1 t − X j =1 (cid:16) φ (cid:0) s ihj , a ihj (cid:1) ⊤ B h w ih − T ih (cid:0) Q ih +1 (cid:1) (cid:0) s ihj , a ihj (cid:1) − z ihj (cid:0) Q ih +1 (cid:1) (cid:0) s ihj , a ihj (cid:1)(cid:17) (96)where z ihj (cid:0) Q ih +1 (cid:1) (cid:16) s ihj , a ihj (cid:17) def = R (cid:16) s ihj , a ihj (cid:17) + max a Q ih +1 (cid:16) s ih +1 ,j , a (cid:17) − T ih (cid:0) Q ih +1 (cid:1) (cid:16) s ihj , a ihj (cid:17) .Define Φ iht ∈ R ( t − × d to be the collection of linear features up to episode t − i ,i.e. the j -th row of Φ iht is φ (cid:16) s ihj , a ihj (cid:17) ⊤ . Let Y iht ∈ R t − be a vector whose j -th dimension is T ih (cid:0) Q ih +1 (cid:1) (cid:16) s ihj , a ihj (cid:17) + z ihj (cid:0) Q ih +1 (cid:1) (cid:16) s ihj , a ihj (cid:17) . Then the objective in (96) can be written asargmin k B h w ih k ≤ D M X i =1 (cid:13)(cid:13) Φ iht B h w ih − Y iht (cid:13)(cid:13) (97)Therefore, we have M X i =1 (cid:13)(cid:13)(cid:13) Φ iht ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − Y iht (cid:13)(cid:13)(cid:13) ≤ M X i =1 (cid:13)(cid:13)(cid:13) Φ iht ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1) − Y iht (cid:13)(cid:13)(cid:13) (98)which implies 26 X i =1 (cid:13)(cid:13)(cid:13) Φ iht ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − Φ iht ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) (99) ≤ M X i =1 (cid:0) ∆ iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (100)+ 2 M X i =1 (cid:0) z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (101)where ∆ iht def = h ∆ ih (cid:0) Q ih +1 (cid:1) (cid:0) s ih , a ih (cid:1) ∆ ih (cid:0) Q ih +1 (cid:1) (cid:0) s ih , a ih (cid:1) · · · ∆ ih,t − (cid:0) Q ih +1 (cid:1) (cid:16) s ih,t − , a ih,t − (cid:17)i ∈ R t − , and z iht def = h z ih (cid:0) Q ih +1 (cid:1) (cid:0) s ih , a ih (cid:1) · · · z ih,t − (cid:0) Q ih +1 (cid:1) (cid:16) s ih,t − , a ih,t − (cid:17)i ∈ R t − .In the next sections we will show how to bound 100 and 101.27 .2 Failure Event Define the failure event at step h in episode t as Definition 2 (Failure Event) . E ht def = I h ∃ (cid:8) Q ih +1 (cid:9) Mi =1 ∈ Q h +1 M X i =1 (cid:0) z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) > (102) F h vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) + F h i (103)where F h and F h will be specified later.We have the following lemma to bound the probability of E ht . Lemma 8.
For the input parameter δ > , there exists F h and F h such that P T [ t =1 H [ h =1 E ht ! ≤ δ Proof.
According to Lemma A.5 of Du et al. (2020), there exists an ǫ -net E oh +1 over O d × k (withregards to the Frobenius norm) such that (cid:12)(cid:12) E oh +1 (cid:12)(cid:12) ≤ (6 √ k/ǫ ′ ) kd . Moreover, there exists an ǫ -net E bh +1 over B k that (cid:12)(cid:12) E bh +1 (cid:12)(cid:12) ≤ (1 + 2 /ǫ ′ ) k . We can show a corresponding ǫ -net E mul h +1 def = E oh +1 × (cid:0) E bh +1 (cid:1) M overΘ h +1 .For any (cid:0) Q h +1 (cid:0) B h +1 w h +1 (cid:1) , · · · , Q Mh +1 (cid:0) B h +1 w Mh +1 (cid:1)(cid:1) ∈ Q h +1 , there exists ¯ B h +1 ∈ E oh +1 and (cid:0) ¯ w h +1 , · · · , ¯ w Mh +1 (cid:1) ∈ (cid:0) E bh +1 (cid:1) M such that (cid:13)(cid:13) B h +1 − ¯ B h +1 (cid:13)(cid:13) F ≤ ǫ ′ (cid:13)(cid:13) w ih +1 − ¯ w ih +1 (cid:13)(cid:13) ≤ ǫ ′ , ∀ i ∈ [ M ]Therefore, (cid:13)(cid:13) B h +1 w ih +1 − ¯ B h +1 ¯ w ih +1 (cid:13)(cid:13) ≤ ǫ ′ , ∀ i ∈ [ M ]Define ¯ Q ih +1 to be Q ih +1 (cid:0) ¯ B h +1 ¯ w ih +1 (cid:1) , and let ¯ z iht def = h z ih (cid:0) ¯ Q ih +1 (cid:1) (cid:0) s ih , a ih (cid:1) · · · z ih,t − (cid:0) ¯ Q ih +1 (cid:1) (cid:16) s ih,t − , a ih,t − (cid:17)i ∈ R t − , then M X i =1 (cid:0) z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (105)= M X i =1 (cid:0) ¯ z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (106)+ M X i =1 (cid:0) z iht − ¯ z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (107)For fixed (cid:8) ¯ B h +1 ¯ w ih +1 (cid:9) Mi =1 ∈ E mul h +1 , z ih,j (cid:0) ¯ Q ih +1 (cid:1) (cid:16) s ih,j , a ih,j (cid:17) is zero-mean 1-subgaussian condi-tioned on F h,j according to Assumption 4. Thus, we can use exactly the same argument as inLemma 1 to show that 28 X i =1 (cid:0) ¯ z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (108) ≤ p M k + 5 kd log( kM T ) + 2 log(1 /δ ′ ) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) (109)+ p M T /δ ′ ) + p k + 3 kd log( kM T ) + log(1 /δ ′ ) (110)by setting ǫ = kM T , δ = δ ′ (cid:16) √ kǫ (cid:17) kd , and δ = δ ′ MT in equation 50. Thus, we have that withprobability 1 − δ ′ the inequality above holds for any h ∈ [ H ] , t ∈ [ T ]. Take δ = δ ′ | E mul h +1 | , by unionbound we know the above ineqaulity holds with probability 1 − δ for any (cid:8) ¯ B h +1 ¯ w ih +1 (cid:9) Mi =1 ∈ E mul h +1 and any h ∈ [ H ] , t ∈ [ T ].Since it holds that (cid:12)(cid:12) Q ih +1 (cid:0) B h +1 w ih +1 (cid:1) ( s, a ) − Q ih +1 (cid:0) ¯ B h +1 ¯ w ih +1 (cid:1) ( s, a ) (cid:12)(cid:12) ≤ ǫ ′ for any ( s, a ) ∈S × A , i ∈ [ M ], we have (cid:12)(cid:12) z ihj (cid:0) ¯ Q ih +1 (cid:1) (cid:0) s ihj , a ihj (cid:1) − z ihj (cid:0) Q ih +1 (cid:1) (cid:0) s ihj , a ihj (cid:1)(cid:12)(cid:12) ≤ ǫ ′ (111)Then we have M X i =1 (cid:0) z iht − ¯ z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (112) ≤ M X i =1 (cid:13)(cid:13)(cid:13)(cid:0) Φ iht (cid:1) ⊤ (cid:0) z iht − ¯ z iht (cid:1)(cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) − (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) (113) ≤ ǫ ′ √ T M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) (114) ≤ ǫ ′ √ M T vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) (115)for arbitrary { Q ih +1 } and any h ∈ [ H ] , t ∈ [ T ]. The second inequality follows from the ProjectionBound (Lemma 8) in Zanette et al. (2020a).Take ǫ ′ = 1 / √ M T , we finally finish the proof by setting F h def = p kd log( kM T ) + 5 M k log(
M T ) + 2 log(2 /δ ) (116) F h def = p kd log( kM T ) + 5 M k log(
M T ) + 2 log(2 /δ ) (117)+ p k + 5 kd log( kM T ) + 2 M k log(
M T ) + log(2 /δ ) (118)In the next sections we assume the failure event S Tt =1 S Hh =1 E ht won’t happen.29 .3 Bellman Error Outside the failure event, we can bound the estimation error of the least-square regression 12.
Lemma 9.
For any episode t ∈ [ T ] and step h ∈ [ H ] , any (cid:8) Q ih +1 (cid:9) Mi =1 ∈ Q h +1 , we have M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) ≤ α ht def = (cid:18) √ M T I + 2 F h + q F h + 4 M D λ (cid:19) (119) Proof.
Recall that M X i =1 (cid:13)(cid:13)(cid:13) Φ iht ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − Φ iht ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) (120) ≤ M X i =1 (cid:0) ∆ iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (121)+ 2 M X i =1 (cid:0) z iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (122)For the first term, we have M X i =1 (cid:0) ∆ iht (cid:1) ⊤ Φ iht (cid:16) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:17) (123) ≤ M X i =1 (cid:13)(cid:13)(cid:13)(cid:0) Φ iht (cid:1) ⊤ ∆ iht (cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) − (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) (124) ≤ √ T I M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) (125) ≤ √ M T I vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) (126)The second inequality follows from the Projection Bound (Lemma 8) in Zanette et al. (2020a),and the last inequality is due to Cauchy-Schwarz.Outside the failure event, we have M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) (127) ≤ M X i =1 (cid:13)(cid:13)(cid:13) Φ iht ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − Φ iht ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) + 4 M D λ (128) ≤ (cid:16) √ M T I + 2 F h (cid:17) vuut M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) + 2 F h + 4 M D λ (129)30hich implies M X i =1 (cid:13)(cid:13)(cid:13) ˆ B h ˆ w ih (cid:0) Q ih +1 (cid:1) − ˙ B h ˙ w ih (cid:0) Q ih +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) (130) ≤ (cid:16) √ M T I + 2 F h (cid:17) + 2 F h + 4 M D λ + (cid:16) √ M T I + 2 F h (cid:17) q F h + 4 M D λ (131) ≤ (cid:18) √ M T I + 2 F h + q F h + 4 M D λ (cid:19) (132) Lemma 10 (Bound on Bellman Error) . Outside the failure event, for any feasible solution (cid:8) Q ih (cid:0) ¯ θ ih (cid:1)(cid:9) ih ( ¯ Q ih for short, with a little abuse of notations) of the global optimization procedure in definition 1,for any ( s, a ) ∈ S × A , any h ∈ [ H ] , t ∈ [ T ] M X i =1 (cid:12)(cid:12) ¯ Q ih ( s, a ) − T ih ¯ Q ih +1 ( s, a ) (cid:12)(cid:12) ≤ M I + 2 vuut α ht · M X i =1 k φ ( s, a ) k V iht ( λ ) − (133) Proof. M X i =1 (cid:12)(cid:12) ¯ Q ih ( s, a ) − T ih ¯ Q ih +1 ( s, a ) (cid:12)(cid:12) = M X i =1 (cid:12)(cid:12)(cid:12) φ ( s, a ) ⊤ ¯ θ ih − φ ( s, a ) ⊤ ˙ θ ih (cid:0) ¯ Q ih +1 (cid:1) − ∆ ih (cid:0) ¯ Q ih +1 (cid:1) ( s, a ) (cid:12)(cid:12)(cid:12) (134) ≤ M I + M X i =1 (cid:12)(cid:12)(cid:12) φ ( s, a ) ⊤ ¯ θ ih − φ ( s, a ) ⊤ ˙ θ ih (cid:0) ¯ Q ih +1 (cid:1)(cid:12)(cid:12)(cid:12) (135) ≤ M I + M X i =1 (cid:16)(cid:12)(cid:12)(cid:12) φ ( s, a ) ⊤ ˙ θ ih (cid:0) ¯ Q ih +1 (cid:1) − φ ( s, a ) ⊤ ˆ θ ih (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) φ ( s, a ) ⊤ ˆ θ ih − φ ( s, a ) ⊤ ¯ θ ih (cid:12)(cid:12)(cid:12)(cid:17) (136) ≤ M I + M X i =1 k φ ( s, a ) k ˜ V iht ( λ ) − (cid:18)(cid:13)(cid:13)(cid:13) ˙ θ ih (cid:0) ¯ Q ih +1 (cid:1) − ˆ θ ih (cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) + (cid:13)(cid:13)(cid:13) ˆ θ ih − ¯ θ ih (cid:13)(cid:13)(cid:13) ˜ V iht ( λ ) (cid:19) (137) ≤ M I + 2 vuut α ht · M X i =1 k φ ( s, a ) k V iht ( λ ) − (138)The first equality is due to the definition of ∆ ih (cid:0) ¯ Q ih +1 (cid:1) ( s, a ). The last inequality is due to lemma9. 31 .4 Optimism We can find the ”best” approximator of optimal value functions in our function class recursivelydefined as (cid:0) θ ∗ h , θ ∗ h , · · · , θ M ∗ h (cid:1) def = argmin( θ h , θ h , ··· , θ Mh ) ∈ Θ h sup s,a,i (cid:12)(cid:12)(cid:0) φ ( s, a ) ⊤ θ ih − T ih Q ih +1 (cid:0) θ i ∗ h +1 (cid:1)(cid:1) ( s, a ) (cid:12)(cid:12) (139)with θ i ∗ H +1 = , ∀ i ∈ [ M ]For the accuracy of this best approximator, we have Lemma 11.
For any h ∈ [ H ] , sup ( s,a ) ∈S×A ,i ∈ [ M ] (cid:12)(cid:12) Q i ∗ h ( s, a ) − φ ( s, a ) ⊤ θ ∗ h (cid:12)(cid:12) ≤ ( H − h + 1) I where Q i ∗ h is the optimal value function for task i . This lemma is derived directly from Lemma6 in Zanette et al. (2020a).For our solution of the problem in Definition 1 in episode t , we have the following lemma: Lemma 12. (cid:8)(cid:0) θ ∗ h , θ ∗ h , · · · , θ M ∗ h (cid:1)(cid:9) Hh =1 is a feasible solution of the problem in Definition 1. More-over, denote the solution of the problem in Definition 1 in episode t by ¯ θ iht for h ∈ [ H ] , i ∈ [ M ] , itholds that M X i =1 V i (cid:0) ¯ θ i t (cid:1) (cid:0) s i t (cid:1) ≥ M X i =1 V i ∗ (cid:0) s i t (cid:1) − M H I (140) Proof.
First we show that (cid:8)(cid:0) θ ∗ h , θ ∗ h , · · · , θ M ∗ h (cid:1)(cid:9) Hh =1 is a feasible solution. We can construct (cid:8) ¯ ξ ih (cid:9) Mi =1 so that ¯ θ ih = θ i ∗ h and no other constraints are violated. We use an inductive construction, and thebase case when ¯ θ iH +1 = θ i ∗ H +1 = 0 is trivial.Now suppose we have (cid:8) ¯ ξ iy (cid:9) Mi =1 for y = h + 1 , ..., H such that ¯ θ iy = θ i ∗ y for y = h + 1 , ..., H and i ∈ [ M ], we show we can find (cid:8) ¯ ξ ih (cid:9) Mi =1 so ¯ θ ih = θ i ∗ h for i ∈ [ M ], and no constraints are violated.From the definition of θ i ∗ h we can set (with a little abuse of notations)˙ θ ih (cid:0) θ i ∗ h +1 (cid:1) = θ i ∗ h (141)According to lemma 9 we have M X i =1 (cid:13)(cid:13)(cid:13) ˆ θ ih (cid:0) θ i ∗ h +1 (cid:1) − ˙ θ ih (cid:0) θ i ∗ h +1 (cid:1)(cid:13)(cid:13)(cid:13) V iht ( λ ) ≤ α ht (142)Therefore, set ¯ ξ ih = ˙ θ ih (cid:0) θ i ∗ h +1 (cid:1) − ˆ θ ih (cid:0) θ i ∗ h +1 (cid:1) , then¯ θ ih = ˆ θ ih (cid:0) ¯ θ ih +1 (cid:1) + ¯ ξ ih (143)= ˆ θ ih (cid:0) θ i ∗ h +1 (cid:1) + ˙ θ ih (cid:0) θ i ∗ h +1 (cid:1) − ˆ θ ih (cid:0) θ i ∗ h +1 (cid:1) (144)= θ i ∗ h (145)32inally, we can verify (cid:0) ¯ θ h , ..., ¯ θ Mh (cid:1) ∈ Θ h from (cid:0) θ ∗ h , · · · , θ M ∗ h (cid:1) ∈ Θ h .Since ¯ θ i t is the optimal solution, we can finish the proof by showing M X i =1 V i (cid:0) ¯ θ i t (cid:1) (cid:0) s i t (cid:1) = M X i =1 max a φ (cid:0) s i t , a (cid:1) ⊤ ¯ θ i t (146) ≥ M X i =1 max a φ (cid:0) s i t , a (cid:1) ⊤ θ i ∗ (since θ i ∗ is the feasible solution) (147) ≥ M X i =1 φ (cid:0) s i t , π i ∗ (cid:0) s i t (cid:1)(cid:1) ⊤ θ i ∗ (148) ≥ M X i =1 Q i ∗ h (cid:0) s i t , π i ∗ (cid:0) s i t (cid:1)(cid:1) − M H I (by Lemma 11) (149) ≥ M X i =1 V i ∗ h (cid:0) s i t (cid:1) − M H I (150)33 .5 Regret Bound We are ready to present the proof of our regret bound.From Lemma 8 we know that the failure event S Tt =1 S Hh =1 E ht happens with probability at most δ/
2, so we assume it does not happen. Then we can decompose the regret asReg( T ) = T X t =1 M X i =1 (cid:16) V i ∗ − V π it (cid:17) (cid:0) s i t (cid:1) (151)= T X t =1 M X i =1 (cid:0) V i ∗ − V i (cid:0) ¯ θ i t (cid:1)(cid:1) (cid:0) s i t (cid:1) + T X t =1 M X i =1 (cid:16) V i (cid:0) ¯ θ i t (cid:1) − V π it (cid:17) (cid:0) s i t (cid:1) (152) ≤ T X t =1 M X i =1 (cid:16) V i (cid:0) ¯ θ i t (cid:1) − V π it (cid:17) (cid:0) s i t (cid:1) + M HT I (by Lemma 12) (153)Let a iht = π it (cid:0) s iht (cid:1) , and denote Q ih (cid:0) ¯ θ iht (cid:1) ( V ih (cid:0) ¯ θ iht (cid:1) ) by ¯ Q iht ( ¯ V iht ) for short, we have M X i =1 (cid:16) ¯ V iht − V π it h (cid:17) (cid:0) s iht (cid:1) = M X i =1 (cid:16) ¯ Q iht − Q π it h (cid:17) (cid:0) s iht , a iht (cid:1) (154)= M X i =1 (cid:0) ¯ Q iht − T ih ¯ Q ih +1 ,t (cid:1) (cid:0) s iht , a iht (cid:1) + M X i =1 (cid:16) T ih ¯ Q ih +1 ,t − Q π it h (cid:17) (cid:0) s iht , a iht (cid:1) (155) ≤ M I + 2 vuut α ht · M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M X i =1 E s ′ ∼ p ih ( s iht ,a iht ) h(cid:16) ¯ V ih +1 ,t − V π it h +1 (cid:17) ( s ′ ) i (156) ≤ M X i =1 (cid:16) ¯ V ih +1 ,t − V π it h +1 (cid:17) (cid:0) s ih +1 ,t (cid:1) + M I + 2 vuut α ht · M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M X i =1 ζ iht (157)where ζ iht is a martingale difference with regards to the filtration F h,t defined as ζ iht def = (cid:16) ¯ V ih +1 ,t − V π it h +1 (cid:17) (cid:0) s ih +1 ,t (cid:1) − E s ′ ∼ p ih ( s iht ,a iht ) h(cid:16) ¯ V ih +1 ,t − V π it h +1 (cid:17) ( s ′ ) i (158)According to assumption 4 we know (cid:12)(cid:12) ζ iht (cid:12)(cid:12) ≤
4, so we can apply Azuma-Hoeffding’s inequalitythat with probability 1 − δ/ t ∈ [ T ] and i ∈ [ M ] t X j =1 ζ iht ≤ s t ln (cid:18) Tδ (cid:19) (159)By applying inequality 157 recursively, we can bound the regret asReg( T ) ≤ T X t =1 M X i =1 (cid:16) ¯ V i t − V π it (cid:17) (cid:0) s i t (cid:1) + M HT I (160) ≤ M HT I + T X t =1 H X h =1 vuut α ht · M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M X i =1 H X h =1 T X t =1 ζ iht (161)34he last inequality is due to ¯ V iH +1 ( s ) = max a φ ( s, a ) ⊤ ¯ θ iH +1 ,t = 0 , V π it H +1 ( s ) = 0.The Lemma 11 of Abbasi-Yadkori et al. (2011) gives that for any i ∈ [ M ] and h ∈ [ H ] T X t =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − = ˜ O ( d ) (162)Moreover, by the definition of α ht (see Lemma 9) we know that for any h ∈ [ H ] and t ∈ [ T ] α ht = ˜ O (cid:0) M k + kd + M T I (cid:1) (163)Take all of above we can show the final regret bound.Reg( T ) ≤ M HT I + T X t =1 H X h =1 vuut α ht · M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M X i =1 H X h =1 T X t =1 ζ iht (164)= ˜ O M HT I + ˜ O (cid:16)p M k + kd + M T I (cid:17) H X h =1 T X t =1 vuut M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M H √ T (165)= ˜ O M HT I + ˜ O (cid:16)p M k + kd + M T I (cid:17) H X h =1 √ T · vuut T X t =1 M X i =1 (cid:13)(cid:13) φ (cid:0) s iht , a iht (cid:1)(cid:13)(cid:13) V iht ( λ ) − + M H √ T (166)= ˜ O (cid:16) M HT I + ˜ O (cid:16) ˜ O (cid:16)p M k + kd + M T I (cid:17) · H √ M T d (cid:17) + M H √ T (cid:17) (167)= ˜ O (cid:16) HM √ dkT + Hd √ M kT + HM T √ d I (cid:17) (168)35 Proof of Theorem 5
To prove the lower bound for multi-task RL, our idea is to connect the lower bound for the multi-tasklearning problem to the lower bound in the single-task LSVI setting (Zanette et al., 2020a). in thepaper of Zanette et al. (2020a), they assumed the feature dimension d can be varied among differentsteps, which is denoted as d h for step h . They proved the lower bound for linear RL in this setting isΩ (cid:16)P Hh =1 d h √ T + P Hh =1 √ d h I T (cid:17) . However, this lower bound is derived by the hard instance with d = P Hh =2 d h . If we set d = d = · · · = d H = d like our setting, we can only obtain the lowerbound of Ω (cid:16) d √ T + √ d I T (cid:17) following their proof idea. In fact, the dependence on H in this lowerbound can be further improved. In order to obtain a tighter lower bound, we consider the lowerbound for single-task misspecified linear MDP. This setting can be proved to be strictly simpler thanthe LSVI setting following the idea of Proposition 3 in Zanette et al. (2020a). The lower bound formisspecified linear MDP can thus be applied to LSVI setting.36 .1 Lower Bounds for single-task RL This subsection focus on the lower bound for misspecifed linear MDP setting, in which the transitionkernel and the reward function are assume to be approximately linear.
Assumption 6. (Assumption B in Jin et al. (2020)) For any ζ ≤ , we say that MDP( S , A , p, r, H ) is a ζ -approximate linear MDP with a feature map φ : S × A → R d , if for any h ∈ [ H ] , there exist d unknown measures θ h = ( θ (1) h , · · · , θ ( d ) h ) over S and an unknown vector ν h ∈ R d such that for any ( s, a ) ∈ S × A , we have k p h ( ·| s, a ) − h φ ( s, a ) , θ h ( · ) i k TV ≤ ζ (169) | r h ( s, a ) − h φ ( s, a ) , ν h i | ≤ ζ (170)For regularity, we assume that Assumption 4 still holds, and we also assume that there exists aconstant D such that k θ h ( s ) k ≤ D for all s ∈ S , h ∈ [ H ], k ν h k ≤ D for all h ∈ [ H ]. D ≥ Proposition 1.
Suppose T ≥ d H , d ≥ , H ≥ and ζ ≤ H , there exist a ζ -approximate linearMDP class such that the expected regret of any algorithm on at least a member of the MDP class isat least Ω (cid:16) d √ HT + HT I√ d (cid:17) . To prove the lower bound, our basic idea is to connect the problem to H linear bandit problems.Similar hard instance construction has been used in Zhou et al. (2020a,b). In our construction, thestate space S consists of H + 2 states, which is denoted as x , x , · · · , x H +2 . The agent starts theepisode in state x . In x h , it can either transits to x h +1 or x H +2 with certain transition probability.If the agent enters x H +2 , it will stay in this state in the remaining steps, i.e. x H +2 is an absorbingstate. For each state, there are 2 d − actions and A = {− , } d − . Suppose the agent takes action a ∈ {− , } d − in state s h , the transition probability to state s h +1 and s H +2 is 1 − ζ h ( a ) − δ − µ ⊤ h a and δ + ζ h ( a ) + µ ⊤ h a respectively. Here | ζ h ( a ) | ≤ ζ denotes the approximation error of linearrepresentation, δ = 1 /H and µ h ∈ {− ∆ , ∆ } d − with ∆ = p δ/T / (4 √
2) so that the probability iswell-defined. The reward can only be obtained in x H +2 , with r h ( x H +2 ,a ) = 1 /H for any h, a . Weassume the reward to be deterministic.We can check that this construction satisfies Assumption 6 with φ and θ defined in the followingway: φ ( s, a ) = (cid:0) , α, αδ, , β a ⊤ (cid:1) ⊤ s = x , x , · · · , x H (cid:0) , , , α, ⊤ (cid:1) ⊤ s = x H +1 (cid:0) α, , , α, ⊤ (cid:1) ⊤ s = x H +2 θ h ( s ′ ) = (cid:18) , α , − α , , − µ ⊤ h β (cid:19) ⊤ s ′ = x h +1 (cid:18) , , α , α , µ ⊤ h β (cid:19) ⊤ s = x H +2 otherwise ν h is defined to be ( Hα , ⊤ ) ⊤ , and α = p / (2 + ∆( d − β = p ∆ / (2 + ∆( d − k φ ( s, a ) k ≤ k θ h ( s ′ ) k ≤ D and k ν h k ≤ D hold for any s, a, s ′ , h when T ≥ d H/ x H +2 , the optimal strategy in state x h ( h ≤ H ) is to take anaction that maximizes the probability of entering x H +2 , i.e., to maximize µ ⊤ h a + ζ ( a ). That is to37ay, we can regard the problem of finding the optimal action in state s h and step h as finding theoptimal arm for a d − δ such that (1 − δ ) H/ is a constant, there is sufficiently high probability of enteringstate x h for any h ≤ H/
2. Therefore, we can show that this problem is harder than solving H/ Lemma 13.
Suppose H ≥ , d ≥ and ( d − ≤ H . We define r bh ( a ) = µ ⊤ a + ζ h ( a ) , whichcan be regarded as the corresponding reward for the equivalent linear bandit problem in step h . Fix µ ∈ ( {− ∆ , ∆ } d − ) H . Fix a possibly history dependent policy π . Letting V ⋆ and V π be the optimalvalue function and the value function of policy π respectively, we have V ⋆ ( s ) − V π ( s ) ≥ . H/ X h =1 max a ∈A r bh ( a ) − X a ∈A π h ( a | s h ) r bh ( a ) ! (171) Proof.
Note that the only rewarding state is x H +2 with r h ( x H +2 , a ) = H . Therefore, the valuefunction of a certain policy π can be calculated as: V π ( x ) = H − X h =1 H − hH P ( N h | π ) (172)where N h denotes the event of visiting state x h in step h and then transits to x H +2 , i.e. N h = { s h = x h , s h +1 = x H +2 } . Suppose ω πh = P a ∈A π h ( a | s h ) r bh ( a ) and ω ⋆h = max a ∈A r bh ( a ). By the lawof total probability and the Markov property, we have P ( N h | π ) = ( δ + ω πh ) h − Y j =1 (1 − δ − ω πh ) (173)Thus we have V π ( x ) = H − X h =1 H − hH ( δ + ω πh ) h − Y j =1 (1 − δ − ω πh ) (174)Similarly, for the value function of the optimal policy, we have V ⋆ ( x ) = H − X h =1 H − hH ( δ + ω ⋆h ) h − Y j =1 (1 − δ − ω ⋆h ) (175)Define S i = P H − h = i H − hH ( δ + ω πh ) Q h − j = i (1 − δ − ω πh ) and T i = P H − h = i H − hH ( δ + ω ⋆h ) Q h − j = i (1 − δ − ω ⋆h ).Then we have V ⋆ ( x ) − V π ( x ) = T − S . Notice that S i = H − iH ( ω πi + δ ) + S i +1 (1 − ω πi − δ ) (176) T i = H − iH ( ω ⋆i + δ ) + T i +1 (1 − ω ⋆i − δ ) (177)Thus we have T i − S i = (cid:18) H − iH − T i +1 (cid:19) ( ω ⋆i − ω πi ) + ( T i +1 − S i +1 )(1 − ω πi − δ ) (178)38y induction, we get T − S = H − X h =1 ( ω ⋆i − ω πi )( H − hH − T h +1 ) h − Y j =1 (1 − ω πj − δ ) (179)Since the reward is non-negative and only occurs in x H +2 , we know that V ⋆ ( x ) ≥ V ⋆ ( x ) ≥· · · ≥ V ⋆ ( x H ). Thus we have T h ≤ T = V ⋆ ( x ) ≤ P Hh =1 P ( N h | π ⋆ ). If N h doesn’t happen for any h ∈ [ H ], then the agent must enter x H +1 . The probability of this event has the following form: P (cid:0) ¬ (cid:0) ∪ h ∈ [ H ] N h | π ⋆ (cid:1)(cid:1) =1 − H Y h =1 P ( N h | π ⋆ ) (180)= Y h ∈ [ H ] (1 − δ − ω ⋆h ) (181) ≥ Y h ∈ [ H ] (1 − H + 12 H ) (182)=(1 − H ) H (183) ≥ . δ = H and | ω ⋆h | ≤ H . The above discussion indicates that T h ≤ . H − hH − T h +1 ≥ . h ≤ H/
2. Similarly, Q h − j =1 (1 − ω πj − δ ) ≥ (1 − H ) H − ≥ .
2. Combiningwith Eqn 179, we have T − S ≥ . H X h =1 ( ω ⋆h − ω πh ) = 0 . H/ X h =1 max a ∈A r bh ( a ) − X a ∈A π h ( a | s h ) r bh ( a ) ! (185)Combining with the definition of T and S , we can prove the lemma.After proving Lemma 13, we are ready to prove Proposition 1. Proof. (proof of Proposition 1) By Lemma 13, we know that we can decompose the sub-optimalitygap of a policy π in the following way: V ⋆ ( s ) − V π ( s ) ≥ . H/ X h =1 max a ∈A r bh ( a ) − X a ∈A π h ( a | s h ) r bh ( a ) ! (186)where r bh ( a ) = µ ⊤ a + ζ h ( a ), which can be regarded as a reward function for misspecified linearbandit. To prove Theorem 1, the only remaining problem is to derive the lower bound for misspecifiedlinear bandits. We directly apply the following two lower bounds for linear bandits. Lemma 14. (Lemma C.8 in Zhou et al. (2020a)) Fix a positive real < δ ≤ / , and positiveintegers T, d and assume that T ≥ d / (2 δ ) and consider the linear bandit problem L µ parametrizedwith a parameter vector µ ∈ {− ∆ , ∆ } d and action set A = {− , } d so that the reward distributionfor taking action a ∈ A is a Bernoulli distribution B ( δ + ( µ ⋆ ) ⊤ a ) . Then for any bandit algorithm B , there exists a µ ∗ ∈ {− ∆ , ∆ } d such that the expected pseudo-regret of B over T steps on bandit L µ ⋆ is lower bounded by d √ T δ √ . emma 15. (Proposition 6 in Zanette et al. (2020a)) There exists a feature map φ : A → R d thatdefines a misspecified linear bandits class M such that every bandit instance in that class has rewardresponse: µ a = φ ⊤ a θ + z a for any action a (Here z a ∈ [0 , ζ ] is the deviation from linearity and µ a ∈ [0 , ) and such that theexpected regret of any algorithm on at least a member of the class up to round T is Ω( √ dζT ) . Lemma 14 is used to prove the lower bound for linear mixture MDPs in Zhou et al. (2020a),which states that the lower bound for linear bandits with approximation error ζ = 0, while Lemma 15mainly consider the influence of ζ to the lower bound. Combining these two lemmas, the regret lowerbound for misspecifid linear bandit is Ω(max( d √ T δ, √ dζT )) = Ω( d √ T δ + √ dζT ). Since here ourproblem can reduce from H/ Hd √ T δ + H √ dζT ) = Ω( d √ HT + H √ dζT )Now we obtain the regret lower bound for misspecified linear MDP. We can prove the correspond-ing lower bound for the LSVI setting Zanette et al. (2020a) since LSVI setting is strictly harderthan linear MDP setting. The following lemma states this relation between two settings. Lemma 16.
If an MDP ( S , A , p, r, H ) is a misspecifed linear MDP with approximation error ζ , thenthis MDP satisfies the low inherent Bellman error assumption with I = 2 ζ .Proof. If an MDP is an ζ -approximate linear MDP, then we have k p h ( ·| s, a ) − h φ ( s, a ) , θ h ( · ) i k TV ≤ ζ (187) | r h ( s, a ) − h φ ( s, a ) , ν h i | ≤ ζ (188)For any θ h +1 ∈ R d , we have T h ( Q h +1 ( θ h +1 )) ( s, a ) = r h ( s, a ) + E s ′ ∼ p h ( ·| s,a ) V h +1 ( θ h +1 ) ( s ′ ). Since V h +1 ( θ h +1 ) ( s ′ ) ≤
1, plugging the approximately linear form of r h ( s, a ) and p h ( ·| s, a ), we have |T h ( Q h +1 ( θ h +1 )) ( s, a ) − * φ ( s, a ) , X s ′ θ h ( s ′ ) V h +1 ( θ h +1 ) ( s ′ ) + ν h + | ≤ ζ (189)By lemma 16, we can directly apply the hard instance construction and the lower bound formisspecified linear MDP to LSVI setting. Proposition 2.
There exist function feature maps φ , ..., φ H that define an MDP class M suchthat every MDP in that class satisfies low inherent Bellman error at most I and such that theexpected reward on at least a member of the class (for |A| ≥ , d, k, H ≥ , T = Ω( d H ) , I ≤ H )is Ω( d √ HT + dH I T ) . .2 Lower Bound for Multi-task RL In order to prove Theorem 5, we need to prove and then combine the following two lemmas.
Lemma 17.
Under the setting of Theorem 5, the expected regret of any algorithm A is lower boundedby Ω( M k √ HT ) . Lemma 18.
Under the setting of Theorem 5, the expected regret of any algorithm A is lower boundedby Ω (cid:16) d √ kM HT + HM T √ d I (cid:17) . These two lemmas are proved by reduction from Proposition 2, which is a lower bound we provedfor the single-task LSVI setting.