[PDF] Low-Rank Generalized Linear Bandit Problems

Abstract

In a low-rank linear bandit problem, the reward of an action (represented by a matrix of size d_1 \times d_2) is the inner product between the action and an unknown low-rank matrix \Theta^*. We propose an algorithm based on a novel combination of online-to-confidence-set conversion~\citep{abbasi2012online} and the exponentially weighted average forecaster constructed by a covering of low-rank matrices. In T rounds, our algorithm achieves \widetilde{O}((d_1+d_2)^{3/2}\sqrt{rT}) regret that improves upon the standard linear bandit regret bound of \widetilde{O}(d_1d_2\sqrt{T}) when the rank of \Theta^*: r \ll \min\{d_1,d_2\}. We also extend our algorithmic approach to the generalized linear setting to get an algorithm which enjoys a similar bound under regularity conditions on the link function. To get around the computational intractability of covering based approaches, we propose an efficient algorithm by extending the "Explore-Subspace-Then-Refine" algorithm of~\citet{jun2019bilinear}. Our efficient algorithm achieves \widetilde{O}((d_1+d_2)^{3/2}\sqrt{rT}) regret under a mild condition on the action set \mathcal{X} and the r-th singular value of \Theta^*. Our upper bounds match the conjectured lower bound of \cite{jun2019bilinear} for a subclass of low-rank linear bandit problems. Further, we show that existing lower bounds for the sparse linear bandit problem strongly suggest that our regret bounds are unimprovable. To complement our theoretical contributions, we also conduct experiments to demonstrate that our algorithm can greatly outperform the performance of the standard linear bandit approach when \Theta^* is low-rank.

Full PDF

LLow-Rank Generalized Linear Bandit Problems

Yangyi LuDepartment of StatisticsUniversity of Michigan [email protected]

Amirhossein MeisamiAdobe Inc. [email protected]

Ambuj TewariDepartment of StatisticsUniversity of Michigan [email protected]

June 5, 2020

Abstract

In a low-rank linear bandit problem, the reward of an action (represented by a matrix of size d × d )is the inner product between the action and an unknown low-rank matrix Θ ∗ . We propose an algorithmbased on a novel combination of online-to-conﬁdence-set conversion (Abbasi-Yadkori et al., 2012) and theexponentially weighted average forecaster constructed by a covering of low-rank matrices. In T rounds,our algorithm achieves (cid:101) O (( d + d ) / √ rT ) regret that improves upon the standard linear bandit regretbound of (cid:101) O ( d d √ T ) when the rank of Θ ∗ : r (cid:28) min { d , d } . We also extend our algorithmic approach tothe generalized linear setting to get an algorithm which enjoys a similar bound under regularity conditionson the link function. To get around the computational intractability of covering based approaches, wepropose an eﬃcient algorithm by extending the "Explore-Subspace-Then-Reﬁne" algorithm of Jun et al.(2019). Our eﬃcient algorithm achieves (cid:101) O (( d + d ) / √ rT ) regret under a mild condition on the actionset X and the r -th singular value of Θ ∗ . Our upper bounds match the conjectured lower bound of Junet al. (2019) for a subclass of low-rank linear bandit problems. Further, we show that existing lowerbounds for the sparse linear bandit problem strongly suggest that our regret bounds are unimprovable.To complement our theoretical contributions, we also conduct experiments to demonstrate that ouralgorithm can greatly outperform the performance of the standard linear bandit approach when Θ ∗ islow-rank. Low-rank models are widely used in various applications, such as matrix completion, computer vision,etc (Candès and Recht, 2009; Basri and Jacobs, 2003). We study low-rank (generalized) linear modelsin the bandit setting (Lai and Robbins, 1985). During the learning process, the agent adaptively pulls anarm (denoted as X t ) from a set of arms based on the past experience. At each pull, the agent observes anoisy reward corresponding to the arm pulled. Let Θ ∗ ∈ R d × d be an unknown low-rank matrix with rank r (cid:28) min { d , d } . The learner’s goal is to maximize the total reward: (cid:80) Tt =1 µ ( (cid:104) Θ ∗ , X t (cid:105) ) where T is the timehorizon, X t ∈ R d × d is an action pulled at time t that belongs to a pre-speciﬁed action set X and µ ( · ) denotes a link function. Note that in the standard linear case the link function is identity.Many practical applications can be framed in this low-rank bandit model. For traveling websites, therecommendation system needs to choose a ﬂight-hotel bundle for the customer that can achieve high revenue.Often one has features of size d for ﬂights and features of size d for hotels. It is natural to form a d × d matrix feature (e.g. via an outer product) for each pair or simply combine the two features row/column-wise if d = d . One can model the appeal of a bundle by a (generalized) linear function of the matrixfeature. In online advertising with image recommendation, the advertiser selects an image to display andthe goal is to achieve the maximum clicking rate. The image is often stored as a d × d matrix, and one canuse a generalized linear model (GLM) with the link function being the logistic function to model the clickrate (Richardson et al., 2007; McMahan et al., 2013). In all of these applications, one puts some capacitycontrol on the underlying matrix linear coeﬃcient Θ ∗ and a natural condition is Θ ∗ being low-rank. We1 a r X i v : . [ s t a t . M L ] J un ote that the examples such as online dating and online shopping discussed in Jun et al. (2019) can also beformulated as our model.In this paper, we measure the quality of an algorithm in terms of its cumulative regret . A naiveapproach is to ignore the low-rank structure and directly apply the standard (generalized) linear banditalgorithms (Abbasi-Yadkori et al., 2011; Filippi et al., 2010). These approaches suﬀer (cid:101) O ( d d √ T ) regret. However, in practice, d d can be huge. Then a natural question is: Can we utilize the low-rank structure of Θ ∗ to achieve o ( d d √ T ) regret? Jun et al. (2019) studied a subclass of our problem, where the actions are rank one matrices. Theyproposed an algorithm that achieves (cid:101) O (( d + d ) / √ rT ) regret under additional incoherence and singularvalue assumptions of an augmented matrix deﬁned via the arm set and Θ ∗ and a singular value assumptionof Θ ∗ . They also provided strong evidence that their bound is unimprovable.We summarize our contributions below.1. We propose Low Rank Linear Bandit with Online Computation algorithm (LowLOC) for the low-ranklinear bandit problem, that achieves (cid:101) O (( d + d ) / √ rT ) regret. Notably, comparing with the result inJun et al. (2019), our result 1) applies to more general action sets which can contain high-rank matricesand 2) does not require the incoherence and bounded eigenvalue assumption of the augmented matrixmentioned in the previous paragraph. Our regret bound also matches with their conjectured lower bound.For LowLOC, we ﬁrst design a novel online predictor which uses an exponentially weighted average fore-caster on a covering of low-rank matrices to solve the online low-rank linear prediction problem with O (( d + d ) r log T ) regret. We then plug in our online predictor to the online-to-conﬁdence-set conversionframework proposed by Abbasi-Yadkori et al. (2012) to construct a conﬁdence set of Θ ∗ in our banditsetting, and at every round we choose the action optimistically.2. We further propose Low Rank Generalized Linear Bandit with Online Computation algorithm (Low-GLOC) for the generalized linear setting that also achieves (cid:101) O (( d + d ) / √ rT ) regret. LowGLOC issimilar to LowLOC but here we need to design a new online-to-conﬁdence-set conversion method, whichcan be of independent interest.3. LowLOC and LowGLOC enjoy good regret but are unfortunately not eﬃciently implementable. To over-come this issue, we provide an eﬃcient algorithm Low-Rank-Explore-Subspace-Then-Reﬁne (LowESTR)for the linear setting, inspired by the ESTR algorithm proposed by Jun et al. (2019). We show that undera mild assumption on action set X , LowESTR achieves (cid:101) O (( d + d ) / √ rT /ω r ) regret, where ω r > is a lower bound for the r -th singular value of Θ ∗ . Comparing with ESTR, LowESTR does not needthe incoherence and the eigenvalue assumption of the augmented matrix while the assumptions on theaction set of the two algorithms are diﬀerent. We also provide empirical evaluations to demonstrate theeﬀectiveness of LowESTR. Our work is inspired by Jun et al. (2019) where they model the reward as x (cid:62) t Θ ∗ z t . x t ∈ X ⊂ R d is aleft arm and z t ∈ Z ⊂ R d is a right arm ( X and Z are left and right arm sets, repsectively). Note thismodel is a special case of our low-rank linear bandit model because one can write x (cid:62) t Θ ∗ z t = (cid:10) Θ ∗ , x t z (cid:62) t (cid:11) and deﬁne the arm set as X Z (cid:62) . Their ESTR algorithm enjoys O (( d + d ) / √ rT /ω r ) regret bound underthe assumptions: 1) an augmented matrix K ∗ = X Θ ∗ Z (cid:62) is incoherent (Keshavan et al., 2010) and has aﬁnite condition number, where X ∈ R d × d is constructed by d arms from X that maximizes (cid:107) X − (cid:107) and Z ∈ R d × d is constructed by d arms from Z that maximizes (cid:107) Z − (cid:107) , and 2) (cid:107) X − (cid:107) and (cid:107) Z − (cid:107) are upperbounded by a constant. Their algorithm requires explicitly ﬁnding X and Z , which is in general NP-hard,even though they also proposed heuristics to speed up this step. Comparing with ESTR, our LowLOCand LowGLOC algorithm are also not computationally eﬃcient, but they both apply to richer action sets See Section 3 for the deﬁnition. (cid:101) O omits poly-logarithmic factors of d , d , r, T . K ∗ , X and Z and their regret bound does not depend on ω r .Our LowESTR algorithm is computationally eﬃcient if the action set admits a nice exploration distribution(see details in Section 6). LowESTR achieves O (cid:16) ( d + d ) / √ rT /ω r (cid:17) regret bound but it does not requireassumptions on K ∗ , X and Z as well.Katariya et al. (2017b) and Kveton et al. (2017) also studied rank-1 and low-rank bandit problems. Theyassume there is an underlying expected reward matrix ¯ R , at each time the learner picks an element on ( i t , j t ) position and receives a noisy reward. It can be viewed as a special case of bilinear bandit with one-hotvectors as left and right arms. Katariya et al. (2017b) is further extended by Katariya et al. (2017a) thatuses KL based conﬁdence intervals to achieve a tighter regret bound. Our problem is more general comparingto these works. Johnson et al. (2016) considered the same setting as ours, but their method relies on theknowledge of many parameters that depend on the unknown Θ ∗ and in particular only works for continuousarm set.There are other works that utilize the low-rank structure in diﬀerent model settings. For example,Gopalan et al. (2016) studied low rank bandits with latent structures using robust tensor power method.Lale et al. (2019) imposed low-rank assumptions on the feature vectors to reduce the eﬀective dimension.These work all utilize the low-rank structure to achieve better regret bound than standard approaches thatdo not take the low-rank structure into account. We formally deﬁne the problem and review relevant background in this section.

Let

X ⊂ R d × d be the arm space. In each round t , the learner chooses an arm X t ∈ X , and observes anoisy reward of a linear form: y t = (cid:104) X t , Θ ∗ (cid:105) + η t , where Θ ∗ ∈ R d × d is an unknown parameter and η t isa -sub-Gaussian random variable. Denote the rank of Θ ∗ by r , we assume r (cid:28) min { d , d } . Let the r -thsingular value of Θ ∗ is lower bounded by ω r > . We use (cid:104) A, B (cid:105) := trace ( A T B ) to denote the inner productbetween matrix A and B . We follow the standard assumptions in linear bandits: (cid:107) Θ ∗ (cid:107) F ≤ and (cid:107) X (cid:107) F ≤ ,for all X ∈ X .In our bandit problem, the goal of the learner is to maximize the total reward (cid:80) Tt =1 (cid:104) X t , Θ ∗ (cid:105) , where T is the time horizon. Clearly, with the knowledge of the unknown parameter Θ ∗ , one should always select anaction X ∗ ∈ argmax X ∈X (cid:104) X, Θ ∗ (cid:105) . It is natural to evaluate the learner relative to the optimal strategy. Thediﬀerence between the learner’s total reward and the total reward of the optimal strategy is called pseudo-regret (Audibert et al., 2009): R T := (cid:80) Tt =1 (cid:104) X ∗ − X t , Θ ∗ (cid:105) . For simplicity, we use the word regret instead ofpseudo-regret for R T . We also study the generalized linear bandit model of the following form: E [ y t | X t , Θ ∗ ] = µ ( (cid:104) X t , Θ ∗ (cid:105) ) where µ ( · ) is a link function. This framework builds on the well-known Generalized Linear Models (GLMs) andhas been widely studied in many applications. For example, when rewards are binary-valued, a natural linkfunction is the logistic function µ ( x ) = exp( x ) / (1 + exp( x )) . For the generalized setting, we assume thereward given the action follows an exponential family distribution: P ( y | z = (cid:104) X, Θ ∗ (cid:105) ) = exp (cid:18) yz − m ( z ) φ ( τ ) + h ( y, τ ) (cid:19) , (1)where τ ∈ R + is a known scale parameter and m, φ and h are some known functions. From basic calculationwe get m (cid:48) ( z ) = E [ y | z ] := µ ( z ) . We assume the above exponential family is a minimal representation, then m ( z ) is ensured to be strictly convex (Wainwright and Jordan, 2008), and thus the negative log likelihood(NLL) loss (cid:96) ( z, y ) := − yz + m ( z ) is also strictly convex.3 lgorithm 1 Low-Rank Linear Bandit with Online Computation (LowLOC)

Input: arm set: X , horizon: T , T -net for S r : ¯ S r ( T ) , failure rate δ , EW constant η (cid:16) T/δ ) .Initial conﬁdence set C = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F ≤ } . for t = 1 , . . . , T do ( X t , (cid:101) Θ t ) := argmax ( X, Θ) ∈X × C t − (cid:104) X, Θ (cid:105) .Pull arm X t and receive reward y t .Compute EW predictor ˆ y t = (cid:80) | ¯ Sr ( 1 T ) | i =1 e − ηLi,t − f Θ i,t (cid:80) | ¯ Sr ( 1 T ) | j =1 e − ηLj,t − , where f Θ i ,t (cid:44) (cid:104) X t , Θ i (cid:105) for Θ i ∈ ¯ S r ( T ) .Update losses L i,t = (cid:80) ts =1 ( y s − f Θ i ,s ) , for i = 1 , . . . , | ¯ S r ( T ) | .Update C t according to Equation 2, where B t is deﬁned in Lemma 2. end for We make the following standard assumption on the link function µ ( · ) (Jun et al., 2017). Assumption 1.

There exist constants L µ , c µ ≥ , κ µ > , such that the link function µ ( · ) is L µ − Lipschitzon [ − , , continously diﬀerentiable on ( − , , inf z ∈ ( − , µ (cid:48) ( z ) := κ µ and | µ (0) | ≤ c µ .One can write down the above reward model (1) in an equivalent way: y t = µ ( (cid:104) X t , Θ ∗ (cid:105) ) + η t , where η t isconditionally R -sub-Gaussian given X t and { ( X s , η s ) } t − s =1 . Using the form of P ( y | z ) , Taylor expansion andthe strictly convexity of m ( · ) , one can show that R = sup z ∈ [ − , (cid:112) µ (cid:48) ( z ) ≤ (cid:112) L µ by the deﬁnition of thesub-Gaussian constant. An optimal arm is X ∗ ∈ argmax X ∈X µ ( (cid:104) X, Θ ∗ (cid:105) ) . The performance of an algorithmis again evaluated by cumulative regret: R T = (cid:80) Tt =1 µ ( (cid:104) X ∗ , Θ ∗ (cid:105) ) − µ ( (cid:104) X t , Θ ∗ (cid:105) ) . We use O and Ω for the standard Big O and Big Omega notations. (cid:101) O and (cid:101) Ω ignore the poly-logarithmicfactors of d , d , r, T . f ( x ) (cid:16) g ( x ) means f and g are of the same order ignoring the poly-logarithmic factorsof d , d , r, T . We ﬁrst present our algorithm, LowLOC (Algorithm 1) for low-rank linear bandit problems.

Theorem 1 (Regret of LowLOC (Algorithm 1)) . For ∀ δ ∈ (0 , . , with probability at least − δ , Algorithm 1achieves regret: R T = (cid:101) O (cid:32) ( d + d ) / √ rT (cid:115) log (cid:18) δ (cid:19)(cid:33) . Note that LowLOC achieves the desired goal of outperforming the standard linear bandit approach with (cid:101) O ( d d √ T ) regret. Furthermore, this bound does not depend on any other problem-dependent parameterssuch as least singular value of Θ ∗ and does not require any other assumption which appeared in Jun et al.(2019). In the following section, we explain details of our algorithm design choices. This algorithm follows the standard Optimism in the Face of Uncertainty (OFU) principle. We maintain aconﬁdence set C t at every round that contains the true parameter Θ ∗ with high probability and we choosethe action X t according to ( X t , (cid:101) Θ t ) = argmax ( X, Θ) ∈X ×C t − (cid:104) X, Θ (cid:105) . Typically, the faster C t shrinks, the lower regret we have. The main diﬀculty is to construct C t thatleverages the low-rank structure so that we only have (cid:101) O (( d + d ) / √ rT ) regret. Our starting point is theto use the online-to-conﬁdence-set conversion framework proposed by Abbasi-Yadkori et al. (2012) who buildsthe conﬁdence set based on an online predictor. At each round, an online predictor receives X t , predicts ˆ y t ,4ased on historical data { ( X s , y s ) } t − s =1 , observes the true value y t and suﬀers a loss (cid:96) t (ˆ y t ) (cid:44) ( y t − ˆ y t ) . Theperformance of this online predictor is measured by comparing its cumulative loss to the cumulative loss ofa ﬁxed linear predictor using coeﬃcient Θ : ρ t (Θ) = (cid:80) ts =1 (cid:96) s (ˆ y s ) − (cid:96) s ( (cid:104) Θ , X s (cid:105) ) .The key idea of online-to-conﬁdence-set conversion (adapted to our low-rank setting) is that if one canguarantee sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ t (Θ) ≤ B t for some non-decreasing sequence { B t } Tt =1 , we can construct theconﬁdence interval for Θ ∗ as: C t = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F + t (cid:88) s =1 (ˆ y s − (cid:104) Θ , X s (cid:105) ) ≤ β t ( δ ) } , with β t ( δ ) = 1 + 2 B t + 32 log (cid:16)(cid:16) √ (cid:112) B t (cid:17) /δ (cid:17) (2)where δ is the failure probability. Lemma 7 in appendix guarantees that Θ ∗ is contained in ∩ t ≥ C t with highprobability and Lemma 8 further guarantees the overall regret R T = (cid:101) O ( (cid:112) d d β T − ( δ ) T ) = (cid:101) O (cid:0) ( d + d ) (cid:112) B T − T (cid:1) .Therefore, the problem achieving the (cid:101) O (( d + d ) / √ rT ) regret bound reduces to designing an onlinepredictor which guarantees sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ t (Θ) ≤ B t and B t = (cid:101) O (( d + d ) r ) . To achieve this rate,the key is to leverage the low-rank structure. We adopt the classical exponentially weighted average forecaster (EW) framework (Cesa-Bianchi and Lugosi,2006) which uses N experts to predict ˆ y t with the following formula (cid:98) y t = (cid:80) Ni =1 e − ηL i,t − f i,t (cid:80) Nj =1 e − ηL j,t − . (3)In above, f i denotes the i -th expert that makes a prediction f i,t at time t , L i,t − (cid:44) (cid:80) t − s =1 (cid:96) s ( f i ( X t )) is thecumulative loss incurred by expert i , and η is a tuning parameter. By choosing η carefully, one can guaranteethat this predictor achieves O (log N log( T /δ )) regret comparing with the best expert among the expert set.In our setting, an expert can be viewed as a matrix Θ satisﬁes (cid:107) Θ (cid:107) F ≤ and rank (Θ) ≤ r , and makesprediction according to f Θ ,t (cid:44) (cid:104) Θ , X t (cid:105) . There are inﬁnitely many such experts so we cannot directly use EWwhich requires ﬁnite number of experts. Our main idea is to construct N experts which guarantees log N issmall and these N experts can represent the original expert set S r (cid:44) { Θ ∈ R d × d : (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r } well, and apply EW using these N experts. We construct an ε -net ¯ S r ( ε ) , i.e., for any Θ ∈ S r , there existsa ¯Θ ∈ ¯ S r ( ε ) , such that (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F ≤ (cid:15) . We further show that | ¯ S r ( ε ) | ≤ (9 /ε ) ( d + d +1) r in Lemma 6, so thenumber of experts N in Equation 3 is at most (9 T ) ( d + d +1) r if we set ε = 1 /T .The following lemma summarizes the performance of this online predictor. Lemma 2 (Regret of EW under Squared Loss) . Let η = √ T/δ )) in EW forecaster (3) . Then, forany < δ < / , with probability at least − δ , we have B T = sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ T (Θ) = O (cid:18) ( d + d ) r log( T ) log (cid:18) Tδ (cid:19)(cid:19) = (cid:101) O (cid:18) ( d + d ) r log (cid:18) δ (cid:19)(cid:19) . To obtain Theorem 1, one just needs to plug in Lemma 2 into Lemma 8.

We also study the low-rank generalized linear bandit setting. Our algorithm LowGLOC is similar to LowLOC,so we only present the diﬀerences and leave the detailed presentation of the algorithm (Algorithm 3) to theappendix (Section H). 5e still use EW to perform online predictions, but instead of squared loss, we use negative log likelihood(NLL) loss (cid:96) s (ˆ y s ) = − ˆ y s y s + m (ˆ y s ) to construct the forecaster in Equation (3), where m ( · ) is as deﬁned inSection 3. Therefore, the performance of EW using NLL loss relative to a ﬁxed linear predictor Θ is measuredby: ρ GLB T (Θ) = (cid:16)(cid:80) Tt =1 − ˆ y t y t + m (ˆ y t ) (cid:17) − (cid:16)(cid:80) Tt =1 −(cid:104) Θ , X t (cid:105) y t + m ( (cid:104) Θ , X t (cid:105) ) (cid:17) . If there exists a non-decreasingsequence { B GLB t } such that sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ GLB t (Θ) ≤ B GLB t , we construct C GLB t in the following way: C GLB t = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F + t (cid:88) s =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) ≤ β GLB t ( δ ) } , (4)where β GLB t ( δ ) = 2 + κ µ B GLB t + L µ κ µ log (cid:18)(cid:18)(cid:112) L µ (cid:113) κ µ + (cid:113) κ µ B GLB t + 1 (cid:19) δ (cid:19) . Lemma 11 guarantees thatthe true parameter Θ ∗ is contained in ∩ t ≥ C GLB t with high probability. Lemma 12 further guarantees that theoverall regret of LowGLOC satisﬁes R T = (cid:101) O ( (cid:113) d d β GLB T − ( δ ) T ) = (cid:101) O (( d + d ) (cid:113) B GLB T T /κ µ ) . Following theonline-to-conﬁdence-set conversion idea as used in LowLOC, we prove that B GLB T = O (cid:16) L µ + c µ κ µ ( d + d ) r log T log (cid:0) Tδ (cid:1)(cid:17) in Lemma 13.We next present the regret of LowGLOC in the next theorem, which can be easily achieved by pluggingLemma 13 into Lemma 12. Theorem 3 (Regret of LowGLOC) . For ∀ δ ∈ (0 , . , with probability at least − δ , Algorithm 3 achievesregret: R T = (cid:101) O (cid:32) ( d + d ) / (cid:115) L µ + c µ κ µ rT log (cid:18) δ (cid:19)(cid:33) . To the best of our knowledge, this is the ﬁrst o ( d d √ T ) regret bound for low-rank GLM bandits. At every round, LowLOC and LowGLOC need to calculate exponentially weighted predictions, which involvescalculating weights of the covering of low-rank matrices. These approaches has high computation complexityeven though their regret is ideal. In this section, we propose a computationally eﬃcient method LowESTR(Algorithm 2) that also achieves (cid:101) O (( d + d ) / √ rT ) regret under some mild assumptions on the action set X in the following. Assumption 2.

There exists a sampling distribution D over X with covariance matrix Σ , such that λ min (Σ) (cid:16) d d and D is sub-Gaussian with parameter σ (cid:16) d d . (see Deﬁnition 1 in Section C forthe deﬁnition of sub-Gaussian random matrices.)This assumption is easily satisﬁed in many arm sets. To guarantee the existence of above samplingdistribution D , we only need that the convex hull of a subset of arms X sub ⊂ X contains a ball with radius R ≤ , which does not scale with d or d . Simple examples for X are Euclidean unit ball/sphere.We extend the two-stage procedure "Explore Subspace Then Reﬁne (ESTR)" proposed by Jun et al.(2019). In stage 1, ESTR estimates the row and column subspaces of Θ ∗ . In stage 2, ESTR transformsthe original problem into a d d -dimensional linear bandit problem and invokes LowOFUL algorithm (Junet al., 2019), which leverages the estimated row/column subspaces of Θ ∗ . LowESTR also proceeds with the two-stage framework as ESTR, but we use diﬀerent estimation method instage 1. 6 lgorithm 2

Low Rank Explore Subspace Then Reﬁne (LowESTR)

Input: arm set X , time horizon T , exploration length T , rank r of Θ ∗ , spectral bound ω r of Θ ∗ , samplingdistribution for stage 1: D ; parameters for LowOFUL in stage 2: B, B ⊥ , λ, λ ⊥ . Stage 1: Explore the Low Rank Subspace

Pull X t ∈ X according to distribution D and observe reward Y t , for t = 1 , . . . , T .Solve (cid:98) Θ using the problem below: (cid:98) Θ = argmin Θ ∈ R d × d T T (cid:88) t =1 ( Y t − (cid:104) X t , Θ (cid:105) ) + λ T (cid:107) Θ (cid:107) nuc . (5)Let (cid:98) Θ = U (cid:98) SV T be the SVD of (cid:98) Θ . Take the ﬁrst r columns of U as (cid:98) U , the ﬁrst r rows of V as (cid:98) V . Let (cid:98) U ⊥ and (cid:98) V ⊥ be orthonormal bases of the complementary subspaces of (cid:98) U and (cid:98) V . Stage 2: Reﬁne Standard Linear Bandit Algorithm

Rotate the arm feature set: X (cid:48) := { [ (cid:98) U (cid:98) U ⊥ ] T X [ (cid:98) V (cid:98) V ⊥ ] : X ∈ X } .Deﬁne a vectorized arm feature set so that the last ( d − r )( d − r ) components are from the complementarysubspaces: X (cid:48) vec := { vec ( X (cid:48) r, r ); vec ( X (cid:48) r +1: d , r ); vec ( X (cid:48) r,r +1: d ); vec ( X (cid:48) r +1: d ,r +1: d ) : X (cid:48) ∈ X (cid:48) } . For T = T − T rounds, invoke LowOFUL (Algorithm 4 in Section H) with arm set X (cid:48) vec , the lowdimension k = ( d + d ) r − r and γ ( T ) (cid:16) ( d + d ) rT ω r , B, B ⊥ , λ, λ ⊥ . Stage 1.

We are inspired by a line of work on low-rank matrices recovery using nuclear-norm penalty withsquared loss (Wainwright, 2019). The learner pulls arm X t ∈ X according to distribution D and observes thereward y t up to a horizon T , then uses { X t , y t } T t =1 to solve a nuclear-norm penalized least square problemin (5) and receives an estimated (cid:98) Θ for Θ ∗ . Notably, instead of invoking an NP-hard problem in stage 1 asESTR, the optimization problem (5) in LowESTR is convex and thus can be solved easily using standardgradient based methods. Assumption 2 guarantees that (cid:13)(cid:13)(cid:13) (cid:98) Θ − Θ ∗ (cid:13)(cid:13)(cid:13) F (cid:16) ( d + d ) rT in Theorem 15 (Section E).We get the estimated row/column subspaces of Θ ∗ simply by running an SVD step. Stage 2.

In stage 2, we apply LowOFUL algorithm (Algorithm 4 in Section H) proposed by Jun et al.(2019) in our setting. The key idea is reducing the problem to linear bandit and utilizing the estimatedsubspaces in the standard linear bandit method OFUL (Abbasi-Yadkori et al., 2011).We now present the overall regret of Algorithm 2.

Theorem 4 (Regret of LowESTR for Low Rank Bandit) . Suppose we run LowESTR in stage 1 with T (cid:16) ( d + d ) / √ rT ω r and λ T (cid:16) T min { d ,d } . We invoke LowOFUL in stage 2 with k = r ( d + d − r ) , λ ⊥ = T k log(1+ T /λ ) , B = 1 , B ⊥ = γ ( T ) , and the rotated arm sets X (cid:48) vec deﬁned in Algorithm 2, the overallregret of LowESTR is, with prob at least − δ , R T = (cid:101) O (cid:16) ( d + d ) / √ rT ω r (cid:17) . We believe that this “Explore-Subspace-Then-Reﬁne" framework can also be extended to the generalizedlinear setting. In stage 1, an M-estimator that minimizes the negative log-likelihood plus nuclear normpenalty (Fan et al., 2019) can be used instead, while in stage 2, one can revise a standard generalized linearbandit algorithm such as GLM-UCB (Filippi et al., 2010) by leveraging the low-rank knowledge in the sameway as LowOFUL. We leave this extension for future work.7

Lower Bound for Low-rank Linear Bandit

In this section, we discuss the regret lower bound of the low-rank linear bandit model. Suppose d = d = d ,we ﬁrst present a (cid:101) O ( dr √ T ) lower bound, which is a straightforward extension of the linear bandit lowerbound (Lattimore and Szepesvári, 2018). Theorem 5 (Lower Bound) . Assume dr ≤ T and let X = { X ∈ R d × d : (cid:107) X (cid:107) F ≤ } . Then ∃ Θ ∈ R d × d ,where (cid:107) Θ (cid:107) F ≤ d r T , rank (Θ) ≤ r , s.t. E [ R T (Θ)] = Ω( dr √ T ) . Above bound is tight when r = d as it matches with the standard d -dimensional linear bandit lowerbound, but for small r , our upper bound is larger than the lower bound by a factor of (cid:112) d/r .Nevertheless, we conjecture that Ω( d / √ rT ) is the correct lower bound for small r . It is well-known thatthe regret lower bound for sparse linear bandit problem (dimension d , sparsity s ) is Ω( √ sdT ) (Lattimore andSzepesvári, 2018). Our problem can be viewed as a d -dimensional linear bandit problem with dr degreesof freedom in Θ ∗ . Then, using the analogue of the degrees of freedom between sparse vectors and low-rankmatrices, one can plug in d for d and dr for s in the sparse linear bandit regret lower bound and achieve Ω( d / √ rT ) as our lower bound. In this section, we compare the performance of OFUL and LowESTR to validate that it is crucial to utilizethe low-rank structure. We run our simulation with d = d = 10 , r = 1 and d = d = 10 , r = 3 . In bothsettings, the true Θ ∗ ∈ R d × d is a diagonal matrix. For r = 1 , we set diag (Θ ∗ ) = (0 . , , . . . , while for r = 3 , diag (Θ ∗ ) = (0 . , . , . , , . . . , . For arms in both settings, we draw 256 vectors from N (0 , I d d ) and standardize them by dividing their 2-norms, then we reshape all standardized d d -dimensional vectorsto d × d matrices. We use these matrices as the arm set X . For each arm X ∈ X , the reward is generated by y = (cid:104) X, Θ ∗ (cid:105) + ε , where ε ∼ N (0 , . ) . We run both algorithms for T = 3000 rounds and repeat 100 timesfor each simulation setup to calculate the averaged regrets and their 1-sd conﬁdence intervals at every step.We leave the hyper-parameters of OFUL and LowESTR in the appendix (Section I). Regret comparisonplots are displayed in Figure 1.Figure 1: Regret Comparison between OFUL and LowESTR. We plot the averaged cumulative regret withred and blue curves, and 1-standard deviation for each method within the yellow area.We observe that in both plots, LowESTR incurs less regret comparing to OFUL within several hundredsof time steps. Further, as we increase the rank from r = 1 to r = 3 , the regret gap between the twoapproaches becomes smaller. This phenomenon is compatible with our theory.We also conduct simulations to see the sensitivity of LowESTR to ω r . We observe that LowESTR indeedperforms better for large ω r , which again matches with our theory. The detailed description and the plotfor this experiment are left to the appendix (Section I).8 Conclusion & Future Work

In this paper, we studied the low-rank (generalized) linear bandit problem. We proposed LowLOC andLowGLOC algorithm for the linear and generalized linear setting, respectively. Both of them enjoy O (( d + d ) / √ rT ) regret. Further, our eﬃcient algorithm LowESTR achieves O (( d + d ) / √ rT /ω r ) regret undermild conditions on the action set. There are several interesting directions that we left as future work:1) We provided some preliminary ideas in Section 6 about how to extend LowESTR to the generalizedlinear setting. We expect that a similar regret bound can be achieved under certain regularity conditionsover the link function. 2) We plan to investigate if one can design an eﬃcient algorithm whose regret doesnot depend on /ω r . 3) As we have shown in Section 7, O (( d + d ) / √ rT ) is our conjectured tight lowerbound. It will be very interesting to formally prove this. Acknowledgement

AT acknowledges the support of NSF CAREER grant IIS-1452099 and an Adobe Data Science ResearchAward.

References

Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits.In

Advances in Neural Information Processing Systems , pages 2312–2320.Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012). Online-to-conﬁdence-set conversions and applicationto sparse stochastic bandits. In

Artiﬁcial Intelligence and Statistics , pages 1–9.Audibert, J.-Y., Munos, R., and Szepesvári, C. (2009). Exploration–exploitation tradeoﬀ using varianceestimates in multi-armed bandits.

Theoretical Computer Science , 410(19):1876–1902.Basri, R. and Jacobs, D. W. (2003). Lambertian reﬂectance and linear subspaces.

IEEE transactions onpattern analysis and machine intelligence , 25(2):218–233.Candes, E. J. and Plan, Y. (2011). Tight oracle inequalities for low-rank matrix recovery from a minimalnumber of noisy random measurements.

IEEE Transactions on Information Theory , 57(4):2342–2359.Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization.

Foundations ofComputational mathematics , 9(6):717.Cesa-Bianchi, N. and Lugosi, G. (2006).

Prediction, learning, and games . Cambridge university press.Fan, J., Gong, W., and Zhu, Z. (2019). Generalized high-dimensional trace regression via nuclear normregularization.

Journal of econometrics , 212(1):177–202.Filippi, S., Cappe, O., Garivier, A., and Szepesvári, C. (2010). Parametric bandits: The generalized linearcase. In

Advances in Neural Information Processing Systems , pages 586–594.Gopalan, A., Maillard, O.-A., and Zaki, M. (2016). Low-rank bandits with latent mixtures. arXiv preprintarXiv:1609.01508 .Johnson, N., Sivakumar, V., and Banerjee, A. (2016). Structured stochastic linear bandits. arXiv preprintarXiv:1606.05693 .Jun, K.-S., Bhargava, A., Nowak, R., and Willett, R. (2017). Scalable generalized linear bandits: Onlinecomputation and hashing. In

Advances in Neural Information Processing Systems , pages 99–109.9un, K.-S., Willett, R., Wright, S., and Nowak, R. (2019). Bilinear bandits with low-rank structure. In

International Conference on Machine Learning , pages 3163–3172.Katariya, S., Kveton, B., Szepesvári, C., Vernade, C., and Wen, Z. (2017a). Bernoulli rank- bandits forclick feedback. arXiv preprint arXiv:1703.06513 .Katariya, S., Kveton, B., Szepesvari, C., Vernade, C., and Wen, Z. (2017b). Stochastic rank-1 bandits. In Artiﬁcial Intelligence and Statistics , pages 392–401.Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from noisy entries.

Journal ofMachine Learning Research , 11(Jul):2057–2078.Kveton, B., Szepesvari, C., Rao, A., Wen, Z., Abbasi-Yadkori, Y., and Muthukrishnan, S. (2017). Stochasticlow-rank bandits. arXiv preprint arXiv:1712.04644 .Lai, T. L. and Robbins, H. (1985). Asymptotically eﬃcient adaptive allocation rules.

Advances in appliedmathematics , 6(1):4–22.Lale, S., Azizzadenesheli, K., Anandkumar, A., and Hassibi, B. (2019). Stochastic linear bandits with hiddenlow rank structure. arXiv preprint arXiv:1901.09490 .Lattimore, T. and Szepesvári, C. (2018). Bandit algorithms. preprint .Loh, P.-L. and Wainwright, M. J. (2011). High-dimensional regression with noisy and missing data: Provableguarantees with non-convexity. In

Advances in Neural Information Processing Systems , pages 2726–2734.McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L., Phillips, T., Davydov, E.,Golovin, D., et al. (2013). Ad click prediction: a view from the trenches. In

Proceedings of the 19th ACMSIGKDD international conference on Knowledge discovery and data mining , pages 1222–1230.Richardson, M., Dominowska, E., and Ragno, R. (2007). Predicting clicks: estimating the click-through ratefor new ads. In

Proceedings of the 16th international conference on World Wide Web , pages 521–530.Wainwright, M. J. (2019).

High-dimensional statistics: A non-asymptotic viewpoint , volume 48. CambridgeUniversity Press.Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference.

Foundations and Trends R (cid:13) in Machine Learning , 1(1-2):1–305.10 Proof for Theorem 1

Lemma 6 (Covering number for low-rank matrices, modiﬁed from (Candes and Plan, 2011)) . Let S r = { Θ ∈ R d × d : rank (Θ) ≤ r, (cid:107) Θ (cid:107) F ≤ } . Then there exists an (cid:15) − net ¯ S r for the Frobenius norm obeying | ¯ S r | ≤ (9 /(cid:15) ) ( d + d +1) r . (6) Proof.

Use SVD decomposition:

Θ = U Σ V T of any Θ ∈ S r obeying (cid:107) Σ (cid:107) F ≤ . We will construct an (cid:15) − netfor S r by covering the set of permissible U, V and Σ . Let D be the set of diagonal matrices with nonnegativediagonal entries and Frobenius norm less than or equal to one. We take ¯ D to be an (cid:15)/ -net for D with | ¯ D | ≤ (9 /(cid:15) ) r . Next, let O d ,r = { U ∈ R d × r : U T U = I } . To cover O d ,r , we use the (cid:107)·(cid:107) , norm deﬁned as (cid:107) U (cid:107) , = max i (cid:107) U i (cid:107) (cid:96) , (7)where U i denotes the i th column of Θ . Let Q d ,r = { U ∈ R d × r : (cid:107) U (cid:107) , ≤ } . It is easy to see that O d ,r ⊂ Q d ,r since the columns of an orthogonal matrix are unit normed. We see that there is an (cid:15)/ -net ¯ O d ,r for O d ,r obeying | ¯ O d ,r | ≤ (9 /(cid:15) ) d r . Similarly, let P d ,r = { V ∈ R d × r : V T V = I } . Deﬁne R d ,r = { V ∈ R d × r : (cid:107) V (cid:107) , ≤ } , we have P d ,r ⊂ R d ,r . By the same argument, there is an (cid:15)/ -net ¯ P d ,r for P d ,r obeying | ¯ P d ,r | ≤ (9 /(cid:15) ) d r . We now let ¯ S r = { ¯ U ¯Σ ¯ V T : ¯ U ∈ O d ,r , ¯ V ∈ P d ,r , ¯Σ ∈ ¯ D } , and remarkthat | ¯ S r | ≤ | ¯ O d ,r | | ¯ D || ¯ P d ,r | ≤ (9 /(cid:15) ) ( d + d +1) r . It remains to show that for all Θ ∈ S r , there exists ¯Θ ∈ ¯ S r with (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F ≤ (cid:15) .Fix Θ ∈ S r and decompose it as Θ = U Σ V T . Then there exists ¯Θ = ¯ U ¯Σ ¯ V T ∈ ¯ S r with ¯ U ∈ O d ,r , ¯ V ∈ P d ,r , ¯Σ ∈ ¯ D satisfying (cid:13)(cid:13) U − ¯ U (cid:13)(cid:13) , ≤ (cid:15)/ , (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) , ≤ (cid:15)/ and (cid:13)(cid:13) Σ − ¯Σ (cid:13)(cid:13) F ≤ (cid:15)/ . This gives (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F = (cid:13)(cid:13) U Σ V T − ¯ U ¯Σ ¯ V T (cid:13)(cid:13) F (8) = (cid:13)(cid:13) U Σ V T − ¯ U Σ V T + ¯ U Σ V T − ¯ U ¯Σ V T + ¯ U ¯Σ V T − ¯ U ¯Σ ¯ V T (cid:13)(cid:13) F (9) ≤ (cid:13)(cid:13) ( U − ¯ U )Σ V T (cid:13)(cid:13) F + (cid:13)(cid:13) ¯ U (Σ − ¯Σ) V T (cid:13)(cid:13) F + (cid:13)(cid:13) ¯ U ¯Σ( V − ¯ V ) T (cid:13)(cid:13) F . (10)For the ﬁrst term, since V is an orthogonal matrix, (cid:13)(cid:13) ( U − ¯ U )Σ V T (cid:13)(cid:13) F = (cid:13)(cid:13) ( U − ¯ U )Σ (cid:13)(cid:13) F (11) ≤ (cid:107) Σ (cid:107) F (cid:13)(cid:13) U − ¯ U (cid:13)(cid:13) , ≤ ( (cid:15)/ . (12)Thus we have shown (cid:13)(cid:13) ( U − ¯ U )Σ V T (cid:13)(cid:13) F ≤ (cid:15)/ , by the same argument, we also have (cid:13)(cid:13) ¯ U ¯Σ( V − ¯ V ) T (cid:13)(cid:13) F ≤ (cid:15)/ .For the second term, (cid:13)(cid:13) ¯ U (Σ − ¯Σ) V T (cid:13)(cid:13) F = (cid:13)(cid:13) Σ − ¯Σ (cid:13)(cid:13) F ≤ (cid:15)/ . This completes the proof. Lemma 7 (Online-to-Conﬁdence-Set Conversion (adapted from Theorem 1 in Abbasi-Yadkori et al. (2012))) . Suppose we feed { ( X s , y s ) } ts =1 into an online prediction algorithm which, for all t ≥ , admits a regret sup (cid:107) Θ (cid:107) F ≤ ρ t (Θ) ≤ B t . Let ˆ y s be the prediction at time step s by the online learner. Then, for any δ ∈ (0 , . , with probability at least − δ , we have P ( ∃ t ∈ N such that Θ ∗ / ∈ C t +1 ) ≤ δ, (13) where we deﬁne β t ( δ ) = 1 + 2 B t + 32 log (cid:32) √ √ B t δ (cid:33) (14) C t +1 = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F + t (cid:88) s =1 (ˆ y s − (cid:104) Θ , X s (cid:105) ) ≤ β t ( δ ) } . (15)11 emma 8 (Regret of LowLOC Given Online Learner’s Regret (adapted from Theorem 3 in Abbasi-Yadkoriet al. (2012))) . Suppose sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ t (Θ) ≤ B t , where { B t } Tt =1 is a non-decreasing sequence. Then,for any δ ∈ (0 , . , with probability at least − δ , for any T ≥ , the regret of LowLOC algorithm is boundedas R T = O (cid:32)(cid:115) d d T (1 + β T − ( δ )) log (cid:18) Td d (cid:19)(cid:33) , (16) where β t ( δ ) = 1 + 2 B t + 32 log (cid:16) √ √ B t δ (cid:17) . Lemma 9 (Theorem 3.2 in (Cesa-Bianchi and Lugosi, 2006)) . If the loss function (cid:96) ( a, b ) is exp-concave in itsﬁrst argument for some η > (i.e. F ( a ) = e − η(cid:96) ( a,b ) is concave for all b ), then the regret of the exponentiallyweighted average forecaster in Equation 3 (used with the same value of η ) satisﬁes, for all y , . . . , y n ∈ Y ,we have Φ η ( R n ) ≤ Φ η (0) . Lemma 10 (Proposition 3.1 in (Cesa-Bianchi and Lugosi, 2006)) . If for some loss function (cid:96) and for some η > , a forecaster satisﬁes Φ η ( R n ) ≤ Φ η ( ) for all y , . . . , y n ∈ Y , then the regret of the forecaster isbounded by (cid:98) L n − min i =1 ,...,N L i,n ≤ log( N ) η . (17) Proof of Lemma 2.

Let y t = (cid:104) X t , Θ ∗ (cid:105) + η t . By subgaussian property, we have, for < δ < , P (cid:32) max t =1 ,...,T | y t | > (cid:115) (cid:18) Tδ (cid:19)(cid:33) ≤ δ. (18)Let’s denote above high probability event (cid:110) max t =1 ,...,T | y t | ≤ (cid:113) (cid:0) Tδ (cid:1)(cid:111) by G , denote the onlineprediction at every round by ˆ y t . Deﬁne the ε -covering set for S r := { Θ : (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r } by ¯ S r , which means, for any Θ ∈ S r , there exists a ¯Θ ∈ ¯ S r , such that (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F ≤ ε . We prove that | ¯ S r | ≤ (9 /ε ) ( d + d +1) r in Lemma 6.One can easily show that F ( a ) := e − η ( a − b ) is concave in a for all | b | ≤ (cid:113) (cid:0) Tδ (cid:1) (this holds underevent G ) by choosing η = (cid:113) ( Tδ ) ) , since a refers to the prediction of exponential weighted averageforcaster and thus we have | a | ≤ according to the construction. So under event G , the squared loss (cid:96) isguaranteed to be exp-concave under above η and Lemma 10 can be applied here.12e now bound the regret under event G . For an arbitrary Θ ∈ S r , ρ T (Θ) = (cid:88) Tt =1 ( (cid:96) t ( (cid:98) y t ) − (cid:96) t ( f Θ ,t )) (19) = T (cid:88) t =1 (cid:0) (cid:96) t ( (cid:98) y t ) − (cid:96) t ( f ¯Θ ,t ) + (cid:96) t ( f ¯Θ ,t ) − (cid:96) t ( f Θ ,t ) (cid:1) where (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F ≤ (cid:15), ¯Θ ∈ ¯ S r (20) ≤ log | ¯ S r | η + T (cid:88) t =1 (cid:0) (cid:96) t ( f ¯Θ ,t ) − (cid:96) t ( f Θ ,t ) (cid:1) by Lemma (21) = log | ¯ S r | η + T (cid:88) t =1 (cid:0) ( (cid:104) ¯Θ , X t (cid:105) − y t ) − ( (cid:104) Θ , X t (cid:105) − y t ) (cid:1) (22) ≤ log | ¯ S r | η + T (cid:88) t =1 (cid:0) (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F + 2 y t (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F (cid:1) (23) ≤ log | ¯ S r | η + 2 T ε + 2

T ε (cid:115) (cid:18) Tδ (cid:19) (24) = 2( d + d + 1) r log( 9 ε ) (cid:32) (cid:115) (cid:18) Tδ (cid:19)(cid:33) + 2 T ε + 2

T ε (cid:115) (cid:18) Tδ (cid:19) (25) = O (cid:18) ( d + d ) r log( T ) log (cid:18) Tδ (cid:19)(cid:19) set ε = 1 /T. (26)Above bounds hold for all Θ ∈ S r . This completes the proof. Proof of Theorem 1.

To obtain Theorem 1, one just needs to plug Lemma 2 into Lemma 8.

B Proof for Theorem 3

Lemma 11 (Online-to-Conﬁdence-Set Conversion with NLL loss) . Suppose we feed { ( X s , y s ) } ts =1 into anonline prediction algorithm which, for all t ≥ , admits a regret under negative log likelihood (NLL) loss sup (cid:107) Θ (cid:107) F ≤ ρ GLB t (Θ) ≤ B t . Let ˆ y s be the prediction at time step s by the online learner. Then, for any δ ∈ (0 , . , with probability at least − δ , we have P ( ∃ t ∈ N such that Θ ∗ / ∈ C t +1 ) ≤ δ, (27) where C t = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F + (cid:80) ts =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) ≤ β GLB t ( δ ) } and β GLB t ( δ ) = 2 + κ µ B t + R κ µ log  R (cid:114) κ µ + (cid:113) κµ B t +1 δ  .Proof. According to the deﬁnition of ρ GLB t ( · ) , we have B t ≥ ρ GLB t (Θ ∗ ) (28) = t (cid:88) s =1 (cid:96) s (ˆ y s ) − (cid:96) s ( (cid:104) Θ ∗ , X s (cid:105) ) (29) ≥ t (cid:88) s =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) (cid:96) (cid:48) s ( (cid:104) Θ ∗ , X s (cid:105) ) + κ µ y s − (cid:104) Θ ∗ , X s (cid:105) ) (Taylor expansion of (cid:96) s at (cid:104) Θ ∗ , X s (cid:105) ) = t (cid:88) s =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) )( − η s ) + κ µ y s − (cid:104) Θ ∗ , X s (cid:105) ) . (30)13hus, rearranging the terms, we have t (cid:88) s =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) ≤ κ µ B t + 2 κ µ t (cid:88) s =1 η s (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) . (31)The remaining proof simply follows the proof of Lemma 7. One can easily conclude that for any δ ∈ (0 , . ,with probability at least − δ t (cid:88) s =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) ≤ κ µ B t + 32 R κ µ log  R (cid:113) κ µ + (cid:113) κ µ B t + 1 δ  . (32)Adding (cid:107) Θ ∗ (cid:107) F on both sides and using the fact that (cid:107) Θ ∗ (cid:107) F ≤ , we complete the proof. Lemma 12 (Regret of LowGLOC Given Online Learner’s Regret) . Suppose sup (cid:107) Θ (cid:107) F ≤ ρ GLB T (Θ) ≤ B GLB T .Then, for any δ ∈ (0 , . , with probability at least − δ , for any T ≥ , the regret of LowGLOC algorithmis bounded by R T = O (cid:32) L (cid:115) β GLB T − ( δ ) T d d log (cid:18) Td d (cid:19)(cid:33) , (33) where β GLB t ( δ ) = 2 + κ µ B GLB t + R κ µ log  R (cid:114) κ µ + (cid:113) κµ B GLB t +1 δ  ∀ t .Proof. Deﬁne V t − = I + (cid:80) t − s =1 vec ( X s ) T vec ( X s ) and (cid:98) Θ t = argmin Θ ∈ R d × d (cid:32) (cid:107) Θ (cid:107) F + t − (cid:88) s =1 (ˆ y s − (cid:104) Θ , X s (cid:105) ) (cid:33) . (34)One can express C t − as { Θ ∈ R d × d : vec (Θ − (cid:98) Θ t ) T V t − vec (Θ − (cid:98) Θ t ) + (cid:13)(cid:13)(cid:13) (cid:98) Θ t (cid:13)(cid:13)(cid:13) F + t − (cid:88) s =1 (ˆ y s − (cid:104) Θ , X s (cid:105) ) ≤ β t − ( δ ) } . (35)Thus, C t − is contained in a bigger ellipsoid C t − ⊆ { Θ ∈ R d × d : vec (Θ − (cid:98) Θ t ) T V t − vec (Θ − (cid:98) Θ t ) ≤ β t − ( δ ) } . (36)Now consider the regret at round t , µ ( (cid:104) X ∗ , Θ ∗ (cid:105) ) − µ ( (cid:104) X t , Θ ∗ (cid:105) ) ≤ L µ | ( (cid:104) X ∗ , Θ ∗ (cid:105) − (cid:104) X t , Θ ∗ (cid:105) ) | (37) ≤ L µ (cid:16) (cid:104) X t , (cid:101) Θ t − Θ ∗ (cid:105) (cid:17) (38) ≤ L µ |(cid:104) X t , (cid:101) Θ t − (cid:98) Θ (cid:105)| + L µ |(cid:104) X t , (cid:98) Θ t − Θ ∗ (cid:105)| (39) ≤ L µ (cid:112) β t − ( δ ) (cid:107) vec ( X t ) (cid:107) V − t − (Cauchy Schwartz) . (40)14ince the regret at every step cannot be bigger than L , R T = T (cid:88) t =1 µ ( (cid:104) X ∗ , Θ ∗ (cid:105) ) − µ ( (cid:104) X t , Θ ∗ (cid:105) ) (41) = T (cid:88) t =1 min (cid:110) L µ , L µ (cid:112) β t − ( δ ) (cid:107) vec ( X t ) (cid:107) V − t − (cid:111) (42) = 2 L µ (cid:112) β t − ( δ ) T (cid:88) t =1 min (cid:26) β t − ( δ ) , (cid:107) vec ( X t ) (cid:107) V − t − (cid:27) (43) ≤ L µ (cid:112) β t − ( δ ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 min (cid:26) β t − ( δ ) , (cid:107) vec ( X t ) (cid:107) V − t − (cid:27) (44) ≤ L µ (cid:112) β t − ( δ ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 min (cid:110) , (cid:107) vec ( X t ) (cid:107) V − t − (cid:111) ( β t − ( δ ) is greater than 1) (45) ≤ L µ (cid:112) β t − ( δ ) (cid:115) T d d log (cid:18) Td d (cid:19) (46) = O (cid:32) L µ (cid:115) β t − ( δ ) T d d log (cid:18) Td d (cid:19)(cid:33) . (47) Lemma 13 (Regret of EW under NLL Loss) . Let EW parameter η := κ µ (cid:18)(cid:113) R log ( Tδ ) +2 c µ +2 L µ (cid:19) . Then, forany < δ < , with probability at least − δ , the regret of EW with expert predictions f Θ ,t = (cid:104) Θ , X t (cid:105) underNLL loss satisﬁes B GLB T = sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ GLB T (Θ) = O (cid:32) ( d + d ) r log T log (cid:0) Tδ (cid:1) L µ + c µ + L µ κ µ (cid:33) (48) = (cid:101) O (cid:32) L µ + c µ κ µ ( d + d ) r log (cid:18) δ (cid:19)(cid:33) . (49) Proof.

Under generalized linear bandit model, y t = µ ( (cid:104) X t , Θ ∗ (cid:105) )+ η t . By subgaussian property and | µ ( (cid:104) X t , Θ ∗ (cid:105) ) | ≤| µ (0) | + L µ |(cid:104) X t , Θ ∗ (cid:105)| ≤ c µ + L µ , for < δ < , we have P (cid:32) max t =1 ,...,T | y t | > c µ + L µ + (cid:115) R log (cid:18) Tδ (cid:19)(cid:33) ≤ δ. (50)Again we denote above high probability event by G , denote the exponential weighted average forecaster atevery round by ˆ y t . We use the same deﬁnition S r and ¯ S r as last section.We use Lemma 9 and Lemma 10 to bound ρ GLB T (Θ) . Then the ﬁrst step is to ﬁnd a proper η > suchthat F (ˆ y t ) := e (cid:96) (ˆ y t ,y t ) = e − ηm (ˆ y t )+ η ˆ y t y t is concave. Taking derivatives we have, F (cid:48)(cid:48) (ˆ y t ) = ηe − ηm (ˆ y t )+ η ˆ y t y t (cid:0) η ( y t − µ (ˆ y t )) − µ (cid:48) (ˆ y t ) (cid:1) . (51)Under event G , it’s easy to show that µ (cid:48) (ˆ y t )( y t − µ (ˆ y t )) ≥ κ µ (cid:16)(cid:113) R log (cid:0) Tδ (cid:1) + 2 c µ + 2 L µ (cid:17) , (52)15ince | µ (ˆ y t ) | ≤ | µ (0) | + L | ˆ y t | ≤ c µ + L µ . Thus, taking η := κ µ (cid:18)(cid:113) R log ( Tδ ) +2 c µ +2 L µ (cid:19) , F ( · ) is guaranteed tobe concave with probability under event G . ρ GLB T (Θ) = T (cid:88) t =1 ( (cid:96) t (ˆ y t ) − (cid:96) t ( (cid:104) Θ , X t (cid:105) )) (53) ≤ T (cid:88) t =1 (cid:0) (cid:96) t (ˆ y t ) − (cid:96) t ( (cid:104) ¯Θ , X t (cid:105) ) + (cid:96) t ( (cid:104) ¯Θ , X t (cid:105) ) − (cid:96) t ( (cid:104) Θ , X t (cid:105) ) (cid:1) where (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F ≤ ε and ¯Θ ∈ ¯ S r (54) ≤ log | ¯ S r | η + T (cid:88) t =1 (cid:0) (cid:96) t ( (cid:104) ¯Θ , X t (cid:105) ) − (cid:96) t ( (cid:104) Θ , X t (cid:105) ) (cid:1) (55) ≤ log | ¯ S r | η + T (cid:88) t =1 (cid:104) Θ − ¯Θ , X t (cid:105) y t + m ( (cid:104) ¯Θ , X t (cid:105) ) − m ( (cid:104) Θ , X t (cid:105) ) (56) ≤ log | ¯ S r | η + T (cid:88) t =1 | y t | (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F + |(cid:104) ¯Θ − Θ , X t (cid:105) ( c µ + L µ ) | (By Taylor expansion) (57) ≤ ( d + d + 1) r log (cid:18) ε (cid:19) (cid:16)(cid:113) R log (cid:0) Tδ (cid:1) + 2 c µ + 2 L µ (cid:17) κ µ (58) + T (cid:32) c µ + 2 L µ + (cid:115) R log (cid:18) Tδ (cid:19)(cid:33) ε (59) = O (cid:32) ( d + d ) r log T log (cid:0) Tδ (cid:1) L µ + c µ + L µ κ µ (cid:33) , (60)where we take ε = 1 /T . Proof for Theorem 3.

One only needs to plug Lemma 13 into Lemma 12.

C Proof for Theorem 4

The whole proof breaks down to two parts. Let Θ ∗ = U ∗ S ∗ V ∗ T be the SVD of Θ ∗ . In the ﬁrst part, weprove the convergence of estimated matrix (cid:98) Θ for Θ ∗ , (cid:98) U for U ∗ , and (cid:98) V for V ∗ . In the second part, we plugthe convergence result into the regret guarantee for LowOFUL in Jun et al. (2019) to achieve our ﬁnal result. C.1 Analysis for Stage 1

In order to analyze how the estimated subspaces are close to the true subspaces, we ﬁrst present the deﬁnitionsfor sub-Gaussian matrix and restricted strong convexity (RSC) as below.

Deﬁnition 1 (sub-Gaussian matrix (See Wainwright (2019))) . A random matrix Z ∈ R n × p is sub-Gaussianwith parameters (Σ , σ ) if: • each row z Ti ∈ R p is sampled independently from a zero-mean distribution with covariance Σ , and • for any unit vector u ∈ R p , the random variable u T z i is sub-Gaussian with parameter at most σ .16 eﬁnition 2 (Restricted Strong Convexity (RSC) (Wainwright, 2019)) . For a given norm (cid:107)·(cid:107) , regularizer Φ( · ) , and X , . . . , X n ∈ R d × d , the matrix (cid:98) Γ = n (cid:101) X T (cid:101) X , where ˜ x i := vec ( X i ) and (cid:101) X := [˜ x T ; . . . ; ˜ x Tn ] , satisﬁesa restricted strong convexity (RSC) condition with curvature κ > and tolerance τ n if (cid:101) ∆ T (cid:98) Γ (cid:101) ∆ = 1 n n (cid:88) t =1 (cid:104) X t , ∆ (cid:105) ≥ κ (cid:107) ∆ (cid:107) − τ n Φ (∆) , (61)for all ∆ ∈ R d × d , and we denote vec (∆) by (cid:101) ∆ .We prove the following theorem about distribution D (see Assumption 2) as below, see proof in Section D. Theorem 14 (Distribution D satisﬁes RSC) . Sample X , . . . , X n ∈ R d × d from X according to D , anddeﬁne ˜ x i := vec ( X i ) , (cid:101) X = [˜ x T ; . . . ; ˜ x Tn ] ∈ R n × d d and (cid:98) Γ := n (cid:101) X T (cid:101) X . Then under Assumption 2, there existsconstants c , c > , such that with probability − δ , (cid:101) Θ T (cid:98) Γ (cid:101) Θ = 1 n n (cid:88) i =1 (cid:104) X i , Θ (cid:105) ≥ c d d (cid:107) Θ (cid:107) F − c ( d + d ) nd d (cid:107) Θ (cid:107) nuc , ∀ Θ ∈ R d × d , (62) for n = Ω (cid:0) ( d + d ) log (cid:0) δ (cid:1)(cid:1) , where (cid:101) Θ := vec (Θ) . Theorem 14 states that sampling X from X according to distribution D guarantees that the sampledarms satisﬁes RSC condition. We further show that under RSC condition, the estimated (cid:98) Θ is guaranteed toconverge to Θ at a fast rate in Theorem 15. Theorem 15.

Sample X , . . . , X n ∈ R d × d from X according to D . Then under Assumption 2, any optimalsolution to the nuclear norm optimization problem 5 using λ n (cid:16) n min { d ,d } log (cid:0) nδ (cid:1) log (cid:0) d + d δ (cid:1) satisﬁes: (cid:13)(cid:13)(cid:13) (cid:98) Θ − Θ ∗ (cid:13)(cid:13)(cid:13) F (cid:16) ( d + d ) rn , (63) with probability − δ . The goal of stage 1 is to estimate the row/column subspaces of Θ ∗ , below corollary characterizes theirconvergence. Corollary 16 (adapted from Jun et al. (2019)) . Suppose we compute (cid:98) Θ by solving the convex problem inEquation 5 as an estimate of the matrix Θ ∗ . After stage 1 of ESTR with T = Ω ( r ( d + d )) satisfying thecondition of Theorem 15, we have, with probability at least − δ , (cid:13)(cid:13)(cid:13) (cid:98) U T ⊥ U ∗ (cid:13)(cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) (cid:98) V T ⊥ V ∗ (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) Θ ∗ − (cid:98) Θ (cid:13)(cid:13)(cid:13) F ω r ≤ C λ T rα ω r := γ ( T ) (cid:16) ( d + d ) rT ω r , (64) where ω r > denotes the lower bound of the r -th singular value of Θ ∗ and C represents some constant. C.2 Analysis for Stage 2

We present the useful lemmas proved in Jun et al. (2017) and combine them with our analysis of stage 1 toachieve the ﬁnal result of Theorem 4.

Lemma 17 (Corollary 1 in (Jun et al., 2019)) . The regret of LowOFUL with λ ⊥ = Tk log ( Tλ ) is, withprobability at least − δ , (cid:101) O (cid:16)(cid:16) k + √ kλB + √ T B ⊥ (cid:17) √ T (cid:17) . (65)17 emma 18 (Modiﬁed from Theorem 5 in (Jun et al., 2019)) . Suppose we run ESTR stage 1 with T =Ω ( r ( d + d )) . We invoke LowOFUL in stage 2 with λ ⊥ = T k log(1+ T /λ ) , B = 1 , B ⊥ = γ ( T ) , the rotated armsets X (cid:48) vec deﬁned in LowESTR (Algorithm 2). With probability − δ , the regret of LowESTR is boundedby (cid:101) O (cid:18) T + T · ( d + d ) rT ω r (cid:19) . (66) Proof.

Combining Lemma 17 and deﬁnitions of parameters B , B ⊥ , λ , λ ⊥ and γ ( T ) . Proof for Theorem 4.

Suppose the assumptions in Lemma 18 hold. Setting T = Θ (cid:16) ( d + d ) / √ rT ω r (cid:17) inLemma 18 leads to the regret (cid:101) O (cid:18) ( d + d ) / √ rT ω r (cid:19) . (67) D Proof for Theorem 14

Throughout this proof, we use Σ and σ to denote the sub-Gaussian parameters deﬁned in Deﬁnition 1 formatrix (cid:101) X in the theorem. D.1 Useful Lemmas

Lemma 19.

For any constant s ≥ , we have B nuc ( √ s ) ∩ B F (1) ⊆ cl { conv { B rank ( s ) ∩ B F (1) }} , (68) where the balls are taken in R d × d , and cl {·} and conv {·} denote the topological closure and convex hull,respectively.Proof. Note that when s > min { d , d } , the statement is trivial, since the right-hand set equals R F (3) , andthe left-hand set is contained in B F (1) . Hence, we will assume ≤ s ≤ min { d , d } .Let A, B ⊆ R d × d be closed convex sets, with support function given by φ A ( z ) = sup Θ ∈ A (cid:104) Θ , z (cid:105) and φ B similarly deﬁned. It is well-known that φ A ( z ) ≤ φ B ( z ) if and only if A ⊆ B . We will now check thiscondition for the pair of sets A = B nuc ( √ s ) ∩ B F (1) and B = 3 cl { conv { B rank ( s ) ∩ B F (1) }} .For any z ∈ R d × d , take r := min { d , d } , we have z = U Σ V T by SVD, where U ∈ R d × r , Σ ∈ R r × r ,and V ∈ R d × r . Let S ⊆ { , . . . , r } be subset indexes for the top (cid:98) s (cid:99) elements of diag (Σ) . We use U S and V S to denote submatrices of U and V with columns of indices in S and use Σ S to denote the submatrix of Σ with columns and rows of indices in S . Then we can write z = U S Σ S V TS + U ⊥ S Σ ⊥ S V ⊥ TS .Consider φ A ( z ) below: φ A ( z ) = sup Θ ∈ A (cid:104) Θ , U S Σ S V TS + U ⊥ S Σ ⊥ S V ⊥ TS (cid:105) (69) ≤ sup (cid:107) U S U TS Θ (cid:107) F ≤ (cid:104) U S U TS Θ , U S Σ S V TS (cid:105) + sup (cid:107) U ⊥ S U ⊥ TS Θ (cid:107) nuc ≤√ s (cid:104) U ⊥ S U ⊥ TS Θ , U ⊥ S Σ ⊥ S V ⊥ TS (cid:105) (70) ≤ (cid:13)(cid:13) U S Σ S V TS (cid:13)(cid:13) F + √ s (cid:13)(cid:13) U ⊥ S Σ ⊥ S V ⊥ TS (cid:13)(cid:13) op by Holder inequality (71) ≤ (cid:13)(cid:13) U S Σ S V TS (cid:13)(cid:13) F + √ s (cid:98) s (cid:99) (cid:13)(cid:13) U S Σ S V TS (cid:13)(cid:13) nuc ≤ (cid:13)(cid:13) U S Σ S V TS (cid:13)(cid:13) F . (72)Finally, note that φ B ( z ) = sup Θ ∈ B (cid:104) Θ , z (cid:105) = 3 max | S | = (cid:98) s (cid:99) sup (cid:107) U S U TS Θ (cid:107) F ≤ (cid:104) U S U TS Θ , U S Σ S V TS (cid:105) = 3 (cid:13)(cid:13) U S Σ S V TS (cid:13)(cid:13) F ,from which the claim follows. 18 eﬁnition 3. Deﬁne K ( s ) := B rank ( s ) ∩ B F (1) and the cone set C ( s ) := { v : (cid:107) v (cid:107) nuc ≤ √ s (cid:107) v (cid:107) F } , all matricesdeﬁned in these sets are in R d × d . Lemma 20.

For a ﬁxed matrix Γ ∈ R d d × d d , parameter s ≥ , and tolerance δ > , suppose we have thedeviation condition (˜ v := vec ( v )) | ˜ v T Γ˜ v | ≤ δ, ∀ v ∈ K (2 s ) , (73) where K (2 s ) is deﬁned in Deﬁnition 3. Then | ˜ v T Γ˜ v | ≤ δ ( (cid:107) v (cid:107) F + 1 s (cid:107) v (cid:107) nuc ) , ∀ v ∈ R d × d . (74) Proof.

We begin by establishing the inequalities | ˜ v T Γ˜ v | ≤ δ (cid:107) v (cid:107) F , ∀ v ∈ C ( s ) , (75) | ˜ v T Γ˜ v | ≤ δs (cid:107) v (cid:107) nuc , ∀ v / ∈ C ( s ) , (76)where C ( s ) is deﬁned in Deﬁnition 3, the statement of this lemma then follows immediately. By rescaling,inequality 75 follows if we can show that | ˜ v T Γ˜ v | ≤ δ for all v such that (cid:107) v (cid:107) F = 1 and (cid:107) v (cid:107) nuc ≤ √ s. (77)By Lemma 19 and continuity, we further reduce the problem to proving the bound 77 for all vectors v ∈ conv { K ( s ) } = conv { B rank ( s ) ∩ B F (3) } . Consider a weighted lienar combination of the form v = (cid:80) i α i v i ,with weights α i ≥ such that (cid:80) i α i = 1 , and rank ( v i ) ≤ s and (cid:107) v i (cid:107) F ≤ for each i . We can write ˜ v Γ˜ v = (cid:88) i,j α i α j (˜ v Ti Γ˜ v j ) . (78)Applying inequality 74 to the vectors v i / , v j / and ( v i + v j ) / , we have | ˜ v Ti Γ˜ v j | = 12 | (˜ v i + ˜ v j ) T Γ(˜ v i + ˜ v j ) − ˜ v Ti Γ˜ v i − ˜ v Tj Γ˜ v j | ≤

12 (36 + 9 + 9) δ = 27 δ (79)for all i, j , and hence | ˜ v T Γ˜ v | ≤ (cid:80) i,j α i α j (27 α ) = 27 δ (cid:107) α (cid:107) = 27 δ , establishing inequality 75. Now let’s turnto inequality 76, note that v / ∈ C ( s ) , we have | ˜ v T Γ˜ v |(cid:107) v (cid:107) nuc ≤ s sup (cid:107) u (cid:107) nuc ≤√ s, (cid:107) u (cid:107) F ≤ | u T Γ u | ≤ δs , (80)where the ﬁrst inequality follows by the substitution u = √ s v (cid:107)(cid:107) nuc , the second follows by the same argumentused for inequality 75. Rearrange above inequality, we establish inequality 76. Lemma 21 (RSC condition) . Suppose s ≥ and (cid:98) Γ is an estimator of Σ satisfying the deviation condition( ˜ v := vec ( v ) ) | ˜ v T ( (cid:98) Γ − Σ)˜ v | ≤ λ min (Σ)54 , ∀ v ∈ K (2 s ) , (81) where K (2 s ) is deﬁned in Deﬁnition 3. Then we have the RSC condition ˜ v T (cid:98) Γ˜ v ≥ λ min (Σ)2 (cid:107) v (cid:107) F − λ min (Σ)2 s (cid:107) v (cid:107) nuc . (82)19 roof. This result follows easily from Lemma 20. Set

Γ = (cid:98) Γ − Σ and δ = λ min (Σ)54 , we have the bound | ˜ v T ( (cid:98) Γ − Σ)˜ v | ≤ λ min (Σ)2 (cid:18) (cid:107) v (cid:107) F + 1 s (cid:107) v (cid:107) nuc (cid:19) . (83)Then ˜ v T (cid:98) Γ˜ v ≥ ˜ v T Σ˜ v − λ min (Σ)2 (cid:18) (cid:107) v (cid:107) F + 1 s (cid:107) v (cid:107) nuc (cid:19) (84) ≥ λ min (Σ)2 (cid:107) v (cid:107) F − λ min (Σ)2 s (cid:107) v (cid:107) nuc , (85)where the last inequality follows from ˜ v T Σ˜ v ≥ λ min (Σ) (cid:107) v (cid:107) F . D.2 Proof for the Theorem 14

Proof.

Using the results in Lemma 21, together with the substitutions (cid:98) Γ − Σ = 1 n (cid:101) X T (cid:101) X − Σ , and s := 1 c nd + d min { λ min (Σ) σ , } , (86)where n ≥ c ( d + d ) / min { λ min (Σ) σ , } so s ≥ , we see that it suﬃces to show that D ( s ) := sup v ∈ K (2 s ) | ˜ v T ( (cid:98) Γ − Σ)˜ v | ≤ λ min (Σ)54 , (87)with high probability.Note that by modiﬁed Lemma 15 ( in Appendix G ) in Loh and Wainwright (2011), we simply change the / -covering set for sparsity vectors to the / -covering set for K (2 s ) , whose covering number is s ( d + d +1) by Lemma 6, and achieve P ( D ( s ) ≥ t ) ≤ (cid:18) − c (cid:48) n min (cid:18) t σ , tσ (cid:19) + 2 s ( d + d + 1) log 27 (cid:19) , (88)for some univeral constant c (cid:48) > . Setting t = λ min (Σ)54 , we see that there exists some c > , such that P (cid:18) D ( s ) ≥ λ min (Σ)54 (cid:19) ≤ (cid:18) − c n min (cid:18) λ min (Σ) σ , (cid:19)(cid:19) , (89)which establishes the result.Set δ equals to the right side of last inequality, one can get the desired gurantee for n in Theorem 14. E Proof for Theorem 15

E.1 Useful Lemmas

Lemma 22 (Converence under RSC, adapted from Proposition 10.1 in Wainwright (2019)) . Suppose theobservations X , . . . , X n satisﬁes the non-scaled RSC condition in Deﬁnition 2, such that T n (cid:88) t =1 (cid:104) X t , Θ (cid:105) ≥ κ (cid:107) Θ (cid:107) F − τ n (cid:107) Θ (cid:107) nuc , ∀ Θ ∈ R d × d . (90) Then under the event G := { (cid:13)(cid:13) n (cid:80) nt =1 η t X t (cid:13)(cid:13) op ≤ λ n } , any optimal solution (cid:98) Θ to Equation 5 satisﬁes thebound below: (cid:13)(cid:13)(cid:13) (cid:98) Θ − Θ ∗ (cid:13)(cid:13)(cid:13) F ≤ . λ n κ r, (91) where r = rank (Θ ∗ ) and τ n ≥ rκ . .2 Proof for Theorem 15 Proof.

According to Theorem 14, there exists constants c and c such that with probability at least − δ ,we have below RSC condition n n (cid:88) t =1 (cid:104) X t , Θ (cid:105) ≥ c d d (cid:107) Θ (cid:107) F − c ( d + d ) nd d (cid:107) Θ (cid:107) nuc , ∀ Θ ∈ R d × d , (92)Lemma 22 can be applied under above RSC condition, then under event G ( λ n ) := { (cid:13)(cid:13) n (cid:80) nt =1 η t X t (cid:13)(cid:13) op ≤ λ n } ,we can easily conclude the theorem. Thus, it remains to ﬁgure out λ n such that event G ( λ n ) can hold withhigh probability.Deﬁne the rare event E := (cid:26) max t =1 ,...,T | η t | > (cid:113) (cid:0) T δ (cid:1)(cid:27) , so that P ( E ) ≤ δ/ can be proved by thedeﬁnition of sub-Gaussian. By matrix Bernstein inequality, the probability of G ( λ n ) c can be bounded by: P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) t =1 η t X t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op > ε  ≤ P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) t =1 η t X t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op > ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E c  + P ( E ) ≤ ( d + d ) exp  − nε /

22 log (cid:0) nδ (cid:1) max { /d , /d } + ε (cid:113) (cid:0) nδ (cid:1) /  + δ/ , where the last inequality is by matrix Bernstein using the fact that max (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) t =1 E η t X t X Tt (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) t =1 E η t X Tt X t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op  ≤ n log (cid:18) nδ (cid:19) max { /d , /d } . (93)For ( d + d ) exp (cid:32) − nε /

22 log ( nδ ) max { /d , /d } + ε (cid:113) ( nδ ) / (cid:33) ≤ δ/ to hold, we need (cid:15) = C (cid:48) n min { d , d } log (cid:16) nδ (cid:17) log (cid:18) d + d δ (cid:19) , (94)holds for some constant C (cid:48) . Take λ n = 2 ε , we need λ n = Cn min { d ,d } log (cid:0) nδ (cid:1) log (cid:0) d + d δ (cid:1) and under thiscondition we have P ( G ( λ n )) ≥ − δ . We complete the proof by noting that the scaling of the right handside in Lemma 22 under above choice of λ n is indeed ( d + d ) rn . F Proof for Theorem 5

Proof.

Take ∆ = (cid:113) drT √ , Θ = { Θ =  θ T ... θ Tr  ∈ R d × d , θ i ∈ {± ∆ } d , ∀ i ∈ [ r ] } . For i ∈ [ r ] , j ∈ [ d ] , deﬁne τ i,j = T ∧ min { t : (cid:80) ts =1 X s,i,j ≥ Tdr } , where X s,i,j denotes the element on the i -th row and j -th column of21atrix X s . Then for a ﬁxed Θ , taking expectation over X t , we have E [ R T (Θ)] = E Θ T (cid:88) t =1 (cid:104) X ∗ − X t , Θ (cid:105) (95) = ∆ E Θ T (cid:88) t =1 r (cid:88) i =1 d (cid:88) j =1 (cid:18) √ dr − X t,i,j sign (Θ i,j ) (cid:19) (96) ≥ ∆ √ dr r (cid:88) i =1 d (cid:88) j =1 E Θ (cid:34) T (cid:88) t =1 (cid:18) √ dr − X t,i,j sign (Θ i,j ) (cid:19) (cid:35) (97) ≥ ∆ √ dr r (cid:88) i =1 d (cid:88) j =1 E Θ (cid:34) τ i,j (cid:88) t =1 (cid:18) √ dr − X t,i,j sign (Θ i,j ) (cid:19) (cid:35) . (98)Deﬁne U i,j ( x ) = (cid:80) τ i,j t =1 (cid:16) √ dr − X t,i,j x (cid:17) . Let Θ (cid:48) ∈ Θ be another parameter matrix such that Θ (cid:48) = Θ , exceptthat Θ (cid:48) i,j = − Θ i,j . Let P , P (cid:48) be the laws of U i,j with respect to the learner interaction measure induced by Θ and Θ (cid:48) . Then E Θ [ U i,j (1)] ≥ E Θ (cid:48) [ U i,j (1)] − ( 4 Tdr + 2) (cid:114) D ( P , P (cid:48) ) (99) ≥ E Θ (cid:48) [ U i,j (1)] − ∆( 4 Tdr + 2) (cid:118)(cid:117)(cid:117)(cid:116) E (cid:34) τ i,j (cid:88) t =1 X t,i,j (cid:35) (100) ≥ E Θ (cid:48) [ U i,j (1)] − ∆( 4 Tdr + 2) (cid:114)

Tdr + 1 (101) ≥ E Θ (cid:48) [ U i,j (1)] − √ T ∆ dr (cid:114) Tdr , (102)where in the ﬁrst inequality we used Pinsker’s inequality, the result in exercise 14.4 in (Lattimore andSzepesvári, 2018), the bound U i,j (1) = τ i,j (cid:88) t =1 (cid:18) √ dr − X t,i,j (cid:19) ≤ τ i,j (cid:88) t =1 dr + 2 τ i,j (cid:88) t =1 X t,i,j ≤ Tdr + 2 (cid:18)

Tdr + 1 (cid:19) = 4

Tdr + 2 . (103)The second inequality in above follows from the chain rule for the relative entropy up to a stopping timein (Lattimore and Szepesvári, 2018): D ( P , P (cid:48) ) ≤ E Θ τ i,j (cid:88) t =1 (cid:104) X t , Θ − Θ (cid:48) (cid:105) = 2∆ E Θ τ i,j (cid:88) t =1 X t,i,j . (104)The third inequality in above is true by the deﬁnition of τ i,j and the fourth inequality holds by the assumptionthat dr ≤ T .Then, E Θ [ U i,j (1)] + E Θ (cid:48) [ U i,j (1)] ≥ E Θ (cid:48) [ U i,j (1) + U i,j ( − − √ T ∆ dr (cid:114) Tdr (105) = 2 E Θ (cid:48) (cid:34) τ i,j d + τ i,j (cid:88) t =1 X t,i,j (cid:35) − √ T ∆ dr (cid:114) Tdr (106) ≥ Td − √ T ∆ dr (cid:114) Tdr = Td . (107)22he proof is completed using an averaging number argument: (cid:88) Θ ∈ Θ R T (Θ) ≥ ∆ √ dr r (cid:88) i =1 d (cid:88) j =1 (cid:88) Θ ∈ Θ E Θ [ U i,j ( sign (Θ i,j ))] (108) ≥ ∆ √ dr r (cid:88) i =1 d (cid:88) j =1 (cid:88) Θ − i, − j (cid:88) Θ i,j ∈{± ∆ } E Θ [ U i,j ( sign (Θ i,j ))] (109) ≥ ∆ √ dr r (cid:88) i =1 d (cid:88) j =1 (cid:88) Θ − i, − j (cid:88) Θ i,j ∈{± ∆ } Tdr = 2 dr − ∆ √ drT. (110)Hence there exists a Θ ∈ Θ such that R T ( A , Θ) ≥ T ∆ √ dr = dr √ T √ . G Preliminaries for EW

We provide more information on the construction of standard exponentially weighted average forecaster . Prediction with Expert Advice.

We use { f i,t : i ∈ I} to denote the prediction of experts at round t ,where f i,t is the prediction of expert i at time t . On the basis of the experts’ predictions, the forecastercomputes the prediction ˆ y t for the next outcome y t and the true outcome y t is revealed afterwards. Theregret of the learner relative to expert is deﬁned by R i,T = T (cid:88) t =1 ( (cid:96) t ( (cid:98) p t ) − (cid:96) t ( f i,t )) = (cid:98) L T − L i,T , where L i,T := (cid:80) Tt =1 (cid:96) t ( f i,t ) and (cid:98) L T := (cid:80) Tt =1 (cid:96) t ( (cid:98) p t ) . For linear prediction expert, we deﬁne f Θ ,t := (cid:104) Θ , X t (cid:105) and above reward matches with ρ T (Θ) . Exponential Weighted Average Forecaster (EW).

Suppose we have N linear prediction experts.Deﬁne the regret vector at time t as r t = ( R ,t , . . . , R N,t ) ∈ R N and the cumulative regret vector up to time T as R T = (cid:80) Tt =1 r t , then a weighted average forecaster is deﬁned as (cid:98) p t = N (cid:88) i =1 (cid:79) Φ( R t − ) i f i,t / N (cid:88) j =1 (cid:79) Φ( R t − ) j where Φ( · ) denotes a potential function Φ : R N → R of the form Φ( u ) = ψ (cid:16)(cid:80) Ni =1 φ ( u i ) (cid:17) . φ : R → R isany nonnegative, increasing and twice diﬀerentiable function, and ψ : R → R is any nonnegative, strictlyincreasing, concave and twice diﬀerentiable auxiliary function. Exponentially weighted average forecaster is constructed using Φ η ( u ) = η log (cid:16)(cid:80) Ni =1 e ηu i (cid:17) , where η is apositive parameter. The weights assigned to the experts are of the form: (cid:79) Φ η ( R t − ) i = e ηRi,t − (cid:80) Nj =1 e ηRj,t − . Thus,the exponentially weighted average forecaster can be simpliﬁed to ˆ y t = (cid:80) Ni =1 e − ηL i,t − f i,t (cid:80) Nj =1 e − ηL j,t − , as deﬁned in the main text. 23 Algorithms

In this section, we present our Low-rank Generalized Linear Bandit with Online Computation algorithm(LowGLOC) and the second part of LowESTR: LowOFUL algorithm in Jun et al. (2019).

Algorithm 3

Low-rank Generalized Linear Bandit with Online Computation (LowGLOC)

Input: arm set: X , horizon: T , T -net for S r : ¯ S r ( T ) , failure rate δ , EW constant η (cid:16) T/δ ) , function m ( · ) in the generalized linear model.Initial conﬁdence set C = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F ≤ } . for t = 1 , . . . , T do ( X t , (cid:101) Θ t ) := argmax ( X, Θ) ∈X × C t − (cid:104) X, Θ (cid:105) .Pull arm X t and receive reward y t .Compute EW predictor ˆ y t = (cid:80) | ¯ Sr ( 1 T ) | i =1 e − ηLi,t − f Θ i,t (cid:80) | ¯ Sr ( 1 T ) | j =1 e − ηLj,t − , where f Θ i ,t (cid:44) (cid:104) X t , Θ i (cid:105) for Θ i ∈ ¯ S r ( T ) .Update losses L i,t = (cid:80) ts =1 − f Θ i ,s y s + m ( f Θ i ,s ) , for i = 1 , . . . , | ¯ S r ( T ) | .Update C t according to Equation 4, where B t is as deﬁned in Lemma 13. end forAlgorithm 4 LowOFUL (Jun et al., 2019)

Input:

T, k , arm set

A ⊂ R d × d , failure rate δ and positive constants B, B ⊥ , λ, λ ⊥ . Λ = diag ( λ, . . . , λ, λ ⊥ , . . . , λ ⊥ ) , where λ occupies the ﬁrst k diagonal entries. for t = 1 , . . . , T do Compute a t = argmax a ∈A max θ ∈C t − (cid:104) θ, a (cid:105) .Pull arm a t and receive reward y t .Update C t = { θ : (cid:13)(cid:13)(cid:13) θ − ˆ θ (cid:13)(cid:13)(cid:13) V t ≤ √ β t } , where √ β t = (cid:113) log | V t || Λ | δ + √ λB + √ λ ⊥ B ⊥ , V t = Λ + (cid:80) ts =1 a t a Tt , ˆ θ t = (Λ + A T A ) − A T y . (Here A = [ a T ; . . . ; a Tt ] and y := [ y , . . . , y t ] T ). end for I More on Experiments

I.1 Parameter Setup for Comparing OFUL and LowESTR Simulation

We present the parameter setups for the experiments in Section 8.

OFUL: failure rate: δ = 0 . , horizon: T = 3000 , standard deviation of the reward error σ = 0 . . LowESTR: • failure rate: δ = 0 . . • standard deviation of the reward error: σ = 0 . . • least positive eigenvalue of Θ ∗ : ω r = 0 . for r = 1 and r = 3 . • horizon T = 3000 , steps of stage 1: T = 200 , steps of stage 2: T = T − T . • penalization in Equation 5: λ T = 0 . (cid:113) T . 24 gradient decent solving Equation 5 step size: . . • k = r ( d + d − r ) in LowOFUL (Algorithm 4). • B = 1 , B ⊥ = σ ( d + d ) r/ ( T ω r ) . • λ = 1 , λ ⊥ = T k log(1+ T /λ ) . I.2 LowESTR: Sensitivity to ω r We prove a (cid:101) O (cid:16) ( d + d ) / √ rT ω r (cid:17) regret for LowESTR algorithm in Section 6. To complement this theoret-ical ﬁnding, we compare the performance of LowESTR on diﬀerent values of ω r ∈ { . , . , . , . , . , . } .We run our simulation with d = d = 10 , r = 3 . The true Θ ∗ ∈ R d × d is a diagonal matrix withdiag = (0 . , . , ω r , , . . . , . The arm set is constructed in the same way as previous experiment and thereward is also generated by y = (cid:104) X, Θ ∗ (cid:105) + ε , where ε ∼ N (0 , . ) . For each ω r setting, we run LowESTRfor 20 times to calculate the averaged regrets and their 1-sd conﬁdence intervals.Parameters for LowESTR are same as those of previous experiment except that T = int (100 /ω r ) . Theplot for cumulative regret at T = 3000 v.s. the value of ω r is displayed in Figure 2. We observe that aswe increase the least positive singular value of Θ ∗ : ω r , the cumulative regret up to T = 3000 is indeeddecreasing.Figure 2: LowESTR: cumulative regret at T = 3000 v.s. ω r . The yellow area represents the 1-standarddeviation of the cumulative regret at T = 3000= 3000