Low-Rank Generalized Linear Bandit Problems
LLow-Rank Generalized Linear Bandit Problems
Yangyi LuDepartment of StatisticsUniversity of Michigan [email protected]
Amirhossein MeisamiAdobe Inc. [email protected]
Ambuj TewariDepartment of StatisticsUniversity of Michigan [email protected]
June 5, 2020
Abstract
In a low-rank linear bandit problem, the reward of an action (represented by a matrix of size d × d )is the inner product between the action and an unknown low-rank matrix Θ ∗ . We propose an algorithmbased on a novel combination of online-to-confidence-set conversion (Abbasi-Yadkori et al., 2012) and theexponentially weighted average forecaster constructed by a covering of low-rank matrices. In T rounds,our algorithm achieves (cid:101) O (( d + d ) / √ rT ) regret that improves upon the standard linear bandit regretbound of (cid:101) O ( d d √ T ) when the rank of Θ ∗ : r (cid:28) min { d , d } . We also extend our algorithmic approach tothe generalized linear setting to get an algorithm which enjoys a similar bound under regularity conditionson the link function. To get around the computational intractability of covering based approaches, wepropose an efficient algorithm by extending the "Explore-Subspace-Then-Refine" algorithm of Jun et al.(2019). Our efficient algorithm achieves (cid:101) O (( d + d ) / √ rT ) regret under a mild condition on the actionset X and the r -th singular value of Θ ∗ . Our upper bounds match the conjectured lower bound of Junet al. (2019) for a subclass of low-rank linear bandit problems. Further, we show that existing lowerbounds for the sparse linear bandit problem strongly suggest that our regret bounds are unimprovable.To complement our theoretical contributions, we also conduct experiments to demonstrate that ouralgorithm can greatly outperform the performance of the standard linear bandit approach when Θ ∗ islow-rank. Low-rank models are widely used in various applications, such as matrix completion, computer vision,etc (Candès and Recht, 2009; Basri and Jacobs, 2003). We study low-rank (generalized) linear modelsin the bandit setting (Lai and Robbins, 1985). During the learning process, the agent adaptively pulls anarm (denoted as X t ) from a set of arms based on the past experience. At each pull, the agent observes anoisy reward corresponding to the arm pulled. Let Θ ∗ ∈ R d × d be an unknown low-rank matrix with rank r (cid:28) min { d , d } . The learner’s goal is to maximize the total reward: (cid:80) Tt =1 µ ( (cid:104) Θ ∗ , X t (cid:105) ) where T is the timehorizon, X t ∈ R d × d is an action pulled at time t that belongs to a pre-specified action set X and µ ( · ) denotes a link function. Note that in the standard linear case the link function is identity.Many practical applications can be framed in this low-rank bandit model. For traveling websites, therecommendation system needs to choose a flight-hotel bundle for the customer that can achieve high revenue.Often one has features of size d for flights and features of size d for hotels. It is natural to form a d × d matrix feature (e.g. via an outer product) for each pair or simply combine the two features row/column-wise if d = d . One can model the appeal of a bundle by a (generalized) linear function of the matrixfeature. In online advertising with image recommendation, the advertiser selects an image to display andthe goal is to achieve the maximum clicking rate. The image is often stored as a d × d matrix, and one canuse a generalized linear model (GLM) with the link function being the logistic function to model the clickrate (Richardson et al., 2007; McMahan et al., 2013). In all of these applications, one puts some capacitycontrol on the underlying matrix linear coefficient Θ ∗ and a natural condition is Θ ∗ being low-rank. We1 a r X i v : . [ s t a t . M L ] J un ote that the examples such as online dating and online shopping discussed in Jun et al. (2019) can also beformulated as our model.In this paper, we measure the quality of an algorithm in terms of its cumulative regret . A naiveapproach is to ignore the low-rank structure and directly apply the standard (generalized) linear banditalgorithms (Abbasi-Yadkori et al., 2011; Filippi et al., 2010). These approaches suffer (cid:101) O ( d d √ T ) regret. However, in practice, d d can be huge. Then a natural question is: Can we utilize the low-rank structure of Θ ∗ to achieve o ( d d √ T ) regret? Jun et al. (2019) studied a subclass of our problem, where the actions are rank one matrices. Theyproposed an algorithm that achieves (cid:101) O (( d + d ) / √ rT ) regret under additional incoherence and singularvalue assumptions of an augmented matrix defined via the arm set and Θ ∗ and a singular value assumptionof Θ ∗ . They also provided strong evidence that their bound is unimprovable.We summarize our contributions below.1. We propose Low Rank Linear Bandit with Online Computation algorithm (LowLOC) for the low-ranklinear bandit problem, that achieves (cid:101) O (( d + d ) / √ rT ) regret. Notably, comparing with the result inJun et al. (2019), our result 1) applies to more general action sets which can contain high-rank matricesand 2) does not require the incoherence and bounded eigenvalue assumption of the augmented matrixmentioned in the previous paragraph. Our regret bound also matches with their conjectured lower bound.For LowLOC, we first design a novel online predictor which uses an exponentially weighted average fore-caster on a covering of low-rank matrices to solve the online low-rank linear prediction problem with O (( d + d ) r log T ) regret. We then plug in our online predictor to the online-to-confidence-set conversionframework proposed by Abbasi-Yadkori et al. (2012) to construct a confidence set of Θ ∗ in our banditsetting, and at every round we choose the action optimistically.2. We further propose Low Rank Generalized Linear Bandit with Online Computation algorithm (Low-GLOC) for the generalized linear setting that also achieves (cid:101) O (( d + d ) / √ rT ) regret. LowGLOC issimilar to LowLOC but here we need to design a new online-to-confidence-set conversion method, whichcan be of independent interest.3. LowLOC and LowGLOC enjoy good regret but are unfortunately not efficiently implementable. To over-come this issue, we provide an efficient algorithm Low-Rank-Explore-Subspace-Then-Refine (LowESTR)for the linear setting, inspired by the ESTR algorithm proposed by Jun et al. (2019). We show that undera mild assumption on action set X , LowESTR achieves (cid:101) O (( d + d ) / √ rT /ω r ) regret, where ω r > is a lower bound for the r -th singular value of Θ ∗ . Comparing with ESTR, LowESTR does not needthe incoherence and the eigenvalue assumption of the augmented matrix while the assumptions on theaction set of the two algorithms are different. We also provide empirical evaluations to demonstrate theeffectiveness of LowESTR. Our work is inspired by Jun et al. (2019) where they model the reward as x (cid:62) t Θ ∗ z t . x t ∈ X ⊂ R d is aleft arm and z t ∈ Z ⊂ R d is a right arm ( X and Z are left and right arm sets, repsectively). Note thismodel is a special case of our low-rank linear bandit model because one can write x (cid:62) t Θ ∗ z t = (cid:10) Θ ∗ , x t z (cid:62) t (cid:11) and define the arm set as X Z (cid:62) . Their ESTR algorithm enjoys O (( d + d ) / √ rT /ω r ) regret bound underthe assumptions: 1) an augmented matrix K ∗ = X Θ ∗ Z (cid:62) is incoherent (Keshavan et al., 2010) and has afinite condition number, where X ∈ R d × d is constructed by d arms from X that maximizes (cid:107) X − (cid:107) and Z ∈ R d × d is constructed by d arms from Z that maximizes (cid:107) Z − (cid:107) , and 2) (cid:107) X − (cid:107) and (cid:107) Z − (cid:107) are upperbounded by a constant. Their algorithm requires explicitly finding X and Z , which is in general NP-hard,even though they also proposed heuristics to speed up this step. Comparing with ESTR, our LowLOCand LowGLOC algorithm are also not computationally efficient, but they both apply to richer action sets See Section 3 for the definition. (cid:101) O omits poly-logarithmic factors of d , d , r, T . K ∗ , X and Z and their regret bound does not depend on ω r .Our LowESTR algorithm is computationally efficient if the action set admits a nice exploration distribution(see details in Section 6). LowESTR achieves O (cid:16) ( d + d ) / √ rT /ω r (cid:17) regret bound but it does not requireassumptions on K ∗ , X and Z as well.Katariya et al. (2017b) and Kveton et al. (2017) also studied rank-1 and low-rank bandit problems. Theyassume there is an underlying expected reward matrix ¯ R , at each time the learner picks an element on ( i t , j t ) position and receives a noisy reward. It can be viewed as a special case of bilinear bandit with one-hotvectors as left and right arms. Katariya et al. (2017b) is further extended by Katariya et al. (2017a) thatuses KL based confidence intervals to achieve a tighter regret bound. Our problem is more general comparingto these works. Johnson et al. (2016) considered the same setting as ours, but their method relies on theknowledge of many parameters that depend on the unknown Θ ∗ and in particular only works for continuousarm set.There are other works that utilize the low-rank structure in different model settings. For example,Gopalan et al. (2016) studied low rank bandits with latent structures using robust tensor power method.Lale et al. (2019) imposed low-rank assumptions on the feature vectors to reduce the effective dimension.These work all utilize the low-rank structure to achieve better regret bound than standard approaches thatdo not take the low-rank structure into account. We formally define the problem and review relevant background in this section.
Let
X ⊂ R d × d be the arm space. In each round t , the learner chooses an arm X t ∈ X , and observes anoisy reward of a linear form: y t = (cid:104) X t , Θ ∗ (cid:105) + η t , where Θ ∗ ∈ R d × d is an unknown parameter and η t isa -sub-Gaussian random variable. Denote the rank of Θ ∗ by r , we assume r (cid:28) min { d , d } . Let the r -thsingular value of Θ ∗ is lower bounded by ω r > . We use (cid:104) A, B (cid:105) := trace ( A T B ) to denote the inner productbetween matrix A and B . We follow the standard assumptions in linear bandits: (cid:107) Θ ∗ (cid:107) F ≤ and (cid:107) X (cid:107) F ≤ ,for all X ∈ X .In our bandit problem, the goal of the learner is to maximize the total reward (cid:80) Tt =1 (cid:104) X t , Θ ∗ (cid:105) , where T is the time horizon. Clearly, with the knowledge of the unknown parameter Θ ∗ , one should always select anaction X ∗ ∈ argmax X ∈X (cid:104) X, Θ ∗ (cid:105) . It is natural to evaluate the learner relative to the optimal strategy. Thedifference between the learner’s total reward and the total reward of the optimal strategy is called pseudo-regret (Audibert et al., 2009): R T := (cid:80) Tt =1 (cid:104) X ∗ − X t , Θ ∗ (cid:105) . For simplicity, we use the word regret instead ofpseudo-regret for R T . We also study the generalized linear bandit model of the following form: E [ y t | X t , Θ ∗ ] = µ ( (cid:104) X t , Θ ∗ (cid:105) ) where µ ( · ) is a link function. This framework builds on the well-known Generalized Linear Models (GLMs) andhas been widely studied in many applications. For example, when rewards are binary-valued, a natural linkfunction is the logistic function µ ( x ) = exp( x ) / (1 + exp( x )) . For the generalized setting, we assume thereward given the action follows an exponential family distribution: P ( y | z = (cid:104) X, Θ ∗ (cid:105) ) = exp (cid:18) yz − m ( z ) φ ( τ ) + h ( y, τ ) (cid:19) , (1)where τ ∈ R + is a known scale parameter and m, φ and h are some known functions. From basic calculationwe get m (cid:48) ( z ) = E [ y | z ] := µ ( z ) . We assume the above exponential family is a minimal representation, then m ( z ) is ensured to be strictly convex (Wainwright and Jordan, 2008), and thus the negative log likelihood(NLL) loss (cid:96) ( z, y ) := − yz + m ( z ) is also strictly convex.3 lgorithm 1 Low-Rank Linear Bandit with Online Computation (LowLOC)
Input: arm set: X , horizon: T , T -net for S r : ¯ S r ( T ) , failure rate δ , EW constant η (cid:16) T/δ ) .Initial confidence set C = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F ≤ } . for t = 1 , . . . , T do ( X t , (cid:101) Θ t ) := argmax ( X, Θ) ∈X × C t − (cid:104) X, Θ (cid:105) .Pull arm X t and receive reward y t .Compute EW predictor ˆ y t = (cid:80) | ¯ Sr ( 1 T ) | i =1 e − ηLi,t − f Θ i,t (cid:80) | ¯ Sr ( 1 T ) | j =1 e − ηLj,t − , where f Θ i ,t (cid:44) (cid:104) X t , Θ i (cid:105) for Θ i ∈ ¯ S r ( T ) .Update losses L i,t = (cid:80) ts =1 ( y s − f Θ i ,s ) , for i = 1 , . . . , | ¯ S r ( T ) | .Update C t according to Equation 2, where B t is defined in Lemma 2. end for We make the following standard assumption on the link function µ ( · ) (Jun et al., 2017). Assumption 1.
There exist constants L µ , c µ ≥ , κ µ > , such that the link function µ ( · ) is L µ − Lipschitzon [ − , , continously differentiable on ( − , , inf z ∈ ( − , µ (cid:48) ( z ) := κ µ and | µ (0) | ≤ c µ .One can write down the above reward model (1) in an equivalent way: y t = µ ( (cid:104) X t , Θ ∗ (cid:105) ) + η t , where η t isconditionally R -sub-Gaussian given X t and { ( X s , η s ) } t − s =1 . Using the form of P ( y | z ) , Taylor expansion andthe strictly convexity of m ( · ) , one can show that R = sup z ∈ [ − , (cid:112) µ (cid:48) ( z ) ≤ (cid:112) L µ by the definition of thesub-Gaussian constant. An optimal arm is X ∗ ∈ argmax X ∈X µ ( (cid:104) X, Θ ∗ (cid:105) ) . The performance of an algorithmis again evaluated by cumulative regret: R T = (cid:80) Tt =1 µ ( (cid:104) X ∗ , Θ ∗ (cid:105) ) − µ ( (cid:104) X t , Θ ∗ (cid:105) ) . We use O and Ω for the standard Big O and Big Omega notations. (cid:101) O and (cid:101) Ω ignore the poly-logarithmicfactors of d , d , r, T . f ( x ) (cid:16) g ( x ) means f and g are of the same order ignoring the poly-logarithmic factorsof d , d , r, T . We first present our algorithm, LowLOC (Algorithm 1) for low-rank linear bandit problems.
Theorem 1 (Regret of LowLOC (Algorithm 1)) . For ∀ δ ∈ (0 , . , with probability at least − δ , Algorithm 1achieves regret: R T = (cid:101) O (cid:32) ( d + d ) / √ rT (cid:115) log (cid:18) δ (cid:19)(cid:33) . Note that LowLOC achieves the desired goal of outperforming the standard linear bandit approach with (cid:101) O ( d d √ T ) regret. Furthermore, this bound does not depend on any other problem-dependent parameterssuch as least singular value of Θ ∗ and does not require any other assumption which appeared in Jun et al.(2019). In the following section, we explain details of our algorithm design choices. This algorithm follows the standard Optimism in the Face of Uncertainty (OFU) principle. We maintain aconfidence set C t at every round that contains the true parameter Θ ∗ with high probability and we choosethe action X t according to ( X t , (cid:101) Θ t ) = argmax ( X, Θ) ∈X ×C t − (cid:104) X, Θ (cid:105) . Typically, the faster C t shrinks, the lower regret we have. The main diffculty is to construct C t thatleverages the low-rank structure so that we only have (cid:101) O (( d + d ) / √ rT ) regret. Our starting point is theto use the online-to-confidence-set conversion framework proposed by Abbasi-Yadkori et al. (2012) who buildsthe confidence set based on an online predictor. At each round, an online predictor receives X t , predicts ˆ y t ,4ased on historical data { ( X s , y s ) } t − s =1 , observes the true value y t and suffers a loss (cid:96) t (ˆ y t ) (cid:44) ( y t − ˆ y t ) . Theperformance of this online predictor is measured by comparing its cumulative loss to the cumulative loss ofa fixed linear predictor using coefficient Θ : ρ t (Θ) = (cid:80) ts =1 (cid:96) s (ˆ y s ) − (cid:96) s ( (cid:104) Θ , X s (cid:105) ) .The key idea of online-to-confidence-set conversion (adapted to our low-rank setting) is that if one canguarantee sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ t (Θ) ≤ B t for some non-decreasing sequence { B t } Tt =1 , we can construct theconfidence interval for Θ ∗ as: C t = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F + t (cid:88) s =1 (ˆ y s − (cid:104) Θ , X s (cid:105) ) ≤ β t ( δ ) } , with β t ( δ ) = 1 + 2 B t + 32 log (cid:16)(cid:16) √ (cid:112) B t (cid:17) /δ (cid:17) (2)where δ is the failure probability. Lemma 7 in appendix guarantees that Θ ∗ is contained in ∩ t ≥ C t with highprobability and Lemma 8 further guarantees the overall regret R T = (cid:101) O ( (cid:112) d d β T − ( δ ) T ) = (cid:101) O (cid:0) ( d + d ) (cid:112) B T − T (cid:1) .Therefore, the problem achieving the (cid:101) O (( d + d ) / √ rT ) regret bound reduces to designing an onlinepredictor which guarantees sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ t (Θ) ≤ B t and B t = (cid:101) O (( d + d ) r ) . To achieve this rate,the key is to leverage the low-rank structure. We adopt the classical exponentially weighted average forecaster (EW) framework (Cesa-Bianchi and Lugosi,2006) which uses N experts to predict ˆ y t with the following formula (cid:98) y t = (cid:80) Ni =1 e − ηL i,t − f i,t (cid:80) Nj =1 e − ηL j,t − . (3)In above, f i denotes the i -th expert that makes a prediction f i,t at time t , L i,t − (cid:44) (cid:80) t − s =1 (cid:96) s ( f i ( X t )) is thecumulative loss incurred by expert i , and η is a tuning parameter. By choosing η carefully, one can guaranteethat this predictor achieves O (log N log( T /δ )) regret comparing with the best expert among the expert set.In our setting, an expert can be viewed as a matrix Θ satisfies (cid:107) Θ (cid:107) F ≤ and rank (Θ) ≤ r , and makesprediction according to f Θ ,t (cid:44) (cid:104) Θ , X t (cid:105) . There are infinitely many such experts so we cannot directly use EWwhich requires finite number of experts. Our main idea is to construct N experts which guarantees log N issmall and these N experts can represent the original expert set S r (cid:44) { Θ ∈ R d × d : (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r } well, and apply EW using these N experts. We construct an ε -net ¯ S r ( ε ) , i.e., for any Θ ∈ S r , there existsa ¯Θ ∈ ¯ S r ( ε ) , such that (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F ≤ (cid:15) . We further show that | ¯ S r ( ε ) | ≤ (9 /ε ) ( d + d +1) r in Lemma 6, so thenumber of experts N in Equation 3 is at most (9 T ) ( d + d +1) r if we set ε = 1 /T .The following lemma summarizes the performance of this online predictor. Lemma 2 (Regret of EW under Squared Loss) . Let η = √ T/δ )) in EW forecaster (3) . Then, forany < δ < / , with probability at least − δ , we have B T = sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ T (Θ) = O (cid:18) ( d + d ) r log( T ) log (cid:18) Tδ (cid:19)(cid:19) = (cid:101) O (cid:18) ( d + d ) r log (cid:18) δ (cid:19)(cid:19) . To obtain Theorem 1, one just needs to plug in Lemma 2 into Lemma 8.
We also study the low-rank generalized linear bandit setting. Our algorithm LowGLOC is similar to LowLOC,so we only present the differences and leave the detailed presentation of the algorithm (Algorithm 3) to theappendix (Section H). 5e still use EW to perform online predictions, but instead of squared loss, we use negative log likelihood(NLL) loss (cid:96) s (ˆ y s ) = − ˆ y s y s + m (ˆ y s ) to construct the forecaster in Equation (3), where m ( · ) is as defined inSection 3. Therefore, the performance of EW using NLL loss relative to a fixed linear predictor Θ is measuredby: ρ GLB T (Θ) = (cid:16)(cid:80) Tt =1 − ˆ y t y t + m (ˆ y t ) (cid:17) − (cid:16)(cid:80) Tt =1 −(cid:104) Θ , X t (cid:105) y t + m ( (cid:104) Θ , X t (cid:105) ) (cid:17) . If there exists a non-decreasingsequence { B GLB t } such that sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ GLB t (Θ) ≤ B GLB t , we construct C GLB t in the following way: C GLB t = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F + t (cid:88) s =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) ≤ β GLB t ( δ ) } , (4)where β GLB t ( δ ) = 2 + κ µ B GLB t + L µ κ µ log (cid:18)(cid:18)(cid:112) L µ (cid:113) κ µ + (cid:113) κ µ B GLB t + 1 (cid:19) δ (cid:19) . Lemma 11 guarantees thatthe true parameter Θ ∗ is contained in ∩ t ≥ C GLB t with high probability. Lemma 12 further guarantees that theoverall regret of LowGLOC satisfies R T = (cid:101) O ( (cid:113) d d β GLB T − ( δ ) T ) = (cid:101) O (( d + d ) (cid:113) B GLB T T /κ µ ) . Following theonline-to-confidence-set conversion idea as used in LowLOC, we prove that B GLB T = O (cid:16) L µ + c µ κ µ ( d + d ) r log T log (cid:0) Tδ (cid:1)(cid:17) in Lemma 13.We next present the regret of LowGLOC in the next theorem, which can be easily achieved by pluggingLemma 13 into Lemma 12. Theorem 3 (Regret of LowGLOC) . For ∀ δ ∈ (0 , . , with probability at least − δ , Algorithm 3 achievesregret: R T = (cid:101) O (cid:32) ( d + d ) / (cid:115) L µ + c µ κ µ rT log (cid:18) δ (cid:19)(cid:33) . To the best of our knowledge, this is the first o ( d d √ T ) regret bound for low-rank GLM bandits. At every round, LowLOC and LowGLOC need to calculate exponentially weighted predictions, which involvescalculating weights of the covering of low-rank matrices. These approaches has high computation complexityeven though their regret is ideal. In this section, we propose a computationally efficient method LowESTR(Algorithm 2) that also achieves (cid:101) O (( d + d ) / √ rT ) regret under some mild assumptions on the action set X in the following. Assumption 2.
There exists a sampling distribution D over X with covariance matrix Σ , such that λ min (Σ) (cid:16) d d and D is sub-Gaussian with parameter σ (cid:16) d d . (see Definition 1 in Section C forthe definition of sub-Gaussian random matrices.)This assumption is easily satisfied in many arm sets. To guarantee the existence of above samplingdistribution D , we only need that the convex hull of a subset of arms X sub ⊂ X contains a ball with radius R ≤ , which does not scale with d or d . Simple examples for X are Euclidean unit ball/sphere.We extend the two-stage procedure "Explore Subspace Then Refine (ESTR)" proposed by Jun et al.(2019). In stage 1, ESTR estimates the row and column subspaces of Θ ∗ . In stage 2, ESTR transformsthe original problem into a d d -dimensional linear bandit problem and invokes LowOFUL algorithm (Junet al., 2019), which leverages the estimated row/column subspaces of Θ ∗ . LowESTR also proceeds with the two-stage framework as ESTR, but we use different estimation method instage 1. 6 lgorithm 2
Low Rank Explore Subspace Then Refine (LowESTR)
Input: arm set X , time horizon T , exploration length T , rank r of Θ ∗ , spectral bound ω r of Θ ∗ , samplingdistribution for stage 1: D ; parameters for LowOFUL in stage 2: B, B ⊥ , λ, λ ⊥ . Stage 1: Explore the Low Rank Subspace
Pull X t ∈ X according to distribution D and observe reward Y t , for t = 1 , . . . , T .Solve (cid:98) Θ using the problem below: (cid:98) Θ = argmin Θ ∈ R d × d T T (cid:88) t =1 ( Y t − (cid:104) X t , Θ (cid:105) ) + λ T (cid:107) Θ (cid:107) nuc . (5)Let (cid:98) Θ = U (cid:98) SV T be the SVD of (cid:98) Θ . Take the first r columns of U as (cid:98) U , the first r rows of V as (cid:98) V . Let (cid:98) U ⊥ and (cid:98) V ⊥ be orthonormal bases of the complementary subspaces of (cid:98) U and (cid:98) V . Stage 2: Refine Standard Linear Bandit Algorithm
Rotate the arm feature set: X (cid:48) := { [ (cid:98) U (cid:98) U ⊥ ] T X [ (cid:98) V (cid:98) V ⊥ ] : X ∈ X } .Define a vectorized arm feature set so that the last ( d − r )( d − r ) components are from the complementarysubspaces: X (cid:48) vec := { vec ( X (cid:48) r, r ); vec ( X (cid:48) r +1: d , r ); vec ( X (cid:48) r,r +1: d ); vec ( X (cid:48) r +1: d ,r +1: d ) : X (cid:48) ∈ X (cid:48) } . For T = T − T rounds, invoke LowOFUL (Algorithm 4 in Section H) with arm set X (cid:48) vec , the lowdimension k = ( d + d ) r − r and γ ( T ) (cid:16) ( d + d ) rT ω r , B, B ⊥ , λ, λ ⊥ . Stage 1.
We are inspired by a line of work on low-rank matrices recovery using nuclear-norm penalty withsquared loss (Wainwright, 2019). The learner pulls arm X t ∈ X according to distribution D and observes thereward y t up to a horizon T , then uses { X t , y t } T t =1 to solve a nuclear-norm penalized least square problemin (5) and receives an estimated (cid:98) Θ for Θ ∗ . Notably, instead of invoking an NP-hard problem in stage 1 asESTR, the optimization problem (5) in LowESTR is convex and thus can be solved easily using standardgradient based methods. Assumption 2 guarantees that (cid:13)(cid:13)(cid:13) (cid:98) Θ − Θ ∗ (cid:13)(cid:13)(cid:13) F (cid:16) ( d + d ) rT in Theorem 15 (Section E).We get the estimated row/column subspaces of Θ ∗ simply by running an SVD step. Stage 2.
In stage 2, we apply LowOFUL algorithm (Algorithm 4 in Section H) proposed by Jun et al.(2019) in our setting. The key idea is reducing the problem to linear bandit and utilizing the estimatedsubspaces in the standard linear bandit method OFUL (Abbasi-Yadkori et al., 2011).We now present the overall regret of Algorithm 2.
Theorem 4 (Regret of LowESTR for Low Rank Bandit) . Suppose we run LowESTR in stage 1 with T (cid:16) ( d + d ) / √ rT ω r and λ T (cid:16) T min { d ,d } . We invoke LowOFUL in stage 2 with k = r ( d + d − r ) , λ ⊥ = T k log(1+ T /λ ) , B = 1 , B ⊥ = γ ( T ) , and the rotated arm sets X (cid:48) vec defined in Algorithm 2, the overallregret of LowESTR is, with prob at least − δ , R T = (cid:101) O (cid:16) ( d + d ) / √ rT ω r (cid:17) . We believe that this “Explore-Subspace-Then-Refine" framework can also be extended to the generalizedlinear setting. In stage 1, an M-estimator that minimizes the negative log-likelihood plus nuclear normpenalty (Fan et al., 2019) can be used instead, while in stage 2, one can revise a standard generalized linearbandit algorithm such as GLM-UCB (Filippi et al., 2010) by leveraging the low-rank knowledge in the sameway as LowOFUL. We leave this extension for future work.7
Lower Bound for Low-rank Linear Bandit
In this section, we discuss the regret lower bound of the low-rank linear bandit model. Suppose d = d = d ,we first present a (cid:101) O ( dr √ T ) lower bound, which is a straightforward extension of the linear bandit lowerbound (Lattimore and Szepesvári, 2018). Theorem 5 (Lower Bound) . Assume dr ≤ T and let X = { X ∈ R d × d : (cid:107) X (cid:107) F ≤ } . Then ∃ Θ ∈ R d × d ,where (cid:107) Θ (cid:107) F ≤ d r T , rank (Θ) ≤ r , s.t. E [ R T (Θ)] = Ω( dr √ T ) . Above bound is tight when r = d as it matches with the standard d -dimensional linear bandit lowerbound, but for small r , our upper bound is larger than the lower bound by a factor of (cid:112) d/r .Nevertheless, we conjecture that Ω( d / √ rT ) is the correct lower bound for small r . It is well-known thatthe regret lower bound for sparse linear bandit problem (dimension d , sparsity s ) is Ω( √ sdT ) (Lattimore andSzepesvári, 2018). Our problem can be viewed as a d -dimensional linear bandit problem with dr degreesof freedom in Θ ∗ . Then, using the analogue of the degrees of freedom between sparse vectors and low-rankmatrices, one can plug in d for d and dr for s in the sparse linear bandit regret lower bound and achieve Ω( d / √ rT ) as our lower bound. In this section, we compare the performance of OFUL and LowESTR to validate that it is crucial to utilizethe low-rank structure. We run our simulation with d = d = 10 , r = 1 and d = d = 10 , r = 3 . In bothsettings, the true Θ ∗ ∈ R d × d is a diagonal matrix. For r = 1 , we set diag (Θ ∗ ) = (0 . , , . . . , while for r = 3 , diag (Θ ∗ ) = (0 . , . , . , , . . . , . For arms in both settings, we draw 256 vectors from N (0 , I d d ) and standardize them by dividing their 2-norms, then we reshape all standardized d d -dimensional vectorsto d × d matrices. We use these matrices as the arm set X . For each arm X ∈ X , the reward is generated by y = (cid:104) X, Θ ∗ (cid:105) + ε , where ε ∼ N (0 , . ) . We run both algorithms for T = 3000 rounds and repeat 100 timesfor each simulation setup to calculate the averaged regrets and their 1-sd confidence intervals at every step.We leave the hyper-parameters of OFUL and LowESTR in the appendix (Section I). Regret comparisonplots are displayed in Figure 1.Figure 1: Regret Comparison between OFUL and LowESTR. We plot the averaged cumulative regret withred and blue curves, and 1-standard deviation for each method within the yellow area.We observe that in both plots, LowESTR incurs less regret comparing to OFUL within several hundredsof time steps. Further, as we increase the rank from r = 1 to r = 3 , the regret gap between the twoapproaches becomes smaller. This phenomenon is compatible with our theory.We also conduct simulations to see the sensitivity of LowESTR to ω r . We observe that LowESTR indeedperforms better for large ω r , which again matches with our theory. The detailed description and the plotfor this experiment are left to the appendix (Section I).8 Conclusion & Future Work
In this paper, we studied the low-rank (generalized) linear bandit problem. We proposed LowLOC andLowGLOC algorithm for the linear and generalized linear setting, respectively. Both of them enjoy O (( d + d ) / √ rT ) regret. Further, our efficient algorithm LowESTR achieves O (( d + d ) / √ rT /ω r ) regret undermild conditions on the action set. There are several interesting directions that we left as future work:1) We provided some preliminary ideas in Section 6 about how to extend LowESTR to the generalizedlinear setting. We expect that a similar regret bound can be achieved under certain regularity conditionsover the link function. 2) We plan to investigate if one can design an efficient algorithm whose regret doesnot depend on /ω r . 3) As we have shown in Section 7, O (( d + d ) / √ rT ) is our conjectured tight lowerbound. It will be very interesting to formally prove this. Acknowledgement
AT acknowledges the support of NSF CAREER grant IIS-1452099 and an Adobe Data Science ResearchAward.
References
Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits.In
Advances in Neural Information Processing Systems , pages 2312–2320.Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012). Online-to-confidence-set conversions and applicationto sparse stochastic bandits. In
Artificial Intelligence and Statistics , pages 1–9.Audibert, J.-Y., Munos, R., and Szepesvári, C. (2009). Exploration–exploitation tradeoff using varianceestimates in multi-armed bandits.
Theoretical Computer Science , 410(19):1876–1902.Basri, R. and Jacobs, D. W. (2003). Lambertian reflectance and linear subspaces.
IEEE transactions onpattern analysis and machine intelligence , 25(2):218–233.Candes, E. J. and Plan, Y. (2011). Tight oracle inequalities for low-rank matrix recovery from a minimalnumber of noisy random measurements.
IEEE Transactions on Information Theory , 57(4):2342–2359.Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization.
Foundations ofComputational mathematics , 9(6):717.Cesa-Bianchi, N. and Lugosi, G. (2006).
Prediction, learning, and games . Cambridge university press.Fan, J., Gong, W., and Zhu, Z. (2019). Generalized high-dimensional trace regression via nuclear normregularization.
Journal of econometrics , 212(1):177–202.Filippi, S., Cappe, O., Garivier, A., and Szepesvári, C. (2010). Parametric bandits: The generalized linearcase. In
Advances in Neural Information Processing Systems , pages 586–594.Gopalan, A., Maillard, O.-A., and Zaki, M. (2016). Low-rank bandits with latent mixtures. arXiv preprintarXiv:1609.01508 .Johnson, N., Sivakumar, V., and Banerjee, A. (2016). Structured stochastic linear bandits. arXiv preprintarXiv:1606.05693 .Jun, K.-S., Bhargava, A., Nowak, R., and Willett, R. (2017). Scalable generalized linear bandits: Onlinecomputation and hashing. In
Advances in Neural Information Processing Systems , pages 99–109.9un, K.-S., Willett, R., Wright, S., and Nowak, R. (2019). Bilinear bandits with low-rank structure. In
International Conference on Machine Learning , pages 3163–3172.Katariya, S., Kveton, B., Szepesvári, C., Vernade, C., and Wen, Z. (2017a). Bernoulli rank- bandits forclick feedback. arXiv preprint arXiv:1703.06513 .Katariya, S., Kveton, B., Szepesvari, C., Vernade, C., and Wen, Z. (2017b). Stochastic rank-1 bandits. In Artificial Intelligence and Statistics , pages 392–401.Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from noisy entries.
Journal ofMachine Learning Research , 11(Jul):2057–2078.Kveton, B., Szepesvari, C., Rao, A., Wen, Z., Abbasi-Yadkori, Y., and Muthukrishnan, S. (2017). Stochasticlow-rank bandits. arXiv preprint arXiv:1712.04644 .Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules.
Advances in appliedmathematics , 6(1):4–22.Lale, S., Azizzadenesheli, K., Anandkumar, A., and Hassibi, B. (2019). Stochastic linear bandits with hiddenlow rank structure. arXiv preprint arXiv:1901.09490 .Lattimore, T. and Szepesvári, C. (2018). Bandit algorithms. preprint .Loh, P.-L. and Wainwright, M. J. (2011). High-dimensional regression with noisy and missing data: Provableguarantees with non-convexity. In
Advances in Neural Information Processing Systems , pages 2726–2734.McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L., Phillips, T., Davydov, E.,Golovin, D., et al. (2013). Ad click prediction: a view from the trenches. In
Proceedings of the 19th ACMSIGKDD international conference on Knowledge discovery and data mining , pages 1222–1230.Richardson, M., Dominowska, E., and Ragno, R. (2007). Predicting clicks: estimating the click-through ratefor new ads. In
Proceedings of the 16th international conference on World Wide Web , pages 521–530.Wainwright, M. J. (2019).
High-dimensional statistics: A non-asymptotic viewpoint , volume 48. CambridgeUniversity Press.Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference.
Foundations and Trends R (cid:13) in Machine Learning , 1(1-2):1–305.10 Proof for Theorem 1
Lemma 6 (Covering number for low-rank matrices, modified from (Candes and Plan, 2011)) . Let S r = { Θ ∈ R d × d : rank (Θ) ≤ r, (cid:107) Θ (cid:107) F ≤ } . Then there exists an (cid:15) − net ¯ S r for the Frobenius norm obeying | ¯ S r | ≤ (9 /(cid:15) ) ( d + d +1) r . (6) Proof.
Use SVD decomposition:
Θ = U Σ V T of any Θ ∈ S r obeying (cid:107) Σ (cid:107) F ≤ . We will construct an (cid:15) − netfor S r by covering the set of permissible U, V and Σ . Let D be the set of diagonal matrices with nonnegativediagonal entries and Frobenius norm less than or equal to one. We take ¯ D to be an (cid:15)/ -net for D with | ¯ D | ≤ (9 /(cid:15) ) r . Next, let O d ,r = { U ∈ R d × r : U T U = I } . To cover O d ,r , we use the (cid:107)·(cid:107) , norm defined as (cid:107) U (cid:107) , = max i (cid:107) U i (cid:107) (cid:96) , (7)where U i denotes the i th column of Θ . Let Q d ,r = { U ∈ R d × r : (cid:107) U (cid:107) , ≤ } . It is easy to see that O d ,r ⊂ Q d ,r since the columns of an orthogonal matrix are unit normed. We see that there is an (cid:15)/ -net ¯ O d ,r for O d ,r obeying | ¯ O d ,r | ≤ (9 /(cid:15) ) d r . Similarly, let P d ,r = { V ∈ R d × r : V T V = I } . Define R d ,r = { V ∈ R d × r : (cid:107) V (cid:107) , ≤ } , we have P d ,r ⊂ R d ,r . By the same argument, there is an (cid:15)/ -net ¯ P d ,r for P d ,r obeying | ¯ P d ,r | ≤ (9 /(cid:15) ) d r . We now let ¯ S r = { ¯ U ¯Σ ¯ V T : ¯ U ∈ O d ,r , ¯ V ∈ P d ,r , ¯Σ ∈ ¯ D } , and remarkthat | ¯ S r | ≤ | ¯ O d ,r | | ¯ D || ¯ P d ,r | ≤ (9 /(cid:15) ) ( d + d +1) r . It remains to show that for all Θ ∈ S r , there exists ¯Θ ∈ ¯ S r with (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F ≤ (cid:15) .Fix Θ ∈ S r and decompose it as Θ = U Σ V T . Then there exists ¯Θ = ¯ U ¯Σ ¯ V T ∈ ¯ S r with ¯ U ∈ O d ,r , ¯ V ∈ P d ,r , ¯Σ ∈ ¯ D satisfying (cid:13)(cid:13) U − ¯ U (cid:13)(cid:13) , ≤ (cid:15)/ , (cid:13)(cid:13) V − ¯ V (cid:13)(cid:13) , ≤ (cid:15)/ and (cid:13)(cid:13) Σ − ¯Σ (cid:13)(cid:13) F ≤ (cid:15)/ . This gives (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F = (cid:13)(cid:13) U Σ V T − ¯ U ¯Σ ¯ V T (cid:13)(cid:13) F (8) = (cid:13)(cid:13) U Σ V T − ¯ U Σ V T + ¯ U Σ V T − ¯ U ¯Σ V T + ¯ U ¯Σ V T − ¯ U ¯Σ ¯ V T (cid:13)(cid:13) F (9) ≤ (cid:13)(cid:13) ( U − ¯ U )Σ V T (cid:13)(cid:13) F + (cid:13)(cid:13) ¯ U (Σ − ¯Σ) V T (cid:13)(cid:13) F + (cid:13)(cid:13) ¯ U ¯Σ( V − ¯ V ) T (cid:13)(cid:13) F . (10)For the first term, since V is an orthogonal matrix, (cid:13)(cid:13) ( U − ¯ U )Σ V T (cid:13)(cid:13) F = (cid:13)(cid:13) ( U − ¯ U )Σ (cid:13)(cid:13) F (11) ≤ (cid:107) Σ (cid:107) F (cid:13)(cid:13) U − ¯ U (cid:13)(cid:13) , ≤ ( (cid:15)/ . (12)Thus we have shown (cid:13)(cid:13) ( U − ¯ U )Σ V T (cid:13)(cid:13) F ≤ (cid:15)/ , by the same argument, we also have (cid:13)(cid:13) ¯ U ¯Σ( V − ¯ V ) T (cid:13)(cid:13) F ≤ (cid:15)/ .For the second term, (cid:13)(cid:13) ¯ U (Σ − ¯Σ) V T (cid:13)(cid:13) F = (cid:13)(cid:13) Σ − ¯Σ (cid:13)(cid:13) F ≤ (cid:15)/ . This completes the proof. Lemma 7 (Online-to-Confidence-Set Conversion (adapted from Theorem 1 in Abbasi-Yadkori et al. (2012))) . Suppose we feed { ( X s , y s ) } ts =1 into an online prediction algorithm which, for all t ≥ , admits a regret sup (cid:107) Θ (cid:107) F ≤ ρ t (Θ) ≤ B t . Let ˆ y s be the prediction at time step s by the online learner. Then, for any δ ∈ (0 , . , with probability at least − δ , we have P ( ∃ t ∈ N such that Θ ∗ / ∈ C t +1 ) ≤ δ, (13) where we define β t ( δ ) = 1 + 2 B t + 32 log (cid:32) √ √ B t δ (cid:33) (14) C t +1 = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F + t (cid:88) s =1 (ˆ y s − (cid:104) Θ , X s (cid:105) ) ≤ β t ( δ ) } . (15)11 emma 8 (Regret of LowLOC Given Online Learner’s Regret (adapted from Theorem 3 in Abbasi-Yadkoriet al. (2012))) . Suppose sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ t (Θ) ≤ B t , where { B t } Tt =1 is a non-decreasing sequence. Then,for any δ ∈ (0 , . , with probability at least − δ , for any T ≥ , the regret of LowLOC algorithm is boundedas R T = O (cid:32)(cid:115) d d T (1 + β T − ( δ )) log (cid:18) Td d (cid:19)(cid:33) , (16) where β t ( δ ) = 1 + 2 B t + 32 log (cid:16) √ √ B t δ (cid:17) . Lemma 9 (Theorem 3.2 in (Cesa-Bianchi and Lugosi, 2006)) . If the loss function (cid:96) ( a, b ) is exp-concave in itsfirst argument for some η > (i.e. F ( a ) = e − η(cid:96) ( a,b ) is concave for all b ), then the regret of the exponentiallyweighted average forecaster in Equation 3 (used with the same value of η ) satisfies, for all y , . . . , y n ∈ Y ,we have Φ η ( R n ) ≤ Φ η (0) . Lemma 10 (Proposition 3.1 in (Cesa-Bianchi and Lugosi, 2006)) . If for some loss function (cid:96) and for some η > , a forecaster satisfies Φ η ( R n ) ≤ Φ η ( ) for all y , . . . , y n ∈ Y , then the regret of the forecaster isbounded by (cid:98) L n − min i =1 ,...,N L i,n ≤ log( N ) η . (17) Proof of Lemma 2.
Let y t = (cid:104) X t , Θ ∗ (cid:105) + η t . By subgaussian property, we have, for < δ < , P (cid:32) max t =1 ,...,T | y t | > (cid:115) (cid:18) Tδ (cid:19)(cid:33) ≤ δ. (18)Let’s denote above high probability event (cid:110) max t =1 ,...,T | y t | ≤ (cid:113) (cid:0) Tδ (cid:1)(cid:111) by G , denote the onlineprediction at every round by ˆ y t . Define the ε -covering set for S r := { Θ : (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r } by ¯ S r , which means, for any Θ ∈ S r , there exists a ¯Θ ∈ ¯ S r , such that (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F ≤ ε . We prove that | ¯ S r | ≤ (9 /ε ) ( d + d +1) r in Lemma 6.One can easily show that F ( a ) := e − η ( a − b ) is concave in a for all | b | ≤ (cid:113) (cid:0) Tδ (cid:1) (this holds underevent G ) by choosing η = (cid:113) ( Tδ ) ) , since a refers to the prediction of exponential weighted averageforcaster and thus we have | a | ≤ according to the construction. So under event G , the squared loss (cid:96) isguaranteed to be exp-concave under above η and Lemma 10 can be applied here.12e now bound the regret under event G . For an arbitrary Θ ∈ S r , ρ T (Θ) = (cid:88) Tt =1 ( (cid:96) t ( (cid:98) y t ) − (cid:96) t ( f Θ ,t )) (19) = T (cid:88) t =1 (cid:0) (cid:96) t ( (cid:98) y t ) − (cid:96) t ( f ¯Θ ,t ) + (cid:96) t ( f ¯Θ ,t ) − (cid:96) t ( f Θ ,t ) (cid:1) where (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F ≤ (cid:15), ¯Θ ∈ ¯ S r (20) ≤ log | ¯ S r | η + T (cid:88) t =1 (cid:0) (cid:96) t ( f ¯Θ ,t ) − (cid:96) t ( f Θ ,t ) (cid:1) by Lemma (21) = log | ¯ S r | η + T (cid:88) t =1 (cid:0) ( (cid:104) ¯Θ , X t (cid:105) − y t ) − ( (cid:104) Θ , X t (cid:105) − y t ) (cid:1) (22) ≤ log | ¯ S r | η + T (cid:88) t =1 (cid:0) (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F + 2 y t (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F (cid:1) (23) ≤ log | ¯ S r | η + 2 T ε + 2
T ε (cid:115) (cid:18) Tδ (cid:19) (24) = 2( d + d + 1) r log( 9 ε ) (cid:32) (cid:115) (cid:18) Tδ (cid:19)(cid:33) + 2 T ε + 2
T ε (cid:115) (cid:18) Tδ (cid:19) (25) = O (cid:18) ( d + d ) r log( T ) log (cid:18) Tδ (cid:19)(cid:19) set ε = 1 /T. (26)Above bounds hold for all Θ ∈ S r . This completes the proof. Proof of Theorem 1.
To obtain Theorem 1, one just needs to plug Lemma 2 into Lemma 8.
B Proof for Theorem 3
Lemma 11 (Online-to-Confidence-Set Conversion with NLL loss) . Suppose we feed { ( X s , y s ) } ts =1 into anonline prediction algorithm which, for all t ≥ , admits a regret under negative log likelihood (NLL) loss sup (cid:107) Θ (cid:107) F ≤ ρ GLB t (Θ) ≤ B t . Let ˆ y s be the prediction at time step s by the online learner. Then, for any δ ∈ (0 , . , with probability at least − δ , we have P ( ∃ t ∈ N such that Θ ∗ / ∈ C t +1 ) ≤ δ, (27) where C t = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F + (cid:80) ts =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) ≤ β GLB t ( δ ) } and β GLB t ( δ ) = 2 + κ µ B t + R κ µ log R (cid:114) κ µ + (cid:113) κµ B t +1 δ .Proof. According to the definition of ρ GLB t ( · ) , we have B t ≥ ρ GLB t (Θ ∗ ) (28) = t (cid:88) s =1 (cid:96) s (ˆ y s ) − (cid:96) s ( (cid:104) Θ ∗ , X s (cid:105) ) (29) ≥ t (cid:88) s =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) (cid:96) (cid:48) s ( (cid:104) Θ ∗ , X s (cid:105) ) + κ µ y s − (cid:104) Θ ∗ , X s (cid:105) ) (Taylor expansion of (cid:96) s at (cid:104) Θ ∗ , X s (cid:105) ) = t (cid:88) s =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) )( − η s ) + κ µ y s − (cid:104) Θ ∗ , X s (cid:105) ) . (30)13hus, rearranging the terms, we have t (cid:88) s =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) ≤ κ µ B t + 2 κ µ t (cid:88) s =1 η s (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) . (31)The remaining proof simply follows the proof of Lemma 7. One can easily conclude that for any δ ∈ (0 , . ,with probability at least − δ t (cid:88) s =1 (ˆ y s − (cid:104) Θ ∗ , X s (cid:105) ) ≤ κ µ B t + 32 R κ µ log R (cid:113) κ µ + (cid:113) κ µ B t + 1 δ . (32)Adding (cid:107) Θ ∗ (cid:107) F on both sides and using the fact that (cid:107) Θ ∗ (cid:107) F ≤ , we complete the proof. Lemma 12 (Regret of LowGLOC Given Online Learner’s Regret) . Suppose sup (cid:107) Θ (cid:107) F ≤ ρ GLB T (Θ) ≤ B GLB T .Then, for any δ ∈ (0 , . , with probability at least − δ , for any T ≥ , the regret of LowGLOC algorithmis bounded by R T = O (cid:32) L (cid:115) β GLB T − ( δ ) T d d log (cid:18) Td d (cid:19)(cid:33) , (33) where β GLB t ( δ ) = 2 + κ µ B GLB t + R κ µ log R (cid:114) κ µ + (cid:113) κµ B GLB t +1 δ ∀ t .Proof. Define V t − = I + (cid:80) t − s =1 vec ( X s ) T vec ( X s ) and (cid:98) Θ t = argmin Θ ∈ R d × d (cid:32) (cid:107) Θ (cid:107) F + t − (cid:88) s =1 (ˆ y s − (cid:104) Θ , X s (cid:105) ) (cid:33) . (34)One can express C t − as { Θ ∈ R d × d : vec (Θ − (cid:98) Θ t ) T V t − vec (Θ − (cid:98) Θ t ) + (cid:13)(cid:13)(cid:13) (cid:98) Θ t (cid:13)(cid:13)(cid:13) F + t − (cid:88) s =1 (ˆ y s − (cid:104) Θ , X s (cid:105) ) ≤ β t − ( δ ) } . (35)Thus, C t − is contained in a bigger ellipsoid C t − ⊆ { Θ ∈ R d × d : vec (Θ − (cid:98) Θ t ) T V t − vec (Θ − (cid:98) Θ t ) ≤ β t − ( δ ) } . (36)Now consider the regret at round t , µ ( (cid:104) X ∗ , Θ ∗ (cid:105) ) − µ ( (cid:104) X t , Θ ∗ (cid:105) ) ≤ L µ | ( (cid:104) X ∗ , Θ ∗ (cid:105) − (cid:104) X t , Θ ∗ (cid:105) ) | (37) ≤ L µ (cid:16) (cid:104) X t , (cid:101) Θ t − Θ ∗ (cid:105) (cid:17) (38) ≤ L µ |(cid:104) X t , (cid:101) Θ t − (cid:98) Θ (cid:105)| + L µ |(cid:104) X t , (cid:98) Θ t − Θ ∗ (cid:105)| (39) ≤ L µ (cid:112) β t − ( δ ) (cid:107) vec ( X t ) (cid:107) V − t − (Cauchy Schwartz) . (40)14ince the regret at every step cannot be bigger than L , R T = T (cid:88) t =1 µ ( (cid:104) X ∗ , Θ ∗ (cid:105) ) − µ ( (cid:104) X t , Θ ∗ (cid:105) ) (41) = T (cid:88) t =1 min (cid:110) L µ , L µ (cid:112) β t − ( δ ) (cid:107) vec ( X t ) (cid:107) V − t − (cid:111) (42) = 2 L µ (cid:112) β t − ( δ ) T (cid:88) t =1 min (cid:26) β t − ( δ ) , (cid:107) vec ( X t ) (cid:107) V − t − (cid:27) (43) ≤ L µ (cid:112) β t − ( δ ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 min (cid:26) β t − ( δ ) , (cid:107) vec ( X t ) (cid:107) V − t − (cid:27) (44) ≤ L µ (cid:112) β t − ( δ ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 min (cid:110) , (cid:107) vec ( X t ) (cid:107) V − t − (cid:111) ( β t − ( δ ) is greater than 1) (45) ≤ L µ (cid:112) β t − ( δ ) (cid:115) T d d log (cid:18) Td d (cid:19) (46) = O (cid:32) L µ (cid:115) β t − ( δ ) T d d log (cid:18) Td d (cid:19)(cid:33) . (47) Lemma 13 (Regret of EW under NLL Loss) . Let EW parameter η := κ µ (cid:18)(cid:113) R log ( Tδ ) +2 c µ +2 L µ (cid:19) . Then, forany < δ < , with probability at least − δ , the regret of EW with expert predictions f Θ ,t = (cid:104) Θ , X t (cid:105) underNLL loss satisfies B GLB T = sup (cid:107) Θ (cid:107) F ≤ , rank (Θ) ≤ r ρ GLB T (Θ) = O (cid:32) ( d + d ) r log T log (cid:0) Tδ (cid:1) L µ + c µ + L µ κ µ (cid:33) (48) = (cid:101) O (cid:32) L µ + c µ κ µ ( d + d ) r log (cid:18) δ (cid:19)(cid:33) . (49) Proof.
Under generalized linear bandit model, y t = µ ( (cid:104) X t , Θ ∗ (cid:105) )+ η t . By subgaussian property and | µ ( (cid:104) X t , Θ ∗ (cid:105) ) | ≤| µ (0) | + L µ |(cid:104) X t , Θ ∗ (cid:105)| ≤ c µ + L µ , for < δ < , we have P (cid:32) max t =1 ,...,T | y t | > c µ + L µ + (cid:115) R log (cid:18) Tδ (cid:19)(cid:33) ≤ δ. (50)Again we denote above high probability event by G , denote the exponential weighted average forecaster atevery round by ˆ y t . We use the same definition S r and ¯ S r as last section.We use Lemma 9 and Lemma 10 to bound ρ GLB T (Θ) . Then the first step is to find a proper η > suchthat F (ˆ y t ) := e (cid:96) (ˆ y t ,y t ) = e − ηm (ˆ y t )+ η ˆ y t y t is concave. Taking derivatives we have, F (cid:48)(cid:48) (ˆ y t ) = ηe − ηm (ˆ y t )+ η ˆ y t y t (cid:0) η ( y t − µ (ˆ y t )) − µ (cid:48) (ˆ y t ) (cid:1) . (51)Under event G , it’s easy to show that µ (cid:48) (ˆ y t )( y t − µ (ˆ y t )) ≥ κ µ (cid:16)(cid:113) R log (cid:0) Tδ (cid:1) + 2 c µ + 2 L µ (cid:17) , (52)15ince | µ (ˆ y t ) | ≤ | µ (0) | + L | ˆ y t | ≤ c µ + L µ . Thus, taking η := κ µ (cid:18)(cid:113) R log ( Tδ ) +2 c µ +2 L µ (cid:19) , F ( · ) is guaranteed tobe concave with probability under event G . ρ GLB T (Θ) = T (cid:88) t =1 ( (cid:96) t (ˆ y t ) − (cid:96) t ( (cid:104) Θ , X t (cid:105) )) (53) ≤ T (cid:88) t =1 (cid:0) (cid:96) t (ˆ y t ) − (cid:96) t ( (cid:104) ¯Θ , X t (cid:105) ) + (cid:96) t ( (cid:104) ¯Θ , X t (cid:105) ) − (cid:96) t ( (cid:104) Θ , X t (cid:105) ) (cid:1) where (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F ≤ ε and ¯Θ ∈ ¯ S r (54) ≤ log | ¯ S r | η + T (cid:88) t =1 (cid:0) (cid:96) t ( (cid:104) ¯Θ , X t (cid:105) ) − (cid:96) t ( (cid:104) Θ , X t (cid:105) ) (cid:1) (55) ≤ log | ¯ S r | η + T (cid:88) t =1 (cid:104) Θ − ¯Θ , X t (cid:105) y t + m ( (cid:104) ¯Θ , X t (cid:105) ) − m ( (cid:104) Θ , X t (cid:105) ) (56) ≤ log | ¯ S r | η + T (cid:88) t =1 | y t | (cid:13)(cid:13) Θ − ¯Θ (cid:13)(cid:13) F + |(cid:104) ¯Θ − Θ , X t (cid:105) ( c µ + L µ ) | (By Taylor expansion) (57) ≤ ( d + d + 1) r log (cid:18) ε (cid:19) (cid:16)(cid:113) R log (cid:0) Tδ (cid:1) + 2 c µ + 2 L µ (cid:17) κ µ (58) + T (cid:32) c µ + 2 L µ + (cid:115) R log (cid:18) Tδ (cid:19)(cid:33) ε (59) = O (cid:32) ( d + d ) r log T log (cid:0) Tδ (cid:1) L µ + c µ + L µ κ µ (cid:33) , (60)where we take ε = 1 /T . Proof for Theorem 3.
One only needs to plug Lemma 13 into Lemma 12.
C Proof for Theorem 4
The whole proof breaks down to two parts. Let Θ ∗ = U ∗ S ∗ V ∗ T be the SVD of Θ ∗ . In the first part, weprove the convergence of estimated matrix (cid:98) Θ for Θ ∗ , (cid:98) U for U ∗ , and (cid:98) V for V ∗ . In the second part, we plugthe convergence result into the regret guarantee for LowOFUL in Jun et al. (2019) to achieve our final result. C.1 Analysis for Stage 1
In order to analyze how the estimated subspaces are close to the true subspaces, we first present the definitionsfor sub-Gaussian matrix and restricted strong convexity (RSC) as below.
Definition 1 (sub-Gaussian matrix (See Wainwright (2019))) . A random matrix Z ∈ R n × p is sub-Gaussianwith parameters (Σ , σ ) if: • each row z Ti ∈ R p is sampled independently from a zero-mean distribution with covariance Σ , and • for any unit vector u ∈ R p , the random variable u T z i is sub-Gaussian with parameter at most σ .16 efinition 2 (Restricted Strong Convexity (RSC) (Wainwright, 2019)) . For a given norm (cid:107)·(cid:107) , regularizer Φ( · ) , and X , . . . , X n ∈ R d × d , the matrix (cid:98) Γ = n (cid:101) X T (cid:101) X , where ˜ x i := vec ( X i ) and (cid:101) X := [˜ x T ; . . . ; ˜ x Tn ] , satisfiesa restricted strong convexity (RSC) condition with curvature κ > and tolerance τ n if (cid:101) ∆ T (cid:98) Γ (cid:101) ∆ = 1 n n (cid:88) t =1 (cid:104) X t , ∆ (cid:105) ≥ κ (cid:107) ∆ (cid:107) − τ n Φ (∆) , (61)for all ∆ ∈ R d × d , and we denote vec (∆) by (cid:101) ∆ .We prove the following theorem about distribution D (see Assumption 2) as below, see proof in Section D. Theorem 14 (Distribution D satisfies RSC) . Sample X , . . . , X n ∈ R d × d from X according to D , anddefine ˜ x i := vec ( X i ) , (cid:101) X = [˜ x T ; . . . ; ˜ x Tn ] ∈ R n × d d and (cid:98) Γ := n (cid:101) X T (cid:101) X . Then under Assumption 2, there existsconstants c , c > , such that with probability − δ , (cid:101) Θ T (cid:98) Γ (cid:101) Θ = 1 n n (cid:88) i =1 (cid:104) X i , Θ (cid:105) ≥ c d d (cid:107) Θ (cid:107) F − c ( d + d ) nd d (cid:107) Θ (cid:107) nuc , ∀ Θ ∈ R d × d , (62) for n = Ω (cid:0) ( d + d ) log (cid:0) δ (cid:1)(cid:1) , where (cid:101) Θ := vec (Θ) . Theorem 14 states that sampling X from X according to distribution D guarantees that the sampledarms satisfies RSC condition. We further show that under RSC condition, the estimated (cid:98) Θ is guaranteed toconverge to Θ at a fast rate in Theorem 15. Theorem 15.
Sample X , . . . , X n ∈ R d × d from X according to D . Then under Assumption 2, any optimalsolution to the nuclear norm optimization problem 5 using λ n (cid:16) n min { d ,d } log (cid:0) nδ (cid:1) log (cid:0) d + d δ (cid:1) satisfies: (cid:13)(cid:13)(cid:13) (cid:98) Θ − Θ ∗ (cid:13)(cid:13)(cid:13) F (cid:16) ( d + d ) rn , (63) with probability − δ . The goal of stage 1 is to estimate the row/column subspaces of Θ ∗ , below corollary characterizes theirconvergence. Corollary 16 (adapted from Jun et al. (2019)) . Suppose we compute (cid:98) Θ by solving the convex problem inEquation 5 as an estimate of the matrix Θ ∗ . After stage 1 of ESTR with T = Ω ( r ( d + d )) satisfying thecondition of Theorem 15, we have, with probability at least − δ , (cid:13)(cid:13)(cid:13) (cid:98) U T ⊥ U ∗ (cid:13)(cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) (cid:98) V T ⊥ V ∗ (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) Θ ∗ − (cid:98) Θ (cid:13)(cid:13)(cid:13) F ω r ≤ C λ T rα ω r := γ ( T ) (cid:16) ( d + d ) rT ω r , (64) where ω r > denotes the lower bound of the r -th singular value of Θ ∗ and C represents some constant. C.2 Analysis for Stage 2
We present the useful lemmas proved in Jun et al. (2017) and combine them with our analysis of stage 1 toachieve the final result of Theorem 4.
Lemma 17 (Corollary 1 in (Jun et al., 2019)) . The regret of LowOFUL with λ ⊥ = Tk log ( Tλ ) is, withprobability at least − δ , (cid:101) O (cid:16)(cid:16) k + √ kλB + √ T B ⊥ (cid:17) √ T (cid:17) . (65)17 emma 18 (Modified from Theorem 5 in (Jun et al., 2019)) . Suppose we run ESTR stage 1 with T =Ω ( r ( d + d )) . We invoke LowOFUL in stage 2 with λ ⊥ = T k log(1+ T /λ ) , B = 1 , B ⊥ = γ ( T ) , the rotated armsets X (cid:48) vec defined in LowESTR (Algorithm 2). With probability − δ , the regret of LowESTR is boundedby (cid:101) O (cid:18) T + T · ( d + d ) rT ω r (cid:19) . (66) Proof.
Combining Lemma 17 and definitions of parameters B , B ⊥ , λ , λ ⊥ and γ ( T ) . Proof for Theorem 4.
Suppose the assumptions in Lemma 18 hold. Setting T = Θ (cid:16) ( d + d ) / √ rT ω r (cid:17) inLemma 18 leads to the regret (cid:101) O (cid:18) ( d + d ) / √ rT ω r (cid:19) . (67) D Proof for Theorem 14
Throughout this proof, we use Σ and σ to denote the sub-Gaussian parameters defined in Definition 1 formatrix (cid:101) X in the theorem. D.1 Useful Lemmas
Lemma 19.
For any constant s ≥ , we have B nuc ( √ s ) ∩ B F (1) ⊆ cl { conv { B rank ( s ) ∩ B F (1) }} , (68) where the balls are taken in R d × d , and cl {·} and conv {·} denote the topological closure and convex hull,respectively.Proof. Note that when s > min { d , d } , the statement is trivial, since the right-hand set equals R F (3) , andthe left-hand set is contained in B F (1) . Hence, we will assume ≤ s ≤ min { d , d } .Let A, B ⊆ R d × d be closed convex sets, with support function given by φ A ( z ) = sup Θ ∈ A (cid:104) Θ , z (cid:105) and φ B similarly defined. It is well-known that φ A ( z ) ≤ φ B ( z ) if and only if A ⊆ B . We will now check thiscondition for the pair of sets A = B nuc ( √ s ) ∩ B F (1) and B = 3 cl { conv { B rank ( s ) ∩ B F (1) }} .For any z ∈ R d × d , take r := min { d , d } , we have z = U Σ V T by SVD, where U ∈ R d × r , Σ ∈ R r × r ,and V ∈ R d × r . Let S ⊆ { , . . . , r } be subset indexes for the top (cid:98) s (cid:99) elements of diag (Σ) . We use U S and V S to denote submatrices of U and V with columns of indices in S and use Σ S to denote the submatrix of Σ with columns and rows of indices in S . Then we can write z = U S Σ S V TS + U ⊥ S Σ ⊥ S V ⊥ TS .Consider φ A ( z ) below: φ A ( z ) = sup Θ ∈ A (cid:104) Θ , U S Σ S V TS + U ⊥ S Σ ⊥ S V ⊥ TS (cid:105) (69) ≤ sup (cid:107) U S U TS Θ (cid:107) F ≤ (cid:104) U S U TS Θ , U S Σ S V TS (cid:105) + sup (cid:107) U ⊥ S U ⊥ TS Θ (cid:107) nuc ≤√ s (cid:104) U ⊥ S U ⊥ TS Θ , U ⊥ S Σ ⊥ S V ⊥ TS (cid:105) (70) ≤ (cid:13)(cid:13) U S Σ S V TS (cid:13)(cid:13) F + √ s (cid:13)(cid:13) U ⊥ S Σ ⊥ S V ⊥ TS (cid:13)(cid:13) op by Holder inequality (71) ≤ (cid:13)(cid:13) U S Σ S V TS (cid:13)(cid:13) F + √ s (cid:98) s (cid:99) (cid:13)(cid:13) U S Σ S V TS (cid:13)(cid:13) nuc ≤ (cid:13)(cid:13) U S Σ S V TS (cid:13)(cid:13) F . (72)Finally, note that φ B ( z ) = sup Θ ∈ B (cid:104) Θ , z (cid:105) = 3 max | S | = (cid:98) s (cid:99) sup (cid:107) U S U TS Θ (cid:107) F ≤ (cid:104) U S U TS Θ , U S Σ S V TS (cid:105) = 3 (cid:13)(cid:13) U S Σ S V TS (cid:13)(cid:13) F ,from which the claim follows. 18 efinition 3. Define K ( s ) := B rank ( s ) ∩ B F (1) and the cone set C ( s ) := { v : (cid:107) v (cid:107) nuc ≤ √ s (cid:107) v (cid:107) F } , all matricesdefined in these sets are in R d × d . Lemma 20.
For a fixed matrix Γ ∈ R d d × d d , parameter s ≥ , and tolerance δ > , suppose we have thedeviation condition (˜ v := vec ( v )) | ˜ v T Γ˜ v | ≤ δ, ∀ v ∈ K (2 s ) , (73) where K (2 s ) is defined in Definition 3. Then | ˜ v T Γ˜ v | ≤ δ ( (cid:107) v (cid:107) F + 1 s (cid:107) v (cid:107) nuc ) , ∀ v ∈ R d × d . (74) Proof.
We begin by establishing the inequalities | ˜ v T Γ˜ v | ≤ δ (cid:107) v (cid:107) F , ∀ v ∈ C ( s ) , (75) | ˜ v T Γ˜ v | ≤ δs (cid:107) v (cid:107) nuc , ∀ v / ∈ C ( s ) , (76)where C ( s ) is defined in Definition 3, the statement of this lemma then follows immediately. By rescaling,inequality 75 follows if we can show that | ˜ v T Γ˜ v | ≤ δ for all v such that (cid:107) v (cid:107) F = 1 and (cid:107) v (cid:107) nuc ≤ √ s. (77)By Lemma 19 and continuity, we further reduce the problem to proving the bound 77 for all vectors v ∈ conv { K ( s ) } = conv { B rank ( s ) ∩ B F (3) } . Consider a weighted lienar combination of the form v = (cid:80) i α i v i ,with weights α i ≥ such that (cid:80) i α i = 1 , and rank ( v i ) ≤ s and (cid:107) v i (cid:107) F ≤ for each i . We can write ˜ v Γ˜ v = (cid:88) i,j α i α j (˜ v Ti Γ˜ v j ) . (78)Applying inequality 74 to the vectors v i / , v j / and ( v i + v j ) / , we have | ˜ v Ti Γ˜ v j | = 12 | (˜ v i + ˜ v j ) T Γ(˜ v i + ˜ v j ) − ˜ v Ti Γ˜ v i − ˜ v Tj Γ˜ v j | ≤
12 (36 + 9 + 9) δ = 27 δ (79)for all i, j , and hence | ˜ v T Γ˜ v | ≤ (cid:80) i,j α i α j (27 α ) = 27 δ (cid:107) α (cid:107) = 27 δ , establishing inequality 75. Now let’s turnto inequality 76, note that v / ∈ C ( s ) , we have | ˜ v T Γ˜ v |(cid:107) v (cid:107) nuc ≤ s sup (cid:107) u (cid:107) nuc ≤√ s, (cid:107) u (cid:107) F ≤ | u T Γ u | ≤ δs , (80)where the first inequality follows by the substitution u = √ s v (cid:107)(cid:107) nuc , the second follows by the same argumentused for inequality 75. Rearrange above inequality, we establish inequality 76. Lemma 21 (RSC condition) . Suppose s ≥ and (cid:98) Γ is an estimator of Σ satisfying the deviation condition( ˜ v := vec ( v ) ) | ˜ v T ( (cid:98) Γ − Σ)˜ v | ≤ λ min (Σ)54 , ∀ v ∈ K (2 s ) , (81) where K (2 s ) is defined in Definition 3. Then we have the RSC condition ˜ v T (cid:98) Γ˜ v ≥ λ min (Σ)2 (cid:107) v (cid:107) F − λ min (Σ)2 s (cid:107) v (cid:107) nuc . (82)19 roof. This result follows easily from Lemma 20. Set
Γ = (cid:98) Γ − Σ and δ = λ min (Σ)54 , we have the bound | ˜ v T ( (cid:98) Γ − Σ)˜ v | ≤ λ min (Σ)2 (cid:18) (cid:107) v (cid:107) F + 1 s (cid:107) v (cid:107) nuc (cid:19) . (83)Then ˜ v T (cid:98) Γ˜ v ≥ ˜ v T Σ˜ v − λ min (Σ)2 (cid:18) (cid:107) v (cid:107) F + 1 s (cid:107) v (cid:107) nuc (cid:19) (84) ≥ λ min (Σ)2 (cid:107) v (cid:107) F − λ min (Σ)2 s (cid:107) v (cid:107) nuc , (85)where the last inequality follows from ˜ v T Σ˜ v ≥ λ min (Σ) (cid:107) v (cid:107) F . D.2 Proof for the Theorem 14
Proof.
Using the results in Lemma 21, together with the substitutions (cid:98) Γ − Σ = 1 n (cid:101) X T (cid:101) X − Σ , and s := 1 c nd + d min { λ min (Σ) σ , } , (86)where n ≥ c ( d + d ) / min { λ min (Σ) σ , } so s ≥ , we see that it suffices to show that D ( s ) := sup v ∈ K (2 s ) | ˜ v T ( (cid:98) Γ − Σ)˜ v | ≤ λ min (Σ)54 , (87)with high probability.Note that by modified Lemma 15 ( in Appendix G ) in Loh and Wainwright (2011), we simply change the / -covering set for sparsity vectors to the / -covering set for K (2 s ) , whose covering number is s ( d + d +1) by Lemma 6, and achieve P ( D ( s ) ≥ t ) ≤ (cid:18) − c (cid:48) n min (cid:18) t σ , tσ (cid:19) + 2 s ( d + d + 1) log 27 (cid:19) , (88)for some univeral constant c (cid:48) > . Setting t = λ min (Σ)54 , we see that there exists some c > , such that P (cid:18) D ( s ) ≥ λ min (Σ)54 (cid:19) ≤ (cid:18) − c n min (cid:18) λ min (Σ) σ , (cid:19)(cid:19) , (89)which establishes the result.Set δ equals to the right side of last inequality, one can get the desired gurantee for n in Theorem 14. E Proof for Theorem 15
E.1 Useful Lemmas
Lemma 22 (Converence under RSC, adapted from Proposition 10.1 in Wainwright (2019)) . Suppose theobservations X , . . . , X n satisfies the non-scaled RSC condition in Definition 2, such that T n (cid:88) t =1 (cid:104) X t , Θ (cid:105) ≥ κ (cid:107) Θ (cid:107) F − τ n (cid:107) Θ (cid:107) nuc , ∀ Θ ∈ R d × d . (90) Then under the event G := { (cid:13)(cid:13) n (cid:80) nt =1 η t X t (cid:13)(cid:13) op ≤ λ n } , any optimal solution (cid:98) Θ to Equation 5 satisfies thebound below: (cid:13)(cid:13)(cid:13) (cid:98) Θ − Θ ∗ (cid:13)(cid:13)(cid:13) F ≤ . λ n κ r, (91) where r = rank (Θ ∗ ) and τ n ≥ rκ . .2 Proof for Theorem 15 Proof.
According to Theorem 14, there exists constants c and c such that with probability at least − δ ,we have below RSC condition n n (cid:88) t =1 (cid:104) X t , Θ (cid:105) ≥ c d d (cid:107) Θ (cid:107) F − c ( d + d ) nd d (cid:107) Θ (cid:107) nuc , ∀ Θ ∈ R d × d , (92)Lemma 22 can be applied under above RSC condition, then under event G ( λ n ) := { (cid:13)(cid:13) n (cid:80) nt =1 η t X t (cid:13)(cid:13) op ≤ λ n } ,we can easily conclude the theorem. Thus, it remains to figure out λ n such that event G ( λ n ) can hold withhigh probability.Define the rare event E := (cid:26) max t =1 ,...,T | η t | > (cid:113) (cid:0) T δ (cid:1)(cid:27) , so that P ( E ) ≤ δ/ can be proved by thedefinition of sub-Gaussian. By matrix Bernstein inequality, the probability of G ( λ n ) c can be bounded by: P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) t =1 η t X t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op > ε ≤ P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) t =1 η t X t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op > ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E c + P ( E ) ≤ ( d + d ) exp − nε /
22 log (cid:0) nδ (cid:1) max { /d , /d } + ε (cid:113) (cid:0) nδ (cid:1) / + δ/ , where the last inequality is by matrix Bernstein using the fact that max (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) t =1 E η t X t X Tt (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) t =1 E η t X Tt X t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ n log (cid:18) nδ (cid:19) max { /d , /d } . (93)For ( d + d ) exp (cid:32) − nε /
22 log ( nδ ) max { /d , /d } + ε (cid:113) ( nδ ) / (cid:33) ≤ δ/ to hold, we need (cid:15) = C (cid:48) n min { d , d } log (cid:16) nδ (cid:17) log (cid:18) d + d δ (cid:19) , (94)holds for some constant C (cid:48) . Take λ n = 2 ε , we need λ n = Cn min { d ,d } log (cid:0) nδ (cid:1) log (cid:0) d + d δ (cid:1) and under thiscondition we have P ( G ( λ n )) ≥ − δ . We complete the proof by noting that the scaling of the right handside in Lemma 22 under above choice of λ n is indeed ( d + d ) rn . F Proof for Theorem 5
Proof.
Take ∆ = (cid:113) drT √ , Θ = { Θ = θ T ... θ Tr ∈ R d × d , θ i ∈ {± ∆ } d , ∀ i ∈ [ r ] } . For i ∈ [ r ] , j ∈ [ d ] , define τ i,j = T ∧ min { t : (cid:80) ts =1 X s,i,j ≥ Tdr } , where X s,i,j denotes the element on the i -th row and j -th column of21atrix X s . Then for a fixed Θ , taking expectation over X t , we have E [ R T (Θ)] = E Θ T (cid:88) t =1 (cid:104) X ∗ − X t , Θ (cid:105) (95) = ∆ E Θ T (cid:88) t =1 r (cid:88) i =1 d (cid:88) j =1 (cid:18) √ dr − X t,i,j sign (Θ i,j ) (cid:19) (96) ≥ ∆ √ dr r (cid:88) i =1 d (cid:88) j =1 E Θ (cid:34) T (cid:88) t =1 (cid:18) √ dr − X t,i,j sign (Θ i,j ) (cid:19) (cid:35) (97) ≥ ∆ √ dr r (cid:88) i =1 d (cid:88) j =1 E Θ (cid:34) τ i,j (cid:88) t =1 (cid:18) √ dr − X t,i,j sign (Θ i,j ) (cid:19) (cid:35) . (98)Define U i,j ( x ) = (cid:80) τ i,j t =1 (cid:16) √ dr − X t,i,j x (cid:17) . Let Θ (cid:48) ∈ Θ be another parameter matrix such that Θ (cid:48) = Θ , exceptthat Θ (cid:48) i,j = − Θ i,j . Let P , P (cid:48) be the laws of U i,j with respect to the learner interaction measure induced by Θ and Θ (cid:48) . Then E Θ [ U i,j (1)] ≥ E Θ (cid:48) [ U i,j (1)] − ( 4 Tdr + 2) (cid:114) D ( P , P (cid:48) ) (99) ≥ E Θ (cid:48) [ U i,j (1)] − ∆( 4 Tdr + 2) (cid:118)(cid:117)(cid:117)(cid:116) E (cid:34) τ i,j (cid:88) t =1 X t,i,j (cid:35) (100) ≥ E Θ (cid:48) [ U i,j (1)] − ∆( 4 Tdr + 2) (cid:114)
Tdr + 1 (101) ≥ E Θ (cid:48) [ U i,j (1)] − √ T ∆ dr (cid:114) Tdr , (102)where in the first inequality we used Pinsker’s inequality, the result in exercise 14.4 in (Lattimore andSzepesvári, 2018), the bound U i,j (1) = τ i,j (cid:88) t =1 (cid:18) √ dr − X t,i,j (cid:19) ≤ τ i,j (cid:88) t =1 dr + 2 τ i,j (cid:88) t =1 X t,i,j ≤ Tdr + 2 (cid:18)
Tdr + 1 (cid:19) = 4
Tdr + 2 . (103)The second inequality in above follows from the chain rule for the relative entropy up to a stopping timein (Lattimore and Szepesvári, 2018): D ( P , P (cid:48) ) ≤ E Θ τ i,j (cid:88) t =1 (cid:104) X t , Θ − Θ (cid:48) (cid:105) = 2∆ E Θ τ i,j (cid:88) t =1 X t,i,j . (104)The third inequality in above is true by the definition of τ i,j and the fourth inequality holds by the assumptionthat dr ≤ T .Then, E Θ [ U i,j (1)] + E Θ (cid:48) [ U i,j (1)] ≥ E Θ (cid:48) [ U i,j (1) + U i,j ( − − √ T ∆ dr (cid:114) Tdr (105) = 2 E Θ (cid:48) (cid:34) τ i,j d + τ i,j (cid:88) t =1 X t,i,j (cid:35) − √ T ∆ dr (cid:114) Tdr (106) ≥ Td − √ T ∆ dr (cid:114) Tdr = Td . (107)22he proof is completed using an averaging number argument: (cid:88) Θ ∈ Θ R T (Θ) ≥ ∆ √ dr r (cid:88) i =1 d (cid:88) j =1 (cid:88) Θ ∈ Θ E Θ [ U i,j ( sign (Θ i,j ))] (108) ≥ ∆ √ dr r (cid:88) i =1 d (cid:88) j =1 (cid:88) Θ − i, − j (cid:88) Θ i,j ∈{± ∆ } E Θ [ U i,j ( sign (Θ i,j ))] (109) ≥ ∆ √ dr r (cid:88) i =1 d (cid:88) j =1 (cid:88) Θ − i, − j (cid:88) Θ i,j ∈{± ∆ } Tdr = 2 dr − ∆ √ drT. (110)Hence there exists a Θ ∈ Θ such that R T ( A , Θ) ≥ T ∆ √ dr = dr √ T √ . G Preliminaries for EW
We provide more information on the construction of standard exponentially weighted average forecaster . Prediction with Expert Advice.
We use { f i,t : i ∈ I} to denote the prediction of experts at round t ,where f i,t is the prediction of expert i at time t . On the basis of the experts’ predictions, the forecastercomputes the prediction ˆ y t for the next outcome y t and the true outcome y t is revealed afterwards. Theregret of the learner relative to expert is defined by R i,T = T (cid:88) t =1 ( (cid:96) t ( (cid:98) p t ) − (cid:96) t ( f i,t )) = (cid:98) L T − L i,T , where L i,T := (cid:80) Tt =1 (cid:96) t ( f i,t ) and (cid:98) L T := (cid:80) Tt =1 (cid:96) t ( (cid:98) p t ) . For linear prediction expert, we define f Θ ,t := (cid:104) Θ , X t (cid:105) and above reward matches with ρ T (Θ) . Exponential Weighted Average Forecaster (EW).
Suppose we have N linear prediction experts.Define the regret vector at time t as r t = ( R ,t , . . . , R N,t ) ∈ R N and the cumulative regret vector up to time T as R T = (cid:80) Tt =1 r t , then a weighted average forecaster is defined as (cid:98) p t = N (cid:88) i =1 (cid:79) Φ( R t − ) i f i,t / N (cid:88) j =1 (cid:79) Φ( R t − ) j where Φ( · ) denotes a potential function Φ : R N → R of the form Φ( u ) = ψ (cid:16)(cid:80) Ni =1 φ ( u i ) (cid:17) . φ : R → R isany nonnegative, increasing and twice differentiable function, and ψ : R → R is any nonnegative, strictlyincreasing, concave and twice differentiable auxiliary function. Exponentially weighted average forecaster is constructed using Φ η ( u ) = η log (cid:16)(cid:80) Ni =1 e ηu i (cid:17) , where η is apositive parameter. The weights assigned to the experts are of the form: (cid:79) Φ η ( R t − ) i = e ηRi,t − (cid:80) Nj =1 e ηRj,t − . Thus,the exponentially weighted average forecaster can be simplified to ˆ y t = (cid:80) Ni =1 e − ηL i,t − f i,t (cid:80) Nj =1 e − ηL j,t − , as defined in the main text. 23 Algorithms
In this section, we present our Low-rank Generalized Linear Bandit with Online Computation algorithm(LowGLOC) and the second part of LowESTR: LowOFUL algorithm in Jun et al. (2019).
Algorithm 3
Low-rank Generalized Linear Bandit with Online Computation (LowGLOC)
Input: arm set: X , horizon: T , T -net for S r : ¯ S r ( T ) , failure rate δ , EW constant η (cid:16) T/δ ) , function m ( · ) in the generalized linear model.Initial confidence set C = { Θ ∈ R d × d : (cid:107) Θ (cid:107) F ≤ } . for t = 1 , . . . , T do ( X t , (cid:101) Θ t ) := argmax ( X, Θ) ∈X × C t − (cid:104) X, Θ (cid:105) .Pull arm X t and receive reward y t .Compute EW predictor ˆ y t = (cid:80) | ¯ Sr ( 1 T ) | i =1 e − ηLi,t − f Θ i,t (cid:80) | ¯ Sr ( 1 T ) | j =1 e − ηLj,t − , where f Θ i ,t (cid:44) (cid:104) X t , Θ i (cid:105) for Θ i ∈ ¯ S r ( T ) .Update losses L i,t = (cid:80) ts =1 − f Θ i ,s y s + m ( f Θ i ,s ) , for i = 1 , . . . , | ¯ S r ( T ) | .Update C t according to Equation 4, where B t is as defined in Lemma 13. end forAlgorithm 4 LowOFUL (Jun et al., 2019)
Input:
T, k , arm set
A ⊂ R d × d , failure rate δ and positive constants B, B ⊥ , λ, λ ⊥ . Λ = diag ( λ, . . . , λ, λ ⊥ , . . . , λ ⊥ ) , where λ occupies the first k diagonal entries. for t = 1 , . . . , T do Compute a t = argmax a ∈A max θ ∈C t − (cid:104) θ, a (cid:105) .Pull arm a t and receive reward y t .Update C t = { θ : (cid:13)(cid:13)(cid:13) θ − ˆ θ (cid:13)(cid:13)(cid:13) V t ≤ √ β t } , where √ β t = (cid:113) log | V t || Λ | δ + √ λB + √ λ ⊥ B ⊥ , V t = Λ + (cid:80) ts =1 a t a Tt , ˆ θ t = (Λ + A T A ) − A T y . (Here A = [ a T ; . . . ; a Tt ] and y := [ y , . . . , y t ] T ). end for I More on Experiments
I.1 Parameter Setup for Comparing OFUL and LowESTR Simulation
We present the parameter setups for the experiments in Section 8.
OFUL: failure rate: δ = 0 . , horizon: T = 3000 , standard deviation of the reward error σ = 0 . . LowESTR: • failure rate: δ = 0 . . • standard deviation of the reward error: σ = 0 . . • least positive eigenvalue of Θ ∗ : ω r = 0 . for r = 1 and r = 3 . • horizon T = 3000 , steps of stage 1: T = 200 , steps of stage 2: T = T − T . • penalization in Equation 5: λ T = 0 . (cid:113) T . 24 gradient decent solving Equation 5 step size: . . • k = r ( d + d − r ) in LowOFUL (Algorithm 4). • B = 1 , B ⊥ = σ ( d + d ) r/ ( T ω r ) . • λ = 1 , λ ⊥ = T k log(1+ T /λ ) . I.2 LowESTR: Sensitivity to ω r We prove a (cid:101) O (cid:16) ( d + d ) / √ rT ω r (cid:17) regret for LowESTR algorithm in Section 6. To complement this theoret-ical finding, we compare the performance of LowESTR on different values of ω r ∈ { . , . , . , . , . , . } .We run our simulation with d = d = 10 , r = 3 . The true Θ ∗ ∈ R d × d is a diagonal matrix withdiag = (0 . , . , ω r , , . . . , . The arm set is constructed in the same way as previous experiment and thereward is also generated by y = (cid:104) X, Θ ∗ (cid:105) + ε , where ε ∼ N (0 , . ) . For each ω r setting, we run LowESTRfor 20 times to calculate the averaged regrets and their 1-sd confidence intervals.Parameters for LowESTR are same as those of previous experiment except that T = int (100 /ω r ) . Theplot for cumulative regret at T = 3000 v.s. the value of ω r is displayed in Figure 2. We observe that aswe increase the least positive singular value of Θ ∗ : ω r , the cumulative regret up to T = 3000 is indeeddecreasing.Figure 2: LowESTR: cumulative regret at T = 3000 v.s. ω r . The yellow area represents the 1-standarddeviation of the cumulative regret at T = 3000= 3000