Problem-Complexity Adaptive Model Selection for Stochastic Linear Bandits
aa r X i v : . [ s t a t . M L ] J un Problem-Complexity Adaptive Model Selection forStochastic Linear Bandits
Avishek Ghosh, Abishek Sankararaman and Kannan RamchandranDept. of Electrical Engg. and Computer Sciences, UC Berkeleyemail: { avishek ghosh,abishek } @berkeley.edu, [email protected] 17, 2020 Abstract
We consider the problem of model selection for two popular stochastic linear bandit settings, and proposealgorithms that adapts to the unknown problem complexity. In the first setting, we consider the K armedmixture bandits, where the mean reward of arm i ∈ [ K ] , is µ i + h α i,t , θ ∗ i , with α i,t ∈ R d being theknown context vector and µ i ∈ [ − ,
1] and θ ∗ are unknown parameters. We define k θ ∗ k as the problemcomplexity and consider a sequence of nested hypothesis classes, each positing a different upper boundon k θ ∗ k . Exploiting this, we propose Adaptive Linear Bandit ( ALB ), a novel phase based algorithm thatadapts to the true problem complexity, k θ ∗ k . We show that ALB achieves regret scaling of e O ( k θ ∗ k√ T ),where k θ ∗ k is apriori unknown. As a corollary, when θ ∗ = 0, ALB recovers the minimax regret for thesimple bandit algorithm without such knowledge of θ ∗ . ALB is the first algorithm that uses parameternorm as model section criteria for linear bandits. Prior state of art algorithms [CMB19] achieve a regretof e O ( L √ T ), where L is the upper bound on k θ ∗ k , fed as an input to the problem. In the second setting,we consider the standard linear bandit problem (with possibly an infinite number of arms) where thesparsity of θ ∗ , denoted by d ∗ ≤ d , is unknown to the algorithm. Defining d ∗ as the problem complexity(similar to [FKL19]), we show that ALB achieves e O ( d ∗ √ T ) regret, matching that of an oracle who knewthe true sparsity level. This is the first algorithm that achieves such model selection guarantees. Thisis methodology is then extended to the case of finitely many arms and similar results are proven. Wefurther verify through synthetic and real-data experiments that the performance gains are fundamentaland not artifacts of mathematical bounds. In particular, we show 1 . −
3x drop in cumulative regret overnon-adaptive algorithms.
We study model selection for MAB, which refers to choosing the appropriate hypothesis class, to model themapping from arms to expected rewards. Model selection for MAB plays an important role in applicationssuch as personalized recommendations, as we explain in the sequel. Formally, a family of nested hypothesisclasses H f , f ∈ F needs to be specified, where each class posits a plausible model for mapping arms toexpected rewards. The true model is assumed to be contained in the family F which is totally ordered,where if f ≤ f , then H f ⊆ H f . Model selection guarantees then refers to algorithms whose regret scalesin the complexity of the smallest hypothesis class containing the true model , even though the algorithm wasnot aware apriori.We consider two canonical settings for the stochastic MAB problem. The first is the K armed mixtureMAB setting, in which the mean reward from any arm i ∈ [ K ] is given by µ i + h θ ∗ , α i,t i , where α i,t ∈ R d is the By [ r ], we denote the set of positive integers { , , . . . , r } . Thoroughout the paper we use k . k to denote the ℓ norm unless otherwise specified. The notation e O hides the logarithmic dependence. i at time t , and µ i ∈ R , θ ∗ ∈ R d are unknown and needs to be estimated. Thissetting also contains the standard MAB [LR85, ACBF02] when θ ∗ = 0. Popular linear bandit algorithms,like LinUCB, OFUL (see [CLRS11, DHK08, AYPS11]) handle the case with no bias ( µ i = 0), while OSOM[CMB19], the recent improvement can handle arm-bias. Implicitly, all the above algorithms assume an upperbound on the norm of k θ ∗ k ≤ L , which is supplied as an input. Crucially however, the regret guarantees scalelinearly in the upper bound L . In contrast, we choose k θ ∗ k as the problem complexity, and provide a novelphase based algorithm, that, without any upper bound on the norm k θ ∗ k , adapts to the true complexity of theproblem instance, and achieves a regret scaling linearly in the true norm k θ ∗ k . As a corollary, our algorithm’sperformance matches the minimax regret of simple MAB when θ ∗ = 0, even though the algorithm did notapriori know that θ ∗ = 0. Formally, we consider a continuum of hypothesis classes, with each class positing adifferent upper bound on the norm k θ ∗ k , where the complexity of a class is the upper bound posited. As ourregret bound scales linearly in k θ ∗ k (the complexity of the smallest hypothesis class containing the instance)as opposed to an upper bound on k θ ∗ k , our algorithm achieves model selection guarantees.The second setting we consider is the standard linear stochastic bandit [AYPS11] with possibly an infinitenumber of arms, where the mean reward of any arm x ∈ R d (arms are vectors in this case) given by h x, θ ∗ i ,where θ ∗ ∈ R d is unknown. For this setting, we consider model selection from among a total of d differenthypothesis classes, with each class positing a different cardinality for the support of θ ∗ . We exhibit a novelalgorithm, where the regret scales linearly in the unknown cardinality of the support of θ ∗ . The regretscaling of our algorithm matches that of an oracle that has knowledge of the optimal support cardinality[CM12],[BB20], thereby achieving model selection guarantees. Our algorithm is the first known algorithm toobtain regret scaling matching that of an oracle that has knowledge of the true support. This is in contrastto standard linear bandit algorithms such as [AYPS11], where the regret scales linearly in d . We also extendthis methodology to the case when the number of arms is finite and obtain similar regret rates matching theoracle. Model selection with dimension as a measure of complexity was also recently studied by [FKL19], inwhich the classical contextual bandit [CLRS11] with a finite number of arms was considered. We clarify herethat although our results for the finite arm setting yields a better (optimal) regret scaling with respect tothe time horizon T and the support of θ ∗ (denoted by d ∗ ), our guarantee depends on a problem dependentparameter and thus not uniform over all instances. In contrast, the results of [FKL19], although sub-optimalin d ∗ and T , is uniform over all problem instances. Closing this gap is an interesting future direction.
1. Successive Refinement Algorithms for Stochastic Linear Bandit - We present two novel epochbased algorithms,
ALB (Adaptive Linear Bandit) - Norm and
ALB - Dim , that achieve model selectionguarantees for both families of hypothesis respectively. For the K armed mixture MAB setting, ALB-Norm ,at the beginning of each phase, estimates an upper bound on the norm of k θ ∗ k . Subsequently, the algorithmassumes this bound to be true during the phase, and the upper bound is re-estimated at the end of a phase.Similarly for the linear bandit setting, ALB-Dim estimates the support of θ ∗ at the beginning of each phaseand subsequently only plays from this estimated support during the phase. In both settings, we show theestimates converge to the true underlying value —in the first case, the estimate of norm || θ ∗ || converges tothe true norm, and in the second case, for all time after a random time with finite expectation, the estimatedsupport equals the true support. Our algorithms are reminiscent of successive rejects algorithm [AB10] forstandard MAB, with the crucial difference being that our algorithm is non-monotone . Once rejected, an armis never pulled in the classical successive rejects. In contrast, our algorithm is successive refinement and isnot necessarily monotone —a hypothesis class discarded earlier can be considered at a later point of time.
2. Regret depending on the Complexity of the smallest Hypothesis Class - In the K armedmixture MAB setting, ALB-Norm ’s regret scale as e O ( k θ ∗ k√ T ), which is superior compared to state of artalgorithms such as OSOM [CMB19], whose regret scales as e O ( L √ T ), where L is an upper bound on k θ ∗ k that issupplied as an input. As a corollary, we get the ‘best of both worlds’ guarantee of [CMB19], where if θ ∗ = 0,our regret bound recovers known minimax regret guarantee of simple MAB. Similarly, for the linear banditsetting with unknown support, ALB-Dim achieves a regret of e O ( d ∗ √ T ), where d ∗ ≤ d is the true sparsity of2 ∗ . This matches the regret obtained by oracle algorithms that know of the true sparsity d ∗ [CM12, BB20].We also apply our methodology to the case when there is a finite number of arms and obtain similar regretscaling as the oracle. ALB-Dim is the first algorithm to obtain such model selection guarantees. Prior state ofart algorithm
ModCB for model selection with dimension as a measure of complexity was proposed in [FKL19],with a finite set of arms, where the regret guarantee was sub-optimal compared to the oracle. However, ourregret bounds for dimension, though matches the oracle, depends on the minimum non-zero coordinate valueand is thus not uniform over θ ∗ . Obtaining regret rates in this case that matches the oracle and is uniformover all θ ∗ is an interesting future work.
3. Empirical Validation -
We conduct synthetic and real data experiments that demonstrate superiorperformance of
ALB compared to state of art methods such as
OSOM [CMB19] in the mixture K armed MABsetting and OFUL [AYPS11] in the linear bandit setting. We further observe, that the performance of
ALB isclose to that of the oracle algorithms that know the true complexity. This indicates that the performancegains from
ALB is fundamental, and not artifacts of mathematical bounds.
Motivating Example:
Our model selection framework is applicable to personalized news recommendationplatforms, that recommend one of K news outlets, to each of its users. The recommendation decisions toany fixed user, can be modeled as an instance of a MAB; the arms are the K different news outlets, and theplatforms recommendation decision (to this user) on day t is the arm played at time t . On each day t , eachnews outlet i reports a story, that can be modeled by the vectors α i,t , which can be obtained by embeddingthe stories into a fixed dimension vector space by some common embedding schemes. The reward obtainedby the platform in recommending news outlet i to this user on day t can be modeled as µ i + h α i,t , θ ∗ i , where µ i captures the preference of this user to news outlet i and the vector θ ∗ captures the “interest” of the user.Thus, if a channel i on day t , publishes a news article α i,t , that this user “likes”, then most likely the content α i,t is “aligned” to θ ∗ and have a large inner product h α i,t , θ ∗ i . Different users on the platform however mayhave different biases and θ ∗ . Some users have strong preference towards certain topics and will read contentwritten by any outlet on this topic (these users will have a large value of k θ ∗ k ). Other users may be agnosticto topics, but may prefer a particular news outlet a lot (for ex. some users like fox news exclusively or CNNexclusively, regardless of the topic). These users will have low k θ ∗ k .In such a multi-user recommendation application, we show that our algorithm ALB-Norm that tailors themodel class for each user separately is more effective (lesser regret), than to, employ a (non-adaptive) linearbandit algorithm for each user. We further show that our algorithms are also more effective than state ofart model selection algorithms such as
OSOM [CMB19], which posits a ‘binary’ model - users either assign a0 weight to topic or assign a potentially large weight to topic. Furthermore the heterogeneous complexity inthis application can also be captured by the cardinality of the support of θ ∗ ; different people are interestedin different sub-vectors of θ ∗ which the recommendation platform is not aware of apriori. In this context, ouradaptive algorithm ALB-Dim that tailors to the interest of the individual user achieves better performancecompared to non-adaptive linear bandit algorithms.
Model selection for MAB are only recently being studied [ALNS16, GCG17], with [CMB19], [FKL19] beingthe closest to our work.
OSOM was proposed in [CMB19] for model selection in the K armed mixture MABfrom two hypothesis classes —a “simple model” where k θ ∗ k = 0, or a “complex model”, where 0 < k θ ∗ k ≤ L . OSOM was shown to obtain a regret guarantee of O (log( T )) when the instance is simple and e O ( L √ T ) otherwise.We refine this to consider a continuum of hypothesis classes and propose ALB-Norm , which achieves regret e O ( k θ ∗ k√ T ), a superior guarantee (which we also empirically verify) compared to OSOM . Model selection withdimension as a measure of complexity was recently initiated in [FKL19], where an algorithm
ModCB wasproposed. The setup considered in [FKL19] was that of contextual bandits [CLRS11] with a fixed and finitenumber of arms.
ModCB in this setting was shown to achieve a regret scaling that is sub-optimal comparedto the oracle. In contrast, we consider the linear bandit setting with a continuum of arms [AYPS11], and3
LB-Dim achieves a regret scaling matching that of an oracle. The continuum of arms allows
ALB-Dim a finerexploration of arms, that enables it to learn the support of θ ∗ reliably and thus obtain regret matching thatof the oracle. However, our regret bounds depend on the magnitude of the minimum non-zero value of θ ∗ and is thus not uniform over all β ∗ . Obtaining regret rates matching the oracle that holds uniformly overall θ ∗ is an interesting future work. Corral was proposed in [ALNS16], by casting the optimal algorithm for each hypothesis class as anexpert, with the forecaster’s performance having low regret with respect to the best expert (best modelclass). However,
Corral can only handle finitely many hypothesis classes and is not suited to our settingwith continuum hypothesis classes.Adaptive algorithms for linear bandits have also been studied in different contexts from ours. Thepapers of [LC18, KWS18] consider problems where the arms have an unknown structure, and proposealgorithms adapting to this structure to yield low regret. The paper [LST17] proposes an algorithm in theadversarial bandit setup that adapt to an unknown structure in the adversary’s loss sequence, to obtain lowregret. The paper of [AGO18] consider adaptive algorithms, when the distribution changes over time. Inthe context of online learning with full feedback, there have been several works addressing model selection[LS15, MA13, Ora14, CB17]. In the context of statistical learning, model selection has a long line of work (foreg. [Vap06], [BM + + + In this section, we formally define the problem. At each round t ∈ [ T ], the player chooses one of the K available arms. Each arm has a context { α i,t ∈ R d } Ki =1 that changes over time t . Similar to the standardstochastic contextual bandit framework, the context vectors for each arm is chosen independently of all otherarms and of the past time instances.We assume that there exists an underlying parameter θ ∗ ∈ R d and biases { µ , . . . , µ K } each taking valuein [ − ,
1] such that the mean reward of an arm is a linear function of the context of the arm. The rewardfor playing arm i at time t is given by, g i,t = µ i + h α i,t , θ ∗ i + η i,t , , where { η i,t } Tt =1 are i.i.d zero mean and σ sub-Gaussian noise. The context vector satisfies E [ α i,t |{ α j,s , η j,s } j ∈ [ K ] ,s ∈ [ t − } ] = 0 , and E [ α i,t α ⊤ i,t |{ α j,s , η j,s } j ∈ [ K ] ,s ∈ [ t − } ] < ρ min I. The above setting is popularly known as stochastic contextual bandit [CMB19]. In the special caseof θ ∗ = 0, the above model reduces to g i,t = µ i + η i,t . Note that in this setting, the mean reward ofarms are fixed, and not dependent on the context. Hence, this corresponds to a simple multi-armed bandit setup and standard algorithms (like UCB [ACBF02]) can be used as a learning rule. At round t , we define i ∗ t = argmax i ∈ [ K ] [ µ i + h θ ∗ , α i,t i ] as the best arm. Also let an algorithm play arm A t at round t . The regretof the algorithm upto time T is given by, R ( T ) = T X s =1 (cid:2) µ i ∗ s + h θ ∗ , α i ∗ s ,s i − µ A s − h θ ∗ , α A s ,s i (cid:3) . Throughout the paper, we use
C, C , .., c, c , .. to denote positive universal constants, the value of which maydiffer in different instances.We define a new notion of complexity for stochastic linear bandits; and propose an algorithm that adaptsto it. We define k θ ∗ k as the problem complexity for the linear bandit instance. Note that if k θ ∗ k = 0,4 lgorithm 1: Adaptive Linear Bandit (Norm) Input:
Initial exploration period τ , the phase length T , δ > δ s > Select an arm at random, sample rewards 2 τ times Obtain initial estimate ( b ) of k θ ∗ k according to Section 3.3 for t = 1 , , . . . , K do Play arm t , receive reward g t,t end for Define S = { g i,i } Ki =1 for epochs i = 1 , . . . , N do Use S as pure-exploration reward Play OFUL + δ i ( b i ) until the end of epoch i (denoted by E i ) At t = E i , refine estimate of k θ ∗ k as, b i +1 = max θ ∈C E i k θ k Set T i +1 = 2 T i , δ i +1 = δ i . end for OFUL + δ ( b ) : Input:
Parameters b , δ >
0, number of rounds ˜ T for t = 1 , , . . . , ˜ T do Select the best arm estimate as j t = argmax i ∈ [ K ] (cid:2) max θ ∈C t − { ˜ µ i,t − + h α i,t , θ i} (cid:3) ,where ˜ µ i,t and C t are given in Section 3.2. Play arm j t , and update { ˜ µ i,t } Ki =1 and C t end for the linear bandit model reduces to the simple multi-armed bandit setting. Furthermore, the cumulativeregret R ( T ) of linear bandit algorithms (like OFUL [AYPS11] and OSOM [CMB19]) scales linearly with k θ ∗ k ([CMB19]). Hence, k θ ∗ k constitutes a natural notion of model complexity. In Algorithm 1, we proposean adaptive scheme which adapts to the true complexity of the problem, k θ ∗ k . Instead of assuming an upper-bound on k θ ∗ k , we use an initial exploration phase to obtain a rough estimate of k θ ∗ k and then successivelyrefine it over multiple epochs. The cumulative regret of our proposed algorithm actually scales linearly with k θ ∗ k . ALB-Norm algorithm)
We present the adaptive scheme in Algorithm 1. Note that Algorithm 1 depends on the subroutine OFUL + .Observe that at each iteration, we estimate the bias { µ , . . . , µ K } and θ ∗ separately. The estimation of thebias involves a simple sample mean estimate with upper confidence level, and the estimation of θ ∗ involvesbuilding a confidence set that shrinks over time.In order to estimate θ ∗ , we use a variant of the popular OFUL [AYPS11] algorithm with arm bias. Werefer to the algorithm as OFUL + . Algorithm 1 is epoch based, and over multiple epochs, we successivelyrefine the estimate of k θ ∗ k . We start with a rough over-estimate of k θ ∗ k (obtained from a pure explorationphase), and based on the confidence set constructed at the end of the epoch, we update the estimate of k θ ∗ k .We argue that this approach indeed correctly estimates k θ ∗ k with high probability over a sufficiently largetime horizon T .We now discuss the algorithm OFUL + . A variation of this was proposed in [CMB19] in the context ofmodel selection between linear and standard multi-armed bandits. We use ˜ µ i,t to address the bias term,which we define shortly. The parameters b and δ are used in the construction of the confidence set C t .Suppose OFUL + is run for a total of ˜ T rounds and plays arm A s at time s . Let T i ( t ) be the number of timesOFUL + plays arm i until time t . Also, let b be the current estimate of k θ ∗ k . We define,¯ g i,t = 1 T i ( t ) t X s =1 g i,s { A s = t } . ˜ µ i,t = ¯ g i,t + c ( σ + b ) s dT i ( t ) log (cid:18) δ (cid:19) . The confidence interval C t , is defined as C t = { θ ∈ R d : k θ − ˆ θ t k ≤ K δ ( b, t, ˜ T ) } , where ˆ θ t is the least squares estimate defined asˆ θ t = (cid:0) α ⊤ K +1: t α K +1: t + I (cid:1) − α ⊤ K +1: t G K +1: t with α K +1: t as a matrix having rows α ⊤ A K +1 ,K +1 , . . . , α ⊤ A t ,t and G K +1: t = [ g A K +1 ,K +1 − ˜ µ A K +1 ,K +1 , . . . , g A t ,t − ˜ µ A t ,t ] ⊤ . The radius of C t is given by (see Appendix A for complete expression), K δ ( b, t, ˜ T ) = c ( σ √ d + b ) ρ min √ t q log( K ˜ T /δ ) . Lemma 2 of [CMB19] shows that θ ∗ ∈ C t with probability at least 1 − δ . b We select an arm at random (without loss of generality, assume that this is arm 1), and sample rewards (inan i.i.d fashion) for 2 τ times, where τ > y (1) = g , − g , , y (2) = g , − g , and so on.Augmenting y ( . ), we obtain: Y = ˜ Xθ ∗ + ˜ η, where the i -th row of ˜ X is ( α , i +1 − α , i +2 ) ⊤ , the i -th elementof ˜ η is η , i +1 − η , i +2 . Hence, the least squares estimate, b θ ( ℓs ) satisfies k b θ ( ℓs ) − θ ∗ k ≤ √ σ q dτ log(1 /δ s ),with probability exceeding 1 − δ s ([Wai19]). We set the initial estimate b = max {k b θ ( ℓs ) k + √ σ r dτ log(1 /δ s ) , } and this satisfies b ≥ k θ ∗ k and b ≥ − δ s . We now obtain an upper bound on the cumulative R ( T ) with Algorithm 1 with high probability. Fortheoretical tractability, we assume that OFUL + restarts at the start of each epoch. We have the followinglemma regarding the sequence { b i } ∞ i =1 of estimates of k θ ∗ k : Lemma 1.
With probability exceeding − δ − δ s , the sequence { b i } ∞ i =1 converges to k θ ∗ k at a rate O ( i i ) ,and we obtain b i ≤ ( c k θ ∗ k + c ) for all i , provided T ≥ C (max { p, q } b ) d , where C > , and p =[
14 log( KT δ ) √ ρ min ] , q = [ Cσ log( KT δ ) √ ρ min ] . Hence, the sequence converges to k θ ∗ k at an exponential rate. We have the following guarantee on thecumulative regret R ( T ): Theorem 1.
Suppose T > max { T min ( δ, T ) , C (max { p, q } b ) d } , where C > and T min ( δ, T ) = ( ρ + ρ min ) log( dTδ ) . Then, with probability at least − δ − δ s , we have R ( T ) ≤ C (2 τ + K ) k θ ∗ k + C ( k θ ∗ k + 1)( √ K + √ d ) √ T log( KT /δ ) log( T /T ) . For complete expression, see Appendix A There is a typo in the proof of regret in [CMB19]. We correct the typo, and modify the definition of ˜ µ i,t and K δ ( b, t, ˜ T ).As a consequence, the high probability bounds change a little. emark 1. Note that the regret bound depends on the problem complexity k θ ∗ k , and we prove that Algo-rithm 1 adapts to this complexity. Ignoring the log factors, Algorithm 1 has a regret of ˜ O ((1 + k θ ∗ k )( √ K + √ d ) √ T ) with high probability. Remark 2. (Matches Linear Bandit algorithm) Note that the above bound matches the regret guarantee ofthe linear bandit algorithm with bias as presented in [CMB19].
Remark 3. (Matches UCB when θ ∗ = 0 ) When θ ∗ = 0 (the simplest model, without any contextual infor-mation), Algorithm 1 recovers the minimax regret of UCB algorithm. Indeed, substituting k θ ∗ k = 0 in theabove regret bound yields R ( T ) = O ( √ KT ) , with high probability, provided K > d . Hence, we obtain the“best of both worlds” results with simple model ( θ ∗ = 0 ) and contextual bandit model ( θ ∗ = 0 ). In this section, we consider the standard stochastic linear bandit model in d dimensions [AYPS11], with thedimension as a measure of complexity. The setup in this section is almost identical to that in Section 3.1,with the 0 arm biases and a continuum collection of arms denoted by the set A := { x ∈ R d : k x k ≤ } Thus,the mean reward from any arm x ∈ A is h x, θ ∗ i , where k θ ∗ k ≤
1. We assume that θ ∗ is d ∗ ≤ d sparse, where d ∗ is apriori unknown to the algorithm. Thus, unlike in Section 3, there is no i.i.d. context sampling in thissection. We consider a sequence of d nested hypothesis classes, where each hypothesis class i ≤ d , models θ ∗ as a i sparse vector. The goal of the forecaster is to minimize the regret, namely R ( T ) := P Tt =1 [ h x ∗ t − x t , θ ∗ i ],where at any time t , x t is the action recommended by an algorithm and x ∗ t = argmax x ∈A h x, θ ∗ i . The regret R ( T ) measures the loss in reward of the forecaster with that of an oracle that knows θ ∗ and thus can compute x ∗ t at each time. ALB-Dim
Algorithm
The algorithm is parametrized by T ∈ N , which is given in Equation (1) in the sequel and slack δ ∈ (0 , ALB-Dim proceeds in phases numbered 0 , , · · · which are non-decreasing with time. Atthe beginning of each phase, ALB-Dim makes an estimate of the set of non-zero coordinates of θ ∗ , which is keptfixed throughout the phase. Concretely, each phase i is divided into two blocks - (i) a regret minimizationblock lasting 25 i T time slots, (ii) followed by a random exploration phase lasting 5 i ⌈√ T ⌉ time slots. Thus,each phase i lasts for a total of 25 i T + 5 i ⌈√ T ⌉ time slots. At the beginning of each phase i ≥ D i ⊆ [ d ]denotes the set of ‘active coordinates’, namely the estimate of the non-zero coordinates of θ ∗ . Subsequently,in the regret minimization block of phase i , a fresh instance of OFUL [AYPS11] is spawned, with thedimensions restricted only to the set D i and probability parameter δ i := δ i . In the random explorationphase, at each time, one of the possible arms from the set A is played chosen uniformly and independentlyat random. At the end of each phase i ≥ ALB-Dim forms an estimate b θ i +1 of θ ∗ , by solving a least squaresproblem using all the random exploration samples collected till the end of phase i . The active coordinateset D i +1 , is then the coordinates of b θ i +1 with magnitude exceeding 2 − ( i +1) . The pseudo-code is provided inAlgorithm 2, where, ∀ i ≥ S i in lines 15 and 16 is the total number of random-exploration samples in allphases upto and including i . We first specify, how to set the input parameter T , as function of δ . For any N ≥ d , denote by A N to bethe N × d random matrix with each row being a vector sampled uniformly and independently from the unitsphere in d dimensions. Denote by M N := N E [ A TN A N ], and by λ ( N ) max , λ ( N ) min , to be the largest and smallest Our algorithm can be applied to any compact set
A ⊂ R d , including the finite set as shown in Appendix C. lgorithm 2: Adaptive Linear Bandit (Dimension) Input:
Initial Phase length T and slack δ > b θ = , T − = 0 for Each epoch i ∈ { , , , · · · } do T i = 25 i T , ε i ← i , δ i ← δ i D i := { i : | b θ i | ≥ ε i } for Times t ∈ { T i − + 1 , · · · , T i } do Play OFUL(1 , δ i ) only restricted to coordinates in D i . Here δ i is the probability slack parameterand 1 represents k θ ∗ k ≤ end for for Times t ∈ { T i + 1 , · · · , T i + 5 i √ T } do Play an arm from the action set A chosen uniformly and independently at random. end for α i ∈ R S i × d with each row being the arm played during all random explorations in the past. y i ∈ R S i with i -th entry being the observed reward at the i -th random exploration in the past b θ i +1 ← ( α Ti α i ) − α i y i , is a d dimensional vector end for eigenvalues of M N . Observe that as M N is positive semi-definite (0 ≤ λ ( N ) min ≤ λ ( N ) max ) and almost-surely fullrank, i.e., P [ λ ( N ) min >
0] = 1. The constant T is the smallest integer such that p T ≥ max σ ( λ ( ⌈√ T ⌉ ) min ) ln(2 d/δ ) ,
43 (6 λ ( ⌈√ T ⌉ ) max + λ ( ⌈√ T ⌉ ) min )( d + λ ( ⌈√ T ⌉ ) max )( λ ( ⌈√ T ⌉ ) min ) ln(2 d/δ ) ! (1) Remark 4. T in Equation (1) is chosen such that, at the end of phase , P [ || b θ − θ ∗ || ∞ ≥ / ≤ δ. A formal statement of the Remark is provided in Lemma 2 in Appendix B.
Theorem 2.
Suppose Algorithm 2 is run with input parameters δ ∈ (0 , , and T as given in Equation (1),then with probability at-least − δ , the regret after a total of T arm-pulls satisfies R T ≤ γ . T + 25 √ T [1 + 4 r d ∗ ln(1 + 25 Td ∗ )(1 + σ r TT δ ) + d ∗ ln(1 + 25 Td ∗ ))] . The parameter γ > is the minimum magnitude of the non-zero coordinate of θ ∗ , i.e., γ = min {| θ ∗ i | : θ ∗ i = 0 } and d ∗ the sparsity of θ ∗ , i.e., d ∗ = |{ i : θ ∗ i = 0 }| . In order to parse this result, we give the following corollary.
Corollary 1.
Suppose Algorithm 2 is run with input parameters δ ∈ (0 , , and T = e O (cid:0) d ln (cid:0) δ (cid:1)(cid:1) givenin Equation (1), then with probability at-least − δ , the regret after T times satisfies R T ≤ O ( d γ . ln ( d/δ )) + e O ( d ∗ √ T ) . Remark 5.
The constants in the Theorem are not optimized. In particular, the exponent of γ can be madearbitrarily close to , by setting ε i = C − i in Line of Algorithm 2, for some appropriately large constant C > , and increasing T i = ( C ′ ) i T , for appropriately large C ′ ( C ′ ≈ C ) . Discussion -
The regret of an oracle algorithm that knows the true complexity d ∗ scales as e O ( d ∗ √ T )[CM12, BB20], matching ALB-Dim ’s regret, upto an additive constant independent of time.
ALB-Dim is8he first algorithm to achieve such model selection guarantees. On the other hand, standard linear banditalgorithms such as
OFUL achieve a regret scaling e O ( d √ T ), which is much larger compared to that of ALB-Dim ,especially when d ∗ << d , and γ is a constant. Numerical simulations further confirms this deduction, therebyindicating that our improvements are fundamental and not from mathematical bounds. Corollary 1 alsoindicates that ALB-Dim has higher regret if γ is lower. A small value of γ makes it harder to distinguisha non-zero coordinate from a zero coordinate, which is reflected in the regret scaling. Nevertheless, thisonly affects the second order term as a constant , and the dominant scaling term only depends on the truecomplexity d ∗ , and not on the underlying dimension d . However, the regret guarantee is not uniform overall θ ∗ as it depends on γ . Obtaining regret rates matching the oracles and that hold uniformly over all θ ∗ isan interesting avenue of future work. In this section, we consider the model selection problem for the setting with finitely many arms in theframework studied in [FKL19]. At each time t ∈ [ T ], the forecaster is shown a context X t ∈ X , where X issome arbitrary ‘feature space’. The set of contexts ( X t ) Tt =1 are i.i.d. with X t ∼ D , a probability distributionover X that is known to the forecaster. Subsequently, the forecaster chooses an action A t ∈ A , wherethe set A := { , · · · , K } are the K possible actions chosen by the forecaster. The forecaster then receivesa reward Y t := h θ ∗ , φ M ( X t , A t ) i + η t . Here ( η t ) Tt =1 is an i.i.d. sequence of 0 mean sub-gaussian randomvariables with sub-gaussian parameter σ that is known to the forecaster. The function φ M : X × A → R d is a known feature map, and θ ∗ ∈ R d is an unknown vector. The goal of the forecaster is to minimizeits regret, namely R ( T ) := P Tt =1 E [ h A ∗ t − A t , θ ∗ i ], where at any time t , conditional on the context X t , A ∗ t ∈ argmax a ∈A h a, φ M ( X t , a ) i . Thus, A ∗ t is a random variable as X t is random.To describe the model selection, we consider a sequence of M dimensions 1 ≤ d < d , · · · < d M := d andan associated set of feature maps ( φ m ) Mm =1 , where for any m ∈ [ M ], φ m ( · , · ) : X × A → R d i , is a featuremap embedding into d i dimensions. Moreover, these feature maps are nested, namely, for all m ∈ [ M − x ∈ X and a ∈ A , the first d m coordinates of φ m +1 ( x, a ) equals φ m ( x, a ). The forecaster is assumedto have knowledge of these feature maps. The unknown vector θ ∗ is such that its first d m ∗ coordinates arenon-zero, while the rest are 0. The forecaster does no know the true dimension d m ∗ . Thus, although, thedimensionality of the problem is d m ∗ , which is unknown to the forecaster. If this were known, than standardcontextual bandit algorithms such as LinUCB [CLRS11] can guarantee a regret scaling as e O ( √ d m ∗ T ). Inthis section, we provide an algorithm in which, even when the forecaster is unaware of d m ∗ , the regret scalesas e O ( √ d m ∗ T ). However, this result is non uniform over all θ ∗ as, we will show, depends on the minimumnon-zero coordinate value in θ ∗ . Model Assumptions
We will require some assumptions identical to the ones stated in [FKL19]. Let k θ ∗ k ≤
1, which is known to the forecaster. The distribution D is assumed to be known to the forecaster.Associated with the distribution D is a matrix Σ M := K P a ∈A E (cid:2) φ M ( x, a ) φ M ( x, a ) T (cid:3) (where x ∼ D ), wherewe assume its minimum eigen value λ min (Σ M ) > a ∈ A ,the random variable φ M ( x, a ) (where x ∼ D is random) is a sub-gaussian random variable with (known)parameter τ . ALB-Dim
Algorithm
The algorithm in this case is identical to that of Algorithm 2, except with the difference that in place ofOFUL, we use
SupLinRel of [Aue02] as the black-box. The full details of the Algorithm are provided inAppendix C. Superscript M will become clear shortly Horizon T R eg r e t ALB (norm)OFUL + ORACLE (a) k θ ∗ k = 0 . , b = 10 Horizon T R eg r e t ALB (norm)OFUL + ORACLE (b) k θ ∗ k = 1 , b = 10 Epoch number N o r m e s t i m a t e ALB (norm = 0.1)ALB (norm = 1) (c) Estimates of k θ ∗ k Horizon T R eg r e t ALB (dim.)OFULORACLE (d) d ∗ = 20, d = 500 Horizon T R eg r e t ALB (dim.)OFULORACLE (e) d ∗ = 20, d = 200 Epoch number D i m . e s t i m a t e (f) Dimension refinement Horizon T R e w a r d (g) Yahoo; b = 25 Epoch number N o r m e s t i m a t e (h) Yahoo; b = 25 Figure 1: Synthetic and real-data experiments, validating the effectiveness of Algorithm 1 and 2. All theresults are averaged over 25 trials.
For brevity, we only state the Corollary of our main Theorem (Theorem 3) which is stated in Appendix C.
Corollary 2.
Suppose Algorithm 3 is run with input parameters δ ∈ (0 , , and T = e O (cid:0) d ln (cid:0) δ (cid:1)(cid:1) givenin Equation (15) , then with probability at-least − δ , the regret after T times satisfies R T ≤ O (cid:18) d γ . ln ( d/δ ) τ ln (cid:18) T Kδ (cid:19)(cid:19) + e O ( p T d ∗ m ) , where γ = min {| θ ∗ i | : θ ∗ i = 0 } and d ∗ the sparsity of θ ∗ . Discussion -
Our regret scaling (in time) matches that of an oracle that knows the true problem complexityand thus obtains a regret scaling of e O ( √ d m ∗ T ). This, thus improves on the rate compared to that obtainedin [FKL19], whose regret scaling is sub-optimal compared to the oracle. On the other hand however, ourregret bound depends on γ and is thus not uniform over all θ ∗ , unlike the bound in [FKL19] that is uniformover θ ∗ . Thus, in general, our results are not directly comparable to that of [FKL19]. It is an interestingfuture work to close the gap and in particular, obtain the regret matching that of an oracle to hold uniformlyover all θ ∗ . We compare
ALB-Norm with the (non-adaptive) OFUL + and an oracle that knows the problem complexityapriori. The oracle just runs OFUL + with the known problem complexity. We choose the bias ∼ U [ − , .
5. At each round of thelearning algorithm, we sample the context vectors from a d -dimensional standard Gaussian, N (0 , I d ). We10elect d = 50, the number of arms, K = 75, and the initial epoch length as 100. In particular, we generatethe true θ ∗ in 2 different ways: (i) k θ ∗ k = 0 .
1, but the initial estimate b = 10, and (ii) k θ ∗ k = 1, with theinitial estimate b = 10.In panel (a) and (b) of Figure 1, , we observe that, in setting (i), OFUL + performs poorly owing to thegap between k θ ∗ k and b . On the other hand, ALB-Norm is sandwiched between the OFUL + and the oracle.Similar things happen in setting (ii). In panel (c), we show that the norm estimates of ALB-Norm improvesover epochs, and converges to the true norm very quickly.In panel (d)-(f), we compare the performance of
ALB-Dim with the OFUL ([AYPS11]) algorithm and an oracle who knows the true support of θ ∗ apriori. For computational ease, we set ε i = 2 − i in simulations.We select θ ∗ to be d ∗ = 20-sparse, with the smallest non-zero component, γ = 0 .
12. We have 2 settings: (i) d = 500 and (ii) d = 200. In panel (d) and (e), we observe a huge gap in cumulative regret between ALB-Dim and OFUL, thus showing the effectiveness of dimension adaptation. In panel (f), we plot the successivedimension refinement over epochs. We observe that within 4 − ALB-Dim finds the sparsity of θ ∗ . Here, we evaluate the performance of
ALB-Norm on Yahoo! ‘Learning to Rank Challenge’ dataset ([CC10]). Inparticular, we use the file set2.test.txt , which consists of 103174 rows and 702 columns. The first columndenotes the rating, { , , ., } given by the user (which is taken as reward); the second column denotes theuser id, and the rest 700 columns denote the context of the user. After selecting 20 ,
000 rows and 50 columnsat random (several other random selections yield similar results), we cluster the data by running k meansalgorithm with k = 500. We treat each cluster as a bandit arm with mean reward as the empirical mean ofthe individual rating in the cluster, and the context as the centroid of the cluster. This way, we obtain abandit setting with K = 500 and d = 50.Assuming (reward, context) coming from a linear model (with bias, see Section 3.1), we use ALB-Norm to estimate the bias and θ ∗ simultaneously. In panel (g), we plot the cumulative reward accumulated overtime. We observe that the reward is accumulated over time in an almost linear fashion. We also plot thenorm estimate, k θ ∗ k over epochs in panel (h), starting with an initial estimate of 25. We observe that within6 epochs the estimate stabilizes to a value of 11 .
1. This shows that
ALB-Norm adapts to the actual k θ ∗ k . In this paper, we considered refined model selection for linear bandits, by defining new notions of complexity.We gave two novel algorithms
ALB-Norm and
ALB-Dim that successively refines the hypothesis class andachieves model selection guarantees; regret scaling in the complexity of the smallest class containing thetrue model. This is the first such algorithm to achieve regret scaling similar to an oracle that knew theproblem complexity. An interesting direction of future work is to derive regret bounds for the case when thedimension is a measure of complexity, that hold uniformly over all θ ∗ , i.e., have no explicit dependence on γ . The authors would like to acknowledge Akshay Krishnamurthy, Dylan Foster and Haipeng Luo for insightfulcomments and suggestions.
References [AB10] Jean-Yves Audibert and S´ebastien Bubeck. Best arm identification in multi-armed bandits. 2010.[AB +
11] Sylvain Arlot, Peter L Bartlett, et al. Margin-adaptive model selection in statistical learning.
Bernoulli , 17(2):687–713, 2011. 11ACBF02] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed banditproblem.
Machine learning , 47(2-3):235–256, 2002.[AGO18] Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best arm with anunknown number of distribution changes. In
European Workshop on Reinforcement Learning ,volume 14, page 375, 2018.[ALNS16] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band ofbandit algorithms. arXiv preprint arXiv:1612.06246 , 2016.[Aue02] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.
Journal of MachineLearning Research , 3(Nov):397–422, 2002.[AYPS11] Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linear stochasticbandits. In
Advances in Neural Information Processing Systems , pages 2312–2320, 2011.[BB20] Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates.
Operations Research , 68(1):276–294, 2020.[BM +
98] Lucien Birg´e, Pascal Massart, et al. Minimum contrast estimators on sieves: exponential boundsand rates of convergence.
Bernoulli , 4(3):329–375, 1998.[CB17] Ashok Cutkosky and Kwabena Boahen. Online learning without prior information. arXiv preprintarXiv:1703.02629 , 2017.[CC10] Olivier Chapelle and Yi Chang. Yahoo! learning to rank challenge overview. In
Proceedings of the2010 International Conference on Yahoo! Learning to Rank Challenge - Volume 14 , YLRC’10,page 1–24. JMLR.org, 2010.[Che02] Vladimir Cherkassky. Model complexity control and statistical learning theory.
Natural comput-ing , 1(1):109–133, 2002.[CLRS11] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payofffunctions. In
Proceedings of the Fourteenth International Conference on Artificial Intelligenceand Statistics , pages 208–214, 2011.[CM12] Alexandra Carpentier and R´emi Munos. Bandit theory meets compressed sensing for high di-mensional stochastic linear bandit. In
Artificial Intelligence and Statistics , pages 190–198, 2012.[CMB19] Niladri S Chatterji, Vidya Muthukumar, and Peter L Bartlett. Osom: A simultaneously optimalalgorithm for multi-armed and linear contextual bandits. arXiv preprint arXiv:1905.10040 , 2019.[DGL13] Luc Devroye, L´aszl´o Gy¨orfi, and G´abor Lugosi.
A probabilistic theory of pattern recognition ,volume 31. Springer Science & Business Media, 2013.[DHK08] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under banditfeedback. 2008.[FKL19] Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. Model selection for contextual bandits.In
Advances in Neural Information Processing Systems , pages 14714–14725, 2019.[GCG17] Avishek Ghosh, Sayak Ray Chowdhury, and Aditya Gopalan. Misspecified linear bandits. In
Thirty-First AAAI Conference on Artificial Intelligence , 2017.[KL18] Michael Krikheli and Amir Leshem. Finite sample performance of linear least squares estima-tors under sub-gaussian martingale difference noise. In , pages 4444–4448. IEEE, 2018.12KWS18] Akshay Krishnamurthy, Zhiwei Steven Wu, and Vasilis Syrgkanis. Semiparametric contextualbandits. arXiv preprint arXiv:1803.04204 , 2018.[LC18] Andrea Locatelli and Alexandra Carpentier. Adaptivity to smoothness in x-armed bandits. In
Conference on Learning Theory , pages 1463–1492, 2018.[LN +
99] G´abor Lugosi, Andrew B Nobel, et al. Adaptive model selection using empirical complexities.
The Annals of Statistics , 27(6):1830–1864, 1999.[LR85] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.
Advancesin applied mathematics , 6(1):4–22, 1985.[LS15] Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In
Conference on Learning Theory , pages 1286–1304, 2015.[LST17] Thodoris Lykouris, Karthik Sridharan, and ´Eva Tardos. Small-loss bounds for online learningwith partial information. arXiv preprint arXiv:1711.03639 , 2017.[MA13] Brendan McMahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linearoptimization. In
Advances in Neural Information Processing Systems , pages 2724–2732, 2013.[Ora14] Francesco Orabona. Simultaneous model selection and optimization through parameter-freestochastic learning. In
Advances in Neural Information Processing Systems , pages 1116–1124,2014.[Vap06] Vladimir Vapnik.
Estimation of dependences based on empirical data . Springer Science & BusinessMedia, 2006.[Wai19] Martin J Wainwright.
High-dimensional statistics: A non-asymptotic viewpoint , volume 48. Cam-bridge University Press, 2019.
AppendixA Detailed Description of OFUL +We now discuss the algorithm OFUL + . A variation of this was proposed in [CMB19] in the context ofmodel selection between linear and standard multi-armed bandits. As seen in the OFUL + sub-routine ofAlgorithm 1, we use ˜ µ i,t to address the bias term in the observation, which we define shortly. The parameters b and δ appears in the construction of the confidence set and the regret guarantee. Furthermore, assumethat the algorithm OFUL + is run for ˜ T rounds.Let A s be the arm index played at time instant s and T i ( t ) be the number of times we play arm i untiltime t . Hence T i ( t ) = P ts =1 { A s = i } . Also, let b be the current estimate of k θ ∗ k . Also define,¯ g i,t = 1 T i ( t ) t X s =1 g i,s { A s = t } . With this, we have˜ µ i,t = ¯ g i,t + σ (cid:20) T i ( t ) T i ( t ) (cid:18) (cid:18) K (1 + T i ( t )) / δ (cid:19)(cid:19)(cid:21) / + b s dT i ( t ) log (cid:18) δ (cid:19) (2)In order to specify the confidence interval C t , we first talk about the least squares estimate ˆ θ first. Usingthe notation of [CMB19], we defineˆ θ t = (cid:0) α ⊤ K +1: t α K +1: t + I (cid:1) − α ⊤ K +1: t G K +1: t α K +1: t is a matrix with rows α ⊤ A K +1 ,K +1 , . . . , α ⊤ A t ,t and G K +1: t = [ g A K +1 ,K +1 − ˜ µ A K +1 ,K +1 , . . . , g A t ,t − ˜ µ A t ,t ] ⊤ . With this, the confidence interval is defined as C t = n θ ∈ R d : k θ − ˆ θ t k ≤ K δ ( b, t, ˜ T ) o , (3)and Lemma 2 of [CMB19] shows that θ ∗ ∈ C t with probability at least 1 − δ .We now define the quantity K δ ( b, t, ˜ T ). Note that we track the dependence on the complexity parameter k θ ∗ k . We have T min ( δ, ˜ T ) = (cid:18) ρ + 83 ρ min (cid:19) log d ˜ Tδ ! , M δ ( b, t ) = b + s σ (cid:18) d (cid:18) td (cid:19) + log (cid:18) δ (cid:19)(cid:19) , (4)Υ δ ( b, t, ˜ T ) = 103 b + 2 + σ vuut K ˜ Tδ ! × log K ˜ Tδ ! + vuut t log K ˜ Tδ ! + log K ˜ Tδ ! , (5) K δ ( b, t, ˜ T ) = M δ ( b, t ) + Υ δ ( b, t, ˜ T ) if 1 < t < T min , M δ ( b,t ) √ ρ min t/ + Υ δ ( b,t, ˜ T )1+ ρ min t/ if t > T min . (6) B Proofs of the main results
In this section, we collect the proof of our main results. We start with the norm-based complexity measure.
B.1 Proof of Theorem 1
We first take Lemma 1 for granted and conclude the proof of Theorem 1 using the lemma. Suppose we playAlgorithm 1 for N epochs. The cumulative regret is given by R ( T ) ≤ C (2 τ + K ) k θ ∗ k + N X i =1 R ( δ i , b i )( T i ) , where R ( δ i , b i )( T i ) is the cumulative regret of the OFUL + δ i ( b i ) in the i -th epoch. As seen (by tracking thedependence on k θ ∗ k ) in [CMB19], the cumulative regret of OFUL + δ i ( b i ) scales linearly with b i . Hence, weobtain R ( T ) ≤ N X i =1 b i R ( δ i , T i ) . Using Lemma 1, we obtain, with probability at least 1 − δ , R ( T ) ≤ C (2 τ + K ) k θ ∗ k + ( c k θ ∗ k + c ) N X i =1 R ( δ i , T i )Theorem 3 of [CMB19] gives, R ( δ i , T i ) ≤ C ( √ K + √ d ) p T i log (cid:18) KT i δ i (cid:19) (7)14ith probability exceeding 1 − δ i . With the doubling trick, we have T i = 2 i − T , δ i = δ i − . Substituting, we obtain R ( δ i , T i ) ≤ C ( √ K + √ d ) p T i log (cid:18) KT i δ i (cid:19) (cid:20) (2 i −
2) log (cid:18) KT δ (cid:19)(cid:21) with probability at least 1 − δ i .Using the above expression, we obtain R ( T ) ≤ C (2 τ + K ) k θ ∗ k + ( C k θ ∗ k + C ) N X i =1 ( √ K + √ d ) p T i (cid:20) (2 i −
2) log (cid:18) KT δ (cid:19)(cid:21) with probability ≥ − δ − δ (cid:18) ..N -th term (cid:19) ≥ − δ − δ (cid:18) ... (cid:19) = 1 − δ − δ = 1 − δ , where the term 8 δ comes from Lemma 1. Also, from the doubling principle, we obtain N X i =1 i − T = T ⇒ N = log (cid:18) TT (cid:19) . Using the above expression, we obtain R ( T ) ≤ C (2 τ + K ) k θ ∗ k + ( C k θ ∗ k + C ) N X i =1 ( √ K + √ d ) p T i (cid:20) (2 i −
2) log (cid:18) KT δ (cid:19)(cid:21) ≤ C (2 τ + K ) k θ ∗ k + 2( C k θ ∗ k + C )( √ K + √ d ) log (cid:18) KT δ (cid:19) N X i =1 i p T i ≤ C (2 τ + K ) k θ ∗ k + 2( C k θ ∗ k + C )( √ K + √ d ) log (cid:18) KT δ (cid:19) N N X i =1 p T i ≤ C (2 τ + K ) k θ ∗ k + 2( C k θ ∗ k + C )( √ K + √ d ) log (cid:18) KT δ (cid:19) log (cid:18) TT (cid:19) N X i =1 p T i ≤ C (2 τ + K ) k θ ∗ k + C ( k θ ∗ k + 1)( √ K + √ d ) log (cid:18) KT δ (cid:19) log (cid:18) TT (cid:19) √ T , N X i =1 p T i = p T N (cid:18) √ ..N -th term (cid:19) ≤ p T N (cid:18) √ ... (cid:19) = √ √ − p T N ≤ √ √ − √ T .
The above regret bound holds with probability at least 1 − δ . B.2 Proof of Lemma 1
Let us consider the i -th epoch, and let ˆ θ E i be the least square estimate of θ ∗ at the end of epoch i . Fromthe above section, the confidence interval at the end of epoch i , is given by C E i = n θ ∈ R d : k θ − ˆ θ E i k ≤ K δ i ( b i , T i , T i ) o where we play OFUL + δ i ( b i ) during the i -th epoch, and T i is the number of total rounds in the i -th epoch. Bychoosing T > T min ( δ, T ), we ensure that T i ≥ T min ( δ, T i ). From equation (6), and ignoring the non-dominantterms, we obtain K δ i ( b i , T i , T i ) = M δ i ( b i , T i ) p ρ min T i / δ i ( b i , T i , T i )1 + ρ min T i / , with M δ i ( b i , T i ) ≤ b i + c σ √ d log (cid:18) T i dδ i (cid:19) and Υ δ i ( b i , T i , T i ) = 4 b i p T i log (cid:18) KT i δ i (cid:19) + c σ p T i log (cid:18) KT i δ i (cid:19) Substituting the values, considering the dominating terms, and for a sufficiently large T i , we obtain K δ i ( b i , T i , T i ) ≤ b i log (cid:16) KT i δ i (cid:17) √ ρ min T i + C σ √ d √ ρ min T i log (cid:18) KT i δ i (cid:19) ≤ b i log (cid:16) KTδ i (cid:17) √ ρ min T i + C σ √ d √ ρ min T i log (cid:18) KT i δ i (cid:19) where C is an universal constant. From Lemma 2 of [CMB19], we know that θ ∗ ∈ C E i with probability atleast 1 − δ i . Hence, we obtain k ˆ θ E i k ≤ k θ ∗ k + 2 K δ i ( b i , T i , T i ) ≤ k θ ∗ k + 14 b i log (cid:16) KT i δ i (cid:17) √ ρ min T i + 2 C σ √ d √ ρ min T i log (cid:18) KT i δ i (cid:19) i -th epoch, we set the length T i +1 = 2 T i , and the estimateof k θ ∗ k is set to b i +1 = max θ ∈C E i k θ k . From the definition of C E i , we obtain b i +1 = k ˆ θ E i k + K δ i ( b i , T i , T i ) ≤ b i log (cid:16) KT i δ i (cid:17) √ ρ min T i + C σ √ d √ ρ min T i log (cid:18) KT i δ i (cid:19) . Re-writing the above expression, with probability at least 1 − δ i , we obtain b i +1 ≤ k θ ∗ k + (cid:16) KT i δ i (cid:17) √ ρ min b i √ T i + Cσ log (cid:16) KT i δ i (cid:17) √ ρ min √ d √ T i ≤ k θ ∗ k + ip b i √ T i + iq √ d √ T i ≤ k θ ∗ k + ip b i i − √ T + iq √ d i − √ T (8)where we use the fact that δ i = δ i − and T i = 2 i − T , and we have p =
14 log (cid:16) KT δ (cid:17) √ ρ min and q = Cσ log (cid:16) KT δ (cid:17) √ ρ min . Hence, we obtain b i +1 − b i ≤ k θ ∗ k + iq √ d i − √ T − (cid:18) − ip i − √ T (cid:19) b i . From the construction of b i , we have − b i ≤ −k θ ∗ k . Hence provided T ≥ i p i − , which is equivalent to the condition T ≥ p (using the fact that i i − ≤ i ≥ b i +1 − b i ≤ (cid:18) ip i − √ T (cid:19) k θ ∗ k + iq √ d i − √ T . From the above expression, we obtain sup i b i < ∞ . ≥ − δ (cid:18) ... (cid:19) = 1 − δ . Invoking Equation (8) and using the above fact in conjunction yield (with probability at least 1 − δ )lim i →∞ b i ≤ k θ ∗ k . However, from construction b i ≥ k θ ∗ k . Using this, along with the above equation, we obtainlim i →∞ b i = k θ ∗ k . with probability exceeding 1 − δ . So, the sequence { b , b , ... } converges to k θ ∗ k with high probability, andhence our successive refinement algorithm is consistent. Rate of Convergence:
Since b i − b i − = ˜ O (cid:18) i i (cid:19) , (9)with probability greater than 1 − δ i , the rate of convergence of the sequence { b i } ∞ i =0 is exponential in thenumber of epochs. Uniform upper bound on b i for all i : We now compute a uniform upper bound on b i for all i . Considerthe sequence ( i i − ) ∞ i =1 , and let t j denote the j -th term of the sequence. It is easy to check that sup i t i = 1 . { t i } ∞ i =1 is convergent. With this new notation, we have b ≤ k θ ∗ k + t pb √ T + t q √ d √ T . with probability exceeding 1 − δ . Similarly, for b , we have b ≤ k θ ∗ k + t pb √ T + t q √ d √ T ≤ (cid:18) t p √ T (cid:19) k θ ∗ k + (cid:18) t t p √ T p √ T b (cid:19) + t t p √ T q √ d √ T + t q √ d √ T ! . with probability at least 1 − δ − δ = 1 − δ . Similarly, we write expressions for b , b , ... . Now, provided T ≥ C (max { p, q } b ) d , where C > b i can be upper-bounded as b i ≤ ( c k θ ∗ k + c ) , (10)with probability ≥ − δ (cid:18) ... upto i -th term (cid:19) ≥ − δ (cid:18) ... (cid:19) = 1 − δ . Here c and c are constants, and are obtained from summing an infinite geometric series with decaying stepsize. We also use the fact that b ≥
1, and the fact that δ i = δ i − .18 .3 Proof of Theorem 2 We shall need the following lemma from [KL18], on the behaviour of linear regression estimates.
Lemma 2. If M ≥ d and satisfies M = O (cid:0)(cid:0) ε + d (cid:1) ln (cid:0) δ (cid:1)(cid:1) , and b θ ( M ) is the least-squares estimate of θ ∗ ,using the M random samples for feature, where each feature is chosen uniformly and independently on theunit sphere in d dimensions, then with probability , b θ is well defined (the least squares regression has anunique solution). Furthermore, P [ || b θ ( M ) − θ ∗ || ∞ ≥ ε ] ≤ δ. We shall now apply the theorem as follows. Denote by b θ i to be the estimate of θ ∗ at the beginning ofany phase i , using all the samples from random explorations in all phases less than or equal to i − Remark 6.
The choice T := O (cid:0) d ln (cid:0) δ (cid:1)(cid:1) in Equation (1) is chosen such that from Lemma 3, we havethat P (cid:20) || b θ ( ⌈√ T ⌉ ) − θ ∗ || ∞ ≥ (cid:21) ≤ δ Lemma 3.
Suppose T = O (cid:0) d ln (cid:0) δ (cid:1)(cid:1) is set according to Equation (1). Then, for all phases i ≥ , P h || b θ i − θ ∗ || ∞ ≥ − i i ≤ δ i , (11) where b θ i is the estimate of θ ∗ obtained by solving the least squares estimate using all random explorationsamples until the beginning of phase i .Proof. The above lemma follows directly from Lemma 2. Lemma 2 gives that if b θ i is formed by solving theleast squares estimate with at-least M i := O (cid:16)(cid:0) i + d (cid:1) ln (cid:16) i δ (cid:17)(cid:17) samples, then the guarantee in Equation (11)holds. However, as T = O (cid:0) ( d + 1) ln (cid:0) δ (cid:1)(cid:1) , we have naturally that M i ≤ i i √ T . The proof is concluded ifwe show that at the beginning of phase i ≥
4, the total number of random explorations performed by thealgorithm exceeds i i ⌈√ T ⌉ . Notice that at the beginning of any phase i ≥
4, the total number of randomexplorations that have been performed is i − X j =0 i ⌈ p T ⌉ = ⌈ p T ⌉ i − , ≥ i i ⌈ p T ⌉ , where the last inequality holds for all i ≥ Corollary 3. P \ i ≥ || nb θ i − θ ∗ || ∞ ≤ − i o ≥ − δ. roof. This follows from a simple union bound as follows. P \ i ≥ n || b θ i − θ ∗ || ∞ ≤ − i o = 1 − P [ i ≥ n || b θ i − θ ∗ || ∞ ≥ − i o , ≥ − X i ≥ P h || b θ i − θ ∗ || ∞ ≥ − i i , ≥ − X i ≥ δ i , ≥ − X i ≥ δ i , = 1 − δ . We are now ready to conclude the proof of Theorem 2.
Proof of Theorem 2.
We know from Corollary 3, that with probability at-least 1 − δ , for all phases i ≥
4, wehave || b θ i − θ ∗ || ∞ ≤ − i . Call this event E . Now, consider the phase i ( γ ) := max (cid:16) , log (cid:16) γ (cid:17)(cid:17) . Now, whenevent E holds, then for all phases i ≥ i ( γ ), D i is the correct set of d ∗ non-zero coordinates of θ ∗ . Thus, withprobability at-least 1 − δ , the total regret upto time T can be upper bounded as follows R T ≤ i ( γ ) − X j =0 (cid:16) i T + 5 i ⌈ p T ⌉ (cid:17) + (cid:24) log (cid:16) TT (cid:17) (cid:25)X j ≥ i ( γ ) Regret(OFUL(1 , δ i ; 25 i T )+ (cid:24) log (cid:16) TT (cid:17) (cid:25)X j = i ( γ ) j ⌈ p T ⌉ . (12)The term Regret(OFUL( L, δ, T ) denotes the regret of the OFUL algorithm [AYPS11], when run with pa-rameters L ∈ R + , such that k θ ∗ k ≤ L , and δ ∈ (0 ,
1) denotes the probability slack and T is the time horizon.Equation (12) follows, since the total number of phases is at-most (cid:24) log (cid:16) TT (cid:17) (cid:25) . Standard result from[AYPS11] give us that, with probability at-least 1 − δ , we haveRegret(OFUL(1 , δ ; T ) ≤ s T d ∗ ln (cid:18) Td ∗ (cid:19) σ s (cid:18) δ (cid:19) + d ∗ ln (cid:18) Td (cid:19)! . Thus, we know that with probability at-least 1 − P i ≥ δ i ≥ − δ , for all phases i ≥ i ( γ ), the regret in theexploration phase satisfiesRegret(OFUL(1 , δ i ; 25 i T ) ≤ s d ∗ i T ln (cid:18) i T d ∗ (cid:19) × σ s (cid:18) i δ (cid:19) + d ∗ ln (cid:18) i T d ∗ (cid:19)! . (13)20n particular, for all phases i ∈ [ i ( γ ) , ⌈ log (cid:16) TT (cid:17) ], with probability at-least 1 − δ , we haveRegret(OFUL(1 , δ i ; 25 i T ) ≤ s d ∗ i T ln (cid:18) Td ∗ (cid:19) × σ s (cid:18) TT δ (cid:19) + d ∗ ln (cid:18) Td ∗ (cid:19)! , = C ( T, δ, d ∗ ) p i T , (14)where the constant captures all the terms that only depend on T , δ and d ∗ . We can write that constant as C ( T, δ, d ∗ ) = 4 s d ∗ ln (cid:18) Td ∗ (cid:19) σ s (cid:18) TT δ (cid:19) + d ∗ ln (cid:18) Td ∗ (cid:19)! . Equation (14) follows, by substituting i ≤ log (cid:16) TT (cid:17) in all terms except the first 25 i term in Equation(13). As Equations (14) and (12) each hold with probability at-least 1 − δ , we can combine them to get thatwith probability at-least 1 − δ , R T ≤ T i ( γ ) + log (cid:16) TT (cid:17) +1 X j =0 C ( T, δ, d ∗ ) p j T + 25 ⌈ p T ⌉ log (cid:16) TT (cid:17) , ≤ T i ( γ ) + 25 √ T + C ( T, δ, d ∗ ) log (cid:16) TT (cid:17) +1 X j =0 p j T , ( a ) ≤ T γ . + 25 √ T + 25 √ T C ( T, δ, d ∗ ) , = O (cid:18) d γ . ln (cid:18) δ (cid:19)(cid:19) + e O d ∗ s T ln (cid:18) δ (cid:19)! . Step ( a ) follows from 25 ≤ . . C ALB-Dim for Stochastic Contextual Bandits with Finite Arms
C.1 ALB-Dim Algorithm for the Finite Armed Case
The algorithm given in Algorithm 3 is identical to the earlier Algorithm 2, except in Line 8, this algorithmuses
SupLinRel [Aue02] as opposed to OFUL used in the previous algorithm. In practice, one could alsouse
LinUCB [CLRS11] in place of
SupLinRel . However, we choose to present the theoretical argument using
SupLinRel , as unlike
LinUCB , has an explicit closed form regret bound [Aue02]. The pseudocode is providedin Algorithm 3.In phase i ∈ N , the SupLinRel algorithm is instantiated with input parameter 25 i T denoting the timehorizon, slack parameter δ i ∈ (0 , d M i and feature scaling b ( δ ). We explain the role of theseinput parameters. The dimension ensures that SupLinRel plays from the restricted dimension d M i . Thefeature scaling implies that when a context x ∈ X is presented to the algorithm, the set of K feature vectors,each of which is d M i dimensional are φ d M i ( x, b ( δ ) , · · · , φ d M i ( x,K ) b ( δ ) . The constant b ( δ ) := O (cid:16) τ q log (cid:0) T Kδ (cid:1)(cid:17) is21 lgorithm 3:
Adaptive Linear Bandit (Dimension) with Finitely Many arms Input:
Initial Phase length T and slack δ > b β = , T − = 0 for Each epoch i ∈ { , , , · · · } do T i = 25 i T , ε i ← i , δ i ← δ i D i := { i : | b β i | ≥ ε i } M i := inf { m : d m ≥ max D i } . for Times t ∈ { T i − + 1 , · · · , T i } do Play according to SupLinRel of [Aue02] with time horizon of 25 i T with parameters δ i ∈ (0 , d M i and feature scaling b ( δ ) := O (cid:16) τ q log (cid:0) T Kδ (cid:1)(cid:17) . end for for Times t ∈ { T i + 1 , · · · , T i + 5 i √ T } do Play an arm from the action set A chosen uniformly and independently at random. end for α i ∈ R S i × d with each row being the arm played during all random explorations in the past. y i ∈ R S i with i -th entry being the observed reward at the i -th random exploration in the past b β i +1 ← ( α Ti α i ) − α i y i , is a d dimensional vector end for chosen such that P " sup t ∈ [0 ,T ] ,a ∈A k φ M ( x t , a ) k ≥ b ( δ ) ≤ δ . Such a constant exists since ( x t ) t ∈ [0 ,T ] are i.i.d. and φ M ( x, a ) is a sub-gaussian random variable withparameter 4 τ , for all a ∈ A . Similar idea was used in [FKL19]. C.2 Regret Guarantee for Algorithm 3
In order to specify a regret guarantee, we will need to specify the value of T . We do so as before. Forany N , denote by λ ( N ) max and λ ( N ) min to be the maximum and minimum eigen values of the following matrix: Σ N := E h K P Kj =1 P Nt =1 φ M ( x t , j ) φ M ( x t , j ) T i , where the expectation is with respect to ( x t ) t ∈ [ T ] which isan i.i.d. sequence with distribution D . First, given the distribution of x ∼ D , one can (in principle) compute λ ( N ) max and λ ( N ) min for any N ≥
1. Furthermore, from the assumption on D , λ ( N ) min = e O (cid:16) √ d (cid:17) > N ≥ T ∈ N to be the smallest integer such that p T ≥ b ( δ ) max σ ( λ ( ⌈√ T ⌉ ) min ) ln(2 d/δ ) ,
43 (6 λ ( ⌈√ T ⌉ ) max + λ ( ⌈√ T ⌉ ) min )( d + λ ( ⌈√ T ⌉ ) max )( λ ( ⌈√ T ⌉ ) min ) ln(2 d/δ ) ! . (15)As before, it is easy to see that T = O (cid:18) d ln (cid:18) δ (cid:19) τ ln (cid:18) T Kδ (cid:19)(cid:19) . Furthermore, following the same reasoning as in Lemmas 3 and 2, one can verify that for all i ≥ P h k b β i − − β ∗ k ∞ ≥ − i i ≤ δ i . Theorem 3.
Suppose Algorithm 3 is run with input parameters δ ∈ (0 , , and T as given in Equation − δ , the regret after a total of T arm-pulls satisfies R T ≤ T max (cid:18) , γ . (cid:19) + 308(1 + ln(2 KT ln T )) / p T d m ∗ + 100 √ T .
The parameter γ > is the minimum magnitude of the non-zero coordinate of β ∗ , i.e., γ = min {| β ∗ i | : β ∗ i =0 } . In order to parse the above theorem, the following corollary is presented.
Corollary 4.
Suppose Algorithm 3 is run with input parameters δ ∈ (0 , , and T = e O (cid:0) d ln (cid:0) δ (cid:1)(cid:1) givenin Equation (15) , then with probability at-least − δ , the regret after T times satisfies R T ≤ O (cid:18) d γ . ln ( d/δ ) τ ln (cid:18) T Kδ (cid:19)(cid:19) + e O ( p T d ∗ m ) . Proof of Theorem 3.
The proof proceeds identical to that of Theorem 2. Observe from Lemmas 2 and 3,that the choice of T is such that for all phases i ≥
1, the estimate P h k b β i − − β ∗ k ∞ ≥ − i i ≤ δ i . Thus,from an union bound, we can conclude that P h ∪ i ≥ k b β i − − β ∗ k ∞ ≥ − i i ≤ δ . Thus at this stage, with probability at-least 1 − δ , the following events holds. • sup t ∈ [0 ,T ] ,a ∈A k φ M ( x t , a ) k ≤ b ( δ ) • k b β i − − β ∗ k ∞ ≤ − i , for all i ≥ E . As before, let γ > β ∗ . Denoteby the phase i ( γ ) := max (cid:16) , log (cid:16) γ (cid:17)(cid:17) . Thus, under the event E , for all phases i ≥ i ( γ ), the dimension d M i = d ∗ m , i.e., the SupLinRel is run with the correct set of dimensions.It thus remains to bound the error by summing over the phases, which is done identical to that in Theorem2. With probability, at-least 1 − δ − P i ≥ δ i ≥ − δ , R T ≤ i ( γ ) − X j =0 (cid:16) j T + 5 j p T (cid:17) + (cid:24) log (cid:16) TT (cid:17) (cid:25)X j = i ( γ ) Regret(SupLinRel)(25 i T , δ i , d M i ,b ( δ ) )+ (cid:24) log (cid:16) TT (cid:17) (cid:25)X j = i ( γ ) j p T , where Regret(SupLinRel)(25 i T , δ i , d M i ,b ( δ ) ) ≤ K i T ln 25 i T )) / p i T d M i + 2 √ i T . Thisexpression follows from Theorem 6 in [Aue02]. We now use this to bound each of the three terms in thedisplay above. Notice from straightforward calculations that the first term is bounded by 2 T i ( γ ) and the23ast term is bounded above by 25 ⌈√ T ⌉ log (cid:16) TT (cid:17) respectively. We now bound the middle term as (cid:24) log (cid:16) TT (cid:17) (cid:25)X j = i ( γ ) Reg(SupLinRel)(25 j T , δ i , d ∗ m , b ( δ )) ≤ b ( δ ) (cid:24) log (cid:16) TT (cid:17) (cid:25)X j = i ( γ ) K i T ln 25 i T )) / p i T d M i + 2 p i T . The first summation can be bounded as (cid:24) log (cid:16) TT (cid:17) (cid:25)X j = i ( γ ) K i T ln 25 i T )) / p i T d M i ≤ (cid:24) log (cid:16) TT (cid:17) (cid:25)X j = i ( γ ) KT ln T )) / p i T d ∗ m , ≤ KT ln T )) / log (cid:16) TT (cid:17) p T d ∗ m , = 308(1 + ln(2 KT ln T )) / p T d ∗ m , and the second by (cid:24) log (cid:16) TT (cid:17) (cid:25)X j = i ( γ ) p i T ≤ √ T .
Thus, with probability at-least 1 − δ , the regret of Algorithm 3 satisfies R T ≤ T i ( γ ) + 308(1 + ln(2 KT ln T )) / p T d ∗ m + 100 √ T , where i ( γ ) := max (cid:16) , log (cid:16) γ (cid:17)(cid:17) . Thus, R T ≤ T max (cid:18) , γ . (cid:19) + 308(1 + ln(2 KT ln T )) / p T d ∗ m + 100 √ T , as 25 ≤ .65