[PDF] A Smoothed Analysis of Online Lasso for the Sparse Linear Contextual Bandit Problem

Abstract

We investigate the sparse linear contextual bandit problem where the parameter θ is sparse. To relieve the sampling inefficiency, we utilize the "perturbed adversary" where the context is generated adversarilly but with small random non-adaptive perturbations. We prove that the simple online Lasso supports sparse linear contextual bandit with regret bound O( kTlogd − − − − − − √ ) even when d≫T where k and d are the number of effective and ambient dimension, respectively. Compared to the recent work from Sivakumar et al. (2020), our analysis does not rely on the precondition processing, adaptive perturbation (the adaptive perturbation violates the i.i.d perturbation setting) or truncation on the error set. Moreover, the special structures in our results explicitly characterize how the perturbation affects exploration length, guide the design of perturbation together with the fundamental performance limit of perturbation method. Numerical experiments are provided to complement the theoretical analysis.

Full PDF

IICML 2020 Workshop on Real World Experiment Design and Active Learning

A Smoothed Analysis of Online Lasso for the Sparse LinearContextual Bandit Problem

Zhiyuan Liu [email protected]

Department of Computer Science, University of Colorado, Boulder

Huazheng Wang [email protected]

Department of Computer Science, University of Virginia

Bo Waggoner [email protected]

Department of Computer Science, University of Colorado, Boulder

Youjian (Eugene) Liu [email protected]

Department of Electrical, Computer and Energy Engineering, University of Colorado, Boulder

Lijun Chen [email protected]

Department of Computer Science, University of Colorado, Boulder

Abstract

We investigate the sparse linear contextual bandit problem where the parameter θ is sparse.To relieve the sampling ineﬃciency, we utilize the “perturbed adversary” where the con-text is generated adversarilly but with small random non-adaptive perturbations. We provethat the simple online Lasso supports sparse linear contextual bandit with regret bound O ( √ kT log d ) even when d (cid:29) T where k and d are the number of eﬀective and ambientdimension, respectively. Compared to the recent work from Sivakumar et al. (2020), ouranalysis does not rely on the precondition processing, adaptive perturbation (the adaptiveperturbation violates the i.i.d perturbation setting) or truncation on the error set. More-over, the special structures in our results explicitly characterize how the perturbation aﬀectsexploration length, guide the design of perturbation together with the fundamental perfor-mance limit of perturbation method. Numerical experiments are provided to complementthe theoretical analysis.

1. Introduction

Contextual bandit algorithms have become a referenced solution for sequential decision-making problems such as online recommendations (Li et al., 2010), clinical trials (Durandet al., 2018), dialogue systems (Upadhyay et al., 2019) and anomaly detection (Ding et al.,2019). It adaptively learns the personalized mapping between the observed contextualfeatures and unknown parameters such as user preferences, and addresses the trade-oﬀbetween exploration and exploitation (Auer, 2002; Li et al., 2010; Abbasi-Yadkori et al.,2011; Agrawal and Goyal, 2013; Abeille et al., 2017).We consider the sparse linear contextual bandit problem where the context is high-dimensional with sparse unknown parameter θ (Abbasi-Yadkori et al., 2012; Hastie et al.,2015; Dash and Liu, 1997), i.e., most entries in θ are zero and thus only a few dimensions ofthe context feature are relevant to the reward. Due to insuﬃcient data samples, the learningalgorithm has to be sampling eﬃciency to support the sequential decision-making. However, a r X i v : . [ c s . L G ] J u l orkshop on Real World Experiment Design and Active Learning the data from bandit model usually does not satisfy the requirements for sparse recoverysuch as Null Space condition (Cohen et al., 2009), Restricted isometry property (RIP)(Donoho, 2006), Restricted eigenvalue (RE) condition (Bickel et al., 2009), Compatibilitycondition (Van De Geer et al., 2009) and so on. To achieve the desired performance, currentworks has to consider the restricted problem settings, e.g., the unit-ball, hypercube or i.i.d.arm set (Carpentier and Munos, 2012; Lattimore et al., 2015; Kim and Paik, 2019; Bastaniand Bayati, 2020), the parameter with Gaussian prior(Gilton and Willett, 2017). Oneexception is the Online-to-Conﬁdence-Set Conversions (Abbasi-Yadkori et al., 2012) whichconsiders the general setting but suﬀers from computation ineﬃciency.In this paper, we tackle the sparse linear bandit problem using smoothed analysis tech-nique (Spielman and Teng, 2004; Kannan et al., 2018), which enjoys eﬃcient implementationand mild assumptions. Speciﬁcally, we consider the perturbed adversary setting where thecontext is generated adversarially but perturbed by small random noise. This setting inter-polates between an i.i.d. distributional assumption on the input, and the worst-case of fullyadversarial contexts. Our results show that with a high probability, the perturbed adver-sary inherently guarantees the (linearly) strong convex condition for the low dimensionalcase and the restricted eigenvalue (RE) condition for the high dimensional case, which is akey property required by the standard Lasso regression. We prove that the simple onlineLasso supports sparse linear contextual bandits with regret bound O ( √ kT log d ). We alsoprovide numerical experiments to complement the theoretical analysis.We also notice the recent work from Sivakumar et al. (2020) using smoothed analysisfor structured linear contextual bandits. Compared to their work, our proposed method hasthe following advantages: (1) Our analysis only relies on the simple online Lasso instead ofprecondition processing and truncation on the error set. Although preconditioning transfersthe non-zero singular value to 1, this could amplify the noise, and the preconditioned noisesare no longer i.i.d., which makes concentration analysis diﬃcult and the estimation unstable(Jia et al., 2015). We also observe this eﬀect in the numeric experiments. (2) Their proofrelies on the assumption that perturbations that need to be adaptively generated based onthe observed history of the chosen contexts. Instead, our analysis is based on the milderassumption that the perturbation is i.i.d. and non-adaptive. (3) Their regret does notdescribe the full picture of the eﬀect of variance of the perturbation. Our analysis explicitlyshow how the perturbation aﬀects the exploration length, guide the design of perturbationtogether with the fundamental performance limit of perturbation method.

2. Model and Methodology

In the bandit problem, at each round t , the learner pulls an arm a t among m arms (wedenote the arm sets by [ m ] , that is, a t ∈ [ m ]) and receives the corresponding noisy reward r ta t . The performance of the learner is evaluated by the regret R which quantiﬁes the totalloss because of not choosing the best arm a ∗ t during T rounds: R ( T ) = T (cid:88) t =1 ( r ta ∗ t − r ta t ) . (1)

1. In this paper, we denote by [ n ] the set [1 , · · · , n ] for positive integer n . CML 2020 Workshop on Real World Experiment Design and Active Learning

In this paper, we focus on the sparse linear contextual bandit problem. Specially, each arm i at round t is associated with a feature (context) vector µ ti ∈ R d . The reward of thatarm is assumed to be generated by the noisy linear model which is the inner product ofarm feature and an unknown S -sparse parameter θ ∗ where S denotes the set of eﬀective(non-zero) entries and | S | = k . That is, r ti = (cid:104) µ ti , θ ∗ (cid:105) + η t , | θ ∗ | = k, (2)where η t follows Gaussian distribution N (0 , σ ). To handle the non-convex L norm, Lassoregression is the natural way to learn the sparse θ ∗ with the relaxation from L to L norm.To achieve the desired performance, the algorithm has to rely on well designed contextswhich guarantee sampling eﬃciency requirements such as Null Space condition (Cohenet al., 2009), Restricted isometry property (RIP) (Donoho, 2006), Restricted eigenvalue(RE) condition (Bickel et al., 2009), Compatibility condition (Van De Geer et al., 2009) andso on. However, the data from bandit problems usually does not satisfy these conditionssince the contexts could be generated adversarilly. Up to now, deciding on the properassumptions for sparsity bandit problems is still a challenge (Lattimore and Szepesv´ari,2018).Inspired by the smoothed analysis for greedy algorithm of linear bandit problem (Kan-nan et al., 2018), we consider the Perturbed Adversary deﬁned below for the sparse linearcontextual bandit problem.

Deﬁnition 1

Perturbed Adversary (Kannan et al., 2018). The perturbed adversary actsas the following at round t . • Given the current context µ t , · · · , µ tm which could be chosen adversarially, the pertur-bation e t , · · · , e tm are drawn independently from certain distribution. Also, each e ti isindependently (non-adaptively) produced of the context. • The perturbed adversary outputs the contexts ( x t , · · · , x tm ) = ( µ t + e t , · · · , µ tm + e tm ) as the arm features to the learner. Let X ∈ R d × t be the context matrix where each column contains one context vector and Y the column vector that contains the corresponding rewards. Based on perturbed adversarysetting, we analyze the online Lasso in Algorithm 1 for sparse linear contextual bandit.Generally speaking, our analysis considers two cases using diﬀerent techniques, one for thelow dimensional case when d < T and the other for the high dimensional case when d (cid:29) T .For the low dimensional case, the analysis utilizes random matrix theory (Tropp, 2012) toprove that with a high probability, the minimum eigenvalue of scaled sample covariancematrix is increasing linearly with round t ; for the high dimensional case, the RE conditionis guaranteed with the help of Gaussian perturbation’s property (Raskutti et al., 2010) thatthe nullspace of context matrix under Gaussian perturbation cannot contain any vectorsthat are “overly” sparse when t is larger than some threshold. The properties of both casessupport O ( (cid:113) k log dt ) parameter recovery of Lasso regression under noisy environment whichleads to O ( √ kT log d ) regret. orkshop on Real World Experiment Design and Active Learning Algorithm 1:

Online Lasso For Sparse Linear Contextual Bandit Under PerturbedAdversary Initialize θ , X and Y . for t = 1 , , , · · · , T do The perturbed adversary produces m context [ x t , ..., x tm ]. The learner greedily chooses the arm i = arg max j ∈ [ m ] (cid:104) x tj , θ t (cid:105) , observes thereward r ti , appends the new observation ( x ti , r ti ) to ( X, Y ), and updates θ t +1 bythe Lasso regression: θ t +1 = arg min θ G ( θ ; λ t ) := (cid:107) Y − X (cid:62) θ (cid:107) + λ t (cid:107) θ (cid:107) . (3) end2.1 Low Dimensional Case We ﬁrst consider the low dimensional case when d < T . Under the perturbed adversarysetting, we then deﬁne the property named perturbed diversity which is adopted fromBastani et al. (2017).

Deﬁnition 2

Perturbed Diversity.

Let e ti ∼ D on R d . Given any context vector µ ti , wecall x ti perturbed diversity if for x ti = µ ti + e ti , the minimum eigenvalue of sample covariancematrix under perturbations satisﬁes λ min (cid:16) E e ti ∼ D (cid:104) x ti ( x ti ) (cid:62) (cid:105)(cid:17) ≥ λ , where λ is a positive constant. Intuitively speaking, perturbed diversity guarantees that each context provides at least cer-tain information about all coordinates of θ ∗ from the expectation which is helpful to recoverthe support of the parameter via regularized method. We can ﬁnd several distributions D that could make the perturbed diversity happen, e.g., the Gaussian distribution. However,without any restriction, x ti could be very large and out of the realistic domain. Instead,the value of each dimension (we denote by x ti ( j ) the j -th dimension of x ti ) should lie ina bounded interval, in the meanwhile, the total energy of context vector is bounded bycertain constant, i.e., (cid:107) x ti (cid:107) ≤ R . This motivates us to consider the perturbed diversityunder censored perturbed adversary. Lemma 1

Given the context vector µ ti ∈ R d and | µ ti ( j ) | ≤ q j for each j ∈ [ d ] , we deﬁne thecensored perturbed context x ti under e ti ∼ N ( , σ I ) as follows: x ti ( j ) =  µ ti ( j ) + e ti ( j ) , if | µ ti ( j ) + e ti ( j ) | ≤ q j ,q j , if µ ti ( j ) + e ti ( j ) > q j , − q j , if µ ti ( j ) + e ti ( j ) < − q j . (4) Then x ti has the perturbed diversity with λ = g ( qσ , σ , where q = min j q j and g ( · , · ) is acomposite function of the probability density function φ ( · ) and the cumulative distributionfunction Φ( · ) of the normal distribution. Please refer to equation (14) for more details. CML 2020 Workshop on Real World Experiment Design and Active Learning

The proof is provided in the appendix and one can easily extend it to the case where e ti ∼ N ( , Σ). Based on Lemma 1, we can derive that with a high probability, λ min ( XX (cid:62) )grows at least with a linear rate t . Lemma 2

With the censored perturbed diversity, when t > R g (cid:16) qσ , (cid:17) σ log( dT ) , the follow-ing is satisﬁed with probability − T : λ min ( XX (cid:62) ) ≥ g (cid:16) qσ , (cid:17) (1 − τ ) σ t, where τ = (cid:114) R g (cid:16) qσ , (cid:17) σ t log( dT ) . As one can see from Lemma 2, after certain number of (implicit) exploration rounds, i.e., R g (2 q/σ , σ log( dT ) , we will have enough information to support the O ( (cid:113) k log dt ) parameterrecovery by Lasso regression. The regret analysis together with the high dimensional caseis deferred to the next section. Now we turn to the high dimensional case when d (cid:29) T . During the learning process, thescaled sample covariance matrix XX (cid:62) is always rank deﬁciency which means λ min ( XX (cid:62) ) =0 and Lemma 2 based on random matrix theory can not be applied here any more. Wethen consider the restricted eigenvalue (RE) condition instead. Here the “restricted” meansthat the error ∆ t := θ t − θ ∗ incurred by Lasso regression is restricted to a set with specialstructure. That is, ∆ t ∈ C ( S ; α ) where C ( S ; α ) := { θ ∈ R d |(cid:107) θ S c (cid:107) ≤ α (cid:107) θ S (cid:107) } , and α ≥ λ t . In the following section, we focuson C ( S ; 3) which could be achieved by setting λ t = Θ(2 σR (cid:112) t log(2 d )).The key is to prove that Null space of X (cid:62) has no overlapping with C ( S ; 3). It has beenproved that special cases in which contexts are purely sampled from special distributionssuch as Gaussian and Bernoulli distributions, satisfy this property (Zhou, 2009; Raskuttiet al., 2010; Haupt et al., 2010). We make a further step to show that nullspace of contextmatrix under Gaussian perturbation cannot contain any vectors that are “overly” sparsewhen t is larger than some threshold. Theorem 1

Consider perturbation e ti ∼ N ( , Σ) where (cid:107) Σ / ∆ (cid:107) ≥ γ (cid:107) ∆ (cid:107) for ∆ ∈C ( S ; 3) . If t > max( 4 c (cid:48)(cid:48) q (Σ) γ k log d (cid:124) (cid:123)(cid:122) (cid:125) d , aR λ max (Σ) log Tγ (cid:124) (cid:123)(cid:122) (cid:125) e ) , then with probability − ( c (cid:48) e ct + T a ) , we have ∆ (cid:62) XX (cid:62) ∆ ≥ ht (cid:107) ∆ (cid:107) , where c, c (cid:48) , c (cid:48)(cid:48) are universal constants, q (Σ) = max i Σ ii and h = ( γ − R (cid:107) ∆ (cid:107) (cid:113) aλ max (Σ) log Tt ) . Moreover, we can design γ = λ min (Σ). By Rayleigh quotient, one can obtain λ max (Σ) ≥ q (Σ) = max i Σ ii ≥ min i Σ ii ≥ λ min (Σ) = γ . We then discuss how perturbations will aﬀect the exploration length. First, the largerperturbation does not indicate the less regret.

Results of Sivakumar et al. (2020) show that orkshop on Real World Experiment Design and Active Learning the regret is O ( log T √ Tσ ) where σ is the perturbation’s variance, and suggests choosinglarger σ leads to smaller regret bounds. However this is not the full picture that showsthe eﬀect of the perturbation’s variance. Our results show that increasing the variance ofthe perturbation has limited eﬀect over the necessary exploration and regret, which revealstheoretical limit of the perturbation method. Speciﬁcally, in the term (d) of Theorem 1,no matter how large the variance is, the term q (Σ) γ ≥

1. So 4 c (cid:48)(cid:48) k log d is the necessaryexploration length and cannot be improved. Second, Condition Number and the SPR (thesignal to perturbation ratio) are important factors. The condition number

Cond (Σ) controlsboth term (d) and (e) , e.g., q (Σ) γ ≤ λ max (Σ) λ min (Σ) = Cond (Σ). This also shows that the optimalperturbation design will choose Σ = σ I . In (e) of Theorem 1, R γ can be regarded as theratio between the energy of the unperturbed context and the perturbation energy. Thisratio shows the trade-oﬀ between exploration and ﬁdelity . That is, a large variance not onlyreduces the exploration (meanwhile, the lower bound is guaranteed by (e) ) but also reducesthe ﬁdelity of original context.

3. Regret Analysis

Based on the properties we have proved for the low and high dimensional cases, we canobtain the following recovery guarantee by the techniques from the standard Lasso regression(Hastie et al., 2015).

Lemma 3 If t > T e and λ t = 2 σR (cid:113) t log dδ , the Lasso regression under perturbed ad-versary has the recovery guarantee (cid:107) θ t − θ ∗ (cid:107) ≤ σRC (cid:113) k log 2 d/δt with probability − δ ,where T e = R g (cid:16) qσ , (cid:17) σ log( dT ) , C = g (cid:16) qσ , (cid:17) (1 − τ ) σ for the low dimensional case and T e = max( c (cid:48)(cid:48) q (Σ) γ k log d, aR λ max (Σ) log Tγ ) , C = γ − R (cid:113) aλ max (Σ) log Tt for the high di-mensional case. We then get the ﬁnal result in Theorem 2 based on all the analysis above.

Theorem 2

The online Lasso for sparse linear contextual bandit under perturbed adversaryadmits the following regret with probability − δ . Regret ≤ R (cid:32) T e + 6 σRC (cid:114) kT log 2 dδ (cid:33) = O ( (cid:112) kT log d ) . (5)

4. Conclusion

This paper utilizes the “perturbed adversary” where the context is generated adversariallybut with small random non-adaptive perturbations to tackle sparse linear contextual banditproblem. We prove that the simple online Lasso supports sparse linear contextual banditwith regret bound O ( √ kT log d ) for both low and high dimensional cases and show howthe perturbation aﬀects the exploration length and the trade-oﬀ between exploration andﬁdelity. Future work will focus on extending our analysis to more challenge setting, i.e.,defending against adversarial attack for contextual bandit model. CML 2020 Workshop on Real World Experiment Design and Active Learning

References

Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linearstochastic bandits. In

Advances in Neural Information Processing Systems , pages 2312–2320, 2011.Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-conﬁdence-set conver-sions and application to sparse stochastic bandits. In

Artiﬁcial Intelligence and Statistics ,pages 1–9, 2012.Marc Abeille, Alessandro Lazaric, et al. Linear thompson sampling revisited.

ElectronicJournal of Statistics , 11(2):5165–5197, 2017.Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linearpayoﬀs. In

International Conference on Machine Learning , pages 127–135, 2013.Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-oﬀs.

Journal ofMachine Learning Research , 3(Nov):397–422, 2002.Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covari-ates.

Operations Research , 68(1):276–294, 2020.Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algo-rithms for contextual bandits. arXiv preprint arXiv:1704.09011 , 2017.Peter J Bickel, Yaacov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lassoand dantzig selector.

The Annals of statistics , 37(4):1705–1732, 2009.Alexandra Carpentier and R´emi Munos. Bandit theory meets compressed sensing for highdimensional stochastic linear bandit. In

Artiﬁcial Intelligence and Statistics , pages 190–198, 2012.Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Compressed sensing and best k-termapproximation.

Journal of the American mathematical society , 22(1):211–231, 2009.Manoranjan Dash and Huan Liu. Feature selection for classiﬁcation.

Intelligent data anal-ysis , 1(3):131–156, 1997.Kaize Ding, Jundong Li, and Huan Liu. Interactive anomaly detection on attributed net-works. In

Proceedings of the Twelfth ACM International Conference on Web Search andData Mining , pages 357–365, 2019.David L Donoho. Compressed sensing.

IEEE Transactions on information theory , 52(4):1289–1306, 2006.Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D Mitsis,and Joelle Pineau. Contextual bandits for adapting treatment in a mouse model of denovo carcinogenesis. In

Machine Learning for Healthcare Conference , pages 67–82, 2018. orkshop on Real World Experiment Design and Active Learning Davis Gilton and Rebecca Willett. Sparse linear contextual bandits via relevance vec-tor machines. In , pages 518–522. IEEE, 2017.Trevor Hastie, Robert Tibshirani, and Martin Wainwright.

Statistical learning with sparsity:the lasso and generalizations . Chapman and Hall/CRC, 2015.Jarvis Haupt, Waheed U Bajwa, Gil Raz, and Robert Nowak. Toeplitz compressed sensingmatrices with applications to sparse channel estimation.

IEEE transactions on informa-tion theory , 56(11):5862–5875, 2010.Jinzhu Jia, Karl Rohe, et al. Preconditioning the lasso for sign consistency.

ElectronicJournal of Statistics , 9(1):1150–1172, 2015.Sampath Kannan, Jamie H Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu.A smoothed analysis of the greedy algorithm for the linear contextual bandit problem.In

Advances in Neural Information Processing Systems , pages 2227–2236, 2018.Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit. In

Advances in NeuralInformation Processing Systems , pages 5869–5879, 2019.Tor Lattimore and Csaba Szepesv´ari. Bandit algorithms. preprint , page 28, 2018.Tor Lattimore, Koby Crammer, and Csaba Szepesv´ari. Linear multi-resource allocationwith semi-bandit feedback. In

Advances in Neural Information Processing Systems , pages964–972, 2015.Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approachto personalized news article recommendation. In

Proceedings of the 19th internationalconference on World wide web , pages 661–670. ACM, 2010.Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties forcorrelated gaussian designs.

Journal of Machine Learning Research , 11(Aug):2241–2259,2010.Vidyashankar Sivakumar, Zhiwei Steven Wu, and Arindam Banerjee. Structured lin-ear contextual bandits: A sharp and geometric smoothed analysis. arXiv preprintarXiv:2002.11332 , 2020.Daniel A Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplexalgorithm usually takes polynomial time.

Journal of the ACM (JACM) , 51(3):385–463,2004.Joel A Tropp. User-friendly tail bounds for sums of random matrices.

Foundations ofcomputational mathematics , 12(4):389–434, 2012.Sohini Upadhyay, Mayank Agarwal, Djallel Bounneﬀouf, and Yasaman Khazaeni. Abandit approach to posterior dialog orchestration under a budget. arXiv preprintarXiv:1906.09384 , 2019. CML 2020 Workshop on Real World Experiment Design and Active Learning

Sara A Van De Geer, Peter B¨uhlmann, et al. On the conditions used to prove oracle resultsfor the lasso.

Electronic Journal of Statistics , 3:1360–1392, 2009.Shuheng Zhou. Restricted eigenvalue conditions on subgaussian random matrices. arXivpreprint arXiv:0912.4045 , 2009.

Appendix

Lemma 4 (A variant of Matrix Chernoﬀ Tropp (2012)) Consider a ﬁnite sequence z t ofindependent, random, self-adjoint matrices satisfy z t (cid:23) and λ max ( z t ) ≤ Q almost surely.Compute the minimum eigenvalue of the sum of expectations, ψ min := λ min ( (cid:80) t E ( z t )) . Thenfor δ ∈ [0 , , we have P (cid:40) λ min ( (cid:88) t z t ) ≤ (1 − δ ) ψ min (cid:41) ≤ d (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ min /Q . (6) Moreover, for any ψ ≤ ψ min , we can get P (cid:40) λ min ( (cid:88) t z t ) ≤ (1 − δ ) ψ (cid:41) ≤ d (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ/Q . (7) Proof

Since ψ ≤ ψ min , there exists δ ∈ [0 ,

1] such that ψ = δ ψ min . We have(1 − δ ) ψ = (1 − δ ) δ ψ min = (1 − (1 − δ + δδ (cid:124) (cid:123)(cid:122) (cid:125) δ )) ψ min . Plugging this into (6) leads to P (cid:40) λ min ( (cid:88) t z t ) ≤ (1 − δ ) ψ (cid:41) ≤ d (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ min /Q . One can easily verify that δ ≥ δ . So (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ min Q ≤ (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ min Q ≤ (cid:20) e − δ (1 − δ ) − δ (cid:21) ψQ . Then we obtain P (cid:40) λ min ( (cid:88) t z t ) ≤ (1 − δ ) ψ (cid:41) ≤ d (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ/Q . (8)Since e − δ (1 − δ ) − δ ≤ e − δ / , so we have the following when δ ∈ [0 , P (cid:40) λ min ( (cid:88) t z t ) ≤ (1 − δ ) ψ (cid:41) ≤ d (cid:104) e − δ / (cid:105) ψ/Q . (9) orkshop on Real World Experiment Design and Active Learning Fact 1

Let η = [ η , · · · , η t ] (cid:62) where each η i i.i.d. from N (0 , σ ) . Let X ∈ R d × t where each | X ij | ≤ R. Then with a high probability − δ , we have (cid:107) Xη (cid:107) ∞ ≤ σR (cid:114) t log 2 dδ . Fact 2 (Chernoﬀ Bound for Sum of Sub-Gaussian random variables) Let X , · · · , X n be nindependent random variables such that X i ∼ subG ( σ ) . Then for any a ∈ R n and c > = 0 ,we have Pr (cid:32) n (cid:88) i =1 a i X i < − c (cid:33) ≤ exp (cid:18) − c σ (cid:107) a (cid:107) (cid:19) . (10) That is, with a high probability at least − δ, we have n (cid:88) i =1 a i X i > − (cid:114) σ (cid:107) a (cid:107) log 1 δ . (11) Lemma 5 (Restricted Eigenvalue Property (Corollary 1 of Raskutti et al. (2010))) Supposethat Σ satisﬁes the RE condition of order k with parameters (1 , γ ) and denote q (Σ) =max i Σ ii . Then for universal positive constants c, c (cid:48) , c (cid:48)(cid:48) , if the sample size satisﬁes t > c (cid:48)(cid:48) q (Σ) γ k log d, (12) then the matrix ΦΦ (cid:62) t satisﬁes the RE condition with parameters (1 , γ ) with probability atleast − c (cid:48) e ct where Φ ∈ R d × t and each column is i.i.d. N ( , Σ) . Proof of Lemma 1Proof

Since e ti ( j ) is independent of each other, we can analyze it by coordinates. Tosimplify the analysis, we slightly abuse the notations and remove subscript i and superscript t (only within this proof), that is, x ( j ) := x ti ( j ) and e ( j ) := e ti ( j ). λ min (cid:16) E (cid:104) xx (cid:62) (cid:105)(cid:17) = min (cid:107) w (cid:107) =1 w (cid:62) E [ xx (cid:62) ] w = min (cid:107) w (cid:107) =1 E ( w (cid:62) xx (cid:62) w )= min (cid:107) w (cid:107) =1 E ( (cid:104) w, x (cid:105) ) ) ≥ min (cid:107) w (cid:107) =1 Var( (cid:104) w, x (cid:105) ) ≥ min (cid:107) w (cid:107) =1 Var( (cid:104) w, e (cid:105) )= min (cid:107) w (cid:107) =1 d (cid:88) i =1 ( w ( i )) Var( e ( i ) | censored in [ − q i , q i ]) ≥ min (cid:107) w (cid:107) =1 g (2 q/σ, σ d (cid:88) i =1 ( w ( i )) = g (2 q/σ , σ , CML 2020 Workshop on Real World Experiment Design and Active Learning where g (2 q/σ ,

0) is according to Lemma 6.

Lemma 6

Let e ∼ N (0 , σ ) . For any interval [ a, b ] which contains and ﬁxed length q ,e.g., b − a = 2 q , and q ≥ σ , we have the following result: Var( e | censored in [ a, b ]) ≥ g (2 q/σ , σ . (13) Proof

We ﬁrst derive the variance for two sided censored Gaussian Distribution. Denote α = a/σ and β = b/σ . For the truncated Gaussian distribution, we have E ( e | e ∈ [ a, b ]) = σ φ ( α ) − φ ( β )Φ( α ) − Φ( β ) = σ ρ. Var( e | e ∈ [ a, b ]) = σ (1 + αφ ( α ) − βφ ( β )Φ( α ) − Φ( β ) − ρ (cid:124) (cid:123)(cid:122) (cid:125) Λ ) . Then we calculate the variance of two sided censored Gaussian distribution byVar( e | censored in [ a, b ]) = E y [Var( e | y )] + Var y [ E ( e | y )] , where y denotes the event e ∈ [ a, b ]. After some basic calculations, we can get the followingresult: Var( e | censored in [ a, b ]) = σ (Φ( β ) − Φ( α ))(1 + Λ)+ σ [( ρ − β ) (Φ( β ) − Φ( α ))(1 − Φ( β ) + Φ( α ))+ 2( β − α )( ρ − β )(Φ( β ) − Φ( α ))Φ( α )+ ( β − α ) (1 − Φ( α ))Φ( α )]= g ( β, α ) σ . (14)One can show that (1) Var( e | censored in [ a, b ]) achieves minimum when a = 0 or b = 0 bythe ﬁrst order optimality condition. (2) Var( e | censored in [0 , b ]) is an increasing functionw.r.t b . Based on (1) and (2), we obtainVar( e | censored in [ a, b ]) ≥ Var( e | censored ∈ [0 , q ]) = g (2 q/σ , σ , Proof of Lemma 2Proof

At round t , we have λ min ( E ( XX (cid:62) ))= λ min ( E ( t (cid:88) i =1 ( x ia i ( x ia i ) (cid:62) )) = λ min ( t (cid:88) i =1 E ( x ia i ( x ia i ) (cid:62) )) ≥ t (cid:88) i =1 λ min ( E ( x ia i ( x ia i ) (cid:62) )) , orkshop on Real World Experiment Design and Active Learning where the second equality is due to the independence of each round’s perturbation and theinequality comes from the fact that minimum eigenvalue is an super-additive operator.For the censored Gaussian perturbation, λ min ( E ( x ia i ( x ia i ) (cid:62) )) ≥ g ( qσ , σ based onLemma 1. So λ min ( E ( XX (cid:62) )) ≥ g ( qσ , σ t .Based on (9) of Lemma 4 and λ max ( x ia i ( x ia i ) (cid:62) ) ≤ (cid:107) x ia i (cid:107) ≤ R , one can obtain P (cid:26) λ min ( XX (cid:62) ) ≤ g ( 2 qσ , − τ ) σ t (cid:27) ≤ d (cid:104) e − τ / (cid:105) g (2 R/σ , σ tR . Let T = d (cid:104) e − τ / (cid:105) g (2 R/σ , σ tR and one can get the ﬁnal result. Proof of Theorem 1Proof

To simplify the analysis, we slightly abuse the notation and denote the unperturbedcontext matrix by µ where each column µ i is one context vector. Similarly, denote e to bethe perturbation matrix and e i to be the column vector. We ﬁrst decompose the ∆ (cid:62) XX (cid:62) ∆as follows: ∆ (cid:62) XX (cid:62) ∆ = ∆ (cid:62) µµ (cid:62) ∆ (cid:124) (cid:123)(cid:122) (cid:125) (a) +2 ∆ (cid:62) eµ (cid:62) ∆ (cid:124) (cid:123)(cid:122) (cid:125) (b) + ∆ (cid:62) ee (cid:62) ∆ (cid:124) (cid:123)(cid:122) (cid:125) (c) . (15)For the term (a) in equation (15), one can only show (a) since ∆ could lie in Null ( µ (cid:62) ).For term (b) and (c) , we ﬁnd both terms high probability lower bounds respectively.Now consider a positive deﬁnite matrix Σ and we can design that Σ such that it satisﬁesthe RE, that is, (cid:107) Σ / ∆ (cid:107) ≥ γ (cid:107) ∆ (cid:107) . Based on Lemma 4, we can derive the following forterm (c) . For universal positive constants c, c (cid:48) , c (cid:48)(cid:48) , if the sample size satisﬁes t > c (cid:48)(cid:48) q (Σ) γ k log d, (16)where q (Σ) = max i Σ ii , then with probability at least 1 − c (cid:48) e ct ∆ (cid:62) ee (cid:62) ∆ ≥ γ t (cid:107) ∆ (cid:107) . (17)We then derive a high probability bound for (b) . First, we decompose (b) into a weightedsum of i.i.d. Gaussian variable. That is,∆ (cid:62) eµ (cid:62) ∆ = t (cid:88) i =1 ( µ (cid:62) i ∆)(∆ (cid:62) e i ) , (18) CML 2020 Workshop on Real World Experiment Design and Active Learning where µ Ti ∆ is the weight and each ∆ (cid:62) e i ∼ N (0 , ∆ (cid:62) Σ∆). Based on the Chernoﬀ Bound ofweighted sum of sub-Gaussian random variables in Fact 2, we have t (cid:88) i =1 ( µ (cid:62) i ∆)(∆ (cid:62) e i ) ≥ − (cid:118)(cid:117)(cid:117)(cid:116) a ∆ (cid:62) Σ∆ t (cid:88) i =1 ( µ (cid:62) i ∆) log t (19) ≥ − (cid:118)(cid:117)(cid:117)(cid:116) aλ max (Σ) (cid:107) ∆ (cid:107) t (cid:88) i =1 R (cid:107) ∆ (cid:107) log t (20)= − Rt (cid:107) ∆ (cid:107) (cid:114) aλ max (Σ) log tt . (21)with probability at least 1 − t a . We can conclude with probability at least 1 − ( c (cid:48) e ct + t a ) , both inequality (17) and (21) hold. If the round t satisﬁes t > max  c (cid:48)(cid:48) q (Σ) γ k log d (cid:124) (cid:123)(cid:122) (cid:125) d , aR λ max (Σ) log tγ (cid:124) (cid:123)(cid:122) (cid:125) e  . (22), we have (b) + (c) ≥ ht (cid:107) ∆ (cid:107) , where h = (cid:18) γ − R (cid:113) aλ max (Σ) log tt (cid:19) . Proof of Lemma 3Proof

Our proof combines the techniques from smoothed analysis and Lasso regression.Since θ t minimizes G ( θ ), we have G ( θ t ) ≤ G ( θ ∗ ). This yields the following inequality (cid:107) X (cid:62) ∆ t (cid:107) ≤ ∆ t Xη + λ t ( (cid:107) θ ∗ (cid:107) − (cid:107) θ ∗ + ∆ t (cid:107) ) , where η denotes the noise vector. Note that (cid:107) θ ∗ (cid:107) = (cid:107) θ ∗ S (cid:107) . Furthermore, one can verifythat (cid:107) θ ∗ (cid:107) − (cid:107) θ ∗ + ∆ t (cid:107) ≤ (cid:107) ∆ tS (cid:107) − (cid:107) ∆ tS c (cid:107) . For ∆ t Xη , applying H¨o lder’s inequality yields∆ t Xη ≤ (cid:107) ∆ t (cid:107) (cid:107) Xη (cid:107) ∞ ≤ σR (cid:114) t log 2 dδ (cid:107) ∆ t (cid:107) = λ t (cid:107) ∆ t (cid:107) , where the second inequality is due to the fact 1. Combine all above and we obtain (cid:107) X (cid:62) ∆ t (cid:107) ≤ λ t (cid:107) ∆ t (cid:107) + λ t ( (cid:107) ∆ tS (cid:107) − (cid:107) ∆ tS c (cid:107) ) (23) ≤ λ t (cid:107) ∆ t (cid:107) ≤ λ t √ k (cid:107) ∆ t (cid:107) . (24)First from inequality (23), we can obtain ∆ t ∈ C ( S ; 3). For low dimensional case, we have (cid:107) X (cid:62) ∆ t (cid:107) ≥ λ min ( XX (cid:62) ) (cid:107) ∆ t (cid:107) ≥ Ct (cid:107) ∆ t (cid:107) by Lemma 2, where C = g (cid:16) qσ , (cid:17) (1 − τ ) σ . Forhigh dimensional case, we apply Theorem 1 since ∆ t ∈ C ( S ; 3) and get (cid:107) X (cid:62) ∆ t (cid:107) ≥ Ct (cid:107) ∆ t (cid:107) orkshop on Real World Experiment Design and Active Learning where C = γ − R (cid:113) aλ max (Σ) log Tt . Combine these with inequality (24) and we get the ﬁnalresult (cid:107) ∆ t (cid:107) ≤ σRC (cid:114) k log 2 d/δt . Proof of Theorem 2Proof

As for the regret in round t , we have (cid:104) x ti ∗ t , θ ∗ (cid:105) − (cid:104) x ti t , θ ∗ (cid:105) = (cid:104) x ti ∗ t , θ ∗ − θ t (cid:105) − (cid:104) x ti t , θ ∗ − θ t (cid:105) + (cid:104) x ti ∗ t , θ t (cid:105) − (cid:104) x ti t , θ t (cid:105)≤ (cid:104) x ti ∗ t , θ ∗ − θ t (cid:105) − (cid:104) x ti t , θ ∗ − θ t (cid:105)≤ (cid:107)(cid:104) x ti ∗ t , θ ∗ − θ t (cid:105)(cid:107) + (cid:107)(cid:104) x ti t , θ ∗ − θ t (cid:105)(cid:107) ≤ R (cid:107) θ ∗ − θ t (cid:107) , where the ﬁrst inequality comes from the greedy choice since i t = arg max i (cid:104) x ti , θ t (cid:105) and thelast inequality is due to the censored perturbations. Based on the analysis of low and highdimensional cases, we denote the exploration length as T e . During the exploration, we canbound the regret by 2 RT e . So we can derive that Regret = T e (cid:88) t =1 (cid:104) x ti ∗ , θ ∗ (cid:105) − (cid:104) x ti t , θ ∗ (cid:105) + T (cid:88) t = T e +1 (cid:104) x ti ∗ , θ ∗ (cid:105) − (cid:104) x ti t , θ ∗ (cid:105)≤ RT e + 2 R T (cid:88) t = T e +1 (cid:107) θ ∗ − θ t (cid:107) ≤ RT e + 2 R T (cid:88) t = T e +1 σRC (cid:114) k log 2 d/δt ≤ R (cid:32) T e + 6 σRC (cid:114) kT log 2 dδ (cid:33) Numeric Simulations

This section shows the result of numeric simulations. We choose the context’s dimension d = 2000 where eﬀective dimension k = 20 and 5 arms for each round. Our sparse banditlearning process only contains 150 rounds with each context vector are randomly generatedfrom the uniform distribution [0 , CML 2020 Workshop on Real World Experiment Design and Active Learning R e g r e t No preconditioningWith preconditioning