A Smoothed Analysis of Online Lasso for the Sparse Linear Contextual Bandit Problem
Zhiyuan Liu, Huazheng Wang, Bo Waggoner, Youjian, Lijun Chen
IICML 2020 Workshop on Real World Experiment Design and Active Learning
A Smoothed Analysis of Online Lasso for the Sparse LinearContextual Bandit Problem
Zhiyuan Liu [email protected]
Department of Computer Science, University of Colorado, Boulder
Huazheng Wang [email protected]
Department of Computer Science, University of Virginia
Bo Waggoner [email protected]
Department of Computer Science, University of Colorado, Boulder
Youjian (Eugene) Liu [email protected]
Department of Electrical, Computer and Energy Engineering, University of Colorado, Boulder
Lijun Chen [email protected]
Department of Computer Science, University of Colorado, Boulder
Abstract
We investigate the sparse linear contextual bandit problem where the parameter θ is sparse.To relieve the sampling inefficiency, we utilize the “perturbed adversary” where the con-text is generated adversarilly but with small random non-adaptive perturbations. We provethat the simple online Lasso supports sparse linear contextual bandit with regret bound O ( √ kT log d ) even when d (cid:29) T where k and d are the number of effective and ambientdimension, respectively. Compared to the recent work from Sivakumar et al. (2020), ouranalysis does not rely on the precondition processing, adaptive perturbation (the adaptiveperturbation violates the i.i.d perturbation setting) or truncation on the error set. More-over, the special structures in our results explicitly characterize how the perturbation affectsexploration length, guide the design of perturbation together with the fundamental perfor-mance limit of perturbation method. Numerical experiments are provided to complementthe theoretical analysis.
1. Introduction
Contextual bandit algorithms have become a referenced solution for sequential decision-making problems such as online recommendations (Li et al., 2010), clinical trials (Durandet al., 2018), dialogue systems (Upadhyay et al., 2019) and anomaly detection (Ding et al.,2019). It adaptively learns the personalized mapping between the observed contextualfeatures and unknown parameters such as user preferences, and addresses the trade-offbetween exploration and exploitation (Auer, 2002; Li et al., 2010; Abbasi-Yadkori et al.,2011; Agrawal and Goyal, 2013; Abeille et al., 2017).We consider the sparse linear contextual bandit problem where the context is high-dimensional with sparse unknown parameter θ (Abbasi-Yadkori et al., 2012; Hastie et al.,2015; Dash and Liu, 1997), i.e., most entries in θ are zero and thus only a few dimensions ofthe context feature are relevant to the reward. Due to insufficient data samples, the learningalgorithm has to be sampling efficiency to support the sequential decision-making. However, a r X i v : . [ c s . L G ] J u l orkshop on Real World Experiment Design and Active Learning the data from bandit model usually does not satisfy the requirements for sparse recoverysuch as Null Space condition (Cohen et al., 2009), Restricted isometry property (RIP)(Donoho, 2006), Restricted eigenvalue (RE) condition (Bickel et al., 2009), Compatibilitycondition (Van De Geer et al., 2009) and so on. To achieve the desired performance, currentworks has to consider the restricted problem settings, e.g., the unit-ball, hypercube or i.i.d.arm set (Carpentier and Munos, 2012; Lattimore et al., 2015; Kim and Paik, 2019; Bastaniand Bayati, 2020), the parameter with Gaussian prior(Gilton and Willett, 2017). Oneexception is the Online-to-Confidence-Set Conversions (Abbasi-Yadkori et al., 2012) whichconsiders the general setting but suffers from computation inefficiency.In this paper, we tackle the sparse linear bandit problem using smoothed analysis tech-nique (Spielman and Teng, 2004; Kannan et al., 2018), which enjoys efficient implementationand mild assumptions. Specifically, we consider the perturbed adversary setting where thecontext is generated adversarially but perturbed by small random noise. This setting inter-polates between an i.i.d. distributional assumption on the input, and the worst-case of fullyadversarial contexts. Our results show that with a high probability, the perturbed adver-sary inherently guarantees the (linearly) strong convex condition for the low dimensionalcase and the restricted eigenvalue (RE) condition for the high dimensional case, which is akey property required by the standard Lasso regression. We prove that the simple onlineLasso supports sparse linear contextual bandits with regret bound O ( √ kT log d ). We alsoprovide numerical experiments to complement the theoretical analysis.We also notice the recent work from Sivakumar et al. (2020) using smoothed analysisfor structured linear contextual bandits. Compared to their work, our proposed method hasthe following advantages: (1) Our analysis only relies on the simple online Lasso instead ofprecondition processing and truncation on the error set. Although preconditioning transfersthe non-zero singular value to 1, this could amplify the noise, and the preconditioned noisesare no longer i.i.d., which makes concentration analysis difficult and the estimation unstable(Jia et al., 2015). We also observe this effect in the numeric experiments. (2) Their proofrelies on the assumption that perturbations that need to be adaptively generated based onthe observed history of the chosen contexts. Instead, our analysis is based on the milderassumption that the perturbation is i.i.d. and non-adaptive. (3) Their regret does notdescribe the full picture of the effect of variance of the perturbation. Our analysis explicitlyshow how the perturbation affects the exploration length, guide the design of perturbationtogether with the fundamental performance limit of perturbation method.
2. Model and Methodology
In the bandit problem, at each round t , the learner pulls an arm a t among m arms (wedenote the arm sets by [ m ] , that is, a t ∈ [ m ]) and receives the corresponding noisy reward r ta t . The performance of the learner is evaluated by the regret R which quantifies the totalloss because of not choosing the best arm a ∗ t during T rounds: R ( T ) = T (cid:88) t =1 ( r ta ∗ t − r ta t ) . (1)
1. In this paper, we denote by [ n ] the set [1 , · · · , n ] for positive integer n . CML 2020 Workshop on Real World Experiment Design and Active Learning
In this paper, we focus on the sparse linear contextual bandit problem. Specially, each arm i at round t is associated with a feature (context) vector µ ti ∈ R d . The reward of thatarm is assumed to be generated by the noisy linear model which is the inner product ofarm feature and an unknown S -sparse parameter θ ∗ where S denotes the set of effective(non-zero) entries and | S | = k . That is, r ti = (cid:104) µ ti , θ ∗ (cid:105) + η t , | θ ∗ | = k, (2)where η t follows Gaussian distribution N (0 , σ ). To handle the non-convex L norm, Lassoregression is the natural way to learn the sparse θ ∗ with the relaxation from L to L norm.To achieve the desired performance, the algorithm has to rely on well designed contextswhich guarantee sampling efficiency requirements such as Null Space condition (Cohenet al., 2009), Restricted isometry property (RIP) (Donoho, 2006), Restricted eigenvalue(RE) condition (Bickel et al., 2009), Compatibility condition (Van De Geer et al., 2009) andso on. However, the data from bandit problems usually does not satisfy these conditionssince the contexts could be generated adversarilly. Up to now, deciding on the properassumptions for sparsity bandit problems is still a challenge (Lattimore and Szepesv´ari,2018).Inspired by the smoothed analysis for greedy algorithm of linear bandit problem (Kan-nan et al., 2018), we consider the Perturbed Adversary defined below for the sparse linearcontextual bandit problem.
Definition 1
Perturbed Adversary (Kannan et al., 2018). The perturbed adversary actsas the following at round t . • Given the current context µ t , · · · , µ tm which could be chosen adversarially, the pertur-bation e t , · · · , e tm are drawn independently from certain distribution. Also, each e ti isindependently (non-adaptively) produced of the context. • The perturbed adversary outputs the contexts ( x t , · · · , x tm ) = ( µ t + e t , · · · , µ tm + e tm ) as the arm features to the learner. Let X ∈ R d × t be the context matrix where each column contains one context vector and Y the column vector that contains the corresponding rewards. Based on perturbed adversarysetting, we analyze the online Lasso in Algorithm 1 for sparse linear contextual bandit.Generally speaking, our analysis considers two cases using different techniques, one for thelow dimensional case when d < T and the other for the high dimensional case when d (cid:29) T .For the low dimensional case, the analysis utilizes random matrix theory (Tropp, 2012) toprove that with a high probability, the minimum eigenvalue of scaled sample covariancematrix is increasing linearly with round t ; for the high dimensional case, the RE conditionis guaranteed with the help of Gaussian perturbation’s property (Raskutti et al., 2010) thatthe nullspace of context matrix under Gaussian perturbation cannot contain any vectorsthat are “overly” sparse when t is larger than some threshold. The properties of both casessupport O ( (cid:113) k log dt ) parameter recovery of Lasso regression under noisy environment whichleads to O ( √ kT log d ) regret. orkshop on Real World Experiment Design and Active Learning Algorithm 1:
Online Lasso For Sparse Linear Contextual Bandit Under PerturbedAdversary Initialize θ , X and Y . for t = 1 , , , · · · , T do The perturbed adversary produces m context [ x t , ..., x tm ]. The learner greedily chooses the arm i = arg max j ∈ [ m ] (cid:104) x tj , θ t (cid:105) , observes thereward r ti , appends the new observation ( x ti , r ti ) to ( X, Y ), and updates θ t +1 bythe Lasso regression: θ t +1 = arg min θ G ( θ ; λ t ) := (cid:107) Y − X (cid:62) θ (cid:107) + λ t (cid:107) θ (cid:107) . (3) end2.1 Low Dimensional Case We first consider the low dimensional case when d < T . Under the perturbed adversarysetting, we then define the property named perturbed diversity which is adopted fromBastani et al. (2017).
Definition 2
Perturbed Diversity.
Let e ti ∼ D on R d . Given any context vector µ ti , wecall x ti perturbed diversity if for x ti = µ ti + e ti , the minimum eigenvalue of sample covariancematrix under perturbations satisfies λ min (cid:16) E e ti ∼ D (cid:104) x ti ( x ti ) (cid:62) (cid:105)(cid:17) ≥ λ , where λ is a positive constant. Intuitively speaking, perturbed diversity guarantees that each context provides at least cer-tain information about all coordinates of θ ∗ from the expectation which is helpful to recoverthe support of the parameter via regularized method. We can find several distributions D that could make the perturbed diversity happen, e.g., the Gaussian distribution. However,without any restriction, x ti could be very large and out of the realistic domain. Instead,the value of each dimension (we denote by x ti ( j ) the j -th dimension of x ti ) should lie ina bounded interval, in the meanwhile, the total energy of context vector is bounded bycertain constant, i.e., (cid:107) x ti (cid:107) ≤ R . This motivates us to consider the perturbed diversityunder censored perturbed adversary. Lemma 1
Given the context vector µ ti ∈ R d and | µ ti ( j ) | ≤ q j for each j ∈ [ d ] , we define thecensored perturbed context x ti under e ti ∼ N ( , σ I ) as follows: x ti ( j ) = µ ti ( j ) + e ti ( j ) , if | µ ti ( j ) + e ti ( j ) | ≤ q j ,q j , if µ ti ( j ) + e ti ( j ) > q j , − q j , if µ ti ( j ) + e ti ( j ) < − q j . (4) Then x ti has the perturbed diversity with λ = g ( qσ , σ , where q = min j q j and g ( · , · ) is acomposite function of the probability density function φ ( · ) and the cumulative distributionfunction Φ( · ) of the normal distribution. Please refer to equation (14) for more details. CML 2020 Workshop on Real World Experiment Design and Active Learning
The proof is provided in the appendix and one can easily extend it to the case where e ti ∼ N ( , Σ). Based on Lemma 1, we can derive that with a high probability, λ min ( XX (cid:62) )grows at least with a linear rate t . Lemma 2
With the censored perturbed diversity, when t > R g (cid:16) qσ , (cid:17) σ log( dT ) , the follow-ing is satisfied with probability − T : λ min ( XX (cid:62) ) ≥ g (cid:16) qσ , (cid:17) (1 − τ ) σ t, where τ = (cid:114) R g (cid:16) qσ , (cid:17) σ t log( dT ) . As one can see from Lemma 2, after certain number of (implicit) exploration rounds, i.e., R g (2 q/σ , σ log( dT ) , we will have enough information to support the O ( (cid:113) k log dt ) parameterrecovery by Lasso regression. The regret analysis together with the high dimensional caseis deferred to the next section. Now we turn to the high dimensional case when d (cid:29) T . During the learning process, thescaled sample covariance matrix XX (cid:62) is always rank deficiency which means λ min ( XX (cid:62) ) =0 and Lemma 2 based on random matrix theory can not be applied here any more. Wethen consider the restricted eigenvalue (RE) condition instead. Here the “restricted” meansthat the error ∆ t := θ t − θ ∗ incurred by Lasso regression is restricted to a set with specialstructure. That is, ∆ t ∈ C ( S ; α ) where C ( S ; α ) := { θ ∈ R d |(cid:107) θ S c (cid:107) ≤ α (cid:107) θ S (cid:107) } , and α ≥ λ t . In the following section, we focuson C ( S ; 3) which could be achieved by setting λ t = Θ(2 σR (cid:112) t log(2 d )).The key is to prove that Null space of X (cid:62) has no overlapping with C ( S ; 3). It has beenproved that special cases in which contexts are purely sampled from special distributionssuch as Gaussian and Bernoulli distributions, satisfy this property (Zhou, 2009; Raskuttiet al., 2010; Haupt et al., 2010). We make a further step to show that nullspace of contextmatrix under Gaussian perturbation cannot contain any vectors that are “overly” sparsewhen t is larger than some threshold. Theorem 1
Consider perturbation e ti ∼ N ( , Σ) where (cid:107) Σ / ∆ (cid:107) ≥ γ (cid:107) ∆ (cid:107) for ∆ ∈C ( S ; 3) . If t > max( 4 c (cid:48)(cid:48) q (Σ) γ k log d (cid:124) (cid:123)(cid:122) (cid:125) d , aR λ max (Σ) log Tγ (cid:124) (cid:123)(cid:122) (cid:125) e ) , then with probability − ( c (cid:48) e ct + T a ) , we have ∆ (cid:62) XX (cid:62) ∆ ≥ ht (cid:107) ∆ (cid:107) , where c, c (cid:48) , c (cid:48)(cid:48) are universal constants, q (Σ) = max i Σ ii and h = ( γ − R (cid:107) ∆ (cid:107) (cid:113) aλ max (Σ) log Tt ) . Moreover, we can design γ = λ min (Σ). By Rayleigh quotient, one can obtain λ max (Σ) ≥ q (Σ) = max i Σ ii ≥ min i Σ ii ≥ λ min (Σ) = γ . We then discuss how perturbations will affect the exploration length. First, the largerperturbation does not indicate the less regret.
Results of Sivakumar et al. (2020) show that orkshop on Real World Experiment Design and Active Learning the regret is O ( log T √ Tσ ) where σ is the perturbation’s variance, and suggests choosinglarger σ leads to smaller regret bounds. However this is not the full picture that showsthe effect of the perturbation’s variance. Our results show that increasing the variance ofthe perturbation has limited effect over the necessary exploration and regret, which revealstheoretical limit of the perturbation method. Specifically, in the term (d) of Theorem 1,no matter how large the variance is, the term q (Σ) γ ≥
1. So 4 c (cid:48)(cid:48) k log d is the necessaryexploration length and cannot be improved. Second, Condition Number and the SPR (thesignal to perturbation ratio) are important factors. The condition number
Cond (Σ) controlsboth term (d) and (e) , e.g., q (Σ) γ ≤ λ max (Σ) λ min (Σ) = Cond (Σ). This also shows that the optimalperturbation design will choose Σ = σ I . In (e) of Theorem 1, R γ can be regarded as theratio between the energy of the unperturbed context and the perturbation energy. Thisratio shows the trade-off between exploration and fidelity . That is, a large variance not onlyreduces the exploration (meanwhile, the lower bound is guaranteed by (e) ) but also reducesthe fidelity of original context.
3. Regret Analysis
Based on the properties we have proved for the low and high dimensional cases, we canobtain the following recovery guarantee by the techniques from the standard Lasso regression(Hastie et al., 2015).
Lemma 3 If t > T e and λ t = 2 σR (cid:113) t log dδ , the Lasso regression under perturbed ad-versary has the recovery guarantee (cid:107) θ t − θ ∗ (cid:107) ≤ σRC (cid:113) k log 2 d/δt with probability − δ ,where T e = R g (cid:16) qσ , (cid:17) σ log( dT ) , C = g (cid:16) qσ , (cid:17) (1 − τ ) σ for the low dimensional case and T e = max( c (cid:48)(cid:48) q (Σ) γ k log d, aR λ max (Σ) log Tγ ) , C = γ − R (cid:113) aλ max (Σ) log Tt for the high di-mensional case. We then get the final result in Theorem 2 based on all the analysis above.
Theorem 2
The online Lasso for sparse linear contextual bandit under perturbed adversaryadmits the following regret with probability − δ . Regret ≤ R (cid:32) T e + 6 σRC (cid:114) kT log 2 dδ (cid:33) = O ( (cid:112) kT log d ) . (5)
4. Conclusion
This paper utilizes the “perturbed adversary” where the context is generated adversariallybut with small random non-adaptive perturbations to tackle sparse linear contextual banditproblem. We prove that the simple online Lasso supports sparse linear contextual banditwith regret bound O ( √ kT log d ) for both low and high dimensional cases and show howthe perturbation affects the exploration length and the trade-off between exploration andfidelity. Future work will focus on extending our analysis to more challenge setting, i.e.,defending against adversarial attack for contextual bandit model. CML 2020 Workshop on Real World Experiment Design and Active Learning
References
Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linearstochastic bandits. In
Advances in Neural Information Processing Systems , pages 2312–2320, 2011.Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conver-sions and application to sparse stochastic bandits. In
Artificial Intelligence and Statistics ,pages 1–9, 2012.Marc Abeille, Alessandro Lazaric, et al. Linear thompson sampling revisited.
ElectronicJournal of Statistics , 11(2):5165–5197, 2017.Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linearpayoffs. In
International Conference on Machine Learning , pages 127–135, 2013.Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.
Journal ofMachine Learning Research , 3(Nov):397–422, 2002.Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covari-ates.
Operations Research , 68(1):276–294, 2020.Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algo-rithms for contextual bandits. arXiv preprint arXiv:1704.09011 , 2017.Peter J Bickel, Yaacov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lassoand dantzig selector.
The Annals of statistics , 37(4):1705–1732, 2009.Alexandra Carpentier and R´emi Munos. Bandit theory meets compressed sensing for highdimensional stochastic linear bandit. In
Artificial Intelligence and Statistics , pages 190–198, 2012.Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Compressed sensing and best k-termapproximation.
Journal of the American mathematical society , 22(1):211–231, 2009.Manoranjan Dash and Huan Liu. Feature selection for classification.
Intelligent data anal-ysis , 1(3):131–156, 1997.Kaize Ding, Jundong Li, and Huan Liu. Interactive anomaly detection on attributed net-works. In
Proceedings of the Twelfth ACM International Conference on Web Search andData Mining , pages 357–365, 2019.David L Donoho. Compressed sensing.
IEEE Transactions on information theory , 52(4):1289–1306, 2006.Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D Mitsis,and Joelle Pineau. Contextual bandits for adapting treatment in a mouse model of denovo carcinogenesis. In
Machine Learning for Healthcare Conference , pages 67–82, 2018. orkshop on Real World Experiment Design and Active Learning Davis Gilton and Rebecca Willett. Sparse linear contextual bandits via relevance vec-tor machines. In , pages 518–522. IEEE, 2017.Trevor Hastie, Robert Tibshirani, and Martin Wainwright.
Statistical learning with sparsity:the lasso and generalizations . Chapman and Hall/CRC, 2015.Jarvis Haupt, Waheed U Bajwa, Gil Raz, and Robert Nowak. Toeplitz compressed sensingmatrices with applications to sparse channel estimation.
IEEE transactions on informa-tion theory , 56(11):5862–5875, 2010.Jinzhu Jia, Karl Rohe, et al. Preconditioning the lasso for sign consistency.
ElectronicJournal of Statistics , 9(1):1150–1172, 2015.Sampath Kannan, Jamie H Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu.A smoothed analysis of the greedy algorithm for the linear contextual bandit problem.In
Advances in Neural Information Processing Systems , pages 2227–2236, 2018.Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit. In
Advances in NeuralInformation Processing Systems , pages 5869–5879, 2019.Tor Lattimore and Csaba Szepesv´ari. Bandit algorithms. preprint , page 28, 2018.Tor Lattimore, Koby Crammer, and Csaba Szepesv´ari. Linear multi-resource allocationwith semi-bandit feedback. In
Advances in Neural Information Processing Systems , pages964–972, 2015.Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approachto personalized news article recommendation. In
Proceedings of the 19th internationalconference on World wide web , pages 661–670. ACM, 2010.Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties forcorrelated gaussian designs.
Journal of Machine Learning Research , 11(Aug):2241–2259,2010.Vidyashankar Sivakumar, Zhiwei Steven Wu, and Arindam Banerjee. Structured lin-ear contextual bandits: A sharp and geometric smoothed analysis. arXiv preprintarXiv:2002.11332 , 2020.Daniel A Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplexalgorithm usually takes polynomial time.
Journal of the ACM (JACM) , 51(3):385–463,2004.Joel A Tropp. User-friendly tail bounds for sums of random matrices.
Foundations ofcomputational mathematics , 12(4):389–434, 2012.Sohini Upadhyay, Mayank Agarwal, Djallel Bounneffouf, and Yasaman Khazaeni. Abandit approach to posterior dialog orchestration under a budget. arXiv preprintarXiv:1906.09384 , 2019. CML 2020 Workshop on Real World Experiment Design and Active Learning
Sara A Van De Geer, Peter B¨uhlmann, et al. On the conditions used to prove oracle resultsfor the lasso.
Electronic Journal of Statistics , 3:1360–1392, 2009.Shuheng Zhou. Restricted eigenvalue conditions on subgaussian random matrices. arXivpreprint arXiv:0912.4045 , 2009.
Appendix
Lemma 4 (A variant of Matrix Chernoff Tropp (2012)) Consider a finite sequence z t ofindependent, random, self-adjoint matrices satisfy z t (cid:23) and λ max ( z t ) ≤ Q almost surely.Compute the minimum eigenvalue of the sum of expectations, ψ min := λ min ( (cid:80) t E ( z t )) . Thenfor δ ∈ [0 , , we have P (cid:40) λ min ( (cid:88) t z t ) ≤ (1 − δ ) ψ min (cid:41) ≤ d (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ min /Q . (6) Moreover, for any ψ ≤ ψ min , we can get P (cid:40) λ min ( (cid:88) t z t ) ≤ (1 − δ ) ψ (cid:41) ≤ d (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ/Q . (7) Proof
Since ψ ≤ ψ min , there exists δ ∈ [0 ,
1] such that ψ = δ ψ min . We have(1 − δ ) ψ = (1 − δ ) δ ψ min = (1 − (1 − δ + δδ (cid:124) (cid:123)(cid:122) (cid:125) δ )) ψ min . Plugging this into (6) leads to P (cid:40) λ min ( (cid:88) t z t ) ≤ (1 − δ ) ψ (cid:41) ≤ d (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ min /Q . One can easily verify that δ ≥ δ . So (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ min Q ≤ (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ min Q ≤ (cid:20) e − δ (1 − δ ) − δ (cid:21) ψQ . Then we obtain P (cid:40) λ min ( (cid:88) t z t ) ≤ (1 − δ ) ψ (cid:41) ≤ d (cid:20) e − δ (1 − δ ) − δ (cid:21) ψ/Q . (8)Since e − δ (1 − δ ) − δ ≤ e − δ / , so we have the following when δ ∈ [0 , P (cid:40) λ min ( (cid:88) t z t ) ≤ (1 − δ ) ψ (cid:41) ≤ d (cid:104) e − δ / (cid:105) ψ/Q . (9) orkshop on Real World Experiment Design and Active Learning Fact 1
Let η = [ η , · · · , η t ] (cid:62) where each η i i.i.d. from N (0 , σ ) . Let X ∈ R d × t where each | X ij | ≤ R. Then with a high probability − δ , we have (cid:107) Xη (cid:107) ∞ ≤ σR (cid:114) t log 2 dδ . Fact 2 (Chernoff Bound for Sum of Sub-Gaussian random variables) Let X , · · · , X n be nindependent random variables such that X i ∼ subG ( σ ) . Then for any a ∈ R n and c > = 0 ,we have Pr (cid:32) n (cid:88) i =1 a i X i < − c (cid:33) ≤ exp (cid:18) − c σ (cid:107) a (cid:107) (cid:19) . (10) That is, with a high probability at least − δ, we have n (cid:88) i =1 a i X i > − (cid:114) σ (cid:107) a (cid:107) log 1 δ . (11) Lemma 5 (Restricted Eigenvalue Property (Corollary 1 of Raskutti et al. (2010))) Supposethat Σ satisfies the RE condition of order k with parameters (1 , γ ) and denote q (Σ) =max i Σ ii . Then for universal positive constants c, c (cid:48) , c (cid:48)(cid:48) , if the sample size satisfies t > c (cid:48)(cid:48) q (Σ) γ k log d, (12) then the matrix ΦΦ (cid:62) t satisfies the RE condition with parameters (1 , γ ) with probability atleast − c (cid:48) e ct where Φ ∈ R d × t and each column is i.i.d. N ( , Σ) . Proof of Lemma 1Proof
Since e ti ( j ) is independent of each other, we can analyze it by coordinates. Tosimplify the analysis, we slightly abuse the notations and remove subscript i and superscript t (only within this proof), that is, x ( j ) := x ti ( j ) and e ( j ) := e ti ( j ). λ min (cid:16) E (cid:104) xx (cid:62) (cid:105)(cid:17) = min (cid:107) w (cid:107) =1 w (cid:62) E [ xx (cid:62) ] w = min (cid:107) w (cid:107) =1 E ( w (cid:62) xx (cid:62) w )= min (cid:107) w (cid:107) =1 E ( (cid:104) w, x (cid:105) ) ) ≥ min (cid:107) w (cid:107) =1 Var( (cid:104) w, x (cid:105) ) ≥ min (cid:107) w (cid:107) =1 Var( (cid:104) w, e (cid:105) )= min (cid:107) w (cid:107) =1 d (cid:88) i =1 ( w ( i )) Var( e ( i ) | censored in [ − q i , q i ]) ≥ min (cid:107) w (cid:107) =1 g (2 q/σ, σ d (cid:88) i =1 ( w ( i )) = g (2 q/σ , σ , CML 2020 Workshop on Real World Experiment Design and Active Learning where g (2 q/σ ,
0) is according to Lemma 6.
Lemma 6
Let e ∼ N (0 , σ ) . For any interval [ a, b ] which contains and fixed length q ,e.g., b − a = 2 q , and q ≥ σ , we have the following result: Var( e | censored in [ a, b ]) ≥ g (2 q/σ , σ . (13) Proof
We first derive the variance for two sided censored Gaussian Distribution. Denote α = a/σ and β = b/σ . For the truncated Gaussian distribution, we have E ( e | e ∈ [ a, b ]) = σ φ ( α ) − φ ( β )Φ( α ) − Φ( β ) = σ ρ. Var( e | e ∈ [ a, b ]) = σ (1 + αφ ( α ) − βφ ( β )Φ( α ) − Φ( β ) − ρ (cid:124) (cid:123)(cid:122) (cid:125) Λ ) . Then we calculate the variance of two sided censored Gaussian distribution byVar( e | censored in [ a, b ]) = E y [Var( e | y )] + Var y [ E ( e | y )] , where y denotes the event e ∈ [ a, b ]. After some basic calculations, we can get the followingresult: Var( e | censored in [ a, b ]) = σ (Φ( β ) − Φ( α ))(1 + Λ)+ σ [( ρ − β ) (Φ( β ) − Φ( α ))(1 − Φ( β ) + Φ( α ))+ 2( β − α )( ρ − β )(Φ( β ) − Φ( α ))Φ( α )+ ( β − α ) (1 − Φ( α ))Φ( α )]= g ( β, α ) σ . (14)One can show that (1) Var( e | censored in [ a, b ]) achieves minimum when a = 0 or b = 0 bythe first order optimality condition. (2) Var( e | censored in [0 , b ]) is an increasing functionw.r.t b . Based on (1) and (2), we obtainVar( e | censored in [ a, b ]) ≥ Var( e | censored ∈ [0 , q ]) = g (2 q/σ , σ , Proof of Lemma 2Proof
At round t , we have λ min ( E ( XX (cid:62) ))= λ min ( E ( t (cid:88) i =1 ( x ia i ( x ia i ) (cid:62) )) = λ min ( t (cid:88) i =1 E ( x ia i ( x ia i ) (cid:62) )) ≥ t (cid:88) i =1 λ min ( E ( x ia i ( x ia i ) (cid:62) )) , orkshop on Real World Experiment Design and Active Learning where the second equality is due to the independence of each round’s perturbation and theinequality comes from the fact that minimum eigenvalue is an super-additive operator.For the censored Gaussian perturbation, λ min ( E ( x ia i ( x ia i ) (cid:62) )) ≥ g ( qσ , σ based onLemma 1. So λ min ( E ( XX (cid:62) )) ≥ g ( qσ , σ t .Based on (9) of Lemma 4 and λ max ( x ia i ( x ia i ) (cid:62) ) ≤ (cid:107) x ia i (cid:107) ≤ R , one can obtain P (cid:26) λ min ( XX (cid:62) ) ≤ g ( 2 qσ , − τ ) σ t (cid:27) ≤ d (cid:104) e − τ / (cid:105) g (2 R/σ , σ tR . Let T = d (cid:104) e − τ / (cid:105) g (2 R/σ , σ tR and one can get the final result. Proof of Theorem 1Proof
To simplify the analysis, we slightly abuse the notation and denote the unperturbedcontext matrix by µ where each column µ i is one context vector. Similarly, denote e to bethe perturbation matrix and e i to be the column vector. We first decompose the ∆ (cid:62) XX (cid:62) ∆as follows: ∆ (cid:62) XX (cid:62) ∆ = ∆ (cid:62) µµ (cid:62) ∆ (cid:124) (cid:123)(cid:122) (cid:125) (a) +2 ∆ (cid:62) eµ (cid:62) ∆ (cid:124) (cid:123)(cid:122) (cid:125) (b) + ∆ (cid:62) ee (cid:62) ∆ (cid:124) (cid:123)(cid:122) (cid:125) (c) . (15)For the term (a) in equation (15), one can only show (a) since ∆ could lie in Null ( µ (cid:62) ).For term (b) and (c) , we find both terms high probability lower bounds respectively.Now consider a positive definite matrix Σ and we can design that Σ such that it satisfiesthe RE, that is, (cid:107) Σ / ∆ (cid:107) ≥ γ (cid:107) ∆ (cid:107) . Based on Lemma 4, we can derive the following forterm (c) . For universal positive constants c, c (cid:48) , c (cid:48)(cid:48) , if the sample size satisfies t > c (cid:48)(cid:48) q (Σ) γ k log d, (16)where q (Σ) = max i Σ ii , then with probability at least 1 − c (cid:48) e ct ∆ (cid:62) ee (cid:62) ∆ ≥ γ t (cid:107) ∆ (cid:107) . (17)We then derive a high probability bound for (b) . First, we decompose (b) into a weightedsum of i.i.d. Gaussian variable. That is,∆ (cid:62) eµ (cid:62) ∆ = t (cid:88) i =1 ( µ (cid:62) i ∆)(∆ (cid:62) e i ) , (18) CML 2020 Workshop on Real World Experiment Design and Active Learning where µ Ti ∆ is the weight and each ∆ (cid:62) e i ∼ N (0 , ∆ (cid:62) Σ∆). Based on the Chernoff Bound ofweighted sum of sub-Gaussian random variables in Fact 2, we have t (cid:88) i =1 ( µ (cid:62) i ∆)(∆ (cid:62) e i ) ≥ − (cid:118)(cid:117)(cid:117)(cid:116) a ∆ (cid:62) Σ∆ t (cid:88) i =1 ( µ (cid:62) i ∆) log t (19) ≥ − (cid:118)(cid:117)(cid:117)(cid:116) aλ max (Σ) (cid:107) ∆ (cid:107) t (cid:88) i =1 R (cid:107) ∆ (cid:107) log t (20)= − Rt (cid:107) ∆ (cid:107) (cid:114) aλ max (Σ) log tt . (21)with probability at least 1 − t a . We can conclude with probability at least 1 − ( c (cid:48) e ct + t a ) , both inequality (17) and (21) hold. If the round t satisfies t > max c (cid:48)(cid:48) q (Σ) γ k log d (cid:124) (cid:123)(cid:122) (cid:125) d , aR λ max (Σ) log tγ (cid:124) (cid:123)(cid:122) (cid:125) e . (22), we have (b) + (c) ≥ ht (cid:107) ∆ (cid:107) , where h = (cid:18) γ − R (cid:113) aλ max (Σ) log tt (cid:19) . Proof of Lemma 3Proof
Our proof combines the techniques from smoothed analysis and Lasso regression.Since θ t minimizes G ( θ ), we have G ( θ t ) ≤ G ( θ ∗ ). This yields the following inequality (cid:107) X (cid:62) ∆ t (cid:107) ≤ ∆ t Xη + λ t ( (cid:107) θ ∗ (cid:107) − (cid:107) θ ∗ + ∆ t (cid:107) ) , where η denotes the noise vector. Note that (cid:107) θ ∗ (cid:107) = (cid:107) θ ∗ S (cid:107) . Furthermore, one can verifythat (cid:107) θ ∗ (cid:107) − (cid:107) θ ∗ + ∆ t (cid:107) ≤ (cid:107) ∆ tS (cid:107) − (cid:107) ∆ tS c (cid:107) . For ∆ t Xη , applying H¨o lder’s inequality yields∆ t Xη ≤ (cid:107) ∆ t (cid:107) (cid:107) Xη (cid:107) ∞ ≤ σR (cid:114) t log 2 dδ (cid:107) ∆ t (cid:107) = λ t (cid:107) ∆ t (cid:107) , where the second inequality is due to the fact 1. Combine all above and we obtain (cid:107) X (cid:62) ∆ t (cid:107) ≤ λ t (cid:107) ∆ t (cid:107) + λ t ( (cid:107) ∆ tS (cid:107) − (cid:107) ∆ tS c (cid:107) ) (23) ≤ λ t (cid:107) ∆ t (cid:107) ≤ λ t √ k (cid:107) ∆ t (cid:107) . (24)First from inequality (23), we can obtain ∆ t ∈ C ( S ; 3). For low dimensional case, we have (cid:107) X (cid:62) ∆ t (cid:107) ≥ λ min ( XX (cid:62) ) (cid:107) ∆ t (cid:107) ≥ Ct (cid:107) ∆ t (cid:107) by Lemma 2, where C = g (cid:16) qσ , (cid:17) (1 − τ ) σ . Forhigh dimensional case, we apply Theorem 1 since ∆ t ∈ C ( S ; 3) and get (cid:107) X (cid:62) ∆ t (cid:107) ≥ Ct (cid:107) ∆ t (cid:107) orkshop on Real World Experiment Design and Active Learning where C = γ − R (cid:113) aλ max (Σ) log Tt . Combine these with inequality (24) and we get the finalresult (cid:107) ∆ t (cid:107) ≤ σRC (cid:114) k log 2 d/δt . Proof of Theorem 2Proof
As for the regret in round t , we have (cid:104) x ti ∗ t , θ ∗ (cid:105) − (cid:104) x ti t , θ ∗ (cid:105) = (cid:104) x ti ∗ t , θ ∗ − θ t (cid:105) − (cid:104) x ti t , θ ∗ − θ t (cid:105) + (cid:104) x ti ∗ t , θ t (cid:105) − (cid:104) x ti t , θ t (cid:105)≤ (cid:104) x ti ∗ t , θ ∗ − θ t (cid:105) − (cid:104) x ti t , θ ∗ − θ t (cid:105)≤ (cid:107)(cid:104) x ti ∗ t , θ ∗ − θ t (cid:105)(cid:107) + (cid:107)(cid:104) x ti t , θ ∗ − θ t (cid:105)(cid:107) ≤ R (cid:107) θ ∗ − θ t (cid:107) , where the first inequality comes from the greedy choice since i t = arg max i (cid:104) x ti , θ t (cid:105) and thelast inequality is due to the censored perturbations. Based on the analysis of low and highdimensional cases, we denote the exploration length as T e . During the exploration, we canbound the regret by 2 RT e . So we can derive that Regret = T e (cid:88) t =1 (cid:104) x ti ∗ , θ ∗ (cid:105) − (cid:104) x ti t , θ ∗ (cid:105) + T (cid:88) t = T e +1 (cid:104) x ti ∗ , θ ∗ (cid:105) − (cid:104) x ti t , θ ∗ (cid:105)≤ RT e + 2 R T (cid:88) t = T e +1 (cid:107) θ ∗ − θ t (cid:107) ≤ RT e + 2 R T (cid:88) t = T e +1 σRC (cid:114) k log 2 d/δt ≤ R (cid:32) T e + 6 σRC (cid:114) kT log 2 dδ (cid:33) Numeric Simulations
This section shows the result of numeric simulations. We choose the context’s dimension d = 2000 where effective dimension k = 20 and 5 arms for each round. Our sparse banditlearning process only contains 150 rounds with each context vector are randomly generatedfrom the uniform distribution [0 , CML 2020 Workshop on Real World Experiment Design and Active Learning R e g r e t No preconditioningWith preconditioning