Optimal Algorithms for Ridge and Lasso Regression with Partially Observed Attributes
aa r X i v : . [ c s . L G ] N ov Linear Regression with Limited Observation
Elad Hazan [email protected]
Tomer Koren [email protected]
Technion — Israel Institute of Technology, Technion City 32000, Haifa, Israel
Abstract
We consider the most common variants oflinear regression, including Ridge, Lasso andSupport-vector regression, in a setting wherethe learner is allowed to observe only a fixednumber of attributes of each example attraining time. We present simple and effi-cient algorithms for these problems: for Lassoand Ridge regression they need the same to-tal number of attributes (up to constants) asdo full-information algorithms, for reaching acertain accuracy. For Support-vector regres-sion, we require exponentially less attributescompared to the state of the art. By that,we resolve an open problem recently posedby Cesa-Bianchi et al. (2010).Experiments show the theoretical bounds tobe justified by superior performance com-pared to the state of the art.
1. Introduction
In regression analysis the statistician attempts to learnfrom examples the underlying variables affecting agiven phenomenon. For example, in medical diagnosisa certain combination of conditions reflects whether apatient is afflicted with a certain disease.In certain common regression cases various limitationsare placed on the information available from the exam-ples. In the medical example, not all parameters of acertain patient can be measured due to cost, time andpatient reluctance.In this paper we study the problem of regression inwhich only a small subset of the attributes per exam-ple can be observed. In this setting, we have access to all attributes and we are required to choose which ofthem to observe. Recently, Cesa-Bianchi et al. (2010)
Appearing in
Proceedings of the 29 th International Confer-ence on Machine Learning , Edinburgh, Scotland, UK, 2012.Copyright 2012 by the author(s)/owner(s). studied this problem and asked the following interest-ing question: can we efficiently learn the optimal re-gressor in the attribute efficient setting with the sametotal number of attributes as in the unrestricted regres-sion setting?
In other words, the question amounts towhether the information limitation hinders our abilityto learn efficiently at all. Ideally, one would hope thatinstead of observing all attributes of every example,one could compensate for fewer attributes by analyz-ing more examples, but retain the same overall sampleand computational complexity.Indeed, we answer this question on the affirmative forthe main variants of regression: Ridge and Lasso. Forsupport-vector regression we make significant advance-ment, reducing the parameter dependence by an expo-nential factor. Our results are summarized in the tablebelow , which gives bounds for the number of exam-ples needed to attain an error of ε , such that at most k attributes are viewable per example. We denoteby d the dimension of the attribute space.Regression New bound Prev. boundRidge O (cid:0) dkε (cid:1) O (cid:16) d log dε kε (cid:17) Lasso O (cid:16) d log dkε (cid:17) O (cid:16) d log dε kε (cid:17) SVR O (cid:0) dk (cid:1) · e O ( log ε ) O (cid:16) e d kε (cid:17) Table 1.
Our sample complexity bounds.
Our bounds imply that for reaching a certain accu-racy, our algorithms need the same number of at-tributes as their full information counterparts. In par-ticular, when k = Ω( d ) our bounds coincide withthose of full information regression, up to constants(cf. Kakade et al. 2008).We complement these upper bounds and prove thatΩ( dε ) attributes are in fact necessary to learn an ε - The previous bounds are due to (Cesa-Bianchi et al.,2010). For SVR, the bound is obtained by additionallyincorporating the methods of (Cesa-Bianchi et al., 2011). For SVR, the number of attributes viewed per exampleis a random variable whose expectation is k . inear Regression with Limited Observation accurate Ridge regressor. For Lasso regression, Cesa-Bianchi et al. (2010) proved that Ω( dε ) attributes arenecessary, and asked what is the correct dependenceon the problem dimension. Our bounds imply that thenumber of attributes necessary for regression learninggrows linearly with the problem dimensions.The algorithms themselves are very simple to imple-ment, and run in linear time. As we show in later sec-tions, these theoretical improvements are clearly visi-ble in experiments on standard datasets. The setting of learning with limited attributeobservation (LAO) was first put forth in(Ben-David & Dichterman, 1998), who coinedthe term “learning with restricted focus of attention”.Cesa-Biachi et al. (2010) were the first to discusslinear prediction in the LAO setting, and gave anefficient algorithm (as well as lower bounds) for linearregression, which is the primary focus of this paper.
2. Setting and Result Statement
In the linear regression problem, each instance is apair ( x , y ) of an attributes vector x ∈ R d and a targetvariable y ∈ R . We assume the standard frameworkof statistical learning (Haussler, 1992), in which thepairs ( x , y ) follow a joint probability distribution D over R d × R . The goal of the learner is to find a vector w for which the linear rule ˆ y ← w ⊤ x provides a goodprediction of the target y . To measure the performanceof the prediction, we use a convex loss function ℓ (ˆ y, y ) : R → R . The most common choice is the squared loss ℓ (ˆ y, y ) = (ˆ y − y ) , which stands for the popular least-squares regression. Hence, in terms of the distribution D , the learner would like to find a regressor w ∈ R d with low expected loss, defined as L D ( w ) = E ( x , y ) ∼D [ ℓ ( w ⊤ x , y )] . (1)The standard paradigm for learning such regressor isseeking a vector w ∈ R d that minimizes a trade-offbetween the expected loss and an additional regular-ization term, which is usually a norm of w . An equiv-alent form of this optimization problem is obtained byreplacing the regularization term with a proper con-straint, giving rise to the problemmin w ∈ R d L D ( w ) s.t. k w k p B , (2)where
B > p > ℓ p norm constraint as well as the loss functions in theabove definition: • Ridge regression: p = 2 and squared loss, ℓ (ˆ y, y ) = (ˆ y − y ) . • Lasso regression: p = 1 and squared loss. • Support-vector regression: p = 2 and the δ -insensitive absolute loss (Vapnik, 1995), ℓ (ˆ y, y ) = | ˆ y − y | δ := max { , | ˆ y − y | − δ } . Since the distribution D is unknown, we learn by re-lying on a training set S = { ( x t , y t ) } mt =1 of examples,that are assumed to be sampled independently from D .We use the notation ℓ t ( w ) := ℓ ( w ⊤ x t , y t ) to refer tothe loss function induced by the instance ( x t , y t ).We distinguish between two learning scenarios. In the full information setup, the learner has unrestrictedaccess to the entire data set. In the limited attributeobservation (LAO) setting, for any given examplepair ( x , y ), the learner can observe y , but only k at-tributes of x (where k > actively choose which attributesto observe. Cesa-Biachi et al. (2010) proved the following samplecomplexity lower bound on any LAO Lasso regressionalgorithm.
Theorem 2.1.
Let < ε < , k > and d > k .For any regression algorithm accessing at most k at-tributes per training example, there exist a distribution D over { x : k x k ∞ } × {± } and a regressor w ⋆ with k w ⋆ k such that the algorithm must see (inexpectation) at least Ω( dkε ) examples in order to learna linear regressor w with L D ( w ) − L D ( w ⋆ ) < ε . We complement this lower bound, by providing astronger lower bound on the sample complexity ofany Ridge regression algorithm, using information-theoretic arguments.
Theorem 2.2.
Let ε = Ω(1 / √ d ) . For any regressionalgorithm accessing at most k attributes per trainingexample, there exist a distribution D over { x : k x k } × {± } and a regressor w ⋆ with k w ⋆ k suchthat the algorithm must see (in expectation) at least Ω( dkε ) examples in order to learn a linear regressor w , k w k with L D ( w ) − L D ( w ⋆ ) ε . Our algorithm for LAO Ridge regression (see section 3)imply this lower bound to be tight up to constants. inear Regression with Limited Observation
Note, however, that the bound applies only to a par-ticular regime of the problem parameters . We give efficient regression algorithms that attain thefollowing risk bounds. For our Ridge regression algo-rithm, we prove the risk bound E [ L D ( ¯ w )] min k w k B L D ( w ) + O B r dkm ! , while for our Lasso regression algorithm we establishthe bound E [ L D ( ¯ w )] min k w k B L D ( w ) + O B r d log dkm ! . Here we use ¯ w to denote the output of each algorithmon a training set of m examples, and the expectationsare taken with respect to the randomization of thealgorithms. For Support-vector regression we obtaina risk bound that depends on the desired accuracy ε .Our bound implies that m = O (cid:18) dk (cid:19) · exp (cid:18) O (cid:18) log Bε (cid:19)(cid:19) . examples are needed (in expectation) for obtaining an ε -accurate regressor.
3. Algorithms for LAO least-squaresregression
In this section we present and analyze our algorithmsfor Ridge and Lasso regression in the LAO setting.The loss function under consideration here is thesquared loss, that is, ℓ t ( w ) = ( w ⊤ x t − y t ) . For con-venience, we show algorithms that use k + 1 attributesof each instance, for k > .Our algorithms are iterative and maintain a regressor w t along the iterations. The update of the regressorat iteration t is based on gradient information, andspecifically on g t := ∇ ℓ t ( w t ) that equals ( w ⊤ t x t − y t ) · x t for the squared loss. In the LAO setting, however,we do not have the access to this information, thus webuild upon unbiased estimators of the gradients. Indeed, there are (full-information) algorithms that areknown to converge in O (1 /ε ) rate – see e.g. (Hazan et al.,2007). We note that by our approach it is impossible to learnusing a single attribute of each example (i.e., with k = 0),and we are not aware of any algorithm that is able to do so.See (Cesa-Bianchi et al., 2011) for a related impossibilityresult. Algorithm 1
AERRParameters:
B, η > Input: training set S = { ( x t , y t ) } t ∈ [ m ] and k > Output: regressor ¯ w with k ¯ w k B Initialize w = , k w k B arbitrarily for t = 1 to m do for r = 1 to k do Pick i t,r ∈ [ d ] uniformly and observe x t [ i t,r ] ˜ x t,r ← d x t [ i t,r ] · e i t,r end for ˜ x t ← k P kr =1 ˜ x t,r Choose j t ∈ [ d ] with probability w t [ j ] / k w t k ,and observe x t [ j t ] ˜ φ t ← k w t k x t [ j t ] / w t [ j t ] − y t ˜ g t ← ˜ φ t · ˜ x t v t ← w t − η ˜ g t w t +1 ← v t · B/ max {k v t k , B } end for ¯ w ← m P mt =1 w t Recall that in Ridge regression, we are interested in thelinear regressor that is the solution to the optimizationproblem (2) with p = 2, given explicitly asmin w ∈ R d L D ( w ) s.t. k w k B , (3)Our algorithm for the LAO setting is based on arandomized Online Gradient Descent (OGD) strategy(Zinkevich, 2003). More specifically, at each iteration t we use a randomized estimator ˜ g t of the gradient g t toupdate the regressor w t via an additive rule. Our gra-dient estimators make use of an importance-samplingmethod inspired by (Clarkson et al., 2010).The pseudo-code of our Attribute Efficient Ridge Re-gression (AERR) algorithm is given in Algorithm 1.In the following theorem, we show that the regressorlearned by our algorithm is competitive with the opti-mal linear regressor having 2-norm bounded by B . Theorem 3.1.
Assume the distribution D is such that k x k and | y | B with probability . Let ¯ w be theoutput of AERR, when run with η = p k/ dm. Then, k ¯ w k B and for any w ⋆ ∈ R d with k w ⋆ k B , E [ L D ( ¯ w )] L D ( w ⋆ ) + 4 B r dkm . Theorem 3.1 is a consequence of the following twolemmas. The first lemma is obtained as a result ofa standard regret bound for the OGD algorithm (seeZinkevich 2003), applied to the vectors ˜ g , . . . , ˜ g m . inear Regression with Limited Observation Lemma 3.2.
For any k w ⋆ k B we have m X t =1 ˜ g ⊤ t ( w t − w ⋆ ) B η + η m X t =1 k ˜ g t k . (4)The second lemma shows that the vector ˜ g t is an un-biased estimator of the gradient g t := ∇ ℓ t ( w t ) at it-eration t , and establishes a “variance” bound for thisestimator. To simplify notations, here and in the restof the paper we use E t [ · ] to denote the conditional ex-pectation with respect to all randomness up to time t . Lemma 3.3.
The vector ˜ g t is an unbiased estimatorof the gradient g t := ∇ ℓ t ( w t ) , that is E t [˜ g t ] = g t . Inaddition, for all t we have E t [ k ˜ g t k ] B d/k. For a proof of the lemma, see (Hazan & Koren, 2011).We now turn to prove Theorem 3.1.
Proof (of Theorem 3.1).
First note that as k w t k B , we clearly have k ¯ w k B . Taking the expectationof (4) with respect to the randomization of the algo-rithm, and letting G := max t E t [ k ˜ g t k ], we obtain E " m X t =1 g ⊤ t ( w t − w ⋆ ) B η + η G m . On the other hand, the convexity of ℓ t gives ℓ t ( w t ) − ℓ t ( w ⋆ ) g ⊤ t ( w t − w ⋆ ) . Together with the above thisimplies that for η = 2 B/G √ m , E " m m X t =1 ℓ t ( w t ) m m X t =1 ℓ t ( w ⋆ ) + 2 BG √ m . Taking the expectation of both sides with respect tothe random choice of the training set, and using G B p d/k (according to Lemma 3.3), we get E " m m X t =1 L D ( w t ) L D ( w ⋆ ) + 4 B r dkm . Finally, recalling the convexity of L D and usingJensen’s inequality, the Theorem follows. We now turn to describe our algorithm for Lasso re-gression in the LAO setting, in which we would like tosolve the problemmin w ∈ R d L D ( w ) s.t. k w k B . (5)The algorithm we provide for this problem isbased on a stochastic variant of the EG algorithm
Algorithm 2
AELRParameters:
B, η > Input: training set S = { ( x t , y t ) } t ∈ [ m ] and k > Output: regressor ¯ w with k ¯ w k B Initialize z +1 ← d , z − ← d for t = 1 to m do w t ← ( z + t − z − t ) · B/ ( k z + t k + k z − t k ) for r = 1 to k do Pick i t,r ∈ [ d ] uniformly and observe x t [ i t,r ] ˜ x t,r ← d x t [ i t,r ] · e i t,r end for ˜ x t ← k P kr =1 ˜ x t,r Choose j t ∈ [ d ] with probability | w [ j ] | / k w k ,and observe x t [ j t ] ˜ φ t ← k w t k sign( w t [ j t ]) x t [ j t ] − y t ˜ g t ← ˜ φ t · ˜ x t for i = 1 to d do ¯ g t [ i ] ← clip(˜ g t [ i ] , /η ) z + t +1 [ i ] ← z + t [ i ] · exp( − η ¯ g t [ i ]) z − t +1 [ i ] ← z − t [ i ] · exp(+ η ¯ g t [ i ]) end for end for ¯ w ← m P mt =1 w t (Kivinen & Warmuth, 1997), that employs multiplica-tive updates based on an estimation of the gradients ∇ ℓ t . The multiplicative nature of the algorithm, how-ever, makes it highly sensitive to the magnitude of theupdates. To make the updates more robust, we “clip”the entries of the gradient estimator so as to preventthem from getting too large. Formally, this is accom-plished via the following “clip” operation:clip( x, c ) := max { min { x, c } , − c } for x ∈ R and c >
0. This clipping has an even strongereffect in the more general setting we consider in Sec-tion 4.We give our Attribute Efficient Lasso Regression(AELR) algorithm in Algorithm 2, and establish a cor-responding risk bound in the following theorem.
Theorem 3.4.
Assume the distribution D is such that k x k ∞ and | y | B with probability . Let ¯ w bethe output of AELR, when run with η = B q k log 2 d md , Then, k ¯ w k B and for any w ⋆ ∈ R d with k w ⋆ k B we have E [ L D ( ¯ w )] L D ( w ⋆ ) + 4 B r d log 2 dkm , provided that m > log 2 d . inear Regression with Limited Observation In the rest of the section, for a vector v we let v denote the vector for which v [ i ] = ( v [ i ]) for all i .In order to prove Theorem 3.4, we first consider theaugmented vectors z ′ t := ( z + t , z − t ) ∈ R d and ¯ g ′ t :=(¯ g t , − ¯ g t ) ∈ R d , and let p t := z ′ t / k z ′ t k . For thesevectors, we have the following. Lemma 3.5. m X t =1 p ⊤ t ¯ g ′ t min i ∈ [2 d ] m X t =1 ¯ g ′ t [ i ] + log 2 dη + η m X t =1 p ⊤ t (¯ g ′ t ) The lemma is a consequence of a second-order regretbound for the Multiplicative-Weights algorithm, essen-tially due to (Clarkson et al., 2010). By means of thislemma, we establish a risk bound with respect to the“clipped” linear functions ¯ g ⊤ t w . Lemma 3.6.
Assume that k E t [˜ g t ] k ∞ G for all t ,for some G > . Then, for any k w ⋆ k B , we have E " m X t =1 ¯ g ⊤ t w t E " m X t =1 ¯ g ⊤ t w ⋆ + B (cid:18) log 2 dη + ηG m (cid:19) Our next step is to relate the risk generated by the lin-ear functions ˜ g ⊤ t w , to that generated by the “clipped”functions ¯ g ⊤ t w . Lemma 3.7.
Assume that k E t [˜ g t ] k ∞ G for all t ,for some G > . Then, for < η / G we have E " m X t =1 ˜ g ⊤ t w t E " m X t =1 ¯ g ⊤ t w t + 4 BηG m . The final component of the proof is a “variance”bound, similar to that of Lemma 3.3.
Lemma 3.8.
The vector ˜ g t is an unbiased estimatorof the gradient g t := ∇ ℓ t ( w t ) , that is E t [˜ g t ] = g t . Inaddition, for all t we have k E t [˜ g t ] k ∞ B d/k. For the complete proofs, refer to (Hazan & Koren,2011). We are now ready to prove Theorem 3.4.
Proof (of Theorem 3.4).
Since k w t k B for all t , weobtain k ¯ w k B . Next, note that as E t [˜ g t ] = g t , wehave E [ P mt =1 ˜ g ⊤ t w t ] = E [ P mt =1 g ⊤ t w t ]. Putting Lem-mas 3.6 and 3.7 together, we get for η / G that E " T X t =1 g ⊤ t ( w t − w ⋆ ) B (cid:18) log 2 dη + 5 ηG m (cid:19) . Proceeding as in the proof of Theorem 3.1, and choos-ing η = G q log 2 d m , we obtain the bound E [ L D ( ¯ w )] L D ( w ⋆ ) + 2 BG r dm . Note that for this choice of η we indeed have η / G ,as we originally assumed that m > log 2 d . Finally,putting G = 2 B p d/k as implied by Lemma 3.8, weobtain the bound in the statement of the theorem.
4. Support-vector regression
In this section we show how our approach can beextended to deal with loss functions other than thesquared loss, of the form ℓ ( w ⊤ x , y ) = f ( w ⊤ x − y ) , (6)(with f real and convex) and most importantly, withthe δ -insensitive absolute loss function of SVR, forwhich f ( x ) = | x | δ := max {| x | − δ, } for some fixed0 δ B (recall that in our results we assumethe labels y t have | y t | B ). For concreteness, weconsider only the 2-norm variant of the problem (asin the standard formulation of SVR)—the results weobtain can be easily adjusted to the 1-norm setting.We overload notation, and keep using the shorthand ℓ t ( w ) := ℓ ( w ⊤ x t , y t ) for referring the loss function in-duced by the instance ( x t , y t ).It should be highlighted that our techniques can beadapted to deal with many other common loss func-tions, including “classification” losses (i.e., of the form ℓ ( w ⊤ x , y ) = f ( y · w ⊤ x )). Due to its importance andpopularity, we chose to describe our method in thecontext of SVR.Unfortunately, there are strong indications that SVRlearning (more generally, learning with non-smoothloss function) in the LAO setting is impossible viaour approach of unbiased gradient estimations (seeCesa-Bianchi et al. 2011 and the references therein).For that reason, we make two modifications to thelearning setting: first, we shall henceforth relax thebudget constraint to allow k observed attributes perinstance in expectation ; and second, we shall aim for biased gradient estimators, instead of unbiased as be-fore.To obtain such biased estimators, we uniformly ε -approximate the function f by an analytic func-tion f ε and learn with the approximate loss func-tion ℓ εt ( w ) = f ε ( w ⊤ x t − y t ) instead. Clearly, any ε -suboptimal regressor of the approximate problem is an2 ε -suboptimal regressor of the original problem. Forlearning the approximate problem we use a novel tech-nique, inspired by (Cesa-Bianchi et al., 2011), for esti-mating gradients of analytic loss functions. Our esti-mators for ∇ ℓ εt can then be viewed as biased estimatorsof ∇ ℓ t (we note, however, that the resulting bias mightbe quite large). inear Regression with Limited Observation Procedure 3
GenEst
Parameters: { a n } ∞ n =0 — Taylor coefficients of f ′ Input: regressor w , instance ( x , y ) Output: ˆ φ with E [ ˆ φ ] = f ′ ( w ⊤ x − y ) Let N = ⌈ B ⌉ . Choose n > n ] = ( ) n +1 if n N then for r = 1 , . . . , n do Choose j ∈ [ d ] with probability w [ j ] / k w k ,and observe x [ j ] ˜ θ r ← k w k x [ j ] / w [ j ] − y end for else for r = 1 , . . . , n do Choose j , . . . , j N ∈ [ d ] w.p. w [ j ] / k w k , (in-dependently), and observe x [ j ] , . . . , x [ j N ] ˜ θ r ← N P Ns =1 k w k x [ j s ] / w [ j s ] − y end for end if ˆ φ ← n +1 a n · ˜ θ ˜ θ · · · ˜ θ n Let f : R → R be a real, analytic function (on theentire real line). The derivative f ′ is thus also analyticand can be expressed as f ′ ( x ) = P ∞ n =0 a n x n where { a n } are the Taylor expansion coefficients of f ′ .In Procedure 3 we give an unbiased estimator of f ′ ( w ⊤ x − y ) in the LAO setting, defined in terms ofthe coefficients { a n } of f ′ . For this estimator, we havethe following (proof is omitted). Lemma 4.1.
The estimator ˆ φ is an unbiased esti-mator of f ′ ( w ⊤ x − y ) . Also, assuming k x k , k w k B and | y | B , the second-moment E [ ˆ φ ] is upper bounded by exp( O (log B )) , provided that theTaylor series of f ′ ( x ) converges absolutely for | x | .Finally, the expected number of attributes of x used bythis estimator is no more than . In order to approximate the δ -insensitive absolute lossfunction, we define f ε ( x ) = ε ρ (cid:18) x − δε (cid:19) + ε ρ (cid:18) x + δε (cid:19) − δ where ρ is expressed in terms of the error function erf, ρ ( x ) = x erf( x ) + 1 √ π e − x , and consider the approximate loss functions ℓ εt ( w ) = f ε ( w ⊤ x t − y t ) . Indeed, we have the following.
Algorithm 4
AESVRParameters:
B, δ, η > ε > Input: training set S = { ( x t , y t ) } t ∈ [ m ] and k > Output: regressor ¯ w with k ¯ w k B Let a n = 0 for n >
0, and a n +1 = 2 √ π · ( − n n !(2 n + 1) , n > Execute algorithm 1 with lines 8–9 replaced by: x ′ t ← x t /εy + t ← ( y t + δ ) /ε, y − t ← ( y t − δ ) /ε ˜ φ t ← [ GenEst ( w t , x ′ t , y + t ) + GenEst ( w t , x ′ t , y − t )] Return the output ¯ w of the algorithm Claim 4.2.
For any ε > , f ε is convex, analytic onthe entire real line and sup x ∈ R | f ε ( x ) − | x | δ | ε . The claim follows easily from the identity | x | δ = | x − δ | + | x + δ | − δ. In addition, for using Procedure 3we need the following simple observation, that followsimmediately from the series expansion of erf( x ). Claim 4.3. ρ ′ ( x ) = P ∞ n =0 a n +1 x n +1 , with the coef-ficients { a n +1 } n > given in (7) . We now give the main result of this section, which isa sample complexity bound for the Attribute EfficientSVR (AESVR) algorithm, given in Algorithm 4.
Theorem 4.4.
Assume the distribution D is such that k x k and | y | B with probability . Then, forany w ⋆ ∈ R d with k w ⋆ k B , we have E [ L D ( ¯ w )] L D ( w ⋆ ) + ε where ¯ w is the output of AESVR (with η properly tuned) on a training set of size m = O (cid:18) dk (cid:19) · exp (cid:18) O (cid:18) log Bε (cid:19)(cid:19) . (8) The algorithm queries at most k + 6 attributes of eachinstance in expectation.Proof. First, note that for the approximate loss func-tions ℓ εt we have ∇ ℓ εt ( w t ) = (cid:2) ρ ′ ( w ⊤ t x ′ t − y + t ) + ρ ′ ( w ⊤ t x ′ t − y − t ) (cid:3) · x t . Hence, Lemma 4.1 and Claim 4.3 above imply that ˜ g t in Algorithm 4 is an unbiased estimator of ∇ ℓ εt ( w t ).Furthermore, since k x ′ t k ε and | y ± t | Bε ,according to the same lemma we have E t [ ˜ φ t ] =exp( O (log Bε )). Repeating the proof of Lemma 3.3, inear Regression with Limited Observation we then have E t [ k ˜ g t k ] = E t [ ˜ φ t ] · E t [ k ˜ x t k ] = exp( O (log Bε )) · dk . Replacing G in the proof of theorem 3.1 with theabove bound, we get for the output of Algorithm 4, E [ L D ( ¯ w )] L D ( w ⋆ ) + exp( O (log Bε )) r dkm , which imply that for obtaining an ε -accurate regressor¯ w of the approximate problem, it is enough to take m as given in (8). However, claim 4.2 now gives that ¯ w itself is an 2 ε -accurate regressor of the original prob-lem, and the proof is complete.
5. Experiments
In this section we give experimental evidence that sup-port our theoretical bounds, and demonstrate the su-perior performance of our algorithms compared to thestate of the art. Naturally, we chose to compare ourAERR and AELR algorithms with the AER algo-rithm of (Cesa-Bianchi et al., 2010). We note thatAER is in fact a hybrid algorithm that combines 1-norm and 2-norm regularizations, thus we use it forbenchmarking in both the Ridge and Lasso settings.We essentially repeated the experiments of(Cesa-Bianchi et al., 2010) and used the popularMNIST digit recognition dataset (LeCun et al.,1998). Each instance in this dataset is a 28 ×
28 imageof a handwritten digit 0 −
9. We focused on the “3vs. 5” task, on a subset of the dataset that consists ofthe “3” digits (labeled −
1) and the “5” digits (labeled+1). We applied the regression algorithms to thistask by regressing to the labels.In all our experiments, we randomly split the data totraining and test sets, and used 10-fold cross-validationfor tuning the parameters of each algorithm. Then,we ran each algorithm on increasingly longer prefixesof the dataset and tracked the obtained squared-erroron the test set. For faithfully comparing partial- andfull-information algorithms, we also recorded the total number of attributes used by each algorithm.In our first experiment, we executed AELR, AER and(offline) Lasso on the “3 vs. 5” task. We allowedboth AELR and AER to use only k = 4 pixels ofeach training image, while giving Lasso unrestrictedaccess to the entire set of attributes (total of 784) ofeach instance. The results, averaged over 10 runs on The AESVR algorithm is presented mainly for the-oretical considerations, and was not implemented in theexperiments. 1 2 3 4 · . . . . . T e s t s q u a r e d e rr o r AELRAEROffline
Figure 1.
Test squared error of Lasso algorithms with k =4, over increasing prefixes of the “3 vs. 5” dataset. random train/test splits, are presented in Figure 1.Note that the x -axis represents the cumulative num-ber of attributes used for training. The graph ends atroughly 48500 attributes, which is the total numberof attributes allowed for the partial-information algo-rithms. Lasso, however, completes this budget afterseeing merely 62 examples.As we see from the results, AELR keeps its test er-ror significantly lower than that of AER along the en-tire execution, almost bridging the gap with the full-information Lasso. Note that the latter has the clearadvantage of being an offline algorithm, while bothAELR and AER are online in nature. Indeed, whenwe compared AELR with an online Lasso solver, ouralgorithm obtained test error almost 10 times better.In the second experiment, we evaluated AERR, AERand Ridge regression on the same task, but now allow-ing the partial-information algorithms to use as muchas k = 56 pixels (which amounts to 2 rows) of eachinstance. The results of this experiment are given inFigure 2. We see that even if we allow the algorithmsto view a considerable number of attributes, the gapbetween AERR and AER is large.
6. Conclusions and Open Questions
We have considered the fundamental problem of statis-tical regression analysis, and in particular Lasso andRidge regression, in a setting where the observationupon each training instance is limited to a few at-tributes, and gave algorithms that improve over thestate of the art by a leading order term with respectto the sample complexity. This resolves an open ques-tion of (Cesa-Bianchi et al., 2010). The algorithms areefficient, and give a clear experimental advantage in inear Regression with Limited Observation · . . . . . . T e s t s q u a r e d e rr o r AERRAEROffline
Figure 2.
Test squared error of Ridge algorithms with k =56, over increasing prefixes of the “3 vs. 5” dataset. previously-considered benchmarks.For the challenging case of regression with general con-vex loss functions, we describe exponential improve-ment in sample complexity, which apply in particularto support-vector regression.It is interesting to resolve the sample complexity gapof ε which still remains for Lasso regression, and toimprove upon the pseudo-polynomial factor in ε forsupport-vector regression. In addition, establishinganalogous bounds for our algorithms that hold withhigh probability (other than in expectation) appearsto be non-trivial, and is left for future work.Another possible direction for future research isadapting our results to the setting of learningwith (randomly) missing data, that was recentlyinvestigated—see e.g. (Rostamizadeh et al., 2011;Loh & Wainwright, 2011). The sample complexitybounds our algorithms obtain in this setting areslightly worse than those presented in the current pa-per, and it is interesting to check if one can do better. Acknowledgments
We thank Shai Shalev-Shwartz for several useful dis-cussions, and the anonymous referees for their detailedcomments.
References
Ben-David, S. and Dichterman, E. Learning with re-stricted focus of attention.
Journal of Computer andSystem Sciences , 56(3):277–298, 1998.Cesa-Bianchi, N., Shalev-Shwartz, S., and Shamir, O.Efficient learning with partially observed attributes. In
Proceedings of the 27th international conferenceon Machine learning , 2010.Cesa-Bianchi, N., Shalev-Shwartz, S., and Shamir, O.Online learning of noisy data.
IEEE Transactions onInformation Theory , 57(12):7907 –7931, dec. 2011.ISSN 0018-9448. doi: 10.1109/TIT.2011.2164053.Clarkson, K.L., Hazan, E., and Woodruff, D.P. Sub-linear optimization for machine learning. In , pp. 449–457. IEEE, 2010.Haussler, D. Decision theoretic generalizations of thePAC model for neural net and other learning appli-cations.
Information and computation , 100(1):78–150, 1992.Hazan, E. and Koren, T. Optimal algorithms forridge and lasso regression with partially observedattributes.
Arxiv preprint arXiv:1108.4559 , 2011.Hazan, E., Agarwal, A., and Kale, S. Logarithmic re-gret algorithms for online convex optimization.
Ma-chine Learning , 69(2):169–192, 2007.Kakade, S.M., Sridharan, K., and Tewari, A. Onthe complexity of linear prediction: Risk bounds,margin bounds, and regularization. In
Advances inNeural Information Processing Systems , volume 22,2008.Kivinen, J. and Warmuth, M.K. Exponentiated gra-dient versus gradient descent for linear predictors.
Information and Computation , 132(1):1–63, 1997.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.Gradient-based learning applied to document recog-nition.
Proceedings of the IEEE , 86(11):2278–2324,1998.Loh, P.L. and Wainwright, M.J. High-dimensionalregression with noisy and missing data: Provableguarantees with non-convexity. In
Advances in Neu-ral Information Processing Systems , 2011.Rostamizadeh, A., Agarwal, A., and Bartlett, P.Learning with missing features. In
The 27th Confer-ence on Uncertainty in Artificial Intelligence , 2011.Vapnik, V.N.
The nature of statistical learning theory .Springer-Verlag, 1995.Zinkevich, M. Online convex programming and gen-eralized infinitesimal gradient ascent. In
Proceed-ings of the 20th international conference on Machinelearning , volume 20, pp. 928–936. ACM, 2003. inear Regression with Limited Observation
A. Proofs
A.1. Proof of Lemma 3.3
It is straightforward to verify that E t [˜ x t,r ] = x t for all r , thus also E t [˜ x t ] = x t . In addition, we have E t [ ˜ φ t ] = d X j =1 w t [ j ] k w t k · k w t k x t [ j ] w t [ j ] − y t = w ⊤ t x t − y t . Hence, and since ˜ x t and ˜ φ t are independent, we obtainthat E t [˜ g t ] equals ( w ⊤ t x t − y t ) · x t , which is exactly thegradient g t = ∇ ℓ t ( w t ).Let us turn to bound E t [ k ˜ g t k ]. It is easy to verify that E t [ k ˜ x t,r k ] = d k x t k for all r , and since E t [˜ x t,r ] = x t and the ˜ x t,r ’s are independent, we have E t [ k ˜ x t k ] = 1 k X r E t [ k ˜ x t,r k ] + 1 k X r = s k x t k = 2 d + k − k k x t k . (9)This gives the bound E t [ k ˜ x t k ] d/k . On the otherhand, recalling | y t | B and using the inequality ( a − b ) a + b ), we obtain E t [ ˜ φ t ] k w t k k x t k + y t ) B . (10)Finally, from (9), (10) and via independence we have E t [ k ˜ g t k ] B d/k , and the lemma follows. A.2. Proof of Lemma 3.5
The lemma is a direct consequence of the followingsecond-order bound for the MW algorithm, whichis essentially a simplified version of Lemma II.3 of(Clarkson et al., 2010).
Lemma A.1.
Let η > , and let c , . . . , c T be anarbitrary sequence of vectors in R n with c t [ i ] > − /η for all t and i . Define a sequence z , . . . , z T by letting z ← n and for t > , z t +1 [ i ] ← z t [ i ] · exp( − η c t [ i ]) , i = 1 , . . . , n. Then, for the vectors p t := z t / k z t k we have T X t =1 p ⊤ t c t min i ∈ [ d ] T X t =1 c t [ i ] + log nη + η T X t =1 p ⊤ t c t . To see how the lemma follows from the above bound,note that we can write the update rule of Algorithm 2,in terms of the augmented vectors z t and ¯ g ′ t , as follows: z t +1 [ i ] = z t [ i ] · exp( − η ¯ g ′ t [ i ]) , i = 1 , , . . . , d. That is, z t +1 is obtained from z t by a multiplicativeupdate based on the vector ¯ g ′ t . Noticing that k ¯ g ′ t k ∞ = k ¯ g t k ∞ /η, we see from Lemma A.1 that for any i ⋆ , m X t =1 p ⊤ t ¯ g ′ t m X t =1 ¯ g ′ t [ i ⋆ ] + log 2 dη + η m X t =1 p ⊤ t (¯ g ′ t ) where p t := z t / k z t k , which gives the Lemma.For completeness, we provide a proof of Lemma A.1. Proof (of Lemma A.1).
Using the fact that e x x + x for x
1, we have k z t +1 k = n X i =1 z t [ i ] · e − η c t [ i ] n X i =1 z t [ i ] · (1 − η c t [ i ] + η c t [ i ] )= k z t k · (1 − η p ⊤ t c t + η p ⊤ t c t )and since e z > z for z ∈ R , this implies by inductionthatlog k z T +1 k = log n + T X t =1 log(1 − η p ⊤ t c t + η p ⊤ t c t ) log n − η T X t =1 p ⊤ t c t + η T X t =1 p ⊤ t c t . (11)On the other hand, we havelog k z T +1 k = log n X i =1 T Y t =1 e − η c t [ i ] > log T Y t =1 e − η c t [ i ⋆ ] = − η T X t =1 c t [ i ⋆ ] . (12)Combining (11) and (12) and rearranging, we obtain T X t =1 p ⊤ t c t T X t =1 c t [ i ⋆ ] + log nη + η T X t =1 p ⊤ t c t for any i ⋆ , which completes the proof. A.3. Proof of Lemma 3.6
Notice that by our notations, m X t =1 p ⊤ t ¯ g ′ t = m X t =1 ( z + t , z − t ) ⊤ (¯ g t , − ¯ g t ) k z + t k + k z − t k = 1 B m X t =1 w ⊤ t ¯ g t , inear Regression with Limited Observation andmin i m X t =1 ¯ g ′ t [ i ] = min k w k B B m X t =1 w ⊤ ¯ g t B m X t =1 w ⊤ ⋆ ¯ g t for any w ⋆ with k w ⋆ k B . Plugging into the boundof Lemma 3.5, we get m X t =1 ¯ g ⊤ t ( w t − w ⋆ ) B log 2 dη + η m X t =1 p ⊤ t (¯ g ′ t ) ! . Finally, taking the expectation with respect to therandomization of the algorithm, and noticing that k E t [(¯ g ′ t ) ] k ∞ k E t [˜ g t ] k ∞ G , the proof is com-plete. A.4. Proof of lemma 3.7
For the proof we need a simple lemma, that allowsus to bound the deviation of the expected value ofa clipped random variable from that of the originalvariable, in terms of its variance.
Lemma A.2.
Let X be a random variable with | E [ X ] | C/ for some C > . Then for the clippedvariable ¯ X := clip( X, C ) = max { min { X, C } , − C } wehave | E [ ¯ X ] − E [ X ] | var [ X ] C .
Now, notice that k E t [˜ g t ] k ∞ G implies k E t [˜ g t ] k ∞ G , as k E t [˜ g t ] k ∞ = k E t [˜ g t ] k ∞ k E t [˜ g t ] k ∞ . Since ¯ g [ i ] = clip(˜ g [ i ] , /η ) and | E t [˜ g t [ i ]] | G / η ,the above lemma implies that | E t [¯ g t [ i ]] − E t [˜ g t [ i ]] | η E t (cid:2) ˜ g t [ i ] (cid:3) ηG for all i , which means that k E t [¯ g t − ˜ g t ] k ∞ ηG .Together with k w t − w ⋆ k B , this yields E t [(˜ g t − ¯ g t ) ⊤ ( w t − w ⋆ )] ηBG . Summing over t = 1 , . . . , m and taking the expecta-tion, we obtain the lemma.Finally, we prove the simple lemma. Proof (of lemma A.2).
As a first step, note that for x > C we have x − E [ X ] > C/
2, so that C ( x − C ) x − E [ X ])( x − C ) x − E [ X ]) . Hence, denoting by µ the probability measure of X ,we obtain E [ X ] − E [ ¯ X ] = Z x< − C ( x + C ) dµ + Z x>C ( x − C ) dµ Z x>C ( x − C ) dµ C Z x>C ( x − E [ X ]) dµ C var [ X ] . Similarly one can prove that E [ X ] − E [ ¯ X ] > − var [ X ] /C , and the result follows. A.5. Proof of Lemma 3.8
Since E t [˜ x t,r ] = x t and E t [˜ x t,r ] = d x t for all r , wehave (by independence) that E t [˜ x t ] = 1 k k X r =1 E t [˜ x t,r ] + 1 k X r = s x t = d + k − k x t so evidently k E t [˜ x t ] k ∞ d/k . In addition, E t [˜ y t ] = k w t k d X j =1 | w t [ j ] | x t [ j ] k w t k k w t k ∞ B . From here we proceed exactly as in Lemma 3.3.
A.6. Proof of Lemma 4.1
First, denote θ := w ⊤ x − y and notice that, in anycase, we have E [˜ θ r ] = θ (see the proof of Lemma 3.3).Hence, and since ˜ θ , . . . , ˜ θ n are independent, E [ ˆ φ ] = ∞ X n =0 n +1 E [2 n +1 a n · ˜ θ ˜ θ · · · ˜ θ n ] = ∞ X n =0 a n θ n = f ( θ ) , thus ˆ φ is an unbiased estimator of f ′ ( θ ).For bounding the second moment, let ν := 2 log N and note that if n ν then for all r , E [˜ θ r ] B ,and otherwise E [˜ θ r ] B /N
1. Also, denot-ing f ′ + ( x ) = P ∞ n =0 | a n | x n which exists for | x | P ∞ n =0 a n ( P ∞ n =0 | a n | ) = inear Regression with Limited Observation ( f ′ + (1)) . This yields E [ φ ] = 2 ∞ X n =0 a n E [2 n · ˜ θ ˜ θ · · · ˜ θ n ] X n ν a n (8 B ) n + 2 X n>ν a n B ) ν ∞ X n =0 a n + 2 ∞ X n =0 a n (4 B ) ν · f ′ + (1)) = exp( O (log B ))Finally, the expected number of attributes used by theestimator is bounded as follows: E [ n ] = X n ν n n +1 + X n>ν nN n +1 ∞ X n =0 n n +1 + N X n > ν n n +1 N N + 1 N P ∞ n = r n/ n +1 =2 − r ( r + 1). B. Lower bounds
In this section we prove:
Theorem B.1.
Let ε = Ω(1 / √ d ) . Any algorithm forLAO Ridge regression requires to observe at least Ω( dε ) coordinates in order to obtain an O ( ε ) -approximate so-lution. B.1. Information theoretic lower bounds
Our lower bound is based on the following folklore fact:
Fact B.2.
Consider the following random process.Initialize a length- d array A to an array of d zeros.Choose a random position i ∈ [ r ] and set it to be A [ i ] = 1 with probability . With the remaining prob-ability , set it to be − . Then any algorithm whichdetermines the value of A [ i ] with probability > mustread Ω( d ) entries of A . A corollary of this fact is the following more generaltheorem. Consider a k × d matrix A and the followingrandom process. Pick a subset of coordinates T ⊆ [ d ]of size | T | = k . For each index j ∈ [ k ], set the j ’th rowof A to be r i e i , where i = T [ j ] is the j ’th element of T , and r i is a Rademacher random variable. Corollary B.3.
Any algorithm that correctly deter-mines the value of Ω( k ) non-zero entries of A withprobability > must read Ω( dk ) entries of A . B.2. Proof of Theorem B.1
Consider the following Ridge regression setting: Thematrix A is created as in the previous subsection with k = ε . The labels are always one, and the examplevectors are chosen uniformly at random as the rows ofthe matrix A .The proof will follow from the following two lemmas: Lemma B.4.
There exists a vector w ⋆ for which L D ( w ⋆ ) (1 − ε ) . Proof.
Consider the vector w ⋆ = X i ∈ T √ k r i e i = X i ∈ T εr i e i . Its expected loss is clearly L D ( w ∗ ) = (1 − ε ) . Lemma B.5.
Let δ = . Any vector w for which L D ( w ) − − δ ) ε has (1 − δ ) of it’s Euclideanweight on the coordinates of T , and at least k coor-dinates in T have weight of at least ε .Proof. Let w be such vector. We have1 − − δ ) ε > L D ( w )= 1 k X i ∈ [ T ] (1 − w ⊤ x i ) = 1 k X i ∈ [ T ] (1 − w i r i + w i ) > − k X i ∈ T w i r i . Hence X i ∈ T | w i | > X i ∈ T w i r i > (1 − δ ) kε = 1 − δε . This implies that1 k X i ∈ T w i > k X i ∈ T | w i | ! > (1 − δ ) ε , hence X i ∈ T w i > (1 − δ ) > − δ. Next, we claim that at least k coordinates inside T have weight of at least ε . If that is not the case,then the ℓ norm of w supported by T would be upperbounded by two terms:1. The ℓ -norm of the coordinates that are largerthan ε . There are few of these, and their ℓ norm is at most one, hence their ℓ contributionis at most q k = √ ε . inear Regression with Limited Observation
2. The ℓ -norm of the small coordinates. These arejust small, so their total ℓ contribution is at most k · ε = ε .Summing both of these up, we arrive at less than − δε ,which is a lower bound for ℓ -norm of the T coordi-nates.Next, consider any algorithm that attains δε = O ( ε )-approximation for our Ridge regression instance. Itfinds a vector w for which L D ( w ) L D ( w ⋆ ) + δε − − δ ) ε Where in the first inequality we used Lemma B.4.Thus, by Lemma B.5, the vector w returned by this al-gorithm has weight of at least 1 − δ on the coordinatesof T .Pick all coordinates of w with weight larger than ε .According to Lemma B.5 there are at least k coor-dinates inside of T that are this large. In the rest ofthe coordinates, there are very few coordinates of thismagnitude, since the remaining ℓ weight is δ . Hence,there are at most another δ · ε = 100 δk such coordi-nates elsewhere.Overall, we returned a set of coordinates of size ( +100 δ ) k , from which the vast majority of coordinatesare inside T . By Corollary B.3, this requires our algo-rithm to read Ω( kd ) = Ω( dε2