Sparse Estimation with Strongly Correlated Variables using Ordered Weighted L1 Regularization
aa r X i v : . [ s t a t . M L ] S e p Sparse Estimation with Strongly Correlated Variablesusing Ordered Weighted ℓ Regularization
M´ario A. T.FigueiredoInstituto de Telecomunicac¸ ˜oe and Instituto Superior T´ecnico, Universidade de Lisboa, PortugalRobert D. NowakDepartment of Electrical and Computer Engineering, University of Wisconsin-Madison, USASeptember 16, 2014
Abstract
This paper studies ordered weighted ℓ (OWL) norm regularization for sparse estimation problems withstrongly correlated variables. We prove sufficient conditions for clustering based on the correlation/colinearityof variables using the OWL norm, of which the so-called OSCAR [4] is a particular case. Our results extendprevious ones for OSCAR in several ways: for the squared error loss, our conditions hold for the more generalOWL norm and under weaker assumptions; we also establish clustering conditions for the absolute error loss,which is, as far as we know, a novel result. Furthermore, we characterize the statistical performance of OWLnorm regularization for generative models in which certain clusters of regression variables are strongly (evenperfectly) correlated, but variables in different clusters are uncorrelated. We show that if the true p -dimensionalsignal generating the data involves only s of the clusters, then O ( s log p ) samples suffice to accurately estimatethe signal, regardless of the number of coefficients within the clusters. The estimation of s -sparse signals withcompletely independent variables requires just as many measurements. In other words, using the OWL wepay no price (in terms of the number of measurements) for the presence of strongly correlated variables. The OWL ( ordered weighted ℓ ) regularizer is defined as Ω w ( x ) = p X i =1 w i | x | [ i ] , (1)where | x | [ i ] is the i -th largest component in magnitude of x ∈ R p , and w ∈ R p + is a vector of non-negativeweights. If w ≥ w ≥ · · · ≥ w p and w > (which we will assume to be always true), then Ω w is anorm (as shown in [3], [22]), which satisfies w k x k ∞ ≤ Ω w ( x ) ≤ w k x k . The OWL regularizer generalizesthe OSCAR ( octagonal shrinkage and clustering algorithm for regression ) [4], which is obtained by setting w i = λ + λ ( p − i ) , where λ , λ ≥ . Notice also that if w > , and w = · · · = w p = 0 , the OWL is simply( w times) the ℓ ∞ norm, whereas for w = w = · · · = w p , the OWL becomes ( w times) the ℓ norm.In this paper, we will study the use of the OWL norm as a regularizer in linear regression with strongly corre-lated variables, both under the squared error loss and the absolute error loss, i.e. , the two following optimizationproblems: min x ∈ R p k A x − y k + Ω w ( x ) , (2)1here A ∈ R n × p is the design matrix, and min x ∈ R p k A x − y k + Ω w ( x ) . (3)We also consider constrained versions of these problems; see (7) and (8) below.The first of our two main results gives sufficient conditions for OWL norm regularization (with either thesquared or the absolute error loss) to automatically cluster strongly correlated variables, in the sense that thecoefficients associated with such variables have equal estimated values (in magnitude). The result for the squarederror loss extends the main theorem about OSCAR in [4], since not only it applies to the more general case ofOWL (of which OSCAR is a particular case), but it also holds under weaker conditions. Furthermore, the resultfor the absolute error loss is, as far as we know, novel.Our second main result is a finite sample bound for the OWL regularization procedure, which includes thestandard LASSO and OSCAR as special cases. To the best of our knowledge, these are the first finite sampleerror bounds for sparse regression with strongly correlated columns. To preview this result, consider the followingspecial case (which we generalize further in the paper): assume that we observe y = Ax ⋆ + ν . (4)where ν ∈ R n is the measurement error satisfying n k ν k ≤ ε , (5)and about which we make no other assumptions. The measurement/design matrix A is Gaussian distributed. Forthe purposes of this introduction, assume that each column of A has i.i.d. N (0 , entries, but that the columnsmay be correlated. Specifically, assume that the columns can be grouped so that columns within each group areidentical (apart from a possible sign flip) and columns in different groups are uncorrelated. This models casesin which certain variables are perfectly correlated with each other, but uncorrelated with all others. The vector x ⋆ ∈ R p is assumed to satisfy k x ⋆ k ≤ √ s . Note, for example, that this condition is met if k x ⋆ k ≤ and x ⋆ has at most s non-zero components (is s -sparse), which we assume to be true. Note that since certain columns of A may be identical, in general there may be many sparse vectors x such that Ax = Ax ⋆ . Thus, for now, assumethat if two columns of A are identical (up to a sign flip), then so are (in magnitude) the corresponding coefficientsin x ⋆ . The following theorem essentially shows that the number of measurements sufficient to estimate an s -sparse signal (i.e., a signal with s nonzero groups of identical coefficients corresponding to identical columns in A ), with a given precision, grows like n ∼ s log p . (6)This agrees with well-known sample complexity bounds for sparse recovery under stronger assumptions such asthe restricted isometry property or i.i.d. measurements [6, 7, 10, 12, 21]. Moreover, this shows that by using OWLwe pay no price (in terms of the sufficient number of measurements) for colinearity of some columns of A . Theorem 1.1.
Let y , A , x ⋆ , and ε be as defined above. Let ∆ := min { w l − w l +1 , l = 1 , ..., p − } be theminimum gap between two consecutive components of vector w , and assume ∆ > . Let b x be a solution to eitherof the two following optimization problems: min x ∈ R p Ω w ( x ) subject to n k Ax − y k ≤ ε , (7) or min x ∈ R p Ω w ( x ) subject to n k Ax − y k ≤ ε. (8) Then, i) for every pair of columns ( i, j ) for which a i = a j , we have b x i = b x j ; (ii) the solution b x satisfies E k b x − x ⋆ k ≤ √ π √ w ¯ w r s log pn + ε ! , (9) where ¯ w = p − P pi =1 w i . The expectation above (and elsewhere in the paper) is with respect to the Gaussian distribution of A . Part(i) of this theorem is proved in Subsection 2.5, and part (ii) in Subsection 3.2. As mentioned above, in generalthere may be many sparse x that yield the same value of Ax . This is where the OWL norm becomes especiallyimportant. If the columns are colinear, then the OWL solution will select a representation including all thecolumns associated with the true model, rather than an arbitrary subset of them. We generalize Theorem 1.1 inthe paper, and show the OWL norm yields similar clustering and recovery conditions for problems with stronglycorrelated, but not necessarily colinear, variables. Notice that the constant factor in the bound is typically a smallconstant; for example, in the OSCAR case we have w i = λ + λ ( p − i ) , thus ¯ w = λ + λ ( p − / andtherefore w / ¯ w ≤ . Estimates obtained with the LASSO ( i.e. , ℓ ) regularizer can be difficult to interpret when columns of the mea-surement matrix A are strongly correlated, because it may select only one of a group of highly correlatedcolumns. For scientific and engineering purposes, one is often interested in identifying all of the columns that areimportant for modeling the data, rather than just a subset of them. Many researchers have proposed alternativesto the LASSO that aim at dealing with this problem. For example, Jia and Yu [14] study the elastic net regular-izer (a combination of the ℓ and the squared ℓ norms), showing that it can consistently select the true modelfor certain correlated design matrices A , when LASSO cannot. Marginal regression methods have also beenshown by Genovese et al to perform better than the LASSO, in the presence of strongly correlated columns [11].Stability selection procedures, can also aid in the selection of correlated columns, as shown by Meinshausen andB ¨ulhmann [15], and Shah and Samworth [19]. Recently, B ¨uhlmann et al [5] proposed and analyzed a two-stageapproach called cluster-LASSO , which first identifies clusters of correlated columns, then groups them, and fi-nally applies LASSO or group-LASSO to the groups; the cluster-LASSO is shown to be statistically consistent incertain cases. Adaptive grouping methods based on nonconvex optimizations have also been proposed by Shenand Huang [20], and shown to be asymptotically consistent under certain conditions.Most closely related to this paper, is the so-called OSCAR ( octagonal shrinkage and clustering algorithmfor regression ), proposed and analyzed by Bondell and Reich [4]. As mentioned above, OSCAR is a specialcase of the OWL regularizer, obtained with w i = λ + λ ( p − i ) . The OSCAR method has been shown toperform well in practice, but prior work has not addressed its statistical consistency or convergence properties.Motivated by this formulation of OSCAR, the OWL regularizer was recently proposed by Zeng et al [22], asa generalization thereof. The OWL norm was also independently proposed by Bodgan et al [2, 3], who set theweights to w i = F − (1 − iq/ (2 p )) , where F is the cumulative distribution function of the error variables, and < q < is a parameter. Those authors showed that if A is orthogonal, the solution to (2) with these weightshas a false discovery rate for variable selection bounded by q ( p − k ) /p , where k is the number of non-zerocoefficients in the true x that generates y .On the computational side, a key tool for solving problems of the form (2), (3), (7), or (8), is the Moreauproximity operator of Ω w [1], defined asprox Ω w ( u ) = arg min x k x − u k + Ω w ( x ) . It is trivial to extend the proof of this result to show that if a i = − a j , then b x i = − b x j O ( p log p ) algorithms to compute prox Ω w have been recently proposed by Bodgan et al [2, 3], and byZeng et al [22], who generalize to the OWL case the algorithm proposed by Zhong and Kwok [24]. Even morerecently, Zeng et al [23] have show how Ω w can be written explicitly as an atomic norm (see [8], for definitions),opening the door to the efficient use of the conditional gradient (also known as Frank-Wolfe) algorithm [13]. Notation
We denote (column) vectors by lower-case bold letters, e.g. , x , y , their transposes by x T , y T , the corresponding i -th and j -th components as x i and y j , and matrices by upper case bold letters, e.g. , A , B . A vector with allelements equal to 1 is denoted as and | x | denotes the vector with the absolute values of the components of x .Given some vector x , x [ i ] is its i -th largest component ( i.e. , for x ∈ R p , x [1] ≥ x [2] ≥ · · · ≥ x [ p ] , with ties brokenby some arbitrary rule); consequently, | x | [ i ] is the i -th largest component of x in magnitude. The vector obtainedby sorting (in non-increasing order) the components of x is denoted as x ↓ , thus | x | ↓ denotes the vector obtainedby sorting the components of x in non-increasing order of magnitude (allowing to write Ω w ( x ) = w T | x | ↓ ). In this section, we study the solutions of (2) and (3) in the case where the design matrix A has strongly correlatedcolumns, and give corollaries for the particular case of OSCAR. The results presented below extend the maintheorem of [4] in several ways: for the squared error loss, our result applies to the more general case of the OWL(of which OSCAR is a particular case) and it holds under weaker conditions; the result for the absolute errorregression case is, as far as we know, novel. Consider the regression problem (2), and let a i ∈ R n denote the i -th column (for i = 1 , ..., p ) of matrix A . Thefollowing theorem shows that (2) clusters (in the sense that the corresponding components of the solution areexactly equal in magnitude) the columns that are correlated enough. Theorem 2.1.
Consider the objective function in (2) and assume, as is common practice in linear regression, thatthe columns of the matrix are normalized to a common norm, that is, k a k k = c , for k = 1 , ..., p . Let b x be anyminimizer of the objective function in (2) . Then, for every pair of columns ( i, j ) for which k y k k sign ( b x i ) a i − sign ( b x j ) a j k < ∆ (where ∆ := min { w l − w l +1 , l = 1 , ..., p − } is the minimum gap between two consecutivecomponents of vector w ), we have | b x i | = | b x j | . Notice that if two columns (affected by the signs of the corresponding regression coefficients) are identical, i.e. , if k sign ( b x i ) a i − sign ( b x j ) a j k = 0 , any strictly positive value of ∆ is sufficient to guarantee that these twocolumns will be clustered, that is, that the corresponding coefficients will be equal in magnitude.The following corollary addresses the case where the columns of A have zero mean and unit norm. Corollary 2.1.
Let the columns of A be normalized to zero sample mean and unit norm: T a k = 0 and k a k k = 1 , for k = 1 , ..., p . Denote their inner products (i.e., the sample correlation of the correspondingexplanatory variables) as ρ ij = a Ti a j / ( k a i k k a j k ) = a Ti a j . Then, the condition in Theorem 2.1 becomes k y k p − ρ ij sign ( b x i b x j ) < ∆ .Proof. The corollary results trivially from inserting the normalization assumption and the definition of ρ ij intothe equality k sign ( b x i ) a i − sign ( b x j ) a j k = p ( sign ( b x i ) a i − sign ( b x j ) a j ) T ( sign ( b x i ) a i − sign ( b x j ) a j ) .The following corollary results from observing that, in the OSCAR case, w i = λ + λ ( p − i ) , thus ∆ = λ .4 orollary 2.2. In the particular case of OSCAR, and for the case of normalized columns (as in Corollary 2.1),the condition is k y k p − ρ ij sign ( b x i b x j ) < λ . Corollary 2.2 is closely related to Theorem 2.1 of [4], but has weaker conditions: unlike in [4], our resultdoes not require that both x i and x j are different from zero and from all other x k , for k = i, j . Furthermore,Theorem 2.1 applies to the more general class of OWL norms, not just to OSCAR. Note also that the results in [4]assume that columns are signed so that b x i ≥ for all i . Our result could also be stated with this assumption, inwhich case k sign ( b x i ) a i − sign ( b x j ) a j k simplifies to k a i − a j k . Finally, observe that, in the extreme case ofperfectly correlated columns ( ρ ij sign ( b x i b x j ) = 1 ), the condition for OSCAR simplifies to λ > . Consider the regression problem under absolute error loss in (3). The following theorem shows that, also in thiscase, the OWL regularizer clusters (in the sense that the corresponding components of the solution are exactlyequal in magnitude) the columns that are similar enough.
Theorem 2.2.
Let b x be any minimizer of the objective function in (3) . Then, for every pair of columns ( i, j ) forwhich k sign ( b x i ) a i − sign ( b x j ) a j k < ∆ (where ∆ is as defined in Theorem 2.1), we have | b x i | = | b x j | . Under the normalization assumptions on matrix A that were used in Corollary 2.1, another (weaker) sufficientcondition can be obtained which depends on the sample correlations, as stated in the following corollary. Corollary 2.3.
Let b x be any minimizer of the objective function in (3) and assume that the columns of A arenormalized, that is, T a k = 0 and k a k k = 1 , for i = k, ..., p . Denote their inner products (i.e., the samplecorrelation of the corresponding explanatory variables) as ρ ij = a Ti a j / ( k a i k k a j k ) = a Ti a j . Then, for everypair of columns ( i, j ) for which p n (2 − ρ ij sign ( b x i b x j )) < ∆ , we have | b x i | = | b x j | .Proof. The corollary results simply from noticing that, under the assumed normalization of the columns of A , k sign ( b x i ) a i − sign ( b x j ) a j k ≤ √ n k sign ( b x i ) a i − sign ( b x j ) a j k = p n (2 − ρ ij sign ( b x i b x j )) . Finally, a simple corollary results from the fact that, for OSCAR, ∆ = λ . Corollary 2.4.
In the particular case of OSCAR, the condition is k sign ( b x i ) a i − sign ( b x j ) a j k < λ , in general,and p n (2 − ρ ij sign ( b x i b x j )) < λ , in the case of normalized columns. Finally, as above, in the extreme case of perfectly correlated columns ( ρ ij sign ( b x i b x j ) = 1 ), the condition forOSCAR simplifies to λ > . The proofs of both Theorems 2.1 and 2.2 are based on a useful lemma about the OWL norm, which we state andprove before proceeding to the proofs of the theorems.
Lemma 2.1.
Consider a vector x ∈ R p + and any two of its components x i and x j , such that x i > x j . Let z ∈ R p + be obtained by applying a so-called Pigou-Dalton transfer of size ε ∈ (cid:0) , ( x i − x j ) / (cid:1) to x , that is: z i = x i − ε , z j = x j + ε , and z k = x k , for k = i, j . Let w be a vector of non-increasing non-negative realvalues, w ≥ w ≥ · · · ≥ w p ≥ , and ∆ be the minimum gap between two consecutive components of vector w , that is, ∆ = min { w l − w l +1 , l = 1 , ..., p − } . Then, Ω w ( x ) − Ω w ( z ) ≥ ∆ ε. (10) The Pigou-Dalton transfer, also known as a Robin Hood transfer, is used in the study of measures of economic inequality [9], [16]. roof. Let l and m be the rank orders of x i and x j , respectively, i.e. , x i = x [ l ] and x j = x [ m ] ; of course, m > l ,because x i > x j . Now let l + a and m − b be the rank orders of z i and z j , respectively, i.e. , x i − ε = z i = z [ l + a ] and x j + ε = z j = z [ m − b ] . Of course, it may happen that a or b (or both) are zero, if ε is small enough not to changethe rank orders of one (or both) of the affected components of x . Furthermore, the condition ε < ( x i − x j ) / implies that x i − ε > x j + ε , thus l + a < m − b . A key observation is that x ↓ and z ↓ only differ in positions l to l + a and m − b to m , thus we can thus write Ω w ( x ) − Ω w ( z ) = l + a X k = l w k (cid:0) x [ k ] − z [ k ] (cid:1) + m X k = m − b w k (cid:0) x [ k ] − z [ k ] (cid:1) . (11)In the range from l to l + a , the relationship between z ↓ and x ↓ is z [ l ] = x [ l +1] , z [ l +1] = x [ l +2] , . . . , z [ l + a − = x [ l + a ] , z [ l + a ] = x [ l ] − ε, (12)whereas in the range from m − b to m , we have z [ m − b ] = x [ m ] + ε, z [ m − b +1] = x [ m − b ] , . . . , z [ m ] = x [ m − . (13)Plugging these equalities into (11) yields Ω w ( x ) − Ω w ( z ) = l + a − X k = l w k (cid:0) x [ k ] − x [ k +1] (cid:1) + m X k = m − b +1 w k (cid:0) x [ k ] − x [ k − (cid:1) + w l + a (cid:0) x [ l + a ] − x [ l ] + ε (cid:1) + w m − b (cid:0) x [ m − b ] − x [ m ] − ε (cid:1) ( a ) ≥ w l + a l + a − X k = l (cid:0) x [ k ] − x [ k +1] (cid:1) + w m − b m X k = m − b +1 (cid:0) x [ k ] − x [ k − (cid:1) + w l + a (cid:0) x [ l + a ] − x [ l ] + ε (cid:1) + w m − b (cid:0) x [ m − b ] − x [ m ] − ε (cid:1) = w l + a l + a − X k = l (cid:0) x [ k ] − x [ k +1] (cid:1) + (cid:0) x [ l + a ] − x [ l ] + ε (cid:1)! + w m − b m X k = m − b +1 (cid:0) x [ k ] − x [ k − (cid:1) + (cid:0) x [ m − b ] − x [ m ] − ε (cid:1)! = ε (cid:0) w l + a − w m − b (cid:1) ≥ ε ∆ , where inequality ( a ) results from x [ k ] − x [ k +1] ≥ , x [ k ] − x [ k − ≤ , and the components of w forming anon-increasing sequence.Armed with Lemma 2.1, we now proceed to prove Theorems 2.1 and 2.2. Proof. (Theorem 2.1) Let us denote L ( x ) = k A x − y k and take some pair of columns ( i, j ) . Suppose that b x is a minimizer of the objective function in (2), satisfying the condition of the theorem ( ∆ > k y k k sign ( b x i ) a i − sign ( b x j ) a j k ), but in contradiction to the theorem’s claim, i.e. , for which | b x i | 6 = | b x j | . Without loss of generality,assume that | b x i | > | b x j | , and define the residual vector g = y − p X k =1 , k = i, k = j b x k a k . (14)Now consider a Pigou-Dalton transfer of size ε < min {| b x i | , ( | b x i | − | b x j | ) / } applied to the magnitudes of b x i and b x j , i.e. , take an alternative candidate solution v ∈ R p , such that, v i = sign ( b x i )( | b x i | − ε ) , v j = sign ( b x i )( | b x i | + ε ) ,6nd v k = b x k , for k = i, j . Denoting e a i = sign ( b x i ) a i and e a j = sign ( b x j ) a j , the diference in loss function thatresults from this transfer is L ( v ) − L ( b x ) = 12 (cid:13)(cid:13) g − ( | b x i | − ε ) e a i − ( | b x j | + ε ) e a j (cid:13)(cid:13) − (cid:13)(cid:13) g − | b x i | e a i − | b x j | e a j (cid:13)(cid:13) . (15)Expanding the squared ℓ norms, cancelling out the common k g k term, and using the common norm of thecolumns ( k a k k = c , for k = 1 , ..., p ) leads to L ( v ) − L ( b x ) = 12 ( | b x i | − ε ) c + 12 ( | b x j | + ε ) c − ( | b x i | − ε ) g T e a i − ( | b x j | + ε ) g T e a j +( | b x i | − ε ) ( | b x j | + ε ) e a Ti e a j − | b x i | c − | b x j | c + | b x i | g T e a i + | b x j | g T e a j − | b x i | | b x j | e a Ti e a j . (16)Expanding the terms ( | b x i | − ε ) , ( | b x j | + ε ) , and ( | b x i | − ε ) ( | b x j | + ε ) and making some further cancellationsyields L ( v ) − L ( b x ) = ε g T (cid:0)e a i − e a j (cid:1) + ε ( c − e a Ti e a j ) − ε c (cid:0) | b x i | − | b x j | (cid:1) + ε ( | b x i | − | b x j | ) e a Ti e a j = ε g T (cid:0)e a i − e a j (cid:1) − ε ( c − e a Ti e a j ) (cid:0) | b x i | − | b x j | − ε (cid:1) ( a ) ≤ ε g T ( e a i − e a j ) ( b ) ≤ ε k y k k e a i − e a j k , where inequality ( a ) results from the facts that (by the Cauchy-Schwartz inequality) c ≥ e a Ti e a j and both ε and (cid:0) | b x i | − | b x j | − ε (cid:1) are (by assumption) positive, whereas ( b ) is again Cauchy-Schwartz together with the fact that k g k ≤ k y k . Finally, since | v | ∈ R p + results from the same Pigou-Dalton transfer of size ε applied to | b x | ∈ R p + ,and Ω w only depends on the absolute values of its arguments, we are in condition to invoke Lemma 2.1, whichyields L ( v ) + Ω w ( v ) − ( L ( b x ) + Ω w ( b x )) ≤ ε (cid:0) k y k k e a i − e a j k − ∆ (cid:1) < , (17)contradicting the assumption that b x is a minimizer of L ( x ) + Ω w ( x ) , thus completing the proof. Proof. (Theorem 2.2) Denote G ( x ) = k A x − y k and take some pair of columns ( i, j ) . Assume that b x is a minimizer of the objective function in (3), satisfying the condition of the theorem ( ∆ > k sign ( b x i ) a i − sign ( b x j ) a j k , but in contradiction to the theorem’s claim, i.e. , for which | b x i | 6 = | b x j | . Define the residual vector g as in (14) and consider a Pigou-Dalton transfer of size ε < min {| b x i | , ( | b x i | − | b x j | ) / } applied to the magnitudesof b x i and b x j , i.e. , take an alternative candidate solution v ∈ R p , such that, v i = sign ( b x i )( | b x i | − ε ) , v j = sign ( b x i )( | b x i | + ε ) , and v k = b x k , for k = i, j . Denoting e a i = sign ( b x i ) a i and e a j = sign ( b x j ) a j , the diference inloss function that results from this transfer satisfies G ( v ) − G ( b x ) = (cid:13)(cid:13) g − ( | b x i | − ε ) e a i − ( | b x i | + ε ) e a i (cid:13)(cid:13) − (cid:13)(cid:13) g − | b x i | e a i − | b x j | e a j (cid:13)(cid:13) = (cid:13)(cid:13) g − | b x i | e a i − | b x i | e a i + ε ( e a i − e a j ) (cid:13)(cid:13) − (cid:13)(cid:13) g − | b x i | e a i − | b x j | e a j (cid:13)(cid:13) ≤ ε (cid:13)(cid:13)e a i − e a j (cid:13)(cid:13) , (18)as a direct consequence of the triangle inequality. Finally, since | v | ∈ R p + results from the same Pigou-Daltontransfer of size ε applied to | b x | ∈ R p + , and Ω w only depends on the absolute values of its arguments, we are incondition to invoke Lemma 2.1, thus G ( v ) + Ω w ( v ) − ( G ( b x ) + Ω w ( b x )) ≤ ε (cid:0) k e a i − e a j k − ∆ (cid:1) < , (19)which contradicts the assumption that b x is a minimizer of G ( x ) + Ω w ( x ) , thus completing the proof.7 .5 Proof of Theorem 1.1 (i) The proof of item (i) in Theorem 1.1 follows the same general structure as the proofs of Theorems 2.1 and 2.2.
Proof. (Theorem 1.1 (i)) Let us define the functions L ( x ) = n k A x − y k and G ( x ) = n k A x − y k , and theresidual g as in (14). Since a i = a j , the functions L ( b x ) = 1 n (cid:13)(cid:13) g − ( b x i + b x j ) a i (cid:13)(cid:13) and G ( b x ) = 1 n (cid:13)(cid:13) g − ( b x i + b x j ) a i (cid:13)(cid:13) , (20)are both invariant under a transformation that adds any quantity ε to b x i and subtracts the same quantity from b x j .We first prove that if a i = a j , then sign ( b x i ) = sign ( b x j ) . Assume, by contradiction, that sign ( b x i ) = sign ( b x j ) ,and, without loss of generality, that b x i > . We need to consider two cases: a) if b x j < , take an alternative feasible solution v , with: v k = b x k , for k = i, j , v i = b x i − ε , and v j = b x j + ε , forsome ε ∈ (0 , min {| b x i | , | b x j |} ] . Since b x i > and b x j < , it’s true that | v i | = | b x i | − ε and | v j | = | b x j | − ε .Finally, the definition of ∆ implies that w p − ≥ ∆ > , thus Ω w ( b x ) − Ω w ( v ) > ε ∆ > , contradictingthe optimality of b x , thus proving the claim that sign ( b x i ) = sign ( b x j ) . b) if b x j = 0 , consider a feasible v resulting from a Pigou-Dalton transfer of size ε ∈ (cid:0) , | b x i | / (cid:3) . From Lemma2.1, Ω w ( b x ) − Ω w ( v ) > ε ∆ > , negating the optimality of b x , thus proving that sign ( b x i ) = sign ( b x j ) .Once it is established that sign ( b x i ) = sign ( b x j ) , we proceed to prove that | b x i | = | b x j | . To this end, notice that L ( b x ) = 1 n (cid:13)(cid:13) g − ( | b x i | + | b x j | ) sign ( b x i ) a i (cid:13)(cid:13) and G ( b x ) = 1 n (cid:13)(cid:13) g − ( | b x i | + | b x j | ) sign ( b x i ) a i (cid:13)(cid:13) . (21)Proceeding again by contradiction, suppose (without loss of generality) that | b x i | > | b x j | , and define u via aPigou-Dalton transfer on the magnitudes of x i and x j , that is: u k = b x k , for k = i, j , u i = ( | b x i | − δ ) sign ( b x i ) ,and u j = ( | b x j | + δ ) sign ( b x i ) , for some δ ∈ (0 , min {| b x i | , ( | b x i | − | b x j | ) / } ] . Of course, u is feasible and Lemma2.1 shows that Ω w ( b x ) − Ω w ( u ) > δ ∆ > , contradicting the optimality of b x , thus concluding the proof. In this section, we characterize the statistical performance of OWL regularization with both the squared and ab-solute error losses, by proving finite sample bounds, which apply to the standard LASSO and OSCAR as specialcases. At the basis of our approach is the following model for correlated measurement matrices. Recall that A has dimensions n × p . Assume that the rows of A are independently and identically distributed N ( , C T C ) ,the multivariate Gaussian distribution with covariance C T C ( i.e. , the columns of A are not necessarily indepen-dent). Assume that the matrix C is q × p with q ≥ n . Note that A can be factorized as A = BC , where B isan n × q Gaussian random matrix, whose entires are i.i.d. N (0 , random variables. The role of matrix C is tomix, or even replicate, columns of B . The next simple example illustrates this construction. Example 1.
Suppose q = 3 , p = 4 , and C = ; (22) then, if B = [ b , b , b ] , where b i ∈ R n is the i -th column of B , matrix A has the form A = [ b , b , b , b ] . y = A x ⋆ + ν , (23)where ν ∈ R n is the measurement error satisfying n k ν k ≤ ε, (24)and about which we make no other assumptions. The signal x ⋆ ∈ R p is assumed to satisfy k x ⋆ k ≤ √ s . Note,for example, that this condition is met if k x ⋆ k ≤ and x ⋆ has at most s non-zero components. Example 2.
To illustrate the model, consider the matrix C defined in (22) and suppose that Ax ⋆ = b . There aremany x ∈ R satisfying Ax = b , including the vectors [1 , , , T , [0 , , , T , and all convex combinationsof the two. LASSO regularization (i.e., the ℓ norm) does not differentiate these equivalent representations, since k [ α, (1 − α ) , , T k = 1 , for any α ∈ [0 , . However the OWL norm (as claimed in Theorem 1.1) prefers thesolution (cid:2) , , , (cid:3) T , selecting both colinear columns of A for the representation. The main result of this section is stated in the following theorem, whose proof is given in the following sub-sections, based on the techniques developed by Vershynin [21]. The results are stated in terms of constrainedoptimization, which, under certain conditions, is equivalent to the Lagrangian formulation studied in Section 2.We also present a corollary for the particular case where C simply replicates columns of B , that is, when A includes groups of identical columns; this corollary is shown to imply part (ii) of Theorem 1.1. Expectations arewith respect to the Gaussian distribution of A . Theorem 3.1.
Let y , A , x ⋆ , and ε be as defined above, and let b x be a solution to one of the two followingoptimization problems: min x ∈ R p Ω w ( x ) subject to n k Ax − y k ≤ ε , (25) or min x ∈ R p Ω w ( x ) subject to n k Ax − y k ≤ ε. (26) Then E q ( b x − x ⋆ ) T C T C ( b x − x ⋆ ) ≤ √ π √ k C k w ¯ w r s log qn + ε ! , (27) where ¯ w = p − P pi =1 w i and k C k is the matrix norm induced by the ℓ norm: k C k = max j k c j k , with c j denoting the j -th column of C . Note that the error bound in (27) holds for optimizations based on the squared ℓ and ℓ losses. In fact, since n k Ax − y k ≤ ε implies n k Ax − y k ≤ ε , the ℓ constraint is less restrictive. In both cases, the theoremshows that the number of samples sufficient to estimate an s -sparse signal with a given precision grows like n ∼ s log q . This agrees with well-known sample complexity bounds for sparse recovery under stronger assumptions such asthe restricted isometry property or i.i.d. measurements [6, 7, 10, 12, 21]. However, note that the error measure ofthe theorem is insensitive to components of b x in the nullspace of C , which is to be expected since in general9here may be many sparse x that yield the same value of Ax (see Example 2). This is where the OWL normbecomes especially important. With strictly decreasing weights ( ∆ > ), OWL prefers solutions that select allcolinear columns associated with the model. In other words, if the columns are colinear (or strongly correlated,per the characterizations of given in Section 2), then the OWL solution will select a representation including allthe columns associated with the sparse model, rather than an arbitrary subset of them.The OSCAR norm is a special case of OWL, with w i = λ + λ ( p − i ) and λ , λ > . In this case, ¯ w = λ + λ ( p − / and therefore w / ¯ w ≤ . Note that the conventional ℓ norm (used in the LASSO)is the special case of OWL with uniform weights (or OSCAR with λ = 0 ); thus, all our results apply to ℓ minimization as well, in which case w / ¯ w = 1 .To illustrate Theorem 3.1, let C be an q × p matrix that replicates each column of B one or more times. Notethat each column of C is -sparse and has unit ℓ norm, thus k C k = 1 . Let G , . . . , G q denote the groups ofreplicated columns in A = BC ; these groups are a partition of the set { , . . . , p } . Example 1 is a special caseof this scenario, with G = { , } , G = { } , and G = { } . Assume that there are s non-zero components in x ⋆ , each in one of s distinct groups. Let x G denote the vector that is zero except on the the subset of entries in G ⊂ { , . . . , p } , where it takes the same values as x . Then note that for any b x we have ( b x − x ⋆ ) T C T C ( b x − x ⋆ ) = q X i =1 | T ( b x G i − x ⋆G i ) | , (28)where = [1 1 . . . T . This produces the following corollary to Theorem 3.1. Corollary 3.1.
Assume that each column of C is -sparse and unit norm. Let b x be a solution to the optimization min x ∈ R p Ω w ( x ) subject to n k Ax − y k ≤ ε , (29) or min x ∈ R p Ω w ( x ) subject to n k Ax − y k ≤ ε. (30) Then E vuut q X i =1 | T ( b x G i − x ⋆G i ) | ≤ √ π √ w ¯ w r s log qn + ε ! . (31)In this case, since the correlated columns are colinear, the OWL norm will select all or none of the columnsin each group and each b x G i will have identical non-zero values (if any). If we let z ⋆i = T x ⋆G i and b z i = T b x G i for i = 1 , . . . , q , and let z ⋆ = [ z ⋆ , ..., z ⋆q ] T and b z = [ b z , ..., b z q ] T ; then (9) can be expressed as E k b z − z ⋆ k ≤ √ π √ w ¯ w r s log qn + ε ! , (32)which is the type of result obtained in the compressed sensing literature for the ideal sparse observation model y = Bz ⋆ + ν (based on an i.i.d. observation model) [21]. This shows that by using OWL we pay no price forcolinearity in A . Also, as shown next, Corollary 3.1 implies claim (ii) in Theorem 1.1 in Section 1. Proof. (Theorem 1.1 (ii)) Taking into account the group structure of b x and x ⋆ , we have that k b x − x ⋆ k = vuut q X i =1 k b x G i − x ⋆G i k = vuut q X i =1 | G i | ( b z i − z ⋆i ) ≤ k b z − z ⋆ k ; this inequality, together with (32) and the fact that q ≤ p yields (9).10efore moving on to the proof of Theorem 3.1, notice that since the ℓ norm is a special case of OWL withuniform weights, the same bounds in Theorem 3.1 and Corollary 3.1 hold for ℓ minimization. The difference isthat the LASSO solution generally will not select all correlated or colinear columns selected by OWL, makingthe estimated model less interpretable. The proof of Theorem 3.1 is based on the approach developed by Vershynin [21]. The key ingredient is theso-called general M ∗ bound (Theorem 5.1 in [21]), which applies to the special case when A is a i.i.d. Gaussianmatrix (i.e., when C is identity in our set-up). We extend the bound to cover our model A = BC , for general C . Recall that A is n × p , B is n × q , with n ≤ q , and C is q × p . M ∗ BoundTheorem 3.2. (Extended general M ∗ bound). Let T be a bounded subset of R p . Let B be an n × q Gaussianrandom matrix (with i.i.d. N (0 , entries) and let A = BC , where C is a deterministic q × p matrix. Fix ε ≥ and consider the set T ε := (cid:26) u ∈ T : 1 n k Au k ≤ ε (cid:27) . (33) Then E sup u ∈ T ε (cid:0) u T C T Cu (cid:1) / ≤ r πn E sup u ∈ T |h C T g , u i| + r π ε , (34) where g ∼ N (0 , I q ) is a standard Gaussian random vector in R q . The proof follows in a straightforward fashion from the proof of Theorem 5.1 in [21], with modificationsmade to account for C . For the sake of completeness we include a proof here. Proof.
The bound (34) follows from the deviation inequality E sup u ∈ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 |h a i , u i| − r π (cid:0) u T C T Cu (cid:1) / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ n E sup u ∈ T |h C T g , u i| , (35)where a i denotes the i th row of A . To see this, note that the inequality holds if we replace T by the smaller set T ε . For u ∈ T ε , and for such u we have by assumption that n P ni =1 |h a i , u i| = n k Au k ≤ ε , and the bound(34) follows by the triangle inequality.To prove (35), the first thing to note is that E |h a i , u i| = E |h C T b i , u i| = E |h b i , Cu i| , (36)where b i is the i th row of B . Because the Gaussian distribution of b i is rotationally invariant, it follows that E |h b i , Cu i| = r π (cid:0) u T C T Cu (cid:1) / . (37)Using the symmetrization and contraction inequalities from Proposition 5.2 in [21], we have the bound E sup u ∈ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 |h a i , u i| − r π (cid:0) u T C T Cu (cid:1) / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E sup u ∈ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ε i h b i , Cu i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (38) = 2 E sup u ∈ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)* n n X i =1 ε i b i , Cu +(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (39)11here each ε i independently takes values − and +1 with probabilities / . Note that vector g := √ n P ni =1 ε i b i ∼ N (0 , I n ) , thus, E sup u ∈ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)* n n X i =1 ε i b i , Cu +(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 2 √ n E sup u ∈ T |h g , Cu i| = 2 √ n E sup u ∈ T |h C T g , u i| . (40)This completes the proof. Theorem 3.2 can be used to derive error bounds for estimating signals known to belong to a certain subset (sparsesets are a special case we will consider in the next section). Let
K ⊂ R p be given. Suppose that we observe y = Ax ⋆ + ν , n k ν k ≤ ε , (41)where x ⋆ ∈ K . The following theorems are straightforward extensions of Theorems 6.1 and 6.2 in [21]. Weinclude the proofs for the sake of completeness. Theorem 3.3. (Estimation from noisy linear observations: feasibility program).
Choose b x to be any vector sat-isfying b x ∈ K and n k A b x − y k ≤ ε . (42) Then E sup x ⋆ ∈K (cid:8) ( b x − x ⋆ ) T C T C ( b x − x ⋆ ) (cid:9) / ≤ √ π (cid:18) E sup u ∈K−K |h C T g , u i|√ n + ε (cid:19) . (43) Proof.
We apply Theorem 3.2 to the set T = K − K with ε instead of ε , which yields E sup u ∈ T ε (cid:0) u T C T Cu (cid:1) / ≤ r πn E sup u ∈ T |h C T g , u i| + √ πε , From here, all we need to show is that for any x ⋆ ∈ K b x − x ⋆ ∈ T ε . (44)To see this, note that b x , x ⋆ ∈ K , so b x − x ⋆ ∈ K − K = T . By the triangle inequality, n k A ( b x − x ⋆ ) k = 1 n k A b x − y + ν k ≤ n k A b x − y k + 1 n k ν k ≤ ε , showing that u = b x − x ⋆ indeed satisfies the constraints that define T ε in (33).Next we derive an optimization program for the solution. The Minkowski functional of K is defined as k x k K = inf { λ > λ − x ∈ K} . If K is a compact and origin-symmetric convex set with non-empty interior, then k x k K is a norm on R p [18].Note that x ∈ K if and only if k x k K ≤ . 12 heorem 3.4. (Estimation from noisy linear observations: optimization program). Choose b x to be a solution tothe the optimization min k x k K subject to n k Ax − y k ≤ ε . (45) Then E sup x ⋆ ∈K (cid:8) ( b x − x ⋆ ) T C T C ( b x − x ⋆ ) (cid:9) / ≤ √ π (cid:18) E sup u ∈K−K |h C T g , u i|√ n + ε (cid:19) . (46) Proof.
If we show that b x ∈ K , then the result follows from Theorem 3.3. Note that the constraint of the programguarantees that n k A b x − y k ≤ ε and by assumption we have n k Ax ⋆ − y k = n k ν k ≤ ε . Thus we have k b x k K ≤ k x ⋆ k K ≤ , since x ⋆ ∈ K . The inequality k b x k K ≤ implies that b x ∈ K . Recall the definition of the OWL norm Ω w ( x ) = p X i =1 w i | x | [ i ] , where | x | [1] , . . . , | x | [ p ] are the magnitudes of the elements of x in decreasing order and w ≥ w ≥ · · · ≥ w p isa non-increasing sequence of weights. The OWL norm satisfies ¯ w k x k ≤ Ω w ( x ) ≤ w k x k , where ¯ w := N P Ni =1 w i . This is easily verified by minimizing or maximizing the OWL norm subject to a fixed ℓ norm. We now prove Theorem 3.1. Proof. (Theorem 3.1) Since the signal generating the measurements is assumed to satisfy k x ⋆ k ≤ √ s , we firstneed to construct an OWL ball that contains all x ∈ R p with k x k ≤ √ s . Let K = { x ∈ R p : Ω w ( x ) ≤ w √ s } . Because Ω w ( x ) ≤ w k x k , all vectors satisfying k x k ≤ √ s belong to K . Also note that because Ω w ( x ) is a norm, and K is a ball of this norm, the Minkowski functional k x k K is proportional to Ω w ( x ) .The quantity E sup u ∈K−K |h C T g , u i| in (46), called the width of K , satisfies E sup u ∈K−K |h C T g , u i| = E sup u ∈K−K |h g , Cu i| . Note that k Cu k ≤ k C k k u k ≤ k C k w Ω w ( u ) . The triangle inequality and the definition of K imply that for any u ∈ K − K , Ω w ( u ) ≤ w √ s , thus we have k Cu k ≤ k C k w ¯ w √ s . The width can be then bounded as E sup u ∈K−K |h g , Cu i| ≤ E sup { v : k v k ≤ k C k w w √ s } |h g , v i| . v that maximizes the right hand side places mass k C k w ¯ w √ s on the largest element of g (in magnitude)and zero on every other element. This yields the bound E sup u ∈K−K |h g , Cu i| ≤ k C k w ¯ w √ s E max i =1 ,...,q | g i | . (47)Using Jensen’s inequality, the square of the expectation in (47) can be bounded as (cid:18) E max i =1 ,...,q | g i | (cid:19) ≤ E max i =1 ,...,q | g i | ≤ (cid:16)p q + 1 (cid:17) , where the second inequality comes from a chi-square tail bound (see Lemma 3.2 in [17]). Note that since q > , √ q + 1 < √ q . Putting everything together, we obtain the bound E sup u ∈K−K |h g , Cu i| ≤ √ k C k w ¯ w p s log q Theorem 3.1 now follows immediately from Theorem 3.4, above.
In this paper, we have studied ordered weighted ℓ (OWL) regularization for sparse estimation problems withstrongly correlated variables. We have proved sufficient conditions under which the OWL regularizer clustersthe coefficient estimates, based on the correlation/colinearity of the variables in the design matrix. We havealso characterized the statistical performance of OWL regularization for generative models in which certainclusters of regression variables are strongly (even perfectly) correlated, but variables in different clusters areuncorrelated. Essentially, we showed that, by using OWL regularization, we pay no price (in terms of the numberof measurements) for the presence of strongly correlated variables. Future work will include the experimentalevaluation of OWL regularization and its application to other problems, such as logistic regression. References [1] H. Bauschke and P. Combettes,
Convex analysis and monotone operator theory in Hilbert spaces , Springer,2011.[2] M. Bogdan, E. van den Berg, W. Su, and E. Cand`es, “Statistical estimation and testing via the sorted ℓ norm,” available at http://arxiv.org/abs/1310.1969 , 2013.[3] M. Bogdan, E. van den Berg, C. Sabatti, W. Su, and E. Cand`es, “SLOPE – adaptive variable selection viaconvex optimization”, available at http://arxiv.org/abs/1407.3824 , 2014.[4] H. Bondell and B. Reich, “Simultaneous regression shrinkage, variable selection, and supervised clusteringof predictors with OSCAR,” Biometrics , vol. 64, pp. 115–123, 2007.[5] P. B ¨uhlmann, P. R ¨uttiman, S. van de Geer, and C.-H. Zhang, “Correlated variables in regression: Clusteringand sparse estimation,”
Journal of Statistical Planning and Inference , pp. 1835–1858, 2013.[6] E. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highlyincomplete frequency information,”
IEEE Transactions on Information Theory , vol. 52, pp. 489–509, 2006.[7] E. Cand`es and T. Tao, “The Dantzig selector: Statistical estimation when p is much larger than n ,” Annalsof Statistics , vol. 35, pp. 2313-2351, 2007. 148] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky, “The convex geometry of linear inverse problems,
Foundations of Computational Mathematics , vol. 12, pp. 805–849, 2012.[9] H. Dalton, “The measurement of the inequality of incomes,”
The Economic Journal , vol. 30, pp. 348-361,1920.[10] D. Donoho, “Compressed sensing,”
IEEE Transactions on Information Theory , vol. 52, pp. 1289–1306,2006.[11] C. Genovese, J. Jin, L. Wasserman, and Z. Yao, “A Comparison of the LASSO and marginal regression”,
Journal of Machine Learning Research , pp. 2107–2143, 2012.[12] J. Haupt and R. Nowak, “Signal reconstruction from noisy random projections,”
IEEE Transactions onInformation Theory , vol. 52, pp. 4036–4048, 2006.[13] M. Jaggi, “Revisiting Frank-Wolfe: projection-free sparse convex optimization”,
Proceedings of the 30thInternational Conference on Machine Learning , pp. 427–435, 2013.[14] J. Jia and B. Yu, “On model selection consistency of the elastic net when p ≫ n ,” Statistica Sinica , vol. 20,pp. 595–611, 2010.[15] N. Meinshausen and P. B ¨ulhmann, “Stability selection,”
Journal of the Royal Statistical Society (B) ,pp. 417–473, 2010.[16] A. Pigou,
Wealth and Welfare , Macmillan, London, 1912.[17] N. Rao, B. Recht, and R. Nowak, “Tight Measurement Bounds for Exact Recovery of Structured SparseSignals,
Proceedings of AISTATS , 2012.[18] R. T. Rockafellar,
Convex Analysis , Princeton University Press, 1970.[19] R. Shah and R. Samworth, “Variable selection with error control: another look at stability selection.”
Jour-nal of the Royal Statistical Society (B) , pp. 55–80, 2013.[20] X. Shen and H.-C. Huang, “Grouping pursuit through a regularization solution surface,”
Journal of theAmerican Statistical Association , pp. 727–739, 2010.[21] R. Vershynin, “Estimation in High Dimensions: A Geometric Perspective,” available at http://arxiv.org/abs/1405.5103 , 2014.[22] X. Zeng, M. Figueiredo, “Decreasing Weighted Sorted ℓ Regularization”,
IEEE Signal Processing Letters, vol. 21, pp. 1240–1244, 2014.[23] X. Zeng, M. Figueiredo, “The atomic norm formulation of OSCAR regularization with application to theFrank-Wolfe algorithm”,
Proceedings of the European Signal Processing Conference , Lisbon, Portugal,2014.[24] L. Zhong and J. Kwok, “Efficient sparse modeling with automatic feature grouping”,