[PDF] Structured Sparse Regression via Greedy Hard-Thresholding

Abstract

Full PDF

aa r X i v : . [ s t a t . M L ] M a y Structured Sparse Regression via Greedy Hard-Thresholding

Prateek Jain

Microsoft Research, India

Nikhil Rao

Technicolor R & I, Los Altos, CA

Inderjit Dhillon

University of Texas at Austin

May 30, 2016

Abstract

Several learning applications require solving high-dimensional regression problems where the relevantfeatures belong to a small number of (overlapping) groups. For very large datasets and under standardsparsity constraints, hard thresholding methods have proven to be extremely eﬃcient, but such methodsrequire NP hard projections when dealing with overlapping groups. In this paper, we show that such NP-hard projections can not only be avoided by appealing to submodular optimization, but such methodscome with strong theoretical guarantees even in the presence of poorly conditioned data (i.e. say whentwo features have correlation ≥ . High dimensional problems where the regressor belongs to a small number of groups play a critical role inmany machine learning and signal processing applications, such as computational biology [9] and multitasklearning [14]. In most of these cases, the groups overlap, i.e., the same feature can belong to multiple groups.For example, gene pathways overlap in computational biology applications, and parent-child pairs of wavelettransform coeﬃcients overlap in signal processing applications.The existing state-of-the-art methods for solving such group sparsity structured regression problems canbe categorized into two broad classes: a) convex relaxation based methods , b) iterative hard thresholding(IHT) or greedy methods. In practice, IHT methods tend to be signiﬁcantly more scalable than the (group-)lasso style methods that solve a convex program. But, these methods require a certain projection operatorwhich in general is NP-hard to compute and often certain simple heuristics are used with relatively weaktheoretical guarantees. Moreover, existing guarantees for both classes of methods require relatively restrictiveassumptions on the data, like Restricted Isometry Property or variants thereof [2, 8, 18, 14], that are unlikelyto hold in most common applications. In fact, even under such settings, the group sparsity based convexprograms oﬀer at most polylogarithmic gains over standard sparsity based methods [18].Concretely, let us consider the following linear model: y = X w ∗ + β , (1)where β ∼ N (0 , λ I ), X ∈ R n × p , each row of X is sampled i.i.d. s.t. x i ∼ N (0 , Σ), 1 ≤ i ≤ n , and w ∗ is a k ∗ -group sparse vector i.e. w ∗ can be expressed in terms of only k ∗ groups, G j ⊆ [ p ].The existing analyses for both convex as well as hard thresholding based methods require κ = σ /σ p ≤ c ,where c is an absolute constant (like say 3) and σ i is the i -th largest eigenvalue of Σ. This is a signiﬁcantly1estrictive assumption as it requires all the features to be nearly independent of each other. For example, iffeatures 1 and 2 have correlation more than say .

99 then the restriction on κ required by the existing resultsdo not hold.Moreover, in this setting (i.e., when κ = O (1)), the number of samples required to exactly recover w ∗ (with λ = 0) is given by: n = Ω( s + k ∗ log M ) [18], where s is the maximum support size of a union of k ∗ groups and M is the number of groups. In contrast, if one were to directly use sparse regression techniques(by ignoring group sparsity altogether) then the number of samples is given by n = Ω( s log p ). Hence, evenin the restricted setting of κ = O (1), group-sparse regression improves upon the standard sparse regressiononly by logarithmic factors.Greedy, Iterative Hard Thresholding (IHT) methods have been considered for group sparse regressionproblems, but they involve NP-hard projections onto the constraint set [3]. While this can be circumventedusing approximate operations, the guarantees they provide are along the same lines as the ones that existfor convex methods.In this paper, we show that IHT schemes with approximate projections for the group sparsity problemyield much stronger guarantees. Speciﬁcally, our result holds for arbitrarily large κ , and arbitrary group struc-tures. In particular, using IHT with greedy projections, we show that n = Ω (cid:0) ( s log ǫ + κ k ∗ log M ) log ǫ (cid:1) samples suﬃce to recover ǫ -approximatation to w ∗ when λ = 0. On the other hand, IHT for standard sparseregression [11] requires n = Ω( κ s log p ). Moreover, for general noise variance λ , our method recovers ˆ w s.t. k ˆ w − w ∗ k ≤ ǫ + λ · κ q s + κ k ∗ log Mn . On the other hand, the existing state-of-the-art results for IHT forgroup sparsity [5] guarantees k ˆ w − w ∗ k ≤ λ · √ s + k ∗ log M for κ ≤

3, i.e., ˆ w is not a consistent estimatorof w ∗ even for small condition number κ .Our analysis is based on an extension of the sparse regression result by [11] that requires exact projections.However, a critical challenge in the case of overlapping groups is the projection onto the set of group-sparse vectors is NP-hard in general. To alleviate this issue, we use the connection between submodularityand overlapping group projections and a greedy selection based projection is at least good enough . Themain contribution of this work is to carefully use the greedy projection based procedure along with hardthresholding iterates to guarantee the convergence to the global optima as long as enough i.i.d. data pointsare generated from model (1).Moreover, the simplicity of our hard thresholding operator allows us to easily extend it to more compli-cated sparsity structures. In particular, we show that the methods we propose can be generalized to thesparse overlapping group setting, and to hierarchies of (overlapping) groups.We also provide extensive experiments on both real and synthetic datasets that show that our methodsare not only faster than several other approaches, but are also accurate despite performing approximate pro-jections. Indeed, even for poorly-conditioned data, IHT methods are an order of magnitude faster than othergreedy and convex methods. We also observe a similar phenomenon when dealing with sparse overlappinggroups. Several papers, notably [6] and references therein, have studied convergence properties of IHT methods forsparse signal recovery under standard RIP conditions. [11] generalized the method to settings where RIPdoes not hold, and also to the low rank matrix recovery setting. [24] used a similar analysis to obtainresults for nonlinear models. However, these techniques apply only to cases where exact projections can beperformed onto the constraint set. Forward greedy selection schemes for sparse [22, 10] and group sparse[20] constrained programs have been considered previously, where a single group is added at each iteration.The authors in [2] propose a variant of CoSaMP to solve problems that are of interest to us, and again, thesemethods require exact projections.Several works have studied approximate projections in the context of IHT [4, 19, 7, 13]. However, theseresults require that the data satisﬁes

RIP -style conditions which typically do not hold in real-world regressionproblems. Moreover, these analyses do not guarantee a consistent estimate of the optimal regressor when themeasurements have zero-mean random noise. In contrast, we provide results under a more general RSC/RSScondition, which is weaker [23] and provide crisp rates for the error bounds when the noise in measurements2s random.

In this section, we formally set up the group sparsity constrained optimization problem, and then brieﬂypresent the IHT algorithm for the same. Suppose we are given a set of M groups that can arbitrarily overlap G = { G , . . . , G M } , where G i ⊆ [ p ]. Also, let ∪ Mi =1 G i = { , , . . . , p } . We let k w k denote the Euclidean normof w , and supp( w ) denotes the support of w . For any vector w ∈ R p , [9] deﬁned the overlapping groupnorm as k w k G := inf M X i =1 k a G i k s.t. M X i =1 a G i = w , supp( a G i ) ⊆ G i (2)We also introduce the notion of “group-support” of a vector and its group- ℓ pseudo-norm:G-supp( w ) := { i s.t. k a G i k > } , k w k G := inf M X i =1 {k a G i k > } , (3)where a G i satisﬁes the constraints of (2). {·} is the indicator function, taking the value 1 if the conditionis satisﬁed, and 0 otherwise. For a set of groups G , supp( G ) = { G i , i ∈ G } . Similarly, G-supp( S ) =G-supp( w S ).Suppose we are given a function f : R p → R and M groups G = { G , . . . , G M } . The goal is to solve thefollowing group sparsity structured problem (GS-Opt): GS-Opt: min w f ( w ) s.t. k w k G ≤ k (4) f can be thought of as a loss function over the training data, for instance, logistic or least squares loss. Inthe high dimensional setting, problems of the form (4) are somewhat ill posed and are NP-hard in general.Hence, additional assumptions on the loss function ( f ) are warranted to guarantee a reasonable solution.Here, we focus on problems where f satisﬁes the restricted strong convexity and smoothness conditions: Deﬁnition 2.1 (RSC/RSS) . The function f : R p → R satisﬁes the restricted strong convexity (RSC) andrestricted strong smoothness (RSS) of order k , if the following holds: α k I (cid:22) H ( w ) (cid:22) L k I, where H ( w ) is the Hessian of f at any w ∈ R p s.t. k w k G ≤ k . Note that the goal of our algorithms/analysis would be to solve the problem for arbitrary α k > L k < ∞ . In contrast, adapting existing IHT results to this setting lead to results that allow L k /α k less thana constant (like say 3).We are especially interested in the linear model described in (1), and in recovering w ⋆ consistently (i.e.recover w ⋆ exactly as n → ∞ ). To this end, we look to solve the following (non convex) constrained leastsquares problem GS-LS: ˆ w = arg min w f ( w ) := 12 n k y − Xw k s.t. k w k G ≤ k (5)with k ≥ k ∗ being a positive, user deﬁned integer . In this paper, we propose to solve (5) using an IterativeHard Thresholding (IHT) scheme. IHT methods iteratively take a gradient descent step, and then projectthe resulting vector ( g ) on to the (non-convex) constraint set of group sparse vectos, i.e., w ∗ = P G k ( g ) = arg min w k w − g k s.t k w k G ≤ k (6) typically chosen via cross-validation projection . Algorithm 1 details the IHT procedure for thegroup sparsity problem (4). Throughout the paper we consider the same high-level procedure, but considerdiﬀerent projection operators b P G k ( g ) for diﬀerent settings of the problem. Algorithm 1

IHT for Group-sparsity Input : data y , X , parameter k , iterations T ,step size η Initialize: t = 0 , w ∈ R p a k -group sparsevector for t = 1, 2, . . . , T do g t = w t − η ∇ f ( w t ) w t = b P G k ( g t ) where b P G k ( g t ) performs (approx-imate) projections end for Output : w T Algorithm 2

Greedy Projection

Require: g ∈ R p , parameter ˜ k , groups G ˆ u = 0 , v = g , b G = { } for t = 1 , , . . . ˜ k do Find G ⋆ = arg max G ∈G\ b G k v G k b G = b G S G ⋆ v = v − v G ⋆ u = u + v G ⋆ end for Output ˆ u := b P G k ( g ) , b G = supp( u ) G Suppose we are given a vector g ∈ R p , which needs to be projected onto the constraint set k u k G ≤ k (see(6)). Solving (6) is NP-hard when G contains arbitrary overlapping groups. To overcome this, P G k ( · ) can bereplaced by an approximate operator b P G k ( · ) (step 5 of Algorithm 1). Indeed, the procedure for performingprojections reduces to a submodular optimization problem [3], for which the standard greedy procedure canbe used (Algorithm 2). For completeness, we detail this in Appendix A, where we also prove the following: Lemma 2.2.

Given an arbitrary vector g ∈ R p , suppose we obtain ˆ u , b G as the output of Algorithm 2 withinput g and target group sparsity ˜ k . Let u ∗ = P G k ( g ) be as deﬁned in (6) . Then k ˆ u − g k ≤ e − ˜ kk k ( g ) supp( u ∗ ) k + k u ∗ − g k where e is the base of the natural logarithm. Note that the term with the exponent in Lemma 2.2 approaches 0 as ˜ k increases. Increasing ˜ k shouldimply more samples for recovery of w ∗ . Hence, this lemma hints at the possibility of trading oﬀ samplecomplexity for better accuracy, despite the projections being approximate. See Section 3 for more details.Algorithm 2 can be applied to any G , and is extremely eﬃcient. IHT methods can be improved by the incorporation of “corrections” after each projection step. This merelyentails adding the following step in Algorithm 1 after step 5: w t = arg min ˜ w f ( ˜ w ) s.t. supp( ˜ w ) = supp( b P G k ( g t ))When f ( · ) is the least squares loss as we consider, this step can be solved eﬃciently using Choleskydecompositions via the backslash operator in MATLAB. We will refer to this procedure as IHT-FC. Fullycorrective methods in greedy algorithms typically yield signiﬁcant improvements, both theoretically and inpractice [11]. We now provide theoretical guarantees for Algorithm 1 when applied to the overlapping group sparsityproblem (4). We then specialize the results for the linear regression model (5).4 heorem 3.1.

Let w ∗ = arg min w , k w G k ≤ k ∗ f ( w ) and let f satisfy RSC/RSS with constants α k ′ , L k ′ ,respectively (see Deﬁnition 2.1). Set k = 32 (cid:16) L k ′ α k ′ (cid:17) · k ∗ log (cid:16) L k ′ α k ′ · k w ∗ k ǫ (cid:17) and let k ′ ≤ k + k ∗ . Suppose werun Algorithm 1, with η = 1 /L k ′ and projections computed according to Algorithm 2. Then, the followingholds after t + 1 iterations: k w t +1 − w ∗ k ≤ (cid:18) − α k ′ · L k ′ (cid:19) · k w t − w ∗ k + γ + α k ′ L k ′ ǫ, where γ = L k ′ max S, s.t., | G-supp( S ) |≤ k k ( ∇ f ( w ∗ )) S k . Speciﬁcally, the output of the T = O (cid:16) L k ′ α k ′ · k w ∗ k ǫ (cid:17) -thiteration of Algorithm 1 satisﬁes: k w T − w ∗ k ≤ ǫ + 10 · L k ′ α k ′ · γ. The proof uses the fact that Algorithm 2 performs approximately good projections. The result followsfrom combining this with results from convex analysis (RSC/RSS) and a careful setting of parameters. Weprove this result in Appendix B.

Remarks

Theorem 3.1 shows that Algorithm 1 recovers w ∗ up to O (cid:16) L k ′ α k ′ · γ (cid:17) error. If k arg min w f ( w ) k G ≤ k , then, γ = 0. In general our result obtains an additive error which is weaker than what one can obtain for a convexoptimization problem. However, for typical statistical problems, we show that γ is small and gives us nearlyoptimal statistical generalization error (for example, see Theorem 3.2).Theorem 3.1 displays an interesting interplay between the desired accuracy ǫ , and the penalty we thuspay as a result of performing approximate projections γ . Speciﬁcally, as ǫ is made small, k becomes large,and thus so does γ . Conversely, we can let ǫ be large so that the projections are coarse, but incur a smallerpenalty via the γ term. Also, since the projections are not too accurate in this case, we can get away withfewer iterations. Thus, there is a tradeoﬀ between estimation error ǫ and model selection error γ . Also, notethat the inverse dependence of k on ǫ is only logarithmic in nature.We stress that our results do not hold for arbitrary approximate projection operators. Our proof criticallyuses the greedy scheme (Algorithm 2), via Lemma 2 .

2. Also, as discussed in Section 4, the proof easily extendsto other structured sparsity sets that allow such greedy selection steps.We obtain similar result as [11] for the standard sparsity case, i.e., when the groups are singletons.However, our proof is signiﬁcantly simpler and allows for a signiﬁcantly easier setting of η . We next proceed to the standard linear regression model considered in (5). To the best of our knowledge,this is the ﬁrst consistency result for overlapping group sparsity problems, especially when the data can bearbitrarily conditioned. Recall that σ max ( σ min ) are the maximum (minimum) singular value of Σ, and κ := σ max /σ min is the condition number of Σ. Theorem 3.2.

Let the observations y follow the model in (1) . Suppose w ∗ is k ∗ -group sparse and let f ( w ) := n k X w − y k . Let the number of samples satisfy: n ≥ Ω (cid:16) ( s + κ · k ∗ · log M ) · log (cid:16) κǫ (cid:17)(cid:17) , where s = max w , k w k G ≤ k | supp( w ) | . Then, applying Algorithm 1 with k = 8 κ k ∗ · log (cid:0) κǫ (cid:1) , η = 1 / (4 σ max ) ,guarantees the following after T = Ω (cid:16) κ log κ ·k w ∗ k ǫ (cid:17) iterations (w.p. ≥ − /n ): k w T − w ∗ k ≤ λ · κ r s + κ k ∗ log Mn + 2 ǫ emarks Note that one can ignore the group sparsity constraint, and instead look to recover the (at most) s - sparsevector w ∗ using IHT methods for ℓ optimization [11]. However, the corresponding sample complexity is n ≥ κ s log( p ). Hence, for an ill conditioned Σ, using group sparsity based methods provide a signiﬁcantlystronger result, especially when the groups overlap signiﬁcantly.Note that the number of samples required increases logarithmically with the accuracy ǫ . Theorem 3.2thus displays an interesting phenomenon: by obtaining more samples, one can provide a smaller recoveryerror while incurring a larger approximation error (since we choose more groups).Our proof critically requires that when restricted to group sparse vectors, the least squares objectivefunction f ( w ) = n k y − Xw k is strongly convex as well as strongly smooth: Lemma 3.3.

Let X ∈ R n × p be such that each x i ∼ N (0 , Σ) . Let w ∈ R p be k -group sparse over groups G = { G , . . . G M } , i.e., k w k G ≤ k and s = max w , k w k G ≤ k | supp( w ) | . Let the number of samples n ≥ Ω( C ( k log M + s )) . Then, the following holds with probability ≥ − /n : (cid:18) − √ C (cid:19) σ min k w k ≤ n k Xw k ≤ (cid:18) √ C (cid:19) σ max k w k , We prove Lemma 3.3 in Appendix C. Theorem 3.2 then follows by combining Lemma 3.3 with Theo-rem 3.1. Note that in the least squares case, these are the Restricted Eigenvalue conditions on the matrix X , which as explained in [23] are much weaker than traditional RIP assumptions on the data. In particular,RIP requires almost 0 correlation between any two features, while our assumption allows for arbitrary highcorrelations albeit at the cost of a larger number of samples. P G k ( · ) We now consider the setting where P G k ( · ) can be computed exactly and eﬃciently for any k . Examplesinclude the dynamic programming based method by [3] for certain group structures, or Algorithm 2 whenthe groups do not overlap. Since the exact projection operator can be arbitrary, our proof of Theorem 3.1does not apply directly in this case. However, we show that by exploiting the structure of hard thresholding,we can still obtain a similar result: Theorem 3.4.

Let w ∗ = arg min w , k w G k ≤ k ∗ f ( w ) . Let f satisfy RSC/RSS with constants α k + k ∗ , L k + k ∗ ,respectively (see Deﬁnition 2.1). Then, the following holds for the T = O (cid:16) L k ′ α k ′ · k w ∗ k ǫ (cid:17) -th iterate of Algo-rithm 1 (with η = 1 /L k + k ∗ ) with b P G k ( · ) = P G k ( · ) being the exact projection: k w T − w ∗ k ≤ ǫ + 10 · L k ′ α k ′ · γ. where k ′ = 2 k + k ∗ , k = O (( L k ′ α k ′ ) · k ∗ ) , γ = L k ′ max S, s.t., | G-supp( S ) |≤ k k ( ∇ f ( w ∗ )) S k . See Appendix D for a detailed proof. Note that unlike greedy projection method (see Theorem 3.1), k isindependent of ǫ . Also, in the linear model, the above result also leads to consistent estimate of w ∗ . The SoG model generalizes the overlapping group sparse model, allowing the selected groups themselves tobe sparse. Given positive integers k , k and a set of groups G , IHT for SoG would perform projections ontothe following set: C sog := ( w = M X i =1 a G i : k w k G ≤ k , k a G k ≤ k ) (7)6s in the case of overlapping group lasso, projection onto (7) is NP-hard in general. Motivated by our greedyapproach in Section 2, we propose a similar method for SoG (see Algorithm 3). The algorithm essentiallygreedily selects the groups that have large top- k elements by magnitude.Below, we show that the IHT (Algorithm 1) combined with the greedy projection (Algorithm 3) indeedconverges to the optimal solution. Moreover, our experiments (Section 5) reveal that this method, whencombined with full corrections, yields highly accurate results signiﬁcantly faster than the state-of-the-art. Algorithm 3

Greedy Projections for SoG

Require: g ∈ R p , parameters k , k , groups G ˆ u = 0 , v = g , b G = { } , ˆ S = { } for t=1,2,. . . k do Find G ⋆ = arg max G ∈G\ b G k v G k b G = b G S G ⋆ Let S correspond to the indices of the top k entries of v G ⋆ by magnitude Deﬁne ¯ v ∈ R p , ¯ v S = ( v G ⋆ ) S ¯ v i = 0 i / ∈ S ˆ S = ˆ S S S v = v − ¯ v u = u + ¯ v end for Output ˆ u , b G , ˆ S We suppose that there exists a set of supports S k ∗ such that supp( w ∗ ) ∈ S k ∗ . Then, we obtain thefollowing result, proved in Appendix E: Theorem 4.1.

Let w ∗ = arg min w , supp( w ) ∈S k ∗ f ( w ) , where S k ∗ ⊆ S k ⊆ { , } p is a ﬁxed set parameterizedby k ∗ . Let f satisfy RSC/RSS with constants α k , L k , respectively. Furthermore, assume that there existsan approximately good projection operator for the set deﬁned in (7) (for example, Algorithm 3). Then, thefollowing holds for the T = O (cid:16) L k ′ α k ′ · k w ∗ k ǫ (cid:17) -th iterate of Algorithm 1 : k w T − w ∗ k ≤ ǫ + 10 · L k + k ∗ α k + k ∗ · γ, where k = O (( L k + k ∗ α k + k ∗ ) · k ∗ · β α k + k ∗ L k + k ∗ ǫ ), γ = L k + k ∗ max S, S ∈S k k ( ∇ f ( w ∗ )) S k . Remarks

Similar to Theorem 3.1, we see that there is a tradeoﬀ between obtaining accurate projections ǫ and modelmismatch γ . Speciﬁcally in this case, one can obtain small ǫ by increasing k , k in Algorithm 3. However,this will mean we select large number of groups, and subsequently γ increases.A result similar to Theorem 3.2 can be obtained for the case when f is least squares loss function.Speciﬁcally, the sample complexity evaluates to n ≥ κ (cid:0) k ∗ log( M ) + κ k ∗ k ∗ log(max i | G i | ) (cid:1) . We obtainresults for least squares in Appendix F.An interesting extension to the SoG case is that of a hierarchy of overlapping, sparsely activated groups.When the groups at each level do not overlap, this reduces to the case considered in [12]. However, our theoryshows that when a corresponding approximate projection operator is deﬁned for the hierarchical overlappingcase (extending Algorithm 3), IHT methods can be used to obtain the solution in an eﬃcient manner. In this section, we empirically compare and contrast our proposed group IHT methods against the existingapproaches to solve the overlapping group sparsity problem. At a high level, we observe that our proposed7 ime (seconds)0 50 100 150 200 250 300 l og ( ob j e c t i v e ) IHTIHT+FCCoGEnTFWGOMP

Time (seconds) l og ( ob j e c t i v e ) IHTIHT+FCCoGEnTFWGOMP condition number m ea s u r e m en t s

50 100 150 200 250 30020001800160014001200 condition number50 100 150 200 250 300 m ea s u r e m en t s Figure 1: (From left to right) Objective value as a function of time for various methods, when data iswell conditioned and poorly conditioned. The latter two ﬁgures show the phase transition plots for poorlyconditioned data, for IHT and GOMP respectively.variants of IHT indeed outperforms the existing state-of-the-art methods for group-sparse regression in termsof time complexity. Encouragingly, IHT also performs competitively with the existing methods in terms ofaccuracy. In fact, our results on the breast cancer dataset shows a 10% relative improvement in accuracyover existing methods.Greedy methods for group sparsity have been shown to outperform proximal point schemes, and hence werestrict our comparison to greedy procedures. We compared four methods: our algorithm with (

IHT-FC )and without (

IHT ) the fully corrective step, the Frank Wolfe ( FW ) method [21] , CoGEnT , [17] and theGroup OMP (

GOMP ) [20]. All relevant hyper-parameters were chosen via a grid search, and experimentswere run on a macbook laptop with a 2.5 GHz processor and 16gb memory. Additional experimental resultsare presented in Appendix G l og ( M SE ) IHTIHT−FCCOGEnTFW

500 1000 1500 2000 2500 3000 3500−101 500 1000 1500 2000 2500 3000 3500−101 500 1000 1500 2000 2500 3000 3500−101

Method Error % time (sec) FW 29.41 6.4538IHT 27.94 . GOMP 25 .

01 0.2891CoGEnT 23.53 0.1414IHT-FC . Synthetic Data, well conditioned : We ﬁrst compared various greedy schemes for solving the over-lapping group sparsity problem on synthetic data. We generated M = 1000 groups of contiguous indices ofsize 25; the last 5 entries of one group overlap with the ﬁrst 5 of the next. We randomly set 50 of these to beactive, populated by uniform [ − ,

1] entries. This yields w ⋆ ∈ R p , p ∼ X ∈ R n × p where n = 5000and X ij i.i.d ∼ N (0 , λ = 0 .

1. IHT mehods achieve orders of magnitude speedup compared to the compet-ing schemes, and achieve almost the same (ﬁnal) objective function value despite approximate projections(Figure 1 (Left)).

Synthetic Data, poorly conditioned : Next, we consider the exact same setup, but with each rowof X given by: x i ∼ N (0 , Σ) where κ = σ max (Σ) /σ min (Σ) = 10. Figure 1 (Center-left) shows again theadvantages of using IHT methods; IHT-FC is about 10 times faster than the next best CoGEnT.We next generate phase transition plots for recovery by our method (IHT) as well as the state-of-the-artGOMP method. We generate vectors in the same vein as the above experiment, with M = 500 , B =15 , k = 25 , p ∼ n ). Figure 1 (Center-right and Right) shows the phase transition plot as the measurementsand the condition number are varied for IHT, and GOMP respectively. The results are averaged over 108ndependent runs. It can be seen that even for condition numbers as high as 200, n ∼ w ∗ , whereas GOMP with the same setting is not able to recover w ∗ evenonce. Tumor Classiﬁcation, Breast Cancer Dataset

We next compare the aforementioned methods ona gene selection problem for breast cancer tumor classiﬁcation. We use the data used in [9] . Weran a 5-fold cross validation scheme to choose parameters, where we varied η ∈ { − , − , . . . , } k ∈{ , , , , , , } τ ∈ { , , . . . , } . Figure 2 (Right) shows that the vanilla hard thresholdingmethod is competitive despite performing approximate projections, and the method with full correctionsobtains the best performance among the methods considered. We randomly chose 15% of the data to teston. Sparse Overlapping Group Lasso:

Finally, we study the sparse overlapping group (SoG) problem thatwas introduced and analyzed in [16] (Figure 2). We perform projections as detailed in Algorithm 3. Wegenerated synthetic vectors with 100 groups of size 50 and randomly selected 5 groups to be active, andamong the active group only set 30 coeﬃcients to be non zero. The groups themselves were overlapping,with the last 10 entries of one group shared with the ﬁrst 10 of the next, yielding p ∼ k = 2 k ∗ for the IHT methods. We proposed a greedy-IHT method that can applied to regression problems over set of group sparse vectors.Our proposed solution is eﬃcient, scalable, and provide fast convergence guarantees under general RSC/RSSstyle conditions, unlike existing methods. We extended our analysis to handle even more challenging struc-tures like sparse overlapping groups. Our experiments show that IHT methods achieve fast, accurate resultseven with greedy and approximate projections.

References [1] Francis Bach. Convex analysis and optimization with submodular functions: A tutorial. arXiv preprintarXiv:1010.4207 , 2010.[2] Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressivesensing.

Information Theory, IEEE Transactions on , 56(4):1982–2001, 2010.[3] Nirav Bhan, Luca Baldassarre, and Volkan Cevher. Tractability of interpretability via selection ofgroup-sparse models. In

Information Theory Proceedings (ISIT), 2013 IEEE International Symposiumon , pages 1037–1041. IEEE, 2013.[4] Thomas Blumensath. Sampling and reconstructing signals from a union of linear subspaces.

InformationTheory, IEEE Transactions on , 57(7):4660–4671, 2011.[5] Thomas Blumensath and Mike E Davies. Sampling theorems for signals from the union of ﬁnite-dimensional linear subspaces.

Information Theory, IEEE Transactions on , 55(4):1872–1882, 2009.[6] Thomas Blumensath and Mike E Davies. Normalized iterative hard thresholding: Guaranteed stabilityand performance.

Selected Topics in Signal Processing, IEEE Journal of , 4(2):298–309, 2010.[7] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Approximation algorithms for model-based com-pressive sensing.

Information Theory, IEEE Transactions on , 61(9):5129–5147, 2015. download at http : //cbio.ensmp.fr/ ljacob/

98] Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning with structured sparsity.

The Journalof Machine Learning Research , 12:3371–3412, 2011.[9] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graphlasso. In

Proceedings of the 26th annual International Conference on Machine Learning , pages 433–440.ACM, 2009.[10] Prateek Jain, Ambuj Tewari, and Inderjit S Dhillon. Orthogonal matching pursuit with replacement.In

Advances in Neural Information Processing Systems , pages 1215–1223, 2011.[11] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In

Advances in Neural Information Processing Systems , pages 685–693, 2014.[12] Rodolphe Jenatton, Julien Mairal, Francis R Bach, and Guillaume R Obozinski. Proximal methods forsparse hierarchical dictionary learning. In

Proceedings of the 27th International Conference on MachineLearning (ICML-10) , pages 487–494, 2010.[13] Anastasios Kyrillidis and Volkan Cevher. Combinatorial selection and least absolute shrinkage via theclash algorithm. In

Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on ,pages 2216–2220. IEEE, 2012.[14] Karim Lounici, Massimiliano Pontil, Alexandre B Tsybakov, and Sara Van De Geer. Taking advantageof sparsity in multi-task learning. arXiv preprint arXiv:0903.1468 , 2009.[15] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations formaximizing submodular set functions.

Mathematical Programming , 14(1):265–294, 1978.[16] Nikhil Rao, Christopher Cox, Rob Nowak, and Timothy T Rogers. Sparse overlapping sets lasso formultitask learning and its application to fmri analysis. In

Advances in neural information processingsystems , pages 2202–2210, 2013.[17] Nikhil Rao, Parikshit Shah, and Stephen Wright. Forward–backward greedy algorithms for atomic normregularization.

Signal Processing, IEEE Transactions on , 63(21):5798–5811, 2015.[18] Nikhil S Rao, Ben Recht, and Robert D Nowak. Universal measurement bounds for structured sparsesignal recovery. In

International Conference on Artiﬁcial Intelligence and Statistics , pages 942–950,2012.[19] Parikshit Shah and Venkat Chandrasekaran. Iterative projections for signal identiﬁcation on manifolds:Global recovery guarantees. In

Communication, Control, and Computing (Allerton), 2011 49th AnnualAllerton Conference on , pages 760–767. IEEE, 2011.[20] Grzegorz Swirszcz, Naoki Abe, and Aurelie C Lozano. Grouped orthogonal matching pursuit for variableselection and prediction. In

Advances in Neural Information Processing Systems , pages 1150–1158, 2009.[21] Ambuj Tewari, Pradeep K Ravikumar, and Inderjit S Dhillon. Greedy algorithms for structurallyconstrained high dimensional problems. In

Advances in Neural Information Processing Systems , pages882–890, 2011.[22] Joel Tropp, Anna C Gilbert, et al. Signal recovery from random measurements via orthogonal matchingpursuit.

Information Theory, IEEE Transactions on , 53(12):4655–4666, 2007.[23] Sara A Van De Geer, Peter B¨uhlmann, et al. On the conditions used to prove oracle results for thelasso.

Electronic Journal of Statistics , 3:1360–1392, 2009.[24] Xiaotong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrainedoptimization. In

Proceedings of the 31st International Conference on Machine Learning (ICML-14) ,pages 127–135, 2014. 10

Using submodularity to perform projections

While solving (6) is NP-hard in general, the authors in [3] showed that it can be approximately solved usingmethods from submodular function optimization, which we quickly recap here. First, (6) can be cast in thefollowing equivalent way: ˆ G = arg max | ˜ G|≤ k (X i ∈ I g i : I = ∪ G ∈ ˜ G G ) (8)Once we have ˆ G , ˆ u can be recovered by simply setting ˆ u I = g I and 0 everywhere else, where I = ∪ G ∈ ˆ G G .Next, we have the following result Lemma A.1.

Given a set S ∈ [ p ] , the function z ( S ) = P i ∈ S x i . is submodular.Proof. First, recall the deﬁnition of a submodular function:

Deﬁnition A.2.

Let Q be a ﬁnite set, and let z ( · ) be a real valued function deﬁned on Ω Q , the power set of Q . The function z ( · ) is said to be submodular if z ( S ) + z ( T ) ≥ z ( S ∪ T ) + z ( S ∩ T ) ∀ S, T ⊂ Ω Q Let S and T be two sets of groups, s.t., S ⊆ T . Let, SS = supp( ∪ j ∈ S G j ) and T T = supp( ∪ j ∈ T G j ).Then, SS ⊆ T T . Hence, z ( S ∪ i ) − z ( S ) = X ℓ ∈ SS ∪ supp( G i ) x ℓ − X ℓ ∈ SS x ℓ = X ℓ ∈ supp( G i ) \ SS x ℓ ζ ≥ X ℓ ∈ supp( G i ) \ T T x ℓ = z ( T ∪ i ) − z ( T ) , where ζ follows from SS ⊆ T T . This completes the proof.This result shows that (8) can be cast as a problem of the formmax S ⊂ Q z ( S ) , s.t. | S | ≤ k. (9)Algorithm 2, which details the pseudocode for performing approximate projections, exactly corresponds tothe greedy algorithm for submodular optimization [1], and this gives us a means to assess the quality of ourprojections. A.1 Proof of Lemma 2.2

Proof.

First, from the approximation property of the greedy algorithm [15], k ˆ u k ≥ (cid:16) − e − k ′ k (cid:17) k u ∗ k (10)Also, k g − ˆ u k = k g k − k ˆ u k because ( ˆ u ) supp(ˆ u ) = ( g ) supp(ˆ u ) and 0 otherwise.Using the above two equations, we have: k g − ˆ u k ≤ k g k − k u ∗ k + e − k ′ k k u ∗ k , = k g − u ∗ k + e − k ′ k k u ∗ k , = k g − u ∗ k + e − k ′ k k ( g ) supp( u ∗ ) k , (11)where both equalities above follow from the fact that due to optimality, ( u ∗ ) supp( u ∗ ) = ( g ) supp( u ∗ ) .11 Proof of Theorem 3.1

Proof.

Recall that g t = w t − η ∇ f ( w t ), w t +1 = b P G k ( g t ).Let supp( w t +1 ) = S t +1 , supp( w ∗ ) = S ∗ , I = S t +1 ∪ S ∗ , and M = S ∗ \ S t +1 . Also, note that | G-supp( I ) | ≤ k + k ∗ .Moreover, ( w t +1 ) S t +1 = ( g t ) S t +1 (See Algorithm 2). Hence, k ( w t +1 − g t ) S t +1 ∪ S ∗ k = k ( g t ) M k .Now, using Lemma B.2 with z = ( g t ) I ,we have: k ( w t +1 − g t ) I k = k ( g t ) M k ζ ≤ k ∗ k − e k · k ( g t ) S t +1 \ S ∗ k + k ∗ ǫk − e k , ζ ≤ k ∗ k − e k · k ( w ∗ − g t ) I k + k ∗ ǫk − e k , (12)where ζ follows from M ⊂ S ∗ and hence | G-supp( M ) | ≤ | G-supp( S ∗ ) | = k ∗ . ζ follows since w ∗ S t +1 \ S ∗ = 0.Now, using the fact that k ( w t +1 − w ∗ ) I k = k w t +1 − w ∗ k along with triangle inequality, we have: k w t +1 − w ∗ k ≤ s k ∗ k − e k ! · k ( w ∗ − g t ) I k + s k ∗ ǫk − e k , (13) ζ ≤ s k ∗ k − e k ! · k ( w ∗ − w t − η ( ∇ f ( w ∗ ) − ∇ f ( w t ))) I k + 2 η k ( ∇ f ( w ∗ )) S t +1 k + s k ∗ ǫk − e k , ζ ≤ s k ∗ k − e k ! · k ( I − ηH ( I ∪ S t )( I ∪ S t ) ( α ))( w t − w ∗ ) I ∪ S t k + 2 η k ( ∇ f ( w ∗ )) S t +1 k + s k ∗ ǫk − e k , ζ ≤ s k ∗ k − e k ! · (cid:18) − α k + k ∗ L k + k ∗ (cid:19) k w t − w ∗ k + 2 L k + k ∗ k ( ∇ f ( w ∗ )) S t +1 k + s k ∗ ǫk − e k , (14)where α = c w t + (1 − c ) w ∗ for c > H ( α ) is the Hessian of f evaluated at α . ζ follows from triangleinequality, ζ follows from the Mean-Value theorem and ζ follows from the RSC/RSS condition and bysetting η = 1 /L k + k ∗ .The theorem now follows by setting k = 2 (cid:18)(cid:16) L k + k ∗ α k + k ∗ (cid:17) + 1 (cid:19) · log( k w ∗ k /ǫ ) and ǫ appropriately. Lemma B.1.

Let w = b P G k ( g ) and let S = supp( w ) . Then, for every I s.t. S ⊆ I , the following holds: w I = b P G k ( g I ) . Proof.

Let Q = { i , i , . . . , i k } be the k -groups selected when the greedy procedure (Algorithm 2) is appliedto g . Then, k w G ij \ ( ∪ ≤ ℓ ≤ j − G iℓ ) k ≥ k w G i \ ( ∪ ≤ ℓ ≤ j − G iℓ ) k , ∀ ≤ j ≤ k, ∀ i / ∈ Q. Moreover, the greedy selection procedure is deterministic . Hence, even if groups G i are restricted to lie ina subset of G , the output of the procedure remains exactly the same. Lemma B.2.

Let z ∈ R p be any vector. Let b w = b P G k ( z ) and let w ∗ ∈ R p be s.t. | G-supp( w ∗ ) | ≤ k ∗ . Let S = supp( b w ) , S ∗ = supp( w ∗ ) , I = S ∪ S ∗ , and M = S ∗ \ S . Then, the following holds: k z M k k ∗ − ǫk − e k ≤ k z S \ S ∗ k k − e k , where e k = O ( k ∗ log( k w ∗ k /ǫ )) . roof. Recall that the k groups are added greedily to form S = supp( b w ). Let Q = { i , i , . . . , i k } be the k -groups selected when the greedy procedure (Algorithm 2) is applied to z . Then, k z G ij \ ( ∪ ≤ ℓ ≤ j − G iℓ ) k ≥ k z G i \ ( ∪ ≤ ℓ ≤ j − G iℓ ) k , ∀ ≤ j ≤ k, ∀ i / ∈ Q. Now, as ∪ ≤ ℓ ≤ j − G i ℓ ⊆ S, ∀ ≤ j ≤ k , we have: k z G ij \ ( ∪ ≤ ℓ ≤ j − G iℓ ) k ≥ k z G i \ S k , ∀ ≤ j ≤ k, ∀ i / ∈ Q. Let G-supp( w ∗ ) = { ℓ , . . . , ℓ k ∗ } . Then, adding the above inequalities for each ℓ j s.t. ℓ j / ∈ Q , we get: k z G ij \ ( ∪ ≤ ℓ ≤ j − G iℓ ) k ≥ k z S ∗ \ S k k ∗ , (15)where the above inequality also uses the fact that P ℓ j ∈ G-supp( w ∗ ) ,ℓ j / ∈ Q k z G ℓj \ S k ≥ k z S ∗ \ S k .Adding (15) ∀ ( e k + 1) ≤ j ≤ k , we get: k z S k − k z B k ≥ k − e kk ∗ · k z S ∗ \ S k , (16)where B = ∪ ≤ j ≤ e k G i j .Moreover using Lemma 2.2 and the fact that | G-supp( z S ∗ ) | ≤ k ∗ , we get: k z B k ≥ k z S ∗ k − ǫ . Hence, k z M k k ∗ ≤ k z S k − k z B k k − e k ≤ k z S k − k z S ∗ k + ǫk − e k ≤ k z S \ S ∗ k + ǫk − e k . (17)Lemma now follows by a simple manipulation of the above given inequality. C Proof of Lemma 3.3

Proof.

Note that, k X w k = X i ( x Ti w ) = X i ( z Ti Σ / w ) = k Z Σ / w k , where Z ∈ R n × p s.t. each row z i ∼ N (0 , I ) is a standard multivariate Gaussian. Now, using Theorem 1 of[5], and using the fact that Σ / w lies in a union of (cid:0) Mk (cid:1) subspaces each of at most s dimensions, we have (cid:0) w.p. ≥ − / ( M k · s ) (cid:1) : (cid:18) − √ C (cid:19) k Σ / w k ≤ n k Z Σ / w k ≤ (cid:18) √ C (cid:19) k Σ / w k . The result follows by using the deﬁnition of σ min and σ max . D Proof of Theorem 3.4

Proof.

Recall that g t = w t − η ∇ f ( w t ), w t +1 = P G k ( g t ). Similar to the proof of Theorem 3.1 (Appendix B),we deﬁne S t +1 = supp( w t +1 ), S t = supp( w t ), S ∗ = supp( w ∗ ), I = S t +1 ∪ S ∗ , J = I ∪ S t , and M = S ∗ \ S t +1 .Also, note that | G-supp( I ) | ≤ k + k ∗ , | G-supp( J ) | ≤ k + k ∗ .Now, using Lemma D.1 with z = ( g t ) I , we have: k ( w t +1 − g t ) I k ≤ k ∗ k · k ( w ∗ − g t ) I k . This followsfrom noting that M = k + k ∗ here. Now, the remaining proof follows proof of Theorem 3.1 closely. That is,13sing the above inequality with triangle inequality, we have: k w t +1 − w ∗ k ≤ r k ∗ k ! · k ( w ∗ − g t ) I k ζ ≤ r k ∗ k ! · k ( w ∗ − w t − η ( ∇ f ( w ∗ ) − ∇ f ( w t ))) I k + 2 η k ( ∇ f ( w ∗ )) S t +1 k , ζ ≤ r k ∗ k ! · k ( I − ηH J,J ( α ))( w t − w ∗ ) J k + 2 η k ( ∇ f ( w ∗ )) S t +1 k , ζ ≤ r k ∗ k ! · (cid:18) − α k + k ∗ L k + k ∗ (cid:19) k w t − w ∗ k + 2 L k + k ∗ k ( ∇ f ( w ∗ )) S t +1 k , (18)where α = c w t + (1 − c ) w ∗ for a c > H ( α ) is the Hessian of f evaluated at α . ζ follows from triangleinequality, ζ follows from the Mean-Value theorem and ζ follows from the RSC/RSS condition and bysetting η = 1 /L k + k ∗ .The theorem now follows by setting k = 2 · (cid:16) L k + k ∗ α k + k ∗ (cid:17) . Lemma D.1.

Let z ∈ R p be such that it is spanned by M groups and let b w = P G k ( z ) , w ∗ = P G k ∗ ( z ) where k ≥ k ∗ and G = { G , . . . , G M } . Then, the following holds: k b w − z k ≤ (cid:18) M − kM − k ∗ (cid:19) k w ∗ − z k . Proof.

Let S = supp( b w ) and S ∗ = supp( w ∗ ). Since b w is a projection of z , b w S = z S and 0 otherwise.Similarly, w ∗ S ∗ = z S ∗ . So, to prove the lemma we need to show that: k z S k ≤ (cid:18) M − kM − k ∗ (cid:19) k z S ∗ k . (19)We ﬁrst construct a group-support set A : we ﬁrst initialize A = { B } , where B = supp( w ∗ ). Next, weiteratively add k − k ∗ groups greedily to form A . That is, A = A ∪ A i where A i = supp( P G ( z ¯ A )).Let e w ∈ R p be such that e w A = z A and e w A = 0, where A denotes the complement of A . Also, recall that k z S k G = k z supp( e w ) k G ≤ | A | = k . Then, using the optimality of b w , we have: k z S k ≤ k z A k . (20)Now, k z B k M − k ∗ − k z A k M − k = 1 M − k ∗ k z B \ A k − k − k ∗ ( M − k ∗ )( M − k ) k z A k . (21)By construction, B \ A = ∪ k − k ∗ i =1 A i . Moreover, A is spanned by at most M − k groups. Since, A i ’s areconstructed greedily, we have: k z A i k ≥ k z A k M − k . Adding the above equation for all 1 ≤ i ≤ k − k ∗ , we get: k z B \ A k = k − k ∗ X i =1 k z A i k ≥ k − k ∗ M − k k z A k . (22)Using (20), (21), and (22), we get: k z B k M − k ∗ − k z S k M − k ≥

0. That is, (19) holds. Hence proved.14

Proof of Theorem 4.1

First, we provide a general result that extracts out the key property of the approximate projection operatorthat is required by our proof. We then show that Algorithm 3 satisﬁes that property.In particular, we assume that there is a set of supports S k ∗ such that supp( w ∗ ) ∈ S k ∗ . Also, let S k ⊆ { , } p be s.t. S k ∗ ⊆ S k . Moreover, for any given z ∈ R p , there exists an eﬃcient procedure to ﬁnd S ∈ S k s.t. the following holds for all S ∗ ∈ S k ∗ : k z S \ S ∗ k ≤ k ∗ k · β ǫ k z S ∗ \ S k + ǫ, (23)where ǫ > β ǫ is a function of ǫ .We now show that (23) holds for the SoG case, speciﬁcally Algorithm 3. For simplicity, we provide theresult for non-overlapping case; for overlapping groups a similar result can be obtained by combining thefollowing lemma, with Lemma B.2. Lemma E.1.

Let G = { G , . . . , G M } be M non-overlapping groups. Let G-supp( w ∗ ) = { i ∗ , . . . , i ∗ k ∗ } . Let G be the groups selected using Algorithm 3 applied to z ∈ R p and let S i be the selected set of co-ordinates fromgroup G i where i ∈ G . Let S = ∪ i S i , and let S ∗ = ∪ i ( S ∗ ) i = supp( w ∗ ) . Also, let G ∗ be the set of groupsthat contains S ∗ . Then, the following holds: k z S \ S ∗ k ≤ max (cid:18) k ∗ k , k ∗ k (cid:19) · k z S ∗ \ S k . Proof.

Consider group G i s.t. i ∈ G ∩ G ∗ . Now, in a group we just select elements S i by the standard hardthresholding. Hence, using Lemma 1 from [11], we have: k z ( S ∗ ) i \ S k ≥ k k ∗ k z S \ ( S ∗ ) i k , ∀ i ∈ G ∩ G ∗ . (24)Due to greedy selection, for each G i , G j s.t. i ∈ G \ G ∗ and j ∈ G ∗ \ G , we have: X i ∈ G \ G ∗ k z S i k ≥ | G \ G ∗ || G ∗ \ G | X j ∈ G ∗ \ G k z S j k . That is, X i ∈ G \ G ∗ k z S i k ≥ k k ∗ X j ∈ G ∗ \ G k z S j k . (25)The lemma now follows by adding (24) and (25), and rearranging the terms.Now, we prove Theorem 4.1 Proof.

Theorem follows directly from proof of Theorem 3.1, but with (12) replaced by the following equation: k ( w t +1 − g t ) I k = k ( g t ) M k ζ ≤ k ∗ k · β ǫ k ( g t ) S t +1 \ S ∗ k + ǫ ζ ≤ k ∗ k · β ǫ · k ( w ∗ − g t ) I k + ǫ, (26)where ζ follows from the assumption given in the theorem statement. ζ follows from w ∗ S t +1 \ S ∗ = 0. F Results for the Least Squares Sparse Overlapping Group Lasso

Lemma E.1 along with Theorem 4.1 shows that for SoG case, we need to project onto more than (than k ∗ )groups and more than (than k ∗ ) number of elements in each group. In particular, we select k i ≈ ( L k + k ∗ α k + k ∗ ) k ∗ i for both i = 1 , w ∗ from ( y, X ) s.t. y = X w ∗ + β . Speciﬁcally, the samplecomplexity evaluates to n ≥ κ (cid:0) k ∗ log( M ) + κ k ∗ k ∗ log(max i | G i | ) (cid:1) .15 ignal IHT GOMP CoGEnT Blocks . . . . . . Piece-Polynomial . . . . . . Table 1: MSE on standard test signals using IHT with full corrections

G Additional Experimental Evaluations

Noisy Compressed Sensing:

Here, we apply our proposed methods in a compressed sensing frameworkto recover sparse wavelet coeﬃcients of signals. We used the standard “test” signals (Table 1) of length2048, and obtained 512 Gaussian measurements. We set k = 100 for IHT and GOMP. IHT is competitive(in terms of accuracy) with the state of the art in convex methods, while being signiﬁcantly faster. Figure3 shows the recovered blocks signal using IHT. All parameters were picked clairvoyantly via a grid search.