[PDF] On Iterative Hard Thresholding Methods for High-dimensional M-Estimation

Abstract

The use of M-estimators in generalized linear regression models in high dimensional settings requires risk minimization with hard L 0 constraints. Of the known methods, the class of projected gradient descent (also known as iterative hard thresholding (IHT)) methods is known to offer the fastest and most scalable solutions. However, the current state-of-the-art is only able to analyze these methods in extremely restrictive settings which do not hold in high dimensional statistical models. In this work we bridge this gap by providing the first analysis for IHT-style methods in the high dimensional statistical setting. Our bounds are tight and match known minimax lower bounds. Our results rely on a general analysis framework that enables us to analyze several popular hard thresholding style algorithms (such as HTP, CoSaMP, SP) in the high dimensional regression setting. We also extend our analysis to a large family of "fully corrective methods" that includes two-stage and partial hard-thresholding algorithms. We show that our results hold for the problem of sparse regression, as well as low-rank matrix recovery.

Full PDF

OOn Iterative Hard Thresholding Methods for High-dimensionalM-Estimation

Prateek Jain ∗ Ambuj Tewari † Purushottam Kar ∗∗ Microsoft Research, INDIA † University of Michigan, Ann Arbor, USA { prajain,t-purkar } @microsoft.com, [email protected] October 22, 2014

Abstract

The use of M-estimators in generalized linear regression models in high dimensional settingsrequires risk minimization with hard L constraints. Of the known methods, the class of pro-jected gradient descent (also known as iterative hard thresholding (IHT)) methods is known tooﬀer the fastest and most scalable solutions. However, the current state-of-the-art is only ableto analyze these methods in extremely restrictive settings which do not hold in high dimensionalstatistical models. In this work we bridge this gap by providing the ﬁrst analysis for IHT-stylemethods in the high dimensional statistical setting. Our bounds are tight and match knownminimax lower bounds. Our results rely on a general analysis framework that enables us toanalyze several popular hard thresholding style algorithms (such as HTP, CoSaMP, SP) in thehigh dimensional regression setting. We also extend our analysis to a large family of “fullycorrective methods” that includes two-stage and partial hard-thresholding algorithms. We showthat our results hold for the problem of sparse regression, as well as low-rank matrix recovery. Modern statistical estimation is routinely faced with real world problems where the number ofparameters p handily outnumbers the number of observations n . In general, consistent estimationof parameters is not possible in such a situation. Consequently, a rich line of work has focused onmodels that satisfy special structural assumptions such as sparsity or low-rank structure. Underthese assumptions, several works (for example, see [1, 2, 3, 4, 5]) have established that consistentestimation is information theoretically possible in the “ n (cid:28) p ” regime as well.The question of eﬃcient estimation, however, is faced with feasibility issues since consistentestimation routines often end-up solving NP-hard problems. Examples include sparse regressionwhich requires loss minimization with sparsity constraints and low-rank regression which requiresdealing with rank constraints which are not eﬃciently solvable in general [6].Interestingly, recent works have demonstrated that these hardness results can be avoided byassuming certain natural conditions over the loss function being minimized such as restricted strongconvexity (RSC) and restricted strong smoothness (RSS). The estimation routines proposed in theseworks typically make use of convex relaxations [5] or greedy methods [7, 8, 9] which do not suﬀerfrom infeasibility issues. 1 a r X i v : . [ c s . L G ] O c t espite this, certain limitations have precluded widespread use of these techniques. Convexrelaxation-based methods typically suﬀer from slow rates as they solve non-smooth optimizationproblems apart from being hard to analyze in terms of global guarantees. Greedy methods, on theother hand, are slow in situations with non-negligible sparsity or relatively high rank, owing totheir incremental approach of adding/removing individual support elements.Instead, the methods of choice for practical applications are actually projected gradient (PGD)methods, also referred to as iterative hard thresholding (IHT) methods. These methods directlyproject the gradient descent update onto the underlying (non-convex) feasible set. This projectioncan be performed eﬃciently for several interesting structures such as sparsity and low rank. How-ever, traditional PGD analyses for convex problems viz. [10] do not apply to these techniques dueto the non-convex structure of the problem.An exception to this is the recent work [11] that demonstrates that PGD with non-convexregularization can oﬀer consistent estimates for certain high-dimensional problems. However, thework in [11] is only able to analyze penalties such as SCAD, MCP and capped L . Moreover, theirframework cannot handle commonly used penalties such as L or low-rank constraints. As noted above, PGD/IHT-style methods have been very popular in literature for sparse recov-ery and several algorithms including Iterative Hard Thresholding (IHT) [12] or GraDeS [13], HardThresholding Pursuit (HTP) [14], CoSaMP [15], Subspace Pursuit (SP) [16], and OMPR( (cid:96) ) [17]have been proposed. However, the analysis of these algorithms has traditionally been restricted tosettings that satisfy the Restricted Isometry property (RIP) or incoherence property. As the dis-cussion below demonstrates, this renders these analyses inaccessible to high-dimensional statisticalestimation problems.All existing results analyzing these methods require the condition number of the loss function,restricted to sparse vectors, to be smaller than a universal constant. The best known such constant isdue to the work of [17] that requires a bound on the RIP constant δ k ≤ . δ k − δ k ≤ (cid:20) − (cid:15) − (cid:15) (cid:21) , then the restricted condition number (on a support set of size just 2)of the sample matrix cannot be brought down below 1 /(cid:15) even with inﬁnitely many samples. Inparticular when (cid:15) < /

6, none of the existing results for hard thresholding methods oﬀer any guarantees. Moreover, most of these analyses consider only the least squares objective. Althoughrecent attempts have been made to extend this to general diﬀerentiable objectives [18, 19], theresults continue to require that the restricted condition number be less than a universal constantand remain unsatisfactory in a statistical setting.

Our main contribution in this work is an analysis of PGD/IHT-style methods in statistical settings.Our bounds are tight, achieve known minmax lower bounds [20], and hold for arbitrary diﬀeren-tiable, possibly even non-convex functions. Our results hold even when the underlying conditionnumber is arbitrarily large and only require the function to satisfy RSC/RSS conditions. In partic-2lar, this reveals that these iterative methods are indeed applicable to statistical settings, a resultthat escaped all previous works.Our ﬁrst result shows that the PGD/IHT methods achieve global convergence if used with arelaxed projection step. More formally, if the optimal parameter is s ∗ -sparse and the problemsatisﬁes RSC and RSS constraints α and L respectively (see Section 2), then PGD methods oﬀerglobal convergence so long as they employ projection to an s -sparse set where s ≥ L/α ) s ∗ . Thisgives convergence rates that are identical to those of convex relaxation and greedy methods for theGaussian sparse linear model. We then move to a family of eﬃcient “fully corrective” methodsand show as before, that for arbitrary functions satisfying the RSC/RSS properties, these methodsoﬀer global convergence.Next, we show that these results allow PGD-style methods to oﬀer global convergence in avariety of statistical estimation problems such as sparse linear regression and low rank matrixregression. Our results eﬀortlessly extend to the noisy setting as a corollary and give boundssimilar to those of [21] that relies on solving an L regularized problem.Our proofs are able to exploit that even though hard-thresholding is not the prox-operator forany convex prox function, it still provides strong contraction when projection is performed onto setsof sparsity s (cid:29) s ∗ . This crucial observation allows us to provide the ﬁrst uniﬁed analysis for hardthresholding based gradient descent algorithms. Our empirical results conﬁrm our predictions withrespect to the recovery properties of IHT-style algorithms on badly-conditioned sparse recoveryproblems, as well as demonstrate that these methods can be orders of magnitudes faster than their L and greedy counterparts. Section 2 sets the notation and the problem statement. Section 3 introduces the PGD/IHT al-gorithm that we study and proves that the method guarantees recovery assuming the RSC/RSSproperty. We also generalize our guarantees to the problem of low-rank matrix regression. Sec-tion 4 then provides crisp sample complexity bounds and statistical guarantees for the PGD/IHTestimators. Section 5 extends our analysis to a broad family of “fully-corrective” hard thresholdingmethods compressive sensing algorithms that include the so-called two-stage hard thresholding and partial hard thresholding algorithms and provide similar results for them as well. We present someempirical results in Section 6 and conclude in Section 7.

High-dimensional Sparse Estimation.

Given data points X = [ X , . . . , X n ] T , where X i ∈ R p ,and the target Y = [ Y , . . . , Y n ] T , where Y i ∈ R , the goal is to compute an s ∗ -sparse θ ∗ s.t., θ ∗ = arg min θ , (cid:107) θ (cid:107) ≤ s ∗ f ( θ ) . (1)Typically, f can be thought of as an empirical risk function i.e. f ( θ ) = n (cid:80) i (cid:96) ( (cid:104) X i , θ (cid:105) , Y i ) for someloss function (cid:96) (see examples in Section 4). However, for our analysis of PGD and other algorithms,we need not assume any other property of f other than diﬀerentiability and the following two RSCand RSS properties. 3 eﬁnition 1 (RSC Property) . A diﬀerentiable function f : R p → R is said to satisfy restrictedstrong convexity (RSC) at sparsity level s = s + s with strong convexity constraint α s if thefollowing holds for all θ , θ s.t. (cid:107) θ (cid:107) ≤ s and (cid:107) θ (cid:107) ≤ s : f ( θ ) − f ( θ ) ≥ (cid:104) θ − θ , ∇ θ f ( θ ) (cid:105) + α s (cid:107) θ − θ (cid:107) . Deﬁnition 2 (RSS Property) . A diﬀerentiable function f : R p → R is said to satisfy restrictedstrong smoothness (RSS) at sparsity level s = s + s with strong convexity constraint L s if thefollowing holds for all θ , θ s.t. (cid:107) θ (cid:107) ≤ s and (cid:107) θ (cid:107) ≤ s : f ( θ ) − f ( θ ) ≤ (cid:104) θ − θ , ∇ θ f ( θ ) (cid:105) + L s (cid:107) θ − θ (cid:107) . Low-rank Matrix Regression.

Low-rank matrix regression is similar to sparse estimationas presented above except that each data point is now a matrix i.e. X i ∈ R p × p , the goal beingto estimate a low-rank matrix W ∈ R p × p that minimizes the empirical loss function on the givendata. W ∗ = arg min W,rank ( W ) ≤ r f ( W ) . (2)For this problem the RSC and RSS properties for f are deﬁned similarly as in Deﬁnition 1, 2 exceptthat the L norm is replaced by the rank function. In this section we study the popular projected gradient descent (a.k.a iterative hard thresholding)method for the case of the feasible set being the set of sparse vectors (see Algorithm 1 for pseu-docode). The projection operator P s ( z ), can be implemented eﬃciently in this case by projecting z onto the set of s -sparse vectors by selecting the s largest elements (in magnitude) of z . Thestandard projection property implies that (cid:107) P s ( z ) − z (cid:107) ≤ (cid:107) θ (cid:48) − z (cid:107) for all (cid:107) θ (cid:48) (cid:107) ≤ s . However,it turns out that we can prove a signiﬁcantly stronger property of hard thresholding for the casewhen (cid:107) θ (cid:48) (cid:107) ≤ s ∗ and s ∗ (cid:28) s . This property is key to analysing IHT and is formalized below. Lemma 1.

For any index set I , any z ∈ R I , let θ = P s ( z ) . Then for any θ ∗ ∈ R I such that (cid:107) θ ∗ (cid:107) ≤ s ∗ , we have (cid:107) θ − z (cid:107) ≤ | I | − s | I | − s ∗ (cid:107) θ ∗ − z (cid:107) . See Appendix A for a detailed proof.Our analysis combines the above observation with the RSC/RSS properties of f to providegeometric convergence rates for the IHT procedure below. Theorem 1.

Let f have RSC and RSS parameters given by L s + s ∗ ( f ) = L and α s + s ∗ ( f ) = α respectively. Let Algorithm 1 be invoked with f , s ≥ (cid:0) Lα (cid:1) s ∗ and η = L . Also let θ ∗ =arg min θ , (cid:107) θ (cid:107) ≤ s ∗ f ( θ ) . Then, the τ -th iterate of Algorithm 1, for τ = O ( Lα · log( f ( θ ) (cid:15) )) satisﬁes: f ( θ τ ) − f ( θ ∗ ) ≤ (cid:15). lgorithm 1 Iterative Hard-thresholding Input : Function f with gradient oracle, sparsity level s , step-size η θ = , t = 1 while not converged do θ t +1 = P s ( θ t − η ∇ θ f ( θ t )) t = t + 1 end while Output : θ t Proof. (Sketch) Let S t = supp ( θ t ), S ∗ = supp ( θ ∗ ), S t +1 = supp ( θ t +1 ) and I t = S ∗ ∪ S t ∪ S t +1 .Using the RSS property and the fact that supp ( θ t ) ⊆ I t and supp ( θ t +1 ) ⊆ I t , we have: f ( θ t +1 ) − f ( θ t ) ≤ (cid:104) θ t +1 − θ t , g t (cid:105) + L (cid:107) θ t +1 − θ t (cid:107) , = L (cid:107) θ t +1 I t − θ tI t + 23 L · g tI t (cid:107) − L (cid:107) g tI t (cid:107) , ζ ≤ L · | I t | − s | I t | − s ∗ · (cid:107) θ ∗ I t − θ tI t + 1 L · g tI t (cid:107) − L ( (cid:107) g tI t \ ( S t ∪ S ∗ ) (cid:107) + (cid:107) g tS t ∪ S ∗ (cid:107) ) , (3)where ζ follows from an application of Lemma 1 with I = I t and the Pythagoras theorem. Theabove equation has three critical terms. The ﬁrst term can be bounded using the RSS condition.Using f ( θ t ) − f ( θ ∗ ) ≤ (cid:104) g tS t ∪ S ∗ , θ t − θ ∗ (cid:105) − α (cid:107) θ t − θ ∗ (cid:107) ≤ α (cid:107) g tS t ∪ S ∗ (cid:107) bounds the third term in(3). The second term is more interesting as in general elements of g tS ∗ can be arbitrarily small.However, elements of g tI t \ ( S t ∪ S ∗ ) should be at least as large as g tS ∗ \ S t +1 as they are selected byhard-thresholding. Combining this insight with bounds for g tS ∗ \ S t +1 and with (3), we obtain thetheorem. See Appendix A for a detailed proof. We now generalize our previous analysis to a projected gradient descent (PGD) method for low-rankmatrix regression. Formally, we study the following problem:min W f ( W ) , s.t., rank ( W ) ≤ s. (4)The hard-thresholding projection step for low-rank matrices can be solved using SVD i.e. P M s ( W ) = U s Σ s V Ts , where W = U Σ V T is the singular value decomposition of W . U s , V s are the top- s singular vectors(left and right, respectively) of W and Σ s is the diagonal matrix of the top- s singular values of W .To proceed, we ﬁrst note a property of the above projection similar to Lemma 1. Lemma 2.

Let W ∈ R p × p be a rank- | I t | matrix and let p ≥ p . Then for any rank- s ∗ matrix W ∗ ∈ R p × p we have (cid:107) P M s ( W ) − W (cid:107) F ≤ | I t | − s | I t | − s ∗ (cid:107) W ∗ − W (cid:107) F . (5)5 roof. Let W = U Σ V T be the singular value decomposition of W . Now, (cid:107) P M s ( W ) − W (cid:107) F = (cid:80) | I t | i = s +1 σ i = (cid:107) P s ( diag (Σ)) − diag (Σ) (cid:107) , where σ ≥ · · · ≥ σ | I t | ≥ W .Using Lemma 1, we get: (cid:107) P M s ( W ) − W (cid:107) F ≤ | I t | − s | I t | − s ∗ (cid:107) Σ ∗ − diag (Σ) (cid:107) ≤ | I t | − s | I t | − s ∗ (cid:107) W ∗ − W (cid:107) F , (6)where the last step uses the von Neumann’s trace inequality ( T r ( A · B ) ≤ (cid:80) i σ i ( A ) σ i ( B )).The following result for low-rank matrix regression immediately follows from Lemma 4. Theorem 2.

Let f have RSC and RSS parameters given by L s + s ∗ ( f ) = L and α s + s ∗ ( f ) = α .Replace the projection operator P s in Algorithm 1 with its matrix counterpart P M s as deﬁned in (5) .Suppose we invoke it with f, s ≥ (cid:0) Lα (cid:1) s ∗ , η = L . Also let W ∗ = arg min W,rank ( W ) ≤ s ∗ f ( W ) .Then the τ -th iterate of Algorithm 1, for τ = O ( Lα · log( f ( W ) (cid:15) ) satisﬁes: f ( W τ ) − f ( W ∗ ) ≤ (cid:15). Proof.

A proof progression similar to that of Theorem 1 suﬃces. The only changes that needto be made are: ﬁrstly Lemma 2 has to be invoked in place of Lemma 1. Secondly, in place ofconsidering vectors restricted to a subset of coordinates viz. θ S , g tI , we would need to considermatrices restricted to subspaces i.e. W S = U S U TS W where U S is a set of singular vectors spanningthe range-space of S . This section elaborates on how the results of the previous section can be used to give guarantees forIHT-style techniques in a variety of statistical estimation problems. We will ﬁrst present a genericconvergence result and then specialize it to various settings. Suppose we have a sample of datapoints Z n and a loss function L ( θ ; Z n ) that depends on a parameter θ and the sample. Then wecan show the following result. (See Appendix B for a proof.) Theorem 3.

Let ¯ θ be any s ∗ -sparse vector. Suppose L ( θ ; Z n ) is diﬀerentiable and satisﬁesRSC and RSS at sparsity level s + s ∗ with parameters α s + s ∗ and L s + s ∗ respectively, for s ≥ (cid:16) L s + s ∗ α s + s ∗ (cid:17) s ∗ . Let θ τ be the τ -th iterate of Algorithm 1 for τ chosen as in Theorem 1 and ε be the function value error incurred by Algorithm 1. Then we have (cid:107) ¯ θ − θ τ (cid:107) ≤ √ s + s ∗ (cid:107)∇L ( ¯ θ ; Z n ) (cid:107) ∞ α s + s ∗ + (cid:115) (cid:15)α s + s ∗ . Note that the result does not require the loss function to be convex. This fact will be cruciallyused later. We now apply the above result to several statistical estimation scenarios.

Sparse Linear Regression.

Here Z i = ( X i , Y i ) ∈ R p × R and Y i = (cid:104) ¯ θ , X i (cid:105) + ξ i where ξ i ∼ N (0 , σ )is label noise. The empirical loss is the usual least squares loss i.e. L ( θ ; Z n ) = n (cid:107) Y − X θ (cid:107) .Suppose X n are drawn i.i.d. from a sub-Gaussian distribution with covariance Σ with Σ jj ≤ j . Then [22, Lemma 6] immediately implies that RSC and RSS at sparsity level k hold, withprobability at least 1 − e − c n , with α k = σ min (Σ) − c k log pn and L k = 2 σ max (Σ) + c k log pn ( c , c are universal constants). So we can set k = 2 s + s ∗ and if n > c k log p/σ min (Σ) then we have α k ≥ σ min (Σ) and L k ≤ . σ max (Σ) which means that L k / α k ≤ κ (Σ) := σ max (Σ) /σ min (Σ).Thus it is enough to choose s = 2592 κ (Σ) s ∗ and apply Theorem 3. Note that (cid:107)∇L ( ¯ θ ; Z n ) (cid:107) ∞ = (cid:107) X T ξ/n (cid:107) ∞ ≤ σ (cid:113) log pn with probability at least 1 − c p − c ( c , c are universal constants). Puttingeverything together, we have the following bound with high probability: (cid:107) ¯ θ − θ τ (cid:107) ≤ κ (Σ) σ min (Σ) σ (cid:114) s ∗ log pn + 2 (cid:114) (cid:15)σ min (Σ) , where (cid:15) is the function value error incurred by Algorithm 1. Noisy and Missing Data.

We now look at cases with feature noise as well. More speciﬁcally,assume that we only have access to ˜ X i ’s that are corrupted versions of X i ’s. Two models of noise arepopular in literature [21]: a) ( additive noise ) ˜ X i = X i + W i where W i ∼ N ( , Σ W ), and b) ( missingdata ) ˜ X is an R ∪{ (cid:63) } -valued matrix obtained by independently, with probability ν ∈ [0 , X with (cid:63) . For the case of additive noise (missing data can be handled similarly), Z i = ( ˜ X i , Y i ) and L ( θ ; Z n ) = θ T ˆΓ θ − ˆ γ T θ where ˆΓ = ˜ X T ˜ X/n − Σ W and ˆ γ = ˜ X T Y /n are unbiasedestimators of Σ and Σ T ¯ θ respectively. [21, Appendix A, Lemma 1] implies that RSC, RSS at sparsitylevel k hold, with failure probability exponentially small in n , with α k = σ min (Σ) − kτ ( p ) /n and L k = 1 . σ max (Σ)+ kτ ( p ) /n for τ ( p ) = c σ min (Σ) max( ( (cid:107) Σ (cid:107) + (cid:107) Σ W (cid:107) ) σ (Σ) ,

1) log p . Thus for k = 2 s + s ∗ and n ≥ kτ ( p ) /σ min (Σ) we have L k /α k ≤ κ (Σ). Note that L ( · ; Z n ) is non-convex but we canstill apply Theorem 3 with s = 1568 κ (Σ) s ∗ because RSC, RSS hold. Using the high probabilityupper bound (see [21, Appendix A, Lemma 2]) (cid:107)∇L ( ¯ θ ; Z n ) (cid:107) ∞ ≤ c ˜ σ (cid:107) ¯ θ (cid:107) (cid:112) log p/n gives us thefollowing (cid:107) ¯ θ − θ τ (cid:107) ≤ c κ (Σ) σ min (Σ) ˜ σ (cid:107) ¯ θ (cid:107) (cid:114) s ∗ log pn + 2 (cid:114) (cid:15)σ min (Σ)where ˜ σ = (cid:113) (cid:107) Σ W (cid:107) + (cid:107) Σ (cid:107) ( (cid:107) Σ W (cid:107) op + σ ) and (cid:15) is the function value error in Algorithm 1. In this section, we study a variety of “fully-corrective” methods. These methods keep the opti-mization objective fully minimized over the support of the current iterate. To this end, we ﬁrstprove a fundamental theorem for fully-corrective methods that formalizes the intuition that forsuch methods, a large function value should imply a large gradient at any sparse θ as well. Thisresult is similar to Lemma 1 of [17] but holds under RSC/RSS conditions (rather than the RIPcondition as in [17]), as well as for the general loss functions. Lemma 3.

Consider a function f with RSC parameter given by L s + s ∗ ( f ) = L and RSS parametergiven by α s + s ∗ ( f ) = α . Let θ ∗ = arg min θ , (cid:107) θ (cid:107) ≤ s ∗ f ( θ ) with S ∗ = supp ( θ ∗ ) . Let S t ⊆ [ p ] be anysubset of co-ordinates s.t. | S t | ≤ s . Let θ t = arg min θ ,supp ( θ ) ⊆ S t f ( θ ) . Then, we have: α ( f ( θ t ) − f ( θ ∗ )) ≤ (cid:107) g tS t ∪ S ∗ (cid:107) − α (cid:107) θ tS t \ S ∗ (cid:107) See Appendix C for a detailed proof. 7 lgorithm 2

Two-stage Hard-thresholding Input : function f with gradient oracle, sparsity level s , sparsity expansion level (cid:96) θ = 0, t = 1 while not converged do g t = ∇ θ f ( θ t ), S t = supp ( θ t ) Z t = S t ∪ (largest (cid:96) elements of | g tS t | ) β t = arg min β ,supp ( β ) ⊆ Z t f ( β ) // fully corrective step (cid:101) θ t = P s ( β t ) θ t +1 = arg min θ ,supp ( θ ) ⊆ supp ( (cid:101) θ t ) f ( θ ) // fully corrective step t = t + 1 end while Output : θ t Here we will concentrate on a family of two-stage fully corrective methods that contains popularcompressive sensing algorithms like CoSaMP and Subspace Pursuit (see Algorithm 2 for pseu-docode). These algorithms have thus far been analyzed only under RIP conditions for the leastsquares objective. Using our analysis framework developed in the previous sections, we present ageneric RSC/RSS-based analysis for general two-stage methods for arbitrary loss functions. Ouranalysis shall use the following key observation that the the hard thresholding step in two stagemethods does not increase the objective function a lot.

Lemma 4.

Let Z t ⊆ [ n ] and | Z t | ≤ q . Let β t = arg min β ,supp ( β ) ⊆ Z t f ( β ) and (cid:98) θ t = P q ( β t ) . Then,the following holds: f ( (cid:98) θ t ) − f ( β t ) ≤ Lα · (cid:96)s + (cid:96) − s ∗ · ( f ( β t ) − f ( θ ∗ )) . Proof.

Let v t = ∇ θ f ( β t ). Then, using the RSS property we get: f ( (cid:98) θ t ) − f ( β t ) ≤ (cid:104) (cid:98) θ t − β t , v t (cid:105) + L (cid:107) (cid:98) θ t − β t (cid:107) ζ = L (cid:107) (cid:98) θ t − β t (cid:107) ζ ≤ L | (cid:96) || s + (cid:96) − s ∗ | (cid:107) w − β t (cid:107) , (7)where w is any vector such that w Z t = 0 and (cid:107) w (cid:107) ≤ s ∗ . ζ follows by observing v tZ t = 0 and bynoting that supp ( (cid:98) θ t ) ⊆ Z t . ζ follows by Lemma 1 and the fact that (cid:107) w (cid:107) ≤ s ∗ . Now, using RSSproperty and the fact that ∇ θ f ( β t ) = 0, we have: α (cid:107) w − β t (cid:107) ≤ f ( β t ) − f ( w ) ≤ f ( β t ) − f ( θ ∗ ) . (8)The result now follows by combining (7) and (8). Theorem 4.

Let f has RSC and RSS parameters given by α s + s ∗ ( f ) = α and L s + (cid:96) ( f ) = L resp.Call Algorithm 2 with f , (cid:96) ≥ s ∗ and s ≥ L α (cid:96) + s ∗ − (cid:96) ≥ L α s ∗ . Also let θ ∗ = arg min θ , (cid:107) θ (cid:107) ≤ s ∗ f ( θ ) .Then, the τ -th iterate of Algorithm 2, for τ = O ( Lα · log( f ( θ ) (cid:15) ) satisﬁes: f ( θ τ ) − f ( θ ∗ ) ≤ (cid:15). See Appendix C for a detailed proof. 8 .2 Partial Hard Thresholding Methods

Algorithm 3

Iterative Partial Hard-thresholding Input : function f with gradient oracle, sparsity level s , step size η , partial thresholding level (cid:96) θ = 0, t = 1 while not converged do z t = θ t − η ∇ θ t f ( θ t ), S t = supp ( θ t ) v t = PHT s ( z t ; S t , (cid:96) ) θ t +1 = arg min θ ,supp ( θ ) ⊆ supp ( v t ) f ( θ ) // fully corrective step t = t + 1 end while Output : θ t We now study Partial Hard Thresholding methods (PHT), a family of fully-corrective iterativemethods introduced by [17]. This family is known to provide the best known RIP guarantees inthe compressive sensing setting, but the proof is restricted to the RIP setting, and for the least-squares objective. An interesting member of this family is Orthogonal Matching Pursuit withReplacement (OMPR), which is also a Forward-Backward Greedy Selection method but performsone forward-backward step per iteration.The pseudo code of the general IPHT( (cid:96) ) algorithm is given in Algorithm 3. The algorithm issimilar to the fully-corrective projected gradient descent (PGD) method, in fact, PHT(0) is indeedexactly same as the fully-corrected PGD method. But, the partial hard-thresholding projection isused instead of hard-thresholding projection.The partial hard thresholding operator v t = PHT s ( z t ; S t , (cid:96) ) projects z t onto the non-convexset of s -sparse vectors s.t. | supp ( v t ) \ S t | ≤ (cid:96) . Although, the operator projects onto a non-convexset, still the projection can be performed eﬃciently by performing hard thresholding only over z tS t and the (cid:96) smallest elements of z tS t . That is, let bot t = smallest (cid:96) elements of | z tS t | . Then, v tS t \ bot t = z tS t \ bot t , and , v tS t ∪ bot t = H (cid:96) ( z tS t ∪ bot t ) . We ﬁrst show that at least a new element is added during each iteration of IPHT( (cid:96) ). Lemma 5.

Let f , s be supplied to Algorithm 3 and let the RSC and RSS parameters of f be givenby L s + s ∗ ( f ) = L and α s + s ∗ ( f ) = α respectively. Let s ≥ (cid:0) Lα (cid:1) s ∗ . Then, either f ( S t ) = f ( S ∗ ) or | S t +1 \ S t | ≥ . That is, at least one new element is added at each iteration of Algorithm 3.Proof. Suppose no new element is added i.e. | S t +1 \ S t | = 0. Since (cid:96) elements of S t +1 are selectedby hard thresholding z tS t ∪ bot t , hence each element of z tS t should be larger (in magnitude) thanmax i ∈ S t | z ti | . Note that, z tS t = − η g tS t . Hence, (cid:107) z tS t \ S ∗ (cid:107) | S t \ S ∗ | = (cid:107) θ tS t \ S ∗ − η g tS t \ S ∗ (cid:107) | S t \ S ∗ | = (cid:107) θ tS t \ S ∗ (cid:107) | S t \ S ∗ | ≥ (cid:107) z tS ∗ \ S t (cid:107) | S ∗ \ S t | = (cid:107) θ tS ∗ \ S t − η g tS ∗ \ S t (cid:107) | S ∗ \ S t | = η (cid:107) g tS ∗ \ S t (cid:107) | S ∗ \ S t | = η (cid:107) g tS ∗ ∪ S t (cid:107) | S ∗ \ S t | , S uppo r t R e c o v e r y E rr o r HTPGraDeSL1FoBa (a) R un t i m e ( s e c ) HTPGraDeSL1FoBa (b) −3 −2 Sparsity (s*) R un t i m e ( s e c ) HTPGraDeSL1FoBa (c)

80 100 120 140 160010203040 Projected Sparsity (s) S uppo r t R e c o v e r y E rr o r CoSaMPHTPGraDeS (d)

Figure 1: A comparison of hard thresholding techniques (HTP) and projected gradient methods(GraDeS) with L1 and greedy methods (FoBa) on sparse noisy linear regression tasks. 1(a) givesthe number of undiscovered elements from supp ( θ ∗ ) as label noise levels are increased. 1(b) showsthe variation in running times with increasing dimensionality p . 1(c) gives the variation in runningtimes (in logscale) when the true sparsity level s ∗ is increased keeping p ﬁxed. HTP and GraDeS areclearly much more scalable than L1 and FoBa. 1(d) shows the recovery properties of diﬀerent IHTmethods under large condition number ( κ = 50) setting as the size of projected set is increased.where we have used the fact that g tS t = 0 and θ tS t = 0.Using Lemma 3 and the above equation, we have:0 ≤ γ (cid:18) f ( θ t ) − f ( θ ∗ ) − α · (cid:18) αγ − (cid:19) (cid:107) θ t − θ ∗ (cid:107) (cid:19) ≤ (cid:18) γ − η | S t \ S ∗ || S ∗ \ S t | (cid:19) (cid:107) g tS t ∪ S ∗ (cid:107) . (9)The lemma now follows by observing that | S t \ S ∗ || S ∗ \ S t | ≥ s − s ∗ s ∗ ≥ γ η ≥ η α , by the choice of s .We now provide the proof of convergence for IPHT( (cid:96) ) method in the general RSC-RSS setting: Theorem 5.

Let f, s be supplied to Algorithm 3. Also, let the RSC and RSS parameters of f begiven by α s + s ∗ ( f ) = α and L s + s ∗ ( f ) = L respectively. Let s ≥ (cid:0) Lα (cid:1) s ∗ and let η = L . Then,the τ -th iterate of Algorithm 3 satisﬁes: f ( θ τ ) − f ( θ ∗ ) ≤ (cid:18) − α L · (cid:96) + 1 (cid:19) τ · (cid:0) f ( θ ) − f ( θ ∗ ) (cid:1) , where θ ∗ = arg min θ , (cid:107) θ (cid:107) ≤ s ∗ f ( θ ) . This implies that for τ = O (cid:16) L(cid:96)α · log( f ( θ ) (cid:15) ) (cid:17) , we have f ( θ τ ) − f ( θ ∗ ) ≤ (cid:15) . See Appendix C fora detailed proof. We conducted simulations on high dimensional sparse linear regression problems to verify our pre-dictions. Our experiments demonstrate that hard thresholding and projected gradient techniquescan not only oﬀer recovery in stochastic setting, but oﬀer much more scalable routines for the same.

Data : Our problem setting is identical to the one described in the previous section. We ﬁxeda parameter vector ¯ θ by choosing s ∗ random coordinates and setting them randomly to ± Z i = ( X i , Y i ) where X i ∼ N (0 , I p ) and Y i = (cid:104) ¯ θ , X i (cid:105) + ξ i where ξ i ∼ N (0 , σ ). We studied the eﬀect of varying dimensionality p , sparsity s ∗ , sample size n andlabel noise level σ on the recovery properties of the various algorithms as well as their run times. Wechose baseline values of p = 20000 , s ∗ = 100 , σ = 0 . , n = f o · s ∗ log p where f o is the oversamplingfactor with default value f o = 2. Keeping all other quantities ﬁxed, we varied one of the quantitiesand generated independent data samples for the experiments. Algorithms : We studied a variety of hard-thresholding style algorithms including HTP [14],GraDeS [13] (or IHT [12]), CoSaMP [15], OMPR [17] and SP [16]. We compared them with a stan-dard implementation of the L1 projected scaled sub-gradient technique [23] for the lasso problemand a greedy method FoBa [24] for the same.

Evaluation Metrics : For the baseline noise level σ = 0 .

1, we found that all the algorithmswere able to recover the support set within an error of 2%. Consequently, our focus shifted to run-ning times for these experiments. In the experiments where noise levels were varied, we recorded,for each method, the number of undiscovered support set elements.

Results : Figure1 describes the results of our experiments in graphical form. For sake ofclarity we included only HTP, GraDeS, L1 and FoBa results in these graphs. Graphs for theother algorithms CoSaMP, SP and OMPR can be seen in the supplementary material. The graphsindicate that whereas hard thresholding techniques are equally eﬀective as L1 and greedy techniquesfor recovery in noisy settings, as indicated by Figure1(a), the former can be much more eﬃcientand scalable than the latter. For instance, as Figure1(b), for the base level of p = 20000, HTP was150 × faster than the L1 method. For higher values of p , the runtime gap widened to more than350 × . We also note that in both these cases, HTP actually oﬀered exact support recovery whereasL1 was unable to recover 2 and 4 support elements respectively.Although FoBa was faster than L1 on Figure1(b) experiments, it was still slower than HTP by50 × and 90 × for p = 20000 and 25000 respectively. Moreover, due to its greedy and incrementalnature, FoBa was found to suﬀer badly in settings with larger true sparsity levels. As Figure 1(c)indicates, for even moderate sparsity levels of s ∗ = 300 and 500, FoBa is 60 − × slower thanHTP. As mentioned before, the reason for this slowdown is the greedy approach followed by FoBa:whereas HTP took less than 5 iterations to converge for these two problems, FoBa spend 300 and500 iterations respectively. GraDeS was found to oﬀer much lesser run times in comparison beingslower than HTP by 30 − × for larger values of p and 2 − × slower for larger values of s ∗ . Experiments on badly conditioned problems.

We also ran experiments to verify theperformance of IHT algorithms in high condition number setting. Values of p, s ∗ and σ werekept at baseline levels. After selecting the optimal parameter vector ¯ θ , we selected s ∗ / s ∗ / s were increased. Our results (see Figure 1(d)) corroborate our theoretical observation thatthese algorithms show a remarkable improvement in recovery properties for ill-conditioned problemswith an enlarged projection size. 11 Discussion and Conclusions

In our work we studied iterative hard thresholding algorithms and showed that these techniquescan oﬀer global convergence guarantees for arbitrary, possibly non-convex, diﬀerentiable objectivefunctions, which nevertheless satisfy Restricted Strong Convexity/Smoothness (RSC/RSM) con-ditions. Our results apply to a large family of algorithms that includes existing algorithms suchas IHT, GraDeS, CoSaMP, SP and OMPR. Previously the analyses of these algorithms requiredstringent RIP conditions that did not allow the (restricted) condition number to be larger thanuniversal constants speciﬁc to these algorithms.Our basic insight was to relax this stringent requirement by running these iterative algorithmswith an enlarged support size. We showed that guarantees for high-dimensional M-estimation followseamlessly from our results by invoking results on RSC/RSM conditions that have already beenestablished in the literature for a variety of statistical settings. Our theoretical results put hardthresholding methods on par with those based on convex relaxation or greedy algorithms. Ourexperimental results demonstrate that hard thresholding methods outperform convex relaxationand greedy methods in terms of running time, sometime by orders of magnitude, all the whileoﬀering competitive or better recovery properties.Our results apply to sparsity and low rank structure, arguably two of the most commonly usedstructures in high dimensional statistical learning problems. In future work, it would be interestingto generalize our algorithms and their analyses to more general structures. A uniﬁed analysis forgeneral structures will probably create interesting connections with existing uniﬁed frameworkssuch as those based on decomposability [5] and atomic norms [25].

References [1] Peter B¨uhlmann and Sara Van De Geer.

Statistics for high-dimensional data: methods, theoryand applications . Springer, 2011.[2] Sahand Negahban, Martin J Wainwright, et al. Estimation of (near) low-rank matrices withnoise and high-dimensional scaling.

The Annals of Statistics , 39(2):1069–1097, 2011.[3] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation forhigh-dimensional linear regression over (cid:96) q -balls. Information Theory, IEEE Transactions on ,57(10):6976–6994, 2011.[4] Angelika Rohde and Alexandre B Tsybakov. Estimation of high-dimensional low-rank matrices.

The Annals of Statistics , 39(2):887–930, 2011.[5] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, Bin Yu, et al. A uniﬁedframework for high-dimensional analysis of M-estimators with decomposable regularizers.

Sta-tistical Science , 27(4):538–557, 2012.[6] Balas Kausik Natarajan. Sparse approximate solutions to linear systems.

SIAM Journal onComputing , 24(2):227–234, 1995.[7] Ji Liu, Ryohei Fujimaki, and Jieping Ye. Forward-backward greedy algorithms for generalconvex smooth functions over a cardinality constraint. In

Proceedings of The 31st InternationalConference on Machine Learning , pages 503–511, 2014.128] Ali Jalali, Christopher C Johnson, and Pradeep D Ravikumar. On learning discrete graphicalmodels using greedy methods. In

NIPS , pages 1935–1943, 2011.[9] Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity in opti-mization problems with sparsity constraints.

SIAM Journal on Optimization , 20(6):2807–2832,2010.[10] Yurii Nesterov.

Introductory lectures on convex optimization: A basic course , volume 87 of

Applied Optimization . Springer, 2004.[11] P. Loh and M. J. Wainwright. Regularized M-estimators with nonconvexity: Statistical andalgorithmic theory for local optima, 2013. arXiv:1305.2436 [math.ST].[12] Thomas Blumensath and Mike E. Davies. Iterative hard thresholding for compressed sensing.

Applied and Computational Harmonic Analysis , 27(3):265 – 274, 2009.[13] Rahul Garg and Rohit Khandekar. Gradient descent with sparsiﬁcation: an iterative algorithmfor sparse recovery with restricted isometry property. In

ICML , 2009.[14] Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing.

SIAM J. onNum. Anal. , 49(6):2543–2563, 2011.[15] Deanna Needell and Joel A. Tropp. CoSaMP: Iterative Signal Recovery from Incomplete andInaccurate Samples.

Appl. Comput. Harmon. Anal. , 26:301–321, 2008.[16] Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction.

IEEE Trans. Inf. Theory , 55(5):22302249, 2009.[17] Prateek Jain, Ambuj Tewari, and Inderjit S. Dhillon. Orthogonal matching pursuit withreplacement. In

Annual Conference on Neural Information Processing Systems , 2011.[18] Sohail Bahmani, Bhiksha Raj, and Petros T Boufounos. Greedy sparsity-constrained opti-mization.

The Journal of Machine Learning Research , 14(1):807–841, 2013.[19] Xiaotong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In

Proceedings of The 31st International Conference on MachineLearning , 2014.[20] Yuchen Zhang, Martin J. Wainwright, and Michael I. Jordan. Lower bounds on the perfor-mance of polynomial-time algorithms for sparse linear regression. arXiv:1402.1918, 2014.[21] P. Loh and M. J. Wainwright. High-dimension regression with noisy and missing data: Provableguarantees with non-convexity.

Annals of Statistics , 40(3):1637–1664, 2012.[22] Alekh Agarwal, Sahand N. Negahban, and Martin J. Wainwright. Fast global convergence ofgradient methods for high-dimensional statistical recovery.

Annals of Statistics , 40(5):2452—2482, 2012.[23] Mark Schmidt.

Graphical Model Structure Learning with L1-Regularization . PhD thesis, Uni-versity of British Columbia, 2010. 1324] Tong Zhang. Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Represen-tations.

IEEE Trans. Inf. Theory , 57:4689–4708, 2011.[25] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convexgeometry of linear inverse problems.

Foundations of Computational Mathematics , 12(6):805–849, 2012.

A Proofs for Section 3

Proof of Lemma 1.

Without loss of generality, assume that we have reordered coordinates suchthat | z | ≥ | z | ≥ . . . ≥ | z I | . Since the projection operator P s ( · ) operates by selecting the largestelements by magnitude, we have θ = z , . . . , θ s = z s and θ s +1 = θ s +2 = . . . = θ | I | = 0.Also deﬁne θ z = P s ∗ ( z ). By the above argument, we have θ z = z , . . . , θ z s ∗ = z s ∗ and θ z s ∗ +1 = θ z s ∗ +2 = . . . = θ z | I | = 0. Now we have (cid:107) θ z − z (cid:107)| I | − s ∗ − (cid:107) θ − z (cid:107)| I | − s = 1 | I | − s ∗ s (cid:88) i = s ∗ +1 z i + (cid:18) | I | − s ∗ − | I | − s (cid:19) | I | (cid:88) i = s +1 z i ≥ s − s ∗ | I | − s ∗ z s + s ∗ − s ( | I | − s ∗ )( | I | − s ) ( | I | − s ) z s +1 ≥ , (10)since the coordinates of z are arranged in decreasing order of magnitude. Combining the abovewith the observation that, due to the projection property (cid:107) θ ∗ − z (cid:107) ≥ (cid:107) θ z − z (cid:107) , proves the result. Proof of Theorem 1.

Recall that θ t +1 = P s ( θ t − η (cid:48) L g t ) where η (cid:48) = <

1. Let S t = supp ( θ t ), S ∗ = supp ( θ ∗ ), and S t +1 = supp ( θ t +1 ). Also, let I t = S ∗ ∪ S t ∪ S t +1 .Now, using the RSS property and the fact that supp ( θ t ) ⊆ I t and supp ( θ t +1 ) ⊆ I t , we have: f ( θ t +1 ) − f ( θ t ) ≤ (cid:104) θ t +1 − θ t , g t (cid:105) + L (cid:107) θ t +1 − θ t (cid:107) , = L (cid:107) θ t +1 I t − θ tI t + η (cid:48) L · g tI t (cid:107) − ( η (cid:48) ) L (cid:107) g tI t (cid:107) + (1 − η (cid:48) ) (cid:104) θ t +1 − θ t , g t (cid:105) . (11)As supp ( θ t ) = S t , supp ( θ t +1 ) = S t +1 and S t \ S t +1 , S t +1 are disjoint, we have: (cid:104) θ t +1 − θ t , g t (cid:105) = −(cid:104) θ tS t \ S t +1 , g tS t \ S t +1 (cid:105) + (cid:104) θ t +1 S t +1 − θ tS t +1 , g tS t +1 (cid:105) , ζ = −(cid:104) θ tS t \ S t +1 , g tS t \ S t +1 (cid:105) − η (cid:48) L (cid:107) g tS t +1 (cid:107) , ζ ≤ η (cid:48) L (cid:107) g tS t +1 \ S t (cid:107) − η (cid:48) L (cid:107) g tS t \ S t +1 (cid:107) − η (cid:48) L (cid:107) g tS t +1 (cid:107) , ζ = − η (cid:48) L (cid:107) g tS t +1 \ S t (cid:107) − η (cid:48) L (cid:107) g tS t \ S t +1 (cid:107) − η (cid:48) L (cid:107) g tS t ∩ S t +1 (cid:107) ≤ − η (cid:48) L (cid:107) g tS t ∪ S t +1 (cid:107) , (12)where the equality ζ follows from the gradient step, i.e., θ t +1 S t +1 = θ tS t +1 − η (cid:48) L g tS t +1 . The inequality ζ follows using the fact that θ t +1 is obtained using hard thresholding and the fact that | S t \ S t +1 | =14 S t +1 \ S t | , as follows: (cid:107) θ tS t \ S t +1 − η (cid:48) L g tS t \ S t +1 (cid:107) ≤ (cid:107) θ t +1 S t +1 \ S t (cid:107) = ( η (cid:48) ) L (cid:107) g tS t +1 \ S t (cid:107) . (13)The equality ζ follows from (cid:107) g tS t +1 (cid:107) = (cid:107) g tS t +1 \ S t (cid:107) + (cid:107) g tS t ∩ S t +1 (cid:107) .Hence, using (11) and (12), we have: f ( θ t +1 ) − f ( θ t ) ≤ L (cid:107) θ t +1 I t − θ tI t + η (cid:48) L · g tI t (cid:107) − ( η (cid:48) ) L (cid:107) g tI t (cid:107) − η (cid:48) (1 − η (cid:48) )2 L (cid:107) g tS t ∪ S t +1 (cid:107) , = L (cid:107) θ t +1 I t − θ tI t + η (cid:48) L · g tI t (cid:107) − ( η (cid:48) ) L (cid:107) g tI t \ ( S t ∪ S ∗ ) (cid:107) − ( η (cid:48) ) L (cid:107) g tS t ∪ S ∗ (cid:107) − η (cid:48) (1 − η (cid:48) )2 L (cid:107) g tS t ∪ S t +1 (cid:107) . (14)Next, let us try to upper bound the ﬁrst two terms on the right hand side above. Since I t \ ( S t ∪ S ∗ ) = S t +1 \ ( S t ∪ S ∗ ) ⊆ S t +1 , we have θ t +1 I t \ ( S t ∪ S ∗ ) = θ tI t \ ( S t ∪ S ∗ ) − η (cid:48) L g tI t \ ( S t ∪ S ∗ ) . However,as θ tI t \ S t = 0, we actually have θ t +1 I t \ ( S t ∪ S ∗ ) = − η (cid:48) L g tI t \ ( S t ∪ S ∗ ) . Now let us choose a set R ⊆ S t \ S t +1 such that | R | = | S t +1 \ ( S t ∪ S ∗ ) | . Such a choice is possible since | S t +1 \ ( S t ∪ S ∗ ) | = | S t \ S t +1 | −| ( S t +1 ∩ S ∗ ) \ S t | (which itself is a consequence of the fact that | S t +1 | = | S t | ). Moreover, since θ t +1 is obtained by hard-thresholding (cid:16) θ t − η (cid:48) L g t (cid:17) , for any choice of R made above, we have:( η (cid:48) ) L (cid:107) g tI t \ ( S t ∪ S ∗ ) (cid:107) = (cid:107) θ t +1 I t \ ( S t ∪ S ∗ ) (cid:107) ≥ (cid:107) θ tR − η (cid:48) L g tR (cid:107) . (15)Using above equation, and the fact that θ t +1 R = 0 (since R ⊆ S t +1 ), we have: L (cid:107) θ t +1 I t − θ tI t + η (cid:48) L · g tI t (cid:107) − ( η (cid:48) ) L (cid:107) g tI t \ ( S t ∪ S ∗ ) (cid:107) ≤ L (cid:107) θ t +1 I t − θ tI t + η (cid:48) L · g tI t (cid:107) − L (cid:107) θ t +1 R − θ tR + η (cid:48) L g tR (cid:107) = L (cid:107) θ t +1 I t \ R − θ tI t \ R + η (cid:48) L · g tI t \ R (cid:107) . (16)We can bound the size of I t \ R as | I t \ R | ≤ | S t +1 | + | ( S t \ S t +1 ) \ R | + | S ∗ | ≤ s + | ( S t +1 ∩ S ∗ ) \ S t | + s ∗ ≤ s + 2 s ∗ . Also, since S t +1 ⊆ ( I t \ R ), we have θ t +1 I t \ R = P s ( θ tI t \ R − η (cid:48) L g tI t \ R ).Using the above observation with (16) and Lemma 1, we get: L (cid:107) θ t +1 I t − θ tI t + η (cid:48) L · g tI t (cid:107) − ( η (cid:48) ) L (cid:107) g tI t \ ( S t ∪ S ∗ ) (cid:107) ≤ L · | I t \ R | − s | I t \ R | − s ∗ (cid:107) θ ∗ I t \ R − θ tI t \ R + η (cid:48) L · g tI t \ R (cid:107) , ζ ≤ L · s ∗ s + s ∗ (cid:107) θ ∗ I t − θ tI t + η (cid:48) L · g tI t (cid:107) , = 2 s ∗ s + s ∗ · (cid:18) η (cid:48) (cid:104) θ ∗ − θ t , g t (cid:105) + L (cid:107) θ ∗ − θ t (cid:107) + ( η (cid:48) ) L (cid:107) g tI t (cid:107) (cid:19) , ζ ≤ s ∗ s + s ∗ · (cid:18) η (cid:48) f ( θ ∗ ) − η (cid:48) f ( θ t ) + L − η (cid:48) α (cid:107) θ ∗ − θ t (cid:107) + ( η (cid:48) ) L (cid:107) g tI t (cid:107) (cid:19) , (17)15here the inequality ζ follows by | I t \ R | ≤ s + 2 s ∗ as shown earlier and the observation that x − ax − b is a positive and increasing function on the interval x ≥ a if a ≥ b ≥

0. Note that since we have S t +1 ⊆ ( I t \ R ), we get | I t \ R | ≥ s . The inequality ζ follows by using RSC.Using (14), (17), and using S t +1 \ ( S t ∪ S ∗ ) ⊆ ( S t +1 ∪ S t ), we get: f ( θ t +1 ) − f ( θ t ) ≤ s ∗ s + s ∗ · (cid:18) η (cid:48) f ( θ ∗ ) − η (cid:48) f ( θ t ) + L − η (cid:48) α (cid:107) θ ∗ − θ t (cid:107) + ( η (cid:48) ) L (cid:107) g tI t (cid:107) (cid:19) − ( η (cid:48) ) L (cid:107) g tS t ∪ S ∗ (cid:107) − η (cid:48) (1 − η (cid:48) )2 L (cid:107) g tS t +1 \ ( S t ∪ S ∗ ) (cid:107) . (18)We now set η (cid:48) = 2 / s = 32 (cid:0) Lα (cid:1) s ∗ , so that we have s ∗ s + s ∗ ≤ α L ( L − η (cid:48) α ) . Since L ≥ α , we also have α L ( L − η (cid:48) α ) ≤ . Using these inequalities, we now rearrangethe terms in (18) above. f ( θ t +1 ) − f ( θ t ) ≤ s ∗ s + s ∗ · η (cid:48) · (cid:0) f ( θ ∗ ) − f ( θ t ) (cid:1) + α L (cid:107) θ ∗ − θ t (cid:107) + 124 L (cid:107) g tI t (cid:107) − L (cid:107) g tS t ∪ S ∗ (cid:107) − L (cid:107) g tS t +1 \ ( S t ∪ S ∗ ) (cid:107) . (19)Splitting (cid:107) g tI t (cid:107) = (cid:107) g tS t ∪ S ∗ (cid:107) + (cid:107) g tS t +1 \ ( S t ∪ S ∗ ) (cid:107) gives us f ( θ t +1 ) − f ( θ t ) ≤ s ∗ s + s ∗ · η (cid:48) · (cid:0) f ( θ ∗ ) − f ( θ t ) (cid:1) − L (cid:18) (cid:107) g tS t ∪ S ∗ (cid:107) − α (cid:107) θ ∗ − θ t (cid:107) (cid:19) − L · (cid:18) − (cid:19) (cid:107) g tS t +1 \ ( S t ∪ S ∗ ) (cid:107) , ≤ s ∗ s + s ∗ · η (cid:48) · (cid:0) f ( θ ∗ ) − f ( θ t ) (cid:1) − L (cid:18) (cid:107) g tS t ∪ S ∗ (cid:107) − α (cid:107) θ ∗ − θ t (cid:107) (cid:19) ≤ s ∗ s + s ∗ · η (cid:48) · (cid:0) f ( θ ∗ ) − f ( θ t ) (cid:1) − α L (cid:0) f ( θ t ) − f ( θ ∗ ) (cid:1) , (20)where the last inequality above follows using Lemma 6. The result now follows by observing that s ∗ s + s ∗ ≥ Lemma 6. (cid:18) (cid:107) g tS t ∪ S ∗ (cid:107) − α (cid:107) θ ∗ − θ t (cid:107) (cid:19) ≥ α · (cid:0) f ( θ t ) − f ( θ ∗ ) (cid:1) . Proof.

Using the RSC property, we have: f ( θ t ) − f ( θ ∗ ) ≤ (cid:104) g t , θ t − θ ∗ (cid:105) − α (cid:107) θ ∗ − θ t (cid:107) = (cid:104) g tS t ∪ S ∗ , θ tS t ∪ S ∗ − θ ∗ S t ∪ S ∗ (cid:105) − α (cid:107) θ ∗ − θ t (cid:107) , ≤ (cid:107) g tS t ∪ S ∗ (cid:107) (cid:107) θ t − θ ∗ (cid:107) − α (cid:107) θ ∗ − θ t (cid:107) . (21)16ow, (cid:107) g tS t ∪ S ∗ (cid:107) − α (cid:107) θ ∗ − θ t (cid:107) = (cid:16) (cid:107) g tS t ∪ S ∗ (cid:107) − α (cid:107) θ ∗ − θ t (cid:107) (cid:17) (cid:16) (cid:107) g tS t ∪ S ∗ (cid:107) + α (cid:107) θ ∗ − θ t (cid:107) (cid:17) , ≥ (cid:0) f ( θ t ) − f ( θ ∗ ) (cid:1) (cid:107) θ t − θ ∗ (cid:107) · (cid:16) (cid:107) g tS t ∪ S ∗ (cid:107) + α (cid:107) θ ∗ − θ t (cid:107) (cid:17) ≥ α · (cid:0) f ( θ t ) − f ( θ ∗ ) (cid:1) , (22)where the ﬁrst inequality above follows from (21). B Proofs for Section 4

Proof of Theorem 3.

Let θ ∗ be the empirical loss minimizer over the set of s -sparse vectors. Theninvoking Theorem 1 with f = L ( · ; Z n ), we get L ( θ τ , Z n ) − (cid:15) ≤ L ( θ ∗ , Z n ) ≤ L ( ¯ θ , Z n ) ≤ L ( θ τ ; Z n ) + (cid:104)∇L ( ¯ θ ; Z n ) , ( ¯ θ − θ τ ) (cid:105) − α s + s ∗ (cid:107) ¯ θ − θ τ (cid:107) where the 2nd inequality is by deﬁnition of θ ∗ and 3rd is by RSC (since θ ∗ , θ τ are s ∗ , s sparse).Duality gives us the upper bound (cid:104)∇L ( ¯ θ ; Z n ) , ( ¯ θ − θ τ ) (cid:105) ≤ (cid:107)∇L ( ¯ θ ; Z n ) (cid:107) ∞ (cid:107) ¯ θ − θ τ (cid:107) ≤ √ s + s ∗ (cid:107)∇L ( ¯ θ ; Z n ) (cid:107) ∞ (cid:107) ¯ θ − θ τ (cid:107) Combining the last two inequalities and rearranging gives a quadratic inequality in (cid:107) ¯ θ − θ τ (cid:107) : α s + s ∗ (cid:107) ¯ θ − θ τ (cid:107) − √ s + s ∗ (cid:107)∇L ( ¯ θ ; Z n ) (cid:107) ∞ (cid:107) ¯ θ − θ τ (cid:107) − (cid:15) ≤ C Proofs for Section 5

Proof of Lemma 3.

We will start by proving a more general result of which the claimed result willbe a corollary. More speciﬁcally, we shall prove that for any γ ≥ α , we have2 γ ( f ( θ t ) − f ( θ ∗ )) ≤ γ (cid:18) f ( θ t ) − f ( θ ∗ ) + α · (cid:18) − αγ (cid:19) (cid:107) θ t − θ ∗ (cid:107) (cid:19) ≤ γ (cid:107) g tS t ∪ S ∗ (cid:107) − (cid:107) θ tS t \ S ∗ (cid:107) , Setting γ = α will yield the claimed result. It is easy to see that the following inequality holdstrivially since γ ≥ α γ ( f ( θ t ) − f ( θ ∗ )) ≤ γ (cid:18) f ( θ t ) − f ( θ ∗ ) + α · (cid:18) − αγ (cid:19) (cid:107) θ t − θ ∗ (cid:107) (cid:19) . For the second inequality, we ﬁrst use the RSC condition to obtain: f ( θ ∗ ) − f ( θ t ) ≥ (cid:104) θ ∗ − θ t , g t (cid:105) + α (cid:107) θ t − θ ∗ (cid:107) . M D t = S ∗ \ S t be the set of true support elements missing from θ t and F A t = S t \ S ∗ be theset of incorrect elements included in the support of θ t . Since θ t is obtained by a “fully corrective”process (recall θ t = arg min θ ,supp ( θ ) ⊆ S t f ( θ )), we have g tS t = . Thus (cid:104) θ ∗ − θ t , g t (cid:105) = (cid:104) θ ∗ MD t , g tMD t (cid:105) .Putting this into the above expansion gives (cid:104) θ ∗ MD t , g tMD t (cid:105) ≤ f ( θ ∗ ) − f ( θ t ) − α (cid:107) θ t − θ ∗ (cid:107) (23)We now present some simple inequalities that will help us get our desired bounds. Firstly, we have (cid:107) θ ∗ MD t + γ g tMD t (cid:107) = (cid:107) θ ∗ MD t (cid:107) + γ (cid:107) g tMD t (cid:107) + 2 γ (cid:104) θ ∗ MD t , g tMD t (cid:105) ≥ , (24)since the ﬁrst expression is a norm. Next, since M D t ∩ F A t = ∅ , we have (cid:107) θ ∗ − θ t (cid:107) ≥ (cid:107) θ ∗ MD t (cid:107) + (cid:107) θ tF A t (cid:107) . (25)Putting equations 23 and 24, we have:2 γ (cid:16) f ( θ t ) − f ( θ ∗ ) + α (cid:107) θ t − θ ∗ (cid:107) (cid:17) ≤ (cid:107) θ ∗ MD t (cid:107) + γ (cid:107) g tMD t (cid:107) . (26)Now, using (25), we get:2 γ (cid:18) f ( θ t ) − f ( θ ∗ ) + α (cid:18) − αγ (cid:19) (cid:107) θ t − θ ∗ (cid:107) (cid:19) ≤ γ (cid:107) g tMD t (cid:107) − (cid:107) θ tF A t (cid:107) We ﬁnish oﬀ the proof by noticing that since g tS t = , we have (cid:107) g tMD t (cid:107) = (cid:107) g tS t ∪ S ∗ (cid:107) Proof of Theorem 4.

Let z tS t = θ tS t , z tZ t \ S t = − L g tZ t \ S t , and z tZ t = 0.Then, using the RSS property, we have: f ( z t ) − f ( θ t ) ≤ (cid:104) z t − θ t , g t (cid:105) + L (cid:107) z t − θ t (cid:107) , ζ ≤ − L (cid:107) g tZ t \ S t (cid:107) + L (cid:107) z tZ t \ S t (cid:107) , ζ = − L · (cid:107) g tZ t \ S t (cid:107) , ζ ≤ − L · (cid:107) g tS ∗ \ S t (cid:107) , ζ ≤ − αL · (cid:0) f ( θ t ) − f ( θ ∗ ) (cid:1) , (27)where ζ follows by observing g tS t = 0, and S t ⊆ Z t . ζ follows by z tZ t \ S t = − L g tZ t \ S t . ζ followsby (cid:96) ≥ s ∗ , and Z t \ S t are the (cid:96) largest elements of | g tZ t \ S t | .Now, using Lemma 4 and (27) along with f ( θ t +1 ) ≤ f ( (cid:101) θ t ) and f ( β t ) ≤ f ( z t ), we have: f ( θ t +1 ) − f ( θ ∗ ) ≤ (cid:16) − αL (cid:17) · (cid:18) Lα · (cid:96)s + (cid:96) − s ∗ (cid:19) · (cid:0) f ( θ t ) − f ( θ ∗ ) (cid:1) . (28)Theorem now follows by using the above equation with the assumption that s + (cid:96) − s ∗ ≥ L · (cid:96)α .18 roof of Theorem 5. Using RSS property: f ( v t ) − f ( θ t ) ≤ (cid:104) v t − θ t , g t (cid:105) + L (cid:107) v t − θ t (cid:107) , ζ ≤ − η (cid:107) g tS t +1 \ S t (cid:107) + L (cid:107) v tS t +1 \ S t (cid:107) + (cid:107) θ tS t \ S t +1 (cid:107) ) , ζ ≤ − η (cid:107) g tS t +1 \ S t (cid:107) + L (cid:107) v tS t +1 \ S t (cid:107) , ζ = − (1 − η · L ) · η · (cid:107) g tS t +1 \ S t (cid:107) , (29)where ζ follows by observing that g tS t = 0 and v tS t +1 \ S t = − η g tS t +1 \ S t . ζ follows by the propertyof PHT operator which ensures that each element of v tS t +1 \ S t is bigger than θ tS t \ S t +1 and by using | S t +1 \ S t | = | S t \ S t +1 | . ζ follows by using v tS t +1 \ S t = − η g tS t +1 \ S t .Similar to the analysis given in [17], we divide the analysis in three mutually exclusive cases.The lemma then follows by combining (29) with the case-by-case analyses below and observing that f ( θ t +1 ) ≤ f ( v t ) because of the fully corrective step. Case 1 : | S t +1 \ S t | = (cid:96) < | S ∗ \ S t | . In this case, As v tS t +1 \ S t is obtained by selecting | S t +1 \ S t | largest elements of z tS t ∪ bot t . Hence, η (cid:107) g tS t +1 \ S t (cid:107) ≥ min (cid:18) , | S t +1 \ S t || S ∗ \ S t | (cid:19) (cid:107) z tS ∗ \ S t (cid:107) = η min (cid:18) , | S t +1 \ S t || S ∗ \ S t | (cid:19) (cid:107) g tS ∗ \ S t (cid:107) = η min (cid:18) , | S t +1 \ S t || S ∗ \ S t | (cid:19) (cid:107) g tS ∗ ∪ S t (cid:107) , (30)since g tS t = 0. Now, using the fact that | S t +1 \ S t | = (cid:96) , | S ∗ \ S t | ≤ s ∗ , and using Lemma 3, we have: (cid:107) g tS t +1 \ S t (cid:107) ≥ α · min (cid:18) , (cid:96)s ∗ (cid:19) (cid:0) f ( θ t ) − f ( θ ∗ ) (cid:1) . (31) Case 2 : | S t +1 \ S t | < (cid:96) , | S t +1 \ S t | ≤ | S ∗ \ S t | . In this case, each element of z tS t +1 ∩ S t is larger thaneach element of z tS ∗ \ ( S t +1 ∪ S t ) , else that element of S ∗ \ ( S t +1 ∪ S t ) would have been selected. Thatis, (cid:107) z tS ∗ \ ( S t +1 ∪ S t ) (cid:107) | S ∗ \ ( S t +1 ∪ S t ) | ≤ (cid:107) z t ( S t +1 ∩ S t ) \ S ∗ (cid:107) | ( S t +1 ∩ S t ) \ S ∗ | . Using z tS ∗ \ S t = − η g tS ∗ \ S t , z tS t = θ tS t and ( S t +1 ∩ S t ) \ S ∗ ⊆ S t \ S ∗ , we have: η (cid:107) g tS ∗ \ ( S t ∪ S t +1 ) (cid:107) ≤ s ∗ s − (cid:96) − s ∗ (cid:107) θ tS t \ S ∗ (cid:107) , (32)where the bound on | S ∗ \ ( S t +1 ∪ S t ) || ( S t +1 ∩ S t ) \ S ∗ | follows by observing | S ∗ \ ( S t +1 ∪ S t ) | ≤ s ∗ and | ( S t +1 ∩ S t ) \ S ∗ | ≥ s − (cid:96) − s ∗ . Using (32) and the fact that each element of v t ( S t +1 \ S t ) ∩ S ∗ is selected from the largest | ( S t +1 \ S t ) ∩ S ∗ | elements of − η · g t ( S t +1 \ S t ) ∩ S ∗ we have: η (cid:107) g tS ∗ \ S t (cid:107) ≤ η (cid:16) (cid:107) g t ( S t +1 \ S t ) ∩ S ∗ (cid:107) + (cid:107) g tS ∗ \ ( S t +1 ∪ S t ) (cid:107) (cid:17) ≤ (cid:18) η (cid:107) g t ( S t +1 \ S t ) ∩ S ∗ (cid:107) + s ∗ s − (cid:96) − s ∗ (cid:107) θ tS t \ S ∗ (cid:107) (cid:19) . (33)19sing the above equation and Lemma 3, we have:2 α (cid:0) f ( θ t ) − f ( θ ∗ ) (cid:1) ≤ α (cid:107) g t ( S t +1 \ S t ) ∩ S ∗ (cid:107) + (cid:18) α η · s ∗ s − (cid:96) − s ∗ − (cid:19) (cid:107) θ tS t \ S ∗ (cid:107) , (34)Using s ≥ (cid:0) Lα (cid:1) s ∗ , we have: (cid:107) g tS t +1 \ S t (cid:107) ≥ α (cid:0) f ( θ t ) − f ( θ ∗ ) (cid:1) . (35) Case 3 : | S t +1 \ S t | ≥ | S ∗ \ S t | . Now, as v tS t +1 \ S t is obtained by selecting | S t +1 \ S t | largestelements of z tS t ∪ bot t . Hence, using Lemma 3, we have: (cid:107) v tS t +1 \ S t (cid:107) ≥ (cid:107) z tS ∗ \ S t (cid:107) = η (cid:107) g tS ∗ \ S t (cid:107) ≥ η · α · (cid:0) f ( θ t ) − f ( θ ∗ ) (cid:1) . (36)The lemma now follows by combining (29), (31), (35), and (36) D Supplementary Experimental Results

Below we present plots that were not included in the main text. S uppo r t R e c o v e r y E rr o r OMPRCoSaMPL1SP (a) R un t i m e ( s e c ) OMPRCoSaMPL1SP (b) −3 −2 Sparsity (s*) R un t i m e ( s e c ) OMPRCoSaMPL1SP (c)

Figure 2: Counterparts of Figure 1 for OMPR, CoSaMP and L1. Oversampling Factor (f o ) R un t i m e ( s e c ) HTPGraDeSL1FoBa (a) Oversampling Factor (f o ) R un t i m e ( s e c ) OMPRCoSaMPL1SP (b)

Figure 3: The eﬀect of increasing sample sizes relative to the base value s ∗ · log pp