[PDF] Reducing Sampling Ratios Improves Bagging in Sparse Regression

Abstract

Bagging, a powerful ensemble method from machine learning, improves the performance of unstable predictors. Although the power of Bagging has been shown mostly in classification problems, we demonstrate the success of employing Bagging in sparse regression over the baseline method (L1 minimization). The framework employs the generalized version of the original Bagging with various bootstrap ratios. The performance limits associated with different choices of bootstrap sampling ratio L/m and number of estimates K is analyzed theoretically. Simulation shows that the proposed method yields state-of-the-art recovery performance, outperforming L1 minimization and Bolasso in the challenging case of low levels of measurements. A lower L/m ratio (60% - 90%) leads to better performance, especially with a small number of measurements. With the reduced sampling rate, SNR improves over the original Bagging by up to 24%. With a properly chosen sampling ratio, a reasonably small number of estimates K = 30 gives satisfying result, even though increasing K is discovered to always improve or at least maintain the performance.

Full PDF

RReducing Sampling Ratios and Increasing Numberof Estimates Improve Bagging in Sparse Regression

Luoluo Liu J , Sang (Peter) Chin B,J , and Trac D. Tran J [email protected] [email protected] [email protected] J Department of Electrical Engineering, Johns Hopkins University, Baltimore, MD, 21210 B Department of Computer Science & Hariri Institute of Computing, Boston University, Boston, MA, 02215

Abstract —Bagging, a powerful ensemble method from machinelearning, has shown the ability to improve the performanceof unstable predictors in difﬁcult practical settings. AlthoughBagging is most well-known for its application in classiﬁcationproblems, here we demonstrate that employing Bagging insparse regression improves performance compared to the baselinemethod ( (cid:96) minimization). Although the original Bagging methoduses a bootstrap sampling ratio of , such that the sizes of thebootstrap samples L are the same as the total number of datapoints m , we generalize the bootstrap sampling ratio to explorethe optimal sampling ratios for various cases.The performance limits associated with different choices ofbootstrap sampling ratio L/m and number of estimates K areanalyzed theoretically. Simulation results show that a lower L/m ratio ( . − . ) leads to better performance than theconventional choice ( L/m = 1 ), especially in challenging caseswith low levels of measurements. With the reduced samplingrate, SNR improves over the original Bagging method by upto and over the base algorithm (cid:96) minimization by upto . With a properly chosen sampling ratio, a reasonablysmall number of estimates ( K = 30 ) gives a satisfying result,although increasing K is discovered to always improve or atleast maintain performance. Index Terms —Bootstrap, Bagging, Sparse Regression, SparseRecovery, (cid:96) minimization, LASSO I. I

NTRODUCTION

Compressed Sensing (CS) and Sparse Regression studiessolving the linear inverse problem in the form of least squareswith an additional sparsity-promoting penalty term. Formallyspeaking, the measurements vector y ∈ R m is generated by y = Ax + z , where A ∈ R m × n is the sensing matrix, x ∈ R n is a vector of sparse coefﬁcients with very few non-zero entries,and z is a noise vector with bounded energy. The problemof interest is ﬁnding the sparse vector x given A as well as y . Among various choices of sparse regularizers, the (cid:96) normis the most commonly used. The noiseless case is referred toas Basis Pursuit (BP) whereas the noisy version is known as basis pursuit denoising [1], or least absolute shrinkage andselection operator (Lasso) [2]: P λ : min λ (cid:107) x (cid:107) + 0 . (cid:107) y − Ax (cid:107) . (1)The performance of (cid:96) minimization in recovering the truesparse solution has been thoroughly investigated in the CSliterature [3]–[6]. CS theory reveals that if the sensing matrix A has good properties, then BP recovers the ground truth andthe LASSO solution is close enough to the true solution withhigh probability [3]. Classical sparse regression recovery based on (cid:96) minimiza-tion solves the problem with all available measurements. Inpractice, it is often the case that not all measurements areavailable or required for recovery. Some measurements mightbe severely corrupted/missing or adversarial samples that breakdown the algorithm. These issues could lead to the failure ofthe sparse regression algorithm.The Bagging procedure [7] proposed by Breiman is anefﬁcient parallel ensemble method that improves the perfor-mance of unstable predictors. In Bagging, we ﬁrst generate abootstrap sample by randomly drawing m samples uniformlywith replacement from all m data points. We repeat theprocess K times and generate K bootstrap samples. Thenone bootstrapped estimator is computed for each bootstrapsample, and the ﬁnal Bagged estimator is the average of all K bootstrapped estimators.Applying Bagging to ﬁnd a sparse vector with a speciﬁcsymmetric pattern was shown empirically to reduce estimationerror when the sparsity level s is high [7] in a forward subsetselection problem. This experiment shows the possibility ofusing Bagging to improve other sparse regression methods ongeneral sparse signals. Although the well-known conventionalBagging method uses the bootstrap ratio , some follow-up works have shown empirically that lower ratios improveBagging in some classic classiﬁers: Nearest Neighbour Clas-siﬁer [8], CART Trees [9], Linear SVM, LDA, and LogisticLinear Classiﬁer [10]. Based on this success, we hypothesizethat reducing the bootstrap ratio will also improve performanceof Bagging in sparse regression. Therefore, we set up theframework with a generic bootstrap ratio and study its behaviorwith various bootstrap ratios.In this paper, we use the notation L as the sizes ofbootstrap samples, m as the number of all measurements,and K as the number of estimates . (i) We demonstrate thegeneralized Bagging framework with bootstrap ratio

L/m andnumber of estimates K as parameters. (ii) We explore thetheoretical properties associated with ﬁnite

L/m and K . (iii) We present simulation results with various parameters

L/m and K and compare the performances of (cid:96) minimization,conventional Bagging, and Bolasso [11], another moderntechnique that incorporates Bagging into sparse recovery. Animportant discovery is that in challenging cases with small m ,Bagging with a ratio L/m that is smaller than the conventionalratio can lead to better performance. a r X i v : . [ s t a t . M L ] M a y I. P

ROPOSED M ETHOD : B

AGGING IN S PARSE R EGRESSION

Our proposed method is sparse recovery using a gen-eralized Bagging procedure. It is accomplished in threesteps. First, we generate K bootstrap samples, each ofsize L , randomly sampled uniformly and independentlywith replacement from the original m data points. Thisresults in K measurements and sensing matrices pairs: { y [ I ] , A [ I ] } , { y [ I ] , A [ I ] } ...., { y [ I K ] , A [ I K ] } . We use thenotation ( · )[ I ] on matrices or vectors to denote retaining onlythe rows supported on I and throwing away all other rowsin the complement I c . Second, we solve the sparse recoveryproblem independently on each of those pairs; mathematically,for all j = 1 , , .., K , we ﬁnd x B j = arg min x ∈ R n λ ( L,K ) (cid:107) x (cid:107) + 0 . (cid:107) y [ I j ] − A [ I j ] x (cid:107) , (2)where the parameter λ ( L,K ) is the balancing parameter of theleast squares ﬁt and the sparsity penalty for ( L, K ) as theparameter choice for Bagging. The proposed approach (2) isa Lasso problem, and numerous optimization methods can beused to solve it, such as [12]–[15].Finally, the Bagging solution is obtained by averaging all K estimators from solving (2):Bagging: x B = 1 K K (cid:88) j =1 x B j . (3)Compared to the (cid:96) minimization solution obtained fromthe usage of all the measurements, the bagged solution x B isobtained by resampling without increasing the number oforiginal measurements . We will show that in some cases, thebagged solution outperforms the base (cid:96) minimization solution.III. P RELIMINARIES

We summarize the theoretical results of CS theory which weneed to analyze our algorithm mathematically. We introduce theNull Space Property (NSP), as well as the Restricted IsometryProperty (RIP). We also provide the tail bound of the sum ofi.i.d. bounded random variables, which is needed to prove ourtheorems.

A. Null Space Property (NSP)

The NSP [16] for standard sparse recovery characterizesthe necessary and sufﬁcient conditions for successful sparserecovery using (cid:96) minimization. Theorem 1 (NSP) . Every s − sparse signal x ∈ R n is a uniquesolution to P : min (cid:107) x (cid:107) s.t. y = Ax if and only if A satisﬁes NSP of order s . Namely, if for all v ∈ Null ( A ) \{ } ,such that for any set S of cardinality less than or equals to thesparsity level s : S ⊂ { , , .., n } , card ( S ) ≤ s , the followingis satisﬁed: (cid:107) v [ S ] (cid:107) < (cid:107) v [ S c ] (cid:107) , where v [ S ] only has the vector values on an index set S andzero elsewhere. B. Restricted Isometry Property (RIP) Although NSP directly characterizes the ability of success forsparse recovery, checking the NSP condition is computationallyintractable. It is also not suitable to use NSP for quantifyingperformance in noisy conditions since it is a binary (True orFalse) metric instead of a continuous range. The Restrictedisometry property (RIP) [3] is introduced to overcome thesedifﬁculties.

Deﬁnition 2 (RIP) . A matrix A with (cid:96) -normalized columnssatisﬁes RIP of order s if there exists a constant δ s ( A ) ∈ [0 , such that for every s − sparse v ∈ R n , the following is satisﬁed: (1 − δ s ( A )) (cid:107) v (cid:107) ≤ (cid:107) Av (cid:107) ≤ (1 + δ s ( A )) (cid:107) v (cid:107) . (4) C. Noisy Recovery bounds based on RIP constants

It is known that satisfying the RIP conditions implies thatthe NSP conditions are also satisﬁed for sparse recovery [3].More speciﬁcally, if the RIP constant of order s is strictly lessthan √ − , then it implies that NSP is satisﬁed of the order s . We recall Theorem 1.2 in [3], where the noisy recoveryperformance for (cid:96) minimization is bounded based on the RIPconstant. This error bound is associated with the s − sparseapproximation error and the noise level. Theorem 3 (Noisy recovery for (cid:96) minimization [3]) . Let y = Ax (cid:63) + z , (cid:107) z (cid:107) ≤ (cid:15) , x is s − sparse that minimizes (cid:107) x − x (cid:63) (cid:107) over all s − sparse signals. If δ s ( A ) ≤ δ < √ − , x (cid:96) be the solution of (cid:96) minimization, then it obeys (cid:107) x (cid:96) − x (cid:63) (cid:107) ≤ C ( δ ) s − / (cid:107) x − x (cid:63) (cid:107) + C ( δ ) (cid:15), where C ( · ) , C ( · ) are some constants, which are determinedby RIP constant δ s . The form of these two constants termsare C ( δ ) = − (1 −√ δ )1 − (1+ √ δ and C ( δ ) = √ δ − (1+ √ δ .D. Tail bound of the sum of i.i.d. bounded Random variables This exponential bound is similar in structure to Hoefﬁdings’inequality. Proving this bound requires working with themoment generating function of a random variable.

Lemma 4.

Let Y , Y , ..., Y n be i.i.d. observations of boundedrandom variable Y : a ≤ Y ≤ b and the expectation E Y exists,for any ξ > , then P { n (cid:88) i =1 Y i ≥ nξ } ≤ exp {− n ( ξ − E Y ) ( b − a ) } . (5)IV. T HEORETICAL R ESULTS FOR B AGGING ASSOCIATEDWITH SAMPLING RATIO

L/m

AND THE NUMBER OFESTIMATES K A. Noisy Recovery for Employing Bagging in Sparse Regression

We derive the performance bound for employing Baggingin sparse regression, in which the ﬁnal estimate is the averageover multiple estimates solved individually from bootstrapsamples. We give the theoretical results for the case that truesignal x (cid:63) is exactly s − sparse and the general case with noassumption of the sparsity level of the ground truth signal.ote that, the theorems are based on deterministic sensingmatrix, measurements, and noise: A , y , z , in which all vectornorms are equivalent. Theorem 5 (Bagging: Error bound for (cid:107) x (cid:63) (cid:107) = s ) . Let y = Ax (cid:63) + z , (cid:107) z (cid:107) < ∞ , If under the assumptionthat, for {I j } s that generates a set of sensing matrices A [ I ] , A [ I ] , ..., A [ I K ] , there exists a constant that is relatesto L and K : δ ( L,K ) such that for all j ∈ { , , ..., K } , δ s ( A [ I j ]) ≤ δ ( L,K ) < √ − . Let x B be the solution ofBagging, then for any τ > , x B satisﬁes P {(cid:107) x B − x (cid:63) (cid:107) ≤ C ( δ ( L,K ) )( (cid:114) Lm (cid:107) z (cid:107) + τ ) }≥ − exp − Kτ L (cid:107) z (cid:107) ∞ . We also study the behavior of Bagging for a generalsignal x (cid:63) , (cid:107) x (cid:63) (cid:107) ≥ s , in which the performance involves the s − sparse approximation error. We use the vector e to denotethis error, and e = x (cid:63) − x , where x is the best s -sparseapproximation of the ground truth signal over all s − sparsesignals. Theorem 6 (Bagging: Error bound for general signal recovery) . Let y = Ax (cid:63) + z , (cid:107) z (cid:107) < ∞ , If under the assumptionthat, for {I j } s that generates a set of sensing matrices A [ I ] , A [ I ] , ..., A [ I K ] , there exists δ ( L,K ) such that for all j ∈ { , , ..., K } , δ s ( A [ I j ]) ≤ δ ( L,K ) < √ − . Let x B bethe solution of Bagging, then for any τ > , x B satisﬁes P {(cid:107) x B − x (cid:63) (cid:107) ≤ C ( δ L,K ) s − / (cid:107) e (cid:107) + C ( δ ( L,K ) )( (cid:114) Lm (cid:107) z (cid:107) + τ ) } ≥ − exp − K C ( δ ( L,K ) ) τ ( b (cid:48) ) , where b (cid:48) = ( C ( δ ( L,K ) ) s − / (cid:107) e (cid:107) + C ( δ ( L,K ) ) √ L (cid:107) z (cid:107) ∞ ) . Theorem 6 gives the performance bound for Bagging insparse signal recovery without the s − sparse assumption, and itreduces to Theorem 5 when the s − sparse approximation erroris zero (cid:107) e (cid:107) = 0 .We give the proof sketch that demonstrates the key ideato prove both Theorem 5 and Theorem 6. The main toolsare Theorem 3 and Lemma 4. Some special treatments arerequired to deal with terms while proving Theorem 6. Formore technical details, full proofs can be found in [17]. Proof Sketch:

Similar to the sufﬁcient condition in Theorem 3,the sufﬁcient condition to analyze Bagging is that all matricesresulting from Bagging have well-behaved RIP constants oforder s bounded by a universal constant δ .Let I denote a generic multi-set containing L elements andeach element in I is independent and identically distributed,obeying a discrete uniform distribution from sample space { , , .., m } . The squared error function f ( x ( I ) ) = (cid:107) x ( I ) − x (cid:63) (cid:107) , where x ( I ) is the solution from (cid:96) minimization on I : x ( I ) = arg min (cid:107) x (cid:107) s.t. (cid:107) y [ I ] − A [ I ] (cid:107) ≤ (cid:15) I . Thesquared errors from K bootstrapped estimators f ( x j ) = (cid:107) x B j − x (cid:63) (cid:107) , j = 1 , , ..., K are realizations generated i.i.d.from the distribution of f ( x ( I ) ) . We proceed with the proof using Lemma 4. We choose theupper bound of the error to be a function of the expected valueof noise power. We pick the bound ξ relating to the the root ofthe expectation of squared error (cid:112) E (cid:107) z [ I ] (cid:107) = (cid:113) Lm (cid:107) z (cid:107) . Thenwe need to compute the upper bound b and the lower bound a for the random variable f ( x ( I ) ) . Since it is non-negative, wechoose a = 0 . The upper bound b is obtained from Theorem 3and then the maximum value (cid:107) z (cid:107) ∞ is employed to furtherupper bound the noise level (cid:107) z [ I j ] (cid:107) . Through this process,we obtain the inequality: P { (cid:80) j (cid:107) x B j − x (cid:63) (cid:107) − Kξ ≤ } ≥ g ( E ( f ( x ) , b, a ) , for some function g .The Bagging solution is the average of all bootstrappedestimators. The key inequality to establish is as follows: P {(cid:107) x B − x (cid:63) (cid:107) − ξ ≤ } = P { K (cid:107) x B − x (cid:63) (cid:107) − (cid:80) j f ( x j ) + (cid:80) j f ( x j ) − Kξ ≤ }≥ P { K (cid:107) x B − x (cid:63) (cid:107) − (cid:80) j f ( x j ) ≤ , (cid:80) j f ( x j ) − Kξ ≤ } = P { K (cid:107) x B − x (cid:63) (cid:107) − (cid:80) j f ( x j ) ≤ } P { (cid:80) j f ( x j ) − Kξ ≤ } = P { (cid:80) j (cid:107) x B j − x (cid:63) (cid:107) − Kξ ≤ } . The ﬁrst term is independent of the second term and it is truewith probability by Jensens’ inequality. Then we successfullyestablish the relationship of error bound of the Bagging solutionto the sum of squared errors of bootstrapped estimates. Toobtain the bound for the second term, we follow the methoddescribed in the previous paragraph. B. Parameters Selection Guided by the Theoretical Analysis

Besides analyzing error bounds for general signals whosesparsity levels might exceed s , Theorem 6 can be used inanalyzing cases when m is not large enough for the sparsitylevel s . Theorem 5 and 6 also guide us to optimal choices ofparameters: the bootstrap sampling ratio L/m and the numberof estimates K .Both Theorem 5 and Theorem 6 show that increasing thenumber of estimates K improves the result, by increasing thelower bound of certainty of the same performance. The growthrate of the certainty bound is decreasing with K . We validatethis in our numerical experiment: even though increasing K improves the results, the performance tends to be ﬂattened outfor a large K .The sampling ratio L/m affects the result through twofactors. The ﬁrst one is the the RIP constant, which in generaldecreases with increasing L (proved in [18] with Gaussianassumption on sensing matrix). Since C ( δ ) is a non-decreasingfunction of δ and a larger L usually results in a smaller δ , thena larger L in general results in a smaller C ( δ ) . On the otherhand, the second factor is the multiplier of the noise powerterm, which is (cid:112) L/m , suggesting a smaller L .Combining these two factors indicates that the best L/m ratio is somewhere in between a small and a large number. Inthe experiment results, we demonstrate that when m is small,varying the bootstrap sampling ratio L/m from − createspeaks with the largest value at L/m < . The ﬁrst factor,which relates L to the RIP constant, is dominating in the stablecase (when m is sufﬁciently large), so that larger L leads tobetter performance. a) m = 50 (b) m = 75 (c) m = 100 (d) m = 150 Fig. 1. Performance curves for Bagging with various sampling ratios

L/m and number of estimates K , the best performance of Bolasso as well as (cid:96) minimization.The Purple lines highlighted conventional Bagging with L/m = 1 . In all cases, SNR = 0 dB and the number of measurements m = 50 , , , fromleft to right. The grey circle highlights the peak of Bagging, and the grey area highlights the bootstrap ratio at the peak point. V. S

IMULATIONS

In this section, we perform sparse recovery on simulated datato study the performance of our algorithm. In our experiment,all entries of A ∈ R m × n are i.i.d. samples from the standardnormal distribution N (0 , . The signal dimension n = 200 andvarious numbers of measurements from to are explored.For the ground truth signals, their sparsity levels are all s = 50 , and the non-zero entries are sampled from the standardGaussian with their locations being generated uniformly atrandom. For the noise processes z , entries are sampled i.i.d.from N (0 , σ ) , with variance σ = 10 − SNR / (cid:107) Ax (cid:107) , whereSNR represents the Signal to Noise Ratio. We add whiteGaussian noise to make the SNR = 0 dB. All numericalrealizations have ﬁnite values. We use the ADMM [12]implementation of Lasso to solve all sparse regression problems,in which the parameter λ ( L,K ) balances the least squares ﬁtand the sparsity penalty for the case with ( L, K ) as parameters.We study how the bootstrap sampling ratio L/m as well asthe number of estimates K affects the result. In our experiment,we take K = 30 , , and L/m from . to . We report theSignal to Noise Ratio (SNR) as the error measure for recovery:SNR ( x , x (cid:63) ) = 10 log (cid:107) x − x (cid:63) (cid:107) / (cid:107) x (cid:63) (cid:107) averaged over independent trials. For all algorithms, we evaluate λ ( L,K ) atdifferent values from . to and then select optimal valuesthat give the maximum averaged SNR over all trials. A. Performance of Bagging, Bolasso and (cid:96) minimization Bagging and Bolasso with the various parameters

K, L and (cid:96) minimization are studied. The results are plotted inFigure 1. The colored curves show the cases of Bagging with various number of estimates K . The intersections ofcolored curves and the purple solid vertical lines at L/m = 1 illustrates conventional Bagging with a full bootstrap rate. Thegrey circle highlights the best performance and the grey areahighlights the optimal bootstrap ratio

L/m . The performance of (cid:96) minimization is depicted by the black dashed lines, while thebest Bolasso performance is plotted using light green dashedlines. In those ﬁgures, for each condition with a choice of L, K ,the information available to Bagging and Bolasso algorithmsare identical, and (cid:96) minimization always has access to all m measurements.From Figure 1, we see that when m is small, Baggingcan outperform (cid:96) minimization. As m decreases, the marginincreases. The important observation is that when the numberof measurements is low ( m is between s to s : − , s is the sparsity level), by using a reduced bootstrap ratio L/m ( − ), Bagging beats the conventional choice of thefull ratio for all different choices of K . Also with a reducedratio and a small K our algorithm is already quite robust andoutperforms (cid:96) minimization by a large margin. When thenumber of measurements is moderate m = 3 s = 150 , Baggingstill beats the baseline; however, the optimal parameters here arebootstrap ratio L/m = 1 and the number of estimates K = 100 .In this case, the reduced bootstrap ratio does not bring anyperformance improvement. Increasing the level measurementmakes the base algorithm more stable and the advantage ofBagging starts decaying.We perform the same experiments with higher number ofmeasurements m , and Table I illustrates the best performancefor various schemes: (cid:96) minimization, the original Bagging TABLE IT

HE PERFORMANCE OF (cid:96) MINIMIZATION AND THE BEST PERFORMANCE AMONG ALL CHOICES OF L AND K FOR B AGGING , B

OLASSO METHODS WITHVARIOUS TOTAL NUMBER OF MEASUREMENTS m . SNR = 0 D B. A

LL PERFORMANCES ARE MEASURED BY THE AVERAGED RECOVERED

SNR ( D B)Small m Moderate m Large m Very large m The number of measurements m

50 75 100 125 150 175 200 500 1000 2000 (cid:96) min. 0.12 0.57 1.00 1.70 2.19 2.61 2.97 cheme with a full bootstrap ratio, Bagging, and Bolassowith SNR = 0 dB. For Bagging, the peak values are foundamong different choices of parameters K and L that weexplored. We see that when the number of measurements m issmall ( − ), Bagging outperforms (cid:96) minimization. Thereduced bootstrap rate also improves conventional Bagging: theimprovement is signiﬁcant: on SNR when m = 50 . When m is moderate ( − ), choosing reduced rates does notimprove the performance compared to conventional Bagging.Bagging still outperforms (cid:96) minimization with smaller marginsthan the cases with small m . While m is large ( ≥ ), Baggingstarts losing its advantage over (cid:96) minimization. Bolasso onlyperforms similarly to other algorithms in the easiest case foran extremely large m ( = 2000 ) where it slightly outperformsall other algorithms. VI. C ONCLUSION

We extend the conventional Bagging scheme in sparserecovery with the bootstrap sampling ratio

L/m as adjustableparameters and derive error bounds for the algorithm associatedwith

L/m and the number of estimates K . Bagging isparticularly powerful when the number of measurements m is small. Although this condition is notoriously difﬁcult, bothin terms of improving sparse recovery results and obtainingtight bounds of theoretical properties, Bagging outperforms (cid:96) minimization by a large margin (up to 367%). Moreover,the reduced sampling rate shows a performance improvementmeasured by the recovered SNR, and it is over the conventionalBagging algorithm by up to .Our Bagging scheme achieves acceptable performance evenwith very small L/m (around . ) and relative small K (around in our experimental study). The error bounds forBagging predict that a smaller sampling rate L/m can leadto performance improvement and increasing K improves thecertainty of the bound. Both are validated in our numericalsimulation. For a sequential system, a reasonably large K (around ) is enough to obtain an fairly good solution. For aparallel system that allows a large amount of processes to berun at the same time, a large K is preferred since it in generalgives a better result. VII. A CKNOWLEDGEMENT

We would like to thank Dr. Dror Baron for insightfulcomments and suggestions, Dr. Cindy Rush for thoughtfulfeedbacks, and Nicholas Huang for efforts in helping polish,all towards improving the overall quality of our paper.R

EFERENCES[1] S. Chen, D. L Donoho, and M. Saunders. Atomic decomposition bybasis pursuit.

SIAM review , 43(1):129–159, 2001.[2] R. Tibshirani. Regression shrinkage and selection via the Lasso.

J. ofthe Royal Stat. Society. Series B , pages 267–288, 1996.[3] E. J Candes. The restricted isometry property and its implications forcompressed sensing.

Comptes Rendus Mathematique , 346(9):589–592,2008.[4] E. J Candes, J. Romberg, and T. Tao. Robust uncertainty principles:Exact signal reconstruction from highly incomplete frequency information.

IEEE Trans. on info. theory , 52(2):489–509, 2006.[5] D. L Donoho. Compressed sensing.

IEEE Trans. on info. theory ,52(4):1289–1306, 2006.[6] E. Candess and J. Romberg. Sparsity and incoherence in compressivesampling.

Inverse prob. , 23(3):969, 2007.[7] L. Breiman. Bagging predictors.

Machine learning , 24(2):123–140,1996.[8] P. Hall and R. J Samworth. Properties of bagged nearest neighbourclassiﬁers.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 67(3):363–379, 2005.[9] M. Sabzevari, G. Martinez-Munoz, and A. Suarez. Improving therobustness of bagging with reduced sampling size. European Symposiumon Artiﬁcial Neural Networks, Computational Intelligence and MachineLearning, 2014.[10] F. Zaman and H. Hirose. Effect of subsampling rate on subbagging andrelated ensembles of stable classiﬁers. In

International Conference onPattern Recognition and Machine Intelligence , pages 44–49. Springer,2009.[11] F. R Bach. Bolasso: model consistent lasso estimation through thebootstrap. In

Proceedings of the 25th int. conf. on Machine learning ,pages 33–40. ACM, 2008.[12] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating direction methodof multipliers.

Foundations and Trends in Machine Learning , 3(1):1–122,2011.[13] E. Berg and M. P. Friedlander. Probing the Pareto frontier for basispursuit solutions.

SIAM J. on Scientiﬁc Computing , 31(2):890–912, 2008.[14] S. J Wright, R. D Nowak, and M. AT Figueiredo. Sparse reconstructionby separable approximation.

IEEE Trans. on Sig. Proc. , 57(7):2479–2493,2009.[15] J. Liu, S. Ji, and J. Ye.

SLEP: Sparse Learning with Efﬁcient Projections .Arizona State University, 2009.[16] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k -term approximation. Journal of the American mathematical society ,22(1):211–231, 2009.[17] L. Liu, S. P Chin, and T. D Tran. JOBS: Joint-sparse optimization frombootstrap samples. arXiv preprint arXiv:1810.03743 , 2018.[18] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proofof the restricted isometry property for random matrices.