Reducing Sampling Ratios Improves Bagging in Sparse Regression
RReducing Sampling Ratios and Increasing Numberof Estimates Improve Bagging in Sparse Regression
Luoluo Liu J , Sang (Peter) Chin B,J , and Trac D. Tran J [email protected] [email protected] [email protected] J Department of Electrical Engineering, Johns Hopkins University, Baltimore, MD, 21210 B Department of Computer Science & Hariri Institute of Computing, Boston University, Boston, MA, 02215
Abstract —Bagging, a powerful ensemble method from machinelearning, has shown the ability to improve the performanceof unstable predictors in difficult practical settings. AlthoughBagging is most well-known for its application in classificationproblems, here we demonstrate that employing Bagging insparse regression improves performance compared to the baselinemethod ( (cid:96) minimization). Although the original Bagging methoduses a bootstrap sampling ratio of , such that the sizes of thebootstrap samples L are the same as the total number of datapoints m , we generalize the bootstrap sampling ratio to explorethe optimal sampling ratios for various cases.The performance limits associated with different choices ofbootstrap sampling ratio L/m and number of estimates K areanalyzed theoretically. Simulation results show that a lower L/m ratio ( . − . ) leads to better performance than theconventional choice ( L/m = 1 ), especially in challenging caseswith low levels of measurements. With the reduced samplingrate, SNR improves over the original Bagging method by upto and over the base algorithm (cid:96) minimization by upto . With a properly chosen sampling ratio, a reasonablysmall number of estimates ( K = 30 ) gives a satisfying result,although increasing K is discovered to always improve or atleast maintain performance. Index Terms —Bootstrap, Bagging, Sparse Regression, SparseRecovery, (cid:96) minimization, LASSO I. I
NTRODUCTION
Compressed Sensing (CS) and Sparse Regression studiessolving the linear inverse problem in the form of least squareswith an additional sparsity-promoting penalty term. Formallyspeaking, the measurements vector y ∈ R m is generated by y = Ax + z , where A ∈ R m × n is the sensing matrix, x ∈ R n is a vector of sparse coefficients with very few non-zero entries,and z is a noise vector with bounded energy. The problemof interest is finding the sparse vector x given A as well as y . Among various choices of sparse regularizers, the (cid:96) normis the most commonly used. The noiseless case is referred toas Basis Pursuit (BP) whereas the noisy version is known as basis pursuit denoising [1], or least absolute shrinkage andselection operator (Lasso) [2]: P λ : min λ (cid:107) x (cid:107) + 0 . (cid:107) y − Ax (cid:107) . (1)The performance of (cid:96) minimization in recovering the truesparse solution has been thoroughly investigated in the CSliterature [3]–[6]. CS theory reveals that if the sensing matrix A has good properties, then BP recovers the ground truth andthe LASSO solution is close enough to the true solution withhigh probability [3]. Classical sparse regression recovery based on (cid:96) minimiza-tion solves the problem with all available measurements. Inpractice, it is often the case that not all measurements areavailable or required for recovery. Some measurements mightbe severely corrupted/missing or adversarial samples that breakdown the algorithm. These issues could lead to the failure ofthe sparse regression algorithm.The Bagging procedure [7] proposed by Breiman is anefficient parallel ensemble method that improves the perfor-mance of unstable predictors. In Bagging, we first generate abootstrap sample by randomly drawing m samples uniformlywith replacement from all m data points. We repeat theprocess K times and generate K bootstrap samples. Thenone bootstrapped estimator is computed for each bootstrapsample, and the final Bagged estimator is the average of all K bootstrapped estimators.Applying Bagging to find a sparse vector with a specificsymmetric pattern was shown empirically to reduce estimationerror when the sparsity level s is high [7] in a forward subsetselection problem. This experiment shows the possibility ofusing Bagging to improve other sparse regression methods ongeneral sparse signals. Although the well-known conventionalBagging method uses the bootstrap ratio , some follow-up works have shown empirically that lower ratios improveBagging in some classic classifiers: Nearest Neighbour Clas-sifier [8], CART Trees [9], Linear SVM, LDA, and LogisticLinear Classifier [10]. Based on this success, we hypothesizethat reducing the bootstrap ratio will also improve performanceof Bagging in sparse regression. Therefore, we set up theframework with a generic bootstrap ratio and study its behaviorwith various bootstrap ratios.In this paper, we use the notation L as the sizes ofbootstrap samples, m as the number of all measurements,and K as the number of estimates . (i) We demonstrate thegeneralized Bagging framework with bootstrap ratio
L/m andnumber of estimates K as parameters. (ii) We explore thetheoretical properties associated with finite
L/m and K . (iii) We present simulation results with various parameters
L/m and K and compare the performances of (cid:96) minimization,conventional Bagging, and Bolasso [11], another moderntechnique that incorporates Bagging into sparse recovery. Animportant discovery is that in challenging cases with small m ,Bagging with a ratio L/m that is smaller than the conventionalratio can lead to better performance. a r X i v : . [ s t a t . M L ] M a y I. P
ROPOSED M ETHOD : B
AGGING IN S PARSE R EGRESSION
Our proposed method is sparse recovery using a gen-eralized Bagging procedure. It is accomplished in threesteps. First, we generate K bootstrap samples, each ofsize L , randomly sampled uniformly and independentlywith replacement from the original m data points. Thisresults in K measurements and sensing matrices pairs: { y [ I ] , A [ I ] } , { y [ I ] , A [ I ] } ...., { y [ I K ] , A [ I K ] } . We use thenotation ( · )[ I ] on matrices or vectors to denote retaining onlythe rows supported on I and throwing away all other rowsin the complement I c . Second, we solve the sparse recoveryproblem independently on each of those pairs; mathematically,for all j = 1 , , .., K , we find x B j = arg min x ∈ R n λ ( L,K ) (cid:107) x (cid:107) + 0 . (cid:107) y [ I j ] − A [ I j ] x (cid:107) , (2)where the parameter λ ( L,K ) is the balancing parameter of theleast squares fit and the sparsity penalty for ( L, K ) as theparameter choice for Bagging. The proposed approach (2) isa Lasso problem, and numerous optimization methods can beused to solve it, such as [12]–[15].Finally, the Bagging solution is obtained by averaging all K estimators from solving (2):Bagging: x B = 1 K K (cid:88) j =1 x B j . (3)Compared to the (cid:96) minimization solution obtained fromthe usage of all the measurements, the bagged solution x B isobtained by resampling without increasing the number oforiginal measurements . We will show that in some cases, thebagged solution outperforms the base (cid:96) minimization solution.III. P RELIMINARIES
We summarize the theoretical results of CS theory which weneed to analyze our algorithm mathematically. We introduce theNull Space Property (NSP), as well as the Restricted IsometryProperty (RIP). We also provide the tail bound of the sum ofi.i.d. bounded random variables, which is needed to prove ourtheorems.
A. Null Space Property (NSP)
The NSP [16] for standard sparse recovery characterizesthe necessary and sufficient conditions for successful sparserecovery using (cid:96) minimization. Theorem 1 (NSP) . Every s − sparse signal x ∈ R n is a uniquesolution to P : min (cid:107) x (cid:107) s.t. y = Ax if and only if A satisfies NSP of order s . Namely, if for all v ∈ Null ( A ) \{ } ,such that for any set S of cardinality less than or equals to thesparsity level s : S ⊂ { , , .., n } , card ( S ) ≤ s , the followingis satisfied: (cid:107) v [ S ] (cid:107) < (cid:107) v [ S c ] (cid:107) , where v [ S ] only has the vector values on an index set S andzero elsewhere. B. Restricted Isometry Property (RIP) Although NSP directly characterizes the ability of success forsparse recovery, checking the NSP condition is computationallyintractable. It is also not suitable to use NSP for quantifyingperformance in noisy conditions since it is a binary (True orFalse) metric instead of a continuous range. The Restrictedisometry property (RIP) [3] is introduced to overcome thesedifficulties.
Definition 2 (RIP) . A matrix A with (cid:96) -normalized columnssatisfies RIP of order s if there exists a constant δ s ( A ) ∈ [0 , such that for every s − sparse v ∈ R n , the following is satisfied: (1 − δ s ( A )) (cid:107) v (cid:107) ≤ (cid:107) Av (cid:107) ≤ (1 + δ s ( A )) (cid:107) v (cid:107) . (4) C. Noisy Recovery bounds based on RIP constants
It is known that satisfying the RIP conditions implies thatthe NSP conditions are also satisfied for sparse recovery [3].More specifically, if the RIP constant of order s is strictly lessthan √ − , then it implies that NSP is satisfied of the order s . We recall Theorem 1.2 in [3], where the noisy recoveryperformance for (cid:96) minimization is bounded based on the RIPconstant. This error bound is associated with the s − sparseapproximation error and the noise level. Theorem 3 (Noisy recovery for (cid:96) minimization [3]) . Let y = Ax (cid:63) + z , (cid:107) z (cid:107) ≤ (cid:15) , x is s − sparse that minimizes (cid:107) x − x (cid:63) (cid:107) over all s − sparse signals. If δ s ( A ) ≤ δ < √ − , x (cid:96) be the solution of (cid:96) minimization, then it obeys (cid:107) x (cid:96) − x (cid:63) (cid:107) ≤ C ( δ ) s − / (cid:107) x − x (cid:63) (cid:107) + C ( δ ) (cid:15), where C ( · ) , C ( · ) are some constants, which are determinedby RIP constant δ s . The form of these two constants termsare C ( δ ) = − (1 −√ δ )1 − (1+ √ δ and C ( δ ) = √ δ − (1+ √ δ .D. Tail bound of the sum of i.i.d. bounded Random variables This exponential bound is similar in structure to Hoeffidings’inequality. Proving this bound requires working with themoment generating function of a random variable.
Lemma 4.
Let Y , Y , ..., Y n be i.i.d. observations of boundedrandom variable Y : a ≤ Y ≤ b and the expectation E Y exists,for any ξ > , then P { n (cid:88) i =1 Y i ≥ nξ } ≤ exp {− n ( ξ − E Y ) ( b − a ) } . (5)IV. T HEORETICAL R ESULTS FOR B AGGING ASSOCIATEDWITH SAMPLING RATIO
L/m
AND THE NUMBER OFESTIMATES K A. Noisy Recovery for Employing Bagging in Sparse Regression
We derive the performance bound for employing Baggingin sparse regression, in which the final estimate is the averageover multiple estimates solved individually from bootstrapsamples. We give the theoretical results for the case that truesignal x (cid:63) is exactly s − sparse and the general case with noassumption of the sparsity level of the ground truth signal.ote that, the theorems are based on deterministic sensingmatrix, measurements, and noise: A , y , z , in which all vectornorms are equivalent. Theorem 5 (Bagging: Error bound for (cid:107) x (cid:63) (cid:107) = s ) . Let y = Ax (cid:63) + z , (cid:107) z (cid:107) < ∞ , If under the assumptionthat, for {I j } s that generates a set of sensing matrices A [ I ] , A [ I ] , ..., A [ I K ] , there exists a constant that is relatesto L and K : δ ( L,K ) such that for all j ∈ { , , ..., K } , δ s ( A [ I j ]) ≤ δ ( L,K ) < √ − . Let x B be the solution ofBagging, then for any τ > , x B satisfies P {(cid:107) x B − x (cid:63) (cid:107) ≤ C ( δ ( L,K ) )( (cid:114) Lm (cid:107) z (cid:107) + τ ) }≥ − exp − Kτ L (cid:107) z (cid:107) ∞ . We also study the behavior of Bagging for a generalsignal x (cid:63) , (cid:107) x (cid:63) (cid:107) ≥ s , in which the performance involves the s − sparse approximation error. We use the vector e to denotethis error, and e = x (cid:63) − x , where x is the best s -sparseapproximation of the ground truth signal over all s − sparsesignals. Theorem 6 (Bagging: Error bound for general signal recovery) . Let y = Ax (cid:63) + z , (cid:107) z (cid:107) < ∞ , If under the assumptionthat, for {I j } s that generates a set of sensing matrices A [ I ] , A [ I ] , ..., A [ I K ] , there exists δ ( L,K ) such that for all j ∈ { , , ..., K } , δ s ( A [ I j ]) ≤ δ ( L,K ) < √ − . Let x B bethe solution of Bagging, then for any τ > , x B satisfies P {(cid:107) x B − x (cid:63) (cid:107) ≤ C ( δ L,K ) s − / (cid:107) e (cid:107) + C ( δ ( L,K ) )( (cid:114) Lm (cid:107) z (cid:107) + τ ) } ≥ − exp − K C ( δ ( L,K ) ) τ ( b (cid:48) ) , where b (cid:48) = ( C ( δ ( L,K ) ) s − / (cid:107) e (cid:107) + C ( δ ( L,K ) ) √ L (cid:107) z (cid:107) ∞ ) . Theorem 6 gives the performance bound for Bagging insparse signal recovery without the s − sparse assumption, and itreduces to Theorem 5 when the s − sparse approximation erroris zero (cid:107) e (cid:107) = 0 .We give the proof sketch that demonstrates the key ideato prove both Theorem 5 and Theorem 6. The main toolsare Theorem 3 and Lemma 4. Some special treatments arerequired to deal with terms while proving Theorem 6. Formore technical details, full proofs can be found in [17]. Proof Sketch:
Similar to the sufficient condition in Theorem 3,the sufficient condition to analyze Bagging is that all matricesresulting from Bagging have well-behaved RIP constants oforder s bounded by a universal constant δ .Let I denote a generic multi-set containing L elements andeach element in I is independent and identically distributed,obeying a discrete uniform distribution from sample space { , , .., m } . The squared error function f ( x ( I ) ) = (cid:107) x ( I ) − x (cid:63) (cid:107) , where x ( I ) is the solution from (cid:96) minimization on I : x ( I ) = arg min (cid:107) x (cid:107) s.t. (cid:107) y [ I ] − A [ I ] (cid:107) ≤ (cid:15) I . Thesquared errors from K bootstrapped estimators f ( x j ) = (cid:107) x B j − x (cid:63) (cid:107) , j = 1 , , ..., K are realizations generated i.i.d.from the distribution of f ( x ( I ) ) . We proceed with the proof using Lemma 4. We choose theupper bound of the error to be a function of the expected valueof noise power. We pick the bound ξ relating to the the root ofthe expectation of squared error (cid:112) E (cid:107) z [ I ] (cid:107) = (cid:113) Lm (cid:107) z (cid:107) . Thenwe need to compute the upper bound b and the lower bound a for the random variable f ( x ( I ) ) . Since it is non-negative, wechoose a = 0 . The upper bound b is obtained from Theorem 3and then the maximum value (cid:107) z (cid:107) ∞ is employed to furtherupper bound the noise level (cid:107) z [ I j ] (cid:107) . Through this process,we obtain the inequality: P { (cid:80) j (cid:107) x B j − x (cid:63) (cid:107) − Kξ ≤ } ≥ g ( E ( f ( x ) , b, a ) , for some function g .The Bagging solution is the average of all bootstrappedestimators. The key inequality to establish is as follows: P {(cid:107) x B − x (cid:63) (cid:107) − ξ ≤ } = P { K (cid:107) x B − x (cid:63) (cid:107) − (cid:80) j f ( x j ) + (cid:80) j f ( x j ) − Kξ ≤ }≥ P { K (cid:107) x B − x (cid:63) (cid:107) − (cid:80) j f ( x j ) ≤ , (cid:80) j f ( x j ) − Kξ ≤ } = P { K (cid:107) x B − x (cid:63) (cid:107) − (cid:80) j f ( x j ) ≤ } P { (cid:80) j f ( x j ) − Kξ ≤ } = P { (cid:80) j (cid:107) x B j − x (cid:63) (cid:107) − Kξ ≤ } . The first term is independent of the second term and it is truewith probability by Jensens’ inequality. Then we successfullyestablish the relationship of error bound of the Bagging solutionto the sum of squared errors of bootstrapped estimates. Toobtain the bound for the second term, we follow the methoddescribed in the previous paragraph. B. Parameters Selection Guided by the Theoretical Analysis
Besides analyzing error bounds for general signals whosesparsity levels might exceed s , Theorem 6 can be used inanalyzing cases when m is not large enough for the sparsitylevel s . Theorem 5 and 6 also guide us to optimal choices ofparameters: the bootstrap sampling ratio L/m and the numberof estimates K .Both Theorem 5 and Theorem 6 show that increasing thenumber of estimates K improves the result, by increasing thelower bound of certainty of the same performance. The growthrate of the certainty bound is decreasing with K . We validatethis in our numerical experiment: even though increasing K improves the results, the performance tends to be flattened outfor a large K .The sampling ratio L/m affects the result through twofactors. The first one is the the RIP constant, which in generaldecreases with increasing L (proved in [18] with Gaussianassumption on sensing matrix). Since C ( δ ) is a non-decreasingfunction of δ and a larger L usually results in a smaller δ , thena larger L in general results in a smaller C ( δ ) . On the otherhand, the second factor is the multiplier of the noise powerterm, which is (cid:112) L/m , suggesting a smaller L .Combining these two factors indicates that the best L/m ratio is somewhere in between a small and a large number. Inthe experiment results, we demonstrate that when m is small,varying the bootstrap sampling ratio L/m from − createspeaks with the largest value at L/m < . The first factor,which relates L to the RIP constant, is dominating in the stablecase (when m is sufficiently large), so that larger L leads tobetter performance. a) m = 50 (b) m = 75 (c) m = 100 (d) m = 150 Fig. 1. Performance curves for Bagging with various sampling ratios
L/m and number of estimates K , the best performance of Bolasso as well as (cid:96) minimization.The Purple lines highlighted conventional Bagging with L/m = 1 . In all cases, SNR = 0 dB and the number of measurements m = 50 , , , fromleft to right. The grey circle highlights the peak of Bagging, and the grey area highlights the bootstrap ratio at the peak point. V. S
IMULATIONS
In this section, we perform sparse recovery on simulated datato study the performance of our algorithm. In our experiment,all entries of A ∈ R m × n are i.i.d. samples from the standardnormal distribution N (0 , . The signal dimension n = 200 andvarious numbers of measurements from to are explored.For the ground truth signals, their sparsity levels are all s = 50 , and the non-zero entries are sampled from the standardGaussian with their locations being generated uniformly atrandom. For the noise processes z , entries are sampled i.i.d.from N (0 , σ ) , with variance σ = 10 − SNR / (cid:107) Ax (cid:107) , whereSNR represents the Signal to Noise Ratio. We add whiteGaussian noise to make the SNR = 0 dB. All numericalrealizations have finite values. We use the ADMM [12]implementation of Lasso to solve all sparse regression problems,in which the parameter λ ( L,K ) balances the least squares fitand the sparsity penalty for the case with ( L, K ) as parameters.We study how the bootstrap sampling ratio L/m as well asthe number of estimates K affects the result. In our experiment,we take K = 30 , , and L/m from . to . We report theSignal to Noise Ratio (SNR) as the error measure for recovery:SNR ( x , x (cid:63) ) = 10 log (cid:107) x − x (cid:63) (cid:107) / (cid:107) x (cid:63) (cid:107) averaged over independent trials. For all algorithms, we evaluate λ ( L,K ) atdifferent values from . to and then select optimal valuesthat give the maximum averaged SNR over all trials. A. Performance of Bagging, Bolasso and (cid:96) minimization Bagging and Bolasso with the various parameters
K, L and (cid:96) minimization are studied. The results are plotted inFigure 1. The colored curves show the cases of Bagging with various number of estimates K . The intersections ofcolored curves and the purple solid vertical lines at L/m = 1 illustrates conventional Bagging with a full bootstrap rate. Thegrey circle highlights the best performance and the grey areahighlights the optimal bootstrap ratio
L/m . The performance of (cid:96) minimization is depicted by the black dashed lines, while thebest Bolasso performance is plotted using light green dashedlines. In those figures, for each condition with a choice of L, K ,the information available to Bagging and Bolasso algorithmsare identical, and (cid:96) minimization always has access to all m measurements.From Figure 1, we see that when m is small, Baggingcan outperform (cid:96) minimization. As m decreases, the marginincreases. The important observation is that when the numberof measurements is low ( m is between s to s : − , s is the sparsity level), by using a reduced bootstrap ratio L/m ( − ), Bagging beats the conventional choice of thefull ratio for all different choices of K . Also with a reducedratio and a small K our algorithm is already quite robust andoutperforms (cid:96) minimization by a large margin. When thenumber of measurements is moderate m = 3 s = 150 , Baggingstill beats the baseline; however, the optimal parameters here arebootstrap ratio L/m = 1 and the number of estimates K = 100 .In this case, the reduced bootstrap ratio does not bring anyperformance improvement. Increasing the level measurementmakes the base algorithm more stable and the advantage ofBagging starts decaying.We perform the same experiments with higher number ofmeasurements m , and Table I illustrates the best performancefor various schemes: (cid:96) minimization, the original Bagging TABLE IT
HE PERFORMANCE OF (cid:96) MINIMIZATION AND THE BEST PERFORMANCE AMONG ALL CHOICES OF L AND K FOR B AGGING , B
OLASSO METHODS WITHVARIOUS TOTAL NUMBER OF MEASUREMENTS m . SNR = 0 D B. A
LL PERFORMANCES ARE MEASURED BY THE AVERAGED RECOVERED
SNR ( D B)Small m Moderate m Large m Very large m The number of measurements m
50 75 100 125 150 175 200 500 1000 2000 (cid:96) min. 0.12 0.57 1.00 1.70 2.19 2.61 2.97 cheme with a full bootstrap ratio, Bagging, and Bolassowith SNR = 0 dB. For Bagging, the peak values are foundamong different choices of parameters K and L that weexplored. We see that when the number of measurements m issmall ( − ), Bagging outperforms (cid:96) minimization. Thereduced bootstrap rate also improves conventional Bagging: theimprovement is significant: on SNR when m = 50 . When m is moderate ( − ), choosing reduced rates does notimprove the performance compared to conventional Bagging.Bagging still outperforms (cid:96) minimization with smaller marginsthan the cases with small m . While m is large ( ≥ ), Baggingstarts losing its advantage over (cid:96) minimization. Bolasso onlyperforms similarly to other algorithms in the easiest case foran extremely large m ( = 2000 ) where it slightly outperformsall other algorithms. VI. C ONCLUSION
We extend the conventional Bagging scheme in sparserecovery with the bootstrap sampling ratio
L/m as adjustableparameters and derive error bounds for the algorithm associatedwith
L/m and the number of estimates K . Bagging isparticularly powerful when the number of measurements m is small. Although this condition is notoriously difficult, bothin terms of improving sparse recovery results and obtainingtight bounds of theoretical properties, Bagging outperforms (cid:96) minimization by a large margin (up to 367%). Moreover,the reduced sampling rate shows a performance improvementmeasured by the recovered SNR, and it is over the conventionalBagging algorithm by up to .Our Bagging scheme achieves acceptable performance evenwith very small L/m (around . ) and relative small K (around in our experimental study). The error bounds forBagging predict that a smaller sampling rate L/m can leadto performance improvement and increasing K improves thecertainty of the bound. Both are validated in our numericalsimulation. For a sequential system, a reasonably large K (around ) is enough to obtain an fairly good solution. For aparallel system that allows a large amount of processes to berun at the same time, a large K is preferred since it in generalgives a better result. VII. A CKNOWLEDGEMENT
We would like to thank Dr. Dror Baron for insightfulcomments and suggestions, Dr. Cindy Rush for thoughtfulfeedbacks, and Nicholas Huang for efforts in helping polish,all towards improving the overall quality of our paper.R
EFERENCES[1] S. Chen, D. L Donoho, and M. Saunders. Atomic decomposition bybasis pursuit.
SIAM review , 43(1):129–159, 2001.[2] R. Tibshirani. Regression shrinkage and selection via the Lasso.
J. ofthe Royal Stat. Society. Series B , pages 267–288, 1996.[3] E. J Candes. The restricted isometry property and its implications forcompressed sensing.
Comptes Rendus Mathematique , 346(9):589–592,2008.[4] E. J Candes, J. Romberg, and T. Tao. Robust uncertainty principles:Exact signal reconstruction from highly incomplete frequency information.
IEEE Trans. on info. theory , 52(2):489–509, 2006.[5] D. L Donoho. Compressed sensing.
IEEE Trans. on info. theory ,52(4):1289–1306, 2006.[6] E. Candess and J. Romberg. Sparsity and incoherence in compressivesampling.
Inverse prob. , 23(3):969, 2007.[7] L. Breiman. Bagging predictors.
Machine learning , 24(2):123–140,1996.[8] P. Hall and R. J Samworth. Properties of bagged nearest neighbourclassifiers.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 67(3):363–379, 2005.[9] M. Sabzevari, G. Martinez-Munoz, and A. Suarez. Improving therobustness of bagging with reduced sampling size. European Symposiumon Artificial Neural Networks, Computational Intelligence and MachineLearning, 2014.[10] F. Zaman and H. Hirose. Effect of subsampling rate on subbagging andrelated ensembles of stable classifiers. In
International Conference onPattern Recognition and Machine Intelligence , pages 44–49. Springer,2009.[11] F. R Bach. Bolasso: model consistent lasso estimation through thebootstrap. In
Proceedings of the 25th int. conf. on Machine learning ,pages 33–40. ACM, 2008.[12] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating direction methodof multipliers.
Foundations and Trends in Machine Learning , 3(1):1–122,2011.[13] E. Berg and M. P. Friedlander. Probing the Pareto frontier for basispursuit solutions.
SIAM J. on Scientific Computing , 31(2):890–912, 2008.[14] S. J Wright, R. D Nowak, and M. AT Figueiredo. Sparse reconstructionby separable approximation.
IEEE Trans. on Sig. Proc. , 57(7):2479–2493,2009.[15] J. Liu, S. Ji, and J. Ye.
SLEP: Sparse Learning with Efficient Projections .Arizona State University, 2009.[16] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k -term approximation. Journal of the American mathematical society ,22(1):211–231, 2009.[17] L. Liu, S. P Chin, and T. D Tran. JOBS: Joint-sparse optimization frombootstrap samples. arXiv preprint arXiv:1810.03743 , 2018.[18] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proofof the restricted isometry property for random matrices.