Feature Selection for Huge Data via Minipatch Learning
FFeature Selection for Huge Data viaMinipatch Learning
Tianyi Yao a and Genevera I. Allen b,c Abstract
Feature selection often leads to increased model interpretability, faster computation,and improved model performance by discarding irrelevant or redundant features. Whilefeature selection is a well-studied problem with many widely-used techniques, thereare typically two key challenges: i) many existing approaches become computationallyintractable in huge-data settings with millions of observations and features; and ii) thestatistical accuracy of selected features degrades in high-noise, high-correlation settings,thus hindering reliable model interpretation. We tackle these problems by proposingStable Minipatch Selection (STAMPS) and Adaptive STAMPS (AdaSTAMPS). Theseare meta-algorithms that build ensembles of selection events of base feature selectorstrained on many tiny, (adaptively-chosen) random subsets of both the observations andfeatures of the data, which we call minipatches. Our approaches are general and can beemployed with a variety of existing feature selection strategies and machine learningtechniques. In addition, we provide theoretical insights on STAMPS and empiricallydemonstrate that our approaches, especially AdaSTAMPS, dominate competing methodsin terms of feature selection accuracy and computational time.
Feature selection is critical in statistical machine learning for improving interpretability ofmachine learning models. While feature selection has been used in a wide variety of applica-tions, there are typically two key challenges: i) many existing techniques can quickly becomecomputationally intractable in huge-data settings on the order of millions of observationsand features; and ii) the statistical accuracy of selected features degrades in the presence ofnoise and/or highly correlated features, thus hampering reliable model interpretation.In this work, our primary goal is to develop a general feature selection framework thatcan select statistically accurate features in a computationally efficient manner, even withhuge data in high-noise, high-correlation settings. We place great emphasis on selecting thecorrect features, which is a statistical accuracy problem, because doing so would considerably a Department of Statistics, Rice University, Houston, TX b Departments of Electrical and Computer Engineering, Statistics, and Computer Science, Rice University,Houston, TX c Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine, Houston, TX a r X i v : . [ s t a t . M L ] O c t nhance explainability of machine learning methods, enable practitioners to reliably interpretmodels, and provide insights into the underlying data-generating processes (e.g. candidategene study (Kohannim et al., 2012)). As stressed in Lipton (2018), interpretability of modelsis critical in many machine learning applications. Furthermore, we seek to design our featureselection framework to be conducive to efficient computation so that it can be applied inhuge-data settings with millions of features and observations. Data of such scale is regularlyobserved in online marketing, genetics, neuroscience, and text mining, among many otherareas of research.Popular feature selection methods can be divided into three categories - filter, wrapper,and embedded methods (Guyon and Elisseeff, 2003). Filter methods select features basedsolely on observed characteristics of available data independently of a learning algorithm(Jovic et al., 2015). While filter methods are fast computationally and they might work wellfor prediction by selecting features related to the output, filter methods usually performpoorly in terms of statistical accuracy of selecting the right features that are critical forreliable model interpretation because they tend to select sets of correlated, redundant featurestogether.Unlike filter methods, wrapper methods assess the quality of selected features usingperformance of a predefined learning algorithm. Therefore, wrappers are much slowercomputationally than filter methods (Jovic et al., 2015; Pirgazi et al., 2019). Despite generallybetter performance than filter methods, wrappers tend to perform suboptimally in selectingstatistically accurate features because they are inherently greedy methods.Another line of work focuses on embedded methods, which can be further divided intooptimization-based methods and methods that select features based on properties of thelearner. The Lasso (Tibshirani, 1996) is arguably one of the most widely-used optimization-based feature selection methods. The Lasso tends to do better on statistical accuracy ofselected features and is model selection consistent under theoretical conditions including theIrrepresentable Condition (Zhao and Yu, 2006; Meinshausen and B¨uhlmann, 2006). Yet,such theoretical conditions requiring low correlation among features will be less likely to everhold in high-dimensional settings in which features tend to become increasingly correlated.Additionally, choosing the right amount of regularization for accurate feature selection ischallenging in practice (Meinshausen and B¨uhlmann, 2010). Fitting the Lasso with modelselection procedures (e.g. cross-validation) can be computationally challenging in huge-datasettings.Last but not least, various tree-based algorithms such as Random Forest (RF) (Breiman,2001) constitute a great proportion of embedded methods that select features using propertiesof the learner. In particular, numerous importance scores such as the RF permutationimportance are used to rank features based on their contributions to improving predictiveperformance of the learner. However, previous studies have found many such importancemeasures can be biased and even become unstable in high-dimensional, high-correlationsettings (Strobl et al., 2007; Genuer et al., 2010; Nicodemus and Shugart, 2007), thushindering reliable model interpretation. Additionally, features selected using properties ofthe learner are often sensitive to the choice of tuning hyperparameters, and systematichyperparameter search poses great computational challenges in huge-data settings.The idea of randomly subsampling observations and/or features for model training hasappeared in many parts of machine learning, including various ensemble learning techniques2Breiman, 1996, 2001; Freund and Schapire, 1997; Louppe and Geurts, 2012; Gomes et al.,2019; LeJeune et al., 2020) and the dropout technique (Srivastava et al., 2014) combinedwith stochastic gradient methods (Hardt et al., 2016) in training deep neural networks.Additionally, there is another line of work focusing on using various data randomizationtechniques to select more “stable” features for high-dimensional linear regression (Bach, 2008;Meinshausen and B¨uhlmann, 2010; Wang et al., 2011; Shah and Samworth, 2013; Yu, 2013;Beinrucker et al., 2016; Staerk et al., 2019). Inspired by these approaches, we propose generaland flexible strategies that exploit tiny, (adaptively-chosen) random subsets of both theobservations and features of the data for considerably enhancing feature selection accuracyand computational efficiency, hence improving model interpretability. Contributions
We summarize our contributions as follows: We develop general featureselection methods named STAMPS and AdaSTAMPS that leverage random data perturbationas well as adaptive feature sampling schemes to select statistically accurate features withdramatically improved robustness and runtime (see Sec. 2). Our proposed approaches aremeta-algorithms that can be employed with a variety of existing feature selection strategiesand machine learning techniques in practice. In addition, we theoretically show that STAMPSachieves familywise error rate (FWER) control under certain conditions. We empiricallydemonstrate the effectiveness of our approaches on multiple synthetic and real data sets inSec. 3, showing that our frameworks, especially AdaSTAMPS, dominate competing methodsin terms of both feature selection accuracy and computational time.
People have used random subsets of observations and features for model training in manyparts of machine learning (see Sec. 1). Notably in the ensemble learning literature, somehave called these “random patches” (Louppe and Geurts, 2012; Gomes et al., 2019). We calltiny, (adaptively-chosen) random subsets of both observations and features “minipatches”to denote the connection with minibatches as well as patches in image processing, and toalso denote that these are typically very tiny subsets. Using tiny minipatches can yieldmajor computational advantages for huge data. Formally, given a pair of response vector y ∈ R N and data matrix X ∈ R N × M that consists of N observations each having M features,a minipatch can be obtained by simultaneously subsampling n rows (observations) and m columns (features) without replacement using some form of randomization, with n (cid:28) N and m (cid:28) M . Inspired by stability selection (Meinshausen and B¨uhlmann, 2010) and its extensions (Shahand Samworth, 2013; Beinrucker et al., 2016) that keep “stable” features based on selectionfrequencies, we propose a general and flexible approach to stable feature selection named3table Minipatch Selection (STAMPS). STAMPS is a meta-algorithm that can be employedwith a variety of existing feature selection strategies and machine learning techniques.Our proposed STAMPS method is summarized in Algorithm 1. Here, y I k denotes thesubvector of y containing its elements indexed by I k . Similarly, X I k ,F k denotes the submatrixof X containing its rows indexed by I k and its columns indexed by F k . For brevity, [ M ]denotes the set { , , . . . , M } .Specifically, STAMPS fits arbitrary base feature selectors to many tiny, random mini-patches and calculates feature selection frequencies by taking an ensemble of estimated featuresupports over these minipatches. We define the selection frequency of the j th feature at the k th iteration, ˆΠ ( k ) j , to be the number of times it is sampled and then selected by base featureselectors divided by the number of times it is sampled into minipatches after k iterations.STAMPS finally outputs a set of stable features ˆ S stable whose selection frequencies are abovea user-specific threshold π thr ∈ (0 , π thr in Sec. 2.5. Algorithm 1:
STAMPS
Input: y ∈ R N , X ∈ R N × M , n , m , π thr ∈ (0 , Initialize: ˆΠ (0) j = 0, ∀ j ∈ [ M ]; while stopping criterion not met do
1) Sample a minipatch: subsample n observations I k ⊂ [ N ] and m features F k ⊂ [ M ] uniformly at random without replacement to get a minipatch( y I k , X I k ,F k ) ∈ R n × R n × m ;2) Fit a selector to minipatch ( y I k , X I k ,F k ) to obtain an estimated feature supportˆ S k ⊆ F k ;3) Ensemble feature supports { ˆ S i } ki =1 to update selection frequencies ˆΠ ( k ) j , ∀ j ∈ [ M ]: ˆΠ ( k ) j = 1max(1 , (cid:80) ki =1 ( j ∈ F i )) k (cid:88) i =1 ( j ∈ F i , j ∈ ˆ S i ); endOutput: ˆ S stable = (cid:8) j ∈ [ M ] : ˆΠ ( K ) j ≥ π thr (cid:9) .To avoid a forever STAMPS situation, one should employ some type of stopping criterionfor Algorithm 1. While there are many possible choices, we happen to use a simple onethat is effective in our empirical studies - stop the algorithm if the rank ordering of the topmin(max( |H| , τ l ) , τ u ) features in terms of selection frequencies { ˆΠ ( k ) j } Mj =1 remain unchangedfor the past 100 iterations, where H = { j : ˆΠ ( k ) j ≥ . } . Here, τ l and τ u are fixed, user-specificparameters. We suggest setting both to well exceed the approximate number of true signalfeatures, with τ u being a multiple of τ l . We propose to adaptively sample minipatches in ways that further boost computationalefficiency as well as feature selection accuracy. Replacing uniform feature sampling in step4 of Algorithm 1 with any appropriate adaptive feature sampling scheme, we obtain ourAdaptive STAMPS (AdaSTAMPS) framework for feature selection.While there are many possible adaptive sampling techniques to choose from, we developtwo types of example algorithms in this work. We describe an exploitation and explorationscheme inspired by the multi-armed bandit literature (Bouneffouf and Rish, 2019; Slivkins,2019) here in Algorithm 2. For brevity, we use the shorthand AdaSTAMPS (EE) to denote thespecific instance of AdaSTAMPS that employs Algorithm 2 as its adaptive feature samplingscheme. In addition, we give another probabilistic adaptive feature sampling scheme in theAppendix, which we abbreviate as AdaSTAMPS (Prob). We use these two in our empiricalstudies just to show examples of what can be done with our flexible AdaSTAMPS approach.However, one can use many other more sophisticated sampling techniques in practice. Wesave the investigation of various other adaptive sampling schemes for future work.In Algorithm 2, k is the iteration counter, E denotes the total number of burn-in epochs, { γ ( k ) } is an increasing geometric sequence in the range [0 . ,
1] for controlling the trade-offbetween exploitation and exploration of the input feature space, and π active ∈ (0 ,
1) is a fixed,user-specific threshold for determining the active feature set. We found setting π active = 0 . Algorithm 2:
Adaptive Feature Sampling Scheme: Exploitation and Exploration(EE)
Input: k , M , m , E , { ˆΠ ( k − j } Mj =1 , { γ ( k ) } = geom seq([0 . , π active . Initialize: G = (cid:100) Mm (cid:101) , J = { , . . . , M } ; if k ≤ E · G then // Burn-in stage if mod G ( k ) = 1 then // A new epoch
1) Randomly reshuffle feature index set J and partition into disjoint sets {J g } G − g =0 ;1) Set F k = J mod G ( k ) ; else // Adaptive stage
1) Update active feature set: A = (cid:8) j ∈ { , . . . , M } : ˆΠ ( k − j ≥ π active (cid:9) ;2) Exploitation step: Sample min( m, γ ( k ) |A| ) features F k, ⊆ A uniformly atrandom;3) Exploration step: Sample ( m − min( m, γ ( k ) |A| )) features F k, ⊆ { , . . . , M } \ A uniformly at random;4) Set F k = F k, ∪ F k, ; endOutput: F k .During the initial burn-in stage of Algorithm 2, AdaSTAMPS (EE) explores the entireinput feature space by ensuring each feature is sampled into minipatches for exactly E times,once for each epoch. The selection frequency of each feature is continuously updated as featureselection is repeatedly conducted on each random minipatch. After E epochs, Algorithm 2transitions to the adaptive stage during which it exploits a subset of features that are likely5o be true signal features while adaptively exploring the rest of the feature space. Specifically,features with selection frequencies above π active are kept in the active feature set A , whichis likely a superset of most of the true signal features. As k increases and feature selectionfrequencies keep polarizing, A continues to shrink and hone in on the set of true signalfeatures. To exploit the relatively stable features in A , Algorithm 2 samples an increasingproportion of A into each new minipatch by gradually increasing γ ( k ) . Meanwhile, Algorithm2 continues to explore the rest of the feature space by drawing from { , . . . , M } \ A into theminipatch, because a tiny fraction of true signal features might still be outside of A due tonoise and correlation among features. Intuitively, the ever-changing active feature set A and { γ ( k ) } work together to adaptively strike a balance between exploitation and exploration ofthe input feature space, thus further enhancing computational efficiency and feature selectionaccuracy (e.g. Figure 1). Suppose we observe training data ( y , X ) ∈ R N × R N × M that have N observations and M features. Let S ⊂ { , . . . , M } denote the true feature support set, which is the set of indicescorresponding to the underlying true signal features. And let S c = { , . . . , M } \ S be thecomplement set that contains indices corresponding to the noise features. Both S and S c aretypically unknown. One important goal of feature selection is to infer S from the observeddata.Following the notations in Algorithm 1, an arbitrary minipatch can be represented by thepair of row index set and feature index set ( I i , F i ). The estimated feature support on thisarbitrary minipatch by an arbitrary base selector is denoted as ˆ S i ⊆ F i . Here, we providesome insights into the theoretical properties of the STAMPS method. The proof is found inthe Appendix. Assumption (A1)
Each feature j ∈ { , . . . , M } is sampled at least once when STAMPSstops after K iterations. Assumption (A2)
For a given noise feature j ∈ S c , ( j ∈ ˆ S i ) is constant across thecollection of minipatches { ( I i , F i ) } i ∈D that contain this feature, where D = (cid:8) i ∈ { , , . . . , K } : j ∈ F i (cid:9) . Assumption (A3) P [ j ∈ ˆ S i ] ≤ απ thr / | S c | for any noise feature j ∈ S c on an arbitraryminipatch ( I i , F i ) that contains this feature. Theorem 1
Assume that Assumptions (
A1-3 ) hold. Then STAMPS controls the familywiseerror rate (FWER) at level α : P [ | ˆ S stable ∩ S c | ≥ ≤ α A simple sampling strategy that satisfies Assumption ( A1 ) is to sample non-overlappingblocks of features sequentially such that all features are sampled exactly once before proceedingwith uniform random sampling. Assumption ( A2 ) intuitively means that for a given noisefeature, the base selector’s decision on whether to select this particular feature does not changeacross minipatches that contain this feature. In the stability selection literature, Meinshausenand B¨uhlmann (2010) made use of a similar condition named the exchangeability assumptions6or false discovery rate (FDR) control. A simple data scenario that fulfills Assumption ( A2 )is when noise features are independent amongst themselves and are also independent of othertrue signal features as well as the response. In Assumption ( A3 ), the probability P is takenwith respect to the randomness of minipatches and the randomness of the base selector ifit is a randomized algorithm. Intuitively, Assumption ( A3 ) requires the base selector tohave a low probability to mistakenly select noise features on random minipatches. Puttingeverything together, under Assumptions ( A1-3 ), Theorem 1 provides guarantee that theprobability of STAMPS mistakenly selecting at least one noise feature is at most α .While some of these assumptions may seem stringent, this result is very generic becausewe are not making any distributional assumptions about the underlying data or consideringany specific learner or feature selection algorithm. In future work, we are going to investigatewhether these assumptions can be relaxed for particular combinations of learners and selectors.We would also like to consider possible extensions to other forms of error control such as theFDR. Figure 1:
Performance of our approaches on a challenging feature selection task in the regressionsetting with highly correlated features (as described in the Appendix). (A) Feature selection frequenciesby STAMPS versus number of iterations; (B) Feature selection frequencies by AdaSTAMPS (EE)versus number of iterations with the green dotted line denoting the end of the burn-in stage; (C)Feature selection frequencies by AdaSTAMPS (Prob) versus number of iterations. This reveals thatwell-designed adaptive feature sampling schemes help AdaSTAMPS correctly select all true signalfeatures (red) while discarding all noise features (gray) within much fewer iterations, thus furtherboosting both feature selection accuracy and computational efficiency in practice.
Our STAMPS and AdaSTAMPS are general meta-algorithms and they have three tuninghyperparameters: minipatch size ( n and m ) and threshold π thr . In general, our methods arequite robust to the choice of these hyperparameters. In our empirical studies in Sec. 3, weuse a default threshold π thr = 0 .
5, which generally works well in practice. We also proposea data-driven way to choose π thr as detailed in the Appendix. For choosing the minipatchsize, we have empirical studies in the Appendix investigating how feature selection accuracy7nd runtime vary for various n and m values. In general, we found taking m to well exceedthe number of true features (e.g. 3-10 times the number of true features) and then picking n relative to m so that it well exceeds the sample complexity of the feature selector usedin step 2 of Algorithm 1 works well in practice. Our empirical results also reveal that ourmeta-algorithms have stable performance for any sensible range of n and m .We also empirically show that using adaptive feature sampling not only increases compu-tational efficiency but also seems to further improve feature selection accuracy in practice. InFigure 1, we display selection frequencies of all features versus the number of iterations forSTAMPS, AdaSTAMPS (EE), and AdaSTAMPS (Prob). We see that both AdaSTAMPS(EE) and AdaSTAMPS (Prob) correctly identify all true signal features while discarding allnoise features within 1000 iterations. On the other hand, STAMPS without adaptive featuresampling converges much slower and is unable to clearly separate selection frequencies ofsignal features from those of noise features. Details of this empirical study are given in theAppendix. In this section, we evaluate our proposed STAMPS and AdaSTAMPS approaches on bothsynthetic and real-world data sets. Because many people study the problem of featureselection in the linear regression setting and it is also easier to simulate synthetic data fromlinear models for comparisons with existing methods, we first focus on feature selection inthe high-dimensional linear regression setting in Sec. 3.1. However, our proposed featureselection frameworks are not limited to either linear or regression settings. We showcase thesuperior performance of our AdaSTAMPS method on real-world classification problems inSec. 3.2.
We compare the feature selection accuracy as well as computational time of our proposedmethods with several widely-used competitors including the Lasso (Tibshirani, 1996), Com-plementary Pairs Stability Selection (CPSS) (Shah and Samworth, 2013), Randomized Lasso(Meinshausen and B¨uhlmann, 2010), univariate linear regression filter with False DiscoveryRate (FDR) control via the Benjamini-Hochberg procedure, recursive feature elimination(RFE) (Guyon et al., 2002), and thresholded Ordinary Least Squares (OLS) (Giurcanu, 2016).Our focus is on high-dimensional cases with N (cid:28) M because these settings are arguablythe most difficult for feature selection. In this section, we consider three high-dimensionalscenarios with the following data matrices X ∈ R N × M : • Scenario 1 (S1)
Autoregressive Toeplitz design: the M = 10000-dimensional featurevector follows a N ( , Σ ) distribution, where Σ ij = ρ | i − j | with ρ = 0 .
95; sample size N = 5000. • Scenario 2 (S2)
Subset of The Cancer Genome Atlas (TCGA) data: this data set( N = 2834, M = 10000) contains 450K methylation status of the top 10000 mostvariable CpGs for 2834 patients. This scenario is even more challenging than S1 due to8trong and complex correlation structure among CpGs. This data set was obtained via TCGA2STAT (Wan et al., 2015). • Scenario 3 (S3)
Fullsize TCGA data: this data set ( N = 2834, M = 335897) contains450K methylation status of 335897 CpGs with complete records for 2834 patients.Following standard simulation procedures (Meinshausen and B¨uhlmann, 2010), we createthe true feature support set S by randomly sampling S ⊂ { , . . . , M } with cardinality | S | = 20. The M -dimensional sparse coefficient vector β is generated by setting β j = 0 forall j / ∈ S and β j is chosen independently and uniformly in [ − /b, − /b ] ∪ [2 /b, /b ] for all j ∈ S . Finally, the response vector y ∈ R N can be generated by y = X β + (cid:15) where the noisevector ( (cid:15) , . . . , (cid:15) N ) is IID N (0 , b to attainvarious signal-to-noise ratio (SNR) levels. For S3, SNR is fixed at 5.For each method, commonly-used data-driven model selection procedures are appliedto choose tuning hyperparameters and produce a data-driven estimate of the true featuresupport. In addition, oracle model selection that assumes knowledge of cardinality of the truesupport | S | is employed with each method, resulting in an oracle estimate of the true featuresupport. Detailed descriptions of these data-driven and oracle model selection procedures aregiven in the Appendix. We evaluate the feature selection accuracy of various methods usingthe F1 Score, which takes values between 0 and 1 with 1 indicating perfect match betweenthe final estimated feature support and the true feature support S . For every method, wecompute F1 Scores for both data-driven and oracle estimates.We use Scikit-learn (Pedregosa et al., 2011) for all Lasso-based methods, univariate filter,and RFE with Ridge estimator. Due to lack of existing software, we faithfully implementedthe thresholded OLS as in Giurcanu (2016). For simplicity, our methods (i.e. STAMPS,AdaSTAMPS (EE), and AdaSTAMPS (Prob)) are employed with the same thresholded OLSas base selector in step 2 of Algorithm 1 using default settings as detailed in the Appendix.All comparisons were run on a VM with 10 Intel (Cascade Lake) vCPUs and 220 GB ofmemory.The feature selection accuracy and computational time of various methods from scenariosS1 and S2 are shown in Figure 2. We see that our AdaSTAMPS consistently dominates allcompeting methods in terms of feature selection accuracy in both scenarios. While somecompetitors such as RFE and the Lasso achieve comparable accuracy to AdaSTAMPS withoracle model selection at some SNR levels (Figure 2B, 2E), their performance degradesdrastically when their tuning hyperparameters are chosen via widely-used data-driven ways(Figure 2A, 2D). On the other hand, performance of AdaSTAMPS doesn’t deteriorate as muchacross data-driven and oracle model selection. We also point out that our AdaSTAMPS canboost the accuracy of any existing feature selection strategy - AdaSTAMPS with thresholdedOLS as base selector outperforms thresholded OLS itself by a large margin, especially inlow SNR settings and in the more challenging scenario S2. Furthermore, AdaSTAMPS iscomputationally the fastest of any method with comparable accuracy, with runtime on theorder of seconds to minutes as opposed to hours for many competing methods.The results for the most challenging scenario S3 are summarized in Table 1. High-dimensional data of such scale is regularly observed in areas such as genetics and neuroimaging.We see that our AdaSTAMPS (EE) perfectly recovers the true feature support across both9igure 2:
Feature selection accuracy (F1 Score) and runtime from simulation Scenario 1 (S1) andScenario 2 (S2). (A, D) F1 Score for data-driven feature support estimates; (B, E) F1 Score fororacle feature support estimates; (C, F) Computational time in seconds. The fastest runtime isreported for every method with parallelism enabled for methods (LassoCV 1SE, Lasso eBIC, CPSS,Rand. Lasso, RFE) wherever possible. Our AdaSTAMPS dominates all competitors in terms offeature selection accuracy in both scenarios. Moreover, the runtime of our AdaSTAMPS is the bestof any method with comparable feature selection accuracy. data-driven and oracle model selection. In addition, STAMPS and AdaSTAMPS (Prob)also perform favorably. On the other hand, all competing methods attain rather low featureselection accuracy even with oracle selection. Clearly, AdaSTAMPS (EE) is more capableof accurately identifying the true signal features even in high-correlation, high-dimensionalsettings, thus leading to more reliable model interpretation. In the meantime, AdaSTAMPS(EE) takes less than 20 minutes to run, rather than the 14 hours required to solve RandomizedLasso that achieves the next best data-driven F1 Score among competing methods.Overall, we empirically demonstrate the effectiveness of our proposed approaches inselecting statistically accurate features with much improved robustness and runtime, even inchallenging high-noise, high-correlation scenarios. Even though STAMPS does not performas well as AdaSTAMPS, STAMPS still performs better than a lot of the competing methods,especially in terms of data-driven F1 Score (e.g. Figure 2A, 2D). In the meantime, ourAdaSTAMPS (both (EE) and (Prob)) consistently outperform competing methods in termsof feature selection accuracy while being computationally the fastest of any method with10able 1: Results of Various Methods for Simulation Scenario 3 (S3). For each method, thenumber of selected features determined by data-driven procedures, the data-driven F1 Score,the oracle F1 Score, and computational time in minutes are reported. The method withthe best accuracy is bold-faced. Note that the fastest runtime is reported for every methodwith parallelism enabled for methods ( ∗ ) wherever possible. Our AdaSTAMPS dominates allcompeting methods in terms of feature selection accuracy. Method
20 1.000 1.000 . . . . ∗ Lasso eBIC 46 0.091 0.100 61 . ∗ CPSS 81 0.238 0.500 716 . ∗ Rand. Lasso 9 0.276 0.450 838 . ∗ Uni. Filter 277261 0.000 0.050 0 . . ∗ Thres. OLS 4 0.083 0.100 1 . comparable accuracy. This suggests that our flexible AdaSTAMPS approach could be adouble-win both in terms of statistical accuracy and computation in practice. We now study the performance of our method when employed as a dimensionality-reductionpreprocessing step for downstream classification tasks on several real-world data sets -CIFAR10 (Krizhevsky, 2009), Gisette (Guyon et al., 2004), and Fashion MNIST (Xiao et al.,2017). For CIFAR10, we consider two classes ( truck , automobile ) with N = 12000 images eachhaving M = 3072 features. Gisette is a binary classification data set ( N = 7000, M = 5000). For Fashion MNIST, we consider three classes ( T-shirt , coat , shirt ) with N = 18000 and M = 784. We split each data set into 70% training and 30% test data via stratified sampling.For every training data set, our AdaSTAMPS (EE) fits Random Forest (RF) classifiersas base selectors to many random minipatches and keeps 10 features with the highestGini importance from every minipatch. Features are then ranked based on their selectionfrequencies from AdaSTAMPS (EE), which are computed by taking an ensemble of selectionresults over these minipatches. For comparisons, we also fit a RF classifier to the full trainingdata and obtain ranks for features based on the RF Gini importance. Moreover, we obtainanother set of ranks for features using a RFE with RF as base estimator on the full trainingdata. As the true feature support is unknown in these real-world data, performance of eachfeature selection method is measured by training a new downstream RF classifier using onlyits top ranked J features and then computing classification accuracy on the test data. In allcases, we use the default RF classifier from Scikit-learn . Figure 3 shows the average testaccuracy of the downstream RF classifiers over 5 random training/test splits versus numberof selected features that are used in training these downstream classifiers.From Figure 3, we see that AdaSTAMPS (EE) mostly outperforms the competitors11igure 3:
Results for the real-world classification data. The top features selected by each featureselection method are used to train new RF classifiers separately for downstream classification tasks.(A, B, C) Test accuracy of these downstream RF classifiers (averaged over random training/testsplits) versus number of selected top features that are used in training the downstream classifiers.Overall, the features selected by our AdaSTAMPS (EE) lead to better classification accuracy on testsets, especially as the number of selected features increases. because the features selected by AdaSTAMPS (EE) are generally more informative andeventually lead to better test accuracy in all three data sets. Similar to observations fromSec. 3.1, our AdaSTAMPS framework can boost the feature selection performance of anyexisting selection strategy: AdaSTAMPS (EE) with RF as inner feature selector outperformsRF itself, sometimes by a large margin as in Figure 3B. We have developed general meta-algorithms, STAMPS and AdaSTAMPS, that practitionerscan employ with a wide variety of existing learning algorithms and feature selection strategiesto identify statistically accurate features in a computationally efficient manner, thus enhancinginterpretability of machine learning models. In future work, we look to theoretically analyzethe more sophisticated AdaSTAMPS framework and also investigate distributed, memory-efficient implementation of our methods.
Acknowledgements
The authors acknowledge support from NSF DMS-1554821, NSF NeuroNex-1707400, andNIH 1R01GM140468. 12 eature Selection for Huge Data via
Minipatch Learning:
Supplementary Materials
Tianyi Yao and Genevera I. Allen
A Probabilistic Adaptive Feature Sampling Scheme
In Algorithm 3, we give our probabilistic adaptive feature sampling scheme. Here, k is theiteration counter, E denotes the total number of burn-in epochs. We use the shorthandAdaSTAMPS (Prob) to denote the specific instance of AdaSTAMPS that employs Algorithm3 as its adaptive feature sampling scheme.Similar to our exploitation and exploration scheme (Algorithm 2 in the main paper),AdaSTAMPS (Prob) first explores the entire input feature space during the burn-in stage.After E burn-in epochs, Algorithm 3 transitions to the adaptive stage during which itadaptively updates the sampling probabilities for all features. Specifically, features withhigher selection frequencies, which are likely to be the true signal features, have relativelyhigher probabilities of being sampled into subsequent minipatches. However, since this is aprobabilistic sampling scheme, Algorithm 3 still adaptively explores the remaining featurespace because many features with relatively lower selection frequencies still have non-zeroprobabilities of being sampled into minipatches. Algorithm 3:
Adaptive Feature Sampling Scheme: Probabilistic (Prob)
Input: k , M , m , E , { ˆΠ ( k − j } Mj =1 . Initialize: G = (cid:100) Mm (cid:101) , J = { , . . . , M } ; if k ≤ E · G then // Burn-in stage if mod G ( k ) = 1 then // Start of a burn-in epoch
1) Randomly reshuffle feature index set J and partition into disjoint sets {J g } G − g =0 ;1) Set F k = J mod G ( k ) ; else // Adaptive stage
1) Update feature sampling probabilities using information from previousiterations: ˆ p ( k ) j = ˆΠ ( k − j (cid:80) Mj =1 ˆΠ ( k − j for j ∈ { , . . . , M }
2) Sample m features F k ⊂ { , . . . , M } without replacement according to theupdated feature sampling probabilities { ˆ p ( k ) j } Mj =1 ; endOutput: F k . 13 Proof of Theorem 1
In this section, we present the detailed proof of Theorem 1. We restate our theoreticalstatements from the main paper here for convenience.
Assumption (A1)
Each feature j ∈ { , . . . , M } is sampled at least once when STAMPSstops after K iterations. Assumption (A2)
For a given noise feature j ∈ S c , ( j ∈ ˆ S i ) is constant across thecollection of minipatches { ( I i , F i ) } i ∈D that contain this feature, where D = (cid:8) i ∈ { , , . . . , K } : j ∈ F i (cid:9) . Assumption (A3) P [ j ∈ ˆ S i ] ≤ απ thr / | S c | for any noise feature j ∈ S c on an arbitraryminipatch ( I i , F i ) that contains this feature. Theorem 1
Assume that Assumptions (
A1-3 ) hold. Then STAMPS controls the family-wise error rate (FWER) at level α : P [ | ˆ S stable ∩ S c | ≥ ≤ α Proof:
By Markov inequality, we have P [ | ˆ S stable ∩ S c | ≥ ≤ E (cid:2) | ˆ S stable ∩ S c | (cid:3) E (cid:2) (cid:88) j ∈ S c ( j ∈ ˆ S stable ) (cid:3) = E (cid:2) (cid:88) j ∈ S c ( ˆΠ ( K ) j ≥ π thr ) (cid:3) = (cid:88) j ∈ S c P [ ˆΠ ( K ) j ≥ π thr ] (1)By definition, ˆΠ ( K ) j ≥ j = 1 , , . . . , M and π thr ∈ (0 , j ∈ S c , we have P [ ˆΠ ( K ) j ≥ π thr ] ≤ E [ ˆΠ ( K ) j ] π thr (2)Now we need to evaluate E [ ˆΠ ( K ) j ]. When STAMPS stops after K iterations, the selectionfrequency for any noise feature j ∈ S c isˆΠ ( K ) j = 1max(1 , (cid:80) Ki =1 ( j ∈ F i )) K (cid:88) i =1 ( j ∈ F i , j ∈ ˆ S i )= 1 (cid:80) Ki =1 ( j ∈ F i ) K (cid:88) i =1 ( j ∈ F i , j ∈ ˆ S i ) (by Assumption ( A1 ))= (cid:80) Ki =1 ( j ∈ F i , j ∈ ˆ S i ) (cid:80) Ki =1 ( j ∈ F i ) 14hen for any given noise feature j ∈ S c , we have E [ ˆΠ ( K ) j ] = E (cid:2) (cid:80) Ki =1 ( j ∈ F i , j ∈ ˆ S i ) (cid:80) Ki =1 ( j ∈ F i ) (cid:3) = E (cid:2) (cid:80) Ki =1 ( j ∈ F i ) ( j ∈ ˆ S i ) (cid:80) Ki =1 ( j ∈ F i ) (cid:3) = E (cid:2) (cid:80) i : j ∈ F i ( j ∈ ˆ S i ) (cid:80) i : j ∈ F i (cid:3) = E (cid:2) ( j ∈ ˆ S i ) (cid:80) i : j ∈ F i (cid:80) i : j ∈ F i (cid:3) (by Assumption ( A2 ))= E (cid:2) ( j ∈ ˆ S i ) (cid:3) = P [ j ∈ ˆ S i ] (3)Here, i indexes an arbitrary minipatch ( I i , F i ) that contains this given j th feature. Then theupper bound in (2) becomes P [ ˆΠ ( K ) j ≥ π thr ] ≤ P [ j ∈ ˆ S i ] π thr (4)and combining with (1), we have P [ | ˆ S stable ∩ S c | ≥ ≤ (cid:88) j ∈ S c P [ ˆΠ ( K ) j ≥ π thr ] ≤ (cid:88) j ∈ S c P [ j ∈ ˆ S i ] π thr ≤ (cid:88) j ∈ S c απ thr | S c | π thr (by Assumption ( A3 ))= α (5)15 More on Practical Considerations
C.1 Data-Driven Rule to Choose π thr While there are many possible data-driven approaches to choose the user-specific threshold π thr ∈ (0 , { ˆΠ ( K ) j } Mj =1 is the set of feature selection frequencies from the last iterationof STAMPS or AdaSTAMPS, and ˆSD denotes the sample standard deviation. Intuitively,Algorithm 4 tries to find all gaps (local minima of density) among { ˆΠ ( K ) j } Mj =1 in order tochoose a threshold that separates features with relatively high selection frequencies fromthose with low selection frequencies. Algorithm 4 does so by first fitting a Gaussian kerneldensity estimator (KDE) to the selection frequencies and then setting π thr to the smallestlocal minima, if any. Algorithm 4:
KDE-Based Data-Driven Rule to Choose π thr Input: { ˆΠ ( K ) j } Mj =1 .1) Fit a Gaussian KDE to { ˆΠ ( K ) j } Mj =1 :ˆ f h ( x ) = 1 M M (cid:88) j =1 exp (cid:16) ( x − ˆΠ ( K ) j ) h (cid:17) , ∀ x ∈ [0 , h = ˆSD( { ˆΠ ( K )1 , ˆΠ ( K )2 , . . . , ˆΠ ( K ) M } );2) Use method of inflection to find all local minima: D = { x ∈ [0 ,
1] s.t. sign( ˆ f h ( x ) − ˆ f h ( x − (cid:15) )) = − , sign( ˆ f h ( x + (cid:15) ) − ˆ f h ( x )) = 1 } for any (cid:15) > if |D| ≥ then // Find at least one local minima Set π thr = min( D ); else // No local minima is found Set π thr = 0 . endOutput: π thr . C.2 Additional Empirical Studies to Investigate Effects of Mini-patch Size
Here, we empirically investigate how feature selection accuracy and runtime of our meta-algorithm vary for various minipatch sizes (i.e. n and m values). We take a subset of theTCGA data that contains 450K methylation status of the top M = 2000 most variable CpGsfor N = 2000 patients as the data matrix X ∈ R N × M . Following similar procedures to Sec. 3.1of the main paper, we create the true feature support S by randomly sampling S ⊂ { , . . . , M } with | S | = 10. The M -dimensional sparse coefficient vector β is generated by setting β j = 016or all j / ∈ S and β j is chosen independently and uniformly in [ − /b, − /b ] ∪ [2 /b, /b ] for all j ∈ S . Finally, the response vector y ∈ R N can be generated by y = X β + (cid:15) where the noisevector ( (cid:15) , . . . , (cid:15) N ) is IID N (0 , b is chosen such that the SNR isfixed at 5. This synthetic data set is also used for Figure 1 in the main paper.To study how performance varies with minipatch size ( n and m ), we run AdaSTAMPS(EE) using thresholded OLS as the inner base selector for various n and m values. Specifically,we take a sequence of m values that are integer multiples of | S | and then pick n relativeto m such that m/n ∈ { . , . , . . . , . } . Feature selection accuracy in terms of F1 Scoresand computational time are reported for various n and m values in Figure C. We see thatour method is quite robust to the choice of minipatch size for it has stable feature selectionperformance for a sensible range of n and m . In general, we found taking m to well exceedthe number of true features (e.g. 3-10 times the number of true features | S | ) and then picking n relative to m so that it well exceeds the sample complexity of the base feature selector usedin the meta-algorithm works well in practice.Figure C: We show how feature selection accuracy and computational time of AdaSTAMPS (EE)vary with minipatch sizes in terms of m/ | S | and m/n , where | S | is the cardinality of the true featuresupport. F1 Scores and computational time are averaged over replicates. (A) Feature selectionaccuracy in terms of F1 Score with data-driven π thr ; (B) Feature selection accuracy in terms of F1Score with oracle model selection; and (C) Computational time on log ( second ) scale. Note thatgray cells indicate NA’s that result from poor selection performance. We see that our method hasstable performance for a sensible range of n and m . D Data-Driven and Oracle Model Selection Proceduresfor Various Methods in Empirical Studies
In this section, we discuss details of data-driven and oracle model selection procedures thatwe use for our empirical studies in Sec. 3.1 of the main paper.For the Lasso, data-driven model selection approaches such as the 10-fold cross-validationwith 1 standard error (SE) rule (Hastie et al., 2009) and the extended BIC (eBIC) rule (Chen17nd Chen, 2008) are employed to determine the optimal regularization parameter, resultingin λ and λ eBIC , respectively. For univariate filter, the data-driven rule keeps all featuresthat are significantly associated with the response with FDR controlled at the default level of0 .
05. Additionally, 10-fold cross-validation is carried out for RFE to choose the best tuninghyperparameters and the Bonferroni procedure (Giurcanu, 2016) is used with thresholded OLSto determine the selected features in a data-driven manner. The user-specific threshold π thr is set to 0 . Scikit-learn (Pedregosa et al., 2011), we use ageometric sequence Λ ∈ (0 . λ max , λ max ) as the set of candidate regularization parametersfor all Lasso-based methods, where λ max is the minimum amount of regularization such thatall estimated coefficients become zero. | Λ | is set to the default 100 for simulation scenariosS1-S2 and to 10 for scenario S3. For AdaSTAMPS (EE), we also use default settings with E = 10 burn-in epochs and π active = 0 .
1. For AdaSTAMPS (Prob), E is also set to 10.Assuming knowledge of the number of true signal features | S | , oracle model selection isalso applied to every method. Specifically, oracle model selection is applied to STAMPS,AdaSTAMPS (EE), AdaSTAMPS (Prob), CPSS, and Randomized Lasso by keeping the top | S | features with the highest selection frequencies. For the Lasso, oracle selection keeps thetop | S | features with the largest magnitude in terms of coefficient estimates at λ and λ eBIC ,respectively. For univariate filter, the top | S | features with the smallest p-values are keptand the top | S | features with the largest magnitude in terms of OLS coefficient estimates areselected for thresholded OLS. Furthermore, RFE is carried out until exactly | S | features areleft for oracle model selection. 18 eferences F. R. Bach. Bolasso: Model consistent lasso estimation through the bootstrap. In
Proceedingsof the 25th International Conference on Machine Learning , ICML ’08, page 33–40, 2008.A. Beinrucker, U. Dogan, and G. Blanchard. Extensions of stability selection using subsamplesof observations and covariates.
Statistics and Computing , 26(5):1059–1077, 2016.D. Bouneffouf and I. Rish. A survey on practical applications of multi-armed and contextualbandits. arXiv preprint , arXiv:1904.10040, 2019.L. Breiman. Bagging predictors.
Machine Learning , 24(2):123–140, 1996.L. Breiman. Random forests.
Machine Learning , 45(1):5–32, 2001.J. Chen and Z. Chen. Extended bayesian information criteria for model selection with largemodel spaces.
Biometrika , 95(3):759–771, 2008.Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and anapplication to boosting.
Journal of Computer and System Sciences , 55(1):119 – 139, 1997.R. Genuer, J.-M. Poggi, and C. Tuleau-Malot. Variable selection using random forests.
Pattern Recognition Letters , 31(14):2225 – 2236, 2010.M. Giurcanu. Thresholding least-squares inference in high-dimensional regression models.
Electronic Journal of Statistics , 10(2):2124–2156, 2016.H. M. Gomes, J. Read, and A. Bifet. Streaming random patches for evolving data streamclassification. In , pages240–249, 2019.I. Guyon and A. Elisseeff. An introduction to variable and feature selection.
Journal ofMachine Learning Research , 3(null):1157–1182, 2003.I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classificationusing support vector machines.
Machine Learning , 46(1-3):389–422, 2002.I. Guyon, S. Gunn, A. B. Hur, and G. Dror. Result analysis of the nips 2003 feature selectionchallenge. In
Proceedings of the 17th International Conference on Neural InformationProcessing Systems , NIPS’04, page 545–552, Cambridge, MA, USA, 2004. MIT Press.M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochasticgradient descent. In
Proceedings of the 33rd International Conference on InternationalConference on Machine Learning - Volume 48 , ICML’16, page 1225–1234. JMLR.org, 2016.T. Hastie, R. Tibshirani, and J. Friedman.
The elements of statistical learning: data mining,inference, and prediction . Springer, 2009. 19. Jovic, K. Brkic, and N. Bogunovic. A review of feature selection methods with applications. , pages 1200–1205, 2015.O. Kohannim, D. P. Hibar, J. L. Stein, N. Jahanshad, X. Hua, P. Rajagopalan, A. W. Toga,J. Jack, C. R., M. W. Weiner, G. I. de Zubicaray, K. L. McMahon, N. K. Hansell, N. G.Martin, M. J. Wright, P. M. Thompson, and A. D. N. Initiative. Discovery and replicationof gene influences on brain structure using lasso regression.
Frontiers in Neuroscience , 6(115), 2012.A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.D. LeJeune, H. Javadi, and R. Baraniuk. The implicit regularization of ordinary least squaresensembles. In
AISTATS , 2020.Z. C. Lipton. The mythos of model interpretability.
ACM Queue , 16:30, 2018.G. Louppe and P. Geurts. Ensembles on random patches. In
Machine Learning and KnowledgeDiscovery in Databases , ECMLPKDD’12, page 346–361, Berlin, Heidelberg, 2012.N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variable selection with thelasso.
Annals of Statistics , 34(3):1436–1462, 2006.N. Meinshausen and P. B¨uhlmann. Stability selection.
Journal of the Royal Statistical Society(Statistical Methodology) , 72(4):417–473, 2010.K. Nicodemus and Y. Shugart. Impact of linkage disequilibrium and effect size on theability of machine learning methods to detect epistasis in case-control studies. In
GeneticEpidemiology , volume 31, page 611. WILEY-LISS DIV JOHN WILEY & SONS INC, 111RIVER ST, HOBOKEN, NJ 07030 USA, 2007.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research , 12:2825–2830, 2011.J. Pirgazi, M. Alimoradi, T. Esmaeili Abharian, and M. H. Olyaee. An efficient hybridfilter-wrapper metaheuristic-based gene selection method for high dimensional datasets.
Scientific Reports , 9(1):18580, 2019.R. D. Shah and R. J. Samworth. Variable selection with error control: another look atstability selection.
Journal of the Royal Statistical Society (Statistical Methodology) , 75(1):55–80, 2013.A. Slivkins. Introduction to multi-armed bandits.
Foundations and Trends in MachineLearning , 12:1–286, 2019.N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: Asimple way to prevent neural networks from overfitting.
Journal of Machine LearningResearch , 15(56):1929–1958, 2014. 20. Staerk, M. Kateri, and I. Ntzoufras. High-dimensional variable selection via low-dimensionaladaptive learning. arXiv preprint , 2019.C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variableimportance measures: Illustrations, sources and a solution.
BMC Bioinformatics , 8(1):25,2007.R. Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal StatisticalSociety , 58(1):267–288, 1996.Y.-W. Wan, G. I. Allen, and Z. Liu. TCGA2STAT: simple TCGA data access for integratedstatistical analysis in R.
Bioinformatics , 32(6):952–954, 11 2015.S. Wang, B. Nan, S. Rosset, and J. Zhu. Random lasso.
Annals of Applied Statistics , 5(1):468–485, 2011.H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarkingmachine learning algorithms. arXiv preprint , arXiv:1708.07747, 2017.B. Yu. Stability.
Bernoulli , 19(4):1484–1500, 2013.P. Zhao and B. Yu. On model selection consistency of lasso.