Subsampling Winner Algorithm for Feature Selection in Large Regression Data
SS UBSAMPLING W INNER A LGORITHM FOR F EATURE S ELECTION IN L ARGE R EGRESSION D ATA
Yiying Fan
Department of Mathematics and StatisticsCleveland State University2121 Euclid AvenueCleveland, OH 44115-2214 [email protected]
Jiayang Sun
Department of StatisticsGeorge Mason University4400 University DriveFairfax, VA 22030 [email protected]
February 10, 2020 A BSTRACT
Feature selection from a large number of covariates (aka features) in a regression analysis remainsa challenge in data science, especially in terms of its potential of scaling to ever-enlarging dataand finding a group of scientifically meaningful features. For example, to develop new, responsivedrug targets for ovarian cancer, the actual false discovery rate (FDR) of a practical feature selectionprocedure must also match the target FDR. The popular approach to feature selection, when truefeatures are sparse, is to use a penalized likelihood or a shrinkage estimation, such as a LASSO,SCAD, Elastic Net, or MCP procedure (call them benchmark procedures). We present a differentapproach using a new subsampling method, called the Subsampling Winner algorithm (SWA). Thecentral idea of SWA is analogous to that used for the selection of US national merit scholars. SWAuses a ‘base procedure’ to analyze each of the subsamples, computes the scores of all featuresaccording to the performance of each feature from all subsample analyses, obtains the ‘semifinalist’based on the resulting scores, and then determines the ‘finalists,’ i.e., the most important features. Dueto its subsampling nature, SWA can scale to data of any dimension in principle. The SWA also hasthe best-controlled actual FDR in comparison with the benchmark procedures and the randomForest,while having a competitive true-feature discovery rate. We also suggest practical add-on strategiesto SWA with or without a penalized benchmark procedure to further assure the chance of ‘true’discovery. Our application of SWA to the ovarian serous cystadenocarcinoma specimens from theBroad Institute revealed functionally important genes and pathways, which we verified by additionalgenomics tools. This second-stage investigation is essential in the current discussion of the properuse of P-values. K eywords regression · sparsity · subsampling · variable selection · high dimensions Feature selection is important in data mining and knowledge discovery. It has many applications, including 1) findingand ranking important disease genes; 2) reducing nuisance features that mask truth or terminate a computation due tothe size or memory limitation of a current computing software architecture [Liu (2009)]; 3) building a parsimoniousmodel or performing a dimension reduction; and 4) improving efficiency in storage, communication, or computation.A unified view for feature selection in supervised learning is that finding important features is equivalent to anoptimization problem that seeks a solution “ a ” that maximizes an objective function J ( a | D ) over a domain of decisions“ a ”, given a data set D . Here D = ( X , y ) , where X is a n × p matrix representing measurements of p covariatesmeasured on n subjects, and y denotes the outcomes of these n subjects. The number p of true important features(defined explicitly below in Section 2) from the p covariates is unknown and typically sparse. J depends on a statistical objective and a suitable model for analyzing the data D . The value of a , which maximizes J is indicative of the a r X i v : . [ s t a t . M L ] F e b PREPRINT - F
EBRUARY
10, 2020importance of features. For example, J is a penalized summation of squares of “residuals” in a linear regression context,or a measure of predictive classification rate in a classification model.With a large p , seeking an exact optimization of J may be impossible or inefficient. Approximate solutions or cleveralternatives are often necessary for practical use. These alternatives include Local Quadratic Approximation (LQA) forminimizing a penalized likelihood with a SCAD penalty [Fan and Li (2001)], LAR [Efron et al. (2004)] for speedingup LASSO [Tibshirani (1996)], MCP [Zhang (2010)] that advocates using a minimax concave penalty, and iterativeprocedures, such as stepwise procedure/ELSA [Kim, Street, and Menczer (2000)], generalized bagging [Breiman(1996)] and boosting [Schapire et al. (1998)] procedures for tree based models. As a current state-of-the-art technique,the Elastic Net [Trevor, Tibshirani, and Friedman (2009)] is a regularized regression method that linearly combines theL1 and L2 penalties of the LASSO and ridge methods. This procedure and its cousins, SCAD and MCP, all belongto the class of penalized (likelihood) strategies. They have been studied extensively to generalize for example, toa smooth regression model that subjects to different covariate effects within a group by Guo et al. (2016); for theirasymptotic properties by Liu and Yu (2013); or for asymptotic optimality by Zhang (2013); for multivariate regressionmodeling with a large number of responses by Sun et al. (2015); and for confidence intervals in high-dimensionallinear regression [Cai and Guo (2017)]. The Random Forest [Breiman (2001)] is the benchmark of tree-based methodsfor producing importance of predictors (or features) in the application of ensemble learning and regression. Thesesophisticated procedures assume that target data are within the size and memory limitation of these procedures’ softwareimplementation. The ever-increasing volume of data continuously challenges this assumption. When a limit is reached,one strategy is to use a sample of the data. However, large data are often heterogenous, a subsample or even a fewsubsamples may not adequately represent the full data set. Another strategy is to use a newer machine with a largerspace and memory, and a new software implementation. On the other hand, when the true number of features is only afew, inclusion of all features, even only a few hundreds or thousands into a (standard) statistical analysis procedure canmask true features. See Liu, Sun, and Zhang (2007) for subsampling for PCA, an unsupervised learning. The thirdapproach is using a rough dimension reduction methodology or domain knowledge to prescreen the features. Here, weconsider a different approach, developing an ensemble that uses an adequate number of multiple subsamples and theoutcomes from these subsamples strategically . We also wish to require this ensemble procedure simple to implement and easy to scale to ever-enlarging dimension or size of data. Our Subsampling Winner Algorithm (SWA) is developedbased on thess ensemble requirements. Our SWA differs from the “subsampling procedures" in Wang et al. (2016),which subsample from n observations; SWA subsamples from p variables or features, and hence the traditional statisticsbased on a large number n of observations does not apply here. SWA bears some similarity to the philosophy behindthe bagging and boosting for a tree based model in selecting the best split at each node of the tree. SWA applies simple,standard procedures on subsamples strategically as Donoho and Jin (2015) did for Higher Criticism (HC) motivated bya Tukey’s idea, and hence SWA can be used for large data without adding too much complexity in its algorithm. By itssubsampling nature, SWA reduces not only the requirement of the memory or size, but also the complexity that arisesfrom many masking nuisance features in the whole sample.This paper focuses on developing a practical SWA for feature selection in linear regression from large data. In Section 2,our sparse linear model is provided with an ovarian cancer data that motivated our development of SWA for regression.In Section 3, the Subsampling Winner Algorithm (SWA) is introduced. In Section 4, the procedures for specifyingthe SWA parameters are provided. In Section 5, the SWA’s parameter specification procedure is validated, and acomprehensive comparison of SWA performance with those of benchmark procedures are given. The benchmarkprocedures include the Elastic Net, SCAD, MCP, and Random Forest. In Section 6, the SWA is applied to ovariancancer data, where we also provided a “double assurance” procedure for practical applications. We found that SWAhas excellent, automatic control of false discovery rates (FDR), and is competitive in capturing true features, i.e., witha good true discovery rate (TDR) when the FDRs are equalized for all comparison procedures. Also, the featuresdiscovered by SWA from the ovarian cancer data contain functionally important genes and pathways, setting a basisfor future research into much needed development of drug targets for ovarian cancer. The article is concluded with adiscussion, summary, and recommendations in Section 7, including an analogy of SWA with the deep learning principle.A simple proof of our proposition is given in the appendix. Consider data ( X , y ) from a linear model y = X β + ε , ε ∼ N ( , σ I ) (1)relating the outcome y and its p -dimensional covariates (aka features) via their measurements in X n × p , and theunknown parameter vector β = ( β , · · · , β p ) T . Here, the n-dimensional error vector ε is assumed to come from ahomoscedastic normal distribution with mean and variance σ , for simplicity. This simple model (1) is adequate for2 PREPRINT - F
EBRUARY
10, 2020many data applications, though the data may need to be preprocessed or transformed before its modeling by (1), such asthe motivating data below. This model is also the most basic model for evaluating a paradigm or direction change froma status quo such as Cai and Guo (2017) did for confidence intervals in high dimension. A generalization of (1) to moregeneral error structures will be discussed in section 7. The challenge here, as it is in many recent modern procedures, iswhen p (cid:29) n . These recent procedures have been based on the exploitation of sparsity assumption of the large β vector,which assumes that only a small number p out of p components of β are non-zero. We assume the same moderatesparsity with p (cid:28) nlogp as Cai and Guo (2017) did. Our objective here is to identify the set of ‘true features’ that areassociated with the ‘covariates’ corresponding to non-zero coefficients β ’s. Data . In human cancer genomic research, feature selection plays an important role in analyzing big data. MessengerRNA (mRNA) are RNA molecules that transmit genetic information from DNA to the ribosome, where they controlthe amino acid sequence of gene expression. In particular, for the ovarian cancer data from the Broad Institute TCGAGenome Data Analysis Center (GDAC, 2013), we are interested in the identification of mRNAs that are directly linkedto an important gene expression (aka, the response y ) that has a significant impact on the overall survival of ovariancancer patients. Analyzing this publicly available data set is in response to the NIH call for “ Secondary Analysisand Integration of Existing Data to Elucidate the Genetic Architecture of Cancer Risk and Related Outcomes (R01) ”( https://grants.nih.gov/grants/guide/pa-files/PA-17-239.html ).The important mRNA genes would influence the response and have non-zero β ’s in explaining the response in (1).The ovarian cancer data contain p = 12042 mRNA gene expression profiles derived from n = 561 ovarian serouscystadenocarcinoma specimens available from the Broad Institute’s GDAC. The data have been normalized and analyzedfor an association study (Broad Institute TCGA Genome Data Analysis Center , GDAC, 2013). Hence, Equation (1) isan adequate approximate model for our application to this normalized dataset.Clearly p (cid:29) n in this ovarian cancer data. Thus, an adhoc or a standard statistical feature selection procedure isinconsistent. An adequate procedure designed specially for large- p -small- n problems is needed. SWA described belowis suitable for the case of p (cid:29) n and will be applied to this ovarian cancer data to find significant genes linked to animportant response y . The resulting “important” genes will help to validate candidate target genes in literature, or minefurther important features that were not presented in the existing biological literatures. This section presents our Subsampling Winner Algorithm (
SWA ) for linear regression with data from equation (1) when p (cid:29) n . Given a suitable base procedure h and a scoring algorithm w that ranks features, our SWA goes through 4 steps:(1) SWA subsamples data and performs subsample analysis using a base procedure to each of m subsamples of size s , for s < n ; (2) SWA ranks the subsample models and features; (3) SWA obtains q semifinalists from the ranks offeatures; (4) SWA analyzes the q semifinalists to capture the finalists, i.e., the most important features using the base, oran enhanced base procedure. Hence, an SWA depends on s, m, q and h and w : SW A ( s, m, q | h , w ) , (2)where h, w are the base and scoring functions, respectively.For the linear regression model (1), h can be simply the standard least square procedure for subsamples of s dimensionof n data points ( s < n ). We define w based on the scores of features in each sub-model weighed by the sub-modelperformance as given in the score function (3) below. Algorithm 1: SWA for Regression in Large Data Subsample analysis.
For each i = 1 , . . . , m , {(a) Sample candidate features.
Randomly draw s sub-columns from p columns of X n × p to obtain an s dimensional sub-matrix, still denoted as X n × s for simplicity though having abused the notation.(b) Least squared regression.
Fit y ∼ X n × s by h , the standard least square procedure (and allowing anadditional stepwise variable selection on this resulting fit, as a user specified option).(c) Grade candidates locally.
Store the resulting ‘Residual Sum of Squares’ and estimated ‘ t statistic’ of β (cid:48) s as RSS i and t ij , for j = 1 , . . . , p , where the t ij for the unsampled features X ( − ) n × s are set to zero.}2. Rank sub-models.
Rank
RSS i for i = 1 , . . . , m , and let RSS ( i ) be the s smallest Residual Sum of Squaresand t ( i ) j be the corresponding t-values of β , for i = 1 , . . . , s , j = 1 , . . . , p .3 PREPRINT - F
EBRUARY
10, 20203.
Rank all individuals to obtain semi-finalists.
Compute the scores of all features for j = 1 , . . . , p , by the scoring function : w j = 1 S j s (cid:88) i =1 (cid:112) RSS ( i ) | t ( i ) j | , where S j = s (cid:88) i =1 I { t ( i ) j (cid:54) = 0 } . (3)Retain the top q features that correspond to the q largest w values, as the semifinalists .4. Select finalists.
Fit a linear regression of y on the q semifinalists by h . The features with p-values less than (after a multiplicity adjustment) are selected as the finalists. A further stepwise selection is also an optionfrom this final regression fit.We shall evaluate the SWA by comparing SWA with its benchmarks, in terms of the actual false discovery rate and truediscovery rate of important features.This SWA algorithm is simple. It can subsample from large “ p ” data to select important features with no restriction onthe size of p . It therefore may be called “dimensionless” and will be shown to be competitive in Section 5. This section addresses the specification of m , s and q in (2) for the SWA in regression. We use SWA ( s, m, q ) in steadof SWA ( s, m, q | h , w ) hereafter. s Denote F true to be the set of p true features and F SW to be the set of features selected by SWA. If the fixed s ≥ p ,then as given by Proposition 1 below, a chance that all p features are selected in one subsample is close to when m is large. The estimated coefficients from this subsample analysis would then be consistent for a fixed s < n , as n → ∞ [Rao (1973)].In general, we assume that data come from an identifiable model in the sense that (1) features are sparse with p (cid:28) n/ log( p ) [Cai and Guo (2018)]; (2) models with more true features are dominant than those with less truefeatures, leading to small RSS; and (3) the SOIL condition by Ye, Yang, and Yang (2018) is satisfied as m → ∞ . Seemore discussion on this SOIL condition below. Hence, we would then have that if s ≥ p , P ( F SW ∆ F true ) → , as m → ∞ and n → ∞ . (4)In other words, when s ≥ p , the features selected by the SWA have a large probability to be consistent to the set oftrue nonzero features when the model is identifiable. However, the true number of features p is unknown. If s is toosmall, we may not be able to capture with a high probability all the important features in any subsample. On the otherhand, if s is too large, a large s can slow down computation dramatically and may mask important features. Hence P ( F SW ∆ F true ) → as n → ∞ , where “ ∆ ” denotes the symmetric difference of two sets F SW and F true . Rule of Thumb.
Our simulation and experience recommend a rule of thumb in selecting s : p ≤ s ≤ p . (5)This rule of thumb does not require knowing p but some range of p . When the range is still quite uncertain, wepropose a multipanel diagnostic plot for choosing s , constructed based on the following algorithm. Algorithm 2: Multipanel Plot Algorithm (MPA) for selecting s Initialize a set of s values : S = { s i : i = 1 , · · · , I } , e.g., S = { , , · · · } .2. Arrange scree plots of feature weights for each of the s values in I panels . Run the SWA ( s i , m, q ) for i = 1 , · · · , I . Name the resulting weights w j in (3) for each i as w ij , for j = 1 , · · · , p and i = 1 , · · · , I . Plot { ( j, w ij ) : j = 1 , · · · , p } for i = 1 , · · · , I to obtain I simultaneous panel plots in both fixed and free-scalesto examine the changes in both magnitude and shape of these plots. See Figures 1 & 2.4 PREPRINT - F
EBRUARY
10, 20203.
Identify the minimal stable point of s . A value of s is a stable point, if its scree plot has an obvious “elbow”point and a stable “upper arm” set. An “elbow” point is the point where a significant change in the slope of ascree plot has occurred, typically leading to a tapering flat line. An “upper arm” set contains the points abovethe “elbow” point, for example, feature set { , , } in Figures 1 and 2. An “upper arm” set is stable if indicesof the points in the “upper arm” set are the same as those of the “upper arm” set of the next s value (althoughset orders may differ). A set is “relative stable" if the intersection of this set with its adjacent upper arm set(s)(for adjacent s values) contains a stable subset in reference to the neighboring sets.4. Optional step . If there are no clear stable plots, but some semi-stable subplots (i.e., there are no definite stablesets), enlarge the set of S by adding a few s i (cid:48) s in the neighborhood of semi-stable plots, return to Step 2 usingthe updated S .Both practical selection procedures of s are simple to implement. The multipanel plot has been implemented in our Rpackage “subsamp,” available in cran.r-project.org.The following are two examples of using this algorithm. Example 1 : Consider data from the model: y i = 2 X i + 3 X i + 5 X i + ε i , (6)where ε i , X i , . . . , X i are i.i.d. N (0 , , for i = 1 , . . . , . Hence, p = 100 , p = 3 , and n = 20 . For this example,we apply the SWA by subsampling with a repetition size m = 5000 , and s = { , , , , , } . The resultingmultipanel plots of the weight scores of the top 40 features for each choice of s i , for i = 1 , . . . , , are in Figures 1 and2.The fixed-scale multipanel plot Figure 1 shows a significant change in magnitude occurred at s = 6 or s = 12 .Combining with the information in the free-scale plot Figure 2, we see that the plot at s = 6 has an “elbow” point,while the plot at s = 12 does not have an obvious “elbow” point.(a) The upper arm set is { , , } for the plot of s = 6 and this set has a relatively stable subset of { , } . It is importantto note that the coefficient of X is not as large as these of X and X . So “1” can hide among other features.(b) min ( s i ) = 6 , obviously.When s ≥ , true features are gradually masked more by nuisance features. Hence, the corresponding s = 6 would bethe recommended subsample size. Example 2 : Consider data from the model y i = X iT β + ε i , with X iT = ( X i , . . . , X i ) , (7)where β = (0 . , . , , . , , . , , . , , T , ε i and X i , . . . , X i are i.i.d. N (0 , , for i = 1 , . . . , . This isagain large p (= 100) small n (= 80) data with p = 10 true features. For this example we apply SWA by subsamplingwith a repetition size m = 5000 , and s = { , , , , , } .(a) As shown in the fixed scale multipanel plot in Figure 3, significant changes in magnitude occurred at s ≥ . Incombination with the information from Figure 4, we observe that s = 30 , , lead to plots with a stable ‘elbow’point.(b) The upper arm set is { , , , , , , , , } , which is also stable.(c) It is obvious that min ( s i ) = 30 . It is also important to note that the signal of X , X and X are not as strong asthe rest of true features with that of X being extremely weak, especially when compared with the variance of ε i .5 PREPRINT - F
EBRUARY
10, 2020 index W e i gh t o f F ea t u r e s s= 3 s= 5 s= 6 s= 9 s=12 s=15 Figure 1: Fixed-scale multi-panel diagnostics plot for p = 3 . index W e i gh t o f F ea t u r e s s= 3 s= 5 s= 6 s= 9 s=12 s=15 Figure 2: Free-scale multi-panel diagnostics plot for p = 3 . PREPRINT - F
EBRUARY
10, 2020 index W e i gh t o f F ea t u r e s s= 5 s=10
108 7 9 6 5 49912619227088209573543425718139405960332684326428871549 s=20
108 9 7 6 5 4 3253920222236048644292347318564033763761707766100499197 s=30
108 9 7 6 5 4 3 256100341442523183924964911725398673534885207876335740 s=40
108 9 7 6 5 4 3 256317857534910097904254831728884844916314236633397618 s=50
Figure 3: Fixed-scale multi-panel diagnostics plot for p = 10 . index W e i gh t o f F ea t u r e s . . . s= 5 . . . s=10 . . . . .
108 7 9 6 5 49912619227088209573543425718139405960332684326428871549 s=20
108 9 7 6 5 4 3253920222236048644292347318564033763761707766100499197 s=30
108 9 7 6 5 4 3 256100341442523183924964911725398673534885207876335740 s=40
108 9 7 6 5 4 3 256317857534910097904254831728884844916314236633397618 s=50
Figure 4: Free-scale multi-panel diagnostics plot for p = 10 . The s = 30 is the optimal subsample size in terms of stabilization in both scale and shape. In general, s in p ≤ s ≤ p is sufficient by our experience, although there are some cases when s < p for a large p may still work. In Section 5.1,we validate systematically the multipanel procedure for choosing s. In these two examples, we did not need to use Step4 in Algorithm 2. 7 PREPRINT - F
EBRUARY
10, 2020
Discussion of the identifiable model . Given the SOIL condition in the identifiable model, Ye, Yang, and Yang (2018)has shown that the ensemble model leads to a consistent feature selection. We made the same assumption in ouridentifiable model. However, in our studies, under the identifiable condition (2), the dominance condition, the scoringfunctions w i ’s in (3) typically place strong-enough important features into a semi-finalist set and hence lead to consistentestimates of regression coefficients. This is perhaps why our SWA procedure is competitive to the consistent penalizedprocedures in our simulation studies. We conjecture that conditions (1) and (2) in the identifiable model above wouldimply conditions (3), the SOIL condition or the consistency of Random Forest [Scornet, Biau, and Vert (2015)], undersome mild regularity conditions on the strength of the signals. The ideas used to show that Adaptive Regression byMixing (ARM) leads to the SOIL condition [Yang (2001)] might be generalizable to probe the consistency of our SWAprocedure. We leave the formal consistency condition and proof to future investigation. m Proposition 1 . Given that s ≥ p , the upper and lower bounds for the number of independently selected subsamples ofsize s to capture all the important features with a probability at least − γ are given by logγlog (cid:16) − ( sp ) p (cid:17) ≤ m ≤ logγlog (cid:16) − ( s − p +1 p − p +1 ) p (cid:17) . (8)A simple proof is given in the appendix. Remark : It is worthwhile to point out that the repetition number m given by Proposition 1 is usually quite large. This isdue to the consideration of the “worst” case senario, in which the important features would not be discovered unless all p of them were selected in one of subsamples. This “all or none" condition is extreme. The repetition number neededto catch the important features is often much smaller than the one indicated by Proposition 1. The repetition numberand its corresponding lower and upper bounds given by Proposition 1 may be considered conservative benchmarks.The larger m is, the more computation load will be. In our R package “subsamp,” the default m = 5000 , has beensufficient for all examples we tried. Parallel computing can be used to speed up the implementation of the algorithmprocess, see discussion in Section 7. In the last step of Subsampling Winner Algorithm, in our examples and application in Section 5 and 6, we have chosen q = s in the final (stepwise) model selection, which seems to be satisfactory for the values of p we have tried. In this section, we first validate the MPA method specified in Algorithm 2, for selecting the subsample size s , byempirically examining the performance of SWA with various choices of subsample size s (Section 5.1). We thencompare the SWA with three benchmark procedures: Elastic Net, SCAD and MCP, via their R code implementation“glmnet”, “ncvreg” and “plus”, and then with the “randomForest” procedure. Both independent (Section 5.2) andcorrelated (Section 5.3) covariates are considered in our comparative study. The ultra high dimensional case with aprescreening procedure is also examined (Section 5.4). s The model of our choice for validating the MPA is (7) given in Example 2, where p = 100 , n = 80 , p = 10 with β = (0 . , . , , . , , . , , . , , T representing the parameters of true features. This model has a mixture offeatures including extremely weak (with coefficients β = weak ( β = moderate ( β = strong ( β = 3 ) and very strong ( β = s = 30 is a good choice of subsample size for example 2. For each sampledrawn from this model, variables selected by SWA, were recorded; p-values of the finalists were adjusted by the simpleBonferroni method to control the multiplicity of searching from p = 100 covariates at . family-wise error (FWE)rate. We repeated this random sampling and analysis times independently.Tables 1 and 2 show the effects of different choices of s on the true discovery and false discovery of features by SWA,respectively. We examine the number of captures of true features for k = 10 (full capture), and k = 9+ , , . . . for at8 PREPRINT - F
EBRUARY
10, 2020least , , . . . true features. We experiment with s = 5 , , , , , , and . Chosen by Algorithm in Example2, s = 30 is indeed the best choice based on a simulation study with repetitions. As shown in Table 1, theperformance of true feature detection is improved as the subsample size s increases from 5 to 30. When s = 5 (or ),obviously, detection rate for more than 5 (or 8) features is zero, because s < p with q = s . The two best powers seemto have reached at s = 30 and s = 35 in this study. There is little difference between results with s = 30 and s = 35 ,considering that β = 0 . is not very different from β = 0 , when σ = 1 .Table 1: SWA True Feature Detection Comparison at s =5 , , , , , and
0% 0% 0% 0% 0% 0% 0.1%
0% 0% 0.3% 4.7% 16.8% 26.7% 29%
0% 1.2 % 10.2% 46.3% 78.1% 96.2% 96%
0% 22.1% 44.0% 86.6% 97.8% 99.9% 100.0%
0% 63.7% 80.1% 98.0% 99.9% 100.0% Notes: Entries are percentages of captures of true features. “10” in the firstcolumn means that all 10 true feature are captured, while “ ” means that atleast 6 true features are captured. Table 2 indicates that the false-feature capture rate remains controlled at ≤ until s = 30 . At s = 35 , the falsediscovery rate is a little bigger than . Hence, s = 30 is the optimal value of s when both the power and falsedetection rate are considered. This outcome (from Tables 1 and 2) based on the repeated experiments validates ourfinding by the multipaneling plot algorithm (MPA).Table 2: SWA False Alarm Comparison at s =5 , , , , , and
986 991 987 984 981 956 934
14 9 12 15 19 42 58 Notes: Entries are numbers of captures. “1” in the first columnmeans that exactly one false feature is captured.
In this section, we compare SWA with 3 benchmark procedures and Random Forest, when covariates are generatedindependently as in model (7) in Example 2. Given by the multipanel diagnostic plots in Figures 3 and 4, we set m = 5000 and s = 30 . We performed a comprehensive, comparative study of SWA with the Elastic Net, SCAD andMCP as well as Random Forest.We first compared SWA to the Elastic Net, SCAD and MCP using the default parameters in these R packages “glmnet”and “ncvreg” where their tuning parameter λ was chosen corresponding to the minimum mean cross-validated errorfor Elastic Net and SCAD, and using σ = 1 for MCP. The simulation size is 1000. The False Discovery Rates (FDR)of these competing procedures are listed in Table 3. Out of independent repetitions, SWA with a Bonferronicontrol of the multiplicity at 0.05 FWE performed extremely well, capturing none of false features in 956 times, exact 1false feature in 42 of the trials, exactly 2 nuisance features only twice, and no error in capturing more than 2 nuisancefeatures. This SWA had a FDR closest to the target level of . , among all comparison procedures. With the sameBonferroni control but adding a stepwise procedure in SWA’s last step (Step 4), the false discovery of the nuisancefeatures by SWA are comparable to the MCP procedure, both much better than those by the Elastic Net and SCAD(using their default parameters). The procedure with a Benjamini-Hochberg (BH) FDR control still did better thanElastic Net and SCAD (with the default data-dependent λ parameter) procedures in terms of false feature alarm.9 PREPRINT - F
EBRUARY
10, 2020Table 3: Comparison for False Feature Alarm
956 756 828 7 403 832
42 169 142 14 255 151
13 6 50 88 2
89 10
82 2 ≥ Notes: Entries are numbers of captures. “1” in the first column means thatexactly one false feature is captured.
Next we compare the competing procedures, SWA, Elastic Net, SCAD and MCP, in terms of true-feature discoveryrates after setting the regularization parameter λ for Elastic Net, SCAD and σ for MCP to match the FDR of SWA at0.05. The outcome is summarized in Table , which provides the percentage of times that the true features are captured.Overall, SWA with a Bonferroni correction (denoted as “SWA w/ Bonferroni") under this setting is comparable to SCADand MCP, while Elastic Net does not scale up to the performance of either of SWA, SCAD and MCP. Specifically, SWAcaptured all moderate, strong and very strong features as SCAD and MCP did in the “7+” case; SWA (w/ Bonferroni)had a rate of 96.2% true-feature captures in the “8+” case that inclueds a weak signal. In the “9+” case that includesextremely weak features, we do not expect a high capture rate, where SWA (w/ Bonferroni) had a true feature capturingrate of 26.7% which is less than those by SCAD and MCP. It is possible that if we used a less conservative multiplicityadjustment than Bonferroni procedure, such as the Benjamini-Hochberg (BH in short) procedure, SWA could havecaptured more very weak features, though the false discovery may also increase.In this set of comparisons, the values of p and p are typical of some genetic studies, and are within the implementationlimitation of the SCAD, Elastic Net and MCP procedures. The advantages of SWA are as follows. First, it does notrequire a choice of λ or determination of unknown σ while providing comparable outcomes for weak to strong signals.We do need to choose s and m , but they are simple to setup. SWA by default controls the actual FDR to the nominallevel, while penalized procedures are liberal (as shown in Table 3) by current data dependent choice of λ . In the truefeature comparison, we adjusted the regularization parameter of Elastic Net, SCAD and MCP based on the performanceof SWA in this simulation study to equalize the FDR. In a real data application, it is not clear how to control the FDRto the desired level for each of Elastic Net, SCAD and MCP individually. Perhaps, we can combine the results ofeither SWA and SCAD, or SWA and MCP. See more discussion in Section 7. Third, SWA is simple. It can handle datawith a much larger p than those requiring all p -dimensional data input once for each analysis, as it’s a subsamplingprocedure. Thus, SWA can scale easily to the cases when the penalized procedures do not run at the current softwareimplementation. Table 4: Comparison for True Feature Detection
0% 0.6% 0.7% 0% 0.1% 0.1% Notes: Entries are percentages of captures. “10” in the first column means thatall 10 true features are captured, “6+” means that at least 6 true features arecaptured.
Random Forest is a machine learning method for classification and regression. Random Forest and SWA share somesimilarities such as random selection of features. However, Random Forest does not control FDR, but gives a set ofimportance features for each user specified number called “npick,” the number of features that one desires to specify in10
PREPRINT - F
EBRUARY
10, 2020advance. Hence, it is not directly comparable with SWA, Elastic Net, SCAD, or MCP in terms of FDR. Regradless, weevaluated Random Forest performance on the same data from Example 2. In comparison to SWA, Table 5 summarizesthe percentage of true features among top “npick” features found by Random Forest measure of importance. When“ntree,” another user specified number in Random Forest calculation, is set to be ntree = 1000 , the resulting true-featurecapture percentage improves as “npick” increases from 10 to 30, which would mean that the FDR would increaseat least to 10 (50%) for npick = 20 , and 20 (67%) for npick = 30 . When npick = 10 , the number of true features,only strong and moderate features are captured well by Random Forest. With npick = 10&20 , we also increased thenumber of trees, “ntree,” from 1000 to 2000, which did not help much. In comparison, SWA performs well for all levels,controlling the FDR and providing good true feature detection rates.Table 5: Random Forest True Feature Detection Comparison Notes: Entries are percentages of captures. “10” in the first column meansthat all 10 true feature are captured, “ ” means that at least 6 true featuresare captured. Based on the above outcome, we shall only compare SWA with Elastic Net, SCAD, and MCP in the correlated andultra-high dimension cases below.
We repeat our comparison when the covariates are correlated as it used in Tibshirani (1996). Specially, in Example 2,let correlation between x i and x j be ρ | i − j | for i, j = 1 , · · · , with ρ = 0 . , and σ = 1 and = 3 respectively. Theresults for case 1 of σ = 1 are given in Tables , while those for case 2 of σ = 3 are in Tables .For Case 1, in terms of false feature alarm, SWA and SCAD made the fewest mis-detection, which is followed by MCPand then Elastic Net. In terms of true feature detection, after adjusting benchmark procedures to match the same FDRas that of SWA, SWA is found to be comparable with the benchmark procedures as shown in Table 7, with MCP to bethe best performer, slightly. Table 6: False Feature Alarm in Correlated Case 1: ρ = 0 . and σ = 1
961 768 264 977 822
36 171 195 21 163
15 129 2 ≥ “1” in the first column means that exactly one false feature iscaptured. For Case 2, simulation variation is increased to σ = 3 . As shown in Table 8, SWA again has the fewest mis-detection.In terms of true feature detection, after matching benchmark procedures’ FDR at the same rate of SWA, SWA is foundto be comparable with the competing procedures for detecting the moderate and stronger features, while Elastic Netdoes better in detecting the weak features. 11 PREPRINT - F
EBRUARY
10, 2020Table 7: True Feature Detection in Correlated Case 1: ρ = 0 . and σ = 1
0% 0% 2.5% 1.9% 0.7% Notes: Entries are percentages of captures. “10” in the first columnmeans that all 10 true features are captured, “6+” means that at least6 true features are captured.
Table 8: False Feature Alarm in Correlated Case 2: ρ = 0 . and σ = 3
875 567 299 73 638
10 234 213 99 272
21 101 143 142 73
57 82 127 13
19 69 117 3
13 44 112 1
14 42 ≥
55 57
Notes: “1” in the first column means that exactly one truefeature is captured. p = 1000 Dimensions
Consider the model in Example 2 with the same p = 10 and n = 80 , but let p = 1000 instead of p = 100 . SWAdid not perform well without a pre-screening step. Since it is common to perform a pre-independence screeningprocedure [Fan and Lv (2008)] for variable selection, we added a pre-screening step by selecting x -variables withthe strongest marginal correlation to the response variable y . SWA procedure with a repetition size m = 5000 and s = 30 was applied afterwards. SWA (w/ Bonferroni adjustment of or ) outperformed, Elastic Net, SCAD andMCP using default parameters, in false feature detection (in Table 10). After setting the smoothing parameters in ElasticNet, SCAD and MCP to match the FDA at the same level, SWA was found still competitive overall as shown in Table11. In this ultra high dimensional case, we recommend using the Bonferroni adjustment at the rate of , the sizethat was adjusted to after a pre-screening step. Arguably, using a Bonferroni adjustment at the rate of , the size ofsubsample size s and the size of the semi-finalists q , is also reasonable. In this section, we apply our SWA to the ovarian cancer data. There are n = 561 samples, each having p = 12042 mRNA gene expressions. This data set neither contains an obvious response variable nor has control cases. We hencechoose the Cyclin E gene, termed as “CCNE1”, as the response in this case study. “CCNE1” is a good surrogate forthe survival, according to the cBioPortal for Cancer Genomics (http://cbioportal.org) by Gao et al. (2013), 25% of the591 ovarian cancer patients were found to have their “CCNE1” gene altered, and the subjects with the alteration of“CCNE1” gene had a significantly lower overall survival rate (P-value = 1 . × − ) as shown in the right side ofFigure 5. Hence, we are interested in finding mRNAs that collectively are significant explanatory variables or featuresfor “CCNE1.” Finding these interesting explanatory variables is helpful to examine/find pathways, not just one gene,responsible for survival.First, we activate a screening step based on the distribution of sample correlation in the left side of Figure 5, to reducethe number of features from to p = 901 by selecting the features with an absolute value of correlation ≥ . .Then, we apply SWA with a repetition size m = 10 , and subsampling sizes s from S = { , , , , and } tochoose the ‘optimal’ subsample s . Guided by Algorithm , examining the fixed scale multipanel plot in Figure 6, wefind that significant changes in magnitude occur at s = 20 . Combining this fixed-scale indication with the free-scale12 PREPRINT - F
EBRUARY
10, 2020Table 9: True Feature Detection from Correlated Case 2: ρ = 0 . and σ = 3
0% 0% 0.8% 0.1% 0% Notes: Entries are percentages of captures. “10” in the firstcolumn means that all 10 true features are captured, “6+” meansthat at least 6 true features are captured.
Table 10: False Feature Detection in Ultra High p = 1000 Case (0 . / . /
965 913 136 714
29 71 147 249 ≥
998 29
Notes: Entries are numbers of true feature captures. “1” in the firstcolumn means that exactly one feature is captured. information in Figure 7, we observe that s = 20 and lead to plots with a stable ‘elbow’ point and a stable upper armset. Thus, the choice of s = min ( s i ) = 20 is the optimal subsample size that is stabilized in both scale and shape.However, as an enhancement to the practice, we propose one more step to implement a ‘Double-Assurance Procedure’below. F r equen cy −0.4 −0.2 0.0 0.2 0.4 0.6 Months Survival S u r v i v i ng Cases with Alteration(s) in Query Gene(s)Cases without Alteration(s) in Query Gene(s)Logrank Test P-Value: 1.619e-4
Figure 5: Left: Histogram of correlation coefficients. Right: Kaplan-Meier plot for the ovarian cancer data usingcBioPortal PREPRINT - F
EBRUARY
10, 2020Table 11: True Feature Detection in Ultra High p = 1000 Case (0 . / . / FDR-ctrl FDR-ctrl FDR-ctrl
0% 0% Notes: Entries are percentages of captures. “10” in column 1 means thatall 10 true features are captured, “6+” means that at least 6 true featuresare captured.
Algorithm 3: Double Assurance Procedure
1. Consider
SW A ( s, m, q ) for s = s and a few s i ∈ S with s i < s that seemed to have led to semi-stablepoints.2. Combine the final significant features in SWA( s , m, q ) and SWA( s i , m, q ) for these s i ’s (here, s i = 10 and ) into a combined prediction set x (cid:48) , and then run another base analysis of h for y ∼ x (cid:48) , and perform avariable selection step on the outcomes. index W e i gh t o f F ea t u r e s s= 5 s= 8 s= 10 s= 15 s= 20 s= 30 Figure 6: Fixed-scale multi-panel diagnostics plot for ovarian cancer study. PREPRINT - F
EBRUARY
10, 2020 index W e i gh t o f F ea t u r e s s= 5 s= 8 s= 10 s= 15 s= 20 s= 30 Figure 7: Free-scale multi-panel diagnostics plot for ovarian cancer study.
The final significant features from
Algorithm 3 are listed in Table 12. Our set of features is sensible based on a STRINGanalysis of our detected features using functional protein association networks (http://string-db.org/) by Szklarczyk et al.(2015). The STRING analysis shows that our discovered features contain functionally important genes; in particular“POLQ” and “RAD51” form a special pathway in which “POLQ” blocks “RAD51”-mediated recombination. Together,“POLQ” and “RAD51” correlate with defects in homologous recombination (HR) repair published in
Nature [Ceccaldiet al. (2015)]. In addition, pathway analysis shows that the network has significantly more interactions than expected(P-value < . ). Both our findings and their results reveal a synthetic lethal relationship between the HR pathway and“POLQ”-mediated repair in Ovarian Cancer. The biological application of the discovery of features in Table 12 maygo beyond the statements above. In a separate biological manuscript, we apply another new bioinformatics tool to acombined set of these features with other “BRCA” genes to explore potentials for drug target for ovarian cancer. We have provided a new procedure, Subsampling Winner Algorithm (SWA), for finding important features in large- p regression. We performed a comprehensive study to compare SWA with the benchmark procedures, LASSO, ElasticNet, SCAD, MCP and Random Forest. We also applied the SWA to analyze the genomic data on ovarian cancer. Ourstudy revealed a meaningful ovarian cancer pathway verified using the STRING analysis, as well as new genes thathave become candidates to be examined biologically.Our subsampling approach represents a paradigm shift from the popular approach using a penalized criterion inselecting important/true features from p covariates when p (cid:29) n . The penalized procedures approximate the solutionto a penalized criterion based on entire sample . The SWA is an ensemble of simple procedures h performed onsubsamples , by lifting the good performance of h in low s -dimension ( (cid:28) n ) to the high p -dimension ( (cid:29) n ) using ourscoring function strategically. SWA is analogous to the method for selecting national merit scholars.15 PREPRINT - F
EBRUARY
10, 2020Table 12: Final Results from Double-Assurance Procedure CDKN2A 0.26554 < C19orf2 0.28228 4.94e-10 PLEKHF1 0.21810 5.08e-09 POLQ 0.35074 4.25e-06 TMEM30B -0.13503 8.61e-06 GLCE -0.16539 5.93e-05 POP4 0.20944 7.06e-05 GIPC1 0.19072 0.000393 C17orf53 0.43510 0.000529 C15orf15 -0.15195 0.000586 TM2D3 -0.14747 0.000712 CRIM1 -0.14717 0.000768 RAD51 0.27486 0.002840
The advantages of SWA are given below.First, the SWA is virtually dimensionless, due to its subsampling nature and the simplicity of its base procedure (withouta need for finding a penalizing or turning parameter). Hence, it can scale easily to an enormously large p , especiallywhen p is too large to use a benchmark, penalized procedure that requires data inputted in an existing software in onepass. Also due to the simplicity and subsampling nature, SWA can be made “embarrassingly parallel" and distributedover multiple cores for fast computation and for enlarging data. We leave this kind of coding refinement for multiplecores or high performance computers to the future.Second, when the large p is manageable, SWA is shown to be competitive to benchmark procedure in our simulationstudies. In terms of the false discovery rate (FDR), we compared the SWA using a simple Bonferroni multiplicityadjustment (denoted as SWA w/Bonferroni) with LASSO, Elastic Net, SCAD, MCP and Random Forest (RF). We usedthe default parameter λ based on a cross-validated mean square errors for Elastic Net and SCAD, and the known σ = 1 for MCP, as well as various parameter choices for RF. The SWA controlled the false discovery rate (FDR) closest to thetarget value (Table 3, 6, 8 and 10). Random Forest does not explicitly control FDR and hence is not comparable (Table5). The advantage of Random Forest is not at controlling FDR, but at its prediction precision and perhaps at providing aset of importance features as a first-stage dimension reduction candidates.In terms of true discovery rate (TDR), we compared the true feature selection after equalizing the FDRs for all penalizedprocedures to that of the SWA. The good performers are summarized in Table 13 based on the comprehensive results inTables 4,7,9,11 and 12. Specifically, in the case of independent covariates , SWA, SCAD and MCP were comparableexcept for the very weak features, where MCP was the (slight) winner followed closely by SCAD and SWA (Table4). Elastic Net did not perform well for the independent covariates case (Table 4), but was comparable with SWA,SCAD and MCP in Cases 1 and 2 of the correlated-covariates (Tables 7 and 9), and performed the best slightly forthe correlated case 2 (Table 9). SWA and MCP are clear winners in the ultra-high dimension case with independentcovariates (Table 11). In the ultra high dimension case, it is important to add a prescreening procedure as we and allpenalized procedures have done.Table 13: Results Summary and Recommendations for Large, Manageable p Cases Good Performers Good Performers Practicalbased on FDR based on TDR RecommendationIndependent SWA (Table 3) MCP, SCAD & A . SWA with an addeddouble assurance(Algorithm 3);or B. Combination of SWA& a penalized procedure(Algorithm 4)SWA (Table 4)Correlated Case1 SCAD & SWA (Table 6) MCP, SCAD, ElasticNet& SWA (Table 7)Correlated Case2 SWA (Table 8) ElasticNet, SWA,MCP & SCAD (Table 9)Ultra High SWA (Table 10) MCP, & SWA (Table 11)
Notes: Entries in Columns 2 and 3 are good performers based on the results from thetable cited. Recommendations in Column 4 were made based on both FDR and TDR.
In all these comparisons, for Elastic Net, SCAD and MCP we used a trial-and-error method to modify the λ and σ from their default or initial values to match the FDR at 0.05. In practice, without knowing the truth, this trial-and-erroradjustment to match actual 0.05 rate is impractical. In practice, thus, we recommend using SWA with a double assuranceas in Algorithm 3, or a combination of SWA with one of good performers from Elastic Net, SCAD, and MCP as shownin Table 13 by the following Algorithm 4. 16 PREPRINT - F
EBRUARY
10, 2020
Algorithm 4 : Combine features from SWA and a good performing penalized procedure and run our base procedure h onthis combined predictor set to lead to the final set of features.We do not have a rigorous mathematical statement or proof of SWA’s optimality, although we gave a rationale for itsobvious consistency if given the SOIL condition, as Ye, Yang, and Yang (2018) did. SWA’s excellent performance andscalability to an ultra-large p may bear some similarity to that of the deep learning procedure [LeCon, Bengio, andHinton (2015)] as follows. A deep learning algorithm is built on many layers of linear combination of simple basisfunctions, while SWA is based on a fusion of many simple analyses of subsamples. The purpose of deep learningparallels to that of RandomForest, being excellent for prediction, while SWA aims directly at feature selection. Anotherreason for the good performance is perhaps at SWA’s selection of important features from an ensemble of (subsample)models in the same spirit of the SOIL conditions of Ye, Yang, and Yang (2018). The difference between our SWAand the ensemble of Ye, Yang, and Yang (2018) is that we extract final useful set of features from simple analyses ofsubsamples (hence, SWA is scalable to very high dimensions) while Ye, Yang, and Yang (2018)’s ensemble obtainsfinal results from penalized or other iterated procedures of whole samples .Although SWA does not need to choose a penalizing parameter λ or require an estimate of σ in advance, SWA doesneed to choose a subsample size s and the number of subsample iterations m ; fortunately, they are straightforwardto select and our selection automatically controls the FDR to the target level. Specifically, the SWA is reasonablystable as long as p ≤ s ≤ p with a sufficiently large m ; and s can be chosen more specifically by our simplemulti-paneling plot with a fixed m = 5000 . Another parameter q , the number of “semifinalists" in SWA is set to be s inour simulation experiments herein and was shown to be satisfactory. However, this q can easily be enlarged to . s or s to be conservative. We also proposed a double-enhancement procedure, which is similar to enlarging q from s , forpractical applications in Section 6.The SWA in this paper is developed for linear regression, the most fundamental model. The SWA can be generalized fora generalized linear model, or to a nonparametric classification function, as suitable for other type of data and analysisobjectives. The key is to choose h properly and generalize the scoring function w accordingly. See SWA for principalcomponent analysis [Liu, Sun, and Zhang (2007)] used different h and w . We leave the study for these generalizedmodels using SWA to the future.It is also worth to note the following caveats before using any new advanced procedure to analyze a data set. First,preprocessing data is essential as it can remove a large number of obviously nuisance features cheaply and quickly inadvance. Second, an ensemble of statistical procedures is commonly needed for analyzing a large and complex data,for example, the ensemble used for analyzing movies (a sequence of images) in a clinical study from Wang, Sun, andBogie (2006) and Bogie et al. (2008) includes both image preprocessing and statistical analysis procedures. Third,correct modeling is critical to features selection for examining the effect of possible influencing factors. Under differentmodels, the effects of various features may be shown differently.Finally, we advocate that any biological implications be verified by independent biological experiments or anotherinstrument. We used cBiportal and STRING analysis tools to our findings by SWA. This is consistent to the recom-mendation made by Wasserstein, Schirm, and Lazar (2019). We are quite pleased in discovering that our study of theovarian cancer data has led to important pathways, and also that there is a potential for biological applications beyondwhat is presented in Section 6. A Regularity Conditions and Proofs
Proof.
The idea is to first calculate the probability that all the important features are selected and detected in onesubsample, then choose the required m such that this probability is at least − γ .Since s ≥ p , there exists a chance that all the important features are selected in a subsample so that the importantfeatures can be detected at once. For each subsample, the probability that all p important features are selected is α = (cid:0) p − p s − p (cid:1)(cid:0) ps (cid:1) = ( p − p )!( s − p )!( p − s )! p ! s !( p − s )! = ( p − p )! s !( s − p )! p ! = s · · · ( s − p + 1) p · · · ( p − p + 1) (9)It is known that s < p , we can obtain upper and lower bounds for α based on the above expression (cid:18) s − p + 1 p − p + 1 (cid:19) p ≤ α ≤ (cid:18) sp (cid:19) p . (10)Assume the random subsamples are independent of each other, the probability that all p important features are selectedat least once in one subsample after m trials is − (1 − α ) m Setting it to be − (1 − α ) m = 1 − γ, (11)17 PREPRINT - F
EBRUARY
10, 2020we have m = logγlog (1 − α ) (12)and can thus obtain the upper and lower bounds of m from the upper and lower bounds of α , which are the ones givenin (8). References B ERGER , J.O. (2010).
Statistical Decision Theory and Bayesian Analysis , 2nd ed. Springer, New York.B
OGIE , K., W
ANG , X., F EI , B., AND S UN , J (2008). Getting More out of Pressure Mapping: A New Technique forInterface Pressure Analysis. Journal of Rehabilitation Research and Development , (4) 523–536.B REIMAN , L. (1996). Bagging predictors.
Machine Learning , (2) 123–140.B REIMAN , L. (2001). Random Forests.
Machine Learning , (1) 5–32.Broad Institute TCGA Genome Data Analysis Center (2013). Identification of putative miR direct targets. BroadInstitute of MIT and Harvard . doi:10.7908/C1B27SBH. 23 May, 2013.C AI , T. T. AND G UO , Z. (2017). Confidence intervals for high-dimensional linear regression: minimax rates andadaptivity. Annals of Statistics , (2) 615–646.C AI , T. T. AND G UO , Z. (2018). Accuracy assessment for high-dimensional linear regression. Annals of Statistics , (4) 1807–1836.C ANDÉS , E. J.
AND T AO , T. (2007). The Dantzig selector: statistical estimation when p is much larger than n . Annalsof Statistics , ECCALDI , R., L IU , J.C., A MUNUGAMA , R., H
AJDU , I., P
RIMACK , B., P
ETALCORIN , M., O’C
ONNOR , K.W.,K
ONSTANTINOPOULOS , P.A., E
LLEDGE , S.J., B
OULTON , S.J., Y
USUFZAI , T.,
AND
D’A
NDREA , A.D. (2015).Homologous recombination-deficient tumors are hyper-dependent on POLQ-mediated repair.
Nature , ONOHO , D.
AND J IN , J. (2015). Higher Criticism for Large-Scale Inference, Especially for Rare and Weak Effects Statistical Science , (1) 1–25.E FRON , B.
AND T IBSHIRANI , R. (1997). Improvements on cross-validation: The .632+ bootstrap method.
Journal ofthe American Association , FRON , B., H
ASTIE , T., J
OHNSTONE , I.,
AND T IBSHIRANI , R. (2004). Least Angle Regression.
The Annals ofStatistics , (2) 407–499.F AN , J. AND L I , R.(2001). Variable selection via nonconcave penalized likelihood and its oracle property. Journal ofthe American Association , (456) 1348–1360.F AN , J. AND L V , J.(2008). Sure Independence Screening for Ultrahigh Dimensional Feature Space (with discussion). Journal of Royal Statistical Society B , AO , J., A KSOY , B.A., D
OGRUSOZ , U, D
RESDNER , G., G
ROSS , B., S
UMER , S.O., S UN , Y., J ACOBSEN , A.,S
INHA , R., L
ARSSON , E., C
ERAMI , E., S
ANDER , C.,
AND S CHULTZ , N. (2013). Integrative analysis of complexcancer genomics and clinical profiles using the cBioPortal.
Sci Signal . 2013 Apr 2; (269):pl1. doi: 10.1126/scisig-nal.2004088.G OLUB , G., H
EATH , M.,
AND W AHBA , G. (1979). Generalized cross-validation as a method for choosing a goodridge parameter.
Technometrics , (2) 215–223.G UO , J., H U , J., J ING , B.,
AND Z HANG , Z. (2016). Spline-Lasso in High-Dimensional Linear Regression.
Journal ofthe American Association , (513) 288–297.I SHWARAN , H., J
AMES , L.,
AND S UN , J. (2001). Bayesian Model Selection in Finite Mixtures by Marginal DensityDecompositions. Journal of American Statistical Association , (456) 1316–1332.K IM , Y., S TREET , W.N.,
AND M ENCZER , F. (2000). Feature selection in unsupervised learning via evolutionarysearch.
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining ,365–369. ACM.L E C ON , Y., B ENGIO , Y.,
AND H INTON , G. (2015). Deep learning.
Nature , (7553) 436–444.L IU , H. AND Y U , B. (2013). Asymptotic properties of Lasso+mLS and Lasso+Ridge in sparse high-dimensional linearregression. Electronic Journal of Statistics , IU , P. (2009). Adaptive Mixture Estimation and Subsampling PCA. Ph.D Thesis, Case Western Reserve University,Sciences. 18 PREPRINT - F
EBRUARY
10, 2020L IU , P., S UN , J., AND Z HANG , Z. (2007). SPCA - A New Feature Selection Procedure for Large-p Data.
Proceedings ofAmerican Statistical Association, Section on Statistical computing , [CD-ROM], Alexandria, VA: American StatisticalAssociation: 1970-1974.P
ARK , J., S
RIRAM , T.N.,
AND Y IN , X. (2010). Dimension Reduction in Time Series, Statistica Sinica , AO , C.R. (1973). Linear statistical inference and its applications.
CHAPIRE , R., F
REUND , Y., B
ARTLETT , P.
AND L EE , W. (1998). Boosting the margin: A new explanation for theeffectiveness of voting methods. Annals of Statistics , (5) 1651–1686.S CORNET , E., B
IAU , G.,
AND V ERT , J (2015). Consistency of Random Forests.
Annals of Statistics , (4) 1716–1741.S UN , Q., Z HU , H., L IU , Y., AND I BRAHIM , J. G. (2015). SPReM: sparse projection regression model for high-dimensional linear regression.
Journal of American Statistical Association , (509) 289–302.S ZKLARCZYK , D., F
RANCESCHINI , A., W
YDER , S., F
ORSLUND , K., H
ELLER , D., H
UERTA -C EPAS , J., S I - MONOVIC , M., R
OTH , A., S
ANTOS , A., T
SAFOU , K.P., K
UHN , M., B
ORK , P., J
ENSEN , L.J.,
AND V ON M ERING ,C. (2015). STRING v10: protein-protein interaction networks, integrated over the tree of life.
Nucleic Acids Res . D447–52.T
IBSHIRANI , R. (1996). Regression shrinkage and selection via lasso.
Journal of the Royal Statistical Society Series B . (1) 267–288.T REVOR , H., T
IBSHIRANI , R.,
AND F RIEDMAN , J. (2009).
The Element of Statistical Learning; Data Mining,Inference, and Prediction.
ANG , C., C
HEN , M., S
CHIFANO , E.,
AND Y AN , J. (2016). Statistical Methods and Computing for Big Data. StatInterface . (4) 399–414.W ANG , X., S UN , J., AND B OGIE , K. (2006). Spatial-temporal data mining procedure: LASR.
IMS Lecture Notes–Monograph Series . ASSERSTEIN , R.L., S
CHIRM , A.L.,
AND L AZAR , N.A. (2019). Moving to a World Beyond “ p < . ”. TheAmerican Statistician . ANG , Y. (2001). Adaptive Regression by Mixing.
Journal of the American Statistical Association . (454) 574–588.Y E , C., Y ANG , Y.,
AND Y ANG , Y. (2018). Sparsity Oriented Importance Learning for High-Dimensional LinearRegression.
Journal of the American Statistical Association . (524) 1797–1812 Theory and Methods.Z HANG , C.H. (2010). Nearly unbiased variable selection under minimax concave penalty.
Annals of Statistics . (2)894–942.Z HANG , L. (2013). Nearly optimal minimax estimator for high-dimensional sparse linear regression.
Annals of Statistics .41