[PDF] RaSE: A Variable Screening Framework via Random Subspace Ensembles

Abstract

Variable screening methods have been shown to be effective in dimension reduction under the ultra-high dimensional setting. Most existing screening methods are designed to rank the predictors according to their individual contributions to the response. As a result, variables that are marginally independent but jointly dependent with the response could be missed. In this work, we propose a new framework for variable screening, Random Subspace Ensemble (RaSE), which works by evaluating the quality of random subspaces that may cover multiple predictors. This new screening framework can be naturally combined with any subspace evaluation criterion, which leads to an array of screening methods. The framework is capable to identify signals with no marginal effect or with high-order interaction effects. It is shown to enjoy the sure screening property and rank consistency. We also develop an iterative version of RaSE screening with theoretical support. Extensive simulation studies and real-data analysis show the effectiveness of the new screening framework.

Full PDF

RRaSE: A Variable Screening Framework viaRandom Subspace Ensembles

Ye TianDepartment of StatisticsColumbia UniversityandYang FengDepartment of Biostatistics, School of Global Public HealthNew York University

Abstract

Variable screening methods have been shown to be eﬀective in dimension reduc-tion under the ultra-high dimensional setting. Most existing screening methods aredesigned to rank the predictors according to their individual contributions to the re-sponse. As a result, variables that are marginally independent but jointly dependentwith the response could be missed. In this work, we propose a new framework for vari-able screening,

Random Subspace Ensemble (RaSE) , which works by evaluating thequality of random subspaces that may cover multiple predictors. This new screeningframework can be naturally combined with any subspace evaluation criterion, whichleads to an array of screening methods. The framework is capable to identify signalswith no marginal eﬀect or with high-order interaction eﬀects. It is shown to enjoy thesure screening property and rank consistency. We also develop an iterative version ofRaSE screening with theoretical support. Extensive simulation studies and real-dataanalysis show the eﬀectiveness of the new screening framework.

Keywords:

Variable screening; Random subspace method; Ensemble learning; Sure screen-ing property; Rank consistency; High dimensional data; Variable selection1 a r X i v : . [ s t a t . M E ] F e b Introduction

With the rapid advancement of computing power and technology, high-dimensional databecome ubiquitous in many disciplines such as genomics, image analysis, and tomogra-phy. With high-dimensional data, the number of variables p could be much larger thanthe sample size n . What makes statistical inference possible is the sparsity assumption,which assumes only a few variables have contributions to the response. Under this sparsityassumption, there has been a rich literature on the topic of variable selection, includingLASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), elastic net (Zou and Hastie, 2005),and MCP (Zhang, 2010). Despite of the success of these methods in many applications,for the ultra-high dimensional scenario where the dimension p grows exponentially with n , they may not work well due to the “curse of dimensionality” in terms of simultaneouschallenges to computational expediency, statistical accuracy, and algorithmic stability (Fanand Lv, 2008).To conquer these diﬃculties, Fan and Lv (2008) proposed a novel procedure called sureindependence screening (SIS) with solid theoretical support. In the past decade, the powerof feature screening has been well recognized and a myriad of screening methods havebeen proposed. The existing screening methods can be broadly classiﬁed into two cate-gories, model-based methods and model-free ones. Model-based screening methods rely onspeciﬁc models, such as SIS and its extensions to generalized linear models (Fan and Lv,2008; Fan et al., 2009), Cox model (Fan et al., 2010; Zhao and Li, 2012), non-parametricindependence screening method based on additive models (Fan et al., 2011; Cheng et al.,2014) and screening via high-dimensional ordinary least-square projection (HOLP) (Wangand Leng, 2016). Recently, model-free approaches become more popular because of lessstringent requirements. Examples of such approaches include the sure independent ranking2nd screening (SIRS) (Zhu et al., 2011), the screening method based on distance correla-tion (DC-SIS) and its iterative version (Li et al., 2012; Zhong and Zhu, 2015), screeningprocedure via martingale diﬀerence correlation (Shao and Zhang, 2014), screening via Kol-mogorov ﬁlter (Mai and Zou, 2013, 2015), the screening approach for discriminant analysis(MV-SIS) (Cui et al., 2015), interaction pursuit via Pearson correlation (IP) and the dis-tance correlation (IPDC) (Fan et al., 2016; Kong et al., 2017), the screening method basedon ball correlation (Pan et al., 2018), and the nonparametric screening under conditionalstrictly convex loss (Han, 2019).For variables that are marginally independent but jointly dependent with the response,many existing screening methods could miss them. This issue has been recognized in theliterature (Fan and Lv, 2008; Fan et al., 2009; Zhu et al., 2011; Zhong and Zhu, 2015) anditerative screening procedures were developed, which were shown to be eﬀective empirically.However, to the best of our knowledge, there is not much theoretical development for theiterative screening methods. In addition, some iterative screening methods (e.g. iterativeSIS) are coupled with a variable selection method like LASSO or SCAD, making its perfor-mance dependent on the speciﬁc choice of the regularization method. Besides, some otheriterative screening methods (e.g. iterative SIRS and iterative DC-SIS) recruit variablesstep by step through residuals until a pre-speciﬁed number of variables are picked. Thus,their success hinges on a key tuning parameter, that is, how many variables to recruit ineach step, making these procedures potentially less robust.These issues mentioned above motivate us to propose a new screening framework whichgoes beyond marginal utilities. In the new framework, we investigate multiple featuresat the same time, via the random subspace method (Ho, 1998). Tian and Feng (2020)proposed a new Random Subspace Ensemble classiﬁcation method based on a similar idea,3 aSE , according to a speciﬁc aggregation framework ﬁrst introduced in Cannings andSamworth (2017). They advocated applying RaSE on sparse classiﬁcation problems. Themain idea of RaSE can be simply described as follows. First, B B random subspacesare generated from a speciﬁc distribution on subspaces, which are evenly divided into B groups. Next, the best subspace within each group is picked according to some criterionand a base learner is trained in that subspace. Hence we obtain B base learners, each ofwhich corresponds to a subspace. Finally, these B base learners are aggregated on averageand the ensemble will be used in prediction. The vanilla RaSE algorithm is reviewed inAlgorithm 3 in Appendix A.1. It’s important to note that there is a by-product of RaSE,which is the selected proportion of each variable within B selected subspaces. In thiswork, we will use this selected proportion to do variable screening, and call this the RaSEscreening framework.We highlight the merits of RaSE screening framework as follows. First, by looking atdiﬀerent feature subspaces, variables marginally independent but jointly dependent withthe response can be identiﬁed. Second, instead of proposing only a single screening ap-proach, the ﬂexible framework of RaSE allows us to use any criterion function for comparingsubspaces, leading to an array of screening methods. One possible way to construct sucha criterion function is to choose a base learner and a speciﬁc measure for comparing thesubspaces. For example, if we know linear methods are suitable for the data, then we canapply RaSE by picking subspaces achieving lower BIC under linear models. If k -nearestneighbor ( k NN) is believed to perform better, we can apply RaSE by choosing subspaceswith the smallest cross-validation error on k NN. Third, under general conditions, we showthe sure screening property and rank consistency for RaSE screening framework. We alsowork out the details under linear regression models, which provides us the intuition to the4eneral conditions. Finally, we develop a novel iterative RaSE screening framework withsure screening property established without the need to use a variable selection step orspecifying the number of variables to recruit in each step.The rest of this paper is organized as follows. Section 2 introduces the vanilla RaSEscreening framework and its iterative version in detail, and discusses the relationship be-tween RaSE and marginal screening methods. In Section 3, we present the theoreticalproperties for vanilla RaSE and iterative RaSE screening, including sure screening prop-erty and rank consistency. At the end of Section 3, the example of linear regression modelis analyzed in detail to illustrate the general conditions imposed. In Section 4, extensivesimulation studies and real-data analysis are conducted to demonstrate the power of ournew screening framework. We summarize our contributions and point out some promisingfuture avenues in Section 5. The supplementary materials include all the technical proofsas well as additional details.

In the follows, we consider predictors x = ( x , . . . , x p ) T and response y . For regressionproblems, y takes value from the real line R , while for classiﬁcation problems, y takes valuefrom an integer set { , . . . , K } , where K > { ( x i , y i ) } ni =1 . Denote by S Full = { , . . . , p } the full feature set. The signal set S ∗ ⊆ S Full is deﬁned as the set S with minimal cardinality satisfying y | x S ⊥⊥ x S Full \ S . [ a ] is used torepresent the largest integer no larger than a .To introduce the RaSE framework, we denote the B B random subspaces as { S b b , b =1 . . . , B , b = 1 . . . , B } , the b -th group of subspaces as { S b b } B b =1 , and the selected B { S b ∗ } B b =1 . The objective function corresponding to the speciﬁc criterion tochoose subspaces is written as Cr n : S → R , where S is the collection of all subspaces.Assume a smaller value of Cr n leads to a better subspace. Although the original RaSE(Tian and Feng, 2020) was introduced to solve classiﬁcation problems, we now consider thegeneral prediction framework, including both classiﬁcation and regression. Following the idea of Tian and Feng (2020), we use the proportion of each feature among theselected B subspaces as the importance measure. Therefore, a natural screening procedureis to rank variables based on this proportion vector, then pick the variables with the largestproportions. The RaSE screening framework is summarized in Algorithm 1. Algorithm 1:

Vanilla RaSE screening

Input: training data { ( x i , y i ) } ni =1 , subspace distribution D , criterion function Cr n ,integers B and B , number of variables N to select Output: the selected proportion of each feature ˆ η , the selected subset ˆ S Independently generate random subspaces S b b ∼ D , ≤ b ≤ B , ≤ b ≤ B for b ← to B do Select the optimal subspace S b ∗ = S b b ∗ , where b ∗ = arg min ≤ b ≤ B Cr n ( S b b ) end Output the selected proportion of each feature ˆ η = (ˆ η , . . . , ˆ η p ) T whereˆ η j = B − (cid:80) B b =1 ( j ∈ S b ∗ ) , j = 1 , . . . , p Output ˆ S = { ≤ j ≤ p : ˆ η j is among the N largest of all } In the algorithm, the subspace distribution D is chosen as a hierarchical uniform dis- ribution over the subspaces by default. Speciﬁcally, with D as the upper bound of thesubspace size, we ﬁrst generate the subspace size d from the uniform distribution over { , . . . , D } . Then, the subspace S follows the uniform distribution over all size- d sub-spaces { S ⊆ S Full : | S | = d } . In practice, the subspace distribution can be adjusted if wehave prior information about the data structure.Algorithm 1 is not the end of the story because it ranks all the variables but does notdetermine how many variables to keep. To facilitate the theoretical analysis, we deﬁne theﬁnal feature subset to be selected asˆ S α = { ≤ j ≤ p : ˆ η j is among the [ αD/c n ] largest of all } , where c n is a constant (to be speciﬁed in the next section) depending on n , B , D , and thecriterion Cr which is a population counterpart of Cr n . Here, α can be any constant largerthan 1, which will appear in the upper bound introduced in the sure screening theorem ofSection 3. As we mentioned in the introduction, the existing iterative screening methods have varioustuning components such as the number of variables to recruit in each step and/or a speciﬁcvariable selection method. We propose the iterative RaSE screening in Algorithm 2 totackle these issues.The main idea of iterative RaSE screening is to update the subspace distribution basedon the selected proportion in the preceding steps and does not conduct variable screeninguntil the ﬁnal step. To understand the details in the algorithm, we introduce a new subspacedistribution. 7 lgorithm 2:

Iterative RaSE screening (RaSE T ) Input: training data { ( x i , y i ) } ni =1 , initial subspace distribution D [0] , criterionfunction Cr n , integers B and B , the number of iterations T , positiveconstant C , number of variables N to select Output: the selected proportion of each feature ˆ η [ T ] , the selected subset ˆ S for t ← to T do Independently generate random subspaces S [ t ] b b ∼ D [ t ] , ≤ b ≤ B ,1 ≤ b ≤ B for b ← to B do Select the optimal subspace S [ t ] b ∗ = S [ t ] b b ∗ , where b ∗ = arg min ≤ b ≤ B Cr n ( S [ t ] b b ) end Update ˆ η [ t ] where ˆ η [ t ] j = B − (cid:80) B b =1 ( j ∈ S [ t ] b ∗ ) , j = 1 , . . . , p Update D [ t +1] ← hierarchical restrictive multinomial distribution R ( U , p, ˜ η [ t ] ),where ˜ η [ t ] j ∝ [ˆ η [ t ] j (ˆ η [ t ] j > C / log p ) + C p (ˆ η [ t ] j ≤ C / log p )] and (cid:80) pj =1 ˜ η [ t ] j = 1 end Output the selected proportion of each feature ˆ η [ T ] Output ˆ S = { ≤ j ≤ p : ˆ η [ T ] j is among the N largest of all } Note that each subspace S can be equivalently represented as J = ( J , . . . , J p ) T , where J j = ( j ∈ S ) , j = 1 , . . . , p . A subspace follows the hierarchical restrictive multinomialdistribution R ( U , p, ˜ η ), where (cid:80) pj =1 ˜ η j = 1 and ˜ η j ≥

0, is equivalent to the procedure:1. Draw d from distribution U on { , . . . , D } ;2. Draw J = ( J , . . . , J p ) T from a restrictive multinomial distribution with parameter8 p, d, ˜ η ), where the restriction is J j ∈ { , } .For example, the hierarchical uniform distribution belongs to this family where U is theuniform distribution U on { , . . . , D } and ˜ η j = p for all j = 1 , . . . , p .With the hierarchical restrictive multinomial distribution in hand, we can depict theiterative algorithm more precisely. In round t , the algorithm updates the subspace distribu-tion of next round D [ t +1] by the hierarchical restrictive multinomial distribution R ( U , p, ˜ η [ t ] ),where ˜ η [ t ] j ∝ [ˆ η [ t ] j (ˆ η [ t ] j > C / log p ) + C p (ˆ η [ t ] j ≤ C / log p )] and ˆ η [ t ] j is the proportion of vari-able j in the B selected subspaces { S [ t ] b ∗ } B b =1 of round t . Before closing this section and moving into theoretical analysis, we want to point out theconnection of RaSE screening approach with the classical marginal screening methods aswell as the important problem of interaction detection.First of all, it is easy to observe that when D = 1 in Algorithm 1, with proper mea-sure, RaSE screening method reduces to the marginal screening approaches. In this sense,RaSE screening method can be seen as an extension of classical marginal screening frame-works by evaluating subspaces instead of individual predictors. In addition, when thereare signals with no marginal contribution, one intuitive idea is to screen all possible inter-action terms, which demand extremely high computational costs. For example, screeningall the order- d interactions leads to a computational cost of O ( p d ). Instead of screeningall possible interactions, RaSE randomly chooses some feature subspaces and explore theircontributions to the response via a speciﬁc criterion. The carefully designed mechanism of9enerating random subspaces along with the iterative step greatly alleviate the requirementon computation.Second, there has been a great interest in studying screening methods for interactiondetection (Hao and Zhang, 2014; Fan et al., 2016; Kong et al., 2017). The proposed RaSEscreening framework works in a diﬀerent fashion, by evaluating the contribution of variablesthrough the joint contributions in diﬀerent subspaces. A simulation example (Example 4)where we have 4-way interactions among predictors will be studied to show the power ofRaSE. In this section, we investigate the theoretical properties of RaSE screening method to helpreaders understand how it works and why it can succeed in practice. We are not claimingthat the assumptions we make are the weakest and conclusions we obtain are the strongest.Before moving forward, we ﬁrst deﬁne some notations. For two numbers a and b , wedenote a ∨ b = max( a, b ) and a ∧ b = min( a, b ). For two numerical sequences { a n } ∞ n =1 and { b n } ∞ n =1 , we denote a n = o ( b n ) or a n (cid:28) b n if lim n →∞ | a n /b n | = 0. Denote a n = O ( b n ) or a n (cid:46) b n if lim sup n →∞ | a n /b n | < ∞ . For two random variable sequences { x n } ∞ n =1 and { y n } ∞ n =1 ,denote x n = O p ( y n ) or x n (cid:46) p y n if for arbitrary (cid:15) >

0, there exists a constant

M > | x n /y n | > M ) ≤ (cid:15) for any n . Denote Euclidean norm for a length- p vector x = ( x , . . . , x p ) T as (cid:107) x (cid:107) = (cid:113)(cid:80) pj =1 x j . p represents a length- p vector with all entries 1.For a p × p (cid:48) matrix A = ( a ij ) p × p (cid:48) , deﬁne the 1-norm (cid:107) A (cid:107) = sup j (cid:80) pi =1 | a ij | , the operatornorm (cid:107) A (cid:107) = sup x : (cid:107) x (cid:107) =1 (cid:107) A x (cid:107) , the inﬁnity norm (cid:107) A (cid:107) ∞ = sup i (cid:80) p (cid:48) j =1 | a ij | and the maximumnorm (cid:107) A (cid:107) max = sup i,j | a ij | . We also denote the minimal and maximal eigenvalues of a square10atrix A as λ min ( A ) and λ max ( A ), respectively. First, note that the success of RaSE relies on the large selected proportions of all signals.According to Algorithm 1, the selected proportion depends on the comparison of diﬀerentsubspaces, which can be naturally divided into two categories — “covering signal j ” or“not covering signal j ”. Capturing the distribution of these two types of subspaces can behelpful to understand when RaSE can succeed. While the joint distribution can be verycomplicated and hard to analyze, we ﬁnd that given the number of B subspaces coveringeach signal j , the joint distribution of these two types of subspaces are much simpliﬁed,which motivates the useful lemma in the follows. Lemma 1. If { S b } B b =1 i.i.d. ∼ R ( U , p, p − p ) , given N j := { b : j ∈ S b } = k , dividing { S b } B b =1 into { S ( j )1 b } kb =1 and { S ( − j )1 b } B − kb =1 , where S ( j )1 b (cid:51) j and S ( − j )1 b (cid:54)(cid:51) j ,(i) { S ( j )1 b } kb =1 independently follow the distribution P ( S ( j ) = S ) = 2 pD ( D + 1) · (cid:0) p | S | (cid:1) · ( j ∈ S ) = 2 | S | D ( D + 1) · (cid:0) p − | S |− (cid:1) ( j ∈ S ); (1) (ii) { S ( − j )1 b } B − kb =1 independently follow the distribution P ( S ( − j ) = S ) = 2 pD (2 p − D − · (cid:0) p | S | (cid:1) · ( j / ∈ S ) = 2( p − | S | ) D (2 p − D − · (cid:0) p − | S | (cid:1) ( j / ∈ S );(2) (iii) { S ( j )1 b } kb =1 ⊥⊥ { S ( − j )1 b } B − kb =1 . The proof of Lemma 1 can be found in Appendix B. It shows us that given N j := { b : j ∈ S b } = k , { S ( j )1 b } kb =1 and { S ( − j )1 b } B − kb =1 are independent. And each S ( j )1 b , S ( − j )1 b follows11 “weighted” hierarchical uniform distribution by adjusting the sampling weight based onthe cardinality of subspace.Now, we introduce a concentration of Cr n on its population version Cr for a collectionof subsets. In particular, for any D , there exists a sequence { (cid:15) n := (cid:15) ( n, D ) } ∞ n =1 and positiveconstant c n → P (cid:32) sup S : | S |≤ D | Cr n ( S ) − Cr( S ) | > (cid:15) n (cid:33) ≤ c n (3)holds for any n . Such a sequence { (cid:15) n } ∞ n =1 always exists, though we would like it to be smallto have a uniform concentration as described in the following assumption. Assumption 1.

Denote δ n := δ ( n, D ) = sup j ∈ S ∗ P (Cr( S ( − j ) ) − Cr( S ( j ) ) < (cid:15) n ) , p = D +12 p ,where S ( j ) and S ( − j ) follow the distributions in (1) and (2) respectively, and S ( j ) ⊥⊥ S ( − j ) .It holds that B p (cid:38) , lim sup n,D,B →∞ B δ B p n < ∞ . Remark 1.

Assumption 1 is the key condition, where δ n measures the minimal signalstrength via comparing the two types of feature subspaces introduced in Lemma 1. Fromthe assumption, we need a large B when δ n is small. We will present a detailed analysisfor linear models with penalized negative log-likelihood as the subspace evaluation measurein Section 3.3. In addition, we can slightly weaken the assumption to that there existssome α ∈ (0 , such that lim sup n,D,B →∞ B δ α B p n < ∞ . We chose the current Assumption tosimplify the presentation. Theorem 1 (Sure screening property) . Deﬁne c n := c ( n, B , D ) := (1 − c n )(1 − δ B p n ) B (cid:18) − exp (cid:26) − B p (cid:27)(cid:19) . or any α > , let ˆ S α = { ≤ j ≤ p : ˆ η j is among the [ αD/c n ] largest of all } . UnderAssumption 1, when B (cid:29) log p ∗ and n → ∞ , we have(i) P( S ∗ ⊆ ˆ S α ) ≥ − p ∗ exp (cid:110) − B c n (cid:0) − α (cid:1) (cid:111) → , (ii) The selected model size | ˆ S α | (cid:46) D . Next, we would like to analyze the requirement of B imposed by Assumption 1, whichdepends on δ n . We ﬁrst introduce an assumption under which the scale of δ n can beexplicitly calculated. Assumption 2.

If for all j ∈ S ∗ , there exists a subset ¯ S j (cid:51) j and another subset S j ⊆ S Full \{ j } with cardinality p j , where sup j ∈ S ∗ p j D = o ( p ) , such that inf j ∈ S ∗ inf S N : | S N |≤ D −| ¯ S j | S N ⊆ S Full \ ( S j ∪ ¯ S j ) (cid:2) Cr( S N ) − Cr( S N ∪ ¯ S j ) (cid:3) (cid:29) (cid:15) n , where (cid:15) n satisﬁes (3) . To provide more intuition, this assumption implies that for each signal j , we can de-compose S Full into ¯ S j (a set covering j ), S j (a set not covering j ), and [ S Full \ ( ¯ S j ∪ S j )](the remaining set). For any S N ⊆ [ S Full \ ( ¯ S j ∪ S j )], we can add ¯ S j onto S N to improve thecriterion function by an order greater than (cid:15) n . This provides a uniform lower bound on thecontribution of every signal when we use the given criterion function. Proposition 1.

Denote ¯ d = max j ∈ S ∗ | ¯ S j | . Under Assumption 2, we have δ n ≤  + o (1) , ¯ d = 1 , − O (cid:16) p ¯ d − (cid:17) , ¯ d ≥ , as n, D → ∞ . emark 2. Based on Proposition 1, to make Assumption 1 hold, it is suﬃcient to require B (cid:38) p κ + ¯ d D , where κ > is an arbitrary constant. Proposition 1 implies that a smaller ¯ S j leads to aless stringent requirement on B . In the most ideal case, ¯ S j = { j } holds for all j ∈ S ∗ , implying ¯ d = 1, which leads to theweakest requirement of B , that is, B (cid:38) p κ +1 D for any κ >

0. If a signal j does not havemarginal contribution to the response, we have | ¯ S j | >

1, leading to a very large order of B to satisfy Assumption 1. This motivates the iterative RaSE screening (Algorithm 2) whichusually has a less stringent requirement of B , making the framework more applicable tohigh-dimensional settings.Before discussing how the requirement of B can be relaxed by iterative RaSE screening,we ﬁrst study the sure screening property for it. For simplicity, we only study the one-stepiteration, i.e. the case when T = 1. It’s not very hard to generalize the conditions andconclusions to the general case when T >

1. To better state the results, we ﬁrst generalizeLemma 1 to understand the distribution of the two aforementioned types of subspaces afterone iteration.

Lemma 2. If { S b } B b =1 i.i.d. ∼ any distribution F with P S ∼F ( j ∈ S ) ∈ (0 , . Given N j := { b : j ∈ S b } = k , dividing { S b } B b =1 into { S ( j )1 b } kb =1 and { S ( − j )1 b } B − kb =1 , where S ( j )1 b (cid:51) j and S ( − j )1 b (cid:54)(cid:51) j ,(i) { S ( j )1 b } kb =1 independently follow the distribution P ( S ( j ) = S ) = P S ∼F ( S = S ) · ( j ∈ S ) P S ∼F ( j ∈ S ) ; (4)14 ii) { S ( − j )1 b } B − kb =1 independently follow the distribution, P ( S ( − j ) = S ) = P S ∼F ( S = S ) · ( j / ∈ S ) P S ∼F ( j / ∈ S ) ; (5) (iii) { S ( j )1 b } kb =1 ⊥⊥ { S ( − j )1 b } B − kb =1 . We omit the proof of Lemma 2 as it is very similar to Lemma 1. Next, we introducethe following technical assumption analogous to Assumption 1.

Assumption 3.

Suppose the signal set S ∗ can be decomposed as S ∗ = S ∗ [0] ∪ S ∗ [1] , where S ∗ [0] and S ∗ [1] satisfy the following conditions.(i) (The ﬁrst-step detection) Denote δ [0] n = sup j ∈ S ∗ [0] P (Cr( S ( − j ) ) − Cr( S ( j ) ) < (cid:15) n ) , p [0]0 = D +12 p , where S ( j ) and S ( − j ) follow the distributions in (1) and (2) respectively, and S ( j ) ⊥⊥ S ( − j ) . It holds that lim sup n,D,B →∞ B ( δ [0] n ) B p [0]0 < ∞ . (ii) (The second-step detection) Denote δ [1] n = sup F∈ Υ sup j ∈ S ∗ P (Cr( S ( − j ) ) − Cr( S ( j ) ) < (cid:15) n ) , p [1]0 = ( D +1) C D + C ) p , where S ( j ) and S ( − j ) follow the distributions in (4) and (5) respectively, S ( j ) ⊥⊥ S ( − j ) , and C is a constant from Algorithm 2. Υ = {R ( U , p, ˜ η ) } is a familyof hierarchical restrictive multinomial distributions satisfying ∃ a constant c ∗ >

0, inf j ∈ S ∗ [0] ˜ η j ≥ c ∗ ( D + C ) . It holds that B (cid:29) p, lim sup n,D,B →∞ B ( δ [1] n ) B p [1]0 < ∞ . emark 3. Condition (i) is a relaxed version of Assumption 1, which only requires that asubset S [0] of S ∗ satisﬁes the condition. This can be seen as a ﬁrst-step detection conditionfor RaSE screening method to capture S ∗ [0] . The remaining signals in S ∗ [1] that might bemissed in the ﬁrst step will be captured in the second step. The family of distributions Υ is introduced to incorporate the randomness in the ﬁrst step of RaSE screening. This typeof stepwise detection conditions is very common in the literature (Jiang and Liu, 2014; Liand Liu, 2019; Zhou et al., 2020; Tian and Feng, 2020). Theorem 2 (Sure screening property for one-step iterative RaSE screening) . Deﬁne c [ l ]2 n := c [ l ]2 ( n, B , D ) := (1 − c n ) (cid:104) − ( δ [ l ] n ) B p [ l ]0 (cid:105) B (cid:18) − exp (cid:26) − B ( p [ l ]0 ) (cid:27)(cid:19) , l = 0 , . For ˆ S [1] α = { ≤ j ≤ p : ˆ η [1] j is among the [ αD/c [1]2 n ] largest of all } , where α > , underAssumption 3, if c [0]2 n > c ∗ and B (cid:29) log p ∗ , we have(i) P( S ∗ ⊆ ˆ S [1] α ) ≥ − p ∗ exp (cid:110) − B ( c [0]2 n − c ∗ ) (cid:111) − p ∗ exp (cid:110) − B ( c [1]2 n ) (cid:0) − α (cid:1) (cid:111) → ,as n → ∞ ;(ii) | ˆ S α | (cid:46) D . The lower bound in (i) comes from the two steps of Algorithm 2, which is very intuitive.The general iterative RaSE screening algorithm with any T ≥ B can be discussed in a similar fashion as the vanilla RaSE screeningfor some speciﬁc scenarios. For instance, if lim sup n,D →∞ δ [0] n and lim sup n,D →∞ δ [1] n are smaller than 1,then we can easily observe that B (cid:29) p is suﬃcient to make Assumption 3 hold, whichrelaxes the requirement shown in Remark 2 a lot. In Section 4, an array of simulations andreal data analyses will show the power of iterative RaSE screening.16 .2 Rank consistency Next, we study another important property of the RaSE screening, namely the rank con-sistency. First, we impose the following assumption.

Assumption 4.

Suppose the following conditions hold:(i) Denote ˜ δ n = inf j / ∈ S ∗ P (Cr( S ( j ) ) − Cr( S ( − j ) ) > (cid:15) n ) , where S ( j ) and S ( − j ) follow thedistributions in (1) and (2) respectively, and S ( j ) ⊥⊥ S ( − j ) . We have γ ( n, D, B ) := (1 − c n ) (cid:18) − (cid:26) − B p (cid:27)(cid:19) × (cid:104) (1 − δ B p n ) B − (1 − ˜ δ B p n ) B (1 − p ) (cid:105) − c n > . (ii) B (cid:29) γ ( n, D, B ) − ∨ log p . Remark 4.

Condition (i) is introduced to make sure the signals are separable from thenoises. Here, ˜ δ n is a parallel deﬁnition to δ n , measuring the maximal noise level via com-paring the two types of feature subspaces introduced in Lemma 1. In an ideal scenario, thereis a suﬃciently large gap between ˜ δ n and δ n , under which there exist D and B to satisfythe condition. A related condition can be found in Assumption (C3) of Cui et al. (2015). Theorem 3 (Rank consistency) . Under Assumption 4, it holds that P (cid:32) inf j ∈ S ∗ ˆ η j > sup j / ∈ S ∗ ˆ η j (cid:33) ≥ − p exp (cid:26) − B γ ( n, D, B ) (cid:27) → , as n, B , B → ∞ . .3 An example: linear regression model Now, we have introduced the general conditions and conducted a detailed theoretical anal-ysis of RaSE screening framework. To better understand the assumptions and conditions,we present a thorough analysis on a simple example: the linear regression model. Recall ahigh-dimensional linear regression model is y = β T x + (cid:15), where (cid:15) ∼ N (0 , σ ), x ∼ N ( , Σ), (cid:15) ⊥⊥ x , β = ( β , . . . , β p ) T is sparse with | S ∗ | = p ∗ (cid:28) p .We further assume Var( y ) = O (1). In addition, suppose we have training data ( X, Y )where X is a n × p matrix and Y is a vector of length n , representing the design andcorresponding response, respectively.We consider the penalized log-likelihood as the criterion function,Cr n ( S ) = − n ˆ L ( S ) + g n ( | S | ) , where ˆ L ( S ) = − n π − n S ) − n , (6)in which MSE( S ) = 1 n (cid:107) ˆ Y − Y (cid:107) = 1 n Y T ( I n − P X S ) Y, (7)and g n : N + → R is a penalty function of the size of S . Some popular examples of g n includeAIC (Akaike, 1973), BIC (Schwarz, 1978), eBIC (Chen and Chen, 2008, 2012), GIC (Fanand Tang, 2013), among others.Next, we would like to introduce a set of conditions, which can make Assumption 1more explicit. 18 ssumption 5. Suppose the following conditions to hold, where M and m are ﬁxed positivereal number:(i) λ max (Σ) ≤ M < ∞ , λ min (Σ) ≥ m > ;(ii) For arbitrary j ∈ S ∗ , there exists a subset S j with cardinality p j , where p j satisﬁes sup j ∈ S ∗ p j D = o ( p ) , such that inf j ∈ S ∗ inf S N : | S N |≤ DS N ⊆ S Full \ ( S j ∪{ j } ) (cid:104) Var( y )Corr ( y, x ⊥ S N j ) − g n ( | S N ∪ { j }| ) + g n ( | S N | ) (cid:105) (cid:29) max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:114) D log pn , where x ⊥ S N j = x j − x S N Σ − S N ,S N Σ S N ,j .(iii) max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:113) D log pn = o (1) , D ( p ∗ ∨ D ) (cid:28) p . Remark 5.

Condition (i) is imposed to make sure the correlation structure among predic-tors is well-behaved, which is similar to Assumption (C2) in Wang (2009) and Condition 4in Fan and Lv (2008). Condition (ii) is a corresponding version of Assumption 2 to guar-antee that adding any signal j into a subset not covering j will make the criterion functionsmaller. A related condition is Condition 3 in Fan and Lv (2008), although here the cor-relation is with respect to a projected predictor. Conditions involving projected predictorsare very common in the stepwise regression literature. (e.g. Condition (c) in Dong et al.(2020).) Under Assumption 5, we can easily verify that Assumptions 1 and 2 hold with somespeciﬁc { (cid:15) n } ∞ n =1 , leading to Theorem 1 and Remark 2. More detailed analysis are presentedin Appendix A.2. 19 Numerical Studies

In this section, we will investigate the performance of RaSE screening methods via extensivesimulations and real data experiments. Each setting is replicated 200 times. In simulations,we evaluate diﬀerent screening approaches by calculating the 5%, 25%, 50%, 75%, and95% quantiles of the minimum model size (MMS) to include all signals. The smaller thequantile is, the better the screening approach is. For real data, since S ∗ is unknown, wecompare diﬀerent methods by investigating the performance of the corresponding post-screening procedure. That is, after screening, we keep the same number of variables foreach screening method, then the same model is ﬁtted based on those selected variables andtheir prediction performance on an independent test data is reported.We compare RaSE screening methods with SIS (Fan and Lv, 2008), ISIS (Fan and Lv,2008; Fan et al., 2009), HOLP (Wang and Leng, 2016), SIRS (Zhu et al., 2011), DC-SIS(Li et al., 2012), MV-SIS (Cui et al., 2015), and IPDC (Kong et al., 2017).All the experiments are conducted in R. We implement RaSE screening methods in RaSEn package. R package

SIS (Saldana and Feng, 2018) is used to implement SISand ISIS, while R package screening ( https://github.com/wwrechard/screening ) isused to implement HOLP. We conduct SIRS, DC-SIS and MV-SIS through R package VariableSelection . IPDC is implemented by calling the function dcor in R package energy .We combine RaSE framework with various criteria to choose subspaces, including min-imizing BIC (RaSE-BIC) and eBIC (RaSE-eBIC) in linear model or logistic regressionmodel, minimizing the leave-one-out MSE/error in k -nearest neighbor ( k NN) (RaSE- k NN),and minimizing the 5-fold cross-validation MSE/error in support vector machine (SVM)with RBF kernel (RaSE-SVM). We add a subscript 1 to RaSE to denote the one-step20terative RaSE (e.g. RaSE -BIC).For all RaSE methods, we ﬁx B = 200 and B = [ p . ], motivated by Remark 2.In addition, we ﬁx D = [ n . ], motivated by Assumption 5.(iii). For Example 1, we alsoinvestigate the impact of B and B on the median MMS. For RaSE- k NN and RaSE - k NN, k is set to be 5. For RaSE-eBIC and RaSE -eBIC, we set the penalty parameter γ = 0 . Example 1 (Example II in Fan and Lv (2008)) . We generate data from the followingmodel: y = 5 x + 5 x + 5 x − √ x + (cid:15), where x = ( x , . . . , x p ) T ∼ N ( , Σ) , Σ = ( σ ij ) p × p , σ ij = 0 . ( i (cid:54) = j ) , (cid:15) ∼ N (0 , , and (cid:15) ⊥⊥ x .The signal set S ∗ = { , , , } . n = 100 and p = 1000 . In this example, there is no correlation between y and x , further leading to the indepen-dence due to normality, therefore methods based on the marginal eﬀect will fail to capture x . However, after projecting y on the space which is perpendicular with any signals from x , x and x , the correlation appears between the projected y and x , which motivatesthe ISIS. Besides, the proposed RaSE methods are also expected to succeed since it workswith feature subsets instead of a single variable.21ethod/MMS Example 1 Example 25% 25% 50% 75% 95% 5% 25% 50% 75% 95%SIS 227 317 397 647 922 8 40 224 705 1838ISIS 14 15 15 15 16 28 349 876 1593 1922SIRS 87 370 594 762 949 6 1164 1562 1836 1972DC-SIS 96 358 610 776 942 47 1116 1536 1836 1966HOLP 912 949 969 986 999 58 268 747 1388 1922IPDC 224 442 700 869 980 59 181 402 698 1383RaSE-BIC 6 11 28 134 662 7 1044 1596 1854 1964RaSE-eBIC 6 19 48 541 873 8 902 1530 1823 1977RaSE- k NN 18 78 206 267 854 5 5 6 74 1696RaSE-SVM 8 51 128 304 870 5 5 5 6 772RaSE -BIC 4 4 4 20 57 13 1017 1477 1749 1950RaSE -eBIC 4 4 4 4 17 980 1483 1724 1888 1972RaSE - k NN 6 72 291 666 909 5 5 5 8 1540RaSE -SVM 4 4 66 104 503 5 5 5 5 41Table 1: Quantiles of MMS in Examples 1 and 2.We present the results in the left panel of Table 1. From the results, it can be seen thatall the marginal screening methods indeed fail to capture 4 signals in a model with a smallsize. ISIS performs much better because it can correct its mistake through one iteration.For RaSE screening methods with no iteration, as analyzed in Remark 2, here ¯ d = 2 since x has no marginal contribution to y , leading to a theoretical requirement B (cid:38) D − p κ with any κ >

0, where D − p ≈ . Despite of the current small B setting, RaSE-BIC and RaSE-eBIC still perform better than SIS and other marginal screening methods.After one iteration, RaSE -BIC and RaSE -eBIC improve a lot compared with their vanillacounterparts, with RaSE -eBIC achieving the best performance.To further study the impact of ( B , B ), we run this example for 100 times under22iﬀerent ( B , B ) settings, and summarize the median of MMS in Figure 1. It shows thatin general, larger ( B , B ) leads to a better performance. The performance is stable interms of B when B is large. On the other hand, the performance improves continuouslyas B grows. In particular, when B ≈ , RaSE-BIC can capture S ∗ very well, whichagrees with Remark 2. This also shows that we can further improve the performance ofRaSE screening if we have more computational resources.Figure 1: Median MMS to capture S ∗ as ( B , B ) varies for RaSE-BIC in Example 1. Example 2 (Latent clusters) . We generate data from the following linear model: y = 0 . x + ˜ x + ˜ x + ˜ x + ˜ x + (cid:15) ) , where ˜ x = (˜ x , . . . , ˜ x p ) T ∼ N ( , Σ) , (cid:15) ∼ t , Σ = ( σ ij ) p × p = (0 . | i − j | ) p × p , and (cid:15) ⊥⊥ x . enerate z ∼ Unif( {− , } ) ⊥⊥ ˜ x and x = ˜ x + z p . The signal set S ∗ = { , , , , } . n = 200 and p = 2000 . −4048 −3 0 3 6 x y z −33 −4048 −2 −1 0 1 2 x y z −33 Figure 2: Scatterplots of y vs. x and y vs. x for Example 2 ( n = 200).Figure 2 shows the scatterplots of y vs. x (left panel) and y vs. x (right panel).We expect the methods based on Pearson correlation to deteriorate due to the partialcancellation of signals by the averaging of two clusters. For such kind of data, k NN couldbe a favorable approach. The performance of various methods are presented in the rightpanel of Table 1. RaSE- k NN and RaSE-SVM perform quite well with their performancesfurther improved by the one-step iteration.

Example 3 (Example 1.c in Li et al. (2012)) . We generate data from the following model: y = 2 β x x + 3 β ( x < x + (cid:15), where β j = ( − U (4 log n/ √ n + | Z | ) , j = 1 , , U ∼ Bernoulli(0 . , Z ∼ N (0 , , (cid:15) ∼ N (0 , , x ∼ N ( , Σ) where Σ = ( σ ij ) p × p = (0 . | i − j | ) p × p , U ⊥⊥ Z , (cid:15) ⊥⊥ x , and ( U, Z ) ⊥⊥ (cid:15), x ) . Note that we regenerate ( U, Z ) for each replication, so the results might diﬀer fromthose in Li et al. (2012). The signal set S ∗ = { , , , } . n = 200 and p = 2000 . Method/MMS Example 3 Example 45% 25% 50% 75% 95% 5% 25% 50% 75% 95%SIS 267 849 1322 1669 1936 264 570 709 885 984ISIS 225 855 1384 1723 1954 293 626 810 911 978SIRS 28 853 1342 1635 1931 487 737 867 935 992DC-SIS 26 688 1246 1663 1925 44 304 603 814 949HOLP 298 1022 1406 1726 1949 316 586 767 886 974IPDC 68 472 876 1406 1895 7 19 68 158 528RaSE-BIC 653 1248 1538 1826 1964 433 717 842 926 985RaSE-eBIC 630 1160 1555 1785 1965 326 588 796 913 985RaSE- k NN 5 15 296 1354 1808 5 14 54 239 534RaSE-SVM 4 8 138 1313 1842 4 10 114 362 863RaSE -BIC 557 1125 1487 1776 1934 448 679 816 922 989RaSE -eBIC 755 1240 1554 1802 1963 475 722 841 920 982RaSE - k NN 4 4 7 395 1652 4 8 52 412 866RaSE -SVM 4 4 4 8 908 4 19 170 602 910Table 2: Quantiles of MMS in Examples 3 and 4.The left panel of Table 2 exhibits the results of diﬀerent screening methods. Due tothe interaction term and indicator function, approaches based on linear models like SIS,ISIS, HOLP, and RaSE with BIC and eBIC do not perform very well. RaSE- k NN andRaSE-SVM do a good job for 5% and 25% quantiles but fail for others. The iteration stepimproves their performances signiﬁcantly with RaSE - k NN and RaSE -SVM performingmuch better than all the other methods. Example 4 (Interactions) . We generate data from the following model: y = 3 (cid:112) | x | + 2 (cid:112) | x | x + 4 sin( x ) sin( x ) sin ( x ) + 12 sin( x ) | x | sin( x ) x + 0 . (cid:15), here x , . . . , x p i.i.d. ∼ N (0 , , (cid:15) ∼ N (0 , , and (cid:15) ⊥⊥ x . The signal set S ∗ = { , , , } . n = 400 and p = 1000 . This example evaluates the capability of diﬀerent screening methods in terms of selectinghigh-order interactions. The results are summarized in the right panel of Table 2. It canbe observed that RaSE- k NN, RaSE - k NN, RaSE-SVM, RaSE -SVM, and IPDC achievean acceptable performance, particularly for the lower quantiles. IPDC performs better on75% and 95% quantiles than all RaSE methods but worse on the other three quantiles thanRaSE- k NN, RaSE - k NN and RaSE-SVM. The remaining methods do not perform well onany of the 5 quantiles. It shows that RaSE framework equipped with appropriate criterionis promising to capture high-order interactions.

Example 5 (Gaussian mixture, Example 1 in Cannings and Samworth (2017)) . We gen-erate data from the following model: y ∼ Bernoulli(0 . , x | y = r ∼ N ( µ r , Σ) + 12 N ( − µ r , Σ) , r = 0 , , where µ = (2 , − , , . . . , T , µ = (2 , , , . . . , T . The signal set S ∗ = { , } . n = 200 and p = 2000 . From the scatterplots in Figure 3, the marginal screening methods are expected to failbecause all signals are marginally independent with y . The only way to capture the signalsis to measure the joint contribution of ( x , x ). We summarize the results in the left panelof Table 3.The table shows that the marginal methods fail as we expected. RaSE with BIC andeBIC fail as well because the data points from the two classes are not linearly separable(Figure 3). SIRS, RaSE - k NN and RaSE -SVM achieve the best performance with veryaccurate feature ranking. 26 x x y

01 −3−2−1012 −2 −1 0 1 2 x x y Figure 3: Scatterplots of x vs. x and x vs. x for Example 5 ( n = 200). Example 6 (Multinomial logistic regression, Case 2 in Section 4.5 of Fan et al. (2009)) . We ﬁrst generate ˜ x , . . . , ˜ x i.i.d. ∼ Unif([ −√ , √ and ˜ x , . . . , ˜ x p i.i.d. ∼ N (0 , , then let x =˜ x − √ x , x = ˜ x + √ x , x = ˜ x − √ x , x = ˜ x + √ x and x j = ˜ x j for j = 5 , . . . , p .The response is generated from P( y = r | ˜ x ) ∝ exp { f r ( ˜ x ) } , r = 1 , . . . , , where f ( ˜ x ) = − a ˜ x + a ˜ x , f ( ˜ x ) = a ˜ x − a ˜ x , f ( ˜ x ) = a ˜ x − a ˜ x and f ( ˜ x ) = a ˜ x − a ˜ x with a = 5 / √ . The signal set S ∗ = { , , , , } . n = 200 and p = 2000 . In this example, x is marginally independent of y , therefore the marginal methods areexpected to fail to capture x . Results are summarized in the right panel of Table 3.We observe that ISIS, RaSE -BIC, and, RaSE -eBIC lead to a great performance. With-out iteration, RaSE-BIC still performs competitively compared with other non-iterative ap-proaches. From the table, it is clear that the iteration usually improves the performancesof RaSE screening methods. 27ethod/MMS Example 5 Example 65% 25% 50% 75% 95% 5% 25% 50% 75% 95%SIS 515 1090 1414 1746 1947 148 496 1108 1546 1914ISIS 420 1012 1466 1786 1968 7 7 7 8 9SIRS 2 2 2 2 2 827 1361 1626 1829 1976DC-SIS 451 960 1385 1706 1913 715 1252 1552 1800 1963MV-SIS 379 957 1366 1692 1895 248 808 1308 1638 1947HOLP 495 1065 1381 1712 1936 — — — — —IPDC 495 1010 1344 1673 1908 936 1471 1714 1891 1968RaSE-BIC 474 1095 1482 1798 1945 6 8 11 16 392RaSE-eBIC 417 1076 1429 1706 1963 18 182 786 1389 1895RaSE- k NN 2 3 4 5 6 45 187 280 1387 1915RaSE-SVM 2 3 4 5 9 10 31 98 306 1666RaSE -BIC 404 942 1308 1667 1929 5 5 5 5 7RaSE -eBIC 503 960 1388 1744 1924 5 5 5 6 14RaSE - k NN 2 2 2 2 2 30 128 908 1452 1904RaSE -SVM 2 2 2 2 2 5 5 94 762 1855Table 3: Quantiles of MMS in Examples 5 and 6. In this section, we investigate the performance of RaSE screening methods on two real datasets. Each data set is randomly divided into training data and test data. As suggested byFan and Lv (2008), we select variables via diﬀerent screening methods on training data,then the LASSO, k NN and SVM are ﬁtted based on the selected variables on trainingdata, and ﬁnally we evaluate diﬀerent screening methods based on their correspondingpost-screening performance on test data. As benchmarks, we also ﬁt LASSO, k NN andSVM models on the training data without screening. Following Fan and Lv (2008), we28hoose the top [ n/ log n ] variables for any screening method as the selected variables. Werandomly divide the whole data set into 90% training data and 10% test data in each of200 replications, and apply various screening methods on training data. Each time bothtraining and test data are standardized by using the center and scale of training data.

This data set was collected by Alon et al. (1999) and consists of 2000 genes measured on62 patients, of which 40 are diagnosed with colon cancer (class 1) and 22 are healthy (class0). The information of each gene is represented as a continuous variable. The predictionresults are summarized in the left panel of Table 4.The table shows that SIS, ISIS, RaSE-BIC, RaSE -BIC, RaSE-eBIC, and RaSE -eBICimprove the performance of vanilla LASSO. In addition, RaSE-BIC with LASSO achievesthe best performance among all post-screening procedure based on LASSO. In addition,RaSE- k NN with k NN and RaSE - k NN with k NN lead to better results than k NN. RaSE-SVM and RaSE -SVM also improve the performance of SVM, demonstrating the eﬀective-ness of RaSE to improve various vanilla methods. This data set was used by Scheetz et al. (2006); Wang and Leng (2016); Zhong and Zhu(2015). It contains the gene expressions values corresponding to 18976 probes from the eyesof 120 twelve-week-old male F2 rats. Among the 18976 probes, TRIM32 is the response, For RaSE methods, sometimes there might be less than [ n/ log n ] variables which have positive selectedproportion. In this case, we randomly choose from the remaining variables with 0 selected proportion tohave the desired number of selected variables. ISIS 0.1742(0.1430)

SIRS 0.2800(0.1734) 0.0145(0.0139)DC-SIS 0.3000(0.1998) 0.0131(0.0135)MV-SIS 0.2958(0.1826) —HOLP 0.1825(0.1491) 0.0228(0.0269)IPDC 0.1917(0.1464) 0.0129(0.0132)RaSE-BIC

RaSE-eBIC 0.1383(0.1295)

RaSE -BIC -eBIC 0.1492(0.1392) 0.0133(0.0115)— k NN 0.2258(0.1653) 0.0177(0.0217)RaSE- k NN 0.1517(0.1390) 0.0143(0.0172)RaSE - k NN 0.1517(0.1390) 0.0144(0.0172)— SVM 0.2025(0.1503) 0.0172(0.0256)RaSE-SVM -SVM 0.1808(0.1435) 0.0172(0.0245)Table 4: Average test error rate with standard deviations (in parentheses) for colon can-cer data set and average test MSE with standard deviations (in parentheses) for rat eyeexpression data set. We highlight the methods with the best performance in bold and thetop 3 performance in italic.which is responsible to cause Bardet-Biedl syndrome. We follow Wang and Leng (2016) tofocus on the top 5000 genes with the highest sample variance. Therefore, the ﬁnal samplesize is 120 and there are 5000 predictors. The right panel of Table 4 shows the test MSEcoupled with the standard deviation for each post-screening procedure.The results show that SIS, ISIS, RaSE-BIC with LASSO and RaSE-eBIC with LASSO30chieve comparable performance, which are better than that of the vanilla LASSO. RaSE- k NN with k NN and RaSE - k NN with k NN enhance the vanilla k NN method as well.However, RaSE-SVM with SVM and RaSE -SVM with SVM do not outperform the vanillaSVM for this data. Note that MV-SIS is not available for Eye data set. It is possible todiscretize all the variables to make MV-SIS work. See Section 4.2 in Cui et al. (2015) fordetails. In this article, we propose a very general screening framework named RaSE screening, basedon the random subspace ensemble method. We can equip it with any criterion function forcomparing subspaces. By comparing subspaces instead of single predictors, RaSE screeningcan capture signals without marginal eﬀect on response. Besides, an iterative version ofRaSE screening framework is introduced to enhance the performance of vanilla RaSE andrelax the requirement of B . In the theoretical analysis, we establish sure screening propertyfor both vanilla and iterative RaSE frameworks under some general conditions. The rankconsistency is also proved for the vanilla RaSE. We also work out the explicit conditionsand conclusions for the linear regression model. We investigate the relationship betweenthe signal strength and the requirement of B , which shows that in some sense the weakerthe signal is, a larger B is necessary for RaSE to succeed. In the numerical studies,the eﬀectiveness of RaSE and its iterative version is veriﬁed through multiple simulationexamples and real data analyses.There are many interesting problems worth further studying. The ﬁrst question is thatwhether there is an adaptive way to automatically select the number T of iterations? A31ossible solution is cross-validation and to stop the iteration process when the performanceof RaSE on validation data stops improving. Another interesting topic is to use diﬀerent B values of in diﬀerent iteration steps, which might further speed up the computation.Finally, the subspace distribution for each iteration step could be further generalized, e.g.,we can choose D from the empirical distribution of the sizes of the selected B subspaces. References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood prin-ciple. Proceeding of IEEE international symposium on information theory.Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine,A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumorand normal colon tissues probed by oligonucleotide arrays.

Proceedings of the NationalAcademy of Sciences , 96(12):6745–6750.Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding.

The Annalsof Statistics , 36(6):2577–2604.Cannings, T. I. and Samworth, R. J. (2017). Random-projection ensemble classiﬁcation.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 79(4):959–1035.Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for model selectionwith large model spaces.

Biometrika , 95(3):759–771.Chen, J. and Chen, Z. (2012). Extended bic for small-n-large-p sparse glm.

StatisticaSinica , pages 555–574.Cheng, M.-Y., Honda, T., Li, J., and Peng, H. (2014). Nonparametric independence screen-ing and structure identiﬁcation for ultra-high dimensional longitudinal data.

The Annalsof Statistics , 42(5):1819–1849. 32ui, H., Li, R., and Zhong, W. (2015). Model-free feature screening for ultrahighdimensional discriminant analysis.

Journal of the American Statistical Association ,110(510):630–641.Dong, Y., Yu, Z., and Zhu, L. (2020). Model-free variable selection for conditional meanin regression.

Computational Statistics & Data Analysis , 152:107042.Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in sparseultra-high-dimensional additive models.

Journal of the American Statistical Association ,106(494):544–557.Fan, J., Feng, Y., and Wu, Y. (2010). High-dimensional variable selection for cox’s propor-tional hazards model. In

Borrowing strength: Theory powering applications–a Festschriftfor Lawrence D. Brown , pages 70–86. Institute of Mathematical Statistics.Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and itsoracle properties.

Journal of the American statistical Association , 96(456):1348–1360.Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional fea-ture space.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,70(5):849–911.Fan, J., Samworth, R., and Wu, Y. (2009). Ultrahigh dimensional feature selection: beyondthe linear model.

The Journal of Machine Learning Research , 10:2013–2038.Fan, Y., Kong, Y., Li, D., and Lv, J. (2016). Interaction pursuit with feature screeningand selection. arXiv preprint arXiv:1605.08933 .Fan, Y. and Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalizedlikelihood.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,75(3):531–552.Han, X. (2019). Nonparametric screening under conditional strictly convex loss for ultrahighdimensional sparse data.

Annals of Statistics , 47(4):1995–2022.Hao, N. and Zhang, H. H. (2014). Interaction screening for ultrahigh-dimensional data.

Journal of the American Statistical Association , 109(507):1285–1301.33o, T. K. (1998). The random subspace method for constructing decision forests.

IEEETransactions on Pattern Analysis and Machine Intelligence , 20(8):832–844.Jiang, B. and Liu, J. S. (2014). Variable selection for general index models via sliced inverseregression.

The Annals of Statistics , 42(5):1751–1786.Kong, Y., Li, D., Fan, Y., and Lv, J. (2017). Interaction pursuit in high-dimensional multi-response regression via distance correlation.

The Annals of Statistics , 45(2):897–922.Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning.

Journal of the American Statistical Association , 107(499):1129–1139.Li, Y. and Liu, J. S. (2019). Robust variable and interaction selection for logistic regressionand general index models.

Journal of the American Statistical Association , 114(525):271–286.Mai, Q. and Zou, H. (2013). The kolmogorov ﬁlter for variable screening in high-dimensionalbinary classiﬁcation.

Biometrika , 100(1):229–234.Mai, Q. and Zou, H. (2015). The fused kolmogorov ﬁlter: A nonparametric model-freescreening method.

The Annals of Statistics , 43(4):1471–1497.Pan, W., Wang, X., Xiao, W., and Zhu, H. (2018). A generic sure independence screeningprocedure.

Journal of the American Statistical Association .Saldana, D. F. and Feng, Y. (2018). Sis: an r package for sure independence screening inultrahigh dimensional statistical models.

Journal of Statistical Software , 83(2):1–25.Scheetz, T. E., Kim, K.-Y. A., Swiderski, R. E., Philp, A. R., Braun, T. A., Knudtson,K. L., Dorrance, A. M., DiBona, G. F., Huang, J., and Casavant, T. L. (2006). Regulationof gene expression in the mammalian eye and its relevance to eye disease.

Proceedings ofthe National Academy of Sciences , 103(39):14429–14434.Schwarz, G. (1978). Estimating the dimension of a model.

The annals of statistics , 6(2):461–464.Shalev-Shwartz, S. and Ben-David, S. (2014).

Understanding machine learning: Fromtheory to algorithms . Cambridge university press.34hao, X. and Zhang, J. (2014). Martingale diﬀerence correlation and its use inhigh-dimensional variable screening.

Journal of the American Statistical Association ,109(507):1302–1318.Tian, Y. and Feng, Y. (2020). Rase: Random subspace ensemble classiﬁcation. arXivpreprint arXiv:2006.08855 .Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.

Journal of theRoyal Statistical Society: Series B (Methodological) , 58(1):267–288.Vershynin, R. (2018).

High-dimensional probability: An introduction with applications indata science , volume 47. Cambridge university press.Wang, H. (2009). Forward regression for ultra-high dimensional variable screening.

Journalof the American Statistical Association , 104(488):1512–1524.Wang, X. and Leng, C. (2016). High dimensional ordinary least squares projection forscreening variables.

Journal of the Royal Statistical Society: Series B (Statistical Method-ology) , 3(78):589–611.Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.

The Annals of statistics , 38(2):894–942.Zhao, S. D. and Li, Y. (2012). Principled sure independence screening for cox models withultra-high-dimensional covariates.

Journal of multivariate analysis , 105(1):397–411.Zhong, W. and Zhu, L. (2015). An iterative approach to distance correlation-based sureindependence screening.

Journal of Statistical Computation and Simulation , 85(11):2331–2345.Zhou, T., Zhu, L., Xu, C., and Li, R. (2020). Model-free forward screening via cumulativedivergence.

Journal of the American Statistical Association , 115(531):1393–1405.Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-dimensional data.

Journal of the American Statistical Association , 106(496):1464–1475.Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.

Journal of the royal statistical society: series B (statistical methodology) , 67(2):301–320.35 upplementary Materials of “RaSE: A Variable ScreeningFramework via Random Subspace Ensembles”

Appendix A Additional Details of This Paper

A.1 Vanilla RaSE algorithm in Tian and Feng (2020)

See Algorithm 3.

Algorithm 3:

Random subspace ensemble classiﬁcation (RaSE)

Input: training data { ( x i , y i ) } ni =1 , new data x , subspace distribution D , criterion C , integers B and B , type of base classiﬁer T Output: predicted label C RaSEn ( x ), the selected proportion of each feature ˆ η Independently generate random subspaces S b b ∼ D , ≤ b ≤ B , ≤ b ≤ B for b ← to B do Select the optimal subspace S b ∗ from { S b b } B b =1 according to C and T Train C S b ∗ −T n in subspace S b ∗ end Construct the ensemble decision function ν n ( x ) = B (cid:80) B b =1 C S b ∗ −T n ( x ) Set the threshold ˆ α according to minimize training error Output the predicted label C RaSEn ( x ) = ( ν n ( x ) > ˆ α ), the selected proportion ofeach feature ˆ η = (ˆ η , . . . , ˆ η p ) T where ˆ η l = B − (cid:80) B b =1 ( l ∈ S b ∗ ) , l = 1 , . . . , p .2 Additional analysis on linear models For the ease of analysis, we deﬁne the population version of Cr n ( S ) asCr( S ) = − n L ( S ) + g n ( | S | ) , where L ( S ) = − n π − n (cid:93) MSE( S ) − n . (A.8)and (cid:93) MSE( S ) = E y − Cov( y, x S )Σ − S,S

Cov( x S , y ) . (A.9)By imposing the previous assumptions, we can translate Assumption 1 into an explicitcondition. Before doing so, let’s see the following useful proposition. Proposition 2.

Under Assumption 5, we have(i) n sup S : | S |≤ D | ˆ L ( S ) − L ( S ) | (cid:46) p max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:113) D log pn .(ii) There exists (cid:15) n (cid:29) max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:113) D log pn satisfying (3) such that lim sup n,D →∞ δ n ≤ . To make Assumption 1 hold, from Remark 2, it is suﬃcient to require B (cid:38) p κ +1 D , where κ >

Assumption 6.

Suppose the following conditions hold: i) (cid:107) Σ (cid:107) max ≤ M < ∞ , λ min (Σ S ∗ ,S ∗ ) ≥ m > ;(ii) For any j ∈ S ∗ , there exists a subset S j ⊆ S Full \{ j } with cardinality p j , where p j satisﬁes sup j ∈ S ∗ p j D = o ( p ) , such that inf j ∈ S ∗ inf S N : | S N |≤ DS N ⊆ S Full \ ( S j ∪{ j } ) (cid:104) Var( y )Corr ( Y, X ⊥ S N j ) + g n ( | S N | ) − g n ( | S N | + D ) (cid:105) − sup ≤ d ≤ D [ g n ( d ) − g n ( d − (cid:29) max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:114) D log pn . (iii) For any j / ∈ S ∗ , there exists a subset ˜ S j ⊆ S Full \ S ∗ such that sup j / ∈ S ∗ sup S N : | S N |≤ DS N ⊆ S Full \ ( ˜ S j ∪{ j } ) (cid:104) Var( y )Corr ( Y, X ⊥ S N j ) − g n ( | S N ∪ { j }| ) + g n ( | S N | ) (cid:105) (cid:28) − max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:114) D log pn , where there hold ˜ p D = o ( p ) , ˜ p = | ˜ S | , and ˜ S := (cid:83) j / ∈ S ∗ S j .(iv) max (cid:110) p ∗ (cid:113) DMm , D (cid:111) (cid:113) log pn = o (1) , B (cid:38) p κ +1 D , D ( p ∗ ∨ D ) (cid:28) p , where κ > can be anarbitrary positive number. Remark 6.

Condition (i) is the same as Assumption 5.(i). Condition (iii) is parallel toAssumption 5.(ii) with respect to the noises. Therefore, we can prove some lower boundsfor ˜ δ n in a similar fashion. Proposition 3.

The following conclusions hold:(i) Under Assumption 6.(i), (ii) and (iv), we have lim sup n,D →∞ δ n = 0 ;(ii) Under Assumption 6.(i)-(iv), we have lim inf n,D →∞ ˜ δ n ≥ ; Corollary 1.

There exist n, D, B satisfying the conditions described in Assumption 6. ppendix B Proofs In this section we use C to represent a positive constant which is irrelevant to n, D, B and B . It might take diﬀerent values at diﬀerent places. B.1 Proof of Lemma 1

For arbitrary subsets A ( j )1 , . . . , A ( j ) k , A ( − j )1 . . . , A ( − j ) B − k , we have P ( S ( j )11 = A ( j )1 , . . . , S ( j )1 k = A ( j ) k , S ( − j )11 = A ( − j )1 , . . . , S ( − j )1( B − k ) = A ( j ) B − k | N j = k )= P ( S ( j )11 = A ( j )1 , . . . , S ( j )1 k = A ( j ) k , S ( − j )11 = A ( − j )1 , . . . , S ( − j )1( B − k ) = A ( j ) B − k ) P ( N j = k ) · k (cid:89) i =1 ( j ∈ A ( j ) i ) · B − k (cid:89) i =1 ( j / ∈ A ( − j ) i )= (cid:0) B k (cid:1) k (cid:81) i =1 1 D ( p | A ( j )1 i | ) B − k (cid:81) i =1 1 D ( p | A ( − j )1 i | ) (cid:0) B k (cid:1) p k (1 − p ) B − k · k (cid:89) i =1 ( j ∈ A ( j ) i ) · B − k (cid:89) i =1 ( j / ∈ A ( − j ) i )= k (cid:89) i =1 D (cid:0) p | A ( j )1 i | (cid:1) p B − k (cid:89) i =1 D (cid:0) p | A ( − j )1 i | (cid:1) (1 − p ) · k (cid:89) i =1 ( j ∈ A ( j ) i ) · B − k (cid:89) i =1 ( j / ∈ A ( − j ) i )= k (cid:89) i =1 pD ( D + 1) 1 (cid:0) p | A ( j )1 i | (cid:1) · ( j ∈ A ( j ) i ) · B − k (cid:89) i =1 pD (2 p − D −

1) 1 (cid:0) p | A ( − j )1 i | (cid:1) · ( j / ∈ A ( − j ) i ) , where the last equation is the joint density under conditions (i)-(iii). The second equalityin (i) and (ii) can be easily derived by basic algebra. This completes the proof.4 .2 Proof of Theorem 1 Lemma 3 (Chernoﬀ’s bound) . For N ∼ Bin( B , p ) , it holds P( N > (1 + λ ) B p ) ≤ exp (cid:26) − λ λ B p (cid:27) , λ > , P( N < (1 − λ ) B p ) ≤ exp (cid:26) − λ λ B p (cid:27) , λ ∈ (0 , . Proof.

See Shalev-Shwartz and Ben-David (2014).

Lemma 4 (Hoeﬀding’s inequality) . Let N j ∼ Bin( B , p ) , then for arbitrary x ∈ (0 , p ) ,there holds that max { P( N j < B ( p − x )) , P( N j > B ( p + x )) } ≤ exp {− B x } . Next we prove Theorem 1 by applying the lemmas above. Denote the training data as D = { ( x i , y i ) } ni =1 and event { sup S : | S |≤ D | Cr n ( S ) − Cr( S ) | ≤ (cid:15) n } as A . Let j be an arbitraryfeature in S ∗ . Because of Lemma 1, it holdsP( j ∈ S ∗ | N j = k, A ) = P (cid:18) inf ≤ b ≤ k Cr n ( S ( j )1 b ) < inf ≤ b ≤ B − k Cr n ( S ( − j )1 b ) | N j = k, A (cid:19) = P (cid:18) inf ≤ b ≤ k Cr n ( S ( j )1 b ) < Cr n ( S ( − j )11 ) | N j = k, A (cid:19) B − k = (cid:20) − P (cid:18) inf ≤ b ≤ k Cr n ( S ( j )1 b ) ≥ Cr n ( S ( − j )11 ) | N j = k, A (cid:19)(cid:21) B − k = (cid:20) − (cid:16) P (cid:16) Cr n ( S ( j )11 ) ≥ Cr n ( S ( − j )11 ) | N j = k, A (cid:17)(cid:17) k (cid:21) B − k ≥ (cid:20) − (cid:16) P (cid:16) Cr( S ( j )11 ) + 2 (cid:15) n ≥ Cr( S ( − j )11 ) | N j = k, A (cid:17)(cid:17) k (cid:21) B − k = (cid:20) − (cid:16) P (cid:16) Cr( S ( j )11 ) + 2 (cid:15) n ≥ Cr( S ( − j )11 ) | N j = k (cid:17)(cid:17) k (cid:21) B − k (1 − δ kn ) B − k . (B.10)It follows thatP( j ∈ S ∗ | N j = k ) ≥ P( j ∈ S ∗ | N j = k, A ) P ( A ) ≥ (1 − c n )(1 − δ kn ) B − k . (B.11)Then we have P( j ∈ S ∗ ) = B (cid:88) k =1 P( j ∈ S ∗ | N j = k ) P ( N j = k ) ≥ (1 − c n ) B (cid:88) k =1 (1 − δ kn ) B − k · P ( N j = k ) ≥ (1 − c n ) (cid:88) k ≥ B p (1 − δ kn ) B − k · P ( N j = k ) ≥ (1 − c n )(1 − δ B p n ) B · P (cid:18) N j ≥ B p (cid:19) . (B.12)By letting λ = in Lemma 3, we have P (cid:18) N j ≥ B p (cid:19) ≥ − exp (cid:26) − B p (cid:27) . (B.13)Combining (B.12) and (B.13), we obtain η j = P( j ∈ S ∗ ) ≥ c n . Again, by Lemma 4 and union bounds, it holdsP( S ∗ ⊆ ˆ S α ) ≥ P (cid:32) (cid:92) j ∈ S ∗ (cid:110) ˆ η j ≥ c n α (cid:111)(cid:33) ≥ − (cid:88) j ∈ S ∗ P (cid:18) ˆ η j − η j < − (cid:18) − α (cid:19) η j (cid:19) − (cid:88) j ∈ S ∗ exp (cid:40) − B η j (cid:18) − α (cid:19) (cid:41) ≥ − p ∗ exp (cid:40) − B c n (cid:18) − α (cid:19) (cid:41) , (B.14)which leads to conclusion (i).For (ii), since lim sup n,D,B →∞ B δ B p n < ∞ , c n → B p (cid:38)

1, it’s easy to show thatlim inf n,D,B →∞ c n > , yielding | ˆ S α | (cid:46) D , which completes our proof. B.3 Proof of Proposition 1

By Assumption 2, there exists such { (cid:15) n } such thatCr( S N ∪ ¯ S j ) + 2 (cid:15) n ≤ Cr( S N ) , for any S N ⊆ S Full \ ( ¯ S j ∪ S j ), when n is large.(i) When ¯ S j = { j } : To ﬁnd a lower bound of P (cid:0) Cr( S ( j ) ) + 2 (cid:15) n ≤ Cr( S ( − j ) ) (cid:1) , we ﬁrstrank elements in { S ( − j ) : | S ( − j ) | ≤ D − , S ( j ) ⊆ S Full \ ( S ∗ ∪ S ) } by increasing values ofCr. Denote the sorted series of subspace as { S ( − j ) r } Rr =1 . Notice thatCr( S ( − j ) r ∪ { j } ) + 2 (cid:15) n ≤ Cr( S ( − j ) r (cid:48) ∪ { j } ) + 2 (cid:15) n ≤ Cr( S ( − j ) r (cid:48) ) , for any r (cid:48) ≥ r ≥

1. If S ( j ) and S ( − j ) follow the distributions in (1) and (2) respectively,and S ( j ) ⊥⊥ S ( − j ) , then P ( S ( j ) = S ( − j ) r ∪ { j } ) = 2 D ( D + 1) · (cid:0) p − | S ( − j ) r | (cid:1) · ( | S ( − j ) r | + 1) , ( S ( − j ) = S ( − j ) r ) = 2 D (2 p − D − · (cid:0) p − | S ( − j ) r | (cid:1) · ( p − | S ( − j ) r | ) . holds for any r ≥

1. It follows thatP (cid:0)

Cr( S ( j ) ) + 2 (cid:15) n ≤ Cr( S ( − j ) ) (cid:1) ≥ D ( D + 1)(2 p − D − (cid:88) r ≥ (cid:88) r (cid:48) ≥ r (cid:0) p − | S ( − j ) r | (cid:1) · (cid:0) p − | S ( − j ) r (cid:48) | (cid:1) · ( | S ( − j ) r | + 1)( p − | S ( − j ) r (cid:48) | ) ≥ p − D + 1) D ( D + 1)(2 p − D − (cid:88) r ≥ (cid:88) r (cid:48) ≥ r ( | S ( − j ) r | + 1) (cid:0) p − | S ( − j ) r | (cid:1) · ( | S ( − j ) r (cid:48) | + 1) (cid:0) p − | S ( − j ) r (cid:48) | (cid:1) ≥ p − D + 1) D ( D + 1)(2 p − D − · · (cid:88) r ≥ | S ( − j ) r | + 1 (cid:0) p − | S ( − j ) r | (cid:1)  = 2( p − D + 1) D ( D + 1)(2 p − D − · (cid:34) D − (cid:88) d =1 (cid:0) p − p j − d (cid:1) ( d + 1) (cid:0) p − d (cid:1) (cid:35) ≥ p − D + 1) D ( D + 1)(2 p − D − · D · (cid:34) D − (cid:88) d =1 (cid:18) − p j p − (cid:19) · · · (cid:18) − p j p − d (cid:19) ( d + 1) (cid:35) ≥ p − D + 1) D ( D + 1)(2 p − D − · D · (cid:34) D − (cid:88) d =1 (cid:18) − p j p − d (cid:19) d ( d + 1) (cid:35) ≥ p − D + 1) D ( D + 1)(2 p − D − · (cid:18) − p j p − D (cid:19) D − · (cid:34) D − (cid:88) d =1 ( d + 1) (cid:35) ≥ ( p − D + 1)( D + 2) ( D − D ( D + 1)(2 p − D − · (cid:18) − p j p − D (cid:19) D − . (ii) When | ¯ S j | >

1: similar as before, we ﬁrst rank elements in { S ( − j ) : | S ( − j ) | ≤ D − ¯ d, S ( − j ) ⊆ S Full \ ( ¯ S j ∪ S j ) } by increasing values of Cr. Denote the sorted series ofsubspace as { S ( − j ) r } Rr =1 . Notice thatCr( S ( − j ) r ∪ ¯ S j ) + 2 (cid:15) n ≤ Cr( S ( − j ) r (cid:48) ∪ ¯ S j ) + 2 (cid:15) n ≤ Cr( S ( − j ) r (cid:48) ) , r (cid:48) ≥ r ≥

1. If S ( j ) and S ( − j ) follow the distributions in (1) and (2) respectively,and S ( j ) ⊥⊥ S ( − j ) , then P ( S ( j ) = S ( − j ) r ∪ ¯ S j ) = 2 D ( D + 1) · (cid:0) p − | S ( − j ) r | + | ¯ S j |− (cid:1) · ( | S ( − j ) r | + | ¯ S j | ) , P ( S ( − j ) = S ( − j ) r ) = 2 D (2 p − D − · (cid:0) p − | S ( − j ) r | (cid:1) · ( p − | S ( − j ) r | ) . P (cid:0) Cr( S ( j ) ) + 2 (cid:15) n ≤ Cr( S ( − j ) ) (cid:1) ≥ D (2 p − D − D + 1) (cid:88) r ≥ (cid:88) r (cid:48) ≥ r ( | S ( − j ) r | + | ¯ S j | )( p − | S ( − j ) r (cid:48) | ) (cid:0) p − | S ( − j ) r | + | ¯ S j |− (cid:1)(cid:0) p − | S ( − j ) r (cid:48) | (cid:1) ≥ p − D ) D (2 p − D − D + 1) (cid:88) r ≥ (cid:88) r (cid:48) ≥ r ( | S ( − j ) r | + | ¯ S j | )( | S ( − j ) r (cid:48) | + | ¯ S j | ) (cid:0) p − | S ( − j ) r | + | ¯ S j |− (cid:1)(cid:0) p − | S ( − j ) r (cid:48) | (cid:1) = 4( p − D ) D (2 p − D − D + 1) · (cid:88) r ≥ (cid:88) r (cid:48) ≥ r ( | S ( − j ) r | + | ¯ S j | − · · · ( | S ( − j ) r | + 1)( p − | ¯ S j | − | S ( − j ) r | + 1) · · · ( p − − | S ( − j ) r | ) · | S ( − j ) r | + | ¯ S j | (cid:0) p − | S ( − j ) r | (cid:1) · | S ( − j ) r (cid:48) | + | ¯ S j | (cid:0) p − | S ( − j ) r (cid:48) | (cid:1) ≥ p − D ) D (2 p − D − D + 1) p | ¯ S j |− · (cid:88) r ≥ | S ( − j ) r | + | ¯ S j | (cid:0) p − | S ( − j ) r | (cid:1)  ≥ p − D ) D (2 p − D − D + 1) p | ¯ S j |− ·  D −| ¯ S j | (cid:88) d =1 (cid:0) p − p j −| ¯ S j | d (cid:1)(cid:0) p − d (cid:1) · ( d + | ¯ S j | )  ≥ p − D ) D (2 p − D − D + 1) p | ¯ S j |− ·  D −| ¯ S j | (cid:88) d =1 (cid:32) − p j + | ¯ S j | − p − (cid:33) · · · (cid:32) − p j + | ¯ S j | − p − d (cid:33) · ( d + | ¯ S j | )  p − D ) D (2 p − D − D + 1) p | ¯ S j |− (cid:32) − p j + | ¯ S j | p − D (cid:33) D −| ¯ S j | )  D −| ¯ S j | (cid:88) d =1 ( d + | ¯ S j | )  , ≥ ( p − D )( D + | ¯ S j | + 1) ( D − | ¯ S j | ) D (2 p − D − D + 1) (cid:32) − p j + | ¯ S j | p − D (cid:33) D −| ¯ S j | ) · p | ¯ S j |− = O (cid:18) p ¯ d − (cid:19) , as n, D → ∞ . Therefore, when ¯ d = 1, we havelim sup n,D →∞ δ n = lim sup n,D →∞ sup j ∈ S ∗ P (cid:0) Cr( S ( − j ) ) − Cr( S ( j ) ) < (cid:15) n (cid:1) ≤ − lim inf n,D →∞ (cid:34) ( p − D + 1)( D + 2) ( D − D ( D + 1)(2 p − D − · (cid:18) − sup j ∈ S ∗ p j p − D (cid:19) D − (cid:35) = 34 , where the inequality holds because of the fact sup j ∈ S ∗ p j D = o ( p ). When ¯ d ≥

2, it follows that δ n = sup j ∈ S ∗ P (cid:0) Cr( S ( j ) ) + 2 (cid:15) n ≤ Cr( S ( − j ) ) (cid:1) ≤ − O (cid:18) p ¯ d − (cid:19) , which completes the proof. B.4 Proof of Theorem 2

Akin to the argument in the proof of Theorem 1, we can prove thatP  (cid:92) j ∈ S ∗ [0] (cid:110) ˆ η [0] j ≥ c ∗ (cid:111) ≥ − p ∗ exp (cid:110) − B ( c [0]2 n − c ∗ ) (cid:111) . (cid:84) j ∈ S ∗ [0] (cid:110) ˆ η [0] j ≥ c ∗ (cid:111) as B . Write { b : j ∈ S [1]1 b } as N j , thensimilar to (B.10) and (B.11), since inf i ∈ S ∗ [0] ˜ η i ≥ c ∗ ( D + C ) , the subspace distribution family satisﬁes the condition in Assumption 3.(ii), it followsP( j ∈ S [1]1 ∗ | N j = k, B ) ≥ (1 − c n )(1 − ( δ [1] n ) k ) B − k . Similar to (B.12), it follows thatP( j ∈ S [1]1 ∗ | ˆ η [0] ∈ B )= B (cid:88) k =1 P( j ∈ S [1]1 ∗ | N j = k, B ) P ( N j = k | ˆ η [0] ∈ B ) ≥ (1 − c n ) B (cid:88) k =1 (1 − ( δ [1] n ) k ) B − k · P ( N j = k | ˆ η [0] ∈ B ) ≥ (1 − c n ) (cid:88) k ≥ B p [1] j (1 − ( δ [1] n ) k ) B − k · P ( N j = k | ˆ η [0] ∈ B ) ≥ (1 − c n )(1 − ( δ [1] n ) B p [1] j ) B · P (cid:18) N j ≥ B p [1] j (cid:12)(cid:12)(cid:12)(cid:12) ˆ η [0] ∈ B (cid:19) , (B.15)where p [1] j = P S [1]11 ∼R ( U ,p, ˜ η [0] ) ( j ∈ S [1]11 | ˆ η [0] ∈ B ). Notice that N j ∼ Bin( B , p [1] j ) when ˆ η [0] isgiven. Therefore, by Lemma 3, analogous to the proof of Theorem 1, we can obtain P (cid:18) N j ≥ B p [1] j (cid:12)(cid:12)(cid:12)(cid:12) ˆ η [0] ∈ B (cid:19) ≥ − exp (cid:26) − B p [1] j (cid:27) . (B.16)Therefore, together with (B.16), equation (B.15) leads toP( j ∈ S [1]1 ∗ | ˆ η [0] ∈ B ) 11 (1 − c n )(1 − ( δ [1] n ) B p [1] j ) B (cid:18) − exp (cid:26) − B p [1] j (cid:27)(cid:19) . Therefore, given ˆ η [0] ∈ B , it follows that η [1] j = P( j ∈ S [1]1 ∗ | ˆ η [0] ∈ B ) ≥ c [1]2 n . And for any j , it’s straightforward to see that P S [1]11 ∼R ( U ,p, ˜ η [0] ) ( j ∈ S [1]11 | ˆ η [0] ∈ B , | S [1]11 | = d ) ≥ − (cid:34) − C p D + C (cid:35) · (cid:34) − C p D + C − C p (cid:35) · · · (cid:34) − C p D + C − ( d − C p (cid:35) ≥ C d ( D + C ) p , yielding that p [1] j = P S [1]11 ∼R ( U ,p, ˜ η [0] ) ( j ∈ S [1]11 | ˆ η [0] ∈ B ) ≥ D D (cid:88) d =1 C d ( D + C ) p ≥ ( D + 1) C D + C ) p = p [1]0 . Analogous to (B.14), applying Hoeﬀding’s inequality and union bounds, we obtainP( S ∗ ⊆ ˆ S [1] α ) ≥ P (cid:32) (cid:92) j ∈ S ∗ (cid:40) ˆ η [1] j ≥ c [1]2 α (cid:41) (cid:12)(cid:12)(cid:12)(cid:12) ˆ η [0] ∈ B (cid:33) P( ˆ η [0] ∈ B ) ≥ − (cid:88) j ∈ S ∗ P (cid:18) ˆ η [1] j − η [1] j < − (cid:18) − α (cid:19) c [1]2 (cid:12)(cid:12)(cid:12)(cid:12) ˆ η [0] ∈ B (cid:19) − P( ˆ η [0] / ∈ B ) ≥ − p ∗ exp (cid:110) − B ( c [0]2 n − c ∗ ) (cid:111) − p ∗ exp (cid:40) − B ( c [1]2 n ) (cid:18) − α (cid:19) (cid:41) , which completes the proof of conclusion (i). The proof of conclusion (ii) is similar to thatof Theorem 1, and we omit it here. 12 .5 Proof of Theorem 3 Similar to (B.10), for any j / ∈ S ∗ , we haveP( j ∈ S ∗ | N j = k, A ) = (cid:20) − (cid:16) P (cid:16) Cr n ( S ( j )11 ) ≥ Cr n ( S ( − j )11 ) |A (cid:17)(cid:17) k (cid:21) B − k ≤ (cid:20) − (cid:16) P (cid:16) Cr( S ( j )11 ) − (cid:15) n > Cr( S ( − j )11 ) |A (cid:17)(cid:17) k (cid:21) B − k = (cid:20) − (cid:16) P (cid:16) Cr( S ( j )11 ) − (cid:15) n > Cr( S ( − j )11 ) (cid:17)(cid:17) k (cid:21) B − k ≤ (1 − ˜ δ kn ) B − k . Then we have P( j ∈ S ∗ | N j = k ) ≤ P( j ∈ S ∗ | N j = k, A ) P ( A ) + P ( A c ) ≤ (1 − c n )(1 − ˜ δ kn ) B − k + c n , which combined with Lemma 3 yields γ (cid:48) := inf j ∈ S ∗ P( j ∈ S ∗ ) − sup j ∈ S ∗ P( j ∈ S ∗ ) ≥ B (cid:88) k =1 (cid:20) inf j ∈ S ∗ P( j ∈ S ∗ | N j = k ) − sup j ∈ S ∗ P( j ∈ S ∗ | N j = k ) (cid:21) P ( N = k ) ≥ (1 − c n ) B (cid:88) k =1 [(1 − δ kn ) B − k − (1 − ˜ δ kn ) B − k ] · P ( N = k ) − c n ≥ (1 − c n ) (cid:88) B p ≤ k ≤ B p [(1 − δ kn ) B − k − (1 − ˜ δ kn ) B − k ] · P ( N = k ) − c n ≥ (1 − c n ) (cid:104) (1 − δ B p n ) B − (1 − ˜ δ B p n ) B (1 − p ) (cid:105) (cid:18) − (cid:26) − B p (cid:27)(cid:19) − c n = γ. γ (cid:48)(cid:48) = inf j ∈ S ∗ η j − sup j / ∈ S ∗ η j inf j ∈ S ∗ η j . Then we havesup j / ∈ S ∗ η j = sup j / ∈ S ∗ P( j ∈ S ∗ ) = (1 − γ (cid:48)(cid:48) ) inf j ∈ S ∗ P( j ∈ S ∗ ) = (1 − γ (cid:48)(cid:48) ) inf j ∈ S ∗ η j . By Hoeﬀding’s inequality and union bounds, it followsP (cid:32) inf j ∈ S ∗ ˆ η j > sup j / ∈ S ∗ ˆ η j (cid:33) ≥ P (cid:32) sup j / ∈ S ∗ ˆ η j < (cid:18) − γ (cid:48)(cid:48) (cid:19) inf j ∈ S ∗ η j < inf j ∈ S ∗ ˆ η j (cid:33) ≥ − P  (cid:91) j / ∈ S ∗ (cid:26) ˆ η j − η j ≥ γ (cid:48)(cid:48) j ∈ S ∗ η j (cid:27) − P (cid:32) (cid:91) j ∈ S ∗ (cid:26) ˆ η j − η j ≤ − γ (cid:48)(cid:48) j ∈ S ∗ η j (cid:27)(cid:33) ≥ − (cid:88) j / ∈ S ∗ P (cid:18) ˆ η j − η j ≥ γ (cid:48) (cid:19) − (cid:88) j ∈ S ∗ P (cid:18) ˆ η j − η j ≤ − γ (cid:48) (cid:19) ≥ − (cid:88) j / ∈ S ∗ exp (cid:26) − B γ (cid:48) (cid:27) − (cid:88) j ∈ S ∗ exp (cid:26) − B γ (cid:48) (cid:27) ≥ − p exp (cid:26) − B γ (cid:48) (cid:27) ≥ − p exp (cid:26) − B γ ( n, D, B ) (cid:27) , which converges to 1 due to Assumption 4 as n, B , B → ∞ . This completes the proof. B.6 Proof of Proposition 2

B.6.1 Proof of (i)

Recall (A.9) and (7): (cid:93)

MSE( S ) = E y − Cov( y, x S )Σ − S,S

Cov( x S , y ) , (B.17)14SE( S ) = 1 n (cid:107) Y (cid:107) − n Y T X S ( X TS X S ) − X TS Y. And there holds 1 n Y T X S = 1 n β TS ∗ X TS ∗ X S + 1 n (cid:15) T X S . Denote ˆΣ = n X T X = (ˆ σ ij ) p × p and Σ = ( σ ij ) p × p . Recall equation (10) in Bickel and Levina(2008): P (cid:18) sup i,j | ˆ σ ij − σ ij | > t (cid:19) (cid:46) p exp (cid:40) − Cn (cid:18) tM (cid:19) (cid:41) . (B.18)It follows thatP (cid:32) sup S : | S |≤ D (cid:13)(cid:13)(cid:13)(cid:13) n X TS ∗ X S − Σ S ∗ ,S (cid:13)(cid:13)(cid:13)(cid:13) > t (cid:33) ≤ P (cid:32) sup S : | S |≤ D sup j ∈ S (cid:88) i ∈ S ∗ | ˆ σ ij − σ ij | > t (cid:33) ≤ P (cid:18) p ∗ · sup i,j | ˆ σ ij − σ ij | > t (cid:19) (cid:46) p exp (cid:40) − Cn (cid:18) tM p ∗ (cid:19) (cid:41) . ThereforeP (cid:32) sup S : | S |≤ D (cid:13)(cid:13)(cid:13)(cid:13) n X TS ∗ X S − Σ S ∗ ,S (cid:13)(cid:13)(cid:13)(cid:13) > t (cid:33) ≤ P (cid:32) D sup S : | S |≤ D (cid:13)(cid:13)(cid:13)(cid:13) n X TS ∗ X S − Σ S ∗ ,S (cid:13)(cid:13)(cid:13)(cid:13) > t (cid:33) (cid:46) p exp (cid:40) − Cn (cid:18) t √ DM p ∗ (cid:19) (cid:41) . In addition, (cid:15) i and X ij are σ -subGaussian and √ M -subGaussian, respectively, and theyare independent. Then by Lemma 2.7.7 in Vershynin (2018), it is known that (cid:15) i X ij is √ M σ -subexponential. Further by Hoeﬀding’s inequality, it follows thatP (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:15) i X ij (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) (cid:46) exp (cid:40) − Cn (cid:34)(cid:18) t √ M (cid:19) ∧ (cid:18) t √ M (cid:19)(cid:35)(cid:41) , (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:15) T X (cid:13)(cid:13)(cid:13)(cid:13) > t (cid:19) = P (cid:18) sup ≤ i ≤ p (cid:12)(cid:12)(cid:12)(cid:12) n (cid:15) T X i (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ p exp (cid:40) − Cn (cid:34)(cid:18) t √ M (cid:19) ∧ (cid:18) t √ M (cid:19)(cid:35)(cid:41) . Subsequently, we haveP (cid:32) sup S : | S |≤ D (cid:13)(cid:13)(cid:13)(cid:13) n (cid:15) T X S (cid:13)(cid:13)(cid:13)(cid:13) > t (cid:33) ≤ P (cid:32) D sup S : | S |≤ D (cid:13)(cid:13)(cid:13)(cid:13) n (cid:15) T X S (cid:13)(cid:13)(cid:13)(cid:13) > t (cid:33) ≤ P (cid:18) D (cid:13)(cid:13)(cid:13)(cid:13) n (cid:15) T X (cid:13)(cid:13)(cid:13)(cid:13) > t (cid:19) (cid:46) p exp (cid:40) − Cn (cid:34)(cid:18) t √ DM (cid:19) ∧ (cid:18) t √ DM (cid:19)(cid:35)(cid:41) . Also, because m (cid:107) β S ∗ (cid:107) ≤ Var( y ) = β TS ∗ Σ S ∗ ,S ∗ β S ∗ + σ = 1 , we have (cid:107) β S ∗ (cid:107) ≤ m − . Then it follows thatP (cid:32) sup S : | S |≤ D (cid:13)(cid:13)(cid:13)(cid:13) n Y T X S − Cov( y, x S ) (cid:13)(cid:13)(cid:13)(cid:13) > t (cid:33) ≤ P (cid:32) (cid:107) β S ∗ (cid:107) · sup S : | S |≤ D (cid:13)(cid:13)(cid:13)(cid:13) n X TS ∗ X S − Σ S ∗ ,S (cid:13)(cid:13)(cid:13)(cid:13) > t (cid:33) + P (cid:32) sup S : | S |≤ D (cid:13)(cid:13)(cid:13)(cid:13) n (cid:15) T X S (cid:13)(cid:13)(cid:13)(cid:13) > t (cid:33) (cid:46) p exp (cid:40) − Cn · mD (cid:18) tM p ∗ (cid:19) (cid:41) , leading to sup S : | S |≤ D (cid:13)(cid:13)(cid:13)(cid:13) n Y T X S − Cov( y, x S ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ O p (cid:32) M p ∗ (cid:114) D log pnm (cid:33) . (B.19)16ecall that y = β TS ∗ x S ∗ + (cid:15) has unit variance. Therefore Y i is a 1-subexponential variable,combined with Hoeﬀding’s inequality, leading toP (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:107) Y (cid:107) − E y (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) (cid:46) p exp (cid:8) − Cn (cid:0) t ∧ t (cid:1)(cid:9) . As a consequence, we have (cid:12)(cid:12)(cid:12)(cid:12) n (cid:107) Y (cid:107) − E y (cid:12)(cid:12)(cid:12)(cid:12) ≤ O p (cid:18) √ n (cid:19) . (B.20)On the other hand, due to (B.18), it holds that (cid:107) ˆΣ − Σ (cid:107) max (cid:46) p M (cid:114) log pn , leading to sup S : | S |≤ D (cid:107) ˆΣ S,S − Σ S,S (cid:107) ≤ D (cid:107) ˆΣ − Σ (cid:107) max (cid:46) p DM (cid:114) log pn . By the proof of Lemma 28 in Tian and Feng (2020), it followssup S : | S |≤ D (cid:107) ˆΣ − S,S − Σ − S,S (cid:107) (cid:46) p DM (cid:114) log pn . (B.21)Due to (B.17), it holds that (cid:93) MSE( S ) = E y − Cov( y, x S )Σ − S,S

Cov( x S , y )= β TS ∗ (Σ S ∗ ,S ∗ − Σ S ∗ ,S Σ − S,S Σ S,S ∗ ) β S ∗ + σ . According to the inverse matrix block decomposition used in the proof of Lemma 4 in Tianand Feng (2020), (Σ S ∗ ,S ∗ − Σ S ∗ ,S Σ − S,S Σ S,S ∗ ) − is the right bottom block of Σ − S ∗ ∪ S,S ∗ ∪ S , whichis thus positive deﬁnite. Then for any subset S , we have σ ≤ (cid:93) MSE( S ) ≤ E y = 1 . (B.22)17herefore M − (cid:107) Cov( x S , y ) (cid:107) ≤ β TS ∗ Σ S ∗ ,S Σ − S,S Σ S,S ∗ β S ∗ ≤ β TS ∗ Σ S ∗ ,S ∗ β S ∗ ≤ E y = 1 , leading to (cid:107) Cov( x S , y ) (cid:107) ≤ M . (B.23)Notice that sup S : | S |≤ D (cid:107) Σ − S,S (cid:107) = sup S : | S |≤ D λ − (Σ S,S ) ≤ λ − (Σ) ≤ m − . (B.24)Connecting (B.19), (B.20), (B.21), (B.23) and (B.24) yieldssup S : | S |≤ D (cid:12)(cid:12)(cid:12) MSE( S ) − (cid:93) MSE( S ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) n (cid:107) Y (cid:107) − E y (cid:12)(cid:12)(cid:12)(cid:12) + sup S : | S |≤ D (cid:2) (cid:107) X TS Y − Cov( x S , y ) (cid:107) (cid:107) Σ − S,S (cid:107) (cid:107) Cov( x S , y ) (cid:107) (cid:3) + sup S : | S |≤ D (cid:104) (cid:107) X TS Y (cid:107) (cid:107) ˆΣ S,S − Σ S,S (cid:107) (cid:107) Cov( x S , y ) (cid:107) (cid:105) + sup S : | S |≤ D (cid:104) (cid:107) X TS Y (cid:107) (cid:107) ˆΣ S,S (cid:107) (cid:107) X TS Y − Cov( x S , y ) (cid:107) (cid:105) (cid:46) p max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:114) D log pn . Therefore sup S : | S |≤ D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:34) MSE( S ) (cid:93) MSE( S ) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup S : | S |≤ D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) MSE( S ) − (cid:93) MSE( S ) (cid:93) MSE( S ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + o (cid:32) sup S : | S |≤ D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) MSE( S ) − (cid:93) MSE( S ) (cid:93) MSE( S ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) (cid:46) max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:114) D log pn . Combined the inequality above with (6) and (A.8), we complete the proof of (i).18 .6.2 Proof of (ii)

First, let’s see the following useful lemma.

Lemma 5.

For arbitrary subset S N (cid:54) = ∅ and S N (cid:54)(cid:51) j , if we assume Var( y ) = 1 , then thepopulation MSE deﬁned in A.9 satisﬁes (cid:93) MSE( S N ) − (cid:93) MSE( S N ∪ { j } ) = Corr ( y, x ⊥ S N j ) , where x ⊥ S N j = x j − Σ j,S N Σ − S N ,S N x S N .Proof of Lemma 5. Denote k j | S N = Σ j,j − Σ j,S N Σ − S N ,S N Σ S N ,j , ˜ S = S N ∪{ j } . Then combinedwith the matrix decomposition (Tian and Feng, 2020)Σ − S, ˜ S =  (Σ S N ,S N − Σ S N ,j Σ − j,j Σ j,S N ) − − Σ − S N ,S N Σ S N ,j (Σ j,j − Σ j,S N Σ − S N ,S N Σ S N ,j ) − − Σ − j,j Σ j,S N (Σ S N ,S N − Σ S N ,j Σ − j,j Σ j,S N ) − (Σ j,j − Σ j,S N Σ − S N ,S N Σ S N ,j ) −  =  Σ − S N ,S N − k j | SN Σ − S N ,S N Σ S N ,j Σ j,S N Σ − S N ,S N − k j | SN Σ − S N ,S N Σ S N ,j − k j | SN Σ j,S N Σ − S N ,S N k j | SN  , we have (cid:93) MSE( S N ) − (cid:93) MSE( S N ∪ { j } )= Cov( y, x S N )Σ − S N ,S N Cov( x S N , y ) − Cov( y, x ˜ S )Σ − S, ˜ S Cov( x ˜ S , y )= Cov( y, x S N )Σ − S N ,S N Cov( x S N , y ) − (cid:16) Cov( y, x S N ) Cov( y, x j ) (cid:17)  Σ − S N ,S N − k j | SN Σ − S N ,S N Σ S N ,j Σ j,S N Σ − S N ,S N − k j | SN Σ − S N ,S N Σ S N ,j − k j | SN Σ j,S N Σ − S N ,S N k j | SN  Cov( y, x S N ) Cov( y, x j ) (cid:17) T = 1 k j | S N (cid:2) Cov( y, x S N )Σ − S N ,S N Σ S N ,j − Cov( y, x j ) (cid:3) = 1 k j | S N Cov( y, x ⊥ S N j ) . In addition, it can be easily veriﬁed thatVar( x ⊥ S N j ) = Var( x j − Σ j,S N Σ − S N ,S N x S N ) = k j | S N , yielding (cid:93) MSE( S N ) − (cid:93) MSE( S N ∪ { j } ) = Var( y )Corr ( y, x ⊥ S N j ) . When the criterion function is deﬁned as − n ˆ L + g ( | S | ), where g : N → R + ∪ { } , bythe proof of (i), Assumption 5.(iii), equation (B.22) and Lemma 5, it is easy to see thatinf j ∈ S ∗ inf S N : | S N |≤ DS N ⊆ S Full \ ( S ∗ ∪ S ) [Cr( S N ) − Cr( S N ∪ { j } )] ≥ inf j ∈ S ∗ inf S N : | S N |≤ DS N ⊆ S Full \ ( S ∗ ∪ S ) (cid:104) log (cid:93) MSE( S N ) − log (cid:93) MSE( S N ∪ { j } ) (cid:105) ≥ inf j ∈ S ∗ inf S N : | S N |≤ DS N ⊆ S Full \ ( S ∗ ∪ S ) (cid:34) (cid:93) MSE( S N ) − (cid:93) MSE( S N ∪ { j } ) (cid:93) MSE( S N ∪ { j } ) + o (cid:32) (cid:93) MSE( S N ) − (cid:93) MSE( S N ∪ { j } ) (cid:93) MSE( S N ∪ { j } ) (cid:33)(cid:35) (cid:29) max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:114) D log pn . Then there exists { (cid:15) n } ∞ n =1 satisfyinginf j ∈ S ∗ inf S N : | S N |≤ DS N ⊆ S Full \ ( S ∗ ∪ S ) [Cr( S N ) − Cr( S N ∪ { j } )] (cid:29) (cid:15) n (cid:29) max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:114) D log pn . Then we apply Proposition 1 to obtain the conclusion.20 .7 Proof of Proposition 3

The derivation of inequality lim inf n,D →∞ ˜ δ n ≥ follows the same idea used in the proof ofProposition 2 (part (ii)), so we omit it here. In the follows, we only prove the conclusionlim sup n,D →∞ δ n = 0. S ( j ) S ( − j ) S ( j ) \{ j } S ( − j ) ∪ ( S ( j ) \{ j } ) ∆ ∆ ∆ ∆ Figure 4: Decomposition of the diﬀerence between Cr( S ( − j ) ) and Cr( S ( j ) ).From Figure 4, for any j ∈ S ∗ , any S ( − j ) satisfying S ( − j ) ∩ ( S ∗ ∪ S j ∪ ˜ S ) = ∅ , and any S ( j ) satisfying S ( j ) ∩ ( S j ∪ ˜ S ) = ∅ , according to the deﬁnition of the criterion, we knowthat ∆ := Cr( S ( − j ) ∪ ( S ( j ) \{ j } )) − Cr( S ( − j ) ) ≤ g n ( | S ( − j ) | + D ) − g n ( | S ( − j ) | ) . Consider the path from S ( − j ) ∪ ( S ( j ) \{ j } ) to S ( j ) \{ j } . Each time one feature from S ( − j ) \ S ( j ) is removed from S ( − j ) ∪ ( S ( j ) \{ j } ). For ease of illustration, denote the deleted featuresin order as j , · · · , j U and the set after removing j u as ¯ S u , then ¯ S U = S ( j ) \{ j } . Forconvenience, write S ( − j ) ∪ ( S ( j ) \{ j } ) as ¯ S . Then it holds that∆ := Cr( S ( j ) \{ j } ) − Cr( S ( − j ) ∪ ( S ( j ) \{ j } ))21 U (cid:88) u =1 Cr( S u ) − Cr( S u − )= U (cid:88) u =1 (cid:2) Corr ( Y, X ⊥ S u i u ) + g n ( | S u | ) − g n ( | S u − | ) (cid:3) ≤ , where the last inequality comes from Assumption 6.(iii). In addition, we have∆ = Cr( S ( j ) ) − Cr( S ( j ) \{ j } )= − Corr(

Y, X ⊥ S U j ) + g n ( | S ( j ) | ) − g n ( | S U | ) ≤ − Corr(

Y, X ⊥ S U j ) + sup ≤ d ≤ D [ g n ( d ) − g n ( d − . Finally it holds that∆ = Cr( S ( j ) ) − Cr( S ( − j ) )= ∆ + ∆ + ∆ ≤ − Corr(

Y, X ⊥ S U j ) + sup ≤ d ≤ D [ g n ( d ) − g n ( d − g n ( | S ( − j ) | + D ) − g n ( | S ( − j ) | ) (cid:28) − max (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:114) D log pn . By choosing { (cid:15) n } ∞ n =1 satisfyingmax (cid:110) p ∗ m − , ( DM ) (cid:111) M (cid:114) D log pn (cid:28) (cid:15) n (cid:28) ∆ , we have Cr( S ( j ) ) − Cr( S ( − j ) ) ≤ − (cid:15) n . ThereforeP(Cr( S ( j ) ) + 2 (cid:15) n ≤ Cr( S ( − j ) )) ≥ P ( S ( j ) ∩ ( S j ∪ ˜ S ) = ∅ ) · P ( S ( − j ) ∩ ( S ∗ ∪ S j ∪ ˜ S ) = ∅ )22 2 pD ( D + 1) D (cid:88) d =1 (cid:0) p − p j − ˜ p − d − (cid:1)(cid:0) pd (cid:1) · pD (2 p − D − D (cid:88) d =1 (cid:0) p − p ∗ − p j − ˜ p − d (cid:1)(cid:0) pd (cid:1) = 4 p D ( D + 1)(2 p − D − (cid:32) p + D (cid:88) d =2 ( p − p j − ˜ p − · · · ( p − p j − ˜ p − d + 1) p ( p − · · · ( p − d + 1) d (cid:33) · (cid:32) D (cid:88) d =1 ( p − p ∗ − p j − ˜ p − · · · ( p − p ∗ − p j − ˜ p − d ) p ( p − · · · ( p − d + 1) (cid:33) ≥ pD ( D + 1)(2 p − D − · (cid:18) − p j + ˜ p p − D + 1 (cid:19) D · (cid:18) − p ∗ + p j + ˜ p + 1 p − D + 1 (cid:19) D (cid:32) D (cid:88) d =1 d (cid:33) D = 2 p p − D − (cid:18) − p j + ˜ p p − D + 1 (cid:19) D · (cid:18) − p ∗ + p j + ˜ p + 1 p − D + 1 (cid:19) D , as n , D are larger than some speciﬁc constants. Thenlim sup n,D →∞ δ n ≤ − lim inf n,D →∞ inf j ∈ S ∗ P(Cr( S ( j ) ) + 2 (cid:15) n ≤ Cr( S ( − j ) )) ≤ − lim inf n,D →∞ (cid:34) p p − D − (cid:18) − sup j ∈ S ∗ p j + ˜ p p − D + 1 (cid:19) D · (cid:18) − p ∗ + sup j ∈ S ∗ p j + ˜ p + 1 p − D + 1 (cid:19) D (cid:35) → , which completes the proof. B.8 Proof of Corollary 1

We only need to show that ∃ ( n, D, B ) → ∞ such that γ ( n, D, B ) ∈ (0 ,

1) for any ﬁnitevalues in the sequence. Recall that γ ( n, D, B ) = (1 − c n ) (cid:18) − (cid:26) − B p (cid:27)(cid:19) (cid:104) (1 − δ B p n ) B − (1 − ˜ δ B p n ) B (1 − p ) (cid:105) − c n .

23y Proposition 3, for an arbitrary small positive number τ > ∃ N, D (cid:48) very large suchthat when n > N and

D > D (cid:48) , it holds that δ n ≤ τ , ˜ δ n ≥ − τ . Then we have(1 − δ B p n ) B ≥ (1 − τ B p ) B , (1 − ˜ δ B p n ) B (1 − p ) ≤ (cid:34) − (cid:18) − τ (cid:19) B p (cid:35) B (1 − p ) , We can pick B and D to make sure that (cid:18) − (cid:26) − B p (cid:27)(cid:19)  (1 − τ B p ) B − (cid:32) − (cid:18) − τ (cid:19) B p (cid:33) B (1 − p )  > c n − c n , which will yield γ ( n, D, B ) >

0. Therefore such n, D, B2