[PDF] Doubly Robust Sure Screening for Elliptical Copula Regression Model

Abstract

Regression analysis has always been a hot research topic in statistics. We propose a very flexible semi-parametric regression model called Elliptical Copula Regression (ECR) model, which covers a large class of linear and nonlinear regression models such as additive regression model,single index model. Besides, ECR model can capture the heavy-tail characteristic and tail dependence between variables, thus it could be widely applied in many areas such as econometrics and finance. In this paper we mainly focus on the feature screening problem for ECR model in ultra-high dimensional setting. We propose a doubly robust sure screening procedure for ECR model, in which two types of correlation coefficient are involved: Kendall tau correlation and Canonical correlation. Theoretical analysis shows that the procedure enjoys sure screening property, i.e., with probability tending to 1, the screening procedure selects out all important variables and substantially reduces the dimensionality to a moderate size against the sample size. Thorough numerical studies are conducted to illustrate its advantage over existing sure independence screening methods and thus it can be used as a safe replacement of the existing procedures in practice. At last, the proposed procedure is applied on a gene-expression real data set to show its empirical usefulness.

Full PDF

DDoubly Robust Sure Screening for EllipticalCopula Regression Model

Yong He ∗ , Liang Zhang † , Jiadong Ji, ‡ Xinsheng Zhang § Regression analysis has always been a hot research topic in statistics. We propose a very ﬂexible semi-parametric regression model called Elliptical Copula Regression (ECR) model, which covers a large classof linear and nonlinear regression models such as additive regression model, single index model. Besides,ECR model can capture the heavy-tail characteristic and tail dependence between variables, thus it couldbe widely applied in many areas such as econometrics and ﬁnance. In this paper we mainly focus on thefeature screening problem for ECR model in ultra-high dimensional setting. We propose a doubly robust surescreening procedure for ECR model, in which two types of correlation coeﬃcient are involved: Kendall’ taucorrelation and Canonical correlation. Theoretical analysis shows that the procedure enjoys sure screeningproperty, i.e., with probability tending to 1, the screening procedure selects out all important variables andsubstantially reduces the dimensionality to a moderate size against the sample size. Thorough numericalstudies are conducted to illustrate its advantage over existing sure independence screening methods and thusit can be used as a safe replacement of the existing procedures in practice. At last, the proposed procedureis applied on a gene-expression real data set to show its empirical usefulness.

Keyword:

Canonical Correlation; Doubly Robust; Elliptical Copula; Kendall’ tau; Sure Screening.

In the last decades, data sets with large dimensionality have arisen in various areas such as ﬁnance, chemistryand so on due to the great development of the computer storage capacity and processing power and featureselection with these big data is of fundamental importance to many contemporary applications. The sparsityassumption is common in high dimensional feature selection literatures, i.e., only a few variables are criticalin for in-sample ﬁtting and out-sample forecasting of certain response of interest. In speciﬁc, for the linearregression setting, statisticians care about how to select out the important variables from thousands or evenmillions of variables. In fact, a huge amount of literature springs up since the appearance of the Lassoestimator [21]. To name a few, there exist SCAD by [9], Adaptive Lasso by [25], MCP by [23], the Dantzigselector by [2], group Lasso by [22]. This research area is very active, and as a result, this list of referenceshere is illustrative rather than comprehensive. The aforementioned feature selection methods perform wellwhen the dimensionality is not “too” large, theoretically in the sense that it is of polynomial order of thesample size. However, in the ultrahigh dimensional setting where the dimensionality is of exponential order ofthe sample size, the aforementioned methods may encounter both theoretical and computational issue. Takethe Dantzig selector for example, the Uniform Uncertainty Principle (UUP) condition to guarantee the oracleproperty may be diﬃcult to satisfy, and the computational cost would increase dramatically by implementinglinear programs in ultra-high dimension. [10] ﬁrst proposed Sure Independence Screening (SIS) and its further ∗ School of Statistics, Shandong University of Finance and Economics, Jinan, China † School of Statistics, Shandong University of Finance and Economics, Jinan, China ‡ School of Statistics, Shandong University of Finance and Economics, Jinan, China; Email: [email protected] . § School of Management, Fudan University, Shanghai, China a r X i v : . [ s t a t . M E ] A ug mprovement, Iterative Sure Independence Screening (ISIS), to alleviate the computational burden in ultra-high dimensional setting. The basic idea goes as follows. In the ﬁrst step, reduce the dimensionality to amoderate size against the sample size by sorting the marginal pearson correlation between covariates and theresponse and removing those covariates whose marginal correlation with response are lower than a certainthreshold. In the second stage perform Lasso, SCAD etc. to the variables survived in the ﬁrst step. The SIS(ISIS) turns out to enjoy sure screening property under certain conditions, that is, with probability tendingto 1, the screening procedure selects out all important variables. The last decade has witnessed plenty ofvariants of SIS to handle the ultra-high dimensionality for more general regression models. [7] proposed asure screening procedure for ultra-high dimensional additive models. [6] proposed a sure screening procedurefor ultra-high dimensional varying coeﬃcient models. [19] proposed censored rank independence screeningof high-dimensional survival data which is robust to predictors that contain outliers and works well for ageneral class of survival models. [24] and [4] proposed model-free feature screening. [14] proposed to screenKendall’s tau correlation while [15] proposed to screen distance correlation which both show robustness toheavy tailed data. [13] proposed to screen the canonical correlation between the response and all possiblesets of k variables, which performs well particularly for selecting out variables that are pairwise jointlyimportant with other variables but marginally insigniﬁcant. This list of references for screening methodsis also illustrative rather than comprehensive. For the development of the screening methods in the lastdecade, we refer to the review paper of [16] and [5].The main contribution of the paper is two-fold. On the one hand, we innovatively propose a very ﬂexiblesemi-parametric regression model called Elliptical Copula Regression (ECR) model, which can capture thethick-tail property of variables and the tail dependence between variables. In speciﬁc, the ECR model hasthe following representation: f ( Y ) = β + p (cid:88) j =1 β j f j ( X j ) + (cid:15), (1)where Y is response variable, X . . . , X p are predictors, f j ( · ) are univariate monotonic functions. We say( Y, X (cid:62) ) (cid:62) = ( Y, X , . . . , X p ) (cid:62) satisﬁes a Elliptical copula regression model if the marginally transformedrandom vectors (cid:101) Z = ( (cid:101) Y , (cid:102) X (cid:62) ) (cid:62) (cid:52) = ( f ( Y ) , f ( X ) , . . . , f p ( X p )) (cid:62) follows elliptical distribution. From therepresentation of ECR model in (1), it can be seen that the ECR model covers a large class of linear andnonlinear regression models such as additive regression model, single index model which makes it moreapplicable in many areas such as econometrics, ﬁnance and bioinformatics. On the other hand, we proposea doubly robust dimension reduction procedure for the ECR model in the ultrahigh dimensional setting.The doubly robustness is achieved by combining two types of correlation, which are Kendall’ tau correlationand canonical correlation. The canonical correlation is employed to capture the joint information of a set ofcovariates and the joint relationship between the response and this set of covariates. Note that for ECR modelin (1), only ( Y, X (cid:62) ) (cid:62) is observable rather than the transformed variables. Thus the Kendall’s tau correlationis exploited to estimate the canonical correlations due to its invariance under strictly monotone marginaltransformations. The dimension reduction procedure for ECR model is achieved by sorting the estimatedcanonical correlations and leaving the variable that attributes a relatively high canonical correlation at leastonce into the active set. The proposed screening procedure enjoys the sure screening property and reducesthe dimensionality substantially to a moderate size under mild conditions. Numerical results shows that theproposed approach enjoys great advantage over state-of-the-art procedures and thus it can be used as a safereplacement.We introduce some notations adopted in the paper. For any vector µ = ( µ , . . . , µ d ) ∈ R d , let µ − i denote the ( d − × i -th entry from µ . | µ | = (cid:80) di =1 I { µ i (cid:54) = 0 } , | µ | = (cid:80) di =1 | µ i | , | µ | = (cid:113)(cid:80) di =1 µ i and | µ | ∞ = max i | µ i | . Let A = [ a ij ] ∈ R d × d . (cid:107) A (cid:107) L = max ≤ j ≤ d (cid:80) di =1 | a ij | , (cid:107) A (cid:107) ∞ =max i,j | a ij | and (cid:107) A (cid:107) = (cid:80) di =1 (cid:80) dj =1 | a ij | . We use λ min ( A ) and λ max ( A ) to denote the smallest and largesteigenvalues of A respectively. For a set H , denote by |H| the cardinality of H . For a real number x , denote2y (cid:98) x (cid:99) the largest integer smaller than or equal to x . For two sequences of real numbers { a n } and { b n } ,we write a n = O ( b n ) if there exists a constant C such that | a n | ≤ C | b n | holds for all n , write a n = o ( b n ) iflim n →∞ a n /b n = 0, and write a n (cid:16) b n if there exist constants c and C such that c ≤ a n /b n ≤ C for all n .The rest of the paper is organized as follows: in Section 2, we introduce the Elliptical copula regressionmodel and present the proposed dimension reduction procedure by ranking the estimated canonical corre-lations. In Section 3, we present the theoretical properties of the proposed procedure,with more detailedproofs collected in the Appendix. In Section 4, we conduct thorough numerical simulations to investigatethe empirical performance of the procedure. In section 5, a real gene-expression data example is given toillustrate its empirical usefulness. At last, we give a brief discussion on possible future directions in the lastsection. To present the Elliptical Copula Regression Model, we ﬁrst need to introduce the elliptical distribution. Theelliptical distribution generalizes the multivariate normal distribution, which includes symmetric distributionswith heavy tails, like the multivariate t -distribution. Elliptical distributions are commonly used in robuststatistics to evaluate proposed multivariate-statistical procedures. In speciﬁc, the deﬁnition of ellipticaldistribution is given as follows: Deﬁnition 2.1. (Elliptical distribution)

Let µ ∈ R p and Σ ∈ R p × p with rank( Σ ) = q ≤ p . A p -dimensional random vector Z is elliptically distributed, denoted by Z ∼ ED p ( µ , Σ , ζ ), if it has a stochasticrepresentation Z d = µ + ζ A U . where U is a random vector uniformly distributed on the unit sphere S q − in R q , ζ ≥ U , A ∈ R p × q is a deterministic matrix satisfying AA (cid:62) = Σ with Σ called scattermatrix.The representation Z d = µ + ζ A U . is not identiﬁable since we can rescale ζ and A . We require E ζ = q to make the model identiﬁable, which makes the covariance matrix of Z to be Σ . In addition, we assume Σ is non-singular, i.e., q = p . In this paper, we only consider continuous elliptical distributions withPr( ζ = 0) = 0.Another equivalent deﬁnition of the elliptical distribution is by its characteristic function, which has theform exp( i t (cid:62) u ) ψ ( t (cid:62) Σ t ), where ψ ( · ) is a properly deﬁned characteristic function and i = √− ζ and ψ are mutually determined by each other. Given the deﬁnition of Elliptical distribution, we are ready forintroducing the Elliptical Copula Regression (ECR) model. Deﬁnition 2.2. (Elliptical copula regression model)

Let f = { f , f , . . . , f p } be a set of monotoneunivariate functions and Σ be a positive-deﬁnite correlation matrix with diag( Σ )= I . We say a d -dimensionalrandom variable Z = ( Y, X , . . . , X p ) (cid:62) satisﬁes the Elliptical Copula Regression model if and only if (cid:101) Z =( (cid:101) Y , (cid:102) X (cid:62) ) (cid:62) = ( f ( Y ) , f ( X ) , . . . , f p ( X p )) (cid:62) ∼ ED d ( , Σ , ζ ) with E ζ = d and (cid:101) Y = (cid:102) X (cid:62) β + (cid:15), or equivalently , f ( Y ) = p (cid:88) j =1 β j f j ( X j ) + (cid:15), (2)where Y is the response and X = ( X , . . . , X p ) (cid:62) are covariates, d = p + 1.We require diag( Σ )= I in Deﬁnition 2.2 for identiﬁability because the shifting and scaling are absorbedinto the marginal functions f . For ease of presentation, we denote Z = ( Y, X , . . . , X p ) (cid:62) ∼ ECR( Σ , ζ, f ) in3he following sections. The ECR model allows the data to come from heavy-tailed distribution and is thusmore ﬂexible and more useful in modelling many modern data sets, including ﬁnancial data, genomics dataand fMRI data.Notice that the transformed variable (cid:101) Y and the transformed covariates (cid:102) X obeys the linear regressionmodel, however, the transformed variables are unobservable, only Y, X are observable. In the following, byvirtue of canonical correlation and Kendall’s tau correlation, we will present an adaptive screening procedurewithout estimating the marginal transformation functions f while capturing the joint information of a setof covariates and the joint relationship between the response and this set of covariates. In this section we will present the adaptive doubly robust screening for ECR model. We ﬁrst introducethe Kendall’s tau-based estimator of correlation matrix in subsection 2.2.1, then we introduce the Kendall’stau-based estimator of canonical correlation in subsection 2.2.2, which are both of fundamental importancefor the detailed doubly robust screening procedure introduced in subsection 2.2.3.

In this section we present the estimator of the correlation matrix based on Kendall’s tau. Let Z , . . . , Z n be n independent observations where Z i = ( Y i , X i , . . . , X ip ) (cid:62) . The sample Kendall’s tau correlation of Z j and Z k is deﬁned by (cid:98) τ j,k = 2 n ( n − (cid:88) ≤ i

4n Section 2.2.1 we present the Kendall’s tau based estimator of Σ and denote it by (cid:98) S . Thus the CanonicalCorrelation ρ c can be naturally estimated by: (cid:98) ρ c = (cid:113) (cid:98) S I×J (cid:98) S − J ×J (cid:98) S (cid:62)I×J . (4)If (cid:98) S J ×J is not positive deﬁnite (not invserible), we ﬁrst project (cid:98) S J ×J into the cone of positive semideﬁnitematrices. In particular, we propose to solve the following convex optimization problem: (cid:101) S J ×J = arg min S (cid:107) (cid:98) S J ×J − S (cid:107) ∞ . The matrix element-wise inﬁnity norm (cid:107) · (cid:107) ∞ is adopted for the sake of further technical developments.Empirically, we can use a surrogate projection procedure that computes a singular value decomposition of (cid:98) S J ×J and truncates all of the negative singular values to be zero. Numerical study shows that this procedureworks well.

In this section, we present the screening procedure by sorting canonical correlation estimated by Kendall’tau.The Screening procedure goes as follows: ﬁrst collect all sets of k transformed variables and total addsup to C kp , the combinatorial number, i.e., { (cid:101) X l,m , . . . , (cid:101) X l,m k } with l = 1 , . . . , C kp . For each k -variable set { (cid:101) X l,m , . . . , (cid:101) X l,m k } , we denote its canonical correlation with f ( Y ) by ρ cl and estimate it by (cid:98) ρ cl = (cid:113) (cid:98) S I×J (cid:98) S − J ×J (cid:98) S (cid:62)I×J . where (cid:98) S is the rank-based estimator of correlation matrix introduced in Section 2.2.1. Then we sort thesecanonical correlations { (cid:98) ρ cl , l = 1 , . . . , C kp } and select the variables that attributes a relatively large canonicalcorrelation at least once into the active set.Speciﬁcally, let M ∗ = { ≤ i ≤ p, β i (cid:54) = 0 } be the true model with size s = |M ∗ | and deﬁne sets I ni = (cid:110) l ; ( X i , X i , . . . , X i k − ) with max ≤ m ≤ k − | i m − i | ≤ k n is used in calculating (cid:98) ρ cl (cid:111) , i = 1 , . . . , p, where k n is a parameter determining a neighborhood set in which variables jointly with X i are included tocalculate the canonical correlation with the response. Finally we estimate the active set as follow: (cid:99) M t n = (cid:110) ≤ i ≤ p : max l ∈I ni (cid:98) ρ cl > t n (cid:111) where t n is a threshold parameter which controls the the size of the estimated active set.If we set k n = p , then all k -variable sets including X i are considered in I ni . However, if there is anatural index for all the covariates such that only the neighboring covariates are related, which is often thecase in portfolio tracking in ﬁnance, it is more appropriate to consider a k n much smaller than p . As forthe parameter k , a relatively large k may bring more accurate results, but will increase the computationalburden. Empirical simulation results show that the performance by by taking k = 2 is already good enoughand substantially better than taking k = 1 which is equivalent to sorting marginal correlation.5 Theoretical properties

In this section, we present the theoretical properties of the proposed approach. In the screening problem,what we care about most is whether the true non-zero index set M ∗ is contained in the estimated active set (cid:99) M t n with high probability for properly chosen threshold t n , i.e., whether the procedure has sure screeningproperty. To this end, we assume the following three assumptions hold. Assumption 1

Assume p > n and log p = O ( n ξ ) for some ξ ∈ (0 , − κ ). Assumption 2

For all l = 1 , . . . , C kp , λ max (( Σ J l ×J l ) − ) ≤ c for some constant c , where J l = { m l , . . . , m lk } is the index set of variables in the l -th k -variable sets. Assumption 3

For some 0 ≤ κ ≤ /

2, min i ∈M ∗ max l ∈I ni ρ cl ≥ c n − κ .Assumption 1 speciﬁes the scaling between the dimensionality p and the sample size n . Assumption2 requires that the minimum eigenvalue of the covariance matrix of any k covariates is lower bounded.Assumption 3 is the fundamental basis for guaranteeing the sure screening property, which means thatany important variable is correlated to the response jointly with some other variables. Technically, theAssumption 3 entails that an important variable would not be veiled by the statistical approximation errorresulting from the estimated canonical correlation. Theorem 3.1.

Assume that Assumptions 1-3 hold, then for some positive constants c ∗ and C , as n goes toinﬁnity, we have P (cid:18) min i ∈M ∗ max l ∈I ni (cid:98) ρ cl ≥ c ∗ n − κ (cid:19) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) , and P (cid:16) M ∗ ⊂ (cid:99) M c ∗ n − κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . Theorem 3.1 shows that, by setting the threshold of order c ∗ n − κ , all important variables can be selectedout with probability tending to 1. However, the constant c ∗ remains unknown. To reﬁne the theoreticalresult, we assume the following assumption holds. Assumption 4

For some 0 ≤ κ ≤ /

2, max i/ ∈M ∗ max l ∈I ni ρ cl < c ∗ n − κ .The Assumption 4 requires that if a variable X i is not important, then the canonical correlations betweenthe response and all k variables sets containing X i are all upper bounded by c ∗ n − κ , and it uniformly holdsfor all unimportant variables. Theorem 3.2.

Assume that Assumptions 1-4 hold, then for some constants c ∗ and C , we have P (cid:16) M ∗ = (cid:99) M c ∗ n − κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) (5)and in particular P (cid:16) | (cid:99) M c ∗ n − κ | = s (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) (6)where s is the size of M ∗ .Theorem 3.2 guarantees the exact sure screening property without any condition on k n . Besides, thetheorem guarantees the existence of c ∗ and C , however, it still remains unknown how to select the constant c ∗ . If we know that s < n log n in advance, one can select a constant c ∗ such that the size of (cid:99) M c ∗ n − κ is approximately n . Obviously, we have (cid:99) M c ∗ n − κ ⊂ (cid:99) M c ∗ n − κ with probability tending to 1. The followingtheorem is particularly useful in practice summarizing the above discussion.6 heorem 3.3. Assume that Assumptions 1-4 hold, if s = |M ∗ | ≤ n/ log n , we have for any constant γ > P ( M ∗ ⊂ M γ ) 1 − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) , where M γ = { ≤ i ≤ p ; max l ∈I ni (cid:98) ρ cl is among the largest (cid:98) γn (cid:99) of max l ∈I n (cid:98) ρ cl , · · · , max l ∈I np (cid:98) ρ cl } The above theorem guarantees that one can reduce dimensionality to a moderate size against n whileensuring the sure screening property, which further guarantees the validity of a more sophisticated andcomputationally eﬃcient variable selection methods.Theorem 3.3 heavily relies on the Assumption 4. If there is natural order of the variables, and anyimportant variable together with only the adjacent variables contributes to the response variable, thenAssumption 4 can be totally removed while exserting an constraint on the parameter k n . The followingtheorem summarizes the above discussion. Theorem 3.4.

Assume Assumptions 1-3 hold, λ max ( Σ ) ≤ c n τ for some τ ≥ c >

0, and furtherassume k n = c n τ ∗ for some constants c > τ ∗ ≥

0. If 2 κ + τ + τ ∗ <

1, then there exists some θ ∈ [0 , − κ − τ − τ ∗ ) such that for γ = c n − θ with c >

0, we have for some constant

C > P ( M ∗ ⊂ M γ ) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . The assumption k n = c n τ ∗ is reasonable in many ﬁelds such as Biology and Finance. An intuitiveexample is in genomic association study, millions of genes tend to cluster together and functions togetherwith adjacent genes.The procedure by ranking the estimated canonical correlation and reducing the dimension in one stepfrom a large p to (cid:98) n/ log n (cid:99) is a crude and greedy algorithm and may result in many fake covariates due tothe strong correlations among them. Motivated by the ISIS method in [10], we propose a similar iterativeprocedure which achieve sure screening in multiple steps. The iterative procedure works as follows. Letthe shrinking factor δ → δn − κ − τ − τ ∗ → ∞ as n → ∞ and we successivelyperform dimensionality reduction until the number of remaining variables drops to below sample size n . Inspeciﬁc, deﬁne a subset M ( δ ) = (cid:26) ≤ i ≤ p : max l ∈I ni (cid:98) ρ cl is among the largest[ δp ] of all (cid:27) . (7)In the ﬁrst step we select a subset of (cid:98) δp (cid:99) variables, M ( δ ) by Equation (7). In the next step, we startfrom the variables indexed in M ( δ ), and apply a similar procedure as (7), and again obtain a sub-model M ( δ ) ⊂ M ( δ ) with size (cid:98) δ p (cid:99) . Iterate the steps above and ﬁnally obtain a sub-model M k ( δ ), with size[ δ k p ] < n . Theorem 3.5.

Assume that the conditions in Theorem 3.4 hold, let δ → δn − κ − τ − τ ∗ → ∞ as n → ∞ , then we have P (cid:0) M ∗ ⊂ M k ( δ ) (cid:1) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . The above theorem guarantees the sure screening property of the iterative procedure and the step size δ can be chosen in the same way as ISIS in [10]. In this section we conduct thorough numerical simulation to illustrate the empirical performance of theproposed doubly robust screening procedure (denoted as CCH). Besides, we compare the proposed procedurewith three methods, the method proposed by [13] (denoted as CCK), the rank correlation screening approach7roposed by [14] (denoted as RRCS) and the initially proposed SIS procedure by [10]. To illustrate the doublyrobustness of the proposed procedure, we consider the following ﬁve models which includes linear regressionwith thick-tail covariates and error term, single-index model with thick-tail error term, additive model andmore general regression model.

Model 1

Linear model setting adapted from [13]: Y i = 0 . β X i + · · · + β p X ip + (cid:15) i , where β =(1 , − . , , , . . . , (cid:62) with the last p − X is sampled from multivariatenormal N ( , Σ ) or multivariate t with degree 1, noncentrality parameter and scale matrix Σ . The diagonalentries of Σ are 1 and the oﬀ-diagonal entries are ρ , the error term (cid:15) is independent of X and generatedfrom the standard normal distribution or the standard t -distribution with degree 1. Model 2

Linear model setting adapted from [14]: Y i = β X i + · · · + β p X ip + (cid:15) i , where β = (5 , , , , . . . , (cid:62) with the last p − X is sampled from multivariate normal N ( , Σ ) ormultivariate t with degree 1, noncentrality parameter and scale matrix Σ . The diagonal entries of Σ are1 and the oﬀ-diagonal entries are ρ , the error term (cid:15) is independent of X and generated from the standardnormal distribution or the standard t -distribution with degree 1. Model 3

Single-index model setting: H ( Y ) = X (cid:62) β + (cid:15). We set H ( Y ) = log( Y ) which corresponds to a special case of BOX-COX transformation ( | Y | λ sgn( Y ) − /λ with λ = 1. The error term (cid:15) is independent of X and generated from the standard normal distribution orthe standard t -distribution with degree 3. The regression coeﬃcients β = (3 , . , , , . . . , (cid:62) with the last p − X is sampled from multivariate normal N ( , Σ ) or multivariate t with degree 3, where the diagonal entries of Σ are 1 and the oﬀ-diagonal entries are ρ . Model 4

Additive model from [17]: Y i = 5 f ( X i ) + 3 f ( X i ) + 4 f ( X i ) + 6 f ( X i ) + (cid:15) i , with f ( x ) = x, f ( x ) = (2 x − , f ( x ) = sin(2 πx )2 − sin(2 πx )and f ( x ) =0 . πx ) + 0 . πx ) + 0 . (2 πx )+ 0 . (2 πx ) + 0 . (2 πx )The covariates X = ( X , . . . , X p ) (cid:62) are generated by X j = W j + tU t , j = 1 , . . . , p, where W , . . . , W p and U are i.i.d. Uniform[0,1]. For t = 0, this is the independent uniform case while t = 1 corresponds to a design with correlation 0.5 between all covariates. The error term (cid:15) are sampled from N (0 , . β in the model setting is obviously (3 , . , , , . . . , (cid:62) with the last p − Model 5

A model generated by combining Model 3 and Model 4: H ( Y i ) = 5 f ( X i ) + 3 f ( X i ) + 4 f ( X i ) + 6 f ( X i ) + (cid:15) i , where H ( Y ) is the same with Model 3 and the functions { f , f , f , f } are the same with Model 4. The8ovariates X = ( X , . . . , X p ) (cid:62) are generated in the same way as in Model 4. The error term (cid:15) are sampledfrom N (0 , . Table 1: The proportions of containing Model 4 and Model 5 in the active set ( p, n ) Method Model 4 Model 5 t = 0 t = 0 . t = 1 t = 0 t = 0 . t = 1(100,20) RRCS 0.038 0.046 0.132 0.038 0.046 0.132SIS 0.048 0.056 0.142 0.002 0.012 0.032CCH1 0.938 0.968 0.938 0.938 0.968 0.938CCK1 0.380 0.356 0.584 0.142 0.146 0.300CCH2 0.978 0.956 0.938 0.978 0.956 0.938CCK2 0.506 0.458 0.682 0.166 0.144 0.302(100,50) RRCS 0.420 0.352 0.504 0.420 0.352 0.504SIS 0.512 0.406 0.496 0.100 0.110 0.182CCH1 1 1 1 1 1 1CCK1 0.922 0.938 0.990 0.690 0.614 0.836CCH2 1 1 1 1 1 1CCK2 0.988 0.990 0.996 0.674 0.588 0.810(500,20) RRCS 0.002 0.004 0.020 0.002 0.004 0.020SIS 0.002 0.008 0.026 0 0 0.008CCH1 0.804 0.884 0.812 0.804 0.884 0.812CCK1 0.168 0.176 0.354 0.038 0.068 0.168CCH2 0.858 0.890 0.842 0.858 0.890 0.842CCK2 0.222 0.224 0.454 0.052 0.086 0.178(500,50) RRCS 0.098 0.092 0.228 0.098 0.092 0.228SIS 0.158 0.140 0.222 0.002 0.026 0.036CCH1 1 1 1 1 1 1CCK1 0.794 0.856 0.984 0.310 0.344 0.642CCH2 1 1 1 1 1 1CCK2 0.930 0.956 0.998 0.350 0.344 0.622For models in which ρ involved, we take ρ = 0 , . , . , .

9. For all the models, we consider four com-binations of ( n, p ): (20,100), (50,100), (20,500), (50, 500). All simulations results are based on the 500replications. We evaluate the performance of diﬀerent screening procedures by the proportion that the true9odel is included in the selected active set in 500 replications. To guarantee a fair comparison, for all thescreening procedures, we choose the variables whose coeﬃcients rank in the ﬁrst largest (cid:98) n/ log n (cid:99) values.For our method CCH and the method CCK in [13], two parameters k n and k are involved. The simulationstudy shows that when k n is small, the performance for diﬀerent combination of ( k n , k ) are quite similar.Thus we only presents the results of ( k n , k ) = (2 , , (2 ,

3) for illustration, which are denoted as CCH1, CCH2for our method and CCK1 and CCK2 for the method by [13].From the simulation results, we can see that the proposed CCH methods detects the true model muchmore accurately than SIS, RRCS and CCK meothods in almost all cases. In speciﬁc, for the motivatingModel 1 in [13], from Table 2, we can see that when the correlations among covariates become large, all theSIS, RRCS and CCK meothods perform worse (the proportion of containing the true model drops sharply),but the proposed CCH procedure shows robustness against the correlations among covariates and detectsthe true model for each replication. Besides, for the heavy tailed error term following t (1), we can see thatall the SIS, RRCS and CCK meothods perform very bad while the CCH method still works very well. ForModel 2, from Table 3, we can see that when the covariates are multivariate normal and the error termis normal, then all the methods works well when the sample size is relatively large while CCK and CCHrequires less sample size compared with RRCS and SIS. If the error term is from t (1), then SIS, RRCS andCCK meothods perform bad especially when the ratio p/n is large. In contrast, the CCH approach stillperforms very well. We should notice that the RRCS also shows certain robustness and CCK2 is slightlybetter than CCK1 because the important covariates are indexed by three consecutive integers.The CCH’s advantage over the CCK is mainly illustrated by the results of Model 3 to Model 5. In fact,Model 3 is an example of single index model, Model 4 is an example of additive model and Model 5 is anexample of more complex nonlinear regression model. CCK approach relies heavily on the linear regressionassumption while CCH is more applicable. For the single index regression model, from Table 4, we can seethat CCK performs badly especially when the ratio p/n is large. The approach RRCS ranks the Kenall’ taucorrelation which is invariant to monotone transformations, thus it exhibits robustness for Model 3, but itstill performs much worse than CCH. For the additive regression model and Model 5, by Table 1, similarconclusions can be drawn as discussed for Model 3. It is worth mentioning that although we require themarginal transformation functions are monotone in theory, but simulation study shows that the proposedscreening procedure is not sensitive to the requirement, and performs pretty well even the transformationfunctions are not monotone. In fact, the marginal transformation functions f , f , f in Model 4 and Model5 are all not monotone. In one word, the proposed CCH procedure performs very well not only for heavytailed error terms, but also for various unknown transformation functions, which shows doubly robustness.Thus in practice, CCH can be used as a safe replacement of the CCK, RRCS or SIS. In this section we apply the variable selection method to a gene expression data set for an eQTL experimentin rat eye reported in [18]. The data set has ever been analyzed by [11], [20] and [8] and can be downloadedfrom the Gene Expression Omnibus at accession number GSE5680.For this data set, 120 12-week-old male rats were selected for harvesting of tissue from the eyes andsubsequent microarray analysis. The microarrays used to analyze the RNA from the eyes of the rats containover 31,042 diﬀerent probe sets (Aﬀymetric GeneChip Rat Genome 230 2.0 Array). The intensity valueswere normalized using the robust multi-chip averaging method [1, 12] to obtain summary expression valuesfor each probe set. Gene expression levels were analyzed on a logarithmic scale. Similar to the work of [11]and [8], we are still interested in ﬁnding the genes correlated with gene TRIM32, which was found to causeBardetCBiedl syndrome [3]. BardetCBiedl syndrome is a genetically heterogeneous disease of multiple organsystems, including the retina. Of more than 31,000 gene probes including > × SHRSP intercross. The10

CCH1 t t t t t t t t t t t t t t t t t t t t CCK1 t t t t t t t t t t t t t t t t t t t t CCH2 t t t t t t t t t t t t t t t t t t t t CCK2 t t t t t t t t t t t t t t t t t t t t RRCS t t t t t t t t t t t t t t t t t t t t SIS t t t t t t t t t t t t t t t t t t t t Figure 1: Boxplot of the ranks of the ﬁrst 20 genes ordered by (cid:98) r Uj . probe from TRIM32 is 1389163 at, which is one of the 18, 976 probes that are suﬃciently expressed andvariable. The sample size is n = 120 and the number of probes is 18,975. It’s expected that only a few genesare related to TRIM32 such that this is a sparse high dimensional regression problem.Direct application of the proposed approach on the whole dataset is slow, thus we select 500 probes withthe largest variances of the whole 18,975 probes. [11] proposed nonparametric additive model to capture therelationship between expression of TRIM32 and candidates genes and ﬁnd most of the plots of the estimatedadditive components are highly nonlinear, thus conﬁrming the necessity of taking into account nonlinearity.The Elliptical Copula Regression (ECR) model can also capture the nonlinear relationship and thus it isreasonable to apply the proposed doubly robust dimension reduction procedure on this data set.For the real data example, we compare the selected genes by procedures introduced in the simulationstudy, which are the SIS ([10]), the RRCS procedure ([14]), CCK procedure ([13]) and the proposed CCHprocedure. To detect inﬂuential genes, we adopt the bootstrap procedure similar to [14, 13]. We denote therespective correlation coeﬃcients calculated using the SIS, RRCS, CCK, CCH by (cid:101) ρ sis , (cid:101) ρ rrcs , (cid:101) ρ cck and (cid:101) ρ cch .The detailed algorithm is presented in Algorithm 1. 11 igure 2: 3-dimensional plots of variables with Genralized Additive Model (GLM) ﬁts. Algorithm 1

A bootstrap procedure to obtain inﬂuential genes

Input: D = { ( X i , Y i ) , i = 1 , . . . , n } Output:

Index of inﬂuential genes By the data set { ( X i , Y i ) , i = 1 , . . . , n } , calculate the correlations coeﬃcients (cid:101) ρ isis , (cid:101) ρ irrcs , (cid:101) ρ icck and (cid:101) ρ icch and then order them as (cid:101) ρ ( (cid:98) j ) ≥ (cid:101) ρ ( (cid:98) j ) ≥ · · · ≥ (cid:101) ρ ( (cid:98) j p ) , where (cid:101) ρ can be (cid:101) ρ sis , (cid:101) ρ rrcs , (cid:101) ρ cck and (cid:101) ρ cch , thus theset { (cid:98) j , · · · , (cid:98) j p } varies with diﬀerent screening procedure. We denote by (cid:98) j (cid:23) · · · (cid:23) (cid:98) j p to represent anempirical ranking of the component indices of X based on the contributions to the response, i.e., s (cid:23) t indicates (cid:101) ρ ( s ) ≥ (cid:101) ρ ( t ) and we informally interpret as “ the s th component of X has at least as muchinﬂuence on the response as the t th component. The ranking (cid:98) r j of the j th component is deﬁned to bethe value of r such that (cid:98) j r = j . For each 1 ≤ i ≤ p , employ the SIS, RRCS, CCK and CCH procedures to calculate the b th bootstrapversion of (cid:101) ρ i , denotes as (cid:101) ρ ib , b = 1 , . . . , Denote the ranks of (cid:101) ρ b , . . . , (cid:101) ρ pb by (cid:98) j b (cid:23) · · · (cid:98) j pb and calculate the corresponding rank (cid:98) r bj for the j th compo-nent of X . Given a value α = 0 .

05, compute the (1 − α ) level, two-sides and equally tailed interval for the rank ofthe j th component, i.e., an interval [ (cid:98) r Lj , (cid:98) r Uj ] where P ( (cid:98) r bj ≤ (cid:98) r Lj |D ) ≈ P ( (cid:98) r bj ≥ (cid:98) r Uj |D ) ≈ α . Treat a variable as inﬂuential if (cid:98) r Uj ranks in the top 20 positions.The box-plot of the ranks of inﬂuential genes is illustrated in Figure 1, from which we can see thatthe proposed CCH procedure selects out three very inﬂuential genes 1373349 at , 1368887 at and 1382291 at (emphasized in Figure 1 by blue color), which were not detected as inﬂuential by the other screening methods.The reason we selects out the three inﬂuential genes is that there exists strong nonlinearity relationshipbetween the response and the combination of the three covariates genes. Figure 2 illustrate the aboveﬁndings. Besides, gene 1398594 at is detected as inﬂuential by CCH and RRCS procedure, which is alsoemphasized by red colour in Figure 1. By scatter plot, we ﬁnd the nonlinearity between gene 1398594 at and TRIM 32 gene is obvious and CCH and RRCS procedure can both capture the nonlinear relationship.The above ﬁndings are just based on statistical analysis, which need to be further validated by experiments12n labs. The screening procedure is particularly helpful by narrowing down the number of research targetsto a few top ranked genes from the 500 candidates. We propose a very ﬂexible semi-parametric ECR model and consider the variable selection problem for ECRmodel in the ultra-high dimensional setting. We propose a doubly robust sure screening procedure for ECRmodel Theoretical analysis shows that the procedure enjoys sure screening property, i.e., with probabilitytending to 1, the screening procedure selects out all important variables and substantially reduces thedimensionality to a moderate size against the sample size. We set k n to be a small value and it performs wellas long as there is a natural index for all the covariates such that the neighboring covariates are correlated.If there is no natural index group in prior, we can do statistical clustering for the variables before screening.The performance of the screening procedure then would rely heavily on the clustering performance, whichwe leave as a future research topic. Appendix: Proof of Main Theorems

First we introduce a useful lemma which is critical for the proof of the main results.

Lemma .1.

For any c >

C >

0, we have P (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) S s,t − Σ s,t (cid:12)(cid:12)(cid:12) ≥ cn − κ (cid:17) = O (cid:0) exp( − Cn − κ ) (cid:1) . Proof.

Recall that (cid:98) S s,t = sin( π (cid:98) τ s,t ), then we have P (cid:16) | (cid:98) S s,t − Σ s,t | > t (cid:17) = P (cid:0)(cid:12)(cid:12) sin( π (cid:98) τ s,t ) − sin( π τ s,t ) (cid:12)(cid:12) ≥ t (cid:1) ≤ P (cid:0) | (cid:98) τ s,t − τ s,t | ≥ π t (cid:1) ≤ P (cid:0) | (cid:98) τ s,t − τ s,t | ≥ π t (cid:1) Since (cid:98) τ s,t can be written in the form of U-statistic with a kernel bounded between -1 and 1, by Hoeﬀding’sinequality, we have that P (cid:18) | (cid:98) τ s,t − τ s,t | ≥ π t (cid:19) ≤ (cid:18) − π nt (cid:19) . By taking t = cn − κ , we then have P (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) S s,t − Σ s,t (cid:12)(cid:12)(cid:12) ≥ cn − κ (cid:17) ≤ (cid:18) − c π n − κ (cid:19) , which concludes the Lemma. Proof of Theorem 3.1

Proof.

By the deﬁnition of CC, ( ρ cl ) = Σ I×J l Σ − J l ×J l Σ (cid:62)I×J l .

13y Assumption 1, λ max ( Σ J l ×J l ) ≤ c , and note that Σ I×J l = (Σ ,m l , . . . , Σ ,m lk ) is a row vector, we havethat ( ρ cl ) ≤ c k (cid:88) t =1 (Σ ,m lt ) . (8)By Assumption 2, if i ∈ M ∗ , then there exists l i ∈ I ni such that ρ cl i ≥ c n − κ . Without loss of generality,we assume that | Σ ,m li | = max ≤ t ≤ k | Σ ,m lit | . By Equation (8), we have that | Σ ,m li | ≥ c ∗ n − κ for some c ∗ >

0. For Σ ,m li , we denote the correspondingKendall’ tau estimator as (cid:98) S ,m li ∈ (cid:98) S I×J li , then we have the following result: P (cid:18) min i ∈M ∗ | (cid:98) S ,m li | ≥ c ∗ n − κ (cid:19) ≥ P (cid:18) min i ∈M ∗ (cid:12)(cid:12)(cid:12) (cid:98) S ,m li − Σ ,m li (cid:12)(cid:12)(cid:12) ≤ c ∗ n − κ (cid:19) . (9)Furthermore, we have P (cid:16) min i ∈M ∗ (cid:12)(cid:12)(cid:12) (cid:98) S ,m li − Σ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ n − κ (cid:17) ≤ s ∗ P (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) S ,m li − Σ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ n − κ (cid:17) = s ∗ P (cid:16)(cid:12)(cid:12)(cid:12) sin( π (cid:98) τ ,m li ) − sin( π τ ,m li ) (cid:12)(cid:12)(cid:12) ≥ c ∗ n − κ (cid:17) ≤ s ∗ P (cid:16)(cid:12)(cid:12)(cid:12)(cid:98) τ ,m li − τ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ π n − κ (cid:17) ≤ p ∗ P (cid:16)(cid:12)(cid:12)(cid:12)(cid:98) τ ,m li − τ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ π n − κ (cid:17) Since (cid:98) τ ,m li can be written in the form of U-statistic with a kernel bounded between -1 and 1, by Hoeﬀding’sinequality, we have that P (cid:18)(cid:12)(cid:12)(cid:12)(cid:98) τ ,m li − τ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ π n − κ (cid:19) ≤ (cid:18) − c ∗ π n − κ (cid:19) . By Assumption 3, log( p ) = o ( n − κ ), we further have that for some constant C , P (cid:18) min i ∈M ∗ (cid:12)(cid:12)(cid:12) (cid:98) S ,m li − Σ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ n − κ (cid:19) ≤ (cid:0) − Cn − κ (cid:1) . Combining with Equation (9), we have P (cid:18) min i ∈M ∗ | (cid:98) S ,m li | ≥ c ∗ n − κ (cid:19) ≥ − (cid:0) − Cn − κ (cid:1) . Besides, it is easy to show that ( (cid:98) ρ cl ) ≥ max ≤ t ≤ k ( (cid:98) S ,m lt ) , and hence, P (cid:18) min i ∈M ∗ max l ∈I ni (cid:98) ρ cl ≥ c ∗ n κ (cid:19) ≥ − (cid:0) − Cn − κ (cid:1) , which concludes P (cid:18) min i ∈M ∗ max l ∈I ni (cid:98) ρ cl ≥ c ∗ n κ (cid:19) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . P (cid:16) M ∗ ⊂ (cid:99) M c ∗ n κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . Proof of Theorem 3.2

Proof.

The proof of Theorem 3.2 is split into the following 2 steps.(

Step I ) In this step we aim to prove the following result holds: P (cid:32) max ≤ l ≤C kp (cid:12)(cid:12) ( (cid:98) ρ cl ) − ( (cid:101) ρ cl ) (cid:12)(cid:12) > cn − κ (cid:33) = O (cid:16) exp( − Cn − κ ) (cid:17) , where ( (cid:101) ρ cl ) = (cid:98) S I×J l Σ − J l ×J l (cid:98) S (cid:62)I×J l . Note that the determinants of matrices Σ J l ×J l and (cid:98) S J l ×J l arepolynomials of ﬁnite order in their entries, thus we have the following inequality holds, P (cid:16)(cid:12)(cid:12)(cid:12) | (cid:98) S J l ×J l | − | Σ J l ×J l | (cid:12)(cid:12)(cid:12) > cn − κ (cid:17) ≤ P (cid:16) max ≤ s,t ≤ k | (cid:98) S s,t − Σ s,t | > cn − κ (cid:17) , ≤ k ∗ P (cid:16) | (cid:98) S s,t − Σ s,t | > cn − κ (cid:17) = k ∗ P (cid:0)(cid:12)(cid:12) sin( π (cid:98) τ s,t ) − sin( π τ s,t ) (cid:12)(cid:12) ≥ cn − κ (cid:1) ≤ k ∗ P (cid:0) | (cid:98) τ s,t − τ s,t | ≥ cπ n − κ (cid:1) ≤ k ∗ P (cid:0) | (cid:98) τ s,t − τ s,t | ≥ cπ n − κ (cid:1) Since (cid:98) τ s,t can be written in the form of U-statistic with a kernel bounded between -1 and 1, by Hoeﬀding’sinequality, we have that P (cid:18) | (cid:98) τ s,t − τ s,t | ≥ cπ n − κ (cid:19) ≤ (cid:18) − c π n − κ (cid:19) . Thus we have for some positive constant C ∗ , the following inequality holds: P (cid:16)(cid:12)(cid:12)(cid:12) | (cid:98) S J l ×J l | − | Σ J l ×J l | (cid:12)(cid:12)(cid:12) > cn − κ (cid:17) ≤ exp (cid:0) − C ∗ n − κ (cid:1) (10)By Assumption 3, log p = O ( n ξ ) with ξ ∈ (0 , − κ ), we further have for some positive constant C , P (cid:32) max ≤ l ≤C kp (cid:12)(cid:12)(cid:12) | (cid:98) S J l ×J l | − | Σ J l ×J l | (cid:12)(cid:12)(cid:12) > cn − κ (cid:33) ≤ exp (cid:0) − Cn − κ (cid:1) . Note that k is ﬁnite and by the adjoint matrix expansion of an inverse matrix, similar to the above analysis,we have for any positive c , P (cid:32) max ≤ l ≤C kp (cid:13)(cid:13)(cid:13) ( (cid:98) S J l ×J l ) − − ( Σ J l ×J l ) − (cid:13)(cid:13)(cid:13) ∞ > cn − κ (cid:33) ≤ exp (cid:0) − Cn − κ (cid:1) . Notice that (cid:12)(cid:12) ( (cid:98) ρ cl ) − ( (cid:101) ρ cl ) (cid:12)(cid:12) ≤ (cid:107) S I×J (cid:107) ∞ (cid:13)(cid:13)(cid:13) ( (cid:98) S J l ×J l ) − − ( Σ J l ×J l ) − (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13) ( (cid:98) S J l ×J l ) − − ( Σ J l ×J l ) − (cid:13)(cid:13)(cid:13) ∞ P (cid:32) max ≤ l ≤C kp (cid:12)(cid:12) ( (cid:98) ρ cl ) − ( (cid:101) ρ cl ) (cid:12)(cid:12) > cn − κ (cid:33) = O (cid:16) exp( − Cn − κ ) (cid:17) ( Step II ) In this step, we will ﬁrst prove that for any c > P (cid:32) max ≤ l ≤C kp | (cid:101) ρ cl − ρ cl | ≥ cn − κ (cid:33) = O (cid:0) exp( − Cn − κ ) (cid:1) . By Lemma .1, we have that P (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) S s,t − Σ s,t (cid:12)(cid:12)(cid:12) ≥ cn − κ (cid:17) = O (cid:0) exp( − Cn − κ ) (cid:1) . By Assumption 3, log p = O ( n ξ ) with ξ ∈ (0 , − κ ), thus we have P (cid:18) max ≤ i ≤ p max l ∈I ni max s,t | (cid:98) S s,t − Σ s,t | (cid:19) = O (cid:0) exp( − Cn − κ ) (cid:1) . Recall that ( (cid:101) ρ cl ) = (cid:98) S I×J l Σ − J l ×J l (cid:98) S (cid:62)I×J l , by the property of Σ J l ×J l , we have for any c > P (cid:32) max ≤ l ≤C kp | (cid:101) ρ cl − ρ cl | ≥ cn − κ (cid:33) = O (cid:0) exp( − Cn − κ ) (cid:1) . Further by Assumption 4, min i/ ∈M ∗ max l ∈I ni ρ cl < c ∗ n − κ and the last equation, we have that P (cid:18) max i/ ∈M ∗ max l ∈I ni (cid:101) ρ cl < c ∗ n − κ (cid:19) ≥ − O (cid:0) exp( − Cn − κ ) (cid:1) . By the result in Step I, we have that P (cid:32) max ≤ l ≤C kp (cid:12)(cid:12) ( (cid:98) ρ cl ) − ( (cid:101) ρ cl ) (cid:12)(cid:12) > cn − κ (cid:33) = O (cid:16) exp( − Cn − κ ) (cid:17) Thus we further have P (cid:18) max i/ ∈M ∗ max l ∈I ni (cid:98) ρ cl < c ∗ n − κ (cid:19) ≥ − O (cid:0) exp( − Cn − κ ) (cid:1) . (11)By Theorem 3.1, the following inequality holds: P (cid:18) min i ∈M ∗ max l ∈I ni (cid:98) ρ cl ≥ c ∗ n − κ (cid:19) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) , then combining the result in Equation (11), we have P (cid:16) M ∗ = (cid:99) M c ∗ n − κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) , which concludes the theorem. 16 roof of Theorem 3.4 Proof.

Let δ → δn − κ − τ − τ ∗ → ∞ as n → ∞ and deﬁne M ( δ ) = (cid:26) ≤ i ≤ p : max l ∈I ni (cid:98) ρ cl is among the largest (cid:98) δp (cid:99) of all (cid:27)(cid:102) M ( δ ) = (cid:26) ≤ i ≤ p : max l ∈I ni (cid:101) ρ cl is among the largest (cid:98) δp (cid:99) of all (cid:27) where ( (cid:101) ρ cl ) = (cid:98) S I×J l Σ − J l ×J l (cid:98) S (cid:62)I×J l and where ( (cid:98) ρ cl ) = (cid:98) S I×J l (cid:98) S − J l ×J l (cid:98) S (cid:62)I×J l . We will ﬁrst show that P (cid:16) M ∗ ⊂ M ( δ ) (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) (12)By Theorem 3.1, it is equivalent to show that P (cid:16) M ∗ ⊂ M ( δ ) ∩ (cid:99) M c ∗ n − κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) By Step I in the proof of Theorem 3.2, it is also equivalent to show that P (cid:16) M ∗ ⊂ (cid:102) M ( δ ) ∩ (cid:99) M c ∗ n − κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . Finally, by Theorem 3.1 again, to prove Equation (12) is equivalent to prove that P (cid:16) M ∗ ⊂ (cid:102) M ( δ ) (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . (13)Recall that in the proof of Theorem 3.1, we obtained the following result: P (cid:18) min i ∈M ∗ max l ∈I ni (cid:101) ρ cl ≥ c ∗ n κ (cid:19) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . If we can prove that P (cid:32) p (cid:88) i =1 (max l ∈I ni (cid:101) ρ cl ) ≤ cn − τ ∗ + τ p (cid:33) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . (14)Then we have, with probability larger than 1 − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) ,Card (cid:26) ≤ i ≤ p ; max l ∈I ni (cid:101) ρ cl ≥ min i ∈M ∗ max l ∈I ni (cid:101) ρ cl (cid:27) ≤ cpn − κ − τ − τ ∗ , which further indicate that the result in (13) holds due to δn − κ − τ − τ ∗ → ∞ . So to end the whole proof,we just need to show that the result in (14) holds.For each 1 ≤ i ≤ p , let (cid:101) ρ ci = max l ∈I ni (cid:101) ρ cl . Note that ( (cid:101) ρ ci ) = (cid:98) S I×J i Σ − J i ×J i (cid:98) S (cid:62)I×J i , with (cid:98) S I×J i =( (cid:98) S ,m i , . . . , (cid:98) S ,m i k ). By Assumption 1, we have( (cid:101) ρ ci ) ≤ c k (cid:88) t =1 ( (cid:98) S ,m i t ) = c (cid:107) (cid:98) S I×J i (cid:107) , p (cid:88) i =1 ( (cid:101) ρ ci ) ≤ c kk n (cid:107) (cid:98) S I×T (cid:107) , with T = { , . . . , d } . Notice that P (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) S s,t − Σ s,t (cid:12)(cid:12)(cid:12) ≥ cn − κ (cid:17) ≤ (cid:18) − c π n − κ (cid:19) , thus similar to the argument in [13], we can easily get that P (cid:32) p (cid:88) i =1 (max l ∈I ni (cid:101) ρ cl ) ≤ cn − τ ∗ + τ p (cid:33) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . Finally, following the same idea of iterative screening as in the proof of Theorem 1 of [10], we ﬁnish the proofof the theorem.

Acknowledgements

Yong He’s research is partially supported by the grant of the National Science Foundation of China (NSFC11801316). Xinsheng Zhang’s research is partially supported by the grant of the National Science Foun-dation of China (NSFC 11571080). Jiadong Ji’s work is supported by the grant from the the grant ofthe National Science Foundation of China (NSFC 81803336) and Natural Science Foundation of ShandongProvince (ZR2018BH033).

References

BM Boldstad, RA Irizarry, M Astrand, and TP Speed. A comparison of normalization methods for highdensity oligonucleotide array data based on bias and variance.

Bioinformatics , 19(2):185–193, 2003.Emmanuel Candes and Terence Tao. The dantzig selector: Statistical estimation when p is much larger thann.

The Annals of Statistics , 35(6):2313–2351, 2007.Annie P Chiang, John S Beck, Hsan-Jan Yen, Marwan K Tayeh, Todd E Scheetz, Ruth E Swiderski, Darryl YNishimura, Terry A Braun, Kwang-Youn A Kim, Jian Huang, et al. Homozygosity mapping with snparrays identiﬁes trim32, an e3 ubiquitin ligase, as a bardet–biedl syndrome gene (bbs11).

Proceedingsof the National Academy of Sciences , 103(16):6287–6292, 2006.H. Cui, R. Li, and W. Zhong. Model-free feature screening for ultrahigh dimensional discriminant analysis.

Journal of the American Statistical Association , 110(510):630–641, 2015.J. Fan and J. Lv. Sure independence screening.

Wiley StatsRef. , 2017.J. Fan, Y. Ma, and W. Dai. Nonparametric independence screening in sparse ultra-high dimensional varyingcoeﬃcient models.

Journal of the American Statistical Association , 109(507):1270, 2014.Jianqing Fan, Yang Feng, and R. Song. Nonparametric independence screening in sparse ultra-high dimen-sional additive models.

Journal of the American Statistical Association , 106(494):544–557, 2011.Jianqing Fan, Yang Feng, and Rui Song. Nonparametric independence screening in sparse ultra-high-dimensional additive models.

Journal of the American Statistical Association , 106(494):544–557, 2011.Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties.

Journal of the American statistical Association , 96(456):1348–1360, 2001.18ianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space.

Journalof the Royal Statistical Society , 70(5):849–911, 2008.Jian Huang, Joel L Horowitz, and Fengrong Wei. Variable selection in nonparametric additive models.

Annals of statistics , 38(4):2282, 2010.Rafael A Irizarry, Bridget Hobbs, Francois Collin, Yasmin D Beazer-Barclay, Kristen J Antonellis, UweScherf, and Terence P Speed. Exploration, normalization, and summaries of high density oligonucleotidearray probe level data.

Biostatistics , 4(2):249–264, 2003.Xin Bing Kong, Zhi Liu, Yuan Yao, and Wang Zhou. Sure screening by ranking the canonical correlations.

Test , 26(1):1–25, 2017.Gaorong Li, Heng Peng, Jun Zhang, and Lixing Zhu. Robust rank correlation based screening.

Annals ofStatistics , 40(3):1846–1877, 2012.Runze Li, Wei Zhong, and Liping Zhu. Feature screening via distance correlation learning.

Journal of theAmerican Statistical Association , 107(499):1129–1139, 2012.Jing Yuan Liu, Zhong Wei, and Li Runze. A selective overview of feature screening for ultrahigh-dimensionaldata.

Science China Mathematics , 58(10):1–22, 2015.Lukas Meier, Sara Van De Geer, and Peter Bhlmann. High-dimensional additive modeling.

Annals ofStatistics , 37(6B):3779–3821, 2009.Todd E Scheetz, Kwang-Youn A Kim, Ruth E Swiderski, Alisdair R Philp, Terry A Braun, Kevin L Knudtson,Anne M Dorrance, Gerald F DiBona, Jian Huang, Thomas L Casavant, et al. Regulation of geneexpression in the mammalian eye and its relevance to eye disease.

Proceedings of the National Academyof Sciences , 103(39):14429–14434, 2006.R. Song, W. Lu, S. Ma, and X. Jessie Jeng. Censored rank independence screening for high-dimensionalsurvival data.

Biometrika , 101(4):799–814, 2014.Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression.

Biometrika , 99(4):879–898, 2012.Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society.Series B (Methodological) , 58(1):267–288, 1996.Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.

Journal of theRoyal Statistical Society , 68(1):49–67, 2006.Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty.

The Annals of Statistics ,38(2):894–942, 2010.Liping Zhu, Lexin Li, Runze Li, and Lixing Zhu. Model-free feature screening for ultrahigh dimensionaldata.

Journal of the American Statistical Association , 106(496):1464–1475, 2011.Hui Zou. The adaptive lasso and its oracle properties.

Journal of the American statistical association ,101(476):1418–1429, 2006. 19 a b l e : T h e p r o p o r t i o n s o f c o n t a i n i n g M o d e l i n t h e a c t i v e s e t ( p , n ) M e t h o d M u l t i v a r i a t e N o r m a l C o v a r i a t e s M u l t i v a r i a t e t C o v a r i a t e s ( d e g r ee ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ( , ) RR C S . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . ( , ) RR C S . . . . . . . . . S I S . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . ( , ) RR C S . . . . S I S . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . ( , ) RR C S . . . . S I S . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . a b l e : T h e p r o p o r t i o n s o f c o n t a i n i n g M o d e l i n t h e a c t i v e s e t ( p , n ) M e t h o d M u l t i v a r i a t e N o r m a l C o v a r i a t e s M u l t i v a r i a t e t C o v a r i a t e s ( d e g r ee ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . CC H CC K . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . CC H CC K . . . . . . . . a b l e : T h e p r o p o r t i o n s o f c o n t a i n i n g M o d e l i n t h e a c t i v e s e t ( p , n ) M e t h o d M u l t i v a r i a t e N o r m a l C o v a r i a t e s M u l t i v a r i a t e t C o v a r i a t e s ( d e g r ee ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . . S I S CC H CC K . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . .012

Related Researches

On structural and practical identifiability

by Franz-Georg Wieland

Nonparametric C- and D-vine based quantile regression

by Marija Tepegjozova

Fisher Scoring for crossed factor Linear Mixed Models

by Thomas Maullin-Sapey

Changepoint detection on a graph of time series

by Karl L. Hallgren

A test for comparing conditional ROC curves with multidimensional covariates

by Arís Fanjul-Hevia

RaSE: A Variable Screening Framework via Random Subspace Ensembles

by Ye Tian

An empirical comparison and characterisation of nine popular clustering methods

by Christian Hennig

Estimating the treatment effect for adherers using multiple imputation

by Junxiang Luo

Nested Group Testing Procedures for Screening

by Yaakov Malinovsky

Vine copula mixture models and clustering for non-Gaussian data

by ?zge Sahin

Tilted Nonparametric Regression Function Estimation

by Farzaneh Boroumand

Eliciting judgements about dependent quantities of interest: The SHELF extension and copula methods illustrated using an asthma case study

by Björn Holzhauer

Identifying regions of inhomogeneities in spatial processes via an M-RA and mixture priors

by Marco H. Benedetti

Classification of Categorical Time Series Using the Spectral Envelope and Optimal Scalings

by Zeda Li

Incompletely observed nonparametric factorial designs with repeated measurements: A wild bootstrap approach

by Lubna Amro

Hierarchical Multivariate Directed Acyclic Graph Auto-Regressive (MDAGAR) models for spatial diseases mapping

by Leiwen Gao

Factor-augmented Smoothing Model for Functional Data

by Yuan Gao

Splitting strategies for post-selection inference

by Daniel G. Rasines

Covariate adjustment in subgroup analyses of randomized clinical trials: A propensity score approach

by Siyun Yang

A Frequency Domain Bootstrap for General Multivariate Stationary Processes

by Marco Meyer

Bayesian Fusion: Scalable unification of distributed statistical analyses

by Hongsheng Dai

Sensitivity Analysis for Unmeasured Confounding via Effect Extrapolation

by Wen Wei Loh

Unobserved classes and extra variables in high-dimensional discriminant analysis

by Michael Fop

pcoxtime: Penalized Cox Proportional Hazard Model for Time-dependent Covariates

by Steve Cygu

Statistical Inference for Ordinal Predictors in Generalized Linear and Additive Models with Application to Bronchopulmonary Dysplasia

by Jan Gertheiss

«

1

2

3

4

»

Submitted on 26 Aug 2018 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar