Doubly Robust Sure Screening for Elliptical Copula Regression Model
DDoubly Robust Sure Screening for EllipticalCopula Regression Model
Yong He ∗ , Liang Zhang † , Jiadong Ji, ‡ Xinsheng Zhang § Regression analysis has always been a hot research topic in statistics. We propose a very flexible semi-parametric regression model called Elliptical Copula Regression (ECR) model, which covers a large classof linear and nonlinear regression models such as additive regression model, single index model. Besides,ECR model can capture the heavy-tail characteristic and tail dependence between variables, thus it couldbe widely applied in many areas such as econometrics and finance. In this paper we mainly focus on thefeature screening problem for ECR model in ultra-high dimensional setting. We propose a doubly robust surescreening procedure for ECR model, in which two types of correlation coefficient are involved: Kendall’ taucorrelation and Canonical correlation. Theoretical analysis shows that the procedure enjoys sure screeningproperty, i.e., with probability tending to 1, the screening procedure selects out all important variables andsubstantially reduces the dimensionality to a moderate size against the sample size. Thorough numericalstudies are conducted to illustrate its advantage over existing sure independence screening methods and thusit can be used as a safe replacement of the existing procedures in practice. At last, the proposed procedureis applied on a gene-expression real data set to show its empirical usefulness.
Keyword:
Canonical Correlation; Doubly Robust; Elliptical Copula; Kendall’ tau; Sure Screening.
In the last decades, data sets with large dimensionality have arisen in various areas such as finance, chemistryand so on due to the great development of the computer storage capacity and processing power and featureselection with these big data is of fundamental importance to many contemporary applications. The sparsityassumption is common in high dimensional feature selection literatures, i.e., only a few variables are criticalin for in-sample fitting and out-sample forecasting of certain response of interest. In specific, for the linearregression setting, statisticians care about how to select out the important variables from thousands or evenmillions of variables. In fact, a huge amount of literature springs up since the appearance of the Lassoestimator [21]. To name a few, there exist SCAD by [9], Adaptive Lasso by [25], MCP by [23], the Dantzigselector by [2], group Lasso by [22]. This research area is very active, and as a result, this list of referenceshere is illustrative rather than comprehensive. The aforementioned feature selection methods perform wellwhen the dimensionality is not “too” large, theoretically in the sense that it is of polynomial order of thesample size. However, in the ultrahigh dimensional setting where the dimensionality is of exponential order ofthe sample size, the aforementioned methods may encounter both theoretical and computational issue. Takethe Dantzig selector for example, the Uniform Uncertainty Principle (UUP) condition to guarantee the oracleproperty may be difficult to satisfy, and the computational cost would increase dramatically by implementinglinear programs in ultra-high dimension. [10] first proposed Sure Independence Screening (SIS) and its further ∗ School of Statistics, Shandong University of Finance and Economics, Jinan, China † School of Statistics, Shandong University of Finance and Economics, Jinan, China ‡ School of Statistics, Shandong University of Finance and Economics, Jinan, China; Email: [email protected] . § School of Management, Fudan University, Shanghai, China a r X i v : . [ s t a t . M E ] A ug mprovement, Iterative Sure Independence Screening (ISIS), to alleviate the computational burden in ultra-high dimensional setting. The basic idea goes as follows. In the first step, reduce the dimensionality to amoderate size against the sample size by sorting the marginal pearson correlation between covariates and theresponse and removing those covariates whose marginal correlation with response are lower than a certainthreshold. In the second stage perform Lasso, SCAD etc. to the variables survived in the first step. The SIS(ISIS) turns out to enjoy sure screening property under certain conditions, that is, with probability tendingto 1, the screening procedure selects out all important variables. The last decade has witnessed plenty ofvariants of SIS to handle the ultra-high dimensionality for more general regression models. [7] proposed asure screening procedure for ultra-high dimensional additive models. [6] proposed a sure screening procedurefor ultra-high dimensional varying coefficient models. [19] proposed censored rank independence screeningof high-dimensional survival data which is robust to predictors that contain outliers and works well for ageneral class of survival models. [24] and [4] proposed model-free feature screening. [14] proposed to screenKendall’s tau correlation while [15] proposed to screen distance correlation which both show robustness toheavy tailed data. [13] proposed to screen the canonical correlation between the response and all possiblesets of k variables, which performs well particularly for selecting out variables that are pairwise jointlyimportant with other variables but marginally insignificant. This list of references for screening methodsis also illustrative rather than comprehensive. For the development of the screening methods in the lastdecade, we refer to the review paper of [16] and [5].The main contribution of the paper is two-fold. On the one hand, we innovatively propose a very flexiblesemi-parametric regression model called Elliptical Copula Regression (ECR) model, which can capture thethick-tail property of variables and the tail dependence between variables. In specific, the ECR model hasthe following representation: f ( Y ) = β + p (cid:88) j =1 β j f j ( X j ) + (cid:15), (1)where Y is response variable, X . . . , X p are predictors, f j ( · ) are univariate monotonic functions. We say( Y, X (cid:62) ) (cid:62) = ( Y, X , . . . , X p ) (cid:62) satisfies a Elliptical copula regression model if the marginally transformedrandom vectors (cid:101) Z = ( (cid:101) Y , (cid:102) X (cid:62) ) (cid:62) (cid:52) = ( f ( Y ) , f ( X ) , . . . , f p ( X p )) (cid:62) follows elliptical distribution. From therepresentation of ECR model in (1), it can be seen that the ECR model covers a large class of linear andnonlinear regression models such as additive regression model, single index model which makes it moreapplicable in many areas such as econometrics, finance and bioinformatics. On the other hand, we proposea doubly robust dimension reduction procedure for the ECR model in the ultrahigh dimensional setting.The doubly robustness is achieved by combining two types of correlation, which are Kendall’ tau correlationand canonical correlation. The canonical correlation is employed to capture the joint information of a set ofcovariates and the joint relationship between the response and this set of covariates. Note that for ECR modelin (1), only ( Y, X (cid:62) ) (cid:62) is observable rather than the transformed variables. Thus the Kendall’s tau correlationis exploited to estimate the canonical correlations due to its invariance under strictly monotone marginaltransformations. The dimension reduction procedure for ECR model is achieved by sorting the estimatedcanonical correlations and leaving the variable that attributes a relatively high canonical correlation at leastonce into the active set. The proposed screening procedure enjoys the sure screening property and reducesthe dimensionality substantially to a moderate size under mild conditions. Numerical results shows that theproposed approach enjoys great advantage over state-of-the-art procedures and thus it can be used as a safereplacement.We introduce some notations adopted in the paper. For any vector µ = ( µ , . . . , µ d ) ∈ R d , let µ − i denote the ( d − × i -th entry from µ . | µ | = (cid:80) di =1 I { µ i (cid:54) = 0 } , | µ | = (cid:80) di =1 | µ i | , | µ | = (cid:113)(cid:80) di =1 µ i and | µ | ∞ = max i | µ i | . Let A = [ a ij ] ∈ R d × d . (cid:107) A (cid:107) L = max ≤ j ≤ d (cid:80) di =1 | a ij | , (cid:107) A (cid:107) ∞ =max i,j | a ij | and (cid:107) A (cid:107) = (cid:80) di =1 (cid:80) dj =1 | a ij | . We use λ min ( A ) and λ max ( A ) to denote the smallest and largesteigenvalues of A respectively. For a set H , denote by |H| the cardinality of H . For a real number x , denote2y (cid:98) x (cid:99) the largest integer smaller than or equal to x . For two sequences of real numbers { a n } and { b n } ,we write a n = O ( b n ) if there exists a constant C such that | a n | ≤ C | b n | holds for all n , write a n = o ( b n ) iflim n →∞ a n /b n = 0, and write a n (cid:16) b n if there exist constants c and C such that c ≤ a n /b n ≤ C for all n .The rest of the paper is organized as follows: in Section 2, we introduce the Elliptical copula regressionmodel and present the proposed dimension reduction procedure by ranking the estimated canonical corre-lations. In Section 3, we present the theoretical properties of the proposed procedure,with more detailedproofs collected in the Appendix. In Section 4, we conduct thorough numerical simulations to investigatethe empirical performance of the procedure. In section 5, a real gene-expression data example is given toillustrate its empirical usefulness. At last, we give a brief discussion on possible future directions in the lastsection. To present the Elliptical Copula Regression Model, we first need to introduce the elliptical distribution. Theelliptical distribution generalizes the multivariate normal distribution, which includes symmetric distributionswith heavy tails, like the multivariate t -distribution. Elliptical distributions are commonly used in robuststatistics to evaluate proposed multivariate-statistical procedures. In specific, the definition of ellipticaldistribution is given as follows: Definition 2.1. (Elliptical distribution)
Let µ ∈ R p and Σ ∈ R p × p with rank( Σ ) = q ≤ p . A p -dimensional random vector Z is elliptically distributed, denoted by Z ∼ ED p ( µ , Σ , ζ ), if it has a stochasticrepresentation Z d = µ + ζ A U . where U is a random vector uniformly distributed on the unit sphere S q − in R q , ζ ≥ U , A ∈ R p × q is a deterministic matrix satisfying AA (cid:62) = Σ with Σ called scattermatrix.The representation Z d = µ + ζ A U . is not identifiable since we can rescale ζ and A . We require E ζ = q to make the model identifiable, which makes the covariance matrix of Z to be Σ . In addition, we assume Σ is non-singular, i.e., q = p . In this paper, we only consider continuous elliptical distributions withPr( ζ = 0) = 0.Another equivalent definition of the elliptical distribution is by its characteristic function, which has theform exp( i t (cid:62) u ) ψ ( t (cid:62) Σ t ), where ψ ( · ) is a properly defined characteristic function and i = √− ζ and ψ are mutually determined by each other. Given the definition of Elliptical distribution, we are ready forintroducing the Elliptical Copula Regression (ECR) model. Definition 2.2. (Elliptical copula regression model)
Let f = { f , f , . . . , f p } be a set of monotoneunivariate functions and Σ be a positive-definite correlation matrix with diag( Σ )= I . We say a d -dimensionalrandom variable Z = ( Y, X , . . . , X p ) (cid:62) satisfies the Elliptical Copula Regression model if and only if (cid:101) Z =( (cid:101) Y , (cid:102) X (cid:62) ) (cid:62) = ( f ( Y ) , f ( X ) , . . . , f p ( X p )) (cid:62) ∼ ED d ( , Σ , ζ ) with E ζ = d and (cid:101) Y = (cid:102) X (cid:62) β + (cid:15), or equivalently , f ( Y ) = p (cid:88) j =1 β j f j ( X j ) + (cid:15), (2)where Y is the response and X = ( X , . . . , X p ) (cid:62) are covariates, d = p + 1.We require diag( Σ )= I in Definition 2.2 for identifiability because the shifting and scaling are absorbedinto the marginal functions f . For ease of presentation, we denote Z = ( Y, X , . . . , X p ) (cid:62) ∼ ECR( Σ , ζ, f ) in3he following sections. The ECR model allows the data to come from heavy-tailed distribution and is thusmore flexible and more useful in modelling many modern data sets, including financial data, genomics dataand fMRI data.Notice that the transformed variable (cid:101) Y and the transformed covariates (cid:102) X obeys the linear regressionmodel, however, the transformed variables are unobservable, only Y, X are observable. In the following, byvirtue of canonical correlation and Kendall’s tau correlation, we will present an adaptive screening procedurewithout estimating the marginal transformation functions f while capturing the joint information of a setof covariates and the joint relationship between the response and this set of covariates. In this section we will present the adaptive doubly robust screening for ECR model. We first introducethe Kendall’s tau-based estimator of correlation matrix in subsection 2.2.1, then we introduce the Kendall’stau-based estimator of canonical correlation in subsection 2.2.2, which are both of fundamental importancefor the detailed doubly robust screening procedure introduced in subsection 2.2.3.
In this section we present the estimator of the correlation matrix based on Kendall’s tau. Let Z , . . . , Z n be n independent observations where Z i = ( Y i , X i , . . . , X ip ) (cid:62) . The sample Kendall’s tau correlation of Z j and Z k is defined by (cid:98) τ j,k = 2 n ( n − (cid:88) ≤ i
4n Section 2.2.1 we present the Kendall’s tau based estimator of Σ and denote it by (cid:98) S . Thus the CanonicalCorrelation ρ c can be naturally estimated by: (cid:98) ρ c = (cid:113) (cid:98) S I×J (cid:98) S − J ×J (cid:98) S (cid:62)I×J . (4)If (cid:98) S J ×J is not positive definite (not invserible), we first project (cid:98) S J ×J into the cone of positive semidefinitematrices. In particular, we propose to solve the following convex optimization problem: (cid:101) S J ×J = arg min S (cid:107) (cid:98) S J ×J − S (cid:107) ∞ . The matrix element-wise infinity norm (cid:107) · (cid:107) ∞ is adopted for the sake of further technical developments.Empirically, we can use a surrogate projection procedure that computes a singular value decomposition of (cid:98) S J ×J and truncates all of the negative singular values to be zero. Numerical study shows that this procedureworks well.
In this section, we present the screening procedure by sorting canonical correlation estimated by Kendall’tau.The Screening procedure goes as follows: first collect all sets of k transformed variables and total addsup to C kp , the combinatorial number, i.e., { (cid:101) X l,m , . . . , (cid:101) X l,m k } with l = 1 , . . . , C kp . For each k -variable set { (cid:101) X l,m , . . . , (cid:101) X l,m k } , we denote its canonical correlation with f ( Y ) by ρ cl and estimate it by (cid:98) ρ cl = (cid:113) (cid:98) S I×J (cid:98) S − J ×J (cid:98) S (cid:62)I×J . where (cid:98) S is the rank-based estimator of correlation matrix introduced in Section 2.2.1. Then we sort thesecanonical correlations { (cid:98) ρ cl , l = 1 , . . . , C kp } and select the variables that attributes a relatively large canonicalcorrelation at least once into the active set.Specifically, let M ∗ = { ≤ i ≤ p, β i (cid:54) = 0 } be the true model with size s = |M ∗ | and define sets I ni = (cid:110) l ; ( X i , X i , . . . , X i k − ) with max ≤ m ≤ k − | i m − i | ≤ k n is used in calculating (cid:98) ρ cl (cid:111) , i = 1 , . . . , p, where k n is a parameter determining a neighborhood set in which variables jointly with X i are included tocalculate the canonical correlation with the response. Finally we estimate the active set as follow: (cid:99) M t n = (cid:110) ≤ i ≤ p : max l ∈I ni (cid:98) ρ cl > t n (cid:111) where t n is a threshold parameter which controls the the size of the estimated active set.If we set k n = p , then all k -variable sets including X i are considered in I ni . However, if there is anatural index for all the covariates such that only the neighboring covariates are related, which is often thecase in portfolio tracking in finance, it is more appropriate to consider a k n much smaller than p . As forthe parameter k , a relatively large k may bring more accurate results, but will increase the computationalburden. Empirical simulation results show that the performance by by taking k = 2 is already good enoughand substantially better than taking k = 1 which is equivalent to sorting marginal correlation.5 Theoretical properties
In this section, we present the theoretical properties of the proposed approach. In the screening problem,what we care about most is whether the true non-zero index set M ∗ is contained in the estimated active set (cid:99) M t n with high probability for properly chosen threshold t n , i.e., whether the procedure has sure screeningproperty. To this end, we assume the following three assumptions hold. Assumption 1
Assume p > n and log p = O ( n ξ ) for some ξ ∈ (0 , − κ ). Assumption 2
For all l = 1 , . . . , C kp , λ max (( Σ J l ×J l ) − ) ≤ c for some constant c , where J l = { m l , . . . , m lk } is the index set of variables in the l -th k -variable sets. Assumption 3
For some 0 ≤ κ ≤ /
2, min i ∈M ∗ max l ∈I ni ρ cl ≥ c n − κ .Assumption 1 specifies the scaling between the dimensionality p and the sample size n . Assumption2 requires that the minimum eigenvalue of the covariance matrix of any k covariates is lower bounded.Assumption 3 is the fundamental basis for guaranteeing the sure screening property, which means thatany important variable is correlated to the response jointly with some other variables. Technically, theAssumption 3 entails that an important variable would not be veiled by the statistical approximation errorresulting from the estimated canonical correlation. Theorem 3.1.
Assume that Assumptions 1-3 hold, then for some positive constants c ∗ and C , as n goes toinfinity, we have P (cid:18) min i ∈M ∗ max l ∈I ni (cid:98) ρ cl ≥ c ∗ n − κ (cid:19) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) , and P (cid:16) M ∗ ⊂ (cid:99) M c ∗ n − κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . Theorem 3.1 shows that, by setting the threshold of order c ∗ n − κ , all important variables can be selectedout with probability tending to 1. However, the constant c ∗ remains unknown. To refine the theoreticalresult, we assume the following assumption holds. Assumption 4
For some 0 ≤ κ ≤ /
2, max i/ ∈M ∗ max l ∈I ni ρ cl < c ∗ n − κ .The Assumption 4 requires that if a variable X i is not important, then the canonical correlations betweenthe response and all k variables sets containing X i are all upper bounded by c ∗ n − κ , and it uniformly holdsfor all unimportant variables. Theorem 3.2.
Assume that Assumptions 1-4 hold, then for some constants c ∗ and C , we have P (cid:16) M ∗ = (cid:99) M c ∗ n − κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) (5)and in particular P (cid:16) | (cid:99) M c ∗ n − κ | = s (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) (6)where s is the size of M ∗ .Theorem 3.2 guarantees the exact sure screening property without any condition on k n . Besides, thetheorem guarantees the existence of c ∗ and C , however, it still remains unknown how to select the constant c ∗ . If we know that s < n log n in advance, one can select a constant c ∗ such that the size of (cid:99) M c ∗ n − κ is approximately n . Obviously, we have (cid:99) M c ∗ n − κ ⊂ (cid:99) M c ∗ n − κ with probability tending to 1. The followingtheorem is particularly useful in practice summarizing the above discussion.6 heorem 3.3. Assume that Assumptions 1-4 hold, if s = |M ∗ | ≤ n/ log n , we have for any constant γ > P ( M ∗ ⊂ M γ ) 1 − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) , where M γ = { ≤ i ≤ p ; max l ∈I ni (cid:98) ρ cl is among the largest (cid:98) γn (cid:99) of max l ∈I n (cid:98) ρ cl , · · · , max l ∈I np (cid:98) ρ cl } The above theorem guarantees that one can reduce dimensionality to a moderate size against n whileensuring the sure screening property, which further guarantees the validity of a more sophisticated andcomputationally efficient variable selection methods.Theorem 3.3 heavily relies on the Assumption 4. If there is natural order of the variables, and anyimportant variable together with only the adjacent variables contributes to the response variable, thenAssumption 4 can be totally removed while exserting an constraint on the parameter k n . The followingtheorem summarizes the above discussion. Theorem 3.4.
Assume Assumptions 1-3 hold, λ max ( Σ ) ≤ c n τ for some τ ≥ c >
0, and furtherassume k n = c n τ ∗ for some constants c > τ ∗ ≥
0. If 2 κ + τ + τ ∗ <
1, then there exists some θ ∈ [0 , − κ − τ − τ ∗ ) such that for γ = c n − θ with c >
0, we have for some constant
C > P ( M ∗ ⊂ M γ ) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . The assumption k n = c n τ ∗ is reasonable in many fields such as Biology and Finance. An intuitiveexample is in genomic association study, millions of genes tend to cluster together and functions togetherwith adjacent genes.The procedure by ranking the estimated canonical correlation and reducing the dimension in one stepfrom a large p to (cid:98) n/ log n (cid:99) is a crude and greedy algorithm and may result in many fake covariates due tothe strong correlations among them. Motivated by the ISIS method in [10], we propose a similar iterativeprocedure which achieve sure screening in multiple steps. The iterative procedure works as follows. Letthe shrinking factor δ → δn − κ − τ − τ ∗ → ∞ as n → ∞ and we successivelyperform dimensionality reduction until the number of remaining variables drops to below sample size n . Inspecific, define a subset M ( δ ) = (cid:26) ≤ i ≤ p : max l ∈I ni (cid:98) ρ cl is among the largest[ δp ] of all (cid:27) . (7)In the first step we select a subset of (cid:98) δp (cid:99) variables, M ( δ ) by Equation (7). In the next step, we startfrom the variables indexed in M ( δ ), and apply a similar procedure as (7), and again obtain a sub-model M ( δ ) ⊂ M ( δ ) with size (cid:98) δ p (cid:99) . Iterate the steps above and finally obtain a sub-model M k ( δ ), with size[ δ k p ] < n . Theorem 3.5.
Assume that the conditions in Theorem 3.4 hold, let δ → δn − κ − τ − τ ∗ → ∞ as n → ∞ , then we have P (cid:0) M ∗ ⊂ M k ( δ ) (cid:1) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . The above theorem guarantees the sure screening property of the iterative procedure and the step size δ can be chosen in the same way as ISIS in [10]. In this section we conduct thorough numerical simulation to illustrate the empirical performance of theproposed doubly robust screening procedure (denoted as CCH). Besides, we compare the proposed procedurewith three methods, the method proposed by [13] (denoted as CCK), the rank correlation screening approach7roposed by [14] (denoted as RRCS) and the initially proposed SIS procedure by [10]. To illustrate the doublyrobustness of the proposed procedure, we consider the following five models which includes linear regressionwith thick-tail covariates and error term, single-index model with thick-tail error term, additive model andmore general regression model.
Model 1
Linear model setting adapted from [13]: Y i = 0 . β X i + · · · + β p X ip + (cid:15) i , where β =(1 , − . , , , . . . , (cid:62) with the last p − X is sampled from multivariatenormal N ( , Σ ) or multivariate t with degree 1, noncentrality parameter and scale matrix Σ . The diagonalentries of Σ are 1 and the off-diagonal entries are ρ , the error term (cid:15) is independent of X and generatedfrom the standard normal distribution or the standard t -distribution with degree 1. Model 2
Linear model setting adapted from [14]: Y i = β X i + · · · + β p X ip + (cid:15) i , where β = (5 , , , , . . . , (cid:62) with the last p − X is sampled from multivariate normal N ( , Σ ) ormultivariate t with degree 1, noncentrality parameter and scale matrix Σ . The diagonal entries of Σ are1 and the off-diagonal entries are ρ , the error term (cid:15) is independent of X and generated from the standardnormal distribution or the standard t -distribution with degree 1. Model 3
Single-index model setting: H ( Y ) = X (cid:62) β + (cid:15). We set H ( Y ) = log( Y ) which corresponds to a special case of BOX-COX transformation ( | Y | λ sgn( Y ) − /λ with λ = 1. The error term (cid:15) is independent of X and generated from the standard normal distribution orthe standard t -distribution with degree 3. The regression coefficients β = (3 , . , , , . . . , (cid:62) with the last p − X is sampled from multivariate normal N ( , Σ ) or multivariate t with degree 3, where the diagonal entries of Σ are 1 and the off-diagonal entries are ρ . Model 4
Additive model from [17]: Y i = 5 f ( X i ) + 3 f ( X i ) + 4 f ( X i ) + 6 f ( X i ) + (cid:15) i , with f ( x ) = x, f ( x ) = (2 x − , f ( x ) = sin(2 πx )2 − sin(2 πx )and f ( x ) =0 . πx ) + 0 . πx ) + 0 . (2 πx )+ 0 . (2 πx ) + 0 . (2 πx )The covariates X = ( X , . . . , X p ) (cid:62) are generated by X j = W j + tU t , j = 1 , . . . , p, where W , . . . , W p and U are i.i.d. Uniform[0,1]. For t = 0, this is the independent uniform case while t = 1 corresponds to a design with correlation 0.5 between all covariates. The error term (cid:15) are sampled from N (0 , . β in the model setting is obviously (3 , . , , , . . . , (cid:62) with the last p − Model 5
A model generated by combining Model 3 and Model 4: H ( Y i ) = 5 f ( X i ) + 3 f ( X i ) + 4 f ( X i ) + 6 f ( X i ) + (cid:15) i , where H ( Y ) is the same with Model 3 and the functions { f , f , f , f } are the same with Model 4. The8ovariates X = ( X , . . . , X p ) (cid:62) are generated in the same way as in Model 4. The error term (cid:15) are sampledfrom N (0 , . Table 1: The proportions of containing Model 4 and Model 5 in the active set ( p, n ) Method Model 4 Model 5 t = 0 t = 0 . t = 1 t = 0 t = 0 . t = 1(100,20) RRCS 0.038 0.046 0.132 0.038 0.046 0.132SIS 0.048 0.056 0.142 0.002 0.012 0.032CCH1 0.938 0.968 0.938 0.938 0.968 0.938CCK1 0.380 0.356 0.584 0.142 0.146 0.300CCH2 0.978 0.956 0.938 0.978 0.956 0.938CCK2 0.506 0.458 0.682 0.166 0.144 0.302(100,50) RRCS 0.420 0.352 0.504 0.420 0.352 0.504SIS 0.512 0.406 0.496 0.100 0.110 0.182CCH1 1 1 1 1 1 1CCK1 0.922 0.938 0.990 0.690 0.614 0.836CCH2 1 1 1 1 1 1CCK2 0.988 0.990 0.996 0.674 0.588 0.810(500,20) RRCS 0.002 0.004 0.020 0.002 0.004 0.020SIS 0.002 0.008 0.026 0 0 0.008CCH1 0.804 0.884 0.812 0.804 0.884 0.812CCK1 0.168 0.176 0.354 0.038 0.068 0.168CCH2 0.858 0.890 0.842 0.858 0.890 0.842CCK2 0.222 0.224 0.454 0.052 0.086 0.178(500,50) RRCS 0.098 0.092 0.228 0.098 0.092 0.228SIS 0.158 0.140 0.222 0.002 0.026 0.036CCH1 1 1 1 1 1 1CCK1 0.794 0.856 0.984 0.310 0.344 0.642CCH2 1 1 1 1 1 1CCK2 0.930 0.956 0.998 0.350 0.344 0.622For models in which ρ involved, we take ρ = 0 , . , . , .
9. For all the models, we consider four com-binations of ( n, p ): (20,100), (50,100), (20,500), (50, 500). All simulations results are based on the 500replications. We evaluate the performance of different screening procedures by the proportion that the true9odel is included in the selected active set in 500 replications. To guarantee a fair comparison, for all thescreening procedures, we choose the variables whose coefficients rank in the first largest (cid:98) n/ log n (cid:99) values.For our method CCH and the method CCK in [13], two parameters k n and k are involved. The simulationstudy shows that when k n is small, the performance for different combination of ( k n , k ) are quite similar.Thus we only presents the results of ( k n , k ) = (2 , , (2 ,
3) for illustration, which are denoted as CCH1, CCH2for our method and CCK1 and CCK2 for the method by [13].From the simulation results, we can see that the proposed CCH methods detects the true model muchmore accurately than SIS, RRCS and CCK meothods in almost all cases. In specific, for the motivatingModel 1 in [13], from Table 2, we can see that when the correlations among covariates become large, all theSIS, RRCS and CCK meothods perform worse (the proportion of containing the true model drops sharply),but the proposed CCH procedure shows robustness against the correlations among covariates and detectsthe true model for each replication. Besides, for the heavy tailed error term following t (1), we can see thatall the SIS, RRCS and CCK meothods perform very bad while the CCH method still works very well. ForModel 2, from Table 3, we can see that when the covariates are multivariate normal and the error termis normal, then all the methods works well when the sample size is relatively large while CCK and CCHrequires less sample size compared with RRCS and SIS. If the error term is from t (1), then SIS, RRCS andCCK meothods perform bad especially when the ratio p/n is large. In contrast, the CCH approach stillperforms very well. We should notice that the RRCS also shows certain robustness and CCK2 is slightlybetter than CCK1 because the important covariates are indexed by three consecutive integers.The CCH’s advantage over the CCK is mainly illustrated by the results of Model 3 to Model 5. In fact,Model 3 is an example of single index model, Model 4 is an example of additive model and Model 5 is anexample of more complex nonlinear regression model. CCK approach relies heavily on the linear regressionassumption while CCH is more applicable. For the single index regression model, from Table 4, we can seethat CCK performs badly especially when the ratio p/n is large. The approach RRCS ranks the Kenall’ taucorrelation which is invariant to monotone transformations, thus it exhibits robustness for Model 3, but itstill performs much worse than CCH. For the additive regression model and Model 5, by Table 1, similarconclusions can be drawn as discussed for Model 3. It is worth mentioning that although we require themarginal transformation functions are monotone in theory, but simulation study shows that the proposedscreening procedure is not sensitive to the requirement, and performs pretty well even the transformationfunctions are not monotone. In fact, the marginal transformation functions f , f , f in Model 4 and Model5 are all not monotone. In one word, the proposed CCH procedure performs very well not only for heavytailed error terms, but also for various unknown transformation functions, which shows doubly robustness.Thus in practice, CCH can be used as a safe replacement of the CCK, RRCS or SIS. In this section we apply the variable selection method to a gene expression data set for an eQTL experimentin rat eye reported in [18]. The data set has ever been analyzed by [11], [20] and [8] and can be downloadedfrom the Gene Expression Omnibus at accession number GSE5680.For this data set, 120 12-week-old male rats were selected for harvesting of tissue from the eyes andsubsequent microarray analysis. The microarrays used to analyze the RNA from the eyes of the rats containover 31,042 different probe sets (Affymetric GeneChip Rat Genome 230 2.0 Array). The intensity valueswere normalized using the robust multi-chip averaging method [1, 12] to obtain summary expression valuesfor each probe set. Gene expression levels were analyzed on a logarithmic scale. Similar to the work of [11]and [8], we are still interested in finding the genes correlated with gene TRIM32, which was found to causeBardetCBiedl syndrome [3]. BardetCBiedl syndrome is a genetically heterogeneous disease of multiple organsystems, including the retina. Of more than 31,000 gene probes including > × SHRSP intercross. The10
CCH1 t t t t t t t t t t t t t t t t t t t t CCK1 t t t t t t t t t t t t t t t t t t t t CCH2 t t t t t t t t t t t t t t t t t t t t CCK2 t t t t t t t t t t t t t t t t t t t t RRCS t t t t t t t t t t t t t t t t t t t t SIS t t t t t t t t t t t t t t t t t t t t Figure 1: Boxplot of the ranks of the first 20 genes ordered by (cid:98) r Uj . probe from TRIM32 is 1389163 at, which is one of the 18, 976 probes that are sufficiently expressed andvariable. The sample size is n = 120 and the number of probes is 18,975. It’s expected that only a few genesare related to TRIM32 such that this is a sparse high dimensional regression problem.Direct application of the proposed approach on the whole dataset is slow, thus we select 500 probes withthe largest variances of the whole 18,975 probes. [11] proposed nonparametric additive model to capture therelationship between expression of TRIM32 and candidates genes and find most of the plots of the estimatedadditive components are highly nonlinear, thus confirming the necessity of taking into account nonlinearity.The Elliptical Copula Regression (ECR) model can also capture the nonlinear relationship and thus it isreasonable to apply the proposed doubly robust dimension reduction procedure on this data set.For the real data example, we compare the selected genes by procedures introduced in the simulationstudy, which are the SIS ([10]), the RRCS procedure ([14]), CCK procedure ([13]) and the proposed CCHprocedure. To detect influential genes, we adopt the bootstrap procedure similar to [14, 13]. We denote therespective correlation coefficients calculated using the SIS, RRCS, CCK, CCH by (cid:101) ρ sis , (cid:101) ρ rrcs , (cid:101) ρ cck and (cid:101) ρ cch .The detailed algorithm is presented in Algorithm 1. 11 igure 2: 3-dimensional plots of variables with Genralized Additive Model (GLM) fits. Algorithm 1
A bootstrap procedure to obtain influential genes
Input: D = { ( X i , Y i ) , i = 1 , . . . , n } Output:
Index of influential genes By the data set { ( X i , Y i ) , i = 1 , . . . , n } , calculate the correlations coefficients (cid:101) ρ isis , (cid:101) ρ irrcs , (cid:101) ρ icck and (cid:101) ρ icch and then order them as (cid:101) ρ ( (cid:98) j ) ≥ (cid:101) ρ ( (cid:98) j ) ≥ · · · ≥ (cid:101) ρ ( (cid:98) j p ) , where (cid:101) ρ can be (cid:101) ρ sis , (cid:101) ρ rrcs , (cid:101) ρ cck and (cid:101) ρ cch , thus theset { (cid:98) j , · · · , (cid:98) j p } varies with different screening procedure. We denote by (cid:98) j (cid:23) · · · (cid:23) (cid:98) j p to represent anempirical ranking of the component indices of X based on the contributions to the response, i.e., s (cid:23) t indicates (cid:101) ρ ( s ) ≥ (cid:101) ρ ( t ) and we informally interpret as “ the s th component of X has at least as muchinfluence on the response as the t th component. The ranking (cid:98) r j of the j th component is defined to bethe value of r such that (cid:98) j r = j . For each 1 ≤ i ≤ p , employ the SIS, RRCS, CCK and CCH procedures to calculate the b th bootstrapversion of (cid:101) ρ i , denotes as (cid:101) ρ ib , b = 1 , . . . , Denote the ranks of (cid:101) ρ b , . . . , (cid:101) ρ pb by (cid:98) j b (cid:23) · · · (cid:98) j pb and calculate the corresponding rank (cid:98) r bj for the j th compo-nent of X . Given a value α = 0 .
05, compute the (1 − α ) level, two-sides and equally tailed interval for the rank ofthe j th component, i.e., an interval [ (cid:98) r Lj , (cid:98) r Uj ] where P ( (cid:98) r bj ≤ (cid:98) r Lj |D ) ≈ P ( (cid:98) r bj ≥ (cid:98) r Uj |D ) ≈ α . Treat a variable as influential if (cid:98) r Uj ranks in the top 20 positions.The box-plot of the ranks of influential genes is illustrated in Figure 1, from which we can see thatthe proposed CCH procedure selects out three very influential genes 1373349 at , 1368887 at and 1382291 at (emphasized in Figure 1 by blue color), which were not detected as influential by the other screening methods.The reason we selects out the three influential genes is that there exists strong nonlinearity relationshipbetween the response and the combination of the three covariates genes. Figure 2 illustrate the abovefindings. Besides, gene 1398594 at is detected as influential by CCH and RRCS procedure, which is alsoemphasized by red colour in Figure 1. By scatter plot, we find the nonlinearity between gene 1398594 at and TRIM 32 gene is obvious and CCH and RRCS procedure can both capture the nonlinear relationship.The above findings are just based on statistical analysis, which need to be further validated by experiments12n labs. The screening procedure is particularly helpful by narrowing down the number of research targetsto a few top ranked genes from the 500 candidates. We propose a very flexible semi-parametric ECR model and consider the variable selection problem for ECRmodel in the ultra-high dimensional setting. We propose a doubly robust sure screening procedure for ECRmodel Theoretical analysis shows that the procedure enjoys sure screening property, i.e., with probabilitytending to 1, the screening procedure selects out all important variables and substantially reduces thedimensionality to a moderate size against the sample size. We set k n to be a small value and it performs wellas long as there is a natural index for all the covariates such that the neighboring covariates are correlated.If there is no natural index group in prior, we can do statistical clustering for the variables before screening.The performance of the screening procedure then would rely heavily on the clustering performance, whichwe leave as a future research topic. Appendix: Proof of Main Theorems
First we introduce a useful lemma which is critical for the proof of the main results.
Lemma .1.
For any c >
C >
0, we have P (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) S s,t − Σ s,t (cid:12)(cid:12)(cid:12) ≥ cn − κ (cid:17) = O (cid:0) exp( − Cn − κ ) (cid:1) . Proof.
Recall that (cid:98) S s,t = sin( π (cid:98) τ s,t ), then we have P (cid:16) | (cid:98) S s,t − Σ s,t | > t (cid:17) = P (cid:0)(cid:12)(cid:12) sin( π (cid:98) τ s,t ) − sin( π τ s,t ) (cid:12)(cid:12) ≥ t (cid:1) ≤ P (cid:0) | (cid:98) τ s,t − τ s,t | ≥ π t (cid:1) ≤ P (cid:0) | (cid:98) τ s,t − τ s,t | ≥ π t (cid:1) Since (cid:98) τ s,t can be written in the form of U-statistic with a kernel bounded between -1 and 1, by Hoeffding’sinequality, we have that P (cid:18) | (cid:98) τ s,t − τ s,t | ≥ π t (cid:19) ≤ (cid:18) − π nt (cid:19) . By taking t = cn − κ , we then have P (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) S s,t − Σ s,t (cid:12)(cid:12)(cid:12) ≥ cn − κ (cid:17) ≤ (cid:18) − c π n − κ (cid:19) , which concludes the Lemma. Proof of Theorem 3.1
Proof.
By the definition of CC, ( ρ cl ) = Σ I×J l Σ − J l ×J l Σ (cid:62)I×J l .
13y Assumption 1, λ max ( Σ J l ×J l ) ≤ c , and note that Σ I×J l = (Σ ,m l , . . . , Σ ,m lk ) is a row vector, we havethat ( ρ cl ) ≤ c k (cid:88) t =1 (Σ ,m lt ) . (8)By Assumption 2, if i ∈ M ∗ , then there exists l i ∈ I ni such that ρ cl i ≥ c n − κ . Without loss of generality,we assume that | Σ ,m li | = max ≤ t ≤ k | Σ ,m lit | . By Equation (8), we have that | Σ ,m li | ≥ c ∗ n − κ for some c ∗ >
0. For Σ ,m li , we denote the correspondingKendall’ tau estimator as (cid:98) S ,m li ∈ (cid:98) S I×J li , then we have the following result: P (cid:18) min i ∈M ∗ | (cid:98) S ,m li | ≥ c ∗ n − κ (cid:19) ≥ P (cid:18) min i ∈M ∗ (cid:12)(cid:12)(cid:12) (cid:98) S ,m li − Σ ,m li (cid:12)(cid:12)(cid:12) ≤ c ∗ n − κ (cid:19) . (9)Furthermore, we have P (cid:16) min i ∈M ∗ (cid:12)(cid:12)(cid:12) (cid:98) S ,m li − Σ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ n − κ (cid:17) ≤ s ∗ P (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) S ,m li − Σ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ n − κ (cid:17) = s ∗ P (cid:16)(cid:12)(cid:12)(cid:12) sin( π (cid:98) τ ,m li ) − sin( π τ ,m li ) (cid:12)(cid:12)(cid:12) ≥ c ∗ n − κ (cid:17) ≤ s ∗ P (cid:16)(cid:12)(cid:12)(cid:12)(cid:98) τ ,m li − τ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ π n − κ (cid:17) ≤ p ∗ P (cid:16)(cid:12)(cid:12)(cid:12)(cid:98) τ ,m li − τ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ π n − κ (cid:17) Since (cid:98) τ ,m li can be written in the form of U-statistic with a kernel bounded between -1 and 1, by Hoeffding’sinequality, we have that P (cid:18)(cid:12)(cid:12)(cid:12)(cid:98) τ ,m li − τ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ π n − κ (cid:19) ≤ (cid:18) − c ∗ π n − κ (cid:19) . By Assumption 3, log( p ) = o ( n − κ ), we further have that for some constant C , P (cid:18) min i ∈M ∗ (cid:12)(cid:12)(cid:12) (cid:98) S ,m li − Σ ,m li (cid:12)(cid:12)(cid:12) ≥ c ∗ n − κ (cid:19) ≤ (cid:0) − Cn − κ (cid:1) . Combining with Equation (9), we have P (cid:18) min i ∈M ∗ | (cid:98) S ,m li | ≥ c ∗ n − κ (cid:19) ≥ − (cid:0) − Cn − κ (cid:1) . Besides, it is easy to show that ( (cid:98) ρ cl ) ≥ max ≤ t ≤ k ( (cid:98) S ,m lt ) , and hence, P (cid:18) min i ∈M ∗ max l ∈I ni (cid:98) ρ cl ≥ c ∗ n κ (cid:19) ≥ − (cid:0) − Cn − κ (cid:1) , which concludes P (cid:18) min i ∈M ∗ max l ∈I ni (cid:98) ρ cl ≥ c ∗ n κ (cid:19) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . P (cid:16) M ∗ ⊂ (cid:99) M c ∗ n κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . Proof of Theorem 3.2
Proof.
The proof of Theorem 3.2 is split into the following 2 steps.(
Step I ) In this step we aim to prove the following result holds: P (cid:32) max ≤ l ≤C kp (cid:12)(cid:12) ( (cid:98) ρ cl ) − ( (cid:101) ρ cl ) (cid:12)(cid:12) > cn − κ (cid:33) = O (cid:16) exp( − Cn − κ ) (cid:17) , where ( (cid:101) ρ cl ) = (cid:98) S I×J l Σ − J l ×J l (cid:98) S (cid:62)I×J l . Note that the determinants of matrices Σ J l ×J l and (cid:98) S J l ×J l arepolynomials of finite order in their entries, thus we have the following inequality holds, P (cid:16)(cid:12)(cid:12)(cid:12) | (cid:98) S J l ×J l | − | Σ J l ×J l | (cid:12)(cid:12)(cid:12) > cn − κ (cid:17) ≤ P (cid:16) max ≤ s,t ≤ k | (cid:98) S s,t − Σ s,t | > cn − κ (cid:17) , ≤ k ∗ P (cid:16) | (cid:98) S s,t − Σ s,t | > cn − κ (cid:17) = k ∗ P (cid:0)(cid:12)(cid:12) sin( π (cid:98) τ s,t ) − sin( π τ s,t ) (cid:12)(cid:12) ≥ cn − κ (cid:1) ≤ k ∗ P (cid:0) | (cid:98) τ s,t − τ s,t | ≥ cπ n − κ (cid:1) ≤ k ∗ P (cid:0) | (cid:98) τ s,t − τ s,t | ≥ cπ n − κ (cid:1) Since (cid:98) τ s,t can be written in the form of U-statistic with a kernel bounded between -1 and 1, by Hoeffding’sinequality, we have that P (cid:18) | (cid:98) τ s,t − τ s,t | ≥ cπ n − κ (cid:19) ≤ (cid:18) − c π n − κ (cid:19) . Thus we have for some positive constant C ∗ , the following inequality holds: P (cid:16)(cid:12)(cid:12)(cid:12) | (cid:98) S J l ×J l | − | Σ J l ×J l | (cid:12)(cid:12)(cid:12) > cn − κ (cid:17) ≤ exp (cid:0) − C ∗ n − κ (cid:1) (10)By Assumption 3, log p = O ( n ξ ) with ξ ∈ (0 , − κ ), we further have for some positive constant C , P (cid:32) max ≤ l ≤C kp (cid:12)(cid:12)(cid:12) | (cid:98) S J l ×J l | − | Σ J l ×J l | (cid:12)(cid:12)(cid:12) > cn − κ (cid:33) ≤ exp (cid:0) − Cn − κ (cid:1) . Note that k is finite and by the adjoint matrix expansion of an inverse matrix, similar to the above analysis,we have for any positive c , P (cid:32) max ≤ l ≤C kp (cid:13)(cid:13)(cid:13) ( (cid:98) S J l ×J l ) − − ( Σ J l ×J l ) − (cid:13)(cid:13)(cid:13) ∞ > cn − κ (cid:33) ≤ exp (cid:0) − Cn − κ (cid:1) . Notice that (cid:12)(cid:12) ( (cid:98) ρ cl ) − ( (cid:101) ρ cl ) (cid:12)(cid:12) ≤ (cid:107) S I×J (cid:107) ∞ (cid:13)(cid:13)(cid:13) ( (cid:98) S J l ×J l ) − − ( Σ J l ×J l ) − (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13) ( (cid:98) S J l ×J l ) − − ( Σ J l ×J l ) − (cid:13)(cid:13)(cid:13) ∞ P (cid:32) max ≤ l ≤C kp (cid:12)(cid:12) ( (cid:98) ρ cl ) − ( (cid:101) ρ cl ) (cid:12)(cid:12) > cn − κ (cid:33) = O (cid:16) exp( − Cn − κ ) (cid:17) ( Step II ) In this step, we will first prove that for any c > P (cid:32) max ≤ l ≤C kp | (cid:101) ρ cl − ρ cl | ≥ cn − κ (cid:33) = O (cid:0) exp( − Cn − κ ) (cid:1) . By Lemma .1, we have that P (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) S s,t − Σ s,t (cid:12)(cid:12)(cid:12) ≥ cn − κ (cid:17) = O (cid:0) exp( − Cn − κ ) (cid:1) . By Assumption 3, log p = O ( n ξ ) with ξ ∈ (0 , − κ ), thus we have P (cid:18) max ≤ i ≤ p max l ∈I ni max s,t | (cid:98) S s,t − Σ s,t | (cid:19) = O (cid:0) exp( − Cn − κ ) (cid:1) . Recall that ( (cid:101) ρ cl ) = (cid:98) S I×J l Σ − J l ×J l (cid:98) S (cid:62)I×J l , by the property of Σ J l ×J l , we have for any c > P (cid:32) max ≤ l ≤C kp | (cid:101) ρ cl − ρ cl | ≥ cn − κ (cid:33) = O (cid:0) exp( − Cn − κ ) (cid:1) . Further by Assumption 4, min i/ ∈M ∗ max l ∈I ni ρ cl < c ∗ n − κ and the last equation, we have that P (cid:18) max i/ ∈M ∗ max l ∈I ni (cid:101) ρ cl < c ∗ n − κ (cid:19) ≥ − O (cid:0) exp( − Cn − κ ) (cid:1) . By the result in Step I, we have that P (cid:32) max ≤ l ≤C kp (cid:12)(cid:12) ( (cid:98) ρ cl ) − ( (cid:101) ρ cl ) (cid:12)(cid:12) > cn − κ (cid:33) = O (cid:16) exp( − Cn − κ ) (cid:17) Thus we further have P (cid:18) max i/ ∈M ∗ max l ∈I ni (cid:98) ρ cl < c ∗ n − κ (cid:19) ≥ − O (cid:0) exp( − Cn − κ ) (cid:1) . (11)By Theorem 3.1, the following inequality holds: P (cid:18) min i ∈M ∗ max l ∈I ni (cid:98) ρ cl ≥ c ∗ n − κ (cid:19) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) , then combining the result in Equation (11), we have P (cid:16) M ∗ = (cid:99) M c ∗ n − κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) , which concludes the theorem. 16 roof of Theorem 3.4 Proof.
Let δ → δn − κ − τ − τ ∗ → ∞ as n → ∞ and define M ( δ ) = (cid:26) ≤ i ≤ p : max l ∈I ni (cid:98) ρ cl is among the largest (cid:98) δp (cid:99) of all (cid:27)(cid:102) M ( δ ) = (cid:26) ≤ i ≤ p : max l ∈I ni (cid:101) ρ cl is among the largest (cid:98) δp (cid:99) of all (cid:27) where ( (cid:101) ρ cl ) = (cid:98) S I×J l Σ − J l ×J l (cid:98) S (cid:62)I×J l and where ( (cid:98) ρ cl ) = (cid:98) S I×J l (cid:98) S − J l ×J l (cid:98) S (cid:62)I×J l . We will first show that P (cid:16) M ∗ ⊂ M ( δ ) (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) (12)By Theorem 3.1, it is equivalent to show that P (cid:16) M ∗ ⊂ M ( δ ) ∩ (cid:99) M c ∗ n − κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) By Step I in the proof of Theorem 3.2, it is also equivalent to show that P (cid:16) M ∗ ⊂ (cid:102) M ( δ ) ∩ (cid:99) M c ∗ n − κ (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . Finally, by Theorem 3.1 again, to prove Equation (12) is equivalent to prove that P (cid:16) M ∗ ⊂ (cid:102) M ( δ ) (cid:17) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . (13)Recall that in the proof of Theorem 3.1, we obtained the following result: P (cid:18) min i ∈M ∗ max l ∈I ni (cid:101) ρ cl ≥ c ∗ n κ (cid:19) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . If we can prove that P (cid:32) p (cid:88) i =1 (max l ∈I ni (cid:101) ρ cl ) ≤ cn − τ ∗ + τ p (cid:33) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . (14)Then we have, with probability larger than 1 − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) ,Card (cid:26) ≤ i ≤ p ; max l ∈I ni (cid:101) ρ cl ≥ min i ∈M ∗ max l ∈I ni (cid:101) ρ cl (cid:27) ≤ cpn − κ − τ − τ ∗ , which further indicate that the result in (13) holds due to δn − κ − τ − τ ∗ → ∞ . So to end the whole proof,we just need to show that the result in (14) holds.For each 1 ≤ i ≤ p , let (cid:101) ρ ci = max l ∈I ni (cid:101) ρ cl . Note that ( (cid:101) ρ ci ) = (cid:98) S I×J i Σ − J i ×J i (cid:98) S (cid:62)I×J i , with (cid:98) S I×J i =( (cid:98) S ,m i , . . . , (cid:98) S ,m i k ). By Assumption 1, we have( (cid:101) ρ ci ) ≤ c k (cid:88) t =1 ( (cid:98) S ,m i t ) = c (cid:107) (cid:98) S I×J i (cid:107) , p (cid:88) i =1 ( (cid:101) ρ ci ) ≤ c kk n (cid:107) (cid:98) S I×T (cid:107) , with T = { , . . . , d } . Notice that P (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) S s,t − Σ s,t (cid:12)(cid:12)(cid:12) ≥ cn − κ (cid:17) ≤ (cid:18) − c π n − κ (cid:19) , thus similar to the argument in [13], we can easily get that P (cid:32) p (cid:88) i =1 (max l ∈I ni (cid:101) ρ cl ) ≤ cn − τ ∗ + τ p (cid:33) ≥ − O (cid:0) exp (cid:0) − Cn − κ (cid:1)(cid:1) . Finally, following the same idea of iterative screening as in the proof of Theorem 1 of [10], we finish the proofof the theorem.
Acknowledgements
Yong He’s research is partially supported by the grant of the National Science Foundation of China (NSFC11801316). Xinsheng Zhang’s research is partially supported by the grant of the National Science Foun-dation of China (NSFC 11571080). Jiadong Ji’s work is supported by the grant from the the grant ofthe National Science Foundation of China (NSFC 81803336) and Natural Science Foundation of ShandongProvince (ZR2018BH033).
References
BM Boldstad, RA Irizarry, M Astrand, and TP Speed. A comparison of normalization methods for highdensity oligonucleotide array data based on bias and variance.
Bioinformatics , 19(2):185–193, 2003.Emmanuel Candes and Terence Tao. The dantzig selector: Statistical estimation when p is much larger thann.
The Annals of Statistics , 35(6):2313–2351, 2007.Annie P Chiang, John S Beck, Hsan-Jan Yen, Marwan K Tayeh, Todd E Scheetz, Ruth E Swiderski, Darryl YNishimura, Terry A Braun, Kwang-Youn A Kim, Jian Huang, et al. Homozygosity mapping with snparrays identifies trim32, an e3 ubiquitin ligase, as a bardet–biedl syndrome gene (bbs11).
Proceedingsof the National Academy of Sciences , 103(16):6287–6292, 2006.H. Cui, R. Li, and W. Zhong. Model-free feature screening for ultrahigh dimensional discriminant analysis.
Journal of the American Statistical Association , 110(510):630–641, 2015.J. Fan and J. Lv. Sure independence screening.
Wiley StatsRef. , 2017.J. Fan, Y. Ma, and W. Dai. Nonparametric independence screening in sparse ultra-high dimensional varyingcoefficient models.
Journal of the American Statistical Association , 109(507):1270, 2014.Jianqing Fan, Yang Feng, and R. Song. Nonparametric independence screening in sparse ultra-high dimen-sional additive models.
Journal of the American Statistical Association , 106(494):544–557, 2011.Jianqing Fan, Yang Feng, and Rui Song. Nonparametric independence screening in sparse ultra-high-dimensional additive models.
Journal of the American Statistical Association , 106(494):544–557, 2011.Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties.
Journal of the American statistical Association , 96(456):1348–1360, 2001.18ianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space.
Journalof the Royal Statistical Society , 70(5):849–911, 2008.Jian Huang, Joel L Horowitz, and Fengrong Wei. Variable selection in nonparametric additive models.
Annals of statistics , 38(4):2282, 2010.Rafael A Irizarry, Bridget Hobbs, Francois Collin, Yasmin D Beazer-Barclay, Kristen J Antonellis, UweScherf, and Terence P Speed. Exploration, normalization, and summaries of high density oligonucleotidearray probe level data.
Biostatistics , 4(2):249–264, 2003.Xin Bing Kong, Zhi Liu, Yuan Yao, and Wang Zhou. Sure screening by ranking the canonical correlations.
Test , 26(1):1–25, 2017.Gaorong Li, Heng Peng, Jun Zhang, and Lixing Zhu. Robust rank correlation based screening.
Annals ofStatistics , 40(3):1846–1877, 2012.Runze Li, Wei Zhong, and Liping Zhu. Feature screening via distance correlation learning.
Journal of theAmerican Statistical Association , 107(499):1129–1139, 2012.Jing Yuan Liu, Zhong Wei, and Li Runze. A selective overview of feature screening for ultrahigh-dimensionaldata.
Science China Mathematics , 58(10):1–22, 2015.Lukas Meier, Sara Van De Geer, and Peter Bhlmann. High-dimensional additive modeling.
Annals ofStatistics , 37(6B):3779–3821, 2009.Todd E Scheetz, Kwang-Youn A Kim, Ruth E Swiderski, Alisdair R Philp, Terry A Braun, Kevin L Knudtson,Anne M Dorrance, Gerald F DiBona, Jian Huang, Thomas L Casavant, et al. Regulation of geneexpression in the mammalian eye and its relevance to eye disease.
Proceedings of the National Academyof Sciences , 103(39):14429–14434, 2006.R. Song, W. Lu, S. Ma, and X. Jessie Jeng. Censored rank independence screening for high-dimensionalsurvival data.
Biometrika , 101(4):799–814, 2014.Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression.
Biometrika , 99(4):879–898, 2012.Robert Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society.Series B (Methodological) , 58(1):267–288, 1996.Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.
Journal of theRoyal Statistical Society , 68(1):49–67, 2006.Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty.
The Annals of Statistics ,38(2):894–942, 2010.Liping Zhu, Lexin Li, Runze Li, and Lixing Zhu. Model-free feature screening for ultrahigh dimensionaldata.
Journal of the American Statistical Association , 106(496):1464–1475, 2011.Hui Zou. The adaptive lasso and its oracle properties.
Journal of the American statistical association ,101(476):1418–1429, 2006. 19 a b l e : T h e p r o p o r t i o n s o f c o n t a i n i n g M o d e l i n t h e a c t i v e s e t ( p , n ) M e t h o d M u l t i v a r i a t e N o r m a l C o v a r i a t e s M u l t i v a r i a t e t C o v a r i a t e s ( d e g r ee ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ( , ) RR C S . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . ( , ) RR C S . . . . . . . . . S I S . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . ( , ) RR C S . . . . S I S . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . ( , ) RR C S . . . . S I S . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . a b l e : T h e p r o p o r t i o n s o f c o n t a i n i n g M o d e l i n t h e a c t i v e s e t ( p , n ) M e t h o d M u l t i v a r i a t e N o r m a l C o v a r i a t e s M u l t i v a r i a t e t C o v a r i a t e s ( d e g r ee ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . CC H CC K . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . CC H CC K . . . . . . . . a b l e : T h e p r o p o r t i o n s o f c o n t a i n i n g M o d e l i n t h e a c t i v e s e t ( p , n ) M e t h o d M u l t i v a r i a t e N o r m a l C o v a r i a t e s M u l t i v a r i a t e t C o v a r i a t e s ( d e g r ee ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) (cid:15) ∼ N ( , ) (cid:15) ∼ t( ) ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ρ = ρ = . ρ = . ρ = . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . . S I S CC H CC K . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . . ( , ) RR C S . . . . . . . . . . . . . . . . S I S . . . . . . CC H CC K . . . . . . . . . . . . . . . . CC H CC K . . . . . . . . . . . . . . . .012