High-dimensional influence measure
aa r X i v : . [ m a t h . S T ] N ov The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2013
HIGH-DIMENSIONAL INFLUENCE MEASURE
By Junlong Zhao , Chenlei Leng , Lexin Li andHansheng Wang Beihang University, University of Warwick and National University ofSingapore, North Carolina State University and Peking University
Influence diagnosis is important since presence of influential ob-servations could lead to distorted analysis and misleading interpreta-tions. For high-dimensional data, it is particularly so, as the increaseddimensionality and complexity may amplify both the chance of an ob-servation being influential, and its potential impact on the analysis.In this article, we propose a novel high-dimensional influence mea-sure for regressions with the number of predictors far exceeding thesample size. Our proposal can be viewed as a high-dimensional coun-terpart to the classical Cook’s distance. However, whereas the Cook’sdistance quantifies the individual observation’s influence on the leastsquares regression coefficient estimate, our new diagnosis measurecaptures the influence on the marginal correlations, which in turnexerts serious influence on downstream analysis including coefficientestimation, variable selection and screening. Moreover, we establishthe asymptotic distribution of the proposed influence measure by let-ting the predictor dimension go to infinity. Availability of this asymp-totic distribution leads to a principled rule to determine the criticalvalue for influential observation detection. Both simulations and realdata analysis demonstrate usefulness of the new influence diagnosismeasure.Received July 2013. Supported by National Natural Science Foundation of China (11101022) and Ministryof Education Humanities and Social Science Foundation Youth project (10YJC910013). Supported by National University of Singapore research grants. Supported by NSF Grant DMS-11-06668. Supported by National Natural Science Foundation of China (11131002, 11271032),Fox Ying Tong Education Foundation, the Business Intelligence Research Center of PekingUniversity and the Center for Statistical Science of Peking University.
AMS 2000 subject classifications.
Primary 62J20; secondary 62E20.
Key words and phrases.
Cook’s distance, high-dimensional diagnosis, influential obser-vation, LASSO, marginal correlations, variable screening.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2013, Vol. 41, No. 5, 2639–2667. This reprint differs from the original inpagination and typographic detail. 1
ZHAO, LENG, LI AND WANG
1. Introduction.
An observation is flagged influential if some importantfeatures of the analysis are substantially altered after this observation isremoved [13]. Presence of influential observations would possibly lead todistorted analysis and misleading results [18], and therefore it is importantto be alert to influential observations and take them into consideration wheninterpreting the results. In the classical normal linear model setup, regressioncoefficient estimate was chosen, naturally, as the feature whose substantialchange defines influential observations. Toward that end, [12] proposed adifference measure between the OLS estimate on the full data and thaton the subset of data without the observation in question. This measure,which is later on referred in the statistical literature as the
Cook’s distance ,quantifies the contribution, or influence, of individual data observation onthe regression coefficient estimate. Consequently an observation with a largeCook’s distance is deemed as influential. Since its introduction, the Cook’sdistance has been routinely employed in regression analysis, due to its clearinterpretation from the case deletion point of view, and its easy computationwithout having to re-estimate the model for each removed observation. Thetopic is covered in most standard regression textbooks, and it is implementedin popular statistical software such as R and SAS.The problem of influence diagnosis has since attracted considerable atten-tion and been systematically investigated for various models and analyses.Examples include linear regression models [9, 12, 14], categorical data anal-yses [1], generalized linear models [16, 33, 38], generalized estimation equa-tions [30], linear mixed models [2, 3, 11], generalized linear mixed models[39], semiparametric mixed models [25], growth curve models [29], incom-plete data analysis [44], perturbation theory [15, 42, 43], among others. Foran excellent review on the latest developments in the field of influence diag-nosis, we refer to [42].Thanks to the aforementioned works, substantial insights have been gainedon influence diagnosis. However, it is important to note that, all existing di-agnosis approaches have been developed under the assumption that the num-ber of predictors in regression is fixed. As such, none is immediately applica-ble to high-dimensional regression analysis, where the number of predictors p far exceeds the sample size n . On the other hand, nowadays prevailing inboth science and business are data with unprecedented size and dimension-ality, calling for the development of high-dimensional influence diagnosis.Detection of influential observations in high-dimensional data analysis, inour opinion, is equally, or to some extent, even more important than ina classical setup. This is partly because the increased dimensionality andcomplexity of the data may amplify both the chance of an observation be-ing influential as well as its potential impact on the analysis. Moreover, thepeculiar data observations themselves may be of practical importance in IGH-DIMENSIONAL INFLUENCE MEASURE addition to data modeling. The diagnosis task, nevertheless, is more chal-lenging in high-dimensional data analysis, and is far from a direct extensionof existing diagnosis approaches. To the best of our knowledge, influencediagnosis in a high-dimensional setting has received little attention despiteits evident importance.The first challenge is the definition of influential observation. In otherwords, which feature of the analysis should one choose such that its sub-stantial alternation defines an influential observation? In the classical setup,an observation is deemed influential if it incurs serious change in regres-sion coefficient estimate. In high-dimensional regression where p > n , theordinary least squares estimator is highly unstable as the gram matrix isnot invertible. On the other hand, we recognize that variable selection andvariable screening are of particular importance in high-dimensional regres-sion analysis. There has been a vast literature on variable selection in re-cent years, including the LASSO [34], the adaptive LASSO [36, 40, 45], theSCAD [21], the bridge estimator [24, 26], the LARS algorithm [19], theDantzig selector [8], the sure independence screening rule [22], SIS, the for-ward regression [35], FR, among many others. Underlying all those selectionmethods, one statistic plays a critical role and, that is, the marginal co-variance, or equivalently, marginal correlation between the response and theindividual covariates. To clarify, we note that, SIS is directly defined basedon this statistic, whereas the first step of the forward regression hinges onthe estimated marginal covariance too. In addition, the sample marginal co-variance, in addition to the Gram matrix, is an important input for the wellcelebrated LARS algorithm, as well as the LASSO, the adaptive LASSO andthe Dantzig selector.Motivated by this vital observation, we choose the marginal correlation asthe feature that defines influential observation. We propose a new influencediagnosis measure, which continues to utilize the leave-one-out idea of theclassical Cook’s distance, but is based on the combined marginal correlationsbetween the response and all predictors. The new measure is applicable tohigh-dimensional setting where p > n , and is very fast and easy to compute.Unlike the classical Cook’s distance that quantifies the individual observa-tion’s influence on the least squares coefficient estimate, the new measurecaptures the influence on the marginal correlation, which in turn exerts seri-ous impact on variable selection and other downstream analysis. The choiceof the marginal correlation as the defining feature of our influence diagnosisdoes not imply that the marginal correlation is our ultimate goal of inter-est. Instead, it reflects influence on important analysis features includingparameter estimation, variable selection and screening. This definition ofinfluential observation in a high-dimensional setting can be viewed as ourfirst contribution. ZHAO, LENG, LI AND WANG
Our second contribution is that the explicit asymptotic distribution forthe proposed influence measure is derived. Availability of this asymptotictheory offers a principled guidance to determine the critical value for theinfluence measure. Subsequently, we propose a false discovery rate basedprocedure for that purpose [5, 6]. We remark that, in the classical setupwhere p is fixed, a standard Taylor’s expansion type analysis [12] revealedthat the classical Cook’s distance’s major variability is due to the obser-vation under investigation and its sample size is only one. This rules outthe possibility of establishing a standard asymptotic theory for the classicalCook’s distance. To determine an appropriate threshold value for the clas-sical Cook’s distance, its distribution can be obtained by bootstrap if thetrue model is a parametric linear model. However, such a bootstrap proce-dure requires a parametric model assumption and can be computationallyexpensive especially for high-dimensional data. By contrast, the asymptoticdistribution of the proposed influence measure is attainable in our setup,since the predictor dimension goes to infinity along with the sample size,and the threshold is easy to obtain.When facing high-dimensional data diagnosis, an intuitive solution is tocontinue using the classical Cook’s distance but to replace the OLS coeffi-cient estimate with a regularized estimate, for instance, a LASSO estimate.This modified Cook’s distance approach could be particularly useful whendata perturbation concentrates on the nonzero coefficients, as it avoids un-necessary variability caused by irrelevant covariates. However, it also hasseveral limitations. First, this solution interweaves influence diagnosis withvariable selection, which can be flawed if the influence is reflected on variableselection itself. For instance, an influential observation may substantially al-ter the chosen tuning parameter of the LASSO, resulting in a totally differentregularized coefficient estimate, which in turn affects the modified Cook’sdistance. Second, the tuning parameter of the LASSO, in principle, should beupdated for every reduced data set, and this re-estimation requirement canbe very expensive computationally, especially when the regression dimension p is large. Third, the asymptotic properties of the modified Cook’s distanceseem intractable analytically, which makes the thresholding of influentialdata difficult, whereas a bootstrap alternative to choose the thresholdingvalue is again computationally expensive. Moreover, while there exist manycompeting variable selection methods, it is unclear which selection methodis the best choice in the context of influence diagnosis. By contrast, our influ-ence measure is not constrained by any particular variable selection method,and this flexibility could benefit downstream analysis. In Section 3, we carryout an intensive numerical study to compare this modified Cook’s distancewith our proposal, and this detailed comparison can be viewed as the thirdcontribution of this article. IGH-DIMENSIONAL INFLUENCE MEASURE Fig. 1.
Effect of influential points on parameter estimation (a) , variable selection (b) andvariable screening (c) , as the perturbation parameter κ varies. “Before HIM” denotes theanalysis on the full data, and “After HIM” denotes the analysis on the reduced data afterremoving the influential observations flagged by our proposed high-dimensional measure(HIM). Before we proceed, we quickly show a simulated example to illustrate twopoints: first, how various aspects of a high-dimensional regression analysis,including regression coefficient estimation, variable selection and variablescreening, can be seriously affected by influential observations, and second,how our proposed measure can help limit such influence. The data was gen-erated from a linear model with p = 1000 predictors, n = 100 observations,among which 10 observations were influential. The magnitude of the in-fluence was dictated by a scalar κ with a larger value indicating a largerinfluence. More details can be found in the setup of model 1 in Section 3.Evaluations include error in coefficient estimation, error in variable selectionafter applying the LASSO [34], and error in variable screening after applyingthe SIS [22]. The results are averaged over 200 simulation replicates, and arereported in Figure 1. It is clearly seen from the plot that, influential obser-vations could have drastic effects on various features for high-dimensionaldata analysis. Meanwhile, our marginal correlation based diagnosis couldgreatly help control the adverse effects after detecting and removing thoseinfluential data points.The rest of the artlicle is organized as follows. Section 2 begins with a re-view of the classical Cook’s distance, then presents our new high-dimensionalinfluence measure, along with a comparison with the Cook’s distance, theasymptotic properties and a power study. Section 3 includes an intensive ZHAO, LENG, LI AND WANG simulation study and a microarray data analysis. Section 4 presents a gen-eralization of our proposal from the normal linear model to the generalizedlinear model. Section 5 concludes the paper with a discussion. All technicalproofs are given in the Appendix and the supplementary material [41].
2. High-dimensional influence measure.
Linear models and classical Cook’s distance.
In this article, we fo-cus on influence diagnosis in the context of the classical linear regressionmodel. Meanwhile, we note that the proposed idea can be readily extendedto a much broader class of regression models, and we will discuss one suchextension in Section 4. Consider the following model: Y i = β + X ⊤ i β + ε i , (2.1)where the pair ( Y i , X i ), 1 ≤ i ≤ n , denote the observation of the i th sub-ject, Y i ∈ R is the response variable, X i = ( X i , . . . , X ip ) ⊤ ∈ R p is the asso-ciated p -dimensional predictor vector, and ε i ∈ R is a mean zero normallydistributed random noise. Let β = ( β , β ⊤ ) ⊤ denote the coefficient vector.Under the classical setup of n > p , the OLS estimate of β is obtained byminimizing the objective function P ni =1 ( Y i − β − X ⊤ i β ) , and the solutionis ˆ β = ( X ⊤ X ) − X ⊤ Y , where Y = ( Y , . . . , Y n ) ⊤ denotes the n × X denotes the n × ( p + 1) design matrix with the i th row being p + 1 dimensional vector (1 , X ⊤ i ), i = 1 , . . . , n .To quantify the influence of the k th observation on regression, 1 ≤ k ≤ n ,[12] employed the leave-one-out idea by studying the OLS estimate of β whilethe k th observation is excluded from estimation. That is, one minimizes themodified objective function P ni =1 ,i = k ( Y i − β − X ⊤ i β ) . The new estimateis of the form ˆ β ( k ) = ( X ⊤ ( k ) X ( k ) ) − X ⊤ ( k ) Y ( k ) , where Y ( k ) is the ( n − × Y k removed, and X ( k ) is the ( n − × ( p + 1) designmatrix with the k th row X k removed. Cook [12] naturally chose the estimateof β to define influence, and intuitively, if an observation is influential, thedifference between ˆ β and ˆ β ( k ) is expected to be large. This leads to thefollowing discrepancy measure, that is, the Cook’s distance: D k = { ˆ β ( k ) − ˆ β } ⊤ X ⊤ X { ˆ β ( k ) − ˆ β } ( p + 1)ˆ σ , (2.2)where ˆ σ = ( n − p − − P ni =1 ( Y i − ˆ β − X ⊤ i ˆ β ) .In the high-dimensional regression setting, the classical Cook’s distance(2.2) encounters some difficulties. When p is close to n , the OLS estimate isknown to be unstable, which would in turn cause D k to be unstable. When p > n , the classical Cook’s distance is not directly computable, because the IGH-DIMENSIONAL INFLUENCE MEASURE OLS estimator ˆ β becomes unstable. For those reasons, the regression co-efficient estimate may no longer be the best choice to define influence inhigh-dimensional analysis. This motivates us to consider an alternative in-fluence measure for high-dimensional data.2.2. High-dimensional influence measure.
In high-dimensional regressionanalysis where p ≈ n or p > n , variable selection (screening) plays a centralrole, whereas marginal covariance or correlation is crucial to the majorityof variable selection approaches. Motivated by this observation, for high-dimensional data influence diagnosis, we choose marginal correlation, in-stead of regression coefficient, as the feature that defines influence. Individ-ual observation’s influence on marginal correlation is to transmit to variousfeatures of downstream analysis, such as variable selection and coefficientestimation.More specifically, we first define the marginal correlation as ρ j = E { ( X j − µ xj )( Y − µ y ) } / ( σ xj σ y ), where µ xj = E ( X j ), µ y = E ( Y ), σ xj = var( X j ) and σ y = var( Y ). We then obtain the sample estimate, ˆ ρ j = { P ni =1 ( X ij − ˆ µ xj )( Y i − ˆ µ y ) } / { n ˆ σ xj ˆ σ y } , for j = 1 , . . . , p , where ˆ µ xj , ˆ µ y , ˆ σ xj and ˆ σ y are the sample es-timates of µ xj , µ y , σ xj and σ y , respectively. Next, we continue to use theleave-one-out principle as in the classical Cook’s distance case, and computethe marginal correlation with the k th observation removed asˆ ρ ( k ) j = P ni =1 ,i = k ( X ij − ˆ µ ( k ) xj )( Y i − ˆ µ ( k ) y )( n − σ ( k ) xj ˆ σ ( k ) y , j = 1 , . . . , p, k = 1 , . . . , n, where ˆ µ ( k ) xj , ˆ µ ( k ) y , ˆ σ ( k ) xj and ˆ σ ( k ) y are the corresponding sample estimates withthe k th observation removed. Finally, we define the influence measure basedon the marginal correlation as D k = 1 p p X j =1 ( ˆ ρ j − ˆ ρ ( k ) j ) . (2.3)We refer to D k as the high-dimensional influence measure , or HIM forbrevity. We make a few remarks. First, we note that the marginal correla-tion can be easily computed regardless of the predictor dimension, and suchcomputational advantage is practically very useful for high-dimensional dataanalysis. Second, the proposed influence measure is built upon the marginalcorrelation coefficient, and is effectively scale invariant. However, it does not imply that marginal correlation is the ultimate feature of interest in our in-fluence diagnosis. Instead, a substantial change on the marginal correlationcaused by a data point is to exert influence on important features such asvariable selection and parameter estimation, as we have seen in Figure 1.As such, for an estimation method to be robust to unexpected perturbation ZHAO, LENG, LI AND WANG [15, 42, 43], the sample marginal correlation should be sufficiently robust.This is an important and necessary condition, although not necessarily suffi-cient. Finally, use of the marginal correlation to define the influence measuredoes not imply that we assume a marginal model . Instead, we still assumethe joint model (2.1). As it may seem unclear how a marginal measure cancapture the influence for a joint model, we will demonstrate through a simplejoint model later in Section 2.5 that, the newly defined D k can indeed iden-tify the influential observation with probability one. This use of marginalcorrelation is also similar in spirit to the sure independence screening pro-cedure for a joint normal model [22], but is in a different context. Fan andLv [22] use marginal correlation for the variable screening purpose, while weuse it for influence diagnosis.The proposed high-dimensional influence measure also shares some simi-larity as the classical Cook’s distance. Note that the Cook’s distance can bereformulated as D k = ˆ ǫ k p ˆ σ h kk (1 − h kk ) , k = 1 , . . . , n, (2.4)where ˆ ǫ k = ˆ Y k − Y k is the residual and h kk = X ⊤ k ( X ⊤ X ) − X k , k = 1 , . . . , n is the ( k )th diagonal element of the hat matrix X ( X ⊤ X ) − X ⊤ . Clearly, D k is an increasing function of both | ˆ ǫ k | and h kk . As such, an observation has alarge value in Cook’s distance, if it has a large residual or it is a high leveragepoint in terms of h kk . Our proposed information measure shares a similarspirit. In Section 2.3, we will derive a decomposition of our influence measure D k under some conditions, and will show that D k is mainly dominated by aterm called B , which is of the form B = ( n − pn ( n − p X j =1 Y k X kj = ( n − pn ( n − Y k k X k k . Consequently the k th data point ( X k , Y k ) is more likely to be marked influ-ential, if it has a large response and a large value of k X k k . Here k X k k playsa similar role as h kk in the classical Cook’s distance, for detecting influentialpoints induced mainly by covariates, whereas Y k plays a similar role as theresidual in the Cook’s distance, for detecting the influential point inducedby abnormal responses.2.3. Theoretical properties.
We next establish the asymptotic distribu-tion of the proposed high-dimensional influence measure D k as both thesample size n and the dimensionality p go to infinity. Toward that end, weimpose the following conditions.(C.1) For any fixed j = 1 , . . . , p , ρ j is constant and does not change as p increases. IGH-DIMENSIONAL INFLUENCE MEASURE (C.2) For the covariance matrix Σ = cov( X ), with the eigen-decomposition Σ = P pj =1 λ j u j u ⊤ j , it is assumed that l p = P pj =1 λ j = O ( p r ) for some 0 ≤ r < X i follows a multivariate normal distribution and therandom noise ε i follows a normal distribution.Condition (C.1) is very general, since it only requires that for any fixed j , ρ j is a constant independent of p . A sufficient condition for condition (C.2)to hold is that all eigenvalues of Σ are finite. This condition also permitseigenvalues of Σ to diverge to infinity but at a slower rate compared to thedimensionality. The normality assumption on X is mainly for convenience,and can be relaxed, for instance, to distributions with sub-Gaussian tails,at the expense of more lengthy proofs. In addition, since the error term isassumed normal, Y is normally distributed.Next, we derive a decomposition of D k , that is, to serve as a basis for itsasymptotic distribution. The result is presented in a way such that µ y , µ xj are assumed to be 0 and σ xj , σ y are 1 for 1 ≤ j ≤ p . This leads to simplifiedestimates ˆ ρ j = n − P ≤ i ≤ n X ij Y i and ˆ ρ ( k ) j = n − P i = k X ij Y i . On the otherhand, we note that this standardization is only for the purpose of simpli-fying the presentation and it loses no generality. As we will show later inProposition 2, replacing the unknown quantities µ xj , µ y , σ xj and σ y withtheir consistent sample estimates would not alter D k ’s asymptotic distri-bution. For t, s = 1 , . . . , n , let K p,ts = P j X tj X sj /p and c p = max ≤ j ≤ p λ j .After some algebraic computation, we obtain that D k = 1 p p X j =1 ( n ( n − t = k X ≤ t ≤ n Y t X tj − n Y k X kj ) = 1 { n ( n − } n X t =1 Y t K p,tt + ( n − n ( n − Y k K p,kk (2.5) + 1[ n ( n − X t = s Y t Y s K p,ts − n ( n − n X t =1 ,t = k Y k Y t K p,tk := B + B + B − B . Then we have the following result on the expectation of D k along with thevariance of its decomposition in terms of B ’s. Proposition 1.
Suppose that ( X i , Y i ) are i.i.d. observations and that (C.1) and (C.3) hold. Then it holds that E ( D k ) = [ n ( n − − E ( Y k ) E ( K p,kk ) + O ( n − p − l / p ) . In addition, var( B ) = O ( n − ) , var( B ) = O ( n − ) , var( B ) = O ( c p n − p − )+ O ( p − n − ) and var( B ) = O ( l p p − n − ) + O ( c p p − n − ) . ZHAO, LENG, LI AND WANG
Now we return to the asymptotic distribution of D k . Proposition 1 helpsto derive the asymptotic distribution of D k . We first present the result as-suming µ xj , µ y , σ xj and σ y are all known. Then we obtain the asymptoticdistribution when µ xj , µ y , σ xj and σ y are replaced by their sample estimates. Theorem 1.
Suppose that (C.1)–(C.3) hold. When there is no influen-tial point and min { n, p } −→ ∞ , we have n D k −→ χ (1) , where χ (1) is the chi-square distribution with one degrees of freedom. Next, we consider the asymptotic distribution of D k when µ xj , µ y , σ j and σ y are unknown. A natural choice is to replace them by their correspondingsample moment estimates as ˆ µ y = P i Y i /n , ˆ µ xj = P i X ij /n , ˆ σ xj = P i ( X ij − ˆ µ xj ) / ( n −
1) and ˆ σ y = P i ( Y i − ˆ µ y ) / ( n − u xj , u y , σ xj and σ y are replaced by √ n -consistent estimates under cer-tain moment assumptions. Let ˙ Y t = ( Y t − µ y ) /σ y , ˙ X tj = ( X tj − u tj ) /σ tj , t =1 , . . . , n, j = 1 , . . . , p and ( Q xj , R xj ) = ((ˆ µ xj − µ xj ) /σ xj , σ xj / ˆ σ xj ) and ( Q y , R y )are defined similarly. Furthermore, let S Qx = lim sup n →∞ E ( n / Q x ) , S Rx =lim sup n →∞ E [ n / ( R x − , S Qy = lim sup n →∞ E ( n / Q y ) and S Ry =lim sup n →∞ E [ n / ( R y − . We make the following additional assump-tion.(C.4) For all 1 ≤ j ≤ p , ( Q xj , R xj ) are the same symmetric function of { ˙ X tj , for t = 1 , . . . , n } ; and ( Q y , R y ) are also the same symmetric functionof ˙ Y t for t = 1 , . . . , n . We assume that S Qx , S Rx , S Qy and S Ry are finite.Condition (C.4) indicates that, for all 1 ≤ j ≤ p , ((ˆ µ xj − µ xj ) /σ xj , ˆ σ xj /σ xj )= f ( ˙ X j , . . . , ˙ X nj ), where f ( x , . . . , x p ) = ( f ( x , . . . , x p ) , f ( x , . . . , x p )) and f and f are symmetric functions. Condition (C.4) is a mild condition. Re-call that ( X i , Y i ), i = 1 , . . . , n are i.i.d. normal in Theorem 1. When ˆ µ xj , ˆ σ xj are the moment estimates, we have Q xj = n − P ≤ t ≤ n ˙ X tj ∼ N (0 , /n ) andconsequently S Qx is finite. Moreover, we have R xj = 1 /S nj where S nj is thesample variance of { ˙ X tj , t = 1 , . . . , n } . Since S n ∼ χ n − / ( n − S Rx is also finite. Similarly, S Qy and S Ry are also finite withmoment estimates ˆ µ y and ˆ σ y . Under the normality of ( X , Y ), (C.4) alsoholds for some robust estimates. Proposition 2.
Assume that ˆ µ xj , ˆ σ xj , ˆ µ y , ˆ σ y are √ n -consistent and sat-isfy (C.4) . Substituting µ xj , µ y , σ j , σ y with their corresponding estimates in D k , Theorem 1 continues to hold under the same conditions. IGH-DIMENSIONAL INFLUENCE MEASURE We remark that the asymptotic distribution of the high-dimensional in-fluence measure D k is obtained as the number of predictor p goes to infinity.This is different from the case of classical Cook’s distance D k where p isfixed, for which a standard asymptotic distribution is not attainable. Weview this as a blessing of dimensionality in contrast to the usually conceived curse of dimensionality. For more examples of blessing of dimensionality, see[17] and [28].2.4. Influence diagnosis.
An important implication of Theorem 1 is thatwe can now obtain a p -value for influence diagnosis. Specifically, for the hy-pothesis that the k th observation is not influential versus its alternative,the p -value is P ( χ (1) > n D k ). Given that the number of predictors p isusually large and multiple hypotheses are tested simultaneously, we employthe false discovery rate based multiple testing procedure of [5] to determinewhich hypothesis should be rejected while controlling the family-wise error.Denote n infl as the number of influential observations among the n obser-vations, n tp and n fp as the number of the observations that are correctlyrejected and incorrectly rejected, respectively, and r as the total number ofrejections in the n hypotheses testing. Then the power and the false dis-covery rate are denoted as Power = n tp /n infl and FDR = n fp /r , respectively.We will set FDR level being small, such as 0.05, and report the power andother quantities in the numerical study section. We also remark that moresophisticated alternative multiple testing procedure, for example, in [6], [20]and [32], can be used in conjunction with our approach, but, that is, not thefocus of this article.2.5. A power comparison of two influence measures.
We next study thepower property of both the new diagnosis measure and the Cook’s distancevia a simple model. This study serves two purposes. First, we can gaininsight about difference between the two diagnosis measures. Second, it offersevidence that the marginal correlation based measure is capable of detectinginfluential observation in a joint model with a large probability.More specifically, we consider the model (2.1), but drop the intercept forsimplicity. The predictors X i , i = 1 , . . . , n , are i.i.d. observations from a mul-tivariate normal distribution N p (0 , Σ ) where Σ is a p × p covariance matrixwith all its diagonal elements σ jj = 1. The error term ǫ i is of the struc-ture ǫ i = e i + c i , where e i follows a standard normal distribution and c i isconstant, c = · · · = c n = 0. Under this setup, the first observation is an influ-ential point as long as c = 0, and we aim to establish the power of both theclassical and our proposed high-dimensional influence measure in identifyingthis influential observation. Let D i be the Cook’s distance defined in (2.2)for the i th observation, D ( c ) i be the proposed high-dimensional measure in ZHAO, LENG, LI AND WANG (2.3), and T ( c ) i = n D ( c ) i be the statistic defined in Theorem 1. Moreover,consider the following condition:(C.5) All eigenvalues of Σ are positive and bounded.Then the next theorem states that, both the classical and the high-dimen-sional Cook’s distance have the power of detecting the influential observationapproaching one under appropriate yet different conditions. Theorem 2.
Consider the model stated above. Suppose that (C.1) and (C.5) hold. If max { n − p , | c | − n / } → , thenwe have that for the Cook’s distance D i , P ( nD − max ≤ i ≤ n nD i > M ) → for any M > , when n → ∞ . Suppose that (C.1) and (C.2) hold. If max {| c | − (log n ) / , l p p − c − n } → , then we have that for the proposed high-dimensional influence measure D ( c ) i , P ( T ( c )1 − max ≤ i ≤ n T ( c ) i > M ) → for any M > , when min( n, p ) → ∞ . The proof is given in the supplementary material [41]. Here we comparethe two sets of conditions to gain some insight about the difference of the twodiagnosis measures. First, we examine the condition max { n − p , | c | − n / } →
0, that is, required by the Cook’s distance. The condition | c | − n / → n goes to infinity. Moreover, in terms of the predictor dimension p , the classical Cook’s distance is defined when p < n . Consequently, thecondition n − p →
0, or equivalently, p = o ( n / ), constrains the growingrate of p with n at a much slower rate. We note that although this ratemay not be the optimal one, the condition p = o ( n ) is clearly necessaryfor the classical Cook’s distance to be feasible. Next, we examine the con-dition max {| c | − (log n ) / , l p p − c − n } →
0, that is, required by our newinfluence measure. For illustration, we consider a simple case with all theeigenvalues of Σ bounded and p > n . We know immediately that both l p /p and n/p are bounded. Accordingly, we should have l p p − c − n → c → ∞ . As log n → ∞ when n → ∞ , then a sufficient condition formax {| c | − (log n ) / , l p p − c − n } → n ) / / | c | →
0. This sug-gests that the influence point can be consistently detected, as long as c diverges to infinity at a speed faster than (log n ) / . This is clearly a ratemuch slower than n / . Finally, the bounded eigenvalue condition (C.5) iscommonly used in the literature for estimating covariance matrices [7]. Hereit is assumed for the Cook’s distance case. For the new diagnosis measure,(C.2) is required instead, which is weaker than (C.5). IGH-DIMENSIONAL INFLUENCE MEASURE
3. Numerical studies.
We have carried out an intensive simulation study,along with a microarray data analysis, to examine the empirical perfor-mance of our proposed high-dimensional influence measure. Since the clas-sical Cook’s distance depends on both leverage points and outliers, in oursimulation study, we consider three different scenarios where there existoutliers only (model 1), leverage points only (model 2), or mixed leveragepoints and outliers (model 3). For the scenarios with leverage points (models2 and 3), we further consider sub-scenarios where important covariates con-tribute to leverage observations, or noisy covariates contribute to leverageobservations. Below we present the summary of the analysis.3.1.
Simulation models.
For all simulations, we set the sample size n =100, and the number of predictors p = 1000. We set 10% of total observationsas influential, so that ˜ n = 10. We consider the model Y i = X ⊤ i β + ε i , i = 1 , . . . , n, where X i is multivariate normal with cov( X ij , X ij ′ ) = 0 . | i − j | , ε i followsthe standard normal distribution, and β = (3 , . , , , , , . . . , ⊤ . We sim-ulated n = 100 i.i.d. observations from this model. Next, we reset the first˜ n = 10 data observations as coming from another model,˜ Y i = ˜ X ⊤ i ˜ β + ε i , i = 1 , . . . , ˜ n, where perturbations are to be introduced on the regression coefficient, thecovariates and their combination. In particular, we have considered threeperturbation models of generating influential points. Model
1. The perturbation was introduced on the response. That is, for i = 1 , . . . , ˜ n , ˜ X i = X i , and ˜ β = (3 , . , κ, κ, , κ, . . . , κ ) ⊤ . In other words, theinfluential observations are generated according to ˜ Y i = X ⊤ i β + κZ i + ε i ,where Z i = X ⊤ i γ and γ = (0 , , , , , , , . . . , ⊤ . In this case, the responsesof the influential observations are contaminated by a random perturbation κZ i . Consequently, the corresponding responses admit a different pattern,whereas the predictors of influential observations follow the same distribu-tion as the rest. Model
2. The perturbation was introduced on the predictors and keep theresponse uncontaminated. That is, for i = 1 , . . . , ˜ n , ˜ Y i = Y i and ˜ X ij = X ij +30 κI { j ∈ S } , j = 1 , . . . , p . In other words, a set S of predictors admit a differentpattern, and its magnitude is controlled by the scalar κ . We examined threechoices of S : S = { , . . . , } , and in this case, the influenced predictorsoverlap with those truly relevant ones { , , } in β ; S = { p − , . . . , p } ,and as such there is no overlap; and S = { , . . . , p } , and in this case, allpredictors are subjected to potential influence. ZHAO, LENG, LI AND WANG
Model
3. The perturbation was introduced on both the response and thepredictors. That is, ˜ β = (3 , . , κ, κ, , κ, . . . , κ ) ⊤ and ˜ X ij = X ij + 30 κI { j ∈ S } , j = 1 , . . . , p . Again, we considered three sets of S as described earlier.It is clear that κ is the parameter that dictates the magnitude of the influ-ential points. When κ = 0, there is no influential point. We used κ = 0, 0.4,0.8, 1.2 and 1.6 in our experiment.3.2. Performance evaluation.
We evaluate and compare our proposedinfluence measure in several ways. First, we study the potential impact ofinfluential data and how the proposed diagnosis measure could help limitsuch impact. Toward that end, we first applied the LASSO or SIS to thefull data. Then we computed the proposed high-dimensional influence mea-sure, evaluated the corresponding p -value, and applied the multiple testingprocedure of [5], with the false discovery rate fixed at α = 5%. We then ob-tained a reduced data set by removing those flagged influential points andapplied the LASSO or SIS to the reduced data set. We evaluated the impactof influential data in terms of coefficient estimation, variable selection, andvariable screening. For coefficient estimation, we report the error betweenthe estimated and true β , ERR = k ˆ β − β true k ; for variable selection, wereport the false positive rate, FPR = / β under a LASSO penaltyand as such avoid the difficulty of the OLS estimate when p > n . This seemsa very natural solution. We compare it with our proposal in terms of estima-tion accuracy, selection accuracy and power. On the other hand, we note thelack of asymptotic theory for this modified Cook’s distance. To determinethe threshold for influential data, one may use bootstrap. However, in ourcomparison, we simply label the observations with the largest ˜ n modifiedCook’s distance as influential. This is not feasible in practice, but providesa useful benchmark for comparison. The other competing solution is thepenalized least absolute deviation via the LASSO penalty (LAD + LASSO)[4, 37]. Due to the use of the least absolute deviation as the loss function, thismethod is designed to handle heavy tailed errors in linear regression, and assuch a potentially useful way to limit impact of the influence observations.3.3. The results.
The averages of a total of 200 random replications arereported in Tables 1–3. We make the following observations.(1) First, the presence of influential points significantly affects variableselection and screening accuracy. This can be seen by comparing the re-
IGH-DIMENSIONAL INFLUENCE MEASURE Table 1
Simulation results for perturbation model 1. HIM denotes our proposed high-dimensionaldiagnosis measure, and CD denotes the classical Cook’s distance κ Method Criterion 0 0.4 0.8 1.2 1.6
SIS CP 1 0 .
25 0 0 0SIS + HIM CP 1 1 1 1 1LASSO ERR 0 .
510 4 .
917 9 .
553 14 .
636 18 . .
002 0 .
094 0 .
103 0 .
107 0 . .
519 1 .
296 1 .
020 0 .
872 0 . .
002 0 .
045 0 .
029 0 .
015 0 . . .
765 0 .
865 0 . .
535 1 .
136 2 .
176 2 .
565 4 . .
003 0 .
034 0 .
066 0 .
072 0 . .
630 0 .
670 0 .
700 0 . .
642 1 .
920 2 .
073 2 .
406 1 . sults between SIS and SIS + HIM in terms of CP. Consider, for example,Table 1. As κ increases, the coverage probability of the SIS method dete-riorates quickly from 1 with κ = 0 to 0 with κ = 1 .
6. This confirms thatinfluential observations do affect variable screening consistency. Meanwhile,the performance of SIS + HIM is quite encouraging as its CP values main-tains at 1 for every κ value considered. This suggests that the proposed HIMmethod helps SIS in removing the influential observations.(2) Second, the presence of influential observations does affect estima-tion accuracy seriously. This can be seen clearly by comparing the resultsof LASSO and LASSO + HIM in terms of ERR values. For instance, theERR values in Table 3 for LASSO with S increases quickly from 0.446 with κ = 0 to 14.498 with κ = 1 .
6. This confirms that influential observations doaffect the accuracy of the LASSO estimate in a negative way. However, wefind that the ERR values of LASSO + HIM are always well controlled withERR < κ increases, the power for HIM to detect influentialobservation increases. Thus, those influential observations are more likelyto be detected and eliminated from the data analysis. This makes the ERRvalues of LASSO + HIM eventually converges to a level around ERR ≈ κ increases. This confirms the usefulness of the HIM method for LASSOestimation, even though its definition only involves marginal correlation co-efficients.(3) Third, the performance of LASSO + CD is mixed. If the perturba-tion is due to the response only as in Table 1, it does yield much betterperformance than LASSO with much smaller ERR values. This suggests ZHAO, LENG, LI AND WANG
Table 2
Simulation results for perturbation model . HIM denotes our proposed high-dimensionaldiagnosis measure, and CD denotes the classical Cook’s distance κ Subset Method Criterion 0 0.4 0.8 1.2 1.6 S SIS CP 1 0 .
05 0 0 0SIS + HIM CP 1 0 .
05 0 . . . .
439 4 .
917 4 .
972 4 .
971 4 . .
002 0 .
086 0 .
090 0 .
089 0 . .
455 4 .
803 4 .
591 3 .
055 3 . .
002 0 .
080 0 .
060 0 .
055 0 . .
620 0 .
775 0 .
892 0 . .
513 4 .
566 4 .
568 4 .
603 4 . .
004 0 .
073 0 .
073 0 .
070 0 . .
095 0 .
085 0 .
105 0 . .
642 1 .
339 1 .
303 1 .
320 1 . S SIS CP 1 1 1 1 1SIS + HIM CP 1 1 1 1 1LASSO ERR 0 .
509 0 .
456 0 .
439 0 .
450 0 . .
001 0 .
001 0 .
001 0 .
002 0 . .
521 0 .
494 0 .
493 0 .
494 0 . .
001 0 .
001 0 .
001 0 .
002 0 . .
695 0 . .
85 0 . .
548 0 .
523 0 .
532 0 .
556 0 . .
001 0 .
002 0 .
002 0 .
002 0 . .
065 0 .
085 0 .
135 0 . .
642 0 .
650 0 .
645 0 .
647 0 . S SIS CP 1 0 .
35 0 .
45 0 .
30 0 . .
50 0 .
60 0 .
62 0 . .
473 1 .
567 1 .
545 1 .
598 1 . .
003 0 .
051 0 .
053 0 .
051 0 . .
490 1 .
517 1 .
456 1 .
221 1 . .
003 0 .
034 0 .
031 0 .
023 0 . .
735 0 .
86 0 .
95 0 . .
560 1 .
751 1 .
700 1 .
743 1 . .
003 0 .
047 0 .
042 0 .
042 0 . .
115 0 .
085 0 .
115 0 . .
642 0 .
608 0 .
573 0 .
580 0 . that LASSO + CD can perform well to limit the effect of influential points.However, even for this example, it is still outperformed by LASSO + HIM.However, the story changes if the perturbation is due to the predictors asin Table 2. This is to be expected because, with contaminated predictors, IGH-DIMENSIONAL INFLUENCE MEASURE Table 3
Simulation results for perturbation model . HIM denotes our proposed high-dimensionaldiagnosis measure, and CD denotes the classical Cook’s distance κ Subset Method Criterion 0 0.4 0.8 1.2 1.6 S SIS CP 1 1 0 .
65 0 .
10 0 . .
90 1 1 1LASSO ERR 0 .
446 1 .
559 5 .
308 9 .
628 14 . .
002 0 .
062 0 .
093 0 .
099 0 . .
447 1 .
278 0 .
771 0 .
499 0 . .
002 0 .
046 0 .
027 0 .
003 0 . .
185 0 .
94 1 1LASSO + CD ERR 0 .
559 0 .
686 2 .
149 5 .
623 10 . .
002 0 .
009 0 .
063 0 .
084 0 . .
555 0 .
720 0 .
675 0 . .
642 1 .
416 4 .
367 8 .
740 13 . S SIS CP 1 1 0 .
05 0 0SIS + HIM CP 1 1 1 1 1LASSO ERR 0 .
479 2 .
090 6 .
619 11 .
997 17 . .
002 0 .
072 0 .
095 0 .
101 0 . .
494 1 .
836 0 .
696 0 .
475 0 . .
002 0 .
062 0 .
009 0 .
002 0 . .
145 0 .
955 1 1LASSO + CD ERR 0 .
501 0 .
769 3 .
702 7 .
676 14 . .
003 0 .
016 0 .
078 0 .
087 0 . .
605 0 .
680 0 .
685 0 . .
642 1 .
859 5 .
855 10 .
829 16 . S SIS CP 1 1 0 . .
464 1 .
682 5 .
720 10 .
943 17 . .
002 0 .
065 0 .
098 0 .
103 0 . .
484 1 .
479 1 .
262 0 .
557 0 . .
002 0 .
057 0 .
034 0 .
003 0 . . .
87 1 1LASSO + CD ERR 0 .
586 0 .
726 1 .
874 4 .
504 7 . .
002 0 .
013 0 .
055 0 .
074 0 . .
465 0 .
765 0 .
810 0 . .
642 1 .
635 5 .
264 10 .
662 17 . LASSO is no longer a stable method for variable selection. If predictorsare selected incorrectly, the subsequent modified Cook’s distance cannot becalculated appropriately. This makes the performance of LASSO + CD un-satisfactory. ZHAO, LENG, LI AND WANG (4) Fourth, as a robust regression method, we find that LAD + LASSOperforms quite well. Its ERR values are smaller than those of the LASSOestimates in all the tables. However, in most cases, it is still outperformedby LASSO + HIM as seen from Tables 1 and 3.(5) Lastly, we find that for most cases, the reported FPR values are wellcontrolled. Furthermore, as κ increases, the corresponding empirical powerincreases toward 100%. These findings are consistent with the theoreticalclaims in Theorems 1 and 2.To summarize, our simulation experiments confirm that the proposed HIMmethod is useful in controlling the effects of the influential observations interms of parameter estimation and variable screening.3.4. A real data example.
We applied our proposed influence diagnosisapproach to a microarray data of [31], and noted that the analysis resultsbecome substantially different when the detected influential observationsare removed. For this dataset, F1 animals were intercrossed and then 120twelve-week-old male offspring were selected for tissue harvesting from theeyes and for microarray analysis. The Affymetrix microarrays that wereused to analyze the RNA from the eyes of those F2 animals contain over31,042 different probe sets. Among them, one probe is for gene TRIM32,which was recently found to cause Bardet–Biedl syndrome [10], a geneticallyheterogeneous disease of multiple organ systems including the retina. Onegoal of interest of this data analysis is to find genes whose expressions arecorrelated with that of gene TRIM32. We first followed [27] to exclude probesthat were not expressed in the eye or that lacked sufficient variation, whichresults in 18,975 probes as regressors. We then followed [22] to retain thetop 1000 probes that are mostly correlated with the probe of TRIM32. Theresulting analysis has p = 1000 predictors and a sample size n = 120. As astandard procedure [27], all the probes are standardized to have mean zeroand standard deviation one.We next applied our method with FDR rate α = 0 .
10 to the data, andidentified a total of 5 influential observations. Their corresponding p -valueswere 0, 0.0004, 0.0011, 0.0029 and 0.0033, respectively. We also show the log-arithm of p -values versus the indices for all 120 observations in Figure 2. Toassess the influence of the detected points, we again compared the LASSO es-timate with and without those points. Since we used ten-fold cross-validationto select the tuning parameter and every run is random, we repeated thisanalysis 100 times and report the average results.We summarize the difference of the estimates in the following aspects: thesparsity, the norm difference and the angle between the two estimates. First,by removing the identified influential observations, the resulting LASSO es-timate is considerably more sparse. The average model size with the full data IGH-DIMENSIONAL INFLUENCE MEASURE Fig. 2.
The logarithm of the p -value for each observation: the detected influential pointsare denoted by solid circles. is 63. By contrast, the average model size without the influential observationsreduces to 27 on the average. The existence of the potential influential pointsclearly shows a significant effect on the model size. Besides that, the averagenumber of the commonly selected predictors by fitting the full data and thereduced data, respectively, is only 8.67, which again shows clear discrepancyof the two estimates. Consequently, the influential points identified by ourapproach seem to have significant effect for subsequent analysis. Second,denote d = k ˆ β full k , d = k ˆ β redu k and d = k ˆ β redu − ˆ β full k , where ˆ β full isthe LASSO estimate using all the observations and ˆ β redu is the estimateafter removing the influential points identified by HIM. We observe that theaverage of ( d − d ) /d is 0.532 and that of d /d is 0.972. Both show thatthe estimates without influential points are quite different in terms of the ℓ norm. In addition, the angle between ˆ β full and ˆ β redu , which is definedas ˆ β ⊤ full ˆ β redu /d d , equals 0.262, averaged over 100 times. These numbersagain indicate that the estimates change substantially after removing theinfluential observations. In summary, this analysis illustrates the importanceof influence diagnosis, and the identified influential observations should betreated with extreme care.
4. Extension to generalized linear models.
The main idea of the high-dimensional influence measure can be extended to a broad class of regressionmodels. Here we briefly discuss one such extension to generalized linear mod-els (GLM). Assume that the data ( X i , Y i ) follow an exponential distribution ZHAO, LENG, LI AND WANG with the canonical probability density function, f ( y ; θ ) = exp { yθ − b ( θ ) + c ( y ) } , and the conditional mean is of the form E ( Y i | X i ) = b ′ ( θ ( X i )) = g − ( β + X ⊤ i β ) , where g is a known link function. For the purpose of feature screening in ultrahigh-dimensional regressions, [23] introduced a marginal utility measure, themaximum marginal likelihood estimator, asˆ β j = ( ˆ β j, , ˆ β j ) = arg min E n l ( Y, β j + β j X j ) , where l ( Y ; θ ) = − Y θ + b ( θ ) + log c ( Y )] and E n f ( X, Y ) = n − P ni =1 f ( X i , Y i ).That is, ˆ β j is the maximum likelihood estimator of fitting a GLM model of Y on the j th predictor X j alone plus an intercept. As remarked by [23], thismeasure can be rapidly computed.In the context of high-dimensional diagnosis, we define the high-dimensionalinfluence measure for generalized linear models for the k th observation, k = 1 , . . . , n , as D glm k = 1 p p X j =1 k ˆ β j − ˆ β ( k ) j k , (4.1)where ˆ β ( k ) j denotes the maximum marginal likelihood estimator but withthe k th observation removed. For GLM, the estimator ˆ β j and ˆ β ( k ) j may nothave a closed-form solution. Consequently, the exact distribution of the pro-posed statistic D glm k is complicated and some approximation is necessary.The detailed derivation, however, is beyond the scope of this paper. In prac-tice, one can always sort the values of {D glm k , k = 1 , . . . , n } and remove thoseobservations associated with large values of D glm k .We have conducted a small simulation study to examine the empiri-cal performance of this measure for GLM. The simulation setup is simi-lar to that of model 1 in Section 3.1, except that this time we adopt abinary response model, P ( Y i = 1 | X i ) = 1 / [1 + exp {− (2 + X ⊤ i β ) } ], where β = β true = (5 , , , . . . , ⊤ , and the outliers are generated from the model β = β infl = (5 , , , . . . , , − κ, . . . , − κ ) ⊤ with p/ κ ’s. We set n = 100,with 10% influential observations, that is, n infl = 10, and we set p = 50 or100. Since the asymptotic distribution of D glm k is not available for the logis-tic regression, we flag the 10 observations with the largest p -values of D glm k as influential. For a binary response, one is often interested in classification.As such we compare the misclassification error rate for the full data as E full and for the reduced data as E redu without the detected influential points.We also report the empirical power. The results out of 200 data replicationsare summarized in Table 4. IGH-DIMENSIONAL INFLUENCE MEASURE Table 4
Simulation results for the logistic model κp Criterion 0 0.4 0.8 1.2 1.6
50 Power – 0.220 0.472 0.422 0.254 E full E redu E full E redu From Table 4, we note that the proposed method has some power for alogistic model, but it is lower than that in a linear model. This is proba-bly due to the fact that a binary response contains much less information,and thus detecting influential observations in a logistic model is much morechallenging, especially in a high-dimensional setting. On the other hand, wealso observe from Table 4 that removing those points with the largest valuesof D glm k improves the classification accuracy by a large margin. This againsuggests that the usefulness of influence diagnosis. Meanwhile, we remarkthat further investigation into both theoretical and empirical properties ofhigh-dimensional influence measure in GLM is warranted.
5. Conclusion.
We perceive several future avenues to extend the pro-posed work in this article. First, we have employed the leave-one-out prin-ciple when quantifying influence of individual observations. We expect thatour high-dimensional influence measure can also be generalized to the casesof leaving out pairs of observations, or triplets or more. Such a strategy canbe useful when those observations conceal one another [18]. Second, we havefocused on the classical linear model in our development, while extensionto more sophisticated models, such as the generalized linear model, that is,briefly examined in Section 4, survival models, and semiparametric addi-tive models, deserve further investigations. Finally, our proposal deals withthe cross sectional data with i.i.d. observations. It is interesting to extendthe proposed influence measure to complex correlated data such as longi-tudinal data where dependence among observations needs to be taken intoconsideration in influence diagnosis [42].APPENDIXWe outline the main idea of the proof for the asymptotic distribution of D k in Theorem 1. First, we decompose D k as D k = B + B + B − B as given in Section 2.3. Then we compute the mean and variance of B i , ZHAO, LENG, LI AND WANG i = 1 , . . . , B i , wefind that B is the leading term. We then study the asymptotic distributionof B , which turns out to follow a χ (1) distribution. Recall in Section 2.3,we defined K p,tl = p − P pj =1 X tj X lj , for t, l = 1 , . . . , n , l p = P pj =1 λ j = O ( p r )and c p = max ≤ j ≤ p λ j , where λ j ’s are the eigenvalues of the covariancematrix Σ . Furthermore, we define a p = P pj =1 λ j , C = E ( Y t Y l K p,tl ) and C = E [ Y t ( P pj =1 ρ j X tj /p ) ] for any t = l . Proof of Proposition 1.
We break the proof into three parts: first,we obtain an expansion of D k ; second, we derive E ( D k ); and finally, wederive the asymptotic behavior of the components in the expansion of D k . Step
1. First, we have the following expansion for D k , k = 1 , . . . , n : D k = 1 p p X j =1 n − n X t =1 ,t = k Y t X tj − n n X t =1 Y t X tj ! = 1 p p X j =1 ( n ( n − n X t =1 ,t = k Y t X tj − n Y k X kj ) = 1 p p X j =1 ( n ( n − n X t =1 ,t = k Y t X tj ) + 1 pn Y k p X j =1 X kj − pn ( n − n X t =1 ,t = k Y t Y k ( p X j =1 X tj X kj ) = 1 p { n ( n − } n X t =1 ,t = k Y t ( p X j =1 X tj ) + 1 pn Y k ( p X j =1 X kj ) + 1 p { n ( n − } X t = s,t,s = k Y t Y s ( p X j =1 X tj X sj ) − pn ( n − n X t =1 ,t = k Y k Y t ( p X j =1 X tj X kj ) = 1 { n ( n − } n X t =1 Y t K p,tt + (cid:20) n − { n ( n − } (cid:21) Y k K p,kk + 1 { n ( n − } X t = s Y t Y s K p,ts IGH-DIMENSIONAL INFLUENCE MEASURE − (cid:20) { n ( n − } + 2 n ( n − (cid:21) n X t =1 ,t = k Y k Y t K p,tk = 1 { n ( n − } n X t =1 Y t K p,tt + ( n − n ( n − Y k K p,kk + 1 { n ( n − } X t = s Y t Y s K p,ts − n ( n − n X t =1 ,t = k Y k Y t K p,tk := B + B + B − B . Step
2. Next, we derive the expectation of D k . It is easy to see that E ( B ) = 1 pn ( n − p X j =1 E ( Y k X kj ) ,E ( B ) = n − pn ( n − p X j =1 E ( Y k X kj ) ,E ( B ) = 1 pn ( n − p X j =1 ρ j , E ( B ) = 1 pn ( n − p X j =1 ρ j . Therefore, we have E ( D k ) = E ( B + B + B − B ) = 1 pn ( n − p X j =1 var( Y k X kj ) . By Lemmas 1 and 3, we have E { Y k ( K p,kk − E ( K p,kk )) } ≤ E / ( Y k ) E / [ { K p,kk − E ( K p,kk ) } ]= O ( p − l / p )and p − p X j =1 E ( Y k X kj ) = p − p X j =1 ρ j = O ( p − c p ) . In addition, noting that c p ≤ l p , we have p − p X j =1 var( Y k X kj )= p − p X j =1 { E ( Y k X kj ) − E ( Y k X kj ) } ZHAO, LENG, LI AND WANG = E ( Y k ) E ( K p,kk ) + E { Y k ( K p,kk − E ( K p,kk )) } − p − p X j =1 ρ j = E ( Y k ) E ( K p,kk ) + O ( p − l / p ) . Consequently, we have E ( D k ) = 1 { pn ( n − } E ( Y k ) p X j =1 E ( X kj ) + O ( n − p − l / p ) . Step
3. Next, we derive the asymptotic behavior of B i , i = 1 , . . . , Step B . Note thatvar( B ) = nn ( n − var( Y t K p,tt )and that E ( Y t K p,tt ) ≤ E / ( Y t ) E / ( K p,tt ). Furthermore, E ( K p,tt ) = p − E ( Z ⊤ t ΣZ t ) = p − E " p X j =1 λ j ( Z ⊤ t u j ) ≤ p − E [( Z ⊤ t u j ) ] p X j =1 λ j ! ≤ E ( Z ⊤ t u j ) . The last equation holds because tr(Σ) = P pj =1 λ j = p . As a result, we havevar( B ) = O ( n − ) . Step B . By Lemma 3, we have E ( B ) = 1 pn ( n − p X j =1 ρ j = O ( c p · p − n − ) . In addition, it is easy to see E ( B ) = 1 n ( n − ( n X t =1 ,t = k E ( Y k Y t K p,tk ) + t,s = k X t = s E ( Y k Y t Y s K p,tk K p,sk ) ) = 1 n ( n − " ( n − E ( Y k Y t K p,tk )+ ( n − n − E ( Y k p X j =1 ρ j X kj /p ! ) IGH-DIMENSIONAL INFLUENCE MEASURE = 1 n ( n − E ( Y k Y t K p,tk ) + ( n − n ( n − E ( Y k p X j =1 ρ j X kj /p ! ) = 1 n ( n − C + ( n − n ( n − C , where C , C are defined as in (1.1) and (1.2) in the supplementary material[41], respectively. From the proof of Lemma 2, we know C = O ( c p p − ) and C = C + C , with C = O ( p − a / p ) and C = l p p − E ( Y t ). Therefore,we have E ( B ) = l p p n ( n − E ( Y t ) + O ( n − p − a / p ) + O ( c p p − n − )= O ( l p p − n − ) + O ( c p p − n − ) . Consequently, we havevar( B ) = E ( B ) − E ( B ) = O ( l p p − n − ) + O ( c p p − n − ) . Step B ). First, we have B = 1 p { n ( n − } X t = s Y t Y s p X j =1 X sj X tj ! = 1 pn ( n − X t = s φ ( Y t , Y s , X s , X t ) / { ( n − n } := 1 pn ( n −
1) ¯ B , where φ ( Y t , Y s , X s , X t ) = P t = s Y t Y s ( P pj =1 X sj X tj ). Let φ ( Y t , X t ) = E { φ ( Y t , Y s , X s , X t ) − E ( φ ( Y t , Y s , X s , X t ) | Y t , X t } = Y t p X j =1 X tj ρ j − p X j =1 ρ j . Noting that ¯ B is an U -statistic, and by the properties of the U -statistic,we havevar( B ) = 1 { pn ( n − } var( ¯ B )= 1 { pn ( n − } (cid:20) n var { φ ( Y t , X t ) } + o ( n − ) (cid:21) (A.1) = 4 p n ( n − var ( Y t p X j =1 X tj ρ j ) + o ( p − n − ( n − − ) . ZHAO, LENG, LI AND WANG
Furthermore, we have p − var Y t p X j =1 X tj ρ j ! = E ( Y t p X j =1 X tj ρ j /p ! ) − p X j =1 ρ j /p ! (A.2) ≤ E / ( Y t ) E / ( p X j =1 X tj ρ j /p ! ) − p X j =1 ρ j /p ! . In addition, we show in the proof of Lemma 2 that E ( p X j =1 X tj ρ j /p ! ) = O ( c p p − )(A.3)and in Lemma 3 that p X j =1 ρ j /p = O ( c p · p − ) . (A.4)Combining (A.1)–(A.4), we havevar( B ) = O ( c p n − p − ) + O ( p − n − ) . Step B , which can be written as B = ( n − n ( n − Y k K p,kk = ( n − n ( n − [ Y k { K p,kk − E ( K p,kk ) } + Y k E ( K p,kk )]:= B + B . By Lemma 1, we have E ( Y k { K p,kk − E ( K p,kk ) } ) ≤ E / ( Y k ) E / ( { K p,kk − E ( K p,kk ) } )= O ( p − l p ) ,E ( Y k { K p,kk − E ( K p,kk ) } ) ≤ E / ( Y k ) E / ( { K p,kk − E ( K p,kk ) } )= O ( l / p p − ) . Therefore, we havevar( B ) = (cid:26) ( n − n ( n − (cid:27) var( Y k [ K p,kk − E ( K p,kk )]) = O ( n − p − l p ) , var( B ) = O ( n − ) . This completes the proof. (cid:3)
IGH-DIMENSIONAL INFLUENCE MEASURE Proof of Theorem 1.
Consider the behavior of K p,kk , k = 1 , . . . , n ,for a sufficient large pK p,kk = p X j =1 X kj /p = X ⊤ k X k /p = Z ⊤ k ΣZ k = p X j =1 λ j ( Z ⊤ k u j ) /p. Its variance is var( K p,kk ) = 2 P pj =1 λ j /p = 2 p − l p . Under the assumption l p = O ( p r ) with 0 ≤ r <
2, we have K p,kk = E ( K p,kk ) + O p ( p r/ − ), and con-sequently, Y k K p,kk = Y k [ E ( K p,kk ) + O p ( p r/ − )] . In addition, noting that E [ Y k ( K p,kk − E ( K p,kk ))] ≤ E / ( Y k )(var( K p,kk )) / = O ( p r/ − ), we have E ( Y k K p,kk ) = E ( Y k ) E ( K p,kk ) + E [ Y k ( K p,kk − E ( K p,kk ))]= E ( Y k ) E ( K p,kk ) + O ( p r/ − ) . Therefore, we have Y k K p,kk − E ( Y k K p,kk ) = [ Y k − E ( Y k )] E ( K p,kk ) + O p ( p r/ − ) . As a result, it holds that B − E ( B ) = n − n ( n − { [ Y k − E ( Y k )] E ( K p,kk ) + O p ( p r/ − ) } . (A.5)Note that c p ≤ l p = O ( p r ) under (C.2). Combined with Proposition 1, wehave B − E ( B ) = O p ( n − / ) ,B − E ( B ) = O p ( n − / p r/ − ) ,B − E ( B ) = O p ( p r/ − n − ) . Consequently, we have n ( n − ( n − (cid:26) X i =1 , [ B i − E ( B i )] − B − E ( B )) (cid:27) (A.6) = O p ( n − / ) + O p ( p r/ − ) . Furthermore, by the results on E ( D k ) in step 2 of the proof of Proposition 1,we have E ( D k ) = 1 n ( n − E ( Y k ) E ( K p,kk ) + O ( n − p − l / p ) . ZHAO, LENG, LI AND WANG
Consequently, by l p = O ( p r ), we have n ( n − ( n − E ( D k ) = n − n − E ( Y k ) E ( K p,kk ) + O ( p r/ − ) . (A.7)Combining (A.5)–(A.7), we have n ( n − ( n − D k = n ( n − ( n − E ( D k ) + n ( n − ( n −
2) [ D k − E ( D k )]= n ( n − ( n − E ( D k ) + n ( n − ( n − (cid:18) X i =1 , , [ B i − E ( B i )] − B − E ( B )) (cid:19) = n − n − E ( Y k ) E ( K p,kk ) + { Y k − E ( Y k ) } E ( K p,kk )+ O p ( p r/ − ) + O p ( n − / )= Y k E ( K p,kk ) + 1 n − E ( Y k ) E ( K p,kk ) + O p ( p r/ − ) + O p ( n − / )= Y k + o p (1) , where the last equation is from the fact that E ( X kj ) = 1 , j = 1 , . . . , p and E ( Y k ) = 1. Since Y ∼ N (0 , n ( n − ( n − D k ∼ χ (1); that is, n D k ∼ χ (1). (cid:3) Acknowledgements.
We thank Professor Peter B¨uhlmann, Professor Pe-ter Hall, an Associate Editor and three referees for their constructive com-ments. SUPPLEMENTARY MATERIAL
Further proofs (DOI: 10.1214/13-AOS1165SUPP; .pdf). The supplemen-tary file contains the proofs of four additional lemmas, Proposition 2 andTheorem 2. REFERENCES [1]
Anderson, E. B. (1992). Diagnostics in categorical data analysis.
J. R. Stat. Soc.Ser. B Stat. Methodol. Banerjee, M. (1998). Cook’s distance in linear longitudinal models.
Comm. Statist.Theory Methods Banerjee, M. and
Frees, E. W. (1997). Influence diagnostics for linear longitudinalmodels.
J. Amer. Statist. Assoc. [4] Belloni, A. and
Chernozhukov, V. (2011). ℓ -penalized quantile regression inhigh-dimensional sparse models. Ann. Statist. Benjamini, Y. and
Hochberg, Y. (1995). Controlling the false discovery rate:A practical and powerful approach to multiple testing.
J. R. Stat. Soc. Ser.B Stat. Methodol. Benjamini, Y. and
Hochberg, Y. (2000). On the adaptive control of the falsediscovery rate in multiple testing with independent statistics.
J. Educ. Behav.Stat. Bickel, P. J. and
Levina, E. (2008). Covariance regularization by thresholding.
Ann. Statist. Candes, E. and
Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n . Ann. Statist. Chatterjee, S. and
Hadi, A. S. (1988).
Sensitivity Analysis in Linear Regression .Wiley, New York. MR0939610[10]
Chiang, A. P. , Beck, J. S. , Yen, H. J. , Tayeh, M. K. , Scheetz, T. E. , Swiderski, R. , Nishimura, D. , Braun, T. A. , Kim, K. Y. , Huang, J. , Elbedour, K. , Carmi, R. , Slusarski, D. C. , Casavant, T. L. , Stone, E. M. and
Sheffield, V. C. (2006). Homozygosity mapping with SNP arrays iden-tifies a novel gene for Bardet–Biedl syndrome (BBS11).
Proc. Natl. Acad. Sci.USA
Christensen, R. , Pearson, L. M. and
Johnson, W. (1992). Case-deletion diag-nostics for mixed models.
Technometrics Cook, R. D. (1977). Detection of influential observation in linear regression.
Tech-nometrics Cook, R. D. (1979). Influential observations in linear regression.
J. Amer. Statist.Assoc. Cook, R. D. and
Weisberg, S. (1982).
Residuals and Influence in Regression . Chap-man & Hall, London. MR0675263[15]
Critchley, F. , Atkinson, R. A. , Lu, G. and
Biazi, E. (2001). Influence analysisbased on the case sensitivity function.
J. R. Stat. Soc. Ser. B Stat. Methodol. Davison, A. C. and
Tsai, C. L. (1992). Regression model diagnostics.
Int. Stat.Rev. Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings ofdimensionality. Technical report, Stanford Univ.[18]
Draper, N. R. and
Smith, H. (1998).
Applied Regression Analysis , 3rd ed. Wiley,New York. MR1614335[19]
Efron, B. , Hastie, T. , Johnstone, I. and
Tibshirani, R. (2004). Least angleregression.
Ann. Statist. Efron, B. , Tibshirani, R. , Storey, J. D. and
Tusher, V. (2001). Empirical Bayesanalysis of a microarray experiment.
J. Amer. Statist. Assoc. Fan, J. and
Li, R. (2001). Variable selection via nonconcave penalized likelihood andits oracle properties.
J. Amer. Statist. Assoc. Fan, J. and
Lv, J. (2008). Sure independence screening for ultrahigh dimensionalfeature space.
J. R. Stat. Soc. Ser. B Stat. Methodol. Fan, J. and
Song, R. (2010). Sure independence screening in generalized linearmodels with NP-dimensionality.
Ann. Statist. Fu, W. J. (1998). Penalized regressions: The bridge versus the Lasso.
J. Comput.Graph. Statist. ZHAO, LENG, LI AND WANG[25]
Fung, W.-K. , Zhu, Z.-Y. , Wei, B.-C. and
He, X. (2002). Influence diagnosticsand outlier tests for semiparametric mixed models.
J. R. Stat. Soc. Ser. B Stat.Methodol. Huang, J. , Horowitz, J. L. and
Ma, S. (2008). Asymptotic properties of bridgeestimators in sparse high-dimensional regression models.
Ann. Statist. Huang, J. , Ma, S. and
Zhang, C.-H. (2008). Adaptive Lasso for sparse high-dimensional regression models.
Statist. Sinica Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principalcomponents analysis.
Ann. Statist. Pan, J.-X. and
Fang, K.-T. (2002).
Growth Curve Models and Statistical Diagnos-tics . Springer, New York. MR1937691[30]
Preisser, J. S. and
Qaqish, B. F. (1996). Deletion diagnostics for generalised esti-mating equations.
Biometrika Scheetz, T. , Kim, K. , Swiderski, R. , Philp, A. , Braun, T. , Knudtson, K. , Dor-rance, A. , DiBona, G. , Huang, J. , Casavant, T. et al. (2006). Regulationof gene expression in the mammalian eye and its relevance to eye disease.
Proc.Natl. Acad. Sci. USA
Storey, J. D. (2002). A direct approach to false discovery rates.
J. R. Stat. Soc.Ser. B Stat. Methodol. Thomas, W. and
Cook, R. D. (1989). Assessing influence on regression coefficientsin generalized linear models.
Biometrika Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.
J. R. Stat.Soc. Ser. B Stat. Methodol. Wang, H. (2009). Forward regression for ultra-high dimensional variable screening.
J. Amer. Statist. Assoc.
Wang, H. and
Leng, C. (2007). Unified Lasso estimation via least squares approxi-mation.
J. Amer. Statist. Assoc.
Wang, H. , Li, G. and
Jiang, G. (2007). Robust regression shrinkage and consistentvariable selection through the LAD–Lasso.
J. Bus. Econom. Statist. Williams, D. A. (1987). Generalized linear model diagnostics using the deviance andsingle case deletions.
J. R. Stat. Soc. Ser. C. Appl. Stat. Xiang, L. , Tse, S.-K. and
Lee, A. H. (2002). Influence diagnostics for generalizedlinear mixed models: Applications to clustered data.
Comput. Statist. Data Anal. Zhang, H. H. and
Lu, W. (2007). Adaptive Lasso for Cox’s proportional hazardsmodel.
Biometrika Zhao, J. , Leng, C. , Li, L. and
Wang, H. (2013). Supplement to “High-dimensionalinfluence measure.” DOI:10.1214/13-AOS1165SUPP.[42]
Zhu, H. , Ibrahim, J. G. and
Cho, H. (2012). Perturbation and scaled Cook’s dis-tance.
Ann. Statist. Zhu, H. , Ibrahim, J. G. , Lee, S. and
Zhang, H. (2007). Perturbation selectionand influence measures in local influence analysis.
Ann. Statist. Zhu, H. , Lee, S.-Y. , Wei, B.-C. and
Zhou, J. (2001). Case-deletion measures formodels with incomplete data.
Biometrika Zou, H. (2006). The adaptive LASSO and its oracle properties.
J. Amer. Statist.Assoc. J. ZhaoSchool of Mathematicsand System ScienceBeihang UniversityBeijingChinaE-mail: [email protected]
C. LengDepartment of StatisticsUniversity of WarwickCoventryUnited KingdomandDepartment of Statisticsand Applied ProbabilityNational University of SingaporeE-mail:
L. LiDepartment of StatisticsNorth Carolina State UniversityRaleigh, North CarolinaUSA E-mail: [email protected]