[PDF] High-dimensional influence measure

Abstract

Influence diagnosis is important since presence of influential observations could lead to distorted analysis and misleading interpretations. For high-dimensional data, it is particularly so, as the increased dimensionality and complexity may amplify both the chance of an observation being influential, and its potential impact on the analysis. In this article, we propose a novel high-dimensional influence measure for regressions with the number of predictors far exceeding the sample size. Our proposal can be viewed as a high-dimensional counterpart to the classical Cook's distance. However, whereas the Cook's distance quantifies the individual observation's influence on the least squares regression coefficient estimate, our new diagnosis measure captures the influence on the marginal correlations, which in turn exerts serious influence on downstream analysis including coefficient estimation, variable selection and screening. Moreover, we establish the asymptotic distribution of the proposed influence measure by letting the predictor dimension go to infinity. Availability of this asymptotic distribution leads to a principled rule to determine the critical value for influential observation detection. Both simulations and real data analysis demonstrate usefulness of the new influence diagnosis measure.

Full PDF

aa r X i v : . [ m a t h . S T ] N ov The Annals of Statistics (cid:13)

Institute of Mathematical Statistics, 2013

HIGH-DIMENSIONAL INFLUENCE MEASURE

By Junlong Zhao , Chenlei Leng , Lexin Li andHansheng Wang Beihang University, University of Warwick and National University ofSingapore, North Carolina State University and Peking University

Inﬂuence diagnosis is important since presence of inﬂuential ob-servations could lead to distorted analysis and misleading interpreta-tions. For high-dimensional data, it is particularly so, as the increaseddimensionality and complexity may amplify both the chance of an ob-servation being inﬂuential, and its potential impact on the analysis.In this article, we propose a novel high-dimensional inﬂuence mea-sure for regressions with the number of predictors far exceeding thesample size. Our proposal can be viewed as a high-dimensional coun-terpart to the classical Cook’s distance. However, whereas the Cook’sdistance quantiﬁes the individual observation’s inﬂuence on the leastsquares regression coeﬃcient estimate, our new diagnosis measurecaptures the inﬂuence on the marginal correlations, which in turnexerts serious inﬂuence on downstream analysis including coeﬃcientestimation, variable selection and screening. Moreover, we establishthe asymptotic distribution of the proposed inﬂuence measure by let-ting the predictor dimension go to inﬁnity. Availability of this asymp-totic distribution leads to a principled rule to determine the criticalvalue for inﬂuential observation detection. Both simulations and realdata analysis demonstrate usefulness of the new inﬂuence diagnosismeasure.Received July 2013. Supported by National Natural Science Foundation of China (11101022) and Ministryof Education Humanities and Social Science Foundation Youth project (10YJC910013). Supported by National University of Singapore research grants. Supported by NSF Grant DMS-11-06668. Supported by National Natural Science Foundation of China (11131002, 11271032),Fox Ying Tong Education Foundation, the Business Intelligence Research Center of PekingUniversity and the Center for Statistical Science of Peking University.

AMS 2000 subject classiﬁcations.

Primary 62J20; secondary 62E20.

Key words and phrases.

Cook’s distance, high-dimensional diagnosis, inﬂuential obser-vation, LASSO, marginal correlations, variable screening.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Statistics ,2013, Vol. 41, No. 5, 2639–2667. This reprint diﬀers from the original inpagination and typographic detail. 1

ZHAO, LENG, LI AND WANG

1. Introduction.

An observation is ﬂagged inﬂuential if some importantfeatures of the analysis are substantially altered after this observation isremoved [13]. Presence of inﬂuential observations would possibly lead todistorted analysis and misleading results [18], and therefore it is importantto be alert to inﬂuential observations and take them into consideration wheninterpreting the results. In the classical normal linear model setup, regressioncoeﬃcient estimate was chosen, naturally, as the feature whose substantialchange deﬁnes inﬂuential observations. Toward that end, [12] proposed adiﬀerence measure between the OLS estimate on the full data and thaton the subset of data without the observation in question. This measure,which is later on referred in the statistical literature as the

Cook’s distance ,quantiﬁes the contribution, or inﬂuence, of individual data observation onthe regression coeﬃcient estimate. Consequently an observation with a largeCook’s distance is deemed as inﬂuential. Since its introduction, the Cook’sdistance has been routinely employed in regression analysis, due to its clearinterpretation from the case deletion point of view, and its easy computationwithout having to re-estimate the model for each removed observation. Thetopic is covered in most standard regression textbooks, and it is implementedin popular statistical software such as R and SAS.The problem of inﬂuence diagnosis has since attracted considerable atten-tion and been systematically investigated for various models and analyses.Examples include linear regression models [9, 12, 14], categorical data anal-yses [1], generalized linear models [16, 33, 38], generalized estimation equa-tions [30], linear mixed models [2, 3, 11], generalized linear mixed models[39], semiparametric mixed models [25], growth curve models [29], incom-plete data analysis [44], perturbation theory [15, 42, 43], among others. Foran excellent review on the latest developments in the ﬁeld of inﬂuence diag-nosis, we refer to [42].Thanks to the aforementioned works, substantial insights have been gainedon inﬂuence diagnosis. However, it is important to note that, all existing di-agnosis approaches have been developed under the assumption that the num-ber of predictors in regression is ﬁxed. As such, none is immediately applica-ble to high-dimensional regression analysis, where the number of predictors p far exceeds the sample size n . On the other hand, nowadays prevailing inboth science and business are data with unprecedented size and dimension-ality, calling for the development of high-dimensional inﬂuence diagnosis.Detection of inﬂuential observations in high-dimensional data analysis, inour opinion, is equally, or to some extent, even more important than ina classical setup. This is partly because the increased dimensionality andcomplexity of the data may amplify both the chance of an observation be-ing inﬂuential as well as its potential impact on the analysis. Moreover, thepeculiar data observations themselves may be of practical importance in IGH-DIMENSIONAL INFLUENCE MEASURE addition to data modeling. The diagnosis task, nevertheless, is more chal-lenging in high-dimensional data analysis, and is far from a direct extensionof existing diagnosis approaches. To the best of our knowledge, inﬂuencediagnosis in a high-dimensional setting has received little attention despiteits evident importance.The ﬁrst challenge is the deﬁnition of inﬂuential observation. In otherwords, which feature of the analysis should one choose such that its sub-stantial alternation deﬁnes an inﬂuential observation? In the classical setup,an observation is deemed inﬂuential if it incurs serious change in regres-sion coeﬃcient estimate. In high-dimensional regression where p > n , theordinary least squares estimator is highly unstable as the gram matrix isnot invertible. On the other hand, we recognize that variable selection andvariable screening are of particular importance in high-dimensional regres-sion analysis. There has been a vast literature on variable selection in re-cent years, including the LASSO [34], the adaptive LASSO [36, 40, 45], theSCAD [21], the bridge estimator [24, 26], the LARS algorithm [19], theDantzig selector [8], the sure independence screening rule [22], SIS, the for-ward regression [35], FR, among many others. Underlying all those selectionmethods, one statistic plays a critical role and, that is, the marginal co-variance, or equivalently, marginal correlation between the response and theindividual covariates. To clarify, we note that, SIS is directly deﬁned basedon this statistic, whereas the ﬁrst step of the forward regression hinges onthe estimated marginal covariance too. In addition, the sample marginal co-variance, in addition to the Gram matrix, is an important input for the wellcelebrated LARS algorithm, as well as the LASSO, the adaptive LASSO andthe Dantzig selector.Motivated by this vital observation, we choose the marginal correlation asthe feature that deﬁnes inﬂuential observation. We propose a new inﬂuencediagnosis measure, which continues to utilize the leave-one-out idea of theclassical Cook’s distance, but is based on the combined marginal correlationsbetween the response and all predictors. The new measure is applicable tohigh-dimensional setting where p > n , and is very fast and easy to compute.Unlike the classical Cook’s distance that quantiﬁes the individual observa-tion’s inﬂuence on the least squares coeﬃcient estimate, the new measurecaptures the inﬂuence on the marginal correlation, which in turn exerts seri-ous impact on variable selection and other downstream analysis. The choiceof the marginal correlation as the deﬁning feature of our inﬂuence diagnosisdoes not imply that the marginal correlation is our ultimate goal of inter-est. Instead, it reﬂects inﬂuence on important analysis features includingparameter estimation, variable selection and screening. This deﬁnition ofinﬂuential observation in a high-dimensional setting can be viewed as ourﬁrst contribution. ZHAO, LENG, LI AND WANG

Our second contribution is that the explicit asymptotic distribution forthe proposed inﬂuence measure is derived. Availability of this asymptotictheory oﬀers a principled guidance to determine the critical value for theinﬂuence measure. Subsequently, we propose a false discovery rate basedprocedure for that purpose [5, 6]. We remark that, in the classical setupwhere p is ﬁxed, a standard Taylor’s expansion type analysis [12] revealedthat the classical Cook’s distance’s major variability is due to the obser-vation under investigation and its sample size is only one. This rules outthe possibility of establishing a standard asymptotic theory for the classicalCook’s distance. To determine an appropriate threshold value for the clas-sical Cook’s distance, its distribution can be obtained by bootstrap if thetrue model is a parametric linear model. However, such a bootstrap proce-dure requires a parametric model assumption and can be computationallyexpensive especially for high-dimensional data. By contrast, the asymptoticdistribution of the proposed inﬂuence measure is attainable in our setup,since the predictor dimension goes to inﬁnity along with the sample size,and the threshold is easy to obtain.When facing high-dimensional data diagnosis, an intuitive solution is tocontinue using the classical Cook’s distance but to replace the OLS coeﬃ-cient estimate with a regularized estimate, for instance, a LASSO estimate.This modiﬁed Cook’s distance approach could be particularly useful whendata perturbation concentrates on the nonzero coeﬃcients, as it avoids un-necessary variability caused by irrelevant covariates. However, it also hasseveral limitations. First, this solution interweaves inﬂuence diagnosis withvariable selection, which can be ﬂawed if the inﬂuence is reﬂected on variableselection itself. For instance, an inﬂuential observation may substantially al-ter the chosen tuning parameter of the LASSO, resulting in a totally diﬀerentregularized coeﬃcient estimate, which in turn aﬀects the modiﬁed Cook’sdistance. Second, the tuning parameter of the LASSO, in principle, should beupdated for every reduced data set, and this re-estimation requirement canbe very expensive computationally, especially when the regression dimension p is large. Third, the asymptotic properties of the modiﬁed Cook’s distanceseem intractable analytically, which makes the thresholding of inﬂuentialdata diﬃcult, whereas a bootstrap alternative to choose the thresholdingvalue is again computationally expensive. Moreover, while there exist manycompeting variable selection methods, it is unclear which selection methodis the best choice in the context of inﬂuence diagnosis. By contrast, our inﬂu-ence measure is not constrained by any particular variable selection method,and this ﬂexibility could beneﬁt downstream analysis. In Section 3, we carryout an intensive numerical study to compare this modiﬁed Cook’s distancewith our proposal, and this detailed comparison can be viewed as the thirdcontribution of this article. IGH-DIMENSIONAL INFLUENCE MEASURE Fig. 1.

Eﬀect of inﬂuential points on parameter estimation (a) , variable selection (b) andvariable screening (c) , as the perturbation parameter κ varies. “Before HIM” denotes theanalysis on the full data, and “After HIM” denotes the analysis on the reduced data afterremoving the inﬂuential observations ﬂagged by our proposed high-dimensional measure(HIM). Before we proceed, we quickly show a simulated example to illustrate twopoints: ﬁrst, how various aspects of a high-dimensional regression analysis,including regression coeﬃcient estimation, variable selection and variablescreening, can be seriously aﬀected by inﬂuential observations, and second,how our proposed measure can help limit such inﬂuence. The data was gen-erated from a linear model with p = 1000 predictors, n = 100 observations,among which 10 observations were inﬂuential. The magnitude of the in-ﬂuence was dictated by a scalar κ with a larger value indicating a largerinﬂuence. More details can be found in the setup of model 1 in Section 3.Evaluations include error in coeﬃcient estimation, error in variable selectionafter applying the LASSO [34], and error in variable screening after applyingthe SIS [22]. The results are averaged over 200 simulation replicates, and arereported in Figure 1. It is clearly seen from the plot that, inﬂuential obser-vations could have drastic eﬀects on various features for high-dimensionaldata analysis. Meanwhile, our marginal correlation based diagnosis couldgreatly help control the adverse eﬀects after detecting and removing thoseinﬂuential data points.The rest of the artlicle is organized as follows. Section 2 begins with a re-view of the classical Cook’s distance, then presents our new high-dimensionalinﬂuence measure, along with a comparison with the Cook’s distance, theasymptotic properties and a power study. Section 3 includes an intensive ZHAO, LENG, LI AND WANG simulation study and a microarray data analysis. Section 4 presents a gen-eralization of our proposal from the normal linear model to the generalizedlinear model. Section 5 concludes the paper with a discussion. All technicalproofs are given in the Appendix and the supplementary material [41].

2. High-dimensional inﬂuence measure.

Linear models and classical Cook’s distance.

In this article, we fo-cus on inﬂuence diagnosis in the context of the classical linear regressionmodel. Meanwhile, we note that the proposed idea can be readily extendedto a much broader class of regression models, and we will discuss one suchextension in Section 4. Consider the following model: Y i = β + X ⊤ i β + ε i , (2.1)where the pair ( Y i , X i ), 1 ≤ i ≤ n , denote the observation of the i th sub-ject, Y i ∈ R is the response variable, X i = ( X i , . . . , X ip ) ⊤ ∈ R p is the asso-ciated p -dimensional predictor vector, and ε i ∈ R is a mean zero normallydistributed random noise. Let β = ( β , β ⊤ ) ⊤ denote the coeﬃcient vector.Under the classical setup of n > p , the OLS estimate of β is obtained byminimizing the objective function P ni =1 ( Y i − β − X ⊤ i β ) , and the solutionis ˆ β = ( X ⊤ X ) − X ⊤ Y , where Y = ( Y , . . . , Y n ) ⊤ denotes the n × X denotes the n × ( p + 1) design matrix with the i th row being p + 1 dimensional vector (1 , X ⊤ i ), i = 1 , . . . , n .To quantify the inﬂuence of the k th observation on regression, 1 ≤ k ≤ n ,[12] employed the leave-one-out idea by studying the OLS estimate of β whilethe k th observation is excluded from estimation. That is, one minimizes themodiﬁed objective function P ni =1 ,i = k ( Y i − β − X ⊤ i β ) . The new estimateis of the form ˆ β ( k ) = ( X ⊤ ( k ) X ( k ) ) − X ⊤ ( k ) Y ( k ) , where Y ( k ) is the ( n − × Y k removed, and X ( k ) is the ( n − × ( p + 1) designmatrix with the k th row X k removed. Cook [12] naturally chose the estimateof β to deﬁne inﬂuence, and intuitively, if an observation is inﬂuential, thediﬀerence between ˆ β and ˆ β ( k ) is expected to be large. This leads to thefollowing discrepancy measure, that is, the Cook’s distance: D k = { ˆ β ( k ) − ˆ β } ⊤ X ⊤ X { ˆ β ( k ) − ˆ β } ( p + 1)ˆ σ , (2.2)where ˆ σ = ( n − p − − P ni =1 ( Y i − ˆ β − X ⊤ i ˆ β ) .In the high-dimensional regression setting, the classical Cook’s distance(2.2) encounters some diﬃculties. When p is close to n , the OLS estimate isknown to be unstable, which would in turn cause D k to be unstable. When p > n , the classical Cook’s distance is not directly computable, because the IGH-DIMENSIONAL INFLUENCE MEASURE OLS estimator ˆ β becomes unstable. For those reasons, the regression co-eﬃcient estimate may no longer be the best choice to deﬁne inﬂuence inhigh-dimensional analysis. This motivates us to consider an alternative in-ﬂuence measure for high-dimensional data.2.2. High-dimensional inﬂuence measure.

In high-dimensional regressionanalysis where p ≈ n or p > n , variable selection (screening) plays a centralrole, whereas marginal covariance or correlation is crucial to the majorityof variable selection approaches. Motivated by this observation, for high-dimensional data inﬂuence diagnosis, we choose marginal correlation, in-stead of regression coeﬃcient, as the feature that deﬁnes inﬂuence. Individ-ual observation’s inﬂuence on marginal correlation is to transmit to variousfeatures of downstream analysis, such as variable selection and coeﬃcientestimation.More speciﬁcally, we ﬁrst deﬁne the marginal correlation as ρ j = E { ( X j − µ xj )( Y − µ y ) } / ( σ xj σ y ), where µ xj = E ( X j ), µ y = E ( Y ), σ xj = var( X j ) and σ y = var( Y ). We then obtain the sample estimate, ˆ ρ j = { P ni =1 ( X ij − ˆ µ xj )( Y i − ˆ µ y ) } / { n ˆ σ xj ˆ σ y } , for j = 1 , . . . , p , where ˆ µ xj , ˆ µ y , ˆ σ xj and ˆ σ y are the sample es-timates of µ xj , µ y , σ xj and σ y , respectively. Next, we continue to use theleave-one-out principle as in the classical Cook’s distance case, and computethe marginal correlation with the k th observation removed asˆ ρ ( k ) j = P ni =1 ,i = k ( X ij − ˆ µ ( k ) xj )( Y i − ˆ µ ( k ) y )( n − σ ( k ) xj ˆ σ ( k ) y , j = 1 , . . . , p, k = 1 , . . . , n, where ˆ µ ( k ) xj , ˆ µ ( k ) y , ˆ σ ( k ) xj and ˆ σ ( k ) y are the corresponding sample estimates withthe k th observation removed. Finally, we deﬁne the inﬂuence measure basedon the marginal correlation as D k = 1 p p X j =1 ( ˆ ρ j − ˆ ρ ( k ) j ) . (2.3)We refer to D k as the high-dimensional inﬂuence measure , or HIM forbrevity. We make a few remarks. First, we note that the marginal correla-tion can be easily computed regardless of the predictor dimension, and suchcomputational advantage is practically very useful for high-dimensional dataanalysis. Second, the proposed inﬂuence measure is built upon the marginalcorrelation coeﬃcient, and is eﬀectively scale invariant. However, it does not imply that marginal correlation is the ultimate feature of interest in our in-ﬂuence diagnosis. Instead, a substantial change on the marginal correlationcaused by a data point is to exert inﬂuence on important features such asvariable selection and parameter estimation, as we have seen in Figure 1.As such, for an estimation method to be robust to unexpected perturbation ZHAO, LENG, LI AND WANG [15, 42, 43], the sample marginal correlation should be suﬃciently robust.This is an important and necessary condition, although not necessarily suﬃ-cient. Finally, use of the marginal correlation to deﬁne the inﬂuence measuredoes not imply that we assume a marginal model . Instead, we still assumethe joint model (2.1). As it may seem unclear how a marginal measure cancapture the inﬂuence for a joint model, we will demonstrate through a simplejoint model later in Section 2.5 that, the newly deﬁned D k can indeed iden-tify the inﬂuential observation with probability one. This use of marginalcorrelation is also similar in spirit to the sure independence screening pro-cedure for a joint normal model [22], but is in a diﬀerent context. Fan andLv [22] use marginal correlation for the variable screening purpose, while weuse it for inﬂuence diagnosis.The proposed high-dimensional inﬂuence measure also shares some simi-larity as the classical Cook’s distance. Note that the Cook’s distance can bereformulated as D k = ˆ ǫ k p ˆ σ h kk (1 − h kk ) , k = 1 , . . . , n, (2.4)where ˆ ǫ k = ˆ Y k − Y k is the residual and h kk = X ⊤ k ( X ⊤ X ) − X k , k = 1 , . . . , n is the ( k )th diagonal element of the hat matrix X ( X ⊤ X ) − X ⊤ . Clearly, D k is an increasing function of both | ˆ ǫ k | and h kk . As such, an observation has alarge value in Cook’s distance, if it has a large residual or it is a high leveragepoint in terms of h kk . Our proposed information measure shares a similarspirit. In Section 2.3, we will derive a decomposition of our inﬂuence measure D k under some conditions, and will show that D k is mainly dominated by aterm called B , which is of the form B = ( n − pn ( n − p X j =1 Y k X kj = ( n − pn ( n − Y k k X k k . Consequently the k th data point ( X k , Y k ) is more likely to be marked inﬂu-ential, if it has a large response and a large value of k X k k . Here k X k k playsa similar role as h kk in the classical Cook’s distance, for detecting inﬂuentialpoints induced mainly by covariates, whereas Y k plays a similar role as theresidual in the Cook’s distance, for detecting the inﬂuential point inducedby abnormal responses.2.3. Theoretical properties.

We next establish the asymptotic distribu-tion of the proposed high-dimensional inﬂuence measure D k as both thesample size n and the dimensionality p go to inﬁnity. Toward that end, weimpose the following conditions.(C.1) For any ﬁxed j = 1 , . . . , p , ρ j is constant and does not change as p increases. IGH-DIMENSIONAL INFLUENCE MEASURE (C.2) For the covariance matrix Σ = cov( X ), with the eigen-decomposition Σ = P pj =1 λ j u j u ⊤ j , it is assumed that l p = P pj =1 λ j = O ( p r ) for some 0 ≤ r < X i follows a multivariate normal distribution and therandom noise ε i follows a normal distribution.Condition (C.1) is very general, since it only requires that for any ﬁxed j , ρ j is a constant independent of p . A suﬃcient condition for condition (C.2)to hold is that all eigenvalues of Σ are ﬁnite. This condition also permitseigenvalues of Σ to diverge to inﬁnity but at a slower rate compared to thedimensionality. The normality assumption on X is mainly for convenience,and can be relaxed, for instance, to distributions with sub-Gaussian tails,at the expense of more lengthy proofs. In addition, since the error term isassumed normal, Y is normally distributed.Next, we derive a decomposition of D k , that is, to serve as a basis for itsasymptotic distribution. The result is presented in a way such that µ y , µ xj are assumed to be 0 and σ xj , σ y are 1 for 1 ≤ j ≤ p . This leads to simpliﬁedestimates ˆ ρ j = n − P ≤ i ≤ n X ij Y i and ˆ ρ ( k ) j = n − P i = k X ij Y i . On the otherhand, we note that this standardization is only for the purpose of simpli-fying the presentation and it loses no generality. As we will show later inProposition 2, replacing the unknown quantities µ xj , µ y , σ xj and σ y withtheir consistent sample estimates would not alter D k ’s asymptotic distri-bution. For t, s = 1 , . . . , n , let K p,ts = P j X tj X sj /p and c p = max ≤ j ≤ p λ j .After some algebraic computation, we obtain that D k = 1 p p X j =1 ( n ( n − t = k X ≤ t ≤ n Y t X tj − n Y k X kj ) = 1 { n ( n − } n X t =1 Y t K p,tt + ( n − n ( n − Y k K p,kk (2.5) + 1[ n ( n − X t = s Y t Y s K p,ts − n ( n − n X t =1 ,t = k Y k Y t K p,tk := B + B + B − B . Then we have the following result on the expectation of D k along with thevariance of its decomposition in terms of B ’s. Proposition 1.

Suppose that ( X i , Y i ) are i.i.d. observations and that (C.1) and (C.3) hold. Then it holds that E ( D k ) = [ n ( n − − E ( Y k ) E ( K p,kk ) + O ( n − p − l / p ) . In addition, var( B ) = O ( n − ) , var( B ) = O ( n − ) , var( B ) = O ( c p n − p − )+ O ( p − n − ) and var( B ) = O ( l p p − n − ) + O ( c p p − n − ) . ZHAO, LENG, LI AND WANG

Now we return to the asymptotic distribution of D k . Proposition 1 helpsto derive the asymptotic distribution of D k . We ﬁrst present the result as-suming µ xj , µ y , σ xj and σ y are all known. Then we obtain the asymptoticdistribution when µ xj , µ y , σ xj and σ y are replaced by their sample estimates. Theorem 1.

Suppose that (C.1)–(C.3) hold. When there is no inﬂuen-tial point and min { n, p } −→ ∞ , we have n D k −→ χ (1) , where χ (1) is the chi-square distribution with one degrees of freedom. Next, we consider the asymptotic distribution of D k when µ xj , µ y , σ j and σ y are unknown. A natural choice is to replace them by their correspondingsample moment estimates as ˆ µ y = P i Y i /n , ˆ µ xj = P i X ij /n , ˆ σ xj = P i ( X ij − ˆ µ xj ) / ( n −

1) and ˆ σ y = P i ( Y i − ˆ µ y ) / ( n − u xj , u y , σ xj and σ y are replaced by √ n -consistent estimates under cer-tain moment assumptions. Let ˙ Y t = ( Y t − µ y ) /σ y , ˙ X tj = ( X tj − u tj ) /σ tj , t =1 , . . . , n, j = 1 , . . . , p and ( Q xj , R xj ) = ((ˆ µ xj − µ xj ) /σ xj , σ xj / ˆ σ xj ) and ( Q y , R y )are deﬁned similarly. Furthermore, let S Qx = lim sup n →∞ E ( n / Q x ) , S Rx =lim sup n →∞ E [ n / ( R x − , S Qy = lim sup n →∞ E ( n / Q y ) and S Ry =lim sup n →∞ E [ n / ( R y − . We make the following additional assump-tion.(C.4) For all 1 ≤ j ≤ p , ( Q xj , R xj ) are the same symmetric function of { ˙ X tj , for t = 1 , . . . , n } ; and ( Q y , R y ) are also the same symmetric functionof ˙ Y t for t = 1 , . . . , n . We assume that S Qx , S Rx , S Qy and S Ry are ﬁnite.Condition (C.4) indicates that, for all 1 ≤ j ≤ p , ((ˆ µ xj − µ xj ) /σ xj , ˆ σ xj /σ xj )= f ( ˙ X j , . . . , ˙ X nj ), where f ( x , . . . , x p ) = ( f ( x , . . . , x p ) , f ( x , . . . , x p )) and f and f are symmetric functions. Condition (C.4) is a mild condition. Re-call that ( X i , Y i ), i = 1 , . . . , n are i.i.d. normal in Theorem 1. When ˆ µ xj , ˆ σ xj are the moment estimates, we have Q xj = n − P ≤ t ≤ n ˙ X tj ∼ N (0 , /n ) andconsequently S Qx is ﬁnite. Moreover, we have R xj = 1 /S nj where S nj is thesample variance of { ˙ X tj , t = 1 , . . . , n } . Since S n ∼ χ n − / ( n − S Rx is also ﬁnite. Similarly, S Qy and S Ry are also ﬁnite withmoment estimates ˆ µ y and ˆ σ y . Under the normality of ( X , Y ), (C.4) alsoholds for some robust estimates. Proposition 2.

Assume that ˆ µ xj , ˆ σ xj , ˆ µ y , ˆ σ y are √ n -consistent and sat-isfy (C.4) . Substituting µ xj , µ y , σ j , σ y with their corresponding estimates in D k , Theorem 1 continues to hold under the same conditions. IGH-DIMENSIONAL INFLUENCE MEASURE We remark that the asymptotic distribution of the high-dimensional in-ﬂuence measure D k is obtained as the number of predictor p goes to inﬁnity.This is diﬀerent from the case of classical Cook’s distance D k where p isﬁxed, for which a standard asymptotic distribution is not attainable. Weview this as a blessing of dimensionality in contrast to the usually conceived curse of dimensionality. For more examples of blessing of dimensionality, see[17] and [28].2.4. Inﬂuence diagnosis.

An important implication of Theorem 1 is thatwe can now obtain a p -value for inﬂuence diagnosis. Speciﬁcally, for the hy-pothesis that the k th observation is not inﬂuential versus its alternative,the p -value is P ( χ (1) > n D k ). Given that the number of predictors p isusually large and multiple hypotheses are tested simultaneously, we employthe false discovery rate based multiple testing procedure of [5] to determinewhich hypothesis should be rejected while controlling the family-wise error.Denote n inﬂ as the number of inﬂuential observations among the n obser-vations, n tp and n fp as the number of the observations that are correctlyrejected and incorrectly rejected, respectively, and r as the total number ofrejections in the n hypotheses testing. Then the power and the false dis-covery rate are denoted as Power = n tp /n inﬂ and FDR = n fp /r , respectively.We will set FDR level being small, such as 0.05, and report the power andother quantities in the numerical study section. We also remark that moresophisticated alternative multiple testing procedure, for example, in [6], [20]and [32], can be used in conjunction with our approach, but, that is, not thefocus of this article.2.5. A power comparison of two inﬂuence measures.

We next study thepower property of both the new diagnosis measure and the Cook’s distancevia a simple model. This study serves two purposes. First, we can gaininsight about diﬀerence between the two diagnosis measures. Second, it oﬀersevidence that the marginal correlation based measure is capable of detectinginﬂuential observation in a joint model with a large probability.More speciﬁcally, we consider the model (2.1), but drop the intercept forsimplicity. The predictors X i , i = 1 , . . . , n , are i.i.d. observations from a mul-tivariate normal distribution N p (0 , Σ ) where Σ is a p × p covariance matrixwith all its diagonal elements σ jj = 1. The error term ǫ i is of the struc-ture ǫ i = e i + c i , where e i follows a standard normal distribution and c i isconstant, c = · · · = c n = 0. Under this setup, the ﬁrst observation is an inﬂu-ential point as long as c = 0, and we aim to establish the power of both theclassical and our proposed high-dimensional inﬂuence measure in identifyingthis inﬂuential observation. Let D i be the Cook’s distance deﬁned in (2.2)for the i th observation, D ( c ) i be the proposed high-dimensional measure in ZHAO, LENG, LI AND WANG (2.3), and T ( c ) i = n D ( c ) i be the statistic deﬁned in Theorem 1. Moreover,consider the following condition:(C.5) All eigenvalues of Σ are positive and bounded.Then the next theorem states that, both the classical and the high-dimen-sional Cook’s distance have the power of detecting the inﬂuential observationapproaching one under appropriate yet diﬀerent conditions. Theorem 2.

Consider the model stated above. Suppose that (C.1) and (C.5) hold. If max { n − p , | c | − n / } → , thenwe have that for the Cook’s distance D i , P ( nD − max ≤ i ≤ n nD i > M ) → for any M > , when n → ∞ . Suppose that (C.1) and (C.2) hold. If max {| c | − (log n ) / , l p p − c − n } → , then we have that for the proposed high-dimensional inﬂuence measure D ( c ) i , P ( T ( c )1 − max ≤ i ≤ n T ( c ) i > M ) → for any M > , when min( n, p ) → ∞ . The proof is given in the supplementary material [41]. Here we comparethe two sets of conditions to gain some insight about the diﬀerence of the twodiagnosis measures. First, we examine the condition max { n − p , | c | − n / } →

0, that is, required by the Cook’s distance. The condition | c | − n / → n goes to inﬁnity. Moreover, in terms of the predictor dimension p , the classical Cook’s distance is deﬁned when p < n . Consequently, thecondition n − p →

0, or equivalently, p = o ( n / ), constrains the growingrate of p with n at a much slower rate. We note that although this ratemay not be the optimal one, the condition p = o ( n ) is clearly necessaryfor the classical Cook’s distance to be feasible. Next, we examine the con-dition max {| c | − (log n ) / , l p p − c − n } →

0, that is, required by our newinﬂuence measure. For illustration, we consider a simple case with all theeigenvalues of Σ bounded and p > n . We know immediately that both l p /p and n/p are bounded. Accordingly, we should have l p p − c − n → c → ∞ . As log n → ∞ when n → ∞ , then a suﬃcient condition formax {| c | − (log n ) / , l p p − c − n } → n ) / / | c | →

0. This sug-gests that the inﬂuence point can be consistently detected, as long as c diverges to inﬁnity at a speed faster than (log n ) / . This is clearly a ratemuch slower than n / . Finally, the bounded eigenvalue condition (C.5) iscommonly used in the literature for estimating covariance matrices [7]. Hereit is assumed for the Cook’s distance case. For the new diagnosis measure,(C.2) is required instead, which is weaker than (C.5). IGH-DIMENSIONAL INFLUENCE MEASURE

3. Numerical studies.

We have carried out an intensive simulation study,along with a microarray data analysis, to examine the empirical perfor-mance of our proposed high-dimensional inﬂuence measure. Since the clas-sical Cook’s distance depends on both leverage points and outliers, in oursimulation study, we consider three diﬀerent scenarios where there existoutliers only (model 1), leverage points only (model 2), or mixed leveragepoints and outliers (model 3). For the scenarios with leverage points (models2 and 3), we further consider sub-scenarios where important covariates con-tribute to leverage observations, or noisy covariates contribute to leverageobservations. Below we present the summary of the analysis.3.1.

Simulation models.

For all simulations, we set the sample size n =100, and the number of predictors p = 1000. We set 10% of total observationsas inﬂuential, so that ˜ n = 10. We consider the model Y i = X ⊤ i β + ε i , i = 1 , . . . , n, where X i is multivariate normal with cov( X ij , X ij ′ ) = 0 . | i − j | , ε i followsthe standard normal distribution, and β = (3 , . , , , , , . . . , ⊤ . We sim-ulated n = 100 i.i.d. observations from this model. Next, we reset the ﬁrst˜ n = 10 data observations as coming from another model,˜ Y i = ˜ X ⊤ i ˜ β + ε i , i = 1 , . . . , ˜ n, where perturbations are to be introduced on the regression coeﬃcient, thecovariates and their combination. In particular, we have considered threeperturbation models of generating inﬂuential points. Model

1. The perturbation was introduced on the response. That is, for i = 1 , . . . , ˜ n , ˜ X i = X i , and ˜ β = (3 , . , κ, κ, , κ, . . . , κ ) ⊤ . In other words, theinﬂuential observations are generated according to ˜ Y i = X ⊤ i β + κZ i + ε i ,where Z i = X ⊤ i γ and γ = (0 , , , , , , , . . . , ⊤ . In this case, the responsesof the inﬂuential observations are contaminated by a random perturbation κZ i . Consequently, the corresponding responses admit a diﬀerent pattern,whereas the predictors of inﬂuential observations follow the same distribu-tion as the rest. Model

2. The perturbation was introduced on the predictors and keep theresponse uncontaminated. That is, for i = 1 , . . . , ˜ n , ˜ Y i = Y i and ˜ X ij = X ij +30 κI { j ∈ S } , j = 1 , . . . , p . In other words, a set S of predictors admit a diﬀerentpattern, and its magnitude is controlled by the scalar κ . We examined threechoices of S : S = { , . . . , } , and in this case, the inﬂuenced predictorsoverlap with those truly relevant ones { , , } in β ; S = { p − , . . . , p } ,and as such there is no overlap; and S = { , . . . , p } , and in this case, allpredictors are subjected to potential inﬂuence. ZHAO, LENG, LI AND WANG

Model

3. The perturbation was introduced on both the response and thepredictors. That is, ˜ β = (3 , . , κ, κ, , κ, . . . , κ ) ⊤ and ˜ X ij = X ij + 30 κI { j ∈ S } , j = 1 , . . . , p . Again, we considered three sets of S as described earlier.It is clear that κ is the parameter that dictates the magnitude of the inﬂu-ential points. When κ = 0, there is no inﬂuential point. We used κ = 0, 0.4,0.8, 1.2 and 1.6 in our experiment.3.2. Performance evaluation.

We evaluate and compare our proposedinﬂuence measure in several ways. First, we study the potential impact ofinﬂuential data and how the proposed diagnosis measure could help limitsuch impact. Toward that end, we ﬁrst applied the LASSO or SIS to thefull data. Then we computed the proposed high-dimensional inﬂuence mea-sure, evaluated the corresponding p -value, and applied the multiple testingprocedure of [5], with the false discovery rate ﬁxed at α = 5%. We then ob-tained a reduced data set by removing those ﬂagged inﬂuential points andapplied the LASSO or SIS to the reduced data set. We evaluated the impactof inﬂuential data in terms of coeﬃcient estimation, variable selection, andvariable screening. For coeﬃcient estimation, we report the error betweenthe estimated and true β , ERR = k ˆ β − β true k ; for variable selection, wereport the false positive rate, FPR = / β under a LASSO penaltyand as such avoid the diﬃculty of the OLS estimate when p > n . This seemsa very natural solution. We compare it with our proposal in terms of estima-tion accuracy, selection accuracy and power. On the other hand, we note thelack of asymptotic theory for this modiﬁed Cook’s distance. To determinethe threshold for inﬂuential data, one may use bootstrap. However, in ourcomparison, we simply label the observations with the largest ˜ n modiﬁedCook’s distance as inﬂuential. This is not feasible in practice, but providesa useful benchmark for comparison. The other competing solution is thepenalized least absolute deviation via the LASSO penalty (LAD + LASSO)[4, 37]. Due to the use of the least absolute deviation as the loss function, thismethod is designed to handle heavy tailed errors in linear regression, and assuch a potentially useful way to limit impact of the inﬂuence observations.3.3. The results.

The averages of a total of 200 random replications arereported in Tables 1–3. We make the following observations.(1) First, the presence of inﬂuential points signiﬁcantly aﬀects variableselection and screening accuracy. This can be seen by comparing the re-

IGH-DIMENSIONAL INFLUENCE MEASURE Table 1

Simulation results for perturbation model 1. HIM denotes our proposed high-dimensionaldiagnosis measure, and CD denotes the classical Cook’s distance κ Method Criterion 0 0.4 0.8 1.2 1.6

SIS CP 1 0 .

25 0 0 0SIS + HIM CP 1 1 1 1 1LASSO ERR 0 .

510 4 .

917 9 .

553 14 .

636 18 . .

002 0 .

094 0 .

103 0 .

107 0 . .

519 1 .

296 1 .

020 0 .

872 0 . .

002 0 .

045 0 .

029 0 .

015 0 . . .

765 0 .

865 0 . .

535 1 .

136 2 .

176 2 .

565 4 . .

003 0 .

034 0 .

066 0 .

072 0 . .

630 0 .

670 0 .

700 0 . .

642 1 .

920 2 .

073 2 .

406 1 . sults between SIS and SIS + HIM in terms of CP. Consider, for example,Table 1. As κ increases, the coverage probability of the SIS method dete-riorates quickly from 1 with κ = 0 to 0 with κ = 1 .

6. This conﬁrms thatinﬂuential observations do aﬀect variable screening consistency. Meanwhile,the performance of SIS + HIM is quite encouraging as its CP values main-tains at 1 for every κ value considered. This suggests that the proposed HIMmethod helps SIS in removing the inﬂuential observations.(2) Second, the presence of inﬂuential observations does aﬀect estima-tion accuracy seriously. This can be seen clearly by comparing the resultsof LASSO and LASSO + HIM in terms of ERR values. For instance, theERR values in Table 3 for LASSO with S increases quickly from 0.446 with κ = 0 to 14.498 with κ = 1 .

6. This conﬁrms that inﬂuential observations doaﬀect the accuracy of the LASSO estimate in a negative way. However, weﬁnd that the ERR values of LASSO + HIM are always well controlled withERR < κ increases, the power for HIM to detect inﬂuentialobservation increases. Thus, those inﬂuential observations are more likelyto be detected and eliminated from the data analysis. This makes the ERRvalues of LASSO + HIM eventually converges to a level around ERR ≈ κ increases. This conﬁrms the usefulness of the HIM method for LASSOestimation, even though its deﬁnition only involves marginal correlation co-eﬃcients.(3) Third, the performance of LASSO + CD is mixed. If the perturba-tion is due to the response only as in Table 1, it does yield much betterperformance than LASSO with much smaller ERR values. This suggests ZHAO, LENG, LI AND WANG

Table 2

Simulation results for perturbation model . HIM denotes our proposed high-dimensionaldiagnosis measure, and CD denotes the classical Cook’s distance κ Subset Method Criterion 0 0.4 0.8 1.2 1.6 S SIS CP 1 0 .

05 0 0 0SIS + HIM CP 1 0 .

05 0 . . . .

439 4 .

917 4 .

972 4 .

971 4 . .

002 0 .

086 0 .

090 0 .

089 0 . .

455 4 .

803 4 .

591 3 .

055 3 . .

002 0 .

080 0 .

060 0 .

055 0 . .

620 0 .

775 0 .

892 0 . .

513 4 .

566 4 .

568 4 .

603 4 . .

004 0 .

073 0 .

070 0 . .

095 0 .

085 0 .

105 0 . .

642 1 .

339 1 .

303 1 .

320 1 . S SIS CP 1 1 1 1 1SIS + HIM CP 1 1 1 1 1LASSO ERR 0 .

509 0 .

456 0 .

439 0 .

450 0 . .

001 0 .

002 0 . .

521 0 .

494 0 .

493 0 .

494 0 . .

001 0 .

002 0 . .

695 0 . .

85 0 . .

548 0 .

523 0 .

532 0 .

556 0 . .

001 0 .

002 0 .

002 0 . .

065 0 .

085 0 .

135 0 . .

642 0 .

650 0 .

645 0 .

647 0 . S SIS CP 1 0 .

35 0 .

45 0 .

30 0 . .

50 0 .

60 0 .

62 0 . .

473 1 .

567 1 .

545 1 .

598 1 . .

003 0 .

051 0 .

053 0 .

051 0 . .

490 1 .

517 1 .

456 1 .

221 1 . .

003 0 .

034 0 .

031 0 .

023 0 . .

735 0 .

86 0 .

95 0 . .

560 1 .

751 1 .

700 1 .

743 1 . .

003 0 .

047 0 .

042 0 .

042 0 . .

115 0 .

085 0 .

115 0 . .

642 0 .

608 0 .

573 0 .

580 0 . that LASSO + CD can perform well to limit the eﬀect of inﬂuential points.However, even for this example, it is still outperformed by LASSO + HIM.However, the story changes if the perturbation is due to the predictors asin Table 2. This is to be expected because, with contaminated predictors, IGH-DIMENSIONAL INFLUENCE MEASURE Table 3

65 0 .

10 0 . .

90 1 1 1LASSO ERR 0 .

446 1 .

559 5 .

308 9 .

628 14 . .

002 0 .

062 0 .

093 0 .

099 0 . .

447 1 .

278 0 .

771 0 .

499 0 . .

002 0 .

046 0 .

027 0 .

003 0 . .

185 0 .

94 1 1LASSO + CD ERR 0 .

559 0 .

686 2 .

149 5 .

623 10 . .

002 0 .

009 0 .

063 0 .

084 0 . .

555 0 .

720 0 .

675 0 . .

642 1 .

416 4 .

367 8 .

740 13 . S SIS CP 1 1 0 .

05 0 0SIS + HIM CP 1 1 1 1 1LASSO ERR 0 .

479 2 .

090 6 .

619 11 .

997 17 . .

002 0 .

072 0 .

095 0 .

101 0 . .

494 1 .

836 0 .

696 0 .

475 0 . .

002 0 .

062 0 .

009 0 .

002 0 . .

145 0 .

955 1 1LASSO + CD ERR 0 .

501 0 .

769 3 .

702 7 .

676 14 . .

003 0 .

016 0 .

078 0 .

087 0 . .

605 0 .

680 0 .

685 0 . .

642 1 .

859 5 .

855 10 .

829 16 . S SIS CP 1 1 0 . .

464 1 .

682 5 .

720 10 .

943 17 . .

002 0 .

065 0 .

098 0 .

103 0 . .

484 1 .

479 1 .

262 0 .

557 0 . .

002 0 .

057 0 .

034 0 .

003 0 . . .

87 1 1LASSO + CD ERR 0 .

586 0 .

726 1 .

874 4 .

504 7 . .

002 0 .

013 0 .

055 0 .

074 0 . .

465 0 .

765 0 .

810 0 . .

642 1 .

635 5 .

264 10 .

662 17 . LASSO is no longer a stable method for variable selection. If predictorsare selected incorrectly, the subsequent modiﬁed Cook’s distance cannot becalculated appropriately. This makes the performance of LASSO + CD un-satisfactory. ZHAO, LENG, LI AND WANG (4) Fourth, as a robust regression method, we ﬁnd that LAD + LASSOperforms quite well. Its ERR values are smaller than those of the LASSOestimates in all the tables. However, in most cases, it is still outperformedby LASSO + HIM as seen from Tables 1 and 3.(5) Lastly, we ﬁnd that for most cases, the reported FPR values are wellcontrolled. Furthermore, as κ increases, the corresponding empirical powerincreases toward 100%. These ﬁndings are consistent with the theoreticalclaims in Theorems 1 and 2.To summarize, our simulation experiments conﬁrm that the proposed HIMmethod is useful in controlling the eﬀects of the inﬂuential observations interms of parameter estimation and variable screening.3.4. A real data example.

We applied our proposed inﬂuence diagnosisapproach to a microarray data of [31], and noted that the analysis resultsbecome substantially diﬀerent when the detected inﬂuential observationsare removed. For this dataset, F1 animals were intercrossed and then 120twelve-week-old male oﬀspring were selected for tissue harvesting from theeyes and for microarray analysis. The Aﬀymetrix microarrays that wereused to analyze the RNA from the eyes of those F2 animals contain over31,042 diﬀerent probe sets. Among them, one probe is for gene TRIM32,which was recently found to cause Bardet–Biedl syndrome [10], a geneticallyheterogeneous disease of multiple organ systems including the retina. Onegoal of interest of this data analysis is to ﬁnd genes whose expressions arecorrelated with that of gene TRIM32. We ﬁrst followed [27] to exclude probesthat were not expressed in the eye or that lacked suﬃcient variation, whichresults in 18,975 probes as regressors. We then followed [22] to retain thetop 1000 probes that are mostly correlated with the probe of TRIM32. Theresulting analysis has p = 1000 predictors and a sample size n = 120. As astandard procedure [27], all the probes are standardized to have mean zeroand standard deviation one.We next applied our method with FDR rate α = 0 .

10 to the data, andidentiﬁed a total of 5 inﬂuential observations. Their corresponding p -valueswere 0, 0.0004, 0.0011, 0.0029 and 0.0033, respectively. We also show the log-arithm of p -values versus the indices for all 120 observations in Figure 2. Toassess the inﬂuence of the detected points, we again compared the LASSO es-timate with and without those points. Since we used ten-fold cross-validationto select the tuning parameter and every run is random, we repeated thisanalysis 100 times and report the average results.We summarize the diﬀerence of the estimates in the following aspects: thesparsity, the norm diﬀerence and the angle between the two estimates. First,by removing the identiﬁed inﬂuential observations, the resulting LASSO es-timate is considerably more sparse. The average model size with the full data IGH-DIMENSIONAL INFLUENCE MEASURE Fig. 2.

The logarithm of the p -value for each observation: the detected inﬂuential pointsare denoted by solid circles. is 63. By contrast, the average model size without the inﬂuential observationsreduces to 27 on the average. The existence of the potential inﬂuential pointsclearly shows a signiﬁcant eﬀect on the model size. Besides that, the averagenumber of the commonly selected predictors by ﬁtting the full data and thereduced data, respectively, is only 8.67, which again shows clear discrepancyof the two estimates. Consequently, the inﬂuential points identiﬁed by ourapproach seem to have signiﬁcant eﬀect for subsequent analysis. Second,denote d = k ˆ β full k , d = k ˆ β redu k and d = k ˆ β redu − ˆ β full k , where ˆ β full isthe LASSO estimate using all the observations and ˆ β redu is the estimateafter removing the inﬂuential points identiﬁed by HIM. We observe that theaverage of ( d − d ) /d is 0.532 and that of d /d is 0.972. Both show thatthe estimates without inﬂuential points are quite diﬀerent in terms of the ℓ norm. In addition, the angle between ˆ β full and ˆ β redu , which is deﬁnedas ˆ β ⊤ full ˆ β redu /d d , equals 0.262, averaged over 100 times. These numbersagain indicate that the estimates change substantially after removing theinﬂuential observations. In summary, this analysis illustrates the importanceof inﬂuence diagnosis, and the identiﬁed inﬂuential observations should betreated with extreme care.

4. Extension to generalized linear models.

The main idea of the high-dimensional inﬂuence measure can be extended to a broad class of regressionmodels. Here we brieﬂy discuss one such extension to generalized linear mod-els (GLM). Assume that the data ( X i , Y i ) follow an exponential distribution ZHAO, LENG, LI AND WANG with the canonical probability density function, f ( y ; θ ) = exp { yθ − b ( θ ) + c ( y ) } , and the conditional mean is of the form E ( Y i | X i ) = b ′ ( θ ( X i )) = g − ( β + X ⊤ i β ) , where g is a known link function. For the purpose of feature screening in ultrahigh-dimensional regressions, [23] introduced a marginal utility measure, themaximum marginal likelihood estimator, asˆ β j = ( ˆ β j, , ˆ β j ) = arg min E n l ( Y, β j + β j X j ) , where l ( Y ; θ ) = − Y θ + b ( θ ) + log c ( Y )] and E n f ( X, Y ) = n − P ni =1 f ( X i , Y i ).That is, ˆ β j is the maximum likelihood estimator of ﬁtting a GLM model of Y on the j th predictor X j alone plus an intercept. As remarked by [23], thismeasure can be rapidly computed.In the context of high-dimensional diagnosis, we deﬁne the high-dimensionalinﬂuence measure for generalized linear models for the k th observation, k = 1 , . . . , n , as D glm k = 1 p p X j =1 k ˆ β j − ˆ β ( k ) j k , (4.1)where ˆ β ( k ) j denotes the maximum marginal likelihood estimator but withthe k th observation removed. For GLM, the estimator ˆ β j and ˆ β ( k ) j may nothave a closed-form solution. Consequently, the exact distribution of the pro-posed statistic D glm k is complicated and some approximation is necessary.The detailed derivation, however, is beyond the scope of this paper. In prac-tice, one can always sort the values of {D glm k , k = 1 , . . . , n } and remove thoseobservations associated with large values of D glm k .We have conducted a small simulation study to examine the empiri-cal performance of this measure for GLM. The simulation setup is simi-lar to that of model 1 in Section 3.1, except that this time we adopt abinary response model, P ( Y i = 1 | X i ) = 1 / [1 + exp {− (2 + X ⊤ i β ) } ], where β = β true = (5 , , , . . . , ⊤ , and the outliers are generated from the model β = β inﬂ = (5 , , , . . . , , − κ, . . . , − κ ) ⊤ with p/ κ ’s. We set n = 100,with 10% inﬂuential observations, that is, n inﬂ = 10, and we set p = 50 or100. Since the asymptotic distribution of D glm k is not available for the logis-tic regression, we ﬂag the 10 observations with the largest p -values of D glm k as inﬂuential. For a binary response, one is often interested in classiﬁcation.As such we compare the misclassiﬁcation error rate for the full data as E full and for the reduced data as E redu without the detected inﬂuential points.We also report the empirical power. The results out of 200 data replicationsare summarized in Table 4. IGH-DIMENSIONAL INFLUENCE MEASURE Table 4

Simulation results for the logistic model κp Criterion 0 0.4 0.8 1.2 1.6

50 Power – 0.220 0.472 0.422 0.254 E full E redu E full E redu From Table 4, we note that the proposed method has some power for alogistic model, but it is lower than that in a linear model. This is proba-bly due to the fact that a binary response contains much less information,and thus detecting inﬂuential observations in a logistic model is much morechallenging, especially in a high-dimensional setting. On the other hand, wealso observe from Table 4 that removing those points with the largest valuesof D glm k improves the classiﬁcation accuracy by a large margin. This againsuggests that the usefulness of inﬂuence diagnosis. Meanwhile, we remarkthat further investigation into both theoretical and empirical properties ofhigh-dimensional inﬂuence measure in GLM is warranted.

5. Conclusion.

We perceive several future avenues to extend the pro-posed work in this article. First, we have employed the leave-one-out prin-ciple when quantifying inﬂuence of individual observations. We expect thatour high-dimensional inﬂuence measure can also be generalized to the casesof leaving out pairs of observations, or triplets or more. Such a strategy canbe useful when those observations conceal one another [18]. Second, we havefocused on the classical linear model in our development, while extensionto more sophisticated models, such as the generalized linear model, that is,brieﬂy examined in Section 4, survival models, and semiparametric addi-tive models, deserve further investigations. Finally, our proposal deals withthe cross sectional data with i.i.d. observations. It is interesting to extendthe proposed inﬂuence measure to complex correlated data such as longi-tudinal data where dependence among observations needs to be taken intoconsideration in inﬂuence diagnosis [42].APPENDIXWe outline the main idea of the proof for the asymptotic distribution of D k in Theorem 1. First, we decompose D k as D k = B + B + B − B as given in Section 2.3. Then we compute the mean and variance of B i , ZHAO, LENG, LI AND WANG i = 1 , . . . , B i , weﬁnd that B is the leading term. We then study the asymptotic distributionof B , which turns out to follow a χ (1) distribution. Recall in Section 2.3,we deﬁned K p,tl = p − P pj =1 X tj X lj , for t, l = 1 , . . . , n , l p = P pj =1 λ j = O ( p r )and c p = max ≤ j ≤ p λ j , where λ j ’s are the eigenvalues of the covariancematrix Σ . Furthermore, we deﬁne a p = P pj =1 λ j , C = E ( Y t Y l K p,tl ) and C = E [ Y t ( P pj =1 ρ j X tj /p ) ] for any t = l . Proof of Proposition 1.

We break the proof into three parts: ﬁrst,we obtain an expansion of D k ; second, we derive E ( D k ); and ﬁnally, wederive the asymptotic behavior of the components in the expansion of D k . Step

1. First, we have the following expansion for D k , k = 1 , . . . , n : D k = 1 p p X j =1 n − n X t =1 ,t = k Y t X tj − n n X t =1 Y t X tj ! = 1 p p X j =1 ( n ( n − n X t =1 ,t = k Y t X tj − n Y k X kj ) = 1 p p X j =1 ( n ( n − n X t =1 ,t = k Y t X tj ) + 1 pn Y k p X j =1 X kj − pn ( n − n X t =1 ,t = k Y t Y k ( p X j =1 X tj X kj ) = 1 p { n ( n − } n X t =1 ,t = k Y t ( p X j =1 X tj ) + 1 pn Y k ( p X j =1 X kj ) + 1 p { n ( n − } X t = s,t,s = k Y t Y s ( p X j =1 X tj X sj ) − pn ( n − n X t =1 ,t = k Y k Y t ( p X j =1 X tj X kj ) = 1 { n ( n − } n X t =1 Y t K p,tt + (cid:20) n − { n ( n − } (cid:21) Y k K p,kk + 1 { n ( n − } X t = s Y t Y s K p,ts IGH-DIMENSIONAL INFLUENCE MEASURE − (cid:20) { n ( n − } + 2 n ( n − (cid:21) n X t =1 ,t = k Y k Y t K p,tk = 1 { n ( n − } n X t =1 Y t K p,tt + ( n − n ( n − Y k K p,kk + 1 { n ( n − } X t = s Y t Y s K p,ts − n ( n − n X t =1 ,t = k Y k Y t K p,tk := B + B + B − B . Step

2. Next, we derive the expectation of D k . It is easy to see that E ( B ) = 1 pn ( n − p X j =1 E ( Y k X kj ) ,E ( B ) = n − pn ( n − p X j =1 E ( Y k X kj ) ,E ( B ) = 1 pn ( n − p X j =1 ρ j , E ( B ) = 1 pn ( n − p X j =1 ρ j . Therefore, we have E ( D k ) = E ( B + B + B − B ) = 1 pn ( n − p X j =1 var( Y k X kj ) . By Lemmas 1 and 3, we have E { Y k ( K p,kk − E ( K p,kk )) } ≤ E / ( Y k ) E / [ { K p,kk − E ( K p,kk ) } ]= O ( p − l / p )and p − p X j =1 E ( Y k X kj ) = p − p X j =1 ρ j = O ( p − c p ) . In addition, noting that c p ≤ l p , we have p − p X j =1 var( Y k X kj )= p − p X j =1 { E ( Y k X kj ) − E ( Y k X kj ) } ZHAO, LENG, LI AND WANG = E ( Y k ) E ( K p,kk ) + E { Y k ( K p,kk − E ( K p,kk )) } − p − p X j =1 ρ j = E ( Y k ) E ( K p,kk ) + O ( p − l / p ) . Consequently, we have E ( D k ) = 1 { pn ( n − } E ( Y k ) p X j =1 E ( X kj ) + O ( n − p − l / p ) . Step

3. Next, we derive the asymptotic behavior of B i , i = 1 , . . . , Step B . Note thatvar( B ) = nn ( n − var( Y t K p,tt )and that E ( Y t K p,tt ) ≤ E / ( Y t ) E / ( K p,tt ). Furthermore, E ( K p,tt ) = p − E ( Z ⊤ t ΣZ t ) = p − E " p X j =1 λ j ( Z ⊤ t u j ) ≤ p − E [( Z ⊤ t u j ) ] p X j =1 λ j ! ≤ E ( Z ⊤ t u j ) . The last equation holds because tr(Σ) = P pj =1 λ j = p . As a result, we havevar( B ) = O ( n − ) . Step B . By Lemma 3, we have E ( B ) = 1 pn ( n − p X j =1 ρ j = O ( c p · p − n − ) . In addition, it is easy to see E ( B ) = 1 n ( n − ( n X t =1 ,t = k E ( Y k Y t K p,tk ) + t,s = k X t = s E ( Y k Y t Y s K p,tk K p,sk ) ) = 1 n ( n − " ( n − E ( Y k Y t K p,tk )+ ( n − n − E ( Y k p X j =1 ρ j X kj /p ! ) IGH-DIMENSIONAL INFLUENCE MEASURE = 1 n ( n − E ( Y k Y t K p,tk ) + ( n − n ( n − E ( Y k p X j =1 ρ j X kj /p ! ) = 1 n ( n − C + ( n − n ( n − C , where C , C are deﬁned as in (1.1) and (1.2) in the supplementary material[41], respectively. From the proof of Lemma 2, we know C = O ( c p p − ) and C = C + C , with C = O ( p − a / p ) and C = l p p − E ( Y t ). Therefore,we have E ( B ) = l p p n ( n − E ( Y t ) + O ( n − p − a / p ) + O ( c p p − n − )= O ( l p p − n − ) + O ( c p p − n − ) . Consequently, we havevar( B ) = E ( B ) − E ( B ) = O ( l p p − n − ) + O ( c p p − n − ) . Step B ). First, we have B = 1 p { n ( n − } X t = s Y t Y s p X j =1 X sj X tj ! = 1 pn ( n − X t = s φ ( Y t , Y s , X s , X t ) / { ( n − n } := 1 pn ( n −

1) ¯ B , where φ ( Y t , Y s , X s , X t ) = P t = s Y t Y s ( P pj =1 X sj X tj ). Let φ ( Y t , X t ) = E { φ ( Y t , Y s , X s , X t ) − E ( φ ( Y t , Y s , X s , X t ) | Y t , X t } = Y t p X j =1 X tj ρ j − p X j =1 ρ j . Noting that ¯ B is an U -statistic, and by the properties of the U -statistic,we havevar( B ) = 1 { pn ( n − } var( ¯ B )= 1 { pn ( n − } (cid:20) n var { φ ( Y t , X t ) } + o ( n − ) (cid:21) (A.1) = 4 p n ( n − var ( Y t p X j =1 X tj ρ j ) + o ( p − n − ( n − − ) . ZHAO, LENG, LI AND WANG

Furthermore, we have p − var Y t p X j =1 X tj ρ j ! = E ( Y t p X j =1 X tj ρ j /p ! ) − p X j =1 ρ j /p ! (A.2) ≤ E / ( Y t ) E / ( p X j =1 X tj ρ j /p ! ) − p X j =1 ρ j /p ! . In addition, we show in the proof of Lemma 2 that E ( p X j =1 X tj ρ j /p ! ) = O ( c p p − )(A.3)and in Lemma 3 that p X j =1 ρ j /p = O ( c p · p − ) . (A.4)Combining (A.1)–(A.4), we havevar( B ) = O ( c p n − p − ) + O ( p − n − ) . Step B , which can be written as B = ( n − n ( n − Y k K p,kk = ( n − n ( n − [ Y k { K p,kk − E ( K p,kk ) } + Y k E ( K p,kk )]:= B + B . By Lemma 1, we have E ( Y k { K p,kk − E ( K p,kk ) } ) ≤ E / ( Y k ) E / ( { K p,kk − E ( K p,kk ) } )= O ( p − l p ) ,E ( Y k { K p,kk − E ( K p,kk ) } ) ≤ E / ( Y k ) E / ( { K p,kk − E ( K p,kk ) } )= O ( l / p p − ) . Therefore, we havevar( B ) = (cid:26) ( n − n ( n − (cid:27) var( Y k [ K p,kk − E ( K p,kk )]) = O ( n − p − l p ) , var( B ) = O ( n − ) . This completes the proof. (cid:3)

IGH-DIMENSIONAL INFLUENCE MEASURE Proof of Theorem 1.

Consider the behavior of K p,kk , k = 1 , . . . , n ,for a suﬃcient large pK p,kk = p X j =1 X kj /p = X ⊤ k X k /p = Z ⊤ k ΣZ k = p X j =1 λ j ( Z ⊤ k u j ) /p. Its variance is var( K p,kk ) = 2 P pj =1 λ j /p = 2 p − l p . Under the assumption l p = O ( p r ) with 0 ≤ r <

2, we have K p,kk = E ( K p,kk ) + O p ( p r/ − ), and con-sequently, Y k K p,kk = Y k [ E ( K p,kk ) + O p ( p r/ − )] . In addition, noting that E [ Y k ( K p,kk − E ( K p,kk ))] ≤ E / ( Y k )(var( K p,kk )) / = O ( p r/ − ), we have E ( Y k K p,kk ) = E ( Y k ) E ( K p,kk ) + E [ Y k ( K p,kk − E ( K p,kk ))]= E ( Y k ) E ( K p,kk ) + O ( p r/ − ) . Therefore, we have Y k K p,kk − E ( Y k K p,kk ) = [ Y k − E ( Y k )] E ( K p,kk ) + O p ( p r/ − ) . As a result, it holds that B − E ( B ) = n − n ( n − { [ Y k − E ( Y k )] E ( K p,kk ) + O p ( p r/ − ) } . (A.5)Note that c p ≤ l p = O ( p r ) under (C.2). Combined with Proposition 1, wehave B − E ( B ) = O p ( n − / ) ,B − E ( B ) = O p ( n − / p r/ − ) ,B − E ( B ) = O p ( p r/ − n − ) . Consequently, we have n ( n − ( n − (cid:26) X i =1 , [ B i − E ( B i )] − B − E ( B )) (cid:27) (A.6) = O p ( n − / ) + O p ( p r/ − ) . Furthermore, by the results on E ( D k ) in step 2 of the proof of Proposition 1,we have E ( D k ) = 1 n ( n − E ( Y k ) E ( K p,kk ) + O ( n − p − l / p ) . ZHAO, LENG, LI AND WANG

Consequently, by l p = O ( p r ), we have n ( n − ( n − E ( D k ) = n − n − E ( Y k ) E ( K p,kk ) + O ( p r/ − ) . (A.7)Combining (A.5)–(A.7), we have n ( n − ( n − D k = n ( n − ( n − E ( D k ) + n ( n − ( n −

2) [ D k − E ( D k )]= n ( n − ( n − E ( D k ) + n ( n − ( n − (cid:18) X i =1 , , [ B i − E ( B i )] − B − E ( B )) (cid:19) = n − n − E ( Y k ) E ( K p,kk ) + { Y k − E ( Y k ) } E ( K p,kk )+ O p ( p r/ − ) + O p ( n − / )= Y k E ( K p,kk ) + 1 n − E ( Y k ) E ( K p,kk ) + O p ( p r/ − ) + O p ( n − / )= Y k + o p (1) , where the last equation is from the fact that E ( X kj ) = 1 , j = 1 , . . . , p and E ( Y k ) = 1. Since Y ∼ N (0 , n ( n − ( n − D k ∼ χ (1); that is, n D k ∼ χ (1). (cid:3) Acknowledgements.

We thank Professor Peter B¨uhlmann, Professor Pe-ter Hall, an Associate Editor and three referees for their constructive com-ments. SUPPLEMENTARY MATERIAL

Further proofs (DOI: 10.1214/13-AOS1165SUPP; .pdf). The supplemen-tary ﬁle contains the proofs of four additional lemmas, Proposition 2 andTheorem 2. REFERENCES [1]

Anderson, E. B. (1992). Diagnostics in categorical data analysis.

J. R. Stat. Soc.Ser. B Stat. Methodol. Banerjee, M. (1998). Cook’s distance in linear longitudinal models.

Comm. Statist.Theory Methods Banerjee, M. and

Frees, E. W. (1997). Inﬂuence diagnostics for linear longitudinalmodels.

J. Amer. Statist. Assoc. [4] Belloni, A. and

Chernozhukov, V. (2011). ℓ -penalized quantile regression inhigh-dimensional sparse models. Ann. Statist. Benjamini, Y. and

Hochberg, Y. (1995). Controlling the false discovery rate:A practical and powerful approach to multiple testing.

J. R. Stat. Soc. Ser.B Stat. Methodol. Benjamini, Y. and

Hochberg, Y. (2000). On the adaptive control of the falsediscovery rate in multiple testing with independent statistics.

J. Educ. Behav.Stat. Bickel, P. J. and

Levina, E. (2008). Covariance regularization by thresholding.

Ann. Statist. Candes, E. and

Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n . Ann. Statist. Chatterjee, S. and

Hadi, A. S. (1988).

Sensitivity Analysis in Linear Regression .Wiley, New York. MR0939610[10]

Chiang, A. P. , Beck, J. S. , Yen, H. J. , Tayeh, M. K. , Scheetz, T. E. , Swiderski, R. , Nishimura, D. , Braun, T. A. , Kim, K. Y. , Huang, J. , Elbedour, K. , Carmi, R. , Slusarski, D. C. , Casavant, T. L. , Stone, E. M. and

Sheffield, V. C. (2006). Homozygosity mapping with SNP arrays iden-tiﬁes a novel gene for Bardet–Biedl syndrome (BBS11).

Proc. Natl. Acad. Sci.USA

Christensen, R. , Pearson, L. M. and

Johnson, W. (1992). Case-deletion diag-nostics for mixed models.

Technometrics Cook, R. D. (1977). Detection of inﬂuential observation in linear regression.

Tech-nometrics Cook, R. D. (1979). Inﬂuential observations in linear regression.

J. Amer. Statist.Assoc. Cook, R. D. and

Weisberg, S. (1982).

Residuals and Inﬂuence in Regression . Chap-man & Hall, London. MR0675263[15]

Critchley, F. , Atkinson, R. A. , Lu, G. and

Biazi, E. (2001). Inﬂuence analysisbased on the case sensitivity function.

J. R. Stat. Soc. Ser. B Stat. Methodol. Davison, A. C. and

Tsai, C. L. (1992). Regression model diagnostics.

Int. Stat.Rev. Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings ofdimensionality. Technical report, Stanford Univ.[18]

Draper, N. R. and

Smith, H. (1998).

Applied Regression Analysis , 3rd ed. Wiley,New York. MR1614335[19]

Efron, B. , Hastie, T. , Johnstone, I. and

Tibshirani, R. (2004). Least angleregression.

Ann. Statist. Efron, B. , Tibshirani, R. , Storey, J. D. and

Tusher, V. (2001). Empirical Bayesanalysis of a microarray experiment.

J. Amer. Statist. Assoc. Fan, J. and

Li, R. (2001). Variable selection via nonconcave penalized likelihood andits oracle properties.

J. Amer. Statist. Assoc. Fan, J. and

Lv, J. (2008). Sure independence screening for ultrahigh dimensionalfeature space.

J. R. Stat. Soc. Ser. B Stat. Methodol. Fan, J. and

Song, R. (2010). Sure independence screening in generalized linearmodels with NP-dimensionality.

Ann. Statist. Fu, W. J. (1998). Penalized regressions: The bridge versus the Lasso.

J. Comput.Graph. Statist. ZHAO, LENG, LI AND WANG[25]

Fung, W.-K. , Zhu, Z.-Y. , Wei, B.-C. and

He, X. (2002). Inﬂuence diagnosticsand outlier tests for semiparametric mixed models.

J. R. Stat. Soc. Ser. B Stat.Methodol. Huang, J. , Horowitz, J. L. and

Ma, S. (2008). Asymptotic properties of bridgeestimators in sparse high-dimensional regression models.

Ann. Statist. Huang, J. , Ma, S. and

Zhang, C.-H. (2008). Adaptive Lasso for sparse high-dimensional regression models.

Statist. Sinica Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principalcomponents analysis.

Ann. Statist. Pan, J.-X. and

Fang, K.-T. (2002).

Growth Curve Models and Statistical Diagnos-tics . Springer, New York. MR1937691[30]

Preisser, J. S. and

Qaqish, B. F. (1996). Deletion diagnostics for generalised esti-mating equations.

Biometrika Scheetz, T. , Kim, K. , Swiderski, R. , Philp, A. , Braun, T. , Knudtson, K. , Dor-rance, A. , DiBona, G. , Huang, J. , Casavant, T. et al. (2006). Regulationof gene expression in the mammalian eye and its relevance to eye disease.

Proc.Natl. Acad. Sci. USA

Storey, J. D. (2002). A direct approach to false discovery rates.

J. R. Stat. Soc.Ser. B Stat. Methodol. Thomas, W. and

Cook, R. D. (1989). Assessing inﬂuence on regression coeﬃcientsin generalized linear models.

Biometrika Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.

J. R. Stat.Soc. Ser. B Stat. Methodol. Wang, H. (2009). Forward regression for ultra-high dimensional variable screening.

J. Amer. Statist. Assoc.

Wang, H. and

Leng, C. (2007). Uniﬁed Lasso estimation via least squares approxi-mation.

J. Amer. Statist. Assoc.

Wang, H. , Li, G. and

Jiang, G. (2007). Robust regression shrinkage and consistentvariable selection through the LAD–Lasso.

J. Bus. Econom. Statist. Williams, D. A. (1987). Generalized linear model diagnostics using the deviance andsingle case deletions.

J. R. Stat. Soc. Ser. C. Appl. Stat. Xiang, L. , Tse, S.-K. and

Lee, A. H. (2002). Inﬂuence diagnostics for generalizedlinear mixed models: Applications to clustered data.

Comput. Statist. Data Anal. Zhang, H. H. and

Lu, W. (2007). Adaptive Lasso for Cox’s proportional hazardsmodel.

Biometrika Zhao, J. , Leng, C. , Li, L. and

Wang, H. (2013). Supplement to “High-dimensionalinﬂuence measure.” DOI:10.1214/13-AOS1165SUPP.[42]

Zhu, H. , Ibrahim, J. G. and

Cho, H. (2012). Perturbation and scaled Cook’s dis-tance.

Ann. Statist. Zhu, H. , Ibrahim, J. G. , Lee, S. and

Zhang, H. (2007). Perturbation selectionand inﬂuence measures in local inﬂuence analysis.

Ann. Statist. Zhu, H. , Lee, S.-Y. , Wei, B.-C. and

Zhou, J. (2001). Case-deletion measures formodels with incomplete data.

Biometrika Zou, H. (2006). The adaptive LASSO and its oracle properties.

J. Amer. Statist.Assoc. J. ZhaoSchool of Mathematicsand System ScienceBeihang UniversityBeijingChinaE-mail: [email protected]

C. LengDepartment of StatisticsUniversity of WarwickCoventryUnited KingdomandDepartment of Statisticsand Applied ProbabilityNational University of SingaporeE-mail:

[email protected]

L. LiDepartment of StatisticsNorth Carolina State UniversityRaleigh, North CarolinaUSA E-mail: [email protected]