Partial Penalized Likelihood Ratio Test under Sparse Case
aa r X i v : . [ s t a t . M E ] N ov Partial Penalized Likelihood Ratio Test under Sparse Case
Shanshan Wang a , Hengjian Cui b, ∗ a School of Mathematical Sciences, School of Physical and Mathematical Sciences, NanyangTechnological University, Singapore 637371, Singapore b School of Mathematical Sciences & BCMIIS, Capital Normal University, Beijing, 100048, China
Abstract
This work is concern with testing the low-dimensional parameters of interest with divergentdimensional data and variable selection for the rest under the sparse case. A consistent test viathe partial penalized likelihood approach, called the partial penalized likelihood ratio test statisticis derived, and its asymptotic distributions under the null hypothesis and the local alternatives oforder n − / are obtained under some regularity conditions. Meanwhile, the oracle property of thepartial penalized likelihood estimator also holds. The proposed partial penalized likelihood ratiotest statistic outperforms the full penalized likelihood ratio test statistic in term of size and power,and performs as well as the classical likelihood ratio test statistic. Moreover, the proposed methodobtains the variable selection results as well as the p-values of testing. Numerical simulations andan analysis of Prostate Cancer data confirm our theoretical findings and demonstrate the promisingperformance of the proposed partial penalized likelihood in hypothesis testing and variable selection. Key Words:
Chi-squared distribution, Hypothesis testing, Likelihood ratio, Partial penalizedlikelihood, SCAD
MSC (2010):
Over the past few years there has been a great deal of attention on the problem of estimating a sparseparameter β ∈ R p associated with the collected data V , . . . , V n being independent and identicallydistributed (iid) random variables with the probability density function (pdf) f ( V, β ). There hasbeen a considerable amount of recent work dedicated to the estimation problem under the sparsityscenario, both in terms of computation and theory. A comprehensive summary of the literature ∗ Corresponding author.E-mail addresses: [email protected] (Hengjian Cui).
1n either category would be too long for our purposes here, so we instead give a short summary:for computational work, some relevant contributions are ? , ?? , ? , ? , ? , ? , ? , ? and so on; and fortheoretical work see, e.g., ?? , ? , ? , ? , ? , ? . Generally speaking, with a few exceptions, existingtheories only handle the problem of variable selection and estimation simultaneously, however, fewof them addresses the problem of assigning uncertainties, statistical significance or confidence. Aspointed out in ? , there are still major gaps in our understanding of these regularization methods asan estimation procedure, and in many real applications, a practitioner will undoubted seek somesort of inferential guarantees for his or her variable selection procedures-but, generically, the usualconstructs like p-values, confidence intervals, etc., do not exist for these estimates, especially for thezero coefficients excluded by some variable selection procedures. In this sense, developing statisticalinference methods under the sparse case is necessary.More recently, there is a growing literature dedicated to statistical inference in the high-dimensional settings, and important progress has certainly been achieved. See ? and ? for variableselection and p-value estimation based on sample splitting; stability selection in ? and ? ; p-valuefor parameter components in lasso and ridge regression in ? and ? ; optimal confidence regionsand tests for single or low-dimensional components in a high-dimensional model in ? and ?? ; per-turbation resampling-based procedures in ? ; conservative statistical inference after model selectionand classification in ? and ? , respectively; the covariance test for Lasso model in ? ; and refer-ences therein. Apart form the aforementioned literature, our investigation is largely motivated bysome doubts pertaining to variable selection procedure. For example, if one concerns how a givengenes expressions (Usually, these gens expressions of interested is known in advance due to someprior knowledge or else) among a great amount of genes expressions affect the survival times ofpatients, while the variable selection procedure excludes these variables, then one need to answerthe question: with how much probability that these genes expressions have no influence on thesurvival times of patients (or with how much probability that the variable selection procedure ex-cluded these variables). Moreover, if the gens expressions of interest are included in the variableselection procedure, one hope to verify it, and to obtain a test procedure that is consistent withvariable selection results. Finally, the proposed method can also perform variable selection for theremaining gens expressions, since the rest usually occupy the majority and also satisfy sparsityassumption. Thus, the focus of current paper is to propose a method to achieve these multipleobjectives simultaneously, performing a consistent hypothesis testing for the variables of interestand variable selection for the remaining sparse ones.2pecifically, consider a canonical instance of a inference problem under the sparse case, namelythat performing hypothesis testing for a sub-vector of parameter β ∈ R p based on iid observations V , . . . , V n with pdf f ( V, β ). The null hypothesis of interest is formulated as H : β = 0 vs. H : β = 0 . (1)Here the true sparse parameter vector β = ( β T , β T ) T , where β ∈ R d is the parameter ofinterest with fixed and known d ≪ p , and its complement β ∈ R p − d is sparse. Without loss ofgenerality, let the sparse parameter β = ( β T , β T ) T with the first s components of β , denotedby β , do not vanish and the remaining p − d − s coefficients, denoted by β , are 0. Rewritten β = ( β D T , β I T ) T , where we refer to β D = ( β T , β T ) T ∈ R d + s as an active parameter vector andits complement β I = β = 0 ∈ R p − d − s as an inactive parameter vector.For hypothesis problem (1), the classical likelihood ratio (OLR) test proposed by ?? is a primaryone, and has been proved to have desirable properties in the literature. For example, ? showed thatthe OLR test has a limiting central chi-square distribution with d degrees of freedom ( χ d ) under H ,and subsequently ? proved that the OLR test converges uniformly in distribution to the noncentralchi-square distribution under the local alternatives of order n − / , i.e., H : β = θ + δn − / with δ is a known d × β ,none of the estimated parameters is exactly zero in the estimation scheme of the classical likelihood(OL) method, leaving all covariates in the final model. Consequently the OL method is incapableof selecting important variables, leading to bad predictability and estimation accuracy. And thisdrawback becomes worse as the sparsity level increases. To achieve variable selection for β , ? studied the oracle properties of nonconcave penalized likelihood estimators in the finite-dimensionalsetting. Their results were extended later by ? to the setting of p = o ( n / ) or o ( n / ). Yet, theirpenalized likelihood (PL) method may not distinguish nonzero component when it is near zero,i.e., δn − / for fix δ = 0, since their proposed method is based on the condition that the nonzerocomponent deviates from zero at a greater rate than O ( n − / ) for p fixed and O (( n/p ) − / ) for p divergent, respectively. Moreover, ? also proposed the penalized likelihood ratio (PLR) test forthe linear hypothesis testing concerning the nonzero components, and investigated its asymptoticnull distribution. However, their proposed PLR test only applies to the hypothesis testing for theremarkable nonzero components, and may not perform inference for the zero components, i.e., thehypothesis (1). More specifically, if we apply the PLR test for the hypothesis (1), the estimateˆ β will shrink to zero when the true value β is near zero, owing to the estimation scheme ofthe full penalization, consequently leading to a conservative test. This will inevitably increase the3ype II error (accept H under the alternative hypothesis H ) of PLR test. Fortunately, imposingno penalization on β will protect it against shrinking to zero, and obtain a consistent test. Thismotivates us to consider the partial penalization, and see the toy example in Section 2.2 to gainmore insights about the motivation for the partial penalization.Thus, in this article we take a different way, namely by adopting the partial penalization insteadof the full penalization, we consider both problems of variable selection and hypothesis testing for(1), in the hope that the proposed method will possess the advantages of both OL and PL methods.Specifically, we propose the partial penalized likelihood (PPL) method to perform variable selectionfor sparse parameter β , establishing its oracle property; meanwhile, for the hypothesis (1), wederive a consistent test, called the partial penalized likelihood ratio (PPLR) test, and under someregularity conditions we establish that the PPLR test converges in distribution to χ d under H (Theorem 1) and χ d ( γ ) with the noncentral parameter γ depending on δ under the local alternativesof order n − / (Theorem 2), respectively. In this sense, our proposed a consistent test performsas well as the OLR, and the PPL method is also capable of selecting important variables as PLmethod, achieving better predictability and estimation accuracy. Overall, the main contributionof this paper is to propose the idea of partial penalization as well as a consistent test for (1),demonstrating its promising advantage in variable selection and hypothesis testing.The rest of the paper is organized as follows. In Section 2, we first briefly review the penal-ized likelihood method and the penalized likelihood ratio test statistic proposed in ? , and thenillustrate our motivation via a toy example. For hypothesis (1), we propose the partial penalizedlikelihood ratio test statistics in the framework with p diverging with n in Section 3.1, togetherwith its asymptotic properties. In Section 3.2, we describe the algorithm and discuss selection oftuning parameters. Numerical comparisons and simulation studies are conducted in Section 4. Anapplication to the Prostate Cancer data is given in Section 5. Some discussion is given in Section6. Technical proofs are relegated to the Appendix. In Section 2.1, we first briefly review existing results for nonconcave penalized likelihood approach,and more details can be found in the work of ? and ? . The more familiar reader may skip Section2.1. Then in Section 2.2, we simply show a toy example for the possible problems existing in the fullpenalized likelihood method, as well as the better illustration of the idea of the partial penalization.4 .1 Full penalized likelihood and its tests Recall that log f ( V, β ) is the underlying likelihood for random vector V , and V , · · · , V n , are iidsamples with pdf f ( V, β ). Let L n ( β ) = P ni =1 log f ( V i , β ) be the log-likelihood function, and let p λ ( | β j | ) be a nonconcave penalized function with a tuning parameter λ ≥
0. As discussed in ? , thepenalized likelihood estimator ˆ β then maximizes the penalized likelihood Q n ( β | V ) = L n ( β ) − n p X j =1 p λ ( | β j | ) . (2)For penalty function p λ ( · ), many variable selection-capable penalty functions have been pro-posed. A well known example is the Lasso penalty ( ?? ). Among many others are the SCAD penalty( ? ), elastic-net penalty ( ? ), adaptive L ( ? ), and minimax concave penalty ( ? ). In particular, ? studied the choice of penalty functions in depth. They proposed a unified approach via nonconcavepenalized likelihood to automatically select important variables and simultaneously estimate thecoefficients of covariates. In this paper, we will use the SCAD penalty for our method whenevernecessary, although other penalties can also be used. Specifically, the first derivative of SCADpenalty satisfies p ′ λ ( | β | ) = λ sgn( β ) { I ( | β | ≤ λ ) + ( aλ − | β | ) + ( a − λ I ( | β | > λ ) } , for some a > , and ( s ) + = s for s > ? , we set a = 3 . ? proposed the local quadratic approximation; ? proposed the locallinear approximation; ? presented the difference convex algorithm; ? investigated the applicationof coordinate descent algorithms to SCAD and MCP regression models. In this work, whenevernecessary we use the idea of coordinate descent algorithm to solve the SCAD penalized optimization.With a slight abuse of notation, only in this section let β = ( β T , β T ) T with the first s compo-nents of β , denoted by β , do not vanish and the remaining p − s coefficients, denoted by β , are0. For the nonconcave penalized likelihood estimator ˆ β , under some regularity conditions, ? and ? established its oracle properties in the framework with dimension p fixed and p divergent, respec-tively. And ? also investigated the linear hypothesis H : Aβ = 0 versus H : Aβ = 0, where A is a q × s matrix and AA T = I q with a fixed q ≤ s , and formulated the penalized likelihood ratiotest statistic as T n = 2 { sup Ω Q n ( β | V ) − sup Ω ,Aβ =0 Q n ( β | V ) } , where denote by Ω the parameterspace for β . Under H , with some additional conditions on the penalty function p λ ( · ) as in ? , theyobtained that T n → χ q in distribution as n → ∞ .5or the zero components β , their oracle property only shows that ˆ β = 0 with probabilitytending to 1, as n → ∞ . However, someone may suspect the assertion β = 0, or want to knowwith how much probability that one of components of β equals to zero for a given sample size,these questions actually involve the aspects of statistical hypotheses testing. The full penalizedlikelihood ratio test statistic T n only involves the linear hypotheses for nonzero components β ,and the conclusion for T n under H may not hold for some special A . For example, when A = I s ,it follows that β = 0 under H , then T n = o p (1), since ˆ β = 0 with probability tending to 1 (oracleproperty). Or A = e j , where e j is a s × j th component is 1 and 0 otherwise, thenunder H , T n may be not asymptotically χ distributed. We will demonstrate this phenomena inthe simulation studies. Before we present the main approach, here we simply show a toy example for the better illustrationof the partial penalization. As in ? , consider the linear regression model Y = Xβ + ε , where assumethat the error vector ε = ( ε , · · · , ε n ) T with ε i iid ∼ N (0 , n × Y = ( Y , . . . , Y n ),and the n × p matrix X = ( X , . . . , X n ) T satisfy n X i =1 Y i = 0 , n X i =1 X ij = 0 , n X i =1 X ij = n. (3)It is well known that the classical maximum likelihood estimate of β corresponds to the leastsquare estimator ˆ β LS = n − X T Y , and the maximum penalized likelihood estimate of β defined in(2) corresponds to the penalized least square estimator, denoted by ˆ β , and under the assumption(3), it holds thatˆ β = arg min β n k Y − Xβ k + p X j =1 p λ ( | β j | ) ∝ k ˆ β LS − β k + p X j =1 p λ ( | β j | ) . (4)For given λ , Figure 1 shows the plots of the penalized least square estimator ˆ β versus the leastsquare estimator ˆ β LS in Eq. (4) for the Lasso (a), SCAD (b) and MCP (c) penalties, respectively.When the true parameter β is near zero, i.e., β = δ/ √ n , then β fall in the interval ( − λ, λ ) withhigh probability tending to 1 from Figure 1, thus these three variable selection procedures all resultin ˆ β = 0 owing to its penalization scheme. Consequently, when perform the hypothesis H : β = 0,the estimate ˆ β will shrink to zero when the true value β is near zero, leading to a conservativetest. This will inevitably increase the type II error (accept H under the alternative hypothesis H : β = δ/ √ n ). To obtain a consistent test, we consider the partial penalization in Section 3.6 asso Solution b ^ LS -l l b ^ =b ^ LS -lb ^ =b ^ LS +l b ^ = b =d n b ^ LS b ^ (a) SCAD Solution b ^ LS - a l -l l a l b ^ = b ^ =b ^ LS b ^ =b ^ LS b =d n b ^ LS b ^ (b) MCP Solution b ^ LS - a l -l l a l b ^ = b ^ =b ^ LS b ^ =b ^ LS b ^ = aa - ( b ^ LS -l ) b ^ = aa - ( b ^ LS +l ) b =d n b ^ LS b ^ (c) Figure 1: Given λ , plots of the penalized least square estimate ˆ β versus the least square estimateˆ β LS in Eq. (4) for the Lasso (a), SCAD (b) and MCP (c) penalties, respectively. For the sparse parameter, variable selection through regularization has proven to be effective, andpossesses desirable oracle properties under some regularized conditions. However, as discussed inSection 1 as well as that at the end of Section 2, it is necessary to develop a new approach todeal with the hypothesis test concerning variable selection results. For hypothesis (1), we derive aconsistent test procedure in the framework with p divergent in Section 3.1. And the implementationof the test procedure and choice of tuning parameter is described in Section 3.2. Throughout thispaper, it is important to note that the quantities p and λ can depend on the sample size n , and wehave suppressed this dependency for natational simplicity. Recall that V , · · · , V n , are iid random variables with pdf f ( V, β ), and the parameter β is thesame as that in Section 1. For the hypothesis problem (1), we define a partial penalized likelihoodratio test statistic as T n = 2 { sup Ω P Q n ( β | V ) − sup Ω ,β =0 P Q n ( β | V ) } , (5)where P Q n ( β | V ) = L n ( β ) − n p X j = d +1 p λ ( | β j | ) (6)7s the partial penalized likelihood function, with ˆ β = argmax β P Q n ( β ) being the partial penalizedlikelihood estimator. Remark 1
Here in (6), instead of full penalized, we propose partial penalized with β nonpenal-ized. This will protect β against shrinking to zero when the true value β is zero or near zero, andto make further statistical inference. In fact, β is the parameter of interested and β is sparse,with partial penalized in (6), we can not only protect β , but also perform variable selection for β . Remark 2
For the linear hypothesis H : Aβ = 0 vs. H : Aβ = 0 , where A is a d × p matrixand AA T = I d for a fixed d ≪ p . This problem includes the problem of testing simultaneously thesignificance of a few parameters. Let B be a ( p − d ) × p matrix which satisfies BB T = I p − d and AB T = 0 . That is, the linear space spanned by rows of B is the orthogonal complement to the linearspace spanned by rows of A . Let ˜ β = ˜ Aβ with ˜ A = ( A T , B T ) T satisfying ˜ A ˜ A T = I p , then the linearhypothesis H : Aβ = 0 vs. H : Aβ = 0 can be reformulated as H : ˜ β = 0 vs. H : ˜ β = 0 ,where ˜ β is the first d components of parameter ˜ β . Then the partial penalized likelihood function in(6) can be defined as P Q n ( ˜ β | V ) = L n ( ˜ β ) − n P pj = d +1 p λ ( | ˜ β j | ) with L n ( ˜ β ) = P ni =1 log f ( V i , ˜ A − ˜ β ) .And the corresponding partial penalized likelihood ratio test statistic T n can also be constructed. Denote ˜ p λ ( · ) as a working penalty function, where ˜ p λ ( | β j | ) = 0 for 1 ≤ j ≤ d , and ˜ p λ ( | β j | ) = p λ ( | β j | ) for ( d + 1) ≤ j ≤ p , then the partial penalized likelihood function in (6) can be rewrittenas Q n ( β | V ) = L n ( β ) − n P pj =1 ˜ p λ ( | β j | ), which can be seen as the penalized likelihood function witha special penalty function ˜ p λ ( · ) in (2). Therefore, it follows that the oracle property of ˆ β as inTheorems 1 and 2 of ? ’s paper also hold. See Lemmas 1 and 2 in the Appendix.Based on the oracle property of ˆ β , we investigate the asymptotic properties of T n in (5) under H in (1) as well as the local alternatives H : β = δn − / , where δ is a known d × T n , facilitating hypothesis testing and the power calculation. It shows that the classical likelihoodtheory continues to hold in the partial penalized likelihood context. Theorem 1
When regularized conditions (A)-(H), and ( E ′ ) and ( F ′ ) in the Appendix are sat-isfied, under H it holds that T n → χ d , provided that p /n → as n → ∞ . Theorem 2
When regularized conditions (A)-(H), and ( E ′ ) and ( F ′ ) in the Appendix are satis-fied, if H : β = δn − / is true, where δ is a d × vector, it holds that T n → χ d ( γ ) with the noncen-tral parameter γ = δ T C . δ , provided that p /n → as n → ∞ . Where C . = C − C C − C nd ( d + s ) × ( d + s ) matrix I ( β D ) = I ( β D ,
0) = C C C C , the Fisher information knowing β I = 0 , with principal submatrices C and C are d × d and s × s , respectively. Remark 3
The condition p /n → in Theorem 1 or p /n → in Theorem 2 as n → ∞ seemssomewhat strong, where the rate on p should not be taken as restrictive because our proposed methodis studied in a broad framework based on the log-likelihood function. Since no particular structuralinformation is available on the log-likelihood function, establishing the theoretical result is verychallenging, so the strong regularity conditions are needed and the bound in the stochastic analysisare conservative. This is also the case in ? . By refining the structure of the log-likelihood function,the restriction on dimensionality p can be relaxed. Another reason is the stronger conditions onthe likelihood function, which facilitate the technical proofs, yet may bring stringent assumptionon p . Since our focus in this section is to demonstrate our proposed method may be applicable inthe framework with p growing with n . Yet, the question that how sharpest the dimension p maybe growing with n isn’t addressed in this paper, which we will consider in the future work. Thus,keep in mind that the framework presented in this paper is applicable only where the sample sizeis larger that the dimension of the parameter. When that is violated, preliminary methods suchas sure independence screening ? may be used to reduce the dimensionality, and then adopt ourproposed method. In this Section, we describe an efficient coordinate descent algorithm for the implementation of theproposed method, and discuss the selection of tuning parameters.The idea of coordinate optimization for penalized problems was proposed by ? , and was demon-strated by ? and ? to be efficient for large-scale sparse problems. Recently, various authors, includ-ing ? , ? , and ? generalized this idea to regularized regression with various penalties and showedthat it was an attractive alternative to earlier proposals such as the local quadratic approximation( ? ) and the local linear approximation ( ? ).To maximize objective function P Q n ( β | V ) in (6), the coordinate descent method maximizes theobjective function in one coordinate at a time and cycles through all coordinates until convergence.For fixed λ , cyclically for j = 1 , . . . , p , update the j th component ˆ β j ( λ ) of ˆ β ( λ ) by the univariatemaximizer of P Q n ( ˆ β ( λ ) | V ) with respect to ˆ β j ( λ ) until convergence. Then this produces a solutionpath ˆ β ( λ ) over a grid of points λ , then the optimal regularization parameter λ can be chosen by9inimizing the following BIC type criteria motivated by ? ,BIC( λ ) = − P Q n ( ˆ β ( λ ) | V ) + C n log( n )df λ , (7)where ˆ β ( λ ) is the partial penalized likelihood estimate of β with regularization parameter λ ; df λ is the number of nonzero coefficients in ˆ β ( λ ); C n is a scaling factor diverging to infinity at a slowrate ? for p → ∞ , and they suggested that C n = max { log log p, } seemed to be a good choice.However, a rigorous proof of the consistency of this BIC for partial penalized likelihood meritsfurther investigation. Fortunately, the BIC type criterion defined in (7) usually selects the tuningparameter satisfactorily and identifies the true model consistently in our simulation studies.Finally, it is important to keep in mind that the optimal regularization parameter λ shouldbe the same for maximizing P Q n ( β | V ) within the full parameter space and the subspace specifiedby the null hypothesis in (1), when we calculate test statistics T n in (5). In fact, we adopt theaforementioned BIC to choose the optimal regularization parameter λ when maximize P Q n ( β | V )within the full parameter space, and then for the chosen λ we maximize P Q n ( β | V ) within thesubspace specified by the null hypothesis in (1). We present simulation results to illustrate the usefulness of the partial penalized likelihood ratio(PPLR) test, and to compare the finite-sample performance with the penalized likelihood ratio(PLR) test and the classical likelihood ratio (LR) test in terms of model selection accuracy andpower. That is, we first assess the performance of the partial penalized likelihood (PPL), thepenalized likelihood (PL) and the ordinary likelihood (OL) in terms of estimation accuracy andmodel selection consistency. Then we evaluate empirical size and power of these three test methods.Here we set d = 1, and the BIC type criterion defined in (7) is used to estimate the optimal tuningparameter λ in the smoothly clipped absolute deviation (SCAD). And we simulate 1000 samplesof size n = 100 , ,
400 and 800 with p = 11 , ,
30 and 41 from the following two examples:
Example 4.1 (Linear Regression) Y = X T β + σε , where we set σ = 1 , ε follows a standardnormal distribution, X ∼ N (0 , I p ) , and the true value β = ( β , , . , , , , · · · , T ∈ R p with β is the parameter of interest, and will be specified as different true values whenever necessaryin the following simulations. All covariates are standardized. We consider the null hypothesis H : β = 0 and the local alternatives H : β = δn − / . xample 4.2 (Logistic Regression) Y ∼ Bernoulli { p ( X T β ) } , where p ( u ) = exp( u ) / (1+exp( u )) ,and the covariates X and β are the same as those in Example 4.1. All covariates are standardized. Table 1: Results for three methods PPL, PL and OL in Example 4.1 under the true values β =( β , , . , , , , · · · , T , where the first component β = δn − / with different δ . Values shownare means (standard deviations) of each performance measure over 1000 replicates. Method PPL PL OL( n, p ) δ L -loss L -loss C IC L -loss L -loss C IC L -loss L -loss(100,11) 0.0 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . For Example 4.1, first, we evaluate the performance of the resulting estimators for the threemethods in term of four measures under the different true values β = ( β , , . , , , , · · · , T with β = δn − / , δ = 0 , . , · · · , . L -loss k ˆ β − β k = { ( ˆ β − β ) T ( ˆ β − ˆ β ) } / and L -loss k ˆ β − β k = P pj =1 | ˆ β j − β j | . The othertwo measures pertain to model selection consistency: C and IC refer to the number of correctlyselected zero coefficients and the number of incorrectly excluded variables, respectively. Due to11pace limitations, Table 1 only summarizes the means and standard deviations of each measureover 1000 replicates for δ = 0 , , , δ = 0,since the average number of incorrectly estimated zero coefficients is always greater than 0 for thePL method. That is, the PL method may not identify the nonzero component β = δn − / withfixed δ = 0; while the proposed PPL method still works. In this way, we conjecture that the PLmethod can not distinguish the nonzero component of order n − / , and show that the PPL methodoutperforms the PL method especially when some of nonzero component is near zero in term ofmodel selection. For the estimation accuracy, the PPL method performs best among the threemethods. Therefore, if we know some components is near zero and the rest are sparse in advance,our proposed method performs best among the three methods, with a performance very close tothat of the oracle estimator and better estimation performance.Next, to verify performance of the PPLR test in Theorems 1 and 2, consider the null hypothesis H : β = 0, and calculate power under the local alternatives H : β = δn − / for different δ andsample size n , respectively. Using a nominal level α = 0 .
05, we documents the empirical size andthe power results in Table 2. From Table 2, for hypothesis H , the PLR test does not work anymore, while the remaining two methods still work, which confirms Theorems 1 and 2. From theview of this point, we can conjecture that the PPLR test performs as well as the LR method andoutperforms the PLR test when the null parameter is zero in terms of size and power. All of theseresults demonstrates the promising performance of the PPLR test in hypothesis testing.Meanwhile, from Table 2, we conjecture that under the null hypothesis H : β = 0, the PLRtest may be not asymptotically chi-squared distributed with one degree of freedom ( χ ), while theconclusion for the PPLR test still hold. We demonstrates these results in Figure 2 for n = 100 , χ distribution under the null hypothesis H : β = 0 and H : β = 2, respectively.From Figures 2 and 3, it follows that the conclusion for the PLR test obtained in ? only hold whenthe null parameter deviates away from zero (For example, H : β = 2, see Figure 3), and maynot hold under the null hypothesis H : β = 0 (See Figure 2).For Example 4.2, results in a similar manner to those for Example 4.1 are observed, see Tables3 and 4, and from which we can obtain the same conclusions as those for Example 4.1.12 ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f PP L R (a) ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f PP L R (b) ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f PP L R (c) . . . . . . ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f P L R (d) ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f P L R (e) . . . . . ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f P L R (f) Figure 2: QQplots of the partial penalized likelihood ratio (PPLR) test statistics (Top row) andthe penalized likelihood ratio (PLR) test statistics (Bottom row) against χ distribution under nullhypothesis H : β = 0 when ( n, p ) = (100 ,
11) (Left column), ( n, p ) = (200 ,
20) (Middle column)and ( n, p ) = (400 ,
30) (Right column), respectively.13 ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f PP L R (a) ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f PP L R (b) ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f PP L R (c) ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f P L R (d) ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f P L R (e) ( n , p ) = ( , ) Quantiles of c Q uan t il e s o f P L R (f) Figure 3: QQplots of the partial penalized likelihood ratio (PPLR) test statistics (Top row) andthe penalized likelihood ratio (PLR) test statistics (Bottom row) against χ distribution under nullhypothesis H : β = 2 when ( n, p ) = (100 ,
11) (Left column), ( n, p ) = (200 ,
20) (Middle column)and ( n, p ) = (400 ,
30) (Right column), respectively.14able 2: Empirical percentage of rejecting H : β = 0 for the true values under H : β = δn − / with different δ and the sample size n in Example 4.1. The nominal level is 5%. ( n, p ) Test δ = 0 . δ = 0 . δ = 1 . δ = 1 . δ = 2 . δ = 2 . δ = 3 . δ = 3 . δ = 4 . .
047 0 .
084 0 .
157 0 .
311 0 .
483 0 .
705 0 .
840 0 .
935 0 . .
000 0 .
001 0 .
001 0 .
023 0 .
055 0 .
136 0 .
291 0 .
480 0 . .
042 0 .
087 0 .
157 0 .
305 0 .
460 0 .
682 0 .
812 0 .
917 0 . .
057 0 .
096 0 .
158 0 .
314 0 .
504 0 .
686 0 .
845 0 .
936 0 . .
001 0 .
002 0 .
003 0 .
014 0 .
049 0 .
113 0 .
248 0 .
410 0 . .
060 0 .
090 0 .
146 0 .
294 0 .
458 0 .
668 0 .
808 0 .
915 0 . .
054 0 .
076 0 .
162 0 .
305 0 .
520 0 .
711 0 .
864 0 .
936 0 . .
000 0 .
000 0 .
002 0 .
008 0 .
055 0 .
101 0 .
209 0 .
375 0 . .
054 0 .
074 0 .
150 0 .
294 0 .
481 0 .
674 0 .
837 0 .
923 0 . .
056 0 .
087 0 .
177 0 .
314 0 .
508 0 .
716 0 .
828 0 .
928 0 . .
000 0 .
000 0 .
000 0 .
007 0 .
034 0 .
086 0 .
218 0 .
320 0 . .
054 0 .
092 0 .
181 0 .
304 0 .
512 0 .
689 0 .
817 0 .
914 0 . In this section we illustrate the techniques of our method via an analysis of Prostate Cancer. Thedata come from a study by ? , and was analyzed by ? . They examined the correlation betweenthe level of prostate-specific antigen and a number of clinical measures in men who were aboutto receive a radical prostatectomy. The sample size is 97 and the variables are log cancer volume( lcavol ), log prostate weight ( lweight ), age , log of the amount of benign prostatic hyperplasia( lbph ), seminal vesicle invasion ( svi ), log of capsular penetration ( lcp ), Gleason score ( gleason ),and percent of Gleason scores 4 or 5 ( pgg45 ). Among these variables, svi is a binary variable, and gleason is an ordered categorical variable. According to the Examples 3.2.1 and 3.3.4 of Hastieet al. (2009), we know that lcavol , lweight and svi show a strong relationship with the response lpsa , while age , lcp , gleason and pgg45 are not significant. Here, we also fit a linear model to thelog of prostate-specific antigen, lpsa , after first standardizing the predictors to have unit varianceand centering the response to have zero mean. lpsa = β lcavol + β lweight + β age + β lbph + β svi + β lcp + β gleason + β pgg45 + ε. (8)We are interested in the significance of each predictor, which leads to the null hypothesis H ,j : β j = 0 for the individual j th predictor, where j = 1 , , · · · , β = ( β , , . , , , , · · · , T , where the first component β = δn − / with different δ . Valuesshown are means (standard deviations) of each performance measure over 1000 replicates. Method PPL PL OL( n, p ) δ L -loss L -loss C IC L -loss L -loss C IC L -loss L -loss(100,11) 0.0 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . multiple R are summarized in the Table 5 with β non-penalized in the SCAD-PPL method, fromwhich we can see that over 60% of the lpsa variation can be explained by the variables that weuse, and the results are consistent with the analysis of Examples 3.2.1 and 3.3.4 of ? .For the null hypothesis H ,j , Table 6 summarizes p-values of testing results, using unpenalized,full penalized and partial penalized versions of the likelihood ratio test. At a nominal significancelevel 0.05, the test results of PPLR method shows that three predictors lcavol , lweight and svi aresignificant, which is consistent with variable selection results in Table 5. However, the test resultsof PLR for the predictor svi contracts ( svi is insignificant). Again, this phenomena shows that full16able 4: Empirical percentage of rejecting H : β = 0 for the true values under H : β = δn − / with different δ and the sample size n in Example 4.2. The nominal level is 5%. ( n, p ) Test δ = 0 . δ = 0 . δ = 1 . δ = 1 . δ = 2 . δ = 2 . δ = 3 . δ = 3 . δ = 4 . .
052 0 .
067 0 .
156 0 .
314 0 .
505 0 .
644 0 .
838 0 .
911 0 . .
000 0 .
000 0 .
005 0 .
019 0 .
059 0 .
140 0 .
285 0 .
472 0 . .
050 0 .
063 0 .
149 0 .
301 0 .
485 0 .
620 0 .
816 0 .
903 0 . .
064 0 .
069 0 .
179 0 .
288 0 .
502 0 .
680 0 .
857 0 .
941 0 . .
000 0 .
001 0 .
004 0 .
011 0 .
058 0 .
134 0 .
252 0 .
439 0 . .
061 0 .
071 0 .
173 0 .
269 0 .
471 0 .
643 0 .
815 0 .
915 0 . .
052 0 .
074 0 .
190 0 .
317 0 .
479 0 .
707 0 .
845 0 .
932 0 . .
000 0 .
000 0 .
001 0 .
016 0 .
035 0 .
104 0 .
212 0 .
406 0 . .
052 0 .
070 0 .
175 0 .
306 0 .
465 0 .
685 0 .
825 0 .
917 0 . .
058 0 .
073 0 .
183 0 .
296 0 .
543 0 .
696 0 .
860 0 .
939 0 . .
000 0 .
000 0 .
001 0 .
015 0 .
034 0 .
092 0 .
178 0 .
346 0 . .
054 0 .
072 0 .
178 0 .
284 0 .
517 0 .
669 0 .
848 0 .
930 0 . Table 5: Estimates for Prostate Cancer data
Variable LS SCAD-PL SCAD-PPLlcavol 0 . . . . . . − . . . . . . . . . − . . . . . . . . . R . . . penalized likelihood ratio test can not distinguish the nonzero component that is near zero. Based on the idea of partial penalization, this paper propose a consistent test, called the partialpenalized likelihood ratio test for the hypothesis problem (1) in the framework that p divergingwith n , establishing that the proposed test converges in distribution to χ d under H (See Theorem1) and χ d ( γ ) with the noncentral parameter γ depending on δ under the local alternatives of order n − / (See Theorem 2), respectively. Meanwhile, the proposed partial penalized likelihood method17able 6: P-values for each individual predictor of Prostate Cancer data Method LR SCAD-PLR SCAD-PPLRlcavol 0 . . . . . . . . . . . . . . . . . . . . . . . . also can perform variable selection for the sparse parameter β , keeping the oracle property. In thissense, our proposed a consistent test performs as well as the OLR, and the PPL method is also ca-pable of selecting important variables as PL method, achieving better predictability and estimationaccuracy. Overall, the main contribution of this paper is to propose the idea of partial penalizationas well as a consistent test for (1), demonstrating its promising advantage in variable selection andhypothesis testing. And we also conduct some numerical simulations and an analysis of ProstateCancer data to confirm our theoretical findings and demonstrate the promising performance of theproposed method.It is noted in the simulation that our proposed test performs as well as the classical likelihoodratio test in term of size and power. Yet, the benefit of our proposed method compared with theclassical likelihood method is to conduct variable selection for the rest sparse parameter, obtainingbetter estimation accuracy. This is, our proposed method can perform hypothesis testing andvariable selection simultaneously.In the present paper is assumed that the position of d parameters of interest is known, that isthe proposed partial penalized likelihood method only applies to the proposed null hypothesis of theform (1). This is partly motivated by some prior knowledge or else that the parameter of interestis pre-specified. However, when the position of d parameters of interest is unknown, the proposedmethod may be not applicable. For example, the null hypothesis is all components of p -dimensionalparameter is zero while the alternative hypothesis is the number of nonzero components is d atmost, where d is known and d ≪ p . How to test it via the proposed partial penalized likelihoodmethod? These question deserves our further study, and has been in progress, but is beyond thescope of the current paper. 18 cknowledgements Cui’s research was supported in part by the National Natural Science Foundation of China (GrantNos. 11071022, 11028103, 11231010, 11471223), the Key project of Beijing Municipal EducationCommission (Grant No. KZ201410028030) and the Foundation of Beijing Center for Mathematicsand Information Interdisciplinary Sciences. And we are grateful to Assistant Prof. PingShou Zhongfor constructive comments.
Appendix-Proofs of theorems
To establish Theorems 1 and 2, we present the following lemmas as well as the regularity conditionssimilar as ? here. The conditions that imposed on the likelihood function are: (A) For every n observations { V i } ni =1 are independent and identically distributed with the proba-bility density f ( V , β ), which has a common support, and the model is identifiable. Furthermore,the first and second derivatives of the likelihood function satisfy equations E β { ∂ log f ( V ,β ) ∂β j } =0 for j = 1 , , · · · , p , and E β { ∂ log f ( V ,β ) ∂β j ∂ log f ( V ,β ) ∂β k } = − E β { ∂ log f ( V ,β ) ∂β j ∂β k } . (B) The Fisher information matrix I ( β ) = E [ { ∂ log f ( V ,β ) ∂β }{ ∂ log f ( V ,β ) ∂β } T ] satisfies conditions0 < C < λ min { I ( β ) } ≤ λ max { I ( β ) } < C < ∞ for all n, and for 1 ≤ j, k ≤ p , E β { ∂ log f ( V ,β ) ∂β j ∂ log f ( V ,β ) ∂β k } < C < ∞ and E β { ∂ log f ( V ,β ) ∂β j ∂β k } < C < ∞ . (C) There is a large enough open subset ω of Ω ⊂ R p which contains the true parameter point β ,such that for almost all V i the density admits all third derivatives ∂f ( V , β ) /∂β j β k β l for all β ∈ ω .Furthermore, there are functions M jkl such that | ∂ ∂β j ∂β k ∂β l log f ( V , β ) | ≤ M jkl ( V ) for all β ∈ ω ,and E β { M jkl ( V ) } < C < ∞ for all p, n and j, k, l .The aforementioned conditions are similar as in ? , under conditions (A) and (C), the second andfourth moments of the likelihood function are imposed. The information matrix of the likelihoodfunctions is assumed to be positive definite, and its eigenvalues are uniformly bounded, which is acommon assumption in the high-dimensional setting. These conditions are stronger that those ofthe usual asymptotic likelihood theory, but they facilitate the technical derivations.Let a n = max ( d +1) ≤ j ≤ p { p ′ λ ( | β j | ) , β j = 0 } and b n = max ( d +1) ≤ j ≤ p { p ′′ λ ( | β j | ) , β j = 0 } . Thenwe need to place the following conditions on the penalty functions:(D) lim inf n →∞ lim inf θ → + p ′ λ ( θ ) /λ > a n = O ( n − / ); ( E ′ ) a n = o (1 / √ np ); 19F) b n → n → ∞ ; ( F ′ ) b n = o (1 / √ p );(G) there are constants C and D such that, when θ , θ > Cλ n , | p ′′ λ ( θ ) − p ′′ λ ( θ ) | ≤ D | θ − θ | .(H) The nonzero components β ( d +1)0 , · · · , β ( d + s )0 satisfy min d +1 ≤ j ≤ d + s | β j | λ → ∞ , as n → ∞ .Condition (H) can be viewed as “beta-min” condition, and it states that the weakest signalshould dominate the penalty parameter λ , which is routinely made to ensure the recovery of signals,and is reasonable because otherwise the noise is too strong. And this also in line with conditionimposed in ? . Given condition (H), all of conditions (D)-(G) are satisfied by the SCAD penalty, as a n = 0 and b n = 0 when n is large enough.Denote ˜Σ λ = d × d d × s s × d Σ λ , ˜ b n = (0 d , b Tn ) T , where Σ λ = diag { p ′′ λ ( | β ( d +1)0 | ) , · · · , p ′′ λ ( | β ( d + s )0 | ) } and b n = { p ′ λ ( | β ( d +1)0 | )sgn( β ( d +1)0 ) , · · · , p ′ λ ( | β ( d + s )0 | )sgn( β ( d + s )0 ) } T . Lemma 1. (Existence of partial penalized likelihood estimator) Suppose that pdf f ( V, β ) satisfiesconditions (A)-(C), and the penalty function p λ ( · ) satisfies conditions (E)-(G). If p /n → as n → ∞ , then there exists a local maximizer ˆ β of P Q n ( β | V ) such that k ˆ β − β k = O p ( √ p ( n − / + a n )) . Lemma 2. (Oracle property) Under conditions (A)-(H), if λ → and p n/pλ → ∞ and p /n → as n → ∞ , then, with probability tending to 1, the root- ( n/p ) -consistent nonconcave partialpenalized likelihood estimator ˆ β = ( ˆ β D T , ˆ β I T ) T in Lemma 1 must satisfy:(i) Sparsity: ˆ β I = 0 .(ii) Asymptotic normality: √ nA n I − / ( β D )( I ( β D ) + ˜Σ λ ) { ( ˆ β D − β D ) + ( I ( β D ) + ˜Σ λ ) − ˜ b n } → N (0 , G ) , where I ( β D ) = I ( β D , , the Fisher information knowing β I = 0 , and A n is a q × ( d + s ) matrix such that A n A Tn → G , and G is a q × q nonnegative symmetric matrix. Proof.
With the working penalty function ˜ p λ ( · ), the proof of Lemmas 1 and 2 follows the argu-ments in ? ’s paper, we omit it here.Let Ω = { ( ν , · · · , ν p − d ) : ( β T , ν T ) T ∈ Ω } with ν = ( ν D T , ν I Tn ) T ranges through an open subsetof R p − d , where ν D is a s × ν I is a ( p − d − s ) × mayequivalently be given as a transformation β j = g j ( ν ), where g ( ν ) = ( g ( ν ) , · · · , g p ( ν )) T with thefirst d components { g ( ν ) , . . . , g d ( ν ) } T = β , and the remaining p − d components g j ( ν ) = ν j − d for ( d + 1) ≤ j ≤ p . And denote ν is the true value of ν . Under H , it follows that β = g ( ν ).Thus under H , the partial penalized likelihood estimator ˜ β = g (ˆ ν ) is also the local maximizer ˆ ν
20f the problem
P Q n ( g (ˆ ν ) | V ) = max ν P Q n ( g ( ν ) | V ). Note that the first order partial derivatives offunction g as C = [ ∂g i ∂ν j ] p × ( p − d ) = d × ( p − d ) I p − d , and let D = d × s I s be a sub-matrix of C . Lemma 3.
Under the condition of Theorem 1 and the null hypothesis H , we have ˆ β D − β D = n − I − ( β D ) ∇ L n ( β D ) + o p ( n − / ) , ˜ β D − β D = n − D ( D T I ( β D ) D ) − D T ∇ L n ( β D ) + o p ( n − / ) . Proof.
We need only prove the second equation. The first equation can be show in the samemanner. Following the steps of the proof of Lemma 2, it follows that under H ,( I ν ( ν D ) + Σ λ )(ˆ ν D − ν D ) − b n = n − ∇ L n ( g ( ν D , o p ( n − / ) , (A.1)where I ν ( ν D ) = I ν ( ν D ,
0) with I ν ( ν ) is the information matrix for the ν -formulation of model.For β = g ( ν ), I ν ( ν D ) = D T I ( β D ) D , ∇ L n ( g ( ν D , D T ∇ L n ( β D ), and ˜Σ λ = D Σ λ D T , we have D T ( I ( β D ) + ˜Σ λ ) D (ˆ ν D n − ν D ) − b n = n − D T ∇ L n ( β D ) + o p ( n − / ) . (A.2)By the conditions a n = o (1 / √ np ), and s ≤ p , we have k b n k ≤ √ sa n = o p (1 / √ n ). On the otherhand, by condition b n = o p (1 / √ p ), k D T ˜Σ λ D (ˆ ν D − ν D ) k = k Σ λ (ˆ ν D − ν D ) k ≤ b n k ˆ ν D − ν D k = o p (1 / √ n ). It follows that D T I ( β D ) D (ˆ ν D − ν D ) = n − D T ∇ L n ( β D ) + o p ( n − / ). As D T I ( β D ) D is s × s sub-matrix of the fisher matrix I ( β ), and by condition (B), if follows that˜ β D − β D = D (ˆ ν D − ν D ) = n − D ( D T I ( β D ) D ) − D T ∇ L n ( β D ) + o p ( n − / ) . Lemma 4.
Under the condition of Theorem 1 and the null hypothesis H , we have P Q n ( ˆ β D | V ) − P Q n ( ˜ β D | V ) = n β D − ˜ β D ) T I ( β D )( ˆ β D − ˜ β D ) + o p (1) . Proof.
A Taylor’s expansion of
P Q n ( ˆ β D | V ) − P Q n ( ˜ β D | V ) at the point ˆ β D yields P Q n ( ˆ β D | V ) − P Q n ( ˜ β D | V ) = T + T + T + T , where T = ∇ P Q n ( ˆ β D | V )( ˆ β D − ˜ β D ) ,T = −
12 ( ˆ β D − ˜ β D ) T ∇ L n ( ˆ β D )( ˆ β D − ˜ β D ) ,T = 16 ∇ T { ( ˆ β D − ˜ β D ) T ∇ L n ( β D∗ )( ˆ β D − ˜ β D ) } ( ˆ β D − ˜ β D ) ,T = −
12 ( ˆ β D − ˜ β D ) T ∇ P λ ( ˆ β D ) { I + o ( I ) } ( ˆ β D − ˜ β D ) , P λ ( β ) = n P pj =1 ˜ p λ ( | β j | ). Since T = 0 as ∇ P Q n ( ˆ β D | V ) = 0. By Lemma 3, it holdsˆ β D − ˜ β D = Θ − / n { I d + s − Θ / n D ( D T Θ n D ) − D T Θ / n } Θ − / n Φ n + o p ( n − / ) , (A.3)where Θ n = I ( β D ) and Φ n = n ∇ L n ( β D ). Note that I d + s − Θ / n D ( D T Θ n D ) − D T Θ / n is anidempotent matrix with rank d . Hence, by a standard argument and condition (B), we have k ˆ β D − ˜ β D k = O p ( q dn ). Thus, by condition (C), we have | T | = 16 | X j,k,l ∂ L n ( β D∗ ) ∂β j ∂β k ∂β l ( ˆ β D − ˜ β D ) j ( ˆ β D − ˜ β D ) k ( ˆ β D − ˜ β D ) l |≤ { X j,k,l n X i =1 M jkl ( V i ) } / √ n k ˆ β D − ˜ β D k = O p (( np ) / ) √ nO p (( d/n ) / ) = O p ( √ np / d / ) = o p (1) . Again by condition b n = o p (1 / √ p ), we have k T k ≤ nb n k ˆ β D − ˜ β D k = no p (1 / √ p ) O p ( d/n ) = o p (1). Thus, P Q n ( ˆ β D | V ) − P Q n ( ˜ β D | V ) = T + o p (1). By condition (B), it is easy to see that k n ∇ L n ( ˆ β D ) + I ( β D ) k = o p (1 / √ p ). Hence, we have ( ˆ β D − ˜ β D ) T {∇ L n ( ˆ β D ) + nI ( β D ) } ( ˆ β D − ˜ β D ) ≤ o p ( n √ p ) O p ( d/n ) = o p (1). Thus, P Q n ( ˆ β D | V ) − P Q n ( ˜ β D | V ) = n ( ˆ β D − ˜ β D ) T I ( β D )( ˆ β D − ˜ β D ) + o p (1). Proof of Theorem 1.Proof.
Substituting (A.3) into Lemma 4, we obtain
P Q n ( ˆ β D | V ) − P Q n ( ˜ β D | V ) = n Tn Θ − / n { I d + s − Θ / n D ( D T Θ n D ) − D T Θ / n } Θ − / n Φ n + o p (1) . (A.4)Since I d + s − Θ / n D ( D T Θ n D ) − D T Θ / n is an idempotent matrix with rank d , we can rewritten itas the product form A Tn A n , where A n is a d × ( d + s ) matrix that satisfies A n A Tn = I d . As in theproof of Lemma 2, we can show that √ nA n Θ − / n Φ n → N (0 , I d ). Thus, T n = 2 { sup Ω P Q n ( β | V ) − sup Ω ,β =0 P Q n ( β | V ) } = 2 { P Q n ( ˆ β D | V ) − P Q n ( ˜ β D | V ) } = ( √ n Θ − / n Φ n ) T ( I d + s − Θ / n D ( D T Θ n D ) − D T Θ / n )( √ n Θ − / n Φ n ) + o p (1)= {√ nA n Θ − / n Φ n } T {√ nA n Θ − / n Φ n } + o p (1) → χ d . Proof of Theorem 2.Proof.
Let ˜ δ = ( δ T , s ) T be a ( d + s ) × H : β = δn − / is true, denote ˜ β D n =( β T , β T ) T be a sequence of ”true” values of β D . Consider Taylor series expansion of ∇ L n ( β D )about β D = ˜ β D n , ∇ L n ( β D ) = ∇ L n ( ˜ β D n ) + ∇ L n ( ˜ β D∗ )( β D − ˜ β D n ) = ∇ L n ( ˜ β D n ) + n / ( I ( β D ) + o p (1))˜ δ, β D∗ such that k ˜ β D∗ − β D k < k β D − ˜ β D n k . Continue the notation of Lemma 4, and by the proofof Lemma 2, we have n − / A n Θ − / n ∇ L n ( ˜ β D n ) → N d (0 , G ), where A n is a d × ( d + s ) matrix suchthat A n A Tn → G , and G is a d × d nonnegative symmetric matrix. Thus, √ nA n Θ − / n Φ n = n − / A n Θ − / n ∇ L n ( β D ) → N d ( A n Θ / n ˜ δ, G ) . (A.5)Thus, let A n be the d × ( d + s ) matrix such that A Tn A n = I d + s − Θ / n D ( D T Θ n D ) − D T Θ / n , and A n A Tn = I d , it holds that √ nA n Θ − / n Φ n → N d ( A n Θ / n ˜ δ, I d ). And finally, T n = 2 { sup Ω P Q n ( β | V ) − sup Ω ,β =0 P Q n ( β | V ) } = 2 { P Q n ( ˆ β D | V ) − P Q n ( ˜ β D | V ) } = ( √ n Θ − / n Φ n ) T ( I d + s − Θ / n D ( D T Θ n D ) − D T Θ / n )( √ n Θ − / n Φ n ) + o p (1)= {√ nA n Θ − / n Φ n } T {√ nA n Θ − / n Φ n } + o p (1) → χ d ( γ ) , where γ = ˜ δ Θ / n A Tn A n Θ / n ˜ δ = δ T C . δ , with C .2