Large-Scale Model Selection with Misspecification
LLarge-Scale Model Selection with Misspecification ∗ Emre Demirkaya , Yang Feng , Pallavi Basu and Jinchi Lv University of Southern California , Columbia University and Tel Aviv University March 16, 2018
Abstract
Model selection is crucial to high-dimensional learning and inference for contemporarybig data applications in pinpointing the best set of covariates among a sequence of can-didate interpretable models. Most existing work assumes implicitly that the models arecorrectly specified or have fixed dimensionality. Yet both features of model misspecifi-cation and high dimensionality are prevalent in practice. In this paper, we exploit theframework of model selection principles in misspecified models originated in Lv and Liu(2014) and investigate the asymptotic expansion of Bayesian principle of model selection inthe setting of high-dimensional misspecified models. With a natural choice of prior proba-bilities that encourages interpretability and incorporates Kullback-Leibler divergence, wesuggest the high-dimensional generalized Bayesian information criterion with prior proba-bility (HGBIC p ) for large-scale model selection with misspecification. Our new informationcriterion characterizes the impacts of both model misspecification and high dimensionalityon model selection. We further establish the consistency of covariance contrast matrixestimation and the model selection consistency of HGBIC p in ultra-high dimensions undersome mild regularity conditions. The advantages of our new method are supported bynumerical studies. Running title : Large-Scale Model Selection with Misspecification
Key words : Model misspecification; High dimensionality; Big data; Model selection; Bayesianprinciple; Kullback-Leibler divergence; GBIC p ; GIC; Robustness ∗ Emre Demirkaya is Ph.D. Candidate, Department of Mathematics, University of Southern California, LosAngeles, CA 90089 (E-mail: [email protected] ). Yang Feng is Associate Professor, Department of Statistics,Columbia University, New York, NY 10027 (Email: [email protected]). Pallavi Basu is PostdoctoralScholar, Department of Statistics and Operations Research, Tel Aviv University, Tel Aviv, Israel 69978 (Email:[email protected]). Jinchi Lv is McAlister Associate Professor in Business Administration, Data Sci-ences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles,CA 90089 (E-mail: [email protected] ). Emre Demirkaya and Yang Feng contributed equally to thiswork. This work was supported by NSF CAREER Award DMS-1554804, a grant from the Simons Foundation,and Adobe Data Science Research Award. a r X i v : . [ s t a t . M E ] M a r Introduction
With rapid advances of modern technology, big data of unprecedented size, such as geneticand proteomic data, fMRI and functional data, and panel data in economics and finance, arefrequently encountered in many contemporary applications. In these applications, the dimen-sionality p can be comparable to or even much larger than the sample size n . A key assumptionthat often makes large-scale learning and inference feasible is the sparsity of signals, meaningthat only a small fraction of covariates contribute to the response when p is large comparedto n . High-dimensional modeling with dimensionality reduction and feature selection plays animportant role in these problems [15, 5, 16]. A sparse modeling procedure typically producesa sequence of candidate interpretable models, each involving a possibly different subset of co-variates. An important question is how to compare different models in high dimensions whenmodels are possibly misspecified.The problem of model selection has been studied extensively by many researchers in thepast several decades. Among others, well-known model selection criteria include the Akaikeinformation criterion (AIC) [1, 2] and Bayesian information criterion (BIC) [32], where theformer is based on the Kullback-Leibler (KL) divergence principle of model selection and thelatter is originated from the Bayesian principle of model selection. A great deal of work hasbeen devoted to understanding and extending these model selection criteria to different modelsettings; see, for example, [4, 20, 25, 24, 8, 9, 27, 30, 11, 23]. The connections between the AICand cross-validation have been investigated in [34, 21, 31] for various contexts. In particular,[19] showed that classical information criteria such as AIC and BIC can no longer be consistentfor model selection in ultra-high dimensions and proposed the generalized information criterion(GIC) for tuning parameter selection in high-dimensional penalized likelihood, for the scenarioof correctly specified models. See also [3, 6, 7, 33, 18, 17] for some recent work on high-dimensional inference for feature selection.Most existing work on model selection and feature selection usually makes an implicit as-sumption that the model under study is correctly specified or of fixed dimensions. Given thepractical importance of model misspecification, [36] laid out a general theory of maximum like-lihood estimation in misspecified models for the case of fixed dimensionality and independentand identically distributed (i.i.d.) observations. [10] also studied maximum likelihood esti-mation of a multi-dimensional log-concave density when the model is misspecified. Recently,[28] investigated the problem of model selection with model misspecification and originatedasymptotic expansions of both KL divergence and Bayesian principles in misspecified gener-alized linear models, leading to the generalized AIC (GAIC) and generalized BIC (GBIC),for the case of fixed dimensionality. A specific form of prior probabilities motivated by theKL divergence principle led to the generalized BIC with prior probability (GBIC p ). Yet bothfeatures of model misspecification and high dimensionality are prevalent in contemporary bigdata applications. Thus an important question is how to characterize the impacts of both2odel misspecification and high dimensionality on model selection. We intend to providesome answer to this question in this paper.Let us first gain some insights into the challenges of the aforementioned problem by con-sidering a motivating example. Assume that the response Y depends on the covariate vector( X , · · · , X p ) T through the functional form Y = f ( X ) + f ( X − X ) + f ( X − X ) + ε, (1)where f ( x ) = x / ( x + 1) and the remaining setting is the same as in Section 4.1. Considersample size n = 200 and vary dimensionality p from 100 to 3200. Without any prior knowledgeof the true model structure, we take the linear regression model y = Z β + ε (2)as the working model and apply some information criteria to hopefully recover the oracleworking model consisting of the first five covariates, where y is an n -dimensional responsevector, Z is an n × p design matrix, β = ( β , · · · , β p ) T is a p -dimensional regression coefficientvector, and ε is an n -dimensional error vector. When p = 100, the traditional AIC and BIC,which ignore model misspecification, tend to select a model with size larger than five. Asexpected, GBIC p in [28] works well by selecting the oracle working model over 90% of thetime. However, when p is increased to 3200, these methods fail to select such a model withsignificant probability and the prediction performance of the selected models deteriorates.This motivates us to study the problem of model selection in high-dimensional misspecifiedmodels. In contrast, our new method can recover the oracle working model with significantprobability in this challenging scenario.The main contributions of our paper are threefold. First, we provide the asymptoticexpansion of Bayesian principle of model selection in high-dimensional misspecified general-ized linear models which involves delicate and challenging technical analysis. Motivated by theasymptotic expansion and a natural choice of prior probabilities that encourages interpretabil-ity and incorporates Kullback-Leibler divergence, we suggest the high-dimensional generalizedBIC with prior probability (HGBIC p ) for large-scale model selection with misspecification.Second, our work provides rigorous theoretical justification of the covariance contrast matrixestimator that incorporates the effect of model misspecification and is crucial for practicalimplementation. Such an estimator is shown to be consistent in the general setting of high-dimensional misspecified models. Third, we establish the model selection consistency of ournew information criterion HGBIC p in ultra-high dimensions under some mild regularity condi-tions. In particular, our work provides important extensions to the studies in [28] and [19] tothe cases of high dimensionality and model misspecification, respectively. The aforementionedcontributions make our work distinct from other studies on model misspecification including[6, 23, 33]. 3he rest of the paper is organized as follows. Section 2 introduces the setup for modelmisspecification and presents the new information criterion HGBIC p based on Bayesian prin-ciple of model selection. We establish a systematic asymptotic theory for the new method inSection 3. Section 4 presents several numerical examples to demonstrate the advantages ofour newly suggested method for large-scale model selection with misspecification. We providesome discussions of our results and possible extensions in Section 5. The proofs of main resultsare relegated to the Appendix. Additional technical details are provided in the SupplementaryMaterial. The main focus of this paper is investigating ultra-high dimensional model selection withmodel misspecification in which the dimensionality p can grow nonpolynomially with samplesize n . We denote by M an arbitrary subset with size d of all p available covariates and X = ( x , · · · , x n ) T the corresponding n × d fixed design matrix given by the covariates inmodel M . Assume that conditional on the covariates in model M , the response vector Y =( Y , · · · , Y n ) T has independent components and each Y i follows distribution G n,i with density g n,i , with all the distributions G n,i unknown to us in practice. Denote by g n = Q ni =1 g n,i theproduct density and G n the corresponding true distribution of the response vector Y .Since the collection of true distributions { G n,i } ≤ i ≤ n is unknown to practitioners, one oftenchooses a family of working models to fit the data. One class of popular working models is thefamily of generalized linear models (GLMs) [29] with a canonical link and natural parametervector θ = ( θ , · · · , θ n ) T with θ i = x Ti β , where x i is a d -dimensional covariate vector and β = ( β , · · · , β d ) T is a d -dimensional regression coefficient vector. Let τ > y i given the covariates in model M is assumed to take the form f n,i ( y i ) = exp { y i θ i − b ( θ i ) + c ( y i , τ ) } , (3)where b ( · ) and c ( · , · ) are some known functions with b ( · ) twice differentiable and b ( · ) boundedaway from 0 and ∞ . F n denotes the corresponding distribution of the n -dimensional responsevector y = ( y , · · · , y n ) T with the product density f n = Q ni =1 f n,i assuming the independenceof components. Since the GLM is chosen by the user, the working distribution F n can begenerally different from the true unknown distribution G n .For the GLM in (3) with natural parameter vector θ , let us define two vector-valuedfunctions b ( θ ) = ( b ( θ ) , · · · , b ( θ n )) T and µ ( θ ) = ( b ( θ ) , · · · , b ( θ n )) T , and a matrix-valuedfunction Σ ( θ ) = diag { b ( θ ) , · · · , b ( θ n ) } . The basic properties of GLM give the mean vector E y = µ ( θ ) and the covariance matrix cov( y ) = Σ ( θ ) with θ = X β . The product density of4he response vector y can be written as f n ( y ; β , τ ) = n Y i =1 f n,i ( y i ) = exp " y T X β − T b ( X β ) + n X i =1 c ( y i , τ ) , (4)where represents the n -dimensional vector with all components being one. Since GLM isonly our working model, (4) results in the quasi-log-likelihood function [36] ‘ n ( y ; β , τ ) = log f n ( y ; β , τ ) = y T X β − T b ( X β ) + n X i =1 c ( y i , τ ) . (5)Hereafter we treat the dispersion parameter τ as a known parameter and focus on our mainparameter of interest β . Whenever there is no confusion, we will slightly abuse the notationand drop the functional dependence on τ .The quasi-maximum likelihood estimator (QMLE) for the parameter vector β in our work-ing model of GLM (3) is defined as b β n = arg max β ∈ R d ‘ n ( y , β ) , (6)which is the solution to the score equation Ψ n ( β ) = ∂‘ n ( y , β ) /∂ β = X T [ y − µ ( X β )] = . (7)For the linear regression model with µ ( X β ) = X β , such a score equation becomes the familiarnormal equation X T y = X T X β . Note that the KL divergence [26] of our working model F n from the true model G n is defined as I ( g n ; f n ( · , β )) = E log g n ( Y ) − E‘ n ( Y , β ) with theresponse vector Y following the true distribution G n . As in [28], we consider the best workingmodel that is closest to the true model under the KL divergence. Such a model has parametervector β n, = arg min β ∈ R d I ( g n ; f n ( · , β )), which solves the equation X T [ E Y − µ ( X β )] = . (8)We see that equation (8) is simply the population version of the score equation given in (7).Following [28], we introduce two matrices that play a key role in model selection withmodel misspecification. Note that under the true distribution G n , we have cov (cid:0) X T Y (cid:1) = X T cov( Y ) X . Computing the score equation at β n, , we define matrix B n as B n = cov (cid:2) Ψ n ( β n, ) (cid:3) = cov (cid:0) X T Y (cid:1) = X T cov( Y ) X (9)with cov( Y ) = diag { var( Y ) , · · · , var( Y n ) } by the independence assumption and under thetrue model. Note that under the working model F n , it holds that cov (cid:0) X T Y (cid:1) = X T Σ ( X β ) X .We then define matrix A n ( β ) as A n ( β ) = ∂ I ( g n ; f n ( · , β )) ∂ β = − E (cid:26) ∂ ‘ n ( Y , β ) ∂ β (cid:27) = X T Σ ( X β ) X , (10)5nd denote by A n = A n ( β n, ). Hence we see that matrices A n and B n are the covariancematrices of X T Y under the best working model F n ( β n, ) and the true model G n , respectively.To account for the effect of model misspecification, we define the covariance contrast matrix H n = A − n B n as revealed in [28]. Observe that A n and B n coincide when the best workingmodel and the true model are the same. In this case, H n is an identity matrix of size d . Given a set of competing models { M m : m = 1 , · · · , M } , a popular model selection procedureusing Bayesian principle of model selection is to first put nonzero prior probability α M m oneach model M m , and then choose a prior distribution µ M m for the parameter vector in thecorresponding model. Assume that the density function of µ M m is bounded in R M m = R d m with d m = | M m | and locally bounded away from zero throughout the domain. The Bayesianprinciple of model selection is to choose the most probable model a posteriori ; that is, choosethe model M m such that m = arg max m ∈{ , ··· ,M } S ( y , M m ; F n ) , (11)where the log-marginal-likelihood is S ( y , M m ; F n ) = log Z α M m exp [ ‘ n ( y , β )] dµ M m ( β ) (12)with the log-likelihood ‘ n ( y , β ) as defined in (5) and the integral over R d m .The choice of prior probabilities α M m is important in high dimensions. [28] suggested theuse of prior probability α M m ∝ e − D m for each candidate model M m , where the quantity D m is defined as D m = E h I ( g n ; f n ( · , b β n,m )) − I ( g n ; f n ( · , β n,m, )) i (13)with the subscript m indicating a particular candidate model. The motivation is that thefurther the QMLE b β n,m is away from the best misspecified GLM F n ( · , β n,m, ), the lower priorprobability we assign to that model. In the high-dimensional setting when dimensionality p can be much larger than sample size n , it is sensible to also take into account the complexityof the space of all possible sparse models with the same size as M m . Such an observationmotivates us to consider a new prior probability of the form α M m ∝ p − d e − D m (14)with d = | M m | . The complexity factor p − d is motivated by the asymptotic expansion of (cid:0)(cid:0) pd (cid:1) d ! (cid:1) − . In fact, an application of Stirling’s formula yieldslog (cid:18)(cid:18) pd (cid:19) d ! (cid:19) − ≈ − d log p = log( p − d )6p to an additive term of order o ( d ) when d = o ( p ). The factor of (cid:0) pd (cid:1) − was also exploited in[8] who showed that using the term (cid:0) pd (cid:1) − γ with some constant 0 < γ ≤
1, the extended BICcan be model selection consistent for the scenario of correctly specified models with p = O ( n κ )for some positive constant κ satisfying 1 − (2 κ ) − < γ . Moreover, we add the term d ! to reflecta stronger prior on model sparsity. See also [19] for the characterization of model selection inultra-high dimensions with correctly specified models.The asymptotic expansion of the Bayes factor for Bayesian principle of model selectionin Theorem 1 to be presented in Section 3.2 motivates us to introduce the high-dimensionalgeneralized BIC with prior probability (HGBIC p ) as follows for large-scale model selectionwith misspecification. Definition 1.
We define
HGBIC p = HGBIC p ( y , M ; F n ) of model M as HGBIC p = − ‘ n ( y , b β n ) + 2(log p ∗ ) | M | + tr( b H n ) − log | b H n | , (15) where b H n is a consistent estimator of H n and p ∗ = pn / . In correctly specified models, the term tr( b H n ) − log | b H n | in (15) is asymptotically close to | M | when b H n is a consistent estimator of H n = A − n B n = I d . Thus compared to BIC withfactor log n , the HGBIC p contains a larger factor of order log p when dimensionality p growsnonpolynomially with sample size n . This leads to a heavier penalty on model complexitysimilarly as in [19]. As shown in [28] for GBIC p , the HGBIC p defined in (15) can also be viewedas a sum of three terms: the goodness of fit, model complexity, and model misspecification; see[28] for more details. Our new information criterion HGBIC p provides an important extensionof the original GBIC p in [28], which was proposed for the scenario of model misspecificationwith fixed dimensionality, by explicitly taking into account the high dimensionality of thewhole feature space. p We list the technical conditions required to prove the main results and the asymptotic prop-erties of QMLE with diverging dimensionality. Denote by Z the full design matrix of size n × p whose ( i, j )th entry is x ij . For any subset M of { , · · · , p } , Z M denotes the submatrixof Z formed by columns whose indices are in M . When there is no confusion, we drop thesubscript and use X = Z M for fixed M . For theoretical reasons, we restrict the parameterspace to B which is a sufficiently large convex and compact set of R p . We consider param-eters with bounded support. Namely, we define B ( M ) = { β ∈ B : supp( β ) = M } and B = ∪ | M |≤ K B ( M ) where the maximum support size K is taken to be o ( n ). Moreover, weassume that c ≤ b ( Z β ) ≤ c − for any β ∈ B where c is some positive constant.7e use the following notation. For matrices, k · k , k · k ∞ , and k · k F denote the matrixoperator norm, entrywise maximum norm, and matrix Frobenius norm, respectively. Forvectors, k·k and k·k ∞ denote the vector L -norm and maximum norm, and ( v ) i represents the i th component of vector v . Denote by λ min ( · ) and λ max ( · ) the smallest and largest eigenvaluesof a given matrix, respectively. Condition 1.
There exists some positive constant c such that for each ≤ i ≤ n , P ( | W i | > t ) ≤ c exp( − c − t ) (16) for any t > , where W = ( W , · · · , W n ) T = Y − E Y . The variances of Y i are bounded belowuniformly in i and n . Condition 2.
Let u and u be some positive constants and m n = O ( n u ) a diverging sequence.We have the following bounds(i) max {k E Y k ∞ , sup β ∈B k µ ( Z β ) k ∞ } ≤ m n ; (ii) P ni =1 (cid:20) [ EY i − ( µ ( X β n, )) i ] var( Y i ) (cid:21) = O ( n u ) .For simplicity, we also assume that m n diverges faster than log n . Condition 3.
Let K = o ( n ) be a positive integer. There exist positive constants c and u such that(i) For any M ⊂ { , · · · , p } such that | M | ≤ K , c ≤ λ min ( n − Z T M Z M ) ≤ λ max ( n − Z T M Z M ) ≤ c − ; (ii) k Z k ∞ = O ( n u ) ;(iii) For simplicity, we assume that columns of Z are normalized: P ni =1 x ij = n for all j = 1 , · · · , p . Condition 1 is a standard tail condition on the response variable Y . This condition ensuresthat the sub-exponential norm of the response is bounded. Conditions 2 and 3 have theircounterparts in [19]. However, Condition 2 is modified to deal with model misspecification.More specifically, the means of the true distribution and fitted model as well as their relationsare assumed in Condition 2. The first part simultaneously controls the tail behavior of theresponse and fitted model. The second part ensures that the mean of the fitted distributiondoes not deviate from the true mean too significantly. Condition 3 is on the design matrix X .The first part is important for the consistency of QMLE b β n and uniqueness of the populationparameter. Conditions 2 and 3 also provide bounds for the eigenvalues of A n ( β ) and B n . See[19] for further discussions on these assumptions.8or the following conditions, we define a neighborhood around β n, . Let δ n = m n √ log p = O ( n u √ log p ). We define the neighborhood N n ( δ n ) = { β ∈ R d : k ( n − B n ) / ( β − β n, ) k ≤ ( n/d ) − / δ n } . We assume that ( n/d ) − / δ n converges to zero so that N n ( δ n ) is an asymptot-ically shrinking neighborhood of β n, . Condition 4.
Assume that the prior density relative to the Lebesgue measure µ on R d π ( h ( β )) = dµ M dµ ( h ( β )) satisfies inf β ∈ N n (2 δ n ) π ( h ( β )) ≥ c and sup β ∈ R d π ( h ( β )) ≤ c − , (17) where c is a positive constant, and h ( β ) = ( n − B n ) / β . Condition 5.
Let V n ( β ) = B − / n A n ( β ) B − / n , V n = V n ( β n, ) = B − / n A n B − / n , and e V n ( β , · · · , β d ) = B − / n e A n ( β , · · · , β d ) B − / n , where e A n ( β , · · · , β d ) is the matrix whose j th row is the corresponding row of A n ( β j ) for each j = 1 , · · · , d . There exists some sequence ρ n ( δ n ) such that ρ n ( δ n ) δ n d converges to zero,(i) max β ∈ N n (2 δ n ) max {| λ min ( V n ( β ) − V n ) | , | λ max ( V n ( β ) − V n ) |} ≤ ρ n ( δ n ) ;(ii) max β , ··· , β d ∈ N n ( δ n ) k e V n ( β , · · · , β d ) − V n k ≤ ρ n ( δ n ) . Similar versions of Conditions 4 and 5 were imposed in [28]. Under Condition 4, theprior density is bounded above globally and bounded below in a neighborhood of β n, . Thiscondition is used in Theorem 1 for the asymptotic expansion of the Bayes factor. Condition5 is on the continuity of the matrix-valued function V n and e V n in a shrinking neighborhood N n (2 δ n ) of β n, . The first and second parts control the expansions of expected log-likelihoodand score functions, respectively. Condition 5 ensures that the remainders are negligible inapproximating the log-marginal-likelihood S ( y , M m ; F n ). See [28] for more discussions. We now provide the asymptotic expansion of Bayesian principle of model selection with theprior introduced in Section 2.2. As mentioned earlier, the Bayesian principle chooses the modelthat maximizes the log-marginal-likelihood given in (12). To ease the presentation, for any β ∈ R d , we define a quantity ‘ ∗ n ( y , β ) = ‘ n ( y , β ) − ‘ n ( y , b β n ) , (18)which is the deviation of the quasi-log-likelihood from its maximum. Then from (12) and (18),we have S ( y , M m ; F n ) = ‘ n ( y , b β n ) + log E µ M m [ U n ( β ) n ] + log α M m , (19)where U n ( β ) = exp[ n − ‘ ∗ n ( y , β )]. With the choice of the prior probability in (14), it is clearthat log α M m = − D m − d log p. (20)9ided by (19) and (20), some delicate technical analysis unveils the following expansion of thelog-marginal-likelihood. Theorem 1.
Let α M m = Cp − d e − D m with C > some normalization constant and assumethat Conditions 1, 2(i), 3(i), 3(iii), 4, and 5 hold. If ( n/d ) − / δ n = o (1) , then we have withprobability tending to one, S ( y , M ; F n ) = ‘ n ( y , b β n ) − (log p ∗ ) | M | −
12 tr( H n ) + 12 log | H n | + log( Cc n ) + o ( µ n ) , (21) where H n = A − n B n , p ∗ = pn / , µ n = tr( A − n B n ) ∨ , and c n ∈ [ c , c − ] . Theorem 1 lays the foundation for investigating high-dimensional model selection withmodel misspecification. Based on the asymptotic expansion in (21), our new informationcriterion HGBIC p in (15) is defined by replacing the covariance contrast matrix H n with aconsistent estimator b H n . The HGBIC p naturally characterizes the impacts of both modelmisspecification and high dimensionality on model selection. A natural question is how toensure a consistent estimator for H n . We address such a question in the next section. For practical implementation of HGBIC p , it is of vital importance to provide a consistentestimator for the covariance contrast matrix H n . To this end, we consider the plug-in estimator b H n = b A − n b B n with b A n and b B n defined as follows. Since the QMLE b β n provides a consistentestimator of β n, in the best misspecified GLM F n ( · , β n, ), a natural estimate of matrix A n is given by b A n = A n ( b β n ) = X T Σ ( X b β n ) X . (22)When the model is correctly specified, the following simple estimator b B n = X T diag nh y − µ ( X b β n ) i ◦ h y − µ ( X b β n ) io X (23)with ◦ denoting the componentwise product gives an asymptotically unbiased estimator ofmatrix B n . Theorem 2.
Assume that Conditions 1–3 hold, n − A n ( β ) is Lipschitz in operator normin the neighborhood N n ( δ n ) , d = O ( n κ ) , and log p = O ( n κ ) with constants satisfying <κ < / , < u < / − κ , < u < − κ − u , < u < / − κ − u , and < κ < − κ − u − u . Then the plug-in estimator b H n = b A − n b B n satisfies that tr( b H n ) = tr( H n ) + o P (1) and log | b H n | = log | H n | + o P (1) . Theorem 2 improves the result in [28] in two important aspects. First, the consistency ofthe covariance contrast matrix estimator was justified in [28] only for the scenario of correctly10pecified models. Our new result shows that the simple plug-in estimator b H n still enjoysconsistency in the general setting of model misspecification. Second, the result in Theorem2 holds for the case of high dimensionality. These theoretical guarantees are crucial to thepractical implementation of the new information criterion HGBIC p . Our numerical studiesin Section 4 reveal that such an estimate works well in a variety of model misspecificationsettings. p We further investigate the model selection consistency property of information criterion HGBIC p .Assume that there are M = o ( n δ ) sparse candidate models M , · · · M M , where δ is some suffi-ciently large positive constant. For each candidate model M m , we have the HGBIC p criterionas defined in (15)HGBIC p ( M m ) = − ‘ n ( y , b β n,m ) + 2(log p ∗ ) | M m | + tr( b H n,m ) − log | b H n,m | , (24)where b H n,m is a consistent estimator of H n,m and p ∗ = pn / . Assume that there exists anoracle working model in the sequence { M m : m = 1 , · · · , M } that has support identical to theset of all important features in the true model. Without loss of generality, suppose that M is such oracle working model. Theorem 3.
Assume that all the conditions of Theorems 1–2 hold and the population versionof
HGBIC p criterion in (24) is minimized at M such that for some ∆ > , min m> (cid:8) HGBIC ∗ p ( M m ) − HGBIC ∗ p ( M ) (cid:9) > ∆ (25) with HGBIC ∗ p ( M m ) = − ‘ n ( y , β n,m, ) + 2(log p ∗ ) | M m | + tr( H n,m ) − log | H n,m | . Then it holdsthat min m> { HGBIC p ( M m ) − HGBIC p ( M ) } > ∆ / with asymptotic probability one. Theorem 3 formally establishes the model selection consistency property of the new in-formation criterion HGBIC p for large-scale model selection with misspecification in that theoracle working model can be selected among a large sequence of candidate sparse models withsignificant probability. Such a desired property is an important consequence of results inTheorems 1 and 2. We now investigate the finite-sample performance of the information criterion HGBIC p in com-parison to the information criteria AIC, BIC, GAIC, GBIC, and GBIC p in high-dimensionalmisspecified models via simulation examples. 11 .1 Multiple index model The first model we consider is the following multiple index model Y = f ( β X ) + f ( β X + β X ) + f ( β X + β X ) + ε, (27)where the response depends on the covariates X j ’s only through the first five ones in a nonlinearfashion and f ( x ) = x / ( x + 1). Here the rows of the n × p design matrix Z are sampled asi.i.d. copies from N ( , I p ), and the n -dimensional error vector ε ∼ N ( , σ I n ). We set the trueparameter vector β = (1 , − , , , − , , · · · , T and σ = 0 .
8. We vary the dimensionality p from 100 to 3200 while keeping the sample size n fixed at 200. We would like to investigatethe behavior of different information criteria when the dimensionality increases. Although thedata was generated from model (27), we fit the linear regression model (2). This is a typicalexample of model misspecification. Note that since the first five variables are independent ofthe other variables, the oracle working model is M = supp( β ) = { , · · · , } . Due to the highdimensionality, it is computationally prohibitive to implement the best subset selection. Thuswe first applied Lasso followed by least-squares refitting to build a sequence of sparse modelsand then selected the final model using a model selection criterion. In practice, one can applyany preferred variable selection procedure to obtain a sequence of candidate interpretablemodels.We report the consistent selection probability (the proportion of simulations where selectedmodel c M = M ), the sure screening probability [14, 13] (the proportion of simulations whereselected mode c M ⊃ M ), and the prediction error E ( Y − z T b β ) with b β an estimate and ( z , Y )an independent observation for z = ( X , · · · , X p ) T . To evaluate the prediction performance ofdifferent criteria, we calculated the average prediction error on an independent test sample ofsize 10,000. The results for prediction error and model selection performance are summarizedin Table 1. In addition, we calculate the average number of false positives for each method inTable 2.From Table 1, we observe that as the dimensionality p increases, the consistent selectionprobability tends to decrease for all criteria except the newly suggested HGBIC p , which main-tains a perfect 100% consistent selection probability throughout all dimensions considered.Generally speaking, GAIC improved over AIC, and GBIC, GBIC p performed better than BICin terms of both prediction and variable selection. In particular, the model selected by ournew information criterion HGBIC p delivered the best performance with the smallest predictionerror and highest consistent selection probability across all settings.An interesting observation is the comparison between GBIC p and HGBIC p in terms ofmodel selection consistency property. While GBIC p is comparable to HGBIC p when thedimensionality is not large (e.g., p = 100 and 200), the difference between these two methodsincreases as the dimensionality increases. In the case when p = 3200, HGBIC p has 100% ofsuccess for consistent selection, while GBIC p has success rate of only 45%. This confirms the12able 1: Average results over 100 repetitions for Example 4.1 with all entries multiplied by100. Consistent selection probability with sure screening probability in parentheses p AIC BIC GAIC GBIC GBIC p HGBIC p Oracle100 1(100) 71(100) 5(100) 72(100) 92(100) 100(100) 100(100)200 0(100) 43(100) 4(100) 44(100) 79(100) 100(100) 100(100)400 0(100) 27(100) 1(100) 31(100) 67(100) 100(100) 100(100)800 0(100) 16(100) 1(100) 21(100) 57(100) 100(100) 100(100)1600 0(100) 13(100) 2(100) 18(100) 49(100) 100(100) 100(100)3200 0(100) 5(100) 3(100) 8(100) 45(100) 100(100) 100(100)Mean prediction error with standard error in parentheses100 97(1) 84(1) 89(1) 84(1) 82(1) 82(1) 82(1)200 103(1) 84(1) 89(1) 84(1) 81(1) 80(1) 80(1)400 112(2) 88(1) 94(1) 87(1) 84(1) 82(1) 82(1)800 120(1) 93(1) 97(1) 92(1) 86(1) 83(1) 83(1)1600 121(1) 94(1) 96(1) 93(1) 87(1) 82(1) 82(1)3200 117(1) 93(1) 93(1) 91(1) 84(1) 80(1) 80(1)necessity of including the log p ∗ factor with p ∗ = pn / in the model selection criterion to takeinto account the high dimensionality, which is in line with the results in [19] for the case ofcorrectly specified models.We further study a family of model selection criteria induced by the HGBIC p and charac-terized as followsHGBIC p,ζ ( M m ) = − ‘ n ( y , b β n,m ) + ζ h p ∗ ) | M m | + tr( b H n,m ) − log | b H n,m | i , (28)where ζ is a positive factor controlling the penalty level on both model misspecification andhigh dimensionality. Note that HGBIC p,ζ with ζ = 1 reduces to our original HGBIC p . Herewe examine the impact of the factor ζ on the false discovery proportion (FDP) and the truepositive rate (TPR) for the selected model c M compared to the oracle working model M . InFigure 1, we observe that as ζ increases, the average FDP drops sharply as it gets close to 1.In addition, we have the desired model selection consistency property (with FDP close to 0and TPR close to 1) when ζ ∈ [1 , p,ζ criteria. 13able 2: Average false positives over 100 repetitions for Example 4.1.AIC BIC GAIC GBIC GBIC p HGBIC p
100 8.55 0.51 3.49 0.48 0.08 0.00200 13.05 1.07 3.70 0.98 0.28 0.00400 17.68 1.65 4.66 1.49 0.40 0.00800 21.28 2.61 4.57 2.17 0.70 0.001600 22.20 3.01 4.40 2.68 0.96 0.003200 22.48 3.92 4.07 3.20 0.86 0.00Figure 1: The average false discovery proportion (left panel) and the true positive rate (rightpanel) as the factor ζ varies for Example 4.1. . . . . . . factor F D P p=200p=800p=3200 0.5 1.0 1.5 2.0 2.5 3.0 3.5 . . . . . factor T P R p=200p=800p=3200 Our second simulation example is the high-dimensional logistic regression with interaction.We simulated 100 data sets from the logistic regression model with interaction and an n -dimensional parameter vector θ = Z β + 2 x p +1 + 2 x p +2 , (29)where Z = ( x , · · · , x p ) is an n × p design matrix, x p +1 = x ◦ x and x p +2 = x ◦ x aretwo interaction terms, and the rest of the setting is the same as in the simulation examplein Section 4.1. For each data set, the n -dimensional response vector y was sampled from theBernoulli distribution with success probability vector [ e θ / (1 + e θ ) , · · · , e θ n / (1 + e θ n )] T with θ = ( θ , · · · , θ n ) T given in (29). As in Section 4.1, we consider the case where all covariates areindependent of each other. We chose β = (2 . , − . , . , − . , , , · · · , T and set sample14able 3: Average results over 100 repetitions for Example 4.2 with all entries multiplied by100.Consistent selection probability with sure screening probability probability in parentheses p AIC BIC GAIC GBIC GBIC p HGBIC p Oracle100 0(100) 30(100) 0(100) 32(100) 55(100) 99(99) 100(100)200 0(100) 27(100) 0(100) 29(100) 41(100) 95(97) 100(100)400 0(100) 12(100) 0(100) 16(100) 35(100) 95(95) 100(100)800 0(100) 2(99) 0(100) 4(99) 12(99) 94(95) 100(100)1600 0(100) 0(100) 0(100) 1(100) 5(100) 84(85) 100(100)3200 0(100) 0(100) 0(100) 1(100) 1(100) 79(84) 100(100)Mean classification error (in percentage) with standard error in parentheses100 25.3(0.3) 16.1(0.2) 27.4(0.3) 16.1(0.2) 15.7(0.2) 15.2(0.2) 15.2(0.2)200 24.9(0.3) 17.2(0.3) 28.4(0.3) 16.9(0.3) 16.5(0.2) 15.5(0.2) 15.4(0.2)400 25.0(0.3) 19.7(0.4) 28.7(0.3) 17.8(0.3) 16.8(0.3) 15.3(0.2) 15.2(0.2)800 24.7(0.3) 21.9(0.4) 28.0(0.3) 18.8(0.3) 17.7(0.3) 15.7(0.2) 15.5(0.2)1600 26.0(0.4) 24.3(0.4) 28.4(0.3) 20.2(0.3) 18.7(0.3) 15.9(0.3) 15.4(0.2)3200 25.7(0.3) 24.4(0.4) 28.2(0.3) 20.7(0.3) 19.5(0.3) 15.9(0.2) 15.3(0.2)size n = 300. Although the data was generated from the logistic regression model withparameter vector (29), we fit the logistic regression model without the two interaction terms.This provides another example of misspecified models. As argued in Section 4.1, the oracleworking model is supp( β ) = { , · · · , } which corresponds to the logistic regression modelwith the first five covariates. To build a sequence of sparse models, we applied Lasso followedby maximum-likelihood refitting based on the support of the estimated model.Since the goal in logistic regression is usually classification, we replace the prediction errorwith the classification error rate. Tables 3 and 4 show similar conclusions to those in Section4.1. Again HGBIC p outperformed all other model selection criteria with greater advantage asthe dimensionality increases (e.g., p ≥ ζ varies in Figure 2. From the figure, we observe that it is a more difficultsetting than the multiple index model to reach model selection consistency. The proposedHGBIC p criterion with the choice of ζ = 1 appears to strike a good balance between FDP andTPR. 15able 4: Average false positives over 100 repetitions for Example 4.2.AIC BIC GAIC GBIC GBIC p HGBIC p
100 55.71 1.57 86.50 1.48 0.78 0.00200 40.83 3.24 119.82 2.14 1.33 0.02400 35.25 11.74 130.33 4.27 2.24 0.00800 31.78 18.22 140.20 6.00 3.45 0.011600 30.25 22.65 142.81 8.02 4.80 0.013200 28.41 22.31 142.75 8.61 6.15 0.05Figure 2: The average false discovery proportion (left panel) and the true positive rate (rightpanel) as the factor ζ varies for Example 4.2. . . . . . . factor F D P p=200p=800p=3200 0.4 0.6 0.8 1.0 1.2 1.4 1.6 . . . . . . . factor T P R p=200p=800p=3200 Despite the rich literature on model selection, the general case of model misspecification inhigh dimensions is less well studied. Our work has investigated the problem of model selectionin high-dimensional misspecified models and characterized the impacts of both model misspec-ification and high dimensionality on model selection, providing an important extension of thework in [28] and [19]. The newly suggested information criterion HGBIC p has been shown toperform well in high-dimensional settings. Moreover, we have established the consistency ofthe covariance contrast matrix estimator that captures the effect of model misspecification inthe general setting, and the model selection consistency of HGBIC p in ultra-high dimensions.The log p ∗ term in HGBIC p with p ∗ = pn / is adaptive to high dimensions. In the setting ofcorrectly specified models, [19] showed that a similar term is necessary for the model selection16onsistency of information criteria when the dimensionality p grows fast with the sample size n . It would be interesting to study optimality property of the information criteria HGBIC p and the HGBIC p,ζ defined in (28) under model misspecification, and investigate these modelselection principles in more general high-dimensional misspecified models such as the additivemodels and survival models. It would also be interesting to combine the strengths of the newlysuggested HGBIC p and the recently introduced knockoffs inference framework [3, 7, 17] formore stable and enhanced large-scale model selection with misspecification. These problemsare beyond the scope of the current paper and are interesting topics for future research. A Proofs of main results
We provide the proofs of Theorems 1–3 in this appendix. Additional technical details areprovided in the Supplementary Material.
A.1 Proof of Theorem 1
We consider the decomposition of S ( y , M m ; F n ) in (19) and deal with terms log E µ M m [ U n ( β ) n ]and log α M m separately by invoking Taylor’s expansion. In fact, log E µ M m [ U n ( β ) n ] is basedon ‘ ∗ n ( y , β ), the deviation of the quasi-log-likelihood from its maximum, while log α M m isthe log-prior probability which depends on D m = E [ I ( g n ; f n ( · , b β n,m )) − I ( g n ; f n ( · , β n,m, ))],expected difference in the KL divergences. In light of consistency of the estimator b β n as shownin Lemma 1, we focus only on the neighborhood of β n, .First, we make a few remarks on the technical details of the proof. Throughout the proof,we condition on the event e Q n = { b β n ∈ N n ( δ n ) } , where N n ( δ n ) = { β ∈ R d : k ( n − B n ) / ( β − β n, ) k ≤ ( n/d ) − / δ n } , B n = X T cov( Y ) X , δ n = O ( L n √ log p ) and b β n is the unrestrictedMLE. Note that the eigenvalues of n − A n ( β ) and n − B n are bounded away from 0 and ∞ by Conditions 1 and 3. This follows from the fact that eigenvalues of M T NM lie between λ min ( N ) λ min ( M T M ) and λ max ( N ) λ max ( M T M ) for any matrix M and positive semidefinitesymmetric matrix N . Therefore, from Lemma 1 we have that P ( e Q n ) → n → ∞ .To establish this theorem we require a possibly dimension dependent bound on the quantity k n − / X b β n k . This can be achieved by putting some restriction on the parameter space. Let M n ( α ) = { β ∈ R d : k X β k ∞ ≤ α log n } be a neighborhood, where α is some positive constant.One way of bounding the quantity k n − / X b β n k is to restrict the QMLE b β n on the set M n ( α ).Here, the constant α can be chosen as large as desired to make M n ( α ) large enough, whereasthe neighborhood N n ( δ n ) is asymptotically shrinking. Then, we have N n ( δ n ) ⊂ M n ( α ) for allsufficiently large n , which implies that conditional on e Q n , the restricted MLE coincides withits unrestricted version. Hereafter in this proof b β n will be referred to as the restricted MLE,unless specified otherwise. Part I: expansion of the term log E µ M m [ U n ( β ) n ].17ecall that U n ( β ) = exp (cid:2) n − ‘ ∗ n ( y , β ) (cid:3) and ‘ ∗ n ( y , β ) = ‘ n ( y , β ) − ‘ n ( y , b β n ). First, weobserve that the maximum value of the function ‘ ∗ n ( y , β ) is attained at β = b β n . Moreover, wehave ∂ ‘ ∗ n ( y , β ) /∂ β = − A n ( β ) from (10) where A n ( β ) = X T Σ ( X β ) X . Then, we considerTaylor’s expansion of the log-likelihood function ‘ n ( y , · ) around b β n in a new neighborhood e N n ( δ n ) = { β ∈ R d : k ( n − B n ) / ( β − b β n ) k ≤ ( n/d ) − / δ n } . We get ‘ ∗ n ( y , β ) = 12 ( β − b β n ) T (cid:2) ∂ ‘ ∗ n ( y , β ∗ ) /∂ β (cid:3) ( β − b β n ) (30)= − n δ T V n ( β ∗ ) δ , where β ∗ lies on the line segment joining β and b β n , δ = n − / B / n ( β − b β n ), and V n ( β ) = B − / n A n ( β ) B − / n . Since b β n ∈ e N n ( δ n ), by the convexity of the neighborhood e N n ( δ n ) we have β ∗ ∈ e N n ( δ n ). We also note that conditional on the event e Q n , it holds that e N n ( δ n ) ⊂ N n (2 δ n ).Now, we will bound U n ( β ) n over the region e N n ( δ n ) using Taylor’s expansion in (30). ByCondition 5, we get q ( β )1 e N n ( δ n ) ( β ) ≤ − n − ‘ ∗ n ( y , β )1 e N n ( δ n ) ( β ) ≤ q ( β )1 e N n ( δ n ) ( β ) , (31)where q ( β ) = δ T [ V n − ρ n ( δ n ) I d ] δ and q ( β ) = δ T [ V n + ρ n ( δ n ) I d ] δ . Then, we consider thelinear transformation h ( β ) = ( n − B n ) / β . For sufficiently large n , we obtain E µ M [ e − nq ( β ) e N n ( δ n ) ( β )] ≤ E µ M [ U n ( β ) n e N n ( δ n ) ( β )] ≤ E µ M [ e − nq ( β ) e N n ( δ n ) ( β )] , (32)where µ M denotes the prior distribution on h ( β ) ∈ R d for the model M .The final expansion of log E µ M [ U n ( β ) n ] results from combination of Lemmas 7–10. Theexpressions E µ M [ U n ( β ) n e N cn ( δ n ) ] and R δ ∈ R d e − nq j e N cn ( δ n ) dµ for j = 1 , n since κ n = λ min ( V n ) / E µ M [ U n ( β ) n ] = log ((cid:18) πn (cid:19) d/ | V n ± ρ n ( δ n ) I d | − / ) + log c n , where c n ∈ [ c , c − ]. Finally, we observe that | V n ± ρ n ( δ n ) I d | − / = | V n | − / | I d ± ρ n ( δ n ) V − n | − / = | V n | − / { O [ ρ n ( δ n )tr( V − n )] } − / = | V n | − / { O [ ρ n ( δ n ) dλ − ( V n )] } − / = | V n | − / [1 + o (1)] , where we use Condition 5. So, we obtainlog E µ M [ U n ( β ) n ] = log ((cid:18) πn (cid:19) d/ | V n | − / [1 + o (1)] ) + log c n = − log n d + 12 log | A − n B n | + log(2 π )2 d + log c n + o (1) . (33)This completes the expansion of log E µ M [ U n ( β ) n ].18 art II: expansion of the prior term log α M m .Now, we consider the prior term log α M m which depends on b β n through D m . Simplecalculation shows that log α M m = − D m + log C − d log p. (34)We aim to provide a decomposition of D m in terms of H n . Observe that − D m = Eη n ( b β n ) − η n ( β n, ) where η n ( β ) = E‘ n ( e y , β ), and e y is an independent copy of y . We expand Eη n ( b β n )around η n ( β n, ). In the GLM setup, we observe that ‘ n ( e y , β ) = e y T X β − T b ( X β ) and η n ( β ) = ( E e y T ) X β − T b ( X β ). Then, we split Eη n ( b β n ) in the region e Q n and its complement,that is, Eη n ( b β n ) = E { η n ( b β n )1 e Q n } + E { η n ( b β n )1 e Q cn } (35)= E { η n ( b β n )1 e Q n } + E { [( E ˜ y ) T ( X b β n ) − T b ( X b β n )]1 e Q cn } . First, we aim to show that the second term on the right hand side of (35) is o (1). Per-forming componentwise Taylor’s expansion of b ( · ) around and evaluating at X b β n , we obtain b ( X b β n ) = b ( ) + b (0) X b β n + r , where r = ( r , · · · , r n ) T with r i = 2 − b (( X β ∗ i ) i )( X b β n ) i and β ∗ , · · · , β ∗ n lying on the line segment joining b β n and . Thus, we get E {| ( E e y ) T X b β n − T b ( X b β n ) | e Q cn } ≤ O { n log n + n + n (log n ) } P ( e Q cn ) = o (1) (36)for sufficiently large n . The last inequality follows from the fact that P ( e Q cn ) converges to zerofaster than any polynomial rate. To verify the orders, we recall that b β n is the constrainedMLE and b ( · ) is bounded away from 0 and ∞ . Thus, we obtain following bounds for thefour terms | ( E e y ) T X b β n | = O ( n log n ), | T b ( ) | = O ( n ), | b (0) T X b β n | = O ( n log n ), and | T r | = O ( n (log n ) ).Now, we consider the first term on the right hand side of (35). We begin by expanding η n ( β ) around β n, conditioned on the event f Q n . By the definition of β n, , η n ( β ) attains itsmaximum at β n, . By evaluating Taylor’s expansion of η n ( · ) around β n, at b β n , we derive η n ( b β n ) = η n ( β n, ) −
12 ( b β n − β n, ) T A n ( β ∗ )( b β n − β n, )= η n ( β n, ) −
12 ( b β n − β n, ) T A n ( b β n − β n, ) − s n , where A n ( · ) = − ∂ ‘ n ( y , · ) /∂ β , A n = A n ( β n, ), and β ∗ is on the line segment joining β n, and b β n . The second equality is obtained by taking s n = ( b β n − β n, ) T [ A n ( β ∗ ) − A n ]( b β n − β n, ).Furthermore, setting C n = B − / n A n and v n = C n ( b β n − β n, ) simplifies the above expressionto η n ( b β n ) = η n ( β n, ) − v Tn [( C − n ) T A n C − n ] v n − s n . (37)19n (37), we first handle the term s n . Note that on the event f Q n , by the convexity of theneighborhood N n ( δ n ) we have β ∗ ∈ N n ( δ n ). Then, Condition 5 implies (cid:12)(cid:12)(cid:12) s n f Q n (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ( b β n − β n, ) T ( A n ( β ∗ ) − A n )( b β n − β n, ) (cid:12)(cid:12)(cid:12) f Q n (38)= (cid:12)(cid:12)(cid:12) [ B / n ( b β n − β n, )] T [ V n ( β ∗ ) − V n ][ B / n ( b β n − β n, )] (cid:12)(cid:12)(cid:12) f Q n ≤ ρ n ( δ n ) δ n d f Q n , where V n ( · ) = B − / A n ( · ) B − / n and V n = V ( β n, ). We then deduce that E ( s n f Q n ) = o (1),since ρ n ( δ n ) δ n d f Q n = o (1) by Condition 5. Therefore, (37) becomes E [ η n ( b β n )1 f Q n ] = E [ η n ( β n, ) − v Tn [( C − n ) T A n C − n ] v n f Q n ] + o (1) . (39)We provide a decomposition of v n to handle the term v Tn [( C − n ) T A n C − n ] v n in (39). Define Ψ ( β n ) = X T [ y − µ ( X β n )]. From the score equation we have Ψ ( b β n ) = 0. From (8), it holdsthat X T [ E y − µ ( X β n, )] = 0. For any β , · · · , β d ∈ R d , denote by e A n ( β , · · · , β d ) a d × d matrix with j th row the corresponding row of A n ( β j ) for each j = 1 , · · · , d . Then, wedefine matrix-valued function e V n ( β , · · · , β d ) = B − / n e A n ( β , · · · , β d ) B − / n . Assuming thedifferentiability of Ψ ( · ) and applying the mean-value theorem componentwise around β n, , weobtain = Ψ n ( b β n ) = Ψ n ( β n, ) − e A n ( β , · · · , β d )( b β n − β n, )= X T ( y − E y ) − e A n ( β , · · · , β d )( b β n − β n, ) , where each of β , · · · , β d lies on the line segment joining b β n and β n, . Therefore, we have thedecomposition v n = C n ( b β n − β n, ) = u n + w n , (40)where u n = B − / n X T ( y − E y ) and w n = − h e V n ( β , · · · , β d ) − V n i h B / n ( b β n − β n, ) i .We handle the quadratic term v Tn [( C − n ) T A n C − n ] v n in (39) by using the decomposition of v n . For simplicity of notation, denote by R n = ( C − n ) T A n C − n . Recall that C n = B − / n A n .With some calculations we obtain E ( u Tn R n u n ) = E { ( y − E y ) T XA − n X T ( y − E y ) } = E { tr( A − n X T ( y − E y )( y − E y ) T X ) } = tr( A − n B n ) . Note that E ( u Tn R n u n f Q n ) = E ( u Tn R n u n ) − E ( u Tn R n u n f Q nc ). From Lemma 1, we have P ( f Q nc ) → n → ∞ . We set µ n = tr( A − n B n ) ∨
1, hereby µ n is bounded away from zero.We apply Vitali’s convergence theorem to show that E ( u Tn R n u n f Q nc ) = o ( µ n ). To establishuniform integrability, we use Lemma 6 which states that sup n E | ( u Tn R n u n ) /µ n | γ < ∞ forsome constant γ >
0. This leads to E ( u Tn R n u n f Q nc ) = o ( µ n ). Hence we have12 E ( u Tn R n u n f Q n ) = 12 tr( A − n B n ) + o ( µ n ) . (41)20ow, it remains to show that E [( w Tn R n w n + 2 w Tn R n u n )1 f Q n ] = o ( µ n ) . (42)Using the definition of R n and w n , we can bound w Tn R n w n : w Tn R n w n = k R / n w n k ≤ k e V n ( β , · · · , β d ) − V n k δ n d tr( A − n B n ) . So, on the event f Q n , it holds that E ( w Tn R n w n f Q n ) = o ( µ n ) by Condition 5. For the crossterm w Tn R n u n , applying the Cauchy–Schwarz inequality yields | E ( w Tn R n u n f Q n ) | ≤ E ( k R / n w n k f Q n ) / E ( k u Tn R / n k ) / ≤ E [ k e V n ( β , · · · , β d ) − V n k f Q n δ n d / tr( A − n B n )] . Thus, we obtain that E ( w Tn R n u n f Q n ) = o ( µ n ). Note that E {| η n ( β n, ) | f Q nc } is of order o (1)by similar calculations as in (36). Then, combining (35), (39), (41) and (42) yields E { η n ( b β n ) } = η n ( β n, ) −
12 tr( A − n B n ) + o ( µ n ) . (43)Combining (34) and (43) yields the expansionlog α M = −
12 tr( A − n B n ) + log C − d log p + o ( µ n ) . Part I and Part II conclude the proof of Theorem 1.
A.2 Proof of Theorem 2
In the beginning of the proof, we demonstrate that the theorem follows from the consistencyof b A n and b B n . Next, we establish the consistency of b A n and b B n . The consistency of b A n follows directly from the Lipschitz assumption; however, the consistency of b B n is harder toprove. To accomplish this, we break down b B n and invoke Bernstein-type tail inequalities andconcentration theorems to handle challenging pieces.We first introduce some notation to simplify the presentation of the proof. λ k ( · ) denotesthe eigenvalues arranged in increasing order. Denote the spectral radius of d × d square matrix M by ρ ( M ) = max ≤ k ≤ d {| λ k ( M ) |} . k · k denotes the matrix operator norm. o P ( · ) denotesthe convergence in probability of the matrix operator norm.We want to show that log | b H n | = log | H n | + o P (1) and tr( b H n ) = tr( H n ) + o P (1). Toestablish both equalities, it is enough to show that b H n = H n + o P (1 /d ). Indeed, assume that b H n = H n + o P (1 /d ) is established. In that case, we observe that | tr( b H n ) − tr( H n ) | = | tr( b H n − H n ) | ≤ dρ ( b H n − H n ) = d k b H n − H n k = o P (1) , b H n − H n . Moreover, we have | log | b H n | − log | H n || ≤ d max ≤ k ≤ d | log λ k ( b H n ) − log λ k ( H n ) | = d max ≤ k ≤ d log max ( λ k ( b H n ) λ k ( H n ) , λ k ( H n ) λ k ( b H n ) )! ≤ d max ≤ k ≤ d max ( λ k ( b H n ) λ k ( H n ) , λ k ( H n ) λ k ( b H n ) ) − ! ≤ d max ≤ k ≤ d | λ k ( b H n ) − λ k ( H n ) | min { λ k ( b H n ) , λ k ( H n ) } . (44)Recall that the smallest and largest eigenvalues of both n − B n and n − A n are bounded awayfrom 0 and ∞ . (See the note in the beginning of the proof of Theorem 1.) So, we get λ k ( H n ) = O (1) and λ − k ( H n ) = O (1) uniformly for all 1 ≤ k ≤ d . An application of Weyl’stheorem shows that | λ k ( b H n ) − λ k ( H n ) | ≤ ρ ( b H n − H n ) for each k . We have ρ ( b H n − H n ) = k b H n − H n k = o P (1 /d ). Hence, the right hand side of (44) is o P (1).Now, we proceed to show that b H n = H n + o P (1 /d ). It suffices to prove that n − b A n = n − A n + o P (1 /d ) and n − b B n = n − B n + o P (1 /d ). To see the sufficiency, note that b H n − H n = ( n − b A n ) − ( n − d b B n ) − ( n − A n ) − ( n − d B n )= ( n − b A n ) − ( n − d b B n ) − ( n − b A n ) − ( n − d B n )+ ( n − b A n ) − ( n − d B n ) − ( n − A n ) − ( n − d B n ) . Then, b H n = H n + o P (1 /d ) can be obtained by repeated application of the following propertiesof the operator norm: k ( I d − M ) − k ≤ / (1 − k M k ) if k M k < k MN k ≤ k M k k N k ,and k M + N k ≤ k M k + k N k , where M and N are d × d matrices [22]. Part 1: prove n − b A n = n − A n + o P (1 /d ). From Lemma 1 we have, k b β n − β n, k = O P { ( n/d ) − / δ n } , which entails b β n = β n, + O P { ( n/d ) − / δ n } . Then it follows from theLipschitz assumption for n − A n ( β ) in the neighborhood N n ( δ n ) that n − b A n = n − A n + o P (1 /d ). Part 2: prove n − b B n = n − B n + o P (1 /d ). We need to control the term y − µ ( X b β n ).In correctly specified models, µ ( X β n, ) and E y are the same. So, it is enough to introducethe mean E y which is close to both y and µ ( X b β n ). However, it is harder to control the term y − µ ( X b β n ) in misspecified models since we need to deal with both µ ( X β n, ) and E y .First, we use the fact that µ ( X β n, ) and µ ( X b β n ) are close. To accomplish this, we addand subtract µ ( X β n, ) to get the following decomposition: n − b B n = n − X T diag nh y − µ ( X b β n ) i ◦ h y − µ ( X b β n ) io X = G + G + G , G = n − X T diag { [ y − µ ( X β n, )] ◦ [ y − µ ( X β n, )] } X , G = 2 n − X T diag { [ y − µ ( X β n, )] ◦ [ µ ( X β n, ) − µ ( X b β n )] } X , G = n − X T diag { [ µ ( X b β n ) − µ ( X β n, )] ◦ [ µ ( X b β n ) − µ ( X β n, )] } X . Next, we introduce E y to obtain terms y − E y and E y − µ ( X β n, ) both of which can bekept small. We split G as G = G + G + G and G as G = G + G , where G = n − X T diag { ( y − E y ) ◦ ( y − E y ) } X , G = 2 n − X T diag { ( y − E y ) ◦ [ E y − µ ( X β n, )] } X , G = n − X T diag { [ E y − µ ( X β n, )] ◦ [ E y − µ ( X β n, )] } X , G = 2 n − X T diag { ( y − E y ) ◦ [ µ ( X β n, ) − µ ( X b β n )] } X , G = 2 n − X T diag { [ E y − µ ( X β n, )] ◦ [ µ ( X β n, ) − µ ( X b β n )] } X . Now, we will control each of the above terms separately. Before we begin, we observe thatfor any matrices M and N , we have P ( d k M − N k ≥ t ) ≤ P ( d k M − N k F ≥ t ) ≤ d max ≤ j,k ≤ d P ( | M jk − N jk | ≥ t/d ) , (45)where k · k F denotes the matrix Frobenius norm and M jk denotes the ( j, k )th entry of M .Therefore, it is enough to bound P ( | M jk − N jk | ≥ t/d ) by o (1 /d ) to show that M = N + o p (1 /d ). Part 2a) prove G = n − B n + o P (1 /d ). We will use Bernstein-type tail inequality.First, note that E G = n − B n and G jk = n − P ni =1 { x ij x ik [ y i − Ey i ] } = P ni =1 a jki q i , where a jki = n − x ij x ik var( y i ) and q i = { var( y i ) } − / ( y i − Ey i ). Let a jk = ( a jk , · · · , a jkn ) T . Then wehave k a jk k = O ( n u − ) since k X k ∞ = O ( n u ) from Condition 3. It may be noted that q i ’sare 1-sub-exponential random variables from Condition 1 and so q i ’s are 2-sub-exponentialrandom variables. Furthermore, sup ≤ i ≤ n var( q i ) = O (1). To see this, we notevar( q i ) ≤ Eq i ≤ (4 − [ Eq i ] / ) ≤ (cid:18) sup m ≥ n m − ( E | q i | m ) /m o(cid:19) = O (1) , where we use Lemma 5. Then combining (45) with Lemma 12 for a choice of α = 2, we deduce P ( d k G − E G k ≥ t ) ≤ d max ≤ j,k ≤ d P ( | G jk − E G jk | ≥ t/d ) ≤ Cd exp {− Ct / n − u /d } for some constant C . Since d = O ( n κ ) and u < / − u , the right hand side of above equationtends to zero. Thus, we obtain G = E G + o P (1 /d ) = n − B n + o P (1 /d ).23 art 2b) prove G = o P (1 /d ). Similar to the previous part, we invoke Bernstein-type tail inequality. Observe that G jk = n − P ni =1 { x ij x ik [ E y − µ ( X β n, )] i [ y i − Ey i ] } = P ni =1 ˜ a jki q i , where ˜ a jki = 2 n − var( y i ) / x ij x ik [ E y − µ ( X β n, )] i and q i = { var( y i ) } − / ( y i − Ey i ). Then, we get k ˜ a jk k = O ( n u + u / − / ) by Conditions 2 and 3.By Lemma 11, we have P ( d k G k ≥ t ) ≤ d max ≤ j,k ≤ d P ( | G jk | ≥ t/d ) ≤ Cd exp {− Ctn − u − u /d } for some constant C . Since d = O ( n κ ) and 3 / − u − u / − κ >
0, the right hand sideof above equation tends to zero. Hence, we have G = o P (1 /d ). Part 2c) prove G = o (1 /d ). We derive k G k ≤ k n − n X i =1 { x i x Ti [ Ey i − [ µ ( X β n, )] i ] }k F = Σ ≤ j,k ≤ d [ n X i =1 a jki [ Ey i − [ µ ( X β n, )] i ] / var( y i )] ≤ n X i =1 { [ Ey i − [ µ ( X β n, )] i ] / var( y i ) } Σ ≤ j,k ≤ d k a jk k , where the last step follows from the componentwise Cauchy–Schwarz inequality. From Con-ditions 2 and 3, we get k G k = O ( n u d n u − ). Therefore, G = o (1 /d ) since d = O ( n κ )and u + 4 κ + 4 u − < Part 2d) prove G = o (1 /d ). Bounding G is the trickiest part. The use of classicalBernstein-type inequalities are prohibited since the summation includes two random quantities y and b β . Instead, we will apply concentration inequalities.We start by truncating the random variable y by conditioning on the set Ω n = {k W k ∞ ≤ C log n } which is defined in Lemma 2. Since b β n belongs to the neighborhood N n ( δ n ) byLemma 1, we get | G jk | = | n − n X i =1 x ij x ik [ y i − Ey i ][ µ ( X β n, ) − µ ( X b β n )] i |≤ sup β n ∈ N n ( δ n ) n − | n X i =1 x ij x ik [ y i − Ey i ][ µ ( X β n, ) − µ ( X β n )] i | . Then, we can separate the right hand side by conditioning on Ω n . So, we have | G jk | ≤ G jk + G jk where G jk = sup β n ∈ N n ( δ n ) n − | n X i =1 x ij x ik [ y i − Ey i ][ µ ( X β n, ) − µ ( X β n )] i Ω n | , G jk = sup β n ∈ N n ( δ n ) n − | n X i =1 x ij x ik [ y i − Ey i ][ µ ( X β n, ) − µ ( X β n )] i (1 − Ω n ) | . E G jk . We take a Rademacher sequence { (cid:15) i } ni =1 independent of y . Then,we apply symmetrization and contraction inequalities in [5] as follows. E G jk = E sup β n ∈ N n ( δ n ) n − | n X i =1 x ij x ik [ y i − Ey i ][ µ ( X β n, ) − µ ( X β n )] i Ω n |≤ n − E sup β n ∈ N n ( δ n ) | n X i =1 (cid:15) i x ij x ik y i [ µ ( X β n, ) − µ ( X β n )] i Ω n |≤ n − c E sup β n ∈ N n ( δ n ) | n X i =1 (cid:15) i x ij x ik y i [ X β n, − X β n ] i Ω n |≤ n − c sup β n ∈ N n ( δ n ) k β n, − β n k E k n X i =1 (cid:15) i x ij x ik y i Ω n x i k , where the last step follows from the Cauchy–Schwarz inequality. We observe that sup β n ∈ N n ( δ n ) k β n, − β n k ≤ n − / d / δ n and E k P ni =1 (cid:15) i x ij x ik y i Ω n x i k ≤ ( P ni =1 x ij x ik E [ y i Ω n ] k x i k ) / .So, we can bound E G jk by 4 c n − / d / δ n ( P ni =1 x ij x ik E [ y i Ω n ] k x i k ) / . Using Conditions2 and 3, we obtain E G jk = O ( n − u dδ n m n ). Since d = O ( n κ ) and − u + 3 κ + 2 u + κ / <
0, we deduce E G jk = o (1 /d ).Furthermore, we need to bound 2 | x ij x ik y i [ µ ( X β n, ) − µ ( X β n )] i Ω n | for any β n ∈ N n ( δ n )in order to use the concentration theorem in [5]. We use Lemma 2 to bound y i :2 | x ij x ik y i [ µ ( X β n, ) − µ ( X β n )] i Ω n | ≤ | x ij || x ik || ( y i − Ey i + Ey i ) | Ω n | [ µ ( X β n, ) − µ ( X β n )] i |≤ | x ij || x ik | ( | Ey i | + C log( n )) | [ µ ( X β n, ) − µ ( X β n )] i | . Since b ( X β ) ≤ c − for any β joining the line segment β n, and β n , we have | [ µ ( X β n, ) − µ ( X β n )] i | ≤ c − k x i k k β n, − β n k for any β n ∈ N n ( δ n ). When we put last two inequalitiestogether with Conditions 2 and 3, we get 2 | x ij x ik y i [ µ ( X β n, ) − µ ( X β n )] i Ω n | ≤ c i, β n where c i, β n = O ( n u m n ) k x i k k β n, − β n k . Moreover, we havesup β n ∈ N n ( δ n ) n − n X i =1 c i, β n ≤ O ( n − u m n ) sup β n ∈ N n ( δ n ) k β n, − β n k n X i =1 k x i k ≤ O ( n − u m n d δ n )where we use the fact that k β n, − β n k = O ( n − dδ n ) for any β n ∈ N n ( δ n ). Thus, we can usethe concentration inequality in [5] which yields P ( G jk ≥ E G jk + t ) ≤ C exp (cid:26) − C nt n − u m n d δ n (cid:27) , (46)for some constant C .Now, take any ˜ t >
0. We know that E G jk < ˜ t/ (2 d ) for large enough n . Then by taking t = ˜ t/ (2 d ) in equation (46), we obtain P ( G jk ≥ ˜ t/d ) ≤ C exp {− C ˜ t n − u m n d δ n } . − u + 6 κ + 4 u + κ <
0, we have P ( G jk ≥ ˜ t/d ) = o (1 /d ).Lastly, G jk = 0 on the event Ω n which holds with probability at least 1 − O ( n − δ ) byLemma 2. Therefore, we obtain G = o (1 /d ) by using (45). Part 2e) prove G = o (1 /d ). First, we apply the Cauchy–Schwarz inequality to obtain | G jk | = n X i =1 h n − var( y i ) / x ij x ik [ µ ( X β n, ) − µ ( X b β n )] i i (cid:20) [ E y − µ ( X β n, )] i var( y i ) / (cid:21)! ≤ n X i =1 n − var( y i ) x ij x ik [ µ ( X β n, ) − µ ( X b β n )] i n X i =1 [ E y − µ ( X β n, )] i var( y i )Since b β n lies in the region N n ( δ n ) with high probability and b ( · ) is bounded, [ µ ( X β n, ) − µ ( X b β n )] i can be bounded by k x i k O ( n − dδ n ). Condition 2 and the Cauchy–Schwarz inequal-ity yield P ni =1 [var( y i )] − [ E y − µ ( X β n, )] i ≤ O ( n / u / ). We further use Conditions 1 and3 to obtain | G jk | = O ( n − / u + u / d δ n ). Since d = O ( n κ ) and − / u + u / κ +2 u + κ <
0, we get | G jk | = o (1 /d ). Thus, we obtain G = o p (1 /d ). Part 2f ) prove G = o (1 /d ). We decompose ( i, j )th entry of G as follows | G jk | = n − | n X i =1 x ij x ik [ µ ( X β n, ) − µ ( X b β n )] i |≤ n − n X i =1 | x ij || x ik | [ µ ( X β n, ) − µ ( X b β n )] i = O ( n − u d δ n ) , where the last line is similar to Part 2e. So, | G jk | = o (1 /d ) since − u +4 κ +2 u + κ < G = o (1 /d ).We have finished the proof of Part 2. This concludes the proof of Theorem 2. A.3 Proof of Theorem 3
Theorem 3 is a direct consequence of Theorem 2, Lemma 1, and assumption (25). To see this,observe that the difference in the sample version HGBIC p can be written as the sum of thepopulation version HGBIC ∗ p and the terms consisting of differences of likelihood, tr( H n ) andlog(det( H n )) between the sample and population versions. That is,HGBIC p ( M m ) − HGBIC p ( M ) = HGBIC ∗ p ( M m ) − HGBIC ∗ p ( M ) − ‘ n ( y , b β n,m ) − ‘ n ( y , β n,m, )] + 2[ ‘ n ( y , b β n, ) − ‘ n ( y , β n, , )]+ [tr( b H n,m ) − tr( H n,m )] − [tr( b H n, ) − tr( H n, )] − [log | b H n,m | − log | H n,m | ] + [log | b H n, | − log | H n, | ] . The equation (25) suggests that the first line is bounded below by ∆ for any m >
1. Thenwe focus on the remaining terms. Let m = 2 , · · · , M be fixed. The consistency of QMLE in26emma 1 implies that − ‘ n ( y , b β n,m ) − ‘ n ( y , β n,m, )] + 2[ ‘ n ( y , b β n, ) − ‘ n ( y , β n, , )] convergesto zero with probability at least 1 − O ( n − δ ) for some constant δ > o (∆) with probability at least 1 − O ( n − δ ).Therefore, { HGBIC p ( M m ) − HGBIC p ( M ) } > ∆ / − O ( n − δ ) for any fixed m >
1. Applying the union bound over all M = o ( n δ ) competing models completes the proofof Theorem 3. References [1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood prin-ciple.
In Second International Symposium on Information Theory (eds. B. N. Petrov andF. Csaki), Akademiai Kiado, Budapest , 267–281.[2] Akaike, H. (1974). A new look at the statistical model identification.
IEEE Transactionson Automatic Control 19 , 716–723.[3] Barber, R. and E. J. Cand`es (2015). Controlling the false discovery rate via knockoffs.
Ann. Statist. 43 , 2055–2085.[4] Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): Thegeneral theory and its analytical extensions.
Psychometrika 52 , 345–370.[5] B¨uhlmann, P. and S. van de Geer (2011).
Statistics for High-Dimensional Data: Methods,Theory and Applications . Springer.[6] B¨uhlmann, P. and S. van de Geer (2015). High-dimensional inference in misspecified linearmodels.
Electronic Journal of Statistics 9 , 1449–1473.[7] Cand`es, E. J., Y. Fan, L. Janson, and J. Lv (2017). Panning for gold: Model-X knockoffsfor high-dimensional controlled variable selection.
Journal of the Royal Statistical SocietySeries B, to appear .[8] Chen, J. and Z. Chen (2008). Extended Bayesian information criteria for model selectionwith large model spaces.
Biometrika 95 , 759–771.[9] Chen, K. and K.-S. Chan (2011). Subset ARMA selection via the adaptive lasso.
Statisticsand Its Interface 4 , 197–205.[10] Cule, M., R. Samworth, and M. Stewart (2010). Maximum likelihood estimation of amulti-dimensional log-concave density (with discussion).
J. Roy. Statist. Soc. Ser. B 72 ,545–607.[11] Eguchi, S. (2017). Model comparison for generalized linear models with dependent ob-servations.
Econometrics and Statistics 5 , 171–188.2712] Erd˝os, L., H.-T. Yau, and J. Yin (2012). Bulk universality for generalized wigner matrices.
Probability Theory and Related Fields 154 , 341–407.[13] Fan, J. and Y. Fan (2008). High-dimensional classification using features annealed inde-pendence rules.
The Annals of Statistics 36 , 2605–2637.[14] Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional featurespace (with discussion).
Journal of the Royal Statistical Society Series B 70 , 849–911.[15] Fan, J. and J. Lv (2010). A selective overview of variable selection in high dimensionalfeature space (invited review article).
Statistica Sinica 20 , 101–148.[16] Fan, J. and J. Lv (2017). Sure independence screening (invited review article).
WileyStatsRef: Statistics Reference Online, to appear .[17] Fan, Y., E. Demirkaya, G. Li, and J. Lv (2017). RANK: large-scale inference withgraphical nonlinear knockoffs.
Manuscript .[18] Fan, Y., E. Demirkaya, and J. Lv (2017). Nonuniformity of p-values can occur early indiverging dimensions.
Manuscript .[19] Fan, Y. and C. Y. Tang (2013). Tuning parameter selection in high dimensional penalizedlikelihood.
Journal of the Royal Statistical Society Series B 75 , 531–552.[20] Foster, D. and E. George (1994). The risk inflation criterion for multiple regression.
TheAnnals of Statistics , 1947–1975.[21] Hall, P. (1990). Akaike’s information criterion and kullback-leibler loss for histogramdensity estimation.
Probability Theory and Related Fields 85 , 449–467.[22] Horn, R. A. and C. R. Johnson (1985).
Matrix Analysis . Cambridge University Press.[23] Hsu, H.-L., C.-K. Ing, and H. Tong (2017). On model selection from a finite family ofpossibly misspecified time series models.
Manuscript .[24] Ing, C.-K. (2007). Accumulated prediction errors, information criteria and optimal fore-casting for autoregressive time series.
The Annals of Statistics 35 , 1238–1277.[25] Konishi, S. and G. Kitagawa (1996). Generalised information criteria in model selection.
Biometrika 83 , 875–890.[26] Kullback, S. and R. A. Leibler (1951). On information and sufficiency.
The Annals ofMathematical Statistics 22 , 79–86.[27] Liu, W. and Y. Yang (2011). Parametric or nonparametric? a parametricness index formodel selection.
The Annals of Statistics , 2074–2102.2828] Lv, J. and J. S. Liu (2014). Model selection principles in misspecified models.
Journal ofthe Royal Statistical Society Series B 76 , 141–167.[29] McCullagh, P. and J. A. Nelder (1989).
Generalized Linear Models . Chapman and Hall,London.[30] Ninomiya, Y. and S. Kawano (2016). AIC for the Lasso in generalized linear models.
Electronic Journal of Statistics 10 , 2537–2560.[31] Peng, H., H. Yan, and W. Zhang (2013). The connection between cross-validation andAkaike information criterion in a semiparametric family.
Journal of Nonparametric Statis-tics 25 , 475–485.[32] Schwarz, G. (1978). Estimating the dimension of a model.
The Annals of Statistics 6 ,461–464.[33] Shah, R. D. and P. B¨uhlmann (2017). Goodness-of-fit tests for high dimensional linearmodels.
Journal of the Royal Statistical Society Series B, to appear .[34] Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation andAkaike’s criterion.
Journal of the Royal Statistical Society Series B , 44–47.[35] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 .[36] White, H. (1982). Maximum likelihood estimation of misspecified models.
Economet-rica 50 , 1–25. 29 upplementary Material to “Large-Scale Model Selection withMisspecification”
Emre Demirkaya, Yang Feng, Pallavi Basu and Jinchi LvThis Supplementary Material contains key lemmas, their proofs, and additional technicaldetails. All the notation is the same as in the main body of the paper.
B Technical lemmas
We aim to establish the asymptotic consistency of QMLE uniformly over all models M suchthat | M | ≤ K where K = o ( n ). For this purpose, we extend our notation. β n, ( M ) denotes theparameter vector for the working model and is defined by the minimizer of the KL-divergencewhose support is M : β n, ( M ) = arg min β ∈B ( M ) I ( g n ; f n ( · ; β , τ )) . β n, ( M ) is estimated by theQMLE b β ( M ) which is defined by b β ( M ) = arg max β ∈B ( M ) ‘ n ( β ) . B.1 Lemma 1 and its proof
Lemma 1 (Uniform consistency of QMLE) . Assume Conditions 1, 2(i), 3(i), and 3(iii) hold.If L n p Kn − log p → , then sup | M |≤ K, M ⊂{ , ··· ,p } p | M | k b β ( M ) − β n, ( M ) k = O p h L n p n − log p i , where L n = 2 m n + C log n . m n is a diverging sequence which appears in Condition 2 and C is the positive constant from Lemma 2.Proof . First, we construct the auxiliary parameter vector b β u ( M ) as follows. For any sequence N n , we take u = (1+ k b β ( M ) − β n, ( M ) k /N n ) − and define b β u ( M ) = u b β ( M )+(1 − u ) β n, ( M ).We have k b β u ( M ) − β n, ( M ) k = u k b β ( M ) − β n, ( M ) k ≤ N n by the definition of u . So, b β u ( M )belongs to the neighborhood B M ( N n ) = { β ∈ R d , supp( β ) = M : k β − β n, ( M ) k ≤ N n } .Moreover, we observe that k b β u ( M ) − β n, ( M ) k ≤ N n / k b β ( M ) − β n, ( M ) k ≤ N n .Thus, it is enough to bound k b β u ( M ) − β n, ( M ) k to prove the theorem.Now, we consider k b β u ( M ) − β n, ( M ) k . First, the concavity of ‘ n and the definition of b β ( M ) yield ‘ n ( b β u ( M )) ≥ u‘ n ( b β ( M )) + (1 − u ) ‘ n ( β n, ( M )) ≥ u‘ n ( b β u ( M )) + (1 − u ) ‘ n ( β n, ( M )) . So, by rearranging terms, we get − ‘ n ( β n, ( M )) + ‘ n ( b β u ( M )) ≥ . (A.1)1esides, for any β ∈ B M ( N n ), we have E [ ‘ n ( β n, ( M )) − ‘ n ( β )] = I ( g n ; f n ( · ; β , τ )) − I ( g n ; f n ( · ; β n, ( M ) , τ )) ≥ , (A.2)by the optimality of β n, ( M ). Combining (A.1) and (A.2) gives0 ≤ E [ ‘ n ( β n, ( M )) − ‘ n ( b β u ( M ))] ≤ − ‘ n ( β n, ( M )) + ‘ n ( b β u ( M )) + E [ ‘ n ( β n, ( M )) − ‘ n ( b β u ( M ))] ≤ sup β ∈B M ( N n ) (cid:12)(cid:12) ‘ ( β ) − E [ ‘ n ( β )] − { ‘ n ( β n, ( M )) − E [ ‘ n ( β n, ( M ))] } (cid:12)(cid:12) = nT M ( N n ) , (A.3)since b β u ( M ) ∈ B M ( N n ).On the other hand, for any β ∈ B M ( N n ), E [ ‘ n ( β n, ( M )) − ‘ n ( β )] = E Y T Z M ( β n, ( M ) − β ) − T ( b ( Z M β n, ( M )) − b ( Z M β ))= µ ( Z M β n, ( M )) Z M ( β n, ( M ) − β ) − T ( b ( Z M β n, ( M )) − b ( Z M β )) , since β n, ( M ) satisfies the score equation: Z T M [ E Y − µ ( Z M β )] = 0. Furthermore, applyingthe second order Taylor expansion yields E [ ‘ n ( β n, ( M )) − ‘ n ( β )] = 12 (cid:0) β n, ( M ) − β (cid:1) T Z T M Σ ( Z M ¯ β ) Z M (cid:0) β n, ( M ) − β (cid:1) , where ¯ β lies on the line segment connecting β n, ( M ) and β . Then, we use Condition 3 andthe assumption that c ≤ b ( Z β ) ≤ c − for any β ∈ B . So, we get E [ ‘ n ( β n, ( M )) − ‘ n ( β )] ≥ nc c k β n, ( M ) − b β u ( M ) k . Therefore, for any β ∈ B M ( N n ), k β n, ( M ) − β k ≤ c c ) − n − E [ ‘ n ( β n, ( M )) − ‘ n ( β )] . (A.4)Finally, we take a slowly diverging sequence γ n such that γ n L n p K log( p ) /n →
0. Then,we choose N n = γ n L n p | M | n − log p . Since b β u ( M ) ∈ B M ( N n ), we combine equations (A.3)and (A.4) to obtainsup | M |≤ K p | M | k β n, ( M ) − b β u ( M ) k ≤ sup | M |≤ K (cid:18) T M ( N n ) | M | (cid:19) / p c c ) − n − = O p [ L n p n − log p ] , where the last step follows from Lemma 4. This completes the proof of Lemma 1. B.2 Lemma 2 and its proof
Lemma 2.
Assume that Y , · · · , Y n are independent and satisfy Condition 1. Then, for anyconstant δ > , there exist large enough positive constants C and C such that k W k ∞ ≤ C log n, (A.5)2 ith probability at least − O ( n − δ ) and, k n − / E [ W | Ω n ] k = O ((log n ) n − C ) , (A.6) where Ω n = {k W k ∞ ≤ C log n } .Proof . We take t = C log n in Condition 1. So we get P ( k W k ∞ ≤ C log n ) ≥ − nP ( | W | > C log n ) ≥ − c n − c − C . We choose C large enough so that 1 − c − C ≤
0. Thus, we have P ( k W k ∞ ≤ C log n ) =1 − O ( n − δ ) where we pick δ = c − C − >
0. This proves the first part of the lemma.Now, we proceed the proof of the second part of the lemma. We will bound each term E [ W i | Ω n ] for i = 1 , · · · , n . Since { W i } for i = 1 , · · · , n are independent, the conditionalexpectation E [ W i | Ω n ] can be written as follows E [ W i | Ω n ] = E [ W i | | W i | ≤ C log n ] = E [ W i {| W i | ≤ C log n } ] P ( | W i | ≤ C log n ) . Since E W = 0 by definition, we get E [ W i {| W i | ≤ C log n } ] = − E [ W i {| W i | > C log n } ].Last two equalities result in | E [ W i | Ω n ] | ≤ E [ | W i | {| W i | > C log n } ] P ( | W i | ≤ C log n ) . We already showed that the denominator P ( | W i | ≤ C log n ) can be bounded below by 1 − O ( n − δ ) uniformly in i . Thus, it suffices to bound the numerator E [ | W i | {| W i | > C log n } ].Indeed, we have E [ | W i | {| W i | > C log n } ] = Z ∞ P ( | W i | {| W i | > C log n } ≥ t ) dt = Z C log n P ( | W i | {| W i | > C log n } ≥ t ) dt + Z ∞ C log n P ( | W i | {| W i | > C log n } ≥ t ) dt = Z C log n P ( | W i | ≥ C log n ) dt + Z ∞ C log n P ( | W i | ≥ t ) dt ≤ C log nP ( | W i | ≥ C log n ) + Z ∞ C log n c exp( − c − t ) dt ≤ C log nc exp( − c − C log n ) + c exp( − c − C log n ) , where we use Condition 1 in the last two steps. This concludes the proof of Lemma 2 bychoosing C = c − C . B.3 Lemma 3 and its proof
Lemma 3.
Under Condition 2, the function ρ defined by ρ ( x Ti β , Y i ) = Y i x Ti β − b ( x Ti β ) isLipschitz continuous with the Lipschitz constant L n = 2 m n + C log n conditioned on the set Ω n = {k W k ∞ ≤ C log n } given in Lemma 2. roof . We consider the difference ρ ( x Ti β , Y i ) − ρ ( x Ti β , Y i ) for any β and β in R p . Weobserve that | ρ ( x Ti β , Y i ) − ρ ( x Ti β , Y i ) | ≤ | Y i || x Ti ( β − β ) | + | b ( x Ti β ) − b ( x Ti β ) | . We can bound | Y i | on Ω n using Condition 2 as | Y i | ≤ k Y k ∞ ≤ k E Y k ∞ + k W k ∞ ≤ m n + C log( n ). Then we apply the mean-value theorem to obtain | b ( x Ti β ) − b ( x Ti β ) | ≤| b (˜ β ) || x Ti ( β − β ) | where ˜ β lies on the line segment connecting β and β . Thus, we get | b ( x Ti β ) − b ( x Ti β ) | ≤ m n | x Ti ( β − β ) | by Condition 2. Hereby, we showed that | ρ ( x Ti β , Y i ) − ρ ( x Ti β , Y i ) | ≤ (2 m n + C log n ) | x Ti β − x Ti β | conditioned on Ω n . Thus, ρ ( · , Y i ) is Lipschitzcontinuous with the Lipschitz constant L n = 2 m n + C log n conditioned on the set Ω n . Thiscompletes the proof of Lemma 3. B.4 Lemma 4 and its proof
Lemma 4.
Assume that Conditions 1, 2(i), 3(i), and 3(iii) hold. Define the neighborhood B M ( N ) = { β ∈ R d , supp( β ) = M : k β − β n, ( M ) k ≤ N } and T M ( N ) = sup β ∈B M ( N ) n − (cid:12)(cid:12) ‘ n ( β ) − ‘ n ( β n, ( M )) − E [ ‘ n ( β ) − ‘ n ( β n, ( M ))] (cid:12)(cid:12) . If γ n is a slowly diverging sequence such that γ n L n p Kn − log p → , then sup | M |≤ K | M | T M (cid:16) γ n L n p | M | n − log p (cid:17) = O ( L n n − log p ) with probability at least (1 − e p − c γ n )(1 − O ( n − δ )) , where L n = 2 m n + C log n .Proof . To prove the lemma, we condition on the set Ω n = {k Y − EY k ∞ ≤ C log n } . Weobserve that (cid:12)(cid:12) ‘ n ( β ) − ‘ n ( β n, ( M )) − E [ ‘ n ( β ) − ‘ n ( β n, ( M ))] (cid:12)(cid:12) ≤ (cid:12)(cid:12) ‘ n ( β ) − ‘ n ( β n, ( M )) − E [ ‘ n ( β ) − ‘ n ( β n, ( M )) | Ω n ] (cid:12)(cid:12) + | E [ ‘ n ( β ) − ‘ n ( β n, ( M ))] − E [ ‘ n ( β ) − ‘ n ( β n, ( M )) | Ω n ] | , by the triangle inequality. Thus, T M ( N n ) can be bounded by the sum of the following twoterms:˜ T M ( N n ) = sup β ∈B M ( N n ) n − (cid:12)(cid:12) ‘ n ( β ) − ‘ n ( β n, ( M )) − E [ ‘ n ( β ) − ‘ n ( β n, ( M )) | Ω n ] (cid:12)(cid:12) , and R M ( N n ) = sup β ∈B M ( N n ) n − { E [ ‘ n ( β ) − ‘ n ( β n, ( M ))] − E [ ‘ n ( β ) − ‘ n ( β n, ( M )) | Ω n ] } That is, T M ( N n ) ≤ ˜ T M ( N n ) + R M ( N n ) . (A.7)4n the rest of the proof, we will show the following bounds R M ( N n ) = o (cid:18) L n log pn (cid:19) , (A.8)and ˜ T M ( N n ) = O p (cid:18) L n log pn (cid:19) . (A.9)First, we consider R M ( N n ). We split R M ( N n ) by the Cauchy–Schwarz inequality so that R M ( N n ) = sup β ∈B M ( N n ) n − | ( EY − E [ Y | Ω n ]) T X [ β − β n, ( M )] |≤ k n − / ( EY − E [ Y | Ω n ]) k sup β ∈B M ( N n ) k n − / X [ β − β n, ( M )] k . We have k n − / ( EY − E [ Y | Ω n ]) k = k n − / ( E [ W | Ω n ]) k = O ( n − C log n )by Lemma 2. We also have k n − / X ( β − β n, ( M )) k ≤ ( λ max ( n − X T M X M )) / k β − β n, ( M ) k ≤ c − / N n , for any β ∈ B M ( N n ).Therefore, R M ( β ) = O ( N n n − C log n ). So, (A.8) follows by taking C large enough.Next, we deal with the term ˜ T M ( N n ) by showing (A.9). We observe that the difference ‘ n ( β ) − ‘ n ( β n, ( M )) can be written as ‘ n ( β ) − ‘ n ( β n, ( M )) = n X i =1 n Y i [ x Ti β − x Ti β n, ( M )] − [ b ( x Ti β ) − b ( x Ti β n, ( M ))] o = n X i =1 (cid:2) ρ ( x Ti β , Y i ) − ρ ( x Ti β n, ( M ) , Y i ) (cid:3) . In Lemma 3, we showed that ρ ( x Ti β , Y i ) = Y i x Ti β − b ( x Ti β ) is Lipschitz continuous with theLipschitz constant L n conditioned on the set Ω n .Next, we choose a Rademacher sequence { (cid:15) i } ni =1 . Then, we apply symmetrization andconcentration inequalities in [5] as follows: E [ ˜ T M ( N n ) | Ω n ] ≤ E " sup β ∈B M ( N n ) n − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 (cid:15) i (cid:2) ρ ( x Ti β , Y i ) − ρ ( x Ti β n, ( M ) , Y i ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | Ω n ≤ L n E " sup β ∈B M ( N n ) n − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 (cid:15) i ( x Ti β − x Ti β n, ( M )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | Ω n . E " sup β ∈B M ( N n ) n − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 (cid:15) i ( x Ti β − x Ti β n, ( M )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | Ω n ≤ E " n − sup β ∈B M ( N n ) k β − β n, ( M ) k k n X i =1 (cid:15) i ( x i ) M k | Ω n ≤ E " n − N n k n X i =1 (cid:15) i ( x i ) M k | Ω n = n − N n E X j ∈ M n X i =1 (cid:15) i x ij ! / ≤ n − N n X j ∈ M E n X i =1 (cid:15) i x ij ! / = N n n − / | M | / , where we use the Cauchy–Schwarz inequality and the assumption P ni =1 x ij = n . Therefore,we obtain the bound E [ ˜ T M ( N n ) | Ω n ] ≤ L n N n n − / | M | / . (A.10)For any β ∈ B M ( N n ), we have n − n X i =1 | ρ ( x Ti β n, ( M ) , Y i ) − ρ ( x Ti β , Y i ) | ≤ n − L n n X i =1 | x Ti β n, ( M ) − x Ti β | = n − L n ( β n, ( M ) − β ) T X T M X M ( β n, ( M ) − β ) ≤ L n c − N n . Then we apply Theorem 14.2 in [5] to obtain P (cid:16) ˜ T M ( N n ) ≥ E [ ˜ T M ( N n ) | Ω n ] + t | Ω n (cid:17) ≤ exp (cid:18) − nc t L n N n (cid:19) . Now, we take t = 4 L n N n n − / | M | / u for some positive u that will be chosen later. So,we get P ( ˜ T M ( N n ) ≥ L n N n n − / | M | / (1 + u ) | Ω n ) ≤ exp( − c u | M | ) by using (A.10).We choose N n = L n n − / | M | / (1 + u ). So, it follows that P ˜ T M ( N n ) | M | ≥ L n n − (1 + u ) | Ω n ! ≤ exp( − c u | M | ) . Thus, we have P sup | M |≤ K ˜ T M ( N n ) | M | ≥ L n n − (1 + u ) | Ω n ! ≤ X | M |≤ K P ˜ T M ( N n ) | M | ≥ L n n − (1 + u ) | Ω n ! ≤ X k ≤ K (cid:18) pk (cid:19) exp( − c u k ) ≤ X k ≤ K (cid:16) pek (cid:17) k exp( − c u k ) . u = γ n √ log p . So, for n large enough, we get X k ≤ K (cid:16) pek (cid:17) k exp( − c u k ) = X k ≤ K (cid:16) pek (cid:17) k p − c γ n k = X k ≤ K ( ep (1 − c γ n ) ) k k k ≤ X k ≤ K ep (1 − c γ n ) k ! ≤ e p − c γ n . So far, the probability of the event ˜ T M ( N n ) = O ( L n log p/n ), which we call A , is boundedbelow conditional on Ω n . Simple calculation yields P ( A ) ≥ P ( A ∩ Ω n ) = P (Ω n ) P ( A | Ω n ).Thus, P ( A ) ≥ (1 − e p − c γ n )(1 − O ( n − δ )). So, (A.9) follows.We have shown (A.8) and (A.9), which control the terms ˜ T M ( N n ) and R M ( N n ), respec-tively. Thus, (A.7) concludes the proof of Lemma 4. B.5 Lemma 5 and its proof
Lemma 5.
Let q i ’s be n independent, but not necessarily identically distributed, scaled andcentered random variables with uniform sub-exponential decay, that is, P ( | q i | > t ) ≤ C exp( − C − t ) for some positive constant C . Let k q i k ψ denote the sub-exponential norm defined by k q i k ψ := sup m ≥ n m − ( E | q i | m ) /m o . Then, we have k q i k ψ ≤ e /e C ( C ∨ for all i .Proof . From the condition on sub-exponential tails, we derive E | q i | m = m Z ∞ x m − P ( | q i | ≥ x ) dx ≤ Cm Z ∞ x m − exp( − C − x ) dx = CmC m Z ∞ u m − exp( − u ) du = CmC m Γ( m ) ≤ CmC m m m , where the last line follows from the definition of the Gamma function. Taking the m th root,we have ( E | q i | m ) /m ≤ ( Cm ) /m Cm.
Rewriting above equation, we obtain m − ( E | q i | m ) /m ≤ m /m C /m C ≤ e /e ( C ∨ C, for all m ≥
1. Since the bound is independent of m , it holds that k q i k ψ ≤ e /e C ( C ∨
1) forall i . This completes the proof of Lemma 5. 7 .6 Lemma 6 and its proof Lemma 6.
Under Condition 1, for some constant γ > , we have sup n E | ( u Tn R n u n ) /µ n | γ < ∞ , where u n = B − / n X T ( Y − E Y ) , R n = B / n A − n B / n , and µ n = tr( A − n B n ) ∨ .Proof . From the expression of u Tn R n u n , we have u Tn R n u n =( Y − E Y ) T XA − n X T ( Y − E Y )=[( Y − E Y ) T cov( Y ) − / ][cov( Y ) / XA − n X T cov( Y ) / ] · [cov( Y ) − / ( Y − E Y )] . Denote S n = cov( Y ) / XA − n X T cov( Y ) / and q = cov( Y ) − / ( Y − E Y ). We decompose u Tn R n u n into two terms, the summations of the diagonal entries and the off-diagonal entries,respectively, u Tn R n u n = q T S n q = n X i =1 s ii q i + X ≤ i = j ≤ n s ij q i q j , where s ij and q i denote the ( i, j )th entry of S n and i th entry of q . Then, we have E ( u Tn R n u n ) = n X i =1 s ii E ( q i ) + X ≤ i = j ≤ n s ii s jj E ( q i ) E ( q j )+ 2 X ≤ i = j ≤ n s ij E ( q i ) E ( q j ) . Using Condition 1 and the sub-Gaussian norm bound in Lemma 5, both quantities E ( q i ) and E ( q i ) E ( q j ) can be uniformly bounded by a common constant. Hence E ( u Tn R n u n ) ≤ O (1) · { [tr( S n )] + tr( S n ) } . Since S n is positive semidefinite it holds that tr( S n ) ≤ [tr( S n )] . Finally noting that tr( S n ) =tr( A − n B n ) ≤ µ n , we see that sup n E | ( u Tn R n u n ) /µ n | γ < ∞ for γ = 1, which concludes theproof of Lemma 6. C Additional technical details
Lemmas 7–10 below are similar to those in [28]. Their proofs can be found in [28] or withminor modifications.
Lemma 7.
Under Condition 4, for j = 1 , , we have c Z δ ∈ R d e − nq j e N n ( δ n ) dµ ≤ E µ M h e − nq j e N n ( δ n ) i ≤ c Z δ ∈ R d e − nq j e N n ( δ n ) dµ . (A.11)8 emma 8. Conditional on the event e Q n , for sufficiently large n we have E µ M [ U n ( β ) n e N cn ( δ n ) ] ≤ exp {− [ κ n − ρ n ( δ n ) / dδ n } (A.12) ≤ exp[ − ( κ n / dδ n ] , where κ n = λ min ( V n ) / . Lemma 9.
It holds that Z δ ∈ R d e − nq dµ = (cid:18) πn (cid:19) d/ | V n − ρ n ( δ n ) I d | − / (A.13) and Z δ ∈ R d e − nq dµ = (cid:18) πn (cid:19) d/ | V n + ρ n ( δ n ) I d | − / . (A.14) Lemma 10.
For j = 1 , , it holds that Z δ ∈ R d e − nq j e N cn ( δ n ) dµ ≤ (cid:18) πnκ n (cid:19) d/ exp h − ( p κ n dδ n − √ d ) / i . (A.15) Lemma 11 ([35]) . For independent sub-exponential random variables { y i } ni =1 , we have thatthe sub-exponential norm of q i = { var( y i ) } − / ( y i − Ey i ) is bounded by some positive constant C . Moreover, the following Bernstein-type tail probability bound holds P ( | n X i =1 a i q i | ≥ t ) ≤ (cid:20) − C min (cid:18) t C k a k , tC k a k ∞ (cid:19)(cid:21) for a ∈ R n , t ≥ . Lemma 11 rephrases Proposition 5.16 of [35] for the case where k q i k Ψ ≤ C . Further,for our proof we need to characterize the concentration of the square of a sub-exponentialrandom variable. In this regard, we define a general α -sub-exponential random variable ξ α which satisfies P ( | ξ α | > t α ) ≤ H exp( − t/H )for H, t >
0. Note that the usual sub-exponential q i ’s are 1-sub-exponential random variables.It may be useful to note that α = 1 / Lemma 12 ([12]) . For independent α -sub-exponential random variables q i , the followingBernstein-type tail probability bound holds P ( | n X i =1 a i q i − E [ n X i =1 a i q i ] | ≥ t ) ≤ C exp − C t sup i var / ( q i ) k a k α for a ∈ R n , t ≥ sup i var / ( q i ) k a k , and C > depending on the choice of α, H ..