A Variant of AIC based on the Bayesian Marginal Likelihood
aa r X i v : . [ s t a t . M E ] O c t A Variant of AIC based on the Bayesian MarginalLikelihood
Yuki Kawakubo ∗ Tatsuya Kubokawa † and Muni S. Srivastava ‡ Abstract
We propose information criteria that measure the prediction risk of a predictivedensity based on the Bayesian marginal likelihood from a frequentist point of view.We derive criteria for selecting variables in linear regression models, assuming a priordistribution of the regression coefficients. Then, we discuss the relationship betweenthe proposed criteria and related criteria. There are three advantages of our method.First, this is a compromise between the frequentist and Bayesian standpoints becauseit evaluates the frequentist’s risk of the Bayesian model. Thus, it is less influenced by aprior misspecification. Second, the criteria exhibits consistency when selecting the truemodel. Third, when a uniform prior is assumed for the regression coefficients, the re-sulting criterion is equivalent to the residual information criterion (RIC) of Shi and Tsai(2002).
Keywords : AIC; BIC; Consistency; Kullback–Leibler divergence; Linear regressionmodel; Residual information criterion; Variable selection.
The problem of selecting appropriate models has been studied extensively in the literaturesince the work of Akaike (1973, 1974), who derived the so-called Akaike information criterion(AIC). There are several approaches to solving the model selection problem: informationcriteria, such as the AIC or BIC (Schwarz, 1978); shrinkage methods, such as the lasso(Tibshirani, 1996); Bayesian techniques; among others. With regard to Bayesian techniques,O’Hara and Sillanpaa (2009) provide a good review of the key works in this field, includingKuo and Mallick (1998), Dellaportas et al. (1997), and George and McCulloch (1993, 1997).In addition, a Bayesian lasso procedure based on a spike-and-slab prior has attracted much ∗ Graduate School of Social Sciences, Chiba University, 1-33, Yayoi-cho, Inage-ku, Chiba, 263-8522,JAPAN, (E-mail: [email protected] ) † Faculty of Economics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN, (E-mail: [email protected] ) ‡ Department of Statistics, University of Toronto, 100 St George Street, Toronto, Ontario, CANADA M5S3G3, (E-mail: [email protected] ) General
In this section, we describe the concept of information criteria from a general point of view.Let y be an n -variate observable random vector, with density f ( y | ω ) for a vector of unknownparameters ω . Let ˆ f ( e y ; y ) be a predictive density of f ( e y | ω ), where e y is an n -variate inde-pendent replication of y . Here, we evaluate the predictive performance of ˆ f ( e y ; y ) in termsof the following risk: R ( ω ; ˆ f ) = Z "Z log ( f ( e y | ω )ˆ f ( e y ; y ) ) f ( e y | ω )d e y f ( y | ω )d y . (1)Because this is interpreted as a risk with respect to the KL divergence, we call it the KL risk.The spirit of AIC suggests that we can provide an information criterion for model selectionas an (asymptotically) unbiased estimator of the information, as follows: I ( ω ; ˆ f ) = Z Z − { ˆ f ( e y ; y ) } f ( e y | ω ) f ( y | ω )d e y d y = E ω h − { ˆ f ( e y ; y ) } i , (2)which is part of (1) (multiplied by 2), where E ω denotes the expectation, with respect to thedistribution, of f ( e y , y | ω ) = f ( e y | ω ) f ( y | ω ). Let ∆ = I ( ω ; ˆ f ) − E ω [ − { ˆ f ( y ; y ) } ]. Then,the AIC variant based on the predictor ˆ f ( e y ; y ) is defined byIC( ˆ f ) = − { ˆ f ( y ; y ) } + b ∆ , (3)where b ∆ is an (asymptotically) unbiased estimator of ∆ and ˆ f ( y ; y ) is the value of thefunction ˆ f ( e y ; y ) evaluated at e y = y .Note that IC( ˆ f ) produces the AIC and BIC for specific predictive densities.(AIC) Use ˆ f ( e y ; y ) = f ( e y | b ω ) as the maximum likelihood estimator b ω of ω . Then,IC( f ( e y | b ω )) is the exact AIC, or is the corrected AIC suggested by Sugiura (1978) andHurvich and Tsai (1989), which is approximated by the AIC of Akaike (1973, 1974) as − { f ( y | b ω ) } + 2 dim( ω ).(BIC) Use ˆ f ( e y ; y ) = f π ( e y ) = R f ( e y | ω ) π ( ω )d ω as a proper prior distribution π ( ω ).It is evident that I ( ω ; f π ) = E ω [ − { f π ( y ) } ]. Thus, we have ∆ = 0, so that IC( f π ) = − { f π ( y ) } , which is the Bayesian marginal likelihood. Note that − { f π ( y ) } is ap-proximated by BIC = − { f ( y | b ω ) } + log( n ) · dim( ω ). The criterion IC( ˆ f ) in (3) can produce not only the conventional AIC and BIC, but alsovarious other criteria. Hereafter, we consider that ω is divided as ω = ( β t , θ t ) t , for a p -dimensional parameter vector of interest β , and a q -dimensional nuisance parameter vector3 , respectively. We assume that β has prior density π ( β | λ , θ ), with hyperparameter λ . Themodel is given as follows: y | β ∼ f ( y | β , θ ) , β ∼ π ( β | λ , θ ) , where θ and λ are estimated from the data. An inference based on such a model is called anempirical Bayes procedure. Here, we consider the predictive density ˆ f ( e y ; y ) asˆ f ( e y ; y ) = f π ( e y | b λ , b θ ) = Z f ( e y | β , b θ ) π ( β | b λ , b θ )d β for some estimators, b λ and b θ . Then, the information given in (2) is I ( ω ; f π ) = Z Z − { f π ( e y | b λ , b θ ) } f ( e y | β , θ ) f ( y | β , θ )d e y d y , (4)and the resulting information criterion isIC( f π ) = − { f π ( y | b λ , b θ ) } + b ∆ f π , (5)where b ∆ f π is an (asymptotically) unbiased estimator of ∆ f π = I ( ω ; f π ) − E ω [ − { f π ( y | b λ , b θ ) } ].There are three motivations for considering the information I ( ω ; f π ) in (4) and the infor-mation criterion IC( f π ) in (5).First, the precision of the Bayesian predictor f π ( e y | b λ , b θ ) is characterized by the risk R ( ω ; f π ) in (1), which is based on a frequentist point of view. On the other hand, theBayesian risk is defined by r ( ψ ; ˆ f ) = Z R ( ω ; ˆ f ) π ( β | λ , θ )d β , (6)which measures the prediction error of ˆ f ( e y ; y ), under the assumption that the prior infor-mation is correct, where ψ = ( λ t , θ t ) t . The resulting Bayesian criteria, such as the PIC(Kitagawa, 1997) or the DIC (Spiegelhalter et al., 2002), are sensitive to a prior misspecifica-tion because they depend on the prior information. However, because R ( ω ; f π ) can measurethe prediction error of the Bayesian model from a frequentist standpoint, the resulting crite-rion IC( f π ) is less influenced by a prior misspecification.Second, this criterion has the property of consistency. In Section 3, we derive criteria forthe variable selection problem in normal linear regression models, and prove that the criteriaselect the true model with probability tending to one. The BIC and marginal likelihoodare known to exhibit consistency, while most AIC-type criteria are not consistent. However,AIC-type criteria can choose a good model in the sense of minimizing the prediction error.Our proposed criterion should include both properties, namely consistency when selecting theparameters of interest β , and the tendency to select a good model in terms of its predictiveability. 4astly, we can construct the information criterion IC( f π ) even when the prior distributionof β is improper, because the information I ( ω ; f π ) in (4) can be defined formally for the cor-responding improper marginal likelihood. However, because the Bayesian risk r ( ψ ; f π ) doesnot exist for the improper prior, we cannot obtain the corresponding Bayesian criteria or usethe Bayesian risk. Note that our criterion is equivalent to the residual information criterion(RIC) of Shi and Tsai (2002) if we assume a uniform prior on the regression coefficients. Ingeneral, the marginal likelihood based on an improper prior depends on an arbitrary scalarconstant, which can be included in the prior, but that might be problematic when selectingthe model. However, the criterion based on our approach, using a uniform prior, can workas a variable selection criterion. This is discussed further in Remark 1 in Section 3. To clarify our proposed approach, we first explain related information criteria that are basedon a Bayesian model. When the prior distribution π ( β | λ , θ ) is proper, we can treat theBayesian prediction risk r ( ψ ; ˆ f ) in (6). When ψ = ( λ t , θ t ) t is known, the predictive den-sity ˆ f ( e y ; y ) that minimizes r ( ψ ; ˆ f ) is the Bayesian predictive density (posterior predictivedensity) f ∗ π ( e y | y , ψ ), given by Z f ( e y | β , θ ) π ( β | y , λ , θ )d β = R f ( e y | β , θ ) f ( y | β , θ ) π ( β | λ , θ )d β R f ( y | β , θ ) π ( β | λ , θ )d β . When ψ is unknown, we can consider the Bayesian risk of the plug-in predictive den-sity f ∗ π ( e y | y , b ψ ). In this case, the resulting criterion is known as the predictive likelihood(Akaike, 1980a) or the PIC (Kitagawa, 1997). The deviance information criterion (DIC) ofSpiegelhalter et al. (2002) and the Bayesian predictive information criterion (BPIC) of Ando(2007) are related criteria based on the Bayesian prediction risk r ( ψ ; ˆ f ).Akaike’s Bayesian information criterion (ABIC) (Akaike, 1980b) is another informationcriterion based on the Bayesian marginal likelihood, given byABIC = − { f π ( y | b λ ) } + 2 dim( λ ) , where the nuisance parameter θ is not considered. The ABIC measures the following KLrisk: Z "Z log ( f π ( e y | λ ) f π ( e y | b λ ) ) f π ( e y | λ )d e y f π ( y | λ )d y , which is not the same as either R ( ω ; ˆ f ) or r ( ψ ; ˆ f ). The ABIC is used to choose the hyperpa-rameter λ in the same manner as the AIC. However, note that the ABIC works as a modelselection criterion for β because it is based on the Bayesian marginal likelihood.5 Application to Linear Regression Models
In this section, we derive variable selection criteria for normal linear regression models. First,we consider a collection of candidate models, defined as follows. Let the n × p ω matrix X ω consist of all explanatory variables, and assume that rank ( X ω ) = p ω . In order to definecandidate models using the index set j , suppose that j denotes a subset of ω = { , . . . , p ω } containing p j elements ( i.e. , p j = j )) and that X j consists of p j columns of X ω indexedby the elements of j . We define the class of candidate models as J = P ( ω ), namely thepower set of ω , where ω denotes the full model. We assume that the true model exists inthe class of the candidate models J , which is denoted by j ∗ . Note that the dimension of thetrue models is p j ∗ , which we abbreviate to p ∗ .The candidate model j is the linear regression model y = X j β j + ε , (7)where y is an n × X j is an n × p j matrixof the explanatory variables, β j is a p j × ε is an n × ε has the distribution N n ( , σ V ), where σ is anunknown scalar and V is a known positive definite matrix.We consider the problem of selecting the explanatory variables, and assume that the truemodel can be expressed by each candidate model. This is the common assumption used toderive an information criterion. Under this assumption, the true mean of y can be writtenas E ( y ) = X j β ∗ j , where β ∗ j is a p j × p j − p ∗ components of which are exactly 0, and the remainingcomponents are not 0. Hereafter, we omit the model index, j , for notational convenience.Furthermore, we abbreviate β ∗ j as β .Now, we construct the variable selection criteria for the regression model (7), which hasthe form (5). We consider the following two situations. [i] A normal prior for β . We first assume a normal prior distribution for β , π ( β | σ ) ∼ N ( , σ W ) , where W is a p × p matrix, suitably chosen with full rank. Examples of W are W =( λ X t X ) − for λ >
0, when V is the identity matrix, as introduced by Zellner (1986), ormore simply, W = λ − I p . For the moment, we assume that λ is known. We discuss howto determine it in Section 3.2. Because the likelihood is f ( y | β , σ ) ∼ N ( Xβ , σ V ), themarginal likelihood function is f π ( y | σ ) = Z f ( y | β , σ ) π ( β | σ )d β = (2 πσ ) − n/ · | V | − / · | W X t V − X + I p | − / · exp (cid:8) − y t Ay / (2 σ ) (cid:9) , A = V − − V − X ( X t V − X + W − ) − X t V − . Note that A = ( V + B ) − for B = XW X t ; that is f π ( y | σ ) ∼ N ( , σ ( V + B )). Then, we take the predictive density asˆ f ( e y ; y ) = f π ( e y | ˆ σ ), and the information (4) can be written as I π, ( ω ) = E ω (cid:2) n log(2 π ˆ σ ) + log | V | + log | W X t V − X + I p | + e y t A e y / ˆ σ (cid:3) , (8)where ˆ σ = y t P y /n , P = V − − V − X ( X t V − X ) − X t V − , and E ω denotes the expecta-tion with respect to the distribution of f ( e y , y | β , σ ) = f ( e y | β , σ ) f ( y | β , σ ) for ω = ( β t , σ ) t .Note that β is the parameter of interest, and σ is the nuisance parameter corresponding to θ in the previous section. Then, we propose the following information criterion. Proposition 1
The information I π, ( ω ) in (8) is unbiasedly estimated by the informationcriterion IC π, = − { f π ( y | ˆ σ ) } + 2 nn − p − , (9) where − { f π ( y | ˆ σ ) } = n log(2 π ˆ σ ) + log | V | + log | W X t V − X + I p | + y t Ay / ˆ σ ; that is, E ω (IC π, ) = I π, ( ω ) . If n − W / X t V − XW / converges to a p × p positive definite matrix as n → ∞ ,log | W X t V − X + I p | can be approximated as p log n , which is the penalty term of the BIC.In that case, IC π, is approximately expressed asIC ∗ π, = n log(2 π ˆ σ ) + log | V | + p log n + 2 + y t Ay / ˆ σ when n is large.Note that only the first term of IC π, can work as a variable selection criterion because f π ( y | σ ) is the Bayesian marginal likelihood. The difference between them is2 nn − p − p + 2) n + O ( n − ) . In other words, IC π, has a slight additional penalty, of order n − . We compare the perfor-mance of the criteria using simulations in Section 4.Alternatively, the KL risk r ( ψ ; ˆ f ) in (6) can be used to evaluate the risk of the predictivedensity f π ( e y | ˆ σ ) because the prior distribution is proper. Then, the resulting criterion isIC π, = n log(2 π ˆ σ ) + log | V | + p log n + p, (10)which is an asymptotically unbiased estimator of I π, ( σ ) = E π [ I π, ( ω )], where E π denotesthe expectation with respect to the prior distribution π ( β | σ ); that is E π E ω (IC π, ) → I π, ( σ )as n → ∞ . Interestingly, IC π, is analogous to the criterion proposed by Bozdogan (1987),known as the consistent AIC, who suggested replacing the penalty term 2 p in the AIC with p + p log n . 7 ii] Uniform prior for β . We next assume a uniform prior for β , namely β ∼ unif orm ( R p ).Although this is an improper prior distribution, we can obtain the marginal likelihood func-tion formally, as follows: f r ( y | σ ) = Z f ( y | β , σ )d β = (2 πσ ) − ( n − p ) / · | V | − / · | X t V − X | − / · exp (cid:8) − y t P y / (2 σ ) (cid:9) , which is known as the residual likelihood (Patterson and Thompson, 1971). Then, we takethe predictive density as ˆ f ( e y ; y ) = f r ( e y | ˜ σ ), and the information (4) can be written as I r ( ω ) = E ω (cid:2) ( n − p ) log(2 π ˜ σ ) + log | V | + log | X t V − X | + e y t P e y / ˜ σ (cid:3) , (11)where ˜ σ = y t P y / ( n − p ), which is the residual maximum likelihood (REML) estimator of σ , based on the residual likelihood f r ( y | σ ). Next, we propose the information criterion. Proposition 2
The information I r ( ω ) in (11) is unbiasedly estimated by the infomationcriterion IC r = − { f r ( y | ˜ σ ) } + 2( n − p ) n − p − , (12) where − { f r ( y | ˜ σ ) } = ( n − p ) log(2 π ˜ σ ) + log | V | + log | X t V − X | + y t P y / ˜ σ ; that is, E ω (IC r ) = I r ( ω ) . Note that y t P y / ˜ σ = n − p . If n − X t V − X converges to a p × p positive definite matrixas n → ∞ , log | X t V − X | can be approximated by p log n . Then, we can approximate thecriterion as IC ∗ r = ( n − p ) log(2 π ˜ σ ) + log | V | + p log n + ( n − p ) n − p − , (13)for large n . Note that IC ∗ r is equivalent to the RIC proposed by Shi and Tsai (2002). Notingthat ( n − p ) / ( n − p −
2) = ( n + 2) + { / ( n − p − − p } , we can see that the differencebetween IC ∗ r and the RIC is n + 2 − p log(2 π ˜ σ ); that is IC ∗ r = RIC + n + 2 − p log(2 π ˜ σ ).Note too that the criterion based on f r ( y | σ ) and r ( ψ ; f r ) cannot be constructed because itsKL risk diverges to infinity. Remark 1
As discussed in Section 2.2, the marginal likelihood based on an improper priordepends on an arbitrary scalar constant, which can, in general, be problematic when selectingthe model. However, our IC r , or its equivalent RIC, can work as a variable selection criterion.In order to show that, we compare IC r with the AIC and BIC. When V = I n , the AIC andBIC for the normal linear regression model can be expressed asIC = n log(2 π ) + n log( n − RSS) + n + g ( p ) , g ( p ) is the penalty,which depends on p . Here, g ( p ) = 2( p + 1) for the AIC and g ( p ) = p log( n ) for the BIC.Then, RSS is the residual sum of squares, defined as RSS = k y − X b β k = n ˆ σ . On the otherhand, IC r in (12) can be rewritten asIC r = n log(2 π ) + n log( n − RSS) + n + h ( p ) , where the first three terms are the same as those of the AIC and BIC, and h ( p ) is h ( p ) = p { log( n − p ) − log( n − RSS) − log(2 π ) − } + log | X t X | − p log( n )+ n log { n/ ( n − p ) } + 2 + O ( n − ) . Thus, h ( p ) can represent the penalty for the large model because log | X t X | − p log( n ) isasymptotically negligible, and the value in the braces of the first term is positive when n isat least moderately large, noting that n − RSS becomes small as p becomes large. In the derivation of the criteria, we assumed that the scaled covariance matrix V of thevector of error terms is known. However, it is often the case that V is unknown, and issome function of the unknown parameter φ , namely V = V ( φ ). In that case, V in eachcriterion is replaced with its plug-in estimator V ( b φ ), where b φ is some consistent estimatorof φ . This strategy is also used in many other studies, for example in Shi and Tsai (2002),who proposed the RIC. We suggest that the φ be estimated based on the full model. Thescaled covariance matrix W of the prior distribution of β is also assumed to be known. Inpractice, its structure should be specified, and we have to estimate the parameter λ in W from the data. In the same manner as V , W in each criterion is replaced with W (ˆ λ ). Notethat λ should be estimated based on each candidate model under consideration, because thestructure of W depends on the model. We propose that λ is estimated by maximizing themarginal likelihood f π ( y | ˆ σ , λ ), after substituting in the estimate ˆ σ .Here, we give three examples for the regression model (7): a regression model with con-stant variance, a variance components model, and a regression model with ARMA errors.The second and the third models include the unknown parameter in the covariance matrix. [1] Regression model with constant variance . When V = I n , (7) represents a multipleregression model with constant variance. In this model, the scaled covariance matrix V doesnot contain any unknown parameters. [2] Variance components model . Consider a variance components model (Henderson,1950), described as follows: y = Xβ + Z v + · · · + Z r v r + η , (14)where Z i is an n × m i matrix with V i = Z i Z ti , v i is an m i × N m i ( , θ i I m i ) for i ≥ η is an n × η ∼ N n ( , V + θ V ) for known9 × n matrices V and V , and η , v , . . . , v r are mutually independently distributed. Thenested error regression model (NERM) is a special case of a variance components model,given by y ik = x tik β + v i + η ik , ( i = 1 , . . . , m ; k = 1 , . . . , n i ) , (15)where v i and η ik are mutually independently distributed as v i ∼ N (0 , τ ) and η ik ∼ N (0 , σ ),respectively, and n = P mi =1 n i . Note that the NERM in (15) is given by θ = σ , θ = τ , V = I n and Z = diag ( n , . . . , n m ), where l is the l -dimensional vector of ones, forthe variance components model (14). This model is often used for clustered data, where v i is considered the random effect of the cluster (Battese et al., 1988). For such a model, whenwe are interested in a specific cluster or in predicting the random effects, an appropriatecriterion is the conditional AIC, as proposed by Vaida and Blanchard (2005), which is basedon the conditional likelihood given the random effects. However, when we wish to predictthe fixed effects, namely x tik β , the NERM can be seen as a linear regression model andthe random effects are part of the error term. In other words, we consider ε = Z v + η , V = V ( φ ) = φ V + I n for (7), where φ = τ /σ and V = Z Z t = diag ( J n , . . . , J n m ) for J l = l tl . In this case, our proposed variable selection procedure is useful. [3] Regression model with autoregressive moving average errors . Consider theregression model (7), assuming the random errors are generated by an ARMA( q, r ) processdefined by ε i − φ ε i − − · · · − φ q ε i − q = u i − ϕ u i − − · · · − ϕ r u i − r , where { u i } is a sequence of independent normal random variables, with mean 0 and variance τ . A special case of this model is the regression model with AR(1) errors, satisfying ε ∼N (0 , τ / (1 − φ )), ε i = φε i − + u i , and u i ∼ N (0 , τ ) for i = 2 , , . . . , n . When we define σ = τ / (1 − φ ), the ( i, j )-element of the scaled covariance matrix V in (7) is φ | i − j | . In this subsection, we prove that the proposed criteria exhibit consistency. Our asymptoticframework is that n tends to infinity and the true dimension of the regression coefficients p ∗ is fixed. Following Shi and Tsai (2002), we first show that the criteria are consistent for theregression model with constant variance and pre-specified W . Then, we extend the result tothe regression model with a general covariance matrix and the case where W is estimated.We divide J into two subsets, J + and J − , where J + = { j ∈ J : j ∗ ⊆ j } and J − = J \J + .Note that the true model j ∗ is the smallest model in J + , and that E ( y ) = X j ∗ β j ∗ , abbreviatedto X ∗ β ∗ , where β ∗ is a p ∗ × denote themodel selected by some criterion. Following Shi and Tsai (2002), we make the followingassumptions:(A1) E ( ε ) < ∞ .(A2) 0 < lim inf n →∞ min j ∈J | X tj X j /n | and lim sup n →∞ max j ∈J | X tj X j /n | < ∞ .(A3) lim inf n →∞ n − inf j ∈J − k X ∗ β ∗ − H j X ∗ β ∗ k >
0, where H j = X j ( X tj X j ) − X tj .10e can now obtain the asymptotic properties of the criteria for the regression model withconstant variance. Theorem 1
If assumptions (A1) – (A3) are satisfied, J + is not empty, the ε i ’s are indepen-dent and identically distributed (iid), and W j in the prior distribution of β j is pre-specified,then the criteria IC π, , IC ∗ π, , IC π, , IC r , and IC ∗ r are consistent; that is P (ˆ = j ∗ ) → as n → ∞ . The proof of Theorem 1 is given in Appendix B.We next consider the regression model with a general covariance structure and the casewhere W j is estimated from the data. In this case, V and W j are replaced with their plug-inestimators V ( b φ ) and W j (ˆ λ j ), respectively. Theorem 2
Assume that b φ − φ and ˆ λ j − λ j, tend to 0 in probability as n → ∞ , for all j ∈ J . In addition, assume that the elements of V ( φ ) and W j ( λ j ) are continuous functionsof φ and λ j , respectively, and that V ( φ ) and W j ( λ j ) are positive definite in the neighborhoodof φ and λ j, , respectively, for all j ∈ J . If assumptions (A1) – (A3) are satisfied when X j and ε are replaced with V − / X j and ε ∗ = V − / ε , respectively, J + is not empty and ε ∗ i areiid. Then, the criteria IC π, , IC ∗ π, , IC π, , IC r , and IC ∗ r are consistent. For the proof of Theorem 2, we use the same techniques as those used in the proof ofTheorem 1.
In this section, we compare the numerical performance of the proposed criteria, IC π, and IC r ,with that of conventional criteria, namely the AIC, BIC, DIC, and the marginal likelihood(ML). We consider two regression models: a regression model with constant variance and aregression model with AR(1) errors. These models are taken as examples of the linear model(7) given in Section 3.2. The matrix of explanatory variables are randomly generated as vec ( X ω ) ∼ N n × p ω ( , I p ω ⊗ Σ ) for Σ = 0 . I n + 0 . t in each simulation.When deriving the criterion IC π, , we set the prior distribution of β as N p ( , σ λ − I p );that is, W = λ − I p . The unknown parameter φ in V for the AR(1) model is estimatedusing the maximum likelihood estimator based on the full model. The hyperparameter λ is estimated by maximizing the marginal likelihood f π ( y | ˆ σ , λ ), after substituting in theestimate ˆ σ = y t P y /n of σ . Note that φ is estimated based on the full model, while σ and λ are estimated from each candidate model using the plugged-in version of V ( ˆ φ ).As a competitor for IC π, , we consider the criterion that uses f π ( y | ˆ σ ) only, which isthe so-called the marginal likelihood commonly used in Bayesian analyses. Another com-petitor is the DIC, which is also popular in Bayesian analyses. When deriving the DIC, weconsider the same prior distribution of β as that assumed when deriving IC π, , namely β ∼N p ( , σ λ − I p ). In fairness to the other criteria, we take σ as an unknown parameter and do11ot assume a prior distribution for the derivation of the DIC. Let D ( β ) = − { f ( y | β , σ ) } .When σ is known, the DIC isDIC( σ ) = 2 E β | y [ D ( β )] − D ( e β ) , where E β | y denotes the expectation with respect to the conditional distribution of β , given y and e β = E β | y ( β ). Because β | y ∼ N p ( e β , σ ( X t V − X + W − ) − ) for e β = ( X t V − X + W − ) − X t V − y , the first term of the DIC is2 E β | y E [ D ( β )] = 2tr (cid:2) X t V − X { σ ( X t V − X + W − ) − + e β e β t } (cid:3) /σ − y t V − X e β /σ + (the term which is irrelevant to the model) , and the second term is D ( e β ) = ( y − X e β ) t V − ( y − X e β ) /σ + (the term that is irrelevant tothe model). Then, we use DIC(ˆ σ ), where ˆ σ is substituted into DIC( σ ). [Experiment 1] First, we confirm that IC π, , IC r , and the BIC are consistent. We con-sider the regression model with constant variance for two cases of the values of regressioncoefficients β = (1 , , , , , , t and β = (1 , , , , , , t , namely p ω = 7 and p ∗ = 2 or p ∗ = 4. We also control the signal-to-noise ratio (SNR = { var( x ti β ) / var( ε i ) } / ) at 1, 3, and5. Note that var( x ti β ) = p ∗ . We select a best model by each criterion among all subsets ofthe full model, that is, we consider 2 − r does notperform as well as IC π, and the BIC in terms of selecting the true model. Although we omitthe detail, the results for the AR(1) model are similar to those of the model with constantvariance. [Experiment 2] Next, we investigate the performance of the criteria in terms of the predic-tion error. In this experiment, we set β , . . . , β p ∗ i . i . d . ∼ N (0 ,
1) and ( β p ∗ +1 , . . . , β p ω ) t = ineach simulation. Note that some of the values of β , . . . , β p ∗ might be close to 0, which makesus difficult to distinguish the true model from the models that include the true model. In thiscase, it is more appropriate to evaluate the performance of the information criteria in termsof prediction error than in terms of selecting the true model. The prediction error of theselected model is defined as k X ˆ b β ˆ − X ∗ β ∗ k /n , where b β ˆ = ( X t ˆ V ( ˆ φ ) − X ˆ ) − X t ˆ V ( ˆ φ ) − y (i.e., the GLS estimator). We consider several settings of p ω , p ∗ , and SNR for regressionmodel with constant variance and for AR(1) model with φ = 0 .
5. For p ω = 5 ,
10 cases, weconsider all subsets of the full model as a class of candidate models, and for p ω = 20 case,we consider a class of nested candidate models j α = { , . . . , α } for α = 1 , . . . , p ω . Tables 1–6show the mean value of the prediction error among 1000 simulations. In each case, we puttwo asterisks at the minimum value of the prediction error and one asterisk at the secondminimum value. First, we look at the results for the regression model with constant variance,which are shown in Tables 1–3. When the true model is small ( i.e. , p ∗ = 2), IC π, performsbest or second best in all cases. Although the marginal likelihood (ML) performs similar toIC π, , the latter is slightly better. When the true model is large relative to the full model,12igure 1: The number of simulations that select the true model by the criteria in 1000realizations of the regression model with constant variance. The left three figures are theresult for p ∗ = 2, and the right three figures are for p ∗ = 4.
20 40 60 80 100 n20 40 60 80 100 n20 40 60 80 100 n IC_{pi,1}IC_rBIC p_* = 2, SNR = 1
20 40 60 80 100 n20 40 60 80 100 n20 40 60 80 100 n IC_{pi,1}IC_rBIC p_* = 2, SNR = 3
20 40 60 80 100 n20 40 60 80 100 n20 40 60 80 100 n IC_{pi,1}IC_rBIC p_* = 2, SNR = 5
20 40 60 80 100 n20 40 60 80 100 n20 40 60 80 100 n IC_{pi,1}IC_rBIC p_* = 4, SNR = 1
20 40 60 80 100 n20 40 60 80 100 n20 40 60 80 100 n IC_{pi,1}IC_rBIC p_* = 4, SNR = 3
20 40 60 80 100 n20 40 60 80 100 n20 40 60 80 100 n IC_{pi,1}IC_rBIC p_* = 4, SNR = 5 p ω = 5.SNR IC π, IC r AIC BIC DIC ML p ∗ = 2 n = 50 1 0 . ∗∗ . ∗ . ∗∗ . ∗ . ∗∗ . ∗ n = 100 1 0 . ∗ . ∗∗ . ∗∗ . ∗ . ∗∗ . ∗ p ∗ = 4 n = 50 1 0.433 0 . ∗∗ . ∗ . ∗ . ∗∗ . ∗ . ∗∗ n = 100 1 0.220 0 . ∗ . ∗∗ . ∗ . ∗∗ . ∗ . ∗∗ p ω = 10.SNR IC π, IC r AIC BIC DIC ML p ∗ = 2 n = 50 1 0 . ∗ . ∗∗ . ∗∗ . ∗ . ∗∗ . ∗ n = 100 1 0 . ∗ . ∗∗ . ∗∗ . ∗ . ∗∗ . ∗ p ∗ = 8 n = 50 1 1.71 1 . ∗∗ . ∗ . ∗ . ∗∗ . ∗∗ . ∗ n = 100 1 0.912 0 . ∗∗ . ∗ . ∗ . ∗∗ . ∗∗ . ∗ p ω = 20.SNR IC π, IC r AIC BIC DIC ML p ∗ = 2 n = 50 1 0 . ∗ . ∗∗ . ∗ . ∗∗ . ∗ . ∗∗ n = 100 1 0 . ∗ . ∗∗ . ∗ . ∗∗ . ∗ . ∗∗ . ∗ p ∗ = 18 n = 50 1 7.58 7 . ∗ . ∗∗ . ∗ . ∗∗ . ∗∗ . ∗ n = 100 1 3.61 3.59 3 . ∗∗ . ∗ . ∗∗ . ∗ . ∗∗ . ∗ Table 4: The mean of the prediction error among 1000 simulations for AR(1) with φ = 0 . p ω = 5.SNR IC π, IC r AIC BIC DIC ML p ∗ = 2 n = 50 1 0.203 0 . ∗∗ . ∗ . ∗∗ . ∗ . ∗∗ . ∗ n = 100 1 0.0964 0 . ∗∗ . ∗ . ∗ . ∗∗ . ∗ . ∗∗ p ∗ = 4 n = 50 1 0.492 0 . ∗∗ . ∗ . ∗ . ∗∗ . ∗∗ . ∗ n = 100 1 0.239 0 . ∗∗ . ∗ . ∗ . ∗∗ . ∗∗ . ∗ φ = 0 . p ω = 10.SNR IC π, IC r AIC BIC DIC ML p ∗ = 2 n = 50 1 0 . ∗ . ∗∗ . ∗ . ∗∗ . ∗∗ . ∗ n = 100 1 0 . ∗∗ . ∗ . ∗∗ . ∗ . ∗ . ∗∗ p ∗ = 8 n = 50 1 1.53 1 . ∗ . ∗ . ∗ . ∗∗ . ∗∗ . ∗ n = 100 1 0.794 0 . ∗∗ . ∗ . ∗∗ . ∗ . ∗ . ∗∗ φ = 0 . p ω = 20.SNR IC π, IC r AIC BIC DIC ML p ∗ = 2 n = 50 1 0 . ∗ . ∗∗ . ∗∗ . ∗ . ∗ . ∗∗ n = 100 1 0 . ∗ . ∗∗ . ∗ . ∗∗ . ∗ . ∗∗ p ∗ = 18 n = 50 1 6.02 6.02 5 . ∗ . ∗∗ . ∗∗ . ∗ . ∗ . ∗∗ n = 50 1 2.74 2.77 2 . ∗∗ . ∗ . ∗∗ . ∗ . ∗∗ . ∗ r performs best or second best in many cases. It is also interesting to point out that IC π, ,BIC and ML have good performance when the true model is small while IC r , AIC and DICperform well when the true model is large. This might be because the first three criteria tendto select parsimonious models while the last three criteria tend to select larger models. Theresults for AR(1) model, which are shown in Tables 4–6, are similar to those for the regressionmodel with constant variance except that the performance of IC r in the case of p ω = 5 and p ∗ = 2 (upper part of Table 4) is good while that in the same case for the regression modelwith constant variance (upper part of Table 1) is not very good. We have derived variable selection criteria for normal linear regression models relative tothe frequentist KL risk of the predictive density, based on the Bayesian marginal likelihood.We have proved the consistency of the criteria and, using simulations, have shown that theyperform well in terms of the prediction.Although our theoretical approach is general, the derivation of the criterion depends on thenormal distribution. If we assume a conjugate prior distribution for the parameter of interestwhen deriving the criterion, it is easy to extend our approach to other models. However, forthe class of generalized linear models, which includes the Poisson and the logistic regressionmodels, it is difficult to consider a prior distribution where the marginal likelihood can beevaluated analytically. In such models, we have to rely on some computational method,which we leave for future research.Variable selection for the mixed effects models, such as the variance components model(14) in Section 3.2, is another important problem. As discussed, it is appropriate to considerthe KL divergence based on the conditional density, given the random effects, when theobjective is to predict the random effects, as in the conditional AIC (cAIC). An extension ofour approach to the cAIC-type criterion is also left to future research.
Acknowledgments.
The authors are grateful to the associate editor and the anonymous referee for theirvaluable comments and helpful suggestions. The first and second authors were supported,in part, by Grant-in-Aid for Scientific Research from the Japan Society for the Promotion ofScience (JSPS). The third author was supported, in part, by NSERC of Canada.
A Derivations of the Criteria
In this section, we show the derivations of the criteria. To this end, we first obtain thefollowing lemma, which was shown in Section A.2 of Srivastava and Kubokawa (2010).17 emma 1
Assume that C is an n × n symmetric matrix, M is an idempotent matrix ofrank p , and that u ∼ N ( , I n ) . Then, E (cid:20) u t Cuu t ( I n − M ) u (cid:21) = tr ( C ) n − p − − C ( I n − M )]( n − p )( n − p − . A.1 Derivation of IC π, in (9) It is sufficient to show that the bias correction ∆ π, = I π, ( ω ) − E ω [ − { f π ( y | ˆ σ ) } ] is2 n/ ( n − p − I π, ( ω ) is given by (8). It follows that∆ π, = E ω ( e y t A e y / ˆ σ ) − E ω ( y t Ay / ˆ σ )= E ω ( e y t A e y ) · E ω (1 / ˆ σ ) − E ω ( y t Ay / ˆ σ ) . First, E ω ( e y t A e y ) = E ω [( e y − Xβ + Xβ ) t A ( e y − Xβ + Xβ )]= σ tr ( AV ) + β t X t AXβ . (16)Second, noting that n ˆ σ = y t P y = σ u t ( I n − M ) u for u = V − / ( y − Xβ ) /σ, M = I n − V − / X ( X t V − X ) − X t V − / , (17)and that P X = , we obtain E ω (1 / ˆ σ ) = nE ω (cid:18) y t P y (cid:19) = nE ω (cid:20) σ u t ( I n − M ) u (cid:21) = nσ ( n − p − . (18)Finally, E ω ( y t Ay / ˆ σ ) = nE ω (cid:18) y t Ayy t P y (cid:19) = nE ω " σ u t V / AV / u + β t X t AXβ σ u t ( I n − M ) u = n × (cid:26) tr ( AV ) n − p − − AV P V )( n − p )( n − p −
2) + β t X t AXβ σ ( n − p − (cid:27) . (19)The latter equation derives from Lemma 1. Combining (16), (18), and (19), we have∆ π, = 2 n · tr ( AV P V )( n − p )( n − p − . Therefore, tr (
AV P V ) = tr { ( V + B ) − ( V + B − B ) P V } = tr ( P V ) − tr { ( V + B ) − BP V } = tr ( I n − M ) = n − p, (20)because BP = XW X t P = . Then, we obtain ∆ π, = 2 n/ ( n − p − (cid:3) .2 Derivation of IC π, in (10) From the fact that E ω (IC π, ) = I π, ( ω ) and that E π E ω (IC π, ) = E π [ I π, ( ω )] = I π, ( σ ), itsuffices to show that E π E ω (IC π, ) is approximated by E π E ω (IC π, ) ≈ E π E ω [ n log(2 π ˆ σ ) + log | V | + p log n + 2 + y t Ay / ˆ σ ] ≈ E π E ω [ n log(2 π ˆ σ ) + log | V | + p log n + p ] + ( n + 2) = E π E ω (IC π, ) + ( n + 2) , when n is large. Note that n + 2 is irrelevant to the model. It follows that E ω (cid:18) y t Ay ˆ σ (cid:19) = n × E ω (cid:20) y t { V − − V − X ( X t V − X + W − ) − X t V − } yy t { V − − V − X ( X t V − X ) − X t V − } y (cid:21) = n + n × E ω (cid:20) y t V − X ( X t V − X + W − ) − W − ( X t V − X ) − X t V − yy t { V − − V − X ( X t V − X ) − X t V − } y (cid:21) = n + nσ ( n − p − × E ω (cid:2) y t V − X ( X t V − X + W − ) − W − ( X t V − X ) − X t V − y (cid:3) = n + nσ ( n − p − × h σ · tr { ( X t V − X + W − ) − W − } + β t X t V − X ( X t V − X + W − ) − W − β i , and that E π [ β t X t V − X ( X t V − X + W − ) − W − β ] = σ · tr [ X t V − X ( X t V − X + W − ) − ] . If n − X t V − X converges to a p × p positive definite matrix as n → ∞ , tr [( X t V − X + W − ) − W − ] → X t V − X ( X t V − X + W − ) − ] → p . Then, we have E π E ω ( y t Ay / ˆ σ − n ) → p , which we want to show. (cid:3) A.3 Derivation of IC r in (12) We show that the bias correction ∆ r = I r ( ω ) − E ω [ − { f r ( y | ˜ σ ) } ] is 2( n − p ) / ( n − p − I r ( ω ) is given by (11). Then,∆ r = E ω ( e y t P e y / ˜ σ ) − E ω ( y t P y / ˜ σ )= E ω ( e y t P e y ) · E ω (1 / ˜ σ ) − ( n − p ) . Since E ω ( e y t P e y ) = ( n − p ) σ and E ω (1 / ˜ σ ) = ( n − p ) / { σ ( n − p − } , we have ∆ r =2( n − p ) / ( n − p − (cid:3) Proof of Theorem 1
We only prove the consistency of IC π, . The proof of the consistency of the other criteria canbe shown in the same manner. Because we have P (ˆ = j ) ≤ P { IC π, ( j ) < IC π, ( j ∗ ) } for any j ∈ J \ { j ∗ } , it suffices to show that P { IC π, ( j ) < IC π, ( j ∗ ) } →
0, or equivalently,that P { IC π, ( j ) − IC π, ( j ∗ ) > } → n → ∞ for j ∈ J \ { j ∗ } . When V = I n , we obtainIC π, ( j ) − IC π, ( j ∗ ) = I + I + I , where I = n log(ˆ σ j / ˆ σ ∗ ) + y t A j y / ˆ σ j − y t A ∗ y / ˆ σ ∗ ,I = log | X tj X j + W − j | − log | X t ∗ X ∗ + W − ∗ | ,I = log {| W j | / | W ∗ |} + 2 nn − p j − − nn − p ∗ − , for ˆ σ j = y t ( I n − H j ) y /n , ˆ σ ∗ = ˆ σ j ∗ , A j = I n − X j ( X tj X j + W − j ) − X tj , A j ∗ = A ∗ ,and W ∗ = W j ∗ . We evaluate the asymptotic behaviors of I , I , and I for j ∈ J − , and j ∈ J + \ { j } , separately.[Case of j ∈ J − ]. First, we evaluate I . We decompose I = I + I , where I = n log(ˆ σ j / ˆ σ ∗ ) and I = y t A j y / ˆ σ j − y t A ∗ y / ˆ σ ∗ . It follows thatˆ σ j − ˆ σ ∗ = ( X ∗ β ∗ + ε ) t ( I n − H j )( X ∗ β ∗ + ε ) /n − ε t ( I n − H ∗ ) ε /n = k X ∗ β ∗ − H j X ∗ β ∗ k /n + o p (1) . Then, we have n − I = log (cid:18) σ j − ˆ σ ∗ ˆ σ ∗ (cid:19) = log (cid:26) k X ∗ β ∗ − H j X ∗ β ∗ k nσ (cid:27) + o p (1) , (21)and it follows from the assumption (A3) thatlim inf n →∞ log (cid:26) k X ∗ β ∗ − H j X ∗ β ∗ k nσ (cid:27) > . (22)Because y t A j y / ( n ˆ σ j ) = 1 + o p (1) and y t A ∗ y / ( n ˆ σ ∗ ) = 1 + o p (1), we obtain n − I = o p (1) . (23)Second, we evaluate I . It follows thatlog | X tj X j + W − j | = p j log n + log | X tj X j /n + W − j /n | = p j log n + O (1) .
20n addition, log | X t ∗ X ∗ + W − ∗ | = p ∗ log n + O (1). Then, n − I = ( p j − p ∗ ) n − log n + o (1) = o (1) . (24)Lastly, it is easy to see that n − I = o (1) . (25)From (21)–(25), it follows that P { IC π, ( j ) − IC π, ( j ∗ ) > } → , (26)for all j ∈ J − .[Case of j ∈ J + \ { j ∗ } ]. First, we evaluate I . Fromˆ σ ∗ − ˆ σ j = ε t ( H j − H ∗ ) ε /n = O p ( n − ) , (27)it follows that (log n ) − I = (log n ) − · n log (cid:26) ˆ σ ∗ − (ˆ σ ∗ − ˆ σ j )ˆ σ ∗ (cid:27) = (log n ) − · n · log { O p ( n − ) } = o p (1) . (28)For I , from (27) and y t A j y − y t A ∗ y = O p (1), we obtain I = y t A j y / ˆ σ j − y t A ∗ y / ˆ σ ∗ = ( y t A j y − y t A ∗ y ) / ˆ σ ∗ + O p (1) = O p (1) . Then, (log n ) − I = o p (1) . (29)Second, we evaluate I . Since p j > p ∗ for all j ∈ J + \ { j ∗ } ,lim inf n →∞ (log n ) − I = p j − p ∗ > . (30)Finally, it is easy to see that (log n ) − I = o (1) . (31)From (28)–(31), it follows that P { IC π, ( j ) − IC π, ( j ∗ ) > } → , (32)for all j ∈ J + \ { j ∗ } .Combining (26) and (32), we obtain P { IC π, ( j ) − IC π, ( j ∗ ) > } → , for all j ∈ J \ { j ∗ } , which shows that IC π, is consistent. (cid:3) eferences Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.In , (Petrov, B.N, and Csaki, F., eds.),267–281, Akademiai Kiado, Budapest.Akaike, H. (1974). A new look at the statistical model identification. System identificationand time-series analysis.
IEEE Transactions on Automatic Control , AC-19 , 716–723.Akaike, H. (1980a). On the use of predictive likelihood of a Gaussian model.
Annals of theInstitute of Statistical Mathematics , , 311–324.Akaike, H. (1980b). Likelihood and the Bayes procedure. In Bayesian Statistics , (N.J.Bernard, M.H. Degroot, D.V. Lindaley and A.F.M. Simith, eds.), Valencia, Spain, Univer-sity Press, 141–166.Ando, T. (2007). Bayesian predictive information criterion for the evaluation of hierarchicalBayesian and empirical Bayes models.
Biometrika , , 443–458.Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An error-components model for predic-tion of county crop areas using survey and satellite data. Journal of the American StatisticalAssociaton , , 28–36.Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The generaltheory and its analytical extensions. Psychometrika , , 345–370.Dellaportas, P., Forster, J.J. and Ntzoufras, I. (1997). On Bayesian model and variableselection using MCMC. Technical report, Department of Statistics, Athens University ofEconomics and Business , Athens, Greece.George, E.I. and McCulloch, R.E. (1993). Variable selection via Gibbs sampling.