[PDF] The Residual Information Criterion, Corrected

Abstract

Shi and Tsai (JRSSB, 2002) proposed an interesting residual information criterion (RIC) for model selection in regression. Their RIC was motivated by the principle of minimizing the Kullback-Leibler discrepancy between the residual likelihoods of the true and candidate model. We show, however, under this principle, RIC would always choose the full (saturated) model. The residual likelihood therefore, is not appropriate as a discrepancy measure in defining information criterion. We explain why it is so and provide a corrected residual information criterion as a remedy.

Full PDF

aa r X i v : . [ s t a t . M E ] N ov The Residual Information Criterion, Corrected

Chenlei Leng ∗ October 29, 2018

Abstract

Shi and Tsai (JRSSB, 2002) proposed an interesting residual information criterion(RIC) for model selection in regression. Their RIC was motivated by the principleof minimizing the Kullback-Leibler discrepancy between the residual likelihoods of thetrue and candidate model. We show, however, under this principle, RIC would alwayschoose the full (saturated) model. The residual likelihood therefore, is not appropriateas a discrepancy measure in deﬁning information criterion. We explain why it is so andprovide a corrected residual information criterion as a remedy.

KEY WORDS:

Residual information criterion; Corrected residual information crite-rion.

Given n iid observations from a true model y = Xβ + ε, where y = ( y , ..., y n ) ′ , X is a n × p design matrix, ε = ( ε , ..., ε n ) ′ follows a multivariatedistribution with mean 0 and variance σ W ( θ ), and β ∈ R p × is an unknown vector tobe estimated. Here θ is an m × A = A ( β ) = { j : β j = 0 , j = 1 , ..., p } as the nonzero coeﬃcient set and k = A as the number of nonzero coeﬃcients. The problem of estimating A is often referred to asvariable selection or model selection.Variable selection in linear regression is probably one of the most important problemsin statistics. See for example the references in Shao (1997). To automate the process ofchoosing a ﬁnite dimensional candidate model out of all possible models, various informationcriteria have been developed. There are two basic elements in all of these criteria: oneelement that measures the goodness of ﬁt and the other term which penalizes the complexityof the ﬁtted model, usually taken as a function of the parameters used. Generally speaking,the existing variable selection approaches can be classiﬁed into two broad categories. On ∗ Leng is Assistant Professor, Department of Statistics and Applied Probability, National Universityof Singapore. Leng’s research is supported in part by NUS research grant R-155-050-053-133 (Email:[email protected]). ∗ motivated by this principle. However, we show that if the residual likelihoods areused to evaluate the Kullback-Leibler divergence between models, RIC (i.e. RIC ∗ ) wouldalways choose the full model. Therefore, the residual likelihood is not an appropriate lossfunction to deﬁne an information criterion. We provide a simple likelihood based approachto circumvent the problem.The rest of the paper is organized as follows. Section 2 reviews the RIC method inShi and Tsai. Since Shi and Tsai’s RIC is not approximating the Kullback-Leibler diver-gence, we provide the RIC ∗ measure as a correction. However, RIC ∗ always chooses thefull model and the reason is explained. Section 3 presents the correct residual likelihoodinformation criterion, motivated by minimizing the Kullback-Leibler divergence betweenlikelihoods instead of residual likelihoods. Concluding remarks are given in Section 4. We review the RIC method in Shi and Tsai (2002) in this section. The model we considerin this article is a special case of that in Shi and Tsai (2002) by assuming the Box-Coxtransformation parameter λ is 1. The results in the paper can be easily extended to Box-Cox models following similar arguments in Shi and Tsai.We start by looking at a candidate (working) model y = Xβ + ε, such that A ( β ) = k . We denote the active covariates in X as X A . Inspired by the residuallikelihood method in Harville (1974) or Diggle et al. (1994) to obtain unbiased estimatorfor the error variance, we can write the residual log-likelihood as L ( θ ′ , σ ) = −

12 ( n − k ) log(2 π ) + 12 log | X ′A X A | −

12 ( n − k ) log( σ ) −

12 log | W |−

12 log | X ′A W − X A | − y ′ ( W − − H A ) y/σ , (1)2here H A = W − X ′A ( X ′A W − X A ) − X ′A W − and the dependence of W on θ is suppressed.A useful measure of the distance between the working model and the true model is theKullback-Leibler divergence d ( θ ′ , σ ) = E [ − L ( θ ′ , σ ) + 2 L ( θ ′ , σ )] , (2)where E denotes the expectation under the true model and L denotes the residual log-likelihood of the true model. Clearly, the best model loses the least information, in terms ofKullback-Leibler distance, relative to the truth and is therefore preferred. Such a criterionformulates RIC in an information-theoretical framework. Provided that one can unbiasedlyestimate d ( θ ′ , σ ), this criterion provides sound basis for parameter estimation and statisticalinference under appropriate conditions.Since E [2 L ( θ ′ , σ )] is independent of the working model, we just need to evaluate E [ − L ( θ ′ , σ )]. In Shi and Tsai (2002), (2) is written as d ( θ ′ , σ ) = E h ( n − k ) log( σ ) + log | W | + log | X ′A W − X A | + y ′ ( W − − H A ) y/σ i (3)= ( n − k ) log( σ ) + log | W | + log | X ′A W − X A | + E ( Xβ + ε ) ′ ( W − − H A )( Xβ + ε ) /σ (4)by omitting irrelevant terms. By substituting their estimated values ˆ θ , ˆ σ into (4), we have d (ˆ θ ′ , ˆ σ ) = ( n − k ) log(ˆ σ ) + log | ˆ W | + log | X ′A ˆ W − X A | + ( Xβ ) ′ ( ˆ W − − ˆ H A )( Xβ ) / ˆ σ + tr { ( ˆ W − − ˆ H A ) W } σ / ˆ σ . (5)The above expression involves an unknown quantity σ . Following Shi and Tsai, we judgethe quality of the candidate model by E { d (ˆ θ ′ , ˆ σ ) } . Now, if we assume A ⊆ A , anassumption also used in deriving AICc (Hurvich and Tsai, 1989), the third term becomeszero. Furthermore, if we assume ˆ θ is consistent for θ , we can estimate W by ˆ W sinceˆ W = W + o p (1). Then the fourth term can be approximated as ( n − k ) σ / ˆ σ . Since A ⊆ A , ( n − k )ˆ σ /σ then follows χ n − k distribution and therefore E [( n − k ) σ / ˆ σ ] = ( n − k ) / ( n − k − . Finally, Shi and Tsai argued that log | X ′A ˆ W − X A | can be approximated by k log( n ). Puttingeverything together, they proposed the residual information criterion as followsRIC = ( n − k ) log(ˆ σ ) + log | ˆ W | + k log( n ) − k + 4 n − k − , (6)after removing the constant n + 2. Asymptotically, the complexity part of RIC is of theorder k log( n ). Comparing to BIC = n log(˜ σ ) + k log( n ), where ˜ σ is the MLE of σ , itis intuitively clear that Shi and Tsai’s RIC yields consistent models as BIC does. Thecomplexity penalty of RIC, however, is fundamentally diﬀerent from that of other familiarinformation criterion such as AIC and AICc, designed to approximate the Kullback-Leibler3ivergence between two models. This observation raises the question on whether RICrightfully approximates the divergence.It turns out that Shi and Tsai’s derivation motivated by minimizing the Kullback-Leiblerdistance, is incorrect in at least two important places:1. In (3), a model dependent term log | X ′A X A | is omitted from (1), which causes seriousbias in deriving an information criterion. In fact, following Shi and Tsai’s arguments,we can approximate log | X ′A X A | by k log( n ) and thus, RIC should have beenRIC ∗ = ( n − k ) log(ˆ σ ) + log | ˆ W | − k + 4( n − k − . Note that in this formulation, RIC ∗ always chooses the full model.2. Even more severely, the practice of approximating the Kullback-Leibler distance be-tween residual likelihoods for comparing models is totally wrong. To illustrate, sup-pose that W = I . In this simple case, the residual likelihood becomes L ( σ ) = −

12 ( n − k ) log( σ ) − y ′ [ I − X A ( X ′A X A ) − X A ] y/σ . We see immediately that E [ − L ( σ )] = ( n − k ) log( σ )+( n − k ) σ /σ whenever A ⊆A . Thus, for candidate models that include X A in the covariate set, E [ − L ( σ )] isalways minimized by σ = σ and in this case E [ − L ( σ )] = ( n − k )(log( σ ) + 1).Therefore, if one knows the exact data generating process, the ideal RIC leads to thefull model, as its E [ − L ( σ )] is the smallest. This explains why RIC ∗ always choosesthe full model.Given the above serious ﬂaws in going from deriving unbiased estimator of the Kullback-Leibler divergence to RIC, Shi and Tsai’s RIC in (6) seems improperly motivated. Fortu-nately, Shi and Tsai’s derivation can be corrected and we introduce a corrected RIC in thenext section. Instead of using the residual likelihood, a justiﬁable criterion is to use the log-likelihood L ( β ′ , θ ′ , σ ) = n log( σ ) + log | W | + ( y − Xβ ) ′ W − ( y − Xβ )in deﬁning the divergence d ( β ′ , θ ′ , σ ) = E [ − L ( β ′ , θ ′ , σ ) + 2 L ( β ′ , θ ′ , σ )] . We can write E [ − L ( β ′ , θ ′ , σ )] = E (cid:2) n log( σ ) + log | W | + ( Xβ + ε − Xβ ) ′ W − ( Xβ + ε − Xβ ) (cid:3) = n log( σ ) + log | W | + nσ /σ + ( Xβ − Xβ ) ′ W − ( Xβ − Xβ ) σ /σ .

4e can now replace σ , β and θ by the their estimates by using the residual likelihoodmethod. Now, suppose that A ⊆ A . Following Shi and Tsai again, E nσ / ˆ σ ≈ n ( n − k ) / ( n − k − β − β follows normal distribution N { , σ ( X ′A W − X A ) − } asymp-totically, 1 k ( X ˆ β − Xβ ) ′ W − ( X ˆ β − Xβ ) σ / ˆ σ is distributed approximately as F ( k, n − k ). Therefore, E { ( X ˆ β − Xβ ) ′ W − ( X ˆ β − Xβ ) σ / ˆ σ } = k ( n − k ) n − k − . Putting everything together, we have the following corrected residual information criterion,which we shall refer to as RICc,RICc = n log(ˆ σ ) + k + 4( k + 1) n − k − , by omitting a constant n + 2. Note thatAIC = n log(˜ σ ) + 2 k, and AICc = n log(˜ σ ) + 2 n ( k + 1) / ( n − k − σ is the MLE of σ . We can decompose the ﬁrst expression of RICc, AIC and AICc as n log(RSS) − n log( n − k ), n log(RSS) − n log( n ) and n log(RSS) − n log( n ) respectively. Thus,the complexity penalties for RICc, AIC, AICc are − n log( n − k ) + k + 4( k + 1) / ( n − k − − n log( n ) + 2 k and − n log( n ) + 2 n ( k + 1) / ( n − k −

2) respectively. It can be seen that RICchas a larger penalty function than AIC and a smaller penalty than AICc when n ≫ k . In ﬁtting a model to data, one is required to choose a set of candidate models, a ﬁttingprocedure and a criterion to compare competing models. A minimal requirement for areasonable criterion is that the population version of the criterion is uniquely minimizedby the set of the parameters which generate the data. The population version of theresidual likelihood information criterion is minimized by the full model and thus fails to meetthis basic requirement. Therefore, the residual likelihood cannot be used as a discrepancymeasure between models. A simple remedy is to use the likelihood based Kullback-Leiblerdivergence.Being a legitimate criterion on its own, our arguments show that Shi and Tsai’s RICis not motivated by the right principle. Should one have followed their motivation, RIC(i.e. RIC ∗ by our notation) would have always chosen the full model. However, Shi andTsai’s RIC, though motivated by the wrong principle (using the residual likelihood insteadof the likelihood) and ignoring dangerously an important term log | X ′ X | in approximation,has good small sample performance in their simulations. Additionally, Shi and Tsai’s RIChas been successfully applied to a number of applications, such as normal linear regression,Box-Cox transformation, inverse regression models (Ni et al. , 2005) and longitudinal data5nalysis (Li et al. , 2006). The success may be understood as Shi and Tsai’s RIC resemblesBIC. Despite the increasing popularity of RIC, Shi and Tsai’s RIC remains unmotivated.It remains to ﬁnd a justiﬁcation for Shi and Tsai’s RIC as a future research topic. References

Akaike, H. (1970). Statistical predictor identiﬁcation.

Annals of Institute of StatisticalMathematics , 22, 203-217.Azari, R., Li, L., and Tsai, C.-L. (2006). Longitudinal data model selection.

ComputationalStatistics and Data Analysis , 50, 3053-3066.Diggle, P.J., Heagerty, P.J., Liang, K.-Y. and Zeger, S.L. (2002). Analysis of longitudinaldata. (2nd edition). Oxford: Oxford University Press.Harville, D.A. (1974). Bayesian inference for variance components using only error contrasts.

Biometrika , 61, 383-385.Hurvich, C. M., and Tsai, C.-L. (1989). Regression and time series model selection in smallsamples.

Biometrika , 76, 297-307.Ni, L., Cook, R. D., and Tsai, C-L. (2005). A note on shrinkage sliced inverse regression.

Biometrika , 92, 242-247.Scharwz, G. (1978). Estimating the dimension of a model.

Annals of Statistics , 6, 461-464.Shao, J. (1997). An asymptotic theory for linear model selection (with discussion).

StatisticaSinica , 7, 221-264.Shi, P. and Tsai, C.-L. (2002). Regression model selection-a residual likelihood approach.