[PDF] Posterior Averaging Information Criterion

Abstract

We propose a new model selection method, the posterior averaging information criterion, for Bayesian model assessment from a predictive perspective. The theoretical foundation is built on the Kullback-Leibler divergence to quantify the similarity between the proposed candidate model and the underlying true model. From a Bayesian perspective, our method evaluates the candidate models over the entire posterior distribution in terms of predicting a future independent observation. Without assuming that the true distribution is contained in the candidate models, the new criterion is developed by correcting the asymptotic bias of the posterior mean of the log-likelihood against its expected log-likelihood. It can be generally applied even for Bayesian models with degenerate non-informative prior. The simulation in both normal and binomial settings demonstrates decent small sample performance.

Full PDF

PPosterior Averaging Information Criterion

Shouhao Zhou ∗ Department of Public Health Sciences, Pennsylvania State UniversitySeptember 22, 2020

Abstract

We propose an innovative model selection method, the posterior averaging information criterion,for Bayesian model assessment from a predictive perspective. The theoretical foundation is built onthe Kullback-Leibler divergence to quantify the similarity between the proposed candidate modeland the underlying true model. From a Bayesian perspective, our method evaluates the candidatemodels over the entire posterior distribution in terms of predicting a future independent observation.Without assuming that the true distribution is contained in the candidate models, the new criterionis developed by correcting the asymptotic bias of the posterior mean of the log-likelihood against itsexpected log-likelihood. It can be generally applied even for Bayesian models with degenerate non-informative prior. The simulation in both normal and binomial settings demonstrates an outstandingsmall sample performance.

Keywords:

Bayesian modeling, Expected out-of-sample likelihood, Kullback-Leibler divergence, Mis-speciﬁed model, Predictive model selection

Model selection plays a key role in applied statistical practice. A clearly deﬁned model selection cri-terion or score usually lies at the heart of any statistical model selection procedure, and facilitates thecomparison of competing models through the assignment of some sort of preference or ranking to thealternatives. Standard criteria include adjusted R (Wherry, 1931), Akaike information criterion (AIC;Akaike, 1973), minimum description length (MDL; Rissanen, 1978) and Schwarz information criterion(SIC; Schwarz, 1978), to name but a few.To choose a proper criterion for a statistical data analysis project, it is essential to distinguish theultimate goal of modeling. Geisser and Eddy (1979) challenged research workers with two fundamentalquestions that should be asked in advance of any procedure conducted for model selection: ∗ Email: [email protected] a r X i v : . [ s t a t . M E ] S e p . Which of the models best explains a given set of data?2. Which of the models yields the best predictions for future observations from the same process thatgenerated the given set of data?The former question, which concerns the accuracy of the model in describing the current data, has beenan empirical problem for many years. It represents the explanatory perspective. The latter question,which represents the predictive perspective, concerns the accuracy of the model in predicting future data,having drawn substantial attention in recent decades. If an inﬁnitely large quantity of data is available,the predictive perspective and the explanatory perspective may not differ signiﬁcantly. However, witha limited number of observations, as we encounter in practice, it is challenging for predictive modelselection methods to achieve an optimal balance between goodness of ﬁt and parsimony.A substantial group of predictive model selection criteria were proposed based on the Kullback-Leibler information divergence (Kullback and Leibler, 1951), an objective measure to estimate the overallcloseness of a probability distribution and the underlying true model. On both theoretical and appliedfronts, Kullback-Leibler divergence in model selection has drawn a huge amount of attention, and a largerelated body of literature now exists for both frequentist and Bayesian inference.Bayesian approaches to statistical inference have speciﬁc concerns regarding the interpretation ofparameters and models. However, most of the Kullback-Leibler based Bayesian criteria follow essen-tially the frequentist paradigm insofar as they select a model using plug-in estimators of the parameters.Starting from the Bayesian predictive information criterion (BPIC; Ando, 2007), model selection criteriawere developed over the entire posterior distribution. Nevertheless, BPIC has a number of limitations,particularly with asymmetric posterior distributions. Furthermore, BPIC is undeﬁned under improperprior distributions, while the expected penalized loss assumes that the true model contained in the ap-proximating family, which limits its use in practice.To explain the intuition of the proposed Bayesian criterion, in Section 2 we review the Kullback-Leibler divergence, its application and development in frequentist statistics and the adaption to Bayesianmodeling based on plug-in parameter estimation. In Section 3, major attention is given to the Kullback-Leibler based predictive criterion for models evaluated by averaging over the posterior distributions ofparameters. A generally applicable method, the posterior averaging information criterion (PAIC), isproposed for comparing different Bayesian statistical models under regularity conditions. Our criterionis developed by correcting the asymptotic bias of using the posterior mean of the log-likelihood as an2stimator of its expected log-likelihood, and we prove that the asymptotic property holds even though thecandidate models are misspeciﬁed. In Section 4 we present some numerical studies in both normal andbinomial cases to investigate its performance with small sample sizes. We conclude with a few summaryremarks and discussions in Section 5. Kullback and Leibler (1951) derived an information measure to assess the directed ‘distance’ betweenany two models. If we assume that f (˜ y ) and g (˜ y ) respectively represent the probability density distri-butions of the ‘true model’ and the ‘approximate model’ on the same measurable space, the Kullback-Leibler divergence is deﬁned by I ( f, g ) = (cid:90) f (˜ y ) · log f (˜ y ) g (˜ y ) d ˜ y = E ˜ y [log f (˜ y )] − E ˜ y [log g (˜ y )] , which is always non-negative, reaching the minimum value of when f is the same as g almost surely.It is interpreted as the ‘information’ lost when g is used to approximate f . Namely, the smaller the valueof I ( f, g ) , the closer we consider the model g to be to the true distribution.Only the second term of I ( f, g ) is relevant in practice to compare different possible models withoutfull knowledge of the true distribution. This is because the ﬁrst term, E ˜ y [log f (˜ y )] , is a constant thatdepends on only the unknown true distribution f , and can be neglected in model comparison for givendata.Let y = ( y , y , · · · , y n ) be n independent observations in the data and ˜ y , an unknown but poten-tially observable quantity, represents a future independent observation has the same probability densityfunction f (˜ y ) , and an approximate model m with density g m (˜ y | θ m ) among a list of potential models m = 1 , , · · · , M . For notational purposes, we ignore the model index m when there is no ambigu-ity. The true model f is referred to as the unknown data generating mechanism, not necessarily to beencompassed in the approximate model family.As n → ∞ , the average of the log-likelihood n L ( θ | y ) = 1 n n (cid:88) i =1 log g ( y i | θ ) E ˜ y [log g (˜ y | θ )] by the law of large numbers, which suggests how we can estimate the secondterm of I ( f, g ) .The model selection based on the Kullback-Leibler divergence is straightforward when all the oper-ating models are ﬁxed probability distributions, i.e., g (˜ y | θ ) = g (˜ y ) . The model with the largest empiricallog-likelihood (cid:80) i log g ( y i ) is favored. However, when the distribution family g (˜ y | θ ) contains some un-known parameters θ , the model ﬁtting should be done ﬁrst so that we may know what values the freeparameters will probably take, given the data. Therefore, the log-likelihood is not optimal for the pre-dictive modeling, when the data were used twice in both model ﬁtting and evaluation. For a desirableout-of-sample predictive performance, a common idea is to identify a bias correction term to rectify theover-estimation bias of the in-sample estimate.In the frequentist setting, the general model selection procedure chooses candidate models speci-ﬁed by some point estimate ˆ θ based on a certain statistical principle such as maximum likelihood. Aconsiderable amount of theoretical research has addressed this problem by correcting for the bias of n (cid:80) i log g ( y i | ˆ θ ) in estimation of E ˜ y [log g (˜ y | ˆ θ )] (Akaike, 1973; Takeuchi, 1976; Hurvich and Tsai, 1989;Murata et al., 1994; Konishi and Kitagawa, 1996). A nice review can be found in Burnham and Anderson(2002).Since the introduction of the Akaike Information Criterion (AIC; Akaike, 1973), researchers havecommonly applied frequenstist model selection methods into Bayesian modeling. However, the dif-ferences in the underlying philosophies between Bayesian and frequentist statistical inference cautionagainst such direct applications. There also have been a few attempts to specialize the Kullback-Leiblerdivergence for Bayesian model selection (Geisser and Eddy, 1979; San Martini and Spezzaferri, 1984;Laud and Ibrahim, 1995) in the last century. Such methods are limited either in the scope of method-ology or computational feasibility, especially when the parameters of the Bayesian models are in high-dimensional hierarchical structures.The seminal work of Spiegelhalter et al. (2002, 2014) proposed Deviance Information Criterion(DIC) as a Bayesian adaption to AIC and implemented it within Bayesian inference using Gibbs sampling(BUGS; Spiegelhalter et al., 1994). Although the estimation lacks a theoretical foundation (Meng andVaida, 2006; Celeux et al., 2006a), − DIC / n , as a model selection criterion, heuristically estimates E ˜ y [log g (˜ y | ¯ θ )] , the expected out-of-sample log-likelihood speciﬁed at the posterior mean, after assumingthat the proposed model encompasses the true model. Alternative methods can be found either using a4imilar approach for mixed-effects models (Vaida and Blanchard, 2005; Liang et al., 2009; Donohue etal. 2011) or using numerical approximation (Plummer, 2008) to estimate cross-validative predictive loss(Efron, 1983). The preceding methods in general can be viewed as Bayesian adaptation of the information criteria orig-inally designed for frequentist statistics, when each model is assessed in terms of the similarity betweenthe true distribution f and the model density function speciﬁed by the plug-in parameters. This may notbe ideal since, in contrast to frequentist modeling, “Bayesian inference is the process of ﬁtting a proba-bility model to a set of data and summarizing the result by a probability distribution on the parameters of the model and on unobserved quantities such as predictions for new observations” (Gelman et al.,2003). Rather than considering a model speciﬁed by a point estimate, it is more reasonable to assess thegoodness of a Bayesian model in terms of prediction against the posterior distribution.Obtaining the posterior averaged Kullback-Leibler discrepancy, rather than the Kullback-Leibler dis-crepancy speciﬁed at some point estimate, could be more computationally intensive, requiring a large setof posterior samples for numerical averaging when analytical form is not available. However, advancedcomputer technology developed in recent years has made this computational cost much more feasible forBayesian model selection. Reviews on recent developments can be found in the next section. Ando (2007) proposed an estimator for the posterior averaged discrepancy function, η = E ˜ y [ E θ | y log g (˜ y | θ )] . η is ˆ η BP IC = 1 n E θ | y log L ( θ | y ) − n [ E θ | y log { π ( θ ) L ( θ | y ) } − log { π (ˆ θ ) L (ˆ θ | y ) } + tr { J − n (ˆ θ ) I n (ˆ θ ) } + K (1) (cid:44) n E θ | y log L ( θ | y ) − BC = 1 n log { π (ˆ θ ) L (ˆ θ | y ) } − n [ E θ | y log π ( θ ) + tr { J − n (ˆ θ ) I n (ˆ θ ) } + K (2) (cid:44) n log { π (ˆ θ ) L (ˆ θ | y ) } − BC . Here, BC denotes the bias correction term, ˆ θ is the posterior mode, K is the cardinality of θ , and matrices J n and I n are some empirical estimators for the Bayesian asymptotic Hessian matrix, J ( θ ) = − E ˜ y (cid:18) ∂ log { g (˜ y | θ ) π ( θ ) } ∂θ∂θ (cid:48) (cid:19) and Bayesian asymptotic Fisher information matrix, I ( θ ) = E ˜ y (cid:18) ∂ log { g (˜ y | θ ) π ( θ ) } ∂θ ∂ log { g (˜ y | θ ) π ( θ ) } ∂θ (cid:48) (cid:19) , where log π ( θ ) = lim n →∞ n − log π ( θ ) .The BPIC is introduced as − n · ˆ η BP IC and applicable when the true model f is not necessarily in thespeciﬁed family of probability distributions. In model comparison, the candidate model with a minimumBPIC value is favored. However, it has the following limitations in practice.1. Equation (1) was from the original presentation for BPIC in Ando (2007). After simple math can-celling out the term n E θ | y log L ( θ | y ) in both estimator and bias correction term, it was actually the plug-in estimate n log { π (ˆ θ ) L (ˆ θ | y ) } , as shown in equation (2), in estimation of η with some bias correction.Compared with the natural estimator n − E θ | y log[ L ( θ | y )] , the estimation efﬁciency of η using plug-in es-timator is suboptimal when the posterior distribution is asymmetric or with non-zero correlation betweenparameters, which occurs in a majority of cases in Bayesian modeling. This will be further illustratedin our simulation studies when we compare the bias correction performance of various criteria in smallsample size.2. The BPIC cannot be calculated when the prior distribution π ( θ ) is degenerate, a situation that com-6only occurs in Bayesian analysis when an objective non-informative prior is selected. For example,if we use non-informative prior π ( µ ) ∝ for the mean parameter µ of the normal distribution in thefollowing section 4.1, the values of log π (ˆ θ ) and E θ | y log π ( θ ) in equation (2) are undeﬁned.In order to avoid those drawbacks, we propose a new model selection criterion in terms of the pos-terior mean of the empirical log-likelihood ˆ η = n (cid:80) i E θ | y [log g ( y i | θ )] , a natural estimator of η . Withoutlosing any of the attractive properties of BPIC, the new criterion expands the model scope to all Bayesianmodels under regularity conditions, improves the unbiased property for small samples, and enhances therobustness of the estimation.Because all the data y are used for both model ﬁtting and model selection, ˆ η always overestimates η .To correct the estimation bias from the overuse of the data, we propose the following theorem. Theorem 3.1.

Let y = ( y , y , · · · , y n ) be n independent observations drawn from the probability cumu-lative distribution F (˜ y ) with density function f (˜ y ) . Consider G = { g (˜ y | θ ); θ ∈ Θ ⊆ R p } as a family ofcandidate statistical models that do not necessarily contain the true distribution f , where θ = ( θ , ..., θ p ) (cid:48) is the p -dimensional vector of unknown parameters, with prior distribution π ( θ ) . Under the followingthree regularity conditions:C1: Both the log density function log g (˜ y | θ ) and the log unnormalized posterior density log { L ( θ | y ) π ( θ ) } are twice continuously differentiable in the compact parameter space Θ ;C2: The expected posterior mode θ = arg max θ E ˜ y [log { g (˜ y | θ ) π ( θ ) } ] is unique in Θ ;C3: The Hessian matrix of E ˜ y [log { g (˜ y | θ ) π ( θ ) } ] is non-singular at θ ,the bias of ˆ η for η can be approximated asymptotically without bias by ˆ η − η = (cid:98) b θ ∼ = 1 n tr { J − n (ˆ θ ) I n (ˆ θ ) } , (3) where ˆ θ is the posterior mode that minimizes the posterior distribution ∝ π ( θ ) (cid:81) ni =1 g ( y i | θ ) and J n ( θ ) = − n n (cid:88) i =1 ( ∂ log { g ( y i | θ ) π n ( θ ) } ∂θ∂θ (cid:48) ) I n ( θ ) = 1 n − n (cid:88) i =1 ( ∂ log { g ( y i | θ ) π n ( θ ) } ∂θ ∂ log { g ( y i | θ ) π n ( θ ) } ∂θ (cid:48) ) . roof. Recall that the quantity of interest is E ˜ y E θ | y log g (˜ y | θ ) . To estimate it, we ﬁrst check E ˜ y E θ | y log { g (˜ y | θ ) π ( θ ) } = E ˜ y E θ | y { log g (˜ y | θ ) + log π ( θ ) } and expand it around θ , E ˜ y E θ | y log { g (˜ y | θ ) π ( θ ) } = E ˜ y log { g (˜ y | θ ) π ( θ ) } + E θ | y ( θ − θ ) (cid:48) ∂E ˜ y log { g (˜ y | θ ) π ( θ ) } ∂θ | θ = θ + 12 E θ | y [( θ − θ ) (cid:48) ∂ E ˜ y log { g (˜ y | θ ) π ( θ ) } ∂θ∂θ (cid:48) | θ = θ ( θ − θ )] + o p ( n − )= E ˜ y log { g (˜ y | θ ) π ( θ ) } + E θ | y ( θ − θ ) (cid:48) ∂E ˜ y log { g (˜ y | θ ) π ( θ ) } ∂θ | θ = θ − E θ | y [( θ − θ ) (cid:48) J ( θ )( θ − θ )] + o p ( n − ) (cid:44) I + I + I + o p ( n − ) (4)The ﬁrst term I can be linked to the empirical log likelihood function as follows: E ˜ y log { g (˜ y | θ ) π ( θ ) } = E ˜ y log g (˜ y | θ ) + log π ( θ )= E y n log L ( θ | y ) + log π ( θ )= E y n log { L ( θ | y ) π ( θ ) } − n log π ( θ ) + log π ( θ )= E y E θ | y n log { L ( θ | y ) π ( θ ) } − n tr { J − n ( θ ) I ( θ ) } + 12 n tr { J − n (ˆ θ ) J n ( θ ) } − n log π ( θ ) + log π ( θ ) + o p ( n − ) where the last equation holds due to Lemma 5.5 (together with other Lemmas, provided in the Appendix).The second term I vanishes since ∂E ˜ y log { g (˜ y | θ ) π ( θ ) } ∂θ | θ = θ = 0 as θ is the expected posterior mode.Using Lemma 5.4, the third term I can be rewritten as I = − E θ | y ( θ − θ ) (cid:48) J ( θ )( θ − θ )= − tr { E θ | y [( θ − θ )( θ − θ ) (cid:48) ] J ( θ ) } = − n ( tr { J − n ( θ ) I ( θ ) J − n ( θ ) J ( θ ) } + tr { J − n (ˆ θ ) J ( θ ) } ) + o p ( n − )

8y substituting each term in equation (4) and neglecting the residual term, we obtain E ˜ y E θ | y log { g (˜ y | θ ) π ( θ ) } (cid:39) E y E θ | y n log { L ( θ | y ) π ( θ ) } − n tr { J − n ( θ ) I ( θ ) } + 12 n tr { J − n (ˆ θ ) J n ( θ ) } − n log π ( θ ) + log π ( θ ) − n ( tr { J − n ( θ ) I ( θ ) J − n ( θ ) J ( θ ) } + tr { J − n (ˆ θ ) J ( θ ) } ) Recall that we have deﬁned log π ( θ ) = lim n →∞ n − log π ( θ ) , so that asymptotically we have log π ( θ ) − n log π ( θ ) (cid:39) ,E θ | y log { π ( θ ) } − E θ | y n log { π ( θ ) } (cid:39) . Therefore, E ˜ y E θ | y log { g (˜ y | θ ) } can be estimated by E ˜ y E θ | y log { g (˜ y | θ ) } = E ˜ y E θ | y log { g (˜ y | θ ) π ( θ ) } − E θ | y log { π ( θ ) }(cid:39) E y E θ | y n log { L ( θ | y ) π ( θ ) } − n tr { J − n ( θ ) I ( θ ) } + 12 n tr { J − n (ˆ θ ) J n ( θ ) }− n ( tr { J − n ( θ ) I ( θ ) J − n ( θ ) J ( θ ) } + tr { J − n (ˆ θ ) J ( θ ) } ) − n log π ( θ ) + log π ( θ ) − E θ | y log { π ( θ ) }(cid:39) E y E θ | y n log { L ( θ | y ) } − n tr { J − n ( θ ) I ( θ ) } + 12 n tr { J − n (ˆ θ ) J n ( θ ) }− n ( tr { J − n ( θ ) I ( θ ) J − n ( θ ) J ( θ ) } + tr { J − n (ˆ θ ) J ( θ ) } ) Replacing θ by ˆ θ , J ( θ ) by J n (ˆ θ ) and I ( θ ) by I n (ˆ θ ) , we obtain E θ | y n log { L ( θ | y ) }− n tr { J − n (ˆ θ ) I n (ˆ θ ) } as an asymptotically unbiased estimate for E ˜ y E θ | y log { g (˜ y | θ ) } .With the above result, we propose a new predictive criterion for Bayesian modeling, the Posterioraveraging information criterion (PAIC) asPAIC = − (cid:88) i E θ | y [log g ( y i | θ )] + 2 tr { J − n (ˆ θ ) I n (ˆ θ ) } . (5)The candidate models with small criterion values are preferred for the purpose of model selection.The PAIC has several attractive properties. First, it assesses Bayesian model performance with re-9pect to the posterior distribution function, which represents the current best knowledge from a Bayesianperspective in the family of candidate model. When the posterior distribution of the parameters is asym-metric, we expect it to better perform than any plug-in estimate based approaches. Secondly, it is anasymptotic unbiased estimator for the out-of-sample log-likelihood, a measure in terms of the Kullback-Leibler divergence for the similarity of the ﬁtted model and the underlying true distribution. Thirdly,PAIC is derived free of the assumption that the approximating distributions contain the truth, indicatingthat PAIC is generally applicable even though some models could be mis-speciﬁed. Lastly, unlike theBPIC, the PAIC is well-deﬁned and can cope with degenerate non-informative prior distributions forparameters.Several Bayesian researchers have also focused on the posterior averaged Kullback-Leibler discrep-ancy using cross-validation. Plummer (2008) introduced the expected deviance penalized loss with ‘ex-pected deviance’ deﬁned as L e ( y i , z ) = − E θ | z log g ( y i | θ ) , which is a special case of the predictive discrepancy measure (Gelfand and Ghosh, 1998). The standardcross-validation method can also be applied in this circumstance to estimate η , simply by consideringthe Kullback-Leibler discrepancy as the utility function of Vehtari and Lampinen (2002) and furtherinvestigated by Gelman et al. (2014). The estimation of the bootstrap error correction η ( b ) − ˆ η ( b ) withbootstrap analogues η ( b ) = E ˜ y ∗ [ E θ | y ∗ log g (˜ y | θ )] and ˆ η ( b ) = E ˜ y ∗ [ n − E θ | y ∗ log L ( θ | y ∗ )] for η − ˆ η was discussed by Ando (2007) as a Bayesian adaptation of frequentist model selection (Konishand Kitagawa, 1996). Although numeric approach such as importance sampling can be used for the in-tensive computation, one caveat is that it may cause inaccurate estimation in practice if some observation y i was inﬂuential (Vehtari and Lampinen, 2002). To address that problem, Vehtari et al. (2017) proposedPareto smoothed importance sampling, a new algorithm for regularizing importance weights, and de-veloped a numerical tool (Vehtari et al., 2018) to facilitate computation. Watanabe (2010) established asingular learning theory and proposed a new criterion named Watanabe-Akaike (Gelman et al., 2014),or widely applicable information criterion (WAIC; Watanabe 2008, 2009), while WAIC was proposed10or the plug-in discrepancy and WAIC for the posterior averaged discrepancy. However, comparedwith BPIC and PAIC, we found that WAIC tends to have larger bias and variation for regular Bayesianmodels, as shown in simulation studies in the next section. In this section, we present numerical results to study the behavior of the proposed method under small andmoderate sample sizes in both Gaussian and non-Gaussian settings. In the simulation experiments, weestimate the true expected bias η either analytically ( § E θ | y [log g (˜ y | θ )] over a large number of extra independent draws of ˜ y when there is no closed form for the integration( § Suppose observations y = ( y , y , ..., y n ) are a vector of iid samples generated from N ( µ T , σ T ) , withunknown true mean µ T and variance σ T = 1 . Assume the data are analyzed by the approximating model g ( y i | µ ) = N ( µ, σ A ) with prior π ( µ ) = N ( µ , τ ) , where σ A is ﬁxed, but not necessarily equal to thetrue variance σ T . When σ A (cid:54) = σ T , the model is misspeciﬁed.The posterior distribution of µ is normally distributed with mean ˆ µ and variance ˆ σ , where ˆ µ = ( µ /τ + n (cid:88) i =1 y i /σ A ) / (1 /τ + n/σ A )ˆ σ = 1 / (1 /τ + n/σ A ) . Therefore, we obtain the Kullback-Leibler discrepancy function and its estimator as η = E ˜ y [ E µ | y [log g (˜ y | µ )]] = −

12 log(2 πσ A ) − σ T + ( µ T − ˆ µ ) + ˆ σ σ A ˆ η = 1 n n (cid:88) i =1 E µ | y [log g ( y i | µ )]] = −

12 log(2 πσ A ) − n n (cid:88) i =1 ( y i − ˆ µ ) + ˆ σ σ A . To eliminate the estimation error caused by the sampling of the observations y , we average the bias11 l l l l l l l . . . . t =100, s A =1 sample size (n) b i a s l l l l l l l l (a) l l l l l l l l . . . . t =100/ n, s A =1 sample size (n) b i a s l l l l l l l l (b) l l l l l l l l l l l l . . . . . . t =0.5, s A =1 sample size (n) b i a s l l l l l l l l l l l l (c) l l l l l l l l . . . . t =100, s A =1.5 sample size (n) b i a s l l l l l l l l (d) l l l l l l l l . . . . t =100/ n, s A =1.5 sample size (n) b i a s l l l l l l l l (e) l l l l l l l l l l l l . . . . . . t =0.5, s A =1.5 sample size (n) b i a s l l l l l l l l l l l l (f) l l l l l l l l t =100, s A =0.5 sample size (n) b i a s l l l l l l l l (g) l l l l l l l l t =100/ n, s A =0.5 sample size (n) b i a s l l l l l l l l (h) l l l l l l l l l l l l t =0.5, s A =0.5 sample size (n) b i a s l l l l l l l l l l l l (i) Figure 1: Performance of the bias estimators for E y (ˆ η − η ) . The top panels are under a relatively non-informative prior with τ = 10 ; the middle panels are under the case that the prior distribution growswith sample size with τ = 10 /n ; the bottom panels are under an informative prior with τ = 0 . . Theleft panels (a), (b) and (c) are under the scenario of σ A = σ T = 1 , i.e., the true distribution is containedin the candidate models. The middle panels (d), (e) and (f) are under the scenario of σ A = 2 . andright panels (g), (h) and (i) are under the scenario of σ A = 0 . when the proposed model is misspeciﬁedfrom σ T = 1 . The true bias b µ is curved by ( —– ) as a function of sample size n . The averages of thedifferent bias estimators are marked by ( • ) for PAIC; ( ◦ ) for BPIC; ( (cid:3) ) for p opt ; ( + ) for WAIC ; and ( × )for cross-validation. Each mark represents the mean of the estimated bias of 100,000 replications of y .12 η − η over y with its true density N ( µ T , σ T ) , b µ = E y (ˆ η − η ) = E y { σ T σ A + ( µ T − ˆ µ ) σ A − n n (cid:88) i =1 ( y i − ˆ µ ) σ A } = σ T ˆ σ /σ A . Here, we compare the bias estimate deﬁned in Theorem 3.1, ˆ b P AICµ with 4 other bias estimators: ˆ b BP ICµ (Ando, 2007), ˆ b W AIC µ (Watanabe, 2009), ˆ b p opt µ (Plummer, 2008), and ˆ b CVµ (Stone, 1974). ˆ b P AICµ = 1 n − σ n (cid:88) i =1 (( µ − ˆ µ ) / ( nτ ) + ( y i − ˆ µ ) /σ A ) ˆ b BP ICµ = 1 n ˆ σ n (cid:88) i =1 (( µ − ˆ µ ) / ( nτ ) + ( y i − ˆ µ ) /σ A ) ˆ b W AIC µ = ˆ σ σ A ( n ˆ σ / n (cid:88) i =1 ( y i − ˆ µ ) )ˆ b p opt µ = 12 n p opt = 1 / (1 /τ + ( n − /σ A ) /σ A ˆ b CVµ = ˆ η − ( n (cid:88) i =1 ( y i − ( µ /τ + (cid:88) j (cid:54) = i y j /σ A ) / (1 /τ + ( n − /σ A )) /n + ˆ σ ) /σ A / . The results are in accordance with the theory. All of the estimates are close to the true bias-correctionvalues when the model is well-speciﬁed with σ A = σ T = 1 , especially when the sample size becomesmoderately large (Figure 1, panels (a), (b) and (c)). The estimated values based on the PAIC are con-sistently closer to the true values than either those based on Ando’s method, which underestimates thebias, or the WAIC , cross-validation or expected deviance penalized loss, which overestimate the bias,especially when the sample size is small. When the models are misspeciﬁed, it is not surprising that inall of the plots given in panels (d)-(i) of Figure 1, only the expected deviance penalized loss misses thetarget even asymptotically since its assumption is violated, whereas all the other approaches converge to b µ . In summary, PAIC achieves the best overall performance. Consider frequencies y = { y , . . . , y N } , which are independent observations from binomial distributionswith respective true probabilities ξ T , . . . , ξ TN , and sample sizes, n , . . . , n N . To draw inference of the ξ ’s,13e assume that the logits β i = logit( ξ i ) = log ξ i − ξ i are random effects that follow the normal distribution β i ∼ N ( µ, τ ) . The weakly-informative joint priordistribution N ( µ ; 0 , ) · Inv - χ ( τ ; 0 . , is proposed on the hyper-parameter ( µ, τ ) so that theBPIC is properly deﬁned and computable. The posterior distribution is asymmetric due to the logistictransformation.We compare the performance of four asymptotically unbiased bias estimators in this hierarchical,asymmetric setting. The true bias η does not have an analytical form. We estimate it through numer-ical computation using independent simulation from the same data generating process, assuming theunderlying true values of µ = 0 and τ = 1 . The simulation scheme is as follows:1. Draw β T,i ∼ N (0 , , y i ∼ Bin ( n i , logit − ( β T,i )) , i = 1 , . . . , N from the true distribution.2. Simulate the posterior draws of ( β, µ, τ ) | y .3. Estimate ˆ b P AICβ , ˆ b BP ICβ , ˆ b W AIC β and ˆ b CVβ .4. Draw z ( j ) ∼ Bin ( n, logit − ( β T )) , j = 1 , . . . , J , for approximation of true η .5. Compare each ˆ b β with true bias b β = ˆ η − η .6. Repeat steps 1-5.Table 1: The estimation error of bias correction: the mean and standard deviation (in parentheses) from1000 replications. C RITERION A CTUAL E RROR M EAN A BSOLUTE E RROR M EAN S QUARE E RROR ˆ η − η − ˆ b β (cid:12)(cid:12)(cid:12) ˆ η − η − ˆ b β (cid:12)(cid:12)(cid:12) (ˆ η − η − ˆ b β ) P AIC

BP IC CV W AIC Table 1 summarizes the bias and standard deviation of the estimation error when we choose N = 15 and n = . . . = n N = 50 , and the β ’s are independently simulated from the standard normal distributionassuming the true hyper-parameter mean µ = 0 and variance τ = 1 . The simulation is repeated for14 , scenarios, each with J = 20 , for out-of-sample η estimation. PAIC and BPIC were calculatedbased on deﬁnition; leave-one-out cross-validation and WAIC were estimated using R package loo v2.0.0. The actual error, mean absolute error and mean square error were considered to assess theestimation error using the bias correction estimates. With respect to all three different metrics, thebias estimation of PAIC is consistently superior to other methods. Compared to BPIC, the second bestperformed model selection criterion, the bias and the mean squared error of PAIC are reduced by about , while the absolute bias is reduced by about one quarter, which matches our expectation that thenatural estimate n (cid:80) i E θ | y [log g ( y i | θ )] will estimate the posterior averaged Kullback-Leibler discrepancymore precisely than plug-in estimate n (cid:80) i log g ( y i | ˆ θ ) when the posterior distribution is asymmetric andcorrelated. Compared to WAIC , the bias, absolute error and mean square error of PAIC are dramaticallyreduced by at least . In practice, we expect the improvement is even larger when proposed modelswere more complicated in terms of hierarchical structures. The Kullback-Leilber divergence is a non-symmetric measure of the difference between two probabilitydistributions. Frequentist statistics theoretically employing Kullback-Leibler divergence into parametricmodel selection emerged during the 1970s. Since then, the development of related theory and applica-tions has rapidly accelerated.Bayesian model selection in terms of the Kullback-Leibler divergence has drawn substantial attentionin the past two decades. The availability of both fast computers and advanced numerical methods enablesthe empirical popularity of Bayesian modeling, which allows for additional ﬂexibility to incorporate theinformation out of the data, as represented by the prior distribution. The fundamental assumption ofBayesian inference is different from frequentist statistics, for the unknown parameters are treated asrandom variables in the form of a probability distribution. Taking this into account, it is important tohave selection techniques that are speciﬁcally designed for Bayesian modeling.Before the proposal of any speciﬁc model selection criterion, two questions should be ﬁrst investi-gated to guide the method development. 1. What is a natural Kullback-Leibler discrepancy to evaluatea Bayesian model? 2. What is a good estimate for Kullback-Leibler discrepancy for Bayesian model?The prevailing plug-in parameter methods, such as DIC, presume the candidate models are correct, andassess the goodness of each candidate model with a density function speciﬁed by the plug-in parameters.15owever, from a Bayesian perspective, it is inherent to examine the performance of a Bayesian modelover the entire posterior distribution, as stated by Celeux et al. (2006, p.703): “... we concede that us-ing a plug-in estimate disqualiﬁes the technique from being properly Bayesian.” Accordingly, statisticalapproaches to estimate the Kullback-Leibler discrepancy as evaluated by averaging over the posteriordistribution are of great interest.We have proposed PAIC, a versatile model selection technique for Bayesian models under regularityassumptions, to address this problem. From a predictive perspective, we consider the asymptotic unbi-ased estimation of a Kullback-Leibler discrepancy, which averages the conditional density of the observ-able data against the posterior knowledge about the unobservable data. Empirically, the proposed PAICmeasures the similarity of the ﬁtted model and the underlying true distribution, regardless of whetheror not the approximating distribution family contains the true model. The range of applications of theproposed criterion can be quite broad.PAIC and BPIC (Ando, 2007) are similar in many aspects. In addition to all the good properties bothmethods share, PAIC has some special features, mainly because of the natural log-likelihood estimator.For example, PAIC can be well applied even if the prior distribution of the parameters degenerates,in which case BPIC becomes uninterpretable. A non-informative prior appears quite often in practice.When the posterior distribution is asymmetric or parameters were correlated, our method provides abetter bias estimation than that obtained by using the natural estimator, which evaluates the log-likelihoodover the posterior distribution instead of over some speciﬁc point. In numerical experiments to comparethe performance of the proposed criteria with other Bayesian model selection criteria including BPIC andWAIC , PAIC has the smallest bias and variance to estimate the posterior averaged discrepancy. Evenfor data obtained from small sample sizes, the bias correction of PAIC still achieves better performance.There are some future directions of the current work. A more comprehensive comparison of Bayesianpredictive methods for model selection can be investigated by taking into account the likely over-ﬁttingin the selection phase, similar to Piironen and Vehtari (2017). Because it’s inconvenient that the usersof PAIC and BPIC have to specify the ﬁrst and second derivatives of the posterior distribution in theirmodeling, development of advanced computational tools for simultaneous calculation are really in need.In singular learning machines, the regularity conditions can be relaxed to singular in a sense that themapping from parameters to probability distributions is not necessarily one-to-one. Although here wefocused on only the regular models, it is also possible to generalize PAIC to singular settings with a16odiﬁed bias correction term, after an algebraic geometrical transformation of the singular parameterspace to a real d -dimensional manifold. SUPPLEMENTAL MATERIALS

Lemmas for Proof of Theorem 3.1

Some important notations

By the law of large numbers we have n log { L ( θ | y ) π ( θ ) } → E ˜ y [log { g (˜ y | θ ) π ( θ ) } ] as n tends to inﬁn-ity. Denote θ , ˆ θ the expected and empirical posterior mode of the log unnormalized posterior density log { L ( θ | y ) π ( θ ) } , i.e. , θ = arg max θ E ˜ y [log { g (˜ y | θ ) π ( θ ) } ]ˆ θ = arg max θ n log { L ( θ | y ) π ( θ ) } , and let I ( θ ) and J ( θ ) denote the Bayesian Hessian matrix and Bayesian Fisher information matrix I ( θ ) = E ˜ y (cid:18) ∂ log { g (˜ y | θ ) π ( θ ) } ∂θ ∂ log { g (˜ y | θ ) π ( θ ) } ∂θ (cid:48) (cid:19) and J ( θ ) = − E ˜ y (cid:18) ∂ log { g (˜ y | θ ) π ( θ ) } ∂θ∂θ (cid:48) (cid:19) . Proof of Lemmas

We start with a few lemmas to support the proofs of Theorem 3.1.

Lemma 5.1.

Under the same regularity conditions of Theorem 3.1, √ n (ˆ θ − θ ) is asymptotically dis-tributed as N (0 , J − n ( θ ) I ( θ ) J − n ( θ )) . roof. Consider the Taylor expansion of ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ =ˆ θ at θ , ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ =ˆ θ (cid:39) ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ = θ + ∂ log { L ( θ | y ) π ( θ ) } ∂θ∂θ (cid:48) | θ = θ (ˆ θ − θ )= ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ = θ − nJ n ( θ )(ˆ θ − θ ) . Note that ˆ θ is the mode of log { L ( y | θ ) π ( θ ) } and satisﬁes ∂ log { L ( y | θ ) π ( θ ) } ∂θ | θ =ˆ θ = 0 . Plug it into the aboveequation, we have nJ n ( θ )(ˆ θ − θ ) (cid:39) ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ = θ . (6)From the central limit theorem, the right-hand-side (RHS) of the equation (6) is approximately distributedas N (0 , nI ( θ )) when E y ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ = θ → . Therefore √ n (ˆ θ − θ ) ∼ N (0 , J − n ( θ ) I ( θ ) J − n ( θ )) . Lemma 5.2.

Under the same regularity conditions of Theorem 3.1, √ n ( θ − ˆ θ ) ∼ N (0 , J − n (ˆ θ )) .Proof. Taylor-expand the logarithm of L ( θ | y ) π ( θ ) around the posterior mode ˆ θ log L ( θ | y ) π ( θ ) = log L (ˆ θ | y ) π (ˆ θ ) −

12 ( θ − ˆ θ ) (cid:48) n J − n (ˆ θ )( θ − ˆ θ ) + o p ( n − ) (7)where J n (ˆ θ ) = − n ∂ log { L ( θ | y ) π ( θ ) } ∂θ∂θ (cid:48) | θ =ˆ θ .Consider the RHS of equation (7) as a function of θ : the ﬁrst term is a constant, whereas the secondterm is proportional to the logarithm of a normal density. It yields the approximation of the posteriordistribution for θ : p ( θ | y ) ≈ N (ˆ θ, n J − n (ˆ θ )) , which completes the proof.Alternatively, though less intuitive, this lemma can also be proved by applying the Berstein-VonMises theorem. Lemma 5.3.

Under the same regularity conditions of Theorem 3.1, E θ | y ( θ − ˆ θ )(ˆ θ − θ ) (cid:48) = o p ( n − ) . roof. First we have ∂ log { L ( θ | y ) π ( θ ) } ∂θ = ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ =ˆ θ − nJ n (ˆ θ )( θ − ˆ θ ) + O p (1) . Since ˆ θ is the mode of log { L ( θ | y ) π ( θ ) } , it satiﬁes ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ =ˆ θ = 0 . Therefore (ˆ θ − θ ) = n − J − n (ˆ θ ) ∂ log { L ( θ | y ) π ( θ ) } ∂θ + O p ( n − ) . Note that E θ | y ∂ log { L ( θ | y ) π ( θ ) } ∂θ = (cid:90) ∂ log { L ( θ | y ) π ( θ ) } ∂θ L ( θ | y ) π ( θ ) p ( y ) dθ = (cid:90) L ( θ | y ) π ( θ ) ∂ { L ( θ | y ) π ( θ ) } ∂θ L ( θ | y ) π ( θ ) p ( y ) dθ = 1 p ( y ) (cid:90) ∂ { L ( θ | y ) π ( θ ) } ∂θ dθ = 1 p ( y ) ∂∂θ (cid:90) L ( θ | y ) π ( θ ) dθ = ∂∂θ . Because of assumption (C1), the equation holds when we change the order of the integral and derivative.Therefore E θ | y (ˆ θ − θ ) = n − J − n (ˆ θ ) E θ | y ∂ log { L ( θ | y ) π ( θ ) } ∂θ + O p ( n − ) = O p ( n − ) . Together with θ − ˆ θ = O p ( n − / ) derived from Lemma 5.1, we complete the proof. Lemma 5.4.

Under the same regularity conditions of Theorem 3.1, E θ | y ( θ − θ )( θ − θ ) (cid:48) = n J − n (ˆ θ ) + n J − n ( θ ) I ( θ ) J − n ( θ ) + o p ( n − ) . Proof. E θ | y ( θ − θ )( θ − θ ) (cid:48) can be rewritten as ( θ − ˆ θ )( θ − ˆ θ ) (cid:48) + E θ | y (ˆ θ − θ )(ˆ θ − θ ) (cid:48) +2 E θ | y ( θ − ˆ θ )(ˆ θ − θ ) .Applying Lemmas 5.1, 5.2 and 5.3, we complete the proof. Lemma 5.5.

Under the same regularity conditions of Theorem 3.1, E θ | y n log { L ( y | θ ) π ( θ ) } (cid:39) n log { L ( θ | y ) π ( θ ) } + 12 n ( tr { J − n ( θ ) I ( θ ) } − tr { J − n (ˆ θ ) J n ( θ ) } ) + O p ( n − ) . Proof.

The posterior mean of the log joint density distribution of ( y , θ ) can be Taylor-expanded around19 as E θ | y n log { L ( θ | y ) π ( θ ) } = 1 n log { L ( θ | y ) π ( θ ) } + E θ | y ( θ − θ ) (cid:48) n ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ = θ + 12 E θ | y ( θ − θ ) (cid:48) n ∂ log { L ( θ | y ) π ( θ ) } ∂θ∂θ (cid:48) | θ = θ ( θ − θ ) + o p ( n − )= 1 n log { L ( θ | y ) π ( θ ) } + E θ | y ( θ − θ ) (cid:48) n ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ = θ − E θ | y ( θ − θ ) (cid:48) J n ( θ )( θ − θ ) + o p ( n − ) . (8)Expand ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ =ˆ θ around θ to the ﬁrst order, we obtain ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ =ˆ θ = ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ = θ − nJ n ( θ )(ˆ θ − θ ) + O p ( n − ) . (9)Because the posterior mode ˆ θ is the solution of ∂ log { L ( θ | y ) π ( θ ) } ∂θ = 0 , the equation (9) can be re-written as n ∂ log { L ( θ | y ) π ( θ ) } ∂θ | θ = θ = J n ( θ )(ˆ θ − θ ) + O p ( n − ) . Substituting it into the second term of (8), the expansion of E θ | y n log { L ( θ | y ) π ( θ ) } becomes: E θ | y n log { L ( θ | y ) π ( θ ) } = 1 n log { L ( θ | y ) π ( θ ) } + E θ | y ( θ − θ ) (cid:48) J n ( θ )(ˆ θ − θ ) − E y E θ | y ( θ − θ ) (cid:48) J n ( θ )( θ − θ ) + o p ( n − )= 1 n log { L ( θ | y ) π ( θ ) } + tr { E θ | y [(ˆ θ − θ )( θ − θ ) (cid:48) ] J n ( θ ) }− tr { E θ | y [( θ − θ )( θ − θ ) (cid:48) ] J n ( θ ) } + o p ( n − )= 1 n log { L ( θ | y ) π ( θ ) } + tr { E θ | y [( θ − θ )(ˆ θ − θ ) (cid:48) ] J n ( θ ) }− tr { n [ J − n (ˆ θ ) + J − n ( θ ) I ( θ ) J − n ( θ )] J n ( θ ) } + o p ( n − ) where in the last line we replace E θ | y [( θ − θ )( θ − θ ) (cid:48) ] with the result of Lemma 5.4. E θ | y [( θ − θ )(ˆ θ − θ ) (cid:48) ] in the second term of the expansion can be rewritten as E θ | y [(ˆ θ − θ )(ˆ θ − θ ) (cid:48) ] + E θ | y [( θ − ˆ θ )(ˆ θ − θ ) (cid:48) ] ,where the former term is asymptotically equal to n J − n ( θ ) I ( θ ) J − n ( θ ) by Lemma 5.1, and the latter isnegligible with higher order o p ( n − ) , as shown in Lemma 5.3. Therefore, the expansion can be ﬁnally20impliﬁed as E θ | y n log { L ( y | θ ) π ( θ ) } (cid:39) n log { L ( θ | y ) π ( θ ) } + 12 n ( tr { J − n ( θ ) I ( θ ) } − tr { J − n (ˆ θ ) J n ( θ ) } ) + O p ( n − ) . References A KAIKE , H. (1973). Information theory and an extension of the maximum likelihood principle.

Proceedings of the SecondInternational Symposium on Information Theory , ed. B. N. Petrov and F. Csaki, 267–281. Budapest: Akademiai Kiado.Reprinted in

Breakthroughs in Statistics , ed. S. Kotz, 610–624. New York: Springer-Verlag (1992).A

NDO , T. (2007). Bayesian predictive information criterion for the evaluation of hierarchical Bayesian and empirical Bayesmodels.

Biometrika , 443-458.B URNHAM , K. P. & A

NDERSON , D. R. (2002).

Model Selection and Multimodel Inference . New York: Springer-Verlag,second edition.C

ELEUX , G., F

ORBES , F., R

OBERT , C. P. & T

ITTERINGTON , D. M. (2006). Deviance information criteria for missing datamodels.

Bayesian Analysis , 651-676.D ONOHUE , M. C., O

VERHOLSER , R., X U , R. & V AIDA , F. (2011). Conditional Akaike information under generalizedlinear and proportional hazards mixed models.

Biometrika , 685-700.E FRON , B. (1983). Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation.

Journal of theAmerican Statistical Association , 316-331.G EISSER , S. & E

DDY , W. F. (1979). A predictive approach to model selection.

J. Am. Statist. Assoc. , 153-160.G ELFAND , A. E. & G

HOSH , S. K. (1998). Model Choice: A Minimum Posterior Predictive Loss Approach.

Biometrika ,1-11.G ELMAN , A., C

ARLIN , J. B., S

TERN , H. S. & R

UBIN , D. B. (2003).

Bayesian Data Analysis . London: CRC Press, SecondEdition.G

ELMAN , A., H

WANG , J. & V

EHTARI , A. (2014). Understanding predictive information criteria for Bayesian models.

Statistics and Computing , 997-1016.G EORGE , E. I. & M C C ULLOCH , R. (1993). Variable selection via Gibbs sampling.

J. Am. Statist. Assoc. , 881-889. URVICH , C. & T

SAI , C. (1989). Regression and time series model selection in small samples.

Biometrika , 297-307.K ONISHI , S. & K

ITAGAWA , G. (1996). Generalised information criteria in model selection.

Biometrika , 875-890.K ULLBACK , S. & L

EIBLER , R. A. (1951). On information and sufﬁciency.

Ann. Math. Statist. , 79-86.L AUD , P. W. & I

BRAHIM , J. G. (1995). Predictive model selection.

J. R. Statist. Soc. B , 247-262.L IANG , H., W U , H. & Z OU , G. (2009). A note on conditional AIC for linear mixed-effects models. Biometrika , 773-778.M ENG , X. L. & V

AIDA , F. (2006). Comments on ‘Deviance Information Criteria for Missing Data Models’.

BayesianAnalysis , 687-698.M ETROPOLIS , N. & U

LAM , S. (1949). The Monte Carlo Method.

Journal of the American Statistical Association ,335-341.N OCEDAL , J. & W

RIGHT , S. J. (1999).

Numerical Optimization . New York: Springer–Verlag.P

IIRONEN , J. & V

EHTARI , A. (2017). Comparison of Bayesian predictive methods for model selection.

Statistics andComputing , 711-735.P

LUMMER , M. (2003). JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling. Proceedingsof the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), March 20–22, Vienna, Austria.P

LUMMER , M. (2008). Penalized loss functions for Bayesian model comparison.

Biostatistics , 523-539.R ISSANEN , J. (1978). Modeling by shortest data description.

Automatica , 465-471.S AN M ARTINI , A. & S

PEZZAFERRI , F. (1984). A predictive model selection criterion.

J. R. Statist. Soc. B , 296-303.S CHWARZ ,G. (1978). Estimating the dimension of a model.

Ann. Statist. , 461-464.S PIEGELHALTER , D. J., T

HOMAS , A. & B

EST , N. G. (1999).

WinBUGS Version 1.2 User Manual . MRC Biostatistics Unit.S

PIEGELHALTER , D. J., B

EST , N. G., C

ARLIN , B. P. & V

AN DER L INDE , A. (2002). Bayesian measures of modelcomplexity and ﬁt (with discussion).

J. R. Statist. Soc. B , 583–639.S PIEGELHALTER , D. J., B

EST , N. G., C

ARLIN , B. P. & V

AN DER L INDE , A. (2002). The deviance information criterion:12 years on.

J. R. Statist. Soc. B , 485–493.S TONE , M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion).

J. R. Statist. Soc. B , 111-147.T AKEUCHI , K. (1976). Distributions of information statistics and criteria for adequacy of models (in Japanese).

MathematicalScience , 15-18. AIDA , F. & B

LANCHARD , S. (2005). Conditional Akaike information for mixed effects models.

Biometrika , 351-370.V EHTARI , A., G

ABRY , J., Y AO Y. & G

ELMAN , A. (2018). loo: Efﬁcient leave-one-out cross-validation and WAIC forBayesian models. https://CRAN.R-project.org/package=loo , R package version 2.0.0.V

EHTARI , A., G

ELMAN , A. & G

ABRY , J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validationand WAIC.

Statistics and Computing , , 1413–1432.V EHTARI , A. & L

AMPINEN , J. (2002). Bayesian model assessment and comparison using cross-validation predictive densi-ties.

Neural Computation , 1339-2468.W ATANABE , S. (2008). A formula of equations of states in singular learning machines. , 2098-2105.W

ATANABE , S. (2009).

Algebraic geometry and statistical learning theory . Cambridge University Press.W

ATANABE , S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion insingular learning theory.

J. Mach. Learn. Res. , 3571–3594.W HERRY , R. J. (1931). A new formula for predicting the shrinkage of the coefﬁcient of multiple correlation.

Ann. Math.Statist. , 440-457., 440-457.