Notes to Robert et al.: Model criticism informs model choice and model comparison
Oliver Ratmann, Christophe Andrieu, Carsten Wiuf, Sylvia Richardson
NNotes to Robert et al.:
Model criticism informs model choice and modelcomparison
Oliver Ratmann (cid:63), † , Christophe Andrieu ‡ , Carsten Wiuf ‡ and Sylvia Richardson ‡ (cid:63) Biology Department, Duke University, Box 90338 Durham, NC 27708, USA ; † Statistical and Applied Mathematical Sciences In-stitute, Research Triangle Park, NC 27709, USA ; ‡ Department of Mathematics, University of Bristol, Bristol, United Kingdom ; ‡ Bioinformatics Research Center, University of Aarhus, Aarhus, Denmark ; ‡ Centre for Biostatistics, Imperial College London, Lon-don, United Kingdom ;Email: (cid:63) [email protected]
In their letter to PNAS and a comprehensive set of notes on arXiv [1, 2], Christian Robert, Kerrie Mengersenand Carla Chen (RMC) represent our approach to model criticism in situations when the likelihood cannot becomputed as a way to “contrast several models with each other”. In addition, guided by an analysis of scalar errorterms on simple examples, RMC argue that model assessment with Approximate Bayesian Computation undermodel uncertainty (ABC µ ) is unduly challenging and question its Bayesian foundations. We thank RMC for theirinterest and their detailed comments on our work, which give us an opportunity to clarify the construction of ABC µ and to explain further the utility of ABC µ for the purpose of model criticism. Here, we provide a comprehensiveset of answers to RMC’s comments, which go beyond our short response [3]. For sake of clarity, we re-state RMC’smain points in italic before we answer each of them in turn.We wish to emphasize that the use of multiple error terms ε K is a necessary and integral part of ABC µ . Inthe first section in [4], we introduced ABC µ with the number of error terms set to K = 1 to keep the presentationsimple. In retrospect, we hope that this initial simplification did not lead to confusion (although in later sectionsand in our applications we clearly use multiple error terms). Introduction and notation.
Approximate Bayesian Computation (ABC) exploits model simulations x of adata-generating process M for sampling from approximate posterior distributions of the model parameters θ [5].Typically, such predictions form the basis for model criticism [6], and we propose to use the data already generatedby Monte Carlo implementations of ABC for this purpose too [4]. In ABC µ , the dual use of the model predictionsis reflected in an extension of the state space of the targeted random variables: whenever the simulated summaries S ( x ) = (cid:8) S ( x ) , . . . , S K ( x ) (cid:9) , x ∼ f ( · | θM ) are sufficiently close to the observed summaries S ( x ), we retain notonly θ but also the computed discrepancies. The rationale of ABC µ is that small discrepancies between x and theobserved data x indicate favorable θ , whereas if these discrepancies are always large, the data-generating process1 a r X i v : . [ s t a t . M E ] D ec in short: model) M cannot describe the observed data well. The full potential of ABC µ is realized when wecompute multiple discrepancies, each for one summary statistic S k , ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) . From first principles, wederived in [4] the sampling density of the accepted pairs (cid:16) θ, (cid:16) ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1)(cid:17) K (cid:17) , which we denote by f ρ,τ ( θ, ε K | x , M ) ∝ ξ x ,θ ( ε K ) π θ ( θ | M ) π ε K ( ε K | M ) . (1)We obtained a formula for the “augmented likelihood” ξ x ,θ ( ε K ), which enables us to relate the posterior errordensity f ρ,τ ( ε K | x , M ) = (cid:90) f ρ,τ ( θ, ε K | x , M ) dθ to the prior predictive error density, a well-known Bayesian quantity that was systematically discussed in a seminalpaper [7] by Box (when K = 1 and ρ (cid:0) S ( x ) , S ( x ) (cid:1) = x − x ). We have f ρ,τ ( ε K | x , M ) ∝ π ε K ( ε K | M ) × L ρ ( ε K | M ) (2)where the prior predictive error density is given by L ρ ( ε K | M ) = (cid:90) δ (cid:8)(cid:0) ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) = ε k (cid:1) K (cid:9) π ( x | M ) dx, and π ( x | M ) = (cid:82) f ( x | θ, M ) π θ ( θ | M ) dθ denotes the prior predictive (data) density. The shorthand δ notationrepresents a limit of functions as detailed in Section S1.1 of the PNAS Supplementary Material [4]. The density π ε K is fully determined by the ABC kernel in the likelihood approximation, f ρ,τ ( θ | x , M ) = (cid:90) f ρ,τ ( θ, ε K | x , M ) dε K ∝ π θ ( θ | M ) (cid:90) π ε K (cid:18)(cid:16) ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1)(cid:17) K (cid:12)(cid:12)(cid:12) M (cid:19) f ( x | θ, M ) dx, (3)and can be interpreted as a prior density [8]. The relationship Eq. 2 enables us to associate a statistical interpretationto our posterior errors and to relate them to other Bayesian quantities. Standard Assumptions in ABC and ABC µ . We assume that (A1) π ε K factorizes into (cid:81) Kk =1 π ε k , iscentered at zero and only depends on a multi-dimensional scale parameter τ = ( τ , . . . , τ K ). The main reasonbehind (A1) is that otherwise, the same aspects of the data might be used to adjust the ABC kernel (or “prior”density) π ε K as well as the magnitude of the errors ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) , and hence (potentially) more than onceto inform our quantities of interest f ρ,τ ( θ | x , M ) and f ρ,τ ( ε K | x , M ). Furthermore, ABC µ might suggest tofalsely reject the hypothesis that a model is an adequate representation of the data if π ε K is not centered atzero. Typical choices are π ε k ( ε k | M ) = 1 /τ k (cid:8)(cid:12)(cid:12) ε k (cid:12)(cid:12) ≤ τ k / (cid:9) , π ε k ( ε k | M ) = (2 πτ k ) − / exp (cid:0) − / ε /τ k (cid:1) or π ε k ( ε k | M ) = 1 /τ k exp (cid:0) − (cid:12)(cid:12) ε k (cid:12)(cid:12) /τ k (cid:1) . We emphasize that in ABC and ABC µ , (A2) the scale parameter τ of the2rior π ε K is in general chosen as small as possible. Otherwise, if all model simulations are “acceptable”, we havethat f ρ,τ ( θ | x , M ) ∝ π θ ( θ | M ) (cid:90) π ε K (cid:18)(cid:16) ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1)(cid:17) K (cid:12)(cid:12)(cid:12) M (cid:19) f ( x | θ, M ) dx = π θ ( θ | M ) (cid:90) const × f ( x | θ, M ) dx = π θ ( θ | M ) . Furthermore, (A3) the compound function x → ρ (cid:0) S ( x ) , S ( x ) (cid:1) must be sensitive to changes in θ . Otherwise, weobtain f ρ,τ ( θ | x , M ) ∝ (cid:90) π ε (cid:0) const (cid:1) f ( x | θ, M ) dx π θ ( θ | M )= π θ ( θ | M ) . The idea is to construct useful discrepancies which reflect changes in the simulated data as θ changes. (A4) As inABC, we require that ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) = 0 if and only if S k ( x ) = S k ( x ). In contrast to most implementationsof ABC, these discrepancies should be real-valued rather than non-negative. For example, in the case of scalarsummaries, we use ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) = S k ( x ) − S k ( x ) instead of ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) = (cid:12)(cid:12) S k ( x ) − S k ( x ) (cid:12)(cid:12) [4]. Weseek to construct (A5) roughly symmetric predictive error densities L ρ ( ε K | M ) with mode at zero under the nullhypothesis that the prior model is an adequate representation of the data. Otherwise, negative small errors ε k ≤ τ k may be significantly more (or less) frequent than positive small errors ε k ≤ τ k under the null, and conditioningon error magnitude could result in a large negative (or positive) posterior mean error even if the prior model iscorrect. Finally, we assume (A6) that the cumulative density function P θ,x (cid:16) ε ∈ E , . . . , ε K ∈ E K (cid:17) = (cid:90) X (cid:110)(cid:16) ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) ∈ E k (cid:17) K (cid:111) f ( x | θ, M ) dx is either continuously differentiable when the observation space X is continuous, or a step function when X is finite.In this case, ξ x ,θ ( ε K ) can be re-written in terms of its elementary derivative. Next, in order to derive Eqns. 2-3,we also assume that the data-generating process M given by f ( · | θ, M ) is sufficiently regular to exchange the orderof integration and limits; recall Section S1.1 of the PNAS Supplementary Material [4]. Construction of ABC µ RMC point out that “the denomination [of ξ x ,θ ( ε K ) as a] likelihood is debatable” [2] and that “the product ξ x ,θ ( ε K ) π ε K ( ε K ) is probabilistically incoherent” [1]. This conclusion derives from at least two observa-tions: (i) “ ξ x ,θ is strictly speaking not proportional to a density in x ” [2] and (ii) “ ξ x ,θ ( ε K ) π ε K ( ε K ) is not invariant under reparameterization” [2]. - In ABC, the observed data is reduced to a set of summary statistics and compared to simulated summarieswith a positive, scalar-valued discrepancy function ρ (cid:0) S ( x ) , S ( x ) (cid:1) . For the purpose of parameter inference,3e only need to plug ε = ρ (cid:0) S ( x ) , S ( x ) (cid:1) into the ABC kernel. In other words, the scalar, positive error ε isin ABC merely a latent random variable, introduced to facilitate Bayesian computation [9, 10].In [4], we derive the sampling distribution of the random variable ε , and recognize the utility of the relatedmultiple error terms ε K , each associated to one summary, for the purpose of model criticism. To us, ε K is a random variable of particular statistical interest and not any longer a latent variable introduced forcomputational reasons. Intuitively, we shift the observed summaries by ε K and propose to infer whethersummaries of x that are shifted away from zero would occur at a higher frequency and hence be moreprobable than the (unshifted) observed summaries. Formally, we define and identify the probability density ε K → ξ θ,x ( ε K ) = lim h → (cid:90) δ h (cid:16)(cid:16) ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) − ε k (cid:17) K (cid:17) f ( x | θ, M ) dx, (4)where the δ h function is given in Section 1.1 of the PNAS Supplementary Material [4]. For any given x and θ , ξ θ,x ( ε K ) is the infinitesimal frequency with which we observe the multi-dimensional error ε K . As RMCremark insightfully, it can be called a predictive error density that conditions on the observed data and themodel parameter θ . In [4], we termed θ, ε K → f ρ,τ ( x | θ, ε K ) = ξ θ,x ( ε K ) (5)an “augmented likelihood” simply to indicate that the state space was extended. Example 1
Suppose we observe a single, one-dimensional data point x , and let us believe it is Poissondistributed with rate θ (denoted by M ). Consider the scalar error ε = x − x . By construction, we have ξ θ,x ( ε ) = lim h → (cid:90) δ h (cid:0) x − x = ε (cid:1) Poisson( x ; θ ) dx = θ x + ε e − θ ( x + ε )! (cid:8) x + ε ∈ [0 , ∞ ) (cid:9) , and the right hand side equals in ε a Poisson distribution shifted by − x and in x a Poisson distributionshifted by − ε . Thus, when interpreted as a function in x , ξ θ,x ( ε ) is also defined for negative values.RMC’s illuminating Poisson example serves to demonstrate how ξ θ,x ( ε ) differs from a “likelihood”. However,RMC go beyond our construction Eq. 4 and truncate x → ξ θ,x ( ε ) to positive values so as to re-adjust ξ θ,x ( ε ) to the likelihood f ( x | θ, M ) that is only defined for positive x [1, 2]. To be clear, this re-adjustment is notpart of ABC µ . Eq. 4 corresponds to a non-parametric evaluation of the sampling model in the context of model uncertainty.We adhere to the sampling model in that data is simulated under the likelihood, x ∼ f ( · | θ, M ), and probethe model predictions in several directions at the same time. If ε K = 0, we have with (A4) that ξ θ,x ( ε K )corresponds to the probability of the observed summaries under θ . For error terms different from zero, wequantify the probability of deviations from the observed summaries under the sampling model. Labeling f ρ,τ ( x | θ, ε K ) Eq. 5 a “shifted likelihood” seems therefore more appropriate. Because we only shift theobserved summaries in Eq. 4 (with no further re-adjustments towards the original likelihood as in [1, 2]),the re-normalization required when considering Eq. 5 as a function in x does not depend on ε K , and f ρ,τ ( x | θ, ε K ) is proportional to a density in x . 4ext, let us recall that our error terms ε k correspond directly to the compound functions x → ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) .Therefore, in ABC µ , a transformation of ε k implies a change in how the data is being summarized. Typ-ically, such a change requires to modify the scale parameter τ of the prior density π ε K when the scale ofthe discrepancies changes too. Therefore, transformations of the product ξ x ,θ ( ε ) π ε ( ε ) must also change τ in π ε K when the Jacobian is not constant.In the ABC literature, it is well-known that the approximate posterior density f ρ,τ ( θ | x , M ) depends on thechoice of discrepancies and the stringency of τ [5, 11, 12]. Since ABC µ only uses the information providedin ABC to a fuller extent, the joint posterior density f ρ,τ ( θ, ε K | x , M ) is equally sensitive to changes in thecompound functions x → ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) and the vector τ . In other words, the ABC and ABC µ targetdensities f ρ,τ ( θ | x , M ) and f ρ,τ ( θ, ε K | x , M ) are not invariant under different approximation schemes. Thisleaves ABC µ probabilistically sound, but warrants particular caution and calls for sensitivity analyses, perhapsto a larger extent than is common practice. Model assessment Model assessment with ABC µ requires that “the data is informative” and “is challenging” [1, 2]. In thelocation-family example [2] it is shown that the posterior error equals the prior error if the prior predictivedensity is flat. - We agree with RMC that ABC µ cannot criticize a model when the observed data x reduces to a single,one-dimensional data point, as in their examples [2] on page 1-2. More generally, we showed that theposterior error can be interpreted as a weighted prior predictive error, Eq. 2 [4]. If the prior predictive erroris uninformative, then the posterior error will simply reflect the weighting. Example 2
Suppose a Gaussian likelihood model M with unknown mean θ and fixed variance , and aGaussian prior density π θ ( θ | M ) = N ( θ ; θ (cid:63) , h ) . We have π ( x | M ) = (cid:90) N ( x ; θ, N ( θ ; θ (cid:63) , h ) dθ = N ( x ; θ (cid:63) , h + 1) , and L ρ,τ ( ε | M ) = N ( ε ; θ (cid:63) − x , h + 1) . We mimic a situation where π θ is uniform by letting h → ∞ , so that π ( x | M ) and L ρ,τ ( ε | M ) become improper. Suppose further a Gaussian error density π ε ( ε | M ) = N ( ε ; 0 , τ ) .Then, f ρ,τ ( ε | x , M ) = N ( ε ; 0 , τ ) .Likewise, when models have comparable parameter spaces, then the Bayes’ factor will be indecisive under non-informative priors π θ [13]. Consider the alternative Gaussian model M (cid:48) defined by f ( x | θ, M (cid:48) ) = N ( x ; θ, ,the same prior density π θ , and let us focus on the approximate Bayes’ factor B ρ,τ = f ρ,τ ( x | M (cid:48) ) f ρ,τ ( x | M ) = (cid:18) (cid:90) f ρ,τ ( x | θ, M (cid:48) ) π θ ( θ | M (cid:48) ) dθ (cid:19)(cid:46)(cid:18) (cid:90) f ρ,τ ( x | θ, M ) π θ ( θ | M ) dθ (cid:19) o mimic the situation that we cannot readily evaluate the likelihood. We obtain B ρ,τ = (cid:114) τ + h + 1 τ + h + 3 exp (cid:18) x ( τ + h + 1)( τ + h + 3) (cid:19) , which tends rapidly to one as h → ∞ .In this setting, both our posterior error and approximate Bayes’ factor give reasonable answers for the purposeof model criticism and model comparison respectively. Based on one data point, we cannot reject the currentmodel M and likewise, the approximate Bayes’ factor for choosing among M and a comparable model isindecisive. Clearly, there is no guarantee that ABC µ always uncovers existing model mismatch. But how difficult isit to uncover existing discrepancies with ABC µ in practice? Typically, x contains some structure and/orrepeated observations. Instead of using just one data point in Example 2, let us imagine a data set of 100samples and summarize this data with two statistics, leading to a two-dimensional posterior error density. Example 3
Consider a data set x of independent samples that are Exponentially distributed with rate /µ t = 0 . . We believe that each sample is generated from N ( · ; θ, and consider a Gaussian prior density π θ ( θ | M ) = N ( θ ; θ (cid:63) , h ) . We summarize the data with the sample mean x and the sample median median( x ) ,use the discrepancies ρ (cid:0) S k ( x ) , S k ( x ) (cid:1) = S k ( x ) − S k ( x ) and consider the prior density π ε K ( ε K | M ) = (cid:81) k /τ k exp (cid:0) − | ε k | /τ k (cid:1) with τ k = 0 . .To illustrate that ABC µ may reveal inappropriate prior specifications, we set θ (cid:63) = 0 and h = 0 . . We appliedthe Metropolis-Hastings sampler proposed by Marjoram et al. [14] and recorded the computed discrepanciesto estimate our posterior error (mcmcABC µ see page 11). A more detailed discussion of various algorithmsto sample from the ABC µ target density Eq. 1 will appear elsewhere. Figures 1A-B show that the marginaldensities f ρ,τ ( ε k | x , M ) are far from zero, suggesting that our strong prior beliefs are inadequate to explainthe data.To illustrate that ABC µ may identify a faulty sampling model, we set θ (cid:63) = 5 , h = 100000 . Again, weestimated the ABC µ target density numerically with mcmcABC µ . Even though π θ is essentially flat, ourmarginal posterior errors do not center at zero, see Figures 1C-D.In [4], we investigated primarily the marginal posterior densities f ρ,τ ( ε k | x , M ) = (cid:82) f ρ,τ ( ε K | x , M ) dε − k .Here, we also show heat plots which reflect more comprehensively the multi-dimensional character of our errordensity f ρ,τ ( ε x , ε median | x , M ) in Figure 2A-B. Intuitively, ABC µ will indicate model mismatch whenever all discrepancies are simultaneously not close tozero for any θ . To escape unidentifiability, the crux in Example 3 is to use multiple error terms associatedto co-dependent summaries that may reveal model inconsistencies, see also [4] for a similar example. Inreal-world applications, (most) summary statistics are usually co-dependent, rendering ABC µ a potentially6 . . . . . . . ee (( MEAN )) po s t e r i o r den s i t y chain 1chain 2chain 3chain 4 B . . . . . . . ee (( MEDIAN )) po s t e r i o r den s i t y chain 1chain 2chain 3chain 4 C −1 0 1 2 . . . . . . . ee (( MEAN )) po s t e r i o r den s i t y chain 1chain 2chain 3chain 4 D −3 −2 −1 0 1 . . . . . . . ee (( MEDIAN )) po s t e r i o r den s i t y chain 1chain 2chain 3chain 4 Figure 1: Numerical reconstructions of the densities f ρ,τ ( ε x | x , M ), f ρ,τ ( ε median | x , M ) in Example 3, obtainedwith samples generated by mcmcABC µ . The (A-B) posterior error under inappropriate prior specifications π θ andthe (C-D) posterior error under essentially flat prior specifications π θ suggest model mismatch. A ee (( MEAN )) ee (( M E D I A N )) B −1.0 0.0 1.0 2.0 − . − . . ee (( MEAN )) ee (( M E D I A N )) Figure 2: Heat plots of the density f ρ,τ ( ε x , ε median | x , M ) in Example 3, obtained with samples generated bymcmcABC µ , (A) under inappropriate prior specifications π θ and (B) under essentially flat prior specifications π θ .very powerful method to reveal model inconsistencies. Because model inconsistencies can only increase withthe inclusion of new summary statistics, we are typically prepared to use a large set of summaries. Moreover,it is not required that these co-dependent summaries are sufficient for θ under the data-generating process M , as we illustrate in [3], Figure 1. A more detailed discussion will appear elsewhere; here we only note thatthese properties are appealing because in real-world applications of ABC and ABC µ , it is typically not knownwhether any set of summaries is sufficient for the parameters of a given model while it is relatively easy tocome up with co-dependent summaries.However, the extent to which f ρ,τ ( ε K | x M ) merely reflects the weighting π ε K should be checked, becausethe discrepancies might not retain enough information of the data to question a model (recall A3). The7erhaps simplest (but not necessarily successful) approach is to compare the posterior error density to theshape of π ε K . Reassuringly, in our real-world applications, f ρ,τ ( ε K | x M ) differs markedly from π ε K ; seee.g. Figure 3 in the PNAS paper where the prior error density is indicated in dotted lines. Crucially, sinceABC µ does not reject a model when f ρ,τ ( ε K | x M ) is close to π ε K (recall A1), no harm is done should thediscrepancies not be informative.The power of ABC µ in assessing goodness-of-fit stems, firstly, from probing a model in multiple directionsat the same time. We hope that our simple examples illuminate the contribution of model inconsistencies,as reflected in multiple error terms, to model assessment. Secondly, ABC µ makes possible to criticize amodel whose likelihood cannot be readily evaluated, and does not incur any extra computational cost whencompared to ABC (in contrast to related predictive approaches discussed in point 4 below). Model criticism ABC µ “is strongly impacted by prior modeling [and] fails to condition on the observed data” [1]. - The ABC kernel, which can be interpreted as a prior density π ε K [8], is at the heart of ABC (recall A1) andmodulates the degree to which ABC and ABC µ condition on the observed data. In other words, the ABC andABC µ target densities are sensitive to the choice of π ε K and particularly its scale parameter τ . Indeed, ABC µ conditions on the observed data by accepting θ in relation to the magnitudes of the K computed discrepanciestaken together. Accordingly, under (A2, A3), the posterior error density f ρ,τ ( ε K | x , M ) updates the priorpredictive error density L ρ ( ε K | M ); see Example 4 below. Based on the observation that small error booststhe weight of the associated value of θ that are simulated from π θ , we say that “ABC µ criticizes a fittedmodel”. This can be illustrated with the location family in Example 2. Example 4
Consider again the Gaussian likelihood model M and a Gaussian prior density π θ as in Exam-ple 2. For our illustration purposes, let us choose π θ broad but not flat: h = 9 . In this case, f ρ,τ ( ε | x , M ) ∝ N ( ε ; θ (cid:63) − x , N ( ε ; 0 , τ )= N ( ε ; ˜ θ, ˜ σ ) where ˜ θ = (cid:2) τ (cid:14) ( τ + 10) (cid:3) × (cid:0) θ (cid:63) − x (cid:1) and ˜ σ = 10 τ / ( τ + 10) ≤ . The posterior error density “up-dates” the prior predictive error in that the variance of f ρ,τ ( ε | x , M ) is smaller than the one of L ρ,τ ( ε | M ) .Furthermore, we observe that (cid:12)(cid:12) ˜ θ (cid:12)(cid:12) is smaller than the absolute mean of L ρ,τ ( ε | M ) , reflecting the fact that f ρ,τ ( ε | x , M ) criticizes a fitted model rather than the prior model ( M , π θ ) . For the purpose of model criticism, it is important to recognize that the dependency of our posterior erroron π ε K is a good thing to the extent to which the prior π θ is not an adequate model parameterization.The smaller τ can be chosen, the more we are able to criticize a fitted model and the more we attenuatethe influence of π θ in f ρ,τ ( ε K | x , M ). The latter point can also be illustrated with the location family in8 −1.0 0.0 1.0 − . . . ee (( MEAN )) ee (( M E D I A N )) B −1.0 0.0 1.0 − . . . ee (( MEAN )) ee (( M E D I A N )) C −4 0 2 4 − ee (( MEAN )) ee (( M E D I A N )) Figure 3: Heat plots of our posterior error density f ρ,τ ( ε x , ε median | x , M ) in Example 5 for broad π θ (A) µ = 1000,(B) α = 2 and β = τ = 1000 and when τ is set too large, (C) α = 2 and β = τ = 1000 and τ k = 6 . π θ and π ε K when the summaries are co-dependent but not sufficient for θ under model M . Example 5
Consider again the data set x of independent samples that are Exponentially distributed withrate . , suppose now that each sample is generated according to a Gaussian likelihood model with unknownmean µ ∈ R and σ ≥ (denoted by M ) and summarize the data with the sample mean and median. Weconsider ρ k ( S k ( x ) , S k ( x )) = S k ( x ) − S k ( x ) , π ε K ( ε K | M ) = (cid:81) Kk =1 /τ k (cid:8)(cid:12)(cid:12) ε k (cid:12)(cid:12) ≤ τ k / (cid:9) , and the priordensity π θ ( θ | M ) = π ( µ | M ) π ( σ | M ) where π ( σ | M ) = IG ( σ ; α , β ) π ( µ | M ) ∝ (cid:8)(cid:12)(cid:12) µ − µ (cid:12)(cid:12) ≤ τ (cid:9) . In [3], Figure 1, we chose a slightly different prior density π θ with hyperparameters µ = 5 , τ = 10 , α = 4 and β = 75 such that the prior means of µ and σ are and β / (cid:0) α − (cid:1) = 25 , matching the empirical meanand the standard deviation of the observed data.To illustrate that ABC µ uncovers existing model mismatch with co-dependent summaries that are not suffi-cient for θ under M even when π θ differs markedly from the data or is uninformative, we now vary thesehyperparameters. First, let us set τ = 1000 . We ran mcmcABC µ (see page 11) for 10 ,
000 iterations to sam-ple from f ρ,τ ( µ, σ , ε x , ε median | x , M ) with τ k set to . , and repeated this run four times from overdispersedstarting values to assess the convergence of the chains. Samples from the burn-in period were discarded.Figure 3A illustrates that our joint posterior error density remains virtually unchanged (compare to Figure 1Cin [3]). Next, we set α = 2 and β = τ = 1000 and ran mcmcABC µ as above. Even though π θ is nowextremely broad, our joint posterior error density continues to identify model mismatch; see Figure 3B.ABC depends on the error threshold τ , and so does ABC µ . In order to identify model mismatch with f ρ,τ ( ε K | x , M ) , existing conflicts among several summary statistics are uncovered by setting τ sufficiently mall. For example, setting τ k to . such that the acceptance probability of mcmcABC µ is larger than 80%,the posterior error is very broad and does not suggest model mismatch; see Figure 3C. In summary, the ability of ABC µ to criticize a fitted model is strongly modulated by the choice of discrepanciesand the error threshold τ . Probing a model under particular assumptions on π θ in the directions specified by ε K is not guaranteed to uncover existing model mismatch. For example, using ABC µ with the sample meanand the standard deviation (two independent summaries) in place of the sample mean and the median inExample 5 fails to uncover existing model mismatch. Similarly, using the sample mean and the 25% quantilefail to reveal model inconsistencies as clearly as the sample mean and the sample median. In principle, thecontribution of “the data” to f ρ,τ ( ε K | x , M ) can only increase with larger K and/or more stringent choicesof τ , and cannot be quantified by considering flat π ε K (see also A2).In the directions determined by ε K and τ , the criticized model comprises the sampling model M and ourprior assumptions π θ , and we acknowledge that “having no way to distinguish between prior and samplingmodel inadequacy is a difficulty” [2]. More work is needed here.4. From an ABC perspective, using the posterior predictive m ( x | x , M ) instead of the prior predictive π ( x | M ) “requires same computing times” [2]. - In the context of ABC when the likelihood cannot be readily evaluated, the use of the posterior predictive(data) density m ( x | x , M ) = (cid:90) f ( x | θ, M ) f ( θ | x , M ) dx is complicated by the fact that samples from the true posterior density f ( θ | x , M ) are in general not available.However, m ( x | x , M ) can be approximated by m ρ,τ ( x | x , M ) = (cid:90) f ( x | θ, M ) f ρ,τ ( θ | x , M ) dx. An alternative approach for model criticism could be the approximate posterior predictive (APP) error density L ρ,τ,x ( ε K | M ) = (cid:90) δ (cid:110)(cid:16) ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) = ε k (cid:17) K (cid:111) m ρ,τ ( x | x , M ) dx. However, for a complex model the extra volatility induced by simulating from f ( · | θ, M ) means that re-simulations from f ρ,τ ( θ | x , M ) need not meet the stringency requirement π ε K . It might therefore also beuseful to consider the weighted approximate posterior predictive (wAPP) error density f ρ,τ,x ( ε K | x , M ) ∝ L ρ,τ,x ( ε K | M ) π ( ε K | M ) . Both L ρ,τ,x ( ε K | M ) and f ρ,τ,x ( ε K | x , M ) adopt a sequential approach to model criticism, comprisinga training step (inference of f ρ,τ ( θ | x , M )) and a testing step (APP or wAPP). The testing step adds acomputational overhead to typical ABC procedures. For example, it takes about two minutes to evaluate ourseven summaries on the Saccharomyces cerevisiae
PPI data set [15], and hence an extra 2 × / ≥ L ρ,τ,x ( ε K | M ) and f ρ,τ,x ( ε K | x , M ) would have an intrinsic advantage compared to our posterior errordensity f ρ,τ ( ε K | x , M ).In general, it is difficult to compare the behavior of our posterior error with APP and wAPP under modeluncertainty. First, we note that L ρ,τ,x ( ε K | M ) and our f ρ,τ ( ε K | x , M ) are very different quantities,relating respectively to sequential and simultaneous approaches to model criticism. This is also reflectedin their distinct asymptotic properties as τ →
0. Second, Example 7 in the Appendix demonstrates that,counter-intuitively, f ρ,τ ( θ | x , M ) may be broader than π θ under model uncertainty.An additional complication to be considered with APP and wAPP is that the same aspects of the data are usedto inform both the training and the testing phase. Hence, these quantities violate the fundamental requirementin statistical learning that the training data be independent from the testing data [16]. It is possible to usedifferent aspects of the data during both stages, and this brings us back to the partially predictive andconditionally predictive densities previously discussed by Bayarri and Berger [17]. Unfortunately, in real-world applications of ABC, it is often difficult to identify discrepancies that are independent of each other.5. Our estimator ˆ ξ to the augmented likelihood that is based on B repeat samples “cannot be used as a practicaldevice because B is necessarily small” [1] “ . . . in which case the non-parametric approximation is poor, or Bis large in which case producing the x’s is too time-consuming” [2]. - In our applications, we found that we obtained largest improvements in terms of the effective sampling sizefor small to moderate values of B that are computationally feasible. Let us also recall that the choice ofproposal kernel in ε K is a crucial element of the second algorithm in [4] and should not be omitted whenconsidering its efficiency. We acknowledge that our observations may not readily extend to other applications.In the same way that we augmented standard ABC to what we call Std-ABC µ in [4], it is straightforward tomodify any existing ABC algorithm for the purpose of model criticism by using (i) many co-dependent, real-valued discrepancies and (ii) recording those discrepancies. For example, the Metropolis-Hastings samplerproposed by Marjoram et al. [14] can be adapted to provide samples from the target distribution f ρ,τ ( dθ, dx, dε K | x , M ) = π θ ( θ | M ) π ε K ( ε K | M ) f ( x | θ, M ) f ρ,τ ( x | M ) (cid:0) δ ρ K ( x ) ( dε K ) dx dθ (cid:1) , where we put ρ K ( x ) = (cid:0) ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1)(cid:1) K for brevity. Here, δ ρ K ( x ) ( dε K ) denotes the Dirac measureat the point ρ K ( x ). Suppose initial values θ , x ∼ f ( · | θ , M ) and set ε k = ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) . mcmcABC µ If now at θ propose a move to θ (cid:48) according to a proposal density q ( θ → θ (cid:48) ). mcmcABC µ Generate x (cid:48) ∼ f ( · | θ (cid:48) , M ) and compute ε (cid:48) k = ρ k (cid:0) S k ( x (cid:48) ) , S k ( x ) (cid:1) for k = 1 , . . . , K . mcmcABC µ Accept ( θ (cid:48) , x (cid:48) , ε (cid:48) K ) with probability mh ( θ, x, ε K ; θ (cid:48) , x (cid:48) , ε (cid:48) K ) = min (cid:8) , r vanilla ( θ, x, ε K ; θ (cid:48) , x (cid:48) , ε (cid:48) K ) (cid:9) r vanilla ( θ, x, ε K ; θ (cid:48) , x (cid:48) , ε (cid:48) K ) = π θ ( θ (cid:48) | M ) q ( θ (cid:48) → θ ) π ε K ( ε (cid:48) K | M ) π θ ( θ | M ) q ( θ → θ (cid:48) ) π ε K ( ε K | M ) , and otherwise stay at ( θ, x, ε K ). Then return to mcmcABC µ θ, ε K ), mcmcABC µ provides samples from f ρ,τ ( θ, ε K | x , M ) Eq. 1 for suitable proposal kernels q ( θ → θ (cid:48) ) under our regularityassumptions (A6). Model criticism and model comparison “The Bayesian foundations of ABC µ are questionable: the consequences of rejecting a model are ignored byABC µ but include constructing another model” [1] and “this leads to wonder about the gain compared withusing the Bayes factor” [2]. Moreover, “the estimation of Bayes’ factors is even faster” [1] and “provides adifferent answer” [2]. - To us, model criticism and model comparison are important and complementary aspects of statisticalreasoning. Indeed, methods for model comparison attempt to choose between candidate models, even if allof them do not match the data in one or several aspects well.ABC is very flexible in that arbitrary data-generating processes M can be analyzed without the need tocompute the likelihood, so long as the evaluation of the summary statistics is computationally tractable.ABC µ makes possible to evaluate at no extra computational cost whether a model matches the observeddata in terms of a large set of summary statistics, and to obtain useful indications how a faulty model shouldbe modified. In our work, we found that ABC µ thus enables to iterate rapidly through the initial stagesof model design to identify one or a suite of models which are in agreement with the data, even when thelikelihood cannot be readily evaluated. We believe that the ability of ABC µ to offer statistical rigor at thispoint is highly valuable to areas of modern science where complex models are now formulated to explain andagree with data collected across the traditional boundaries of disciplines. For example, in biology, we facea wealth of data that is hard to analyze in its entirety under current computer resources (e.g. moleculargenetic data), or we have one intricate data set (e.g. molecular interaction networks), or we cross boundariesof biological organization (e.g. systems biology).The methods presented in Ratmann et al. [4] do not address the problem of choosing a model from a suiteof candidates. Model comparison when the likelihood cannot be readily evaluated is not the topic of [4], andhas been introduced elsewhere [18, 19, 20, 21, 22]. Example 6
Let us re-visit RMC’s Poisson example [1, 2] in order to (a) illustrate model criticism with ABC µ when the errors are discrete rather than continuous random variables, (b) inspect the case of asymmetric −5 0 5 10 . . . . . ee f rr , tt (( ee | x ,, M )) x == == tt == B −5 0 5 10 . . . . . . ee f rr , tt (( ee | x ,, M )) tt == tt == == Figure 4: Density plots of f ρ,τ ( ε | M ) in Example 6. Posterior mean errors are indicated in large diamonds. (A)We fix τ = 2 and consider a data point x = 1 that is in agreement with our prior belief π θ = Exp(1) as well as adata point x = 5 that differs from our prior model. In the latter case, the posterior mean error suggests mismatchbetween the model and the data. (B) We fix x = 5 and consider the prior predictive error density (correspondingto an essentially flat π ε with τ = 50) and the posterior error density associated to τ = 2 /
3. Again, we observe thatthis fitted model is harder to criticize than the prior model. predictive error densities L ρ ( ε K | M ) and (c) re-examine the behavior of the approximate marginal likelihoodas presented in [2], Figure 1.Consider the Poisson model M of Example 1 and suppose that π θ ( θ | M ) = exp( − θ ) (cid:8) θ ≥ (cid:9) . We have that π ( x | M ) = (cid:90) ∞ Poisson ( x ; θ ) exp( − θ ) dθ = 2 − x − { x ≥ } , and hence L ρ ( ε | M ) ∝ − x − ε − { x + ε ≥ } . The ABC kernel always depends on an “error threshold” τ (recall A1) and, given the form of L ρ ( ε | M ) , we consider here π ε : { , , − , , . . . } → R +0 with π ε ( ε | M ) ∝ −| ε | /τ . Then, our marginal posterior error is f ρ,τ ( ε | M ) ∝ − ( x + ε + | ε | /τ +1) { x + ε ≥ } , with normalizing constant f ρ,τ ( x | M ) = ∞ (cid:88) ε = − x − ( x + ε + | ε | /τ +1) = 2 − ( x +1) (cid:18) { x > } (cid:20) − (1 − /τ )( x +1) − − /τ − (cid:21) + 11 − − − /τ (cid:19) (6)13 − − − − x po s t e r i o r m ean e rr o r l l l l l l l l l l l tt == tt == tt == ¥¥ B . . . . x app r o x i m a t e m a r g i na l li k e li hood l l l l l l l l l l l tt == tt == tt == Figure 5: (A) Plots of the posterior mean error (cid:82) εf ρ,τ ( ε | x , M ) in Example 6 as a function of x for τ = 2 / τ = 2 (black) and τ = ∞ (grey). Provided the prior model is an adequate representation of the data ( x = 1), theprior predictive mean error is zero (dashed lines). By contrast, the posterior mean error is not zero in this case(red and black lines), simply because L ρ ( ε | M ) is not symmetric. (B) Plots of the ABC µ approximate marginallikelihood as a function of x for τ = 2 / τ = 2 (black) and τ = ∞ (grey). Note that this plot differsqualitatively from the one in [2] because RMC decided to truncate ξ θ,x to positive values. under the assumption that x ≥ . Figure 4 illustrates the posterior error density for various choices of x and τ , and the respective posterior means are indicated in large diamonds. Note that π ε was here onlychosen for reasons of analytical tractability, and we could still use our two-sided Exponential density π ε : R → R +0 where π ε ( ε | M ) = 1 /τ exp (cid:0) − | ε | /τ (cid:1) , or the standard indicator function. Indeed, even if we do notknow the set of possible discrete errors under a model M , all we miss is the correct normalizing constantof π ε : { , , − , , . . . } → R +0 where π ε ( ε | M ) ∝ exp (cid:0) − | ε | /τ (cid:1) . This constant need not be known, see forexample our algorithm mcmcABC µ .Figure 5A illustrates the posterior mean error (cid:82) εf ρ,τ ( ε | M ) ε as a function of x for various choices of τ .Setting τ = ∞ , we obtain the prior predictive mean error, which is zero if the model corresponds well tothe observed data ( x = 1 ). Since L ρ ( ε | M ) is not symmetric around zero when the prior model is adequate(opposing A5), conditioning on error magnitude results in a slightly negative posterior mean error when x = 1 .Let us recall that RMC decided to truncate the density ε → ξ θ,x ( ε ) to non-negative values, and then plottedthe associated marginal likelihood f trunc ( x | M ) = (cid:82)(cid:82) ξ trunc θ,x ( ε ) π θ ( θ | M ) π ε ( ε | M ) dθ dε as a function of x in µ marginal likelihood Eq. 6 as a function of x . In this example,the approximate marginal likelihood decreases monotonically for all values of τ , and only small values of τ provide a suitable approximation of the true marginal likelihood ( τ = 0 ). Thus, the posterior mean error andthe approximate marginal likelihood both depend on the precise value of the “error threshold”, suggesting thatsensitivity analyses are required for ABC µ as well as for complementary tools for model comparison that arebased on approximate marginal likelihoods. The Bayes’ factor is a tool that may address both model comparison and model criticism, depending on theformulation of the null and alternative hypothesis. It is possible to devise approximate Bayes’ factors to testthe null hypothesis ε = 0 versus the alternative ε (cid:54) = 0 as a surrogate measure for the hypothesis that themodel describes the data adequately well, i.e. for the purpose of model criticism (unpublished results, butsee [23, 24]). However, in both cases, the robustness of the Bayes’ factor (with regard to the choice of τ andthe quality of the numerical approximation of the ABC or ABC µ target densities) is debatable (unpublishedresults). While we agree that computing the Bayes’ factor proposed in [20, 19] is faster than computingposterior predictive checks, we also note that it cannot be faster than sampling from f ρ,τ ( θ, ε K | x , M ) solong as the same ABC kernel (i.e. π ε K ) is used.7. “Comparing models via the posterior error is missing the model complexity penalisation from Bayesian modelcomparison” [1]. - We acknowledge that model complexity is an important quantity to consider during model comparison.8. “The choice of ε and π ε ( ε ) is model dependent and the comparison [of models] reflects prior modeling, notdata assessment” [1]. Finally, “using the same τ across all models does not seem to be recommendable on ageneral basis” [1]. - We agree that the choice of discrepancies (hence errors) and τ are application- and model specific. Althoughthe same summaries can typically be used across models that attempt to explain the same data, modelpredictions will typically vary and hence the scales of the simulated summaries. This implies that the same τ may not always be used across different models. In this case, it may be difficult to compare the posteriorerror density f ρ,τ ( ε | x , M ) across different models. In [4], we only suggest to use f ρ,τ ( ε | x , M ) to compareeach model against the observed data. Conclusion
We still find that ABC µ enables us to comprehensively quantify discrepancies between a data-generating process M and the data, simultaneously with parameter inference even when the likelihood cannot be readily evaluated,thus providing valuable guidance on the interpretability of parameter estimates and on how to improve models.However, the method has its limitations. The posterior error reflects an interplay between the prior predictive15rror and the stringency with which that error is updated; recall Eq. 2. At present, there is no formal procedureto disentangle the contribution of π ε K and L ρ ( ε K | M ) in f ρ,τ ( ε K | x , M ); this prompted us to caution that it isdifficult to convincingly associate a formal, probabilistic framework with credibility intervals of f ρ,τ ( ε k | x , M ) [4].In other words, there is no formal guarantee that zero is included in a 95% credibility interval with a probabilityof 0.95 under the hypothesis that the prior model is correct. We agree that the methods proposed by Verdinelliand Wasserman [24] are promising for the purpose of model criticism via Bayes’ factors, although the sharp hullhypothesis ε K = 0 has limitations in itself [25]. Moreover, we emphasize that f ρ,τ ( ε K | x , M ) cannot be thought ofas a purely Bayesian quantity because π ε K also determines the approximation quality of f ρ,τ ( θ | x , M ), recall Eq. 3.In particular, this implies that the contribution of “the data” to f ρ,τ ( ε K | x , M ) cannot be directly quantified bysetting π ε K uniform. Finally, we agree with RMC that the ABC and ABC µ target densities, i.e. f ρ,τ ( θ | x , M ) and f ρ,τ ( θ, ε K | x , M ), are sensitive to changes in the compound functions x → ρ k (cid:0) S k ( x ) , S k ( x ) (cid:1) (i.e. not invariant)and may attain different meanings under different choices of τ . This leaves the whole method probabilisticallycoherent, but calls for sensitivity analyses.Nonetheless, we believe that ABC and ABC µ are useful to compare observed data and model simulations in acoherent way and to make inference on the model parameters as well as the error terms. It is difficult to understandposterior quantities of f ρ,τ ( θ | x , M ) in place of posterior quantities of the true posterior density f ( θ | x , M ), butit makes good sense to interpret them as quantities that lie between the prior and posterior density. With ourinability to evaluate the likelihood f ( x | θ, M ), we acknowledge that it is too cumbersome (or would take too long)to comprehend the data in full, and turn to those aspects S k of the data which we consider to be most relevant.Doing so, we retain some information of the available data and our posterior density f ρ,τ ( θ | x , M ) updates our priorbeliefs accordingly. Simultaneously, we can make use of the very same information to investigate the adequacy of amodel M in explaining the data, and to update our prior predictions according to error magnitude. In conclusion,if we interpret f ρ,τ ( θ | x , M ) as an update of our prior beliefs in θ and f ρ,τ ( ε K | x , M ) as an update of our priorpredictive density under M , then both are useful and meaningful quantities, particularly when the likelihood f ( θ | x , M ) cannot be evaluated. Acknowledgements
We thank Julien Cornebise for insightful comments on an earlier version of this manuscript,as well as Christian Robert for insightful discussions that stimulated parts of these notes. OR gratefully acknowl-edges support from NSF grant NSF-EF-08-27416, CA is supported by an EPSRC Advance Research Fellowship,CW by the Danish Research Council, and S.R. by the Centre for Integrative Systems Biology at Imperial Collegeas well as grant G-0600-609 from the Medical Research Council.
Appendix
Example 7
Reconsider the data set x of n = 100 independent samples that are Exponentially distributed withrate /µ t = 0 . . We believe again that each sample of x is generated from N ( µ, σ ) with µ ∈ R , σ ≥ unknown,and take a conjugate normal inverse-gamma prior density for µ and σ with hyperparameters µ ∈ R , n = 1 , > and β > π ( σ | M ) = f IG ( α ,β ) ( σ ) = β α (cid:0) σ − (cid:1) α +1 exp (cid:0) − β σ − (cid:1) (cid:14) Γ( α ) π ( µ | σ , M ) = f N ( µ ,σ /n ) ( µ ) π ε K ( ε K | M ) = K (cid:89) k =1 /τ k (cid:8)(cid:12)(cid:12) ε k (cid:12)(cid:12) ≤ τ k / (cid:9) , with hyperparameters set to µ = 5 , n = 1 , α = 4 and β = 75 . We ran Std-ABC µ based on the summarySYMM ( x ) = x − median ( x ) , ρ (cid:0) S ( x ) , S ( x ) (cid:1) = SYMM ( x ) − SYMM ( x ) and the above conjugate prior in θ for20 ,
000 iterations to obtain samples from the joint posterior density f ρ,τ ( µ, σ , ε SYMM | x , M ) for various values of τ . Interestingly, the approximate posterior density f ρ,τ ( θ | x , M ) broadens for decreasing values of τ , as shown inFigures 6(A-B). A −20 0 20 40 . . . . . . . mm llll tt == tt == tt == tt == B . . . . . ss llll tt == tt == tt == tt == Figure 6: Numerical estimates of (A) f ρ,τ ( µ | x , M ) and (B) f ρ,τ ( σ | x , M ) in Example 7 for decreasing values of τ SYMM (different colors). The respective marginal prior densities are overlaid (black, dashed).17 ibliography [1] Robert CP, Mengersen KL, Chen C (2009) Letter: Model choice versus model criticism. Proc Natl Acad Sci USA .[2] Robert CP, Mengersen KL, Chen C (2009) Model choice versus model criticism arXiv:0909.5673v2.[3] Ratmann O, Andrieu C, Wiuf C, Richardson S (2009) Reply to Robert et al.: Model criticism informs model choiceand model comparison. Proc Natl Acad Sci USA in press.[4] Ratmann O, Andrieu C, Wiuf C, Richardson S (2009) Model criticism based on likelihood-free inference, with anapplication to protein network evolution. Proc Natl Acad Sci USA 106:10576–10581.[5] Beaumont M, Zhang W, Balding D (2002) Approximate Bayesian Computation in population genetics. Genetics162:2025–2035.[6] Jeffreys H (1961) Theory of Probability. Oxford University Press, 3rd edition.[7] Box GEP (1980) Sampling and Bayes’ inference in scientific modelling and robustness. J Roy Stat Soc A (General)143:383–430.[8] Wilkinson RD (2008) Approximate Bayesian Computation (ABC) gives exact results under the assumption of modelerror arXiv:0811.3355v1.[9] Andrieu C (2006) The expected auxiliary variable method for Monte Carlo simulation. URL . Isaac Newton Institute for Mathematical Sciences,Cambridge, UK.[10] Andrieu C, Roberts G (2009) The pseudo-marginal approach for efficient Monte Carlo computations. Ann Stat 37:697–725.[11] Pritchard J, Seielstad M, Perez-Lezaun A, Feldman M (1999) Population growth of human Y chromosomes: a study ofY chromosome microsatellites. Mol Biol Evol 16:1791–1798.[12] Joyce P, Marjoram P (2008) Approximately sufficient statistics and Bayesian computation. Stat Appl Gen Mol Biol7:26.[13] Berger JO, Pericchi LR (2001) Objective bayesian methods for model selection: Introduction and comparison. IMSLecture Notes-Monograph Series 38.[14] Marjoram P, Molitor J, Plagnol V, Tavar´e S (2003) Markov Chain Monte Carlo without likelihoods. Proc Natl AcadSci USA 100:15324–15328.[15] Reguly T, Breitkreutz A, Boucher L, Breitkreutz B, Hon G, et al. (2006) Comprehensive curation and analysis of globalinteraction networks in Saccharomyces cerevisiae. J Biol 5:11.[16] Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer-Verlag, New York.
17] Bayarri MJ, Berger JO (1999) Bayesian Statistics 6, Oxford University Press, chapter Quantifying surprise in the dataand model verification (with discussion). pp. 53–82.[18] Wilkinson RD (2007) Bayesian inference of primate divergence times. Ph.D. thesis, University of Cambridge.[19] Beaumont M (2008) Simulations, genetics and human prehistory, McDonald Institute Monographs, University of Cam-bridge.[20] Fagundes NJR, Ray N, Beaumont M, Neuenschwander S, Salzano FM, et al. (2007) Statistical evaluation of alternativemodels of human evolution. Proc Natl Acad Sci USA 104:17614–17619.[21] Toni T, Stumpf MPH (2009) Simulation-based model selection for dynamical systems in systems and population biology.Bioinformatics :btp619–.[22] Grelaud A, Robert CP, Marin JM, Rodolphe F, Taly JF (2009) ABC likelihood-free methods for model choice in Gibbsrandom fields. Bayesian Analysis 4:317–336.[23] Dickey JM, Lientz BP (1970) The weighted likelihood ratio, sharp hypotheses about chances, the order of a Markovchain. Ann Math Stat 41:214–226.[24] Verdinelli I, Wasserman L (1995) Computing Bayes’ factors using a generalization of the Savage-Dickey density ratio.J Am Stat Ass 90:614–618.[25] Berger JO, Delampady M (1987) Testing precise hypotheses. Statistical Science 2:317–335.17] Bayarri MJ, Berger JO (1999) Bayesian Statistics 6, Oxford University Press, chapter Quantifying surprise in the dataand model verification (with discussion). pp. 53–82.[18] Wilkinson RD (2007) Bayesian inference of primate divergence times. Ph.D. thesis, University of Cambridge.[19] Beaumont M (2008) Simulations, genetics and human prehistory, McDonald Institute Monographs, University of Cam-bridge.[20] Fagundes NJR, Ray N, Beaumont M, Neuenschwander S, Salzano FM, et al. (2007) Statistical evaluation of alternativemodels of human evolution. Proc Natl Acad Sci USA 104:17614–17619.[21] Toni T, Stumpf MPH (2009) Simulation-based model selection for dynamical systems in systems and population biology.Bioinformatics :btp619–.[22] Grelaud A, Robert CP, Marin JM, Rodolphe F, Taly JF (2009) ABC likelihood-free methods for model choice in Gibbsrandom fields. Bayesian Analysis 4:317–336.[23] Dickey JM, Lientz BP (1970) The weighted likelihood ratio, sharp hypotheses about chances, the order of a Markovchain. Ann Math Stat 41:214–226.[24] Verdinelli I, Wasserman L (1995) Computing Bayes’ factors using a generalization of the Savage-Dickey density ratio.J Am Stat Ass 90:614–618.[25] Berger JO, Delampady M (1987) Testing precise hypotheses. Statistical Science 2:317–335.